arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
专题追踪
2605.10046 2026-05-12 cs.CV cs.LG cs.MA

PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows

Yufeng Zhu, Chunlei Shi, Yongchao Feng, Dan Niu

发表机构 * Department of Automation, Southeast University(东南大学自动化系) State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(北京航空航天大学虚拟现实技术与系统国家重点实验室)

AI总结 本文提出了一种名为PixelFlowCast的降水临近预报方法,旨在在不使用潜在空间压缩的情况下实现高效且高精度的短期雷达回波预测。该方法采用两阶段框架,第一阶段通过确定性模型生成粗粒度预测以捕捉整体演变趋势,第二阶段利用KANCondNet提取深度时空特征进行精确条件引导,并结合基于像素均值流的预测器,以少量步骤生成高质量预测结果。实验表明,PixelFlowCast在预测精度和推理效率方面均优于现有主流方法,尤其在长序列预测任务中表现突出,具有良好的实际应用前景。

Comments 26 pages, 7 figures

详情
英文摘要

Precipitation nowcasting aims to forecast short-term radar echo sequences for extreme weather warning, where both prediction fidelity and inference efficiency are critical for real-world deployment. However, diffusion-based models, despite their strong generative capability, suffer from slow inference due to multi-step sampling trajectories, limiting their practical usability. Conditional Flow Matching (CFM) improves efficiency via straightened trajectories, but relies on latent space compression, which inevitably discards high-frequency physical details and degrades fine-grained prediction quality. To address these limitations, we propose PixelFlowCast, a two-stage probabilistic forecasting framework that achieves both high-efficiency and high-fidelity prediction without latent compression. Specifically, in the first stage, a deterministic model first produces coarse forecasts to capture global evolution trends. In the subsequent stage, the proposed KANCondNet extracts deep spatiotemporal evolution features to provide accurate conditional guidance. Based on this, a latent-free, few-step Pixel Mean Flows (PMF) predictor employs an $x$-prediction mechanism to generate high-quality predictions, effectively preserving fine-grained structures while maintaining fast inference. Experiments on the publicly available SEVIR dataset demonstrate that PixelFlowCast outperforms existing mainstream methods in both prediction accuracy and inference efficiency, particularly for long sequence forecasting, highlighting its strong potential for real-world operational deployment.

2605.10045 2026-05-12 cs.CV

ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

Feihong Yan, Shaoyu Liu, Haixuan Wang, Shuai Lu, Linfeng Zhang, Huiqi Li, Xiangyang Ji

发表机构 * Beijing Institute of Technology(北京理工大学) Xidian University(西安电子科技大学) Northeastern University at Qinhuangdao(秦皇岛东北大学) Shanghai Jiao Tong University(上海交通大学) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 视觉自回归(VAR)模型作为扩散模型的有力替代方案,在图像生成中表现出色,但其固定训练分辨率限制了其在更高分辨率下的直接生成能力。本文提出ExtraVAR方法,通过引入阶段感知的RoPE重映射策略,解决了VAR模型在分辨率外推过程中出现的全局重复、局部重复和细节退化等问题,并进一步提出基于熵驱动的自适应注意力校准方法,以适应高分辨率下注意力分布的变化,实验表明该方法在结构一致性和细节保真度方面均优于现有方法。

Comments 10 pages, 7 figures

详情
英文摘要

Visual Autoregressive (VAR) models have emerged as a strong alternative to diffusion for image synthesis, yet their fixed training resolution prevents direct generation at higher resolutions. Naively transferring training-free extrapolation methods from LLMs or diffusion models to VAR yields three characteristic failure modes: global repetition, local repetition, and detail degradation. We trace them to a unified band-stage mismatch: VAR generates images in a coarse-to-fine, scale-wise process where each stage is driven by a distinct dominant RoPE frequency band, and each failure mode emerges when the dominant band of a particular stage is disrupted. Building on this insight, we propose Stage-Aware RoPE Remapping, a training-free strategy that assigns each frequency band a stage-specific remapping rule, jointly suppressing all three failure modes. We further observe that attention becomes systematically dispersed as the image resolution increases. Existing methods typically depend on predefined attention scaling factors, which are neither adaptive to the target resolution nor capable of faithfully capturing the actual extent of attention dispersion. We therefore propose Entropy-Driven Adaptive Attention Calibration, which quantifies dispersion via a resolution-invariant normalized entropy and yields a closed-form per-head scaling factor that realigns the extrapolated-resolution attention entropy with its training-resolution counterpart. Extensive experiments show that our method consistently outperforms prior resolution-extrapolation methods in both structural coherence and fine-detail fidelity. Our code is available at https://github.com/feihongyan1/ExtraVAR.

2605.10044 2026-05-12 cs.LG cs.AI

Adaptive Action Chunking via Multi-Chunk Q Value Estimation

Yongjae Shin, Jongseong Chae, Seongmin Kim, Jongeui Park, Youngchul Sung

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出了一种名为Adaptive Action Chunking (ACH)的新方法,用于强化学习中的动作分块问题。该方法通过基于Transformer的架构,在一次前向传播中同时估计所有候选分块长度的动作价值,从而动态调整分块长度以适应当前状态,克服了传统固定分块长度方法在不同状态和任务下性能受限的问题。实验表明,ACH在34个复杂任务中均优于固定长度基线,展现出更优的泛化能力和学习效率。

详情
英文摘要

Action chunking emerged as a pivotal technique in imitation learning, enabling policies to predict cohesive action sequences rather than single actions. Recently, this approach has expanded to reinforcement learning (RL), enhancing behavioral consistency and reducing bootstrapping errors in value function estimation. However, existing methods rely on a fixed chunk length, creating a performance bottleneck as the optimal length varies across states and tasks. In this paper, we propose Adaptive Action CHunking (ACH), a novel offline-to-online RL algorithm that dynamically modulates chunk length during both training and inference. To find the optimal chunk length for a dynamically varying current state, we simultaneously estimate action-values for all candidate chunk lengths in a single forward pass, using a Transformer-based architecture. Our mechanism allows the agent to select the most effective chunk length adaptively based on the current state. Evaluated on 34 challenging tasks, ACH consistently outperforms fixed-length baselines, demonstrating superior generalization and learning efficiency in complex environments.

2605.10043 2026-05-12 cs.CL cs.AI

Personalizing LLMs with Binary Feedback: A Preference-Corrected Optimization Framework

Xilai Ma, Liye Zhao, Weijun Yao, Haibing Di, Wenya Wang, Jing Li

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) Huawei Technologies Co., Ltd.(华为技术有限公司) Nanyang Technological University(南洋理工大学)

AI总结 该研究旨在通过二元反馈个性化大语言模型(LLM),以更好地对齐用户个体偏好。提出了一种基于偏好校准的优化框架C-BPO,通过将目标用户数据视为正反馈,其他用户数据作为隐式负反馈,捕捉用户间的差异。为解决偏好重叠问题,该方法基于正-未标记(PU)学习理论构建目标函数,有效去除正样本偏差,从而在保持模型通用性的同时实现更精准的个性化。实验表明,C-BPO在多种任务和模型上均优于现有方法,验证了其有效性。

Comments Accepted by ACL 2026 Main

详情
英文摘要

Large Language Model (LLM) personalization aims to align model behaviors with individual user preferences. Existing methods often focus on isolated user histories, neglecting the essential role of inter-user differences. We propose C-BPO, a framework that personalizes LLMs via preference-calibrated binary signals. By treating target user data as positive feedback and other users' data as an auxiliary set of implicit negative signals, C-BPO captures distinct inter-user differences. To mitigate the preference overlap issue, where shared task knowledge is erroneously penalized, we derive an objective grounded in Positive-Unlabeled (PU) learning theory. This approach purifies negative signals by subtracting ``positive bias'', ensuring alignment with unique idiosyncrasies without compromising general helpfulness. Empirical experiments across various personalization tasks and backbone LLMs show C-BPO consistently outperforms baselines, demonstrating the efficacy of preference-calibrated binary signals in modeling inter-user differences.

2605.10038 2026-05-12 cs.AI

TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning

Hangchen Liu, Dongyuan Li, Renhe Jiang, Jiewen Deng, Weiwei Ye, Yoshihide Sekimoto

发表机构 * The University of Tokyo(东京大学) Southern University of Science and Technology(南方科技大学)

AI总结 TimeClaw 是一种面向时间序列分析的 AI 智能体,旨在解决任务执行中探索经验难以复用的问题。该方法通过探索、比较、提炼和重注入的四阶段循环,将探索性执行转化为可复用的分层经验,结合指标监督学习、任务感知的工具丢弃以及推理时的经验注入,提升了模型在金融、气象等领域的预测与推理能力。实验表明,TimeClaw 在多个任务上优于现有方法,突显了探索经验处理机制对科学系统性能的关键影响。

Comments Under review

详情
英文摘要

Time series analysis underpins forecasting, monitoring, and decision making in domains such as finance and weather, where solving a task often requires both numerical accuracy and contextual reasoning. Recent progress has moved from specialized neural predictors to approaches built on LLMs and foundation models that can reason over time series inputs and use external tools. However, most such systems remain execution-centric: they focus on solving the current instance but learn little from exploratory execution. This is especially limiting in verifiable numeric settings, where multiple candidate executions and tool-use procedures may all be task-valid yet differ sharply in quantitative quality, and where early success can trigger tool-prior collapse that suppresses further exploration. To address this limitation, we present TimeClaw, an exploratory execution learning framework that turns exploratory execution into reusable hierarchical distilled experience through a four-stage loop: Explore, Compare, Distill, and Reinject. TimeClaw combines metric-supervised exploratory execution learning, task-aware tool dropout, and hierarchical distilled experience for inference-time reinjection, while keeping the base model frozen and avoiding online test-time adaptation. In an MTBench-aligned evaluation with 17 tasks that span finance and weather prediction and reasoning tasks, TimeClaw delivers consistent gains over the baselines. These results suggest that, for scientific systems, the bottleneck is not only execution-time capability, but how exploratory experience is compared, distilled, and reused.

2605.10035 2026-05-12 cs.AI

From Single-Step Edit Response to Multi-Step Molecular Optimization

Haojie Rao, Kun Li, Yida Xiong, Jiameng Chen, Wenbin Hu, Yizhen Zheng, Jiajun Yu, Duanhua Cao

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) Department of Data Science and Artificial Intelligence, Monash University(墨尔本大学数据科学与人工智能系) College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) School of Life Sciences and Technology, Tongji University(同济大学生命科学与技术学院)

AI总结 该研究旨在通过分子结构编辑实现特定性质的优化,面对结构相似分子数据稀缺及决策过程需遵循化学规则的挑战。提出了一种响应导向的离散编辑优化方法,包含单步分子编辑响应预测器和多步规划器,通过指导树搜索将局部预测组合为优化路径,从而减少对外部评估的依赖,并提升了数据利用效率。

详情
英文摘要

Conditional molecular optimization aims to edit a molecule to realize a specified property shift. In practice, structurally similar molecule data is scarce, while decisions are inherently action-level: at each step, the system must select one local structural edit from a candidate set that is strictly filtered by chemical feasibility rules. This level mismatch between supervision and decision makes oracle-in-the-loop search unstable in molecular optimization. Regressing on property differences between molecule pairs improves data efficiency but relies on oracle-in-the-loop search, entangling transformation effects with global context and providing limited guidance for selecting the next feasible edit, often resorting to oracle-in-the-loop search. For this reason, we propose a response-oriented discrete edit optimization approach comprising two tightly coupled components: a single-step molecular edit response predictor (SMER) and a multi-step planner that composes local predictions into optimization trajectories via guided tree search (SMER-Opt). The approach learns a directional evaluation model over edit actions to support constraint-aware planning. It mines weakly related molecule pairs and decomposes their structural differences into minimal edit units, turning endpoint property annotations into process-level supervision and yielding reusable, transferable action primitives. A directional edit evaluator then scores feasible candidate edits by their likelihood of moving the molecule toward the desired property change, substantially reducing dependence on external evaluator queries at decision time. Code is available at https://anonymous.4open.science/r/SMER.

2605.10034 2026-05-12 cs.RO

Beyond Self-Play and Scale: A Behavior Benchmark for Generalization in Autonomous Driving

Aron Distelzweig, Faris Janjoš, Andreas Look, Anna Rothenhäusler, Daniel Jost, Oliver Scheel, Raghu Rajan, Daphne Cornelisse, Eugene Vinitsky, Joschka Boedecker

发表机构 * University of Freiburg(弗赖堡大学) Bosch Center for Artificial Intelligence(博世人工智能中心) Coburg University of Applied Sciences(科堡应用科学大学) New York University(纽约大学)

AI总结 本文提出BehaviorBench,一个用于评估自动驾驶策略泛化能力的综合性基准测试平台,旨在弥补当前大规模强化学习策略与标准评估体系之间的差距。该基准从评估体系、场景复杂度和行为多样性三个方面进行设计,支持在nuPlan等标准规划基准上评估大规模RL策略,并引入多样化的交互式交通代理以测试策略在不同行为模式下的表现。研究发现,基于纯自博弈训练的策略在面对真实交通场景时存在泛化不足的问题,并提出了一种结合策略梯度与规则规划的混合方法以提升性能。

详情
英文摘要

Recent Autonomous Driving (AD) works such as GigaFlow and PufferDrive have unlocked Reinforcement Learning (RL) at scale as a training strategy for driving policies. Yet such policies remain disconnected from established benchmarks, leaving the performance of large-scale RL for driving on standardized evaluations unknown. We present BehaviorBench -- a comprehensive test suite that closes this gap along three axes: Evaluation, Complexity, and Behavior Diversity. In terms of Evaluation, we provide an interface connecting PufferDrive to nuPlan, which, for the first time, enables policies trained via RL at scale to be evaluated on an established planning benchmark for autonomous driving. Complementarily, we offer an evaluation framework that allows planners to be benchmarked directly inside the PufferDrive simulation, at a fraction of the time. Regarding Complexity, we observe that today's standardized benchmarks are so simple that near-perfect scores are achievable by straight lane following with collision checking. We extract a meaningful, interaction-rich split from the Waymo Open Motion Dataset (WOMD) on which strong performance is impossible without multi-agent reasoning. Lastly, we address Behavior Diversity. Existing benchmarks commonly evaluate planners against a single rule-based traffic model, the Intelligent Driver Model (IDM). We provide a diverse suite of interactive traffic agents to stress-test policies under heterogeneous behaviors, beyond just using IDM. Overall, our benchmarking analysis uncovers the following insight: despite learning interactive behaviors in an emergent manner, policies trained via pure self-play under standard reward functions overfit to their training opponents and fail to generalize to other traffic agent behaviors. Building on this observation, we propose a hybrid planner that combines a PPO policy with a rule-based planner.

2605.10029 2026-05-12 cs.CV

Slum Detection and Density Mapping with AlphaEarth Foundations: A Representation Learning Evaluation Across 12 Global Cities

Shuyang Hou, Ziqi Liu, Haoyue Jiao, Zhangyan Xu, Xiaopu Zhang, Lutong Xie, Yaxian Qing, Jianyuan Liang, Xuefeng Guan, Huayi Wua

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing(信息工程测绘与遥感国家重点实验室)

AI总结 该研究利用AlphaEarth Foundations(AEF)这一全球一致的高分辨率地表嵌入数据,评估其在12个全球城市中用于贫民窟检测和密度估计的性能。通过多种训练策略和辅助特征配置,研究发现同一城市跨年训练效果最佳,并揭示了AEF在区分贫民窟边界和建模像素内密度梯度方面的局限性。研究还指出POI特征对密度估计有显著提升,并展示了AEF在长期贫民窟监测中的结构保持能力。

详情
英文摘要

Pixel-level slum mapping has long been constrained by limited cross-city generalisation, the absence of continuous density estimation, and weak global comparability. AlphaEarth Foundations (AEF), a globally consistent 64-dimensional annual surface embedding at 10 m, offers a new analysis-ready basis for lightweight slum monitoring, but its applicability to slum detection - an indirectly coupled task shaped by both built form and socio-economic processes - remains untested. We evaluate AEF on slum classification and sub-pixel density estimation across 12 cities and 69 city-year pairs (2017-2024), using GRAM pseudo-masks as supervisory labels. The evaluation spans four training strategies, two protocols (random split and 3x3 spatial block cross-validation), six auxiliary feature configurations, and five baseline models, complemented by representation-level analyses (PCA, SHAP) and full-AOI mapping. Five findings emerge. (1) Same-city cross-year training is optimal under both protocols (median spatial F1 = 0.616, R^2 = 0.466); temporal expansion outperforms cross-city transfer, indicating city-scale representational drift. (2) Regression R^2 is driven primarily by zero/non-zero boundary discrimination: positive-pixel R^2 is consistently negative across all cities, revealing limited capacity to model intra-pixel density gradients at 10 m. (3) PC36 is consistently top-ranked across tasks; classification saturates at k = 32 while regression remains unsaturated at k = 64. (4) POI features yield the largest density gain (Delta R^2 = +0.064). (5) For six cities meeting dual-task usability thresholds, full-AOI inference across 2017-2024 preserves slum cluster structure (mean SSIM = 0.926). The study delineates the capabilities and complementarity needs of foundation-model embeddings for slum monitoring.

2605.10027 2026-05-12 cs.CL cs.AI

Speech-based Psychological Crisis Assessment using LLMs

Terumi Chiba, Yang Luo, Ziyun Cui, Yongsheng Tong, Chao Zhang

发表机构 * Tsinghua University(清华大学) Peking University Huilongguan Clinical Medical School(北京大学回龙guan临床医学院) WHO Collaborating Centre for Research and Training in Suicide Prevention(世界卫生组织自杀预防研究与培训协作中心)

AI总结 本文提出了一种基于大语言模型(LLM)的语音心理危机评估框架,旨在自动化识别通话中的心理危机等级,以提升心理热线服务的质量与效率。为更好地捕捉语音对话中的情感信号,研究引入了副语言注入方法,将识别出的非语言情感线索插入语音文本中,增强模型对语音细微情感的感知能力。同时,提出了一种增强推理的训练策略,通过生成诊断推理链作为辅助任务,提升分类性能,结合数据增强后,在三类分类任务中取得了较高的宏F1分数和准确率。

Comments 5 pages, 5 figures

详情
英文摘要

Psychological support hotlines provide critical support for individuals experiencing mental health emergencies, yet current assessments largely rely on human operators whose judgments may vary with professional experience and are constrained by limited staffing resources. This paper proposes a large language model (LLM)-based framework for automated crisis level classification, a key indicator that supports many downstream tasks and improves the overall quality of hotline services. To better capture emotional signals in spoken conversations, we introduce a paralinguistic injection method that inserts identified non-verbal emotional cues into speech transcripts, enabling LLM-based reasoning to incorporate critical acoustic nuances. In addition, we propose a reasoning-enhanced training strategy that trains the model to generate diagnostic reasoning chains as an auxiliary task, which serves as a regulariser to improve classification performance. Combined with data augmentation, our final system achieves a macro F1-score of 0.802 and an accuracy of 0.805 on the three-class classification task under 5-fold cross-validation.

2605.10026 2026-05-12 cs.CV

MUSDA: Multi-source Multi-modality Unsupervised Domain Adaptive 3D Object Detection for Autonomous Driving

Xiaohu Lu, Hamed Khatounabadi, Hayder Radha

发表机构 * Electrical and Computer Engineering(电气与计算机工程) Michigan State University(密歇根州立大学)

AI总结 随着自动驾驶技术的发展,多模态标注数据集日益丰富,为无需人工标注即可适应新环境的3D目标检测提供了可能。然而传统领域自适应方法通常仅针对单一来源或单一模态,难以应对多源多模态场景。本文提出了一种面向自动驾驶的多源多模态无监督领域自适应3D目标检测框架,通过引入分层空间条件领域分类器和原型图加权融合策略,有效对齐了不同来源和模态的特征,实验表明该方法在多个主流数据集上均优于现有先进方法。

详情
英文摘要

With the advancement of autonomous driving, numerous annotated multi-modality datasets have become available. This presents an opportunity to develop domain-adaptive 3D object detectors for new environments without relying on labor-intensive manual annotations. However, traditional domain adaptation methods typically focus on a single source domain or a single modality, limiting their effectiveness in multi-source, multi-modality scenarios. In this paper, we propose a novel framework for multi-source, multi-modality unsupervised domain adaptation in 3D object detection for autonomous driving. Given multiple labeled source domains and one unlabeled target domain, our framework first introduces hierarchical spatially-conditioned (HSC) domain classifiers, which jointly align features from both camera and LiDAR modalities at two distinct levels for each source-target domain pair. To effectively leverage information from multiple source domains, we construct a prototype graph between each pair of domains. Based on this, we develop a prototype graph weighted (PGW) multi-source fusion strategy to aggregate predictions from multiple source detection heads. Experimental results on three widely used 3D object detection datasets - Waymo, nuScenes, and Lyft - demonstrate that our proposed framework effectively integrates information across both modalities and source domains, consistently outperforming state-of-the-art methods.

2605.10025 2026-05-12 cs.CL cs.AI

Medical Incident Causal Factors and Preventive Measures Generation Using Tag-based Example Selection in Few-shot Learning

Yuna Haseyama, Tomoki Ito, Hiroki Sakaji, Itsuki Noda

发表机构 * Graduate School of Information Science(信息科学研究生院) Technology Hokkaido University Hokkaido, Japan(技术 Hokkaido 大学 Hokkaido, Japan) National Institute of Information(信息国家研究所) Faculty of Information Science(信息科学学院)

AI总结 在医疗等高风险领域,大型语言模型(LLM)生成临床见解的可靠性至关重要。本文提出了一种基于标签的少样本示例选择方法,用于引导LLM从医疗事件描述中生成背景/因果因素和预防措施。实验使用日本医疗事件数据集(JMID),结果表明,基于标签的示例选择方法在生成精度和稳定性方面优于随机采样和基于相似度的方法,为提升临床LLM应用的可靠性提供了有效策略。

详情
英文摘要

In high-stakes domains such as healthcare, the reliability of Large Language Models (LLMs) is critical, particularly when generating clinical insights from incident reports. This study proposes a tag-based few-shot example selection method for prompting LLMs to generate background/causal factors and preventive measures from details of the medical incidents. For our experiments, we use the Japanese Medical Incident Dataset (JMID), a structured dataset of 3,884 real-world medical accident and near-miss reports. These reports are variably annotated with a wide range of tags--some include descriptive information (e.g., "medications," "blood transfusion therapy"). We compare three few-shot example selection strategies--random sampling, cosine similarity-based selection, and our proposed tag-based method--using GPT-4o and LLaMA 3.3. Results show that the tag-based approach achieves the highest precision and most stable generation behavior, while similarity-based selection often leads to unintended outputs and safety filter activation. These findings suggest that selecting examples based on human-interpretable dataset tags can improve generation precision and stability in clinical LLM applications.

2605.10020 2026-05-12 cs.LG

TrajDLM: Topology-Aware Block Diffusion Language Model for Trajectory Generation

Wilson Wongso, Lihuan Li, Arian Prabowo, Xiachong Lin, Baiyu Chen, Hao Xue, Flora D. Salim

发表机构 * University of New South Wales(新南威尔士大学) Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 生成高保真合成GPS轨迹在交通、城市规划和情景模拟等领域日益重要,但现有方法在生成效率与道路网络拓扑结构的忠实度之间存在矛盾。本文提出TrajDLM,一种基于块扩散语言模型的拓扑感知轨迹生成框架,通过将轨迹建模为离散道路段序列,并结合拓扑感知嵌入与约束采样,在保证轨迹真实性的同时显著提升生成速度。实验表明,TrajDLM在多个城市规模数据集上表现出优异的局部相似性性能,且比现有方法快2.8倍,同时具备跨领域零样本迁移能力。

详情
英文摘要

Generating high-fidelity synthetic GPS trajectories is increasingly important for applications in transportation, urban planning, and what-if scenario simulation, especially as privacy concerns limit access to real-world mobility data. Existing trajectory generation models face a trade-off between efficiency and faithfulness to road network topology: continuous-space methods enable fast generation but ignore the road network, while topology-aware approaches rely on search-based autoregressive decoding that limits generation speed. We propose TrajDLM, a topology-aware trajectory generation framework based on block diffusion language models that bridges this gap. TrajDLM models trajectories as sequences of discrete road segments, combining a block diffusion backbone for efficient denoising, topology-aware embeddings from a road network encoder, and topology-constrained sampling to ensure coherent and realistic trajectories. Across three city-scale datasets, TrajDLM achieves strong performance on fine-grained local similarity metrics while being up to $2.8\times$ faster than prior work, and demonstrates strong zero-shot transfer across domains, including unseen transportation modes. These results highlight the effectiveness of block-wise discrete diffusion as a scalable approach to accurate and efficient trajectory generation. Our code is available at https://github.com/cruiseresearchgroup/TrajDLM/

2605.10019 2026-05-12 cs.LG cs.AI cs.CC stat.ML

The two clocks and the innovation window: When and how generative models learn rules

Binxu Wang, Emma Lucia Byrnes Finn, Bingbin Liu

发表机构 * Kempner Institute at Harvard University(哈佛大学凯普纳研究所)

AI总结 该论文研究了生成模型在有限数据下学习规则时所面临的基本矛盾,即模型的训练目标使其更倾向于拟合经验分布而非目标分布。通过引入两个关键时间点——规则生效时间 $τ_{\mathrm{rule}}$ 和记忆重现时间 $τ_{\mathrm{mem}}$,论文分析了生成模型何时开始生成符合规则的样本以及何时开始复制训练数据。研究发现,这两个时间点受规则复杂度、模型容量和数据规模等因素影响,并定义了“创新窗口”作为模型真正创新的时期,揭示了生成模型在不同架构下学习规则的共性与差异。

Comments 48 pages, 28 figures. Earlier versions are presented in NeurIPS2025 SPIGM workshop as oral presentation https://openreview.net/forum?id=LjqX8OhPPi

详情
英文摘要

Generative models trained on finite data face a fundamental tension: their score-matching or next-token objective converges to the empirical training distribution rather than the population distribution we seek to learn. Using rule-valid synthetic tasks, we trace this tension across two training timescales: $τ_{\mathrm{rule}}$, the step at which generations first become rule-valid, and $τ_{\mathrm{mem}}$, the step at which models begin reproducing training samples. Focusing on parity and extending to other binary rules and combinatorial puzzles, we characterize how these two clocks, $τ_{\mathrm{rule}}$ and $τ_{\mathrm{mem}}$, depend on key aspects of the learning setup. Specifically, we show that $τ_{\mathrm{rule}}$ increases with rule complexity and decreases with model capacity, while $τ_{\mathrm{mem}}$ is approximately invariant to the rule and scales nearly linearly with dataset size $N$. We define the \emph{innovation window} as the interval $[τ_{\mathrm{rule}}, τ_{\mathrm{mem}}]$. This window widens with increasing $N$ and narrows with rule complexity, and may vanish entirely when $τ_{\mathrm{rule}} \geq τ_{\mathrm{mem}}$. The same two-clock structure arises in both diffusion (DiT) and autoregressive (GPT) models, with architecture-dependent offsets. Dissecting the learned score of DiT models reveals a corresponding evolution of the optimization landscapes, where rule-valid samples' basins expand substantially around $τ_{\mathrm{rule}}$, while training samples' basins begin to dominate around $τ_{\mathrm{mem}}$. Together, these results yield a unified and predictive account of when and how generative models exhibit genuine innovation.

2605.10018 2026-05-12 cs.LG

The Value of Mechanistic Priors in Sequential Decision Making

Itai Shufaro, Gal Benor, Shie Mannor

发表机构 * Technion(技术学院) NVIDIA Research(NVIDIA研究)

AI总结 本文研究了在序列决策中引入机制先验(mechanistic priors)的价值,提出了一种量化机制模型信息量的指标——机制互信息,并分析了其在渐近和小样本(burn-in)两种场景下的理论性能。研究证明,使用机制先验可以显著降低样本复杂度,尤其在小样本阶段表现出更高的样本效率。通过基于实际药代动力学数据的5-氟尿嘧啶给药模拟,验证了混合机制先验的有效性,并对比了大型语言模型先验的不足,强调了在安全关键应用中使用物理基础先验的重要性。

详情
英文摘要

Hybrid mechanistic models, physical priors with learned residuals, promise to reduce the data required for good decisions, but have no computable criterion to test this. We characterize the value of mechanistic priors in sequential decision-making within both asymptotic and burn-in regimes. To formalize this, we introduce the mechanistic information of a model -- the mutual information between the model's recommended policy $\hatπ$ and the true optimal policy $π^*$ -- quantified via an occupancy-weighted bias $B_μ$. In the asymptotic regime (large $N$), matched bounds reveal that Bayesian regret scales with the residual entropy $H_{\mathrm{mech}}$, delivering a theoretical sample complexity reduction of $H(μ)/H_{\mathrm{mech}}$ compared to an uninformed baseline. Furthermore, we provide a model certificate to determine empirical sample efficiency. Complementarily, in the clinically relevant burn-in regime (small $N$), we establish a lower bound on the penalty incurred by confidently wrong priors. We demonstrate both the asymptotic and burn-in bounds across 5-fluorouracil (5-FU) dosing simulations motivated by published FOLFOX pharmacokinetic data, where a hybrid prior yields large sample-efficiency gains in the burn-in regime. Finally, we contrast these grounded models with LLM priors, demonstrating that LLMs can suffer severe losses in mechanistic information, thereby motivating the exclusive use of physically-grounded priors for safety-critical applications.

2605.10009 2026-05-12 cs.CV

Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation

Yujia Cai, Boxuan Li, Chenghao Xu, Jiexi Yan

发表机构 * School of Computer Science and Technology, Xidian University, Xi’an, Shaanxi, China(西安电子科技大学计算机科学与技术学院) School of Electronic Engineering, Xidian University, Xi’an, Shaanxi, China(西安电子科技大学电子工程学院)

AI总结 本文提出了一种名为Hystar的轻量级框架,用于解决基于查询的图像检索(QBIR)中因查询风格多样而导致的分布偏移问题。该方法通过超网络动态生成注意力层的奇异值扰动,实现对每个查询风格的自适应调整,同时利用静态奇异值偏移保证跨风格的稳定性。此外,Hystar引入了基于最优传输的对比损失StyleNCE,以增强跨风格语义区分能力,实验表明该方法在多风格检索和跨风格分类任务中均优于现有方法,具有参数高效且风格稳定的优势。

Comments Accepted by ICLR2026

详情
英文摘要

Query-based image retrieval (QBIR) requires retrieving relevant images given diverse and often stylistically heterogeneous queries, such as sketches, artworks, or low-resolution previews. While large-scale vision--language representation models (VLRMs) like CLIP offer strong zero-shot retrieval performance, they struggle with distribution shifts caused by unseen query styles. In this paper, we propose the Hypernetwork-driven Style-adaptive Retrieval (Hystar), a lightweight framework that dynamically adapts model weights to each query's style. Hystar employs a hypernetwork to generate singular-value perturbations ($ΔS$) for attention layers, enabling flexible per-input adaptation, while static singular-value offsets on MLP layers ensure cross-style stability. To better handle semantic confusions across styles, we design StyleNCE as part of Hystar, an optimal-transport-weighted contrastive loss that emphasizes hard cross-style negatives. Extensive experiments on multi-style retrieval and cross-style classification benchmarks demonstrate that Hystar consistently outperforms strong baselines, achieving state-of-the-art performance while being parameter-efficient and stable across styles.

2605.10002 2026-05-12 cs.CV

Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models

Minh Khoi Nguyen, Dai Lam Le, Amir Reza Jafari, Tuan Dung Nguyen, Mai Hong Son, Mai Huy Thong, Quang Huy Nguyen, Thanh Trung Nguyen, Reza Farahbakhsh, Noel Crespi, Phi Le Nguyen

发表机构 * AI4LIFE, Hanoi University of Science and Technology, Vietnam(AI4LIFE,越南科学与技术大学) SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris, France(SAMOVAR,法国电信南巴黎学院,巴黎理工学院) Military Central Hospital, Vietnam(越南108军中心医院)

AI总结 该研究提出Med-StepBench,首个用于评估医学视觉语言模型在3D PET/CT影像中逐步推理能力的大型基准,旨在检测模型在生成临床合理但错误的诊断时的幻觉问题。该框架将临床推理分解为四个诊断阶段,并通过超过12,000张影像和100万对图像-陈述对,揭示了现有模型在多步骤推理中的系统性缺陷。研究还表明,当前模型对看似合理但具有误导性的中间解释高度敏感,进一步放大了幻觉风险,为构建更安全可靠的医学视觉语言模型提供了重要依据。

Comments Accepted at IJCAI-ECAI 2026

详情
英文摘要

Large vision-language models (VLMs) demonstrate strong performance in medical image understanding, but frequently generate clinically plausible yet incorrect statements, raising significant safety concerns. Existing medical hallucination benchmarks primarily focus on 2D imaging with one-shot diagnostic questions, offering limited insight into whether predictions are grounded in correct localization and abnormality identification, allowing critical reasoning errors to remain hidden behind seemingly correct diagnoses. We introduce Med-StepBench, the first large-scale benchmark for step-wise hallucination detection in 3D oncological PET/CT, comprising over 12,000 images and more than 1,000,000 image-statement pairs across volumetric and multi-view 2D data, which decomposes clinical reasoning into four expert-designed diagnostic stages. Using clinician-verified annotations, we perform the first step-level evaluation of general-purpose and medical VLMs, revealing systematic failure modes obscured by aggregate accuracy metrics. Furthermore, we show that current VLMs are highly susceptible to adversarial yet clinically plausible intermediate explanations, which significantly amplify hallucinations despite contradictory visual evidence. Together, our findings highlight fundamental limitations in grounding multi-step clinical reasoning and establish Med-StepBench as a rigorous benchmark for developing safer and more reliable medical VLMs.

2605.10001 2026-05-12 cs.LG

Anchor-guided Hypergraph Condensation with Dual-level Discrimination

Fan Li, Xiaoyang Wang, Chen Chen, Wenjie Zhang

发表机构 * School of Computer Science and Engineering, University of New South Wales, Sydney, Australia(新南威尔士大学计算机科学与工程学院) School of Artificial Intelligence, Shenzhen University, Shenzhen, China(深圳大学人工智能学院)

AI总结 随着超图规模的增大,超图神经网络的训练面临显著的计算挑战。为解决这一问题,本文提出了一种名为AHGCDD的超图压缩方法,通过引入锚点引导的超边合成策略和双层次判别目标,实现了结构与特征的联合优化,有效提升了压缩效率和下游任务性能。该方法在结构生成和特征压缩之间建立了更紧密的联系,避免了传统方法中结构与特征不一致的问题。实验表明,AHGCDD在多个基准数据集上表现出优越的压缩效果和计算效率。

Comments This paper has been accepted by ICML 2026

详情
英文摘要

The increasing prevalence of large-scale hypergraphs poses significant computational challenges for hypergraph neural network (HNN) training. To address this, hypergraph condensation (HGC) distills large real hypergraphs into compact yet informative synthetic ones, beyond graph condensation (GC) methods limited to pairwise relations. However, existing HGC methods rely on decoupled training architectures, where structure generators are pre-trained on the original hypergraph but not jointly optimized with condensed features during refinement, resulting in misaligned structures that degrade downstream utility. Moreover, trajectory-based optimization incurs substantial computational overhead in refinement, limiting condensation efficiency. To tackle these issues, we propose \textbf{A}nchor-guided \textbf{H}yper\textbf{G}raph \textbf{C}ondensation with \textbf{D}ual-level \textbf{D}iscrimination (\textbf{AHGCDD}), which consists of three key components: (1) a node initialization module based on Heat Kernel PageRank (HKPR) to encode structural knowledge into feature semantics; (2) an anchor-guided hyperedge synthesis strategy for joint optimization of condensed features and structure; (3) a theoretically grounded dual-level discrimination objective for utility-preserving condensation without redundant HNN training. Extensive experiments demonstrate the superior effectiveness and efficiency of AHGCDD.

2605.09999 2026-05-12 cs.RO cs.PF cs.SY eess.SY

Muninn: Your Trajectory Diffusion Model But Faster

Gokul Puthumanaillam, Hao Jiang, Ruben Hernandez, Jose Fuentes, Paulo Padrao, Leonardo Bobadilla, Melkior Ornik

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Florida International University(佛罗里达国际大学) Providence College(普罗维登斯学院)

AI总结 该论文提出了一种名为Muninn的训练无关缓存方法,旨在加速基于扩散模型的轨迹规划器,使其适用于实时机器人应用。其核心思想是利用扩散模型内部轨迹表示的变化信号和去噪误差的解析系数,动态判断是否复用缓存的去噪结果,从而减少不必要的计算。实验表明,Muninn在多个轨迹扩散模型上实现了最高4.6倍的加速,同时保持任务性能和安全性,并在实际硬件部署中验证了其有效性。

Comments Accepted to Robotics: Science and Systems 2026

详情
英文摘要

Diffusion-based trajectory planners can synthesize rich, multimodal robot motions, but their iterative denoising makes online planning and control prohibitively slow. Existing accelerations either modify the sampler or compress the network--sacrificing plan quality or requiring retraining without accounting for downstream control risk. We address the problem of making diffusion-based trajectory planners fast enough for real-time robot use without retraining the model or sacrificing trajectory quality, and in a way that works across diverse state-space diffusion architectures. Our key insight is that diffusion trajectory planners expose two signals we can exploit: a cheap probe of how their internal trajectory representation changes across steps, and analytic coefficients that describe how denoiser errors affect the sampler's state update. By calibrating the first signal against the second on offline runs, we obtain a per-step score that upper-bounds how far the final trajectory can deviate when we reuse a cached denoiser output, and we treat this bound as an uncertainty budget that we can spend over the denoising process. Building on this insight, we present Muninn, a training-free caching wrapper that tracks this uncertainty budget during sampling and, at each diffusion step, chooses between reusing a cached denoiser output when the predicted deviation is small and recomputing the denoiser when it is not. Across standard benchmarks Muninn delivers up to 4.6x wall-clock speedups across several trajectory diffusion models by reducing denoiser evaluations, while preserving task performance and safety metrics. Muninn further certifies that cached rollouts remain within a specified distance of their full-compute counterparts, and we validate these gains in real-time closed-loop navigation and manipulation hardware deployments. Project page: https://github.com/gokulp01/Muninn.

2605.09998 2026-05-12 cs.LG cs.AI

Continual Harness: Online Adaptation for Self-Improving Foundation Agents

Seth Karten, Joel Zhang, Tersoo Upaa, Ruirong Feng, Wenzhe Li, Chengshuai Shi, Chi Jin, Kiran Vodrahalli

发表机构 * Princeton University(普林斯顿大学) ARISE Foundation(ARISE基金会) Google DeepMind(谷歌深Mind)

AI总结 本文研究了具身智能体在长期部分可观测决策任务中的在线自适应问题,提出了“Continual Harness”方法,使智能体无需人工干预即可通过自身策略迭代和长期记忆优化实现持续自我改进。该方法从最小环境接口出发,通过交替执行和优化自身提示、子代理、技能及记忆,实现了在《宝可梦》游戏中的高效策略学习,并显著降低了操作成本,接近甚至部分超越了手工设计的专家系统。研究还构建了一个模型自身参与的在线过程-奖励联合学习闭环,推动了游戏内里程碑的持续进展。

Comments 28 pages, 19 figures, 5 tables

详情
英文摘要

Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.

2605.09996 2026-05-12 cs.CV

Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

Yeongtak Oh, Dongwook Lee, Sangkwon Park, Heeseung Kim, Sungroh Yoon

发表机构 * Department of Electrical and Computer Engineering, Seoul National University(首尔国立大学电气与计算机工程系) Interdisciplinary Program in Artificial Intelligence, Seoul National University(首尔国立大学人工智能跨学科项目) Department of Artificial Intelligence, University of Seoul(首尔大学人工智能系)

AI总结 本文提出Omni-Persona,首个全面的多模态个性化基准,用于系统评估和改进文本、图像和音频的联合个性化能力。该基准通过“人格模态图”形式化任务,涵盖四个任务组和18个细粒度任务,并引入校准准确率(Cal)指标,综合衡量正确对齐与适当回避的能力。实验揭示了开源模型在音频与视觉对齐上的差距、参数规模与召回率并非可靠诊断指标,以及监督微调与基于奖励的强化学习在个性化中的不同局限与挑战。

Comments Project Page: https://github.com/oyt9306/Omni-Persona

详情
英文摘要

While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the \emph{Persona Modality Graph}, encompassing 4 task groups and 18 fine-grained tasks across ${\sim}750$ items. To rigorously diagnose grounding behavior, we propose \emph{Calibrated Accuracy ($\mathrm{Cal}$)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher $\mathrm{Cal}$, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.

2605.09995 2026-05-12 cs.CL

Annotations Mitigate Post-Training Mode Collapse

Jacob Mitchell Springer, Madhu Advani, Lukas Aichberger, Arwen Bradley, Eran Malach, Omid Saremi, Sinead Williamson, Preetum Nakkiran, Etai Littwin, Aditi Raghunathan

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Apple(苹果公司) Johannes Kepler University Linz(林茨约翰尼斯·开普勒大学)

AI总结 该研究探讨了监督微调(SFT)在提升模型指令遵循能力的同时,可能导致语义模式崩溃的问题,并发现随着模型规模增大,这一问题更加严重。为此,作者提出了一种基于注释引导的训练方法,通过在预训练阶段使用带有语义注释的文档,保留注释分布并在微调过程中保持其多样性,从而在微调后仍能保持丰富的语义表达。实验表明,该方法有效缓解了语义多样性下降的问题,且效果随着模型规模提升而进一步增强。

Comments 21 pages, 8 figures, 11 tables. Accepted at ICML 2026

详情
英文摘要

Post-training (via supervised fine-tuning) improves instruction-following, but often induces semantic mode collapse by biasing models toward low-entropy fine-tuning data at the expense of the high-entropy pretraining distribution. Crucially, we find this trade-off worsens with scale. To close this semantic diversity gap, we propose annotation-anchored training, a principled method that enables models to adopt the preference-following behaviors of post-training without sacrificing the inherent diversity of pretraining. Our approach is simple: we pretrain on documents paired with semantic annotations, inducing a rich annotation distribution that reflects the full breadth of pretraining data, and we preserve this distribution during post-training. This lets us sample diverse annotations at inference time and use them as anchors to guide generation, effectively transferring pretraining's semantic richness into post-trained models. We find that models trained with annotation-anchored training can attain $6 \times$ less diversity collapse than models trained with SFT, and improve with scale.

2605.09993 2026-05-12 cs.LG

Learning Graph Foundation Models on Riemannian Graph-of-Graphs

Haokun Liu, Zezhong Ding, Xike Xie

发表机构 * School of Biomedical Engineering, University of Science and Technology of China (USTC), Suzhou, Jiangsu, China(生物医学工程学院,中国科学技术大学(USTC),苏州,江苏,中国) Data Darkness Lab, Suzhou Institute for Advanced Research, USTC, Suzhou, Jiangsu, China(Data Darkness实验室,苏州市先进研究院,USTC,苏州,江苏,中国) School of Artificial Intelligence and Data Science, USTC, Hefei, Anhui, China(人工智能与数据科学学院,USTC,合肥,安徽,中国)

AI总结 本文提出了一种基于黎曼图-of-图(GoG)结构的图基础模型R-GFM,旨在解决现有图基础模型在处理不同尺度和结构复杂性任务时存在的泛化能力不足问题。R-GFM通过在不同跳数的子图上构建多尺度的GoG,并从黎曼流形中学习几何自适应表示,从而更灵活地捕捉图数据的结构特征。实验表明,R-GFM在多个数据集上取得了最先进的性能,部分任务的相对提升达到49%。

Comments This paper has been accepted by ICML 2026

详情
英文摘要

Graph foundation models (GFMs), pretrained on massive graph data, have transformed graph machine learning by supporting general-purpose reasoning across diverse graph tasks and domains. Existing GFMs pretrained with fixed-hop subgraph sampling impose a fixed receptive field, causing scale mismatch on diverse tasks, which often require heterogeneous and unknown structural contexts beyond a fixed sampling scale. We propose R-GFM, a Riemannian Graph-of-Graphs (GoG) based foundation model, that treats structural scale as a first-class citizen in modeling. R-GFM constructs a multi-scale GoG over-sampled subgraphs at different hop distances and learns geometry-adaptive representations from Riemannian manifolds. Theoretical analysis shows that R-GFM reduces structural domain generalization error compared to fixed-scale GFMs. Experiments on various datasets demonstrate that R-GFM achieves state-of-the-art performance, with up to a 49% relative improvement on downstream tasks. Our code is available at https://github.com/USTC-DataDarknessLab/R-GFM.

2605.09992 2026-05-12 cs.LG cs.AI

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Doğaç Eldenk, Payal Mohapatra, Yigitcan Comlek, Kaan Oktay, Hongyang Zhang, Stephen Xia

发表机构 * Northwestern University(西北大学) GE Aerospace(通用电气航空航天) University of Waterloo(滑铁库大学)

AI总结 本文研究了自回归推测解码模型在生成过程中注意力分布的变化现象,称为“注意力漂移”,即模型在生成连续token时,注意力逐渐从原始提示转移到自身生成的内容上。研究发现这一现象源于模型内部未归一化的残差路径,导致隐藏状态随生成深度不断增长。为此,作者提出了两种架构改进方法,包括对隐藏状态进行后归一化和逐状态RMS归一化,有效提升了模型在模板扰动、长上下文任务及多个基准测试中的生成长度和泛化能力。

详情
英文摘要

Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbf{attention drift}: as the drafter generates successive tokens within a speculation chain, attention progressively moves from the prompt onto its own recently-generated tokens. We observe this across both \emph{EAGLE3} drafters and \emph{MTP heads}, suggesting drift is a property of drafter designs. We trace this to the un-normalized residual path between chain steps: the drafter's hidden state magnitude grows monotonically with chain depth, which exhibits dynamics consistent with additional pre-norm transformer layers stacked on the target rather than as a standalone autoregressive predictor. In order to limit the growth, we propose two architectural changes: Post-norm on the drafter hidden states and per-hidden-state RMSNorm after capturing target hidden states. Our interventions improve acceptance length over the current leading model, pre-norm EAGLE3, by up to $2\times$ under template perturbation, $1.18\times$ on long-context tasks, and $1.10\times$ on seven standard benchmarks spanning multi-turn chat, math, and coding. Our changes also allow shorter train-time-test depths to generalize over longer drafting sequences.

2605.09991 2026-05-12 cs.AI cs.LG math.OC

Optimizer-Induced Mode Connectivity: From AdamW to Muon

Fangzhao Zhang, Sungyoon Kim, Erica Zhang, Yiqi Jiang, Mert Pilanci

发表机构 * Stanford University(斯坦福大学)

AI总结 本文研究了优化器对模式连通性的影响,探讨了在给定优化器约束下解空间的连通性行为。通过分析两层ReLU网络,发现当网络宽度足够大时,由单一优化器(如AdamW、Muon等)生成的解构成一个连通集,这一结果超越了以往的研究。实验表明,不同优化器生成的解区域可能因正则化条件而相互分离或重叠,且在GPT-2预训练中,同一优化器路径保持模型谱特性,而跨优化器路径则表现出平滑过渡,揭示了优化器对解空间结构的重要影响。

详情
英文摘要

Mode connectivity has been widely studied, yet the role of the optimizer remains underexplored. We revisit it through optimizer-induced implicit regularization, asking how connectivity behaves when restricted to solutions constrained by a given optimizer. For two-layer ReLU networks, we show that solutions from a single optimizer -- AdamW, Muon, or others in the Lion-$\mathcal{K}$ family -- form a connected set at sufficiently large width, a result not implied by prior work. We then characterize how optimizer-induced regions interact: at large width two different regions can be disjoint or overlap depending on regularization, while in our small-width example AdamW and Muon converge to disconnected zero-loss components separated by a provable loss barrier. Empirically, in GPT-2 pretraining, we observe same-optimizer paths preserve each model's spectrum while cross-optimizer paths traverse a smooth transition. Our results reveal optimizer-dependent structure beyond classical mode connectivity literature.

2605.09990 2026-05-12 cs.CL

Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference

Sietse Schelpe

发表机构 * Corbenic AI, Inc.(Corbenic AI公司)

AI总结 本文提出了一种名为Merlin的确定性字节精确去重系统,旨在解决大型语言模型推理过程中因冗余文本带来的效率瓶颈问题。该系统采用优化的SIMD友好哈希算法,实现高效、精确的文本去重与上下文优化,特别适用于检索增强生成(RAG)等应用场景。实验表明,Merlin在不同冗余程度的数据集上均可实现显著的输入缩减,同时保持数据完整性,并支持通过模型上下文协议(MCP)进行高速、安全的部署。

Comments Preprint. Implementation and open-source community version available at: https://github.com/corbenicai/merlin-community - https://doi.org/10.5281/zenodo.20090991

详情
英文摘要

Data-intensive applications, ranging from large-scale retrieval systems to advanced data pipelines, are increasingly bottlenecked by the processing of highly redundant text corpora. We present Merlin, a local-first, agnostic, high-throughput deduplication and context optimization engine designed to mitigate these inefficiencies. Utilizing a highly optimized, SIMD-friendly open-addressing flat hash set combined with xxHash3-64, Merlin performs rapid, byte-exact deduplication of text passages and data chunks. While broadly applicable to any text-processing workflow, its impact is particularly pronounced in Large Language Model (LLM) ecosystems, such as Retrieval-Augmented Generation (RAG). Our empirical evaluations demonstrate an input reduction ranging from 13.9% in low-redundancy datasets to over 71% in high-redundancy pipelines, maintaining absolute data fidelity. Furthermore, we detail the system's integration architecture via the Model Context Protocol (MCP), enabling secure, zero-network-interception deployment across major IDEs and autonomous agents. This paper outlines the core algorithmic design, performance benchmarks, and the architectural principles required to process data at sustained speeds of up to 8.7 GB/s.

2605.09985 2026-05-12 cs.AI cs.LG cs.NE

Prospective Compression in Human Abstraction Learning

Leonardo Hernandez Cano, Ivan Zareski, Luisa El Amouri, Pinzhe Zhao, Max Mascini, Emanuele Sansone, Yewen Pu, Bonan Zhao, Marta Kryven

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Dalhousie University(达尔豪斯大学) Nanyang Technological University(南洋理工大学)

AI总结 本文研究人类在非平稳任务环境中如何逐步学习和构建可复用的抽象结构。作者提出,与现有算法基于过去任务进行回顾式压缩不同,人类更倾向于面向未来任务进行前瞻性压缩。通过视觉程序合成任务实验及计算模型对比,研究发现人类抽象行为能感知任务生成过程中的潜在非平稳结构,这一特性无法用传统回顾式压缩算法或基于大语言模型的归纳偏置加以解释。

Comments under review at neurips 2026

详情
英文摘要

A core challenge in program synthesis is online library learning: the incremental acquisition of reusable abstractions under uncertainty about future task demands. Existing algorithms treat library learning as retrospective compression over a static task distribution, where the learned library is determined by the corpus of past tasks. However, real-world learning domains are often non-stationary, with tasks arising from a generative process that evolves over time. We propose and test the hypothesis that in non-stationary domains human library learning selects abstractions prospectively: targeting compression of future tasks. We study this question using the Pattern Builder Task, a visual program synthesis paradigm in which participants construct increasingly complex geometric patterns from a small set of primitives, transformations, and custom helpers that carry forward across trials. Using this task, we conduct two experiments with complementary latent curricula, designed to dissociate between behaviors consistent with prospective compression, and alternative library learning accounts. Using six computational models spanning online library learning strategies, we show that human abstraction behavior reflects sensitivity to latent, non-stationary structure in the task-generating process. This behavior is consistent with prospective compression, and cannot be captured by existing retrospective compression-based algorithms, or inductive biases modeled by LLM-based program synthesis.

2605.09984 2026-05-12 cs.CV cs.AI cs.LG

Geometric 4D Stitching for Grounded 4D Generation

Sunwoo Park, Taesung Kwon, Jong Chul Ye

发表机构 * KAIST AI(韩国科学技术院人工智能实验室)

AI总结 本文提出了一种名为“几何4D缝合”的高效框架,用于解决现有4D场景生成方法中几何不一致和重建成本高的问题。该方法通过显式识别缺失的几何区域,并用几何基础的4D缝合进行补充,从而在保证几何一致性的同时,显著提升了4D场景生成的效率。此外,该方法还支持4D网格的迭代扩展和场景编辑,具有良好的实用性和扩展性。

详情
英文摘要

Recent 4D generation methods complete scene-level missing information using generative models and reconstruct the scene into radiance-based representations. However, these pipelines often present geometric inconsistencies in the generated content, and the radiance-based reconstruction requires expensive optimization. Furthermore, radiance-based representations often absorb these geometric inconsistencies into their view-dependent nature, failing to enforce the grounded geometric consistency. To address these issues, we propose Geometric 4D Stitching, an efficient framework that explicitly identifies missing geometric regions and complements them with geometrically grounded 4D stitches. As a result, our method constructs 4D scene representations in under 10 minutes on a single NVIDIA RTX 5090 GPU per one-step scene expansion, while improving geometric consistency. Moreover, we demonstrate that our explicit 4D stitching supports interative expansion of 4D mesh as well as 4D scene editing.

2605.09982 2026-05-12 cs.CV

ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning

Yuna Lee, Kyoungho Min, Yulhwa Kim

发表机构 * Department of Electrical and Computer Engineering, Sungkyunkwan University, Republic of Korea(电气与计算机工程系,成均馆大学,大韩民国) Department of Semiconductor Systems Engineering, Sungkyunkwan University, Republic of Korea(半导体系统工程系,成均馆大学,大韩民国)

AI总结 本文提出了一种名为ERASE的两阶段视觉token剪枝框架,旨在解决视觉语言模型处理高分辨率图像时产生的大量视觉token带来的计算负担问题。该方法通过自适应剪枝策略,根据输入图像的复杂度识别并保留关键视觉token,在保持模型性能的同时显著减少token数量。实验表明,ERASE在Qwen2.5-VL-7B模型上以85%的剪枝率仍能保留89.46%的原始精度,优于现有最佳方法。

Comments 20 pages, 8 figures

详情
英文摘要

Recent advancements in Vision-Language Models (VLMs) enable large language models (LLMs) to process high-resolution images, significantly improving real-world multimodal understanding. However, this capability introduces a large number of vision tokens, resulting in substantial computational overhead. To mitigate this issue, various vision token pruning methods have been proposed. Nevertheless, existing approaches predominantly rely on learned semantic features within the model to capture visual redundancy. Moreover, they lack adaptive mechanisms to adjust pruning strategies according to the complexity of the input image. In this paper, we propose ERASE, a two-stage vision token pruning framework that identifies and retains salient tokens through pruning strategies adaptive to image complexity. Experiment results demonstrate that ERASE significantly reduces vision tokens while preserving accuracy. For Qwen2.5-VL-7B, at a token pruning ratio of 85\%, ERASE retains 89.46% of the original model accuracy, whereas the best prior method retains only 78.1%. Our code is available at https://github.com/Tuna-Luna/ERASE.

2605.09977 2026-05-12 cs.CV

INFANiTE: Implicit Neural representation for high-resolution Fetal brain spatio-temporal Atlas learNing from clinical Thick-slicE MRI

Xiaotian Hu, Mingxuan Liu, Hongjia Yang, Juncheng Zhu, Yijin Li, Yifei Chen, Haoxiang Li, Tongxi Song, Zihan Li, Yingqi Hao, Ziyu Li, Yujin Zhang, Gang Ning, Yi Liao, Haibo Qu, Qiyuan Tian

发表机构 * Beihang University(北航大学) Tsinghua University(清华大学) Sichuan University(四川大学) University of Oxford(牛津大学)

AI总结 该研究提出了一种名为INFANiTE的隐式神经表示框架,用于从临床厚切片MRI扫描中高效学习高分辨率胎儿脑时空图谱,解决了传统方法中耗时的切片到体积重建和迭代配准步骤的问题。该方法显著加速了图谱构建过程,实验表明其在稀疏数据条件下仍能保持较高的精度和生物学合理性,为大规模胎儿脑发育分析提供了可行的解决方案。

详情
英文摘要

Spatio-temporal fetal brain atlases are important for characterizing normative neurodevelopment and identifying congenital anomalies. However, existing atlas construction pipelines necessitate days for slice-to-volume reconstruction (SVR) to generate high-resolution 3D brain volumes and several additional days for iterative volume registration, thereby rendering atlas construction from large-scale cohorts prohibitively impractical. We address these limitations with INFANiTE, an Implicit Neural Representation (INR) framework for high-resolution Fetal brain spatio-temporal Atlas learNing from clinical Thick-slicE MRI scans, bypassing both the costly SVR and the iterative non-rigid registration steps entirely, thereby substantially accelerating atlas construction. Extensive experiments demonstrate that INFANiTE outperforms existing baselines in subject consistency, reference fidelity, intrinsic quality and biological plausibility, even under challenging sparse-data settings. Additionally, INFANiTE reduces the end-to-end processing time (i.e., from raw scans to the final atlas) from days to hours compared to the traditional 3D volume-based pipeline (e.g., SyGN), facilitating large-scale population-level fetal brain analysis. Our code is publicly available at: https://anonymous.4open.science/r/INFANiTE-5D74

2605.09976 2026-05-12 cs.CV

OZ-TAL: Online Zero-Shot Temporal Action Localization

Chaolei Han, Hongsong Wang, Xin Gong, Jie Gui

发表机构 * School of Cyber Science and Engineering, Southeast University(东南大学信息科学与工程学院) Engineering Research Center of Blockchain Application, Supervision and Management (Southeast University), Ministry of Education(教育部区块链应用、监督与管理工程研究中心(东南大学)) Purple Mountain Laboratories, Nanjing(紫金山实验室(南京)) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院)

AI总结 本文提出了一种新的在线零样本时序动作定位任务(OZ-TAL),旨在在视频流处理过程中检测尚未见过的动作类别及其发生时间。为了解决现有方法在跨域视频中泛化能力不足的问题,作者设计了一个无需训练的框架,利用现成的视觉-语言模型并引入额外机制以增强视觉表示并减少其偏差。实验表明,该方法在THUMOS14和ActivityNet-1.3数据集上显著优于现有先进方法,确立了新的基准和对比基线。

详情
英文摘要

Online Temporal Action Localization (On-TAL) aims to detect the occurrence time and category of actions in untrimmed streaming videos immediately upon their completion. Recent advancements in this field focus on developing more sophisticated frameworks, shifting from Online Action Detection (OAD)-based aggregation paradigm to instance-level understanding. However, existing approaches are typically trained on specific domains and often exhibit limited generalization capabilities when applied to arbitrary videos, particularly in the presence of previously unseen actions. In this paper, we introduce a new task called Online Zero-shot Temporal Action Localization (OZ-TAL), which aims to detect previously unseen actions in an online fashion. Furthermore, we propose a training-free framework that leverages off-the-shelf Vision-Language Models (VLMs) while introducing additional mechanisms to enhance visual representations and mitigate their inherent biases. We establish new benchmarks and representative baselines for OZ-TAL on THUMOS14 and ActivityNet-1.3, and extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches under both offline and online zero-shot settings.