arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2332
2605.11020 2026-05-13 cs.LG cs.AI cs.RO

Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates

Anish Diwan, Davide Tateo, Christopher E. Mower, Haitham Bou-Ammar, Jan Peters, Oleg Arenz

AI总结 本文提出了一种名为 Trust Region Inverse Reinforcement Learning(TRIRL)的逆强化学习方法,旨在在无需每次迭代都完整求解强化学习问题的前提下,实现奖励函数和策略的单调改进。其核心思想是通过信任区域优化策略,在当前策略附近进行局部搜索,从而显式优化对偶目标。该方法在保持对偶改进单调性的同时,避免了对抗方法的训练不稳定性,并在多个复杂任务中表现出色,奖励函数也具有对系统动态变化的鲁棒性。

Comments Accepted as a conference paper at the International Conference on Machine Learning (ICML) 2026

详情
英文摘要

Inverse reinforcement learning (IRL) is typically formulated as maximizing entropy subject to matching the distribution of expert trajectories. Classical (dual-ascent) IRL guarantees monotonic performance improvement but requires fully solving an RL problem each iteration to compute dual gradients. More recent adversarial methods avoid this cost at the expense of stability and monotonic dual improvement, by directly optimizing the primal problem and using a discriminator to provide rewards. In this work, we bridge the gap between these approaches by enabling monotonic improvement of the reward function and policy without having to fully solve an RL problem at every iteration. Our key theoretical insight is that a trust-region-optimal policy for a reward function update can be globally optimal for a smaller update in the same direction. This smaller update allows us to explicitly optimize the dual objective while only relying on a local search around the current policy. In doing so, our approach avoids the training instabilities of adversarial methods, offers monotonic performance improvement, and learns a reward function in the traditional sense of IRL--one that can be globally optimized to match expert demonstrations. Our proposed algorithm, Trust Region Inverse Reinforcement Learning (TRIRL), outperforms state-of-the-art imitation learning methods across multiple challenging tasks by a factor of 2.4x in terms of aggregate inter-quartile mean, while recovering reward functions that generalize to system dynamics shifts.

2605.11019 2026-05-13 cs.LG cs.AI

Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness

Zizhao Chen, Yuying Li, Siting Lin, Lianxi Wang

AI总结 尽管大语言模型依赖于思维链进行复杂推理,但过度思考现象严重降低了推理效率。本文受认知科学启发,提出了一种基于变分后验引导的高效推理框架VPG-EA,通过引入效率感知的证据下界,将高效推理建模为变分推断问题,并采用参数共享的双流架构,将后验分布中的高效模式通过变分蒸馏迁移至先验策略中。实验表明,该方法在不同规模模型上均显著提升了综合效率指标。

详情
英文摘要

Although large language models rely on chain-of-thought for complex reasoning, the overthinking phenomenon severely degrades inference efficiency. Existing reinforcement learning methods compress reasoning chains by designing elaborate reward functions, which renders high-quality samples extremely sparse in the exploration space and creates a sampling bottleneck for the prior policy. Inspired by cognitive science, we theoretically prove that a posterior distribution guided by reference answers achieves higher expected utility than the prior distribution, thus capable of breaking through the sampling bottleneck of high-quality samples. However, the posterior distribution is unavailable during inference. To this end, we formalize efficient reasoning as a variational inference problem and introduce an efficiency-aware evidence lower bound as the theoretical foundation. Based on this, we propose the VPG-EA framework. It adopts a parameter-shared dual-stream architecture to instantiate both the posterior distribution and the prior policy; after filtering out pseudo-efficient paths via cross-view evaluation, it unidirectionally transfers the posterior's efficient patterns to the prior policy through variational distillation. Experiments on DeepSeek-R1-Distill-Qwen-1.5B and 7B scales demonstrate that VPG-EA improves the comprehensive efficiency metric epsilon cubed by 8.73% and 12.37% over the strongest baselines on each model size, respectively.

2605.11017 2026-05-13 cs.LG cs.AI cs.IR

Simpson's Paradox in Behavioral Curves: How Aggregation Distorts Parametric Models of User Dynamics

Chao Zhou

AI总结 该论文研究了在用户行为曲线建模中,由于数据聚合导致的参数模型系统性偏差问题,即行为曲线中的辛普森悖论。研究发现,个体用户的行为峰值与聚合后的整体曲线存在显著差异,这种偏差主要由生存偏差引起。论文提出了合成零校准方法以减少个体分类中的误判,并指出这一现象在推荐系统、广告和临床给药等领域具有广泛影响。

Comments Submitted to NeurIPS 2026

详情
英文摘要

Behavioral curve modeling -- fitting parametric functions to engagement-versus-exposure data -- is standard practice in recommendation, advertising, and clinical dosing. We show that aggregation introduces a systematic distortion: Simpson's paradox in behavioral curves. On Goodreads (3.3M users, 9 genres), individual users peak at n* approximately 11 exposures while the aggregate peaks at n* approximately 34 -- a 3x gap driven by survival bias. Amazon Electronics (18M reviews) shows a 5.3x distortion. MovieLens-25M (D approximately 1) serves as a negative control, confirming that survival bias -- not aggregation per se -- is the operative mechanism. The distortion is robust to category granularity, engagement operationalization, and classifier calibration. We develop Synthetic Null Calibration to address a 32% false positive rate in per-user classification. Our findings apply wherever individual behavioral parameters are estimated from aggregate curves under differential attrition.

2605.11014 2026-05-13 cs.LG cs.AI

Backbone-Equated Diffusion OOD via Sparse Internal Snapshots

Yadang Alexis Rouzoumka, Jean Pinsolle, Eugénie Terreaux, Christèle Morisseau, Jean-Philippe Ovarlez, Chengfang Ren

AI总结 该论文提出了一种名为MBE的公平比较协议,用于解决扩散模型在异常检测(OOD)任务中因主干网络、噪声参数化和推理预算不同而导致的评估不一致问题。研究引入了基于稀疏内部激活的Canonical Feature Snapshots(CFS)检测方法,仅需少量冻结扩散模型的内部激活即可实现高效的OOD检测。实验表明,CFS在CIFAR尺度基准上表现出色,且其性能主要依赖于少量稀疏状态,而非完整的去噪过程或复杂的下游模块。论文还从理论角度解释了这一现象,揭示了扩散模型在低噪声条件下内部状态与编码器-解码器互补性的关系。

详情
英文摘要

Fair comparison between diffusion-based OOD detectors is challenging, as conclusions can vary with backbone choice, corruption parameterization, and test-time budget. We address this issue through a Mutualized Backbone-Equated (MBE) protocol that aligns canonical corruption levels and logical test-time cost across diffusion backbones. Within this setting, we introduce Canonical Feature Snapshots (CFS), a family of detectors that probes a frozen diffusion backbone using only a tiny number of native internal activations at canonical low-noise levels. On a controlled CIFAR-scale benchmark, the strongest one-forward CFS variant is CFS(1x2), while an even smaller decoder-only variant remains highly competitive. This shows that much of the relative-OOD signal exposed by frozen diffusion backbones is concentrated in a small number of sparse internal states, rather than requiring full denoising trajectories or high-capacity downstream heads. We further provide a local diagnostic theory explaining these observations through conditional encoder-decoder complementarity, diagonal-score separation, and low-noise corruption stability. The official implementation is available at https://github.com/RouzAY/cfs-diffusion-ood/.

2605.11011 2026-05-13 cs.LG cs.AI

LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

Taekhyun Park, Yongjae Lee, Dohee Kim, Hyerim Bae

AI总结 LoopUS 是一种将预训练大语言模型(LLM)转化为循环潜层优化模型的后训练框架,旨在提升模型的推理能力。该方法通过分解模型结构、引入选择性门控机制、随机深度监督和置信度头部等核心组件,实现了在不破坏原有能力的前提下,将标准模型改造成稳定的循环架构。LoopUS 有效缓解了计算瓶颈和表示崩溃问题,显著提升了模型的推理性能。

详情
英文摘要

Looped computation shows promise in improving the reasoning-oriented performance of LLMs by scaling test-time compute. However, existing approaches typically require either training recurrent models from scratch or applying disruptive retrofits, which involve substantial computational costs and may compromise pretrained capabilities. To address these limitations, we introduce \textbf{Looped Depth Up-Scaling} (LoopUS), a post-training framework that converts a standard pretrained LLM into a looped architecture. As a key technical contribution, LoopUS recasts the pretrained LLM into an encoder, a looped reasoning block, and a decoder. It operationalizes this latent-refinement architecture through four core components: (1) block decomposition, guided by staged representation dynamics; (2) an input-dependent selective gate to mitigate hidden-state drift; (3) random deep supervision for memory-efficient learning over long recursive horizons; and (4) a confidence head for adaptive early exiting. Collectively, these mechanisms transform a standard non-looped model into a looped form while stabilizing it against both computational bottlenecks and representation collapse. Through stable latent looping, LoopUS improves reasoning-oriented performance without extending the generated traces or requiring recurrent training from scratch. For more details, see https://thrillcrazyer.github.io/LoopUS

2605.11010 2026-05-13 cs.LG

A Comparative Study of Federated Learning Aggregation Strategies under Homogeneous and Heterogeneous Data Distributions

Antonios Makris, Christos Dousis, Emmanouil Kritharakis, Stavros Bouras, Konstantinos Tserpes

AI总结 本文对比研究了在同质和异质数据分布下常用的联邦学习聚合策略的性能。通过实验分析不同聚合方法在数据异质性变化时对模型精度、损失及系统效率(如聚合、训练和通信时间)的影响,揭示了各类策略在不同数据分布和任务条件下的优劣差异。研究为选择适合实际场景的聚合方法提供了重要参考。

详情
英文摘要

Federated Learning has emerged as a transformative paradigm for collaborative machine learning across distributed environments. However, its performance is strongly influenced by the aggregation strategy used to combine local model updates at the server, which directly affects learning performance, robustness, and system behavior. This work presents a comprehensive experimental comparison of widely used federated aggregation strategies under both homogeneous and heterogeneous data distributions. Using benchmark image classification datasets, we analyze how different aggregation mechanisms respond to varying degrees of data heterogeneity, examining their impact on centralized accuracy and loss, and system-level efficiency metrics, including aggregation, training, and communication time. The results demonstrate that aggregation strategies exhibit distinct trade-offs across datasets and data distributions, with their effectiveness varying according to dataset characteristics and operating conditions.

2605.11009 2026-05-13 cs.LG cs.RO

ACSAC: Adaptive Chunk Size Actor-Critic with Causal Transformer Q-Network

Qian Chen, Junqiao Zhao, Hongtu Zhou, Hang Yu, Yanping Zhao, Chen Ye, Guang Chen

AI总结 本文提出了一种名为ACSAC的自适应块大小的Actor-Critic方法,用于解决长期稀疏奖励任务中的强化学习挑战。该方法通过引入因果Transformer作为Q网络,能够在不同大小的动作块之间灵活选择最优块长度,从而在无需任务特定调参的情况下实现反应性与时间一致性之间的自适应平衡。实验表明,ACSAC在多个长期稀疏奖励操作任务中表现出优越的性能,达到了当前最先进的水平。

详情
英文摘要

Long-horizon, sparse-reward tasks pose a fundamental challenge for reinforcement learning, since single-step TD learning suffers from bootstrapping error accumulation across successive Bellman updates. Actor-critic methods with action chunking address this by operating over temporally extended actions, which reduce the effective horizon, enable fast value backups, and support temporally consistent exploration. However, existing methods rely on a fixed chunk size and therefore cannot adaptively balance reactivity against temporal consistency. A large fixed chunk size reduces responsiveness to new observations, while a small one produces incoherent motions, forcing task-specific tuning of the chunk size. To address this limitation, we propose Adaptive Chunk Size Actor-Critic (ACSAC). ACSAC leverages a causal Transformer critic to evaluate expected returns for action chunks of different sizes. At each chunk boundary, it adaptively selects the chunk size that maximizes the expected return, supporting flexible, state-dependent chunk sizes without task-specific tuning. We prove that the ACSAC Bellman operator is a contraction whose unique fixed point is the action-value function of the adaptive policy. Experiments on OGBench demonstrate that ACSAC achieves state-of-the-art performance on long-horizon, sparse-reward manipulation tasks across both offline RL and offline-to-online RL settings.

2605.11008 2026-05-13 cs.LG cs.AI

When and How to Canonize: A Generalization Perspective

Yonatan Sverdlov, Benjamin Friedman, Snir Hordan, Nadav Dym

AI总结 本文从理论角度分析了通过规范化(canonization)实现不变性的方法在对称数据处理中的泛化性能。研究引入了一种基于覆盖数界分析的理论框架,揭示了规范化模型的误差界处于结构不变模型与非不变基线模型之间,并证明了规范化效果依赖于其正则性。在点云处理中,作者进一步证明了字典序排序的覆盖数随维度指数增长,而Hilbert曲线规范化则保证多项式增长,为该方法在点云架构中的成功提供了理论依据。

详情
英文摘要

While invariant architectures are standard for processing symmetric data, there is growing interest in achieving invariance by applying group averaging or canonization to non-invariant backbones. However, the theoretical generalization properties of these alternative strategies remain poorly understood. We introduce a theoretical framework to analyze the generalization error of these methods by bounding their covering numbers. We establish a rigorous generalization hierarchy: the error bounds of canonized models are at best equal to the error bounds of structurally invariant and group-averaged models, and at worst equal to the bounds of non-invariant baselines. Furthermore, we show that there exist optimal canonizations which attain the optimal error bounds, and poor canonizations which attain the non-invariant error bounds, and that this depends on the regularity of the canonization. Finally, applying this framework to permutation groups in point cloud processing, we rigorously prove that the covering number of lexicographical sorting grows exponentially with point cloud dimension, whereas Hilbert curve canonization guarantees polynomial growth. This provides the first formal theoretical justification for the empirical success of Hilbert curve serialization in state-of-the-art point cloud architectures. We conclude with experiments that support our theoretical claims. Code is available at https://github.com/yonatansverdlov/Canonization

2605.11007 2026-05-13 cs.LG cs.AI

RT-Transformer: The Transformer Block as a Spherical State Estimator

Peter Racioppo

AI总结 本文提出了一种将Transformer模块视为球面上状态估计器的方法,揭示了Transformer中的核心组件——注意力机制、残差连接和归一化——实际上源于一个统一的几何估计问题。通过将潜在状态建模为超球面上的方向,并在当前估计的切平面上定义噪声,研究构建了一个基于精度加权的方向推断过程,其中注意力聚合证据,残差连接实现状态更新,归一化将更新后的状态重新投影到超球面上。该工作表明,这些组件是估计问题几何性质的自然结果,而非独立的架构设计选择。

详情
英文摘要

We show that the core components of the Transformer block -- attention, residual connections, and normalization -- arise naturally from a single geometric estimation problem. Modeling the latent state as a direction on the hypersphere, with noise defined in the tangent plane at the current estimate, yields a precision-weighted directional inference procedure in which attention aggregates evidence, residual connections implement incremental state updates, and normalization retracts the updated state back onto the hypersphere. Together, these components follow from the geometry of the estimation problem rather than being introduced as independent architectural choices.

2605.11005 2026-05-13 cs.LG cs.AI cs.DC

DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism

Zhichen Zeng, Chi-Chih Chang, Jiayi Wang, Zezhou Wang, Ningxin Zheng, Zheng Zhong, Cesar A. Stuardo, Dongyang Wang, Mohamed S. Abdelfattah, Haibin Lin, Banghua Zhu, Ang Li, Ziheng Jiang

AI总结 本文提出了一种名为DisagMoE的混合专家(MoE)训练系统,旨在解决大规模语言模型训练中专家并行策略面临的通信瓶颈问题。该方法通过将注意力层和前馈网络层分组到不同的GPU组中,并引入多阶段流水线和单向多对多通信机制,有效实现了计算与通信的重叠。实验表明,DisagMoE在多个MoE模型上显著提升了训练效率,尤其在16节点8xH800集群上实现了最高1.8倍的加速。

详情
英文摘要

Mixture-of-experts (MoE) architectures enable trillion-parameter LLMs with sparsely activated experts. Expert parallelism (EP) is a widely adopted MoE training strategy, but it suffers from severe all-to-all communication bottlenecks, which is exaggerated by the limited inter-node network bandwidth as the growing model size requires distributing experts across GPU nodes. Prior work focused on overlapping these all-to-all communications with feed-forward network (FFN) and self-attention computations, which often leaves residual network-bound stalls due to inherent imbalance in attention and FFN layers' computation-communication ratios. We present DisagMoE, a disaggregated MoE training system that jointly optimizes model placement and scheduling for maximal efficiency. DisagMoE separates attention and FFN layers into disjoint GPU groups, introduces a multi-stage pipeline with uni-directional, many-to-many communications, and employs a computation-communication roofline model to balance GPU and network bandwidth allocation among the attention and FFN groups. DisagMoE is implemented on Megatron-LM, and evaluation shows that DisagMoE improves training efficiency across multiple MoE models with up to 1.8x speedup on 16-node 8xH800 clusters.

2605.11001 2026-05-13 cs.LG

Finite Volume-Informed Neural Network Framework for 2D Shallow Water Equations: Rugged Loss Landscapes and the Importance of Data Guidance

Xiaofeng Liu

AI总结 本文提出了一种基于有限体积法的物理信息神经网络(FVM-PINN)框架,用于求解二维浅水方程,解决了传统PINNs在处理守恒性、不连续性和非结构网格时的不足。研究发现,仅依赖物理方程训练的FVM-PINN在实际问题中容易陷入低动量的平凡解,而引入稀疏数据指导可显著提升模型性能,减少速度场的误差达数十倍。实验表明,该框架在真实河流场景中能够有效构建高精度的替代模型。

详情
英文摘要

Physics-informed neural networks (PINNs) are a simple surrogate-modelling paradigm for partial differential equations, but their standard strong-form residual formulation is ill suited to the shallow water equations (SWE). It cannot enforce local conservation, handle discontinuities, or leverage the boundary-conforming unstructured meshes used in real-world applications. We introduce ``Data-Guided FVM-PINN'', a framework that replaces the strong-form residual with a differentiable, well-balanced Roe Riemann-solver finite-volume (FVM) loss evaluated on unstructured meshes. The major finding is that physics-only FVM-PINN training often fails on realistic 2D problems: the network collapses to a trivial low-momentum state that nearly satisfies the FVM-PINN residual but bears no resemblance to the true flow. A loss-landscape diagnostic shows that the FVM-PINN loss at zero momentum is only about $7\times$ larger than at the trained solution, a shallow basin that an ordinary optimizer falls into; adding even sparse data turns this into a $310\times$ separation, breaking the degeneracy. On a 2D block-in-channel benchmark, just $200$ random velocity measurements drop the velocity-field $L_2$ error by $22\times$ versus physics-only; $50$ measurements still deliver a $7\times$ reduction. A controlled ablation isolates the contribution of the FVM-PINN loss: it reduces velocity-field $L_2$ by $\sim$$23\%$ in the sparse-data regime and is essentially neutral when dense reference data is available. On a real-world Savannah River reach ($1306$ cells, $3600$~s simulation, five Manning zones), the framework constructs an accurate surrogate from SRH-2D anchor data, with time-window decomposition reducing error monotonically via progressive initial-condition handoff.

2605.10999 2026-05-13 cs.LG cs.AI cs.MA

SkillGen: Verified Inference-Time Agent Skill Synthesis

Yuchen Ma, Yue Huang, Han Bao, Haomin Zhuang, Swadheen Shukla, Michel Galley, Xiangliang Zhang, Stefan Feuerriegel

AI总结 SkillGen 是一种多智能体框架,旨在从基础智能体生成的轨迹中合成可审计的单一技能,无需重新训练模型即可提升智能体性能。该方法通过对比成功与失败轨迹,识别可复用的成功模式和失败原因,并生成可读的技能描述,支持人工验证。SkillGen 的核心创新在于将技能建模为对智能体行为的干预,并通过对比使用和不使用该技能时的性能差异,评估其整体效果,从而有效提升模型在多个任务和数据集上的表现。

详情
英文摘要

Skills are a promising way to improve LLM agent capabilities without retraining, while keeping the added procedure reusable and controllable. However, high-quality skills are still largely written by hand. We introduce SkillGen, a multi-agent framework that synthesizes a single auditable skill from trajectories generated by a base agent. The output is a human-readable artifact that can be inspected before use. Rather than merely summarizing trajectories, SkillGen leverages contrastive induction over both successful and failed trajectories to identify reusable success patterns, recurring failure modes, and behaviors that appear in nearby successes but are missing from failures. SkillGen then generates candidate skills and iteratively refines the skill. A key novelty in SkillGen is that we model agent skills as interventions to empirically verify the net effect of skills on the overall performance. Specifically, we compare outcomes on the same instances with and without the skill, so that we account for both repairs (cases where the skill fixes a baseline failure) and regressions (cases where the skill breaks a baseline success). Across a broad range of agents and datasets, SkillGen consistently improves held-out performance, outperforms existing skill-generation baselines, and produces skills that transfer across models.

2605.10993 2026-05-13 cs.RO

ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models

Yanbin Hu, Jin Cui, Jiayi Lu, Ruixuan Yang, Jun Ye, Boran Zhao, Xingyu Chen, Xuguang Lan, Pengju Ren

AI总结 ECHO 是一种用于视觉-语言-动作(VLA)模型的连续层次记忆框架,旨在提升模型在长时域操作任务中的性能。该方法受人类经验层次组织的启发,通过双曲自编码器将VLA隐藏状态映射到连续层次空间,并利用双曲度量和蕴含约束机制构建语义记忆树,实现高效的自上而下经验检索。同时,背景巩固机制通过几何插值和结构分割持续优化记忆树,支持连续空间中的虚拟记忆合成,显著提升了模型在长任务组合和未知场景中的泛化能力。

详情
英文摘要

Memory capacity is a critical factor determining the performance of Vision-Language-Action (VLA) models in long-horizon manipulation tasks. Existing memory-augmented architectures primarily rely on linear or flat storage, lacking structural priors for manipulation categories and hierarchical organization. This deficiency hinders efficient experience retrieval and limits generalization to unseen long-horizon task compositions. Inspired by the hierarchical organization of human experience, we propose ECHO (Experience Consolidation and Hierarchical Organization), a novel memory framework operating within a Continuous Hierarchical Space. By employing a hyperbolic autoencoder, ECHO maps VLA hidden states into this space. Leveraging hyperbolic metrics and entailment constraint mechanisms, experience vectors are organized into a semantic memory tree that supports efficient top-down retrieval. In parallel, a background consolidation mechanism continuously refines the memory tree through geometric interpolation and structural splitting, supporting virtual memory synthesis in the continuous space. We integrate ECHO into the $π_0$ foundation model. Evaluations on LIBERO and preliminary real-world experiments demonstrate the effectiveness of our approach, notably achieving a 12.8% absolute improvement in execution success rate over the $π_0$ baseline on LIBERO-Long, while improving compositional generalization on cross-suite unseen long-horizon tasks.

2605.10991 2026-05-13 cs.LG cs.AI

Test-Time Personalization: A Diagnostic Framework and Probabilistic Fix for Scaling Failures

Linhai Zhang, Yulan He

AI总结 本文研究了测试时个性化(TTP)这一新兴方向,提出通过从个性化策略模型中采样多个候选并利用个性化奖励模型选择最优解,以提升推理阶段的计算扩展性。研究证明,理想选择方式下,预期效用随采样数量对数增长,但现有奖励模型难以实现这一潜力。为此,作者推导出统一的扩展定律,揭示了两种失效模式,并提出一种概率化的个性化奖励模型,有效缓解了这些问题。实验表明,该框架在多种策略模型和文本生成任务中均能实现稳定的扩展效果。

详情
英文摘要

Existing approaches to LLM personalization focus on constructing better personalized models or inputs, while treating inference as a single-shot process. In this work, we study Test-Time Personalization (TTP) along an unexplored axis: scaling inference-time computation by sampling N candidates from a personalized policy model and selecting the best with a personalized reward model. We prove that oracle selection yields expected utility growing logarithmically with the number of sampled candidates, establishing a theoretical ceiling for test-time scaling. However, standard reward models fail to realize this potential. To diagnose why, we derive a unified scaling law that decomposes any reward model's Best-of-N curve into four measurable quantities and reveals two failure modes, user-level collapse (near-constant prediction for some users) and query-level reward hacking (negative correlation with true quality for some queries). Guided by this law, we propose a probabilistic personalized reward model whose learned variance effectively mitigates both failure modes. Experiments confirm both elements of our framework: TTP delivers consistent scaling across multiple policy models and personalized text generation tasks, and our scaling law closely matches observed scaling curves across reward-model variants.

2605.10988 2026-05-13 cs.LG cs.AI

Seeing the Needle in the Haystack: Towards Weakly-Supervised Log Instance Anomaly Localization via Counterfactual Perturbation

Yutszyuk Wong, Wentai Wu, Yuen-Ying Yeung, Weiwei Lin

AI总结 本文研究了在大规模网络系统中如何在仅有包级标注的情况下实现日志实例级别的异常定位问题。为此,作者提出了LogMILP方法,结合多实例学习、原型引导和反事实扰动一致性正则化,实现了在弱监督条件下的高效异常检测与定位。实验表明,该方法在多个公开数据集上表现出优异的检测性能和更可靠的实例级定位能力。

Comments 6 pages,2 figures

详情
英文摘要

Log anomaly detection is a critical task for system operations and security assurance. However, in networked systems at scale, log data are generated at massive scale while instance-level annotations are prohibitively expensive, posing great difficulties to fine-grained anomaly localization. To address this challenge, we propose LogMILP (Log anomaly localization based on Multi-Instance Learning enhanced by prototypes and Perturbation), a weakly supervised framework that enables both bag-level anomaly detection and instance-level anomaly localization using only bag-level labels. Our method guides the model to pinpoint the critical log entries using prototype-guided structural modeling with counterfactual perturbation consistency regularization, thereby improving localization reliability and interpretability under coarse-grained supervision. Experimental results on three public datasets demonstrate that LogMILP achieves competitive detection performance while yielding significantly more reliable instance-level localization. Our code is open-sourced at https://github.com/YUK1207/LogMILP.

2605.10987 2026-05-13 cs.LG cs.AI cs.CR

AESOP: Adversarial Execution-path Selection to Overload Deep Learning Pipelines

Tingxi Li, Mingfang Ji, Ravishka Shemal Rathnasuriya, Simin Chen, Yitao Hu, Wei Yang

AI总结 本文研究了深度学习推理流水线中由于动态路径选择带来的效率攻击问题,提出了一种名为AESOP的对抗性路径选择框架。该方法通过结合漏洞引导的路径排序与自适应损失加权,有效放大了模型的计算量和延迟,实验证明其在白盒和灰盒设置下均能显著提升攻击效果。研究揭示了现有针对单一模型的攻击方法在动态流水线场景下存在显著性能差距,并展示了系统级防御措施虽能缓解攻击但无法完全阻止其影响。

详情
英文摘要

Modern machine learning deployments increasingly compose specialized models into dynamic inference pipelines, where upstream components produce intermediate predictions that determine the workload and inputs of downstream components. The cost of processing an input is therefore not determined by any single model, but by two coupled factors: the per-inference cost of each invoked component and its workload volume. Because these pipelines run under hard real-time constraints, efficiency is a fundamental requirement for system availability. We show that this structure creates an efficiency-attack surface that existing methods targeting single models cannot exploit: on identical inputs and budgets, path-aware targeting inflates FLOPs by $2,407\times$ while the strongest single-model baseline achieves $117\times$ -- a $20\times$ gap attributable entirely to where the attack is directed. We formalize this as the adversarial path-selection problem and present AESOP, a framework combining vulnerability-guided path ranking with adaptive loss weighting. We evaluate AESOP on five pipelines plus a production-realistic deployment variant with batching, bounded buffering, and confidence-threshold defenses. AESOP achieves up to $2,407\times$ FLOPs and $419\times$ latency inflation in white-box setting and 58$\times$ FLOPs / 17$\times$ latency in gray-box settings. Under system-level defenses, the attack is not neutralized but redirected: pipelines are forced to choose between throughput collapse ($0.578 \to 0.006$ input/s) and $96.7\%$ data loss to sustain throughput.

2605.10985 2026-05-13 cs.LG cs.AI q-bio.BM

Structural Interpretations of Protein Language Model Representations via Differentiable Graph Partitioning

Siddhant Dutta, Edward Tan Beng Wai, Soumick Sarker, Pasan Gunawardane, Jagath C. Rajapakse

AI总结 该研究提出了一种可解释的蛋白质语言模型表示方法,通过可微分图划分技术将ESM-2的表示映射到蛋白质接触图,并利用SoftBlobGIN网络学习功能子结构,从而提升预测任务的性能与可解释性。该方法无需重新训练语言模型,仅增加少量参数,即可在酶分类、功能预测等任务中取得优异表现,并能自动识别生物意义的功能区域,如活性位点残基和催化接触模式。实验表明,该框架显著提升了结构解释的准确性与可审计性,为蛋白质语言模型提供了结构层面的透明性支持。

Comments 19 Pages, 8 figures, 11 Tables, Submitted to NeurIPS 2026

详情
英文摘要

Protein language models such as ESM-2 learn rich residue representations that achieve strong performance on protein function prediction, but their features remain difficult to interpret as structural $\&$ evolutionary signals are encoded in dense latent spaces. We propose a plug-$\&$-play framework that projects ESM-2 representations onto protein contact graphs $\&$ applies $\textbf{SoftBlobGIN}$, a lightweight Graph Isomorphism Network with differentiable Gumbel-softmax substructure pooling, to perform structure-aware message passing $\&$ learn coarse functional substructures for downstream prediction tasks. Across enzyme classification, SoftBlobGIN achieves 92.8\% accuracy $\&$ 0.898 macro-F1. Unlike post hoc analysis of protein language models alone, our method produces directly auditable structural explanations: GNNExplainer recovers biologically meaningful active-site residues, spatially localized functional clusters, $\&$ catalytic contact patterns. On binding-site detection, SoftBlobGIN improves residue AUROC from $0.885$ using an ESM-2 linear probe to $0.983$, indicating that these structural explanations are not recoverable from language-model features alone. Learned blob partitions provide an additional layer of interpretability by automatically grouping residues into functional substructures, with blobs containing annotated active-site residues showing $1.85\times$ higher importance than other blobs ($ρ{=}0.339$, $p{=}0.009$), without any active-site supervision. Our framework requires no retraining of the language model, adds only $\sim$1.1M parameters, $\&$ generalises across ProteinShake tasks, achieving $F_{\max}$ of $0.733$ on Gene Ontology prediction $\&$ AUROC of $0.969$ on binding-site detection. We position this as an interpretable structural companion to protein language models that makes their predictions more transparent $\&$ auditable.

2605.10984 2026-05-13 cs.CV

Principle-Guided Supervision for Interpretable Uncertainty in Medical Image Segmentation

An Sui, Yuzhu Li, Gunter Schumann, Fuping Wu, Xiahai Zhuang

AI总结 本文研究了医学图像分割中可解释的不确定性量化问题,旨在使模型的不确定性估计更符合人类对不确定性的理解。为此,作者提出了三个与感知对齐的原则,要求不确定性在空间分布上反映图像结构对比度、图像损坏程度和解剖结构几何复杂性。基于这些原则,研究设计了一种原理引导的不确定性监督框架(PriUS),通过证据学习方法在训练过程中显式约束不确定性分布,并引入量化指标评估不确定性与图像模糊源的一致性。实验表明,PriUS在多个医学数据集上实现了更具一致性的不确定性估计,同时保持了良好的分割性能。

Comments 14 pages, 8 figures

详情
英文摘要

Uncertainty quantification complements model predictions by characterizing their reliability, which is essential for high-stakes decision making such as medical image segmentation. However, most existing methods reduce uncertainty to a scalar confidence estimate, leaving its spatial distribution semantically underconstrained. In this work, we focus on uncertainty interpretability, namely, whether estimated uncertainty behaves in a human-understandable manner with respect to sources of ambiguity. We identify three perception-aligned principles requiring the spatial distribution of uncertainty to reflect: (1) image contrast between structures, (2) severity of image corruption, and (3) geometric complexity in anatomical structures. Accordingly, we develop a principle-guided uncertainty supervision framework (PriUS) based on evidential learning, in which the corresponding supervision objectives are explicitly enforced during training. We further introduce quantitative metrics to measure the consistency between predicted uncertainty and image attributes that induce ambiguity. Experiments on ACDC, ISIC, and WHS datasets showed that, compared with state-of-the-art methods, PriUS produced more consistent uncertainty estimates while maintaining competitive segmentation performance.

2605.10981 2026-05-13 cs.LG cs.AI

$ξ$-DPO: Direct Preference Optimization via Ratio Reward Margin

Zhengyuan Fan, Zhonghua Wu, Yuxuan Du, Qun Chen

AI总结 本文提出了一种名为 $ξ$-DPO 的直接偏好优化方法,旨在解决现有 SimPO 方法中超参数调优困难的问题。通过重新定义奖励目标为最小化奖励差距与最优边距之间的距离,并引入基于选择与拒绝响应比值的奖励形式,$ξ$-DPO 有效消除了对超参数 $β$ 的依赖,并获得了更具解释性和稳定性的边距 $ξ$。该方法无需反复调参,能够更直观地控制偏好响应之间的相对分离程度,提升了直接偏好优化的效率与可解释性。

详情
英文摘要

Reference-free preference optimization has emerged as an efficient alternative to reinforcement learning from human feedback, with Simple Preference Optimization(SimPO) demonstrating strong performance by eliminating the explicit reference model through a simple objective. However, the joint tuning of the hyperparameters $β$ and $γ$ in SimPO remains a central challenge. We argue that this difficulty arises because the margin formulation in SimPO is not easily interpretable across datasets with different reward gap structures. To better understand this issue, we conduct a comprehensive analysis of SimPO and find that $β$ implicitly controls sample filtering, while the effect of $γ$ depends on the reward gap structure of the dataset. Motivated by these observations, we propose $ξ$-DPO: Direct preference optimization via ratio reward margin. We first reformulate the preference objective through an equivalent transformation, changing the optimization target from maximizing the likelihood of reward gaps to minimizing the distance between reward gaps and optimal margins. Then, we redefine the reward in a ratio form between the chosen and rejected, which effectively cancels the effect of $β$ and yields a bounded and interpretable margin. This margin is called the ratio reward margin and is denoted by $ξ$. Unlike the margin $γ$ in SimPO, $ξ$ explicitly represents the desired relative separation between chosen and rejected responses and can be determined from the initial reward gap distribution, avoiding repeated trial-and-error tuning. ....

2605.10980 2026-05-13 cs.LG cs.AI

LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection

Haohui Zhang, Zhiye Wang, Xiaoying Gan, Xinbing Wang, Bo Jiang

AI总结 本文提出了一种名为LEAP的方法,旨在通过检测早期收敛的标记来提升扩散语言模型(dLLM)的并行解码能力。传统方法依赖高置信度阈值来保证准确性,但这一要求限制了并行性。LEAP通过未来上下文过滤和多序列叠加技术,在无需训练的情况下识别出早期已收敛且正确的标记,从而实现更早的解码,显著降低了推理延迟和解码步骤。实验表明,LEAP在多个领域均有效提升了解码效率,同时保持了模型精度。

详情
英文摘要

Diffusion Language Models (dLLMs) have garnered significant attention for their potential in highly parallel processing. The parallel capabilities of existing dLLMs stem from the assumption of conditional independence at high confidence levels, which ensures negligible discrepancy between the marginal and joint distributions. However, the stringent confidence thresholds required to preserve accuracy severely constrain the scalability of parallelism. Through systematic token-level statistical analysis, we reveal that a substantial proportion of tokens converge to their correct predictions early in the denoising process yet fail to reach standard confidence thresholds, confirming that current confidence-based criteria are overly conservative. In response, we introduce LEAP (Lookahead Early-Convergence Token Detection for Accelerated Parallel Decoding). LEAP is a training-free, plug-and-play method that leverages future context filtering and multi-sequence superposition to detect early-converging tokens. By validating the alignment between early convergence and correctness, we enable reliable early decoding of these tokens. Benchmarking across diverse domains demonstrates that LEAP significantly lowers inference latency and decoding steps. Compared to confidence-based decoding, the average number of denoising steps is reduced by about 30%. On the GSM8K dataset, combining LEAP with dParallel accelerates decoding to 7.2 tokens per step while preserving model precision. LEAP effectively breaks the reliance on high-confidence priors, offering a novel paradigm for parallel decoding.

2605.10975 2026-05-13 cs.LG cs.AI

Hierarchical Multi-Scale Graph Neural Networks: Scalable Heterophilous Learning with Oversmoothing and Oversquashing Mitigation

Md Sazzad Hossen, Avimanyu Sahoo

AI总结 该论文研究了异质图(相邻节点标签不同)分类中的可扩展学习问题,针对现有图神经网络在处理异质性数据时存在的聚合偏差和过平滑、过压缩问题,提出了一种分层多尺度图神经网络框架HMH。该方法通过学习特征与结构感知的符号亲和力,构建软图层次结构,并在每一层使用稀疏正交的Haar基进行频域滤波,结合跳跃连接解池化层,有效缓解了中心节点主导和长距离信号压缩问题。实验表明,HMH在节点和图分类任务上均优于现有方法,且具有近线性的时间复杂度。

详情
英文摘要

Graphs with heterophily, where adjacent nodes carry different labels, are prevalent in real-world applications, from social networks to molecular interactions. However, existing spectral Graph Neural Network (GNN) approaches tailored for heterophilous graph classification suffer from hub-dominated (node with large degree) aggregation and oversmoothing, as their suboptimal polynomial filters introduce approximation errors and blend distant signals. To address the degree-biased aggregation and suboptimal polynomial filtering, we introduce a Hierarchical Multi-view HAAR (HMH), a novel spectral graph-learning framework that scales in near-linear time . HMH first learns feature- and structure-aware signed affinities via a heterophily-aware encoder, then constructs a soft graph hierarchy guided by these embeddings. At each hierarchical level, HMH constructs a sparse, orthonormal, and locality-aware Haar basis to apply learnable spectral filters in the frequency domain. Finally, skip-connection unpooling layers combine outputs from all hierarchical levels back into the original graph, effectively preventing hub domination and long-range signal bottleneck (over-squashing). Experimentation shows that HMH outperforms state-of-the-art spectral baselines, achieving up to a 3% improvement on node classification and 7% points on graph classification datasets, all while maintaining linear scalability.

2605.10974 2026-05-13 cs.LG cs.AI

Vertex-Softmax: Tight Transformer Verification via Exact Softmax Optimization

Navid Rezazadeh, Arash Gholami Davoodi

AI总结 本文提出了一种名为Vertex-Softmax的新方法,用于提升Transformer注意力机制的认证验证精度。该方法通过精确优化softmax函数在预softmax分数区间约束下的最优解,证明了最优解必定出现在约束盒的顶点,并基于此建立了具有线性复杂度的Vertex-Softmax原语。实验表明,该方法在多个数据集上显著提升了认证准确率并紧缩了下界,同时在计算成本上优于现有方法。

详情
英文摘要

Certified verification of transformer attention requires bounding the softmax function over interval constraints on the pre-softmax scores. Existing verifiers relax softmax ndependently of the downstream objective, leaving avoidable slack. We prove that the exact optimum of this score-box problem is attained at a vertex of the constraint box, and establish a threshold structure theorem showing that, after sorting the objective coefficients, the optimum lies among only linearly many candidates, yielding the Vertex-Softmax primitive with log-linear complexity in the sequence length. We further prove a formal optimality result showing that Vertex-Softmax is the tightest sound bound obtainable from score intervals alone, characterizing precisely what additional structure (score correlations, score-value coupling) is needed for further improvement. Integrated into a CROWN Convex Relaxation based Optimization for Worst-case Neurons)-style verifier with a formal soundness guarantee, Vertex-Softmax significantly improves certified rates and substantially tightens lower bounds across MNIST, Fashion-MNIST, and CIFAR-10 attention models, while consistently matching or outperforming alpha-CROWN and branch-and-bound baselines at a fraction of their cost.

2605.10973 2026-05-13 cs.LG cs.AI

Rotation-Preserving Supervised Fine-Tuning

Hangzhan Jin, Tianwei Ni, Lu Li, Pierre-Luc Bacon, Mohammad Hamdaqa, Doina Precup

AI总结 监督微调(SFT)虽能提升模型在特定领域内的性能,但可能损害其在领域外的泛化能力。本文提出了一种名为旋转保持监督微调(RPSFT)的方法,通过在预训练权重矩阵的奇异子空间中保持投影旋转,高效地近似Fisher敏感方向,从而限制不必要的权重旋转,保留任务适应性。实验表明,RPSFT在数学推理数据上训练的多种模型中,有效改善了领域内与领域外性能的平衡,更好地保留了预训练表示,并为后续强化学习微调提供了更优的初始化。

Comments 31 pages, 13 figures

详情
英文摘要

Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight matrices. However, directly identifying loss-sensitive directions with Hessian or Fisher information is computationally expensive at LLM scale. In this work, we propose preserving projected rotations in pretrained singular subspaces as an efficient proxy for Fisher-sensitive directions, which we call Rotation-Preserving Supervised Fine-Tuning (RPSFT). RPSFT penalizes changes in the projected top-$k$ singular-vector block of each pretrained weight matrix, limiting unnecessary rotation while preserving task adaptation. Across model families and sizes trained on math reasoning data, RPSFT improves the in-domain/OOD trade-off over standard SFT and strong SFT baselines, better preserves pretrained representations, and provides stronger initializations for downstream RL fine-tuning. Code is available at \href{https://github.com/jinhangzhan/RPSFT.git}{https://github.com/jinhangzhan/RPSFT}.

2605.10971 2026-05-13 cs.LG cs.AI cs.CL

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

Hanhan Zhou, Shamik Roy, Rashmi Gangadharaiah

AI总结 离散扩散语言模型(DLMs)通过并行去噪生成文本,提供了不同于自回归模型的生成方式。本文指出,从自回归模型迁移而来的控制生成方法在每一步去噪中采用统一干预策略,会导致生成质量下降,尤其在多属性联合控制时问题更为严重。研究通过训练稀疏自编码器分析发现,不同属性在去噪过程中以不同的时间、强度和节奏固化,因此提出了一种自适应调度方法,将干预集中在属性形成的关键步骤,从而在保持生成质量的同时显著提升了控制精度,尤其在多属性联合控制任务中表现出色。

Comments preprint, 47 pages

详情
英文摘要

Discrete diffusion language models (DLMs) generate text by iteratively denoising all positions in parallel, offering an alternative to autoregressive models. Controlled generation methods for DLMs, imported from autoregressive models, apply uniform intervention at every denoising steps. We show this uniform schedule degrades quality, and the damage compounds when multiple attributes are steered jointly. To diagnose the failure, we train sparse autoencoders on four DLMs (124M-8B parameters) and find that different attributes commit on distinct schedules, varying in timing, sharpness, and magnitude. For instance, topic commits within the first 2\% of denoising, whereas sentiment emerges gradually over 20\% of the process. Consequently, uniform intervention wastes steering capacity on steps where the target attribute has already solidified or has yet to emerge. We propose a novel adaptive scheduler that concentrates interventions on the steps where an attribute is actively forming and leaves the rest of generation untouched. The cost-control trade-off admits a closed-form characterization: the advantage of adaptive over uniform scheduling is governed by a single dispersion statistic of the commitment distribution. Across four DLMs and seven steering tasks, our method achieves precise control without the degradation typical of uniform interventions. Especially on challenging simultaneous three-attribute control, it reaches up to 93\% steering strength, beating the strongest baseline by up to 15\% points while preserving generation quality.

2605.10959 2026-05-13 cs.LG cs.AI

QuIDE: Mastering the Quantized Intelligence Trade-off via Active Optimization

Xiantao Jiang

AI总结 当前缺乏统一的指标来评估量化神经网络的效率。本文提出QuIDE,通过引入智能指数I = (C × P)/log₂(T+1),将压缩率、精度与延迟的权衡统一为单一评分。实验表明,不同任务存在任务相关的帕累托拐点,4位量化在MNIST和大语言模型中表现最佳,而8位量化更适合复杂CNN任务。QuIDE还提供了一个可复现的评估协议和适用于混合精度搜索的适应性函数。

Comments 16 pages, 9 figures

详情
英文摘要

There is currently no unified metric for evaluating the efficiency of quantized neural networks. We propose QuIDE, built around the Intelligence Index I = (C x P)/log_2(T+1), which collapses the compression-accuracy-latency trade-off into a single score. Experiments across six settings -- SimpleCNN (MNIST, CIFAR), ResNet-18 (ImageNet-1K), and Llama-3-8B -- show a task-dependent Pareto Knee. 4-bit quantization is optimal for MNIST and large LLMs, while 8-bit is the sweet spot for complex CNN tasks (ResNet-18 on ImageNet), where 4-bit PTQ collapses accuracy catastrophically. The accuracy-gated variant I' correctly flags these non-viable configurations that the raw I would reward. QuIDE provides a reproducible evaluation protocol and a ready-to-use fitness function for mixed-precision search.

2605.10865 2026-05-13 cs.AI cs.CV cs.SE

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

Haozhe Zhang, Kaichen Liu, Miaomiao Chen, Lei Li, Shaojie Yang, Cheng Peng, Hanjie Chen

AI总结 BenchCAD 是一个面向工业CAD编程的综合性基准测试平台,旨在评估模型从视觉或文本输入生成可执行参数化CAD程序的能力。该基准包含17,900个经过验证的CadQuery程序,涵盖106类工业零件,通过视觉问答、代码问答、图像到代码生成等多种任务全面评估模型在感知、参数抽象和程序合成方面的能力。实验表明,当前主流模型虽能恢复零件的粗略外形,但在精确生成参数化CAD程序方面仍存在显著不足,如忽略细粒度3D结构、误读工程参数等,突显了工业CAD自动化领域亟需改进的方向。

Comments 9 page 7 figures

详情
英文摘要

Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured. Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities jointly hold in realistic industrial CAD settings. We present BenchCAD, a unified benchmark for industrial CAD reasoning. BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families, including bevel gears, compression springs, twist drills, and other reusable engineering designs. It evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing, enabling fine-grained analysis across perception, parametric abstraction, and executable program synthesis. Across 10+ frontier models, BenchCAD shows that current systems often recover coarse outer geometry but fail to produce faithful parametric CAD programs. Common failures include missing fine 3D structure, misinterpreting industrial design parameters, and replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns. Fine-tuning and reinforcement learning improve in-distribution performance, but generalization to unseen part families remains limited. These results position BenchCAD as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation.

2605.10815 2026-05-13 cs.AI eess.AS

Probing Cross-modal Information Hubs in Audio-Visual LLMs

Jihoo Jung, Chaeyoung Jung, Ji-Hoon Kim, Joon Son Chung

AI总结 本文研究了音频-视觉大语言模型(AVLLMs)中跨模态信息的流动机制,重点分析了音频和视觉模态之间的信息编码方式。通过实证分析,发现AVLLMs主要在所谓的“sink tokens”中整合跨模态信息,其中一部分特定的sink tokens专门用于存储跨模态信息,称为“跨模态sink tokens”。基于这一发现,作者提出了一种无需训练的幻觉缓解方法,通过增强对跨模态sink tokens中整合信息的依赖来提升模型表现。

Comments Accepted by ICML 2026

详情
英文摘要

Audio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens do not uniformly hold cross-modal information. Instead, a distinct subset of sink tokens, which we term cross-modal sink tokens, specializes in storing such information. Based on these findings, we further propose a simple training-free hallucination mitigation method by encouraging reliance on integrated cross-modal information within cross-modal sink tokens. Our code is available at https://github.com/kaistmm/crossmodal-hub.

2605.10780 2026-05-13 cs.CV cs.AI

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

Xuanyu Zhu, Yan Bai, Yang Shi, Yihang Lou, Yuanxing Zhang, Jing Jin, Yuan Zhou

AI总结 该研究提出了一种名为DRoRAE的多层表示融合方法,旨在改进视觉编码器的特征提取过程。不同于现有方法仅使用最后一层特征,DRoRAE通过能量约束路由和增量校正机制,融合所有中间层的特征,从而恢复因多层语义抽象而丢失的细节信息。实验表明,该方法在图像重建和生成任务中显著提升了性能,并揭示了表示丰富性与重建质量之间的可预测关系,为视觉分词器的设计提供了新的理论依据。

详情
英文摘要

Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law ($R^2{=}0.86$) between fusion capacity and reconstruction quality, identifying \textit{representation richness} as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.

2605.10201 2026-05-13 cs.RO cs.AI

HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions

Zhenhao Shen, Zeming Yang, Yue Chen, Yuran Wang, Shengqiang Xu, Mingleyang Li, Hao Dong, Ruihai Wu

AI总结 该研究旨在解决机器人在异类物体交互中实现通用操作的难题,重点解决“在哪里操作”和“如何操作”这两个核心问题。提出了一种两阶段框架HeteroGenManip,通过解耦初始抓取与复杂交互过程,结合结构先验和多基础模型扩散策略,显著提升了操作的鲁棒性和泛化能力。实验表明,该方法在多种仿真和真实任务中均取得显著性能提升。

详情
英文摘要

Generalizable manipulation involving cross-type object interactions is a critical yet challenging capability in robotics. To reliably accomplish such tasks, robots must address two fundamental challenges: "where to manipulate" (contact point localization) and "how to manipulate" (subsequent interaction trajectory planning). Existing foundation-model-based approaches often adopt end-to-end learning that obscures the distinction between these stages, exacerbating error accumulation in long-horizon tasks. Furthermore, they typically rely on a single uniform model, which fails to capture the diverse, category-specific features required for heterogeneous objects. To overcome these limitations, we propose HeteroGenManip, a task-conditioned, two-stage framework designed to decouple initial grasp from complex interaction execution. First, Foundation-Correspondence-Guided Grasp module leverages structural priors to align the initial contact state, thereby significantly reducing the pose uncertainty of grasping. Subsequently, Multi-Foundation-Model Diffusion Policy (MFMDP) routes objects to category-specialized foundation models, integrating fine-grained geometric information with highly-variable part features via a dual-stream cross-attention mechanism. Experimental evaluations demonstrate that HeteroGenManip achieves robust intra-category shape and pose generalization. The framework achieves an average 31% performance improvement in simulation tasks with broad type setting, alongside a 36.7% gain across four real-world tasks with different interaction types.

2605.10125 2026-05-13 cs.AI cs.HC

Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research

Anthea Dathe, Kiran Hoffmann, Aline Mangold

AI总结 该研究评估了人工智能工具在学术研究中的应用,重点关注问答和文献综述工具的实用性与局限性。研究提出了一种结合人机中心指标的评估框架,发现问答工具虽能提供有用概述,但在精确信息提取上可靠性不足,而文献综述工具虽有助于探索性搜索,却缺乏可重复性和透明度。研究强调了提升AI工具可解释性的重要性,并指出在研究工作流中合理整合AI仍需依赖人工验证。

详情
英文摘要

Artificial intelligence (AI) tools are being incorporated into scientific research workflows with the potential to enhance efficiency in tasks such as document analysis, question answering (Q&A), and literature search. However, system outputs are often difficult to verify, lack transparency in their generation and remain prone to errors. Suitable benchmarks are needed to document and evaluate arising issues. Nevertheless, existing benchmarking approaches are not adequately capturing human-centered criteria such as usability, interpretability, and integration into research workflows. To address this gap, the present work proposes and applies a benchmarking framework combining human-centered and computer-centered metrics to evaluate AI-based Q&A and literature review tools for research use. The findings suggest that Q&A tools can offer valuable overviews and generally accurate summaries; however, they are not always reliable for precise information extraction. Explainable AI (xAI) accuracy was particularly low, meaning highlighted source passages frequently failed to correspond to generated answers. This shifted the burden of validation back onto the researcher. Literature review tools supported exploratory searches but showed low reproducibility, limited transparency regarding chosen sources and databases, and inconsistent source quality, making them unsuitable for systematic reviews. A comparison of these tool groups reveals a similar pattern: while AI tools can enhance efficiency in the early stages of the research workflow and shallow tasks, their outputs still require human verification. The findings underscore the importance of explainability features to enhance transparency, verification efficiency and careful integration of AI tools into researchers' workflows. Further, human-centered evaluation remains an important concern to ensure practical applicability.