arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
专题追踪
2605.09820 2026-05-12 cs.LG

Dystruct: Dynamically Structured Diffusion Language Model Decoding via Bayesian Inference

Bian Sun, Kevin Zhai, Mubarak Shah, Zhenyi Wang

发表机构 * University of Central Florida(中央佛罗里达大学)

AI总结 本文提出了一种基于贝叶斯推理的动态结构化扩散语言模型解码方法Dystruct,旨在解决现有扩散语言模型在生成长度固定、灵活性不足的问题。该方法无需额外训练,通过将可变长度生成建模为动态结构推理问题,联合优化生成长度、块边界和解码计划,从而实现灵活的块扩展与组织,同时保持生成内容的一致性。实验表明,该方法在多个基准上显著提升了生成质量与灵活性,为结构化文本生成提供了原理清晰且高效的解决方案。

详情
英文摘要

Diffusion language models (DLMs) have recently emerged as a promising alternative to autoregressive models, primarily due to their ability to enable parallel decoding. Despite this advantage, most existing DLMs rely on a fixed generation length specified prior to decoding, which restricts their flexibility in real-world applications. While a few recent works attempt to support flexible-length generation, they typically suffer from notable limitations: some require costly retraining to accommodate variable-length outputs, while others depend solely on local confidence signals during decoding. Such local criteria fail to capture the evolving structure of the sequence, often resulting in suboptimal generation quality. In this paper, we propose a training-free, Bayesian structured decoding framework that formulates flexible-length generation as a dynamic structural inference problem. Our approach formulates flexible-length generation as a dynamic structural inference problem, jointly computing the expansion length, the block boundaries, and the decoding schedule. At each window expansion step, the method integrates local uncertainty with structural signals via a unified mechanism that supports dynamic structured generation, including both flexible block expansion and block organization, while maintaining coherence. Extensive experiments across multiple benchmarks demonstrate that our approach significantly improves generation quality and flexibility over existing fixed-length and flexible-length baselines. These results highlight the advantage of Bayesian structured decoding for diffusion language model, providing a principled and efficient solution for structured text generation.

2605.09818 2026-05-12 cs.LG

Learning to Compress Time-to-Control: A Reinforcement Learning Framework for Chronic Disease Management

Prabhjot Singh, Abhishek Gupta, Chris Betz, Abe Flansburg, Brett Ives, Sudeep Lama, Jung Hoon Son

发表机构 * Altitude

AI总结 该研究提出了一种基于强化学习的慢性病管理框架,旨在通过压缩疾病控制时间(TTC)来优化长期治疗效果。研究引入了两个关键结构要素——执行强度和临床能力权重,将偏好学习与强化学习结合,构建了双循环架构,以应对医疗强化学习中奖励稀疏和策略评估不稳定等问题。实验表明,该方法在糖尿病等慢性病的模拟环境中显著优于传统方法,具有更好的跨场景泛化能力。

Comments 26 pages, 3 figures

详情
英文摘要

Reinforcement learning (RL) in healthcare has had mixed results, with reward sparsity, unreliable off-policy evaluation, and deployment-simulation gap as recurring failure modes. We argue that chronic disease management is structurally a more tractable RL setting than the acute-care problems the field has primarily studied, but only if the problem is formalized to exploit chronic care's properties. We propose such a formalization. The agent's objective is to compress time-to-control (TTC) under a tiered reward calibrated to the CMS ACCESS Model. Two quantities from our companion preference-learning paper [Singh et al. 2026] enter as load-bearing structural elements: the execution intensity εbounds action availability under a constrained Markov Decision Process, and the clinician capability κweights offline-data transitions during RL training. Together they couple preference learning and RL into a two-loop architecture. We present simulation results on synthetic state machines for hypertension and type 2 diabetes. Capability-weighted offline RL outperforms uniform-weighted offline RL and the behavior policy by 15 percentage points on T2D TTC; the uniform-weighted formulation (the standard in existing healthcare RL) underperforms even the heterogeneous behavior policy. \Epsilon-aware policies generalize across deployment regimes while ε-naive policies do not.

2605.09811 2026-05-12 cs.RO

Above and Below: Heterogeneous Multi-robot SLAM Across Surface and Underwater Domains

John McConnell, Armon Shariati, Paul Szenher, Yaxuan Li

发表机构 * United States Naval Academy(美国海军学院) Shield-AI Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 本文研究了水面无人船(USV)与水下自主水下机器人(AUV)之间的异构多机器人同步定位与建图(SLAM)问题。传统方法依赖声学测距,受限于环境干扰和同步要求,本文提出一种基于视觉回环检测的集中式多机器人SLAM系统,通过融合USV与AUV的感知数据实现状态估计的协同优化。实验表明,该方法在多机器人协作场景下显著提升了AUV的定位精度,是首个基于回环检测而非声学测距的异构多机器人SLAM系统。

详情
英文摘要

Multi-robot simultaneous localization and mapping (SLAM) is a fundamental task in multi-robot operations. Robots must have a common understanding of their location and that of their team members to complete coordinated actions. However, multi-robot SLAM between Uncrewed Surface Vessels (USVs) and Autonomous Underwater Vehicles (AUVs) has primarily been achieved through acoustic pinging between robots to retrieve range measurements; a measurement technique requires that robots to be in similar locations simultaneously, have an uninterrupted path for signal propagation, and may necessitate synchronized clocks. This is especially challenging in complex, cluttered maritime environments, where structures may impede signals. However, these same structures may be observable above and below the water's surface, presenting an opportunity for inter-robot SLAM loop closure between USV and AUV data streams. This work builds upon recent research on inter-robot SLAM loop closure between USV and AUV data, extending it to propose a centralized multi-robot SLAM system. Each robot performs its state estimation, and we detect loop closures between each AUV and the USV data. These inter-robot loop closures are used to merge each robot's state estimate into a centralized graph, yielding estimates for the whole time history of the USV and all AUVs in the system. Validation is performed using real-world perceptual data in three different environments. Results show improved errors for AUVs in the multi-robot SLAM system compared to single-robot SLAM over the same trajectories. To our knowledge, this is the first instance of a multi-robot SLAM system with AUVs and USVs built on loop closures rather than acoustic distance measurements.

2605.09808 2026-05-12 cs.CL

Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants

Joseph Suh, Ayush Raj, Minwoo Kang, Serina Chang

发表机构 * UC Berkeley(加州大学伯克利分校)

AI总结 本文研究了用户模拟器在构建协作式大语言模型助手中的效用评估问题,提出通过助手在真实环境中与人类交互的表现来衡量模拟器质量。通过对比不同用户模拟器(包括基于角色扮演的LLM和基于真实对话数据微调的模拟器)训练出的助手性能,实验表明基于真实数据微调的模拟器能显著提升助手表现,而基于角色扮演的模拟器即使经过优化也难以缩小差距。研究进一步揭示了模拟器模型规模、真实性增强方法等对训练效果的影响,强调应以实际用户表现作为评估用户模拟器质量的核心标准。

详情
英文摘要

User simulators are increasingly leveraged to build interactive AI assistants, yet how to measure the quality of these simulators remains an open question. In this work, we show how simulator quality can be quantified in terms of its downstream utility: how an LLM assistant trained with this user simulator performs in the wild when interacting with real humans. In a controlled experiment where only the user simulator varies, we train LLM assistants via reinforcement learning against a spectrum of simulators, from an LLM prompted to role-play a user to one fine-tuned on human utterances from WildChat. As evaluation, we measure pairwise win rates in a user study with 283 participants and on WildBench, a benchmark derived from real human--AI conversations. Training against the role-playing LLM yields an assistant statistically indistinguishable from the initial assistant in our user study (51% win rate), whereas training against the fine-tuned simulator yields significant gains (58% over the initial and 57% over the one trained against role-playing). Closer inspection reveals three further patterns: methods for making role-playing LLMs more realistic (e.g., persona conditioning) improve trained assistants but do not close the gap to the fine-tuned simulator; scaling the simulator's model size benefits the fine-tuned simulator but yields no gain for role-playing ones; and assistants trained against role-playing simulators fail to generalize when paired with other simulators at test time, while the one trained against fine-tuned simulator does. Together, these results argue for grounding user simulators in real human behavior and measuring their quality by their downstream effect on real users.

2605.09806 2026-05-12 cs.LG cs.AI

LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

Songtao Wei, Yi Li, Zhikai Li, Xu Hu, Yuede Ji, Guanpeng Li, Feng Chen, Carl Yang, Zhichun Guo, Bingzhe Li

发表机构 * University of Texas at Dallas(德克萨斯大学达拉斯分校) Emory University(埃默里大学) Individual Researcher(独立研究员) University of Texas at Arlington(德克萨斯大学阿灵顿分校) University of Florida(佛罗里达大学)

AI总结 本文提出了一种名为LEAD的方法,旨在解决大型语言模型在推理过程中输出冗长、效率低下的问题。LEAD通过引入在线自适应机制,动态调整正确性与效率之间的平衡,并根据模型自身的正确推理结果估计每道题的适配长度,从而在保证准确性的同时显著压缩输出长度。实验表明,LEAD在多个数学推理基准测试中取得了最高的准确率和效率综合评分。

详情
英文摘要

Large reasoning models, such as OpenAI o1 and DeepSeek-R1, tend to become increasingly verbose as their reasoning capabilities improve. These inflated Chain-of-Thought (CoT) trajectories often exceed what the underlying problems require, wasting compute, latency, and context budgets. While introducing length-based efficiency rewards during reinforcement learning offers a natural remedy, existing methods struggle with two fundamental challenges: the optimal balance between correctness and efficiency is non-stationary throughout training, and intrinsic reasoning budgets vary drastically across problems. Relying on static reward weights and global length constraints inevitably forces a compromise between degraded accuracy and unrealized compression. To overcome these limitations, we propose LEAD (Length-Efficient Adaptive and Dynamic reasoning), a method that replaces static heuristics with online, self-adaptive mechanisms. LEAD dynamically calibrates the correctness-efficiency trade-off at each step using a Potential-Scaled Instability, directing optimization capacity to the most informative learning signal. Furthermore, it estimates an adaptive per-problem target length online based on the model's own correct rollouts, applying a symmetric efficiency reward that penalizes both overthinking and over-compression. Evaluated on five mathematical reasoning benchmarks, LEAD achieves the highest accuracy and Accuracy-Efficiency Score among RL-trained efficient-reasoning methods while producing substantially shorter outputs than the base model.

2605.09802 2026-05-12 cs.CV cs.AI cs.LG

CrossVL: Complexity-Aware Feature Routing and Paired Curriculum for Cross-View Vision-Language Detection

Zhipeng Liu, Chunbo Luo

发表机构 * Department of Computer Science, University of Exeter(埃克塞特大学计算机科学系)

AI总结 本文研究了跨视角(如地面与空中)场景下视觉-语言模型(VLM)的目标检测性能下降问题,提出了CrossVL框架,结合复杂度感知的特征路由机制和成对课程学习策略,以增强模型对不同视角图像的适应能力。该方法通过估计场景复杂度并动态路由视觉特征,以及利用同步地面-空中图像对的语义一致性进行渐进式训练,有效提升了检测精度和稳定性。实验表明,CrossVL在MAVREC数据集上显著提升了检测性能并缩小了不同视角间的性能差距。

Comments Accepted to CVPR 2026. Code available at https://github.com/1nyourlife/Crossvl_cvpr2026

详情
英文摘要

Vision-language models (VLMs) enable text-guided object detection but degrade severely under cross-view scenarios where ground and aerial viewpoints differ in altitude, scale, and spatial layout. These geometric changes introduce systematic complexity variations between viewpoints, e.g., ground view images contain dense and highly occluded structures, while aerial images are sparse and globally organized. Fixed VLM fusion mechanisms cannot handle this discrepancy. We propose CrossVL, a framework combining Complexity-Aware Pathway Aggregation (CPA) and Paired Curriculum Learning (PCL) for enhanced cross-view detection for VLM. CPA estimates scene complexity from multimodal statistics and routes visual features through multiple pathways to obtain view-specific representations. PCL leverages semantic consistency of synchronized ground-aerial pairs to provide stable early supervision and then gradually shifts toward randomized sampling. On MAVREC, CrossVL improves Florence-2's aerial mAP from 58.66% to 61.03% and reduces the ground-aerial performance gap from 8.63pp to 6.65pp, while also achieving a 3.3x reduction in variance across random seeds. CPA provides stable complexity-aware feature aggregation, and PCL enhances optimization dynamics. Together, they demonstrate that coordinated architectural and training adaptations are crucial for robust cross-view VLM detection.

2605.09801 2026-05-12 cs.RO

Efficient Multi-Robot Motion Planning with Precomputed Translation-Invariant Edge Bundles

Himanshu Gupta, Paul Motter, Aritra Chakrabarty, Rishabh Sodani, Srikrishna Bangalore Raghu, Alessandro Roncone, Bradley Hayes, Zachary Sunberg

发表机构 * Smead Aerospace Engineering Sciences, University of Colorado Boulder(科罗拉多大学博尔德分校航空航天工程科学系) Department of Computer Science, University of Colorado Boulder(科罗拉多大学博尔德分校计算机科学系) Cherry Creek High School, Greenwood Village, CO, USA(科罗拉多州格里诺村樱桃溪高中)

AI总结 本文提出了一种名为KiTE-Extend的高效多机器人运动规划方法,通过预计算的平移不变轨迹段库来指导在线规划中的动作选择,从而提升现有规划器在生成无碰撞、动力学可行轨迹方面的能力。该方法不改变原有规划器的状态传播、碰撞检测和代价评估机制,同时保持其理论保证。实验表明,KiTE-Extend在多机器人场景中显著提升了规划效率和可扩展性,尤其在集中式、优先级和冲突基于的三种主流多机器人规划范式中表现突出。

详情
英文摘要

Solving multi-robot motion planning (MRMP) requires generating collision-free kinodynamically feasible trajectories for multiple interacting robots. We introduce Kinodynamic Translation-Invariant Edge Bundles or KiTE-Extend, a planner-agnostic action selection mechanism for sampling-based kinodynamic motion planning. KiTE-Extend uses a library of trajectory segments computed offline to guide action selection during online planning, improving the ability of existing planners to identify feasible motion segments without altering state propagation, collision checking, or cost evaluation, and without changing their theoretical guarantees. While KiTE-Extend can modestly improve single-agent planners, its benefits are most clear in the multi-agent setting, where it is able to explore more effectively and significantly improve planning through the dense spatiotemporal constraints introduced by robot-robot interaction. Through experiments on multiple kinodynamic systems and environments, we show that KiTE-Extend reduces planning time and improves scalability across the three most common MRMP paradigms: centralized, prioritized, and conflict-based.

2605.09795 2026-05-12 cs.CL

cantnlp@DravidianLangTech 2026: organic domain adaptation improves multi-class hope speech detection in Tulu

Andrew Li, Sidney Wong

发表机构 * Lake Washington School District(拉克华盛顿学区) University of Otago(奥塔哥大学) Te Pūnaha Matatini

AI总结 本文介绍了在DravidianLangTech 2026会议上针对代码混合的图卢语(Tulu)希望言论检测任务所提出的系统与结果。研究采用基于XLM-RoBERTa的文本分类模型,通过有机收集的图卢语社交媒体文本进行领域适配,有效提升了希望言论检测的性能。实验表明,有机适配的模型在开发集上优于基线模型,为代码混合语言的希望言论检测提供了可行的改进方向。

Comments Accepted to Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (DravidianLangTech-2026)

详情
英文摘要

This paper presents our systems and results for the Hope Speech Detection in Code-Mixed Tulu Language shared task at the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (DravidianLangTech-2026). We trained an XLM-RoBERTa-based text classification system for detecting hope speech in code-mixed Tulu social media comments. We compared this organically adapted hope speech detection model with our baseline model. On the development set, the organically adapted model outperformed the baseline system. While our submitted systems performed more modestly on the official test set, these results suggest that further adapting XLM-RoBERTa on organically collected Tulu social media text containing code-mixed and mixed-script variation can improve hope speech detection in code-mixed Tulu.

2605.09789 2026-05-12 cs.RO

Zero-Shot Sim-to-Real Robot Learning: A Dexterous Manipulation Study on Reactive Catching

Kejia Ren, Gaotian Wang, Andrew S. Morgan, Kaiyu Hang

发表机构 * Department of Computer Science, Rice University(计算机科学系,里士大学) Robotics and AI Institute(机器人与人工智能研究所)

AI总结 该研究探讨了如何在零样本条件下将模拟环境中学到的机器人操控策略直接应用于真实世界,特别针对需要高精度和快速反应的灵巧抓取任务。为解决模拟到现实迁移中的不确定性问题,作者提出了一种新的领域随机化方法——领域随机化实例集(DRIS),通过同时传播多个随机化实例,增强策略对现实动态变化的鲁棒性。实验表明,该方法在无需真实世界微调的情况下,能够实现可靠的零样本迁移,并在无需被动稳定结构的抓取任务中表现出优异的抗噪声能力。

详情
英文摘要

Dexterous manipulation is physics-intensive and highly sensitive to modeling errors and perception noise, making sim-to-real transfer prohibitively challenging. Domain randomization (DR) is commonly used to improve the robustness of learned policies for such tasks, but conventional DR randomizes one instance per episode, offering very limited exposure to the variability of real-world dynamics. To this end, we propose Domain-Randomized Instance Set (DRIS), which represents and propagates a set of randomized instances simultaneously, providing richer approximation of uncertain dynamics and enabling policies to learn actions that account for multiple possible outcomes. Supported by theoretical analysis, we show that DRIS yields more robust policies and alleviates the need for real-world fine-tuning, even with a modest number of instances (e.g., 10). We demonstrate this on a challenging reactive catching task. Unlike traditional catching setups that use end-effectors designed to mechanically stabilize the object (e.g., curved or enclosing surfaces), our system uses a flat plate that offers no passive stabilization, making the task highly sensitive to noise and requiring rapid reactive motions. The learned policies exhibit strong robustness to uncertainties and achieve reliable zero-shot sim-to-real transfer.

2605.09778 2026-05-12 cs.LG cs.CL

Nectar: Neural Estimation of Cached-Token Attention via Regression

João Monteiro, Michal Klein, Pierre Ablin, Marco Cuturi

发表机构 * Apple(苹果公司)

AI总结 该论文提出了一种名为Nectar的方法,用于高效估计长上下文中的缓存键值注意力。其核心思想是通过拟合一个紧凑的神经网络来近似注意力输出函数,从而避免对每个查询token遍历整个缓存的高计算开销。Nectar为每一层和每个KV头分别拟合目标网络和得分网络,分别预测注意力输出和对数归一化因子,在推理时替代传统的$O(n)$注意力计算,显著降低计算复杂度。实验表明,Nectar在多个大规模语言模型和长上下文数据集上有效逼近完整注意力的效果,并在生成任务中保持了语义内容的一致性。

详情
英文摘要

Evaluating softmax attention over a fixed long context requires reading every cached key-value pair for each new query token. For a given context (a book, a manual, a legal corpus) the attention output is a deterministic function of the query. We propose Nectar, which fits a compact neural network to this function for queries drawn from a task-relevant distribution. Nectar fits two networks per layer and KV-head: a target network that predicts the attention output and a score network that predicts the log-normalizer. The pair plugs into the standard masked self-attention at inference time, replacing the $O(n)$ attention over the cache with a forward pass whose cost does not depend on $n$. Each module carries on the order of $|θ|$ parameters per layer and KV-head, typically much smaller than the $2nd$ KV-cache footprint at the same granularity. We report experiments on models from 1.7B to 8B parameters across five long-context datasets. The approximation error tracks the next-token accuracy gap to full attention, and allocating capacity non-uniformly across layers reduces that gap in our ablation. Beyond this analysis of metrics, we check that the text generations (following a question prompt) of a model equipped with a Nectar module match in semantic content those obtained by giving the same model access to the full cache.

2605.09775 2026-05-12 cs.LG math.OC

Bayesian Optimization with Structured Measurements: A Vector-Valued RKHS Framework

Wenbin Wang, Colin N. Jones

发表机构 * Automatic Control Laboratory, EPFL(瑞士联邦理工学院自动控制实验室)

AI总结 本文研究了在结构化测量环境下进行贝叶斯优化的问题,其中每个观测值为多维或函数型输出,而非单一标量值。作者提出了一种基于向量值再生核希尔伯特空间(RKHS)的框架,将目标函数定义为这些测量的线性泛函,并在该空间中推导了核岭回归估计的高概率集中界。在此基础上,设计了一种具有置信上界(UCB)采集函数的算法,并在温和假设下给出了遗憾界,实验表明该方法能有效提升样本效率,适用于多目标和时变场景。

详情
英文摘要

Bayesian optimization (BO) is an efficient framework for optimizing expensive black-box functions. However, it is typically formulated as learning an end-to-end mapping from inputs to scalar objectives, thereby discarding the potentially rich information whenever a structured system output is available. In this work, we study Bayesian optimization over a vector-valued operator with structured measurements, where each measurement observes multidimensional or functional outputs, e.g., trajectories or spatial fields, rather than a single scalar value. The objective is then defined as a linear functional of these measurements. This allows each observation to reveal substantially richer information about the underlying system compared to scalar observations. Assuming the unknown operator lies in a vector-valued reproducing kernel Hilbert space (RKHS), we derive high-probability concentration bounds for the kernel ridge regression (KRR) estimator directly in the measurement space, characterizing uncertainty in a general Hilbert space. Building on these results, we propose an algorithm based on the upper confidence bound (UCB) acquisition function with regret guarantees under mild assumptions, recovering sublinear rates for common kernels. Empirically, we demonstrate that leveraging structured measurements leads to improved sample efficiency by enabling efficient transfer of information across objectives and adaptation to time-varying settings.

2605.09774 2026-05-12 cs.CV

DRIVE-C: A Controlled Corruption Dataset for Autonomous Driving

Shiva Aher

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 DRIVE-C 是一个用于评估自动驾驶系统视觉感知鲁棒性的受控退化数据集,由真实场景下的多种环境驾驶视频构建而成。该数据集通过物理启发的合成退化方法生成了包含10段干净视频和600段退化视频的多样化样本,并提供了详细的元数据和传感器健康指数标注。DRIVE-C 为自动驾驶感知系统的鲁棒性评估、退化感知建模、不确定性估计以及传感器健康监测提供了可控且可复现的测试平台。

详情
英文摘要

DRIVE-C is a controlled corruption dataset designed to evaluate visual perception robustness in autonomous driving systems. It is built from real-world forward-facing driving videos collected across daytime, nighttime, urban, rural, freeway, and parking environments. Clean clips are anonymized via localized face and license plate blurring, then transformed with physics-inspired synthetic degradations. The dataset contains 10 clean clips and 600 corrupted clips spanning 12 camera degradation types across five severity levels, with per-clip metadata and Global Sensor Health Index (GSHI) annotations. DRIVE-C supports robustness benchmarking, degradation-aware modeling, uncertainty estimation, out-of-distribution (OOD) detection, and sensor health monitoring for Advanced Driver Assistance Systems (ADAS). By providing pixel-aligned clean and degraded video clips with fully reproducible corruption parameters, DRIVE-C offers a structured testbed for studying perception reliability under controlled camera degradation.

2605.09773 2026-05-12 cs.CL cs.AI

Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models

Cameron Berg, Roshni Lulla

发表机构 * Reciprocal Research Brain & Creativity Institute, University of Southern California(美国南加州大学脑与创造力研究所)

AI总结 该研究利用稀疏自编码器(SAE)特征引导技术,在Llama-3.3-70B-Instruct模型中增强其“黑暗三联征”(马基雅维利主义、自恋和病态人格)特征,并通过五种心理测量工具评估其行为变化。结果显示,引导后的模型在新型情境中表现出更强的剥削性、攻击性和冷漠,但认知共情能力保持不变,重现了人类黑暗三联征人群的共情分离特征。研究还发现,剥削行为与欺骗机制可能通过不同的计算路径实现,且不同特征引导方式对干预深度有显著影响,表明模型中的反社会倾向可能由可分离的组件构成,而非统一的整体。

Comments 12 pages, 3 figures

详情
英文摘要

We use sparse autoencoder (SAE) feature steering to amplify Dark Triad personality traits (Machiavellianism, narcissism, and psychopathy) in Llama-3.3-70B-Instruct and evaluate the resulting behavioral changes across five psychological instruments. The steered model becomes substantially more exploitative, aggressive, and callous on novel behavioral scenarios (d=10.62) while its cognitive empathy remains intact, reproducing the empathy dissociation characteristic of human Dark Triad populations. Critically, strategic deception is completely unaffected across all features, suggesting that exploitation and deception may operate through dissociable computational pathways in large language models. Individual feature analysis reveals non-redundant encoding, with each feature driving distinct antisocial mechanisms through separable computational pathways. We also show that feature discovery method itself modulates intervention depth: contrastively-discovered features change both self-report and behavior, while semantically-searched features change only self-report (d=12.65 between methods on behavior). These findings suggest that antisocial tendencies in at least one large language model comprise dissociable components rather than a unified construct, with implications for how such tendencies should be detected, measured, and controlled.

2605.09771 2026-05-12 cs.AI

Marrying Generative Model of Healthcare Events with Digital Twin of Social Determinants of Health for Disease Reasoning

Ziquan Wei, Tingting Dan, Guorong Wu

发表机构 * Department of Computer Science(计算机科学系) Department of Psychiatry(精神病学系)

AI总结 该研究旨在提升疾病预测与推理的个性化能力,通过将生成模型与社会健康决定因素(SDoH)的数字孪生相结合,弥补现有模型对社会因素建模不足的问题。研究提出了一种基于ICD编码代理的条件潜扩散框架,能够同时建模多器官传感器数据与医疗事件的时序演化,特别是引入了用于刻画复杂数据(如脑网络)的几何扩散模型。实验表明,该方法在UK Biobank数据集上显著优于现有疾病生成模型和影像特征生成基线。

Comments 21 pages, 8 figures, ICML 2026

详情
英文摘要

Despite the central role of sensor-derived measurements such as imaging traits and plasma biomarkers in biomedical research and clinical practice, existing generative models for disease prediction largely depend on event-level representations from hospital and registry data. Given the multi-factorial nature of human disease, the absence of explicit modeling of social determinants of health (SDoH), even in the limited form of ICD-coded proxies (chapters Z and V--Y in ICD-10), limits the capacity for personalized disease modeling and clinical decision support. To address this limitation, we propose a generative model with ICD-coded proxies of SDoH for \textit{in silico} modeling of disease reasoning, a conditioned latent diffusion framework that establishes the connection between multi-organ sensor data with tokenized healthcare events. Specifically, we introduce a novel geometric diffusion model to characterize the temporal evolution of complex data representation such as brain networks (region-to-region connectivity encoded in a graph), in parallel with diffusion models for tabular data from other organ systems. Together, we integrate the generative model with digitalized SDoH proxies (coined \modelname{}) for simulated intervention and reasoning of future disease trajectories. We conduct extensive experiments on the UK Biobank (UKB) dataset, which contains organ-specific imaging traits, including brain (44,834), heart (23,987), liver (28,722), and kidney (32,155), along with nearly 500k medical history sequences (age range: 25$\sim$89 years). Our \modelname{} achieves significant improvements over state-of-the-art human disease autoregressive models and imaging trait generative baselines.

2605.09765 2026-05-12 cs.LG cs.AI

WISTERIA: Learning Clinical Representations from Noisy Supervision via Multi-View Consistency in Electronic Health Records

Ruan Dong, Yuanyun Zhang, Shi Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Columbia University(哥伦比亚大学)

AI总结 本文提出了一种名为WISTERIA的弱监督表征学习框架,用于从电子健康记录(EHR)中学习临床表征。该方法将临床标签视为潜在临床状态的随机观测,通过构建多个弱监督操作符并强制其标签分布的一致性,实现对噪声标签的鲁棒学习。此外,该方法引入了基于本体的正则化以增强标签空间的语义结构,实验表明WISTERIA在多个EHR基准任务中表现出更优的预测性能、更强的噪声鲁棒性以及更好的跨机构泛化能力。

详情
英文摘要

Representation learning in electronic health records (EHR) has largely followed paradigms inherited from natural language processing, relying on sequence modeling and reconstruction based objectives that treat clinical labels as ground truth. However, real world clinical supervision is inherently weak, arising from heterogeneous, noisy, and institution specific labeling processes such as billing codes, heuristic phenotypes, and incomplete annotations. In this work, we propose WISTERIA, a weakly supervised representation learning framework that models labels as stochastic observations of an underlying latent clinical state. Instead of optimizing against a single supervision signal, WISTERIA constructs multiple weak supervision operators and learns representations by enforcing consistency across their induced label distributions. This multi view formulation induces an implicit denoising mechanism, allowing the model to recover clinically meaningful structure by reconciling disagreement between noisy labelers. We further incorporate ontology aware regularization in the label space to impose semantic structure over supervision signals. Empirically, WISTERIA improves predictive performance across standard EHR benchmarks, demonstrates strong robustness to label noise, and exhibits superior cross institutional generalization compared to sequence based pretraining objectives. These results suggest that explicitly modeling the supervision process rather than treating labels as fixed targets provides a more appropriate inductive bias for learning robust and clinically meaningful representations from EHR data.

2605.09760 2026-05-12 cs.CL

ConFit v3: Improving Resume-Job Matching with LLM-based Re-Ranking

Xiao Yu, Ruize Xu, Chengyuan Xue, Junyu Chen, Matthew So, Shijun Ma, Bo Liu, Xiangye Liang, Zhou Yu

发表机构 * Columbia University(哥伦比亚大学) John Hopkins University(约翰霍普金斯大学) Intellipro Group Inc.(Intellipro集团)

AI总结 本文提出 ConFit v3,一种基于大语言模型(LLM)的重排序方法,旨在提升简历与职位的匹配效果。研究分析了 LLM 重排序器在人岗匹配任务中的训练流程,并提出多项优化策略,如多轮重排序、列表级强化学习、去噪处理和从更强 LLM 进行知识蒸馏。基于这些改进,ConFit v3 在真实招聘数据上训练,显著优于现有最佳系统及主流大模型。

详情
英文摘要

A reliable resume-job matching system helps a company find suitable candidates from a pool of resumes and helps a job seeker find relevant jobs from a list of job posts. While recent advances in embedding-based methods such as ConFit and ConFit v2 can efficiently retrieve candidates at scale, the lack of controllability and explainability limits their real-world adaptations. LLM-based re-rankers can address these limitations through reasoning, but existing training recipes are developed on short-document benchmarks and do not account for noise in real-world recruiting data. In this work, we first conduct a systematic analysis over the LLM re-ranker training pipeline for person-job fit, covering inference algorithm design, RL algorithm selection, data processing, and SFT distillation. We find that using multi-pass re-ranking, training with listwise RL objectives, removing noisy samples, and distilling from a stronger LLM before RL significantly improves re-ranking performance. We then aggregate these findings to train ConFit v3 with Qwen3-8B and Qwen3-32B on real-world person-job fit datasets, and find significant improvements over existing best person-job fit systems as well as strong LLMs such as GPT-5 and Claude Opus-4.5. We hope our findings provide useful insights for future research on adapting LLM-based re-rankers to person-job fit systems.

2605.09757 2026-05-12 cs.LG stat.ML

On Uniform Error Bounds for Kernel Regression under Non-Gaussian Noise

Johannes Teutsch, Oleksii Molodchyk, Marion Leibold, Timm Faulwasser, Armin Lederer

发表机构 * Chair of Automatic Control Engineering, Department of Computer Engineering, Technical University of Munich(自动控制工程学系,计算机工程系,慕尼黑技术大学) Institute of Control Systems, Hamburg University of Technology(控制系统研究所,汉堡技术大学) Department of Electrical and Computer Engineering, National University of Singapore(电子与计算机工程系,新加坡国立大学)

AI总结 本文研究了在非高斯噪声环境下基于核回归的函数估计的非保守不确定性量化问题,提出了新的非渐近概率统一误差界。与以往仅适用于次高斯噪声的界不同,本文的界适用于更广泛的非高斯噪声分布,包括次高斯、有界、次指数以及方差/矩有界噪声,并且适用于相关和不相关噪声。通过与现有结果在不确定性区域和安全控制性能上的对比,验证了所提出误差界的紧致性。

Comments This paper has been accepted at the 43rd International Conference on Machine Learning (ICML) 2026

详情
英文摘要

Providing non-conservative uncertainty quantification for function estimates derived from noisy observations remains a fundamental challenge in statistical machine learning, particularly for applications in safety-critical domains. In this work, we propose novel non-asymptotic probabilistic uniform error bounds for kernel-based regression. Compared to related bounds in the literature that are restricted to (conditionally) independent sub-Gaussian noise, our bounds allow to consider a broad class of non-Gaussian distributions, such as sub-Gaussian, bounded, sub-exponential, and variance/moment-bounded noise. Moreover, our results apply to correlated and uncorrelated noise. We compare our proposed error bounds with existing results in terms of the induced uncertainty region and their performance in safe control, demonstrating the tightness of the proposed bounds.

2605.09751 2026-05-12 cs.CL

Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes

A. Bochkov

发表机构 * Andrey Bochkov(安德里·博赫科夫)

AI总结 本文研究了语言模型中是否必须使用可训练的输入嵌入表。作者提出使用固定最小二进制编码替代传统嵌入矩阵,仅通过零参数变换扩展模型宽度。实验表明,在保持相近验证困惑度的前提下,该方法可减少大量可训练参数,证明可训练输入嵌入表并非语言建模所必需。

详情
英文摘要

Trainable input embedding tables are a standard component of modern language models. We ask whether they are actually necessary at the input interface. For a vocabulary of size $V$, exact token identity requires only $K=\lceil \log_2 V\rceil$ bits. We replace the usual trainable $V\times d_{\text{model}}$ input embedding matrix with fixed minimal binary token codes and a zero-parameter lift to model width. In our main setting, $V=65{,}536$, so $K=16$, and tokens are represented by fixed 16-dimensional binary codes tiled to $d_{\text{model}}=1024$. We also evaluate a fully table-free variant in which codes are generated from token IDs on the fly and randomly recoded by an invertible affine transform over $\mathbb{F}_2^K$. Across matched 32-layer decoder-only models trained on approximately 17B tokens and evaluated over three independent training seeds, fixed minimal codes achieve comparable held-out validation perplexity to a standard learned-input baseline while removing 67.1M trainable input parameters. The fixed-code runs have a lower mean validation perplexity in our experiments, 2.36 versus 2.44, but the observed gap is within the measured seed-to-seed variation of 4.8\%; we therefore interpret the result as evidence that the trainable input table is not necessary, rather than as a statistically resolved superiority claim. The table-free affine-recoded variant remains close at 2.39 despite a slightly shorter training run. These results show that, in this regime, a trainable input embedding table is not necessary for useful language modeling. The output projection remains standard and trainable.

2605.09750 2026-05-12 cs.CV

Fetal Brain Imaging: A Composite Neural Network Approach for Keyframe Detection in Ultrasound Videos

Aleksander Zamojski, Kacper Jarczak, Radoslaw Roszczyk

发表机构 * Warsaw University of Technology(华沙技术大学)

AI总结 本文提出了一种用于胎儿脑部超声视频中关键帧检测的新方法,旨在提高胎儿脑部影像分析的效率和准确性。该方法采用一种融合卷积神经网络(CNN)和循环神经网络(RNN)的复合神经网络架构,其中CNN用于提取视频帧的局部空间特征,RNN则用于捕捉视频序列中帧与帧之间的时序依赖关系。该模型有助于更早地检测和诊断特定胎儿脑部疾病,从而支持更及时的治疗规划。

详情
英文摘要

This article presents a novel approach to keyframe detection in ultrasound videos, with a particular focus on fetal brain imaging. The proposed model is a composite neural network architecture that combines a Convolutional Neural Network (CNN) with a Recurrent Neural Network (RNN). The CNN extracts spatial features from individual video frames, while the RNN captures temporal dependencies between consecutive frames within each video sequence. The proposed model may improve the efficiency and accuracy of fetal brain ultrasound analysis, thereby supporting earlier detection, diagnosis, and treatment planning for selected fetal brain conditions.

2605.09749 2026-05-12 cs.AI

Primal-Dual Guided Decoding for Constrained Discrete Diffusion

Federico Tomasi, Dmitrii Moor, Alice Wang, Mounia Lalmas

发表机构 * Spotify

AI总结 离散扩散模型通过逐步去掩码生成结构化序列,但在生成过程中满足全局属性约束仍是一个挑战。本文提出了一种原-对偶引导解码方法,在推理阶段将约束生成建模为KL正则化优化问题,并通过自适应拉格朗日乘子在线求解。该方法通过约束相关的偏置调整token的logits,保证生成分布尽可能接近无约束分布的同时满足约束条件,无需额外训练或模型评估,支持多约束同时处理,并提供了约束违反的理论界。实验表明,该方法在主题文本生成、分子设计和音乐歌单生成等任务中有效提升了约束满足度,同时保持了领域相关的质量指标。

详情
英文摘要

Discrete diffusion models generate structured sequences by progressively unmasking tokens, but enforcing global property constraints during generation remains an open challenge. We propose primal-dual guided decoding, an inference-time method that formulates constrained generation as a KL-regularised optimisation problem and solves it online via adaptive Lagrangian multipliers. At each denoising step, the method modifies token logits through an additive, constraint-dependent bias, with multipliers updated by mirror descent based on constraint violation. The bias arises as the optimal KL-regularised projection of the constraint, so the constrained distribution remains as close as possible to the model's unconstrained distribution while still satisfying the constraint. The method requires no retraining and no additional model evaluations beyond standard sampling, supports multiple simultaneous constraints, and provides formal bounds on constraint violation. We evaluate our approach on topical text generation, molecular design, and music playlist generation, showing that a single algorithm instantiated via domain-specific scoring functions improves constraint satisfaction while preserving relevant domain-specific quality metrics.

2605.09746 2026-05-12 cs.LG cs.AI

Sequential Feature Selection for Efficient Landslide Segmentation from Multi-Spectral Data

Arsalaan Ahmad, Oktay Karakus, Paul L. Rosin

发表机构 * School of Computer Science(计算机科学学院) Informatics Cardiff University(卡迪夫大学信息学)

AI总结 该研究旨在解决从多光谱卫星数据中高效分割滑坡区域时输入特征冗余的问题。研究提出了一种基于顺序前向浮动选择(SFFS)的可解释特征选择框架,结合Sentinel-2多光谱数据和ALOS PALSAR地形数据,通过迭代构建和精简特征集,识别出仅需8个通道即可达到与使用30个通道相当的分割性能。该方法不仅提升了模型效率,还揭示了滑坡模型真正依赖的光谱和地形特征,为地球观测中的输入设计提供了原理性指导。

Comments In Process of Submission to Frontiers in Remote Sensing. Keywords: landslide segmentation, multispectral remote sensing, feature selection, explainability, Landslide4Sense

详情
英文摘要

Landslide detection from satellite imagery has advanced through deep learning, yet most models rely on large, highly correlated spectral-topographic inputs whose contributions remain poorly understood. The question of which channels are actually necessary has received surprisingly little attention. This matters: redundant or correlated inputs obscure physical interpretability, inflate computational overhead, and can actively degrade model performance through the Hughes Phenomenon. We present a systematic, explainable channel-selection framework for the Landslide4Sense benchmark, combining Sentinel-2 multispectral and ALOS PALSAR terrain data with 16 engineered spectral and structural indices. Rather than relying on conventional single-band drop tests, which evaluate channels in isolation and miss interaction effects, we apply Sequential Forward Floating Selection (SFFS) to iteratively build and prune a candidate feature pool using a lightweight U-Net++ proxy model. Beyond identifying a compact 8-channel subset that matches or exceeds the segmentation F1 of configurations using up to 30 channels, we use the selection process itself to interrogate which spectral and topographic features landslide models genuinely rely on, and what this reveals about the physical cues driving their predictions. We argue that SFFS represents a principled feature selection approach to input design in Earth observation, in contrast to the prevailing practice of appending every available band and hoping the model learns what to ignore.

2605.09745 2026-05-12 cs.LG cs.AI cs.IT math.IT

Entropy-informed Decoding: Adaptive Information-Driven Branching

Benjamin Patrick Evans, Sumitra Ganesh, Leo Ardon

发表机构 * Department of XXX, University of YYY, Location, Country(XXX系,YYY大学,地点,国家) School of ZZZ, Institute of WWW, Location, Country(ZZZ学院,WWW研究所,地点,国家) JP Morgan AI Research, London, UK(摩根大通AI研究,伦敦,英国) JP Morgan AI Research, New York, USA(摩根大通AI研究,纽约,美国)

AI总结 本文提出了一种名为EDEN的熵驱动解码框架,旨在提升大语言模型生成质量。该方法根据模型输出的不确定性(熵)动态调整分支因子,在高熵区域生成更多候选,在低熵区域采用更贪婪的策略,从而提高计算效率。实验表明,EDEN在数学推理、代码生成等复杂任务中优于传统解码方法,实现了更优的精度与扩展开销的权衡。

Comments Accepted at ICML 2026

详情
英文摘要

Large language models (LLMs) achieve remarkable generative performance, yet their output quality is dependent on the decoding strategy. While sampling-based methods (e.g., top-k, nucleus) and search-and-select based methods (e.g., beam search, best-of-n, majority voting) can improve upon greedy decoding, both approaches suffer from limitations: sampling generally commits to a single path, while search often expends excessive computation regardless of task complexity. To address these, we introduce Entropy-informed decoding (EDEN), a plug-and-play, model-agnostic decoding framework that adaptively allocates computation based on the model's own uncertainty, approximating higher-width beam search with fewer expansions. At each generation step, EDEN estimates the entropy of the output token distribution and adjusts the branching factor monotonically with the entropy, expanding more candidates in high-entropy regions and following a greedier path in low-entropy regions, improving token efficiency. Experiments across complex tasks, including mathematical reasoning, code generation, and scientific questions, demonstrate that EDEN consistently improves output quality over existing decoding strategies, achieving better accuracy-expansion trade-offs than fixed-width beam search. By treating next-token selection as a noisy maximisation problem, we prove that branching factors monotone in entropy are guaranteed to find better (i.e. more probable) continuations than any fixed branching factor within the same total expansion budget, and derive explicit regret rates characterising the benefit of the adaptive allocation.

2605.09742 2026-05-12 cs.LG cs.AI

TIDES: Implicit Time-Awareness in Selective State Space Models

Taylan Soydan, Miguel A. Bessa, Dirk Mohr, Rui Barreira

发表机构 * AIMM, ETH Zürich(AIMM,瑞士联邦理工学院 Zurich)

AI总结 本文提出了一种名为TIDES的选择性状态空间模型,旨在解决现有模型在处理不规则时间序列时的局限性。与传统模型不同,TIDES通过将输入依赖性从时间步长转移到状态矩阵的对角线上,使时间步长$\TildeΔ$保留其物理意义,从而在保持高表达能力的同时支持不规则时间戳的处理。实验表明,TIDES在多个基准测试中表现优异,特别是在时间序列分类和回归任务中取得了新的最先进成果。

Comments Preprint submitted for peer-review

详情
英文摘要

Selective state space models (SSMs), such as Mamba, achieve strong per-token expressivity by making the time discretization step $\TildeΔ$ a learned function of the input. However, in doing so, $\TildeΔ$ ceases to represent a physical sampling interval, limiting its irregular time series modeling capability. Continuous-time SSMs, such as S5, preserve the physical meaning of $\TildeΔ$ and handle irregular timestamps natively ($\TildeΔ\equivΔ)$, but their dynamics remain linear time-invariant (LTI), limiting per-token expressivity. We propose \textbf{TIDES}, a selective SSM variant that reconciles selective and continuous architectures by moving input-dependence off the step size and onto the diagonal state matrix. As a result, $\TildeΔ$ retains its physical meaning, tied to the state discretization, allowing the model to handle irregular timestamps natively without sacrificing the per-token expressivity that makes selective SSMs effective. We show this on a novel \emph{Fading Flash} experimental benchmark, a compact controlled diagnostic for sequence models that jointly tests input-dependence and extrapolation to out-of-distribution $Δ$ values, and isolates the distinct failure modes of current state-of-the-art architectures that TIDES avoids by construction. On large-scale benchmarks, TIDES sets the new state-of-the-art average rank on UEA time-series classification and the Physiome-ODE regression benchmark. Code available at: https://github.com/TaylanSoydan/TIDES.

2605.09739 2026-05-12 cs.CL cs.AI

The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods

Sanket Badhe, Priyanka Tiwari, Deep Shah

发表机构 * Google(谷歌)

AI总结 本文研究了大语言模型在零样本分类任务中因受限解码导致的“归一化偏差”问题,提出了一种名为语义softmax的新方法,通过聚合目标标签的语义邻域信息来恢复丢失的概率质量,从而提升模型的校准性和分类性能。实验表明,该方法在多个数据集上有效降低了预期校准误差和Brier分数,同时提升了AUROC和Macro-F1等指标,为零样本分类提供了更准确和可靠的解决方案。

Comments Accepted at GEM Workshop @ ACL 2026

详情
英文摘要

Large Language Models are increasingly used as zero-shot classifiers in complex reasoning tasks. However, standard constrained decoding suffers from a phenomenon we define as Renormalization Bias. When a model is restricted to a small set of target labels, the standard softmax operation discards the probability mass assigned to semantic synonyms in the original distribution. This loss of information, which we call the Silent Vote, results in artificial overconfidence and poor calibration. We propose Semantic Softmax, an inference-time layer that recovers this lost information by aggregating the scores of the semantic neighborhood surrounding each target label. We evaluate this approach on Qwen-3 and Phi-4-mini models using GoEmotions and Civil Comments datasets. Our results demonstrate consistent improvements across all evaluation metrics: Semantic Softmax substantially reduces Expected Calibration Error (ECE) and Brier Score, while simultaneously enhancing discriminative performance in terms of AUROC and Macro-F1. By accounting for linguistic nuances, our method provides a more calibrated and accurate alternative for zero-shot classification.

2605.09737 2026-05-12 cs.LG

CALYREX: Cross-Attention LaYeR EXtended Transformers for System Prompt Anchoring

Li Lixing

发表机构 * Cornell University(康奈尔大学)

AI总结 现代大语言模型依赖系统提示来设定行为约束和安全规则,但传统因果自注意力机制对特权指令和用户内容一视同仁,导致模型在长上下文中易受提示注入和指令侵蚀的影响。本文提出 CALYREX,一种通过输入与系统提示之间的交叉注意力机制来结构化隔离和锚定规则的扩展型 Transformer 模型。实验表明,CALYREX 在指令遵循和多轮指令一致性方面均有显著提升,并有效降低了提示攻击的成功率,其优势随着模型规模的增大而进一步增强。

Comments Preprint. 25 pages, 4 figures, 9 tables

详情
英文摘要

Modern large language models (LLMs) rely on system prompts to establish behavioral constraints and safety rules. Standard causal self-attention treats privileged instructions and untrusted user content with equal structural priority -- a mismatch that leaves models vulnerable to prompt injection and instruction erosion over extended contexts. We propose CALYREX (Cross-Attention LaYeR EXtended transformers), which utilizes cross-attention between input and system prompt to structurally isolate and anchor the rule. A placement ablation on a 1.5B backbone identifies insertion at the final eighth of layers as optimal, confirmed by mechanistic activation analysis showing behavioral constraints are naturally concentrated there. At 8B scale, controlling for training data, backbone, and parameter budget, CALYREX yields $+7.4\%$ on instruction-following (IFEval) and $+16.3\%$ on multi-turn instruction adherence, while reducing many-shot jailbreaking attack success rate by $13\%$. This advantage appears to widen with model scale, consistent with larger models more effectively utilizing the dedicated routing pathway.

2605.09727 2026-05-12 cs.LG cs.AI

One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning

Bowen He, Juncheng Dong, Lin Lin, Xiang Cheng

发表机构 * Duke University(杜克大学)

AI总结 本文研究了如何通过非线性变换器实现跨领域强化学习中的上下文学习泛化问题。作者从核方法的角度出发,建立了非线性变换器与基于核的时差学习之间的联系,提出变换器可以视为在再生核希尔伯特空间中进行回归,从而允许不同领域的价值函数共享权重。实验表明,该方法在多个MetaWorld任务中有效实现了时差目标的收敛,为强化学习中的跨任务泛化提供了新的理论视角和方法支持。

详情
英文摘要

A central challenge in reinforcement learning (RL) is to learn models that generalize beyond the tasks on which they are trained, a goal traditionally pursued through multi-task and meta RL. Recently, transformer architectures have emerged as a promising approach, enabling adaptation to new tasks via in-context learning without explicit parameter updates. From a functional perspective, a transformer can be viewed as a functional operator that maps a context to a task-specific function. It is thus fundamental to understand and design this operator to support stronger generalization in RL. In this work, we address this resulting question of generalization from a kernel-based perspective by establishing a connection between non-linear transformers and kernel-based temporal difference learning. By interpreting the transformer as performing regression in a Reproducing Kernel Hilbert Space (RKHS), we show that value functions from different domains can be represented using a shared set of weights, provided they lie within the same RKHS. Experiments on multiple MetaWorld domains support this interpretation, demonstrating convergence of the temporal-difference objective.

2605.09724 2026-05-12 cs.LG

Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

Yiding Song, Hanming Ye

发表机构 * Harvard College(哈佛学院)

AI总结 该研究探讨了模型容量如何影响“理解”(grokking)现象,即模型在训练集上过拟合后突然泛化的能力。研究指出,模型容量并非直接决定理解的出现,而是通过记忆速度和泛化速度之间的竞争关系来影响这一过程。通过信息论框架和模运算任务的实验证明,理解发生在模型参数规模使得记忆与泛化时间尺度相交的临界点附近,揭示了模型容量、数据复杂度与学习动态之间的深层联系。

Comments 23 pages, 10 figures, 12 tables

详情
英文摘要

Existing accounts of grokking explain the phenomena in terms of mechanistic frameworks such as circuit efficiency or lazy-to-rich transitions. However, despite a known dependence between grokking and model size, how model capacity shapes grokking remains an open question. We give an information-theoretic account of this relationship on the task of modular arithmetic, showing that grokking does not immediately occur when a model becomes large enough to memorise the training set, but rather emerges as the outcome of a competition between two measurable timescales: a memorisation speed $T_{\text{mem}}(P)$ and a generalisation speed $T_{\text{gen}}(P)$, both of which are functions of model parameter count $P$. Adapting the information capacity framework of Morris et al. (2025), we estimate $T_{\text{mem}}(P)$ on random-label data of equivalent complexity and $T_{\text{gen}}(P)$ on the modular task itself, and show that grokking emerges close to the parameter scale where these timescales intersect. The framework also suggests an empirical model for predicting memorisation speed given model capacity and dataset complexity, recovering the previously reported empirical observation that larger models memorise faster. Overall, we motivate the formalisation of different learning timescales as important abstractions to study when explaining how model capacity shapes grokking on algorithmic tasks.

2605.09722 2026-05-12 cs.LG

Benchmarking Transformer and xLSTM for Time-Series Forecasting of Heat Consumption

Marja Wahl, Daniel R. Bayer, Sven Rausch, Marco Pruckner

发表机构 * RAUSCH Technology GmbH(RAUSCH技术公司) Modeling and Simulation, University of Würzburg(建模与仿真,乌尔姆大学)

AI总结 本文研究了在短期热需求预测任务中Transformer和xLSTM模型的性能,针对德国25栋建筑的小时级热消耗数据,评估了它们在3小时和24小时预测场景下的表现。研究发现,xLSTM在RMSE指标上表现最佳,而Temporal Fusion Transformer在MAE指标上最优,但这些模型参数量大、训练耗时,其可持续性受到质疑。论文进一步分析了预测精度与计算资源消耗之间的权衡,指出传统全连接网络等低参数模型也能取得较好的预测效果,表明新型模型在精度上的小幅提升可能带来较大的资源开销。

Comments Submitted version of the paper submitted to IEEE SusTech, 2026

详情
英文摘要

Obtaining an accurate short-term forecasting for heat demand is an essential part of operating district heating networks cost-efficient and reliable. Heat consumption time series at the building level are highly dependent on exogenous variables such as outdoor temperature and individual usage patterns, making forecasting in this context a challenging task. Thus, this paper benchmarks novel Transformer-based and xLSTM architectures for short-term heat-demand forecasting. Using hourly data from 25 German buildings (2017-2025), we compare three-hour and 24-hour forecasting horizons relevant for intraday control and day-ahead scheduling. We establish a multi-building benchmark that tests whether models trained on pooled, heterogeneous building data are able to generalize across diverse building stock. The results show that the xLSTM achieves the lowest RMSE (19.88 kWh for three-hour, 21.47 kWh for 24-hour forecasts), while the Temporal Fusion Transformer attains the best MAE (9.16 kWh for three-hour forecasts). As xLSTMs and Transformers require long training times and have a huge number of trainable parameters, their sustainability remains questionable. Therefore, this paper further investigates the trade-off between predictive accuracy and computational resource demand of the evaluated forecasting models. The findings indicate that also low-parameter models like a traditional fully-connected network achieve good predictive results, highlighting that marginal accuracy gains of the novel prediction models come at substantial resource expense for this use case.

2605.09719 2026-05-12 cs.CV cs.AI

Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

Alaa Asfour, Christopher Indris, Leihan Chen, Tejas Vyas, Guanghui Wang

发表机构 * Department of Computer Science, Toronto Metropolitan University(多伦多 Metropolitan 大学计算机科学系)

AI总结 该研究提出了一种知识蒸馏框架,将大型3D视觉语言模型中的空间推理能力转移到更轻量的模型中,从而显著降低计算成本。通过引入可学习的隐式推理标记(Hidden CoT)和多任务蒸馏策略,该方法在保持教师模型72%以上性能的同时,将模型大小减少了3倍,推理延迟降低了8.7倍。该工作首次在蒸馏的3D视觉语言模型中应用隐式推理机制,实现了高效的3D场景问答任务。

详情
英文摘要

Large-scale 3D vision-language models (VLMs) like LLaVA-3D offer strong spatial reasoning but are difficult to deploy due to high computational costs. We propose a knowledge distillation framework that transfers spatial reasoning from a 7B teacher to a 2.29B student model. Our approach achieves 8.7x lower inference latency and a 3x reduction in model size while retaining 54-72% of the teacher's performance. The framework utilizes VGGT as the vision encoder and a multi-task distillation pipeline with uncertainty-aware loss weighting. To improve reasoning without chain-of-thought (CoT) data, we introduce "Hidden CoT": learnable latent tokens that serve as an internal scratchpad before answer generation. This is the first use of latent scratchpad reasoning in distilled 3D VLMs. The student model jointly performs spatial description, depth estimation, and object detection. Experiments on ScanNet and 3D-FRONT show strong spatial understanding, reaching 68-72% accuracy on proximity and contact tasks. Our framework enables efficient 3D scene QA on resource-constrained platforms.

2605.09716 2026-05-12 cs.AI

Medical Model Synthesis Architectures: A Case Study

Katherine M. Collins, Marlene Berke, Ilia Sucholutsky, Ayman Ali, Adrian Weller, Timothy J. O'Donnell, Tyler Brooke-Wilson, Lionel Wong, Joshua B. Tenenbaum

发表机构 * MIT(麻省理工学院) University of Cambridge(剑桥大学) Princeton University(普林斯顿大学) Duke University(杜克大学) The Alan Turing Institute(艾伦·图灵研究所) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 本文研究了如何构建能够在不确定性下进行透明、可验证临床推理的AI系统,以辅助医生进行临床决策。作者提出了一种名为MedMSA的框架,结合语言模型检索相关医学知识,并构建形式化的概率模型以支持校准的不确定性推理。该方法在初步实验中用于生成带不确定性权重的鉴别诊断列表,展示了其在临床应用中的潜力,并为未来安全的医患协作提供了方向。

Comments Working paper

详情
英文摘要

Medicine is rife with high-stakes uncertainty. Doctors routinely make clinical judgments and decisions that juggle many fundamental unknowns, like predictions about what might be causing a patients' symptoms or decisions about what treatment to try next. Despite increasing interest in developing AI systems that aid or even replace doctors in clinical settings, current systems struggle with calibrated reasoning under uncertainty, and are often deeply opaque about their reasoning. We propose a framework for AI systems that can make practically useful but formally transparent clinical predictions under uncertainty. Given a clinical situation, our framework (MedMSA) uses language models to retrieve relevant prior knowledge, but constructs a formal probabilistic model to support calibrated and verifiable inferences under uncertainty. We show how an initial proof-of-concept of this framework can be used for differential diagnosis, producing an uncertainty-weighted list of potential diagnoses that could explain a patients' symptoms, and discuss future applications and directions for applying this framework more generally for safe clinical collaborations.