arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.28585 2026-05-28 cs.LG

Outer-Momentum Restarting in High-Dimensional Two-Phase Optimization

高维两阶段优化中的外动量重启

Kristi Topollai, Allan Ma, Tolga Dimlioglu, Sui Jiet Tay, Anna Choromanska

AI总结本文研究在分布式优化中周期性重启外动量以控制外存效应，通过理论分析、玩具实验和语言模型预训练验证其能扩大稳定范围。

详情

AI中文摘要

通信高效的分布式优化器（如DiLoCo）通过让工作节点在聚合进度之前执行多次本地更新来减少同步成本，并使用外动量优化器进行聚合。近期理论表明，外优化器作用于由内优化循环诱导的有效谱，而外动量的选择控制着本地更新的进度如何在通信轮次间累积。我们研究外动量的周期性重启，作为控制这种外存的一种简单互补机制。在线性化平方损失模型中，预测空间残差在经验NTK下演化，我们推导出模态重启收缩，表明重置通过丢弃陈旧动量同时保留内循环进度来利用相位抵消。玩具实验验证了预测的收缩行为，语言模型预训练实验表明，周期性重启扩大了外学习率和动量值在通信周期内的稳定范围。

英文摘要

Communication-efficient distributed optimizers such as DiLoCo reduce synchronization costs by letting workers perform many local updates before aggregating their progress with an outer momentum optimizer. Recent theory suggests that the outer optimizer acts on an effective spectrum induced by the inner optimization loop, and that the choice of outer momentum controls how progress from local updates is accumulated across communication rounds. We study periodic restarting of the outer momentum as a simple complementary mechanism for controlling this outer memory. In a linearized squared-loss model where prediction-space residuals evolve under the empirical NTK, we derive a mode-wise restart contraction showing that resets exploit phase cancellation by discarding stale momentum while preserving inner-loop progress. Toy experiments verify the predicted contraction behavior, and language-model pretraining experiments show that periodic restarts widen the stable range of outer learning rates and momentum values across communication periods.

URL PDF HTML ☆

赞 0 踩 0

2605.28583 2026-05-28 cs.RO cs.AI cs.LG cs.SY eess.SY

SARAD: LLM-Based Safety-Aware Hybrid Reinforcement Learning with Collision Prediction for Autonomous Driving

SARAD：基于LLM的安全感知混合强化学习与碰撞预测在自动驾驶中的应用

Kangyu Wu, Peng Cui, Guoxi Chen, Ya Zhang

AI总结提出SARAD框架，结合大语言模型和深度强化学习，通过检索增强生成和碰撞预测模块提升自动驾驶的安全性和效率。

Comments 7 pages, 4 figures, accepted by IJCNN 2026

详情

AI中文摘要

确保自动驾驶系统决策的安全性和效率仍然是一个基本挑战。传统的深度强化学习（DRL）存在不安全的随机探索和收敛缓慢的问题，而大语言模型（LLM）在实时推理操作中表现出固有的延迟。为了解决这些限制，本文提出了SARAD，一种新颖的安全感知混合框架，协同LLM和DRL用于自动驾驶。SARAD用来自动态专家知识库的、经检索增强生成（RAG）增强的LLM引导决策替代了DRL的随机探索。提出了一个注意力判别器，将LLM的先验知识整合到DRL策略优化中。进一步设计了一个碰撞预测模块，使用历史碰撞数据进行微调，以提高车辆安全性。大量实验表明，SARAD在Highway-Env模拟器中实现了显著的性能提升，验证了所提模型在自动驾驶中的有效性。

英文摘要

Ensuring both safety and efficiency in decision-making for autonomous driving systems remains a fundamental challenge. Traditional Deep Reinforcement Learning (DRL) suffers from unsafe random exploration and slow convergence, while Large Language Models (LLMs) demonstrate inherent latency in real-time inference operations. To address these limitations, this paper proposes SARAD, a novel safety-aware hybrid framework that synergizes LLMs and DRL for autonomous driving. SARAD substitutes the random exploration of DRL with Retrieval-Augmented Generation (RAG)-enhanced, LLM-guided decisions sourced from a dynamic expert knowledge repository. An attention discriminator is proposed to integrate the prior knowledge of LLMs into DRL policy optimization. A collision predictor module, fine-tuned with historical collision data, is further designed to improve vehicle safety. Extensive experiments show that SARAD achieves significant performance improvements in the Highway-Env simulator, validating the effectiveness of the proposed model in autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.28578 2026-05-28 cs.LG

A Generalized Tikhonov Layer for Interpretable-by-design Graph Neural Networks

一种用于可解释设计的图神经网络的广义Tikhonov层

Nicolas Tremblay, Benjamin Ricaud, Filippo Maria Bianchi

AI总结提出Tikhonov层，通过可学习的节点重要性分数和多项式实现图神经网络的可解释性，其输出是广义图Tikhonov问题的精确解。

详情

AI中文摘要

我们提出了Tikhonov层，一种设计上可解释的图神经网络层：一旦训练完成，其学习到的参数直接揭示了哪些节点特征和图拓扑的哪些方面被用于预测。实际上，该层的传播矩阵采用闭式$R = (p(L)+Q)^{-1} Q$，其中$L$是归一化图拉普拉斯矩阵，$Q = diag(q_1,...,q_n)$是一个可学习的正节点重要性分数对角矩阵，$p(\cdot)$是一个可学习多项式。对于任意输入特征$x$，层输出$Rx$是广义图Tikhonov问题的精确最小化器，该问题在节点级数据保真度和拓扑驱动正则化惩罚之间进行权衡。学习到的对$\{\{q_i\},p\}$构成了内置的解释：大的$q_i$表明节点$i$自身的特征驱动预测，而小的$q_i$则表明依赖于局部图拓扑；$p$的形状揭示了是同质性、异质性还是带通响应被利用。通过将复杂性路由到一个专用的、任意深的Q网络来产生重要性分数，从而保持了表达能力，而Tikhonov层本身保持透明。我们证明了不同的节点重要性矩阵产生不同的传播算子，在结构上将解释与计算耦合。此外，Tikhonov层在单层中提供了全局感受野，缓解了过平滑和过挤压问题。在标准图分类基准上的实验证实，该模型匹配（有时甚至超越）不透明的基线，同时产生可解释且忠实的解释。

英文摘要

We propose the Tikhonov layer, a graph neural network layer that is interpretable by design: once trained, its learned parameters directly reveal which node features and which aspects of the graph topology were leveraged for prediction. In practice, the layer's propagation matrix takes the closed-form $R = (p(L)+Q)^{-1} Q$, where $L$ is the normalized graph Laplacian, $Q = diag(q_1,...,q_n)$ a learnable diagonal matrix of positive node-importance scores, and $p(\cdot)$ a learnable polynomial. For any input feature $x$, the layer output $Rx$ is the exact minimizer of a generalized graph Tikhonov problem that trades off node-level data fidelity against a topology-driven regularization penalty. The learned pair $\{\{q_i\},p\}$ constitutes a built-in explanation: large $q_i$ indicates that node $i$'s own features drive the prediction, while small $q_i$ signals reliance on the local graph topology; the shape of $p$ reveals whether homophily, heterophily, or a band-pass response is exploited. Expressivity is preserved by routing complexity through a dedicated, arbitrarily deep Q-network that produces the importance scores, while the Tikhonov layer itself remains transparent. We prove that distinct node-importance matrices yield distinct propagation operators, structurally coupling the explanation to the computation. Additionally, the Tikhonov layer provides, in a single layer, a global receptive field, mitigating both oversmoothing and oversquashing. Experiments on standard graph classification benchmarks confirm that the model matches (and sometimes outperforms) opaque baselines while producing interpretable and faithful explanations.

URL PDF HTML ☆

赞 0 踩 0

2605.28577 2026-05-28 cs.AI cs.LG

Continual Model Routing in Evolving Model Hubs

演化模型库中的持续模型路由

Jack Bell, Giacomo Carfì, Gerlando Gramaglia, Vincenzo Lomonaco

AI总结针对模型库快速扩展带来的模型选择和路由更新挑战，提出持续模型路由（CMR）问题，构建大规模基准CMRBench，并设计基于对比嵌入的CARvE方法，通过检查点锚定和结构化重放实现高效路由，显著优于多种基线。

Comments 42 pages, 24 tables, 6 figures, to be published at ICML 2026

详情

AI中文摘要

AI模型库提供了对快速增长的大量预训练模型的访问，使得具有不同路由策略的现成混合专家系统成为可能。然而，这种快速增长带来了两个基本挑战：跨数千个专家进行模型选择的扩展，以及随着新模型和任务的引入持续更新路由机制。在本文中，我们将这一设置形式化为持续模型路由（CMR），并提出了CMRBench，这是一个新的大规模基准，模拟现实的模型库扩展，包括超过2000个候选模型。最后，我们介绍了CARvE，一种对比嵌入方法，通过基于检查点的锚定和结构化重放实现高效的持续模型路由。大量的实验结果和消融研究表明，CARvE在模型、家族和领域级别的准确性上显著优于零样本检索、微调和适配器合并基线。

英文摘要

AI model hubs provide access to a rapidly growing collection of powerful pre-trained models, enabling off-the-shelf mixture-of-experts systems with different routing strategies. However, this rapid growth poses two fundamental challenges: scaling model selection across thousands of experts and continually updating routing mechanisms as new models and tasks are introduced. In this paper, we formalise this setting as Continual Model Routing (CMR) and propose CMRBench, a new large-scale benchmark simulating realistic hub expansion and including over 2,000 candidate models. Finally, we introduce CARvE, a contrastive embedding approach for efficient continual model routing via checkpoint-based anchoring and structured replay. Extensive empirical results and ablations show that CARvE significantly outperforms zero-shot retrieval, fine-tuning, and adapter-merging baselines in model, family, and domain-level accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.28575 2026-05-28 cs.AI

A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis

一种冲突感知惩罚与统计损失框架，用于平衡模态并增强多模态情感分析的稳定性

Jianheng Dai, Jiazhang Liang, Sijie Mai

AI总结针对多模态情感分析中文本模态主导导致梯度冲突的问题，提出冲突感知惩罚和统计损失框架，实现模态平衡与训练稳定，在CMU-MOSI上取得最优性能。

详情

AI中文摘要

多模态情感分析（MSA）融合文本、声学和视觉流来推断情感。由于预训练文本编码器的表达能力远强于声学和视觉编码器，文本模态往往主导优化过程，抑制较弱模态并引发梯度范数冲突，从而破坏训练稳定性。为解决此问题，我们提出一种冲突感知惩罚（CP），在每一步训练中检测并惩罚梯度范数冲突，以及一种统计损失（SL），使预测分布统计量与经验输入统计量对齐。关键的是，CP防止主导模态梯度干扰SL目标，从而在统一框架内实现协同训练，该框架包含自适应模态编码、门控跨模态融合和单模态辅助头。在CMU-MOSI上的实验表明，该方法达到了最先进的性能，消融研究证实了每个组件的有效性。

英文摘要

Multimodal Sentiment Analysis (MSA) fuses text, acoustic, and visual streams to infer sentiment. Because pre-trained text encoders are far more expressive than their acoustic and visual counterparts, the text modality tends to dominate optimization, suppressing weaker modalities and inducing gradient norm conflicts that destabilize training. To address this, we propose a Conflict-aware Penalty (CP) that detects and penalizes gradient norm conflicts at each training step, and a Statistical Loss (SL) that aligns predicted distribution statistics with empirical input statistics. Crucially, CP prevents dominant modality gradients from interfering with the SL objective, enabling synergistic training within a unified framework incorporating adaptive modality encoding, gated cross-modal fusion, and unimodal auxiliary heads. Experiments on CMU-MOSI demonstrate state-of-the-art performance, with ablation studies confirming the effectiveness of each component.

URL PDF HTML ☆

赞 0 踩 0

2605.28573 2026-05-28 cs.LG cs.AI

Efficient Pre-Training of LLMs through Truncated SVD Layers

通过截断SVD层实现LLM的高效预训练

Kaivan Kamali, Kajetan Schweighofer, Hormoz Shahrzad, Olivier Francon, Babak Hodjat, Risto Miikkulainen

AI总结提出TSVD框架，利用谱能量启发式自适应秩选择和缓存机制保持低秩与严格正交性，在减少计算开销的同时匹配或超越全参数基线的性能。

详情

AI中文摘要

大规模语言模型（LLM）的规模扩展使得预训练成本日益高昂。虽然低秩表示和正交权重矩阵原则上可以减少参数数量和计算开销，但现有方法大多依赖静态秩选择，且由于高计算成本而不强制权重正交性。本文引入TSVD框架，在整个训练过程中保持低秩和严格正交性。它利用基于谱能量的启发式方法进行自适应秩选择，并采用缓存机制来维持正交性。理论分析证明了该方法在预训练动态中的优势，跨多种模型规模的实验表明其在经验上有效。TSVD在显著降低计算需求的同时，匹配或超越了全参数基线的性能。因此，该方法为高效高性能LLM预训练提供了一条有充分依据、实用且可扩展的路径。

英文摘要

The massive scaling of Large Language Models (LLMs) has made pretraining increasingly cost-prohibitive. While low-rank representation and orthonormal weight matrices could in principle reduce parameter counts and computational overhead, most existing methods rely on static rank selection and do not enforce weight orthonormality due to high computational cost. This paper introduces TSVD, a framework that maintains low rank and strict orthonormality throughout the training process. It utilizes a spectral energy-based heuristic for adaptive rank selection, and a caching mechanisms to maintain orthonormality. Theoretical analysis justifies the advantage of the approach in pretraining dynamics and experiments across various model scales demonstrate that it is effective empirically. TSVD matches or exceeds the performance of full-parameter baselines while significantly reducing compute requirements. The approach thus offers a well-founded, practical, and scalable path toward efficient high-performance LLM pretraining.

URL PDF HTML ☆

赞 0 踩 0

2605.28567 2026-05-28 cs.LG cs.AI

Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression

稀疏自编码器特征匹配与电路压缩的语义最优传输

Tue M. Cao, Nguyen Do, My T. Thai

AI总结提出基于最优传输的分布框架，通过激活加权分布和Wasserstein距离统一解决跨层特征匹配与电路压缩问题。

Comments preprint

详情

AI中文摘要

稀疏自编码器（SAE）已成为解释语言模型的核心工具。然而，两个关键的SAE分析仍然难以规模化：（1）跨层匹配语义相似的特征，（2）将大型特征电路压缩为可解释的超节点。尽管这些问题被视为独立问题，但我们表明它们都是更基础挑战的实例，我们将其框架化为估计位于不同激活流形上的SAE特征之间的语义距离。我们为此问题引入了一个分布框架，其中每个特征不是像文献中那样由单个解码器向量表示，而是由表达它的隐藏状态上的激活加权分布表示。通过将这些分布投影到共享参考空间并使用Wasserstein距离进行比较，我们的方法为跨层特征比较提供了统一的语义度量。我们证明了我们的表示对激活缩放具有不变性，在扰动下稳定，并在有限样本边际条件下恢复真实匹配。实验上，我们的方法优于解码器向量和基于LLM的基线，并捕捉相关特征之间的细微功能差异。值得注意的是，我们的方法自动将大型特征电路压缩为可解释的超节点。

英文摘要

Sparse autoencoders (SAEs) have become a central tool for interpreting language models. However, two key SAE analyses that remain difficult to scale are (1) matching semantically similar features across multi-layers and (2) compressing large feature circuits into interpretable supernodes. Although these have been treated as separate problems, we show that both are instances of a more fundamental challenge, which we frame as the estimation of semantic distances between SAE features that lie on different activation manifolds. We introduce a distributional framework for this problem, in which each feature is represented not by a single decoder vector like in the literature, but by an activation-weighted distribution over the hidden states that express it. By projecting these distributions into a shared reference space and comparing them with Wasserstein distance, our method provides a unified semantic metric for cross-layer feature comparison. We prove that our representation is invariant to activation rescaling, stable under perturbations, and recovers true matches under finite-sample margin conditions. Empirically, our method outperforms decoder-vector and LLM-based baselines and captures subtle functional distinctions between related features. Notably, our method compresses large feature circuits into interpretable supernodes automatically.

URL PDF HTML ☆

赞 0 踩 0

2605.28566 2026-05-28 cs.AI cs.LG

Tree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design Patterns

思维树作为经典启发式搜索问题：形式化基础与设计模式

Guni Sharon

AI总结本文通过经典启发式搜索术语统一分类法，将基于LLM的推理映射到搜索组件，并识别出系统搜索和前瞻性策略两种设计模式。

Comments Extended version of the SoCS 2026 paper. Includes appendices omitted from the proceedings version

详情

Journal ref: Proceedings of the Nineteenth International Symposium on Combinatorial Search (SoCS 2026), AAAI Press, 2026

AI中文摘要

大型语言模型（LLM）展示了卓越的推理能力，但其标准生成过程——自回归令牌预测——本质上是短视的，容易产生级联错误。为了解决这个问题，思维树（ToT）框架在中间推理步骤上创建了一个搜索空间，允许搜索模型进行探索、前瞻和回溯。然而，当前的ToT研究在自然语言处理和自动规划社区之间仍然分散，常常使用不一致的术语和临时实现。因此，我们通过基于经典启发式搜索术语的统一分类法综合了ToT领域。我们将基于LLM的推理映射到经典搜索组件：状态表示（思维粒度）、后继生成（提示操作符）和启发式评估（进展自我评估）。我们在分类法的背景下分析现有工作，并识别出新兴的设计模式：针对浅层确定性任务的系统搜索（最佳优先搜索）和针对深层多步推理的前瞻性策略（DFS、MCTS）。最后，我们指出了启发式搜索与LLM推理交叉领域中的开放算法挑战，并呼吁启发式搜索社区参与这一新兴领域。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, yet their standard generation process -- auto-regressive token prediction -- is inherently myopic and prone to cascading errors. To address this, the Tree-of-Thoughts (ToT) framework creates a search space over intermediate reasoning steps, allowing search models to explore, look ahead, and backtrack. However, current ToT research remains fragmented across Natural Language Processing and Automated Planning communities, often using inconsistent terminology and ad-hoc implementations. Consequently, we synthesize the ToT landscape through a unified taxonomy based on classical heuristic search terminology. We map LLM-based reasoning to classical search components: state representation (granularity of thoughts), successor generation (prompting operators), and heuristic evaluation (self-assessment of progress). We analyze existing work within the context of our taxonomy and identify emerging design patterns: systematic search (Best-First Search) for shallow, deterministic tasks and lookahead-heavy strategies (DFS, MCTS) for deep multi-step reasoning. We conclude by identifying open algorithmic challenges at the intersection of heuristic search and LLM reasoning, and call on the heuristic search community to engage with this emerging domain.

URL PDF HTML ☆

赞 0 踩 0

2605.28563 2026-05-28 cs.LG cs.AI

A Multi-dimensional Framework for Evaluating Generalization in EEG Foundation Models

评估脑电图基础模型泛化能力的多维框架

Aditya Kommineni, Emily Zhou, Kleanthis Avramidis, Tiantian Feng, Shrikanth Narayanan

AI总结提出一个多维评估框架，在低资源条件下系统评估EEG基础模型（如LaBraM、CSBrain、CBraMod）的泛化能力，发现其在长上下文任务中表现优异，但在短窗口BCI任务中与监督模型相当，且对通道限制鲁棒性不足。

Comments 24 pages, 5 Figures

详情

AI中文摘要

在适当的适应设置下评估基础模型对于理解所学表示的质量和可迁移性至关重要。最近的脑电图基础模型在跨任务和数据集上展示了有前景的迁移能力，推动了它们在神经技术和临床应用中日益增长的使用。然而，这些模型通常是在精心整理的下游数据集上进行全微调评估，这种设置并未反映生物医学领域的约束，如有限的标记数据、减少的传感器覆盖或参数高效的适应。在这项工作中，我们提出了一个多维评估框架，用于在现实低资源条件下评估脑电图模型。在提出的多维评估框架下，对包括LaBraM、CSBrain和CBraMod在内的监督脑电图模型和最近的脑电图基础模型在6个不同数据集上进行了实证分析。我们发现，脑电图基础模型在长上下文任务（如睡眠阶段预测和心理健康状态分类）上持续提供性能提升。相比之下，对于短窗口的脑机接口风格任务，监督模型尽管参数少得多，却取得了相当的性能。额外的分析表明，当前的基础模型对短窗口任务和通道受限设置提供的鲁棒性有限。总之，这些发现激励使用多维评估协议，以表征模型在现实使用约束下的行为。

英文摘要

Evaluating foundation models under appropriate adaptation settings is essential for understanding the quality and transferability of the learned representations. Recent EEG foundation models have demonstrated promising transfer capabilities across tasks and datasets, motivating their growing use in neurotechnology and clinical applications. However, these models are typically evaluated under full fine-tuning on well-curated downstream datasets, a setting that does not reflect biomedical domain constraints such as limited labeled data, reduced sensor coverage, or parameter-efficient adaptation. In this work, we propose a multi-dimensional evaluation framework for assessing EEG models under realistic low-resource conditions. Empirical analysis of both supervised EEG models and recent EEG foundation models, including LaBraM, CSBrain, and CBraMod, across 6 different datasets is performed under the proposed multi-dimensional evaluation framework. We find that EEG foundation models consistently provide performance gains on long-context tasks such as sleep stage prediction and mental health state classification. In contrast, for short-window Brain Computer Interface style tasks, supervised models achieve comparable despite having substantially fewer parameters. Additional analyses demonstrate that current foundation models provide limited robustness to short-window tasks and channel constrained settings. Together, these findings motivate the use of multi-dimensional evaluation protocols that characterize model behavior under realistic use constraints.

URL PDF HTML ☆

赞 0 踩 0

2605.28561 2026-05-28 cs.CL cs.LG

Soft-SVeRL: Self-Verified Reinforcement Learning with Soft Rewards

Soft-SVeRL: 基于软奖励的自验证强化学习

Saurabh Dash, Pierre Clavier, John Dang, Matthias Galle, Marzieh Fadaee, Ahmet Üstün, Beyza Ermis

AI总结针对部分可验证任务，提出基于检查表分解的软奖励框架Soft-RLVR及其自验证变体Soft-SVeRL，通过密集部分信用信号提升强化学习训练效果，并解决自验证中的奖励膨胀问题。

详情

AI中文摘要

可验证奖励的强化学习（RLVR）在数学和代码等领域改进了语言模型，这些领域中正确性可以自动检查。然而，许多重要任务仅部分可验证：提示包含多个要求，响应可能满足其中一些但非全部，或者可能不存在单一的参考答案。我们引入Soft-RLVR，一个从分解的、学习的验证信号中进行强化学习的框架。Soft-RLVR将每个提示转换为原子要求的检查表，使用LLM验证器逐项评分候选响应，并在生成的软奖励上进行训练。基于检查表的奖励将稀疏的通过/失败监督转化为更密集的部分信用信号，但它们也引入了一个权衡：平均逐项判断可以减少验证器噪声，而部分信用可能奖励不完整的响应。我们形式化了这一权衡，并确定了基于检查表的验证比整体验证提供更可靠RL训练信号的条件。我们进一步引入Soft-SVeRL，这是Soft-RLVR的一个自验证变体，其中策略也充当验证器。我们表明，自验证容易因过于宽松的自我判断而导致奖励膨胀，并且需要显式稳定化以防止这种崩溃。在基于规则的ground-truth评估的受控指令遵循设置中，基于检查表的Soft-RLVR仅使用学习的验证器奖励就将IFEval提升了最多11.1分。我们的实验进一步表明，验证器质量和检查表质量都影响下游RL结果，并且显式稳定化对于有效的自验证至关重要。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has improved language models in domains such as mathematics and code, where correctness can be checked automatically. However, many important tasks are only partially verifiable: prompts contain multiple requirements, responses may satisfy some but not all of them, or no single reference answer might exist. We introduce Soft-RLVR, a framework for reinforcement learning from decomposed, learned verification signals. Soft-RLVR converts each prompt into a checklist of atomic requirements, scores candidate responses item by item with an LLM verifier, and trains on the resulting soft reward. Checklist-based rewards turn sparse pass/fail supervision into a denser partial-credit signal, but they also introduce a tradeoff: averaging item-level judgments can reduce verifier noise, while partial credit can reward incomplete responses. We formalize this tradeoff and identify conditions under which checklist-based verification gives a more reliable RL training signal than holistic verification. We further introduce Soft-SVeRL, a self-verifying variant of Soft-RLVR in which the policy also acts as the verifier. We show that self-verification is prone to reward inflation from overly permissive self-judgments, and that explicit stabilization is needed to prevent this collapse. In a controlled instruction-following setting with rule-based ground-truth evaluation, checklist-based Soft-RLVR improves IFEval by up to 11.1 points using only learned verifier rewards. Our experiments further show that verifier quality and checklist quality both affect downstream RL outcomes, and that explicit stabilization is essential for effective self-verification.

URL PDF HTML ☆

赞 0 踩 0

2605.28554 2026-05-28 cs.LG

High Performance, Low Reliability: Uncertainty Benchmarking for Tabular Foundation Models

高性能，低可靠性：表格基础模型的不确定性基准测试

José Lucas De Melo Costa, Fabrice Popineau, Arpad Rimmel, Bich-Liên Doan

AI总结通过TALENT基准测试，发现表格基础模型虽在预测性能上优于梯度提升决策树，但在不确定性校准上表现更差，存在性能-不确定性权衡。

Comments 6 pages, 2 figures, 2 tables. Accepted at ESANN 2026 (European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning), 22-24 April 2026, Bruges (Belgium)

详情

DOI: 10.14428/esann/2026.ES2026-261
Journal ref: ESANN 2026 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges (Belgium) and online event, 22-24 April 2026, pp. 115-120, i6doc.com publ., ISBN 9782875870964

AI中文摘要

最近的表格基础模型（TFMs）展示了最先进的预测性能，通常超越梯度提升决策树（GBDTs）。然而，这些模型的可信度，特别是其不确定性量化，在很大程度上被忽视了。我们通过在TALENT基准测试的112个数据集上进行广泛研究，比较TFMs、GBDTs和经典基线，调查了这一差距。我们的结果揭示了性能-不确定性权衡：尽管TFMs在AUC测量下达到了最高的预测性能，但在共形预测下，它们表现出较低的条件覆盖率（由SSCS测量），相比GBDTs。在合成数据集上的补充实验进一步刻画了这种效应加剧的情景。我们得出结论，尽管TFMs推进了预测前沿，但实现良好校准的不确定性仍然是其可靠采用的主要开放挑战。代码可在：https://github.com/jose-melo/high-performance-low-reliability 获取。

英文摘要

Recent Tabular Foundation Models (TFMs) have demonstrated state-of-the-art predictive performance, often surpassing Gradient-Boosted Decision Trees (GBDTs). However, the trustworthiness of these models, particularly their uncertainty quantification, has been largely overlooked. We investigate this gap through an extensive study comparing TFMs, GBDTs, and classical baselines on the 112 datasets of the TALENT benchmark. Our results reveal a performance-uncertainty trade-off: although TFMs achieve the highest predictive performance, measured by AUC, they exhibit lower conditional coverage under conformal prediction, measured by SSCS, compared to GBDTs. Complementary experiments on synthetic datasets further characterize the regimes in which this effect intensifies. We conclude that while TFMs advance predictive frontiers, achieving well-calibrated uncertainty remains a major open challenge for their reliable adoption. Code is available at: https://github.com/jose-melo/high-performance-low-reliability

URL PDF HTML ☆

赞 0 踩 0

2605.28553 2026-05-28 cs.AI cs.CR

Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

解码前拒绝：检测和利用中间LLM激活中的拒绝信号

Matteo Gioele Collu, Riccardo Conte, Alberto Giaretta, Denis Kleyko, Mauro Conti, Matteo Zavatteri, Roberto Confalonieri

AI总结本文通过线性探针在变压器块的残差流激活中检测拒绝行为，并提出Mechanistic AutoDAN方法，利用探针引导的遗传搜索实现高效攻击，显著降低搜索时间并保持攻击成功率。

详情

AI中文摘要

在本文中，我们研究了是否可以通过在解码前使用线性探针在变压器块的残差流激活上训练，从LLM中间激活中预测拒绝行为。我们发现拒绝在远早于最后一层时即可线性解码，表明安全相关行为在输出生成前就已编码在中间激活中。为了测试该信号是否可行，我们引入了Mechanistic AutoDAN，这是AutoDAN的一种探针引导变体，它在遗传提示搜索循环中用部分前向传递和基于探针的评分取代了全模型适应度评估。在评估的模型中，我们的方法实现了与原始AutoDAN相当的攻击成功率，同时将每次迭代的搜索时间减少了高达72%，并且在多种配置下，探针引导的提示在跨模型迁移方面达到或超过了AutoDAN。我们进一步发现，探针引导的有效性随模型规模增大而增加。我们的结果表明，拒绝不仅在输出层面可观察，而且作为结构化且可行的信号编码在LLM中间激活中。

英文摘要

In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that safety-relevant behavior is represented in intermediate activations before output generation. To test whether this signal is actionable, we introduce Mechanistic AutoDAN, a probe-guided variant of AutoDAN that replaces full-model fitness evaluation with partial forward passes and probe-based scoring inside a genetic prompt search loop. Across the evaluated models, our method achieves attack success rates competitive with vanilla AutoDAN while reducing per-iteration search time by up to 72%, and probe-guided prompts match or exceed AutoDAN's cross-model transfer in several configurations. We further find that the usefulness of probe guidance increases with model scale. Our results show that refusal is not only observable at the output level, but is encoded as a structured and actionable signal in intermediate LLM activations.

URL PDF HTML ☆

赞 0 踩 0

2605.28552 2026-05-28 cs.AI

Modeling Vehicle-Type-Specific Pedestrian Crash Avoidance Behavior in Safety-Critical Interactions Using Smooth-Mamba Deep Reinforcement Learning

使用Smooth-Mamba深度强化学习建模安全关键交互中车辆类型特定的行人碰撞规避行为

Qingwen Pu, Kun Xie, Hong Yang, Di Yang, Junqing Wang

AI总结本研究利用Smooth-Mamba深度确定性策略梯度框架（SMamba-DDPG）从Argoverse 2数据集中提取安全关键交互，建模行人与自动驾驶车辆（AV）和人类驾驶车辆（HDV）的碰撞规避行为，发现行人对AV反应更快、穿越速度更低，且AV场景冲突率更低。

Comments 37 page. 15 Figure, 9 table

详情

AI中文摘要

随着自动驾驶车辆（AV）越来越多地与人类驾驶车辆（HDV）共享道路，理解行人在安全关键交互中如何应对不同车辆类型对于自动驾驶技术的安全部署至关重要。本研究从Argoverse 2数据集中提取安全关键的行人-车辆交互，以捕捉涉及AV和HDV的真实碰撞规避行为。为了建模车辆类型特定的行人碰撞规避行为，我们开发了Smooth-Mamba深度确定性策略梯度框架（称为SMamba-DDPG），该框架将平滑动作约束与高效的时序表示学习相结合。为了量化行人行为差异，该框架分别为行人与AV和HDV的交互训练了碰撞规避策略。结果表明，SMamba-DDPG在复现行人碰撞规避行为方面优于基线强化学习和监督学习模型。重构轨迹表现出强烈的行为真实性，准确复现了AV和HDV场景中的碰撞规避运动学。反应时间分析表明，该模型捕捉到了类人的响应延迟，并揭示行人对AV的反应比HDV更快。反事实分析进一步表明，行人在与AV交互时采用更低的穿越速度。对模型生成数据的大规模安全分析显示，与行人-HDV交互相比，行人-AV交互始终产生更低的冲突率和更高的行人让行率。这些发现强调了在混合交通环境中，将车辆类型特定的行人行为模型纳入更安全的自动驾驶系统设计和更真实的交通模拟中的重要性。

英文摘要

As automated vehicles (AVs) increasingly share roadways with human-driven vehicles (HDVs), understanding how pedestrians respond to different vehicle types in safety-critical interactions is essential for the safe deployment of automated driving technologies. This study extracts safety-critical pedestrian-vehicle interactions from the Argoverse 2 dataset to capture real-world crash avoidance behaviors in encounters involving AVs and HDVs. To model vehicle-type-specific pedestrian crash avoidance behavior, we develop a Smooth-Mamba Deep Deterministic Policy Gradient framework, termed SMamba-DDPG, which integrates smooth action constraints with efficient temporal representation learning. To quantify pedestrian behavioral differences, the framework trains separate crash avoidance policies for pedestrian interactions with AVs and HDVs. Results show that SMamba-DDPG outperforms baseline reinforcement learning and supervised learning models in reproducing pedestrian crash avoidance behaviors. Reconstructed trajectories demonstrate strong behavioral realism, accurately reproducing crash avoidance kinematics in both AV and HDV scenarios. Reaction time analysis shows that the model captures human-like response delays and reveals that pedestrians respond more quickly to AVs than to HDVs. Counterfactual analysis further indicates that pedestrians adopt lower crossing speeds when interacting with AVs. Large-scale safety analysis of model-generated data revealed that pedestrian-AV interactions consistently yielded lower conflict rates and higher pedestrian yielding rates compared to pedestrian-HDV interactions. The findings highlight the importance of incorporating vehicle-type-specific pedestrian behavioral models for safer automated driving system design and more realistic traffic simulations in mixed-traffic environments.

URL PDF HTML ☆

赞 0 踩 0

2605.28549 2026-05-28 cs.RO cs.LG

SPRINT: Efficient Spectral Priors for Humanoid Athletic Sprints

SPRINT: 用于人形运动短跑的高效频谱先验

Yantong Wei, Kaihong Huang, Hainan Pan, Jiawei Luo, Jiawei Zhou, Ziyan Mai, Zhiwen Zeng, Yaonan Wang, Huimin Lu

AI总结提出SPRINT框架，利用频率自适应频谱先验生成运动学可行的关节轨迹，实现零样本仿真到现实迁移，在Unitree G1平台上达到6 m/s峰值速度。

详情

AI中文摘要

人形运动短跑的追求受到缺乏人形可行的运动学参考数据以及现有框架在短跑过程中无法保持稳定性的阻碍。为了克服这些限制，我们引入了SPRINT，一种由高效、频率自适应频谱先验驱动的新框架。通过使用五个离散运动序列的参考库在频域中表征人类运动的基本周期性，这些先验在广泛的速度范围内生成运动学可行的关节轨迹，成功外推至超过参考分布的速度。在这些预训练先验的指导下，SPRINT策略在Unitree G1平台上的现场实验中实现了零样本仿真到现实迁移，达到了6 m/s的峰值短跑速度，并在保持仿生自然性的同时展示了无缝步态转换。最终，这项工作确立了频率自适应频谱先验作为人形运动短跑的高数据效率基础。项目页面见 https://anonymous.4open.science/w/SPRINT-138A/。

英文摘要

The pursuit of humanoid athletic sprints is hindered by a scarcity of humanoid-viable kinematic reference data and the inability of existing frameworks to maintain stability during sprints. To overcome these limitations, we introduce SPRINT, a novel framework driven by efficient, frequency-adaptive spectral priors. By characterizing the fundamental periodicity of human locomotion in the frequency domain using a reference library of five discrete motion sequences, these priors generate kinematically feasible joint trajectories across a broad velocity spectrum, successfully extrapolating to speeds that exceed the reference distribution. Guided by these pretrained priors, the SPRINT policy achieves zero-shot sim-to-real transfer in field experiments on the Unitree G1 platform, reaching a peak sprinting velocity of 6 m/s and demonstrating seamless gait transitions while preserving biomimetic naturalness. Ultimately, this work establishes frequency-adaptive spectral priors as a highly data-efficient foundation for humanoid athletic sprints. The project page is available at https://anonymous.4open.science/w/SPRINT-138A/.

URL PDF HTML ☆

赞 0 踩 0

2605.28548 2026-05-28 cs.CV

GEM: Generative Supervision Helps Embodied Intelligence

GEM: 生成式监督助力具身智能

Ruowen Zhao, Bangguo Li, Zuyan Liu, Yinan Liang, Junliang Ye, Fangfu Liu, Diankun Wu, Zhengyi Wang, Xumin Yu, Yongming Rao, Han Hu, Jun Zhu

AI总结提出GEM模型，通过在视觉语言模型预训练中引入深度图生成任务，联合训练以提升具身智能的语义理解与物理操作能力，并发布大规模数据集GEM-4M，在多个基准上取得最优结果。

Comments Project Page: https://zhaorw02.github.io/GEM/

详情

AI中文摘要

具身视觉语言模型（VLMs）在机器人领域，特别是在视觉-语言-动作框架中，展示了令人印象深刻的性能和泛化能力。然而，标准文本引导预训练范式的高层语义焦点与具身环境中执行所需的关键低层空间和物理知识之间仍存在显著差距。在本文中，我们介绍了GEM，一种生成式监督的具身视觉语言模型，旨在弥合这一鸿沟。我们提出将深度图生成任务直接集成到VLM预训练阶段。通过将这一生成目标与主模型联合训练，我们观察到具身智能的显著提升，同时增强了语义理解和物理操作能力。为了支持这一范式，我们整理并发布了GEM-4M，一个包含基础、推理和规划数据与高质量深度监督配对的大规模综合数据集。大量实验表明，GEM在多个具身基准上取得了最先进的结果。此外，我们部署的动作模型GEM-VLA在模拟环境和真实世界评估中均表现出卓越的任务执行能力。代码、模型和数据集可在https://zhaorw02.github.io/GEM/获取。

英文摘要

Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus of standard text-guided pre-training paradigms and the low-level spatial and physical knowledge critical for execution in embodied environments. In this paper, we introduce GEM, a Generative-supervised Embodied vision-language Model designed to bridge this divide. We propose integrating a depth map generation task directly into the VLM pre-training phase. By training this generative objective jointly with the main model, we observe substantial improvements in embodied intelligence, significantly enhancing both semantic understanding and physical operation capabilities. To support this paradigm, we curate and release GEM-4M, a comprehensive large-scale dataset featuring a mixture of grounding, reasoning, and planning data paired with high-quality depth supervision. Extensive experiments demonstrate that GEM achieves state-of-the-art results across diverse embodied benchmarks. Furthermore, our deployed action model, GEM-VLA, exhibits vastly superior task execution abilities in both simulation environments and real-world evaluations. Code, models, and datasets are available at https://zhaorw02.github.io/GEM/

URL PDF HTML ☆

赞 0 踩 0

2605.28544 2026-05-28 cs.CV

DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving

DriveWAM: 视频生成先验实现自动驾驶的可扩展世界-动作建模

Chen Shi, Jinrui Xu, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiang

AI总结提出DriveWAM，通过将预训练视频扩散Transformer适配为自回归视频-动作策略，并引入场景演化驾驶引导和选择性KV记忆，实现可扩展的世界-动作建模，在NAVSIM和PhysicalAI基准上取得强规划性能。

详情

AI中文摘要

预训练基础模型已成为端到端自动驾驶的重要基础。与主要在静态图像-文本对上预训练的视觉-语言模型相比，视频生成模型捕获了自然适合驾驶的时间动态和运动先验。我们提出DriveWAM，一种驾驶世界-动作模型，它将预训练的视频扩散Transformer适配为自回归视频-动作策略。DriveWAM将视频和动作流组织成统一的时序token序列，并在联合流匹配目标下训练它们，保留预训练的视频生成架构，同时将其大规模视频先验适应于动作生成。为了融入高层场景理解，我们引入了场景演化驾驶引导，其中冻结的VLM生成块特定的语义意图以指导视频-动作生成。为了保持长时域推演有界，我们进一步引入了选择性KV记忆，通过推理时的相关性-冗余性缓存选择来维护有界的模态感知视频和动作记忆池。在NAVSIM和PhysicalAI-Autonomous-Vehicles基准上的实验表明，DriveWAM实现了强大的规划性能，从4k到100k驾驶片段的数据缩放研究进一步证实了世界-动作建模在端到端自动驾驶中的扩展潜力。

英文摘要

Pretrained foundation models have become an important basis for end-to-end autonomous driving. In contrast to vision-language models pretrained primarily on static image-text pairs, video generative models capture temporal dynamics and motion priors that are naturally suited for driving. We present DriveWAM, a driving world-action model that adapts a pretrained video diffusion transformer into an autoregressive video-action policy. DriveWAM organizes video and action streams into a unified temporal token sequence and trains them under a joint flow-matching objective, preserving the pretrained video-generation architecture while adapting its large-scale video priors to action generation. To incorporate high-level scene understanding, we introduce scene-evolving driving guidance, where a frozen VLM produces chunk-specific semantic intent to guide video-action generation. To keep long-horizon rollout bounded, we further introduce selective KV memory, which maintains bounded modality-aware video and action memory pools through relevance-redundancy cache selection at inference time. Experiments on NAVSIM and the PhysicalAI-Autonomous-Vehicles benchmark show that DriveWAM achieves strong planning performance, and a data-scaling study from 4k to 100k driving clips further confirms the scaling potential of world-action modeling for end-to-end autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.28543 2026-05-28 cs.AI cs.CL cs.LG

Cultural Binding Heads in Language Models

语言模型中的文化绑定头

Avrile Floro, Luca Benedetto

AI总结通过机制可解释性和析因设计，识别出8个语言模型中2-3个中间层注意力头对文化绑定有因果贡献，且绑定主要在预训练阶段形成，知识探测表明模型知道的知识远多于其行为表现。

详情

AI中文摘要

大型语言模型通常默认对不同文化群体一视同仁，即使上下文需要区分：这缺乏差异意识。利用机制可解释性和Wang等人(2025)的N4文化挪用基准上的析因设计，我们在八个模型（四种架构，基础版和指令版）中识别出每个模型有2-3个中间层注意力头对文化绑定有因果贡献。文化绑定是将文化项目与适当身份关联的过程。敲除这些头上的身份到项目边会使绑定强度降低9-23%。识别出的头从指令模型转移到基础模型，表明文化绑定是在预训练阶段创建的。α缩放显示分级剂量反应，生成时适度放大引导（α=2-3）可将文化区分准确性提高1-3个百分点，同时基本保持中性推理不变。知识探测任务表明，模型知道的知识比其行为表现多3-5倍，表明瓶颈在于路由而非知识。

英文摘要

LLMs often default to equal treatment across cultural groups, even though context warrants differentiation: this is a lack of difference awareness. Using mechanistic interpretability and a factorial design on the N4 cultural appropriation benchmark from Wang et al. (2025), we identify 2-3 mid-layer attention heads per model that contribute causally to cultural binding across eight models (four architectures, base and instruct). Cultural binding is the process of associating cultural items with the appropriate identity. Knockout of the identity-to-item edges on these heads lowers the binding strength by 9-23%. The identified heads transfer from instruct to base models, suggesting that cultural binding is created at pre-training. An $α$-scaling shows a graded dose-response and moderate amplification steering at generation ($α= 2-3$) increases cultural differentiation accuracy by 1-3 pp while leaving neutral reasoning mostly intact. A knowledge probing task shows that models know 3-5 times more than they act upon it, indicating that the bottleneck lies in routing and not knowledge.

URL PDF HTML ☆

赞 0 踩 0

2605.28534 2026-05-28 cs.CL

GUI-CIDER: Mid-training GUI Agents via Causal Internalization and Density-aware Exemplar Reselection

GUI-CIDER：通过因果内化和密度感知示例重选进行GUI代理的中期训练

Zheng Wu, Chengcheng Han, Zhengxi Lu, Tianjie Ju, Yanyu Chen, Qi Gu, Xunliang Cai, Zhuosheng Zhang

AI总结提出GUI-CIDER中期训练方法，通过因果内化和密度感知示例重选显式内化GUI世界知识，提升代理对GUI操作的理解和任务成功率。

详情

AI中文摘要

尽管多模态大语言模型在构建图形用户界面（GUI）代理方面取得了快速进展，但其现实世界任务完成从根本上受到缺乏GUI操作世界知识的瓶颈。现有解决方案通常依赖昂贵的多代理框架或传统的后训练范式，如监督微调（SFT）和强化学习（RL）。然而，后训练仅允许代理通过动作注释或奖励信号隐式吸收世界知识，导致低效的轨迹记忆而非真正理解。因此，一种能够显式学习这些知识的方法至关重要。为此，我们提出GUI-CIDER，一种通过因果内化和密度感知示例重选显式内化GUI世界知识的中期训练方法。GUI-CIDER分为三个阶段：（1）数据合成，从GUI轨迹中提取静态规划和动态因果知识为文本；（2）示例重选，通过奖励因果结构和惩罚语义冗余来过滤语料库；（3）中期训练，使用精炼数据嵌入所学知识。在两个GUI知识基准和三个任务完成基准上的大量实验表明，GUI-CIDER持续提升了代理对GUI操作的理解及其任务成功率。代码可在https://github.com/Wuzheng02/GUI-CIDER获取。

英文摘要

Despite the rapid progress of multimodal large language models in building Graphical User Interface (GUI) agents, their real-world task completion is fundamentally bottlenecked by a lack of world knowledge about GUI operations. Existing solutions typically rely on expensive multi-agent scaffolding or conventional post-training paradigms, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). However, post-training only allows agents to implicitly absorb world knowledge through action annotations or reward signals, leading to inefficient trajectory memorization rather than genuine comprehension. Therefore, an approach that enables explicit learning of this knowledge is imperative. To this end, we propose GUI-CIDER, a mid-training method that explicitly internalizes GUI world knowledge through Causal Internalization and Density-aware Exemplar Reselection. GUI-CIDER operates in three stages: (1) data synthesis, which distills static planning and dynamic causal knowledge from GUI trajectories into text; (2) exemplar reselection, which filters the corpus by rewarding causal structures and penalizing semantic redundancy; and (3) mid-training, where the refined data is used to embed the acquired knowledge. Extensive experiments on two GUI knowledge benchmarks and three task completion benchmarks demonstrate that GUI-CIDER consistently improves both the agent's understanding of GUI operations and its task success rates.The codes are available at https://github.com/Wuzheng02/GUI-CIDER.

URL PDF HTML ☆

赞 0 踩 0

2605.28533 2026-05-28 cs.LG

Semi-Supervised Hypothesis Testing by Betting on Predictions

基于预测投注的半监督假设检验

Yaniv Tenzer, Elad Tolochinsky, Yaniv Romano

AI总结提出一种基于预测投注的框架，利用无标签数据增强序贯假设检验的效力，通过引入e统计量实现任意有效的检验，并在标签偏移或概念偏移下保持有效性。

详情

AI中文摘要

我们引入了一个基于预测投注的框架，利用无标签数据上的预测来增强序贯假设检验的效力。给定来自$(X,Y)$联合分布的有限样本，以及来自$X$边际分布的额外无标签样本，我们探究如何利用无标签数据对$Y$的分布以及$Y\mid X$的条件分布进行假设。我们引入了一个e统计量，并用它构建了一个序贯检验。在标准分布假设——标签偏移或概念偏移下，我们证明了该检验是任意有效的。此外，我们表明对于二元数据，该e统计量具有非平凡的检验功效。关键在于，即使底层预测不准确，我们的方法仍能保持这些性质。通过模拟实验和在大语言模型评估中的应用，我们展示了该方法相对于基线方法（包括预测驱动推断）的效力提升。即使在无标签数据相对有限，且由于$X$和$Y$之间弱相关导致预测精度较低的情况下，这些提升仍然存在。

英文摘要

We introduce a testing-by-betting framework that leverages predictions on unlabeled data to enhance the power of sequential hypothesis testing. Given limited samples from the joint distribution of $(X,Y)$, and additional unlabeled samples from the marginal of $X$, we ask how unlabeled data can be used to hypothesize about the distribution of $Y$, and the conditional distribution of $Y\mid X$. We introduce an e-statistic and use it to construct a sequential test. Under standard distributional assumptions -- label shift or concept shift -- we establish that the test is anytime valid. Furthermore, we show that for binary data, the e-statistic has non-trivial power. Crucially, our approach retains these properties even when the underlying predictions are inaccurate. Through simulations and applications to large language models evaluation, we demonstrate power gains over baseline approaches, including prediction-powered inference. These gains persist even with relatively limited unlabeled data and when predictions have low accuracy due to weak correlation between $X$ and $Y$.

URL PDF HTML ☆

赞 0 踩 0

2605.28532 2026-05-28 cs.AI

Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents

智能体知道它们不能做什么吗？评估使用工具的智能体的可行性意识

Liang Cheng, Mingsheng Cai, Jiuming Jiang, Luo Mai

AI总结提出FeasiGen自动构建不可行任务管道，通过屏蔽关键工具将可解任务转为不可解，评估发现多数模型缺乏可行性检测能力，错误继续率高达73.9%。

Comments 14 pages

详情

AI中文摘要

使用工具的智能体通常因长推理链和迭代工具使用而产生大量计算成本。在实际场景中，许多任务在受限的工具环境下变得不可行，因为成功完成任务所需的能力不可用。检测不可行任务并提前停止执行可以显著减少不必要的执行成本。在这项工作中，我们提出了FeasiGen，一个自动构建不可行智能体任务的管道，通过识别成功完成任务所需的关键工具。我们的方法从多个智能体系统的成功执行中提取工具调用轨迹，识别不同执行策略中一致共享的关键工具，并屏蔽这些工具，从而自动将可解任务转化为不可解任务。人工验证确认，我们构建的任务的不可行性标注准确率超过94%。我们进一步引入了可行性感知评估指标，用于衡量智能体是否能识别不可行任务并适当停止执行。在九个模型上的广泛评估揭示了显著弱的不可行性检测能力，错误继续率高达73.9%。我们进一步观察到，多智能体架构在不可行条件下显著减少了错误执行。

英文摘要

Tool-using agents often incur substantial computational cost due to long reasoning chains and iterative tool usage. In practical scenarios, many tasks become infeasible under constrained tool environments, where the capabilities required for successful task completion are unavailable. Detecting infeasible tasks and stopping execution early can significantly reduce unnecessary execution cost. In this work, we propose FeasiGen, an automatic pipeline for constructing infeasible agent tasks by identifying the critical tools required for successful task completion. Our approach extracts tool-calling traces from successful executions across multiple agent systems, identifies critical tools consistently shared across diverse execution strategies, and masks these tools to automatically transform solvable tasks into infeasible ones. Human verification confirms that the infeasibility annotations for our constructed tasks achieve over 94% accuracy. We further introduce feasibility-aware evaluation metrics for measuring whether agents can recognize infeasible tasks and stop execution appropriately. Extensive evaluations across nine models reveal substantially weak infeasibility detection ability, with false continue rate reaching up to 73.9%. We further observe that multi-agent architectures significantly reduce erroneous execution under infeasible conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.28531 2026-05-28 cs.LG

Stabilizing distribution-free probabilistic forecasts

稳定化无分布概率预测

Jente Van Belle, Honglin Wen, Wouter Verbeke, Pierre Pinson

AI总结提出一种基于神经网络参数化回归样条的方法，联合优化无分布概率时间序列预测的质量与稳定性，以控制预测更新导致的波动，并在两个数据集上验证了其有效性。

详情

AI中文摘要

多步预测通常会在新观测值可用时进行更新，因为较短的预测期限通常会提高预测质量。然而，这种改进是以预测不稳定性为代价的，即同一目标时期的预测值存在变异性。这种不稳定性可能引发基于预测制定的计划发生代价高昂的变更，并可能削弱对预测系统的信任。在这项工作中，我们将预测稳定性与预测质量一起纳入无分布概率时间序列预测模型的训练中，从而能够控制这种权衡。我们提出了一种使用神经网络参数化的回归样条生成稳定化预测条件分位数函数的方法。这种方法能够联合优化质量和稳定性，因为它允许我们直接惩罚由预测更新引起的差异。此外，它允许对稳定预测分布的不同部分（例如，中心部分与尾部）赋予不同的重要性，以专注于对预期下游应用最相关的部分（例如，库存管理的上尾）。我们在两个具有不同统计特性的数据集上对所提出的方法进行了实证评估，结果表明，它可以在不显著损失预测质量的情况下有效降低预测不稳定性，并且可以将稳定化努力针对预测分布的特定部分。

英文摘要

Multi-step-ahead forecasts are often updated as new observations become available, since shorter forecast horizons typically improve forecast quality. However, such improvements come at the cost of forecast instability, i.e., variability in forecasts for the same target period. This instability can trigger costly changes to plans formulated based on the forecasts and may erode trust in the forecasting system. In this work, we integrate forecast stability alongside forecast quality into the training of distribution-free probabilistic time-series forecasting models, allowing us to control this trade-off. We propose a method for generating stabilized forecasted conditional quantile functions using regression splines parameterized by a neural network. This approach enables joint optimization of quality and stability, as it allows us to directly penalize dissimilarities arising from forecast updates. Furthermore, it allows assigning varying importance to stabilizing different parts of the forecast distributions (e.g., central parts vs. tails) to focus on the parts most relevant for the intended downstream use (e.g., the upper tail for inventory management). We empirically evaluate the proposed method on two datasets with different statistical properties and show that it can effectively reduce forecast instability without a substantial loss in forecast quality, and that it can target stabilization effort toward specific parts of the forecast distributions.

URL PDF HTML ☆

赞 0 踩 0

2605.28527 2026-05-28 cs.RO

What Frozen VLAs Already Know About Success: A Probing Study of Value-Like Structure in Foundation Robot Policies

冻结的VLA已经知道关于成功的信息：对基础机器人策略中价值类结构的探测研究

Jiachen Zhang, Junnan Nie, Junyi Lao, Wei Cheng, Chenghao Liu, Jiaxin Jiang, Songfang Huang

AI总结通过线性探测从冻结的VLA特征中预测蒙特卡洛结果目标，发现其编码了成功信息，并可用于测试时动作选择提升成功率。

Comments 14 pages, 1 figure, 11 tables. Equal contribution: Jiachen Zhang, Junnan Nie, and Junyi Lao. Corresponding author: Songfang Huang. Preprint

详情

AI中文摘要

视觉-语言-动作（VLA）策略被训练来模仿动作；它们的损失函数从未要求它们估计奖励、进展或未来成功。然而，它们冻结的表示仍然携带这些信息，并且可以在不重新训练策略的情况下被读取并用于指导动作选择。从LIBERO-Goal上的混合成功和失败操作轨迹中，我们使用冻结特征上的轻量级线性探测恢复了蒙特卡洛结果目标。这些目标可以从OpenVLA、Pi0.5、DINOv2和CLIP特征中一致地预测，而基于进展、剩余时间、任务身份或本体感觉的基线则显著较差。为了排除任务和时间捷径，我们在相同任务、相同时间步的匹配比较下评估探测：Pi0.5探测仍然达到约92%的成对排序准确率，而标签打乱的对照则停留在随机水平。作为测试时选择器，在采样的Pi0.5动作前缀上使用相同的探测，将这一离线发现转化为行为：在推板任务中，成功率从贪婪解码下的26.7%上升到44.3%，在酒架任务中也有一个正面案例。这种提升并非普遍适用，并且需要额外的推理计算，但底层发现是清晰的：冻结的VLA已经编码了关于成功的信息，而它们的模仿目标从未明确要求这些信息。

英文摘要

Vision--language--action (VLA) policies are trained to imitate actions; their loss never asks them to estimate reward, progress, or future success. Their frozen representations nevertheless carry such information, and it can be read out and used to guide action choice without retraining the policy. From mixed successful and failed manipulation trajectories on LIBERO-Goal, we recover Monte-Carlo outcome targets using lightweight linear probes on frozen features. The targets are consistently predictable from OpenVLA, Pi0.5, DINOv2, and CLIP features, and substantially less so from baselines built on progress, time-to-go, task identity, or proprioception. To rule out task and temporal shortcuts, we evaluate the probes under same-task, same-timestep matched comparisons: Pi0.5 probes still reach roughly 92% pairwise ordering accuracy, while label-shuffled controls stay at chance. Used as a test-time selector over sampled Pi0.5 action prefixes, the same probe turns this offline finding into behavior: on push-plate, success rises from 26.7% under greedy decoding to 44.3%, with a second positive case on wine-rack. The gains are not universal and require additional inference compute, but the underlying finding is clean: frozen VLAs already encode information about success that their imitation objective never explicitly demands.

URL PDF HTML ☆

赞 0 踩 0

2605.28526 2026-05-28 cs.AI cs.CL

Entropy-aware Masking for Masked Language Modeling

面向掩码语言建模的熵感知掩码策略

Gokul Srinivasagan, Kai Hartung, Munir Georges

AI总结提出基于熵分布的掩码策略，通过模型预测熵识别信息量高的token进行掩码，并引入自掩码方法提升训练效率，在GLUE上平均提升5%。

Comments accepted at starsem 2026 Conference

详情

AI中文摘要

掩码语言建模已成为训练基于编码器的语言模型的标准预训练目标。在该方法中，输入中的某些token被掩码，模型学习利用周围上下文预测它们。这一过程使模型能够捕捉语言的句法和语义属性。传统上，用于掩码的token是随机选择的，这可能并不总是产生最有效的学习信号。在这项工作中，我们研究了一种基于熵分布的token掩码策略。我们利用模型在token预测上的熵来确定哪些token应被掩码。该方法旨在针对信息量更大、不确定性更高的token，以提高训练效率。我们还提出了一种新颖的自掩码方法，无需依赖外部参考模型即可增强训练效率。实验结果表明，与基线相比，我们的方法在GLUE分数上平均提升了5%。此外，我们尝试将知识蒸馏与熵掩码相结合，取得了最佳的整体结果。

英文摘要

Masked language modeling has become a standard pretraining objective for training encoder-based language models. In this approach, certain tokens in the input are masked, and the model learns to predict them using the surrounding context. This process enables the model to capture both syntactic and semantic properties of language. Conventionally, the tokens selected for masking are chosen at random, which may not always yield the most effective learning signals. In this work, we examine a token masking strategy based on entropy distribution. We use the model's entropy over token predictions to identify which tokens should be masked. This method aims to target tokens that are more informative and uncertain to improve the training efficacy. We also propose a novel self-masking approach that enhances training efficiency without relying on an external reference model. Experimental results demonstrate that our method achieves an average performance improvement of 5% in GLUE scores compared to the baseline. Further, we experiment with combining knowledge distillation with entropy masking, resulting in the best overall results.

URL PDF HTML ☆

赞 0 踩 0

2605.28524 2026-05-28 cs.AI

Let Relations Speak: An End-to-End LLM-GNN Soft Prompt Framework for Fraud Detection

让关系说话：面向欺诈检测的端到端LLM-GNN软提示框架

Zhixing Zuo, Huilin He, Jiasheng Wu, Dawei Cheng

AI总结提出LGSPF框架，通过软提示桥接图结构与语义空间，并引入并行GNN编码器将多关系拓扑转化为图令牌，实现端到端优化，在欺诈检测中达到最优性能。

Comments 14 pages,3 figures

详情

AI中文摘要

近年来，大型语言模型（LLM）在处理欺诈检测等图任务方面展现出强大能力。然而，现有方法大多严重依赖丰富的文本属性，由于该领域缺乏文本数据，这带来了困难。尽管一些开创性方法试图克服这一问题，但它们通过硬提示将图结构文本化容易导致特征失真。此外，欺诈检测通常表现出多关系复杂性，当前方法难以捕捉这种深层语义信息。为应对这些挑战，我们提出了LLM-GNN软提示框架（LGSPF）。具体而言，LGSPF使用软提示桥接图结构和语义空间，以消除对文本的依赖。我们进一步引入并行图神经网络（GNN）编码器，将多关系拓扑转化为图令牌，用于细粒度的LLM欺诈理解。通过端到端优化，LGSPF增强了LLM和GNN之间的深层语义对齐。在多个欺诈检测基准上的实验表明，我们的方法达到了最先进的性能。此外，我们进一步验证了LGSPF在增强欺诈行为语义可解释性方面的贡献。

英文摘要

In recent years, Large Language Models (LLMs) have shown great capability in processing graph tasks such as fraud detection. However, most existing methods rely heavily on rich text attributes, which poses difficulties for this domain due to the lack of textual data. Although some pioneering methods attempt to overcome it, their textualization of graph structures via hard prompts easily leads to feature distortion. Additionally, fraud detection often exhibits multi-relational complexity, where current methods struggle to capture this deep semantic information. To address these challenges, we propose LLM-GNN Soft Prompt Framework (LGSPF). Specifically, LGSPF bridges the graph structure and semantic space using soft prompt to eliminate reliance on text. We further introduce a parallel Graph Neural Network (GNN) encoder to translate multi-relational topologies into graph tokens for fine-grained LLM fraud comprehension. Through end-to-end optimization, LGSPF enhances deep semantic alignment between LLM and GNN. Experiments across diverse fraud detection benchmarks demonstrate our method achieves state-of-the-art performance. Moreover, we further validate the contribution of LGSPF on enhancing the semantic interpretability of fraud behaviors.

URL PDF HTML ☆

赞 0 踩 0

2605.28521 2026-05-28 cs.CL

ClinicalEncoder26AM: A Multlilingual Diagnosable ColBERT Model; Evidences from the MultiClinNER Shared Task

ClinicalEncoder26AM：一个多语言可诊断的ColBERT模型——来自MultiClinNER共享任务的证据

François Remy

AI总结本文提出ClinicalEncoder26AM，一个基于BGE-M3的多语言可诊断ColBERT模型，通过多适配器蒸馏和ColBERT式检索目标进行临床后训练，在MultiClinNER任务中微调为BIO标注器，实现了最先进的多语言实体召回率和字符加权F1分数前五。

详情

AI中文摘要

ClinicalEncoder26AM是一个用于临床和生物医学文本的多语言可诊断ColBERT模型，它在多个层次上将其token级语义与ClinicalMap25对齐，ClinicalMap25是一个受BioLORD-2023启发并通过合成和标注监督丰富的临床潜在空间。后训练方案基于BGE-M3，结合了合成临床笔记、患者-医生对话以及MedMentions等标注资源，同时通过多适配器蒸馏考虑命名实体级和句子级表示，并采用ColBERT风格的检索目标。在这篇系统演示论文中，我们通过将模型微调为用于患者症状、疾病和程序范围的BIO标注器来评估其在MultiClinNER共享任务中的表现，使用轻量级两层CNN头部来改善局部边界检测。最终系统保持简单，在单个8192 token窗口中处理大多数文档，实现了最先进的多语言实体召回率，并在所有实体类型和语言的字符加权F1分数中达到前五。训练曲线进一步表明，ClinicalEncoder26AM比基础M3模型在数据效率上显著更高，支持其临床后训练对下游信息提取的有用性。模型可在https://huggingface.co/Parallia/ClinicalEncoder26AM-Diagnosable-Colbert-L2-for-multilingual-medical-texts下载。

英文摘要

ClinicalEncoder26AM is a multilingual Diagnosable ColBERT for clinical and biomedical texts, which aligns at multiple levels its token-level semantic with ClinicalMap25, a clinical latent space inspired by BioLORD-2023 and enriched with synthetic and annotated supervision. The post-training recipe builds upon BGE-M3, and combines synthetic clinical notes, patient--doctor conversations, and annotated resources such as MedMentions, while considering both named-entity-level and sentence-level representations in a multi-adapter distillation, along with a ColBERT-style retrieval objective. In this system demonstration paper, we evaluate the model in the MultiClinNER shared task by finetuning it as a BIO tagger for patient symptoms, disorders, and procedure spans, using a lightweight two-layer CNN head to improve local boundary detection. The resulting system remains simple, processes most documents in a single 8192-token window, and achieves state-of-the-art multilingual entity recall, while achieving Top 5 overall across all entity types and languages in Character-weighted F1 scores. Training curves further show that ClinicalEncoder26AM is markedly more data-efficient than the base M3 model, supporting the usefulness of its clinical post-training for downstream information extraction. The model can be downloaded on https://huggingface.co/Parallia/ClinicalEncoder26AM-Diagnosable-Colbert-L2-for-multilingual-medical-texts

URL PDF HTML ☆

赞 0 踩 0

2605.28520 2026-05-28 cs.AI

GS-FUSE: Granger-Supervised Gated Fusion and Multi-Granularity Alignment for Event-Driven Financial Forecasting

GS-FUSE: 格兰杰监督的门控融合与多粒度对齐用于事件驱动的金融预测

Yang Zhang, En Chun, Ziyun Mao, Yulu Wu, Jun Wang

AI总结提出GS-Fuse框架，通过格兰杰因果监督的门控融合模块和多粒度对齐机制，选择性利用事件文本与价格信号，提升金融事件对市场影响的预测精度。

详情

DOI: 10.1145/3770855.3817927

AI中文摘要

准确预测重大金融事件对市场的影响对投资者和政策制定者至关重要。然而，现有的多模态时间序列模型通常对称地融合文本和价格，没有明确的方式来决定事件文本何时真正具有预测性，因此难以利用事件到价格的方向性结构以及文本和价格信号的异质性角色。在这项工作中，我们提出了GS-Fuse，一个基于多模态事件的预测框架，它采用：(i) 格兰杰监督的、因果感知的门控融合模块，该模块仅在事件文本提供超越历史价格的增量预测价值时学习向事件文本开放；(ii) 多粒度对齐机制，该机制将高级事件表示和细粒度文本线索与未来市场轨迹联合对齐。作为构建在现成的大语言模型和时间序列基础模型之上的灵活、即插即用适配器，GS-Fuse可以在不同的骨干网络和市场设置中实例化。在真实世界金融数据集上的大量实验表明，GS-Fuse在多种资产和预测时间范围内始终优于最先进的时间序列和多模态基线。

英文摘要

Accurately forecasting the impact of salient financial events on markets is critical for investors and policymakers. However, existing multimodal time-series models typically fuse text and prices symmetrically, without an explicit way to decide when event text is truly predictive, and thus struggle to exploit the directional event-to-price structure and the heterogeneous roles of textual and price signals. In this work, we propose GS-Fuse, a multimodal event-based forecasting framework that employs (i) a Granger-supervised, causal-aware gated fusion module, which learns to open toward event text only when it provides incremental predictive value beyond historical prices, and (ii) a multi-granularity alignment mechanism that jointly aligns high-level event representations and fine-grained textual cues with future market trajectories. Built as a flexible, plug-and-play adapter on top of off-the-shelf large language models and time-series foundation models, GS-Fuse can be instantiated across diverse backbones and market settings. Extensive experiments on real-world financial datasets show that GS-Fuse consistently outperforms state-of-the-art time-series and multimodal baselines across multiple assets and forecasting horizons.

URL PDF HTML ☆

赞 0 踩 0

2605.28517 2026-05-28 cs.LG cs.AI

Stochastic Gradient Descent with Momentum is Algorithmically Stable

带动量的随机梯度下降具有算法稳定性

Yunwen Lei, Zimeng Wang, Xiaoming Yuan

AI总结本文通过算法稳定性分析，证明了带动量的随机梯度下降（SGDM）在光滑凸问题上具有泛化保证，并建立了最优的过界总体风险界。

详情

AI中文摘要

带动量的随机梯度下降（SGDM）是机器学习中最广泛使用的优化算法之一。尽管文献中已经广泛研究了SGDM的优化性质，但关于SGDM是否以及何时能够很好地泛化到未见数据，仍然不够清楚。特别是，有人推测虽然动量加速了训练，但可能会降低泛化性能。在本文中，我们通过算法稳定性的视角，对SGDM进行了全面的泛化分析，填补了这一空白。更具体地说，我们引入了一个广义的SGDM框架，该框架涵盖了Polyak和Nesterov的动量方案，并为光滑凸问题建立了紧的平均模型稳定性界。值得注意的是，所获得的界利用了沿轨迹的小优化误差界，适用于区间$[0, 1)$内的任何动量参数，并且不需要通常假设的损失函数的Lipschitz连续性。我们进一步推导了广义SGDM的优化误差界，并将其与我们的泛化分析相结合，为具有Polyak和Nesterov动量的SGDM获得了最优的过界总体风险界。

英文摘要

Stochastic gradient descent with momentum (SGDM) is one of the most widely used optimization algorithms in machine learning. While optimization properties of SGDM have been extensively studied in the literature, it remains insufficiently understood whether and when SGDM can generalize well to unseen data. In particular, it has been conjectured that while momentum accelerates training, it may degrade generalization. In this paper, we close this gap by developing a comprehensive generalization analysis of SGDM through the lens of algorithmic stability. More specifically, we introduce a generalized SGDM framework that encompasses both Polyak's and Nesterov's momentum schemes, and establish tight on-average model stability bounds for smooth and convex problems. Notably, the obtained bounds exploit small optimization error bounds along the trajectory, apply to any momentum parameter in the interval $[0, 1)$, and do not require the commonly assumed Lipschitzness of loss functions. We further derive optimization error bounds for the generalized SGDM, and combine them with our generalization analyses to obtain optimal excess population risk bounds for SGDM with both Polyak's and Nesterov's momentum.

URL PDF HTML ☆

赞 0 踩 0

2605.28513 2026-05-28 cs.LG cs.AI

Learning Theory of the SVRG: Generalization and Convergence Analysis

SVRG的学习理论：泛化与收敛性分析

Yunwen Lei, Zimeng Wang, Xiaoming Yuan

AI总结本文通过算法稳定性分析，首次为非凸和强凸设置下的SVRG方法建立了非平凡的泛化界，揭示了优化与泛化之间的相互作用，并得到了最优的过量风险界。

详情

AI中文摘要

方差缩减（VR）方法采用方差递减的随机梯度，因其高效性被广泛应用于机器学习中的大规模优化问题。现有的VR方法理论研究主要集中在收敛性分析上，而泛化行为在很大程度上未被探索。本文通过算法稳定性的视角，首次为代表性VR方法——随机方差缩减梯度（SVRG）建立了非平凡的泛化分析，填补了这一空白。特别地，我们利用SVRG的算法结构，在凸和强凸两种设置下建立了尖锐的稳定性界。所得到的界是数据依赖的，因为训练误差沿轨迹被纳入。我们的分析阐明了优化与泛化之间的相互作用，从而在两种设置下都得到了最优的过量风险界。我们的方法与现有的随机算法分析有本质不同，我们将SVRG更新分解为类似SGD的步骤加上一个零均值修正项，然后引入新的Lyapunov函数来吸收由参考点引起的额外梯度项。我们的分析框架可以推广到其他VR方法，并通过著名的随机平均梯度加速（SAGA）方法展示了泛化性。

英文摘要

Variance reduction (VR) methods employ stochastic gradients with decreasing variance, and they have been widely applied to solve large-scale optimization problems in machine learning because of their efficiency. Existing theoretical studies of VR methods are mainly focused on the convergence analysis, leaving the generalization behavior largely unexplored. In this paper, we bridge this gap by developing the first non-vacuous generalization analysis of the representative VR method: Stochastic Variance Reduced Gradient (SVRG), through the lens of algorithmic stability. In particular, we establish sharp stability bounds of the SVRG in both convex and strongly convex settings by exploiting its algorithmic structure. The obtained bounds are data-dependent, because the training errors are incorporated along the trajectory. Our analysis clarifies the interplay between optimization and generalization, leading to optimal excess population risk bounds in both settings. Our approach differs substantially from existing analyses of stochastic algorithms in the sense that we decompose the SVRG update as an SGD-like step plus a zero-mean correction term and then introduce novel Lyapunov functions to absorb the additional gradient terms induced by the reference points. Our analytical framework can be generalized to other VR methods, and we demonstrate the generalization by the well-known Stochastic Average Gradient Accelerated (SAGA) method.

URL PDF HTML ☆

赞 0 踩 0

2605.28512 2026-05-28 cs.CL

On Compositional Learning Behaviours in Formal Mathematics

论形式数学中的组合学习行为

Kevin Yandoka Denamganaï

AI总结本文提出 S2B-LM 基准，通过去除数值处理混淆并添加思维链框架来评估组合学习行为（CLB），发现 CLB 能力对于形式数学验证的困难部分必要但不充分。

Comments work in progress, under review

详情

AI中文摘要

能够征服形式数学困难尾部的自我进化科学智能体需要组合学习行为（CLBs）——在上下文中基础化和重组新颖符号结构的能力，而不仅仅是预学习原子的重组。我们提出了 extbf{S2B-LM}，这是符号行为基准的一个改编，它移除了数值处理作为混淆因素，并添加了思维链框架以引发而非仅仅探测潜在的 CLB 能力。在 CLB 能力（adj-ZSCT）和 miniF2F 整体证明性能上交叉评估十个 Lean~4 定理证明器，精确置换检验建立了一个层次必要性结构：搜索密集型模型覆盖了可处理的绝大部分而没有可检测的 CLB，然而每个进入奥林匹克级别（miniF2F $>75\%$）的模型都是五个最高 CLB 得分者之一（$p=0.004$）。在排除模型规模作为混淆因素后，我们的结果表明 CLB 能力对于形式数学验证的困难尾部是 \emph{必要但不充分的}。

英文摘要

Self-evolving scientific agents capable of conquering the hard tail of formal mathematics require Compositional Learning Behaviours (CLBs) -- the capacity to ground and recombine novel symbolic structures in context, beyond mere recombination of prelearned atoms. We propose \textbf{S2B-LM}, an adaptation of the Symbolic Behaviour Benchmark that removes numerical processing as a confound and adds chain-of-thought scaffolding to elicit rather than merely probe latent CLB competency. Cross-evaluating ten Lean~4 theorem provers on CLB competency (adj-ZSCT) and miniF2F whole-proof performance, exact permutation tests establish a hierarchical necessity structure: search-heavy models cover the tractable bulk without detectable CLBs, yet every model breaking into the Olympiad-level tier (miniF2F $>75\%$) is among the five highest CLB scorers ($p=0.004$). After ruling out model scale as a confound, our results show that CLB competency is \emph{necessary but not sufficient} for the hard tail of formal mathematical verification.

URL PDF HTML ☆

赞 0 踩 0

2605.28501 2026-05-28 cs.LG

Fitting Unknown Number of Hyperplanes with Manifold Optimization

基于流形优化的未知数量超平面拟合

Zhiqin Cheng, Yu Zhan, Mingjin Zhang, Lingbo Liu, Liang Lin

AI总结针对未知数量超平面拟合的非凸、非可微及模型阶数未知问题，提出基于流形优化的两阶段算法，通过黎曼期望最大化与投影密度估计实现高精度鲁棒拟合。

详情

AI中文摘要

将未知数量的超平面拟合到数据是机器学习中一个基本但具有挑战性的问题，其特点是非凸性、非可微性和未知模型阶数。现有方法常陷入局部最优或缺乏几何一致性。为解决这些局限，我们提出一种基于流形优化的新框架。我们将问题重新表述为单位球面流形 $\mathcal{S}^{ extbf{dim}-1}$ 上的无监督学习任务。该公式有效处理了非凸约束并线性化了距离度量，使得梯度下降易于处理。我们提出了一种两阶段流形优化算法。在第一阶段，我们采用带有重尾核的黎曼期望最大化过程来鲁棒地估计后验概率，有效解决了相交超平面间点分布的歧义。在第二阶段，当软估计收敛后，概率权重退化为硬匹配，产生严格满足几何定义的精确局部最优解。此外，我们引入了一种投影密度估计策略用于初始化，通过显著降低特征描述空间和搜索复杂度来促进全局收敛。大量实验表明，我们的方法在几何精度和鲁棒性方面均优于最先进的基线方法。

英文摘要

Fitting an unknown number of hyperplanes to data is a fundamental yet challenging problem in machine learning, characterized by its non-convexity, non-differentiability, and unknown model order. Existing approaches often struggle with local optima or lack geometric consistency. To address these limitations, we propose a novel framework based on Manifold Optimization. We reformulate the problem as an unsupervised learning task on the unit sphere manifold $\mathcal{S}^{\textbf{dim}-1}$. This formulation effectively handles the non-convex constraints and linearizes the distance measurement, rendering the gradient descent tractable. We propose a Two-Stage Manifold Optimization algorithm. In Phase I, we employ a Riemannian Expectation-Maximization process with a heavy-tailed kernel to robustly estimate posterior probabilities, effectively resolving the ambiguities of point distribution between intersecting hyperplanes. In Phase II, upon convergence of the soft estimates, the probabilistic weights degenerate into hard matching, generating a precise local optimum that strictly satisfies the geometric definition. Furthermore, we introduce a projected density estimation strategy for initialization to facilitate global convergence by significantly reducing the feature description space and search complexity. Extensive experiments demonstrate that our method outperforms state-of-the-art baselines in both geometric accuracy and robustness.

URL PDF HTML ☆

赞 0 踩 0