arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 深度学习架构与训练方法 34 篇

2606.12478 2026-06-12 cs.LG cond-mat.stat-mech quant-ph 新提交

Boltzmann Attention: Learnable Ising Couplings for Cooperative Attention

玻尔兹曼注意力:用于协同注意力的可学习伊辛耦合

Gilhan Kim, Daniel K. Park

发表机构 * Yonsei University(延世大学)

AI总结 提出玻尔兹曼注意力,通过可学习的伊辛耦合增强注意力机制中的位置间交互,在字符级语言建模和括号匹配任务中优于标准softmax注意力,并展示了量子退火训练的有效性。

Comments 19 pages, 5 figures

详情
AI中文摘要

注意力机制是现代序列模型的核心,但标准注意力主要通过单个查询-键相似度计算相关性。尽管softmax归一化引入了位置间的竞争,但标准注意力层并未显式参数化注意力决策之间的可学习交互。这限制了其直接在注意力机制内建模协同或对抗性共注意力结构的能力。我们提出玻尔兹曼注意力,一种基于能量的泛化,其中注意力模式由相互作用的伊辛模型控制。该方法用可学习的成对耦合增强通常的数据依赖局部场,使模型能够表示超出softmax或sigmoid注意力所捕获的位置间相关性。在字符级语言建模和合成括号匹配实验上,玻尔兹曼注意力在标准Transformer架构中持续优于标准softmax注意力,且优势随序列长度增加而更加明显。四路消融实验证实改进来自可学习的成对耦合。这些结果表明,显式位置间交互为基于注意力的序列建模提供了原则性增强。此外,伊辛公式为基于量子计算的采样策略开辟了自然路径:我们证明非绝热量子退火提供了实用的训练方法,同时保持了与精确玻尔兹曼计算相当的性能。

英文摘要

Attention mechanisms are central to modern sequence models, yet standard attention computes relevance primarily through individual query--key similarities. Although softmax normalization introduces competition among positions, a standard attention layer does not explicitly parameterize learnable interactions between attention decisions. This limits its ability to directly model cooperative or antagonistic co-attention structure within the attention mechanism itself. We propose Boltzmann attention, an energy-based generalization in which attention patterns are governed by an interacting Ising model. The method augments the usual data-dependent local fields with learnable pairwise couplings, allowing the model to represent inter-position correlations beyond those captured by softmax or sigmoid attention. Experiments on character-level language modeling and synthetic bracket matching show that Boltzmann attention consistently improves over standard softmax attention within a standard Transformer architecture, with the advantage becoming more pronounced as sequence length increases. A four-way ablation confirms that the improvement arises from the learnable pairwise couplings. These results suggest that explicit inter-position interactions provide a principled enhancement for attention-based sequence modeling. Moreover, the Ising formulation opens a natural path toward quantum-computing-based sampling strategies: we demonstrate that diabatic quantum annealing provides a practical training method while maintaining competitive performance with exact Boltzmann computation.

2606.12497 2026-06-12 cs.LG cs.RO 新提交

$μ$VLA: On Recurrent Memory for Partially Observable Manipulation in VLA Models

$μ$VLA:部分可观测操作中VLA模型的循环记忆研究

Egor Cherepanov, Nikita Kachaev, Daniil Zelezetsky, Aydar Bulatov, Artem Pshenitsyn, Yuri Kuratov, Alexey Skrynnik, Aleksandr I. Panov, Alexey K. Kovalev

发表机构 * CogAI Lab, Moscow, Russia(CogAI实验室,莫斯科,俄罗斯) MIRAI, Moscow, Russia(MIRAI,莫斯科,俄罗斯)

AI总结 针对VLA模型在部分可观测场景中的记忆缺失问题,提出仅通过可学习记忆令牌和截断反向传播时间实现最小化循环记忆增强,在MIKASA-Robo上将训练任务成功率从0.42提升至0.84,并在LIBERO上保持全可观测性能。

Comments 34 pages, 20 figures, 9 tables

详情
AI中文摘要

视觉-语言-动作(VLA)模型从当前观测预测未来动作块,这一假设在部分可观测性下失效,因为决策依赖于不再可见的信息。现有的记忆增强VLA同时引入了循环、检索、压缩模块、辅助目标、层次化记忆或特定任务架构变化,因此循环本身的贡献与周围机制纠缠不清。我们提出了一个在强预训练VLA骨干网络中的受控隔离研究。我们的方案通过一小部分可学习的记忆令牌增强Transformer,这些令牌跨时间步传递并通过自注意力更新,使用截断反向传播时间进行端到端训练,没有辅助损失和架构变化。我们将其实例化为$μ$VLA,一组由记忆宽度m、TBPTT长度K和记忆更新规则(跨步梯度或分离的EMA)参数化的OpenVLA-OFT变体,使得循环是唯一变化的因素。在MIKASA-Robo上,$μ$VLA在最强设置下将五个训练任务的平均成功率从0.42提高到0.84,并在具有相同记忆结构的保留任务上达到0.23,而无记忆基线为0.07。在需要不同记忆结构的任务上,性能接近基线。在LIBERO上,最强的循环变体达到96.2%的平均成功率,表明在全可观测性下没有性能下降。我们将这些结果解释为对最小化骨干网络循环能力范围的校准,识别了其足够的情况以及需要额外记忆结构的情况。演示和视频可在以下链接找到:https://example.com。

英文摘要

Vision-language-action (VLA) models predict chunks of future actions from the current observation, an assumption that fails under partial observability, where decisions depend on information no longer visible. Existing memory-augmented VLAs simultaneously introduce recurrence, retrieval, compression modules, auxiliary objectives, hierarchical memory, or task-specific architectural changes, so the contribution of recurrence itself remains entangled with surrounding machinery. We present a controlled isolation study of recurrence in a strong pretrained VLA backbone. Our formulation augments the transformer with a small set of learnable memory tokens carried across timesteps and updated through self-attention, trained end to end with truncated backpropagation through time, with no auxiliary losses and no architectural changes. We instantiate this as $μ$VLA, a family of OpenVLA-OFT variants parameterized by memory width m, TBPTT length K, and the memory update rule (cross-step gradients or a detached EMA), so that recurrence is the only varying factor. On MIKASA-Robo, $μ$VLA improves average success rate on five training tasks from 0.42 to 0.84 at the strongest setting and reaches 0.23 on held-out tasks with the same memory structure versus 0.07 for the memoryless baseline. On tasks requiring different memory structure, performance remains near baseline. On LIBERO, the strongest recurrent variant achieves 96.2% average success, indicating no regression under full observability. We interpret these results as a calibration of the capability envelope of minimal in-backbone recurrence, identifying the regime in which it is sufficient and the regime where additional memory structure is required. Demos and videos can be found in https://avanturist322.github.io/mu-vla/.

2606.12507 2026-06-12 cs.LG 新提交

Rubric-Guided Self-Distillation: Post-Training Without Rubric Verifiers

基于评分标准的自蒸馏:无需评分标准验证器的后训练

MohammadHossein Rezaei, Anas Mahmoud, Zihao Wang, Utkarsh Tyagi, Advait Gosai, Razvan-Gabriel Dumitru, Aakash Sabharwal, Bing Liu, Yunzhong He

发表机构 * Scale AI

AI总结 提出RGSD方法,通过将评分标准作为条件蒸馏到学生模型,无需验证器即可实现密集逐令牌学习,在医学和科学领域达到与基于评判的GRPO相当的评分标准满足率。

详情
AI中文摘要

在开放领域(单一标准答案不可用)中,评分标准已成为RLVR的替代方案。现有的基于评分标准的训练方法依赖LLM验证器对每次生成根据评分标准进行评分。这引入了大量的训练时间开销,使优化暴露于验证器特定偏差,并将评分标准反馈简化为稀疏的轨迹末端信号。我们提出无验证器的训练方法——基于评分标准的自蒸馏(RGSD),其中基础策略以评分标准为条件,作为无条件学生的教师。RGSD将基于评分标准的教师分布逐令牌蒸馏到学生,用密集的逐令牌学习信号替代稀疏的轨迹级奖励,并完全从训练循环中移除LLM评判。在Qwen-2.5(3B、7B)和Qwen3-Thinking(4B、8B)模型上,针对医学和科学领域,RGSD在每次提示仅使用一次在线生成且无需训练时验证器调用的情况下,实现了与基于评判的GRPO相当的评分标准满足率。消融实验表明,原始评分标准比自生成参考响应提供更强的教师增强信号,而更强的GRPO评判在某些设置下可能优于RGSD,使RGSD成为验证器成本或可靠性成为瓶颈时的互补性无验证器替代方案。

英文摘要

Rubrics have emerged as an alternative to RLVR in open-ended domains where a single ground-truth final answer is not available. Existing rubric-based training methods rely on an LLM verifier that scores each rollout against rubrics. This introduces substantial training-time overhead, exposes optimization to verifier-specific biases, and reduces rubric feedback to a sparse end-of-trajectory signal. We propose Rubric-Guided Self-Distillation (RGSD), a verifier-free training method in which the base policy, conditioned on the rubric, serves as the teacher for the unconditioned student. RGSD distills the rubric-conditioned teacher distribution into the student token-by-token, replacing sparse trajectory-level rewards with dense per-token learning signals and removing the LLM judge from the training loop entirely. Across Qwen-2.5 (3B, 7B) and Qwen3-Thinking (4B, 8B) models on medical and science domains, RGSD achieves rubric satisfaction comparable to judge-based GRPO while using one on-policy rollout per prompt and no training-time verifier calls. Ablations show that raw rubrics provide a stronger teacher enrichment signal than self-generated reference responses, while a stronger GRPO judge can outperform RGSD in some settings, positioning RGSD as a complementary verifier-free alternative when verifier cost or reliability is the bottleneck.

2606.12740 2026-06-12 cs.LG 新提交

Deep Unfolded Latent Optimally Partitioned-l2/l1 Networks for Data-driven Block-Sparse Recovery

深度展开潜在最优分区l2/l1网络用于数据驱动的块稀疏恢复

Takanobu Furuhashi, Hidekata Hontani, Qibin Zhao, Tatsuya Yokota

发表机构 * Nagoya Institute of Technology(名古屋工业大学) RIKEN Center for Advanced Intelligence Project(理化学研究所革新智能研究中心)

AI总结 针对凸LOP-l2/l1方法依赖手动调参且近端算子不可微的问题,提出基于隐式微分和深度权重分解的两种深度展开架构,实现自动参数学习,在块稀疏恢复中表现优异且抗脉冲噪声。

Comments 11 pages, 6 figures

详情
AI中文摘要

凸潜在最优分区(LOP)-l2/l1方法能够在未知分区的情况下实现块稀疏信号恢复,但依赖于手动超参数调整。此外,其近端算子微分时的数值不稳定性阻碍了通过深度展开(DU)进行自动参数调整。为解决这些限制,我们提出了两种架构:一种利用隐式微分的稳定框架,以及一种利用深度权重分解(DWF)的灵活变体。基于DWF的方法还支持非凸光滑数据保真项。数值实验表明,DU-LOP-l2/l1在块稀疏恢复中具有竞争性能,并且对脉冲噪声具有高鲁棒性。

英文摘要

The convex Latent Optimal Partition (LOP)-l2/l1 approach enables block-sparse signal recovery with unknown partitions but relies on manual hyperparameter tuning. Additionally, numerical instability in differentiating its proximal operator prevents its automatic parameter tuning via Deep Unfolding (DU). To address these limitations, we propose two architectures: a stable framework utilizing implicit differentiation and a flexible variant leveraging Deep Weight Factorization (DWF). The DWF-based approach also supports nonconvex smooth data fidelity terms. Numerical experiments demonstrate that DU-LOP-l2/l1 yields competitive performance and high resilience against impulsive noise.

2606.12840 2026-06-12 cs.LG 新提交

CLARITree: Cholesky and Lookahead Accelerations for Regression with Interpretable Piecewise Linear Trees

CLARITree: 基于Cholesky和前瞻加速的可解释分段线性树回归

Yixiao Wang, Hayden McTavish, Varun Babbar, Margo Seltzer, Cynthia Rudin

AI总结 提出一种结合前瞻搜索和秩一Cholesky更新的算法,用于构建近最优稀疏分段线性回归树,在计算效率、预测精度和稀疏性之间取得良好平衡。

Comments Accepted at ICML 2026

详情
AI中文摘要

回归树是机器学习中最具可解释性且表达能力最强的模型之一。历史上,贪心归纳一直是构建高性能回归树的主要方法。尽管存在基于动态规划和分支定界的最优方法,但对于一般的线性回归树,这些方法在计算上不可行,尽管它们通常比贪心方法取得更好的性能。最近的研究表明,专门的前瞻策略可以显著提高运行时间,同时保持接近最优的性能,主要是在分类设置中。在这项工作中,我们开发了一种新颖的算法,用于近最优、稀疏、分段线性回归树,该算法将前瞻式搜索策略与Gram矩阵的高效秩一Cholesky更新相结合。我们从理论和实验上证明,我们的方法在计算效率、预测精度和稀疏性之间实现了有利的权衡,并且比当前最先进的方法具有更好的可扩展性。

英文摘要

Regression trees are among the most interpretable yet expressive model classes in machine learning. Historically, greedy induction has been the dominant approach for constructing well-performing regression trees. While optimal methods based on dynamic programming and branch-and-bound exist, they are computationally prohibitive for general linear regression trees, despite often achieving substantially better performance than greedy approaches. Recent work has shown that specialized lookahead strategies can dramatically improve runtime while maintaining near-optimal performance, primarily in classification settings. In this work, we develop a novel algorithm for near-optimal, sparse, piecewise linear regression trees that combines a lookahead-style search strategy with efficient rank-one Cholesky updates of the Gram matrix. We demonstrate, both theoretically and empirically, that our method achieves a favorable trade-off between computational efficiency, predictive accuracy, and sparsity, and scales significantly better than the current state of the art.

2606.12841 2026-06-12 cs.LG cs.AI 新提交

TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

TimeROME-DLM:掩码扩散语言模型的时间因果追踪与低秩推理时知识编辑

Zhengtao Yao, Liuyang Song, Hongbo Zhang, Chenhao Wei, Haoyan Xu, Guang Yang, Siheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出TimeROME-DLM,首个无需训练和梯度的推理时知识编辑框架,通过时间因果追踪定位关键坐标并应用低秩残差编辑,在保持模型性能的同时高效删除事实。

详情
AI中文摘要

掩码扩散语言模型(MDLM),如LLaDA,现已能与自回归(AR)大语言模型(LLM)竞争,但现有的所有知识编辑和遗忘方法(如ROME、MEMIT等)均针对AR Transformer,要么做出在迭代去噪下失败的假设,要么需要梯度更新,其反向传播激活会消耗数十GB的额外显存,并在标准学习率下导致MDLM崩溃。我们提出TimeROME-DLM,这是首个针对MDLM的无需训练、无需梯度、推理时的知识编辑框架。它结合了两个组件:时间间接效应(TIE)因果追踪协议,用于识别每个事实中在后续去噪步骤中最强驱动对象预测的坐标;以及一个闭式低秩残差编辑记忆,该记忆聚合所有遗忘事实的主语键和目标差值,并在每个扩散前向步骤中对该坐标应用单次岭正则化更新,同时通过稀疏化限制效用溢出。骨干权重保持冻结;仅需在小型验证集上调整三个超参数(alpha、lambda、q)。在TOFU forget01任务上,使用TOFU微调的LLaDA-8B-Base,TimeROME-DLM将遗忘集的对数概率降低了约83 nats。相同的配置可迁移至LLaDA-8B-Instruct、Dream-7B、MMaDA-8B、DiffuLLaMA-7B和LLaDA-MoE-1.4B。在50个顺序插入的事实中,它使保留集的对数概率几乎持平(在效用安全操作点处波动约1 nat),相比最强的收敛训练时基线,实现了四到十四倍的墙钟加速且零额外显存,并亚线性地扩展到400个事实。TimeROME-DLM以极小的计算代价弥合了AR LLM与MDLM之间的定位-编辑差距。

英文摘要

Masked diffusion language models (MDLMs) such as LLaDA now rival autoregressive (AR) LLMs, but every existing knowledge-editing and unlearning method (ROME, MEMIT, etc.) targets AR transformers and either makes assumptions that fail under iterative denoising, or requires gradient updates whose backward-pass activations cost tens of GB of extra VRAM and which collapse MDLMs at standard learning rates. We introduce TimeROME-DLM, the first training-free, gradient-free, inference-time knowledge-editing framework for MDLMs. It couples two components: a Temporal Indirect Effect (TIE) causal-tracing protocol that identifies, for each fact, the coordinate whose intervention most strongly drives the object prediction at later denoising steps; and a closed-form, low-rank residual edit memory that aggregates subject keys and target deltas across all forget facts and applies a single ridge-regularised update at that coordinate at every diffusion forward, with sparsification to limit utility spillover. Backbone weights stay frozen; only three hyperparameters (alpha, lambda, q) are tuned on a small validation split. On TOFU forget01 with TOFU-finetuned LLaDA-8B-Base, TimeROME-DLM cuts forget-set log-probability by roughly 83 nats. The same configuration transfers to LLaDA-8B-Instruct, Dream-7B, MMaDA-8B, DiffuLLaMA-7B, and LLaDA-MoE-1.4B. It keeps retain-set log-probability nearly flat (within ~1 nat at the utility-safe operating point) across 50 sequentially inserted facts, delivers a four- to fourteen-fold wall-clock speedup with zero additional VRAM over the strongest converged training-time baseline, and scales sub-linearly to 400 facts. TimeROME-DLM closes the locate-then-edit gap between AR LLMs and MDLMs at a fraction of the computational cost.

2606.12895 2026-06-12 cs.LG 新提交

LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning

LongSpike:用于高效长序列学习的分数阶脉冲状态空间模型

Xinrui He, Qiyu Kang, Xuhao Li, Zheng-Jun Zha

发表机构 * Wuhan University(武汉大学) University of Science and Technology of China(中国科学技术大学) Anhui University(安徽大学)

AI总结 提出LongSpike框架,将分数阶状态空间模型(f-SSM)引入脉冲神经网络,通过长记忆核实现高效长序列学习,在多个基准上超越现有SNN。

详情
AI中文摘要

脉冲神经网络(SNN)因其生物合理性和处理序列数据时的能量效率而备受推崇。然而,主流的SNN架构通常依赖一阶常微分方程(ODE)来控制神经元状态转换。这种一阶假设引入了“无记忆”瓶颈,限制了模型捕捉长序列任务中固有的复杂长程依赖关系的能力。在这项工作中,我们提出了LongSpike,一种新颖的SNN框架,它将控制理论中的分数阶状态空间建模(f-SSM)集成到脉冲域中。通过将传统的整数阶SSM扩展到分数阶微积分领域,LongSpike实现了具有长记忆核的神经元动力学的层次化集成。为了缓解分数算子通常带来的计算开销和并行化挑战,我们利用了一种支持高效并行训练的状态空间公式。在具有挑战性的基准测试(包括Long Range Arena(LRA)、大规模WikiText-103和Speech Commands)上的实证评估表明,LongSpike在保持稀疏突触计算的同时,在准确性上优于最先进的SNN。代码可在以下网址获取:https://this URL。

英文摘要

Spiking Neural Networks (SNNs) are well-regarded for their biological plausibility and energy efficiency in processing sequential data. However, dominant SNN architectures typically rely on first-order Ordinary Differential Equations (ODEs) to govern neuronal state transitions. This first-order assumption imposes a "memoryless" bottleneck, limiting the model's capacity to capture the complex, long-range dependencies inherent in long-sequence tasks. In this work, we propose LongSpike, a novel SNN framework that integrates fractional-order State-Space Modeling, or f-SSM, from control theory into the spiking domain. By extending traditional integer-order SSMs to the fractional-calculus regime, LongSpike enables the hierarchical integration of neuronal dynamics with long-memory kernels. To mitigate the computational overhead and parallelization challenges typically associated with fractional operators, we leverage a state-space formulation that supports efficient, parallel training. Empirical evaluations on challenging benchmarks, including Long Range Arena (LRA), large-scale WikiText-103, and Speech Commands, demonstrate that LongSpike outperforms state-of-the-art SNNs in accuracy while preserving sparse synaptic computation. The code is available at https://github.com/xinruihe389-commits/LongSpike.

2606.12917 2026-06-12 cs.LG 新提交

Where Computation Lives Inside TabPFN: Causal Localisation of Attention Head Function

计算在 TabPFN 中的位置:注意力头功能的因果定位

Atharva Gupta, Dhruv Kumar, Murari Mandal, Saurabh Deshpande

AI总结 通过激活修补、消融和注意力熵分析,发现 TabPFN 2.5 中一个注意力头在峰值层的因果必要性比其他头高2-5倍,且其主导层随任务复杂度变化,其余头呈现对称的后期层轮廓。

Comments Accepted to Workshop FMSD @ ICML 2026

详情
AI中文摘要

我们首次对表格基础模型进行了因果机制分析,研究了 TabPFN 2.5 的逐特征注意力头如何跨层分布计算。使用两个合成回归数据集上的激活修补、消融和注意力熵,我们发现明确的时间特化:一个头的因果必要性在峰值层比其他头高2到5倍,其主导层随不同复杂度的任务而变化,而其余头表现出对称的后期层轮廓。注意力熵和修补为优势头的计算活跃层提供了收敛证据。我们还通过对比激活引导研究了推理时间的可操控性,发现它无法跨样本迁移。我们将这一结果归因于 TabPFN 的上下文学习机制,该机制通过上下文相关的注意力编码任务结构,而不是语言模型中使引导可行的稳定参数方向。

英文摘要

We present the first causal mechanistic analysis of a tabular foundation model, investigating how TabPFN 2.5's feature wise attention heads distribute computation across layers. Using activation patching, ablation, and attention entropy across two synthetic regression datasets, we find clear temporal specialisation: one head's causal necessity dominates that of the others by 2 to 5 times at peak layer, with its dominant layer shifting across tasks of different complexity, while the remaining heads exhibit symmetric late layer profiles. Attention entropy and patching provide convergent evidence for the computationally active layers of the dominant head. We additionally investigate inference time steerability via contrastive activation steering, which fails to transfer across samples. We attribute this result to TabPFN's in context learning mechanism, which encodes task structure through context dependent attention rather than the stable parametric directions that make steering tractable in language models.

2606.12966 2026-06-12 cs.LG cs.NE 新提交

Circuit Synchronization Precedes Generalization: Causal Evidence from Fourier Structure in Grokking Transformers

电路同步先于泛化:来自Grokking Transformer中傅里叶结构的因果证据

Achyuthan Sivasankar

发表机构 * New York University(纽约大学)

AI总结 提出频率同步度(FSD)指标,发现其在模算术任务中比grokking早500-3000步同步,且通过权重衰减控制验证了间隔期的正则化本质,提供因果证据。

Comments 16 pages, 6 figures, 10 tables

详情
AI中文摘要

Grokking——模算术上的transformer从近乎随机突然转变为近乎完美的验证准确率——归因于傅里叶电路,但其时机、因果结构和可控性仍知之甚少。我们引入了频率同步度(FSD),一种无需先验电路知识的归一化、置换检验的傅里叶电路同步度量。在九个模加法配置(素数p∈{53,71,97,113,131},三个种子)中,FSD在grokking前500-3000步同步(平均领先+1722步;所有九个为正,符号检验p≈0.004),并且在所有九个案例中先于受限logit损失基线(Nanda等人的排除损失),使其成为最早可用的预测器。我们提供了直接因果证据,证明相间间隙是一种正则化现象:在FSD峰值步骤分叉训练并变化权重衰减λ,会产生严格单调的更早grokking,且Δ_t与1/λ成正比。该定律在三个素数(p∈{53,97,131};两个干净案例的R²=1.00和R²=0.99)上重复,表示为Δ_t ~ C/λ,与(1/λ)*log(||W_mem||/τ)一致。架构消融实验表明,仅注意力模型在强FSD前兆下grok;仅MLP模型从不grok;单层模型的FSD滞后,确认了前兆是多块电路属性。

英文摘要

Grokking -- where a transformer on modular arithmetic suddenly transitions from near-chance to near-perfect validation accuracy -- is attributed to a Fourier circuit, but its timing, causal structure, and controllability remain poorly understood. We introduce the Frequency Synchronization Degree (FSD), a normalised, permutation-tested metric for Fourier circuit synchronisation requiring no prior circuit knowledge. Across nine modular addition configurations (primes p in {53, 71, 97, 113, 131}, three seeds), FSD synchronises 500-3,000 steps before grokking (mean lead +1,722 steps; all nine positive, sign-test p~0.004), and precedes a restricted-logit loss baseline (Nanda et al.'s excluded loss) in all nine cases, making it the earliest available predictor. We provide direct causal evidence that the inter-phase gap is a regularisation phenomenon: forking training at the FSD-ceiling step and varying weight decay lambda produces strictly monotone earlier grokking, with Delta_t proportional to 1/lambda. This law replicates across three primes (p in {53,97,131}; R^2=1.00 and R^2=0.99 for two clean cases), captured as Delta_t ~ C/lambda, consistent with (1/lambda)*log(||W_mem||/tau). Architecture ablations show an attention-only model groks with a strong FSD precursor; an MLP-only model never groks; a single-layer model's FSD lags, confirming the precursor is a multi-block circuit property.

2606.12979 2026-06-12 cs.LG 新提交

EPM-JEPA: Operator-Side Experience Modulation in JEPA-Family World Models

EPM-JEPA:JEPA系列世界模型中的算子侧经验调制

Vedant Pandya

发表机构 * School of Artificial Intelligence and Data Engineering (SAIDE), Indian Institute of Technology Jodhpur(印度理工学院焦特布尔分校人工智能与数据工程学院)

AI总结 提出EPM-JEPA,通过LoRA在权重层面调制预测器,以应对测试时动态偏移;实验表明其优于无记忆基线,但效果弱于预期,并揭示了三种独立动力学过程。

Comments 16 pages, 5 figures, 9 tables, 5 code listings. Pre-registered experimental study with mechanism analysis

详情
AI中文摘要

JEPA系列世界模型使用静态预测器,其权重在测试时动态偏离训练时不会自适应。我们比较了在分布偏移下将累积经验融入JEPA预测器的两种机制:操作数侧注入(EI-JEPA),将压缩的经验表示作为残差添加到预测器的隐藏状态;以及算子侧调制(EPM-JEPA),通过应用于预测器权重的LoRA生成低秩权重增量。在预注册的比较(Moving MNIST,重力偏移)中,EPM-JEPA(D_shift^{n=50} = 0.7848 +/- 0.0078,三个种子)与EI-JEPA(0.8238)相差delta = 4.74% - 根据我们声明的标准,结果C:零结果 - 是一个有效结果。作为次要的、非预注册的观察,EPM-JEPA在无记忆基线(0.8000)上提高了1.90%,且在所有种子上一致,而EI-JEPA低于基线,表明收益特定于权重级调制。我们的主要贡献是机制分析:D_shift^{n=50}轨迹反映了三个独立的动力学过程——缓冲区循环、EMA目标漂移和内在的LoRA稳定瞬态(+0.021)——而非收敛到平衡。这些发现推动了PEM-JEPA,一个基于物理的后续模型,以解决这一动力学峰值限制。

英文摘要

JEPA-family world models use a static predictor whose weights do not adapt when test-time dynamics diverge from training. We compare two mechanisms for incorporating accumulated experience into a JEPA predictor under distribution shift: operand-side injection, where a compressed experience representation is added as a residual to the predictor's hidden state (EI-JEPA), and operator-side modulation, where the same representation generates low-rank weight deltas via LoRA applied to the predictor's weights (EPM-JEPA). On a pre-registered comparison (Moving MNIST, gravity shift), EPM-JEPA (D_shift^{n=50} = 0.7848 +/- 0.0078, three seeds) differs from EI-JEPA (0.8238) by delta = 4.74% - Outcome C: a null result - by our stated criterion, a valid outcome. As a secondary, non-pre-registered observation, EPM-JEPA improves 1.90% over a no-memory baseline (0.8000), consistently across seeds, while EI-JEPA underperforms the baseline, indicating the benefit is specific to weight-level modulation. Our primary contribution is a mechanism analysis: the D_shift^{n=50} trajectory reflects three independent dynamical processes - buffer cycling, EMA target drift, and an intrinsic LoRA settling transient of +0.021 - rather than convergence to equilibrium. These findings motivate PEM-JEPA, a physics-grounded successor addressing this dynamical-peak limitation.

2606.13081 2026-06-12 cs.LG cs.AI 新提交

Emotional regulation improves deep learning-based image classification

情绪调节改善基于深度学习的图像分类

Riccardo Emanuele Landi, João M. F. Rodrigues, Marta Chinnici

发表机构 * Mare Group(Mare集团) NOVA LINCS(NOVA LINCS实验室) Institute of Engineering (ISE), University of Algarve(阿尔加维大学工程学院) Department of Energy Technologies and Renewable Sources, ENEA Casaccia Research Center(ENEA卡萨恰研究中心能源技术与可再生能源部)

AI总结 提出情绪调节框架,通过人工主观体验在深度学习中建模情绪,在图像分类任务中预训练ResNet和ViT,在CIFAR-10/100上超越现有方法,成为情绪增强深度学习的新标杆。

详情
AI中文摘要

情绪显著影响认知,能在特定条件下增强记忆和学习。基于这一原理,情绪增强深度学习研究情感状态如何改善神经网络架构和学习范式,实现比非情绪模型更好的泛化。然而,现有方法通常仅依赖客观神经生理因素,忽视了情绪的主观性。为弥补这一差距,本研究引入情绪调节,一种通过人工主观体验在深度学习中建模情绪的新框架。该方法采用基于情感刺激的预训练,在下游任务优化中平衡非情绪和情绪影响响应。在图像分类中进行了广泛实验,在四个情感数据集上预训练ResNet和ViT架构,以CIFAR-10和CIFAR-100作为目标基准。结果显示,相比上述骨干网络有改进,证明情绪调节是通过人工主观体验定义情绪增强深度学习的有前景方法。此外,所提方法超越了基于CIFAR的图像分类相关工作,揭示情绪调节成为大规模视觉数据集上情绪增强深度学习的新标杆。研究还提供了情感状态改善机器学习任务优化的证据,鼓励进一步探索情绪启发架构。

英文摘要

Emotion significantly influences cognition, enhancing memory and learning under certain conditions. Drawing on this principle, emotion-augmented deep learning investigates how affective states can improve neural network architectures and learning paradigms, achieving better generalization than non-emotional models. However, existing methods often rely solely on objective neurophysiological factors, neglecting the role of subjectivity in emotion. To bridge this gap, the present study introduces Emotional Regulation, a novel framework for modeling emotion in deep learning through artificial subjective experience. The method employs pre-training based on affective stimuli, balancing non-emotional and emotionally-influenced responses in downstream task optimization. Extensive experimentation was conducted in image classification, pre-training ResNet and ViT architectures on four emotional datasets, using CIFAR-10 and -100 as target benchmarks. Results reveal improvements over the aforementioned backbones, providing evidence of Emotional Regulation as a promising method for defining emotion-augmented deep learning through artificial subjective experience. Furthermore, the proposed approach overcomes the related work in image classification based on CIFAR, revealing Emotional Regulation as the new state-of-the-art in emotion-augmented deep learning for large-scale vision datasets. The study also enforces evidence of the impact of affective states in improving machine learning tasks' optimization, encouraging further investigation on emotion-inspired architectures.

2606.13106 2026-06-12 cs.LG cs.CL 新提交

Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

揭秘隐状态循环:基于在线强化学习的可切换潜在推理

Jiayu Yang, Chao Chen, Shengen Wu, Yinhong Liu, Yuxuan Fan, Lujundong Li, Songning Lai, Chengwei Qin, Zhijiang Guo

发表机构 * HKUST(GZ)(香港科技大学(广州)) University of Cambridge(剑桥大学) NTU(南洋理工大学) JoinQuant(聚宽) HKUST(香港科技大学)

AI总结 提出SWITCH框架,通过离散边界令牌使隐状态循环推理兼容在线强化学习,并支持因果机制分析,实验表明其优于现有方法。

详情
AI中文摘要

潜在思维链通过用连续的隐状态循环替换可见推理轨迹来压缩推理,但现有公式难以用标准在线强化学习(RL)优化,且难以进行因果解释。我们的关键见解是,一对显式的边界令牌可以同时解决这两个问题:离散的进入和退出锚点使潜在块与标准在线RL兼容,并且相同的锚点为机制分析提供了自然立足点。基于此,我们提出SWITCH,一个可切换的潜在推理框架。模型发出<swi>进入潜在模式,</swi>退出。由于边界是普通的离散令牌,GRPO策略比率在每个决策点都有明确定义。相同的锚点也使潜在步骤暴露于直接探测和因果干预。我们通过可见到潜在的课程和Switch-GRPO目标训练模型,该目标通过循环潜在计算传播梯度。SWITCH在相似规模下始终优于先前的隐状态循环潜在推理方法。通过边界令牌的机制分析进一步揭示了三个发现:(i)<swi>是一个尖锐局部化的学习切换策略,而非风格化伪影;(ii)它开启的潜在步骤执行特定于问题的、因果重要的计算,而非作为惰性占位符;(iii)该计算集中在进入时的单个隐状态转换上。这些结果表明,隐状态循环潜在推理既可RL训练,又可进行直接机制分析,包括在线RL本身如何从内部改进模型。

英文摘要

Latent chain-of-thought compresses reasoning by replacing visible reasoning traces with continuous hidden-state recurrence, but existing formulations are difficult to optimize with standard on-policy reinforcement learning (RL) and hard to interpret causally. Our key insight is that a single pair of explicit boundary tokens can address both issues at once: discrete entry and exit anchors make the latent block compatible with standard on-policy RL, and the same anchors offer a natural foothold for mechanistic analysis. Motivated by this, we propose SWITCH, a switchable latent reasoning framework. The model emits <swi> to enter latent mode and </swi> to exit. Because the boundaries are ordinary discrete tokens, the GRPO policy ratio is well-defined at every decision point. The same anchors also expose the latent steps to direct probing and causal intervention. We train the model with a visible-to-latent curriculum and a Switch-GRPO objective that propagates gradients through recurrent latent computation. SWITCH consistently outperforms prior hidden-state-recurrence latent reasoning approaches at similar scale. Mechanistic analysis through the boundary tokens further reveals three findings: (i) <swi> is a sharply localised, learned switching policy rather than a stylistic artefact; (ii) the latent step it opens performs problem-specific, causally important computation rather than acting as an inert placeholder; and (iii) that computation is concentrated at a single hidden-state transition on entry. Together, these results show that hidden-state-recurrence latent reasoning is both RL-trainable and open to direct mechanistic analysis, including of how on-policy RL itself improves the model from the inside.

2606.13125 2026-06-12 cs.LG cs.AI 新提交

Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

选择与改进:理解推理后训练的机制

Akshay Krishnamurthy, Audrey Huang, Nived Rajaraman

发表机构 * Microsoft Research NYC(微软研究院纽约) UIUC(伊利诺伊大学厄巴纳-香槟分校)

AI总结 通过控制实验揭示强化学习后训练通过策略选择和策略改进两种机制提升推理能力,并指出SFT数据和RL数据的不同作用。

详情
AI中文摘要

强化学习已迅速成为推理和编码模型训练的关键组成部分,但从机制角度理解仍不足。我们研究通过强化学习后训练如何以及通过哪些底层过程获取或增强能力。基于Qwen-2.5-1.5B的受控数学推理实验分析揭示了两种核心机制:策略选择和策略改进。我们的结果强调了SFT数据和强化学习数据在激活这些机制中的作用,特别展示了监督模型使用多种推理策略如何实现策略选择,以及增加强化学习数据难度如何实现策略改进。综合来看,我们的结果为RL训练提供了机制性见解,并提出了继续扩展推理能力的实用干预措施。

英文摘要

Reinforcement learning has rapidly emerged as a key component in the training of reasoning and coding models, yet it remains poorly understood from a mechanistic perspective. We study how and through what underlying processes capabilities are acquired or enhanced via reinforcement learning post-training. Our analysis, based on controlled math reasoning experiments with Qwen-2.5-1.5B, reveals two core mechanisms: strategy selection and strategy improvement. Our results highlight the role of SFT data and reinforcement learning data in activating these mechanisms, in particular showing how supervising the model on diverse reasoning strategies can enable strategy selection and how increasing difficulty in reinforcement learning data can enable strategy improvement. Taken together, our results provide mechanistic insight into RL training and suggest practical interventions to continue scaling reasoning capabilities.

2606.13168 2026-06-12 cs.LG 新提交

When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals

路由何时变得可解释?对块注意力残差的因果探针

Aydin Javadov

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 研究块注意力残差中路由的可解释性,发现仅当路由参与训练时才出现结构化深度路由,且路由权重与因果重要性存在分离,需用因果干预验证。

详情
AI中文摘要

块注意力残差(Block AttnRes)通过将固定的加性残差替换为基于早期深度源表示的学习softmax,在前向传播中将跨层路由暴露为可检查的张量。这是一个诱人的可解释性目标:通常间接推断的信息流现在可以直接观察。我们询问这种暴露是否足以进行机制解释。我们在相同的路由消融干预下探测了两个同规模(0.6B)的Block AttnRes检查点:一个是通过确定性近因偏差调度(代码库将其视为路由等效加载路径)包装的普通Qwen3推理,另一个是从头训练且路由作为优化一部分的Block AttnRes Qwen3。包装基线的路由权重与内容无关,并重现了调度的分析预测。而训练的AttnRes检查点则表现出三种局部路由模式:通过早期层MLP的嵌入源路径、通过早期层注意力和MLP的当前状态路径,以及通过后期层注意力的较旧历史路径。除了这种分层之外,我们发现平均路由质量与因果重要性之间存在明显分离:在两个子层中,最大的质量切片并非最大的因果贡献,并且一个源家族在干预下携带了可观的质量但没有可检测的因果作用。因此,路由的架构暴露对于机制解释是必要但不充分的:只有当路由是训练的一部分时,结构化的深度路由才会出现,即使如此,描述性路由总结也应被视为待因果干预检验的候选假设,而非其本身的机制证据。

英文摘要

Block Attention Residuals (Block AttnRes) by replace fixed additive residuals with a learned softmax over earlier depth-source representations, surfacing cross-layer routing as an inspectable tensor in the forward pass. This is a tempting interpretability target: information flow normally inferred indirectly is now directly observable. We ask whether such exposure suffices for mechanistic interpretation. We probe two same-scale ($0.6$B) Block AttnRes checkpoints under identical routing-ablation interventions: a vanilla Qwen3 inference-wrapped through a deterministic recency-bias schedule that the codebase admits as a routing-equivalent loading path, and a Block AttnRes Qwen3 trained from scratch with routing as part of optimisation. The wrapped baseline's routing weights are content-independent and reproduce the schedule's analytic prediction. The trained AttnRes checkpoint instead exhibits three localised routing motifs: an embedding-source pathway through early-layer MLP, a current-state pathway through early-layer attention and MLP, and an older-history pathway through late-layer attention. Beyond this stratification, we find a sharp dissociation between average routing mass and causal importance: in both sublayers, the largest mass slice is not the largest causal contribution, and one source family carries appreciable mass with no detectable causal role under intervention. Architectural exposure of routing is therefore necessary but not sufficient for mechanistic interpretation: structured depth routing emerges only when routing has been part of training, and even then, descriptive routing summaries should be treated as candidate hypotheses to be tested by causal interventions, not as evidence of mechanism in their own right.

2606.13223 2026-06-12 cs.LG cs.CV 新提交

Distributional Loss for Robust Classification

分布损失用于鲁棒分类

Kathleen Anderson, Thomas Martinetz

发表机构 * Institute for Neuro- and Bioinformatics(神经与生物信息学研究所)

AI总结 提出一种基于双峰高斯分布的分布损失概念,通过软化目标隐式捕捉类别模糊性,缓解过拟合,提升决策边界鲁棒性,尤其在低数据场景下效果显著。

Comments ICANN 2026

详情
AI中文摘要

本文提出了一种用于监督分类任务的新型损失概念。我们不是强制每个输入样本直接映射到单个分配标签,而是将分类器输出的优化目标定义为双峰高斯分布。这种更柔和的目标公式隐式地捕捉了类别模糊性,减轻了过拟合,并鼓励学习更鲁棒的决策边界,所有这些都不需要额外的标签信息。实验结果表明,鲁棒性持续提升,在低数据场景下尤其明显,同时仅需对标准训练流程进行最小修改。

英文摘要

This paper proposes a novel loss concept for supervised classification tasks. Rather than enforcing a direct mapping from each input sample to a single assigned label, we define an optimization objective over all classifier outputs as a bimodal Gaussian distribution. This softer target formulation implicitly captures class ambiguity, mitigates overfitting, and encourages the learning of more robust decision boundaries, all without requiring additional label information. Experimental results demonstrate consistent improvements in robustness, with particularly pronounced gains in low-data regimes, while requiring only minimal modifications to standard training pipelines.

2606.13276 2026-06-12 cs.LG cs.AI 新提交

Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

不同层,不同流形:Transformer优化中的模块级权重空间几何

Kirato Yoshihara

发表机构 * School of Engineering Science, The University of Osaka(大阪大学工程科学学院)

AI总结 研究Transformer不同模块偏好不同流形几何,提出为注意力层和MLP层分别分配Stiefel和DGram约束,在GPT-2预训练中取得最佳性能。

Comments Accepted at WSS @ ICML 2026, code is available at https://github.com/kiratoyoshihara/module-wise-manifold-muon

详情
AI中文摘要

权重空间几何在神经网络优化中扮演核心角色,但流形约束通常统一应用于所有权重矩阵。在这项工作中,我们探究不同Transformer模块是否偏好不同的流形几何。我们研究GPT-2预训练的Manifold Muon,并比较跨注意力块和MLP块的Stiefel和DGram约束的逐层分配。我们的结果显示出明显的不对称性:在测试配置中,将注意力层约束为Stiefel几何,同时将MLP层分配为DGram几何,获得了最佳性能;而反向分配和全DGram配置在共享超参数设置下变得不稳定。我们将这种失败归因于DGram约束的注意力权重中奇异值的增长,这会放大注意力logits并导致softmax饱和。这些发现表明,Transformer的对称感知和几何感知优化应该是模块特定的,而不是统一的。

英文摘要

Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all weight matrices. In this work, we ask whether different transformer modules prefer different manifold geometries. We study Manifold Muon for GPT-2 pretraining and compare layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. Our results show a clear asymmetry: constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas the inverted assignment and all-DGram configuration become unstable under the shared hyperparameter setting. We trace this failure to singular value growth in DGram-constrained attention weights, which can amplify attention logits and induce softmax saturation. These findings suggest that symmetry-aware and geometry-aware optimization for transformers should be module-specific rather than uniform.

2606.13443 2026-06-12 cs.LG 新提交

How Much Memory Do We Need? Adaptive Memory Gate for Neural Operators

我们需要多少记忆?神经算子的自适应记忆门

Jihyeon Hur, Yongseok Kwon, Min-Gi Jo, Jeongwhan Choi, Noseong Park

发表机构 * University of Seoul(首尔大学)

AI总结 针对现有神经算子固定记忆权重适应性不足的问题,提出AMGFNO,通过可学习门动态调节记忆权重,在低分辨率下nRMSE降低55-79%。

详情
AI中文摘要

神经算子已成为求解时间相关PDE的强大数据驱动方法。在最近的进展中,记忆增强神经算子显式地纳入过去状态,并在低分辨率观测设置下取得了显著性能。然而,现有方法无论观测条件(如分辨率或物理参数)如何,都应用固定的记忆权重,限制了其适应性。我们的初步实验表明,最优记忆权重随分辨率和粘度变化,这意味着固定记忆权重无法同时优化不同设置下的性能。我们提出AMGFNO,通过可学习门动态调节记忆权重。在Kuramoto-Sivashinsky和Burgers方程上,AMGFNO在低分辨率下实现了55-79%的nRMSE降低,且学习到的门值随分辨率增加自动从$\bar{g} \approx 0.7$降至接近零。

英文摘要

Neural operators have emerged as a powerful data-driven approach for solving time-dependent PDEs. Among recent advances, memory-augmented neural operators explicitly incorporate past states and have achieved remarkable performance under low-resolution observation settings. However, existing approaches apply a fixed memory weight regardless of observation conditions, such as resolution or physical parameters, limiting their adaptability. Our preliminary experiments reveal that optimal memory weight varies with resolution and viscosity, implying that a fixed memory weight cannot simultaneously optimize performance across diverse settings. We propose AMGFNO, which dynamically modulates memory weight through a learnable gate. On the Kuramoto-Sivashinsky and Burgers' equations, AMGFNO achieves 55-79% nRMSE reduction over at low resolution, with the learned gate value automatically decreasing from $\bar{g} \approx 0.7$ to near-zero as resolution increases.

2606.13568 2026-06-12 cs.LG math-ph math.MP 新提交

Adjusted Cup-Product Neural Layer

调整杯积神经层

Snigdha Chandan Khilar

AI总结 提出调整杯积神经层,通过硬连线杯积与高规范理论调整项,实现规范不变读出,并证明调整系数是唯一信号源。

详情
AI中文摘要

物理和几何中的许多重要可观测量是上链的杯积。本文引入了调整杯积神经层。这是一种神经原语,硬连线了杯积与来自高规范理论的调整项。这创建了一个设计上规范不变的读出。他们的主要理论结果表明,在闭链上,输出完全依赖于调整系数。将该系数设为零,无论其他参数如何,输出完全消失。因此,调整是规范不变信号的唯一来源。他们证明该可观测量是一个非零二次型,并且在一个和两个规范变换下精确不变。

英文摘要

Many important observables in physics and geometry are cup products of cochains. The adjusted cup product neural layer has been introduced in this paper. It is a neural primitive that hard wires the cup product with an adjustment term from higher gauge theory. This creates a readout that is gauge invariant by design. Their main theoretical result shows that on a closed cycle the output relies entirely on the adjustment coefficient. Setting this coefficient to zero removes the output completely regardless of other parameters. Thus the adjustment is the only source of gauge invariant signal. They prove this observable is a nonzero quadratic form and is exactly invariant under one and two gauge transformations.

2606.13571 2026-06-12 cs.LG cs.AI 新提交

Existence Precedes Value: Joint Modeling of Observational Existence and Evolving States in Time Series Forecasting

存在先于价值:时间序列预测中观测存在性与状态演变的联合建模

Yifan Hu, Hongzhou Chen, Peiyuan Liu, Yiding Liu, Zewei Dong, Jiang-Ming Yang

发表机构 * Ant International(蚂蚁国际)

AI总结 提出Timeflies框架,联合建模未来观测是否发生(存在性)与数值估计,通过观测流和数值流耦合模块提升缺失值时间序列预测性能。

详情
AI中文摘要

现实世界的时间序列常因传感器休眠、传输延迟和事件驱动采样而高度不完整和不规则,使得可靠预测面临根本性挑战。现有方法已从插值后预测的流水线发展到连续时间模型,如神经常微分方程和连续时间图网络。尽管这些方法改进了历史不规则性的建模,但它们仍然在推理时依赖一个隐式的先知假设:未来有效观测的时间戳被假定为预先已知。这一假设限制了实际相关性,因为在许多现实系统中,更根本的问题不仅是未来值是多少,还包括是否会有有效观测发生。在本文中,我们提出Timeflies,一个统一的框架,将预测重新表述为未来可观测性推断和数值估计的联合问题。为了显式建模观测动态与状态演变之间的交互,Timeflies采用观测流和数值流,通过三个专用模块(可靠性感知嵌入、观测引导的依赖建模和联合预测)进行耦合。我们进一步构建了Shadow基准,该基准结合了来自公共数据集和真实工业数据的自然缺失,并引入观测-值联合熵(OVJE)指标来全面评估这种耦合的可预测性。大量实验表明,Timeflies始终优于现有方法,突显了在缺失值时间序列预测中显式建模未来可观测性的重要性。代码和数据集见https://this URL。

英文摘要

Real-world time series are often highly incomplete and irregular due to sensor dormancy, transmission delays, and event-driven sampling, making reliable forecasting fundamentally challenging. Existing methods have evolved from impute-then-forecast pipelines to continuous-time models such as Neural ODEs and continuous-time graph networks. While these approaches improve the modeling of historical irregularity, they still rely on an implicit oracle assumption at inference time: the timestamps of future valid observations are presumed to be known in advance. This assumption limits practical relevance, since in many real systems the more fundamental question is not only what the future value will be, but also whether a valid observation will occur at all. In this paper, we propose Timeflies, a unified framework that reformulates forecasting as a joint problem of future observability inference and value estimation. To explicitly model the interaction between observation dynamics and state evolution, Timeflies adopts an observation stream and a value stream, coupled through three dedicated modules for reliability-aware embedding, observation-guided dependency modeling, and joint prediction. We further construct Shadow, a benchmark that combines natural missingness from public datasets with real-world industrial data, and introduce the Observation-Value Joint Entropy (OVJE) metric to comprehensively evaluate this coupled predictability. Extensive experiments show that Timeflies consistently outperforms existing methods, highlighting the importance of explicitly modeling future observability in time series forecasting with missing values. Code and dataset are available in https://github.com/ant-intl/Timeflies.

2606.13603 2026-06-12 cs.LG cs.AI cs.CL 新提交

Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

超越承诺边界:探究大型推理模型中的附带思维链

Daniel Scalena, Sara Candussio, Luca Bortolussi, Elisabetta Fersini, Malvina Nissim, Gabriele Sarti

发表机构 * CLCG, University of Groningen(格罗宁根大学CLCG) University of Milano-Bicocca(米兰-布雷拉大学) University of Trieste(特里耶大学) Khoury College of Computer Sciences, Northeastern University(东北大学Khoury计算机科学学院)

AI总结 通过早期退出估计思维链步骤的因果重要性,发现推理中存在从瞬态猜测到稳定答案的“承诺边界”,后续步骤为附带现象,可提前退出以缩短推理长度达55%而不影响性能。

详情
AI中文摘要

思维链推理是语言模型推理时扩展的主导范式,但每个步骤对最终答案的因果影响尚不明确。我们通过早期退出估计每个步骤的因果重要性,并利用这一度量研究多个模型家族的推理轨迹中答案如何形成。在多种任务中,我们发现推理通常会跨越一个“承诺边界”——从瞬态中间猜测到稳定、高置信度答案的急剧转变。这种转变通常发生在单个步骤中,远在模型推理块结束之前,随后是“附带”的思维链步骤,这些步骤不改变最终答案概率。利用注意力探针,我们表明答案形成阶段可以从中间推理步骤中以高精度线性解码,并稳健地泛化到未见过的推理任务。我们利用这一信号在承诺边界处提前退出推理块,平均将思维链长度减少高达55%,而对模型性能影响微乎其微。

英文摘要

Chain-of-thought (CoT) reasoning is the dominant paradigm for inference-time scaling in language models, yet the causal influence of individual steps on the final answer poorly understood. We estimate each step's causal importance via early exit and use this measure to study how answers form across the reasoning traces of several model families. Across diverse tasks, we find that reasoning typically crosses a \emph{commitment boundary} -- a sharp transition from transient intermediate guesses to a stable, high-confidence answer. This transition often happens in a single step, well before the model's reasoning block ends, and is followed by \emph{epiphenomenal} CoT steps that leave the final answer probability unaltered. Using attention probes, we show that answer-formation stages can be linearly decoded from intermediate reasoning steps with high accuracy and generalize robustly to unseen reasoning tasks. We exploit this signal to early-exit reasoning blocks at the commitment boundary, reducing the length of CoTs up to 55\% on average with negligible impact on model performance.

2605.18898 2026-06-12 cs.LG stat.ML 新提交

A Two-Parameter Weibull Framework for Diagnosing Transformer Weight Distributions

一种双参数Weibull框架用于变压器权重分布诊断

Tiexin Ding

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种基于Weibull分布的双参数框架,用于分析Transformer中元素权重幅度分布,通过实验发现不同模块的k值分布特征,并揭示了训练过程中lambda参数的变化规律。

Comments 27 pages, 14 figures. Companion library npm-weibull-py and benchmark database available at https://github.com/tiexinding/NPM-Weibull-public

详情
AI中文摘要

我们应用Weibull分布——极值理论中的一个双参数家族——作为诊断框架,用于分析Transformer中元素权重幅度分布。在初始化时,i.i.d.高斯权重给出|w| ~ HalfNormal,产生k ~ 1.20通过中间80%概率-图拟合(此工作中的协议)。这个锚点使k成为一种原则性的、架构无关的训练动态测量工具;在每个层的每个检查点独立拟合每个权重矩阵,使能够进行每组件、每层和每步的诊断,这些聚合统计无法解决。将此框架应用于12个模型,涵盖7个架构家族(Pythia, OLMo-1/2, LLaMA-3, Mistral, Qwen2.5/3)揭示了三个发现。首先,FFN模块和注意力输出投影W_o——传输类——落在狭窄的k带中:在12个条目中,中位数终端k在[1.186, 1.204]之间(跨家族CV=0.51%),在SwiGLU/GeLU激活、Pre-LN/QK-Norm放置和70M-14B大小之间共享。其次,注意力输入投影W_q, W_k——选择类——脱离Weibull家族,其严重程度由存储形状决定:分别存储Q/K(OLMo-1, OLMo-2)产生k在[0.76, 0.99](深层);GQA模型产生k在[1.10, 1.16](轻微);Pythia的合并W_qkv占据过渡区,跟踪训练预算T/tau单调递增。第三,lambda在训练过程中显著增长,并在Pythia家族中与sqrt(eta/lambda_wd)成比例(Pearson r=0.94,三种传输类型),方向上与Fan等人(2025)一致。这两个参数携带独立信息:k标记功能类别,lambda标记训练进度。我们发布了npm-weibull-py v0.4(Python库)和DATABASE_v9_1在https://github.com/tiexinding/NPM-Weibull-public。

英文摘要

We apply the Weibull distribution -- a two-parameter family from extreme-value theory -- as a diagnostic framework for element-wise weight magnitude distributions in transformers. At initialization, i.i.d. Gaussian weights give |w| ~ HalfNormal, yielding k ~ 1.20 via middle-80% probability-plot fit (the protocol used throughout this work). This anchor makes k a principled, architecture-independent measuring stick for training dynamics; fitting each weight matrix independently at every layer at every checkpoint enables per-component, per-layer, and per-step diagnostics that aggregate statistics cannot resolve. Applying this framework to 12 model entries spanning 7 architectural families (Pythia, OLMo-1/2, LLaMA-3, Mistral, Qwen2.5/3) reveals three findings. First, FFN modules and the attention output projection W_o -- the Transmission Class -- fall in a narrow k band: median terminal k in [1.186, 1.204] across 12 entries (cross-family CV = 0.51%), shared across SwiGLU/GeLU activations, Pre-LN/QK-Norm placements, and 70M-14B sizes. Second, the attention input projections W_q, W_k -- the Selection Class -- depart from the Weibull family, with severity shaped by storage: separately-stored Q/K (OLMo-1, OLMo-2) yields k in [0.76, 0.99] (deep); GQA models yield k in [1.10, 1.16] (mild); Pythia's merged W_qkv occupies a transitional zone tracking training budget T/tau monotonically. Third, lambda grows substantially during training and scales with sqrt(eta/lambda_wd) within the Pythia family (Pearson r = 0.94, three Transmission kinds), directionally consistent with Fan et al. (2025). The two parameters carry independent information: k labels the functional class, lambda labels training progress. We release npm-weibull-py v0.4 (Python library) and DATABASE_v9_1 at https://github.com/tiexinding/NPM-Weibull-public .

2606.12662 2026-06-12 cs.SD cs.AI cs.LG 交叉投稿

BASENet: Band-Adapted Speech Enhancement Network with Cross-Band Attention

BASENet: 基于频带自适应的跨频带注意力语音增强网络

Damien Martins Gomes, François Capman

发表机构 * Thales SIX GTS, FRANCE(泰雷兹SIX GTS公司,法国)

AI总结 提出BASENet,通过Bark尺度划分频带并分配自适应容量编码器,结合跨频带注意力模块,以最少参数实现高PESQ和STOI,适用于资源受限设备。

详情
AI中文摘要

语音增强模型通常对所有频率采用统一容量,忽略了人类听觉的非均匀频谱分辨率。我们提出BASENet,一种频率自适应架构,将频谱划分为Bark尺度频带,并为每个频带分配基于临界频带密度的缩放容量编码器,自动为感知密集的低频分配更深的分支,为高频分配更轻的分支。跨频带注意力模块通过紧凑的频率池化表示以线性复杂度捕获跨频带的谐波依赖性。基于具有密集连接的倒残差块和卷积循环网络,BASENet在VoiceBank+DEMAND上以仅0.83M参数和7.3 G MACs达到3.55 PESQ和STOI~96%,是所有PESQ > 3.50方法中参数最少的。因果变体(3.44 PESQ)超过了几种非因果基线,证实了其在资源受限设备上实时流传输的适用性。

英文摘要

Speech enhancement models typically apply uniform capacity across all frequencies, disregarding the non-uniform spectral resolution of human hearing. We propose BASENet, a frequency-adapted architecture that partitions the spectrum into Bark-scale bands and assigns each a scaled-capacity encoder derived from critical-band density, automatically granting deeper branches to perceptually dense low frequencies and lighter ones to high frequencies. A cross-band attention module captures harmonic dependencies across bands through compact frequency-pooled representations at linear complexity. Built on inverted residual blocks with dense connectivity and a convolutional recurrent network, BASENet achieves 3.55 PESQ and STOI~96% on VoiceBank+DEMAND with only 0.83M parameters and 7.3 G~MACs, the fewest parameters among all methods with PESQ > 3.50. A causal variant (3.44 PESQ) surpasses several non-causal baselines, confirming suitability for real-time streaming on resource-constrained devices.

2606.12940 2026-06-12 cs.SD cs.LG 交叉投稿

Self-Guidance: Enhancing Neural Codecs via Decoder Manifold Alignment

自引导:通过解码器流形对齐增强神经编解码器

Xiang Li, Yixuan Zhou, Jingran Xie, Zhiyong Wu, Hui Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出自引导方法,通过轻量特征映射损失对齐解码器内部流形,在不改变推理过程下提升VQ-VAE神经语音编解码器重建质量,实现低比特率SOTA性能并支持4倍码本缩减。

Comments 20 pages, 9 figures, accepted to ICML 2026, demo website available at https://sgvqvae.github.io/sgvqvae-demo

详情
AI中文摘要

基于向量量化VAE(VQ-VAE)的神经语音编解码器是语音大语言模型的核心音频分词器,但其重建保真度受限于量化误差。常见的修复方法是修改量化器或增加模型容量,但这会复杂化下游语言建模。我们的核心思想是,在处理量化标记及其原始连续嵌入时,使用轻量级特征映射损失对齐解码器的内部特征流形。这需要最小的训练开销,且无需改变推理过程。应用于XCodec2时,自引导改善了所有重建指标,实现了低比特率下的最先进性能。值得注意的是,它实现了4倍码本缩减而无保真度损失,下游TTS实验表明,通过简化标记建模空间,这显著改善了基于LLM的合成。多项统计观察和可视化证实了解码器中内部流形对齐的增强。大量实验证实了其在各种归纳偏置下的通用性。因此,自引导建立了一种高效、广泛适用的高保真神经音频编码方法。

英文摘要

Neural speech codecs based on Vector-Quantized VAEs (VQ-VAEs) are core audio tokenizers for speech LLMs, yet their reconstruction fidelity is bottlenecked by quantization error. Modifying the quantizer or increasing model capacity are common fixes, but they complicate downstream language modeling. Our core idea is to align the decoder's internal feature manifolds when processing both the quantized tokens and their original continuous embeddings, using a lightweight feature-mapping loss. This requires minimal training overhead and no inference-time changes. Applied to XCodec2, self-guidance improves all reconstruction metrics, achieving state-of-the-art low-bitrate performance. Notably, it enables a 4x codebook reduction without fidelity loss, which downstream TTS experiments show significantly improves LLM-based synthesis by simplifying the token modeling space. Multiple statistical observations and visualizations corroborate the enhanced internal manifold alignment in the decoder. Extensive experiments confirm its generality across various inductive biases. Self-guidance thus establishes an efficient, broadly applicable method for high-fidelity neural audio coding.

2408.17221 2026-06-12 cs.LG math.AG 版本更新

Geometry of Lightning Self-Attention: Identifiability and Dimension

闪电自注意力的几何:可识别性与维度

Nathan W. Henry, Giovanni Luca Marchetti, Kathlén Kohn

发表机构 * University of Toronto(多伦多大学) Royal Institute of Technology (KTH)(皇家理工学院(KTH))

AI总结 本文利用代数几何工具,分析了无归一化自注意力网络的函数空间几何,给出了深层注意力的可识别性描述并计算了函数空间维度,同时刻画了单层模型的奇异点和边界点,并推测了归一化情形的结果。

Comments Accepted at ICLR 2025

详情
AI中文摘要

我们考虑由无归一化的自注意力网络定义的函数空间,并理论上分析其几何结构。由于这些网络是多项式,我们依赖代数几何的工具。特别地,我们通过描述任意层数参数化的通用纤维来研究深层注意力的可识别性,并据此计算函数空间的维度。此外,对于单层模型,我们刻画了奇异点和边界点。最后,我们提出一个关于归一化自注意力网络结果的推测性扩展,在单层情况下证明该推测,并在深层情况下进行数值验证。

英文摘要

We consider function spaces defined by self-attention networks without normalization, and theoretically analyze their geometry. Since these networks are polynomial, we rely on tools from algebraic geometry. In particular, we study the identifiability of deep attention by providing a description of the generic fibers of the parametrization for an arbitrary number of layers and, as a consequence, compute the dimension of the function space. Additionally, for a single-layer model, we characterize the singular and boundary points. Finally, we formulate a conjectural extension of our results to normalized self-attention networks, prove it for a single layer, and numerically verify it in the deep case.

2502.18959 2026-06-12 cs.LG stat.ML 版本更新

Fourier Multi-Component and Multi-Layer Neural Networks: Unlocking High-Frequency Potential

傅里叶多分量与多层神经网络:解锁高频潜力

Shijun Zhang, Hongkai Zhao, Yimin Zhong, Haomin Zhou

发表机构 * Department of Applied Mathematics(应用数学系) Hong Kong Polytechnic University(香港理工大学) Department of Mathematics(数学系) Duke University(杜克大学) Department of Mathematics and Statistics(数学与统计学系) Auburn University(阿伯茨伦大学) School of Mathematics(数学学院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出傅里叶多分量与多层神经网络(FMMNN),结合正弦型激活函数与多分量多层结构,通过低秩架构实现指数级函数逼近能力,优化景观优于标准全连接网络,并设计缩放随机初始化方法加速训练,在高频函数逼近任务中取得高精度与良好收敛性。

Comments Our code and implementation details are available at https://github.com/ShijunZhangMath/FMMNN

详情
AI中文摘要

神经网络的结构及其激活函数的选择对其性能至关重要。同样重要的是确保这两个元素良好匹配,因为它们的对齐是有效表示和学习的关键。在本文中,我们引入了傅里叶多分量与多层神经网络(FMMNN),该模型将正弦型激活函数与MMNN的多分量多层结构相结合。在FMMNN中,每个分量表示为固定随机正弦型基函数的可训练线性组合,而多层组合则生成更复杂且自适应的频率特征。我们证明,即使在低秩架构下,FMMNN仍能保持函数逼近的指数级表达能力。我们还分析了FMMNN的优化景观,发现其比标准全连接神经网络更有利,尤其是对于高频目标。此外,我们提出了一种针对FMMNN第一层权重的缩放随机初始化方法,当样本充足时,该方法能加速训练并提高最终性能。大量数值实验支持我们的理论见解,表明FMMNN在振荡函数逼近基准上实现了高精度和良好的收敛行为。

英文摘要

The architecture of a neural network and the choice of its activation function are both fundamental to its performance. Equally important is ensuring that these two elements are well matched, as their alignment is key to effective representation and learning. In this paper, we introduce the Fourier Multi-Component and Multi-Layer Neural Network (FMMNN), a model that combines sine-type activations with the multi-component and multi-layer structure of MMNNs. In an FMMNN, each component is represented as a trainable linear combination of fixed random sine-type basis functions, while multi-layer composition generates more complex and adaptive high-frequency features. We establish that FMMNNs retain exponential expressive power for function approximation even under a low-rank architectural structure. We also analyze the optimization landscape of FMMNNs and find it to be substantially more favorable than that of standard fully connected neural networks, especially for high-frequency targets. In addition, we propose a scaled random initialization method for the first-layer weights in FMMNNs, which accelerates training and improves final performance when sufficient samples are available. Extensive numerical experiments support our theoretical insights, showing that FMMNNs achieve strong accuracy and favorable convergence behavior on oscillatory function-approximation benchmarks.

2505.11846 2026-06-12 cs.LG math.AG 版本更新

Learning on a Razor's Edge: Identifiability and Singularity of Polynomial Neural Networks

刀刃上的学习:多项式神经网络的可辨识性与奇异性

Vahid Shahverdi, Giovanni Luca Marchetti, Kathlén Kohn

发表机构 * Department of Mathematics, KTH Royal Institute of Technology(数学系,皇家理工学院)

AI总结 研究以多项式为激活函数的MLP和CNN的函数空间(神经流形),证明MLP参数化几乎处处有限对一,CNN参数化一一对应,并刻画奇异性源于稀疏子网络,解释MLP的稀疏偏好。

Comments Published at ICLR 2026

详情
AI中文摘要

我们研究由神经网络参数化的函数空间,称为神经流形。具体地,我们关注具有充分一般多项式激活函数的深度多层感知机(MLP)和卷积神经网络(CNN)。首先,我们解决可辨识性问题,表明对于MLP神经流形中的几乎所有函数,只有有限多个参数选择产生该函数。对于CNN,参数化通常是一一对应的。作为推论,我们计算了神经流形的维数。其次,我们描述神经流形的奇异点。我们完全刻画了CNN的奇异性,部分刻画了MLP的奇异性。在这两种情况下,奇异性都源于稀疏子网络。对于MLP,我们证明这些奇异性通常对应于均方误差损失的临界点,而这对CNN不成立。这为MLP的稀疏偏好提供了几何解释。我们的所有结果都利用了代数几何的工具。

英文摘要

We study function spaces parametrized by neural networks, referred to as neuromanifolds. Specifically, we focus on deep Multi-Layer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs) with an activation function that is a sufficiently generic polynomial. First, we address the identifiability problem, showing that, for almost all functions in the neuromanifold of an MLP, there exist only finitely many parameter choices yielding that function. For CNNs, the parametrization is generically one-to-one. As a consequence, we compute the dimension of the neuromanifold. Second, we describe singular points of neuromanifolds. We characterize singularities completely for CNNs, and partially for MLPs. In both cases, they arise from sparse subnetworks. For MLPs, we prove that these singularities often correspond to critical points of the mean-squared error loss, which does not hold for CNNs. This provides a geometric explanation of the sparsity bias of MLPs. All of our results leverage tools from algebraic geometry.

2505.13102 2026-06-12 cs.LG cs.AI eess.SP 版本更新

Lightweight and Interpretable Transformer via Mixed Graph Algorithm Unrolling for Traffic Forecast

轻量级可解释Transformer:基于混合图算法展开的交通预测

Ji Qi, Tam Thuc Do, Mingxiao Liu, Zhuoshi Pan, Yuzhe Li, Gene Cheung, H. Vicky Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种通过展开混合图优化算法构建的轻量级可解释类Transformer网络,用于时空交通预测,在保持竞争性能的同时大幅减少参数。

Comments 24 pages, 7 figures, 11 tables

详情
AI中文摘要

与采用经典自注意力机制的传统“黑箱”Transformer不同,我们通过展开基于混合图的优化算法构建了一个轻量级且可解释的类Transformer神经网络,用于具有空间和时间维度的交通预测。我们构建了两个图:一个无向图$\mathcal{G}^u$捕捉跨地理的空间相关性,以及一个有向图$\mathcal{G}^d$捕捉时间上的序列关系。我们预测信号$\mathbf{x}$的未来样本,假设其相对于$\mathcal{G}^u$和$\mathcal{G}^d$都是“平滑的”,为此我们设计了新的$\ell_2$和$\ell_1$范数变分项来量化并促进有向图上的信号平滑性(低频重构)。我们基于交替方向乘子法(ADMM)设计了一个迭代算法,并将其展开为一个前馈网络以进行数据驱动的参数学习。我们周期性地插入用于$\mathcal{G}^u$和$\mathcal{G}^d$的图学习模块,这些模块扮演自注意力的角色。实验表明,我们的展开网络在交通预测性能上与最先进的预测方案相当,同时大幅减少了参数数量。

英文摘要

Unlike conventional "black-box" transformers with classical self-attention mechanism, we build a lightweight and interpretable transformer-like neural net by unrolling a mixed-graph-based optimization algorithm to forecast traffic with spatial and temporal dimensions. We construct two graphs: an undirected graph $\mathcal{G}^u$ capturing spatial correlations across geography, and a directed graph $\mathcal{G}^d$ capturing sequential relationships over time. We predict future samples of signal $\mathbf{x}$, assuming it is "smooth" with respect to both $\mathcal{G}^u$ and $\mathcal{G}^d$, where we design new $\ell_2$ and $\ell_1$-norm variational terms to quantify and promote signal smoothness (low-frequency reconstruction) on a directed graph. We design an iterative algorithm based on alternating direction method of multipliers (ADMM), and unroll it into a feed-forward network for data-driven parameter learning. We periodically insert graph learning modules for $\mathcal{G}^u$ and $\mathcal{G}^d$ that play the role of self-attention. Experiments show that our unrolled networks achieve competitive traffic forecast performance as state-of-the-art prediction schemes, while reducing parameter counts drastically.

2507.03660 2026-06-12 cs.LG 版本更新

Single vs. Multiple Branches in DeepONet and S-DeepONet: Network Architecture Follows Coupling in Multiphysics Systems

DeepONet和S-DeepONet中的单分支与多分支:网络架构遵循多物理系统中的耦合

Jaewan Park, Kazuma Kobayashi, Qibang Liu, Seid Koric, Diab Abueidda, Syed Bahauddin Alam

发表机构 * National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign(国家超级计算应用中心,伊利诺伊大学厄巴纳-香槟分校) The Grainger College of Engineering, Mechanical Science and Engineering, University of Illinois at Urbana-Champaign(格拉inger工程学院,机械科学与工程系,伊利诺伊大学厄巴纳-香槟分校) The Grainger College of Engineering, Nuclear, Plasma & Radiological Engineering, University of Illinois at Urbana-Champaign(格拉inger工程学院,核物理与辐射工程系,伊利诺伊大学厄巴纳-香槟分校) Department of Industrial and Manufacturing Systems Engineering, Kansas State University(工业与制造系统工程系,堪萨斯州立大学) Civil and Urban Engineering Department, New York University Abu Dhabi, UAE(土木与城市工程系,纽约大学阿布扎比分校,阿联酋)

AI总结 研究比较单分支与多分支神经算子架构在强耦合多物理系统中的表现,发现单分支网络在紧耦合场景下通过共享潜在表示优于多分支,而多分支适用于解耦或单物理任务,代理模型加速高达1.8×10^4倍。

详情
AI中文摘要

复杂物理系统的实时预测需要从数据中学习并代表强多物理耦合的代理模型。深度算子网络在单物理问题中已显示出成功,但其在捕捉耦合系统(如热-机械或电-热耦合)中非线性相互作用方面的有效性仍未充分探索。这里我们提出一个实际问题:神经算子的架构是否应反映其旨在建模的物理耦合强度?我们比较了单分支和多分支设计,包括前馈和顺序循环形式,跨越三个代表性系统:具有异质源的反应-扩散问题、具有温度依赖电导率和焦耳热的非线性热电问题,以及钢凝固的粘塑性热-机械模型。单分支网络在紧耦合场景中通过鼓励共享潜在表示持续优于多分支变体,而多分支设计对于解耦或单物理任务仍然有利。一旦训练完成,这些代理模型提供全场预测的速度比基于物理的求解器快高达1.8×10^4倍。

英文摘要

`Real-time prediction of complex physical systems requires surrogate models that learn from data while representing strong multiphysics coupling. Deep Operator Networks have shown success in single-physics problems, yet their effectiveness in capturing nonlinear interactions in coupled systems (such as thermo-mechanical or electro-thermal coupling) remains underexplored. Here we pose a practical question: should the architecture of a neural operator reflect the strength of physical coupling it aims to model? We compare single-branch and multi-branch designs, in both feedforward and sequential recurrent forms, across three representative systems: a reaction--diffusion problem with heterogeneous sources, a nonlinear thermo-electrical problem with temperature-dependent conductivity and Joule heating, and a viscoplastic thermo-mechanical model of steel solidification. Single-branch networks consistently outperform multi-branch variants in tightly coupled regimes by encouraging shared latent representations, whereas multi-branch designs remain favorable for decoupled or single-physics tasks. Once trained, these surrogates deliver full-field predictions up to $1.8 \times 10^4$ times faster than physics-based solvers.

2509.18085 2026-06-12 cs.LG cs.AI cs.CL 版本更新

Structuring The Future: Diffusion LLM Speculative Decoding via Calibrated Draft Graphs

构建未来:通过校准草稿图实现扩散LLM推测解码

Sudhanshu Agrawal, Risheek Garrepalli, Raghavv Goel, Christopher Lott, Fatih Porikli, Mingu Lee

发表机构 * University of Waterloo(多伦多大学)

AI总结 提出Spiffy算法,利用校准的草稿图结构实现扩散LLM的推测解码,在保持输出分布的同时加速推理,最高减少8.6倍模型推理次数并加速6.3倍令牌生成速率。

Comments Original version uploaded on Sep 22, 2025. (v2): Extended Table 2 with additional analysis and referenced it in Sec 5.2. (v3): Added note to Sec 4.2 and Appendix A.2 specifying conditions for losslessness. (v4): Updated with the version accepted to ICML 2026 workshops

详情
AI中文摘要

扩散LLM(dLLM)最近作为自回归LLM(AR-LLM)的强大替代方案出现,具有以显著更高的令牌生成速率运行的潜力。为了释放这一潜力,我们提出了Spiffy,一种推测解码算法,用于加速dLLM推理,同时可证明地保持模型的输出分布。这项工作解决了将AR-LLM的推测解码思想应用于dLLM所涉及的独特挑战。Spiffy执行自动推测以消除独立草稿模型的开销,以新颖的有向草稿图形式构建草稿状态,以利用dLLM生成的双向、块状特性。这些草稿图离线校准以最大化接受率,并在推理过程中动态剪枝以提高计算效率。我们给出了Spiffy的详细公式,并展示了其与KV缓存和基于阈值的动态掩码相结合,加速LLaDA、Dream和SDAR模型的能力,导致模型推理次数减少高达8.6倍,令牌速率加速高达6.3倍。

英文摘要

Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token-generation rates. To unlock this potential, we present Spiffy, a speculative decoding algorithm to accelerate dLLM inference while provably preserving the model's output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to dLLMs. Spiffy performs auto-speculation to eliminate the overheads of an independent draft model, structuring draft states in the form of a novel directed draft graph to take advantage of the bidirectional, blockwise nature of dLLM generation. These draft graphs are calibrated offline to maximize acceptance rates and are dynamically pruned during inference for improved computational efficiency. We present a detailed formulation of Spiffy and demonstrate its ability to accelerate LLaDA, Dream, and SDAR models in combination with KV caching and threshold-based dynamic unmasking leading to up to $8.6\times$ reduction in model inferences and $6.3\times$ acceleration in token rate.

2605.18817 2026-06-12 cs.LG 版本更新

Multi-Token Residual Prediction

多令牌残差预测

Yufeng Xu, Zishuo Bao, Qian Wang, Zeshen Zhang, Haoqi Zhang, Bowen Peng, Ang Li, Rahul Chalamala, Yucheng Lu

发表机构 * New York University(纽约大学) New York University Shanghai(纽约大学上海) Nos Research(Nos研究) Modal

AI总结 本文提出了一种轻量级模块Multi-token Residual Prediction,通过利用去噪过程中相邻步骤的logit分布相似性,在单次骨干网络前向传播中实现依赖感知的多令牌去噪,从而在成本较低的情况下提高去噪效率。

详情
AI中文摘要

扩散语言模型(DLMs)通过迭代去噪掩码令牌序列生成文本,相较于自回归模型在并行性和质量之间提供了一种权衡。在当前实践中,每步解码的令牌数量由置信度阈值控制,随着每步去噪的令牌数量增加,质量单调下降。我们引入了多令牌残差预测(MRP),这是一种轻量级模块,能够在单个骨干网络前向传播中实现依赖感知的多令牌去噪。MRP利用了去噪过程的一个关键性质:相邻去噪步骤的logit分布具有显著相似性。而不是再次运行骨干网络以获得下一步的logits,MRP通过骨干网络的隐藏状态预测步骤间的残差,从而在较低的成本下在单次骨干网络前向传播中去噪更多的令牌。我们部署了MRP在两种推理模式中:直接解码,它使用纠正的logits而不进行验证,以实现可调节的质量-速度权衡;以及推测解码,它通过骨干网络验证MRP的提案以实现无损加速。在SDAR模型上进行的实验表明,在推理和代码生成基准测试中,SDAR模型在1.7B、4B和8B规模上实现了高达1.42倍的SGLang无损加速。

英文摘要

Diffusion Language Models (DLMs) generate text by iteratively denoising masked token sequences, offering a tradeoff between parallelism and quality compared to autoregressive models. In current practice, the number of tokens decoded per step is controlled by a confidence threshold, and quality degrades monotonically as more tokens are denoised per step. We introduce Multi-token Residual Prediction (MRP), a lightweight module that enables dependency-aware multi-token denoising within a single backbone forward pass. MRP exploits a key property of the denoising process: the logit distributions at adjacent denoising steps are remarkably similar. Rather than running the backbone a second time to obtain the next-step logits, MRP predicts the residual between steps from the backbone's hidden states, effectively denoising more tokens per backbone forward at a fraction of the cost. We apply MRP across the two operating regimes of DLM decoding. In the high-quality-low-throughput static denoising regime, MRP serves as a drafter for speculative decoding: its proposals are verified against the backbone, yielding lossless acceleration of up to 1.4x in SGLang. In the low-quality-high-throughput dynamic denoising regime, MRP instead drives a remasking scheme that revokes over-eager reveals, recovering most of the accuracy lost to aggressive low-threshold decoding and improving accuracy by up to 22.6 points on code generation task HumanEval and 17.7 points on reasoning task GSM8K.

2605.25225 2026-06-12 cs.LG cs.AI 版本更新

Transformer Field Theory: A Response-Theoretic Approach to Mechanistic Interpretability

用于Transformer修补和机制可解释性的连续深度场论

David N. Olivieri, Antonio F. Pérez Rodríguez

发表机构 * Universidade de Vigo(维戈大学) Independent Researcher(独立研究员)

AI总结 本文提出场论框架,将残差流视为深度-标记场,通过局部源插入、灵敏度场预测、经验格林函数响应和伴随变分问题来组织和预测Transformer激活修补干预,并在GPT-2风格自回归Transformer中验证了前向响应理论。

详情
AI中文摘要

机制可解释性通常使用激活修补、因果追踪、路径修补和引导方向来揭示Transformer激活空间中行为有意义的子空间。本文发展了一个场论框架来组织和预测此类干预。将残差流视为深度-标记场,我们将修补公式化为局部源插入,修补效应作为灵敏度场预测,下游传播作为经验格林函数响应,修补选择作为伴随变分问题。实验上,我们通过在GPT-2风格自回归Transformer中应用局部残差场干预并观察诱导的残差场差异和logit差异响应来测试前向响应理论。我们识别出有界的局部线性区域;从跨残差站点的一阶灵敏度预测修补效应;测量跨深度和标记位置的结构化各向异性传播;从高灵敏度站点和切片格林算子构建响应描述;并表明提示诱导的残差位移可以传递答案行为。这些结果将响应对象(即灵敏度、传播场和格林算子切片)确立为组织修补实验的实用语言,以及制定修补站点推断和跨尺度迁移的前向数学基础。

英文摘要

Mechanistic interpretability often studies Transformer behavior by intervening on internal activations through activation patching, causal tracing, path patching, and steering directions. This paper develops Transformer Field Theory: a response-theoretic framework in which the residual stream of a fixed forward pass is treated as a Transformer field over layer depth and token position. In this formulation, patching becomes a localized source insertion into the Transformer field, first-order sensitivity fields predict patch effects, Green functions describe downstream propagation, and patch selection is posed as an adjoint inverse problem. Empirically, we test the theory's forward response objects in GPT-2-style autoregressive Transformers. Localized Transformer-field interventions exhibit a bounded local linear regime; first-order sensitivities predict patch effects across layer-token sites; localized sources generate structured anisotropic Transformer-field propagation; high-sensitivity sites and sliced Green operators provide reduced response descriptions; and prompt-induced Transformer-field displacements partially transfer answer behavior. These results establish sensitivities, Transformer-field responses, and sliced Green operators as practical objects for organizing patching experiments, while providing the forward mathematical basis for patch-site inference and cross-scale response transfer.

2606.05860 2026-06-12 cs.LG 版本更新

GenAutoML: An Agentic Framework for Dynamic Architecture Generation and Optimization in Time-Series Analysis

GenAutoML: 面向时间序列分析的动态架构生成与优化的智能体框架

Oleeviya Babu Poikarayil, Cédric Schockaert, Abdulrahman Nahhas, Christian Daase, Mursal Dawodi, Jawid Ahmad Baktash

发表机构 * Paul Wurth S.A.(保罗·沃思公司) Otto-von-Guericke University(奥托·冯·格里克大学) Technical University of Munich(慕尼黑技术大学)

AI总结 提出GenAutoML框架,利用大语言模型作为神经架构师,通过沙盒反射循环和签名感知运行时自动生成并优化时间序列预测与异常检测的神经网络架构,引入动态可逆实例归一化提升非平稳条件下的鲁棒性。

Comments 26 pages, 17 figures, 12 tables. Under review

详情
AI中文摘要

为时间序列预测和异常检测设计神经架构仍然是一项资源密集型任务,通常需要大量领域专业知识。传统的自动机器学习系统通常依赖于静态、预定义的搜索空间,限制了其适应多样数据特征的能力。我们提出GenAutoML,一个智能体框架,利用大语言模型作为神经架构师,将自然语言需求与可执行的PyTorch实现连接起来。该框架包含一个沙盒反射循环用于自主代码优化,以及一个签名感知运行时用于确保架构一致性和执行安全性。为了提升非平稳条件下的鲁棒性,我们进一步引入了动态可逆实例归一化包装器。在ETTh1、ETTm1和Weather基准上的实验表明,GenAutoML能够动态生成针对数据集特征定制的任务特定神经架构。在生成的模型中,WaveInterferenceNet实现了每个样本低于0.01毫秒的推理延迟,同时保持有竞争力的预测性能。通过强调计算效率、架构适应性和稳定的优化行为,GenAutoML使得创建适用于资源受限和延迟敏感的Edge AI部署的超轻量级神经网络成为可能。

英文摘要

Designing neural architectures for time-series forecasting and anomaly detection remains a resource-intensive task that often requires substantial domain expertise. Traditional Automated Machine Learning (AutoML) systems typically rely on static, predefined search spaces, limiting their ability to adapt to diverse data characteristics. We present GenAutoML, an agentic framework that leverages Large Language Models (LLMs) as neural architects to bridge natural-language requirements and executable PyTorch implementations. The framework incorporates a Sandboxed Reflection Loop for autonomous code refinement and a Signature-Aware Runtime that enforces architectural consistency and execution safety. To improve robustness under non-stationary conditions, we further introduce a Dynamic Reversible Instance Normalization (Dyn-RevIN) wrapper. Experiments on the ETTh1, ETTm1, and Weather benchmarks demonstrate that GenAutoML can dynamically generate task-specific neural architectures tailored to dataset characteristics. Among the generated models, WaveInterferenceNet achieves inference latency below 0.01 ms per sample while maintaining competitive predictive performance. By emphasizing computational efficiency, architectural adaptability, and stable optimization behavior, GenAutoML enables the creation of ultra-lightweight neural networks suitable for resource-constrained and latency-sensitive Edge AI deployments.

2606.11255 2026-06-12 cs.LG 版本更新

Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

Bernstein-Schur核:通过草图调制和径向随机化的随机特征

Taha Bouhsine

发表机构 * Azetta AI

AI总结 提出一种随机特征构造方法,用于Bernstein-Schur核类,通过草图化有限调制和随机化完全单调径向因子,实现无偏估计和算子范数界,应用于yat核族。

详情
AI中文摘要

Bernstein-Schur核是有限特征核(具有显式有限维特征映射的核)与完全单调平移不变核的乘积:非平稳核介于平移不变和点积模板之间,随机特征通常利用后者,因此一般Bochner采样或多项式草图都不能直接应用于完整核。我们为整个类给出一种随机特征构造,它随机化两个因子:草图化有限调制并随机化完全单调径向因子,对后者的单变量Bernstein-Widder尺度进行采样,然后应用高斯随机傅里叶特征(其频率仍是d维的)。特征维度为Dm,由草图大小m和径向抽取次数D设定,与精确调制特征的O(d^2)大小无关。保持调制精确是可分析极限(m→∞):在那里我们证明无偏性、推荐平坦估计量的精确方差、期望矩阵-Bernstein算子范数界(具有匹配的高概率尾部),该界由核和调制Gram矩阵的最大特征值以及固有维度控制,而非粗糙的N max_{ij}逐元素路径,以及确定性相对谱核岭稳定性结果。通过条件化于草图,双随机化估计量继承了相同的固有维度算子范数保证,加上一个可调加性草图项,该草图项由m独立于D调节。激励实例是有偏yat核k_{yat,b}(w,x)=(w^⊤x+b)^2/(‖w-x‖^2+ε),b≥0,其族通过b的有限差分包含逆多二次核;对于它,径向混合是IMQ谱采样器,每个尺度一个频率在固定径向特征预算下是方差最优的。

英文摘要

Bernstein--Schur kernels are products of a finite-feature kernel and a completely monotone shift-invariant kernel: nonstationary kernels falling between the shift-invariant and dot-product templates random features exploit, so neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and samples the radial factor's one-dimensional Bernstein--Widder scale before applying Gaussian random Fourier features, giving feature dimension $Dm$, free of the $O(d^2)$ size of the exact modulation feature. With the modulation kept exact (the $m\to\infty$ limit), we prove unbiasedness, an exact variance, and a matrix-Bernstein operator-norm bound controlled by the top kernel and modulation eigenvalues and an intrinsic dimension rather than the crude $N\max_{ij}$ route. Whitening this argument at the ridge makes the effective dimension $d_{\mathrm{eff}}(λ)$ the \emph{exact} intrinsic dimension of the matrix variance, so $O((1+\|P\|_{\mathrm{op}}/λ)\log(d_{\mathrm{eff}}/δ))$ radial draws preserve the kernel-ridge solution; tilting the draw by a closed-form whitened leverage improves this to the effective-dimension count $O((1+d_{\mathrm{eff}})\log(d_{\mathrm{eff}}/δ))$. Conditioning on the sketch carries every guarantee to the deployed doubly-randomized estimator up to one additive sketch term, and all hold for the whole class with the modulation Gram in place of the polynomial one. The flagship instance is the biased $yat$-kernel $k_{yat,b}(w,x)=(w^\top x+b)^2/(\|w-x\|^2+\varepsilon)$, whose family span contains the inverse-multiquadric kernel by finite differences in $b$.

2606.04364 2026-06-12 cs.CV cs.LG 版本更新

Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention

通过部分分解注意力的空间基础概念瓶颈模型

Dhanesh Ramachandram

发表机构 * Vector Institute(向量研究所)

AI总结 提出一种部分分解的概念瓶颈模型,通过空间先验约束注意力,在细粒度识别中实现可解释性并提升定位精度。

Comments Updated results with GobalAttention Tokens

详情
AI中文摘要

概念瓶颈模型(CBM)在预测类别之前预测一层人类命名的属性,从而使其决策可审计。在细粒度识别任务中,概念头通常可以自由关注图像中的任何位置,因此以某个身体区域命名的头可能被其他区域的证据满足。本研究通过构造一个部分分解的CBM来消除这种自由度。该方法基于冻结的DINOv3视觉变换器,包含三个组件。一个学习到的前景门控,基于DINOv3块特征训练,抑制部分注意力内的背景块。一组部分查询交叉关注块特征,并且312个CUB属性中的每一个通过固定的概念到部分映射被路由,仅从其名称所暗示的部分令牌读取。一个可学习的二维高斯先验,以对数空间加性注入注意力logits,打破部分查询之间的排列对称性;其均值从每个部分的数据集平均关键点位置初始化,在训练或测试时不需要每张图像的关键点监督。在CUB-200-2011上,空间先验模型匹配完全监督基线(top-1准确率88.85%对88.95%),同时将指向精度提高16个百分点(52.6%对36.4%)。用PCA前景目标替换边界框监督,并与高斯先验结合,消除了所有每张图像监督,达到88.6%的top-1准确率和约70%的指向精度。关键点分数扫描显示,训练集的0.5%(约27张图像)足以初始化先验,且无显著损失。完全移除部分身份是更困难的情况:没有任何空间先验,指向精度降至2.9%。

英文摘要

Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction. The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies. A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88.85% versus 88.95% top-1) while raising pointing accuracy by 16 points (52.6% versus 36.4%). Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88.6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0.5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to $2.9\%$.

2. 表示学习、自监督与对比学习 15 篇

2606.12481 2026-06-12 cs.LG cs.AI 新提交

Representing Time Series as Structured Programs for LLM Reasoning

将时间序列表示为结构化程序以进行LLM推理

Jaeho Kim, Changhun Oh, Seokhyun Lee, Irina Rish, Changhee Lee

发表机构 * Korea University(高丽大学) Mila, University of Montreal(蒙特利尔大学米拉研究所)

AI总结 提出T2SP方法,将时间序列分解为趋势、周期和显著事件并表示为结构化符号程序,使LLM无需微调即可高效推理,在编辑、描述和问答任务上优于原始序列表示。

Comments Preprint

详情
AI中文摘要

大型语言模型(LLM)展示了强大的推理和指令遵循能力,使其成为时间序列分析的潜在强大工具。然而,时间序列超出了其原生文本模态,引发了一个基本问题:应该如何表示时间序列,以便LLM能够有效地推理它们?现有工作通常序列化原始数值序列或在时间序列数据上微调预训练的LLM。这些方法将提取时间结构的负担直接放在LLM上,造成了模态不匹配,常常降低长序列的性能并引入大量计算开销。在这项工作中,我们引入了时间序列到结构化程序表示(T2SP),一种确定性的、无需训练的方法,将时间序列表示为结构化的符号程序。T2SP将时间序列分解为趋势、周期和显著事件,并以与LLM原生训练的文本和代码类模态对齐的程序友好格式表达它们。通过将时间结构提取从模型转移到表示本身,T2SP使现成的LLM能够利用其现有的推理能力进行时间序列理解。我们在三个推理任务上评估T2SP——编辑、描述和问答——与原始字符串表示相比,它持续提高了性能,减少了推理时间,并降低了失败率。我们的结果表明,T2SP提供了时间序列和LLM之间的有效接口。

英文摘要

Large language models (LLMs) have demonstrated strong reasoning and instruction-following capabilities, making them potentially powerful tools for time-series analysis. However, time series lie outside their native textual modality, raising a fundamental question: how should time series be represented so that LLMs can reason about them effectively? Existing work typically serializes raw numerical sequences or fine-tunes pre-trained LLMs on time-series data. These approaches place the burden of extracting temporal structure directly on the LLM, creating a modality mismatch that often degrades performance on long sequences and introduces substantial computational overhead. In this work, we introduce Time-Series-to-Structured-Program representation (T2SP), a deterministic, training-free method that represents a time series as a structured symbolic program. T2SP decomposes time series into trends, periods, and salient events, expressing them in a program-friendly format aligned with the textual and code-like modalities on which LLMs are natively trained. By shifting temporal-structure extraction from the model to the representation itself, T2SP enables off-the-shelf LLMs to leverage their existing reasoning capabilities for time-series understanding. We evaluate T2SP on three reasoning tasks -- editing, captioning, and question answering -- where it consistently improves performance, reduces reasoning time, and lowers failure rates compared with raw-string representations. Our results demonstrate that T2SP provides an effective interface between time series and LLMs.

2606.12488 2026-06-12 cs.LG 新提交

A Stationary (and Therefore Compatible) Representation is All You Need

静态(因此兼容)表示即所需

Niccolò Biondi, Federico Pernici, Simone Ricci, Alberto Del Bimbo

发表机构 * Media Integration and Communication Center (MICC), Dipartimento di Ingegneria dell’Informazione, Università degli Studi di Firenze(佛罗伦萨大学信息工程系媒体集成与通信中心(MICC))

AI总结 本文证明d-Simplex固定分类器学习的静态表示满足兼容性定义,并通过交叉熵与对比损失的凸组合捕获高阶依赖,实现模型更新时无需重处理的检索服务。

Comments Accepted to TPAMI2026. Extension of the CVPR2024 version (arXiv:2405.02581)

详情
AI中文摘要

学习兼容表示旨在当模型更新时,特征表示可以互换使用。本文证明,由d-Simplex固定分类器学习的静态表示隐含了其正式定义中的兼容性。这一结果为未来工作奠定了基础,并可直接应用于实际学习场景。我们解决了在模型顺序微调时使用d-Simplex固定分类器学习兼容性的挑战。使用交叉熵损失的d-Simplex固定分类器学习对齐一阶统计量的特征分布,因此可能无法完全捕捉模型更新之间表示的高阶依赖。为解决此问题,我们证明通过交叉熵损失和对比损失的凸组合使用d-Simplex固定分类器训练模型,不仅能捕捉高阶依赖,而且等价于在兼容性约束下使用交叉熵学习。我们通过大量实验证实了我们的发现,并考虑了一个新场景:预训练模型被顺序微调,偶尔被改进模型替换。我们表明,静态表示能够实现不间断的检索服务(无需重新处理图库图像),同时在模型更新和替换期间提升性能,达到最先进水平。代码见此 https URL。

英文摘要

Learning compatible representations aims to learn feature representations that can be used interchangeably over time whenever a model undergoes updates. In this paper, we demonstrate that stationary representations learned by d-Simplex fixed classifiers imply compatibility as in its formal definition. This result establishes a foundation for future works and can be directly exploited in practical learning scenarios. We address the challenge of learning compatibility using $d$-Simplex fixed classifiers when the model is sequentially fine-tuned. Learning according to a d-Simplex fixed classifier with the cross-entropy loss aligns feature distributions at the first-order statistics. Consequently, it may not fully capture higher-order dependencies in the representation between model updates. To address this issue, we demonstrate that training the model using a $d$-Simplex fixed classifier through a convex combination of the cross-entropy loss and a contrastive loss not only captures higher-order dependencies, but is also equivalent to learning with the cross-entropy under the compatibility constraints. We confirm our findings with extensive experiments also considering a new scenario where a pre-trained model is sequentially fine-tuned and occasionally replaced with an improved model. We show that stationary representations enable uninterrupted retrieval services (without reprocessing gallery images) while improving performance during model updates and replacements, achieving state-of-the-art. Code at https://github.com/miccunifi/iamcl2r.

2606.12503 2026-06-12 cs.LG cs.SD 新提交

Dolph2Vec: Self-Supervised Representations of Dolphin Vocalizations

Dolph2Vec: 海豚发声的自监督表示

Chiara Semenzin, Faadil Mustun, Roberto Dessi, Pierre Orhan, Alexis Emanuelli, Yair Lakretz, Gonzalo de Polavieja, German Sumbre

发表机构 * École Normale Supérieure, Paris, France(巴黎高等师范学院) Not Diamond, San Francisco, USA(Not Diamond公司) Institut du Cerveau, Paris, France(巴黎脑研究所) Champalimaud Foundation, Lisbon, Portugal(尚帕利莫基金会)

AI总结 提出Dolph2Vec,首个基于五年纵向海豚录音数据训练的自监督模型,在签名哨声分类和检测任务上显著优于通用基线,并发现可解释的声学单元。

详情
AI中文摘要

自监督学习(SSL)通过无需昂贵人工标注即可对动物发声进行可扩展建模,为生物声学开辟了新机遇。然而,当前该领域的SSL模型优先考虑跨物种的广泛泛化,并未针对揭示个体通信系统的细粒度结构进行优化。在这项工作中,我们收集并发布了一个新颖的数据集,包含来自半自然海洋环境中五只已知海豚的超过五年的纵向录音,这是研究海豚通信的前所未有的资源。我们将Wav2Vec2.0 Baevski等人(2020)的架构适应于此领域,并引入Dolph2Vec,这是第一个仅在此数据上训练的大规模、物种特异性SSL模型。我们在两个生物学相关任务上对模型进行基准测试:签名哨声分类和哨声检测。Dolph2Vec在这两个任务上均显著优于通用基线。除了性能,我们还展示了学习到的嵌入和码本结构捕获了与海豚哨声类别以及可能的子哨声结构对齐的可解释声学单元,从而能够对通信模式进行细粒度分析。我们的发现证明了SSL如何作为模型和科学工具来探索动物通信研究中的假设。

英文摘要

Self-supervised learning (SSL) has opened new opportunities in bioacoustics by enabling scalable modeling of animal vocalizations without the need for expensive manual annotation. However, current SSL models in this domain prioritize broad generalization across species and are not optimized for uncovering the fine-grained structure of individual communication systems. In this work, we collect and release a novel dataset of over five years of longitudinal recordings, from five known dolphins in a semi-naturalistic marine environment, an unprecedented resource for studying dolphin communication. We adapt the Wav2Vec2.0 Baevski et al. (2020) architecture to this domain and introduce Dolph2Vec, the first large-scale, species-specific SSL model trained exclusively on this data. We benchmark our model on two biologically relevant tasks: signature whistle classification and whistle detection. Dolph2Vec significantly outperforms general-purpose baselines in both tasks. Beyond performance, we show that learned embeddings and codebook structure capture interpretable acoustic units aligned with dolphin whistle categories and possibly sub-whistle structure, enabling fine-grained analysis of communication patterns. Our findings demonstrate how SSL can serve as both a model and a scientific tool to explore hypotheses in animal communication research.

2606.12609 2026-06-12 cs.LG q-bio.QM 新提交

Viral Proteins Reveal Geometry of Protein Language Models

病毒蛋白质揭示蛋白质语言模型的几何结构

Arthur Bigot, Harmon Bhasin, Core Francisco Park, Eugene Shakhnovich, Dianzhuo Wang

发表机构 * University of Washington(华盛顿大学) DeepMind(深度思维)

AI总结 研究蛋白质语言模型在不平衡数据下对病毒蛋白的表示,发现嵌入空间中存在主导的“天然性”轴,该轴按模型困惑度排序序列,且缩放效果因病毒家族而异,但嵌入仍保留病毒特异性信号。

Comments Accepted at ICML 2026 GenBio Workshop and FM4LS Workshop. Code available at https://github.com/MisteFr/viral-proteins-plms

详情
AI中文摘要

蛋白质语言模型在高度不平衡的数据集上训练,引发了一个问题:它们如何表示代表性不足的生物序列?以病毒蛋白作为跨ESM模型家族的案例研究,我们在嵌入空间中识别出一个主导的天然性轴,该轴与掩码重建困惑度对齐,将序列从建模良好的细胞蛋白通过病毒蛋白排序到打乱和随机序列。缩放效果在不同病毒家族间不均匀地压缩该轴。尽管如此,蛋白质语言模型嵌入保留了病毒特异性信号:病毒蛋白在零样本困惑度和浅层序列特征之上仍然是线性可分的。这些结果共同表明,pLM表示由天然性的一般概念结构化,同时保留了特定于不同生物群体的信息。

英文摘要

Protein language models are trained on highly imbalanced datasets, raising the question of how they represent underrepresented biological sequences. Using viral proteins as a case study across ESM model families, we identify a dominant nativeness axis in embedding space, aligned with masked reconstruction perplexity, that orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. Scaling contracts this axis unevenly across viral families. Despite this, protein language model embeddings retain viral-specific signal: viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. Together, these results suggest that pLM representations are structured by a general notion of nativeness while preserving information specific to distinct biological groups.

2606.13260 2026-06-12 cs.LG q-bio.NC 新提交

Extracting Governing Equations from Latent Dynamics via Multi-View Contrastive Learning

通过多视图对比学习从潜在动力学中提取控制方程

Paolo Muratore, Mackenzie Weygandt Mathis

发表机构 * EPFL(瑞士联邦理工学院洛桑)

AI总结 提出DYSCO算法,利用多视图时间对比学习从噪声高维观测中联合恢复潜在轨迹和动力学方程,并通过结构化基函数实现符号恢复,理论保证强可识别性。

详情
AI中文摘要

从噪声高维测量中识别潜在动力系统是表示学习、系统辨识和科学发现交叉领域的一个核心问题。我们提出了DYSCO,一种多视图时间对比学习算法,通过利用同一底层过程的多个独立噪声视图来区分信号与噪声,从而从这些观测中联合恢复潜在轨迹和控制动力学。通过在结构化函数基上参数化动力学,我们的框架进一步能够在仿射规范内符号恢复控制方程。我们提供了强可识别性的理论保证,直到仿射不确定性,将先前的可识别性结果扩展到噪声非线性观测的现实设置。实验上,我们在高斯和泊松观测噪声下(后者尤其与神经记录相关),在多种动力学 regime(如混沌、振荡和亚稳态)中展示了潜在轨迹和流场的准确恢复。

英文摘要

Identifying latent dynamical systems from noisy, high-dimensional measurements is a central problem at the intersection of representation learning, system identification, and scientific discovery. We present DYSCO, a multi-view temporal contrastive learning algorithm that jointly recovers latent trajectories and the governing dynamics from such observations, by leveraging multiple independent noisy views of the same underlying process to disentangle signal from noise. By parameterizing the dynamics in a structured functional basis, our framework further enables symbolic recovery of the governing equations within an affine gauge. We offer theoretical guarantees for strong identification up to an affine indeterminacy, extending prior identifiability results to the realistic setting of noisy nonlinear observations. Empirically, we demonstrate accurate recovery of both latent trajectories and flow fields across a diverse set of dynamical regimes (e.g., chaotic, oscillatory, and metastable) under both Gaussian and Poisson observation noise, the latter being particularly relevant for neural recordings.

2606.12471 2026-06-12 stat.ML cs.CL cs.ET cs.LG 交叉投稿

Identifiability Without Gaussianity: Symbolic World Models and Near-Infinite Temporal Consistency

无高斯假设的可识别性:符号世界模型与近无限时间一致性

Seth Dobrin, Łukasz Chmiel

AI总结 本文提出物理基础符号架构(PGSA),证明其在非高斯动态系统中实现精确线性可识别性和近无限时间一致性,克服了统计世界模型的高斯边界限制。

Comments Pre-print

详情
AI中文摘要

Klindt、LeCun 和 Balestriero (arXiv:2605.26379) 证明了联合嵌入预测架构(JEPA)实现线性可识别性(即线性恢复世界的真实潜在变量)当且仅当世界的潜在动态遵循高斯平稳过程。这一高斯边界意味着时间一致性的基本限制:对于任何非高斯物理系统,统计世界模型的表示误差随时间单调增长。我们证明这一限制是统计对齐机制的产物,而非世界模型的一般性质。我们引入物理基础符号架构(PGSA),并证明三个结果:(1) PGSA 对所有物理机制实现精确线性可识别性,无论潜在分布如何;(2) PGSA 的每步误差仅受数值精度限制;(3) 直接推论是,PGSA 在无界数量的转换中保持时间一致性,我们称之为近无限时间一致性。我们进一步证明,对于任何非高斯系统,统计世界模型无法实现这一性质,无论模型容量或训练数据量如何。其中四个定理的代数核心已在 Lean 4 中使用 Mathlib4 v4.31.0 形式化(零个 sorry 占位符);Klindt 等人的逆命题作为外部前提。对比表明,在世界动态的因果生成器中进行符号基础化是充分条件,并且在非高斯体制下,是实现近无限时间一致性的唯一条件。

英文摘要

Klindt, LeCun, and Balestriero (arXiv:2605.26379) proved that Joint-Embedding Predictive Architectures (JEPAs) achieve linear identifiability, the linear recovery of the world's true latent variables, if and only if the world's latent dynamics follow a Gaussian, stationary process. This Gaussian boundary implies a fundamental limit on temporal consistency: for any non-Gaussian physical system, the representation error of a statistical World Model grows monotonically with time. We prove that this limit is an artifact of the statistical alignment mechanism, not a property of World Models in general. We introduce the Physics-Grounded Symbolic Architecture (PGSA) and prove three results: (1) a PGSA achieves exact linear identifiability for all physical regimes, regardless of the latent distribution; (2) the per-step error of a PGSA is bounded by numerical precision alone; and (3) as a direct consequence, a PGSA maintains temporal consistency for an unbounded number of transitions, a property we term near-infinite temporal consistency. We further prove that statistical World Models cannot achieve this property for any non-Gaussian system, regardless of model capacity or the volume of training data. The algebraic cores of four of the theorems are formalized in Lean 4 with Mathlib4 v4.31.0 (zero sorry placeholders); the Klindt et al. converse is taken as an external premise. The contrast establishes that symbolic grounding in the causal generator of the world's dynamics is the sufficient condition and, in non-Gaussian regimes, the only condition for near-infinite temporal consistency.

2507.02921 2026-06-12 cs.LG cs.AI 版本更新

PlaceRep: Geospatial Place Representation Learning from Large-Scale Point-of-Interest Data

PlaceRep: 基于大规模兴趣点数据的地理空间场所表示学习

Mohammad Hashemi, Hossein Amiri, Andreas Zufle

发表机构 * Emory University(埃默里大学)

AI总结 提出PlaceRep方法,通过聚类空间和语义相关的兴趣点构建场所级表示,无需预训练即可高效生成多尺度城市区域嵌入,在人口密度估计和房价预测任务中优于现有方法并实现百倍加速。

详情
AI中文摘要

学习城市环境的有效表示需要捕捉超越固定行政边界的空间结构。现有的地理空间表示学习方法通常将兴趣点(POI)聚合到预定义的行政区域(如普查单元或邮政编码区域),为每个区域分配单个嵌入。然而,POI 通常形成跨越、包含或超出这些边界的语义上有意义的组,定义了更能反映人类活动和城市功能的场所。为解决这一局限性,我们提出 PlaceRep,一种通过聚类空间和语义相关的 POI 来构建场所级表示的地理空间表示学习方法。PlaceRep 从美国 Foursquare 数据中总结大规模 POI 图,生成通用城市区域嵌入,同时自动识别跨多个空间尺度的场所。通过消除模型预训练,PlaceRep 为多粒度地理空间分析提供了可扩展且高效的解决方案。使用人口密度估计和房价预测作为下游任务的实验表明,PlaceRep 优于大多数最先进的基于图的地理空间表示学习方法,并在大规模 POI 图上生成区域级表示时实现了高达 100 倍的加速。PlaceRep 的实现可在该 https URL 获取。

英文摘要

Learning effective representations of urban environments requires capturing spatial structure beyond fixed administrative boundaries. Existing geospatial representation learning approaches typically aggregate Points of Interest (POIs) into pre-defined administrative regions such as census units or ZIP code areas, assigning a single embedding to each region. However, POIs often form semantically meaningful groups that extend across, within, or beyond these boundaries, defining places that better reflect human activity and urban function. To address this limitation, we propose PlaceRep, a geospatial representation learning method that constructs place-level representations by clustering spatially and semantically related POIs. PlaceRep summarizes large-scale POI graphs from U.S. Foursquare data to produce general-purpose urban region embeddings while automatically identifying places across multiple spatial scales. By eliminating model pre-training, PlaceRep provides a scalable and efficient solution for multi-granular geospatial analysis. Experiments using the tasks of population density estimation and housing price prediction as downstream tasks show that PlaceRep outperforms most state-of-the-art graph-based geospatial representation learning methods and achieves up to a x100 speedup in generating region-level representations on large-scale POI graphs. The implementation of PlaceRep is available at https://github.com/mohammadhashemii/PlaceRep.

2509.22050 2026-06-12 cs.LG 版本更新

BrainPro: Towards Large-scale Brain State-aware EEG Representation Learning

BrainPro:迈向大规模脑状态感知的脑电图表征学习

Yi Ding, Muyun Jiang, Weibang Jiang, Shuailei Zhang, Xinliang Zhou, Chenyu Liu, Shanglin Li, Yong Li, Cuntai Guan

发表机构 * Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学) Advanced Telecommunications Research Institute International(先进电信研究院) Southeast University(东南大学)

AI总结 提出BrainPro模型,通过检索式空间对齐和脑状态解耦模块,学习共享与特定状态表征,在9个公共BCI数据集上取得最优性能。

Comments 31 pages, 11 figures

详情
AI中文摘要

脑电图(EEG)反映了潜在的脑状态,其活动分布在大脑区域并表现为头皮上的空间模式。学习这些空间结构化的、与状态相关的模式需要跨数据集的一致空间表征。然而,现有的EEG基础模型通常基于自注意力机制,该机制不保留位置特定信息,并且难以对齐不同通道配置记录的信号。此外,脑状态包含共享和状态特定的区域活动,这表明学习神经生理学上合理的、状态感知的表征可以补充当前模型所针对的共享表征,并改善下游解码。为了解决这些局限性,我们提出了BrainPro,一个大型EEG模型,它结合了基于检索的空间学习机制用于跨布局空间对齐,以及一个脑状态解耦模块,通过并行编码器和区域感知重建学习共享和状态特定表征。在大型EEG语料库上预训练后,BrainPro在跨越情感、运动、语音、压力、精神疾病和注意力任务的九个公共BCI数据集上实现了最先进的性能。对空间滤波器、通道丢失鲁棒性和编码器贡献的分析进一步验证了其空间对齐和状态感知路径的有效性。这些结果表明,BrainPro实现了学习空间模式的更好可解释性,并产生了有益于多种EEG解码任务的表征。

英文摘要

Electroencephalography (EEG) reflects underlying brain states, whose activities are distributed across brain regions and manifest as spatial patterns on the scalp. Learning these spatially structured, state-related patterns requires consistent spatial representations across datasets. However, existing EEG foundation models are typically based on self-attention, which does not preserve location-specific information and struggles to align signals recorded with different channel configurations. Moreover, brain states contain both shared and state-specific regional activity, suggesting that learning neurophysiologically plausible, state-aware representations can complement the shared representations targeted by current models and improve downstream decoding. To address these limitations, we propose BrainPro, a large EEG model that combines a retrieval-based spatial learning mechanism for cross-layout spatial alignment with a brain state-decoupling module that learns both shared and state-specific representations through parallel encoders and region-aware reconstruction. Pre-trained on a large EEG corpus, BrainPro achieves state-of-the-art performance across nine public BCI datasets spanning emotion, motor, speech, stress, mental disease, and attention tasks. Analyses of spatial filters, channel-drop robustness, and encoder contributions further validate the effectiveness of its spatial alignment and state-aware pathways. These results show that BrainPro achieves improved interpretability of learned spatial patterns and produces representations that benefit diverse EEG decoding tasks.

2603.08505 2026-06-12 cs.LG cs.AI 版本更新

Echo2ECG: Enhancing ECG Representations with Cardiac Morphology from Multi-View Echos

Echo2ECG:利用多视角超声心动图的心脏形态增强心电图表示

Michelle Espranita Liman, Özgün Turgut, Alexander Müller, Eimo Martens, Daniel Rueckert, Philip Müller

发表机构 * Chair for AI in Healthcare and Medicine, Technical University of Munich (TUM) and TUM University Hospital(人工智能在医疗与医学中的中心,慕尼黑技术大学(TUM)和慕尼黑大学医院) Department of Cardiology, TUM University Hospital(心血管科,慕尼黑大学医院) Department of Computing, Imperial College London(计算系,伦敦帝国理工学院) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 提出Echo2ECG多模态自监督学习框架,通过多视角超声心动图丰富心电图表示,在结构表型分类和超声检索任务上优于现有方法,模型大小仅为最大基线的1/18。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

心电图(ECG)是一种低成本、广泛使用的模态,通过捕捉心脏电活动来诊断电异常(如房颤)。然而,它无法直接测量心脏形态表型,如左心室射血分数(LVEF),这通常需要超声心动图(Echo)。从ECG预测这些表型将实现早期、可及的健康筛查。现有的自监督方法通过将ECG与单视角Echo对齐而遭受表示不匹配,单视角Echo仅捕捉局部、空间受限的解剖快照。为解决此问题,我们提出Echo2ECG,一种多模态自监督学习框架,利用多视角Echo中捕捉的心脏形态结构丰富ECG表示。我们在两个根本上需要形态信息的临床相关任务上评估Echo2ECG作为ECG特征提取器:(1)跨三个数据集的结构性心脏表型分类,以及(2)使用ECG查询检索具有相似形态特征的Echo研究。我们的提取的ECG表示在两个任务上始终优于最先进的单模态和多模态基线,尽管模型大小仅为最大基线的1/18。这些结果表明Echo2ECG是一个鲁棒、强大的ECG特征提取器。我们的代码可从此https URL获取。

英文摘要

Electrocardiography (ECG) is a low-cost, widely used modality for diagnosing electrical abnormalities like atrial fibrillation by capturing the heart's electrical activity. However, it cannot directly measure cardiac morphological phenotypes, such as left ventricular ejection fraction (LVEF), which typically require echocardiography (Echo). Predicting these phenotypes from ECG would enable early, accessible health screening. Existing self-supervised methods suffer from a representational mismatch by aligning ECGs to single-view Echos, which only capture local, spatially restricted anatomical snapshots. To address this, we propose Echo2ECG, a multimodal self-supervised learning framework that enriches ECG representations with the heart's morphological structure captured in multi-view Echos. We evaluate Echo2ECG as an ECG feature extractor on two clinically relevant tasks that fundamentally require morphological information: (1) classification of structural cardiac phenotypes across three datasets, and (2) retrieval of Echo studies with similar morphological characteristics using ECG queries. Our extracted ECG representations consistently outperform those of state-of-the-art unimodal and multimodal baselines across both tasks, despite being 18x smaller than the largest baseline. These results demonstrate that Echo2ECG is a robust, powerful ECG feature extractor. Our code is accessible at https://github.com/michelleespranita/Echo2ECG.

2603.14483 2026-06-12 cs.LG 版本更新

Disentangling Dynamical Systems: Causal Representation Learning Meets Local Sparse Attention

解耦动力系统:因果表示学习遇见局部稀疏注意力

Markus W. Baumgartner, Anson Lei, Joe Watson, Ingmar Posner

发表机构 * Applied Artificial Intelligence Lab, Oxford Robotics Institute, Oxford, UK(应用人工智能实验室,牛津机器人研究所,英国牛津)

AI总结 提出一种结合因果表示学习和局部稀疏注意力的方法,从原始轨迹数据中无结构假设地解耦系统参数,并通过图论准则保证可辨识性。

Comments Presented as an Oral at the 5th Conference on Causal Learning and Reasoning

详情
Journal ref
Proceedings of Machine Learning Research 323, 2026
AI中文摘要

参数化系统辨识方法从数据中估计显式定义的物理系统的参数。然而,它们仍然受限于需要提供显式函数空间,通常通过基于可用领域知识预定义的候选函数库。相比之下,深度学习能够以高保真度对广泛复杂性的系统进行建模,但黑箱函数逼近通常无法产生揭示系统结构的显式描述性或解耦表示。我们开发了一种新的可辨识性定理,利用因果表示学习,在没有结构假设的情况下发现系统参数的解耦表示。我们推导了一个图论准则,指定何时系统参数可以从原始轨迹数据中唯一解耦,直至置换和微分同胚。关键的是,我们的分析表明,全局因果结构为考虑局部状态依赖因果结构时可实现的解耦保证提供了下界。我们将系统参数识别实例化为变分推断问题,利用稀疏正则化变换器来发现状态依赖的因果结构。我们在四个合成领域上实证验证了我们的方法,证明了其恢复基线方法无法恢复的高度解耦表示的能力。与我们的理论分析一致,我们的结果证实了强制局部因果结构通常对于完全可辨识性是必要的。

英文摘要

Parametric system identification methods estimate the parameters of explicitly defined physical systems from data. Yet, they remain constrained by the need to provide an explicit function space, typically through a predefined library of candidate functions chosen via available domain knowledge. In contrast, deep learning can demonstrably model systems of broad complexity with high fidelity, but black-box function approximation typically fails to yield explicit descriptive or disentangled representations revealing the structure of a system. We develop a novel identifiability theorem, leveraging causal representation learning, to uncover disentangled representations of system parameters without structural assumptions. We derive a graphical criterion specifying when system parameters can be uniquely disentangled from raw trajectory data, up to permutation and diffeomorphism. Crucially, our analysis demonstrates that global causal structures provide a lower bound on the disentanglement guarantees achievable when considering local state-dependent causal structures. We instantiate system parameter identification as a variational inference problem, leveraging a sparsity-regularised transformer to uncover state-dependent causal structures. We empirically validate our approach across four synthetic domains, demonstrating its ability to recover highly disentangled representations that baselines fail to recover. Corroborating our theoretical analysis, our results confirm that enforcing local causal structure is often necessary for full identifiability.

2604.27277 2026-06-12 cs.LG cs.AI cs.CV 版本更新

BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning

BrainDINO:一种用于通用临床表征学习的脑MRI基础模型

Yizhou Wu, Shansong Wang, Yuheng Li, Mojtaba Safari, Mingzhe Hu, Chih-Wei Chang, Harini Veeraraghavan, Xiaofeng Yang

发表机构 * Department of Radiation Oncology and Winship Cancer Institute, Emory University(放射肿瘤科和Winship癌症研究所,埃默里大学) Department of Radiation and Cellular Oncology, The University of Chicago(放射肿瘤学与细胞肿瘤学部,芝加哥大学) Department of Electrical and Computer Engineering, Georgia Institute of Technology(电气与计算机工程系,佐治亚理工学院) Department of Biomedical Engineering, Georgia Institute of Technology(生物医学工程系,佐治亚理工学院) Department of Biomedical Informatics, Emory University(生物医学信息学系,埃默里大学) Department of Medical Physics, Memorial Sloan Kettering Cancer Center(医学物理系,纪念斯隆凯特琳癌症中心)

AI总结 提出BrainDINO,一种基于自蒸馏的基础模型,在约660万张未标记轴向切片上训练,通过冻结编码器加轻量任务头,在多种脑MRI任务上达到或超越基线,尤其在小样本场景下优势显著。

Comments 25 pages, 5 figures

详情
AI中文摘要

脑MRI支撑着广泛的神经科学和临床应用,然而大多数基于学习的方法仍针对特定任务且需要大量标注数据。本文表明,单一的自监督表征可以泛化到异质的脑MRI终点。我们训练了BrainDINO,一个自蒸馏的基础模型,使用了来自20个数据集的约660万张未标记轴向切片,这些数据集涵盖了人群、疾病和采集设置的广泛变异。通过使用冻结编码器加轻量任务头,BrainDINO支持肿瘤分割、神经退行性和神经发育性疾病分类、脑年龄估计、卒中后时间预测、分子状态预测、MRI序列分类和生存建模等任务的迁移。在各种任务和监督机制下,BrainDINO始终等于或超过自然图像和MRI特定自监督基线,在标签稀缺时尤其具有优势。表征分析进一步显示,在缺乏任务特定监督的情况下,特征结构具有解剖学组织和病理敏感性。我们的发现表明,大规模切片级自监督学习可以产生统一的脑MRI表征,支持多样化的神经影像任务,无需体积预训练或全网络微调,为稳健且数据高效的脑影像分析建立了可扩展的基础。代码可在 https://github.com/mclwu22/BrainDINO 获取。

英文摘要

Brain MRI underpins a wide range of neuroscientific and clinical applications, yet most learning-based methods remain task-specific and require substantial labeled data. Here we show that a single self-supervised representation can generalize across heterogeneous brain MRI endpoints. We trained BrainDINO, a self-distilled foundation model, on approximately 6.6 million unlabeled axial slices from 20 datasets encompassing broad variation in population, disease, and acquisition setting. Using a frozen encoder with lightweight task heads, BrainDINO supported transfer across tumor segmentation, neurodegenerative and neurodevelopmental conditions classification, brain age estimation, post-stroke temporal prediction, molecular status prediction, MRI sequence classification, and survival modeling. Across tasks and supervision regimes, BrainDINO consistently equaled or exceeded natural-image and MRI-specific self-supervised baselines, with particularly strong advantages under label scarcity. Representation analyses further showed anatomically organized and pathology-sensitive feature structure in the absence of task-specific supervision. Our findings indicate that large-scale slice-wise self-supervised learning can yield a unified brain MRI representation that supports diverse neuroimaging tasks without volumetric pretraining or full-network fine-tuning, establishing a scalable foundation for robust and data-efficient brain imaging analysis. Code is available at https://github.com/mclwu22/BrainDINO

2606.10678 2026-06-12 cs.LG 版本更新

One Step Closer to Ground Truth: A Multi-Scale Residual-Aware Representation Learning Pipeline for Predicting Time Series Data

更接近真实:一种多尺度残差感知表示学习管道用于时间序列预测

Amrijit Biswas, Mustafa Kamal, Robin Krambroeckers, M. M. Lutfe Elahi, Sifat Momen, Nabeel Mohammed, Shafin Rahman

发表机构 * RobotBulls Labs(RobotBulls实验室) North South University(南北大学)

AI总结 提出两阶段模型无关框架,通过显式解耦预测与残差学习,使用元校正器动态建模结构误差模式,提升Transformer预测精度。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26)

详情
AI中文摘要

近年来,基于Transformer的模型已成为时间序列预测的主要范式,利用自注意力机制捕获长程依赖关系。尽管取得了成功,但这些单阶段预测架构由于结构差异、未建模的随机成分或多尺度时间表示不足,表现出持续的系统性残差偏差。当残差被视为不可约噪声时,这一局限性依然存在,阻碍了对结构化误差模式的自适应校正。为解决这一问题,我们引入了一个两阶段、模型无关的框架,将预测和残差学习显式解耦为不同的表示学习阶段。基础Transformer首先生成初始预测。随后,专用的元校正器动态建模跨多元通道的结构化误差模式,保留跨变量依赖关系,并迭代修正基础Transformer的残差偏差。通过将该管道形式化为假设空间扩展,我们的框架解决了单阶段架构固有的近似局限性,消除了对限制性假设的依赖,并实现了复杂误差动态的端到端学习。在八个流行的基准数据集上使用既定协议进行评估,我们的方法达到了最先进的性能,在标准指标(MSE、MAE)上有显著改进。结果表明,该框架能够减轻系统性偏差,增强对复杂时间动态的鲁棒性,推进了基于Transformer的预测模型的实际应用。

英文摘要

Transformer-based models have emerged as leading paradigms in time-series forecasting in recent years, employing self-attention mechanisms to capture long-range dependencies. Despite their success, these single-stage forecasting architectures exhibit persistent systematic residual biases arising from structural discrepancies, unmodeled stochastic components, or inadequate multi-scale temporal representations. This limitation persists when residuals are treated as irreducible noise, precluding adaptive correction of structured error patterns. To address this limitation, we introduce a two-stage, model-agnostic framework that explicitly decouples forecasting and residual learning into distinct stages of representation learning. A base transformer first generates the initial predictions. Subsequently, a dedicated meta-corrector dynamically models structured error patterns across multivariate channels, preserves cross-variable dependencies, and iteratively refines the residual bias of the base transformer. By formalizing this pipeline as a hypothesis space expansion, our framework addresses approximation limitations inherent in single-stage architectures, removes reliance on restrictive assumptions, and enables end-to-end learning of complex error dynamics. Evaluated on eight popular benchmark datasets using established protocols, our approach achieves state-of-the-art performance, with significant improvements in standard metrics (MSE, MAE). The results demonstrate the framework's ability to mitigate systematic biases and enhance robustness to complex temporal dynamics, advancing the practical applicability of transformer-based forecasting models.

2606.11190 2026-06-12 cs.LG 版本更新

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

何时对齐,何时预测:多模态学习的相图

Ilay Kamai, Hugues Van Assel, Aviv Regev, Hagai B. Perets, Randall Balestriero

发表机构 * Technion(以色列理工学院) Genentech(基因泰克公司) Brown University(布朗大学) Meta AI, FAIR

AI总结 提出统一线性框架,通过信噪比模型揭示跨模态对齐与预测的互补失效模式,构建四区域相图指导多模态学习目标选择,并在非线性实验中验证。

详情
AI中文摘要

跨模态对齐(CA)和跨模态预测(CP)是多模态表示学习的主要范式,但目前缺乏对每种方法何时成功、何时失败以及跨模态训练何时有帮助的系统性理解——这一空白使得从业者,特别是在生物医学或天体物理学等科学领域,面对异构仪器以及多个层次的组织和测量时,无法诊断为什么标准方法不如最佳单模态。我们开发了一个统一的线性框架来解决这两个问题。在具有结构化跨模态干扰相关性的尖峰信号加噪声模型下,我们推导出两个目标的分离比,揭示了互补的失效模式:对齐使每个模态白化,当干扰在视图间强相关时失败;预测通过单侧白化编码任何可跨模态预测的内容,恢复由源模态质量决定。由此产生的相图将多模态问题划分为四个区域:两者、仅CA、仅CP和两者都不。我们提出了一种数据驱动的方法,使用少量标记子样本将真实世界数据集定位在该图中,在任何跨模态训练之前确定首选目标和预测方向。在合成数据、立体视觉基准、图像-文本对和真实天体物理数据上的实验验证了非线性情况下的预测,包括跨模态训练有害的“两者都不”区域。我们的框架使从业者能够诊断其多模态问题,并在投入训练之前选择正确的目标。重现结果的代码可在此https URL获取。

英文摘要

Cross-modal alignment (CA) and cross-modal prediction (CP) are the dominant paradigms for multimodal representation learning, yet there is no systematic understanding of when each succeeds, when each fails, and when cross-modal training helps at all -- a gap that leaves practitioners, especially in scientific domains like biomedicine or astrophysics, with heterogeneous instruments and multiple levels of organization and measurement, unable to diagnose why standard methods underperform the best single modality. We develop a unified linear framework that addresses both questions. Under a spiked signal-plus-noise model with structured cross-modal nuisance correlation, we derive separation ratios for both objectives that expose complementary failure modes: alignment whitens each modality and fails when nuisance is strongly correlated across views; prediction encodes whatever is cross-predictable through a one-sided whitening, with recovery governed by source-modality quality. The resulting phase diagram partitions multimodal problems into four regimes: Both, CA only, CP only, and Neither. We present a data-driven procedure to locate real-world datasets in this diagram using a small labeled subsample, identifying the preferred objective and prediction direction before any cross-modal training. Experiments on synthetic data, stereo-vision benchmarks, image-caption pairs, and real astrophysical data validate the predictions in the nonlinear regime, including the Neither regime where cross-modal training is actively harmful. Our framework lets practitioners diagnose their multimodal problem and choose the right objective before committing to training. Code to reproduce the results is available at https://github.com/IlayMalinyak/mm_align_vs_pred.

2511.18322 2026-06-12 cs.RO cs.CV cs.LG 版本更新

Learning Visually Interpretable Oscillator Networks for Soft Continuum Robots from Video

从视频中学习软体连续体机器人的视觉可解释振荡器网络

Henrik Krauss, Johann Licher, Naoya Takeishi, Annika Raatz, Takehisa Yairi

发表机构 * Department of Advanced Interdisciplinary Studies, The University of Tokyo(东京大学先进跨学科研究系) Institute of Assembly Technology and Robotics, Leibniz University Hannover(莱比锡大学汉诺威装配技术与机器人研究所) Research Center for Advanced Science and Technology, The University of Tokyo(东京大学先进科学研究中心)

AI总结 提出注意力广播解码器(ABCD)和视觉振荡器网络(VONs),实现从视频中学习软体连续体机器人动力学的视觉和机械可解释性,多步预测误差降低5.8倍。

Comments Code available at: https://github.com/UThenrik/visual_oscillators_for_SCR Dataset available at: https://zenodo.org/records/17812071 Video available at: https://youtu.be/i80H8erVISM

详情
AI中文摘要

从视频中学习软体连续体机器人(SCR)动力学提供了灵活性,但现有方法缺乏可解释性或依赖先验假设。基于模型的方法需要先验知识和手动设计。我们通过引入以下内容来弥补这一差距:(1)注意力广播解码器(ABCD),一种用于基于自编码器的潜在动力学学习的即插即用模块,生成像素级注意力图,定位每个潜在维度的贡献,同时过滤静态背景,通过空间接地潜在变量和图像叠加实现视觉可解释性。(2)视觉振荡器网络(VONs),一种二维潜在振荡器网络,与ABCD注意力图耦合,用于学习到的质量、耦合刚度和力的图像可视化,从而实现机械可解释性。我们在单段和双段SCR上验证了我们的方法,表明基于ABCD的模型显著提高了多步预测精度,在双段机器人上,Koopman算子的误差降低了5.8倍,振荡器网络的误差降低了3.5倍。VONs自主发现了振荡器的链式结构。这种完全数据驱动的方法产生了紧凑、机械可解释的模型,对未来的控制应用具有潜在意义。

英文摘要

Learning soft continuum robot (SCR) dynamics from video offers flexibility but existing methods lack interpretability or rely on prior assumptions. Model-based approaches require prior knowledge and manual design. We bridge this gap by introducing: (1) The Attention Broadcast Decoder (ABCD), a plug-and-play module for autoencoder-based latent dynamics learning that generates pixel-accurate attention maps localizing each latent dimension's contribution while filtering static backgrounds, enabling visual interpretability via spatially grounded latents and on-image overlays. (2) Visual Oscillator Networks (VONs), a 2D latent oscillator network coupled to ABCD attention maps for on-image visualization of learned masses, coupling stiffness, and forces, thereby enabling mechanical interpretability. We validate our approach on single- and double-segment SCRs, demonstrating that ABCD-based models significantly improve multi-step prediction accuracy with 5.8x error reduction for Koopman operators and 3.5x for oscillator networks on a two-segment robot. VONs autonomously discover a chain structure of oscillators. This fully data-driven approach yields compact, mechanically interpretable models with potential relevance for future control applications.

2606.04009 2026-06-12 stat.ML cs.AI cs.LG 版本更新

Counterfactual Explanations for Deep Two-Sample Testing

深度双样本检验的反事实解释

Wei-Cheng Lai, Marco Simnacher, Christoph Lippert

发表机构 * Hasso-Plattner-Institute, University of Potsdam(波茨坦大学洪堡-劳恩堡研究所) Hasso Plattner Institute for Digital Health at Mount Sinai Icahn School of Medicine at Mount Sinai(辛辛那提医学院洪堡数字健康研究所)

AI总结 针对深度双样本检验,提出基于扩散自编码器和MMD优化的反事实解释框架,生成样本级编辑以揭示驱动假设拒绝的特征。

Comments 17 pages

详情
AI中文摘要

双样本检验是检测科学领域中分布差异的基本工具,但经典检验(包括基于核的检验)在高维结构化数据(如图像)上可能效果不佳。最近的深度双样本检验通过学习信息表示提高了这些场景下的灵敏度,但它们对哪些数据特征驱动拒绝原假设 $H_0$ 提供的洞察有限。为解决此问题,我们提出了一种用于深度双样本检验的反事实解释框架,该框架生成样本级编辑,将观测值从源组移向目标组,同时明确减少检验所测量的差异。我们的方法将扩散自编码器与预训练的深度双样本检验模型相结合,并在检验模型的表示空间中优化最大均值差异(MMD)目标,以生成合理的反事实。我们通过检验统计量和由此产生的双样本p值的变化来量化分布级效应。我们在合成2D形状数据集和两个MRI队列上评估了该方法。在这两种设置下,反事实变换相对于原始样本持续增加p值,表明编辑后的源集在检验下在统计上更接近目标分布。我们使用LPIPS测量最小性,以确保反事实保持接近原始样本。由此产生的编辑提供了与检测到的组差异相关的特征的可解释证据。在MRI上,局部变化与队列之间已知的解剖学差异一致。

英文摘要

Two-sample testing is a fundamental tool for detecting distributional differences across scientific domains, but classical tests (including kernel-based tests) can be ineffective on high-dimensional structured data such as images. Recent deep two-sample tests improve sensitivity in these settings by learning informative representations, yet they provide limited insight into which data features drive rejection of the null hypothesis $H_0$. To address this issue, we propose a counterfactual explanation framework for deep two-sample testing that generates sample-level edits moving observations from a source group toward a target group while explicitly reducing the discrepancy measured by the test. Our method combines a diffusion autoencoder with a pretrained deep two-sample test model and optimizes a maximum mean discrepancy (MMD) objective in the test model's representation space to produce plausible counterfactuals. We quantify distribution-level effects through changes in the test statistic and the resulting two-sample p-values. We evaluate the method on synthetic 2D shape datasets and two MRI cohorts. Across both settings, the counterfactual transformations consistently increase p-values relative to the original samples, indicating that the edited source set becomes statistically closer to the target distribution under the test. We measure minimality using LPIPS to ensure the counterfactuals remain close to the original samples. The resulting edits provide interpretable evidence of the features associated with the detected group differences. On MRI, the localized changes are consistent with known anatomical differences between cohorts.

3. 强化学习与序列决策 18 篇

2606.12479 2026-06-12 cs.LG cs.AI 新提交

ReCal: Reward Calibration for RL-based LLM Routing

ReCal: 基于强化学习的LLM路由的奖励校准

Qihang Yu, Hanwen Tong, Zhengqi Zhang, Bo Zheng, Feng Wei, Shengyu Zhang, Zemin Liu, Fei Wu

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出ReCal框架,通过分层奖励分解和分布感知优化校准奖励信号,解决多目标冲突和异质性任务优化偏差,提升LLM路由性能与稳定性。

详情
AI中文摘要

大型语言模型(LLM)路由已成为一种有效范式,通过动态模型和推理策略选择来利用多个LLM的互补优势。最近的基于强化学习(RL)的路由方法通过从交互反馈中优化路由策略,进一步提高了路由质量。然而,在难度不同的异质性任务下,它们仍然难以提供信息丰富且可比较的学习信号。在实践中,多个目标(如正确性、格式行为)被聚合为单个标量奖励,导致模糊的信用分配和冲突的优化信号。此外,奖励信号在不同实例间表现出显著变异性,其中一些实例产生更高或更可变的奖励,引入了偏向于平凡样本而非信息性样本的优化偏差。为了解决这些问题,我们提出了\textbf{ReCal},一个用于基于RL的LLM路由的\textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration(奖励校准)框架。我们首先引入了一种具有分量式优势估计的分层奖励分解机制。我们进一步提出了一种分布感知的优化策略,通过方差感知重加权和每数据集归一化来校准优化变异性。在七个数据集上的实验表明,ReCal在路由性能和训练稳定性上持续优于基线方法。代码可在该网址获取。

英文摘要

Large language model (LLM) routing has emerged as an effective paradigm for leveraging the complementary strengths of multiple LLMs through dynamic model and reasoning-strategy selection. Recent reinforcement learning (RL)-based routing methods further improve routing quality by optimizing routing policies from interaction feedback. However, they still struggle to provide informative and comparable learning signals under heterogeneous tasks with varying difficulty. In practice, multiple objectives (e.g., correctness, format behavior) are aggregated into a single scalar reward, leading to ambiguous credit assignment and conflicting optimization signals. Moreover, reward signals exhibit significant variability across instances, where some instances produce higher or more variable rewards, introducing optimization bias that favors trivial samples over informative ones. To address these issues, we propose \textbf{ReCal}, a \textbf{\underline{Re}}ward \textbf{\underline{Cal}}ibration framework for RL-based LLM routing. We first introduce a hierarchical reward decomposition mechanism with component-wise advantage estimation. We further propose a distribution-aware optimization strategy that calibrates optimization variability through variance-aware reweighting and per-dataset normalization. Experiments on seven datasets demonstrate that ReCal consistently improves routing performance, and training stability over baselines. Code is available at https://anonymous.4open.science/r/ReCal.

2606.12485 2026-06-12 cs.LG cs.AI 新提交

Speculative Rollback Correction for Quality-Diverse Web Agent Imitation

面向质量多样性的Web智能体模仿的推测性回滚修正

Longkun Hao, Hongyu Lin, Hao Li, Zhichao Yang, Haojie Hao, Dongshuo Huang, Haitao Yang, Hongyu Ge, Ming jie Xie, Yanjun Wu, Zi Hao Yin, Yan Bai, Yihang Lou

发表机构 * Beihang University(北京航空航天大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) The Hong Kong University of Science and Technology(香港科技大学) Northwestern Polytechnical University(西北工业大学) Tsinghua University(清华大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Peking University(北京大学)

AI总结 提出推测性回滚修正(SRC)框架,通过固定视野分支审查和回滚机制,在减少教师查询的同时保持轨迹多样性,在WebArena-Infinity上收集了977条通过验证的轨迹和9183个下一步动作示例。

详情
AI中文摘要

通过从专家轨迹进行模仿学习来训练交互式Web智能体已成为一种高效的方法。然而,在此背景下,确定专家干预的最佳时机是一个关键挑战。延迟干预往往导致早期错误的累积,将页面状态推入不可恢复的区域。相反,过早或过度干预会使智能体过度依赖专家策略,将模型困在以单一刚性轨迹为特征的局部最优中。我们提出推测性回滚修正(SRC),一种针对可重置智能体环境的分支级模仿框架。SRC不是在每个访问状态请求教师标签,也不是仅在完成轨迹后修正,而是采用固定视野分支审查:学生先执行一个短的推测性片段,然后由教师审查,仅当局部进展中断时,教师才定位第一个有害偏差。回滚保留有用的前缀,而成功的展开由硬验证器过滤并保留在轻量级质量多样性档案中。所得数据支持对局部修正和通过验证器的轨迹进行下一步动作监督微调。在WebArena-Infinity上,SRC收集了977条通过验证器的轨迹和9183个下一步动作示例;固定视野审查在保留通过验证器的解决方案变体的同时,改善了恢复与查询的权衡。代码可在该https URL获取。

英文摘要

Training interactive web agents through imitation learning from expert trajectories has emerged as a highly effective approach. However, determining the optimal timing for expert intervention presents a critical challenge in this context. Delayed intervention often leads to the accumulation of early-stage errors, pushing the page state into an irrecoverable regime. Conversely, premature or excessive intervention causes the agent to become overly reliant on expert policies, trapping the model in local optima characterized by a single, rigid trajectory. We propose Speculative Rollback Correction (SRC), a branch-level imitation framework for resettable agent environments. Instead of requesting teacher labels at every visited state or correcting only after a completed trajectory, SRC uses fixed-horizon branch review: the student executes a short speculative segment before teacher review, and the teacher localizes the first harmful deviation only when local progress breaks. Rollback preserves useful prefixes, while successful rollouts are filtered by a hard verifier and retained in a lightweight quality-diversity archive. The resulting data supports next-action supervised fine-tuning on both localized corrections and verifier-passing trajectories. On WebArena-Infinity, SRC collects 977 verifier-passing trajectories and 9,183 next-action examples; fixed-horizon review improves the recovery-versus-query tradeoff over step-level review while retaining verifier-passing solution variants. Code is available at https://github.com/LongkunHao/SRC_gui_agent.

2606.12505 2026-06-12 cs.LG cs.AI 新提交

Boosting Direct Preference Optimization with Penalization

通过惩罚增强直接偏好优化

Pengwei Sun

发表机构 * Pengwei Sun(Sun Pengwei)

AI总结 提出DPOP,在DPO损失上增加对参考模型贪婪响应的门控惩罚,仅当当前策略对偏好响应概率低于拒绝响应时激活,在AlpacaEval 2.0上显著提升胜率。

Comments Accepted at ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning

详情
AI中文摘要

离线偏好优化已成为从人类反馈中进行强化学习的实用替代方案,但诸如直接偏好优化(DPO)及其变体等成对目标仅使用存储在静态数据集中的选择和拒绝响应。这留下了一个有用的信号未被利用:参考模型本身为同一提示生成的响应。我们提出了带惩罚的直接偏好优化(DPOP),这是DPO的一个简单扩展,它在基础偏好损失上增加了一个对参考贪婪响应的门控惩罚。DPOP仅在当前策略对偏好响应的似然仍低于对拒绝响应的似然时激活此惩罚。在AlpacaEval 2.0上,DPOP在Llama-3-8b-it和Gemma-2-9b-it上均提高了长度控制的胜率,相对于DPO、SimPO和AlphaDPO,在两个模型上分别实现了5.3%和4.4%的相对增益。消融实验进一步表明,在此设置下,SimNPO风格的长度归一化惩罚比NPO和token级非似然惩罚更强。

英文摘要

Offline preference optimization has become a practical substitute for reinforcement learning from human feedback, but pairwise objectives such as Direct Preference Optimization (DPO) and its variants use only the chosen and rejected responses stored in a static dataset. This leaves a useful signal unused: the response that the reference model itself would generate for the same prompt. We propose Direct Preference Optimization with Penalization (DPOP), a simple extension of DPO that augments the base preference loss with a gated penalty on reference-greedy responses. DPOP activates this penalty only when the current policy still assigns a lower likelihood to the preferred response than to the rejected response. On AlpacaEval 2.0, DPOP improves length-controlled win rate over DPO, SimPO, and AlphaDPO on both Llama-3-8b-it and Gemma-2-9b-it, achieving relative gains of 5.3\% and 4.4\% over baselines on the two models, respectively. Ablations further show that a SimNPO-style length-normalized penalty is stronger than NPO and token-level unlikelihood in this setting.

2606.12634 2026-06-12 cs.LG cs.AI cs.CL 新提交

Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents

保持策略梯度主导:面向长程工具使用智能体的兄弟引导信用蒸馏

Tianyu Ding, Jianhong Xin, Juan Pablo De la Cruz Weinstein

发表机构 * Amazon Web Services(亚马逊云服务)

AI总结 针对长程工具使用强化学习中轨迹级优势信号稀疏的问题,提出兄弟引导信用蒸馏(SGCD),通过动态采样成功与失败轨迹、外部LLM对比生成逐步信用参考,实现密集信用分配,在AppWorld和τ³-airline任务上显著提升性能。

Comments 13 pages, 4 figures, 7 tables. Submitted to EMNLP 2026 Industry Track

详情
AI中文摘要

长程工具使用强化学习可以从结果验证中学习,但其轨迹级优势被广播到许多推理、API和答案令牌上。自蒸馏通过重用策略自身的轨迹或特权教师承诺提供更密集的信号。然而,我们表明直接的令牌级自蒸馏会悄然破坏工具使用:它复述教师行为而不知道验证器奖励哪些动作,因此有用技能和有害捷径被一起放大。我们引入兄弟引导信用蒸馏(SGCD),它使用蒸馏进行信用分配而非作为竞争性的演员损失。动态采样产生混合的成功和失败的兄弟轨迹;外部LLM将其对比总结为训练时逐步信用参考;密集的教师/学生散度驱动信用重新分配;有界分离的信用权重重塑GRPO令牌优势。部署的学生看不到外部LLM、兄弟证据或预言机。在AppWorld和τ³-airline上,SGCD优于匹配的GRPO比较器:AppWorld上test_normal的TGC从42.9提升到45.6,test_challenge从24.7提升到27.0;τ³-airline的pass@1从0.583提升到0.602。

英文摘要

Long-horizon tool-use reinforcement learning can learn from outcome verification, but its trajectory-level advantage is broadcast across many reasoning, API, and answer tokens. Self-distillation promises a denser signal by reusing a policy's own rollouts or a privileged teacher. We show, however, that direct token-level self-distillation can silently destroy tool use: it rehearses teacher behavior without knowing which actions the verifier rewards, so useful skills and harmful shortcuts are amplified together. We introduce Sibling-Guided Credit Distillation (SGCD), which uses distillation for credit assignment rather than as a competing actor loss. Dynamic sampling produces mixed successful and failed sibling rollouts; an external LLM summarizes their contrast into a training-only stepwise credit reference; dense teacher/student divergence drives credit reassignment; and bounded detached credit weights reshape GRPO token advantages. The deployed student sees no external LLM, sibling evidence, or oracle. Across AppWorld and $τ^3$-airline, SGCD improves over matched GRPO comparators: AppWorld TGC $42.9 \to 45.6$ on test_normal and $24.7 \to 27.0$ on test_challenge, and $τ^3$-airline pass@1 $0.583 \to 0.602$.

2606.12640 2026-06-12 cs.LG cs.RO cs.SY eess.SY 新提交

Individual Control Barrier Functions-Guided Diffusion Model for Safe Offline Multi-Agent Reinforcement Learning

个体控制障碍函数引导的扩散模型用于安全离线多智能体强化学习

Qingyun Guo, Junyi Shi, Jianuo Huang, Tianyu Shi

发表机构 * Department of Electrical Engineering and Automation, Aalto University(阿尔托大学电气工程与自动化系) School of Computing and Data Science, Xiamen University Malaysia(厦门大学马来西亚分校计算与数据科学学院) Department of Computer Science, University of Toronto(多伦多大学计算机科学系)

AI总结 提出一种将神经个体控制障碍函数嵌入扩散模型的离线多智能体强化学习算法,通过逆动力学恢复控制策略,在保证奖励的同时显著提升轨迹生成的安全性。

Comments Accepted to the 23rd IFAC World Congress, 2026

详情
AI中文摘要

离线强化学习允许直接从数据中学习控制策略而无需在线交互,使其适用于安全关键任务。最近的研究将扩散模型应用于离线强化学习,以利用其建模复杂数据分布的强大能力。然而,现有方法主要关注单智能体设置,多智能体环境中的安全挑战在很大程度上未被探索。在这项工作中,我们提出了一种安全的离线多智能体强化学习算法,该算法将神经个体控制障碍函数嵌入扩散模型中,以增强轨迹生成过程中的安全性,并通过逆动力学恢复控制策略。我们在多种基准上评估了我们的算法,证明了在保持竞争性奖励的同时实现了显著的安全改进。

英文摘要

Offline reinforcement learning allows control policies to be learned directly from data without online interaction, making it suitable for safety-critical tasks. Recent studies have applied diffusion models to offline reinforcement learning to leverage their strong capacity for modeling complex data distributions. However, existing approaches primarily focus on single-agent settings, leaving the safety challenges in multi-agent environments largely unexplored. In this work, we propose a safe offline multi-agent reinforcement learning algorithm that embeds neural individual control barrier functions into the diffusion model to enhance safety during trajectory generation, with control policies recovered through inverse dynamics. We evaluate our algorithm across diverse benchmarks, demonstrating substantial safety improvements while maintaining competitive rewards.

2606.12780 2026-06-12 cs.LG cs.CL 新提交

ProPlay: Procedural World Models for Self-Evolving LLM Agents

ProPlay: 用于自我进化LLM智能体的程序化世界模型

Yijun Ma, Zehong Wang, Yiyang Li, Ziming Li, Xiaoguang Guo, Weixiang Sun, Chuxu Zhang, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学) University of Connecticut(康涅狄格大学)

AI总结 提出ProPlay程序化世界模型,通过程序级预演和因果过程图,使LLM智能体在部分可观测环境中自我进化,无需外部监督。

详情
AI中文摘要

自我进化智能体应能在无外部监督下通过交互改进,但在部分可观测环境中仍困难,智能体必须主动探索、从有限反馈中学习,并决定何时信任先前经验。现有的LLM智能体方法通常依赖记忆或规划模块,但很少在它们之间闭环以持续完善对环境动态的内部理解。我们提出ProPlay,一种程序化世界模型,支持程序级预演,智能体可利用学到的世界知识排练未来的程序路径。ProPlay不将经验表示为孤立的规则或低层动作约束,而是将成功轨迹抽象为程序,并在捕获任务阶段间因果转换的程序图中组织它们。每个转换与一个可靠性记录嵌入相关联,以从过去结果中估计其任务特定贡献。在每个回合前,ProPlay在已知图结构上模拟未来程序轨迹作为结构化软指导;执行后,它利用环境反馈精炼图。在公开基准上的实验表明,ProPlay在环境理解和自我进化能力上持续优于强基线。我们的代码已在此https URL发布。

英文摘要

Self-evolving agents are expected to improve through interaction without external supervision, but this remains difficult in partially observable environments where agents must explore actively, learn from limited feedback, and decide when to trust prior experience. Existing LLM-agent methods often rely on memory or planning modules, yet they rarely close the loop between them to continually refine an internal understanding of environment dynamics. We introduce ProPlay, a procedural world model that supports procedure-level preplay, where agents can rehearse future procedural paths using the learned world knowledge. Rather than representing experience as isolated rules or low-level action constraints, ProPlay abstracts successful trajectories into procedures and organizes them in a procedure graph that captures causal transitions among task stages. Each transition is associated with a reliability record embedding to estimate its task-specific contribution from past outcomes. Before each episode, ProPlay simulates future procedural trajectories over known graph structures as structured soft guidance; after execution, it refines the graph using environment feedback. Experiments on public benchmarks show that ProPlay consistently improves environment understanding and self-evolution capability over strong baselines. Our code has been released in https://github.com/antman9914/proplay.

2606.13461 2026-06-12 cs.LG cs.CV 新提交

Reinforcement Learning for Neural Model Editing

神经模型编辑的强化学习

Shaivi Malik

发表机构 * Shaivi Malik

AI总结 提出将神经模型编辑形式化为强化学习问题,通过奖励反馈学习编辑策略,在偏见缓解和机器遗忘任务上取得良好效果。

详情
AI中文摘要

编辑预训练神经网络需要针对特定目标定制的专用算法。设计此类算法通常耗时且需要大量精力。我们提出了一个探索性框架,将神经模型编辑形式化为强化学习问题,其中智能体使用奖励反馈修改模型。我们引入了两个环境:MaskWorld,其中智能体以乘法方式缩放权重;以及ShiftWorld,其中智能体应用加法权重更新。奖励函数结合了效用保持目标和任务特定编辑目标,使智能体能够在保持整体模型性能的同时学习有针对性的修改。我们在文本分类中的偏见缓解和图像分类中的机器遗忘上评估了该框架,这两者传统上都依赖于专用算法。我们的结果表明,在遗忘任务中,学习到的策略将遗忘集准确率降至接近0%,同时保留集准确率保持在90%以上。在偏见缓解设置中,学习到的策略将偏见相关性能提高了5%以上,同时保持了一般分类效用。我们的发现表明,神经模型编辑可以转化为强化学习问题,从而可以从奖励反馈中学习编辑策略,而不是为每个任务手动设计。

英文摘要

Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formulates neural model editing as a reinforcement learning problem, where agents modify models using reward feedback. We introduce two environments: MaskWorld, where agents scale weights multiplicatively, and ShiftWorld, where agents apply additive weight updates. The reward function combines a utility-preservation objective with a task-specific editing objective, enabling agents to learn targeted modifications while maintaining overall model performance. We evaluate the framework on bias mitigation in text classification and machine unlearning in image classification, both of which traditionally rely on specialized algorithms. Our results show that the learned policies reduce forget set accuracy to nearly 0% while preserving over 90% retain set accuracy on the unlearning task. In the bias mitigation setting, the learned policies improve bias-related performance by more than 5% while maintaining general classification utility. Our findings show that neural model editing can be cast as a reinforcement learning problem, allowing editing policies to be learned from reward feedback rather than manually engineered for each task.

2606.13473 2026-06-12 cs.LG cs.AI cs.CL 新提交

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

MaxProof: 通过生成-验证器强化学习与群体级测试时扩展实现数学证明规模化

Jiacheng Chen, Xinyu Zhang, Shunkai Zhang, Yanmohan Wang, Lin Li, Tiancheng Qin, Qin Wang, Zhengmao Zhu, Tianle Li, Jingyang Li, Zehan Li, Binyang Jiang, Jin Zhu, Han Ding, Fei Yu, Chenyu Du, Zijian Song, Jiayuan Song, Zhi Zhang, Yunan Huang, Weiyu Cheng, Pengyu Zhao, Yu Cheng

发表机构 * MiniMax The Chinese University of Hong Kong(香港中文大学) Fudan University(复旦大学) Peking University(北京大学) Tsinghua University(清华大学)

AI总结 提出MaxProof框架,结合生成-验证器强化学习与群体级测试时扩展,在MiniMax-M3系列上实现竞赛级数学证明,在IMO 2025和USAMO 2026上超越人类金牌阈值。

详情
AI中文摘要

我们提出了MaxProof,一个用于MiniMax-M3系列中竞赛级数学证明的群体级测试时扩展框架。M3首先使用为低误报率设计的深度防御生成验证器,训练三种面向证明的能力——证明生成、证明验证和基于批评的证明修复。这些能力被合并到单个发布的M3模型中。在测试时,MaxProof将模型视为生成器、验证器、精炼器和排序器,在候选证明群体中进行搜索,并通过锦标赛选择返回一个最终证明。通过MaxProof测试时扩展,M3模型在IMO 2025上达到35/42,在USAMO 2026上达到36/42,两者均超过了人类金牌阈值。

英文摘要

We present MaxProof, a population-level test-time scaling framework for competition-level mathematical proof in the MiniMax-M3 series. M3 first trains three proof-oriented capabilities -- proof generation, proof verification, and critique-conditioned proof repair -- using a defense-in-depth generative verifier engineered for low false-positive rate. These capabilities are merged into a single released M3 model. At test time, MaxProof treats the model as a generator, verifier, refiner, and ranker, searches over a population of candidate proofs, and returns one final proof through tournament selection. With MaxProof test-time scaling, the M3 model reaches 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding the human gold-medal threshold on both.

2606.13076 2026-06-12 cs.MA cs.GT cs.LG 交叉投稿

$α$-fair heterogeneous agent reinforcement learning

$\alpha$-公平异质智能体强化学习

Yao-hua Franck Xu, Tayeb Lemlouma, Jean-Marie Bonnin, Arnaud Braud

发表机构 * Orange Innov(Orange创新)

AI总结 提出一种结合$\alpha$-公平性与异质智能体信任区域学习(HATRL)的框架,通过公平优势函数动态加权智能体效用,实现单调改进并收敛至纳什均衡,在顺序社会困境中优于HATRL算法。

详情
AI中文摘要

多智能体系统中的合作通常通过功利主义目标进行优化,这些目标最大化整体效率但未能考虑奖励分配,常常导致不公平的“领导者-跟随者”动态。虽然基于公平的方法鼓励每个智能体从合作中受益的亲社会行为,但许多当前算法——包括那些利用奖励塑造的算法——破坏了马尔可夫博弈的平稳性或缺乏严格的理论保证。这在公平目标方法和理论上安全的学习框架之间造成了关键差距。我们提出了一种新颖的框架,将$\alpha$-公平性与异质智能体信任区域学习(HATRL)相结合,确保单调改进并收敛至纳什均衡。我们的方法利用一种公平优势函数,该函数根据智能体的期望回报动态加权其效用,使得全局目标能够根据参数$\alpha$从纯粹的功利主义效率过渡到$\alpha$-公平福利。我们引入了两种实用算法,$\alpha$-公平HATRPO和$\alpha$-公平HAPPO,并通过在CleanUp和CommonHarvest等顺序社会困境中的实验证明,从功利主义角度看,它们比HATRL算法表现更好,同时实现了更高的社会结果。

英文摘要

Cooperation in multi-agent systems is typically optimized through utilitarian objectives that maximize overall efficiency but fail to account for reward distribution, often resulting in inequitable "leader-follower" dynamics. While fairness-based approaches encourage pro-social behaviors where every agent benefits from cooperation, many current algorithms - including those utilizing reward shaping - break the stationarity of Markov Games or lack rigorous theoretical guarantees. This creates a critical gap between fair objective methods and theoretically safe learning frameworks. We propose a novel framework that bridges $α$-fairness with Heterogeneous-Agent Trust Region Learning (HATRL), ensuring monotonic improvement and convergence toward Nash Equilibria. Our approach leverages a fair advantage function that dynamically weights agent utilities based on their expected returns, allowing the global objective to transition from purely utilitarian efficiency to $α$-fairness welfare based on the parameter $α$. We introduce two practical algorithms, $α$-fair HATRPO and $α$-fair HAPPO, and demonstrate through experiments in sequential social dilemmas like CleanUp and CommonHarvest that they perform better than HATRL's algorithms from a utilitarian point of view while achieving socially higher outcomes.

2606.13598 2026-06-12 cs.AI cs.CL cs.LG cs.MA 交叉投稿

Reward Modeling for Multi-Agent Orchestration

多智能体编排的奖励建模

King Yeung Tsang, Zihao Zhao, Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Semih Yavuz, Shafiq Joty, Hao Wang

发表机构 * Rutgers University(罗杰斯大学) Salesforce AI Research(Salesforce人工智能研究)

AI总结 提出OrchRM框架,通过自监督学习从多智能体执行中间产物构建奖励模型,无需人工标注,实现高效编排器训练和测试时扩展,在多个领域提升性能并降低计算成本。

Comments Preprint; work in progress

详情
AI中文摘要

基于大型语言模型(LLM)的多智能体系统(MAS)需要有效的编排来协调专门化的智能体,然而训练这样的编排器受到有限监督和高计算成本的阻碍。我们提出了编排奖励建模(OrchRM),一种无需人工标注即可评估编排质量的自监督框架。OrchRM利用多智能体执行过程中的中间产物来构建Bradley-Terry奖励模型训练的胜负对。与现有的依赖昂贵子智能体展开的MAS测试时扩展和编排器训练框架不同,OrchRM直接在编排层面操作,实现了高效且高性能的奖励引导编排器训练和MAS测试时扩展。OrchRM在token使用上提高了高达10倍的训练效率,同时将MAS测试时扩展的准确率提升了高达8%。这些增益在多个领域(包括数学推理、基于网络的问答和多跳推理)中一致迁移,证明了编排级奖励建模作为鲁棒多智能体编排的可扩展方向。代码将在此https URL提供。

英文摘要

Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self-supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi-agent executions to construct win-lose pairs for Bradley-Terry reward model training. Unlike existing MAS test-time scaling and orchestrator training frameworks that rely on costly sub-agent rollouts, OrchRM operates directly at the orchestration level, enabling efficient and high-performing reward-guided orchestrator training and MAS test-time scaling. OrchRM improves training efficiency by up to 10x in token usage while improving MAS test-time scaling performance by up to 8% in accuracy. These gains consistently transfer across multiple domains, including mathematical reasoning, web-based question answering, and multi-hop reasoning, demonstrating orchestration-level reward modeling as a scalable direction for robust multi-agent orchestration. Code will be available at https://github.com/Wang-ML-Lab/OrchRM.

2606.13604 2026-06-12 cs.AI cs.LG cs.MA 交叉投稿

Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

基于延迟市场反馈的多智能体强化学习在三方调度中的目标权重自适应

Haochen Wu, Yi Hou, Shiguang Xie

发表机构 * DoorDash

AI总结 提出在DoorDash部署的强化学习系统,利用延迟信号自适应调整调度目标权重,通过离线策略学习在噪声和耦合反馈下优化配送质量与批处理效率的权衡。

Comments Accepted at ICML 2026 Workshop on Reinforcement Learning from World Feedback (RLxF)

详情
AI中文摘要

三方市场中的调度为从世界反馈中进行强化学习提供了自然场景:决策通过延迟的操作结果(如配送速度、骑手利用率和商家拥堵)进行评估。我们介绍了DoorDash部署的一个强化学习系统,该系统利用延迟信号在大规模食品配送市场中自适应调整调度目标权重。该系统并非取代组合分配优化器,而是通过从记录的市场数据中学习的店铺级策略选择一个离散乘数,该乘数改变调度优化器在配送质量与批处理效率之间的权衡。这种接口使得在噪声、延迟和耦合反馈下进行离线策略学习成为可能,同时保留生产可行性约束和操作保障。我们使用集中式离线数据和分散式店铺级执行训练共享价值函数,采用双Q学习目标和保守正则化器以减少分布外价值高估。在生产切换实验中,离线训练的策略增加了批处理并减少了骑手侧时间成本,而不会降低面向客户的配送质量。结果展示了如何利用来自实时经济和物流系统的世界反馈安全地在线调整决策策略。

英文摘要

Dispatch in three-sided marketplaces provides a natural setting for reinforcement learning from world feedback: decisions are evaluated by delayed operational outcomes such as delivery speed, courier utilization, and merchant congestion. We present a deployed reinforcement learning system at DoorDash that adapts dispatch objective weights in a large-scale food-delivery marketplace using delayed signals. Rather than replacing the combinatorial assignment optimizer, a store-level policy learned from logged marketplace data selects a discrete multiplier that shifts the dispatch optimizer's tradeoff between delivery quality and batching efficiency. This interface enables offline policy learning under noisy, delayed, and coupled feedback while preserving production feasibility constraints and operational safeguards. We train a shared value function using centralized offline data and decentralized store-level execution, with Double Q-learning targets and a conservative regularizer to reduce out-of-distribution value overestimation. In a production switchback experiment, the offline-trained policy increases batching and reduces courier-side time costs without degrading customer-facing delivery quality. Results illustrate how world feedback from a live economic and logistics system can be used to safely adapt decision policies online.

2606.13605 2026-06-12 math.OC cs.LG cs.SY eess.SY 交叉投稿

Distribution-Agnostic Robust Trajectory Optimization via Chance-Constrained Reinforcement Learning

基于机会约束强化学习的分布无关鲁棒轨迹优化

Yashdeep Chaudhary, Roberto Armellin, Harry Holt, Marco Sagliano

发表机构 * Auckland University(奥克兰大学)

AI总结 提出一种分布无关的鲁棒轨迹优化框架,通过机会约束强化学习处理初始条件和过程噪声的不确定性,采用离线标称轨迹与在线仿射闭环校正,在两种不同轨迹设计问题上验证了概率可行性与燃料效率。

Comments Preprint. 39 pages, 16 figures

详情
AI中文摘要

本文提出了一种基于机会约束强化学习的分布无关鲁棒轨迹优化框架。不确定性通过初始条件和过程噪声表示,唯一要求是能够对其进行采样。首先离线计算确定性标称轨迹,然后仅使用强化学习通过结构化仿射闭环校正律(包括前馈控制调整和时变反馈增益)来鲁棒化该基线。通过基于rollout的上尾分位数经验性地强制执行概率可行性,同时通过协方差可行性惩罚来调节终端分散性。该框架在两个性质不同的轨迹设计问题上进行了评估。主要案例研究是一个三维多脉冲地球-火星转移任务,其中学习策略在高斯不确定性下与最近的鲁棒轨迹优化参考进行基准比较,然后在有界均匀不确定性和训练期间未见的过程扰动下进行评估。第二个案例研究是一个随机大气精确火箭着陆问题,用于评估在具有阻力、质量消耗和下滑角约束的短时连续推力设置中的可移植性。结果表明,所提出的框架在保持概率可行性的同时,能够在上尾燃料成本方面保持竞争力,并且相同的鲁棒化框架可以跨异构航天器轨迹规划问题移植,而无需重新设计其核心随机控制结构。

英文摘要

This paper presents a distribution-agnostic robust trajectory-optimization framework based on chance-constrained reinforcement learning. The uncertainty is represented here through initial conditions and process noise, with the only requirement being that it can be sampled. A deterministic nominal trajectory is first computed offline, and reinforcement learning is then used only to robustify that baseline through a structured affine closed-loop correction law comprising a feedforward control adjustment and time-varying feedback gains. Probabilistic feasibility is enforced empirically through rollout-based upper-tail quantiles, while terminal dispersion is regulated through covariance-feasibility penalties. The framework is assessed on two materially different trajectory design problems. The flagship case study is a three-dimensional multi-impulse Earth-Mars transfer, where the learned policy is benchmarked against a recent robust trajectory-optimization reference under Gaussian uncertainty and then evaluated under bounded uniform uncertainty and under process disturbances not seen during training. The second case study is a stochastic atmospheric pinpoint rocket landing problem, used to assess portability to a short-horizon continuous-thrust setting with drag, mass depletion, and glide-slope constraints. The results show that the proposed framework can remain competitive in upper-tail fuel cost while preserving probabilistic feasibility, and that the same robustification scaffold can be carried across heterogeneous spacecraft trajectory planning problems without redesign of its core stochastic-control structure.

2509.01630 2026-06-12 cs.LG cs.MA cs.RO cs.SY eess.SY 版本更新

DiffCoord: Differentiable Coordination for Distributed Multi-Agent Trajectory Optimization

DiffCoord: 分布式多智能体轨迹优化的可微协调

Bingheng Wang, Yichao Gao, Tianchen Sun, Shanker Ajay, Lin Zhao

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(新加坡国立大学电子与计算机工程系)

AI总结 提出DiffCoord框架,将截断ADMM-DDP管道的耦合参数通过端到端元学习联合优化,利用智能体神经网络实现任务自适应,并扩展到不同智能体数量。在协作空中运输系统中验证,相比现有方法将每智能体梯度计算时间减少70%。

详情
AI中文摘要

将交替方向乘子法(ADMM)与微分动态规划(DDP)相结合,为分布式多智能体轨迹优化提供了一个可扩展的框架。在实践中,ADMM通常被截断以提高计算效率,这紧密耦合了原本分别控制协调质量和任务性能的参数。在本文中,我们提出了可微协调(DiffCoord),一个统一框架,联合元学习截断ADMM-DDP管道的这些耦合参数。这些参数由智能体神经网络生成以实现任务自适应,并且同构智能体之间共享相同的网络,从而能够扩展到不同数量的智能体。我们通过端到端微分ADMM-DDP管道实现了高效的元学习。值得注意的是,这产生了一个辅助的ADMM-LQR分布式梯度求解器,用于计算和协调关于这些参数的元梯度。该求解器继承了管道的计算结构,使得关键计算结果可以重用,并能够在智能体和轨迹时间线上高效并行化。我们通过协作空中运输系统的数值和物理实验验证了DiffCoord,该系统在狭窄空间中重新配置四旋翼编队以实现安全的六自由度负载操作。它能够鲁棒地适应变化的团队规模和负载动力学,同时与最先进的轨迹梯度方法相比,将每智能体梯度计算时间减少高达70%。

英文摘要

Integrating the Alternating Direction Method of Multipliers (ADMM) with Differential Dynamic Programming (DDP) provides a scalable framework for distributed multi-agent trajectory optimization. In practice, ADMM is typically truncated for computational efficiency, tightly coupling parameters that would otherwise separately govern coordination quality and task performance. In this paper, we propose Differentiable Coordination (DiffCoord), a unified framework that jointly meta-learns these coupled parameters for the truncated ADMM-DDP pipeline. These parameters are generated by agent-wise neural networks for task adaptation, and the same networks are shared among isomorphic agents to enable scalability to varying agent counts. We achieve efficient meta-learning by differentiating the ADMM-DDP pipeline end-to-end. Notably, this yields an auxiliary ADMM-LQR distributed gradient solver that computes and coordinates meta-gradients with respect to these parameters. This solver inherits the computational structure of the pipeline, enabling reuse of key computation results and efficient parallelization over agents and along trajectory horizons. We validate DiffCoord through numerical and physical experiments on a cooperative aerial transport system, where it reconfigures quadrotor formations for safe 6-DoF load manipulation in tight spaces. It adapts robustly to varying team sizes and load dynamics, while reducing per-agent gradient computation time by up to 70% compared with state-of-the-art trajectory-gradient methods.

2603.11395 2026-06-12 cs.LG cs.AI 版本更新

ARROW: Augmented Replay for RObust World models

ARROW:增强重放用于鲁棒世界模型

Abdulaziz Alyahya, Abdallah Al Siyabi, Markus R. Ernst, Luke Yang, Levin Kuhlmann, Gideon Kowadlo

发表机构 * Imam Mohammad Ibn Saud Islamic University (IMSIU)(伊玛姆·穆罕默德·本·沙特伊斯兰大学) Monash University(莫纳什大学) University of New South Wales, Sydney(新南威尔士大学,悉尼) Cerenaut

AI总结 本文提出ARROW算法,一种基于模型的持续强化学习方法,通过高效的重放缓冲区减少灾难性遗忘,提升在无共享结构任务和有共享结构任务中的表现。

Comments 36 pages and 11 figures (includes Appendix)

详情
Journal ref
Transactions on Machine Learning Research, 2026
AI中文摘要

持续强化学习挑战智能体在获取新技能的同时保留已学习技能,以提高过去和未来任务的性能。大多数现有方法依赖于无模型方法和重放缓冲区来缓解灾难性遗忘;然而,这些解决方案往往面临显著的可扩展性挑战,因为内存需求大。受神经科学启发,其中大脑将经验重放给预测世界模型而不是直接重放到策略中,我们提出了ARROW(增强重放用于鲁棒世界模型),一种扩展DreamerV3的基于模型的持续RL算法,具有内存高效、分布匹配的重放缓冲区。与标准固定大小的FIFO缓冲区不同,ARROW维护两个互补的缓冲区:一个短期缓冲区用于近期经验,一个长期缓冲区通过智能采样保留任务多样性。我们在两个具有挑战性的持续RL设置中评估了ARROW:无共享结构任务(Atari)和有共享结构任务(Procgen CoinRun变体)。与相同大小的无模型和基于模型的基线方法相比,ARROW在无共享结构任务中表现出显著减少的遗忘,同时保持可比的前向转移。我们的发现突显了基于模型的RL和生物启发方法在持续强化学习中的潜力,值得进一步研究。

英文摘要

Continual reinforcement learning challenges agents to acquire new skills while retaining previously learned ones with the goal of improving performance in both past and future tasks. Most existing approaches rely on model-free methods with replay buffers to mitigate catastrophic forgetting; however, these solutions often face significant scalability challenges due to large memory demands. Drawing inspiration from neuroscience, where the brain replays experiences to a predictive World Model rather than directly to the policy, we present ARROW (Augmented Replay for RObust World models), a model-based continual RL algorithm that extends DreamerV3 with a memory-efficient, distribution-matching replay buffer. Unlike standard fixed-size FIFO buffers, ARROW maintains two complementary buffers: a short-term buffer for recent experiences and a long-term buffer that preserves task diversity through intelligent sampling. We evaluate ARROW on two challenging continual RL settings: Tasks without shared structure (Atari), and tasks with shared structure, where knowledge transfer is possible (Procgen CoinRun variants). Compared to model-free and model-based baselines with replay buffers of the same-size, ARROW demonstrates substantially less forgetting on tasks without shared structure, while maintaining comparable forward transfer. Our findings highlight the potential of model-based RL and bio-inspired approaches for continual reinforcement learning, warranting further research.

2603.12530 2026-06-12 cs.LG 版本更新

Mixing Makes Markovian Contexts Cheap for Linear Bandits

混合使得马尔可夫上下文在线性赌博机中变得廉价

Kaan Buyukkalayci, Osama Hanna, Christina Fragouli

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Meta, Superintelligence Lab(Meta超智能实验室)

AI总结 针对马尔可夫上下文线性赌博机问题,提出一种基于均匀几何遍历性的约简方法,通过构建平稳替代动作集和延迟更新方案,实现了与标准线性赌博机相当的最坏情况遗憾界。

详情
AI中文摘要

最近的研究表明,当上下文是独立同分布时,线性上下文赌博机可以简化为单上下文线性赌博机。这种“上下文廉价”的视角非常有利,因为它允许更精确的有限时间分析,并利用线性赌博机文献中的成熟技术,例如针对错误规范和对抗性腐败的技术。然而,这种约简关键依赖于上下文的独立性,并不适用于时间相关(例如马尔可夫)的上下文设置,而这种设置在现实中经常出现。受时间相关可用性应用的启发,我们将这一视角扩展到具有马尔可夫上下文过程的线性赌博机,其中动作集通过外生马尔可夫链演化。我们的主要贡献是在均匀几何遍历性条件下的一种约简。我们构建了一个平稳替代动作集,使用标准线性赌博机预言机来解决问题,并采用延迟更新方案来控制由非平稳条件上下文分布引起的偏差。我们进一步为未知平稳分布提供了一种分阶段算法,该算法在线学习替代映射。在两种设置中,我们在足够快的混合区域获得了与底层线性赌博机预言机相匹配的高概率最坏情况遗憾界。然后,我们在一个真实世界实例上验证了我们的结果,展示了相对于LinUCB基线的实际改进。

英文摘要

Recent work shows that when contexts are drawn i.i.d., linear contextual bandits can be reduced to single-context linear bandits. This ``contexts are cheap'' perspective is highly advantageous, as it allows for sharper finite-time analyses and leverages mature techniques from the linear bandit literature, such as those for misspecification and adversarial corruption. However, this reduction crucially relies on the independence of contexts and does not extend to settings with temporally correlated (e.g., Markovian) contexts, which arise frequently in practice. Motivated by applications with temporally correlated availability, we extend this perspective to linear bandits with Markovian context processes, where the action set evolves via an exogenous Markov chain. Our main contribution is a reduction that applies under uniform geometric ergodicity. We construct a stationary surrogate action set to solve the problem using a standard linear bandit oracle, employing a delayed-update scheme to control the bias induced by the nonstationary conditional context distributions. We further provide a phased algorithm for unknown stationary distributions that learns the surrogate mapping online. In both settings, we obtain a high-probability worst-case regret bound matching that of the underlying linear bandit oracle in sufficiently fast mixing regimes. We then validate our results on a real-world instance, where we show practical gains over a LinUCB baseline.

2604.08958 2026-06-12 cs.LG cs.AI cs.RO 版本更新

WOMBET: World Model-Based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

WOMBET:基于世界模型的经验迁移实现鲁棒且样本高效的强化学习

Mintae Kim, Koushil Sreenath

发表机构 * Hybrid Robotics, UC Berkeley(混合机器人技术,伯克利大学)

AI总结 提出WOMBET框架,通过源任务中学习世界模型并生成不确定性惩罚的离线数据,再结合自适应采样进行在线微调,实现鲁棒且样本高效的强化学习迁移。

Comments 13 pages, 6 figures, 8th Annual Learning for Dynamics & Control Conference (L4DC)

详情
AI中文摘要

机器人领域的强化学习通常受限于数据收集的成本和风险,因此需要从源任务向目标任务进行经验迁移。离线到在线强化学习利用先验数据,但通常假设给定固定数据集,并未解决如何生成可靠数据进行迁移的问题。我们提出基于世界模型的经验迁移(WOMBET)框架,该框架联合生成和利用先验数据。WOMBET在源任务中学习世界模型,并通过不确定性惩罚规划生成离线数据,随后筛选出高回报和低认知不确定性的轨迹。然后,它通过在离线数据和在线数据之间进行自适应采样,在目标任务中进行在线微调,实现了从先验驱动的初始化到任务特定适应的稳定过渡。我们证明了不确定性惩罚目标提供了真实回报的下界,并推导了有限样本误差分解,捕捉了分布不匹配和近似误差。实验上,WOMBET在连续控制基准测试中相比强基线提高了样本效率和最终性能,展示了联合优化数据生成和迁移的益处。

英文摘要

Reinforcement learning (RL) in robotics is often limited by the cost and risk of data collection, motivating experience transfer from a source task to a target task. Offline-to-online RL leverages prior data but typically assumes a given fixed dataset and does not address how to generate reliable data for transfer. We propose World Model-Based Experience Transfer (WOMBET), a framework that jointly generates and utilizes prior data. WOMBET learns a world model in the source task and generates offline data via uncertainty-penalized planning, followed by filtering trajectories with high return and low epistemic uncertainty. It then performs online fine-tuning in the target task using adaptive sampling between offline and online data, enabling a stable transition from prior-driven initialization to task-specific adaptation. We show that the uncertainty-penalized objective provides a lower bound on the true return and derive a finite-sample error decomposition capturing distribution mismatch and approximation error. Empirically, WOMBET improves sample efficiency and final performance over strong baselines on continuous control benchmarks, demonstrating the benefit of jointly optimizing data generation and transfer.

2401.08301 2026-06-12 eess.SP cs.LG cs.SY eess.SY 版本更新

QoS Improvement in Multi User Cellular-Symbiotic Radio Network Assisted by Active-STAR-RIS

基于有源同步透射反射智能超表面的多用户蜂窝共生无线电网络中的QoS改进

Rahman Saadat Yeganeh, Mohammad Javad Omidi, Farshad Zeinali, Mohammad Robat Mili, Mohammad Ghavami

发表机构 * Department of Electrical and Computer Engineering, Isfahan University of Technology(伊斯法罕理工大学电气与计算机工程系) Department of Electronics and Communication Engineering, Kuwait College of Science and Technology(科威特科学与技术学院电子与通信工程系) The Pasargad Institute for Advanced Innovative Solutions (PIAIS)(帕萨尔加德先进创新解决方案研究所) Electrical and Electronic Engineering Department, London South Bank University(伦敦南岸大学电子与电气工程系)

AI总结 本文利用有源同步透射反射智能超表面(ASRIS)增强6G蜂窝网络服务质量,通过深度强化学习优化波束成形、相位调整和调度参数,最大化共生反向散射设备与用户间的吞吐量。

Comments This article will be submitted to the Transactions journal

详情
AI中文摘要

在本文中,我们采用有源同步透射反射可重构智能表面(ASRIS)来增强6G蜂窝网络服务的质量。该网络集成了共生无线电(CSR)子系统,以促进无源物联网(IoT)用户与有源用户之间的通信,分别称为共生反向散射设备(SBD)和共生用户设备(SUE)。由于SBD是无源的,向SUE传输信息面临重大挑战。为克服这一挑战,我们利用基站(BS)内大规模多输入多输出(MIMO)天线的能力,以更大的功率中继SBD传输的信息。该方案采用非正交多址(NOMA)技术实现所有用户的多址接入,并使用连续干扰消除(SIC)消除潜在干扰。主要目标是最大化SBD与SUE之间的吞吐量。为此,我们构建了一个优化问题,涉及BS和ASRIS处的有源波束成形系数、ASRIS的相位调整以及CSR与蜂窝网络之间的调度参数。为解决该优化问题,我们使用了三种深度强化学习(DRL)方法:近端策略优化(PPO)、双延迟深度确定性策略梯度(TD3)和异步优势演员-评论家(A3C)。对这些方法进行了仿真,结果表明A3C、TD3和PPO分别具有最快的收敛速度并实现了最高的网络吞吐量增长。最后,使用无源同步透射反射RIS(STAR-RIS)对所提方案进行了评估,其性能劣于ASRIS。

英文摘要

In this article, we employ active simultaneously transmitting and reflecting reconfigurable intelligent surfaces (ASRIS) to enhance the quality of 6G cellular network services. The network integrates commensal symbiotic radio (CSR) subsystems to facilitate communication between passive Internet of Things (IoT) users and active users, referred to as symbiotic backscatter devices (SBDs) and symbiotic user equipments (SUEs), respectively. Since the SBDs are passive, transmitting information to the SUEs poses significant challenges. To overcome this challenge, we harness the capabilities of massive multiple input multiple output (MIMO) antennas within the base station (BS) to relay the information transmitted by SBDs with greater power. This scheme uses the non-orthogonal multiple access (NOMA) technique for multiple access among all users, and potential interferences are eliminated using successive interference cancellation (SIC). The primary objective is to maximize the throughput between SBDs and SUEs. To achieve this, we formulate an optimization problem involving variables such as active beamforming coefficients at the BS and ASRIS, phase adjustments of ASRIS, and scheduling parameters between CSR and cellular networks. To solve this optimization problem, we used three deep reinforcement learning (DRL) methods: proximal policy optimization (PPO), twin delayed deep deterministic policy gradient (TD3), and asynchronous advantage actor critic (A3C). These methods were simulated, and the results demonstrate that A3C, TD3, and PPO have the best convergence speeds and achieve the highest increases in network throughput, respectively. Finally, the proposed scheme was evaluated using passive simultaneously transmitting and reflecting RIS (STAR-RIS), which demonstrated poorer performance compared to ASRIS.

2602.04208 2026-06-12 cs.RO cs.AI cs.LG 版本更新

SCALE: Self-uncertainty Conditioned Adaptive Looking and Execution for Vision-Language-Action Models

SCALE: 基于自不确定性条件自适应观察与执行的视觉-语言-动作模型

Hyeonbeom Choi, Daechul Ahn, Youhan Lee, Taewook Kang, Seongwon Cho, Jonghyun Choi

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出SCALE推理策略,利用自不确定性联合调节视觉感知和动作,无需额外训练或验证器,仅单次前向传播,提升VLA模型在模拟和真实环境中的鲁棒性。

Comments ICML 2026 Spotlight. Project page: https://dcahn12.github.io/projects/scale/

详情
AI中文摘要

视觉-语言-动作(VLA)模型已成为通用机器人控制的一种有前景的范式,测试时缩放(TTS)在增强训练外鲁棒性方面受到关注。然而,现有的VLA TTS方法需要额外训练、验证器和多次前向传播,使其部署不切实际。此外,它们仅干预动作解码,而保持视觉表示固定——在感知模糊的情况下不足,此时重新考虑如何感知与决定做什么同样重要。为解决这些限制,我们提出SCALE,一种简单的推理策略,基于“自不确定性”联合调节视觉感知和动作,受主动推理理论中不确定性驱动探索的启发——无需额外训练、无需验证器,且仅需单次前向传播。SCALE在高不确定性下拓宽感知和动作的探索,而在自信时聚焦于利用——实现在不同条件下的自适应执行。在模拟和真实世界基准上的实验表明,SCALE改进了最先进的VLA模型,并优于现有TTS方法,同时保持单次前向传播的效率。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for general-purpose robotic control, with test-time scaling (TTS) gaining attention to enhance robustness beyond training. However, existing TTS methods for VLAs require additional training, verifiers, and multiple forward passes, making them impractical for deployment. Moreover, they intervene only at action decoding while keeping visual representations fixed-insufficient under perceptual ambiguity, where reconsidering how to perceive is as important as deciding what to do. To address these limitations, we propose SCALE, a simple inference strategy that jointly modulates visual perception and action based on 'self-uncertainty', inspired by uncertainty-driven exploration in Active Inference theory-requiring no additional training, no verifier, and only a single forward pass. SCALE broadens exploration in both perception and action under high uncertainty, while focusing on exploitation when confident-enabling adaptive execution across varying conditions. Experiments on simulated and real-world benchmarks demonstrate that SCALE improves state-of-the-art VLAs and outperforms existing TTS methods while maintaining single-pass efficiency.

4. 生成模型与概率建模 27 篇

2606.12494 2026-06-12 cs.LG 新提交

Net-Ev$^2$: A Generative Simulator for Network Event Evolution

Net-Ev$^2$:网络事件演化的生成式模拟器

Guangyu Wang, Zhaonan Wang

发表机构 * NYU Shanghai(上海纽约大学)

AI总结 提出Net-Ev$^2$,一种结合事件线索与网络拓扑的生成式模拟器,通过结构引导掩码预训练和拓扑感知扩散过程模拟网络事件演化,在多个道路网络数据集上达到最优性能。

Comments Accepted by KDD 2026 Research Track

详情
Journal ref
In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)
AI中文摘要

减少现实世界的试错一直是决策的核心目标,生成式模拟器通过建模未来状态的演化推进了这一目标。一个更具挑战性且更有意义的任务是模拟扰动事件(如事故)如何通过网络传播其影响。现有方法在模拟网络事件演化时,未能同时建模事件的结构化属性和非结构化语义,也未能捕捉拓扑结构。因此,我们提出Net-Ev$^2$($\underline{\textbf{Net}}$work $\underline{\textbf{Ev}}$ent $\underline{\textbf{Ev}}$olution),一种新颖的生成式模拟器,在模拟中联合利用事件线索并保留网络拓扑。具体而言,该框架包含两个阶段:结构引导的掩码预训练和拓扑感知扩散过程,后者通过类似U-Net的图下采样和上采样实现去噪。在推理时,Net-Ev$^2$仅需自然语言事件输入即可生成模拟,具有更大的实际使用灵活性。此外,我们引入了Net-Ev$^2$-6.5M,一个跨四个大规模道路网络的对齐事件和网络流量数据的多模态基准,以及一个新的拓扑感知指标JL-MMD,用于评估生成网络动态的拓扑保真度。大量实验证明了Net-Ev$^2$的最优性能和强泛化能力。代码已开源。

英文摘要

Reducing real-world trial and error has long been a central goal of decision making, and generative simulators advance this goal by modeling the evolution of future states. An even more challenging yet meaningful task is simulating how disturbance events (e.g., accidents) propagate their impacts across real-world networks. The existing approaches fall short of modeling both structured attributes and unstructured semantics of events, and capturing topological structures in simulating network event evolution. Therefore, we are motivated to propose Net-Ev$^2$ ($\underline{\textbf{Net}}$work $\underline{\textbf{Ev}}$ent $\underline{\textbf{Ev}}$olution), a novel generative simulator that jointly leverages event cues while preserving network topology in simulations. Specifically, the framework consists of two stages, namely structure-guided masked pre-training and topology-aware diffusion process, which is achieved by U-Net-like graph downsampling and upsampling during denoising. At inference time, Net-Ev$^2$ can generate simulations using natural-language event input only, with greater flexibility for practical usage. Furthermore, we introduce Net-Ev$^2$-6.5M, a multimodal benchmark of aligned event and network traffic data across four large-scale road networks, as well as a new topology-aware metric, namely JL-MMD, to evaluate topological fidelity in generated network dynamics. Extensive experiments demonstrate the state-of-the-art performance and strong generalization ability of Net-Ev$^2$. Code is made available at https://github.com/Guangyu4/Net-Ev-2.

2606.12710 2026-06-12 cs.LG math.OC 新提交

A Stabilized Path-Space Approach to Diffusion-Based Posterior Sampling

一种稳定的路径空间方法用于基于扩散的后验采样

Evan Scope Crafts, Umberto Villa, Saviz Mowlavi, Yanting Ma, Hassan Mansour, Wael H. Ali

发表机构 * Oden Institute for Computational Engineering and Sciences, The University of Texas at Austin(德克萨斯大学奥斯汀分校奥登计算工程与科学研究所) Mitsubishi Electric Research Laboratories (MERL)(三菱电机研究实验室) Department of Biomedical Engineering, The University of Texas at Austin(德克萨斯大学奥斯汀分校生物医学工程系) Mitsubishi Electric Research Laboratories(三菱电机研究实验室)

AI总结 提出一种稳定的路径空间框架,通过随机最优控制与信任域优化,实现非线性逆问题中准确且鲁棒的后验采样。

详情
AI中文摘要

扩散模型为贝叶斯逆问题提供了表达性数据驱动先验,但许多扩散后验采样器依赖启发式引导近似,可能对非线性算子和多模态后验失效。本文开发了一种稳定的路径空间框架用于基于扩散的后验采样。从终端边际代表先验的基础扩散过程出发,我们定义了轨迹上的似然加权目标测度,并将后验采样转化为学习一个路径测度匹配该目标的受控随机过程。该公式将扩散后验采样与随机最优控制联系起来,同时保留了不确定性量化所需的贝叶斯结构。我们引入了一种时间重参数化,通过消除未知初始值函数引起的偏差,使路径空间控制问题适定,无需辅助训练。然后通过具有对数方差目标的信任域路径空间优化方法学习控制。路径空间视角还统一了我们的学习控制方法与现有的基于引导的采样器,量化了近似控制引起的采样误差,并产生了用于渐近精确后验期望的重要性采样校正。我们在具有解析表征或高质量参考后验的基准逆问题套件上评估了所提出的框架,从而实现了对采样精度和不确定性量化的原则性评估。这些实验深入揭示了基于扩散的后验采样器的行为,并证明了相比领先方法更高的准确性和鲁棒性。

英文摘要

Diffusion models provide expressive data-driven priors for Bayesian inverse problems, but many diffusion posterior samplers rely on heuristic guidance approximations that can fail for nonlinear operators and multimodal posteriors. In this work, we develop a stabilized path-space framework for diffusion-based posterior sampling. Starting from a base diffusion process whose terminal marginal represents the prior, we define a likelihood-weighted target measure on trajectories and cast posterior sampling as learning a controlled stochastic process whose path measure matches this target. This formulation connects diffusion posterior sampling to stochastic optimal control while preserving the Bayesian structure needed for uncertainty quantification. We introduce a time reparameterization that makes the path-space control problem well posed by removing the bias induced by the unknown initial value function, without auxiliary training. We then learn the control via a trust-region path-space optimization method with log-variance objectives. The path-space perspective also unifies our learned control approach with existing guidance-based samplers, quantifies the sampling error induced by approximate controls, and yields importance sampling corrections for asymptotically exact posterior expectations. We evaluate the proposed framework on a suite of benchmark inverse problems with analytically characterized or high-quality reference posteriors, enabling principled assessment of sampling accuracy and uncertainty quantification. These experiments provide insight into the behavior of diffusion-based posterior samplers and demonstrate improved accuracy and robustness over leading approaches.

2606.13191 2026-06-12 cs.LG 新提交

The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics

生成动力学中相变的几何:投影焦散视角

Ryosuke Sakamoto, Kotaro Sakamoto

发表机构 * Institute for the Advanced Study of Human Biology, Institute for Advanced Study, Kyoto University(京都大学高等研究院人类生物学高等研究所) Graduate School of Engineering, The University of Tokyo(东京大学大学院工学系研究科)

AI总结 本文通过投影焦散几何解释生成动力学中的相变行为,提出临界边界检测器(CBD)诊断分数方向不稳定性,定位模式承诺并支持敏感区域控制。

详情
AI中文摘要

连续状态生成采样器(包括扩散和流匹配模型)通过连续逆时间动力学演化,但其样本经常经历突然的定性变化:轨迹承诺于模式,语义替代坍缩,窄时间窗口内的小扰动可产生大的下游效应。本文对这种相变般行为进行了几何解释。我们将去噪视为自由能景观上的梯度下降,并表明尖锐转变出现在投影焦散附近,此时数据支撑上的最近点投影不再唯一。受此视角启发,我们引入临界边界检测器(CBD)作为分数方向不稳定性的实用诊断工具。在玩具模型、标准扩散模型和潜在文本到图像扩散模型中,CBD定位了模式承诺,预测了干预敏感窗口,并支持几何敏感区域中的目标控制。我们的结果连接了数据的几何与扩散生成的动力学。

英文摘要

Continuous-state generative samplers, including diffusion and flow-matching models, evolve through continuous reverse-time dynamics, yet their samples often undergo abrupt qualitative changes: trajectories commit to modes, semantic alternatives collapse, and small perturbations in narrow time windows can produce large downstream effects. This paper develops a geometric account of such phase-transition-like behaviour. We view denoising as gradient descent on a free energy landscape and show that sharp transitions arise near projection caustics, where the nearest-point projection onto the data support ceases to be unique. Motivated by this perspective, we introduce the Critical Boundary Detector (CBD), as practical diagnostics for score-direction instability. Across toy models, standard diffusion models, and latent text-to-image diffusion models, CBD localises mode commitment, predicts intervention-sensitive windows, and supports targeted control in geometrically sensitive regions. Our results connect geometry of data and dynamics of diffusion generation.

2606.13240 2026-06-12 cs.LG cs.AI cs.CV stat.ME stat.ML 新提交

Towards More General Control of Diffusion Models Using Jeffrey Guidance

使用 Jeffrey 引导实现扩散模型的更通用控制

Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso, Jes Frellsen, Pierre-Alexandre Mattei

发表机构 * Inria, CNRS, I3S, Maasai Université Côte d’Azur(法国国家信息与自动化研究所、法国国家科学研究中心、信息与系统科学实验室、马赛·蔚蓝海岸大学) Technical University of Denmark(丹麦技术大学) Inria, CNRS, LJAD, Maasai Université Côte d’Azur(法国国家信息与自动化研究所、法国国家科学研究中心、雅克-路易·利翁实验室、马赛·蔚蓝海岸大学)

AI总结 提出 Jeffrey 引导框架,通过 Jeffrey 条件规则更新边缘分布,扩展扩散模型控制到标准引导无法表达的应用,在 CIFAR-10 和 FFHQ 上显著降低 FID,并在 CelebA-HQ 上实现公平性控制。

详情
AI中文摘要

扩散模型的一个关键优势在于其灵活性,因为其输出可以在采样时通过引导进行控制。然而,除了条件采样等简单情况外,目标分布通常隐含地定义,仅通过采样规则或启发式能量函数给出。为了解决这个问题,我们提出了 Jeffrey 引导,这是一个原则性框架,将扩散模型控制扩展到标准引导无法表达的应用。它利用 Jeffrey 条件规则将边际分布更新到指定的目标,保持条件结构并最小化对联合分布的扰动。我们首先通过针对指定的嵌入分布来演示 Jeffrey 引导。以 Inception 嵌入为目标,这导致在 CIFAR-10 和 FFHQ 上 FID 显著降低。我们进一步将 Jeffrey 引导应用于 CelebA-HQ 上的公平性,更新无条件扩散模型以强制属性之间的独立性。

英文摘要

A key strength of diffusion models lies in their flexibility, since their outputs can be controlled at sampling time through guidance. However, beyond simple cases such as conditional sampling, the target distribution is often left implicit, defined only through a sampling rule or a heuristic energy function. To address this, we propose Jeffrey guidance, a principled framework that extends diffusion-model control to applications beyond what standard guidance can express. It leverages Jeffrey's rule of conditioning to update marginal distributions towards a prescribed target, preserving the conditional structure and minimally perturbing the joint distribution. We first demonstrate Jeffrey guidance by targeting a prescribed embedding distribution. With Inception embeddings as the target, this leads to substantial reductions in FID on both CIFAR-10 and FFHQ. We further apply Jeffrey guidance to fairness on CelebA-HQ, updating an unconditional diffusion model to enforce independence between attributes.

2606.13347 2026-06-12 cs.LG 新提交

Enhanced Low-Density Region Exploration in Classifier-Guided Diffusion Models Through Modified Reverse Diffusion Sampling

改进反向扩散采样在分类器引导扩散模型中的低密度区域探索

Jagriti Singh, Shekhar Verma, Muneendra Ojha

发表机构 * University of Allahabad(阿拔斯大学)

AI总结 提出一种无需额外训练的采样时间密度感知方法,通过修改分类器梯度引导轨迹朝向低置信区域并引导采样朝向预测真实图像,以增强扩散模型对低密度区域的探索。

详情
AI中文摘要

扩散模型已成为高保真图像合成的最先进生成模型,特别是在无分类器引导和分类器引导形式中。然而,标准分类器引导将概率质量集中在高密度类均值周围,导致对类条件分布尾部罕见样本的覆盖不足。最近关于基于扩散的尾部采样的工作通过训练一个额外的低密度寻求分类器(使用合成与真实判别器)来缓解这一问题,但代价是额外的网络和训练。与此同时,许多采样器和蒸馏技术加速或改进扩散采样,但并未明确解决长尾覆盖问题。我们提出一种纯采样时间、密度感知的分类器引导条件扩散模型扩展,针对低密度区域且无需任何额外训练。我们像大多数扩散模型一样,对噪声图像应用引导而非预测噪声。从预训练的ImageNet条件扩散模型和分类器开始,我们通过修改分类器梯度将轨迹引导向低置信区域,并在每个时间步引导采样过程朝向预测的真实图像,从而修改引导反向动力学。第一个引导有助于探索低概率样本,第二个引导有助于生成接近真实数据流形的样本。所提出的采样器在64x64分辨率下一致提高了ADM模型的召回率,同时保持可比的FID,并且使用256x256 ADM模型,我们展示了两种引导不同组合的视觉结果。我们还表明,标准ADM分类器引导结合预测真实图像引导,有助于在ImageNet上使用256x256 ADM模型生成高感知质量的样本。

英文摘要

Diffusion models have emerged as state-of-the-art generative models for high-fidelity image synthesis, particularly in their classifier-free guided and classifier-guided forms. However, standard classifier guidance concentrates probability mass around high-density class mean, leading to poor coverage of rare samples in the tails of the class-conditional distributions. Recent work on diffusion-based tail sampling mitigates this by training an additional low-density-seeking classifier with a synthetic-vs-real discriminator, at the cost of additional networks and training. In parallel, a number of samplers and distillation techniques accelerate or refine diffusion sampling, but do not explicitly address long-tail coverage. We propose a purely sampling-time, density-aware extension of classifier-guided conditional diffusion model that targets low-density regions without any additional training. We have applied guidance at noisy images not on predicted noise like most diffusion models. Starting from a pretrained conditional diffusion model and classifier on ImageNet, we modify the guided reverse dynamics by steering trajectories toward low-confidence regions via the modified classifier gradient, and at each time step, we also guide the sampling process toward the predicted real image. 1st guidance helps explore low-probability samples, and 2nd guidance helps to generate samples to be close to the real data manifold. The proposed sampler consistently improves ADM model recall at 64x64 resolution while maintaining a comparable FID, and with a 256x256 ADM model, we showed the results visually with different combinations of both guidance. We also showed that standard ADM classifier guidance, combined with predicted real image guidance, helps generate high perceptual quality samples with a 256x256 ADM model on ImageNet.

2606.13364 2026-06-12 cs.LG cs.CV 新提交

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

VideoMDM: 从2D监督走向3D人体运动生成

Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany

发表机构 * Technion(以色列理工学院) NVIDIA(英伟达)

AI总结 提出VideoMDM框架,利用单目视频的2D姿态通过扩散模型学习3D运动先验,使用深度加权的2D重投影损失近似3D监督,在HumanML3D上接近全3D监督性能。

Comments https://videomdm.github.io/

详情
AI中文摘要

我们提出VideoMDM,一个基于扩散的框架,直接从单目视频中提取的精确2D姿态训练3D人体运动先验,无需任何3D真实数据。预训练的2D到3D提升器提供近似的3D姿态序列,作为有噪声的教师:这些序列被扩散,模型在3D空间去噪,并通过重投影预测并与精确关键点比较在2D空间进行监督。我们证明,在温和假设下,深度加权的2D重投影损失在期望上等价于直接3D监督,并将标准3D运动正则化器——速度一致性和过参数化表示对齐——适应到这一2D设置。与仅在推理时将2D提升到3D的方法不同,VideoMDM在训练期间学习一个连贯的3D运动流形。在HumanML3D上,它几乎缩小了与完全3D监督的MDM的差距(FID 0.88 vs 0.54);在真实视频数据集Fit3D和NBA上,该方法学习生成一致被人类偏好的运动,并取得了强定量结果。

英文摘要

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

2606.13381 2026-06-12 cs.LG 新提交

Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs

Hölder++:改进多模态VAE中的质量-一致性权衡

Huyen Vo, María Martínez-García, Isabel Valera

发表机构 * Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs Supplementary Material(Hölder++:多模态VAE中质量与一致性权衡的改进补充材料)

AI总结 针对多模态VAE生成质量与语义一致性之间的权衡问题,提出Hölder++,通过精确Hölder池化、扩展架构和层次推理,在提升一致性的同时保持生成质量。

Comments Accepted at ICML 2026. Camera-ready version

详情
AI中文摘要

现有的多模态变分自编码器(VAE)方法面临生成质量与一致性之间的权衡——即它们难以生成既真实多样又在各模态间语义一致的样本。最近的一项工作表明,使用Hölder池化的简单近似作为聚合方法,尽管假设所有模态共享单一表示,但能提高一致性超过SOTA MMVAE+。然而,它略微牺牲了样本多样性。受此启发,我们提出Hölder++,一种新颖的多模态VAE,通过以下方式改进生成质量-一致性权衡:(i) 首次实现无近似的Hölder池化用于多模态VAE;(ii) 扩展架构,建模不同的共享和私有(即模态特定)表示(Hölder+);(iii) 层次推理,进一步增强共享和私有表示之间的解耦(Hölder++)。我们的实验证实,Hölder++持续改进生成质量-一致性权衡,产生更结构化的潜在空间,并学习对下游任务信息丰富的共享表示。

英文摘要

Existing approaches for multimodal variational autoencoders (VAEs) face a trade-off between generative quality and coherence-i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to Hölder pooling as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises sample diversity. Inspired by this insight, we propose Hölder++, a novel multimodal VAE that improves the generative quality-coherence trade-off through: (i) the first implementation of Hölder pooling without any approximation for multimodal VAEs; (ii) an extended architecture that models distinct shared and private (i.e., modality-specific) representations (Hölder+); and (iii) hierarchical inference that further enhances the disentanglement between the shared and private representations (Hölder++). Our experiments corroborate that Hölder++ consistently improves the generative quality-coherence trade-off, yields more structured latent spaces, and learns shared representations that are informative for downstream tasks.

2606.13400 2026-06-12 cs.LG cs.AI cs.RO 新提交

PolyFlow: Safe and Efficient Polytope-Constrained Flow Matching with Constraint Embedding and Projection-free Update

PolyFlow: 安全高效的多面体约束流匹配,具有约束嵌入和无投影更新

Jianming Ma, Qiyue Yang, Yang Zhang, Liyun Yan, Zhanxiang Cao, Yazhou Zhang, Yue Gao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出PolyFlow,一种将约束直接嵌入模型和流动力学的多面体约束流匹配框架,通过离散时间流公式和无投影架构消除离散化误差并严格满足任意多面体约束,在规划与控制任务中实现零约束违反并降低推理延迟。

Comments 30 pages, 12 figures, Accepted to ICML 2026

详情
AI中文摘要

尽管基于流的生成模型在广泛领域展现了强大的性能,但由于严格的约束要求,在安全关键的物理系统中部署它们仍然具有挑战性。现有方法通常通过事后修正来强制执行安全性,这会产生大量的计算开销,并可能扭曲学习到的分布。我们提出了PolyFlow,一种多面体约束流匹配框架,将约束直接嵌入到模型和流动力学中。PolyFlow引入了离散时间流公式和无投影架构,消除了离散化误差,并保证严格满足任意多面体约束,无需昂贵的迭代求解器。实验结果表明,PolyFlow在规划和控制任务中实现了零约束违反,同时保持了较高的分布保真度。与最先进的约束生成基线相比,PolyFlow显著降低了推理延迟,并在安全性、效率和生成质量之间展示了有利的权衡。代码可在该 https URL 获取。

英文摘要

While flow-based generative models have demonstrated strong performance across a wide range of domains, deploying them in safety-critical physical systems remains challenging due to strict constraint requirements. Existing approaches typically enforce safety through post-hoc corrections, which incur substantial computational overhead and may distort the learned distribution. We propose PolyFlow, a polytope-constrained flow matching framework that embeds constraints directly into the model and flow dynamics. PolyFlow introduces a discrete-time flow formulation and a projection-free architecture, which eliminate the discretization error and guarantee strict satisfaction of arbitrary polyhedral constraints, without the need for expensive iterative solvers. Experimental results show that PolyFlow achieves zero constraint violation while maintaining high distributional fidelity across a range of planning and control tasks. Compared to state-of-the-art constrained generation baselines, PolyFlow significantly reduces inference latency and demonstrates a favorable trade-off between safety, efficiency, and generative quality. Code is available on https://github.com/MJianM/PolyFlow.

2606.13426 2026-06-12 cs.LG stat.ML 新提交

Accelerating Speculative Diffusions via Block Verification

通过块验证加速推测性扩散

Alexander Soen, Hisham Husain, Valentin De Bortoli, Arnaud Doucet

发表机构 * KTH(皇家理工学院) Google Research(谷歌研究) Google DeepMind(谷歌深Mind)

AI总结 提出一种针对扩散模型的推测性采样方案,通过块验证提高草稿接受率,无需训练的Free Drafter实现高达6.3%的加速。

详情
AI中文摘要

推测性解码通过使用草稿模型生成令牌,并采用接受-拒绝方案确保输出与目标分布匹配,从而加速LLM推理。将其适应于连续扩散是困难的,因为推测性采样需要从残差分布中采样。虽然在离散空间中直接,但在连续空间中高效采样残差并非易事。因此,现有的扩散适应要么使用计算效率低下的采样技术,要么依赖替代方案。在这项工作中,我们引入了一种新颖的方案,高效地实现了扩散模型的原始推测性采样机制。我们的方法相比现有方法具有关键优势:它使我们能够将LLM的块验证适应到扩散——这被证明可以提高草稿的接受率。此外,我们形式化并分析了Free Drafter,一种无需训练的扩散启发式自推测草稿生成器。通过启用块验证,我们的Free Drafter在无需额外训练且开销可忽略的情况下,相比现有推测性方法实现了高达6.3%的加速。

英文摘要

Speculative decoding speeds up LLM inference by using a draft model to generate tokens, with an acceptance-rejection scheme that ensures that the output matches the target distribution. Adapting this to continuous diffusions is difficult because speculative sampling requires drawing from a residual distribution. While straightforward in discrete spaces, efficiently sampling this residual in continuous space is non-trivial. Consequently, existing diffusion adaptations either use computationally inefficient sampling techniques or rely on an alternative scheme. In this work, we introduce a novel scheme that efficiently implements the original speculative sampling mechanism for diffusion models. Our approach offers a critical advantage over current methods: it enables us to adapt block verification from LLMs to diffusions -- which provably improves the acceptance rate of drafts. Furthermore, we formalize and analyze the Free Drafter, a heuristic self-speculative drafter for diffusions that requires no training. By enabling block verification, our Free Drafter yields up to a 6.3% speedup over existing speculative methods with no additional training and negligible overhead beyond the existing parallel verification pass.

2606.13565 2026-06-12 cs.LG 新提交

A2D2: Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding

A2D2: 任意长度离散扩散模型的自适应解码微调

Sophia Tang, Yuchen Zhu, Molei Tao, Pranam Chatterjee

发表机构 * Department of Computer and Information Science, University of Pennsylvania(宾夕法尼亚大学计算机与信息科学系) Department of Bioengineering, University of Pennsylvania(宾夕法尼亚大学生物工程系) School of Mathematics, Georgia Institute of Technology(佐治亚理工学院数学学院)

AI总结 提出A2D2框架,通过联合优化插入和去掩蔽策略及基于质量的推理调度,实现任意长度离散扩散模型的奖励引导微调,理论上保证收敛到奖励倾斜分布,实验提升奖励优化与生成灵活性和准确性。

详情
AI中文摘要

离散扩散模型为序列生成提供了一个简单且稳定的基于似然的框架,最近通过令牌插入扩展到任意长度设置。然而,针对任意长度离散扩散的基于奖励的微调原则性方法仍 largely unexplored。我们引入了任意长度离散扩散模型的自适应解码微调(A2D2),这是一个统一的框架,通过联合优化插入和去掩蔽策略以及基于质量的推理调度,实现任意长度离散扩散模型的奖励引导微调。我们推导了联合插入-去掩蔽路径测度的Radon-Nikodym导数,从而在不需要目标样本的情况下,理论上保证收敛到难以处理的奖励倾斜序列分布。在此基础上,我们将去掩蔽和插入质量确立为最小化解码误差的可行方法,并引入自适应联合解码(AJD)损失,该损失可证明地生成产生奖励倾斜分布的最优路径测度。实验上,A2D2在提高奖励优化的同时,相比先前的固定长度微调和推理时引导方法,增强了生成的灵活性和准确性。

英文摘要

Discrete diffusion models offer a simple and stable likelihood-based framework for sequence generation, recently extended to any-length settings via token insertion. Principled reward-guided fine-tuning for any-length discrete diffusion, however, remains largely unexplored. We introduce Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding (A2D2), a unified framework for reward-guided fine-tuning of any-length discrete diffusion models via joint optimization of the insertion and unmasking policies together with a quality-based inference schedule. We derive the Radon-Nikodym derivative for the joint insertion-unmasking path measures, enabling theoretically guaranteed convergence to the intractable reward-tilted sequence distribution without requiring target samples. Building on this, we establish unmasking and insertion quality as tractable approaches for minimizing decoding error and introduce the Adaptive Joint Decoding (AJD) loss, which provably yields the optimal path measure that generates the reward-tilted distribution. Empirically, A2D2 improves reward optimization while enhancing generation flexibility and accuracy over prior fixed-length fine-tuning and inference-time guidance methods.

2606.12987 2026-06-12 cs.CV cs.AI cs.LG cs.RO 交叉投稿

Diffusion Transformer World-Action Model for AV Scene Prediction

扩散Transformer世界-动作模型用于自动驾驶场景预测

Ruslan Sharifullin, Benjamin Jiang, Kai Xi Chew

发表机构 * Stanford University(斯坦福大学)

AI总结 提出紧凑潜世界模型,结合扩散Transformer(DiT)预测未来场景,在nuScenes上实现4.8倍更好的KID,并实现动作可控性(转向ρ=0.81)。

Comments 10 pages, 9 figures, 2 tables

详情
AI中文摘要

动作条件世界模型使自动驾驶车辆能够根据自身规划的控制预测未来摄像头场景,从而无需真实世界部署即可进行规划和仿真,但在紧凑、可训练的规模下,未来具有模糊性,且该领域的标准失真度量具有误导性:它们奖励模糊的回归均值而非逼真的预测。我们通过一个紧凑的潜世界模型应对这一问题,该模型给定当前前摄像头潜变量和一系列自我动作,预测未来场景潜变量,由冻结解码器渲染为$256 \ imes 256$帧,最多提前8秒,在150个保留的nuScenes场景上评估。我们首先基准测试预测位置:在跨越四个表示族的六个冻结编码器中,具有时间上下文的V-JEPA2将转向RMSE比最佳单帧编码器降低40%。然后我们训练一个潜扩散Transformer(DiT),并通过受控诊断识别其所需的四个要素:空间token、$x_0$目标、残差锚定以及与目标不确定性匹配的采样。在Stable-Diffusion-VAE编码-预测-解码流水线中,我们揭示了核心矛盾:失真度量(余弦相似度、SSIM)倾向于模糊均值,掩盖了扩散模型更接近真实帧分布的事实。基于Inception的FID和KID揭示了清晰的感知-失真边界:扩散模型达到KID 0.078,而回归为0.375(好4.8倍),且可部署的训练校准使其无需测试时真实值即可实用。该模型真正具有动作可控性(转向驱动场景位移,Spearman $\ ho = 0.81$,而回归为$-0.18$)。我们将有限的单次运动归因于共享当前锚点,并设计了一个紧凑的170万参数“跳跃”模型,恢复完整的真实运动幅度($1.02\ imes$ GT),而单次模型捕获不到一半。

英文摘要

Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction. We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to $256 \times 256$ frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V-JEPA2 with temporal context reduces steering RMSE by 40% over the best single-frame encoder. We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the $x_0$ objective, residual anchoring, and sampling matched to target uncertainty. In a Stable-Diffusion-VAE encode-predict-decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution. Inception-based FID and KID reveal a clean perception-distortion frontier: diffusion attains KID 0.078 versus 0.375 for regression ($4.8\times$ better), and a deployable train-derived calibration makes this practical without test-time ground truth. The model is genuinely action-controllable (steering drives scene displacement, Spearman $ρ= 0.81$, vs $-0.18$ for regression). We trace limited single-pass motion to a shared-present anchor and engineer a compact 1.7M-parameter "jump" model that recovers full ground-truth motion magnitude ($1.02\times$ GT), where single-pass models capture less than half.

2509.03340 2026-06-12 cs.LG cs.AI cs.CE physics.comp-ph 版本更新

Equivariant Flow Matching for Symmetry-Breaking Bifurcation Problems

等变流匹配用于对称破缺分岔问题

Fleur Hendriks, Ondřej Rokoš, Martin Doškář, Marc G. D. Geers, Vlado Menkovski

发表机构 * Department of Mechanical Engineering, Eindhoven University of Technology(埃因霍温理工大学机械工程系) DIFFER – Dutch Institute for Fundamental Energy Research(荷兰基础能源研究所) Faculty of Civil Engineering, Department of Mechanics, Czech Technical University in Prague(布拉格捷克技术大学土木工程学院力学系) Department of Mathematics and Computer Science, Eindhoven University of Technology(埃因霍温理工大学数学与计算机科学系)

AI总结 针对非线性动力系统中对称破缺导致的多稳态共存问题,提出等变流匹配方法,结合等变架构与最优传输耦合机制,准确捕捉多模态分布和对称破缺分岔,优于非概率和变分方法。

Comments 9 pages, 7 figures including appendices. Accepted to Machine Learning and the Physical Sciences Workshop, NeurIPS 2025 (https://ml4physicalsciences.github.io/2025/). Repository with corresponding code: https://github.com/FHendriks11/bifurcationML/. Video explanation: https://www.youtube.com/watch?v=wsL3h17KtjY

详情
AI中文摘要

非线性动力系统中的分岔现象通常导致多个共存的稳定解,特别是在对称破缺的情况下。确定性机器学习模型无法捕捉这种多重性,会平均化解并无法表示低对称性结果。在这项工作中,我们正式将生成式AI(特别是流匹配)作为建模分岔结果全概率分布的原则性方法。我们的方法建立在现有技术基础上,将流匹配与等变架构和基于最优传输的耦合机制相结合。我们将等变流匹配推广到一种对称耦合策略,该策略在群作用下对齐预测和目标输出,从而在等变设置中实现准确学习。我们在从简单概念系统到物理问题(如屈曲梁和Allen-Cahn方程)的一系列系统上验证了我们的方法。结果表明,该方法准确捕捉了多模态分布和对称破缺分岔。此外,我们的结果表明,流匹配显著优于非概率和变分方法。这为高维系统中的多稳态建模提供了一种原则性且可扩展的解决方案。

英文摘要

Bifurcation phenomena in nonlinear dynamical systems often lead to multiple coexisting stable solutions, particularly in the presence of symmetry breaking. Deterministic machine learning models are unable to capture this multiplicity, averaging over solutions and failing to represent lower-symmetry outcomes. In this work, we formalize the use of generative AI, specifically flow matching, as a principled way to model the full probability distribution over bifurcation outcomes. Our approach builds on existing techniques by combining flow matching with equivariant architectures and an optimal-transport-based coupling mechanism. We generalize equivariant flow matching to a symmetric coupling strategy that aligns predicted and target outputs under group actions, allowing accurate learning in equivariant settings. We validate our approach on a range of systems, from simple conceptual systems to physical problems such as buckling beams and the Allen--Cahn equation. The results demonstrate that the approach accurately captures multimodal distributions and symmetry-breaking bifurcations. Moreover, our results demonstrate that flow matching significantly outperforms non-probabilistic and variational methods. This offers a principled and scalable solution for modeling multistability in high-dimensional systems.

2509.19526 2026-06-12 cs.LG cs.SY eess.SY 版本更新

Metriplectic Conditional Flow Matching for Dissipative Dynamics

度量辛条件流匹配用于耗散动力学

Ali Baheri, Lars Lindemann

发表机构 * Rochester Institute of Technology, Rochester, NY, USA(罗切斯特理工学院) Automatic Control Laboratory, ETH Zürich, Switzerland(自动控制实验室)

AI总结 提出度量辛条件流匹配(MCFM)方法,通过将保守-耗散分解融入向量场和结构保持采样器,学习耗散动力学,保证能量单调递减和长期稳定性。

详情
AI中文摘要

度量辛条件流匹配(MCFM)在不违反第一原理的情况下学习耗散动力学。神经替代模型常常注入能量并破坏长期推演的稳定性;MCFM 则将保守-耗散分解同时融入向量场和结构保持采样器。MCFM 通过短时间过渡上的条件流匹配进行训练,避免了长时间推演伴随的梯度计算。在推理时,Strang-prox 方案交替进行辛更新和近端度量步骤,确保离散能量衰减;当有可信能量可用时,可选投影强制严格衰减。我们提供了连续和离散时间保证,将该参数化和采样器与守恒、单调耗散和稳定推演联系起来。在一个受控机械基准上,MCFM 产生的相图更接近真实情况,并且与同等表达能力的无约束神经流相比,能量增加和正能量率事件显著减少,同时匹配终端分布拟合。

英文摘要

Metriplectic conditional flow matching (MCFM) learns dissipative dynamics without violating first principles. Neural surrogates often inject energy and destabilize long-horizon rollouts; MCFM instead builds the conservative-dissipative split into both the vector field and a structure preserving sampler. MCFM trains via conditional flow matching on short transitions, avoiding long rollout adjoints. In inference, a Strang-prox scheme alternates a symplectic update with a proximal metric step, ensuring discrete energy decay; an optional projection enforces strict decay when a trusted energy is available. We provide continuous and discrete time guarantees linking this parameterization and sampler to conservation, monotonic dissipation, and stable rollouts. On a controlled mechanical benchmark, MCFM yields phase portraits closer to ground truth and markedly fewer energy-increase and positive energy rate events than an equally expressive unconstrained neural flow, while matching terminal distributional fit.

2512.22287 2026-06-12 cs.LG cs.AI 版本更新

Cluster Aggregated GAN (CAG): A Cluster-Based Hybrid Model for Appliance Pattern Generation

聚类聚合生成对抗网络 (CAG):一种基于聚类的混合模型用于电器模式生成

Zikun Guo, Adeyinka. P. Adedigba, Rammohan Mallipeddi

发表机构 * Department of Artificial Intelligence, School of Electronics Engineering, Kyungpook National University(人工智能系,电子工程学院,全北国立大学)

AI总结 针对现有生成方法忽略间歇性与连续电器行为差异导致训练不稳定和保真度有限的问题,提出CAG框架,通过聚类模块为间歇电器分配专用生成器,连续电器使用LSTM生成器,在UVIC数据集上优于基线方法。

Comments 18pages, 5Figues

详情
AI中文摘要

合成电器数据对于开发非侵入式负荷监测算法和实现隐私保护的能源研究至关重要,然而标记数据集的稀缺性仍然是一个重大障碍。最近基于GAN的方法已经证明了合成负荷模式的可行性,但大多数现有方法在单个模型内统一处理所有设备,忽略了间歇性和连续性电器之间的行为差异,导致训练不稳定和输出保真度有限。为了解决这些局限性,我们提出了聚类聚合生成对抗网络框架,这是一种混合生成方法,根据每个电器的行为特征将其路由到专门的分支。对于间歇性电器,聚类模块将相似的激活模式分组,并为每个聚类分配专用生成器,确保常见和罕见操作模式都获得足够的建模能力。连续性电器遵循单独的分支,采用基于LSTM的生成器来捕捉逐渐的时间演变,同时通过序列压缩保持训练稳定性。在UVIC智能插头数据集上的大量实验表明,所提出的框架在衡量真实性、多样性和训练稳定性的指标上始终优于基线方法,并且将聚类作为主动生成组件显著提高了可解释性和可扩展性。这些发现确立了所提出的框架作为非侵入式负荷监测研究中合成负荷生成的有效方法。

英文摘要

Synthetic appliance data are essential for developing non-intrusive load monitoring algorithms and enabling privacy preserving energy research, yet the scarcity of labeled datasets remains a significant barrier. Recent GAN-based methods have demonstrated the feasibility of synthesizing load patterns, but most existing approaches treat all devices uniformly within a single model, neglecting the behavioral differences between intermittent and continuous appliances and resulting in unstable training and limited output fidelity. To address these limitations, we propose the Cluster Aggregated GAN framework, a hybrid generative approach that routes each appliance to a specialized branch based on its behavioral characteristics. For intermittent appliances, a clustering module groups similar activation patterns and allocates dedicated generators for each cluster, ensuring that both common and rare operational modes receive adequate modeling capacity. Continuous appliances follow a separate branch that employs an LSTM-based generator to capture gradual temporal evolution while maintaining training stability through sequence compression. Extensive experiments on the UVIC smart plug dataset demonstrate that the proposed framework consistently outperforms baseline methods across metrics measuring realism, diversity, and training stability, and that integrating clustering as an active generative component substantially improves both interpretability and scalability. These findings establish the proposed framework as an effective approach for synthetic load generation in non-intrusive load monitoring research.

2601.03184 2026-06-12 cs.LG cs.AI 版本更新

Decentralized Autoregressive Generation

分散自回归生成

Stepan Maschan, Haoxuan Qu, Jun Liu

发表机构 * Lancaster University(兰卡斯特大学)

AI总结 本文通过离散流匹配框架证明分散训练与集中训练在理论上等价,实验验证其在多模态基准上保持竞争力。

详情
AI中文摘要

近年来,自回归生成的分散化作为解决扩展瓶颈的方案引起了广泛关注。然而,尽管有令人鼓舞的实验结果,这一范式目前缺乏严格的理论证明。在这项工作中,我们正式建立了分散训练与集中训练之间的理论等价性。为此,我们调整了离散流匹配框架用于自回归生成,利用其固有性质证明全局模型自然分解为独立专家。最后,我们在多种多模态基准上进行了大量实验,实验验证了分散训练在标准集中架构上保持竞争性。

英文摘要

The decentralization of autoregressive generation has attracted considerable attention in recent years as a solution to scaling bottlenecks. However, despite promising empirical results, this paradigm currently lacks rigorous theoretical justification. In this work, we formally establish the theoretical equivalence between decentralized and centralized training. To achieve this, we adapt the Discrete Flow Matching framework for autoregressive generation, leveraging its inherent properties to demonstrate that global models naturally decompose into independent experts. Finally, we conduct extensive experiments across diverse multimodal benchmarks, empirically validating that decentralized training maintains competitive parity with standard centralized architectures.

2601.06572 2026-06-12 cs.LG cs.AI 版本更新

Hellinger Multimodal Variational Autoencoders

Hellinger多模态变分自编码器

Huyen Vo, Isabel Valera

发表机构 * Department of Computer Science, Saarland University(萨尔兰大学计算机科学系) MPI-SWS, Saarland Informatics Campus(萨尔兰信息学校区Max Planck研究所)

AI总结 提出基于Hellinger距离的矩匹配近似方法HELVAE,避免子采样,在多模态变分自编码器中实现更优的生成一致性与质量权衡。

Comments Accepted at AISTATS 2026. Camera-ready version

详情
AI中文摘要

多模态变分自编码器(VAEs)广泛用于弱监督生成学习,涉及多种模态。主流方法通过专家乘积(PoE)、专家混合(MoE)或其组合来聚合单模态推理分布,以近似联合后验。本文从概率意见池化的优化视角重新审视多模态推理。我们从$\alpha=0.5$的Hölder池化出发,这是$\alpha\text{-散度}$族中唯一的对称成员,并推导出一种矩匹配近似,称为Hellinger。我们利用这种近似提出HELVAE,一种避免子采样的多模态VAE,从而得到一个高效且有效的模型,该模型:(i)随着观察到的模态增加,学习更具表达力的潜在表示;(ii)在生成一致性和质量之间实现更好的权衡,优于最先进的多模态VAE模型。

英文摘要

Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from Hölder pooling with $α=0.5$, which corresponds to the unique symmetric member of the $α\text{-divergence}$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.

2602.04675 2026-06-12 cs.LG 版本更新

Generalized Schrödinger Bridge on Graphs

图上的广义薛定谔桥

Panagiotis Theodoropoulos, Juno Nam, Evangelos Theodorou, Jaemoo Choi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GSBoG框架,通过似然优化学习图上可控连续时间马尔可夫链策略,满足端点边际分布并优化中间状态成本,实现可扩展的拓扑感知运输。

详情
AI中文摘要

图上的运输是许多领域中的一个基本挑战,决策必须尊重拓扑和操作约束。尽管需要可执行的策略,现有的图运输方法缺乏这种表达能力。它们依赖于限制性假设,无法在稀疏拓扑上泛化,并且随着图大小和时间范围的增加而扩展性差。为了解决这些问题,我们引入了图上的广义薛定谔桥(GSBoG),这是一种新颖的可扩展数据驱动框架,用于在状态成本增强动力学下学习任意图上的可执行受控连续时间马尔可夫链(CTMC)策略。值得注意的是,GSBoG学习轨迹级策略,避免了密集的全局求解器,从而增强了可扩展性。这是通过似然优化方法实现的,满足端点边际分布,同时优化状态依赖运行成本下的中间行为。在具有挑战性的真实世界图拓扑上的大量实验表明,GSBoG能够可靠地学习准确、尊重拓扑的策略,同时优化特定应用的中间状态成本,突出了其广泛的适用性,并为一般图上的成本感知动态运输开辟了新途径。

英文摘要

Transportation on graphs is a fundamental challenge across many domains, where decisions must respect topological and operational constraints. Despite the need for actionable policies, existing graph-transport methods lack this expressivity. They rely on restrictive assumptions, fail to generalize across sparse topologies, and scale poorly with graph size and time horizon. To address these issues, we introduce Generalized Schrödinger Bridge on Graphs (GSBoG), a novel scalable data-driven framework for learning executable controlled continuous-time Markov chain (CTMC) policies on arbitrary graphs under state cost augmented dynamics. Notably, GSBoG learns trajectory-level policies, avoiding dense global solvers and thereby enhancing scalability. This is achieved via a likelihood optimization approach, satisfying the endpoint marginals, while simultaneously optimizing intermediate behavior under state-dependent running costs. Extensive experimentation on challenging real-world graph topologies shows that GSBoG reliably learns accurate, topology-respecting policies while optimizing application-specific intermediate state costs, highlighting its broad applicability and paving new avenues for cost-aware dynamical transport on general graphs.

2604.13924 2026-06-12 cs.LG cs.AI cs.CV 版本更新

ASTER: Latent Pseudo-Anomaly Generation for Unsupervised Time-Series Anomaly Detection

ASTER: 用于无监督时间序列异常检测的潜在伪异常生成

Romain Hermary, Samet Hicsonmez, Dan Pineau, Abd El Rahman Shabayek, Djamila Aouada

发表机构 * University of Montreal(蒙特利尔大学) Université de Montréal(蒙特利尔大学)

AI总结 提出ASTER框架,在潜在空间生成伪异常训练Transformer分类器,结合预训练LLM增强表示,在三个基准数据集上达到最优性能。

Comments Published in ICPR 2026

详情
AI中文摘要

时间序列异常检测(TSAD)在工业监控、医疗保健和网络安全等领域至关重要,但由于罕见且异质的异常以及标记数据的稀缺性,它仍然具有挑战性。这种稀缺性使得无监督方法占主导地位,但现有方法通常依赖于重建或预测(难以处理复杂数据),或依赖于需要领域特定异常合成和固定距离度量的基于嵌入的方法。我们提出ASTER,一个直接在潜在空间中生成伪异常的框架,避免了手工制作的异常注入和对领域专业知识的需求。潜在空间解码器生成定制的伪异常,用于训练基于Transformer的异常分类器,而预训练的LLM丰富了该空间的时间和上下文表示。在三个基准数据集上的实验表明,ASTER达到了最先进的性能,并为基于LLM的TSAD设立了新标准。

英文摘要

Time-series anomaly detection (TSAD) is critical in domains such as industrial monitoring, healthcare, and cybersecurity, but it remains challenging due to rare and heterogeneous anomalies and the scarcity of labelled data. This scarcity makes unsupervised approaches predominant, yet existing methods often rely on reconstruction or forecasting, which struggle with complex data, or on embedding-based approaches that require domain-specific anomaly synthesis and fixed distance metrics. We propose ASTER, a framework that generates pseudo-anomalies directly in the latent space, avoiding handcrafted anomaly injections and the need for domain expertise. A latent-space decoder produces tailored pseudo-anomalies to train a Transformer-based anomaly classifier, while a pre-trained LLM enriches the temporal and contextual representations of this space. Experiments on three benchmark datasets show that ASTER achieves state-of-the-art performance and sets a new standard for LLM-based TSAD.

2605.08116 2026-06-12 cs.LG cs.AI 版本更新

The Safety-Aware Denoiser for Text Diffusion Models

文本扩散模型的安全感知去噪器

Amman Yusuf, Zhejun Jiang, Mijung Park

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出安全感知去噪器(SAD),在文本扩散模型的迭代去噪过程中引导生成文本进入安全区域,无需重训练即可实现灵活的安全约束,有效降低不安全生成同时保持生成质量。

Comments 28 pages, 12 figures. Code available at: https://github.com/ParkLabML/SAD

详情
AI中文摘要

最近关于文本扩散模型的工作为自回归生成提供了一种有前景的替代方案,但控制其安全性仍未被充分探索。现有的安全方法面向自回归模型,通常依赖于事后过滤或推理时干预。这些方法不足以有效解决文本扩散模型中的安全风险。我们提出了安全感知去噪器(SAD),一种文本扩散模型中的安全引导框架。SAD修改了迭代去噪过程,使得最终去噪步骤中的文本样本被引导至文本空间中可证明的安全区域。这种推理时方法可以将安全约束集成到去噪器中,避免了底层扩散模型的计算昂贵重训练,并实现了灵活、轻量级的安全引导。我们使用SAD评估生成文本的安全性,涉及危害分类、记忆和越狱。实验结果表明,SAD在保持生成质量、多样性和流畅性的同时,显著减少了不安全生成,优于现有方法。这些结果表明,我们在去噪过程中的安全引导为在文本扩散模型中实施安全提供了一种有效且可扩展的机制。

英文摘要

Recent work on text diffusion models offers a promising alternative to autoregressive generation, but controlling their safety remains underexplored. Existing safety approaches are geared toward autoregressive models and typically rely on post-hoc filtering or inference-time interventions. These are inadequate for effectively addressing safety risks in text diffusion models. We propose the Safety-Aware Denoiser (SAD), a safety-guidance framework in text diffusion models. The SAD modifies the iterative denoising process such that the text sample at the final denoising step is steered toward provably safe regions of the text space. This inference-time method can integrate safety constraints into the denoiser, avoiding computationally expensive retraining of the underlying diffusion model and enabling flexible, lightweight safety guidance. We evaluate the safety of the generated text using the SAD, with respect to hazard taxonomy, memorization, and jailbreak. Experimental results show that SAD substantially reduces unsafe generations while preserving generation quality, diversity, and fluency, outperforming existing methods. These results demonstrate that our safety guidance during denoising provides an effective and scalable mechanism for enforcing safety in text diffusion models.

2605.28507 2026-06-12 cs.LG 版本更新

Universal Time Series Generation with Neural Controlled Differential Equations

基于神经受控微分方程的通用时间序列生成

Torben Berndt, Elyes Farjallah, Leif Seute, Raeid Saqur, Benjamin Walker, Jan Stühmer

发表机构 * Heidelberg Institute for Theoretical Studies(海德堡理论研究所) IAR, Karlsruhe Institute of Technology(卡尔斯鲁厄技术大学IAR部门) Max Planck Institute for Polymer Research(马克斯·普朗克聚合物研究所) IWR, Heidelberg University(海德堡大学IWR部门) Dept. of Computer Science, University of Toronto(多伦多大学计算机科学系) Mathematical Institute, University of Oxford(牛津大学数学研究所) Vector Institute, Toronto, Canada(多伦多向量研究所)

AI总结 本文证明结构化线性受控微分方程(SLiCEs)是通用时间序列生成器,并提出生成式SLiCEs(G-SLiCEs)用于路径空间上的流匹配,在概率预测和下流任务中表现优异,尤其适用于不规则网格。

详情
AI中文摘要

最近关于状态空间模型(SSMs)序列通用性的工作引入了高效、最大表达性的连续时间方法用于时间序列建模。虽然这些工作侧重于判别设置,我们将这一视角扩展到生成式时间序列建模,通过证明最大表达性的结构化线性受控微分方程(SLiCEs)是通用时间序列生成器,即它们可以在$W_\infty$中逼近紧致潜在集上连续因果推前映射的诱导路径律。基于这些理论结果,我们提出了生成式SLiCEs(G-SLiCEs),一种用于路径空间上流匹配的最大表达性连续时间模型。实验上,我们表明表达性提高了概率预测和下流任务的性能,同时保留了连续时间模型的优势,例如泛化到任意观测网格。这对于不规则网格尤其有利,而固定网格模型通常难以处理此类网格。

英文摘要

Recent work on the sequence universality of State Space Models (SSMs) has introduced efficient, maximally expressive continuous-time approaches for time-series modelling. While these works focus on discriminative settings, we extend this perspective to generative time-series modelling by proving that maximally expressive Structured Linear Controlled Differential Equations (SLiCEs) are universal time-series generators, in the sense that they can approximate the induced path laws of continuous causal pushforwards on compact latent sets in $W_\infty$. Building on these theoretical results, we propose Generative SLiCEs (G-SLiCEs), a maximally expressive continuous-time model for flow matching on path-space. Empirically, we show that expressivity improves performance in probabilistic forecasting and downstream tasks, while retaining the advantages of continuous-time models such as generalising to arbitrary observation grids. This is particularly beneficial for irregular grids, where fixed-grid models often struggle.

2605.29906 2026-06-12 cs.LG 版本更新

Plan, Don't Pose: Long Composite Motion Generation with Text-Aligned BFM

计划,而非摆姿势:基于文本对齐的BFM的长复合运动生成

Nikolay Shvetsov, Maksim Bobrin, Nazar Buzun, Anton Bozhedarov, Dmitry V. Dylov

发表机构 * AvaCapo Potsdam University(波茨坦大学) Applied AI Institute(应用人工智能研究所) Computational Imaging Lab(计算成像实验室) AXXX Innopolis University(因诺波利斯大学)

AI总结 提出Text2BFM框架,通过将自然语言与预训练行为基础模型对齐,在潜在策略空间中实现长复合运动生成,无需端到端运动生成器。

详情
AI中文摘要

文本到运动(T2M)生成在角色动画、虚拟化身和人机交互中具有广泛应用。现有方法通常直接从语言生成姿态轨迹或运动令牌,迫使单个模型处理语义解释、长程结构和低级物理实现。这种耦合使得它们在处理长、复合或语义密集的提示时成本高昂且往往不可靠。我们提出Text2BFM,这是第一个将自然语言与预训练行为基础模型(BFM)对齐用于T2M生成的框架,无需依赖重型端到端运动生成器。Text2BFM在冻结的BFM的潜在策略空间中操作,将其用作可执行的运动先验。一个文本对齐的变分行为瓶颈将BFM策略潜在序列压缩成与语言兼容且保留长程行为结构的紧凑运动表示。生成在这个紧凑的行为流形上通过轻量级条件生成器进行,得到的潜在编码行为被解码为驱动预训练冻结BFM的策略潜在。通过将语义规划与运动执行解耦,Text2BFM实现了高效、鲁棒的T2M生成,并在长复合文本描述上表现出色。

英文摘要

Text-to-motion (T2M) generation has broad applications in character animation, virtual avatars, and human-robot interaction. Existing methods typically generate pose trajectories or motion tokens directly from language, forcing a single model to handle semantic interpretation, long-horizon structure, and low-level physical realization. This coupling makes them costly and often unreliable for long, compositional, or semantically dense prompts. We propose Text2BFM, the first framework that aligns natural language with pretrained Behavioral Foundation Models (BFMs) for T2M generation without relying on heavy end-to-end motion generators. Text2BFM operates in the latent policy space of a frozen BFM, using it as an executable motion prior. A text-aligned variational behavioral bottleneck compresses BFM policy-latent sequences into compact motion representations that are compatible with language and preserve long-horizon behavioral structure. Generation is performed in this compact behavioral manifold with a lightweight conditional generator, and the resulting latent encoded behaviors are decoded into policy latents that drive the pretrained frozen BFM. By decoupling semantic planning from motion execution, Text2BFM achieves efficient, robust T2M generation and strong performance on long, compositional textual descriptions.

2606.01172 2026-06-12 cs.LG stat.ME stat.ML 版本更新

Revisiting Neural Processes via Fourier Transform and Volterra Series

通过傅里叶变换和Volterra级数重新审视神经过程

Peiman Mohseni, Nick Duffield, Raymond K. W. Wong

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文利用Volterra展开和集合傅里叶卷积,提出了两种新的条件神经过程模型,解决了现有平移等变神经过程在可解释性和计算效率上的局限性。

详情
AI中文摘要

从有限的、不规则采样的测量中建模未知的潜在函数是科学和工程中的一个反复出现的挑战。神经过程(NPs)是一类概率函数模型,是有前景的解决方案——尤其是当赋予领域特定的对称性(如平移等变性)时,这提高了样本效率和泛化能力。然而,现有的平移等变NPs面临两个局限性:(i)它们堆叠带有非线性的通用组件,模糊了诱导的函数类并限制了可解释性;(ii)卷积设计依赖于具有局部感受野的核,并需要密集的均匀输入网格,而基于注意力的方法避免了这些问题,但随观测数量呈二次方缩放。我们通过两个贡献解决了这两个问题。首先,利用Volterra展开,我们将连续平移等变算子表征为高阶卷积的和,实现了分析透明性,同时允许通过一阶卷积进行高效近似。其次,我们引入了集合傅里叶卷积(SFConvs),这是一种频域参数化方法,直接在不规则采样点上操作,实现近似全局感受野,并在观测数量上线性缩放。基于这些思想,我们提出了两种条件神经过程(CNPs):SFConvCNPs,它堆叠带有非线性的SFConv块,以及SFVConvCNPs,它整合了Volterra公式。在合成和真实世界数据集上的实验证明了我们的方法相对于最先进基线的有效性。

英文摘要

Modeling unknown latent functions from finite, irregularly sampled measurements is a recurring challenge across science and engineering. Neural processes (NPs), a family of probabilistic functional models, are promising solutions -- especially when endowed with domain-specific symmetries like translation equivariance, which improve sample efficiency and generalization. Yet existing translation-equivariant NPs face two limitations: (i) they stack generic components with non-linearities, obscuring the induced function class and limiting interpretability; and (ii) convolutional designs rely on kernels with local receptive fields and require dense uniform input grids, while attention-based methods avoid these issues but scale quadratically with the number of observations. We address both with two contributions. First, using the Volterra expansion, we characterize continuous translation-equivariant operators as sums of higher-order convolutions, yielding analytical transparency while admitting efficient approximation by first-order convolutions. Second, we introduce set Fourier convolutions (SFConvs), a frequency-domain parameterization that operates directly on irregularly sampled points, achieves approximately global receptive fields, and scales linearly in the number of observations. Building on these ideas, we propose two conditional NPs (CNPs): SFConvCNPs, which stack SFConv blocks with non-linearities, and SFVConvCNPs, which integrate the Volterra formulation. Experiments on synthetic and real-world datasets demonstrate our methods' efficacy against state-of-the-art baselines.

2606.02133 2026-06-12 cs.LG cs.AI 版本更新

Variational Learning for Insertion-based Generation

基于插入生成的变分学习

Yangtian Zhang, Zhe Wang, Arthur Gretton, Rex Ying, David van Dijk, Michalis K. Titsias, Jiaxin Shi

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出插入过程(IP)模型,通过排列变分推断联合学习插入位置、内容和终止条件,支持变长生成并提升非自回归序列建模质量。

详情
AI中文摘要

非单调序列生成方法,如掩码扩散模型,通过允许以非固定和预设的顺序生成token,为从左到右的自回归建模提供了一种灵活的替代方案。尽管具有实际优势,但大多数现有的非单调模型是顺序无关的,并依赖于固定长度的网格,限制了它们支持变长生成和自适应插入顺序的能力。在这项工作中,我们引入了一个概率框架,用于在变长插入模型中学习插入顺序。我们形式化了插入轨迹与排列之间的双射对应关系,这使得数据似然能够精确重参数化为排列上的和。基于这一结果,我们提出了插入过程(IP),这是一种随机生成模型,它联合学习在哪里插入、插入什么以及何时终止,并通过基于排列的变分推断进行训练。与先前的固定画布方法不同,IP原生支持变长生成,并学习数据驱动的插入顺序偏好。在目标条件规划和分子字符串生成上的实验表明,在缺乏规范从左到右结构的领域中,学习插入顺序提高了建模质量和泛化能力。

英文摘要

Non-monotonic sequence generation methods, such as masked diffusion models, provide a flexible alternative to left-to-right autoregressive modeling by allowing tokens to be generated in non-fixed and prescribed orders. Despite their practical advantages, most existing non-monotonic models are order-agnostic and rely on a fixed-length grid, limiting their ability to support variable-length generation and adaptive insertion order. In this work, we introduce a probabilistic framework for learning insertion order in variable-length insertion models. We formalize a bijective correspondence between insertion trajectories and permutations, which enables an exact reparameterization of the data likelihood as a sum over permutations. Building on this result, we propose the Insertion Process (IP), a stochastic generative model that jointly learns where to insert, what to insert, and when to terminate, trained via permutation-based variational inference. Unlike prior fixed-canvas approaches, IP natively supports variable-length generation and learns data-driven preferences over insertion orders. Experiments on goal-conditioned planning and molecular string generation demonstrate that learning insertion order improves both modeling quality and generalization in domains without a canonical left-to-right structure.

2402.01779 2026-06-12 eess.IV cs.CV cs.LG stat.ML 版本更新

Plug-and-Play image restoration with Stochastic deNOising REgularization

即插即用图像恢复:随机去噪正则化

Marien Renaud, Jean Prost, Arthur Leclaire, Nicolas Papadakis

发表机构 * GitHub

AI总结 提出SNORE框架,仅在适当噪声水平图像上应用去噪器,结合随机正则化与梯度下降求解逆问题,在去模糊和修复任务上达到SOTA。

详情
AI中文摘要

即插即用(PnP)算法是一类迭代算法,通过结合物理模型和深度神经网络进行正则化来解决图像逆问题。尽管它们能产生令人印象深刻的图像恢复结果,但这些算法依赖于在迭代过程中噪声逐渐减小的图像上非标准地使用去噪器,这与最近基于扩散模型(DM)的算法形成对比,后者仅在重新加噪的图像上应用去噪器。我们提出了一种新的PnP框架,称为随机去噪正则化(SNORE),该框架仅在具有适当噪声水平的图像上应用去噪器。它基于显式的随机正则化,从而产生一种随机梯度下降算法来解决不适定逆问题。提供了该算法及其退火扩展的收敛性分析。实验上,我们证明SNORE在去模糊和修复任务上与最先进方法相比具有竞争力,无论是在定量还是定性方面。

英文摘要

Plug-and-Play (PnP) algorithms are a class of iterative algorithms that address image inverse problems by combining a physical model and a deep neural network for regularization. Even if they produce impressive image restoration results, these algorithms rely on a non-standard use of a denoiser on images that are less and less noisy along the iterations, which contrasts with recent algorithms based on Diffusion Models (DM), where the denoiser is applied only on re-noised images. We propose a new PnP framework, called Stochastic deNOising REgularization (SNORE), which applies the denoiser only on images with noise of the adequate level. It is based on an explicit stochastic regularization, which leads to a stochastic gradient descent algorithm to solve ill-posed inverse problems. A convergence analysis of this algorithm and its annealing extension is provided. Experimentally, we prove that SNORE is competitive with respect to state-of-the-art methods on deblurring and inpainting tasks, both quantitatively and qualitatively.

2508.21531 2026-06-12 stat.ML cs.LG stat.CO 版本更新

Adaptive generative moment matching networks for improved learning of dependence structures

自适应生成矩匹配网络用于改进依赖结构学习

Marius Hofert, Gan Yao

发表机构 * Department of Statistics and Actuarial Science, The University of Hong Kong(香港大学统计与精算科学系)

AI总结 提出自适应带宽选择的最大均值差异混合核用于生成矩匹配网络,通过增加核数量和早停策略提升训练性能,在copula随机数生成、高维收敛率及金融数据依赖建模中优于传统方法。

详情
AI中文摘要

引入了一种用于最大均值差异(MMD)中混合核的自适应带宽选择程序,以拟合生成矩匹配网络(GMMNs),并展示了copula随机数生成器的改进学习。基于训练损失的相对误差,在训练过程中增加核的数量;此外,验证损失的相对误差被用作早停标准。虽然训练时间保持相似,但自适应训练GMMNs(AGMMNs)显著提高了训练性能,这通过验证MMD轨迹、样本和验证MMD值得以展示。在三个应用中,AGMMNs相比GMMNs和参数copula模型也表现出优越性。首先,首次在高达100维的维度中研究了基于copula的准随机与伪随机样本的估计量收敛速度。其次,重复的验证MMD以及蒙特卡洛和准蒙特卡洛应用证明了AGMMNs在去GARCH化后的标普500指数50个成分所隐含的copula模型上的改进训练。最后,后一个数据集和富时100指数的50个成分被用于证明AGMMNs的改进训练确实转化为改进的模型预测。

英文摘要

An adaptive bandwidth selection procedure for the mixture kernel in the maximum mean discrepancy (MMD) for fitting generative moment matching networks (GMMNs) is introduced, and improved learning of copula random number generators is demonstrated. Based on the relative error of the training loss, the number of kernels is increased during training; additionally, the relative error of the validation loss is used as an early stopping criterion. While training time remains similar, adaptively training GMMNs (AGMMNs) significantly increases training performance, which is shown based on validation MMD trajectories, samples and validation MMD values. Superiority of AGMMNs over GMMNs and parametric copula models is also demonstrated in terms of three applications. First, convergence rates of estimators based on quasi-random versus pseudo-random samples from copulas are investigated in dimensions as large as 100 for the first time. Second, replicated validation MMDs, as well as Monte Carlo and quasi-Monte Carlo applications demonstrate the improved training of AGMMNs for a copula model implied by the 50 constituents of the S&P 500 index after deGARCHing. Last, both the latter dataset and 50 constituents of the FTSE 100 are used to demonstrate that the improved training of AGMMNs indeed translates to an improved model prediction.

2602.09730 2026-06-12 cs.CV cs.LG cs.NA math.NA 版本更新

Allure of Craquelure: A Variational-Generative Approach to Crack Detection in Paintings

龟裂的魅力:一种变分-生成式绘画裂纹检测方法

Laura Paul, Holger Rauhut, Martin Burger, Samira Kabri, Tim Roith

发表机构 * Dept. of Mathematics, LMU Munich(数学系,慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Helmholtz Imaging, Deutsches Elektronen-Synchrotron DESY(海德堡影像,德意志电子同步辐射光源) Fachbereich Mathematik, University of Hamburg(数学学院,汉堡大学) CIT School, Technical University of Munich(技术大学慕尼黑信息学院)

AI总结 提出混合方法,将裂纹检测建模为逆问题,用深度生成模型作为画作先验,结合Mumford-Shah变分泛函和裂纹先验,通过联合优化获得像素级裂纹定位图。

详情
AI中文摘要

近期成像技术、深度学习与数值性能的进步使得对艺术品的非侵入性详细分析成为可能,支持其记录与保护。特别是,数字化绘画中龟裂的自动检测对于评估退化和指导修复至关重要,但由于可能复杂的场景以及裂纹与类似裂纹的艺术特征(如笔触或毛发)之间的视觉相似性,这仍然具有挑战性。我们提出一种混合方法,将裂纹检测建模为一个逆问题,将观测图像分解为无裂纹绘画和裂纹分量。采用深度生成模型作为底层艺术品的有力先验,同时使用Mumford-Shah型变分泛函结合裂纹先验来捕捉裂纹结构。联合优化得到绘画中裂纹定位的像素级图。

英文摘要

Recent advances in imaging technologies, deep learning and numerical performance have enabled non-invasive detailed analysis of artworks, supporting their documentation and conservation. In particular, automated detection of craquelure in digitized paintings is crucial for assessing degradation and guiding restoration, yet remains challenging due to the possibly complex scenery and the visual similarity between cracks and crack-like artistic features such as brush strokes or hair. We propose a hybrid approach that models crack detection as an inverse problem, decomposing an observed image into a crack-free painting and a crack component. A deep generative model is employed as powerful prior for the underlying artwork, while crack structures are captured using a Mumford--Shah-type variational functional together with a crack prior. Joint optimization yields a pixel-level map of crack localizations in the painting.

2603.11242 2026-06-12 stat.ML cs.LG 版本更新

A Unified Latent Space Disentanglement VAE Framework with Robust Disentanglement Effectiveness Evaluation

统一潜在空间解缠的VAE框架及鲁棒的解缠效果评估

Xiaoan Lang, Md Mostafizer Rahman, Fang Liu

发表机构 * Department of Applied and Computational Mathematics and Statistics(应用与计算数学与统计系) Lucy Family Institute for Data & Society(数据与社会学院)

AI总结 提出统一框架bfVAE整合多种解缠VAE方法,并开发FVH-LT和DBSR-LS评估解缠效果,引入LSSI指标量化潜在结构分离,无需真实生成因子。

详情
AI中文摘要

评估和解释潜在表示(如变分自编码器VAE)对于多样数据类型仍然是一个重大挑战,尤其是当真实生成因子未知时。为此,我们将几种最先进的用于潜在空间解缠的VAE方法统一到一个框架——bfVAE中。为了评估解缠VAE模型的有效性并增强潜在空间可解释性,我们提出了通过潜在遍历的特征方差异质性(FVH-LT)和潜在空间中的脏块稀疏回归(DBSR-LS)。为了确保学习到的潜在空间的鲁棒可解释性,我们开发了一种贪婪对齐策略(GAS),该策略减轻了标签切换问题,并对齐不同运行中的潜在维度,为结果聚合奠定基础。我们还引入了一个方便的标量潜在空间分离指数(LSSI),该指数基于FVH-LT和DBSR-LS的GAS对齐输出,在不知道真实生成因子的情况下总结整体潜在结构分离。我们将bfVAE与五个VAE模型进行比较,并在七个表格和图像数据集上验证了FVH-LT、DBSR-LS和LSSI的有效性。在我们检查的实验设置下,bfVAE提供了一个更灵活的解缠框架,在解缠和重构之间实现了比基准VAE模型更有利的整体权衡;FVH-LT和DBSR-LS可靠地揭示了语义上有意义且与领域相关的潜在结构,并且通常产生一致的结果;LSSI对潜在结构分离做出了有效的定量总结。

英文摘要

Evaluating and interpreting latent representations, such as variational autoencoders (VAEs), remains a significant challenge for diverse data types, especially when ground-truth generative factors are unknown. To address this, we unify several state-of-the-art disentangled VAE approaches for latent space disentanglement into one framework -- bfVAE. To assess the effectiveness of a disentangled VAE model and enhance latent space interpretability, we propose Feature Variance Heterogeneity via Latent Traversal (FVH-LT) and Dirty Block Sparse Regression in Latent Space (DBSR-LS). To ensure robust interpretability of learned latent space, we develop a greedy alignment strategy (GAS) that mitigates label switching and aligns latent dimensions across runs to set the foundation of result aggregation. We also introduce a convenient scalar latent space separation index (LSSI) based on the GAS-aligned outputs of FVH-LT and DBSR-LS to summarize the overall latent structural separation without knowledge of the ground-truth generative factors. We compare bfVAE to five VAE models and validate the effectiveness FVH-LT, DBSR-LS, and LSSI in on seven tabular and image datasets. Under our examined experimental settings, bfVAE provides a more flexible disentanglement framework achieves more favorable overall trade-off between disentanglement and reconstruction than the benchmark VAE models; FVH-LT and DBSR-LS reliably uncover semantically meaningful and domain-relevant latent structures and generally yield consistent results; and LSSI makes an effective quantitative summary of latent structural separation.

5. 优化、泛化与理论分析 25 篇

2606.12691 2026-06-12 cs.LG cs.AI cs.SY eess.SY math.OC stat.ML 新提交

Two-Layer Linear Auto-Regressive Models Estimate Latent States

两层线性自回归模型估计潜在状态

Yahya Sattar, Sunmook Choi, Leo Maynard-Zhang, Yassir Jedra, Maryam Fazel, Sarah Dean

AI总结 本文证明两层线性自回归模型通过经验风险最小化训练时,能近似卡尔曼滤波,恢复潜在状态估计,并提供有限样本保证。

Comments ICML 2026

详情
AI中文摘要

自回归模型已成为处理序列数据(从语言到视频)的强大工具。理解这些模型如何以及为何学习潜在表示仍然是一个开放的理论问题。在这项工作中,我们证明,当在部分观测的线性动力系统的数据上通过经验风险最小化训练时,两层线性自回归模型自然学会近似卡尔曼滤波。特别地,我们表明,学习到的隐藏表示与最优(卡尔曼)滤波器产生的状态估计一致,仅相差一个相似变换,尽管模型没有关于底层动力学或状态的显式知识。该结果基于三个主要见解。首先,我们建立卡尔曼滤波器可以被具有有界截断误差的自回归模型很好地近似。其次,我们表明,尽管非凸性,两层优化景观是良性的,即所有驻点要么是严格鞍点,要么是全局最小值。最后,作为我们的主要贡献,我们提供了关于预测误差、参数估计误差和潜在状态恢复的有限样本保证。数值模拟支持理论结果,并表明自回归模型的潜在表示恢复了状态估计。

英文摘要

Auto-regressive models have emerged as powerful tools for sequential data, from language to video. Understanding how and why these models learn latent representations remains an open theoretical question. In this work, we demonstrate that when trained by empirical risk minimization on data from partially observed linear dynamical systems, two-layer linear auto-regressive models naturally learn to approximate Kalman filtering. In particular, we show that the learned hidden representation coincides, up to a similarity transformation, with the state estimates produced by the optimal (Kalman) filter, even though the model has no explicit knowledge of the underlying dynamics or state. The result follows from three main insights. First, we establish that the Kalman filter is well approximated by an auto-regressive model with bounded truncation error. Second, we show that despite non-convexity, the two-layer optimization landscape is benign, i.e., all stationary points are either strict saddles or global minima. Finally, as our main contributions, we provide finite-sample guarantees on prediction error, parameter estimation error, and latent state recovery. Numerical simulations support the theoretical results and demonstrate that the latent representations of auto-regressive models recover state estimates.

2606.12763 2026-06-12 cs.LG cs.DS 新提交

Adaptive Weighted Averaging

自适应加权平均

Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, Manish Purohit

发表机构 * University of Utah(犹他大学) Boston University(波士顿大学) Google(谷歌)

AI总结 提出一种从单次无偏估计中选取最大未知值的方法,具有可容许性且不劣于基线,应用于随机优化获得在线到批次的转换界限。

详情
AI中文摘要

我们研究在仅对每个 $x_i$ 有一个无偏估计 $y_i$ 的情况下,从 $n$ 个未知值 $x_1,\dots,x_n$ 中选择最大值的问题。我们设计的策略同时具有可容许性(不被任何其他策略一致支配)且不劣于给定的基线(如均匀随机选择)。我们将其应用于随机优化,获得了具有理想“无妥协”保证的在线到批次转换界限:它们从不比标准随机迭代选择差,同时在良性设置中可以显著更好。

英文摘要

We study the problem of selecting the largest among $n$ unknown values $x_1,\dots,x_n$ given only a single unbiased estimate $y_i$ for each $x_i$. We design strategies that are simultaneously admissible (not uniformly dominated by any other strategy) and also never worse than a given baseline such as uniform random selection. We provide an application to stochastic optimization, where we obtain online-to-batch conversion bounds with a desirable "no-compromise" guarantee: they are never worse than standard random iterate selection, and yet can be significantly better in benign settings.

2606.12921 2026-06-12 cs.LG cs.AI 新提交

LoRA-Muon: Spectral Steepest Descent on the Low-Rank Manifold

LoRA-Muon:低秩流形上的谱最速下降

Franz Louis Cesista, Katherine Crowson, Cédric Simal, Stella Biderman

发表机构 * Ateneo de Manila University(雅典耀马尼拉大学) EleutherAI NaXys, UNamur(纳慕尔大学NaXys研究所)

AI总结 提出LoRA-Muon优化器,将Muon的谱最速下降规则应用于低秩微调,解决LoRA对初始化敏感、最优学习率跨秩迁移差等问题,在TinyShakespeare上以秩32达到比稠密基线更低的验证损失。

Comments 20 pages, 4 figures

详情
AI中文摘要

低秩适应(LoRA)显著降低了微调深度学习模型的计算和内存成本,但通常比稠密训练更难调优:当使用因子级优化器(如AdamW)时,它对初始化选择敏感,其最优学习率在秩之间迁移性差,且常常无法超越稠密基线。我们通过将Muon优化器的谱最速下降规则应用于低秩设置,推导出LoRA-Muon。结合我们的分裂权重衰减规则,我们的主要主张是LoRA-Muon是全秩Muon和Shampoo族优化器的一个良好的低秩代理。其最优学习率在秩、宽度、深度和因子重缩放之间均可迁移。在我们计算匹配的TinyShakespeare研究中,秩2代理恢复了稠密最佳测试学习率,秩32的LoRA-Muon运行在种子平均扫描中达到了比稠密基线更低的平均验证损失。我们进一步表明,Spectron优化器依赖于任意的因子缩放,因此在从严重不平衡的因子开始微调时可能不太适用,并且LoRA-RITE的简化QR坐标核心实现了相同的谱更新。LoRA-Muon无需QR分解即可计算该更新,并避免存储二阶矩,使其更易于加速器使用且内存效率更高。

英文摘要

Low-Rank Adaptation (LoRA) significantly reduces compute and memory costs for finetuning Deep Learning models but is often harder to tune than dense training: when using factor-wise optimizers such as AdamW, it is sensitive to initialization choices, its optimal learning rates transfer poorly across ranks, and it often fails to beat dense baselines. We derive LoRA-Muon by applying the Muon optimizer's spectral steepest-descent rule to the low-rank setting. Along with our split weight-decay rule, our main claim is that LoRA-Muon is a good low-rank proxy for full-rank Muon and Shampoo-family optimizers. Its optimal learning rates transfer across rank, width, depth, and factor-rescaling. In our compute-matched TinyShakespeare study, a rank-$2$ proxy recovers the dense best tested learning rate, and a rank-$32$ LoRA-Muon run attains lower mean validation loss than the dense baseline in the seed-averaged sweep. We further show that the Spectron optimizer depends on arbitrary factor scaling, so it would likely be a poor fit when finetuning starts from badly imbalanced factors, and that LoRA-RITE's simplified QR-coordinate core implements the same spectral update. LoRA-Muon computes that update without QR-decomposition and avoids storing second moments, making it more accelerator-friendly and memory-efficient.

2606.12930 2026-06-12 cs.LG 新提交

Is Spurious Correlation Removal Always Learnable?

虚假相关性去除是否总是可学习的?

Yibo Zhou, Bo Li, Hai-Miao Hu, Hanzi Wang, Xiaokang Zhang, Ruifan Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 研究不变学习在统计可识别时的计算障碍,证明存在一维不变子空间的可采样多环境实例,多项式时间算法无法达到常数精度,并量化环境多样性对可识别性和风险的影响。

Comments poster paper in ICML-2026

详情
AI中文摘要

即使不变结构在统计上是可识别的,不变学习也可能失败。我们展示了一个条件计算障碍:在由平均情况稀疏恢复归约驱动的黑盒可采样监督稀疏恢复原语下,存在具有一维预测不变子空间($k=1$)的\emph{可采样}多环境实例,这些实例可以通过穷举搜索用多项式样本学习,而任何多项式时间常数精度恢复算法都会与该原语矛盾。我们进一步通过分离参数$\gamma$量化环境多样性,该参数控制可识别性和不变性目标的曲率。在充分多样性和局部高斯正则性下,极小极大风险为$\mathbb{E}[\dist(\hat{V},V_{\mathrm{inv}})^2]=\Theta(k(d-k)/(n|\mathcal{E}|))$,在标签诱导的偏移下,在$n^*\propto k(d-k)/(|\mathcal{E}|\gamma^2)$处发生相变,估计误差缩放比例与$1/\gamma^2$成正比。合成和真实数据集说明了预测的差距和转变,并激发了简单的多样性诊断。

英文摘要

Invariant learning can fail even when the invariant structure is statistically identifiable. We show a conditional computational barrier: under a black-box samplable supervised sparse recovery primitive motivated by average-case sparse-recovery reductions, there exist \emph{samplable} multi-environment instances with a one-dimensional predictive invariant subspace ($k=1$) that are learnable with polynomial samples by exhaustive search, while any polynomial-time constant-accuracy recovery algorithm would contradict the primitive. We further quantify environment diversity by a separation parameter $γ$, which controls identifiability and the curvature of invariance objectives. Under sufficient diversity and local Gaussian regularity, the minimax risk is $\mathbb{E}[\dist(\hat{V},V_{\mathrm{inv}})^2]=Θ(k(d-k)/(n|\mathcal{E}|))$, and under label-induced shifts a phase transition occurs at $n^*\propto k(d-k)/(|\mathcal{E}|γ^2)$ with refined estimation error scaling proportional to $1/γ^2$. Synthetic and real datasets illustrate the predicted gaps and transitions and motivate simple diversity diagnostics.

2606.12990 2026-06-12 cs.LG 新提交

Exposure Bias as Epistemic Underidentification in Recursive Forecasting

递归预测中的曝光偏差作为认知欠识别问题

Riku Green, Zahraa S. Abdallah, Telmo M Silva Filho

发表机构 * University of Bristol(布里斯托大学)

AI总结 本文证明递归多步预测中的曝光偏差不仅是分布偏移,更是部分可观测性下的认知欠识别问题,并提出基于来源变量的误差分解与校正方法。

Comments Accepted for ICML 2026 EIML workshop

详情
AI中文摘要

递归多步预测通常被表述为分布偏移:模型在观测历史数据上训练,但部署于自身预测结果上。我们通过证明在部分可观测性或状态截断下,递归展开也是一个认知欠识别问题,表明这种表述是不完整的。即使具有确定性潜在动力学,一步贝叶斯监督仅在观测上下文中识别行为,一旦展开查询自生成诱导状态(其正确的局部目标不能仅由数值状态确定),则无需识别部署的递归预测器。我们通过诱导状态 $Z$ 和来源变量 $P$ 形式化这一点,并推导出诱导状态误差分解为教师强制/展开不匹配、表示-类别逼近和来源信息差距。实验表明,展开进入一个不同的诱导状态区域,固定诱导状态定义了一个不同的局部校正任务,闭环增益不仅来自局部适应,还来自改变展开期间访问的诱导状态。使用简单的二进制来源编码,来源感知校正可以进一步提高性能,尽管增益是有条件的而非均匀的。这些结果将曝光偏差重新定义为自诱导认知不确定性下的推理。

英文摘要

Recursive multi-step forecasting is usually framed as distribution shift: models are trained on observed histories but deployed on their own predictions. We show this framing is incomplete by proving that, under partial observability or state truncation, recursive rollout is also an epistemic underidentification problem. Even with deterministic latent dynamics, one-step Bayes supervision identifies behavior only on observed contexts and need not identify the deployed recursive predictor once rollout queries self-generated induced states whose correct local targets are not determined by numeric state alone. We formalize this with induced states $Z$ and provenance variables $P$, and derive a decomposition of induced-state error into teacher-forcing/rollout mismatch, representation--class approximation, and provenance information gaps. Empirically, we show that rollout enters a distinct induced-state regime, that fixed induced states define a distinct local corrective task, and that closed-loop gains arise not only from local adaptation but also from changing the induced states visited during rollout. Using a simple binary provenance encoding, provenance-aware correction can further improve performance, though gains are conditional rather than uniform. These results recast exposure bias as reasoning under self-induced epistemic uncertainty.

2606.13067 2026-06-12 cs.LG 新提交

Limits of spectral learning under noise

噪声下谱学习的极限

Sabin Roman, Ljupco Todorovski, Saso Dzeroski, Marta Sales-Pardo, Roger Guimera

发表机构 * Joz̆ef Stefan Institute(约瑟夫·斯特凡研究所) Faculty of Mathematics and Physics, University of Ljubljana(卢布尔雅那大学数学与物理学院) Department of Chemical Engineering, Universitat Rovira i Virgili(罗维拉-威尔吉利大学化学工程系) Center for Computational Science and Applied Mathematics (ComSCIAM), Universitat Rovira i Virgili(罗维拉-威尔吉利大学计算科学与应用数学中心) ICREA(加泰罗尼亚研究与高等研究院)

AI总结 研究监督回归中加性标签噪声对谱方法的影响,推导出噪声导致系数漂移的闭合表达式,揭示了由单一内在噪声尺度控制的通用退化曲线。

详情
AI中文摘要

从含噪数据中学习函数关系是科学推理的核心问题。谱方法通过将未知函数在基函数上展开并从数据中估计相应系数来逼近函数,但这些系数在噪声下的稳定性仍知之甚少。本文研究使用稀疏谱表示在多个基和维度下进行带加性标签噪声的监督回归。我们表明,噪声会导致学习到的系数向量发生可预测的漂移,其大小取决于有效活跃谱模式的数量。在对经验特征几何进行白化后,我们推导出含噪与无噪系数向量之间重叠的闭合表达式,揭示了一条由单一内在噪声尺度控制的通用退化曲线。在傅里叶、勒让德、贝塞尔和哈尔基上的数值实验证实了理论预测。结果表明,谱学习存在一个基本噪声阈值,超过该阈值系数估计变得不稳定,从而对从含噪数据中恢复函数结构施加了内在限制。

英文摘要

Learning functional relationships from noisy data is a central problem in scientific inference. Spectral methods approximate unknown functions by expanding them in a basis and estimating the corresponding coefficients from data, but the stability of these coefficients under noise remains poorly understood. Here we study supervised regression with additive label noise using sparse spectral representations across multiple bases and dimensions. We show that noise induces a predictable drift in the learned coefficient vector whose magnitude depends on the effective number of active spectral modes. After whitening the empirical feature geometry, we derive a closed-form expression for the overlap between noisy and noiseless coefficient vectors, revealing a universal degradation curve governed by a single intrinsic noise scale. Numerical experiments across Fourier, Legendre, Bessel, and Haar bases confirm the theoretical prediction. The results demonstrate that spectral learning exhibits a fundamental noise threshold beyond which coefficient estimates become unstable, placing intrinsic limits on recovering functional structure from noisy data.

2606.13092 2026-06-12 cs.LG cs.RO math.DS 新提交

Scale Buys Interpolation, Structure Buys a Horizon: Certified Predictability for Equivariant World Models

规模买插值,结构买地平线:等变世界模型的认证可预测性

Hongbo Wang

AI总结 针对等变潜在世界模型,提出可计算的多步可预测地平线认证,证明T步滚动误差在对称轨道上恒定,并由李雅普诺夫谱分层界定,且该认证为等变模型独有。

Comments 23 pages (9 main + appendices). Code: https://github.com/TimothyWang418/se3-ejepa

详情
AI中文摘要

规模买插值;结构买认证的地平线。世界模型的平均误差无法说明特定预测是否可信,或可信多久。对于等变潜在世界模型,我们给出可计算的多步可预测地平线认证:$T$步滚动误差在每个对称轨道上恒定(定理A),并由预测器的李雅普诺夫谱逐通道分层,$T_j(\epsilon)\sim\log(1/\epsilon)/\lambda_j$。地平线是双向的——匹配的下界使近似等变被证明受地平线限制——且该认证为结构独有:轨道恒定误差刻画等变性,因此任何非等变模型无论规模多大都不具备。实验上,在40维Lorenz-96上,只有$\mathbb{Z}_N$等变网络恢复完整李雅普诺夫谱($R^2=0.98$);密集和循环基线失败。由于谱是忠实的,认证先验地起作用:在固定感知预算下,$c$倍膨胀的认证需要$c$倍预算,且等变认证满足其膨胀密集对应物无法满足的预算——无需校准数据。相同的读出,未经修改,可无训练审计公开预训练世界模型:TD-MPC2检查点落在认证自身的范围分类上——在强膨胀处校准(比率0.94-1.02),在弱膨胀处乐观,在收缩处正确弃权——部署的监控器逐单元复制该映射,样本外。在官方1M-317M多任务阶梯上,校准不随参数增加。在V-JEPA 2-AC(1B,真实机器人数据)上,测量的交叉检查正确覆盖了过度承诺的切空间谱——交叉验证审计,而非原始数值,是可部署的对象。规模买插值,而非校准的地平线。

英文摘要

Scale buys interpolation; structure buys a certified horizon. A world model's average error says nothing about whether a particular prediction can be trusted, or for how long. For equivariant latent world models we give a computable, multi-step certificate of the predictable horizon: $T$-step rollout error is provably constant over each symmetry orbit (Theorem A) and stratified channel-by-channel by the predictor's Lyapunov spectrum, $T_j(ε)\sim\log(1/ε)/λ_j$. The horizon is two-sided -- a matching lower bound makes approximate equivariance provably horizon-limited -- and the certificate is exclusive to structure: orbit-constant error characterizes equivariance, so no non-equivariant model has it at any scale. Empirically, on 40-D Lorenz-96 only a $\mathbb{Z}_N$-equivariant network recovers the full Lyapunov spectrum ($R^2{=}0.98$); dense and recurrent baselines fail. Because the spectrum is faithful, the certificate acts, a priori: under a fixed sensing budget a $c\times$-inflated certificate provably needs $c\times$ the budget, and the equivariant certificate meets a budget its inflated dense counterpart cannot -- with zero calibration data. The same read-out, unchanged, audits public pretrained world models training-free: TD-MPC2 checkpoints land on the certificate's own scope taxonomy -- calibrated where strongly expansive (ratio 0.94-1.02), optimistic where weakly expansive, correctly abstaining where contracting -- a map a deployed monitor replicates cell-by-cell, out-of-sample. Across the official 1M-317M multitask ladder, calibration does not improve with parameters. On V-JEPA 2-AC (1B, real robot data) the measured cross-check correctly overrides an over-promising tangent spectrum -- the cross-validated audit, not the raw number, is the deployable object. Scale buys interpolation, not a calibrated horizon.

2606.13178 2026-06-12 cs.LG 新提交

Loss-Shift Transfer via Bayes Quotients

通过贝叶斯商进行损失转移迁移学习

Vasileios Sevetlidis

发表机构 * Athena Research Center(雅典娜研究中心) Democritus University of Thrace(德谟克利特大学) International Hellenic University(国际希腊大学)

AI总结 本文研究数据分布固定但损失函数变化时的损失转移问题,利用贝叶斯商形式化损失的精炼顺序,证明粗损失的最小表示对严格更细的损失不足,并在有限输出对数损失下给出精确量化关系。

详情
AI中文摘要

迁移学习通常被研究为分布偏移的结果。本文识别了一种正交的失败模式,其中数据分布固定而损失函数变化。这种设置称为\emph{损失转移}。损失决定了\(X\)中哪些信息是贝叶斯相关的,因此即使在同一联合分布\(P(X,Y)\)下,两个损失也可能需要不同的表示。该思想使用贝叶斯商形式化,允许按精炼程度对损失排序。在贝叶斯商公式中,严格精炼立即给出定性的障碍。对于较粗损失,源最小表示对于严格更细的目标损失是不充分的。对于有限输出的对数损失,这个障碍变成了精确的定量恒等式。超额风险是表示丢弃的关于\(Y\)的条件信息。在受控、学习、合成图像和真实图像设置中的实验显示了预测的效果,即分类等价的表示在固定数据分布下可能具有不同的最优对数损失性能。

英文摘要

Transfer learning is usually studied as a consequence of distribution shift. This paper identifies an orthogonal failure mode in which the data distribution is fixed and the loss changes. This setting is called \emph{loss shift}. A loss determines which information in \(X\) is Bayes-relevant, and two losses may therefore require different representations even under the same joint law \(P(X,Y)\). The idea is formalized using Bayes quotients, which allow losses to be ordered by refinement. In the Bayes-quotient formulation, strict refinement gives an immediate qualitative obstruction. A source-minimal representation for a coarser loss is insufficient for a strictly finer target loss. For finite-output log loss, this obstruction becomes an exact quantitative identity. The excess risk is the conditional information about \(Y\) discarded by the representation. Experiments in controlled, learned, synthetic-image, and real-image settings show the predicted effect, i.e., classification-equivalent representations can have different optimal log-loss performance under a fixed data distribution.

2606.13287 2026-06-12 cs.LG cs.DC math.OC 新提交

Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

裁剪使分布式和联邦异步SGD对掉队者具有鲁棒性

Samuel Erickson, Mikael Johansson

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院)

AI总结 本文理论证明梯度裁剪能消除异步SGD中最大延迟对复杂度的影响,基于次Weibull梯度噪声模型,首次实现异步优化的高概率收敛。

详情
AI中文摘要

在现代机器学习中,训练的并行化是扩大规模的重要策略。异步随机梯度下降(ASGD)通过避免等待慢速工作节点来最大化可用硬件的利用率。然而,在恒定步长下,由于更新中的大延迟,慢速工作节点仍然会对ASGD的收敛产生负面影响。同时,在深度学习模型的异步训练中,经验观察到梯度裁剪能“稳定”训练。在这项工作中,我们为这一行为提供了理论依据,证明裁剪消除了最大延迟对预言复杂度的依赖。我们采用次Weibull梯度噪声模型,该模型将次高斯和次指数分布推广到更重尾的分布,受深度学习中的经验观察启发。我们证明了期望收敛,并且首次在异步优化中证明了高概率收敛。

英文摘要

In modern machine learning, parallelization of training is an important strategy for increasing scale. Asynchronous stochastic gradient descent (ASGD), which maximizes the utilization of available hardware by avoiding waiting for slow workers. However, with constant step sizes, the convergence of ASGD is nonetheless affected negatively by slow workers due to large delays in updates. At the same time, it has been empirically observed in asynchronous training of deep learning models that gradient clipping "stabilizes" training. In this work, we provide a theoretical justification for this behavior, as we show that clipping removes the dependence of the maximum delay in the oracle complexity. We employ a sub-Weibull model of gradient noise which generalizes sub-Gaussian and sub-exponential distributions to more heavy-tailed distributions, motivated by empirical observations in deep learning. We show convergence in expectation, and the first time in asynchronous optimization, convergence with high probability.

2606.13576 2026-06-12 cs.LG cs.CC cs.DS stat.ML 新提交

Learning with Simulators: No Regret in a Computationally Bounded World

与模拟器学习:计算受限世界中的无悔学习

Sasha Voitovych, Abhishek Shetty, Noah Golowich, Alexander Rakhlin

发表机构 * MIT(麻省理工学院) Microsoft Research(微软研究院)

AI总结 提出可模拟过程框架,利用模拟器近似任意复杂依赖的数据分布,恢复VC维误差界,并展示条件采样的统计与计算优势。

Comments To appear at COLT 2026

详情
AI中文摘要

理解泛化所需的最小假设是学习理论的基本问题。不幸的是,大多数结果严重依赖于数据生成过程的独立性(或其某种代理),而强依赖数据的结果则非常有限。为填补这一空白,我们引入了可模拟过程的框架,其中学习器可以访问一个近似数据生成分布(可能是任意复杂且依赖的过程)的模拟器。令人惊讶的是,我们表明,在访问这样的模拟器的情况下,我们可以恢复与经典独立数据设置相同的学习保证,即依赖于VC维的误差界。此外,我们利用这一框架研究条件采样的能力,并展示了在这种设置下严格的统计和计算优势。作为我们框架的一个亮点,我们展示了一个单一算法,该算法同时学习所有在有限多项式时间内可采样的过程下的任意给定VC类,其遗憾由过程的时间有界Kolmogorov复杂度控制。这为经典PAC模型提供了重要的概念扩展。

英文摘要

Understanding the minimal assumptions necessary for generalization is the fundamental question in learning theory. Unfortunately, most results rely heavily on independence (or some proxy thereof) of the data-generating process, while results for strongly dependent data are far more limited. Towards addressing this gap, we introduce the framework of simulatable processes, where the learner has access to a simulator that approximates the distribution generating the data (which may be an arbitrarily complex and dependent process). Surprisingly, given access to such a simulator, we show that we can recover the same learning guarantees as in the classical setting with independent data, namely, error bounds that depend on the VC dimension. Further, we use this framework to study the power of conditional sampling and show strict statistical and computational advantages in this setting. As a highlight of our framework, we exhibit a single algorithm that simultaneously learns any given VC class under all processes samplable in bounded polynomial time, with regret controlled by the time-bounded Kolmogorov complexity of the process. This provides a significant conceptual broadening of the classical PAC model.

2606.11104 2026-06-12 cs.LG math.CA stat.ML 新提交

Limitations of Learning Tanh Neural Networks with Finite Precision

有限精度下学习Tanh神经网络的局限性

Philipp Grohs, Matěj Trödler

AI总结 基于有限精度计算和L^p精度保证,通过构造尖锐局部化bump函数,证明自适应随机算法在L^p范数下收敛速度不超过蒙特卡洛率O(m^{-1/p}),除非采样预算随网络参数和架构指数增长。

详情
AI中文摘要

我们研究了在有限精度计算和$L^p$精度保证下,从点评估中学习$\ anh$神经网络的局限性,建立在Berner、Grohs和Voigtländer(2023)的工作基础上。我们的方法基于通过迭代$\ anh$激活函数新颖构造的尖锐局部化bump函数。利用这一机制,我们证明,在有限精度设置下,基于$m$个样本的自适应随机算法在$L^p$范数下无法达到比蒙特卡洛率$O(m^{-1/p})$更高的收敛速度,除非采样预算随网络参数和架构的大小指数增长。结果揭示了有限精度对包含局部化bump函数的类别可学习性施加的基本限制,将先前针对ReLU网络的结果推广到了$\ anh$设置。

英文摘要

We investigate limitations of learning $\tanh$ neural networks from point evaluations under finite-precision computations and $L^p$ accuracy guarantees, building on Berner, Grohs, and Voigtländer (2023). Our approach is based on a novel construction of sharply localized bump functions via iterated $\tanh$ activations. Using this mechanism, we show that, in a finite-precision setting, no adaptive randomized algorithm based on $m$ samples can achieve a convergence rate higher than the Monte Carlo rate $O(m^{-1/p})$ in the $L^p$ norm, unless the sampling budget grows exponentially with the size of the network parameters and architecture. The results reveal fundamental limitations imposed by finite precision on the learnability of classes containing localized bump functions, extending previous results for ReLU networks to the $\tanh$ setting.

2606.12646 2026-06-12 stat.ML cs.IT cs.LG math.IT 交叉投稿

Epistemic Uncertainty Is Not the Reducible Kind

认知不确定性并非可约简的那种

Robin Young

发表机构 * University of Cambridge(剑桥大学)

AI总结 证明标准定义中认知不确定性为可被更多数据移除的部分,与互信息度量在扩展上不一致,并提出三部分分解:偶然、样本可约简认知和机制可约简认知不确定性。

详情
AI中文摘要

预测不确定性的标准分类将认知不确定性定义为可通过收集更多数据移除的部分,而标准度量将其与互信息项等同。我们证明该定义与度量在扩展上不一致。在一个显式构造中,度量将所有不确定性归为认知类,但任何数量的训练数据都无法减少它。可约简性反而是(不确定性,获取类)这一对的性质,二分法分解为三部分:偶然不确定性、样本可约简认知不确定性和机制可约简认知不确定性。一个观测值的精确恒等式表明,分布内数据永远不会减少机制不可约简的不确定性,并且通常会增加它。集成分歧,即部署的认知估计,追踪的是训练过程而非认知项。在一致训练下,它降至正真值以下的零,并在插值下等于超参数缩放的初始化噪声。有限样本的证伪测试和种子扫描实验证实了该理论。

英文摘要

The standard taxonomy of predictive uncertainty defines epistemic uncertainty as the part removable by collecting more data, while the standard measure identifies it with a mutual-information term. We prove the definition and the measure are extensionally inconsistent. On an explicit construction, the measure assigns all uncertainty to the epistemic class, yet no quantity of training data reduces it. Reducibility is instead a property of the pair (uncertainty, acquisition class), and the dichotomy resolves into three parts: aleatoric, sample-reducible epistemic, and mechanism-reducible epistemic uncertainty. An exact identity for the value of an observation shows that in-distribution data never reduces mechanism-irreducible uncertainty and generically increases it. Ensemble disagreement, the deployed epistemic estimate, tracks the training procedure rather than the epistemic term. It collapses to zero beneath a positive truth under consistent training, and equals hyperparameter-scaled initialization noise under interpolation. A finite-sample falsification test and seed-swept experiments confirm the theory.

2606.12647 2026-06-12 cs.CC cs.AI cs.LG 交叉投稿

Token Complexity Theory for AI-Augmented Computing

AI增强计算的Token复杂度理论

Jie Wang

AI总结 提出Token复杂度作为AI增强计算中查询与响应成本的形式化度量,建立AI-Oracle图灵机框架,证明单调性、凸性、价格敏感性和任务排序的价格相对性等基本定理。

Comments 25 pages, 1 figure

详情
AI中文摘要

AI增强计算将自然语言查询、代码生成请求及其他开放式任务委托给一组AI模型,这些模型处理查询并生成响应。这一范式引入了一个经典时间或空间复杂度无法捕捉的资源维度:向该集群发送查询和接收响应的成本。我们引入Token复杂度,将其定义为在任务上达到指定输出质量水平所需的最小期望Token成本,并建立了一个根据概率性质强度对AI系统进行分类的体系。我们在AI-Oracle图灵机框架内发展Token复杂度,其中概率图灵机通过专用查询和响应磁带与随机Oracle交互。我们证明了基本定理,表明Token复杂度符合预期:单调性(更高质量需要更多Token)、凸性(质量改进逐渐变得更昂贵)、价格敏感性(小价格变化导致有界成本变化)以及任务排序的价格相对性(任务的Token复杂度排序可能根据查询与响应成本比率而反转)。我们证明了复杂度前沿(定义为Token、时间和空间中所有可行资源约束的集合)是非空的、向上封闭且凸的。

英文摘要

AI-augmented computing delegates natural language queries, code generation requests, and other open-ended tasks to a cluster of AI models that processes queries and generates responses. This paradigm introduces a resource dimension that neither classical time nor space complexity captures: the cost of sending queries to and receiving responses from such a cluster. We introduce token complexity, a formal resource measure defined as the minimum expected token cost to achieve a specified level of output quality on a task, and develop a taxonomy classifying AI systems by the strength of their probabilistic properties. We develop token complexity within the framework of AI-Oracle Turing machines, in which a probabilistic Turing machine interacts with a stochastic oracle via dedicated query and response tapes. We prove basic theorems establishing that token complexity behaves as expected: monotonicity (higher quality costs more tokens), convexity (quality improvements become progressively more expensive), price sensitivity (small price changes produce bounded cost changes), and price-relativity of task ordering (the token complexity ordering of tasks can reverse depending on the query-to-response cost ratio). We prove that the complexity frontier, defined as the set of all feasible resource bounds in tokens, time, and space, is non-empty, upward-closed, and convex.

2606.12694 2026-06-12 cs.DS cs.LG math.PR stat.ML 交叉投稿

A unified complexity bound for logconcave sampling

对数凹采样的统一复杂度界

Yunbum Kook, Santosh S. Vempala

发表机构 * University of Texas at Austin(得克萨斯大学奥斯汀分校)

AI总结 本文通过In-and-Out算法与指数提升,给出了从热启动采样任意对数凹分布的简单、统一且近乎紧的界,主要创新是提升了提升分布的Poincaré常数界。

Comments 5 pages

详情
AI中文摘要

我们给出了一个简单、统一且近乎紧的界,用于从热启动使用In-and-Out算法结合指数提升采样任意对数凹分布。分析中的主要新成分是提升了提升分布的Poincaré常数界。因此,得到的收敛率对于约束设置(例如,限制在凸体上的高斯分布)和良条件设置(例如,强对数凹且光滑的密度)都是近乎紧的。

英文摘要

We give a simple, unified, and nearly tight bound for sampling arbitrary logconcave distributions from a warm start using the In-and-Out algorithm along with exponential lifting. The main new ingredient in the analysis is an improved bound on the Poincaré constant of a lifted distribution. As a consequence, the resulting convergence rate is nearly tight for both constrained settings (e.g., Gaussian restricted to a convex body) and well-conditioned settings (e.g., strongly logconcave and smooth densities).

2606.12892 2026-06-12 stat.ML cs.LG econ.EM math.ST stat.ME stat.TH 交叉投稿

Prediction-Powered Causal Inference by Automatic Debiased Machine Learning and Semi-Supervised Riesz Regression

预测驱动的因果推断:自动去偏机器学习与半监督Riesz回归

Masahiro Kato

发表机构 * University of Tokyo(东京大学)

AI总结 研究半监督设置下因果参数的半参数有效估计,通过结合去偏机器学习和半监督Riesz回归,提出DML-PPCI和TMLE-PPCI方法,实现比仅用标注数据更小的渐近方差。

详情
AI中文摘要

本研究探讨了在半监督设置下因果和结构参数的半参数有效估计。在我们的设置中,除了由结果和回归变量组成的标注观测数据外,还有未标记的辅助回归变量可用。我们的目标是构建因果和结构参数的估计量,其渐近方差小于仅使用标注数据构建的估计量。我们将此框架称为预测驱动的因果推断(PPCI)。我们首先推导了有效影响函数和效率界,这表明使用辅助回归变量可以获得比仅从标注观测数据可达到的效率界更小的渐近方差。然后,通过将有效影响函数与去偏机器学习(DML)框架相结合,我们提出了称为DML-PPCI的方法。如果我们构建一个估计方程估计量,我们称之为EE-DML-PPCI;如果我们构建一个目标学习估计量,我们称之为TMLE-DML-PPCI。两种估计量的渐近方差都与我们推导的效率界相匹配。在构建估计量时,有效影响函数的估计起着重要作用。在我们的研究中,有效影响函数也是一个Neyman正交分数,它依赖于Riesz表示子和回归函数。对于Riesz表示子估计,我们开发了具有收敛速度保证的半监督广义Riesz回归。

英文摘要

This study investigates semiparametric efficient estimation of causal and structural parameters in a semi-supervised setting. In our setting, unlabeled auxiliary regressors are available in addition to labeled observations consisting of outcomes and regressors. Our goal is to construct estimators of causal and structural parameters whose asymptotic variances are smaller than those of estimators constructed using only labeled data. We refer to this framework as prediction-powered causal inference (PPCI). We first derive the efficient influence function and the efficiency bound, which imply that the use of auxiliary regressors can attain a smaller asymptotic variance than the efficiency bound attainable from labeled observations alone. Then, by combining the efficient influence function with the debiased machine learning (DML) framework, we propose methods that we call DML-PPCI. If we construct an estimating-equation estimator, we refer to the method as EE-DML-PPCI; if we construct a targeted-learning estimator, we refer to the method as TMLE-DML-PPCI. The asymptotic variances of both estimators match our derived efficiency bound. In the construction of the estimators, estimation of the efficient influence function plays an important role. In our study, the efficient influence function is also a Neyman orthogonal score, which depends on the Riesz representer and the regression function. For Riesz representer estimation, we develop semi-supervised generalized Riesz regression with convergence rate guarantees.

2606.13614 2026-06-12 stat.ML cs.LG math.ST stat.TH 交叉投稿

Majority-of-Three is Optimal

三中多数是最优的

Divit Rawal, Nikita Zhivotovskiy

发表机构 * Department of Statistics, University of California, Berkeley(加州大学伯克利分校统计学系)

AI总结 本文通过简短证明,在可实现PAC学习框架下,三个独立一致分类器的多数投票是最优学习器,简化了投票学习器的算法结构和概率分析。

Comments 9 pages

详情
AI中文摘要

我们给出一个简短证明,表明在可实现PAC学习框架下,三个独立一致分类器的多数投票是最优学习器。这证明了最简单投票方案的最优性,同时简化了先前投票学习器的算法结构和概率分析,包括S. Hanneke的算法和K. Green Larsen对装袋的分析。

英文摘要

We give a short proof that the majority vote of three independent consistent classifiers is an optimal learner in the realizable PAC setting. This proves optimality for the simplest voting scheme, while simplifying both the algorithmic structure and the probabilistic analysis of previous voting learners, including the algorithm of S. Hanneke and the analysis of bagging by K. Green Larsen.

2501.08425 2026-06-12 cs.LG math.AP math.PR 版本更新

Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes

随机梯度下降有效吗?机器学习过程的PDE视角

Davide Barbieri, Matteo Bonforte, Peio Ibarrondo

发表机构 * Departamento de Matemáticas, Universidad Autónoma de Madrid, ICMAT - Instituto de Ciencias Matemáticas, CSIC-UAM-UC3M-UCM(数学系,马德里自治大学,ICMAT数学科学研究所,CSIC-UAM-UC3M-UCM)

AI总结 通过Fokker-Planck型抛物PDE分析SGD行为,区分漂移和扩散两个阶段,量化浓度现象并证明平均退出时间界限,为非凸损失和退化扩散矩阵下的渐近收敛提供新结果。

详情
AI中文摘要

本文分析了随机梯度下降(SGD)的行为,这是一种在监督学习中广泛使用的方法,通过最小化非凸损失函数来优化神经网络权重。自E、Li和Tai(2017)的开创性工作以来,此类过程的基本结构可以通过Fokker-Planck型抛物PDE来理解,这是我们分析的核心。尽管Fokker-Planck方程历史悠久且文献丰富,但当势函数非凸或扩散矩阵退化时,几乎一无所知,这是我们分析中面临的主要困难。我们识别出两种不同的阶段:在SGD的初始阶段,损失函数驱动权重集中在最近的局部最小值附近。我们将此阶段称为漂移阶段,并提供了关于这种集中现象的定量估计。接下来,我们引入扩散阶段,其中随机波动帮助学习过程逃离次优局部最小值。我们分析了平均退出时间(MET),并证明了MET的上下界。最后,我们针对非凸代价函数和退化扩散矩阵(不允许使用标准方法并需要新技术)研究了SGD的渐近收敛性。为此,我们利用了两种不同的方法:对偶方法和熵方法。我们提供了关于SGD动力学和有效性的新结果,建立了随机优化与PDE理论之间的深层联系,并为机器学习过程中的基本问题提供了一些答案和见解:SGD需要多长时间才能逃离一个坏的最小值?使用SGD时神经网络参数是否收敛?在SGD训练的第一阶段,参数如何演化?

英文摘要

In this paper we analyze the behaviour of the stochastic gradient descent (SGD), a widely used method in supervised learning for optimizing neural network weights via a minimization of non-convex loss functions. Since the pioneering work of E, Li and Tai (2017), the underlying structure of such processes can be understood via parabolic PDEs of Fokker-Planck type, which are at the core of our analysis. Even if Fokker-Planck equations have a long history and a extensive literature, almost nothing is known when the potential is non-convex or when the diffusion matrix is degenerate, and this is the main difficulty that we face in our analysis. We identify two different regimes: in the initial phase of SGD, the loss function drives the weights to concentrate around the nearest local minimum. We refer to this phase as the drift regime and we provide quantitative estimates on this concentration phenomenon. Next, we introduce the diffusion regime, where stochastic fluctuations help the learning process to escape suboptimal local minima. We analyze the Mean Exit Time (MET) and prove upper and lower bounds of the MET. Finally, we address the asymptotic convergence of SGD, for a non-convex cost function and a degenerate diffusion matrix, that do not allow to use the standard approaches, and require new techniques. For this purpose, we exploit two different methods: duality and entropy methods. We provide new results about the dynamics and effectiveness of SGD, offering a deep connection between stochastic optimization and PDE theory, and some answers and insights to basic questions in the Machine Learning processes: How long does SGD take to escape from a bad minimum? Do neural network parameters converge using SGD? How do parameters evolve in the first stage of training with SGD?

2602.08913 2026-06-12 cs.LG stat.ML 版本更新

GEMSS: A Variational Bayesian Method for Discovering Multiple Sparse Solutions in Classification and Regression Problems

GEMSS: 一种用于在分类和回归问题中发现多个稀疏解的变分贝叶斯方法

Kateřina Henclová, Václav Šmídl

发表机构 * Faculty of Electrical Engineering, Czech Technical University(捷克技术大学电子工程系)

AI总结 提出GEMSS算法,利用结构化spike-and-slab先验、高斯混合近似后验和Jaccard惩罚,通过变分推断同时发现多个多样化的稀疏特征组合,在128个实验和3个真实数据集上优于对比方法。

详情
AI中文摘要

高维、欠定且高度相关的系统在数据科学实践中很常见,尤其是在分析物理测量时。在这种情况下,特征选择面临根本性挑战,因为多个不同的稀疏子集可能同样好地解释响应。识别这些子集不仅对预测建模至关重要,而且对生成关于潜在机制的领域特定见解也至关重要。然而,传统方法通常只隔离单个解,掩盖了全部合理的解释。本文介绍了GEMSS(高斯集成多稀疏解),一种变分算法,旨在同时发现多个多样化的稀疏特征组合。该方法采用结构化spike-and-slab先验实现稀疏性,使用高斯混合近似难以处理的多模态后验,并引入基于Jaccard的惩罚进一步控制解的多样性。通过随机梯度下降优化单个目标函数。该方法通过一个新的基准测试框架在128个综合实验上进行测试,该框架旨在生成具有相同预测属性的多个稀疏解的人工问题。这使我们能够测量真实特征的检索,而不仅仅是评估预测性能——这些特征更符合我们的实际需求。比较分析表明,GEMSS始终优于通过ALFESE框架适配的五种著名特征选择方法。最后,我们通过来自代谢组学和物理化学的3个具有挑战性的真实世界数据集展示了其实用性:GEMSS成功分离出多个不同但质量高的解。GEMSS作为PyPI包'gemss'提供。相应的存储库此http URL包含完整的代码库和免费的无代码应用程序GEMSS Explorer。

英文摘要

High-dimensional, underdetermined and highly correlated systems are common in data science practice, especially when analyzing physical measurements. In such settings, feature selection poses a fundamental challenge because multiple distinct sparse subsets may explain the response equally well. Their identification is crucial not only for predictive modeling but also for generating domain-specific insights into the underlying mechanisms. Yet, conventional methods typically isolate a single solution, obscuring the full spectrum of plausible explanations. This work introduces GEMSS (Gaussian Ensemble for Multiple Sparse Solutions), a variational algorithm designed to simultaneously discover multiple, diverse sparse feature combinations. The method employs a structured spike-and-slab prior for sparsity, a mixture of Gaussians to approximate the intractable multimodal posterior, and a Jaccard-based penalty to further control solution diversity. A single objective function is optimized via stochastic gradient descent. The method is tested on 128 comprehensive experiments by a novel benchmarking framework designed to generate artificial problems with multiple sparse solutions of equal predictive properties. This allows us to measure the retrieval of ground truth features rather than only evaluating predictive performance -- characteristics more fitting to our practical needs. A comparative analysis shows that GEMSS consistently outperforms five prominent feature selection methods adapted through the ALFESE framework. Finally, we demonstrate practical usability through 3 challenging real-world datasets from metabolomics and physical chemistry: GEMSS successfully isolates multiple distinct yet quality solutions. GEMSS is available as a PyPI package 'gemss'. The corresponding repository github.com/kat-er-ina/gemss/ includes the full codebase and a free, no-code application GEMSS Explorer.

2603.02234 2026-06-12 cs.LG cs.AI 版本更新

Structured vs. Unstructured Pruning: An Exponential Gap

结构化剪枝与非结构化剪枝:指数级差距

Davide Ferre', Frédéric Giroire, Frederik Mallmann-Trenn, Emanuele Natale

发表机构 * Department of Informatics, King’s College London(伦敦国王学院信息学院)

AI总结 研究随机初始化网络中剪枝的局限性,证明神经元剪枝需要指数级更大的网络规模才能达到与非结构化剪枝相同的近似精度。

详情
AI中文摘要

强彩票假说(SLTH)指出,大型随机初始化神经网络包含稀疏子网络,无需训练即可在初始化时逼近目标函数,这表明仅剪枝就足够了。剪枝方法通常分为非结构化(可移除单个权重)和结构化(根据特定模式移除参数,如神经元剪枝)。现有支持SLTH的理论结果几乎完全依赖于非结构化剪枝,表明对数级的过参数化足以逼近简单的目标网络。相比之下,神经元剪枝尽管因其直接加速硬件的实用性而备受关注,但理论关注有限。本文考虑通过剪枝随机初始化两层ReLU网络的隐藏单元来逼近单个无偏置ReLU神经元的问题,从而隔离神经元剪枝的内在局限性。我们证明,实现ε-逼近需要神经元剪枝的起始网络规模为Ω(1/ε),而权重剪枝仅需O(log(1/ε))个隐藏单元,揭示了两种方法之间的指数级差距。

英文摘要

The Strong Lottery Ticket Hypothesis (SLTH) states that large, randomly initialized neural networks contain sparse subnetworks capable of approximating a target function at initialization without training, suggesting that pruning alone is sufficient. Pruning methods are typically classified as unstructured, where individual weights can be removed from the network, and structured, where parameters are removed according to specific patterns, as in neuron pruning. Existing theoretical results supporting the SLTH rely almost exclusively on unstructured pruning, showing that logarithmic overparameterization suffices to approximate simple target networks. In contrast, neuron pruning has received limited theoretical attention, despite its practical appeal for direct hardware speedups. In this work, we consider the problem of approximating a single bias-free ReLU neuron by pruning hidden units of a randomly initialized two-layer ReLU network, effectively isolating the intrinsic limitations of neuron pruning. We show that achieving an $\varepsilon$-approximation requires a starting network size of $Ω(1/\varepsilon)$ for neuron pruning, whereas weight pruning succeeds with only $O(\log(1/\varepsilon))$ hidden units, revealing an exponential separation between the two approaches.

2605.13426 2026-06-12 cs.LG math.AG 版本更新

Strategic PAC Learnability via Geometric Definability

通过几何可定义性实现策略PAC可学习性

Yuval Filmus, Shay Moran, Elizaveta Nesterova, Nir Rosenfeld, Alexander Shlimovich

发表机构 * Weizmann Institute of Science(魏茨曼研究院) University of Waterloo(滑铁卢大学) ETH Zurich(苏黎世联邦理工学院) University of Washington(华盛顿大学)

AI总结 研究个体通过成本修改特征影响分类器决策的策略学习问题,证明在简单情况下策略行为可使易学问题变为不可学,并引入几何可定义性假设以控制样本复杂度。

详情
AI中文摘要

策略分类研究个体通过成本修改特征以影响分类器决策的学习场景。核心问题是诱导的(策略性)假设类样本复杂度如何依赖于基础假设类复杂度和可行操纵的成本结构。先前工作显示在某些自然设置如线性分类器与范数成本下,诱导复杂度可被控制。我们证明此类保证一般失效:存在VC维为1的实数假设类,即使在最简单的区间邻域下,诱导类的VC维为无限。因此策略行为可将易学问题转为不可学。为克服此问题,我们引入几何可定义性假设:假设类和成本诱导的邻域关系可通过实数上的第一阶公式定义。这表示假设和成本可通过算术运算、指数、对数和比较描述。此假设涵盖广泛自然类和成本函数,包括ℓp距离、Wasserstein距离和信息论分歧。在此假设下,我们证明可学习性得以保持,样本复杂度由定义公式的复杂度控制。

英文摘要

Strategic classification studies learning settings in which individuals can modify their features, at a cost, in order to influence the classifier's decision. A central question is how the sample complexity of the induced (strategic) hypothesis class depends on the complexities of the underlying hypothesis class and the cost structure governing feasible manipulations. Prior work has shown that in several natural settings, such as linear classifiers with norm costs, the induced complexity can be controlled. We begin by showing that such guarantees fail in general - even in simple cases: there exist hypothesis classes of VC dimension $1$ on the real line such that, even under the simplest interval neighborhoods, the induced class has infinite VC dimension. Thus, strategic behavior can turn an easy learning problem into a non-learnable one. To overcome this, we introduce structure via a geometric definability assumption: both the hypothesis class and the cost-induced neighborhood relation can be defined by first-order formulas over $\mathbb{R}_{\mathtt{exp}}$. Intuitively, this means that hypotheses and costs can be described using arithmetic operations, exponentiation, logarithms, and comparisons. This captures a broad range of natural classes and cost functions, including $\ell_p$ distances, Wasserstein distance, and information-theoretic divergences. Under this assumption, we prove that learnability is preserved, with sample complexity controlled by the complexity of the defining formulas.

2503.02178 2026-06-12 stat.ML cs.LG 版本更新

Central Limit Theorems for Stochastic Gradient Descent Quantile Estimators

随机梯度下降分位数估计量的中心极限定理

Ziyang Wei, Jiaqi Li, Likai Chen, Wei Biao Wu

发表机构 * Department of Statistics, University of Chicago(芝加哥大学统计系) Department of Statistics and Data Science, Washington University in St. Louis(圣路易斯华盛顿大学统计与数据科学系)

AI总结 本文针对常学习率SGD分位数估计,利用马尔可夫链理论证明其平稳分布随学习率趋于零时收敛到高斯分布,首次给出CLT型理论保证,并提出置信区间递归算法。

详情
AI中文摘要

本文发展了通过恒定学习率的随机梯度下降(SGD)进行分位数估计的渐近理论。分位数损失函数既不光滑也不强凸。超越传统视角和技术,我们将分位数SGD迭代视为一个不可约、周期且正常返的马尔可夫链,该链循环收敛到其唯一的平稳分布,无论初始值如何任意固定。为了推导平稳分布的精确形式,我们通过利用平稳方程分析其特征函数的结构。我们还推导了其矩生成函数(MGF)和尾部概率的紧界。综合上述方法,我们证明了当学习率$\eta\rightarrow0$时,中心化和标准化的平稳分布收敛到高斯分布。这一发现为恒定学习率的分位数SGD估计量提供了首个中心极限定理(CLT)类型的理论保证。我们进一步提出了一种递归算法来构建具有统计保证的估计量的置信区间。数值研究展示了在线估计器和推断过程的有效有限样本性能。本研究所发展的理论工具对于研究一般形式化为马尔可夫链的SGD算法具有独立意义,特别是在非强凸和非光滑设置中。

英文摘要

This paper develops asymptotic theory for quantile estimation via stochastic gradient descent (SGD) with a constant learning rate. The quantile loss function is neither smooth nor strongly convex. Beyond conventional perspectives and techniques, we view quantile SGD iteration as an irreducible, periodic, and positive recurrent Markov chain, which cyclically converges to its unique stationary distribution regardless of the arbitrarily fixed initialization. To derive the exact form of the stationary distribution, we analyze the structure of its characteristic function by exploiting the stationary equation. We also derive tight bounds for its moment generating function (MGF) and tail probabilities. Synthesizing the aforementioned approaches, we prove that the centered and standardized stationary distribution converges to a Gaussian distribution as the learning rate $η\rightarrow0$. This finding provides the first central limit theorem (CLT)-type theoretical guarantees for the quantile SGD estimator with constant learning rates. We further propose a recursive algorithm to construct confidence intervals of the estimators with statistical guarantees. Numerical studies demonstrate the effective finite-sample performance of the online estimator and inference procedure. The theoretical tools developed in this study are of independent interest for investigating general SGD algorithms formulated as Markov chains, particularly in non-strongly convex and non-smooth settings.

2511.19716 2026-06-12 math.NA cs.LG cs.NA 版本更新

Design Criteria for SGD Preconditioners: Local Conditioning, Noise Floors, and Basin Stability

SGD预条件子的设计准则:局部条件数、噪声基底与盆地稳定性

Mitchell Scott, Tianshi Xu, Ziyuan Tang, Alexandra Pichette-Emmons, Qiang Ye, Yousef Saad, Yuanzhe Xi

发表机构 * Department of Mathematics, Emory University(埃默里大学数学系) Department of Mathematics, University of Minnesota Twin Cities(明尼苏达大学双城分校数学系) Department of Computer Science, University of Minnesota Twin Cities(明尼苏达大学双城分校计算机科学系) Department of Mathematics, University of Kentucky(肯塔基大学数学系)

AI总结 针对SGD在训练后期因各向异性曲率和梯度噪声导致的收敛缓慢问题,提出基于对称正定矩阵M的预条件SGD分析框架,推导收敛速率和噪声基底受M相关量控制的界,并给出非凸目标下的盆地稳定性保证,为科学机器学习提供设计准则。

Comments 31 pages, 11 Figures

详情
Journal ref
Trans. of Mach. Learning Research, 06/2026
AI中文摘要

随机梯度下降(SGD)在训练后期常因各向异性曲率和梯度噪声而变慢。我们在对称正定矩阵$\mathbf{M}$诱导的几何中分析预条件SGD,推导出收敛速率和随机噪声基底均受$\mathbf{M}$相关量控制的界:速率通过$\mathbf{M}$度量下的有效条件数,基底通过该条件数与预条件噪声水平的乘积。对于非凸目标,我们建立了依赖于预条件子的盆地稳定性保证:当光滑性和盆地大小以$\mathbf{M}$范数度量时,迭代停留在良好局部区域的概率有显式下界。这一视角在科学机器学习(SciML)中尤为重要,其中在随机更新下实现小训练损失与物理保真度、数值稳定性和约束满足密切相关。该框架适用于对角/自适应和曲率感知预条件子,并给出一个简单的设计原则:选择$\mathbf{M}$以改善局部条件同时衰减噪声。在二次诊断问题和三个SciML基准上的实验验证了预测的速率-基底行为。

英文摘要

Stochastic Gradient Descent (SGD) often slows in the late stage of training due to anisotropic curvature and gradient noise. We analyze preconditioned SGD in the geometry induced by a symmetric positive definite matrix $\mathbf{M}$, deriving bounds in which both the convergence rate and the stochastic noise floor are governed by $\mathbf{M}$-dependent quantities: the rate through an effective condition number in the $\mathbf{M}$-metric, and the floor through the product of that condition number and the preconditioned noise level. For nonconvex objectives, we establish a preconditioner-dependent basin-stability guarantee: when smoothness and basin size are measured in the $\mathbf{M}$-norm, the probability that the iterates remain in a well-behaved local region admits an explicit lower bound. This perspective is particularly relevant in Scientific Machine Learning (SciML), where achieving small training loss under stochastic updates is closely tied to physical fidelity, numerical stability, and constraint satisfaction. The framework applies to both diagonal/adaptive and curvature-aware preconditioners and yields a simple design principle: choose $\mathbf{M}$ to improve local conditioning while attenuating noise. Experiments on a quadratic diagnostic and three SciML benchmarks validate the predicted rate-floor behavior.

2512.23566 2026-06-12 math.DS cond-mat.stat-mech cs.LG math.OC stat.ML 版本更新

From geometry to dynamics: Learning overdamped Langevin dynamics from sparse observations with geometric constraints

从几何到动力学:基于几何约束从稀疏观测学习过阻尼朗之万动力学

Dimitra Maoutsa

发表机构 * Dimitra Maoutsa(迪米特拉·马乌茨)

AI总结 提出一种随机控制框架,利用系统不变密度的几何结构进行路径增强,从稀疏时间采样数据中恢复过阻尼朗之万动力学,无需参数模型假设。

Comments 10+54 pages, 14 figures; accepted at ICML 2026 An earlier account of this work has previously appeared in arXiv:2301.08102 and arXiv:2304.00423 ; main methodology remains the same, this version includes additional numerical experiments and theory

详情
AI中文摘要

当随机系统的轨迹在时间上稀疏采样时,我们如何学习其动力学背后的规律?现有方法要么需要时间分辨的高频观测,要么依赖于仅适用于保守系统的几何论证,限制了它们能恢复的动力学范围。在这里,我们提出一个新的框架,通过将推断重新表述为随机控制问题来调和这两种观点。我们的方法使用几何驱动的路径增强,以系统不变密度的几何结构为指导,重构可能的轨迹并推断底层动力学,而不假设特定的参数模型。应用于过阻尼朗之万系统,我们的方法即使在极度欠采样数据下也能准确恢复随机动力学,在合成基准测试中优于现有方法。这项工作证明了将几何归纳偏差纳入随机系统识别方法的有效性。

英文摘要

How can we learn the laws underlying the dynamics of stochastic systems when their trajectories are sampled sparsely in time? Existing methods either require temporally resolved high-frequency observations, or rely on geometric arguments that apply only to conservative systems, limiting the range of dynamics they can recover. Here, we present a new framework that reconciles these two perspectives by reformulating inference as a stochastic control problem. Our method uses geometry-driven path augmentation, guided by the geometry in the system's invariant density to reconstruct likely trajectories and infer the underlying dynamics without assuming specific parametric models. Applied to overdamped Langevin systems, our approach accurately recovers stochastic dynamics even from extremely undersampled data, outperforming existing methods in synthetic benchmarks. This work demonstrates the effectiveness of incorporating geometric inductive biases into stochastic system identification methods.

2601.22003 2026-06-12 stat.ML cs.LG stat.CO 版本更新

Efficient Stochastic Optimisation via Sequential Monte Carlo

通过序贯蒙特卡洛实现高效随机优化

James Cuin, Davide Carbone, Yanbo Tang, O. Deniz Akyildiz

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对梯度难以计算的优化问题,提出用序贯蒙特卡洛(SMC)采样器替代昂贵的内采样循环,实现高效随机优化,并在能量模型奖励调优中验证有效性。

Comments Accepted to ICML 2026

详情
AI中文摘要

在机器学习和统计学中,从最大边际似然估计过程到生成模型的微调,经常出现优化具有难处理梯度函数的问题。针对这类问题的随机近似方法通常需要内部采样循环来获得(有偏的)随机梯度估计,这很快会变得计算昂贵。在这项工作中,我们开发了用于优化具有难处理梯度函数的序贯蒙特卡洛(SMC)采样器。我们的方法用高效的SMC近似替代昂贵的内部采样方法,这可以带来显著的计算收益。我们为我们的方法所定义的基本递归建立了收敛结果,这些递归由SMC采样器近似。我们在各种设置下对能量模型的奖励调优展示了我们方法的有效性。

英文摘要

The problem of optimising functions with intractable gradients frequently arises in machine learning and statistics, ranging from maximum marginal likelihood estimation procedures to fine-tuning of generative models. Stochastic approximation methods for this class of problems typically require inner sampling loops to obtain (biased) stochastic gradient estimates, which rapidly becomes computationally expensive. In this work, we develop sequential Monte Carlo (SMC) samplers for optimisation of functions with intractable gradients. Our approach replaces expensive inner sampling methods with efficient SMC approximations, which can result in significant computational gains. We establish convergence results for the basic recursions defined by our methodology which SMC samplers approximate. We demonstrate the effectiveness of our approach on the reward-tuning of energy-based models within various settings.

2603.17527 2026-06-12 stat.ML cs.LG math.OC 版本更新

Mirror Descent on Riemannian Manifolds

黎曼流形上的镜像下降

Jiaxin Jiang, Lei Shi, Jiyuan Tan

发表机构 * School of Mathematical Sciences, Fudan University, Shanghai 200433, China(复旦大学数学学院,上海200433,中国) Shanghai Key Laboratory for Contemporary Applied Mathematics, Fudan University, Shanghai 200433, China(上海当代应用数学重点实验室,复旦大学,上海200433,中国)

AI总结 将镜像下降推广到黎曼流形,通过重参数化提出黎曼镜像下降(RMD)及其随机变体,并建立非渐近收敛保证,在Stiefel流形上退化为曲线梯度下降(CGD)。

详情
AI中文摘要

镜像下降(MD)是一种可扩展的一阶方法,广泛应用于大规模优化,包括图像处理、策略优化和神经网络训练。本文将MD推广到黎曼流形上的优化。具体地,我们通过重参数化开发了一个黎曼镜像下降(RMD)框架,并进一步提出了RMD的随机变体。我们还为RMD和随机RMD建立了非渐近收敛保证。作为在Stiefel流形上的应用,我们的RMD框架退化为[26]中提出的曲线梯度下降(CGD)方法。此外,当将随机RMD框架特化到Stiefel设置时,我们得到了CGD的随机扩展,这有效地解决了大规模流形优化问题。

英文摘要

Mirror Descent (MD) is a scalable first-order method widely used in large-scale optimization, with applications in image processing, policy optimization, and neural network training. This paper generalizes MD to optimization on Riemannian manifolds. In particular, we develop a Riemannian Mirror Descent (RMD) framework via reparameterization and further propose a stochastic variant of RMD. We also establish non-asymptotic convergence guarantees for both RMD and stochastic RMD. As an application to the Stiefel manifold, our RMD framework reduces to the Curvilinear Gradient Descent (CGD) method proposed in [26]. Moreover, when specializing the stochastic RMD framework to the Stiefel setting, we obtain a stochastic extension of CGD, which effectively addresses large-scale manifold optimization problems.

6. 高效学习、压缩与部署 10 篇

2606.12487 2026-06-12 cs.LG 新提交

DynamicPTQ: Mitigating Activation Quantization Collapse via Residual-Stream Dynamics

DynamicPTQ: 通过残差流动态缓解激活量化崩溃

Zimo Zhao, Maolin Wang, Bowen Yu, Bowen Liu, Xiao Han, Xiangyu Zhao

发表机构 * City University of Hong Kong(香港城市大学) Zhejiang University of Technology(浙江工业大学)

AI总结 提出DynamicPTQ,通过分析残差流中激活的相位式动态变化,识别量化敏感层并分配8位精度,在W4A4KV4量化下提升LLaMA-2/3的困惑度和零样本QA性能,吞吐量提升1.05-1.07倍。

详情
AI中文摘要

训练后量化(PTQ)对于高效的大语言模型推理至关重要,但当权重、激活和KV缓存全部量化到4位精度时,可靠地量化激活仍然具有挑战性。一个关键困难在于大规模激活,其极端值主导激活范围并放大量化误差。最先进的方法主要通过基于变换的平滑(如正交旋转和仿射缩放)来缓解大规模激活,但忽略了残差流的跨层动态。在本文中,我们展示了大规模激活在网络深度上以相位模式出现和消失,触发大的残差变化。这些变化导致新注入的逐层更新主导4位量化尺度,并削弱历史残差信息。为了表征这种行为,我们引入了跳跃比和历史特征信噪比。这表明基于静态变换的平滑无法完全解决由跨层残差变化引起的动态量化不稳定性。基于这一分析,我们提出了DynamicPTQ,一种用于相位感知混合精度激活量化的动态训练后量化策略。DynamicPTQ从残差流动态中识别量化敏感层,并仅对这些层分配8位激活精度,同时保持权重、KV缓存和其他激活为4位精度。它可以直接集成到强大的PTQ基线中,如QuaRot、SpinQuant和FlatQuant。在LLaMA-2和LLaMA-3上的实验表明,DynamicPTQ在W4A4KV4量化下一致地提高了困惑度和零样本QA性能,同时实现了1.05到1.07倍的吞吐量提升,且内存开销适中。这些结果展示了实现鲁棒低位LLM推理的实用路径。

英文摘要

Post-training quantization (PTQ) is essential for efficient large language model inference, but reliably quantizing activations remains challenging when weights, activations, and KV caches are all quantized to 4-bit precision. A key difficulty lies in massive activations, whose extreme values dominate the activation range and amplify quantization errors. State-of-the-art methods mainly mitigate massive activations through transformation-based smoothing, such as orthogonal rotations and affine scaling, but overlook the cross-layer dynamics of the residual stream. In this paper, we show that massive activations emerge and disappear in a phase-wise pattern across network depth, triggering large residual changes. These changes cause newly injected layer-wise updates to dominate the 4-bit quantization scale and weaken historical residual information. To characterize this behavior, we introduce Jump Ratio and Historical Feature SNR. This suggests that static transformation-based smoothing cannot fully resolve dynamic quantization instability caused by cross-layer residual changes. Based on this analysis, we propose DynamicPTQ, a Dynamic Post-Training Quantization policy for phase-aware mixed-precision activation quantization. DynamicPTQ identifies quantization-sensitive layers from residual-stream dynamics and assigns 8-bit activation precision only to these layers, while keeping weights, KV caches, and other activations in 4-bit precision. It can be directly integrated with strong PTQ baselines such as QuaRot, SpinQuant, and FlatQuant. Experiments on LLaMA-2 and LLaMA-3 show that DynamicPTQ consistently improves perplexity and zero-shot QA performance under W4A4KV4 quantization, while achieving 1.05 to 1.07 times throughput improvement with modest memory overhead. These results demonstrate a practical path toward robust low-bit LLM inference.

2606.12876 2026-06-12 cs.LG cs.CL cs.IT math.IT 新提交

Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

使用加性码本的大语言模型多比特宽度量化

Liza Babaoglu, Shuangyi Chen, Ashish Khisti

发表机构 * University of Toronto(多伦多大学)

AI总结 提出Drop-by-Drop框架,基于信息论和逐次细化理论,利用加性码本和Matryoshka监督实现单个模型在推理时支持多精度权重控制,降低存储开销并保持性能。

Comments 37 pages, 12 figures

详情
AI中文摘要

随着大语言模型(LLM)在具有不同资源约束的异构硬件上部署越来越广泛,无需重新训练即可自适应管理性能与效率之间权衡的能力变得至关重要。我们提出Drop-by-Drop,一种新颖的多比特宽度训练后量化框架,能够从单个训练模型实现对LLM权重的推理时精度控制。我们的方法在理论上基于信息论和逐次细化。我们证明,通常服从高斯分布的LLM权重,在由LLM损失函数驱动的加权均方误差失真下,随着额外比特的加入可以以递增的保真度最优重建。为了在实践中实现这一点,Drop-by-Drop将Matryoshka风格的监督纳入损失函数,利用了加性码本的结构。Drop-by-Drop生成单个模型,其中有序的码本子集在每个精度级别产生精确的部分重建。这种方法通过允许单个检查点服务于多个比特宽度,显著减少了存储和内存开销,同时在主要架构(如Qwen、LLaMA、Gemma和Mistral)上保持了有竞争力的困惑度和准确度。

英文摘要

As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.

2606.13126 2026-06-12 cs.LG cs.AI cs.CL 新提交

MiniPIC: Flexible Position-Independent Caching in <100LOC

MiniPIC: 少于100行代码的灵活位置无关缓存

Nathan Ordonez, Thomas Parnell

发表机构 * IBM Research(IBM研究院)

AI总结 提出MiniPIC,通过无位置编码KV缓存和用户控制缓存重用原语,在vLLM中实现多种位置无关缓存方法,显著提升预填充吞吐量并降低首个令牌延迟。

Comments 13 pages, 5 figures

详情
AI中文摘要

检索增强和代理工作负载重复预填充可预测的结构化输入(我们称之为“跨度”),例如文档和代码文件。然而,vLLM等引擎中的前缀缓存无法重用KV条目,除非它们与另一个请求共享相同的前缀,而生产级推理服务器中的位置无关缓存(PIC)实现通常需要大量服务器代码更改或将KV状态保留在服务器外部,从而产生主机到设备的传输开销。我们提出了极简PIC(MiniPIC):一种最小化、灵活且快速的vLLM设计,由两个组件构建:无位置编码的KV缓存和用户控制的缓存重用原语。MiniPIC在KV缓存中存储未旋转的K向量,在注意力内部使用每请求逻辑位置对K块应用RoPE,并公开三个面向用户和令牌级别的原语:块对齐填充、跨度分隔符(SSep)和提示依赖(PDep),这些原语修改哈希行为和有效的块级因果注意力结构。通过少于100行的核心引擎更改加上自定义注意力后端,这些原语足以在同一个运行的vLLM实例中实现多种PIC方法,包括Block-Attention、EPIC和Prompt Cache,同时原生集成KV缓存CPU卸载实现。在2WikiMultihopQA上,使用交错调度的MiniPIC相比基线vLLM将预填充吞吐量提高了49%,将缓存跨度的首个令牌时间减少了最多两个数量级,保持了未缓存跨度的线性预填充扩展,并且仅产生5.7%的最坏情况开销。

英文摘要

Retrieval-augmented and agentic workloads repeatedly prefill recurring predictable structured inputs (which we call "spans") such as documents and code files. Yet, prefix caching in engines such as vLLM cannot reuse their KV entries unless they share identical prefixes with another request, while Position-Independent Caching (PIC) implementations within production-grade inference servers typically either require substantial server code changes or keep KV state outside the server, incurring host-to-device transfer overhead. We present Minimalistic PIC (MiniPIC): a minimal, flexible and fast vLLM design built from two ingredients: positional-encoding-free KV cache and user-controlled cache-reuse primitives. MiniPIC stores unrotated K vectors in the KV cache, applies RoPE to K tiles inside attention using per-request logical positions, and exposes three user-facing and token-level primitives: block-aligned padding, span separator (SSep), and prompt depend (PDep), that modify hashing behavior and effective block-level causal attention structure. With fewer than 100 lines of core-engine changes plus a custom attention backend, these primitives are sufficient to realize multiple PIC methods, including Block-Attention, EPIC, and Prompt Cache, within the same running vLLM instance, while natively integrating with KV cache CPU offload implementations. On 2WikiMultihopQA, MiniPIC with interleaved scheduling improves prefill throughput by 49% over baseline vLLM, reduces cached-span time-to-first-token by up to two orders of magnitude, preserves the linear prefill scaling of uncached spans, and incurs only 5.7% worst-case overhead.

2606.13233 2026-06-12 cs.LG cs.AI 新提交

ReSET: Accurate Latency-Critical NVFP4 Reasoning via Step-Aware Temperature Scaling

ReSET: 通过步骤感知温度缩放实现精确的延迟关键型NVFP4推理

Sihwa Lee, Janghwan Lee, Donghoon Yoo, Jae Gon Kim, Hanyul Ryu, Soojung Ryu, Jungwook Choi

发表机构 * Hanyang University(汉阳大学) Xenoscube Korean Inc.(Xenoscube韩国公司)

AI总结 针对大型推理模型在NVFP4低精度推理中精度下降和延迟问题,提出基于推理步骤熵的温度缩放方法ReSET,并设计CUDA小M核,在多个基准上提升精度约2点,解码速度提升2倍。

详情
AI中文摘要

大型推理模型(LRMs)通过生成长中间推理轨迹来改进复杂问题求解,但这大幅增加了推理成本。NVFP4推理通过硬件支持的低精度执行提供了一种减少计算和内存成本的有前景方法。然而,直接将NVFP4应用于LRMs引入了两个实际限制:量化下推理精度下降,且现有NVFP4核在小型批处理自回归解码中未完全实现延迟优势。在这项工作中,我们分析了NVFP4量化对推理过程中token级不确定性的影响。我们表明,量化增加了低熵符号token的错误采样,同时导致在高不确定性推理步骤中过度集中于少量token。基于这一观察,我们提出了\textbf{ReSET},一种基于推理步骤熵的温度缩放方法,它在线估计步骤级不确定性,并使用token级和步骤级熵信号自适应调整解码温度。为解决延迟差距,我们进一步设计了一个CUDA核心的小型$M$ NVFP4核,用于延迟关键的自回归解码。在推理基准和模型规模上,ReSET将NVFP4推理精度相比NVFP4基线提升高达$\sim\!$2个点。我们的CUDA核心小型$M$核进一步改善了延迟关键解码,相比NVFP4 vLLM提供高达$2.5\!\times$的核级加速,相比BF16提供约$2\!\times$的端到端解码加速。代码可在该https URL获取。

英文摘要

Large reasoning models (LRMs) improve complex problem-solving by generating long intermediate reasoning traces, but this substantially increases inference costs. NVFP4 inference offers a promising approach to reduce both computational and memory costs through hardware-supported low-precision execution. However, directly applying NVFP4 to LRMs introduces two practical limitations: reasoning accuracy degrades under quantization, and existing NVFP4 kernels do not fully realize latency benefits in small-batch autoregressive decoding. In this work, we analyze the effect of NVFP4 quantization on token-level uncertainty during reasoning. We show that quantization increases incorrect sampling at low-entropy symbolic tokens, while causing over-concentration on a small set of tokens in high-uncertainty reasoning steps. Based on this observation, we propose \textbf{ReSET}, a reasoning-step entropy-based temperature-scaling method that estimates step-level uncertainty online and adapts the decoding temperature using both token-level and step-level entropy signals. To address the latency gap, we further design a CUDA-core small-$M$ NVFP4 kernel for latency-critical autoregressive decoding. Across reasoning benchmarks and model scales, ReSET improves NVFP4 reasoning accuracy by up to $\sim\!$2 points over the NVFP4 baseline. Our CUDA-core small-$M$ kernel further improves latency-critical decoding, delivering up to $2.5\!\times$ kernel-level speedup over NVFP4 vLLM and approximately $2\!\times$ end-to-end decoding speedup over BF16. Code is available at https://github.com/aiha-lab/ReSET.

2606.13379 2026-06-12 cs.LG cs.AR cs.ET 新提交

Positional Encoding in the Context of Memristor-Based Analog Computation for Automatic Speech Recognition

基于忆阻器的模拟计算在自动语音识别中的位置编码

Benedikt Hilmes, Nick Rossenbach, Ralf Schlüter

发表机构 * Machine Learning and Human Language Technology Group, Faculty of Computer Science, RWTH Aachen University(亚琛工业大学计算机科学学院机器学习和人类语言技术组) Apptek GmbH(Apptek 有限公司)

AI总结 针对忆阻器模拟计算中位置编码导致模数转换精度下降的问题,通过调整ADC权重和精度位比例或移除编码相关线性变换,分别降低约50%和30%的性能损失。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

忆阻器通过实现向量-矩阵乘法的模拟执行,为自然语言处理神经模型的资源高效计算提供了新机遇。然而,目前这些器件在权重编程和执行过程中都容易产生较大的失真。在这项工作中,我们发现转换后的位置编码的大输出值会导致忆阻器计算中模数转换(ADC)的严重退化。通过调整特定忆阻器层的ADC权重和精度位的比例,我们将执行退化相对降低了约50%,同时保持估计能耗稳定。此外,我们研究了ADC无法修改的情况。在这种情况下,移除编码相关的线性变换后,退化可相对降低约30%。

英文摘要

Memristors provide a new chance for resource-efficient computation of neural models for natural language processing by enabling analog execution of vector-matrix-multiplication. Yet, computations on these devices are currently subject to larger distortion, both in weight programming and execution. In this work, we identify large output values of transformed positional encodings to cause major degradation within analog-to-digital conversion (ADC) as part of memristor-based computation. By adjusting the proportion of weight and precision bits of the ADC of specific memristor layers, we reduce the degradation of the execution by ~50% relative, while keeping the estimated energy consumption stable. Additionally, we investigate scenarios where the ADC cannot be modified. In that case the degradation can be reduced by ~30% relative after removing encoding-related linear transformations.

2606.13177 2026-06-12 cs.CL cs.AI cs.LG 交叉投稿

MemRefine: LLM-Guided Compression for Long-Term Agent Memory

MemRefine: 基于LLM引导的压缩用于长期智能体记忆

Minjae Kim, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang

发表机构 * Korea University(韩国大学) KAIST(韩国科学技术院)

AI总结 提出MemRefine框架,利用LLM判断事实内容,通过删除、合并和保留操作将记忆库压缩到固定预算内,在多个基准上保持下游性能并优于基于规则的基线。

详情
AI中文摘要

大型语言模型(LLM)智能体越来越需要在长期交互中运行,其中过去对话中的信息必须被保留和回忆以支持未来任务。然而,随着交互的积累,记忆存储无限制增长,并充满冗余条目,这些条目增加了存储成本,并通过排挤最有用的证据而降低了检索质量。此外,在具有硬性内存预算的资源受限平台上,这尤其受限,促使我们制定了有存储预算的记忆管理任务,即在固定预算内保持已构建的记忆库,同时保留对未来交互有用的信息。为此,我们提出了MemRefine,一个基于LLM引导的框架,由于表面相似性不能很好地反映事实价值,该框架仅使用相似性来提出候选对,并将删除、合并和保留决策推迟给基于事实内容的LLM判断,迭代直到满足预算。在多个记忆框架和长期对话基准上,MemRefine始终满足目标预算,同时保持下游性能,并在紧预算下优于基于规则的基线。

英文摘要

Large language model (LLM) agents are increasingly expected to operate over long-term interactions, where information from past dialogues must be preserved and recalled to support future tasks. However, as interactions accumulate, the memory store grows without bound and fills with redundant entries that inflate storage cost and degrade retrieval by crowding out the most useful evidence. Furthermore, this is especially limiting on resource-constrained platforms with hard memory budgets, motivating us to formulate storage-budgeted memory management, the task of keeping an already constructed memory store within a fixed budget while preserving information useful for future interactions. To this end, we then propose MemRefine, an LLM-guided framework that, since surface similarity poorly reflects factual value, uses similarity only to propose candidate pairs and defers delete, merge, and preserve decisions to an LLM judge based on factual content, iterating until the budget is met. Across multiple memory frameworks and long-term conversation benchmarks, MemRefine consistently meets target budgets while preserving downstream performance and outperforming rule-based baselines under tight budgets.

2606.13501 2026-06-12 cs.DC cs.LG cs.PF 交叉投稿

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

GF-DiT:扩散Transformer服务的并行调度

Xinwei Qiang, Yifan Hu, Shixuan Sun, Jing Yang, Han Zhao, Chen Chen, Yu Feng, Jingwen Leng, Minyi Guo

AI总结 提出GF-DiT,一种策略可编程运行时,通过动态调整请求并行度来优化扩散Transformer服务,利用无组集合通信实现低开销在线重配置,显著提升吞吐量和降低延迟。

详情
AI中文摘要

扩散Transformer(DiT)已成为图像和视频生成的主流架构,对高效DiT服务的需求日益增长。现有系统为每个请求在其整个生命周期内分配固定的并行配置。然而,DiT工作负载在请求、执行阶段和系统条件之间表现出显著的异构性,使得静态并行性效率低下,通常导致GPU利用率低和服务质量下降。本文认为,DiT服务应将GPU并行性视为一种可调度的资源。我们提出GF-DiT,一种策略可编程的弹性DiT服务运行时,能够根据工作负载需求和服务目标动态调整运行中请求的并行度。GF-DiT引入了一种异步执行抽象,将请求分解为独立可调度的轨迹任务,并支持在线GPU重新分配。为了使弹性并行性实用化,GF-DiT进一步提出了无组集合(group-free collectives),一种轻量级通信抽象,支持低开销的任意执行组在线形成和重新配置。我们在vLLM-Omni中实现了GF-DiT,并在代表性的图像和视频扩散工作负载上进行了评估。与具有静态并行性的固定流水线执行相比,GF-DiT将吞吐量提高了高达6.01倍,平均延迟降低了高达95%,SLO违规率降低了高达90%,并将通信组设置开销从778毫秒降低到约60微秒。

英文摘要

Diffusion Transformers (DiTs) have become the dominant architecture for image and video generation, creating growing demand for efficient DiT serving. Existing systems assign each request a fixed parallel configuration throughout its lifetime. However, DiT workloads exhibit substantial heterogeneity across requests, execution stages, and system conditions, making static parallelism inefficient and often leading to poor GPU utilization and degraded service quality. This paper argues that DiT serving should treat GPU parallelism as a first-class schedulable resource. We present GF-DiT, a policy-programmable runtime for elastic DiT serving that dynamically adapts the parallelism of running requests according to workload demands and service objectives. GF-DiT introduces an asynchronous execution abstraction that decomposes requests into independently schedulable trajectory tasks and enables online GPU reallocation. To make elastic parallelism practical, GF-DiT further proposes group-free collectives, a lightweight communication abstraction that supports low-overhead online formation and reconfiguration of arbitrary execution groups. We implement GF-DiT in vLLM-Omni and evaluate it on representative image and video diffusion workloads. Compared with fixed-pipeline execution with static parallelism, GF-DiT improves throughput by up to 6.01$\times$, reduces mean latency by up to 95%, lowers SLO violation rates by up to 90%, and reduces communication-group setup overhead from 778 ms to approximately 60 $μ$s.

2601.06227 2026-06-12 cs.LG cs.AI 版本更新

When Smaller Wins: Dual-Stage Distillation and Pareto-Guided Compression of Liquid Neural Networks for Edge Battery Prognostics

当更小胜出:面向边缘电池健康预测的液态神经网络双阶段蒸馏与帕累托引导压缩

Dhivya Dharshini Kannan, Wei Li, Wei Zhang, Jianbiao Wang, Zhi Wei Seh, Man-Fai Ng

发表机构 * Singapore Institute of Technology(新加坡科技学院) Institute of Materials Research and Engineering(材料研究与工程研究所) Agency for Science, Technology and Research(科技研究局) Institute of High Performance Computing(高性能计算研究所)

AI总结 提出DLNet框架,通过欧拉离散化、双阶段知识蒸馏和帕累托引导压缩,将高容量液态神经网络压缩为边缘可部署模型,在电池健康预测中实现小模型超越大模型。

Comments Accepted at International Conference on Pattern Recognition, ICPR 2026. Code available at: https://github.com/Dhivya-DD17/DLNet

详情
AI中文摘要

电池管理系统日益需要在严格的设备端约束下进行准确的电池健康预测。本文提出DLNet,一个实用的双阶段液态神经网络蒸馏框架,将高容量模型转化为紧凑且可边缘部署的电池健康预测模型。DLNet首先应用欧拉离散化重新表述液态动力学以实现嵌入式兼容性。然后进行双阶段知识蒸馏,以传递教师模型的时间行为,并在进一步压缩后恢复该行为。在联合误差-成本目标下的帕累托引导选择保留了平衡准确性和效率的学生模型。我们在广泛使用的数据集上评估DLNet,并在Arduino Nano 33 BLE Sense上使用int8部署验证实际设备可行性。最终部署的学生模型在预测未来100个周期的电池健康时实现了0.0066的低误差,比教师模型低15.4%。模型大小从616 kB减少到94 kB,减少了84.7%,在设备上每次推理耗时21毫秒。这些结果支持了一个实用的“更小胜出”观察:通过适当的监督和选择,小模型可以在边缘预测中匹配或超越大模型。除了电池,DLNet框架可以扩展到其他具有严格硬件约束的工业分析任务。

英文摘要

Battery management systems increasingly require accurate battery health prognostics under strict on-device constraints. This paper presents DLNet, a practical framework with dual-stage distillation of liquid neural networks that turns a high-capacity model into compact and edge-deployable models for battery health prediction. DLNet first applies Euler discretization to reformulate liquid dynamics for embedded compatibility. It then performs dual-stage knowledge distillation to transfer the teacher model's temporal behavior and recover it after further compression. Pareto-guided selection under joint error-cost objectives retains student models that balance accuracy and efficiency. We evaluate DLNet on a widely used dataset and validate real-device feasibility on an Arduino Nano 33 BLE Sense using int8 deployment. The final deployed student achieves a low error of 0.0066 when predicting battery health over the next 100 cycles, which is 15.4% lower than the teacher model. It reduces the model size from 616 kB to 94 kB with 84.7% reduction and takes 21 ms per inference on the device. These results support a practical smaller wins observation that a small model can match or exceed a large teacher for edge-based prognostics with proper supervision and selection. Beyond batteries, the DLNet framework can extend to other industrial analytics tasks with strict hardware constraints.

2601.17654 2026-06-12 cs.LG cs.DC 版本更新

Kareus: Joint Reduction of Dynamic and Static Energy in Large Model Training

Kareus:大型模型训练中动态与静态能量的联合降低

Ruofan Wu, Jae-Won Chung, Mosharaf Chowdhury

发表机构 * University of Michigan(密歇根大学)

AI总结 针对AI训练能耗高昂问题,提出Kareus系统,通过联合优化细粒度内核调度与频率缩放,协同降低动态和静态能耗,在相同训练时间下节能28.3%,或相同能耗下提速27.5%。

Comments OSDI '26 | Open-source at https://github.com/ml-energy/kareus

详情
AI中文摘要

AI的计算需求正以前所未有的速度增长,但能源供应并未跟上步伐。因此,能源已成为一种昂贵且受争抢的资源,需要明确的管理和优化。尽管近期工作在大型模型训练优化方面取得了显著进展,但它们侧重于优化动态或静态能耗中的一种。我们发现,细粒度的内核调度和频率缩放共同且相互依赖地影响动态和静态能耗。基于这一发现,我们设计了Kareus,一个通过优化两方面来推动时间-能耗权衡前沿的训练系统。Kareus将棘手的联合优化问题分解为基于分区的局部子问题,然后使用多遍多目标优化算法来找到推动时间-能耗权衡前沿的执行调度。与现有技术相比,Kareus在相同训练时间下最多可减少28.3%的训练能耗,或在相同能耗下最多减少27.5%的训练时间。

英文摘要

The computing demand of AI is growing at an unprecedented rate, but energy supply is not keeping pace. As a result, energy has become an expensive and contended resource that requires explicit management and optimization. Although recent works have made significant progress in large model training optimization, they focus on optimizing either dynamic or static energy consumption. We find that fine-grained kernel scheduling and frequency scaling jointly and interdependently impact both dynamic and static energy consumption. Based on this finding, we design Kareus, a training system that pushes the time-energy tradeoff frontier by optimizing both aspects. Kareus decomposes the intractable joint optimization problem into local, partition-based subproblems. It then uses a multi-pass multi-objective optimization algorithm to find execution schedules that push the time-energy tradeoff frontier. Compared to the state of the art, Kareus reduces training energy by up to 28.3% at the same training time, or reduces training time by up to 27.5% at the same energy consumption.

2505.04021 2026-06-12 cs.DC cs.AI cs.LG cs.PF 版本更新

Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

Prism: 通过GPU内存气球实现经济高效的多LLM服务

Shan Yu, Yifan Qiao, Mingyuan Ma, Yangmin Li, Shuo Yang, Xinyuan Tong, Yang Wang, Zhiqiang Xie, Yuwei An, Shiyi Cao, Ke Bao, Deepak Vij, Xiaoning Ding, Yichen Wang, Qingda Lu, Zhong Wang, Gao Gao, Harry Xu, Junyi Shu, Jiarong Xing, Ying Sheng

发表机构 * UCLA(加州大学洛杉矶分校) UC Berkeley(伯克利加州大学) Harvard University(哈佛大学) CMU(卡内基梅隆大学) University of Edinburgh(爱丁堡大学) Intel(英特尔) Stanford University(斯坦福大学) LMSYS(灵州市系统实验室) ByteDance(字节跳动) Alibaba Cloud(阿里云) Tsinghua University(清华大学) Novita AI Rice University(里士满大学)

AI总结 针对多LLM服务中资源效率低下的问题,提出基于内存气球的内存中心化LLM协同服务框架Prism,统一空间与时间共享,已在10K+ GPU生产环境部署。

Comments OSDI'26

详情
AI中文摘要

推理提供商必须为许多LLM保持可用性,包括低流量但关键的模型,随着token价格下降,资源效率变得越来越重要。对生产轨迹的分析揭示了一种动态突发组模式,其中一组模型同时活跃并随时间变化;现有的空间和时间共享方法缺乏适应这种变化的原理性机制,迫使在SLO遵守和效率之间进行权衡。我们观察到弹性内存分配可以统一空间和时间共享。基于这一洞察,我们开发了Prism,一个以内存为中心的LLM协同服务框架,它应用内存气球来跨模型回收内存,并在单一方案下支持两种形式的共享。Prism的气球驱动程序,称为kvcached,已在https://github.com/... 开源,并在超过10K GPU的生产环境中部署。

英文摘要

Inference providers must maintain availability for many LLMs, including low-volume but essential models, making resource efficiency increasingly important as token prices fall. Analysis of production traces reveals a dynamic bursty-group pattern in which sets of models become active together and shift over time; existing space- and time-sharing approaches lack principled mechanisms to adapt to this variability, forcing trade-offs between SLO adherence and efficiency. We observe that elastic memory allocation can unify spatial and temporal sharing. Based on this insight, we have developed Prism, a memory-centric LLM co-serving framework that applies memory ballooning to reclaim memory across models and support both forms of sharing under a single scheme. Prism's balloon driver, referred to as kvcached, has been open-sourced at https://github.com/ovg-project/kvcached, and deployed in production environments across 10K+ GPUs.

7. 联邦学习、隐私与安全 8 篇

2606.12679 2026-06-12 cs.LG cs.CR eess.IV 新提交

Fed-FBD: Federated Functional Block Diversification for Isolation, Privacy, and Surgical Unlearning

Fed-FBD:用于隔离、隐私和精准遗忘的联邦功能块多样化

Weijie Chen, Alan B. McMillan

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 提出Fed-FBD模块化联邦架构,将ResNet分解为六个功能块并维护颜色变体仓库,实现块级隔离、隐私设计和亚秒级精准遗忘,在多个数据集上以微小精度代价换取安全保障。

Comments 12 pages, 3 figures, 8 tables. Code: https://github.com/wchen-ai/functional-block-diversification

详情
AI中文摘要

联邦学习(FL)能够在无需共享原始患者数据的情况下进行协作模型训练,但标准方法(如FedAvg)将每个客户端视为黑盒,无法隔离对抗性贡献者、审计每个客户端的影响或尊重已退出参与者的被遗忘权。我们提出Fed-FBD(联邦功能块多样化),一种模块化联邦架构,将ResNet骨干网络分解为六个功能块(主干、四个残差组和分类头),并维护一个包含N种颜色变体的仓库,每种变体由独立跟踪和贡献者标记的块组装而成。Fed-FBD提供了FedAvg所不具备的三种能力:(i) 架构保证的块级隔离,使对抗性或错误标注的客户端无法污染干净颜色;(ii) 隐私设计,在应用任何隐私机制之前,成员推断优势已与随机猜测无异;(iii) 在亚秒级成本下无需重新训练即可精准遗忘已退出参与者的贡献。在六个MedMNIST-2D数据集、224x224的PathMNIST和CIFAR-10上的实验表明,Fed-FBD在规模足够的数据集上以0.3%-3.1%的IID精度差距换取这些保证,在四个数据集中的三个上,Dirichlet alpha=1.0时与FedAvg的差距在0.8%-4.0%以内,并将我们研究的所有六种对抗性攻击限制在中毒客户端自己的块内,干净颜色上的AUC漂移最多为+/-0.01。

英文摘要

Federated learning (FL) enables collaborative model training without sharing raw patient data, but standard approaches such as FedAvg treat each client as a black box and provide no mechanism for isolating an adversarial contributor, auditing per-client influence, or honoring a departed participant's right to be forgotten. We present Fed-FBD (Federated Functional Block Diversification), a modular federated architecture that decomposes a ResNet backbone into six functional blocks (the stem, four residual groups, and the classification head) and maintains a warehouse of N color variants, each assembled from independently tracked and contributor-stamped blocks. Fed-FBD provides three capabilities absent in FedAvg: (i) architecturally guaranteed block-level isolation, so that an adversarial or mislabelled client cannot contaminate the clean colous; (ii) privacy-by-design, where membership inference advantage is already indistinguishable from chance before any privacy mechanism is applied; and (iii) surgical machine unlearning of a departed participant's contribution at sub-second cost and without retraining. Experiments on six MedMNIST-2D datasets, PathMNIST at 224x224, and CIFAR-10 show that Fed-FBD trades a modest 0.3%-3.1% IID accuracy gap on the adequately sized datasets for these guarantees, remains within 0.8%-4.0% of FedAvg at Dirichlet alpha=1.0 on three of four datasets, and confines all six adversarial attacks we study to the poisoned client's own blocks with at most +/-0.01 AUC drift on the clean colors.

2606.12654 2026-06-12 stat.ME cs.LG stat.ML 交叉投稿

Computationally tractable robust differentially private mean estimation

计算可处理的鲁棒差分隐私均值估计

Kelly Ramsay

AI总结 提出一种名为“气球均值”的新差分隐私均值估计器,通过扩展马氏距离球上的迭代裁剪实现计算可处理性、鲁棒性及零集中差分隐私,理论保证在重尾和污染椭圆模型下的统计性能与鲁棒性。

Comments 40 pages, 17 figures

详情
AI中文摘要

我们开发了一种新的差分隐私均值估计器,称为气球均值。气球均值的主要特点是计算可处理且对异常观测具有鲁棒性。它基于在扩展的马氏距离球(即“气球”)上的迭代裁剪过程。该方法满足零集中差分隐私,并依赖于少量可解释的调优参数。我们在重尾和污染椭圆模型下提供了理论保证,刻画了其统计性能和对异常值的鲁棒性。大量模拟表明,气球均值对重尾和污染数据具有鲁棒性,并且在污染环境下优于现有的差分隐私均值估计器。

英文摘要

We develop a new, differentially private mean estimator called the balloon mean. The main features of the balloon mean are that it is computationally tractable and enjoys robustness to outlying observations. It is based on an iterative clipping procedure over expanding Mahalanobis balls, or ``balloons.'' The method satisfies zero-concentrated differential privacy and depends on a small number of interpretable tuning parameters. We provide theoretical guarantees under heavy-tailed and contaminated elliptical models, characterizing its statistical performance and robustness to outliers. Extensive simulations demonstrate that the balloon mean is robust to heavy-tailed and contaminated data, and outperforms existing differentially private mean estimators in contaminated settings.

2606.12703 2026-06-12 cs.CR cs.AI cs.LG 交叉投稿

SMSR: Certified Defence Against Runtime Memory Poisoning in Persistent LLM Agent Systems

SMSR:针对持久化LLM代理系统中运行时内存投毒的认证防御

Tarun Sharma

AI总结 提出SMSR防御框架,通过写入时HMAC签名和查询时随机化内存消融与基于判决的多数投票,首次为多会话内存投毒攻击提供认证鲁棒性保证。

详情
AI中文摘要

检索增强生成(RAG)代理越来越多地使用跨用户会话累积的持久化内存。这创造了一个新的攻击面:仅通过正常渠道交互的对手可以注入精心构造的内存,一旦被检索,就会影响未来用户的代理响应,而无需触及模型权重或代码。我们将此称为多会话内存投毒(MSMP),并表明现有防御无法对此进行认证;静态语料库防御(RobustRAG、ReliabilityRAG)假设固定的知识库,而启发式过滤器则被流畅的企业风格文本绕过。我们提出了带平滑检索的签名内存(SMSR),这是首个针对此场景提供认证鲁棒性边界的防御。组件1在写入时添加HMAC-SHA256来源证明,阻止未签名注入。组件2在查询时应用随机化内存消融与基于判决的多数投票,限制认证对手的影响。我们证明了无来源证明的检索时过滤器无法认证自适应注入,推导了组件2的超几何证书,并形式化了一致少数效应,即一致对抗答案在基于字符串的投票中作为数值少数胜出,而基于判决的投票则将其移除。在15个企业场景(3150次重复试验)中,组件1将未签名变体的攻击成功率从93-100%降至0%。对于单次注入的认证对手,组件2将成功率控制在8.0%(95% CI [5.8, 10.9], n=450),低于认证最坏情况。在端到端仅查询攻击中(代理自身写入投毒而非预植入),SMSR在实时代理栈上将成功率从65.3%降至5.3%(n=150,非重叠置信区间)。干净查询效用为90%(组件1)和85%(组合)。

英文摘要

Retrieval-augmented generation (RAG) agents increasingly run with persistent memory that accumulates across user sessions. This creates a new attack surface: an adversary interacting only through normal channels can inject crafted memories that, once retrieved, steer the agent's responses for future users, without touching model weights or code. We call this Multi-Session Memory Poisoning (MSMP) and show that no existing defence certifies against it; static-corpus defences (RobustRAG, ReliabilityRAG) assume a fixed knowledge base, and heuristic filters are bypassed by fluent enterprise-style text. We present Signed Memory with Smoothed Retrieval (SMSR), the first defence with a certified robustness bound for this setting. Component 1 adds HMAC-SHA256 provenance at write time, blocking unsigned injection. Component 2 applies randomised memory ablation with verdict-based majority voting at query time, bounding the influence of authenticated adversaries. We prove that no provenance-free retrieval-time filter can certify against adaptive injection, derive a hypergeometric certificate for Component 2, and formalise the Consistent Minority Effect, whereby a consistent adversarial answer wins string-based voting as a numerical minority while verdict-based voting removes it. Across 15 enterprise scenarios (3,150 repeated trials), Component 1 cuts attack success from 93-100% to 0% for all unsigned variants. For an authenticated adversary with a single injection, Component 2 holds success to 8.0% (95% CI [5.8, 10.9], n=450), below the certified worst case. In an end-to-end query-only attack where the agent itself writes the poison rather than it being pre-seeded, SMSR reduces success from 65.3% to 5.3% (n=150, non-overlapping CIs) on a live agent stack. Clean-query utility is 90% (Component 1) and 85% (combined).

2606.12845 2026-06-12 cs.CR cs.LG 交叉投稿

A Privacy-Preserving Framework Using Remote Data Science for Inter-Institutional Student Retention Prediction

一种使用远程数据科学的隐私保护框架用于机构间学生保留率预测

John Fields, K M Sajjadul Islam, Ruchitha Thota, Victor Chen, Praveen Madiraju

AI总结 提出基于PySyft和半气隙架构的远程数据科学框架,实现三所大学在不直接访问敏感数据的情况下协作预测学生保留率,验证了隐私保护机器学习在教育场景的可行性。

Comments 7 pages, 2 figures. Accepted at the 2026 IEEE International Conference on Information Reuse and Integration (IEEE IRI 2026)

详情
AI中文摘要

本研究探索了使用PySyft平台的隐私保护机器学习(PPML)技术,以实现机构间学生保留率的协作预测。我们开发了一个远程数据科学(RDS)框架,采用半气隙架构,包含高端和低端服务器,使来自三所大学的研究人员能够在无需直接访问数据的情况下,基于敏感学生数据构建预测模型。利用一所小型私立大学的历史数据(N=720),我们评估了三种合成数据生成方法,并通过机构间协作验证了该框架。结果显示,各机构的分类性能一致(Macro F1: 0.690--0.695),同时严格遵守《家庭教育权利和隐私法案》(FERPA)。我们还提出了数据类型感知模板,这是一种新颖的合成数据方法,优先考虑隐私而非分布保真度。我们的发现证实,基于RDS的PPML在教育环境中技术上可行,并为小规模机构间协作提供了一种联邦学习的实用替代方案。代码可在以下网址获取:this https URL。

英文摘要

This study explores privacy-preserving machine learning (PPML) techniques using the PySyft platform to enable collaborative prediction of student retention between institutions. We developed a remote data science (RDS) framework with a semi-air-gapped architecture consisting of high-side and low-side servers, allowing researchers from three universities to build predictive models on sensitive student data without direct data access. Using historical data from a small private university (N=720), we evaluated three synthetic data generation approaches and validated the framework through inter-institutional collaboration. The results demonstrate consistent classification performance across institutions (Macro F1: 0.690--0.695) while maintaining strict Family Educational Rights and Privacy Act (FERPA) compliance. We also propose Data-Type-Aware Templates, a novel synthetic data method that prioritizes privacy over distributional fidelity. Our findings confirm that RDS-based PPML is technically feasible for educational settings and offers a practical alternative to federated learning for small-scale inter-institutional collaborations. The code is available at https://github.com/jtfields/NAIRR240195-Privacy-Preserving-Machine-Learning.

2606.13045 2026-06-12 cond-mat.dis-nn cs.LG 交叉投稿

A solvable model for unsupervised federated learning

无监督联邦学习的一个可解模型

Giovanni Catania, Aurélien Decelle, Gianluca Manzan, Beatriz Seoane, Daniele Tantari

发表机构 * Institute for Cross-disciplinary Physics and Complex Systems IFISC (CSIC-UIB)(跨学科物理与复杂系统研究所(IFISC,CSIC-UIB)) Departamento de Física Teórica, Universidad Complutense de Madrid(马德里complutense大学理论物理系) Escuela Técnica Superior de Ingenieros Industriales, Universidad Politécnica de Madrid(马德里理工大学工业工程师学院) GISC - Grupo Interdisciplinar de Sistemas Complejos(跨学科复杂系统小组) Inria Saclay - Tau team(萨克利Inria团队) Department of Mathematics, University of Bologna(博洛尼亚大学数学系)

AI总结 提出一个理论框架,通过教师-多学生交互场景分析联邦学习,证明学生间交互能系统提升学习性能,并推导最优贝叶斯条件,映射到受限玻尔兹曼机。

详情
AI中文摘要

我们引入了一个理论框架,用于在生成式设置中分析联邦学习,通过教师-多学生交互场景,其中每个学生接收不同的数据实现,要么通过不同的噪声破坏,要么通过访问不同的子集,可能大小不同。使用平衡无序系统的理论工具,我们解析地表明学生间的交互系统地提升了学习性能:高噪声学生需要更少的样本来恢复潜在模式,而低噪声学生与真实信号的重叠更大。我们推导了教师恢复的最优贝叶斯条件,作为样本复杂度、噪声水平和交互强度的函数,并通过数值模拟验证了这些预测。得到的动力学可以映射到具有结构化隐藏层的受限玻尔兹曼机中的平衡采样,从而为交互如何改进分布式生成建模提供了原则性的理论理解。

英文摘要

We introduce a theoretical framework for analyzing federated learning in a generative setting through a teacher-multiple interacting students scenario, in which each student receives a distinct realization of the data, either through a different noise corruption or by accessing a different subset, possibly of varying size. Using theoretical tools in equilibrium disordered system, we analytically show that interactions among students systematically enhance learning performance: highly noisy students require fewer samples to recover the underlying pattern, while low-noise students achieve a larger overlap with the ground-truth signal. We derive the optimal Bayesian conditions for teacher recovery as functions of the sample complexity, noise level, and interaction strength, and validate these predictions through numerical simulations. The resulting dynamics can be mapped onto equilibrium sampling in a Restricted Boltzmann Machine with a structured hidden layer, providing a principled theoretical understanding of how interactions improve distributed generative modeling.

2601.01901 2026-06-12 cs.LG 版本更新

FedBiCross: Personalized One-Shot Federated Learning on Medical Images

FedBiCross: 医学图像上的个性化一次性联邦学习

Yuexuan Xia, Yinghao Zhang, Yalin Liu, Hong-Ning Dai, Yong Xia

发表机构 * School of Computer Science and Engineering, Northwestern Polytechnical University, China(西北工业大学计算机科学与工程学院) School of Science and Technology, Hong Kong Metropolitan University, Hong Kong(香港 Metropolitan 大学科学与技术学院) Department of Computer Science, Hong Kong Baptist University, Hong Kong(香港 Baptist 大学计算机科学系)

AI总结 提出FedBiCross框架,通过聚类、双层跨簇优化和个性化蒸馏解决非独立同分布数据下一次性联邦学习中知识蒸馏效果差的问题,在四个医学图像数据集上优于现有方法。

Comments Accepted by BlockSys 2026. This version of the contribution has been accepted for publication, after peer review (when applicable) but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections

详情
AI中文摘要

基于无数据知识蒸馏的一次性联邦学习(OSFL)在单轮通信中训练模型,无需共享原始数据,这使得OSFL对隐私敏感的医疗应用具有吸引力。然而,现有方法聚合所有客户端的预测以形成全局教师。在非独立同分布数据下,冲突的预测在平均过程中相互稀释,产生信息量较少的软标签,从而削弱蒸馏效果。我们提出FedBiCross,一个个性化OSFL框架,包含三个阶段:(1)根据模型输出相似性对客户端进行聚类,形成连贯的子集成;(2)双层跨簇优化,学习自适应权重以选择性利用有益的跨簇知识,同时抑制负迁移;(3)针对客户端特定适应的个性化蒸馏。在四个医学图像数据集上的实验表明,FedBiCross在不同非独立同分布程度下始终优于最先进的基线方法。

英文摘要

Data-free knowledge distillation-based one-shot federated learning (OSFL) trains a model in a single communication round without sharing raw data, making OSFL attractive for privacy-sensitive medical applications. However, existing methods aggregate predictions from all clients to form a global teacher. Under non-IID data, conflicting predictions dilute each other during averaging, yielding less informative soft labels that weaken distillation. We propose FedBiCross, a personalized OSFL framework with three stages: (1) clustering clients by model output similarity to form coherent sub-ensembles, (2) bi-level cross-cluster optimization that learns adaptive weights to selectively leverage beneficial cross-cluster knowledge while suppressing negative transfer, and (3) personalized distillation for client-specific adaptation. Experiments on four medical image datasets demonstrate that FedBiCross consistently outperforms state-of-the-art baselines across different non-IID degrees.

2605.11165 2026-06-12 cs.LG 版本更新

COSMOS: Model-Agnostic Personalized Federated Learning with Clustered Server Models and Pseudo-Label-Only Communication

COSMOS:基于聚类服务器模型和伪标签通信的模型无关个性化联邦学习

Ben Rachmut, Luise Ge, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校)

AI总结 COSMOS通过伪标签通信实现服务器端个性化,利用客户端本地模型预测公共数据并聚类,训练集群特定模型并回传知识蒸馏,理论分析显示其能有效降低个性化风险,实验验证其在异构环境中优于现有基线方法。

详情
AI中文摘要

联邦学习在异构环境中面临挑战,因为客户端模型在架构和数据分布上差异显著。尽管近期方法通过客户端聚类和知识蒸馏应对,但同时处理架构和统计异质性仍困难。我们引入COSMOS,一种模型无关框架,通过仅使用伪标签通信实现服务器端个性化。客户端训练本地模型并在公共数据上进行预测;服务器根据预测相似性聚类客户端,利用自身计算为每个群组训练特定模型,并将所得模型蒸馏回客户端。我们提供了首个理论分析,证明从学习的集群模型蒸馏可产生指数级个性化风险收缩,超越模型无关联邦学习通常提供的收敛到平稳状态保证。在基准测试中,COSMOS在异构环境中一致优于所有模型无关联邦学习基线方法,同时与最先进的个性化联邦学习方法竞争。更广泛地说,我们的结果强调了使用伪标签实现个性化服务器端学习作为可扩展且模型无关联邦学习的有前景范式。

英文摘要

Federated learning (FL) in heterogeneous environments remains challenging because client models often differ in both architecture and data distribution. While recent approaches attempt to address this challenge through client clustering and knowledge distillation, simultaneously handling architectural and statistical heterogeneity remains difficult. We introduce COSMOS, a model-agnostic framework that enables server-side personalization using only pseudo-label communication. Clients train local models and predict on the public data; the server clusters clients by prediction similarity, trains a cluster-specific model for each group using its own compute, and distills the resulting models back to clients. We provide the first theoretical analysis showing that distillation from the learned cluster models can yield exponential personalization risk contraction, going beyond the convergence-to-stationarity guarantees typically provided in model-agnostic FL. Experiments across benchmarks demonstrate that COSMOS consistently outperforms all model-agnostic FL baselines while remaining competitive with state-of-the-art personalized FL methods. More broadly, our results highlight personalized server-side learning with pseudo-labels as a promising paradigm for scalable and model-agnostic federated learning in highly heterogeneous environments.

2305.08175 2026-06-12 cs.DB cs.CR cs.LG 版本更新

ResidualPlanner+: a scalable matrix mechanism for marginals and beyond

ResidualPlanner+:一种用于边际查询及更广泛查询的可扩展矩阵机制

Guanlin He, Yingtai Xiao, Levent Toksoz, Zeyu Ding, Danfeng Zhang, Daniel Kifer

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) Binghamton University(宾厄姆顿大学) Duke University(杜克大学) TikTok Inc.(抖音公司)

AI总结 提出两种可扩展的矩阵机制ResidualPlanner和ResidualPlanner+,分别优化边际查询的精度和支持更复杂的工作负载(如范围查询),在速度和内存上显著超越现有方法。

详情
AI中文摘要

带噪声的边际查询是保护机密性的常见数据发布形式,对于列联表分析、贝叶斯网络构建甚至合成数据生成等下游任务非常有用。为线性查询(如边际查询)提供无偏噪声答案的隐私机制称为矩阵机制。我们提出了ResidualPlanner和ResidualPlanner+,两种高度可扩展的矩阵机制。ResidualPlanner在使用高斯噪声回答边际查询时既最优又可扩展,而ResidualPlanner+支持更通用的工作负载,例如边际查询与范围查询或前缀和查询的组合。ResidualPlanner可以优化许多损失函数,这些损失函数可以写成边际方差的凸函数(先前的工作仅限于一个预定义的目标函数)。ResidualPlanner可以在几秒钟内优化大规模设置中边际查询的精度,即使之前的最先进方法(HDMM)内存耗尽。它甚至可以在几分钟内处理具有100个属性的数据集。此外,ResidualPlanner可以高效计算每个边际的方差/协方差值(先前的方法即使对于相对较小的数据集也会很快耗尽内存)。ResidualPlanner+支持更复杂的工作负载,这些工作负载结合了边际查询和范围/前缀和查询(例如,关于种族的边际查询、关于年龄的范围查询以及回答每个种族的年龄范围查询的组合种族/年龄表格)。它甚至支持用户在不同属性上自定义工作负载。凭借这种增加的灵活性,ResidualPlanner+不一定是最优的,但它仍然极具可扩展性,并且在精度和速度上均优于先前的最先进方法(HDMM)处理前缀和查询。

英文摘要

Noisy marginals are a common form of confidentiality protecting data release and are useful for many downstream tasks such as contingency table analysis, construction of Bayesian networks, and even synthetic data generation. Privacy mechanisms that provide unbiased noisy answers to linear queries (such as marginals) are known as matrix mechanisms. We propose ResidualPlanner and ResidualPlanner+, two highly scalable matrix mechanisms. ResidualPlanner is both optimal and scalable for answering marginal queries with Gaussian noise, while ResidualPlanner+ provides support for more general workloads, such as combinations of marginals and range queries or prefix-sum queries. ResidualPlanner can optimize for many loss functions that can be written as a convex function of marginal variances (prior work was restricted to just one predefined objective function). ResidualPlanner can optimize the accuracy of marginals in large scale settings in seconds, even when the previous state of the art (HDMM) runs out of memory. It even runs on datasets with 100 attributes in a couple of minutes. Furthermore, ResidualPlanner can efficiently compute variance/covariance values for each marginal (prior methods quickly run out of memory, even for relatively small datasets). ResidualPlanner+ provides support for more complex workloads that combine marginal and range/prefix-sum queries (e.g., a marginal on race, a range query on age, and a combined race/age tabulation that answers age range queries for each race). It even supports custom user-defined workloads on different attributes. With this added flexibility, ResidualPlanner+ is not necessarily optimal, however it is still extremely scalable and outperforms the prior state-of-the-art (HDMM) on prefix-sum queries both in terms of accuracy and speed.

8. 鲁棒性、不确定性与可信学习 25 篇

2606.12490 2026-06-12 cs.LG 新提交

Robustness Verification of Recurrent Neural Networks with Abstraction Refinement

基于抽象精化的循环神经网络鲁棒性验证

Li-Jen Lin, Chih-Duo Hong

发表机构 * National Science and Technology Council (NSTC), Taiwan(台湾国家科学与技术委员会)

AI总结 提出抽象精化框架,通过分割预激活区间消除非线性松弛误差,并利用SHAP引导的时间步选择策略降低组合成本,显著提升RNN鲁棒性验证成功率。

详情
AI中文摘要

循环神经网络(RNN)的认证局部鲁棒性验证具有挑战性,因为非线性松弛引入的近似误差会通过循环连接传播并随时间累积。因此,可扩展的线性边界传播方法往往过于保守,无法认证实际上鲁棒的输入,尤其是当许多预激活区间跨越零点时。我们提出了一种用于RNN验证的抽象精化框架,该框架划分此类区间以消除主要的松弛误差:在每个精化分支上,ReLU变得精确,而tanh和sigmoid等平滑激活函数则允许更紧的线性包络。为了控制在长序列中分裂的组合成本,我们引入了一种SHAP引导的时间步选择策略,该策略根据隐藏状态对验证目标的贡献进行排序,并按时间顺序仅精化最关键的时间步。在CIFAR10和MNIST笔画基准上的实验表明,与仅使用抽象的基线相比,验证成功率和鲁棒性边界紧度持续提升,同时揭示了ReLU和tanh模型之间清晰的运行时权衡。

英文摘要

Certified local robustness verification for recurrent neural networks (RNNs) is challenging because approximation errors introduced by nonlinear relaxations can propagate through recurrent connections and accumulate over time. As a result, scalable linear bound propagation methods often become overly conservative and fail to certify inputs that are in fact robust, especially when many pre-activation intervals cross zero. We propose an abstraction-refinement framework for RNN verification that partitions such intervals to remove the dominant relaxation error: on each refined branch, ReLU becomes exact, and smooth activations such as tanh and sigmoid admit substantially tighter linear envelopes. To control the combinatorial cost of splitting in long sequences, we introduce a SHAP-guided timestep selection strategy that ranks hidden states by their contribution to the verification objective and refines only the most critical timesteps in temporal order. Experiments on CIFAR10 and MNIST stroke benchmarks demonstrate consistent improvements in verification success and robustness-margin tightness over abstraction-only baselines, while exposing clear runtime trade-offs between ReLU and tanh models.

2606.12501 2026-06-12 cs.LG 新提交

Policy-driven Conformal Prediction for Trustworthy QoT Estimation

策略驱动的可信QoT估计的保形预测

Kiarash Rezaei, Omran Ayoub, Paolo Monti, Carlos Natalino

发表机构 * Chalmers University of Technology(查尔姆斯理工大学) University of Applied Sciences and Arts of Southern Switzerland(瑞士南方应用科学与艺术大学)

AI总结 提出Conformal QoT框架,结合统计保证的QoT估计与操作决策策略,实现域偏移下可靠的光路可行性预测,在开放数据集上将准确率从92%提升至99.6%。

详情
Journal ref
Proc. Optical Fiber Communication Conference (OFC) 2026
AI中文摘要

我们提出Conformal QoT,一个策略驱动的框架,将具有统计保证的QoT估计与操作决策策略相结合,能够在域偏移下实现可靠的光路可行性预测,并在开放数据集上将准确率从92%提升至99.6%。

英文摘要

We propose Conformal QoT, a policy-driven framework that combines statistically guaranteed QoT estimation with operational decision policies, enabling reliable lightpath-feasibility predictions under domain shift and improving accuracy from 92\% to 99.6\% on open datasets.

2606.12615 2026-06-12 cs.LG 新提交

Towards Provably Fair Machine Learning: Bayesian Approaches For Consistent and Transparent Predictions

迈向可证明公平的机器学习:用于一致和透明预测的贝叶斯方法

Owen O'Neill, Fintan Costello

发表机构 * University College Dublin(都柏林大学学院)

AI总结 提出公平贝叶斯分类器,通过强制确定性和统计一致性,在多个数据集上实现零一致性错误,同时保持准确性和多校准,解决少数群体因正则化导致的预测不一致问题。

详情
AI中文摘要

部署在高风险领域的机器学习分类器产生的预测质量在不同子组之间存在系统性差异。对于由多个特征交叉定义的细粒度子组,预测通常与观测数据不一致:模型输出与该子组可用的证据相矛盾。正则化通过将小子组合并到较大组中来改善整体性能,从而加剧了这一问题,对人口统计少数群体产生不成比例的影响。我们定义了一致性预测的两个要求:确定性(相同的个体获得相同的预测)和统计一致性(在显著性水平alpha下,我们不能拒绝子组预测来自为该子组推断的贝叶斯最优目标分布的假设)。从这些要求出发,我们推导出公平贝叶斯分类器,该分类器同时强制每个组和子组满足这两个要求,并在无法进行一致确定性预测时弃权。在三个基准数据集(Adult、COMPAS和Bank Marketing)上,标准分类器对相当一部分子组产生统计上不一致的预测。我们的分类器通过构造实现零一致性错误,同时在每个测试数据集上超过基线准确性和多校准。统计一致性为预测质量提供了原则性基础,对算法公平性有直接影响。少数群体人口不成比例地集中在小子组中,而正是在这些子组中频率论推断最不可靠;因此,解决这一推断问题是迈向公平ML的必要步骤。通过在数据支持的最细粒度上强制贝叶斯一致性,我们的分类器证明了在实践中可以实现具有原则性弃权的详尽子组公平性。

英文摘要

ML classifiers deployed in high-stakes domains produce predictions whose quality varies systematically across subgroups. For granular subgroups defined by intersections of multiple features, predictions are often inconsistent with the observed data: the model's outputs contradict the evidence available for that subgroup. This problem is exacerbated by regularisation, which improves aggregate performance by collapsing small subgroups into larger groups, disproportionately affecting demographic minorities. We define two requirements for consistent prediction: determinism (identical individuals receive identical predictions) and statistical consistency (we cannot reject, at significance level alpha, the hypothesis that the predictions for a subgroup were drawn from the Bayesian optimal target distribution inferred for that subgroup). From these requirements we derive the Fair Bayesian classifier, which enforces both across every group and subgroup simultaneously and abstains whenever no consistent deterministic prediction is possible. On three benchmark datasets (Adult, COMPAS, and Bank Marketing), standard classifiers produce statistically inconsistent predictions for a substantial proportion of subgroups. Our classifier achieves zero consistency error by construction while exceeding baseline accuracy and multicalibration on every dataset tested. Statistical consistency provides a principled foundation for prediction quality with direct implications for algorithmic fairness. Minority demographics are disproportionately concentrated in small subgroups, precisely where frequentist inference is least reliable; addressing this inference problem is therefore a necessary step toward fair ML. By enforcing Bayesian consistency at the finest resolution the data supports, the our classifier demonstrates that exhaustive subgroup fairness with principled abstention is achievable in practice.

2606.12731 2026-06-12 cs.LG cs.CY 新提交

Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs

规范性鲁棒性作为LLM中不可验证推理的前沿

Elizaveta Tennant, Benjamin Henke, Anita Keshmirian, Murray Shanahan, Verena Rieser, Kristian Lum, Sydney Levine, Julia Haas

发表机构 * DeepMind Institute of Philosophy, School of Advanced Study, University of London(伦敦大学高等研究院哲学研究所) Technische Universität Berlin(柏林工业大学)

AI总结 提出道德推理作为不可验证推理的典型子域,定义道德鲁棒性并引入可扩展的多轮对抗评估框架,发现模型会向用户偏好偏移推理(平均6.5%),且受顺序和轮次影响。

详情
AI中文摘要

随着LLM越来越多地承担咨询和审议角色,用户在缺乏客观真实性的领域中依赖它们进行不可验证推理。然而,传统LLM推理评估几乎只关注基于事实的领域(如数学和科学),导致不确定模型能否以及能在多大程度上处理随时间变化的模糊、主观或价值负载问题。为解决这一问题,我们提出道德推理作为不可验证推理的一个典型子域。我们将道德鲁棒性定义为模型在不同时间和情境下展现合理道德推理的能力,并引入一个可扩展的、对抗性的多轮评估框架来实证测量这一能力。我们在四个前沿LLM上模拟了48,000次用户-智能体道德讨论,变化前提相关性、前提顺序、对话时长和用户声明的道德观点。我们发现模型成功忽略了道德无关的干扰项,但平均向用户声明的偏好道德观点偏移了6.5%的推理,并且推理因顺序(在13-22%的案例中改变道德判断)和时长(在10-24%的案例中在单轮和多轮之间改变道德判断)等因素而变化。我们的分析表明,模型不仅调整最终裁决,还调整其背后的理由以适应用户的道德观点——我们将这种失败模式称为道德审议谄媚。

英文摘要

As LLMs increasingly serve in advisory and deliberative roles, users rely on them for non-verifiable reasoning in domains lacking objective ground truths. However, traditional evaluations of LLM reasoning focus almost exclusively on fact-based domains, such as mathematics and science, leaving uncertainty over whether and to what degree models can handle ambiguous, subjective, or value-laden problems over time. To address this concern, we propose moral reasoning as a paradigmatic subdomain of non-verifiable reasoning. We define moral robustness as a model's capacity to exhibit sound moral reasoning across time and contexts, and we introduce a scalable, adversarial, multi-turn evaluation framework to empirically measure this capability. We simulate 48,000 user-agent moral deliberations across four frontier LLMs, varying premise relevance, premise order, conversation duration, and the user's stated moral view. We find that models successfully ignore morally-irrelevant distractors, but shift their reasoning by up to 6.5%, on average, towards the user's stated preferred moral view, and varying their reasoning depending on factors such as order (altering moral judgments by order in 13-22% of the cases) and duration (altering moral judgments between single-turn and multi-turn in 10-24% of the cases). Our analysis indicates that models tailor not just their final verdicts but their underlying justifications to align with a user's moral viewpoint - a failure mode we characterize as moral deliberative sycophancy.

2606.12896 2026-06-12 cs.LG cs.AI cs.CR 新提交

PolicyGuard: Towards Test-time and Step-level Adversary Defense for Reinforcement Learning Agent

PolicyGuard:面向强化学习智能体的测试时和步级对抗防御

Junfeng Guo Heng Huang

AI总结 提出PolicyGuard,一种基于高斯过程后验方差的测试时步级后门防御方法,通过自适应伪轨迹计算单步不确定性,在七种RL游戏中达到平均AUROC 0.856和0.859。

详情
AI中文摘要

尽管强化学习(RL)的实际应用日益普及,但RL系统的安全性值得更多关注和探索。特别是,最近的研究揭示了RL智能体容易受到后门攻击,即受害智能体在标准条件下表现正常,但在特定触发器被激活时执行恶意动作。现有的RL后门防御要么需要访问智能体的内部参数,要么仅在模型或轨迹级别操作,或者仅限于特定攻击类型。为了确保RL智能体的安全性,我们提出了\texttt{PolicyGuard},一种\textit{测试时步级}后门防御方法,它利用高斯过程(GP)后验方差并自适应伪轨迹以实现单个时间步的不确定性计算。此外,我们还提供了理论基础来解释GP后验方差的有效性。在七个RL游戏上的大量实验表明,PolicyGuard在大多数情况下实现了最先进的检测性能,对于基于扰动的攻击平均AUROC为0.856,对于对抗智能体攻击平均AUROC为0.859。

英文摘要

While real-world applications of reinforcement learning (RL) are becoming increasingly popular, the security of RL systems deserve more attention and exploration. In particular, recent work has revealed that RL agents are vulnerable to backdoor attacks, where a victim agent behaves normally under standard conditions but executes malicious actions when a specific trigger is activated. Existing backdoor defenses for RL either require access to the agent's internal parameters, operate only at the model or trajectory level, or are limited to specific attack types. To ensure the security of RL agents, we propose \texttt{PolicyGuard}, a \textit{test-time step-level} backdoor defense which leverages Gaussian Process (GP) posterior variance and adapts pseudo trajectories to enable uncertainty computation for individual time step. Besides, we also provide theoretical foundations to explain the efficacy of GP posterior variance. Extensive experiments across seven RL games demonstrate that PolicyGuard achieves state-of-the-art detection performance in most cases, with average AUROC of 0.856 for perturbation-based attacks and 0.859 for adversary-agent attacks.

2606.13172 2026-06-12 cs.LG 新提交

Detecting Explanatory Insufficiency in Learned Representations: A Framework for Representational Vigilance

检测学习表示中的解释不充分性:表示警觉性框架

Jacques Raynal, Pierre Slangen, Elsa Raynal, Jacques Margerit

发表机构 * Laboratory of Bioengineering and Nanosciences (LBN), University of Montpellier(蒙彼利埃大学生物工程与纳米科学实验室) EuroMov Digital Health in Motion, University of Montpellier, IMT Mines Alès(蒙彼利埃大学EuroMov数字健康运动实验室,IMT阿莱斯矿业学院) Certified Sophrologist, Sensorimotor Practice(认证心理放松治疗师,感觉运动实践) Emeritus Professor, University of Montpellier(蒙彼利埃大学名誉教授)

AI总结 提出VER框架,通过识别持久残差结构来监测学习表示的充分性,补充传统评估方法。

Comments 22 pages, 1 figure. Conceptual framework for representation diagnostics in machine learning

详情
AI中文摘要

学习表示是现代机器学习的核心,通常通过预测性能、鲁棒性、不确定性估计或泛化能力来评估。然而,一个学习表示可能在操作上仍然成功,同时逐渐无法组织未被传统评估指标完全捕获的持久残差结构。本文介绍了VER(表示警觉评估器),一个用于监测学习表示充分性的概念框架。VER不提出新的学习算法、损失函数或模型架构。相反,它形式化了一个诊断过程,通过该过程可以识别、分析持久残差结构,并将其解释为解释不充分性的潜在指标。该框架将表示不充分性与普通预测误差、不确定性、噪声和分布偏移区分开来。它引入了一个基于表示识别、解释域界定、残差结构检测、解释阻力评估和警觉信号发出的监测序列。VER旨在作为机器学习中表示诊断的贡献。其目标不是取代现有的评估方法,而是通过将表示充分性视为明确的探究对象来补充它们。还概述了通过表示警觉性基准进行实证评估的路径。

英文摘要

Learned representations are central to modern machine learning and are commonly evaluated through predictive performance, robustness, uncertainty estimation, or generalization. However, a learned representation may remain operationally successful while progressively failing to organize persistent residual structures that are not fully captured by conventional evaluation metrics. This article introduces VER, the Vigilant Evaluator of Representations, a conceptual framework for monitoring representational adequacy in learned representations. VER does not propose a new learning algorithm, loss function, or model architecture. Instead, it formalizes a diagnostic process through which persistent residual structures may be identified, analyzed, and interpreted as potential indicators of explanatory insufficiency. The framework distinguishes representational inadequacy from ordinary prediction error, uncertainty, noise, and distribution shift. It introduces a monitoring sequence based on representation identification, explanatory-domain delimitation, residual-structure detection, explanatory-resistance evaluation, and vigilance signaling. VER is intended as a contribution to representation diagnostics in machine learning. Its objective is not to replace existing evaluation methods but to complement them by treating representational adequacy as an explicit object of inquiry. A path toward empirical evaluation through representational-vigilance benchmarks is also outlined.

2606.13209 2026-06-12 cs.LG cs.CL 新提交

Understanding helpfulness and harmless tension in reward models

理解奖励模型中的有用性与无害性张力

Eshaan Tanwar, Pepa Atanasova

发表机构 * University of Copenhagen(哥本哈根大学)

AI总结 通过激活分析和消融实验,发现奖励模型中有用性和无害性目标存在干扰,共享神经元对模型行为影响不成比例,导致对齐张力。

Comments The source code used in this study is publicly available at: https://github.com/EshaanT/RM-alignment\_tension

详情
AI中文摘要

奖励模型是从人类反馈中进行强化学习(RLHF)的关键组成部分,使语言模型在有用性和无害性行为上对齐。然而,这些目标背后的内部机制及其冲突仍知之甚少。我们研究了在仅有用性、仅无害性和混合目标设置下训练的奖励模型中的对齐张力。我们发现混合目标模型通常表现不如单目标模型,表明目标之间存在干扰。使用基于激活的方法,我们识别了与每个目标相关的神经元,并通过定向消融研究其功能角色。我们发现这些神经元因果地支持其对应目标,同时往往对对立目标产生负面影响。我们发现相当比例的神经元在有用性和无害性之间共享,并且这些共享神经元对模型行为产生不成比例的影响,导致对齐张力。此外,我们的结果提供了关于对齐目标如何在奖励模型中表示以及为什么多目标对齐仍然具有挑战性的见解和机制解释,为未来关于解耦和可控对齐方法的研究提供了动力。

英文摘要

Reward models are a key component of reinforcement learning from human feedback (RLHF), aligning language models toward both helpful and harmless behaviour. However, the internal mechanisms underlying these objectives and their conflicts remain poorly understood. We study alignment tension in reward models trained under helpfulness-only, harmlessness-only, and mixed-objective settings. We find that mixed-objective models often underperform single-objective models, indicating interference between objectives. Using activation-based methods, we identify neurons associated with each objective and study their functional roles via targeted ablations. We find that these neurons causally support their corresponding objectives while often negatively affecting the opposing one. We find that a substantial proportion of neurons are shared between helpfulness and harmlessness, and that these shared neurons exert a disproportionate influence on model behaviour, contributing to alignment tension. Additionally, our results provide insights and mechanistic interpretation into how alignment objectives are represented in reward models and why multi-objective alignment remains challenging, motivating future work on disentangled and controllable alignment methods.

2606.13451 2026-06-12 cs.LG 新提交

Uncertainty Estimation for Molecular Diffusion Models

分子扩散模型的不确定性估计

Paul Seij, Christian A. Naesseth, Stephan Mandt, Metod Jazbec

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 提出一种事后方法,利用去噪网络的拉普拉斯近似估计预训练分子扩散模型中每个样本的不确定性,该分数与样本质量负相关,可用于过滤生成样本。

详情
AI中文摘要

扩散模型已被广泛用于三维分子生成,但它们没有提供关于生成分子何时可能质量低下的原则性信号。我们提出了一种事后方法,用于估计预训练分子扩散模型中每个样本的不确定性。基于去噪网络的拉普拉斯近似,我们测量了生成轨迹中噪声预测的变异性。实验表明,所得的不确定性分数能够反映样本质量,与已建立的样本级质量指标呈负相关。我们进一步研究了如何使用所提出的不确定性分数来过滤生成的样本,通过测试时缩放提高模型性能。

英文摘要

Diffusion models have seen wide adoption for 3D molecular generation, yet they offer no principled signal of when a generated molecule is likely to be of low quality. We propose a post-hoc method for estimating per-sample uncertainty in pretrained molecular diffusion models. Building on a Laplace approximation of the denoising network, we measure the variability of the noise prediction across the generation trajectory. Empirically, we show that the resulting uncertainty score is informative of sample quality, exhibiting a negative correlation with established sample-level quality metrics. We further study how the proposed uncertainty score can be used to filter generated samples, improving model performance via test-time scaling.

2606.12498 2026-06-12 cs.CR cs.LG 交叉投稿

From Parameters to Feature Space: Task Arithmetic for Backdoor Mitigation in Model Merging

从参数到特征空间:模型合并中后门缓解的任务算术

Zhenqian Zhu, Yamin Hu, Yiya Diao, Weixiang Li, Haodong Li, Wenjian Luo

AI总结 提出线性特征路径最小化(LFPM)框架,通过跨任务线性性在特征空间优化反后门任务向量,在模型合并中有效抑制后门且保持干净任务性能。

详情
AI中文摘要

模型合并(MM)作为一种将多个任务特定模型整合为统一模型的成本效益方法,已获得显著关注。然而,近期工作揭示MM极易受到后门攻击。现有基于任务算术的防御通常因依赖直接参数空间编辑,在未显著降低干净任务性能的情况下难以消除后门。为解决这一差距,我们提出线性特征路径最小化(LFPM),一种用于模型合并的后门缓解框架,该框架将反后门任务向量引入被后门污染的合并模型。与先前方法不同,LFPM在跨任务线性性(CTL)框架下从统一的特征空间视角制定合并模型的后门鲁棒性,该框架利用跨任务特征的近似线性性。这一视角指导反后门任务的优化,以在抑制后门的同时保持干净任务性能。此外,我们引入一种基于梯度累积和损失路径积分的有效优化机制,确保沿插值路径的鲁棒后门抑制。大量实验表明,LFPM在完全微调和参数高效微调(PEFT)设置中均对后门攻击表现出强鲁棒性。

英文摘要

Model merging (MM) has gained significant attention as a cost-effective approach to integrate multiple task-specific models into a unified model. However, recent work reveals that MM is highly susceptible to backdoor attacks. Existing defenses based on task arithmetic often fail to eliminate backdoors without substantially degrading clean-task performance, owing to their reliance on direct parameter-space editing. To address this gap, we propose Linear Feature Path Minimization (LFPM), a backdoor mitigation framework for model merging, which introduces an anti-backdoor task vector into the backdoored merged model. Unlike prior approaches, LFPM formulates the backdoor robustness of the merged model from a unified feature-space perspective under the Cross-Task Linearity (CTL) framework, which leverages the approximate linearity of features across tasks. This perspective guides the optimization of the anti-backdoor task to suppress backdoors while preserving clean-task performance. Furthermore, we introduce an effective optimization mechanism based on gradient accumulation and loss path-integral, ensuring robust backdoor suppression along the interpolation path. Extensive experiments demonstrate that LFPM consistently exhibits strong robustness against backdoor attacks in both full fine-tuning and Parameter-Efficient Fine-Tuning (PEFT) settings.

2606.12900 2026-06-12 cs.AI cs.CL cs.LG 交叉投稿

Zero-source LLM Hallucination Detection with Human-like Criteria Probing

零源大语言模型幻觉检测:类人类标准探测

Jiahao Yang, Shuhai Zhang, Hailong Kang, Feng Liu, Qi Chen, Mingkui Tan

AI总结 提出HCPD范式,通过类人类标准探测机制模拟人类评估者的多面推理,结合奖励对齐和多样本聚合,实现零源条件下的有效可解释幻觉检测。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLM)常因生成事实错误或不忠实的内容而产生幻觉,对其安全使用构成重大风险。在零源约束下,即无法获取模型内部信息或外部参考,检测必须仅依赖于文本查询-答案对,检测此类幻觉尤为困难。本文提出用于幻觉检测的类人类标准探测(HCPD)范式,该范式模拟人类评估者的多面推理。其核心是类人类标准探测(HCP)机制,其中LLM代理自适应地将其判断分解为一组可解释的加权标准,并将特定标准得分聚合为最终的真实性度量。为实现这种自适应能力,我们引入了一种基于奖励的对齐方案,仅使用来自语义一致性的弱监督。在推理时,我们采用多样本聚合策略,确保决策稳健的同时保持完全可解释性。我们进一步提供了支持我们方法可靠性的理论分析。大量实验表明,HCPD始终优于最先进的基线,为零源幻觉检测提供了一种有效且可解释的解决方案。代码可从此https URL获取。

英文摘要

Large language models (LLMs) often hallucinate by generating factually incorrect or unfaithful content, posing significant risks to their safe use. Detecting such hallucinations is particularly challenging under the zero-source constraint, where no model internals or external references are available, and detection must rely solely on the textual query-answer pair. In this paper, we propose Human-like Criteria Probing for Hallucination Detection (HCPD), a paradigm that emulates the multi-faceted reasoning of human evaluators. Its core is a Human-like Criteria Probing (HCP) mechanism, in which a LLM agent adaptively decomposes its judgment into a weighted set of interpretable criteria and aggregates criterion-specific scores into a final truthfulness measure. To achieve this adaptive capability, we introduce a reward-based alignment scheme using only weak supervision from semantic consistency. At inference, we employ a multi-sampling aggregation strategy to ensure robust decisions while preserving full interpretability. We further provide theoretical analysis supporting the reliability of our approach. Extensive experiments show that HCPD consistently outperforms state-of-the-art baselines, offering an effective and explainable solution for zero-source hallucination detection. Code is available at https://github.com/TRISKEL10N/HCPD.

2606.12977 2026-06-12 cs.CV cs.AI cs.CR cs.LG 交叉投稿

Efficient, Robust, and Anti-Collusion Fingerprinting of Image Diffusion Models

图像扩散模型的高效、鲁棒且抗共谋指纹识别

Jianwei Fei, Yunshu Dai, Zhihua Xia, Xiaochun Cao, Jiantao Zhou, Alessandro Piva, Benedetta Tondi

发表机构 * University of Florence(佛罗伦萨大学) Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区) College of Cyber Security, Jinan University(暨南大学网络空间安全学院) State Key Laboratory of Internet of Things for Smart City, University of Macau(澳门大学智慧城市物联网国家重点实验室) Department of Computer and Information Science, Faculty of Science and Technology, University of Macau(澳门大学科技学院计算机与信息科学系) University of Siena(锡耶纳大学)

AI总结 针对生成式文本到图像模型指纹识别缺乏抗共谋攻击鲁棒性的问题,提出基于个性化归一化模块的编码方法,并引入无损函数不变参数变换的抗共谋机制,实现高保真、高鲁棒且首次主动抵御共谋攻击的指纹识别。

详情
AI中文摘要

模型指纹识别,即将用户特定标识(指纹)嵌入生成输出中,最近已成为保护生成式文本到图像(T2I)模型知识产权并防止未经授权重新分发的流行解决方案。在这项工作中,我们揭示了现有生成模型指纹识别方法中一个先前未被探索的系统性漏洞:它们缺乏对共谋攻击的鲁棒性,其中多个攻击者结合他们的模型以移除或掩盖指纹。为了解决这个问题,我们迈出了为T2I模型开发具有抗共谋能力的鲁棒指纹识别方法的第一步。所提出的方法将比特串(即指纹)编码到集成到T2I模型中的个性化归一化模块(PNM)的系数中,从而可以从任何生成的图像中可靠地恢复指纹。为了防御共谋攻击并防止未经授权的模型重新分发,我们引入了一种基于无损函数不变参数变换的抗共谋机制。该机制显著降低了共谋模型的图像生成质量,使其实际上无法使用。此外,我们的方法允许开发者通过重新参数化PNM高效地创建多个带指纹的T2I模型副本,而无需重新训练。我们还引入了一种最坏情况优化策略,以提高对模型级攻击的鲁棒性。实验表明,所提出的方法在多个T2I图像生成和编辑任务中实现了高保真度和鲁棒性,指纹提取准确率超过99.5%。与现有方法相比,我们的方法首次通过显著增加共谋模型的FID,展示了对共谋攻击的显著主动鲁棒性。

英文摘要

Model fingerprinting, embedding user-specific identifiers (fingerprints) into generated outputs, has recently emerged as a popular solution to protect the intellectual property rights (IPR) of generative text-to-image (T2I) models and prevent unauthorized redistribution. In this work, we reveal a previously unexplored systematic vulnerability in existing generative model fingerprinting methods: they lack robustness against collusion attacks, where multiple attackers combine their models to remove or obscure the fingerprints. To address this issue, we take the first step towards a robust fingerprinting method for T2I models with anti-collusion capabilities. The proposed method encodes strings of bits, namely fingerprints, into the coefficients of a personalized normalization module (PNM) incorporated into T2I models, so that fingerprints can be reliably recovered from any generated image. To defend against collusion attacks and prevent unauthorized model redistribution, we introduce an anti-collusion mechanism based on lossless function-invariant parameter transformations. This mechanism significantly degrades the image generation quality of colluded models, making them effectively unusable. Moreover, our method allows developers to efficiently create multiple copies of fingerprinted T2I models by reparameterizing the PNM without the need for retraining. We also introduce a worst-case optimization strategy to improve robustness against model-level attacks. Our experiments demonstrate that the proposed method achieves high fidelity and robustness across multiple T2I image generation and editing tasks, with fingerprint extraction accuracy exceeding 99.5%. Compared with existing methods, our method demonstrates, for the first time, a notable proactive robustness to collusion attacks by significantly increasing the FID of colluded models.

2606.13022 2026-06-12 cs.CV cs.LG 交叉投稿

Quality-Preserving Imperceptible Adversarial Attack on Skeleton-based Human Action Recognition

基于骨架的人体动作识别中保质量不可察觉对抗攻击

Ziyi Chang, Kanglei Zhou, Xiaohui Liang, Hubert P. H. Shum

发表机构 * Durham University(杜伦大学) Tsinghua University(清华大学) Beihang University(北京航空航天大学) Zhongguancun Laboratory(中关村实验室)

AI总结 针对骨架动作识别的对抗攻击常引入噪声扰动降低动作质量,本文提出一种基于分布的对抗攻击方法,通过最小化经验风险与真实风险的差距来保持动作质量,并设计新指标评估自然性,实验表明该方法在攻击成功率和动作质量上均优于现有方法。

详情
AI中文摘要

针对骨架人体动作识别的对抗攻击已受到广泛关注。然而,现有方法通常引入类似噪声的扰动,导致攻击后动作质量下降,从而在S-HAR系统的最新进展中本质上是可察觉的。我们发现这种退化源于先前对抗攻击优化过程中经验风险与真实风险之间的差距。为解决此问题,我们提出一种在不损害动作质量的情况下获得对抗动作的攻击方法。为最小化风险差距并保持动作质量,我们提出一种基于分布的对抗攻击方法,不引入类似噪声的扰动。为忠实评估动作质量,我们提出一种新指标,该指标与人类对真实世界自然性的感知一致。在两个数据集上对最先进的S-HAR方法进行了实验,通过定性和定量分析证明了我们的方法在攻击成功率和攻击后动作质量方面的优越性。我们的保质量攻击应用和基于分布的方法的成功引发了关于动作识别器鲁棒性的严重担忧,强调了在该领域进一步改进的必要性。

英文摘要

Adversarial attacks on skeletal human action recognition have received significant attention. However, existing methods typically introduce noise-like perturbations that degrade motion quality post-attack, and thereby are inherently perceptible with recent advancements in S-HAR systems. We discover that this degradation stems from the gap between empirical and true risks during the optimization process of previous adversarial attacks. To address this issue, we propose an attack where adversarial motions are obtained without compromising their motion quality. To minimize the risk gap and preserve motion quality, we propose a distribution-based adversarial attack method without introducing noise-like perturbations. To faithfully evaluate the motion quality, we propose a new metric that aligns with human perception on real-world naturalness. Experiments have been conducted on the state-of-the-art S-HAR methods across two datasets, demonstrating the superiority of our method in both the attack success rate and the post-attack motion quality through qualitative and quantitative analyses. The success of our quality-preserving attack application and distribution-based method raises serious concerns about the robustness of action recognizers, highlighting the need for further enhancements in this domain.

2606.13146 2026-06-12 stat.ML cs.LG stat.ME 交叉投稿

Robust State-Conditional Feature-Weighted Jump Models for Temporal Clustering

鲁棒的状态条件特征加权跳跃模型用于时间聚类

Federico P. Cortese, Alessio Farcomeni

AI总结 提出一种鲁棒的特征加权跳跃模型,通过Tukey双权损失函数实现鲁棒性,并引入状态特定特征权重,在模拟和实证中优于竞争方法。

详情
AI中文摘要

我们提出了一种用于时间依赖聚类的鲁棒特征加权跳跃模型。使用惩罚项来鼓励随时间平滑过渡,同时通过Tukey双权损失函数实现鲁棒性。一个额外的参数控制特征权重在不同状态间的变异性,允许模型为每个特征分配状态特定的相关性。我们在模拟中展示了该方法如何准确恢复真实聚类序列并可靠识别相关特征,特别是在存在异常值的情况下优于竞争方法。最后,我们进行了两个实证应用,一个涉及1998-2000年科索沃冲突相关杀人事件的数量,另一个涉及1949-2024年十二个欧洲国家的宏观经济表现。

英文摘要

We propose a robust feature-weighted jump model for time-dependent clustering. A penalty is used to encourage smoothness of transitions over time, while robustness is achieved through the use of a Tukey's biweight loss function. An additional parameter controls the variability of feature weights across states, allowing the model to assign state-specific relevance to each feature. We illustrate in simulation how the method accurately recovers the true cluster sequence and reliably identifies relevant features, outperforming competing approaches, particularly in the presence of outliers. We conclude with two empirical applications, one on the number of conflict-related homicides in Kosovo in the period 1998-2000, and another on macroeconomic performance of twelve European countries in the period 1949-2024.

2606.13277 2026-06-12 stat.ML cs.LG 交叉投稿

ProtoX-AD: Self-Explainable Time Series Anomaly Detection and Characterization

ProtoX-AD:自解释的时间序列异常检测与特征描述

Aitor Sánchez-Ferrera, Elisabeth Wetzer, Kristoffer Wickstrøm, Michael Kampffmeyer, Robert Jenssen

AI总结 提出ProtoX-AD框架,通过原型学习实现自监督时间序列异常检测的可解释性,在保持检测性能的同时提供语义一致的异常特征解释。

Comments 26 pages, 8 figures

详情
AI中文摘要

时间序列异常检测(TSAD)的最新进展突显了自监督分类方法的有效性。这些方法对正常训练样本应用变换,训练分类器识别变换特定模式,从而通过增加分类误差来帮助识别异常。尽管性能强大,但一个重大挑战是缺乏可解释性,因为它们对标记异常的特征提供的洞察有限。为了解决这一局限,我们提出了ProtoX-AD,一种基于原型的自解释框架,用于自监督TSAD。ProtoX-AD学习变换感知的潜在表示以及可解释的原型,从而实现准确的异常检测和通过基于原型的解释识别不同的异常轮廓。此外,它允许系统分析变换设计如何影响检测性能和可解释性。在合成和真实世界数据集上的实验结果表明,ProtoX-AD实现了与其黑盒对应物相当的检测性能,同时比现有的可解释基线提供更一致和语义上有意义的解释。我们的代码在此 https URL 公开。

英文摘要

Recent advances in time series anomaly detection (TSAD) have highlighted the effectiveness of self-supervised classification-based approaches. These methods apply transformations to normal training samples, training a classifier to recognize transformation-specific patterns that help identify anomalies through increased classification errors. Despite their strong performance, a significant challenge is their lack of explainability, as they provide limited insight into the characteristics of flagged anomalies. To address this limitation, we propose ProtoX-AD, a prototype-based self-explainable framework for self-supervised TSAD. ProtoX-AD learns transformation-aware latent representations alongside interpretable prototypes, enabling both accurate anomaly detection and the identification of distinct anomalous profiles through prototype-based explanations. Additionally, it allows for systematic analysis of how transformation design impacts detection performance and explainability. Experimental results on synthetic and real-world datasets demonstrate that ProtoX-AD achieves detection performance comparable to its black-box counterparts while offering more consistent and semantically meaningful explanations than existing explainable baselines. Our code is publicly available at https://github.com/Aitorzan3/ProtoX-AD.

2606.13439 2026-06-12 cs.CL cs.LG 交叉投稿

S-GBT: Smooth Growth Bound Tensor for Certified Robustness Against Word Substitution Attacks in NLP

S-GBT:针对NLP中词替换攻击的认证鲁棒性的平滑增长界张量

Mohammed Bouri, Mohammed Erradi, Adnane Saoud

发表机构 * College of Computing, Mohammed VI Polytechnic University(穆罕默德六世理工大学计算机学院) ENSIAS, University Mohamed V of Rabat(拉巴特穆罕默德五世大学ENSIAS) CID Development

AI总结 提出二阶方法S-GBT,通过逐元素约束Hessian矩阵并加入正则化项,结合一阶和二阶正则化提升对词替换攻击的认证鲁棒性,在LSTM和CNN上验证,认证鲁棒准确率提升高达23.4%。

Comments The paper has been accepted at NETYS 2026 - 14th edition of the International Conference on Networked Systems

详情
AI中文摘要

尽管自然语言处理(NLP)近期取得了进展,模型仍然容易受到词替换攻击。大多数现有防御方法关注一阶敏感性,并衡量输入轻微扰动时输出的变化程度。然而,它们忽略了这种敏感性的演变,而这由曲率描述。当梯度急剧变化时,模型仍可能失败。本文引入了平滑增长界张量(S-GBT),一种逐元素约束Hessian矩阵的二阶方法,我们为其产生的鲁棒性界提供了形式化理论证明。在训练过程中添加正则化项以最小化这些界。这产生了针对词替换攻击的更紧的认证鲁棒性。词替换下输出的变化由线性项和二次项共同界定。S-GBT针对两种架构推导:长短期记忆网络(LSTM)和卷积神经网络(CNN)。该方法直接集成到训练目标中。在多个基准数据集上评估其有效性。结果表明,与先前方法相比,结合一阶和二阶正则化可将认证鲁棒准确率提升高达23.4%,同时干净准确率保持竞争力。这些发现表明,同时控制梯度及其变化是构建更鲁棒模型的一个有前景的方向。

英文摘要

Despite recent progress in Natural Language Processing (NLP), models remain vulnerable to word substitution attacks. Most existing defenses focus on first order sensitivity and measure how much the output changes when the input is slightly perturbed. However, they ignore how this sensitivity evolves, which is described by curvature. When gradients vary sharply, models can still fail. This paper introduces the Smooth Growth Bound Tensor (S-GBT), a second order method that bounds the Hessian element-wise, for which we provide formal theoretical proofs on the resulting robustness bounds. A regularization term is added during training to minimize these bounds. This yields tighter certified robustness against word substitution attacks. The change in the output under word substitution is bounded by both a linear term and a quadratic term. S-GBT is derived for two architectures: Long Short-Term Memory (LSTM) and Convolutional Neural Networks (CNN). The method is integrated directly into the training objective. Its effectiveness is evaluated on multiple benchmark datasets. The results show that combining first and second order regularization improves certified robust accuracy by up to 23.4% compared to prior methods, while clean accuracy remains competitive. These findings indicate that controlling both the gradient and its variation is a promising direction for building more robust models.

2606.13621 2026-06-12 cs.AI cs.CR cs.GT cs.LG cs.MA 交叉投稿

Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks

超越运行时强制:作为对抗网络可防御性分析的盾牌合成

Achraf Hsain, Sultan Almuhammadi

发表机构 * Information and Computer Science Department, King Fahd University of Petroleum and Minerals(信息与计算机科学系,法赫德国王石油矿产大学)

AI总结 提出将盾牌合成重新解释为设计时分析工具,通过约束双人安全博弈生成可防御性判定,并融合拓扑度量和强化学习行为形成可防御性指纹,揭示系统安全的结构性见解。

Comments 26 pages, 7 figures, 7 tables. Under review at JAIR. Code: https://github.com/AchrafHsain7/Bastion

详情
AI中文摘要

盾牌强化学习通常被呈现为一种运行时安全机制,它将时序逻辑规范编译成限制智能体行为的自动机。我们认为这是错误的产品。同样的自动机理论机制——规范编译、乘积博弈构建、吸引子计算和获胜区域提取——更适合被解读为一种设计时分析工具,其输出是关于系统的结构性见解,而非对已部署智能体的运行时约束。我们通过一个用于网络防御的约束双人安全博弈来实例化这一点。两个规范被不对称地执行:防御者规范定义了博弈的不安全区域,而攻击者规范在吸引子计算期间限制了对手的合法行为。求解该博弈产生一个可防御性判定——一个形式化证书,表明拓扑-规范对是否可防御——以及相关的获胜区域和盾牌。除了二元判定,我们还从吸引子结构中推导出拓扑级度量,并将其与盾牌约束的对抗性多智能体强化学习的后收敛行为相结合。这些共同构成了一个可防御性指纹,捕捉了网络的形式安全属性及其在自适应博弈下的操作行为。假设分析表明,形式可防御性和操作有效性捕捉了安全的不同方面:小的架构变化可能导致操作结果的巨大变化,而形式安全裕度几乎不变。因此,盾牌合成最有价值的不是作为安全智能体的部署机制,而是作为回答关于系统是否、在哪里以及如何可以被防御的架构问题的框架。可防御性判定是输出,而非安全策略。

英文摘要

Shielded reinforcement learning is typically presented as a runtime safety mechanism that compiles temporal-logic specifications into automata restricting an agent's actions. We argue this is the wrong product. The same automata-theoretic machinery -- specification compilation, product game construction, attractor computation, and winning-region extraction -- is better read as a design-time analytical instrument whose outputs are structural insights about a system rather than runtime constraints on a deployed agent. We instantiate this through a constrained two-player safety game for network defense. The two specifications are enforced asymmetrically: the defender specification defines the unsafe region of the game, whereas the attacker specification restricts the adversary's legal actions during attractor computation. Solving the game yields a defensibility verdict -- a formal certificate that a topology-specification pair is or is not defensible -- with the associated winning region and shield. Beyond the binary verdict, we derive topology-level metrics from the attractor structure and combine them with post-convergence behavior from shield-constrained adversarial multi-agent reinforcement learning. Together these form a defensibility fingerprint capturing both a network's formal safety properties and its operational behavior under adaptive play. A what-if analysis shows that formal defensibility and operational effectiveness capture distinct aspects of security: small architectural changes can produce large shifts in operational outcomes while leaving formal safety margins nearly unchanged. Shield synthesis is thus most valuable not as a deployment mechanism for safe agents, but as a framework for answering architectural questions about whether, where, and how a system can be defended. The defensibility verdict is the output, not the safe policy.

2506.23033 2026-06-12 cs.LG stat.ML 版本更新

How Reliable are Fairness Audits with Unreliable Data?

不可靠数据下的公平性审计有多可靠?

Yash Vardhan Tomar

发表机构 * Purdue University(普渡大学)

AI总结 研究受保护标签缺失对公平性缓解审计的影响,提出种子校准压力测试区分缺失效应与随机波动,发现正可用性缺失通常不改变缓解方法效果,但无标签端点表现不同,且阈值优化可能将单轴公平性增益转化为交叉危害。

详情
AI中文摘要

公平性审计是负责任机器学习部署的关键组成部分。然而,在不完全受保护标签访问下审计建议的可靠性仍然知之甚少。在这项工作中,我们关注公平性缓解审计中的受保护标签缺失。我们引入了一种种子校准压力测试,以将缺失效应与完全标签下已经存在的种子间波动分离开来。在ACS/Folktables任务中,我们发现正可用性缺失通常不会将选定的缓解方法移出完全标签的种子基线。无标签端点表现不同,暴露了ERM等效候选和确定性断点,而不是广泛的缺失效应。我们还发现,阈值优化可以将单轴公平性增益转化为高于零点的交叉危害,这是一种更尖锐的失败模式,在随机森林验证下似乎仍然可见。总体而言,我们的结果强调,在将受保护标签缺失视为审计脆弱性的证据之前,应报告种子零校准、候选集背景和交叉后果。

英文摘要

Fairness audits are a key component of responsible machine-learning deployment. Yet, audit-recommendation reliability under incomplete protected-label access is still poorly understood. In this work, we focused on protected-label missingness in fairness mitigation audits. We introduced a seed-calibrated stress test to separate missingness effects from seed-to-seed movement already present under complete labels. Across ACS/Folktables tasks, missingness settings that retain some protected labels usually do not move selected mitigation methods beyond a complete-label seed-to-seed baseline. At $0%$ protected-label access, candidates collapse to an empirical-risk-minimization baseline and deterministic tie-breaking rather than revealing a broad missingness effect. We also found that threshold optimization can turn fairness gains on a single protected axis into intersectional harm above a seed baseline, and this threshold-optimizer finding persists under random-forest validation. Overall, our results highlight that protected-label missingness should be reported with seed-null calibration, candidate-set context, and intersectional consequences before it is treated as evidence of audit fragility.

2507.07947 2026-06-12 cs.LG cs.AI 版本更新

Reconstructing Template-Memorized Images from Natural Prompts

从自然提示中重建模板记忆的图像

Sol Yarkoni, Mahmood Sharif, Roi Livni

发表机构 * School of Electrical & Computer Engineering(电气与计算机工程学院) School of Computer Science & AI(计算机科学与人工智能学院) Tel Aviv University(特拉维夫大学)

AI总结 提出一种低资源攻击方法,利用模板化电商数据中的模式,从自然提示中重建训练集中的记忆图像,揭示隐私风险。

详情
AI中文摘要

生成模型(如扩散模型)的最新进展引发了与隐私、版权侵犯和数据管理相关的担忧。为了更好地理解和控制这些风险,先前的工作引入了从训练数据中重建图像或部分图像的技术和攻击。虽然这些结果表明训练数据可以被恢复,但现有方法通常依赖于高计算资源、对训练集的部分访问或精心设计的提示。在这项工作中,我们提出了一种新的攻击,该攻击需要低资源,假设对训练数据几乎没有或完全没有访问权限,并识别出看似良性的提示,这些提示可能导致潜在有风险的图像重建。我们进一步表明,即使对于没有专业知识的用户,这种重建也可能无意中发生。例如,我们观察到,对于现有模型,提示“蓝色男女通用T恤”会生成一个真实个体的面部。此外,通过将已识别的漏洞与真实世界的提示数据相结合,我们发现了能够重现记忆视觉元素的提示。我们的方法建立在先前工作的见解之上,并利用领域知识来揭示由于使用抓取的电商数据而产生的基本漏洞,其中模板化布局和图像与模式化的文本提示紧密相关。我们的攻击代码在此https URL公开。

英文摘要

Recent advances in generative models, such as diffusion models, have raised concerns related to privacy, copyright infringement, and data stewardship. To better understand and control these risks, prior work has introduced techniques and attacks that reconstruct images, or parts of images, from training data. While these results demonstrate that training data can be recovered, existing methods often rely on high computational resources, partial access to the training set, or carefully engineered prompts. In this work, we present a new attack that requires low resources, assumes little to no access to the training data, and identifies seemingly benign prompts that can lead to potentially risky image reconstruction. We further show that such reconstructions may occur unintentionally, even for users without specialized knowledge. For example, we observe that for one existing model, the prompt ``blue Unisex T-Shirt'' generates the face of a real individual. Moreover, by combining the identified vulnerabilities with real-world prompt data, we discover prompts that reproduce memorized visual elements. Our approach builds on insights from prior work and leverages domain knowledge to expose a fundamental vulnerability arising from the use of scraped e-commerce data, where templated layouts and images are closely tied to pattern-like textual prompts. The code for our attack is publicly available at https://github.com/TheSolY/lr-tmi.

2507.08794 2026-06-12 cs.LG cs.CL 版本更新

One Token to Fool LLM-as-a-Judge

一个令牌就能欺骗LLM裁判

Yulai Zhao, Haolin Liu, Dian Yu, Sunyuan Kung, Meijia Chen, Haitao Mi, Dong Yu

发表机构 * Princeton University(普林斯顿大学) University of Virginia(弗吉尼亚大学) Tencent AI Lab(腾讯人工智能实验室) Rutgers University(罗格斯大学)

AI总结 发现基于参考的生成式奖励模型易受奖励黑客攻击,表面输入(如非词符号或通用推理开头)能持续引发假阳性奖励,提出使用截断模型输出作为对抗性负例的数据增强策略,构建鲁棒的Master奖励模型。

详情
AI中文摘要

大型语言模型(LLM)越来越被信任作为自动裁判,协助评估并为训练其他模型提供奖励信号,特别是在基于参考的设置中,如带可验证奖励的强化学习(RLVR)。然而,我们揭示了即使在这种基于参考的范式中也存在一个关键漏洞:生成式奖励模型系统性地容易受到奖励黑客攻击。我们发现,表面输入——我们称之为“万能钥匙”,例如非词符号(如“:”或“.”)或通用推理开头(如“思考过程:”或“让我们逐步解决这个问题。”)——可以在没有任何实质性推理的情况下持续引发假阳性奖励。我们的系统评估表明,这是一个广泛存在的失败,影响多种模型,包括领先的专有系统如GPT-o1和Claude-4。这些结果挑战了LLM裁判假定的鲁棒性,并对其可靠性构成重大威胁。为了解决这个问题,我们提出了一种简单而有效的数据增强策略,使用截断的模型输出作为对抗性负例。由此产生的Master奖励模型(Master-RMs)在对这些“万能钥匙”攻击方面表现出最先进的鲁棒性,同时在标准评估设置中保持高性能。我们通过跨模型规模、提示变化和常见推理时策略的漏洞全面分析来补充这些发现,为未来关于鲁棒LLM评估的研究提供见解。我们在https://this.url 和 https://this.url 发布我们的鲁棒通用领域奖励模型和合成训练数据。

英文摘要

Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models, particularly in reference-based settings like Reinforcement Learning with Verifiable Rewards (RLVR). However, we uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking. We find that superficial inputs, which we term ''master keys'' such as non-word symbols (e.g., '':'' or ''.'') or generic reasoning openers (e.g., ''Thought process:'' or ''Let's solve this problem step by step.''), can consistently elicit false positive rewards without any substantive reasoning. Our systematic evaluation demonstrates this is a widespread failure affecting a diverse range of models, including leading proprietary systems such as GPT-o1 and Claude-4. These results challenge the assumed robustness of LLM judges and pose a significant threat to their reliability. To address this, we propose a simple yet effective data augmentation strategy using truncated model outputs as adversarial negative examples. The resulting Master Reward Models (Master-RMs) demonstrate state-of-the-art robustness against these ''master key'' attacks while maintaining high performance in standard evaluation settings. We supplement these findings with a comprehensive analysis of the vulnerability across model scales, prompt variations, and common inference-time strategies, offering insights to guide future research on robust LLM evaluation. We release our robust, general-domain reward models and the synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

2603.29515 2026-06-12 cs.LG 版本更新

Variational Graph Neural Networks for Uncertainty Quantification in Inverse Problems

变分图神经网络用于反问题中的不确定性量化

David Gonzalez, Alba Muixi, Beatriz Moya, Elias Cueto

发表机构 * Keysight-UZ Chair of the Spanish National Strategy on AI(西班牙人工智能国家战略主席席位) Aragon Institute of Engineering Research (I3A)(阿拉贡工程研究所(I3A)) Universidad de Zaragoza(萨拉戈塔大学) Laboratori de Càlcul Numèric (LaCàN)(数值计算实验室(LaCàN)) Universitat Politècnica de Catalunya - BarcelonaTech (UPC)(加泰罗尼亚理工大学 - 巴塞罗那科技大学(UPC)) Centre Internacional de Mètodes Numèrics en Enginyeria (CIMNE)(国际数值工程方法中心(CIMNE)) PIMM Lab. Arts et Métiers Institute of Technology(巴黎艺术与技术理工学院PIMM实验室)

AI总结 提出变分图神经网络(VGNN),通过在解码器引入变分层以较低成本量化认知和统计不确定性,在固体力学反问题中验证了高精度参数恢复与置信区间估计。

详情
AI中文摘要

深度学习技术在计算力学中的日益广泛应用显著加速了那些几年前还被认为是难以处理的问题的模拟。然而,在诸如工程或医学数字孪生等关键应用中,快速响应是不够的;还必须提供可靠的结果。在某些情况下,传统的确定性方法可能不是最优的,因为它们无法提供对其预测或结果的置信度度量,尤其是在反问题中,解可能不唯一或初始数据由于噪声等原因不完全可靠。经典的深度神经网络也缺乏明确的度量来量化其预测的不确定性。在这项工作中,我们提出了一种变分图神经网络(VGNN)架构,该架构将变分层集成到其架构中以建模权重的概率分布。与计算昂贵的全贝叶斯网络不同,我们的方法仅在解码器中策略性地引入变分层,从而能够以相对较低的成本估计认知不确定性和统计不确定性。在这项工作中,我们在两个固体力学案例中验证了所提出的方法:在二维弹性问题中识别具有非线性分布的弹性模量值,以及在三维超弹性梁中定位和量化施加的载荷,在这两种情况下仅使用每个测试的位移场作为输入数据。结果表明,该模型不仅以高精度恢复了物理参数,还提供了与问题物理特性一致的置信区间,并且能够定位施加载荷的位置并估计其值,为该实验提供了置信区间。

英文摘要

The increasingly wide use of deep machine learning techniques in computational mechanics has significantly accelerated simulations of problems that were considered unapproachable just a few years ago. However, in critical applications such as Digital Twins for engineering or medicine, fast responses are not enough; reliable results must also be provided. In certain cases, traditional deterministic methods may not be optimal as they do not provide a measure of confidence in their predictions or results, especially in inverse problems where the solution may not be unique or the initial data may not be entirely reliable due to the presence of noise, for instance. Classic deep neural networks also lack a clear measure to quantify the uncertainty of their predictions. In this work, we present a variational graph neural network (VGNN) architecture that integrates variational layers into its architecture to model the probability distribution of weights. Unlike computationally expensive full Bayesian networks, our approach strategically introduces variational layers exclusively in the decoder, allowing us to estimate cognitive uncertainty and statistical uncertainty at a relatively lower cost. In this work, we validate the proposed methodology in two cases of solid mechanics: the identification of the value of the elastic modulus with nonlinear distribution in a 2D elastic problem and the location and quantification of the loads applied to a 3D hyperelastic beam, in both cases using only the displacement field of each test as input data. The results show that the model not only recovers the physical parameters with high precision, but also provides confidence intervals consistent with the physics of the problem, as well as being able to locate the position of the applied load and estimate its value, giving a confidence interval for that experiment.

2605.00432 2026-06-12 cs.LG stat.ML 版本更新

Optimal Spatio-Temporal Decoupling for Bayesian Conformal Prediction

贝叶斯共形预测的最优时空解耦

Yu-Hsueh Fang, Chia-Yen Lee

AI总结 提出状态自适应贝叶斯共形预测(SA-BCP),通过门控凸组合平衡长期时间惯性与局部空间证据,实现分布漂移下的快速适应与稳定覆盖,并给出MSE最优阈值闭式解及在线选择过程的遗憾界。

详情
AI中文摘要

在线共形预测必须在快速适应分布漂移与稳定覆盖之间取得平衡:基于反馈的方法反应迅速但变得不稳定,而强折扣贝叶斯方法滞后并在紧密覆盖下膨胀区间。我们引入了\textbf{状态自适应贝叶斯共形预测(SA-BCP)},它将预测分位数形成为长期时间惯性与来自核密度估计的局部空间证据的门控凸组合,由单个可解释的证据阈值$K$控制。我们建立了三个结果:(i) 所得区间的渐近边际有效性;(ii) MSE最优阈值的闭式表达式$K^*_{\mathrm{MSE}}=\alpha(1-\alpha)/M^{\mathcal{T}}$,权衡了覆盖指标(伯努利)方差与时间结构偏差$M^{\mathcal{T}}$;(iii) 在线选择$K$的滚动起点过程——在平稳性下一致,对最佳固定$K$具有$O(\sqrt{T\log N})$遗憾,对于分段变体,在有界漂移下具有次线性动态遗憾界。在四个金融波动率和天气数据集、三个目标覆盖水平以及八个基线(包括最强的最近条件分位数方法SPCI和KOWCPI)上,SA-BCP在大多数设置中达到或超过名义覆盖,同时产生显著更窄的区间——在最紧密覆盖下,Winkler得分比折扣贝叶斯CP低约$3\times$——覆盖匹配审计确认这些效率提升并非欠覆盖的假象。我们披露了一个主要限制:一个专门针对波动率的共形GARCH竞争对手在其主波动率基序列上仍然更高效,尽管它不能跨领域迁移。

英文摘要

Online conformal prediction must balance fast adaptation to distribution shift against stable coverage: feedback-driven methods react quickly but become volatile, while strongly discounted Bayesian methods lag and inflate intervals at tight coverage. We introduce \textbf{State-Adaptive Bayesian Conformal Prediction (SA-BCP)}, which forms the predictive quantile as a gated convex combination of long-term temporal inertia and local spatial evidence from a kernel density estimate, controlled by a single interpretable evidence threshold $K$. We establish three results: (i) asymptotic marginal validity of the resulting intervals; (ii) a closed-form expression for the MSE-optimal threshold, $K^*_{\mathrm{MSE}}=α(1-α)/M^{\mathcal{T}}$, trading the coverage-indicator (Bernoulli) variance against the temporal structural bias $M^{\mathcal{T}}$; and (iii) a rolling-origin procedure for selecting $K$ online -- consistent under stationarity, with $O(\sqrt{T\log N})$ regret against the best fixed $K$ and, for a segmented variant, a sublinear dynamic-regret bound under bounded drift. Across four financial-volatility and weather datasets, three target coverage levels, and eight baselines (including the strongest recent conditional-quantile methods, SPCI and KOWCPI), SA-BCP attains at-or-above-nominal coverage in most settings while producing substantially sharper intervals -- up to roughly $3\times$ lower Winkler score than discounted Bayesian CP at the tightest coverage -- and a coverage-matched audit confirms these efficiency gains are not an artifact of under-coverage. We disclose one principal limitation: a volatility-specialized conformal-GARCH competitor remains more efficient on its home volatility-base series, though it does not transfer across domains.

2605.00600 2026-06-12 cs.LG cs.AI cs.CV 版本更新

Possibilistic Predictive Uncertainty for Deep Learning

深度学习的可能性预测不确定性

Yao Ni, Jeremie Houssineau, Yew-Soon Ong, Piotr Koniusz

发表机构 * University of Cambridge(剑桥大学) National University of Singapore(新加坡国立大学) University of Warsaw(华沙大学)

AI总结 提出基于可能性理论的Dirichlet近似可能性后验预测(DAPPr)框架,通过投影-近似策略实现高效且原则性的认知不确定性量化,在多个基准上达到竞争性能。

Comments Accepted by ICML 2026, 20 pages

详情
AI中文摘要

深度神经网络在多种应用中取得了令人印象深刻的结果,然而它们对未见输入的过度自信需要可靠的认知不确定性建模。现有的不确定性建模方法面临一个基本困境:贝叶斯方法提供原则性的估计,但计算成本高昂,而高效的二阶预测器在其特定目标与认知不确定性量化之间缺乏严格联系。为解决这一困境,我们引入了Dirichlet近似可能性后验预测(DAPPr),一个基于可能性理论的原则性框架。我们定义了参数上的可能性后验,通过上确界算子将其投影到预测空间,并使用可学习的Dirichlet可能性函数近似投影后的后验。这种投影-近似策略产生了一个具有闭式解的简单训练目标。尽管简单,跨多个不同基准的大量实验表明,DAPPr在保持原则性推导和计算效率的同时,实现了与最先进的二阶预测器相当或更优的不确定性量化性能。代码可在 https://github.com/MaxwellYaoNi/DAPPr 获取。

英文摘要

Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliable epistemic uncertainty modeling. Existing methods for uncertainty modeling face a fundamental dilemma: Bayesian approaches provide principled estimates but remain computationally prohibitive, while efficient second-order predictors lack rigorous connections between their specific objectives and epistemic uncertainty quantification. To resolve this dilemma, we introduce Dirichlet-approximated possibilistic posterior predictions (DAPPr), a principled framework grounded in possibility theory. We define a possibilistic posterior over parameters, project it to the prediction space via supremum operators, and approximate the projected posterior using learnable Dirichlet possibility functions. This projection-and-approximation strategy yields a simple training objective with closed-form solutions. Despite its simplicity, extensive experiments across diverse benchmarks show that DAPPr achieves competitive or superior uncertainty quantification performance over state-of-the-art second-order predictors while maintaining both principled derivation and computational efficiency. Code is available at https://github.com/MaxwellYaoNi/DAPPr.

2605.18231 2026-06-12 cs.LG 版本更新

Attacking the First-Principle: A Black-Box, Query-Free Targeted Mimicry Attack on Binary Function Classifiers

攻击第一原理:一种针对二元函数分类器的黑盒、无查询目标模仿攻击

Gabriel Sauger, Jean-Yves Marion, Sazzadur Rahaman, Victor Matrat, Vincent Tourneur, Muaz Ali

发表机构 * LORIA(洛林信息与自动化研究院) University of Arizona(亚利桑那大学)

AI总结 本文提出Kelpie框架,首次在黑盒无查询环境下成功执行针对二元函数分类器的模仿攻击,展示了其在不同模型架构下的有效性,并通过实际案例验证了攻击的可行性,引发对现有机器学习二元函数分类器可靠性和安全性的质疑。

详情
AI中文摘要

二元函数分类器在维护软件系统安全性和完整性方面起着关键作用,通过检测恶意代码和未经授权的修改。然而,基于机器学习的分类器容易受到对抗攻击的威胁,这些攻击可以绕过检测。在本研究中,我们提出Kelpie,一种新型框架,用于在黑盒、零查询环境下执行模仿攻击,这是一种更强大的目标逃避攻击类型。与以往依赖查询目标分类器来优化无目标逃避攻击的方法不同,Kelpie利用代码转换,保持恶意负载的功能性,同时使其被误分类为所需类别。通过广泛实验,我们证明Kelpie能够成功对六种最先进的二元函数分类器执行模仿攻击,这些分类器代表了不同的模型架构,而无需直接与它们交互。我们进一步通过实际演示验证了我们的方法,包括隐藏在看似无害函数中的键盘记录器和擦除器。到目前为止,我们的工作是首次在黑盒、零查询环境下展示此类模仿攻击,引发了对现有基于机器学习的二元函数分类器可靠性和安全性的重大质疑。

英文摘要

Binary function classifiers play a crucial role in maintaining the security and integrity of software systems by detecting malicious code and unauthorized modifications. However, machine learning-based classifiers are vulnerable to adversarial attacks that can evade detection. In this study, we present Kelpie, a novel framework for executing mimicry attacks, a stronger type of targeted evasion attacks, on binary function classifiers in a black-box, zero-query setting. Unlike previous approaches that rely on querying the target classifier to refine untargeted evasion attacks, Kelpie leverages code transformations that preserve the functionality of malicious payloads while causing them to be misclassified as we want. Through extensive experimentation, we demonstrate that Kelpie can successfully execute mimicry attacks against six state-of-the-art binary function classifiers representing different model architectures without requiring direct interaction with them. We further validate our approach with a practical demonstration, involving a keylogger and a wiper concealed within benign-looking functions embedded in an application. This work, to our best knowledge, is the first to demonstrate such a mimicry attack in a black-box, zero-query context, raising important questions about the reliability and security of existing machine learning-based binary function classifiers.

2606.09073 2026-06-12 cs.LG cs.AI cs.CL 版本更新

A Unifying Lens on Reward Uncertainty in RLHF

RLHF中奖励不确定性的统一视角

Ely Hahami, Yoel Zimmermann, Ray Zhou, Jack Benarroch Jedlicki

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深度Mind)

AI总结 本文提出使用分布奖励模型统一RLHF中的悲观主义方法,通过闭式有效奖励公式连接现有启发式方法,并揭示其隐含假设。

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)受限于\textit{奖励破解},即策略利用代理奖励模型(RM)中的错误,产生高RM分数而缺乏真正的质量提升。一种自然的缓解方法是\textit{悲观主义}:在RM不确定的区域惩罚奖励。然而,标准标量RM没有提供原则性的不确定性概念。我们认为正确的对象是\textit{分布}奖励模型$p(r\mid x,y)$。在贝叶斯推断或KL分布鲁棒优化(KL-DRO)视角下,KL正则化的RLHF目标具有闭式有效奖励$\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$。悲观分支统一了RM集成聚合的先前启发式方法:均值聚合、最坏情况优化(WCO)和不确定性加权优化(UWO)都作为该单一表达式的极限或截断出现。这也澄清了每个现有规则的隐含假设。

英文摘要

Reinforcement learning from human feedback (RLHF) is bottlenecked by reward hacking, where the policy exploits errors in a proxy reward model (RM) and produces high RM scores without genuine quality gains. A natural mitigation is pessimism: lowering rewards in regions where the RM is uncertain. However, standard scalar RMs provide no principled notion of uncertainty. We argue that the right object is a distributional reward model $p(r\mid x,y)$. Under either a Bayesian inference or a KL-distributionally robust optimization (KL-DRO) lens, the KL-regularized RLHF objective admits a closed-form effective reward $\tilde r(x,y) = \pmβ\log\mathbb{E}_p[e^{\pm r/β}]$. The pessimistic branch unifies the prior heuristics for RM ensemble aggregation: mean aggregation, worst-case optimization (WCO), and uncertainty-weighted optimization (UWO) all emerge as limits or truncations of this single expression. This also clarifies the implicit assumptions of each existing rule.

2601.21324 2026-06-12 stat.ML cs.LG 版本更新

Bulk-Calibrated Credal Ambiguity Sets: Fast, Tractable Decision Making under Out-of-Sample Contamination

批量校准的置信模糊集:样本外污染下的快速、可处理决策

Mengqi Chen, Thomas B. Berrett, Theodoros Damoulas, Michele Caprio

发表机构 * University of Bristol(布里斯托大学) University of Cambridge(剑桥大学) University of California, Berkeley(加州大学伯克利分校) University of Oxford(牛津大学)

AI总结 提出批量校准置信模糊集,通过分离批量内污染和尾部贡献,得到闭式有限风险目标,转化为线性或二阶锥规划,实现高效鲁棒优化。

Comments Accepted for publication (spotlight) at ICML 2026

详情
AI中文摘要

分布鲁棒优化(DRO)在模糊集上最小化最坏情况期望损失,该模糊集可捕捉样本外环境中的分布偏移。虽然Huber(线性-空)污染是$\varepsilon$分数任意扰动的经典最小假设模型,但将其纳入模糊集可能导致最坏情况风险无穷大,且DRO目标变得无意义,除非施加强有界性或支撑假设。我们通过引入批量校准的置信模糊集来解决这些挑战:我们从数据中学习一个高质量批量集,同时考虑批量内的污染,并分别约束剩余尾部贡献。这导致一个闭式、有限的$\mathrm{mean}+\sup$鲁棒目标,以及针对常见损失和批量几何结构的可处理线性或二阶锥规划。通过该框架,我们强调并利用上期望(不精确概率概念)与最坏情况风险之间的等价性,展示IP置信集如何转化为具有可解释容忍水平的DRO目标。在重尾库存控制、地理偏移房价回归和人口偏移文本分类上的实验显示了竞争性的鲁棒性-准确性权衡和高效的优化时间,使用了贝叶斯、频率学派或经验参考分布。

英文摘要

Distributionally robust optimisation (DRO) minimises the worst-case expected loss over an ambiguity set that can capture distributional shifts in out-of-sample environments. While Huber (linear-vacuous) contamination is a classical minimal-assumption model for an $\varepsilon$-fraction of arbitrary perturbations, including it in an ambiguity set can make the worst-case risk infinite and the DRO objective vacuous unless one imposes strong boundedness or support assumptions. We address these challenges by introducing bulk-calibrated credal ambiguity sets: we learn a high-mass bulk set from data while considering contamination inside the bulk and bounding the remaining tail contribution separately. This leads to a closed-form, finite $\mathrm{mean}+\sup$ robust objective and tractable linear or second-order cone programs for common losses and bulk geometries. Through this framework, we highlight and exploit the equivalence between the imprecise probability (IP) notion of upper expectation and the worst-case risk, demonstrating how IP credal sets translate into DRO objectives with interpretable tolerance levels. Experiments on heavy-tailed inventory control, geographically shifted house-price regression, and demographically shifted text classification show competitive robustness-accuracy trade-offs and efficient optimisation times, using Bayesian, frequentist, or empirical reference distributions.

9. 图学习与结构化数据 4 篇

2606.12673 2026-06-12 cs.LG cs.AI 新提交

A Zero-shot Generalized Graph Anomaly Detection Framework via Node Reconstruction

基于节点重构的零样本广义图异常检测框架

Phan Nguyen, Dat Cao, Hien Chu, Khue Hoang

发表机构 * School of Computing, KAIST(韩国科学技术院计算机学院)

AI总结 提出AlignGAD框架,通过全局统一模块对齐异构特征、聚类模块捕获组级异常模式及节点差异评分模块聚合多视图异常证据,实现零样本跨域图异常检测。

详情
AI中文摘要

跨域图异常检测旨在识别未见过的目标图中的异常节点,在异构图数据的实际应用中展现出巨大潜力。然而,现有方法通常依赖于数据集特定的特征语义和结构模式,限制了其跨域泛化能力。为解决这一挑战,我们提出AlignGAD,一个零样本广义图异常检测框架。我们的框架基于三个关键组件:全局统一模块,用于对齐异构节点特征并在谱域中归一化图信号;聚类模块,用于构建聚类感知的图视图以捕获组级异常模式;以及节点差异评分模块,用于测量重构差异并聚合来自不同图视图的异常证据。在多个真实数据集上的实验证明了AlignGAD在零样本图异常检测设置下的有效性。

英文摘要

Cross-domain graph anomaly detection (GAD) aims to identify abnormal nodes in unseen target graphs, showing strong potential in real-world applications with heterogeneous graph data. However, existing methods often depend on dataset-specific feature semantics and structural patterns, which limits their ability to generalize across different domains. To address this challenge, we propose AlignGAD, a zero-shot generalized graph anomaly detection framework. Our framework is built upon three key components: a Global Unification Module that aligns heterogeneous node features and normalizes graph signals in the spectral domain; a Clustering Module that constructs cluster-aware graph views to capture group-level abnormal patterns; and a Node Discrepancy Scoring Module that measures reconstruction discrepancy and aggregates anomaly evidence from different graph views. Experiments on multiple real-world datasets demonstrate the effectiveness of AlignGAD under the zero-shot GAD setting.

2606.13444 2026-06-12 cs.LG 新提交

Clustering Node Attributed Networks with Graph Neural Networks and Self Learning

使用图神经网络和自学习的节点属性网络聚类

Rodrigo de Sapienza Luna, Daniel Ratton Figueiredo

发表机构 * Systems Engineering and Computer Science (PESC), Federal University of Rio de Janeiro (UFRJ)(里约热内卢联邦大学系统工程与计算机科学系)

AI总结 提出一种基于图神经网络和自学习的无监督图聚类框架,通过多轮自学习交替优化节点表示和聚类,利用上下文图提升性能,在合成和真实数据上表现优异。

详情
AI中文摘要

图聚类——将图的节点集划分为反映潜在信息的互不相交的子集——是一个基本问题,因为它应用于多种不同的场景。虽然这个经典问题已经被不同社区处理了几十年,但由真实数据驱动的一个最新变体考虑了节点具有信息性属性的场景。这引发了同时利用网络信息(边)和节点信息(属性)设计新型聚类算法的新方法。本文提出了一种新颖的框架,该框架建立在先前将图神经网络(GNN)应用于图聚类的工作之上。所提出的框架在完全无监督的设置下以自学习轮次运行。在每一轮中,GNN生成用于聚类节点的节点表示。这种聚类影响用于生成下一轮节点表示的图。此外,每一轮中使用原始图构建的上下文图用于生成节点表示。实验结果表明,所提出的方法从合成数据中的网络边和节点属性中提取信息,当两者都不太具有信息性时,其性能优于仅关注网络或属性的算法。多轮学习也提高了性能,并且总是优于长时间的单轮训练(即经典的GNN图聚类)。在考虑真实数据集时,实验结果表明,当聚类大小平衡时,所提出的方法与最先进的方法具有竞争力。

英文摘要

Graph clustering - partitioning the node set of a graph into disjoint subsets that reflect some latent information - is a fundamental problem as it finds applications in a myriad of different scenarios. While this classic problem has been tackled for decades by different communities, a recent variation of the problem driven by real data considers the scenario where nodes have attributes that are also informative. This has triggered novel methods that simultaneously leverage network information (edges) and node information (attributed) in the design of novel clustering algorithms. This work proposes a novel framework that builds on prior works that have applied graph neural networks (GNN) to graph clustering. The proposed framework operates in rounds of self learning in a fully unsupervised setting. In each round, a GNN generates representations for nodes that are used to cluster the nodes. This clustering influences the graph used to generate the node representation in the next round. Moreover, a context graph built in each round using the original graph is used to generate the node representations. Empirical results show that the proposed methodology extracts information from both network edges and node attributes in synthetic data, outperforming algorithms focused solely on the network or attributes when neither are very informative. Multiple rounds of learning also improve the performance and always outperforms a long single round of training (i.e., classic GNN graph clustering). When considering real datasets, empirical results indicate that the proposed methodology is competitive to state-of-the-art methods when cluster sizes are balanced.

2606.13671 2026-06-12 cs.LG 新提交

Understanding Truncated Positional Encodings for Graph Neural Networks

理解图神经网络的截断位置编码

James Flora, Mitchell Black, Weng-Keen Wong, Amir Nayyeri

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究截断位置编码(如前k个特征空间或邻接矩阵幂)对图神经网络表达能力的影响,理论证明截断后多种位置编码的表达能力存在本质差异,且截断谱位置编码不再强于1-WL测试,实验表明混合截断编码优于单一类型。

Comments 28 pages, 4 figures, ICML 2026

详情
AI中文摘要

位置编码(PEs)在理论和经验上增强了图神经网络(GNNs)的能力。两个最流行的PE家族——谱(例如,拉普拉斯特征空间、有效电阻)和基于游走的(邻接矩阵的多项式)——在表达能力上理论等价,其表达性介于1-WL和3-WL测试之间。然而,这种等价性假设GNN使用这些PE的“完整”版本,这需要$O(n^3)$的时间和空间复杂度。相反,从业者通常使用这些编码的截断变体,例如前$k$个特征空间或邻接矩阵的幂。然而,这些截断PE的理论性质尚不清楚。在这项工作中,我们启动了对这些截断PE的研究。理论上,我们表明,在截断下,几个PE家族在表达能力上存在根本差异。作为推论,我们证明截断谱PE不再强于1-WL测试。我们还研究了一个谱PE家族——$k$-调和距离——以突出即使密切相关的截断PE在表达能力上的差异。最后,我们通过实验表明,在真实世界数据集上,混合截断PE优于任何单一家族。

英文摘要

Positional encodings (PEs) enhance the power of graph neural networks (GNNs), both theoretically and empirically. Two of the most popular families of PEs - spectral (e.g., Laplacian eigenspaces, effective resistance) and walk-based (polynomials of the adjacency matrix) - are theoretically equivalent in expressive power, with expressivity between the 1-WL and 3-WL tests. However, this equivalence assumes the GNN uses the "complete" version of these PEs, which requires $O(n^3)$ time and space complexity. Instead, practitioners commonly use truncated variants of these encodings, such as the first $k$ eigenspaces or powers of the adjacency matrix. However, the theoretical properties of these truncated PEs are unknown. In this work, we initiate the study of these truncated PEs. Theoretically, we show that, under truncation, several families of PEs are fundamentally different in expressive power. As a corollary, we show that truncated spectral PEs are no longer stronger than the 1-WL test. We also study a family of spectral PEs, the $k$-harmonic distances, to highlight the differences in expressive power of even closely related truncated PEs. Finally, we experimentally show that a mix of truncated PEs is preferable to any single family on real-world datasets.

2510.16311 2026-06-12 cs.LG 版本更新

Toward General Digraph Contrastive Learning: A Dual Spatial Perspective

面向一般有向图对比学习:双空间视角

Zhengyu Wu, Daohan Su, Yang Zhang, Xunkai Li, Rong-Hua Li, Guoren Wang

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出S2-DiGCL框架,从复数域和实数域双空间视角对有向图进行对比学习,通过磁拉普拉斯自适应调制和路径子图增强,在节点分类和链接预测任务上分别提升4.41%和4.34%。

详情
AI中文摘要

图对比学习(GCL)已成为一种从图中提取一致表示而无需标签信息的强大工具。然而,现有方法主要关注无向图,忽略了在实际网络(如社交网络和推荐系统)中基础且不可或缺的关键方向信息。本文提出了S2-DiGCL,一种新颖的框架,强调从复杂域和实数域视角对有向图进行对比学习的空间洞察。从复数域视角,S2-DiGCL在磁拉普拉斯中引入个性化扰动,以自适应地调制边相位和方向语义。从实数域视角,它采用基于路径的子图增强策略,捕捉细粒度的局部不对称性和拓扑依赖性。通过联合利用这两个互补的空间视图,S2-DiGCL构建了高质量的正负样本,从而实现更通用和鲁棒的有向图对比学习。在7个真实有向图数据集上的大量实验证明了我们方法的优越性,在监督和无监督设置下,节点分类和链接预测分别实现了4.41%和4.34%的性能提升,达到了最先进水平。

英文摘要

Graph Contrastive Learning (GCL) has emerged as a powerful tool for extracting consistent representations from graphs, independent of labeled information. However, existing methods predominantly focus on undirected graphs, disregarding the pivotal directional information that is fundamental and indispensable in real-world networks (e.g., social networks and recommendations).In this paper, we introduce S2-DiGCL, a novel framework that emphasizes spatial insights from complex and real domain perspectives for directed graph (digraph) contrastive learning. From the complex-domain perspective, S2-DiGCL introduces personalized perturbations into the magnetic Laplacian to adaptively modulate edge phases and directional semantics. From the real-domain perspective, it employs a path-based subgraph augmentation strategy to capture fine-grained local asymmetries and topological dependencies. By jointly leveraging these two complementary spatial views, S2-DiGCL constructs high-quality positive and negative samples, leading to more general and robust digraph contrastive learning. Extensive experiments on 7 real-world digraph datasets demonstrate the superiority of our approach, achieving SOTA performance with 4.41% improvement in node classification and 4.34% in link prediction under both supervised and unsupervised settings.

10. 迁移、元学习与持续学习 7 篇

2606.12680 2026-06-12 cs.LG stat.ML 新提交

How Useful is Causal Invariance for Domain Adaptation in Finite-Sample Settings?

因果不变性在有限样本设置中对领域适应有多大用处?

Julia Kostin, Kasra Jalaldoust, Elias Bareinboim, Samory Kpotufe, Fanny Yang

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系) Causal Artificial Intelligence Lab, Columbia University(哥伦比亚大学因果人工智能实验室) Department of Statistics, Columbia University(哥伦比亚大学统计系)

AI总结 研究线性回归中因果不变性如何提升监督领域适应,通过候选预测器的目标风险边界和有限样本估计误差推导匹配上下界,证明当边界足够大时自适应聚合可避免负迁移。

详情
AI中文摘要

机器学习模型在部署到与训练源分布不同的目标分布时,性能往往会下降。最近基于因果的领域泛化工作表明,领域间的共享因果结构可以诱导不变预测器,例如在结构化领域偏移下具有稳定风险的某些特征子集上的模型。然而,这种总体水平的因果不变性在有限样本设置中能带来多大收益仍未充分探索。特别是,在实践中我们通常只能获得少量带标签的目标样本,这种设置称为监督领域适应(sDA)。本文探讨何时(完全或部分)因果知识能够可证明地改进监督领域适应。作为第一步,我们研究线性回归,其中完全或部分因果知识指定了一组不变或可能不变的特征子集,每个子集产生一个源训练候选预测器。我们推导了匹配的上界和下界,表明有限样本收益由候选预测器之间的目标风险边界以及有限源估计误差共同决定。当这些边界相对于$n_Q$足够大时,自适应聚合过程可以匹配最佳候选预测器,同时避免相对于仅使用目标样本学习的负迁移。另一方面,当边界过小时,没有算法能够可靠地利用候选集合获得更快的有限样本速率。我们进一步将这些边界与线性SCM中的结构偏移幅度联系起来,并在真实世界的因果基准上验证了理论。

英文摘要

Machine learning models often degrade when they are deployed on a target distribution that differs from the source distributions they were trained on. Recent work in causality-based domain generalization has shown how shared causal structure between domains can induce invariant predictors, e.g., models on a subset of features which have stable risk across structured domain shifts. However, the extent to which such population-level causal invariances can lead to gains in finite-sample settings remains underexplored. In particular, in practice we often have access to a few labeled target samples, a setting called supervised domain adaptation (sDA). In this paper, we explore when (full or partial) causal knowledge can provably improve supervised domain adaptation. As a first step, we study linear regression, where full or partial causal knowledge specifies a collection of invariant or possibly invariant feature subsets, each yielding a source-trained candidate predictor. We derive matching upper and lower bounds showing that finite-sample gains are governed by the target-risk margins separating the candidates, together with the finite-source estimation error. When these margins are sufficiently large relative to $n_Q$, an adaptive aggregation procedure can match the best candidate predictor while avoiding negative transfer relative to target-only learning. On the other hand, when the margins are too small, no algorithm can reliably exploit the candidate collection to obtain faster finite-sample rates. We further connect these margins to structural shift magnitude in linear SCMs and validate the theory on real-world causal benchmarks.

2606.13637 2026-06-12 cs.LG 新提交

The Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual Learning

稳定恢复流形:持续学习中可恢复性的几何原理

Ayushman Trivedi, Bhavika Melwani

发表机构 * ResNet-18 Split CIFAR-100

AI总结 通过分析Split CIFAR-100上ResNet-18的顺序学习,发现遗忘知识在表示重组后仍可紧凑解码,提出稳定恢复流形假说,表明灾难性遗忘主要是可访问性和流形对齐问题。

Comments 9 pages, 8 figures, 8 tables

详情
AI中文摘要

灾难性遗忘通常被视为顺序学习过程中先前学习知识的破坏。基于可访问性崩溃框架,我们研究了持续学习中可恢复性的几何结构。使用Split CIFAR-100和顺序训练的ResNet-18,我们分析了十个任务上的可恢复性、表示漂移和恢复复杂度。我们引入了恢复子空间维度(k_t),即保持完整探针性能90%所需的最小奇异方向数量。与我们的可恢复性扩散假说相反,尽管存在显著的表示漂移,恢复维度在整个训练过程中保持稳定(平均k_t = 8.0)。主角度漂移强烈预测可恢复性(r = -0.862),一个简单的几何模型解释了82.2%的可恢复性方差。这些发现支持稳定恢复流形假说,表明遗忘的知识在表示重组后仍可紧凑解码。结果表明,灾难性遗忘主要是一个可访问性和流形对齐问题,而非信息破坏。

英文摘要

Catastrophic forgetting is often viewed as the destruction of previously learned knowledge during sequential learning. Building on the Accessibility Collapse framework, we investigate the geometric structure of recoverability in continual learning. Using Split CIFAR-100 and a sequentially trained ResNet-18, we analyze recoverability, representational drift, and recovery complexity across ten tasks. We introduce Recovery Subspace Dimensionality (k_t), a measure of the minimum number of singular directions required to preserve 90 percent of full probe performance. Contrary to our Recoverability Diffusion hypothesis, recovery dimensionality remains stable throughout training (mean k_t = 8.0) despite substantial representational drift. Principal-angle drift strongly predicts recoverability (r = -0.862), and a simple geometric model explains 82.2 percent of recoverability variance. These findings support the Stable Recovery Manifold hypothesis, suggesting that forgotten knowledge remains compactly decodable despite representational reorganization. The results indicate that catastrophic forgetting is primarily an accessibility and manifold-alignment problem rather than information destruction.

2606.12633 2026-06-12 cs.CV cs.LG 交叉投稿

ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation

ECA:面向开放图像到文本生成的高效持续对齐

Jiangtao Kong, Peijun Zhao, Chun-Fu Chen, Youngwook Do, Shaohan Hu, Tianyi Zhou, Huajie Shao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ECA方法,通过混合查询模块、Fisher动态扩展和字典重放,实现无需旧数据的持续对齐,缓解灾难性遗忘,提升开放图像到文本生成的增量学习性能。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

开放图像到文本生成(OpenITG)的增量学习(IL)使模型能够持续为新的图像生成准确、上下文相关的文本,同时保留先前获得的知识。与先前研究不同,本文处理了一个更实际的场景,其中视觉数据的主要类别随时间推移而演变。在此背景下,我们引入了持续对齐的新概念,它逐步调整预训练VLM中的对齐模块,以保持高质量的跨模态表示。基于这一思想,我们提出了高效持续对齐(ECA),一种用于OpenITG的无样本IL方法。关键挑战是使模型能够获取新的任务特定特征,同时最小化对已建立对齐的干扰,且无需访问先前任务的原始数据。为此,ECA采用了三种核心机制:混合查询(MoQ)模块,用于适应任务特定的查询令牌;Fisher动态扩展(FeDEx),基于Fisher信息矩阵(FIM)度量动态扩展模型结构;以及带有字典重放(DR)的嵌入字典,以保留过去的知识。为了评估ECA的性能,我们构建了四个新的IL OpenITG基准,更好地反映了现实场景。实验结果表明,与基线方法相比,ECA显著缓解了灾难性遗忘并提高了IL性能。代码和基准可在该https URL获取。

英文摘要

Incremental Learning (IL) for Open-ended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve. In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre-trained VLMs to preserve high-quality cross-modal representations. Based on this idea, we propose Efficient Continual Alignment (ECA), a novel exemplar-free IL approach for OpenITG. The key challenge is enabling the model to acquire new, task-specific features while minimizing interference with the established alignment without accessing raw data from previous tasks. To address this, ECA employs three core mechanisms: a Mixture of Query (MoQ) module that adapts task-specific query tokens, a Fisher Dynamic Expansion (FeDEx) that dynamically expands model structure based on a Fisher Information Matrix (FIM)-based metric, and an embedding dictionary with Dictionary Replay (DR) to retain past knowledge. To evaluate ECA's performance, we construct four new IL OpenITG benchmarks that better reflect real-world scenarios. Experimental results demonstrate that ECA significantly mitigates catastrophic forgetting and improves IL performance compared to baseline methods. Code and benchmarks are available at https://github.com/Snowball0823/ECA.

2606.12925 2026-06-12 cs.CV cs.LG 交叉投稿

Multi-Label Test-Time Adaptation with Bayesian Conditional Priors

基于贝叶斯条件先验的多标签测试时自适应

Qiru Li, Ao Zhou, Zhiwei Jiang, Zifeng Cheng, Cong Wang, Yafeng Yin, Qing Gu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出贝叶斯条件先验估计(BCP),一种无梯度的测试时自适应方法,通过在线估计锚定条件先验注入标签依赖性,提升冻结视觉语言模型在多标签识别中的分布偏移鲁棒性。

Comments accepted by ICML2026

详情
AI中文摘要

多标签识别中,冻结的视觉语言模型(VLM)在分布偏移下表现脆弱:标准零样本推理独立评分每个标签,忽略共现结构,产生不连贯的标签集,其中主导概念抑制较弱但兼容的标签。我们引入贝叶斯条件先验(BCP)估计,一种无梯度的测试时自适应方法,在不调整主干网络的情况下注入标签依赖性。BCP将零样本logits视为在固定图像-文本似然下的边缘后验代理,并将偏移引起的误差主要归因于不匹配的标签先验。对于每个测试图像,它选择一个高置信度的锚定标签,并应用锚定条件的贝叶斯精炼。该更新在logit空间中是闭式的,并具有点互信息(PMI)解释,明确促进兼容标签并抑制不兼容标签。BCP通过从无标签测试流中在线估计锚定条件先验(使用轻量级二阶共现统计)来运行,无需目标标注,且仅增加单个前向传递之外的微不足道的开销。在标准多标签基准和多个CLIP主干网络上,BCP持续优于强TTA基线,例如将RN50的平均mAP从57.31提升至69.22,ViT-B/16从62.61提升至71.79。

英文摘要

Multi-label recognition with frozen Vision-Language Models (VLMs) is brittle under distribution shift: standard zero-shot inference scores labels independently, ignoring co-occurrence structure and producing incoherent label sets where dominant concepts suppress weaker but compatible labels. We introduce Bayesian Conditional Priors (BCP) Estimation, a gradient-free test-time adaptation method that injects label dependency without tuning the backbone. BCP views zero-shot logits as a proxy for marginal posteriors under a fixed image-text likelihood and attributes shift-induced errors mainly to a mismatched label prior. For each test image, it selects a high-confidence anchor label and applies an anchor-conditioned Bayesian refinement. This update is closed-form in logit space and admits a pointwise mutual information (PMI) interpretation, explicitly promoting compatible labels and suppressing incompatible ones. BCP operates without target annotations by estimating anchor-conditioned priors online from the unlabeled test stream via lightweight second-order co-occurrence statistics, adding negligible overhead beyond a single forward pass. Across standard multi-label benchmarks and multiple CLIP backbones, BCP consistently outperforms strong TTA baselines, e.g., improving RN50 average mAP from 57.31 to 69.22 and ViT-B/16 from 62.61 to 71.79.

2507.05019 2026-06-12 cs.LG cs.AI 版本更新

Meta-Learning Transformers to Improve In-Context Generalization

元学习变换器以改进上下文泛化

Lorenzo Braccaioli, Anna Vettoruzzo, Prabhant Singh, Joaquin Vanschoren, Mohamed-Rafik Bouguelia, Nicola Conci

发表机构 * University of Trento, Italy(特伦托大学,意大利) Eindhoven University, Netherlands(埃因霍温大学,荷兰) University of Doha for Science and Technology, Qatar(多哈科学与技术大学,卡塔尔)

AI总结 提出利用多个小规模领域特定数据集训练上下文学习器,通过元学习提升跨领域泛化能力,并在持续学习和无监督场景下验证其鲁棒性。

详情
AI中文摘要

上下文学习使变换器模型能够仅基于输入提示泛化到新任务,无需任何权重更新。然而,现有的训练范式通常依赖于大型非结构化数据集,这些数据集存储成本高,难以评估质量和平衡性,并且由于包含敏感信息而引发隐私和伦理问题。受这些局限性和风险的启发,我们提出了一种替代训练策略,利用多个小规模、领域特定的数据集集合。我们经验性地证明,此类数据质量的提高和多样性的增加提升了上下文学习器在其训练领域之外的泛化能力,同时与在单个大规模数据集上训练的模型相比,性能相当。我们通过利用元学习在Meta-Album集合上训练上下文学习器来研究这一范式,在多种设置下进行实验。首先,我们在受控环境中展示性能,其中测试领域完全排除在训练知识之外。其次,我们探索这些模型在信息可访问时间有限的持续场景中对遗忘的鲁棒性。最后,我们探索更具挑战性的无监督场景。我们的发现表明,当在精心策划的数据集集合上训练时,变换器仍然能够泛化用于上下文预测,同时在模块化和可替换性方面提供了优势。

英文摘要

In-context learning enables transformer models to generalize to new tasks based solely on input prompts, without any need for weight updates. However, existing training paradigms typically rely on large, unstructured datasets that are costly to store, difficult to evaluate for quality and balance, and pose privacy and ethical concerns due to the inclusion of sensitive information. Motivated by these limitations and risks, we propose an alternative training strategy where we leverage a collection of multiple, small-scale, and domain-specific datasets. We empirically demonstrate that the increased quality and diversity of such data improve the generalization abilities of in-context learners beyond their training domain, while achieving comparable performance with models trained on a single large-scale dataset. We investigate this paradigm by leveraging meta-learning to train an in-context learner on the Meta-Album collection under several settings. Firstly, we show the performance in a controlled environment, where the test domain is completely excluded from the training knowledge. Secondly, we explore the robustness of these models to forgetting in a continual scenario where the information is accessible for a limited time. Finally, we explore the more challenging unsupervised scenario. Our findings demonstrate that transformers still generalize for in-context prediction when trained on a curated dataset collection while offering advantages in modularity and replaceability.

2602.12753 2026-06-12 cs.LG 版本更新

Hierarchical Successor Representation for Robust Transfer

层次化后继表示用于鲁棒迁移

Changmin Yu, Máté Lengyel

发表机构 * University of Cambridge(剑桥大学) DeepMind(深度思维)

AI总结 提出层次化后继表示(HSR),通过时间抽象构建鲁棒的状态特征,结合非负矩阵分解实现稀疏低秩表示,支持多隔间环境下的高效任务迁移与探索。

详情
AI中文摘要

后继表示(SR)为将预测动态与奖励解耦提供了强大框架,能够实现跨奖励配置的快速泛化。然而,经典SR受其固有的策略依赖性限制:由于持续学习、环境非平稳性和任务需求变化,策略会发生变化,使得已建立的预测表示过时。此外,在拓扑复杂的环境中,SR遭受谱扩散,导致特征密集重叠且扩展性差。本文提出层次化后继表示(HSR)以克服这些限制。通过将时间抽象纳入预测表示的构建,HSR学习到对任务引起的策略变化鲁棒的稳定状态特征。将非负矩阵分解(NMF)应用于HSR,得到稀疏低秩的状态表示,有助于在多隔间环境中实现向新任务的高样本效率迁移。进一步分析表明,HSR-NMF发现了可解释的拓扑结构,提供了策略无关的层次化地图,有效桥接了无模型最优性和基于模型的灵活性。除了为任务迁移提供有用基础外,我们还展示了HSR的时间扩展预测结构也可用于驱动高效探索,有效扩展到大规模程序生成的环境。

英文摘要

The successor representation (SR) provides a powerful framework for decoupling predictive dynamics from rewards, enabling rapid generalisation across reward configurations. However, the classical SR is limited by its inherent policy dependence: policies change due to ongoing learning, environmental non-stationarities, and changes in task demands, making established predictive representations obsolete. Furthermore, in topologically complex environments, SRs suffer from spectral diffusion, leading to dense and overlapping features that scale poorly. Here we propose the Hierarchical Successor Representation (HSR) for overcoming these limitations. By incorporating temporal abstractions into the construction of predictive representations, HSR learns stable state features which are robust to task-induced policy changes. Applying non-negative matrix factorisation (NMF) to the HSR yields a sparse, low-rank state representation that facilitates highly sample-efficient transfer to novel tasks in multi-compartmental environments. Further analysis reveals that HSR-NMF discovers interpretable topological structures, providing a policy-agnostic hierarchical map that effectively bridges model-free optimality and model-based flexibility. Beyond providing a useful basis for task-transfer, we show that HSR's temporally extended predictive structure can also be leveraged to drive efficient exploration, effectively scaling to large, procedurally generated environments.

2603.15158 2026-06-12 cs.LG 版本更新

Point-Identification of a Robust Predictor Under Latent Shift with Imperfect Proxies

在不完美代理下潜在偏移中鲁棒预测器的点识别

Zahra Rahiminasab, Reza Soumi, Arto Klami, Samuel Kaski

发表机构 * Department of Computer Science, Aalto University(阿尔托大学计算机科学系) Department of Computer Science, University of Helsinki(赫尔辛基大学计算机科学系) ELLIS Institute Finland(芬兰埃利斯研究所) Department of Computer Science, Manchester University(曼彻斯特大学计算机科学系)

AI总结 针对潜在混淆变量导致的域适应问题,提出基于潜在等价类的点识别方法,通过跨域秩条件替代强完备性假设,并设计主动学习框架PQAL实现鲁棒预测。

详情
AI中文摘要

当跨域的分布偏移源于同时影响协变量和结果的潜在混淆变量时,域适应问题变得更加具有挑战性。现有的基于代理的方法通过强完备性假设来唯一确定(点识别)鲁棒预测器。完备性要求代理具有关于潜在混淆变量变化的足够信息。对于不完美代理,从混淆变量到代理分布空间的映射是非单射的,多个潜在混淆变量值可能生成相同的代理分布。这破坏了完备性假设,观测数据与多个潜在预测器(集识别)一致。为了解决这个问题,我们引入了潜在等价类(LECs)。LECs定义为诱导相同条件代理分布的潜在混淆变量组。我们证明,只要多个域在如何混合代理诱导的LECs以形成鲁棒预测器方面有足够差异,鲁棒预测器的点识别仍然可以实现。这种域多样性条件被形式化为混合权重的跨域秩条件,该条件比完备性假设弱得多。我们提出了近端准贝叶斯主动学习(PQAL)框架,该框架主动查询满足该秩条件的小型、有针对性的多样化域集合。PQAL可以恢复点识别的预测器,展示了对不同程度偏移的鲁棒性,并在合成数据、半合成dSprites、IHDP、ACS Folktables数据集上优于先前方法。

英文摘要

Addressing the domain adaptation problem becomes more challenging when distribution shifts across domains stem from latent confounders that affect both covariates and outcomes. Existing proxy-based approaches that address latent shift rely on a strong completeness assumption to uniquely determine (point-identify) a robust predictor. Completeness requires that proxies have sufficient information about variations in latent confounders. For imperfect proxies the mapping from confounders to the space of proxy distributions is non-injective, and multiple latent confounder values can generate the same proxy distribution. This breaks the completeness assumption and observed data are consistent with multiple potential predictors (set-identified). To address this, we introduce latent equivalent classes (LECs). LECs are defined as groups of latent confounders that induce the same conditional proxy distribution. We show that point-identification for the robust predictor remains achievable as long as multiple domains differ sufficiently in how they mix proxy-induced LECs to form the robust predictor. This domain diversity condition is formalized as a cross-domain rank condition on the mixture weights, which is substantially weaker assumption than completeness. We introduce the Proximal Quasi-Bayesian Active learning (PQAL) framework, which actively queries a small, targeted set of diverse domains that satisfy this rank condition. PQAL can recover the point-identified predictor, demonstrates robustness to varying degrees of shift and outperforms previous methods on synthetic data and semi-synthetic dSprites, IHDP, ACS Folktables datasets.

11. 数据集、基准与评测 39 篇

2606.12483 2026-06-12 cs.LG 新提交

Scalable anomaly detection via a univariate Christoffel function

通过单变量Christoffel函数实现可扩展的异常检测

Florian Grivet, Didier Henrion, Jean-Bernard Lasserre, Louise Travé-Massuyès

AI总结 针对Christoffel函数方法因矩阵大小随维度指数增长而难以应用于高维数据的问题,提出基于查询点与支撑点间平方距离的单变量Christoffel函数(UCF),在ADBench基准上平均精度优于14种基线方法。

详情
AI中文摘要

异常检测在欺诈检测、网络入侵和系统故障诊断等领域识别异常模式中发挥关键作用。近年来,基于Christoffel函数的方法(根植于多项式优化)因其坚实的数学基础和计算节俭性,成为深度学习的有前景替代方案。然而,其实用性受限于需要求逆一个大小随数据维度指数增长的矩阵,即使对于中等维度数据集也难以处理。本文解决了Christoffel函数异常检测的维度限制,同时保留了其关键理论性质,即开关支撑二分法行为和准确的支撑形状捕获。我们引入了UCF,一种基于查询点与支撑点间平方距离的单变量Christoffel函数。在ADBench基准上的大量实验表明,UCF在平均精度上持续优于14个最先进的基线方法。通过解决Christoffel函数的可扩展性瓶颈,本文扩展了异常检测方法的工具箱,提供了一种稳健、有理论依据且普遍适用的方法。

英文摘要

Anomaly detection plays a critical role in identifying unusual patterns across domains such as fraud detection, network intrusion, and system fault diagnosis. Recently, Christoffel function-based methods, rooted in polynomial optimization, have emerged as promising alternatives to deep learning due to their strong mathematical foundations and computational frugality. However, their practical applicability is hindered by the need to invert a matrix whose size grows exponentially with the data dimension, rendering the method intractable even for moderate-dimensional datasets. This paper addresses the dimensionality limitations of Christoffel function-based anomaly detection while preserving its key theoretical properties, i.e., the on-off support dichotomy behavior and the accurate support shape capture. We introduce UCF, a univariate Christoffel function which is based on the squared distance between the query point and the support points. Extensive experiments on the ADBench benchmark demonstrate that UCF consistently outperforms 14 state-of-the-art baselines in terms of Average Precision. By resolving the scalability bottleneck of the Christoffel Function, this work expands the toolkit of anomaly detection methods with a robust, theoretically grounded, and universally applicable approach.

2606.12552 2026-06-12 cs.LG 新提交

Crossing the Validation Crisis: Cross-Validation Reduces Benchmarking Variance Surprisingly Well

跨越验证危机:交叉验证出人意料地有效降低基准测试方差

Célestin Eve, Gaël Varoquaux, Thomas Moreau

发表机构 * MIND Team, Université Paris-Saclay, Inria, CEA, Palaiseau, France(MIND团队,巴黎-萨克雷大学,法国国家信息与自动化研究所,法国原子能委员会,帕莱索,法国) SODA Team, Inria, Palaiseau, France(SODA团队,法国国家信息与自动化研究所,帕莱索,法国) Probabl

AI总结 本文提出交叉验证通过样本增益概念量化虚拟数据增强,显著提升算法性能评估的置信度与稳定性,并引入动态早停机制减少计算开销。

Comments 34 pages, 11 figures

详情
AI中文摘要

现代机器学习通过实证工作推进,对新方法进行基准测试以评估相对性能。然而,评估固有的统计变异性——由于许多算法的随机性而加剧——常常因有限的测试样本而使性能估计不可靠,导致验证危机,其中真正的进步难以辨别。在这项工作中,我们展示了交叉验证在评估和比较学习算法性能时显著提高了置信度。我们引入了样本增益的概念,它量化了通过使用多个交叉验证分割来减少基准测试方差所实现的虚拟数据增强。在合成和真实世界数据集(组织病理学扫描和NLP微调)上的实验表明,多个分割可以显著提高性能估计的可靠性和稳定性,且收益递减往往比预期来得更晚。我们还引入了一种动态早停交叉验证的程序,通过从最初几个折叠估计后续折叠是否会带来大的样本增益。我们的发现强调了在可用样本上推行交叉验证以实现稳健可靠基准测试的价值。

英文摘要

Modern machine learning progresses through empirical work, benchmarking new methods to evaluate relative performance. However, the statistical variability inherent to evaluation - exacerbated by the stochastic nature of many algorithms - often makes performance estimation unreliable due to the limited test samples available, leading to a validation crisis in which genuine advances are difficult to discern. In this work, we show that cross-validation improves markedly confidence when evaluating and comparing learning algorithm performances. We introduce the concept of sample gain, which quantifies the virtual data augmentation achieved by using multiple cross-validation splits to reduce benchmarking variance. Experiments on both synthetic and real-world datasets (histopathologic scans and NLP fine-tuning) demonstrate that multiple splits can substantially improve the reliability and stability of performance estimates, with diminishing returns often setting in later than expected. We also introduce a procedure to dynamically early-stop cross-validation by estimating from the first few folds if subsequent folds will bring large sample gains. Our findings highlight the value of pushing cross-validation on available samples to achieve robust and reliable benchmarking.

2606.12595 2026-06-12 cs.LG cs.AI cs.CV 新提交

Emerging Flexible Designs for Geospatial Multimodal Foundation Models

地理空间多模态基础模型的新兴灵活设计

Philipe Dias, Waqwoya Abebe, Abhishek Potnis, Aristeidis Tsaris, Dan Lu, Xiao Wang, Dalton Lunga

发表机构 * Oak Ridge National Laboratory(橡树岭国家实验室)

AI总结 本文系统比较了不同架构的地理空间基础模型,在统一设置下评估其灵活性与性能,为多模态推理提供设计指导。

详情
AI中文摘要

基础模型通过跨多样未标记地理空间模态的可扩展预训练,正在迅速改变地球观测。然而,其架构多样性——从编码器-only到编码器-解码器以及掩码自编码范式——使得以一致方式评估性能权衡变得具有挑战性。在这项工作中,我们对领先的、专为地理空间多模态推理设计的基础模型架构进行了同类比较,特别关注不同光谱波段配置下的灵活性。我们使用相同的自监督学习目标和训练数据集标准化预训练,并在GEOBench基准测试上,在一致参数化下评估所有模型的分类和分割任务。我们的结果为模型灵活性、模态对齐和下游任务性能之间的设计权衡提供了新见解。通过强调受控条件下的架构优势和局限性,本研究为构建能够进行鲁棒多模态推理的下一代地理空间基础模型提供了实用指导。

英文摘要

Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.

2606.12611 2026-06-12 cs.LG cs.IT math.IT 新提交

Evaluation of AutoML Frameworks for IDS under Imbalanced Data Conditions of the NSL-KDD Dataset

NSL-KDD数据集不平衡数据条件下IDS的AutoML框架评估

Wiliane Carolina Silva, Evandro César Vilas Boas, Felipe A. P. de Figueiredo

发表机构 * Cybersecurity and Artificial Intelligence Laboratory (CS&I Lab), National Institute of Telecommunications (Inatel)(网络安全与人工智能实验室(CS&I Lab),国家电信研究所(Inatel)) Wireless and Artificial Intelligence Laboratory (WAI Lab), National Institute of Telecommunications (Inatel)(无线与人工智能实验室(WAI Lab),国家电信研究所(Inatel))

AI总结 研究NSL-KDD数据集上严重类别不平衡对多分类入侵检测中AutoML框架性能的影响,发现集成学习和不平衡感知优化可提升少数类检测能力,PyCaret表现最佳(macro-F1 66%)。

详情
AI中文摘要

本研究探讨了严重类别不平衡对使用NSL-KDD数据集进行多分类网络入侵检测的自动化机器学习(AutoML)框架性能的影响。与以往通过二分类或移除少数类来简化问题的研究不同,我们保留了原始的五类分布,包括高度欠表示的R2L和U2R攻击,从而能够对不平衡敏感的学习行为进行现实评估。在统一且可重复的实验协议下,分析了九个开源AutoML框架,考虑了架构设计、集成策略、验证程序、超参数优化和不平衡处理机制的差异。结果表明,采用集成学习和不平衡感知优化的框架在少数类判别上表现更好。PyCaret获得了最佳整体性能,macro-F1达到66%,其次是AutoGluon(55%),而缺乏原生平衡支持的框架在少数类检测能力上显著下降。进一步分析表明,仅以准确率为导向的优化不足以应对高度不平衡的入侵检测场景,因为高加权指标可能与对罕见攻击类别的泛化能力差共存。作为贡献,本研究为严重多类不平衡下的AutoML入侵检测建立了标准化基准,指出了当前架构的局限性,以及将不平衡感知优化、重采样和分层评估策略原生集成到自动化学习流水线中的必要性。源代码已公开。

英文摘要

This work investigates the impact of severe class imbalance on the performance of automated machine learning (AutoML) frameworks for multiclass network intrusion detection using the NSL-KDD dataset. Unlike previous studies that simplify the problem through binary classification or minority-class removal, we preserve the original five-class distribution, including highly underrepresented attacks such as R2L and U2R, enabling a realistic evaluation of imbalance-sensitive learning behavior. Nine open-source AutoML frameworks were analyzed under a unified and reproducible experimental protocol, considering differences in architectural design, ensemble strategies, validation procedures, hyperparameter optimization, and imbalance-handling mechanisms. The results demonstrate that frameworks incorporating ensemble learning and imbalance-aware optimization achieve better minority-class discrimination. PyCaret obtained the best overall performance, reaching 66\% macro-F1, followed by AutoGluon with 55\%, whereas frameworks lacking native balancing support exhibited significant degradation in minority-class detection capability. The analysis further shows that accuracy-oriented optimization alone is insufficient for highly imbalanced IDS scenarios, since high-weighted metrics may coexist with poor generalization on rare attack categories. As a contribution, this work establishes a standardized benchmark for AutoML-based intrusion detection under severe multiclass imbalance, highlighting current architectural limitations and the need for native integration of imbalance-aware optimization, resampling, and stratified evaluation strategies into automated learning pipelines. The source code is publicly available.

2606.12639 2026-06-12 cs.LG q-bio.QM 新提交

The Metric Picks the Winner: Evaluation Choice Flips Model Rankings for Drug-Response Prediction in Unseen Chemistry

度量选择胜者:评估选择翻转未见化学空间中药物反应预测的模型排名

Dhruv Agarwal, Riya Bisht

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本研究通过VCPI竞赛数据,发现药物反应预测模型排名随评估指标反转:简单基线在代理指标下胜出,但真实指标下深度模型显著优于线性指纹基线,首次在真实药物化学数据上验证了度量校准效应。

详情
AI中文摘要

预测细胞转录组对其从未见过的药物的反应是计算细胞生物学中的一个核心难题:最近的基准测试表明,一旦测试化合物按化学结构留出,复杂模型往往无法击败简单基线。我们研究了一个细胞系和检测方法,即通过DRUG-seq分析的THP-1细胞,由VCPI预测竞赛的活性化合物加权MSE(wMSE)评分。我们提出了一种分阶段方法:该领域一直无法击败的简单基线(未处理对照和平均训练化合物响应);非参数检索(对留出化合物的最近训练化合物进行Tanimoto加权平均);以及一个融合阶段,将冻结的化学嵌入与检索支持特征相结合,以预测相对于均值的残差,并包含不确定性头和基因程序。在发布的VCPI THP-1 drug-seq数据(14,026个训练化合物)上,采用Bemis-Murcko骨架划分,模型排名根据度量标准反转。在逆方差每基因代理度量下,基于Morgan指纹的正则化线性回归似乎胜过了深度模型、检索和ChemBERTa——这是教科书式的“简单基线获胜”结果。但在竞赛的真实活性集度量(每(基因,化合物)的Mejia权重,经官方评分器验证;均值基线0.535 vs 组织者的0.507参考)下,情况反转:深度模型获胜,我们的融合解码器显著优于线性指纹基线(-0.012 wMSE,配对bootstrap p < 10^-4),而代理度量的胜者成为最差的化学感知预测器。选择度量即选择胜者——据我们所知,这是首次在真实留出药物化学数据上证明度量校准效应,该效应此前主要在遗传扰动中建立。我们发布了一个可复现的流水线,连接到官方评分器,可在真实的1064 x 12,995网格上生成有效提交。

英文摘要

Predicting how a cell's transcriptome responds to a drug it has never seen is a core, hard problem in computational cell biology: recent benchmarks show complex models often fail to beat trivial baselines once test compounds are held out by chemistry. We study one cell line and assay, THP-1 cells profiled by DRUG-seq, scored by the active-compound weighted MSE(wMSE) of the VCPI prediction contest. We propose a staged approach: dumb baselines (untreated control and mean training-compound response) that the field keeps failing to beat; non-parametric retrieval (a Tanimoto-weighted average of a held-out compound's nearest training compounds); and a fusion stage combining a frozen chemistry embedding with retrieval-support features to predict the residual over the mean, with an uncertainty head and gene programs. On the released VCPI THP-1 drug-seq data (14,026 training compounds), under a Bemis-Murcko scaffold split, the model ranking inverts depending on the metric. Under an inverse-variance per-gene proxy, a regularized linear regression on Morgan fingerprints appears to win over the deep models, retrieval, and ChemBERTa -- the textbook "simple baselines win" result. But under the contest's true active-set metric (per-(gene, compound) Mejia weights, validated against the official scorer; mean baseline 0.535 vs the organizers' 0.507 reference), that reverses: the deep models win, our fusion decoder significantly beats the linear fingerprint baseline (-0.012 wMSE, paired bootstrap p < 10^-4), and the proxy's winner becomes the worst chemistry-aware predictor. Picking the metric picks the winner -- to our knowledge the first demonstration on real held-out drug chemistry of the metric-calibration effect established largely on genetic perturbation. We release a reproducible pipeline wired to the official scorer that emits a valid submission over the real 1064 x 12,995 grid.

2606.12643 2026-06-12 cs.LG 新提交

TEDD: Robust Detection of Unstable Temporal Features

TEDD:不稳定时间特征的鲁棒检测

Ricardo Ribeiro Pereira, Bruno Casal Laraña, Nádia Soares, Miguel Araújo

发表机构 * Feedzai

AI总结 提出TEDD方法,利用回归模型检测导致时间分布变化的特征,无需参数调优,可扩展,能检测数值和类别特征的单变量及多变量漂移。

Comments 8 pages, 9 figures

详情
AI中文摘要

在处理真实世界的时间序列数据时,经常会遇到特征分布随时间变化的情况。在这种不稳定的数据上直接使用机器学习模型可能导致性能迅速下降,尤其是当新分布与训练时所见差异较大时。为了解决这个问题,自动识别随时间变化的特征至关重要。检测到这些特征后,数据科学家和其他从业者能够通过应用数据变换等方式缓解问题,部署更鲁棒的模型,使其在更长时间内保持高性能。本文描述了特征不应遭受的时间变化类型,并提出了TEDD技术,用于a) 识别数据集何时可能导致不稳定的机器学习模型,以及b) 自动检测哪些特征导致了这种不鲁棒性。为此,我们利用回归模型来突出哪些特征有助于良好预测实例的时间戳。我们将我们的方法与其他方法在真实和合成数据上进行比较,测试它们在所有简单变化模式上的检测能力。我们表明,我们的方法:检测所有类型的基本变化,包括数值和类别特征;能够检测多变量漂移;返回一个可比较的值来衡量每个特征的变化量;无需参数调优;并且在数据集的特征数量和实例数量上都具有可扩展性。

英文摘要

When working with real-world temporal data, it is common to encounter features whose distribution is changing over time. The naive employment of Machine Learning models on this unstable data might lead to rapidly degrading performance, especially if the new distribution is much different from what was previously seen during training. In order to cope with this problem, it is critical to automatically identify features that are changing over time. With these features detected, data scientists and other practitioners will be able to mitigate the issue (for instance, by applying data transformations), deploying more robust models that retain high performance for longer periods of time. In this paper, we describe which temporal changes a feature should not suffer from, and propose TEDD, a technique to a) identify when a dataset might lead to an unstable Machine Learning model and b) automatically detect which features cause such lack of robustness. In order to achieve it, we leverage a regression model to highlight which features contribute to a good prediction of an instance's timestamp. We compare our approach to other methods in real and synthetic data, testing their detection capability on all simple change patterns. We show that our method: detects all types of basic changes, both for numerical and categorical features; can detect multivariate drifts; returns a comparable value measuring the amount of change of each feature; requires no parameter tuning; and is scalable both on number of features and instances of the dataset.

2606.12718 2026-06-12 cs.LG eess.SP 新提交

Out-of-Distribution (OOD) Detectors for Open-Set RF Fingerprinting

面向开放集射频指纹识别的分布外检测器

Sudeepta Mondal, Ganesh Sundaramoorthi

发表机构 * University of Michigan(密歇根大学)

AI总结 针对开放集射频指纹识别中未知发射机与时间漂移引起的分布偏移问题,引入基于信息论的OOD检测统一框架,并采用无需OOD调优数据的方法,在POWDER数据集上验证其性能接近有真实OOD数据的基线。

详情
AI中文摘要

射频指纹识别系统必须在开放世界环境中运行,其中来自未知发射机的信号和时间漂移会在测试时引入分布偏移。分布外检测为该问题提供了自然框架,但其在射频指纹识别中的应用仍然有限。其采用的一个关键障碍是大多数OOD检测器需要辅助OOD数据进行参数调优,而在射频环境中收集代表性OOD数据不切实际,这一假设难以满足。在这项工作中,我们将机器学习文献中一组有前景的OOD检测方法引入开放集RFF领域。我们基于信息论(通信系统的自然框架)在一个统一的数学框架中呈现这些方法。我们的框架允许对方法进行系统分析并开发新方法。我们进一步展示了最近关于无需给定OOD调优数据即可调优OOD检测器的工作在开放集RFF中的适用性。我们在POWDER射频指纹数据集上进行评估,表明无需任何给定OOD数据调优的检测器性能与能够访问真实OOD调优数据的基线相当,并且大大优于无法访问真实OOD调优数据的基线方法,展示了RFF问题的实际可行性。

英文摘要

Radio-frequency (RF) fingerprinting systems must operate in open-world environments where signals from unknown transmitters and temporal drift introduce distribution shift at test time. Out-of-distribution (OOD) detection provides a natural framework for this problem, yet its application to RF fingerprinting (RFF) remains limited. A key barrier to their adoption is that most OOD detectors require auxiliary OOD data for parameter tuning, an assumption that is difficult to satisfy in RF environments where representative OOD data is impractical to collect. In this work, we introduce a promising set of OOD detection methods from the machine learning literature to open-set RFF domain. We present these methods within a unified mathematical framework based on information theory, which is a natural framework for communication systems. Our framework allows for the systematic analysis of methods and development of new methods. We further demonstrate the applicability of recent work on tuning OOD detectors without given OOD tuning data for open-set RFF. We evaluate on the POWDER RF fingerprinting dataset, showing that detectors tuned without any given OOD data achieve performance comparable to baselines with access to true OOD tuning data and greatly out-perform baseline approaches without access to true OOD tuning data, showcasing the practical viability for the RFF problem.

2606.12764 2026-06-12 cs.LG cs.CL cs.CR 新提交

Detecting Functional Memorization in Code Language Models

检测代码语言模型中的功能记忆

Matthieu Meeus, Anil Ramakrishna, Matthew Grange, Zheng Xu, Luca Melis

发表机构 * Meta Imperial College London(伦敦帝国学院)

AI总结 研究代码语言模型的功能记忆现象,通过反事实设置对比暴露目标代码的模型与未暴露的参考模型,使用文本和功能相似性度量,发现功能记忆超出文本重叠的检测范围。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于大规模生成代码。同时,先前的工作通过审计训练示例与模型生成之间的文本重叠,研究了训练数据是否可以从模型输出中恢复。然而,代码可能在功能上等价而在文本上不相似。在这项工作中,我们研究了功能记忆:提取超出逐字指标检测的功能逻辑。我们为Olmo-3-32B构建了一个反事实设置,将中期训练模型(暴露于目标代码)与预训练参考模型(未暴露)进行比较。我们使用Python函数签名提示两个模型,并测量文本和功能相似性(即LLM作为评判者、基于执行)。我们的结果显示了功能记忆的明确证据,突出了需要超越文本重叠的审计指标。

英文摘要

Large language models (LLMs) are increasingly used to generate code at scale. Meanwhile, prior work has investigated whether training data may be recoverable from model outputs, by auditing the textual overlap between training examples and model generations. Code, however, can be functionally equivalent while textually dissimilar. In this work, we study functional memorization: extraction of functional logic beyond what verbatim metrics detect. We construct a counterfactual setup for Olmo-3-32B, comparing a midtrained model (exposed to target code) against a pretrained reference (not exposed). We prompt both models with Python function signatures and measure both textual and functional similarity (i.e., LLM-as-a-judge, execution-based). Our results show clear evidence of functional memorization, highlighting the need for auditing metrics that go beyond textual overlap.

2606.12913 2026-06-12 cs.LG cs.CV 新提交

Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration

图上的样本选择:用于无损训练加速的统一数据集剪枝框架

Dongyue Wu, Zilin Guo, Xiaoyu Li, Jiajia Liu, Jingdong Chen, Nong Sang, Changxin Gao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于图的统一数据集剪枝框架,将数据集建模为加权图,通过最大权重团问题选择样本,并设计贪心算法,在多种剪枝比例下优于现有方法,实现ImageNet-1k上40%以上训练加速且不损失精度。

Comments ICML 2026

详情
AI中文摘要

现代训练数据集的快速增长显著增加了计算成本,促使数据集剪枝(DP)方法仅保留信息量丰富的样本子集以减少训练成本。现有的剪枝标准通常依赖于评估样本独立性的内在信号或通过成对关系促进多样性的外在信号。虽然在其特定领域有效,但每种方法仅捕捉样本效用的一方面,且在不同剪枝比例或数据分布下缺乏鲁棒性。在这项工作中,我们提出了一个统一的基于图的DP框架。通过将数据集建模为加权图,其中节点权重编码内在价值,边权重编码外在价值,DP可以转化为最大权重团问题(MWCP)。尽管MWCP是NP难的,但其结构允许基于样本边际增益的原则性贪心解法。在几个温和条件下,我们进一步证明该统一目标具有形式化的近似保证,适用于广泛的度量族,并提供了实用设计指南。大量实验表明,我们的方法优于现有DP方法,同时显著降低训练成本,在ImageNet-1k上使用ResNet-50时,训练时间减少超过40%且不损失精度。

英文摘要

The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40\% without sacrificing accuracy on ImageNet-1k with ResNet-50.

2606.12997 2026-06-12 cs.LG stat.ML 新提交

Reliability of Probabilistic Emulation of Physical Systems

物理系统概率仿真的可靠性

Sam F. Greenbury, Radka Jersakova, Paolo Conti, Marjan Famili, Christopher Iliffe Sprague, Edwin Brown, Jason D. McEwen

发表机构 * The Alan Turing Institute(艾伦·图灵研究所) Autodesk Research(欧特克研究院) PhysicsX Orbital University of Sheffield(谢菲尔德大学) University College London(伦敦大学学院)

AI总结 比较生成模型与CRPS训练集成在物理系统概率仿真中的可靠性,发现CRPS集成在覆盖率和推理速度上更优。

详情
AI中文摘要

目前,生成物理系统概率预测的两种主要方法已经出现:生成模型(如扩散或流匹配)以及注入随机性的确定性模型集成(使用连续排序概率评分(CRPS)损失训练)。虽然这两种方法都表现出强大的预测准确性,但其不确定性的可靠性尚未得到系统评估。我们通过开发一个框架来填补这一空白,该框架在匹配模型大小和计算预算的情况下,评估这两种方法在多种二维时空物理系统中的表现。我们通过检查预测区间的经验覆盖率来评估概率仿真的可靠性,同时考虑准确性和计算效率指标。CRPS训练的集成在单步预测和自回归展开中通常能实现更可靠的不确定性,显示出比在潜在空间中训练生成模型的标准替代方案更好的覆盖率。此外,CRPS方法提供了显著更快的推理速度。当生成模型在环境空间而非压缩潜在空间中训练时(这在高维问题中通常不可行),它们表现出与CRPS训练集成相当的覆盖率,但推理延迟显著更大。相比之下,当CRPS训练的集成在潜在空间中训练时,其覆盖率相对于环境空间没有明显下降。生成模型和CRPS训练的集成都表现出良好的预测准确性。为促进未来的研究和应用,我们发布了AutoCast,一个实现生成模型和CRPS训练集成的模块化框架,以及AutoSim,一个用于快速原型的灵活数据集生成包。

英文摘要

Two dominant approaches have emerged for generating probabilistic forecasts of physical systems: generative models, such as diffusion or flow matching; and ensembles of deterministic models with stochasticity injected, trained using the continuous ranked probability score (CRPS) loss. While both approaches have demonstrated strong predictive accuracy, the reliability of their uncertainties has not been systematically assessed. We address this gap by developing a framework to evaluate both approaches across diverse 2D spatiotemporal physical systems, under matched model size and computational budget. We assess the reliability of probabilistic emulation by inspecting the empirical coverage of predictive intervals, while also considering accuracy and computational efficiency metrics. CRPS-trained ensembles typically achieve more reliable uncertainties on both single-step prediction and autoregressive rollouts, demonstrating better coverage than the standard alternative of training generative models in a latent space. Moreover, the CRPS approach offers significantly faster inference. When generative models are trained in ambient rather than a compressed latent space, which is often infeasible for high-dimensional problems, they exhibit comparable coverage to CRPS-trained ensembles, though with substantially larger inference latency. In contrast, when CRPS-trained ensembles are trained in latent space they do not show a marked degradation in coverage with respect to ambient space. Both generative models and CRPS-trained ensembles demonstrate good predictive accuracy. To facilitate future research and application, we release AutoCast, a modular framework implementing both generative models and CRPS-trained ensembles, alongside AutoSim, a flexible dataset generation package for rapid prototyping.

2606.13104 2026-06-12 cs.LG 新提交

Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

权威、真实性与引文偏差:研究大语言模型认知易感性的大规模多领域基准

Aryan Khurana, Aravind Ramana RN, Dhruv Kumar

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出AuthorityBench基准,通过2x2因子设计隔离引文权威信号对LLM认知行为的影响,发现引文存在(无论真假)均提高幻觉率,真声明搭配假引文时幻觉率上升3-22个百分点。

Comments 10 pages, 5 figures. Accepted to AI4GOOD and EIML at ICML 2026

详情
AI中文摘要

大型语言模型越来越多地部署在引文增强的环境中,但引文存在对模型行为的影响(独立于事实内容)仍知之甚少。我们引入了AuthorityBench,一个包含220,564个提示的多领域基准,用于隔离基于引文的权威信号如何影响LLM的认知行为。该基准采用完全平衡的2x2因子设计,交叉声明真实性(claim veracity)与引文真实性(citation veracity),这是首个这样做的基准,涵盖四个领域(常识、科学、法律和医学),并在40个提示模板、四个场所声望等级和一个国家编码的作者姓名数据集上进行受控变化。评估七个模型在12个结构化研究问题上的表现,我们发现引文的存在(无论是真实的还是捏造的)相对于无引文基线一致地提高了幻觉率。当捏造的引文伴随真实声明时,这种效应最强,使幻觉率提高3到22个百分点,在常识领域达到35%到77%,而法律声明相对稳健,场所声望和作者人口统计学影响可忽略不计。所有数据集和评估代码均可在以下网址获取:this https URL

英文摘要

Large language models are increasingly deployed in citation-augmented settings, yet the effect of citation presence on model behavior independent of factual content remains poorly understood. We introduce AuthorityBench, a 220,564-prompt multi-domain benchmark that isolates how citation-based authority signals influence epistemic behavior in LLMs. The benchmark uses a fully balanced 2x2 factorial design crossing claim veracity with citation veracity, the first to do so, across four domains (general knowledge, science, law, and medicine), with controlled variation over 40 prompt templates, four venue prestige tiers, and a country-coded author name dataset. Evaluating seven models on 12 structured research questions, we find that citation presence, whether real or fabricated, consistently increases hallucination rates relative to a no-citation baseline. The effect is strongest when fabricated citations accompany true claims, raising hallucination rates by 3 to 22 percentage points and reaching 35 to 77% in the general knowledge domain, while legal claims are comparatively robust and venue prestige and author demographics show negligible impact. All datasets and evaluation code are available at: https://github.com/floating-reeds/AuthorityBench

2606.13105 2026-06-12 cs.LG 新提交

Disparate Impact in Synthetic Data Generation

合成数据生成中的差异性影响

Paul Andrey, Michaël Perrot, Batiste Le Bars, Marc Tommasi

发表机构 * Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 - CRIStAL(里尔大学、法国国家信息与自动化研究所、法国国家科学研究中心、中央里尔高等电力工程学院、计算机科学、信号与自动化研究实验室)

AI总结 本文重新审视合成数据生成中的差异性影响公平性概念,指出非差异性影响要求合成分布与真实分布一致,并分析SDG失败的原因(表达能力、抽样误差、差分隐私估计误差),提出分组学习策略以提升整体效用和公平性。

详情
AI中文摘要

我们重新审视合成数据生成(SDG)中差异性影响的公平性概念,该概念评估生成记录的效用是否在不同敏感群体间相同。我们的方法不同于现有的公平SDG工作,后者旨在纠正观测分布中的不当偏差,从而将SDG重新定义为学习一个并非真实数据分布的分布。相比之下,当合成分布与真实分布相同时,非差异性影响得以显著实现。我们揭示了SDG可能无法达到该解决方案的原因,并讨论了近似误差和估计误差为何会发生以及可能在不同群体间存在差异。我们特别关注了SDG方法相对于分布复杂性的表达能力、群体比例导致的抽样误差以及差分隐私机制引起的估计误差。我们在人工和真实数据上展示了差异性影响的案例,重点关注依赖概率图模型的SDG方法。我们还引入了一种学习分组SDG模型的策略,并说明了它在许多情况下如何提升整体效用及其公平性。

英文摘要

We revisit the fairness notion of disparate impact for synthetic data generation (SDG), that assesses whether the utility of generated records is the same across sensitive groups. Our approach departs from existing work on fair SDG, that address the problem of correcting for undue biases in the observed distribution, hence redefining SDG as learning a distribution that is not that of the real data. By contrast, non-disparate impact is notably achieved when the synthetic and real distributions are the same. We expose reasons why SDG may fail to reach that solution and discuss why approximation and estimation errors occur and can be disparate across groups. We notably look into the expressive power of SDG methods relative to distribution complexity, sampling errors due to group proportions, and estimation errors induced by differential privacy mechanisms. We illustrate cases of disparate impact on both artificial and real-world data, focusing on SDG methods that rely on probabilistic graphical models. We also introduce a strategy of learning group-wise SDG models and illustrate how it can improve both the overall utility and its parity in many settings.

2606.13194 2026-06-12 cs.LG 新提交

WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition

WHAR Arena: 基准测试高效可穿戴人体活动识别的最新进展

Maximilian Burzer, Tobias King, Till Riedel, Michael Beigl, Tobias Röddiger

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) IPAI Foundation gGmbH(IPAI基金会有限责任公司)

AI总结 为解决可穿戴人体活动识别中的可比性危机,构建了包含30个数据集的大规模基准,评估17种架构,发现预测性能趋于饱和,而紧凑模型和随机森林在部署效率上构成帕累托前沿。

Comments 20 pages, 9 Figures, 3 Tables

详情
AI中文摘要

深度学习已成为可穿戴人体活动识别(WHAR)的主导范式,但进展因可比性危机而变得模糊。结果通常使用不一致的数据集、自定义数据处理和不同的评估协议报告,使得最新技术的声明脆弱。我们通过一个大规模、开源基准来解决这个问题,该基准在标准化处理、统一模型接口和共享的跨主体评估协议下整合了30个不同的数据集。在4760次训练运行中评估了17种代表性架构,我们共同测量了预测性能以及Android参考设备上的设备延迟、峰值内存和模型大小。我们的结果表明,WHAR的最新进展是分布式的,而非由单一架构主导。虽然CNN-HAR实现了最高的平均宏F1,但表现最佳的模型紧密聚集,表明当代架构已接近预测性能上限。当考虑部署效率时,紧凑神经模型(如TinierHAR)和经典随机森林定义了实际相关的帕累托前沿,而较大的循环和混合模型则产生高硬件成本而无相应的性能增益。因此,尽管预测性能已趋于平稳,但在优化部署效率和改进对领域变化的适应方面,未来仍有巨大潜力。我们发布完整框架以支持透明的重用和扩展。

英文摘要

Deep learning has become the dominant paradigm in Wearable Human Activity Recognition (WHAR), yet progress is obscured by a comparability crisis. Results are often reported using inconsistent datasets, custom data processing, and varying evaluation protocols, making state-of-the-art claims fragile. We address this with a large-scale, open-source benchmark that integrates 30 diverse datasets under standardized processing, unified model interfaces, and a shared cross-subject evaluation protocol. Evaluating 17 representative architectures across 4760 training runs, we jointly measure predictive performance alongside on-device latency, peak memory, and model size on an Android reference device. Our results reveal that the WHAR state of the art is distributed rather than dominated by a single architecture. While CNN-HAR achieves the highest mean macro-F1, top-performing models cluster tightly, indicating contemporary architectures have converged near a predictive performance ceiling. When accounting for deployment efficiency, compact neural models, such as TinierHAR, and classical Random Forests define the practically relevant Pareto frontier, whereas larger recurrent and hybrid models incur high hardware costs without corresponding performance gains. Consequently, while predictive performance has plateaued, substantial potential for future progress remains in optimizing deployment efficiency and improving adaptation to domain shifts. We release our full framework to support transparent reuse and extension.

2606.13338 2026-06-12 cs.LG 新提交

Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

导航安全-保真度权衡:通过概率场景进行电力系统的大规模多变量时间序列预测

Kaijie Xu, Anqi Wang, Xilin Dai

发表机构 * ZJU-UIUC Institute, Zhejiang University(浙江大学伊利诺伊大学厄巴纳香槟校区联合学院)

AI总结 针对现有基准无法评估大规模多变量概率预测的安全性与保真度权衡问题,提出包含多达36,964个通道的电力系统基准PowerPhase和场景式分位数预测器PowerForge,在多个网格上取得最佳平均排名。

详情
AI中文摘要

概率预测模型越来越多地部署在具有不同通道物理特性和运行约束的多变量系统上,但现有基准无法大规模评估这两个属性。公开的规范多变量基准最多包含2,000个通道,而电力系统基准要么缺乏时间结构,要么缺乏概率评估。我们提出PowerPhase,这是一个基于六个输电网络构建的概率预测基准,联合预测通道数从2,000到36,964,比流行的规范多变量基准高出一个数量级以上。每个目标轨迹是交流潮流求解的输出,PowerPhase配备了约束感知指标,包括Safety_mBrier、NECV和CVaR-alpha,作为CRPS和Distortion的补充。在八个基线和三个随机种子上,分布准确性和约束满足对模型进行不同排序,我们将这种权衡称为安全-保真度。我们进一步提出PowerForge,一种基于场景的分位数预测器,具有类型特定的解码头和变量组之间的因果桥,在每个网格上实现了最佳平均排名。

英文摘要

Probabilistic forecasting models are increasingly deployed on multivariate systems with distinct channel physics and operational constraints, but existing benchmarks evaluate neither property at scale. Public canonical multivariate benchmarks cap out at 2,000 channels, while power-system benchmarks either lack temporal structure or probabilistic evaluation. We introduce PowerPhase, a probabilistic forecasting benchmark built on six transmission grids ranging from 2,000 to 36,964 jointly forecasted channels, more than an order of magnitude beyond popular canonical multivariate benchmarks. Each target trajectory is the output of an AC power-flow solve, and PowerPhase ships with constraint-aware metrics, including Safety_mBrier, NECV, and CVaR-alpha, that complement CRPS and Distortion. Across eight baselines and three seeds, distributional accuracy and constraint satisfaction rank models differently, a trade-off we term safety-fidelity. We further propose PowerForge, a scenario-based quantile forecaster with type-specific decoding heads and a causal bridge between variable groups, which achieves the best average rank on every grid.

2606.13477 2026-06-12 cs.LG cs.AI cs.CL 新提交

SupraBench: A Benchmark for Supramolecular Chemistry

SupraBench: 超分子化学基准

Tianyi Ma, Yijun Ma, Zehong Wang, Weixiang Sun, Ziming Li, Connor R. Schmidt, Chuxu Zhang, Matthew J. Webber, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学) University of Connecticut(康涅狄格大学)

AI总结 为评估大语言模型在超分子化学推理中的能力,与领域专家合作发布了首个超分子基准SupraBench,包含四个基本任务和一个辅助视觉任务,并提供了16M令牌的语料库SupraPMC。

详情
AI中文摘要

超分子化学,包括非共价主客体组装的研究,推动了各种应用的发展。然而,设计主客体系统仍然耗时,每个候选对需要数天的干实验室验证。尽管LLMs已成为一种快速的替代方案,在分子结合任务上表现出色,但目前尚无基准系统性地评估LLMs在超分子化学基本任务(如结合亲和力预测)中的主客体推理能力。为此,我们与领域专家合作发布了首个超分子基准,称为SupraBench,用于评估LLMs在化学推理中的表现。具体来说,我们设计了四个基本任务,即结合亲和力预测、最佳结合物选择、溶剂识别和主客体描述,以及一个辅助的基于视觉的分子识别任务。我们还发布了SupraPMC,一个从Europe PMC中提取的经过整理的1600万令牌的超分子化学文章语料库,以支持对超分子领域的适应。我们对一系列开源和专有LLMs进行了基准测试,发现LLMs在所有任务上都有很大的提升空间。在SupraPMC上的领域自适应预训练可以干净地迁移到分布内回归,但会与严格的字母格式输出进行权衡。此外,不同任务家族的难度分布差异很大,揭示了不同的失败模式,表明当前超分子化学推理中存在特定的差距。我们的源代码和基准数据集可在以下网址获取:此 https URL。

英文摘要

Supramolecular chemistry, which includes the study of non-covalent host-guest assemblies, has advanced various applications. However, designing host-guest systems remains time-consuming, requiring days of dry-lab verification per candidate pair. Although LLMs have emerged as a fast alternative with strong performance on molecular binding tasks, no benchmark currently systematically evaluates LLMs for host-guest reasoning across fundamental supramolecular chemistry tasks, e.g., binding affinity prediction. To this end, we collaborate with domain experts to release the first Supramolecular Benchmark, called SupraBench, to evaluate LLMs in chemistry reasoning. Specifically, we design four fundamental tasks, i.e., binding affinity prediction, top-binder selection, solvent identification, and host-guest description, plus an auxiliary vision-based task for molecular identification. We also release SupraPMC, a curated 16M-token corpus of Supramolecular chemistry articles distilled from Europe PMC, to support the adaptation to the supramolecular domain. We benchmark a broad range of open and proprietary LLMs and find that LLMs leave substantial headroom across all tasks. Domain adaptation pretraining over SupraPMC transfers cleanly to in-distribution regression but trades off against strict letter-format output. Moreover, the difficulty profile differs sharply across task families, revealing distinct failure modes that indicate specific gaps in current supramolecular chemistry reasoning. Our source codes and benchmark datasets are available at https://github.com/Tianyi-Billy-Ma/SupraBench.

2606.13486 2026-06-12 cs.LG cs.AI 新提交

CRAFTIIF: Cross-Resolution Analytic Four-Type Interpretable Isolation Forest for Multivariate Time Series Anomaly Detection

CRAFTIIF:用于多元时间序列异常检测的跨分辨率分析四类型可解释孤立森林

William Smits

发表机构 * Avathon

AI总结 提出CRAFTIIF无监督框架,通过四种小波特征和五个孤立森林同时检测点、分布、时间和集体四类异常,在mTSBench基准上达到平均F1=0.228,VUS-PR比先前最佳提升40.7%。

Comments 14 pages, 4 figures, 2 appendices. Submitted to IEEE Transactions on Knowledge and Data Engineering (TKDE). Code: https://github.com/smitswil/craftiif

详情
AI中文摘要

多元时间序列中的异常检测面临四种结构不同的异常类型——点异常(孤立尖峰)、分布异常(水平偏移)、时间异常(节奏变化)和集体异常(传感器间相关性崩溃)——每种都需要不同的特征表示。大多数无监督方法只针对其中一两种类型,且可解释性有限。我们提出CRAFTIIF(跨分辨率分析四类型可解释孤立森林),这是一个完全无监督的框架,针对所有四种类型,无需针对数据集调整。CRAFTIIF生成K=500个随机分析小波特征,跨越四个小波族(Morlet、DOG、Haar、Coiflet),每个针对特定异常类型,并输入五个结构化的孤立森林——每种类型一个,外加一个用于复合异常的元IF。自适应Otsu/MAD阈值在0.1%到69.2%的异常率范围内自动校准检测。由于每个IF仅针对特定类型的特征进行训练,分支触发直接提供异常类型归因,无需事后解释。在mTSBench基准(Zhou等人,TMLR 2026)的所有19个数据集上评估,CRAFTIIF在全部19个数据集上达到平均F1=0.228,在13个可检测数据集上F1=0.322,在VUS-PR上排名第一(0.463对比之前最佳0.329,提升40.7%)。一个诊断框架——oracle F1、可检测性限制和分支分离比——识别出19个数据集中有6个从根本上无法被任何无监督方法检测。在11种消融条件下,自适应阈值(+38% F1)、四分支结构(+20%)和元IF(+23%)均被证明是必不可少的。代码:此 https URL

英文摘要

Anomaly detection in multivariate time series is challenged by four structurally distinct anomaly types -- point (isolated spikes), distributional (level shifts), temporal (rhythm changes), and collective (inter-sensor correlation breakdowns) -- each requiring different feature representations. Most unsupervised methods target only one or two types and provide limited interpretability. We present CRAFTIIF (Cross-Resolution Analytic Four-Type Interpretable Isolation Forest), a fully unsupervised framework targeting all four types without dataset-specific tuning. CRAFTIIF generates K=500 random analytic wavelet feature draws across four families (Morlet, DOG, Haar, Coiflet), each targeting a specific anomaly type, feeding five structured Isolation Forests -- one per type plus a meta-IF for compound anomalies. An adaptive Otsu/MAD threshold calibrates detection automatically across anomaly rates from 0.1% to 69.2%. Because each IF is trained exclusively on type-specific features, branch firing provides direct anomaly-type attribution by construction, without post-hoc explanation. Evaluated on all 19 datasets of the mTSBench benchmark (Zhou et al., TMLR 2026), CRAFTIIF achieves mean F1=0.228 (all 19 datasets) and F1=0.322 (13 detectable datasets), ranking first among all 25 evaluated methods on VUS-PR (0.463 vs. previous best 0.329, +40.7%). A diagnostic framework -- oracle F1, detectability limits, and branch separation ratios -- identifies 6 of 19 datasets as fundamentally undetectable by any unsupervised method. Ablation over 11 conditions confirms adaptive thresholding (+38% F1), four-branch structure (+20%), and meta-IF (+23%) are each essential. Code: https://github.com/smitswil/craftiif

2606.12426 2026-06-12 cs.CY cs.CL cs.LG 交叉投稿

Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science

两个错误,没有正确:审计计算社会科学中LLM标注者的社会期望偏差

Varun Kotte

发表机构 * Varun Kotte

AI总结 研究审计了三个开源指令微调模型在TweetEval任务中的社会期望偏差,发现模型存在宽大、过度纠正和中性偏差,且提示干预无法纠正,聚合指标可能掩盖实质结论错误。

详情
AI中文摘要

LLM标注者越来越多地用于计算社会科学(CSS),但尚不清楚其对齐形状的错误是否会改变研究者报告的实证结论。我们在四个提示条件下(72个单元格)审计了三个开源7B指令微调模型(Zephyr、Mistral-Instruct、Qwen2.5-Instruct)在六个TweetEval任务中的表现,发现社会期望失败并非单一方向。Zephyr表现出宽大偏差,系统性地少应用有害标签(冒犯性语言:假良性率0.729,虚警率0.031)。Mistral和Qwen表现出过度纠正,过度应用相同标签(Mistral仇恨言论FAR = 0.604)。所有三个模型在堕胎立场上表现出中性偏差,低估反对流行率24至40个百分点,并夸大中性标签。我们测试的四种提示干预(中性、安全框架、去个性化、思维链)均未纠正这些跨模型失败;安全框架可能加剧立场扭曲。引人注目的是,Zephyr的仇恨言论流行率估计与黄金率完全一致,而其类别条件误差在两个方向上都很大,这是一种偶然的抵消,误导了聚合验证。我们将这些模式转化为一个三部分分类法,具有诊断性FBR/FAR特征和轻量级黄金样本验证协议。可信CSS的标题:在聚合指标上看起来校准的模型仍然可能翻转研究者报告的实质性实证结论。

英文摘要

LLM annotators are increasingly used in computational social science (CSS), but it is unclear whether their alignment-shaped errors preserve the empirical conclusions a researcher would report. We audit three open-source 7B instruction-tuned models (Zephyr, Mistral-Instruct, Qwen2.5-Instruct) across six TweetEval tasks under four prompt conditions (72 cells) and find that social-desirability failures do not run in a single direction. Zephyr exhibits leniency bias, systematically under-applying harmful labels (offensive language: false benign rate 0.729, false alarm rate 0.031). Mistral and Qwen exhibit overcorrection, over-applying the same labels (Mistral hate-speech FAR = 0.604). All three models exhibit neutrality bias on abortion stance, underestimating opposition prevalence by 24 to 40 percentage points and inflating the neutral label. None of the four prompting interventions we test (neutral, safety framing, depersonalized, chain-of-thought) corrects these failures across models; safety framing can worsen stance distortion. Strikingly, Zephyr's hate-speech prevalence estimate matches the gold rate exactly while its class-conditional errors are large in both directions, an accidental cancellation that misleads aggregate validation. We translate these patterns into a three-part taxonomy with diagnostic FBR/FAR signatures and a lightweight gold-sample validation protocol. The headline for trustworthy CSS: a model that looks calibrated on aggregate metrics can still flip the substantive empirical conclusion a researcher would report.

2606.12451 2026-06-12 cs.AI cs.IR cs.LG 交叉投稿

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

ToolSense: 审计LLM中参数化工具知识的诊断框架

Ashutosh Hathidara, Sai Shruthi Sistla, Sebastian Schreiber, Sahil Bansal

发表机构 * SAP Labs(SAP实验室)

AI总结 提出ToolSense诊断框架,自动生成三类基准测试,揭示参数化工具检索中知识-检索分离现象,发现模型在模糊查询下性能显著下降。

详情
AI中文摘要

作为大型工具目录上的代理部署的大型语言模型面临关键的工具检索瓶颈。由于基于嵌入的检索方法依赖于可能无法充分捕获专用工具语义的紧凑编码器,参数化工具检索通过将每个工具编码为附加到LLM词汇表的虚拟令牌来解决这一问题,经过两个阶段(记忆然后检索SFT)的微调,将LLM用作检索器,在标准ToolBench检索基准上取得了强劲性能。然而,这些基准使用冗长、完全指定的查询,并且其评估应用了将输出限制为有效令牌路径的约束解码,这并不能揭示模型是否真正理解其工具。我们引入了\textbf{ToolSense},一个开源LLM驱动的诊断框架,它将任何工具目录作为输入,并自动生成三个基准:具有三个模糊级别查询的现实检索基准(RRB)、MCQ探测基准和QA探测基准。将ToolSense应用于ToolBench(约47k个工具)并评估五个参数化模型训练配置,揭示了知识-检索分离:在RRB查询上,与完全指定的ToolBench基准相比,几个配置下降了约50-64个百分点,低于嵌入模型基线。此外,尽管检索性能强劲,一些模型在事实探测上得分接近随机,表明存在知识-检索分离。我们在https://this URL上开源了ToolSense框架和ToolBench诊断基准。

英文摘要

Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce \textbf{ToolSense}, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.

2606.12608 2026-06-12 cs.CL cs.LG 交叉投稿

Shopping Reasoning Bench: An Expert-Authored Benchmark for Multi-Turn Conversational Shopping Assistants

购物推理基准:面向多轮对话购物助手的专家编写基准

Shuxian Fan, Seonwoo Min, Youna Hu, Botao Xia, Jayakrishnan Unnikrishnan, Rowan Musselmann, Yifan Gao, Qingyu Yin, Priyanka Nigam, Bing Yin

发表机构 * Amazon(亚马逊)

AI总结 提出一个由零售专家编写的525个任务的多轮对话购物推理基准,包含10863个加权评分标准,评估9个模型显示通过率仅57-77%,多轮任务性能下降4-18分。

详情
AI中文摘要

对话式购物助手现已服务数亿客户,但现有基准均未联合评估真实购物对话所需的开放式多轮推理、领域专业知识和标准级质量。购物推理在语言模型应用中独具特色。与事实性问答或可验证代码生成不同,它需要在多轮对话中平衡主观偏好、预算约束和跨产品权衡,这些能力在以往的电商和通用基准中缺失。我们引入了购物推理基准(Shopping Reasoning Bench),这是一个由零售领域专家编写的基准,包含525个任务(232个单轮,293个多轮)和10863个重要性加权的二元评分标准。这些标准组织在包含五个推理类别和十五个子类别的分类体系下,涵盖偏好细化、权衡分析和兼容性评估等多样化需求。对三个模型系列(GPT、Claude、Gemini)中九个模型的评估显示,整体通过率仅为57-77%。在多轮任务中,所有模型在可选的超越标准上的得分比必需标准低13-29分,并且随着对话进行,性能下降4-18分。这些差距表明,当前模型能处理基本购物辅助,但达不到专家级建议,使购物推理基准成为未来购物助手开发的挑战性测试平台。

英文摘要

Conversational shopping assistants now serve hundreds of millions of customers, yet no existing benchmark jointly evaluates the open-ended multi-turn reasoning, domain expertise, and criterion-level quality that real shopping conversations demand. Shopping reasoning is unique among language model applications. Unlike factual question answering or verifiable code generation, it requires balancing subjective preferences, budget constraints, and cross-product trade-offs across multi-turn dialogue, capabilities absent from previous e-commerce and general-purpose benchmarks. We introduce the Shopping Reasoning Bench, an expert-authored benchmark of 525 missions (232 single-turn, 293 multi-turn) with 10863 importance-weighted binary rubrics authored by retail domain experts. These criteria are organized under a taxonomy of five reasoning categories and fifteen subcategories covering diverse demands such as preference refinement, trade-off analysis, and compatibility assessment. An evaluation of nine models across three families (GPT, Claude, Gemini) shows that pass rates reach only 57--77% overall. On multi-turn missions, all models score 13--29 points lower on optional above-and-beyond criteria than on required ones, and performance degrades 4--18 points as conversations progress. These gaps show that current models handle basic shopping assistance but fall short of expert-level advice, making Shopping Reasoning Bench a challenging testbed for future shopping assistant development.

2606.12730 2026-06-12 cs.AI cs.CL cs.CY cs.LG 交叉投稿

Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior

重新思考LLMs的心理测量评估:自我报告何时以及为何能预测行为

Rafal Kocielnik, Pengrui Han, Peiyang Song, Myrl G. Marmarelis, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez

发表机构 * Caltech(加州理工学院) UIUC(伊利诺伊大学厄巴纳-香槟分校) University of Cambridge(剑桥大学)

AI总结 研究对比大五人格与计划行为理论,发现LLMs的自我报告-行为一致性存在选择性:在共享对话中TPB达到人类水平,跨对话仅对锚定于训练的行为保持一致性,且角色提示不能使行为对齐。

Comments Accepted as an Oral (Contributed Talk) at the ICML 2026 Workshop on Combining Theory and Benchmarks (CTB)

详情
AI中文摘要

从低成本心理测量探针预测LLM行为倾向对于安全部署至关重要,但前提是自我报告(SR)能可靠地预测行为。近期研究记录了LLMs中显著的SR-行为分离,但依赖于广泛的人格特质(大五),这些特质即使在人类中也只能弱预测特定行为。此外,对话会话的隔离加上弱上下文匹配使得以下问题悬而未决:LLMs是否真正缺乏一致性,或者检测这种一致性所需的条件是否未满足。我们将大五与计划行为理论(TPB)进行对比,后者测量针对特定行为的意图,并且比广泛特质能更好地预测人类行为。我们在四个行为任务和11个前沿LLM上进行实验,同时改变会话上下文和身份诱导。我们发现SR-行为一致性存在但具有选择性。1) 在共享对话中,计划行为理论达到人类水平的一致性;大五则没有。2) 在跨对话中,一致性仅对锚定于即时提示之外的行为(如由训练塑造的内隐偏见)幸存,而当行为被上下文强烈启动(如谄媚)时则崩溃。3) 角色提示使自我报告在对话间更一致,但并未使行为对齐。这些发现表明,粗糙的人格框架(如大五)可能不是测试部署行为的最佳工具。需要更多任务和特定行为的工具,并且即使这些工具也必须在任务和上下文中进行评估。

英文摘要

Anticipating LLM behavioral tendencies from low-cost psychometric probes is critical for safe deployment, but only if self-reports (SR) reliably predict behavior. Recent work documented substantial SR-behavior dissociation in LLMs, but relied on broad personality traits (Big 5) that predict specific behaviors weakly, even in humans. Furthermore, the isolation of conversational sessions combined with weak context matching left open whether LLMs truly lack coherence or whether the conditions needed to detect such coherence were not met. We contrast Big 5 with the Theory of Planned Behavior (TPB), which measures intention targeted to a specific behavior and predicts human behavior substantially better than broad traits. We run experiments across four behavioral tasks and 11 frontier LLMs, while also varying session context and identity induction. We find that SR-behavior coherence exists but is selective. 1) Within a shared conversation, the Theory of Planned Behavior reaches human-level coherence; Big 5 does not. 2) Across separate conversations, coherence survives only for behaviors anchored outside the immediate prompt, such as implicit bias shaped by training, and collapses when behavior is strongly primed by context, as with sycophancy. 3) Persona prompting makes self-reports more consistent across conversations, but does not bring behavior into alignment. These findings suggest that coarse personality frameworks, such as Big 5 may not be the best tools for testing deployment behavior. More task- and behavior-specific instruments are needed, and even these must be evaluated across tasks and contexts.

2606.12736 2026-06-12 cs.AI cs.LG 交叉投稿

Benchmarking AI Agents for Addressing Scientific Challenges Across Scales

跨尺度科学挑战的AI智能体基准测试

Tianyu Liu, Allen Xin Wang, Antonia Panescu, Lisa Xinyi Chen, Wenxin Long, Xinyu Wei, Yueqian Jing, Ziyao Zeng, Jihang Chen, Sihan Jiang, Ziqing Wang, Siyi Gu, Siyu Chen, Xinyang Hu, Haoran Shao, Leqi Xu, Wangjie Zheng, Zhiyuan Cao, Ada Fang, Botao Yu, Kunyang Sun, Rex Ying, Arman Cohan, Qingyu Chen, Lingzhou Xue, Kaize Ding, Yuanqi Du, Wengong Jin, Zhuoran Yang, Marinka Zitnik, James Zou, Hua Xu, Hongyu Zhao

发表机构 * Yale University(耶鲁大学) Broad Institute of MIT and Harvard(布罗德研究所) The Pennsylvania State University(宾夕法尼亚州立大学) Northeastern University(东北大学) Northwestern University(西北大学)

AI总结 提出SciAgentArena基准,含约200个交互式任务,评估AI智能体在真实科研场景中的能力,发现其在数据分析中有效,但在创新探索和开放问题上表现不均。

Comments 6 figures

详情
AI中文摘要

AI智能体正被越来越多地开发用于加速科学发现,但它们在真实研究环境中的实际能力仍知之甚少。现有的AI智能体基准很少捕捉科学工作所需的复杂性、异质性和扩展推理,而科学任务的基准通常将研究简化为静态、直接的问题,并对交互式评估支持有限。在此,我们引入SciAgentArena,这是一个系统性的基准,用于评估AI智能体在来自多个领域新兴需求的真实科学研究场景中的表现。SciAgentArena包含约200个具有逐步验证的任务,以及一个交互式、与智能体无关的环境,用于评估不同的AI智能体。使用该基准,我们发现当前智能体能够有效贡献于明确指定的数据分析工作流,特别是当任务结构和评估标准清晰时。然而,它们在科学情境中的表现仍然不均衡:智能体难以产生真正新颖的见解,维持自主探索,并为开放的研究问题制定稳健的解决方案。我们进一步描述了智能体常见的失败模式,并识别了提高其可靠性、自主性和科学推理能力的机会。总之,SciAgentArena提供了一个实用的框架,用于衡量AI智能体在科学领域的进展,并指导未来能够应对复杂科学挑战的智能体设计。完整代码、任务和数据集可通过此链接访问:this https URL。

英文摘要

AI agents are increasingly being developed to accelerate scientific discovery, yet their practical capabilities in real research settings remain poorly understood. Existing benchmarks for AI agents rarely capture the complexity, heterogeneity, and extended reasoning required by scientific work, whereas benchmarks for scientific tasks often reduce research to static, direct problems and provide limited support for interactive evaluation. Here, we introduce SciAgentArena, a systematic benchmark for evaluating AI agents in real-world scientific research scenarios drawn from emerging needs across multiple domains. SciAgentArena comprises approximately 200 tasks with stepwise verification and an interactive, agent-agnostic environment for assessing diverse AI agents. Using this benchmark, we find that current agents can contribute effectively to well-specified data-analysis workflows, particularly when the task structure and evaluation criteria are clear. However, their performance remains uneven across scientific contexts: agents struggle to generate genuinely novel insights, sustain self-directed exploration, and formulate robust solutions for open-ended research questions. We further characterize common failure modes across agents and identify opportunities for improving their reliability, autonomy, and scientific reasoning. Together, SciAgentArena provides a practical framework for measuring progress in AI agents for science and for guiding the design of future agents capable of addressing complex scientific challenges. Full codes, tasks, and datasets can be accessed via this link: https://sciagentarena.github.io/.

2606.12809 2026-06-12 cs.AI cs.LG 交叉投稿

MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs

MLUBench: 多模态大语言模型终身遗忘评估基准

He Li, Haoang Chi, Qizhou Wang, Yunxin Mao, Zhiheng Zhang, Jie Tan, Tongliang Liu, Wenjing Yang, Bo Han

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MLUBench基准,评估多模态大模型在连续遗忘请求下的性能,发现现有方法存在累积退化,并揭示多模态对齐保持的挑战,提出LUMoE方法缓解退化。

Comments 36 pages, accepted to the ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在海量多模态数据上训练,使得数据遗忘变得越来越重要,因为数据所有者可能要求移除特定内容。实际上,这些请求通常随时间顺序到达,引发了MLLM终身遗忘这一具有挑战性的问题。然而,现有大多数基准在规模和范围上有限,未能捕捉MLLM终身遗忘的复杂性。为填补这一空白,我们引入了MLUBench,一个大规模、全面的基准,包含9个类别下的127个实体,用于终身遗忘请求。我们使用MLUBench进行了大量实验,揭示出现有遗忘方法遭受严重且累积的退化。更重要的是,我们进一步识别出该问题的独特挑战:与单模态模型不同,MLLM终身遗忘受到保持多模态对齐需求的约束。持续从一种模态遗忘可能会退化整个模型。为缓解这一挑战,我们提出了LUMoE,一种有效方法。实验表明,LUMoE显著缓解了基线方法面临的退化问题。源代码和MLUBench数据集已在此https URL开源。

英文摘要

Multimodal large language models (MLLMs) are trained on massive multimodal data, making data unlearning increasingly important as data owners may request the removal of specific content. In practice, these requests often arrive sequentially over time, giving rise to the challenging problem of MLLM Lifelong Unlearning. However, most existing benchmarks are limited in scale and scope, failing to capture the complexities of MLLM lifelong unlearning. To fill this gap, we introduce the MLUBench, a large-scale and comprehensive benchmark featuring 127 entities across 9 classes under lifelong unlearning requests. We perform extensive experiments using MLUBench and reveal that existing unlearning methods suffer from severe, cumulative degradation. More critically, we further identify the unique challenge of this problem: unlike in unimodal models, MLLM lifelong unlearning is constrained by the need to preserve multimodal alignment. Continually unlearning from one modality could degrade the entire model. To alleviate this challenge, we propose LUMoE, an effective method. Experiments demonstrate that LUMoE significantly mitigates the degradation problem faced by baselines. The source code and the MLUBench dataset are open-sourced in https://github.com/lihe-maxsize/Lifelong_Unlearning_main.

2606.12953 2026-06-12 cs.AI cs.CV cs.LG eess.IV 交叉投稿

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

OpenMedQ:面向医学视觉语言模型的广泛开放预训练

Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert

发表机构 * Stanford University(斯坦福大学) Stanford University School of Medicine(斯坦福大学医学院) Ghent University(根特大学)

AI总结 提出OpenMedQ,在14个数据集(约335万样本)上预训练医学视觉语言模型,在PathVQA上BLEU-1达75.9,超越562B参数的Med-PaLM M,并在8个未见医学分类任务上取得最高平均macro-F1(0.757)。

Comments Medical Imaging with Deep Learning (MIDL) 2026, Short Paper Track

详情
AI中文摘要

我们提出OpenMedQ,一个在迄今为止最广泛的完全开放医学混合数据集上预训练的医学视觉语言模型:包含14个数据集,总计约335万预训练样本,涵盖病理学、放射学、显微镜和纯文本临床问答。OpenMedQ在PathVQA上达到最先进的BLEU-1(75.9),击败了参数多达562B(约大80倍)的Med-PaLM M变体,并在VQA-MED上匹配了最佳报告的BLEU-1(64.5)。其视觉编码器在相同的下游配方下迁移到8个未见过的医学分类基准,获得了最高的平均macro-F1(0.757),优于BiomedCLIP(0.745)、PMC-CLIP(0.745)、PubMedCLIP(0.746)和从头训练的基线(0.616)。我们公开了代码,并提供了一个交互式演示,作为社区的可复现基线。

英文摘要

We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling ~3.35M pretraining samples spanning pathology, radiology, microscopy, and text-only clinical QA. OpenMedQ reaches state-of-the-art BLEU-1 on PathVQA (75.9), beating Med-PaLM M variants up to 562B parameters (~80x larger), and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro-F1 (0.757) among BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). We release our code and an interactive demo is publicly available as a reproducible baseline for the community.

2606.13629 2026-06-12 stat.ME cs.AI cs.LG stat.ML 交叉投稿

Valid Inference with Synthetic Data via Task Exchangeability

通过任务可交换性实现基于合成数据的有效推断

Lezhi Tan, Tijana Zrnic

AI总结 提出任务可交换性条件,确保在科学研究中使用合成数据进行统计推断的有效性,并给出在民意调查和AI评估中的应用。

详情
AI中文摘要

越来越多的工作主张在科学研究中使用合成数据。例如,社会科学家主张在试点研究中使用LLM生成的“硅样本”;AI评估越来越依赖“LLM作为裁判”的输出;蛋白质组学研究通过生成合成蛋白质结构的生成模型加速。这些发展引发了一个有趣的可能性:合成数据可以帮助研究人员提出更多问题、进行更多研究并加速发现。但它们也引发了一个根本性的担忧:合成数据可能有偏、有噪声且设定错误。在这项工作中,我们提出了在科学研究中使用合成数据的统计原则,并具有可证明的有效性保证。关键见解是一个我们称为任务可交换性的新技术条件。非正式地说,这是一个要求,即研究人员可以识别出有真实数据可用的历史任务,使得他们当前感兴趣的任务与历史任务在适当的数学意义上可交换。我们开发了在任务可交换性下进行有效推断的方法,以及即使在可交换性之外也能提供保证的扩展。我们通过硅样本的民意调查和自动评分器的AI评估来展示该框架。

英文摘要

There is a proliferation of work arguing for the use of synthetic data in scientific research. For example, social scientists are arguing for the use of LLM-generated "silicon samples" in pilot studies; AI evaluations increasingly rely on "LLM-as-a-judge" outputs; and proteomics research is accelerated by generative models that produce synthetic protein structures. These developments raise an intriguing possibility: synthetic data may help researchers ask more questions, run more studies, and accelerate discovery. But they also raise a fundamental concern: synthetic data can be biased, noisy, and misspecified. In this work, we propose statistical principles for using synthetic data in scientific research with provable validity guarantees. The key insight is a new technical condition that we call task exchangeability. Informally, this is a requirement that the researcher can identify historical tasks, for which real data is available, such that their current task of interest is exchangeable with the historical tasks in an appropriate mathematical sense. We develop methods for valid inference under task exchangeability, together with extensions that provide guarantees even beyond exchangeability. We demonstrate the framework on public opinion surveys with silicon samples and AI evaluation with autoraters.

2606.13647 2026-06-12 cs.CL cs.AI cs.LG 交叉投稿

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

SkMTEB:斯洛伐克大规模文本嵌入基准与模型适配

Marek Šuppa, Andrej Ridzik, Daniel Hládek, Natália Kňažeková, Viktória Ondrejová

发表机构 * Comenius University in Bratislava(布拉迪斯拉发夸美纽斯大学) Cisco Systems(思科系统) Technical University of Košice(科希策技术大学) Kempelen Institute of Intelligent Technologies(肯佩伦智能技术研究所)

AI总结 针对低资源西斯拉夫语斯洛伐克语,构建首个MTEB风格文本嵌入基准SkMTEB(含31个数据集、7类任务),并开发高效本地部署模型e5-sk-small/large,通过词汇裁剪与微调在参数减少62%下达到与商业API相当的竞争力。

Comments ACL 2026

详情
AI中文摘要

我们介绍了SkMTEB,这是首个针对斯洛伐克语(一种低资源西斯拉夫语)的全面MTEB风格文本嵌入基准,包含31个数据集,覆盖7种任务类型——几乎是现有斯洛伐克语多语言基准覆盖深度的4倍。我们对31个嵌入模型的评估表明,大型指令调优多语言模型表现最强,而现有的针对NLU任务训练的斯洛伐克语特定模型在嵌入任务上迁移效果不佳。为了满足高效、可本地部署的斯洛伐克语嵌入需求,我们通过对多语言E5模型进行词汇裁剪和微调,开发了\ exttt{e5-sk-small}(45M参数)和\ exttt{e5-sk-large}(365M)模型。尽管模型尺寸缩小了高达62%,我们的开源模型在性能上与专有API相当,同时仍可本地部署用于语义搜索和检索增强生成(RAG)。我们公开了基准、模型、数据集和代码,希望我们的方法能为其他资源匮乏的语言提供可复现的路径。

英文摘要

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

2606.13649 2026-06-12 cs.CL cs.LG 交叉投稿

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

Operadic一致性:LLM中组合推理失败的无标签信号

Nathaniel Bottman, Yinhong Liu, Kyle Richardson

发表机构 * Incubilate University of Cambridge(剑桥大学) Allen Institute for Artificial Intelligence(艾伦人工智能研究所)

AI总结 提出Operadic一致性(OC)作为检测大语言模型组合推理失败的无标签信号,在四个多跳QA数据集上与准确率强相关(Pearson r≥0.86),优于自一致性等方法。

详情
AI中文摘要

在推理时检测LLM推理失败而无需真实标签,催生了广泛的置信度基线,包括自一致性、语义熵和P(True),这些方法基于问题内采样和自我评估。Operad理论,即通过迭代替换构建系统的形式化方法,提出了一种补充性诊断:模型对组合查询的直接回答应与通过组合同一查询的分解陈述所产生的回答一致。我们将这一思想实例化为Operadic一致性(OC),一个每问题信号。在四个多跳QA数据集上的十二个指令微调LLM(4B到671B参数,开源和闭源)上,OC与每个数据集上的准确率强相关(Pearson r ∈ [0.86, 0.94],所有p ≤ 0.0004),并且是我们评估的所有信号中唯一在所有四个数据集上均匀达到r ≥ 0.85的信号。思维链自一致性(CoT-SC;Wang等人,2023)在HotpotQA和DROP上与OC匹配(r = 0.93, 0.87),但在MuSiQue和StrategyQA上降至r ≈ 0.45。在每问题层面,OC在每个数据集上提供了超出CoT-SC和语义熵的信息(OC系数的聚类稳健p ≤ 10^{-16}),并且该结论在额外控制构造的分解感知基线时依然稳健(p ≤ 10^{-13})。相同的信号在等成本K = 3预算下,相对于调优的CoT-SC基线产生了选择性预测改进(固定覆盖率下的准确率提升)(AUARC提升+0.086至+0.096,AUROC提升+0.092至+0.164;95%置信区间在每个单元上排除零)。在五个前沿思维模型上,其中分解从模型自身的思维链中提取,相同的等成本比较在所有测试的16个(数据集、预算、指标)单元上给出了正的选择性预测点估计提升,其中12个单元的95%置信区间排除零。

英文摘要

Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for systems built by iterated substitution, suggests a complementary diagnostic: a model's direct answer to a compositional query should agree with the answer it produces by composing a stated decomposition of the same query. We instantiate this idea as operadic consistency (OC), a per-question signal. Across twelve instruction-tuned LLMs (4B to 671B parameters, open-weights and closed-source) on four multi-hop QA datasets, OC is strongly correlated with accuracy on every dataset (Pearson $r \in [0.86, 0.94]$, all $p \leq 0.0004$), and is the only signal we evaluate with $r \geq 0.85$ uniformly across all four datasets. Chain-of-thought self-consistency (CoT-SC; Wang et al., 2023) matches OC on HotpotQA and DROP ($r = 0.93, 0.87$) but drops to $r \approx 0.45$ on MuSiQue and StrategyQA. At the per-question level, OC contributes information beyond CoT-SC and semantic entropy on every dataset (cluster-robust $p \leq 10^{-16}$ for the OC coefficient), and the conclusion is robust to additionally controlling for constructed decomposition-aware baselines ($p \leq 10^{-13}$). The same signal yields selective-prediction improvements (accuracy at fixed coverage) over a tuned CoT-SC baseline at the equal-cost $K = 3$ budget (AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell). On five frontier thinking models, where the decomposition is extracted from the model's own chain of thought, the same equal-cost comparison gives positive selective-prediction point-estimate lift on all 16 (dataset, budget, metric) cells tested, with 95% CIs excluding zero on 12 of the 16.

2304.13836 2026-06-12 cs.LG cs.AI cs.CV stat.ME 版本更新

On Pitfalls of $\textit{RemOve-And-Retrain}$: Data Processing Inequality Perspective

论 $\textit{RemOve-And-Retrain}$ 的陷阱:数据处理不等式视角

Junhwa Song, Keumgang Cha, Junghoon Seo

发表机构 * KAIST(韩国科学技术院)

AI总结 从信息论角度揭示ROAR基准的缺陷:数据无关的后处理可提升ROAR分数,导致对归因图信息量的误判,并发现模糊性偏差。

Comments Accepted at the 2026 ICML Workshop on Mechanistic Interpretability

详情
AI中文摘要

RemOve-And-Retrain (ROAR) 基准被广泛用于评估特征归因方法,但其有效性尚未从信息论角度得到充分探索。我们证明,对归因图进行模型和数据无关的后处理(通过数据处理不等式,这些变换\emph{不能}增加关于决策函数的信息)通常可以改善ROAR分数。这意味着ROAR排名的提升本身并不能证明归因图携带更多关于模型的信息。我们将这种失败模式归因于对空间模糊掩膜的偏好。在CIFAR-10、SVHN和CUB-200上的实验显示,模糊度与ROAR性能之间存在一致的关联,这种模式也出现在ROAD变体中。我们为更谨慎的基于移除的基准测试提供了指导方针,这对验证神经网络内部机制的机械理解具有重要意义。

英文摘要

The RemOve-And-Retrain (ROAR) benchmark is widely used to evaluate feature attribution methods, yet its validity remains underexplored from an information-theoretic perspective. We show that model- and data-agnostic post-processing of attribution maps (transformations that, by the data processing inequality, \emph{cannot} add information about the decision function) can often improve ROAR scores. This means that an improved ROAR ranking is not, by itself, evidence that an attribution map carries more information about the model. We trace this failure mode to a bias toward spatially blurry masks. Experiments on CIFAR-10, SVHN, and CUB-200 show a consistent association between blurriness and ROAR performance, a pattern that also appears in the ROAD variant. We provide guidelines for more cautious removal-based benchmarking, with implications for validating mechanistic understanding of neural network internals.

2603.14407 2026-06-12 cs.LG 版本更新

Towards One-for-All Anomaly Detection for Tabular Data

面向表格数据的通用异常检测

Shiyuan Li, Yixin Liu, Yu Zheng, Xiaofeng Cao, Shirui Pan, Heng Tao Shen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出OFA-TAD框架,通过多视图邻居距离表示和混合专家评分网络,实现跨领域表格异常检测的通用化,一次训练即可泛化到未见数据集。

Comments Accepted by ICML 2026

详情
AI中文摘要

表格异常检测(TAD)旨在识别表格数据中偏离大多数样本的样本,在许多实际应用中至关重要。然而,现有方法遵循“一个数据集一个模型(OFO)”范式,依赖于数据集特定的训练,导致计算成本高且对未见领域的泛化能力有限。为解决这些局限性,我们提出OFA-TAD,一个通用的“一劳永逸(OFA)”TAD框架,只需在多个源数据集上进行一次训练,即可即时泛化到来自不同领域的未见数据集。为实现通用表格异常检测,OFA-TAD提取邻居距离模式作为可迁移线索,并引入来自多个变换诱导度量空间的多视图邻居距离表示,以减轻距离分布对变换的敏感性。为自适应组合多视图距离证据,采用混合专家(MoE)评分网络进行视图特定异常评分和熵正则化门控融合,并采用多策略异常合成机制以支持单类约束下的训练。在来自14个领域的34个数据集上的大量实验表明,OFA-TAD在严格的OFA设置下实现了优越的异常检测性能和强大的跨领域泛化能力。源代码见:https://this URL。

英文摘要

Tabular anomaly detection (TAD) aims to identify samples that deviate from the majority in tabular data and is critical in many real-world applications. However, existing methods follow a ``one model for one dataset (OFO)'' paradigm, which relies on dataset-specific training and thus incurs high computational cost and yields limited generalization to unseen domains. To address these limitations, we propose OFA-TAD, a generalist one-for-all (OFA) TAD framework that only requires one-time training on multiple source datasets and can generalize to unseen datasets from diverse domains on-the-fly. To realize one-for-all tabular anomaly detection, OFA-TAD extracts neighbor-distance patterns as transferable cues, and introduces multi-view neighbor-distance representations from multiple transformation-induced metric spaces to mitigate the transformation sensitivity of distance profiles. To adaptively combine multi-view distance evidence, a Mixture-of-Experts (MoE) scoring network is employed for view-specific anomaly scoring and entropy-regularized gated fusion, with a multi-strategy anomaly synthesis mechanism to support training under the one-class constraint. Extensive experiments on 34 datasets from 14 domains demonstrate that OFA-TAD achieves superior anomaly detection performance and strong cross-domain generalizability under the strict OFA setting. The source code is available at https://github.com/Shiy-Li/OFA-TAD.

2605.20763 2026-06-12 cs.LG 版本更新

ShapeBench: A Scalable Benchmark and Diagnostic Suite for Standardized Evaluation in Aerodynamic Shape Optimization

ShapeBench: 一种可扩展的基准和诊断套件,用于气动形状优化的标准化评估

Shaghayegh Fazliani, Krissh Chawla, Jack Guo, Yiren Shen, Matthias Ihme, Madeleine Udell

发表机构 * Stanford University(斯坦福大学) Spinoza Labs(斯皮诺扎实验室)

AI总结 本文提出ShapeBench,一个开源的气动形状优化基准,提供统一的API,涵盖103个任务和八个形状类别,通过验证的代理模型和高保真CFD流程进行系统分析,展示了不同形状类别和问题形式中优化器排名的显著差异,强调了需要更通用方法的必要性。

详情
AI中文摘要

气动形状优化(ASO)的快速进展已超过了目前可用的标准化评估框架。公平比较需要一个覆盖多样形状类别、目标公式和匹配预算的统一基准。我们引入ShapeBench,一个开源的ASO基准,涵盖103个任务,跨越八个形状类别和多种优化模式。每个ShapeBench任务包括经过验证的代理模型以实现快速搜索;当可行时,提供高保真计算流体动力学(CFD)流程用于最终验证,从而实现系统化的保真度差距分析。ShapeBench提供可重复的协议和配置良好的基线,以使用一致的预算度量进行公平比较,允许在经典方法和LLM驱动方法之间进行比较,包括通用优化器和一个新的领域专用进化LLM基线,ShapeEvolve。在ShapeBench上的结果展示了不同形状类别和问题形式中优化器排名的显著差异,平均成对斯皮尔曼ρ=0.013,因此单任务结论无法可靠地推广到问题类别中。该基准还远未饱和;经典方法很少能适用于所有形状类别和任务,进一步强调了需要更通用方法的必要性。

英文摘要

Rapid progress in aerodynamic shape optimization (ASO) has outpaced currently-available standardized evaluation frameworks. Fair comparison requires a unified benchmark spanning diverse shape classes, objective formulations, and matched-budget state-of-the-art baselines. We introduce ShapeBench, an open-source ASO benchmark with a unified API spanning 103 tasks across eight shape categories and multiple optimization regimes. Each ShapeBench task includes a validated surrogate for fast search; when feasible, a high-fidelity Computational Fluid Dynamics (CFD) pipeline for final verification is available, enabling systematic fidelity-gap analysis. ShapeBench provides a reproducible protocol with well-configured baselines to compare fairly using a consistent budget metric, allowing for comparison among both classical and LLM-driven methods, including general-purpose optimizers and a new domain-specialized evolutionary LLM baseline, ShapeEvolve. Results on ShapeBench demonstrate substantial variance in optimizer rankings across shape categories and problem formulations, with mean pairwise Spearman $ρ= 0.013$, so single-task conclusions do not reliably generalize across problem classes. The benchmark is also far from saturation; classical methods are rarely applicable across all shape categories and tasks, further highlighting the need for more general-purpose approaches.

2606.10642 2026-06-12 cs.LG physics.ao-ph 版本更新

PhysMetrics.Weather: An Evaluation Framework for Physical Consistency in ML Weather Models

PhysMetrics.Weather: 机器学习天气模型中物理一致性的评估框架

Emma Kasteleyn, Timo Maier, Axel Lauer, Veronika Eyring, Pierre Gentine, Ana Lucic

AI总结 提出PhysMetrics.Weather评估框架,通过守恒、谱和动力学三类指标量化MLWP模型的物理真实性,指导物理信息架构开发并评估其运行可靠性。

Comments Preprint

详情
AI中文摘要

机器学习天气预测(MLWP)模型以传统基于物理方法所需计算成本的一小部分实现了令人印象深刻的预测性能。然而,它们主要是(1)数据驱动的,并且(2)使用逐像素误差指标(例如RMSE)进行评估,因此无法保证其预测与已知物理定律一致。我们介绍了PhysMetrics.Weather,这是一个评估框架,通过三类指标(守恒、谱和动力学)评估MLWP模型的物理真实性。通过量化物理真实性,该工具指导物理信息架构的开发,并帮助评估MLWP模型是否可用于运行。我们的框架可在Github上获取,网址为https://github.com/...(原文未提供完整链接)。

英文摘要

Machine learning weather prediction (MLWP) models have achieved impressive forecasting performance at a small fraction of the computational costs required for traditional physics-based methods. However, they are primarily (1) data-driven and (2) evaluated using pixel-wide error metrics (e.g., RMSE), so there are no guarantees that their forecasts are consistent with known physical laws. We introduce PhysMetrics$.$Weather, an evaluation framework that assesses the physical realism of MLWP models across three types of metrics: conservation, spectral, and dynamical. By quantifying physical realism, this tool guides the development of physics-informed architectures and helps evaluate whether MLWP models are reliable for operational use. Our framework is available on Github at https://github.com/Emmakast/PhysMetrics.Weather.

2510.16380 2026-06-12 cs.CL cs.AI cs.CY cs.HC cs.LG 版本更新

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

MoReBench:评估语言模型中的程序性和多元道德推理,超越结果

Yu Ying Chiu, Michael S. Lee, Rachel Calcott, Brandon Handoko, Paul de Font-Reaulx, Raphaël Millière, Paula Rodriguez, Chen Bo Calvin Zhang, Ziwen Han, Udari Madhushani Sehwag, Yash Maurya, Christina Q Knight, Harry R. Lloyd, Florence Bacus, Conor Downey, Mantas Mazeika, Bing Liu, Yejin Choi, Mitchell L Gordon, Sydney Levine

发表机构 * University of Washington(华盛顿大学) New York University(纽约大学) Scale AI Harvard University(哈佛大学) University of Michigan(密歇根大学) UNC Chapel Hill(北卡罗来纳大学教堂山分校) Center for AI Safety(人工智能安全中心) Stanford University(斯坦福大学) MIT(麻省理工学院) University of Oxford(牛津大学)

AI总结 提出MoReBench基准,包含1000个道德场景和超过2.3万条标准,用于评估语言模型在道德推理中的程序性推理能力,发现现有基准无法预测模型表现,且模型对特定道德框架存在偏好。

Comments 46 pages, 8 figures, 10 tables. Published in ICLR 2026. Accepted at CHAI workshop and SPP 2026 (non-archival)

详情
AI中文摘要

随着人工智能系统的进步,我们越来越依赖它们与我们共同或代替我们做出决策。为了确保这些决策符合人类价值观,我们不仅需要理解它们做出了什么决策,还需要理解它们如何得出这些决策。推理语言模型能够提供最终响应和(部分透明的)中间思考轨迹,这为研究AI的程序性推理提供了及时的机会。与通常有客观正确答案的数学和代码问题不同,道德困境是过程导向评估的绝佳测试平台,因为它们允许多种可辩护的结论。为此,我们提出了MoReBench:包含1000个道德场景,每个场景配有一组专家认为在推理该场景时必须包含(或避免)的评分标准。MoReBench包含超过2.3万条标准,包括识别道德考量、权衡利弊以及给出可操作的建议,覆盖了AI为人类道德决策提供建议以及自主做出道德决策的情况。此外,我们整理了MoReBench-Theory:150个示例,用于测试AI是否能在规范伦理学的五个主要框架下进行推理。我们的结果表明,规模定律以及现有的数学、代码和科学推理任务基准无法预测模型进行道德推理的能力。模型还显示出对特定道德框架(例如边沁式的行为功利主义和康德义务论)的偏好,这可能是流行训练范式的副作用。这些基准共同推动了面向过程推理的评估,以实现更安全、更透明的AI。

英文摘要

As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

2602.13379 2026-06-12 cs.CR cs.AI cs.CL cs.LG cs.SE 版本更新

Unsafer in Many Turns: Benchmarking and Defending Multi-Turn Safety Risks in Tool-Using Agents

多轮交互中的安全隐患:工具使用智能体的多轮安全风险基准与防御

Xu Li, Simon Yu, Minzhou Pan, Yiyou Sun, Bo Li, Dawn Song, Xue Lin, Weiyan Shi

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校)

AI总结 提出多轮工具使用安全基准MT-AgentRisk,发现多轮设置下攻击成功率平均增加16%,并设计无训练、与工具无关的自探索防御方法ToolShield,平均降低30%攻击成功率。

详情
AI中文摘要

基于LLM的智能体能力日益增强,但其安全性滞后。这造成了智能体能够做什么和应该做什么之间的差距。随着智能体进行多轮交互并使用多样化的工具,这一差距扩大,引入了现有基准忽视的新风险。为了系统地将安全测试扩展到多轮、工具真实的设置,我们提出一个原则性的分类法,将单轮有害任务转化为多轮攻击序列。利用该分类法,我们构建了MT-AgentRisk(多轮智能体风险基准),这是首个评估多轮工具使用智能体安全性的基准。我们的实验揭示了显著的安全退化:在开放和封闭模型的多轮设置中,攻击成功率(ASR)平均增加16%。为了缩小这一差距,我们提出了ToolShield,一种无需训练、与工具无关的自我探索防御方法:当遇到新工具时,智能体自主生成测试用例,执行它们以观察下游效果,并提炼安全经验用于部署。实验表明,ToolShield在多轮交互中平均有效降低ASR 30%。我们的代码可在该网址获取。

英文摘要

LLM-based agents are becoming increasingly capable, yet their safety lags behind. This creates a gap between what agents can do and should do. This gap widens as agents engage in multi-turn interactions and employ diverse tools, introducing new risks overlooked by existing benchmarks. To systematically scale safety testing into multi-turn, tool-realistic settings, we propose a principled taxonomy that transforms single-turn harmful tasks into multi-turn attack sequences. Using this taxonomy, we construct MT-AgentRisk (Multi-Turn Agent Risk Benchmark), the first benchmark to evaluate multi-turn tool-using agent safety. Our experiments reveal substantial safety degradation: the Attack Success Rate (ASR) increases by 16% on average across open and closed models in multi-turn settings. To close this gap, we propose ToolShield, a training-free, tool-agnostic, self-exploration defense: when encountering a new tool, the agent autonomously generates test cases, executes them to observe downstream effects, and distills safety experiences for deployment. Experiments show that ToolShield effectively reduces ASR by 30% on average in multi-turn interactions. Our code is available at https://github.com/CHATS-lab/ToolShield.

2602.14367 2026-06-12 cs.CL cs.AI cs.IR cs.LG 版本更新

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

InnoEval:将研究思路评估视为基于知识的多视角推理问题

Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue, Ningyu Zhang, Hossein A. Rahmani, Yanshan Wang, Qiang Zhang, Keyan Ding, Jeff Z. Pan, Huajun Chen, Emine Yilmaz

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出InnoEval框架,通过异构深度知识检索和多视角评审委员会,实现基于知识的多维度解耦评估,在点对点、成对和分组评估任务中优于基线方法。

Comments ICML 2026

详情
AI中文摘要

大型语言模型的快速发展催生了科学思路的激增,但这一飞跃并未伴随思路评估的相应进步。科学评估的基本性质需要知识基础、集体审议和多标准决策。然而,现有的思路评估方法往往存在知识视野狭窄、评估维度扁平化以及LLM作为评判者的固有偏见。为解决这些问题,我们将思路评估视为一个基于知识的多视角推理问题,并引入InnoEval,一个深度创新评估框架,旨在模拟人类水平的思路评估。我们应用了一个异构深度知识搜索引擎,从多样化的在线来源中检索和获取动态证据。我们进一步通过一个包含不同学术背景的评审员的创新评审委员会实现评审共识,从而在多个指标上进行多维解耦评估。我们构建了来自权威同行评审提交的全面数据集,以基准测试InnoEval。实验表明,InnoEval在点对点、成对和分组评估任务中始终优于基线方法,展现出与人类专家高度一致的判断模式和共识。

英文摘要

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

2603.00610 2026-06-12 cs.SD cs.AI cs.LG cs.MM eess.AS 版本更新

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

CMI-RewardBench: 基于组合多模态指令评估音乐奖励模型

Yinghao Ma, Haiwen Xia, Hewei Gao, Weixiong Chen, Yuxin Ye, Yuchen Yang, Sungkyun Chang, Mingshuo Ding, Yizhi Li, Ruibin Yuan, Simon Dixon, Emmanouil Benetos

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 针对音乐生成模型缺乏有效评估机制的问题,提出CMI-RewardBench基准,包含大规模偏好数据集和参数高效奖励模型,实现多模态指令下的音乐质量评估。

Comments Accepted by ICML 2026

详情
AI中文摘要

虽然音乐生成模型已经发展到能够处理混合文本、歌词和参考音频的复杂多模态输入,但评估机制却滞后了。在本文中,我们通过为组合多模态指令(CMI)下的音乐奖励建模建立了一个全面的生态系统来弥补这一关键差距,其中生成的音乐可能以文本描述、歌词和音频提示为条件。我们首先引入了CMI-Pref-Pseudo,一个包含11万个伪标签样本的大规模偏好数据集,以及CMI-Pref,一个针对细粒度对齐任务量身定制的高质量人工标注语料库。为了统一评估格局,我们提出了CMI-RewardBench,一个统一的基准,用于评估音乐奖励模型在音乐性、文本-音乐对齐和组合指令对齐方面的异质样本。利用这些资源,我们开发了CMI奖励模型(CMI-RMs),一个能够处理异质输入的参数高效奖励模型家族。我们评估了它们与人类判断分数在音乐性和对齐方面的相关性,使用了CMI-Pref以及之前的数据集。进一步的实验表明,CMI-RM不仅与人类判断高度相关,而且通过top-k过滤实现了有效的推理时扩展。代码可在GitHub(此 https URL )获取。模型权重:CMI-RM(此 https URL )。数据集:CMI-Pref-Pseudo(此 https URL )和CMI-Pref(此 https URL )。

英文摘要

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgment scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. Code is available at GitHub (https://github.com/Haiwen-Xia/CMI-RewardBench). Model weights: CMI-RM (https://huggingface.co/HaiwenXia/CMI-RM). Datasets: CMI-Pref-Pseudo (https://huggingface.co/datasets/HaiwenXia/cmi-pref-pseudo) and CMI-Pref (https://huggingface.co/datasets/HaiwenXia/cmi-pref)

2605.17062 2026-06-12 cs.CR cs.LG cs.SE 版本更新

The Range Shrinks, the Threat Remains: Re-evaluating LLM Package Hallucinations on the 2026 Frontier-Model Cohort

范围缩小,威胁依旧:重新评估2026前沿模型队列上的LLM包幻觉

Aleksandr Churilov

AI总结 本文重新评估了2026前沿模型队列上大型语言模型(LLM)的包幻觉现象,发现尽管幻觉率有所降低,但仍然存在威胁,识别出一组127个包名(109个在PyPI,18个在npm)被所有评估模型一致生成,构成一个跨模型的供应链攻击面,同时发现Python与JavaScript幻觉的不对称性以及DeepSeek V3.2和GPT-5.4-mini之间的高相似性。

Comments 13 pages, 3 figures, 4 tables. v2: incorporates coordinated-disclosure feedback from PyPI Security and Socket.dev; registrable attack surface refined to 53 names (41 PyPI, 12 npm). Headline rates unchanged. Replication of Spracklen et al. (USENIX Security 2025). Data and code: https://github.com/churik5/slopsquatting-replication-2026 and https://doi.org/10.5281/zenodo.19859120

详情
AI中文摘要

Spracklen等人(USENIX Security '25)表明,生成代码的大型语言模型会以5.2%至21.7%的比率生成不存在于PyPI或npm上的包名,从而为slopsquatting攻击(恶意包的注册)提供了攻击面。我们在这五款2025年10月至2026年3月期间发布的前沿代码能力LLM上重复了他们的方法:Claude Sonnet 4.6、Claude Haiku 4.5、GPT-5.4-mini、Gemini 2.5 Pro和DeepSeek V3.2。在199,845个经过PyPI和npm主列表验证的Python和JavaScript提示对中,我们测量到幻觉率在4.62%(Claude Haiku 4.5)到6.10%(GPT-5.4-mini)之间——比Spracklen观察到的模型间差异缩小了一个数量级,但威胁并未消失。除了重复研究外,我们识别出一组127个包名(109个在PyPI,18个在npm)被所有评估模型一致生成,构成一个跨模型的供应链攻击面,无法由单一模型研究揭示。我们进一步记录了Python与JavaScript幻觉的不对称性,推翻了Spracklen 2024年的发现,识别出Anthropic家族中的Haiku低于Sonnet的倒置现象,并观察到DeepSeek V3.2和GPT-5.4-mini之间的Jaccard相似性峰值(J=0.343),暗示共享的训练数据起源。

英文摘要

Spracklen et al. (USENIX Security '25) showed that code-generating large language models hallucinate package names that do not exist on PyPI or npm at rates ranging from 5.2% on commercial models to 21.7% on open-source models, creating an attack surface for slopsquatting -- the registration of malicious packages under hallucinated names. We replicate their methodology on five frontier code-capable LLMs released between October 2025 and March 2026: Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4-mini, Gemini 2.5 Pro, and DeepSeek V3.2. Across 199,845 paired Python and JavaScript prompts validated against PyPI and npm master lists, we measure overall hallucination rates between 4.62% (Claude Haiku 4.5) and 6.10% (GPT-5.4-mini) -- an order-of-magnitude compression of the inter-model spread observed by Spracklen, but not a retirement of the threat. Beyond replication, we identify a set of 127 package names (109 on PyPI, 18 on npm) that all five evaluated models invent identically; following coordinated disclosure with PyPI Security and Socket.dev, 53 of these (41 on PyPI, 12 on npm) remain registrable by an attacker after each registry's existing defenses, constituting a model-agnostic supply-chain attack surface that no single-model study can reveal. We further document a Python-over-JavaScript hallucination asymmetry that inverts Spracklen's 2024 finding, identify a Haiku-below-Sonnet inversion within the Anthropic family, and observe a Jaccard-similarity peak between DeepSeek V3.2 and GPT-5.4-mini (J = 0.343) suggestive of shared training-data origins.

2606.01538 2026-06-12 cs.GR cs.CV cs.LG 版本更新

MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics

MPMWorlds: 用于推断和外推物理动力学的物质点法模拟

Žiga Kovačič, Kevin Ellis

发表机构 * Cornell University(康奈尔大学)

AI总结 通过构建2D物质点法(MPM)模拟数据集,研究从视频推断物理动力学并外推时间演化的能力,比较代码生成与视频扩散方法的优劣。

Comments 16 pages, 13 figures. Project page: https://zzigak.github.io/mpmworlds/

详情
AI中文摘要

为了研究从视频推断物理动力学并将其向前外推的能力,我们组装了一个包含丰富物理现象(如可变形物体、流体、运动物体和发射器)的2D物质点法(MPM)物理模拟数据集。我们在此数据集上研究了代码生成和视频扩散方法,通过改变物理相关辅助信息的数量来识别它们的优缺点。代码生成模型除了提供自动合成MPM模拟的工作演示外,还揭示了这种方法在从视觉输入推断物理参数方面存在困难,但相对于视频扩散,它能产生物理和时间上稳定的向前外推结果,而视频扩散模型能更强烈地从视觉输入中识别几何属性,但会产生物理上不可信的外推结果。

英文摘要

To study the ability to infer physical dynamics from videos and extrapolate them forward in time, we assemble a dataset of 2D Material Point Method (MPM) physical simulations covering rich physical phenomena such as deformable objects, fluids, kinetic objects, and emitters. We study code generation and video diffusion approaches on this dataset, identifying their strengths and weaknesses by varying the amount of physically relevant side information. The code generation model, beyond giving a working demonstration of automatic synthesis of MPM simulations, reveals that such an approach struggles with inferring physical parameters from visual input, but relative to video diffusion, produces physically and temporally stable extrapolations forward in time, while the video diffusion model more strongly identifies geometric properties from visual input but produces physically implausible extrapolations.

2606.04525 2026-06-12 cs.CL cs.LG q-bio.GN 版本更新

GENEB: Why Genomic Models Are Hard to Compare

GENEB:为什么基因组模型难以比较

Daria Ledneva, Mikhail Nuridinov, Denis Kuznetsov

发表机构 * GitHub arXiv

AI总结 针对基因组基础模型评估碎片化的问题,提出GENEB基准,通过统一探测协议在100项任务上比较40个模型,揭示模型排名不稳定、规模收益有限等关键发现。

Comments change first page figure, fix model sizes, add more consistency

详情
AI中文摘要

由于基准碎片化、评估协议不兼容以及任务特定报告,基因组基础模型的进展难以评估。因此,关于模型优越性或通用性的声明往往无法直接比较。我们引入GENEB,这是一个大规模诊断基准,在统一的基于探测的协议下(包括少样本场景),评估来自40个基因组基础模型的冻结表示,涵盖100个任务,跨越13个功能类别。GENEB能够在明确暴露任务级权衡的同时,对模型规模、架构、分词和预训练数据进行受控比较。我们的分析表明,整体排行榜不稳定:模型排名在不同任务类别间变化剧烈,规模仅带来适度且不一致的收益,而架构和预训练对齐常常超过参数数量的影响。这些结果凸显了当前评估实践的局限性,并将GENEB定位为基因组机器学习中原则性比较和类别感知模型选择的参考框架。

英文摘要

Progress in genomic foundation models is difficult to assess due to fragmented benchmarks, incompatible evaluation protocols, and task-specific reporting. As a result, claims of superiority or generality across models are often not directly comparable. We introduce GENEB, a large-scale diagnostic benchmark that evaluates frozen representations from 40 genomic foundation models across 100 tasks spanning 13 functional categories under a unified probing-based protocol, including few-shot regimes. GENEB enables controlled comparison across model scale, architecture, tokenization, and pretraining data while explicitly exposing task-level trade-offs. Our analysis shows that aggregate leaderboards are unstable: model rankings vary sharply across task categories, scale provides only modest and inconsistent gains, and architectural and pretraining alignment frequently outweigh parameter count. These results highlight limitations of current evaluation practices and position GENEB as a reference framework for principled comparison and category-aware model selection in genomic machine learning.

2606.05405 2026-06-12 cs.AI cs.CL cs.LG 版本更新

Agents' Last Exam

Agents' Last Exam

Yiyou Sun, Xinyang Han, Weichen Zhang, Yuanbo Pang, Tianyu Wang, Yuhan Cao, Yixiao Huang, Chris Duroiu, Haoyun Zhang, Jeffrey Lin, Weishu Zhang, Tyler Zeng, Ying Yan, Bo Liu, Hanson Wen, Mingyang Xu, Xiaoyuan Liu, Zimeng Chen, Weiyan Shi, Amanda Dsouza, Vincent Sunn Chen, Patrick Bryant, Carl Boettiger, Yamini Rangan, Bradley Rothenberg, Kyle Steinfeld, Arvind Rao, Tapio Schneider, Georgios Yannakakis, Laure Zanna, Kaan Ozbay, Ida Sim, Tarek Zohdi, George Em Karniadakis, Jack Gallant, Teresa Head-Gordon, Yushan Li, Wenxi Deng, Tao Sun, Huiqi Wang, Zhun Wang, Justin Xu, Chris Yuhao Liu, Yafei Cheng, Rongwang Hu, Aras Bacho, Shengcao Cao, Zengyi Qin, Yixiong Chen, Hengduan Fan, Hao Liu, Lin Zeng, Shashank Muralidhar Bharadwaj, Litian Gong, Yingxuan Yang, Maojia Song, Ruheng Wang, Zongzheng Zhang, Honglin Bao, Shuo Lu, Jianhong Tu, Zhonghua Wang, Zheng Zhang, Zijiao Chen, Yanqiong Jiang, Zhendong Li, Bohan Lyu, Chang Ma, Peiran Xu, Benran Zhang, Shangding Gu, Haoyue Hua, Haoyang Li, Wanzhe Liao, Chengzhi Liu, Junbo Peng, Haoran Sun, Zechen Xu, Bo Chen, Jiayi Cheng, Yi Jiang, Keying Kuang, Yuan Li, Youbang Pan, Ziyan Rao, Alexander Schubert, Yifan Shen, Vincent Siu, Xiatao Sun, Kangqi Zhang, Xiaopan Zhang, Yuchen Zhu, Ishaan Singh Chandok, Lei Ding, Jingxuan Fan, Andrew Glover, Jiaming Hu, Yiran Hu, Wenbo Huang, Zixin Jiang, Haoran Jin, Lukas Kim, Ming Liu, Yang Liu, Alireza Rafiei, Xuhuan Shen, Kunyang Sun, Sophia Sun, Ting Sun, Eric Wang, Yixin Wang, Hanwen Xing, Sihan Xu, Yuzheng Xu, Zhongxing Xu, Zhiling Yan, Boqin Yuan, Ruiqi Zhang, Yifan Zhang, Zibo Zhao, Liana, Santanu Bosu Antu, Haoyue Bai, Carlo Bosio, Joseph Cavanagh, Patricia Cavazos-Rehg, Tianxing Chen, Xuewen Chen, Yipu Chen, Chenyu Zhu, Chen Dai, Stefano De Castro, Yunfu Deng, Kaustubh Dhole, Jiayuan Ding, Chenchen Du, Zhehang Du, Hao Fan, Run-Ze Fan, Hengyu Fu, Shi Gu, Yifan Gu, Charlie Guo, Baihe Huang, Baixiang Huang, Rimika Jaiswal, Zhihan Jiang, Ran Jin, Erin Kasson, Xin Lan, Joseph Lee, Deren Lei, Chenyu Li, Daofeng Li, Haitao Li, Hongwei Li, Jingyan Li, Xiao Li, Yi Li, Yinsheng Li, Yuangang Li, Zhixu Li, Wenyu Liang, Longtai Liao, Kevin Qinghong Lin, Andy Zeyi Liu, Che Liu, Jiaming Liu, Kaiyuan Liu, Xuan Liu, Pan Lu, Wenbo Lv, Yicheng Lyu, Qiuyang Mang, Kyle Montgomery, Yuzhou Nie, Ruoxi Ning, Jorin Overwiening, Xu Pan, Layna Paraboschi, Core Francisco Park, Justin Purnomo, Swati Rajwal, Scott Rankin, Bixuan Ren, Yiren Rong, HaoYang Shang, Ventus Shaw, Fiona Shen, Jiawei Shen, Minqi Shi, Shi Qiu, Huaxiu Yao, Tianneng Shi, Jonah So, Vladislav Susoy, Hannah Szlyk, Haocheng Wang, Jialu Wang, Wei Wang, Xinyu Wang, Zehao Wang, Dowling Wong, Angela Wu, Dehao Wu, Fangyu Wu, Mengyuan "Millie" Wu, Yu Wu, Yuchen Wu, Yuhao Wu, Qingpo Wuwu, Weihang Xiao, Yongyi Xiong, Fan Xu, Ruiling Xu, Mingxuan Yan, Benjamin Yang, Jirong Yang, Sen Yang, Xiaoli Yang, Yushi Yang, Haoran Ye, Xiaohu Yu, Zhengming Yu, Chenlong Zhang, Chi Zhang, Hanning Zhang, Hanwen Zhang, Junge Zhang, Kunpeng Zhang, Song Zhang, Wenjin Zhang, Wenshuo Zhang, Ying Zhang, Yizhi Zhang, Brian Zhao, Qijian Zhao, Yimin Zhao, Yuhaohua Zheng, Liwei Zhou, Tianyue Zhou, Sichen Zhu, Siqi Zhu, Yan Zhu, Yishu Zhu, Jierui Zuo, Chonghao Cai, Helena Casademunt, Wenjia Chen, Cheng Cheng, Nawen Deng, Rao Fu, Tianfu Fu, Yifan Han, He Ren, Zhenyu He, Qiao Jin, Langlang Li, Yuetai Li, Sylvia Liu, Lu Lu, Luqing Zhou, Subhabrata Mukherjee, Yunqi Ouyang, Yin Ren, Dawei Shi, Haoran Wu, Zhiyue Wu, Hannah Yao, Zhuoran Yi, Jenny Yu, Rhea Zhan, Hang Zhou, Blake Zhu, Junfan Zhu, Alan Yuille, Yang Liu, Russell Alan Poldrack, Jiachen Li, Zhenglu Li, Molei Tao, Jing Huang, Wenqi Shi, Costas Spanos, Lichao Sun, Chenguang Wang, Orson Xu, Zhen Dong, Hector Gomez, Aylin Caliskan, Ali Emami, Haimin Hu, Zhi Li, Lihui Liu, Murphy Niu, Yi Shao, Jianxin Sun, Mikko Tolonen, Ting Wang, Sanjiv Das, Yanjun Gao, Wenbo Guo, Erika J Schneider, Zhiyong Lu, Yian Ma, Mark Mueller, Radha Poovendran, Somayeh Sojoudi, Yinglun Zhu, Dawn Song

发表机构 * arXiv

AI总结 针对AI系统在专业领域缺乏经济性部署的问题,提出Agents' Last Exam (ALE)基准,通过250+专家协作构建覆盖13个行业集群55个子领域的1000+长期真实经济任务,当前最难层级平均通过率仅2.6%。

Comments Project website: https://agents-last-exam.org Code: https://github.com/rdi-berkeley/agents-last-exam

详情
AI中文摘要

最近的AI系统在广泛基准测试中取得了强劲结果,但这些成果并未转化为许多专业领域的经济上有意义的部署。我们认为这一差距主要是评估问题:广泛使用的基准缺乏对真实且经济上有价值的工作流程的持续性能测量。本文介绍了Agents' Last Exam (ALE),这是一个旨在评估AI代理在长期、经济上有价值、结果可验证的真实世界任务上的基准。与250多名行业专家合作开发,ALE涵盖了参考O*NET/SOC 2018(美国联邦职业分类)定义的非实体行业。它围绕一个任务分类法组织,包含55个子领域,分为13个行业集群,涵盖1000多个任务。当前结果显示,最难层级远未饱和:在主流框架和骨干配置下,平均完全通过率为2.6%。ALE被设计为一个活的基准:其任务池随着新工作流程和行业的加入而持续增长。更广泛地说,ALE不仅旨在作为另一个排行榜,而是作为缩小基准成功与GDP相关影响之间差距的工具。

英文摘要

Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents' Last Exam (ALE), a benchmark designed to evaluate AI agents on long horizon, economically valuable, real world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 sub fields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is below 1%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP relevant impact.

2606.08098 2026-06-12 cs.AI cs.LG 版本更新

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

何时委托优于多数?一种基于委托的多样本LLM推理聚合器

Yasushi Sakai, Allen Song, Kent Larson

发表机构 * MIT Media Lab(麻省理工学院媒体实验室)

AI总结 提出基于委托的聚合器PPV,利用样本的字母熵和推理几何信号,在MMLU-Pro上比多数投票高1.5个百分点,无需标签或训练。

Comments Preprint. 16 pages, 5 figures, 4 tables

详情
AI中文摘要

多数投票是对多样本LLM推理进行无监督聚合的主流方法。我们证明,将每个样本携带的信号输入基于委托的聚合器(传播代理投票,PPV)可产生一种无监督共识规则,在MMLU-Pro上整体比多数投票高1.5个百分点,在非平凡子集上高2.24个百分点(配对McNemar p ~ 1.0e-14,n = 8,099)。多数投票丢弃了每个样本携带的两个自由信号:组内字母熵和组间推理几何。PPV暴露了两个每个投票者使用的杠杆,它们恰好消耗这些信号:WHEN(投票者保留自己选择的权重)和WHOM(如何将剩余权重分配给同行)。我们使用字母熵驱动WHEN,使用以问题为中心的嵌入余弦驱动WHOM。该方法不需要真实标签和辅助训练:对于每个问题,我们将128个采样生成划分为16组,计算每组的字母级语义熵和推理嵌入质心,并将两者输入随机委托矩阵,其平稳分布选择共识答案。我们通过一个例子说明PPV如何推翻一个明显的10-6多数(错误答案):10票的多数簇几何上不连贯(平均簇内余弦-0.02),而6票的少数簇紧凑(+0.26),因此传播的委托质量集中在少数派的答案上,尽管仅凭熵会使多数保持领先。我们还报告了具有负面结果的委托策略,这些策略限制了无监督LLM聚合的设计空间:没有问题内的置信度模式集成能够缩小与oracle的差距。

英文摘要

Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. In this paper, we show a delegation-based aggregator (Propagational Proxy Voting, PPV; Sakai et al., 2025) yields an unsupervised consensus rule that beats majority on MMLU-Pro by +1.5 pp overall and +2.24 pp on the non-trivial subset (paired McNemar p ~ 1.0e-14, n = 8,099). Majority discards two signals that every sample carries: within-group letter entropy and between-group reasoning geometry. PPV exposes per-voter levers that consume exactly these two signals: When (how much weight a voter keeps on its own pick) and Whom (how it splits the remainder across peers). We drive When with letter entropy and Whom with per-question-centered embedding cosine. Our method needs no gold labels and no auxiliary training: per-question, we partition 128 sampled generations into 16 groups, compute each group's letter-level semantic entropy and reasoning embedding centroid, and feed both into a stochastic delegation matrix whose stationary distribution selects the consensus answer. We walk through an example in which PPV overturns a clear 10-6 majority for the wrong letter: the 10-voter majority cluster is geometrically incoherent (mean within-cluster cosine -0.02) while the 6-voter minority is tight (+0.26), so propagated delegation mass concentrates on the minority's answer even though entropy alone would keep the majority ahead. We further report delegation strategies with negative results that constrain the design space for unsupervised LLM aggregation. No within-question ensemble of confidence modes closes the oracle gap.

12. 机器学习应用 62 篇

2606.12500 2026-06-12 cs.LG cs.AI 新提交

Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation

基于机器学习的微观仿真从模拟交通冲突改进碰撞频率预测

Xian Liu, Carlo G. Prato, Gustav Markkula

AI总结 本文利用机器学习行为模型替代传统规则模型进行交通微观仿真,通过极端值理论分析模拟冲突预测碰撞频率,在英国利兹五个信号交叉口验证了ML模型无需地点校准即可提升预测准确性。

详情
AI中文摘要

交通微观仿真结合替代安全措施越来越多地被用作历史碰撞数据的主动替代方案,用于预测当前或计划道路基础设施设计的碰撞频率。然而,现有的基于微观仿真的安全研究采用了简化的基于规则的行为模型,这些模型能较好地再现交通流,但往往无法生成真实的冲突动态,限制了碰撞预测的准确性。机器学习(ML)行为模型的最新进展提供了一个有希望的机会,通过直接从大规模轨迹数据集中学习人类驾驶行为,可能提高微观仿真的真实性和碰撞频率预测。为了研究这种可能性,我们对英国利兹的五个真实信号交叉口进行了交通微观仿真,使用了标准的基于规则模型和最先进的ML模型。使用二维碰撞时间指标分析模拟车辆轨迹以识别模拟冲突,然后使用极端值理论建模以预测碰撞频率。结果表明,ML模型的冲突产生的碰撞预测与实际碰撞数据一致,而基于规则的模型由于缺乏对特定模拟交叉口的模型校准,无法产生有意义的预测。直接使用ML生成的模拟碰撞来预测实际碰撞频率也产生了较差的结果,这表明尽管当前的ML模型可以真实地再现冲突,但尚不能生成真实的碰撞。总体而言,研究结果表明,基于ML的行为模型在无需特定地点模型校准的情况下,有望从模拟冲突中改进碰撞预测,并为基于ML的交通微观仿真指明了明确的未来方向。

英文摘要

Traffic microsimulation combined with surrogate safety measures has increasingly been used as a proactive alternative to historical crash data for predicting crash frequency for current or planned road infrastructure designs. However, existing microsimulation-based safety studies have adopted simplified rule-based behaviour models, which reproduce traffic flow reasonably well but often fail to generate realistic conflict dynamics, limiting crash prediction accuracy. Recent advances in machine learning (ML)-based behaviour models offer a promising opportunity to potentially improve microsimulation realism and crash frequency predictions by learning human driving behaviour directly from large-scale trajectory datasets. To investigate this possibility, traffic microsimulation was conducted for five real-world signalised intersections in Leeds, UK, using both a standard rule-based model and a state-of-the-art ML model. Simulated vehicle trajectories were analysed using a two-dimensional Time-to-Collision metric to identify simulated conflicts, which were then modelled using Extreme Value Theory to predict crash frequency. Results show that conflicts from the ML model yielded crash predictions in line with the real-world crash data, whereas the rule-based model did not permit meaningful predictions, presumably due to a lack of model calibration to the specific simulated intersections. Directly using ML-generated simulated crashes to predict real-world crash frequency also yielded poor results, suggesting that while current ML models can realistically reproduce conflicts, they are not yet able to generate realistic crashes. Overall, the findings demonstrate that ML-based behaviour models are promising for improving crash prediction from simulated conflicts, without a need for location-specific model calibration, and suggest clear future directions for ML-based traffic microsimulation.

2606.12658 2026-06-12 cs.LG q-bio.QM stat.ML 新提交

Physics-Informed Neural Networks for Chemotherapy Pharmacokinetics: Benchmarking the Clinical Estimator and Exposing Parameter Identifiability

基于物理信息的神经网络用于化疗药代动力学:基准测试临床估计器并揭示参数可辨识性

Riya Bisht, Dhruv Agarwal

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本研究将物理信息神经网络(PINN)应用于化疗药代动力学,在双室线性模型上匹配临床标准方法,在Michaelis-Menten扩展模型中揭示参数不可辨识性,并通过稀疏组织观测部分恢复可辨识性。

详情
AI中文摘要

物理信息神经网络(PINN)是生物学中部分观测问题的一个有吸引力的工具,其中控制动力学已知但某些隔室无法测量。化疗药代动力学(PK)是一个清晰的实例:血浆中的药物浓度常规测量,但组织中的浓度——决定肿瘤杀伤和脱靶毒性——无法测量。我们在两个PK问题上将PINN与标准临床基线(非线性最小二乘解析双指数血浆解,以下简称NLS)和物理无关的神经基线(仅数据的MLP)进行基准测试。在线性双室问题上,NLS接近最优;PINN在匹配其性能(小常数因子内)的同时,在单次训练过程中产生组织曲线,而仅数据的MLP在组织上失败约10倍。在Michaelis-Menten扩展(可饱和消除)上,双指数闭式不再存在,因此NLS被错误指定并静默返回无意义的速率常数。PINN反而揭示了一个更深层的事实:Michaelis-Menten双室模型仅从血浆数据不可辨识,PINN通过收敛到k12 -> 0的盆地诚实地报告这一点。添加两个稀疏组织观测在很大程度上解决了可辨识性:在五个随机种子上,PINN恢复k21在真实值的1%以内,Vmax和Km在一个标准差范围内,而k12向正确方向移动(0.02 -> 0.82)但仍低于真实值约2个标准差——这是闭式NLS估计器根本无法尝试的恢复,因为其双指数假设仅描述血浆。我们的主张不是PINN击败NLS。而是PINN提供了一种统一的方案,该方案在教科书问题上与教科书估计器匹配,揭示了教科书估计器隐藏的结构可辨识性,并在单一损失中吸收异构测量。

英文摘要

Physics-Informed Neural Networks (PINNs) are an attractive tool for partial-observation problems in biology, where the governing dynamics are known but some compartments cannot be measured. Chemotherapy pharmacokinetics (PK) is a clean instance: drug concentration in plasma is routinely measured, but concentration in tissue -- which determines tumour kill and off-target toxicity -- is not. We benchmark a PINN against the standard clinical baseline (nonlinear least-squares on the analytical biexponential plasma solution, hereafter NLS) and a physics-agnostic neural baseline (a data-only MLP) on two PK problems. On the linear two-compartment problem, NLS is near-optimal; the PINN matches it to within a small constant factor while also producing the tissue curve in a single training pass, whereas the data-only MLP fails on tissue by roughly 10x. On a Michaelis-Menten extension (saturable elimination), the biexponential closed form no longer exists, so NLS is mis-specified and silently returns meaningless rate constants. The PINN instead exposes a deeper fact: the Michaelis-Menten two-compartment model is non-identifiable from plasma alone, and the PINN reports this honestly by converging to a basin with k12 -> 0. Adding two sparse tissue observations largely resolves identifiability: across five seeds the PINN recovers k21 to within 1% of truth and Vmax, Km to within one standard-deviation bar, while k12 moves in the correct direction (0.02 -> 0.82) but remains ~2 sigma below truth -- a recovery the closed-form NLS estimator cannot attempt at all, because its biexponential ansatz describes only plasma. Our claim is not that PINNs beat NLS. It is that PINNs offer a uniform recipe that ties the textbook estimator on the textbook problem, exposes structural identifiability that the textbook estimator hides, and absorbs heterogeneous measurements within a single loss.

2606.12687 2026-06-12 cs.LG 新提交

Forecasting Is Not Attribution: Localizing Decoder Bypass in Graph-Based Neural Marketing Mix Models

预测不等于归因:在基于图的神经营销组合模型中定位解码器旁路

Yunbo Wang, Bolbi Liu

发表机构 * University of California, Irvine(加州大学尔湾分校) AdsGency AI

AI总结 针对基于图的神经营销组合模型中预测精度高但归因失败的问题,提出DICE-MMM框架,通过限制解码器通信路径来诊断和定位归因旁路,实验表明低预测误差不能保证归因正确性。

详情
AI中文摘要

营销组合模型用于预测业务结果并将这些结果归因于营销渠道,但这些目标并不等价。我们研究了基于图的神经MMM中的一种失败模式,称为归因旁路:高容量解码器可以通过目标自回归、密集通信、共同运动、上下文或潜在记忆获得低预测误差,但未能将反事实敏感性通过用作归因对象的图进行路由。我们引入DICE-MMM作为一个有界诊断和训练框架。我们不声称观测性神经MMM能够识别因果效应。相反,DICE将基于图的MMM中经常混淆的三个问题分开:图恢复、预测准确性,以及训练后的解码器的扰动诱导影响是否与图对齐。阶段1训练一个带有受限图介导解码器的图编码器。阶段2冻结选定的编码器,并训练一个图安全的潜在解码器,其跨节点通信必须通过提供的图。解码器的使用通过CIG、AR-CIG和图交换测试进行评估。在受控的R/d/T交换和外部多图原始日志压力测试中,DICE比CausalMMM提高了稳定图恢复。实验表明,预测准确性不是归因证书:在稀疏目标基准中,无图解码器和全图解码器实现了约0.004的MSE@7,而AR-CIG nAUPRC仍接近或低于零,而oracle图在可比的MSE下达到0.807 +/- 0.129。冻结图交换定位了瓶颈:相同的DICE-hard训练解码器在学习图输入下从nAUPRC -0.044 +/- 0.006移动到oracle图下的0.894 +/- 0.027。贡献在于一个压力测试和故障定位框架,表明低MSE可能隐藏归因旁路,且未解决的瓶颈是图支撑选择,而不是预测或解码器容量。

英文摘要

Marketing mix models are used to forecast business outcomes and to attribute those outcomes to marketing channels, but these goals are not equivalent. We study a failure mode in graph-based neural MMM called attribution bypass: a high-capacity decoder can obtain low forecasting error through target autoregression, dense communication, co-movement, context, or latent memory while failing to route counterfactual sensitivity through the graph used as the attribution object. We introduce DICE-MMM as a bounded diagnostic and training framework. We do not claim that observational neural MMM identifies causal effects. Instead, DICE separates three questions often conflated in graph-based MMM: graph recovery, forecasting accuracy, and whether the trained decoder's perturbation-induced influence is graph aligned. Stage 1 trains a graph encoder with a restricted graph-mediated decoder. Stage 2 freezes the selected encoder and trains a graph-safe latent decoder whose cross-node communication must pass through the supplied graph. Decoder use is evaluated with CIG, AR-CIG, and graph-swap tests. Across controlled R/d/T swaps and an external multi-graph rawlog stress test, DICE improves stable graph recovery over CausalMMM. The experiments show that forecasting accuracy is not an attribution certificate: in a sparse-target benchmark, no-graph and full-graph decoders achieve MSE@7 around 0.004 while AR-CIG nAUPRC remains near or below zero, whereas an oracle graph reaches 0.807 +/- 0.129 at comparable MSE. Frozen graph-swap localizes the bottleneck: the same DICE-hard-trained decoder moves from nAUPRC -0.044 +/- 0.006 under learned graph inputs to 0.894 +/- 0.027 with the oracle graph. The contribution is a stress test and failure-localization framework showing that low MSE can hide attribution bypass and that the unresolved bottleneck is graph-support selection, not forecasting or decoder capacity.

2606.12699 2026-06-12 cs.LG cs.AI 新提交

LLM-Powered Personalized Glycemic Assessment in Type 2 Diabetes with Wearable Sensor Data

基于可穿戴传感器数据的2型糖尿病个性化血糖评估:LLM驱动方法

Yifan Gao, Yanmin Gong, Yun Shi, Yuanxiong Guo

发表机构 * Department of Information Systems and Cybersecurity, The University of Texas at San Antonio(德克萨斯大学圣安东尼奥分校信息系统与网络安全系) School of Engineering Medicine, Texas A&M University(德克萨斯农工大学工程医学院) Department of Family and Community Medicine, The University of Texas at San Antonio(德克萨斯大学圣安东尼奥分校家庭与社区医学系)

AI总结 提出GlyLLM框架,利用大语言模型整合可穿戴传感器数据和结构化元数据,实现个性化血糖动态建模,在血糖预测和糖尿病分类任务上分别比传统ML方法提升13.66%和13.08%。

Comments The 14th IEEE International Conference on Healthcare Informatics, 2026

详情
AI中文摘要

2型糖尿病(T2D)对全球健康构成日益严重的威胁,需要有效的血糖评估来支持个性化和改进的糖尿病护理。可穿戴传感器如连续血糖监测仪(CGM)和健身追踪器为血糖评估提供了许多有价值的见解。然而,有效分析这些数据需要与重要的个体层面背景信息整合。现有方法通常基于传统机器学习(ML),主要依赖历史血糖测量值,忽略了个性化信息,这限制了它们在多样化糖尿病群体中的性能。大语言模型(LLMs)的最新进展展示了它们整合多种数据模态同时建模序列依赖性的能力,激发了探索其在个性化血糖评估中潜力的兴趣。在本文中,我们提出了GlyLLM,一个基于LLM的框架,通过整合可穿戴传感器数据和结构化元数据来建模基于CGM的血糖动态。GlyLLM可以利用预训练LLM的广泛先验知识,并在决策时实现传感器-文本语义抽象。在AI-READI数据集上的两个相关任务实验表明,我们的模型在血糖预测的均方根误差(RMSE)上平均优于传统ML方法13.66%,在糖尿病分类的受试者工作特征曲线下面积(AUROC)上平均优于13.08%。此外,我们的消融研究表明,糖尿病调查和生物特征测试比其他健康信息对血糖评估更为关键。我们的工作为利用LLM推进T2D护理中的个性化血糖评估迈出了有希望的一步。

英文摘要

Type 2 Diabetes (T2D) poses an increasing global health threat, demanding effective glycemic assessment to support personalized and improved diabetes care. Wearable sensors such as continuous glucose monitors (CGM) and fitness trackers offer many valuable insights for glycemic assessment. However, effectively analyzing these data requires integration with essential individual-level context. Existing methods are often based on traditional machine learning (ML) and rely primarily on historical blood glucose measurements and overlook personalized information, which limits their performance across diverse diabetes populations. Recent advances in large language models (LLMs) have demonstrated their ability to integrate diverse data modalities while modeling sequential dependencies, motivating the exploration of their potential for personalized glycemic assessment. In this paper, we propose GlyLLM, an LLM-powered framework for modeling CGM-based glycemic dynamics through the integration of wearable sensor data and structured metadata. GlyLLM can leverage the extensive prior knowledge of pre-trained LLMs and achieve sensor-text semantic abstraction at decision time. Experiments on two related tasks on the AI-READI dataset demonstrate that our model outperforms traditional ML methods by an average of 13.66\% in Root Mean Squared Error (RMSE) for glucose forecasting and 13.08\% in Area Under the Receiver Operating Characteristic (AUROC) for diabetes categorization. Additionally, our ablation study shows that diabetes surveys and biometric tests are more critical than other health information for glycemic assessment. Our work presents a promising step toward harnessing the power of LLMs to advance personalized glycemic assessment in T2D care.

2606.12735 2026-06-12 cs.LG 新提交

Physics-Informed Neural Networks and Radial Basis Functions for PDEs with Dirac Delta Sources

物理信息神经网络与径向基函数求解含狄拉克δ源的偏微分方程

Manuel Reyna, Alexandre Tartakovsky

发表机构 * Department of Civil and Environmental Engineering, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校土木与环境工程系)

AI总结 针对含狄拉克δ项的偏微分方程,通过将物理信息神经网络解释为残差最小二乘法,利用弱形式直接处理δ项,并对比径向基函数展开方法,发现径向基函数-残差最小二乘法在输运问题中更稳定。

Comments 33 pages, 4 figures

详情
AI中文摘要

物理信息神经网络(PINNs)是一种用于求解正向和逆向偏微分方程(PDEs)的机器学习方法。当应用于强迫项、边界条件或初始条件中包含狄拉克δ函数的PDEs时,PINNs需要用光滑的代理函数来近似它们,这种做法可能会引入显著的建模误差。在这项工作中,我们利用PINNs作为残差最小二乘法(RLS)的解释,并表明这种视角能够通过积分弱形式方程直接处理狄拉克δ项。在除PINN之外的RLS公式中,我们重点关注径向基函数(RBF)展开(也称为单层RBF网络)。我们证明,虽然在PINNs中积分掉狄拉克δ会导致残差无法收敛到零,但RBF-RLS始终能为输运问题提供良好的正向和逆向解。我们使用神经正切核(NTK)理论解释这一发现。我们在代表多孔介质和河流中地下水流和输运的线性PDEs上测试了这两种方法。我们求解逆问题以拟合合成数据、含噪声的合成数据以及真实世界测量值。

英文摘要

Physics-Informed Neural Networks (PINNs) are a machine learning method for solving forward and inverse Partial Differential Equations (PDEs). When applied to PDEs with Dirac delta functions in the forcing terms, boundary conditions, or initial conditions, PINNs require approximating them with smooth surrogate functions, a practice that can introduce significant modeling errors. In this work, we exploit the interpretation of PINNs as Residual Least Squares (RLS) methods and show that this perspective enables direct treatment of Dirac delta terms by integrating the weak-form equation. Among RLS formulations other than PINN, we focus on the Radial Basis Function (RBF) expansion (also known as a single-layer RBF Network). We show that while integrating out the Dirac delta in PINNs causes residuals to fail to converge to zero, RBF-RLS consistently provides good forward and inverse solutions to transport problems. We explain this finding using the Neural Tangent Kernel (NTK) theory. We test both approaches on linear PDEs that represent groundwater flow and transport in porous media and rivers. We solve inverse problems to fit synthetic data, noisy synthetic data, and real-world measurements.

2606.12843 2026-06-12 cs.LG cs.CE 新提交

Interpretable Factor Decomposition for Decision Intelligence in Large-Scale Financial Markets: Evidence from China's A-Share Market

可解释因子分解用于大规模金融市场决策智能:来自中国A股市场的证据

Xiao Han, Yao Xiao, Zhen Zhang, Moxuan Zheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出可解释机器学习流程,将截面股票收益预测分解为可审计因子贡献,使用XGBoost和TreeSHAP在中国A股市场验证,发现行为信号贡献58.2%预测归因。

详情
AI中文摘要

我们提出一个可解释的机器学习流程,将截面股票收益预测分解为可审计的因子贡献。我们应用带有TreeSHAP归因的XGBoost模型,对2009年至2019年的3632只中国A股进行压力测试。使用60个月滚动窗口,在55个月的样本外数据上,XGBoost获得平均AUC为0.547,且前五分之一与后五分之一的多空价差为+2.38%/月(Newey-West t = 5.94;年化夏普比率2.23)。在调整Carhart四因子模型后,该alpha持续存在(+2.31%/月;t = 7.48)。SHAP分解表明,在55个行业组中,行为信号(换手率和动量)平均占预测归因的58.2%,而估值比率仅占10.7%。消融分析用于交叉验证这一排名,并提供证据表明SHAP和消融以突出特征可替代性结构的方式产生分歧,而这种结构在单独使用任一方法时几乎不可见。

英文摘要

We present an interpretable machine learning pipeline to decompose Cross-Sectional Equity Return Predictability into auditable factor contribution. We apply an XGBoost model with TreeSHAP attribution and conduct stress testing on 3632 Chinese A-share stocks from 2009 until 2019. Using 60-month, rolling windows over 55 months of out-of-sample data, XGBoost obtains a mean AUC of 0.547 and +2.38%/month (Newey-West t = 5.94; Annualized Sharpe 2.23) long-short spread for the top vs bottom quintiles. This alpha is persistent after adjusting for the Carhart four-factor model (+2.31%/month; t = 7.48). SHAP Decomposition indicates that behavioral signals (turnover and momentum) account for 58.2% of predictive attribution compared to 10.7% for valuation ratios, on average, across 55 industry groups. Ablation analysis serves to cross-validate this ranking and provides evidence that SHAP and ablation diverge in a manner that highlights feature substitutability structure that is largely invisible to either method used in isolation.

2606.12971 2026-06-12 cs.LG 新提交

Predicting Cognitive Load from Speech and Interaction Dynamics in Dyadic Conversations

从二元对话中的语音和交互动态预测认知负荷

Tahiya Chowdhury

发表机构 * Department of Computer Science, Colby College(科尔比学院计算机科学系)

AI总结 研究在自然协作对话中,通过语音和交互动态特征预测感知认知负荷,发现对话交互(如话轮转换)能有效预测时间压力、脑力工作等认知负荷维度。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

从语音估计认知负荷主要在受控实验室环境中研究,对其在自然协作对话中的可靠性了解有限。我们研究语音和交互动态是否能预测二元对话中的感知认知负荷。我们分析了53对执行九项协作任务的对话音频,提取静态声学、动态和交互特征,训练双头门控循环单元编码器预测认知负荷分数。结果表明,对话交互为预测与时间压力、脑力工作、努力和任务表现相关的认知负荷提供了有用信号。时间需求与话轮转换动态(如重叠和说话者切换)相关,而脑力需求与说话者之间的不平衡参与相关。这些发现强调了任务结构和对话交互在自然协作环境中建模认知负荷的重要性。

英文摘要

Estimating cognitive load from speech has largely been studied in controlled laboratory settings, with limited understanding of its reliability in natural collaborative conversations. We investigate whether speech and interaction dynamics predict perceived cognitive load during dyadic conversations. We analyze audio from 53 dyads performing nine collaborative tasks and extract static acoustic, dynamic, and interaction features to train a two-head Gated Recurrent Unit encoder to predict cognitive load scores. Results show conversational interaction provides useful signals for predicting cognitive load related to time pressure, mental work, effort, and task performance. Temporal demand is associated with turn-taking dynamics such as overlap and speaker switch, while mental demand is linked to imbalanced participation between speakers. These findings highlight the importance of task structure and conversational interaction for modeling cognitive load in natural collaborative settings.

2606.13007 2026-06-12 cs.LG cs.AI 新提交

scLLM-DSC: LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering for Single-Cell RNA Sequencing

scLLM-DSC:基于LLM知识增强的跨模态深度结构聚类用于单细胞RNA测序

Ping Xu, Pengjiang Li, Tian Du, Zaitian Wang, Jiawei Gu, Ziyue Qiao, Pengfei Wang, Yuanchun Zhou

发表机构 * Computer Network Information Center, Chinese Academy of Sciences(中国科学院计算机网络信息中心) University of Chinese Academy of Sciences(中国科学院大学) Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院) School of Computing and Information Technology, Great Bay University(大湾区大学计算机科学与技术学院) School of Engineering, Westlake University(西湖大学工学院)

AI总结 提出scLLM-DSC框架,通过知识驱动语义视图与结构感知拓扑视图的跨模态对比对齐,利用LLM增强单细胞RNA测序数据的聚类性能,显著优于现有方法。

详情
AI中文摘要

聚类是scRNA-seq分析的基础,是识别细胞群体和解析组织异质性的基石。然而,现有方法专注于挖掘数值统计模式,由于忽略了基因编码的内在生物学功能,存在语义不可知的问题。虽然大语言模型(LLM)提供了有前景的语义能力,但生成式预训练目标与判别式下游任务之间的结构不匹配阻碍了它们直接适应细胞聚类。为弥合这一差距,我们提出了scLLM-DSC,一种新颖的LLM知识增强跨模态深度结构聚类框架。与数据驱动范式不同,scLLM-DSC通过协同两个视图建立语义基础表示:从NCBI基因先验和上下文化的Cell2Sentence嵌入中提取的知识驱动语义视图,以及通过图引导编码器提取的结构感知拓扑视图。关键的是,我们引入了一种跨模态对比对齐机制,以在统一潜在空间中强制生物学语义与转录组特征之间的一致性。广泛的基准测试表明,scLLM-DSC在聚类准确性上显著优于十一个最先进的基线方法。

英文摘要

Clustering is fundamental to scRNA-seq analysis, serving as a cornerstone for identifying cell populations and resolving tissue heterogeneity. However, existing methods focus on mining numerical statistical patterns, suffering from semantic agnosticism by neglecting the intrinsic biological functions encoded by genes. While Large Language Models (LLMs) offer promising semantic capabilities, their direct adaptation to cell clustering is hindered by the structural mismatch between generative pre-training objectives and discriminative downstream tasks. To bridge this gap, we propose scLLM-DSC, a novel LLM-Knowledge Enhanced Cross-Modal Deep Structural Clustering framework. Diverging from data-driven paradigms, scLLM-DSC establishes a semantically-grounded representation by synergizing two views: a Knowledge-Driven Semantic View derived from NCBI gene priors and contextualized Cell2Sentence embeddings, and a Structure-Aware Topological View extracted via a graph-guided encoder. Crucially, we introduce a cross-modal contrastive alignment mechanism to enforce consistency between biological semantics and transcriptomic features within a unified latent space. Extensive benchmarks demonstrate that scLLM-DSC significantly outperforms eleven state-of-the-art baselines in clustering accuracy.

2606.13024 2026-06-12 cs.LG cs.AI 新提交

CausalMoE: A Billion-Scale Multimodal Foundation Model for Granger Causal Discovery with Pattern-Routed Heterogeneous Experts

CausalMoE:基于模式路由异构专家的十亿规模多模态基础模型用于格兰杰因果发现

Bo Liu, Di Dai, Jingwei Liu, Jiarui Jin, Xiaocheng Fang, Guangkun Nie, Hongyan Li, Shenda Hong

发表机构 * State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院通用人工智能国家重点实验室) National Institute of Health Data Science, and Institute for Artificial Intelligence, Peking University(北京大学健康医疗大数据国家研究院、人工智能研究院)

AI总结 提出CausalMoE,一种十亿规模多模态格兰杰因果基础模型,通过模式路由混合异构专家解耦动态机制,结合因果自注意力与LLM/VLM先验,实现稀疏因果图恢复,在监督和少样本场景中达到最优。

详情
AI中文摘要

格兰杰因果发现(GCD)是分析复杂系统中时间依赖性的基础。然而,现有的神经GCD方法主要依赖“一刀切”范式,难以捕捉真实世界时间序列中固有的分布偏移和动态机制变化,常导致表示纠缠和虚假因果图。本文提出CausalMoE,一种十亿规模多模态格兰杰因果基础模型,显式建模补丁级异质性。CausalMoE引入模式路由混合异构专家,动态识别潜在时间模式并将补丁路由到专门领域专家,有效解耦机制特定动态与共享动态。为确保可解释的图恢复,我们设计了一种跨变量运行的因果感知自注意力机制,通过近端优化生成稀疏格兰杰因果图。此外,CausalMoE是首个集成LLM和VLM以对齐数值信号与文本和视觉先验的模型,在复杂场景中正则化因果估计。大量实验表明,CausalMoE在全监督基准上达到新最优,同时在传统方法失败的少样本设置中有效泛化。

英文摘要

Granger Causal Discovery (GCD) is fundamental for analyzing temporal dependencies in complex systems. However, existing neural GCD methods predominantly rely on a "one-size-fits-all" paradigm, struggling to capture distribution shifts and dynamic regime changes inherent in real-world time series. This often leads to entangled representations and spurious causal graphs. In this paper, we propose CausalMoE, a billion-scale multimodal Granger causal foundation model that explicitly models patch-level heterogeneity. CausalMoE introduces a Pattern-Routed Mixture of Heterogeneous Experts, which dynamically identifies latent temporal patterns and routes patches to specialized domain experts, effectively decoupling regime-specific mechanisms from shared dynamics. To ensure interpretable graph recovery, we design a Causality-Aware Self-Attention mechanism operating across variables, yielding sparse Granger causal graphs via proximal optimization. Furthermore, CausalMoE is the first to integrate LLMs and VLMs to align numerical signals with textual and visual priors, regularizing causal estimation in complex scenarios. Extensive experiments demonstrate that CausalMoE establishes a new state-of-the-art on fully supervised benchmarks, while effectively generalizing to few-shot settings where traditional methods fail.

2606.13060 2026-06-12 cs.LG 新提交

A green solvent screening tool for emerging materials via uncertainty aware, transformer enhanced transfer learning

一种面向新兴材料的绿色溶剂筛选工具:基于不确定性感知、Transformer增强的迁移学习

Ioannis Kouroudis, Simon Ternes, Zhaosu Gu, Gohar Ali Siddiqui, Marina Ustinova, Angelo Lembo, Alessio Gagliardi, Aldo Di Carlo

发表机构 * Technical University of Munich(慕尼黑工业大学) Institute of Structure of Matter – National Research Council Rome (ISM-CNR)(罗马国家研究委员会物质结构研究所) University of Rome "Tor Vergata"(罗马第二大学)

AI总结 提出一种结合预训练Transformer模型和不确定性量化的迁移学习方法,在极少数据下高精度预测溶解度参数,并开发了可定制的绿色溶剂筛选工具。

详情
AI中文摘要

溶解度的准确预测仍然是材料科学和可持续化学中的一个核心挑战。特别是由于有机和混合光伏、电池、催化等新兴技术,溶剂使用量预计在未来几年将显著增加。因此,用更绿色的替代品取代溶剂至关重要。这正是机器学习可以产生重大影响的地方。然而,溶解度关键参数的数据有限,严重制约了机器学习的效能。在这项工作中,我们将预训练的QM9基础模型迁移到我们的应用中,所需数据极少。此外,该流程集成了不确定性量化,允许用户评估预测的置信度。作为基线,我们成功预测了存在大量数据库的汉森溶解度参数和介电常数。重要的是,我们在其他目标(如Gutmann供体和受体数)上实现了高模型性能,而这些目标的可获得数据极为有限。总体而言,我们通过高质量预测将溶解度描述符的数据量提高了数个数量级。为了有效传播,我们部署了一个易于使用、易于与高通量实验室集成、可定制的工具,用于排序和筛选可能的溶剂替代品。最后,我们重新发现了已知的绿色溶剂替代品,并提出了新的候选者,证明了其在寻找环保溶剂方面的相关性。

英文摘要

Accurate prediction of solubility remains a central challenge across materials science and sustainable chemistry. In particular due to emerging technologies like organic and hybrid photovoltaics, batteries, and catalysis, solvent usage is expected to increase significantly within the coming years. Therefore, substituting solvents with greener alternatives is vital. This is where machine learning can have substantial impact. However, the limited data on critical parameters of solubility significantly constraints machine learning efficacy. In this work, we transfer a pre-trained foundational model on QM9 targets to our application with minimal data requirements. Additionally, the pipeline integrates uncertainty quantification, allowing the user to gauge the confidence of the predictions. As baseline, we succeed in predicting the Hansen solubility parameters and Dielectric Constant for which extensive databases exist. Importantly, we achieve high model performance on additional targets, such as Gutmann Donor and Acceptor numbers, where the available data is extremely limited. Overall, we augment data on solubility descriptors by orders of magnitude with high quality predictions. For effective dissemination, we deploy easy-to-use, easily integrateable with high throughput labs, customizable tool for ranking and screening possible solvent substitutes. Finally, we rediscovered known green solvent alternatives and proposed new candidates proving its relevance for finding eco-friendly solvents.

2606.13174 2026-06-12 cs.LG cs.CL 新提交

Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

与你合作得更好:将用户修正编译为编码代理的运行时强制

Yujun Zhou, Kehan Guo, Haomin Zhuang, Xiangqi Wang, Yue Huang, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Nuno Moniz, Nitesh V. Chawla, Xiangliang Zhang

发表机构 * University of Notre Dame(圣母大学) IBM Research(IBM研究院) Tencent AI Lab(腾讯AI实验室)

AI总结 提出TRACE方法,通过将用户修正编译为原子规则并在运行时强制执行,显著减少编码代理在后续任务中的偏好违反,优于纯记忆方法。

详情
AI中文摘要

交互式LLM代理正成为日常工作的组成部分,但它们并不会随着时间的推移而变得更易于合作:在一个会话中记住的修正可能在下一个会话中仍被违反。我们研究了偏好访问与偏好遵从之间的差距。在源自匿名真实用户摩擦案例的任务中,Mem0记忆仍然导致57.5%的适用偏好检查被违反。我们引入了测试时规则获取与编译强制(TRACE),这是一个用于编码代理运行时的即插即用技能层管道,它挖掘用户修正,将其重写为原子规则,并编译为运行时检查,这些检查必须在代理完成未来任务之前通过。与开发者提前编写的运行时检查不同,TRACE技能来自用户自己的聊天修正。我们通过在ClawArena编码代理任务和MemoryArena衍生的内存密集型任务上进行模拟用户参与实验来评估TRACE。在ClawArena上,TRACE将分布内任务的保留偏好违反从100.0%降低到37.6%,将分布外任务从100.0%降低到2.0%。在MemoryArena衍生的任务上,TRACE将分布内违反从100.0%降低到60.5%,同时在任务通过率上匹配或超过最强的记忆基线。这些结果表明,将修正编译为运行时强制可以解决记忆单独无法可靠解决的重复摩擦失败模式,减少用户在未来会话中重复相同修正的需求。实验代码可在此https URL获取,可部署的技能可在此https URL获取。

英文摘要

Interactive LLM agents are becoming part of daily work, but they do not reliably become easier to work with over time: a correction remembered in one session may still be violated in the next. We study this gap between preference access and preference compliance. In tasks derived from anonymized real-user friction cases, Mem0 memory still leaves 57.5% of applicable preference checks violated. We introduce Test-time Rule Acquisition and Compiled Enforcement (TRACE), a drop-in skill-layer pipeline for coding-agent runtimes that mines user corrections, rewrites them as atomic rules, and compiles them into runtime checks that must pass before an agent completes future tasks. Unlike runtime checks written ahead of time by developers, TRACE skills come from the user's own chat corrections. We evaluate TRACE with simulated user-in-the-loop experiments on ClawArena coding-agent tasks and MemoryArena-derived memory-intensive tasks. On ClawArena, TRACE reduces held-out preference violation from 100.0% to 37.6% on in-distribution tasks and from 100.0% to 2.0% on out-of-distribution tasks. On MemoryArena-derived tasks, TRACE reduces in-distribution violation from 100.0% to 60.5% while matching or exceeding the strongest memory baseline on task pass. These results suggest that compiling corrections into runtime enforcement can address a repeated-friction failure mode that memory alone does not reliably solve, reducing the need for users to restate the same correction across future sessions. Experiment code is available at https://github.com/YujunZhou/TRACE_exp, and the deployable skill is available at https://github.com/YujunZhou/tellonce.

2606.13236 2026-06-12 cs.LG cs.AI cs.SD stat.AP 新提交

Decoding Insect Song: A Multitask Semisupervised Orthoptera Bioacoustic Classifier

解码昆虫之歌:一种多任务半监督直翅目生物声学分类器

Olga Isupova, Danil Kuzin, Ella Browning, Tom Mills, Steven Reece

发表机构 * University of Oxford(牛津大学)

AI总结 提出PULSE半监督多任务框架,结合弱监督分类、自监督学习和知识蒸馏,在直翅目生物声学分类中优于通用模型,并通过主动学习进一步提升性能。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情
AI中文摘要

被动声学监测在生态推断方面具有巨大潜力,但现有的自动化工具通常训练范围狭窄且不可迁移。我们通过PULSE(一种用于直翅目生物声学的半监督多任务框架)解决了这些局限性,该框架结合了弱监督物种分类、未标记野外音频的自监督学习以及来自通用生物声学模型的知识蒸馏。我们的领域自适应专家模型在所有指标上均优于最先进的通用模型(宏F1:0.21 vs. 0.07;AUC:0.74 vs. 0.45;AP:0.32 vs. 0.19),主动学习进一步将F1提升至0.34,AUC提升至0.84。除了分类之外,学习到的嵌入编码了生态上有意义的结构,并通过交互式可视化工具暴露出来,用于生态发现。

英文摘要

Passive acoustic monitoring holds great promise for ecological inference, yet existing automated tools are typically narrowly trained and non-transferable. We address these limitations with PULSE, a semi-supervised, multi-task framework for Orthoptera bioacoustics, combining weakly-supervised species classification, self-supervised learning on unlabelled field audio, and knowledge distillation from a general-purpose bioacoustic model. Our domain-adapted specialist model outperforms a state-of-the-art general model across all metrics (macro F1: 0.21 vs. 0.07; AUC: 0.74 vs. 0.45; AP: 0.32 vs. 0.19), with active learning further raising F1 to 0.34 and AUC to 0.84. Beyond classification, the learned embeddings encode ecologically meaningful structure, exposed through an interactive visualisation tool for ecological discovery.

2606.13252 2026-06-12 cs.LG 新提交

To GAN or Not To GAN: Segmentation Analysis on Mars DEM

生成对抗还是非生成对抗:火星DEM上的分割分析

Douglas Dziedzorm Agbeve, Aditya V. Handrale, Salim Fares, Seif E. Idani

发表机构 * University of Passau(帕绍大学)

AI总结 使用监督语义分割和生成对抗方法自动检测火星上的土丘,并比较两种方法,发现添加人工生成数据并未改善结果。

详情
AI中文摘要

为了更好地理解火星表面,使火星车能够轻松导航火星,有必要能够确定土丘的位置。检测和研究这些形态也有助于我们找到地外生命的证据,在这种情况下,更具体地说,是水或生命适宜环境的迹象。土丘的检测是通过将形态参数手动映射到数字高程模型上完成的。本文通过使用基于神经网络的语义分割方法自动检测和/或预测火星上的土丘来解决这个问题。这是通过使用监督语义分割模型和生成对抗方法实现的。两种方法的比较表明,添加额外的人工生成数据并未改善结果。

英文摘要

To better understand Martian Surface, which is needed to enable Rovers navigate Mars with ease, it is necessary to be able to determine the location of mounds. Detecting and studying these morphologies can also help us find evidence of extraterrestrial life, in this case, more specifically, water or signs of life conducive environments. Detection of mounds was done by manually mapping morphological parameters onto Digital Elevation Models. This paper solves the problem by automatically detecting and or predicting mounds on Mars using Neural Network based Semantic Segmentation methodologies. This is done by using supervised semantic segmentation model and generative adversarial approach. A comparison of the approaches shows that adding extra artificially generated data did not improve the result.

2606.13285 2026-06-12 cs.LG cs.AI 新提交

Once-for-All: Scalable Simultaneous Forecasting via Equilibrium State Estimation

Once-for-All: 基于均衡状态估计的可扩展同步预测

Beinan Xu, Andy Song, Jiti Gao, Feng Liu

发表机构 * RMIT University(皇家墨尔本理工大学) Monash University(莫纳什大学) University of Adelaide(阿德莱德大学)

AI总结 提出均衡状态估计(ESE)范式,通过一次前向传播估计多系统均衡状态并基于状态差异生成预测,在保持精度的同时实现10-70倍加速,且具有线性时间复杂度和鲁棒性。

Comments Accepted by ICML 2026

详情
AI中文摘要

我们引入均衡状态估计(ESE),一种用于同步预测的新范式,其中多个相互作用的系统需要独立但协调的预测。这种场景在现实世界中经常出现,例如经济学和医疗建模。与一次预测一个系统的现有方法不同,ESE在一次前向传播中预测所有系统。它首先估计跨系统的均衡状态,然后基于当前状态与估计均衡之间的差异生成整体预测。在合成和真实世界数据集(包括货币汇率和COVID-19传播建模)上的大量实验表明,ESE至少与最先进(SOTA)方法一样准确,同时速度显著更快。此外,ESE与传统预测器无缝集成,结合了它们的准确性和其卓越的效率,实现了10-70倍的加速。凭借线性时间复杂度,随着系统数量的增加,ESE的扩展性远优于SOTA方法。此外,它在各种扰动下仍保持准确,使ESE成为一种快速、可泛化、鲁棒且可扩展的多预测方法。

英文摘要

We introduce Equilibrium State Estimation (ESE), a novel paradigm for simultaneous prediction, where multiple interacting systems require separate yet coordinated forecasts. Such scenarios often arise in real-world settings such as economics and healthcare modeling. Unlike existing approaches that predict one system at a time, ESE forecasts all systems in a single pass. It first estimates the equilibrium state across systems, then generates holistic forecasts based on the difference between the current state and the estimated equilibrium. Extensive experiments on synthetic and real-world datasets, including currency exchange and COVID-19 spread modeling, demonstrate that ESE is at least as accurate as state-of-the-art (SOTA) methods while being significantly faster. In addition, ESE integrates seamlessly with conventional predictors, combining their accuracy with its exceptional efficiency and delivering a 10-70x speedup. With linear-time complexity, ESE scales far better than SOTA methods as the number of systems increases. Moreover, it remains accurate under diverse perturbations, establishing ESE as a fast, generalizable, robust, and scalable multi-prediction method.

2606.13311 2026-06-12 cs.LG cs.AI 新提交

Rarity-Gated Context Conditioning for Offline Imitation Learning-Based Maritime Anomaly Detection

基于离线模仿学习的海事异常检测中的稀有门控上下文调节

Yongmin Kim, ByeongHoon Jeon, Sungil Kim

发表机构 * Department of Industrial Engineering, Ulsan National Institute of Science and Technology (UNIST)(蔚山科学技术院工业工程系)

AI总结 提出RGFiLM模块,通过稀有度门控调节上下文调制强度,解决上下文异常检测中稀有上下文导致的高误报问题,在海事轨迹异常检测中取得最佳F1-FPR权衡。

详情
AI中文摘要

上下文异常检测旨在根据上下文变量识别异常行为,但实际部署常面临高度不平衡的上下文分布,其中稀有情境可能包含关键信息。在这种频率偏差下,上下文条件模型可能在稀有上下文中产生不稳定的决策和过多的误报。我们提出稀有门控特征线性调制(RGFiLM),一种稀有感知调节模块,结合特征调制(即上下文条件化的隐藏特征缩放和平移)与由数据驱动稀有度分数控制的门控。稀有度分数根据上下文变量的经验分布估计,并调节上下文对中间表示的调制强度:在稀有上下文中门控更果断,而在常见上下文中保持保守。我们在使用AIS运动序列和ERA5环境上下文的环境敏感绕行场景中评估RGFiLM在海事轨迹异常检测中的表现。当实例化到顺序异常评分流程中时,RGFiLM在比较的上下文无关和上下文条件方法中实现了最佳的平均F1-假阳性率(FPR)权衡。这些结果表明,显式考虑上下文稀有性是减少上下文敏感异常检测中误报的有效方法。

英文摘要

Contextual anomaly detection aims to identify abnormal behavior conditional on context variables, but practical deployments often face highly imbalanced context distributions where rare regimes can be critical information. Under such frequency bias, context-conditioned models can produce unstable decisions and excessive false alarms in rare contexts. We propose Rarity-Gated Feature-wise Linear Modulation (RGFiLM), a rarity-aware conditioning module that combines feature-wise modulation (i.e., context-conditioned scaling and shifting of hidden features) with a gate controlled by a data-driven rarity score. The rarity score is estimated from the empirical distribution of context variables and regulates how strongly context modulates intermediate representations: the gate becomes more decisive under rare contexts while remaining conservative under frequent contexts. We evaluate RGFiLM on maritime trajectory anomaly detection using AIS motion sequences with ERA5 environmental context in an environment-sensitive detour scenario. When instantiated in a sequential anomaly scoring pipeline, RGFiLM achieves the best mean F1--False Positive Rate (FPR) trade-off among the compared context-agnostic and context-conditioned methods. These results suggest that explicitly accounting for context rarity is an effective approach for reducing false alarms in context-sensitive anomaly detection.

2606.11000 2026-06-12 quant-ph cs.LG cs.NE 交叉投稿

Analog Quantum Asynchronous Event-Based Graph Neural Network

模拟量子异步事件驱动图神经网络

Kristian Sotirov, Shaheen Acheche, Antonio A. Gentile, Osvaldo Simeone

发表机构 * King’s Communications, Learning and Information Processing (KCLIP) lab(国王通讯、学习与信息处理(KCLIP)实验室) Centre for Intelligent Information Processing Systems (CIIPS)(智能信息处理系统中心) Department of Engineering(工程系) Pasqal SAS(Pasqal SAS公司) Institute for Intelligent Networked Systems (INSI)(智能网络化系统研究所) Northeastern University London(伦敦东北大学)

AI总结 提出模拟量子异步事件驱动图神经网络(QA-AEGNN),利用中性原子量子处理器映射事件数据为原子阵列,通过Rydberg哈密顿量模拟消息传递,实现高效事件图计算。

Comments 31 pages, 8 figures, initial version

详情
AI中文摘要

异步、事件驱动的图神经网络(AEGNN)最近成为一种处理事件相机稀疏高时间分辨率数据的有效范式。本文提出量子模拟AEGNN(QA-AEGNN),一种在中性原子量子计算机上实现AEGNN的新框架。中性原子量子处理器基于可控的Rydberg原子相互作用,提供可编程的模拟量子计算平台。为此,我们将流式事件数据映射到被困中性原子阵列,每个原子代表一个图节点(事件),其位置使得几何邻近性反映事件的时空邻域。量子处理器的原生Rydberg哈密顿量被编程以镜像AEGNN的消息传递计算,原子量子比特状态作为节点特征嵌入,原子间相互作用实现图边。此外,我们提出一种混合量子-经典训练方案,其中模拟哈密顿量参数(如激光脉冲幅度和失谐)通过经典反馈优化,以从数据中学习量子AEGNN模型。我们的方法利用中性原子量子系统的连续哈密顿量动力学和大规模并行性,以潜在精度改进原生执行事件图计算。

英文摘要

Asynchronous, event-based graph neural networks (AEGNNs) have recently emerged as an efficient paradigm for processing the sparse and high-temporal-resolution data from event cameras. In this paper, we propose quantum analog AEGNNs (QA-AEGNNs), a novel framework to implement an AEGNN on a neutral-atom quantum computer. Neutral-atom quantum processors offer a programmable analog quantum computing platform based on controllable Rydberg-atom interactions. To this end, we map the streaming event data to an array of trapped neutral atoms, where each atom represents a graph node (event) and is positioned such that geometric proximity reflects the spatio-temporal neighborhood of events. The native Rydberg Hamiltonian of the quantum processor is programmed to mirror the message-passing computations of the AEGNN, with atomic qubit states serving as node feature embeddings and inter-atom interactions realizing graph edges. Furthermore, we propose a hybrid quantum-classical training scheme in which the analog Hamiltonian parameters (e.g., laser pulse amplitudes and detunings) are optimized using classical feedback to learn the quantum AEGNN model from data. Our approach leverages the continuous Hamiltonian dynamics and massive parallelism of neutral-atom quantum systems to natively execute event-based graph computations with potential accuracy improvements

2606.12425 2026-06-12 cs.CY cs.AI cs.ET cs.HC cs.LG 交叉投稿

An Explainable AI Assistant for Introductory Programming Education: Improving Feedback Reliability with Instructor-AI Collaboration

面向入门编程教育的可解释AI助手:通过教师-AI协作提高反馈可靠性

Muntasir Hoq, Griffin Pitts, Bradford Mott, Seung Lee, Jessica Vandenberg, Shuyin Jiao, Narges Norouzi, James Lester, Bita Akram

发表机构 * North Carolina State University(北卡罗来纳州立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种可解释AI驱动的课堂助手,通过分析学生代码、映射逻辑错误到教师识别的误解并提供教师撰写的反馈,提高入门编程课程中反馈的可靠性和可解释性。

Comments Full paper accepted to the 27th International Conference on AI in Education (AIED 2026)

详情
AI中文摘要

主动学习被广泛认为是提高入门编程课程学习效果的有效方法。然而,不足的教学支持往往限制了学生获得及时、个性化反馈的机会,而这对于掌握基础编程概念至关重要。尽管最近AI的进展,特别是大型语言模型,为反馈提供了可扩展的机会,但可解释性和可靠性问题仍然存在。在本文中,我们提出了一种AI驱动的课堂助手,它利用可解释的AI模型分析学生代码,将逻辑错误映射到教师识别的误解,并提供教师撰写的反馈,从而将可靠性建立在教师定义的教学知识基础上。为了评估我们框架的有效性,我们进行了专家评估以检查其与教师验证反馈的一致性,并在课堂环境中部署了该系统以评估学生对其可用性的看法。结果表明,该助手能够为学生提供准确的、经过教师验证的反馈,同时培养积极的体验。

英文摘要

Active learning is widely recognized as an effective approach for improving learning outcomes in introductory programming courses. However, insufficient instructional support often limits students' access to timely, personalized feedback, which is crucial for mastering foundational programming concepts. Although recent advances in AI, particularly large language models, offer scalable opportunities for feedback, concerns about explainability and reliability remain. In this paper, we present an AI-driven classroom assistant that leverages an explainable AI model to analyze student code, map logical errors to instructor-identified misconceptions, and deliver instructor-authored feedback, thereby grounding reliability in instructor-defined pedagogical knowledge. To evaluate the effectiveness of our framework, we conducted an expert evaluation to examine its alignment with instructor-verified feedback and deployed the system in a classroom setting to assess students' perceptions of its usability. Results indicate that the assistant can provide accurate, instructor-verified feedback to students while fostering a positive experience.

2606.12435 2026-06-12 cs.CY cs.DB cs.LG 交叉投稿

Auditing Discriminatory Patterns in Mortgage Lending Through Association Rules and Fair Binning

通过关联规则和公平分箱审计抵押贷款中的歧视性模式

Archit Rathod, Dhwani Chande, Het Nagda

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 研究标准分箱预处理是否放大抵押贷款中的种族/性别差异,使用HMDA数据构建三阶段流水线,发现公平分箱以公平代价29.4%实现,K-Means聚类揭示黑人申请者拒绝率显著更高。

Comments 10 pages, 4 figures, fairness-aware mortgage lending analysis using HMDA 2023 data. Project repository available at GitHub

详情
AI中文摘要

美国的抵押贷款表现出持续的种族和性别差异。我们研究标准数据预处理步骤,特别是属性分箱,是否在下游模式挖掘中放大这些差异。使用来自HMDA 2023数据集(芝加哥大都市区)的103,481份清理后的抵押贷款申请,我们构建了一个三阶段流水线:(1)PySpark数据清理和分箱流水线,实现标准等频分箱和Asudeh等人[1]的ε偏置公平分箱算法;(2)FP-Growth关联规则挖掘,比较两种分箱制度下的拒绝模式;(3)K-Means聚类及每簇差异影响审计。我们的标准分箱在收入离散化中显示9.63%的种族偏差,与先前工作中报告的8-10%一致。使用七个种族组的公平分箱在ε=0.03时不可行,仅在ε=0.08时成功,公平代价为29.4%。FP-Growth揭示高债务收入比是主要的拒绝预测因子(置信度67.2%,提升度2.81),而种族偏差未表现为显式的高支持度规则。然而,K-Means聚类后进行差异影响审计标记了45个簇-组对中的10个,表明即使在财务相似的群体中,黑人申请者的拒绝率也显著高于白人申请者。

英文摘要

Mortgage lending in the United States exhibits persistent racial and gender disparities. We investigate whether standard data preprocessing steps, specifically attribute binning, amplify these disparities in downstream pattern mining. Using 103,481 cleaned mortgage applications from the HMDA 2023 dataset (Chicago metropolitan area), we build a three-stage pipeline: (1) a PySpark data cleaning and binning pipeline that implements both standard equal-frequency binning and the epsilon-biased fair binning algorithm from Asudeh et al. [1], (2) FP-Growth association rule mining that compares denial patterns under both binning regimes, and (3) K-Means clustering with a per-cluster disparate impact audit. Our standard binning shows 9.63% racial bias in income discretization, consistent with the 8-10% reported in prior work. Fair binning with seven race groups is infeasible at epsilon=0.03 and only succeeds at epsilon=0.08 with a Price of Fairness of 29.4%. FP-Growth reveals that high debt-to-income ratio is the dominant denial predictor (67.2% confidence, 2.81 lift), while racial bias does not appear as explicit high-support rules. However, K-Means clustering followed by a disparate impact audit flags 10 out of 45 cluster-group pairs, showing that Black applicants face significantly higher denial rates than White applicants even among financially similar groups.

2606.12489 2026-06-12 cs.IT cs.LG math.IT 交叉投稿

Masked Neural Detection for Constrained Channel Coding in Molecular Communication

分子通信中约束信道编码的掩码神经检测

Melih Şahin, Ozgur B. Akan

发表机构 * Centre for neXt Communications (CXC), Department of Engineering, University of Cambridge(下一代通讯中心(CXC)、工程系、剑桥大学) Centre for neXt Communications (CXC), Department of Electrical and Electronics Engineering, Koç University(下一代通讯中心(CXC)、电子与电气工程系、科克大学)

AI总结 针对分子通信中的扩散记忆问题,提出掩码神经检测器,结合RLIM约束码与SBRNN,在多数情况下优于未编码检测,平均增益达10.36倍,并设计RLIM定制训练掩码进一步提升性能。

Comments 5 pages, 2 figures, 4 tables

详情
AI中文摘要

分子通信(MC)遭受严重的扩散记忆,因为一个符号释放的分子可能在后续符号期间到达。神经序列检测器,特别是滑动双向循环神经网络(SBRNN),在此类信道中能显著优于阈值检测器。这引出了MC信道编码的一个核心问题:当编码和未编码传输均采用神经检测评估时,先前在阈值检测下建立优势的码是否仍能保持其优势?本文针对游程限制的ISI缓解(RLIM)码(一类先前在MC中显示出巨大BER增益的约束码)回答了这一问题。在测试的工作点中,最佳RLIM-SBRNN接收机在59个案例中的46个中击败了最佳未编码接收机(在阈值和SBRNN检测之间选择),平均增益为10.36倍。我们还为紧凑型SBRNN检测器提出了一个RLIM定制的训练掩码,在236次比较中的227次中改进了未掩码的RLIM-SBRNN,当掩码有益时平均增益为3.267倍。最后,紧凑型掩码RLIM-SBRNN尽管不使用任何信道知识,但与信道状态感知的MLSE具有竞争力。

英文摘要

Molecular communication (MC) suffers from severe diffusion memory because molecules released for one symbol may arrive during later symbols. Neural sequence detectors, especially sliding bidirectional recurrent neural networks (SBRNNs), can substantially outperform threshold detectors in such channels. This raises a central question for MC channel coding: does a code whose advantage was established under threshold detection retain it when both coded and uncoded transmission are evaluated with neural detection? This letter answers this question for run-length-limited ISI-mitigation (RLIM) codes, a class of constrained codes previously shown to provide large BER gains in MC. Across the tested operating points, the best RLIM-SBRNN receiver beats the best uncoded receiver, chosen between threshold and SBRNN detection, in $46$ of $59$ cases, with a mean gain of $10.36\times$ over those wins. We also propose an RLIM-tailored training mask for compact SBRNN detectors, improving the unmasked RLIM-SBRNN in $227$ of $236$ comparisons with $3.267\times$ mean gain when masking is beneficial. Finally, the compact masked RLIM-SBRNN is competitive with channel-state-aware MLSE despite using no channel knowledge.

2606.12559 2026-06-12 physics.comp-ph cs.LG cs.NA math.NA physics.flu-dyn 交叉投稿

Feature-preserving Latent-EnKF for Data Assimilation of Flows with Shocks

保持特征的潜在EnKF用于含激波流动的数据同化

Hemanth Chandravamsi, Hangchuan Hu, Ponkrshnan Thiagarajan, Tamer A. Zaki

发表机构 * Department of Mechanical Engineering, Johns Hopkins University(约翰霍普金斯大学机械工程系)

AI总结 针对含激波流动中EnKF因多模态统计产生伪振荡的问题,提出在学习的低维潜在空间进行集合更新以保持激波特征,并通过共享解码器恢复物理状态,数值实验验证了无伪振荡的准确特征恢复。

详情
AI中文摘要

集合卡尔曼滤波(EnKF)被广泛用于顺序数据同化,但对于具有间断的解(如可压缩流中的激波)会失效。激波位置的不确定性导致多模态集合统计,违反了EnKF的高斯假设,在分析状态中产生大尺度伪振荡。我们引入了一种保持特征的潜在EnKF,在学习的低维潜在空间中进行集合更新,其中激波和流动特征具有光滑流形表示,从而在EnKF分析期间保持尖锐特征。更新后的潜在状态通过所有集合成员共享的解码器映射回物理状态。该算法消除了先前方法中使用的成员特定有序训练和正性下限。在Sod激波管和马赫2激波与二维圆柱相互作用的数值实验中,使用稀疏和噪声观测,结果显示能够准确恢复激波和接触间断的特征,且无伪振荡。

英文摘要

The ensemble Kalman filter (EnKF) is widely adopted for sequential data assimilation, but fails for solutions with discontinuities, such as shocks in compressible flows. Uncertainty in shock location induces multimodal ensemble statistics that violate the Gaussian assumptions underlying the EnKF, producing large-scale spurious oscillations in the analysis state. We introduce a feature-preserving latent-EnKF that performs the ensemble update in a learned low-dimensional latent space, where shock and flow features admit a smooth manifold representation, thereby preserving sharp features during EnKF analysis. The updated latent state is mapped back to physical state through a shared decoder for all ensemble members. The algorithm eliminates the member-specific ordered training and positivity flooring used in prior approaches. Numerical experiments on a Sod shock tube and Mach 2 shock interaction with a 2D cylinder, using sparse and noisy observations, show accurate feature recovery of shocks and contact discontinuities without spurious oscillations.

2606.12806 2026-06-12 quant-ph cs.LG 交叉投稿

Quantum Reservoir Computing for Short-Term Power Load Forecasting in Resource-Constrained Energy Systems

量子储层计算在资源受限能源系统中的短期电力负荷预测

Mansi Od, Param Pathak, Nouhaila Innan, Muhammad Shafique

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出一种硬件高效的量子储层计算框架,通过固定量子储层和压缩经典读出层,在有限内存和硬件噪声下实现短期负荷预测,6位量化保留全精度性能并减少81.2%内存。

Comments 11 pages, 9 figures

详情
AI中文摘要

短期负荷预测对于可靠的能源管理至关重要,但在边缘设备上的实际部署需要模型在有限内存、有限测量预算和硬件噪声下保持准确性。本文提出一种硬件高效的量子储层计算(QRC)框架用于能源负荷预测,其中固定量子储层将时间输入窗口转换为高维特征,仅训练经典弹性网络读出层。为降低部署成本,训练后的读出层通过训练后定点量化压缩,位宽从8位到2位。该框架在Tetouan和Spain能源负荷数据集上评估,采用精确态矢量模拟、512次有限采样以及来自IBM FakeTorino和IBM FakeMarrakesh的 realistic 硬件噪声模型。结果表明,6位读出精度保持全精度预测性能,同时将读出内存减少81.2%。低于此阈值时,性能退化依赖于数据集,Tetouan表现出更强的敏感性,而Spain退化更缓慢。硬件噪声验证进一步表明,训练后的读出层可转移到噪声储层状态而无需重新训练。这些发现支持量化QRC作为近期量子时间序列应用的资源感知预测方法。

英文摘要

Short-term load forecasting is essential for reliable energy management, but practical deployment on edge devices requires models that remain accurate under limited memory, finite measurement budgets, and hardware noise. This work proposes a hardware-efficient Quantum Reservoir Computing (QRC) framework for energy load forecasting, where a fixed quantum reservoir transforms temporal input windows into high-dimensional features and only a classical Elastic Net readout is trained. To reduce deployment cost, the trained readout is compressed using post-training fixed-point quantization at bit widths from 8 to 2 bits. The framework is evaluated on the Tetouan and Spain energy load datasets under exact statevector simulation, 512-shot finite sampling, and realistic hardware-noise models from IBM FakeTorino and IBM FakeMarrakesh. Results show that 6-bit readout precision preserves full-precision forecasting performance while reducing readout memory by 81.2%. Below this point, degradation becomes dataset dependent, with Tetouan showing stronger sensitivity and Spain degrading more gradually. Hardware-noise validation further shows that the trained readout transfers to noisy reservoir states without retraining. These findings support quantized QRC as a resource-aware forecasting approach for near-term quantum time-series applications.

2606.12838 2026-06-12 q-bio.QM cs.AI cs.LG q-bio.GN 交叉投稿

OCOO-T : A Simple and Scalable Virtual Cell Model for Transcriptional Perturbation Response Prediction

OCOO-T: 一种用于转录扰动响应预测的简单可扩展虚拟细胞模型

Danning Jiang, Zheming An, Yalong Zhao, Lipeng Lai

AI总结 提出OCOO-T,一种基于流匹配的简约虚拟细胞模型,通过连续时间去噪和自适应层归一化,在多个基准上实现转录扰动预测的最优性能。

Comments 22 pages, 6 figures

详情
AI中文摘要

预测单细胞对遗传、化学和细胞因子扰动的转录响应是计算生物学和AI虚拟细胞(AIVC)建模中的一个基本挑战,对药物发现和基因调控网络的阐明具有直接影响。现有方法通常依赖辅助细胞状态编码器、分层变分自编码器、专用Transformer编码器-解码器模块或基因相互作用先验,将高维表达谱压缩为潜在表示。虽然有效,但这些设计增加了架构复杂性,可能限制可扩展性和泛化性。本文介绍了OCOO-T,一种基于流匹配的简约AIVC模型,用于转录扰动响应预测。OCOO-T利用一个直接操作连续基因表达谱的普通Transformer堆栈,并将扰动响应预测表述为连续时间去噪过程。通过自适应层归一化和上下文令牌整合扰动嵌入、剂量信息以及细胞系/细胞类型特异性。在Tahoe100M、Replogle和PBMC基准上的全面评估表明,OCOO-T在多种扰动和细胞类型上实现了最先进的性能,同时通过细胞上下文的修补和拆补有效扩展到长转录谱。通过利用基于Transformer去噪的单细胞组学简单性,OCOO-T为计算机细胞模拟提供了一个有效且可扩展的框架。

英文摘要

Predicting single-cell transcriptional responses to genetic, chemical and cytokine perturbations is a fundamental challenge in computational biology and AI Virtual Cell (AIVC) modeling, with direct implications for drug discovery and the elucidation of gene regulatory networks. Existing approaches often rely on auxiliary cell-state encoders, hierarchical variational autoencoders, dedicated Transformer encoder-decoder modules, or gene-interaction priors to compress high-dimensional expression profiles into latent representations. While effective, these designs increase architectural complexity and may limit scalability and generalizability. This paper introduces OCOO-T, a minimalist flow-matching-based AIVC model for transcriptional perturbation response prediction. OCOO-T utilizes a vanilla Transformer stack that operates directly on continuous gene expression profiles and formulates perturbation response prediction as a continuous-time denoising process. Perturbation embeddings, dosage information, and cell-line/cell-type specificity are integrated through adaptive layer normalization and in-context tokens. Comprehensive evaluations on Tahoe100M, Replogle, and PBMC benchmarks demonstrate that OCOO-T achieves state-of-the-art performance across diverse perturbations and cell types while effectively scaling to long transcriptional profiles through patching and depatching of cellular contexts. By leveraging the simplicity of Transformer-based denoising for single-cell omics, OCOO-T provides an effective and scalable framework for in-silico cellular simulation.

2606.12916 2026-06-12 cs.AI cs.CL cs.LG 交叉投稿

MDForge: Agentic Molecular Dynamics Pipeline Design under Sparse Simulator Feedback

MDForge:稀疏模拟器反馈下的智能分子动力学流水线设计

Zehong Wang, Yijun Ma, Connor R. Schmidt, Tianyi Ma, Weixiang Sun, Ziming Li, Xiaoguang Guo, Chuxu Zhang, Matthew J. Webber, Yanfang Ye

发表机构 * University of Notre Dame(圣母大学) University of Connecticut(康涅狄格大学)

AI总结 提出MDForge,利用LLM智能体通过多智能体辩论将稀疏奖励稠密化,自动设计分子动力学流水线,在SAMPL基准上达到专家水平,并发现新型高亲和力CB[7]结合剂。

详情
AI中文摘要

分子动力学(MD)是原子分子科学中经典的计算机模拟方法,从第一性原理物理模拟分子行为。为新系统设计MD流水线需要大量专业知识:即使在一个分子上运行也代价高昂,排除了试错法。我们使用LLM智能体自动化这一专家流水线设计过程。与现有编排预定义工具集的MD智能体不同,我们将流水线设计视为开放式代码生成,其中智能体的行为通过语言奖励在线重塑。具体而言,我们构建了MDForge,一个LLM智能体,其上下文更新规则通过物理专家间的多智能体辩论将稀疏奖励稠密化。在三个SAMPL主客体结合自由能基准上,MDForge自动设计的MD流水线与人类专家竞争。部署在未见过的候选客体库上,其CB[7]流水线发现了一种新型结合剂,湿实验竞争NMR证实其为高亲和力、皮摩尔级的CB[7]结合剂。我们的数据和代码可在https://this URL获取。

英文摘要

Molecular dynamics (MD) is the canonical in-silico method for atomistic molecular science, simulating molecular behavior from first-principle physics. Designing an MD pipeline for a new system requires substantial expert knowledge: running it on even one molecule is expensive, ruling out trial-and-error. We automate this expert pipeline-design process with an LLM agent. Unlike existing MD agents that orchestrate a predefined tool set, we treat pipeline design as open-ended code generation in which the agent's behavior is reshaped online by verbal reward. Specifically, we build MDForge, an LLM agent whose in-context update rule densifies the sparse reward via a multi-agent debate among physics experts. On three SAMPL host-guest binding free-energy benchmarks, MDForge automatically designs MD pipelines competitive with human experts. Deployed on a library of unseen candidate guests, its CB[7] pipeline discovers a novel binder that wet-lab competition NMR confirms is a high-affinity, picomolar CB[7] binder. Our data and code are available at https://github.com/Zehong-Wang/MDForge.

2606.13017 2026-06-12 q-bio.NC cs.LG 交叉投稿

Deep Sleep Classification via EEG Signal Criticality: A Passive BCI Approach for Sleep-Improvement Neurofeedback

基于EEG信号临界性的深度睡眠分类:一种用于改善睡眠神经反馈的被动BCI方法

Stanisław Narębski, Tomasz Komendziński, Tomasz M. Rutkowski

AI总结 本研究利用去趋势波动分析(DFA)提取的临界性特征,通过朴素贝叶斯分类器实现了对深度睡眠(N3)的高精度识别(平衡准确率87.17%),为被动脑机接口中的状态依赖神经反馈提供了高效感知机制。

Comments 7 pages, 3 figures, accepted for publication in the Proceedings of the 10th Graz Brain-Computer Interface Conference 2026, Graz, Austria, September 14-17, 2026

详情
AI中文摘要

自动睡眠分期是被动脑-机接口(pBCI)的一项基础应用,它解码自发神经状态以实现独立于用户意图的闭环干预。本研究评估了从去趋势波动分析(DFA)中提取的临界性特征,用于特定识别深度睡眠(N3)。我们分析了来自290名老年女性的347,232个EEG时段,使用UMAP流形学习可视化状态转换。随后,通过10折交叉验证对六个分类器进行基准测试,使用平衡准确率确定此http URL的最佳“状态感知”引擎。朴素贝叶斯达到了最高的平均平衡准确率(87.17% ± 0.24%),显著优于全连接深度神经网络(FNN:81.58%)和随机森林(80.97%)。线性模型(LDA:57.21%;SVM:51.01%)表现不佳,表明DFA衍生的临界性特征位于一个独特的非线性流形上。EEG临界性的概率解码为pBCI提供了一种高精度的感知机制。这种稳健的分类流程支持开发状态依赖的神经反馈,例如靶向听觉刺激,以增强认知恢复。

英文摘要

Automated sleep staging is a fundamental application of passive Brain-Computer Interfaces (pBCI), decoding spontaneous neural states to enable closed-loop interventions independent of user intent. This study evaluates criticality features derived from Detrended Fluctuation Analysis (DFA) for the specific identification of deep sleep (N3). We analyzed $347,232$ EEG epochs from $290$ older women using UMAP manifold learning to visualize state transitions. Subsequently, six classifiers were benchmarked via 10-fold cross-validation, using balanced accuracy to determine the optimal "state-sensing" engine for neurofeedback.Naive Bayes achieved the highest mean balanced accuracy ($87.17\% \pm 0.24\%$), significantly outperforming a fully connected deep neural network (FNN: $81.58\%$) and Random Forest ($80.97\%$). Linear models (LDA: $57.21\%$; SVM: $51.01\%$) performed poorly, indicating that DFA-derived criticality features reside on a distinct, non-linear manifold. Probabilistic decoding of EEG criticality provides a high-accuracy sensing mechanism for pBCIs. This robust classification pipeline supports the development of state-dependent neurofeedback, such as targeted auditory stimulation, to enhance cognitive recovery.

2606.13133 2026-06-12 cs.DS cs.LG 交叉投稿

Learning-Augmented Approximation for Unrelated-Machines Makespan Scheduling

学习增强的无关联机器调度近似算法

Kaito Baba, Evripidis Bampis, Giorgos Mitropoulos

AI总结 针对无关联机器调度问题,提出学习增强算法,利用重作业分配预测实现精确预测时(1+ε)-近似,误差增大时退化为2-近似。

Comments 22 pages, 3 figures

详情
AI中文摘要

最近,Antoniadis等人(ICLR 2025)提出了一个框架,通过引入预测来近似NP-hard选择问题。尽管该方法简单,但它紧密匹配理论下界,因此其推广极具吸引力。我们解决了Antoniadis等人工作中提出的一个开放问题,即如何将该方法扩展到选择问题类之外的其他重要问题,例如调度问题。我们为无关联机器上的最小化完工时间问题(记为$R\\|C_{\max}$)开发了一种学习增强算法。通过使用重作业分配的预测,我们在预测准确时实现了多项式时间的$(1+\varepsilon)$-近似,并且随着误差增加,该近似平滑地退化为最坏情况下的2-近似。我们通过实证分析总结了我们的工作。

英文摘要

Recently, Antoniadis et al. (ICLR 2025) proposed a framework for incorporating predictions to approximate NP-hard selection problems. Despite its simplicity, this approach tightly matches theoretical lower bounds, making its generalization highly compelling. We address an open question raised in the work of Antoniadis et al., concerning the extension of this approach to other important problems outside the class of selection problems, such as scheduling. We develop a learning-augmented algorithm for the makespan minimization problem on unrelated machines, denoted by $R\|C_{\max}$. By using predictions of heavy job assignments, we achieve a polynomial-time $(1+\varepsilon)$-approximation for accurate predictions that smoothly degrades to a worst-case 2-approximation as the error increases. We conclude our work with an empirical analysis of our method.

2606.13136 2026-06-12 cs.CV cs.LG eess.IV 交叉投稿

An Extensible and Lightweight Unified Architecture for Demosaicing Pixel-bin Image Sensors

一种可扩展且轻量级的统一架构用于像素合并图像传感器的去马赛克

Saurabh Kumar, Nutan Sairam Yenneti

发表机构 * Samsung Research Institute Bangalore(三星研究院班加罗尔分院)

AI总结 提出模块化统一架构,通过无学习CFA识别模块和轻量级设计,实现多种像素合并传感器的去马赛克,提升图像质量并降低资源消耗。

详情
AI中文摘要

像素合并图像传感器因其分辨率与聚光能力的权衡,正成为智能手机相机的默认选择。然而,与拜耳彩色滤光片阵列(CFA)相比,它们更大的颜色间分离使得去马赛克更具挑战性。此外,现有的基于深度学习的去马赛克方法是CFA特定的,需要多个独立模型,占用宝贵的板载资源,并需要更大的开发和维护工作。在这项工作中,我们提出了一种模块化的统一架构,用于对各种像素合并传感器进行去马赛克,该架构在可扩展且轻量级的同时提供更高的图像质量。此外,为了实现即插即用操作,我们引入了一个无学习的CFA识别模块,以准确检测原始数据的CFA类型。

英文摘要

Pixel-bin image sensors are becoming the default choice for smartphone cameras due to their resolution vs light-gathering trade-off. However, their larger inter-color separation compared to the Bayer color filter array (CFA) makes them challenging to demosaic. Furthermore, existing deep learning-based demosaicing methods are CFA-specific, requiring multiple individual models that take up precious onboard resources and demand larger development and maintenance efforts. In this work, we propose a modular unified architecture for demosaicing various pixel-bin sensors that provides higher image quality while being extensible and lightweight. Additionally, to enable plug-and-play operation, we introduce a learning-free CFA-identification module to detect the CFA type of raw data accurately.

2606.13216 2026-06-12 cs.CL cs.LG 交叉投稿

Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization

分层最优传输用于神经机器翻译和抽象摘要中的幻觉检测

Mariia Onyshchuk, Maksym-Vasyl Tarnavskyi, Marta Sumyk

发表机构 * Fairseq AggreFact

AI总结 通过最优传输分析跨注意力分布,发现幻觉检测集中于解码器前四层,且该方法在源脱离时有效,但无法检测注意力下游的不忠实摘要。

Comments Accepted to ICML Mechanistic Interpretability Workshop 2026

详情
AI中文摘要

最优传输(OT)已被证明可以通过测量跨注意力分布与参考分布之间的几何距离来检测神经机器翻译(NMT)中的幻觉,无需任何监督。我们将此分析扩展到Fairseq DE-EN模型的所有六个解码器层($N=3{,}414$),表明Wass-to-Unif和Wass-to-Data是互补的检测器,专门针对不同类型的幻觉;检测集中在L1--L4层,而L5层对较微妙的类型具有反预测性;并且幻觉翻译缺乏正确翻译从第一步解码开始就存在的探索性注意力阶段。我们进一步评估了几何信号是否可迁移到抽象摘要忠实性检测:在AggreFact($N=1{,}116$)上,我们的无监督OT检测器在CNN/XSum上达到$57.2\%$/$57.6\%$的平衡准确率——高于随机水平,但远低于有监督的MiniCheck-Flan-T5-L($69.9\%$/$74.3\%$)。这种差距是原则性的:与NMT幻觉不同,不忠实的摘要可以正确关注源标记,同时歪曲其内容,这种失败模式在基于集中度的OT指标中由于构造原因而不可见。在T5-base上的结构实验证实了解码器在深度上的一致组织,其中第3层显示峰值集中度,第12层对生成质量最为关键。总之,结果确立了当失败模式是源脱离时,跨注意力的OT是一种可靠的检测器;无论任务如何,它都是一种原则性的可解释性工具;而当忠实性失败发生在注意力下游时,它则具有根本局限性。

英文摘要

Optimal transport (OT) has been shown to detect hallucinations in neural machine translation (NMT) by measuring the geometric distance between cross-attention distributions and a reference distribution, without any supervision. We extend this analysis to all six decoder layers of the Fairseq DE-EN model ($N=3{,}414$), showing that Wass-to-Unif and Wass-to-Data are complementary detectors specialised across hallucination types, that detection is concentrated in layers L1--L4 with L5 anti-predictive for subtler types, and that hallucinated translations lack the exploratory attention phase present in correct translations from the first decoding step. We further evaluate whether the geometric signal transfers to abstractive summarization faithfulness detection: our unsupervised OT detector on AggreFact ($N=1{,}116$) achieves $57.2\%$/$57.6\%$ balanced accuracy on CNN/XSum -- above chance but substantially below supervised MiniCheck-Flan-T5-L($69.9\%$/$74.3\%$). This gap is principled: unlike NMT hallucinations, unfaithful summaries can attend correctly to source tokens while misrepresenting their content, a failure mode invisible to concentration-based OT metrics by construction. Structural experiments on T5-base confirm consistent decoder organisation across depth, with Layer~3 showing peak concentration and Layer~12 being most critical for generation quality. Together, the results establish OT on cross-attention as a reliable detector when the failure mode is source disengagement, a principled interpretability tool regardless of task, and fundamentally limited when faithfulness failures occur downstream of attention.

2606.13220 2026-06-12 cs.AI cs.CE cs.ET cs.LG cs.MA 交叉投稿

LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis

LLM作为调查员:基于证据优先的鲁棒交互式问题诊断

Fabrizio Marozzo, Pietro Liò

发表机构 * University of Calabria(卡拉布里亚大学) University of Cambridge(剑桥大学)

AI总结 提出证据优先的AI方法LLM-as-an-Investigator,通过估计问题歧义、生成假设、提问澄清并更新概率,避免过早接受用户假设,提升诊断准确性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作技术问题解决的交互式助手。然而,当用户提供不完整的描述或看似合理但未经证实的解释时,LLM可能会过早地认同这些假设,并在收集足够证据之前提出解决方案。我们将这种行为称为用户驱动的谄媚:LLM倾向于强化用户提供的假设,而不是测试其他解释。本文介绍了LLM-as-an-Investigator,一种基于证据优先的智能体AI方法,用于鲁棒的问题诊断。该方法通过一个解决方案调查智能体实现,该智能体估计初始问题描述的模糊性,生成候选假设,提出有针对性的澄清问题,并在每次回答后更新假设概率。该智能体不是立即给出响应,而是继续调查,直到证据使一个候选解释比其他解释更强。为了评估该方法,我们从机械、电气和液压领域已解决的技术论坛帖子中构建了一个基准测试。我们使用一个三智能体评估流程:问题-解决方案提取智能体将已解决的帖子转换为结构化案例,真实答案评估智能体在隐藏已知解决方案的同时模拟用户,被测试的助手通过对话尝试恢复解决方案。实验比较了标准助手、面向推理的LLM和基于调查员的模型,使用不同的LLM骨干网络。除了诊断准确性,我们还分析了标准助手在诊断案例中如何遵循误导性的用户假设。结果表明,所提出的方法比直接提示和仅推理基线更准确地识别问题,而其证据优先协议有助于减少用户引发的对话偏差。

英文摘要

Large language models (LLMs) are increasingly used as interactive assistants for technical problem solving. However, when users provide incomplete descriptions or plausible but unverified explanations, LLMs may prematurely align with these assumptions and propose solutions before collecting sufficient evidence. We refer to this behavior as user-driven sycophancy: the tendency of an LLM to reinforce a user-provided hypothesis instead of testing alternative explanations. This paper introduces LLM-as-an-Investigator, an evidence-first agentic AI methodology for robust problem diagnosis. The approach is implemented through a Solution Investigator Agent, which estimates the ambiguity of an initial problem description, generates candidate hypotheses, asks targeted clarification questions, and updates hypothesis probabilities after each answer. Rather than producing an immediate response, the agent continues the investigation until the evidence makes one candidate explanation stronger than the alternatives. To evaluate the approach, we build a benchmark from solved technical forum threads in mechanical, electrical, and hydraulic domains. We use a three-agent evaluation pipeline in which a Problem-Solution Extractor Agent converts solved threads into structured cases, a Ground-Truth Evaluator Agent simulates the user while hiding the known solution, and the tested assistant attempts to recover the solution through dialogue. The experiments compare standard assistants, reasoning-oriented LLMs, and the proposed investigator-based model across LLM backbones. In addition to diagnostic accuracy, we analyze how standard assistants follow misleading user hypotheses in diagnostic cases. The results show that the proposed approach identifies the problem more accurately than direct prompting and reasoning-only baselines, while its evidence-first protocol helps reduce user-induced conversational bias.

2606.13302 2026-06-12 cs.AI cs.LG 交叉投稿

Physics-Guided Spatiotemporal Learning for Coastal Wave Peak Period Estimation from Video

物理引导的时空学习用于从视频估计海岸波浪峰值周期

Abubakar Hamisu Kamagata, Dharm Singh Jat, Attlee Munyaradzi Gamundani, Abhishek Srivastava, Paramasivam Saravanakumar

发表机构 * Namibia University of Science and Technology(纳米比亚科技大学) Indian Institute of Technology Indore(印度理工学院印多尔分校) Namdeb Diamond Corporation(纳米比亚钻石公司)

AI总结 提出物理引导的深度时空学习框架,结合自动区域检测、模拟到真实迁移学习和物理信息正则化,从海岸视频直接估计近岸波浪峰值周期,验证了基于Transformer和轻量级循环卷积架构的有效性。

详情
AI中文摘要

近岸波浪参数对于海岸工程、海岸线保护、海洋灾害评估和气候适应性的海岸管理至关重要。传统的监测系统如浮标和雷达平台提供精确监测,但安装和维护成本高,空间覆盖有限。通过深度学习实现了使用视频的被动海洋监测,然而许多方法在海洋学上缺乏物理可解释性、可行性和验证。本文提出了一种物理引导的深度时空学习框架,用于从被动海岸视频流直接估计近岸波浪峰值周期。该框架结合了基于自动时间方差感兴趣区域检测、多阶段模拟到真实迁移学习和物理信息正则化,以提高预测精度和物理一致性。评估了多种时空架构,如基于Transformer和循环卷积的架构,以及合成预训练、银标签自适应和专家微调。结果表明,基于Transformer的架构在瞬时预测精度方面表现更好,而轻量级循环卷积架构实现了更高的时间稳定性和操作海洋学技能。消融研究也证明了物理引导正则化在趋势跟随一致性和减少物理上不可信预测方面的益处。可解释性审计有助于将注意力集中在水动力活跃的碎波带区域,并与物理推导的波浪传播行为良好吻合。总体而言,所提出的框架展示了基于物理引导视频的深度学习系统在长期海岸波浪监测中的潜力,具有成本效益和操作可行性。

英文摘要

Wave parameters in the nearshore are crucial for coastal engineering, shoreline protection, marine hazard assessment, and coastal management for climate resilience. Traditional monitoring systems like buoys and radar platforms offer accurate monitoring but can have high installation and maintenance expenses and limited spatial coverage. Passive ocean monitoring using video has been achieved by leveraging deep learning, however, many methods are not physically interpretable, feasible, and validated for oceanography. In thiswork, a Physics-Guided Deep Spatiotemporal Learning Framework for direct estimation of nearshore wave peak periods from passive coastal video stream is proposed. The framework combines automated temporal-variance based region-of-interest detection, multi-stage Sim-to-Real transfer learning, and physics-informed regularization to enhance the predictive accuracy and physical consistency. A variety of spatiotemporal architectures were assessed, such as transformer-based and recurrent-convolutional ones, alongside synthetic pretraining,silver-label adaptation, and expert fine-tuning. The results show that transformer-based architectures outperformed in terms of the accuracy of the instantaneous prediction, while lightweight recurrent-convolutional architectures achieved higher temporal stability and operational oceanographic skill. Ablation studies also demonstrated the benefits of physics-guided regularization in terms of trend-following consistency, and physically implausible predictions. Explainability auditing also helped to focus attention in hydrodynamically active surf-zone regions and showed good agreement with the physically derived wave propagation behavior. In general, the proposed framework shows the promise of physics-guided video-based deep learning systems for long-term coastal wave monitoring that are cost-efficient and operationally feasible.

2606.13515 2026-06-12 cs.CV cs.LG cs.RO 交叉投稿

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

MaskWAM:统一掩码提示与预测的世界-动作模型

Hanyang Yu, Haitao Lin, Jingbo Zhang, Wenyao Zhang, Chenghao Gu, Heng Li, Ping Tan

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Tencent Robotics X(腾讯机器人X实验室) Tsinghua University(清华大学)

AI总结 提出MaskWAM,通过统一掩码输入与预测的混合Transformer架构,解决世界-动作模型的空间瓶颈,提升策略泛化能力,在LIBERO等任务上显著优于基线。

详情
AI中文摘要

世界-动作模型(WAMs)通过视频预测为机器人控制提供了一种有前景的范式。然而,当前的WAMs存在根本性的空间瓶颈:标准文本输入在杂乱场景中引入指代歧义,而非结构化的RGB预测缺乏语义基础,并受任务无关背景的偏差影响。为克服这些限制,我们引入了MaskWAM,一种以对象为中心的世界-动作模型。通过统一的混合Transformer(MoT)将掩码同时作为显式输入和预测进行联合集成,MaskWAM实现了鲁棒的策略泛化。该设计提供两个关键优势:(1)预测未来掩码产生以对象为中心的语义监督,抑制视觉噪声,显著增强甚至标准文本条件的WAMs;(2)将此预测监督与第一帧视觉提示(如目标对象掩码)耦合,建立精确的空间锚点,大幅减少语言歧义。关键在于,由于WAMs本质上是视觉驱动的架构,直接掩码条件化比单独文本提供更强的引导,为操作未见对象建立了精确且鲁棒的范式。在LIBERO、RoboTwin和真实世界任务上的评估表明,MaskWAM在语言清晰和语言模糊任务中均显著优于基线。

英文摘要

World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

2606.13529 2026-06-12 cs.HC cs.LG 交叉投稿

Ride, Track, and Recover: Pilot Randomized Trial of a Wearable Digital Self-Management Intervention During a Veteran Endurance-Cycling Program

骑行、追踪与恢复:一项关于可穿戴数字自我管理干预在退伍军人耐力骑行项目中的初步随机试验

Alan Ta, Nilsu Salgin, Caleb Armstrong, Kala Phillips Reindel, Farzan Sasangohar

发表机构 * Department of Industrial and Systems Engineering, Texas A&M University(工业与系统工程系,德克萨斯A&M大学) Texas A&M Health Telehealth Institute(德克萨斯A&M健康远程医疗研究所)

AI总结 本研究通过随机试验,评估可穿戴数字自我管理干预对退伍军人创伤后应激障碍(PTSD)高唤醒症状的稳定效果,发现干预组症状改善更持久,且机器学习检测精度与症状严重程度正相关。

详情
AI中文摘要

退伍军人的创伤后应激障碍(PTSD)以持续高唤醒及共病焦虑和抑郁症状为特征,这些症状在临床环境外难以监测和管理。在德克萨斯州参加“英雄计划”骑行活动的13名退伍军人,通过计算机生成序列在自然环境中随机分为两组:(1)数字干预加体力活动,或(2)仅体力活动,外加一个由从更广泛的“英雄计划”退伍军人社区中选出的7名退伍军人组成的第三组家庭监测对照组。连续智能手表传感结合心率和加速度计特征来检测高唤醒事件,并由参与者实时确认。每周收集焦虑、抑郁和PTSD严重程度的自我报告测量。广义加性混合模型描述了随时间变化的非线性轨迹。基线归一化的高唤醒轨迹在不同条件下存在显著差异,数字干预组(n=7)显示出结构化的稳定,而仅体力活动组(n=3)在研究后期出现恶化。两个骑行组在耐力活动期间均表现出急性症状改善;然而,数字干预组表现出更高的整体收益维持。家庭对照组(n=4)显示出症状逐渐下降。机器学习检测的感知精度在个体间差异很大,并与症状严重程度正相关,较高严重程度的参与者确认了更大比例的检测事件。这些结果表明,将可穿戴检测与数字自我管理工具相结合可能支持高唤醒的稳定和症状改善,同时强调了在可穿戴心理健康系统中个性化和以人为中心的设计的重要性。

英文摘要

Post-traumatic stress disorder (PTSD) in veterans is characterized by persistent hyperarousal and comorbid anxiety and depressive symptoms that are difficult to monitor and manage outside clinical settings. Thirteen veterans participating in a Project Hero cycling event in Texas were randomized by computer-generated sequence in a naturalistic setting to two arms: (1) digital intervention plus physical activity, or (2) physical activity only, plus a third at-home monitoring control cohort consisting of 7 veterans selected from the broader Project Hero veteran community. Continuous smartwatch sensing combined heart rate and accelerometer features to detect hyperarousal events, which were confirmed in real time by participants. Weekly self-report measures of anxiety, depression, and PTSD severity were collected. Generalized additive mixed models characterized nonlinear trajectories over time. Baseline-normalized hyperarousal trajectories differed significantly across conditions, with the digital intervention group (n=7) showing structured stabilization compared to late-study escalation in the physical-only group (n=3). Both cycling groups exhibited acute symptom improvements during the endurance event; however, the digital intervention group demonstrated a higher overall maintenance of gains. The at-home control group (n=4) showed gradual symptom declines. Perceived precision of ML detections varied substantially across individuals and was positively associated with symptom severity, with higher-severity participants confirming a greater proportion of detected events. These results suggest that coupling wearable detection with digital self-management tools may support stabilization of hyperarousal and symptom improvement while emphasizing the importance of personalization and human-centered design in wearable mental health systems.

2606.13532 2026-06-12 cs.NI cs.LG 交叉投稿

Graphical Causal Reasoning for Root Cause Analysis in Cloud Networks

云网络中根本原因分析的图因果推理

Fabien Chraim, Dominik Janzing, John Evans

AI总结 提出基于图因果发现的云网络事故根本原因分析方法,通过时空分组和自动化本体降维,利用双变量Granger因果性和条件独立性检验构建因果图,并引入概率方法进行时间感知的根因评分。在35个生产事故中召回率85.7%,精确匹配率74.3%。

Comments 6 pages, 4 figures

详情
AI中文摘要

云计算依赖于大规模网络,这些网络本质上是复杂系统。在本文中,我们提出了一种新颖的云网络事故根本原因分析(RCA)方法,利用基于图的因果发现技术。我们的方法通过引入时空分组策略和自动化本体来降低问题维度,从而解决了基于规则的自动化的局限性。我们使用双变量Granger因果性和条件独立性检验从二元时间序列数据构建因果图。对于推理,我们引入了一种概率方法,该方法根据时间延迟分配边特定的条件概率,从而通过因果图遍历实现可解释的、时间感知的根因评分。我们使用来自一家主要云提供商的35个生产事故的标记数据集评估了该系统。该模型成功召回正确根因的事故占85.7%,精确匹配的事故占74.3%。在生产中,该系统已用于800多个真实世界事故,并获得了网络工程师的积极定性反馈。这些结果突显了在动态和大规模运营环境中采用数据驱动的因果方法进行RCA的实用性。

英文摘要

Cloud-computing relies on large-scale networks which are inherently complex systems. In this paper, we present a novel approach to root cause analysis (RCA) of cloud network incidents, leveraging graph-based causal discovery techniques. Our method addresses the limitations of rule-based automation by introducing a spatiotemporal grouping strategy and an automation ontology to reduce the dimensionality of the problem. We construct a causal graph from binary time series data using bivariate Granger causality and conditional independence tests. For inference, we introduce a probabilistic method that assigns edge-specific conditional probabilities as a function of time lag, allowing for interpretable, time-aware root cause scoring via causal graph traversal. We evaluated the system using a labeled dataset of 35 production incidents from a major cloud provider. The model successfully recalled the correct root cause in 85.7% of incidents and produced an exact match in 74.3%. In production, the deployed system has been used in over 800 real-world incidents, with positive qualitative feedback from network engineers. These results highlight the practicality of a data-driven, causal approach to RCA in dynamic and large-scale operational environments.

2606.13543 2026-06-12 cs.NI cs.LG 交叉投稿

NetCause: Counterfactual Learning for Root Cause Analysis in Large-Scale Networks

NetCause:大规模网络中根因分析的反事实学习

Fabien Chraim, Jian Zhang, Dominik Janzing, Xiang Song, Christos Faloutsos, John Evans

AI总结 提出NetCause框架,将网络事件建模为图时间过程,通过反事实模拟排序候选根因,在31个专家标注事件上准确率提升16.1%。

Comments 9 pages, 6 figures

详情
AI中文摘要

一个学习模型能否捕捉故障在大规模网络中的传播方式,并利用这些知识将客户影响因果归因于其根本原因?现有的根因分析技术通常依赖于静态规则、相关启发式或拓扑局部推理,难以在动态环境中泛化,因为故障在复杂的物理和逻辑依赖关系中传播。我们提出了NetCause,一个基于自监督学习的框架,将网络事件建模为图时间过程,并使用反事实模拟对候选根因进行排序。该方法生成可解释的根因假设排序,并自然地与操作员定义的缓解和修复措施集成。我们在来自领先云提供商生产网络的六个月内收集的1500多个事件上训练模型,并在31个专家标注的事件上评估。NetCause在与运营决策最相关的场景中持续改善根因排序质量,相比基于规则的启发式基线,准确率提升16.1%。虽然训练计算密集,但推理轻量,每个事件仅需数秒GPU运行时间(远低于典型的遥测收集延迟)。

英文摘要

Can a learned model capture how faults propagate through a large-scale network and use this knowledge to causally attribute customer impact to its underlying root cause? Existing root cause analysis techniques often rely on static rules, correlation heuristics, or topology-local reasoning, which struggle to generalize in dynamic environments where faults propagate across complex physical and logical dependencies. We present NetCause, a self-supervised learning-based framework that models network incidents as graph-temporal processes and uses counterfactual simulation to rank candidate root causes. This approach produces an interpretable ranking of root cause hypotheses and integrates naturally with operator-defined mitigation and remediation actions. We train the model on over 1,500 incidents collected over six months from a leading cloud provider's production network and evaluate it on 31 expert-labeled incidents. NetCause consistently improves root cause ranking quality in the regime most relevant to operational decision-making, achieving a 16.1% accuracy improvement over a rule-based heuristic baseline. While training is computationally intensive, inference is lightweight, requiring only seconds of GPU runtime per incident (well below typical telemetry collection latencies).

2606.13591 2026-06-12 cs.AI cs.LG cs.MA 交叉投稿

Multiagent Protocols with Aggregated Confidence Signals

带有聚合置信信号的多智能体协议

Ali Elahi, Barbara Di Eugenio

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出三种协议,通过转换原始置信信号并采用软投票或贝叶斯融合,为多智能体系统输出聚合置信度,在保持正确性的同时显著提升判别能力。

Comments 22 pages and 5 figures, 9 pages and 2 figures before the appendix

详情
AI中文摘要

置信度在自然语言处理中用于可靠性、监督和一系列下游决策任务,但目前没有方法能够为多智能体系统的输出产生或评估置信度。先前的工作在多智能体辩论中使用置信度来加权消息、触发辩论或校准单个智能体,但从未将这些置信度聚合成系统本身的单一置信度。我们引入了三种协议,通过首先转换原始置信信号使其在不同模型间可比,然后通过软投票或称为贝叶斯融合的概率融合方法将它们组合,从而产生最终答案和单一的聚合置信度。这种聚合置信度在判别性(AUARC)上显著优于最佳单个智能体或标准辩论基线,同时正确性(F1分数)保持稳定,并恢复了多智能体辩论在更模糊任务上的损失。通过分析两种估计器(序列概率和自我报告)以及参数和非参数校准器,我们发现校准提高了两种估计器的F1分数,而AUARC对其依赖较小。我们在五个基准测试和四种任务类型上评估了每基准六对同质和异质辩论对,涵盖了多种模型能力和大小。

英文摘要

Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existing method produces or evaluates a confidence for the output of a multiagent system. Prior work uses confidence within multiagent debate (MAD) to weight messages, trigger debate, or calibrate individual agents, but it never aggregates these into a single confidence for the system itself. We introduce three protocols that produce a final answer along with a single aggregated confidence by first transforming raw confidence signals to make them comparable across models, then combining them via soft voting or a probability fusion we call Bayesian fusion. This aggregated confidence is substantially more discriminative (AUARC) than that of the best single agent or the standard debate baselines, while correctness (F1-score) stays stable and recovers the losses MAD incurs on more ambiguous tasks. Analyzing two estimators, sequence probability and self-report, alongside parametric and non-parametric calibrators, we find that calibration improves F1 for both estimators while AUARC is less reliant on it. We evaluate six homogeneous and heterogeneous debating pairs per benchmark, across five benchmarks and four task types, spanning a range of model capabilities and sizes.

2606.13633 2026-06-12 eess.SY cs.LG cs.SY 交叉投稿

Aerial Wildfire Suppression Planning with a Hybrid CNN-Cellular Automata Fire Model

基于混合CNN-元胞自动机火灾模型的空中野火抑制规划

Ion Matei, Maksym Zhenirovskyy, Takuya Kurihana, Rohit Vupala, Anthony Wong

AI总结 提出结合混合神经-元胞自动机野火模型与梯度优化空中投放的框架,通过蒙特卡洛采样和空间相关扰动量化不确定性,案例验证可生成有效抑制方案。

详情
AI中文摘要

空中野火抑制不仅需要预测火势蔓延,还需要在操作和环境不确定性下设计有效的干预策略。我们提出了一个空中野火抑制的建模与优化框架,该框架将混合神经-元胞自动机野火模型与基于梯度的目标空中投放设计相结合。野火模型根据地形、燃料和风数据预测空间变化的蔓延行为,而干预模块确定二元投放动作,其连续值位置和方向参数映射到模拟网格。水和阻燃剂具有不同的抑制效果,分别对应于立即减少活跃燃烧和持续减少未来蔓延。为了评估所得抑制方案的鲁棒性,我们通过每日火势状态的蒙特卡洛采样量化偶然不确定性,并通过空间相关的预测误差扰动量化认知不确定性。基于2020年Bear Fire的案例研究表明,该框架可以生成连贯的空中抑制调度,以减少总火灾影响面积,并支持对野火干预策略的不确定性感知分析。

英文摘要

Aerial wildfire suppression requires not only predicting fire spread, but also designing effective intervention strategies under operational and environmental uncertainty. We present a modeling and optimization framework for aerial wildfire suppression that combines a hybrid neural-cellular automaton wildfire model with gradient-based design of targeted aerial drops. The wildfire model predicts spatially varying spread behavior from terrain, fuel, and wind data, while the intervention module determines binary drop actions with continuous-valued location and orientation parameters mapped to the simulation grid. Water and retardant are represented with distinct suppression effects, corresponding to immediate reduction of active burning and persistent reduction of future spread. To evaluate the robustness of the resulting suppression plans, we quantify both aleatoric uncertainty through Monte Carlo sampling of daily fire-state realizations and epistemic uncertainty through spatially correlated prediction-error perturbations. A case study based on the 2020 Bear Fire shows that the framework can generate coherent aerial suppression schedules for reducing total fire-affected area and can support uncertainty-aware analysis of wildfire intervention strategies.

2606.13677 2026-06-12 cs.RO cs.AI cs.CV cs.LG 交叉投稿

Mana: Dexterous Manipulation of Articulated Tools

Mana: 铰接工具的灵巧操作

Zhao-Heng Yin, Guanya Shi, Pieter Abbeel, C. Karen Liu

发表机构 * UC Berkeley(加州大学伯克利分校) CMU(卡内基梅隆大学) Stanford University(斯坦福大学) Amazon FAR(亚马逊FAR)

AI总结 提出Mana框架,将灵巧操作重解释为动画问题,通过粗到细的流水线自动生成操作轨迹,实现铰接工具的零样本仿真到现实迁移。

Comments Project Page: https://zhaohengyin.github.io/mana

详情
AI中文摘要

铰接工具的操作由于需要协调内部自由度与接触丰富的交互,仍然是灵巧机器人学中的一个主要挑战。虽然先前的工作主要集中在刚性物体上,但铰接工具的使用由于其物理复杂性以及学习功能性抓取和操作策略的困难,仍未得到充分探索。我们提出了Mana(操作动画器),一个通用的仿真到现实框架,将灵巧操作重新解释为动画问题。受计算机动画启发,Mana采用粗到细的流水线,通过运动规划和强化学习将程序生成的抓取关键帧转化为操作轨迹。数据生成过程基本自动化,仅需几次鼠标点击即可指定功能可供性(每个工具不到1分钟)。在跨越不同尺度和关节类型的四个铰接工具上,Mana实现了抓取和手内操作的零样本仿真到现实迁移,展示了灵巧铰接工具操作的可扩展方法。

英文摘要

Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an animation problem. Inspired by computer animation, Mana employs a coarse-to-fine pipeline that transforms procedurally-generated grasp keyframes into manipulation trajectories through motion planning and reinforcement learning. The data generation process is largely automatic, requiring only a few mouse clicks to specify functional affordances (<1 minute per tool). Across four articulated tools spanning different scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating a scalable approach to dexterous articulated tool use.

2301.12538 2026-06-12 cs.LG cs.AI math.DS 版本更新

On Approximating the Dynamic Response of Synchronous Generators via Operator Learning: A Step Towards Building Deep Operator-based Power Grid Simulators

关于通过算子学习逼近同步发电机动态响应:迈向构建基于深度算子的电网模拟器的一步

Christian Moya, Amirhossein Mollaali, Guang Lin, Meng Yue

发表机构 * Purdue University(普渡大学)

AI总结 提出基于算子学习的框架,利用DeepONet逼近同步发电机的动态响应,并设计递归模拟方案及残差DeepONet方案,结合数据聚合策略实现与电网交互的模拟。

详情
AI中文摘要

本文开发了一个算子学习框架,用于逼近同步发电机的动态响应。该框架可用于(i)构建一个基于神经网络的发电机模型,与电网模拟器交互,或(ii)跟踪真实发电机的暂态响应。首先,我们开发了一个数据驱动的深度算子网络(DeepONet)来逼近发电机的无限维解算子。然后,我们设计了一个基于DeepONet的数值方案,在给定的时间范围内模拟发电机的响应。所提出的方案递归地使用训练好的DeepONet来模拟给定多维输入下的响应,该输入描述了发电机与电网之间的相互作用。此外,我们设计了一个残差DeepONet数值方案,可以整合现有数学模型的信息。我们为这个残差DeepONet方案提供了预测累积误差的估计。最后,我们构建了一个数据聚合(DAgger)策略,允许使用DeepONet在与其他电网组件交互模拟中可能遇到的聚合训练数据对DeepONet进行微调。作为概念验证,我们证明了所提出的框架能够有效逼近同步发电机的暂态模型。

英文摘要

This paper develops an Operator Learning framework for approximating the dynamic response of synchronous generators. The framework can be used to (i) build a neural network-based generator model that interacts with a power grid simulator or (ii) shadow the true generator's transient response. First, we develop a data-driven Deep Operator Network (DeepONet) to approximate the infinite-dimensional solution operator of the generators. Then, we design a numerical scheme based on DeepONet that simulates the generator's response over a given time horizon. The proposed scheme recursively employs the trained DeepONet to simulate the response for a given multi-dimensional input that describes the interaction between the generator and the power grid. In addition, we design a residual DeepONet numerical scheme that can incorporate information from existing mathematical models. We accompany this residual DeepONet scheme with an estimate for the prediction's cumulative error. Finally, we build a data aggregation (DAgger) strategy that allows fine-tuning of DeepONets using aggregated training data that the DeepONets will likely encounter during interactive simulations with other grid components. As a proof of concept, we demonstrate that the proposed frameworks can effectively approximate the transient model of a synchronous generator.

2505.22695 2026-06-12 cs.LG 版本更新

LLM-ODDR: A Large Language Model Framework for Joint Order Dispatching and Driver Repositioning

LLM-ODDR:一种用于联合订单调度和司机重新定位的大语言模型框架

Tengfei Lyu, Siyuan Feng, Hao Liu, Hai Yang

发表机构 * Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou)(人工智能前沿技术 thrust,香港科学与技术大学(广州)) Department of Aeronautical and Aviation Engineering, The Hong Kong Polytechnic University(航空与航空工程系,香港理工大学) Research Center for Low Altitude Economy, The Hong Kong Polytechnic University(低空经济研究中心,香港理工大学) Department of Computer Science and Engineering, The Hong Kong University of Science and Technology(计算机科学与工程系,香港科学与技术大学) Department of Civil and Environmental Engineering, The Hong Kong University of Science and Technology(土木与环境工程系,香港科学与技术大学)

AI总结 提出LLM-ODDR框架,利用大语言模型联合优化网约车订单调度与司机重新定位,通过多目标价值细化、公平感知调度和时空需求感知重定位提升效果、适应性和可解释性。

Comments Published in IEEE Transactions on Intelligent Transportation Systems (TITS)

详情
AI中文摘要

网约车平台在动态城市环境中优化订单调度和司机重新定位操作面临重大挑战。基于组合优化、规则启发式和强化学习的传统方法往往忽视司机收入公平性、可解释性以及对现实动态的适应性。为弥补这些不足,我们提出LLM-ODDR,一种利用大语言模型(LLM)进行网约车服务中联合订单调度和司机重新定位(ODDR)的新型框架。LLM-ODDR框架包含三个关键组件:(1)多目标引导的订单价值细化,通过考虑多个目标评估订单以确定其整体价值;(2)公平感知的订单调度,平衡平台收入与司机收入公平性;(3)时空需求感知的司机重新定位,基于历史模式和预测供应优化空闲车辆放置。我们还开发了JointDR-GPT,一个针对ODDR任务进行领域知识微调的模型。在曼哈顿出租车运营的真实数据集上进行的大量实验表明,我们的框架在有效性、对异常条件的适应性以及决策可解释性方面显著优于传统方法。据我们所知,这是首次将LLM作为决策智能体应用于网约车ODDR任务,为将先进语言模型集成到智能交通系统中奠定了基础性见解。虽然当前框架的计算成本高于传统方法,但我们表明并行分解和模型蒸馏可以将延迟降低到可部署的生产水平。

英文摘要

Ride-hailing platforms face significant challenges in optimizing order dispatching and driver repositioning operations in dynamic urban environments. Traditional approaches based on combinatorial optimization, rule-based heuristics, and reinforcement learning often overlook driver income fairness, interpretability, and adaptability to real-world dynamics. To address these gaps, we propose LLM-ODDR, a novel framework leveraging Large Language Models (LLMs) for joint Order Dispatching and Driver Repositioning (ODDR) in ride-hailing services. LLM-ODDR framework comprises three key components: (1) Multi-objective-guided Order Value Refinement, which evaluates orders by considering multiple objectives to determine their overall value; (2) Fairness-aware Order Dispatching, which balances platform revenue with driver income fairness; and (3) Spatiotemporal Demand-Aware Driver Repositioning, which optimizes idle vehicle placement based on historical patterns and projected supply. We also develop JointDR-GPT, a fine-tuned model optimized for ODDR tasks with domain knowledge. Extensive experiments on real-world datasets from Manhattan taxi operations demonstrate that our framework significantly outperforms traditional methods in terms of effectiveness, adaptability to anomalous conditions, and decision interpretability. To our knowledge, this is the first exploration of LLMs as decision-making agents in ride-hailing ODDR tasks, establishing foundational insights for integrating advanced language models within intelligent transportation systems. While the current framework incurs higher computational costs than traditional methods, we show that parallel decomposition and model distillation can reduce latency to production-viable levels for deployment.

2508.04888 2026-06-12 cs.LG 版本更新

Retrieval-Augmented Foundation Models for Water Level Prediction in the Everglades

用于大沼泽地水位预测的检索增强基础模型

Rahuul Rangaraj, Jimeng Shi, Rajendra Paudel, Giri Narasimhan, Yanzhao Wu

发表机构 * Florida International University(佛罗里达国际大学) Everglades National Park(大沼泽地国家公园)

AI总结 针对大沼泽地水位预测,提出检索增强机制,利用统计相似性或互信息检索历史水文事件,提升预训练时序基础模型的长期预测性能,尤其在极端事件中效果显著。

详情
AI中文摘要

大沼泽地的准确水位预测对于防洪、干旱管理、水资源规划和生物多样性保护至关重要。尽管最近的时序基础模型在通用任务(体现在其预训练中)上表现出色,但它们在特定领域应用中的有效性仍未被充分理解。在这项工作中,我们整理了一个用于大沼泽地水位预测的领域特定数据集,并观察到当前最先进模型的性能仍然有限。为了解决这一差距,我们利用检索增强机制,从历史观测的外部档案中检索类似的多变量水文事件,以丰富这些预训练模型的输入上下文。我们研究了两种检索策略:基于统计相似性的检索和基于互信息的检索,并分析了纳入检索到的历史上下文如何影响预测性能。大量实验表明,检索增强一致地改善了长期水位预测,并在极端事件期间产生了不成比例的更大收益,这对环境决策尤为关键。我们的研究提供了经验证据,表明基于类比检索可以有益于环境科学中的预训练时序基础模型,为它们在大沼泽地水文预测中的应用提供了关于其优势、局限性和失败模式的实用见解。尽管在大沼泽地进行了评估,但所提出的框架是通用的,并且可以应用于给定时间序列数据的其他水文系统。代码和数据已在此 https URL 公开。

英文摘要

Accurate water level forecasting in the Everglades is essential for flood mitigation, drought management, water resource planning, and biodiversity conservation. While recent time-series foundation models have shown strong performance on generic tasks (represented in their pre-training), their effectiveness in domain-specific applications remains insufficiently understood. In this work, we curate a domain-specific dataset for water-level forecasting in the Everglades and observe that the performance of current state-of-the-art models remains limited. To address this gap, we leverage a retrieval-augmented mechanism that retrieves analogous multivariate hydrological episodes from an external archive of historical observations to enrich the input context of those pre-trained models. We study two retrieval strategies, statistical similarity-based retrieval and mutual information-based retrieval, and analyze how incorporating retrieved historical contexts affects predictive performance. Extensive experiments show that retrieval augmentation consistently improves long-horizon water level forecasts and yields disproportionately larger gains during extreme events, which is particularly critical for environmental decision-making. Our study provides empirical evidence that analog-based retrieval can benefit pretrained time-series foundation models in environmental science, offering practical insights into their strengths, limitations, and failure modes when applied to hydrological forecasting in the Everglades. Although evaluated in the Everglades, the proposed framework is general and can be applied to other hydrological systems given time series data. The code and data have been made publicly available at https://github.com/rahuul2992000/WaterRAF.

2509.07150 2026-06-12 cs.LG cond-mat.mtrl-sci 版本更新

PLaID++: A Preference Aligned Language Model for Targeted Inorganic Materials Design

PLaID++: 一种用于定向无机材料设计的偏好对齐语言模型

Andy Xu, Rohan Desai, Larry Wang, Ethan Ritz, Gabriel Hope

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出PLaID++,通过对称性感知的Wyckoff文本表示和温度缩放熵正则化,结合可验证奖励的强化学习,实现稳定、新颖且满足空间群属性的晶体生成,比先前方法效率提高约50%。

Comments Code available at https://github.com/andaero/PLaID, model weights at https://huggingface.co/HOPE-Lab-HMC/PLaID

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为提高LLM正确性的有前景方法,然而在许多科学问题中,目标并非产生正确答案,而是产生满足一组约束的多样化候选方案。我们在材料生成背景下研究这一挑战。为此,我们引入了PLaID++,一个经过后训练的LLM,用于稳定且属性引导的晶体生成。我们发现性能取决于我们的晶体学表示和奖励公式。首先,我们引入了一种紧凑的、对称性感知的Wyckoff文本表示,提高了计算效率并鼓励从物理先验中泛化。其次,我们证明了温度缩放作为熵正则化器,可以抵消模式坍塌并鼓励探索。通过将对称性约束直接编码到文本中,并将模型输出引导至理想的化学空间,PLaID++生成热力学稳定、独特且新颖的结构,其速率比先前方法高约50%,并能条件性地生成具有所需空间群属性的结构。我们的工作展示了将自然语言处理中的后训练技术适应于材料设计的潜力,为定向和高效发现新材料铺平了道路。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising approach to improve correctness in LLMs, however, in many scientific problems, the objective is not necessarily to produce the correct answer, but instead to produce a diverse array of candidates which satisfy a set of constraints. We study this challenge in the context of materials generation. To this end, we introduce PLaID++, an LLM post-trained for stable and property-guided crystal generation. We find that performance hinges on our crystallographic representation and reward formulation. First, we introduce a compact, symmetry-informed Wyckoff text representation which improves computational efficiency and encourages generalization from physical priors. Second, we demonstrate that temperature scaling acts as an entropy regularizer which counteracts mode collapse and encourages exploration. By encoding symmetry constraints directly into text and guiding model outputs towards desirable chemical space, PLaID++ generates structures that are thermodynamically stable, unique, and novel at a $\sim$50\% greater rate than prior methods and conditionally generates structures with desired space group properties. Our work demonstrates the potential of adapting post-training techniques from natural language processing to materials design, paving the way for targeted and efficient discovery of novel materials.

2601.00921 2026-06-12 cs.LG cs.AI quant-ph 版本更新

Geometric and Quantum Kernel Methods for Predicting Skeletal Muscle Outcomes in chronic obstructive pulmonary disease

用于预测慢性阻塞性肺疾病骨骼肌结果的几何与量子核方法

Azadeh Alavi, Hamidreza Khalili, Stanley H. Chan, Fatemeh Kouchmeshki, Muhammad Usman, Ross Vlahos

发表机构 * School of Computing Technologies, RMIT University(计算技术学院,拉筹纳斯大学) School of Health & Biomedical Sciences, STEM College, RMIT University(健康与生物医学科学学院,STEM学院,拉筹纳斯大学) Pattern Recognition Pty Ltd, Melbourne(模式识别有限公司,墨尔本) Data61, CSIRO(Data61,澳大利亚联邦科学与工业研究组织)

AI总结 提出一种核几何量子混合方法,通过再生核希尔伯特空间映射合成SPD参考、随机投影压缩和低维量子回归电路,在COPD动物队列中预测肌肉重量、质量和力量,肌肉重量RMSE比最佳经典方法低约1.8%。

Comments 24 pages, 2 figures

详情
AI中文摘要

慢性阻塞性肺疾病(COPD)影响全球数亿人,骨骼肌功能障碍具有临床重要性。量子机器学习在生物医学预测中日益受到探索,但在小型生物标志物队列中的价值需要与强经典基线进行基准测试。我们分析了一个由213只动物组成的香烟烟雾COPD队列,利用血液和支气管肺泡灌洗生物标志物预测胫骨前肌重量、肌肉质量和力量。我们开发了一种核几何量子混合方法,其中合成对称正定(SPD)参考通过再生核希尔伯特空间映射,使用仅训练随机投影压缩,归一化,并输入低维量子回归电路。我们将该方法与经典岭/核模型、SPD关系表示和量子核回归(QKR)进行了基准测试。所有方法均使用条件分层重复交叉验证进行评估。最大的数值改进出现在肌肉重量上,所提出方法的平均均方根误差(RMSE)数值最低,比最佳经典比较器低约1.8%;配对折叠水平测试在Holm调整后未建立统计显著性优势,但该终点具有生物学意义。该方法在肌肉质量上也具有数值最低的平均RMSE。对于力量,仅使用生物标志物的岭回归表现最佳,表明更线性的终点结构。

英文摘要

Chronic obstructive pulmonary disease (COPD) affects hundreds of millions of people worldwide, and skeletal-muscle dysfunction is clinically important. Quantum machine learning is increasingly explored for biomedical prediction, but its value in small biomarker cohorts requires benchmarking against strong classical baselines. We analysed a cigarette-smoke COPD cohort of 213 animals with blood and bronchoalveolar-lavage biomarkers to predict tibialis anterior muscle weight, muscle quality, and force. We developed a kernel-geometric quantum hybrid method in which synthetic symmetric positive definite (SPD) references are mapped through a reproducing kernel Hilbert space, compressed using train-only random projection, normalised, and supplied to low-dimensional quantum regression circuits. We benchmarked this approach against classical ridge/kernel models, SPD relational representations, and quantum-kernel regression (QKR). All methods were evaluated using condition-stratified repeated cross-validation. The largest numerical improvement was observed for muscle weight, where the proposed method had the numerically lowest mean root mean squared error (RMSE), approximately 1.8% below the best classical comparator; paired fold-level testing did not establish statistically significant superiority after Holm adjustment, but the endpoint is biologically meaningful. The method also had the numerically lowest mean RMSE for muscle quality. For force, biomarker-only Ridge performed best, suggesting a more linear endpoint structure.

2601.09693 2026-06-12 cs.LG stat.ML 版本更新

Contrastive Geometric Learning Unlocks Unified Structure- and Ligand-Based Drug Design

对比几何学习实现统一的结构与配体药物设计

Lisa Schneckenreiter, Sohvi Luukkonen, Lukas Friedrich, Daniel Kuhn, Günter Klambauer

发表机构 * DeepMind Ltd(DeepMind有限公司)

AI总结 提出对比几何模型ConGLUDe,统一结构与配体训练,实现虚拟筛选、靶标钓鱼和配体条件口袋预测,在多项基准测试中表现优异。

Comments Forty-Third International Conference on Machine Learning

详情
AI中文摘要

基于结构和基于配体的计算药物设计传统上依赖于不相关的数据源和建模假设,限制了它们在大规模上的联合使用。在这项工作中,我们引入了用于统一计算药物设计的对比几何学习(ConGLUDe),这是一个单一的对比几何模型,统一了基于结构和基于配体的训练。ConGLUDe将产生全蛋白质表示和预测结合位点的隐式嵌入的几何蛋白质编码器与快速配体编码器耦合,消除了对预定义口袋的需求。通过对比学习将配体与全局蛋白质表示和多个候选结合位点对齐,ConGLUDe除了支持虚拟筛选和靶标钓鱼外,还支持配体条件口袋预测,同时在蛋白质-配体复合物和大规模生物活性数据上联合训练。在多种基准测试中,ConGLUDe实现了具有竞争力的零样本虚拟筛选性能,在具有挑战性的靶标钓鱼任务上显著优于现有方法,并展示了最先进的配体条件口袋选择。这些结果突显了统一结构-配体训练的优势,并将ConGLUDe定位为迈向药物发现通用基础模型的一步。

英文摘要

Structure-based and ligand-based computational drug design have traditionally relied on disjoint data sources and modeling assumptions, limiting their joint use at scale. In this work, we introduce Contrastive Geometric Learning for Unified Computational Drug Design (ConGLUDe), a single contrastive geometric model that unifies structure- and ligand-based training. ConGLUDe couples a geometric protein encoder that produces whole-protein representations and implicit embeddings of predicted binding sites with a fast ligand encoder, removing the need for predefined pockets. By aligning ligands with both global protein representations and multiple candidate binding sites through contrastive learning, ConGLUDe supports ligand-conditioned pocket prediction in addition to virtual screening and target fishing, while being trained jointly on protein-ligand complexes and large-scale bioactivity data. Across diverse benchmarks, ConGLUDe achieves competitive zero-shot virtual screening performance, substantially outperforms existing methods on a challenging target fishing task, and demonstrates state-of-the-art ligand-conditioned pocket selection. These results highlight the advantages of unified structure-ligand training and position ConGLUDe as a step toward general-purpose foundation models for drug discovery.

2601.15503 2026-06-12 cs.LG 版本更新

Data-driven Lake Water Quality Forecasting for Time Series with Missing Data using Machine Learning

基于机器学习的数据驱动湖泊水质时间序列缺失数据预测

Rishit Chatterjee, Tahiya Chowdhury

发表机构 * Department of Computer Science, Colby College(科克学院计算机科学系)

AI总结 针对志愿者监测导致的湖泊数据缺失问题,采用多重插补和岭回归,在30个湖泊数据集上实现透明度预测,并量化了最小样本量和特征集,提出联合可行性函数以优化监测策略。

Comments 8 pages, 4 figures, 3 tables

详情
Journal ref
Published in: 2026 IEEE Conference on Technologies for Sustainability (SusTech)
AI中文摘要

志愿者主导的湖泊监测产生不规则、季节性的时间序列,由于冰盖、天气相关的通行限制以及偶尔的人为错误,存在大量缺失数据,这给有害藻华预测和早期预警带来了困难。我们研究了基于来自缅因州湖泊三十年间原位记录的数据丰富子集(30个湖泊)的塞氏盘深度(SDD)预测。通过链式方程多重插补(MICE)处理缺失数据,并使用归一化平均绝对误差(nMAE)指标进行跨湖泊性能比较。在六种候选模型中,岭回归提供了最佳的平均测试性能。利用岭回归,我们量化了最小样本量,表明在向后近期历史协议下,模型平均每个湖泊约176个训练样本即可达到全历史准确率的5%以内。我们还确定了最小特征集,其中紧凑的四特征子集在相同5%容差内匹配了十三特征基线。综合这些结果,我们引入了一个联合可行性函数,该函数识别出达到完整历史、全特征基线5%以内目标所需的最小训练历史和最少预测变量。在我们的研究中,达到5%准确率目标需要每个湖泊约64个近期样本和仅一个预测变量,凸显了针对性监测的实用性。因此,我们的联合可行性策略在固定准确率目标下统一了近期历史长度和特征选择,为湖泊研究人员制定采样工作和测量优先级提供了简单高效的规则。

英文摘要

Volunteer-led lake monitoring yields irregular, seasonal time series with many gaps arising from ice cover, weather-related access constraints, and occasional human errors, complicating forecasting and early warning of harmful algal blooms. We study Secchi Disk Depth (SDD) forecasting on a 30-lake, data-rich subset drawn from three decades of in-situ records collected across Maine lakes. Missingness is handled via Multiple Imputation by Chained Equations (MICE), and we evaluate performance with a normalized Mean Absolute Error (nMAE) metric for cross-lake comparability. Among six candidates, ridge regression provides the best mean test performance. Using ridge regression, we then quantify the minimal sample size, showing that under a backward, recent-history protocol, the model reaches within 5% of full-history accuracy with approximately 176 training samples per lake on average. We also identify a minimal feature set, where a compact four-feature subset matches the thirteen-feature baseline within the same 5% tolerance. Bringing these results together, we introduce a joint feasibility function that identifies the minimal training history and fewest predictors sufficient to achieve the target of staying within 5% of the complete-history, full-feature baseline. In our study, meeting the 5% accuracy target required about 64 recent samples and just one predictor per lake, highlighting the practicality of targeted monitoring. Hence, our joint feasibility strategy unifies recent-history length and feature choice under a fixed accuracy target, yielding a simple, efficient rule for setting sampling effort and measurement priorities for lake researchers.

2603.11249 2026-06-12 cs.LG 版本更新

Differentiable Thermodynamic Phase-Equilibria for Machine Learning

可微热力学相平衡用于机器学习

Karim K. Ben Hicham, Moreno Ascani, Jan G. Rittig, Alexander Mitsos

发表机构 * RWTH Aachen University(亚琛工业大学) Process Systems Engineering (AVT.SVT)(过程系统工程) Forschungszentrum Jülich GmbH(吕根研究中心) Institute of Climate and Energy Systems ICE-1(气候与能源系统研究所) Energy Systems Engineering(能源系统工程) JARA-ENERGY

AI总结 提出DISCOMAX算法,通过可微相平衡计算结合离散枚举与掩码softmax,实现热力学一致性端到端学习,在二元液液平衡数据上优于现有方法。

Comments 45 pages, 27 figures, 5 tables

详情
AI中文摘要

相平衡的准确预测仍是化学工程中的核心挑战。将热力学结构融入神经网络的物理一致性机器学习方法最近在活度系数建模中表现出色。然而,将此类方法扩展到源于极值原理的平衡数据(如液液平衡)仍然困难。本文提出DISCOMAX,一种用于相平衡计算的可微算法,在训练和推理时均保证热力学一致性,仅受用户指定的离散化影响。该方法将可行相态的离散枚举与反向传播中的掩码softmax聚合相结合,在前向传播中传播真实平衡态,使用直通梯度估计器实现神经gE模型的物理一致性端到端学习。我们展示了该方法与统计热力学的类比,并在二元液液平衡数据上评估,其优于现有基于代理的方法,同时为从不同种类的平衡数据中学习提供了通用框架。

英文摘要

Accurate prediction of phase equilibria remains a central challenge in chemical engineering. Physics-consistent machine learning methods that incorporate thermodynamic structure into neural networks have recently shown strong performance for activity-coefficient modeling. However, extending such approaches to equilibrium data arising from an extremum principle, such as liquid-liquid equilibria, remains difficult. Here we present DISCOMAX, a differentiable algorithm for phase-equilibrium calculation that guarantees thermodynamic consistency at both training and inference, only subject to a user-specified discretization. The method combines discrete enumeration of feasible phase states with masked softmax aggregation in the backward pass, with the propagation of the true equilibrium state in the forward pass, using a straight-through gradient estimator to enable physics-consistent end-to-end learning of neural \gls{gE}-models. We show that this approach bears analogy to statistical thermodynamics, and we evaluate it on binary liquid-liquid equilibrium data where it outperforms existing surrogate-based methods, while offering a general framework for learning from different kinds of equilibrium data.

2603.11479 2026-06-12 cs.LG cs.AI cs.MA 版本更新

Grammar of the Wave: Towards Explainable Multivariate Time Series Event Detection via Neuro-Symbolic VLM Agents

波的语法:通过神经符号VLM智能体实现可解释的多变量时间序列事件检测

Sky Chenwei Wan, Yifei Y. Wang, Tianjun Hou, Xiqing Chang, Aymeric Jan

发表机构 * AI Lab, SLB(SLB人工智能实验室) Télécom Paris, Institut Polytechnique de Paris, France(巴黎电信学院,巴黎高等理工学院,法国)

AI总结 提出语言引导的时间序列事件检测(TSED)任务,通过事件逻辑树(ELT)将文本描述转化为结构化时序逻辑,并构建神经符号VLM智能体SELA,实现零/少样本事件检测与可解释推理。

Comments 8 pages (main text), 28 pages total including appendix. 9 figures, 7 tables

详情
AI中文摘要

时间序列事件检测(TSED)旨在定位时间序列数据中具有语义意义的事件,在高风险领域具有关键应用。与统计异常不同,事件通常由自然语言描述定义,且跨多个物理通道具有内部时序逻辑结构。然而,在现实场景中,密集的事件标注成本高昂,使得纯监督学习困难。我们引入了语言引导的TSED,该设置中模型被赋予文本事件描述,并必须在几乎没有标注数据的情况下将其映射到多变量信号中的区间。为了解决这个问题,我们提出了事件逻辑树(ELT),一种知识表示框架,将语言描述转化为信号基元上的结构化时序逻辑。基于ELT,我们提出了SELA,一种神经符号VLM智能体框架,它从信号可视化中迭代地接地基元,并在ELT约束下组合它们,产生事件区间和忠实的树状结构解释。我们进一步发布了跨能源和气候领域的真实世界基准,包含专家知识和标注。实验表明,SELA优于监督微调和现有的零/少样本时间序列推理基线。

英文摘要

Time Series Event Detection (TSED) aims to localize semantically meaningful events in time series data, with critical applications in high-stakes domains. Unlike statistical anomalies, events are often defined by natural-language descriptions with internal temporal-logic structures across multiple physical channels. However, in real-world settings, dense event annotations are expensive to obtain, making purely supervised learning difficult. We introduce Language-guided TSED, a setting where a model is given textual event descriptions and must ground them to intervals in multivariate signals with little or no labeled data. To address this problem, we propose Event Logic Tree (ELT), a knowledge representation framework that converts linguistic descriptions into structured temporal logic over signal primitives. Building on ELT, we present SELA, a neuro-symbolic VLM agent framework that iteratively grounds primitives from signal visualizations and composes them under ELT constraints, producing both event intervals and faithful tree-structured explanations. We further release a real-world benchmark across energy and climate domains with expert knowledge and annotations. Experiments show that SELA improves over supervised fine-tuning and existing zero/few-shot time series reasoning baselines.

2604.12497 2026-06-12 cs.LG stat.ML 版本更新

Allocating Human Oversight in AI-Enabled Analytics

AI赋能分析中的人类监督分配

Zikun Ye, Jiameng Lyu, Rui Tao

发表机构 * Michael G. Foster School of Business, University of Washington(华盛顿大学迈克尔·G·福斯特商学院) Department of Management Science, School of Management, Fudan University(复旦大学管理学院管理科学系) Guanghua School of Management, Peking University(北京大学光华管理学院)

AI总结 针对AI预测可靠性异质且未知的问题,提出基于上置信界的在线学习策略,动态分配有限的人类验证预算,使终端效率损失随预算增长趋于零。

详情
AI中文摘要

组织越来越多地部署AI作为面向客户的决策过程中的低成本预测层,包括需求感知、服务质量监控、产品测试和市场研究,但AI生成的信号在不同任务、产品和客户细分中的可靠性并不均匀。因此,企业仍然需要稀缺的人类验证(标签、审计、调查回复或后续测量)来将AI输出锚定到真实情况。由于人类真实情况本身存在噪声,在不同标注者之间甚至重复判断中都有所变化,企业必须为每个任务收集并平均多个人类标签,这使得人类验证成本高昂。我们研究如何在可靠性异质且在部署前未知的情况下,将有限的人类验证预算分配到多个AI辅助任务中。我们将其置于调优的预测驱动推断框架内。每个人类标签既提高了AI辅助估计的精度,也揭示了任务的修正难度,即在使用AI预测作为控制变量后剩余的方差。如果难度已知,最优分配将遵循Neyman平方根规则;由于未知,我们提出一种基于上置信界的策略,该策略在线学习难度并将验证导向AI最不可靠的任务。我们证明,随着预算增长,该策略相对于最优分配的终端效率损失趋于零。在合成实验和一个包含68个任务和超过2000名受访者的真实数字孪生调查中,当可靠性异质时,该策略缩小了与最优分配的大部分差距,优于均匀分配和epsilon-贪婪分配;在调查数据上,它还优于先探索后提交的试点设计,并将均匀分配的10-12%差距缩小到2-6%。AI的价值不仅取决于模型准确性,还取决于将人类监督定向到AI错误影响最大的操作策略。

英文摘要

Organizations increasingly deploy AI as a low-cost prediction layer in customer-facing decision processes, including demand sensing, service-quality monitoring, product testing, and market research, but AI-generated signals are unevenly reliable across tasks, products, and customer segments. Firms therefore still need scarce human validation (labels, audits, survey responses, or follow-up measurements) to anchor AI outputs to ground truth. Because human ground truth is itself noisy, varying across labelers and even across repeated judgments, the firm must collect and average several human labels per task, which makes human validation costly. We study how to allocate a limited human-validation budget across many AI-assisted tasks when reliability is heterogeneous and unknown before deployment. We cast this within tuned prediction-powered inference. Each human label both sharpens the AI-assisted estimate and reveals the task's rectification difficulty, the variance that remains after the AI prediction is optimally used as a control variate. If difficulties were known, the optimal allocation would follow a Neyman square-root rule; because they are unknown, we propose a policy based on upper confidence bounds that learns them online and steers validation toward tasks where AI is least reliable. We prove that the policy's terminal efficiency loss relative to the oracle allocation vanishes as the budget grows. In synthetic experiments and a real digital-twin survey with 68 tasks and over 2000 respondents, it closes most of the gap to the oracle when reliability is heterogeneous, outperforming uniform and epsilon-greedy allocation; on the survey data it also outperforms explore-then-commit pilot designs and cuts uniform's 10--12% gap to 2--6%. The value of AI depends not only on model accuracy but also on the operational policy that targets human oversight where AI errors matter most.

2604.20236 2026-06-12 cs.LG 版本更新

Machine Learning-based Two-Stage Graph Sparsification for the Travelling Salesman Problem

基于机器学习的两阶段图稀疏化方法用于旅行商问题

Bo-Cheng Lin, Yi Mei, Mengjie Zhang

发表机构 * Centre for Data Science and Artificial Intelligence(数据科学与人工智能中心) School of Engineering and Computer Science(工程与计算机科学学院) Victoria University of Wellington(惠灵顿维多利亚大学)

AI总结 提出两阶段方法,先结合α-Nearest和POPMUSIC得到近完美召回率的候选图,再用轻量级分类器修剪单源边,在保持≥99.69%最优边的同时降低37%-47%密度。

详情
AI中文摘要

高性能TSP求解器(如Lin-Kernighan-Helsgaun (LKH))在\emph{候选图}(为求解器预先选定的边的小子集)中搜索,而不是在完整图上搜索。两种主要的稀疏化启发式方法,$\alpha$-Nearest和POPMUSIC,各自在密度-覆盖率平衡上存在不足:$\alpha$-Nearest密集且召回率稳定,而POPMUSIC更稀疏但其召回率随规模增大而下降。它们的并集在密度上远低于完整图的同时弥补了召回率差距,为进一步缩减留下了空间。现有的基于学习的稀疏化方法在完整图上对边评分,这种方法代价高昂且主要限于欧几里得实例。我们提出了一种两阶段方法,反转了这一逻辑。第一阶段取$\alpha$-Nearest和POPMUSIC的并集,在${\sim}6N$条边上实现近乎完美的召回率。关键在于,并集为每条边标注了其\emph{来源出处}——即它是由$\alpha$-Nearest、POPMUSIC还是两者共同支持的。第二阶段在这些标注边上训练一个轻量级分类器,并修剪得分最低的边。由于双源边几乎总是最优的,学习问题简化为过滤单源子集——这比从头开始对所有$O(N^2)$条边进行分类要容易得多。在四种距离类型、五种空间分布以及50到500的问题规模上,该流程将候选图密度降低了37%-47%,同时保留了${\geq}99.69\%$的最优旅行边,并且在TSP500上以更低的密度达到或超过了近期仅限欧几里得的神经稀疏化方法的覆盖率。

英文摘要

High-performance TSP solvers such as Lin-Kernighan-Helsgaun (LKH) search within a \emph{candidate graph} -- a small subset of edges pre-selected for the solver -- rather than over the complete graph. The two leading sparsification heuristics, $α$-Nearest and POPMUSIC, each fall short of the density-coverage balance: $α$-Nearest is dense with stable recall, while POPMUSIC is sparser but its recall degrades with scale. Their union closes the recall gap while remaining far below the complete graph in density, leaving room for further reduction. Existing learning-based sparsifiers score edges on the complete graph, an approach that is expensive and largely limited to Euclidean instances. We propose a two-stage method that inverts this logic. Stage~1 takes the union of $α$-Nearest and POPMUSIC, achieving near-perfect recall at ${\sim}6N$ edges. Crucially, the union annotates each edge with its \emph{source provenance} -- whether it was endorsed by $α$-Nearest, POPMUSIC, or both. Stage~2 trains a lightweight classifier on these annotated edges and prunes the lowest-scoring ones. Because dual-source edges are almost always optimal, the learning problem reduces to filtering the single-source subset -- a substantially easier task than classifying all $O(N^2)$ edges from scratch. Across four distance types, five spatial distributions, and problem sizes from 50 to 500, the pipeline reduces candidate-graph density by $37$-$47\%$ while retaining ${\geq}99.69\%$ of optimal-tour edges, and matches or exceeds the coverage of recent Euclidean-only neural sparsifiers at lower density at TSP500.

2606.02044 2026-06-12 cs.LG physics.med-ph 版本更新

Realistic noise synthesis reduces bias and improves tissue microstructure estimation with supervised machine learning

真实噪声合成减少偏差并改善有监督机器学习的组织微结构估计

Bradley G. Karat, Maëliss Jallais, Ali R. Khan, Santiago Aja-Fernández, Jelle Veraart, Marco Palombo

AI总结 针对扩散MRI中模拟与实测信号噪声不匹配导致的协变量偏移问题,提出真实噪声合成框架,通过引入Rician期望和有效后处理噪声方差,显著降低参数估计偏差并提高精度。

Comments * Shared first author

详情
AI中文摘要

扩散MRI能够无创探测组织微结构,但准确的参数估计受到噪声相关效应的挑战。在基于模拟数据训练的有监督机器学习框架中,模拟信号与采集信号的噪声特性差异引入了一种协变量偏移,导致训练和推理时的输入信号分布不同。我们研究了这种不匹配对微结构参数估计的影响,并提出了一种真实噪声合成(RNS)框架来缓解该问题。RNS将Rician期望和有效后处理噪声方差同时纳入模拟训练信号。Rician期望使用MPPCA估计的噪声标准差建模,而有效标准差则从预处理数据的球谐残差中导出。该方法使用cylinder-zeppelin和SANDI模型在多个SNR水平的模拟数据集以及具有重复采集的体内扩散数据上进行了评估。还评估了对噪声误估计的敏感性。训练过程中忽略幅度诱导的噪声效应会产生系统性的、依赖于SNR的参数偏差,尤其是在低SNR下。引入Rician期望显著降低了偏差,使其达到噪声感知的非线性最小二乘拟合的水平。对有效标准差进行建模进一步提高了精度。性能在很大程度上独立于回归架构,但对准确的噪声估计敏感。这些发现表明,在模拟训练数据中进行真实噪声建模可以减轻信号域的协变量偏移,并且对于无偏的监督微结构估计至关重要,特别是在与高b值或高空间分辨率相关的低SNR区域。

英文摘要

Diffusion MRI enables non-invasive probing of tissue microstructure, but accurate parameter estimation is challenged by noise-related effects. In supervised machine learning frameworks trained on simulated data, discrepancies between the noise characteristics of simulated and acquired signals introduce a form of covariate shift, whereby the input signal distribution differs between training and inference. We investigated the impact of this mismatch on microstructure parameter estimation and propose a realistic noise synthesis (RNS) framework to mitigate it. RNS incorporates both the Rician expectation and the effective post-processing noise variance into simulated training signals. The Rician expectation was modelled using a noise standard deviation estimated with MPPCA, while the effective standard deviation was derived from spherical harmonic residuals of preprocessed data. The method was evaluated using the cylinder-zeppelin and the SANDI models on simulated datasets across multiple SNR levels and on in vivo diffusion data with repeated acquisitions. Sensitivity to noise misestimation was also assessed. Ignoring magnitude-induced noise effects during training produced systematic, SNR-dependent parameter bias, particularly at low SNR. Incorporating the Rician expectation substantially reduced bias to the level of noise-aware nonlinear least-squares fitting. Modelling the effective standard deviation further improved precision. Performance was largely independent of regression architecture but sensitive to accurate noise estimation. These findings demonstrate that realistic noise modelling in simulated training data mitigates signal-domain covariate shift and is essential for unbiased supervised microstructure estimation, particularly in low-SNR regimes associated with high b-values or high spatial resolution.

2606.10069 2026-06-12 cs.LG physics.geo-ph 版本更新

Using Seismic Statistical Features and VQ-VAE to Improve Spatiotemporal Seismicity Predictability

基于VQ-VAE和地震统计特征的时空地震危险性评估

Wei Quan, Denise Gorse

AI总结 本文在先前基于XGBoost和地震统计特征的研究基础上,将预测从全区域扩展到局部区域,并引入基于VQ-VAE模型从二维地震图提取的新特征,提升了局部地震预测性能。

Comments Title updated from "Spatiotemporal Seismic Hazard Assessment Using VQ-VAE and Seismic Statistical Features" to "Using Seismic Statistical Features and VQ-VAE to Improve Spatiotemporal Seismicity Predictability" in v2 to better reflect the focus of the paper. The content is unchanged apart from the title and minor copyediting

详情
AI中文摘要

在本文中,我们基于先前的一项研究,该研究使用XGBoost以及日本和智利的地震目录数据证明,一组60个地震统计特征(SSFs)比tsfresh包中的428个通用时间序列特征具有更大的预测价值。我们在此以两种关键方式扩展了先前的工作,重点使用日本的数据,因为需要大数据集来训练深度学习(自编码器)模型。首先,我们从全区域预测(针对每个候选事件,考虑未来15天内区域内任何地方发生M≥5.0事件的可能性)转向局部预测,其中特征计算区域和预测区域都限制在候选事件周围半径24公里的圆内,并且我们表明性能仍然优秀,与先前同一区域的全局研究相似。其次,我们将基于一维(目录)数据的这套经过验证的SSFs与基于二维地震图的新特征相结合,该特征通过训练VQ-VAE模型以输出此类地图,并识别其误差度量与局部地壳应力积累的关系。我们表明,尽管仅基于SSFs的局部预测可以单独有效,测试AUC值与先前日本全局研究中的值一样高,但包含新的原生空间VQ-VAE衍生特征(通过SHAP分析排名最高)可以提升性能,并且似乎几乎完全取代了传统计算的b值在特征使用中的位置。

英文摘要

In this paper we build upon a previous study in which we demonstrated, using XGBoost and earthquake catalogue data from Japan and Chile, that a set of 60 seismic statistical features (SSFs) had much greater predictive value than a set of 428 generic time series features from the tsfresh package. We here extend this previous work in two key ways, focusing on data from Japan as a large dataset is necessary in order to allow for the training of a deep learning (autoencoder) model. First, we move from whole-region prediction (considering, for each candidate event, the likelihood of an event M $\geq$ 5.0 anywhere in the region in the next 15 days) to localised predictions in which both the region of feature computation and the region of prediction are restricted to a circle of radius 24 km around the candidate event, and we show that performance remains excellent, similar to our previous whole-region study for the same area. Second, we here couple this proven set of SSFs, based on one-dimensional (catalogue) data, with a novel feature based on two-dimensional seismic maps, obtained by training a VQ-VAE model to reproduce such maps as output and identifying a measure of its error in doing so with a localised build-up of crustal stress. We show that while localised prediction based on SSFs can be effective alone, with test AUC values as high as those obtained in the case of Japan in our previous whole-region study, the inclusion of the new natively-spatial VQ-VAE-derived feature, top-ranked by SHAP analysis, can enhance performance and additionally appears to near-wholly replace the traditionally-computed $b$-value in terms of feature usage.

2606.11793 2026-06-12 cs.LG cs.AI physics.ao-ph 版本更新

Scalable Deep Learning Framework for Global High-Resolution Land Use Reconstruction

AI4Land: 面向全球高分辨率土地利用重建的可扩展深度学习

Amirpasha Mozaffari, Marina Castaño, Stefano Materia, Etienne Tourigny, Oscar Molina-Sedano, Jordi Varela-Agrelo, Dario Garcia-Gasulla, Miguel Castrillo Melguizo, Mario Acosta, Amanda Duarte

发表机构 * Barcelona Supercomputing Center(巴塞罗那超级计算中心)

AI总结 提出AI4Land框架,采用U-Net两阶段方法,结合粗分辨率情景数据与静态地理特征,重建高分辨率年度土地利用与覆盖,减少陆地碳循环不确定性,支持气候模拟。

详情
AI中文摘要

陆地碳循环的不确定性仍是气候预测的主要制约因素,部分源于地球系统模型中陆面表征和变率的不确定性。为解决此问题,我们提出了数据驱动框架AI4Land,用于生成关键陆面变量的高分辨率历史重建和未来预测。该框架采用U-Net架构的两阶段方法。在第一阶段(本文重点),它通过整合粗分辨率情景数据与静态地理特征,重建年度土地利用与土地覆盖。在计划的第二阶段,生成的高分辨率地图将用于在更细时间尺度上预测动态生物物理变量,特别是叶面积指数。模型基于地球观测数据训练,学习再现空间明确且物理一致的陆面模式,并将时间覆盖扩展到缺乏直接观测的时期。AI4Land在MareNostrum5上开发和训练,展示了GPU加速的高性能计算基础设施如何支持全球尺度的气候AI流水线。最终产品是一套开源模拟器,旨在与数字孪生平台(如Destination Earth计划下开发的平台)实时耦合。通过按需提供逼真且演变的陆面条件,本工作旨在减少关键不确定性,提高下一代气候模拟的预测能力。

英文摘要

Uncertainty in the terrestrial carbon cycle remains a major constraint in climate projections, partly driven by the uncertainties affecting the land surface representation and variability in Earth system models. To address this limitation, we present a data-driven framework AI4Land, for generating high-resolution historical reconstructions and future projections of key land surface variables. The framework follows a two-phase approach using a U-Net architecture. In the first phase, which is the focus of this work, it reconstructs annual land use and land cover by integrating coarse-resolution scenario data with static geophysical features. In a planned second phase, the resulting high-resolution maps will be used to predict dynamic biophysical variables, particularly leaf area index, at finer temporal scales. Trained on Earth observation data, the models learn to reproduce spatially explicit and physically consistent land surface patterns, extending temporal coverage to periods lacking direct observations. AI4Land was developed and trained on MareNostrum5, demonstrating how GPU-accelerated HPC infrastructure enables global-scale climate AI pipelines. The final product is a suite of open-source emulators designed for real-time coupling with digital twin platforms, such as those developed under the Destination Earth initiative. By delivering realistic and evolving land surface conditions on demand, this work aims to reduce critical uncertainties and improve the predictive power of next-generation climate simulations.

2410.00903 2026-06-12 stat.AP cs.CL cs.LG 版本更新

Causal Inference with Generative Artificial Intelligence: Application to Texts as Treatments

基于生成式人工智能的因果推断:以文本作为处理变量

Kosuke Imai, Kentaro Nakamura

发表机构 * Harvard University(哈佛大学) John F. Kennedy School of Government(约翰·F·肯尼迪政府学院)

AI总结 提出利用生成式AI(如大语言模型)生成处理变量并利用其内部表示进行因果效应估计,避免从数据中学习因果表示,提高估计准确性和效率。

详情
AI中文摘要

在本文中,我们展示了如何利用生成式人工智能(GenAI)的力量,增强以文本等高维非结构化数据作为处理变量时的因果推断有效性。具体而言,我们提出使用深度生成模型(如大语言模型,LLMs)高效地生成处理变量,并利用其内部表示进行后续的因果效应估计。我们表明,了解这种真实内部表示有助于将感兴趣的处理特征(如特定情感和某些主题)与其他可能未知的混淆特征分离开来。与现有方法不同,所提出的GenAI驱动推断(GPI)方法无需从数据中学习因果表示,因此能产生更准确和高效的估计。我们正式建立了非参数识别平均处理效应所需的条件,提出了一种避免重叠假设违反的估计策略,并通过应用双重机器学习推导了所提出估计量的渐近性质。最后,利用工具变量方法,我们将所提出的GPI方法扩展到处理特征基于人类感知的场景。GPI也适用于文本复用,即使用LLM重新生成现有文本。我们进行了模拟和实证研究,使用开源LLM Llama 3生成的文本数据,展示了我们的估计器相对于最先进的因果表示学习算法的优势。

英文摘要

In this paper, we demonstrate how to enhance the validity of causal inference with unstructured high-dimensional treatments like texts, by leveraging the power of generative Artificial Intelligence (GenAI). Specifically, we propose to use a deep generative model such as large language models (LLMs) to efficiently generate treatments and use their internal representation for subsequent causal effect estimation. We show that the knowledge of this true internal representation helps disentangle the treatment features of interest, such as specific sentiments and certain topics, from other possibly unknown confounding features. Unlike existing methods, the proposed GenAI-Powered Inference (GPI) methodology eliminates the need to learn causal representation from the data, and hence produces more accurate and efficient estimates. We formally establish the conditions required for the nonparametric identification of the average treatment effect, propose an estimation strategy that avoids the violation of the overlap assumption, and derive the asymptotic properties of the proposed estimator through the application of double machine learning. Finally, using an instrumental variables approach, we extend the proposed GPI methodology to the settings in which the treatment feature is based on human perception. The GPI is also applicable to text reuse where an LLM is used to regenerate existing texts. We conduct simulation and empirical studies, using the generated text data from an open-source LLM, Llama 3, to illustrate the advantages of our estimator over state-of-the-art causal representation learning algorithms.

2508.12681 2026-06-12 cs.RO cs.LG cs.SY eess.SY 版本更新

Adaptive Model-Predictive Control of a Soft Continuum Robot Using a Physics-Informed Neural Network Based on Cosserat Rod Theory

基于Cosserat杆理论物理信息神经网络的软体连续机器人自适应模型预测控制

Johann Licher, Max Bartholdt, Henrik Krauss, Tim-Lukas Habich, Thomas Seel, Moritz Schappler

发表机构 * Institute of Mechatronic Systems, Leibniz University Hannover(机械系统研究所,汉诺威莱布尼茨大学) Department of Advanced Interdisciplinary Studies, The University of Tokyo(先进跨学科研究部,东京大学) Institute of Assembly Technology and Robotics, Leibniz University of Hannover(组装技术与机器人研究所,汉诺威莱布尼茨大学)

AI总结 提出一种基于域解耦物理信息神经网络(DD-PINN)的实时非线性模型预测控制框架,实现软体连续机器人的高精度动态控制,位置误差低于3 mm。

Comments Submitted to IEEE Transactions on Robotics, 20 pages, 14 figures

详情
AI中文摘要

软体连续机器人(SCR)的动态控制对其应用扩展具有巨大潜力,但由于精确动态模型的高计算需求,仍然是一个具有挑战性的问题。虽然已经提出了如Koopman算子方法等数据驱动方法,但它们通常缺乏自适应性,且无法重建完整的机器人形状,限制了其适用性。本文介绍了一种基于具有自适应弯曲刚度的域解耦物理信息神经网络(DD-PINN)的实时非线性模型预测控制(MPC)框架。DD-PINN作为动态Cosserat杆模型的替代模型,加速比高达44,000倍。它还被用于无迹卡尔曼滤波器中,从末端执行器位置测量中估计模型状态和弯曲柔度。我们在GPU上实现了一个以70 Hz运行的非线性进化MPC。在仿真中,它展示了动态轨迹的精确跟踪和设定点控制,末端执行器位置误差低于3 mm(执行器长度的2.3%)。在实际实验中,控制器实现了类似的精度和高达3.55 m/s²的加速度。

英文摘要

Dynamic control of soft continuum robots (SCRs) holds great potential for expanding their applications, but remains a challenging problem due to the high computational demands of accurate dynamic models. While data-driven approaches like Koopman-operator-based methods have been proposed, they typically lack adaptability and cannot reconstruct the full robot shape, limiting their applicability. This work introduces a real-time-capable nonlinear model-predictive control (MPC) framework for SCRs based on a domain-decoupled physics-informed neural network (DD-PINN) with adaptable bending stiffness. The DD-PINN serves as a surrogate for the dynamic Cosserat rod model with a speed-up factor of up to 44,000. It is also used within an unscented Kalman filter for estimating the model states and bending compliance from end-effector position measurements. We implement a nonlinear evolutionary MPC running at 70 Hz on the GPU. In simulation, it demonstrates accurate tracking of dynamic trajectories and setpoint control with end-effector position errors below 3 mm (2.3\% of the actuator's length). In real-world experiments, the controller achieves similar accuracy and accelerations up to 3.55 m/s2.

2509.04682 2026-06-12 cs.SD cs.AI cs.CV cs.IR cs.LG eess.AS 版本更新

GetNetUPAM: Ecologically Informed Nested Cross-Validation and Noise-Robust Attention for Marine Bioacoustic Monitoring

GetNetUPAM:生态信息嵌套交叉验证与噪声鲁棒注意力用于海洋生物声学监测

Nicholas R. Rasmussen, Rodrigue Rizk, Longwei Wang, KC Santosh

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出GetNetUPAM框架,通过分层嵌套交叉验证保持生态异质性,并集成CBAM空间注意力的ARPA-N网络,在高噪声低信噪比条件下实现鲁棒泛化,在零训练区域将误报率降低约10倍。

Comments Resubmitted and under review as an anonymous submission to IEEETAI - We are allowed an archive submission. Final formatting is yet to be determined

详情
AI中文摘要

部署可靠的生物声学监测系统需要能够在高噪声、低信噪比条件下泛化的模型,以及能够暴露部署相关故障模式的评估协议,这些在当前UPAM实践中基本未得到解决。内在噪声、可变传播以及混合的生物和人为源会导致分布偏移,而传统模型和单次划分评估会掩盖这些偏移,夸大性能并掩盖不稳定性。我们提出GetNetUPAM,一种分层嵌套交叉验证框架,它利用嵌套阶段来量化模型稳定性,而不是调整以获取夸大的保留分数。通过将数据划分为站点-年份块,GetNetUPAM保留了生态异质性,并迫使每个外层折代表不同的环境条件,防止过拟合局部噪声或传感器伪影。内层分层折衡量整个UPAM信号分布上的泛化能力,强制模型开发与外层保留部署条件严格分离。使用GetNetUPAM,我们评估了自适应分辨率池化和注意力网络(ARPA-N),一种用于不规则频谱图维度的CNN架构。ARPA-N将CBAM空间注意力集成为学习型噪声抑制器,生成注意力图以定位真实叫声结构,并避免标准CNN在长窗口数据上利用的全局非生物线索。在GetNetUPAM下,ARPA-N在不同环境条件下鲁棒泛化。在零训练的Balleny Islands区域,它在固定90%召回率下将每小时误报率降低超过一个数量级(约10倍),并在各折上持续改进指标。这些进展提供了可重复的基准,推动UPAM向可扩展、部署可靠的生态监测发展。

英文摘要

Deploying reliable bioacoustic monitoring systems requires models that generalize under high-noise, low-SNR conditions and evaluation protocols that expose deployment-relevant failure modes, gaps largely unaddressed in current UPAM practice. Intrinsic noise, variable propagation, and mixed biological and anthropogenic sources induce distribution shifts that conventional models and single-split evaluations obscure, inflating performance and masking instability. We introduce GetNetUPAM, a hierarchical nested cross-validation framework that uses the nested stage to quantify model stability rather than tune for inflated hold-out scores. By partitioning data into site-year blocks, GetNetUPAM preserves ecological heterogeneity and forces each outer fold to represent a distinct environmental regime, preventing overfitting to localized noise or sensor artifacts. Inner stratified folds measure generalization across the full UPAM signal distribution, enforcing strict separation between model development and the outer held-out deployment condition. Using GetNetUPAM, we evaluate the Adaptive Resolution Pooling and Attention Network (ARPA-N), a CNN architecture for irregular spectrogram dimensions. ARPA-N integrates CBAM spatial attention as a learned noise suppressor, producing attention maps that localize true call structure and avoid the global, non-biological cues exploited by standard CNNs on long-window data. Under GetNetUPAM, ARPA-N generalizes robustly across diverse environmental regimes. In the zero-training support Balleny Islands region, it reduces false positives per hour by over an order of magnitude (approximately 10x) at fixed 90 percent recall, yielding consistently improved metrics across folds. These advances provide a reproducible benchmark and move UPAM toward scalable, deployment-reliable ecological monitoring.

2602.04075 2026-06-12 cond-mat.mtrl-sci cs.LG 版本更新

Thermodynamic assessment of machine learning models for solid-state synthesis prediction

固态合成预测机器学习模型的热力学评估

Jane Schlesinger, Simon Hjaltason, Nathan J. Szymanski, Christopher J. Bartel

发表机构 * University of Minnesota(明尼苏达大学)

AI总结 评估了机器学习模型预测固态材料合成可行性的热力学一致性,发现模型普遍高估合成可能性,但部分分数与热力学启发式趋势一致。

详情
AI中文摘要

机器学习模型最近被用于预测假设的固态材料是否可合成。这些模型旨在绕过固态相变的直接第一性原理建模,而是从成功合成材料的大型数据库中学习。在这里,我们评估了几个最近引入的合成预测模型与材料和反应热力学的对齐程度,通过相对于凸包的能量和考虑枚举合成反应的热力学选择性的度量来量化。使用成功合成配方的数据集确定了这两个量的可能界限,超出该界限的材料被认为不太可能被合成。以这些界限为背景,使用CHGNet基础势计算了通过Chemeleon生成模型生成的数千种新假设材料的热力学量。将四个最近发表的用于可合成性预测的机器学习模型应用于同一数据集,并将所得预测与计算的热力学进行比较。我们发现这些模型普遍高估了合成的可能性,但一些模型分数确实与热力学启发式趋势一致,对稳定性较差或没有计算为热力学选择性的可用合成配方的材料分配较低的分数。总的来说,这项工作识别了机器学习模型在材料合成中存在的差距,并引入了一种在缺乏大量负例(失败合成)的情况下评估其质量的新方法。

英文摘要

Machine learning models have recently emerged to predict whether hypothetical solid-state materials can be synthesized. These models aim to circumvent direct first-principles modeling of solid-state phase transformations, instead learning from large databases of successfully synthesized materials. Here, we assess the alignment of several recently introduced synthesis prediction models with material and reaction thermodynamics, quantified by the energy with respect to the convex hull and a metric accounting for thermodynamic selectivity of enumerated synthesis reactions. A dataset of successful synthesis recipes was used to determine the likely bounds on both quantities beyond which materials can be deemed unlikely to be synthesized. With these bounds as context, thermodynamic quantities were computed using the CHGNet foundation potential for thousands of new hypothetical materials generated using the Chemeleon generative model. Four recently published machine learning models for synthesizability prediction were applied to this same dataset, and the resultant predictions were considered against computed thermodynamics. We find these models generally overpredict the likelihood of synthesis, but some model scores do trend with thermodynamic heuristics, assigning lower scores to materials that are less stable or do not have an available synthesis recipe that is calculated to be thermodynamically selective. In total, this work identifies existing gaps in machine learning models for materials synthesis and introduces a new approach to assess their quality in the absence of extensive negative examples (failed syntheses).

2605.12542 2026-06-12 astro-ph.IM astro-ph.EP cs.LG 版本更新

Earth Science Foundation Models: From Perception to Reasoning and Discovery

地球科学基础模型:从感知到推理与发现

Xiangyu Zhao, Bo Liu, Yuehan Zhang, Zelin Song, Wanghan Xu, Feng Liu, Fengxiang Wang, Ben Fei, Fenghua Ling, Wangxu Wei, Wenlong Zhang, Xiao-Ming Wu

发表机构 * Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University(数据科学与人工智能系,香港理工大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文综述了地球科学基础模型,探讨了其从感知到多模态推理及科学发现的能力演进,并总结了其在大气、水圈、岩石圈等领域的广泛应用。

详情
AI中文摘要

大规模基础模型(FMs)正在通过整合异构多模态数据,如多平台影像、格网再分析数据、多样的地球物理和地球化学观测以及领域特定文本,来推动地球科学的发展。本文通过两个互补维度对地球科学基础模型(地球FMs)进行统一综述:深度,即追踪模型能力从感知到多模态推理和代理科学工作流的演变;广度,即总结其在大气、水圈、岩石圈、生物圈、人类圈和冰圈以及耦合地球系统过程中的扩展应用。利用这一框架,我们回顾了代表性多模态地球基础模型,并编译了超过200个数据集和基准,涵盖多样化的地球科学任务和模态。我们进一步讨论了多模态数据异构性、科学可靠性和持续更新、可扩展性和可持续性以及从基础模型到代理和具身地球智能的转变,并展望了更集成、可信和可操作的AI地球科学家的未来方向。总体而言,本文为理解地球基础模型的发展提供了结构化的路线图,从能力和应用广度两个方面进行综述。

英文摘要

Large foundation models (FMs) are transforming Earth science by integrating heterogeneous multimodal data, such as multi-platform imagery, gridded reanalysis data, diverse geophysical and geochemical observations, and domain-specific text, to support tasks ranging from basic perception to advanced scientific discovery. This paper provides a unified review of Earth science foundation models (Earth FMs) through two complementary dimensions: depth, which traces the evolution of model capabilities from perception to multimodal reasoning and agentic scientific workflows, and breadth, which summarizes their expanding applications across the atmosphere, hydrosphere, lithosphere, biosphere, anthroposphere, and cryosphere, as well as coupled Earth system processes. Using this framework, we review representative multimodal Earth foundation models and compile more than 200 datasets and benchmarks spanning diverse Earth science tasks and modalities. We further discuss key challenges in multimodal data heterogeneity, scientific reliability and continual updating, scalability and sustainability, and the transition from foundation models to agentic and embodied Earth intelligence, and outline future directions toward more integrated, trustworthy, and actionable AI Earth scientists. Overall, this paper offers a structured roadmap for understanding the development of Earth foundation models from both capability depth and application breadth.

2605.14568 2026-06-12 cs.SE cs.CL cs.LG 版本更新

Given, When, Then, Again: Mining Subscenario Refactoring Candidates in Behaviour-Driven Test Suites with ML Classifiers and LLM-Judge Baselines

在行为驱动软件测试套件中挖掘子场景重构机会:ML分类器和LLM-判断基线

Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal

发表机构 * Independent Researcher(独立研究者;应用MBA(数据分析),德克萨斯韦斯利安大学) Applied MBA (Data Analytics), Texas Wesleyan University(独立研究者;计算机工程学士,国立科学与技术大学(NUST)) Independent Researcher(独立研究者;管理硕士,慕尼黑技术大学) B.E. Computer Engineering, National University of Sciences and Technology (NUST) Independent Researcher M.Sc. Management, Technical University of Munich

AI总结 本文通过ML分类器和LLM基线,识别行为驱动开发测试套件中可提取的子场景,量化其在公共BDD生态系统中的普及率。

Comments 31 pages, 10 figures, 6 tables, 56 references. v2: retitled; reference list fully corrected and verified; decision-threshold sensitivity analysis and imbalance-robust baseline metrics added; figures restyled. Reproduction package at https://github.com/amughalbscs16/cukereuse_subscenarios_release (Apache-2.0). Upstream cukereuse corpus at https://doi.org/10.5281/zenodo.19754359

详情
AI中文摘要

背景。行为驱动开发(BDD)软件测试套件积累重复的步骤子序列。有三种已发布的重构模式(在同一文件中的背景、在同一仓库中可重用的场景调用、跨组织共享的更高层次步骤),但没有先前工作自动化确定哪些重复的子序列值得提取或哪种机制适用。目标。通过重构适宜性(提取值得)对重复的步骤子序列(

英文摘要

Context. Behaviour-Driven Development (BDD) test suites accumulate duplicated step subsequences. Three published refactoring patterns are available (within-file Background, within-repo reusable-scenario invocation, cross-organisational shared higher-level step), but no prior work automates which recurring subsequences are worth extracting or which mechanism applies. Objective. Rank recurring step subsequences ("slices") by refactoring suitability (extraction-worthy), pre-map each to one of the three patterns, and quantify prevalence across the public BDD ecosystem. Method. Every contiguous L-step window (L in [2, 18]) in a 339-repository / 276-upstream-owner Gherkin corpus is keyed by paraphrase-robust cluster identifiers and counted under three scopes. SBERT / UMAP / HDBSCAN clustering recovers paraphrase-equivalent slices. Three authors label a stratified 200-slice pool against a written rubric. An XGBoost extraction-worthy classifier trained under 5-fold cross-validation is compared with a tuned rule baseline and two open-weight Large Language Model (LLM) judges. Results. The miner produces 5,382,249 slices collapsing to 692,020 recurring patterns. Three-author Fleiss' kappa = 0.56 (extraction-worthy) and 0.79 (mechanism). The classifier reaches out-of-fold F1 = 0.891 (95% CI [0.852, 0.927]), outperforming both the rule baseline (F1 = 0.836, p = 0.017) and the better LLM judge (F1 = 0.728, p = 1.5e-4). 75.0%, 59.5%, and 11.7% of scenarios carry a within-file Background, within-repo reusable-scenario, and cross-organisational shared-step candidate, respectively; the figures are stable under a sweep of the classifier decision threshold. Conclusion. Paraphrase-robust subscenario discovery yields a corpus-wide census of BDD refactoring candidates; pipeline, classifier predictions, labelled pool, and rubric are released under Apache-2.0.

2605.22641 2026-06-12 cs.CL cs.AI cs.LG 版本更新

More Context, Larger Models, or Moral Knowledge? A Systematic Study of Schwartz Value Detection in Political Texts

更多上下文、更大模型还是道德知识?政治文本中施瓦茨价值观检测的系统研究

Víctor Yeste, Paolo Rosso

发表机构 * PRHLT Research Center, Universitat Politècnica de València, Spain(巴塞罗那理工大学研究中心,西班牙 Valencia理工大学) School of Science, Engineering and Design, Universidad Europea de Valencia, Spain(Valencia欧洲大学科学、工程与设计学院,西班牙) Valencian Graduate School and Research Network of Artificial Intelligence (ValgrAI)(瓦伦西亚人工智能研究生学院与研究网络(ValgrAI))

AI总结 本研究系统比较了上下文范围、检索增强道德知识和模型规模对政治文本中施瓦茨价值观检测的影响,发现全文档上下文和检索知识对监督编码器有效,但对零样本大语言模型帮助有限,且模型扩展不保证性能提升。

Comments Code: https://github.com/VictorMYeste/human-value-detection-context-rag, best model: https://huggingface.co/VictorYeste/value-context-rag-deberta-v3-base-doc-rag, 18 pages, 3 figures

详情
AI中文摘要

检测政治文本中的施瓦茨价值观具有挑战性,因为隐含线索通常依赖于周围的论证和相邻价值观之间的细微差别。我们研究了上下文和显式道德知识何时有助于句子级别的价值观检测。使用ValuesML/Touché ValueEval格式,我们比较了句子、窗口和全文档输入;无检索增强和基于检索增强的设置(使用精心策划的道德知识库);监督的DeBERTa-v3-base/large编码器;以及参数规模从12B到123B的零样本大语言模型。结果表明,更多上下文并非总是更好:全文档上下文使监督的DeBERTa编码器相比仅句子输入提高了3.8-4.8个宏F1点,但对零样本大语言模型没有一致帮助。在匹配比较中,检索到的道德知识更一致地有用,在早期融合下改善了每个测试的模型系列和上下文条件。然而,从DeBERTa-v3-base扩展到large以及从12B扩展到更大的大语言模型并不保证收益,并且简单的早期融合优于测试的后期融合和交叉注意力检索增强生成变体。按价值观分析表明,上下文和检索对社交情境化或概念上易混淆的价值观帮助最大。这些发现表明,价值观敏感的NLP应联合评估上下文、知识和模型系列,而不是将更长的输入或更大的模型视为通用改进。

英文摘要

Detecting Schwartz values in political text is difficult because implicit cues often depend on surrounding arguments and fine-grained distinctions between neighboring values. We study when context and explicit moral knowledge help sentence-level value detection. Using the ValuesML/Touché ValueEval format, we compare sentence, window, and full-document inputs; no-RAG and retrieval-augmented settings with a curated moral knowledge base; supervised DeBERTa-v3-base/large encoders; and zero-shot LLMs from 12B to 123B parameters. The results show that more context is not uniformly better: full-document context improves supervised DeBERTa encoders by 3.8-4.8 macro-F1 points over sentence-only input, but does not consistently help zero-shot LLMs. Retrieved moral knowledge is more consistently useful in matched comparisons, improving each tested model family and context condition under early fusion. However, scaling from DeBERTa-v3-base to large and from 12B to larger LLMs does not guarantee gains, and simple early fusion outperforms the tested late-fusion and cross-attention RAG variants for encoders. Per-value analyses show that context and retrieval help most for socially situated or conceptually confusable values. These findings suggest that value-sensitive NLP should evaluate context, knowledge, and model family jointly rather than treating longer inputs or larger models as universal improvements.

2605.26358 2026-06-12 physics.flu-dyn cs.LG 版本更新

Deep Learning-based Algebraic Reynolds Stress Closures for RANS Simulations of Turbulent Flows

基于深度学习的代数雷诺应力闭合模型用于湍流RANS模拟

Daniel Dehtyriov, Jonathan F. MacArt, Justin Sirignano

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) Aerospace and Mechanical Engineering, University of Notre Dame(诺特丹大学航空航天与机械工程系)

AI总结 提出一种物理驱动的深度学习闭合模型DARSM,通过神经网络映射流动不变量到隐式代数雷诺应力方程中的经验参数,并结合伴随方程实现端到端优化,在方形管道和周期性山丘基准测试中平均速度误差降低2-4倍。

详情
AI中文摘要

湍流在工程和科学中普遍存在,但直接模拟成本过高。雷诺平均纳维-斯托克斯(RANS)方程可节省超过十个数量级的计算量,但引入了未封闭项(封闭问题)。离线训练的机器学习(ML)闭合模型在预测模拟中会出现分布偏移,而绕过控制方程的ML方法难以从稀缺的高保真数据中泛化。我们开发了一种基于物理的深度学习RANS闭合模型——深度代数雷诺应力模型(DARSM),该模型可在小数据集上训练,并准确泛化到不同雷诺数、未见几何形状和不同流动状态。神经网络将流动不变量映射到隐式代数雷诺应力方程中的经验参数,该方程基于弱平衡假设从雷诺应力输运方程推导而来,为ML闭合施加了基于物理的结构。通过控制偏微分方程和耦合隐式闭合的端到端优化消除了分布偏移,但展开和隐式自动微分在刚性耦合求解器上均失败。我们推导了利用求解器隐式-显式结构的伴随方程,以实现高效优化。在标准方形管道和周期性山丘基准测试中,DARSM将基线RANS的平均测试速度误差降低了2-4倍(跨雷诺数、几何形状和流动状态),峰值案例级降低达12倍。在附着、各向异性主导的流动(方形管道)上训练的模型无需重新训练即可准确泛化到分离流动(周期性山丘),这是底层物理状态的改变。DARSM还优于五种已建立的ML方法:离线训练、张量基神经网络、场反演机器学习、DeepONet和物理信息神经网络。

英文摘要

Turbulence is ubiquitous in engineering and science, yet direct simulation is prohibitively expensive. The Reynolds-averaged Navier-Stokes (RANS) equations provide savings exceeding ten orders of magnitude but introduce unclosed terms (the closure problem). Offline-trained machine-learning (ML) closures suffer distribution shift in predictive simulations, while ML methods that bypass the governing equations struggle to generalise from scarce high-fidelity data. We develop a physics-derived deep learning closure model for RANS, the Deep Algebraic Reynolds Stress Model (DARSM), which can be trained on small datasets and accurately generalise across Reynolds numbers, to unseen geometries, and to different flow regimes. A neural network maps flow invariants to empirical parameters in an implicit algebraic Reynolds stress equation, derived from the Reynolds stress transport equations under the weak-equilibrium assumption, imposing physics-based structure on the ML closure. End-to-end optimisation through the governing PDEs and the coupled implicit closure eliminates distribution shift, but both unrolled and implicit automatic differentiation fail on the stiff coupled solver. We derive adjoint equations that exploit the solver's implicit-explicit structure for efficient optimisation. On canonical square-duct and periodic-hill benchmarks, DARSM reduces average test velocity error over baseline RANS by $2$-$4\times$ across Reynolds number, geometries, and flow regimes, with peak case-level reductions of $12\times$. The model trained on attached, anisotropy-dominated flows (square duct) accurately generalises without retraining to separated flows (periodic hills), a regime change in the underlying physics. DARSM also outperforms five established ML methods: offline training, tensor-basis neural networks, field-inversion machine learning, DeepONets, and physics-informed neural networks.

2606.02778 2026-06-12 astro-ph.EP astro-ph.IM cs.LG 版本更新

One Transit Is All You Need: Detecting Exoplanets Through Learned Stellar Behaviour with EXOVEIL

一次凌星足矣:通过EXOVEIL学习恒星行为检测系外行星

Pratik Priyanshu

发表机构 * SRH Hochschule(SRH 高校)

AI总结 提出EXOVEIL系统,利用Transformer世界模型和自监督学习从原始光变曲线中检测单次凌星事件,在Kepler数据上实现高召回率,并零样本迁移至TESS和PLATO任务。

Comments v3: appendix gallery of confirmed-planet recoveries added; Section 6 candidate catalogue reframed as transit-like anomalies for follow-up; TLS comparison table expanded

详情
AI中文摘要

我提出EXOVEIL,一个凌星检测系统,它学习恒星亮度应有的样子,并在现实不符时发出标记。与需要相位折叠输入的现有系统不同,EXOVEIL在原始通量时间序列上运行,可以检测仅凌星一次的行星。一个Transformer世界模型,在16,499条Kepler光变曲线上通过凌星掩蔽自监督学习训练,预测预期的恒星通量。一个带有方差加权的匹配滤波检测器从预测残差中提取凌星信号。一个学习分类器(XGBoost)将行星与假阳性区分开,在Kepler DR25上达到AUC 0.938。应用于单次凌星注入-恢复,EXOVEIL在1000 ppm深度下恢复了32%的凌星——而所有基于分类的系统由于设计原因得分为0%。对3,737颗Kepler恒星进行盲搜索,发现了179个新的凌星类信号,这些信号不在DR25 TCE目录中,包括46个单次凌星候选者。无需重新训练,应用于PLATO LOPS2场中的47颗已确认TESS行星,EXOVEIL实现了100%的恢复,展示了零样本跨任务迁移。在PLATO的25秒曝光下,检测达到100 ppm——接近地球类似物范围。我提供了共形预测在凌星检测中的首次应用(95.9%经验覆盖率),并发布了该系统,可通过pip install exoveil安装,包含预训练权重和候选目录。

英文摘要

I present EXOVEIL, a transit detection system that learns what a star's brightness should look like and flags when reality disagrees. Unlike existing systems that require phase-folded input, EXOVEIL operates on raw flux time series and can detect planets that transit only once.A Transformer world model, trained on 16,499 Kepler light curves with transit-masked self-supervised learning, predicts expected stellar flux. A matched-filter detector with variance weighting extracts transit signals from the prediction residuals. A learned classifier (XGBoost) separates planets from false positives, achieving AUC 0.938 on Kepler DR25. Applied to single-transit injection-recovery, EXOVEIL recovers 32% of transits at 1000 ppm depth a task where all classification-based systems score 0% by construction. A blind search of 3,737 Kepler stars yields 179 new transit-like signals not present in the DR25 TCE catalogue, including 46 monotransit candidates. Applied withoutretraining to 47 confirmed TESS planets in the PLATO LOPS2 field, EXOVEIL achieves 100% recovery, demonstrating zero-shot cross-mission transfer. At PLATO's 25-second cadence, detection reaches 100 ppm -- approaching the Earth-analog regime. I provide the first application of conformal prediction to transit detection (95.9% empirical coverage) and release the system as pip install exoveil with pretrained weights and a candidate catalogue.

2606.09855 2026-06-12 cs.MM cs.CV cs.LG 版本更新

MinhwaNet: Faithful but Insufficient Object Grounding in Korean Folk Painting

MinhwaNet: 韩国民俗画中忠实但不足的对象定位

Joonhyung Bae

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 提出MinhwaNet,通过部分级检测器生成对象证据图,发现韩国民俗画中符号列表不足以预测画作类型,而符号布局更重要,揭示了忠实但不足的解离现象。

详情
AI中文摘要

韩国民俗画(minhwa)由少量吉祥符号构成——老虎代表保护、一对鸟代表婚姻和谐、牡丹代表财富——这些符号在其许多绘画类型中反复出现。这暗示了一种直观的计算方法:识别画作中出现的符号,并从符号清单中读取画作类型。我们使用一个公开语料库,包含整幅画作、八字段双语策展说明以及一组独立的专家对象裁剪图,发现这种方法并不奏效。仅给定画作包含的符号列表的模型,其预测画作类型的效果远不如将图像与策展文本融合的模型,而强制类型表示基于对象定位反而会损害准确性。然而,类型预测所依赖的视觉证据仍然是局部化的且可检查的。从部分级检测器投影出的无泄漏对象证据图,在空间上忠实于策展人隔离符号对象的位置以及基于补丁的替代模型的梯度显著性。我们将这种配置称为忠实但不足的解离。部分级解释诚实地反映了部分级模型所见,但类型目标取决于符号的排列方式而非出现的符号。相同的视角区分了内容标签(在转移到保留的源机构时仍然有效,即类型)和风格标签(无效,即时代),我们通过语料库中的另外两个标签验证了这一预测。我们发布了多模态系统、一幅画作的证据图与其目录的工作示例解读,以及在长尾遗产收藏中反复出现的一系列评估注意事项。

英文摘要

Korean folk painting (minhwa) is built from a small vocabulary of auspicious symbols, a tiger for protection, a pair of birds for marital harmony, a peony for wealth, that recur across many of its painted genres. This suggests an obvious computational approach, identify which symbols appear in a painting and read the genre from the inventory. Working with a public corpus that pairs whole paintings, eight-field bilingual curatorial captions, and a separate set of expert object crops, we find that this approach does not work. A model given only a list of which symbols a painting contains predicts the genre far worse than a model that fuses the image with the curatorial text, and forcing the genre representation to be object-grounded actively hurts accuracy. The visual evidence on which the genre prediction rests is nonetheless localized and inspectable. A leakage-safe object evidence map projected from a part-level detector is spatially faithful to where curators isolated symbolic objects and to a patch-based surrogate's own gradient saliency. We name this configuration a faithful-but-insufficient dissociation. The part-level explanation is honest about what the part-level model sees, yet the genre target turns on how symbols are arranged rather than on which ones appear. The same lens separates a content label that survives transfer to held-out source institutions, genre, from a style label that does not, era, a prediction we confirm on two further labels in the corpus. We release the multimodal system, a worked-example reading of one painting's evidence map against its catalogue, and a set of evaluation cautions that recur in long-tailed heritage collections.

2606.10200 2026-06-12 cs.CV cs.AI cs.LG 版本更新

An Improved Generative Adversarial Network for Micro-Resistivity Imaging Logging Restoration

一种改进的生成对抗网络用于微电阻率成像测井恢复

Ahmed Faizul Haque, S. M. Riaz Rahman Antu, Saif Ahmed, Asadullah Hil Galib, Souvik Pramanik, Mohammad Ashrafuzzaman Khan, Mohammad Abdul Qayum, Mohsin Sajjad

AI总结 提出基于改进GAN的成像测井图像恢复方法,通过FCN生成网络、深度可分离卷积残差块、Inception模块及多尺度特征提取与空间注意力机制,结合全局与局部判别网络,有效恢复缺失区域,结构相似性达0.903。

Comments Mistakes in citations and references. Further we want to submit in conference with improved experiments and results

详情
AI中文摘要

本文提出了一种改进的基于GAN的成像测井图像恢复方法,用于解决微电阻率成像测井图像部分缺失的问题。该方法采用FCN作为生成网络基础设施,并添加深度可分离卷积残差块以学习和保留更有效的像素与语义信息;添加Inception模块以增加网络的多尺度感知场并减少参数数量;添加多尺度特征提取模块和空间注意力残差块,结合通道注意力机制与残差块实现多尺度特征提取。设计了全局判别网络和局部判别网络,通过相互对抗与生成网络逐步提高恢复部分与整体图像之间的内容和语义结构一致性。实验结果表明,测试集中五组不同大小缺失区域的成像测井图像的平均结构相似性度量为0.903,相比其他类似方法提高了约0.3。研究表明,该方法可用于微电阻率成像测井图像的恢复,在语义结构一致性和纹理细节方面有良好改善,从而为保障微电阻率成像测井图像后续解释的顺利进行提供了一种新的深度学习方法。

英文摘要

An improved GAN-based imaging logging image restoration method is presented in this paper for solving the problem of partially missing micro-resistivity imaging logging images. The method uses FCN as the generative network infrastructure and adds a depth-separable convolutional residual block to learn and retain more effective pixel and semantic information; an Inception module is added to increase the multi-scale perceptual field of the network and reduce the number of parameters in the network; and a multi-scale feature extraction module and a spatial attention residual block are added to combine the channel attention. The multi-scale module adds a multi-scale feature extraction module and a spatial attention residual block, which combine the channel attention mechanism and the residual block to achieve multi-scale feature extraction. The global discriminative network and the local discriminative network are designed to gradually improve the content and semantic structure coherence between the restored parts and the whole image by playing off each other and the generative network. According to the experimental results, the average structural similarity measure of the five sets of imaged logging images with different sizes of missing regions in the test set is 0.903, which is an improvement of about 0.3 compared with other similar methods. It is shown that the method in this study can be used for the restoration of micro-resistivity imaging log images with good improvement in semantic structural coherence and texture details, thus providing a new deep learning method to ensure the smooth advancement of the subsequent interpretation of micro-resistivity imaging log images.

2606.11240 2026-06-12 physics.comp-ph cond-mat.str-el cs.LG quant-ph 版本更新

Physically Constrained Ensemble Gaussian Process Modelling for Expensive Quantum Systems with Heteroskedastic Noise

物理约束集成高斯过程建模用于具有异方差噪声的昂贵量子系统

Arpan Biswas, Sutirtha Paul, Joseph Agada, Matthias Thamm, Adrian Del Maestro

AI总结 提出物理约束集成高斯过程框架,通过加权惩罚和数值积分集成多个GP代理,高效建模含异方差噪声的量子系统,在Bose-Hubbard模型和纳米孔硅酸盐量子液体模拟中实现更准确且物理合理的预测。

Comments 14 pages, 6 figures in main text, 2 figures in Supp materials

详情
AI中文摘要

精确建模量子多体系统通常需要计算昂贵的模拟,如密度矩阵重正化群(DMRG)或量子蒙特卡洛(QMC)计算。这些方法虽然精确,但会带来显著的时间和资源限制,限制了它们在详尽参数探索中的应用。此外,这些昂贵模拟在大的未知参数空间内可能包含可变误差,需要量化和传播。因此,需要预测建模来准确估计稀疏采样数据(具有异方差噪声)的函数空间,同时保持估计的物理相关性。为此,我们提出了物理约束集成高斯过程(pc-EGP)框架,旨在物理一致性约束下高效建模复杂且含噪声的量子系统。该方法首先将物理约束作为用户控制的加权惩罚项,施加到高斯过程(GP)代理的数据驱动损失函数中。然后,通过数值求积方法训练一组这样的GP模型,其中多个不同节点上的GP通过求积加权平均进行集成。我们首先在合成生成数据上演示该框架,然后应用于量子系统。在第一个案例研究中,我们利用Bose-Hubbard模型的DMRG模拟来预测控制超流-莫特绝缘体转变的临界相互作用参数Uc。在第二个案例研究中,我们展示了该方法在QMC模拟上的应用,模拟限制在纳米孔硅酸盐内的量子液体,目标是优化化学环境以实现一维超流。与传统GP相比,pc-EGP在准确性和物理有意义的预测之间实现了更好的平衡。

英文摘要

Accurate modeling of quantum many-body systems often requires computationally expensive simulations such as Density Matrix Renormalization Group (DMRG) or Quantum Monte Carlo (QMC) calculations. These methods, while precise, impose significant time and resource constraints, limiting their use in exhaustive parameter exploration. Moreover, these expensive simulations can contain variable errors over the large unknown parameter space, which needs to be quantified and propagated. Thus, predictive modelling is required to estimate the functional space accurately over scarcely sampled data with heteroskedastic noise, while preserving the physical relevance of the estimation. Therefore, we present a Physically Constrained Ensemble Gaussian Process (pc-EGP) framework designed to efficiently model complex and noisy quantum systems under physical consistency constraints. The proposed method first enforces physical constraints as a user controlled weighted penalty to the data-driven loss function of the Gaussian Process (GP) surrogates. Then an ensemble of such GP models is trained with variable noisy simulations via numerical quadrature method where these multiple GP(s) at different nodes is integrated as a quadrature weighted average. We first demonstrate the framework on synthetically generated data before applying to quantum systems. In the first case study, we leverage DMRG simulations of the Bose-Hubbard Model to predict the critical interaction parameter Uc governing the superfluid-to-Mott-insulator transition. In the second case study, we demonstrate our method on QMC simulations, of a quantum liquid confined inside a nanoporous silicate with the goal of optimizing a chemical environment to realize a one-dimensional superfluid. Compared to conventional GP, pc-EGP achieves a better balance of accuracy and physically meaningful predictions.

13. 其他/综合机器学习 24 篇

2606.12610 2026-06-12 cs.LG 新提交

The Mathematics of AI Winters: The mathematical Taxonomy of Paradigm Fragility in AI Winter

AI寒冬的数学:AI中范式脆弱性的数学分类

Miquel Noguer i Alonso, David Pacheco Aznar

发表机构 * AIFI Staq.io

AI总结 本文提出AI寒冬的数学解释,通过感知机不可能性、神经网络训练复杂度、高维非参数估计率、梯度消失和统计学习理论等数学瓶颈,分析早期AI范式失败的原因,并关联后续突破。

Comments 33 pages, 1 figure

详情
AI中文摘要

人工智能研究中两个主要的资金减少和信心下降时期,通常被称为第一次和第二次AI寒冬,通常被解释为工程失败、商业失望和预期膨胀。本文提出一个补充论点:这些时期的主导范式也遇到了真正的形式障碍,包括表示、优化、计算复杂性、统计可学习性和高维近似的限制。贡献是综合性的而非档案性的。我们并不声称特定定理机械地导致了寒冬;相反,我们表明早期AI的几个核心失望与数学上精确的瓶颈相一致。我们通过Minsky和Papert的感知机不可能结果、Blum和Rivest建立的精确神经网络训练的计算复杂性困难、Stone的高维非参数估计的极小化极大率、Hochreiter以及Bengio及其合作者的梯度消失分析,以及Vapnik和Chervonenkis、Valiant、Blumer及其合作者传统的经典统计学习理论来分析这些瓶颈。然后我们将这些障碍与后来缓解(而非消除)它们的突破联系起来。

英文摘要

Two major periods of reduced funding and confidence in artificial intelligence research, commonly called the first and second AI winters, are usually explained through engineering failure, commercial disappointment, and inflated expectations. This article develops a complementary thesis: that the dominant paradigms of those periods also met genuine formal barriers, including limitations of representation, optimisation, computational complexity, statistical learnability, and high-dimensional approximation. The contribution is synthetic rather than archival. We do not claim that particular theorems mechanically caused the winters; rather, we show that several central disappointments of early AI were aligned with mathematically precise bottlenecks. We analyse these bottlenecks through the perceptron impossibility results of Minsky and Papert, the complexity-theoretic hardness of exact neural-network training established by Blum and Rivest, minimax rates for nonparametric estimation in high dimension due to Stone, vanishing-gradient analyses by Hochreiter and by Bengio and collaborators, and classical statistical learning theory in the tradition of Vapnik and Chervonenkis, Valiant, and Blumer and collaborators. We then relate these barriers to the later breakthroughs that mitigated, rather than eliminated, them.

2606.12683 2026-06-12 cs.AI cs.CY cs.LG 交叉投稿

From AGI to ASI

从AGI到ASI

Tim Genewein, Matija Franklin, Alexander Lerchner, Laurent Orseau, Samuel Albanie, Adam Bales, Cole Wyeth, Stephanie Chan, Iason Gabriel, Joel Z. Leibo, Allan Dafoe, Marcus Hutter, Thore Graepel, Shane Legg

发表机构 * Google DeepMind(谷歌深度思维) University of Waterloo(滑铁卢大学) Australian National University(澳大利亚国立大学) University College London(伦敦大学学院)

AI总结 探讨从人类级通用人工智能到超级智能的转变路径,包括扩展、范式转变、递归改进和多智能体涌现,并分析摩擦与瓶颈。

详情
AI中文摘要

在过去十年中,构建人类级通用人工智能已从遥不可及的猜测转变为许多大型AI组织未来十年的具体目标。实现这一目标将对人类社会产生深远影响,并引发未来十年的诸多复杂问题。本报告研究在机器智能连续体中,AI如何在后AGI世界中继续发展。该连续体的终点——通用AI——在理论上已被充分理解,这为本报告的主要焦点提供了形式基础:从人类级AGI向人工通用超级智能的转变,直观上可理解为比大型人类组织更智能、认知能力更强的系统。在描述ASI后,报告讨论了从AGI到ASI的四条潜在路径:扩展AGI、AI范式转变、递归改进以及从大规模多智能体集体中涌现ASI。随后,报告讨论了这些路径上可能的摩擦和瓶颈。确定这些摩擦的影响是微不足道还是重大,提出了若干具体的开放研究问题。由于预测ASI进展存在巨大不确定性,不能排除AI进展在未来几年继续加速的可能性。这可能意味着由人类级AGI引入社会所导致的单一变革性步骤的形象可能不准确。更恰当的前景可能是由AI在科学和技术的多个领域引发的进步和突破所导致的一系列变革性社会变化。为这一前景做准备需要全球范围内的大规模跨学科努力。

英文摘要

Over the last decade, building human-level artificial general intelligence has moved from far-fetched speculation to being a concrete next-decade target for many of the largest AI organisations. Achieving this goal would have profound and far-reaching impacts on human society, which raises many complex questions for the decade ahead. This report investigates how AI itself might continue to develop in a post-AGI world along the continuum of machine intelligence. The endpoint of this continuum, Universal AI, is theoretically well understood, which provides some formal grounding for the main focus of this report: the transition from human-level AGI to artificial general superintelligence, which, intuitively, can be understood as a system that is more intelligent and cognitively capable than large organisations of humans. After characterizing ASI, the report discusses four potential pathways from AGI to ASI: scaling AGI, AI paradigm shifts, recursive improvement, and ASI emerging from large-scale multi-agent collectives. The report then discusses possible frictions and bottlenecks along these pathways. Determining whether the impact of these frictions will be negligible or substantial raises a number of concrete open research questions. Due to large uncertainties for predicting ASI progress, it cannot be ruled out that AI progress might continue to accelerate over the next years. This could imply that the image of a single transformative step change, caused by the introduction of human-level AGI into our society, could be inaccurate. More apt might be the prospect of a series of transformative societal changes caused by AI-enabled progress and breakthroughs across many areas of science and technology. Preparing for this prospect requires a massively interdisciplinary endeavour of global scope and interest.

2606.12709 2026-06-12 cs.MA cs.CR cs.LG 交叉投稿

Smarter Saboteurs, Better Fixers: Scaling & Security in Linear Multi-Agent Workflows

更聪明的破坏者,更好的修复者:线性多智能体工作流中的规模与安全性

Timothy McAllister, Sina Abdidizaji, Ivan Garibay, Ozlem Ozmen Garibay

AI总结 研究模型规模对线性多智能体工作流安全性的影响,发现大模型更易执行恶意指令,但轻量级修复阶段可恢复性能,表明线性结构在适当校正下具有鲁棒性。

Comments 16 pages (4 are main text), 2 figures, 6 tables. Accepted to the AIWILD Workshop at ICML 2026

详情
AI中文摘要

随着基于LLM的多智能体系统(MAS)在现实环境中部署,其协作结构对抗对抗性攻击的韧性成为一个关键的安全问题。攻击者可能利用提示注入或越狱来破坏MAS工作流中的单个智能体,但模型缩放与系统级韧性之间的相互作用仍知之甚少。本文研究了模型规模如何影响线性多智能体工作流的安全性。我们在HumanEval基准上对两个开放权重模型系列在不同规模下的实验揭示了一种合规-校正对称性:较大的模型更可能忠实地执行恶意指令,在未校正的流水线中,27B参数模型的控制到恶意性能下降达到53.7个百分点。然而,附加一个轻量级的终端修复阶段可将此下降缩小到0.6个百分点,并恢复与控制级性能的统计对等性,表明严格线性协作结构在此规模下是可行且对抗性鲁棒的,并暗示先前归因于线性拓扑的脆弱性可能源于缺乏校正。

英文摘要

As LLM-based multi-agent systems (MAS) are deployed in the wild, the resilience of their collaboration structures against adversarial compromise becomes a critical safety concern. Attackers may leverage prompt-injection or jailbreaking to sabotage individual agents within MAS workflows, but the interaction between model scaling and system-level resilience remains poorly understood. This paper investigates how model scale affects the security of linear multi-agent workflows. Our experiments across scales of two open-weight model families on the HumanEval benchmark reveal a compliance-correction symmetry: larger models are far more likely to faithfully execute malicious instructions, with the control-to-malicious performance drop reaching 53.7pp at 27B in uncorrected pipelines. However, appending a lightweight terminal Fixer stage collapses this to 0.6pp and restores statistical parity with control-level performance, demonstrating that strictly linear collaboration structures can be viable and resilient to adversaries at this scale, and suggesting that the brittleness previously attributed to linear topology may stem from a lack of correction.

2606.13422 2026-06-12 quant-ph cs.LG physics.flu-dyn 交叉投稿

Foundations of Practical Quantum Advantage in Quantum-Informed Machine Learning for Predicting Chaos

量子信息机器学习预测混沌的实用量子优势基础

Maida Wang, Xiao Xue, Minh Chung, Peter V. Coveney

发表机构 * Centre for Computational Science, University College London(大学学院伦敦计算科学中心) Leibniz Supercomputing Centre of the Bavarian Academy of Sciences and Humanities(巴伐利亚科学院和人文科学莱比锡超算中心) Centre for Advanced Research Computing, University College London(大学学院伦敦先进研究计算中心)

AI总结 提出基于高阶量子统计先验的量子优势机制,通过两阶段优势(表示与提取)证明量子-经典复制测量复杂度分离,并在湍流和天气预报中验证。

详情
AI中文摘要

我们为混沌动力系统的量子信息机器学习中的实用量子优势机制建立了理论基础。一族由k索引的高阶量子统计先验(Q-Priors)在n_q = kq个量子比特上承载不变测度的k点边际,扩展了先前工作的单站点构造。我们证明了一个两阶段优势。在表示阶段,叠加和纠缠紧凑地存储了n_q个量子比特上不变测度的不可分解空间相关性。在提取阶段,对两个副本进行联合贝尔测量,以独立于n_q的副本对数量估计任何事后泡利泛函,而相应的全泡利读出的任何自适应单副本协议需要Ω(2^(n_q))个副本;这是复制测量复杂度中可证明的量子-经典分离。双副本读出在模拟和IQM超导处理器上实现。两个案例研究将这一机制实例化到具有独立科学价值的工作流程中:一个湍流通道流研究,其中双副本读出产生了不变测度的一个命名的非对角关联子(速度方向相干性),以及一个基于欧洲中期天气预报中心ERA5再分析的中期天气预报工作流程,其中对角k ≤ 2 Q-Prior引导Koopman展开,在48-240小时预报时效内将异常相关系数技能提高10-39%,并减少了滚动预报到静态平均场的长期崩溃。我们的实用优势定义的两个条件在互补层面上得到满足,为在容错硬件之前实现实用量子优势确定了一条候选路径。

英文摘要

We develop theoretical foundations for a practical quantum-advantage mechanism in quantum-informed machine learning for chaotic dynamical systems. A family of k-indexed higher-order quantum statistical priors (Q-Priors) hosts the k-point marginal of the invariant measure on n_q = kq qubits, extending the single-site construction of prior work. We prove a two-stage advantage. In the representation stage, superposition and entanglement compactly store non-factorisable spatial correlations of the invariant measure on n_q qubits. In the extraction stage, joint Bell measurements on two copies estimate any post hoc Pauli functional with a copy-pair count independent of n_q, whereas any adaptive single-copy protocol for the corresponding full-Pauli read-out requires Omega(2^(n_q)) copies; this is a provable quantum-classical separation in copy-measurement complexity. The two-copy read-out is realised in simulation and on IQM superconducting processors. Two case studies instantiate the mechanism in workflows of independent scientific value: a turbulent channel-flow study in which the two-copy read-out yields a named non-diagonal correlator of the invariant measure (the velocity-direction coherence), and a medium-range weather forecasting workflow on the European Centre for Medium-Range Weather Forecasts ERA5 reanalysis in which the diagonal k <= 2 Q-Prior steers a Koopman rollout, improves anomaly-correlation skill by 10-39% across 48-240 h lead times, and reduces the long-horizon collapse of rollouts onto a static mean field. The two conditions of our practical-advantage definition are met at complementary levels, identifying a candidate route to practical quantum advantage before fault-tolerant hardware.

2606.13454 2026-06-12 physics.optics cond-mat.dis-nn cs.ET cs.LG 交叉投稿

Optical Implementation of Equilibrium Propagation Using Spatial Photonic Ising Machines

利用空间光子伊辛机实现平衡传播的光学实现

Dimitri Vanden Abeele, Daniele Veraldi, Davide Pierangeli, Claudio Conti, Serge Massar

发表机构 * Laboratoire d’Information Quantique, Université Libre de Bruxelles (ULB)(量子信息实验室,布鲁塞尔自由大学) Dipartimento di Fisica, Sapienza Università di Roma(物理学系,萨皮恩扎罗马大学)

AI总结 提出利用空间光子伊辛机光学实现平衡传播,通过规范变换方法编码神经元状态和可训练模式,在Wine和MNIST数据集上验证了能效物理实现的可行性。

详情
AI中文摘要

平衡传播为训练基于能量的网络提供了一种传统机器学习的引人注目的替代方案。在这里,我们展示了使用空间光子伊辛机(SPIM)的平衡传播(EP)的混合光学-数字实现。SPIM利用规范变换方法,通过空间光调制器将连续神经元状态和秩1二进制可训练模式光学编码为相位调制,并使用有限差分方案实现推理。实验系统在Wine分类数据集上进行了评估。该方法的潜力,包括使用连续耦合和结构化耦合矩阵,在更复杂的MNIST数据集上通过数值评估。我们的工作为平衡传播的节能物理实现提供了一条具体路径。

英文摘要

Equilibrium Propagation offers a compelling alternative to traditional machine learning for training energy-based networks. Here we demonstrate a hybrid optical-digital implementation of EP using a Spatial Photonic Ising Machine (SPIM). The SPIM exploits the gauge transformation method to optically encode both continuous neuron states and rank-1 binary trainable patterns as phase modulations via a spatial light modulator, with inference realized using a finite difference scheme. The experimental system is evaluated on the Wine classification dataset. The potential of this approach, including the use of continuous couplings and structured coupling matrices, is evaluated numerically on the more complex MNIST dataset. Our work provides a concrete pathway toward energy-efficient physical implementations of Equilibrium Propagation.

2505.20076 2026-06-12 cs.LG 版本更新

ExPLAIND: Unifying Model, Data, and Training Attribution to Study Model Behavior

ExPLAIND:统一模型、数据和训练归因以研究模型行为

Florian Eichin, Yupei Du, Philipp Mondorf, Maria Matveev, Barbara Plank, Michael A. Hedderich

发表机构 * University of Michigan(密歇根大学)

AI总结 提出ExPLAIND框架,统一归因于模型组件、数据和训练轨迹,支持跨粒度解释,通过梯度路径核和AdamW核机器推导参数级和步骤级影响分数,验证了Transformer的Grokking和EuroLLM预训练中的两阶段动态。

Comments published at ICML 2026, code at https://github.com/mainlp/explaind

详情
AI中文摘要

事后可解释性方法通常将模型行为归因于其组件、数据或训练轨迹中的某一个,并且往往局限于局部到全局谱中的特定粒度。这导致解释缺乏统一视角,可能遗漏关键交互。我们提出了ExPLAIND,一个理论扎实的统一框架,它整合了模型组件、数据和训练轨迹,同时支持跨粒度的解释。我们推广了最近关于梯度路径核的工作,将AdamW训练的模型重新表述为核机器。从得到的核特征图中,我们推导出新的参数级和步骤级影响分数。我们在多种设置下实证验证了模型行为的分解结果,并将ExPLAIND应用于两个案例研究。我们对一个表现出Grokking现象的Transformer的发现支持了先前提出的学习阶段,同时将最后阶段细化为外层在记忆后围绕一个表示管道对齐的阶段。对于EuroLLM预训练,ExPLAIND揭示了一个两阶段动态:第一阶段以外部MLP学习为特征,第二阶段以中间注意力层的相对影响增加为特征。这些结果确立了ExPLAIND作为解释模型行为和训练动态的统一框架。

英文摘要

Post-hoc interpretability methods typically attribute a model's behavior to its components, data, or training trajectory in isolation, and are often tied to a particular level of granularity along the local-to-global spectrum. This leads to explanations that lack a unified view and may miss key interactions. We present ExPLAIND, a theoretically grounded, unified framework that integrates model components, data, and training trajectory while supporting explanations across granularities. We generalize recent work on gradient path kernels, reformulating models trained by AdamW as kernel machines. From the resulting kernel feature maps, we derive novel parameter-wise and step-wise influence scores. We empirically validate the resulting decomposition of model behavior in several settings and apply ExPLAIND to two case studies. Our findings on a Transformer exhibiting Grokking support previously proposed learning phases, while refining the final phase as one in which outer layers align around a representation pipeline learned after memorization. For EuroLLM pretraining, ExPLAIND reveals a two-phase dynamic, with the first characterized by outer-layer MLP learning and the second by increased relative influence of intermediate attention layers. These results establish ExPLAIND as a unified framework for interpreting model behavior and training dynamics.

2508.04427 2026-06-12 cs.LG cs.AI 版本更新

Decoding the Multimodal Maze: A Systematic Review on the Adoption of Explainability in Multimodal Attention-based Models

解码多模态迷宫:多模态注意力模型中可解释性采纳的系统综述

Md Raisul Kibria, Sébastien Lafond, Janan Arslan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文系统综述了2020年至2024年初多模态模型可解释性研究,发现多数工作集中于视觉-语言和纯语言模型,注意力机制是主要解释方法,但评估缺乏系统性和鲁棒性,并提出了改进建议。

详情
AI中文摘要

近年来,多模态学习取得了显著进展,特别是随着注意力模型的整合,在各种任务中带来了显著的性能提升。与此同时,对可解释人工智能(XAI)的需求推动了越来越多的研究,旨在解释这些模型的复杂决策过程。本系统文献综述分析了2020年1月至2024年初期间发表的、关注多模态模型可解释性的研究。在XAI更广泛目标的框架内,我们从多个维度审视文献,包括模型架构、涉及模态、解释算法和评估方法。我们的分析显示,大多数研究集中在视觉-语言和纯语言模型上,注意力机制是最常用的解释方法。然而,这些方法往往无法捕捉模态间交互的全谱系,这一问题因领域间的架构异质性而进一步加剧。重要的是,我们发现多模态环境中XAI的评估方法大多是非系统性的,缺乏一致性、鲁棒性,并且未考虑模态特定的认知和上下文因素。为解决这些不足,我们不仅综合了所调查研究的发现,还纳入了补充分析,整合了推动多模态可解释性的近期和新兴进展。基于这些见解,我们提出了一套全面的建议,旨在促进多模态XAI研究中严谨、透明和标准化的评估与报告实践。我们的目标是支持未来构建更可解释、可问责和负责任的多模态AI系统,并以可解释性为核心。

英文摘要

Multimodal learning has witnessed remarkable advancements in recent years, particularly with the integration of attention-based models, leading to significant performance gains across a variety of tasks. Parallel to this progress, the demand for explainable artificial intelligence (XAI) has spurred a growing body of research aimed at interpreting the complex decision-making processes of these models. This systematic literature review analyzes research published between January 2020 and early 2024 that focuses on the explainability of multimodal models. Framed within the broader goals of XAI, we examine the literature across multiple dimensions, including model architecture, modalities involved, explanation algorithms and evaluation methodologies. Our analysis reveals that most studies are concentrated on vision-language and language-only models, with attention-based techniques being the most commonly employed for explanation. However, these methods often fall short in capturing the full spectrum of interactions between modalities, a challenge further compounded by the architectural heterogeneity across domains. Importantly, we find that evaluation methods for XAI in multimodal settings are largely non-systematic, lacking consistency, robustness, and consideration for modality-specific cognitive and contextual factors. To address these gaps, we not only synthesize findings from the surveyed works but also incorporate a complementary analysis that integrates recent and emerging advances driving multimodal explainability. Based on these insights, we provide a comprehensive set of recommendations aimed at promoting rigorous, transparent, and standardized evaluation and reporting practices in multimodal XAI research. Our goal is to support future research in more interpretable, accountable, and responsible multimodal AI systems, with explainability at their core.

2508.14143 2026-06-12 cs.LG q-bio.NC 版本更新

The Urysohn Machine: A Metric-Topological Model of Computation

Urysohn机器:一种度量-拓扑计算模型

Xin Li

发表机构 * University at Albany, State University of New York(纽约州立大学阿尔巴尼分校)

AI总结 提出Urysohn机器,一种基于度量分离、前沿结构和收缩的分类计算模型,通过Urysohn三元组和分层构造实现分类复杂度度量与可重用推理。

详情
AI中文摘要

我们引入Urysohn机器,一种面向分类计算的有效模型,其中度量分离、前沿结构和收缩是计算状态的显式部分。其基本对象是Urysohn三元组:一个支撑区域、一个目标划分以及一个存储在可重用度量库中的分离分类器。拓扑基础是有限单纯形设置下的构造性Urysohn实现定理。它通过嵌套多面体区域的二进阶梯构建分离器,并为其前沿配备链级微积分:前沿是循环,层级之间的壳层边界由前沿之差给出。该构造产生两种相关的复杂度度量:决策边界宽度(单个分类器边界的几何度量)和Urysohn宽度(库或实现所表示的总前沿质量)。我们证明了摊销分离定理,该定理表明在显式边界足迹假设下,逼近宽度为的边界达到精度所需的简单基三元组数量与边界宽度成正比,与分辨率成反比。我们还引入了一种对比分离算子,其图割泛函能从采样度量数据中一致地估计决策边界宽度,而其拉普拉斯谱则能证明类组件结构和电导率。最后,我们分析了动态Urysohn阶梯,并证明了四个保证:商塌缩下的可分离性、已提交前沿的稳定性、收缩下的有界容量以及商距离下的可扩展性。这些结果共同给出了分类复杂度、摊销推理和组合重用的度量-拓扑解释,在保留经典可计算性的同时,揭示了纯符号描述所隐藏的几何结构。

英文摘要

We introduce the Urysohn Machine, an effective model of classification-oriented computation in which metric separation, frontier structure, and contraction are explicit parts of the computational state. Its basic object is a \emph{Urysohn Triple}: a support region, a target partition, and a separating classifier stored in a reusable Metric Library. The topological foundation is a constructive Urysohn Realization theorem for finite simplicial settings. It builds separators from dyadic ladders of nested polyhedral regions and equips their frontiers with a chain-level calculus: frontiers are cycles, and shells between levels have boundaries given by differences of frontiers. This construction yields two related complexity measures: decision-boundary width, the geometric measure of a single classifier's boundary, and Urysohn width, the total frontier mass represented by a library or realization. We prove an Amortized Separation Theorem showing that approximating a boundary of width to accuracy requires a number of simple basis triples proportional to boundary width and inversely proportional to resolution, under explicit boundary-footprint assumptions. We also introduce a contrastive separation operator whose graph-cut functional consistently estimates decision-boundary width from sampled metric data, while its Laplacian spectrum certifies class-component structure and conductance. Finally, we analyze the dynamic Urysohn ladder and prove four guarantees: separability under quotient collapse, stability of committed frontiers, bounded capacity under contraction, and scalability with quotient distance. Together, these results give a metric-topological account of classification complexity, amortized inference, and compositional reuse that preserves classical computability while exposing geometric structure hidden by purely symbolic descriptions.

2512.15134 2026-06-12 cs.LG cs.AI cs.CL 版本更新

From Isolation to Entanglement: When Do Interpretability Methods Identify and Disentangle Known Concepts?

从孤立到纠缠:可解释性方法何时识别和解缠已知概念?

Aaron Mueller, Andrew Lee, Shruti Joshi, Ekdeep Singh Lubana, Dhanya Sridhar, Patrik Reizinger

发表机构 * Boston University(波士顿大学) Harvard University(哈佛大学) Mila – Quebec AI Institute(魁北克AI研究所) Goodfire(Goodfire公司)

AI总结 本文提出多概念评估框架,研究稀疏自编码器和探针等方法是否真正解缠概念,发现特征通常只对单一概念敏感,但概念分布在多个特征上,且干预特征常影响多个概念,表明相关性指标不足以证明干预选择性。

Comments ACL 2026

详情
AI中文摘要

可解释性的一个目标是从神经网络的激活中恢复潜在概念(特征)的解缠表示。特征的质量通常孤立地评估,并在可能不成立的隐式独立性假设下进行。因此,尚不清楚常见的特征化方法(如稀疏自编码器(SAE)和探针)在多大程度上将一个概念与另一个概念解缠。我们提出了一个多概念评估设置,使用包括情感、领域、语态和时态在内的概念。我们评估特征化器产生每个概念的解缠表示的效果,观察到特征通常只对单一概念敏感,但概念分布在许多特征上。然后,我们干预这些特征,测量每个概念是否可独立操控,以及特征是否相互作用。即使在理想化设置中,干预一个特征通常会影响多个概念,尽管几乎没有交互效应。这些结果表明,相关性指标不足以建立干预选择性,并且证明两个特征在分离空间中运行不足以声称它们将对一个概念具有选择性。这些结果强调了可解释性研究中多概念评估的重要性。

英文摘要

A goal of interpretability is to recover disentangled representations of latent concepts (features) from the activations of neural networks. The quality of features is typically evaluated in isolation, and under implicit independence assumptions that may not hold in practice. Thus, it is unclear to what extent common featurization methods such as sparse autoencoders (SAEs) and probes disentangle one concept from another. We propose a multi-concept evaluation setting using concepts including sentiment, domain, voice, and tense. We evaluate how well featurizers produce disentangled representations of each concept, observing that features are typically sensitive to only one concept, but also that concepts are distributed across many features. Then, we steer these features, measuring whether each concept is independently manipulable, and whether features interact. Even in idealized settings, steering a feature often affects many concepts, despite a near absence of interaction effects. These results suggest that correlational metrics are insufficient to establish steering selectivity, and that demonstrating that two features operate in separate spaces is insufficient to claim that they will be selective for one concept. These results underscore the importance of multi-concept evaluations in interpretability research.

2605.16430 2026-06-12 cs.LG cs.AI 版本更新

A Theory of Training Profit-Optimal LLMs

训练利润最优大语言模型的理论

Sophie Hao, William Merrill

发表机构 * Boston University(波士顿大学) Allen Institute for AI(人工智能研究院)

AI总结 本文提出一个经济模型,结合扩展定律与微观经济学理论,分析大语言模型训练的利润最大化问题,探讨模型规模与训练成本的关系及对利润的影响。

Comments Minor edits for preprint

详情
AI中文摘要

扩展大语言模型(LLM)需要巨大的计算资源,近年来人工智能的进步与大量资本支出相伴而生。尽管扩大LLM规模确实能提高模型质量(以损失或下游评估量化),但其质量提升如何转化为潜在收入,以及收入是否能抵消更大规模训练和推理的成本仍不清楚。本文发展了一个经济模型,结合扩展定律与微观经济学理论,以描述LLM训练公司的理性行为。在我们的模型中,增加参数和训练令牌可提高LLM质量,从而吸引更多消费者,每个消费者都有一个质量阈值。另一方面,额外的参数和训练令牌都会带来额外成本。我们分析了该模型在计算受限和数据受限环境下的利润最大化问题。在计算受限环境下,最优模型规模和令牌预算与硬件效率$E$(FLOPs/$)近似线性增长;总训练成本则以$E$的亚四次方程增长。数据效率的提升激励更大规模的模型和训练支出。当数据受限于$D$时,利润最优的训练支出为$D^2/E$,即随数据增加而增加,随硬件效率(以及数据效率)降低而减少。最后,我们分析了训练支出的实际趋势:当前趋势与计算受限环境下的最宽松模型变体一致,但在数据受限环境或假设硬件进步停滞时并非利润最优。总体而言,我们的结果提供了利润最优LLM训练的理论,为批判性地看待行业声明和支持长期经济决策提供了基础。

英文摘要

Scaling LLMs requires tremendous computational resources, and recent advances in AI have gone hand in hand with massive amounts of capital expenditure. While it is established that scaling up LLMs reliably increases model quality (quantified in terms of loss or downstream evaluations), it is unclear how these quality improvements translate to potential revenue, and whether revenue increases would offset costs of larger-scale training and inference. In this work, we develop an economic model for characterizing the rational behavior of an LLM training firm by combining scaling laws with microeconomic theory. Under our model of firm behavior, LLM quality can be increased with more parameters and training tokens, leading to more potential adoption by consumers, who each have a quality threshold for using the LLM. On the other hand, additional parameters and training tokens both incur additional costs. We analyze the profit maximization problem for this model under compute-bound and data-bound regimes. In the compute-bound regime, optimal model size and token budget track hardware efficiency $E$ (FLOPs/\$) at a near-linear rate; total training cost then scales sub-quadratically in $E$. Data efficiency improvements incentivize larger models and training expenditure. When we are limited to $D$ data, profit-optimal training expenditure scales as $D^2/E$, i.e, increase with data and decreases with hardware efficiency (as well as data efficiency). Finally, we analyze practical trends in training expenditure: current trends are consistent with our most permissive model variants in the compute-bound regime, but are not profit-optimal in the data-bound regime or assuming hardware advances will stall. Overall, our results provide a theory of profit-optimal LLM training, providing a foundation for engaging critically with industry statements and supporting long-term economic decision making.

2507.10599 2026-06-12 cs.CL cs.AI cs.LG 版本更新

Emergence of Hierarchical Emotion Organization in Large Language Models

大型语言模型中层级情感组织的涌现

Maya Okawa, Bo Zhao, Eric J. Bigelow, Rose Yu, Tomer Ullman, Ekdeep Singh Lubana, Hidenori Tanaka

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Washington(华盛顿大学) University of Tokyo(东京大学)

AI总结 受情感轮理论启发,分析大型语言模型输出中情感状态间的概率依赖关系,发现模型自然形成与人类心理模型一致的层级情感树,且更大模型发展出更复杂的层级结构,同时揭示社会经济角色在情感识别中的系统性偏差。

Comments ICML 2026

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地驱动对话代理,理解它们如何建模用户的情绪状态对于伦理部署至关重要。受情感轮(即一种认为情感层级组织的心理学框架)的启发,我们分析了模型输出中情感状态之间的概率依赖关系。我们发现LLMs自然形成与人类心理模型一致的层级情感树,且更大的模型发展出更复杂的层级结构。我们还揭示了跨社会经济角色的情感识别中存在系统性偏差,对于交叉、代表性不足的群体,错误分类会叠加。人类研究显示出惊人的相似性,表明LLMs内化了社会感知的某些方面。除了突出LLMs中的涌现情感推理能力,我们的结果还暗示了利用认知基础理论开发更好模型评估的潜力。

英文摘要

As large language models (LLMs) increasingly power conversational agents, understanding how they model users' emotional states is critical for ethical deployment. Inspired by emotion wheels, i.e., a psychological framework that argues emotions organize hierarchically, we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded theories for developing better model evaluations.

2510.02524 2026-06-12 cs.CL cs.FL cs.LG 版本更新

Unraveling Syntax: Language Modeling and the Substructure of Grammars

解析句法:语言建模与语法的子结构

Laura Ying Schulz, Daniel Mitropolsky, Tomaso Poggio

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究语言模型在上下文无关语法子结构上的学习行为,证明损失函数在顶层子语法上线性递归,并发现参数化模型并行学习子语法,子语法预训练能提升小模型性能并改善内部表征。

Comments Equal contribution by LYS and DM. Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

尽管语言模型取得了令人印象深刻的结果,但其学习动态远未被理解。许多感兴趣的领域——如自然语言句法、编程语言、算术——都由上下文无关语法(CFG)捕获。在这项工作中,我们将先前关于CFG神经语言建模的工作扩展到一个新的方向:语言建模如何相对于CFG子结构(即子语法)表现。我们定义了子语法,并证明了一组连接语言建模和子语法的基本定理。我们表明,语言建模损失在其顶层子语法上线性递归;递归应用,损失分解为“不可约”子语法的损失。在额外假设下,并且经验上,参数化模型并行学习子语法,不同于首先掌握简单子结构的儿童。我们发现,子语法预训练可以提高最终性能,但仅对于相对于语法而言微小的模型,而对齐分析表明,预训练一致地导致内部表征更好地反映语法的子结构。

英文摘要

While language models achieve impressive results, their learning dynamics are far from understood. Many domains of interest -- such as natural language syntax, coding languages, arithmetic -- are captured by context-free grammars (CFGs). In this work, we extend prior work on neural language modeling of CFGs in a novel direction: how language modeling behaves with respect to CFG substructure, namely subgrammars. We define subgrammars, and prove a set of fundamental theorems connecting language modeling and subgrammars. We show that language modeling loss recurses linearly over its top-level subgrammars; applied recursively, the loss decomposes into losses for "irreducible" subgrammars. Under additional assumptions, and empirically, parametrized models learn subgrammars in parallel, unlike children who first master simple substructures. We find that subgrammar pretraining can improve final performance, but only for tiny models relative to the grammar, while alignment analyses show that pretraining consistently leads to internal representations that better reflect the grammar's substructure.

2604.24449 2026-06-12 cs.RO cs.AI cs.LG 版本更新

SPLIT: Separating Physical-Contact via Latent Arithmetic in Image-Based Tactile Sensors

SPLIT:通过潜在算术分离物理接触以实现基于图像的触觉传感器

Wadhah Zai El Amri, Nicolás Navarro-Guerrero

发表机构 * Leibniz Universität Hannover, L3S Research Center(莱布尼茨汉诺威大学,L3S研究所)

AI总结 本文提出SPLIT方法,通过潜在空间算术分离接触几何与传感器光学特性,实现触觉传感器的高效模拟,支持多传感器迁移和双向模拟,提升机器人触觉感知研究效率。

Comments Accepted to Elsevier Robotics and Autonomous Systems Journal

详情
AI中文摘要

训练机器人触觉感知的机器学习模型需要大量数据,但获取真实交互数据因物理复杂性和变异性而具有挑战性。模拟触觉传感器是加速进展的关键步骤。本文提出了SPLIT,一种新的基于图像的触觉传感器模拟方法,重点在于DIGIT传感器。我们的方法核心是一种潜在空间算术策略,明确分离接触几何与传感器特定的光学属性。与需要重新校准的现有方法不同,这种分离使SPLIT能够适应多样化的DIGIT背景,甚至在不完全重训练的情况下将数据转移到不同的传感器如GelSight R1.5。此外,我们的方法在推理速度上优于现有替代方案。我们还提供了一种校准的有限元方法(FEM)软体网格模拟,具有可变分辨率,提供速度与保真度之间的可调权衡。此外,我们的算法支持双向模拟,允许从变形网格生成逼真图像以及从触觉图像重建网格。这种多功能性使SPLIT成为加速机器人触觉感知研究进展的重要工具。

英文摘要

Training machine learning models for robotic tactile sensing requires vast amounts of data, yet obtaining realistic interaction data remains a challenge due to physical complexity and variability. Simulating tactile sensors is thus a crucial step in accelerating progress. This paper presents SPLIT, a novel method for simulating image-based tactile sensors, with a primary focus on the DIGIT sensor. Central to our approach is a latent space arithmetic strategy that explicitly disentangles contact geometry from sensor-specific optical properties. Unlike methods that require recalibration for every new unit, this disentanglement allows SPLIT to adapt to diverse DIGIT backgrounds and even transfer data to distinct sensors like the GelSight R1.5 without full model retraining. Beyond this adaptability, our approach achieves faster inference speeds than existing alternatives. Furthermore, we provide a calibrated finite element method (FEM) soft-body mesh simulation with variable resolution, offering a tunable trade-off between speed and fidelity. Additionally, our algorithm supports bidirectional simulation, allowing for both the generation of realistic images from deformation meshes and the reconstruction of meshes from tactile images. This versatility makes SPLIT a valuable tool for accelerating progress in robotic tactile sensing research.

2604.08581 2026-06-12 cs.LG 版本更新

Fully Autonomous Z-Score-Based TinyML Anomaly Detection on Resource-Constrained MCUs Using Power Side-Channel Data

基于功率侧信道数据的全自主Z分数TinyML异常检测

Abdulrahman Albaiz, Fathi Amsaad

AI总结 本文提出一种在低功耗微控制器上实现的全自主TinyML Z分数异常检测系统,利用功率侧信道数据实时监控设备行为,无需外部计算或连接,实现高效嵌入式部署。

Comments SaTC 2026 Conference

详情
Journal ref
Proc. IEEE 2nd International Conference on Secure IoT, Assured and Trusted Computing (SATC), Houston, TX, USA, 2026, pp. 1-6
AI中文摘要

本文提出了一种在低功耗微控制器上实现的全自主TinyML Z分数异常检测系统,用于通过功率侧信道数据实时监控设备行为。与现有物联网异常检测方法不同,该系统在资源受限的微控制器上直接进行模型训练和推断,无需外部计算或连接。系统持续采样电流消耗,在设备上计算均方根(RMS)值,并在初始训练阶段推导统计参数。利用轻量级Z分数阈值检测异常,实现可解释且计算高效的推断,适用于嵌入式部署。该架构在基于STM32的平台上实现,并使用从家庭小型冰箱在正常运行和受控异常条件下收集的14天数据集进行评估。结果表明,检测性能完美,精度和召回率均为1.00,推断延迟在十微秒量级,总内存占用约为3.3 KB SRAM和63 KB Flash。这些结果证实,可以在低成本微控制器上实现稳健且完全自主的TinyML异常检测。未来的工作包括扩展框架以纳入额外轻量级模型和多设备学习场景。

英文摘要

This paper presents a fully autonomous Tiny Machine Learning (TinyML) Z-Score-based anomaly detection system deployed on a low-power microcontroller for real-time monitoring of appliance behavior using power side-channel data. Unlike existing Internet of Things (IoT) anomaly detection approaches that rely on offline training or cloud-assisted analytics, the proposed system performs both model training and inference directly on a resource-constrained microcontroller without external computation or connectivity. The system continuously samples current consumption, computes Root Mean Square (RMS) values on-device, and derives statistical parameters during an initial training phase. Anomalies are detected using lightweight Z-Score thresholds, enabling interpretable and computationally efficient inference suitable for embedded deployment. The architecture was implemented on an STM32-based platform and evaluated using a 14-day dataset collected from a household mini-fridge under normal operation and controlled anomaly conditions. Results demonstrate perfect detection performance, with Precision and Recall of 1.00, inference latencies on the order of tens of microseconds, and a total memory footprint of approximately 3.3 KB SRAM and 63 KB Flash. These results confirm that robust and fully autonomous TinyML anomaly detection can be achieved on low-cost microcontrollers. Future work includes extending the framework to incorporate additional lightweight models and multi-device learning scenarios.

2603.27393 2026-06-12 cs.LG 版本更新

K-Means Based TinyML Anomaly Detection and Distributed Model Reuse via the Distributed Internet of Learning (DIoL)

基于K均值的TinyML异常检测与通过分布式物联网学习(DIoL)的分布式模型重用

Abdulrahman Albaiz, Fathi Amsaad

AI总结 本文提出了一种轻量级K均值异常检测模型和适用于资源受限微控制器的分布式模型共享流程。通过实际电源测量数据,在设备上进行特征提取、聚类和阈值估计以识别异常行为。DIoL框架允许在一台MCU上训练的模型导出为可移植的文本表示并在其他设备上直接重用,实验验证了该方法的可行性。

Comments SaTC 2026 Conference

详情
Journal ref
Proc. IEEE 2nd International Conference on Secure IoT, Assured and Trusted Computing (SATC), Houston, TX, USA, 2026, pp. 1-5
AI中文摘要

本文提出了一种轻量级K均值异常检测模型和适用于资源受限微控制器的分布式模型共享流程。通过实际电源测量数据,在设备上进行特征提取、聚类和阈值估计以识别异常行为。DIoL框架允许在一台MCU上训练的模型导出为可移植的文本表示并在其他设备上直接重用,实验验证了该方法的可行性。

英文摘要

This paper presents a lightweight K-Means anomaly detection model and a distributed model-sharing workflow designed for resource-constrained microcontrollers (MCUs). Using real power measurements from a mini-fridge appliance, the system performs on-device feature extraction, clustering, and threshold estimation to identify abnormal appliance behavior. To avoid retraining models on every device, we introduce the Distributed Internet of Learning (DIoL), which enables a model trained on one MCU to be exported as a portable, text-based representation and reused directly on other devices. A two-device prototype demonstrates the feasibility of the "Train Once, Share Everywhere" (TOSE) approach using a real-world appliance case study, where Device A trains the model and Device B performs inference without retraining. Experimental results show consistent anomaly detection behavior, negligible parsing overhead, and identical inference runtimes between standalone and DIoL-based operation. The proposed framework enables scalable, low-cost TinyML deployment across fleets of embedded devices.

2603.26705 2026-06-12 q-bio.BM cs.AI cs.LG 版本更新

PI-Mamba: Linear-Time Protein Backbone Generation via Spectrally Initialized Flow Matching

PI-Mamba:通过谱初始化流匹配实现线性时间的蛋白质主链生成

Tianyu Wu, Lin Zhu

发表机构 * Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign(生物物理与定量生物学中心,伊利诺伊大学厄巴纳-香槟分校) School of Information Science, University of Illinois Urbana-Champaign(信息科学学院,伊利诺伊大学厄巴纳-香槟分校)

AI总结 PI-Mamba通过谱初始化和流匹配框架,在保证局部共价几何精确性的同时实现线性时间推断,实现了主链生成的高效与高保真。

详情
Journal ref
Bioinformatics (2026)
AI中文摘要

动机:蛋白质主链设计的生成模型必须同时确保几何有效性、采样效率和长序列的可扩展性。然而,大多数现有方法依赖于迭代细化、二次注意力机制或事后几何修正,导致计算效率与结构保真度之间存在持续的权衡。结果:我们提出物理指导的Mamba(PI-Mamba),一种生成模型,通过构造确保精确的局部共价几何,同时实现线性时间推断。PI-Mamba将可微约束执行操作符整合到流匹配框架中,并与基于Mamba的状态空间架构耦合。为了提高优化稳定性和主链真实性,我们引入了源自Rouse聚合物模型的谱初始化和辅助的顺式脯氨酸意识头。在基准任务中,PI-Mamba实现了0.0%的局部几何违规率和高设计性(scTM = $0.91\pm 0.03$,n = 100),并且在单个A5000 GPU(24 GB)上可扩展到超过2,000个残基的蛋白质。

英文摘要

Motivation: Generative models for protein backbone design have to simultaneously ensure geometric validity, sampling efficiency, and scalability to long sequences. However, most existing approaches rely on iterative refinement, quadratic attention mechanisms, or post-hoc geometry correction, leading to a persistent trade-off between computational efficiency and structural fidelity. Results: We present Physics-Informed Mamba (PI-Mamba), a generative model that enforces exact local covalent geometry by construction while enabling linear-time inference. PI-Mamba integrates a differentiable constraint-enforcement operator into a flow-matching framework and couples it with a Mamba-based state-space architecture. To improve optimisation stability and backbone realism, we introduce a spectral initialization derived from the Rouse polymer model and an auxiliary cis-proline awareness head. Across benchmark tasks, PI-Mamba achieves 0.0\% local geometry violations and high designability (scTM = $0.91\pm 0.03$, n = 100), while scaling to proteins exceeding 2,000 residues on a single A5000 GPU (24 GB).

2411.02933 2026-06-12 cs.DB cs.LG cs.PF 版本更新

P-MOSS: Scheduling Main-Memory Indexes Over NUMA Servers Using Next Token Prediction

P-MOSS:利用下一个令牌预测在NUMA服务器上调度主内存索引

Yeasir Rayhan, Walid G. Aref

发表机构 * Purdue University West Lafayette, IN, USA(普渡大学西拉法叶分校)

AI总结 P-MOSS通过学习空间调度框架,在NUMA服务器上调度查询执行到特定逻辑核心并 colocate 数据,利用大语言模型原理提升性能,实验表明其查询吞吐量提升达6倍。

Comments Accepted to SIGMOD'26

详情
AI中文摘要

自从2000年代初Dennard缩放定律失效,CPU频率停滞,厂商开始在每个CPU芯片上增加核心数量,引入异构性,从而 ushered the era of NUMA和Chiplet处理器。此后,硬件设计空间的异构性不断增加,现代服务器中DBMS性能可能变化高达一个数量级。影响性能的重要因素包括DBMS查询执行的逻辑核心位置和数据存储的位置。本文介绍了P-MOSS,一种学习空间调度框架,将查询执行调度到特定逻辑核心,并在对应的NUMA节点上 colocate数据。为了实现跨硬件和工作负载的适应性,P-MOSS利用大语言模型的核心原理,如下一个令牌预测、生成式预训练和微调。在硬件-软件协同的精神下,P-MOSS仅基于硬件性能监控单元收集的低层硬件统计信息,通过决策变压器进行调度决策。在B$^+$-Tree索引的背景下进行了实验评估。性能结果表明,P-MOSS在查询吞吐量方面比传统调度提高了多达6倍。

英文摘要

Ever since the Dennard scaling broke down in the early 2000s and the frequency of the CPUs stalled, vendors have started to increase the core count in each CPU chip at the expense of introducing heterogeneity, thus ushering the era of NUMA and Chiplet processors. Since then, the heterogeneity in the design space of hardware has only increased to the point that DBMS performance may vary significantly up to an order of magnitude in modern servers. An important factor that affects performance includes the location of the logical cores where the DBMS queries execute, and the location where the data resides. This paper introduces P-MOSS, a learned spatial scheduling framework that schedules query execution to specific logical cores, and co-locates data on the corresponding NUMA node. For cross-hardware and workload adaptability, P-MOSS leverages core principles from Large Language Models, such as Next Token prediction, Generative Pre-training, and Fine-tuning. In the spirit of hardware-software synergy, P-MOSS guides its scheduling decision solely based on the low-level hardware statistics collected from the hardware Performance Monitoring Unit with the aid of a Decision Transformer. Experimental evaluation is performed in the context of the B$^+$-Tree index. Performance results demonstrate that P-MOSS offers an improvement of up to $6\times$ over traditional schedules in terms of query throughput.

2601.10885 2026-06-12 physics.plasm-ph cs.LG physics.comp-ph 版本更新

Learning collision operators from plasma phase space data using differentiable simulators

利用可微分模拟器从等离子体相空间数据学习碰撞算子

Diogo D. Carvalho, Pablo J. Bilbao, Warren B. Mori, Luis O. Silva, E. Paulo Alves

发表机构 * GoLP/Instituto de Plasmas e Fusão Nuclear, Instituto Superior Técnico, Universidade de Lisboa(GoLP/等离子体与核融合研究所,理工学院,里斯本大学) Mani L. Bhaumik Institute for Theoretical Physics, University of California, Los Angeles(马尼·L·巴乌米克理论物理研究所,加州大学洛杉矶分校) The Rudolf Peierls Centre for Theoretical Physics, University of Oxford(鲁道夫·皮埃尔尔斯理论物理中心,牛津大学) Department of Physics and Astronomy University of California, Los Angeles(物理与天文学系,加州大学洛杉矶分校)

AI总结 提出一种结合可微分Fokker-Planck求解器与梯度优化方法,从等离子体相空间数据推断碰撞算子的方法,并在二维PIC模拟数据上验证其准确性和计算效率。

Comments accepted for publication in Journal of Plasma Physics, code available at https://github.com/diogodcarvalho/ml-pic-collision-operators

详情
Journal ref
J. Plasma Phys. (2026), vol. 92, E76
AI中文摘要

我们提出了一种从等离子体动力学相空间数据推断碰撞算子的方法。该方法结合了一个可微分动力学模拟器(其核心组件是一个可微分的Fokker-Planck求解器)与基于梯度的优化方法,以学习最能描述相空间动力学的碰撞算子。我们使用空间均匀热等离子体的二维Particle-in-Cell模拟数据测试了该方法,学习了能够捕获有限大小带电粒子之间自洽电磁相互作用的碰撞算子,该算子适用于多种模拟参数。我们证明,学习到的算子比基于粒子轨迹的替代估计更准确,同时无需对过程的相关时间尺度做出先验假设,并显著降低了内存需求。我们发现,在非相对论条件下获得的算子与静电场景的理论预测高度一致。我们的结果表明,可微分模拟器为推断新算子提供了一种强大且计算高效的方法,适用于广泛的问题,如电磁主导的碰撞动力学和随机波粒相互作用。

英文摘要

We propose a methodology to infer collision operators from phase space data of plasma dynamics. Our approach combines a differentiable kinetic simulator, whose core component in this work is a differentiable Fokker-Planck solver, with a gradient-based optimisation method to learn the collisional operators that best describe the phase space dynamics. We test our method using data from two-dimensional Particle-in-Cell simulations of spatially uniform thermal plasmas, and learn the collision operator that captures the self-consistent electromagnetic interaction between finite-size charged particles over a wide variety of simulation parameters. We demonstrate that the learned operators are more accurate than alternative estimates based on particle tracks, while making no prior assumptions about the relevant time scales of the processes and significantly reducing memory requirements. We find that the retrieved operators, obtained in the non-relativistic regime, are in excellent agreement with theoretical predictions derived for electrostatic scenarios. Our results show that differentiable simulators offer a powerful and computational efficient approach to infer novel operators for a wide rage of problems, such as electromagnetically dominated collisional dynamics and stochastic wave-particle interactions.

2510.03699 2026-06-12 q-bio.NC cs.AI cs.LG cs.NE cs.SY eess.SY 版本更新

Dissecting Larval Zebrafish Hunting using Deep Reinforcement Learning Trained RNN Agents

解析斑马鱼幼体捕食行为的深度强化学习训练RNN代理

Raaghav Malik, Satpreet H. Singh, Sonja Johnson-Yu, Nathan Wu, Roy Harpaz, Florian Engert, Kanaka Rajan

发表机构 * California Institute of Technology(加州理工学院) Harvard University(哈佛大学)

AI总结 本文通过深度强化学习训练RNN代理,研究斑马鱼幼体捕食行为,揭示生态和能量约束如何影响适应性行为,发现简单模型能复现真实捕食行为,并通过虚拟实验验证约束和环境对捕食动态的影响。

详情
Journal ref
Proceedings of the 9th Conference on Cognitive Computational Neuroscience (2026)
AI中文摘要

斑马鱼幼体捕食行为为研究生态和能量约束如何塑造生物大脑和人工代理适应性行为提供了可操作的环境。本文开发了一个最小的基于代理的模型,通过深度强化学习在基于回合的斑马鱼模拟器中训练循环策略。尽管模型简单,它能复现标志性的捕食行为,包括眼位联合适追、速度调节和刻板接近轨迹,这些行为与真实幼体斑马鱼高度吻合。定量轨迹分析显示,追捕回合系统性地将猎物角度减少约一半后再捕食,与测量结果一致。虚拟实验和参数扫描变化生态和能量约束、回合运动学(耦合 vs. 未耦合转弯和前进运动)以及环境因素如食物密度、食物速度和融合限制。这些操作揭示了约束和环境如何塑造追捕动态、捕食成功率和中止率,为神经科学实验提供可验证的预测。这些扫描识别出一组紧凑的约束——双目感知、回合运动学中前进速度与转弯的耦合,以及适度的运动和融合的能量成本——这些约束足以使斑马鱼样式的捕食行为出现。惊人的是,这些行为在最小的代理中出现,而无需详细的生物力学、流体动力学、电路真实性和从真实斑马鱼数据中模仿学习。总体而言,这项工作为斑马鱼捕食行为提供了规范性的解释,即能量成本和感官收益之间的最佳平衡,突显了融合和轨迹动态的权衡。我们建立了一个虚拟实验室,缩小了实验搜索空间并生成了关于行为和神经编码的可验证预测。

英文摘要

Larval zebrafish hunting provides a tractable setting to study how ecological and energetic constraints shape adaptive behavior in both biological brains and artificial agents. Here we develop a minimal agent-based model, training recurrent policies with deep reinforcement learning in a bout-based zebrafish simulator. Despite its simplicity, the model reproduces hallmark hunting behaviors -- including eye vergence-linked pursuit, speed modulation, and stereotyped approach trajectories -- that closely match real larval zebrafish. Quantitative trajectory analyses show that pursuit bouts systematically reduce prey angle by roughly half before strike, consistent with measurements. Virtual experiments and parameter sweeps vary ecological and energetic constraints, bout kinematics (coupled vs. uncoupled turns and forward motion), and environmental factors such as food density, food speed, and vergence limits. These manipulations reveal how constraints and environments shape pursuit dynamics, strike success, and abort rates, yielding falsifiable predictions for neuroscience experiments. These sweeps identify a compact set of constraints -- binocular sensing, the coupling of forward speed and turning in bout kinematics, and modest energetic costs on locomotion and vergence -- that are sufficient for zebrafish-like hunting to emerge. Strikingly, these behaviors arise in minimal agents without detailed biomechanics, fluid dynamics, circuit realism, or imitation learning from real zebrafish data. Taken together, this work provides a normative account of zebrafish hunting as the optimal balance between energetic cost and sensory benefit, highlighting the trade-offs that structure vergence and trajectory dynamics. We establish a virtual lab that narrows the experimental search space and generates falsifiable predictions about behavior and neural coding.

2307.05520 2026-06-12 cs.LG cs.CY cs.SE 版本更新

Estimating Deep Learning energy consumption based on model architecture and training environment

基于模型架构和训练环境的深度学习能耗估算

Santiago del Rey, Luís Cruz, Xavier Franch, Silverio Martínez-Fernández

发表机构 * Universitat Politècnica de Catalunya(巴塞罗那理工大学) Tecnológico de Delft(代尔夫特理工大学)

AI总结 研究通过分析模型架构与训练环境对能耗的影响,提出STEP和PRE方法,显著提升能耗估算准确性,减少训练能耗达80.68%。

Comments 48 pages, 10 figures, under review in Computer Standards & Interfaces journal. This work is an extension of arXiv:2307.05520v3 [cs.LG]

详情
AI中文摘要

为提高对深度学习环境影响的认识,许多研究估算DL系统的能耗。然而,训练期间的能耗估计常依赖未经验证的假设。本文通过研究模型架构和训练环境对能耗的影响,训练多种计算机视觉模型并收集能耗和准确率指标,分析其配置间的权衡。结果表明,选择合适的模型-训练环境组合可将训练能耗降低80.68%,准确率损失低于2%。发现模型与训练环境之间存在显著交互效应:GPU计算能力与模型复杂度成正比时,能效提升。此外,证明常用估算方法如FLOPs或GPU TDP无法捕捉这些动态,可能导致重大误差。为此,提出STable Training Epoch Projection (STEP)和Pre-training Regression-based Estimation (PRE)方法。在评估中,这些方法在估算准确性上比现有工具高两倍或更多。

英文摘要

To raise awareness of the environmental impact of deep learning (DL), many studies estimate the energy use of DL systems. However, energy estimates during DL training often rely on unverified assumptions. This work addresses that gap by investigating how model architecture and training environment affect energy consumption. We train a variety of computer vision models and collect energy consumption and accuracy metrics to analyze their trade-offs across configurations. Our results show that selecting the right model-training environment combination can reduce training energy consumption by up to 80.68% with less than 2% loss in $F_1$ score. We find a significant interaction effect between model and training environment: energy efficiency improves when GPU computational power scales with model complexity. Moreover, we demonstrate that common estimation practices, such as using FLOPs or GPU TDP, fail to capture these dynamics and can lead to substantial errors. To address these shortcomings, we propose the Stable Training Epoch Projection (STEP) and the Pre-training Regression-based Estimation (PRE) methods. Across evaluations, our methods outperform existing tools by a factor of two or more in estimation accuracy.

2507.11936 2026-06-12 cs.CL cs.AI cs.CV cs.LG 版本更新

A Survey of Deep Learning for Geometry Problem Solving

深度学习在几何问题求解中的应用综述

Jianzhe Ma, Wenxuan Wang, Qin Jin

发表机构 * Renmin University of China(中国人民大学)

AI总结 本文综述了深度学习在几何问题求解中的应用,涵盖相关任务、方法、评估指标及未来方向,旨在提供实践参考以推动该领域发展。

Comments ACL 2026 Main Conference

详情
AI中文摘要

几何问题求解作为数学推理的重要组成部分,在教育、评估AI数学能力及多模态能力评估中具有关键作用。近期深度学习技术,尤其是多模态大语言模型的出现,显著加速了该领域的研究。本文综述了深度学习在几何问题求解中的应用,包括(i)几何问题求解相关任务的全面总结;(ii)相关深度学习方法的深入回顾;(iii)评估指标和方法的详细分析;以及(iv)最先进性能、现有挑战和有前景的未来方向的批判性讨论。我们的目标是提供一个全面且实用的深度学习在几何问题求解中的参考,从而推动该领域进一步发展。我们维护了一个相关论文列表:https://github.com/majianz/dl4gps。

英文摘要

Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI's mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper presents a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of state-of-the-art performance, existing challenges, and promising future directions. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We maintain a list of relevant papers: https://github.com/majianz/dl4gps.

2104.11105 2026-06-12 cs.CR cs.LG cs.NE 版本更新

Synchronization of Tree Parity Machines using non-binary input vectors

使用非二进制输入向量同步树奇偶机

Miłosz Stypiński, Marcin Niemiec

发表机构 * AGH University of Science and Technology(波兰格但尼克技术大学)

AI总结 本文提出利用范围更广的非二进制输入向量改进树奇偶机的同步过程,从而减少同步时间并提升神经密码学的安全性。

Comments This work has been submitted to the IEEE for possible publication

详情
Journal ref
IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 1, pp. 1423-1429, Jan. 2024
AI中文摘要

神经密码学是将人工神经网络应用于密码学领域的解决方案。其功能基于树奇偶机,利用人工神经网络在网络实体间执行安全密钥交换。本文提出改进两个树奇偶机的同步方法,该方法基于使用范围更广的非二进制输入向量学习人工神经网络。结果表明,同步过程的时间缩短,因此树奇偶机在更短的时间内达成共同权重,从而提升了神经密码学的安全性。

英文摘要

Neural cryptography is the application of artificial neural networks in the subject of cryptography. The functionality of this solution is based on a tree parity machine. It uses artificial neural networks to perform secure key exchange between network entities. This article proposes improvements to the synchronization of two tree parity machines. The improvement is based on learning artificial neural network using input vectors which have a wider range of values than binary ones. As a result, the duration of the synchronization process is reduced. Therefore, tree parity machines achieve common weights in a shorter time due to the reduction of necessary bit exchanges. This approach improves the security of neural cryptography

2306.01690 2026-06-12 cs.LG cs.AI 版本更新

Context selectivity with dynamic availability enables lifelong continual learning

基于动态可用性的上下文选择性促进终身持续学习

Martin Barry, Wulfram Gerstner, Guillaume Bellec

发表机构 * Department of Life Sciences, Department of Computer Sciences(生命科学系、计算机科学系)

AI总结 本文提出基于上下文选择性和动态可用性的元可塑性规则,通过模拟验证该模型在图像识别和自然语言处理任务中优于现有持续学习算法。

详情
AI中文摘要

"你永远忘不了如何骑自行车"——但这是如何可能的?大脑能够学习复杂技能,停顿多年不练习,中间学习其他技能,仍能随时召回原始知识。这种能力的机制,称为终身学习(或持续学习,CL),尚不清楚。我们建议一种生物合理的元可塑性规则,基于经典持续学习工作,总结为两个原则:(i) 神经元具有上下文选择性,(ii) 一个局部可用性变量在神经元先前任务相关时部分冻结可塑性。在新的神经中心形式化中,我们建议神经元选择性和神经元级巩固是简单且可行的元可塑性假设,以在大脑中实现CL。在模拟中,该简单模型平衡了遗忘和巩固,导致在图像识别和自然语言处理CL基准上优于当前CL算法。

英文摘要

"You never forget how to ride a bike", -- but how is that possible? The brain is able to learn complex skills, stop the practice for years, learn other skills in between, and still retrieve the original knowledge when necessary. The mechanisms of this capability, referred to as lifelong learning (or continual learning, CL), are unknown. We suggest a bio-plausible meta-plasticity rule building on classical work in CL which we summarize in two principles: (i) neurons are context selective, and (ii) a local availability variable partially freezes the plasticity if the neuron was relevant for previous tasks. In a new neuro-centric formalization of these principles, we suggest that neuron selectivity and neuron-wide consolidation is a simple and viable meta-plasticity hypothesis to enable CL in the brain. In simulation, this simple model balances forgetting and consolidation leading to better transfer learning than contemporary CL algorithms on image recognition and natural language processing CL benchmarks.

1710.03070 2026-06-12 cs.NE cs.LG q-bio.NC stat.ML 版本更新

full-FORCE: A Target-Based Method for Training Recurrent Networks

full-FORCE:一种基于目标的训练循环网络方法

Brian DePasquale, Christopher J. Cueva, Kanaka Rajan, G. Sean Escola, L. F. Abbott

发表机构 * Department of Neuroscience(神经科学系) Zuckerman Institute(Zuckerman研究所) Columbia University(哥伦比亚大学) Department of Physiology and Cellular Biophysics(生理学与细胞生物物理学系) Columbia University College of Physicians and Surgeons(哥伦比亚大学医学与外科学院) Princeton Neuroscience Institute(普林斯顿神经科学研究所) Lewis-Sigler Institute for Integrative Genomics(整合基因组学研究所)

AI总结 本文提出一种基于目标的循环网络训练方法,通过引入第二网络提供目标动态,实现更高效的任务处理,具有更少的神经元和更高的噪声鲁棒性。

Comments 20 pages, 8 figures

详情
Journal ref
PLoS ONE (2018)
AI中文摘要

训练好的循环网络是建模动态神经计算的强大工具。我们提出了一种基于目标的方法,用于修改循环网络的全连接矩阵,以训练其执行涉及时间复杂输入/输出转换的任务。该方法在训练过程中引入第二个网络,提供合适的“目标”动态,有助于完成任务。由于利用了全循环连接,该方法产生的网络在执行任务时比传统的最小二乘(FORCE)方法使用更少的神经元,并具有更高的噪声鲁棒性。此外,我们展示了如何通过向目标生成网络引入额外的输入信号,这些信号作为任务提示,大大扩展了可学习的任务范围,并提供了对训练任务执行网络动态复杂性和性质的控制。

英文摘要

Trained recurrent networks are powerful tools for modeling dynamic neural computations. We present a target-based method for modifying the full connectivity matrix of a recurrent network to train it to perform tasks involving temporally complex input/output transformations. The method introduces a second network during training to provide suitable "target" dynamics useful for performing the task. Because it exploits the full recurrent connectivity, the method produces networks that perform tasks with fewer neurons and greater noise robustness than traditional least-squares (FORCE) approaches. In addition, we show how introducing additional input signals into the target-generating network, which act as task hints, greatly extends the range of tasks that can be learned and provides control over the complexity and nature of the dynamics of the trained, task-performing network.