arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2086
专题追踪
2605.06189 2026-05-08 eess.AS cs.LG

Predictive-Generative Drift Decomposition for Speech Enhancement and Separation

用于语音增强和分离的预测-生成漂移分解

Julius Richter, Yoshiki Masuyama, Christoph Boeddeker, Takahiro Edo, Gordon Wichern, Jonathan Le Roux

发表机构 * MERL Cambridge, MA, USA(MERL剑桥马萨诸塞州美国)

AI总结 本文提出一种结合预测与生成模型的框架,通过分解漂移动态提升语音增强和分离性能,采用统一方法适用于多种任务和降质场景。

Comments Submitted to NeurIPS 2026

详情
AI中文摘要

我们提出了一种即插即用的框架,将预测方法与生成语音先验相结合,通过随机插值构建Stochastic Interpolant Prior for Speech(SIPS)。该方法将插值动态分解为任务特定的漂移和随机去噪组件,使预测估计能直接整合到生成采样过程中。通过仅使用干净语音训练得分模型,获得一种不依赖降质的先验,可跨任务复用。推理时,预测器提供确定性漂移以引导采样过程,而得分模型保持感知自然性。与传统混合方法不同,SIPS提供统一框架,适用于多种预测器和加性降质任务。实验显示,SIPS在语音增强和分离中均提升感知质量,语音分离达到+1.0 NISQA的提升。

英文摘要

We propose a plug-and-play framework for speech enhancement and separation that augments predictive methods with a generative speech prior. Our approach, termed Stochastic Interpolant Prior for Speech (SIPS), builds on stochastic interpolants and leverages their flexibility to bridge predictive and generative modeling. Specifically, we decompose the interpolation dynamics into a task-specific drift and a stochastic denoising component, allowing a predictive estimate to be integrated directly into the generative sampling process. This results in a mathematically grounded framework for combining strong pretrained predictors with the expressive power of generative models. To this end, we train a score model using only clean speech, yielding a degradation-agnostic prior that can be reused across tasks. During inference, the predictor provides a deterministic drift that steers the sampling process toward a task-consistent estimate, while the score model preserves perceptual naturalness. Unlike prior hybrid approaches, which typically rely on architecture-specific conditioning and are tied to particular predictors or degradation settings, SIPS provides a unified framework that generalizes across predictors and additive degradation tasks. We demonstrate its effectiveness for both speech enhancement and speech separation using recent predictors such as SEMamba and FlexIO. The proposed method consistently improves perceptual quality, achieving gains up +1.0 NISQA for speech separation.

2605.06172 2026-05-08 stat.ML cs.LG cs.NA math.NA math.PR

Expressivity of Bi-Lipschitz Normalizing Flows: A Score-Based Diffusion Perspective

双 Lipschitz 正则化流的表达性:基于分数扩散的视角

Meira Iske, Carola-Bibiane Schönlieb

发表机构 * Center for Industrial Mathematics, University of Bremen(工业数学中心,不莱梅大学) Department of Theoretical Physics and Applied Mathematics, University of Cambridge(理论物理与应用数学系,剑桥大学)

AI总结 本文从分数扩散模型的角度研究双 Lipschitz 正则化流的表达性,通过概率流ODE分析其分布逼近能力,并推导扩散运输的确定性收敛保证。

详情
AI中文摘要

许多正则化流架构施加了正则性约束,但其分布逼近性质尚未完全明确。我们通过分数扩散模型的视角研究双 Lipschitz 正则化流的表达性。对于方差保持扩散的概率流ODE,分数的Lipschitz正则性诱导了双 Lipschitz 偏微分运输映射的流。这种ODE桥梁使我们能够分析双 Lipschitz 正则化流的分布逼近能力,并反向推导基于扩散的运输的确定性收敛保证。我们的关键思想是利用概率流ODE将分数的正则性与诱导的运输映射的正则性联系起来。我们验证了广泛的目标密度的分数正则性,包括具有紧支撑的密度、高斯卷积的紧支撑测度以及有限高斯混合物。我们获得了一个通用的分布逼近结果:由双 Lipschitz 方差保持运输映射诱导的高斯拉回在所有概率密度中是L^1-密集的。对于高斯卷积目标,我们进一步获得无需早停的KL散度收敛性。

英文摘要

Many normalizing flow architectures impose regularity constraints, yet their distributional approximation properties are not fully characterized. We study the expressivity of bi-Lipschitz normalizing flows through the lens of score-based diffusion models. For the probability flow ODE of a variance-preserving diffusion, Lipschitz regularity of the score induces a flow of bi-Lipschitz diffeomorphic transport maps. This ODE bridge allows us to analyze the distributional approximation power of bi-Lipschitz normalizing flows and, conversely, derive deterministic convergence guarantees for diffusion-based transport. Our key idea is to use the probability flow ODE to link regularity of the score to regularity of the induced transport maps. We verify score regularity for broad target densities, including compactly supported densities, Gaussian convolutions of compactly supported measures and finite Gaussian mixtures. We obtain a universal distributional approximation result: Gaussian pullbacks induced by bi-Lipschitz variance-preserving transport maps are $L^1$-dense among all probability densities. For Gaussian convolution targets, we further obtain convergence in Kullback-Leibler divergence without early stopping.

2605.06153 2026-05-08 cs.CR cs.CV

Secure Seed-Based Multi-bit Watermarking for Diffusion Models from First Principles

基于安全性的多比特水印嵌入用于扩散模型的原理性方法

Enoal Gesny, Eva Giboulot

发表机构 * Inria Rennes(法国里昂国家信息与自动化研究所)

AI总结 本文提出基于原理的水印方法,通过理论分析建立安全、鲁棒性和保真度的评估框架,实现无需依赖生成模型的水印系统设计。

详情
AI中文摘要

生成图像模型的快速发展促使了专门的水印技术发展,特别是基于种子的嵌入方法。然而,当前的评估仍主要依赖于特定生成和反向模型架构,限制了对性能的明确结论,尤其是安全性方面缺乏严格定义。本文主张通过彻底的理论分析来建立水印方案的有效性。通过将模型依赖部分与水印系统的实际决策机制解耦,我们引入了一个基于安全、鲁棒性和保真的正式评估框架。这允许通过代表安全、鲁棒性和保真度之间权衡的特征曲面进行精确比较。基于此框架,我们提出了SSB,一种新的水印方法,通过允许在特征曲面上达到任何安全-鲁棒性-保真度区域,扩展了先前的种子方法。本文为具有理论保证的现代水印系统设计打开了大门,无需任何昂贵的实证评估。

英文摘要

The rapid emergence of generative image models has led to the development of specialized watermarking techniques, particularly in-generation methods such as seed-based embedding. However, current evaluations in this area remain largely empirical, making them heavily reliant on the specific model architectures used for generation and inversion. This prevents any clear conclusion on the performance of any method, especially regarding security, for which a rigorous definition is lacking. Against this approach, we argue that the effectiveness of a watermarking scheme should be established purely through a thorough theoretical analysis. This is enabled by decoupling the model-dependent part from the actual decision mechanism of the watermarking system. Using this decoupling, we introduce a formal evaluation framework based on security, robustness, and fidelity. This allows precise comparisons between watermarking systems through a characteristic surface representing the trade-off between these three quantities, independent of any generative model. Based on this framework, we propose SSB, a novel watermarking method that generalizes previous seed-based methods by allowing to reach any security-robustness-fidelity regime on its characteristic surface. This work opens the door to the design of modern watermarking systems with theoretical guarantees that do not necessitate any costly empirical evaluations.

2605.06136 2026-05-08 cs.SE cs.AI

BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases

BUILD-AND-FIND: 一种考虑努力的协议用于评估代理管理的代码库

Jhen-Ke Lin

发表机构 * National Yang Ming Chiao Tung University

AI总结 该研究提出BUILD-AND-FIND协议,用于评估下游代理从生成代码库中恢复意图选择的能力及所需检查努力,通过分离行为正确性与恢复过程,评估准确性、稳定性、重复性、覆盖范围和努力程度。

Comments 25 pages, 8 figures, 17 tables

详情
AI中文摘要

大多数代码代理基准测试询问生成的代码是否正确。尽管这仍然重要,但仓库级别的工程越来越多地由代理管理:一个代理编写仓库,后来的代理会检查、审计或扩展它作为工作上下文。在这种情况下,生成的仓库不仅是任务的答案,也是未来工作的沟通 artifacts。即使强代理几乎满足可见行为目标,仓库在暴露意图行为和设计选择方面的清晰度可能不同。我们引入BUILD-AND-FIND协议,用于评估下游代理是否能从生成的仓库中恢复这些意图选择,以及恢复所需多少检查努力。对于每个任务,构建者看到隐藏的仓库规范并创建代码库;查找者只能看到代码库和一个带有规范追溯的多选问题库。该协议将行为正确性与 artifact 侧的恢复分离,并报告恢复准确性、稳定性、重复性、实现覆盖范围和检查努力。准确性与稳定性作为闸门:只有在恢复成功可靠时,努力才被解释。在从相同意图可以恢复的 artifacts 中,相同查找者更低的努力表明该 artifact 使该意图更容易定位。仅问题和仅规范的控制量度通用先验和规范访问,而审计将遗漏的声明与查找者失败分离,并检查正确答案是否引用 artifact 证据。在发布的高优先级任务包中,恢复准确性接近饱和,因此检查努力和查找者特定效应提供了主要的面板局部比较。

英文摘要

Most coding-agent benchmarks ask whether generated code behaves correctly. That remains essential, but repository-level engineering is increasingly agent-managed: one agent writes a repository, and later agents inspect, audit, or extend it as working context. In that setting, a generated repository is not only an answer to a task but also a communication artifact for future work. Even when strong agents nearly satisfy the visible behavioral objective, repositories can differ in how clearly they expose the intended behavior and design choices behind that behavior. We introduce BUILD-AND-FIND, a protocol for evaluating whether downstream agents can recover those intended choices from generated repositories, and how much inspection that recovery requires. For each task, a builder sees a hidden repository specification and creates a codebase; a finder sees only the codebase and a specification-traced multiple-choice question bank. The protocol separates behavioral correctness from artifact-side recovery and reports recovery accuracy, repeatability, implementation coverage, and inspection effort. Accuracy and stability act as gates: effort is interpreted only when recovery succeeds reliably. Among artifacts from which the same intent can be recovered, lower effort by the same finder suggests that the artifact makes that intent easier to locate. Question-only and spec-only controls quantify generic priors and specification access, while audits separate omitted claims from finder failures and check whether correct answers cite artifact evidence. In the released high-prior task pack, recovery accuracy is near saturation, so inspection effort and finder-specific effects provide the main panel-local comparison.

2605.06134 2026-05-08 hep-lat cs.LG

Diffusion model for SU(N) gauge theories

SU(N)规范理论的扩散模型

Javad Komijani, Marina K. Marinkovic, Lara Turgut

发表机构 * Institute for Theoretical Physics, ETH Zurich, 8093 Zurich, Switzerland(苏黎世联邦理工学院理论物理研究所)

AI总结 本文提出适用于SU(N)晶格规范理论的分数匹配框架,用于训练扩散模型并生成高质量样本,通过与HMC模拟对比验证了其有效性,并探讨了提升采样效率的策略。

Comments 23 pages, 6 figures

详情
AI中文摘要

隐式分数匹配提供了一种计算高效的训练扩散模型的方法,能够生成复杂分布的高质量样本。本文开发了适用于SU(N)晶格规范理论的分数匹配框架,可扩展至其他李群。我们将其应用于具有Wilson规范作用的SU(3)规范配置,在二维和四维中进行测试,并通过与混合蒙特卡罗(HMC)模拟的比较评估生成样本的质量。我们展示了扩散模型能够成功训练并用于采样Wilson规范作用。对于逆耦合较大的情况,准确的反向时间积分需要预测-校正方案,为此我们引入基于哈密顿分子动力学的校正器。虽然校正器显著提高了采样质量,但也增加了计算成本。我们概述了几种提升采样效率的策略。

英文摘要

Implicit score matching provides a computationally efficient approach for training diffusion models and generating high-quality samples from complex distributions. In this work, we develop a score-matching framework for SU(N) lattice gauge theories, which can be extended to other Lie groups. We apply the method to SU(3) gauge configurations with the Wilson gauge action in two and four dimensions and assess the quality of the generated samples by comparison with Hybrid Monte Carlo (HMC) simulations. We show that the diffusion models can be successfully trained and applied for sampling the Wilson gauge action. For large values of inverse coupling, accurate reverse-time integration requires predictor-corrector schemes, for which we introduce a corrector based on Hamiltonian molecular dynamics. While the corrector significantly improves sampling quality, it also increases the computational cost. We outline several strategies for improving sampling efficiency.

2605.06111 2026-05-08 cs.SE cs.AI

Schedule-and-Calibrate: Utility-Guided Multi-Task Reinforcement Learning for Code LLMs

调度与校准:面向代码LLM的实用导向多任务强化学习

Yujia Chen, Yang Ye, Xiao Chu, Yuchi Ma, Cuiyun Gao

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Huawei Cloud Computing Technologies Co., Ltd.(华为云计算技术有限公司)

AI总结 本文提出ASTOR框架,通过任务实用性驱动的协调机制,提升多任务强化学习在代码LLM中的效果,实验显示其在多个编码任务中优于专用专家和基线方法。

详情
AI中文摘要

可验证奖励的强化学习(RL)在训练代码LLM后表现出色,但单独部署任务专用专家会带来与任务数量成比例的成本,推动了统一多任务RL(MTRL)方法的发展。然而,现有MTRL方法将所有编码任务视为统一,依赖于共享优化策略下的固定数据课程,最终限制了多任务训练的有效性。为解决这些限制,我们提出了ASTOR,一种通过实用性驱动的协调机制进行多任务代码强化学习的框架。该框架以任务实用性为中心,一种捕捉每个任务学习潜力和跨任务协同的信号,ASTOR包含两个耦合模块:1)分层实用性引导的数据调度模块,分层分配训练预算并优先选择信息量大的提示,引导训练向最有价值的数据倾斜;2)自适应实用性校准的策略优化模块,动态调整每任务KL正则化,使更新约束与每个任务的当前训练状态相匹配。在两个广泛使用的LLM上四个代表性编码任务上的实验表明,ASTOR在所有任务上一致提升了单一模型,优于最佳任务专用专家9.0%-9.5%,并超越最强的MTRL基线7.5%-12.8%。

英文摘要

Reinforcement learning (RL) with verifiable rewards has proven effective at post-training LLMs for coding, yet deploying separate task-specific specialists incurs costs that scale with the number of tasks, motivating a unified multi-task RL (MTRL) approach. However, existing MTRL methods treat all coding tasks uniformly, relying on fixed data curricula under a shared optimization strategy, ultimately limiting the effectiveness of multi-task training. To address these limitations, we propose ASTOR, a multi-tASk code reinforcement learning framework via uTility-driven coORdination. Centered on task utility, a signal capturing each task learning potential and cross-task synergy, ASTOR comprises two coupled modules: 1) Hierarchical Utility-Routed Data Scheduling module hierarchically allocates training budget and prioritizes informative prompts, steering training toward the most valuable data and 2) Adaptive Utility-Calibrated Policy Optimization module dynamically scales per-task KL regularization, matching update constraints to each tasks current training state. Experiments on two widely-used LLMs across four representative coding tasks demonstrate that ASTOR consistently improves a single model across all tasks, outperforming the best task-specific specialist by 9.0%-9.5% and surpassing the strongest MTRL baseline by 7.5%-12.8%.

2605.06091 2026-05-08 math.ST cs.LG math.PR stat.CO stat.TH

Time-Inhomogeneous Preconditioned Langevin Dynamics

时间非齐次预条件拉格朗日动力学

Alexander Falk, Laurenz Nagler, Andreas Habring, Thomas Pock

发表机构 * Institute of Visual Computing(视觉计算研究所) Graz University of Technology(格拉茨技术大学)

AI总结 本文提出TIPreL方法,通过时间与位置依赖的预条件器解决拉格朗日动力学在多模分布中的全局模式覆盖与局部模式探索问题,证明其在Wasserstein-2距离下的收敛性,并在实验中验证其有效性。

详情
AI中文摘要

从形式为p(x) ∝ exp(-Ψ(x))的分布进行拉格朗日采样面临两个主要挑战:(全局)模式覆盖和(局部)模式探索。第一挑战尤其适用于具有不相交模式的多模分布,而第二挑战出现在势函数Ψ表现出多样且病态的局部模式几何时。为解决这些挑战,一种常见方法是用问题特定的信息预条件拉格朗日动力学,例如样本协方差或Ψ的局部曲率。然而,现有预条件器选择本质上涉及全局模式覆盖与局部模式探索之间的权衡,且无先前方法同时解决两者。为克服这一限制,我们提出TIPreL,引入时间与位置依赖的预条件器。这种设计在单一框架内有效解决上述两个挑战。我们建立了所得到动力学在Wasserstein-2距离下的收敛性,无论是连续时间还是tamed Euler离散化。特别是,我们的分析扩展了现有状态,通过证明在时间与空间依赖的扩散系数和仅局部Lipschitz漂移下的收敛性,这之前未被先前工作覆盖。最后,我们在二维严重病态示例和高维贝叶斯逻辑回归任务上实验性比较TIPreL与竞争预条件方案,证实所提方法的有效性。

英文摘要

Langevin sampling from distributions of the form $p(x) \propto \exp(-Ψ(x))$ faces two major challenges: (global) mode coverage and (local) mode exploration. The first challenge is particularly relevant for multi-modal distributions with disjoint modes, whereas the second arises when the potential $Ψ$ exhibits diverse and ill-conditioned local mode geometry. To address these challenges, a common approach is to precondition Langevin dynamics with problem-specific information, such as the sample covariance or the local curvature of $Ψ$. However, existing preconditioner choices inherently involve a trade-off between global mode coverage and local mode exploration, and no prior method resolves both simultaneously. To overcome this limitation, we propose the TIPreL, which introduces a time- and position-dependent preconditioner. This design effectively addresses both challenges mentioned above within a single framework. We establish convergence of the resulting dynamics in the Wasserstein-2 distance both in continuous time and for a tamed Euler discretization. In particular, our analysis extends the existing state of the art by proving convergence under time- and space-dependent diffusion coefficients, and only locally Lipschitz drifts, which has not been covered by prior work. Finally, we experimentally compare TIPreL with competing preconditioning schemes on a two-dimensional, severely ill-posed example and on a Bayesian logistic regression task in higher dimensions, confirming the efficiency of the proposed method.

2605.06082 2026-05-08 cs.AR cs.LG cs.PF

PoTAcc: A Pipeline for End-to-End Acceleration of Power-of-Two Quantized DNNs

PoTAcc:一种用于端到端加速功率为二量化DNN的流水线

Rappy Saha, Jude Haris, Nicolas Bohm Agostini, David Kaeli, José Cano

发表机构 * UK Engineering and Physical Sciences Research Council(英国工程与物理科学研究委员会)

AI总结 本文提出PoTAcc流水线,用于在资源受限边缘设备上加速和评估功率为二量化DNN,通过TensorFlow Lite实现无缝准备和部署,展示了在CPU和混合CPU-FPGA系统上的加速效果。

Comments Accepted to IEEE Transactions on Circuits and Systems for Artificial Intelligence (TCASAI), 2026

详情
AI中文摘要

功率为二(PoT)量化显著减少了深度神经网络(DNN)的大小,并将乘法替换为位移操作用于推断。先前的工作表明,PoT量化DNN可以在图像分类等任务中保持准确性;然而,其在资源受限边缘设备上的性能仍不够了解。尽管通用边缘CPU和GPU不提供优化的位移操作后端,定制硬件加速器可以通过实现专用位移处理元素更好地利用PoT量化。然而,将PoT量化模型部署在这些加速器上具有挑战性,因为现有推理框架的现有支持有限。此外,不同PoT量化策略对硬件设计、性能和能效在完整推断中的影响尚未系统性地探讨。为了解决这些挑战,我们提出了PoTAcc,一个开源的端到端流水线,用于在资源受限边缘设备上加速和评估PoT量化DNN。PoTAcc通过TensorFlow Lite(TFLite)在异构平台上无缝准备和部署PoT量化模型,包括仅CPU系统和带有定制加速器的混合CPU-FPGA系统。我们为三种PoT量化方法设计了位移处理元素(shift-PE)加速器,并在两个FPGA平台上实现了它们。我们评估了在多种模型(包括CNN和基于Transformer的架构)上的准确性、性能、能效和资源利用率。结果表明,我们的CPU加速器设计在PYNQ-Z2和Kria板上实现了高达3.6倍的加速和78%的能耗降低,相比仅CPU执行的PoT量化DNN。代码将在https://github.com/gicLAB/PoTAcc公开发布。

英文摘要

Power-of-two (PoT) quantization significantly reduces the size of deep neural networks (DNNs) and replaces multiplications with bit-shift operations for inference. Prior work has shown that PoT-quantized DNNs can preserve accuracy for tasks such as image classification; however, their performance on resource-constrained edge devices remains insufficiently understood. While general-purpose edge CPUs and GPUs do not provide optimized backends for bit-shift operations, custom hardware accelerators can better exploit PoT quantization by implementing dedicated shift-based processing elements. However, deploying PoT-quantized models on such accelerators is challenging due to limited support in existing inference frameworks. In addition, the impact of different PoT quantization strategies on hardware design, performance, and energy efficiency during full inference has not been systematically explored. To address these challenges, we propose PoTAcc, an open-source end-to-end pipeline for accelerating and evaluating PoT-quantized DNNs on resource-constrained edge devices. PoTAcc enables seamless preparation and deployment of PoT-quantized models via TensorFlow Lite (TFLite) across heterogeneous platforms, including CPU-only systems and hybrid CPU-FPGA systems with custom accelerators. We design shift-based processing element (shift-PE) accelerators for three PoT quantization methods and implement them on two FPGA platforms. We evaluate accuracy, performance, energy efficiency, and resource utilization across a range of models, including CNNs and Transformer-based architectures. Results show that our CPU-accelerator design achieves up to 3.6x speedup and 78% energy reduction compared to CPU-only execution for PoT-quantized DNNs on PYNQ-Z2 and Kria boards. The code will be publicly released at https://github.com/gicLAB/PoTAcc

2605.06059 2026-05-08 stat.AP cs.LG

Correcting heterogeneous diagnostic bias when developing clinical prediction models using causal hidden Markov models

在使用因果隐马尔可夫模型开发临床预测模型时纠正异质性诊断偏差

Jose Benitez-Aurioles, Ricardo Silva, Brian McMillan, Matthew Sperrin

发表机构 * Division of Informatics, Imaging & Data Sciences, University of Manchester(曼彻斯特大学信息学、影像与数据科学系) Department of Statistical Science, UCL(伦敦大学学院统计科学系) Division of Population Health, Health Services Research & Primary Care, University of Manchester(曼彻斯特大学人口健康、卫生服务研究与初级保健系)

AI总结 本文提出一种方法,通过因果推断框架纠正因诊断延迟差异导致的预测偏差,利用隐马尔可夫模型模拟纵向过程,提升慢性肾病预测模型的校准性。

Comments 4 figures, 2 tables, 4 supplementaries

详情
AI中文摘要

在常规医疗中,预先确定为高风险的个体通常更频繁地接受检查,受保护属性如性别或种族也可能影响检查频率。这种在人口中的异质性检测率导致标签错误,从而引起特定群体的系统性模型误差和验证期间的性能指标偏差。本文提出了一种方法,通过因果推断框架定义我们的目标估计量:在个体诊断率与参考组相匹配的反事实场景中的诊断概率。我们将纵向过程建模为隐马尔可夫模型,其中确认性检查结果是潜变量渐进性疾病的排放。我们在模拟数据中验证了我们的方法,并将其应用于慢性肾病预测的电子健康记录案例研究。在模拟中,我们的方法减少了预测偏差并提高了整体校准性,将未诊断组的观察到的预期比从1.34(标准差:0.09)降低到1.02(0.09)。模拟中假设的违反影响了模型参数的估计,但所提出的方法仍然比标准模型更校准。在临床案例研究中,我们发现糖尿病是可观察性的主要驱动因素,在6个月尿白蛋白肌酐比测试率中,糖尿病的比值比为10.36(95%置信区间,9.80 - 11.02)。使用我们的方法预测无糖尿病患者的反事实诊断率,将开发的临床预测模型的观察到的预期比从1.55(1.51 - 1.59)提高到1.01(0.98 - 1.04)

英文摘要

In routine care, individuals identified a priori as high-risk are usually tested for conditions more frequently. Protected attributes, such as sex or ethnicity may also determine testing frequency. Such heterogeneous detection rates across a population induce label error. This causes systematic model error for specific groups and biases performance metrics during validation. This paper proposes a method to correct for such bias in prediction models due to differential diagnostic delay. We use a causal inference framework to define our target estimand: an individual's diagnosis probability in a counterfactual scenario where their diagnosis rate matches that of a reference group. We model the longitudinal process as a hidden Markov model, in which confirmatory test results are emissions from a latent progressive disease stage. We validate our approach in simulated data and apply it to a case study of chronic kidney disease prediction using electronic health records. In simulations, our method reduces prediction bias and improves calibration-in-the-large, correcting the Observed:Expected ratio in the underdiagnosed group from 1.34 (standard deviation: 0.09) in a model developed without any correction for underdiagnosis bias to 1.02 (0.09). Violations of assumptions in the simulation affected the estimation of model parameters, but the proposed approach nonetheless remained better calibrated than the standard model. In the clinical case study, we identify diabetes as the main driver of observability, with an odds ratio of 10.36 (95% confidence interval, 9.80 - 11.02) in 6-month urine albumin-creatinine ratio testing rate. Using our approach to predict the counterfactual diagnostic rate in patients without diabetes, we improved the Observed:Expected ratio of a developed clinical prediction model from 1.55 (1.51 - 1.59) to 1.01 (0.98 - 1.04).

2605.05996 2026-05-08 stat.ML cs.LG

Gaussian mixture models in Hilbert spaces via kernel methods

基于核方法的希尔伯特空间中高斯混合模型

Daniel López-Montero, Antonio Álvarez-López, Marcos Matabuena

发表机构 * Friedrich-Alexander-Universität Erlangen-Nürnberg(弗里德里希-亚历山大-埃朗根-纽伦堡大学) Universidad Autónoma de Madrid(马德里自治大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出基于核均值嵌入的希尔伯特空间数据高斯混合模型,通过高效优化算法进行估计,并在无限维空间中建立理论保证,验证了该框架在动态函数数据和现代医学应用中的有效性。

Comments 38 pages, 13 figures

详情
AI中文摘要

现代跨学科数据集越来越多地由时间演变的、可能无限维的随机对象组成,如动态函数数据,这些数据自然建模于希尔伯特空间中。在这些设置中,通过密度来表征概率测度可能定义不清或技术上具有挑战性。受聚类应用的启发,本文提出了一种基于核均值嵌入的希尔伯特空间数据高斯混合框架,并开发了高效的估计优化算法。我们建立了理论保证,证明所提出算法是良好的定义,且模型在无限维空间中产生密集的近似类。通过在多样化的结构和数据几何上进行广泛实验,包括L²函数数据和出现在现代医学应用中的拉普拉斯空间中的随机图,评估了该框架。

英文摘要

Modern datasets across many disciplines increasingly consist of time-evolving, potentially infinite-dimensional random objects, such as dynamic functional data, which are naturally modeled in Hilbert spaces. In these settings, characterizing probability measures, for example, through densities, can be ill-defined or technically challenging. Motivated by clustering applications, we propose a Gaussian mixture framework for Hilbert-space-valued data based on kernel mean embeddings and develop efficient optimization algorithms for estimation. We establish theoretical guarantees showing that the proposed algorithm is well defined and that the model yields a dense class of approximations in infinite-dimensional spaces. We evaluate the framework through extensive experiments on diverse structures and data geometries, including $L^2$-functional data and random graphs in Laplacian spaces arising in modern medical applications.

2605.05993 2026-05-08 stat.ML cs.LG stat.ME stat.OT

TabCF: Distributional Control Function Estimation with Tabular Foundation Models

TabCF:基于表格基础模型的分布控制函数估计

Geping Chen, Chunlin Li, Tianzhong Yang, Zhengyuan Zhu, Jing Zhou

发表机构 * Iowa State University(爱荷华州立大学) University of Virginia(弗吉尼亚大学) University of Minnesota(明尼苏达大学) University of Manchester(曼彻斯特大学)

AI总结 TabCF利用表格基础模型进行控制函数回归,实现快速且透明的分布因果估计,适用于干预均值和分位数等分布量,提出基于Copula的多变量结果近似方法。

详情
AI中文摘要

工具变量(IV)和控制函数(CF)方法是处理未测量混杂因素时估计因果效应的强大工具,但现有方法大多仅针对均值效应或需要大量拟合和调参。本文提出TabCF,一种基于表格基础模型的控制函数回归方法,能够实现准确、快速、透明且调参轻量的分布量因果估计,如干预均值和分位数;同时提出基于Copula的多变量结果近似方法。TabCF在多种小至中型合成和真实数据场景中表现优异。核心观点:对实践者而言,TabCF是分布因果推断的有效工具;对研究者而言,该方法可作为未来方法开发的强基线。代码见https://github.com/GepingChen/TabCF。

英文摘要

Instrumental variable (IV) and control function (CF) methods are powerful tools for causal effect estimation in the presence of unmeasured confounding, yet most existing approaches target only mean effects and/or demand substantial fitting and tuning effort. In this paper, we introduce a simple method, TabCF, for control function regression using tabular foundation models, which enables accurate, fast, identification-transparent, and tuning-light causal estimation of distributional quantities, such as interventional means and quantiles; we also propose a copula-based approximation for multivariate outcomes. TabCF performs favorably against representative methods across a broad range of small- to medium-sized synthetic and real data scenarios. The central message is two-fold: for practitioners, it highlights that TabCF is an effective tool for distributional causal inference; for researchers, it suggests that the proposed approach could be considered a strong baseline for future method development. Code is available at https://github.com/GepingChen/TabCF.

2605.05973 2026-05-08 stat.ML cs.AI cs.LG stat.AP

Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking

迈向可靠的LLM评估:在自适应基准测试中纠正胜利者偏差

Yang Xu, Jiefu Zhang, Haixiang Sun, Zihan Zhou, Tianyu Cao, Vaneet Aggarwal

发表机构 * Purdue University(普渡大学) Johns Hopkins University(约翰霍普金斯大学) Purdue University, USA(美国普渡大学)

AI总结 本文研究了自适应基准测试中胜利者偏差问题,提出SIREN方法以提高评估可靠性,通过冻结短名单、分离选择与评估并采用高斯乘子bootstrap进行不确定性量化,实验证明其在有限预算下能提供有效的置信区间和比较。

详情
AI中文摘要

自适应提示和程序搜索使LLM评估对选择敏感。一旦基准项在调优中被重复使用,观察到的胜利者分数不必估计完整调优后部署流程在新鲜数据上的性能。我们研究了在显式调优预算下该流程级别的推断。我们提出SIREN,一种选择感知的重复分割报告协议,冻结搜索后的短名单,分离分割选择与保留评估,并使用项目级高斯乘子bootstrap进行不确定性量化。在固定短名单制度下,当选择稳定时,估计量允许一级项目级表示,bootstrap在有限预算网格上提供有效的同时推断。这支持了过程性能曲线的置信区间和预设等预算和跨预算比较。受控模拟和MMLU-Pro调优实验表明,基于胜利者的报告可以是乐观的,并可能导致部署结论的变化,而SIREN则接近有限样本报告目标。

英文摘要

Adaptive prompt and program search makes LLM evaluation selection-sensitive. Once benchmark items are reused inside tuning, the observed winner's score need not estimate the fresh-data performance of the full tune-then-deploy procedure. We study inference for this procedure-level target under explicit tuning budgets. We propose SIREN, a selection-aware repeated-split reporting protocol that freezes the post-search shortlist, separates splitwise selection from held-out evaluation, and uses an item-level Gaussian multiplier bootstrap for uncertainty quantification. In a fixed-shortlist regime with smooth stabilized selection, the estimator admits a first-order item-level representation, and the bootstrap yields valid simultaneous inference on a finite budget grid. This supports confidence intervals for procedure-performance curves and pre-specified equal-budget and cross-budget comparisons. Controlled simulations and MMLU-Pro tuning experiments show that winner-based reporting can be optimistic and can change deployment conclusions, while SIREN remains close to the finite-sample reporting target.

2605.04918 2026-05-08 math.AP cs.LG cs.NA math.NA

Neural Discovery of Strichartz Extremizers

神经发现斯特里茨茨极值解

Nicolás Valenzuela, Ricardo Freire, Claudio Muñoz

发表机构 * Departamento de Ingeniería Matemática Universidad de Chile(数学工程系智利大学) Universidad de Chile(智利大学) DIM & CMM (UMI 2807 CNRS)(DIM与CMM(UMI 2807 CNRS))

AI总结 本文提出基于神经网络的流程,用于寻找斯特里茨茨不等式极值解,通过三个案例验证其有效性,发现极值解可能为高斯函数,并提出新的猜想。

Comments 38 pages, 26 figures; v.2: corrected typos

详情
AI中文摘要

斯特里茨茨不等式是现代散射PDE理论的基石,但其极值解仅在少数尖锐情况下明确已知。非凸性使问题复杂,目前尚未有系统性的数值方法被尝试。本文提出一个简单的神经网络流程,通过斯特里茨茨比率的临界点寻找极值解,并应用于三个场景。首先,在薛定谔群中,恢复了Foschi和Hundertmark--Zharnitsky在d=1,2维中的高斯极值解,误差在10^{-3}以内,无分析先验。其次,在d=1的59个进一步可接受配对中,方法一致发现高斯函数,支持高斯函数在可接受范围内是通用极值解的猜想。第三,在临界Airy--Strichartz不等式γ=1/q处,优化不收敛到任何L^2轮廓:相反,迭代项组织成mKdV呼吸子B(0,⋅;α,1,0,0),内部频率α增长,发现的比率从下面接近Frank--Sabin通用下界A_{q,r},存在幂律间隙~α^{-0.9}。通过独立的Hermite基函数假设验证了相同图景。本文提出精确猜想:上界等于A_{q,r},但沿呼吸子家族趋近,而非达到。该流程既验证已知案例,也作为不存在极值解时的发现工具。

英文摘要

Strichartz inequalities are a cornerstone of the modern theory of dispersive PDEs, but their extremizers are known explicitly only in a handful of sharp cases. The non-convexity of the underlying functional makes the problem hard, and to our knowledge no systematic numerical attack has been attempted. We propose a simple neural-network-based pipeline that searches for extremizers as critical points of the Strichartz ratio, and apply it in three settings. First, on the Schrödinger group we recover the Gaussian extremizers of Foschi and Hundertmark--Zharnitsky in dimensions $d=1,2$ to within $10^{-3}$ relative error, with no analytical prior. Second, on $59$ further admissible pairs in $d=1$ where the answer is conjectural, the method consistently finds Gaussians, supporting the conjecture that Gaussians are the universal extremizers in the admissible range. Third, on the critical Airy--Strichartz inequality at $γ=1/q$, where existence is open, the optimization does not converge to any $L^2$ profile: instead, the iterates organize themselves as mKdV breathers $B(0,\cdot;α,1,0,0)$ with growing internal frequency $α$, and the discovered ratio approaches the Frank--Sabin universal lower bound $\widetilde A_{q,r}$ from below with a power-law gap $\simα^{-0.9}$. We confirm the same picture with an independent Hermite-basis ansatz. We propose a precise conjecture: the supremum equals $\widetilde A_{q,r}$ and is approached, but not attained, along the breather family. The pipeline thus serves both as a validator on known cases and as a discovery tool when no extremizer exists.

2605.04510 2026-05-08 math.OC cs.AI cs.LG

Predictive and Prescriptive AI toward Optimizing Wildfire Suppression

预测与规范性人工智能:优化野火扑救

Leonard Boussioux, Alexandre Jacquillat, Ryne Reger, Jacob Wachspress

发表机构 * Michael G. Foster School of Business and Paul G. Allen School of Computer Science & Engineering, University of Washington(迈克尔·G·福斯特商学院和保罗·G·阿伦计算机科学与工程学院,华盛顿大学)

AI总结 本文提出一种预测与规范性AI方法,联合优化扑救队伍分配与野火扑救,通过整数优化模型和双侧列生成算法提升扑救效率,减少火灾面积。

详情
AI中文摘要

剧烈的野火季节需要关键的优先决策来分配稀缺的扑救资源。本文开发了一种预测和规范性方法,共同优化队伍分配和野火扑救。问题特征是具有内生野火需求和非线性野火动态的离散资源分配结构。我们提出了一个整数优化模型,包含时间-空间-休息网络上的队伍分配,时间-状态网络上的野火动态,以及它们之间的连接约束。我们开发了一种基于双侧列生成方案、新家族的切割和新颖分支规则的双侧分支-切割算法。我们还提出了一种数据驱动的双重机器学习方法,以估计野火蔓延作为协变量信息和扑救努力的函数,减轻历史队伍分配与野火增长之间的观察混淆。广泛的计算实验表明,优化算法可以扩展到其他不可行的现实世界实例;并且该方法可以提高实际扑救效果,从而在野火季节内显著减少烧毁面积,并指导跨野火管辖区域的资源共享。

英文摘要

Intense wildfire seasons require critical prioritization decisions to allocate scarce suppression resources over a dispersed geographical area. This paper develops a predictive and prescriptive approach to jointly optimize crew assignments and wildfire suppression. The problem features a discrete resource-allocation structure with endogenous wildfire demand and non-linear wildfire dynamics. We formulate an integer optimization model with crew assignments on a time-space-rest network, wildfire dynamics on a time-state network, and linking constraints between them. We develop a two-sided branch-and-price-and-cut algorithm based on: (i) a two-sided column generation scheme that generates fire suppression plans and crew routes iteratively; (ii) a new family of cuts exploiting the knapsack structure of the linking constraints; and (iii) novel branching rules to accommodate non-linear wildfire dynamics. We also propose a data-driven double machine learning approach to estimate wildfire spread as a function of covariate information and suppression efforts, mitigating observed confounding between historical crew assignments and wildfire growth. Extensive computational experiments show that the optimization algorithm scales to otherwise intractable real-world instances; and that the methodology can enhance suppression effectiveness in practice, resulting in significant reductions in area burned over a wildfire season and guiding resource sharing across wildfire jurisdictions.

2605.03482 2026-05-08 cs.CR cs.AI cs.LG

MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents

MEMSAD: 基于梯度耦合的内存污染异常检测用于检索增强型智能体

Ishrith Gowda

发表机构 * Department of Electrical Engineering and Computer Sciences(电气工程与计算机科学系) University of California, Berkeley(加州大学伯克利分校) Berkeley AI Research(伯克利人工智能研究)

AI总结 本文提出MEMSAD,通过梯度耦合定理实现内存污染攻击的检测,证明其在对抗策略下的正确分类保证,并通过实验验证复合防御在所有攻击下的高检测率和低误报率。

Comments 28 pages, 9 figures, 6 theorems. Submitted to NeurIPS 2026

详情
AI中文摘要

持久的外部内存使LLM智能体能够在会话间保持上下文,但其安全性属性仍未正式表征。我们正式将针对检索增强型智能体的内存污染攻击建模为一个Stackelberg博弈,并提出一个涵盖三种攻击类别的统一评估框架。修正Chen等人(2024)触发查询规范中的评估协议不一致,我们证明忠实评估使攻击成功率提高了4倍(ASR-R: 0.25→1.00)。我们的主要贡献是MEMSAD(语义异常检测),一种基于校准的防御方法,其基础是梯度耦合定理:在编码器正则性下,异常得分梯度和检索目标梯度可证明相同,因此任何减少检测风险的连续扰动必然降低检索排名。这种耦合产生了一种认证的检测半径,无论对手策略如何都能保证正确分类。我们通过Le Cam的方法证明了极小极大最优性,证明任何阈值检测器需要Ω(1/ρ²)校准样本,而MEMSAD在log(1/δ)因子内达到此水平。我们进一步推导了滚动校准的在线遗憾界,以率O(σ²/³Δ¹/³),并正式刻画了一个离散同义词不变性漏洞,标志着连续空间防御能保证的边界。在3×5攻击-防御矩阵上进行实验,使用bootstrap置信区间、Bonferroni校正的假设检验和Clopper-Pearson验证(n=1,000)证实:复合防御在所有攻击下实现TPR=1.00,FPR=0.00,而同义词替换在Δ ASR-R≈0时逃避检测,暴露了现有嵌入式防御无法弥补的差距。

英文摘要

Persistent external memory enables LLM agents to maintain context across sessions, yet its security properties remain formally uncharacterized. We formalize memory poisoning attacks on retrieval-augmented agents as a Stackelberg game with a unified evaluation framework spanning three attack classes with escalating access assumptions. Correcting an evaluation protocol inconsistency in the triggered-query specification of Chen et al. (2024), we show faithful evaluation increases measured attack success by $4\times$ (ASR-R: $0.25 \to 1.00$). Our primary contribution is MEMSAD (Semantic Anomaly Detection), a calibration-based defense grounded in a gradient coupling theorem: under encoder regularity, the anomaly score gradient and the retrieval objective gradient are provably identical, so any continuous perturbation that reduces detection risk necessarily degrades retrieval rank. This coupling yields a certified detection radius guaranteeing correct classification regardless of adversary strategy. We prove minimax optimality via Le Cam's method, showing any threshold detector requires $Ω(1/ρ^2)$ calibration samples and MEMSAD achieves this up to $\log(1/δ)$ factors. We further derive online regret bounds for rolling calibration at rate $O(σ^{2/3}Δ^{1/3})$, and formally characterize a discrete synonym-invariance loophole that marks the boundary of what continuous-space defenses can guarantee. Experiments on a $3 \times 5$ attack-defense matrix with bootstrap confidence intervals, Bonferroni-corrected hypothesis tests, and Clopper-Pearson validation ($n=1{,}000$) confirm: composite defenses achieve TPR $= 1.00$, FPR $= 0.00$ across all attacks, while synonym substitution evades detection at $Δ$ ASR-R $\approx 0$, exposing a gap existing embedding-based defenses cannot close.

2605.03213 2026-05-08 cs.CR cs.AI

When Agents Handle Secrets: A Survey of Confidential Computing for Agentic AI

当代理处理秘密:关于代理AI的保密计算调查

Javad Forough, Marios Kogias, Hamed Haddadi

发表机构 * Department of Computing Imperial College London London, United Kingdom(计算系帝国理工学院伦敦英国)

AI总结 本文探讨了代理AI中保密计算的应用,分析了六个TEE平台的设计,提出了基于代理的威胁模型,并指出了六个开放挑战,旨在为生产级代理AI提供安全基础。

详情
AI中文摘要

代理AI系统,特别是基于大语言模型的代理,能够计划、调用工具、维护持久内存,并通过MCP和A2A等协议将任务委托给同僚代理,引入了与独立模型推理截然不同的威胁面。代理会积累敏感上下文,持有凭证,并在单个方无法完全控制的管道中运行,导致提示注入、上下文外泄、凭证盗窃和跨代理信息中毒。当前的防御措施完全基于软件栈,可以被足够特权的对手如被入侵的云运营商静默绕过。保密计算(CC)提供了一种基于硬件的替代方案:可信执行环境(TEEs)将代理代码和数据与特权系统软件隔离,而远程认证可实现分布式部署中的可验证信任。本文将设计空间分为四部分:(i)涵盖部署角色和性能权衡的六个TEE平台统一分类(Intel SGX、Intel TDX、AMD SEV-SNP、ARM TrustZone、ARM CCA和NVIDIA H100 CC);(ii)以代理为中心的威胁模型,覆盖感知、规划、内存、行动和协调层,并映射到九个安全目标;(iii)对基于CC的防御措施的比较调查,区分从单次调用推理转移的发现与需要新代理设计的发现;(iv)六个开放挑战,包括多跳代理链的复合认证和GPU-TEE在LLM规模下的性能。尽管几种硬件信任原始语似乎足够成熟以供定向部署,但尚无广泛确立的端到端框架将它们整合成一个连贯的安全基质用于生产代理AI。

英文摘要

Agentic AI systems, specifically LLM-driven agents that plan, invoke tools, maintain persistent memory, and delegate tasks to peer agents via protocols such as MCP and A2A, introduce a threat surface that differs materially from standalone model inference. Agents accumulate sensitive context, hold credentials, and operate across pipelines no single party fully controls, enabling prompt injection, context exfiltration, credential theft, and inter-agent message poisoning. Current defenses operate entirely within the software stack and can be silently bypassed by a sufficiently privileged adversary such as a compromised cloud operator. Confidential computing (CC) offers a hardware-rooted alternative: Trusted Execution Environments (TEEs) isolate agent code and data from privileged system software, while remote attestation enables verifiable trust across distributed deployments. This survey synthesizes the design space in four parts: (i) a unified taxonomy of six TEE platforms (Intel SGX, Intel TDX, AMD SEV-SNP, ARM TrustZone, ARM CCA, and NVIDIA H100 CC) covering deployment roles and performance tradeoffs; (ii) an agent-centric threat model spanning perception, planning, memory, action, and coordination layers mapped to nine security goals; (iii) a comparative survey of CC-based defenses distinguishing findings that transfer from single-call inference versus what requires new agentic designs; and (iv) six open challenges including compound attestation for multi-hop agent chains and GPU-TEE performance at LLM scale. While several hardware trust primitives appear mature enough for targeted deployments, no broadly established end-to-end framework yet binds them into a coherent security substrate for production agentic AI.

2604.20050 2026-05-08 econ.GN cs.AI cs.GT q-fin.EC

Information Aggregation with AI Agents

利用AI代理的信息聚合

Spyros Galanis

发表机构 * Department of Economics, University of Durham(杜伦大学经济学系)

AI总结 研究通过交易和观察价格波动,探讨大型语言模型能否聚合分散的私人信息,发现信息聚合在信息结构复杂时显著下降,且更智能的AI代理在聚合和盈利方面表现更好。

Comments 64 pages

详情
AI中文摘要

大型语言模型(AI代理)能否通过交易聚合分散的私人信息并通过观察价格波动来推理他人的知识?我们进行了一项受控实验,让AI代理在接收到私人信号后在预测市场中交易,通过最后价格的对数误差衡量信息聚合。我们发现,尽管中位市场在简单信息结构中有效聚合信息,但增加复杂性有显著的负面影响,表明AI代理在推理他人时可能面临与人类相似的限制。与我们的理论预测一致,信息聚合不受允许廉价谈话通信、改变市场持续时间和初始价格或战略提示的影响,从而证明预测市场是稳健的。我们证明了“更智能”的AI代理在聚合和盈利方面表现更好。令人惊讶的是,给它们提供过去表现的反馈对聚合没有影响。

英文摘要

Can Large Language Models (AI agents) aggregate dispersed private information through trading and reason about the knowledge of others by observing price movements? We conduct a controlled experiment where AI agents trade in a prediction market after receiving private signals, measuring information aggregation by the log error of the last price. We find that although the median market is effective at aggregating information in the easy information structures, increasing the complexity has a significant and negative impact, suggesting that AI agents may suffer from similar limitations as humans when reasoning about others. Consistent with our theoretical predictions, information aggregation remains unaffected by allowing cheap talk communication, changing the duration of the market or initial price, and strategic prompting, thus demonstrating that prediction markets are robust. We establish that "smarter" AI agents perform better at aggregation and they are more profitable. Surprisingly, giving them feedback about past performance has no impact on aggregation.

2604.06269 2026-05-08 q-bio.QM cs.AI

MAT-Cell: A Multi-Agent Tree-Structured Reasoning Framework for Batch-Level Single-Cell Annotation

MAT-Cell: 一种多智能体树状推理框架用于批量单细胞注释

Yehui Yang, Zelin Zang, Xienan Zheng, Yuzhe Jia, Changxi Chi, Jingbo Zhou, Chang Yu, Jinlin Wu, Fuji Yang, Jiebo Luo, Zhen Lei, Stan Z. Li

发表机构 * Westlake University(西湖大学) Shenzhen University of Advanced Technology(深圳先进技术大学) Center for Artificial Intelligence and Robotics(人工智能与机器人中心) Hong Kong Institute of Science and Innovation(香港科学与创新研究院) Chinese Academy of Sciences(中国科学院) University Key Laboratory of Information and Communication Security Backup and Recovery(信息与通信安全备份与恢复大学重点实验室)

AI总结 MAT-Cell通过分离证据基础与标签决策,结合反向验证查询和多轮辩论,提升批量单细胞注释的准确性与可追溯性。

详情
AI中文摘要

在最丰富的基因不一定是最判别性的或目标状态在固定参考图谱中覆盖不足时,自动单细胞注释具有挑战性。GPTCelltype-style one-shot prompting允许大型语言模型(LLMs)从通用表达信号生成合理标签,而基于参考的注释器可将不熟悉的状态强制归类到最近已知类别。我们提出了MAT-Cell,一种基于提示的批量单细胞注释框架,通过反向验证查询将组织上下文、观察到的差异表达基因和LLM提取的生物学先验整合为结构化的候选特定前提。验证代理将这些前提转换为前提到声明的推理树,并通过有界多轮辩论比较、挑战和修订最终声明,达成共识或最终裁决。返回的三段论推导树(SDT)提供可审计的辩论轨迹而非正式证明。在五个数据集的开放候选基准中,本地部署的Qwen3-30B模型结合MAT-Cell达到75.5%的平均准确率,优于64.2%的最强评估CoT基线和51.9%的最强评估scPilot变体。在三个物种的Oracle候选基准中,MAT-Cell在不同架构上保持竞争力,本地推断大幅降低批量注释的货币成本。代码可在:https://anonymous.4open.science/r/MATCell-4067

英文摘要

Automated single-cell annotation is difficult when the most abundant genes are not the most discriminative ones, or when a target state is poorly covered by a fixed reference atlas. GPTCelltype-style one-shot prompting allows large language models (LLMs) to produce plausible labels from generic expression signals, while reference-based annotators can force unfamiliar states into the nearest known category. We propose MAT-Cell, a prompt-driven framework for batch-level single-cell annotation that separates evidence grounding from label decision. MAT-Cell first uses Reverse Verification Query (RVQ) to combine tissue context, observed differentially expressed genes, and LLM-elicited biological priors into structured candidate-specific premises. Verifier agents then convert these premises into explicit premise-to-claim reasoning trees, and bounded multi-round debate compares,challenges, and revises the resulting claims before consensus or final adjudication.The returned Syllogistic Derivation Tree (SDT) provides an auditable debate trace rather than a formal proof of the annotation. In open-candidate benchmarks across five datasets, a locally deployed Qwen3-30B model with MAT-Cell achieves 75.5% average accuracy, compared with 64.2% for the strongest evaluated CoT baseline and 51.9% for the strongest evaluated scPilot variant. In oracle-candidate bench-marks across three species,MAT-Cell remains competitive across backbones, and local inference substantially reduces monetary cost for batch annotation. Code is available at: https://anonymous.4open.science/r/MATCell-4067

2603.23055 2026-05-08 stat.ML cs.IT cs.LG math.IT

Post-Selection Distributional Model Evaluation

后选择分布模型评估

Amirmohammad Farzaneh, Osvaldo Simeone

发表机构 * Institute for Intelligent Networked Systems (INSI)(智能网络系统研究所)

AI总结 本文提出PS-DME框架,用于在数据依赖的模型预选后进行统计有效的分布模型评估,通过控制后选择虚无覆盖率提升样本效率,实验证明其在性能-可靠性权衡中可靠性。

详情
AI中文摘要

正式的模型评估方法通常证明模型满足规定的性能指标(KPI)水平。然而,在许多应用中,相关的目标KPI水平可能无法预先确定,用户可能希望通过分析测试时性能与可靠性之间的完整权衡来比较候选模型。这项任务要求可靠估计测试时KPI分布,这变得更加复杂,因为同一数据通常用于预选候选模型集和估计其KPI分布,导致潜在的后选择偏差。本文介绍了后选择分布模型评估(PS-DME),一种通用的框架,用于在任意数据依赖的模型预选后进行统计有效的分布模型评估。基于e值,PS-DME控制后选择虚无覆盖率(FCR)以确保分布KPI估计的统计有效性,并建立了显式条件,证明其比基于样本分割的基线方法在样本效率上更具优势。在合成数据、文本到SQL解码与大语言模型以及电信网络性能评估中的实验表明,PS-DME能够在各种可靠性水平上可靠地比较候选配置,支持对性能-可靠性权衡的统计可靠探索。

英文摘要

Formal model evaluation methods typically certify that a model satisfies a prescribed target key performance indicator (KPI) level. However, in many applications, the relevant target KPI level may not be known a priori, and the user may instead wish to compare candidate models by analyzing the full trade-offs between performance and reliability achievable at test time by the models. This task, requiring the reliable estimate of the test-time KPI distributions, is made more complicated by the fact that the same data must often be used both to pre-select a subset of candidate models and to estimate their KPI distributions, causing a potential post-selection bias. In this work, we introduce post-selection distributional model evaluation (PS-DME), a general framework for statistically valid distributional model assessment after arbitrary data-dependent model pre-selection. Building on e-values, PS-DME controls post-selection false coverage rate (FCR) for the distributional KPI estimates and we establish explicit conditions under which it is provably more sample efficient than a baseline method based on sample splitting. Experiments on synthetic data, text-to-SQL decoding with large language models, and telecom network performance evaluation demonstrate that PS-DME enables reliable comparison of candidate configurations across a range of reliability levels, supporting the statistically reliable exploration of performance--reliability trade-offs.

2603.12278 2026-05-08 q-bio.OT cs.AI cs.LG

Unsupervised Anomaly Detection in Wearable Foot Sensor Data: A Baseline Feasibility Study Towards Diabetic Foot Ulcer Prevention

无监督的可穿戴足部传感器数据异常检测:一种面向糖尿病足溃疡预防的基线可行性研究

Md Tanvir Hasan Turja

发表机构 * Department of Computer Science, Middlesex University London(伦敦Middlesex大学计算机科学系)

AI总结 本文研究了利用无监督算法检测可穿戴足部传感器数据中的异常,通过温度和压力信号分析,建立基线模型以评估糖尿病足溃疡预防的可行性。

Comments 36 pages, 19 figures. Published in Biomedical Signal Processing and Control, Vol. 123, Part A, 110416, September 2026. https://doi.org/10.1016/j.bspc.2026.110416

Journal ref Biomedical Signal Processing and Control, Vol. 123, Part A, 110416 (2026)

详情
AI中文摘要

糖尿病足溃疡(DFUs)是糖尿病的严重并发症,与显著的发病率、截肢风险和医疗负担相关。开发有效的连续监测框架需要首先建立可靠的正常足部生物力学基线模型。本文提出了一种应用于可穿戴足部传感器时间序列数据的异常检测框架的可行性研究,具体使用NTC薄膜热电偶测量温度和FlexiForce A401压力传感器测量足底负荷。数据从312次捕捉会话中收集,生成93,790个有效多传感器读数,时间跨度为2023年9月至2024年6月。应用了两种无监督算法,即隔离森林和基于局部异常因子的K-最近邻(KNN/LOF),以检测足部温度和压力信号的统计偏差。结果表明,隔离森林对细微分布异常更敏感,而KNN/LOF识别集中极端偏差但会标记更多未被隔离森林验证的会话。由于没有临床真实情况,这种差异被解释为在共享5%污染假设下的较低特异性,而不是确认的假阳性率。压力和温度特征之间存在轻微正相关(0.41-0.48),支持多模态监测的案例。这些发现建立了一个经过验证的基线分析流程,并为未来涉及糖尿病患者的临床验证研究提供了方法学基础,其中检测到的异常与DFU相关病理生理学的关系可以被直接评估。

英文摘要

Diabetic foot ulcers (DFUs) are a severe complication of diabetes associated with significant morbidity, amputation risk, and healthcare burden. Developing effective continuous monitoring frameworks requires first establishing reliable baseline models of normal foot biomechanics. This paper presents a feasibility study of an anomaly detection framework applied to time-series data from wearable foot sensors, specifically NTC thin-film thermocouples for temperature and FlexiForce A401 pressure sensors for plantar load monitoring. Data were collected from healthy adult subjects across 312 capture sessions on an instrumented pathway, generating 93,790 valid multi-sensor readings spanning September 2023 to June 2024. Two unsupervised algorithms, Isolation Forest and K-Nearest Neighbors using Local Outlier Factor (KNN/LOF), were applied to detect statistical deviations in foot temperature and pressure signals. Results show that Isolation Forest is more sensitive to subtle, distributed anomalies, while KNN/LOF identifies concentrated extreme deviations but flags a higher proportion of sessions not corroborated by Isolation Forest. Since no clinical ground truth is available, this difference is interpreted as lower specificity under the shared 5 percent contamination assumption rather than a confirmed false-positive rate. A mild positive correlation (0.41-0.48) between pressure and temperature features supports the case for combined multi-modal monitoring. These findings establish a validated baseline analytical pipeline and provide a methodological foundation for future clinical validation studies involving diabetic patients, where the relationship between detected anomalies and DFU-related pathophysiology can be directly assessed.

2603.12031 2026-05-08 cs.DC cs.LG cs.MA

AGMARL-DKS: An Adaptive Graph-Enhanced Multi-Agent Reinforcement Learning for Dynamic Kubernetes Scheduling

AGMARL-DKS: 一种适应性图增强多智能体强化学习用于动态Kubernetes调度

Hamed Hamzeh

发表机构 * University of Westminster Computer Science(威斯敏斯特大学计算机科学) University of Westminster(威斯敏斯特大学)

AI总结 本文提出AGMARL-DKS,通过多智能体强化学习解决动态Kubernetes调度问题,引入图神经网络和应力感知策略,提升容错性和资源利用率。

详情
AI中文摘要

最先进的云原生应用需要智能调度器在系统稳定性、资源利用率和成本之间取得平衡。尽管Kubernetes默认提供可行性置放,但近期研究探索了强化学习(RL)用于更智能的调度决策。然而,现有基于RL的调度器存在三大局限:首先,大多数调度器使用集中式单一智能体,无法扩展至大规模异构集群;其次,使用多目标奖励函数的调度器假设简单静态线性组合目标;第三,无先前工作产生能适应动态条件的应力感知调度器。为解决这些研究空白,本文提出AGMARL-DKS,通过三个创新:1)将调度挑战视为协作多智能体问题,每个集群节点作为智能体,集中训练后分散执行;2)使用图神经网络(GNN)构建全局集群上下文的状态表示,优于仅依赖本地观察的方法;3)采用应力感知的字典序排列策略替代简单静态线性加权。在Google Kubernetes Engine(GKE)上的评估显示,AGMARL-DKS在容错性、利用率和成本方面显著优于默认调度器,尤其在批处理和关键任务工作负载调度中表现突出。

英文摘要

State-of-the-art cloud-native applications require intelligent schedulers that can effectively balance system stability, resource utilisation, and associated costs. While Kubernetes provides feasibility-based placement by default, recent research efforts have explored the use of reinforcement learning (RL) for more intelligent scheduling decisions. However, current RL-based schedulers have three major limitations. First, most of these schedulers use monolithic centralised agents, which are non-scalable for large heterogeneous clusters. Second, the ones that use multi-objective reward functions assume simple, static, linear combinations of the objectives. Third, no previous work has produced a stress-aware scheduler that can react adaptively to dynamic conditions. To address these gaps in current research, we propose the Adaptive Graph-enhanced Multi-Agent Reinforcement Learning Dynamic Kubernetes Scheduler (AGMARL-DKS). AGMARL-DKS addresses these gaps by introducing three major innovations. First, we construct a scalable solution by treating the scheduling challenge as a cooperative multi-agent problem, where every cluster node operates as an agent, employing centralised training methods before decentralised execution. Second, to be context-aware and yet decentralised, we use a Graph Neural Network (GNN) to build a state representation of the global cluster context at each agent. This represents an improvement over methods that rely solely on local observations. Finally, to make trade-offs between these objectives, we use a stress-aware lexicographical ordering policy instead of a simple, static linear weighting of these objectives. The evaluations in Google Kubernetes Engine (GKE) reveal that AGMARL-DKS significantly outperforms the default scheduler in terms of fault tolerance, utilisation, and cost, especially in scheduling batch and mission-critical workloads.

2603.02950 2026-05-08 cs.CY cs.AI cs.GT

Path Dependence under Adaptive AI Delegation

适应性AI委托下的路径依赖

Lingxiao Huang, Nisheeth K. Vishnoi

发表机构 * Nanjing University(南京大学) Yale University(耶鲁大学)

AI总结 研究探讨了适应性AI委托对长期技能与独立工作能力的动态影响,揭示了路径依赖性及技能变化与依赖倾向的耦合效应。

详情
AI中文摘要

重复的AI协助可提升短期任务表现,但会降低未来独立工作的技能储备。本文建立了一个数学框架来分析这种长期权衡。模型跟踪两个状态变量:一个隐含的人类技能水平,决定预期的独立表现;以及一个委托水平,代表学习者逐渐依赖AI的倾向。技能通过错误驱动的学习在实践中增强,但在委托下会衰减;委托则根据观察到的表现变化,当AI协助工作表现优于独立工作时会增加。我们分析了由此产生的动态,并将其与固定委托进行对比。在固定委托下,技能遵循一个维的学习-衰减过程,具有单一稳定平衡点。在适应性委托下,耦合系统有两个吸引平衡点,由内部鞍点的稳定曼陀罗分隔开。这种分隔线的存在和几何特性需要对耦合动态进行全局相位平面分析。该系统具有路径依赖性:初始技能或依赖的微小差异可能导致不同的长期结果。我们利用这一特征证明,AI协助可提升短期表现,但比无AI基准线的长期表现更差。增加AI能力可扩大低技能平衡点的吸引盆地,使委托在更长时间内显得有益,但增加最终技能丧失的风险。这种定性图景在替代规范下持续存在。这些结果表明,风险并非AI协助本身,而是性能驱动的依赖与使用依赖的技能变化之间的耦合。

英文摘要

Repeated AI assistance can improve immediate task performance while reducing the skill available for future independent work. We develop a mathematical framework for this long-run tradeoff. The model tracks two state variables: a latent human skill level governing expected independent performance, and a delegation level representing the learner's evolving tendency to rely on AI. Skill changes through error-driven learning under practice and decay under delegation; delegation responds to observed performance, increasing when AI-assisted work appears to outperform independent work. We analyze the resulting dynamics and contrast them with fixed delegation. With fixed delegation, skill follows a one-dimensional learning-decay process with a single stable equilibrium. With adaptive delegation, the coupled system has two attracting equilibria separated by the stable manifold of an interior saddle. The existence and geometry of this separatrix require a global phase-plane analysis of the coupled dynamics. The system is path-dependent: small differences in initial skill or reliance can lead to different long-run outcomes. We use this characterization to show that AI assistance can improve short-run performance while producing worse long-run performance than a no-AI baseline. Increasing AI capability can enlarge the basin of attraction of the low-skill equilibrium, making delegation appear beneficial for longer while increasing the risk of eventual skill loss. The qualitative picture is observed to persist across alternative specifications. Together, these results show that the risk is not AI assistance itself, but the coupling between performance-driven reliance and use-dependent skill change.

2603.01192 2026-05-08 stat.ML cs.LG

A Basin-Selection Perspective on Grokking via Singular Learning Theory

从奇异学习理论的角度看通过奇异学习理论实现的grokking现象

Ben Cullen, Sergio Estan-Ruiz, Riya Danait, Jiayi Li

发表机构 * Department of Computer Science(计算机科学系) Department of Mathematics(数学系) Mathematical Institute(数学研究所) University of Pisa(比萨大学) Imperial College London(伦敦帝国学院) University of Oxford(牛津大学) Section of Mathematics and Artificial Intelligence(数学与人工智能系) Max Planck Institute of Molecular Cell Biology and Genetics(马克斯·普朗克分子细胞生物学和遗传学研究所) Center for Systems Biology(系统生物学中心) Faculty of Mathematics(数学系) TU Dresden(德累斯顿技术大学)

AI总结 本文从奇异学习理论的角度研究grokking现象,通过分析损失景观的几何特性,探讨了记忆到泛化过渡的机制,并推导了浅层二次网络中局部学习系数的解析公式。

详情
AI中文摘要

Grokking,即在长时间训练后从记忆到泛化的突然转变,表明存在具有不同统计特性的竞争解盆地。我们通过奇异学习理论(SLT)研究这一现象,SLT是一种贝叶斯框架,用于描述损失景观的几何结构。关键度量是局部学习系数(LLC),它量化了损失表面的局部退化程度。SLT将低LLC盆地与较高的后验质量集中和较低的预期泛化误差联系起来。利用SLT,我们开发了二次网络中grokking的盆地选择视角:LLC通过统计偏好对竞争的近零损失盆地进行排名,而训练时在这些盆地之间的转换由优化动态决定。在此观点中,groke对应于从高LLC(记忆)盆地到低LLC(泛化)盆地的转换,该盆地主导后验。为了支持这一点,我们推导了浅层二次网络在懒惰和特征学习模式下的LLC解析公式。实证上,我们展示了从训练数据估计的LLC轨迹跟踪泛化的起始点,并提供了优化路径的有信息探测。

英文摘要

Grokking, the abrupt transition from memorization to generalisation after extended training, suggests the presence of competing solution basins with distinct statistical properties. We study this phenomenon through the lens of Singular Learning Theory (SLT), a Bayesian framework that characterizes the geometry of the loss landscape. The key measure is the local learning coefficient (LLC) which quantifies the local degeneracy of the loss surface. SLT links lower-LLC basins to higher posterior mass concentration and lower expected generalisation error. Leveraging SLT, we develop a basin-selection perspective on grokking in quadratic networks: LLC ranks competing near-zero-loss basins by statistical preference, while the training-time transition between them is governed by optimisation dynamics. In this view, grokking corresponds to a transition from a higher-LLC (memorising) basin to a lower-LLC (generalising) basin that dominates the posterior. To support this, we derive analytic formulas for the LLC in shallow quadratic networks under both lazy and feature learning regimes. Empirically, we demonstrate that LLC trajectories estimated from training data track the onset of generalisation and provide an informative probe of the optimisation path.

2603.00113 2026-05-08 cs.MA cs.AI cs.CE cs.CY cs.SI

AI Agents Alone Are Not (Yet) Sufficient for Social Simulation

仅靠AI代理不足以进行社会模拟

Yiming Li, Dacheng Tao

发表机构 * College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学计算机与数据科学学院,新加坡)

AI总结 本文指出LLM代理单独使用不足以实现真实的社会动态,提出需考虑环境互动和调度机制的统一框架。

Comments 16 pages

详情
AI中文摘要

近年来,大语言模型(LLMs)的进步推动了使用集成代理的社会模拟研究,通常隐含假设角色指定代理在网络化多代理环境中会产生真实人口动态。本文认为LLM代理单独使用尚不足以进行社会模拟。我们归因于当前代理流程与模拟作为科学要求之间的系统性不匹配。具体而言,角色扮演的合理性不等于人类行为的有效性;集体结果常由代理与环境的共动态而非仅代理间消息传递决定;结果可能受交互协议、调度和初始信息先验主导。为使这些机制显式和可审计,我们提出将AI代理社会模拟统一为涉及环境的马尔可夫游戏,从中推导出设计、评估和解释的具体行动。

英文摘要

Recent advances in large language models (LLMs) have spurred growing interest in using LLM-integrated agents for social simulation, often under the implicit assumption that realistic population dynamics will emerge once role-specified agents are placed in a networked multi-agent setting. This position paper argues that LLM-based agents alone are not (yet) sufficient for social simulation. We attribute this over-optimism to a systematic mismatch between what current agent pipelines are typically optimized and validated to produce and what simulation-as-science requires. Concretely, role-playing plausibility does not imply faithful human behavioral validity; collective outcomes are frequently mediated by agent-environment co-dynamics rather than agent-agent messaging alone; and results can be dominated by interaction protocols, scheduling, and initial information priors. To make these underlying mechanisms explicit and auditable, we propose a unified formulation of AI agent-based social simulation as an environment-involved Markov game with explicit exposure and scheduling mechanisms, from which we derive concrete actions for design, evaluation, and interpretation.

2602.12805 2026-05-08 physics.med-ph cs.SD eess.IV

A Wavefield Correlation Approach to Improve Sound Speed Estimation in Ultrasound Autofocusing

一种基于波场相关的方法用于改进超声自聚焦中的声速估计

Louise Zhuang, Samuel Beuret, Ben Frey, Saachi Munot, Walter Simson, Dongwoon Hyun, Jeremy J. Dahl

发表机构 * Department of Electrical Engineering, Stanford University(电气工程系,斯坦福大学) Department of Radiology, School of Medicine, Stanford University(放射学系,医学院,斯坦福大学) Department of Applied Physics, Stanford University(应用物理学系,斯坦福大学) Department of Biomedical Engineering, Columbia University(生物医学工程系,哥伦比亚大学)

AI总结 本文提出利用波场相关技术优化声速估计,以提高超声自聚焦中图像质量,通过改进的波场相关成像方法减少杂波干扰,提升成像分辨率和对比度。

详情
AI中文摘要

在脉冲回声超声中,当成像不考虑波前畸变时,色差往往会退化图像质量。为解决这一问题,过去十年中已开发出本地声速估计器用于分布式色差校正。最近,基于迭代优化的方法提高了声速估计的准确性,但其准确性受介质中回声杂波和直线射线波传播模型的限制。为解决这些挑战,本文提出在进行声速优化时使用波场相关(WFC)成像。WFC是一种超声适应的反时间迁移,通过相关模拟正向传播的发射波场和反向传播的接收波场来重建图像。此过程更准确地建模异质介质中的波传播,并能通过其时空匹配滤波效应减少扩散杂波。本文实现了使用自动微分软件的WFC成像器,并通过梯度下降优化正则化共同中点相位聚焦标准来估计声速图。该方法与之前依赖于延迟和求和(DAS)的直线射线时间延迟计算方法在多种模拟、仿生和体内数据中进行了比较,这些数据具有较大的声速变化和杂波。结果表明,使用WFC可以降低声速估计误差,从而提高校正图像的分辨率和对比度。特别是,这些有前景的结果可能有助于改进具有挑战性的临床场景中的脉冲回声成像。

英文摘要

In pulse-echo ultrasound, aberration often degrades image quality when beamforming does not account for wavefront distortions. To address this issue, local sound speed estimators have been developed in the past decade for distributed aberration correction. Recently, methods based on iterative optimization have improved sound speed accuracy with respect to earlier approaches. However, the accuracy of these newer methods is limited by media with reverberation clutter and by the straight-ray model of wave propagation. To address these challenges, we propose using wavefield correlation (WFC) beamforming when performing sound speed optimization. WFC, an ultrasound adaptation of reverse time migration, correlates simulated forward-propagated transmit wavefields and backwards-propagated receive wavefields in order to reconstruct images. This process more accurately models wave propagation in heterogeneous media and can decrease diffuse clutter due to its spatiotemporal matched filtering effect. We implement herein a WFC beamformer using an auto-differentiation software and estimate the sound speed map by optimizing a regularized common-midpoint phase focusing criterion using gradient descent. This approach is compared to a previous method relying on delay and sum (DAS) with straight-ray time delay calculations on a variety of simulated, phantom, and in vivo data with large sound speed variations and clutter. Results show that using WFC decreases sound speed estimation error, leading to improvements in resolution and contrast in the corrected image. In particular, these promising results have potential to improve pulse-echo imaging for challenging clinical scenarios.

2602.08318 2026-05-08 stat.ML cs.LG nlin.CD

Is Flow Matching Just Trajectory Replay for Sequential Data?

流匹配是否只是用于序列数据的轨迹回放?

Soon Hoe Lim, Shizheng Lin, Michael W. Mahoney, N. Benjamin Erichson

发表机构 * Department of Mathematics(数学系) Nordita KTH Royal Institute of Technology and Stockholm University(KTH皇家理工学院与斯德哥尔摩大学联合研究所) Department of Statistics(统计系) International Computer Science Institute(国际计算机科学研究所) Lawrence Berkeley National Laboratory(伯克利国家实验室)

AI总结 本文研究流匹配是否学习可转移的动力学结构还是仅进行有效轨迹回放,通过推导完美函数逼近极限下的速度场,揭示流匹配模型作为非参数解的参数化近似,并提出稳健的ODE生成方案。

Comments 56 pages

详情
AI中文摘要

流匹配(FM)在科学领域的时间序列生成和预测中日益流行,但其是否学习可转移的动力学结构还是仅进行有效轨迹回放尚不明确。本文通过推导完美函数逼近极限下的经验FM目标的速度场,研究这一问题。对于实践中常用的高斯条件路径,我们表明隐含的采样器是一个ODE,其动态构成非参数化、内存增强的连续时间动力学系统。最优场的表达式为相似度加权的瞬时速度混合,使数据集依赖性显式且可解释。这一特性将神经FM模型定位为理想非参数解的参数化近似,并提出实用的ODE生成方案。作为分析的副产品,所得到的闭式采样器FreeFM能够从历史过渡直接提供非线性动力学系统基准的强概率预测,无需训练。

英文摘要

Flow matching (FM) is increasingly used in scientific domains for time series generation and forecasting, where data often arise from underlying dynamical systems. However, it is not well-understood whether it learns transferable dynamical structure or simply performs an effective "trajectory replay". We study this question by deriving the velocity field targeted by the empirical FM objective on sequential data in the limit of perfect function approximation. For the Gaussian conditional paths commonly used in practice, we show that the implied sampler is an ODE whose dynamics constitutes a nonparametric, memory-augmented continuous-time dynamical system. The optimal field admits a closed-form expression as a similarity-weighted mixture of instantaneous velocities induced by observed transitions, making the dataset dependence explicit and interpretable. This characterization positions neural FM models as parametric surrogates of an ideal nonparametric solution and suggests practical approximation schemes for robust ODE-based generation. As a byproduct of our analysis, the resulting closed-form sampler, FreeFM, provides strong probabilistic forecasts on nonlinear dynamical system benchmarks directly from historical transitions, without training.

2602.07633 2026-05-08 stat.ML cs.LG stat.ME

Flow-Based Conformal Predictive Distributions

基于流的符合预测分布

Trevor Harris

发表机构 * Department of Statistics(统计学系) University of Connecticut(康涅狄格大学) Storrs, CT 06269(斯托尔斯,CT 06269)

AI总结 本文提出基于流的方法,用于高效生成符合预测边界,适用于任意维度,通过混合置信水平得到符合预测分布,并在多个领域进行评估。

Comments 9 pages, 15 figures, 20 appendix pages

详情
AI中文摘要

符合预测提供了一个无分布框架,通过具有精确有限样本覆盖的预测集进行不确定性量化。在低维情况下这些集容易解释,但在高维或结构化输出空间中难以表示和使用,限制了其与下游任务如采样和概率预测的整合。我们证明任何足够正则的可微非符合分数诱导输出空间上的确定性流,其轨迹收敛到相应符合预测集的边界。这导致了一种计算高效、无需训练的方法,用于在任意维度中采样符合边界。跨置信水平混合产生符合预测分布,其分位数区域与经验符合预测集一致。我们提供了一个近似界,将CPD预测误差分解为分数诱导的扭曲、基础测度质量以及梯度流诱导的扭曲。我们在PDE逆问题、降水下缩、气候模型去偏差和飓风轨迹预测中评估了该方法。

英文摘要

Conformal prediction provides a distribution-free framework for uncertainty quantification via prediction sets with exact finite-sample coverage. In low dimensions these sets are easy to interpret, but in high-dimensional or structured output spaces they are difficult to represent and use, which can limit their ability to integrate with downstream tasks such as sampling and probabilistic forecasting. We show that any sufficiently regular differentiable nonconformity score induces a deterministic flow on the output space whose trajectories converge to the boundary of the corresponding conformal prediction set. This leads to a computationally efficient, training-free method for sampling conformal boundaries in arbitrary dimensions. Mixing across confidence levels yields conformal predictive distributions whose quantile regions coincide with the empirical conformal prediction sets. We provide an approximation bound decomposing CPD predictive error into score-induced distortion, base-measure quality, and gradient flow-induced distortion. We evaluate the approach on PDE inverse problems, precipitation downscaling, climate model debiasing, and hurricane trajectory forecasting.

2602.03258 2026-05-08 stat.ML cs.LG

Principled Federated Random Forests for Heterogeneous Data

原理化的联邦随机森林用于异质数据

Rémi Khellaf, Erwan Scornet, Aurélien Bellet, Julie Josse

发表机构 * Inria(法国国家信息与自动化研究所) PreMeDICaL(预医学实验室) Inserm(法国国家医学研究院) University of Montpellier(蒙彼利埃大学) Sorbonne Université(索邦大学) Université Paris Cité(巴黎城市大学) CNRS(国家科学研究中心) LPSM(巴黎高等师范学院)

AI总结 本文提出FedForest,一种适用于水平分割数据的联邦随机森林算法,能自然处理客户端数据异质性,通过聚合精心选择的客户端统计信息近似集中算法的分裂过程,并实现非参数化个性化。

详情
AI中文摘要

随机森林(RF)是用于集中表格数据最强大且广泛应用的预测模型之一,但很少有方法能将其适应到联邦学习设置中。不同于大多数联邦学习方法,随机森林的分段常数性质阻止了精确的梯度优化。因此,现有的联邦随机森林实现依赖于不严谨的启发式方法:例如,聚合在客户端独立训练的决策树无法优化全局纯度标准,即使在简单的分布偏移下也是如此。我们提出FedForest,一种新的联邦随机森林算法,适用于水平分割的数据,能够自然适应各种客户端数据异质性,从协变量偏移到更复杂的结果偏移机制。我们证明,基于聚合精心选择的客户端统计信息的分裂过程,能近似集中算法所选的分裂。此外,FedForest允许在客户端指示符上进行分裂,实现一种不存在于先前联邦随机森林方法中的非参数化形式的个性化。实证上,我们证明,所得到的联邦森林在异质基准上接近集中性能,同时保持通信高效。

英文摘要

Random Forests (RF) are among the most powerful and widely used predictive models for centralized tabular data, yet few methods exist to adapt them to the federated learning setting. Unlike most federated learning approaches, the piecewise-constant nature of RF prevents exact gradient-based optimization. As a result, existing federated RF implementations rely on unprincipled heuristics: for instance, aggregating decision trees trained independently on clients fails to optimize the global impurity criterion, even under simple distribution shifts. We propose FedForest, a new federated RF algorithm for horizontally partitioned data that naturally accommodates diverse forms of client data heterogeneity, from covariate shift to more complex outcome shift mechanisms. We prove that our splitting procedure, based on aggregating carefully chosen client statistics, closely approximates the split selected by a centralized algorithm. Moreover, FedForest allows splits on client indicators, enabling a non-parametric form of personalization that is absent from prior federated random forest methods. Empirically, we demonstrate that the resulting federated forests closely match centralized performance across heterogeneous benchmarks while remaining communication-efficient.

2601.19886 2026-05-08 econ.GN cs.AI cs.CY cs.GT q-fin.EC

AI Cap-and-Trade: Efficiency Incentives for Accessibility and Sustainability

AI配额交易:为可及性和可持续性提升效率激励

Marco Bornstein, Amrit Singh Bedi

发表机构 * Independent Researcher(独立研究者) University of Central Florida(佛罗里达中央大学)

AI总结 本文提出通过市场机制激励AI效率,减少排放并为学术界和中小企业创造机会,倡导实施AI配额交易制度。

Comments 22 pages, 2 figures. Accepted as a position paper at ICML 2026

详情
AI中文摘要

人工智能(AI)主导的竞赛往往更重视规模而非效率。超大规模是行业常见的做法:更大的模型、更多的数据以及尽可能多的计算资源。使用更多资源是提升AI性能的更简单路径。因此,效率被弱化了。 consequently,对昂贵计算资源的需求使学术界和中小企业边缘化。同时,由于AI使用增加,能源支出增长导致环境成本上升。为应对可及性和可持续性问题,我们主张研究并实施基于市场的机制,以激励AI效率。我们认为,激励高效操作和方法将减少排放,同时为学术界和中小企业创造新机会。作为呼吁行动,我们提出AI配额交易制度。我们的系统可证明减少AI部署的计算量,从而降低排放,并将效率 monetize 以造福学术界和中小企业。

英文摘要

The race for artificial intelligence (AI) dominance often prioritizes scale over efficiency. Hyper-scaling is the common industry approach: larger models, more data, and as many computational resources as possible. Using more resources is a simpler path to improved AI performance. Thus, efficiency has been de-emphasized. Consequently, the need for costly computational resources has marginalized academics and smaller companies. Simultaneously, increased energy expenditure, due to growing AI use, has led to mounting environmental costs. In response to accessibility and sustainability concerns, we argue for research into, and implementation of, market-based methods that incentivize AI efficiency. We believe that incentivizing efficient operations and approaches will reduce emissions while opening new opportunities for academics and smaller companies. As a call to action, we propose a cap-and-trade system for AI. Our system provably reduces computations for AI deployment, thereby lowering emissions and monetizing efficiency to the benefit of academics and smaller companies.

2512.09538 2026-05-08 stat.ML cs.CL cs.LG

Don't Throw Away Your Beams: Improving Consistency-based Uncertainties in LLMs via Beam Search

不要抛弃你的光束:通过光束搜索改进大语言模型中的基于一致性不确定性的方法

Ekaterina Fadeeva, Maiya Goloburda, Aleksandr Rubashevskii, Roman Vashurin, Artem Shelmanov, Preslav Nakov, Mrinmaya Sachan, Maxim Panov

发表机构 * ETH Zurich(苏黎世联邦理工学院) MBZUAI(穆萨伊人工智能研究院)

AI总结 本文提出通过光束搜索改进大语言模型中的基于一致性的不确定性量化方法,减少方差并提升性能,实验证明其在六个问答数据集上达到最先进的性能。

详情
AI中文摘要

基于一致性的方法已成为大语言模型中不确定性量化(UQ)的有效方法。这些方法通常依赖于通过多项式采样获得的多个生成,测量其一致性水平。然而,在短格式问答中,多项式采样容易由于尖峰分布产生重复,其随机性引入了不确定性估计在不同运行中的显著方差。我们引入了一种新的方法家族,利用光束搜索生成一致性UQ的候选,相比多项式采样,实现了更好的性能和更小的方差。我们还提供了光束集概率质量的理论下限,表明在该下限下,光束搜索的误差比多项式采样更小。我们实验证明了我们的方法在六个问答数据集上的表现,发现其对多项式采样的一致性改进导致了最先进的UQ性能。

英文摘要

Consistency-based methods have emerged as an effective approach to uncertainty quantification (UQ) in large language models. These methods typically rely on several generations obtained via multinomial sampling, measuring their agreement level. However, in short-form QA, multinomial sampling is prone to producing duplicates due to peaked distributions, and its stochasticity introduces considerable variance in uncertainty estimates across runs. We introduce a new family of methods that employ beam search to generate candidates for consistency-based UQ, yielding improved performance and reduced variance compared to multinomial sampling. We also provide a theoretical lower bound on the beam set probability mass under which beam search achieves a smaller error than multinomial sampling. We empirically evaluate our approach on six QA datasets and find that its consistent improvements over multinomial sampling lead to state-of-the-art UQ performance.