arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2056
专题追踪
2605.30662 2026-06-01 cs.LG q-bio.PE

Spatio-temporal stochastic graph-based learning for infectious disease forecasting

基于时空随机图的传染病预测学习

Luz Stefani Sotomayor Valenzuela, Susanna Cramb, Darren Wraith

发表机构 * School of Public Health and Social Work, Queensland University of Technology(昆士兰理工大学公共卫生与社会科学学院) QUT Centre for Data Science, Queensland University of Technology(昆士兰理工大学数据科学中心)

AI总结 提出一种集成随机公式和不确定性近似过程的时空图架构,用于预测新发传染病病例,在COVID-19和水痘数据集上表现出竞争性性能。

Comments Preprint under review

详情
AI中文摘要

时空图模型通常用于预测COVID-19和水痘爆发等传染病的新病例。然而,在其学习过程中使用随机建模的研究却出人意料地不足,并且很少考虑大国家的完整数据集。因此,尚不清楚这些模型是否能在真实疾病传播场景中提供准确的预测。在这项工作中,我们提出了一种时空随机图架构,该架构集成了随机公式和不确定性近似过程,以预测新的传染病病例。我们发现,我们的方法能够适应在单一模型架构中编码大小人口地理网络。使用两个真实世界数据集——美国COVID-19和匈牙利水痘,我们报告了所提出的架构在预测美国2022年第一波COVID-19和匈牙利2012-2014年水痘波次中的增强效果。通过与四种时空图模型进行基准测试,定量结果显示,所提出的方法在预测美国所有3218个县和匈牙利所有20个县的新病例方面,具有竞争性的整体周度性能。所提出的方法能够表示相对于基线的整体流行病进展,尽管存在一步延迟;同时表现出对高频低幅变异的低敏感性。

英文摘要

Spatio-temporal graph-based models have typically been used to forecast new cases of infectious diseases such as COVID-19 and chickenpox outbreaks. However, the use of stochastic modelling into their learning process has been surprisingly under-investigated and rarely considered entire data sets of large countries. As a result, it is unknown whether these models would provide accurate forecasts in real-world disease spread scenarios. In this work, we propose a spatio-temporal stochastic graph-based architecture that integrates a stochastic formulation and uncertainty approximation process to forecast new infectious disease cases. We find that our approach can adapt to encode large and small population geographical networks within a single model architecture. Using two real-world data sets, COVID-19 in the US and chickenpox in Hungary, we report an enhanced effect of the proposed architecture across predictions of the 2022 first wave for COVID-19 in the US and comparative results of chickenpox waves during 2012-2014 in Hungary. By benchmarking with four spatio-temporal graph-based models, quantitative results show competitive overall weekly performance of the proposed approach on forecasting new cases for all 3,218 US counties and all 20 Hungary counties. The proposed approach can represent overall epidemic progression relative to baselines, though with a one-step delay; while exhibiting a reduced sensitivity to high-frequency and low-amplitude variability.

2605.30656 2026-06-01 cs.LG

Learning to Perceive the World Through Control: Empowerment-Based Representation Learning

通过控制学习感知世界:基于赋能的表示学习

Mahsa Bastankhah, Sophie Broderick, Benjamin Eysenbach

发表机构 * Princeton University, USA(普林斯顿大学,美国)

AI总结 本文通过最大化赋能目标,研究如何学习仅捕捉环境控制相关特征的表示,并证明赋能代理诱导的前向和后向表示对控制无关特征具有不变性。

详情
AI中文摘要

在许多实际强化学习环境中,观测的维度远高于对控制重要的变量。在这项工作中,我们提出一个问题:我们能否学习仅捕捉环境中控制相关特征的表示?我们通过赋能目标研究这个问题,该目标最大化代理对环境的影响,并广泛用于无监督技能学习。我们表明,赋能代理诱导两种不同的表示——前向和后向——它们捕捉状态的互补方面,并且两者都对控制无关特征具有不变性。因此,赋能最大化导致代理学习一个隐式的、以控制为中心的世界模型。我们的分析强调了通过交互而非被动数据集学习表示的重要性:旨在最大化控制的交互对于学习有用的不变性属性至关重要,这一观点与因果学习文献紧密一致。

英文摘要

In many practical reinforcement learning environments, observations are far higher-dimensional than the variables that matter for control. In this work, we ask: can we learn representations that capture only control-relevant features of the environment? We study this question through the empowerment objective, which maximizes an agent's influence over the environment and is widely used for unsupervised skill learning. We show that empowerment agents induce two distinct representations -- forward and backward -- that capture complementary aspects of the state, and both of which are invariant to control-irrelevant features. Thus, empowerment maximization leads agents to learn an implicit, control-centric model of the world. Our analysis highlights the importance of learning representations through interaction rather than from passive datasets: interaction aimed at maximizing control is essential for learning useful invariance properties, a perspective that aligns closely with the causal learning literature.

2605.30654 2026-06-01 cs.CL cs.AI cs.HC

EUDAIMONIA: Evaluating Undesirable Dynamics in AI

EUDAIMONIA: 评估AI中的不良动态

Jun Rui Huang, Wang Bill Zhu, Ziyi Liu, Nathanael Fast, Ravi Iyer, Robin Jia

发表机构 * University of Southern California(南加州大学)

AI总结 提出Social AI Design Code框架,并通过EUDAIMONIA基准测试评估22个最新LLM在社交互动中对用户福祉的符合程度,发现即使最强模型也违反约30%的设计要求。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作陪伴、情感披露和人际建议的对话伙伴,但这些互动的社会动态可能造成能力导向或传统安全评估无法捕捉的伤害。我们引入了Social AI Design Code,这是一个评估LLM在社交互动中是否符合用户福祉的框架,包括它们是否鼓励有害的亲密关系、依赖或长时间参与。为了在自然且多样化的用户-LLM互动中评估这些风险,我们通过弱到强过滤、多模型重新标记和受控重写,从WildChat构建了包含969个用户输入和3,147个设计要求违规检查的基准测试EUDAIMONIA,将代码操作化。评估22个最近的LLM,我们发现即使最强的模型Claude-Opus-4.7和GPT-5.5也分别违反了30.7%和27.2%的检查。扩展思考并未降低违规率,表明这些失败是持久的社会对齐问题,而非仅通过测试时推理就能解决的缺陷。

英文摘要

Large language models (LLMs) are increasingly used as conversational partners for companionship, emotional disclosure, and interpersonal advice, but the social dynamics of these interactions can create harms that are not captured by capability-oriented or traditional safety evaluations. We introduce the Social AI Design Code, a framework for evaluating whether LLMs align with user welfare in social interactions, including whether they encourage harmful intimacy, dependence, or prolonged engagement. To evaluate these risks in natural and diverse user-LLM interactions, we operationalize the code with EUDAIMONIA, a benchmark of 969 user inputs and 3,147 design-requirement violation checks built from WildChat through weak-to-strong filtration, multi-model relabeling, and controlled rewriting. Evaluating 22 recent LLMs, we find that even the strongest models, Claude-Opus-4.7 and GPT-5.5, violate 30.7% and 27.2% of checks, respectively. Extended thinking does not reduce violation rates, suggesting that these failures are persistent social-alignment problems rather than deficits solvable through test-time reasoning alone.

2605.30653 2026-06-01 cs.CL

Counterfactual Graph for Multi-Agent LLM Calibration

多智能体LLM校准的反事实图

Jiatan Huang, Mingchen Li, Ziming Li, Sunjae Kwon, Hong Yu, Chuxu Zhang

发表机构 * University of Connecticut(康涅狄格大学) University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校)

AI总结 针对多智能体LLM系统中通信导致的相关失败和虚假共识问题,提出CAGE-CAL框架,通过比较观察到的通信后图与匹配的反事实无通信图来校准置信度,提升可靠性区分和拓扑选择。

详情
AI中文摘要

多智能体LLM系统通常将一致性视为证据:当面板中的许多智能体给出相同答案时,该答案被认为更可靠。我们表明,在智能体通信后,这一假设可能失效。通信可能引发相关失败和虚假共识,因此相同的投票份额可能在一个拓扑中反映可靠的一致性,但在另一个拓扑中反映过度自信。我们提出CAGE-CAL,一个用于多智能体LLM的反事实智能体图校准框架。对于每个查询,CAGE-CAL将观察到的通信后智能体图与匹配的反事实无通信图进行比较,捕捉成对失败相关性和组级依赖性。CAGE-CAL不是简单地计算有多少智能体一致,而是估计观察到的依赖性与无通信依赖性之间的反事实偏移,并据此校准置信度。在五个基准测试中,CAGE-CAL以具有竞争力的ECE提高了可靠性区分,并且其校准后的置信度进一步改进了拓扑选择,优于最佳固定拓扑策略。

英文摘要

Multi-agent LLM systems often treat agreement as evidence: when many agents in a panel give the same answer, that answer is assumed to be more reliable. We show that this assumption can fail after agents communicate. Communication can induce correlated failures and false consensus, so the same vote share may reflect reliable agreement in one topology but over-confidence in another. We propose CAGE-CAL, a counterfactual agent-graph calibration framework for multi-agent LLMs. For each query, CAGE-CAL compares an observed post-communication agent graph with a matched counterfactual no-communication graph, capturing both pairwise failure correlations and group-level dependencies. Rather than simply counting how many agents agree, CAGE-CAL estimates the counterfactual shift between observed and no-communication dependence, and calibrates confidence accordingly. Across five benchmarks, CAGE-CAL improves reliability discrimination with competitive ECE, and its calibrated confidence further improves topology selection over the best fixed-topology strategy.

2605.30651 2026-06-01 cs.LG cs.AI

LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation

LARK:基于可学习性的轨迹选择用于高效推理蒸馏

Tianrun Yu, Kaixiang Zhao, Chih-Chun Chen, Amanda Hughes, Taylor W. Killian, Fenglong Ma, Weitong Zhang, Porter Jenkins

发表机构 * Brigham Young University The Pennsylvania State University University of North Carolina at Chapel Hill

AI总结 提出LARK方法,通过可学习性因子ρ和χ²正则化选择策略,在推理蒸馏中高效选择学生模型可学习的轨迹,同时保持分布覆盖,显著提升多个基模型和推理任务的性能。

Comments 43 pages, 9 figures, 2 tables

详情
AI中文摘要

我们研究推理蒸馏中的轨迹选择问题,其中教师生成的推理轨迹被选择性地用作学生模型的监督。现有方法依赖于启发式规则,如轨迹质量或模型置信度,但往往忽略了轨迹是否可被学生模型学习。本文提出LARK,一种基于可学习性的推理轨迹选择方法。LARK选择学生能够高效学习的轨迹,同时保留完整训练分布的泛化能力。LARK的核心是可学习性因子$ρ$,它刻画了学生训练损失下降的速率。为了高效估计该速率并保持泛化,我们引入了一个可学习性代理和一个$χ^2$正则化的选择策略,该策略平衡可学习性和分布覆盖,两者均具有强理论保证的估计误差。实验表明,LARK在多个基模型和推理任务上持续优于数据选择基线。诊断分析显示,LARK得分能预测下游训练效用,且LARK选择的轨迹能诱导更快的监督微调损失下降。我们的代码可在https://github.com/Tianrun-Yu/LARK获取。

英文摘要

We study trajectory selection for reasoning distillation, where teacher-generated reasoning trajectories are selectively used as supervision for a student model. Existing methods rely on heuristics such as trajectory quality or model confidence, but they often overlook whether a trajectory is learnable by the student. In this paper, we present LARK, a learnability-grounded method for reasoning trajectory selection. LARK selects trajectories that the student can learn efficiently while preserving the generalization of the full training distribution. At the core of LARK is a learnability factor $ρ$, which characterizes the rate at which the student's training loss decreases. To estimate this rate efficiently and maintain generalization, we introduce a learnability proxy and a $χ^2$-regularized selection policy that balances learnability and distributional coverage, both with strong theoretical guarantees on their estimation error. Empirically, LARK consistently outperforms data selection baselines across multiple base models and reasoning tasks. Diagnostic analyses show that the LARK score predicts downstream training utility and that LARK-selected trajectories induce faster supervised fine-tuning loss reduction. Our code is available at https://github.com/Tianrun-Yu/LARK.

2605.30648 2026-06-01 cs.LG math.OC

Convergence of Steepest Descent and Adam under Non-Uniform Smoothness

非均匀光滑条件下最速下降与Adam的收敛性

Sharan Vaswani, Yifan Sun, Reza Babanezhad

发表机构 * Simon Fraser University(西蒙弗雷泽大学) Stony Brook University(石溪大学)

AI总结 本文在非均匀光滑假设下,研究最速下降法及RMSProp和Adam的确定性对角变体的收敛率,证明在逻辑回归、softmax策略梯度等目标上符号梯度下降线性收敛且快于梯度下降,并在两层神经网络上证明RMSProp和Adam可线性收敛。

Comments ICML 2026

详情
AI中文摘要

近期工作分析了在非均匀光滑假设下的一阶方法的收敛性,该假设更好地模拟了机器学习任务中的损失景观。我们将这一假设推广到曲率是目标值的仿射函数的目标函数上。这一性质被广泛的问题类别所满足,包括逻辑回归、具有逻辑链接函数的广义线性模型、强化学习中的softmax策略梯度以及一类神经网络。在此假设和梯度支配条件下,我们建立了最速下降法以及RMSProp和Adam的确定性对角变体的通用收敛率。我们的结果表明,对于可分离数据上的逻辑回归和softmax策略梯度目标,符号梯度下降线性收敛且被证明比梯度下降更快。此外,我们证明对于可分离数据上的一类两层神经网络,RMSProp和Adam可以在恒定步长和动量参数下以线性速率收敛。最后,我们给出了一个下界,表明在我们的假设下,RMSProp和Adam被证明比AdaGrad、AMSGrad、梯度下降和重球动量更快。

英文摘要

Recent work has analyzed the convergence of first-order methods under non-uniform smoothness assumptions that better model the loss landscape in machine learning tasks. We generalize this assumption to objectives whose curvature is an affine function of the objective value. This property is satisfied by a broad class of problems, including logistic regression, generalized linear models with a logistic link function, softmax policy gradient in reinforcement learning, and a class of neural networks. Under this assumption and gradient domination conditions, we establish a general convergence rate for the steepest descent method, and deterministic, diagonal variants of RMSProp and Adam. Our results imply that for logistic regression on separable data and the softmax policy gradient objective, sign GD converges linearly and is provably faster than GD. Furthermore, we show that for a class of two-layer neural networks on separable data, RMSProp and Adam can converge at a linear rate with a constant step-size and momentum parameter. Finally, we present a lower bound demonstrating that, under our assumption, RMSProp and Adam are provably faster than AdaGrad, AMSGrad, gradient descent, and heavy-ball momentum.

2605.30646 2026-06-01 cs.CL cs.AI

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

同一患者,不同措辞,不同诊断?评估临床大语言模型中的语义稳定性

Mahdi Alkaeed, Adnan Qayyum, Nabeel Abo Kashreef, Muhammad Bilal, Junaid Qadir

发表机构 * Department of Computer Science and Engineering, College of Engineering, Qatar University(卡塔尔大学计算机科学与工程系) College of Science and Engineering, Hamad Bin Khalifa University (HBKU)(哈马德·本·卡伊夫大学(HBKU)理学院) Primary Health Care Corporation (PHCC)(初级卫生保健公司) Birmingham City University(伯明翰城市大学)

AI总结 针对临床大语言模型对语义等价但措辞不同的提示敏感的问题,提出基于自然语言推理的语义验证框架和三个量化指标,评估16个开源通用与医学模型,发现领域专业化并不一致地提升或降低鲁棒性。

Comments 14 pages, 5 figures

详情
AI中文摘要

大语言模型(LLMs)越来越多地应用于临床场景。然而,它们的行为对细微的语言变化(如改写或句法变化)高度敏感。这种敏感性在安全关键的医疗环境中带来风险,因为语义等价的输入应产生一致的预测。但一个关键挑战是确保提示变化真正保留临床意义,因为基于嵌入的相似性度量通常无法捕捉涉及否定、时间性或严重程度的区别。为解决这一局限,我们提出一个基于自然语言推理(NLI)的语义验证框架,用于过滤保留意义的提示变化,并进一步使用LLM-as-a-judge进行精炼,由临床专家审核。此外,我们引入三个指标来量化模型敏感性:保留意义变化敏感性(MVS)、置信度变化(ΔC)和最坏情况不稳定性(WCI)。我们使用来自DiagnosisQA和MedQA数据集的改写提示,评估了同一模型系列和参数规模下的16个开源通用(GP)和医学LLMs。结果表明,领域特定(DS)模型之间的鲁棒性差异是混合的且高度依赖模型,即领域专业化并不一致地改善或降低对保留意义提示改写的鲁棒性。几个DS模型在鲁棒性排名中位列前茅(与GP对应模型相比),而强大的GP基线模型也保持竞争力。

英文摘要

Large Language Models (LLMs) are increasingly used in clinical applications. However, their behavior remains highly sensitive to subtle linguistic variations, such as rephrasing or syntactic variation. This sensitivity poses risks in safety-critical healthcare settings, where semantically equivalent inputs should produce consistent predictions. However, a key challenge is to ensure that prompt variations truly preserve clinical meaning, as embedding-based similarity metrics often fail to capture distinctions involving negation, temporality, or severity. To address this limitation, we propose a semantic verification framework based on Natural Language Inference (NLI) to filter meaning-preserving prompt variations, which are further refined using an LLM-as-a-judge and audited by a clinical expert. In addition, we introduce three metrics to quantify model sensitivity: MeaningPreserving Variation Sensitivity (MVS), confidence variation (ΔC), and Worst-Case Instability (WCI). We evaluate 16 open-source general-purpose (GP) and medical LLMs within the same model families and parameter scales, using reformulated prompts derived from the DiagnosisQA and MedQA datasets. Our results demonstrate that robustness differences between domain-specific (DS) models are mixed and highly model-dependent, i.e., domain specialization does not consistently improve or reduce robustness to meaning-preserving prompt reformulations. Several DS models rank among the most robust (when compared with GP counterparts), and strong GP baselines remain competitive as well.

2605.30642 2026-06-01 cs.LG

Diffusion Models Preferentially Memorize Prototypical Examples or: Why Does My Diffusion Model Love Slop?

扩散模型优先记忆原型样本,或:为什么我的扩散模型喜欢“潦草”?

Marta Aparicio Rodriguez, Anastasia Borovykh, Grigorios A. Pavliotis, Daniel J. Korchinski

发表机构 * Department of Mathematics, Imperial College London, UK ML Lab, Capital Fund Management, France Department of Physics, \'Ecole Polytechnique F\'ed\'erale de Lausanne (EPFL), Switzerland

AI总结 本文通过随机层次模型生成的字符串训练扩散模型,发现模型优先记忆常见子串组成的样本,即使数据完全去重,表明点级去重无法保证隐私,而数据集多样性(尤其是高层抽象)能延缓记忆,并识别出部分记忆的中间状态导致生成均值回归的“潦草”现象。

详情
AI中文摘要

生成模型存在一个持久限制:它们记忆训练数据的倾向可能产生法律责任并削弱创意多样性。因此,理解哪些样本被全部或部分记忆,以及在什么条件下被记忆,仍然是一个重要的开放问题。本文对“非典型或稀有样本是否首先被记忆?”这一问题给出了否定答案。我们根据随机层次模型(RHM)的产生规则生成的字符串训练扩散模型,发现由常见子串组成的样本被优先记忆。即使训练数据由完全独特的样本组成,这一结论仍然成立,表明在数据点级别进行去重并不能提供有意义的隐私保证。相应地,我们预测并随后观察到,对于重尾数据集(即包含更多非典型样本的数据集),记忆会延迟。当重尾特性引入高层产生规则时,这种效应会放大。这些结果共同表明,数据集多样性,尤其是在更高抽象层次上,在延缓记忆方面起着重要作用。最后,我们识别出一个部分记忆的中间状态,其中常见子串首先被学习,随后在生成过程中过度产生。如果在此状态停止训练,模型将表现出均值回归的平淡性,常被讥讽为“潦草”。

英文摘要

Generative models have a persistent limitation: their tendency to memorize training data can create legal liabilities and erode creative diversity. Understanding which samples are memorized in whole or in part, and under what conditions, therefore remains an important open problem. Here we answer the question "Are atypical or rare samples memorized first?" in the negative. We train diffusion models on strings generated according to the production rules of the Random Hierarchy Model (RHM), and find that samples composed of common substrings are preferentially memorized. This holds true even if the training data consists of entirely unique samples, indicating that deduplication at the data point level does not provide a meaningful privacy guarantee. Correspondingly we predict, then observe, delayed memorization for fat-tailed datasets (i.e., those with more atypical samples). This effect is amplified when fat-tails are introduced into high-level production rules. These together suggest that dataset diversity, particularly at higher levels of abstraction, plays an important role in staving off memorization. Finally, we identify an intermediate regime of partial memorization in which common substrings are learned first and subsequently overproduced during generation. If training is stopped in this regime, models will exhibit the reversion-to-the-mean blandness often derided as "slop".

2605.30641 2026-06-01 cs.CL cs.AI

COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models

COFT: 用于大语言模型中公平思维链推理的反事实-保形解码

Arya Fayyazi, Mehdi Kamal, Massoud Pedram

发表机构 * Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, California, USA(电气与计算机工程系,南加州大学,洛杉矶,加利福尼亚州,美国)

AI总结 提出COFT,一种无需训练的解码方法,通过反事实提示和保形校准在解码时实现token级公平性控制,显著减少思维链生成中的社会偏见,同时保持任务效用和语言质量。

Comments Proceeding of ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在思维链(CoT)生成过程中可能揭示并放大社会偏见。我们提出COFT(Chain of Fair Thought),一种无需训练的解码方法,在解码时应用token级公平性控制,并对任何冻结的因果语言模型提供无分布边际有效性保证(在可交换性下)。COFT分三个阶段运行。首先,通过将敏感跨度替换为中性token来创建掩码反事实提示。其次,通过轻量级logit融合比较事实和掩码logit分布,以减弱属性驱动的偏见。第三,使用双分支分裂保形校准,在用户选择的风险水平下认证每步候选token集。我们在六个模型和多个偏见基准上评估COFT。我们的方法将标准偏见指标降低30-55%(中位数38%),同时保持任务效用和语言质量。推理准确率在运行间噪声范围内保持不变。计算开销适中,相当于一次额外的缓存前向传递(<=11%)。COFT提供了一条清晰、可审计的路径,实现更安全的CoT生成,显著减少偏见,效用损失可忽略,且无需重新训练、辅助分类器或权重访问。

英文摘要

Large language models (LLMs) can reveal and amplify societal biases during chain-of-thought (CoT) generation. We present COFT (Chain of Fair Thought), a training-free decoding method that applies token-level fairness control at decode time, with distribution-free marginal validity guarantees (under exchangeability) for any frozen causal language model. COFT operates in three stages. First, it creates a masked counterfactual prompt by replacing sensitive spans with neutral tokens. Second, it compares the factual and masked logit distributions through lightweight logit fusion to attenuate attribute-driven biases. Third, it uses dual-branch split-conformal calibration to certify per-step candidate token sets at a user-chosen risk level. We evaluate COFT across six models and multiple bias benchmarks. Our method reduces standard bias metrics by 30-55% (median 38%) while preserving task utility and language quality. Reasoning accuracies remain unchanged within run-to-run noise margins. The computational overhead is modest, equivalent to one additional cached forward pass (<=11%). COFT offers a clear, auditable path to safer CoT generation with significant bias reduction, negligible utility loss, and no requirement for retraining, auxiliary classifiers, or weight access.

2605.30640 2026-06-01 cs.LG cs.CL

CSULoRA: Closest Safe Update Low-Rank Adaptation

CSULoRA:最近安全更新低秩适应

Oleksandr Marchenko Breneur, Adelaide Danilov, Aria Nourbakhsh, Salima Lamsiyah

发表机构 * Department of Computer Science, University of Luxembourg(卢森堡大学计算机科学系)

AI总结 提出CSULoRA方法,通过后处理校正LoRA适配器,在保留任务相关性的同时抑制不安全更新方向,降低攻击成功率。

Comments 10 pages, 3 figure

详情
AI中文摘要

低秩适应已成为大型语言模型参数高效微调的标准方法,但即使少量不安全或对抗性微调数据也会显著削弱对齐模型的安全行为。现有的安全保持LoRA方法通常依赖硬干预,如投影、剪枝、阈值化或额外训练目标。虽然这些方法可以抑制不安全更新方向,但它们也可能移除任务相关信息或需要额外调优。我们提出CSULoRA,一种通过最近安全更新估计来校正训练后LoRA适配器的后处理方法。CSULoRA从安全对齐模型与其对应基础检查点之间的权重位移中估计安全对齐子空间。然后,它将每个LoRA更新分解为完全对齐、部分对齐和子空间外分量。CSULoRA不丢弃估计安全子空间外的分量,而是求解一个闭式惩罚最小变化问题,该问题保留完全对齐分量,同时根据相对能量平滑衰减潜在不安全方向。在对抗性微调实验中,CSULoRA显著降低了攻击成功率,同时保留了标准LoRA微调获得的大部分效用增益。

英文摘要

Low-rank adaptation has become a standard method for parameter-efficient fine-tuning of large language models, but even small amounts of unsafe or adversarial fine-tuning data can substantially weaken the safety behavior of aligned models. Existing safety-preserving LoRA methods often rely on hard interventions such as projection, pruning, thresholding, or additional training objectives. While these methods can suppress unsafe update directions, they may also remove task-relevant information or require extra tuning. We introduce CSULoRA, a post-hoc method for correcting trained LoRA adapters through closest safe update estimation. CSULoRA estimates a safety-aligned subspace from the weight displacement between a safety-aligned model and its corresponding base checkpoint. It then decomposes each LoRA update into fully aligned, partially aligned, and off-subspace components. Instead of discarding components outside the estimated safety subspace, CSULoRA solves a closed-form penalized minimum-change problem that preserves the fully aligned component while smoothly attenuating potentially unsafe directions according to their relative energy. In adversarial fine-tuning experiments, CSULoRA substantially reduces attack success rate while preserving most of the utility gains obtained from standard LoRA fine-tuning.

2605.30639 2026-06-01 cs.CV cs.AI cs.RO

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

PInVerify:面向主动实例验证的离线具身基准

Yuhang Jiang

发表机构 * University of Trento(特伦托大学)

AI总结 提出主动实例验证任务,构建离线具身基准PInVerify,通过多视角导航和细粒度属性匹配评估具身智能体,并基于多模态大语言模型建立基线。

Comments Accepted as a poster at the Foundation Models Meet Embodied Agents (FMEA) Workshop, CVPR 2026. 44 pages including appendix. Code: https://github.com/Avalon-S/PInVerify

详情
AI中文摘要

具身智能体在导航到目标物体方面取得了显著进展,但到达目标附近并不能保证智能体找到了正确的实例:微妙的属性差异(例如“白色花卉”与“白色条纹”)通常需要近距离、多视角检查。我们通过主动实例验证(AIV)来解决这一差距,该任务要求智能体主动围绕候选对象选择视角,以判断其是否匹配细粒度的自然语言描述。我们将AIV形式化为一个有限视野决策过程,并引入PInVerify,一个用于AIV的离线具身基准:包含18个物体类别的3000个评估场景,以多视角捕获形式提供,并采用6扇区导航拓扑,暴露陷阱视角(可导航但无信息)和不可达扇区。作为参考基线,我们构建了一个无需训练的流水线和一个基于开源多模态大语言模型(MLLMs)的LoRA微调端到端智能体(参数规模≤8B),包括属性分解、可见性加权多视角跟踪器和三种次优视角选择(NBV)策略。在Qwen3-VL(4B/8B)、SenseNova-SI-1.2-InternVL3-8B、CLIP和SigLIP2上的评估中,最佳MLLM基线超过最佳嵌入基线4.9个百分点;GT框消融实验显示检测差距为+3.1个百分点;在测试的NBV策略中,我们未观察到主动视角选择带来的可靠增益。LoRA微调智能体(SFT+GSPO)达到85.6%。PInVerify旨在支持具身AI中主动、细粒度语义验证的进一步研究。代码:https://github.com/Avalon-S/PInVerify。

英文摘要

Embodied agents have made strong progress in navigating to target objects, but reaching the goal vicinity does not guarantee that the agent has found the correct instance: subtle attribute differences (e.g., "white floral" vs. "white striped") often require close-range, multi-view inspection. We address this gap with Active Instance Verification (AIV), a task in which an agent actively selects viewpoints around a candidate object to decide whether it matches a fine-grained natural-language description. We formalize AIV as a finite-horizon decision process and introduce PInVerify, an offline embodied benchmark for AIV: 3,000 evaluation episodes across 18 object categories, delivered as multi-view captures with a 6-sector navigation topology that exposes trap views (navigable but uninformative) and unreachable sectors. As reference baselines we build a training-free pipeline and a LoRA-fine-tuned end-to-end agent around open-source multimodal large language models (MLLMs) at on-device scale ($\leq$8B parameters), with attribute decomposition, a visibility-weighted multi-view tracker, and three next-best-view (NBV) strategies. In our evaluation across Qwen3-VL (4B/8B), SenseNova-SI-1.2-InternVL3-8B, CLIP, and SigLIP2, the best MLLM-based baseline exceeds the best embedding baseline by 4.9 pp; GT-box ablations show a +3.1 pp detection gap; and we do not observe reliable gains from active viewpoint selection within the tested NBV strategies. A LoRA-fine-tuned agent (SFT+GSPO) reaches 85.6%. PInVerify aims to support further work on active, fine-grained semantic verification in embodied AI. Code: https://github.com/Avalon-S/PInVerify.

2605.30638 2026-06-01 cs.LG cs.AI

Score Broadcast and Decorrelation: A General Framework for Broadcast-Based Credit Assignment

分数广播与去相关:基于广播的信用分配通用框架

Mustafa Uzun, Mete Erdogan, Cengiz Pehlevan, Alper T. Erdogan

发表机构 * KUIS AI Center, Koc University, Turkey(科克大学KUIS人工智能中心,土耳其) Electrical and Electronics Engineering, Koc University, Turkey(科克大学电子与电气工程系,土耳其) Department of Electrical Engineering, Stanford University, USA(斯坦福大学电气工程系,美国) John A. Paulson School of Engineering & Applied Sciences, Harvard University, USA(哈佛大学约翰·A·保罗森工程与应用科学学院,美国) Kempner Institute, Harvard University, USA(哈佛大学凯姆纳研究所,美国) Center for Brain Science, Harvard University, USA(哈佛大学脑科学中心,美国)

AI总结 提出分数广播与去相关(SBD)框架,通过输出分数与隐藏层激活的正交性原理,统一了多种可微损失函数下的广播式信用分配,并理论支撑了三因子学习规则。

详情
AI中文摘要

我们引入了分数广播与去相关(SBD),一个用于一般可微损失族基于广播的信用分配的原则性框架。误差广播是反向传播的一种生物合理替代方案,它无需权重传输即可将输出信息发送到隐藏层。最近针对均方误差(MSE)设置引入的误差广播与去相关(EBD)框架,将这一机制建立在最优估计量的随机正交性基础上,即最优残差与输入的函数正交。我们通过引入输出分数(损失对最终层输出的梯度)与隐藏层激活之间的正交性原理来推广这一基础,该原理在最优分数条件均值为零时成立。这一单一原理统一了标准可微损失族(包括交叉熵、Bregman散度、适当评分规则和指数族负对数似然)的广播式信用分配。该框架为一般损失下的三因子学习规则提供了理论基础,其中神经调节因子被推导为广播损失分数。我们明确推导了交叉熵情况,刻画了可接受损失类,并引入了一种分数向量扩展技术,该技术在保持正交性框架的同时丰富了广播信号。在CIFAR-10和Tiny ImageNet上的实验表明,SBD显著优于现有的广播方法,而分数向量扩展带来了进一步的提升。总体而言,这项工作确定了损失分数作为广播信号,提供了正交性理论以及神经科学中三因子学习规则的理论基础,并展示了分数向量扩展如何丰富所得目标函数的去相关方向。

英文摘要

We introduce Score Broadcast and Decorrelation (SBD), a principled framework for broadcast-based credit assignment for general families of differentiable losses. Error broadcast is a biologically plausible alternative to backpropagation that sends output information to hidden layers without weight transport. The Error Broadcast and Decorrelation (EBD) framework, recently introduced for the mean-squared-error (MSE) setting, grounded this mechanism in the stochastic orthogonality of optimal estimators, under which the optimal residual is orthogonal to functions of the input. We generalize that foundation by introducing an orthogonality principle between the output score (the gradient of loss with respect to the final-layer output) and hidden-layer activations, which holds whenever the optimal score has conditional mean zero. This single principle unifies broadcast-based credit assignment across the standard differentiable-loss families, including cross-entropy, Bregman divergences, proper scoring rules, and exponential-family negative log-likelihoods. The framework supplies a theoretical grounding for the three-factor learning rule under general losses, with the neuromodulatory factor derived as the broadcast loss score. We derive the cross-entropy case explicitly, characterize the admissible loss class, and introduce a score vector expansion technique that enriches the broadcast signal while preserving the orthogonality framework. Experiments on CIFAR-10 and Tiny ImageNet show that SBD substantially improves over existing broadcast approaches, with score vector expansion delivering further gains. Overall, this work identifies the loss score as the signal to broadcast, supplies the orthogonality theory and theoretical grounding for the three-factor learning rule from neuroscience, and shows how score vector expansion enriches the decorrelation directions of the resulting objective.

2605.30637 2026-06-01 cs.AI

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

EHRBench: 基于电子健康记录的自动化可靠临床决策基准测试,用于大语言模型

Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui, Guanchen Wu, Ziyang Zhang, Kai Shu, Jiaying Lu, Xiao Hu, Carl Yang

发表机构 * Emory University(埃默里大学) Stanford University(斯坦福大学)

AI总结 提出EHRBench,通过EHR-LLM-KB交互流水线自动构建近百万问答对,涵盖诊断、治疗和预后三大临床决策任务,系统评估30余种LLM的性能与鲁棒性。

Comments Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026), Datasets and Benchmarks Track, Oral

详情
AI中文摘要

临床决策(CDM)是真实临床工作流程的核心,临床医生在不完整证据下推断诊断、选择治疗方案或预测未来健康结果。由于大语言模型(LLM)具有强大的语言能力、广泛的生物医学知识和高效性,越来越多地被用于支持这些决策,但LLM在真实临床决策任务上的可靠性尚未得到充分理解。为了评估CDM模型,特别是基于LLM的模型,一个理想且实用的医学决策基准应通过自动化且可靠的流水线构建,以确保规模和质量。此外,基于真实患者电子健康记录(EHR)的CDM基准可以更好地支持需要实质性生物医学知识和临床推理的实践性CDM任务的评估。为填补这些空白,我们引入了EHRBench,一个自动化且可靠的基于EHR的基准,用于大规模评估基于LLM的临床决策。为了确保可扩展性和可靠性,EHRBench通过EHR-LLM-KB(知识库)交互流水线构建。为了提高效率,我们使用专门的LLM自动将就诊级别的EHR轨迹转换为结构化模板,并确定性地将模板实例化为问答项。同时,我们应用系统性的基于知识库的验证和丰富,以过滤幻觉或模糊关系,并提高可靠性。利用该流水线,我们构建了近100万(960,067)个问答项,涵盖三个需要推理的核心临床决策任务:诊断、治疗和预后。我们在EHRBench上对30多个代表性LLM进行了基准测试,并提供了性能和鲁棒性的详细分析。结果显示了跨设置的一致能力趋势,进一步验证了EHRBench的可靠性,并指出了实现临床可靠LLM系统的可操作差距。

英文摘要

Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence. LLMs are increasingly used to support these decisions due to strong language capabilities, broad biomedical knowledge, and efficiency, yet the reliability of LLMs on real-world clinical decision tasks remains insufficiently understood. To evaluate CDM models, especially LLM-based models, an ideal and practical medical decision benchmark should be constructed via an automated yet reliable pipeline to ensure both scale and quality. Moreover, the grounding of a CDM benchmark in real patient EHRs can better support evaluation on practical CDM tasks that require substantive biomedical knowledge and clinical inference. To fill the gaps, we introduce EHRBench, an automated and reliable EHR-grounded benchmark for evaluating LLM-based clinical decision-making at scale. To ensure scalability and reliability, EHRBench is constructed through an EHR-LLM-KB(knowledge-base) interaction pipeline. For efficiency, we use a specialized LLM to automatically convert encounter-level EHR trajectories into structured templates and deterministically instantiate the templates into QA items. In parallel, we apply systematic KB-based verification and enrichment to filter hallucinated or ambiguous relations and to improve reliability. Using this pipeline, we construct nearly 1M (960,067) QA items spanning three core inference-required clinical decision tasks: diagnosis, treatment, and prognosis. We benchmark more than 30 representative LLMs on EHRBench and provide detailed analyses of performance and robustness. The results show consistent capability trends across settings, further validating the reliability of EHRBench and highlighting actionable gaps toward clinically reliable LLM systems.

2605.30635 2026-06-01 cs.LG q-bio.GN

CellBRIDGE: Learning Cellular Trajectories via Interaction-Aware Alignment

CellBRIDGE:通过交互感知对齐学习细胞轨迹

Silas Ruhrberg Estévez, Nicolas Huynh, Tennison Liu, Roderik M. Kortlever, Gerard I. Evan, David L. Bentley, Mihaela van der Schaar

发表机构 * DAMTP, University of Cambridge(剑桥大学应用数学与理论物理系) Francis Crick Institute(弗朗西斯·克里克研究所) University of Colorado Anschutz Medical Campus(科罗拉多大学安舒茨医学校区)

AI总结 提出CellBRIDGE方法,通过将配体-受体介导的细胞间通信成本融入最优传输框架,改进了单细胞RNA测序数据中的轨迹推断和跨快照耦合。

Journal ref ICML 2026

详情
AI中文摘要

从群体快照推断动态是机器学习和生物学中的一个基本挑战。在单细胞RNA测序(scRNA-seq)中,破坏性测量阻止了跨时间直接追踪单个细胞,使得轨迹推断欠定。最优传输(OT)为快照对齐提供了一个原则性框架,但一个长期存在的建模问题是哪些成本函数能产生生物学上有意义的耦合。标准的OT方法依赖于基因表达距离,隐含地将细胞视为独立点,并忽略了由配体-受体信号介导的结构化细胞间通信。我们引入了CellBRIDGE(基于细胞的规则化交互驱动基因表达),它用源自配体-受体活性的定向、类型化交互成本来增强基于特征的OT。通过显式建模细胞间通信,与仅基于特征的基线相比,CellBRIDGE在合成和真实scRNA-seq数据集上改善了跨快照耦合和下游轨迹估计。值得注意的是,CellBRIDGE实现了可机械解释的计算机扰动:在肺癌数据上,沉默特定的配体-受体对诱导的轨迹变化重现了预期靶向通路抑制的效果。

英文摘要

Inferring dynamics from population snapshots is a fundamental challenge in machine learning and biology. In scRNA-sequencing (scRNA-seq), destructive measurements preclude direct tracking of individual cells across time, making trajectory inference underdetermined. Optimal Transport (OT) provides a principled framework for snapshot alignment, but a long-standing modeling question is which cost functions yield biologically meaningful couplings. Standard OT approaches rely on gene-expression distances, implicitly treating cells as independent points and neglecting structured cell-cell communication mediated by ligand-receptor signaling. We introduce CellBRIDGE (Cell-Based Regularized Interaction-Driven Gene Expression), which augments feature-based OT with a directed, typed interaction cost derived from ligand-receptor activity. By explicitly modeling cell-cell communication, CellBRIDGE improves cross-snapshot couplings and downstream trajectory estimates across synthetic and real scRNA-seq datasets relative to feature-only baselines. Notably, CellBRIDGE enables mechanistically interpretable in silico perturbations: on lung cancer data, silencing specific ligand-receptor pairs induces trajectory shifts that recapitulate expected effects of targeted pathway inhibition.

2605.30631 2026-06-01 cs.CV cs.AI cs.LG

Controllable Lung Nodule Synthesis via Histogram-Regularized Latent Diffusion Models

基于直方图正则化潜扩散模型的可控肺结节合成

Arunkumar Kannan, Yanbo Zhang, Han Liu, Michael Baumgartner, Jianing Wang, Alexander Hertel, Bogdan Georgescu, Sasa Grbic

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Department of Radiology and Nuclear Medicine, University Medical Center Mannheim, Heidelberg University(放射学与核医学科,曼海姆大学医学中心,海德堡大学)

AI总结 提出一种直方图正则化潜扩散模型,通过结合亚型、空间掩码和HU直方图条件以及可微特征空间直方图正则化项,在3D CT体积中合成肺结节,以准确建模结节特异性强度分布,提高视觉真实感和亚型一致性。

详情
AI中文摘要

尽管自动诊断系统在基于CT的肺癌筛查中取得了显著成功,但其发展仍受限于多样化、带标注的肺结节数据集的稀缺性。基于扩散的生成模型为数据合成提供了一种有前景的策略;然而,许多现有的条件方法主要优化空间重建损失,这鼓励体素级相似性,但可能不足以约束病灶级强度分布。因此,这些方法可能产生过度平滑的纹理轮廓,并低估不同结节亚型(包括实性、部分实性和磨玻璃结节)的独特衰减特性。为解决这一挑战,我们提出了一种可控潜扩散模型,该模型在全3D CT体积内合成肺结节,同时准确建模结节特异性强度分布。具体而言,我们不只依赖空间损失,还引入了一个基于直方图的正则化项,在生成过程中约束体素强度分布。该模型结合了亚型、空间掩码和Hounsfield单位(HU)直方图条件以及可微特征空间直方图正则化项,以更好地对齐病灶级强度分布,提高合成结节的视觉真实感和亚型一致性。在肺部CT数据上的大量实验表明,我们的框架实现了强烈的视觉真实感,通过定量指标和视觉图灵测试验证。此外,当用于数据增强时,生成的结节提高了下游临床任务的性能,特别是对于代表性不足的结节亚型,并显示出对亚型知情恶性分类的潜在益处。

英文摘要

While automated diagnosis systems have achieved remarkable success in computed tomography (CT)-based lung cancer screening, their development remains limited by the scarcity of diverse, annotated pulmonary nodule datasets. Diffusion-based generative models offer a promising strategy for data synthesis; however, many existing conditional approaches primarily optimize spatial reconstruction losses, which encourage voxel-wise similarity but may inadequately constrain lesion-level intensity distributions. As a result, these methods may produce over-smoothed texture profiles and underrepresent the distinct attenuation characteristics of different nodule subtypes, including solid, part-solid, and ground-glass nodules. To address this challenge, we propose a controllable latent diffusion model that synthesizes pulmonary nodules within full 3D CT volumes while accurately modeling nodule-specific intensity distributions. Specifically, rather than relying solely on spatial losses, we introduce a histogram-based regularization term that constrains voxel intensity distributions during the generative process. The model combines subtype, spatial mask, and Hounsfield unit (HU) histogram conditioning with the differentiable feature-space histogram regularization term to better align lesion-level intensity distributions, improving the visual plausibility and subtype consistency of synthesized nodules. Extensive experiments on lung CT data demonstrate that our framework achieves strong visual realism, validated through both quantitative metrics and a visual Turing test. Furthermore, when used for data augmentation, the generated nodules improve performance in downstream clinical tasks, particularly for underrepresented nodule subtypes, and show a potential benefit for subtype-informed malignancy classification.

2605.30628 2026-06-01 cs.CL cs.AI cs.LG

The Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability

错误的架构:从普遍不可能到局部补丁的LLM可靠性

Mikhail L. Arbuzov, Lee Mosbacker, Sisong Bei, Ziwei Dong, Dmitri Kalaev, Alexey Shvets

发表机构 * Independent Researcher(独立研究者) Palo Alto Networks(帕洛阿尔托网络)

AI总结 本文通过两个命题和一个推论,形式化地论证了通用LLM可靠性在无限域上不可实现,但在操作有界的局部补丁中可通过目录发现和干预覆盖实现可靠性。

Comments 25 pages, no figures

详情
AI中文摘要

通用LLM可靠性不是一个有限库问题:在所有可能任务、工具、模式、知识源和评估者期望中,新的可干预区分的失败模式会无界出现,因此没有有限的干预词典能保证对每种此类模式的有界残余错误。但部署的系统并不在整个宇宙中运行。它们在操作有界的补丁(法律审查、医学RAG、代码修复、客户支持代理、合同提取)内运行,这些补丁具有重复的任务、模式、工具和评估者期望。在这些补丁内,经验证据表明失败是稀疏的、重复的,并集中在一个小的重复目录中,因此可靠性变成了一个局部目录发现和干预覆盖问题,而不是指数级的令牌长度问题。我们通过两个命题和一个推论形式化了这一转变。命题1是最坏情况模式方面的负面结果:没有有限的干预词典能覆盖无界域的每个可区分的失败模式。推论1是逆发现蕴含:模式发现的对数上界无法容纳线性更多的不同尾模式,除非指数级地观察到更多的硬失败事件。命题2是积极的局部补丁结果:在活跃模式暴露对数增长和头部重覆盖下,每个硬决策的足够干预预算随序列长度多对数增长,并在补丁目录饱和后变为域常数。该框架重新定位而非消解长上下文困难:当硬决策数量本身随任务长度增长时,可靠性仍然困难;贡献在于识别轴向干预,而非使这些区域变得容易。

英文摘要

Universal LLM reliability is not a finite-library problem: across all possible tasks, tools, schemas, knowledge sources, and evaluator expectations, new intervention-distinguishable failure modes can appear without bound, so no finite intervention dictionary can guarantee bounded residual error for every such mode. But deployed systems do not operate over the whole universe. They operate inside operationally bounded patches (legal review, medical RAG, code repair, customer-support agents, contract extraction) with recurring tasks, schemas, tools, and evaluator expectations. Within such patches, empirical evidence suggests failures are sparse, repetitive, and concentrated in a small recurring catalogue, so reliability becomes a local catalogue-discovery and intervention-coverage problem rather than an exponential token-length problem. We formalize this transition with two propositions and one corollary. Proposition 1 is the worst-case-mode-wise negative result: no finite intervention dictionary covers every distinguishable failure mode of an unbounded domain. Corollary 1 is the inverse-discovery implication: the logarithmic upper bound on mode discovery cannot accommodate linearly more distinct tail modes without exponentially more observed hard-failure events. Proposition 2 is the positive patch-local result: under log active-mode exposure and head-heavy coverage, a sufficient per-hard-decision intervention budget grows polylogarithmically in sequence length and becomes domain-constant once the patch catalogue saturates. The framework relocates rather than dissolves long-context difficulty: where the number of hard decisions itself grows with task length, reliability remains hard; the contribution is to identify the on-axis intervention rather than to make those regimes easy.

2605.30625 2026-06-01 cs.LG cs.AI stat.ML

Active Timepoint Selection for Learning Measure-Valued Trajectories

学习测度值轨迹的主动时间点选择

Nicolas Huynh, Mihaela van der Schaar

发表机构 * DAMTP, University of Cambridge(剑桥大学 DAMTP 实验室)

AI总结 针对高成本破坏性数据获取场景,提出基于线性化最优传输的主动学习框架,通过高斯过程建模概率路径并迭代选择最优测量时间点以最小化不确定性。

Comments ICML 2026

详情
AI中文摘要

从稀疏快照推断连续概率路径是单细胞生物学等领域的基本挑战,其中高保真数据获取通常具有破坏性且受限于高昂测序成本。这促使需要主动学习策略来战略性选择最优测量时间。然而,为此场景设计主动学习策略仍是一个开放问题:目标对象位于无限维Wasserstein空间,标准欧几里得度量在此不适用,且当前插值方法缺乏认知不确定性量化。我们提出一个将主动实验扩展到测度空间的框架。通过利用线性化最优传输(LOT),我们将分布快照映射到适合高斯过程建模的切空间,从而为底层概率路径构建可处理的概率代理模型。这产生了一种采集策略,通过迭代选择测量时间以最小化不确定性。实验结果表明,我们的策略在合成和真实数据集上均优于不考虑不确定性的基线方法。

英文摘要

Inferring continuous probability paths from sparse snapshots is a fundamental challenge in domains like single-cell biology, where high-fidelity data acquisition is often destructive and constrained by prohibitive sequencing costs. This motivates the need for active learning strategies to strategically select optimal measurement times. However, designing active learning policies for this setting remains an open problem: the target objects reside on the infinite dimensional Wasserstein space where standard Euclidean metrics are ill-defined, and current interpolation methods lack epistemic uncertainty quantification. We introduce a framework which extends active experimentation to the space of measures. By leveraging Linearized Optimal Transport (LOT), we map distributional snapshots into a tangent space amenable to Gaussian Process modeling, allowing us to construct a tractable probabilistic surrogate for the underlying probability path. This yields an acquisition policy that iteratively selects measurement times to minimize uncertainty. Empirical results demonstrate that our strategy outperforms uncertainty-agnostic baselines on both synthetic and real-world datasets.

2605.30621 2026-06-01 cs.AI

Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

利用更新并非利用收益:解构自演化LLM智能体中的演化能力

Minhua Lin, Juncheng Wu, Zijun Wang, Zhan Shi, Yisi Sang, Bing He, Zewen Liu, Tianxin Wei, Zongyu Wu, Zhiwei Zhang, Dakuo Wang, Xiang Zhang, Benoit Dumoulin, Cihang Xie, Yuyin Zhou, Suhang Wang, Hanqing Lu

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) UC Santa Cruz(加州大学圣克鲁兹分校) Amazon(亚马逊) Emory University(埃默里大学) UIUC(伊利诺伊大学香槟分校) Northeastern University(东北大学)

AI总结 本文通过分析LLM智能体在外部框架(提示、技能、记忆和工具)上的自演化能力,发现框架更新能力与基础能力无关,而框架收益能力与基础能力呈非单调关系,中等能力模型受益最大。

Comments 24 pages, 9 figures, 12 tables

详情
AI中文摘要

LLM智能体越来越多地被部署为围绕可编辑外部框架(包括提示、技能、记忆和工具)构建的系统,这些框架在不改变模型参数的情况下塑造任务执行。框架自演化通过从执行证据中更新这些框架来适应此类智能体。然而,模型在任务解决中的基础能力是否预测其在框架自演化中的能力仍不清楚:哪些模型产生有用的框架更新,哪些模型实际上从中受益?我们分析了两种框架自演化能力:(i) 框架更新能力,即从执行证据中产生有用的持久框架更新的能力;(ii) 框架收益能力,即在任务解决过程中从更新框架中受益的能力。我们的分析揭示了两个发现。首先,框架更新能力在基础能力上是平坦的:来自不同能力层级的模型产生的框架更新带来的收益惊人地相似;甚至Qwen3.5-9B的更新产生的收益与Claude Opus~4.6相当。其次,框架收益能力在基础能力上是非单调的:弱层级模型从更新框架中受益甚微,中等层级模型受益最大,强层级模型受益少于中等层级。我们将弱层级的低收益归因于两种失败模式:弱层级模型可能无法激活相关的框架工件,或者激活了但未能忠实地遵循它们。这些发现表明应将能力预算投入到任务解决智能体而非演化器中,并在智能体训练中针对框架调用和长程指令遵循进行优化。我们的源代码公开在 https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution。

英文摘要

LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task-solving predicts its capabilities in harness self-evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self-evolution capabilities: (i) harness-updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness-benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness-updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness-benefit is non-monotonic in base capability: weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. We trace low gains at the weak tier to two failure modes: weak-tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task-solving agent rather than the evolver, and targeting harness invocation and long-horizon instruction following in agent training. Our source code is publicly available at https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution.

2605.30617 2026-06-01 cs.RO math.OC

Exploiting Chordal Sparsity for Globally Optimal Estimation with Factor Graphs

利用弦稀疏性实现因子图的全局最优估计

Avinash Subramanian, Connor Holmes, Timothy D. Barfoot, Frank Dellaert, Frederike Dümbgen

发表机构 * College of Computing, Georgia Institute of Technology(佐治亚理工学院计算机学院) Robotics Institute, University of Toronto(多伦多大学机器人研究所) Department of Mechanical Engineering, Carnegie Mellon University(卡内基梅隆大学机械工程系)

AI总结 本文提出在GTSAM框架中自动构建凸半定规划松弛,并利用贝叶斯树分解加速求解,实现因子图的全局最优估计。

Journal ref ICRA 2026 WORKSHOP ON FRONTIERS OF OPTIMIZATION FOR ROBOTICS

详情
AI中文摘要

鲁棒且高效的状态估计对于机器人感知、导航和控制至关重要。状态估计问题可以方便地使用因子图框架建模,如现代软件包GTSAM或g2o所支持的那样。然而,这些框架中包含的标准求解器是局部的,可能收敛到较差的局部最小值,带来显著的安全隐患。相反,基于凸松弛的技术已被证明能够全局求解或认证许多状态估计问题。但是,这些松弛方法1)通常需要大量精力来构建,并且2)与高效的局部求解器相比,可能产生显著更高的成本,因为它们需要求解一个大规模半定规划(SDP)。在这项工作中,我们通过以下方式解决了这两个缺点:1)在GTSAM框架内创建了一个新过程,用于自动为任何具有常见因子和变量类型的因子图构建凸SDP松弛,以及2)利用GTSAM原生的贝叶斯树结构来分解SDP问题,从而在弦稀疏问题上显著加速求解器时间。我们通过两个案例研究展示了这种利用结构的全局估计器与标准局部求解器相比的有利扩展性:一个带有环因子图的三维位姿图SLAM问题和一个带有链因子图的二维定位问题。软件框架可在https://github.com/borglab/gtsam获取。

英文摘要

Robust and efficient state estimation is crucial for perception, navigation, and control in robotics. State estimation problems are conveniently modeled using the factor-graph framework as enabled by modern software packages such as GTSAM or g2o. However, the standard solvers included in such frameworks are local and may converge to poor local minima, posing significant safety concerns. Conversely, techniques based on convex relaxations have been shown to provide a means of globally solving or certifying many state estimation problems. However, these relaxations 1) often require substantial effort to formulate, and 2) may incur significantly higher cost compared to efficient local solvers, as they require solving a large semidefinite program (SDP). In this work, we address both shortcomings by 1) creating a new procedure within the GTSAM framework for automatically constructing convex SDP relaxations for any factor graphs with common factor and variable types, and by 2) exploiting the Bayes tree constructions native to GTSAM to decompose the SDP problem, leading to significant speedup in solver time for chordally sparse problems. We demonstrate the favorable scaling of this structure-exploiting global estimator compared to standard local solvers for two case studies: A 3D pose-graph SLAM problem with a ring factor graph and a 2D localization problem with a chain factor graph. The software framework is available at https://github.com/borglab/gtsam.

2605.30615 2026-06-01 cs.LG

Improving Selective Classification with Pairwise Queries for Binary Classification

通过成对查询改进二分类的选择性分类

Harsh Vardhan, Sunav Choudhary, Natwar Modani, Arya Mazumdar

发表机构 * Adobe Research(Adobe研究院)

AI总结 针对选择性分类中模型置信度与预测不一致导致高错误率的问题,提出使用成对查询检测高错误样本,以降低非拒绝样本的错误率,并通过理论和实验验证了其有效性。

详情
AI中文摘要

在选择性分类中,模型预测其确信的数据样本的标签,并避免预测不确信样本的标签。被拒绝的样本通常由专家标注,这成本高昂。当模型在非拒绝样本上错误率低时,专家的预算得到最佳利用。然而,模型置信度的估计可能与模型的预测不一致,这可能导致非拒绝样本上的高错误率。这种情况在LLM的上下文二分类中容易发生。为了解决这个问题,我们提出向同一模型进行额外的成对查询。这些成对查询可以检测高错误样本,并整合到选择性分类技术中,以降低非拒绝样本上的错误率。理论上,我们建立了使用成对查询的简单算法优于不一致置信度估计的条件。我们通过大量实验支持这一见解,包括1个合成数据集和4个基于上下文学习的真实二分类数据集。在所有情况下,我们展示了使用成对查询的算法比仅使用原始置信度估计(例如LLM的下一个token对数概率)获得了更好的准确率-成本权衡。

英文摘要

In selective classification, a model predicts the labels of data samples where it is confident, and abstains from predicting labels for samples on which it is not confident. The rejected samples are often labeled by an expert, which is expensive. The budget for the expert is best utilized when the model has low error on non-rejected samples. However, the estimate of a model's confidence might be inconsistent with the model's predictions, which can lead to high error on non-rejected points. Such situations can readily occur in in-context binary classification by LLMs. To remedy this, we propose making additional pairwise queries to the same model. These pairwise queries can detect high-error samples and be incorporated into selective classification techniques to reduce the error on non-rejected samples. Theoretically, we establish the conditions under which a simple algorithm using pairwise queries outperforms an inconsistent confidence estimate. We support this insight through extensive experiments for $1$ synthetic and $4$ in-context learning-based real binary classification datasets. In all these cases, we show that our algorithms, using pairwise queries, obtain a better accuracy-cost tradeoff than using only the raw confidence estimates, for instance, the LLM's next-token logits.

2605.30612 2026-06-01 cs.RO cs.LG cs.SY eess.SY

ZAPS-DA: Zero-Phase Action Policy Smoothing with Decoupled Actor for Continuous Control in Reinforcement Learning

ZAPS-DA:基于解耦演员的零相位动作策略平滑用于强化学习中的连续控制

Faiq Shamass

发表机构 * Independent Researcher(独立研究者)

AI总结 提出ZAPS-DA框架,通过解耦演员网络模仿零相位滤波目标,在不引入相位延迟和后处理的情况下减少连续控制策略的动作抖动,并在驾驶仿真中验证了其有效性。

Comments 7 pages, 5 figures, 5 tables. Submitted to IEEE RA-L

详情
AI中文摘要

基于离策略强化学习训练的连续控制策略经常表现出高频动作抖动,使得直接部署在物理执行器上不可行。事后滤波可以减弱抖动但引入相位延迟;在演员损失中嵌入平滑惩罚会将其与RL梯度耦合,并将奖励回归与过度激进的平滑混为一谈。我们提出ZAPS-DA,一个在部署时减少动作抖动且具有可忽略相位延迟和无后处理的框架。ZAPS-DA将一个未修改的主演员(由基础RL损失训练)与一个单独的解耦演员配对,该解耦演员通过监督学习模仿存储在回放缓冲区中的零相位滤波目标。部署的策略是解耦演员:一个从当前观测到平滑动作的前馈映射,没有推理时滤波和动作历史输入——我们称之为非因果滤波器的因果蒸馏机制。幅度匹配的MSE损失提供了跨优化器类别的零超参数可移植性。使用Soft Actor-Critic和Savitzky-Golay滤波器在两个驾驶模拟器中通过配对n=150评估协议进行验证:在MetaDrive上,ZAPS-DA将转向抖动减少14-21倍,油门抖动减少3-5倍(所有p < 10^{-4},Bonferroni校正),同时以6.3%的奖励成本匹配任务完成率(成功率p=0.31,碰撞率p=0.31);在自定义Webots自适应巡航控制环境中,相同的SG配置产生了帕累托改进——奖励持平(p=0.121),转向抖动减少8-45倍,总任务失败率从2.0%降至0.7%。

英文摘要

Continuous control policies trained with off-policy reinforcement learning frequently exhibit high-frequency action jitter, rendering direct deployment on physical actuators impractical. Post-hoc filtering attenuates jitter but introduces phase lag; embedding smoothness penalties in the actor's loss couples them with the RL gradient and conflates reward regression with over-aggressive smoothing. We present ZAPS-DA, a framework that reduces action jitter at deployment with negligible phase lag and no post-processing. ZAPS-DA pairs an unmodified main actor (trained by the base RL loss) with a separate decoupled actor trained via supervised imitation of zero-phase filtered targets stored in the replay buffer. The deployed policy is the decoupled actor: a feed-forward map from the current observation to a smooth action, with no inference-time filter and no action-history input -- a mechanism we term causal distillation of a non-causal filter. A magnitude-matched MSE loss provides zero-hyperparameter portability across optimizer classes. Validated with Soft Actor-Critic and a Savitzky--Golay filter in two driving simulators using paired n=150 evaluation protocols: on MetaDrive, ZAPS-DA reduces steering jitter by 14--21x and throttle jitter by 3--5x (all $p < 10^{-4}$, Bonferroni-corrected) while matching task-completion (p=0.28 success, p=0.31 crash) at a 6.3% reward cost; on a custom Webots adaptive cruise control environment, the same SG configuration produces a Pareto improvement -- reward parity (p=0.121), 8--45x steering jitter reduction, and total task-failure rate reduced from 2.0% to 0.7%.

2605.30611 2026-06-01 cs.CV cs.AI cs.CL

Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

Crafter: 面向多样化输入的可编辑科学图表生成的多智能体框架

Haozhe Zhao, Shuzheng Si, Zhenhailong Wang, Zheng Wang, Liang Chen, Xiaotong Li, Zhixiang Liang, Maosong Sun, Minjia Zhang

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Tsinghua University(清华大学) Peking University(北京大学)

AI总结 提出Crafter多智能体框架,通过结构化组合离散语义组件,实现跨图表类型和输入条件的可编辑科学图表生成,并引入CraftEditor将栅格输出转换为可编辑SVG,在CraftBench基准上显著优于现有方法。

Comments 24 pages, 11 figures

详情
AI中文摘要

科学图表是传达复杂研究思想最有效的手段之一,但生成出版质量的插图仍然是论文准备中最劳动密集的部分。现有的自动化系统各自针对单一图表类型,且仅接受文本输入,未能解决研究人员实际使用的多样类型和条件;此外,它们的栅格输出无法进行局部修改。由于科学图表是离散语义组件的结构化组合,生成器在这些布局上产生的局部错误需要的不是更强的骨干网络,而是一个框架。我们将这个框架实例化为两个互补系统:Crafter,一个用于图表生成的多智能体框架,无需架构更改即可泛化到多种图表类型和输入条件;以及CraftEditor,它应用相同的模式将栅格输出转换为可编辑的SVG。此外,我们引入了CraftBench,一个涵盖三种图表类型和四种输入条件的基准,并带有手工质量标注。实验表明,Crafter在PaperBanana-Bench和CraftBench上显著优于独立的生成器和智能体基线,消融实验确认了每个组件的独立贡献;CraftEditor忠实地将输出转换为可编辑的SVG,超越了所有基线。我们的代码和基准可在https://github.com/HaozheZhao/Crafter获取。

英文摘要

Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrations remains one of the most labor-intensive parts of paper preparation. Existing automated systems each target a single figure type under text-only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic components, the localized errors generators produce on such layouts demand not a stronger backbone but a harness. We instantiate this harness in two complementary systems: Crafter, a multi-agent harness for figure generation that generalizes across figure types and input conditions without architectural changes, and CraftEditor, which applies the same pattern to convert raster outputs into editable SVGs. Moreover, we introduce CraftBench, a benchmark spanning three figure types and four input conditions with human quality annotation. Experiments show that Crafter substantially outperforms both standalone generators and the agentic baseline on PaperBanana-Bench and CraftBench, with ablations confirming each component's independent contribution; CraftEditor faithfully converts outputs into editable SVGs that surpass all baselines. Our code and benchmark are available at https://github.com/HaozheZhao/Crafter.

2605.30610 2026-06-01 cs.LG

Constrained Flow Optimization via Sequential Fine Tuning for Molecular Design

通过序列微调进行约束流优化以用于分子设计

Sven Gutjahr, Riccardo De Santi, Luca Schaufelberger, Kjell Jorner, Andreas Krause

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系) Institute of Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich(苏黎世联邦理工学院化学与生物工程研究所) ETH AI Center(苏黎世联邦理工学院人工智能中心)

AI总结 提出约束流优化(CFO)算法,通过将约束生成优化问题转化为序列微调,在分子设计中平衡奖励最大化与约束满足。

Comments ICML 2026

详情
AI中文摘要

适应生成基础模型,特别是扩散和流模型,以优化给定奖励函数(例如结合亲和力)同时满足约束(例如分子可合成性),对于其在分子设计或蛋白质工程等现实世界科学发现应用中的采用至关重要。虽然最近的工作通过强化学习和控制方案引入了可扩展的奖励引导微调方法,但如何以可靠和可预测的方式在算法上权衡奖励最大化和约束满足仍然是一个开放问题。受此挑战的启发,我们首先提出了约束生成优化的严格框架,该框架将优化视角引入所提出的适应问题,并将约束生成的相关任务作为子案例。然后,我们引入了约束流优化(CFO),这是一种通过将原始问题简化为通过已建立的可扩展方法进行序列微调来自动且可证明地平衡奖励最大化和约束满足的算法。我们为约束生成优化和通过CFO进行约束生成提供了收敛保证。最后,我们在合成(但具有说明性)设置和分子设计任务上对CFO进行了实验评估。在这些评估中,CFO在确保高约束满足的同时实现了奖励的持续增长,展示了其在约束生成优化中的实用性。

英文摘要

Adapting generative foundation models, in particular diffusion and flow models, to optimize given reward functions (e.g., binding affinity) while satisfying constraints (e.g., molecular synthesizability) is fundamental for their adoption in real-world scientific discovery applications such as molecular design or protein engineering. While recent works have introduced scalable methods for reward-guided fine-tuning of such models via reinforcement learning and control schemes, it remains an open problem how to algorithmically trade-off reward maximization and constraint satisfaction in a reliable and predictable manner. Motivated by this challenge, we first present a rigorous framework for Constrained Generative Optimization, which brings an optimization viewpoint to the introduced adaptation problem and retrieves the relevant task of constrained generation as a sub-case. Then, we introduce Constrained Flow Optimization (CFO), an algorithm that automatically and provably balances reward maximization and constraint satisfaction by reducing the original problem to sequential fine-tuning via established, scalable methods. We provide convergence guarantees for constrained generative optimization and constrained generation via CFO. Ultimately, we present an experimental evaluation of CFO on both synthetic, yet illustrative, settings, and a molecular design task. Across these evaluations, CFO achieves consistent increases in reward while ensuring high constraint satisfaction, showcasing its practical utility for constrained generative optimization.

2605.30601 2026-06-01 cs.LG

TASER: Task-Aware Stein Regularisation for Geometry-Driven Robustness

TASER: 面向几何驱动鲁棒性的任务感知斯坦正则化

Michał Kozyra, Gesine Reinert

发表机构 * Department of Statistics, University of Oxford, United Kingdom(英国牛津大学统计系)

AI总结 提出TASER(任务感知斯坦正则化),一种基于Langevin斯坦算子的训练时正则化框架,通过惩罚训练分布下的逐点斯坦残差,诱导各向异性数据感知平滑性,从而提升模型在分布偏移和对抗扰动下的鲁棒性。

详情
AI中文摘要

现代深度网络在分布偏移和对抗扰动下仍然脆弱,通常是由于过度或结构不良的输入敏感性。我们引入TASER(任务感知斯坦正则化),一种源自Langevin斯坦算子的训练时正则化框架。通过惩罚训练分布下的逐点斯坦残差,TASER鼓励预测器与数据密度之间的几何兼容性,诱导各向异性、数据感知的平滑性。我们提供了斯坦正则化与降低一阶偏移敏感性之间的理论联系,开发了与现代架构兼容的可扩展实现变体,并在回归和视觉基准上展示了改进的鲁棒性和稳定性。在CIFAR-10实验中,TASER一致地提高了已有训练方法的对抗鲁棒性,且未造成统计显著的干净准确率下降。

英文摘要

Modern deep networks remain fragile under distribution shift and adversarial perturbations, often due to excessive or poorly structured input sensitivity. We introduce TASER (Task-Aware Stein Regularisation), a training-time regularisation framework derived from Langevin Stein operators. By penalising pointwise Stein residuals under the training distribution, TASER encourages geometric compatibility between predictors and data density, inducing anisotropic, data-aware smoothness. We provide theoretical links between Stein regularisation and reduced first-order shift sensitivity, develop scalable implementation variants compatible with modern architectures, and demonstrate improved robustness and stability across regression and vision benchmarks. Across CIFAR-10 experiments, TASER consistently improves the adversarial robustness of established training methods without incurring statistically significant clean-accuracy degradation.

2605.30599 2026-06-01 cs.LG cs.CL

AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis

AMNESIA: 一个大规模医学遗忘基准套件与疾病知情分析

Saeedeh Davoudi, Reihaneh Iranmanesh, Ophir Frieder, Nazli Goharian

发表机构 * IR Lab, Computer Science Department, Georgetown University, Washington D.C.(信息检索实验室,计算机科学系,乔治·华盛顿大学,华盛顿特区)

AI总结 提出AMNESIA,首个大规模开源医学遗忘基准,包含70,560个问答对,评估四种遗忘方法,发现个体遗忘会侵蚀同病患者的其他知识。

详情
AI中文摘要

医学知识不断演变。这需要更新或选择性遗忘已训练的医学LLM中编码的信息。机器遗忘旨在无需完全重新训练即可移除特定训练数据对模型的影响。然而,现有的遗忘基准依赖于合成或小规模通用数据,导致临床遗忘研究不足。我们引入AMNESIA,首个大规模、开源医学遗忘基准,包含来自11种疾病类别、8,820份患者笔记的70,560个问答对。AMNESIA包括测试直接回忆的事实问题和测试临床推理的推理问题。我们用它来评估四种广泛使用的遗忘方法,分别在随机患者和疾病级别,并引入一个新的指标来检测医学术语的泄露。我们表明,遗忘个体患者会侵蚀具有相同病症的其他患者的知识,这需要能够更好地区分患者与共享临床知识的方法。

英文摘要

Medical knowledge is continuously evolving. This creates a need to update or selectively forget information encoded in already-trained medical LLMs. Machine unlearning aims to remove the influence of specific training data from a model without full retraining. Yet, existing unlearning benchmarks rely on synthetic or small-scale general data, leaving clinical unlearning understudied. We introduce AMNESIA, the first large-scale, open source benchmark for medical unlearning, with 70,560 question-answer pairs from 8,820 patient notes across 11 disease categories. AMNESIA includes both factual questions testing direct recall and reasoning questions testing clinical inference. We use it to evaluate four widely used unlearning methods at both random patient and disease-level, and introduce a new metric for detecting leakage of medical terminology. We show that unlearning individual patients erodes knowledge of others with the same condition, calling for methods that can better separate patients from shared clinical knowledge.

2605.30597 2026-06-01 cs.LG

ScaleMAP: Preserving Local Density and Neighborhood Structure in Low-Dimensional Embeddings

ScaleMAP: 在低维嵌入中保持局部密度和邻域结构

Rajas Poorna, Marcus T. Cicerone

发表机构 * School of Chemical and Biomolecular Engineering(化学与生物分子工程学院) Georgia Institute of Technology(佐治亚理工学院) School of Chemistry and Biochemistry(化学与生物化学学院)

AI总结 提出ScaleMAP方法,通过将每对嵌入位移除以原始空间局部半径的几何均值,在保持UMAP级邻域保真度的同时恢复密度信息,解决了UMAP等非线性降维方法丢失邻域尺度的问题。

Comments 23 pages, 16 figures

详情
AI中文摘要

非线性降维方法(如UMAP和PaCMAP)在图构建过程中自适应地归一化局部距离,从而抹去了数据中的邻域尺度。这不仅扭曲了相对聚类大小:稀疏结构(如过渡细胞类型之间的桥梁和超光谱图像中的窄光谱峰值)可能被抑制或完全丢失。DensMAP通过添加密度惩罚来纠正这一点,但该惩罚与UMAP的吸引-排斥力竞争,导致点远离其邻域分散。ScaleMAP采用不同的方法:每个成对嵌入位移除以两个端点原始空间局部半径的几何均值,将尺度信息作为变量变换而非竞争目标重新注入。在标准基准测试以及来自转录组学、超光谱成像和流式细胞术的科学数据集中,ScaleMAP在密度保持方面与DensMAP相当,同时保持UMAP级别的邻域保真度。在转录组数据中,它恢复了UMAP压缩的细胞群体之间的稀疏桥梁;在流式细胞术中,它忠实地表示了跨越17个数量级的密度结构。同样的原理应用于PaCMAP,持续改善了密度保持,表明该方法可推广到UMAP之外。

英文摘要

Nonlinear dimensionality-reduction methods such as UMAP and PaCMAP adaptively normalize local distances during graph construction, erasing neighborhood scale from the data. This distorts more than relative cluster sizes: sparse structures like bridges between transitioning cell types and narrow spectral spikes in hyperspectral images can be suppressed or lost entirely. DensMAP adds a density penalty to correct this, but this penalty competes with UMAP's attraction-repulsion forces, scattering points far from their neighborhoods. ScaleMAP takes a different approach: each pairwise embedding displacement is divided by the geometric mean of the two endpoints' original-space local radii, re-injecting scale information as a change of variables rather than as a competing objective. Across standard benchmarks and scientific datasets from transcriptomics, hyperspectral imaging, and flow cytometry, ScaleMAP matches DensMAP on density preservation while maintaining UMAP-level neighborhood preservation. In transcriptomic data, it recovers sparse bridges between cell populations that UMAP collapses; in flow cytometry, it faithfully represents density structure across 17 orders of magnitude. The same principle applied to PaCMAP yields consistently improved density preservation, suggesting the approach generalizes beyond UMAP.

2605.30596 2026-06-01 cs.LG

Improving Relative Representations with Learned Anchors and Whitened Inner Products

改进相对表示:使用学习锚点和白化内积

Oscar Thorsted Svendsen, Nikolaj Holst Jakobsen, Fabian Mager, Hiba Nassar

发表机构 * Technical University of Denmark(丹麦技术大学)

AI总结 提出通过学习锚点作为鲁棒语义原型并采用几何感知的相似度度量(白化内积),改进相对表示方法,实现跨模型的高保真信息传输和零样本通信。

Comments 14 pages, 5 figures

详情
AI中文摘要

独立训练的神经模型通常收敛到不兼容的潜在表示,这成为高度模块化AI系统的基本障碍。相对表示(RR)通过将绝对坐标映射到由与公共锚点的相似性定义的共享空间来解决这一问题,但传统实现依赖于随机采样的锚点和余弦相似度,常常无法捕捉现代架构(如Transformer)的各向异性几何结构。在这项工作中,我们提出了一个基于两项改进的跨模型通信鲁棒框架。我们学习锚点作为鲁棒的语义原型,并利用一种几何感知的相似度度量,该度量保留了判别性的幅度信息且对仿射变换具有不变性。我们的方法在视觉和语言任务中展示了显著的性能和一致性提升。值得注意的是,它实现了几乎无损的信息传输和稳定的零样本通信,即使在高度异构的架构之间,例如不同规模的小型语言模型。

英文摘要

Independently trained neural models typically converge to incompatible latent representations, creating a fundamental barrier to highly modular AI systems. While Relative Representations (RR) address this by mapping absolute coordinates to a shared space defined by similarities to common anchor points, traditional implementations rely on randomly sampled anchors and cosine similarity, which frequently fail to capture the anisotropic geometries of modern architectures like Transformers. In this work, we propose a robust framework for cross-model communication based on two improvements. We learn anchors as robust semantic prototypes and utilize a geometry-aware similarity metric which preserves discriminative magnitude information and is invariant to affine shifts. Our approach demonstrates significant gains in performance and consistency across vision and language tasks. Notably, it enables nearly lossless information transfer and stable zero-shot communication even between highly heterogeneous architectures, such as small language models of varying scales.

2605.30593 2026-06-01 cs.LG cs.AI cs.CE

Scientific Machine Learning for Engine Health Management and Remaining Useful Life Prediction

面向发动机健康管理与剩余寿命预测的科学机器学习

Jostein Barry-Straume, Changmin Son, Adrian Sandu, Gavan Burke, Rekha Sundararajan, Andrew Rimell, James G. Steinrock

发表机构 * Computational Science Laboratory(计算科学实验室) Department of Computer Science(计算机科学系) Virginia Tech(弗吉尼亚理工学院)

AI总结 提出一个多任务科学机器学习框架,通过联合预测涡轮气体温度、温差和剩余寿命并提供量化不确定性区间,以支持基于风险的维护决策。

详情
AI中文摘要

发动机健康管理依赖于对剩余寿命的可靠预测以及对涡轮气体温度等热指标的跟踪。在实际应用中,真实机队数据具有异质性和非平稳性,仅靠点预测不足以支持风险感知的维护决策。本文提出了一种用于涡轮机预测的多任务科学机器学习框架,该框架联合预测未修剪涡轮气体温度、涡轮气体温差和剩余寿命,并以预测区间的形式提供量化不确定性,并评估其经验覆盖率。共享序列编码器(带有残差双向LSTM层和注意力池化的卷积前端)为任务特定头部提供输入,包括用于概率回归的均值-方差估计,以及可选的用于基于阈值事件建模的生存头部。该框架设计为可通过少量面向实践者的参数(例如,温差阈值规则和剩余寿命目标构建)进行调整,以便部署能够与内部策略和专有标准保持一致。使用点指标和区间指标评估所提出框架的预测性能,包括平均绝对误差、预测区间覆盖概率、平均预测区间宽度以及覆盖-宽度准则。结果按总体和按飞行阶段与维护段分层报告,以突出运营环境的影响并支持不确定性感知监控。

英文摘要

Engine Health Management (EHM) depends on reliable forecasting of Remaining Useful Life (RUL) and on tracking thermal indicators such as turbine gas temperature (TGT). In practice, real-world fleet data are heterogeneous and non-stationary, and point predictions alone are insufficient for risk-aware maintenance decisions. This paper presents a multi-task scientific machine learning framework for turbine prognostics that jointly predicts turbine gas temperature untrimmed (TGTU), Delta Turbine Gas Temperature (DTGT), and RUL, with quantified uncertainty in the form of prediction intervals whose empirical coverage is evaluated. A shared sequence encoder (convolutional front-end with residual bidirectional LSTM layers and attention pooling) feeds task-specific heads, including mean--variance estimation for probabilistic regression and, optionally, a survival head for threshold-based event modeling. The framework is designed to be tunable via a small set of practitioner-facing parameters (e.g., DTGT thresholding rules and RUL target construction) so that deployment can align with in-house policies and proprietary criteria. The predictive performance of the proposed framework is evaluated using both point and interval metrics, including mean absolute error (MAE), prediction interval coverage probability (PICP), mean prediction interval width (MPIW), and the coverage--width criterion (CWC). Results are reported both in aggregate and stratified by flight phase and maintenance segment to highlight operational-context effects and to support uncertainty-aware monitoring.

2605.30592 2026-06-01 cs.LG

Learning Transferable Predictability Representations

学习可迁移的可预测性表示

Diyali Goswami, Auroop R. Ganguly

发表机构 * Sustainability and Data Sciences Laboratory (SDS Lab)(可持续性与数据科学实验室) AI4CaS: AI for Climate and Sustainability(AI4CaS:为气候与可持续性的人工智能) Institute for Experiential AI(体验式人工智能研究所) Pacific Northwest National Laboratory (PNNL)(太平洋西北国家实验室)

AI总结 提出Gauge-Fixed Ordinal Network (GON)模型,通过锚定方差目标学习跨系统一致的序数评分,解决可预测性评估中的尺度模糊问题。

Comments 27 pages, 3 figures

详情
AI中文摘要

我们研究将标量分数分配给短轨迹窗口的问题,该分数反映其在有序可预测性机制连续体上的位置,范围从结构化确定性动力学到非结构化随机噪声。现有方法在单个系统内进行确定性-随机性判别,并且不能产生跨系统具有一致数值解释的分数。我们将此形式化为五级可预测性阶梯上的序数估计,并识别出跨系统模糊性的结构来源:仅排序监督使分数坐标在单调重参数化下未固定,我们称之为序数评分的规范自由度。我们提出了规范固定序数网络(GON),这是一种时间卷积模型,使用锚定方差目标训练,将级别-wise分数均值固定到共享目标坐标。GON操作于2-jet特征,这些特征暴露局部轨迹几何结构,由平滑流保持,并被随机代理过程破坏。在五个保留的动力学系统上,从预训练的GON检查点初始化在所有窗口预算上始终优于从头训练,适应深度反映了与训练家族的几何接近性。零样本分数在随机边界保留序数结构,其中代理过程最强烈地破坏非线性几何,并且预训练初始化在所有窗口预算上始终优于从头训练。成对判别和全局一致的序数评分是不同的属性,需要稳定的分数坐标以实现跨系统迁移,这对自然和工程动力学系统的可预测性评估、模型选择和早期预警诊断具有直接影响。

英文摘要

We study the problem of assigning a scalar score to a short trajectory window that reflects its position on an ordered continuum of predictability regimes, spanning structured deterministic dynamics to unstructured stochastic noise. Existing methods address deterministic-versus-stochastic discrimination within a single system and do not produce scores with a consistent numerical interpretation across systems. We formalize this as ordinal estimation over a five-level predictability ladder and identify a structural source of cross-system ambiguity: ranking supervision alone leaves the score coordinate unfixed up to a monotone reparameterization, which we term the gauge freedom of ordinal scoring. We propose the Gauge-Fixed Ordinal Network (GON), a temporal convolutional model trained with an anchor-and-variance objective that pins level-wise score means to shared target coordinates. GON operates on 2-jet features that expose local trajectory geometry, preserved by smooth flows and disrupted by stochastic surrogate procedures. On five held-out dynamical systems, initializing from a pretrained GON checkpoint consistently outperforms training from scratch across all window budgets, with adaptation depth reflecting geometric proximity to the training family. Zero-shot scores retain ordinal structure at the stochastic boundary, where surrogate procedures most strongly disrupt nonlinear geometry, and pretrained initialization consistently beats scratch across all window budgets. Pairwise discrimination and globally coherent ordinal scoring are distinct properties requiring a stable score coordinate for cross-system transfer, with direct implications for predictability assessment, model selection, and early-warning diagnostics across natural and engineered dynamical systems.

2605.30590 2026-06-01 cs.LG cs.AI cs.CL

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

反事实评估揭示临床LLM和智能体的隐藏能力画像

Matt Turk

发表机构 * Protege Data Lab(Protege数据实验室)

AI总结 提出因果敏感性评分(CSS),通过沿五个临床维度变异肿瘤病例来评估模型是否按预期方向更新推荐,发现与覆盖度指标排名相反,并揭示所有前沿模型在手术状态干预上的安全盲点。

Comments Accepted to RLEval @ ACM CAIS 2026 (Workshop on Methods and RL Environments for Evaluating AI Agents) and selected for an invited talk based on reviewer ratings. 4-page short paper + appendix

详情
AI中文摘要

两个临床AI系统在基于覆盖度的评分标准上得分几乎相同,但当患者输入变化时行为却截然不同:一个更新其推荐以匹配新的临床信号,而另一个无论输入如何都产生相同输出。我们引入因果敏感性评分(CSS),这是一个预注册的干预性指标,沿五个临床有意义的维度——生物标志物翻转、先前治疗失败、生物标志物移除、手术状态变化和分期扰动——变异肿瘤肿瘤委员会病例,并使用{0, 0.5, 1.0}量表对每个模型是否在预注册的正确方向上更新其推荐进行评分。与基于覆盖度的加权召回指标共识匹配评分(CMS)相比,来自三个实验室的六个前沿模型在224个病例的单次推理中评估,排名几乎完全相反:所有六个模型排名发生变化,CMS最差的模型成为CSS最好的模型,而一个中上CMS模型在CSS上排名最后。我们进一步揭示了一个普遍的安全盲点:每个前沿模型在手术状态干预上失败(D家族最多17.2%的CSS),这是CMS未暴露的发现。该指标也适用于使用工具的智能体:在ReAct风格的实验中,工具使用改善了六个模型中五个的CSS(+2.5到+20.3个百分点),然而CSS最低的模型检索相同的图表部分但仍未能更新其推荐——揭示了仅在反事实评估下可见的结构性响应缺陷。跨评判者复制和三位评估者的医学专业验证确认了总体发现。像CSS这样的干预性预注册指标补充了临床AI智能体的基于覆盖度的评估:它们捕捉了覆盖度指标遗漏的响应性,并为未来的智能体强化学习系统提供了候选的密集奖励信号。

英文摘要

Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other produces the same output regardless. We introduce the Causal Sensitivity Score (CSS), a pre-registered interventional metric that mutates oncology tumor-board cases along five clinically meaningful dimensions - biomarker flips, prior-treatment failures, biomarker removals, surgery-status changes, and stage perturbations - and scores whether each model updates its recommendations in the pre-registered correct direction using a {0, 0.5, 1.0} scale. Benchmarked against the Consensus Match Score (CMS), a coverage-based weighted recall metric, six frontier models from three labs evaluated in single-shot inference across 224 cases rank in nearly opposite orders: all six models change rank, the CMS-worst model becomes CSS-best, and one upper-mid CMS model ranks last on CSS. We further surface a universal safety blind spot: every frontier model fails on surgery-status interventions (at most 17.2% CSS on Family D), a finding CMS does not expose. The metric also transfers to tool-using agents: in a ReAct-style experiment, tool use improves CSS for five of six models (+2.5 to +20.3 percentage points), yet the lowest-CSS model retrieves the same chart sections and still fails to update its recommendations - revealing a structural responsiveness deficit visible only under counterfactual evaluation. Cross-judge replication and three-rater medical-professional validation confirm the aggregate findings. Interventional pre-registered metrics like CSS complement coverage-based evaluation for clinical AI agents: they capture responsiveness that coverage metrics miss and offer a candidate dense reward signal for future agentic RL systems.