arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1938
2605.20538 2026-05-21 cs.CV

Continual Segmentation under Joint Nonstationarity

连续分割下的联合非平稳性

Prashant Pandey, Himanshu Kumar, Devineni Sri Venkatraya Chowdary, Brejesh Lall

发表机构 * Bharti School of Telecommunications Technology \& Management, Indian Institute of Technology, Delhi, India Yardi School of Artificial Intelligence, Indian Institute of Technology, Delhi, India Department of Computer Science Engineering, Indian Institute of Technology (Indian School of Mines), Dhanbad, India

AI总结 本文研究了在联合非平稳性条件下连续语义分割的问题,提出了一种基于梯度适应稳定机制和半监督学习的方法,以应对数据分布漂移带来的不稳定性和过拟合问题,并在多种场景下验证了方法的有效性。

详情
AI中文摘要

演化数据流导致连续语义分割中的联合非平稳性,其中语义类别、输入分布和监督可用性随时间同时变化。这种设置反映了实际的结构预测系统,但此前的持续学习工作通常孤立地研究这些因素。我们正式化了在耦合类别、领域和标签漂移下的持续分割,并研究了在异构密集预测环境中有限标注和丰富未标注数据下的学习。为了解决在分布漂移下少量监督带来的不稳定性及过拟合问题,我们引入了梯度适应稳定机制,这是一种通过梯度缩放的随机扰动实现的参数级正则化机制,促进了原理上的稳定性-可塑性权衡。我们进一步通过半监督学习利用未标注数据,并引入原型锚定监督,通过联合置信度和原型一致性验证伪标签。这些机制共同使持续分割在联合非平稳性下得以学习。在类别递增、领域递增和少样本场景中的广泛实证评估显示,在异构结构预测设置中,与现有方法相比有持续的改进。我们的结果揭示了现有持续分割方法的根本失败模式,并提供了在动态演变环境中学习鲁棒密集预测器的见解。

英文摘要

Evolving data streams induce joint nonstationarity in continual semantic segmentation, where semantic classes, input distributions, and supervision availability change simultaneously over time. This setting reflects practical structured prediction systems, yet remains largely unexplored in prior continual learning work, which typically studies these factors in isolation. We formalize continual segmentation under coupled class, domain, and label shifts and investigate learning in heterogeneous dense prediction environments with limited annotations and abundant unlabeled data. To address instability and overfitting arising from few-shot supervision under distribution drift, we introduce gradient-adaptive stabilization, a parameter-wise regularization mechanism implemented via gradient-scaled stochastic perturbations that promotes a principled stability-plasticity tradeoff. We further leverage unlabeled data through semi-supervised learning and introduce prototype anchored supervision that validates pseudo-labels via joint confidence and prototype consistency. Together, these mechanisms enable learning under joint nonstationarity in continual segmentation. Extensive empirical evaluation across class-incremental, domain-incremental, and few-shot regimes demonstrates consistent improvements over prior methods in heterogeneous structured prediction settings. Our results expose fundamental failure modes of existing continual segmentation approaches and provide insight into learning robust dense predictors in dynamically evolving environments.

2605.20537 2026-05-21 cs.CL

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

Biomedical NER和实体链接基准测试测量什么?一种以语料库为中心的诊断框架

Robert Leaman, Rezarta Islamaj, Zhiyong Lu

发表机构 * National Library of Medicine(国家医学图书馆)

AI总结 本文提出一种以语料库为中心的诊断框架,用于分析生物医学NER和实体链接基准测试的相关属性,揭示语料库特性对评估信号、泛化需求和基准测试结论范围的影响。

Comments Accepted to the ACL 25th Workshop on Biomedical Language Processing

详情
AI中文摘要

生物医学命名实体识别(NER)和实体链接(EL)高度依赖于标注语料库,但这些资源用于基准测试的效用往往被假设而非明确描述。我们提出一种以语料库为中心的框架,直接从语料库注释、概念链接、训练-测试分割、文档元数据和术语映射中诊断与基准测试相关的属性。该框架将标准化统计分为五类:(1)规模、密度和标签分布,(2)词汇和概念结构,(3)训练-测试重叠,(4)元数据组成,以及(5)术语覆盖(如适用)。对九个涵盖疾病、化学物质和细胞类型的语料库应用该框架,发现即使针对相同明显任务,语料库属性也可能存在显著差异。我们发现它们提供的评估信号、施加的泛化需求、允许的训练-测试重用程度以及所代表的生物医学文献和概念空间区域也存在差异。这些差异表明,通常报告的语料库统计可能不足以描述生物医学NER和EL基准测试所评估的内容。我们主张语料库中心的诊断提供了一种实用框架,用于分析语料库,超越仅基于语料库大小和实体类型的表面描述,以识别潜在的迁移风险,并解释基准测试结论的范围。我们以开源代码和交互式仪表盘的形式发布该框架,以支持重现我们的分析并表征其他语料库。

英文摘要

Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train-test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.

2605.20536 2026-05-21 cs.CV

HADS-Net:A Hybrid Attention-Augmented Dual-Stream Network with Physics-Informed Augmentation for Breast Ultrasound Image Classification

HADS-Net:一种融合注意力增强的双流网络,用于乳腺超声图像分类

Chinedu Emmanuel Mbonu, Blessing Nwamaka Iduh, Joseph Ikechukwu Odo, Doris Chinedu Asogwa

发表机构 * NedumCares

AI总结 本文提出HADS-Net,一种融合注意力增强的双流网络,通过两个并行路径利用全局纹理和局部边界线索,结合物理信息增强,以提高乳腺超声图像分类的准确性。

Comments 7 pages, 4 figures

详情
AI中文摘要

准确地将乳腺超声图像分类为良性、恶性和正常类别是至关重要的临床任务,但受到斑点噪声、声影效应和类间视觉模糊的阻碍。现有的深度学习方法依赖于单流架构,使用通用增强方法,忽略了超声成像的物理特性,并且没有先前的方法专门针对被确定为最诊断性视觉线索的病变边界特征进行处理。我们提出了HADS-Net,一种混合注意力增强的双流网络,通过两个并行路径利用全局纹理和局部边界线索。流1应用物理信息增强模拟斑点噪声、声影效应和增益变化,在提取特征前使用预训练的EfficientNet-B3投影到512维空间。流2提取Sobel边缘图,经过轻量级CNN处理后投影到相同的512维空间。交叉注意力融合模块允许纹理流选择性地查询边界特征,生成联合优化的表示,由通过自适应类别加权焦点损失训练的MLP进行分类。使用五折分层交叉验证,在50个周期中使用余弦退火,选择验证损失最低的全局最佳检查点。在BUSI数据集上,HADS-Net实现了96.58%的准确率,宏ROC-AUC为0.9978,宏F1为0.9654,以及良性、恶性和正常类别的F1分数分别为0.970、0.951和0.976。没有恶性病变被误分类为正常。这些结果证实,模态特定的增强与跨模态注意力融合是超声波乳腺癌诊断的有效策略。

英文摘要

Accurate classification of breast ultrasound images into benign, malignant, and normal categories is a critical clinical task complicated by speckle noise, acoustic shadowing, and inter-class visual ambiguity. Existing deep learning methods rely on single-stream architectures with generic augmentation that ignores ultrasound acquisition physics, and no prior method dedicates a stream to the lesion boundary features identified as the most diagnostically significant visual cue. We propose HADS-Net, a Hybrid Attention-Augmented Dual-Stream Network exploiting global texture and local boundary cues through two parallel pathways. Stream 1 applies physics-informed augmentation simulating speckle noise, acoustic shadowing, and gain variation before extracting features via pretrained EfficientNet-B3 projected to 512 dimensions. Stream 2 extracts Sobel edge maps processed by a lightweight CNN projected to the same 512-dimensional space. A cross-attention fusion module allows the texture stream to selectively query boundary features, producing a jointly optimised representation classified by an MLP trained with adaptive class-weighted focal loss. Five-fold stratified cross-validation with cosine annealing over 50 epochs is used, with the globally best checkpoint selected by lowest validation loss evaluated on a held-out test set. On the BUSI dataset, HADS-Net achieves 96.58% accuracy, macro ROC-AUC of 0.9978, macro F1 of 0.9654, and per-class F1-scores of 0.970, 0.951, and 0.976 for benign, malignant, and normal. No malignant lesion is misclassified as normal. These results confirm that modality-specific augmentation with cross-modal attention fusion is an effective strategy for ultrasound-based breast cancer diagnosis.

2605.20534 2026-05-21 cs.LG cs.AI stat.ML

Axiomatizing Neural Networks via Pursuit of Subspaces

通过子空间追求轴心化神经网络

Mehmet Yamac, Mert Duman, Ugur Akpinar, Felix Rojas Casadiego, Serkan Kiranyaz, Marcel van Gerven, Moncef Gabbouj

发表机构 * Tampere University, Faculty of ITC, Finland(芬兰塔尔库大学信息与通信技术学院) Department of Electrical Engineering, Qatar University, Qatar(卡塔尔大学电气工程系) Donders Institute, Radboud University, The Netherlands(荷兰拉德堡德大学多纳尔斯研究所)

AI总结 本文提出一个基于几何公理的框架,用于解释神经网络的行为,通过子空间追求假设,统一了表示、计算和泛化在浅层和深层架构中的视角。

Comments 43 pages, 25 figures. Code and additional materials will be released

详情
AI中文摘要

尽管深度神经网络在许多领域取得了显著成功,但其底层机制仍不清晰,常被视为黑箱。这种经验表现与理论理解之间的差距类似于经典几何学的前公理阶段。在本文中,我们引入了子空间追求(PoS)假设,这是一个轴心化的框架,通过一组几何公理来表征神经网络的行为。这些公理及其推导出的结论为浅层和深层架构中的表示、计算和泛化提供了统一的视角。我们展示了该框架能够为深度学习中的基本问题提供几何解释,包括表示结构、架构机制和泛化行为,从而为一个连贯的理论基础提供了有原则的步骤。

英文摘要

While deep neural networks have achieved remarkable success across a wide range of domains, their underlying mechanisms remain poorly understood, and they are often regarded as black boxes. This gap between empirical performance and theoretical understanding poses a challenge analogous to the pre-axiomatic stage of classical geometry. In this work, we introduce the Pursuit of Subspaces (PoS) hypothesis, an axiomatic framework that formulates neural network behavior through a set of geometric postulates. These axioms, together with their derived consequences, provide a unified perspective on representation, computation, and generalization in both shallow and deep architectures. We show that this framework yields geometric explanations for fundamental questions in deep learning, including representation structure, architectural mechanisms, and generalization behavior, offering a principled step toward a coherent theoretical foundation.

2605.20533 2026-05-21 cs.LG

Ada2MS: A Hybrid Optimization Algorithm Based on Exponential Mixing of Elementwise and Global Second-Moment Estimates

Ada2MS: 一种基于元素级和全局二阶矩估计指数混合的混合优化算法

Meng Zhu, Quan Xiao, Weidong Min

发表机构 * School of Information Management and Mathematics, Jiangxi University of Finance and Economics(江西财经大学信息管理与数学学院) School of Mathematics and Computer Science, Nanchang University(南昌大学数学与计算机科学学院) Institute of Metaverse, Nanchang University(南昌大学元宇宙研究院) Jiangxi Provincial Key Laboratory of Virtual Reality(江西省虚拟现实重点实验室)

AI总结 本文提出Ada2MS算法,通过连续指数插值元素级和全局二阶矩估计,平衡AdamW和动量SGD的优缺点,在视觉任务中取得竞争性结果。

详情
AI中文摘要

优化算法是机器学习模型通过迭代最小化损失函数、更新参数、从数据中学习并提高性能的核心方法。动量SGD和AdamW代表了两种重要的优化范式。AdamW产生稳定的更新,通常在各种训练场景中具有较强的鲁棒性,但其泛化性能有时弱于动量方法。动量SGD在仔细调参后通常可以获得更好的泛化性能,但对梯度尺度变化和超参数设置更敏感。为了平衡这两种范式的优缺点,本文提出Ada2MS优化算法,通过连续指数插值元素级二阶矩估计和全局二阶矩估计,实现AdamW-like行为和动量SGD-like行为的平滑过渡。在本研究评估的视觉任务中,Ada2MS在统一的优化器比较协议下取得了竞争性结果。代码将在https://github.com/mengzhu0308/Ada2MS上发布。

英文摘要

Optimization algorithms are core methods by which machine learning models iteratively minimize loss functions, update parameters, learn from data, and improve performance. Momentum SGD and AdamW represent two important optimization paradigms. AdamW produces stable updates and usually has strong robustness across training scenarios, but its generalization performance is sometimes weaker than that of momentum methods. Momentum SGD can often obtain better generalization after careful tuning, but it is more sensitive to gradient-scale variation and hyperparameter settings. To balance the strengths and weaknesses of the two paradigms, this paper proposes Ada2MS, an optimization algorithm that achieves a smooth transition between AdamW-like behavior and momentum-SGD-like behavior through continuous exponential interpolation between elementwise second-moment estimates and global second-moment estimates. On the visual tasks evaluated in this study, Ada2MS obtains competitive results under a unified optimizer-comparison protocol. The code will be released at https://github.com/mengzhu0308/Ada2MS

2605.20529 2026-05-21 cs.CL cs.AI

Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks

词组关联性:人类和神经网络中主谓一致学习的一种假设

Claire Hobbs, R. Thomas McCoy

发表机构 * Yale University(耶鲁大学) Cognitive Science Program(认知科学项目) Dept. of Linguistics(语言学系) Wu Tsai Institute(吴泰科创研院)

AI总结 本文探讨了语言输入中的统计信号如何帮助语法习得,提出词组关联性假设,通过词组共现规律提供句法依赖线索,并验证该机制在英语主谓一致习得中的有效性。

Comments Accepted to CoNLL

详情
AI中文摘要

在何种程度上,语言输入中的统计信号可以促进语法的习得?本文提出了一种称为词组关联性学习的机制,其中词组共现规律可以提供句法依赖的线索。我们研究这种机制是否能支持英语主谓一致的习得。首先,我们通过在不同可预测性水平的合成数据集上训练神经网络来模拟语言习得,发现存在一个可预测性范围,使得这些统计学习器能够稳健地学习主谓一致。然后,我们分析儿童导向语言中主谓配对的可变性,并发现此类数据中的可变性落在我们计算模拟中支持稳健泛化的范围内。综合来看,这些结果表明词组关联性是一种可行的学习策略,适用于儿童所接收的输入类型。

英文摘要

In what ways might statistical signals in linguistic input assist with the acquisition of syntax? Here we hypothesize a mechanism called collocational bootstrapping, in which regularities in word co-occurrence patterns can provide cues to syntactic dependencies. We investigate whether this mechanism can support the acquisition of English subject-verb agreement. First, we simulate language acquisition by training neural networks on synthetic datasets that vary in how predictable their subject-verb pairings are. We find that there is a range of variability levels at which these statistical learners robustly learn subject-verb agreement. We then analyze the variability of subject-verb pairings in child-directed language, and we find that the variability in such data falls within the range that supported robust generalization in our computational simulations. Taken together, these results suggest that collocational bootstrapping is a viable learning strategy for the type of input that children receive.

2605.20525 2026-05-21 cs.CV cs.AI cs.CL cs.LG eess.IV

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

NeuroQA: 一种大规模的3D脑部MRI理解图像 grounded 评估基准

Mohammad H. Abbasi, Favour Nerrise, Shaurnav Ghosh, Ridvan Yesiloglu, Yuncong Mao, Bailey Trang, Mohammad Asadi, Merryn Daniel, Gustavo Chau Loo Kung, Ken Chang, Pavan Pinkesh Shah, Adam Turnbull, Kyan Younes, Seena Dehkharghani, Ehsan Adeli

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出NeuroQA,一个大规模的3D脑部MRI视觉问答基准,包含来自12977名受试者的56953个问答对,涵盖5-104岁及五个临床领域,通过3D体积评估11种临床推理技能,并提供可复现的生成脚本和在线排行榜。

Comments 30 pages, dataset and benchmark release

详情
AI中文摘要

我们提出了NeuroQA,一个大规模的3D脑部磁共振成像(MRI)视觉问答基准,包含来自12977名受试者的56953个问答对,涵盖5-104岁及五个临床领域:阿尔茨海默病、帕金森病、肿瘤、白质疾病和神经发育。与以往基于2D切片或狭窄诊断标签的医学视觉问答(VQA)方法不同,NeuroQA将每个项目与完整的3D体积配对。它评估11种临床相关的推理技能,涵盖是/否、多项选择和开放式格式。在203个模板中,131个是图像 grounded(可从3平面查看器回答),72个是图像 informed(答案来自定量体积测量或临床仪器)。为消除纯文本捷径,我们应用了答案分布优化,将封闭式文本-only 准确率从>80%降至44.6%;图像必要性通过发布的图像 grounded 协议单独评估。一个38规则的确定性管道和两轮专家审查验证每个QA对与FreeSurfer测量、元数据或放射学报告字段的匹配,零个相同受试者矛盾。我们进行了临床评估,两名临床医生独立评估100个冻结测试项目,使用3平面查看器。在封闭式(是/否+多项选择)测试公开项目上,最好的零样本视觉语言模型和监督的3D CNN基线分别达到47.5%和43.7%的准确率,均低于49.4%的文本-only 多数模板基准。NeuroQA采用两级发布,公开QA对用于开放访问数据集和受数据使用协议(DUAs)限制的数据集的可复现生成脚本,加上受试者级划分、保留的私人测试集和在线排行榜。

英文摘要

We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from $>$80% to 44.6%; image necessity is assessed separately through an image-grounding protocol released with the benchmark. A 38-rule deterministic pipeline and two rounds of expert review verify every QA pair against FreeSurfer measurements, metadata, or radiology report fields, with zero same-subject contradictions across templates. We conduct a clinician evaluation in which two clinicians independently assess 100 frozen test items on a three-plane viewer. On closed-format (Yes/No + multiple-choice) test-public items, the best zero-shot vision-language model and a supervised 3D CNN baseline reach 47.5% and 43.7% accuracy respectively, both below the 49.4% text-only majority-template floor. NeuroQA adopts a two-tier release with public QA pairs for open-access datasets and reproducible generation scripts for datasets restricted by data use agreements (DUAs), plus subject-level splits, a held-out private test set, and an online leaderboard.

2605.20523 2026-05-21 cs.LG cs.AI q-bio.QM

Machine-Learning-Enhanced Non-Invasive Testing for MASLD Fibrosis: Shallow-Deep Neural Networks Versus FIB-4, Tabular Foundation Models, and Large Language Models

机器学习增强的非侵入性测试用于MASLD纤维化:浅层-深层神经网络与FIB-4、表格基础模型和大语言模型的比较

Athanasios Angelakis, Gabriele De Vito, Eleni-Myrto Trifylli, Filomena Ferrucci

发表机构 * BioML Lab, RI CODE, UniBw, Munich, Germany(BioML实验室,RI CODE,UniBw,慕尼黑,德国) Department of Epidemiology and Data Science, Amsterdam UMC, Amsterdam, Netherlands(流行病学与数据科学系,阿姆斯特丹大学医学中心,阿姆斯特丹,荷兰) Alpha Indicium, Rijswijk, Netherlands(Alpha Indicium,里杰斯霍伊斯,荷兰) Department of Computer Science, University of Salerno, Salerno, Italy(计算机科学系,萨勒诺大学,萨勒诺,意大利) GI-Liver Unit, 2nd Department of Internal Medicine, National and Kapodistrian University of Athens, General Hospital of Athens “Hippocratio”, Athens, Greece(肝病单位,第二内科部,雅典国家与卡波迪斯托里亚大学,雅典“希波克拉底”医院,希腊)

AI总结 本文研究了机器学习增强的非侵入性测试在MASLD纤维化检测中的应用,比较了浅层-深层神经网络、FIB-4、表格基础模型和大语言模型在不同队列中的性能,发现浅层-深层神经网络在保持FIB-4变量空间的同时提供了更平衡的外部操作性能。

Comments 26 pages, 4 figures, 3 tables. Preprint

详情
AI中文摘要

晚期纤维化是代谢功能障碍相关脂肪性肝病(MASLD)中肝相关发病率的主要决定因素。FIB-4被广泛用作一线非侵入性测试,但其固定公式可能低估了年龄、天冬氨酸转氨酶、丙氨酸转氨酶和血小板计数中包含的诊断信息。我们评估了机器学习增强的非侵入性测试(MLE-NIT)是否能够在保持FIB-4变量空间的同时提高晚期纤维化的检测能力。我们使用了来自中国、马来西亚和印度的三个经活检确认的MASLD队列(n=784)。中国队列被分为486名训练样本和54名内部验证/调整治疗样本;最终性能仅在马来西亚和印度的外部队列中报告。模型使用了五个变量:年龄、FIB-4、天冬氨酸转氨酶、血小板计数和丙氨酸转氨酶。我们比较了FIB-4与浅层-深层神经网络(s-DNN)、TabPFN和gpt-4o-2024-08-06。FIB-4在马来西亚和印度的外部ROC-AUC分别为0.75和0.60。TabPFN达到0.69和0.66,微调后的GPT-4o达到0.75和0.63,而s-DNN达到0.77和0.67。s-DNN仅包含354个可训练参数,相比TabPFN的7,244,554个参数,却提供了更平衡的外部操作性能。校准显示s-DNN的Brier分数为0.18和0.22,排列重要性识别出AST和FIB-4为主要变量。紧凑的非线性MLE-NIT可能在不增加临床数据需求的情况下增强基于FIB-4的纤维化评估。

英文摘要

Advanced fibrosis is a major determinant of liver-related morbidity in metabolic dysfunction-associated steatotic liver disease (MASLD). FIB-4 is widely used as a first-line non-invasive test, but its fixed formula may underuse diagnostic information contained in age, aspartate aminotransferase, alanine aminotransferase, and platelet count. We evaluated whether machine-learning-enhanced non-invasive testing (MLE-NIT) can improve advanced fibrosis detection while preserving this FIB-4 variable space. We used three biopsy-confirmed MASLD cohorts from China, Malaysia, and India (n=784). The Chinese cohort was split into 486 training and 54 internal validation/tuning patients; final performance was reported only on the Malaysian and Indian external cohorts. Models used five variables: age, FIB-4, aspartate aminotransferase, platelet count, and alanine aminotransferase. We compared FIB-4 with a shallow-deep neural network (s-DNN), TabPFN, and gpt-4o-2024-08-06. FIB-4 achieved external ROC-AUCs of 0.75 and 0.60 in Malaysia and India, respectively. TabPFN achieved 0.69 and 0.66, fine-tuned GPT-4o achieved 0.75 and 0.63, and the s-DNN achieved 0.77 and 0.67, respectively. The s-DNN contained only 354 trainable parameters, compared with 7,244,554 for TabPFN, yet provided a more balanced external operating profile. Calibration showed s-DNN Brier scores of 0.18 and 0.22, and permutation importance identified AST and FIB-4 as dominant variables. Compact non-linear MLE-NITs may enhance FIB-4-based fibrosis assessment without increasing clinical data requirements.

2605.20521 2026-05-21 cs.LG cs.CR

An exponential mechanism based on quadratic approximations for fine-tuning machine learning models with privacy guarantees

基于二次近似的指数机制用于具有隐私保障的机器学习模型微调

Hoang Tran, Jorge Ramirez, Jiayi Wang, Alberto Bocchinfuso, Christopher Stanley, M. Paul Laiu

发表机构 * Computer Science and Mathematics Division, Oak Ridge National Laboratory(橡树岭国家实验室计算机科学与数学 division) Computational Science and Engineering Division, Oak Ridge National Laboratory(橡树岭国家实验室计算科学与工程 division) HPC Department, Cineca(Cineca 高性能计算部)

AI总结 本文提出一种基于指数机制的随机算法,用于在保证差分隐私的前提下微调预训练模型,通过结合局部二次近似和新数据集信息构建效用函数,并引入随机投影策略提升高维模型的可扩展性。

详情
AI中文摘要

微调过程将预训练的机器学习模型适应到一个小而敏感的数据集,但此过程有风险记住个体新的数据点,使模型对试图提取敏感信息的对手而言变得脆弱。在本文中,我们开发了一种基于指数机制的随机算法,用于微调的同时确保差分隐私。我们的关键思想是构建一个简单的效用函数,该函数结合了预训练模型的局部二次近似和新数据集的信息。所得到的指数机制允许以闭式形式精确地从多元正态分布中进行抽样。我们建立了该方法的理论隐私保证、灵敏度界限和准确性估计。我们进一步引入了一种随机投影策略,使该方法能够扩展到高维模型。在MNIST基准和MIMIC临床数据集上的数值实验显示,该方法在现有差分隐私微调技术中表现具有竞争力。

英文摘要

Fine-tuning adapts a pretrained machine learning model to a small, sensitive dataset, but this process risks memorizing individual new data points, making the model vulnerable to adversaries who seek to extract sensitive information. In this work, we develop a randomized algorithm based on the exponential mechanism for fine-tuning while ensuring differential privacy. Our key idea is to construct a simple utility function that combines a local quadratic approximation of the pretrained model with information from the new dataset. The resulting exponential mechanism admits exact sampling from a multivariate normal distribution in closed form. We establish theoretical privacy guarantees, sensitivity bounds, and accuracy estimations for our method. We further introduce a random-projection strategy that makes the approach scalable to high-dimensional models. Numerical experiments on the MNIST benchmark and the MIMIC clinical dataset demonstrate competitive performance against existing differentially private fine-tuning techniques.

2605.20520 2026-05-21 cs.AI

Open-World Evaluations for Measuring Frontier AI Capabilities

面向前沿AI能力的开放世界评估

Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J. J. Allaire, Rishi Bommasani, Harry Coppock, Magda Dubois, Gillian K Hadfield, Andrew B. Hall, Sara Hooker, Seth Lazar, Steve Newman, Dimitris Papailiopoulos, Shoshannah Tekofsky, Helen Toner, Cozmin Ududec, Arvind Narayanan

发表机构 * Princeton University(普林斯顿大学) Cornflower Labs(Cornflower实验室) Meridian Labs(Meridian实验室) Stanford University(斯坦福大学) UK AI Security Institute(英国人工智能安全研究所) Johns Hopkins University(约翰霍普金斯大学) Adaption Labs(Adaption实验室) Australian National University(澳大利亚国立大学) Golden Gate Institute for AI(金门人工智能研究所) UW Madison(威斯康星大学麦迪逊分校) Microsoft Research(微软研究院) AI Digest(AI摘要) Georgetown University (CSET)(乔治城大学(CSET))

AI总结 本文提出开放世界评估作为一种补充方法,通过小样本定性分析来评估长期、复杂、现实世界任务,以更准确地衡量AI能力,并介绍了CRUX项目作为定期进行此类评估的尝试。

详情
AI中文摘要

基于基准的评估在跟踪前沿AI进展方面仍然很重要。但其可能同时高估和低估实际能力,因为它优先考虑可以精确指定、自动评分、容易优化且预算低、时间短的任务。我们倡导一种互补的评估类别,我们称之为开放世界评估:长期、复杂、现实世界任务通过小样本定性分析而非基准规模自动化来评估。在本文中,我们回顾了最近的开放世界评估,识别了其优势和局限性,并介绍了CRUX(Collaborative Research for Updating AI eXpectations),一个定期进行此类评估的项目。作为第一个实例,我们让一个AI代理开发并发布一个简单的iOS应用程序到Apple App Store。代理仅需一次可避免的手动干预就完成了任务,这表明开放世界评估可以提供关于可能很快普及的能力的早期预警。我们最后提出设计和报告开放世界评估的建议。

英文摘要

Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.

2605.20515 2026-05-21 cs.LG eess.SP

Online Conformal Prediction with Corrupted Feedback

在线腐蚀反馈下的符合预测

Bowen Wang, Matteo Zecchin, Osvaldo Simeone

发表机构 * Department of Engineering, King’s College London(伦敦国王学院工程系) Communication Systems Department, EURECOM(EURECOM通信系统部) Institute for Intelligent Networked Systems, Northeastern University London(伦敦东北大学智能网络系统研究所)

AI总结 本文研究了在存在腐蚀反馈的情况下在线符合预测的鲁棒性问题,提出两种鲁棒方案并通过实验验证了其在腐蚀反馈下的改进性能。

详情
AI中文摘要

现代人工智能系统需要校准的不确定性估计,这些估计在顺序和非平稳环境中仍需保持可靠。在线符合预测(OCP)通过适应性更新的预测集来解决这一挑战,这些预测集提供确定性的长期误覆盖保证。然而,这些保证依赖于对过去预测集覆盖情况的完美反馈假设。在实践中,观察到的误覆盖指示器可能受到噪声、通信故障或对抗性操纵的干扰,这会严重降低OCP的校准保证。本文研究了在腐蚀反馈下的OCP。我们首先将反馈腐蚀建模为任意的二进制翻转序列,并分析反馈腐蚀如何影响和降低标准OCP的误覆盖性能。然后我们提出两种鲁棒方案:通过过滤的鲁棒OCP,利用预测阈值的结构特性来过滤腐蚀反馈;以及通过主动补偿的鲁棒OCP,整合主动补偿机制以减轻腐蚀反馈的影响。对于这两种方法,我们建立了显式的误覆盖保证,并进一步专门针对独立随机翻转模型和具有记忆限制的任意误差模型。在真实世界数据集上的实验验证了所提出的方法,显示在腐蚀反馈下校准显著改进,预测集明显更小,相比基线OCP方法。

英文摘要

Modern artificial intelligence systems require calibrated uncertainty estimates that remain reliable in sequential and non-stationary environments. Online conformal prediction (OCP) addresses this challenge through adaptively updated prediction sets that provide deterministic long-run miscoverage guarantees. These guarantees, however, hinge on the assumption of perfect feedback about the coverage of past prediction sets. In practice, the observed miscoverage indicator may be corrupted by noise, communication failures, or adversarial manipulation, which can severely degrade OCP's calibration guarantees. In this paper, we study OCP under corrupted feedback. We first model feedback corruption as an arbitrary binary flip sequence, and analyze how feedback corruption affects and degrades the miscoverage performance of standard OCP. We then propose two robust schemes: robust OCP via filtering, which leverages the structural properties of the predicted threshold to filter corrupted feedback, and robust OCP via active compensation, which incorporates an active compensation mechanism to mitigate the effect of corrupted feedback. For both methods, we establish explicit miscoverage guarantees, which are further specialized for an independent stochastic flip model and for an arbitrary error model with memory bounds. Experiments on real-world datasets validate the proposed approach, showing markedly improved calibration and significantly smaller prediction sets compared with baseline OCP methods under corrupted feedback.

2605.20510 2026-05-21 cs.CV cs.AI cs.CY

ShadeBench: A Benchmark Dataset for Building Shade Simulation in Sustainable Society

ShadeBench: 一个用于可持续社会建筑阴影模拟的基准数据集

Longchao Da, Mithun Shivakoti, Xiangrui Liu, T Pranav Kutralingam, Yezhou Yang, Hua Wei

发表机构 * School of Computing and Augmented Intelligence, Arizona State University(计算与增强智能学院,亚利桑那州立大学) Global Futures Laboratory, Arizona State University(全球未来实验室,亚利桑那州立大学)

AI总结 本文提出ShadeBench,一个用于城市阴影理解的综合数据集和基准,通过多模态数据支持阴影生成、分割和3D建筑重建,并提供标准化评估协议和基线方法,为数据驱动的城市气候研究和热适应城市规划提供基础。

Comments 12 pages, 13 figures, 2 tables. Accepted by KDD 2026 AI for Sciences Track

详情
AI中文摘要

由于城市热岛效应的加剧,城市热暴露问题变得越来越严峻。细粒度的阴影模式,尤其是由建筑物引起的阴影,强烈影响行人热暴露和户外活动规划。然而,大规模准确建模和分析城市阴影仍然困难,因为缺乏大规模数据集和系统评估框架。为了解决这一挑战,我们提出了ShadeBench,一个全面的城市阴影理解数据集和基准。ShadeBench包含地理多样的城市场景,具有时间变化的模拟阴影地图和文本描述,以及对齐的卫星图像、建筑骨架表示和3D建筑网格。基于此多模态数据集,ShadeBench支持一系列下游任务,包括阴影生成、阴影分割和3D建筑重建。我们进一步建立了这些任务的标准评估协议和基线方法。通过使大规模和细粒度的阴影分析成为可能,ShadeBench为数据驱动的城市气候研究提供了基础,并支持未来在热适应城市规划和决策中的研究。代码和数据集可在https://darl-genai.github.io/shadebench/上公开获取。

英文摘要

Urban heat exposure is becoming an increasingly critical challenge due to the intensifying urban heat island effect. Fine-grained shade patterns, especially those induced by urban buildings, strongly influence pedestrians' thermal exposure and outdoor activity planning. However, accurately modeling and analyzing urban shade at scale remains difficult because of the lack of large-scale datasets and systematic evaluation frameworks. To address this challenge, we present ShadeBench, a comprehensive dataset and benchmark for urban shade understanding. ShadeBench contains geographically diverse urban scenes with temporally varying simulated shade maps and textual descriptions, together with aligned satellite imagery, building skeleton representations, and 3D building meshes. Built upon this multimodal dataset, ShadeBench supports a range of downstream tasks, including shade generation, shade segmentation, and 3D building reconstruction. We further establish standardized evaluation protocols and baseline methods for these tasks. By enabling scalable and fine-grained shade analysis, ShadeBench provides a foundation for data-driven urban climate research and supports future studies in heat-resilient urban planning and decision-making. The code and dataset are publicly available at https://darl-genai.github.io/shadebench/.

2605.20506 2026-05-21 cs.LG cs.CL

Reinforcing Human Behavior Simulation via Verbal Feedback

通过言语反馈强化人类行为模拟

Weiwei Sun, Xuhui Zhou, Jiarui Liu, Weihua Du, Haojia Sun, Yiqing Xie, Qianou Ma, Sihao Chen, Mengting Wan, Longqi Yang, Pei Zhou, Sherry Wu, Sean Welleck, Graham Neubig, Yiming Yang, Maarten Sap

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Microsoft(微软)

AI总结 本文提出DITTO模型,通过将言语反馈作为强化学习中的首要信号来提升LLM模拟人类行为的能力,并引入SOUL基准测试平台,展示了在多个任务中显著提升性能的成果。

详情
AI中文摘要

人类通过言语反馈(例如父母说“那很粗鲁”或朋友解释“这是为什么那会伤害你”)学习社会规范和行为。然而,对于LLM而言,学习反馈主要集中在代码和数学等领域,这些领域中的RL奖励可以直接验证并压缩为标量值。随着LLM越来越多地用于模拟人类行为,例如代表用户、患者、学生和其他角色,有必要使它们更加人性化,这需要接受一种根本不同的信号:主观的、多方面的言语反馈。我们提出了DITTO,一个通过将言语反馈作为强化学习中的首要信号进行训练的模型。每次回放后,DITTO会接收言语反馈并生成反馈条件的改进回放;两个输出通过GRPO联合优化,将言语指导蒸馏到基础策略中,而无需在测试时使用反馈。我们还引入了SOUL(Simulation gym Of hUman-Like behavior),一个涵盖10个任务、六个类别的统一基准和训练数据集:理论思维、角色扮演、社交技能、学习模拟、用户模拟和角色模拟。DITTO在基础模型上平均提升了36%,并在SOUL基准测试中的6个任务上超过了GPT-5.4,证明了通过言语反馈的强化学习是训练LLM模拟人类行为的有前途的方向。

英文摘要

Humans learn social norms and behaviors from verbal feedback (e.g., a parent saying "that was rude" or a friend explaining "here's why that hurt"). Yet, learning from feedback for LLMs has largely focused on domains like code and math, where RL rewards are directly verifiable and condensed into scalar values. As LLMs are increasingly used to simulate human behavior, e.g., standing in for users, patients, students, and other personas, there is a pressing need to make them more human-like, which requires embracing a fundamentally different kind of signal: feedback that is verbal, subjective, and multi-faceted. We present DITTO, a model trained by treating verbal feedback as a first-class signal in reinforcement learning. After each rollout, DITTO receives verbal feedback and generates a feedback-conditioned improved rollout; both outputs are jointly optimized with GRPO, distilling verbal guidance into the base policy without requiring feedback at test time. We also introduce SOUL (Simulation gym Of hUman-Like behavior), a unified benchmark and training data suite spanning 10 tasks across six categories: Theory of Mind, character role play, social skill, learner simulation, user simulation, and persona simulation. DITTO achieves an average 36% improvement over the base model and exceeds GPT-5.4 on 6 of 10 SOUL benchmarks, demonstrating that RL with verbal feedback is a promising direction for training LLMs to simulate human behavior.

2605.20502 2026-05-21 cs.LG cs.AI cs.CV stat.AP stat.ML

Tippett-minimum Fusion of Representation-space Diffusion Models for Multi-Encoder Out-of-Distribution Detection

基于表示空间扩散模型的Tippett最小融合多编码器异常检测

Neelkamal Bhuyan

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种多编码器融合的表示空间扩散模型,通过统计分析每个编码器对特定分布偏移类型的敏感性,引入EncMin2L门控机制,无需使用OOD标签即可在较低参数成本下提升异常检测性能,同时在四种分布偏移类型上均达到0.94以上的AUROC。

Comments 14 pages

详情
AI中文摘要

我们通过多编码器融合的每编码器表示空间扩散模型(RDMs)来解决跨完整分布偏移谱的异常检测问题,包括全局域变化、语义分歧、纹理差异和协变量腐蚀。我们从ID数据中统计地识别每个编码器对特定偏移类型的敏感性,并引入EncMin2L——一种编码器无关的两级min(⋅)门控,能够在不使用OOD标签的情况下结合和校准每编码器扩散基的似然检测器,参数成本比单编码器基线低2.3倍。两种ID数据诊断:η²(类条件F检验)和Δμ(在合成腐蚀下的对数似然偏移)量化编码器的专业化,而Tippett最小p值组合将每编码器得分聚合为一个校准稳定的OOD信号。EncMin2L在所有四种偏移类型上均达到≥0.94的AUROC,优于在重叠基准上的最佳表示空间扩散OOD检测器。

英文摘要

We address out-of-distribution (OOD) detection across the full spectrum of distribution shifts -- global domain changes, semantic divergence, texture differences, and covariate corruptions -- through a multi-encoder fusion of per-encoder representation-space diffusion models (RDMs). We statistically identify each encoder's sensitivity to specific shift types from ID data alone and introduce EncMin2L -- an encoder-agnostic two-level $\min(\cdot)$-gate that combines and calibrates per-encoder diffusion-based likelihood detectors without OOD labels, outperforming monolithic multi-encoder baselines at $2.3\times$ lower parameter cost. Two ID-data diagnostics: $η^2$ (class-conditional F-test) and $Δμ$ (log-likelihood shift under synthetic corruptions) -- quantify encoder specialization, while a Tippett minimum $p$-value combination aggregates per-encoder scores into a single, calibration-stable OOD signal. EncMin2L achieves $\geq 0.94$ AUROC across all four shift types simultaneously, outperforming the state-of-the-art representation-space diffusion OOD detectors across overlapping benchmarks.

2605.20495 2026-05-21 cs.CV

A Human-in-the-Loop Framework for Efficient Prompt Selection in Microscopy Vision-Language Models

一种用于显微镜视觉-语言模型中高效提示选择的人机协作框架

Abhiram Kandiyana, Ankur Mali, Lawrence O. Hall, Peter R. Mouton, Dmitry Goldgof

发表机构 * University of South Florida(佛罗里达州立大学) SRC Biosciences(SRC生物科学公司)

AI总结 本文提出了一种人机协作框架,通过目标驱动的主动学习方法解决显微镜视觉-语言模型中提示集构建的问题,减少专家验证图像的数量,提高分类性能。

Comments Accepted to CVPR workshops, 2026

详情
AI中文摘要

显微镜图像分类的深度学习流程通常需要昂贵、耗时的人工标注来生成高质量的训练地面真实数据。最近的研究表明,通过提示调整视觉-语言模型(VLMs)可以减少手动标注,通过构建一个小的专家验证图像-描述示例集,作为少样本上下文来对所有剩余图像进行分类。为了进一步减少工作量,VLM可以为候选示例生成描述,然后由专家验证并进行轻微编辑,而不是从头编写文本。然而,仍有两个实际问题未得到解决:(1)哪些未标注图像应优先进行验证?(2)需要多少验证示例才能达到性能目标?在本文中,我们通过将提示集构建公式化为目标驱动的主动学习问题来解决这些问题,优先标注哪些图像。我们在严格低资源约束下研究了三种互补的选取标准,并在小的未标注池中进行实验。实验表明,我们的方法在显著较少的专家验证图像下达到目标性能,平均只需20个标注图像即可达到100%的测试准确率。更广泛地说,我们的以人为本的框架展示了生成式AI在生物医学图像分析中的应用,其中专家在验证和改进模型输出方面仍保持积极的参与,同时显著降低了标注成本。代码和数据将向公众开放。

英文摘要

Deep-learning pipelines for microscopy image classification often require expensive, labor- and time-intensive expert annotation to produce high-quality ground truth for training. Recent work has shown that prompt tuning of vision-language models (VLMs) can reduce manual annotation by constructing a small prompt set of expert-verified image-caption exemplars that is reused as few-shot context to classify all remaining images at inference time. To further reduce effort, the VLM can draft captions for candidate exemplars, which experts then verify and lightly edit instead of writing text de novo. However, two practical questions remain unaddressed: (1) which unlabeled images should be prioritized for verification, and (2) how many verified exemplars are needed to reach a performance target. In this work, we address these questions by formulating prompt-set construction as a target-driven active learning problem that prioritizes which images to annotate. We study three complementary selection criteria under strict low-resource constraints with small unlabeled pools. Experiments show that our methods reach the target performance with substantially fewer expert-verified images than random selection, achieving 100% test accuracy with as few as 20 annotated images on average. More broadly, our human-in-the-loop framework demonstrates a human-centered use of generative AI in biomedical image analysis, where experts remain actively involved in verifying and refining model output while significantly reducing annotation cost. Code and data will be publicly available.

2605.20494 2026-05-21 cs.LG physics.ao-ph stat.AP

A 10,000-Year Global Stochastic Tropical Cyclone Catalog with Wind-Dependent Track Transitions (WHITS)

具有风依赖性路径转换的10,000年全球随机热带气旋目录(WHITS)

Jennifer Nakamura, Upmanu Lall

发表机构 * Lamont-Doherty Earth Observatory, Columbia University(哥伦比亚大学拉蒙特-多赫蒂地球观测站) School of Complex Adaptive Systems, Arizona State University(亚利桑那州立大学复杂适应系统学院) Earth and Environmental Engineering, Columbia University(哥伦比亚大学地球与环境工程系)

AI总结 本文提出WHITS方法,通过非参数半马尔可夫路径生成器生成全球10,000年合成气旋目录,以提高保险损失评估的可靠性。

详情
AI中文摘要

可靠的热带气旋(TC)风险评估受到历史记录的简短和空间稀疏性的限制,特别是对于罕见的高强度登陆事件,这些事件主导了保险损失。我们提出了WHITS(风聚焦飓风交互路径模拟器),这是一种非参数半马尔可夫路径生成器,扩展了Nakamura等人(2015)的HITS框架,有三种改进:在历史路径段之间转换时,除了位置、年龄和前进向量外,还根据局部风速进行条件;在比较向量项上选择核时,进行了细化以抑制动态不一致的跳跃;并在每个转换中应用了短平滑窗口,以消除下游风暴潮用户报告的位置和风速不连续性。WHITS被拟合到每个六个盆地的完整可用最佳轨迹记录中,北大西洋延伸至1851年,在其他盆地延伸至可靠最佳轨迹数据的最早年份。所得到的10,000年全球合成目录重现了所有盆地的观测路径密度和每年飓风/台风风力打击概率。该目录旨在用于灾难风险应用,其中大量、低偏倚的物理合理路径比小而统计上修正的样本更有用。

英文摘要

Reliable assessment of tropical cyclone (TC) risk is limited by the brevity and spatial sparsity of the historical record, particularly for the rare, high-intensity landfalls that dominate insured loss. We present WHITS (Wind-focused Hurricane Interactive Track Simulator), a non-parametric semi-Markov track generator that extends the HITS framework of Nakamura et al. (2015) in three ways: transitions between historical track segments are conditioned on local wind speed in addition to position, age, and forward vector; the kernel selection on the comparative-vector term is sharpened to suppress dynamically inconsistent jumps; and a short smoothing window is applied across each transition to remove the position and wind discontinuities reported by downstream surge users. WHITS is fit to the full available best-track record in each of six basins in IBTrACS, extending in the North Atlantic to 1851 and in other basins to the earliest year of reliable best-track data. The resulting 10,000-yr global synthetic catalog reproduces observed track density and the annual hurricane/typhoon-force wind-hit probability across all basins. The catalog is intended for catastrophe-risk applications where a large, low-bias sample of physically plausible tracks is more useful than a small, statistically corrected one.

2605.20485 2026-05-21 cs.LG

ZEBRA: Zero-shot Budgeted Resource Allocation for LLM Orchestration

ZEBRA: 零样本预算化资源分配用于LLM编排

May Hamri, Inbal Talgam-Cohen

发表机构 * Tel Aviv University(特拉维夫大学)

AI总结 该研究提出ZEBRA框架,通过将多阶段预算分配转化为连续非线性背包问题,有效解决多智能体流水线中预算分配问题,实验显示其在多个任务上均优于传统方法。

详情
AI中文摘要

随着自主代理在固定货币预算下执行端到端任务,关键问题从预算是否被尊重转变为如何有效使用预算。现有预算感知方法通常在单一代理内逐步控制推理过程,或通过强化学习学习资源分配策略。本文提出ZEBRA,一种零样本框架,将多阶段预算分配转化为连续非线性背包问题:一个LLM控制器估计各阶段的效用曲线,通过拉格朗日乘数的水填充搜索返回各阶段的分配。加法和乘法聚合统一在同一个求解器下。在150个任务APPS编码基准测试中,ZEBRA变体在所有聚合指标上均优于LLM直接分配方法。在预算为无约束支出的α=0.5时,ZEBRA恢复了94.4%的无约束质量,而LLM直接分配仅为88.1%。该优势具有统计显著性,并且在编码之外也具有转移性:在3阶段的HotpotQA流水线中,ZEBRA比LLM直接分配高出14.3个百分点,分配在经验上对曲线估计噪声具有鲁棒性。在HotpotQA中,ZEBRA达到的预算分配(近平衡)与APPS中的分配(偏向细化阶段)不同,显示出对流水线结构的适应性。更广泛地说,我们展示了在推理时间使用轻量级算法指导可以改善自主多智能体系统的经济行为。

英文摘要

As autonomous agents increasingly execute end-to-end tasks under fixed monetary budgets, the pressing open question shifts from whether the budget is respected, to how to spend it effectively. Existing budget-aware methods typically control reasoning step-by-step within a single agent, or learn resource allocation policies via RL. None address how to split a budget across the composing phases of a multi-agent pipeline at inference time. We propose ZEBRA, a zero-shot framework that reduces multi-phase budget allocation to a continuous nonlinear knapsack problem: an LLM controller estimates per-phase utility curves, and a water-filling search on the Lagrange multiplier returns the per-phase split. Additive and multiplicative aggregations are unified under the same solver. On a $150$-task APPS coding benchmark, both ZEBRA variants outperform LLM-direct (budget allocation directly by an LLM) on every aggregate metric. At a budget of $α= 0.5$ of the unconstrained spend, ZEBRA recovers $94.4\%$ of unconstrained quality, versus $88.1\%$ for LLM-direct. The advantage is statistically significant and transfers beyond coding: on a $3$-phase HotpotQA pipeline, ZEBRA beats LLM-direct by $14.3$pp, with allocations empirically robust to curve-estimation noise. On HotpotQA, ZEBRA arrives at a different budget split (near-balanced) compared to the APPS one (skewed towards a refinement phase), showing adaptation to the pipeline structure. More broadly, we show that lightweight algorithmic guidance at inference time can improve the economic behavior of autonomous multi-agent systems.

2605.20484 2026-05-21 cs.RO

Enhancing Graph-Based SLAM in GNSS-Denied environments by leveraging leg odometry

通过利用腿部里程计增强基于图的SLAM在GNSS受限环境中的性能

Léon Perruchot-Triboulet, Luc Jaulin, Kai Xiao

发表机构 * LinxAI Tech(LinxAI科技)

AI总结 本文提出了一种基于因子图的架构,通过结合本体感觉腿部里程计和激光雷达-惯性里程计,有效减少了GNSS受限环境中视觉漂移,提高了SLAM的鲁棒性。

Comments 4 pages, 3 figures, 2 tables, for ICRA workshop on Robot Meets GNSS and Ranging for Seamless Autonomy

详情
AI中文摘要

在GNSS受限环境中,自主导航仍然是四足机器人面临的核心挑战,其中如激光雷达等外周传感器在几何稀疏或重复场景中容易产生高度漂移。我们提出了一种因子图架构,该架构通过并行运动学车道驱动由本体感觉腿部里程计提供的数据,并通过身份相对姿态约束与主要激光雷达-惯性车道连接,该约束采用选择性噪声模型。在Linxai D50四足平台上,该方法在两个总计超过一公里的户外环路中应用,将高度漂移从超过30米减少到不足30厘米,并在基线流程完全失败的场景中实现了收敛。这些结果表明,已经在机载系统中计算的本体感觉数据构成了轻量且有效的垂直锚点,用于GNSS受限环境下的SLAM。

英文摘要

Autonomous navigation in GNSS-denied environments remains a core challenge for legged robots, where exteroceptive sensors such as LiDAR are prone to elevation drift in geometrically sparse or repetitive scenes. We present a factor graph architecture that augments the LIO-SAM framework with a parallel kinematic lane driven by proprioceptive leg odometry, coupled to the main LiDAR-inertial lane via an identity relative pose constraint with a selective noise model. Applied to a Linxai D50 quadruped platform across two outdoor loops totaling over one kilometer, our approach reduces elevation drift from over 30m to under 30cm and enables convergence in a scene where the baseline pipeline fails entirely. These results suggest that proprioceptive data, already computed onboard for gait control, constitutes a lightweight and effective vertical anchor for SLAM in GNSS-denied settings.

2605.20482 2026-05-21 cs.LG cs.SY eess.SY

Quadratic Characterizations for Reachability Analysis of Neural Networks

二次特性用于神经网络可达性分析

Elias Khalife, Mazen Farhood, Pierre-Loic Garoche

发表机构 * Kevin T. Crofton Department of Aerospace and Ocean Engineering, Virginia Tech(凯文·T·克罗夫顿航空航天与海洋工程系,弗吉尼亚理工学院) Federation ENAC ISAE-SUPAERO ONERA, Universite de Toulouse(ENAC ISAE-SUPAERO ONERA联盟,图卢兹大学)

AI总结 本文提出了一种构建二维实平面上标量关系的验证二次特性的框架,通过局部生成候选二次不等式并全局验证,以提高神经网络可达性分析的精度和效率。

详情
AI中文摘要

二次约束(QCs)广泛用于表征非线性和不确定性,但在有界域上通用分析特性可能较为保守。本文开发了一个框架,用于构建二维实平面上标量关系的验证二次特性。候选二次不等式通过使用关系和外部样本点求解凸二次规划局部生成。然后通过求和平方证书在精确半代数描述或非多项式关系的放松多项式描述上进行全局验证。所得到的验证约束定义了所考虑域上标量关系的可信上近似。这些约束与基于QCs和点wise积分二次约束(IQCs)的现有分析框架直接兼容,可用于静态非线性和不确定性的分析,并可嵌入基于QCs的半正定规划中,用于前馈神经网络的可达性和安全性分析。对于平滑激活函数如tanh,该方法产生域依赖的二次特性,作为通用扇区或斜率描述的替代方案。对于ReLU网络,我们给出了减少QC基于可达性分析保守性的方法,通过利用神经元间的依赖关系和更紧的局部界限。数值示例展示了对平滑激活函数的改进可达性结果,对ReLU网络的减少保守性,以及通过涉及饱和的示例展示了其在神经网络之外的应用。

英文摘要

Quadratic constraints (QCs) are widely used to characterize nonlinearities and uncertainties, but generic analytical characterizations can be conservative on bounded domains. This paper develops a framework for constructing verified quadratic characterizations of scalar relations in the two-dimensional real plane. Candidate quadratic inequalities are locally generated by solving convex quadratic programs using samples from the relation and exterior sample points. They are then verified globally using sum-of-squares certificates over an exact semialgebraic description or, in the case of nonpolynomial relations, over relaxed polynomial descriptions. The resulting verified constraints define a sound overapproximation of the scalar relations over the considered domains. These constraints are directly compatible with existing analysis frameworks based on QCs and pointwise integral quadratic constraints (IQCs) for static nonlinearities and uncertainties, and they can also be embedded in QC-based semidefinite programs for reachability and safety analysis of feedforward neural networks. For smooth activations such as $\tanh$, the method yields domain-dependent quadratic characterizations that constitute an alternative to generic sector- or slope-based descriptions. For ReLU networks, we give methods to reduce conservatism in QC-based reachability analysis of feedforward networks by exploiting dependencies between neurons and tighter local bounds. Numerical examples demonstrate improved reachability results for smooth activations, reduced conservatism for ReLU networks, and applicability beyond neural networks through an example involving saturation.

2605.20479 2026-05-21 cs.CV cs.LG

Oracle Supervision Transfers for Hyperparameter Prediction in Model-Based Image Denoising

用于基于模型的图像去噪中超参数预测的Oracle监督转移

Jianmin Liao, Lixin Shen, Yuesheng Xu

发表机构 * Department of Mathematics Syracuse University(数学系苏利文大学) Department of Mathematics & Statistics Old Dominion University(数学与统计学系老 Dominion 大学)

AI总结 该研究提出HyperDn,一种单配置条件预测器,通过聚合源配置的Oracle监督,预测新的去噪器-噪声配置的异质超参数,展示了在跨范式实验中,从相对便宜的TV/TGV变分源转移到更昂贵的扩散模型DiffPIR时,通过少量或无目标Oracle标签实现接近Oracle性能的成果。

详情
AI中文摘要

超参数预测是基于模型的图像去噪器中的关键实际瓶颈,从经典的TV/TGV变分求解器到现代的扩散基模型如DiffPIR。尽管现有的学习预测器可以实现接近Oracle的性能,但这种方法扩展性差:每个新的配置通常需要其自身的Oracle标记训练集,且每个标签都需要通过与干净地面真实值对比的分层网格搜索来评估。因此,我们询问是否可以从源配置收集的Oracle监督能够转移到目标配置,而使用很少或没有目标Oracle标签。我们提出了HyperDn,一种单配置条件预测器,通过聚合源配置的Oracle监督,预测新的去噪器-噪声配置的异质超参数。在跨范式实验中,HyperDn从相对便宜的TV/TGV变分源转移到更昂贵的扩散基DiffPIR。仅使用2个目标Oracle标签,它达到了30.23 dB,接近Oracle性能,且在使用1/32个目标标签的情况下优于训练自研的每配置64标签预测器。在没有目标Oracle标签的情况下,HyperDn在两个未见过的噪声类型混合和从相对便宜的96×96源图像转移到512×768目标时也达到了接近Oracle的PSNR。这些结果表明,超参数预测的昂贵Oracle监督可以从源转移到新的目标配置,从而减少为每个新的去噪配置重建Oracle标签的需求。

英文摘要

Hyperparameter prediction is a critical practical bottleneck for model-based image denoisers, ranging from classical TV/TGV variational solvers to modern diffusion-based models such as DiffPIR. While existing learned predictors can achieve near-oracle performance, this approach scales poorly: each new configuration conventionally requires its own oracle-labeled training set, and each label requires a hierarchical grid search evaluated against clean ground truth. We therefore ask whether oracle supervision collected on source configurations can transfer to target configurations with few or no target oracle labels. We propose HyperDn, a single configuration-conditioned predictor that pools oracle supervision across source configurations and predicts heterogeneous hyperparameters for new denoiser--noise configurations. In a cross-paradigm experiment, HyperDn transfers from relatively cheap TV/TGV variational sources to more expensive diffusion-based DiffPIR. With only $2$ target oracle labels, it reaches $30.23$\,dB, within $0.90$\,dB of the oracle, and outperforms the $64$-label per-configuration predictor trained from scratch, using $1/32$ as many target labels as that baseline point. Without any target oracle labels, HyperDn also reaches near-oracle PSNR on two unseen mixtures of seen noise types and on transfer from relatively cheap $96\times 96$ source images to $512\times 768$ targets. Together, these results show that expensive oracle supervision for hyperparameter prediction can be transferred from source to new target configurations, reducing the need to rebuild oracle labels for each new denoising configuration.

2605.20478 2026-05-21 cs.CL

Stage-Audit: Auditable Source-Frontier Discovery for Cross-Wiki Tables

Stage-Audit: 用于跨维基表的可审计源前沿发现

Chen Shen

发表机构 * Megagon Labs(梅加贡实验室)

AI总结 本文研究了LLM整理的表格可能存在的源不一致问题,提出Stage-Audit方法通过分离curator和auditor的写权限、行级源引用门禁以及12项审计分类学,提高了源前沿的精度和F1值,同时保持了每行的源可追溯性。

Comments 9 pages, 2 figures, 3 tables. Accepted at the ACM CAIS 2026 Workshop on AI Agents for Discovery in the Wild

详情
AI中文摘要

LLM整理的表格可能看似源相关,但实际上包含不支持的行:整理者可能从参数记忆中回忆条目并回溯性地附加页面级引用,这些引用并非实际来源。我们研究了这一风险在Seed2Frontier发现任务中的影响:该任务是从种子页面找到互补的维基百科页面以构建结构化表格。Stage-Audit通过分离curator和auditor的写权限、行级源引用门禁以及12项审计分类学(涵盖键、模式、源角色、基数和范围)来解决这一问题。在覆盖15个顶级域的51实例Seed2Frontier评估集上,Stage-Audit将源前沿的精度从0.356提升到0.505(+42%相对提升),F1值从0.334提升到0.451(+35%),同时保持了每行的源可追溯性。Vanilla-LLM与Stage-Audit的比较隔离了策略贡献,而非一般LLM发现过程的贡献。

英文摘要

LLM-curated tables can appear source-grounded while containing unsupported rows: the curator may recall entries from parametric memory and retroactively attach page-level citations that are not the actual source. We study this hazard in Seed2Frontier discovery: the task of finding complement Wikipedia pages from a seed page to assemble a structured table. Stage-Audit addresses it with disjoint curator-auditor write rights, a row-level source-citation gate, and a 12-check audit taxonomy over keys, schema, source roles, cardinality, and scope. On a curated 51-instance Seed2Frontier evaluation set spanning 15 top-level domains, Stage-Audit improves source-frontier precision over a vanilla LLM curator from 0.356 to 0.505 (+42% relative) and F1 from 0.334 to 0.451 (+35%), while maintaining explicit per-row source traceability. The vanilla-LLM-vs-Stage-Audit comparison isolates the policy contribution rather than LLM-based discovery in general.

2605.20477 2026-05-21 cs.LG cs.AI cs.CL

Training Language Agents to Learn from Experience

训练语言代理以从经验中学习

Yuval Shalev, Zifeng Ding, Mateja Jamnik

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出了一种名为In-context Training(ICT)的任务框架,用于评估语言代理在跨任务中的自我改进能力,并通过基于强化学习的训练管道直接从经验中学习反思,从而在多个基准任务中优于基线模型,展示了从经验中学习的能力本身可以被学习。

详情
AI中文摘要

语言代理可以在交互环境中通过经验进行适应,但当前基于反思的方法只能在单个任务实例内进行自我纠正。是否可以将这种经验提炼成可重用的教训,从而在未来的未见任务上提高性能仍不明确。我们通过引入In-context Training(ICT)任务来解决这个问题,这是一种用于评估语言代理跨任务自我改进能力的框架。在ICT中,一个反思模型观察由行为模型收集的轨迹,并生成旨在提高行为模型在未见任务上的性能的系统提示。然后,我们提出了一种基于强化学习的训练管道,用于直接从经验中学习此类反思,而无需人工提供的示例。在ALFWorld和MiniHack上,我们训练的反思器在大多数保留的任务家族上优于未训练的基线,表明从经验中学习的能力本身可以被学习。在某些情况下,我们观察到在训练反射器的基准之外的泛化能力,能够显著不同的环境。最后,我们介绍了MetaGym,一个通用的Python库,用于构建元环境,从而促进未来对自我改进语言代理的研究。

英文摘要

Language agents can adapt from experience in interactive environments, but current reflection-based methods can only self-correct within a single task instance. Whether such experience can be distilled into reusable lessons that improve performance on future unseen tasks remains unclear. We address this problem by introducing the In-context Training (ICT) task, a framework for evaluating cross-task self-improvement in language agents. In ICT, a reflector model observes trajectories collected by an actor model and generates system prompts intended to improve the actor's performance on future unseen tasks. We then propose an RL-based training pipeline for learning such reflections directly from experience, without human-provided examples. Across ALFWorld and MiniHack, our trained reflectors outperform an untrained baseline on most held-out task families, showing that the ability to learn from experience can itself be learned. In some cases, we observe generalisation beyond the benchmark on which the reflector was trained, to substantially different environments. Finally, we introduce MetaGym, a generic Python library for constructing meta-environments, enabling future research on self-improving language agents.

2605.20476 2026-05-21 cs.CV

Goodbye Drift: Anchored Tree Sampling for Long-Horizon Video-to-Video Generation

告别漂移:用于长时视频到视频生成的锚定树采样

Matthew Bendel, Stephen W. Bailey, Mithilesh Vaidya, Sumukh Badam, Xingzhe He

发表机构 * Descript, Inc.(Descript公司)

AI总结 本文提出了一种名为锚定树采样的方法,通过减少关键路径步骤来解决长时视频生成中的漂移问题,并在静态相机模式下实现了稳定且高质量的视频生成。

Comments 30 pages, 23 figures

详情
AI中文摘要

长时视频生成面临两个交织的问题。首先,漂移问题,即视频质量随时间下降。其次,连续性问题,表现为物体永久性问题或不当渲染瞬态内容(例如,出现在非连续帧中的物体颜色/风格变化)。最近的工作集中在自回归蒸馏技术上,旨在同时解决这两个问题。我们选择专注于漂移问题,并引入锚定树采样(ATS):一种无训练的推理时间调度器,用稀疏到密集、锚定范围内的填补方法替代从左到右的滚动。根调用在全时间范围内生成稀疏锚点,递归细化生成中间锚点,最终叶跨度在相邻锚点之间合成。这将关键路径从K个连续滚动步骤减少到L+1个树状步骤,并将时间累积漂移转换为锚定范围内的漂移。我们专注于静态相机模式下的V2V生成,其中稀疏锚点在时间范围内可由密集条件信号近似,且基础模型可在不重新训练的情况下生成它们。我们在Wan 2.1 + VACE上评估了ATS,针对五种条件模式(修复、扩展、边缘、姿态、深度)。我们证明ATS在整体质量和漂移防止方面均优于两个竞争对手。此外,我们还展示了在LTX-2.3上稳定生成至少40分钟的视频。最后,我们提出了一条路径,将ATS扩展到任意长的T2V生成,以及动态相机和多镜头模式。

英文摘要

Long-horizon video generation suffers from two intertwined issues. First, there is drift, where video quality degrades over time. Second, there are continuity issues which manifest as object permanence issues, or improperly rendering transient content (e.g., an object that appears in non-consecutive frames changing color/style). Recent work has focused on autoregressive distillation techniques that attack both problems simultaneously. We instead choose to focus on drift directly and introduce \textbf{Anchored Tree Sampling (ATS)}: a training-free inference-time scheduler that replaces left-to-right rollout with sparse-to-dense, anchor-bounded imputation organized as a tree. A root call produces sparse anchors over the full horizon, recursive refinement generates intermediate anchors, and final leaf spans are synthesized between neighboring anchors. This reduces the critical path from $K$ sequential rollout steps to $L+1$ tree-hierarchical steps and converts horizon-compounding drift into anchor-bounded drift. We focus on V2V generation in the \emph{static-camera} regime, where sparse anchors over the horizon are well approximated by the dense conditioning signal, and the base model can produce them without retraining. We evaluate ATS against two contemporary autoregressive baselines on Wan $2.1$ $+$ VACE, across five conditioning modalities (inpainting, outpainting, edge, pose, depth). We show that ATS outperforms both competitors in overall quality, as well as in drift prevention. We additionally demonstrate stable $\geq 40$-minute generation on LTX-$2.3$ across the same five modalities. We conclude by proposing a path forward to extend ATS to arbitrarily long T2V generation, as well as the dynamic-camera and multi-shot regimes.

2605.20470 2026-05-21 cs.CV cs.AI physics.med-ph

EPC-3D-Diff: Equivariant Physics Consistent Conditional 3D Latent Diffusion for CBCT to CT Synthesis

EPC-3D-Diff: 基于CBCT到CT合成的等价物理一致条件3D潜在扩散模型

Alzahra Altalib, Chunhui Li, Haytham Al Ewaidat, Khaled Alawneh, Ahmad Qendel, Alessandro Perelli

发表机构 * School of Science and Engineering, University of Dundee UK(邓迪大学科学与工程学院) Faculty of Applied Sciences, Jordan University of Science and Technology(约旦科学技术大学应用科学学院) Experia Healthcare, Jordan(约旦Experia医疗) School of Cardiovascular and Metabolic Health, University of Glasgow UK(格拉斯哥大学心血管与代谢健康学院)

AI总结 本文提出EPC-3D-Diff,一种新的条件3D潜在扩散框架,用于体积CBCT到CT合成,通过引入从成像物理导出的投影域等价损失,提高了物理一致性。该方法在训练过程中通过正向投影旋转合成的CT体积,并将其与相应角度偏移的投影进行匹配,从而在扩散目标中集成物理一致的等价约束。

Comments 10 pages, 4 figures

详情
AI中文摘要

锥束CT(CBCT)在放疗中常用于患者定位,但其定量可靠性受到散射、噪声和重建伪影的限制,限制了Hounsfield单位(HU)的准确性。我们提出了EPC-3D-Diff,一种新的条件3D潜在扩散框架,用于体积CBCT到CT合成,引入了从成像物理导出的投影域等价损失。与常见的图像域等价性不同,我们利用体积内旋转对应于其投影的角偏移的事实。在训练过程中,我们通过正向投影旋转合成的CT体积并将其与适当角度偏移的投影进行匹配,从而在扩散目标中集成物理一致的等价约束。为了高效捕捉完整的3D上下文,条件扩散在由轻量3D自动编码器学习的紧凑潜在空间中进行,保持轴向深度的同时在平面分辨率上进行下采样以实现稳定训练。我们验证了配对的头CBCT/CT假体数据集,包括重复扫描,并使用患者层面的分割进行配对临床数据验证,并进行了单域和混合域训练、消融实验和与扩散和CycleGAN的比较。EPC-3D-Diff具有良好的泛化能力,并在PSNR上相比最先进的方法取得了显著的改进,分别在假体和临床数据上提高了+7.4 dB和+1.8 dB,同时在SSIM和HU准确性方面也有所提升,在组织边界内。总体而言,EPC-3D-Diff提高了鲁棒性和物理一致性,支持HU意识的合成,以支持下游的放疗工作流程。

英文摘要

Cone-beam CT (CBCT) is routinely acquired during radiotherapy for patient setup, but its quantitative reliability is degraded by scatter, noise, and reconstruction artifacts, limiting Hounsfield Unit (HU) accuracy. We propose EPC-3D-Diff, a novel conditional 3D latent diffusion framework for volumetric CBCT to CT synthesis that introduces a projection domain equivariance loss derived from acquisition physics. Unlike common image domain equivariance, we exploit the fact that an in plane rotation of the volume corresponds to an angular shift in its projections. During training, we enforce this relationship by forward projecting rotated synthesized CT volumes and matching them to appropriately angle shifted projections of the paired target CT, yielding a physics consistent equivariance constraint integrated into the diffusion objective. To capture full 3D context efficiently, conditional diffusion is performed in a compact latent space learnt by a lightweight 3D autoencoder, preserving axial depth while downsampling in plane resolution for stable training. We validate on a paired head CBCT/CT phantom dataset, including repeat scans, and paired clinical data using patient wise splits, and perform single and mixed domain training, ablations, and comparisons with diffusion and CycleGAN. EPC-3D-Diff generalizes well and achieved substantial improvements, +7.4 dB (phantom) and +1.8 dB (clinical data) in PSNR compared to state of the art methods, alongside improved SSIM and HU accuracy, within tissue boundaries. Overall, EPC-3D-Diff improves robustness and physics consistency, supporting HU aware synthesis for downstream radiotherapy workflows.

2605.20469 2026-05-21 cs.CV

HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation

HalluCXR: 评估和缓解医疗视觉-语言模型在胸部X光解读中的幻觉

Haoyu Wang, Zitong Li

发表机构 * Department of Biostatistics & Health Informatics, Institute of Psychiatry, Psychology & Neuroscience, King’s College London(生物统计学与健康信息学系,精神病学、心理学与神经科学研究所,伦敦国王学院)

AI总结 本文提出HalluCXR基准,评估六种不同架构的视觉-语言模型在856例分层MIMIC-CXR胸部X光图像上的表现,发现61.9%-82.3%的输出存在幻觉,其中80.2%存在临床危险错误,通过引入幻觉分类学、检测管道和模型集成方法,提出了缓解幻觉的策略。

详情
AI中文摘要

视觉-语言模型(VLMs)在医学影像解读中日益被使用,但它们经常产生幻觉,即生成在临床上合理但事实错误的发现,这直接对患者安全构成风险。我们介绍了HalluCXR,一个基准,评估了六个架构各异的VLMs在856例分层MIMIC-CXR胸部X光图像和三种查询类型上的表现,产生15,408次模型评估。一个八类幻觉分类学,带有临床严重程度评分和一个双层检测管道,经过250个人类注释验证(自动检测F1=0.959;LLM判断F1=0.907)。我们发现61.9%-82.3%的输出包含幻觉,其中最多80.2%存在临床危险错误。三种关键模式显现:正常X光图像反而吸引最严重的幻觉,常见发现被系统性夸大,而罕见发现被低估,且响应长度本身预测幻觉风险(AUC最高达0.908)。一个六模型集成减少了伪造的84.8%,但增加了遗漏;一个三模型子集在成本减半的情况下保持了相当的性能。这些结果表明,幻觉审计、基于 verbosity 的风险监控和基于集成的安全层是临床部署的先决条件。

英文摘要

Vision-language models (VLMs) are increasingly used for medical image interpretation, yet they frequently hallucinate, generating clinically plausible but factually incorrect findings that pose direct patient safety risks. We introduce HalluCXR, a benchmark evaluating six architecturally diverse VLMs across 856 stratified MIMIC-CXR chest radiographs and three query types, yielding 15,408 model evaluations. An eight-category hallucination taxonomy with clinical severity ratings and a two-layer detection pipeline are validated against 250 human annotations (auto-detection F1=0.959; LLM judge F1=0.907). We find that 61.9--82.3% of outputs contain hallucinations, with clinically dangerous errors in up to 80.2%. Three key patterns emerge: normal radiographs paradoxically attract the most severe hallucinations, common findings are systematically over-fabricated while rare findings go under-detected, and response length alone predicts hallucination risk (AUC up to 0.908). A six-model ensemble reduces fabrication by up to 84.8% at the cost of increased omission; a three-model subset retains comparable performance at half the cost. These results establish that hallucination auditing, verbosity-based risk monitoring, and ensemble-based safety layers are prerequisites for clinical deployment.

2605.20467 2026-05-21 cs.AI

High Quality Embeddings for Horn Logic Reasoning

用于霍恩逻辑推理的高质量嵌入

Yifan Zhang, Yasir White, Dean Clark, Joseph Sanchez, Jevon Lipsey, Ashely Hirst, Jeff Heflin

发表机构 * Lehigh University Computer Science and Engineering(莱维大学计算机科学与工程系) Los Angeles Pierce College Computer Science(洛杉矶派克学院计算机科学) Colorado College Computer Science(科罗拉多学院计算机科学)

AI总结 本文提出了一种生成高质量逻辑语句嵌入的方法,通过三元组损失训练嵌入,并通过生成重复术语的锚点、平衡易难例以及强调最困难的例子来提高下游任务的表现。

Journal ref Proceedings of Machine Learning Research 284:1-14, 2025

详情
AI中文摘要

神经网络可以被训练以对逻辑推理者的选择进行排序,从而更高效地寻找答案。这一过程中的关键步骤是创建有用的嵌入,即逻辑语句的数值表示。本文介绍了并评估了几种生成嵌入的方法,以获得更好的下游结果。我们使用三元组损失训练嵌入,这需要由锚点、正例和负例组成的示例。我们引入了三个想法:生成更可能具有重复术语的锚点,以生成正例和负例的方式确保在简单、中等和困难示例之间有良好的平衡,并在训练过程中定期强调最困难的例子。我们进行了几项实验来评估这种方法,包括在不同知识库中比较不同嵌入的性能,以尝试确定哪些特征使嵌入适合特定的推理任务。

英文摘要

Neural networks can be trained to rank the choices made by logical reasoners, resulting in more efficient searches for answers. A key step in this process is creating useful embeddings, i.e., numeric representations of logical statements. This paper introduces and evaluates several approaches to creating embeddings that result in better downstream results. We train embeddings using triplet loss, which requires examples consisting of an anchor, a positive example, and a negative example. We introduce three ideas: generating anchors that are more likely to have repeated terms, generating positive and negative examples in a way that ensures a good balance between easy, medium, and hard examples, and periodically emphasizing the hardest examples during training. We conduct several experiments to evaluate this approach, including a comparison of different embeddings across different knowledge bases, in an attempt to identify what characteristics make an embedding well-suited to a particular reasoning task.

2605.20461 2026-05-21 cs.CV

Understanding Model Behavior in Monocular Polyp Sizing

理解单目肠镜下息肉大小的模型行为

Xinqi Xiong, Andrea Dunn Beltran, Junmyeong Choi, Sarah K. McGill, Marc Niethammer, Roni Sengupta

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 本文通过多中心数据集和多种模型对二元息肉大小分类(≤5 mm vs. >5 mm)进行诊断审核,发现模型性能在不同架构和输入模态下较为一致,表明其依赖于与检查行为相关的线索而非真实度量尺度,并展示了完美尺度信息的潜在改进以及当前深度估计和全局校准的有限增益。

详情
AI中文摘要

准确的息肉大小分层指导监视决策,通常大于5 mm的病变需要更密切的随访。然而,单目结肠镜缺乏可靠的参考度量标准。我们对多个公共多中心数据集、模型家族和患者分层交叉验证中的二元息肉大小分类(≤5 mm vs. >5 mm)进行了诊断审核。在不同架构和输入模态(包括RGB外观、相对深度和照度)下,模型性能相对一致,表明其依赖于与检查行为相关的线索而非真实度量尺度。通过提供不同粒度的地面真实尺度,我们量化了完美尺度信息的潜在改进,并显示当前深度估计和全局校准提供的增益有限。我们进一步证明,在分布偏移下分割错误消除了大部分潜在增益,具有预测掩码的oracle尺度仅恢复基线性能。这些结果突显了度量尺度和掩码鲁棒性作为两个独立的瓶颈,并提供了可重用的评估工具,如oracle尺度梯子、快捷分组和掩码替换,用于审核未来的息肉大小管道。我们的代码在https://github.com/anaxqx/polyp-sizing-audit上公开可用。

英文摘要

Accurate polyp size stratification guides surveillance decisions, with lesions larger than 5 mm typically requiring closer follow-up. However, monocular colonoscopy lacks a reliable metric reference. We present a diagnostic audit of binary polyp size classification (<=5 mm vs. >5 mm) across multiple public multi-center datasets, model families, and patient-stratified cross-validation. Across architectures and input modalities, including RGB appearance, relative depth, and photometry, model performance is moderately consistent, suggesting reliance on cues correlated with examination behavior rather than true metric scales. By providing ground-truth scale at varying granularities, we quantify the potential improvement from perfect scale information and show that current depth estimation and global calibration offer limited gains. We further demonstrate that segmentation errors under distribution shift eliminate most of this potential, with oracle scale under predicted masks recovering only baseline performance. These results highlight metric scale and mask robustness as two independent bottlenecks and provide reusable evaluation tools such as oracle scale ladders, shortcut partitions, and mask substitution for auditing future polyp sizing pipelines. Our code is publicly accessible at https://github.com/anaxqx/polyp-sizing-audit.

2605.20459 2026-05-21 cs.CV cs.AI

Pixel Wised Lesion Prediction on COVID-19 CT Imagery: A Comparative Analysis of Automated Image Segmentation Architectures

基于像素的新冠CT影像病变预测:自动图像分割架构的比较分析

Sarmad Khan, Arslan Shaukat, Umer Asgher, Basim Azam

发表机构 * Department of Computer \& Software Engineering National University of Sciences \& Technology Islamabad, Pakistan School of Computing \& Information Systems University of Melbourne Melbourne, Australia

AI总结 本文通过比较四种深度学习架构与六种预训练编码器,评估了在新冠CT影像中预测病变的性能,发现深度学习在分割任务中具有高精度和效率,其中二分类分割达到98%的F1分数,多分类分割在不同数据集上分别达到75%和77%的F1分数。

Comments 7 pages, 6 figures, 4 tables

详情
AI中文摘要

近年来,深度学习算法在医学图像分割领域受到了越来越多的关注。然而,由于缺乏标准化的性能分析方法和先前研究中使用不同数据集,该领域的可靠性受到阻碍。本研究的主要目的是全面评估当前的分割框架与最先进的预训练骨干网络,以准确预测CT影像中的新冠病变。此外,这种评估可以作为其他成像场景图像分割的参考点。为了实现这一目标,我们整合了四个不同的深度学习架构,即Unet、PSPNet、Linknet和FPN,以及六个预训练编码器,包括VGG 19、DenseNet 121、Inception ResNet V2、MobileNet V2、SeresNet 101和EfficientNet B0。这种方法使能够开发出多样化的测试架构。在图像分割的背景下,我们的研究涵盖了二分类和多分类实验。通过分析三个不同的新冠CT分割数据集,我们的分析结果表明深度学习架构能够产生精确且高效的分割结果。显著的是,二分类分割的最高F1分数达到98%,而多分类分割在两个不同的数据集上分别达到了75%和77%的F1分数。人工智能和深度学习的使用在多个维度上增强了对流行病疾病诊断过程的帮助。

英文摘要

In recent years, there has been a notable increase in the level of attention that is given to algorithms based on deep learning in the context of medical image segmentation. Nevertheless, the reliability of the field has been hindered due to the absence of a standardized methodology for performance analysis and the utilization of different datasets in previous research. The primary objective of the research is to comprehensively evaluate contemporary segmentation frameworks combined with state-of-the-art pre-trained backbones in order to accurately predict COVID-19 lesions in CT images. Moreover, this evaluation can serve as a point of reference for the segmentation of images in various other imaging scenarios. In order to accomplish this, we integrate four distinct deep learning architectures, namely Unet, PSPNet, Linknet, and FPN, with six pre-trained encoders, including VGG 19, DenseNet 121, Inception ResNet V2, MobileNet V2, SeresNet 101, and EfficientNet B0. This approach enables the development of diverse testing architectures. In the context of image segmentation, our research encompassed both binary and multi-class experimentation. The findings derived from our analysis of three distinct COVID-19 CT segmentation datasets indicate that deep learning architectures yield precise and efficient segmentation outcomes. Significantly, a maximum F1-Score of 98% was attained for binary class segmentation, while multi-class segmentation yielded F1-Scores of 75% and 77% across two separate datasets. The utilization of artificial intelligence and deep learning enhances the diagnostic process for pandemic diseases across multiple dimensions.

2605.20450 2026-05-21 cs.LG cs.CR

SMA-DP: Spectral Memory-Aware Differential Privacy for Deep Learning

SMA-DP:基于频谱记忆的差分隐私用于深度学习

Mohammad Partohaghighi, Roummel Marcia

发表机构 * Department of Electrical Engineering and Computer Science(电气工程与计算机科学系) University of California, Merced(加州大学默塞德分校) Department of Applied Mathematics(应用数学系)

AI总结 本文提出了一种名为SMA-DP-SGD的差分隐私随机梯度下降方法,通过引入频谱记忆分支来增强DP-SGD的隐私保护性能,从而在多个数据集上实现了更优的准确率和隐私保护。

详情
AI中文摘要

差分隐私随机梯度下降(DP-SGD)通过每个示例裁剪和校准的高斯噪声实现私人的深度学习,但其高方差更新会降低在具有挑战性的数据集上的效用。我们提出了SMA-DP-SGD,一种基于频谱记忆的差分隐私随机梯度下降方法,该方法通过在之前隐私化噪声发布中构建的分数记忆分支来增强DP-SGD。受WeightWatcher启发的幂律频谱指数提供了组级可靠性信号,在实验中以层级方式实现,以适应衰减和有效记忆深度。隐私历史对齐、范数匹配和激活预热稳定了记忆贡献。隐私保持透明:在给定隐私发布历史的条件下,记忆分支是固定的,而唯一新的数据依赖项是当前裁剪总和乘以固定系数β。因此,SMA-DP-SGD保持了干净的条件敏感度结构,并且当β=1时,精确恢复组级DP-SGD。在CIFAR-100、CIFAR-10和MNIST上的实验显示,SMA-DP-SGD在多个DP优化基线中表现竞争或更优,尤其在CIFAR-100和CIFAR-10上获得最大收益。CIFAR-10的消融实验显示,β控制隐私-效用轨迹,而频谱和记忆诊断确认了受控的短至中等有效记忆深度和小的记忆分支比。运行时分析显示,该机制带来了额外的开销,大约是DP-SGD的2.94倍,在我们的CIFAR-10实现中,揭示了适应性隐私记忆与计算成本之间的实际权衡。

英文摘要

Differentially private stochastic gradient descent (DP-SGD) enables private deep learning through per-example clipping and calibrated Gaussian noise, but its high-variance updates can reduce utility on challenging datasets. We propose \textbf{SMA-DP-SGD}, a \textbf{Spectral Memory-Aware Differentially Private Stochastic Gradient Descent} method that augments DP-SGD with a fractional memory branch built only from previously privatized noisy releases. WeightWatcher-inspired power-law spectral exponents provide group-wise reliability signals, instantiated layer-wise in our experiments, to adapt the decay and effective memory depth. Private-history alignment, norm matching, and warm-up activation stabilize the memory contribution. Privacy remains transparent: conditioned on the private release history, the memory branch is fixed, and the only newly data-dependent term is the current clipped sum scaled by a fixed coefficient \(β\). Hence, SMA-DP-SGD preserves a clean conditional sensitivity structure and exactly recovers group-wise DP-SGD when \(β=1\). Experiments on CIFAR-100, CIFAR-10, and MNIST show competitive or superior accuracy over several DP optimization baselines, with the largest gains on CIFAR-100 and CIFAR-10. CIFAR-10 ablations show that \(β\) controls the privacy--utility trajectory, while spectral and memory diagnostics confirm a controlled short-to-moderate effective memory depth and a small memory-branch ratio. Runtime analysis shows that the mechanism incurs additional overhead, about \(2.94\times\) DP-SGD in our CIFAR-10 implementation, revealing a practical trade-off between adaptive private memory and computational cost.

2605.20449 2026-05-21 cs.LG cs.AI

LLM Pretraining Shapes a Generalizable Manifold: Insights into Cross-Modal Transfer to Time Series

LLM预训练塑造了可泛化的流形:跨模态迁移至时间序列的洞察

Alexis Roger, Prateek Humane, Zhenghan Tai, Gwen Legate, Andrei Mircea, Vasilii Feofanov, Irina Rish

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) University of Toronto(多伦多大学) Concordia University(康科迪亚大学) com(42.com)

AI总结 研究探讨了语言预训练的Transformer能否成为有效的时序预测器,并揭示了跨模态迁移的机制,指出预训练构建了流形,微调则将数值动态投影到任务相关方向。

详情
AI中文摘要

语言预训练的Transformer能否成为有效的时序预测器,以及原因是什么?本文表明,跨模态迁移出现是因为语言预训练为时序训练预设了一个可重用的流形。在冻结的LLM状态上进行线性探测可以解码出真实的时序轨迹而无需配对监督,该投影空间中的检索能产生具有竞争力的预测,表明在微调之前就已经存在结构和动态。预训练初始化还提升了优化效果,产生连贯的梯度和高度各向异性的损失景观,不同于随机初始化。微调则起到低维对齐的作用,重用已有的方向而非从头学习时间原始特性,这通过低秩更新、子空间对齐和共享的周期性、趋势和重复特征得到证实。这些结果支持了LLM到时序迁移的几何解释:语言预训练构建了流形,微调将数值动态投影到任务相关方向上。

英文摘要

Can language-pretrained transformers become effective time-series forecasters, and why? In this paper, we show that cross-modal transfer arises because language pretraining preconditions time series training with a reusable manifold. A linear probe on frozen LLM states decodes realistic time-series trajectories without paired supervision, and retrieval in this projected space yields competitive forecasts, showing that structure and dynamics exist before finetuning. Pretrained initialization also improves optimization, producing coherent gradients and a highly anisotropic loss landscape unlike random initialization. Finetuning then acts as low-dimensional alignment, reusing existing directions rather than learning temporal primitives from scratch, as evidenced by low-rank updates, subspace alignment, and shared features for periodicity, trend, and repetition. Together, these results support a geometric account of LLM-to-time-series transfer: language pretraining builds the manifold, and finetuning projects numerical dynamics onto task-relevant directions.