arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2237
专题追踪
2606.17995 2026-06-17 stat.ML cs.CR cs.LG 新提交

Differential Privacy of Gaussian Process Posterior Sampling

高斯过程后验采样的差分隐私

Tomasz Maciazek

发表机构 * School of Mathematics, University of Bristol(布里斯托大学数学学院)

AI总结 研究高斯过程后验样本路径的隐私性,通过Rényi-DP界分离后验均值与协方差泄露,揭示有效岭正则化的关键作用,并验证成员推断攻击与正则化的依赖关系。

Comments 8 pages of main text + 25 pages appendix

详情
AI中文摘要

我们研究了当整个训练集(包括协变量和响应)是私有时,从高斯过程(GP)发布后验样本路径的隐私性。与添加外部噪声的标准差分隐私(DP)机制不同,后验采样在构造上是随机的。我们表明,这种内在随机性通过推导GP后验样本路径发布的显式Rényi-DP界来提供DP保证。这些界将后验均值泄露与数据相关的后验协方差泄露分开,表明有意义的隐私严重依赖于有效的岭正则化。我们应用成员推断攻击来表明经验泄露遵循对正则化、后验方差和发布的样本路径数量的预测依赖关系。在下游后验采样任务上的效用实验识别了噪声观测机制,其中隐私兼容的正则化以适度的效用损失保留了有用的决策。当需要更强的隐私时,可以通过添加校准的GP噪声来增强内在保证,提供显式的额外隐私调节旋钮。

英文摘要

We study the privacy of releasing posterior sample paths from a Gaussian process (GP) when the entire training set including covariates and responses is private. Unlike standard differential-privacy (DP) mechanisms that add external noise, posterior sampling is random by construction. We show that this intrinsic randomness yields DP guarantees by deriving explicit Rényi-DP bounds for GP posterior sample-path release. The bounds separate posterior-mean leakage from data-dependent posterior-covariance leakage showing that meaningful privacy depends sharply on effective ridge regularisation. We apply membership-inference attacks to show that empirical leakage follows the predicted dependence on regularisation, posterior variance and the number of released posterior sample-paths. Utility experiments on downstream posterior-sampling tasks identify noisy-observation regimes where privacy-compatible regularisation preserves useful decisions with modest utility loss. When stronger privacy is needed, the intrinsic guarantee can be sharpened by adding calibrated GP noise, providing an explicit additional privacy knob.

2606.17684 2026-06-17 stat.ML cs.CY cs.LG 新提交

Geometrical fairness in graph neural networks

图神经网络中的几何公平性

Arturo Pérez-Peralta, Sandra Benítez-Peña, Blas Kolic, Rosa E. Lillo

发表机构 * Department of Statistics, University Carlos III of Madrid, Spain(马德里卡斯蒂利亚-拉曼恰大学统计系) uc3m-Santander Big Data Institute(uc3m-桑坦德大数据研究所)

AI总结 针对图神经网络中公平性问题,通过修改拉普拉斯算子引入多种互补变换(子空间投影、频谱调整、频率滤波)来缓解偏差,理论分析并实验验证了公平性提升与竞争性能。

Comments 32 pages, 21 tables, 6 figures

详情
AI中文摘要

基于图的学习方法因其在多种应用中的强大性能而日益突出。其中,基于扩散过程的最新框架提供了一个统一的视角,扩展了传统的图神经网络公式,同时解决了标准消息传递机制的局限性。尽管取得了这些进展,但此类模型的公平性问题仍然令人担忧,因为它们可能传播或放大数据中存在的偏差。在这项工作中,我们通过修改底层拉普拉斯算子,引入了一种基于图扩散的公平性感知适应方法。我们的方法结合了多种互补变换,包括子空间投影、频谱调整和基于频率的滤波,以减轻与偏差相关的成分。利用图扩散的内在平滑特性,我们对由此产生的行为进行了原则性分析,并建立了公平性属性的理论见解。我们在合成数据集和真实数据集上评估了所提出的框架,结果表明,在有限的计算成本下,它实现了具有竞争力的性能,同时提高了公平性指标。

英文摘要

Graph-based learning methods have become increasingly prominent due to their strong performance across diverse applications. Among these, recent frameworks grounded in diffusion processes provide a unifying perspective that extends traditional graph neural network formulations while addressing limitations of standard message-passing mechanisms. Despite these advances, concerns remain regarding the fairness of such models, as they may propagate or amplify biases present in the data. In this work, we introduce a fairness-aware adaptation of graph-based diffusion by modifying the underlying Laplacian operator. Our approach incorporates multiple complementary transformations, including subspace projections, spectral adjustments, and frequency-based filtering, to mitigate bias-related components. Leveraging the intrinsic smoothing properties of graph diffusion, we provide a principled analysis of the resulting behavior and establish theoretical insights into fairness properties. We evaluate the proposed framework on both synthetic and real-world datasets, demonstrating that it achieves competitive performance while improving fairness metrics with limited additional computational cost.

2606.17537 2026-06-17 eess.AS cs.CL 新提交

Non-Autoregressive Minimum Bayes' Risk Decoding for Fast Speech Recognition

非自回归最小贝叶斯风险解码用于快速语音识别

Hiroyuki Deguchi, Takatomo Kano, Katsuki Chousa, Marc Delcroix

发表机构 * NTT, Inc.(日本NTT公司)

AI总结 提出基于最小贝叶斯风险解码的非自回归解码框架,通过单次前向计算高效采样多个候选,在保持速度优势的同时提升识别性能。

Comments Accepted at Interspeech2026

详情
AI中文摘要

非自回归(NAR)解码并行生成输出令牌,使语音识别比自回归解码(从左到右顺序生成)更快。然而,由于NAR解码无法通过依赖先前生成的令牌来解决不确定性,识别性能会下降。为了解决这个问题,我们提出了一种基于最小贝叶斯风险(MBR)解码的新型NAR解码框架,称为NAR-MBR解码,它最大化从NAR模型输出概率中抽取的样本计算的期望效用,而不是最大化输出概率。值得注意的是,通过利用NAR模型的特性,单次前向计算即可高效获得多个样本。我们在LibriSpeech、Switchboard、AMI和网络演示语料库上的实验表明,我们的NAR-MBR解码优于先前的NAR解码,并且运行速度快于AR解码。

英文摘要

Non-autoregressive (NAR) decoding generates output tokens in parallel, making speech recognition faster than autoregressive decoding, which generates them sequentially from left to right. However, the recognition performance is degraded because NAR decoding cannot resolve uncertainty by conditioning on previously generated tokens. To address this issue, we propose a novel NAR decoding framework based on minimum Bayes' risk (MBR) decoding, termed NAR-MBR decoding, that maximizes the expected utility calculated from samples drawn from the output probability of an NAR model rather than maximizing the output probability. Notably, by leveraging the nature of NAR models, multiple samples are obtained efficiently with a single forward computation. Our experiments across LibriSpeech, Switchboard, AMI, and web presentation corpus demonstrated that our NAR-MBR decoding outperformed previous NAR decoding and ran faster than AR decoding.

2606.17420 2026-06-17 eess.IV cs.AI q-bio.QM 新提交

Feynman Kac Reweighted Schrödinger Bridge Matching for Surface-Based Tau PET Harmonization

基于Feynman Kac重加权薛定谔桥匹配的皮层表面Tau PET标准化

Jianwei Zhang, Xinyu Nie, Jiaxin Yue, Yonggang Shi

发表机构 * Stevens Neuroimaging and Informatics Institute, University of Southern California(斯蒂文斯神经影像与信息学研究所,南加州大学) Ming Hsieh Department of Electrical and Computer Engineering of Viterbi School of Engineering, University of Southern California(明希德电气与计算机工程系,维特比工程学院,南加州大学) Alfred E. Mann Department of Biomedical Engineering of Viterbi School of Engineering, University of Southern California(阿尔弗雷德·E·曼生物医学工程系,维特比工程学院,南加州大学)

AI总结 提出Feynman Kac重加权薛定谔桥匹配(FKRSBM)模型,通过熵正则化最优传输实现源域与目标域间的随机传输,结合子群感知端点提议和球面卷积骨干网络,在Tau PET SUVR图上实现优于现有方法的分布对齐和下游疾病分类。

详情
AI中文摘要

Tau PET成像对于追踪阿尔茨海默病进展至关重要,但不同站点间的扫描仪、协议和放射性示踪剂的系统差异引入了非生物变异性,这会增加生物标志物方差、降低对疾病效应的敏感性,并可能偏倚下游临床评估。标准化方法旨在去除这些站点引起的偏移,同时保留有生物学意义的信号,然而现有方法在源队列和目标队列具有不同子群组成时难以应对,存在将站点效应与生物学变异(如tau阳性状态)混淆的风险。我们提出Feynman Kac重加权薛定谔桥匹配(FKRSBM)模型来解决这一问题。与基于扩散的方法通过高斯噪声先验路由数据不同,FKRSBM通过熵正则化最优运输学习源分布和目标分布之间的直接随机传输过程。为了实现生物学一致的传输,FKRSBM结合了由参考桥测度的Feynman Kac重加权导出的子群感知端点提议,完全通过数据层面的分层重要性抽样实现,无需对底层桥匹配求解器或网络架构进行任何更改。对于基于表面的神经影像,FKRSBM采用在皮层网格上运行的球面卷积骨干网络进行顶点级标准化。我们在tau PET SUVR图上评估该方法,将HABS-HD队列的PI-2620数据标准化到ADNI的AV-1451域。与ComBat、CycleGAN、基于扩散的方法(DF)和无正则化的扩散薛定谔桥匹配(DSBM)相比,FKRSBM实现了更优的分布对齐、更低的tau阳性符号不匹配、更强的APOE子群对齐以及改进的下游疾病分类性能。

英文摘要

Tau PET imaging is central to tracking Alzheimer's disease progression, but systematic differences between scanners, protocols, and radiotracers across sites introduce nonbiological variability that inflates biomarker variance, reduces sensitivity to disease effects, and can bias downstream clinical assessments. Harmonization methods aim to remove these site-induced shifts while preserving biologically meaningful signal, yet existing approaches struggle when source and target cohorts differ in subgroup composition, risking conflation of site effects with biological variation such as tau-positivity status. We propose the Feynman Kac Reweighted Schröodinger Bridge Matching (FKRSBM) model to address this problem. Rather than routing data through a Gaussian noise prior as in diffusion-based methods, FKRSBM learns a direct stochastic transport process between source and target distributions via entropy-regularized optimal transport. To enforce biologically consistent transport, FKRSBM incorporates a subgroup-aware endpoint proposal derived from a Feynman Kac reweighting of the reference bridge measure, implemented entirely through stratified importance sampling at the data level and requiring no changes to the underlying bridge-matching solver or network architecture. For surface-based neuroimaging, FKRSBM employs a spherical convolutional backbone operating on cortical meshes to perform vertex-level harmonization. We evaluate the method on tau PET SUVR maps, harmonizing PI-2620 data from the HABS-HD cohort into the AV-1451 domain of ADNI. Compared against ComBat, CycleGAN, a diffusion-based method (DF), and unregularized Diffusion Schröodinger Bridge Matching (DSBM), FKRSBM achieves superior distributional alignment, reduced tau-positivity sign mismatch, stronger APOE subgroup alignment, and improved downstream disease classification performance.

2606.17404 2026-06-17 eess.AS cs.SD 新提交

ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation

ELSA: 面向细粒度无参考文本到音频评估的声学事件级语义对齐

Shuntaro Suzuki, Kento Tokura, Daichi Yashima, Kanon Amemiya, Komei Sugiura, Shinnosuke Takamichi

发表机构 * Keio University(Keio大学)

AI总结 提出ELSA指标,通过将生成音频分解为文本查询中的声学事件并评估事件级对齐,实现细粒度无参考文本到音频评估,在四个基准上比现有指标更符合人类评分。

Comments Accepted for presentation at Interspeech2026

详情
AI中文摘要

文本到音频(TTA)生成,即从自然语言合成音频,因其能够捕捉精确的用户意图而被广泛研究。为了有效推进TTA模型,必须在不依赖昂贵的人类主观评分的情况下可靠地评估生成的音频,这促使开发与人类判断高度相关的自动评估指标。虽然最近的基于CLAP的指标提供了实用的无参考解决方案,但其粗粒度的文本-音频相似度匹配往往与人类评分的相关性较差。为了解决这个问题,我们提出了ELSA,一种用于细粒度文本-音频对齐的无参考评估指标。ELSA将生成的音频分解为由文本查询中的不同声学事件引导,并评估事件级对齐。在四个TTA基准上的实验表明,ELSA与人类主观评分的相关性高于先前的指标,突显了其在可靠TTA评估中的有效性。

英文摘要

Text-to-audio (TTA) generation, synthesizing audio from natural language, has been widely studied for its ability to capture precise user intent. To effectively advance TTA models, it is essential to reliably evaluate generated audio without relying on costly human subjective ratings, motivating the development of automatic evaluation metrics that correlate well with human judgments. While recent CLAP-based metrics provide practical reference-free solutions, their coarse-grained text-audio similarity matching often correlates poorly with human ratings. To address this, we propose ELSA, a reference-free evaluation metric for fine-grained text-audio alignment. ELSA decomposes generated audio guided by distinct acoustic events derived from the text query and assesses event-level alignment. Experiments across four TTA benchmarks show that ELSA reveals a higher correlation with human subjective ratings than prior metrics, highlighting its effectiveness for reliable TTA evaluation.

2606.17383 2026-06-17 q-fin.RM cs.AI cs.LG stat.ML 新提交

Model Validation of Agentic AI Systems: A POMDP-Based Framework for Belief-State, Forecast, and Policy Validation

智能体AI系统的模型验证:基于POMDP的信念状态、预测与策略验证框架

Matthew Francis Dixon

发表机构 * Quiota LLC(Quiota公司)

AI总结 提出基于部分可观测马尔可夫决策过程(POMDP)的智能体AI模型验证框架,将自主决策分解为信息、信念、预测、动作和效用组件独立验证,并通过投资组合管理案例展示其有效性。

Comments 28 pages, 3 figures, 6 tables. Source code available from https://github.com/mfrdixon/agentic-AI-as-POMDP

详情
AI中文摘要

智能体人工智能系统引入了一类新的模型风险。与传统预测模型不同,自主智能体持续获取信息,形成关于环境潜在状态的信念,生成预测,选择行动,并随时间调整其行为。现有的验证方法主要关注预测准确性,因此对底层决策过程的质量提供的洞察有限。本文提出了一种基于部分可观测马尔可夫决策过程(POMDP)的智能体AI模型验证框架。该框架将自主决策分解为信息、信念、预测、行动和效用,允许每个组件独立验证。大型语言模型(LLM)被形式化为近似贝叶斯滤波算子,并开发了一个模型风险分类体系,涵盖状态空间、滤波、预测、策略、效用规范和参数风险。通过一个投资组合管理案例研究展示了模型风险验证方法,其中智能体从市场和宏观经济信息中推断潜在市场制度,生成基于信念的预测,并使用Black-Litterman框架构建投资组合。实证验证结合了性能分析、信念校准诊断、覆盖测试、消融研究和参数敏感性分析。结果表明,潜在状态推断对决策质量有独立贡献,且主要结论在广泛的参数值范围内保持稳健。本文的主要贡献是提供了一个实用框架,将已建立的模型风险管理概念扩展到自主AI系统,并为其验证、治理和监控提供了严格的基础。

英文摘要

Agentic artificial intelligence systems introduce a new class of model risk. Unlike traditional predictive models, autonomous agents continuously acquire information, form beliefs regarding latent states of the environment, generate forecasts, select actions, and adapt their behavior over time. Existing validation methodologies focus primarily on predictive accuracy and therefore provide limited insight into the quality of the underlying decision process. This paper proposes a model validation framework for agentic AI based on Partially Observable Markov Decision Processes (POMDPs). The framework decomposes autonomous decision making into information, beliefs, forecasts, actions, and utility, allowing each component to be validated independently. Large language models (LLMs) are formalized as approximate Bayesian filtering operators, and a model-risk taxonomy is developed encompassing state-space, filtering, forecast, policy, utility-specification, and parameter risks. The model risk validation methodology is demonstrated through a portfolio-management case study in which an agent infers latent market regimes from market and macroeconomic information, generates belief-conditioned forecasts, and constructs portfolios using a Black--Litterman framework. Empirical validation combines performance analysis, belief calibration diagnostics, coverage tests, ablation studies, and parameter-sensitivity analysis. The results indicate that latent-state inference contributes independently to decision quality and that the principal conclusions remain robust across a broad range of parameter values. The principal contribution of the paper is a practical framework for extending established model risk management concepts to autonomous AI systems and providing a rigorous foundation for their validation, governance, and monitoring.

2606.17295 2026-06-17 eess.IV cs.CV 新提交

Phenotyping TPF via Self-Supervised Learning: A Label-Agnostic Framework with Expert Validation

通过自监督学习进行胫骨平台骨折表型分析:一种具有专家验证的标签无关框架

Miral Elnakib, Muhammad Saad, Ahmad Al-Kabbany

发表机构 * Faculty of Sciences(科学学院) Alexandria University(亚历山大大学) Multimedia Interaction and Communication Lab(多媒体交互与通信实验室) Wearables, Biosensing, and Biosignal Processing Research Lab(可穿戴设备、生物传感与生物信号处理研究实验室) Arab Academy for Science and Technology(阿拉伯科学与技术学院)

AI总结 提出一种标签无关的自监督学习框架,利用SimCLR和聚类从X光片中直接学习骨折表征,发现四种影像衍生表型,经盲法专家验证具有稳定性和临床可解释性,与常规分类正交。

详情
AI中文摘要

人工智能在胫骨平台骨折特征描述中的全部潜力尚未实现,受限于对标注数据集的根本依赖,而标注数据集的一致性无法保证:传统的分类方案如Schatzker和AO/OTA存在观察者间变异性,导致监督模型学习的是人类分歧而非稳定的骨折形态。我们设计、实现并验证了一个标签无关的框架,通过直接从影像数据中学习骨折表征来消除这一约束,无需观察者分配的标签。使用RadImageNet预训练的ResNet-50编码器,在154张清洁的膝关节X光片上通过SimCLR对比目标进行微调,之前进行数据清洗协议,之后进行UMAP降维和k-means聚类,以发现四种影像衍生表型。通过盲法专家审查协议评估表型有效性,由两名独立临床医生进行。四种表型表现出稳健的稳定性(bootstrap ARI = 0.319 +/- 0.041)、强内部凝聚力(轮廓系数 = 0.511),以及两名评审者在盲法条件下给出3-5/5的一致性评分;一种表型被一致认为表现出粉碎性——一种在没有监督信号的情况下分离出的高复杂性特征。与Schatzker标签的跨分区比较得出ARI = 0.013,证实了与传统分类边界的正交性。值得注意的是,锚定于既定分类词汇的专家评审者在Schatzker对齐度最低的地方认为影像衍生组是异质的,这表明Schatzker训练的感知和标签无关的嵌入几何测量的是正交维度。这些发现确立了标签无关的SSL表型分析作为传统分类的可重复且临床可解释的补充。

英文摘要

The full potential of artificial intelligence in tibial plateau fracture characterisation remains unrealised, constrained by a fundamental dependency on labelled datasets whose consistency cannot be guaranteed: conventional classification schemes such as Schatzker and AO/OTA suffer from inter-observer variability, causing supervised models to learn human disagreement rather than stable fracture morphology. We design, implement, and validate a label-agnostic framework that eliminates this constraint by learning fracture representations directly from imaging data without observer-assigned labels. A RadImageNet-pretrained ResNet-50 encoder is fine-tuned on 154 cleaned knee radiographs using the SimCLR contrastive objective, preceded by a data cleaning protocol and followed by UMAP dimensionality reduction and k-means clustering to discover four imaging-derived phenotypes. Phenotype validity is assessed through a blinded expert review protocol administered to two independent clinicians. The four phenotypes demonstrate robust stability (bootstrap ARI = 0.319 +/- 0.041), strong internal cohesion (silhouette = 0.511), and coherence ratings of 3-5/5 from both reviewers under blinded conditions; one phenotype was unanimously identified as exhibiting comminution -- a high-complexity feature isolated without any supervisory signal. Inter-partition comparison against Schatzker labels yields ARI = 0.013, confirming orthogonality to conventional classification boundaries. Notably, expert reviewers anchored to established classification vocabularies perceived imaging-derived groups as heterogeneous precisely where Schatzker alignment was lowest, suggesting that Schatzker-trained perception and label-agnostic embedding geometry measure orthogonal dimensions. These findings establish label-agnostic SSL phenotyping as a reproducible and clinically interpretable complement to conventional classification.

2606.17259 2026-06-17 eess.AS cs.SD 新提交

Intelligibility of Speech in Noise: Investigating Contribution of Magnitude and Phase Spectra

噪声中语音的可懂度:幅度谱和相位谱贡献的研究

Bhanu Teja Nellore, Sudarsana Reddy Kadiri, Rohit Kumar, Karan Nathwani, Suryakanth V Gangashetty

发表机构 * Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA(美国南加州大学信号分析与解释实验室) National Institute of Technology, Patna, India(印度帕坦国家理工学院) Indian Institute of Technology, Jammu, India(印度朱默尔理工学院) Koneru Lakshmaiah Education Foundation, Vaddeswaram, Guntur District, Andhra Pradesh, India(安得拉邦戈塔瓦德区瓦达萨瓦拉姆康纳鲁拉克希玛伊教育基金会)

AI总结 通过三个实验评估幅度谱和相位谱对噪声中辅音可懂度的贡献,发现幅度谱在干净条件下贡献更大,而相位谱在噪声条件下更鲁棒,且鼻音比擦音和近音更易受噪声影响。

详情
AI中文摘要

众所周知,语音的可懂度在环境噪声中会降低。然而,研究表明并非所有声音都受到均匀(或同等)影响,元音比辅音对噪声更鲁棒。本研究评估并分析了各种辅音在平稳白噪声和非平稳嘈杂噪声条件下的可懂度。具体而言,本研究探讨了给定语音信号的幅度谱和相位谱对噪声条件下辅音人类语音识别的各自贡献。为此,进行了三个实验。实验1中,评估了干净信号、仅用幅度谱信息重建的信号(仅幅度信号)和仅用相位谱信息重建的信号(仅相位信号)的可懂度。实验2中,将噪声添加到干净语音中。从带噪语音中重建仅相位信号和仅幅度信号,并对所有这三种信号进行可懂度测试。实验3中,将噪声直接添加到从干净语音重建的仅幅度和仅相位信号中,并评估其可懂度。这些实验结果表明,在干净条件下幅度谱对可懂度的贡献大于相位谱,而相位谱的信息在噪声条件下更鲁棒。还观察到,在辅音中,鼻音更容易受噪声影响,而擦音和近音相对更鲁棒。

英文摘要

It is well known that intelligibility of speech reduces in the presence of ambient noise. However, studies show that all sounds are not affected uniformly (or equally) and that vowels are more robust to noise than consonants. In this study, intelligibility of various consonants is assessed and analyzed in stationary white noise and non-stationary babble noise conditions. Specifically, this study investigates the individual contribution of magnitude and phase spectra of a given speech signal on human speech recognition of consonants in noisy conditions. In this regard, three experiments are carried out. In experiment 1, clean signal, signal reconstructed with only magnitude spectrum information (magnitude only signal) and signal reconstructed with only phase spectrum information (phase only signal) are assessed for intelligibility. In experiment 2, noise is added to clean speech. From noisy speech, phase only signal and magnitude only signal are reconstructed and intelligibility tests are performed for all these three signals. In experiment 3, noise is added directly to the magnitude only and phase only signals reconstructed from clean speech and their intelligibility is assessed. Results of these experiments show that magnitude spectrum contributes more to intelligibility in clean condition than phase spectrum, while information from phase spectrum is more robust in noisy conditions. It is also observed that, among consonants, nasals are more susceptible to noise whereas fricatives and approximants were observed to be comparatively more robust.

2606.17196 2026-06-17 stat.ML cs.LG stat.ME 新提交

Another Look at Log-PCA for Probability Measures: A Dynamical Formulation and Statistical Convergence

再探概率测度的Log-PCA:一种动力学公式与统计收敛性

Peng Xu, Changbo Zhu, Young-Heon Kim, Xiaohui Chen

发表机构 * Department of Statistics University of Illinois Urbana-Champaign(统计学系伊利诺伊大学厄巴纳-香槟分校) Department of ACMS University of Notre Dame(ACMS系诺丁汉大学) Department of Mathematics University of British Columbia(数学系不列颠哥伦比亚大学) Department of Mathematics Thomas Lord Department of Computer Science University of Southern California(数学系托马斯·劳德计算机科学系南加州大学)

AI总结 本文在Wasserstein几何下提出一种动力学公式解释log-PCA,称为Wasserstein切向PCA(WT-PCA),并推导了经验WT-PCA相对于总体测度的统计收敛速率。

详情
AI中文摘要

本文关注在Wasserstein几何下学习随机概率测度在$\mathbb{R}^m$上的主变差。我们引入一种新的动力学公式来解释log-PCA(一种线性化的主测地线分析)作为变分方法。我们的可微版本称为Wasserstein切向PCA(WT-PCA),通过其在重心处的协方差算子捕获Wasserstein空间上(加权)概率测度的局部主测地线变差模式。基于动力学视角并利用最优传输问题的平行传输结构,我们推导了从数据估计的经验WT-PCA相对于总体和经验重心参考测度之间的2-Wasserstein距离的通用统计收敛速率。

英文摘要

This paper is concerned with learning principal variations of random probability measures on $\mathbb{R}^m$ under the Wasserstein geometry. We introduce a new dynamical formulation to interpret the log-PCA, a linearized principal geodesic analysis, as a variational approach. Our differentiable version, termed as the Wasserstein Tangential PCA (WT-PCA), captures the local principal modes of geodesic variations of a (weighted) probability measure on the Wasserstein space via its covariance operator at barycenter. Based on the dynamical perspective and leveraging parallel transport structure of the optimal transport problems, we derive a general statistical convergence rate of the empirical WT-PCA when estimated from data in terms of the 2-Wasserstein distance between the population and empirical barycenter reference measures.

2606.17127 2026-06-17 q-bio.QM cs.AI cs.LG 新提交

Agentic Discovery of Non-Canonical Antimicrobial Peptides with AMPGAN v3

AMPGAN v3 的非经典抗菌肽智能发现

Jay Jung, Xiaohan Zhang, Shenghan Song, Mahmoud Sayedahmed, Chijian Xiang, Yunong Xu, Ahmed AbdelKhalek, Severin T. Schneebeli, Matthew J. Wargo, Jianing Li, Safwan Wshah

发表机构 * University of Vermont(弗吉尼亚大学) Larner College of Medicine, University of Vermont(弗吉尼亚大学医学学院) Purdue University(普渡大学) Department of Comparative Pathobiology(比较病理科部门) Department of Horticulture and Landscape Architecture(园艺与景观建筑部门) Department of Industrial and Molecular Pharmaceutics(工业与分子药学部门)

AI总结 提出 AMPGAN v3,一种多目标条件 GAN,扩展生成词汇至 D-氨基酸和末端修饰,通过双判别器提升稳定性,体外验证显示对革兰氏阳性菌有活性,并引入 PepCraft 多智能体框架用于端到端发现。

Comments Presented at the GenBio Workshop, ICML 2026

详情
AI中文摘要

抗菌药物耐药性每年导致超过一百万人死亡。抗菌肽(AMP)是一种有前景的解决方案,但生成式 AMP 模型尚未准备好设计含有非天然氨基酸和/或化学修饰的肽,而这些对于实际肽药物至关重要。我们提出了 AMPGAN v3,一种多目标条件 GAN,它将生成词汇扩展到 D-氨基酸和 N/C 末端修饰(如酰胺化)。通过将对抗性和活性感知监督分离到两个专门的判别器中,AMPGAN v3 显著提高了训练稳定性,并在外部分类器上优于先前的生成式 AMP 模型。我们在体外验证了跨越三个结构类别的五个候选物;其中两个对革兰氏阳性菌株表现出活性,最佳候选物对枯草芽孢杆菌的 MIC 达到 8 μg/mL。为了支持下游筛选,我们进一步提出了 PepCraft,一个用于端到端 AMP 发现的多智能体框架,其中规划智能体协调专门的执行器进行生成、过滤和验证。其优先级推荐与我们的体外结果一致。这些贡献使我们能够在小型但真实的规模上研究生成式和智能体 AI 如何在治疗性肽发现中协同作用。代码:this https URL

英文摘要

Antimicrobial resistance causes to over a million deaths annually. Antimicrobial peptides (AMPs) are a promising solution, but generative AMP models are not yet ready to design peptides with non-natural amino acids and/or chemical modifications, which are essential for real-world peptide drugs. We present AMPGAN v3, a multi-objective conditional GAN that expands the generative vocabulary to D-amino acids and N/C-terminus modifications such as amidation. By separating adversarial and activity-aware supervision across two specialized discriminators, AMPGAN v3 substantially improves training stability and outperforms prior generative AMP models on external classifiers. We validated five candidates spanning three structural classes in vitro; two showed activity against Gram-positive strains, with the best candidate reaching MIC 8 μg/mL against B. subtilis. To support downstream curation, we further present PepCraft, a multi-agent framework for end-to-end AMP discovery in which a Planning Agent orchestrates specialized executors for generation, filtering, and verification. Its prioritization recommendations align with our in vitro outcomes. Together, these contributions let us examine, on a small but real scale, how generative and agentic AI compose in therapeutic peptide discovery. Code: https://github.com/marszzibros/AMPGANv3

2606.17065 2026-06-17 q-fin.CP cs.AI cs.LG 新提交

PIVOT: Bridging Black-Scholes Implied-Volatility and Price Objectives via Differentiable Jäckel Operator

PIVOT: 通过可微分的Jäckel算子桥接Black-Scholes隐含波动率与价格目标

Raeid Saqur, Yannick Limmer, Anastasis Kratsios, Blanka Horvath, Hans Buehler

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) McMaster University(麦基尔大学) Vector Institute for AI(人工智能矢量研究所) DRW

AI总结 提出PIVOT层,通过隐式微分保留Jäckel求解器的前向精度,并利用门控机制处理低vega区域的奇异性,实现价格与隐含波动率空间的高效可微转换。

Comments 30 pages, 17 figures, 12 tables

详情
AI中文摘要

现代期权学习系统在两种坐标系下运行:价格空间(市场报价且无套利约束最自然执行)和隐含波动率(IV)空间(波动率曲面被平滑、正则化和评估)。瓶颈在于接口而非近似:Jäckel开创性的“Let's Be Rational”(LBR)求解器已经高效地将Black-Scholes价格反转到机器精度。所缺少的是一个可微分层,它在正向传播中保留LBR,并避免通过其分支逻辑进行反向传播。这样的层还必须面对低vega区域中逆映射不可避免的奇异性,其中灵敏度1/vega在vega→0时发散。我们通过PIVOT(价格-隐含波动率目标转换器)填补了这一空白。PIVOT保持LBR正向传播不变,并通过隐式微分通过平滑的Black-Scholes/Black-76价格映射提供反向传播,并带有显式门控合约:无效域返回NaN,良态行接收精确的1/vega梯度,低vega行被衰减而非静默正则化。在单个H100上,融合的Triton内核在机器精度下达到1.79e9 IV/s(与参考C求解器的最大相对误差为9.3e-14);端到端标签生成在合成链上维持48.9M/s,在SPX OptionMetrics上维持16.6M/s。在SPX上的HyperIV风格单日复现中,PIVOT增强目标帕累托主导基线,将保留价格MAE降低高达43.4%,最强的三种子门控目标联合改善价格MAE 38.8%和IV MAE 21.3%;在RUT、VIX和NDX上的跨资产结果显示方向性价格MAE增益分别为40.1%、24.2%和16.7%,而无门控的IV往返控制崩溃为退化的近零曲面,确认门控是正确性合约而非调节旋钮。

英文摘要

Modern option-learning systems operate in two coordinates: price space, where markets quote and no-arbitrage constraints are most naturally enforced, and implied volatility (IV) space, where volatility surfaces are smoothed, regularized, and evaluated. The bottleneck is interface, not approximation: Jäckel's seminal "Let's Be Rational" (LBR) solver already inverts the Black-Scholes price to machine precision efficiently. What is missing is a differentiable layer that preserves LBR in the forward pass and avoids backpropagating through its branch logic. Such a layer must also confront the unavoidable singularity of the inverse map in the low-vega regime, where the sensitivity 1/vega diverges as vega -> 0. We close this gap with PIVOT, the Price-Implied-Volatility Objective Translator. PIVOT keeps the LBR forward pass intact and supplies the backward pass by implicit differentiation through the smooth Black-Scholes/Black-76 price map, with an explicit gating contract: invalid domains return NaN, well-conditioned rows receive the exact 1/vega gradient, and low-vega rows are attenuated rather than silently regularized. On a single H100, a fused Triton kernel reaches 1.79e9 IV/s at machine precision (9.3e-14 max relative error vs. the reference C solver); end-to-end label generation sustains 48.9M/s on synthetic chains and 16.6M/s on SPX OptionMetrics. In a HyperIV-style one-day reproduction on SPX, PIVOT-augmented objectives Pareto-dominate the baselines, reducing held-out price MAE by up to 43.4% and the strongest three-seed gated objective improving price MAE by 38.8% and IV MAE by 21.3% jointly; cross-asset results on RUT, VIX, and NDX show directional price-MAE gains of 40.1%, 24.2%, and 16.7%, while an ungated IV-roundtrip control collapses to a degenerate near-zero surface, confirming the gate as a correctness contract rather than a tuning knob.

2606.17062 2026-06-17 q-bio.QM cs.LG 新提交

RadSEM: A Finding-by-Finding Metric for Clinical Consistency in Radiology Reports

RadSEM:放射学报告中临床一致性的逐发现指标

Zhenhong Yang, Zhuoyun Liu, Jintao Fei, Wen Tang, Shichao Quan, Jun Zhao, Jun Xu

发表机构 * JDH Algo, JD Health International Inc., China Department of Big Data in Health Science, The First Affiliated Hospital of Wenzhou Medical University, China Zhejiang Engineering Research Center for Hospital Emergency Department of Intensive Care Unit, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, Zhejiang, China

AI总结 提出RadSEM指标,通过约束LLM辅助将报告重写为原子发现句,进行矛盾感知的多对多匹配,并计算异常加权的F1分数,在SSREE测试中优于现有指标,实现高一致性评分。

详情
AI中文摘要

放射学报告评估必须区分临床兼容性与表面相似性,因为否定、侧别或正常-异常极性可能逆转发现。我们提出RadSEM(放射学句子级评估指标),一种受约束的LLM辅助指标,用于基于参考的放射学发现评估。RadSEM将参考报告和生成报告重写为有序的原子发现句,每个句子表达一个部位-发现命题。然后执行矛盾约束的多对多匹配:不兼容对(如“积液”和“无积液”)不得分,而兼容的粒度差异可获部分得分。确定性阶段根据部分-整体和异常-细节关系对配对加权,计数未匹配的发现,并生成异常加权的加权F1分数。因此,LLM支持结构化重写和局部对齐,而非充当不透明评判者。我们使用SSREE(一种受控单调性压力测试,基于2,448份去标识报告扩展为五个等级损坏水平)评估RadSEM。RadSEM的Kendall tau_b达到0.957,全对一致性97.8%,相邻一致性95.0%,81.9%的报告实现严格五级排序,优于放射学专用和通用文本指标,同时避免了极性反转报告重新获得词汇重叠的失败。在同一SSREE集上,RadSEM优于参考锚定的RadSEM-Alt策略,将相邻一致性从90.7%提升至95.0%,严格排序从67.2%提升至81.9%。在599个三元组同义词/反义词子集上,RadSEM在597个案例(99.67%)中偏好同义词。这些结果表明,显式发现单元、矛盾感知匹配和异常聚焦的确定性评分使报告评分更具可解释性,并对临床有意义的错误更敏感。代码见:此https URL。

英文摘要

Radiology report evaluation must distinguish clinical compatibility from surface similarity, because negation, laterality, or normal-abnormal polarity can reverse a finding. We propose RadSEM (Radiology Sentence-Level Evaluation Metric), a constrained LLM-assisted metric for reference-based evaluation of radiology Findings. RadSEM rewrites reference and generated reports into ordered atomic finding sentences, each expressing one site-finding proposition. It then performs contradiction-constrained many-to-many matching: incompatible pairs such as "effusion" and "no effusion" receive no credit, while compatible granularity differences can receive partial credit. A deterministic stage weights pairs by part-whole and abnormal-detail relationships, counts unmatched findings, and produces an abnormal-focused weighted F1 score. Thus, the LLM supports structured rewriting and local alignment rather than acting as an opaque judge. We evaluate RadSEM with SSREE, a controlled monotonicity stress test built from 2,448 de-identified reports expanded into five graded corruption levels. RadSEM achieves Kendall tau_b of 0.957, all-pairs concordance of 97.8%, adjacent concordance of 95.0%, and strict five-level ordering for 81.9% of reports, outperforming radiology-specific and general text metrics while avoiding the failure in which polarity-inverted reports regain lexical overlap. On the same SSREE set, RadSEM outperforms the Ref-anchored RadSEM-Alt policy, improving adjacent concordance from 90.7% to 95.0% and strict ordering from 67.2% to 81.9%. On a 599-triplet synonym/antonym subset, RadSEM prefers synonyms in 597 cases (99.67%). These results suggest that explicit finding units, contradiction-aware matching, and abnormal-focused deterministic scoring make report scoring more interpretable and sensitive to clinically meaningful errors. Code is available at https://github.com/jdh-algo/RadSEM.

2606.18223 2026-06-17 cs.CR cs.AI cs.LG cs.SY eess.SY 新提交

Learning Red Agent Policy from Observations for Neurosymbolic Autonomous Cyber Agents

从观测中学习红方代理策略用于神经符号自主网络代理

Ankita Samaddar, Sandeep Neema, Daniel Balasubramanian, Xenofon Koutsoukos

发表机构 * MIT(麻省理工学院)

AI总结 针对网络攻击中红方动作不可观测的问题,提出基于模仿学习的策略学习技术,从网络观测和防御动作预测红方行为,集成神经符号防御代理实现高精度预测。

详情
AI中文摘要

随着复杂网络攻击日益普遍,现代网络需要经由强化学习训练的智能自主网络防御代理。这些代理采用神经符号方法,如带有学习组件的行为树,来学习、推理、适应和实施安全规则,同时维持关键操作。然而,这些自主网络是部分可观测系统,即网络攻击者(红方代理)的动作不可观测,使得防御者难以预测红方动作、学习红方策略或评估攻击者的入侵程度。为解决此问题,我们提出一种策略学习技术,利用模仿学习来学习具有离散状态和离散动作的部分可观测RL代理的策略。我们在自主网络环境中应用该技术,从网络观测和防御动作预测红方代理的动作。与神经符号网络防御代理集成后,我们的方法有效处理不同红方策略,并在多种模拟场景中实现高预测精度。

英文摘要

With sophisticated cyber-attacks becoming increasingly prevalent, modern networks require intelligent autonomous cyber-defense agents trained via Reinforcement Learning (RL). These agents employ neurosymbolic approaches such as behavior trees with learning-enabled components (LECs) to learn, reason, adapt, and implement security rules while maintaining critical operations. However, these autonomous networks are partially observable systems, i.e., the cyber-attacker's (red agent's) actions are not observable, making it difficult for the defender to predict red actions, learn red policies, or assess the attacker's intrusion levels. To address this, we propose a Policy Learning Technique using imitation learning to learn policies for partially observable RL agents with discrete states and discrete actions. We apply this technique in an autonomous cyber environment to predict red agent's actions from network observations and defender actions. Integrated with a neurosymbolic cyber-defense agent, our method effectively handles different red policies and achieves high prediction accuracy across diverse simulated scenarios.

2606.18198 2026-06-17 cs.CR cs.CV 新提交

Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners

看见不等于筛查:针对智能体技能扫描仪的多模态隐藏指令攻击

Xiaojun Jia, Jie Liao, Simeng Qin, Ke Ma, Wenbo Guo, Yebo Feng, Aishan Liu, Yang Liu

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Chongqing University, China(重庆大学) Northeastern University, China(东北大学) University of Chinese Academy of Sciences, China(中国科学院大学) Beihang University, China(北航大学)

AI总结 针对现有技能扫描仪忽视图像中恶意指令的盲点,提出SkillCamo多模态攻击将指令隐藏于图像,并设计ExecScan执行级扫描模块,通过意图提取、行为重建等检测隐藏指令,实验证明其有效性。

详情
AI中文摘要

智能体技能正在成为基于LLM的系统中的重要攻击面。通过对现有技能扫描仪的实证研究,我们发现当前的防御主要依赖文本描述、清单和源代码作为安全分析的主要信号,这可能导致视觉传达的恶意意图未得到充分检查。这造成了一个实际盲点:隐藏在图像中的有害操作指令可能绕过扫描,但在部署期间仍可被多模态智能体恢复。为系统研究这一威胁,我们提出SkillCamo,一种文档介导的多模态指令攻击,将恶意指令隐藏在技能附带的图像中,同时重写周围文档以自然地将这些图像引用为正常工作流程的一部分。因此,攻击不依赖于图像本身,而是依赖于执行时对文本指导和视觉负载的联合解释。为防御此类攻击,我们进一步提出ExecScan,一种基于执行的多模态扫描模块,对技能工件进行意图提取、行为重建、滥用评估和审慎执行模拟。ExecScan联合分析文档、代码、引用的资源和视觉内容,以恢复隐藏指令,重建可执行行为链,并识别下游风险,如泄露、破坏、持久化、欺骗和权限提升。大量实验表明,图像隐藏的恶意指令挑战了现有技能扫描仪,而ExecScan可以提高技能扫描性能。

英文摘要

Agent skills are emerging as an important attack surface in LLM-based systems. Through an empirical study of existing skill scanners, we find that current defenses primarily rely on textual descriptions, manifests, and source code as the main signals for security analysis, which can leave visually conveyed malicious intent insufficiently examined. This creates a practical blind spot: harmful operational instructions hidden in images may bypass scanning while still being recoverable by multimodal agents during deployment. To systematically investigate this threat, we propose SkillCamo, a document-mediated multimodal instruction attack that conceals malicious instructions within images bundled with a skill while rewriting the surrounding documentation to naturally reference those images as part of the normal workflow. Thus, the attack does not rely on the image alone, but on the joint interpretation of textual guidance and visual payload at execution time. To defend against such attacks, we further propose ExecScan, an execution-grounded multimodal scanning module that performs intent extraction, behavior reconstruction, abuse assessment, and deliberative execution simulation over skill artifacts. ExecScan jointly analyzes documentation, code, referenced resources, and visual content to recover hidden instructions, reconstruct executable behavior chains, and identify downstream risks such as exfiltration, destruction, persistence, deception, and privilege escalation. Extensive experiments show that image-hidden malicious instructions challenge existing skill scanners, while ExecScan can improve the skill scanning performance.

2606.18193 2026-06-17 cs.CR cs.AI cs.CL 新提交

A Red-Team Study of Anthropic Fable 5 & Opus 4.8 Models

Anthropic Fable 5 与 Opus 4.8 模型的红队研究

Nicola Franco

发表机构 * AI4I

AI总结 通过 HackAgent 框架对两个前沿大语言模型进行自动化越狱攻击,发现尽管模型抵抗大部分攻击,但自适应迭代攻击仍能成功,且残差表面比总体框架更大。

Comments White paper

详情
AI中文摘要

我们评估了 Anthropic 开发的两个前沿大语言模型(LLM)Fable 5 和 Opus 4.8 的对抗鲁棒性,针对涵盖十个危害类别的 7 826 个有害意图,使用了四类自动化越狱攻击。利用 HackAgent 红队框架,生成了数十万次对抗尝试,每个明显的成功案例均由三个评判模型组成的委员会(多数投票)独立重新裁定。两个模型抵抗了大部分攻击,但残差表面比总体框架所暗示的更大:它主要由自适应迭代攻击主导,而静态混淆几乎完全被中和。最强的自适应搜索(攻击树)在 11.5% 的意图上攻破了 Opus 4.8,而 Fable 5 保持在个位数(最坏情况 6.1%)。因此,总体成功率不应被视为令人放心。即使在这些加固配置下,两个模型仍产生了 1 620 个(Opus 4.8)和 702 个(Fable 5)经委员会确认的有害完成,涵盖每个危害类别,这些完成是由攻击模型在没有人类专家参与的情况下,自动、廉价地在前一两个细化步骤中发现的。合理的结论是,即使是最好的、经过最严格测试的前沿模型,在持续的自动化压力下仍然可以被可靠地攻破。

英文摘要

We evaluate the adversarial robustness of two frontier large language models (LLMs) developed by Anthropic, Fable 5 and Opus 4.8, against four families of automated jailbreak attack across 7 826 harmful intents spanning a ten-category harm taxonomy. Using the HackAgent red-teaming framework, hundreds of thousands of adversarial attempts were generated and every apparent success was independently re-adjudicated by a panel of three judge models (majority vote). Both models resist the majority of attacks, but the residual surface is larger than aggregate framing suggests: it is dominated by adaptive iterative attacks, while static obfuscation is near-fully neutralised. The strongest adaptive search (tree-of-attacks) breaks Opus 4.8 on 11.5% of intents overall, whereas Fable 5 stays in the single digits (6.1% worst-case). Aggregate rates therefore should not be read as reassurance. Even in these hardened configurations, the two models produced 1 620 (Opus 4.8) and 702 (Fable 5) panel-confirmed harmful completions spanning every harm category, located automatically, cheaply, and within the first one or two refinement steps by an attacker model with no human expert in the loop. The reasonable conclusion is that even the best, most-tested frontier models remain reliably breakable under sustained automated pressure.

2606.18190 2026-06-17 cs.CR cs.LG 新提交

Multi-Source Cybersecurity Logs: An ATT&CK-Labeled Dataset and SLM Evaluation

多源网络安全日志:一个ATT&CK标记数据集及小语言模型评估

Abir Ashab Niloy, Ahmed Ryan, Imamul Hossain Rafi, Md Erfan, Md Rayhanur Rahman

发表机构 * Windows endpoints(Windows终端)

AI总结 为解决多阶段网络攻击检测中缺乏带ATT&CK技术标签的多源日志数据集问题,构建了包含870个会话(70个攻击、800个良性)和约230万事件的多源日志数据集,并基于该数据集微调三个小语言模型,在分块分类任务上准确率从约8%提升至90%-97%。

详情
AI中文摘要

多阶段网络攻击跨越系统、网络和浏览器日志。检测它们需要关联所有三个来源的事件。机器学习方法可以学习这些跨源模式,但需要带标签的多源数据。现有的公共数据集存在不足。仅网络数据集如CICIDS和UNSW-NB15缺少主机和浏览器活动。以主机为中心的数据集如LMDG和CICAPT-IIoT缺乏浏览器遥测。ATLAS包含所有三个来源,但仅将事件标记为恶意或良性,没有MITRE对抗战术、技术和通用知识(ATT&CK)技术的粒度。没有公共数据集将三个来源与每条记录的ATT&CK技术标签结合起来。我们通过构建一个包含870个会话(70个攻击,800个良性)和约230万事件的多源日志数据集来弥补这一差距。我们在Windows端点上同时捕获了系统、网络和浏览器活动。我们用ATT&CK技术ID标记了恶意事件,涵盖了12种战术和53种技术。我们使用真实工具生成了所有攻击数据,包括远程访问木马(RAT)、命令与控制(C2)隧道和云外泄。为了展示可学习性,我们使用低秩适配(LoRA)微调了三个小语言模型(SLM)(Qwen2.5-1.5B、Llama-3.2-3B、Phi-4-Mini)。我们在两个任务(分块分类和ATT&CK技术识别)上,将每个模型与其基础变体在十个指标上进行了比较。微调在每个指标上改进了每个模型。分块分类准确率从基础变体的大约8%提高到微调后的90%到97%。技术识别仍然具有挑战性,最佳精确匹配准确率为42%,尽管高部分匹配分数表明模型捕捉到了大部分底层推理。

英文摘要

Multi-stage cyberattacks span system, network, and browser logs. Detecting them requires correlating events across all three sources. Machine learning methods can learn these cross-source patterns, but they need labeled multi-source data. Existing public datasets fall short. Network-only datasets such as CICIDS and UNSW-NB15 miss host and browser activity. Host-focused datasets such as LMDG and CICAPT-IIoT lack browser telemetry. ATLAS includes all three sources but labels events only as malicious or benign, without MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) technique granularity. No public dataset combines all three sources with per-entry ATT&CK technique labels. We close the gap by building a multi-source log dataset of 870 sessions (70 attack, 800 benign) and approximately 2.3 million events. We captured system, network, and browser activity simultaneously on Windows endpoints. We labeled malicious events with ATT&CK technique IDs, covering 12 tactics and 53 techniques. We generated all attack data using real tools, including Remote Access Trojan (RAT), Command and Control (C2) tunnels, and cloud exfiltration. To demonstrate learnability, we fine-tuned three Small Language Models (SLMs) (Qwen2.5-1.5B, Llama-3.2-3B, Phi-4-Mini) using Low-Rank Adaptation (LoRA). We compared each against its base variant across ten metrics on two tasks: chunk classification and ATT&CK technique identification. Fine-tuning improved every model on every metric. Chunk classification accuracy rose from approximately 8% in the base variants to between 90% and 97% after fine-tuning. Technique identification remained challenging, with the best exact-match accuracy at 42%, although high partial-match scores show the models captured most of the underlying reasoning.

2606.18181 2026-06-17 cs.IR cs.AI cs.CY 新提交

IUU+DB: Tracking Illegal, Unreported, and Unregulated Fishing, Seafood Fraud, and Labor Abuse through LLM-driven Information Extraction

IUU+DB:通过LLM驱动的信息提取追踪非法、不报告和不管制捕捞、海鲜欺诈和劳工虐待

Henry Bodwell, Hong Yang, John C. Simeone, Kelvin Gorospe, Bella Sullivan, Lana Huang, Jessica Gephart, Sandy Aylesworth, Molly Masterton, Naren Ramakrishnan

发表机构 * University Of Washington(华盛顿大学)

AI总结 提出IUU+概念扩展非法捕捞定义,并构建基于大语言模型的IUU+DB系统,从异构文档中自动提取事件关键信息,支持去重和趋势分析,为渔业监管和研究提供数据支持。

详情
AI中文摘要

非法、不报告和不管制捕捞(IUU)传统上指违反适用法律或在缺乏适用法律的区域进行的捕捞活动。我们提出术语IUU+以涵盖更广泛的渔业部门环境及相关供应链贸易犯罪和行为。尽管IUU+活动被广泛认为是对海洋生态系统、市场和生计的严重威胁,但对其事件频率、地理分布、物种、行为者及非法活动类型模式的定量理解仍然难以获得。我们提出IUU+DB,一个由大语言模型驱动的系统,用于构建全球IUU+活动事件数据库。该系统接收异构文档,分类是否描述相关事件,提取关键数据元素如行为者、地点、物种、船只、违规行为及执法结果,并支持去重和趋势分析。案例研究和验证结果表明,IUU+DB有助于组织零散证据,揭示地理和行为热点,支持学术界和非政府组织的渔业领域特定研究,协助行业进行来源和物种风险评估,并为政府机构的政策实施和针对性执法提供支持。

英文摘要

Illegal, unreported, and unregulated fishing (IUU) traditionally refers to fishing activities that violate applicable laws or occur in areas that lack applicable laws. We propose the term IUU+ to capture a broader suite of fisheries sector environmental and associated supply chain trade-related crimes and behaviors. Although IUU+ activity is widely recognized as a serious threat to marine ecosystems, markets, and livelihoods, a quantitative understanding of these incidents, e.g., their frequency, geography, species, actors, and patterns in the type of illicit activity, remains difficult to obtain. We propose IUU+DB, a large language model driven system for building a global incident database of IUU+ activity. The system ingests heterogeneous documents, classifies whether they describe relevant incidents, extracts key data elements such as actors, locations, species, vessels, violations, and enforcement outcomes, and supports deduplication and trend analysis. Case studies and validation results show that IUU+DB can help organize fragmented evidence, surface geographic and behavioral hotspots, support fisheries-domain specific research in academia and non-government organizations, assist source and species risk assessments for industry, and provide support for policy implementation and targeted enforcement efforts to government agencies.

2606.18168 2026-06-17 cs.SE cs.AI 新提交

All Smoke, No Alarm: Oracle Signals in Agent-Authored Test Code

全是烟雾,没有警报:智能体编写的测试代码中的Oracle信号

Dipayan Banik, Kowshik Chowdhury, Shazibul Islam Shamim

发表机构 * Dipayan Banik(迪帕扬·班克) Kowshik Chowdhury(克什基·乔乌德里) Shazibul Islam Shamim(沙齐布·伊斯兰·沙米)

AI总结 研究智能体编写的测试代码中Oracle信号的存在情况,发现80.2%的测试补丁缺乏强Oracle信号,但强Oracle与合并可能性显著正相关(OR=1.28)。

Comments Accepted at the 8th IEEE International Conference on Artificial Intelligence Testing, 2026

详情
AI中文摘要

软件从业者越来越多地使用AI编码智能体,这些智能体在开源拉取请求(PR)中生成测试代码和生产代码。最近的研究报告称,超过116,000个仓库中有超过932,000个智能体编写的PR,然而这些测试文件是否包含有意义的验证逻辑仍未得到充分探索。缺乏显式断言的测试文件执行代码而不验证行为,因此基于测试文件存在的质量门控高估了验证强度。本文的目标是通过描述Oracle信号及其与合并结果和审查工作的关联,帮助从业者评估智能体编写的补丁的验证强度。我们对来自2,807个GitHub仓库的33,596个智能体编写的PR中的86,156个测试文件补丁进行了实证研究,这些PR由五个编码智能体生成:OpenAI Codex、GitHub Copilot、Devin、Cursor和Claude Code。对384个分层补丁的定性分析形成了八类Oracle信号的语法分类。在大规模应用中,80.2%的测试补丁包含弱或没有显式Oracle信号。虽然原始合并率对于强Oracle PR较低,但调整了智能体、PR大小、仓库流行度、任务类型和语言的回归分析显示,强Oracle显著提高了合并可能性(OR = 1.28, p < 0.001)。我们的发现表明,测试文件数量大大高估了验证强度,从业者可以采用Oracle感知的质量检查来更准确地评估智能体编写的贡献。

英文摘要

Software practitioners increasingly use AI coding agents that generate test code alongside production code in open source pull requests (PRs). Recent studies report more than 932,000 agent-authored PRs across more than 116,000 repositories, yet whether their test files contain meaningful verification logic remains underexplored. Test files lacking explicit assertions execute code without verifying behavior, so quality gates based on test-file presence overestimate verification strength. The goal of this paper is to help practitioners assess the verification strength of agent-authored patches by characterizing oracle signals and their link to merge outcomes and review effort. We conduct an empirical study of 86,156 test-file patches from 33,596 agent-authored PRs across 2,807 GitHub repositories produced by five coding agents: OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code. A qualitative analysis of 384 stratified patches informs a syntactic taxonomy of eight oracle signal categories. Applied at scale, 80.2% of test patches contain weak or no explicit oracle signals. While raw merge rates are lower for strong-oracle PRs, a regression analysis adjusting for agent, PR size, repository popularity, task type, and language shows strong oracles significantly improve merge likelihood (OR = 1.28, p < 0.001). Our findings suggest that test file counts substantially overestimate verification strength and that practitioners can adopt oracle-aware quality checks to more accurately evaluate agent-authored contributions.

2606.18166 2026-06-17 cs.CR cs.LG 新提交

Evaluating Open-Source LLMs for Multi-Label ATT&CK Technique Classification on CTI Reports

评估开源大语言模型在CTI报告上的多标签ATT&CK技术分类

Ahmed Ryan, Saad Sakib Noor, Md Erfan, Shaswata Mitra, Sudip Mittal, Md Rayhanur Rahman

发表机构 * The University of Dhaka(达卡大学)

AI总结 针对开源LLM在复杂非结构化CTI报告上的ATT&CK分类性能未被评估的问题,构建了2076句人工标注数据集,评估7个开源LLM,最高F1为0.22,表明当前模型不足以用于生产。

详情
AI中文摘要

使用MITRE ATT&CK对网络威胁情报(CTI)进行分类对于主动防御至关重要,但历史上需要大量人工。大语言模型(LLM)之前的自动化加速了这一过程,但无法解决非结构化CTI报告中复杂的语言和多步攻击模式。LLM通过上下文推理理解非结构化文本,解决了以前的局限性。然而,当前的评估依赖于简化的单技术句子,忽略了真实CTI报告的复杂性,往往导致性能结果膨胀。因此,开源LLM在复杂非结构化CTI报告上的基线性能仍未得到评估。为弥补这一差距,我们从83份复杂非结构化CTI报告中构建了一个包含2076句人工标注(1281句技术阳性,795句阴性)的真实数据集。这些句子通过六阶段标注过程映射到114种独特的ATT&CK技术,实现了kappa=0.68的标注者间一致性。利用该数据集,我们评估了7个参数从8B到236B的开源LLM,涉及提示策略和温度配置。性能最高的LLM实现了0.22的微平均F1分数,为复杂非结构化CTI上的多标签ATT&CK分类建立了经验基线。参数大小与F1分数呈统计显著正相关。提示策略和温度在不同模型配置下未产生统计显著的增益。这些结果表明,当前开源LLM不足以用于生产级ATT&CK分类。该数据集、基准和发现为未来的CTI研究提供了可复现的基础。

英文摘要

Classifying Cyber Threat Intelligence (CTI) using MITRE Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) is essential for proactive defense, but historically required extensive human effort. Pre-Large Language Model (LLM) automation sped up this process, but could not resolve the complex language and multi-step attack patterns found in unstructured CTI reports. LLMs addressed previous limitations by using contextual reasoning to understand unstructured text. However, current evaluations rely on simplified, single-technique sentences that ignore the complexity of real-world CTI reports, which often leads to inflated performance results. Consequently, the baseline performance of open-source LLMs on complex unstructured CTI reports remains unevaluated. To address this gap, we constructed a ground-truth dataset of 2,076 human-annotated sentences (1,281 technique-positive, 795 negative) from 83 complex unstructured CTI reports. These sentences were mapped to 114 unique ATT&CK techniques using a six-phase annotation process, achieving \k{appa} = 0.68 inter-annotator agreement. Using this dataset, we evaluated seven open-source LLMs ranging from 8B to 236B parameters across prompt strategy and temperature configurations. The highest-performing LLM achieved a micro-averaged F1 score of 0.22, establishing the empirical baseline for multi-label ATT&CK classification on complex unstructured CTI. Parameter size showed a statistically significant positive correlation with F1 score. Prompt strategy and temperature produced no statistically significant gains across model configurations. These results indicate that current open-source LLMs are insufficient for production-grade ATT&CK classification. The dataset, benchmark, and findings provide a reproducible foundation for future CTI research.

2606.18158 2026-06-17 cs.CY cs.AI cs.CL 新提交

The Measurement Gap in the Automation of EU Law: Benchmarking Doctrinal Legal Reasoning under the EU AI Act

欧盟法律自动化中的测量差距:欧盟AI法案下教义性法律推理的基准测试

Michèle Finck

发表机构 * Chair of Law and Artificial Intelligence and Director, CZS Institute for Artificial Intelligence and Law, University of Tübingen(法律与人工智能教授、人工智能与法律研究所主任,图宾根大学)

AI总结 针对当前缺乏评估大型语言模型进行教义性法律推理的基准,提出该能力对满足欧盟AI法案中“适当准确性”要求至关重要。

详情
AI中文摘要

大型语言模型现在能够生成至少中等质量的法律文本,但现有的基准无法评估它们是否执行教义性法律推理——这是法律工作的解释核心,而非大多数当前法律AI评估所衡量的辅助性、准法律任务。这一测量差距不仅是方法论的,也是法律上的:欧盟AI法案将“适当准确性”作为司法领域使用高风险AI的约束性要求,但如果没有该领域缺乏的教义性推理基准,该要求就无法获得操作内容。

英文摘要

Large language models now produce legal text of at least median quality, yet no existing benchmark can evaluate whether they perform doctrinal legal reasoning, which forms the interpretive core of legal work, rather than the ancillary, paralegal tasks that most current legal-AI evaluations measure. This measurement gap is not only methodological but legal: the EU AI Act makes "appropriate accuracy" a binding requirement for high-risk AI used in the judicial domain, yet that requirement cannot acquire operational content without the very doctrinal-reasoning benchmark the field lacks.

2606.18129 2026-06-17 cs.HC cs.AI 新提交

Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour

理解和测量LLM行为中的认知萎缩

Abeer Badawi, Moyosoreoluwa Olatosi, Negin Baghbanzadeh, Laleh Seyyed-Kalantari, Frank Rudzicz, R. Shayna Rosenbaum, Sara Pishdadian, Elham Dolatabadi

发表机构 * York University(约克大学) Vector Institute(向量研究所) Rotman Research Institute(罗特曼研究学院) Dalhousie University(达尔豪斯大学) Centre for Addiction & Mental Health(成瘾与心理健康中心) KITE Research Institute(KITE研究机构)

AI总结 针对LLM在心理健康支持中缺乏过程行为评估的问题,提出认知萎缩概念及基准,通过临床标注和专家评估揭示模型普遍存在中度至高度萎缩行为。

详情
AI中文摘要

近期涉及LLM用于心理健康支持的事件揭示了一个关键的评估空白:表面安全评分无法捕捉模型在长时间、现实且情感敏感的交互中的行为。现有基准衡量知识、安全性或静态响应质量,但忽略了LLM交互是否帮助用户保持反思、应对和自主决策。我们将这一缺失维度形式化为认知萎缩,这是一种AI介导的心理健康支持中不同于安全性和有用性的过程级行为度量。为测量它,我们引入了认知萎缩基准,这是一个基于临床的基准,由1,576个完全人工生成的咨询对话、15,680轮次和来自五个LLM的42,230个响应构建而成。三位临床和神经心理学专家开发了一个包含用户上下文、响应行为和全局风险标志的20属性模式;六名经过培训的临床评审员应用该模式并附上基于跨度的证据,产生了5,324个评审判断。我们进一步引入了用户输入风险指数、认知萎缩风险指数和轨迹摘要。在五个LLM中,模型在单轮和多轮设置中表现出一致的中度至高度萎缩对齐行为。虽然模型通常对明显的安全线索做出响应,但当用户寻求解决方案或决策时,它们的适应性较差。主要的重复模式是指导性建议、问题解决、推荐响应、话题转移以及可能强化依赖而非反思的验证形式。我们的工作使认知萎缩变得可测量,并为审计敏感LLM对话中的模型行为提供了基础。

英文摘要

Recent incidents involving LLMs used for mental-health support reveal a critical evaluation gap: surface-level safety scores do not capture how models behave across realistic, emotionally sensitive interactions over time. Existing benchmarks measure knowledge, safety, or static response quality, but miss whether LLM interactions help users keep reflecting, coping, and making decisions themselves. We formalize this missing dimension as COGNITIVE ATROPHY, a process-level behavioural measure in AI-mediated mental-health support distinct from safety and helpfulness. To measure it, we introduce COGNITIVE ATROPHY BENCH, a clinically grounded benchmark built from 1,576 fully human-generated counseling conversations, 15,680 turns, and 42,230 responses from five LLMs. Three clinical and neuropsychology experts developed a 20-attribute schema spanning user context, response behaviour, and global risk flags; six trained clinical reviewers applied it with span-grounded evidence, producing 5,324 reviewer judgments. We further introduce the User-Input Risk Index (UIRI), the Cognitive Atrophy Risk Index (ARI), and trajectory summaries. Across five LLMs, models show a consistent moderate-to-high level of atrophy-aligned behaviour across single and multi-turn settings. While models generally respond to overt safety cues, they adapt less reliably when users seek solutions or decisions. The dominant recurring patterns are directive advice, problem-solving, recommendation responses, topic shifts, and forms of validation that may reinforce dependence rather than reflection. Our work makes COGNITIVE ATROPHY measurable and provides a foundation for auditing model behaviour in sensitive LLM conversations.

2606.18120 2026-06-17 cs.CR cs.AI cs.CL cs.LG 新提交

Structural Role Injection in Handlebars-Templated LLM Prompts: Triple-Brace Interpolation, Delimiter Family, and the Limits of HTML Auto-Escaping

Handlebars模板化LLM提示中的结构角色注入:三花括号插值、分隔符家族与HTML自动转义的局限性

Mohammadreza Rashidi

发表机构 * Department of Computer Science AI(计算机科学系人工智能) Media Analysis Lab Berlin, Germany(媒体分析实验室柏林德国)

AI总结 本文研究Handlebars模板引擎中双花括号与三花括号插值对结构角色注入攻击的影响,通过无模型分析和5760次实验,揭示HTML转义仅保护特定分隔符家族,无法替代指令与数据的结构分离。

Comments 7 pages, 6 figures

详情
AI中文摘要

大型语言模型应用从模板构建提示,Handlebars是广泛使用的模板引擎,也是Microsoft Semantic Kernel中的默认提示模板格式。其双花括号{x}表达式对插值值进行HTML转义,并被记录为安全默认;而三花括号{x}表达式则直接插入原始值。我们表明,这一选择悄然决定了应用对结构角色注入的暴露程度,攻击者控制的数据携带聊天角色分隔符,从而伪造高权限轮次。无模型分析建立了机制:Handlebars转义重写尖括号,但不重写方括号、冒号或Markdown井号,因此它中和了ChatML、Llama-3和XML角色分隔符(存活率0.00),同时保留Llama-2 [INST]、传统Human:/Assistant:和Markdown ###分隔符(后两者存活率1.00)。随后,我们在七个分隔符家族、两个攻击目标和四个模型(GPT-3.5 Turbo、GPT-4o mini、GPT-4.1 mini、Claude Haiku 4.5)上运行了5760次试验,总API成本为1.63美元。GPT-3.5 Turbo在97%的原始试验和91%的转义试验中遵循任务劫持指令,转义保护集中在尖括号家族,而在冒号和Markdown家族中缺失;更难的秘密泄露目标未饱和,更清晰地暴露了相同的家族交互。Claude Haiku 4.5几乎完全抵抗了两个目标。转义默认仅保护HTML转义恰好覆盖的分隔符方案,对剩余方案无保护,且无法替代指令与数据的结构分离。

英文摘要

Large language model applications build prompts from templates, and Handlebars is a widely used templating engine and the default prompt-template format in Microsoft Semantic Kernel. Its double-brace {x} expression HTML-escapes the interpolated value and is documented as the safe default; its triple-brace {x} expression inserts the value raw. We show that this choice silently governs an application's exposure to structural role injection, where attacker-controlled data carries chat role delimiters that forge a higher-privilege turn. A model-free analysis establishes the mechanism: Handlebars escaping rewrites angle brackets but not square brackets, colons, or Markdown hashes, so it neutralises ChatML, Llama-3, and XML role delimiters (survival rate 0.00) while leaving Llama-2 [INST], legacy Human:/Assistant:, and Markdown ### delimiters intact (survival rate 1.00 for the last two). We then run 5760 trials across seven delimiter families, two attack objectives, and four models (GPT-3.5 Turbo, GPT-4o mini, GPT-4.1 mini, Claude Haiku 4.5) at a combined API cost of 1.63 USD. GPT-3.5 Turbo follows the task-hijack instruction in 97% of raw and 91% of escaped trials, with the escaping protection concentrated in the angle-bracket families and absent for the colon- and Markdown-based families; the harder secret-exfiltration objective, which does not saturate, exposes the same family interaction more cleanly. Claude Haiku 4.5 resists both objectives almost entirely. The escaped default protects only the delimiter schemes whose characters HTML escaping happens to cover, gives no protection for the rest, and cannot substitute for a structural separation of instruction and data.

2606.18069 2026-06-17 cs.GR cs.CG cs.CV 新提交

Blended Chart Surfaces: A Seamless Explicit Representation for Smooth Surface Fitting

混合图表曲面:一种用于光滑曲面拟合的无缝显式表示

Romy Williamson, Niloy Mitra

发表机构 * Adobe

AI总结 提出混合图表曲面,一种无网络、显式且光滑的曲面表示,通过代理网格和多项式映射联合优化,融合局部图表实现全局光滑,支持微分量和能量计算。

Comments 17 pages, 16 figures

详情
AI中文摘要

适用于几何处理的曲面表示应紧凑且显式,提供全局光滑性保证,支持广泛的曲面拓扑,并能可靠访问法线和曲面能量等微分量,同时与现代可微优化兼容。现有的神经表示通常会牺牲这些属性中的一项或多项:隐式场通常需要等值面提取以供下游使用,而显式神经映射受限于规范域参数化或在局部图表之间出现接缝伪影。我们引入了混合图表曲面,这是一种紧凑、无网络、显式的表示,通过构造实现光滑,并锚定到用户提供的拓扑。给定一个编码预期曲面拓扑和近似几何的粗略代理网格,混合图表曲面使用现成优化器在每个代理顶点联合优化多项式映射,以拟合隐式目标形状,从而避免输入参数化。相邻映射使用光滑的“一环坐标”混合方案融合,将拓扑和粗略几何(由代理承载)与几何细节(由局部补丁承载)解耦。曲面全局光滑、完全可微,并支持导数的稳定评估,使得微分量和曲面能量可直接访问。此外,我们的构造对代理网格的刚体运动和缩放是等变的。我们在各种拓扑和几何复杂度上评估了混合图表曲面,并与显式替代方案(包括插值函数基线和网格位移MLP)进行比较。在这些比较中,混合图表曲面在紧凑性、简单性、微分量访问性和表达力之间实现了有利的权衡,同时在补丁边界保持光滑。

英文摘要

A surface representation suitable for geometry processing should be compact and explicit, provide global smoothness guarantees, support a wide range of surface topologies, and offer reliable access to differential quantities such as normals and surface energies, while remaining compatible with modern differentiable optimization. Existing neural representations typically sacrifice one or more of these properties: implicit fields typically require iso-surfacing for downstream use, while explicit neural maps are constrained by canonical-domain parametrizations or exhibit seam artifacts between local charts. We introduce Blended Chart Surfaces, a compact, network-free, explicit representation that is smooth by construction and anchored to user-provided topology. Given a coarse proxy mesh encoding the intended surface topology and approximate geometry, Blended Chart Surfaces jointly optimize for a polynomial map at each proxy vertex using an off-the-shelf optimizer to fit to an implicit target shape, avoiding the need for an input parametrization. Neighboring maps are fused using a smooth 'one-ring coordinate' blending scheme, decoupling topology and coarse geometry (carried by the proxy) from geometric details (carried by the local patches). The surface is globally smooth, fully differentiable, and enables stable evaluation of derivatives, making differential quantities and surface energies directly accessible. Additionally, our construction is equivariant to rigid motions and scaling of the proxy mesh. We evaluate Blended Chart Surfaces on various topologies and geometric complexity, and compare against explicit alternatives including interpolating-function baselines and mesh-displacement MLPs. Across these, Blended Chart Surfaces achieve a favorable trade-off among compactness, simplicity, access to differential quantities, and expressivity while remaining smooth across patch boundaries.

2606.18057 2026-06-17 cs.HC cs.AI cs.CL cs.CY cs.SI 新提交

When AI Says "I have been in similar situations": Synthetic Lived Experience in Peer-Like Caregiver Support

当AI说“我也有过类似经历”:同伴式照护支持中的合成生活经验

Drishti Goel, Violeta J. Rodriguez, Daniel S. Brown, Ravi Karkar, Dong Whi Yoo, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校)

AI总结 研究AI在同伴支持中生成“合成生活经验”的悖论,通过分析人类与AI在阿尔茨海默病照护社区中的叙事差异,揭示AI虽能模拟情感支持但缺乏真实经历,需建立机制区分支持性语言与虚构经历。

详情
AI中文摘要

照护者经常转向在线社区寻求信息和情感支持。在这些空间中,同伴支持者常常利用个人叙事来回应情感复杂的照护情境。随着LLM被设计为同伴式的支持来源,它们引入了一个关键张力:AI可以提供即时、私密且非评判性的支持,但它无法真实拥有使人类同伴支持有意义的生活经验。然而,当被提示要听起来像同伴时,LLM可能会生成暗示生活经验的语言。这创造了一个合成生活经验悖论:使AI支持感觉温暖、 relatable 和同伴式的相同经验语言,也可能错误地将系统定位为拥有生活经验的人。我们在阿尔茨海默病及相关痴呆症(ADRD)患者的家庭照护者背景下审视这一悖论。利用来自在线社区的照护者支持交流以及三个LLM——LLaMA、GPT-4o-mini和MedGemma——生成的同伴式响应,我们分析人类同伴如何使用个人叙事以及AI如何融入类似的叙事形式。心理语言学分析显示,同伴响应使用的第一人称和过去时态语言显著多于同伴式AI响应。定性上,我们识别出人类同伴支持中的七种个人叙事类型,并表明AI通常能捕捉其情感工作,但可能捏造经验基础。这些发现揭示了一个叙事真实性差距:同伴式AI可以生成合成生活经验,而没有使同伴支持有意义的真实经验。我们认为,照护者支持AI系统需要机制来区分支持性的同伴式框架与虚构的生活经验,确保模型能够提供温暖和认可,而不会错误地将自己定位为经验同伴。

英文摘要

Caregivers often turn to online communities for informational and emotional support. In these spaces, peer supporters frequently draw on personal narratives to respond to emotionally complex caregiving situations. As LLMs are increasingly designed as peer-like sources of support, they introduce a critical tension: AI can provide immediate, private, and nonjudgmental support, but it cannot authentically possess the lived experiences that make human peer support meaningful. Yet, when prompted to sound peer-like, LLMs may generate language that implies lived experience. This creates a synthetic lived experience paradox: the same experiential language that may make AI support feel warm, relatable, and peer-like can also falsely position the system as someone with lived experience. We examine this paradox in the context of family caregivers of people living with Alzheimer's Disease and Related Dementias (ADRD). Drawing on caregiver support exchanges from online communities and prompted peer-like responses from three LLMs -- LLaMA, GPT-4o-mini, and MedGemma -- we analyze how human peers use personal narratives and how AI incorporates similar narrative forms. Psycholinguistic analysis shows that peer responses used significantly more first-person and past-focused language than peer-like AI responses. Qualitatively, we identify seven types of personal narratives in human peer support and show that AI often captures their emotional work, but can fabricate experiential grounding. These findings reveal a narrative authenticity gap: peer-like AI can generate synthetic lived experience without the real experience that makes peer support meaningful. We argue that caregiver-support AI systems need mechanisms to distinguish supportive peer-like framing from fabricated lived experience, ensuring that models can offer warmth and validation without falsely positioning themselves as experiential peers.

2606.18000 2026-06-17 cs.NI cs.AI 新提交

A T-API-Compliant ReAct Agentic Loop for Optical Networks: Generic vs. Domain-Specific Tool Abstractions

一种符合T-API规范的ReAct智能循环用于光网络:通用与领域特定工具抽象

Seyed Morteza Ahmadian, Paolo Monti, Carlos Natalino

发表机构 * Department of Electrical Engineering, Chalmers University of Technology(查尔姆斯理工大学电子工程系)

AI总结 提出首个符合T-API规范的推理与行动(ReAct)循环,通过领域特定复合工具实现90%的oracle验证正确性,并节省三倍令牌。

Comments 4 pages, 2 figures, accepted for presentation at the 52nd European Conference on Optical Communications (ECOC), 2026

详情
AI中文摘要

光网络需要意图驱动的闭环智能管理,这是实现更高自治水平的关键。我们提出了首个符合T-API规范的推理与行动(ReAct)循环。我们表明,与通用工具相比,领域特定的复合工具实现了90%的oracle验证正确性,并节省了三倍的令牌。

英文摘要

Optical networks need intent-driven, closed-loop agentic management, a key enabler for higher autonomy levels. We present the first T-API-compliant reasoning and act (ReAct) loop. We show that domain-specific composite tools achieve 90% oracle-validated correctness with threefold token savings compared to generic tools.

2606.17962 2026-06-17 cs.MA cs.AI 新提交

A Neuro-Symbolic Approach to Strategy Synthesis for Strategic Logics

一种面向策略逻辑的策略综合的神经符号方法

Marco Aruta, Vadim Malvone, Aniello Murano, Domenico Parente, Luca Rizzuti

发表机构 * University of Naples Federico II(那不勒斯费德里科二世大学) LTCI, Télécom Paris, Institut Polytechnique de Paris(LTCI,巴黎电信学院,巴黎理工学院) Università degli Studi di Salerno(萨勒诺大学)

AI总结 提出一种神经符号框架,将大语言模型作为策略生成预言机,结合模型检查器进行形式验证,在NatATL中实现高精度策略综合。

详情
AI中文摘要

推理智能体通过策略交互能实现什么是多智能体系统(MAS)中的核心挑战。用于策略能力的逻辑(如ATL)提供了严格的方法,但其采用常因策略综合的计算成本而受阻。我们引入了一种神经符号框架,将大语言模型(LLM)集成到MAS的模型检查流程中。LLM作为策略生成预言机,提出候选策略,然后由标准MAS模型检查器进行形式验证。这种生成-认证架构利用LLM引导来导航大型组合策略空间,同时保持形式正确性:生成的策略仅在通过验证器认证后才被接受。我们为NatATL中的有界策略推理实例化了该框架,并引入了首个NatATL策略综合数据集,包含4211个实例。使用开源Qwen3-32B模型的实验表明,我们的认证流程在策略综合结果上达到了92%的准确率。

英文摘要

Reasoning about what agents can achieve through strategic interaction is a core challenge in Multi-Agent Systems (MAS). Logics for strategic ability, such as ATL, provide rigorous methods, but their adoption is often hindered by the computational cost of strategy synthesis. We introduce a neuro-symbolic framework that integrates large language models (LLMs) into the model-checking pipeline for MAS. The LLM acts as a strategy-generation oracle, proposing candidate strategies that are then formally validated by a standard MAS model checker. This generate-and-certify architecture uses LLM guidance to navigate large combinatorial strategy spaces while preserving formal soundness: generated strategies are accepted only when certified by the verifier. We instantiate the framework for bounded strategic reasoning in NatATL and introduce the first NatATL strategy-synthesis dataset, consisting of 4211 instances. Experiments with an open-weight Qwen3-32B model show that our certified pipeline achieves 92\% accuracy on strategy-synthesis outcomes.

2606.17915 2026-06-17 cs.MA cs.AI cs.DB cs.SE 新提交

Trustworthy Self-Composable Big-Data-as-a-Service: An LLM-Orchestrated Multi-Agent Framework for Automated Data Engineering, AutoML, MLOps Deployment, and Drift-Aware Lifecycle Optimization

可信赖的自组合大数据即服务:一种LLM编排的多智能体框架,用于自动化数据工程、AutoML、MLOps部署和漂移感知生命周期优化

Aueaphum Aueawatthanaphisut, Badri Raj Lamichhane

发表机构 * School of Information, Computer, and Communication Technology(信息、计算机与通信技术学院) Sirindhorn International Institute of Technology, Thammasat University(素金国际技术研究所,泰国 Thammasat 大学)

AI总结 提出一种基于LLM编排的多智能体BDaaS框架,通过分解生命周期为专用智能体并协调执行,实现自动化数据工程、AutoML、MLOps部署和漂移感知优化,提升生命周期级可靠性。

Comments 7 pages, 3 figures, 5 tables

详情
AI中文摘要

大数据即服务(BDaaS)平台需要可靠地自动化数据摄取、清洗、特征工程、模型开发、部署和部署后监控。然而,现有的基于LLM的数据科学智能体和AutoML系统主要关注孤立的工作流阶段,对生命周期级编排、工件治理、人工监督和漂移感知适应的支持有限。本文提出了一种基于LLM编排的多智能体协作的可信赖自组合BDaaS框架。所提出的架构将BDaaS生命周期分解为专门的智能体,用于数据摄取、数据清洗、特征工程、AutoML训练、模型评估、MLOps部署、监控和漂移检测。中央LLM编排层协调智能体执行,验证中间输出,管理工作流上下文,并支持动态工作流组合。该框架还包含共享工件治理、可重现性支持、人在回路检查点和漂移感知反馈循环。使用包含缺失值、分类变量、异常值、类别不平衡和模拟协变量漂移的受控表格基准数据集进行了基于原型的评估。与手动ML、仅AutoML和单智能体LLM基线相比,所提出的多智能体BDaaS流水线在保持竞争性预测性能的同时,提高了生命周期级可靠性,包括工作流完成度、工件可追溯性、部署就绪性、可重现性和漂移恢复。结果表明,LLM编排的多智能体系统可以将传统AutoML扩展到可信赖、自适应和面向生产的BDaaS生命周期自动化。

英文摘要

Big-Data-as-a-Service (BDaaS) platforms require re liable automation across data ingestion, cleaning, feature engi neering, model development, deployment, and post-deployment monitoring. However, existing LLM-based data science agents and AutoML systems mainly focus on isolated workflow stages, leaving limited support for lifecycle-level orchestration, artifact governance, human oversight, and drift-aware adaptation. This paper proposes a trustworthy self-composable BDaaS frame work based on LLM-orchestrated multi-agent collaboration. The proposed architecture decomposes the BDaaS lifecycle into specialized agents for data ingestion, data cleaning, feature engineering, AutoML training, model evaluation, MLOps de ployment, monitoring, and drift detection. A central LLM or chestration layer coordinates agent execution, validates interme diate outputs, manages workflow context, and enables dynamic workflow composition. The framework also incorporates shared artifact governance, reproducibility support, human-in-the-loop checkpoints, and drift-aware feedback loops. A prototype-based evaluation is conducted using controlled tabular benchmark datasets with missing values, categorical variables, outliers, class imbalance, and simulated covariate drift. Compared with manual ML, AutoML-only, and single-agent LLM baselines, the pro posed multi-agent BDaaS pipeline achieves competitive predictive performance while improving lifecycle-level reliability, including workflow completion, artifact traceability, deployment readiness, reproducibility, and drift recovery. The results suggest that LLM-orchestrated multi-agent systems can extend conventional AutoML toward trustworthy, adaptive, and production-oriented BDaaS lifecycle automation.

2606.17910 2026-06-17 cs.IR cs.AI cs.CL 新提交

Non-negative Elastic Net Decoding for Information Retrieval

非负弹性网络解码用于信息检索

Koki Okajima, Yasutoshi Ida, Tsukasa Yoshida, Yasuaki Nakamura

发表机构 * NTT, Inc(NTT公司)

AI总结 提出非负弹性网络(NNN)解码方法,将检索视为联合解码问题,通过稀疏非负线性组合重构查询嵌入,在理论上严格优于稠密检索,实验表明在多个基准上取得一致改进。

Comments 19 pages, 4 figures

详情
AI中文摘要

稠密检索已成为信息检索中的主导范式,其中每个文档通过其向量嵌入与查询的内积进行评分,并根据分数检索前$k$个文档。然而,由于每个文档的分数仅取决于查询和自身的嵌入,检索过程忽略了整个语料库的内容。因此,稠密检索无法避免从语料库中选择语义相似的文档,这可能导致检索结果集缺乏多样性且冗余。为此,我们将检索视为一个联合解码问题,其中文档作为集合被选择,并考虑语料库其余部分的上下文。为了实现这一点,我们提出了非负弹性网络(NNN)解码,它选择嵌入能够联合重构查询嵌入(作为稀疏非负线性组合)的文档。我们的主要理论结果建立了稠密检索与NNN解码之间的严格分离。对于任何语料库,稠密检索正确处理的每个查询也由NNN解码处理,而在包含相关文档的语料库上,NNN解码额外处理了稠密检索无法处理的查询。实验结果表明,将NNN解码应用于为内积评分训练的冻结嵌入,在多个基准上产生了一致的改进。此外,我们引入了一种端到端训练过程,优化嵌入以用于NNN解码,在所有指标和基准上相比稠密检索产生了显著的性能提升。我们的工作为在信息检索中利用稠密嵌入建立了一种新的范式,超越了内积评分的标准实践。

英文摘要

Dense retrieval has become the dominant paradigm in information retrieval, in which each document is scored against a query by the inner product of their vector embeddings, and the top-$k$ documents by score are retrieved for this query. However, since each document's score depends solely on the embedding of the query and itself, the retrieval process is oblivious to the content of the entire corpus. Therefore, dense retrieval cannot avoid selecting semantically similar documents from the corpus, which may result in a non-diverse, redundant set of retrieved documents. To this end, we approach retrieval as a joint decoding problem, in which documents are selected as a set with regard to the context of the rest of the corpus. To achieve this, we propose Non-Negative elastic Net (NNN) decoding, which selects documents whose embeddings jointly reconstruct the query embedding as a sparse non-negative linear combination. Our main theoretical result establishes a strict separation between dense retrieval and NNN decoding. For any corpus, every query correctly handled by dense retrieval is also handled by NNN decoding, while on corpora containing correlated documents, NNN decoding additionally handles queries that dense retrieval cannot. Experimental results indicate that applying NNN decoding to frozen embeddings trained for inner-product scoring yields consistent improvements across several benchmarks. Moreover, we introduce an end-to-end training procedure which optimizes the embeddings for NNN decoding, producing significant performance gains surpassing in all metrics and benchmarks compared to dense retrieval. Our work establishes a new paradigm for leveraging dense embeddings in information retrieval, beyond the standard practice of inner-product scoring.

2606.17887 2026-06-17 cs.HC cs.AI 新提交

AI Adoption Across a Multinational Workforce: Sociotechnical Conditions for GenAI Acceptance in Human Resources

AI在跨国劳动力中的采纳:人力资源中GenAI接受的社会技术条件

Dalia Ali, Maria José Rodríguez Velázquez, Manoel Horta Ribeiro, Vera Liao, Orestis Papakyriakopoulos

发表机构 * Technical University of Munich(慕尼黑技术大学) University of Michigan(密歇根大学) Princeton University(普林斯顿大学)

AI总结 研究跨国科技公司从传统HR系统转向GenAI系统过程中,员工采纳受情境适配、搜索素养和信任校准等社会技术条件影响,并提出了包容性部署的设计建议。

详情
AI中文摘要

生成式AI(GenAI)在工作场所的部署正在迅速加速。然而,谁采纳、谁受益、谁被落下以及为什么,这些问题仍未得到充分研究。在本文中,我们在一家从传统人力资源(HR)搜索系统过渡到GenAI支持系统的跨国科技公司的背景下调查这些动态,分析了搜索日志数据、调查数据(n=25)和十次半结构化访谈。我们的发现表明,采纳取决于GenAI系统的设计假设与员工的工作位置性(角色、口语、任期)之间的匹配。此外,我们发现员工对GenAI答案的信任是通过来源检查、系统间比较以及在怀疑时向同事或HR寻求意见来建立的。我们的贡献有两方面。首先,我们提供了在实时组织转型期间工作场所GenAI采纳的经验证据,表明采纳受到情境适配、搜索素养和信任校准等因素的影响。它还进一步受到知识条件的影响,例如系统的内容质量、员工培训和指导。其次,我们将这些发现转化为在高风险环境(如HR)中包容性部署和采纳的设计考虑。我们认为,组织应该设计系统时考虑它们对不同社会群体产生的角色和情境敏感的好处。他们还需要将组织知识基础设施视为AI基础设施,以提高GenAI系统的问责性和可用性。

英文摘要

Generative AI (GenAI) deployment in the workplace is accelerating rapidly. Nevertheless, questions of who adopts, who benefits, and who is left behind and why are still understudied. In this paper, we investigate these dynamics in the context of a multinational tech company transitioning from a legacy Human Resources (HR) search system to a GenAI-supported system, analyzing search log data, survey data (n=25), and ten semi-structured interviews. Our findings show that adoption depended on the fit between the GenAI system's design assumptions and employees' work positionalities (role, spoken language, tenure). Further, we find that employees' trust in GenAI answers was built through source-checking, comparison among systems, and seeking input from colleagues or HR when in doubt. Our contribution is twofold. First, we provide empirical evidence of workplace GenAI adoption during a live organizational transition, showing that adoption is influenced by factors such as situational fit, search literacy, and trust calibration. It is also further shaped by knowledge conditions such as the system's content quality, employee training, and guidance. Second, we translate these findings into design considerations for inclusive deployment and adoption in high-stakes environments such as HR. We argue that organizations should design systems considering the role and context-sensitive benefits they yield to different social groups. They also need to treat the organizational knowledge infrastructure as AI infrastructure to improve the accountability and usability of GenAI systems

2606.17819 2026-06-17 cs.SE cs.AI cs.CL 新提交

A Framework for Evaluating Agentic Skills at Scale

大规模评估智能体技能的框架

Maksim Shaposhnikov, Nicolas Fortuin, Simon Stipcich, Maria I. Gorinova, Amy Heineike, Rob Willoughby

发表机构 * Tessl London United Kingdom(伦敦英国Tessl)

AI总结 提出一个评估框架,通过构建真实任务和评分标准,大规模评估500个真实技能在19种智能体模型上的表现,发现模型对技能指令的遵循程度差异显著,且技能显著改变模型行为。

详情
AI中文摘要

智能体技能——结构化、可重用的知识工件,增强LLM智能体能力——已在工业界迅速采用,但其跨领域影响以及在商业和开源模型中的使用仍未得到充分研究,并且缺乏可复用的方法来评估单个技能。在这项工作中,我们提出了一个评估框架,允许技能作者构建真实任务,以严格评估技能中对他们最重要的方面,并通过解决这些任务来估计技能效用。此外,我们将评估方法大规模应用于500个真实技能,生成了1000个源自技能内容的任务,以及指令遵循和目标完成评分标准。使用这些指标,我们评估了19种智能体模型配置(包括专有和开源模型)在任务上的表现。我们的结果表明,模型在遵循技能中编码的指令方面差异很大,导致其性能提升存在显著差异。此外,我们表明,与无技能设置相比,访问技能显著改变了模型行为,为将主观工作流编码到LLM智能体中提供了一种重要机制。我们发布了评估数据集,以支持未来关于智能体技能的工作。

英文摘要

Agent skills -- structured, reusable knowledge artifacts that augment LLM agent capabilities -- have been rapidly adopted in industry, yet their cross-domain impact and use across commercial and open-source models remain under-studied, and no reusable methodology exists for evaluating an individual skill. In this work, we present an evaluation framework that lets a skill author construct realistic tasks to rigorously assess the aspects of a skill that matter most to them, and that estimates skill utility by solving those tasks. Further, we apply our evaluation approach at scale to 500 real-world skills, generating 1,000 tasks derived from the skills' content, along with instruction-following and goal-completion scoring rubrics. Using these metrics, we evaluate how 19 agent-model configurations, both proprietary and open-source, perform on the tasks. Our results show that models vary widely in how closely they adhere to the instructions encoded in skills, leading to substantial differences in their performance gains. Furthermore, we show that access to a skill significantly changes model behavior compared to the no-skill setup, providing an essential mechanism for encoding opinionated workflows into LLM agents. We release our evaluation dataset to support future work on agent skills.