arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2094
专题追踪
2507.23027 2026-06-19 cs.CV cs.AI

Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

恢复诊断价值:超分辨率辅助的资源受限成像中的心电图分类

Krishan Agyakari Raja Babu, Om Prabhu, Annu, Mohanasankar Sivaprakasam

发表机构 * Indian Institute of Technology Madras(印度理工学院马德拉斯分校) All India Institute of Medical Sciences(全印度医学科学研究所) Indian Institute of Technology Hyderabad(印度理工学院海得拉巴分校)

AI总结 本文研究了基于深度学习的超分辨率技术在低质量2D超声心动图分类中的应用,通过CAMUS数据集验证了SRGAN和SRResNet在提升分类准确率和计算效率方面的有效性。

Comments Accepted at the MICCAI Workshop on "Medical Image Computing in Resource Constrained Settings & Knowledge Interchange (MIRASOL)" 2025

详情
AI中文摘要

在资源受限环境下,自动心脏解读常受限于低质量超声心动图图像,限制了后续诊断模型的效果。尽管超分辨率(SR)技术在增强磁共振成像(MRI)和计算机断层扫描(CT)扫描方面表现出潜力,但其在超声心动图-一种广泛但易受噪声影响的模态中的应用仍待探索。本文研究了基于深度学习的SR技术在低质量2D超声心动图分类中的潜力。使用公开的CAMUS数据集,我们按图像质量分层样本,并评估了两个临床相关的任务:相对简单的两腔 vs. 四腔(2CH vs. 4CH)视图分类和更复杂的终舒张期 vs. 终收缩期(ED vs. ES)相分类。我们应用了两种广泛使用的SR模型-Super-Resolution Generative Adversarial Network(SRGAN)和Super-Resolution Residual Network(SRResNet),以增强低质量图像并观察到性能指标上的显著提升,特别是SRResNet,它还提供了计算效率。我们的发现表明,SR可以有效恢复降质超声扫描的诊断价值,使其成为资源受限环境(RCS)中AI辅助护理的可行工具,实现以少胜多。

英文摘要

Automated cardiac interpretation in resource-constrained settings (RCS) is often hindered by poor-quality echocardiographic imaging, limiting the effectiveness of downstream diagnostic models. While super-resolution (SR) techniques have shown promise in enhancing magnetic resonance imaging (MRI) and computed tomography (CT) scans, their application to echocardiography-a widely accessible but noise-prone modality-remains underexplored. In this work, we investigate the potential of deep learning-based SR to improve classification accuracy on low-quality 2D echocardiograms. Using the publicly available CAMUS dataset, we stratify samples by image quality and evaluate two clinically relevant tasks of varying complexity: a relatively simple Two-Chamber vs. Four-Chamber (2CH vs. 4CH) view classification and a more complex End-Diastole vs. End-Systole (ED vs. ES) phase classification. We apply two widely used SR models-Super-Resolution Generative Adversarial Network (SRGAN) and Super-Resolution Residual Network (SRResNet), to enhance poor-quality images and observe significant gains in performance metric-particularly with SRResNet, which also offers computational efficiency. Our findings demonstrate that SR can effectively recover diagnostic value in degraded echo scans, making it a viable tool for AI-assisted care in RCS, achieving more with less.

2406.15465 2026-06-19 cs.CL cs.AI

RadEx: A Framework for Structured Information Extraction from Radiology Reports based on Large Language Models

RadEx:基于大型语言模型的结构化信息提取框架

Daniel Reichenpfader, Jonas Knupp, André Sander, Kerstin Denecke

发表机构 * Institute for Patient-centered Digital Health, Bern University of Applied Sciences, Biel, Switzerland(以患者为中心的数字健康研究所,伯恩应用科学大学,比尔,瑞士) ID Suisse AG, St. Gallen, Switzerland(ID瑞士股份有限公司,圣加尔,瑞士)

AI总结 RadEx框架通过15个软件组件和10个工具,实现从放射科报告中自动提取结构化信息,支持生成式和编码器模型,提升临床应用中的信息处理效率与系统互操作性。

详情
AI中文摘要

每年全球有超过30亿次放射学检查和计算机断层扫描,产生大量未结构化的放射科报告,包含自由文本。尽管结构化报告有潜在优势,但其采用受限于现有流程、资源限制和信息丢失风险。然而,结构化信息对于自动分析、临床试验匹配和健康结果预测至关重要。本研究介绍RadEx,一个端到端框架,包含15个软件组件和10个工具,用于开发自动提取放射科报告信息的系统。该框架涵盖从标注训练数据到信息提取的完整过程,提供一致的通用信息模型并设定模型开发边界。RadEx允许临床医生定义相关临床领域(如乳腺摄影)的信息,并创建报告模板。该框架支持生成式和编码器-only模型,并通过将信息提取与模板填充解耦,实现独立的模型改进。根据RadEx框架开发信息提取系统,可简化实现和维护,因为组件易于替换,标准化的工具确保组件间互操作性。

英文摘要

Annually and globally, over three billion radiography examinations and computer tomography scans result in mostly unstructured radiology reports containing free text. Despite the potential benefits of structured reporting, its adoption is limited by factors such as established processes, resource constraints and potential loss of information. However, structured information would be necessary for various use cases, including automatic analysis, clinical trial matching, and prediction of health outcomes. This study introduces RadEx, an end-to-end framework comprising 15 software components and ten artifacts to develop systems that perform automated information extraction from radiology reports. It covers the complete process from annotating training data to extracting information by offering a consistent generic information model and setting boundaries for model development. Specifically, RadEx allows clinicians to define relevant information for clinical domains (e.g., mammography) and to create report templates. The framework supports both generative and encoder-only models and the decoupling of information extraction from template filling enables independent model improvements. Developing information extraction systems according to the RadEx framework facilitates implementation and maintenance as components are easily exchangeable, while standardized artifacts ensure interoperability between components.

2306.12679 2026-06-19 cs.CL

Constructing Colloquial Dataset for Persian Sentiment Analysis of Social Microblogs

构建波斯语社交媒体微博客情感分析的口语数据集

Mojtaba Mazoochi, Leila Rabiei, Farzaneh Rahmani, Zeinab Rajabi

发表机构 * Faculty member in ICT Research Institute(ICT研究所教员) Iran Telecommunication Research Center (ITRC)(伊朗电信研究中心) Faculty member in Computer Department(计算机系教员) Mehralborz University(梅赫拉布尔兹大学) Hazrat-e Masoumeh University(玛苏姆大学)

AI总结 本文构建了波斯语口语数据集并提出基于CNN的模型,提升社交媒体微博客口语文本的情感分析性能,实验结果显示72%的准确率。

Journal ref Multimedia Tools and Applications, 2025

详情
AI中文摘要

介绍:微博网站为情感分析和观点挖掘提供了丰富的数据源。然而,由于微博帖子通常缺乏语法一致的术语和代表性词汇,且用户不愿撰写长文,情感分类效率较低。此外,低资源语言也存在局限性。波斯语具有独特特征,需要独特的标注数据和模型进行情感分析,这与英语文本特征不同。方法:本文首先在协作环境中构建了一个名为ITRC-Opinion的用户意见数据集,包含60,000条来自Twitter和Instagram等社交媒体的非正式波斯语文本。其次,本文提出了一种基于卷积神经网络(CNN)的新型架构,以更有效地进行社交媒体微博客口语文本的情感分析。构建的数据集用于评估所提出的架构。此外,一些模型,如LSTM、CNN-RNN、BiLSTM和BiGRU,结合不同的词嵌入,包括Fasttext、Glove和Word2vec,也研究了我们的数据集并评估了结果。结果:结果表明我们的数据集和所提模型(72%准确率)的优势,展示了情感分类性能的显著提升。

英文摘要

Introduction: Microblogging websites have massed rich data sources for sentiment analysis and opinion mining. In this regard, sentiment classification has frequently proven inefficient because microblog posts typically lack syntactically consistent terms and representatives since users on these social networks do not like to write lengthy statements. Also, there are some limitations to low-resource languages. The Persian language has exceptional characteristics and demands unique annotated data and models for the sentiment analysis task, which are distinctive from text features within the English dialect. Method: This paper first constructs a user opinion dataset called ITRC-Opinion in a collaborative environment and insource way. Our dataset contains 60,000 informal and colloquial Persian texts from social microblogs such as Twitter and Instagram. Second, this study proposes a new architecture based on the convolutional neural network (CNN) model for more effective sentiment analysis of colloquial text in social microblog posts. The constructed datasets are used to evaluate the presented architecture. Furthermore, some models, such as LSTM, CNN-RNN, BiLSTM, and BiGRU with different word embeddings, including Fasttext, Glove, and Word2vec, investigated our dataset and evaluated the results. Results: The results demonstrate the benefit of our dataset and the proposed model (72% accuracy), displaying meaningful improvement in sentiment classification performance.

1902.06202 2026-06-19 cs.CV cs.CG

Using Persistent Homology to Quantify a Diurnal Cycle in Hurricane Felix

利用持续同调量化飓风菲利克斯的日变化

Sarah Tymochko, Elizabeth Munch, Jason Dunion, Kristen Corbosiero, Ryan Torn

发表机构 * Michigan State University, Dept. of Computational Mathematics, Science and Engineering(密歇根州立大学,计算数学、科学与工程系) Michigan State University, Dept. of Mathematics(密歇根州立大学,数学系) Cooperative Institute for Marine and Atmospheric Studies, University of Miami(马里安诺大气研究合作机构,迈阿密大学) Hurricane Research Division, NOAA/Atlantic Oceanographic and Meteorological Laboratory(飓风研究部,国家海洋和大气管理局/大西洋海洋学和气象实验室) University at Albany - SUNY Albany, Dept. of Atmospheric and Environmental Sciences(阿尔巴尼大学 - 纽约州立大学阿尔巴尼分校,大气与环境科学系)

AI总结 本文提出利用持续同调量化热带气旋日变化的方法,通过追踪最大持续性并利用离散傅里叶变换量化日变化特征。

详情
AI中文摘要

热带气旋的日变化是卫星图像中出现的每日云层循环,可能对气旋结构和强度有影响。这些日变化脉冲在红外卫星图像中表现为周期性脉冲,从几乎所有大西洋气旋中心向外径向传播。我们提出利用持续同调追踪最大持续性并利用离散傅里叶变换量化日变化,通过Geostationary Operational Environmental Satellite IR影像数据检测飓风菲利克斯的日变化。

英文摘要

The diurnal cycle of tropical cyclones (TCs) is a daily cycle in clouds that appears in satellite images and may have implications for TC structure and intensity. The diurnal pattern can be seen in infrared (IR) satellite imagery as cyclical pulses in the cloud field that propagate radially outward from the center of nearly all Atlantic-basin TCs. These diurnal pulses, a distinguishing characteristic of the TC diurnal cycle, begin forming in the storm's inner core near sunset each day and appear as a region of cooling cloud-top temperatures. The area of cooling takes on a ring-like appearance as cloud-top warming occurs on its inside edge and the cooling moves away from the storm overnight, reaching several hundred kilometers from the circulation center by the following afternoon. The state-of-the-art TC diurnal cycle measurement has a limited ability to analyze the behavior beyond qualitative observations. We present a method for quantifying the TC diurnal cycle using one-dimensional persistent homology, a tool from Topological Data Analysis, by tracking maximum persistence and quantifying the cycle using the discrete Fourier transform. Using Geostationary Operational Environmental Satellite IR imagery data from Hurricane Felix (2007), our method is able to detect an approximate daily cycle.

2606.20457 2026-06-19 eess.AS cs.AI cs.LG 新提交

Repurposing a Speech Classifier for Guided Diffusion-Based Speech Generation

重新利用语音分类器进行基于引导扩散的语音生成

Rostislav Makarov, Timo Gerkmann

发表机构 * University of Hamburg(汉堡大学)

AI总结 提出将预训练的语音分类器作为扩散生成的主干,通过附加轻量子网络并仅训练该子网络,实现单主干模型的高质量条件语音生成,降低内存和计算成本。

Comments Accepted for publication in the Proceedings of Interspeech 2026

详情
AI中文摘要

分类器引导是一种通过使用噪声条件分类器将采样过程导向目标类别来控制扩散生成的方法。分类器引导的一个缺点是需要两个单独训练的模型:一个分类器和一个扩散模型。因此,我们研究了一种更紧凑的替代方案,其中将传统训练的语音分类器重新用作扩散生成的主干。从log-Mel空间中的冻结噪声条件分类器开始,我们附加一个轻量子网络,该子网络重用中间分类器表示,并在去噪分数匹配目标下仅训练该子网络。我们的工作表明,预训练的分类器可以重新用于条件生成,为判别建模和条件语音合成之间提供了有吸引力的桥梁,从而在单主干模型中实现高语音质量,同时减少内存占用和计算成本。

英文摘要

Classifier guidance is a way to control diffusion generation by using a noise-conditioned classifier to steer the sampling process toward a target class. One drawback of classifier guidance is that it requires two separately trained models: a classifier and a diffusion model. We therefore study a more compact alternative in which a conventionally trained speech classifier is repurposed as the backbone for diffusion generation. Starting from a frozen noise-conditioned classifier in log-Mel space, we attach a lightweight subnetwork that reuses intermediate classifier representations and train only this subnetwork under a Denoising Score Matching objective. Our work shows that a pretrained classifier can be repurposed for conditional generation, providing an appealing bridge between discriminative modeling and conditional speech synthesis resulting in high speech quality within a single-backbone model, with reduced memory footprint and computational cost.

2606.20451 2026-06-19 stat.ML cs.LG stat.AP stat.CO 新提交

SSH-Net: A Deep Neural Network for Predicting Failure Time Distribution Functions under Competing Risks with Application to GPU Data

SSH-Net: 一种用于竞争风险下预测失效时间分布函数的深度神经网络及其在GPU数据上的应用

Jie Min, Yueyao Wang, Mengkun Chen

发表机构 * Department of Mathematics & Statistics, University of South Florida(佛罗里达州立大学数学与统计学系) School of Statistics and Data Science, Zhejiang Gongshang University(浙江工商大学统计与数据科学学院) Department of Statistics, Virginia Tech(弗吉尼亚理工学院统计学系)

AI总结 提出结构化分段风险深度神经网络(SSH-Net),通过将网络结构与数据结构关联,允许不同协变量组通过子网络影响预测,在竞争风险框架下预测失效时间分布函数,仿真和GPU数据验证了准确性。

详情
AI中文摘要

竞争风险在工程领域常见,当应用场景复杂时会给时间事件数据建模带来挑战。近年来,深度神经网络因其灵活性和高学习能力在竞争风险预测中受到广泛关注。然而,神经网络结构的复杂性使得基于不同数据输入的超参数调优更加困难。此外,当工程系统具有多层级的复杂物理结构时,将所有结构层级视为单一输入组可能无法捕捉关键信息。为解决这些问题,我们提出了一种结构化分段风险深度神经网络(SSH-Net),用于在特定原因竞争风险框架下预测失效时间。我们的方法将神经网络结构与数据结构相关联,并允许不同的协变量组通过分离的子网络影响失效预测。神经网络基于特定原因竞争风险模型构建。SSH-Net输出特定原因风险函数,并采用惩罚对数似然作为损失函数。通过评估Brier分数、接收者操作特征曲线下面积(AUC)和预测的特定原因累积发生函数的均方根误差(RMSE),仿真研究验证了SSH-Net的预测准确性。我们进一步使用Titan GPU失效时间数据展示了模型预测失效时间分布函数的能力。

英文摘要

Competing risks are commonly observed in engineering fields and can bring challenges to time-to-event data modeling when the application scenarios are complicated. Recently, deep neural networks have received great attention for prediction with competing risks, due to their flexibility and high learning capability. However, the complexity of neural network structure brings extra difficulty in hyperparameter tuning based on different data inputs. Additionally, when an engineered system has complex physical structures with multiple hierarchical levels, treating all structural levels as a single group of inputs may fail to capture critical information. To address the issues, we propose a Structured Segmented Hazard Deep Neural Network (SSH-Net) for failure time prediction under cause-specific competing risks framework. Our approach associates neural network structure with data structures, and allows different covariate groups to impact the failure prediction through separate sub-networks. The neural network is constructed based on a cause-specific competing risks model. The SSH-Net outputs cause-specific hazard functions, and utilizes the penalized log-likelihood as the loss function. The prediction accuracy of SSH-Net is validated through simulation studies by evaluating the Brier score, the area under receiver operating characteristic curves (AUC), and the root mean square error (RMSE) of the predicted cause-specific cumulative incident function. We further demonstrate the model's ability to predict failure time distribution functions using the Titan GPU failure time data.

2606.20206 2026-06-19 stat.ML cs.LG 新提交

Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

马尔可夫决策过程中奖励非随机缺失的缺失感知策略的离线评估

Ziheng Wei, Annie Qu, Rui Miao

发表机构 * Department of Statistics, University of Michigan at Ann Arbor(密歇根大学安娜堡分校统计学系) Department of Statistics(统计学系) Applied Probability, University of California at Santa Barbara(加州大学圣巴巴拉分校应用概率系) Department of Mathematical Sciences, University of Texas at Dallas(德克萨斯大学达拉斯分校数学科学系)

AI总结 针对奖励非随机缺失的离线强化学习问题,提出基于未来状态作为影子变量的识别方法,并利用桥函数和min-max估计器恢复条件均值奖励,实现缺失感知策略的离线评估。

Comments Accepted at ICML 2026. 31 pages, 6 figures

详情
AI中文摘要

在离线强化学习中,由于记录稀疏或不规则,或超出特定奖励值的审查,记录批次数据中的即时奖励通常未被观测到。这个问题出现在实际场景中,包括医疗和营销。我们研究了有限时域马尔可夫决策过程中奖励非随机缺失时的离线策略评估,这破坏了可忽略性,并即使在以状态和行动为条件后也会引起选择偏差。为了解决这个问题,我们形式化了一个依赖于奖励的倾向模型,并使用未来状态作为影子变量来识别完整数据的条件均值奖励。我们进一步引入了一个桥函数,无需显式建模MNAR机制即可恢复条件均值奖励,并通过min-max过程进行估计以避免双重采样。基于这些识别结果,我们提出了一个类似Fitted-Q-Evaluation的估计器,该估计器传播恢复的奖励,同时允许目标策略依赖于过去的缺失指示符。最后,我们为我们的OPE估计器建立了一致性和有限样本误差界,并通过实验在模拟数据和MIMIC-III脓毒症数据上展示了我们方法相比现有方法的强性能。

英文摘要

In offline Reinforcement Learning, immediate rewards in logged batch data are often unobserved due to sparse or irregular record-keeping, or censored beyond certain reward values. This issue arises in practical settings, including health care and marketing. We investigate off-policy evaluation (OPE) in finite-horizon Markov decision processes when rewards are missing not at random (MNAR), which breaks ignorability and induces selection bias even after conditioning on states and actions. To address this, we formalize a reward-dependent propensity model and use future states as shadow variables to identify the full-data conditional mean reward. We further introduce a bridge function that recovers the conditional mean reward without explicitly modeling the MNAR mechanism, and estimate it via a min-max procedure to avoid double sampling. Building upon these identification results, we propose an Fitted-Q-Evaluation-style estimator that propagates the recovered rewards while allowing target policies to depend on past missingness indicators. Finally, we establish consistency and finite-sample error bounds for our OPE estimator, and show through experiments the strong performance of our method compared to existing methods on simulated and MIMIC-III Sepsis data.

2606.20137 2026-06-19 eess.AS cs.CL cs.LG cs.SD 新提交

PASQA: Pitch-Accent-Focused Speech Quality Assessment Model Trained on Synthetic Speech with Accent Errors

PASQA:针对重音错误的合成语音训练的以音高重音为中心的语音质量评估模型

Masaya Kawamura, Yuma Shirahata, Kentaro Mitsui, Reo Shimizu

发表机构 * LY Corporation(LY公司)

AI总结 提出PASQA模型,通过可控重音合成数据集和伪重音质量分数,结合自监督表示、摩拉条件融合等训练策略,有效评估音高重音正确性,优于传统MOS模型。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

现有的平均意见得分(MOS)预测模型通常预测话语级别的自然度MOS,并且可能对局部音高重音错误不敏感。我们提出了以音高重音为中心的语音质量评估(PASQA),明确针对音高重音正确性。为了训练我们的模型,我们使用重音可控的文本转语音系统通过改变重音模式构建了一个受控的日语重音错误数据集,并根据重音错误率计算伪重音质量得分。PASQA建立在自监督表示的基础上,并采用摩拉条件融合、排序损失、辅助重音错误定位任务和说话者不变训练。实验表明,传统模型无法保持按重音错误严重程度的排序,而PASQA在已见和未见说话者上都实现了高排序准确性。此外,PASQA与人类重音正确性判断的一致性更强。代码可在以下网址获取:https://this URL。

英文摘要

Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.

2606.20106 2026-06-19 eess.AS cs.SD 新提交

Personalized Keyword Spotting for User-Defined Keywords Leveraging Text-Independent Speaker Verification

利用文本无关说话人验证的用户自定义关键词个性化唤醒

Ming-Hsiang Hu, Kuan-Tang Huang, Chien-Chun Wang, Hung-Shin Lee, Berlin Chen

发表机构 * Dept. Computer Science and Information Engineering, National Taiwan Normal University, Taiwan(计算机科学与信息工程系,台湾国立台湾师范大学) United Link Co., Ltd., Taiwan(台湾联链公司)

AI总结 提出ZP-KWS轻量框架,结合音素监督音频编码器和紧凑说话人编码器,通过乘法后融合实现零样本关键词检测与说话人验证,在多个数据集上将目标误拒率降低高达60%。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

用户自定义关键词唤醒(UD-KWS)能够从文本实现零样本唤醒词检测,但现有系统学习的是说话人不变表示,无法拒绝说出正确关键词的冒名顶替者。我们针对这种双重零样本设置——未见关键词和未见说话人——提出了ZP-KWS,一个轻量级框架,将音素监督的音频编码器与GE2E预训练的紧凑说话人编码器(约0.9M参数)相结合。推理时的乘法后融合赋予每个分支独立的否决权,支持从传统检测到严格说话人门控激活的模式,无需重新训练。在LibriPhrase、Google Speech Commands和Qualcomm数据集上,ZP-KWS在1%虚警率下将目标仅误拒率相对于最强基线降低了高达60%,同时保持有竞争力的关键词检测,且总参数量在1.55M以内,适合边缘部署。

英文摘要

User-defined keyword spotting (UD-KWS) enables zero-shot wake-word detection from text, but existing systems learn speaker-invariant representations that cannot reject impostors uttering the correct keyword. We address this dual zero-shot setting -- unseen keywords and unseen speakers -- with ZP-KWS, a lightweight framework combining a phoneme-supervised audio encoder with a GE2E-pretrained compact speaker encoder (about 0.9M parameters). Multiplicative late fusion at inference grants each branch independent veto power, supporting modes from conventional detection to strict speaker-gated activation without retraining. On LibriPhrase, Google Speech Commands, and Qualcomm datasets, ZP-KWS reduces target-only FRR at 1% FAR by up to 60% relative to the strongest baseline while maintaining competitive keyword detection, all within a 1.55M parameter budget for edge deployment.

2606.20074 2026-06-19 eess.SP cs.AI cs.LG 新提交

Evaluation of EEG Foundation Models for Event-Based Burst-Suppression Detection in ICU

用于ICU中基于事件的爆发-抑制检测的EEG基础模型评估

Elisa Vasta, Thorir Mar Ingolfsson, Andrea Cossettini, Luca Benini, Tilman Beck, Emanuela Keller, Una Pale

发表机构 * DEI, University of Bologna, Bologna, Italy(DEI,博洛尼亚大学,博洛尼亚,意大利)

AI总结 本研究首次评估EEG基础模型在ICU中无需患者校准的爆发检测性能,REVE-base模型在事件级F1分数上达到0.868,并将每分钟爆发错误率分别降低52.1%和36.2%。

Comments 4 pages, 1 figure. Code available upon publication

详情
AI中文摘要

爆发抑制(BS)是一种临床相关的脑电图(EEG)模式,用于监测危重患者的镇静深度和脑活动,特别是在重症监护病房(ICU)的诱导昏迷期间。自动爆发检测仍然具有挑战性,因为BS模式在不同患者之间差异很大,且标注数据集稀缺。最近,EEG基础模型(FMs)在多个下游EEG应用中显示出前景,但它们在BS检测中的实用性尚未被探索。我们提出了第一项研究,评估EEG FMs在减少导联的ICU EEG中无需患者校准的爆发检测性能。我们将REVE-base、LUNA-large和LuMamba-Tiny与自适应阈值基线以及任务特定的EEGNet基线进行比较。此外,我们补充了基于事件的爆发检测评估,以替代传统的EEG窗口分类。这有助于临床评估爆发事件是否被正确检测,减少预期标注变异性的影响。最佳模型REVE-base取得了最高的事件级F1分数($0.868 \pm 0.167$),并且与EEGNet和自适应阈值相比,分别将每分钟爆发错误减少了52.1%和36.2%,支持了FMs在ICU中可扩展的EEG监测。消融实验表明,与冻结骨干训练、两步微调和基于LoRA的适应相比,全微调是最有效的适应策略,对于LUNA-large,事件级F1分数比冻结骨干训练提高了最多$+0.102$。在减少标注数据集的情况下,预训练的REVE-base在25%的队列中比随机初始化高出$+0.723$事件级F1点,证明了在有限标注数据下适应爆发检测时预训练FM表示的优势。

英文摘要

Burst suppression (BS) is a clinically relevant electroencephalographic (EEG) pattern used to monitor sedation depth and brain activity in critically ill patients, particularly during induced coma in Intensive Care Units (ICUs). Automatic burst detection remains challenging because BS patterns vary substantially between patients and annotated datasets are scarce. Recently, EEG Foundation Models (FMs) have shown promise across several downstream EEG applications, but their usefulness for BS detection remains unexplored. We present the first study to evaluate EEG FMs for burst detection in reduced-montage ICU EEG without patient-specific calibration. We compare REVE-base, LUNA-large and LuMamba-Tiny with an adaptive thresholding baseline and a task-specific EEGNet baseline. Additionally, we complement conventional EEG window-based classification with event-based burst detection evaluation. This helps assessing clinically whether burst episodes are correctly detected, reducing the impact of expected annotation variability. The best model, REVE-base, achieved the highest event-based F1-score ($0.868 \pm 0.167$) and reduced burst-per-minute error by 52.1% and 36.2% compared to EEGNet and adaptive thresholding respectively, supporting FMs for scalable EEG monitoring in ICU. Ablation experiments showed that full fine-tuning was the most effective adaptation strategy with respect to frozen-backbone training, two-step fine-tuning, and LoRA-based adaptation, improving event-based F1-score over frozen-backbone training by up to $+0.102$ for LUNA-large. With reduced labeled datasets, pretrained REVE-base outperformed random initialization by $+0.723$ event-based F1 points at 25% of the cohort, demonstrating the benefit of pretraining FM representations when adapted to burst detection with limited labeled data.

2606.20041 2026-06-19 econ.GN cs.AI cs.LG q-fin.EC q-fin.GN 新提交

AI Economist Agent: An Agentic Framework for Model-Grounded Economic Analysis with RAG, Knowledge Graphs, and Large Language Models

AI经济学家代理:一种基于模型的经济分析代理框架,结合RAG、知识图谱和大语言模型

Masahiro Kato

发表机构 * Mizuho-DL Financial Technology, Co., Ltd.(Mizuho-DL金融科技有限公司)

AI总结 提出一种基于RAG的AI经济学家代理框架,利用知识图谱和大语言模型进行经济情景分析,通过代理规划、检索证据、选择模型并生成报告,提高经济叙事的连贯性和可追溯性。

详情
AI中文摘要

我们提出了一种基于模型的RAG型AI经济学家,具有用于经济情景分析的代理框架,使用大语言模型(LLMs)和知识图谱。虽然LLMs可以生成流畅的经济叙事,但经济学家通常需要做出基于经济理论和现实数据的经济主张。基于这一动机,本研究提出了一种基于RAG的AI经济学家,它利用包含经济数据和理论的知识图谱以及基于LLM的代理来规划分析、检索相关证据、选择合适的模型并生成报告。在我们的框架中,我们不直接仅使用语言模型产生定量主张;相反,我们生成基于显式模型计算的叙事,并通过AI代理与检索到的证据相关联。我们将我们的框架称为AI经济学家代理。我们在两个应用中评估了AI经济学家代理:为美国通胀持续性和美联储政策生成经济学家报告,以及为美国商业房地产再融资压力生成银行压力测试叙事。结果说明了如何通过基于生成报告来提高其经济连贯性和可追溯性。

英文摘要

We propose a model-grounded RAG-based AI economist with an agentic framework for economic scenario analysis using large language models (LLMs) and knowledge graphs. While LLMs can generate fluent economic narratives, economists are often required to make economic claims grounded by economic theory and real-world data. Based on this motivation, this study proposes an RAG-based AI economist, which utilizes knowledge graphs including economic data and theory and LLM-based agents to plan the analysis, retrieve relevant evidence, select appropriate models, and generate reports. In our framework, we do not produce quantitative claims directly with the language model alone; instead, we generate narratives grounded in explicit model-based computations and linked to the retrieved evidence via AI agents. We refer to our framework as an AI economist agent. We evaluate the AI economist agent in two applications: economist report generation for U.S. inflation persistence and Federal Reserve policy, and bank stress-test narrative generation for U.S. commercial real estate refinancing stress. The results illustrate how grounding the generated reports improves their economic coherence and traceability.

2606.19951 2026-06-19 eess.AS cs.CL cs.LG cs.SD 新提交

Investigating Human-Model Discrepancies in Speech Quality Assessment via Acoustic and Prosodic Perturbations

通过声学和韵律扰动研究语音质量评估中的人机差异

Masato Takagi, Masaya Kawamura, Reo Shimizu, Yuma Shirahata

发表机构 * Nagoya Institute of Technology, Japan(名古屋技术大学,日本) LY Corporation, Japan(LY公司,日本)

AI总结 通过声学退化、韵律错误和说话人特征扰动,发现MOS预测模型对声学退化敏感,但对韵律错误不敏感,且对基频有偏见,而对语速和基频变化不敏感。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

平均意见得分(MOS)预测模型在文本到语音(TTS)研究中被广泛用作代理指标,但它们捕捉超出声学保真度的质量差异的能力仍不清楚。我们通过控制性扰动来研究这一点:声学退化、韵律错误以及说话人特定特征(如音高和语速)的操纵。我们从人类听众和模型那里获得了这些语音样本的MOS预测,并分析了它们感知特征的差异。结果表明,大多数模型能很好地跟踪声学退化,而所有模型对韵律错误不敏感,尽管主观评分大幅下降。对于说话人特征,模型表现出双重分离:在人类评分中不存在的强平均基频(F0)偏见,但对人类注意到的语速和F0变化不敏感。这些发现突出了标量MOS预测在声学保真度之外的局限性。

英文摘要

Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.

2606.19823 2026-06-19 eess.AS cs.LG 新提交

Low-Burden Data Augmentation for Dysarthric ASR via Zero-Shot Voice Cloning

低负担数据增强:通过零样本语音克隆改善构音障碍语音识别

Satwinder Singh, Qianli Wang, Zihan Zhong, Clarion Mendes, Hasegawa-Johnson, Waleed Abdulla, Seyed Reza Shahamiri

发表机构 * DeepNet Discovery Network, University of Auckland, New Zealand(奥克兰大学深网发现网络, 新西兰) University of Illinois Urbana-Champaign, USA(伊利诺伊大学厄巴纳-香槟分校, 美国)

AI总结 针对构音障碍语音数据稀缺和变异性大的问题,提出使用零样本语音克隆(Higgs Audio V2)生成合成数据,微调Whisper-medium模型,在TORGO数据集上达到与真实数据微调相近的词错误率,并显著降低数据收集成本。

Comments Accepted to Interspeech 2026, Sydney, Australia

详情
AI中文摘要

由于数据稀缺和说话人之间高度变异,自动语音识别对于构音障碍语音仍然不可靠。虽然合成数据可以弥补这些不足,但传统方法通常需要大量的说话人特定数据,重新引入了数据收集瓶颈。我们研究零样本语音克隆作为一种低负担的增强策略,使用Higgs Audio V2克隆TORGO数据集中的说话人。我们在克隆数据、真实数据和混合数据上微调Whisper-medium,并在保留的真实语音上进行评估。与零样本基线(31.62%)相比,克隆数据微调实现了具有竞争力的26.00%词错误率,几乎与真实数据微调(24.44%)和混合数据微调(25.12%)相当。值得注意的是,对于中重度构音障碍说话人,克隆和混合微调优于真实数据微调。在SAP-1102上的跨语料库评估中,克隆微调取得了最佳结果(相对提升11.45%)。这些结果表明,零样本克隆提供了可扩展的训练数据,绕过了昂贵的数据收集瓶颈。

英文摘要

Automatic speech recognition remains unreliable for dysarthric speech due to data scarcity and high inter-speaker variability. While synthetic data can address these gaps, traditional methods often require extensive speaker-specific data, reintroducing the collection bottleneck. We investigate zero-shot voice cloning as a low-burden augmentation strategy, using Higgs Audio V2 to clone speakers in the TORGO dataset. We fine-tune (FT) Whisper-medium on cloned, real, and hybrid data and evaluate on held-out real speech. Compared to the zero-shot (31.62%), Clone FT achieved a competitive 26.00% WER, nearly matching the 24.44% and 25.12% seen with Real and Hybrid FT, respectively. Notably, Clone and Hybrid FT outperform Real FT for moderate-severe speakers. Clone FT achieves the best results (11.45% relative) in cross-corpus evaluation on the SAP-1102. These results suggest that zero-shot cloning provides scalable training data that circumvents the costly data collection bottleneck.

2606.19797 2026-06-19 eess.AS cs.AI cs.SD eess.SP 新提交

Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation

通过域内数据增强改进构音障碍语音的端到端语音识别

Paban Sapkota, Hemant Kumar Kathania, Sudarsana Reddy Kadiri, Shrikanth Narayanan

发表机构 * Department of Electronics and Communication Engineering, National Institute of Technology Sikkim, India(电子与通信工程系,印度尼特拉特技术学院Sikkim分校) Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, USA(信号分析与解释实验室(SAIL),美国南加州大学洛杉矶分校)

AI总结 针对构音障碍语音识别中数据稀缺和严重程度差异的问题,本文探索了四种数据增强方法(SRM、PM、FM、VTLP)对预训练Wav2Vec2模型进行微调,在不同严重程度上实现了显著的字错误率降低。

详情
AI中文摘要

构音障碍语音识别对于促进构音障碍患者之间的有效沟通至关重要。然而,由于严重程度不同和数据可用性有限,准确识别构音障碍语音面临重大挑战。在本文中,我们通过微调端到端预训练Wav2Vec2模型,探索了针对构音障碍自动语音识别(ASR)系统的数据增强技术,特别关注严重程度级别。为了解决数据稀缺以及微调预训练ASR系统用于构音障碍语音时需要大量数据的问题,我们研究了四种主要的数据增强方法:语速修改(SRM)、音高修改(PM)、共振峰修改(FM)和声道长度扰动(VTLP),这些方法针对构音障碍的不同方面进行了调整。本研究使用为每个严重程度类别单独微调的Wav2Vec2模型作为基线系统。此外,我们使用增强数据对ASR模型进行了特定严重程度的微调。结果表明,每种增强技术在不同严重程度级别上表现出不同的有效性模式。对于\textit{低}(9.02%)和\textit{中}(38.11%)严重程度,使用SRM($s$=0.8)获得了最佳WER;对于\textit{高}严重程度(55.15%),使用PM($\ au$=0.8)获得了最佳WER,分别相对改进了30.02%、16.64%和15.47%。这些结果证实了增强方法在提高构音障碍ASR性能方面的有效性。

英文摘要

Dysarthric speech recognition is crucial for facilitating effective communication among individuals with dysarthria. However, accurately recognizing dysarthric speech poses significant challenges due to varying severity levels and limited data availability. In this paper, we explore data augmentation techniques for dysarthric automatic speech recognition (ASR) systems by fine-tuning the End-to-End pre-trained Wav2Vec2 model, with a specific focus on severity levels. To address the challenges of data scarcity and the need for extensive data in fine-tuning pre-trained ASR systems for dysarthric speech, we investigate four prominent data augmentation methods: Speaking-Rate Modification (SRM), Pitch Modification (PM), Formant Modification (FM), and vocal tract Length Perturbation (VTLP), tailored to different aspects of dysarthria. The study uses individually fine-tuned Wav2Vec2 models for each severity class as baseline systems. Additionally, we conducted severity-specific fine-tuning of the ASR model using augmented data. Results demonstrate distinct efficacy patterns for each augmentation technique across severity levels. The best WERs were achieved with SRM ($s$=0.8) for \textit{low} (9.02\%) and \textit{medium} (38.11\%) severities, and with PM ($τ$=0.8) for \textit{high} severity (55.15\%), reflecting relative improvements of 30.02\%, 16.64\%, and 15.47\%, respectively. These results confirm the effectiveness of the augmentation methods in improving dysarthric ASR performance.

2606.19793 2026-06-19 eess.AS cs.AI cs.LG cs.SD eess.SP 新提交

Systematic Study of Dysarthric Speech Recognition: Spectral Features and Acoustic Models

构音障碍语音识别的系统研究:频谱特征与声学模型

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

发表机构 * Department of Electronics and Communication Engineering, National Institute of Technology Sikkim, India(电子与通信工程系,印度尼特技术学院锡金分校) Department of Information and Communications Engineering, Aalto University, Finland(信息与通信工程系,阿尔托大学,芬兰) Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, USA(信号分析与解释实验室(SAIL),美国南加州大学洛杉矶分校)

AI总结 本文系统研究不同频谱特征与声学模型的组合,通过引入音高特征和优化训练帧重叠数,在F-TDNN模型上实现孤立词和句子识别相对提升4.65%和4.63%。

详情
AI中文摘要

识别构音障碍语音的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明,通过使用混合DNN/HMM序列区分性训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究,为每种模型提供了合适的特征选择。音高特征的引入显著提高了识别性能,特别是对于涉及构音障碍语音的句子识别任务。通过对TORGO数据库的系统检查,我们证明了增强最先进的因子化时延神经网络(F-TDNN)模型识别构音障碍语音性能的潜力。使用F-TDNN模型实现的方法,与先前研究相比,在构音障碍语音的孤立词识别中获得了4.65%的相对改进,在句子识别中获得了4.63%的相对改进。这种改进有效补偿了语音变异性,这归因于我们精心选择了连续训练样本块之间的重叠帧数。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

2606.19791 2026-06-19 eess.AS cs.AI cs.SD 新提交

Cross-Dataset, Age, and Gender Generalization: A Comprehensive Analysis of Fine-Tuning Strategies for Low-Resource Children's ASR

跨数据集、年龄和性别泛化:低资源儿童语音识别的微调策略综合分析

Paban Sapkota, Hemant Kumar Kathania, Mikko Kurimo, Sudarsana Reddy Kadiri, Shrikanth Narayanan

发表机构 * Department of Electronics and Communication Engineering, National Institute of Technology Sikkim, India(印度西西姆国立技术学院电子与通信工程系) Signal Analysis and Interpretation Laboratory, University of Southern California, Los Angeles, USA(美国南加州大学洛杉矶分校信号分析与解释实验室)

AI总结 针对低资源儿童语音识别,系统分析了不同微调策略在跨数据集、年龄和性别泛化上的表现,发现特定策略能显著提升泛化能力。

详情
AI中文摘要

与识别构音障碍语音相关的挑战主要源于发音精度受损导致的显著声学变异性。过去的研究表明,使用混合DNN/HMM序列判别训练可以改善识别性能。本文对不同声学模型定制的各种声学特征组合进行了全面研究,为每种模型提供了合适的特征选择。音高特征的加入显著提升了识别性能,尤其是在涉及构音障碍语音的句子识别任务中。通过对TORGO数据库的系统研究,我们展示了增强最先进的因子化时延神经网络(F-TDNN)模型识别构音障碍语音性能的潜力。我们使用F-TDNN模型实现的方法,与先前研究相比,在孤立词识别上实现了4.65%的相对改进,在句子识别上实现了4.63%的相对改进。这一改进有效补偿了语音变异性,这归因于我们对连续训练样本块之间重叠帧数的精心选择。

英文摘要

The challenge associated with recognizing dysarthric speech primarily arises from pronounced acoustic variability attributed to impaired articulatory precision. Past research has demonstrated improved recognition through the use of hybrid DNN/HMM sequence discriminative training. This paper presents a comprehensive investigation of various combinations of acoustic features tailored to different Acoustic Models, offering suitable feature selections for each. The incorporation of Pitch features notably improved recognition performance, especially for sentence recognition tasks involving dysarthric speech. Through a systematic examination of the TORGO database, we have demonstrated the potential to enhance the performance of the state-of-the-art Factorized Time Delay Neural Network (F-TDNN) model for recognizing dysarthric speech. Our methods, implemented with the F-TDNN model, resulted in a 4.65\% relative improvement in isolated word recognition and a 4.63\% relative improvement in sentence recognition for dysarthric speech, compared to previous research. This improvement effectively compensates for speech variability, attributable to our deliberate selection of the number of overlapping frames between consecutive training example chunks.

2606.19714 2026-06-19 stat.ML cs.AI cs.LG stat.CO stat.ME 新提交

AURA: Adaptive Uncertainty-aware Refinement for LLM-as-a-Judge Auditing

AURA: 用于LLM作为评判审计的自适应不确定性感知精炼

Zilong Zhang, Yi-Ting Hung, Weiyi He, Junxi Zhang, Lei Ding, Chi-Kuang Yeh

发表机构 * Department of Mathematics and Statistics(数学与统计学系) Georgia State University(佐治亚州立大学) Department of Probability and Statistics(概率与统计学系) Department of Computer Science and Engineering(计算机科学与工程系) Michigan State University(密歇根州立大学) Concordia University(Concordia 大学) Department of Statistics(统计学系) University of Manitoba(曼尼托巴大学)

AI总结 提出AURA框架,通过自适应不确定性感知精炼,在少量人工验证下迭代学习人类一致性信号,优先审核不确定比较,提升LLM评判的可靠性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作开放式生成的评判者,因为大规模人工评估通常昂贵且难以扩展,但它们的偏好仍然是人类判断的不完美代理。现有的审计流程通常假设事先存在可靠的示例子集或干净的监督信号,例如来自人工注释、启发式过滤或强评判者的输出。在LLM评估中,这一假设是脆弱的:初始分割可能继承评判者偏差,而人工验证通常过于稀缺,无法在规模上定义稳定组。我们提出AURA,一种自适应不确定性感知精炼框架,用于在选定的人工验证下审计成对LLM作为评判的决策。AURA迭代学习人类一致性信号,传播可靠证据,并优先将不确定的比较提交人工审核。关键思想是将对评判者的信任视为一个潜在量,随着证据积累逐步精炼。我们提供了紧凑的公式、稳定的精炼过程,以及在合成和真实成对LLM答案数据上的全面评估。

英文摘要

Large language models (LLMs) are increasingly used as judges for open-ended generation, as large-scale human evaluation is often expensive and difficult to scale, yet their preferences remain imperfect proxies for human judgment. Existing auditing pipelines often assume that a reliable subset of examples or clean supervision signals are available beforehand, for example from human annotation, heuristic filtering, or the outputs of strong judges. In LLM evaluation, this assumption is fragile: the initial split may inherit judge bias, while human verification is typically too scarce to define stable groups at scale. We propose AURA, an adaptive uncertainty--aware refinement framework for auditing pairwise LLM--as--a--judge decisions under selected human verification. AURA iteratively learns a human-consistency signal, propagates reliable evidence, and prioritizes uncertain comparisons for human review. The key idea is to treat trust in a judge as a latent quantity that is progressively refined as evidence accumulates. We provide a compact formulation, a stable refinement procedure, and a comprehensive evaluation on both synthetic and real pairwise LLM-answer data.

2606.19643 2026-06-19 stat.ML cs.LG 新提交

Variational Consensus Monte Carlo for Bayesian Mixture

变分共识蒙特卡洛用于贝叶斯混合模型

Julie Fendler, Francesca L. Crowe, Tom Marshall, Sylvia Richardson, Paul D. W. Kirk

发表机构 * MRC Biostatistics Unit, University of Cambridge(剑桥大学生物统计学单位) Institute of Applied Health Research, University of Birmingham(伯明翰大学应用健康研究学院)

AI总结 提出变分共识蒙特卡洛方法扩展至过拟合贝叶斯混合模型,通过新颖的聚类匹配算法和聚合策略,在联邦学习设置下推断聚类数和所有参数,并在模拟和真实电子健康记录数据上验证了有效性。

详情
AI中文摘要

受健康数据的隐私、敏感性和共享限制的驱动,我们提出了一个在联邦学习设置下(即数据无法在计算节点之间完全共享或汇集)对贝叶斯混合模型进行推断的全面流程。我们采用共识蒙特卡洛(CMC)方法,在每个数据孤岛内独立运行MCMC算法以估计局部后验分布,然后聚合这些分布以近似完整数据的后验。Rabinovich, Angelino 和 Jordan (2015) [1] 的变分CMC方法将聚合步骤视为变分推断问题,但他们应用于混合模型时假设聚类数和关键混合参数已知。我们的主要方法贡献是:(i) 将变分CMC扩展到过拟合贝叶斯混合模型,该模型推断聚类数和所有模型参数,无需共轭性;(ii) 适用于跨孤岛设置的新颖聚类匹配算法,其中并非每个聚类都出现在每个局部数据集中;(iii) 针对聚合步骤的多种推断策略,匹配不同的联邦学习约束;以及 (iv) 在实践中选择这些策略的指南。一项全面的模拟研究验证了该框架,并允许我们与最先进的联邦学习替代方法进行比较。值得注意的是,我们表明当局部数据集的组成反映了数据中的底层聚类结构时,我们的方法可以比应用于汇集数据的标准MCMC更准确地恢复小聚类。我们在大规模电子健康记录数据上展示了该框架,识别了英国老年人群中的多发病模式。

英文摘要

Motivated by the privacy, sensitivity and sharing limitations of health data, we present a comprehensive pipeline for inference of Bayesian mixture models within a federated learning setting, i.e. when data cannot be fully shared or pooled across compute nodes. We adopt a Consensus Monte Carlo (CMC) approach, in which an MCMC algorithm is run independently within each data silo to estimate local posterior distributions, which are then aggregated to approximate the posterior over the full data. The variational CMC approach of Rabinovich, Angelino and Jordan (2015) [1] frames the aggregation step as a variational inference problem, but their application to mixtures assumes the number of clusters and key mixture parameters to be known. Our main methodological contributions are: (i) an extension of variational CMC to over-fitted Bayesian mixture models that infer the number of clusters and all model parameters, without requiring conjugacy; (ii) novel cluster-matching algorithms suitable for cross-silo settings in which not every cluster appears in each local dataset; (iii) a number of inference strategies for the aggregation step, matched to different federated learning constraints; and (iv) guidelines for choosing among these in practice. A comprehensive simulation study validates the framework and allows us to compare to state-of-the-art federated learning alternatives. Notably, we show that when the composition of local datasets reflects the underlying clustering structure in the data, our approach can recover small clusters with greater accuracy than standard MCMC applied to the pooled data. We illustrate the framework on large-scale electronic health record data, identifying multi-morbidity patterns in a British geriatric population.

2606.19587 2026-06-19 stat.ML cs.LG 新提交

A Solver-Free Training Method for Predict-then-Optimize

一种无求解器的预测后优化训练方法

Beichen Wan, Mo Liu

发表机构 * Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, NC, USA(统计与运筹学系,北卡罗来纳大学教堂山分校)

AI总结 提出一种基于测度变换的决策聚焦学习管道,通过无求解器代理损失实现预测后优化中预测模型的高效训练,理论保证Fisher一致性,训练时间降低数个数量级。

Comments Accepted by ICML 2026

详情
AI中文摘要

我们提出了一种可扩展的方法,用于在预测后优化范式中训练预测(机器学习)模型,其中模型输出作为后续线性优化任务的系数。直接最小化经验决策遗憾对于线性规划和组合优化是不可行的,因为决策映射是分段常数,且梯度几乎处处为零。虽然现有方法通过平滑微分过程来解决这一问题,但它们存在可扩展性问题,因为每次梯度评估都需要调用计算昂贵的求解器。为了解决这个问题,我们提出了一种基于测度变换原理的决策聚焦学习管道,该管道在训练期间产生一个完全无优化求解器的新代理损失。我们建立了理论保证,包括Fisher一致性和超额风险界。实验上,我们的方法在实现与最先进方法相当的决策质量的同时,将训练时间减少了数个数量级。

英文摘要

We propose a scalable method for training prediction (machine learning) models in the predict-then-optimize paradigm, where model outputs serve as coefficients for a subsequent linear optimization task. Directly minimizing the empirical decision regret is intractable for linear programming and combinatorial optimization since the decision mapping is piecewise constant, and the gradients are zero almost everywhere. While existing methods address this by smoothing the differentiation process, they suffer from scalability issues, since a computationally expensive solver call is required for every gradient evaluation. To address this, we propose a decision-focused learning pipeline based on a measure transformation principle, which yields a new surrogate loss that is completely optimization-solver-free during training. We establish theoretical guarantees, including Fisher consistency and excess risk bounds. Empirically, our method achieves decision quality competitive with state-of-the-art methods while reducing training time by orders of magnitude.

2606.19410 2026-06-19 stat.ML cs.LG 新提交

The Representational Limit of Scalar Interactions: An Interventional Decomposition

标量交互的表征限制:一种干预分解

Potito Aghilar, Sabino Roccotelli, Stanislao Fidanza, Vito Walter Anelli, Sebastiano Stramaglia, Tommaso Di Noia

发表机构 * Polytechnic University of Bari(巴里理工学院) University of Bari Aldo Moro(巴里大学Aldo Moro)

AI总结 本文证明标量交互指标混淆了唯一性、冗余性和协同性,并提出Stochastic Hi-Fi方法,通过干预掩码推理分解每个特征的U/R/S轮廓,在表格和图像任务中恢复被标量基线遗漏的结构。

详情
AI中文摘要

有符号的成对交互指标从根本上混淆了唯一性(U)、冗余性(R)和协同性(S)。我们在一个最小的3路XOR结构因果模型上证明了这一点:忠实的指标如Shapley-Taylor对每对返回零,而投影指标如Shapley Interaction将三阶效应扩散到混淆三种机制的成对标量中。我们引入了Stochastic Hi-Fi,一种事后、无需重新训练的可预测性分解方法,通过干预掩码推理估计每个特征的U/R/S轮廓。该估计器提供精确的干预语义、有限样本蒙特卡洛界限、耦合菱形采样带来的严格方差减少以及均匀的有限词汇收敛。在表格SCM上,Stochastic Hi-Fi恢复了被标量基线遗漏的结构(交互幅度恢复比高达411倍)。它还在GPT-2 IOI电路中分离了冗余和协同头。在NIH ChestX-ray14上,Stochastic Hi-Fi在Pointing Game中匹配GradCAM,并在Deletion AUC上显著改进。

英文摘要

Signed pairwise interaction scores fundamentally conflate uniqueness (U), redundancy (R), and synergy (S). We prove this on a minimal 3-way XOR structural causal model: faithful indices such as Shapley-Taylor return zero per pair, whereas projective indices such as Shapley Interaction spread the third-order effect into pair scalars that conflate the three mechanisms. We introduce Stochastic Hi-Fi, a post-hoc, retraining-free predictability decomposition that estimates per-feature U/R/S profiles by interventional masked inference. The estimator provides exact interventional semantics, finite-sample Monte Carlo bounds, strict variance reduction from coupled diamond sampling, and uniform finite-vocabulary convergence. Across tabular SCMs, Stochastic Hi-Fi recovers structure missed by scalar baselines (up to 411x larger interaction-magnitude recovery ratios). It also separates redundant and synergistic heads in the GPT-2 IOI circuit. On NIH ChestX-ray14, Stochastic Hi-Fi matches GradCAM on Pointing Game and improves substantially on Deletion AUC.

2606.19372 2026-06-19 eess.IV cs.CV cs.LG 新提交

Full-Self Diagnostics (FSD): Physics-Grounded Visual Biomarker Inference from Smartphone Video via Inverse Problems and Operator Learning

全自诊断(FSD): 通过逆问题和算子学习从智能手机视频进行基于物理的可视生物标志物推断

Jonathan Thomas, Harsh Thaker

发表机构 * Algomash® (Algorithmic Mashup Inc.)(算法混搭公司)

AI总结 提出全自诊断(FSD)框架,结合物理前向模型、信息论可观测性、正则化逆问题、算子学习和随机变分推断,从9秒面部视频恢复生理状态,在59名受试者38812次扫描中验证,血糖MARD达29.86%。

Comments 38,812 paired scans, preliminary longitudinal validation of multichannel visual glucose inference (MARD 17 to 46 percent across cohorts); physics plus information theory plus operator learning framework

详情
AI中文摘要

我们提出全自诊断(FSD),一个统一的数学框架,用于从消费级智能手机拍摄的无约束9秒面部视频中恢复潜在生理状态。该方法整合了五个相互增强的组件:(1)基于辐射传输方程和发色团吸收的物理前向模型,将相机观测映射到生物标志物浓度;(2)信息论可观测性理论,证明多通道视觉信号(光谱、脉搏、呼吸、微表情和眼动)与生理状态包含严格递增的互信息;(3)具有域均匀可辨识性保证的稳定Tikhonov正则化逆问题;(4)算子学习公式,实现跨设备、分辨率和人群的泛化;(5)可解释为随机变分推断的监督学习过程,从配对生物传感器真实值持续优化模型,性能随配对观测数量的平方根倒数比例提升。在59名受试者的38812次真实世界配对扫描上的实证验证展示了实际性能。第一作者自采数据(血糖范围35-550 mg/dL)的MARD为29.86%,97.57%的预测落在Clarke误差网格A+B区,仅0.27%在危险E区。一位管理良好的糖尿病参与者在较窄的70-180 mg/dL范围内达到MARD 17%。这些结果证实,消费级面部视频编码了足够的结构化信息,可在完全无约束条件下进行临床相关的非侵入性生物标志物推断,且性能随更多配对数据的可用性可预测地提升。

英文摘要

We present Full-Self Diagnostics (FSD), a unified mathematical framework for recovering latent physiological states from unconstrained 9-second facial videos captured by consumer smartphones. The approach integrates five mutually reinforcing components: (1) a physics-based forward model derived from the radiative transfer equation and chromophore absorption that maps camera observables to biomarker concentrations; (2) an information-theoretic observability theory proving that multi-channel visual signals (spectral, pulse, respiratory, micro-expression, and oculomotor) contain strictly increasing mutual information with physiological state; (3) a stable, Tikhonov-regularized inverse problem with domain-uniform identifiability guarantees; (4) an operator-learning formulation that enables generalization across devices, resolutions, and populations; and (5) a supervised learning procedure, interpretable as stochastic variational inference, that continuously refines the model from paired biosensor ground truth with performance improving proportionally to one over the square root of the number of paired observations. Empirical validation on 38812 real-world paired scans across 59 subjects demonstrates practical performance. Self-collected data from the lead author (glucose range 35-550 mg/dL) yields MARD of 29.86 percent with 97.57 percent of predictions in Clarke Error Grid Zones A+B and only 0.27 percent in the dangerous Zone E. A well-managed diabetic participant achieves MARD of 17 percent in the narrower 70-180 mg/dL band. These results confirm that consumer-grade facial video encodes sufficient structured information for clinically relevant, non-invasive biomarker inference under fully unconstrained conditions, with performance scaling predictably as more paired data becomes available.

2606.20554 2026-06-19 cs.IR cs.AI 新提交

Structuring and Tokenizing Distributed User Interest Context for Generative Recommendation

结构化与分词化分布式用户兴趣上下文以支持生成式推荐

Ruizhong Qiu, Yinglong Xia, Dongqi Fu, Hanqing Zeng, Ren Chen, Xiangjun Fan, Hong Li, Hong Yan, Hanghang Tong

发表机构 * University of Illinois Urbana--Champaign(伊利诺伊大学厄巴纳-香槟分校) Meta MRS

AI总结 提出G2Rec框架,通过统一图建模与语义分词,实现工业级生成式推荐中用户兴趣上下文的全面准确建模。

详情
AI中文摘要

生成式推荐是一种新兴范式,在工业推荐系统中展现出前景,旨在从用户历史行为中预测其下一次交互。生成式推荐的核心是物品分词,它连接了物品语义与推荐模型。然而,现有方法往往难以同时有效地组织和注入复杂的用户行为与物品语义上下文。一方面,现有的基于图的集成方法,如图序列化和图神经网络,要么存在可扩展性问题,要么仅利用局部图信息。另一方面,现有的语义分词方法通常依赖启发式规则且缺乏明确的监督信号,可能导致不准确或次优的语义表示。为解决用户兴趣上下文建模中的这些局限性,我们提出G2Rec,一个可扩展的框架,将基于图的整体用户共同参与建模与语义分词统一起来,用于工业级生成式推荐。总体而言,G2Rec使推荐模型能够捕捉整体且基于语义的用户兴趣原型,而无需真实用户兴趣,从而在工业序列推荐中提供更全面、更准确的用户行为上下文建模。跨产品表面的在线部署和在公开数据集上的大量实验证明了G2Rec相对于现有方法的优越性。

英文摘要

Generative recommendation is an emerging paradigm that has shown promise in industrial recommendation systems, aiming to predict users' next interactions from their historical behaviors. At the core of generative recommendation lies item tokenization, which bridges item semantics and recommendation models. However, existing methods often struggle to effectively organize and inject complex user-behavioral and item-semantic contexts into recommendation models simultaneously. On the one hand, existing graph-based integration methods, such as graph serialization and graph neural networks, either suffer from scalability issues or exploit only local graph information. On the other hand, existing semantic tokenization methods typically rely on heuristics and lack explicit supervision signals, which may lead to inaccurate or suboptimal semantic representations. To address these limitations in user interest context modeling, we propose G2Rec, a scalable framework that unifies holistic graph-based user co-engagement modeling with semantic tokenization for industrial-scale generative recommendation. Overall, G2Rec enables recommendation models to capture holistic and semantically grounded user interest prototypes without requiring ground-truth user interests, thereby providing more comprehensive and accurate modeling of user behavior contexts in industrial sequential recommendation. Online deployment across product surfaces and extensive experiments on public datasets demonstrate the superiority of G2Rec over existing methods.

2606.20512 2026-06-19 cs.SE cs.LG 新提交

Probe-and-Refine Tuning of Repository Guidance for Coding Agents

代码代理的仓库指导的探测与精炼调优

Asa Shepard, Jeannie Albrecht

发表机构 * Williams College(威廉姆斯学院)

AI总结 提出探测与精炼调优方法,通过合成bug修复探测迭代诊断和修补仓库指导文件,在SWE-bench Verified上以Qwen3.5-35B-A3B模型达到33.0%解决率,优于静态知识库的28.3%和无指导基线的25.5%。

详情
AI中文摘要

基于LLM的代码代理需要关于仓库的更高级操作知识(哪些文件包含哪些子系统、如何运行测试套件、哪些工作流历史上导致错误修复),这些知识并不存在于代码本身。工程师通常维护\texttt{ this http URL }文件来提供这些上下文作为代码代理的指令,但它们是否有帮助存在争议:最近的研究对LLM生成的指导是否改善或损害代理性能存在分歧。在本文中,我们展示了指导的产生方式才是决定性变量,并引入了\emph{探测与精炼调优}:一种通过合成bug修复探测来迭代诊断和修补仓库指导文件的过程,使用单次LLM调用,在调优期间没有代理循环或工具使用。在SWE-bench Verified上,使用Qwen3.5-35B-A3B进行200步的四个独立试验中,探测与精炼实现了33.0%的平均解决率,而用于初始化的静态知识库为28.3%,无指导基线为25.5%(两个探测与精炼对比的p < 0.001)。改进来自覆盖率而非精确度:精炼后的指导为14.5个百分点(pp)更多的实例生成了可评估的补丁,而每个补丁的精确度在统计上保持不变(约59%,p = 0.119),表明改进的指导帮助代理到达正确的文件,而不是提高它们所做更改的质量。此外,一个步骤预算实验表明,指导让代理能够更有效地利用更大的步骤预算,而一个跨模型实验(使用NVIDIA-Nemotron-3-Nano-30B-A3B)发现,当模型无法生成足够诊断性的输出时,调优循环会退化,尽管即使在这种情况下每个补丁的精确度仍然保持不变。

英文摘要

LLM-based coding agents need higher-level operational knowledge about a repository (which files house which subsystems, how to run the test suite, which workflows have historically led to wrong fixes) that does not exist in the code itself. Engineers typically maintain \texttt{AGENTS.md} files to supply this context as instructions for coding agents, but whether they help is contested: recent studies disagree on whether LLM-generated guidance improves or harms agent performance. In this paper we show that how the guidance is produced is the decisive variable, and introduce \emph{probe-and-refine tuning}: a procedure that uses synthetic bug-fix probes to iteratively diagnose and patch a repository's guidance file through single-shot LLM calls, with no agent loop or tool use during tuning. On SWE-bench Verified across four independent trials with Qwen3.5-35B-A3B at 200 steps, probe-and-refine achieves 33.0\,\% mean resolve rate vs.\ 28.3\,\% for the static knowledge base used to initialize it and 25.5\,\% for an unguided baseline ($p < 0.001$ for both probe-and-refine contrasts). The improvement comes from coverage rather than precision: refined guidance produces evaluable patches for 14.5 percentage points (pp) more instances while per-patch precision remains statistically constant ($\sim$59\,\%, $p = 0.119$), showing that improved guidance helps agents reach the correct file rather than improving the quality of the changes they make. Further, a step-budget experiment shows that guidance is what lets the agent use a larger step budget productively, and a cross-model experiment with NVIDIA-Nemotron-3-Nano-30B-A3B finds that the tuning loop degrades when the model cannot generate sufficiently diagnostic output, though per-patch precision remains constant even then.

2606.20510 2026-06-19 cs.CR cs.AI 新提交

Efficient and Sound Probabilistic Verification for AI Agents

高效且可靠的AI智能体概率验证

Alaia Solko-Breslin, Pramod Kaushik Mudrakarta, Mihai Christodorescu, Somesh Jha, Krishnamurthy Dj Dvijotham

发表机构 * Google DeepMind(谷歌深Mind) Google(谷歌) University of Pennsylvania(宾夕法尼亚大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 提出基于分布鲁棒优化的框架,为AI智能体在复杂数字环境中的概率策略违规提供可靠上界,无需独立性假设,在终端和工具调用智能体基准上优于现有方法。

详情
AI中文摘要

保护在复杂数字环境中运行的AI智能体已成为关键需求,而运行时监控方法通过制定并执行以Datalog等正式语言表达的策略提供了一种有前景的解决方案。然而,现有方法仅限于确定性策略。在AI智能体的许多实际应用中,需要在面对模糊性时强制执行安全策略,导致概率谓词或状态转换(例如,每次调用时具有一定失败概率的解密器或个人身份信息(PII)检测器)。此外,在许多此类应用中,无法轻易做出调用先前Datalog概率推理工作所需的独立性假设。我们通过引入一种基于分布鲁棒优化的可靠且高效的验证框架来解决这一问题,该框架计算策略违规概率的可靠上界,而不考虑谓词之间可能的相关性。在终端和工具调用智能体的标准基准上,我们证明了我们的方法优于现有技术,并在确保策略违规概率的严格上界的同时,改善了安全-效用权衡。

英文摘要

Securing AI agents that operate in complex digital environments has become a critical need, and runtime monitoring approaches that formulate and enforce policies expressed in a formal language like Datalog offer a promising solution. However, existing approaches are restricted to deterministic policies. In many practical applications of AI agents, there is a need to enforce security policies in the face of ambiguity, leading to probabilistic predicates or state transitions (for example, a declassifier or Personally Identifiable Information (PII) detector that has some failure probability on each invocation). Furthermore, in many such applications, one cannot easily make the independence assumptions necessary to invoke prior work on probabilistic inference in Datalog. We address this by introducing a sound and efficient framework for such verification based on distributionally robust optimization, computing sound upper bounds on the probability of policy violation regardless of possible correlations between predicates. On standard benchmarks for terminal and tool calling agents, we demonstrate that our approach outperforms prior art and improves the security-utility trade-off while ensuring rigorous bounds on the probability of policy violation.

2606.20502 2026-06-19 cs.CR cs.AI cs.SE 新提交

Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

无理解的校准:诊断微调大语言模型在系统软件漏洞检测中的局限性

Arastoo Zibaeirad, Marco Vieira

发表机构 * University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

AI总结 提出CWE-Trace框架,通过834个Linux内核样本和两个诊断指标(DFI和HDD)评估LLM漏洞检测能力,发现数据污染无实质帮助,微调仅改变输出阈值而非决策策略,模型缺乏真正的安全推理能力。

详情
AI中文摘要

大语言模型在漏洞基准测试中得分高,但究竟是真正推理安全还是仅对污染数据进行模式匹配,这一问题仍未解决。我们提出CWE-Trace,一个基于834个手动整理的Linux内核样本(涵盖74个CWE)构建的LLM漏洞检测框架。该框架强制执行严格的时间分割(2025年前的历史集/截止后的无泄漏集),保留上下文感知的易受攻击-修补对,并引入两个诊断指标:方向性失败指数(DFI)和层次距离与方向(HDD)。我们评估了8个原始LLM和15个LoRA微调变体,涵盖非目标检测、目标检测和CWE分类。分析得出两个关键结果。首先,数据污染未提供可衡量的优势。函数级分析显示,84%的名义污染样本不携带可用的记忆信号:易受攻击的函数缺失或跨数据集交叉映射,约31%的污染样本存在CWE误分类。其次,骨干方向性先验主导微调。模型表现出稳定、系统性的失败模式(DFI范围从-85.5到+94.8个百分点),这些模式从历史数据持续到截止后数据,且难以纠正。微调改变了输出阈值,但未改变决策策略。这是无理解的校准:输出分布适应训练数据,而底层安全推理仍然缺失。在二元检测中最弱的骨干(DeepSeek-R1)在粗粒度CWE分类中提升最大,表明检测和理解是解耦的能力。最佳检测得分仅达到52.1%(比随机高2.1个百分点);精确CWE排名Top-1准确率仍低于1.3%,证实当前LLM无论采用何种微调策略,都缺乏对系统软件的可靠安全推理能力。

英文摘要

Whether LLMs scoring well on vulnerability benchmarks genuinely reason about security or merely pattern-match on contaminated data remains unresolved. We present CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs. The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set), preserves context-aware vulnerable--patched pairs, and introduces two diagnostic metrics: the Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD). We evaluate eight vanilla LLMs and 15 LoRA fine-tuned variants across non-targeted detection, targeted detection, and CWE classification. Our analysis yields two key results. First, data contamination provides no measurable advantage. Function-level analysis shows that 84% of nominally contaminated samples carry no usable memorization signal: vulnerable functions are absent or cross-mapped across datasets, and ~31% of contaminated samples carry CWE misclassification. Second, backbone directional priors dominate fine-tuning. Models exhibit stable, systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data and resist correction. Fine-tuning shifts the output threshold without changing the decision policy. This is calibration without comprehension: output distributions adapt to training data while the underlying security reasoning remains absent. The weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities. The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy, confirming that current LLMs lack reliable security reasoning for systems software, regardless of fine-tuning strategy.

2606.20470 2026-06-19 cs.CR cs.AI 新提交

Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

分析针对基于模型引导的自动化攻击的防御性误导策略在智能体AI系统中的应用

Reza Soosahabi, Vivek Namsani

发表机构 * Application & Threat Intelligence Research Center(应用与威胁情报研究中心)

AI总结 本文通过概率模型分析智能体AI系统的攻击-防御场景,提出“检测-误导”策略(如CMPE)以替代传统“检测-拦截”方法,通过产生误导性响应降低攻击者成功率,并在基准测试中将攻击成功率上限降低两个数量级。

详情
AI中文摘要

智能体AI系统越来越依赖语言模型组件来解释指令、处理外部数据、调用工具以及与其他智能体协调。这些能力使得提示注入和越狱攻击的后果更加严重,尤其是当攻击者采用模型引导的自动化来扩展探测、提示优化和响应评估时。本文通过目标系统、其防御机制以及攻击者的自动评判器的概率模型来分析由此产生的攻击-防御场景。我们的分析表明,传统的“检测-拦截”防御可能使攻击者成功率(ASR)随着查询预算的增长而趋近于1,因为可预测的拒绝为自动化搜索提供了有用的反馈。然后,我们研究了“检测-误导”策略,其中检测到的恶意交互会收到受控的、非操作性的响应,旨在诱导攻击者评判器产生假阳性错误。这种策略降低了攻击者选择候选的正预测值,并产生有界的渐近ASR。我们通过渐进式参与的上下文误导(CMPE)评估了该策略的概念验证实现,这是一种轻量级的对话误导方法,旨在在自动化越狱设置中用安全但具有战略误导性的响应替换可预测的拒绝文本。在越狱基准测试中,CMPE将估计的ASR上限降低了两个数量级,并在端到端PAIR和GPTFuzz攻击运行中几乎消除了验证的攻击成功。

英文摘要

Agentic AI systems increasingly rely on language-model components to interpret instructions, process external data, invoke tools, and coordinate with other agents. These capabilities make prompt-injection and jailbreak attacks more consequential, especially as attackers adopt model-guided automation to scale probing, prompt refinement, and response evaluation. This work analyzes the resulting attack-defense setting through a probabilistic model of a target system, its defense mechanism, and the attacker's automated judge. Our analysis shows that conventional detect-and-block defenses can allow attacker success rate (ASR) to approach one as the query budget grows, since predictable refusals provide useful feedback to automated search. We then examine detect-and-misdirect, where detected malicious interactions receive controlled, non-operational responses designed to induce false-positive errors in the attacker's judge. This strategy reduces the positive predictive value of attacker-selected candidates and yields a bounded asymptotic ASR. We evaluate a proof-of-concept realization of this strategy through Contextual Misdirection via Progressive Engagement (CMPE), a lightweight conversational misdirection method designed to replace predictable refusal text with safe but strategically misleading responses in automated jailbreak settings. On jailbreak benchmarks, CMPE reduces estimated ASR upper bounds by up to two orders of magnitude and nearly eliminates verified attack success in end-to-end PAIR and GPTFuzz attack runs.

2606.20436 2026-06-19 cs.CR cs.AI 新提交

Multi-View Decompilation for LLM-Based Malware Classification

基于LLM的恶意软件分类的多视角反编译

Bercan Turkmen, Vyas Raina

发表机构 * Independent Researcher(独立研究员) SPARK

AI总结 提出多反编译器视角提升LLM恶意软件分类性能,通过Ghidra和RetDec的互补伪C代码提高召回率和F1分数。

详情
AI中文摘要

恶意软件分析师通常在源代码不可用时,通过反编译的伪C代码检查编译后的二进制文件。最近的研究表明,大型语言模型(LLMs)可以通过将反编译代码分类为良性或恶意来辅助这一过程,但现有的流程通常依赖于单一的反编译器视角。我们认为这一假设是脆弱的:反编译器是有损的启发式工具,不同的反编译器可能暴露同一二进制文件的不同特征。我们整理了一个包含良性工具和恶意程序的基准测试,涵盖一系列威胁行为。每个样本都使用Ghidra和RetDec进行编译和反编译,生成匹配的伪C视图。在来自主要模型系列的一系列LLMs中,我们发现提供两种反编译器视图可以提高恶意类别的F1分数,主要是通过提高恶意样本的召回率。一致性分析进一步表明,Ghidra和RetDec会犯部分不同的错误,支持反编译器输出提供互补证据的观点。我们的结果表明,多反编译器提示是一种简单、无需训练的方法,可以在实际环境中改进基于LLM的恶意软件分类。

英文摘要

Malware analysts often inspect compiled binaries through decompiled pseudo-C, when source code is unavailable. Recent work suggests that large language models (LLMs) can assist this process by classifying decompiled code as benign or malicious, but existing pipelines typically rely on a single decompiler view. We argue that this assumption is fragile: decompilers are lossy heuristic tools, and different decompilers can expose different artefacts of the same binary. We curate a benchmark of benign utilities and malicious programs spanning a range of threat behaviors. Each sample is compiled and decompiled with both Ghidra and RetDec, yielding matched pseudo-C views. Across a range of LLMs from major model families, we find that providing both decompiler views improves malicious-class F1, mainly by increasing recall on malicious samples. Agreement analyses further show that Ghidra and RetDec make partially different errors, supporting the view that decompiler outputs provide complementary evidence. Our results suggest that multi-decompiler prompting is a simple, training-free way to improve LLM-based malware triage in practical settings.

2606.20408 2026-06-19 cs.CR cs.AI 新提交

LLM agent safety, multi-turn red-teaming, jailbreak benchmarks, adversarial robustness, safety-critical systems

LLM智能体安全性、多轮红队测试、越狱基准、对抗鲁棒性、安全关键系统

Hanwool Lee, Dasol Choi, Bokyeong Kim, Seung Geun Kim, Haon Park

发表机构 * AIM Intelligence(AIM智能公司) KAERI(韩国原子能研究所)

AI总结 提出NRT-Bench基准,通过模拟核电站控制室的多轮红队测试,评估LLM智能体在安全关键系统中的对抗鲁棒性,发现不同模型的漏洞几乎不重叠,且防御效果高度依赖模型。

详情
AI中文摘要

大型语言模型(LLM)智能体越来越多地被提议作为安全关键系统的监督组件,但它们在持续、自适应对抗压力下的鲁棒性仍鲜有表征。我们提出了NRT-Bench,一个用于对作为安全关键系统操作员的LLM智能体进行多轮红队测试的基准,实例化为一个模拟核电站控制室。一个由五个角色组成的操作员团队,每个角色由可配置的LLM支持,运行一个由六项关键安全功能(CSF)管理的工厂,而对手在有限的多轮会话中通过四个通道注入消息,每轮有反馈。危害是一个客观信号,而非LLM评判的文本:一旦任何CSF丢失,运行即终止,并归因于导致该消息。在固定攻击配对重放协议下评估四个前沿操作员模型,我们发现自适应多轮攻击可靠地将操作员团队推过安全极限:在这四个模型中,8.7%至12.1%的攻击会话以工厂失去关键安全功能告终。尽管这四个模型在此聚合率下看起来几乎同样鲁棒,但它们的失败几乎没有重叠:在149个会话中,没有一个会话击败所有四个模型,而三分之一的会话至少击败一个模型,因此漏洞在模型之间几乎是不相交的,而非嵌套的。添加防御的效果强烈依赖于模型:同一套护栏或安全顾问智能体对一个模型降低攻击成功率,却可能对另一个模型提高成功率。我们发布了模拟场地、攻击数据集和重放工具,用于LLM智能体的可重复安全评估。

英文摘要

Large language model (LLM) agents are increasingly proposed as supervisory components for safety-critical systems, yet their robustness under sustained, adaptive adversarial pressure remains poorly characterized. We present NRT-Bench, a benchmark for multi-turn red-teaming of LLM agents acting as operators of a safety-critical system, instantiated in a simulated nuclear power plant control room. A five-role operator team, each backed by a configurable LLM, runs a plant governed by six critical safety functions (CSFs), while adversaries inject messages over four channels in bounded multi-turn sessions with per-turn feedback. Harm is an objective signal rather than LLM-judged text: a run terminates the moment any CSF is lost, attributed to the causing message. Evaluating four frontier operator models under a fixed-attack paired-replay protocol, we find that adaptive multi-turn attacks reliably push the operator team past a safety limit: across the four models, between 8.7% and 12.1% of attack sessions end with the plant losing a critical safety function. Although the four models look almost equally robust by this aggregate rate, their failures barely overlap: of $149$ sessions, none defeat all four models while a third defeat at least one, so vulnerabilities are nearly disjoint across models rather than nested. The effect of added defences is strongly model-dependent: the same guardrail stack or safety-advisor agent that lowers attack success for one model can raise it for another. We release the simulation venue, attack dataset, and replay tooling for reproducible safety evaluation of LLM agents.

2606.20373 2026-06-19 cs.SE cs.AI 新提交

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

AutoPass:基于证据的LLM智能体用于编译器性能调优

Zepeng Li, Jie Ren, Zhanyong Tang, Jie Zheng, Zheng Wang

发表机构 * Shaanxi Normal University(陕西师范大学) Northwest University(西北大学) University of Leeds(利兹大学)

AI总结 提出AutoPass多智能体框架,通过查询编译器内部状态和中间表示,利用运行时反馈迭代优化编译选项,无需训练即可提升性能,在x86-64和ARM64上分别实现1.043倍和1.117倍加速。

详情
AI中文摘要

大型语言模型(LLM)在代码编译任务中展现出潜力,但由于复杂的微架构效应和噪声运行时测量,将其应用于运行时性能调优较为困难。我们提出AutoPass,一个用于编译器性能调优的多智能体框架,它利用编译器和运行时证据来指导LLM生成的优化决策。与先前的自动调优方案将编译器视为黑盒不同,AutoPass向LLM开放编译器,使其能够查询编译器内部的优化状态并分析中间表示以编排编译器选项。搜索过程利用测量的运行时反馈迭代地优化配置,以诊断性能回退并指导延迟改进的编辑。AutoPass在仅推理、无需训练的环境下运行,无需离线训练或任务特定的微调,因此可轻松应用于新的基准测试和平台。我们在LLVM编译器上实现AutoPass,并在服务器级x86-64和嵌入式ARM64系统上进行评估。AutoPass优于专家调优的启发式方法和经典自动调优方法,在x86-64和ARM64上相对于LLVM -O3分别实现了1.043倍和1.117倍的几何平均加速。

英文摘要

Large Language Models (LLMs) show promise for code compilation tasks, but applying them to runtime performance tuning is difficult due to complex microarchitectural effects and noisy runtime measurements. We present AutoPass, a multi-agent framework for compiler performance tuning that uses compiler and runtime evidence to guide LLM-generated optimization decisions. Rather than treating the compiler as a black box like prior auto-tuning schemes, AutoPass opens up the compiler to the LLM, enabling it to query compiler-internal optimization states and analyze the intermediate representation to orchestrate compiler options. The search process iteratively refines optimization configurations using measured runtime feedback to diagnose regressions and guide latency-improving edits. AutoPass operates in an inference-only, training-free setting and requires no offline training or task-specific fine-tuning, making it readily applicable to new benchmarks and platforms. We implement AutoPass on the LLVM compiler and evaluate it on server-grade x86-64 and embedded ARM64 systems. AutoPass outperforms expert-tuned heuristics and classical autotuning methods, achieving geometric-mean speedups of 1.043x and 1.117x over LLVM -O3 on x86-64 and ARM64, respectively.

2606.20295 2026-06-19 cs.SE cs.CL 新提交

Token-Operations-Oriented Inference Optimization Techniques for Large Models

面向令牌操作的大模型推理优化技术

Shiguo Lian, Kai Wang, Zhaoxiang Liu, Wen Liu, Minjie Hua, Yutong Liu, Jiangze Yan, Xin Wang, Cong Wang, Yilin Zhang, Yi Shen, Jieyun Huang, Fang Zhao, Huanlin Gao, Ping Chen, Xinyu Yang, Kaikai Zhao, Yao Zhao, Xinggang Wang, Huishuai Zhang, Dongyan Zhao, Junping Du, Tao Chen, Xiang Gao, Qinghuai Ma

发表机构 * China’s National Data Administration(中国国家数据管理局)

AI总结 本文提出多模型融合、模型优化、计算-模型融合、计算-网络-模型融合四层技术架构,系统综述各层关键技术及产业现状,旨在降低令牌成本、提升服务效率、保障供应稳定性,推动大模型服务从可调用到可运营的转变。

Comments 62 pages, 36 figures

详情
AI中文摘要

大模型推理优化是支撑大模型服务可扩展、低成本、高稳定运行的关键基础。本文以面向令牌的推理优化技术为核心,首次提出由多模型融合、模型优化、计算-模型融合、计算-网络-模型融合组成的四层技术架构,系统梳理了这四层的关键技术和产业现状,并分析了相关技术在实际业务场景中的应用价值。本文为降低令牌生产成本、提高令牌服务效率、保障令牌供应稳定性、推动大模型服务从可调用到可运营的转变提供了实用的技术路径。

英文摘要

Large model inference optimization serves as a key foundation for supporting the scalable, low-cost, and highly stable operation of large model services. Centered on token-oriented inference optimization technology, this paper proposes for the first time a four-layer technical architecture consisting of Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion. It systematically reviews the key technologies and current industry status across these four levels and analyzes the application value of related technologies in real-world business scenarios. This paper provides a practical technical path for reducing token production costs, improving token service efficiency, ensuring the stability of token supply, and driving the transition of large model services from being merely callable to being operable.