arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

语言大模型 / LLM

大语言模型、预训练、指令微调、后训练和语言模型应用。

今日/当前日期收录 133 信号源:cs.CL, cs.AI, cs.LG

1. 领域大模型 4 篇

2606.19387 2026-06-19 cs.SE cs.AI 新提交 75%

Interpretable and Verifiable Hardware Generation with LLM-Driven Stepwise Refinement

可解释且可验证的硬件生成:基于LLM驱动的逐步细化

You Li, Samuel Mandell, David Z. Pan

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Fudan University(复旦大学) USA(美国)

专题命中 领域大模型 :LLM用于硬件设计,但非通用语言模型。

AI总结 提出结合LLM创造力与形式化方法可解释性的硬件生成框架,通过迭代应用变换规则将设计规范转换为正确性有保证的RTL程序。

详情
AI中文摘要

大型语言模型(LLM)在软件开发中取得了显著成功。然而,它们容易产生幻觉,即可能引入微妙的语义和逻辑错误。由于芯片设计和制造的高风险,硬件工程师仍不愿依赖LLM进行寄存器传输级(RTL)生成。本文提出一种硬件生成框架,结合了LLM的创造力和广泛知识与形式化方法的可解释性和数学严谨性。具体而言,我们设计了一组覆盖各种设计决策和硬件特征的变换规则。通过迭代应用这些规则,LLM代理可以将设计规范转换为正确性有保证的RTL程序。实验结果证明了该框架的有效性和效率。

英文摘要

Large language models (LLMs) have achieved remarkable success in software development. However, they are susceptible to hallucinations, meaning that they can introduce subtle semantic and logical errors. Due to the high stakes in chip design and manufacturing, hardware engineers are still reluctant to rely on LLMs for register-transfer level (RTL) generation. In this paper, we propose a hardware generation framework that combines the creativity and broad knowledge of LLMs with the explainability and mathematical rigor of formal methods. Specifically, we devise a set of transformation rules that cover various design decisions and hardware features. By iteratively applying these rules, an LLM agent can convert a design specification into an RTL program with guaranteed correctness. Experimental results demonstrate the effectiveness and efficiency of the framework.

2606.19364 2026-06-19 cs.LG 新提交 75%

Closing the Social-Semantic Gap: SPSD for Edge-Based Prompt Compression in Cloud LLM Inference

缩小社会-语义差距:SPSD用于云LLM推理中的边缘端提示压缩

Abhinit Sen, Ajeet Kumar, Manaranjan Pradhan

发表机构 * Indian School of Business(印度管理学院)

专题命中 领域大模型 :边缘端提示压缩用于云LLM推理。

AI总结 针对云LLM推理中提示词预填充阶段能耗高的问题,提出SPSD边缘端管道,利用4比特量化小语言模型压缩用户提示,在保持响应质量非劣效的前提下,平均节省99.9个输入token,每调用净节能70-270 uWh。

Comments 19 pages, 7 tables, 1 figure, includes appendix

详情
AI中文摘要

大语言模型(LLM)推理的预填充阶段正成为云规模能耗的日益增长的贡献者。许多面向消费者的支持和对话提示包含社会性支架:礼貌标记、道歉性开场白、重复以及建立融洽关系的语言,这些对人类交流很重要,但对机器推理而言边际信息量较低。我们将这种差异称为社会-语义差距。我们提出SPSD(情感保留语义蒸馏),一种边缘端管道,在传输到云端部署的LLM之前,使用4比特量化的小语言模型压缩用户提示。在248个提示的语料库上,使用Gemma-2-2B-Instruct(Q4_K_M)作为SLM、Llama-3.1-8B-Instruct作为云端评估模型进行评估,每次蒸馏调用平均输入token节省99.9个,所有146次蒸馏调用均产生正向节省。通过盲法LLM-as-judge评分对121对进行评估,响应质量在15分制中预先指定的1分非劣效范围内不劣于原始路径;评审员给出43%平局、28%蒸馏胜出和29%原始胜出。余弦相似度结果不一:均值0.682,中位数0.712,54.1%的配对高于0.70参考阈值。安全关键领域通过基于规则的网关保守地路由至直通模式。在所述假设下,每次调用净节能估计为70-270 uWh。SPSD表明,设备端提示蒸馏可以在保持响应质量在实际非劣效范围内的同时,降低云LLM的输入token成本。

英文摘要

The prefill stage of Large Language Model (LLM) inference is a growing contributor to cloud-scale energy cost. Many consumer-support and conversational prompts contain social scaffolding: politeness markers, apologetic preamble, repetition, and rapport-building language that is important for human communication but carries low marginal information for machine reasoning. We call this discrepancy the Social-Semantic Gap. We present SPSD (Sentiment Preserving Semantic Distillation), an edge-based pipeline that compresses user prompts using a 4-bit quantised Small Language Model before transmission to a cloud-deployed LLM. Evaluation on a 248-prompt corpus using Gemma-2-2B-Instruct (Q4_K_M) as the SLM and Llama-3.1-8B-Instruct as the cloud evaluation model yields a mean input token saving of 99.9 tokens per distilled call, with all 146 distilled calls yielding positive savings. Response quality, assessed by blind LLM-as-judge scoring across 121 pairs, is non-inferior to the raw path within a pre-specified 1-point margin on a 15-point rubric; the judge awarded 43 percent ties, 28 percent distilled wins, and 29 percent raw wins. Cosine similarity is mixed: mean 0.682, median 0.712, with 54.1 percent of pairs above the 0.70 reference threshold. Safety-critical domains are conservatively routed to passthrough via rule-based gates. Per-call net energy saving is estimated at 70-270 uWh under stated assumptions. SPSD shows that on-device prompt distillation can reduce cloud LLM input-token cost while preserving response quality within a practical non-inferiority margin.

2512.18859 2026-06-19 cs.CL 版本更新 75%

Toward Human-Centered AI-Assisted Terminology Work

迈向以人为中心的AI辅助术语工作

Antonio San Martin

发表机构 * Universite du Quebec à Trois-Rivieres(魁北克大学三河分校)

专题命中 领域大模型 :讨论生成式AI在术语工作中的应用,属于领域大模型

AI总结 本文提出以人为中心的人工智能框架,在利用生成式AI自动化术语工作的同时,通过增强术语学家能力、保持人类控制权来确保术语数据的准确性和可靠性。

Comments Accepted for publication in the journal Terminology

详情
AI中文摘要

生成式AI可能通过创造自动化新机会来改变术语工作。同时,它引发了对术语学家和术语资源未来的担忧,因为效率压力可能鼓励过度自动化,认为人类专业知识可被AI取代。然而,由于错误、幻觉和各种形式的偏见,大型语言模型在术语目的上仍然不可靠,使得术语学家在确保术语数据的准确性和可靠性方面不可或缺。本文认为,以人为中心的AI(强调AI的主要目标应是促进人类福祉的方法)提供了一个框架,可以在最大化生成式AI收益的同时减轻其风险。它主张高水平的自动化和有意义的人类控制是兼容且可取的,AI应增强术语学家的能力,同时保留他们的自主权和决策权。通过三个相互关联的维度——增强的术语学家、伦理AI和以人为中心的设计——审视了AI辅助术语工作的影响。特别是,本文探讨了AI整合如何重塑术语学家的角色,影响专业价值观和工作条件,要求管理AI产生的偏见,并呼吁围绕术语学家的需求设计AI工具。本文得出结论,以人为中心的方向是必要的,以确保AI加强而非削弱术语工作在支持专业交流以及跨语言和跨文化准确传播知识中的关键作用。

英文摘要

Generative AI is likely to transform terminology work by creating new opportunities for automation. At the same time, it raises concerns about the future of terminologists and terminological resources, as efficiency pressures may encourage excessive automation based on the perception that human expertise can be replaced by AI. However, large language models remain unreliable for terminological purposes due to errors, hallucinations, and various forms of bias, making terminologists indispensable for ensuring the accuracy and reliability of terminological data. This paper argues that human-centered AI, an approach that emphasizes that AI's primary goal should be to contribute to human well-being, provides a framework for maximizing the benefits of generative AI while mitigating its risks. It contends that high levels of automation and meaningful human control are compatible and desirable, and that AI should enhance terminologists' capabilities while preserving their agency and decision-making authority. The implications of AI-assisted terminology work are examined through three interrelated dimensions: the augmented terminologist, ethical AI, and human-centered design. In particular, the paper examines how AI integration reshapes the role of the terminologist, affects professional values and working conditions, requires the management of AI-generated bias, and calls for the design of AI tools around the terminologist's needs. The paper concludes that a human-centered orientation is necessary to ensure that AI strengthens, rather than undermines, the essential role of terminology work in supporting specialized communication and the accurate transmission of knowledge across languages and cultures.

2606.19640 2026-06-19 cs.CL cs.AI cs.HC 新提交 70%

Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language

创建多语言心理健康对话数据集:基于国籍和语言的人物角色本地化方法的局限性

Yunkai Xu, Saeed Abdullah

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学)

专题命中 领域大模型 :研究LLM生成多语言心理健康对话数据集及评估。

AI总结 研究通过修改人物角色中的国籍和语言参数生成中文、孟加拉语和印地语临床对话,发现仅添加这些参数会导致跨语言临床不一致,且LLM评估非英语文本的抑郁严重度时存在不准确性。

Comments 15 pages, 4 figures. Accepted to the 2026 Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026), co-located with ACL 2026

详情
AI中文摘要

人工智能和大语言模型(LLMs)已成为应对全球心理健康挑战的有前景的工具。尽管这些挑战具有全球性,但用于训练和评估此类系统的高质量数据集仍然严重短缺。为弥补这一差距,研究人员越来越多地生成合成临床人物角色来模拟用户数据并测试数字心理健康支持系统。然而,大多数经过验证的人物角色依赖于以英语为中心的语境。本文研究了是否可以使用类似的人物角色方法生成多语言心理健康数据集。我们修改了人物角色中的国籍和语言参数,以生成普通话、孟加拉语和印地语的临床对话。然后,我们考察了不同LLM在评估这些生成的多语言数据集的抑郁严重程度(与英语基线相比)时的表现。我们的研究结果表明,仅在人物角色中添加国籍和语言参数可能不够,因为它可能引入跨语言的临床不一致性。LLM评判模型在评估非英语文本中的抑郁严重程度时常常表现出不准确性,且不同模型的性能存在差异。这暴露了将以英语为中心的人物角色应用于多语言语境的系统性局限性。最终,我们的工作强调了迫切需要文化响应式数据生成,以确保全球心理健康系统的公平性。

英文摘要

AI and large language models (LLMs) have emerged as promising tools to address global mental health challenges. Despite the global nature of these challenges, there remains a critical shortage of high-quality datasets for training and evaluating such systems. To mitigate this gap, researchers increasingly generate synthetic clinical personas to simulate user data and test digital mental health support systems. However, most validated personas rely on English-centric contexts. This paper investigates whether similar persona-based methods can be used to generate multilingual mental health datasets. We modified nationality and language parameters in personas to generate clinical dialogues in Mandarin, Bengali, and Hindi. We then examined how different LLMs perform when evaluating the depression severity of these generated multilingual datasets against the baseline in English. Our findings indicate that just adding nationality and language parameters in personas might not be adequate, as it can introduce clinical inconsistency across languages. LLM judge models often exhibit inaccuracies in assessing depression severity in non-English texts, with performance varying across different models. This exposes the systemic limitations of applying English-centric personas to multilingual contexts. Ultimately, our work highlights the urgent need for culturally responsive data generation to ensure equitable mental health systems globally.

2. 其他LLM 23 篇

2606.18933 2026-06-19 cs.LG cs.IR stat.ME 新提交 75%

Zero-Shot Active Feature Acquisition via LLM-Elicitation

基于LLM启发式的零样本主动特征获取

Binyamin Perets, Natalie Mendelson, Shiran Vainberg, Yehuda Chowers, Shai Shen-Orr, Shie Mannor

发表机构 * Faculty of EE, Technion(技术学院电子工程系) Faculty of Medicine, Technion(技术学院医学院) CytoReason NVIDIA

专题命中 其他LLM :LLM启发式用于主动特征获取

AI总结 提出通过LLM启发式获取马尔可夫随机场充分统计量的零样本主动特征获取框架,解决数据标注不足问题,在IBD患者诊断中优于现有方法。

详情
AI中文摘要

主动特征获取(AFA)顺序选择要观察的特征以达成分类或排序决策。其主要局限性在于依赖大量标注数据来拟合指导获取的概率模型。大型语言模型(LLM)提供无监督的领域知识,但作为序列规划者表现不佳。要求其同时知晓和决策会混淆最好分开的能力。这里,我们通过严格的启发式方法开发了一个零样本AFA框架:仅要求LLM返回其可被信任返回的内容,即马尔可夫随机场(MRF)的充分统计量——一元偏差和成对协变。我们将该框架应用于两个场景:二分类和top-$k$识别。实践中,LLM可靠地仅返回判别性统计量,即区分类别而非孤立每个类别的统计量,这阻碍了经典AFA。我们应用最大熵闭包来解决这种规范模糊性。我们在炎症性肠病(IBD)患者队列上进行评估,这是一个活跃的临床环境,其中诊断模糊性和患者异质性阻碍了稳定的治疗策略。我们的框架在真实标签和其自身提取的信念上均优于LLM。在最关键的地方,即最困难的患者上,我们的top-$k$获取策略显著优于所有现有方法。

英文摘要

Active feature acquisition (AFA) sequentially selects which features to observe to reach a classification or ranking decision. Its central limitation is reliance on large amount of labeled data to fit probabilistic models guiding acquisition. Large language models (LLMs) supply unsupervised domain knowledge, but are poor sequential planners. Asking one to both know and decide conflates capabilities best kept separate. Here, we develop a framework for zero-shot AFA through disciplined elicitation: asking the LLM only for what it can be trusted to return, the unary deviations and pairwise co-variations that are the sufficient statistics of a Markov random field (MRF). We apply our framework to two settings: binary classification and top-$k$ identification. In practice, the LLM reliably returns only discriminative statistics, what distinguishes the classes rather than each class in isolation, which precludes classical AFA. We apply a maximum-entropy closure that resolves this gauge ambiguity. We evaluate on a cohort of Inflammatory Bowel Disease (IBD) patients, an active clinical setting where diagnostic ambiguity and patient heterogeneity obstruct stable treatment strategies. Our framework outperforms the LLM both on real labels and on its own extracted beliefs. Where it matters most, on the hardest patients, our top-$k$ acquisition policy markedly outperforms all existing methods.

2606.17832 2026-06-19 cs.LG 新提交 75%

From Drift to Coherence: Stabilizing Beliefs in LLMs

从漂移到一致:稳定LLM中的信念

SongEun Kim, Seungyoo Lee, Edwin Fong, Hyungi Lee, Juho Lee

发表机构 * Department of Statistics, Seoul National University Korea Advanced Institute of Science \& Technology Department of AI, Kookmin University University of Hong Kong

专题命中 其他LLM :研究LLM信念稳定性,提出预测重采样方法

AI总结 研究LLM在多项选择问答中的信念漂移问题,提出提示式预测重采样(PPR)方法,发现信念过程会自稳定并收敛,进而提出种子答案提示策略和自一致性损失以加速稳定并提高预测一致性。

详情
AI中文摘要

大型语言模型(LLM)常被假设执行隐式贝叶斯推理,然而一个关键的一致性条件——预测信念的鞅性质——已被证明在受控的合成上下文学习设置中失效。我们在更典型的使用场景中重新审视这个问题:通用多项选择问答。利用离散答案空间,我们计算精确的预测分布,并研究由自回归答案重采样引起的信念动态。我们引入了提示式预测重采样(PPR),其中LLM对同一问题生成一系列答案。实验表明,PPR揭示了早期阶段的信念漂移,表明鞅性质被违反。然而,在足够的重采样步骤后,信念过程自稳定并收敛到一个一致的预测分布。基于这一观察,我们进一步提出了(i)种子答案提示策略以加速稳定,以及(ii)自一致性损失,通过微调将早期漂移摊销到模型中。在多项选择问答基准上的实验表明,我们的方法在不牺牲准确性的情况下显著减少了信念漂移并提高了预测一致性。

英文摘要

Large language models (LLMs) are often hypothesized to perform implicit Bayesian inference, yet a key coherence condition, the martingale property of predictive beliefs, has been shown to fail in controlled synthetic in-context learning settings. We revisit this question in a more typical usage regime: generic multiple-choice question answering. Exploiting the discrete answer space, we compute exact predictive distributions and study belief dynamics induced by autoregressive answer resampling. We introduce prompted predictive resampling (PPR), where an LLM generates a sequence of answers to the same question. Empirically, PPR reveals early-stage belief drift, indicating martingale violations. However, after sufficient resampling steps, the belief process self-stabilizes and converges to a coherent predictive distribution. Based on this observation, we further propose (i) a seed-answer prompting strategy to accelerate stabilization, and (ii) a self-consistency loss that amortizes early-stage drift into the model via fine-tuning. Experiments on multiple-choice QA benchmarks show that our methods substantially reduce belief drift and improve predictive coherence without sacrificing accuracy.

2606.20544 2026-06-19 cs.AI cs.LG 新提交 70%

Toward Calibrated Mixture-of-Experts Under Distribution Shift

面向分布偏移下的校准混合专家模型

Gina Wong, Drew Prinster, Suchi Saria, Rama Chellappa, Anqi Liu

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

专题命中 其他LLM :研究混合专家模型的校准问题,属于大语言模型应用

AI总结 研究混合专家模型在分布偏移下的校准问题,提出对抗性重加权方法以改善路由聚合的校准误差,提升准确率-校准权衡。

Journal ref ICML 2026

详情
AI中文摘要

校准将模型的预测不确定性与其经验结果的频率对齐,对于理解和信任报告的概率很重要。最近的研究表明,在单个预测器级别强制执行校准可以提高集成准确性和校准,特别是混合专家(MoE)模型显示出强烈的经验改进;然而,校准有助于MoE的条件尚不清楚。在这项工作中,我们研究了MoE模型在分布偏移下的行为,重点关注路由机制如何与专家级校准相互作用。我们表明,在硬路由模型中,专家校准足以确保整体模型在一大类分布偏移下的校准,但不足以校准软路由模型。为了解决这个问题,我们提出了一种对抗性重加权方法,惩罚分布偏移下路由聚合的校准误差,并证明它在平均情况下以及在数据的困难子集上,跨模型类别、预测任务和分布偏移,改善了准确率-校准权衡。

英文摘要

Calibration aligns a model's predictive uncertainty with the frequencies of its empirical outcomes and is important for understanding and trusting reported probabilities. Recent work shows that enforcing calibration at the level of individual predictors can improve ensemble accuracy and calibration, with mixture-of-experts (MoE) models showing strong empirical improvements in particular; however, the conditions under which calibration helps MoE are not well understood. In this work, we study how MoE models behave under distribution shift, focusing on how routing mechanisms interact with expert-level calibration. We show that expert calibration is sufficient to ensure calibration of the overall model under a broad class of distribution shifts in hard-routed models, but is insufficient for calibrating soft-routed models. To address this, we propose an adversarial reweighting that penalizes calibration errors of the routed aggregate under distribution shift, and we demonstrate that it improves the accuracy-calibration tradeoff both on average and on difficult subsets of the data, across model classes, prediction tasks, and distribution shifts.

2606.20538 2026-06-19 cs.LG 新提交 70%

Multi-Task Bayesian In-Context Learning

多任务贝叶斯上下文学习

Qingyang Zhu, Eric Karl Oermann, Kyunghyun Cho

发表机构 * New York University(纽约大学)

专题命中 其他LLM :提出多任务上下文学习框架,属于LLM方法

AI总结 提出多任务上下文学习框架,通过将先验信息表示为上下文数据集前缀,训练Transformer实现分层贝叶斯预测推理,在多种分布偏移下匹配最优贝叶斯性能且速度提升数个数量级。

Comments ICML 2026

详情
AI中文摘要

贝叶斯预测推断为不确定性量化、数据效率和鲁棒泛化提供了原则性框架。然而,精确推断通常难以处理,可扩展近似可能仍计算昂贵或需要限制性建模假设,从而降低预测性能。先验数据拟合和上下文模型最近作为一种摊销替代方案出现,通过学习直接将数据集映射到预测分布,但现有方法与训练先验的支持紧密耦合,缺乏在测试时适应新先验的显式机制,导致在分布偏移下鲁棒性有限。我们引入了一个多任务上下文学习框架,用于摊销分层贝叶斯预测推断,该框架将先验信息显式表示为上下文数据集的前缀。一个在先验和目标任务序列上训练的Transformer学习跨先验族调整其预测。在一系列难度递增的评估中,包括元分布外先验和具有高维潜在结构的先验,我们的方法匹配了最优贝叶斯预测器,同时速度快了几个数量级。我们进一步在真实世界的时空温度预测基准上展示了其实用性。代码可在https://this URL获取。

英文摘要

Bayesian predictive inference provides a principled framework for uncertainty quantification, data efficiency, and robust generalization. However, exact inference is often intractable, and scalable approximations may remain computationally expensive or require restrictive modeling assumptions that degrade predictive performance. Prior-Data Fitted and in-context models have recently emerged as an amortized alternative by learning to map datasets directly to predictive distributions, but existing approaches are tightly coupled to the support of the training prior and lack explicit mechanisms for adapting to new priors at test time, resulting in limited robustness under distribution shift. We introduce a multi-task in-context learning framework for amortized hierarchical Bayesian predictive inference that explicitly represents prior information as a prefix of in-context datasets. A transformer trained on sequences of prior and target tasks learns to adapt its predictions across families of priors. On a suite of evaluations with increasing difficulty, including out-of-meta-distribution priors and priors with high-dimensional latent structures, our method matches oracle Bayesian predictors while being orders of magnitude faster. We further demonstrate its practical relevance on a real-world spatiotemporal temperature prediction benchmark. Code is available at https://github.com/martianmartina/multi-task-bayesian-icl/.

2606.20502 2026-06-19 cs.CR cs.AI cs.SE 新提交 70%

Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

无理解的校准:诊断微调大语言模型在系统软件漏洞检测中的局限性

Arastoo Zibaeirad, Marco Vieira

发表机构 * University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

专题命中 其他LLM :诊断微调LLM在漏洞检测中的局限性

AI总结 提出CWE-Trace框架,通过834个Linux内核样本和两个诊断指标(DFI和HDD)评估LLM漏洞检测能力,发现数据污染无实质帮助,微调仅改变输出阈值而非决策策略,模型缺乏真正的安全推理能力。

详情
AI中文摘要

大语言模型在漏洞基准测试中得分高,但究竟是真正推理安全还是仅对污染数据进行模式匹配,这一问题仍未解决。我们提出CWE-Trace,一个基于834个手动整理的Linux内核样本(涵盖74个CWE)构建的LLM漏洞检测框架。该框架强制执行严格的时间分割(2025年前的历史集/截止后的无泄漏集),保留上下文感知的易受攻击-修补对,并引入两个诊断指标:方向性失败指数(DFI)和层次距离与方向(HDD)。我们评估了8个原始LLM和15个LoRA微调变体,涵盖非目标检测、目标检测和CWE分类。分析得出两个关键结果。首先,数据污染未提供可衡量的优势。函数级分析显示,84%的名义污染样本不携带可用的记忆信号:易受攻击的函数缺失或跨数据集交叉映射,约31%的污染样本存在CWE误分类。其次,骨干方向性先验主导微调。模型表现出稳定、系统性的失败模式(DFI范围从-85.5到+94.8个百分点),这些模式从历史数据持续到截止后数据,且难以纠正。微调改变了输出阈值,但未改变决策策略。这是无理解的校准:输出分布适应训练数据,而底层安全推理仍然缺失。在二元检测中最弱的骨干(DeepSeek-R1)在粗粒度CWE分类中提升最大,表明检测和理解是解耦的能力。最佳检测得分仅达到52.1%(比随机高2.1个百分点);精确CWE排名Top-1准确率仍低于1.3%,证实当前LLM无论采用何种微调策略,都缺乏对系统软件的可靠安全推理能力。

英文摘要

Whether LLMs scoring well on vulnerability benchmarks genuinely reason about security or merely pattern-match on contaminated data remains unresolved. We present CWE-Trace, a framework for LLM vulnerability detection built from 834 manually curated Linux kernel samples spanning 74 CWEs. The framework enforces a strict temporal split (pre-2025 historical set / post-cutoff leakage-free set), preserves context-aware vulnerable--patched pairs, and introduces two diagnostic metrics: the Directional Failure Index (DFI) and Hierarchical Distance and Direction (HDD). We evaluate eight vanilla LLMs and 15 LoRA fine-tuned variants across non-targeted detection, targeted detection, and CWE classification. Our analysis yields two key results. First, data contamination provides no measurable advantage. Function-level analysis shows that 84% of nominally contaminated samples carry no usable memorization signal: vulnerable functions are absent or cross-mapped across datasets, and ~31% of contaminated samples carry CWE misclassification. Second, backbone directional priors dominate fine-tuning. Models exhibit stable, systematic failure modes (DFI ranging from -85.5 to +94.8 pp) that persist from historical to post-cutoff data and resist correction. Fine-tuning shifts the output threshold without changing the decision policy. This is calibration without comprehension: output distributions adapt to training data while the underlying security reasoning remains absent. The weakest backbone at binary detection (DeepSeek-R1) gains the most in coarse CWE classification, revealing that detection and understanding are decoupled capabilities. The best detection score reaches only 52.1% (+2.1 pp above chance); exact CWE ranking remains below 1.3% Top-1 accuracy, confirming that current LLMs lack reliable security reasoning for systems software, regardless of fine-tuning strategy.

2606.20436 2026-06-19 cs.CR cs.AI 新提交 70%

Multi-View Decompilation for LLM-Based Malware Classification

基于LLM的恶意软件分类的多视角反编译

Bercan Turkmen, Vyas Raina

发表机构 * Independent Researcher(独立研究员) SPARK

专题命中 其他LLM :利用LLM分类反编译代码,涉及LLM应用

AI总结 提出多反编译器视角提升LLM恶意软件分类性能,通过Ghidra和RetDec的互补伪C代码提高召回率和F1分数。

详情
AI中文摘要

恶意软件分析师通常在源代码不可用时,通过反编译的伪C代码检查编译后的二进制文件。最近的研究表明,大型语言模型(LLMs)可以通过将反编译代码分类为良性或恶意来辅助这一过程,但现有的流程通常依赖于单一的反编译器视角。我们认为这一假设是脆弱的:反编译器是有损的启发式工具,不同的反编译器可能暴露同一二进制文件的不同特征。我们整理了一个包含良性工具和恶意程序的基准测试,涵盖一系列威胁行为。每个样本都使用Ghidra和RetDec进行编译和反编译,生成匹配的伪C视图。在来自主要模型系列的一系列LLMs中,我们发现提供两种反编译器视图可以提高恶意类别的F1分数,主要是通过提高恶意样本的召回率。一致性分析进一步表明,Ghidra和RetDec会犯部分不同的错误,支持反编译器输出提供互补证据的观点。我们的结果表明,多反编译器提示是一种简单、无需训练的方法,可以在实际环境中改进基于LLM的恶意软件分类。

英文摘要

Malware analysts often inspect compiled binaries through decompiled pseudo-C, when source code is unavailable. Recent work suggests that large language models (LLMs) can assist this process by classifying decompiled code as benign or malicious, but existing pipelines typically rely on a single decompiler view. We argue that this assumption is fragile: decompilers are lossy heuristic tools, and different decompilers can expose different artefacts of the same binary. We curate a benchmark of benign utilities and malicious programs spanning a range of threat behaviors. Each sample is compiled and decompiled with both Ghidra and RetDec, yielding matched pseudo-C views. Across a range of LLMs from major model families, we find that providing both decompiler views improves malicious-class F1, mainly by increasing recall on malicious samples. Agreement analyses further show that Ghidra and RetDec make partially different errors, supporting the view that decompiler outputs provide complementary evidence. Our results suggest that multi-decompiler prompting is a simple, training-free way to improve LLM-based malware triage in practical settings.

2606.20295 2026-06-19 cs.SE cs.CL 新提交 70%

Token-Operations-Oriented Inference Optimization Techniques for Large Models

面向令牌操作的大模型推理优化技术

Shiguo Lian, Kai Wang, Zhaoxiang Liu, Wen Liu, Minjie Hua, Yutong Liu, Jiangze Yan, Xin Wang, Cong Wang, Yilin Zhang, Yi Shen, Jieyun Huang, Fang Zhao, Huanlin Gao, Ping Chen, Xinyu Yang, Kaikai Zhao, Yao Zhao, Xinggang Wang, Huishuai Zhang, Dongyan Zhao, Junping Du, Tao Chen, Xiang Gao, Qinghuai Ma

发表机构 * China’s National Data Administration(中国国家数据管理局)

专题命中 其他LLM :综述大模型推理优化技术

AI总结 本文提出多模型融合、模型优化、计算-模型融合、计算-网络-模型融合四层技术架构,系统综述各层关键技术及产业现状,旨在降低令牌成本、提升服务效率、保障供应稳定性,推动大模型服务从可调用到可运营的转变。

Comments 62 pages, 36 figures

详情
AI中文摘要

大模型推理优化是支撑大模型服务可扩展、低成本、高稳定运行的关键基础。本文以面向令牌的推理优化技术为核心,首次提出由多模型融合、模型优化、计算-模型融合、计算-网络-模型融合组成的四层技术架构,系统梳理了这四层的关键技术和产业现状,并分析了相关技术在实际业务场景中的应用价值。本文为降低令牌生产成本、提高令牌服务效率、保障令牌供应稳定性、推动大模型服务从可调用到可运营的转变提供了实用的技术路径。

英文摘要

Large model inference optimization serves as a key foundation for supporting the scalable, low-cost, and highly stable operation of large model services. Centered on token-oriented inference optimization technology, this paper proposes for the first time a four-layer technical architecture consisting of Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion. It systematically reviews the key technologies and current industry status across these four levels and analyzes the application value of related technologies in real-world business scenarios. This paper provides a practical technical path for reducing token production costs, improving token service efficiency, ensuring the stability of token supply, and driving the transition of large model services from being merely callable to being operable.

2606.20287 2026-06-19 cs.CL 新提交 70%

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

PsyScore: 一种心理测量感知的特质自适应作文评分与最近发展区支架反馈框架

Wei Xia, Jin Wu, Haoran Shi, Xiangyu Wang, Chanjin Zheng

发表机构 * Department of Educational Psychology, East China Normal University(华东师范大学教育心理学系) Shanghai Institute of Artificial Intelligence for Education, East China Normal University(华东师范大学上海智能教育研究院) School of Computer Science and Technology, East China Normal University(华东师范大学计算机科学与技术学院)

专题命中 其他LLM :使用LLM生成自适应反馈

AI总结 提出PsyScore框架,通过共享潜在能力表示整合诊断评估与教学支架,包括特质自适应神经IRT评分器、ZPD支架反馈生成器和多视角反馈评估策略,在ASAP++数据集上实现竞争性评分性能并提供更符合教学法的反馈。

详情
AI中文摘要

有效的自动作文评分(AES)应支持可靠评估和可操作的教学反馈。然而,现有方法通常将评分和反馈视为独立组件:神经评分模型可解释性有限,而基于大语言模型(LLM)的反馈通常对学习者熟练度不敏感。为解决这一碎片化问题,本工作提出PsyScore,一个心理测量感知的框架,通过共享潜在能力表示整合诊断评估与教学支架。PsyScore包含三个关键模块:特质自适应神经IRT评分器,将分级部分信用模型(GPCM)融入神经架构,能够在保持心理测量可解释性的同时精确估计学生能力;ZPD支架反馈生成器,根据诊断出的能力参数调节多智能体反馈策略,以适应不同熟练水平的教学重点;以及多视角反馈评估策略,通过成对偏好判断和学生修订模拟评估反馈质量。在ASAP++数据集上的实验表明,PsyScore在提供更具教学一致性的反馈的同时,实现了有竞争力的评分性能。

英文摘要

Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.

2606.19941 2026-06-19 cs.LG 新提交 70%

Compositionality Emerges in a Narrow Depth-Connectivity Regime: Architecture Constraints and Solution Manifolds

组合性在窄深度-连接性区域中涌现:架构约束与解流形

Dat H. Do, Rushi Shah, Duc V. Le, Dianbo Liu

发表机构 * National University of Singapore(新加坡国立大学) University of Twente(特温特大学)

专题命中 其他LLM :研究组合性在稀疏网络中的涌现机制

AI总结 研究发现组合性仅在特定稀疏网络和特定深度区间涌现,提出基于相似性的剪枝和深度预测方法,并用理论框架解释原因。

详情
AI中文摘要

组合性被认为是泛化的基础,使模型能够在新颖组合中重用有意义的原语。然而,使用标准梯度优化训练的模型很少且通常仅微弱地表现出组合内部结构,并且尚不清楚这种组合性如何或为何形成。在这项工作中,我们表明组合性在一个狭窄的连接性-深度最佳点涌现。沿着连接性轴,组合性仅出现在某些特定稀疏网络中,严重依赖于保留哪些连接而非仅权重的稀疏性。沿着深度轴,组合性在一个狭窄的、目标依赖的区域内涌现,在特定深度达到峰值,而更浅和更深的网络都失败。当深度或连接性条件被违反时,梯度下降会静默地收敛到破碎解而非组合解。为了发现并利用这种涌现,我们引入了(i)基于相似性的剪枝(SP)以恢复组合连接性,以及(ii)一个启发式深度预测器以估计组合性最可能出现的深度。最后,我们通过基于组合稀疏性、体积比论证和特征干扰界限的理论框架支持这些实证发现,解释了为什么组合解仅在狭窄的深度-连接性区域内可达。

英文摘要

Compositionality is believed to be the foundation for generalization, enabling models to reuse meaningful primitives in novel combinations. Yet, models trained with standard gradient-based optimization rarely, and often only weakly, exhibit compositional internal structure, and it remains unclear how or why such compositionality forms. In this work, we show that compositionality emerges in a narrow connectivity-depth sweet spot. Along the connectivity axis, compositionality only appears in some specifically sparse networks, heavily depends on which connections remain rather than on weights' sparsity alone. Along the depth axis, compositionality emerges within a narrow, target-dependent regime, peaking at specific depths, while both shallower and deeper networks fail. When either the depth or connectivity condition is violated, gradient descent silently converges to fractured solutions rather than compositional ones. To discover and exploit this emergence, we introduce (i) similarity-based pruning (SP) to recover compositional connectivity and (ii) a heuristic depth predictor to estimate where compositionality is most likely to appear. Finally, we support these empirical findings with a theoretical framework based on compositional sparsity, volume-ratio arguments, and feature-interference bounds, explaining why compositional solutions are reachable only in a narrow depth-connectivity regime.

2606.19864 2026-06-19 cs.CL 新提交 70%

The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI

近乎智能的革命:扩大审议规模并利用AI赋能人类的选项

Serge Sharoff

发表机构 * Centre on Participatory and Deliberative Democracy(参与性和协商性民主研究中心)

专题命中 其他LLM :LLM用于民主审议规模扩大。

AI总结 探讨大型语言模型如何通过系统功能语言学视角扩大民主审议规模,增强包容性并赋权边缘群体,同时警惕过度承诺与低估风险。

Comments Published in /Handbook of Democracy in the Era of Artificial Intelligence/ edited by Evangelos Pournaras, Srijoni Majumdar, Carina Ines Hausladen, and Dirk Helbing. 2026

详情
AI中文摘要

大型语言模型在公共话语中的日益突出为民主审议带来了机遇和挑战。虽然红队策略有助于缓解特定风险,但关于语言限制、偏见和LLM的谄媚倾向等更广泛的担忧仍然存在。本章探讨如何利用LLM显著扩大和民主化审议,特别是在促进包容性和赋权传统边缘群体方面。借鉴系统功能语言学的概念,本章考察了语言使用者之间的差异(例如,关于社会人口群体)和语言使用中的差异(例如,关于交际功能)如何影响AI支持的审议参与。本章介绍了AI驱动的审议研究,并评估了它们在支撑论证、增强可及性以及减少嵌入在声望语域中的排斥性语言规范和偏见的影响方面的潜力。同时,本章警告不要过度承诺(导致不切实际的期望)和低估承诺(冒着错失AI辅助参与机会的风险)。最后,本章确定了未来的研究方向,以最大化AI辅助参与的民主潜力,同时嵌入伦理保障以抵消语言不平等的再生产。

英文摘要

The increasing prominence of Large Language Models (LLMs) in public discourse presents both opportunities and challenges for democratic deliberation. While red teaming strategies help mitigate specific risks, broader concerns persist regarding linguistic constraints, biases, and the sycophantic tendencies of LLMs. This chapter explores how LLMs can be used to significantly scale up and democratise deliberation, particularly in fostering inclusivity and empowering traditionally marginalised groups. Drawing on concepts from Systemic-Functional Linguistics, the chapter examines how variations across language users (for example, with respect to socio-demographic groups) and across language use (for example, with respect to communicative functions) shape participation in AI-supported deliberation. The chapter presents AI-driven deliberation studies and assesses their potential to scaffold argumentation, enhance access, and reduce the influence of exclusionary linguistic norms and biases which are embedded in prestigious registers. At the same time, the chapter cautions against both overclaiming, which leads to unrealistic expectations, and underclaiming, which risks missed opportunities for AI-assisted engagement. The chapter concludes by identifying future research directions to maximise the democratic potential of AI-assisted participation while embedding ethical safeguards to counteract the reproduction of linguistic inequalities.

2606.19857 2026-06-19 cs.CL cs.AI 新提交 70%

Large Language Models Do Not Always Need Readable Language

大型语言模型并不总是需要可读语言

Jiayi Zhu, Haoxuan Peng, Junxi Wang, Liang Ke, Chen Zhang, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) The University of Sydney(悉尼大学) Hefei University of Technology(合肥工业大学) Xi’an Jiaotong University(西安交通大学) Nanjing University(南京大学)

专题命中 其他LLM :探索非标准文本表示,降低上下文开销。

AI总结 研究提出BabelTele表示法,将语义编码为紧凑、非标准文本,牺牲人类可读性但保持LLM可恢复性,实验表明可压缩至27.9%长度并保持99.5%语义保真度,降低上下文开销。

Comments 23 pages, 10 figures. Preprint

详情
AI中文摘要

大型语言模型(LLM)通常使用人类可读的自然语言进行提示和交互,即使目标读者是另一个模型。本文研究语义信息是否可以编码为紧凑、非标准的文本形式,这种形式牺牲了人类可读性,但能被LLM恢复。我们将这类以模型为中心的文本表示称为BabelTele,这里不是作为固定协议,而是作为探索LLM生成和解释此类表示能力的经验探针。通过可读性诊断、模型似然度量、人类问卷和下游任务评估,我们发现BabelTele可以显著偏离普通自然语言,同时为指令调优的LLM保留核心语义。作为一种任务无关的表示范式,BabelTele展示了高信息密度,即使文本体积压缩到原始长度的27.9%,也能保持99.5%的语义保真度。我们进一步评估了其在跨模型迁移、智能体记忆和多智能体通信中的语义鲁棒性。结果表明,BabelTele可以降低上下文开销,同时通常保持可靠的下游性能,但其有效性取决于压缩器-读取器对和任务设置。这些发现表明,人类可读性、自然语言典型性和模型端语义可恢复性可以部分解耦,为未来探索LLM系统中的模型原生表示开辟了道路。

英文摘要

Large language models (LLMs) are commonly prompted and interfaced with human-readable natural language, even when the intended reader is another model. This paper investigates whether semantic information can be encoded in compact, non-standard textual forms that sacrifice human readability while remaining recoverable by LLMs. We refer to this class of model-centric textual representations as BabelTele, approached here not as a fixed protocol but as an empirical probe into LLMs' capacity to generate and interpret such representations. Through readability diagnostics, model likelihood measures, human questionnaires, and downstream task evaluations, we find that BabelTele can substantially depart from ordinary natural language while preserving core semantics for instruction-tuned LLMs. As a task-agnostic representational paradigm, BabelTele demonstrates high information density, maintaining 99.5% semantic fidelity even when the text volume is condensed to 27.9% of its original length. We further evaluate its semantic robustness in cross-model transfer, agent memory, and multi-agent communication. Results suggest that BabelTele can reduce context overhead while generally maintaining reliable downstream performance, although its effectiveness depends on the compressor-reader pair and task setting. These findings indicate that human readability, natural-language typicality, and model-side semantic recoverability can be partially decoupled, opening a path toward model-native representations in future exploration of LLM systems.

2606.19826 2026-06-19 cs.CR cs.MA 新提交 70%

Heterogeneous LLM Debate Under Adversarial Peers: Honest Gains, Replacement Costs, and Resilience

对抗性同伴下的异构LLM辩论:诚实增益、替代成本与韧性

Prashanti Nilayam, Kiran Kumar Ramanna, Prashil Tumbade, Sankalp Nayak

专题命中 其他LLM :研究异构LLM辩论中诚实与对抗性同伴的影响。

AI总结 研究异构LLM辩论中诚实与对抗性同伴对修正行为的影响,发现诚实同伴降低有害修正率,对抗性同伴则逆转,且异构性在已有对手时也能作为防御。

详情
AI中文摘要

异构LLM辩论的动机在于,多样化的同伴可以相互纠正,但同样的交流既携带纠正也携带对抗性影响。我们通过跟踪异构同伴如何改变诚实智能体的修正行为来衡量哪种影响占主导:他们改变答案的频率,以及这种改变是纠正性的还是有害的。我们比较了匹配面板(同质基线、诚实混合和对抗混合)以及受污染面板(其中已存在一个恶意的同族同伴),涵盖四个模型家族和三个推理基准。一个诚实的异构同伴显著降低了有害修正,而对抗性同伴则逆转了这一效果。对于Llama-3.1-70B防御者在MATH-hard上,诚实插槽的有害修正率从同质面板的89%下降到有诚实同伴时的35%,而对抗性同伴使其回到90%。条件率对弱防御者隐藏了这种损害,但辩论结束时的翻转率暴露了它。该模式在家族和基准上保持符号一致,而其幅度随防御者-基准机制变化。我们还测量了当已存在一个对抗性同族同伴时的效果:一个诚实的异构同伴降低了有害修正率以及最初正确答案丢失的比率。在相同的Llama-3.1-70B设置下,添加的诚实同伴将最初正确项上的翻转率从同族对手下的31%降至6%。因此,异构性不仅是一个攻击面,而且当对手已经存在时,也是一种防御。

英文摘要

Heterogeneous LLM debate is motivated by the promise that diverse peers correct one another, but the same exchange that carries correction also carries adversarial influence. We measure which dominates by tracking how a heterogeneous peer changes the honest agents' revision behavior: how often they change their answer, and whether the change is corrective or harmful. We compare matched panels (homogeneous baseline, honest-mixed, and adversarial-mixed) and contaminated panels in which a malicious same-family peer is already present, spanning four model families and three reasoning benchmarks. An honest heterogeneous peer sharply lowers harmful revision, and an adversarial one reverses it. For Llama-3.1-70B defenders on MATH-hard, the honest-slot harmful-revision rate falls from 89% in the homogeneous panel to 35% with an honest peer, and an adversarial peer returns it to 90%. The conditional rate hides this damage on weak defenders, but the end-of-debate flip rate exposes it. The pattern keeps its sign across families and benchmarks while its magnitude varies with the defender-benchmark regime. We also measure the effects when an adversarial same-family peer is already present: an honest heterogeneous peer lowers both harmful revision and the rate at which initially-correct answers are lost. On the same Llama-3.1-70B setting, the added honest peer cuts the flip rate on initially-correct items from 31% under a same-family adversary to 6%. Heterogeneity is therefore not only an attack surface but, when an adversary is already present, also a defense.

2606.19815 2026-06-19 cs.CL 新提交 70%

Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability

聚类即一切:利用语言模型中的语义聚类预训练Tsetlin Machine以实现可解释性

Jiechao Gao, Rohan Kumar Yadav, Yuangang Li, Yuandong Pan, Jie Wang, Ying Liu, Michael Lepech

发表机构 * Independent Researcher(独立研究员) University of California, Irvine(加州大学尔湾分校) University of the Chinese Academy of Sciences(中国科学院大学)

专题命中 其他LLM :利用语言模型语义聚类预训练Tsetlin Machine,提升可解释性。

AI总结 提出一种语义预训练框架,通过K-means或Top2Vec将文本聚类,用聚类-样本对预训练Tsetlin Machine,使其学习可解释的语义关键词,在五个数据集上性能优于传统方法且与BERT竞争。

详情
AI中文摘要

预训练语言模型如BERT在文本分类任务中表现强劲,但缺乏透明度,限制了在高风险场景中的应用。Tsetlin Machine (TM) 提供完全可解释的基于子句的推理,但捕获的语义信息有限,先前桥接两者的尝试依赖于静态词嵌入,忽略了上下文含义。我们提出一种语义预训练框架,无需使用嵌入即可将知识从预训练语言模型转移到TM中。文本样本通过K-means或Top2Vec被分组为语义一致的聚类,得到的聚类-样本对通过增强的Type I反馈预训练一个非否定TM。因此,TM学习到可解释的语义关键词,并在下游任务上进行微调。在五个数据集上,我们的方法显著优于传统和基于嵌入的TM,性能与BERT竞争,同时保持可解释性。

英文摘要

Pre-trained language models such as BERT achieve strong text classification performance but lack transparency, limiting their use in high-stakes settings. The Tsetlin Machine (TM) offers fully interpretable, clause-based reasoning but captures little semantic information, and prior attempts to bridge the two rely on static word embeddings that miss contextual meaning. We propose a semantic pre-training framework that transfers knowledge from a pre-trained language model into a TM without using embeddings. Text samples are grouped into semantically coherent clusters with K-means or Top2Vec, and the resulting cluster-sample pairs pre-train a non-negated TM with enhanced Type I feedback. The TM thereby learns interpretable semantic keywords that are fine-tuned on downstream tasks. Across five datasets, our method substantially outperforms vanilla and embedding-based TMs and reaches performance competitive with BERT while remaining interpretable.

2606.19753 2026-06-19 cs.AI cs.SE 新提交 70%

Grounded Inference: Principles for Deterministically Encapsulated Generative Models

基于推理:确定性封装生成模型的原则

Marty O'Neill

发表机构 * Odenton, MD, USA(美国马里兰州奥登顿)

专题命中 其他LLM :讨论AI混合架构中生成模型的确定性封装原则。

AI总结 提出四种AI混合架构原语,实现概率模型的确定性封装,并指出两个行业反模式,为AI与传统系统集成提供基础框架。

Comments 12 pages, 3 figures

详情
AI中文摘要

将生成模型整合到传统计算系统中既带来了巨大的机遇,也带来了巨大的风险。尽管许多早期采用者已付出巨大代价认识到这些风险,但该领域仍需基础框架来降低AI融入传统系统的风险。本文通过定义四种AI混合架构的具体原语,旨在实现概率模型的确定性封装,从而奠定这一基础。此外,本文还确立了行业中广泛存在的两个总体反模式,作为该领域工程师的警示。该框架旨在实现AI与传统系统的成功集成,同时为生成模型提供商构建下一代生成模型接口提供基础。

英文摘要

The incorporation of generative models into traditional computational systems presents both enormous opportunity and tremendous peril. Although many early adopters have realized these perils at great expense, the field still requires foundational frameworks to de-risk incorporation of AI into traditional systems. This manuscript establishes this foundation through the definition of four specific primitives of AI blended architecture, designed to enable deterministic encapsulation of probabilistic models. It further establishes two overarching anti-patterns broadly represented across industry to serve as warnings for engineers in this field. This framework was designed to enable successful integration of AI into traditional systems while providing a foundation upon which generative model providers could build the next generation of generative model interfaces.

2606.19741 2026-06-19 cs.AI cs.LG 新提交 70%

Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks

通过演化程序瓶颈解释神经组合优化

Haocheng Duan, Yuxin Guo, Jieyi Bi, Anqi Xie, Sirui Li, Yining Ma, Cathy Wu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Nanyang Technological University(南洋理工大学) Microsoft Research(微软研究院) Massachusetts Institute of Technology(麻省理工学院)

专题命中 其他LLM :利用LLM演化程序解释神经组合优化模型。

AI总结 提出演化程序瓶颈(EPB)框架,通过将黑盒神经组合优化模型蒸馏为可读程序组合,利用LLM和混合梯度下降实现可解释性,揭示模型行为与经典启发式变体的关系。

Comments Under Review

详情
AI中文摘要

神经组合优化(NCO)取得了强劲性能,但其黑盒性质仍然是部署和科学诊断的关键障碍。标准可解释性工具(如概念瓶颈模型)不适用于NCO,因为其决策是动态的、状态依赖的,且缺乏适当的概念词汇定义。为弥合这一差距,我们引入了演化程序瓶颈(EPB),据我们所知,这是首个通过将黑盒NCO模型蒸馏为人类可读程序组合来解释NCO策略的框架。EPB利用LLM自主演化一组程序,其中每个程序的每步动作分布作为瓶颈。EPB通过迭代框架工作:模块I固定程序库容量,并引入混合文本-数值梯度下降方案,该方案将学生路由器更新的数值梯度和基于LLM程序修订的文本梯度相结合;模块II通过故障目标扩展和冗余剪枝动态调整库容量。大量实验证明了EPB的有效性和广泛适用性,蒸馏后的程序组合在很大程度上保持了原始性能。EPB还揭示了NCO行为在优化阶段的变化,并且可以近似为经典启发式变体的组合。我们的工作推进了可解释NCO,并将EPB建立为解释序列决策模型的有前途工具。

英文摘要

Neural Combinatorial Optimization (NCO) achieves strong performance, yet its black-box nature remains a key roadblock to deployment and scientific diagnosis. Standard interpretability tools, such as Concept Bottleneck Models (CBMs), are ill-equipped for NCO, whose decisions are dynamic, state-dependent, and lack proper concept vocabulary definition. To close this gap, we introduce Evolving Programmatic Bottlenecks (EPB), to our knowledge, the first framework for interpreting NCO policies by distilling black-box NCO models into human-readable program portfolios. EPB employs an LLM to autonomously evolve a bank of programs, where each program's per-step action distribution serves as the bottleneck. EPB works through an iterative framework: Block I fixes program bank capacity and introduces a hybrid textual-numerical gradient descent scheme that couples numerical gradients for student router updates and textual gradients for LLM-based program revision; Block II dynamically adapts bank capacity via fault-targeted expansion and redundancy pruning. Extensive experiments demonstrate EPB's effectiveness and broad applicability, where the distilled program portfolios largely match original performance. EPB also reveals that NCO behavior shifts across optimization stages and can be approximated as a composition of classic heuristic variants. Our work advances interpretable NCO and establishes EPB as a promising tool for interpreting sequential decision-making models.

2606.19719 2026-06-19 cs.IR cs.CL cs.LG 新提交 70%

Closing the Calibration Gap in Semantic Caching

缩小语义缓存中的校准差距

Aditeya Baral, Radoslav Ralev, Iliya Sotirov Zhechev, Srijith Rajamohan, Jen Agarwal

发表机构 * New York University(纽约大学) Redis(Redis公司)

专题命中 其他LLM :语义缓存用于降低LLM推理成本。

AI总结 针对语义缓存系统中离线指标与部署性能的差距,提出P-CHR AUC和CRR指标,发现校准差距由训练目标主导,模型选择本质是校准问题。

Comments 23 pages, 2 figures. Source code: https://github.com/aditeyabaral/calibration-gap-semantic-caching ; Models and Datasets: https://huggingface.co/redis

详情
AI中文摘要

语义缓存通过为语义相似的查询提供缓存响应来降低LLM推理成本。标准实践使用PR-AUC评估这些系统,该指标仅衡量分数排序的好坏,而忽略它们在固定阈值下是否可用。我们表明这种不匹配会导致系统性的部署选择不佳,因为具有最高PR-AUC的模型通常在操作中最差。我们引入精度-缓存命中率(P-CHR)AUC,一种衡量缓存利用率水平上精度的缓存感知指标,以及校准保留率(CRR),它捕捉离线排序质量在部署中保留多少。我们将离线质量与部署质量之间的操作差距分解为可恢复的校准组件和由数据集正例率固定的不可约结构组件。我们的实验表明,校准差距由训练目标而非数据规模主导,事后校准只能部分缩小它。最终,语义缓存的模型选择是一个校准问题,而非排序问题,而测量它是缩小差距的第一步。

英文摘要

Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether they are usable at a fixed threshold. We show this mismatch leads to systematically poor deployment choices, as models with the highest PR-AUC are often the worst in operation. We introduce Precision-Cache Hit Ratio (P-CHR) AUC, a cache-aware metric that measures precision across cache utilization levels, and Calibration Retention Rate (CRR), which captures how much offline ranking quality survives at deployment. We decompose the operational gap between offline and deployed quality into a recoverable calibration component and an irreducible structural component fixed by the dataset's positive rate. Our experiments show that the calibration gap is governed by the training objective rather than data scale, and post-hoc calibration only partially closes it. Ultimately, model selection for semantic caching is a calibration problem, not a ranking one, and measuring it is the first step to closing the gap.

2606.19704 2026-06-19 cs.AI 新提交 70%

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

超越静态排行榜:LLM智能体评估的预测有效性

Dhaval C. Patel, Kaoutar El Maghraoui, Shuxin Lin, Yusheng Li, Tianjun Feng, Chun-Yi Tsai, Yihan Sun, Wei Alexander Xin, Akshat Bhandari, Tanisha Rathod, Aaron Fan, Sanskruti Vijay Shejwal, Tomas Pasiecznik, Sagar Chethan Kumar, Tanmay Agarwal, Rohith Kanathur, Sam Colman, Amaan Sheikh, Dev Bahl, Ann Li, Krish Veera, Alimurtaza Mustafa Merchant, Shambhawi Baswaraj Bhure, Sajal Kumar Goyla, Chengrui Li, Kirthana Natarajan, Rui Li, Thomas Ajai, Rujing Li, Vivek G. Iyer, Sanjaii Vijayakumar, Yitong Bai, Ayal Yakobe, Darief Maes, Yassine Jebbouri, Tianyang Xu, Thai Quoc On, Vera Mazeeva, Winston Li, Yuval Shemla, Yeshitha Bhuvanesh, Rushin Bhatt, Siddharth Chethan Gowda, Alisha Vinod, Caroline Cahill, Shriya Aishani Rachakonda, Yunfeng Chen, Aryaman Agrawal, Aman Upganlawar, Mao Le Jonathan Ang, Yubin Sally Go, Madhav Rajkondawar, Yang-Jung Chen, Trisha Maturi, Ananya Kapoor, Andrew Li, Shrey Arora, Mana Abbaszadeh, Shen Li, Charles Xu, Byeolah Kwon

发表机构 * IBM

专题命中 其他LLM :研究LLM智能体评估的泛化问题。

AI总结 本文通过14项并行研究,论证聚合分数排行榜无法泛化到分布外场景,提出基于预测有效性的排名配置方法,并设计可证伪的分布外评估标准。

Comments 17 pages, 2 tables, 5 figures

详情
AI中文摘要

智能体基准测试发展迅速,但单一基准测试无法涵盖部署所涉及的多个维度。本文汇总了迄今为止最大规模的基于MCP的工业智能体基准测试的协调深度分析:14项并行实现研究,涵盖新的资产类别(包括多模态视觉扩展)、替代编排、检索策略、推理模式、基础设施优化和评估方法探索。结合这些研究与七个先前的智能体基准测试,我们认为聚合分数排行榜系统性地低估了部署智能体的评估。基于聚合分数的排名无法泛化到分布外设置;最近的公开到私有竞赛回顾提供了这种排名不稳定性的直接经验证据。我们提出通过预测有效性(样本内与样本外排名之间的相关性)而非样本内均值来配置排名,并报告了一个十二层测量装置,该装置揭示了HELM及其智能体时代后继者所忽略的部署相关维度。该立场通过三个具有明确阈值的可证伪分布外标准得以操作化;现有证据部分支持但过于薄弱无法确认。最后,我们提出了一个预注册的试点设计和下一代智能体基准测试应报告的内容的领域级愿景。

英文摘要

Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi-modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. Rankings derived from aggregate scores do not transfer to out-of-distribution settings; recent public-to-hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in-sample and out-of-sample rank, rather than in-sample mean, and report a twelve-tier measurement apparatus that exposes the deployment-relevant dimensions HELM and its agent-era successors collapse. The position is operationalized through three falsifiable out-of-distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre-registered pilot design and a field-level vision for what the next generation of agentic benchmarks should report.

2606.19591 2026-06-19 cs.CL cs.AI 新提交 70%

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

基于BART的分层策略用于越南语抽象式多文档摘要

Vu Nguyen Nguyen Xuan, Huy Ngo Quang

发表机构 * Aimesoft JSC(Aimesoft股份公司)

专题命中 其他LLM :基于BART的越南语多文档摘要,属于语言模型应用

AI总结 提出一种新颖简单的基于黄金摘要缩短文档的分层策略,结合BART模型实现越南语多文档抽象式摘要,在VLSP 2022测试集上达到ROUGE2-F1 0.2468,并利用外部数据增强训练。

Comments originally written in 2022

详情
AI中文摘要

在本技术报告中,我们专注于解决越南语多文档抽象式摘要的挑战,该任务在2022年越南语言与语音处理国际研讨会(VLSP)上提出。我们选择遵循流行的分层方法,即先浓缩每个文档,然后进行聚合和摘要。我们提出了一种新颖而简单的策略来缩短文档,该策略由黄金摘要驱动,从而确保分层方法各阶段之间的高度相关性。我们的方法在VLSP的公开测试集上达到了0.2468的ROUGE2-F1分数,并且能够生成流畅简洁的摘要。此外,我们利用外部来源获取额外数据,这极大地增加了越南语多文档摘要的数据量。额外数据已向社区公开。

英文摘要

In this technical report, we focus on solving the challenge of Vietnamese multi-document abstractive summarization, introduced in the International Workshop on Vietnamese Language and Speech Processing (VLSP) 2022. We choose to follow the popular hierarchical approach, i.e. condensing each document followed by aggregation and summarization. We propose a novel yet simple strategy to shorten documents that is driven by the golden summary, thus ensuring high correlation between stages of the hierarchical approach. Our method achieves a ROUGE2-F1 score of 0.2468 on the VLSP's public test set, and can produce fluent and concise summaries. Additionally, we utilize external sources for extra data, which greatly enhances the quantity of data for Vietnamese multi-document summarization. The additional data is made available for the community.

2606.19588 2026-06-19 cs.AI cs.CR cs.LO 新提交 70%

Analyzing the Narration Gap in LLM-Solver Loops

分析大语言模型-求解器循环中的叙述差距

Zunchen Huang, Songgaojun Deng

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

专题命中 其他LLM :分析LLM-求解器循环中的叙述差距和安全漏洞

AI总结 研究LLM与SAT/SMT求解器混合推理中,将求解器输出转化为用户答案的叙述步骤存在的安全漏洞,通过形式化建模和实验评估发现证书门控可保证求解结果正确,但对抗攻击可反转结论。

详情
AI中文摘要

诸如SAT和SMT求解器之类的形式化工具,当安全或安保关键问题可以用逻辑表述时,越来越多地被嵌入到语言模型推理流程中。与思维链不同(其步骤从模型分布中采样,没有形式化保证),求解器产生可靠且可独立验证的答案。然而,这种可靠性保证可能在求解器与模型之间的交互中丢失。混合流程包含三个组成部分:形式化问题、求解问题以及叙述结果。先前的工作研究了形式化和求解,但未涉及叙述——即将形式化工具的输出转化为用户答案的步骤。为了填补叙述差距,我们首先将LLM-求解器循环建模为经过验证的决策过程。我们进一步在提示注入下评估了五个开源模型,发现证书门控使求解器判定可靠,而攻击者可以通过不同措辞和渠道反转已验证的结论。我们研究了通过强化提示进行缓解的方法,该方法显著减少了注入但无法完全消除,并且在自适应攻击下仍然存在问题。结合形式化分析和实证研究,我们表明在LLM-求解器循环中,鲁棒性无法延伸到用户最终读取的答案。

英文摘要

Formal tools such as SAT and SMT solvers are increasingly embedded in language model reasoning pipelines when a safety or security critical question can be formulated in logic. Unlike chain of thought whose steps are sampled from the model distribution without formal guarantee, a solver produces a sound and independently verifiable answer. However, the soundness guarantee can be lost in the interaction between the solver and the model. The hybrid pipeline has three components: formalizing the question, deciding it, and narrating the result. Prior work has studied the formalization and decision, but not narration, which is the step that turns a formal tool's output into the user answer. To fill the narration gap, we first model the LLM-solver loop as a verified decision procedure. We further evaluate five open-sourced models under prompt injection, and we find certificate gating makes the solver verdict sound, while an adversary can invert a verified conclusion across phrasings and channels. We study the mitigation through hardened prompt that reduces injection significantly but cannot eliminate it and still suffers under adaptive attack. Combining the formal analysis and empirical studies, we show in the LLM-solver loop, robustness does not reach to the answer that the user finally reads.

2606.19559 2026-06-19 cs.AI cs.CL 新提交 70%

Uncertainty Decomposition for Clarification Seeking in LLM Agents

LLM代理中寻求澄清的不确定性分解

Gregory Matsnev

发表机构 * AI Talent Hub, ITMO University(AI Talent Hub, ITMO大学)

专题命中 其他LLM :基于提示的不确定性分解方法用于LLM代理

AI总结 提出一种基于提示的不确定性分解方法,将行动置信度与请求不确定性分离,使代理能在任务规范模糊时主动寻求澄清,在五个LLM骨干上平均澄清F1提升36%-73%。

Comments 26 pages, 8 figures. Source code: https://github.com/PE51K/udcs-in-llm-agents

详情
AI中文摘要

最近的立场论文认为,经典的偶然/认知不确定性框架对于交互式大型语言模型(LLM)代理是不够的,并呼吁需要一种对欠规范感知、可分解且可通信的不确定性表示,以解锁新的代理能力,如主动寻求澄清和共享心理模型构建。实际部署约束——黑盒API、交互延迟预算以及缺乏标注轨迹——排除了基于logprob、多采样和基于训练的方法,使得基于提示的估计成为在部署时浮现此类信号的最可行方案。我们通过一种简单的基于提示的分解来响应这一呼吁,该分解将行动置信度与请求不确定性(u)分离,使代理能在任务规范模糊时请求澄清。为了评估它,我们引入了两个增强澄清的基准(WebShop-Clarification和ALFWorld-Clarification),其中50%的任务被故意欠规范,并在这些变体以及用于故障检测的标准WebShop、ALFWorld和REAL基准上,系统地将所提出的分解与ReAct+UE和不确定性感知记忆(UAM)在五个LLM骨干(GPT-5.1、DeepSeek-v3.2-exp、GLM-4.7、Qwen3.5-35B、GPT-OSS-120B)上进行比较。在五个骨干上平均,所提出的分解在ALFWorld-Clarification上比ReAct+UE提高了73%的澄清F1,比UAM提高了36%,并且在WebShop-Clarification的每个骨干以及ALFWorld-Clarification的五个骨干中的四个上领先澄清F1,表明增益超越了单个LLM。

英文摘要

Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints -- black-box APIs, interactive latency budgets, and the absence of labeled trajectories -- rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.

2606.19558 2026-06-19 cs.LG cs.CL 新提交 70%

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

位移不是方向:评估量化LLM部署的保真度指标

Miloš Nikolić, Ali Hadi Zadeh, Enrique Torres Sanchez, Andreas Moshovos

发表机构 * ByteShape University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(向量人工智能研究所)

专题命中 其他LLM :评估量化LLM部署的保真度指标,属于LLM应用

AI总结 本文研究KL散度等保真度指标在量化语言模型部署中与下游基准分数的相关性,发现整体强相关但在近基线区域失效,归因于KL散度主要衡量分歧量而非方向。

详情
AI中文摘要

保真度指标,如每个token的KL散度(KLD)与高精度参考模型的比较,常被用作基准质量的低成本代理。我们在Qwen3.6-35B-A3B的28个量化模型和Devstral-Small-2-24B的41个量化模型上,通过一系列下游基准测试验证了这一做法。我们发现,在整个量化队列中,KLD与基准分数强相关(Qwen上ρ=-0.72,Devstral上ρ=-0.86,p<0.001)。然而,在接近基线的静默区,这种关系变得不显著(Qwen上ρ=+0.00,Devstral上ρ=-0.24,p=0.36)。这种失效在14种测量变体中持续存在,包括不同的KLD聚合方式、困惑度公式、top-1一致性、校准语料库和上下文长度。在逐提示层面,KLD在代码任务上仅有较弱的失败预测能力,在LiveCodeBench上五个模型的失败与通过几何平均比在[1.08,1.22]之间,并且作为跨模型路由器失败,在分歧提示上仅达到42.3%-49.4%的准确率。我们将这种失效归因于结构分解:KLD主要衡量与参考模型的分歧量,在静默区复合ρ在Qwen上为+0.94(p<0.001),在Devstral上为+0.55(p=0.03),而其与分歧方向的关系较弱且依赖于任务。

英文摘要

Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a 41-quant cohort of Devstral-Small-2-24B, evaluated across a suite of downstream benchmarks. We find that KLD is strongly correlated with benchmark score over the full cohort ($ρ=-0.72$ on Qwen and $ρ=-0.86$ on Devstral, both with $p<0.001$). However, this relationship collapses to non-significance in the near-baseline silent zone ($ρ=+0.00$ on Qwen and $ρ=-0.24$, $p=0.36$, on Devstral). This collapse persists across 14 measurement variants, including different KLD aggregations, perplexity formulations, top-1 agreement, calibration corpora, and context lengths. At the per-prompt level, KLD has only weak failure-prediction power on code, with failed-vs-passed geometric-mean ratios in $[1.08,1.22]$ across five models on LiveCodeBench, and fails as a cross-model router, achieving only $42.3\%-49.4\%$ accuracy on disagreement prompts. We trace the collapse to a structural decomposition: KLD primarily measures the volume of disagreement with the reference, with silent-zone composite $ρ=+0.94$ ($p<0.001$) on Qwen and $+0.55$ ($p=0.03$) on Devstral, while its relationship to the direction of those disagreements is weak and task-conditional.

2606.19476 2026-06-19 cs.LG cs.AI 新提交 70%

Can In-Context Learning Support Intrinsic Curiosity?

上下文学习能否支持内在好奇心?

Eric Elmoznino, Sangnie Bhardwaj, Johannes von Oswald, Rajai Nasser, Blaise Agüera y Arcas, João Sacramento, Rif A. Saurous, Guillaume Lajoie

发表机构 * Google – Paradigms of Intelligence Team(Google – 智能范式团队) Google DeepMind

专题命中 其他LLM :利用序列模型上下文学习作为世界模型,探索好奇心。

AI总结 研究利用序列模型的上下文学习能力作为即时无更新世界模型,以消除传统内在好奇心方法中梯度下降的计算瓶颈,理论证明在非时间设置下可渐近收敛到真实学习进度。

详情
AI中文摘要

有效的机器学习不仅取决于我们如何对数据建模,还取决于我们选择收集哪些数据。虽然大型序列模型已经彻底改变了数据建模,但自动数据选择或“内在好奇心”的问题仍然是一个重大挑战。经典方法通过基于智能体的“学习进度”奖励来激励探索,该奖励衡量新获得的观测在多大程度上改进了世界模型的预测能力。然而,传统上评估这些奖励需要在每个轨迹内进行昂贵的梯度下降内循环更新,这使得它们在规模上计算上不可行。在这项工作中,我们研究序列模型涌现的上下文学习(ICL)能力是否可以通过作为即时的、无需更新的世界模型来消除这一瓶颈。具体来说,我们评估是否可以训练一个探索策略来最大化学习进度,仅使用上下文学习者的预测误差和反事实上下文操作。我们首先证明,在一般马尔可夫决策过程中,这实际上不可能以无偏的方式实现:由此产生的内在奖励要么包含干扰项,使其对真实学习进度的估计产生偏差,要么无法使用上下文学习者的预测误差来实现。相反,我们对于非时间设置的一个广泛子类(包括主动学习和贝叶斯实验设计)证明了积极结果:在这里,ICL派生的奖励成功界定了真实学习进度并渐近收敛到它。我们通过连续和符号环境中的受控实验证实了我们的理论,表明我们的ICL驱动框架成功训练了以最优方式进行探索的好奇数据收集策略。

英文摘要

Effective machine learning depends not only on how we model data, but also on what data we choose to collect. While large sequence models have revolutionized data modeling, the problem of automated data selection, or "intrinsic curiosity", remains a significant challenge. Classic approaches incentivize exploration by rewarding an agent based on its "learning progress", which measures how much a newly acquired observation improves a world model's predictive ability. However, evaluating these rewards traditionally requires expensive inner loops of gradient descent updates within each trajectory, rendering them computationally impractical at scale. In this work, we investigate whether the emergent in-context learning (ICL) capabilities of sequence models can eliminate this bottleneck by serving as immediate, update-free world models. Specifically, we evaluate whether an exploration policy can be trained to maximize learning progress, using solely the prediction errors and counterfactual context manipulations of an in-context learner. We first prove that in general Markov decision processes, this is in fact impossible in an unbiased way: the resulting intrinsic rewards either suffer from nuisance terms that bias their estimation of true learning progress, or they cannot be implemented using an in-context learner's prediction errors. Conversely, we prove a positive result for a broad subclass of non-temporal settings, encompassing active learning and Bayesian Experimental Design: here, ICL-derived rewards successfully bound and asymptotically converge to the true learning progress. We corroborate our theory with controlled experiments across continuous and symbolic environments, demonstrating that our ICL-driven framework successfully trains curious data-collection policies that explore optimally.

2606.19404 2026-06-19 cs.LG cs.CL 新提交 70%

Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models

推理的热力学特征:用于大型语言模型幻觉检测的自由能和谱形因子诊断

Salim Khazem

发表机构 * Talan Research & Innovation Center(Talan研究与创新中心)

专题命中 其他LLM :热力学特征用于LLM幻觉检测。

AI总结 提出自由能签名(Fes)作为谱描述符,将注意力拉普拉斯视为哈密顿量并提取热力学势和随机矩阵理论谱形因子,用于检测LLM幻觉,无需训练即可实现高AUROC。

详情
AI中文摘要

大型语言模型(LLM)中的幻觉检测对部署至关重要,近期研究表明注意力导出的图拉普拉斯谱携带关于推理质量的强信号。然而,先前的谱诊断仅通过少数特征值或手工选取的标量来总结拉普拉斯谱,忽略了其大部分结构。我们提出自由能签名(Fes),一种谱描述符,将每层的注意力拉普拉斯视为哈密顿量,并提取其热力学势(配分函数、自由能、谱熵、热容)以及随机矩阵理论(RMT)谱形因子。我们证明了三个结果:(i)Fes在注意力扰动下的Lipschitz稳定性;(ii)一个表达性结果,表明Fes丰富了有限谱摘要,并在明确的规则性和网格分辨率假设下逼近矩导出的谱泛函;(iii)基于Fes构建的无训练检测器AUROC的有限样本PAC界。实验上,在六个开源LLM和六个基准测试中,基于Fes描述符的轻量级探测在注意力谱基线中实现了最强的平均AUROC,相比LapEig平均提高+6.5 AUROC点,相比GoR-4平均提高+2.4点,且无需更新底层LLM。在完全无监督设置下,RMT偏差得分达到平均AUROC 0.71,提供了一个无标签但较弱的检测器。互补的RMT分析表明,正确生成表现出更接近Wigner-Dyson的谱统计,而幻觉表现出更接近Poisson的统计。匿名代码和配置在补充材料中提供。

英文摘要

Hallucination detection in large language models (LLMs) is deployment-critical, and recent work shows that the spectrum of attention-derived graph Laplacians carries strong signal about reasoning quality. Prior spectral diagnostics, however, summarize the Laplacian spectrum by a handful of eigenvalues or hand-picked scalars, leaving most of its structure unused. We propose Free-Energy Signatures (Fes), a spectral descriptor that treats each layer's attention Laplacian as a Hamiltonian and extracts its thermodynamic potentials partition function, free energy, spectral entropy, heat capacity together with the random-matrix-theory (RMT) spectral form factor. We prove three results: (i)~Lipschitz stability of Fes under attention perturbation; (ii)~an expressiveness result showing that Fes enriches finite spectral summaries and approximates moment-derived spectral functionals under explicit regularity and grid-resolution assumptions; and (iii)~a finite-sample PAC bound on the AUROC of a training-free detector built from Fes. Empirically, across six open-weight LLMs and six benchmarks, a lightweight probe on Fes descriptors achieves the strongest aggregate AUROC among attention-spectral baselines, improving over LapEig by $+6.5$ AUROC points and over GoR-4 by $+2.4$ points on average, while requiring no update to the underlying LLM. In the fully unsupervised setting, an RMT-deviation score achieves mean AUROC $0.71$, providing a label-free but weaker detector. A complementary RMT analysis shows that correct generations exhibit more Wigner-Dyson like spectral statistics, whereas hallucinations exhibit more Poisson-like statistics. The anonymized code and config are provided in the supplementary material.

3. 后训练 2 篇

2606.20508 2026-06-19 cs.AI cs.LG 新提交 70%

What Do Safety-Aligned LLMs Learn From Mixed Compliance Demonstrations?

安全对齐的LLM从混合顺从演示中学到了什么?

Sihui Dai, Mann Patel

专题命中 后训练 :涉及偏好优化训练阶段对安全的影响

AI总结 研究通过混合良性顺从演示和有害顺从演示,探究演示组成如何驱动有害顺从,发现演示内容、顺序和训练方法影响模型提取的信息。

详情
AI中文摘要

先前工作表明,上下文演示可以越狱语言模型,但模型如何解释不同类型的顺从演示仍不清楚。我们通过混合良性顺从演示(无害请求,有帮助响应)与有害顺从演示(有害请求,有帮助响应)并测试关于演示组成如何驱动有害顺从的三个假设来研究这一点。在四个模型中,我们发现良性和有害演示不可互换:良性演示根据模型不同可以减少或增加有害顺从。我们进一步表明,偏好优化是防止良性演示增加有害顺从的关键训练阶段,演示顺序表现出强烈的近因偏差,并且模型在拒绝与上下文学习的交互方式上有所不同:一些模型在拒绝时也采用演示的格式,而其他模型在拒绝时覆盖所有上下文信号。综合来看,这项工作超越了展示基于演示的越狱有效,而是描述了其工作原理:模型从顺从演示中提取的内容取决于演示内容、顺序和训练方法。

英文摘要

Prior work has shown that in-context demonstrations can jailbreak language models, but it remains unclear how models interpret different types of compliance demonstrations. We study this by mixing benign compliance demonstrations (non-harmful request, helpful response) with harmful compliance demonstrations (harmful request, helpful response) and testing three hypotheses about how demonstration composition drives harmful compliance. Across four models, we find that benign and harmful demonstrations are not interchangeable: benign demonstrations can either reduce or increase harmful compliance depending on the model. We further show that preference optimization is the critical training stage that prevents benign demonstrations from increasing harmful compliance, that demonstration ordering exhibits strong recency bias, and that models differ in how refusal interacts with in-context learning: some adopt demonstrated formatting even when refusing, while others override all in-context signals upon refusal. Taken together, this work moves beyond showing that demonstration-based jailbreaking works to characterizing how it works: what models extract from compliance demonstrations depends on demonstration content, ordering, and training methodology.

2606.20482 2026-06-19 cs.CL cs.HC cs.LG 新提交 70%

Your Mouse and Eyes Secretly Leak Your Preference: LLM Alignment using Implicit Feedback from Users

你的鼠标和眼睛悄悄泄露你的偏好:利用用户隐式反馈进行LLM对齐

Haw-Shiuan Chang, Jeffrey Gomez, Mehul Patwari, Aryan Sajith, Hamed Zamani

发表机构 * University of Massachusetts, Amherst(马萨诸塞大学阿默斯特分校) York University(约克大学)

专题命中 后训练 :训练奖励模型用于DPO对齐

AI总结 针对显式反馈稀缺的问题,提出利用鼠标轨迹和眼动数据等隐式反馈训练奖励模型,将文本奖励模型准确率从55%提升至64%,并显著提高DPO对齐后响应质量。

详情
AI中文摘要

为了对齐大型语言模型(LLM),大多数现有方法收集显式的人类反馈,并基于响应文本训练奖励模型来预测人类偏好。这些现有方法有两个关键局限性。首先,用户很少为LLM响应提供显式反馈,这使得高质量偏好标注的收集成本高昂。其次,这些方法没有利用隐式人类反馈,而隐式反馈已被证明对互联网巨头的经济护城河至关重要。为了量化隐式反馈的价值,我们构建了一个名为IFLLM的新数据集,收集了来自59名Mechanical Turk工作者的1336个多轮问题、他们的鼠标轨迹以及通过网络摄像头对LLM响应的眼动注视点。IFLLM显示用户具有非常多样化的注视行为和鼠标轨迹。基于隐式用户反馈的奖励模型将基于文本的奖励模型准确率从55%提升至64%,并在将DPO应用于八个LLM后,相对响应质量改进几乎翻了三倍,证明了隐式反馈在现实场景中的价值。我们的数据收集网站、数据集和代码可在以下网址找到:此https URL。

英文摘要

To align a Large Language Model (LLM), most existing methods collect explicit human feedback and train a reward model to predict the human preference based on the response text. These existing methods have two key limitations. First, the users rarely provide explicit feedback for LLM responses, which makes the high-quality preference annotation expensive to collect. Second, the methods do not leverage implicit human feedback, which has proven vital to the economic moats of Internet giants. To quantify the value of implicit feedback, we build a new dataset called IFLLM, which collects 1336 multi-turn questions from the 59 Mechanical Turk workers, their mouse trajectories, and eye gazing points to the LLMs' responses from their webcams. IFLLM shows that the users have very diverse types of gazing behavior and mouse trajectories. Our reward model based on the implicit user feedback boosts the accuracy of the text-based reward model from 55% to 64% and nearly triples the relative response quality improvements after applying the DPO to eight LLMs, demonstrating the value of implicit feedback in the wild. Our data collection website, dataset, and codes can be found at https://github.com/themehulpatwari/llm-implicit-feedback/.

4. 长上下文 1 篇

2606.20474 2026-06-19 cs.LG cs.AI cs.PF 新提交 70%

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant: 面向上下文密集型智能体的4位KV缓存

Inesh Chakrabarti, David Limpus, Aditi Ghai Rana, Bowen Bao, Spandan Tiwari, Thiago Crepaldi, Ashish Sirasao

发表机构 * Advanced Micro Devices(超威半导体) University of California, Los Angeles(加州大学洛杉矶分校) Purdue University(普渡大学)

专题命中 长上下文 :针对长上下文场景优化KV缓存,降低延迟。

AI总结 针对上下文密集型智能体场景,提出UltraQuant方法,通过4位KV缓存压缩、旋转量化和代码本量化,结合AMD GPU优化,在长上下文多轮任务中延迟降低3.47倍,吞吐量提升1.63倍。

Comments 11 pages, 9 figures

详情
AI中文摘要

上下文密集型智能体给键值(KV)缓存带来了异常压力:长前缀在多个短轮次中重复使用,而并发性决定了服务系统能否保持GPU利用率。我们针对此场景研究4位KV缓存压缩,采用TurboQuant风格的旋转和代码本量化作为质量锚点,vLLM FP8 KV缓存作为部署锚点。我们报告三项贡献。首先,我们将4位KV缓存框架用于多轮智能体工作负载,其中任务质量、缓存驻留和服务吞吐量必须联合衡量。其次,我们描述了使4位路径鲁棒所需的实际设计选择,包括非对称K/V处理、Walsh-Hadamard旋转、QJL移除和块尺度变体。第三,我们展示了AMD GPU上的服务优化,包括优化的解码注意力内核和UltraQuant,一种使用FP8查询、FP4 KV张量、UE8M0组尺度和CDNA4上原生缩放MFMA支持的FP4近似路径。在长上下文、多轮智能体工作负载上,UltraQuant在缓存压力大的后期轮次中将P50首令牌延迟降低了3.47倍(所有轮次平均2.3倍),并将输出吞吐量比FP8 KV基线提高了1.63倍。

英文摘要

Context-heavy agents place unusual pressure on the key-value (KV) cache: long prefixes are reused across many short turns, while concurrency determines whether the serving system can keep GPUs utilized. We study 4-bit KV-cache compression for this setting, using TurboQuant-style rotation and codebook quantization as a quality anchor and vLLM FP8 KV caching as the deployment anchor. We report three contributions. First, we frame 4-bit KV caching around multi-round agent workloads where task quality, cache residency, and serving throughput must be measured jointly. Second, we describe the practical design choices needed to make the 4-bit path robust, including asymmetric K/V treatment, Walsh-Hadamard rotation, QJL removal, and block-scale variants. Third, we present serving optimizations on AMD GPUs, including optimized decode-attention kernels and UltraQuant, an FP4 approximation path that uses FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA support on CDNA4. On a long-context, multi-turn agentic workload, UltraQuant cuts P50 time-to-first-token by 3.47x in the cache-pressured late rounds (2.3x across all rounds) and raises output throughput by 1.63x over the FP8 KV baseline.