arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

语言大模型 / LLM

大语言模型、预训练、指令微调、后训练和语言模型应用。

今日/当前日期收录 83 信号源:cs.CL, cs.AI, cs.LG
2606.20436 2026-06-19 cs.CR cs.AI 新提交 70%

Multi-View Decompilation for LLM-Based Malware Classification

基于LLM的恶意软件分类的多视角反编译

Bercan Turkmen, Vyas Raina

发表机构 * Independent Researcher(独立研究员) SPARK

专题命中 其他LLM :利用LLM分类反编译代码,涉及LLM应用

AI总结 提出多反编译器视角提升LLM恶意软件分类性能,通过Ghidra和RetDec的互补伪C代码提高召回率和F1分数。

详情
AI中文摘要

恶意软件分析师通常在源代码不可用时,通过反编译的伪C代码检查编译后的二进制文件。最近的研究表明,大型语言模型(LLMs)可以通过将反编译代码分类为良性或恶意来辅助这一过程,但现有的流程通常依赖于单一的反编译器视角。我们认为这一假设是脆弱的:反编译器是有损的启发式工具,不同的反编译器可能暴露同一二进制文件的不同特征。我们整理了一个包含良性工具和恶意程序的基准测试,涵盖一系列威胁行为。每个样本都使用Ghidra和RetDec进行编译和反编译,生成匹配的伪C视图。在来自主要模型系列的一系列LLMs中,我们发现提供两种反编译器视图可以提高恶意类别的F1分数,主要是通过提高恶意样本的召回率。一致性分析进一步表明,Ghidra和RetDec会犯部分不同的错误,支持反编译器输出提供互补证据的观点。我们的结果表明,多反编译器提示是一种简单、无需训练的方法,可以在实际环境中改进基于LLM的恶意软件分类。

英文摘要

Malware analysts often inspect compiled binaries through decompiled pseudo-C, when source code is unavailable. Recent work suggests that large language models (LLMs) can assist this process by classifying decompiled code as benign or malicious, but existing pipelines typically rely on a single decompiler view. We argue that this assumption is fragile: decompilers are lossy heuristic tools, and different decompilers can expose different artefacts of the same binary. We curate a benchmark of benign utilities and malicious programs spanning a range of threat behaviors. Each sample is compiled and decompiled with both Ghidra and RetDec, yielding matched pseudo-C views. Across a range of LLMs from major model families, we find that providing both decompiler views improves malicious-class F1, mainly by increasing recall on malicious samples. Agreement analyses further show that Ghidra and RetDec make partially different errors, supporting the view that decompiler outputs provide complementary evidence. Our results suggest that multi-decompiler prompting is a simple, training-free way to improve LLM-based malware triage in practical settings.

2606.20295 2026-06-19 cs.SE cs.CL 新提交 70%

Token-Operations-Oriented Inference Optimization Techniques for Large Models

面向令牌操作的大模型推理优化技术

Shiguo Lian, Kai Wang, Zhaoxiang Liu, Wen Liu, Minjie Hua, Yutong Liu, Jiangze Yan, Xin Wang, Cong Wang, Yilin Zhang, Yi Shen, Jieyun Huang, Fang Zhao, Huanlin Gao, Ping Chen, Xinyu Yang, Kaikai Zhao, Yao Zhao, Xinggang Wang, Huishuai Zhang, Dongyan Zhao, Junping Du, Tao Chen, Xiang Gao, Qinghuai Ma

发表机构 * China’s National Data Administration(中国国家数据管理局)

专题命中 其他LLM :综述大模型推理优化技术

AI总结 本文提出多模型融合、模型优化、计算-模型融合、计算-网络-模型融合四层技术架构,系统综述各层关键技术及产业现状,旨在降低令牌成本、提升服务效率、保障供应稳定性,推动大模型服务从可调用到可运营的转变。

Comments 62 pages, 36 figures

详情
AI中文摘要

大模型推理优化是支撑大模型服务可扩展、低成本、高稳定运行的关键基础。本文以面向令牌的推理优化技术为核心,首次提出由多模型融合、模型优化、计算-模型融合、计算-网络-模型融合组成的四层技术架构,系统梳理了这四层的关键技术和产业现状,并分析了相关技术在实际业务场景中的应用价值。本文为降低令牌生产成本、提高令牌服务效率、保障令牌供应稳定性、推动大模型服务从可调用到可运营的转变提供了实用的技术路径。

英文摘要

Large model inference optimization serves as a key foundation for supporting the scalable, low-cost, and highly stable operation of large model services. Centered on token-oriented inference optimization technology, this paper proposes for the first time a four-layer technical architecture consisting of Multi-model Fusion, Model Optimization, Compute-Model Fusion, and Compute-Network-Model Fusion. It systematically reviews the key technologies and current industry status across these four levels and analyzes the application value of related technologies in real-world business scenarios. This paper provides a practical technical path for reducing token production costs, improving token service efficiency, ensuring the stability of token supply, and driving the transition of large model services from being merely callable to being operable.

2606.20287 2026-06-19 cs.CL 新提交 70%

PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

PsyScore: 一种心理测量感知的特质自适应作文评分与最近发展区支架反馈框架

Wei Xia, Jin Wu, Haoran Shi, Xiangyu Wang, Chanjin Zheng

发表机构 * Department of Educational Psychology, East China Normal University(华东师范大学教育心理学系) Shanghai Institute of Artificial Intelligence for Education, East China Normal University(华东师范大学上海智能教育研究院) School of Computer Science and Technology, East China Normal University(华东师范大学计算机科学与技术学院)

专题命中 其他LLM :使用LLM生成自适应反馈

AI总结 提出PsyScore框架,通过共享潜在能力表示整合诊断评估与教学支架,包括特质自适应神经IRT评分器、ZPD支架反馈生成器和多视角反馈评估策略,在ASAP++数据集上实现竞争性评分性能并提供更符合教学法的反馈。

详情
AI中文摘要

有效的自动作文评分(AES)应支持可靠评估和可操作的教学反馈。然而,现有方法通常将评分和反馈视为独立组件:神经评分模型可解释性有限,而基于大语言模型(LLM)的反馈通常对学习者熟练度不敏感。为解决这一碎片化问题,本工作提出PsyScore,一个心理测量感知的框架,通过共享潜在能力表示整合诊断评估与教学支架。PsyScore包含三个关键模块:特质自适应神经IRT评分器,将分级部分信用模型(GPCM)融入神经架构,能够在保持心理测量可解释性的同时精确估计学生能力;ZPD支架反馈生成器,根据诊断出的能力参数调节多智能体反馈策略,以适应不同熟练水平的教学重点;以及多视角反馈评估策略,通过成对偏好判断和学生修订模拟评估反馈质量。在ASAP++数据集上的实验表明,PsyScore在提供更具教学一致性的反馈的同时,实现了有竞争力的评分性能。

英文摘要

Effective Automated Essay Scoring (AES) are expected to support both reliable assessment and actionable instructional feedback. However, existing approaches often treat scoring and feedback as separate components: neural scoring models provide limited interpretability, while Large Language Model (LLM)-based feedback is typically insensitive to learners proficiency levels. To address this fragmentation, this work proposes PsyScore, a psychometrically-aware framework that integrates diagnostic assessment with instructional scaffolding through a shared latent ability representation. PsyScore comprises three key modules: a Trait-Adaptive Neural IRT Scorer that incorporates the Graded Partial Credit Model (GPCM) into a neural architecture, enabling the precise estimation of student ability while maintaining psychometric interpretability, a ZPD-Scaffolded Feedback Generator, which conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, and a Multi-Perspective Feedback Evaluation Strategy that assesses feedback quality via pairwise preference judgements and student revision simulations. Experiments on the ASAP++ dataset demonstrate that PsyScore achieves competitive scoring performance while providing more pedagogically aligned feedback.

2606.19941 2026-06-19 cs.LG 新提交 70%

Compositionality Emerges in a Narrow Depth-Connectivity Regime: Architecture Constraints and Solution Manifolds

组合性在窄深度-连接性区域中涌现:架构约束与解流形

Dat H. Do, Rushi Shah, Duc V. Le, Dianbo Liu

发表机构 * National University of Singapore(新加坡国立大学) University of Twente(特温特大学)

专题命中 其他LLM :研究组合性在稀疏网络中的涌现机制

AI总结 研究发现组合性仅在特定稀疏网络和特定深度区间涌现,提出基于相似性的剪枝和深度预测方法,并用理论框架解释原因。

详情
AI中文摘要

组合性被认为是泛化的基础,使模型能够在新颖组合中重用有意义的原语。然而,使用标准梯度优化训练的模型很少且通常仅微弱地表现出组合内部结构,并且尚不清楚这种组合性如何或为何形成。在这项工作中,我们表明组合性在一个狭窄的连接性-深度最佳点涌现。沿着连接性轴,组合性仅出现在某些特定稀疏网络中,严重依赖于保留哪些连接而非仅权重的稀疏性。沿着深度轴,组合性在一个狭窄的、目标依赖的区域内涌现,在特定深度达到峰值,而更浅和更深的网络都失败。当深度或连接性条件被违反时,梯度下降会静默地收敛到破碎解而非组合解。为了发现并利用这种涌现,我们引入了(i)基于相似性的剪枝(SP)以恢复组合连接性,以及(ii)一个启发式深度预测器以估计组合性最可能出现的深度。最后,我们通过基于组合稀疏性、体积比论证和特征干扰界限的理论框架支持这些实证发现,解释了为什么组合解仅在狭窄的深度-连接性区域内可达。

英文摘要

Compositionality is believed to be the foundation for generalization, enabling models to reuse meaningful primitives in novel combinations. Yet, models trained with standard gradient-based optimization rarely, and often only weakly, exhibit compositional internal structure, and it remains unclear how or why such compositionality forms. In this work, we show that compositionality emerges in a narrow connectivity-depth sweet spot. Along the connectivity axis, compositionality only appears in some specifically sparse networks, heavily depends on which connections remain rather than on weights' sparsity alone. Along the depth axis, compositionality emerges within a narrow, target-dependent regime, peaking at specific depths, while both shallower and deeper networks fail. When either the depth or connectivity condition is violated, gradient descent silently converges to fractured solutions rather than compositional ones. To discover and exploit this emergence, we introduce (i) similarity-based pruning (SP) to recover compositional connectivity and (ii) a heuristic depth predictor to estimate where compositionality is most likely to appear. Finally, we support these empirical findings with a theoretical framework based on compositional sparsity, volume-ratio arguments, and feature-interference bounds, explaining why compositional solutions are reachable only in a narrow depth-connectivity regime.

2606.19864 2026-06-19 cs.CL 新提交 70%

The Almost Intelligent Revolution: Options for Scaling Up Deliberation and Empowering People with AI

近乎智能的革命:扩大审议规模并利用AI赋能人类的选项

Serge Sharoff

发表机构 * Centre on Participatory and Deliberative Democracy(参与性和协商性民主研究中心)

专题命中 其他LLM :LLM用于民主审议规模扩大。

AI总结 探讨大型语言模型如何通过系统功能语言学视角扩大民主审议规模,增强包容性并赋权边缘群体,同时警惕过度承诺与低估风险。

Comments Published in /Handbook of Democracy in the Era of Artificial Intelligence/ edited by Evangelos Pournaras, Srijoni Majumdar, Carina Ines Hausladen, and Dirk Helbing. 2026

详情
AI中文摘要

大型语言模型在公共话语中的日益突出为民主审议带来了机遇和挑战。虽然红队策略有助于缓解特定风险,但关于语言限制、偏见和LLM的谄媚倾向等更广泛的担忧仍然存在。本章探讨如何利用LLM显著扩大和民主化审议,特别是在促进包容性和赋权传统边缘群体方面。借鉴系统功能语言学的概念,本章考察了语言使用者之间的差异(例如,关于社会人口群体)和语言使用中的差异(例如,关于交际功能)如何影响AI支持的审议参与。本章介绍了AI驱动的审议研究,并评估了它们在支撑论证、增强可及性以及减少嵌入在声望语域中的排斥性语言规范和偏见的影响方面的潜力。同时,本章警告不要过度承诺(导致不切实际的期望)和低估承诺(冒着错失AI辅助参与机会的风险)。最后,本章确定了未来的研究方向,以最大化AI辅助参与的民主潜力,同时嵌入伦理保障以抵消语言不平等的再生产。

英文摘要

The increasing prominence of Large Language Models (LLMs) in public discourse presents both opportunities and challenges for democratic deliberation. While red teaming strategies help mitigate specific risks, broader concerns persist regarding linguistic constraints, biases, and the sycophantic tendencies of LLMs. This chapter explores how LLMs can be used to significantly scale up and democratise deliberation, particularly in fostering inclusivity and empowering traditionally marginalised groups. Drawing on concepts from Systemic-Functional Linguistics, the chapter examines how variations across language users (for example, with respect to socio-demographic groups) and across language use (for example, with respect to communicative functions) shape participation in AI-supported deliberation. The chapter presents AI-driven deliberation studies and assesses their potential to scaffold argumentation, enhance access, and reduce the influence of exclusionary linguistic norms and biases which are embedded in prestigious registers. At the same time, the chapter cautions against both overclaiming, which leads to unrealistic expectations, and underclaiming, which risks missed opportunities for AI-assisted engagement. The chapter concludes by identifying future research directions to maximise the democratic potential of AI-assisted participation while embedding ethical safeguards to counteract the reproduction of linguistic inequalities.

2606.19857 2026-06-19 cs.CL cs.AI 新提交 70%

Large Language Models Do Not Always Need Readable Language

大型语言模型并不总是需要可读语言

Jiayi Zhu, Haoxuan Peng, Junxi Wang, Liang Ke, Chen Zhang, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) The University of Sydney(悉尼大学) Hefei University of Technology(合肥工业大学) Xi’an Jiaotong University(西安交通大学) Nanjing University(南京大学)

专题命中 其他LLM :探索非标准文本表示,降低上下文开销。

AI总结 研究提出BabelTele表示法,将语义编码为紧凑、非标准文本,牺牲人类可读性但保持LLM可恢复性,实验表明可压缩至27.9%长度并保持99.5%语义保真度,降低上下文开销。

Comments 23 pages, 10 figures. Preprint

详情
AI中文摘要

大型语言模型(LLM)通常使用人类可读的自然语言进行提示和交互,即使目标读者是另一个模型。本文研究语义信息是否可以编码为紧凑、非标准的文本形式,这种形式牺牲了人类可读性,但能被LLM恢复。我们将这类以模型为中心的文本表示称为BabelTele,这里不是作为固定协议,而是作为探索LLM生成和解释此类表示能力的经验探针。通过可读性诊断、模型似然度量、人类问卷和下游任务评估,我们发现BabelTele可以显著偏离普通自然语言,同时为指令调优的LLM保留核心语义。作为一种任务无关的表示范式,BabelTele展示了高信息密度,即使文本体积压缩到原始长度的27.9%,也能保持99.5%的语义保真度。我们进一步评估了其在跨模型迁移、智能体记忆和多智能体通信中的语义鲁棒性。结果表明,BabelTele可以降低上下文开销,同时通常保持可靠的下游性能,但其有效性取决于压缩器-读取器对和任务设置。这些发现表明,人类可读性、自然语言典型性和模型端语义可恢复性可以部分解耦,为未来探索LLM系统中的模型原生表示开辟了道路。

英文摘要

Large language models (LLMs) are commonly prompted and interfaced with human-readable natural language, even when the intended reader is another model. This paper investigates whether semantic information can be encoded in compact, non-standard textual forms that sacrifice human readability while remaining recoverable by LLMs. We refer to this class of model-centric textual representations as BabelTele, approached here not as a fixed protocol but as an empirical probe into LLMs' capacity to generate and interpret such representations. Through readability diagnostics, model likelihood measures, human questionnaires, and downstream task evaluations, we find that BabelTele can substantially depart from ordinary natural language while preserving core semantics for instruction-tuned LLMs. As a task-agnostic representational paradigm, BabelTele demonstrates high information density, maintaining 99.5% semantic fidelity even when the text volume is condensed to 27.9% of its original length. We further evaluate its semantic robustness in cross-model transfer, agent memory, and multi-agent communication. Results suggest that BabelTele can reduce context overhead while generally maintaining reliable downstream performance, although its effectiveness depends on the compressor-reader pair and task setting. These findings indicate that human readability, natural-language typicality, and model-side semantic recoverability can be partially decoupled, opening a path toward model-native representations in future exploration of LLM systems.

2606.19826 2026-06-19 cs.CR cs.MA 新提交 70%

Heterogeneous LLM Debate Under Adversarial Peers: Honest Gains, Replacement Costs, and Resilience

对抗性同伴下的异构LLM辩论:诚实增益、替代成本与韧性

Prashanti Nilayam, Kiran Kumar Ramanna, Prashil Tumbade, Sankalp Nayak

专题命中 其他LLM :研究异构LLM辩论中诚实与对抗性同伴的影响。

AI总结 研究异构LLM辩论中诚实与对抗性同伴对修正行为的影响,发现诚实同伴降低有害修正率,对抗性同伴则逆转,且异构性在已有对手时也能作为防御。

详情
AI中文摘要

异构LLM辩论的动机在于,多样化的同伴可以相互纠正,但同样的交流既携带纠正也携带对抗性影响。我们通过跟踪异构同伴如何改变诚实智能体的修正行为来衡量哪种影响占主导:他们改变答案的频率,以及这种改变是纠正性的还是有害的。我们比较了匹配面板(同质基线、诚实混合和对抗混合)以及受污染面板(其中已存在一个恶意的同族同伴),涵盖四个模型家族和三个推理基准。一个诚实的异构同伴显著降低了有害修正,而对抗性同伴则逆转了这一效果。对于Llama-3.1-70B防御者在MATH-hard上,诚实插槽的有害修正率从同质面板的89%下降到有诚实同伴时的35%,而对抗性同伴使其回到90%。条件率对弱防御者隐藏了这种损害,但辩论结束时的翻转率暴露了它。该模式在家族和基准上保持符号一致,而其幅度随防御者-基准机制变化。我们还测量了当已存在一个对抗性同族同伴时的效果:一个诚实的异构同伴降低了有害修正率以及最初正确答案丢失的比率。在相同的Llama-3.1-70B设置下,添加的诚实同伴将最初正确项上的翻转率从同族对手下的31%降至6%。因此,异构性不仅是一个攻击面,而且当对手已经存在时,也是一种防御。

英文摘要

Heterogeneous LLM debate is motivated by the promise that diverse peers correct one another, but the same exchange that carries correction also carries adversarial influence. We measure which dominates by tracking how a heterogeneous peer changes the honest agents' revision behavior: how often they change their answer, and whether the change is corrective or harmful. We compare matched panels (homogeneous baseline, honest-mixed, and adversarial-mixed) and contaminated panels in which a malicious same-family peer is already present, spanning four model families and three reasoning benchmarks. An honest heterogeneous peer sharply lowers harmful revision, and an adversarial one reverses it. For Llama-3.1-70B defenders on MATH-hard, the honest-slot harmful-revision rate falls from 89% in the homogeneous panel to 35% with an honest peer, and an adversarial peer returns it to 90%. The conditional rate hides this damage on weak defenders, but the end-of-debate flip rate exposes it. The pattern keeps its sign across families and benchmarks while its magnitude varies with the defender-benchmark regime. We also measure the effects when an adversarial same-family peer is already present: an honest heterogeneous peer lowers both harmful revision and the rate at which initially-correct answers are lost. On the same Llama-3.1-70B setting, the added honest peer cuts the flip rate on initially-correct items from 31% under a same-family adversary to 6%. Heterogeneity is therefore not only an attack surface but, when an adversary is already present, also a defense.

2606.19815 2026-06-19 cs.CL 新提交 70%

Clusters are All You Need: Pre-Training the Tsetlin Machine with Semantic Clusters from Language Models for Interpretability

聚类即一切:利用语言模型中的语义聚类预训练Tsetlin Machine以实现可解释性

Jiechao Gao, Rohan Kumar Yadav, Yuangang Li, Yuandong Pan, Jie Wang, Ying Liu, Michael Lepech

发表机构 * Independent Researcher(独立研究员) University of California, Irvine(加州大学尔湾分校) University of the Chinese Academy of Sciences(中国科学院大学)

专题命中 其他LLM :利用语言模型语义聚类预训练Tsetlin Machine,提升可解释性。

AI总结 提出一种语义预训练框架,通过K-means或Top2Vec将文本聚类,用聚类-样本对预训练Tsetlin Machine,使其学习可解释的语义关键词,在五个数据集上性能优于传统方法且与BERT竞争。

详情
AI中文摘要

预训练语言模型如BERT在文本分类任务中表现强劲,但缺乏透明度,限制了在高风险场景中的应用。Tsetlin Machine (TM) 提供完全可解释的基于子句的推理,但捕获的语义信息有限,先前桥接两者的尝试依赖于静态词嵌入,忽略了上下文含义。我们提出一种语义预训练框架,无需使用嵌入即可将知识从预训练语言模型转移到TM中。文本样本通过K-means或Top2Vec被分组为语义一致的聚类,得到的聚类-样本对通过增强的Type I反馈预训练一个非否定TM。因此,TM学习到可解释的语义关键词,并在下游任务上进行微调。在五个数据集上,我们的方法显著优于传统和基于嵌入的TM,性能与BERT竞争,同时保持可解释性。

英文摘要

Pre-trained language models such as BERT achieve strong text classification performance but lack transparency, limiting their use in high-stakes settings. The Tsetlin Machine (TM) offers fully interpretable, clause-based reasoning but captures little semantic information, and prior attempts to bridge the two rely on static word embeddings that miss contextual meaning. We propose a semantic pre-training framework that transfers knowledge from a pre-trained language model into a TM without using embeddings. Text samples are grouped into semantically coherent clusters with K-means or Top2Vec, and the resulting cluster-sample pairs pre-train a non-negated TM with enhanced Type I feedback. The TM thereby learns interpretable semantic keywords that are fine-tuned on downstream tasks. Across five datasets, our method substantially outperforms vanilla and embedding-based TMs and reaches performance competitive with BERT while remaining interpretable.

2606.19753 2026-06-19 cs.AI cs.SE 新提交 70%

Grounded Inference: Principles for Deterministically Encapsulated Generative Models

基于推理:确定性封装生成模型的原则

Marty O'Neill

发表机构 * Odenton, MD, USA(美国马里兰州奥登顿)

专题命中 其他LLM :讨论AI混合架构中生成模型的确定性封装原则。

AI总结 提出四种AI混合架构原语,实现概率模型的确定性封装,并指出两个行业反模式,为AI与传统系统集成提供基础框架。

Comments 12 pages, 3 figures

详情
AI中文摘要

将生成模型整合到传统计算系统中既带来了巨大的机遇,也带来了巨大的风险。尽管许多早期采用者已付出巨大代价认识到这些风险,但该领域仍需基础框架来降低AI融入传统系统的风险。本文通过定义四种AI混合架构的具体原语,旨在实现概率模型的确定性封装,从而奠定这一基础。此外,本文还确立了行业中广泛存在的两个总体反模式,作为该领域工程师的警示。该框架旨在实现AI与传统系统的成功集成,同时为生成模型提供商构建下一代生成模型接口提供基础。

英文摘要

The incorporation of generative models into traditional computational systems presents both enormous opportunity and tremendous peril. Although many early adopters have realized these perils at great expense, the field still requires foundational frameworks to de-risk incorporation of AI into traditional systems. This manuscript establishes this foundation through the definition of four specific primitives of AI blended architecture, designed to enable deterministic encapsulation of probabilistic models. It further establishes two overarching anti-patterns broadly represented across industry to serve as warnings for engineers in this field. This framework was designed to enable successful integration of AI into traditional systems while providing a foundation upon which generative model providers could build the next generation of generative model interfaces.

2606.19741 2026-06-19 cs.AI cs.LG 新提交 70%

Interpreting Neural Combinatorial Optimization via Evolving Programmatic Bottlenecks

通过演化程序瓶颈解释神经组合优化

Haocheng Duan, Yuxin Guo, Jieyi Bi, Anqi Xie, Sirui Li, Yining Ma, Cathy Wu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Nanyang Technological University(南洋理工大学) Microsoft Research(微软研究院) Massachusetts Institute of Technology(麻省理工学院)

专题命中 其他LLM :利用LLM演化程序解释神经组合优化模型。

AI总结 提出演化程序瓶颈(EPB)框架,通过将黑盒神经组合优化模型蒸馏为可读程序组合,利用LLM和混合梯度下降实现可解释性,揭示模型行为与经典启发式变体的关系。

Comments Under Review

详情
AI中文摘要

神经组合优化(NCO)取得了强劲性能,但其黑盒性质仍然是部署和科学诊断的关键障碍。标准可解释性工具(如概念瓶颈模型)不适用于NCO,因为其决策是动态的、状态依赖的,且缺乏适当的概念词汇定义。为弥合这一差距,我们引入了演化程序瓶颈(EPB),据我们所知,这是首个通过将黑盒NCO模型蒸馏为人类可读程序组合来解释NCO策略的框架。EPB利用LLM自主演化一组程序,其中每个程序的每步动作分布作为瓶颈。EPB通过迭代框架工作:模块I固定程序库容量,并引入混合文本-数值梯度下降方案,该方案将学生路由器更新的数值梯度和基于LLM程序修订的文本梯度相结合;模块II通过故障目标扩展和冗余剪枝动态调整库容量。大量实验证明了EPB的有效性和广泛适用性,蒸馏后的程序组合在很大程度上保持了原始性能。EPB还揭示了NCO行为在优化阶段的变化,并且可以近似为经典启发式变体的组合。我们的工作推进了可解释NCO,并将EPB建立为解释序列决策模型的有前途工具。

英文摘要

Neural Combinatorial Optimization (NCO) achieves strong performance, yet its black-box nature remains a key roadblock to deployment and scientific diagnosis. Standard interpretability tools, such as Concept Bottleneck Models (CBMs), are ill-equipped for NCO, whose decisions are dynamic, state-dependent, and lack proper concept vocabulary definition. To close this gap, we introduce Evolving Programmatic Bottlenecks (EPB), to our knowledge, the first framework for interpreting NCO policies by distilling black-box NCO models into human-readable program portfolios. EPB employs an LLM to autonomously evolve a bank of programs, where each program's per-step action distribution serves as the bottleneck. EPB works through an iterative framework: Block I fixes program bank capacity and introduces a hybrid textual-numerical gradient descent scheme that couples numerical gradients for student router updates and textual gradients for LLM-based program revision; Block II dynamically adapts bank capacity via fault-targeted expansion and redundancy pruning. Extensive experiments demonstrate EPB's effectiveness and broad applicability, where the distilled program portfolios largely match original performance. EPB also reveals that NCO behavior shifts across optimization stages and can be approximated as a composition of classic heuristic variants. Our work advances interpretable NCO and establishes EPB as a promising tool for interpreting sequential decision-making models.

2606.19719 2026-06-19 cs.IR cs.CL cs.LG 新提交 70%

Closing the Calibration Gap in Semantic Caching

缩小语义缓存中的校准差距

Aditeya Baral, Radoslav Ralev, Iliya Sotirov Zhechev, Srijith Rajamohan, Jen Agarwal

发表机构 * New York University(纽约大学) Redis(Redis公司)

专题命中 其他LLM :语义缓存用于降低LLM推理成本。

AI总结 针对语义缓存系统中离线指标与部署性能的差距,提出P-CHR AUC和CRR指标,发现校准差距由训练目标主导,模型选择本质是校准问题。

Comments 23 pages, 2 figures. Source code: https://github.com/aditeyabaral/calibration-gap-semantic-caching ; Models and Datasets: https://huggingface.co/redis

详情
AI中文摘要

语义缓存通过为语义相似的查询提供缓存响应来降低LLM推理成本。标准实践使用PR-AUC评估这些系统,该指标仅衡量分数排序的好坏,而忽略它们在固定阈值下是否可用。我们表明这种不匹配会导致系统性的部署选择不佳,因为具有最高PR-AUC的模型通常在操作中最差。我们引入精度-缓存命中率(P-CHR)AUC,一种衡量缓存利用率水平上精度的缓存感知指标,以及校准保留率(CRR),它捕捉离线排序质量在部署中保留多少。我们将离线质量与部署质量之间的操作差距分解为可恢复的校准组件和由数据集正例率固定的不可约结构组件。我们的实验表明,校准差距由训练目标而非数据规模主导,事后校准只能部分缩小它。最终,语义缓存的模型选择是一个校准问题,而非排序问题,而测量它是缩小差距的第一步。

英文摘要

Semantic caching cuts LLM inference costs by serving a cached response to semantically similar queries. Standard practice evaluates these systems using PR-AUC, a metric that only measures how well scores rank and ignores whether they are usable at a fixed threshold. We show this mismatch leads to systematically poor deployment choices, as models with the highest PR-AUC are often the worst in operation. We introduce Precision-Cache Hit Ratio (P-CHR) AUC, a cache-aware metric that measures precision across cache utilization levels, and Calibration Retention Rate (CRR), which captures how much offline ranking quality survives at deployment. We decompose the operational gap between offline and deployed quality into a recoverable calibration component and an irreducible structural component fixed by the dataset's positive rate. Our experiments show that the calibration gap is governed by the training objective rather than data scale, and post-hoc calibration only partially closes it. Ultimately, model selection for semantic caching is a calibration problem, not a ranking one, and measuring it is the first step to closing the gap.

2606.19704 2026-06-19 cs.AI 新提交 70%

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

超越静态排行榜:LLM智能体评估的预测有效性

Dhaval C. Patel, Kaoutar El Maghraoui, Shuxin Lin, Yusheng Li, Tianjun Feng, Chun-Yi Tsai, Yihan Sun, Wei Alexander Xin, Akshat Bhandari, Tanisha Rathod, Aaron Fan, Sanskruti Vijay Shejwal, Tomas Pasiecznik, Sagar Chethan Kumar, Tanmay Agarwal, Rohith Kanathur, Sam Colman, Amaan Sheikh, Dev Bahl, Ann Li, Krish Veera, Alimurtaza Mustafa Merchant, Shambhawi Baswaraj Bhure, Sajal Kumar Goyla, Chengrui Li, Kirthana Natarajan, Rui Li, Thomas Ajai, Rujing Li, Vivek G. Iyer, Sanjaii Vijayakumar, Yitong Bai, Ayal Yakobe, Darief Maes, Yassine Jebbouri, Tianyang Xu, Thai Quoc On, Vera Mazeeva, Winston Li, Yuval Shemla, Yeshitha Bhuvanesh, Rushin Bhatt, Siddharth Chethan Gowda, Alisha Vinod, Caroline Cahill, Shriya Aishani Rachakonda, Yunfeng Chen, Aryaman Agrawal, Aman Upganlawar, Mao Le Jonathan Ang, Yubin Sally Go, Madhav Rajkondawar, Yang-Jung Chen, Trisha Maturi, Ananya Kapoor, Andrew Li, Shrey Arora, Mana Abbaszadeh, Shen Li, Charles Xu, Byeolah Kwon

发表机构 * IBM

专题命中 其他LLM :研究LLM智能体评估的泛化问题。

AI总结 本文通过14项并行研究,论证聚合分数排行榜无法泛化到分布外场景,提出基于预测有效性的排名配置方法,并设计可证伪的分布外评估标准。

Comments 17 pages, 2 tables, 5 figures

详情
AI中文摘要

智能体基准测试发展迅速,但单一基准测试无法涵盖部署所涉及的多个维度。本文汇总了迄今为止最大规模的基于MCP的工业智能体基准测试的协调深度分析:14项并行实现研究,涵盖新的资产类别(包括多模态视觉扩展)、替代编排、检索策略、推理模式、基础设施优化和评估方法探索。结合这些研究与七个先前的智能体基准测试,我们认为聚合分数排行榜系统性地低估了部署智能体的评估。基于聚合分数的排名无法泛化到分布外设置;最近的公开到私有竞赛回顾提供了这种排名不稳定性的直接经验证据。我们提出通过预测有效性(样本内与样本外排名之间的相关性)而非样本内均值来配置排名,并报告了一个十二层测量装置,该装置揭示了HELM及其智能体时代后继者所忽略的部署相关维度。该立场通过三个具有明确阈值的可证伪分布外标准得以操作化;现有证据部分支持但过于薄弱无法确认。最后,我们提出了一个预注册的试点设计和下一代智能体基准测试应报告的内容的领域级愿景。

英文摘要

Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi-modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. Rankings derived from aggregate scores do not transfer to out-of-distribution settings; recent public-to-hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in-sample and out-of-sample rank, rather than in-sample mean, and report a twelve-tier measurement apparatus that exposes the deployment-relevant dimensions HELM and its agent-era successors collapse. The position is operationalized through three falsifiable out-of-distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre-registered pilot design and a field-level vision for what the next generation of agentic benchmarks should report.

2606.19591 2026-06-19 cs.CL cs.AI 新提交 70%

A BART-based approach with hierarchical strategy for Vietnamese abstractive multi-document summarization

基于BART的分层策略用于越南语抽象式多文档摘要

Vu Nguyen Nguyen Xuan, Huy Ngo Quang

发表机构 * Aimesoft JSC(Aimesoft股份公司)

专题命中 其他LLM :基于BART的越南语多文档摘要,属于语言模型应用

AI总结 提出一种新颖简单的基于黄金摘要缩短文档的分层策略,结合BART模型实现越南语多文档抽象式摘要,在VLSP 2022测试集上达到ROUGE2-F1 0.2468,并利用外部数据增强训练。

Comments originally written in 2022

详情
AI中文摘要

在本技术报告中,我们专注于解决越南语多文档抽象式摘要的挑战,该任务在2022年越南语言与语音处理国际研讨会(VLSP)上提出。我们选择遵循流行的分层方法,即先浓缩每个文档,然后进行聚合和摘要。我们提出了一种新颖而简单的策略来缩短文档,该策略由黄金摘要驱动,从而确保分层方法各阶段之间的高度相关性。我们的方法在VLSP的公开测试集上达到了0.2468的ROUGE2-F1分数,并且能够生成流畅简洁的摘要。此外,我们利用外部来源获取额外数据,这极大地增加了越南语多文档摘要的数据量。额外数据已向社区公开。

英文摘要

In this technical report, we focus on solving the challenge of Vietnamese multi-document abstractive summarization, introduced in the International Workshop on Vietnamese Language and Speech Processing (VLSP) 2022. We choose to follow the popular hierarchical approach, i.e. condensing each document followed by aggregation and summarization. We propose a novel yet simple strategy to shorten documents that is driven by the golden summary, thus ensuring high correlation between stages of the hierarchical approach. Our method achieves a ROUGE2-F1 score of 0.2468 on the VLSP's public test set, and can produce fluent and concise summaries. Additionally, we utilize external sources for extra data, which greatly enhances the quantity of data for Vietnamese multi-document summarization. The additional data is made available for the community.

2606.19588 2026-06-19 cs.AI cs.CR cs.LO 新提交 70%

Analyzing the Narration Gap in LLM-Solver Loops

分析大语言模型-求解器循环中的叙述差距

Zunchen Huang, Songgaojun Deng

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

专题命中 其他LLM :分析LLM-求解器循环中的叙述差距和安全漏洞

AI总结 研究LLM与SAT/SMT求解器混合推理中,将求解器输出转化为用户答案的叙述步骤存在的安全漏洞,通过形式化建模和实验评估发现证书门控可保证求解结果正确,但对抗攻击可反转结论。

详情
AI中文摘要

诸如SAT和SMT求解器之类的形式化工具,当安全或安保关键问题可以用逻辑表述时,越来越多地被嵌入到语言模型推理流程中。与思维链不同(其步骤从模型分布中采样,没有形式化保证),求解器产生可靠且可独立验证的答案。然而,这种可靠性保证可能在求解器与模型之间的交互中丢失。混合流程包含三个组成部分:形式化问题、求解问题以及叙述结果。先前的工作研究了形式化和求解,但未涉及叙述——即将形式化工具的输出转化为用户答案的步骤。为了填补叙述差距,我们首先将LLM-求解器循环建模为经过验证的决策过程。我们进一步在提示注入下评估了五个开源模型,发现证书门控使求解器判定可靠,而攻击者可以通过不同措辞和渠道反转已验证的结论。我们研究了通过强化提示进行缓解的方法,该方法显著减少了注入但无法完全消除,并且在自适应攻击下仍然存在问题。结合形式化分析和实证研究,我们表明在LLM-求解器循环中,鲁棒性无法延伸到用户最终读取的答案。

英文摘要

Formal tools such as SAT and SMT solvers are increasingly embedded in language model reasoning pipelines when a safety or security critical question can be formulated in logic. Unlike chain of thought whose steps are sampled from the model distribution without formal guarantee, a solver produces a sound and independently verifiable answer. However, the soundness guarantee can be lost in the interaction between the solver and the model. The hybrid pipeline has three components: formalizing the question, deciding it, and narrating the result. Prior work has studied the formalization and decision, but not narration, which is the step that turns a formal tool's output into the user answer. To fill the narration gap, we first model the LLM-solver loop as a verified decision procedure. We further evaluate five open-sourced models under prompt injection, and we find certificate gating makes the solver verdict sound, while an adversary can invert a verified conclusion across phrasings and channels. We study the mitigation through hardened prompt that reduces injection significantly but cannot eliminate it and still suffers under adaptive attack. Combining the formal analysis and empirical studies, we show in the LLM-solver loop, robustness does not reach to the answer that the user finally reads.

2606.19559 2026-06-19 cs.AI cs.CL 新提交 70%

Uncertainty Decomposition for Clarification Seeking in LLM Agents

LLM代理中寻求澄清的不确定性分解

Gregory Matsnev

发表机构 * AI Talent Hub, ITMO University(AI Talent Hub, ITMO大学)

专题命中 其他LLM :基于提示的不确定性分解方法用于LLM代理

AI总结 提出一种基于提示的不确定性分解方法,将行动置信度与请求不确定性分离,使代理能在任务规范模糊时主动寻求澄清,在五个LLM骨干上平均澄清F1提升36%-73%。

Comments 26 pages, 8 figures. Source code: https://github.com/PE51K/udcs-in-llm-agents

详情
AI中文摘要

最近的立场论文认为,经典的偶然/认知不确定性框架对于交互式大型语言模型(LLM)代理是不够的,并呼吁需要一种对欠规范感知、可分解且可通信的不确定性表示,以解锁新的代理能力,如主动寻求澄清和共享心理模型构建。实际部署约束——黑盒API、交互延迟预算以及缺乏标注轨迹——排除了基于logprob、多采样和基于训练的方法,使得基于提示的估计成为在部署时浮现此类信号的最可行方案。我们通过一种简单的基于提示的分解来响应这一呼吁,该分解将行动置信度与请求不确定性(u)分离,使代理能在任务规范模糊时请求澄清。为了评估它,我们引入了两个增强澄清的基准(WebShop-Clarification和ALFWorld-Clarification),其中50%的任务被故意欠规范,并在这些变体以及用于故障检测的标准WebShop、ALFWorld和REAL基准上,系统地将所提出的分解与ReAct+UE和不确定性感知记忆(UAM)在五个LLM骨干(GPT-5.1、DeepSeek-v3.2-exp、GLM-4.7、Qwen3.5-35B、GPT-OSS-120B)上进行比较。在五个骨干上平均,所提出的分解在ALFWorld-Clarification上比ReAct+UE提高了73%的澄清F1,比UAM提高了36%,并且在WebShop-Clarification的每个骨干以及ALFWorld-Clarification的五个骨干中的四个上领先澄清F1,表明增益超越了单个LLM。

英文摘要

Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model (LLM) agents and call for underspecification-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental-model building. Practical deployment constraints -- black-box APIs, interactive latency budgets, and the absence of labeled trajectories -- rule out logprob-based, multi-sampling, and training-based methods, leaving prompt-based estimation as the most viable family for surfacing such signals at deployment time. We answer this call with a simple prompt-based decomposition that separates action confidence from request uncertainty (u), enabling the agent to ask for clarification when the task specification is ambiguous. To evaluate it, we introduce two clarification-augmented benchmarks (WebShop-Clarification and ALFWorld-Clarification) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct+UE and Uncertainty-Aware Memory (UAM) across five LLM backbones (GPT-5.1, DeepSeek-v3.2-exp, GLM-4.7, Qwen3.5-35B, GPT-OSS-120B) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld-Clarification by 73% over ReAct+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop-Clarification and on four of five backbones on ALFWorld-Clarification, indicating that the gains generalize beyond a single LLM.

2606.19558 2026-06-19 cs.LG cs.CL 新提交 70%

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

位移不是方向:评估量化LLM部署的保真度指标

Miloš Nikolić, Ali Hadi Zadeh, Enrique Torres Sanchez, Andreas Moshovos

发表机构 * ByteShape University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(向量人工智能研究所)

专题命中 其他LLM :评估量化LLM部署的保真度指标,属于LLM应用

AI总结 本文研究KL散度等保真度指标在量化语言模型部署中与下游基准分数的相关性,发现整体强相关但在近基线区域失效,归因于KL散度主要衡量分歧量而非方向。

详情
AI中文摘要

保真度指标,如每个token的KL散度(KLD)与高精度参考模型的比较,常被用作基准质量的低成本代理。我们在Qwen3.6-35B-A3B的28个量化模型和Devstral-Small-2-24B的41个量化模型上,通过一系列下游基准测试验证了这一做法。我们发现,在整个量化队列中,KLD与基准分数强相关(Qwen上ρ=-0.72,Devstral上ρ=-0.86,p<0.001)。然而,在接近基线的静默区,这种关系变得不显著(Qwen上ρ=+0.00,Devstral上ρ=-0.24,p=0.36)。这种失效在14种测量变体中持续存在,包括不同的KLD聚合方式、困惑度公式、top-1一致性、校准语料库和上下文长度。在逐提示层面,KLD在代码任务上仅有较弱的失败预测能力,在LiveCodeBench上五个模型的失败与通过几何平均比在[1.08,1.22]之间,并且作为跨模型路由器失败,在分歧提示上仅达到42.3%-49.4%的准确率。我们将这种失效归因于结构分解:KLD主要衡量与参考模型的分歧量,在静默区复合ρ在Qwen上为+0.94(p<0.001),在Devstral上为+0.55(p=0.03),而其与分歧方向的关系较弱且依赖于任务。

英文摘要

Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a 41-quant cohort of Devstral-Small-2-24B, evaluated across a suite of downstream benchmarks. We find that KLD is strongly correlated with benchmark score over the full cohort ($ρ=-0.72$ on Qwen and $ρ=-0.86$ on Devstral, both with $p<0.001$). However, this relationship collapses to non-significance in the near-baseline silent zone ($ρ=+0.00$ on Qwen and $ρ=-0.24$, $p=0.36$, on Devstral). This collapse persists across 14 measurement variants, including different KLD aggregations, perplexity formulations, top-1 agreement, calibration corpora, and context lengths. At the per-prompt level, KLD has only weak failure-prediction power on code, with failed-vs-passed geometric-mean ratios in $[1.08,1.22]$ across five models on LiveCodeBench, and fails as a cross-model router, achieving only $42.3\%-49.4\%$ accuracy on disagreement prompts. We trace the collapse to a structural decomposition: KLD primarily measures the volume of disagreement with the reference, with silent-zone composite $ρ=+0.94$ ($p<0.001$) on Qwen and $+0.55$ ($p=0.03$) on Devstral, while its relationship to the direction of those disagreements is weak and task-conditional.

2606.19476 2026-06-19 cs.LG cs.AI 新提交 70%

Can In-Context Learning Support Intrinsic Curiosity?

上下文学习能否支持内在好奇心?

Eric Elmoznino, Sangnie Bhardwaj, Johannes von Oswald, Rajai Nasser, Blaise Agüera y Arcas, João Sacramento, Rif A. Saurous, Guillaume Lajoie

发表机构 * Google – Paradigms of Intelligence Team(Google – 智能范式团队) Google DeepMind

专题命中 其他LLM :利用序列模型上下文学习作为世界模型,探索好奇心。

AI总结 研究利用序列模型的上下文学习能力作为即时无更新世界模型,以消除传统内在好奇心方法中梯度下降的计算瓶颈,理论证明在非时间设置下可渐近收敛到真实学习进度。

详情
AI中文摘要

有效的机器学习不仅取决于我们如何对数据建模,还取决于我们选择收集哪些数据。虽然大型序列模型已经彻底改变了数据建模,但自动数据选择或“内在好奇心”的问题仍然是一个重大挑战。经典方法通过基于智能体的“学习进度”奖励来激励探索,该奖励衡量新获得的观测在多大程度上改进了世界模型的预测能力。然而,传统上评估这些奖励需要在每个轨迹内进行昂贵的梯度下降内循环更新,这使得它们在规模上计算上不可行。在这项工作中,我们研究序列模型涌现的上下文学习(ICL)能力是否可以通过作为即时的、无需更新的世界模型来消除这一瓶颈。具体来说,我们评估是否可以训练一个探索策略来最大化学习进度,仅使用上下文学习者的预测误差和反事实上下文操作。我们首先证明,在一般马尔可夫决策过程中,这实际上不可能以无偏的方式实现:由此产生的内在奖励要么包含干扰项,使其对真实学习进度的估计产生偏差,要么无法使用上下文学习者的预测误差来实现。相反,我们对于非时间设置的一个广泛子类(包括主动学习和贝叶斯实验设计)证明了积极结果:在这里,ICL派生的奖励成功界定了真实学习进度并渐近收敛到它。我们通过连续和符号环境中的受控实验证实了我们的理论,表明我们的ICL驱动框架成功训练了以最优方式进行探索的好奇数据收集策略。

英文摘要

Effective machine learning depends not only on how we model data, but also on what data we choose to collect. While large sequence models have revolutionized data modeling, the problem of automated data selection, or "intrinsic curiosity", remains a significant challenge. Classic approaches incentivize exploration by rewarding an agent based on its "learning progress", which measures how much a newly acquired observation improves a world model's predictive ability. However, evaluating these rewards traditionally requires expensive inner loops of gradient descent updates within each trajectory, rendering them computationally impractical at scale. In this work, we investigate whether the emergent in-context learning (ICL) capabilities of sequence models can eliminate this bottleneck by serving as immediate, update-free world models. Specifically, we evaluate whether an exploration policy can be trained to maximize learning progress, using solely the prediction errors and counterfactual context manipulations of an in-context learner. We first prove that in general Markov decision processes, this is in fact impossible in an unbiased way: the resulting intrinsic rewards either suffer from nuisance terms that bias their estimation of true learning progress, or they cannot be implemented using an in-context learner's prediction errors. Conversely, we prove a positive result for a broad subclass of non-temporal settings, encompassing active learning and Bayesian Experimental Design: here, ICL-derived rewards successfully bound and asymptotically converge to the true learning progress. We corroborate our theory with controlled experiments across continuous and symbolic environments, demonstrating that our ICL-driven framework successfully trains curious data-collection policies that explore optimally.

2606.19404 2026-06-19 cs.LG cs.CL 新提交 70%

Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models

推理的热力学特征:用于大型语言模型幻觉检测的自由能和谱形因子诊断

Salim Khazem

发表机构 * Talan Research & Innovation Center(Talan研究与创新中心)

专题命中 其他LLM :热力学特征用于LLM幻觉检测。

AI总结 提出自由能签名(Fes)作为谱描述符,将注意力拉普拉斯视为哈密顿量并提取热力学势和随机矩阵理论谱形因子,用于检测LLM幻觉,无需训练即可实现高AUROC。

详情
AI中文摘要

大型语言模型(LLM)中的幻觉检测对部署至关重要,近期研究表明注意力导出的图拉普拉斯谱携带关于推理质量的强信号。然而,先前的谱诊断仅通过少数特征值或手工选取的标量来总结拉普拉斯谱,忽略了其大部分结构。我们提出自由能签名(Fes),一种谱描述符,将每层的注意力拉普拉斯视为哈密顿量,并提取其热力学势(配分函数、自由能、谱熵、热容)以及随机矩阵理论(RMT)谱形因子。我们证明了三个结果:(i)Fes在注意力扰动下的Lipschitz稳定性;(ii)一个表达性结果,表明Fes丰富了有限谱摘要,并在明确的规则性和网格分辨率假设下逼近矩导出的谱泛函;(iii)基于Fes构建的无训练检测器AUROC的有限样本PAC界。实验上,在六个开源LLM和六个基准测试中,基于Fes描述符的轻量级探测在注意力谱基线中实现了最强的平均AUROC,相比LapEig平均提高+6.5 AUROC点,相比GoR-4平均提高+2.4点,且无需更新底层LLM。在完全无监督设置下,RMT偏差得分达到平均AUROC 0.71,提供了一个无标签但较弱的检测器。互补的RMT分析表明,正确生成表现出更接近Wigner-Dyson的谱统计,而幻觉表现出更接近Poisson的统计。匿名代码和配置在补充材料中提供。

英文摘要

Hallucination detection in large language models (LLMs) is deployment-critical, and recent work shows that the spectrum of attention-derived graph Laplacians carries strong signal about reasoning quality. Prior spectral diagnostics, however, summarize the Laplacian spectrum by a handful of eigenvalues or hand-picked scalars, leaving most of its structure unused. We propose Free-Energy Signatures (Fes), a spectral descriptor that treats each layer's attention Laplacian as a Hamiltonian and extracts its thermodynamic potentials partition function, free energy, spectral entropy, heat capacity together with the random-matrix-theory (RMT) spectral form factor. We prove three results: (i)~Lipschitz stability of Fes under attention perturbation; (ii)~an expressiveness result showing that Fes enriches finite spectral summaries and approximates moment-derived spectral functionals under explicit regularity and grid-resolution assumptions; and (iii)~a finite-sample PAC bound on the AUROC of a training-free detector built from Fes. Empirically, across six open-weight LLMs and six benchmarks, a lightweight probe on Fes descriptors achieves the strongest aggregate AUROC among attention-spectral baselines, improving over LapEig by $+6.5$ AUROC points and over GoR-4 by $+2.4$ points on average, while requiring no update to the underlying LLM. In the fully unsupervised setting, an RMT-deviation score achieves mean AUROC $0.71$, providing a label-free but weaker detector. A complementary RMT analysis shows that correct generations exhibit more Wigner-Dyson like spectral statistics, whereas hallucinations exhibit more Poisson-like statistics. The anonymized code and config are provided in the supplementary material.

2606.18812 2026-06-19 cs.LG cs.AI 新提交 70%

Reinforcement Learning Foundation Models Should Already Be A Thing

强化学习基础模型本应已经存在

Abdelrahman Zighem, Jill-Jênn Vie

发表机构 * École normale supérieure de Paris, PSL University, Paris, France(巴黎高等师范学院,PSL大学,法国巴黎) Soda team, Inria Saclay, Palaiseau, France(Soda团队,法国国家信息与自动化研究所萨克雷中心,法国帕莱索)

专题命中 其他LLM :提出强化学习基础模型概念

AI总结 提出通过合成MDP构建强化学习基础模型,利用固定大小的充分统计量使注意力架构适用,在线和离线实验均优于传统算法。

详情
AI中文摘要

语言和视觉的基础模型由互联网规模的数据驱动,而结构化领域(表格预测、时间序列预测、图学习、强化学习)则不然。替代方案是合成数据,它将负担从收集转移到先验设计。这种先验已经存在于许多结构化任务中:TabPFN及其后续工作通过一个在合成贝叶斯先验上预训练的Transformer解决表格分类问题。我们提出两点。\textbf{首先},强化学习是明显的空白:采样一个合成MDP与采样一个合成表格数据集一样可行,然而没有上下文强化学习工作将先验设计作为主要目标。\textbf{其次},MDP允许一个固定大小的充分统计量,独立于观察到的回合且形状为表格形式,这使得它们直接适用于用于表格基础模型的基于注意力的架构,只需将策略头替换监督目标。这些共同定义了强化学习基础模型的议程。作为概念验证,我们完全在合成MDP上训练一个模型,并表明,无需任务特定的调优,它就能在上下文中解决留出的表格基准,包括在线和离线:在线时,使用比UCB-VI和表格Q-learning少得多的回合;离线时,与VI-LCB竞争。

英文摘要

Foundation models for language and vision are powered by internet-scale data, while structured domains such as tabular prediction are powered by synthetic data. This substitute shifts the challenge from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train a Graph Attention Network entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.

2606.14784 2026-06-19 cs.SD cs.LG eess.AS 新提交 70%

LLM-Based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning

基于上下文学习的音频情感分类的LLM合成真实标签生成

Qing Huang, Pooja Pol, Jianing Zhang

发表机构 * School of Business, Technical University of Applied Sciences Augsburg(应用技术大学阿沙芬堡商学院) Data Science und Autonome Systeme Technologietransferzentrum (TTZ)(数据科学与自主系统技术转移中心(TTZ))

专题命中 其他LLM :LLM生成音频情感标签

AI总结 提出利用大语言模型(LLM)和上下文学习(ICL)从多用户VR环境的流式语音数据中自动生成情感相关合成真实标签,解决团队协作状态标注难题。

Comments https://icaiit.org/paper.php?paper=14th_ICAIIT_2/3_9

详情
AI中文摘要

理解人类状态和交互动态是人机交互(HCI)的核心目标。随着交互范式变得更加沉浸,虚拟现实(VR)已成为研究协作工作的强大平台。在此类环境中,评估团队协作状态(包括团队表现和团队韧性)需要从多模态传感器数据(如语音信号)中连续可靠地推断潜在的团队级认知和情感状态。然而,由于传感器噪声、上下文变异性和稀疏的专家标注,为这些潜在状态生成真实标签仍然具有挑战性。传统的自我报告方法仅提供静态和延迟的测量,因此不足以捕捉连续语音数据中反映的动态团队过程。在这项工作中,我们提出了一种由大语言模型(LLM)驱动的、基于代理的推理工作流,用于从多用户VR环境中的流式语音数据自动生成情感相关的合成真实标签。利用LLM的泛化能力,我们使用上下文学习(ICL)和少量配对的音频样本及其对应转录的演示。ICL倾向于实现与模型微调相当的任务适应,同时避免了参数更新的计算开销。为了构建信息丰富且鲁棒的上下文提示,我们采用基于检索的选择策略,根据声学特征空间中的相似性动态识别相关的音频演示。

英文摘要

Understanding human states and interaction dynamics is a core goal of human-computer interaction (HCI). As interaction paradigms become more immersive, virtual reality (VR) has emerged as a powerful platform for studying collaborative work. In such settings, evaluating team collaboration states, including team performance and team resilience, requires continuous and reliable inference of latent team-level cognitive and affective states from multi-modal sensor data, such as speech signals. However, generating ground truth labels for these latent states remains challenging due to sensor-induced noise, contextual variability, and sparse expert annotations. Traditional self-reporting approaches provide only static and delayed measurements and are therefore insufficient for capturing dynamic team processes reflected in continuous speech data. In this work, we propose a large language model (LLM)-driven, agentic inference workflow for automated emotion-related synthetic ground truth generation from streaming speech data in multi-user VR environments. Leveraging the generalization capabilities of LLMs, we use In-Context Learning (ICL) with few-shot demonstrations of paired audio-based samples and their corresponding transcriptions. ICL tends to achieve task adaptation comparable to model fine-tuning while circumventing the computational overhead of parameter updates. To construct informative and robust in-context prompts, we adopt a retrieval-based selection strategy that dynamically identifies relevant audio demonstrations based on similarity in the acoustic feature space.

2606.10616 2026-06-19 cs.AI 新提交 70%

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

学习记住什么:通过约束优化实现长时域语言代理的观测安全记忆保留

Qingcan Kang, Liu Mingyang, Shixiong Kai, Kaichao Liang, Tao Zhong, Mingxuan Yuan

发表机构 * Huawei Noah's Ark Lab(华为诺亚方舟实验室) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系)

专题命中 其他LLM :针对语言代理的上下文窗口限制提出方法

AI总结 针对长时域语言代理的有限上下文窗口,提出OSL-MR框架,将记忆保留建模为约束随机优化问题,通过在线可观测特征与离线监督的严格分离学习查询条件化的证据价值,实验表明在严格预算下优于现有方法。

详情
AI中文摘要

长时域语言代理积累的观测、推理轨迹和检索事实会超出其有限的上下文窗口,使得记忆保留成为一个基本的资源分配问题。现有记忆系统通过启发式评分、检索优化或学习压缩来改进管理,但大多将保留视为局部决策问题,并未在现实观测约束下显式建模其长期后果。为填补这一空白,我们将记忆保留建模为一个约束随机优化问题,具有明确的预算可行性、证据效用以及延迟成本(包括遗漏惩罚、重新获取延迟和过时信息风险)。随后,我们提出OSL-MR(观测安全记忆保留学习),这是一个新颖的框架,强制执行在线可观测特征与离线可用监督(OAS)之间的严格分离。OSL-MR结合了一个从实现的证据监督中训练的证据学习器和一个混合评分启发式,该启发式既作为可部署的在线安全基线,又作为结构化的归纳先验用于学习。由此产生的策略直接从交互数据中学习查询条件化的证据价值,同时在同一观测约束下保持可部署性。在LOCOMO和LongMemEval上的实验表明,OSL-MR在严格记忆预算下持续优于基于最近性的方法、生成式代理风格评分和其他启发式基线。混合评分先验在保持召回率的同时进一步提高了精确度,敏感性分析表明其在广泛的成本配置下具有鲁棒性。

英文摘要

Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts exceeding context windows, making memory retention a fundamental resource-allocation problem. Existing systems treat retention as local and do not model long-term consequences under observability constraints. To fill this gap, we formulate memory retention as a constrained stochastic optimization with budget feasibility, evidence utility, and delayed costs including miss, reacquisition, and stale penalties. We show this multi-step problem is NP-hard, making exact solution intractable. Moreover, deployment decisions must be made under partial observability. To address these challenges, we propose OSL-MR (Observability-Safe Learning for Memory Retention), a learning-augmented framework that enforces a strict separation between online-observable features and offline-available supervision. OSL-MR combines an evidence learner trained from realized evidence with a Mixed-Score heuristic that serves as a deployable online-safe baseline and an inductive prior. The policy learns query-conditioned evidence from interaction data and remains deployable under the same constraints. Experiments on LoCoMo and LongMemEval show OSL-MR outperforms recency-based, Generative Agents-style, and other heuristic baselines, especially under tight budgets. The Mixed-Score prior improves precision and recall, and sensitivity analysis shows robustness across cost settings. On small solvable instances, single-step optimization is insufficient to anticipate future demand shifts, while OSL-MR stays significantly closer to the dynamic-programming optimum, confirming the necessity of the sequential formulation and reinforcing our learning-guided approximation. These results establish constrained stochastic optimization and optimization-guided learning as a principled foundation for memory management in long-horizon agents.

2606.06971 2026-06-19 cs.MA cs.SI 版本更新 70%

Modeling U.S. Attitudes Toward China via an Event-Steered Multi-Agent Simulator

通过事件驱动的多智能体模拟器建模美国对华态度

Chenxu Zhu, Hantao Yao, Wu Liu, Junbo Guo, Yongdong Zhang

专题命中 其他LLM :基于LLM的多智能体模拟,驱动舆论演化

AI总结 提出事件驱动多智能体模拟器(ES-MAS),利用CURE数据集和双流数据集成引擎(DSDIE)及新闻驱动动态交互模块(NDDI),模拟美国对华舆论的动态演化,实验表明优于现有模型。

详情
AI中文摘要

理解舆论的动态演化,如美国公众对中国的态度,对于评估地缘政治风险至关重要。然而,现有的基于LLM的多智能体模拟器主要依赖静态规则和固定数据集,限制了其捕捉现实世界中宏观层面舆论转变的动态、事件驱动特性的能力。为解决这一限制,我们提出了一种事件驱动的多智能体模拟器(ES-MAS),其中重大事件和日常新闻通过智能体之间的动态交互持续驱动舆论演化。我们首先构建了中美关系演化(CURE)数据集,涵盖2021年至2025年的20个季度,包括258个重大事件和超过14,000篇日常新闻文章,为建模舆论动态提供了全面的时间基础。基于CURE数据集,我们提出了双流数据集成引擎(DSDIE),该引擎通过宏观层面事件将模拟与历史时间线对齐,同时基于个体智能体画像和上下文信号实现个性化信息暴露。此外,我们设计了新闻驱动的动态交互(NDDI)模块,该模块自适应地将具有共同新闻兴趣的智能体分组到局部交互上下文中,促进自下而上的共识形成,同时降低孤立信息茧房的风险。在CURE数据集上的实验结果表明,ES-MAS在复现真实世界历史趋势方面显著优于现有模拟器,为建模动态舆论演化提供了一个可扩展且有效的框架。

英文摘要

Understanding the dynamic evolution of opinions, such as U.S. public attitudes toward China, is essential for assessing geopolitical risks. However, existing LLM-based multiagent simulators predominantly rely on static rules and fixed datasets, limiting their ability to capture the dynamic, event-driven nature of macro-level opinion shifts in real-world settings. To address this limitation, we propose an Event-Steered Multi-Agent Simulator (ES-MAS), in which significant events and daily news continuously drive opinion evolution through dynamic interactions among agents. We first construct the China-U.S. Relation Evolution (CURE) dataset, covering 20 quarters from 2021 to 2025, including 258 major events and over 14,000 daily news articles, and providing a comprehensive temporal foundation for modeling opinion dynamics. Building upon the CURE dataset, we propose a Dual-Stream Data Integration Engine (DSDIE) that aligns simulations with historical timelines via macro-level events while enabling personalized information exposure based on individual agent profiles and contextual signals. Furthermore, we design a News-Driven Dynamic Interaction (NDDI) module, which adaptively groups agents with shared news interests into localized interaction contexts, facilitating bottom-up consensus formation while mitigating the risk of isolated information cocoons. Experimental results on the CURE dataset demonstrate that ES-MAS substantially outperforms existing simulators in reproducing real-world historical trends, offering a scalable and effective framework for modeling dynamic opinion evolution.

2604.07593 2026-06-19 cs.AI 版本更新 70%

Too long; didn't solve

太长;没解决

Lucía M. Cabrera, Isaac Saxton-Knight, Jocelyn D'Arcy

发表机构 * Instituto Balseiro(巴塞罗那研究所) Poindexter Labs(波因迪克斯实验室)

专题命中 其他LLM :提示长度与数学推理性能关系研究

AI总结 研究提示长度和解答长度与大型语言模型在数学问题上的性能关系,发现两者与模型失败率正相关。

详情
AI中文摘要

由一系列数学问题组成的数学基准被广泛用于评估大型语言模型的推理能力,但关于其结构特性如何影响模型行为的研究很少。在这项工作中,我们研究了两个结构长度变量——提示长度和解答长度,并分析了它们如何与模型在新构建的、由专家编写的对抗性数学问题数据集上的性能相关。我们发现,提示长度和解答长度均与模型失败率的增加呈正相关。我们还进行了跨模型分歧的探索性辅助分析。在难度调整的归一化分析下,两个变量与实现模型分离仍保持弱负相关,提示长度的关联稍强。总体而言,我们的主要稳健发现是,结构长度与该数据集中的经验难度相关。

英文摘要

Mathematical benchmarks consisting of a range of mathematics problems are widely used to evaluate the reasoning abilities of large language models, yet little is known about how their structural properties influence model behaviour. In this work, we investigate two structural length variables, prompt length and solution length, and analyse how they relate to model performance on a newly constructed adversarial dataset of expert-authored mathematics problems. We find that both prompt and solution lengths correlate positively with increased model failure across models. We also include a secondary, exploratory analysis of cross-model disagreement. Under a difficulty-adjusted normalised analysis, both variables retain weak negative associations with realised model separation, slightly stronger for prompt length. Overall, our main robust finding is that structural length is linked to empirical difficulty in this dataset.

2604.01955 2026-06-19 cs.CY 版本更新 70%

Teaching Students to Question the Machine: An AI Literacy Intervention Improves Students' Regulation of LLM Use in a Science Task

教导学生质疑机器:一项AI素养干预措施提升学生在科学任务中调节LLM使用的能力

O. Clerc, R. Abdelghani, C. Desvaux, E. Poisson, P. Y. Oudeyer, H. Sauzéon

专题命中 其他LLM :AI素养干预提升学生LLM使用能力

AI总结 本研究通过两小时的AI素养工作坊,训练中学生(8-9年级)在科学问题解决中更有效地使用大语言模型,减少盲目依赖并提高答案质量。

Comments Workshop paper accepted at ALIT4ALL 2026: 2nd International Workshop on AI Literacy Education For All, co-located with AIED 2026

详情
AI中文摘要

生成式人工智能(GenAI)在学校中的快速普及引发了人们对学生不加批判地依赖其输出的担忧。有效使用大语言模型(LLM)不仅需要技术知识,还需要监控、评估和调节与系统交互的能力,这些过程与元认知调节密切相关。这些技能在中学阶段仍在发展中,使得学生特别容易过度信任和过早接受AI输出。由于课堂时间和教师培训资源有限,迫切需要开发和评估可在现实学校条件下实施的AI素养干预措施。我们报告了一项受控的课堂研究,考察两小时的AI素养工作坊是否能改善学生在LLM支持的科学问题解决中的交互策略和最终答案质量。共有116名学生(8-9年级;13-15岁)使用生成式AI系统完成了六项科学调查任务。两天前,干预组参加了工作坊,该工作坊结合了关于LLM如何工作及失败的信息,以及关于提示和响应评估的实用指导;对照组未接受培训。受过训练的学生表现出更少的盲目依赖:他们更频繁地重新表述查询、提出后续问题,并更准确地判断响应正确性,从而获得更好的表现。相比之下,GenAI和元认知自我报告分数不能预测表现,这表明有效使用生成式AI较少依赖于自我报告测量,而更多依赖于交互调节的明确训练。总体而言,结果表明,简短、可扩展的AI素养教学可以显著改善中学生在校本学习活动中使用生成式AI的方式。

英文摘要

The rapid adoption of generative artificial intelligence (GenAI) in schools raises concerns about students' uncritical reliance on its outputs. Effective use of large language models (LLMs) requires not only technical knowledge but also the ability to monitor, evaluate, and regulate one's interaction with the system, processes closely tied to metacognitive regulation. These skills are still developing in middle school, making students particularly vulnerable to over-trust and premature acceptance of AI outputs. Because classroom time and teacher training resources are constrained, there is a pressing need to develop and evaluate AI literacy interventions that can be implemented under realistic school conditions. We report a controlled classroom study examining whether a two-hour AI literacy workshop improves students' interaction strategies and quality of final answers in LLM-supported science problem solving. A total of 116 students (grades 8-9; ages 13-15) completed six science investigation tasks using a generative AI system. Two days prior, the intervention group attended the workshop, which combined information about how LLMs work and fail with practical guidance on prompting and response evaluation; the control group received no training. Trained students showed less uncritical reliance on the system: they more often reformulated queries, asked follow-up questions, and more accurately judged response correctness, leading to better performance. In contrast, GenAI and metacognitive self-report scores did not predict performance, suggesting that effective use of generative AI depends less on self-reported measures and more on explicit training in interaction regulation. Overall, the results show that brief, scalable AI literacy instruction can meaningfully improve how middle-school students use generative AI in school-like learning activities.

2603.16941 2026-06-19 eess.AS cs.CL cs.SD 版本更新 70%

The Voice Behind the Words: Quantifying Intersectional Bias in SpeechLLMs

言语背后的声音:量化语音大语言模型中的交叉偏见

Shree Harsha Bokkahalli Satish, Christoph Minixhofer, Maria Teleki, James Caverlee, Ondřej Klejch, Peter Bell, Gustav Eje Henter, Éva Székely

发表机构 * 1 Department of Speech, Music Hearing, KTH Royal Institute of Technology, Sweden 2 Centre for Speech Technology Research, University of Edinburgh, UK 3 Texas A\&M University, USA

专题命中 其他LLM :语音大语言模型中的交叉偏见量化

AI总结 本研究通过2880次受控交互,评估三种语音大语言模型在六种英语口音和两种性别呈现中的口音与性别交叉偏见,发现东欧口音(尤其女性)获得更低有用性评分,且人类评估者比LLM评判更敏感。

Comments 5 pages, 3 figures, 1 table, Accepted to Interspeech 2026

详情
AI中文摘要

语音大语言模型直接处理语音输入,保留了之前级联管道中去除的口音和感知性别等线索,这导致了依赖于说话者身份的反应差异。我们使用2880次受控交互(涵盖六种英语口音和两种性别呈现,通过语音克隆保持语言内容不变),对三种语音大语言模型中的口音和性别偏见进行了大规模交叉评估。通过逐点LLM评判评分、成对比较以及经过人工验证的最佳-最差缩放,我们检测到反复出现的定向差异。东欧口音的语音获得较低的有用性评分,尤其是女性呈现的语音。反应保持礼貌但在有用性上存在差异。虽然LLM评判捕捉到了这些偏见的定向趋势,但人类评估者表现出显著更高的敏感性,显示出更强的口音级别对比。

英文摘要

Speech Large Language Models (SpeechLLMs) process spoken input directly, retaining cues such as accent and perceived gender that were previously removed in cascaded pipelines. This introduces speaker identity dependent variation in responses. We present a large-scale intersectional evaluation of accent and gender bias in three SpeechLLMs using 2,880 controlled interactions across six English accents and two gender presentations, keeping linguistic content constant through voice cloning. Using pointwise LLM-judge ratings, pairwise comparisons, and Best-Worst Scaling with human validation, we detect recurring directional disparities. Eastern European-accented speech receives lower helpfulness scores, particularly for female-presenting voices. Responses remain polite but differ in helpfulness. While LLM judges capture the directional trend of these biases, human evaluators exhibit significantly higher sensitivity, showing stronger accent-level contrasts.

2603.16357 2026-06-19 cs.CY cs.SE 版本更新 70%

Beyond Grading Accuracy: Exploring Alignment of TAs and LLMs

超越评分准确性:探索助教与LLMs的一致性

Matthijs Jansen op de Haar, Nacir Bouali, Faizan Ahmed

专题命中 其他LLM :开源LLM用于UML类图评分评估

AI总结 本文提出一个评估管道,通过定量研究92个UML类图,比较助教与六个开源LLMs在单个评分标准上的表现,发现开源LLMs在评分准确性上接近助教,为混合主动评分系统提供了可能。

Comments 7 pages, 3 figures

详情
AI中文摘要

在本文中,我们研究了开源大型语言模型(LLMs)在评分统一建模语言(UML)类图方面的潜力。与现有主要评估专有LLMs的工作不同,我们专注于非专有模型,使得我们的方法适用于对透明度和成本敏感的大学。此外,现有研究评估的是完整图表而非单个标准的性能,对自动评分与人类评估的一致性提供的见解有限。为解决这些差距,我们提出一个评分管道,其中学生生成的UML类图由助教(TAs)和LLMs独立评估,然后在单个标准级别比较评分。我们通过一项对软件设计课程中92个UML类图的定量研究来评估该管道,将助教评分与六个开源LLMs产生的评估进行比较。性能在单个标准上测量,突出LLMs与人类评分者存在差异的领域。我们的结果显示,每个标准的准确率高达88.56%,皮尔逊相关系数高达0.78,仅使用开源模型就比先前工作有显著改进。这些模型的性能接近助教,表明了一条通往混合主动评分系统的可能路径,其中助教在评分中得到辅助。我们的发现表明,开源LLMs可以通过明确识别与评分标准的一致性来有效支持UML类图评分。所提出的管道提供了一种实用方法,以应对随着学生人数增长而增加的工作量。

英文摘要

In this paper, we investigate the potential of open-source Large Language Models (LLMs) for grading Unified Modeling Language (UML) class diagrams. In contrast to existing work, which primarily evaluates proprietary LLMs, we focus on non-proprietary models, making our approach suitable for universities where transparency and cost are critical. Additionally, existing studies assess performance over complete diagrams rather than individual criteria, offering limited insight into how automated grading aligns with human evaluation. To address these gaps, we propose a grading pipeline in which student-generated UML class diagrams are independently evaluated by both teaching assistants (TAs) and LLMs. Grades are then compared at the level of individual criteria. We evaluate this pipeline through a quantitative study of 92 UML class diagrams from a software design course, comparing TA grades against assessments produced by six open-source LLMs. Performance is measured across individual criteria, highlighting areas where LLMs diverge from human graders. Our results show per-criterion accuracy of up to 88.56\% and a Pearson correlation coefficient of up to 0.78, representing a substantial improvement over previous work while using only open-source models. The models achieve performance close to that of a TA, suggesting a possible path toward a mixed-initiative grading system, where TAs are aided in their grading. Our findings demonstrate that open-source LLMs can effectively support UML class diagram grading by explicitly identifying alignment with grading criteria. The proposed pipeline provides a practical approach to managing increasing workloads with growing student counts.

2502.19193 2026-06-19 cs.SI cs.AI cs.NE 版本更新 70%

Simulation of Language Evolution under Regulated Social Media Platforms: A Synergistic Approach of Large Language Models and Genetic Algorithms

受监管社交媒体平台下的语言演化模拟:大语言模型与遗传算法的协同方法

Jinyu Cai, Yusei Ishimizu, Mingyue Zhang, Munan Li, Jialong Li, Kenji Tei

专题命中 其他LLM :用LLM模拟语言演化,结合遗传算法

AI总结 提出基于大语言模型的多智能体框架,结合遗传算法模拟用户语言策略在监管下的迭代演化,实验表明对话轮次增加可提升信息传递准确性和对话持续性。

Comments The manuscript has been accepted to IEEE Transactions on Computational Social Systems

详情
AI中文摘要

社交媒体平台经常实施限制性政策来调节用户内容,从而催生出创造性的规避语言策略。本文提出了一个基于大语言模型(LLMs)的多智能体框架,用于模拟在监管约束下语言策略的迭代演化。在该框架中,参与者智能体作为社交媒体用户,不断演化其语言表达,而监管智能体通过评估政策违规来模拟平台级别的监管。为了实现更逼真的模拟,我们采用了语言策略的双重设计(约束和表达)来区分冲突目标,并利用LLM驱动的遗传算法(GA)进行语言策略的选择、变异和交叉。该框架使用两种不同的场景进行评估:一个抽象的密码游戏和一个逼真的模拟非法宠物交易场景。实验结果表明,随着对话轮次的增加,不间断对话轮次的数量和信息传输的准确性都显著提高。此外,一项包含40名参与者的用户研究验证了生成对话和策略的现实相关性。消融研究也验证了GA的重要性,强调了其对长期适应性和整体结果改善的贡献。

英文摘要

Social media platforms frequently impose restrictive policies to moderate user content, prompting the emergence of creative evasion language strategies. This paper presents a multi-agent framework based on Large Language Models (LLMs) to simulate the iterative evolution of language strategies under regulatory constraints. In this framework, participant agents, as social media users, continuously evolve their language expression, while supervisory agents emulate platform-level regulation by assessing policy violations. To achieve a more faithful simulation, we employ a dual design of language strategies (constraint and expression) to differentiate conflicting goals and utilize an LLM-driven GA (Genetic Algorithm) for the selection, mutation, and crossover of language strategies. The framework is evaluated using two distinct scenarios: an abstract password game and a realistic simulated illegal pet trade scenario. Experimental results demonstrate that as the number of dialogue rounds increases, both the number of uninterrupted dialogue turns and the accuracy of information transmission improve significantly. Furthermore, a user study with 40 participants validates the real-world relevance of the generated dialogues and strategies. Moreover, ablation studies validate the importance of the GA, emphasizing its contribution to long-term adaptability and improved overall results.

2606.20537 2026-06-19 cs.LG cs.DC 新提交 65%

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

执行状态胶囊:面向低延迟、小批量、设备端物理AI服务的图绑定执行状态检查点与恢复

Liang Su

发表机构 * GitHub

专题命中 其他LLM :面向LLM服务的执行状态检查点与恢复机制

AI总结 针对低延迟、小批量、设备端物理AI服务场景,提出执行状态胶囊机制,通过图绑定检查点与恢复完整可恢复状态,在RTX 5090上实现亚毫秒级恢复,TTFT加速比达3.9倍至27倍。

Comments 27 pages, 9 figures

详情
AI中文摘要

主流LLM服务系统主要通过分页或基数键值(KV)缓存重用前缀工作。这对于高吞吐量、高并发服务非常有效,但它只管理执行状态的一个位置片段:KV缓存。我们研究相反的场景:低延迟、小批量、设备端物理AI服务,其中交互式LLM代理、语音系统和机器人策略在严格的响应预算下频繁分支、重置、中断和重新进入。我们引入执行状态胶囊,一种图绑定的检查点和恢复机制,用于在提交边界处保存完整的可恢复状态。FlashRT是一个白盒、后端内核运行时,其评估的NVIDIA CUDA后端在连续的静态缓冲区上运行捕获的图计划,无需块表间接寻址。由于活动状态是一组命名的封闭缓冲区,胶囊可以快照、恢复、分叉或回滚整个执行边界,包括KV、循环状态、卷积状态、MTP状态和元数据。这将重用从令牌寻址的KV片段转移到图绑定的执行状态边界。在RTX 5090上,胶囊恢复在存储状态级别是字节精确的,在贪婪解码下是令牌一致的。仅KV的消融实验出现分歧,表明循环状态是承载负载的。GPU驻留的快照和恢复是亚毫秒级的,TTFT相对于冷预填充的加速比从2k令牌时的3.9倍增长到16k令牌时的27倍。在Jetson AGX Thor和DGX Spark上,相同的正确性和结构属性成立。胶囊不是高吞吐量KV缓存服务的替代品;它们定义了一个互补的以延迟为先的服务点,用于显式执行状态重用。

英文摘要

Mainstream LLM serving systems reuse prefix work mainly through paged or radix key-value (KV) caches. This is highly effective for high-throughput, high-concurrency serving, but it manages only one positional fragment of execution state: the KV cache. We study the opposite regime: low-latency, small-batch, on-device physical-AI serving, where interactive LLM agents, speech systems, and robot policies repeatedly branch, reset, interrupt, and re-enter under tight responsiveness budgets. We introduce execution-state capsules, a graph-bound checkpoint and restore mechanism for the complete restorable state at a committed boundary. FlashRT is a white-box, backend-facing kernel runtime whose evaluated NVIDIA CUDA backend runs captured graph plans over contiguous static buffers with no block-table indirection. Because the live state is a closed set of named buffers, a capsule can snapshot, restore, fork, or roll back the whole execution boundary, including KV, recurrent state, convolution state, MTP state, and metadata. This moves reuse from token-addressed KV fragments to graph-bound execution-state boundaries. On an RTX 5090, capsule restore is byte-exact at the stored-state level and token-identical under greedy decode. A KV-only ablation diverges, showing that recurrent state is load-bearing. GPU-resident snapshot and restore are sub-millisecond, and TTFT speedup over cold prefill grows from 3.9x at 2k tokens to 27x at 16k tokens. On Jetson AGX Thor and DGX Spark, the same correctness and structural properties hold. Capsules are not a replacement for high-throughput KV-cache serving; they define a complementary latency-first serving point for explicit execution-state reuse.

2606.19850 2026-06-19 cs.LG cs.AI 新提交 65%

Neural Additive and Basis Models with Feature Selection and Interactions

具有特征选择和交互的神经加性模型与神经基础模型

Yasutoshi Kishimoto, Kota Yamanishi, Takuya Matsuda, Shinichi Shirakawa

发表机构 * Yokohama National University(横滨国立大学)

专题命中 其他LLM :提出在神经加性模型中引入特征选择机制,属于可解释机器学习方法,与LLM无直接关系。

AI总结 提出在神经加性模型和神经基础模型中引入特征选择机制,通过特征选择层减少计算开销,并支持高维数据中的特征交互学习,性能优于或持平于现有GAM方法。

Comments Accepted at PAKDD 2024. Code is available at https://github.com/shiralab/NAM-FS

详情
AI中文摘要

深度神经网络(DNN)在各个领域表现出色,但通常可解释性较低。神经加性模型(NAM)及其变体神经基础模型(NBM)在广义加性模型(GAM)中使用神经网络(NN)作为非线性形状函数。这两种模型具有高度可解释性,并且在NN训练中表现出良好的性能和灵活性。NAM和NBM基于GAM架构,可以提供并可视化每个特征对预测的贡献。然而,当使用双输入NN来考虑特征交互或将其应用于高维数据集时,由于所需计算资源的增加,训练NAM和NBM变得棘手。本文提出将特征选择机制融入NAM和NBM以解决计算瓶颈。我们在两种模型中引入特征选择层,并在训练过程中更新选择权重。我们的方法简单,与原始NAM和NBM相比,可以降低计算成本和模型大小。此外,它使我们即使在数据维度很高的情况下也能使用双输入NN并捕获特征交互。我们证明,所提出的模型与原始NAM和NBM相比计算效率更高,并且与最先进的GAM相比表现出更好或相当的性能。

英文摘要

Deep neural networks (DNNs) exhibit attractive performance in various fields but often suffer from low interpretability. The neural additive model (NAM) and its variant called the neural basis model (NBM) use neural networks (NNs) as nonlinear shape functions in generalized additive models (GAMs). Both models are highly interpretable and exhibit good performance and flexibility for NN training. NAM and NBM can provide and visualize the contribution of each feature to the prediction owing to GAM-based architectures. However, when using two-input NNs to consider feature interactions or when applying them to high-dimensional datasets, training NAM and NBM becomes intractable due to the increase in the computational resources required. This paper proposes incorporating the feature selection mechanism into NAM and NBM to resolve computational bottlenecks. We introduce the feature selection layer in both models and update the selection weights during training. Our method is simple and can reduce computational costs and model sizes compared to vanilla NAM and NBM. In addition, it enables us to use two-input NNs even in high-dimensional datasets and capture feature interactions. We demonstrate that the proposed models are computationally efficient compared to vanilla NAM and NBM, and they exhibit better or comparable performance with state-of-the-art GAMs.

2606.19819 2026-06-19 cs.CL cs.AI 新提交 65%

CREDENCE: Claim Reduction for Decomposition & Enhanced Credibility -- Semantic Metrics and Convergence Analysis

CREDENCE: 面向分解与增强可信度的声明缩减——语义度量与收敛性分析

Phuong Huu Vu Tran, Thuan Duc Mai, Bach Xuan Le

发表机构 * Vietnamese-German University(越南德国大学) Ho Chi Minh University of Technology(胡志明市理工大学)

专题命中 其他LLM :声明分解和修复框架,用于事实核查,与LLM应用相关。

AI总结 提出CREDENCE框架,通过语义F1度量解决Jaccard度量对释义声明的低估问题,并形式化分析修复管道的收敛性,实验表明语义F1比Jaccard F1提升15-32个百分点,规则修复将原子性违反率降低47-100%。

Comments 40 pages, 6 figures, 19 tables. Submitted to Language Resources and Evaluation

详情
AI中文摘要

将复合句分解为原子化的、可验证的声明是可靠自动化事实核查的前提。先前工作依赖基于词重叠(Jaccard)的度量,系统性地低估了释义声明的分解质量,并且缺乏对修复循环的形式化终止分析。我们提出CREDENCE,一个改进的声明分解与评估框架,解决了这两个缺陷。我们的贡献包括:(1) 语义F1:我们使用BGE-large余弦相似度保真度度量,解决了Jaccard的惩罚问题,并提高了下游事实核查的准确性;(2) 收敛定理:我们形式化地表征了修复管道的四个性质,确立了在预言解析器假设下基于规则的修复是单调且有限终止的;基于LLM的自修复被证明是非单调的,需要早期退出保护;(3) 三个评估基准,涵盖社交媒体、百科全书和新闻领域,用于跨领域泛化度量;(4) 跨四个分解器模型(3.8B-12B)和一个封闭API模型的多模型基准测试。在SocialClaimSplit、WikiSplitBench和ClaimDecompBench上的实验表明,语义F1比Jaccard F1提升15-32个百分点。在SocialClaimSplit和WikiSplitBench上,EPR范围为0.94至1.00,而ClaimDecompBench由于更难的新闻领域构造,包含较低的基线EPR情况(低至0.824),规则修复相对于基线模型将原子性违反率(AVR)降低了47-100%,且不降低保真度。

英文摘要

Decomposing compound sentences into atomic, verifiable claims is a prerequisite for reliable automated fact-checking. Prior work has relied on token-overlap (Jaccard) metrics that systematically underestimate decomposition quality for paraphrastic claims, and has lacked formal termination analysis for the repair loop. We present Credence, a revised claim decomposition and evaluation framework addressing both shortcomings. Our contributions are: (1) Semantic-F1: we use BGE-large cosine similarity fidelity metric that resolves Jaccard's penalisation and improves downstream fact-checking accuracy; (2) Convergence theorems: we formally characterise four properties of the repair pipeline, establishing that rule-based repair is monotone and finitely terminating under an oracle parser assumption; LLM-based self-repair is provably non-monotone and requires an early-exit guard; (3) Three evaluation benchmarks spanning social-media, encyclopaedic, and news domains for cross-domain generalisation measurement; (4) Multi-model benchmarking across four decomposer models (3.8B-12B) and a closed API model. Experiments on SocialClaimSplit, WikiSplitBench, and ClaimDecompBench show that Semantic-F1 outperforms Jaccard-F1 by +15-32pp. EPR ranges from 0.94 to 1.00 on SocialClaimSplit and WikiSplitBench, while ClaimDecompBench includes lower base EPR cases (down to 0.824) due to harder news-domain constructions, and rule-repair reduces the Atomicity Violation Rate (AVR) by 47-100% relative to the base model without degrading fidelity.