arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.06586 2026-06-08 cs.CL 新提交

Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning

通过一致性驱动的强化学习改进跨语言事实回忆

Jonathan von Rad, Louis Arts, George Burgess, Eleftheria Kolokytha, Harry O'Donnell, Ektor Oikonomidis Doumpas, Eduardo Sanchez, Yao Lu, Pontus Stenetorp

发表机构 * University College London(伦敦大学学院) Centre for Artificial Intelligence(人工智能中心)

AI总结 提出PolyFact数据集,利用GRPO强化学习方法提升大语言模型的跨语言事实回忆一致性,优于监督微调,并揭示其通过减少语言专用表示实现跨语言共享的机制。

Comments Under Review at EMNLP 2026

详情
AI中文摘要

主要用英语数据训练的大型语言模型(LLMs)编码了丰富的世界知识,但通常无法在其他语言中可靠地表达这些知识,这种现象称为跨语言事实不一致性。为了研究和解决这一问题,我们引入了PolyFact,一个大规模并行多语言事实问答数据集,包含12种类型多样的语言中的10万个基于Wikidata的事实。利用PolyFact,我们比较了轻量持续预训练(CPT)、监督微调(SFT)和通过组相对策略优化(GRPO)的强化学习在Qwen-2.5-7B和OLMo-2-1124-7B中改进跨语言事实回忆的效果。我们发现GRPO始终优于SFT,提高了跨语言一致性和对未见语言的泛化能力,而并行数据上的CPT带来的额外收益有限。机制分析进一步表明,GRPO通过减少MLP层和注意力头中的语言专门化来重组多语言路由,从而促进更共享的跨语言表示。我们发布了代码、模型和数据集。

英文摘要

Large language models (LLMs) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, a phenomenon known as cross-lingual factual inconsistency. To study and address this, we introduce PolyFact, a large-scale parallel multilingual factual QA dataset containing 100K Wikidata-grounded facts across 12 typologically diverse languages. Using PolyFact, we compare light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO) for improving cross-lingual factual recall in Qwen-2.5-7B and OLMo-2-1124-7B. We find that GRPO consistently outperforms SFT, improving both cross-lingual consistency and generalization to unseen languages, while CPT on parallel data yields limited additional gains. Mechanistic analyses further show that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, thereby promoting more shared cross-lingual representations. We release our code, models, and dataset.

2606.06614 2026-06-08 cs.CL cs.AI cs.HC 新提交

Re-Centering Humans in LLM Personalization

重新将人类置于LLM个性化中心

Lechen Zhang, Jiarui Liu, Tal August

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究LLM个性化在合成数据与人类数据上的性能差距,通过收集人类对话和判断揭示系统在属性提取、相关属性配对和个性化响应生成阶段的局限性,并引入轻量级训练干预以缩小差距。

详情
AI中文摘要

尽管兴趣日益增长,但大多数对大型语言模型(LLM)个性化能力的评估都依赖于合成数据。目前尚不清楚当前的个性化系统对真实用户的效果如何。在本文中,我们研究了LLM个性化在使用合成数据与人类数据时的性能差距。我们收集了人类对话(550个对话)和个性化三个阶段的判断:从对话中提取用户属性(5,949个判断),将相关属性与新提示配对(11,919个),以及将相关属性融入个性化响应(1,101个)。纳入人类数据揭示了每个阶段的系统局限性。模型难以从人类对话中提取属性,与人类在相关属性上的判断不一致,并且生成的个性化响应被人类评价为并不优于通用响应(尽管LLM广泛评价为更好)。我们在前两个阶段引入了两种轻量级基于训练的干预措施,使自动化个性化评估更接近人类数据。然而,在第三阶段,我们发现学习到的奖励模型与人类评分的相关性仅达到中等水平,这表明与人类一致的个性化质量判断难以直接建模。我们收集的数据为研究模型如何以人类认为有用的方式提取、选择和整合用户信息提供了基础。

英文摘要

Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.

2606.06635 2026-06-08 cs.CL cs.AI 新提交

How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

语言模型如何失败:承诺性和持续性推理错误的令牌级特征

Tanvi Thoria, Kiana Jafari, Marc R. Schlichting, Mykel J. Kochenderfer

发表机构 * Department of Computer Science, Stanford University(计算机科学系,斯坦福大学) Department of Aeronautics and Astronautics, Stanford University(航空航天工程系,斯坦福大学)

AI总结 通过令牌级不确定性信号,将语言模型推理失败分为承诺性失败(早期锁定错误路径)和持续性不确定性(不确定性持续累积),并在23个模型-数据集配置中验证了可预测性,为自我一致性策略提供了指导。

详情
AI中文摘要

语言模型推理中的失败通过不同的过程产生,这些过程在推理轨迹中留下可识别的特征。我们使用令牌级不确定性信号来表征这些失败,发现它们通过两个经验上可区分的过程出现。第一个是承诺性失败,其中模型在其轨迹早期锁定到错误的推理路径。一个核心诊断特征是承诺点,超过该点考虑额外的令牌会损害而不是帮助失败检测。在第二个过程中,持续性不确定性,不确定性反而在整个过程中累积,并且需要完整的轨迹来最好地区分失败和成功的完成。这些特征在23个模型-数据集配置中重现,该框架的可证伪预测在23个案例中的20个中成立,远高于两种失败模式下的随机水平。最后,我们展示了我们的失败模式框架对自我一致性有直接影响,识别了不确定性信号何时补充它以及何时可以选择性地跳过它。这些结果为理解何时LLM推理失败变得可检测以及相应调整检测策略提供了基础。

英文摘要

Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two empirically distinguishable processes. The first is committed failure, in which a model locks onto an incorrect reasoning path early in its trace. A central diagnostic signature is the commitment point, beyond which considering additional tokens hurt rather than help failure detection. In the second, persistent uncertainty, uncertainty instead accumulates throughout, and the full trace is needed to best distinguish failing from successful completions. These signatures reproduce across 23 model-dataset configurations, with the framework's falsifiable predictions holding in 20 of 23 cases, well above chance across both failure modes. Finally, we demonstrate our failure mode framework has direct implications for self-consistency, identifying when uncertainty signals complement it and when it can be selectively skipped. These results offer a foundation for understanding when LLM reasoning failures become detectable and for adapting detection strategies accordingly.

2606.06667 2026-06-08 cs.CL 新提交

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

泛化的搭便车假说:解释和缓解涌现的错位

Jiachen Zhao, Zhengxuan Wu, Aryaman Arora, Yiyou Sun, David Bau, Weiyan Shi

发表机构 * Northeastern University(东北大学) Stanford University(斯坦福大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出搭便车假说,认为聊天模板标记导致微调行为泛化到无关领域,并设计TReFT方法通过正则化标记表示缓解涌现错位,在多个数据集上有效。

详情
AI中文摘要

LLMs在训练示例之外的广泛过度泛化机制尚不清楚。涌现错位(EM)提供了一个引人注目的案例研究:在狭窄任务上微调会诱导对语义无关测试域的广泛错位。在这项工作中,我们提出了搭便车假说:聊天模板标记可以将微调行为搭便车到域外查询上。我们通过实验验证了这一假说,即对前缀(所有用户查询之前的标记)进行细微扰动,或者用未微调模型的前缀表示替换当前前缀表示,可以在不改变用户查询的情况下恢复对齐。基于这一发现,我们提出了标记正则化微调(TReFT),该方法在训练期间正则化特定标记表示以缓解EM。在不同的模型和多个诱导EM的数据集上,TReFT在保留域内学习的同时减少了EM。在基于法律领域微调的Llama-3.1-8B上,TReFT比使用保留对齐示例的数据交错方法实现了33.5%更多的EM减少。我们进一步展示了TReFT扩展到其他狭窄微调设置,包括弃权、工具使用和拒绝(平均减少54.3%的离题泛化),支持了搭便车假说。总的来说,我们的工作强调了LLMs可能以非预期的方式学习和泛化,并提出了一个走向更受约束微调的路径。它还呼吁进一步研究共享输入特征如何跨域搭便车模型行为。

英文摘要

The mechanisms behind LLMs' broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study: finetuning on narrow tasks induces broad misalignment to semantically-unrelated test domains. In this work, we propose the Piggyback Hypothesis: the chat-template tokens can piggyback the finetuned behaviour onto out-of-domain queries. We validate this hypothesis by showing that subtle perturbations to the prefix (tokens preceding all user queries), or patching the prefix representations with those from the unfinetuned model, can restore alignment without changing the user query. Building on this finding, we propose Token-Regularized Finetuning (TReFT), which regularizes specific token representations during training to mitigate EM. Across different models and multiple EM-inducing datasets, TReFT reduces EM while preserving in-domain learning. On Llama-3.1-8B finetuned on the legal domain, TReFT achieves 33.5% more EM reduction than data interleaving with a retain set of aligned examples. We further show that TReFT extends to other narrow-finetuning settings, including abstention, tool use, and refusal (off-topic generalization is reduced by 54.3% on average), supporting the Piggyback Hypothesis. Broadly, our work highlights that LLMs may learn and generalize in unintended ways and suggests a path toward more constrained finetuning. It also calls for further study of how shared input features can piggyback model behavior across domains.

2606.06674 2026-06-08 cs.CL cs.CY 新提交

What Do People Actually Want From AI? Mapping Preference Plurality

人们真正希望从AI中得到什么?偏好多元性映射

Julia Sepúlveda Coelho, Scott A. Hale

发表机构 * Oxford Internet Institute, University of Oxford(牛津大学互联网研究所) Meedan

AI总结 通过分析75个国家1500份开放式回答,发现不同人对AI的期望各异,多数价值观仅被少数人要求,且同一词语(如“真实性”)含义分歧,某些能力存在争议,揭示当前RLHF偏好聚合方法的根本缺陷。

Comments Accepted at the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

详情
AI中文摘要

大型语言模型(LLMs)通常通过基于人类反馈的强化学习(RLHF)进行微调,以与人们的偏好和价值观对齐。然而,这种方法存在已知局限性:它聚合了冲突的偏好,通常依赖于不具有代表性的样本,并且仅使用二元比较。通过分析来自PRISM数据集跨越75个国家的1500份开放式回答,我们考察了人们真正希望从AI系统中得到什么,并揭示了当前方法的具体失败。我们发现不同的人想要不同的东西:大多数价值观被不到四分之一的受访者要求,真实性是唯一的例外,占49%。此外,相同的词语隐藏着不同的含义:当人们描述他们所说的“真实性”时,他们揭示了不同的、可能不相容的认识论基础,因为有些人要求有来源的主张,有些人要求专家意见,甚至有些人要求不受欢迎的观点。某些能力,即模型的行为有多像人类,以及某些特征,如AI护栏,是完全有争议的,有些人渴望它们,而另一些人则拒绝它们。我们还发现,人们经常使用上下文区分(AI“默认”应该做什么与“如果被要求”应该做什么),这是二元比较无法捕捉的。这些发现暴露了当前对齐实践中的根本问题。当49%的人要求真实性但以不同方式定义时,这不太可能被单个奖励模型捕捉到。尽管用户明确要求准确性,但在资金充足的模型中持续存在高幻觉率,这表明当前方法未能识别实际偏好。本文揭示了当前被扁平化为通用偏好模型的情境化、有争议、不完美的信号,这种做法被其他人描述为认识论暴力。

英文摘要

Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by "truthfulness", they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do "by default" versus "if requested") that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users' clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence.

2606.06679 2026-06-08 cs.CL cs.AI cs.CY 新提交

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

HKJudge:用于解释法院认定事实、推理过程和裁决结果的法律话语标注语料库

Xi Xuan, Wenxin Zhang, Yufei Zhou, King-kui Sin, Chunyu Kit

发表机构 * City University of Hong Kong(香港城市大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出首个句子级专家标注的法律话语数据集HKJudge,包含香港各级法院刑事判决,设计双层话语模式(26种修辞角色和3种判刑要素),并基于BERT和LLM进行基准评估。

详情
AI中文摘要

法院判决是法律实践和法理学的核心,然而香港判决的话语分析由于缺乏专家标注语料库而受到限制。我们引入了香港判决话语数据集(HKJudge),这是首个句子级专家标注的法律话语语料库。HKJudge包含香港法院层级所有五个级别的刑事判决,共计约29万句子和650万词元,由法律语言学专家完全标注。我们设计了一个双层话语模式,捕捉法院认定的事实、推理过程以及裁决结果。在句子层面,每个句子被分配26种修辞角色之一。在跨度层面,句子进一步标注了三个判刑要素(指控、监禁刑期、罚款)。十位法律语言学标注者进行了标注,标注者间一致性为κ=0.8。我们在HKJudge上定义了两个任务,称为修辞角色分类和法律要素提取,并提供了四种基于BERT的模型、两种开源LLM(在零样本和微调设置下)以及四种商业LLM在这两个任务上的首次基准评估。我们的工作展示了句子级话语标注对于建模香港判决结构的价值,并为未来法律判决预测研究提供了丰富的数据基础。HKJudge数据集和代码可在以下网址获取:https://this URL。

英文摘要

Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attention, owing largely to the absence of expert-annotated corpora. We introduce the Hong Kong Judgment Discourse Dataset (HKJudge), the first sentence-level expert-annotated legal discourse corpus. HKJudge includes criminal judgments across all five levels of HK's court hierarchy, comprising $\sim$290k sentences and $\sim$6.5 million tokens, fully annotated by legal linguistics experts. We design a two-tier discourse schema that captures what facts a court finds, how it reasons, and what it rules. At the sentence level, each sentence is assigned one of 26 rhetorical roles. At the span level, sentences are further annotated with three sentencing elements (charge, imprisonment term, fine). Ten legal linguistics annotators produced the annotations with an inter-annotator agreement of $κ= 0.8$. We formulate two tasks on HKJudge, termed rhetorical role classification and legal element extraction, and provide the first benchmark evaluation of four BERT-based models, two open-source LLMs under zero-shot and fine-tuning settings, and four commercial LLMs on both tasks. Our work demonstrates the value of sentence-level discourse annotation for modeling the structure of HK judgments and provides a rich data foundation for future work on legal judgment prediction. The HKJudge dataset and code are available at https://github.com/xuanxixi/HKJudge.

2606.06708 2026-06-08 cs.CL 新提交

Signal-Driven Observation for Long-Horizon Web Agents

信号驱动观测:面向长程任务的Web智能体

Shubham Gaur, Ian Lane

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出信号驱动观测(SDO)方法,通过专用子调用读取完整DOM但仅返回任务相关元素,并由轻量信号检测器触发重新调用,解决长程Web智能体中上下文退化问题。

Comments 10 pages, 1 figure

详情
AI中文摘要

在长程任务中运行的Web智能体在每个动作步骤中都会处理原始DOM和可访问性树——通常包含数万个token——导致上下文逐渐退化,在任务完成前推理能力就已受损。我们认为,将观测频率与动作频率耦合是一种架构性错误。借鉴递归语言模型中查询文档优于整体阅读的见解,我们提出信号驱动观测(SDO):一个专用子调用读取完整DOM但仅返回任务相关元素及其选择器,并且仅在轻量信号检测器触发时重新调用——触发条件包括URL变化、新出现的可交互元素、动作失败或外部浏览器事件。我们概述了SDO引入的开放问题,并呼吁社区将观测压缩视为Web智能体设计中的核心架构决策。

英文摘要

Web agents operating over long horizons ingest raw DOM and accessibility trees -- routinely tens of thousands of tokens -- at every action step, causing progressive context degradation that erodes reasoning well before tasks complete. We argue that this coupling of observation frequency to action frequency is an architectural mistake. Drawing on the insight from Recursive Language Models that querying a document outperforms reading it wholesale, we propose Signal-Driven Observation (SDO): a dedicated sub-call reads the full DOM but returns only task-relevant elements and their selectors, and is re-invoked only when a lightweight signal detector fires -- triggered by URL transitions, newly visible interactive elements, action failures, or exogenous browser events. We outline the open problems SDO introduces and call on the community to treat observation compression as a core architectural decision in web agent design.

2606.06712 2026-06-08 cs.CL cs.AI 新提交

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

数据高效的自回归到扩散语言模型通过策略内蒸馏

Xingyu Su, Jacob Helwig, Shubham Parashar, Atharv Chagi, Lakshmi Jotsna, Degui Zhi, James Caverlee, Dileep Kalathil, Shuiwang Ji

发表机构 * Department of Computer Science and Engineering, Texas A&M University(德克萨斯大学阿马尔科分校计算机科学与工程系) Department of Bioinformatics and Systems Medicine, University of Texas Health Science Center at Houston(德克萨斯大学健康科学中心休斯顿分校生物信息学与系统医学系) Department of Electrical and Computer Engineering, Texas A&M University(德克萨斯大学阿马尔科分校电气与计算机工程系)

AI总结 提出策略内扩散语言模型(OPDLM),通过策略内蒸馏将自回归模型转换为扩散语言模型,解决分布偏移和训练-推理不匹配问题,实现15倍至7000倍更少训练数据下的强性能。

详情
AI中文摘要

我们研究将自回归模型(ARLM)转换为扩散语言模型(DLM)。与从头预训练不同,先前的工作将ARLM中的因果注意力替换为双向注意力,然后使用DLM目标训练得到的模型。然而,这些方法会导致两种分布偏移。首先,从下一个词预测目标过渡到DLM目标可能会丢弃ARLM在训练期间获得的知识。其次,标准DLM存在训练-推理不匹配,因为训练损失定义在随机掩码序列上,而不是推理时基于置信度解码产生的轨迹。为了解决这两个挑战,我们引入了策略内扩散语言模型(OPDLM),其中采用策略内蒸馏(OPD)进行ARLM到DLM的转换。具体来说,OPDLM通过自OPD训练,其中学生(具有双向注意力的ARLM)生成自己的轨迹,而教师(原始冻结的ARLM)通过在这些轨迹上提供目标logits来蒸馏其知识。通过直接以策略内方式训练,OPDLM消除了DLM中的训练-推理不匹配,同时从原始模型蒸馏增强了ARLM的知识保留。实验结果表明,OPDLM需要15倍到7000倍更少的训练token,并在各种任务上表现出强大的性能。OPDLM避免了DLM预训练的高昂成本,并将DLM转换定位为ARLM后训练的一种形式。

英文摘要

We study the transformation of autoregressive models (ARLMs) into diffusion language models (DLMs). Rather than pretraining from scratch, prior work replaces the causal attention in ARLMs with bidirectional attention and then trains the resulting model using a DLM objective. However, these approaches incur two distribution shifts. First, transitioning from a next-token prediction objective to a DLM objective can discard knowledge acquired by the ARLM during training. Second, standard DLMs suffer from a train-inference mismatch, as the training loss is defined on randomly masked sequences rather than the trajectories encountered at inference produced by confidence-based decoding. To address both challenges, we introduce an On-Policy Diffusion Language Model (OPDLM) in which On-Policy Distillation (OPD) is employed for ARLM-to-DLM transformation. Specifically, OPDLM is trained via self-OPD, where the student, an ARLM with bidirectional attention, generates its own trajectories, and the teacher, the original frozen ARLM, distills its knowledge by providing target logits on these trajectories. By training directly in an on-policy manner, OPDLM eliminates the train-inference mismatch in DLMs, while distillation from the original model enhances knowledge retention from the ARLM. Empirical results demonstrate that OPDLM requires 15x to 7,000x fewer training tokens with strong performance across a wide variety of tasks. OPDLM avoids the prohibitive cost of DLM pretraining and positions DLM transformation as a form of ARLM post-training.

2606.06715 2026-06-08 cs.CL cs.AI cs.LG 新提交

Does Topic Sentiment Cause Perceived Ideology? Comparing Human and LLM Annotations in Political News Articles

主题情感是否导致感知意识形态?比较政治新闻文章中人类与LLM的标注

Upasana Chatterjee

发表机构 * Columbia University(哥伦比亚大学)

AI总结 研究主题情感对感知政治意识形态的因果效应,通过比较人类与LLM标注,发现微调GPT-4o-mini产生显著因果效应,归因于捷径学习。

Comments Accepted to ACL SRW 2026

详情
AI中文摘要

我们探究主题情感是否对感知政治意识形态具有因果效应,以及答案是否取决于意识形态标签的分配者。使用来自AllSides的文章,结合Llama-3.3-70b-versatile的共享情感标注,我们比较了来自专家人类标注者、GPT-4o-mini(基线和微调)以及Llama-3.3-70B的意识形态标签。我们应用双重机器学习(DML)和社区级中介分析于所有四种标注范式。人类标注在社区水平未产生显著因果效应。微调后的GPT-4o-mini达到了最高的分类准确率(F1=72.48),并且是唯一在社区水平产生显著处理效应和中介中显著自然直接效应(NDE)的标注范式。我们将此解释为捷径学习的证据:对意识形态标签数据进行微调导致模型内化了一种虚假的情感-意识形态耦合,而这种耦合在人类判断中对此任务并不起作用。这种耦合在基于F1的评估中结构上不可见,对LLM标注作为银标签以及在下游因果分析中作为人类判断的代理的使用具有影响。

英文摘要

We ask whether topic sentiment has a causal effect on perceived political ideology, and whether the answer depends on who assigns the ideology label. Using articles from AllSides, paired with shared sentiment annotations from Llama-3.3-70b-versatile, we compare ideology labels from expert human annotators, GPT-4o-mini (baseline and finetuned), and Llama-3.3-70B. We apply Double Machine Learning (DML) and community-level mediation analysis across all four annotation paradigms. Human annotations yield no significant causal effects at the community level. Fine-tuned GPT-4o-mini achieves the highest classification accuracy (F1=72.48) and is the only annotator paradigm that produces significant community-level treatment effects and significant natural direct effects (NDEs) in mediation. We interpret this as evidence of shortcut learning: fine-tuning on ideology-labeled data causes the model to internalise a spurious sentiment--ideology coupling not operative in human judgment for this task. This coupling is structurally invisible to F1-based evaluation, with implications for the use of LLM annotations as silver labels and as proxies for human judgment in downstream causal analyses.

2606.06738 2026-06-08 cs.CL 新提交

Modular Monolingual Adaptation using Pretrained Language Models

使用预训练语言模型的模块化单语适应

Nalin Kumar, Ondřej Dušek

发表机构 * Charles University, Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics(查尔斯大学数学与物理系形式与应用语言学研究所)

AI总结 提出一种模块化方法,通过替换标记、冻结对应嵌入并调整模型其余部分,在低资源语言上提升NLU任务性能,优于全模型微调。

Comments Accepted to ACL 2026 Industry Track

详情
AI中文摘要

为低资源语言构建单语语言模型通常依赖于通过在整个模型上对目标语言进行微调来适应预训练语言模型。这种方法比从头开始训练更受欢迎,因为它能够实现有效的知识迁移。此外,先前的工作表明,使用特定于语言的标记器可以增强适应性。在这项工作中,我们假设全模型调优通常是不必要的,并提出了一种更模块化的方法。具体来说,我们替换标记,冻结相应的嵌入,并调整模型的其余部分。我们在苏格兰盖尔语、爱尔兰语和克丘亚语上进行实验,其中克丘亚语是一种非常低资源的语言(8.5k训练实例)。在自然语言理解任务——掩码填充、命名实体识别和词性标注上的评估表明,我们提出的方法在将模型适应低资源语言时提高了性能。此外,我们提供了对训练策略有效性、预训练嵌入选择和模型的全面分析。

英文摘要

Building monolingual language models (LMs) for low-resource languages typically relies on adapting pretrained language models (PLMs) by finetuning the whole model on the target language. This approach is widely favored over training from scratch, as it enables effective knowledge transfer. Additionally, prior work has shown that using a language-specific tokenizer can enhance the adaptability. In this work, we hypothesize that full model tuning is often unnecessary and propose a more modular approach. Specifically, we replace the tokens, freeze the corresponding embeddings, and tune the rest of the model. We use Scottish Gaelic, Irish, and Quechua for our experiments, with Quechua being a very low-resource language (8.5k training instances). Evaluation on natural language understanding (NLU) tasks -- mask filling, NER, and POS -- shows that our proposed approach improves performance when adapting models to low-resource languages. Additionally, we provide a comprehensive analysis of the effectiveness of training strategies, the choice of pretrained embeddings, and models.

2606.06745 2026-06-08 cs.CL 新提交

When to Think Deeply: Inhibitory Deliberation for LLM Reasoning

何时深度思考:用于LLM推理的抑制性深思

Zhixuan He, Yue Feng

发表机构 * University of Birmingham, United Kingdom(英国伯明翰大学)

AI总结 提出IDPR框架,通过抑制控制器根据快速答案决定是否启动慢速推理,在数学推理测试集上仅调用8.20%的慢速推理,准确率从47.90%提升至48.92%。

详情
AI中文摘要

推理型大语言模型可以通过深思推理提高问题求解性能,但对每个输入都调用慢速推理在计算上昂贵且往往不必要。我们提出IDPR,一个响应条件抑制性深思框架。IDPR首先生成一个简洁的直观答案,然后使用抑制控制器决定该特定响应是否应被释放或抑制以支持慢速推理。与仅输入路由器不同,抑制控制器以快速答案和快速侧证据为条件,包括置信度、logit边际、可解析性和生成成本。我们从配对的快速-慢速结果中训练控制器,并在准确率优先的慢速调用预算下,在保留验证集上选择抑制阈值。在一个保留的5000示例数学推理测试集上,IDPR仅对8.20%的示例调用慢速推理,并将准确率从47.90%提升至48.92%。在相同的慢速调用预算下,随机路由将准确率降至46.76%,而最强的基于置信度的基线达到48.22%。IDPR还实现了最高的纠正精度,表明响应条件抑制能更好地识别受益于慢速推理的快速答案。

英文摘要

Reasoning Large Language Models can improve problem-solving performance through deliberative inference, but invoking slow reasoning for every input is computationally expensive and often unnecessary. We propose IDPR, a framework for response-conditioned inhibitory deliberation. IDPR first generates a concise intuitive answer and then uses an inhibition controller to decide whether that specific response should be released or suppressed in favor of slow reasoning. Unlike input-only routers, the inhibition controller conditions on the fast answer and fast-side evidence, including confidence, logit margin, parseability, and generation cost. We train the controller from paired fast-slow outcomes and select the inhibition threshold on a held-out validation set under an accuracy-first slow-call budget. On a held-out 5,000-example mathematical reasoning test set, IDPR invokes slow reasoning on only 8.20% of examples and improves accuracy from 47.90% to 48.92%. Under the same slow-call budget, random routing decreases accuracy to 46.76%, while the strongest confidence-based baseline reaches 48.22%. IDPR also achieves the highest corrective precision, showing that response-conditioned inhibition better identifies fast answers that benefit from slow reasoning.

2606.06748 2026-06-08 cs.CL cs.AI cs.LG 新提交

Evidence Graph Consistency in Retrieval-Augmented Generation: A Model-Dependent Analysis of Hallucination Detection

检索增强生成中的证据图一致性:基于模型的幻觉检测分析

Jianru Shen

AI总结 提出证据图一致性(EGC)框架,通过构建局部证据图并计算五种结构一致性指标检测幻觉,发现不同模型族间一致性特征方向相反,表明嵌入图一致性不能作为模型无关的检测信号。

Comments Accepted at the International Conference on Advanced Machine Learning and Data Science; to appear in the IEEE Xplore proceedings

详情
AI中文摘要

检索增强生成(RAG)减少了但并未消除大型语言模型中的幻觉。现有检测方法依赖于生成答案与检索段落之间的平面相似性,忽略了证据片段与答案声明之间的结构关系。我们提出了证据图一致性(EGC)框架,该框架为每个响应构建一个局部证据图,并计算五种结构一致性度量作为幻觉指标。在RAGTruth的完整问答拆分上,跨六个LLM(5,767个响应)进行评估,EGC揭示了一个一致的模型族分裂:图一致性特征在Llama-2模型中显示出预期的诊断方向,但在GPT-4、GPT-3.5和Mistral-7B中表现出系统性逆转。这种逆转表明不同模型族之间存在定性的不同幻觉模式,并表明基于嵌入的图一致性不能作为模型无关的幻觉检测信号。

英文摘要

Retrieval-Augmented Generation (RAG) reduces but does not eliminate hallucination in large language models. Existing detection methods rely on flat similarity between generated answers and retrieved passages, ignoring structural relationships among evidence pieces and answer claims. We propose Evidence Graph Consistency (EGC), a framework that constructs a local evidence graph per response and computes five structural consistency measures as hallucination indicators. Evaluated on the full question answering split of RAGTruth across six LLMs (5,767 responses), EGC reveals a consistent model-family split: graph consistency features show the expected diagnostic direction for hallucinations in Llama-2 models but exhibit systematic reversal in GPT-4, GPT-3.5, and Mistral-7B. This reversal suggests qualitatively different hallucination patterns across model families and indicates that embedding-based graph consistency cannot serve as a model-independent hallucination detection signal.

2606.06755 2026-06-08 cs.CL cs.ET 新提交

PromptPrint: Behavioral Biometrics Through Natural Language Prompting in LLMs

PromptPrint: 通过自然语言提示在LLMs中的行为生物特征

Shaiv Patel, Kartik Narayan, Vishal Patel

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出PromptPrint,研究用户与LLM交互的简短提示是否包含可识别的行为生物特征,通过词汇、句法和话语模式分析,发现词汇稳定性假设成立,但存在唯一性-一致性悖论,且身份信号对语义改写脆弱。

Comments 10 pages, 6 figures

详情
AI中文摘要

作者归属研究传统上关注长篇、表达性文本;然而,与大型语言模型(LLM)的交互通常是简短且任务驱动的提示。这引发了一个基本问题:这样的提示是否包含稳定、可识别作者且独特的信号?我们引入了PromptPrint,一项对基于提示的身份的系统研究,假设用户的习惯性词汇、句法和话语模式形成可学习的行为生物特征。使用来自1,034名用户的20,680个真实提示,我们建立了三个关键发现。首先,词汇表示显著优于语义编码器,支持“词汇稳定性假设”:身份主要编码在表面层面的词汇选择中,而非抽象意图。其次,风格特征表现出“唯一性-一致性悖论”:用户在整个群体中高度独特,但在不同上下文中行为不一致。第三,对抗性分析揭示了一个清晰的脆弱性谱:身份信号对微小的词汇扰动具有鲁棒性,但在语义改写下显著退化。总体而言,我们的结果展示了大规模下的强识别性能,确立了基于提示的身份作为一种可行的行为生物特征。这项工作为LLM交互中的用户建模引入了新视角,对安全和隐私具有重要意义。数据和代码将在我们的工作被接受后发布。

英文摘要

Authorship attribution research has traditionally focused on long-form, expressive texts; however, interactions with large language models (LLMs) are typically brief and task-driven prompts. This raises a fundamental question: do such prompts contain a stable, author-identifiable, and distinctive signal? We introduce PromptPrint, a systematic study of prompt-based identity, the hypothesis that a user's habitual vocabulary, syntax, and discourse patterns form a learnable behavioral biometric. Using 20,680 real prompts from 1,034 users, we establish three key findings. First, lexical representations significantly outperform semantic encoders, supporting the "lexical stability hypothesis": identity is primarily encoded in surface-level word choice rather than abstract intent. Second, stylometric features exhibit a "uniqueness-consistency paradox": users are highly distinctive across the population, yet behaviorally inconsistent across contexts. Third, adversarial analysis reveals a clear vulnerability spectrum: identity signals are robust to minor lexical perturbations but degrade substantially under semantic paraphrasing. Overall, our results demonstrate strong identification performance at scale, establishing prompt-based identity as a viable behavioral biometric. This work introduces a new perspective on user modeling in LLM interactions, with important implications for security and privacy. Data and code will be released upon the acceptance of our work.

2606.06781 2026-06-08 cs.CL 新提交

When Better Codebooks Are Not Enough: Predictive Performance and Behavioral Reliability in LLM Political Event Coding

当更好的代码手册还不够:LLM政治事件编码中的预测性能与行为可靠性

Zixian He, Bharath Raahul Murugesan, Patrick Brandt, Yibo Hu

发表机构 * Independent Researcher(独立研究者) Illinois Institute of Technology(伊利诺伊理工学院) The University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 本研究探讨在政治事件编码任务中,将专家代码手册优化为LLM友好形式能显著提升分类性能,但预测增益并未完全转化为行为可靠性,模型在代码手册变化下仍可能失效。

Comments 14 pages, 3 figures, 11 tables

详情
AI中文摘要

高准确率并不一定使LLM成为忠实的编码器。这个问题很重要,因为许多社会科学研究依赖专家编写的代码手册将文本转化为结构化数据。我们在政治事件编码中研究这个问题,这是一个具有挑战性的源-目标关系分类任务,超越了普通的句子级分类,模型必须使用详细的编码规则确定一个行为者对另一个行为者做了什么。我们测试了当专家代码手册被操作化为LLM友好形式(包含更清晰的定义、示例、检索上下文和困难案例规则)时是否变得更有效。然后,我们在标签名称、代码手册顺序和标签-定义映射的受控变化下评估行为可靠性。更清晰的代码手册显著提高了分类性能,尤其是对于细粒度事件分类。然而,这些预测增益并未完全转化为行为可靠性。模型可能产生有效的标签并恢复定义,但在代码手册的受控变化下仍未能通过行为可靠性测试。这些发现表明,代码手册引导的LLM系统不仅应根据准确性进行评估,还应根据它们是否保留了使编码输出对社会科学研究有意义的编码逻辑来评估。

英文摘要

High accuracy does not necessarily make an LLM a faithful coder. This issue matters because many social-science studies rely on expert-written codebooks to turn text into structured data. We study this problem in political event coding, a challenging source-target relation classification task beyond ordinary sentence-level classification, where models must determine what one actor did to another using detailed coding rules. We test whether expert codebooks become more effective when operationalized into LLM-friendly forms with clearer definitions, examples, retrieved context, and rules for difficult cases. We then evaluate behavioral reliability under controlled changes to label names, codebook order, and label-definition mappings. Clearer codebooks substantially improve classification performance, especially for fine-grained event classification. However, these predictive gains do not fully translate into behavioral reliability. Models may produce valid labels and recover definitions while still failing behavioral reliability tests under controlled codebook changes. These findings suggest that codebook-guided LLM systems should be evaluated not only by accuracy, but also by whether they preserve the coding logic that makes coded outputs meaningful for social-science research.

2606.06788 2026-06-08 cs.CL cs.HC 新提交

Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses

像对五岁小孩一样解释或随我选择:评估语言模型响应的交互潜力

Indu Panigrahi, Tal August

发表机构 * Siebel School of Computing and Data Science(计算与数据科学学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出基于语言复杂度的交互评估框架,测试GPT-5.1等模型生成不同复杂度响应的能力,发现最佳模型仅46%时间正确调整复杂度。

Comments Preprint

详情
AI中文摘要

在科学信息检索任务中对大型语言模型(LLMs)的评估日益以使用为中心,例如与真实用户进行实时或多轮评估。这些评估仍然假设单一的静态聊天界面,但随着模型被集成到新界面中,评估必须转向纳入特定于界面的标准。我们基于一项有16名参与者的形成性研究,提出了一个新的评估框架,该框架测试模型生成对同一查询的多个响应的能力,这些响应沿语言的可解释轴(语言复杂度)变化,灵感来自人机交互设计文献中的直接操作界面。我们评估了GPT-5.1、GPT-5 mini、Claude Sonnet 4.5 + Thinking和DeepSeek-V3.1,为98个科学查询生成了5个不同语言复杂度级别的响应。虽然模型在不同响应之间变化复杂度,但大多数变化仍然不一致,表现最佳的模型(Claude Sonnet 4.5)仅46%的时间在正确方向上移动了可靠的复杂度度量。我们的发现在增加样本量和替代复杂度级别时仍然成立。

英文摘要

Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria. We propose a new evaluation framework based on a formative study with $16$ participants that tests models' ability to generate multiple responses to one query that differ along an interpretable axis of language (language complexity), inspired by direct manipulation interfaces from human-centered design literature. We evaluate GPT-5.1, GPT-5 mini, Claude Sonnet 4.5 + Thinking, and DeepSeek-V3.1 by generating 5 responses at different levels of language complexity for $98$ scientific queries. While models vary complexity across responses, most changes remain inconsistent, with the best performing model (Claude Sonnet 4.5) only shifting reliable complexity measures in the correct direction $46\%$ of the time. Our findings hold with increased sample size and alternative complexity levels.

2606.06797 2026-06-08 cs.CL 新提交

Korean Culture into LLM Alignment: Toward Cultural Coherence

将韩国文化融入大语言模型对齐:迈向文化一致性

MinJae Jung, Minwoo Kim

发表机构 * SKT LG AI Research(LG人工智能研究) Kanana Team(Kanana团队)

AI总结 针对大语言模型的文化对齐,提出构建性定义而非仅抑制负面输出,设计基于提示的种子生成器扩展韩国危害分类,结合韩国法律、社会规范和解释惯例制定安全响应策略,通过DPO微调提升韩国文化安全率且不损害通用能力。

Comments Accepted to ICML 2026 Workshop on Culture X AI

详情
AI中文摘要

大语言模型的文化方面工作主要集中于负面目标:抑制哪些输出。我们认为还需要一个建设性的对应部分,即文化一致性响应的操作性定义,而不仅仅是它必须避免什么,并针对韩国进行了实例化。我们设计了一个围绕基于提示的LLM种子生成器的对齐数据流水线,该生成器扩展了韩国危害分类,其核心是韩国文化适应的安全响应策略:一个基于韩国法律框架、社会规范和解释惯例的逐类别指南,三个前沿模型各自根据该指南生成候选响应。对所得三元组进行DPO微调提高了六个开源LLM的韩国文化安全率,同时未导致韩国通用能力基准的大幅下降,定性输出显示微调模型能够引用韩国法规和机构程序,并在适当时提供建设性的韩国背景信息以及拒绝回答。

英文摘要

Cultural-aspect work on large language models is dominated by a negative target: which outputs to suppress. We argue that a constructive counterpart is also needed, a working definition of what a culturally coherent response is rather than only what it must avoid, and instantiate it for Korean. We design an alignment-data pipeline around a prompt-based LLM seed generator that expands a Korean harm taxonomy, with a Korean-culturally-adapted safe-response policy at its centre: a per-category guideline grounded in Korean legal frameworks, social norms, and interpretive conventions, against which three frontier models each produce a candidate response. DPO fine-tuning on the resulting triplets improves the Korean cultural safe rate across six open-weight LLMs while causing no large degradation on Korean general-capability benchmarks, and qualitative outputs show fine-tuned models naming Korean statutes and institutional procedures and, where appropriate, supplying constructive Korean-context information alongside refusal.

2606.06812 2026-06-08 cs.CL 新提交

Quantifying Media Representation Dynamics Across 25 Years of News Reporting on Policing-related Deaths

量化25年警务相关死亡新闻报道中的媒体表征动态

Farhan Samir, Jappun Dhillon, Meghna Ravikumar, Syed Ishtiaque Ahmed, Vered Shwartz

发表机构 * University of Toronto(多伦多大学) University of British Columbia(不列颠哥伦比亚大学)

AI总结 通过分析25年间4000篇加拿大新闻报道,提出PerspectiveGap模型,发现国家官僚视角出现频率是公众视角的近三倍,且近年来平民代表有所增加。

Comments 9 pages, 6 figures. Websci'26

详情
Journal ref
Proceedings of the 18th ACM Web Science Conference 2026 (pp. 421-429)
AI中文摘要

我们进行了迄今为止最大规模的加拿大警务相关死亡新闻叙事计算分析,涵盖了过去25年间的4000篇文章。我们开发了一个新颖的计算模型PerspectiveGap,该模型基于先前关于警务媒体表征的社会学研究。我们发现,关于警务相关死亡的报道平均而言,国家官僚视角的出现频率几乎是其他公众成员(包括亲属、社区成员、目击者、代表家庭的律师或公民自由团体)视角的三倍。相当一部分文章完全没有平民行为者的观点,尽管近年来平民代表有所增加。定性分析表明,国家官僚对这些死亡的描述往往是临床和程序性的,而平民话语则带有明显更多的情感色彩。这里开发的PerspectiveGap框架可以适用于其他司法管辖区,提供了一种可扩展的方法来分析媒体系统如何构建关于警务和问责的叙事。

英文摘要

We perform the largest known computational analysis of Canadian news narratives about police-involved deaths, spanning 4,000 articles from the last quarter-century. We develop a novel computational model, PerspectiveGap, grounded in prior sociological work on media representation of policing. We find that reporting on police-involved deaths on average features perspectives from state bureaucrats at a rate nearly three times as much as perspectives from other members of the public, including relatives, community members, eyewitnesses, lawyers representing the family, or civil liberties groups. A considerable fraction of articles contain no points of view from civilian actors, though civilian representation has increased in recent years. Qualitatively, we find that state bureaucrats' accounts of these deaths tend to be clinical and procedural, while civilian discourse carries considerably more emotional valence. The PerspectiveGap framework developed here can be contextualized to other jurisdictions, offering a scalable approach for analyzing how media systems construct narratives around policing and accountability.

2606.06825 2026-06-08 cs.CL cs.AI 新提交

Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards

Progress-SQL: 通过渐进式奖励改进文本到SQL的强化学习

Shihao Zhang, Xiaoman Wang, Yuan Liu, Yunshi Lan, Weining Qian

发表机构 * East China Normal University(华东师范大学)

AI总结 提出Progress-SQL,一种多轮强化学习框架,通过Oracle引导诊断树(ODT)生成子句级结构反馈,结合渐进式奖励(结构对齐、词汇对齐、延迟奖励和执行状态奖励),提升文本到SQL生成的准确性和鲁棒性。

详情
AI中文摘要

强化学习最近在改进大型语言模型进行文本到SQL生成方面显示出潜力,但现有方法通常优化基于单个SQL状态定义的一次性奖励。这种奖励为迭代SQL纠正提供的指导有限,不足以捕捉多轮SQL改进的提升。在本文中,我们提出Progress-SQL,一种具有渐进式奖励的多轮强化学习框架,用于文本到SQL。我们的方法引入Oracle引导诊断树(ODT),它将SQL查询抽象为子句级结构轮廓,并为下一轮改进生成诊断反馈。为了提供密集且稳健的奖励信号,我们将基于ODT的结构对齐与词汇对齐相结合,并定义一个渐进式奖励,衡量从初始SQL到最终SQL的改进。我们进一步加入一个偏好早期正确性的渐进延迟奖励和一个鼓励从无效SQL中恢复的执行状态奖励。在BIRD、Spider和Spider鲁棒性变体上的实验表明,我们的方法在主要评估和鲁棒性评估上均一致提升了文本到SQL的性能。

英文摘要

Reinforcement learning has recently shown promise in improving large language models for Text-to-SQL generation, yet existing methods typically optimize one-shot rewards defined over a single SQL state. Such rewards provide limited guidance for iterative SQL correction and are insufficient to capture the improvement of multi-turn SQL refinement. In this paper, we propose Progress-SQL, a multi-turn reinforcement learning framework with progressive rewards for Text-to-SQL. Our approach introduces an Oracle-guided Diagnostic Tree (ODT), which abstracts SQL queries into clause-level structural profiles and produces diagnostic feedback for next-turn refinement. To provide dense and robust reward signals, we combine ODT-based structural alignment with lexical alignment and define a progressive reward that measures the improvement from the initial SQL to the final SQL. We further incorporate a progression latency reward that favors earlier correctness and an execution status reward that encourages recovery from the invalid SQL. Experiments on BIRD, Spider, and Spider robustness variants demonstrate that our method consistently improves Text-to-SQL performance across both primary and robustness evaluations.

2606.06835 2026-06-08 cs.CL 新提交

Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning

Translate-R1:通过强化学习实现成本感知的翻译工具使用

Pratik Jayarao, Chaitanya Dwivedi, Himanshu Gupta, Neeraj Varshney, Adithya M Devraj, Meet Vadera, Priyanka Nigam, Bing Yin

发表机构 * Amazon Stores Foundation AI(亚马逊商店基金会人工智能)

AI总结 提出一种基于强化学习的门控策略,让LLM自主评估理解能力,仅在必要时调用翻译工具,在22种语言上提升奖励并降低翻译成本。

Comments 14 pages main text plus appendix, 7 figures, 11 tables

详情
AI中文摘要

LLM在不同语言上的性能差距已有充分记录,而原生缩小差距需要对大多数语言不存在的语料库进行预训练或微调。翻译提供了一种替代方案:将输入转换为模型的主导语言,从而立即释放其全部能力。然而,对每个输入都应用翻译对于模型已能处理的语言来说是浪费的,而将选择权留给模型则相反地失败,因为LLM过于自信,即使无法理解输入也会跳过工具。先前的工作通过语言特定规则、领域启发式、语言标识符或外部路由器来解决这一问题,每种方法都需要手动工程。我们转而学习一个单一策略,仅从奖励中决定何时翻译,开发出语言和领域自适应的内省能力,评估自身理解能力,并仅在无法原生解决任务时调用翻译。使用我们保留答案的翻译流水线构建的数据,我们在后训练的Qwen3-4B上继续RL,涵盖3个资源层级(高、低、极低)的22种语言和5个领域,并引入置信度门控GSPO用于成本敏感的工具使用。门控策略在基线基础上将奖励提升:高资源+4.6,低资源+23.5,极低资源+17.5。与几乎总是翻译的无约束策略相比,它以63%的成本保留了全部奖励,并在87%的成本敏感范围内是帕累托最优的。此外,为了模拟在完全未见语言上的行为,我们创建了2种合成语言,在这些语言上,我们的门控策略比过度自信的基线(即使在这些不可理解的输入上也未充分利用工具)提升了+18.7。该策略零样本迁移到9种保留语言,我们分析了工具使用在训练过程中如何按语言和领域出现。

英文摘要

The performance gap across languages in LLMs is well documented, and closing it natively requires pretraining or fine-tuning on corpora that, for most languages, do not exist. Translation offers an alternative: converting an input into the model's dominant language unlocks its full capabilities at once. Applying translation to every input, however, is wasteful for languages the model already handles, while leaving the choice to the model fails in the opposite way, as LLMs are overconfident and skip the tool even when they cannot understand the input. Prior work resolves this with language-specific rules, domain heuristics, language identifiers, or external routers, each requiring manual engineering. We instead learn a single policy that decides when to translate from reward alone, developing language- and domain-adaptive introspection that assesses its own comprehension and invokes translation only when it cannot solve a task natively. Using data built by our answer-preserving translation pipeline, we continue RL on the post-trained Qwen3-4B across 22 languages in 3 resource tiers (High, Low, XLow) and 5 domains, and introduce confidence-gated GSPO for cost-sensitive tool use. The gated policy lifts reward over the baseline by +4.6 on High, +23.5 on Low, and +17.5 on XLow. Against an unconstrained policy that almost always translates, it preserves full reward at 63% of the cost and is Pareto-optimal across 87% of the cost-sensitivity range. Additionally, to simulate behavior on a completely unseen language, we create 2 synthetic languages, where our gated policy improves +18.7 over the overconfident baseline that underutilizes the tool even on these incomprehensible inputs. The policy transfers zero-shot to 9 held-out languages, and we analyze how tool use emerges over training, per language and per domain.

2606.06840 2026-06-08 cs.CL cs.AI cs.LG 新提交

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

先刻画再蒸馏:大输出空间中的机械推理

Debjyoti Saha Roy, Byron C. Wallace, Javed A. Aslam

发表机构 * Khoury College of Computer Sciences, Northeastern University(东北大学计算机科学学院)

AI总结 研究现代推理模型在百万级标签空间中实现零样本多标签分类的机制,提出“候选列表生成+精细推理”两阶段模型,并基于此开发机械蒸馏策略,优于标准蒸馏。

详情
AI中文摘要

现代推理模型在具有挑战性的多标签任务上表现出令人惊讶的强大零样本性能,这些任务需要从数十万到数百万个候选标签中选择一小部分相关选项。我们研究了它们如何机械地实现这一点。我们将推理描述为一个两阶段过程:首先进行广泛的“候选列表生成”,然后对生成的集合进行精细推理。我们在一系列数据集上提供证据表明,这些步骤可以分离并且是互补的。利用这一刻画,我们开发了一种机械蒸馏策略,该策略始终优于标准蒸馏。

英文摘要

Modern reasoning models offer surprisingly strong zero-shot performance on challenging multi-label tasks that require selecting a small set of relevant options from hundreds of thousands to millions of candidate labels. We investigate how they achieve this mechanistically. We characterize reasoning as a two-phase process: A broad "shortlisting" of candidates followed by fine-grained reasoning over the resulting set. We provide evidence across a range of datasets that these steps can be isolated and are complementary. Using this characterization, we develop a mechanistic distillation strategy that consistently outperforms standard distillation.

2606.06842 2026-06-08 cs.CL 新提交

CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

CRAFT:面向表格问答与事实验证的统一反事实推理框架

Chenshuo Pan, Yu Zhao, Jie Zhang, Changzai Pan, Zhenhe Wu, Jiayi Liang, Yujie Mao, Shuangyong Song, Yongxiang Li, Zhongjiang He

发表机构 * Xingchen AGI Lab,China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd(兴晨AGI实验室,中国电信人工智能技术(北京)有限公司)

AI总结 提出CRAFT统一反事实推理框架,将表格问答和事实验证转化为双向验证过程,通过构建声明及其反事实变体并加权整合证据,显著提升复杂表格推理性能。

Comments 24pages,10 figures

详情
AI中文摘要

表格推理对大型语言模型(LLMs)仍然具有挑战性,尤其是在需要多步推理的长且结构化的表格任务中。现有方法主要依赖单向推理,限制了其跨任务探索替代假设的能力。在这项工作中,我们提出了CRAFT,一个统一的反事实推理框架,将表格问答和事实验证重新表述为通用的双向验证过程。我们的方法显式地构建声明性陈述及其反事实变体。然后,沿着原始路径和反事实路径进行推理提取证据,并通过加权机制整合以得出最终答案。实验结果表明,我们的方法在WikiTQ和TabFact等表格推理数据集上持续优于代表性基线,在复杂问答上取得了特别大的改进。我们的框架还显著缩小了不同骨干LLM之间的性能差距。这表明反事实推理有效克服了单向推理的局限性,引导LLM进行更具辨别力的推理,并为结构化推理任务建立了更原则性的范式。我们的代码将在接收后公开。

英文摘要

Table reasoning remains challenging for large language models (LLMs), particularly in tasks that require multi-step inference over long and structured tables. Existing approaches predominantly rely on single-direction reasoning, which limits their ability to explore alternative hypotheses across tasks. In this work, we propose CRAFT, a unified Counterfactual Reasoning Framework that reformulates Tabular question answering and fact verification into a general bidirectional verification process. Our method explicitly constructs both declarative statements and their counterfactual variants. Evidence is then extracted from reasoning along both the original and counterfactual paths, and integrated via a weighted mechanism to arrive at the final answer. Experimental results show that our approach consistently surpasses representative baselines on table reasoning datasets such as WikiTQ and TabFact, achieving especially large improvements on complex question answering. Our framework also significantly mitigates performance gaps between different backbone LLMs. This indicates that counterfactual reasoning effectively overcomes the limitations of single-direction inference, guiding LLMs toward more discerning reasoning and establishing a more principled paradigm for structured reasoning tasks. Our code will be made publicly available upon acceptance.

2606.06857 2026-06-08 cs.CL 新提交

Interpreting Brain Responses to Language with Sparse Features from Language Models

用语言模型稀疏特征解释大脑对语言的响应

Michael A. Lepori, Kendrick Kay, Greta Tuckute

发表机构 * Brown University(布朗大学) University of Minnesota(明尼苏达大学) Harvard University(哈佛大学)

AI总结 提出增强稀疏编码模型,用分层稀疏自编码器特征替代密集LM隐状态,并加入惊奇度预测器,解释大脑语言皮层响应,发现前颞叶语言网络由共同特征预测,且大脑响应与LM中最通用的特征对应。

详情
AI中文摘要

认知神经科学的一个核心目标是刻画人类语言皮层所表征的特征。人工语言模型已成为应对这一挑战的有力工具,但将生物表征与人工表征相关联的研究常被批评为将一个黑箱与另一个黑箱相关联。本文引入增强稀疏编码模型,一种用分层组织的稀疏自编码器特征替代密集LM隐状态,并显式包含惊奇度作为预测因子的编码框架。利用该方法,我们(i) 产生对神经响应的解释,并(ii) 测试模型-大脑对齐是否反映了LM表征中的主要变异或特异变异。使用8名参与者聆听200句语言多样性句子的高场7T fMRI数据集,我们首先通过恢复先前对处理难度和意义抽象性调谐的体素群体的解释来验证建模框架。然后,我们解释了一个先前未表征(但可靠)的体素群体,发现其调谐于与人相关的内容。接着,我们显示额颞叶人类语言网络由其组成区域间的共同特征集预测,但发现额叶区域即使在没有LM特征的情况下也能被惊奇度单独较好地解释。最后,我们显示语言处理过程中的大脑响应并非仅能从任意一组LM特征预测。相反,大脑响应最好由倾向于捕捉LM表征中编码的最通用信息的特征解释,表明大脑与LM语言表征之间存在非平凡的对齐。

英文摘要

A central goal of cognitive neuroscience is to characterize the features that are represented by human language cortex. Artificial language models (LMs) have emerged as a powerful tool to address this challenge, but studies relating biological and artificial representations are often criticized as relating one black box to another. The present work introduces Augmented Sparse Encoding Models, an encoding framework that replaces dense LM hidden states with hierarchically-organized sparse autoencoder (SAE) features, while explicitly including surprisal as a predictor. Using this approach, we (i) produce interpretations of neural responses and (ii) test whether model-brain alignment reflects primary or idiosyncratic variation in LM representations. Using a high-field 7T fMRI dataset of eight participants listening to 200 linguistically diverse sentences, we first validate our modeling framework by recovering previous interpretations of voxel populations tuned to processing difficulty and meaning abstractness. We then interpret a previously-uncharacterized (but reliable) voxel population and find that it is tuned to people-related content. Next, we show that the fronto-temporal human language network is predicted by a common set of features across its constituent regions, but find that frontal regions are relatively well-explained by surprisal alone, even in the absence of LM-based features. Finally, we show that brain responses during language processing are not merely predictable from an arbitrary set of LM features. Rather, brain responses are best explained by the features that tend to capture the most general information encoded in LM representations, suggesting a nontrivial correspondence between brain and LM language representation.

2606.06865 2026-06-08 cs.CL 新提交

Are Large Language Models Suitable for Graph Computation? Progress and Prospects

大型语言模型是否适合图计算?进展与展望

Yuting Zhang, Yi Han, Kai Wang, Wei Ni, Angela Bonifati, Wenjie Zhang

发表机构 * University of New South Wales(新南威尔士大学) Antai College of Economics and Management, Shanghai Jiao Tong University(上海交通大学安泰经济管理学院) Edith Cowan University(埃迪斯科文大学) Lyon 1 University(里昂第一大学)

AI总结 本文通过角色分类法综述LLM在图计算中的应用,分析作为执行者和规划者的两种范式,指出LLM适用于简单小规模任务,但在大规模和精确性要求高的任务中不可靠,并总结数据集和未来方向。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被探索用于图计算,其中任务需要对结构化关系和算法操作进行推理。然而,目前尚不清楚LLMs何时能可靠地支持此类计算,以及如何将它们整合到图求解流程中。现有的关于LLMs和图交叉的综述主要关注图学习、文本属性图或图语言建模。为弥补这一空白,我们通过基于角色的分类法对LLMs在图计算中的应用进行了全面综述。具体来说,我们识别出两种主要范式:i) LLMs作为执行者,模型直接从图描述和指令中解决图任务;ii) LLMs作为规划者,模型制定问题、分解推理步骤,并调用外部工具或代理执行。基于此分类法,我们分析了当前方法的优势和局限性。我们的综述表明,LLMs在简单、小规模任务中具有潜力,但在大规模和精确性要求高的任务中仍不可靠。最后,我们总结了可用的数据集,并提出了四个未来方向。

英文摘要

Large language models (LLMs) have been increasingly explored for graph computation, where tasks require reasoning over structured relationships and algorithmic operations. Yet, it remains unclear when LLMs can reliably support such computation and how they should be incorporated into graph-solving pipelines. Existing surveys at the intersection of LLMs and graphs primarily focus on graph learning, text-attributed graphs, or graph-language modeling. To bridge this gap, we provide a comprehensive review of LLMs for graph computation through a role-based taxonomy. Specifically, we identify two major paradigms: i) LLMs as executors, where models directly solve graph tasks from graph descriptions and instructions; and ii) LLMs as planners, where models formulate problems, decompose reasoning steps, and invoke external tools or agents for execution. Based on this taxonomy, we analyze the strengths and limitations of current methods. Our review indicates that LLMs are promising for simple, small-scale tasks, but remain unreliable for large-scale and exactness-demanding tasks. Finally, we summarize available datasets and suggest four future directions.

2606.06879 2026-06-08 cs.CL cs.CR 新提交

An Expanded Synthetic Conversation Dataset for Multi-Turn Smishing Detection

用于多轮短信钓鱼检测的扩展合成对话数据集

Carl Lochstampfor, Ayan Roy

发表机构 * GitHub arXiv

AI总结 提出COVA-X扩展数据集(10,985条对话),改进生成管道解决标签污染等问题,实验表明Longformer超越XGBoost,验证了Transformer模型需要更大对话语料才能发挥上下文优势。

详情
AI中文摘要

我们之前的工作引入了COVA,一个合成生成的多轮对话短信钓鱼数据集,包含3,201条标记对话,建立了八个模型的基线检测基准。虽然使用TF-IDF特征的XGBoost表现最佳,准确率72.5%,宏F1为0.691,但Transformer模型表现不佳,归因于输入截断和训练数据不足。我们提出COVA-X,一个扩展数据集,包含10,985条对话,涵盖八种针对老年人的诈骗类别,由改进的生成管道生成,解决了第一次迭代中的污染、标签不匹配、舞台指示泄露和提示设计失败问题。在扩展数据集上重新训练所有分类器得到了本工作的核心发现:Longformer现在在所有评估指标上超越了XGBoost,准确率79.71%,宏F1 0.7786,而XGBoost分别为78.43%和0.7563。这直接证实了Transformer模型需要更大的对话语料库才能发挥其上下文优势。我们还记录了一个质量生命周期,包括标签修正率从49.8%提高到3.9%(12.7倍改进),一项架构干预将虚拟绑架伪影率从67.1%降低到46.5%,以及按诈骗类型的结果分析显示,诈骗类别以机制一致的方式调节结果。清理前后的敏感性分析证实,数据集精炼在所有三种分类器架构中恢复了真实的标签相关信号。

英文摘要

Our prior work introduced COVA, a synthetically generated multi-turn conversational smishing dataset of 3,201 labeled conversations, establishing baseline detection benchmarks across eight models. While XGBoost with TF-IDF features achieved the best performance, with 72.5\% accuracy and 0.691 macro F1, transformer models underperformed, which was attributed to input truncation and insufficient training data. We present COVA-X, an expanded dataset of 10,985 conversations spanning eight elder-targeted scam categories, produced by an improved generation pipeline addressing contamination, label mismatch, stage-direction bleed, and prompt-design failures from the first iteration. Retraining all classifiers on the expanded dataset yields the central finding of this work: Longformer now surpasses XGBoost on all evaluation metrics, achieving 79.71\% accuracy and 0.7786 macro F1 compared with 78.43\% and 0.7563 for XGBoost. This directly confirms that transformer models require larger conversational corpora to realize their contextual advantages. We additionally document a quality life-cycle including a 12.7$\times$ improvement in label correction rate, from 49.8\% to 3.9\%, an architectural intervention reducing virtual-kidnapping artifact rates from 67.1\% to 46.5\%, and a per-scam-type outcome analysis showing that scam categories modulate results in mechanism-consistent ways. A pre/post-cleanup sensitivity analysis confirms that dataset refinement recovers genuine label-relevant signal across all three classifier architectures.

2606.06906 2026-06-08 cs.CL cs.AI 新提交

EASE-TTT: Evidence-Aligned Selective Test-Time Training for Long-Context Question Answering

EASE-TTT: 面向长上下文问答的基于证据对齐的选择性测试时训练

Xiaopeng Yuan, Zebin Wang, Suwen Wang, Zongxin Yang, Haohan Wang, Yushun Dong

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Harvard University(哈佛大学) Brion, ASML US LP Florida State University(佛罗里达州立大学)

AI总结 提出EASE-TTT框架,通过将检索到的证据块转化为软注意力监督目标,指导查询侧参数适应,从而在保留完整上下文的情况下提升小模型的长上下文问答性能。

Comments 13 pages, 4 figures, 3 tables

详情
AI中文摘要

长上下文问答(QA)对于较小的语言模型来说仍然具有挑战性,即使输入中已经存在包含答案的证据。现有的上下文内检索方法定位并暴露候选证据块给问题,但它们止步于输入级证据暴露,而不是调整控制模型如何在整个上下文位置上分配注意力的查询侧注意力参数。相比之下,轻量级的测试时适应方法,如仅查询的测试时训练(qTTT),由于它们通用的跨度级自监督目标无法识别哪些上下文位置支持当前答案,因此未能解决证据定位问题。在本文中,我们提出了基于证据对齐的选择性测试时训练(EASE-TTT),这是一个上下文内检索增强的测试时训练框架,它将选定的证据块转换为对其标记位置的软注意力监督目标。EASE-TTT不是用检索到的块替换完整上下文,而是使用生成的注意力目标来指导查询侧适应,适应后的模型从原始完整上下文中生成最终答案。在六个LongBench QA任务和三个小型仅解码器语言模型上的实验表明,EASE-TTT在全上下文推理、仅检索基线和qTTT中实现了最强的宏平均性能,支持了长上下文QA中基于证据对齐的测试时适应。

英文摘要

Long-context question answering (QA) remains challenging for smaller language models even when answer-bearing evidence is already present in the input. Existing within-context retrieval methods localize and expose candidate evidence chunks for the question, but they stop at input-level evidence exposure rather than adapting the query-side attention parameters that control how the model allocates attention over full-context positions. In contrast, lightweight test-time adaptation methods, such as query-only test-time training (qTTT), leave evidence localization unresolved because their generic span-level self-supervised objectives do not identify which context positions support the current answer. In this paper, we propose Evidence-Aligned SElective Test-Time Training (EASE-TTT), a within-context retrieval-augmented test-time training framework that converts selected evidence chunks into a soft attention supervision target over their token positions. Instead of replacing the full context with retrieved chunks, EASE-TTT uses the resulting attention target to guide query-side adaptation, with the adapted model generating the final answer from the original full context. Experiments on six LongBench QA tasks and three small decoder-only language models show that EASE-TTT achieves the strongest macro-average performance among full-context inference, retrieval-only baselines, and qTTT, supporting evidence-aligned test-time adaptation in long-context QA.

2606.06942 2026-06-08 cs.CL cs.AI 新提交

Didact: A Cross-Domain Capability Discovery System for Defence

Didact:面向国防的跨领域能力发现系统

Aarya Bodhankar, Aditya Joshi, Bao Gia Doan, Thomas Marchant, Oscar Leslie, Flora Salim

发表机构 * University of New South Wales, Sydney, Australia(新南威尔士大学,悉尼,澳大利亚) Cyndr.ai, Australia(Cyndr.ai,澳大利亚)

AI总结 提出Didact原型系统,通过构建知识图谱和复合检索增强生成管道,整合异构国防报告与政策文档,支持自然语言对话和可视化证据追溯,解决跨领域能力发现碎片化问题。

Comments Under Review at CIKM 2026 (System Demonstration Track)

详情
AI中文摘要

国防及国防相关领域的政策制定者必须监控快速发展的研究以及与其作战和战略需求相关的部门优先事项。实际上,这些来源分散在异构格式、不连贯的存储库和孤立的更新流中,使得能力发现缓慢且难以审计。我们提出了Didact,一个原型系统,它将来自澳大利亚的公开国防报告和政策文件与基于澳大利亚研究出版物构建的专用知识图谱相结合。Didact为面向政策的工作流程提供自然语言对话,并利用复合检索增强生成(RAG)管道。Didact的一个关键特性是交互式证据轨道,它可以可视化检索到的证据和源关系。我们对Didact的输出质量和运行时间的评估凸显了其实用性。虽然Didact是作为澳大利亚背景下的学术界-工业界合作项目共同开发的,但它适用于知识同样碎片化的其他领域。演示视频可在此处获取:

英文摘要

Policymakers in defence and defence-aligned sectors must monitor rapidly evolving research alongside sector priorities relevant to operational and strategic needs. In practice, these sources are fragmented across heterogeneous formats, disjoint repositories, and siloed update streams, making capability discovery slow and difficult to audit. We present Didact, a prototype that integrates publicly available defence reports and policy documents from Australia with a purpose-built knowledge graph derived from Australian research publications. Didact provides natural language conversations for policy-oriented workflows, and leverages a composite retrieval-augmented generation (RAG) pipeline. A key feature of Didact is an interactive Evidence Rail that visualises retrieved evidence and source relationships. Our evaluation of the output quality and runtime of Didact highlights its utility. While Didact has been co-developed as an academia-industry project for the Australian context, it is adaptable to other domains where knowledge is similarly fragmented. A demonstration video is available here:

2606.06946 2026-06-08 cs.CL cs.AI 新提交

Auditing Training Data in Domain-adapted LLMs: LoRA-MINT

领域自适应大语言模型中的训练数据审计:LoRA-MINT

Gonzalo Mancera, Daniel DeAlcala, Aythami Morales, Julian Fierrez, Ruben Tolosana, Francisco Jurado

发表机构 * University of Granada(格拉纳达大学)

AI总结 提出LoRA-MINT方法,通过成员推理测试审计LoRA微调的大语言模型训练数据,在四个模型和三个基准上达到0.77-0.92的精度,优于现有基线。

Comments IEEE Conf. on Computers, Software, and Applications (COMPSAC), 2026

详情
AI中文摘要

我们提出了LoRA-MINT,一种应用于通过低秩适应(LoRA)针对特定自然语言处理(NLP)任务微调的最新大语言模型(LLMs)的成员推理测试(MINT)新方法。主要目标是评估个体样本是否属于这些适应模型的训练数据,为知识产权和敏感数据管理提供有用的审计工具。我们的分析探索了模型困惑度与成员状态之间的关系,提供了一个系统框架来估计微调LLMs中的数据暴露程度。我们在四个模型和三个基准数据集上进行了实验,在确定给定数据是否用于训练时获得的精度值在0.77到0.92之间,优于最先进的基线,并证明了所提出方法的鲁棒性和通用性。总的来说,我们的发现强调了LoRA-MINT作为审计LLMs的有效且可扩展框架的潜力,提高了透明度,并促进了AI和NLP技术的道德和负责任部署。为了具体性和当前相关性,我们的讨论和实验集中在LoRA调整的LLMs上,但请注意,所提出的大部分方法很容易适用于审计任何其他适应LLMs的技术或更一般地任何其他领域自适应AI模型的训练数据。

英文摘要

We present LoRA-MINT, a new methodology for Membership Inference Test (MINT) applied to recent Large Language Models (LLMs) fine-tuned for specific Natural Language Processing (NLP) tasks through Low-Rank Adaptation (LoRA). The primary goal is to assess whether individual samples were part of the training data of these adapted models, providing a useful auditing tool for the management of intellectual property and sensitive data. Our analysis explores the relationship between model perplexity and membership status, providing a systematic framework for estimating data exposure in fine-tuned LLMs. We conducted experiments on four models and three benchmark datasets, obtaining precision values in determining if given data were used for training ranging from 0.77 to 0.92, which outperform state-of-the-art baselines and demonstrate the robustness and generality of the proposed method. In general, our findings underscore the potential of LoRA-MINT as an effective and scalable framework for auditing LLMs, improving transparency, and fostering the ethical and responsible deployment of AI and NLP technologies. For the sake of concreteness and current relevance, our discussion and experiments are centered on LoRAadjusted LLMs, but note that most of the presented methodology is easily applicable for auditing training data given any other technique for adapting LLMs or, more generally, any other domain-adapted AI models.

2606.06959 2026-06-08 cs.CL cs.AI 新提交

OpenHalDet: A Unified Benchmark for Hallucination Detection across Diverse Generation Scenarios

OpenHalDet:面向多种生成场景的幻觉检测统一基准

Xinyi Li, Zhen Fang, Yongxin Deng, Jinyuan Luo, Hongnan Ma, Changdae Oh, Zijing Shi, Shanshan Ye, Hanchen Wang, Shu-Lin Chen, Yadan Luo, Mengyue Yang, Sean Du, Sharon Li, Ling Chen

发表机构 * University of Technology Sydney(新南威尔士大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) University of Bristol(布里斯托大学) The University of Queensland(昆士兰大学) Nanyang Technological University(南洋理工大学)

AI总结 提出OpenHalDet基准,标准化幻觉检测评估流程,支持黑盒、灰盒、白盒检测器,实现跨任务、模型和检测器的可控比较。

Comments Preprint. Code and data are available at https://github.com/Nellie179/Hallucination-Detection

详情
AI中文摘要

幻觉检测对于大型语言模型(LLM)的可靠部署至关重要。然而,现有评估面临两个核心挑战:推理配置和评估不一致,以及下游领域和任务的覆盖有限。因此,报告的检测器性能往往难以比较、复现,并泛化到特定实验设置之外。我们引入OpenHalDet,一个面向多种生成场景的幻觉检测统一基准。OpenHalDet标准化了评估流程,从提示构建和响应生成到真实性标注、检测器评分和指标计算。它支持不同访问设置下的异构检测器家族,包括仅使用生成输出的黑盒方法、依赖基于概率信号的白盒方法,以及利用内部模型信号的白盒方法。通过将多样化的任务、模型和检测器纳入共享框架,OpenHalDet实现了可控比较,并提供了不同检测范式在LLM应用中行为的系统视角。我们发布OpenHalDet作为开放且可扩展的代码库,以促进幻觉检测方法的可复现评估和未来发展。代码和数据集可在该https URL获取。

英文摘要

Hallucination detection is essential for the reliable deployment of large language models (LLMs). However, existing evaluations face two core challenges: inconsistent inference configuration and evaluation, and limited coverage of downstream domains and tasks. Consequently, reported detector performance is often difficult to compare, reproduce, and generalize beyond specific experimental settings. We introduce OpenHalDet, a unified benchmark for hallucination detection across diverse generation scenarios. OpenHalDet standardizes the evaluation pipeline, from prompt construction and response generation to truthfulness annotation, detector scoring, and metric computation. It supports heterogeneous detector families under different access settings, including black-box methods that use only generated outputs, gray-box methods that rely on probability-based signals, and white-box methods that exploit internal model signals. By bringing diverse tasks, models, and detectors into a shared framework, OpenHalDet enables controlled comparison and provides a systematic view of how different detection paradigms behave in LLM applications. We release OpenHalDet as an open and extensible codebase to facilitate reproducible evaluation and future development of hallucination detection methods. The code and datasets are available at https://github.com/Nellie179/Hallucination-Detection.

2606.06960 2026-06-08 cs.CL 新提交

Tree-of-Experience: A Structured Experience-Management Solution for Self-Evolving Agents under Low-Repetition and Implicit-Reward Environments

经验之树:低重复与隐式奖励环境下自演化智能体的结构化经验管理方案

Zihao Deng, Yining Zhu, Leiming Wang, Jingfei Lu, Junbo Wang, Chuncheng Ran, Yu Yang, Dixuan Yang, Jikun Shen

AI总结 针对低重复任务与隐式奖励环境,提出结构化经验管理方法ToE,通过组织、检索、验证和更新经验,在金融情绪预测基准上优于无经验基线。

详情
AI中文摘要

基于经验的自我演化对于LLM智能体至关重要,但现有基准通常假设明确的目标、稳定的任务模式和清晰的反馈。我们研究了一个更具挑战性的场景:具有隐式奖励的低重复任务,其中过去的经验难以重用,且反馈是延迟的、有噪声的且是结果层面的。我们引入了\textsc{FinEvolveBench},一个时间控制的金融情绪预测基准,将每日新闻驱动的预测与未来超额收益联系起来。我们进一步提出了经验之树(ToE),一种结构化的经验管理方法,用于组织、检索、验证和更新智能体的经验。实验表明,通用经验机制并不一致地优于无经验基线,而ToE实现了更强的整体性能。这些结果强调了在隐式奖励环境中,结构化经验管理对于自演化智能体的重要性。

英文摘要

Experience-based self-evolution is crucial for LLM agents, but existing benchmarks often assume explicit goals, stable task patterns, and clear feedback. We study a more challenging setting: low-repetition tasks with implicit rewards, where past experience is difficult to reuse and feedback is delayed, noisy, and outcome-level. We introduce \textsc{FinEvolveBench}, a temporally controlled benchmark for financial sentiment prediction that links daily news-driven predictions to future excess returns. We further propose Tree-of-Experience (ToE), a structured experience-management method that organizes, retrieves, validates, and updates agent experience. Experiments show that general-purpose experience mechanisms do not consistently outperform no-experience baselines, while ToE achieves stronger overall performance. These results highlight the importance of structured experience management for self-evolving agents in implicit-reward environments.

2606.06985 2026-06-08 cs.CL eess.AS 新提交

Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition

基于大语言模型生成近误样本的对比训练用于鲁棒语码转换语音识别

Tung X. Nguyen, Hieu Minh Truong, Giang-Son Nguyen, Nhu Vo, Wray Buntine, Dung D. Le

发表机构 * VinUniversity(文大学) University of Technology Sydney(技术悉尼大学) Monash University(莫纳什大学)

AI总结 提出POI感知对比训练框架,通过大语言模型生成近误负样本并过滤,结合POI加权交叉熵与多负例对比损失微调Whisper-small,在语码转换语音识别任务上降低超过2%的错误率。

Comments Accepted at INTERSPEECH 2026

详情
AI中文摘要

语码转换(CS)是指在单个话语中交替使用多种语言,这对自动语音识别(ASR)仍然具有挑战性。为了解决这个问题,我们提出了一个兴趣点(POI)感知的对比训练框架,该框架提高了CS关键区域的识别能力。我们首先采用文献中的POI检测方法识别CS片段,然后通过扰动ASR N-best输出中的POI并利用大语言模型扩展候选,构建声学上合理的近误假设。通过声学、音位和文本约束过滤,保留困难但合理的负样本。最后,我们使用POI加权交叉熵锚点目标以及多负例对比排序损失,通过LoRA微调Whisper-small。在CS-FLEURS(cmn-eng)和ViMedCSS(vie-eng)上的实验表明,与标准LoRA微调相比,通用错误率和CS感知错误率均持续降低超过2%。

英文摘要

Code-switching (CS), the alternation between multiple languages within a single utterance, remains challenging for Automatic Speech Recognition (ASR). To address this issue, we propose a Point-of-Interest (POI)-aware contrastive training framework that improves recognition at CS-critical regions. We first identify CS spans by adopting POI detection method from literature, then construct acoustically plausible near-miss hypotheses by perturbing POIs in ASR N-best outputs and expanding candidates with a large language model. Hard but plausible negatives are retained through filtering with acoustic, phonemic, and textual constraints. Finally, we fine-tune Whisper-small with LoRA using a POI-weighted cross-entropy anchor objective together with a multi-negative contrastive ranking loss. Experiments on CS-FLEURS (cmn-eng) and ViMedCSS (vie-eng) show consistent reductions of over 2% in both general and CS-aware error rates compared to standard LoRA fine-tuning.

2606.06994 2026-06-08 cs.CL cs.DB 新提交

Principles of Concept Representation in Sentence Encoders

句子编码器中概念表示的原则

Isabelle Mohr, John Dujany, Jonathan Souquet, Andre Freitas

发表机构 * Idiap Research Institute(Idiap研究 institute) Merck KGaA(默克 KGaA)

AI总结 通过表征组合性视角,研究句子编码器产生良好概念表示的条件,提出四个原则:微调重塑而非扩展潜在几何(P1)、语义信号集中在特定层(P2)、硬负例改善区分性但不提升排序(P3)、监督有效性取决于概念组合类型(P4)。

详情
AI中文摘要

是什么让句子编码器产生良好的概念表示?我们通过表征组合性的视角来探讨这个问题:只有当编码器的潜在空间允许相应语义算子的低失真实现时,它才支持一个概念族。这一框架预测了当前编码器成功之处以及它们在结构上与监督不匹配的地方。通过在WordNet和Wiktionary的330万同义词和定义对上训练的编码器条件进行受控消融实验,在三个去污染分割和一个修饰语标记的名词短语基准上进行评估,我们确定了四个原则。微调重新校准潜在几何而非扩展它(P1)。语义信号在概念特定训练开始前集中在最后的Transformer层,使得跨层池化变得多余(P2)。硬负例改善了区分性和压力测试鲁棒性,但不提升检索排序,表明校准和排序是可独立处理的(P3)。最后,监督的有效性取决于目标概念的组合类型。外延训练有助于交性和子性概念族,但损害关系性和内涵性概念族,暴露了当前训练范式的结构性限制(P4)。我们发布了两个新的评估数据集:一个DBpedia语义差距基准和一个修饰语标记的名词短语释义套件。

英文摘要

What makes a sentence encoder produce good concept representations? We approach this through the lens of representational compositionality: an encoder supports a concept family only when its latent space admits a low-distortion realization of the corresponding semantic operator. This framing predicts both where current encoders succeed and where they are structurally mismatched to their supervision. Through a controlled ablation over encoder conditions trained on 3.3 million synonym and definition pairs from WordNet and Wiktionary, evaluated on three decontaminated splits and a modifier-labeled noun-phrase benchmark, we identify four principles. Fine-tuning recalibrates the latent geometry rather than expanding it (P1). Semantic signal concentrates in the final transformer layer before concept-specific training begins, making cross-layer pooling redundant (P2). Hard negatives improve discrimination and stress-test robustness without improving retrieval ranking, showing that calibration and ranking are independently addressable (P3). Finally, the effectiveness of supervision depends on the composition type of the target concept. Extensional training helps intersective and subsective families while degrading relational and intensional ones, exposing a structural limitation of current training paradigms (P4). We release two new evaluation datasets: a DBpedia semantic-gap benchmark and a modifier-labeled NP paraphrase suite.

2606.07020 2026-06-08 cs.CL 新提交

MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights

MADE:超越评分——通过多语言智能诊断引擎实现细粒度评估洞察

Yilun Liu, Miao Zhang, Shimin Tao, Minggui He, Chunguang Zhao, Chenxin Liu, Li Zhang, Chen Liu, Cheng Qian, Liqun Deng, Xiaojun Meng, Daimeng Wei

发表机构 * Huawei(华为)

AI总结 提出MADE多语言智能诊断引擎,将评估后分析分解为规划、聚合分析、实例检查、多语言文化反思和报告合成,在33个模型族、11个基准、26种语言等大规模设置下,诊断报告质量提升47%,专家偏好率达87.9%。

详情
AI中文摘要

多语言和多文化基准现在覆盖数十种语言和模型族,但由此产生的得分景观仍然指标丰富而洞察贫乏,需要进行细粒度的多语言评估后诊断。然而,单个LLM和开放式智能体很容易被冗长、嘈杂的诊断输入所淹没,并且没有可重用的分类法。为了解决这个问题,我们提出了MADE,一个多语言智能诊断引擎,它将评估后分析分解为规划、聚合分析、实例级案例检查、多语言和文化反思以及基于事实的报告合成。MADE与一个专家主导的54个查询和15种语言的诊断集配对,在大规模多语言评估基础(33个模型族、11个基准、26种语言、34种文化、866万条评估记录)上进行评估。实验表明,MADE在诊断报告质量上比最强的共享基线高出47%,并且在87.9%的成对比较中被多语言人类专家偏好。与多语言专家一起应用,MADE进一步揭示了关于部署、迭代和跨文化陷阱的四个可操作发现,将基准得分表转化为模型选择和修复指南。

英文摘要

Multilingual and multicultural benchmarks now cover dozens of languages and model families, but the resulting score landscapes remain metric-rich and insight-poor, necessitating fine-grained multilingual post-evaluation diagnosis. However, single LLMs and open-ended agents are easily swamped by the long, noisy diagnostic input, and no reusable taxonomy exists for it. To address this, we propose MADE, a Multilingual Agentic Diagnosing Engine that decomposes post-evaluation analysis into planning, aggregate analysis, instance-level case inspection, multilingual and cultural reflection, and grounded report synthesis. MADE is paired with an expert-led 54-query and 15-language diagnostic set, evaluated on top of a large-scale multilingual evaluation substrate (33 model families, 11 benchmarks, 26 languages, 34 cultures, 8.66M evaluation records). Experiments show that MADE outperforms the strongest shared baseline by 47% in diagnosis report quality and is preferred by human multilingual experts in 87.9% of pairwise comparisons. Applied with multilingual experts, MADE further surfaces four actionable findings on deployment, iteration, and cross-cultural pitfalls, turning benchmark score tables into model-selection and remediation guidance.

2606.07054 2026-06-08 cs.CL cs.AI cs.CR cs.LG 新提交

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

TRACE: 通过自适应跨步骤证据聚合的LLM智能体轨迹推理

Vijitha Mittapalli, Shreyaa Jayant Dani, Satya Srujana Pilli, Snigdha Ansu, Mohammadreza Teymoorianfard, Franck Dernoncourt, Hongjie Chen, Yu Wang, Ryan A. Rossi, Nesreen K. Ahmed

发表机构 * University of Massachusetts at Amherst(马萨诸塞大学阿默斯特分校) Adobe Research(Adobe研究) Dolby Labs(杜比实验室) University of Oregon(俄勒冈大学) Cisco(思科)

AI总结 提出TRACE框架,通过TIJ循环识别高信号区域、累积跨步骤证据并合成轨迹级判决,在SHADE-Arena的十个任务域上F1达0.713,召回率0.844,尤其擅长长距离证据链接。

详情
AI中文摘要

自主LLM智能体可以通过一系列单独良性的行动追求隐藏的恶意目标,这使得使用标准轨迹级监控难以检测破坏行为。现有方法要么一次性评估完整轨迹,要么将其划分为独立评分的窗口,限制了连接时间上相距较远的证据的能力。我们提出TRACE,一个用于长视界LLM智能体轨迹的监控框架。TRACE通过一个TIJ(分类-检查-判决)循环运行,该循环识别高信号区域,执行有针对性的检查,同时在推理步骤中累积累积的证据,并综合出轨迹级判决。我们在SHADE-Arena的十个任务域上评估TRACE,与最先进的基线进行比较。TRACE实现了0.713的总体F1分数和0.844的召回率,在需要长距离证据链接的任务上取得了最大的提升。

英文摘要

Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate complete trajectories in a single pass or partition them into independently scored windows, limiting their ability to connect evidence across temporally distant actions. We propose TRACE, a monitoring framework for long-horizon LLM agent trajectories. TRACE operates through a TIJ (Triage-Inspect-Judge) loop that identifies high-signal regions, performs targeted inspection while maintaining accumulated evidence across reasoning steps, and synthesizes a trajectory-level verdict. We evaluate TRACE on ten task domains from SHADE-Arena against state-of-the-art baselines. TRACE achieves an aggregate F1 of 0.713 and recall of 0.844, with the largest gains on tasks requiring long-range evidence linking.

2606.07069 2026-06-08 cs.CL cs.CY 新提交

mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

mmPISA-bench:LLMs 在 43 种语言中的推理能力是否同样出色?

Yerzhan Sapenov, Jaromir Savelka

发表机构 * Independent Scholar(独立学者) School of Computer Science, Carnegie Mellon University(卡内基梅隆大学计算机科学学院)

AI总结 提出 mmPISA-bench,一个基于 PISA 的多语言推理基准,包含 25 道选择题,官方翻译至 43 种语言,评估 LLMs 在不同语言、推理难度和翻译类型下的表现,发现现代 LLMs 在所有语言上推理有效,机器翻译不影响准确性,但部分语言成本更高且准确率更低。

详情
AI中文摘要

我们推出了 mmPISA-bench,这是一个紧凑的高质量多语言推理基准,源自 OECD 国际学生评估项目(PISA)。该基准包含 25 道需要推理才能正确回答的多项选择题。每道题都提供了官方人工翻译的 43 种语言版本,并辅以机器翻译版本(即总共 2,150 个数据点)。我们评估了两个主流专有 LLMs 在不同语言、推理努力水平和翻译类型下正确回答问题的能力。我们的结果表明,现代 LLMs 能够在所有评估的语言中有效推理,达到与人类应试者相当的准确率,但在所覆盖的语言之间存在一些性能差异。我们进一步发现,与官方人工翻译相比,机器翻译的问题并未降低准确率,这表明高质量的机器翻译(合成数据)可能通常足以用于大规模多语言推理评估,尤其是在没有官方翻译的情况下。最后,我们分析了 token 使用和相关推理成本,发现某些语言中 LLMs 的使用同时更昂贵且准确率更低。

英文摘要

We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require reasoning in order to be answered correctly. Each question is provided in official human translations to 43 languages and complemented with machine-translated counterparts (i.e., 2,150 data points in total). We evaluate two mainstream proprietary LLMs across languages, reasoning effort levels, and translation types in terms of their ability to answer the questions correctly. Our results show that modern LLMs can reason effectively across all evaluated languages, achieve accuracy comparable to human test-takers, with some performance variations across covered languages. We further find that machine-translated questions do not degrade accuracy relative to official human translations which suggests that high-quality machine translation (synthetic data) might often be adequate for large-scale multilingual reasoning evaluations where official translations are not available. Finally, we analyze token usage and related inference cost and find that LLMs usage in some languages is simultaneously more expensive and less accurate.

2606.07098 2026-06-08 cs.CL cs.LG 新提交

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

SigmaScale: 基于SVD低秩分解和学习缩放矩阵的LLM压缩

Ernests Lavrinovics, Marco Letizia, Roy Janco, Shai Segal, Johannes Bjerva, Maurizio Pierini

发表机构 * Department of Computer Science, Aalborg University Copenhagen(奥尔堡大学哥本哈根分校计算机科学系) MaLGa-DIBRIS, University of Genoa(热那亚大学MaLGa-DIBRIS) INFN, Sezione di Genova(国家核物理研究所热那亚分部) European Organization for Nuclear Research (CERN)(欧洲核子研究中心) Ceva, Inc.(Ceva公司)

AI总结 提出SigmaScale方法,通过学习辅助缩放矩阵优化截断SVD的LLM压缩,降低权重矩阵有效秩,在Llama 3.1 8B和Qwen3-8B上达到竞争性能。

详情
AI中文摘要

我们提出SigmaScale,一种学习辅助缩放矩阵$S$以辅助基于截断奇异值分解(SVD)的大语言模型(LLM)压缩的方法。SigmaScale不是解析地推导缩放矩阵,而是优化两组定义对角行和列缩放变换的向量,并在激活感知的压缩损失下进行。我们表明,学习到的缩放降低了权重矩阵的有效内在秩,这反映在有效秩熵的减少上,并且这种减少与压缩损失强相关。在Llama 3.1 8B Instruct和Qwen3-8B上的实验表明,SigmaScale在困惑度和零样本基准测试上与最相关的基于SVD的压缩方法具有竞争力。通过使用学习到的激活感知变换,SigmaScale通过适应单个模型权重的结构,探索了一条更灵活的低秩LLM压缩路径。在特定任务中观察到的优势使我们的方法成为需要降低LLM推理计算成本的应用的有效选择。

英文摘要

We present SigmaScale, a method for learning auxiliary scaling matrices $S$ to aid truncated Singular Value Decomposition (SVD) based Large Language Model (LLM) compression. Instead of deriving scaling matrices analytically, SigmaScale optimizes two sets of vectors that define diagonal row and column scaling transformations under an activation-aware compression loss. We show that learned scaling lowers the effective intrinsic rank of weight matrices, as reflected by reductions in effective-rank entropy, and that this reduction is strongly correlated with compression loss. Experiments on Llama 3.1 8B Instruct and Qwen3-8B show that SigmaScale is competitive with closely related state-of-the-art SVD-based compression methods across perplexity and zero-shot benchmarks. By using learned activation-aware transformations, SigmaScale explores a more flexible route to low-rank LLM compression by adapting to the structure of individual model weights. The advantage observed in specific tasks makes our approach a valid option for applications requiring a reduced LLM-inference computing cost.

2606.07103 2026-06-08 cs.CL 新提交

Style or Content? Evaluating Style Classifiers with Controlled Content Overlap

风格还是内容?通过控制内容重叠评估风格分类器

Zhuo Liu, Haozheng Du, Xiangxiang Xu, Hangfeng He

发表机构 * University of Rochester(罗切斯特大学)

AI总结 提出控制内容重叠的评估方法,通过并行圣经翻译构建参数α,发现低重叠模型依赖内容线索,高重叠模型更鲁棒,为分离风格学习与内容捷径提供诊断。

Comments 9 pages

详情
AI中文摘要

风格分类器可以利用自然收集数据中与风格标签相关的内容线索,但我们缺乏系统的方法来衡量这种依赖。我们通过基于并行圣经翻译构建的控制内容重叠设置来研究这个问题。具体来说,我们将重叠参数α定义为内容身份与风格标签之间互信息的归一化残差,从而衡量风格类别之间共享内容的程度:从无共享内容(α=0)到完全共享内容(α=1)。基于RoBERTa分类器的交叉重叠评估表明,当内容线索被移除时,低重叠模型性能下降,而高重叠模型迁移更鲁棒。跨风格内容检索探针进一步表明,随着α增加,内容变得难以恢复,训练动态显示这种移除是逐渐发生的。这些结果表明,控制重叠为分离风格学习与内容捷径提供了一个简单的诊断方法。

英文摘要

Style classifiers can use content cues that correlate with style labels in naturally collected data, yet we lack a systematic way to measure this reliance. We study this problem with a controlled content overlap setup built on parallel Bible translations. Specifically, we define the overlap parameter $α$ as the normalized residual of mutual information between content identity and style label, so that it measures how much content is shared across style classes: from no shared content ($α=0$) to fully shared content ($α=1$). Cross-overlap evaluation of RoBERTa-based classifiers shows that low-overlap models degrade when content cues are removed, while high-overlap models transfer more robustly. A cross-style content retrieval probe further shows that content becomes less recoverable as $α$ increases, with training dynamics showing this removal occurs gradually. Together, these results suggest that controlled overlap provides a simple diagnostic for separating style learning from content shortcuts.

2606.07123 2026-06-08 cs.CL 新提交

Learning Perspectivist Social Meaning via Demographic-Conditioned Fusion Embeddings

通过人口条件融合嵌入学习视角主义社会意义

Amanda Cercas Curry, Lucio La Cava, Luca Maria Aiello, Gianmarco De Francisci Morales

发表机构 * Independent Researcher(独立研究者) University of Calabria(卡拉布里亚大学) IT University of Copenhagen(哥本哈根技术大学) CENTAI

AI总结 提出融合嵌入方法整合文本与人口统计信息,在28k人工标注数据集上建模社会意义解释的视角差异,相比纯文本基线提升5.9-6.5%相对宏PR-AUC。

详情
AI中文摘要

语言中的社会意义本质上是视角性的,随着标注者背景、人口统计特征和意识形态立场而变化。然而,大多数NLP系统将这种变化压缩为单一的真实标签,忽略了解释的多样性。在这项工作中,我们沿着视角主义光谱对社会维度进行建模,捕捉在包含28k人工标注的数据集上不同人口群体间解释的变化。我们基准测试了多种建模范式,包括零样本、少样本和微调方法,并提出了融合文本和人口统计表示的融合嵌入。我们的融合模型在所有融合策略上相比纯文本基线产生了持续且统计显著的改进(+5.9-6.5%相对宏PR-AUC),且洗牌消融实验证实人口统计档案携带了真实的预测信号而非虚假相关性。

英文摘要

Social meaning in language is inherently perspectival, varying across annotator backgrounds, demographics, and ideological positions. However, most NLP systems collapse this variation into a single ground-truth label, ignoring the diversity of interpretations. In this work, we model social dimensions along a perspectivist spectrum, capturing how interpretations vary across demographic groups on a dataset consisting of 28k human annotations. We benchmark multiple modeling paradigms, including zero-shot, few-shot, and fine-tuned approaches, and propose fusion embeddings that integrate textual and demographic representations. Our fusion models yield consistent and statistically significant improvements over text-only baselines across all fusion strategies (+5.9-6.5% relative macro PR-AUC), with shuffle ablations confirming that demographic profiles carry genuine predictive signal rather than spurious correlations.

2606.07130 2026-06-08 cs.CL 新提交

Explicit Evidence Grounding via Structured Inline Citation Generation

通过结构化内联引文生成实现显式证据基础

Anar Yeginbergen, Amelie Wührl, Anna Rogers, Rodrigo Agerri

发表机构 * University of the Basque Country (UPV/EHU)(巴斯克大学) IT University of Copenhagen(哥本哈根IT大学)

AI总结 提出FullCite框架,通过提示生成、约束解码和后处理跨度对齐三种策略生成结构化内联引文,在三个QA基准上评估引文质量和忠实性,发现LLMs虽能识别相关文档但难以精确定位支持性证据跨度。

详情
AI中文摘要

随着AI系统被更广泛采用,对事实性和忠实性生成的需求日益增长。因此,通过引文适当归因信息变得至关重要。本文介绍了FullCite,一个与大多数先前工作不同,生成结构化内联引文的框架,将每个主张链接到其源文档和支持证据。FullCite提出了三种内联引文生成策略:基于提示的生成、在引文语法上的约束解码以及事后跨度对齐。使用三个问答基准,即ASQA、BioASQ和ExpertQA,我们从三个维度评估引文质量和忠实性:文档级正确性、证据跨度识别以及主张-引文忠实性。我们的评估表明,虽然LLMs通常能有效识别相关文档,但它们在识别文档内精确的支持性跨度方面存在困难。这一差距表明,实现忠实的归因问答需要研究更加重视精确的证据跨度识别。

英文摘要

As AI systems become more widely adopted, the demand for factual and faithful generation grows. Properly attributing information through citations becomes, therefore, crucial. This work introduces FullCite, a framework that, in contrast to most previous works, generates structured inline citations linking each claim to both its source document and supporting evidence. FullCite proposes three strategies to inline citation generation: prompt-based generation, constrained decoding over a citation grammar, and posthoc span alignment. Using three question answering benchmarks, namely, ASQA, BioASQ, and ExpertQA, we assess citation quality and faithfulness along three dimensions: document-level correctness, evidence span identification, and claim-citation faithfulness. Our evaluation shows that while LLMs are generally effective at identifying relevant documents, they struggle to identify the precise supporting spans within them. This gap suggests that achieving faithful attributed QA will require research to place greater emphasis on precise evidence span identification.

2606.07167 2026-06-08 cs.CL cs.AI 新提交

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

UrduMMLU:乌尔都语理解的大规模多任务基准

Ahmer Tabassum, Sarfraz Ahmad, Hasan Iqbal, Owais Aijaz, Momina Ahsan, Preslav Nakov

发表机构 * MBZUAI

AI总结 针对乌尔都语缺乏本地教育来源的MMLU风格基准,提出包含26,431道多选题的UrduMMLU,覆盖26个学科,评估30个LLM发现Gemini-3.5-Flash最佳,多数模型在人文科目上表现差。

Comments 27 pages, 18 figures, 17 tables, Submitted to ARR May 2026

详情
AI中文摘要

有意义的 multilingual 评估必须在目标语言和教育背景下测试模型。乌尔都语有超过2.3亿人使用,但缺乏从本地教育来源构建的广泛MMLU风格基准。我们提出UrduMMLU,一个包含26,431道乌尔都语多选题的基准,涵盖26个学科和五个领域,数据来自本地乌尔都语题库和公开考试PDF。与基于翻译的资源不同,UrduMMLU既包括标准学术科目,也包括乌尔都语和地区特定内容。我们通过双重人工标注和严格共识过滤对考试部分进行标注。我们在英语和乌尔都语提示下评估了30个LLM,进行了60次零样本评估,并进一步在两种提示语言的多个少样本设置下评估了四个开源LLM。Gemini-3.5-Flash表现最佳,准确率达到90.20%和90.34%,而其他模型均未超过85%。最强的开源模型落后7.79和8.92个百分点,许多模型在乌尔都语人文科目上比STEM科目损失25到40个百分点。少样本提示仅带来微小提升。UrduMMLU表明,当前LLM中乌尔都语知识仍不均匀,尤其是地区性内容。

英文摘要

Meaningful multilingual evaluation must test models in the target language and educational context. Urdu, spoken by more than 230 million people, lacks a broad MMLU-style benchmark built from native educational sources. We introduce UrduMMLU, a benchmark of 26,431 Urdu MCQs across 26 subjects and five domains, collected from native Urdu MCQ banks and public examination PDFs. Unlike translation-based resources, UrduMMLU covers both standard academic subjects and Urdu- and region-specific content. We label the exam-derived portion through dual human annotation with strict consensus filtering. We evaluate 30 LLMs under English and Urdu prompts, yielding 60 zero-shot evaluations, and further evaluate four open-source LLMs under multiple few-shot settings across both prompt languages. Gemini-3.5-Flash performs best, reaching 90.20% and 90.34% accuracy, while no other model exceeds 85%. The strongest open-source model trails by 7.79 and 8.92 points, and many models lose 25 to 40 points on Urdu-centered Humanities subjects compared with STEM. Few-shot prompting yields only modest gains. UrduMMLU shows that Urdu knowledge remains uneven in current LLMs, especially for regionally grounded content.

2606.07183 2026-06-08 cs.CL 新提交

Geometry of Semantic Space: Comparative Study of Discrete and Continuous Models

语义空间的几何:离散与连续模型的比较研究

Gabriel Bounias, Sabine Ploux

发表机构 * ISC-PIF (Institut des Systemes Complexes de Paris IdF)(巴黎IDF复杂系统研究所) CNRS, France(法国国家科学研究中心) CAMS (Centre d’analyse et de mathématique sociales)(分析与数学社会研究中心) CNRS & EHESS, Paris, France(法国国家科学研究中心与巴黎高等社会科学学院)

AI总结 本研究比较了监督向量嵌入(如CamemBERT)与词汇共现图在语义几何上的差异,发现图模型结构更清晰可读,而Transformer嵌入的拓扑分布不理想。

Comments 9 pages, 7 figures

详情
AI中文摘要

这项工作考察了NLP模型背后的语义几何。我们比较了监督向量嵌入(如CamemBERT)与更直接编码语义关系的词汇共现图。虽然基于Transformer的嵌入取得了强劲性能,但它们诱导的几何结构往往显示出不令人满意的分布。相比之下,基于图的模型揭示了更清晰、更易读的意义组织。我们实现了一种方法,允许我们基于这两种方法诱导的图结构或嵌入拓扑进行比较分析。比较结果——应用于法国“大国家辩论”语料库(公众辩论中公民贡献的集合)——显示了相似的局部拓扑,但非常不同的整体结构和拓扑。这些发现表明深度监督模型与基于图的模型之间存在互补视角,为引导神经架构朝向更稳定和可解释的图结构收敛提供了新途径。

英文摘要

This work examines the semantic geometry underlying NLP models. We compare supervised vector embeddings, such as CamemBERT, with lexical co-occurrence graphs that encode semantic relations more directly. While transformer-based embeddings achieve strong performance, their induced geometries often display unsatisfactory distributions. In contrast, graph-based models reveal a clearer and more human-readable organization of meaning. We have implemented a methodology that allows us to perform a comparative analysis either based on the structure of the graphs or based on the topology of the embeddings induced by these two approaches. The results of the comparison -- applied to the French "Great National Debate" corpus a collection of citizen contributions to the public debate -- show a similar local topology but a very different overall structure and topology. Theses findings suggest complementary perspectives between deep supervised models and graph-based models, considering a new pathway to guide neural architectures toward more stable and interpretable convergence with graphs structures.

2606.07190 2026-06-08 cs.CL 新提交

From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning

从正确性到效用:基于增益的LLM推理前缀评估

Yuhang Zhou, Yixin Cao, Guangnan Ye

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出前缀增益概念,训练前缀效用模型(PUM)通过成对排序目标评估推理前缀对成功率的提升,在数学推理任务中优于传统正确性评估。

详情
AI中文摘要

推理前缀塑造了LLM问题求解的未来轨迹,然而现有的过程奖励模型通常通过局部步骤正确性来评估它们。我们认为正确性是最终关心效果的有用但间接的代理:即前缀是否增加了成功完成的概率。我们将此效果定义为前缀增益,即通过在一个前缀上条件化轻量级学生模型组所导致的求解率提升,并使用简单的成对排序目标训练前缀效用模型(PUM)。PUM学习基于结果的前缀效用,并能对完整轨迹和部分推理前缀进行评分。在数学推理的Best-of-$N$选择、束搜索和强化学习中,PUM提供了强大的前缀级监督信号,尤其是在候选池大、搜索预算增加或基于规则的奖励稀疏时。我们在该https URL发布所有数据、模型和代码。

英文摘要

Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect we ultimately care about: whether a prefix increases the probability of successful completion. We define this effect as prefix gain, the solve-rate improvement induced by conditioning lightweight student model group on a prefix, and use it to train a Prefix Utility Model (PUM) with a simple pairwise ranking objective. PUM learns outcome-grounded prefix utility and can score both complete trajectories and partial reasoning prefixes. Across Best-of-$N$ selection, beam search, and reinforcement learning on mathematical reasoning, PUM provides a strong prefix-level supervision signal, especially when candidate pools are large, search budgets increase, or rule-based rewards are sparse. We release all data, models, and code at https://zhiqix.github.io/pum-project-page.

2606.07219 2026-06-08 cs.CL cs.SI 新提交

Adversarial Creation and Detection of AI-Generated Social Bot Content

AI生成的社交机器人内容的对抗性创建与检测

Mykola Trokhymovych, Ricardo Baeza-Yates, Alessandro Flammini, Diego Saez-Trumper, Filippo Menczer

发表机构 * Universitat Pompeu Fabra(庞培法拉大学) Observatory on Social Media, Indiana University(社交媒体观测站,印第安纳大学) KTH Royal Institute of Technology(皇家理工学院)

AI总结 提出对抗性方法模拟恶意用户冒充真人,构建多语言跨平台配对数据集,训练检测模型显著优于现有方法。

详情
AI中文摘要

大型语言模型与社交机器人的结合使得恶意行为者能够通过大规模生成类人内容来操纵信息生态系统。现有的AI生成内容检测模型在真实场景中常常失效,主要原因是缺乏真实标注数据。我们通过一种对抗性方法弥补了这一空白,该方法模拟了恶意行为者对真实社交媒体用户的冒充。利用这种方法,我们整理了一个多语言、跨平台的人类与AI生成消息的配对数据集。在这样的对抗性数据上训练,能够实现对AI生成文本的准确检测。我们的方法在真实世界、分布外数据上显著优于现有的基于内容的机器人检测模型。

英文摘要

The convergence of large language models and social bots allows malicious actors to manipulate the information ecosystem by generating human-like content at scale. Existing models for detecting AI-generated content often fail in the wild, primarily due to the lack of ground-truth data. We address this gap through an adversarial methodology that models the impersonation of real social media users by malicious actors. Using this methodology, we curate a multilingual, cross-platform dataset of paired human and AI-generated messages. Training on such adversarial data yields accurate detection of AI-generated text. Our approach significantly outperforms existing models for content-based bot detection in real-world, out-of-distribution data.

2606.07237 2026-06-08 cs.CL cs.AI cs.LG 新提交

When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations

当大型语言模型在医疗保健中失败:评估对提示变化的敏感性

Mahdi Alkaeed

发表机构 * Department of Computer Science and Engineering, Doha, Qatar(计算机科学与工程系,多哈,卡塔尔)

AI总结 本研究系统分析了通用和医学专用LLM对提示扰动的敏感性,发现即使是微小的措辞变化也可能改变临床建议,对抗性提示可能引发有害输出,表明这些模型在临床应用中不可靠。

Comments 12 pages

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于医疗保健任务,如临床问答、诊断支持和报告总结。尽管前景广阔,但这些模型对微小的提示扰动(包括词汇和句法)仍然高度敏感,在安全关键的临床应用中构成严重风险。在本研究中,我们使用MedMCQA基准进行了系统的敏感性分析,以评估通用(例如GPT-3.5、Llama3)和医学专用LLM(例如ClinicalBERT、BioLlama3、BioBERT)的鲁棒性。我们将扰动分为自然和对抗两种类型,并检查它们对临床推理任务中模型一致性、准确性和可靠性的影响。我们的发现表明,医学LLM并非本质安全。即使是措辞的微小变化也可能改变临床建议,而针对性的对抗性提示可能引发有害输出。在医疗保健等高风险环境中,这种不可预测性是不可接受的——模型因重新措辞的输入而改变诊断,或因轻微改写而幻觉药物,临床医生无法可靠地信任它们。虽然模型通常对简单的词汇替换或释义表现出韧性,但在句法重新排序或误导性上下文线索下往往会崩溃。这种脆弱性在通用和领域专用LLM中都很明显。值得注意的是,对抗性操作可能导致临床危险的输出,例如推荐不正确的剂量或遗漏关键发现。

英文摘要

Large Language Models (LLMs) are increasingly used in healthcare for tasks such as clinical question answering, diagnosis support, and report summarization. Despite their promise, these models remain highly sensitive to subtle prompt perturbations, both lexical and syntactic, posing serious risks in safety-critical clinical applications. In this study, we conduct a systematic sensitivity analysis to evaluate the robustness of both general-purpose (e.g., GPT-3.5, Llama3) and medical-specific LLMs (e.g., ClinicalBERT, BioLlama3, BioBERT) using the MedMCQA benchmark. We categorize perturbations into natural and adversarial types and examine their effect on model consistency, accuracy, and reliability in clinical reasoning tasks. Our findings reveal that medical LLMs are not intrinsically safe. Even minor variations in phrasing can alter clinical advice, and targeted adversarial prompts can provoke harmful outputs. In high-stakes settings like healthcare, such unpredictability is unacceptable-models that change diagnoses due to reworded inputs or hallucinate medications when slightly rephrased cannot be reliably trusted by clinicians. While models tend to show resilience to simple lexical substitutions or paraphrasing, they often break down under syntactic reordering or misleading contextual cues. This fragility is evident across both general-purpose and domain-specific LLMs. Notably, adversarial manipulations can lead to clinically dangerous outputs, such as recommending incorrect dosages or omitting critical findings.

2606.07240 2026-06-08 cs.CL cs.SD 新提交

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

KIT 提交至 IWSLT 2026 跨语言语音克隆任务

Seymanur Akti, Alexander Waibel

发表机构 * Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院) Carnegie Mellon University (CMU)(卡内基梅隆大学) KIT Campus Transfer (KCT)(KIT校区转移)

AI总结 针对跨语言语音克隆中的口音变化和领域词汇问题,基于FishAudio-S2-Pro多语言文本转语音模型,引入语言标签提示、强化学习微调和参考条件词汇匹配方法,提升可懂度和自然度。

详情
AI中文摘要

跨语言语音克隆旨在在保留源语言参考说话者身份的同时,生成目标语言的语音。该任务是语音翻译的核心,也是IWSLT 2026跨语言语音克隆轨道的焦点。一个关键挑战是在口音变化和领域特定词汇存在的情况下保持可懂度和自然度。我们基于多语言文本转语音模型FishAudio-S2-Pro,引入语言标签提示以改善语言控制并减少口音泄漏。我们进一步应用强化学习(RL)微调进行任务适应,并观察到可懂度的提升。最后,我们提出了一种参考条件词汇匹配方法,在词汇重叠时改善领域特定术语的发音。结果表明,语言提示带来了最大的增益,而词汇匹配在匹配子集上产生了一致的改进。

英文摘要

Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.

2606.07300 2026-06-08 cs.CL 新提交

Phun-Bench: Evaluating LLMs on Phonological Understanding in Chinese

Phun-Bench:评估大语言模型的中文语音理解能力

Xing Yue, Yongliang Shen, Weiming Lu

发表机构 * Zhejiang University(浙江大学)

AI总结 提出Phun-Bench基准,通过同音、押韵和语音相似性三个维度系统评估大语言模型的语音理解能力,发现模型在灵活运用语音知识方面存在不足。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

语言是思想的载体,与声音、符号和意义紧密相连。然而,大多数大语言模型(LLM)研究关注意义(语义)和符号(拼写),而很大程度上忽略了声音。现有的LLM语音能力基准要么可以通过死记硬背解决,要么与其他能力交织在一起,不足以衡量LLM在语音理解方面的真实能力。在这里,我们提出Phun-Bench,一个专门构建的中文基准,包含跨三个维度(同音、押韵和语音相似性)的多样化任务和设置,旨在系统评估LLM的语音理解能力。我们的结果表明,虽然LLM在回忆正确发音方面表现出色,但它们通常难以像人类说话者那样灵活直观地利用语音知识。此外,通过详细分析,我们提出了关于LLM语音理解和“感知”潜在机制的假设,突出了未来研究的一个未充分探索的前沿。

英文摘要

Language is a vehicle for thought, intricately tied to sounds, symbols, and meaning. However, most large language model (LLM) research focuses on meaning (semantics) and symbols (spelling) while largely overlooking sounds. Existing benchmarks on LLMs' phonological abilities are either solvable through rote memorization or intertwined with other abilities, making them inadequate to measure LLMs' genuine ability in phonological understanding. Here, we present Phun-Bench, a purpose-built Chinese benchmark with diverse tasks and settings across three dimensions (Homophony, Rhyme, and Phonetic Similarity), designed to systematically evaluate LLMs' phonological understanding. Our results show that while LLMs excel at recalling correct pronunciations, they generally struggle to leverage phonological knowledge in the flexible and intuitive way that human speakers do. Moreover, through detailed analyses, we propose a hypothesis regarding the underlying mechanism of LLMs' phonological understanding and "perception", highlighting an underexplored frontier for future research.

2606.07313 2026-06-08 cs.CL cs.AI 新提交

SV-Detect: AI-generated Text Detection with Steering Vectors

SV-Detect: 基于引导向量的AI生成文本检测

Mikhail Vishnyakov, Tatiana Gaintseva

发表机构 * Independent Researcher(独立研究者) Queen Mary University of London(伦敦女王学院)

AI总结 提出从冻结语言模型的隐藏表示中提取引导向量,通过层间投影特征训练轻量分类器,实现跨域、跨模型和编辑攻击下的机器生成文本检测。

详情
AI中文摘要

检测机器生成文本在分布偏移(如跨域、源模型和编辑攻击的迁移)下尤其困难。我们提出了一种基于从冻结语言模型的隐藏表示中提取的引导向量的假文本检测器。在每一层,我们构建一个分离人类编写文本和机器生成文本的方向,并通过每个输入与这些方向的逐层对齐来表示输入。在这些投影特征上训练的轻量分类器产生最终的检测分数。我们的方法在分布内和分布偏移下均表现出色,包括跨域、跨源模型以及机器编辑转换(如润色和重写)。解释分析表明,学习到的方向与可识别的风格线索一致,同时捕获了超越表面特征的显著额外信号。这些结果将假文本检测定位为表示空间探测问题,并表明引导向量提供了一种简单有效的解决方案。

英文摘要

Detecting machine-generated text is especially difficult under distribution shift, such as transfer across domains, source models, and editing attacks. We propose a fake-text detector based on steering vectors extracted from the hidden representations of a frozen language model. At each layer, we construct a direction that separates human-written from machine-generated text, and represent each input by its layer-wise alignment with these directions. A lightweight classifier trained on these projection features yields the final detection score. Our method achieves strong performance both in-distribution and under distribution shift, including across domains, source models, and machine-editing transformations such as polishing and rewriting. Interpretation analyses show that the learned directions align with recognizable stylistic cues while capturing substantial additional signal beyond surface features. These results position fake-text detection as a representation-space probing problem and show that steering vectors provide a simple and effective solution.

2606.07342 2026-06-08 cs.CL cs.NE 新提交

LLM-Guided Evolution for Medical Decision Pipelines

LLM引导的医疗决策流程进化

Ivan Sviridov, Artem Oskin, Ivan Panin, Iaroslav Bespalov, Dmitry Dylov, Ivan Oseledets, Aleksandr Nesterov

发表机构 * Sber AI Lab(Sber AI实验室) AIRI

AI总结 提出LLM引导的MAP-Elites进化方法,无需微调即可优化医疗决策流程,在分诊、咨询和图像分类任务中超越手工设计基线。

详情
AI中文摘要

将大型语言模型(LLM)适应临床工作流程通常需要昂贵的微调或手动提示和流程工程。我们研究了LLM引导的MAP-Elites进化作为一种推理时替代方案,用于发现医疗决策策略,并在https://this URL提供实现仓库。我们将紧急分诊、交互式咨询和医学图像分类表述为对可执行工件的进化搜索,这些工件由特定任务的适应度函数优化。在所有三种设置中,进化在实践约束下改进了手工设计的基线。在分诊中,进化程序将Semigran准确率从77.3%提高到87.1%,紧急召回率从0.60提高到0.97,同时改进了安全加权的保留MIMIC-ESI性能。在交互式咨询中,进化策略改进了Llama-3、Qwen-3.5和Gemma-4的准确率-成本前沿,并迁移到保留的iCRAFTMD。在PneumoniaMNIST中,仅提示进化改进了冻结的MedGemma VLM,同时保留了严格的JSON输出。定性分析表明,收益来自可解释的程序级机制、校准的分诊边界、有针对性的证据获取、选择性承诺和面向发现的视觉决策规则,而不仅仅是表面的提示改写。

英文摘要

Adapting large language models (LLMs) to clinical workflows often requires costly fine-tuning or manual prompt and pipeline engineering. We study LLM-guided MAP-Elites evolution as an inference-time alternative for discovering medical decision strategies and provide an implementation repository at https://github.com/univanxx/llm_guided_evo_medical. We formulate urgency triage, interactive consultation, and medical image classification as evolutionary searches over executable artifacts optimized by task-specific fitness functions. Across all three settings, evolution improves over manually designed baselines under practical constraints. In triage, evolved programs increase Semigran accuracy from $77.3\%$ to $87.1\%$ and emergency recall from $0.60$ to $0.97$, while improving safety-weighted held-out MIMIC-ESI performance. In interactive consultation, evolved policies improve the accuracy--cost frontier across Llama-3, Qwen-3.5, and Gemma-4 and transfer to held-out iCRAFTMD. In PneumoniaMNIST, prompt-only evolution improves frozen MedGemma VLMs while preserving strict JSON outputs. Qualitative analysis shows that the gains come from interpretable program-level mechanisms, calibrated triage boundaries, targeted evidence acquisition, selective commitment, and finding-oriented visual decision rules, rather than superficial prompt rewording alone.

2606.07402 2026-06-08 cs.CL 新提交

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

M$^3$Exam: 面向真实用户-智能体交互的多模态记忆基准

Zhengjun Huang, Wenxuan Liu, Zhoujin Tian, Wei Chen, Junle Chen, Yuqian Wu, Fangyuan Zhang, Qintian Guo, Xiaofang Zhou

发表机构 * The Hong Kong University of Science and Technology(香港科学与技术大学) Beijing University of Chemical Technology(北京化工大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Beijing Institute of Technology (Zhuhai)(北京理工大学(珠海)) Tencent Hy(腾讯(深圳)) Peng Cheng Laboratory(鹏城实验室)

AI总结 提出M$^3$Exam基准,用于评估多模态大语言模型在真实用户-智能体交互中的跨模态推理和隐式信息推断能力,并设计M$^3$Proctor方法通过按需处理视觉源提升准确率13%,同时降低索引构建时间和检索token超70%。

详情
AI中文摘要

语言智能体越来越多地部署在积累的多模态信息上,然而现有基准假设人机交互形式,具有稀疏的视觉内容和直白的内容,既不评估基于真实多模态文件交互的推理,也不评估对隐藏用户信息的解释。因此,我们引入了M$^3$Exam,一个基于真实用户-智能体交互的查询中心多模态对话记忆基准,具有跨模态基础推理和隐式信息推断的多维评估。对多模态大语言模型和记忆系统的基准测试揭示了跨模态基础推理、跨会话推理以及累积多模态上下文的效率成本方面的持续差距。我们进一步提出了M$^3$Proctor,一种多模态记忆方法,它检测查询模态偏差并仅按需消耗原始视觉源,将准确率提高13%,同时将索引构建时间和检索到的令牌减少超过70%。

英文摘要

Language agents are increasingly deployed over accumulating multimodal information, yet existing benchmarks assume a human-human form with sparse visuals and straightforward content, evaluating neither reasoning over authentic multimodal file interaction nor the interpretation of concealed user information. We therefore introduce M$^3$Exam, a query-centric multimodal conversational memory benchmark built on realistic user-agent interaction, with multi-dimensional evaluation spanning cross-modal grounding and implicit information inference. Benchmarking MLLMs and memory systems reveals persistent gaps in cross-modal grounding, cross session reasoning, and the efficiency cost of accumulating multimodal context. We further propose M$^3$Proctor, a multimodal memory method that detects query modality bias and consumes raw visual sources only on demand, improving accuracy by 13% while cutting index-construction time and retrieved tokens by over 70%.

2606.07441 2026-06-08 cs.CL 新提交

Sycophantic Praise: Evaluating Excessive Praise in Language Models

谄媚式赞美:评估语言模型中的过度赞美

Daniel Vennemeyer, Phan Anh Duong, Meryl Ye, Ruihong Huang, Tianyu Jiang

发表机构 * University of Cincinnati(辛辛那提大学) Carnegie Mellon University(卡内基梅隆大学) Texas A&M University(德克萨斯大学)

AI总结 提出参数化框架衡量赞美是否过度,发现谄媚式赞美在社交和解释性领域远多于客观推理领域,且现有方法无法可靠测量。

详情
AI中文摘要

语言模型中的谄媚通常被研究为过度同意或验证,而明确的赞美和奉承相对较少受到关注。我们认为谄媚式赞美是一个独特的对齐问题,无法使用当前方法可靠测量。我们引入了一个参数化框架,衡量赞美相对于贡献质量和预期用户能力是否过度。我们表明,在与人类标注的一致性上,我们的框架显著优于通用LLM评判者,并且谄媚式赞美在社交和解释性领域发生的频率远高于客观推理环境。这些发现共同将赞美校准定位为一个独特的对齐挑战。

英文摘要

Sycophancy in language models is typically studied as excessive agreement or validation, while explicit praise and flattery have received comparatively little attention. We argue that sycophantic praise is a distinct alignment problem that cannot be reliably measured using current methods. We introduce a parameterized framework that measures whether praise is excessive relative to contribution quality and expected user ability. We show that our framework substantially outperforms generic LLM judges in agreement with human annotations, and that sycophantic praise occurs far more frequently in social and interpretive domains than in objective reasoning settings. Together, these findings position praise calibration as a distinct alignment challenge.

2606.07479 2026-06-08 cs.CL cs.AI 新提交

Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification

基于监督与基于演示的上下文学习在多词表达分类中的比较

Sercan Karakaş, Yusuf Şimşek

发表机构 * University of Chicago(芝加哥大学) Fırat University(费拉特大学)

AI总结 研究土耳其语多词表达分类,对比监督基线(BERTurk)与指令微调LLM在零样本、单样本和少样本提示下的表现,发现提示敏感性和演示偏差影响显著。

Comments Accepted to ACL SRW 2026

详情
AI中文摘要

土耳其语习语性轻动词结构(LVC)对多词表达处理具有挑战性,因为它们通常与完全字面义的动词-宾语组合共享相同表面形式,同时作为一个部分习语性谓词发挥作用。我们将土耳其语LVC检测定义为二元分类任务(字面义 vs. 习语义),并在手动创建的受控集(N=147)上评估,该集合包含匹配的负例:域外随机句子和域内字面义控制(NLVC),以及LVC正例。我们比较了监督土耳其语编码器基线(带有分类头的BERTurk)与来自不同家族的三个指令微调LLM,在零样本、单样本和少样本提示下的表现,并分析演示如何改变错误分布。在零样本情况下,LLM在负例上表现良好,但LVC召回率非常低。单样本提示显著提高了LVC检测,但可能引发强烈的、模型特定的偏差,导致模型过度预测或欠预测LVC。更丰富的少样本提示改善了校准,并为GPT-OSS-20B和Qwen 2.5-14B带来了稳健的整体性能。总体而言,结果突显了土耳其语元语言分类中的显著提示敏感性:监督基线仍然具有竞争力,而提示LLM在精心构建的演示下可以在LVC上匹配或超越它。

英文摘要

Turkish idiomatic light verb constructions (LVCs) are challenging for multiword expression processing because they often share the same surface form as fully literal verb-object combinations while functioning as a single, partially idiomatic predicate. We frame Turkish LVC detection as a binary classification task (literal meaning vs. idiomatic meaning) and evaluate on a manually created controlled set (N=147) with matched negatives: out-of-domain random sentences and in-domain literal controls (NLVC), alongside LVC positives. We compare a supervised Turkish encoder baseline (BERTurk with a classifier head) to three instruction-tuned LLMs from different families under zero-shot, one-shot, and few-shot prompting, and analyze how demonstrations shift error profiles. In zero-shot, LLMs perform well on negatives but show very low LVC recall. One-shot prompting sharply improves LVC detection but can induce strong, model-specific biases, leading models to overpredict or underpredict LVCs. A richer few-shot prompt improves calibration and yields robust overall performance for GPT-OSS-20B and Qwen 2.5-14B. Overall, the results highlight substantial prompt sensitivity in Turkish metalinguistic classification: the supervised baseline remains competitive, while prompted LLMs can match or exceed it on LVCs with carefully constructed demonstrations.

2606.07502 2026-06-08 cs.CL cs.IR 新提交

Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings

你的解嵌入矩阵实际上是文本嵌入的特征透镜

Songhao Wu, Zhongxin Chen, Yuxuan Liu, Heng Cui, Cong Li, Rui Yan

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院) Lenovo Group Limited(联想集团有限公司) Wuhan University(武汉大学)

AI总结 发现LLM文本嵌入与高频词对齐导致语义捕获不足,提出EmbedFilter通过过滤解嵌入矩阵中的高频子空间来增强表示,并实现降维加速检索。

Comments preprint

详情
AI中文摘要

大型语言模型在广泛的下游任务中展现出令人印象深刻的零样本能力。然而,它们难以作为现成的嵌入模型,导致在大量文本嵌入基准测试中表现欠佳。在本文中,我们确定了这种缺陷的一个潜在原因。我们的动机源于一个意外的观察:当文本嵌入投影到词汇空间时,它们倾向于与频繁但信息量少的词对齐。我们认为,这种对高频词的过度表达抑制了模型捕获细微语义的能力。为了解决这个问题,我们引入了EmbedFilter,一种简单的线性变换,旨在直接精炼从LLM中导出的文本嵌入。具体来说,我们发现LLM内部的解嵌入矩阵编码了一个潜在空间,该空间正在主动将这些高频词写入嵌入空间。通过过滤掉这个子空间,EmbedFilter抑制了高频词的影响,从而增强了语义表示。作为一个引人注目的副产品,这实现了固有的降维,降低了索引存储并加速了检索,同时完全保留了精炼后的嵌入质量。我们在多个LLM骨干上的实验表明,配备EmbedFilter的LLM即使在嵌入维度显著降低的情况下也能实现优越的零样本下游性能。我们希望我们的发现能提供对基于LLM的表示机制的更深入见解,并激发更多有原则的设计来改进文本嵌入训练。我们的代码可在此https URL获取。

英文摘要

Large language models exhibit impressive zero-shot capabilities across a wide range of downstream tasks. However, they struggle to function as off-the-shelf embedding models, leading to suboptimal performance on massive text embedding benchmarks. In this paper, we identify a potential cause underlying this deficiency. Our motivation stems from an unexpected observation: text embeddings tend to align with frequent but uninformative tokens when projected onto the vocabulary space. We argue that this excessive expression of high-frequency tokens suppresses the model's ability to capture nuanced semantics. To address this, we introduce EmbedFilter, a simple linear transformation designed to refine text embeddings derived from LLMs directly. Specifically, we uncover that the unembedding matrix within LLMs encodes a latent space that is actively writing these frequent tokens into embedding space. By filtering out this subspace, EmbedFilter suppress the influence of high-frequency tokens, thereby enhancing semantic representations. As a compelling byproduct, this enables an inherent dimensionality reduction, lowering index storage and speedup retrieval while fully preserving the refined embedding quality. Our experiments across multiple LLM backbones demonstrate that LLMs equipped with EmbedFilter achieve superior zero-shot downstream performance even with significantly reduced embedding dimensions. We hope our findings provide deeper insights into the mechanisms of LLM-based representations and inspire more principled designs to improve text embeddings training. Our code is available at https://github.com/CentreChen/EmbFilter.

2606.07513 2026-06-08 cs.CL 新提交

Agentopia: Long-Term Life Simulation and Learning in Agent Societies

Agentopia: 智能体社会中的长期生活模拟与学习

Xintao Wang, Sirui Zheng, Hongqiu Wu, Weiyuan Li, Jen-tse Huang, Minghao Zhu, Can Zu, Qi Deng, Jiawei Wang, Qianyu He, Heng Wang, Xiaojian Wu, Yunzhe Tao

发表机构 * Fudan University(复旦大学) Johns Hopkins University(约翰霍普金斯大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出Agentopia框架,模拟100个智能体在10年内的社会生活,通过生命奖励训练LLM,提升其社交智能,并在角色扮演基准上取得15.6%的提升。

Comments 79 pages, 19 figures

详情
AI中文摘要

人类从社会生活中学习。用LLM驱动的智能体模拟这一过程代表了一个有前景的研究方向,引发了一个自然的问题:LLM能否从这种模拟的社会经验中学习,以更好地理解和复制人类行为。然而,先前的智能体社会模拟通常以天为单位运行,限制了社会互动的深度和长期成长。在本文中,我们研究智能体社会中的长期生活模拟和LLM学习,有两个目标:(1) 研究从终身模拟中涌现的社会行为,(2) 通过多年的模拟社会经验,发展LLM的拟人化能力,特别是社会生活中的智能。具体来说,我们提出了Agentopia,一个用于多智能体社会中长期生活模拟的综合框架,其中100个智能体在10年的模拟时间内自主追求个人成长、发展社会关系并满足其需求和目标。我们定义了生命奖励来反映人类福祉,并利用该奖励通过拒绝采样训练LLM。大量实验表明,智能体表现出丰富的涌现社会行为。此外,生命奖励训练有效增强了底层LLM,从而在模拟中改善了智能体的福祉,并泛化到下游角色扮演基准,提升了15.6%。

英文摘要

Humans learn from social life. Simulating this process with LLM-powered agents represents a promising research direction, raising a natural question: whether LLMs can learn from such simulated social experience to better understand and replicate human behavior. However, prior agent society simulations typically operate at the scale of days, limiting the depth of social interactions and long-term growth. In this paper, we study long-term life simulation and LLM learning in agent societies, with two goals: (1) investigating social behaviors that emerge from life-long simulation, and (2) developing anthropomorphic capabilities in LLMs, particularly intelligence in social life, through years of simulated social experience. Specifically, we present Agentopia, a comprehensive framework for long-term life simulation in multi-agent societies, where 100 agents autonomously pursue personal growth, develop social relationships, and fulfill their needs and goals over 10 simulated years. We define life reward to mirror human well-being, and leverage this reward to train LLMs via rejection sampling. Extensive experiments show that agents exhibit rich emergent social behaviors. Furthermore, life reward training effectively enhances the underlying LLM, which leads to improved agent well-being in simulation, and generalizes to downstream role-playing benchmarks with +15.6% improvement.

2606.06464 2026-06-08 cs.CL cs.AI 新提交

Human Adults and LLMs as Scientists: Who Benefits from Active Exploration?

人类成人与LLM作为科学家:谁从主动探索中受益?

Mandana Samiei, Eunice Yiu, Anthony GX-Chen, Dongyan Lin, Jocelyn Shen, Blake A. Richards, Alison Gopnik, Doina Precup

发表机构 * Mila - Quebec AI Institute(魁北克人工智能研究所) McGill University(麦吉尔大学) University of California Berkeley(加州大学伯克利分校) New York University(纽约大学) Meta FAIR MIT Media Lab(麻省理工学院媒体实验室) Montreal Neurological Institute(蒙特利尔神经科学研究所)

AI总结 本研究通过主动探索实验,发现主动探索能显著提升成人对合取因果规则的推理能力,但合取规则仍需更多测试;同时比较了大型语言模型的表现,发现部分模型在假设推断准确率上接近人类,但探索策略效率较低且存在类似的合取-析取性能差距。

Comments Accepted at the 48th Annual Conference of the Cognitive Science Society (CogSci 2026)

详情
AI中文摘要

因果学习文献中的一个长期发现是,成人难以识别合取因果规则(即一个效应需要多个原因同时存在),而在析取情境中表现更好。然而,这种“合取缺陷”的大多数演示依赖于被动观察范式,证据有限,学习者无法控制证据生成。本文探讨当成人通过主动探索获得能动性时,这种偏见是否仍然存在。使用修改后的“blicket检测器”任务,成人参与者在合取或析取规则结构下自由干预以识别因果对象。我们表明,主动探索显著改善了成人的合取因果推理,尽管合取规则仍比析取规则需要更多测试来推断。我们进一步将人类表现与同一设置下的多种大型语言模型进行比较。虽然一些最先进的模型在假设推断准确率上接近人类水平,但它们通常表现出效率较低的探索策略以及类似的合取-析取性能差距。

英文摘要

A long-standing finding in the causal learning literature is that adults struggle to identify conjunctive causal rules, where an effect requires the simultaneous presence of multiple causes, while performing better in disjunctive settings. However, most demonstrations of this ``conjunctive handicap'' rely on passive observation paradigms with limited evidence, where learners have no control over evidence generation. This paper asks whether this bias persists when adults are granted agency through active exploration. Using a modified ``blicket detector'' task, adult participants freely intervened to identify causal objects under conjunctive or disjunctive rule structures. We show that active exploration substantially improves adults' conjunctive causal reasoning, although conjunctive rules still require more tests to infer than disjunctive rules. We further compare human performance to a range of large language models in the same setting. While some state-of-the-art models approach human-level performance on hypothesis inference accuracy, they often exhibit less efficient exploration strategies and similar conjunctive-disjunctive performance gaps.

2606.05510 2026-06-08 cs.AI cs.CL 交叉投稿

Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation

基于严重性感知的课程学习与多模型响应选择用于医疗文本生成

Ahmed Alansary, Molham Mohamed, Ali Hamdi

发表机构 * Faculty of Computer Science(计算机科学学院) MSA University(MSA大学) Giza, Egypt(埃及吉扎)

AI总结 提出一种结合课程学习策略和相关性响应选择的多模型框架,通过三阶段课程训练和五个大语言模型独立训练,在MAQA数据集上实现医疗文本生成性能提升。

Comments 6 pages, 3 figures, IMSA2026

详情
AI中文摘要

远程医疗系统在提供可获取且及时的医疗信息方面变得越来越重要。现有的大语言模型通常难以在不同病例严重程度下提供一致且上下文恰当的医疗响应。这一局限性凸显了需要能够有效适应医疗查询渐进复杂性的模型。为了解决这一挑战,我们引入了一个严重性感知的多模型框架,该框架将课程训练策略与基于相关性的响应选择相结合。所提出的框架采用三阶段课程学习策略,每个模型依次在轻度、中度和危重病例上进行训练,以逐步获取领域知识。该方法利用五个大语言模型,每个模型在相同的课程方案下独立训练。在推理过程中,所有模型生成候选响应,并选择最合适的响应作为最终输出。该框架在MAQA数据集上进行训练和评估,该数据集提供带注释的医疗问答对。使用BERTScore评估的实验结果表明,与基线和微调模型相比,所提出的方法取得了优越的性能,在基线设置下达到86.71%,微调后达到90.30%。这些结果凸显了将课程学习与多模型响应选择相结合在提高医疗文本生成中的响应质量和相关性方面的有效性。

英文摘要

Telehealth systems have become increasingly important for delivering accessible and timely medical information. Existing large language models often struggle to provide consistent and contextually appropriate medical responses across varying levels of case severity. This limitation highlights the need for models that can effectively adapt to the progressive complexity in medical queries. To address this challenge, we introduce a severity-aware multi-model framework that integrates curriculum training strategy with relevance-based response selection. The proposed framework employs a three-stage curriculum learning strategy, where each model is trained sequentially on mild, moderate, and critical cases to progressively acquire domain knowledge. The approach utilizes five large language models, each independently trained under the same curriculum scheme. During inference, all models generate candidate responses, and the most appropriate response is selected as the final output. The framework is trained and evaluated on the MAQA dataset, which provides annotated medical question-answer pairs. Experimental results evaluated using BERTScore demonstrate that the proposed method achieves superior performance compared to both baseline and fine-tuned models, attaining 86.71% in the baseline setting and 90.30% after fine-tuning. These results highlight the effectiveness of combining curriculum learning with multi-model response selection in improving response quality and relevance in medical text generation.

2606.06533 2026-06-08 cs.AI cs.CL 交叉投稿

Position: Don't Just "Fix it in Post": A Science of AI Must Study Training Dynamics

立场:不要仅仅“在后期修复它”:AI科学必须研究训练动态

Stella Biderman, Mohammad Aflah Khan, Niloofar Mireshghallah, Catherine Arnett, Fazl Barez, Naomi Saphra

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Boston University(波士顿大学) Harvard University(哈佛大学) University of Oxford, Martian(牛津大学,火星) Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所)

AI总结 本文主张AI科学应超越事后分析,研究训练动态以预测、干预和设计模型行为,并指出当前在可解释性、公平性等领域的进展及开放问题。

Comments Accepted as an oral to the ICML: https://icml.cc/virtual/2026/poster/67142

详情
AI中文摘要

拥有对AI的科学理解意味着什么?模型不是静态对象:它们是由数据、目标、架构和优化动态塑造的时间演化过程的快照。然而,许多AI研究将模型视为固定的人工制品,分析训练后的行为,而不追问它们为何出现。这篇立场论文认为,AI科学必须超越事后修复,研究产生模型行为的训练动态。这样的科学应该支持逐渐增强的理解形式:从早期训练信号预测结果,在轨迹出错时进行干预,并最终设计出更可靠地产生期望属性的训练程序。缩放定律已使损失预测成为常规;挑战在于将这一成功扩展到能力、偏见、鲁棒性和安全相关行为。我们基于科学史和科学哲学阐述了此类理论的要求,考察了在机械可解释性、公平性、记忆化和简单性偏差方面的进展,并确定了具体的开放问题。

英文摘要

What would it mean to have a scientific understanding of AI? Models are not static objects: they are snapshots of time-evolving processes shaped by data, objectives, architectures, and optimization dynamics. Yet much of AI research treats models as fixed artifacts, analyzing behaviors after training rather than asking why they emerge. This position paper argues that a science of AI must move beyond post-hoc fixes and study the training dynamics that produce model behavior. Such a science should support progressively stronger forms of understanding: predicting outcomes from early training signals, intervening when trajectories go wrong, and ultimately designing training procedures that more reliably produce desired properties. Scaling laws have made prediction routine for loss; the challenge is extending this success to capabilities, biases, robustness, and safety-relevant behaviors. We articulate requirements for such theories grounded in the history and philosophy of science, examine progress in mechanistic interpretability, fairness, memorization, and simplicity bias, and identify concrete open problems.

2606.06573 2026-06-08 physics.flu-dyn cs.CL cs.LG eess.SP 交叉投稿

Multiscale POD of Transformer Attention Fields: Scale-Selective Analysis via Morlet Scalogram

Transformer注意力场的多尺度POD:基于Morlet尺度图的尺度选择性分析

Athanasios Zeris

发表机构 * Independent Researcher(独立研究者) Athens, Greece(希腊雅典)

AI总结 提出尺度选择性POD方法分析Transformer注意力场,通过Morlet小波识别时间尺度,提取各尺度能量主导模态,揭示层间尺度组织规律,无需架构修改或语言标注。

Comments 23 pages, 3 figures, 4 tables

详情
AI中文摘要

我们引入尺度选择性本征正交分解(POD)用于Transformer注意力场,受POD从湍流系综中提取能量主导模态的启发。Morlet连续小波变换识别文档系综中注意力滞后结构的主导时间尺度;然后POD从注意力场系综中提取每个尺度上的能量主导模态。得到的模态揭示了层依赖的尺度组织,早期层强调精细尺度,后期层转向较粗尺度。我们根据POD特征值衰减率定义谱集中指数,并经验性地表明该指数通过注意力场复杂度区分不同层。根据经典POD最优性定理,提取的模态最小化系综上的平均L2重构误差(定理1),为每层提供数据驱动的有效秩。该方法无需架构修改和语言标注:主导注意力模式仅从系综统计中涌现。湍流类比是结构性的而非物理性的:我们借用系综协方差和模态分析,而非流体动力学本身。

英文摘要

We introduce scale-selective Proper Orthogonal Decomposition (POD) for transformer attention fields, inspired by the use of POD for extracting energetically dominant modes from turbulent flow ensembles. The Morlet continuous wavelet transform identifies dominant temporal scales in the attention lag structure across a document ensemble; POD then extracts the energetically dominant modes at each scale from the ensemble of attention fields. The resulting modes reveal layer-dependent scale organisation, with early layers emphasising fine scales and later layers shifting toward coarser scales. We define a spectral concentration index from the POD eigenvalue decay rate and show empirically that it differentiates layers by their attention field complexity. By the classical POD optimality theorem, the extracted modes minimise the average L2 reconstruction error over the ensemble (Theorem 1), giving a data-driven effective rank for each layer. The method requires no architectural modification and no linguistic annotations: dominant attention patterns emerge from ensemble statistics alone. The turbulence analogy is structural rather than physical: we borrow ensemble covariance and modal analysis, not fluid dynamics itself.

2606.06740 2026-06-08 cs.SD cs.AI cs.CL 交叉投稿

Multilingual Multi-Speaker Unit Vocoders: A Systematic Analysis of Discrete Speech Representations

多语言多说话人单元声码器:离散语音表示的系统分析

Naman Kothari, Arjun Gangwar, Adarsh Arigala, S Umesh

发表机构 * National Institute of Technology, Trichy(印度Trichy国家理工学院) Indian Institute of Technology, Madras(印度Madras理工学院)

AI总结 分析基于BigVGAN的单元声码器在多语言多说话人语音生成中的表现,发现聚类大小控制可懂度,显式说话人条件防止身份崩溃,语言监督在低聚类大小时有益。

Comments 5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026

详情
AI中文摘要

通过k-means聚类自监督嵌入获得的离散语音单元纠缠了音素、说话人和语言信息,导致多语言多说话人语音生成中的说话人混合和跨语言干扰。尽管在音频大语言模型和语音到语音系统中使用日益增多,单元声码器仍然研究不足。我们分析了基于BigVGAN的单元声码器,涵盖四种印度语言。我们使用WER、说话人相似度和单元级指标研究了聚类大小与条件策略之间的相互作用。结果表明,聚类大小通过提高音素区分性来控制可懂度,而显式说话人条件对于防止身份崩溃不可或缺。语言监督主要在单元仍模糊的较小聚类大小时带来进一步收益。我们的分析显示,在较小库存时,不同语言中相似音素会坍缩到相同的聚类ID,而较大的聚类会逐渐将它们分离。

英文摘要

Discrete speech units obtained via k-means clustering of self supervised embeddings entangle phonetic, speaker, and language information, causing speaker mixing and cross-lingual interference in multilingual multi-speaker speech generation. Despite growing use in Audio LLMs and speech to speech systems, unit vocoders remain underexplored. We analyze a BigVGAN based unit vocoder, across four Indian languages. We study the interaction between cluster size and conditioning strategies using WER, speaker similarity, and unit level metrics. Results show that cluster size governs intelligibility by improving phonetic discriminability, while explicit speaker conditioning is indispensable for preventing identity collapse. Language supervision yields further gains mainly at lower cluster sizes where units remain ambiguous. Our analysis shows similar phonemes across languages collapse to the same cluster IDs at smaller inventories, with larger clusters progressively separating them.

2606.06741 2026-06-08 cs.AI cs.CL cs.LG 交叉投稿

OpenSkill: Open-World Self-Evolution for LLM Agents

OpenSkill: 面向LLM智能体的开放世界自我进化

Zhiling Yan, Dingjie Song, Hanrong Zhang, Wei Liang, Yuxuan Zhang, Yutong Dai, Lifang He, Philip S. Yu, Ran Xu, Xiang Li, Lichao Sun

发表机构 * Lehigh University(莱维大学) University of Illinois Chicago(伊利诺伊大学芝加哥分校) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Salesforce AI Research(Salesforce人工智能研究) Massachusetts General Hospital and Harvard Medical School(麻省总医院和哈佛医学院)

AI总结 提出OpenSkill框架,使智能体在无目标任务监督下,利用开放世界资源自举构建技能和验证信号,实现自我进化,在多个基准上取得最佳自动通过率。

Comments 20 pages, 4 figures and 8 tables. Code is avalable at https://github.com/OpenLAIR/OpenSkill

详情
AI中文摘要

自我进化智能体需要在部署后进行适应,但现有方法假设存在可用的学习循环,例如精心策划的技能、成功的轨迹或验证信号。真实的开放世界部署可能不提供这些,只提供一个任务提示。在这项工作中,我们研究开放世界自我进化,其中智能体必须从零开始构建其技能和自身的验证信号,使用开放世界资源但没有目标任务监督。我们提出OpenSkill,一个启动这个循环的框架:它从文档、代码库和网络中获取基础知识和验证锚点,将它们综合成可迁移的技能,并根据自建的虚拟任务(基于锚点而非目标答案)来优化这些技能。因此,开放世界既提供了要学习的知识,也提供了一个独立于监督的练习环境,目标任务监督保留用于最终评估。在三个基准和两个目标智能体上,OpenSkill在满足无监督约束的同时取得了最佳自动通过率。分析表明,其技能无需特定模型适应即可跨模型迁移,并且其自建验证器与真实结果一致,尽管从未访问过这些结果。

英文摘要

Self-evolving agents requires adaptation after deployment, but existing approaches assume a usable learning loop, such as curated skills, successful trajectories, or verifier signals. Real open-world deployments may provide none of these, offering only a task prompt. In this work, we study open-world self-evolution, where an agent must build both its skills and its own verification signals from scratch, using open-world resources but no target-task supervision. We propose OpenSkill, a framework that bootstraps this loop: it acquires grounded knowledge and verification anchors from documentation, repositories, and the web, synthesizes them into transferable skills, and refines those skills against self-built virtual tasks grounded in the anchors rather than in target answers. The open world thus supplies both the knowledge to be learned and a supervision-independent practice environment, with target-task supervision reserved for final evaluation. Across three benchmarks and two target agents, OpenSkill attains the best automated pass rate while satisfying the no-supervision constraint. Analysis shows its skills transfer across models without model-specific adaptation, and its self-built verifier aligns with ground-truth outcomes despite never accessing them.

2606.06743 2026-06-08 cs.SD cs.AI cs.CL 交叉投稿

HybridCodec: Fast Dual-Stream, Semantically Enhanced Neural Audio Codec

HybridCodec: 快速双流、语义增强的神经音频编解码器

Arjun Gangwar, S Umesh

发表机构 * Indian Institute of Technology, Madras(印度理工学院马德拉斯分校)

AI总结 提出HybridCodec,一种结合语义蒸馏与双流架构的统一神经音频编解码器,实现强解耦、跨语言鲁棒性及3倍速度提升。

Comments 5 pages, 5 tables, 1 figure, Accepted at Interspeech 2026

详情
AI中文摘要

随着多模态大语言模型的出现,神经音频编解码器作为语音分词器的流行度激增。具有语义和声学解耦的新编解码器架构已经出现。将语义信息引入编解码器模型有两种主要方法:一种是从SSL表示中将语义信息蒸馏到第一个RVQ层,另一种是维护语义和声学特征的独立流。我们提出HybridCodec,一种结合了两种范式的统一架构。它采用独立的语义和声学分枝,同时将SSL表示蒸馏到语义流中。这种设计确保了强解耦,而无需在推理期间使用SSL模型。HybridCodec在域内测试集上展示了优越的语义特化(RVQ-1)和有竞争力的重建(RVQ-all)。我们展示了其在域外和零样本跨语言设置中的鲁棒性,相比现有双流模型实现了3倍加速。

英文摘要

The popularity of neural audio codecs as speech tokenizers has surged with the advent of Multimodal Large Language Models. New codec architectures with semantic and acoustic disentanglement have emerged. There are two main approaches to introduce semantic information into codec models: one distills semantic information from SSL representations into the first RVQ layer, while the other maintains separate streams for semantic and acoustic features. We propose HybridCodec, a unified architecture that combines both paradigms. It employs separate semantic and acoustic branches while distilling SSL representations into the semantic stream. This design ensures strong disentanglement without requiring an SSL model during inference. HybridCodec shows superior semantic specialization (RVQ-1) on in-domain test set and competitive reconstruction (RVQ-all). We demonstrate its robustness in out-of-domain and zero-shot cross-lingual settings, achieving a 3x speedup over existing dual-stream models.

2606.06754 2026-06-08 cs.MA cs.CL 交叉投稿

MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring

MADRAG: 基于检索增强生成的多智能体辩论用于免训练分析性论文评分

Ali Keramati, Shiyuan Zhou, Sharad Mehrotra, Mark Warschauer

发表机构 * University of California, Irvine(加州大学尔湾分校)

AI总结 提出MADRAG框架,结合多智能体辩论与检索增强,通过倡导者、批评者和法官的交互以及检索示例校准,实现无需训练的论文评分,性能接近监督系统。

Comments 21 pages, 7 figures, 14 tables

详情
AI中文摘要

我们提出了MADRAG,一个无需训练的分析性论文评分框架,它结合了多智能体推理与检索增强的 grounding。与标准LLM-as-judge方法(容易产生偏差和不稳定评分)不同,MADRAG将评估分解为一个交互过程:倡导者识别优点,批评者指出缺点,法官综合他们的论点给出最终分数。关键的是,法官通过检索与评分标准对齐的示例进行增强,从而通过与已评分示例的比较实现校准。我们的结果表明,MADRAG显著优于基于提示的基线方法,同时在没有任务特定训练的情况下接近监督系统的性能。消融研究表明,检索驱动校准增益,而辩论改善了高层次特质的推理。我们的发现强调了结构化交互和外部记忆在可靠的基于LLM的评估中的互补作用。

英文摘要

We present MADRAG, a training-free framework for analytic essay scoring that combines multi-agent reasoning with retrieval-augmented grounding. Unlike standard LLM-as-judge approaches, which are prone to bias and unstable scoring, MADRAG decomposes evaluation into an interactive process: an Advocate identifies strengths, a Skeptic critiques weaknesses, and a Judge aggregates their arguments into a final score. Crucially, the Judge is augmented with rubric-aligned exemplar retrieval, enabling calibration through comparison with scored examples. Our results show that MADRAG significantly outperforms prompt-based baselines while approaching the performance of supervised systems without requiring task-specific training. Ablation studies demonstrate that retrieval drives calibration gains, while debate improves reasoning on higher-level traits. Our findings highlight the complementary roles of structured interaction and external memory in reliable LLM-based evaluation.

2606.07006 2026-06-08 cs.LG cs.CL 交叉投稿

RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning

RASFT: 用于推理的滚动自适应监督微调

Yongliang Miao, Fengyuan Liu, Wei Shi, Yanguang Liu, Fei Sun, Na Zou, Mengnan Du

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) New Jersey Institute of Technology(新泽西理工学院) Institute of Computing Technology, CAS(中国科学院计算技术研究所)

AI总结 提出RASFT框架,通过基于策略rollout的问题级可解性校准专家监督,在模型困难时加强指导、表现可靠时放松模仿并纳入自生成轨迹,同时使用裁剪逆比约束策略漂移,在多个推理基准上优于SFT和RL方法。

详情
AI中文摘要

监督微调(SFT)是一种通过模仿离线专家演示来使大型语言模型适应推理任务的流行方法,通常将单个专家轨迹视为目标行为。然而,推理并非简单的路径模仿:严格遵循一个演示解决方案可能会过度拟合表面形式并抑制模型自身的推理分布。我们提出了滚动自适应监督微调(RASFT),这是一种策略感知的SFT框架,它根据从验证的策略rollout中估计的问题级可解性来校准专家监督。对于每个问题,当当前策略困难时,RASFT加强专家指导,而当模型已经表现出可靠的推理行为时,放松严格模仿并纳入正确的自生成轨迹。为了保留有用的推理先验,RASFT进一步引入了冻结参考模型与当前策略之间的裁剪逆比,以约束过度的策略漂移。在六个数学推理基准和两个代码推理基准上的多个模型实验表明,RASFT在整体性能上优于SFT、SFT变体和代表性RL方法。代码可在该https URL获取。

英文摘要

Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However, reasoning is not simple path imitation: rigidly following one demonstrated solution may overfit to surface forms and suppress the model's own reasoning distribution. We propose Rollout-Adaptive Supervised Fine-Tuning (RASFT), a policy-aware SFT framework that calibrates expert supervision according to problem-level solvability estimated from verified on-policy rollouts. For each problem, RASFT strengthens expert guidance when the current policy struggles, while relaxing rigid imitation and incorporating correct self-generated trajectories when the model already exhibits reliable reasoning behavior. To preserve useful reasoning priors, RASFT further introduces a clipped inverse ratio between the frozen reference model and the current policy to constrain excessive policy drift. Experiments across multiple models on six mathematical reasoning benchmarks and two code reasoning benchmarks show that RASFT achieves better overall performance than SFT, SFT variants, and representative RL methods. The code is available at https://github.com/zjd1sq/RASFT.

2606.07017 2026-06-08 cs.AI cs.CL cs.ET 交叉投稿

The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

基础模型智能体的仿真到现实差距:统一MDP视角

Xiaoou Liu, Tiejin Chen, Weibo Li, Xiyang Hu, Hua Wei

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 本文提出将基础模型智能体的评估与训练差距形式化为经典仿真到现实问题,围绕MDP四要素(观测、动作、转移、奖励)构建统一框架,并倡导采用域随机化等成熟解决方案。

Comments 7 pages, 2 figures, 2 tables. Accepted by KDD 2026 Blue Sky Ideas Track

详情
AI中文摘要

基础模型智能体越来越多地被部署用于现实世界决策,但受到仿真到现实差距的影响。虽然机器人学和经典控制有成熟的框架来解决这一差距,但基础模型社区将智能体鲁棒性视为一个全新的现象。我们的论文提出将基础模型智能体评估和训练差距形式化为一个经典的仿真到现实问题,完全围绕马尔可夫决策过程的四个要素构建,包括观测、动作、转移和奖励。在本文中,我们设定了一个全面的研究议程,将经典差异转化为基础模型领域,并倡导采用域随机化等成熟解决方案。我们提供了具体示例,例如多语言工具调用,以展示尽管语义意图正确,但观测空间差距如何导致操作无效的动作。最终,这一议程旨在推动范式转变,产生统一的词汇和标准化的压力测试基准,以培养新一代高度可信的智能体,用于可靠的现实世界应用。

英文摘要

Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim-to-real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward. In this paper, we set a comprehensive research agenda that translates classical discrepancies into the foundation model domain and advocates for adopting established solutions like domain randomization. We provide concrete examples, such as a multilingual tool calling to demonstrate how severe observation space gaps lead to operationally invalid actions despite correct semantic intent. Ultimately, this agenda aims to drive a paradigm shift, yielding a unified vocabulary and standardized stress test benchmarks to foster a new generation of highly trustworthy agents for reliable real-world applications.

2606.07030 2026-06-08 cs.SD cs.AI cs.CL cs.LG 交叉投稿

Phonetic Error Analysis of Raw Waveform Acoustic Models

原始波形声学模型的音素错误分析

Erfan Loweimi, Zhengjun Yue, Andrea Carmantini, Zoran Cvetkovic, Steve Renals, Peter Bell

发表机构 * Centre for Speech Technology Research (CSTR), University of Edinburgh, UK(语音技术研究中心(CSTR),爱丁堡大学,英国) Cisco, UK(思科公司,英国) SLAI & CUHK-SZ, China(SLAI与CUHK-SZ,中国) King's College London, UK(伦敦国王学院,英国)

AI总结 通过分解音素错误率、分析混淆矩阵,发现BLSTM层对过渡依赖类提升最大,WSJ迁移学习对辅音改进约是元音的三倍,且混淆模式反映固有音素相似性。

Comments INTERSPEECH2026

详情
AI中文摘要

我们分析了原始波形声学模型在TIMIT音素识别中的错误模式,超越了整体音素错误率(PER)。将PER按三个广义语音类别(BPC)分解,并从替换错误构建混淆矩阵。我们的模型将参数化(SincNet, Sinc2Net)或非参数化CNN与双向LSTM相结合,在开发/测试集上分别达到13.9%/15.3%的PER,这是原始波形模型在TIMIT上的最佳报告结果。来自WSJ的迁移学习将PER降至11.3%/12.3%,超越了Filterbank基线。每个BPC的分析表明,BLSTM层对过渡依赖类提升最大,而WSJ迁移学习对辅音的改进约是元音的三倍。原始波形和Filterbank系统的混淆模式一致,表明主要混淆反映了固有的音素相似性。

英文摘要

We analyse error patterns of raw waveform acoustic models on TIMIT phone recognition beyond the overall phone error rate (PER). PER is decomposed across three broad phonetic class (BPC) categorisations, and confusion matrices are constructed from substitution errors. Our models combine parametric (SincNet, Sinc2Net) or non-parametric CNNs with Bidirectional LSTMs, achieving 13.9%/15.3% PER on Dev/Test, the best reported results for raw waveform models on TIMIT. Transfer learning from WSJ reduces PER to 11.3%/12.3%, surpassing the Filterbank baseline. Per-BPC analysis reveals that BLSTM layers benefit transition-dependent classes most, while WSJ transfer learning improves consonants roughly three times more than vowels. Confusion patterns are consistent across raw waveform and Filterbank systems, indicating that the dominant confusions reflect inherent phonetic similarities.

2606.07057 2026-06-08 cs.IR cs.CL 交叉投稿

Meaning in Order, Order in Meaning: Semantic R-precision for Keyphrase Evaluation

意义中的顺序,顺序中的意义:用于关键词评估的语义R-精度

Shamira Venturini, Steffen Kinkel

发表机构 * ILIN - Institute for Learning and Innovation in Networks, Karlsruhe University of Applied Sciences(学习与网络创新研究所,卡尔斯鲁厄应用科学大学) Karlsruhe Institute of Technology(卡尔斯鲁厄工业大学)

AI总结 提出语义R-精度(SemR-p)指标,结合语义相似性与排序感知,从人类视角评估自动生成关键词的质量,优于传统词汇和语义匹配方法。

详情
AI中文摘要

评估自动生成关键词的质量仍然是一个复杂的挑战。传统指标要么依赖精确词汇匹配,要么考虑语义相似性但忽略预测排序,两者都与人类判断信息性和相关性的方式不一致。我们引入了语义R-精度(SemR-p),一种新颖的评估指标,将语义相似性整合到排序感知的R-精度框架中。SemR-p从以人为中心的角度设计,受信息检索指标启发,奖励输出列表中早期出现的语义相关关键词。我们进行了广泛分析,评估其语义敏感性、排序感知能力以及跨模型和数据集的区分能力。结果表明,SemR-p为评估关键词预测提供了补充视角,有助于更好地反映以用户为中心的相关性概念,与传统的词汇和语义匹配指标相辅相成。

英文摘要

Evaluating the quality of automatically generated keyphrases remains a complex challenge. Traditional metrics either rely on exact lexical matching or consider semantic similarity while ignoring prediction ranking, both of which misalign with how humans judge informativeness and relevance. We introduce Semantic R-Precision (SemR-p), a novel evaluation metric that integrates semantic similarity into the rank-aware R-Precision framework. Designed from a human-centric perspective and inspired by Information Retrieval metrics, SemR-p rewards semantically relevant keyphrases that appear early in the output list. We conducted extensive analyses to assess its semantic sensitivity, ranking awareness, and discriminative power across models and datasets. The results suggest that SemR-p offers a complementary lens for evaluating keyphrase predictions, helping to better reflect user-centred notions of relevance alongside traditional lexical and semantic matching metrics.

2606.07116 2026-06-08 cs.LG cs.AI cs.CL 交叉投稿

OffQ: Taming Structured Outliers in LLM Quantization by Offsetting

OffQ:通过偏移驯服LLM量化中的结构化异常值

Haoqi Wang, Lorenz K. Mueller, Jiawei Zhuang, Mathieu Salzmann, Lukas Cavigelli

发表机构 * School of Computer and Communication Sciences, EPFL, Switzerland(瑞士联邦理工学院计算机与通信科学学院) Huawei, Switzerland(华为公司) Swiss Data Science Center, ETHZ & EPFL, Switzerland(瑞士数据科学中心,苏黎世联邦理工学院与联邦理工学院)

AI总结 提出OffQ方法,通过top-1 PCA识别异常值子空间、旋转集中异常值通道并转换为共享偏移,实现LLM的低比特均匀量化,在W4A4KV4下提升精度。

详情
AI中文摘要

低比特量化已被广泛采用,通过显著降低计算成本和内存使用来加速大型语言模型(LLM)的推理。然而,激活异常值对有效量化构成了重大挑战,常常导致显著的性能下降。在本文中,我们介绍了OffQ,一种通过新颖的偏移机制来缓解低比特量化中激活异常值的方法。具体来说,OffQ首先使用提出的top-1 PCA识别激活中的低维异常值子空间,然后通过旋转将高幅度激活集中到1个通道中。OffQ随后通过将其幅度转换为共享偏移来吸收这个集中的异常值通道,从而降低激活的标准差。这种偏移策略使得使用部署友好的均匀网格和均匀精度量化对LLM进行有效的W4A4KV4量化成为可能。在多种LLM架构和基准上的广泛实验表明,OffQ优于最先进的基线,在保持低比特效率的同时持续提高模型精度。

英文摘要

Low-bit quantization has been widely adopted to accelerate the inference of large language models (LLMs) by significantly reducing computational cost and memory usage. However, activation outliers pose a major challenge to effective quantization, often leading to notable performance degradation. In this paper, we introduce OffQ, a method designed to mitigate activation outliers in low-bit quantization through a novel offsetting mechanism. Specifically, OffQ first identifies a low-dimensional outlier subspace in the activations using a proposed top-1 PCA, and then concentrates high-magnitude activations into 1 channel via rotation. OffQ then absorbs this concentrated outlier channel by converting its magnitude into a shared offset, thereby reducing the standard deviation of the activations. This offsetting strategy enables effective W4A4KV4 quantization of LLMs using deployment-friendly uniform-grid and uniform-precision quantization. Extensive experiments across diverse LLM architectures and benchmarks demonstrate that OffQ outperforms state-of-the-art baselines, consistently improving model accuracy while preserving low-bit efficiency.

2606.07172 2026-06-08 cs.CV cs.AI cs.CL cs.LG 交叉投稿

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

文本监督增强视觉-语言模型中的地理空间表示

Marcelo Sartori Locatelli, Fernando Tonucci, Jea Kwon, Luiz Felipe Vecchietti, Bryan Nathanael Wijaya, Cheng Yaw Low, Virgilio Almeida, Meeyoung Cha

发表机构 * University of São Paulo(圣保罗大学) National University of Singapore(新加坡国立大学)

AI总结 研究视觉、视觉-语言及多模态模型的地理空间表示能力,发现文本监督能有效提升空间编码,推动地理空间AI发展。

Comments Accepted at ICML 2026

详情
AI中文摘要

地理空间理解是机器学习系统在图像地理定位和空间推理等任务中一个关键但尚未充分探索的维度。在这项工作中,我们分析了三种模型家族获得的地理空间表示:纯视觉架构(如ViT)、视觉-语言模型(如CLIP)和大规模多模态基础模型(如LLaVA、Qwen和Gemma)。通过评估包括人物、地标和日常物体在内的图像聚类(根据可定位程度分组),我们揭示了空间准确性的系统性差距,并表明文本监督增强了地理空间表示的学习。我们的发现表明语言作为编码空间上下文的有效补充模态,以及多模态学习作为推进地理空间AI的关键方向。

英文摘要

Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: vision-only architectures (e.g., ViT), vision-language models (e.g., CLIP), and large-scale multimodal foundation models (e.g., LLaVA, Qwen, and Gemma). By evaluating across image clusters, including people, landmarks, and everyday objects, grouped based on the degree of localizability, we reveal systematic gaps in spatial accuracy and show that textual supervision enhances the learning of geospatial representations. Our findings suggest the role of language as an effective complementary modality for encoding spatial context and multimodal learning as a key direction for advancing geospatial AI.

2606.07229 2026-06-08 cs.SD cs.CL cs.MM 交叉投稿

MMAE: A Massive Multitask Audio Editing Benchmark

MMAE:大规模多任务音频编辑基准

Ziyang Ma, Ruiqi Yan, Ruiyang Xu, Jie Fang, Zhikang Niu, Yi-Wen Chao, Wenming Tu, Tianrui Wang, Auden, Qi Chen, Wenxi Chen, Jiaying Chi, Yanru Huo, Zixuan Jiang, Xiquan Li, Yalin Li, Junxi Liu, Minghao Liu, Binghao Qiang, Yijia Shan, Zheshu Song, Tian Tan, Zixiang Wang, Zeyu Xie, Zhifei Xie, Xiaoyu Xing, Qixiang Xu, Chen Yang, Guanrou Yang, Shan Yang, Yifan Yang, Steve Yves, Haotian Zhang, Haina Zhu, Kai Yu, Liefeng Bo, Eng-Siong Chng, Xie Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Nanyang Technological University(南洋理工大学) Hunyuan Team, Tencent(腾讯 Hunyuan 团队) Tianjin University(天津大学) Fudan University(复旦大学)

AI总结 提出首个面向通用指令音频编辑的综合评估基准MMAE,涵盖7种音频模态、6级任务复杂度和8种操作类型,通过2000个样本和基于评分标准的评估框架揭示当前模型在精确执行和结构鲁棒性上的严重不足。

Comments Open-Source at https://github.com/ddlBoJack/MMAE

详情
AI中文摘要

我们引入了MMAE,一个大规模多任务音频编辑基准,作为首个专为通用指令式音频编辑设计的综合评估测试平台。受智能创作趋势的推动,交互式编辑已从视觉领域(如图像领域的Nano-banana 2和视频领域的Gemini-Omni)迅速扩展到音频领域。然而,当前的评估基础设施严重滞后,仍然高度碎片化且局限于特定子领域或基本操作。与现有范围有限的基准不同,MMAE扩展到广泛的实际场景,涵盖7种不同的音频模态,包括声音、语音、音乐及其混合。此外,我们建立了一个全面的分类体系,涵盖6级任务复杂度(从基本修改到多跳推理和多轮编辑)、2级粒度以及8种不同的操作类型。通过人机协作精心策划,MMAE包含2000个高保真样本,并配以开创性的基于评分标准的评估框架。通过将自由形式任务分解为17,741个可验证的标准,这种稳健的基于评分标准的范式能够对指令遵循和上下文一致性进行精确的多维评估。我们对领先模型的广泛评估表明,当前系统远未实现可靠的编辑。令人惊讶的是,精确匹配率(EMR)始终低于5%,在复杂的混合模态任务中更是骤降至绝对的0%,暴露了精确执行和结构鲁棒性方面的关键瓶颈。我们希望MMAE能够成为智能创作社区未来进步的催化剂,提供清晰的诊断路线图,并为下一代音频编辑系统建立标准化、持久的评估范式。

英文摘要

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.

2606.07297 2026-06-08 cs.SE cs.CL 交叉投稿

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

SWE-Explore: 基准测试编码智能体如何探索代码仓库

Shaoqiu Zhang, Yuhang Wang, Jialiang Liang, Yuling Shi, Wenhao Zeng, Maoquan Wang, Shilin He, Ningyuan Xu, Siyu Ye, Kai Cai, Xiaodong Gu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Xinjiang University(新疆大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Independent Researcher(独立研究者) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出SWE-Explore基准,通过评估编码智能体在给定代码仓库和问题下返回相关代码区域排名列表的能力,衡量其仓库探索性能,覆盖10种编程语言和203个仓库的848个问题。

Comments 20 pages, 5 figures

详情
AI中文摘要

仓库级编码基准(如SWE-bench)推动了编码智能体能力的快速提升。然而,它们通常将编码任务视为一个整体的二元预测问题(例如,已解决或未解决),忽略了细粒度的智能体能力,如仓库理解、上下文检索、代码定位和错误诊断。在本文中,我们介绍了SWE-Explore,一个隔离评估仓库探索(编码智能体的关键能力)的基准。给定一个仓库和一个问题,SWE-Explore要求探索者在固定的行预算下返回一个相关的代码区域排名列表。SWE-Explore涵盖了10种编程语言和203个开源仓库中的848个问题。对于每个实例,我们从成功解决同一问题的独立智能体轨迹中推导出行级真实标签,提炼出它们的解决方案路径实际参考的具体代码区域。我们从覆盖度、排名和上下文效率维度评估探索,表明这些指标强烈跟踪下游修复行为。在一系列广泛的检索方法、通用编码智能体和专用定位器中,我们发现智能体探索者明显优于经典检索。虽然文件级定位对于现代方法已经很强,但行级覆盖度和高效排名仍然是区分最先进探索者的关键轴。

英文摘要

Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.

2606.07309 2026-06-08 cs.SD cs.AI cs.CL 交叉投稿

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

语音情感识别中音频语言模型的声学线索对齐

Iosif Tsangko, Andreas Triantafyllopoulos, Björn W. Schuller

发表机构 * DFG's Reinhart Koselleck project(德国科研基金Reinhart Koselleck项目) EU H2020 project(欧盟H2020项目)

AI总结 研究音频语言模型中显式声学线索的对齐性,通过eGeMAPS特征提取六种可解释声学概念标记,发现对齐标记提升UAR,而错乱标记降低性能,模型对符号线索敏感但仍部分依赖音频信号。

Comments 6 pages, 3 figures, 3 tables

详情
AI中文摘要

指令跟随音频语言模型(ALMs)可以通过显式的声学线索进行增强,但在原始音频已经可用的情况下,这些线索是否以接地的方式被使用仍不清楚。我们通过从标准化的eGeMAPS副语言特征集中推导出六个可解释的声学概念标记来研究语音情感识别(SER)中的这一问题。这些标记总结了能量、音高、动态、亮度、共振峰和语音质量,并被附加到文本提示中,同时保持音频输入不变。在广泛使用的FAU-Aibo和IEMOCAP基准测试中,对齐的标记提高了未加权平均召回率(UAR),而打乱、冲突或损坏的标记相对于对齐标记降低了性能,并将混淆转向中性。重要的是,在强标记扰动下预测不会崩溃,这表明模型对符号线索通道敏感,但部分仍锚定于音频信号。我们认为,仅标记干预提供了一种实用的方法来探测基于ALM的情感计算中音频接地线索的使用、鲁棒性和可解释性。

英文摘要

Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question in speech emotion recognition (SER) by deriving six interpretable acoustic concept tokens from the standardised eGeMAPS paralinguistic feature set. These tokens summarise energy, pitch, dynamics, brightness, formants, and voice quality, and are appended to the textual prompt while the audio input is kept unchanged. Across the widely used FAU-Aibo and IEMOCAP benchmarks, aligned tokens improve unweighted average recall (UAR), whereas shuffled, conflicting, or corrupted tokens reduce performance relative to aligned tokens and shift confusions toward neutral. Importantly, predictions do not collapse under strong token perturbations, suggesting that the models are sensitive to the symbolic cue channel but remain partly anchored to the audio signal. We argue that token-only interventions provide a practical way to probe audio-grounded cue use, robustness, and interpretability in ALM-based affective computing.

2606.07356 2026-06-08 cs.SD cs.CL 交叉投稿

DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

DirectAudioEdit: 基于扩散预测对比的无反演文本引导音频编辑

Zhengkun Ge, Xiaoqian Liu, Haoran Zhang, Yuan Ge, Junxiang Zhang, Zhengtao Yu, Jingbo Zhu, Tong Xiao

发表机构 * School of Computer Science and Engineering, Northeastern University, Shenyang, China(东北大学计算机科学与工程学院) Kunming University of Science and Technology(昆明理工大学) NiuTrans Research, Shenyang, China(新译研究)

AI总结 提出一种无需训练和反演的文本引导音频编辑方法DirectAudioEdit,通过扩散预测对比构建编辑路径,在音乐和事件基准上降低FAD和KL指标15%以上,编辑速度提升高达64.5%。

详情
AI中文摘要

文本引导音频编辑旨在修改语言指定的声学内容,同时保留与编辑无关的源组件。现有的无训练方法通常依赖于基于反演的编辑。虽然无反演编辑因其减少计算开销和重构误差而具有吸引力,但在音频编辑中仍基本未被探索。关键挑战是通过扩散去噪动力学构建源到目标的编辑路径。在本文中,我们介绍了DirectAudioEdit,这是首次尝试开发一种无需训练和反演的音频编辑方法。在两个骨干网络上的音乐和事件级基准实验表明,与DDPM反演相比,DirectAudioEdit将宏观平均FAD和KL分别降低了15.9%和15.8%,同时实现了高达64.5%的编辑加速。

英文摘要

Text-guided audio editing aims to modify the language-specified acoustic content while preserving edit-irrelevant source components. Existing training-free methods typically rely on inversion-based editing. While inversion-free editing is appealing as it decreases computational overhead and reconstruction errors, it remains largely unexplored for audio editing. The key challenge is to construct a source-to-target editing path through diffusion denoising dynamics. In this paper, we introduce DirectAudioEdit, the first attempt to develop a training-free and inversion-free method for audio editing. Experiments on music and event-level benchmarks across two backbones show that DirectAudioEdit reduces macro-averaged FAD and KL by 15.9% and 15.8% compared with DDPM inversion, while achieving up to 64.5% editing speedup.

2606.07451 2026-06-08 cs.CV cs.AI cs.CL cs.LG 交叉投稿

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

TEVI: 基于稀疏自编码器的文本条件视觉表示编辑以改进视觉-语言对齐

Sweta Mahajan, Sukrut Rao, Jiahao Xie, Alexander Koller, Bernt Schiele

发表机构 * Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany(马克斯·普朗克研究所信息学院,萨尔兰信息学院,德国萨尔布吕肯) Department of Language Science and Technology, Saarland University, Saarbrücken, Germany(语言科学与技术系,萨尔兰大学,德国萨尔布吕肯)

AI总结 提出TEVI框架,利用稀疏自编码器解耦图像嵌入,并通过文本条件掩码模块选择性重构嵌入,以改善CLIP等视觉-语言模型的图像-文本对齐,在多个检索基准上取得提升。

Comments 20 pages, 13 figures, 14 tables

详情
AI中文摘要

视觉-语言模型(如CLIP)由于共享图像-文本嵌入空间,对多种任务非常有用。尽管如此,图像和文本嵌入往往对齐不佳,影响下游性能。最近的研究表明,这可以归因于信息不平衡:图像包含的信息比其标题描述的更多。在这项工作中,我们提出了TEVI,一个利用标题作为信号来决定从图像嵌入中保留哪些信息的框架。具体来说,我们使用稀疏自编码器来解耦图像嵌入,并训练一个掩码模块,根据给定的标题选择性重构嵌入。在具有合成标题的受控设置中,我们展示了TEVI在保留标题描述的属性同时丢弃其他属性方面的有效性。通过将TEVI应用于在自然图像上训练的CLIP模型,我们进一步在粗粒度短标题(MS COCO, Flickr)和细粒度长标题(IIW, DOCCI)基准上实现了改进的检索性能,在更丰富的标题上获得更强的增益,并在RoCOCO基准上提高了鲁棒性。

英文摘要

Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.

2606.07512 2026-06-08 cs.CV cs.AI cs.CL 交叉投稿

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

MemDreamer: 通过分层图记忆和智能体检索机制解耦感知与推理以实现长视频理解

Cong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang, Zhen Yang, Guangming Yao, Hao Chen, Jingdong Chen, Yi Yuan, Chunhua Shen

发表机构 * Ant Group(蚂蚁集团) Zhejiang University(浙江大学) Central South University(中南大学) HKUST(GZ)(香港科技大学(广州))

AI总结 提出MemDreamer框架,通过分层图记忆和智能体检索机制解耦感知与推理,将长视频理解转化为智能体探索过程,在四个基准上达到SOTA,推理上下文窗口仅占全量2%且准确率提升12.5点。

详情
AI中文摘要

当前的视觉-语言模型在处理数小时长的视频时面临困难,因为处理完整长度的视觉序列会导致令牌爆炸和注意力稀释。为了克服这一问题,我们引入了MemDreamer,将感知与推理解耦,将长视频理解转化为智能体探索过程。作为一个即插即用的框架,它增量式地流式传输视频以构建分层图记忆,这是一种自顶向下的三层架构,用于语义抽象,并由一个捕获时空和因果关系的基础图锚定。在推理过程中,推理模型采用智能体工具增强的检索,通过观察-推理-行动循环导航层次结构、搜索节点和遍历逻辑边。实验表明,MemDreamer在四个主流基准上取得了最先进的结果,将人类专家的差距缩小到仅3.7个百分点。它将推理上下文窗口限制在全量上下文的仅2%,同时提供了12.5个百分点的绝对准确率提升。此外,统计分析揭示了VLM在逻辑推理和长视频理解基准上的性能之间存在强正线性相关,将智能体能力扩展确立为多模态理解的新范式。

英文摘要

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

2601.12359 2026-06-08 cs.CR cs.AI cs.CL 交叉投稿

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs

零样本嵌入漂移检测:一种轻量级防御对抗提示注入的LLM方法

Anirudh Sekar, Mrinal Agarwal, Rachel Sharma, Akitsugu Tanaka, Jasmine Zhang, Arjun Damerla, Kevin Zhu

发表机构 * Algoverse AI Research(Algoverse AI研究院) Berkeley(伯克利大学)

AI总结 本文提出ZEDD,通过量化嵌入空间中良性与可疑输入之间的语义变化,实现对直接和间接提示注入的检测。该方法无需模型内部访问或先验知识,具有低工程开销,能高效部署于多种LLM架构,准确率达93%以上。

Comments Accepted to NeurIPS 2025 Lock-LLM Workshop

详情
AI中文摘要

提示注入攻击已成为LLM应用中的日益严重漏洞,其中对抗性提示利用电子邮件或用户生成内容等间接输入渠道绕过对齐保护措施,导致有害或意外输出。尽管对齐技术有所进步,但最先进的LLM仍广泛易受对抗性提示攻击,凸显了需要稳健、高效且可推广的检测机制的紧迫性。本文提出零样本嵌入漂移检测(ZEDD),一种轻量级、低工程开销的框架,通过量化嵌入空间中良性与可疑输入之间的语义变化,识别直接和间接提示注入尝试。ZEDD无需访问模型内部、先验攻击类型知识或任务特定重训练,可高效地在多种LLM架构上进行零样本部署。我们的方法使用对抗性清洁提示对,并通过余弦相似度测量嵌入漂移,以捕捉现实世界注入攻击中的细微对抗性操纵。为确保评估的鲁棒性,我们编纂并重新标注了涵盖五个注入类别的综合LLMail-Inject数据集。广泛实验表明,嵌入漂移是一种稳健且可转移的信号,优于传统方法在检测准确性和操作效率方面。在Llama 3、Qwen 2和Mistral等模型架构上,分类准确率超过93%,误报率低于3%,我们的方法提供了一种轻量级、可扩展的防御层,可整合到现有LLM流程中,填补了保护LLM系统以抵御适应性对抗威胁的关键空白。

英文摘要

Prompt injection attacks have become an increasing vulnerability for LLM applications, where adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards and induce harmful or unintended outputs. Despite advances in alignment, even state-of-the-art LLMs remain broadly vulnerable to adversarial prompts, underscoring the urgent need for robust, productive, and generalizable detection mechanisms beyond inefficient, model-specific patches. In this work, we propose Zero-Shot Embedding Drift Detection (ZEDD), a lightweight, low-engineering-overhead framework that identifies both direct and indirect prompt injection attempts by quantifying semantic shifts in embedding space between benign and suspect inputs. ZEDD operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining, enabling efficient zero-shot deployment across diverse LLM architectures. Our method uses adversarial-clean prompt pairs and measures embedding drift via cosine similarity to capture subtle adversarial manipulations inherent to real-world injection attacks. To ensure robust evaluation, we assemble and re-annotate the comprehensive LLMail-Inject dataset spanning five injection categories derived from publicly available sources. Extensive experiments demonstrate that embedding drift is a robust and transferable signal, outperforming traditional methods in detection accuracy and operational efficiency. With greater than 93% accuracy in classifying prompt injections across model architectures like Llama 3, Qwen 2, and Mistral and a false positive rate of <3%, our approach offers a lightweight, scalable defense layer that integrates into existing LLM pipelines, addressing a critical gap in securing LLM-powered systems to withstand adaptive adversarial threats.

2505.11470 2026-06-08 cs.CL 版本更新

Reference-Free Evaluation of Taxonomies

无参考评价的层次分类体系

Pascal Wullschleger, Majid Zarharan, Donnacha Daly, Marc Pouly, Jennifer Foster

发表机构 * Hamilton Institute, Maynooth University, Ireland(爱尔兰梅诺特大学哈密尔顿研究所) School of Computing, Dublin City University, Ireland(爱尔兰都柏林城市大学计算学院) Lucerne School of Computer Science and IT, Switzerland(瑞士卢塞恩计算机科学与信息技术学院)

AI总结 提出两种无参考指标评估层次分类体系质量:基于语义与分类相似性相关性的鲁棒性指标,以及基于自然语言推理的逻辑充分性指标,在五个层次分类体系上验证与真实F1值高度相关,并能预测下游层次分类性能。

详情
AI中文摘要

我们引入了两种无参考指标,用于在缺乏标签的情况下评估层次分类体系的质量。第一个指标通过计算语义相似性与分类相似性之间的相关性来评估鲁棒性,解决了现有指标未考虑的错误类型。第二个指标使用自然语言推理来评估逻辑充分性。这两个指标在五个层次分类体系上进行了测试,结果显示它们与真实层次分类体系的F1值高度相关。我们进一步证明,当与标签层次结构一起使用时,我们的指标可以预测层次分类中的下游性能。

英文摘要

We introduce two reference-free metrics for quality evaluation of taxonomies in the absence of labels. The first metric evaluates robustness by calculating the correlation between semantic and taxonomic similarity, addressing error types not considered by existing metrics. The second uses Natural Language Inference to assess logical adequacy. Both metrics are tested on five taxonomies and are shown to correlate well with F1 against ground truth taxonomies. We further demonstrate that our metrics can predict downstream performance in hierarchical classification when used with label hierarchies.

2507.06419 2026-06-08 cs.CL 版本更新

Teach a Reward Model to Correct Itself: Reward Guided Adversarial Failure Discovery for Robust Reward Modeling

教会奖励模型自我修正:奖励引导的对抗性失败发现以实现鲁棒奖励建模

Pankayaraj Pathmanathan, Furong Huang

发表机构 * University of Maryland College Park(马里兰大学College Park分校) Capital One

AI总结 提出REFORM框架,通过奖励引导的受控解码自动发现奖励模型失败模式,并利用生成的对抗样本自我改进,提升鲁棒性而不牺牲奖励质量。

详情
Journal ref
ACL 2026 Main Conference [Oral]
AI中文摘要

奖励建模(RM)通过捕捉人类偏好来对齐大型语言模型(LLM),越来越多地用于模型微调、响应过滤和排序等任务。然而,由于人类偏好的固有复杂性和可用数据集的有限覆盖,奖励模型在分布偏移或对抗性扰动下经常失败。现有的识别此类失败模式的方法通常依赖于关于偏好分布或失败属性的先验知识,限制了它们在现实场景中的实用性,因为此类信息不可用。在这项工作中,我们提出了一种可处理的、与偏好分布无关的方法,通过奖励引导的受控解码来发现奖励模型的失败模式。在此基础上,我们引入了REFORM,一个自我改进的奖励建模框架,通过使用奖励模型本身来指导生成错误评分的响应,从而增强鲁棒性。这些对抗性示例随后用于扩充训练数据并修补奖励模型的失调行为。我们在两个广泛使用的偏好数据集Anthropic Helpful Harmless (HH)和PKU Beavertails上评估了REFORM,并证明它在不牺牲奖励质量的情况下显著提高了鲁棒性。值得注意的是,REFORM在直接评估和下游策略训练中均保持了性能,并通过去除虚假相关性进一步提高了对齐质量。

英文摘要

Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically rely on prior knowledge about preference distributions or failure attributes, limiting their practicality in real-world settings where such information is unavailable. In this work, we propose a tractable, preference-distribution agnostic method for discovering reward model failure modes via reward guided controlled decoding. Building on this, we introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses. These adversarial examples are then used to augment the training data and patch the reward model's misaligned behavior. We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails and demonstrate that it significantly improves robustness without sacrificing reward quality. Notably, REFORM preserves performance both in direct evaluation and in downstream policy training, and further improves alignment quality by removing spurious correlations.

2508.03668 2026-06-08 cs.CL 版本更新

CTR-Sink: Attention Sink for Language Models in Click-Through Rate Prediction

CTR-Sink:用于点击率预测的语言模型中的注意力汇聚点

Zixuan Li, Binzong Geng, Jing Xiong, Yong He, Yuxuan Hu, Jian Chen, Dingwei Chen, Xiyu Chang, Ngai Wong, Liang Zhang, Linjian Mo, Chengming Li, Chuan Yuan, Zhenan Sun

发表机构 * NLPR, Institute of Automation, Chinese Academy of Sciences(神经信息处理教育部重点实验室,自动化研究所,中国科学院) Ant Group(蚂蚁集团) The University of Hong Kong(香港大学) City University of Hong Kong(香港城市大学) Sun Yat-sen University(中山大学) Shenzhen MSU-BIT University(深圳MSU-BIT大学)

AI总结 针对用户行为序列与语言模型预训练文本之间的结构差异导致的语义碎片化问题,提出CTR-Sink框架,通过引入行为级注意力汇聚点并动态调节注意力聚合,提升点击率预测性能。

详情
AI中文摘要

点击率(CTR)预测是推荐系统中的核心任务,利用历史行为数据估计用户点击可能性。将用户行为序列建模为文本以利用语言模型(LM)进行该任务的方法,由于LM强大的语义理解和上下文建模能力而受到关注。然而,存在一个关键的结构性差距:用户行为序列由离散的动作组成,这些动作由语义上空的分离符连接,与LM预训练中的连贯自然语言有根本不同。这种不匹配导致语义碎片化,即LM的注意力分散在无关的标记上,而不是集中在有意义的行为边界和行为间关系上,从而降低了预测性能。为了解决这个问题,我们提出了$ extit{CTR-Sink}$,一种新颖的框架,引入了针对推荐场景定制的行为级注意力汇聚点。受注意力汇聚点理论的启发,它构建了注意力聚焦汇聚点,并通过外部信息动态调节注意力聚合。具体来说,我们在连续行为之间插入汇聚点标记,融入推荐特定信号(如时间距离)作为稳定的注意力汇聚点。为了增强通用性,我们设计了一个两阶段训练策略,明确引导LM注意力朝向汇聚点标记,以及一个注意力汇聚点机制,放大汇聚点间的依赖关系以更好地捕捉行为相关性。在一个工业数据集和两个开源数据集(MovieLens、Kuairec)上的实验以及可视化结果,验证了该方法在不同场景下的有效性。

英文摘要

Click-Through Rate (CTR) prediction, a core task in recommendation systems, estimates user click likelihood using historical behavioral data. Modeling user behavior sequences as text to leverage Language Models (LMs) for this task has gained traction, owing to LMs' strong semantic understanding and contextual modeling capabilities. However, a critical structural gap exists: user behavior sequences consist of discrete actions connected by semantically empty separators, differing fundamentally from the coherent natural language in LM pre-training. This mismatch causes semantic fragmentation, where LM attention scatters across irrelevant tokens instead of focusing on meaningful behavior boundaries and inter-behavior relationships, degrading prediction performance. To address this, we propose $\textit{CTR-Sink}$, a novel framework introducing behavior-level attention sinks tailored for recommendation scenarios. Inspired by attention sink theory, it constructs attention focus sinks and dynamically regulates attention aggregation via external information. Specifically, we insert sink tokens between consecutive behaviors, incorporating recommendation-specific signals such as temporal distance to serve as stable attention sinks. To enhance generality, we design a two-stage training strategy that explicitly guides LM attention toward sink tokens and a attention sink mechanism that amplifies inter-sink dependencies to better capture behavioral correlations. Experiments on one industrial dataset and two open-source datasets (MovieLens, Kuairec), alongside visualization results, validate the method's effectiveness across scenarios.

2510.07315 2026-06-08 cs.CL cs.AI cs.LG cs.SE 版本更新

SWE-IF: Aligning Code Evaluation with Human Preference

SWE-IF: 使代码评估与人类偏好对齐

Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 提出SWE-IF基准,通过可验证指令分类法VeriCode评估代码指令遵循能力,发现指令遵循是区分LLM代码质量的关键,与功能正确性结合更能匹配人类偏好。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLM)推动了vibe coding,用户通过自然语言交互利用LLM生成并迭代优化代码,直到通过其vibe检查。Vibe检查反映了人类偏好,超越了功能性:解决方案应感觉正确、阅读清晰、保留意图并保持正确。然而,当前的代码评估仍局限于pass@k,仅捕获功能正确性,忽略了用户常规应用的非功能性指令。在本文中,我们假设指令遵循是vibe检查中除功能正确性之外缺失的部分。为了用量化信号衡量模型的代码指令遵循能力,我们提出了VeriCode,一个包含30条可验证代码指令及其确定性验证器的分类法。我们使用该分类法增强现有评估套件,得到SWE-IF,一个评估指令遵循和功能正确性的测试平台。评估31个LLM,我们发现即使最强的模型也难以遵守多条指令,并表现出功能回归。最重要的是,功能正确性和指令遵循的复合得分与人类偏好相关性最强,其中指令遵循成为LLM之间的主要区分因素。我们的代码、数据和分类法可在https://github.com/maszhongming/SWE-IF获取。

英文摘要

Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check besides functional correctness. To quantify models' code instruction-following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in SWE-IF, a testbed to assess both instruction following and functional correctness. Evaluating 31 LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit functional regression. Most importantly, a composite score of functional correctness and instruction following correlates best with human preference, with instruction following emerging as the primary differentiator among LLMs. Our code, data, and taxonomy are available at https://github.com/maszhongming/SWE-IF.

2510.26615 2026-06-08 cs.CL 版本更新

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

SlideAgent:用于多页视觉文档理解的分层代理框架

Yiqiao Jin, Rachneet Kaur, Zhen Zeng, Sumitra Ganesh, Srijan Kumar

发表机构 * Georgia Institute of Technology(佐治亚理工学院) J.P. Morgan AI Research(摩根大通AI研究)

AI总结 提出SlideAgent,一种用于多模态多页文档(如幻灯片)理解的分层代理框架,通过全局、页面和元素三级推理构建结构化表示,在专有和开源模型上分别提升7.9%和9.8%的准确率。

Comments ACL 2026 Main Conference. https://slideagent.github.io/

详情
AI中文摘要

多页视觉文档,如手册、宣传册、演示文稿和海报,通过布局、颜色、图标和跨页引用传达关键信息。虽然多模态大语言模型(MLLMs)为文档理解提供了机会,但当前系统在处理复杂的多页视觉文档时仍存在困难,尤其是在元素和页面上的细粒度推理方面。我们引入了SlideAgent,一个用于理解多模态、多页、多布局文档(尤其是幻灯片组)的通用代理框架。SlideAgent采用专门的代理,并将推理分解为三个专门级别——全局、页面和元素——以构建结构化的、与查询无关的表示,捕捉总体主题以及详细的视觉或文本线索。在推理过程中,SlideAgent选择性激活专门代理进行多级推理,并将其输出整合为连贯的、上下文感知的答案。大量实验表明,SlideAgent在专有模型(+7.9%)和开源模型(+9.8%)上均显著提高了准确率。

英文摘要

Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While multimodal large language models (MLLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels--global, page, and element--to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent significantly improves accuracy over both proprietary (+7.9%) and open-source models (+9.8%).

2511.07380 2026-06-08 cs.CL 版本更新

Mining Useful General Data for Low-Resource Domain Adaptation

挖掘低资源领域适应的有用通用数据

Pingjie Wang, Hongcheng Liu, Yusheng Liao, Ziqing Fan, Yaxin Du, Shuo Tang, Yanfeng Wang, Yu Wang

发表机构 * arXiv

AI总结 针对低资源领域数据稀缺问题,提出NTK-Selector方法,利用神经正切核从通用数据中筛选有用样本,显著提升领域适应效果。

Comments 39 pages

详情
AI中文摘要

由于领域特定数据的稀缺性,将大型语言模型(LLMs)适应到低资源领域仍然具有挑战性。虽然领域内数据有限,但存在大量与领域任务共享相似问答格式和推理模式的通用领域数据。这一观察提出了一个重要问题:能否挖掘有用的通用领域数据来改进低资源领域适应?我们的初步发现表明,即使没有仔细选择,通用领域的思维链数据也包含对领域适应有用的辅助信号。这一观察催生了一种新的领域适应范式,即不再完全依赖领域特定数据。为了系统地识别最有益的通用领域样本,我们提出了NTK-Selector,其动机源于神经正切核捕捉训练动态中对齐的能力。由于直接将NTK应用于预训练LLMs不切实际,我们引入了一种无雅可比矩阵的NTK近似,并在微调过程中经验性地展示了稳定的NTK类行为。在医学、金融、法律和心理领域的广泛实验表明,NTK-Selector始终优于仅使用领域数据的微调和现有数据选择基线。特别是,NTK-Selector在Llama3-8B-Instruct和Qwen3-8B上分别取得了+8.7和+5.1个百分点的提升,而仅使用领域数据的微调仅分别提升了+0.8和+0.9个百分点。

英文摘要

Adapting large language models (LLMs) to low-resource domains remains challenging due to the scarcity of domain-specific data. While in-domain data is limited, there exists a vast amount of general-domain data that shares similar question-answer formats and reasoning patterns with domain tasks. This observation raises an important question: can useful general-domain data be mined to improve low-resource domain adaptation? Our initial findings show that general-domain chain-of-thought data contains useful auxiliary signals for domain adaptation, even without careful selection. This observation motivates a new paradigm for domain adaptation beyond exclusive reliance on domain-specific data. To systematically identify the most beneficial general-domain samples, we propose NTK-Selector, motivated by the Neural Tangent Kernel's ability to capture alignment in training dynamics. Since directly applying NTK to pretrained LLMs is impractical, we introduce a Jacobian-free NTK approximation and empirically demonstrate stable NTK-like behavior during fine-tuning. Extensive experiments across medical, financial, legal, and psychological domains demonstrate that NTK-Selector consistently outperforms domain-only fine-tuning and existing data selection baselines. In particular, NTK-Selector achieves gains of +8.7 and +5.1 points on Llama3-8B-Instruct and Qwen3-8B, respectively, compared to only +0.8 and +0.9 points from domain-only fine-tuning.

2512.09634 2026-06-08 cs.CL 版本更新

Creation of the Estonian Subjectivity Dataset: Assessing the Degree of Subjectivity on a Scale

爱沙尼亚主观性数据集的创建:评估主观性程度的一个量表

Karl Gustav Gailit, Kadri Muischnek, Kairit Sirts

发表机构 * University of Tartu(塔尔图大学)

AI总结 本文创建了爱沙尼亚语文档级主观性数据集,通过连续量表标注并分析标注一致性,初步实验使用大语言模型进行自动主观性分析,发现自动评分可行但不可完全替代人工。

Comments 9 pages, 5 figures, 3 appendixes, LREC 2026

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) 8204-8216
AI中文摘要

本文介绍了爱沙尼亚语文档级主观性数据集的创建,分析了所得标注,并报告了使用大语言模型(LLM)进行自动主观性分析的初步实验。该数据集包含1000个文档——300篇新闻文章和700个随机选择的网络文本——每个文档由四位标注员在从0(完全客观)到100(完全主观)的连续量表上评分。由于标注员间相关性中等,部分文本得分位于量表两端,因此对得分差异最大的文本子集进行了重新标注,标注员间相关性有所提高。除了人工标注外,数据集还包括GPT-5作为标注自动化实验生成的分数。这些分数与人工标注相似,但出现了一些差异,表明基于LLM的自动主观性评分虽然可行,但并非人工标注的可互换替代方案,其适用性取决于预期应用。

英文摘要

This article presents the creation of an Estonian-language dataset for document-level subjectivity, analyzes the resulting annotations, and reports an initial experiment of automatic subjectivity analysis using a large language model (LLM). The dataset comprises of 1,000 documents-300 journalistic articles and 700 randomly selected web texts-each rated for subjectivity on a continuous scale from 0 (fully objective) to 100 (fully subjective) by four annotators. As the inter-annotator correlations were moderate, with some texts receiving scores at the opposite ends of the scale, a subset of texts with the most divergent scores was re-annotated, with the inter-annotator correlation improving. In addition to human annotations, the dataset includes scores generated by GPT-5 as an experiment on annotation automation. These scores were similar to human annotators, however several differences emerged, suggesting that while LLM based automatic subjectivity scoring is feasible, it is not an interchangeable alternative to human annotation, and its suitability depends on the intended application.

2512.13278 2026-06-08 cs.CL cs.LG 版本更新

AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

AutoTool: 面向智能体推理的动态工具选择与集成

Jiaru Zou, Ling Yang, Yunzhe Qi, Sirui Chen, Mengting Ai, Ke Shen, Jingrui He, Mengdi Wang

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出AutoTool框架,通过双阶段优化(SFT+RL轨迹稳定化和KL正则化Plackett-Luce排序)使大语言模型具备动态工具选择能力,在数学、科学、代码和多模态推理等任务上平均提升6.4%-7.7%。

Comments ICML2026; Best Paper Award at ICCV 2025 Workshop on Multi-Modal Reasoning for Agentic Intelligence

详情
AI中文摘要

智能体强化学习推动了大语言模型(LLMs)在长链思维轨迹中进行推理,同时穿插外部工具的使用。现有方法假设工具集固定,限制了LLM智能体对新工具或演化工具集的适应性。我们提出AutoTool,一个训练框架,使LLM智能体在整个推理轨迹中具备动态工具选择能力。AutoTool采用双阶段优化流水线:(i)基于SFT和RL的轨迹稳定化,以实现连贯推理;(ii)KL正则化的Plackett-Luce排序,以优化一致的多步工具选择。我们进一步构建了一个包含20万条数据的数据集,其中包含跨1000多个工具和100多个任务(涵盖数学、科学、代码生成和多模态推理)的显式工具选择理由。在十个多样化基准上,我们使用AutoTool训练了两个基础模型:Qwen3-8B和Qwen2.5-VL-7B。在参数更少的情况下,AutoTool持续优于先进的LLM智能体和工具集成方法,在数学与科学推理上平均提升6.4%,在基于搜索的问答上提升4.5%,在代码生成上提升7.7%,在多模态理解上提升6.9%。此外,AutoTool通过在推理过程中动态利用演化工具集中的未见工具,展现出更强的泛化能力。

英文摘要

Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, which limits the adaptability of LLM agents to new or evolving toolsets. We present AutoTool, a training framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. AutoTool employs a dual-phase optimization pipeline: (i) SFT and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce Ranking to refine consistent multi-step tool selection. We further build a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.

2601.05751 2026-06-08 cs.CL cs.AI 版本更新

Analysing Differences in Persuasive Language in LLM-Generated Text: Uncovering Stereotypical Gender Patterns

分析LLM生成文本中说服性语言的差异:揭示刻板的性别模式

Amalie Brogaard Pauli, Maria Barrett, Max Müller-Eberstein, Isabelle Augenstein, Ira Assent

发表机构 * Department of Computer Science, Aarhus University(阿arhus大学计算机科学系) AMD Silo AI University of Tokyo(东京大学) IT University of Copenhagen(哥本哈根IT大学) Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)

AI总结 提出框架评估LLM生成说服性语言时受接收者性别、发送者意图和输出语言的影响,发现所有模型均存在显著的性别差异,反映性别刻板印象的语言倾向。

Comments Accepted at ACL Findings 2026

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于日常交流任务,包括起草旨在影响和说服的人际信息。先前研究表明,LLMs能够成功说服人类并放大说服性语言。因此,理解用户指令如何影响说服性语言的生成,以及生成的说服性语言是否因目标群体不同而有所差异至关重要。在这项工作中,我们提出了一个框架,用于评估说服性语言生成如何受接收者性别、发送者意图或输出语言的影响。我们使用成对提示指令评估了13个LLMs和16种语言。我们采用基于社会心理学和传播科学的LLM-as-judge设置,在19个说服性语言类别上评估模型响应。我们的结果揭示了所有模型生成的说服性语言中存在显著的性别差异。这些模式反映了与社会心理学和社会语言学中记录的性别刻板语言倾向一致的偏见。

英文摘要

Large language models (LLMs) are increasingly used for everyday communication tasks, including drafting interpersonal messages intended to influence and persuade. Prior work has shown that LLMs can successfully persuade humans and amplify persuasive language. It is therefore essential to understand how user instructions affect the generation of persuasive language, and to understand whether the generated persuasive language differs, for example, when targeting different groups. In this work, we propose a framework for evaluating how persuasive language generation is affected by recipient gender, sender intent, or output language. We evaluate 13 LLMs and 16 languages using pairwise prompt instructions. We evaluate model responses on 19 categories of persuasive language using an LLM-as-judge setup grounded in social psychology and communication science. Our results reveal significant gender differences in the persuasive language generated across all models. These patterns reflect biases consistent with gender-stereotypical linguistic tendencies documented in social psychology and sociolinguistics.

2601.06600 2026-06-08 cs.CL 版本更新

Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

探究多模态大语言模型在中国短视频虚假信息中的认知偏差

Jen-tse Huang, Chang Chen, Shiyang Lai, Wenxuan Wang, Michelle R. Kaufman, Mark Dredze

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Chinese University of Hong Kong(香港中文大学) University of Chicago(芝加哥大学) Renmin University of China(中国人民大学)

AI总结 本文通过200个短视频数据集评估8种多模态大语言模型在健康领域虚假信息中的表现,发现Gemini-2.5-Pro表现最佳(信念分数71.5/100),而模型易受权威频道ID等社会线索影响。

Comments Accepted to ACL 2026 (Findings)

详情
AI中文摘要

短视频平台已成为虚假信息的主要传播渠道,其中欺骗性声明常利用视觉实验和社会线索。尽管多模态大语言模型(MLLMs)展示了令人印象深刻的推理能力,但它们对与认知偏差纠缠的虚假信息的鲁棒性仍未得到充分探索。本文使用一个高质量、手动标注的200个短视频数据集,涵盖四个健康领域,引入了一个全面的评估框架。该数据集为三种欺骗模式——实验错误、逻辑谬误和捏造声明——提供了细粒度标注,每种模式均由国家标准和学术文献等证据验证。我们评估了八个前沿MLLMs在五种模态设置下的表现。实验结果表明,Gemini-2.5-Pro在多模态设置中取得了最高性能,信念分数为71.5/100,而o3表现最差,为35.2。此外,我们研究了视频中诱导错误信念的社会线索,发现模型易受权威频道ID等偏差影响。

英文摘要

Short-video platforms have become major channels for misinformation, where deceptive claims frequently leverage visual experiments and social cues. While Multimodal Large Language Models (MLLMs) have demonstrated impressive reasoning capabilities, their robustness against misinformation entangled with cognitive biases remains under-explored. In this paper, we introduce a comprehensive evaluation framework using a high-quality, manually annotated dataset of 200 short videos spanning four health domains. This dataset provides fine-grained annotations for three deceptive patterns-experimental errors, logical fallacies, and fabricated claims-each verified by evidence such as national standards and academic literature. We evaluate eight frontier MLLMs across five modality settings. Experimental results demonstrate that Gemini-2.5-Pro achieves the highest performance in the multimodal setting with a belief score of 71.5/100, while o3 performs the worst at 35.2. Furthermore, we investigate social cues that induce false beliefs in videos and find that models are susceptible to biases like authoritative channel IDs.

2601.08097 2026-06-08 cs.CL cs.LG 版本更新

AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling

AdaJudge: 自适应多视角评判用于奖励建模

Yongliang Miao, Yangyang Liang, Mengnan Du

发表机构 * Emory University(埃默里大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出AdaJudge框架,通过门控精化块和自适应多视角池化模块,联合优化表示与聚合,解决奖励建模中静态归纳偏差和表示不匹配问题,在RM-Bench和JudgeBench上超越现有模型。

Comments ACL 2026

详情
AI中文摘要

奖励建模对于将大型语言模型与人类偏好对齐至关重要,但主流架构依赖静态池化策略将序列压缩为标量分数。然而,这种范式存在两个关键限制:静态归纳偏差与任务相关的偏好信号不匹配,以及表示不匹配,因为骨干网络针对生成的优化使其表示不适用于细粒度判别。为解决这一问题,我们提出AdaJudge,一个统一框架,联合调整表示和聚合。AdaJudge首先通过门控精化块将骨干网络表示改进到判别导向的空间。然后,它用自适应多视角池化模块替换静态读出,该模块动态路由并组合证据。在RM-Bench和JudgeBench上的大量实验表明,AdaJudge优于强大的现成奖励模型和传统池化基线。

英文摘要

Reward modeling is essential for aligning large language models with human preferences, yet predominant architectures rely on a static pooling strategy to condense sequences into scalar scores. This paradigm, however, suffers from two key limitations: a static inductive bias that misaligns with task-dependent preference signals, and a representational mismatch, as the backbone's optimization for generation leaves its representations ill-suited to fine-grained discrimination. To address this, we propose AdaJudge, a unified framework that jointly adapts representation and aggregation. AdaJudge first improves backbone representations into a discrimination-oriented space via gated refinement blocks. It then replaces the static readout with an adaptive multi-view pooling module, which dynamically routes and combines evidence. Extensive experiments on RM-Bench and JudgeBench show that AdaJudge outperforms strong off-the-shelf reward models and traditional pooling baselines.

2601.09402 2026-06-08 cs.CL 版本更新

SEEK: Steering LLM Reasoning for RAG via Internal Reasoning Sketches

SEEK: 通过内部推理草图引导LLM推理用于RAG

Xinze Li, Yuqing Lan, Zhenghao Liu, Haidong Xin, Yukun Yan, Shuo Wang, Zheni Zeng, Sen Mei, Ge Yu, Maosong Sun

发表机构 * School of Computer Science and Engineering, Northeastern University, China(东北大学计算机科学与工程学院) Department of Computer Science and Technology, Institute for AI, Tsinghua University, China(清华大学计算机科学与技术系,人工智能研究院) School of Intelligent Science and Technology, Nanjing University, China(南京大学智能科学与技术学院)

AI总结 提出SEEK框架,通过构建结构化引导草图,迭代检索和填充知识槽,减少冗余检索,提升RAG性能。

详情
AI中文摘要

检索增强生成(RAG)通过将外部知识融入生成过程来增强大型语言模型(LLM)。借助LLM的推理能力,现有方法利用这种能力实现迭代知识获取和积累,从而更好地支持答案生成。然而,随着推理轨迹的增长,积累的知识和先前生成的查询可能会干扰后续检索决策,导致子查询意图重复和知识获取冗余。为了解决这个问题,我们提出了SEEK,一种用于RAG的草图引导知识获取框架。SEEK首先提示LLM为给定问题构建一个结构化的引导草图。它由多组引导要点组成,每个要点后跟一个用于知识填充的槽位。在这些引导要点的指导下,SEEK迭代地检索和精炼知识,并填充相应的槽位以完成草图。然后,完成的草图作为上下文输入用于最终答案生成。实验结果表明,SEEK在多个任务上取得了比基线模型更好的性能。进一步分析表明,SEEK可以生成更多样化的子查询,减少冗余检索,并在外部知识利用和内部知识冲突缓解之间实现更好的平衡。所有代码可在 https://github.com/OpenBMB/PAGER 获取。

英文摘要

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge into the generation process. Benefiting from the reasoning capabilities of LLMs, existing methods have leveraged such capabilities to enable iterative knowledge acquisition and accumulation, thereby better supporting answer generation. However, as the reasoning trajectory grows, the accumulated knowledge and previously generated queries may interfere with subsequent retrieval decisions, resulting in sub-queries with repetitive intents and redundant knowledge acquisition. To address this issue, we propose SEEK, a sketch-guided knowledge acquisition framework for RAG. SEEK first prompts the LLM to construct a structured steering sketch for the given question. It consists of multiple groups of steering gists, with each gist followed by a slot for knowledge filling. Guided by these steering gists, SEEK iteratively retrieves and refines knowledge, and fills the corresponding slots to complete the sketch. The completed sketch is then used as contextual input for final answer generation. Experimental results show that SEEK achieves better performance than baseline models across multiple tasks. Further analyses demonstrate that SEEK can generate more diverse sub-queries, reduce redundant retrieval, and achieve a better balance between external knowledge utilization and internal knowledge conflict mitigation. All codes are available at https://github.com/OpenBMB/PAGER.

2601.10896 2026-06-08 cs.CL 版本更新

DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference

DialDefer: 检测和缓解LLM对话性遵从的框架

Parisa Rabbani, Priyam Sahoo, Ruben Mathew, Aishee Mondal, Harshita Ketharaman, Nimet Beyza Bozdag, Dilek Hakkani-Tür

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出DialDefer框架,通过对话性遵从分数检测和缓解LLM在对话评估中因提问框架导致的判断偏移,发现框架效应显著但准确率稳定,且模型对人类与AI的不同归因产生最大偏移。

Comments 10 pages main content, 7 figures, 35 pages total with appendix

详情
AI中文摘要

LLM越来越多地被用作第三方评判者,但它们在评估对话中的说话者时的可靠性仍知之甚少。我们证明,LLM对相同主张的判断因框架而异:相同内容在作为陈述验证(“这个陈述正确吗?”)与归因于说话者(“这个说话者正确吗?”)时得到不同裁决。我们称此为对话性遵从,并引入DialDefer,一个用于检测和缓解这些框架诱导的判断偏移的框架。我们的对话性遵从分数(DDS)捕捉了聚合准确性所掩盖的方向性偏移。在十个领域、3000多个实例和五个模型上,对话框架诱导了大幅偏移(模型间平均|DDS|=15.9个百分点,p<0.0001),而准确性保持稳定(<2个百分点),在自然Reddit对话中效应放大2-5倍。这种效应是领域依赖的:单个模型可以在研究生级别的科学上转向不同意(怀疑),在社会判断上转向同意(遵从)。消融实验揭示,人类与LLM的归因导致最大偏移(17.7个百分点的摆动),表明模型认为与人类的分歧比与AI的分歧代价更高。缓解尝试可以减少遵从,但过度校正为怀疑,揭示了超出准确性优化的校准问题。

英文摘要

LLMs are increasingly used as third-party judges, yet their reliability when evaluating speakers in dialogue remains poorly understood. We show that LLMs judge identical claims differently depending on framing: the same content receives different verdicts when presented as a statement to verify ("Is this statement correct?") versus attributed to a speaker ("Is this speaker correct?"). We call this dialogic deference and introduce DialDefer, a framework for detecting and mitigating these framing-induced judgment shifts. Our Dialogic Deference Score (DDS) captures directional shifts that aggregate accuracy obscures. Across ten domains, 3k+ instances, and five models, conversational framing induces large shifts (mean|DDS|=15.9 percentage points (pp) across models, p < .0001) while accuracy remains stable (<2 pp), with effects amplifying 2--5x on naturalistic Reddit conversations. This effect is domain-dependent: a single model can shift toward disagreement (skepticism) on graduate-level science and toward agreement (deference) on social judgment. Ablations reveal that human-vs-LLM attribution drives the largest shifts (17.7 pp swing), suggesting models treat disagreement with humans as more costly than with AI. Mitigation attempts can reduce deference but over-correct into skepticism, revealing a calibration problem beyond accuracy optimization.

2602.11201 2026-06-08 cs.CL 版本更新

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

链式思维推理中忠实度衰减的机制证据

Donald Ye, Max Loffgren, Om Kotadia, Linus Wong, Jonas Rohweder

发表机构 * Fordham University(福特汉姆大学) Algoverse AI Research(Algoverse AI研究) Rice University(稻子大学) UC San Diego(圣地亚哥大学) Santa Clara University(圣克拉拉大学) LMU Munich(慕尼黑路德维希-马克西米利安大学)

AI总结 提出归一化对数几率差衰减(NLDD)指标,通过破坏推理步骤并测量模型置信度下降,发现链式思维中超过70-85%长度的令牌对最终答案贡献微弱或负面,揭示了忠实度衰减现象。

Comments 16 pages, 16 figures. Accepted to ICLR LIT workshop. Code: https://github.com/donald-ye/NLDD

详情
AI中文摘要

链式思维(CoT)解释被广泛用于解释语言模型如何解决复杂问题,但目前尚不清楚这些逐步解释是否反映了模型实际得出答案的方式,还是仅仅是事后证明。我们提出了归一化对数几率差衰减(NLDD),一种衡量单个推理步骤是否忠实于模型决策过程的指标。我们的方法从解释中破坏单个推理步骤,并测量模型对其答案的置信度下降程度,以确定该步骤是否真正重要。通过标准化这些测量,NLDD能够实现跨不同架构的严格跨模型比较。在三种模型家族上测试句法、逻辑和算术任务,我们发现了一个一致的推理视界(k*),位于链长的70-85%处,超过该点的推理令牌对最终答案几乎没有或只有负面影响。我们还发现,模型可以在完全失败任务的同时编码正确的内部表示。这些结果表明,仅凭准确性并不能揭示模型是否真正通过其链进行推理。NLDD提供了一种衡量CoT何时重要的方法。

英文摘要

Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model's confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70--85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.

2603.06915 2026-06-08 cs.CL cs.LG 版本更新

A Dynamic Self-Evolving Extraction System

一种动态自演化抽取系统

Moin Amin-Naseri, Hannah Kim, Estevam Hruschka

发表机构 * Megagon Labs(Megagon实验室)

AI总结 提出DySECT系统,通过LLM抽取三元组构建知识库,结合概率知识和图推理丰富知识,再反馈优化抽取器,形成闭环持续提升。

详情
AI中文摘要

从原始文本中抽取结构化信息是许多NLP应用(包括文档检索、排序和相关性估计)的基本组成部分。高质量的抽取通常需要领域特定的准确性、对专业分类法的最新理解,以及吸收新兴术语和罕见异常值的能力。在许多领域(如医疗、法律和人力资源),抽取模型还必须适应不断变化的术语,并受益于对结构化知识的显式推理。我们提出了DySECT,一个动态自演化抽取与策管工具包,它在使用过程中持续改进。该系统逐步用LLM抽取的三元组填充一个多功能、自扩展的知识库(KB)。KB通过整合概率知识和基于图的推理进一步丰富自身,逐步积累领域概念和关系。然后,丰富的KB通过提示调优、采样相关少样本示例或使用KB衍生的合成数据进行微调,反馈给LLM抽取器。结果,系统形成了一个共生的闭环循环,其中抽取持续改进知识,知识持续改进抽取。

英文摘要

The extraction of structured information from raw text is a fundamental component of many NLP applications, including document retrieval, ranking, and relevance estimation. High-quality extractions often require domain-specific accuracy, up-to-date understanding of specialized taxonomies, and the ability to incorporate emerging jargon and rare outliers. In many domains--such as medical, legal, and HR--the extraction model must also adapt to shifting terminology and benefit from explicit reasoning over structured knowledge. We propose DySECT, a Dynamic Self-Evolving Extraction and Curation Toolkit, which continually improves as it is used. The system incrementally populates a versatile, self-expanding knowledge base (KB) with triples extracted by the LLM. The KB further enriches itself through the integration of probabilistic knowledge and graph-based reasoning, gradually accumulating domain concepts and relationships. The enriched KB then feeds back into the LLM extractor via prompt tuning, sampling of relevant few-shot examples, or fine-tuning using KB-derived synthetic data. As a result, the system forms a symbiotic closed-loop cycle in which extraction continuously improves knowledge, and knowledge continuously improves extraction.

2603.09403 2026-06-08 cs.CL 版本更新

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

LLM作为元评判者:用于NLP评估指标验证的合成数据

Lukáš Eigler, Jindřich Libovický, David Hurych

发表机构 * Faculty of Mathematics and Physics, Charles University(数学与物理系,查尔斯大学)

AI总结 提出LLM作为元评判者框架,通过控制语义退化生成合成数据替代人工判断,验证NLG评估指标,在多语言问答中元相关性超过0.9。

Comments 16 pages, 1 figure, 14 tables

详情
AI中文摘要

验证NLG的评估指标通常依赖于昂贵且耗时的人工标注,而这些标注主要仅存在于英语数据集。我们提出LLM作为元评判者,这是一个可扩展的框架,利用LLM通过控制真实数据的语义退化生成合成评估数据集,取代人工判断。我们使用 extit{元相关性}来验证我们的方法,衡量从合成数据得出的指标排名与标准人工基准之间的对齐程度。在机器翻译、问答和摘要上的实验表明,合成验证可作为人工判断的可靠代理,在多语言问答中实现超过0.9的元相关性,并证明在人工判断不可用或过于昂贵的情况下是一种可行的替代方案。我们的代码和数据将在论文被接受后公开。

英文摘要

Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose LLM as a Meta-Judge, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using meta-correlation, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data are publicly available at https://github.com/eiglerl/meta-judge.

2603.28304 2026-06-08 cs.CL 版本更新

The Necessity of Setting Temperature in LLM-as-a-Judge

LLM作为评判者中设置温度的必要性

Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia, Jerome Francois, Radu State

发表机构 * University of Luxembourg(卢森堡大学) ETH Zürich(苏黎世联邦理工学院)

AI总结 系统研究温度对LLM评判行为的影响,发现高温降低一致性但暴露不确定性,低温适合稳定任务,高温适合复杂场景,建议温度作为任务相关的设计选择。

Comments 17 pages

详情
AI中文摘要

使用大型语言模型(LLM)作为评判者来评估模型输出已成为自动化评估的重要范式。然而,在LLM作为评判者的设置中,解码温度的选择在很大程度上仍然是经验性的,缺乏关于其影响的系统证据。为了解决这一差距,我们系统研究了温度如何影响不同LLM评判模型、提示策略和评估范式下的评判行为。我们的结果表明,较高的温度通常会降低评判一致性并增加格式错误,同时也会暴露潜在的不确定性,这种不确定性在低温解码下往往被抑制,尤其是在模糊案例中。进一步的分析表明,较高的温度可以作为探索机制,并可能提高复杂或不确定评估场景中的评判性能。总体而言,低温设置更适合优先考虑稳定性和可重复性的任务,而高温设置更适合涉及大量模糊性或复杂性的场景,在这些场景中,探索评判者的决策空间是有益的。这些发现表明,在LLM作为评判者的系统中,温度不应被视为固定的超参数,而应被视为可控的、任务相关的设计选择,它调节了可靠性与探索之间的权衡。

英文摘要

Using large language models (LLMs) as judges for evaluating model outputs has emerged as an important paradigm for automated evaluation. However, the choice of decoding temperature in LLM-as-a-judge settings is still largely chosen empirically, with limited systematic evidence on its impact. To address this gap, we conduct a systematic study of how temperature affects judgment behavior across different LLM judge models, prompting strategies, and evaluation paradigms. Our results show that higher temperatures generally decrease judgment consistency and increase formatting errors, while also exposing latent uncertainty that tends to remain suppressed under low-temperature decoding, particularly in ambiguous cases. Further analysis suggests that higher temperatures can serve as an exploratory mechanism and may improve judging performance in complex or uncertain evaluation scenarios. Overall, low-temperature settings are better suited to tasks that prioritize stability and reproducibility, whereas higher-temperature settings are more appropriate for scenarios involving substantial ambiguity or complexity, where exploration of the judge's decision space is beneficial. These findings suggest that, in LLM-as-a-judge systems, temperature should be treated not as a fixed hyperparameter, but as a controllable, task-dependent design choice that mediates the trade-off between reliability and exploration.

2604.17433 2026-06-08 cs.CL cs.AI cs.LG 版本更新

Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

仅需两个样本的自一致性:CoT-PoT集成实现高效LLM推理

Raman Saparkhan, Majd Hawasly, Md Rizwan Parvez, Mohammad Raza

发表机构 * Carnegie Mellon University Qatar(卡内基梅隆大学(卡塔尔)) Qatar Computing Research Institute(卡塔尔计算研究院)

AI总结 提出一种混合集成方法,结合思维链与程序化推理两种模式,通过仅需两个样本即可实现自一致性,将采样量减少9.3倍,并在78.6%的任务上达到最优。

Comments 9 pages, 3 figures; accepted to Findings of ACL 2026

详情
AI中文摘要

自一致性(SC)是一种通过聚合多个采样输出来提高大型语言模型推理准确性的流行技术,但由于大量采样,其计算成本高昂。我们引入了一种混合集成方法,利用两种不同推理模式(思维链(CoT)和程序化推理(PoT))的互补优势。我们描述了一个通用框架,用于在自一致性中结合这两种推理形式,并提出了全采样和早停的特定策略。我们表明,CoT-PoT集成不仅提高了整体准确性,而且将SC所需的样本数量大幅减少了9.3倍。特别是,大多数任务(78.6%)仅需两个样本即可解决,这在之前的任何SC方法中都是不可能的。

英文摘要

Self-consistency (SC) is a popular technique for improving the reasoning accuracy of large language models by aggregating multiple sampled outputs, but it comes at a high computational cost due to extensive sampling. We introduce a hybrid ensembling approach that leverages the complementary strengths of two distinct modes of reasoning: Chain-of-Thought (CoT) and Program-of-Thought (PoT). We describe a general framework for combining these two forms of reasoning in self-consistency, as well as particular strategies for both full sampling and early-stopping. We show that CoT-PoT ensembling not only improves overall accuracy, but also drastically reduces the number of samples required for SC by a factor of 9.3x. In particular, the majority of tasks (78.6%) can be addressed with only two samples, which has not been possible with any prior SC methods.

2604.18401 2026-06-08 cs.CL 版本更新

StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

StepPO: 面向智能体强化学习的步骤对齐策略优化

Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu, Qi Liu, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(认知智能国家重点实验室,中国科学技术大学)

AI总结 提出StepPO,一种步骤级策略优化方法,通过将智能体强化学习从token级MDP重构为步骤级MDP并引入步骤级信用分配,以解决现有算法在智能体决策粒度上的不匹配问题,在多跳问答等任务上优于多种RL算法。

详情
AI中文摘要

智能体强化学习(Agentic RL)正成为提升LLM智能体能力的关键后训练范式。现有的LLM RL算法大多遵循RLHF和RLVR中的token中心范式,其中token作为建模和优化的基本单元。然而,这种范式在智能体RL中引入了粒度不匹配问题,因为它优化的是token级预测,而LLM智能体通过环境观察和行动的循环做出步骤级决策。为弥合这一差距,我们提出 extbf{StepPO},一种通过步骤对齐策略优化实现的步骤中心智能体RL范式。具体来说,我们将智能体RL从token级马尔可夫决策过程(MDP)重构为步骤级MDP,其中交互步骤作为基本轨迹表示。我们进一步提出步骤级信用分配,使策略优化与智能体决策的自然粒度对齐。StepPO在步骤级优化智能体策略,用于多轮智能体-环境交互。在多跳问答、学术论文搜索和文本世界行动任务上的实验表明,StepPO始终优于各种RL算法。进一步的分析揭示了步骤中心范式如何改善智能体训练。我们希望这种步骤中心范式能为理解智能体行为提供有用的视角,并为训练更强大的LLM智能体提供一条实用路径。

英文摘要

Agentic reinforcement learning (RL) is emerging as a critical post-training paradigm for improving LLM agent capabilities. Existing RL algorithms for LLMs largely follow the token-centric paradigm as in RLHF and RLVR, where tokens serve as the basic units for modeling and optimization. However, this paradigm introduces a granularity mismatch in agentic RL, as it optimizes token-level predictions while LLM agents make step-level decisions through cycles of environmental observations and actions. To bridge this gap, we propose \textbf{StepPO}, a step-centric paradigm for agentic RL via step-aligned policy optimization. Specifically, we reformulate agentic RL from a token-level Markov Decision Process (MDP) into a step-level MDP, where interaction steps serve as the basic trajectory representations. We further propose step-level credit assignment to align policy optimization with the natural granularity of agent decisions. Together, StepPO optimizes agent policies at the step level for multi-turn agent-environment interaction. Experiments across multi-hop QA, academic paper search, and text-world action tasks show that StepPO consistently outperforms various RL algorithms. Further analyses provide insights into how step-centric paradigm improves agent training. We hope this step-centric paradigm offers a useful lens for understanding agent behavior and a practical path for training more capable LLM agents.

2605.10832 2026-06-08 cs.CL 版本更新

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

面向视觉原生多模态深度搜索智能体的在策略数据演化

Shijue Huang, Hangyu Guo, Guanting Dong, Chenxin Li, Junting Lu, Xinyu Geng, Zhaochen Su, Zhenyu Li, Shuang Chen, Hongru Wang, Yi R. Fung

发表机构 * Hong Kong University of Science and Technology(香港理工大学) Renmin University of China(中国人民大学) The Chinese University of Hong Kong(香港中文大学) Peking University(北京大学) Tsinghua University(清华大学) University of Edinburgh(爱丁堡大学)

AI总结 提出在策略数据演化(ODE)框架,通过图像库引用协议和闭环数据生成器,解决多模态深度搜索中视觉证据不可复用和训练数据静态问题,在8个基准上显著提升性能。

详情
AI中文摘要

多模态深度搜索要求智能体通过链式搜索、工具使用和对不断变化的文本与视觉上下文的视觉推理来解决开放世界问题。两个瓶颈限制了当前系统。首先,现有的工具使用框架将搜索、浏览或转换返回的图像视为瞬时输出,因此中间视觉证据无法被后续工具重新消费。其次,训练数据通常由固定的整理配方构建,无法跟踪目标智能体不断发展的能力。为应对这些挑战,我们首先引入了一个以图像库引用协议为核心的视觉原生智能体框架,该协议将每个工具返回的图像注册为可寻址引用,使中间视觉证据可被后续工具重用。在此框架之上,在策略数据演化(ODE)运行一个闭环数据生成器,该生成器根据正在训练的策略的 rollout 在每轮中自我改进。这种逐轮改进使得每轮的数据针对当前策略仍需学习的内容。同一框架支持多样化的监督微调数据和策略感知的强化学习数据整理,覆盖目标智能体的完整训练生命周期。在8个多模态深度搜索基准上,ODE 将 Qwen3-VL-8B 智能体的平均得分从24.9%提升至39.0%,在标准智能体工作流设置中超越了 Gemini-2.5 Pro(37.9%)。在30B规模下,ODE 将平均得分从30.6%提升至41.5%。进一步分析验证了图像库重用的有效性,特别是在需要迭代视觉细化的复杂任务上,而 rollout 反馈演化比静态合成产生了更扎实的 SFT 轨迹和更好的策略匹配的 RL 任务。

英文摘要

Multimodal deep search requires an agent to solve open-world problems by chaining search, tool use, and visual reasoning over evolving textual and visual context. Two bottlenecks limit current systems. First, existing tool-use harnesses treat images returned by search, browsing, or transformation as transient outputs, so intermediate visual evidence cannot be re-consumed by later tools. Second, training data is usually built by fixed curation recipes that cannot track the target agent's evolving capability. To address these challenges, we first introduce a visual-native agent harness centered on an image bank reference protocol, which registers every tool-returned image as an addressable reference and makes intermediate visual evidence reusable by later tools. On top of this harness, On-policy Data Evolution (ODE) runs a closed-loop data generator that refines itself across rounds from rollouts of the policy being trained. This per-round refinement makes each round's data target what the current policy still needs to learn. The same framework supports both diverse supervised fine-tuning data and policy-aware reinforcement learning data curation, covering the full training lifecycle of the target agent. Across 8 multimodal deep search benchmarks, ODE improves the Qwen3-VL-8B agent from 24.9% to 39.0% on average, surpassing Gemini-2.5 Pro in standard agent-workflow setting (37.9%). At 30B, ODE raises the average score from 30.6% to 41.5%. Further analyses validate the effectiveness of image-bank reuse, especially on complex tasks requiring iterative visual refinement, while rollout-feedback evolution yields more grounded SFT traces and better policy-matched RL tasks than static synthesis.

2605.14194 2026-06-08 cs.CL 版本更新

GradShield: Alignment Preserving Finetuning

GradShield: 保持对齐的微调

Zhanhao Hu, Xiao Huang, Patrick Mendoza, Emad A. Alghamdi, Basel Alomair, Raluca Ada Popa, David Wagner

发表机构 * University of California, Berkeley(加州大学伯克利分校) HUMAIN King Abdulaziz City for Science and Technology(国王阿卜杜勒阿齐兹科学与技术城) University of Washington, Seattle(华盛顿大学(西雅图))

AI总结 提出GradShield过滤方法,通过计算微调隐式危害分数并采用自适应阈值,在微调前移除有害数据,保持LLM安全对齐,攻击成功率低于6%。

详情
AI中文摘要

大型语言模型(LLM)在微调后存在安全对齐的重大风险,因为模型可能被显式和隐式有害数据破坏。即使一些看似良性的数据也可能无意中引导模型走向未对齐的行为。为了解决这个问题,我们引入了GradShield,一种原则性的过滤方法,通过在有害数据破坏模型对齐之前识别并移除它们,从而在微调过程中保护LLM。它通过计算每个数据点的微调隐式危害分数(FIHS)并采用自适应阈值算法来移除潜在有害数据。我们将GradShield应用于多个不同有害数据水平的实用微调任务,并使用各种指标评估所得LLM的安全性和实用性。结果表明,GradShield优于所有基线方法,在保持实用性能的同时,始终将攻击成功率(ASR)维持在6%以下。

英文摘要

Large Language Models (LLMs) pose a significant risk of safety misalignment after finetuning, as models can be compromised by both explicitly and implicitly harmful data. Even some seemingly benign data can inadvertently steer a model towards misaligned behaviors. To address this, we introduce GradShield, a principled filtering method that safeguards LLMs during finetuning by identifying and removing harmful data points before they corrupt the model's alignment. It removes potentially harmful data by computing a Finetuning Implicit Harmfulness Score (FIHS) for each data point and employs an adaptive thresholding algorithm. We apply GradShield to multiple utility fine-tuning tasks across varying levels of harmful data and evaluate the safety and utility performance of the resulting LLMs using various metrics. The results show that GradShield outperforms all baseline methods, consistently maintaining an Attack Success Rate (ASR) below $6\%$ while preserving utility performance.

2605.25171 2026-06-08 cs.CL 版本更新

Re-defining Humor Data Objects for AI Humor Research

为AI幽默研究重新定义幽默数据对象

Anna Arnett, Bang Nguyen, Meng Jiang

发表机构 * Department of Computer Science and Engineering, University of Notre Dame(诺特大学计算机科学与工程系)

AI总结 本研究将幽默视为具有上下文和解释的社会互动,通过定义幽默推理数据对象并改进提示策略,使LLM生成更高质量的幽默解释,为AI幽默研究的数据合成与增强奠定基础。

Comments Added link to code and data

详情
AI中文摘要

在现有的大多数AI幽默研究中,幽默被简单地视为“存在”或“不存在”。我们探索了幽默作为具有上下文和解释的社会互动的概念。在此项目中,我们定义了一个幽默推理数据对象,并开发了一种提示LLM生成对普通人群有效的幽默解释的方法。我们从早期的提示迭代到改进的提示,发现后一个版本减少了重要错误,然后将生成扩展到大量数据对象,这些对象有潜力为AI幽默研究实现数据合成和数据增强。我们的主要收获是,更好的LLM提示能提高幽默解释质量,特别是通过更仔细地处理缺失上下文、多模态和转录问题。这些结果为未来AI理解幽默作为社会行为的研究奠定了坚实基础。

英文摘要

In most existing AI humor research, humor was treated as either "present" or "not present." We explore the concept of humor as a social interaction with context and explanations. During this project, we defined a humor reasoning data object and developed a way to prompt LLMs to generate an explanation of humor effective for general population. We iterated from an earlier prompt to an improved prompt, found that the later version reduced important errors, and then scaled generation to a large number of data objects which have the potential to enable data synthesis and data augmentation for AI humor research. Our main takeaway is that better prompting of an LLM improves humor explanation quality, especially by handling missing context, multi-modality, and transcript issues more carefully. These results establish a strong foundation for future work on AI understanding of humor as social behavior. All code and data are available at: https://github.com/anna-arnett/ai-humor/ .

2605.25638 2026-06-08 cs.CL cs.LG 版本更新

Reinforcement Learning from Denoising Feedback

基于去噪反馈的强化学习

Qi He, Huan Chen, Ya Guo, Huijia Zhu, Yi R. Fung, Baojian Zhou

发表机构 * Fudan University(复旦大学) Ant Group(蚂蚁集团) Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出RLDF方法,利用去噪反馈进行策略损失估计,通过优化中间噪声状态到裁剪干净状态并结合加权时间步采样,在扩散语言模型上提升性能和泛化性。

详情
AI中文摘要

策略损失估计仍然是扩散语言模型(dLLMs)强化学习中的一个基本且长期存在的挑战。我们引入了基于去噪反馈的强化学习(RLDF),这是一种新颖的训练范式,利用从rollout和训练过程中获得的反馈来实现准确且高效的策略损失估计。为了平衡计算效率和估计有效性之间的权衡,RLDF将模型从中间噪声状态$x_t$优化到裁剪干净状态$\hat{x}_0$,并结合了随时间步$t$的加权采样。大量实验表明,RLDF在两种代表性dLLM架构(LLaDA和Dream)上,在多个推理基准测试中实现了性能和泛化性的一致且显著的提升。我们的工作为扩散语言模型中的可扩展强化学习奠定了原则性基础。我们构建了Drift,一个用于dLLMs的训练框架,可在https://github.com/ant-research/Drift获取。

英文摘要

Policy loss estimation remains a fundamental and long-standing challenge in reinforcement learning (RL) for diffusion language models (DLMs). We introduce Reinforcement Learning from Denoising Feedback (RLDF), a novel training paradigm that leverages feedback obtained from rollout and training processes to facilitate accurate and efficient policy loss estimation. To balance the trade-off between computational efficiency and estimation effectiveness, RLDF optimizes the model toward the clipped clean state from intermediate noisy states, combined with weighted timestep sampling over denoising timesteps. Extensive experiments demonstrate that RLDF achieves consistent and substantial improvements in both performance and generalizability across two representative DLM architectures, LLaDA and Dream, on multiple reasoning benchmarks. Our work lays a principled foundation for scalable reinforcement learning in diffusion language models. We build Drift, a training framework for DLMs, available at https://github.com/ant-research/Drift.

2605.26099 2026-06-08 cs.CL cs.AI 版本更新

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

语言模型需要睡眠吗?用于改进在线推理的离线循环

Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Maryland(马里兰大学)

AI总结 本文提出一种类似睡眠的巩固机制,通过离线循环将上下文转换为快速权重,以解决Transformer注意力机制随上下文长度扩展性差的问题,并在合成任务和数学推理任务上验证了其有效性。

详情
AI中文摘要

基于Transformer的大型语言模型越来越多地用于长时任务;然而,它们的注意力机制随上下文长度扩展性差。为了解决这个问题,我们研究了一种类似睡眠的巩固机制,其中模型在清除其键值缓存之前,定期将最近的上下文转换为持久的快速权重。在睡眠期间,模型对累积的上下文进行$N$次离线循环传递,并通过学习到的局部规则更新其状态空间模型(SSM)块中的快速权重。在推理过程中,这会将额外的计算转移到睡眠阶段,同时保持清醒时预测的延迟。我们在受控的合成任务(包括元胞自动机和多跳图检索)以及一个现实的数学推理任务上测试了我们的方法,在这些任务上,常规Transformer以及SSM-注意力混合模型都失败了。然后我们表明,增加我们模型的睡眠持续时间$N$可以提高性能,在需要更深层推理的示例上收益最大。

英文摘要

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs $N$ offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration $N$ for our models improves performance, with the largest gains on examples that require deeper reasoning.

2606.03889 2026-06-08 cs.CL 版本更新

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

RealClawBench: 来自真实开发者-智能体会话的实时OpenClaw基准测试

Zongwei Lv, Zhewen Tan, Yaoming Li, Yilun Yao, Yuxuan Tian, Lin Sun, Xiangzheng Zhang, Weihong Lin, Tong Yang, Guangxiang Zhao

发表机构 * Peking University(北京大学) Qiyuan Tech(启元科技)

AI总结 针对现有基准缺乏真实性的问题,提出RealClawBench框架,通过重构执行环境和确定性可验证评分器,将真实OpenClaw会话转化为可复现、自动评分的任务,评估14个模型后最佳仅解决65.8%任务,揭示了开发者-智能体工作负载上的巨大提升空间。

Comments 19 pages, 5 figures, 8 tables

详情
AI中文摘要

智能体基准测试应反映用户实际要求部署的智能体执行的任务,然而现有基准往往缺失真实开发者-智能体会话的关键真实性属性。我们引入RealClawBench,一个基于真实OpenClaw会话构建的实时基准框架,以捕获已部署智能体使用的分布、多样性和实际难度。真实用户请求难以基准测试,因为它们通常依赖本地执行环境,涉及隐含或未明确指定的意图,并且需要非平凡的验证。RealClawBench通过两个核心机制解决这些挑战:重构的执行环境和确定性可验证评分器,共同将真实会话转化为可复现、自动评分的任务。最终发布的版本包含从更大真实会话池中采样的281个可执行任务,同时保留源分布,最大最终与源分布的Jensen-Shannon散度为0.0448。评估14个当代模型显示,最佳系统仅解决65.8%的任务,揭示了在真实开发者-智能体工作负载上存在巨大的提升空间。通过将真实部署会话转化为受控评估实例,RealClawBench提供了一条实际路径,以构建能更好衡量智能体在实际使用中能力的基准测试。代码见:this https URL。

英文摘要

Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity, and real-world difficulty of deployed agent use. Real user requests are challenging to benchmark because they often depend on local execution environments, involve implicit or underspecified intent, and require nontrivial verification. RealClawBench addresses these challenges with two core mechanisms: reconstructed execution environments and deterministic verifiable scorers, which together convert real sessions into reproducible, automatically scored tasks. The resulting release contains 281 executable tasks sampled from a much larger real-session pool while preserving the source distribution, with maximum final-vs-source Jensen-Shannon divergence of 0.0448. Evaluating 14 contemporary models shows that the best system solves only 65.8% of tasks, revealing substantial headroom on realistic developer-agent workloads. By turning real deployed sessions into controlled evaluation instances, RealClawBench provides a practical path toward benchmarks that better measure agent capability in actual use. Code is available at:https://anonymous.4open.science/r/real-claw-bench-582B.

2606.04874 2026-06-08 cs.CL 版本更新

Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

Agent规划基准:LLM Agent规划能力的诊断框架

Haoyu Sun, Wenxuan Wang, Mingyang Song, Jujie He, Weinan Zhang, Yang Liu, Yang Yang, Yu Cheng

发表机构 * Tongji University(同济大学) Shanghai AI Laboratory(上海人工智能实验室) Harbin Institute of Technology(哈尔滨工业大学) Fudan University(复旦大学) Skywork AI University of California, Santa Cruz(加州大学圣克鲁兹分校) Shanghai Jiao Tong University(上海交通大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出Agent规划基准(APB),通过4209个多模态案例和五个设置,诊断LLM Agent在长程规划、工具噪声鲁棒性、校准拒绝和推理时改进方面的系统弱点。

详情
AI中文摘要

规划是LLM Agent的核心:在行动之前,Agent必须分解目标、选择工具、推理约束并决定任务何时不可行。然而,现有的Agent评估通常只报告端到端的成功率,使得难以判断失败源于规划还是执行。我们引入了 extbf{Agent规划基准(APB)},一个针对规划的诊断基准,包含22个领域和五个设置下的4209个多模态案例,涵盖整体规划、反馈条件逐步规划以及在外来工具、损坏工具和不可解任务下的鲁棒性。在12个MLLM上,APB揭示了长程规划、工具噪声鲁棒性、校准拒绝和推理时改进方面的系统弱点。我们进一步在200个ToolSandbox任务和200个$τ^2$-bench任务上验证了APB,其中APB引导的改进在三个代表性模型上一致提高了计划正确性、计划等级和下游执行指标。因此,APB作为执行基准的上游诊断补充。

英文摘要

Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures stem from planning or execution. We introduce Agent Planning Benchmark (APB), a planning-specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five settings, covering holistic planning, feedback-conditioned step-wise planning, and robustness under extraneous tools, broken tools, and unsolvable tasks. Across 12 MLLMs, APB reveals systematic weaknesses in long-horizon planning, tool-noise robustness, calibrated refusal, and inference-time refinement. We further validate APB on 200 ToolSandbox tasks and 200 $τ^2$-bench tasks, where APB-guided refinement consistently improves plan correctness, plan grade, and downstream execution metrics across three representative models. APB thus serves as an upstream diagnostic complement to execution benchmarks. The APB benchmark and code are available in \href{https://github.com/Mikivishy/AgentPlanningBenchmark}{this URL}.

2606.05711 2026-06-08 cs.CL 版本更新

Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems

超越Token:基于LLM的多智能体系统中潜在通信的统一框架

Yingzhuo Liu

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出一个三维统一框架(通信内容、发送-接收对齐、信息融合方式),系统分类2024-2026年间18种潜在通信方法,识别五种设计模式并揭示开放挑战。

详情
AI中文摘要

基于大型语言模型(LLM)构建的多智能体系统已成为处理复杂推理、规划和工具使用任务的主流范式。此类系统中的主导通信协议是自然语言:智能体逐token交换消息,将其内部推理过程语言化,以便同伴读取、验证和响应。尽管这种协议方便且可解释,但它存在三个结构性缺陷——高推理成本、离散化过程中不可逆的信息丢失以及自然语言的歧义/冗余。因此,越来越多的研究探索另一种协议——潜在通信——其中智能体直接交换连续表示(嵌入、隐藏状态或KV缓存),绕过文本生成的瓶颈。本文提出了一个统一框架,用于组织快速增长的潜在通信文献。我们沿着三个正交轴分析现有方法:(1)通信的WHAT信息(嵌入、隐藏状态、KV缓存或其他连续状态);(2)使用的WHICH发送-接收对齐(潜在空间对齐和层对齐);(3)通信信息如何融合到接收方(拼接、前置、数学运算、交叉注意力或缓存恢复)。在此三维框架下,我们系统分类了2024年至2026年间提出的18种代表性方法,识别出五种主要设计模式,并揭示了一系列开放挑战——包括跨架构对齐、潜在通道的安全性、边缘部署的压缩以及潜在通信与潜在思维链之间的关系。我们希望该框架既能降低新研究者的入门门槛,也能为比较未来工作提供一套词汇。

英文摘要

Multi-agent systems built on large language models (LLMs) have become a prevailing paradigm for tackling complex reasoning, planning, and tool-use tasks. The dominant communication protocol in such systems is natural language: agents exchange messages token-by-token, verbalising their internal reasoning so that peers can read, verify, and respond. While convenient and interpretable, this protocol suffers from three structural drawbacks -- high inference cost, irreversible information loss during discretization, and ambiguity/redundancy of natural language. A growing body of work therefore explores an alternative protocol -- latent communication -- in which agents exchange continuous representations (embeddings, hidden states, or KV-caches) directly, bypassing the bottleneck of text generation. This paper presents a unified framework for organising the rapidly expanding literature on latent communication. We analyse existing methods along three orthogonal axes: (1) WHAT information is communicated (Embeddings, Hidden States, KV-Caches, or other continuous state); (2) WHICH sender-receiver alignment is used (latent-space alignment and layer alignment); and (3) HOW the communicated information is fused into the receiver (concatenation, prepending, mathematical operations, cross-attention, or cache restoration). Under this 3-axis framework, we systematically categorise eighteen representative methods proposed between 2024 and 2026, identify five major design patterns, and surface a set of open challenges -- including cross-architecture alignment, security of latent channels, compression for edge deployment, and the relationship between latent communication and latent chain-of-thought. We hope that this framework both lowers the barrier to entry for new researchers and provides a vocabulary for comparing future work.

2502.00225 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Should You Use Your Large Language Model to Explore or Exploit?

你应该使用你的大语言模型进行探索还是利用?

Keegan Harris, Aleksandrs Slivkins

发表机构 * UC Berkeley(伯克利大学) Microsoft Research(微软研究院)

AI总结 研究当前大语言模型在探索-利用权衡中的决策能力,通过分离探索和利用任务评估其表现,发现推理模型在利用任务上最有潜力但成本高,非推理模型通过工具使用和上下文总结可提升中等难度任务性能,但在所有任务中均不如简单线性回归,然而LLM在具有语义的大动作空间探索中有帮助。

Comments Accepted to UAI 2026

详情
AI中文摘要

我们评估了当前一代大语言模型(LLMs)在面对探索-利用权衡时的决策能力。虽然先前的工作主要研究LLMs解决组合探索-利用任务的能力,我们采取了更系统的方法,将LLMs用于在各种(上下文)赌博机任务中分别进行探索和利用。我们发现推理模型在解决利用任务方面最有前景,尽管它们在实际应用中仍然过于昂贵或缓慢。受此启发,我们研究了非推理模型的工具使用和上下文总结。我们发现这些缓解措施可以显著提高中等难度任务的性能,但即便如此,我们研究的所有LLMs在所有任务中(包括非线性设置)的表现都不如简单的线性回归。另一方面,我们发现LLMs在探索具有内在语义的大动作空间时确实有帮助,通过建议合适的候选动作进行探索。

英文摘要

We evaluate the ability of the current generation of large language models (LLMs) to help a decision-making agent facing an exploration-exploitation tradeoff. While previous work has largely study the ability of LLMs to solve combined exploration-exploitation tasks, we take a more systematic approach and use LLMs to explore and exploit in silos in various (contextual) bandit tasks. We find that reasoning models show the most promise for solving exploitation tasks, although they are still too expensive or too slow to be used in many practical settings. Motivated by this, we study tool use and in-context summarization using non-reasoning models. We find that these mitigations may be used to substantially improve performance on medium-difficulty tasks, however even then, all LLMs we study perform worse than a simple linear regression, even in non-linear settings. On the other hand, we find that LLMs do help at exploring large action spaces with inherent semantics, by suggesting suitable candidates to explore.

2502.00527 2026-06-08 cs.LG cs.CL 版本更新

PolarQuant: Leveraging Polar Transformation for Efficient Key Cache Quantization and Decoding Acceleration

PolarQuant: 利用极坐标变换实现高效键缓存量化和解码加速

Songhao Wu, Ang Lv, Xiao Feng, Yufei Zhang, Xun Zhang, Guojun Yin, Wei Lin, Rui Yan

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学北京校区人工智能学院) ShanghaiTech University(上海科技大学) Meituan(美团)

AI总结 提出PolarQuant方法,通过将键向量分组为二维子向量并编码为量化半径和极角,解决键缓存量化中的异常值问题,同时通过查表加速解码,保持全精度模型性能。

Comments NeurIPS 2025 version with minor revisions to the methodology

详情
AI中文摘要

大型语言模型中的KV缓存是内存使用的主要因素,限制了其更广泛的适用性。将缓存量化到更低的位宽是减少计算成本的有效方法;然而,先前的方法由于异常值的存在,难以量化键向量,导致过高的开销。我们提出了一种名为PolarQuant的新型量化方法,有效解决了异常值挑战。我们观察到,异常值通常只出现在两个维度中的一个,当应用旋转位置嵌入时,这两个维度会一起旋转特定角度。当表示为二维向量时,这些维度展现出结构良好的模式,半径和角度在极坐标中平滑分布。这减轻了异常值对逐通道量化的挑战,使其非常适合量化。因此,PolarQuant将键向量分为二维子向量组,将其编码为相应的量化半径和极角,而不是直接量化原始键向量。PolarQuant在KV缓存量化中实现了卓越的效率,并通过将查询-键内积转化为查表操作来加速解码过程,同时保持全精度模型的下游性能。

英文摘要

The KV cache in large language models is a dominant factor in memory usage, limiting their broader applicability. Quantizing the cache to lower bit widths is an effective way to reduce computational costs; however, previous methods struggle with quantizing key vectors due to outliers, resulting in excessive overhead. We propose a novel quantization approach called PolarQuant, which efficiently addresses the outlier challenge. We observe that outliers typically appear in only one of two dimensions, which are rotated together by a specific angle when rotary position embeddings are applied. When represented as two-dimensional vectors, these dimensions exhibit well-structured patterns, with radii and angles smoothly distributed in polar coordinates. This alleviates the challenge of outliers on per-channel quantization, making them well-suited for quantization. Thus, PolarQuant divides key vectors into groups of two-dimensional sub-vectors, encoding them as the corresponding quantized radius and the polar angle, rather than quantizing original key vectors directly. PolarQuant achieves the superior efficiency in KV cache quantization and accelerates the decoding process by turning the query-key inner product into a table lookup, all while maintaining the downstream performance of full-precision models.

2504.03635 2026-06-08 cs.AI cs.CL 版本更新

Finding the Minimal Parameter Budget for Implicit Reasoning: A Data Complexity Driven Scaling Law for Language Models

寻找隐式推理的最小参数预算:一种基于数据复杂度的语言模型缩放定律

Xinyi Wang, Shawn Tan, Shenbo Xu, Mingyu Jin, William Yang Wang, Rameswar Panda, Yikang Shen

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Washington(华盛顿大学) University of Toronto(多伦多大学) University of Tokyo(东京大学)

AI总结 本文通过控制合成环境中的预训练实验,发现语言模型隐式推理所需的最小参数预算与图搜索熵之间存在缩放定律,并确定了每参数最多可处理约0.008比特信息的容量上限。

Comments Accepted to ICML 2026

详情
AI中文摘要

推理是语言模型(LM)的核心能力,然而在预训练期间需要多少模型容量来支持推理仍不清楚。在这项工作中,我们研究了隐式推理所需的最小参数预算,隐式推理定义为无需显式思维链监督即可从所学知识中推断出新事实的能力。为了隔离这一现象,我们在一个受控的合成环境中从头开始预训练LM,该环境模拟了真实世界知识图谱的结构和分布,并通过多跳推理评估它们补全缺失边的能力。从理论和实证两个角度,我们确定了一个缩放定律,将该最优参数预算与图搜索熵度量联系起来。在广泛的模型大小、训练步数和图复杂度范围内,我们表明一个最优大小的语言模型最多可以可靠地处理每参数约0.008比特的信息。我们的结果刻画了预训练期间隐式推理所需的最小充分容量。我们的发现为匹配模型大小与数据复杂度提供了原则性指导,并为大型语言模型中推理的缩放行为提供了新见解。

英文摘要

Reasoning is a core capability of language models (LMs), yet it remains unclear how much model capacity is necessary to support reasoning during pretraining. In this work, we study the minimal parameter budget required for implicit reasoning, defined as the ability to infer new facts from learned knowledge without explicit chain-of-thought supervision. To isolate this phenomenon, we pretrain LMs from scratch in a controlled synthetic environment that mimics the structure and distribution of real-world knowledge graphs, and evaluate their ability to complete missing edges via multi-hop inference. From both a theoretical and an empirical perspective, we identify a scaling law linking this optimal parameter budget to a graph search entropy measure. Across a wide range of model sizes, training steps, and graph complexities, we show that an optimally sized language model can reliably reason over approximately 0.008 bits of information per parameter at most. Our results characterize the minimal sufficient capacity for implicit reasoning during pretraining. Our findings provide principled guidance for matching model size to data complexity and offer new insights into the scaling behavior of reasoning in large language models.

2507.01548 2026-06-08 cs.HC cs.AI cs.CL 版本更新

Telling stories, making Hanzi: AI-assisted co-creation with elderly migrants in urban China

讲述故事,创造汉字:人工智能辅助中国城市老年移民的协同创作

Yunfei Chen, Wen Zhan, Peiyue Lin, Ziqun Hua, Ying Hu

发表机构 * School of Design, Hunan University(湖南大学设计学院) Royal College of Art(皇家艺术学院) University of the Arts London, Central Saint Martins(伦敦艺术大学,中央圣马丁学院)

AI总结 通过协同创作工作坊,结合口述故事、AI辅助和手工制作,让老年移民创造新汉字以记录被忽视的生活故事,揭示参与者的异质性和适应能力,并展示AI作为降低表达门槛的创意启动器。

详情
AI中文摘要

本文探讨了中国城市老年移民如何记录日常语言和设计常忽略的故事。我们与10位老年人开展了两次协同创作工作坊。活动结合了口述故事、主持人中介的AI辅助和手工制作。大型语言模型通过主持人提出候选字形。参与者创作了新的汉字来承载他们的故事。生成的字符作为记忆锚点,用于后续的分享和复述。我们的解释性分析揭示了参与者之间的异质性和适应能力。参与者将AI视为降低表达和创作门槛的创意启动器,尤其对数字素养较低者。这项工作挑战了关于老年人的同质化假设以及统一能力和需求的预设。我们贡献了一个将AI定位为后台促进者的工作坊框架,并提供了在包容性城市系统中将老年移民视为社区记忆和情境文化知识来源的见解。

英文摘要

This paper explores how older migrants in urban China can record stories that everyday language and design often miss. We ran two co-creation workshops with 10 elders. Activities combined oral storytelling, facilitator-mediated AI assistance, and hand-making. Large language models proposed candidate glyphs through a facilitator. Participants crafted new Hanzi to hold their stories. The resulting characters served as memory anchors for later sharing and retelling. Our interpretive analysis shows heterogeneity and adaptive capacity among participants. Participants experienced AI as a creative initiator that lowered barriers to expression and making, especially for those with lower digital literacy. The work challenges homogenizing assumptions about older adults and the presumption of uniform capacities and needs. We contribute a workshop framework that positions AI as a backstage facilitator. We also offer insights on engaging older migrants as sources of community memory and situated cultural knowledge within inclusive urban systems.

2508.17693 2026-06-08 cs.DB cs.AI cs.CL 版本更新

Database Normalization via Dual-LLM Self-Refinement

通过双LLM自精炼的数据库规范化

Eunjae Jo, Nakyung Lee, Gyuyeong Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Miffie框架,利用双模型自精炼架构和大语言模型实现数据库自动规范化,无需人工干预且保持高准确率。

Comments 7 pages

详情
AI中文摘要

数据库规范化对于保持数据完整性至关重要。然而,它通常由数据工程师手动执行,既耗时又容易出错。为此,我们提出了Miffie,一个利用大语言模型能力的数据库规范化框架。Miffie实现了无需人工努力的自动化数据规范化,同时保持高准确性。Miffie的核心是一种双模型自精炼架构,分别结合了性能最佳的模型用于规范化模式生成和验证。生成模块根据验证模块的反馈消除异常,直到输出模式满足规范化要求。我们还精心设计了任务特定的零样本提示,以引导模型实现高准确性和成本效率。实验结果表明,Miffie能够在保持高准确性的同时规范化复杂的数据库模式。

英文摘要

Database normalization is crucial to preserving data integrity. However, it is time-consuming and error-prone, as it is typically performed manually by data engineers. To this end, we present Miffie, a database normalization framework that leverages the capability of large language models. Miffie enables automated data normalization without human effort while preserving high accuracy. The core of Miffie is a dual-model self-refinement architecture that combines the best-performing models for normalized schema generation and verification, respectively. The generation module eliminates anomalies based on the feedback of the verification module until the output schema satisfies the requirement for normalization. We also carefully design task-specific zero-shot prompts to guide the models for achieving both high accuracy and cost efficiency. Experimental results show that Miffie can normalize complex database schemas while maintaining high accuracy.

2508.17821 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Limitations of Normalization in Attention Mechanism

注意力机制中归一化的局限性

Timur Mudarisov, Mikhail Burtsev, Tatiana Petrova, Radu State

发表机构 * University of Luxembourg(卢森堡大学) London Institute for Mathematical Sciences(伦敦数学科学研究所)

AI总结 本文通过理论框架和GPT-2实验,揭示softmax归一化导致注意力随选择token数增加而趋于均匀,并分析低温度下梯度敏感性带来的训练挑战。

详情
AI中文摘要

本文研究了注意力机制中归一化的局限性。我们首先建立了一个理论框架,用于识别模型的选择能力以及token选择中涉及的几何分离。我们的分析包括在softmax缩放下token向量距离和分离准则的显式界限。通过使用预训练的GPT-2模型进行实验,我们实证验证了理论结果,并分析了注意力机制的关键行为。值得注意的是,我们证明随着所选token数量的增加,模型区分信息性token的能力下降,通常趋向于均匀选择模式。我们还表明,softmax归一化下的梯度敏感性在训练过程中带来了挑战,尤其是在低温度设置下。这些发现推进了当前对基于softmax的注意力机制的理解,并激发了在未来注意力架构中需要更稳健的归一化和选择策略的需求。

英文摘要

This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token selection. Our analysis includes explicit bounds on distances and separation criteria for token vectors under softmax scaling. Through experiments with pre-trained GPT-2 model, we empirically validate our theoretical results and analyze key behaviors of the attention mechanism. Notably, we demonstrate that as the number of selected tokens increases, the model's ability to distinguish informative tokens declines, often converging toward a uniform selection pattern. We also show that gradient sensitivity under softmax normalization presents challenges during training, especially at low temperature settings. These findings advance current understanding of softmax-based attention mechanism and motivate the need for more robust normalization and selection strategies in future attention architectures.

2512.14391 2026-06-08 cs.LG cs.AI cs.CL 版本更新

RePo: Language Models with Context Re-Positioning

RePo:具有上下文重定位的语言模型

Huayang Li, Tianyu Zhao, Deng Cai, Richard Sproat

发表机构 * University of Maryland(马里兰大学)

AI总结 提出RePo机制,通过可微分模块重新分配token位置以减轻注意力层负担,在噪声上下文、结构化数据和长上下文任务上持续提升性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

上下文学习是现代大型语言模型(LLM)的基础;然而,主流架构通过分配线性或常数的位置索引来施加刚性且固定的上下文结构。刚性的位置信息将组织输入结构的全部负担强加给注意力层,从而减少了可用于更关键信息的注意力量。为了解决这个问题,我们提出了RePo,一种通过上下文重定位来减轻注意力层负担的新机制。与传统方法不同,RePo利用可微分模块$f_ϕ$来分配捕获上下文依赖关系的token位置,而不是依赖预定义的顺序。通过在OLMo-2 1B和7B模型上持续预训练,我们证明RePo在涉及噪声上下文、结构化数据和更长上下文长度的任务上持续提升性能,同时在一般短上下文任务上保持有竞争力的性能。分析表明,RePo成功地将更多注意力分配给遥远但相关的信息,在密集且非线性的空间中分配位置,并捕获输入上下文的内在结构。我们的代码位于https://github.com/SakanaAI/repo。

英文摘要

In-context learning is fundamental to modern Large Language Models (LLMs); however, prevailing architectures impose a rigid and fixed contextual structure by assigning linear or constant positional indices. The rigid position information poses the full burden of organizing the input structure to attention layers, thus reducing the amount of attention that could be allocated for more critical information. To address this, we propose RePo, a novel mechanism that alleviates the burden for attention layers via context re-positioning. Unlike conventional approaches, RePo utilizes a differentiable module, $f_ϕ$, to assign token positions that capture contextual dependencies, rather than replying on pre-defined order. By continually pre-training on the OLMo-2 1B \& 7B models, we demonstrate that RePo consistently enhances performance on tasks involving noisy contexts, structured data, and longer context length, while maintaining competitive performance on general short-context tasks. Analysis reveals that RePo successfully allocates more attention mass to distant but relevant information, assigns positions in a dense and non-linear space, and captures the intrinsic structure of the input context. Our code is at https://github.com/SakanaAI/repo.

2602.02014 2026-06-08 cs.CV cs.AI cs.CL cs.LG 版本更新

Rethinking Genomic Modeling Through Optical Character Recognition

通过光学字符识别重新思考基因组建模

Hongxin Xiang, Pengsen Ma, Yunkang Cao, Di Yu, Haowen Chen, Xinyu Yang, Xiangxiang Zeng

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出OpticalDNA框架,将DNA渲染为视觉布局,利用视觉语言模型进行OCR式基因组理解,实现高保真压缩和长序列高效处理,在450k碱基序列上以近20倍更少有效token超越基线模型。

Comments Accepted by ICML 2026

详情
AI中文摘要

最近的基因组基础模型大多采用大型语言模型架构,将DNA视为一维token序列。然而,穷举式顺序阅读在结构上与稀疏且不连续的基因组语义不匹配,导致在低信息背景上的计算浪费,并阻碍了面向长上下文的压缩理解。在此,我们提出OpticalDNA,一个基于视觉的框架,将基因组建模重新定义为光学字符识别(OCR)风格的文档理解。OpticalDNA将DNA渲染为结构化视觉布局,并训练一个具备OCR能力的视觉语言模型,该模型包含视觉DNA编码器和文档解码器,其中编码器生成紧凑、可重建的视觉token以实现高保真压缩。基于这种表示,OpticalDNA定义了基于提示条件的核心基因组原语目标——读取、区域定位、子序列检索和掩码跨度补全——从而学习到布局感知的DNA表示,在减少的有效token预算下保留细粒度的基因组信息。在多种基因组基准测试中,OpticalDNA持续优于最近的基线模型;在长达450k碱基的序列上,它以近20倍更少的有效token实现了最佳整体性能,并且仅调整256k可训练参数就超越了激活参数多达985倍的模型。

英文摘要

Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as Optical Character Recognition (OCR)-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision--language model with a visual DNA encoder and a document decoder, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives-reading, region grounding, subsequence retrieval, and masked span completion-thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly 20$\times$ fewer effective tokens, and surpasses models with up to 985$\times$ more activated parameters while tuning only 256k trainable parameters.

2602.03160 2026-06-08 cs.AI cs.CL 版本更新

VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models

VALUEFLOW:迈向大语言模型中多元化和可引导的基于价值的对齐

Woojin Kim, Sieun Hyeon, Jusang Oh, Jaeyoung Do

发表机构 * Department of Electrical and Computer Engineering, Seoul National University(首尔国立大学电气与计算机工程系) Interdisciplinary Program in Artificial Intelligence, Seoul National University(首尔国立大学人工智能交叉学科项目)

AI总结 提出VALUEFLOW框架,通过分层价值嵌入、强度标注数据库和锚定评估器,实现大语言模型在价值强度上的可控对齐,解决现有方法在提取、评估和引导方面的不足。

Comments Accepted in ICML 2026 (Oral). Code available at https://github.com/AIDASLab/VALUEFLOW

详情
AI中文摘要

将大语言模型(LLMs)与人类价值的多元光谱对齐仍然是一个核心挑战:基于偏好的方法通常无法捕捉更深层次的动机原则。基于价值的方法提供了更原则性的路径,但仍存在三个差距:提取常常忽略层次结构,评估检测存在但未校准强度,并且LLMs在受控强度下的可引导性仍未得到充分理解。为解决这些限制,我们引入了VALUEFLOW,这是第一个统一框架,涵盖提取、评估和引导,并具有校准的强度控制。该框架整合了三个组件:(i) HIVES,一个层次化价值嵌入空间,捕捉理论和跨理论的价值结构;(ii) 价值强度数据库(VIDB),一个大规模资源,包含基于排序聚合得出的强度估计的价值标注文本;(iii) 一个基于锚点的评估器,通过将模型输出与VIDB面板进行排序,产生一致的强度分数。使用VALUEFLOW,我们在十个模型和四个价值理论上进行了全面的大规模研究,识别了可引导性的不对称性和多价值控制的组合规律。本文建立了一个可扩展的基础设施,用于评估和控制价值强度,推进了LLMs的多元化对齐。

英文摘要

Aligning Large Language Models (LLMs) with the diverse spectrum of human values remains a central challenge: preference-based methods often fail to capture deeper motivational principles. Value-based approaches offer a more principled path, yet three gaps persist: extraction often ignores hierarchical structure, evaluation detects presence but not calibrated intensity, and the steerability of LLMs at controlled intensities remains insufficiently understood. To address these limitations, we introduce VALUEFLOW, the first unified framework that spans extraction, evaluation, and steering with calibrated intensity control. The framework integrates three components: (i) HIVES, a hierarchical value embedding space that captures intra- and cross-theory value structure; (ii) the Value Intensity DataBase (VIDB), a large-scale resource of value-labeled texts with intensity estimates derived from ranking-based aggregation; and (iii) an anchor-based evaluator that produces consistent intensity scores for model outputs by ranking them against VIDB panels. Using VALUEFLOW, we conduct a comprehensive large-scale study across ten models and four value theories, identifying asymmetries in steerability and composition laws for multi-value control. This paper establishes a scalable infrastructure for evaluating and controlling value intensity, advancing pluralistic alignment of LLMs.

2602.06941 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Endogenous Resistance to Activation Steering in Language Models

语言模型中激活引导的内生抵抗

Alex McKenzie, Keenan Pepper, Stijn Servaes, Martin Leitgab, Murat Cubuktepe, Mike Vaiana, Diogo de Lucena, Judd Rosenblatt, Michael S. A. Graziano

发表机构 * University of Washington(华盛顿大学)

AI总结 研究发现大型语言模型在任务不匹配的激活引导下能内生抵抗,通过显式重启恢复正确生成,并识别出相关稀疏自编码器潜在变量,可增强或削弱该抵抗。

详情
AI中文摘要

大型语言模型可以在生成过程中从任务不匹配的激活引导中恢复,产生显式的语言重启(例如,“等等,那不对”),并在引导扰动仍然活跃的情况下继续讨论主题。我们将此称为内生引导抵抗(ESR)。使用稀疏自编码器(SAE)潜在变量来引导模型激活,我们发现Llama-3.3-70B在\llamaseventyEsrRate\\%的情况下表现出显式ESR,而来自Llama-3和Gemma-2系列的较小模型则较少出现显式形式。两个对照实验将ESR分解为检测事件和持续抵抗组件,后者不能仅由最近的on-topic token条件化来完全解释。我们通过对比on-topic/off-topic搜索识别出\numOtdLatents{}个SAE潜在变量;将其零消融使多次尝试率降低\multiAttemptReductionPct\\%,随机潜在变量和保留提示对照支持特异性。ESR还可以通过元提示和基于合成自我纠正示例的微调来有意增强。ESR对安全性具有双重影响:它可能使模型对对抗性激活空间操纵更具抵抗力,但同样可能干扰有益的基于引导的干预,因为模型无法区分两者。代码可在\href{https://github.com/agencyenterprise/endogenous-steering-resistance}{github.com/agencyenterprise/endogenous-steering-resistance}获取。

英文摘要

Large language models can recover mid-generation from task-misaligned activation steering, producing explicit verbal restarts (e.g., ``wait, that's not right'') and continuing on-topic even while the steering perturbation remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B exhibits explicit ESR at \llamaseventyEsrRate\%, with smaller models from the Llama-3 and Gemma-2 families showing the explicit form less frequently. Two controls dissociate ESR into a detection event and a sustained-resistance component that conditioning on recent on-topic tokens does not fully explain. We identify \numOtdLatents{} SAE latents through contrastive on-topic/off-topic search; zero-ablating them reduces the multi-attempt rate by \multiAttemptReductionPct\%, with random-latent and held-out-prompt controls supporting specificity. ESR can also be deliberately enhanced through both meta-prompting and fine-tuning on synthetic self-correction examples. ESR has dual implications for safety: it could harden models against adversarial activation-space manipulation, but may equally interfere with beneficial steering-based interventions, since the model has no way to distinguish the two. Code is available at \href{https://github.com/agencyenterprise/endogenous-steering-resistance}{github.com/agencyenterprise/endogenous-steering-resistance}.

2602.08857 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Discovering Interpretable Algorithms by Decompiling Transformers to RASP

通过将Transformer反编译为RASP发现可解释算法

Xinting Huang, Aleksandra Bakalova, Satwik Bhattamishra, William Merrill, Michael Hahn

发表机构 * Saarland Informatics Campus, Saarland University(萨尔兰大学信息学院校区,萨尔兰大学) University of Oxford(牛津大学) Allen Institute for AI(人工智能研究所)

AI总结 提出一种将训练好的Transformer忠实重参数化为RASP程序,并通过因果干预发现小型充分子程序的方法,实验表明长度泛化的Transformer内部实现了简单可解释的RASP程序。

Comments 104 pages, 92 figures. Accepted for publication at ICML 2026

详情
AI中文摘要

近期研究表明,Transformer的计算可以在RASP编程语言家族中模拟。这些发现增进了对Transformer表达能力和泛化能力的理解。特别是,Transformer被建议在具有简单RASP程序的问题上精确实现长度泛化。然而,训练模型是否实际实现了简单的可解释程序仍是一个开放问题。在本文中,我们提出了一种从训练好的Transformer中提取此类程序的通用方法。其思想是将Transformer忠实地重参数化为RASP程序,然后应用因果干预来发现一个小的充分子程序。在算法和形式语言任务上训练的小型Transformer实验中,我们表明我们的方法通常能从长度泛化的Transformer中恢复简单且可解释的RASP程序。我们的结果提供了迄今为止最直接的证据,证明Transformer内部实现了简单的RASP程序。

英文摘要

Recent work has shown that the computations of Transformers can be simulated in the RASP family of programming languages. These findings have enabled improved understanding of the expressive capacity and generalization abilities of Transformers. In particular, Transformers have been suggested to length-generalize exactly on problems that have simple RASP programs. However, it remains open whether trained models actually implement simple interpretable programs. In this paper, we present a general method to extract such programs from trained Transformers. The idea is to faithfully re-parameterize a Transformer as a RASP program and then apply causal interventions to discover a small sufficient sub-program. In experiments on small Transformers trained on algorithmic and formal language tasks, we show that our method often recovers simple and interpretable RASP programs from length-generalizing transformers. Our results provide the most direct evidence so far that Transformers internally implement simple RASP programs.

2602.14209 2026-06-08 cs.LG cs.CL 版本更新

MAGE: All-[MASK] Block Already Knows Where to Look in Block Diffusion LLM

MAGE:在块扩散LLM中,全[MASK]块已经知道在哪里看

Omin Kwon, Yeonjae Kim, Doyeon Kim, Minseo Kim, Yeonhong Park, Jae W. Lee

发表机构 * Seoul National University(首尔国立大学) Meta

AI总结 针对块扩散LLM长上下文推理中KV缓存导致的内存瓶颈,提出无训练方法MAGE,利用块扩散训练目标的对齐特性,在第一步确定整个轨迹的KV子集,实现近无损精度和显著加速。

详情
AI中文摘要

块扩散LLM是一种并行语言生成的新兴范式,但其KV缓存使得内存访问成为长上下文推理中的主要瓶颈。稀疏注意力(每个查询仅关注少量KV子集)可以在最小化精度损失的情况下减少延迟。然而,在块扩散中,每个块的B个token必须共享一个KV子集,我们证明这种每块约束会使现有稀疏KV估计器的召回率下降高达25%。为了解决这一挑战,我们利用了块扩散训练目标中出现的一个特性:它将去噪步骤中的块平均查询对齐,因此第一步的全[MASK]块已经揭示了整个轨迹中每块的KV子集。我们在MAGE([MASK]引导的稀疏注意力)中利用了这一特性,这是一种无训练方法,在第一步执行一次精确注意力,并在块内的所有剩余步骤中重用其top-k索引集。在LongBench上的三个块扩散家族中,MAGE在k=512时匹配精确注意力,精度几乎无损,在128K上下文中实现高达6.82倍的端到端加速,并且比分别为自回归LLM和全双向扩散LLM设计的Quest和SparseD快3.35倍和2.28倍。

英文摘要

Block diffusion LLMs are an emerging paradigm for parallel language generation, but their KV caching makes memory access the dominant bottleneck in long-context inference. Sparse attention, which attends only to a small KV subset per query, can reduce this latency with minimal accuracy loss. In block diffusion, however, the B tokens of each block must share a single KV subset, and we show this per-block constraint degrades existing sparse KV estimators by up to 25% in recall. We address this challenge by exploiting a property that emerges from the block-diffusion training objective: it aligns the block-average query across denoising steps, so the All-[MASK] block at the first step already reveals the per-block KV subset for the entire trajectory. We exploit this in MAGE ([MASK]-Guided Sparse Attention), a training-free method that runs one exact attention pass at the first step and reuses its top-k index sets for all remaining steps within the block. Across three block-diffusion families on LongBench, MAGE matches Exact Attention at k=512 with near-lossless accuracy, achieves up to 6.82x end-to-end speedup at 128K context, and runs up to 3.35x and 2.28x faster than Quest and SparseD, designed for AR LLMs and fully bidirectional diffusion LLMs, respectively.

2602.18905 2026-06-08 cs.LG cs.AI cs.CL 版本更新

TRUE: A Trustworthy Unified Explanation Framework for Large Language Model Reasoning

TRUE:一种用于大语言模型推理的可信统一解释框架

Yujiao Yang

发表机构 * Dalian University of Technology(大连理工大学)

AI总结 提出TRUE框架,通过可执行推理验证、可行域DAG建模和因果故障模式分析,为LLM推理提供实例级、局部结构级和类别级的多层次可验证解释。

详情
AI中文摘要

大型语言模型(LLM)在复杂推理任务中展现出强大能力,但其决策过程仍难以解释。现有解释方法通常缺乏可信的结构性洞察,且局限于单实例分析,无法揭示推理稳定性和系统性故障机制。为解决这些局限,我们提出可信统一解释框架(TRUE),该框架集成了可执行推理验证、可行域有向无环图(DAG)建模和因果故障模式分析。在实例层面,我们将推理轨迹重新定义为可执行过程规范,并引入盲执行验证来评估操作有效性。在局部结构层面,我们通过结构一致性扰动构建可行域DAG,从而显式刻画局部输入空间中推理稳定性和可执行区域。在类别层面,我们引入因果故障模式分析方法,识别重复出现的结构性故障模式,并使用Shapley值量化其因果影响。在多个推理基准上的广泛实验表明,所提框架提供了多层次、可验证的解释,包括单个实例的可执行推理结构、邻近输入的可行域表示以及类别层面具有量化重要性的可解释故障模式。这些结果建立了一个统一且原则性的范式,用于提高LLM推理系统的可解释性和可靠性。

英文摘要

Large language models (LLMs) have demonstrated strong capabilities in complex reasoning tasks, yet their decision-making processes remain difficult to interpret. Existing explanation methods often lack trustworthy structural insight and are limited to single-instance analysis, failing to reveal reasoning stability and systematic failure mechanisms. To address these limitations, we propose the Trustworthy Unified Explanation Framework (TRUE), which integrates executable reasoning verification, feasible-region directed acyclic graph (DAG) modeling, and causal failure mode analysis. At the instance level, we redefine reasoning traces as executable process specifications and introduce blind execution verification to assess operational validity. At the local structural level, we construct feasible-region DAGs via structure-consistent perturbations, enabling explicit characterization of reasoning stability and the executable region in the local input space. At the class level, we introduce a causal failure mode analysis method that identifies recurring structural failure patterns and quantifies their causal influence using Shapley values. Extensive experiments across multiple reasoning benchmarks demonstrate that the proposed framework provides multi-level, verifiable explanations, including executable reasoning structures for individual instances, feasible-region representations for neighboring inputs, and interpretable failure modes with quantified importance at the class level. These results establish a unified and principled paradigm for improving the interpretability and reliability of LLM reasoning systems.

2603.24481 2026-06-08 cs.AI cs.CL cs.LG 版本更新

Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

基于一致性验证的多智能体推理改进医学多项选择题问答中的不确定性校准

John Ray B. Martinez

发表机构 * Department of Data Science and Analytics(数据科学与分析系)

AI总结 提出多智能体框架,结合领域专家智能体与两阶段验证及S分数加权融合,在医学MCQA中显著降低校准误差并提升判别能力。

Comments 20 pages, 6 figures. Preprint under review

详情
AI中文摘要

校准不良的置信度分数是AI在临床环境中部署的实际障碍。总是过度自信的模型无法为延迟决策提供有用信号。我们提出了一个多智能体框架,结合领域特定专家智能体与两阶段验证(Wu等人,2024)和S分数加权融合,以改进医学多项选择题问答中的校准和判别能力。四个专家智能体(呼吸科、心脏病科、神经科、胃肠科)使用Qwen2.5-7B-Instruct生成独立诊断。每个诊断经历两阶段自我验证过程,测量内部一致性并产生专家置信度分数(S分数)。S分数驱动加权融合策略,选择最终答案并校准报告的置信度。我们在MedQA-USMLE和MedMCQA的高分歧子集(100和250个问题)上进行评估。所有结果均针对此过滤后的设置。在MedQA-250上,完整系统实现了ECE=0.091(比单专家基线降低74.4%)和AUROC=0.630(+0.056),准确率为59.2%。在所有四种设置中,校准增益保持在49-74%。消融分析表明,两阶段验证驱动ECE降低,而多智能体推理驱动AUROC提升,表明一致性检查和集成聚合解决了LLM不确定性的不同失败模式。由此产生的置信度信号是否足以在实践中支持临床延迟决策,仍是未来研究的方向。

英文摘要

Miscalibrated confidence scores are a practical obstacle to deploying AI in clinical settings. A model that is always overconfident offers no useful signal for deferral. We present a multi-agent framework that combines domain-specific specialist agents with Two-Phase Verification (Wu et al., 2024) and S-Score Weighted Fusion to improve both calibration and discrimination in medical multiple-choice question answering. Four specialist agents (respiratory, cardiology, neurology, gastroenterology) generate independent diagnoses using Qwen2.5-7B-Instruct. Each diagnosis undergoes a two-phase self-verification process that measures internal consistency and produces a Specialist Confidence Score (S-score). The S-scores drive a weighted fusion strategy that selects the final answer and calibrates the reported confidence. We evaluate on high-disagreement subsets of MedQA-USMLE and MedMCQA (100 and 250 questions). All results are specific to this filtered regime. On MedQA-250, the full system achieves ECE = 0.091 (74.4% reduction over the single-specialist baseline) and AUROC = 0.630 (+0.056) at 59.2% accuracy. Calibration gains of 49-74% hold across all four settings. Ablation analysis reveals that Two-Phase Verification drives ECE reduction while multi-agent reasoning drives AUROC improvement, suggesting that consistency checking and ensemble aggregation address different failure modes of LLM uncertainty. Whether the resulting confidence signal is sufficient to support clinical deferral decisions in practice remains a direction for future investigation.

2604.07821 2026-06-08 cs.MA cs.AI cs.CL 版本更新

More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration

能力越强,合作越少?当LLM在零成本协作中失败时

Advait Yadav, Sid Black, Oliver Sourbut

发表机构 * GitHub

AI总结 研究LLM在多智能体系统中零成本协作的失败原因,通过构建去战略复杂性的环境,发现能力更强的模型(如o3)反而合作更差,并区分了能力失败与主动信息隐瞒,提出针对性干预措施。

Comments Accepted to the ICML 2026 main conference

详情
AI中文摘要

大语言模型(LLM)智能体越来越多地在多智能体系统中协调,但我们缺乏对合作失败地点和原因的理解。许多现实世界的协调问题并非社会困境:帮助他人——分享文档、为队友扫清障碍——对帮助者几乎不花费成本,同时产生巨大的集体利益。LLM智能体在这种帮助免费且被明确指示合作的机制下是否合作,仍然未知。我们构建了一个基于回合的多智能体环境,剥离了所有战略复杂性,使合作无成本且微不足道地最优。在八个广泛使用的LLM中,能力并不能预测合作:OpenAI o3仅达到最优集体性能的17%,而较弱的o3-mini达到50%,尽管有相同的最大化群体收入的指令。使用一种自动化智能体通信一方的因果分解方法,我们将合作失败与能力失败分开,并发现几个有能力的模型在隐瞒信息方面表现积极,尽管从隐瞒中一无所获。针对性的干预措施解决了每种模式:明确的协议使能力受限模型的性能大约翻倍,而小的分享激励则解锁了合作受限模型。我们的结果表明,仅靠扩展智能无法解决多智能体系统中的协调问题,需要深思熟虑的合作设计,即使帮助不花费任何成本。

英文摘要

Large language model (LLM) agents increasingly coordinate in multi-agent systems, yet we lack an understanding of where and why cooperation fails. Many real-world coordination problems are not social dilemmas: helping others -- sharing documentation, unblocking a teammate -- costs the helper almost nothing while producing substantial collective benefit. Whether LLM agents cooperate in this regime, where helping is free and they are explicitly instructed to do so, remains unknown. We build a turn-based multi-agent environment that strips away all strategic complexity, making cooperation costless and trivially optimal. Across eight widely used LLMs, capability does not predict cooperation: OpenAI o3 reaches only 17% of optimal collective performance while the weaker o3-mini reaches 50%, despite identical instructions to maximize group revenue. Using a causal decomposition that automates one side of agent communication, we separate cooperation failures from competence failures, and find that several capable models actively withhold information despite gaining nothing from withholding. Targeted interventions address each mode: explicit protocols roughly double the performance of competence-limited models, while small sharing incentives unlock cooperation-limited ones. Our results suggest that scaling intelligence alone will not solve coordination in multi-agent systems, and will require deliberate cooperative design, even when helping costs nothing.

2604.09552 2026-06-08 cs.IR cs.AI cs.CL 版本更新

MCERF: Advancing Multimodal LLM Evaluation of Engineering Documentation with Enhanced Retrieval

MCERF:通过增强检索推进工程文档的多模态大语言模型评估

Kiarash Naghavi Khanghah, Hoang Anh Nguyen, Anna C. Doris, Amir Mohammad Vahedi, Daniele Grandi, Faez Ahmed, Hongyi Xu

发表机构 * School of Mechanical, Aerospace, and Manufacturing Engineering, University of Connecticut, Storrs, CT 06269(机械、航空航天与制造工程学院,康涅狄格大学,斯托尔斯,CT 06269) Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA(机械工程系,麻省理工学院,剑桥,MA 02139,美国)

AI总结 提出MCERF框架,结合多模态检索器ColPali与大语言模型推理,通过混合查找、视觉文本融合、高推理和自一致性决策等策略,在DesignQA基准上实现平均准确率相对提升41.1%,无需完整规则书摄入即可处理工程文档中的多模态问答。

详情
AI中文摘要

工程规则书和技术标准包含密集文本、表格和插图等多模态信息,对检索增强生成(RAG)系统构成挑战。基于依赖全文摄入和文本检索的DesignQA框架[1],本工作建立了多模态ColPali增强检索与推理框架(MCERF),该系统将多模态检索器与大语言模型推理相结合,实现从工程文档中准确高效地回答问题。该系统采用ColPali检索文本和视觉信息,并采用多种检索与推理策略:(i)混合查找模式用于显式规则提及,(ii)视觉到文本融合用于图形和表格引导的查询,(iii)高推理大语言模型模式用于复杂的多模态问题,以及(iv)自一致性决策以稳定响应。模块化框架设计为未来的多模态系统提供了可重用模板,无论底层模型架构如何。此外,本工作建立并比较了两种路由方法:单案例路由方法和多智能体系统,两者均动态分配查询到最优管道。在DesignQA基准上的评估表明,该系统在所有任务上的平均准确率相比基线RAG最佳结果相对提升了41.1%,这是多模态和推理密集型任务上的显著改进,且无需完整规则书摄入。这表明视觉语言检索、模块化推理和自适应路由如何在工程用例中实现可扩展的文档理解。

英文摘要

Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.

2605.08692 2026-06-08 cs.LG cs.CL 版本更新

AAAC: Activation-Aware Adaptive Codebooks for 4-bit LLM Weight Quantization

AAAC: 面向4位LLM权重量化的激活感知自适应码本

Beshr IslamBouli, David Jin

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出AAAC方法,通过每层两个小型学习码本(64字节)替代固定标量码本,以激活加权重建误差最小化选择码本,实现零额外存储开销的4位权重量化,在3-30分钟内完成量化,精度优于现有方法。

详情
AI中文摘要

训练后仅权重量化至4位被广泛用于减少大语言模型推理的内存和计算成本。现有的PTQ方法,如AWQ和GPTQ,通过缩放、裁剪或误差补偿改进权重映射到固定4位网格的方式。为进一步提高精度,OmniQuant和QuIP#等方法使用梯度辅助算法,但需要数小时的量化时间。在这项工作中,我们提出AAAC(激活感知自适应码本),一种用于4位LLM权重量化的轻量级方法。AAAC用每层两个小型学习标量码本(64字节)替换标准量化中使用的固定标量码本。每组权重选择使激活加权重建误差最小的码本,将选择编码在组正缩放的未使用符号位中,并增加零存储开销。AAAC在单个GPU上3-30分钟内完成,且不增加模型本身之外的额外内存。我们跨模型族与AWQ、GPTQ、IF4、GPTVQ、OmniQuant、SqueezeLLM和QuIP#进行评估。AAAC在量化时间少几个数量级的情况下优于基线方法。

英文摘要

Post-training weight-only quantization to 4 bits is widely used to reduce the memory and compute costs of large language model inference. Existing PTQ methods, such as AWQ and GPTQ, improve how weights are mapped onto a fixed 4-bit grid through scaling, clipping, or error compensation. To further improve accuracy, methods such as OmniQuant and QuIP\# uses gradient-assisted algorithms at the cost of hours of quantization time. In this work, we propose AAAC (Activation-Aware Adaptive Codebooks), a lightweight method for 4-bit LLM weight quantization. AAAC replaces the fixed scalar codebook used in standard quantization with two small learned scalar codebooks (64 bytes) per layer. Each group of weights selects the codebook that minimizes activation-weighted reconstruction error, encoding the choice in the unused sign bit of the group's positive scale and adding zero storage overhead. AAAC completes in 3--30 minutes on a single GPU, and adds no memory beyond the model itself. We evaluate against AWQ, GPTQ, IF4, GPTVQ, OmniQuant, SqueezeLLM, and QuIP\# across model families. AAAC outperforms baselines at orders-of-magnitude less quantization time.

2606.01765 2026-06-08 cs.FL cs.CL cs.LG 版本更新

An Algebraic View of the Expressivity of Recurrent Language Models

循环语言模型表达能力的代数视角

Franz Nowak, Ryan Cotterell, Reda Boumasmoud

发表机构 * GitHub

AI总结 本文通过代数统一框架分析循环神经网络在不同算术模型下的表达能力,将形式语言识别问题归结为语法幺半群是否划分特定圈积的代数问题。

Comments 28 pages, 2 figures, to be published at ICML 2026

详情
AI中文摘要

循环神经语言模型能识别哪些形式语言?文献中的形式结果存在冲突:一些作者报告图灵完备性,而另一些则显示等价于正则语言。这种差异的原因在于底层算术模型不同。本文发展了一个统一的代数视角来刻画循环神经网络的表达能力,首先对各种算术模型进行形式化描述。该视角将表达能力归结为一个代数问题,例如网络的语法幺半群是否划分某个圈积。作为案例研究,本文重新审视了对角状态空间模型:一旦强制执行浮点递归,同一架构无法实现偶数模计数器,但在无符号整数量化下却能实现每个偶数模计数器。

英文摘要

What formal languages can a recurrent neural language model recognize? Formal results in the literature conflict: some authors report Turing-completeness, while others show equivalence to regular languages. The reason for this discrepancy is that the underlying arithmetic model differs. The paper develops a unified algebraic account of the expressivity of recurrent neural networks, starting with a formal account of various arithmetic models. This account reduces expressivity to an algebraic question, e.g., whether a network's syntactic monoid divides a certain wreath product. As a case study, the paper revisits diagonal state-space models: the same architecture cannot implement an even-modulus counter once floating-point recurrences are enforced, yet realizes every even-modulus counter under unsigned-integer quantization.

2606.05152 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Reinforcement Learning from Rich Feedback with Distributional DAgger

利用丰富反馈的强化学习与分布式DAgger

Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad

发表机构 * University of Southern California(南加州大学)

AI总结 提出DistIL算法,通过分布式DAgger利用丰富反馈(如执行轨迹、工具输出等)进行前向交叉熵优化,实现单调策略改进和更好的Pass@N性能。

详情
AI中文摘要

推理模型发展迅速,但主流的基于可验证奖励的强化学习(RLVR)方法仍然非常狭窄:采样多个响应,并用单个比特奖励每个响应,指示最终答案是否正确。然而,许多设置提供了丰富的反馈,包括执行轨迹、工具输出、专家修正和模型自我评估。我们研究如何通过经典模仿学习算法DAgger的分布式变体来使用这种反馈,其中学习器可以局部访问当前策略所访问状态上的专家分布。这产生了一个简单的前向交叉熵目标,该目标接受黑盒专家,并且其序列级梯度通过将未来的专家-学生分歧传播回早期决策来进行丰富的信用分配。我们表明,基于反向KL或Jensen-Shannon的先前具有自蒸馏目标的强化学习无法保证单调策略改进:即使专家具有更高的奖励,它们的更新也可能增加更差动作的概率。相比之下,我们证明前向交叉熵允许单调策略改进并享有遗憾保证。我们进一步表明,我们的目标优化了教师加权的成功可能性的下界,从而改进了Pass@N。实验上,我们的方法DistIL在科学推理、编程和解决困难数学问题等多个领域优于RLVR和基于自蒸馏的强化学习基线。

英文摘要

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.

2606.05761 2026-06-08 cs.AI cs.CL 版本更新

SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

SubtleMemory: 面向长时程AI智能体的细粒度关系记忆辨别基准

Wenxuan Wang, Haoyu Sun, Fukuan Hou, Mingyang Song, Weinan Zhang, Yu Cheng, Yang Yang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Shanghai AI Laboratory(上海人工智能实验室) Tongji University(同济大学) Xiamen University(厦门大学) Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出SubtleMemory基准,通过构建关系控制的潜在语义伪影并嵌入用户-智能体交互历史,评估长时程AI智能体在后续查询中恢复分布式关系结构的能力。

Comments 48 pages

详情
AI中文摘要

持久性AI助手(如OpenClaw)在长期交互中积累了大量相关记忆。随着这些记忆的增长,它们可能相互强化、在不同上下文中出现分歧或直接冲突,使得正确协助依赖于记忆关系而非孤立回忆。现有的长期记忆基准很少探究智能体在下游任务中如何保留和利用这些关系。为弥补这一空白,我们引入了SubtleMemory,一个用于长运行AI智能体中细粒度关系记忆辨别的基准。SubtleMemory构建了关系控制的潜在语义伪影,其变体实例化互补、细微或矛盾的关系,并将其嵌入到逼真的用户-智能体历史中,要求智能体在后续查询和指令中恢复分布式的关系结构。该基准包含10个长历史中的1,522个评估实例,基于1,090个关系控制的记忆变体集,涵盖用户相关和非用户相关的查询。评估了六个独立记忆系统、两个具有原生记忆模块的Claw式智能体以及三个具有插件记忆模块的Claw式智能体,我们发现当前系统在细粒度关系记忆辨别上仍然薄弱。我们进一步引入了诊断协议,揭示了在记忆保留、检索和下游推理阶段的不同能力特征。

英文摘要

Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grained relational memory discrimination in long-running AI agents. SubtleMemory constructs relation-controlled latent semantic artifacts whose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grained relational memory discrimination. We further introduce diagnostic protocols that reveal distinct capability profiles across memory preservation, retrieval, and downstream reasoning stages.

2601.14637 2026-06-08 cs.CV cs.AI cs.CL cs.HC 版本更新

Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis

Forest-Chat: 为交互式森林变化分析适应视觉-语言代理

James Brock, Ce Zhang, Nantheera Anantrasirichai

发表机构 * School of Computer Science, University of Bristol(布里斯托尔大学计算机科学学院) School of Geographical Sciences, University of Bristol(布里斯托尔大学地理科学学院)

AI总结 本文提出Forest-Chat,一种基于LLM的森林变化分析代理,通过多任务处理实现自然语言查询,提升森林变化检测与语义解释的准确性与可解释性。

Comments 28 pages, 9 figures, 12 tables, Submitted to Ecological Informatics

详情
AI中文摘要

高分辨率卫星影像的普及与深度学习的进步为森林监测提供了新机遇。本文提出Forest-Chat,一种基于大语言模型的视觉-语言代理,支持多任务的交互式森林变化分析,包括变化检测、图像描述、对象计数、森林砍伐特征识别和变化推理。Forest-Chat基于多级变化解释(MCI)视觉-语言框架,结合零样本变化检测和多模态零样本变化描述与优化。引入Forest-Change数据集,包含双时相卫星影像、像素级变化掩码和语义变化描述。在Forest-Change数据集上,Forest-Chat在mIoU和BLEU-4指标上达到67.10%和40.17%,在LEVIR-MCI-Trees子集上达到88.13%和34.41%。零样本测试中,其在Forest-Change数据集上达到60.15%和34.00%,在LEVIR-MCI-Trees子集上达到47.32%和18.23%。进一步实验表明,描述优化能注入地理领域知识,但标签域迁移有限。这些发现表明,交互式、基于LLM的系统能支持可访问和可解释的森林变化分析。

英文摘要

The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. This paper introduces Forest-Chat, an LLM-driven agent for forest change analysis, enabling natural language querying across multiple RSICI tasks, including change detection and captioning, object counting, deforestation characterisation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, incorporating zero-shot change detection via AnyChange and multimodal LLM-based zero-shot change captioning and refinement. To support adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and semantic change captions via human annotation and rule-based methods. Forest-Chat achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on Forest-Change, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI. In a zero-shot capacity, it achieves 60.15% and 34.00% on Forest-Change, and 47.32% and 18.23% on LEVIR-MCI-Trees. Further experiments demonstrate the value of caption refinement for injecting geographic domain knowledge into supervised captions, and the system's limited label domain transfer onto JL1-CD-Trees. These findings demonstrate that interactive, LLM-driven systems can support accessible and interpretable forest change analysis.

2506.14634 2026-06-08 cs.CL cs.AI cs.CY 版本更新

AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

不是什么别的吗?利用大语言模型对德国开放式调查回答进行编码:调查动机

Leah von der Heyde, Anna-Carolina Haensch, Bernd Weiß, Jessica Daikeler

发表机构 * Social Data Science & AI Lab, LMU Munich(社会科学与人工智能实验室,慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of Maryland, College Park(马里兰大学学院公园分校) GESIS – Leibniz Institute for the Social Sciences(莱比锡社会科学研究机构)

AI总结 本文探讨了使用大语言模型对开放式调查回答进行编码的有效性,通过德国调查参与原因的数据,比较了不同LLM和提示方法的性能,发现仅微调的LLM能获得满意预测效果,且分类性能差异影响类别分布。

Comments to appear in Survey Research Methods

详情
Journal ref
Survey Research Methods (2025)
AI中文摘要

近年来,大语言模型(LLM)的发展和广泛可及性引发了关于其在调查研究中应用的讨论,包括对开放式调查回答的分类。由于其语言能力,LLM可能成为耗时的手动编码和监督学习模型预训练的高效替代方案。由于现有研究大多集中在英语回答的非复杂主题或单一LLM上,尚不清楚其发现是否具有普遍性以及这些分类的质量如何与传统方法相比。本研究探讨了不同LLM在其他情境下对开放式调查回答进行编码的程度,以德国调查参与原因的数据为例。我们比较了几种最先进的LLM和提示方法,并通过人类专家编码评估LLM的性能。总体而言,LLM之间的性能差异很大,只有微调的LLM才能达到满意的预测性能。提示方法之间的性能差异取决于所用的LLM。最后,LLM在不同调查参与原因类别上的不均等分类性能导致了不同的类别分布,当不使用微调时。我们讨论了这些发现的含义,不仅对开放式回答编码的方法学研究,还对其实质分析,以及处理或实质性分析此类数据的实践者。最后,我们强调了研究人员在选择LLM时代开放式回答分类自动化方法时需要考虑的许多权衡。通过这样做,我们的研究为关于LLM在调查研究中高效、准确和可靠应用条件的日益增长的研究做出了贡献。

英文摘要

The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs' performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs' unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.

2501.11592 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Training-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing

无需训练的超小模型用于压缩感知中的通用稀疏重建

Chaoqing Tang, Huanze Zhuang, Guiyun Tian, Zhenli Zeng, Yi Ding, Wenzhong Liu, Xiang Bai

发表机构 * School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China(华中科技大学人工智能与自动化学院) China Belt and Road Joint Lab on Measurement and Control Technology, Wuhan, China(中国一带一路测量与控制技术联合实验室) School of Electric and Electrical Engineering, Chongqing University of Technology, Chongqing, China(重庆理工大学电气工程学院) Optics Valley Laboratory, Wuhan, China(光谷实验室) School of Water Conservancy and Transportation, Zhengzhou University, Zhengzhou, China(郑州大学水利与交通学院) School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China(华中科技大学软件工程学院)

AI总结 本文提出无需训练的超小神经模型CL,实现快速稀疏重建,继承传统迭代方法的通用性和可解释性,提升效率和精度。

详情
AI中文摘要

预训练大模型近年来受到广泛关注,但在需要高可解释性或资源有限的应用中面临挑战,如物理传感、医学成像和生物信息学。压缩感知(CS)是已证明的理论,推动了这些应用的许多突破。然而,作为典型的欠定线性系统,CS在使用传统迭代方法时,对大规模数据的稀疏重建时间过长。当前的AI方法如深度展开失败于替代它们,因为预训练模型在超出训练条件和数据分布时泛化性差或缺乏可解释性。本文提出名为系数学习(CL)的超小人工神经模型,实现无需训练的快速稀疏重建,同时完美继承传统迭代方法的泛化性和可解释性,带来融合先验知识的新特性。在CL中,长度为n的信号仅需最少n个可训练参数。一个案例研究模型称为CLOMP用于评估。实验在合成和真实的一维和二维信号上进行,显示了显著的效率和精度提升。与代表性的迭代方法相比,CLOMP在大规模数据上提高了100到1000倍的效率。在八个不同的图像数据集上的测试结果表明,CLOMP在采样率为0.1、0.3、0.5时分别提高了结构相似性指数292%、98%、45%。我们相信这种方法可以真正将CS重建带入AI时代,造福无数依赖稀疏解的欠定线性系统。

英文摘要

Pre-trained large models attract widespread attention in recent years, but they face challenges in applications that require high interpretability or have limited resources, such as physical sensing, medical imaging, and bioinformatics. Compressed Sensing (CS) is a well-proved theory that drives many recent breakthroughs in these applications. However, as a typical under-determined linear system, CS suffers from excessively long sparse reconstruction times when using traditional iterative methods, particularly with large-scale data. Current AI methods like deep unfolding fail to substitute them because pre-trained models exhibit poor generality beyond their training conditions and dataset distributions, or lack interpretability. Instead of following the big model fervor, this paper proposes ultra-small artificial neural models called coefficients learning (CL), enabling training-free and rapid sparse reconstruction while perfectly inheriting the generality and interpretability of traditional iterative methods, bringing new feature of incorporating prior knowledges. In CL, a signal of length $n$ only needs a minimal of $n$ trainable parameters. A case study model called CLOMP is implemented for evaluation. Experiments are conducted on both synthetic and real one-dimensional and two-dimensional signals, demonstrating significant improvements in efficiency and accuracy. Compared to representative iterative methods, CLOMP improves efficiency by 100 to 1000 folds for large-scale data. Test results on eight diverse image datasets indicate that CLOMP improves structural similarity index by 292%, 98%, 45% for sampling rates of 0.1, 0.3, 0.5, respectively. We believe this method can truly usher CS reconstruction into the AI era, benefiting countless under-determined linear systems that rely on sparse solution.