arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.07524 2026-06-09 cs.CL cs.AI 新提交

ABLE: Representing and Mapping LLMs via Attribution-Based Large-model Embedding

ABLE：基于归因的大模型嵌入表示与映射

Zirui Wang, Yusen Hou, Shaofeng Liang, Bowen Tian, Yanlin Zhang, Wenshuo Chen, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Deep Interdisciplinary Intelligence Lab (DI2 Lab)（深度跨学科智能实验室（DI2 Lab））

AI总结提出ABLE框架，利用梯度特征归因和分词器无关的词级对齐构建模型嵌入，实现异构LLM的高效比较，在关系预测、模型路由和基准分数预测上表现优异。

详情

AI中文摘要

大语言模型（LLM）的爆炸式增长形成了一个异构且文档不完善的生态系统，使得系统性的模型比较对于来源审计、安全分析和模型选择越来越重要。现有的表示方法难以高效应对这一场景。分析内部参数的方法在架构兼容时很强大，但在结构异构下面临可扩展性障碍；而依赖外部输出的方法可能混淆具有相似行为的模型，且难以在不同分词器的更丰富输出空间中对齐。为弥合这一差距，我们提出ABLE（基于归因的大模型嵌入）框架，利用可解释性空间构建模型表示。通过基于梯度的特征归因，经由分词器无关的词级对齐进行聚合，ABLE捕获模型特定的输入敏感性模式，而不仅仅是表面输出。除经验效用外，我们提供了稳定性分析，表明在可微Transformer风格模型的标准正则性假设下，ABLE诱导出一个Lipschitz连续的参数到嵌入映射，并具有有限样本收敛保证。在239个开源LLM上的大量实验表明，我们的无训练方法在关系预测、模型路由和基准分数预测方面达到了有竞争力或更优的性能。

英文摘要

The explosive growth of large language models (LLMs) has created a heterogeneous and poorly documented ecosystem, making systematic model comparison increasingly important for provenance auditing, security analysis, and model selection. Existing representation methods struggle to address this setting efficiently. Approaches analyzing internal parameters are powerful when architectures are compatible, but face scalability barriers under structural heterogeneity, while methods relying on external outputs may conflate models with similar behaviors and are difficult to align in richer output spaces across different tokenizers. To bridge this gap, we propose ABLE (Attribution-Based Large-model Embedding), a framework that leverages the interpretability space to construct model representations. By aggregating gradient-based feature attributions via a tokenizer-agnostic word-level alignment, ABLE captures model-specific input-sensitivity patterns rather than only surface-level outputs. Beyond empirical utility, we provide a stability analysis showing that, under standard regularity assumptions for differentiable Transformer-style models, ABLE induces a Lipschitz-continuous parameter-to-embedding map with finite-sample convergence guarantees. Extensive experiments on 239 open-source LLMs demonstrate that our training-free approach achieves competitive or superior performance in relation prediction, model routing, and benchmark score prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.07526 2026-06-09 cs.CL cs.AI 新提交

GraphLoRA: Structure-Aware Low-Rank Adaptation for Large Language Model Recommendation

GraphLoRA: 面向大语言模型推荐的结构感知低秩适配

Lin Mu, Guoji Wang, Li Ni, Lei Sang, Zhize Wu, Peiquan Jin, Yiwen Zhang

发表机构 * Anhui University（安徽大学）； Hefei University（合肥大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出GraphLoRA框架，通过在低秩适配路径中嵌入可训练的图消息传递网络，实现结构信号传播，从而深度融合图结构与文本语义，提升LLM推荐性能。

Comments ACL 2026 findings

详情

AI中文摘要

大型语言模型（LLM）因其强大的推理和泛化能力，在推荐任务（LLMRec）中展现出巨大潜力。然而，如何有效对齐LLM建模的文本语义与协同信号仍是一个关键挑战。现有方法要么将协同信息转化为文本提示，要么将预训练嵌入注入LLM，两者都将结构信息视为静态输入，无法捕获高阶关系依赖。为弥合这一差距，我们提出GraphLoRA，一种新颖的框架，将低秩适配从独立传播推广到结构感知传播。GraphLoRA在低秩适配路径中嵌入一个可训练的图消息传递网络，使结构信号能够在参数空间中传播。该设计允许协同拓扑显式指导参数更新，促进图结构与文本语义信息的深度融合。在多个基准上的大量实验表明，GraphLoRA不仅优于最先进的基于LLM的推荐方法，而且实现了卓越的泛化能力，有效平衡了结构推理能力与计算效率。代码可在https://github.com/wgj15965/GraphLoRA获取。

英文摘要

Large Language Models (LLMs) have shown strong potential for recommendation (LLMRec) due to their powerful reasoning and generalization abilities. However, effectively aligning the textual semantics modeled by LLMs with the collaborative signals remains a key challenge. Existing methods either translate collaborative information into textual prompts or inject pre-trained embeddings into the LLM, both of which treat structural information as static input and fail to capture high-order relational dependencies. To bridge this gap, we propose GraphLoRA, a novel framework that generalizes low-rank adaptation from independent to structure-aware propagation. GraphLoRA embeds a trainable graph message-passing network within the low-rank adaptation pathway, enabling structural signals to propagate through the parameter space. This design allows collaborative topology to explicitly guide parameter updates, fostering deep integration between graph-structured and textual semantic information. Extensive experiments on multiple benchmarks demonstrate that GraphLoRA not only outperforms state-of-the-art LLM-based recommendation methods but also achieves superior generalization, effectively balancing structural reasoning capability with computational efficiency. Code is available at \href{https://github.com/wgj15965/GraphLoRA}{https://github.com/wgj15965/GraphLoRA}.

URL PDF HTML ☆

赞 0 踩 0

2606.07527 2026-06-09 cs.CL cs.AI cs.LG 新提交

Post-training is (Massive) Supervised Learning

后训练是（大规模）监督学习

Michael Hassid, Yossi Adi, Roy Schwartz

发表机构 * FAIR, Meta AI（Meta AI 基础人工智能研究团队）； The Hebrew University of Jerusalem（耶路撒冷希伯来大学）

AI总结本文论证当前LLM后训练阶段（SFT+RL）实质是回归到BERT时代的“预训练-微调”范式，通过实验表明从零开始后训练的模型也能取得显著性能，并提出应转向“学会学习”的训练方式。

详情

AI中文摘要

训练LLM的主流范式已演变为依赖包含SFT和RL的大规模后训练阶段。在这篇立场论文中，我们认为这种方法实际上标志着回归到BERT时代的“预训练然后微调”方法，明确地使模型适应期望的行为和评估所用的特定基准。我们首先回顾LLM的历史，描述LLM演化的不同阶段。我们认为当前格局与LLM早期惊人地相似，那时任务性能严重依赖于将模型拟合到分布内数据集。为了实证证明这一点，我们比较了预训练模型和随机初始化模型，在现代推理数据集上对两种变体进行微调，并在竞争性数学和代码基准上评估它们。我们表明，从头开始后训练的模型产生了高度非平凡的性能。我们的发现表明，当前的后训练方法主要作为分布拟合机制发挥作用。最后，我们提出，开发通用能力的模型和系统需要超越针对预定义行为的广泛后训练，转而采用模型“学会如何学习”的训练过程。

英文摘要

The prevailing paradigm for training LLMs has evolved to rely on a massive post-training phase consisting of SFT and RL. In this position paper, we argue that this methodology effectively marks a reversion to the ``pre-train then fine-tune'' approach of the BERT era, explicitly tailoring models to the desired behaviors and specific benchmarks on which they are evaluated. We begin with a historical overview of LLMs, describing the different phases of the LLM evolution. We argue that the current landscape is remarkably similar to the early days of LLMs, where task performance heavily relied on fitting the models to in-distribution datasets. To empirically demonstrate this, we compare pre-trained models to randomly initialized ones, by fine-tuning both variants on modern reasoning datasets and evaluating them on competitive math and code benchmarks. We show that models post-trained from scratch yield highly non-trivial performance. Our findings suggest that current post-training methodologies function primarily as a distribution-fitting mechanism. We finish by positing that developing generally capable models and systems requires moving beyond extensive post-training for predefined behaviors, shifting instead toward training procedures where models ``learn how to learn''.

URL PDF HTML ☆

赞 0 踩 0

2606.07559 2026-06-09 cs.CL cs.AI quant-ph 新提交

Phantom transitions in language model fine-tuning

语言模型微调中的幻影相变

Vaibhav Prakash, Jayasri Dontabhaktuni

发表机构 * Mahindra University（马恒达大学）

AI总结本文研究语言模型微调时，正确补全被近义词竞争而失败的现象，通过序参量分解信号与背景拖拽，发现两种失败模式，并揭示相变为幻影，源于softmax读出而非几何相变。

Comments 26 pages, 9 figures

详情

AI中文摘要

在上下文中微调语言模型，当正确补全存在近义词竞争者时，常常无声地失败。交叉熵损失单调递减，而正确token在排名上从未超越竞争者。我们研究了跨越两个系列和五倍参数范围的五种Transformer架构，在十个精心挑选的近义词上下文中。我们用一个结合预测分布和成对嵌入重叠的序参量来测量这些失败。它可加性地分解为一个信号（跟踪模型对正确token相对于其最近竞争者的承诺）和一个背景拖拽（由嵌入整体向分数泄漏概率的方式决定）。这分离出两种失败模式：运动学失败中信号保持较小；结构失败中拖拽随着微调进行而主动恶化。我们观察到序参量中类似相变的弹弓状跳跃。一个核心负面结果组织了本文：这些相变是幻影。直接测量排除了自发对称破缺的解释。在LoRA微调下，当token嵌入矩阵在训练期间完全不变时，弹弓状跳跃仍然出现，而此处不可能存在几何相变。不连续性完全存在于softmax读出中。少量无量纲量组织跨架构的轨迹。其中一个在所有五种架构的全微调下保持一致。第二个根据整体嵌入分布将架构分为两类，并预测LoRA的充分性。作为盲测，该框架预测了一个未用于拟合任何参数的保留架构的临界学习率，与后续学习率扫描的误差在2.1%以内。研究结果仅涉及近义词机制，未经重新校准不应外推。

英文摘要

Fine-tuning a language model on contexts whose correct completion has a near-synonym competitor often fails silently. The cross-entropy loss decreases monotonically while the correct token never overtakes the competitor in rank. We study this regime across five transformer architectures spanning two families and a fivefold parameter range, on ten hand-selected near-synonym contexts. We instrument these failures with an order parameter combining the predicted distribution and pairwise embedding overlaps. It decomposes additively into a signal, tracking the model's commitment to the correct token over its nearest competitor, and a background drag, set by how the embedding bulk leaks probability into the score. This isolates two failure modes. In kinematic failure the signal stays small. In structural failure the drag actively worsens as fine-tuning proceeds. We observe sharp catapult-like jumps in the order parameter that resemble a phase transition. A central negative result organises the paper. The transitions are phantoms. The spontaneous-symmetry-breaking interpretation is ruled out by direct measurement. Catapult-like jumps still appear under LoRA fine-tuning with the token embedding matrix exactly unchanged during training, where no geometric phase transition is possible. The discontinuity lives entirely in the softmax readout. A small number of dimensionless quantities organise the trajectory across architectures. One is consistent across all five under full fine-tuning. A second sorts architectures into two classes by bulk embedding distribution and predicts LoRA sufficiency. As a blind test, the framework predicts the critical learning rate of a held-out architecture, not used to fit any parameter, to within 2.1% of a subsequent learning-rate sweep. Findings concern the near-synonym mechanism only and should not be extrapolated without recalibration.

URL PDF HTML ☆

赞 0 踩 0

2606.07560 2026-06-09 cs.CL cs.LG 新提交

Function-Vector Heads Are Two Populations: Writers and Cancellers in In-Context Learning

函数向量头是两个群体：上下文学习中的写入者和取消者

Han-yu Wang

发表机构 * The University of Hong Kong（香港大学）

AI总结发现函数向量头并非同质群体，而是分为写入者和取消者两个子群体，分别推高和压低规则正确logit，且仅基于幅度的排名无法区分二者。

详情

AI中文摘要

函数向量头（Todd et al., 2024）通常通过其对上下文规则任务的因果贡献幅度来识别，隐含假设顶级集合是同一功能类。这一假设不成立。我们用保留符号的标准（改进的DLA + 置换FDR）替代仅幅度排名，并通过路径修补验证每个候选。然后，FV头群体分裂为两个对立的子群体：写入者推高规则正确logit；取消者压低它。一个四条件规范判定在三个模型家族和六个Pythia规模的13/15个单元中成立，符号置换检验在5/6个主要单元中拒绝同质性。仅幅度排名无法看到这种结构：Todd的前20个在层次任务中捕获了64%的取消者但仅4%的写入者，在模块任务中捕获了59%的写入者但仅8%的取消者。我们在所有27个（取消者，单元，头）对上排除了六种人为解释：归纳重叠、汇点、通用重要性、秩1复制抑制、V级联和最近邻非FV控制。零消融取消者在6/6个主要单元中产生+0.13到+0.29 nats的logit增益，方向一致地带来+2到+7个百分点的准确率提升。

英文摘要

Function-vector (FV) heads (Todd et al., 2024) are typically identified by the magnitude of their causal contribution to in-context rule tasks, under the implicit assumption that the top set is a homogeneous functional class. This assumption fails. We replace magnitude-only ranking with a sign-preserving criterion (refined DLA + permutation FDR) and validate each candidate by path patching. The FV head population then splits into two opposing sub-populations: writers push the rule-correct logit up; cancellers push it down. A four-condition canonical verdict holds in $13/15$ cells across three model families and six Pythia scales, and a sign-shuffle rejects homogeneity in $5/6$ main cells. The structure is invisible to magnitude-only ranking: Todd's top-$20$ captures $64\%$ of cancellers but only $4\%$ of writers on the hierarchical task, and $59\%$ of writers but only $8\%$ of cancellers on the modular task. We rule out six artefact accounts on all $27$ canceller (cell, head) pairs: induction overlap, sinks, generic importance, rank-$1$ copy-suppression, V-cascade, and rank-nearest non-FV controls. Zero-ablating cancellers yields $+0.13$ to $+0.29$ nats of logit gain in $6/6$ main cells with a directionally consistent $+2$ to $+7$ pp accuracy effect.

URL PDF HTML ☆

赞 0 踩 0

2606.07818 2026-06-09 cs.CL cs.NE 新提交

Representational Similarity and Model Behavior in Multi-Agent Interaction

多智能体交互中的表征相似性与模型行为

Yujin Potter, Seun Eisape, Shiyang Lai, Alexander Huth, James Evans, Been Kim, Jacob Eisenstein, Dawn Song, Alane Suhr

发表机构 * University of Washington（华盛顿大学）

AI总结研究LLM对间的表征相似性对合作与创新的影响，发现高相似性促进合作但降低新颖性，且早期层相似性关联最强。

Comments ICML 2026

详情

AI中文摘要

研究人员已经表明，人类之间的神经相似性可以预测社会亲密度和合作成功，而创新往往源于不同个体之间的互动。我们通过考察大型语言模型之间的交互来研究这些原理是否适用于人工智能。在我们的实验中，276个模型对在涵盖合作和新颖性的八个游戏中互动。我们发现，具有更相似表征空间的对实现了显著更高的合作，但表现出较低的新颖性和创造力。即使控制了其他因素（如性能差异和模型大小），表征相似性对合作和新颖性的影响仍然稳健。我们还发现，与中间层和后期层相比，早期层的相似性与合作和新颖性的关联始终最强。这表明这些模式背后的一个核心因素可能是两个模型共享词汇和语义基础的程度。总体而言，表征相似性可能是多智能体系统设计中的一个重要考虑因素。

英文摘要

Researchers have shown that neural similarity among humans predicts social closeness and cooperative success, whereas innovation often emerges from interactions among dissimilar individuals. We investigate whether these principles extend to artificial intelligence by examining interactions between large language models. In our experiments, 276 model pairs interact across eight games spanning both cooperation and novelty. We find that pairs with more similar representation spaces achieve significantly higher cooperation but exhibit reduced novelty and creativity. The effects of representational similarity on cooperation and novelty remain robust even after controlling for other factors such as performance disparity and model size. We also find that similarity in the early layers consistently shows the strongest association with cooperation and novelty, compared to the middle and later layers. This suggests that a central factor underlying these patterns could be the extent to which the two models share lexical and semantic grounding. Overall, representational similarity can be an important consideration in multi-agent system design.

URL PDF HTML ☆

赞 0 踩 0

2606.07978 2026-06-09 cs.CL 新提交

MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models

MechLens：事实知识的晚期结晶解释语言模型中的干预有效性

Xueping Gao

发表机构 * Alibaba Cloud（阿里云）

AI总结本文发现LLM中的事实知识在最后层突然“结晶”，而非逐层涌现，并基于此提出结晶引导的干预原则，优于现有方法。

详情

AI中文摘要

理解LLM存储事实知识的位置对于减少幻觉至关重要。我们系统量化了“晚期结晶”：事实知识并非逐层涌现，而是在最后层突然“结晶”。在五个模型家族（Pythia、Gemma、Qwen2.5、Llama-3.1、Mistral；0.5–14B）中，26.8%–93.4%的正确答案从未在任何中间层进入前10预测，且晚期涌现（>80%深度）在不同架构中一致。跨尺度（Qwen2.5-14B）和跨基准（MMLU：98.2%）结果证实了普遍性；调谐透镜排除了探针伪影。情感分类对照（Qwen为0.5% vs. 事实85.9%；Mistral为2.0% vs. 26.8%）确认该现象是事实回忆特有的。\n晚期结晶引出了结晶引导的干预原则：CAA在中等结晶模型（Llama、Mistral）上优于DoLa（p<0.001），在高结晶模型Qwen上方向一致反转（+25.4% vs. +15.5% MC1，p=0.069）。LayerNorm消融表明结晶是残差流固有的；LN缩放（x1.2）在零推理开销下带来+11.8% MC1提升。我们进一步揭示了可计算性-记忆谱：可计算知识比记忆事实更早结晶（层22.1/28 vs. 28.0/28）。我们发布了支持五个模型家族的MechLens。

英文摘要

Understanding where LLMs store factual knowledge is critical for hallucination mitigation. We systematically quantify Late Crystallization: factual knowledge does not gradually emerge across layers but "crystallizes" abruptly at the final layers. Across five model families (Pythia, Gemma, Qwen2.5, Llama-3.1, Mistral; 0.5--14B), 26.8%--93.4% of correct answers never enter top-10 predictions at any intermediate layer, with late emergence (>80% depth) consistent across architectures. Cross-scale (Qwen2.5-14B) and cross-benchmark (MMLU: 98.2%) results confirm generality; tuned lens rules out probe artifacts. A sentiment-classification control (0.5% for Qwen vs. 85.9% factual; 2.0% for Mistral vs. 26.8%) confirms the phenomenon is specific to factual recall. Late Crystallization yields a crystallization-guided intervention principle: CAA outperforms DoLa on moderate-crystallization models (Llama, Mistral; p<0.001), with a directionally consistent reversal on high-crystallization Qwen (+25.4% vs. +15.5% MC1, p=0.069). LayerNorm ablation shows crystallization is intrinsic to the residual stream; LN scaling (x1.2) yields +11.8% MC1 with zero inference overhead. We further reveal a Computability-Memorization Spectrum: computable knowledge crystallizes earlier (layer 22.1/28) than memorized facts (28.0/28). We release MechLens supporting five model families.

URL PDF HTML ☆

赞 0 踩 0

2606.08295 2026-06-09 cs.CL 新提交

张量化Engram：在N-gram嵌入中共享潜在变量对大型语言模型有益

Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, Yuning Qiu, Qibin Zhao, Danilo Mandic

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； University of Toronto（多伦多大学）

AI总结提出张量化Engram（TN-gram），通过CP分解共享因子压缩n-gram嵌入，减少参数并避免哈希冲突，在多个任务上匹配或超越现有方法。

详情

AI中文摘要

现代语言模型使用离散的token级嵌入表示文本，这迫使重复的多token模式必须在Transformer层中隐式学习。过度token化的Transformer和Engram都试图通过显式引入多token（n-gram）记忆来解决这一限制。然而，它们为每个n-gram阶数使用单独的哈希表，这引入了哈希冲突并阻止嵌套的n-gram共享底层潜在结构。为了解决这些问题，我们提出了张量化Engram（TN-gram），一种紧凑的记忆模块，通过Canonical Polyadic（CP）形式中的共享因子表示张量化的n-gram嵌入。TN-gram学习共享的token-位置因子以及阶数吸收向量，以编码不同n-gram阶数的嵌入。综合实验表明，TN-gram在需要更少参数的情况下，匹配甚至超越了Engram风格的n-gram模块。

英文摘要

Modern language models represent text using discrete token-level embeddings, which forces recurring multi-token patterns to be learned implicitly across Transformer layers. Both Over-tokenized Transformers and Engram attempt to address this limitation by explicitly incorporating multi-token (n-gram) memories. However, they rely on separate hash tables for each n-gram order, which introduces hash collisions and prevents nested n-grams from sharing the underlying latent structures. To address these issues, we propose Tensorized Engram (TN-gram), a compact memory module that represents tensorized n-gram embeddings through shared factors in the Canonical Polyadic (CP) form. TN-gram learns shared token-position factors together with order-absorption vectors to encode the embeddings of different n-gram order. Comprehensive experiments demonstrate that TN-gram matches or even outperforms Engram-style n-gram modules while requiring much fewer parameters.

URL PDF HTML ☆

赞 0 踩 0

2606.08471 2026-06-09 cs.CL cs.AI 新提交

More Yap Less Meaning: Uncovering Self-Improvement Behavior in SLMs

更多废话，更少意义：揭示小语言模型中的自我改进行为

Marina Igitkhanian, Erik Arakelyan

发表机构 * American University of Armenia（亚美尼亚美国大学）； NVIDIA（英伟达）

AI总结本研究通过构建充分性测试，发现小语言模型在自我纠正中仅获得4.4%的准确率提升，且较长的提示反而与错误答案正相关，表明其推理能力有限。

Comments GEM Workshop at ACL 2026

详情

AI中文摘要

近年来，语言模型在各个领域和应用中取得了快速进展。然而，它们的自我改进能力——即是否善于识别和纠正自身推理中的缺陷——仍然存疑。在本研究中，我们通过构建一个充分性测试来严格检验小语言模型（SLMs）的自我纠正能力。我们提出了一个最小化的三步自我纠正流程：收集初始SLM答案，提示同一模型根据真实答案为错误回答生成提示，然后将相同问题与模型自身的反馈一起输入以改进初始答案。我们在算术和逻辑推理基准上评估了多种指令微调和推理SLM。我们的发现表明，注入提示句子的SLM相比初始问答准确率仅提升4.4%。即使正确答案与模型的错误推理一起提供，评估的SLM也无法理解其推理中缺失了什么，并且在导致纠正和未导致纠正的提示之间显示出最小的语义差异。此外，我们的实验表明，较长的提示与错误的最终答案正相关，表明对问题的较长思考可能阻碍推理过程，这意味着SLM的性能不一定随更大的计算预算而扩展。

英文摘要

Recently, language models have made rapid progress across various domains and applications. However, their capability for self-improvement, i.e., whether they are adept at recognising and correcting flaws in their own reasoning, remains dubious. In this study, we address this question by constructing a sufficiency test to rigorously examine the self-correction capabilities of small language models (SLMs). We propose a minimal three-step self-correction pipeline that collects initial SLM answers, prompts the same model to generate hints for its incorrect responses given the ground truth, and feeds the model the same question with its own feedback to refine the initial answer. We evaluate a variety of instruction-tuned and reasoning SLMs in this experimental setup on arithmetic and logical reasoning benchmarks. Our findings show that SLMs with injected hint sentences yield only a 4.4 percent gain over initial question-answering accuracy. Even though the correct answer was provided alongside the model's incorrect reasoning, the evaluated SLMs fail to understand what was missing in their reasoning and show minimal semantic difference between hints that lead to corrections and ones that do not. Furthermore, our experiments show that longer hints are positively correlated with incorrect final answers, suggesting that longer deliberation on problems can hinder the reasoning process, meaning that SLMs do not necessarily scale in performance with a larger compute budget.

URL PDF HTML ☆

赞 0 踩 0

2606.08501 2026-06-09 cs.CL 新提交

Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

重回正轨：在扩散大语言模型中对齐奖励与状态以进行推理

Yawen Shao, Jie Xiao, Kai Zhu, Yu Liu, Hongchen Luo, Xueyang Fu, Yang Cao, Wei Zhai, Zheng-Jun Zha

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tongyi Lab（通义实验室）； Northeastern University（东北大学）

AI总结针对扩散大语言模型强化学习中过程奖励与状态轨迹的双重错位问题，提出PAPO框架，通过步骤感知过程奖励和熵引导历史重演实现对齐，在四个基准上取得显著提升。

详情

AI中文摘要

强化学习（RL）在增强扩散大语言模型（dLLMs）的推理能力方面具有巨大潜力。然而，进展受到真实生成轨迹与梯度更新过程之间双重错位的基本限制：（i）过程-奖励错位。稀疏的终端奖励被不加区分地分配给生成过程的所有中间步骤，未能提供有区分度的信用分配。（ii）状态-轨迹错位。策略更新常常被引向人工的、偏离轨迹的状态，在信息量较少的样本上浪费梯度。为了解决这些限制，我们引入了过程对齐策略优化（PAPO），这是一种新颖的框架，通过步骤感知过程奖励（SPR）将稀疏的终端奖励转化为密集的逐步信用，以及熵引导历史重演（EHR）在高不确定性步骤重放真实轨迹，从而整体上对齐RL更新与dLLM的生成轨迹。在四个基准上的大量实验表明，PAPO显著优于基线，在GSM8K上提升高达4.5%，在MATH500上提升4.8%，在Countdown上提升42.2%，在Sudoku上提升16.1%。

英文摘要

Reinforcement learning (RL) holds immense promise for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, progress is fundamentally constrained by a dual misalignment between authentic generation trajectory and the gradient update process: (i) Process-reward misalignment. Sparse, terminal rewards are indiscriminately assigned to all intermediate steps of the generation process, failing to provide discriminative credit assignment. (ii) State-trajectory misalignment. Policy updates are often diverted toward artificial, out-of-trajectory states, squandering gradients on less informative samples. To address these limitations, we introduce Process Aligned Policy Optimization (PAPO), a novel framework that holistically aligns the RL update with the dLLM's generative trajectory via Step-Aware Process Rewards (SPR) that transform sparse terminal rewards into dense, step-wise credit, and Entropy-Guided Historical Re-enactment (EHR) that replays authentic trajectories at high-uncertainty steps. Extensive experiments on four benchmarks demonstrate that PAPO significantly outperforms baselines, achieving gains of up to 4.5% on GSM8K, 4.8% on MATH500, 42.2% on Countdown and 16.1% on Sudoku.

URL PDF HTML ☆

赞 0 踩 0

2606.08644 2026-06-09 cs.CL cs.AI 新提交

A retrieval conditioned rebinding circuit for dynamic entity tracking in large language models

一种用于大语言模型中动态实体追踪的检索条件重绑定电路

Soyoung Oh, Vera Demberg

发表机构 * Saarland University（萨尔兰大学）； Max Planck Institute for Informatics（马克斯·普朗克信息学研究所）

AI总结通过因果干预识别出大语言模型中实现动态状态追踪的检索条件重绑定机制，该机制由紧凑的注意力头电路编码并恢复绑定信息，在不同模型家族中表现不同。

2606.08755 2026-06-09 cs.CL 新提交

Co-Evolving Skill Generation and Policy Optimization

共同进化技能生成与策略优化

Zhiwei Zhang, Yudi Lin, Nikki Lijing Kuang, Linlin Wu, Xiaomin Li, Songtao Liu, Fenglong Ma

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）； Nanyang Technological University（南洋理工大学）； University of California, San Diego（加州大学圣迭戈分校）； University of Utah（犹他大学）； Harvard University（哈佛大学）

AI总结提出在线强化学习框架，通过对比基线和技能增强轨迹的奖励差距估计技能边际效用，实现存储前验证，并利用该信号训练策略作为技能生成器，减少对专有模型的依赖。

详情

AI中文摘要

技能增强的强化学习通过存储从过去经验中获取的可重用程序性知识来改进语言智能体。现有方法通常使用强大的语言模型分析轨迹、生成技能，并在在线训练期间更新可检索的技能库。然而，它们很少在存储和重用新生成的技能之前评估其是否有用。我们发现这一假设不可靠：即使由专有前沿LLM生成的技能也表现出高度混合的效用，许多技能几乎没有益处甚至降低性能。一旦此类技能进入库中，其影响难以识别，因为后续的轨迹反馈是延迟的，并且通常反映多个检索技能的组合效果，而非单个技能的边际贡献。我们提出了一种用于存储前技能验证的在线强化学习框架。该框架估计候选技能是否在当前任务的已检索技能之外贡献了有用信息。它使用标准的轨迹预算，在同一任务和检索上下文下形成两个匹配组：基于当前检索技能的条件基础轨迹，以及基于相同技能加上从基础轨迹中诱导出的一个候选技能的条件技能增强轨迹。这两组之间的奖励差距估计了候选技能的上下文相关边际效用，使框架能够在不增加轨迹开销的情况下促进有用技能，同时过滤无效或有害技能。该框架进一步利用这一边际效用信号来训练策略本身作为技能生成器，减少对专有模型重复调用的依赖。学习到的技能生成似然作为上下文相关的分数，用于检索时的重排序和随着策略演化对过时技能的修剪。

英文摘要

Skill-augmented reinforcement learning improves language agents by storing reusable procedural knowledge acquired from past experience. Existing methods typically use strong language models to analyze trajectories, generate skills, and update a retrievable skill bank during online training. However, they rarely assess whether a newly generated skill is useful before it is stored and reused. We find that this assumption is unreliable: even skills generated by proprietary frontier LLMs exhibit highly mixed utility, with many providing little benefit or even degrading performance. Once such skills enter the bank, their effects are difficult to identify, because subsequent rollout feedback is delayed and usually reflects the combined effect of multiple retrieved skills rather than the marginal contribution of any individual skill. We propose an online reinforcement learning framework for pre-storage skill validation. The framework estimates whether a candidate skill contributes useful information beyond the skills already retrieved for the current task. It uses the standard rollout budget to form two matched groups under the same task and retrieval context: base rollouts conditioned on the currently retrieved skills, and skill-augmented rollouts conditioned on the same skills plus one candidate skill induced from the base trajectories. The reward gap between these two groups estimates the candidate skill's context-dependent marginal utility, enabling the framework to promote useful skills while filtering ineffective or harmful ones without additional rollout overhead. The framework further uses this marginal-utility signal to train the policy itself as a skill generator, reducing reliance on repeated calls to proprietary models. The learned skill-generation likelihood serves as a context-dependent score for retrieval-time reranking and outdated-skill pruning as the policy evolves.

URL PDF HTML ☆

赞 0 踩 0

2606.08994 2026-06-09 cs.CL 新提交

Language-Aware Token Boosting: LLM Language Confusion Reduction Without Tuning

语言感知令牌增强：无需微调的大语言模型语言混淆减少

Trapoom Ukarapol, Pakhapoom Sarapat, Nut Chukamphaeng

发表机构 * SCB DataX ； Tsinghua University（清华大学）； SCBX

AI总结提出无需微调的语言混淆减少方法，通过语言感知令牌增强（LATB）和自适应版本（Adaptive-LATB）对目标语言令牌施加扰动，有效提升多语言对齐并保持摘要质量。

Comments ACL2026 Main Conference

2606.09032 2026-06-09 cs.CL 新提交

Bridging the Agent-World Gap: Text World Models for LLM-based Agents

弥合智能体-世界鸿沟：面向基于LLM的智能体的文本世界模型

Yixia Li, Hongru Wang, Peng Lai, Zhiwen Ruan, He Zhu, Youxin Zhu, Ganlong Zhao, Minda Hu, Yun Chen, Sibei Yang, Peng Li, Jeff Z. Pan, Jia Pan, Guanhua Chen, Yang Liu, Guanbin Li

发表机构 * Southern University of Science and Technology（南方科技大学）； University of Edinburgh（爱丁堡大学）； Peking University（北京大学）； Sun Yat-sen University（中山大学）； The Chinese University of Hong Kong（香港中文大学）； Shanghai University of Finance and Economics（上海财经大学）； Tsinghua University（清华大学）； The University of Hong Kong（香港大学）

AI总结本文系统综述了面向基于LLM的智能体的文本世界模型，围绕形式化框架和智能体生命周期，涵盖基础定义、构建范式、应用（训练时经验合成与推理时规划、验证、适应）及评估，旨在整合该领域并明确设计空间与开放挑战。

Comments Code: https://github.com/sustech-nlp/awesome-text-world-models

详情

AI中文摘要

基于大型语言模型（LLM）的智能体越来越多地用于交互式文本环境，从网页导航、代码编辑到工具使用和长时对话。然而，许多智能体仍然主要是反应式的，将观察映射到动作，而没有对这些环境如何构建和演变的显式模型。这激发了文本世界模型（TWMs）：文本状态上的转移模型，给定状态和候选动作，预测结果网页、终端输出、API响应或用户回复，从而支持规划、高效学习和原则性评估。我们系统综述了面向基于LLM的智能体的文本世界模型，围绕形式化框架和智能体生命周期组织：（1）基础，定义文本世界模型并通过状态表示和基础领域对其进行表征；（2）构建，对LLM作为世界模型和代码作为世界模型范式进行分类，并回顾构建方法；（3）应用，考察世界模型如何通过经验合成在训练时以及通过规划、验证和适应在推理时支持智能体；（4）评估，涵盖世界模型本身的评估及其作为智能体评估环境的使用。我们旨在巩固这一快速发展领域，阐明其设计空间，并强调未来研究的开放挑战。

当内置思考既有帮助又有害：指令遵循中的约束级错误转移

Sai Adith Senthil Kumar

发表机构 * George Mason University（乔治梅森大学）

AI总结研究大型推理模型（LRM）的思考模式对指令遵循的影响，发现思考会改变错误模式而非统一降低性能，其中规划类约束改善而精确类约束恶化，并通过分析思考轨迹和激活修补揭示了机制。

Comments 16 pages, 7 figures, 15 tables

详情

AI中文摘要

大型推理模型（LRM）通常能提升数学和编码性能，但其对指令遵循的影响尚不明确。我们使用 Qwen3 模型（1.7B-32B）研究 IFEval，采用同权重的思考开启/关闭控制；四个 Hunyuan 模型提供跨家族方向性支持。总体通过率变化很小（-0.55 到 -3.52 个百分点），但 10-20% 的提示在两种模式间在通过和失败之间切换，表明思考改变了错误模式——某些提示改善而另一些恶化——而非统一降低性能。在事后 Qwen3 导出的分组下，约束类型分为规划类（全局计数、结构、协调）和精确类（精确局部形式）；规划类在思考下类别层面改善，而精确类持续恶化；尽管 Hunyuan 的总体方向相反，但所有四个 Hunyuan 模型在类别层面的规划/精确符号模式方向一致。思考还改变了最终答案长度；匹配长度分析大幅减少了精确类的下降，但仍有残余惩罚。使用交叉编码器相关性指标分析思考轨迹揭示了三种模式：中性模式显示正的相关性-合规性关联（r ≈ 0.15）；规划模式显示接近零的预测相关性（r ≈ 0.02），尽管有可测量的轨迹参与，这与 CE 测量的轨迹相关性和最终答案合规性之间的执行差距一致；精确模式显示小的负相关性（r ≈ -0.05），失败实例的平均相关性高于通过实例。跨四个模型大小（1.7B-14B）的激活修补显示，精确类翻转实例比规划类翻转实例更常被恢复（32-58% 对 14-40% 的平均层恢复），最大差距在 14B 处（约 30 个百分点）。

英文摘要

Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models provide directional cross-family support. Aggregate pass-rate changes are small (-0.55 to -3.52 pp), yet 10-20% of prompts switch between pass and fail across modes, suggesting that thinking changes the pattern of errors--some prompts improve while others worsen--rather than uniformly degrading performance. Under a post-hoc Qwen3-derived grouping, constraint types separate into Planning (global counting, structure, coordination), which improves at the class level under thinking, and Precision (exact local form), which consistently worsens; the class-level Planning/Precision sign pattern holds directionally for all four Hunyuan models despite Hunyuan's opposite aggregate direction. Thinking also changes final-answer length; matched-length analyses substantially reduce the Precision drop, but a residual penalty remains. Analyzing thinking traces with a cross-encoder relevance metric reveals three patterns: Neutral shows a positive relevance-compliance link (r approximately 0.15); Planning shows near-zero predictive correlation (r approximately 0.02) despite measurable trace engagement, consistent with an execution gap between CE-measured trace relevance and final-answer compliance; Precision shows a small negative correlation (r approximately -0.05), with failing instances having higher mean relevance than passing ones. Activation patching across four model sizes (1.7B-14B) shows that Precision flip instances are more often restored than Planning flip instances (32-58% vs. 14-40% mean layer-restoration), with the largest gap at 14B (about 30 pp).

URL PDF HTML ☆

赞 0 踩 0

2606.07548 2026-06-09 cs.IR cs.AI cs.CL 交叉投稿

Evaluating Advanced Prompting on Gemini Flash for Multi-Hop Biomedical QA

评估 Gemini Flash 上的高级提示工程用于多跳生物医学问答

Ahmed Bajaber, Mohammed Alliheedi

发表机构 * Saudi Med AI Lab (SMAIL)（沙特医学人工智能实验室（SMAIL））； Prince Sultan University（普森国王大学）； Al-Baha University（阿勒巴哈大学）

AI总结本研究通过设计多组件提示（角色扮演、多步思维链示例和格式规则），在 Gemini 2.0 Flash 上实现概念级得分0.720，显著优于基线0.565，并接近下一代模型性能，证明高级提示设计对释放LLM推理能力至关重要。

Comments 8 pages, proceedings of the BioCreative IX Challenge and Workshop (BC9) at IJCAI 2025

详情

DOI: 10.5281/zenodo.16876579
Journal ref: Proc. BioCreative IX Workshop (BC9), IJCAI 2025, Montreal, Canada

AI中文摘要

MedHopQA 挑战为大型语言模型（LLM）提供了一个关键测试：在高风险的生物医学领域中进行复杂的多跳推理。本文详细介绍了我们对 Google Gemini Flash 模型的直接基于 API 的评估，重点关注高级提示工程的影响。我们为 Gemini 2.0 Flash 设计了一个复杂的多组件提示，结合了角色扮演、显式的多步思维链（CoT）示例和详细的格式规则。使用这个复杂提示的最佳运行获得了0.720的概念级得分。这一结果显著优于仅得0.565的基线提示。值得注意的是，在高效的 Gemini 2.0 Flash 上的性能与下一代 Gemini 2.5 Flash 的结果几乎相同。我们的发现表明，复杂的提示设计是释放现代LLM全部推理能力的关键因素。

英文摘要

The MedHopQA challenge presents a critical test for Large Language Models (LLMs): complex, multi-hop reasoning in the high-stakes biomedical domain. This paper details our direct API-based evaluation of Google's Gemini Flash models, focusing on the impact of advanced prompt engineering. We designed a sophisticated, multi-component prompt for Gemini 2.0 Flash that combined role-playing, explicit multi-shot Chain-of-Thought (CoT) examples, and detailed formatting rules. Our best run, using this complex prompt, achieved a Concept Level Score of 0.720. This result dramatically outperformed a baseline prompt which scored only 0.565. Remarkably, this performance on the efficient Gemini 2.0 Flash was almost identical to the result from the next-generation Gemini 2.5 Flash. Our findings demonstrate that sophisticated prompt design is a critical factor for unlocking the full reasoning capabilities of modern LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.07703 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

需要多少密集注意力？面向混合长上下文模型中全/GQA层的Oracle引导稀疏预填充

Hongxing Wang, Harenome Razanajato, Zhen Zhang, Yujie Yuan, Hongsheng Liu

发表机构 * Technical Report, First Release（技术报告，首次发布）

AI总结研究在混合长上下文模型中，通过Oracle引导的稀疏预填充减少密集注意力计算，在保持任务性能的同时实现加速，并验证了可行性、索引器质量和运行时加速潜力。

Comments Technical report, first release, 26 pages, 2 figures, 11 tables

详情

AI中文摘要

长上下文预填充仍然昂贵，因为即使在包含局部、稀疏、线性或循环组件的混合模型中，全/GQA层仍然对整个历史序列进行评分。我们研究了在显式支持粒度和top-k预算下，需要多少密集注意力来保持任务级行为。我们为现有的GQA检查点引入了一种注意力质量top-k oracle：对于每个层和查询位置，它计算密集注意力，选择头平均的token支持，并仅在该支持上重新计算注意力。该oracle是一个诊断参考，而非可部署的加速器，并将稀疏预算可行性从索引器误差和运行时实现效果中分离出来。在Qwen家族的检索密集型评估中，每个查询的最长oracle行与密集注意力相差在1个点以内，而Qwen3.5-9B在4K到100K的RULER风格扫描中相差在0.48个点以内。在oracle的指导下，我们通过KL蒸馏从密集注意力质量分布中训练了一个头折叠的辅助索引器，同时保持骨干网络冻结。使用分别蒸馏的Qwen3.5-0.8B和Qwen3.5-9B索引器，报告的16K/32K验证宏观差距分别为+2.04和+1.13个点，这被视为质量保持而非改进；融合的选择块共享支持可能引入更大的实现差距。初步的单卡TTFT测量显示，与密集FlashAttention-2基线相比，蒸馏索引器的稀疏服务加速比在NPU上对Qwen3.5-0.8B为1.71倍，在GPU上对Qwen3.5-9B为1.93倍。额外的随机初始化压力行达到3.44倍，表明稀疏运行时存在提升空间，但输出质量未经验证。本次发布首次分离了oracle可行性、蒸馏索引器质量和运行时提升空间，将完全匹配的质量-延迟前沿留待未来工作。

英文摘要

Long-context prefill remains expensive because full/GQA layers still score the historical sequence, even in hybrid models with local, sparse, linear, or recurrent components. We study how much dense attention is needed to preserve task-level behavior under explicit support granularity and top-k budgets. We introduce an attention-mass top-k oracle for existing GQA checkpoints: for each layer and query position, it computes dense attention, selects head-averaged token support, and recomputes attention only on that support. The oracle is a diagnostic reference, not a deployable accelerator, and separates sparse-budget feasibility from indexer error and runtime realization effects. On Qwen-family retrieval-heavy evaluations, the longest per-query oracle rows stay within 1 point of dense, and a Qwen3.5-9B RULER-style sweep from 4K to 100K stays within 0.48 points. Guided by the oracle, we derive a head-collapsed auxiliary indexer trained by KL distillation from dense attention-mass distributions while keeping the backbone frozen. With separately distilled Qwen3.5-0.8B and Qwen3.5-9B indexers, the reported 16K/32K validation macro gaps are +2.04 and +1.13 points, treated as quality preservation rather than improvement; fused selection-block-shared support can introduce a larger realization gap. Preliminary single-card TTFT measurements show distilled-indexer sparse serving speedups of 1.71x for Qwen3.5-0.8B on NPU and 1.93x for Qwen3.5-9B on GPU against its dense FlashAttention-2 baseline. Additional random-init stress rows reach 3.44x, indicating sparse-runtime headroom but not validated output quality. This first release separates oracle feasibility, distilled-indexer quality, and runtime headroom, leaving a fully matched quality-latency frontier to future work.

URL PDF HTML ☆

赞 0 踩 0

2606.07720 2026-06-09 cs.AI cs.CL cs.LG 交叉投稿

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

为什么将残差流限制在层而不是令牌？用于连续潜在推理的持久记忆

Mujtaba Farhan, Maheep Chaudhary

发表机构 * University of Cambridge（剑桥大学）

AI总结针对CoCoNuT在潜在空间推理中因中间隐藏状态被覆盖导致概念瓶颈的问题，提出AGCLR模型，通过门控概念流持久记忆机制，在GSM8K、HotpotQA和ProsQA上取得一致提升。

详情

AI中文摘要

大型语言模型（LLMs）在数学和多跳规划任务上展现了卓越的推理能力。CoCoNuT（连续思维链）范式通过使模型能够在潜在空间中进行推理，同时探索多个推理路径，而不是早期就承诺单一链条，从而扩展了这一能力。然而，我们识别出一个我们称之为\textbf{概念瓶颈}的限制。在每次推理过程中，中间隐藏状态被覆盖，导致模型随着推理深度增加而丢失早期步骤中计算出的关键事实。我们在经验上观察到了这一点。在HotpotQA上，原始CoCoNuT（10.4% EM）未能超过CoT基线（11.0% EM），并且在GSM8K上随着课程深度增加性能下降。为了解决这个问题，我们提出了\textbf{AGCLR}（自适应门控连续潜在推理），它通过一个\textit{门控概念流}增强了CoCoNuT。一个跨所有推理过程保持的持久残差记忆，由三个学习到的门控制：一个将中间事实提交到记忆的\textit{写入}门，一个检索相关先前状态的\textit{读取}门，以及一个修剪不相关上下文的\textit{遗忘}门。在使用GPT-2作为基础模型在GSM8K、HotpotQA和ProsQA上进行评估时，AGCLR在所有类型的数据集上实现了一致的改进。随着课程深度的增加，性能差距进一步扩大，直接解决了概念瓶颈。代码可在https://anonymous.4open.science/r/JJJJ/README.md获取。

英文摘要

Large language models (LLMs) have demonstrated remarkable reasoning abilities on mathematical and multi-hop planning tasks. The CoCoNuT (Chain of Continuous Thought) paradigm~\cite{hao2024coconut} extends this by enabling models to reason in latent space, exploring multiple reasoning paths simultaneously rather than committing to a single chain early on. However, we identify a limitation we term the \textbf{concept bottleneck}. At each reasoning pass, intermediate hidden states are overwritten, causing the model to lose critical facts computed in earlier steps as reasoning depth increases. We observe this empirically. On HotpotQA, vanilla CoCoNuT (10.4\% EM) fails to improve over the CoT baseline (11.0\% EM), and performance degrades with curriculum depth on GSM8K. To address this, we propose \textbf{AGCLR} (Adaptive Gated Continuous Latent Reasoning), which augments CoCoNuT with a \textit{Gated Concept Stream}. A persistent residual memory maintained across all reasoning passes, controlled by three learned gates: a \textit{write} gate that commits intermediate facts to memory, a \textit{read} gate that retrieves relevant prior states, and a \textit{forget} gate that prunes irrelevant context. Evaluated on GSM8K, HotpotQA, and ProsQA using GPT-2 as our base model, AGCLR achieves consistent improvements across all types of datasets. With the performance gap compounding as curriculum depth increases, directly resolving the concept bottleneck. Code available at https://anonymous.4open.science/r/JJJJ/README.md

URL PDF HTML ☆

赞 0 踩 0

2606.07812 2026-06-09 cs.AI cs.CL 交叉投稿

Scaling Participation in Modular AI Systems

模块化AI系统中的参与扩展

Shangbin Feng, Yike Wang, Weijia Shi, Luke Zettlemoyer, Yejin Choi, Yulia Tsvetkov

发表机构 * University of Washington（华盛顿大学）； Stanford University（斯坦福大学）

AI总结提出参与扩展范式，通过多方贡献小模型构建模块化AI系统，在15项任务上比单体大语言模型提升高达15.4%，并展现涌现能力。

详情

sGPO: 在RLVR中用推理FLOPs换取训练效率

Shivchander Sudalairaj, Kai Xu, Akash Srivastava, Giorgio Giannone

发表机构 * Red Hat（红帽）； IBM

AI总结提出sGPO方法，通过少量推理计算预估查询难度，自适应分配训练预算，将训练计算量降低三倍，同时保持或提升性能。

详情

AI中文摘要

标准的可验证奖励强化学习（RLVR）训练为每个查询分配固定的展开预算，而不考虑每个查询的难度对当前策略的意义。这导致两种对称的失败模式：简单查询产生接近零的优势，因为策略已经解决了它们；而无法解决的查询不产生信号，因为策略从未解决它们。这两种情况都浪费了训练FLOPs，而没有贡献学习梯度。我们引入了排序组策略优化（sGPO），一种计算高效的策略，用少量推理FLOPs换取大量减少浪费的训练FLOPs。关键见解是，廉价的推理计算可以作为查询难度的单一离线代理。通过在初始策略下为每个查询生成一小批并行样本，我们获得了模型感知的经验成功率。这激励将训练展开组大小设置为该成功率的倒数，这是一个实用的规则，通过从每个生成的展开中提取最大优势来最大化样本效率。这一单次性能分析过程同时驱动数据过滤（移除琐碎查询和子采样无法解决的查询）、自适应组大小分配和课程构建（从易到难调度查询）。sGPO匹配或超过基线性能，同时将总训练计算量减少三倍，包括前期的推理性能分析成本。

英文摘要

Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a learning gradient. We introduce sorted Group Policy Optimization (sGPO), a compute-efficient strategy that trades a small budget of inference FLOPs for a large reduction in wasted training FLOPs. The key insight is that cheap inference compute can serve as a single offline proxy for query difficulty. By generating a small batch of parallel samples per query under the initial policy, we obtain a model-aware empirical success rate. This motivates setting the training rollout group size to the inverse of this success rate, a practical rule that maximizes sample efficiency by extracting the most advantage per generated rollout. This single profiling pass simultaneously drives data filtering (removing trivial queries and sub-sampling unsolvable ones), adaptive group size allocation, and curriculum construction (scheduling queries from easy to hard). sGPO matches or exceeds baseline performance while reducing total training compute by a factor of three, with the upfront inference profiling cost included.

URL PDF HTML ☆

赞 0 踩 0

2606.09030 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

TRIAGE: 基于辩证推理的不规则采样医学时间序列风险可解释预测方法

Hyeongwon Jang, Gyouk Chu, Changhun Kim, Joonhyung Park, Hangyul Yoon, Eunho Yang

发表机构 * KAIST（韩国科学技术院）； AITRICS ； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结提出TRIAGE框架，利用大语言模型对竞争性临床结果生成辩证推理，缓解风险极化，实现连续风险评分与可解释推理，在三个基准上AUPRC提升3.3%，校准误差降低81%。

Comments Code is available at https://github.com/HyeongWon-Jang/TRIAGE

详情

PBSD: 特权贝叶斯自蒸馏用于长程信用分配

Yang Tian, Rui Wang, Xumeng Wen, Junjie Li, Shizhao Sun, Lei Song, Jiang Bian, Bo Zhao

发表机构 * School of AI, Shanghai Jiao Tong University（上海交通大学人工智能学院）； XYZ AI Lab（XYZ AI实验室）

AI总结提出PBSD方法，通过贝叶斯校准的自蒸馏将稀疏最终奖励转化为细粒度步骤级信用信号，解决长程智能体任务中的信用分配问题，实验表明其提升领域内外性能并促进泛化。

详情

AI中文摘要

长程智能体任务对基于结果的强化学习提出了根本性的信用分配挑战：轨迹级奖励验证最终正确性，但很少指导哪些中间推理步骤或工具交互对结果有贡献。在多轮搜索智能体中，这一困难尤为突出，因为成功轨迹可能包含误导性动作，而失败轨迹可能包含有价值的证据收集步骤。我们提出PBSD（特权贝叶斯自蒸馏），一种在稀疏最终奖励下进行细粒度信用分配的贝叶斯校准自蒸馏方法。PBSD通过验证答案的后验与先验概率比来衡量轨迹质量，并应用贝叶斯规则将这个难以估计的答案侧比率转化为标准学生模型与特权答案条件教师模型之间的易处理似然比。对该贝叶斯证据分数的自回归分解产生轮级信号，识别每个中间轮次是支持还是破坏已验证结果。因此，PBSD提供了一种原则性且优雅的重新加权方案，将稀疏结果监督转化为贝叶斯校准的轮级信用信号，同时完全兼容标准策略优化。实验表明，PBSD在领域内和领域外设置中均持续提升性能，并有效将知识从短上下文训练迁移到长上下文推理，表明其细粒度信用分配机制促进了更有效的策略学习并带来更好的泛化。

英文摘要

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.09410 2026-06-09 cs.AI cs.CL 交叉投稿

Capacity, Not Format: Rethinking Structured Reasoning Failures

容量而非格式：重新思考结构化推理失败

Hengxin Fan

AI总结研究发现结构化格式对模型性能的影响取决于其空闲容量，容量不足时通过截断和纯容量竞争两种机制导致性能下降，建议先思考后格式化。

Comments 12 pages, 3 figures

详情

AI中文摘要

先前的工作将结构化输出视为推理的代价，但这种框架是不完整的：格式化的成本强烈依赖于模型的空闲容量。通过使用信息匹配的散文控制和四级模式复杂度梯度，我们在4个模型和5个基准测试中分离了格式特定效应与提示长度混淆，成功生成的响应中解析失败率为0%。我们发现结构化格式是容量依赖的。具有足够余量的模型在吸收JSON约束时不会出现性能下降（Sonnet：MATH-Hard上JSON为$88.7\pm4.0$%，CoT为$89.3\pm1.7$%）。相反，格式会严重降低接近其极限运行的模型，通过两种不同的机制。首先，在标准token预算下，Haiku下降了36.2个百分点（$p < 0.0001$），主要是由于截断。其次，即使延长预算消除了截断，GPT-4o-mini仍下降了28.0个百分点（$p < 0.001$），揭示了独立于token耗尽的纯容量竞争。这种格式惩罚随模式复杂度增加（McNemar $p < 0.0001$），且不能仅由提示长度解释。此外，这些结果对前沿模型免疫的说法提出了质疑：在AIME竞赛数学中，Opus 4.7在JSON下从96.2%下降到91.0%（$-5.3$个百分点；显示的百分比独立四舍五入，精确差值为$7/133 = 5.26$pp $\approx 5.3$pp）。一种延迟结构消融——在格式化之前自由推理——恢复了大部分丢失的准确率（3次运行均值：80-87%），支持了容量竞争机制。实际意义不是避免结构化输出，而是使其与容量匹配：当模型接近其极限时，先思考，后格式化。

英文摘要

Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: $88.7\pm4.0$% JSON vs. $89.3\pm1.7$% CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp ($p < 0.0001$) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp ($p < 0.001$), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar $p < 0.0001$) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON ($-5.3$pp; the displayed percentages are independently rounded, exact difference is $7/133 = 5.26$pp $\approx 5.3$pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.

URL PDF HTML ☆

赞 0 踩 0

2606.09471 2026-06-09 cs.LG cs.CL 交叉投稿

Escaping the KL Agreement Trap in On-Policy Distillation

逃离在线策略蒸馏中的KL一致陷阱

Haoran Xin, Anhao Zhao, Ying Sun, Jin Li, Xiaoyu Shen, Hui Xiong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； The Hong Kong University of Science and Technology（香港科技大学）； The Hong Kong Polytechnic University（香港理工大学）； Eastern Institute of Technology, Ningbo（宁波东方理工大学）

AI总结针对在线策略蒸馏中学生陷入低KL一致陷阱导致训练信号弱的问题，提出KAT动态终止规则，过滤弱监督，在数学基准上提升avg@k 2.66%和pass@k 3.43%，同时减少59.73%的rollout长度。

Comments 13 pages, 8 figures

2606.09508 2026-06-09 cs.AI cs.CL 交叉投稿

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

从刚性到动态：面向长上下文LLM的熵引导自适应推理

Zhanchao Xu, Haoyang Li, Qingfa Xiao, Fei Teng, Chen Jason Zhang, Lei Chen, Qing Li

发表机构 * Department of Computing, PolyU（香港理工大学计算学系）； DSA, HKUST(GZ)（香港科技大学（广州）数据科学与分析学域）； CSE, HKUST（香港科技大学计算机科学与工程学系）

AI总结提出EntropyInfer框架，利用注意力熵在预填充阶段自适应分配计算资源，并在解码阶段通过生成令牌压缩KV缓存，实现长上下文LLM的高效推理。

详情

AI中文摘要

现有的用于长上下文LLM推理的稀疏注意力和KV缓存压缩方法通常应用固定的稀疏模式或跨所有注意力头的统一预算，忽略了头和上下文之间注意力行为的显著变化。我们观察到注意力头之间存在两种不同的熵模式：刚性头，其熵在输入段中保持接近零；动态头，其熵显著波动。至关重要的是，这些类型的分布是上下文相关的，无法离线预先确定。因此，我们提出了EntropyInfer，一个无需训练框架，在预填充期间使用注意力熵在单个头和段的粒度上自适应分配计算。对于解码，我们引入了一种潜在KV缓存压缩方案，该方案利用生成的输出令牌（而非仅预填充令牌）来识别和保留最关键的缓存条目。在Llama、Qwen和openPangu模型系列上的大量实验表明，EntropyInfer在包括SnapKV、AdaKV和CritiPrefill在内的基线上持续取得优势，在超过100k令牌的情况下实现了高达2.39倍的端到端加速，同时与全注意力相比质量下降最小。代码已发布在https://github.com/SHA-4096/EntropyInfer。

英文摘要

Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or uniform budgets across all attention heads, overlooking the substantial variation in attention behavior among heads and contexts. We observe two distinct entropy patterns among attention heads: Rigid Heads, whose entropy stays near zero across input segments, and Dynamic Heads, whose entropy fluctuates significantly. Crucially, the distribution of these types is context-dependent and cannot be predetermined offline. We therefore propose EntropyInfer, a training-free framework that uses attention entropy to adaptively allocate compute at the granularity of individual heads and segments during prefilling. For decoding, we introduce a latent KV cache compression scheme that leverages generated output tokens, rather than prefill tokens alone, to identify and retain the most critical cache entries. Extensive experiments on Llama, Qwen and openPangu model series show that EntropyInfer consistently outperforms baselines including SnapKV, AdaKV, and CritiPrefill, achieving up to 2.39$\times$ end-to-end speedup beyond 100k tokens with minimal quality degradation compared to full attention. The code is released in https://github.com/SHA-4096/EntropyInfer.

URL PDF HTML ☆

赞 0 踩 0

2606.09672 2026-06-09 cs.AI cs.CL cs.LG cs.PF q-bio.QM 交叉投稿

Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

相关性不够：嵌入人类元数据用于个体因果发现

Suraj Biswas, Saurabh Gupta, Pritam Mukherjee

发表机构 * Assessli Research（Assessli研究）； Dots-In Research（Dots-In研究）

AI总结针对预训练生物医学语言模型在跨域无关对中产生高余弦相似度（0.76-0.92）导致因果推断错误的问题，提出对比学习（提升分离度至1.63x）和BODHI硬负例挖掘（提升至2.30x），结合OpenVINO优化实现133倍加速。

Comments 20 pages, 18 figures, 9 tables

详情

AI中文摘要

询问一个预训练的生物医学语言模型“皮质醇28 ug/dL”和“股市波动”是否相关，它会返回0.83的余弦相似度（1.0表示完全相同）。两者没有共同机制。这不是个例：我们测试的所有现成生物医学编码器（BioBERT、PubMedBERT、BioM-ELECTRA）在跨域无关对上得分在0.76到0.92之间，而正确答案应接近零。跨域区分准确率为0%。检索系统可以承受这一点，因为下游语言模型会过滤噪声。但大型行为模型（LBM）——一种以人为对象而非句子的基础模型——则不能：它在用户生活图上推理，并将嵌入接近性视为两个事件因果关联的证据。虚假接近性会写入虚假因果边，所有下游都会继承错误。在这里，嵌入几何不是调节旋钮，而是正确性的关键。我们报告了修复方法。对72,034对进行对比训练，将PubMedBERT的BIOSSES相关性从0.633提升到0.828，域内与域间分离度从1.05倍提升到1.63倍。第二次训练BODHI从生物医学知识图中缺失的边挖掘硬负例，将分离度提升到2.30倍，区分差距提升到+0.392，BIOSSES代价为4.5%。在带有AMX的Intel Xeon 6737P上，OpenVINO将单查询延迟从1367毫秒降至10毫秒（133倍），达到每秒555个句子。一个发现与标准建议相悖：在此芯片上，FP16在所有服务批量大小下优于INT8，我们解释了原因。同一模型在无AMX的Ice Lake实例上运行慢13-27倍。我们发布了基准测试套件、训练语料库、BODHI生成器和OpenVINO脚本。

英文摘要

Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%. Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness. We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.

URL PDF HTML ☆

赞 0 踩 0

2606.09707 2026-06-09 cs.LG cs.CL 交叉投稿

BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling

BrainSurgery：用于模型编辑和升级的可复现且可靠的声明式权重操作

Gianluca Barmina, Annemette Broch Pirchert, Andrea Blasi Núñez, Lukas Galke Poech, Peter Schneider-Kamp

发表机构 * University of Southern Denmark（南丹麦大学）

AI总结提出BrainSurgery工具，通过声明式YAML计划实现神经网络检查点的鲁棒可复现张量操作，支持结构修改、数学变换和张量重塑，内置断言验证防止静默错误。

详情

AI中文摘要

随着深度学习模型规模的扩大，管理、检查和修改大型检查点变得越来越具有挑战性。研究人员经常需要更改模型权重以进行层重构、精度转换、低秩分解和架构调试，但这些工作流程通常依赖于脆弱的临时Python脚本。在这里，我们介绍BrainSurgery，一个用于对神经网络检查点进行鲁棒且可复现的“张量手术”的工具，并提供一个系统演示，涵盖从模型升级到LoRA提取的四个示例和三个案例研究。通过抽象存储格式和内存管理，BrainSurgery通过声明式YAML计划执行复杂的转换。它支持通过表达性正则表达式和结构定位进行结构修改、数学变换和张量重塑，同时内置断言验证张量形状、数据类型和值，以防止静默错误。我们期望BrainSurgery通过其可复现且经过验证的操作，为未来的研究提供坚实的基础。

英文摘要

As deep learning models scale, managing, inspecting, and modifying large checkpoints has become increasingly challenging. Researchers often need to alter model weights for layer restructuring, precision casting, low-rank factorization, and architectural debugging, yet these workflows often rely on fragile ad-hoc Python scripts. Here, we introduce BrainSurgery, a tool for robust and reproducible "tensor surgery" on neural network checkpoints, and provide a system demonstration covering four examples and three case studies from model upcycling to LoRA extraction. By abstracting storage formats and memory management, BrainSurgery executes complex transformations through declarative YAML plans. It supports structural modifications, mathematical transformations, and tensor reshaping through expressive regex and structural targeting, while built-in assertions validate tensor shapes, data types, and values to prevent silent errors. We envision that BrainSurgery will provide a strong foundation for future research through its reproducible and validated operations.

URL PDF HTML ☆

赞 0 踩 0

2507.00322 2026-06-09 cs.CL cs.AI cs.SE 版本更新

Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones

干扰导致的失败：当有缺陷机制掩盖健全机制时，语言模型在平衡括号任务中出错

Daking Rai, Samuel Miller, Kevin Moran, Ziyu Yao

发表机构 * George Mason University（乔治·马歇尔大学）； University of Central Florida（中央佛罗里达大学）； Department of Computer Science（计算机科学系）

AI总结研究揭示语言模型在平衡括号任务中出错的原因：部分组件实现可靠机制，而其他组件引入噪声，当噪声机制主导时导致错误。提出RASteer方法，通过增强可靠组件贡献，将部分模型准确率从0%提升至近100%，并在算术推理任务中取得约20%的性能提升。

Comments 23 pages, 10 figures, accepted for NeurIPS 2025

详情

AI中文摘要

尽管语言模型（LMs）在编码能力方面取得了显著进步，但在生成平衡括号等简单句法任务上仍然存在困难。在本研究中，我们调查了不同规模（124M-7B）的语言模型中这些错误持续存在的潜在机制，旨在理解和减少这些错误。我们的研究揭示，语言模型依赖于多个独立做出预测的组件（注意力头和前馈神经元）。虽然一些组件在广泛的输入范围内可靠地促进正确答案（即实现“健全机制”），但其他组件可靠性较低，通过促进错误标记引入噪声（即实现“有缺陷机制”）。当有缺陷机制掩盖健全机制并主导预测时，就会发生错误。受此启发，我们引入了RASteer，一种引导方法，用于系统地识别并增加可靠组件的贡献，以提升模型性能。RASteer在平衡括号任务上显著提升了性能，将某些模型的准确率从0%提高到接近100%，且不影响模型的一般编码能力。我们进一步展示了其在算术推理任务中的更广泛适用性，实现了高达约20%的性能提升。

英文摘要

Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M-7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing "sound mechanisms''), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing "faulty mechanisms''). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from $0$% to around $100$% without impairing the models' general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around $20$%.

URL PDF HTML ☆

赞 0 踩 0

2509.24189 2026-06-09 cs.CL 版本更新

知道更多，更清晰：大型语言模型中知识增强的元认知框架

Hao Chen, Ye He, Yuchun Fan, Yukun Yan, Zhenghao Liu, Qingfu Zhu, Maosong Sun, Wanxiang Che

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出元认知框架，利用内部认知信号划分知识空间为掌握、混淆和缺失区域，通过差异化干预和认知一致性机制增强知识并校准置信度，实验证明优于基线方法。

详情

AI中文摘要

知识增强显著提升了大型语言模型（LLMs）在知识密集型任务中的性能。然而，现有方法通常基于模型性能等同于内部知识的简单前提，忽略了导致过度自信错误或不确定真相的知识-置信度差距。为弥合这一差距，我们提出了一种新颖的元认知框架，通过差异化干预和对齐实现可靠的知识增强。我们的方法利用内部认知信号将知识空间划分为掌握、混淆和缺失区域，指导有针对性的知识扩展。此外，我们引入了一种认知一致性机制，以同步主观确定性与客观准确性，确保校准的知识边界。大量实验表明，我们的框架持续优于强基线，验证了其在不仅增强知识能力，而且培养更好区分已知与未知的认知行为方面的合理性。所有代码均可在该 https URL 获取。

英文摘要

Knowledge augmentation has significantly enhanced the performance of Large Language Models (LLMs) in knowledge-intensive tasks. However, existing methods typically operate on the simplistic premise that model performance equates with internal knowledge, overlooking the knowledge-confidence gaps that lead to overconfident errors or uncertain truths. To bridge this gap, we propose a novel meta-cognitive framework for reliable knowledge augmentation via differentiated intervention and alignment. Our approach leverages internal cognitive signals to partition the knowledge space into mastered, confused, and missing regions, guiding targeted knowledge expansion. Furthermore, we introduce a cognitive consistency mechanism to synchronize subjective certainty with objective accuracy, ensuring calibrated knowledge boundaries. Extensive experiments demonstrate the our framework consistently outperforms strong baselines, validating its rationality in not only enhancing knowledge capabilities but also fostering cognitive behaviors that better distinguish knowns from unknowns. All codes are available at https://github.com/AI9Stars/Know-More-Know-Clearer.

URL PDF HTML ☆

赞 0 踩 0

2603.13259 2026-06-09 cs.CL cs.AI 版本更新

ThinkBooster: 一种用于LLM推理无缝测试时扩展的统一框架

Vladislav Smirnov, Chieu Nguyen, Sergey Senichev, Minh Ngoc Ta, Ekaterina Fadeeva, Artem Vazhentsev, Daria Galimzianova, Nikolai Rozanov, Viktor Mazanov, Jingwei Ni, Tianyi Wu, Igor Kiselev, Mrinmaya Sachan, Iryna Gurevych, Preslav Nakov, Timothy Baldwin, Artem Shelmanov

发表机构 * MBZUAI ； ETH Zürich（苏黎世联邦理工学院）； Imperial College London（伦敦帝国理工学院）； NUS（国立大学新加坡）； Accenture（埃森哲）； Innopolis University（因诺普里斯大学）； Independent Researcher（独立研究者）

AI总结提出ThinkBooster框架，通过模块化库、联合评估基准和可部署代理服务，实现LLM推理的测试时计算扩展，在数学和编码任务上验证了性能-计算权衡。

详情

AI中文摘要

测试时计算（TTC）扩展已成为一种强大的范式，通过在推理期间分配额外计算（例如，通过多样本生成和基于验证器的重新排序）来改进大型语言模型（LLM）推理。现有的TTC扩展策略和推理评分器仍然碎片化，在不一致的协议下进行评估，并且很少通过质量-成本权衡的视角进行分析。我们引入了ThinkBooster，一个用于LLM推理无缝测试时计算扩展的统一框架，它包括（i）一个模块化的Python库，实现了最先进的TTC扩展策略和评分器家族，（ii）一个联合评估性能和计算效率的基准，以及（iii）一个可部署的、兼容OpenAI的代理服务，使得将自适应推理无缝集成到实际应用中成为可能。我们还提供了一个演示可视化调试器，用于检查推理轨迹、中间选择决策和替代推理路径。在数学和编码任务上的实证结果揭示了TTC扩展策略和评分方法的性能-计算权衡，并表明ThinkBooster在实际任务中提供了实际收益。代码以MIT许可证在线提供。

英文摘要

Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for seamless test-time compute scaling of LLM reasoning, which consists of (i) a modular Python library implementing state-of-the-art TTC scaling strategy and scorer families, (ii) a benchmark that jointly evaluates performance and computational efficiency, and (iii) a deployable OpenAI-compatible proxy service that enables drop-in integration of adaptive reasoning into real-world applications. We further provide a demo visual debugger for inspecting the reasoning trajectories, intermediate selection decisions, and alternative reasoning paths. Empirical results on mathematical and coding tasks reveal the performance-compute trade-offs of TTC scaling strategies and scoring methods and demonstrate that ThinkBooster provides practical gains in real-world tasks. The code is available online under an MIT license.

URL PDF HTML ☆

赞 0 踩 0

2409.15723 2026-06-09 cs.LG cs.CL 版本更新

Federated Large Language Models: Current Progress and Future Directions

联邦大语言模型：当前进展与未来方向

Yuhang Yao, Jianyi Zhang, Junda Wu, Chengkai Huang, Yu Xia, Tong Yu, Ruiyi Zhang, Sungchul Kim, Ryan Rossi, Ang Li, Lina Yao, Julian McAuley, Yiran Chen, Carlee Joe-Wong

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Duke University（杜克大学）； University of California San Diego（加州大学圣地亚哥分校）； The University of New South Wales（新南威尔士大学）； Adobe Research（Adobe研究）； University of Maryland College Park（马里兰大学学院公园分校）； CSIRO’s Data61（澳大利亚联邦科学与工业研究组织Data61）

AI总结本文综述联邦学习与大语言模型结合（FedLLM）的最新进展，重点分析联邦微调和联邦提示学习如何应对效率、个性化和安全挑战，并展望联邦预训练和联邦智能体等方向。

Comments Accepted by PAKDD 2026

详情

AI中文摘要

大语言模型在各种应用中取得了令人印象深刻的性能，但其训练通常依赖于集中式数据收集，引发了严重的隐私和治理问题。联邦学习通过使多个客户端能够协作训练共享模型而不暴露原始本地数据，提供了一种去中心化的替代方案。然而，将联邦学习与大语言模型集成带来了新的挑战，包括数据异质性、收敛不稳定性、通信开销和计算约束。本综述提供了联邦学习用于大语言模型（FedLLM）的全面且最新的概述。我们系统地回顾了近期进展，特别强调联邦微调和联邦提示学习，并分析了现有方法如何应对效率、个性化和安全挑战。我们进一步总结了新兴方向，如联邦预训练和联邦智能体。我们的目标是提供对这个快速发展领域的结构化视角，并突出未来研究的有前景的途径。

英文摘要

Large Language Models have achieved impressive performance across diverse applications, yet their training typically depends on centralized data collection, raising serious privacy and governance concerns. Federated Learning offers a decentralized alternative by enabling multiple clients to collaboratively train shared models without exposing raw local data. However, integrating FL with LLMs introduces new challenges, including data heterogeneity, convergence instability, communication overhead, and computational constraints. This survey provides a comprehensive and up-to-date overview of Federated Learning for Large Language Models (FedLLM). We systematically review recent advances, with particular emphasis on federated fine-tuning and federated prompt learning, and analyze how existing methods address efficiency, personalization, and security challenges. We further summarize emerging directions such as federated pre-training and federated agents. Our goal is to offer a structured perspective on this rapidly evolving field and to highlight promising avenues for future research.

URL PDF HTML ☆

赞 0 踩 0

2506.06295 2026-06-09 cs.LG cs.AI cs.CL 版本更新

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

dLLM-Cache：基于自适应缓存的扩散大语言模型加速

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyan Wei, Shaobo Wang, Yichen Zhu, Linfeng Zhang

发表机构 * Zhejiang University（浙江大学）

AI总结针对扩散大语言模型推理延迟高的问题，提出一种无需训练的自适应缓存框架dLLM-Cache，通过长间隔提示缓存和基于特征相似性的部分响应更新，实现高效中间计算复用，在保持输出质量的同时大幅降低FLOPs。

Comments Accepted by ICML 2026

详情

AI中文摘要

自回归模型长期以来主导了大语言模型领域。最近，一种基于扩散的大语言模型（dLLMs）的新范式出现，它通过迭代去噪掩码段来生成文本。这种方法显示出显著的优势和潜力。然而，dLLMs存在高推理延迟的问题。传统的自回归模型加速技术，如键值缓存，由于dLLMs的双向注意力机制而无法兼容。为了应对这一特定挑战，我们的工作首先基于一个关键观察：dLLM推理涉及一个静态提示和一个部分动态的响应，其中大多数标记在相邻去噪步骤中保持稳定。基于此，我们提出了dLLM-Cache，一种无需训练的自适应缓存框架，它结合了长间隔提示缓存和基于特征相似性的部分响应更新。这种设计能够在不影响模型性能的情况下高效重用中间计算。在代表性dLLMs（包括LLaDA 8B和Dream 7B）上的大量实验表明，dLLM-Cache在LongBench-HotpotQA上实现了高达9.1倍的FLOPs减少，同时保持了具有竞争力的输出质量。值得注意的是，我们的方法使dLLM推理延迟在许多设置下接近自回归模型。本工作的代码公开于：https://github.com/maomaocun/dLLM-cache。

英文摘要

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1x FLOPs reduction on LongBench-HotpotQA while maintaining competitive output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. The code for this work is publicly available at: https://github.com/maomaocun/dLLM-cache.

URL PDF HTML ☆

赞 0 踩 0

2506.10341 2026-06-09 cs.LG cs.CL 版本更新

Formalizing Learning from Language Feedback with Provable Guarantees

从语言反馈中学习的形式化与可证明保证

Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, Ching-An Cheng

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Washington（华盛顿大学）； University of Toronto（多伦多大学）

AI总结本文形式化语言反馈学习问题，提出转移埃尔泽维度刻画学习难度，并开发无遗憾算法HELiX，证明其性能保证，展示丰富语言反馈可指数级加速学习。

Comments ICML 2026

详情

AI中文摘要

通过观察和语言反馈进行交互式学习是一个日益受到关注的领域，其驱动力来自大型语言模型（LLM）智能体的出现。尽管有令人印象深刻的实证演示，但迄今为止，这些决策问题的原则性框架仍然缺乏。我们形式化了语言反馈学习（LLF）问题，提出了足以在潜在奖励下实现学习的假设，并引入了$\ extit{转移埃尔泽维度}$作为衡量LLF难度的指标。我们形式化了语言反馈中的信息控制学习复杂性的直觉，并展示了从丰富语言反馈中学习可以比从奖励中学习指数级更快的案例。我们开发了一种名为$\ exttt{HELiX}$的无遗憾算法，通过顺序交互可证明地解决LLF问题，其性能保证随转移埃尔泽维度缩放。在多个实证领域，我们展示了即使重复提示LLM不可靠时，$\ exttt{HELiX}$也能表现良好。我们的贡献标志着朝着使用通用语言反馈设计原则性交互学习算法迈出了重要一步。

英文摘要

Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. Despite impressive empirical demonstrations, so far a principled framing of these decision problems remains lacking. We formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce $\textit{transfer eluder dimension}$ as a measure to characterize the hardness of LLF. We formalize the intuition that information in the language feedback governs the learning complexity, and demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called $\texttt{HELiX}$, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension. Across several empirical domains, we show that $\texttt{HELiX}$ performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark an important step towards designing principled interactive learning algorithms using generic language feedback.

URL PDF HTML ☆

赞 0 踩 0

2507.09751 2026-06-09 cs.AI cs.CL cs.LO 版本更新

Sound and Complete Neurosymbolic Reasoning with LLM-Grounded Interpretations

基于LLM解释的完备且可靠的神经常识推理

Bradley P. Allen, Prateek Chhikara, Thomas Macaulay Ferguson, Filip Ilievski, Paul Groth

发表机构 * University of Amsterdam（阿姆斯特丹大学）； University of Southern California（南加州大学）； Rensselaer Polytechnic Institute（拉特格斯理工学院）； Vrije Universiteit Amsterdam（阿姆斯特丹自由大学）

AI总结提出将LLM直接集成到次协调逻辑的语义解释函数中，实现可靠且完备的神经常识推理，在GPQA和SimpleQA基准上宏F1提升约6个百分点，并成功检测药物安全知识库中的矛盾。

Comments 43 pages, 14 tables, 4 figures. Accepted to the 19th Conference on Neurosymbolic Learning and Reasoning (NeSy 2025); to appear Neurosymbolic Artifical Intelligence Special Issue on NeSy 2025 Extended Papers

详情

AI中文摘要

大型语言模型（LLM）在自然语言理解和生成方面展现了令人印象深刻的能力，但在输出中表现出逻辑一致性问题。我们如何在形式推理中利用LLM的广泛覆盖参数知识，尽管它们存在不一致性？我们提出了一种方法，将LLM直接集成到次协调逻辑的形式语义的解释函数中。我们使用从短事实性基准GPQA和SimpleQA导出的数据集对方法进行实证评估，显示双边事实性评估在两个基准上的宏F1比单边基线提高了约6个百分点（以覆盖率为代价，因为在不一致或不确定的情况下会触发弃权）。我们进一步描述了一个实现该方法的原型tableau推理器，并将其应用于包含228条断言和712条推断语句的药物安全知识库：系统检测到92个对应于医学显著错误（例如，阿片类药物被推断为非成瘾性，β受体阻滞剂被推断为在哮喘中安全）的过剩（glut），同时保持可满足性，表明矛盾被局部化而不是导致逻辑爆炸。与先前工作不同，我们的方法提供了一个理论框架和实际实现，用于神经常识推理，利用LLM的知识同时保留底层逻辑的可靠性和完备性属性。

英文摘要

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but exhibit problems with logical consistency in their output. How can we harness LLMs' broad-coverage parametric knowledge in formal reasoning despite their inconsistency? We present a method for directly integrating an LLM into the interpretation function of the formal semantics for a paraconsistent logic. We evaluate the method empirically using datasets derived from the short-form factuality benchmarks GPQA and SimpleQA, showing that bilateral factuality evaluation improves macro-F1 over a unilateral baseline by roughly 6 percentage points on both benchmarks (at the cost of reduced coverage, as abstention is triggered on inconsistent or uncertain cases). We further describe a proof-of-concept tableau reasoner implementing the method, and apply it to a medication-safety knowledge base of 228 asserted and 712 inferred statements: the system detects 92 gluts corresponding to medically significant errors (e.g., opioids inferred as non-addictive, beta-blockers inferred as safe in asthma) while remaining satisfiable, demonstrating that contradictions are localized rather than causing logical explosion. Unlike prior work, our method offers a theoretical framework with a practical implementation for neurosymbolic reasoning that leverages an LLM's knowledge while preserving the underlying logic's soundness and completeness properties.

URL PDF HTML ☆

赞 0 踩 0

2509.10534 2026-06-09 cs.LG cs.AI cs.CL 版本更新

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

解耦“什么”和“哪里”：极坐标位置嵌入

Anand Gopalakrishnan, Robert Csordás, Jürgen Schmidhuber, Michael C. Mozer

发表机构 * DeepMind, London, UK（深度Mind，伦敦，英国）

AI总结提出极坐标位置嵌入（PoPE）以解耦Transformer注意力机制中的内容和位置，在诊断任务、序列建模和语言模型中优于RoPE，并展现零样本长度外推能力。

Comments ICML 2026 camera-ready version

详情

AI中文摘要

Transformer架构中的注意力机制根据内容（“什么”）和序列中的位置（“哪里”）将键匹配到查询。我们提出一项分析，表明在流行的RoPE旋转位置嵌入中，“什么”和“哪里”是纠缠的。这种纠缠会损害性能，特别是当决策需要在这两个因素上独立匹配时。我们提出对RoPE的改进，称为极坐标位置嵌入（PoPE），它消除了“什么-哪里”的混淆。PoPE在仅通过位置或内容进行索引的诊断任务上表现远优于基线。在音乐、基因组和自然语言领域的自回归序列建模中，使用PoPE作为位置编码方案的Transformer在评估损失（困惑度）和下游任务性能上优于使用RoPE的基线。在语言建模中，这些优势在模型规模从124M到774M参数时持续存在。关键的是，与RoPE甚至专为外推设计的方法YaRN（需要额外微调和频率插值）相比，PoPE展现出强大的零样本长度外推能力。

英文摘要

The attention mechanism in a Transformer architecture matches key to query based on both content -- the what -- and position in a sequence -- the where. We present an analysis indicating that what and where are entangled in the popular RoPE rotary position embedding. This entanglement can impair performance particularly when decisions require independent matches on these two factors. We propose an improvement to RoPE, which we call Polar Coordinate Position Embeddings or PoPE, that eliminates the what-where confound. PoPE is far superior on a diagnostic task requiring indexing solely by position or by content. On autoregressive sequence modeling in music, genomic, and natural language domains, Transformers using PoPE as the positional encoding scheme outperform baselines using RoPE with respect to evaluation loss (perplexity) and downstream task performance. On language modeling, these gains persist across model scale, from 124M to 774M parameters. Crucially, PoPE shows strong zero-shot length extrapolation capabilities compared not only to RoPE but even a method designed for extrapolation, YaRN, which requires additional fine tuning and frequency interpolation.

URL PDF HTML ☆

赞 0 踩 0

2509.12760 2026-06-09 cs.LG cs.CL 版本更新

Similarity-Distance-Magnitude Activations

相似度-距离-幅度激活函数

Allen Schmaltz

发表机构 * Reexpress AI

AI总结本文提出SDM激活函数，通过引入相似度和距离意识提升softmax的鲁棒性和可解释性，并通过密集匹配实现基于实例的可解释性。SDM估计器通过数据驱动的CDF分区控制分类准确性，优于现有校准方法。

Comments Accepted to Findings of the Association for Computational Linguistics: ACL 2026. 21 pages, 8 tables, 1 algorithm. arXiv admin note: substantial text overlap with arXiv:2502.20167

2510.06052 2026-06-09 cs.AI cs.CL 版本更新

MixReasoning: Switching Modes to Think

MixReasoning: 切换模式以思考

Haiquan Lu, Gongfan Fang, Xinyin Ma, Qi Li, Xinchao Wang

发表机构 * arXiv

AI总结提出MixReasoning框架，动态调整推理深度，对困难步骤详细推理、简单步骤简洁推理，在GSM8K、MATH-500和AIME上缩短推理长度并提高效率，不牺牲准确性。

详情

AI中文摘要

推理模型通过逐步解决问题、将问题分解为子问题并在生成答案前探索长思维链来提升性能。然而，对每一步都应用扩展推理会引入大量冗余，因为子问题的难度和复杂度差异很大：少数关键步骤对最终答案真正具有挑战性和决定性，而许多其他步骤仅涉及简单的修正或计算。因此，一个自然的想法是赋予推理模型自适应应对这种变化的能力，而不是对所有步骤采用相同的详细程度。为此，我们提出了MixReasoning，一个在单个响应中动态调整推理深度的框架。由此产生的思维链成为困难步骤的详细推理与简单步骤的简洁推理的混合。在GSM8K、MATH-500和AIME上的实验表明，MixReasoning缩短了推理长度，显著提高了效率，且不牺牲准确性。

英文摘要

Reasoning models enhance performance by tackling problems in a step-by-step manner, decomposing them into sub-problems and exploring long chains of thought before producing an answer. However, applying extended reasoning to every step introduces substantial redundancy, as sub-problems vary widely in difficulty and complexity: a small number of pivotal steps are genuinely challenging and decisive for the final answer, while many others only involve straightforward revisions or simple computations. Therefore, a natural idea is to endow reasoning models with the ability to adaptively respond to this variation, rather than treating all steps with the same level of elaboration. To this end, we propose MixReasoning, a framework that dynamically adjusts the depth of reasoning within a single response. The resulting chain of thought then becomes a mixture of detailed reasoning on difficult steps and concise inference on simpler ones. Experiments on GSM8K, MATH-500, and AIME show that MixReasoning shortens reasoning length and substantially improves efficiency without compromising accuracy.

URL PDF HTML ☆

赞 0 踩 0

2601.02880 2026-06-09 cs.AI cs.CL 版本更新

ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

ReTreVal：带有验证和跨问题记忆的推理树

Abhishek HS, Pavan C Shekar, Arpit Jain, Ashwanth Krishnan

发表机构 * QpiAI

AI总结 ReTreVal通过自适应树探索、带工具增强的节点细化、类型化失败回溯和自修改记忆，使大语言模型在无需微调的情况下实现跨问题学习，其在MATH-500上达到85.8%的pass@1准确率，在MMLU-Pro上达到54.4%的准确率。

Comments 15 pages, 1 figure, 12 tables

详情

AI中文摘要

现有推理框架在问题边界丢弃所有失败上下文，导致模型解决问题500时比问题1时更无知。我们提出了ReTreVal（带有验证的推理树），这是一个无需训练的框架，通过自适应树探索、带工具增强的节点细化、类型化失败回溯以及自修改记忆，实现了跨问题学习。ReTreVal在MATH-500上达到85.8%的pass@1准确率（比零样本CoT高8.6个百分点，比最强基线Self-Refine高8.6个百分点），在MMLU-Pro上达到54.4%的准确率（比Self-Refine高15.3个百分点），3.4:1的胜率比噪声比证实了真正的错误恢复。这些能力，以前需要梯度更新，使32B模型能够与更大的单次通过系统竞争。

英文摘要

Every existing inference-time reasoning framework discards all failure context at problem boundaries, leaving a model solving problem 500 no wiser than it was on problem 1. We present ReTreVal (Reasoning Tree with Validation), a training-free framework that closes this gap through adaptive tree exploration with tool-augmented node refinement, typed-failure backtracking that injects categorized error context into the recovered branch, and a self-rewriting memory that accumulates and revises strategy entries across problems, enabling inference-time cross-problem learning on any fixed, unmodified LLM without fine-tuning. ReTreVal achieves 85.8% pass@1 on MATH-500 (+8.6 pp over Zero-Shot CoT, +8.6 pp over the strongest baseline Self-Refine) and 54.4% on MMLU-Pro (+15.3 pp over Self-Refine), with a 3.4:1 win-to-regression ratio confirming genuine error recovery rather than noise. These capabilities, previously requiring gradient updates, allow a 32B model to compete with much larger single-pass systems.

URL PDF HTML ☆

赞 0 踩 0

2601.09085 2026-06-09 cs.LG cs.AI cs.CL cs.IR 版本更新

SPHERICAL KV: 角度域注意力与率失真保持用于高效长上下文推理

Anay Chauhan, Gurucharan Marthi Krishna Kumar, Arion Das, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Synopsys ； McGill University（麦吉尔大学）； IIIT Ranchi（印度理工学院拉奇）； Amazon（亚马逊）； Meta ； Apple（苹果）； Pragya Lab, BITS Pilani Goa（普拉基亚实验室， BITS 拉贾斯坦）

AI总结提出Spherical KV方法，通过角度域注意力（ADA）和率失真保持（RDR）机制，在长上下文推理中减少KV缓存占用并保持解码效率。

详情

AI中文摘要

长上下文推理日益受到KV缓存的限制：常驻内存随上下文长度增长，解码受限于重复的高带宽内存（HBM）流而非算术运算。现有方法如驱逐、窗口化、量化和卸载减少了占用，但通常仅部分解决了关键路径瓶颈，尤其是在解码期间压缩状态仍需重建为密集向量时。我们提出Spherical KV，一种将KV分配视为基于注意力几何的率失真问题以实现高效解码的长上下文推理方法。该方法基于两个思想：(i) 在解码热循环中廉价地表示方向信息，(ii) 根据估计的未来效用分配保留和精度。其第一个组件，角度域注意力（ADA），将键存储在由标量半径和紧凑角度码组成的球面参数化中，并直接根据这些码计算注意力对数，无需重建密集键。这保留了分页、块局部、融合友好的解码路径，并在实际服务设置中直接针对HBM流量。其第二个组件，率失真保持（RDR），在固定预算下联合选择每个令牌和头的保留/丢弃决策及精度层级，生成层级同质的页面，具有轻量级元数据和合并读取。ADA和RDR共同提供了一种面向部署的机制，在保持解码效率的同时减少KV常驻内存。

英文摘要

Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing methods such as eviction, windowing, quantization, and offloading reduce footprint, but often leave the critical-path bottleneck only partially addressed, especially when compressed states must still be reconstructed into dense vectors during decoding. We present Spherical KV, a long-context inference method that treats KV allocation as a rate-distortion problem grounded in attention geometry for efficient decoding. The method is built on two ideas: (i) represent directional information cheaply in the decode hot loop, and (ii) allocate retention and precision according to estimated future utility. Its first component, Angle-Domain Attention (ADA), stores keys in a spherical parameterization consisting of a scalar radius and compact angle codes, and computes attention logits directly from these codes without reconstructing dense keys. This preserves a paged, block-local, fusion-friendly decode path and directly targets HBM traffic in realistic serving settings. Its second component, Rate-Distortion Retention (RDR), jointly chooses keep/drop decisions and precision tiers per token and head under a fixed budget, producing tier-homogeneous pages with lightweight metadata and coalesced reads. Together, ADA and RDR provide a deployment-oriented mechanism for reducing KV residency while preserving decode efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.26872 2026-06-09 cs.LG cs.AI cs.CL 版本更新

The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection

最强的教师并不总是最好的教师：以学生为中心的答案选择

Zhengyu Hu, Zheyuan Xiao, Linxin Song, Fengqing Jiang, Yuetai Li, Zhengyu Chen, Zhihan Xiong, Yue Liu, Junhao Lin, Yao Su, Lijie Hu, Kaize Ding, Teng Xiao, Radha Poovendran

发表机构 * University of Washington（华盛顿大学）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； University of Southern California（南加州大学）； Independent Researcher（独立研究者）； National University of Singapore（新加坡国立大学）； Microsoft（微软）； Google（谷歌）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）； Northwestern University（西北大学）； Allen Institute for AI (AI2)（人工智能研究院（AI2））

AI总结提出以学生为中心的答案采样（SCAS）框架，通过估计学生中心的学习成本选择教师生成的答案，从而提升学生模型性能。

详情

AI中文摘要

LLM训练越来越依赖教师生成的监督，包括合成响应、推理轨迹和工具使用演示。当前实践通常选择表现最好的教师来生成学生训练数据，隐含地将教师测试表现视为教学质量的代理。我们表明这一假设可能失败：即使多个教师对同一问题提供正确答案，最强教师的答案也不一定是对给定学生的最佳监督。为解决这一问题，我们提出以学生为中心的答案采样（SCAS），该框架根据估计的学生中心学习成本从经过验证的教师生成答案中进行选择。受逐词梯度分解的启发，我们推导出该成本的高效前向代理，并在训练中用于指导答案选择。在30个教师模型、6个学生基础模型和8个任务上的实验表明，SCAS持续提升学生性能，表明有效的蒸馏应优先考虑与当前学生匹配的监督，而非仅依赖教师强度。

英文摘要

LLM training increasingly relies on teacher-generated supervision, from synthetic responses to reasoning traces and tool-use demonstrations. Current practice often chooses the highest-performing teacher to generate student training data, implicitly treating teacher test performance as a proxy for teaching quality. We show that this assumption can fail: even when multiple teachers provide correct answers to the same question, the answer from the strongest teacher is not necessarily the best supervision for a given student. To address this gap, we propose Student-Centric Answer Sampling (SCAS), a framework that selects from verified teacher-generated answers according to their estimated student-centric learning cost. Motivated by a token-wise gradient decomposition, we derive an efficient forward-only proxy for this cost and use it to guide answer selection during training. Experiments across 30 teacher models, 6 student base models, and 6 tasks show that SCAS consistently improves student performance, suggesting that effective distillation should prioritize supervision matched to the current student rather than teacher strength alone.

URL PDF HTML ☆

赞 0 踩 0

2506.03106 2026-06-09 cs.CL cs.AI 版本更新

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Critique-GRPO：通过自然语言和数值反馈提升大语言模型推理能力

Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng

发表机构 * HCCL, The Chinese University of Hong Kong, Hong Kong, China（香港中文大学人工智能研究中心，香港，中国）； University of Cambridge, Cambridge, United Kingdom（剑桥大学，剑桥，英国）； MMLab, The Chinese University of Hong Kong, Hong Kong, China（香港中文大学人工智能实验室，香港，中国）； Shanghai Artificial Intelligence Laboratory, Shanghai, China（上海人工智能实验室，上海，中国）

AI总结本文提出Critique-GRPO框架，结合自然语言和数值反馈提升LLM推理能力，实验显示其在多个任务中优于传统方法，显著提升推理性能。

Comments Accepted by ICML 2026 Spotlight

详情

AI中文摘要

最近利用数值奖励的强化学习（RL）进展显著增强了大语言模型（LLM）的复杂推理能力。然而，我们发现纯数值反馈存在三个根本限制：性能停滞、无效的自发自我反思和持续失败。我们证明，当给plateaued RL模型提供自然语言批评时，它们能够成功细化失败的解决方案。受此启发，我们提出Critique-GRPO，一种在线RL框架，整合自然语言和数值反馈进行策略优化。该方法使LLM能够同时学习初始响应和批评引导的细化，有效内化两个阶段的探索收益。大量实验显示，Critique-GRPO优于所有比较的监督和基于RL的微调方法，在各种Qwen模型上平均Pass@1提升约+15.0-21.6%，在Llama-3.2-3B-Instruct上提升约+7.3%。值得注意的是，Critique-GRPO通过自我批评实现有效自我改进，相较于GRPO取得显著提升，例如在AIME 2024上Pass@1提升+16.7%。代码和模型已发布：https://github.com/zhangxy-2019/critique-GRPO

英文摘要

Recent advances in reinforcement learning (RL) using numerical rewards have significantly enhanced the complex reasoning capabilities of large language models (LLMs). However, we identify three fundamental limitations of purely numerical feedback: performance plateaus, ineffective spontaneous self-reflection, and persistent failures. We show that plateaued RL models can successfully refine failed solutions when given natural language critiques. Motivated by this, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for policy optimization. This approach enables LLMs to learn simultaneously from initial responses and critique-guided refinements, effectively internalizing the exploration benefits of both stages. Extensive experiments show that Critique-GRPO outperforms all compared supervised and RL-based fine-tuning methods, achieving average Pass@1 improvements of approximately +15.0-21.6% on various Qwen models and +7.3% on Llama-3.2-3B-Instruct across eight challenging reasoning tasks. Notably, Critique-GRPO facilitates effective self-improvement through self-critiquing, achieving substantial gains over GRPO, e.g., a +16.7% Pass@1 improvement on AIME 2024. The code and models are released at: https://github.com/zhangxy-2019/critique-GRPO

URL PDF HTML ☆

赞 0 踩 0

2606.08673 2026-06-09 cs.CL 新提交

ClinicalAligner26AM: A Cross-Lingual Aligner for Dataset Translation; Evidences from the MultiClinCorpus Shared Task

ClinicalAligner26AM: 用于数据集翻译的跨语言对齐器；来自MultiClinCorpus共享任务的证据

François Remy

发表机构 * Parallia Healthcare AI（Parallia医疗人工智能）

AI总结提出ClinicalAligner26AM，一种基于ClinicalEncoder26AM初始化的生物医学临床文本多语言对齐模型，通过Sinkhorn-Knop最优传输融合多级信号构建软对齐目标，在MultiClinCorpus任务中跨语言投影实体标注，字符加权F1超0.95。

详情

AI中文摘要

词级跨语言对齐对于标注投影、翻译审计和跨语言忠实度估计至关重要，然而现有的神经对齐器很少适应专业领域。在本文中，我们介绍了ClinicalAligner26AM，这是一个从ClinicalEncoder26AM初始化的大上下文多语言对齐模型，用于生物医学和临床文本。我们的训练方法受AWESoME Align启发。我们通过使用Sinkhorn-Knop最优传输对为平行临床文本和对话建立的成本矩阵进行锐化，该矩阵融合了句子级、短语级和词元级信号，从而构建软对齐目标。我们通过鼓励学生对齐器的朴素余弦词元相似度分数匹配该目标，直接将锐化后的对齐矩阵蒸馏到学生对齐器中。在推理时，我们通过学习的词元对齐矩阵投影源跨度分数，并解码目标文本中最长有效的高分跨度，可选地由附录B中总结的MultiClinNER预测支持。我们在MultiClinCorpus共享任务上评估CA26AM，该任务将西班牙语临床实体标注投影到六种目标语言中。我们提交的两个系统在所有语言和实体类型中分别排名第一和第二，几乎所有设置下的字符加权F1分数均高于0.95。

英文摘要

Word-level cross-lingual alignment is central to annotation projection, translation auditing, and cross-lingual faithfulness estimation, yet existing neural aligners are rarely adapted to specialized domains. In this paper, we introduce ClinicalAligner26AM, a large-context multilingual aligner model for biomedical and clinical text initialized from ClinicalEncoder26AM. Our training recipe is inspired by AWESoME Align. We build our soft alignment target by sharpening with Sinkhorn-Knop optimal transport a cost matrix established for parallel clinical texts and conversations through the fusion of sentence-level, phrase-level, and token-level signals. We distill this sharpened alignment matrix directly into our student aligner, by encouraging its naive cosine-based token similarity scores to match this target. At inference time, we project source-span scores through the learned token alignment matrix and decode the longest valid high-scoring span in the target text, optionally supported by MultiClinNER predictions summarized in Appendix B. We evaluate CA26AM on the MultiClinCorpus shared task, which projects Spanish clinical entity annotations into six target languages. Our two submitted systems ranked respectively first and second across all languages and entity types, with character-weighted F1 scores above 0.95 in nearly all settings.

URL PDF HTML ☆

赞 0 踩 0

2606.09334 2026-06-09 cs.CL 新提交

How Far Can Prompting Go for Minimal-Edit Ukrainian Grammatical Error Correction?

提示工程在最小编辑乌克兰语语法错误纠正中能走多远？

Kateryna Karpo, Artem Chernodub

发表机构 * Ukrainian Catholic University（乌克兰天主教大学）； YouScan ； Zendesk

AI总结评估11个商业LLM在乌克兰语最小编辑语法错误纠正上的表现，发现结合最小编辑提示和少样本策略的Gemini 3.1-Pro达到F0.5=69.22，缩小了与微调SOTA的差距。

详情

AI中文摘要

微调大型语言模型在乌克兰语语法错误纠正中占主导地位，而通过API访问的LLM在最小编辑基准上几乎未经过测试。我们在UNLP 2023 GEC-only基准上评估了来自四个提供商的11个商业LLM和一个开源乌克兰语模型，比较了零样本、少样本、最小编辑和LLM辅助提示优化策略。我们的最佳配置（Gemini 3.1-Pro）达到了F0.5=69.22，缩小了与微调SOTA（F0.5=73.14）超过90%的差距。对于零样本提示，只有Claude模型受益于乌克兰语指令。然而，所有模型的最佳总体结果使用了乌克兰语最小编辑提示，其语言特定规则需要精确的乌克兰语表达。在最小编辑+少样本基础上进行LLM辅助提示优化获得了最高分数。详细的最小编辑指令在标点和大小写错误上带来了最大收益，但导致模型放弃了几个低频类别。深入错误分析，我们识别了与乌克兰语特定语言现象相关的五种重复过度纠正模式。代码、提示和输出已公开。

英文摘要

Fine-tuned Large Language Models (LLMs) dominate in Ukrainian grammatical error correction (GEC), while API-accessed LLMs remain nearly untested on minimal-edit benchmarks. We evaluate 11 commercial LLMs from four providers and one open-source Ukrainian model on the UNLP 2023 GEC-only benchmark, comparing zero-shot, few-shot, minimal-edits, and LLM-assisted prompt optimization strategies. Our best configuration (Gemini 3.1-Pro) reaches F0.5=69.22, closing over 90% of the gap to fine-tuned SOTA (F0.5=73.14). For zero-shot prompts, only Claude models benefit from Ukrainian instructions. However, the best overall results for all models use Ukrainian minimal-edits prompts, whose language-specific rules require Ukrainian to express precisely. LLM-assisted prompt optimization on top of minimal-edits + few-shot achieves the highest score. Detailed minimal-edits instructions yield the largest gains for punctuation and case errors but cause the model to abandon several low-frequency categories. Delving into error analysis, we identify five recurring overcorrection patterns tied to Ukrainian-specific linguistic phenomena. Code, prompts, and outputs are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.09655 2026-06-09 cs.CL 新提交

Beyond Accuracy: Community Perspectives on Machine Translation

超越准确率：机器翻译的社区视角

Yujun Wang, Ehud Reiter, Shimei Pan, Steffen Eger, Wei Zhao

发表机构 * University of Technology Nuremberg（纽伦堡工业大学）； University of Maryland, Baltimore County（马里兰大学巴尔的摩县分校）； University of Aberdeen（阿伯丁大学）

AI总结本文通过分析社交媒体上四个利益相关者社区（AI开发者、专业译者、语言学习者、语言服务提供商）的帖子，揭示机器翻译技术社区间的分歧与冲突，强调倾听用户社区需求的重要性。

详情

AI中文摘要

尽管机器翻译（MT）取得了显著进展，但非AI社区对MT系统日益增长的担忧表明技术进展与现实用户需求之间存在明显差距。例如，NLP研究人员关注基准性能，而最终用户关心伦理问题、信任、可靠性、成本等。我们认为倾听不同用户社区至关重要，以便研究工作能针对社区关心的问题。为此，我们首次进行大规模分析，调查四个利益相关者社区（AI开发者、专业译者、语言学习者和语言服务提供商）在社交媒体上关于MT技术的帖子。我们构建了一个包含2019年至2025年来自Reddit、Facebook、Bluesky和Mastodon的79,286条帖子及评论的数据集，并分析这些社区在哪些方面存在分歧，以及分歧的方式和原因。总体而言，我们发现社区间经常存在分歧，甚至在翻译质量、效率和可靠性等话题上因情绪极化而表现出强烈冲突。这是因为这些社区处理这些话题的方式不同：AI社区将其视为技术和计算问题，而非AI（用户）社区更关注质量细微差别、时间节省、用户信任和更广泛的社会问题。

英文摘要

Despite remarkable progress in machine translation (MT), non-AI communities have raised growing concerns about MT systems, suggesting a noticeable gap between technical advancement and the needs of real-world users. For instance, while NLP researchers focus on benchmark performance, end users care about ethical concerns, trust, reliability, costs, and more. We argue that listening to various user communities is essential so that research efforts would be directed towards the problems that the communities care about. To this end, we present a large-scale analysis, for the first time, that investigates what four stakeholder communities (AI developers, professional translators, language learners, and language service providers) post about MT technology on social media. To do so, we construct a dataset of 79,286 posts and comments from Reddit, Facebook, Bluesky, and Mastodon from 2019 to 2025, and analyse where these communities disagree, and how and why. Overall, we find that communities often disagree, and even show strong conflicts due to polarised sentiments on topics such as translation quality, efficiency, and reliability. This is because these communities approach these topics differently: the AI community frames them as technical and computational problems, while non-AI (user) communities care more about quality nuances, time savings, user trust, and broader social issues.

URL PDF HTML ☆

赞 0 踩 0

2601.10925 2026-06-09 cs.CL 版本更新

Massively Multilingual Joint Segmentation and Glossing

大规模多语言联合分割与标注

Michael Ginn, Lindia Tjuatja, Enora Rice, Ali Marashian, Maria Valentini, Jasmine Xu, Graham Neubig, Alexis Palmer

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）； Carnegie Mellon University（卡内基梅隆大学）

AI总结针对现有神经标注模型缺乏形态边界预测导致可解释性差的问题，提出PolyGloss模型，通过联合分割与标注提升准确性和对齐度，并支持低秩适应快速迁移。

Comments 15 pages, 9 figures, accepted to ACL 2026 Long Papers

详情

AI中文摘要

利用神经网络进行自动行间标注预测是加速语言文档记录工作的一种有前景的方法。然而，尽管像GlossLM这样的最先进模型在标注基准测试中取得了高分，但语言学家进行的用户研究发现，这些模型在实际场景中的实用性存在关键障碍。特别是，现有模型通常生成语素级别的标注，但将其分配给整个单词而不预测实际的语素边界，这使得预测的可解释性降低，从而对人类标注者来说不可信。我们首次研究了从原始文本中联合预测行间标注和相应形态分割的神经模型。我们进行实验以确定平衡分割和标注准确性以及两个任务之间对齐的最佳模型训练方式。我们扩展了GlossLM的训练语料库，并预训练了PolyGloss，这是一系列用于联合分割和标注的seq2seq多语言模型，在标注方面优于GlossLM，并在分割、标注和对齐方面击败了各种开源LLM。此外，我们证明了PolyGloss可以通过低秩适应快速适应新数据集。

英文摘要

Automated interlinear gloss prediction with neural networks is a promising approach to accelerate language documentation efforts. However, while state-of-the-art models like GlossLM achieve high scores on glossing benchmarks, user studies with linguists have found critical barriers to the usefulness of such models in real-world scenarios. In particular, existing models typically generate morpheme-level glosses but assign them to whole words without predicting the actual morpheme boundaries, making the predictions less interpretable and thus untrustworthy to human annotators. We conduct the first study on neural models that jointly predict interlinear glosses and the corresponding morphological segmentation from raw text. We run experiments to determine the optimal way to train models that balance segmentation and glossing accuracy, as well as the alignment between the two tasks. We extend the training corpus of GlossLM and pretrain PolyGloss, a family of seq2seq multilingual models for joint segmentation and glossing that outperforms GlossLM on glossing and beats various open-source LLMs on segmentation, glossing, and alignment. In addition, we demonstrate that PolyGloss can be quickly adapted to a new dataset via low-rank adaptation.

URL PDF HTML ☆

赞 0 踩 0

2606.07519 2026-06-09 cs.CL cs.AI 新提交

Bidirectional Small-Granularity Search between Code and Text

代码与文本之间的双向小粒度搜索

Marco A. Valenzuela-Escárcega, Enrique Noriega-Atala, Gus Hahn-Powell, Clayton T. Morrison, Mihai Surdeanu

发表机构 * Lex Machina ； The University of Arizona（亚利桑那大学）

AI总结提出双向小粒度搜索任务，通过自动生成数据训练模型，实现科学出版物文本与代码片段间的直接链接，支持跨模态检索。

详情

AI中文摘要

我们引入了代码与文本之间双向小粒度搜索的新任务，其中查询是文本或代码的小片段，结果也是相反模态的小片段，即代码或文本。该任务在科学出版物中的文本与相应代码片段之间建立直接链接，以支持更好、更快地理解科学方法。我们为所提出的任务引入了一个大型数据集，其中包括使用GPT-4自动生成的代码文本描述的训练分区，以及三个测试分区：一个域内和两个域外（OOD），包含手动注释的数据以及其他领域的材料。我们还提出了一种模块化方法来解决此任务。我们的方法在四个不同的子任务之间共享一个编码器，这些子任务学习双向答案跨度的开始/结束。我们表明，我们的方法在域内取得了良好结果，在域外也取得了令人鼓舞的结果。这表明使用自动生成的数据解决此任务是可能的，但仍有令人兴奋的未来工作要做。

英文摘要

We introduce the novel task of bidirectional small-granularity search between code and text, where the queries are small snippets of text or code and the results are also small fragments of the opposite modality, i.e., code or text. This task establishes direct links between text in scientific publications and corresponding code segments, in support of better and faster understanding of scientific methods. We introduce a large dataset for the proposed task that includes a training partition with textual descriptions of code generated automatically using GPT-4, and three testing partitions, one in-domain and two out-of-domain (OOD) that contain manually-annotated data as well as material from other domains. We also propose a modular approach to address this task. Our approach shares an encoder across four different subtasks that learn start/end of answer spans in both directions. We show that our method achieves good results in-domain, and encouraging results OOD. This suggests that addressing this task with automatically-generated data is possible, but there is exciting future work to be done.

URL PDF HTML ☆

赞 0 踩 0

2606.07523 2026-06-09 cs.CL cs.AI 新提交

Retrieval Augmented Generation Framework for the Nepali Legal Domain Question Answering

面向尼泊尔法律领域问答的检索增强生成框架

Samir Wagle, Abiral Adhikari, Reewaj Khanal, Batsal Bhandari, Prashant Manandhar, Praveen Acharya, Bal Krishna Bal

发表机构 * Dublin City University（都柏林城市大学）

AI总结提出首个基于检索增强生成的尼泊尔法律问答模型，利用BM25和E5模型检索案例法，实现91%的top-1精度和74%的生成答案可信度。

详情

AI中文摘要

英语等高资源语言的法律领域已广泛采用人工智能进行法律问答。然而，尼泊尔语等低资源语言的数据稀缺限制了大型语言模型在尼泊尔法律文本上的训练。本研究首次应用基于检索增强生成的模型，利用从Nepal Kanun Patrika数字档案中提取的案例法进行尼泊尔法律问答。使用BM25对分块文档进行检索，该方法实现了91%的top-1精度，使用多语言E5大模型时达到75%。对生成答案的评估显示，使用BM25文档检索时，可信度为74%，根据自动评判模型评估的真实性为85%，人工评估的真实性为84%，成功答案生成率为92%。这些结果表明，RAG管道可以有效解决低资源语言法律问答的差距，并为尼泊尔法律领域的可靠AI系统奠定基础。

英文摘要

Legal domains in high-resource languages like English have widely adopted artificial intelligence for legal question answering. However, data scarcity in low resource languages such as Nepali has limited the training of large language models on Nepali legal texts. This study presents the first application of a Retrieval Augmented Generation based model for Nepali legal question answering using case laws extracted from the Nepal Kanun Patrika digital archive. Using BM25 on chunked documents, the approach achieved a top precision at one of 91 percent, and up to 75 percent with the multilingual E5 large model. Evaluation of generated answers showed 74 percent groundedness, 85 percent truthfulness according to an automated judge model, and 84 percent human evaluated truthfulness when using BM25 document retrieval, with a 92 percent successful answer generation rate. These results demonstrate that the RAG pipeline can effectively address the gap in legal question answering for low resource languages and provide a foundation for reliable AI systems in the Nepali legal domain.

URL PDF HTML ☆

赞 0 踩 0

2606.07530 2026-06-09 cs.CL 新提交

Finding New Connections between Concepts from Medline Database Incorporating Domain Knowledge

从Medline数据库中结合领域知识发现概念间的新连接

Yang Weikang, Chowdhury S. M. Mazharul Hoque, Jin Wei

AI总结提出一种基于Swanson ABC模型的改进自适应模型，用于文献发现中隐藏的概念连接，通过中间主题B连接看似无关的主题A和C。

详情

DOI: 10.5772/intechopen.113081
Journal ref: Artificial Intelligence, IntechOpen, 2024

AI中文摘要

TrustMargin: 大语言模型中参数化记忆与检索证据之间的无训练仲裁

Jingyan Xu, Hong Shi, Yi Shan, Penghui Liu, Yunhao Bai, Ningyuan Li, Xueyang Liu

发表机构 * Peking University（北京大学）

AI总结针对大语言模型在知识问答中参数记忆与检索证据冲突的问题，提出无训练仲裁层TrustMargin，利用模型自身似然度评分选择更可信的答案，无需微调或外部评判。

Comments 13 pages, 6 figures, 9 tables. Code and data are available at https://github.com/mojixu/TrustMargin.git

详情

AI中文摘要

大语言模型通过参数化记忆和检索证据回答知识密集型问题，但两种来源并非都可靠。检索可以填补知识空白，但干扰性段落可能覆盖正确的闭卷答案。我们将这种生成后冲突视为答案级源仲裁：给定来自同一冻结模型的直接和RAG答案，决定信任哪个源。我们提出TRUSTMARGIN，一个无训练、即插即用的仲裁层，它使用模型自身的似然度对两个现有候选答案进行评分。它结合了参数先验边际（测试记忆是否接受检索答案）和证据绑定边际（折扣仅段落显著性并衡量问题特定支持）。TRUSTMARGIN在直接和RAG之间进行选择，无需微调、外部评判或额外生成。在2WIKIMQA和CWQA上使用三种LLaMA规模，TRUSTMARGIN一致优于直接生成和BM25-RAG，恢复了直接/RAG oracle差距的一部分，并推广到多个无训练RAG流水线。

英文摘要

Large language models answer knowledge-intensive questions using both parametric memory and retrieved evidence, but neither source is uniformly reliable. Retrieval can fill knowledge gaps, yet distracting passages may override correct closed-book answers. We study this post-generation conflict as answer-level source arbitration: given Direct and RAG answers from the same frozen model, decide which source to trust. We propose TRUSTMARGIN, a training-free, plug-and-play arbitration layer that scores the two existing candidates with the model's own likelihoods. It combines a parametric-prior margin, which tests whether memory accepts the retrieved answer, with an evidence-binding margin, which discounts passage-only salience and measures question-specific support. TRUSTMARGIN selects between Direct and RAG without fine-tuning, external judges, or additional generation. Across 2WIKIMQA and CWQA with three LLaMA scales, TRUSTMARGIN consistently improves over Direct generation and BM25-RAG, recovers part of the Direct/RAG oracle gap, and generalizes to multiple training-free RAG pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.08589 2026-06-09 cs.CL cs.DL cs.IR 新提交

Detection and Interpretability Analysis of Quotation Errors by Large Language Models

大语言模型对引用错误的检测与可解释性分析

Bei Huang, Yingyi Zhang, Shenghao Huang, Chengzhi Zhang

发表机构 * School of Social Science, Soochow University（苏州大学社会科学学院）； School of Economics and Management, Nanjing University of Science and Technology（南京理工大学经济与管理学院）

AI总结针对引用错误问题，提出基于大语言模型微调的自动检测方法，通过引入全文数据优化数据集构建，并利用TokenSHAP进行可解释性分析，实验表明微调方法有效且基于源摘要的全文整合方案性能最佳。

详情

DOI: 10.1108/EL-11-2025-0464
Journal ref: The Electronic Library, 2026

AI中文摘要

目的 - 引用错误指引用信息与其原始来源之间的不一致。这一现象导致一系列负面影响，如对原始研究的误解、削弱学术界对相关问题的集体理解，以及削弱基于引用的学术评价体系的准确性和公平性。现有研究表明，引用错误在学术界普遍存在；此外，人工验证引用错误不仅劳动密集，而且效率低下。因此，本文提出“引用错误自动检测”任务。方法 - 采用基于大语言模型的方法，本文在现有研究基础上从两个方面提升检测性能：首先，采用微调方法使大语言模型检测引用错误；其次，将引文全文数据纳入数据集构建，并通过比较三种全文整合方法探索构建此类数据集的最优方案。在此基础上，本文进一步使用TokenSHAP工具对模型预测结果进行可解释性实验分析。发现 - 大语言模型的微调方法提升了引用错误检测的性能。在整合全文信息的不同方法中，基于使用源摘要的方法取得了最佳性能。原创性 - 将大语言模型的微调方法应用于引用错误自动检测任务，并对模型输出结果进行可解释性分析。

英文摘要

Purpose - Quotation error refers to the inconsistency between cited information and its original source. This phenomenon leads to a series of negative impacts, such as misinterpretation of the original research, undermining the academic community's collective understanding of relevant issues, and weakening the accuracy and fairness of the citation-based academic evaluation system. Existing studies have shown that quotation error is prevalent in the academic community; moreover, manual verification of quotation error is not only labor-intensive but also inefficient. Therefore, this paper proposes the task of 'automated detection of quotation errors'. Methodology - Adopting a large language model (LLM)-based approach, this paper improves detection performance from two aspects on the basis of existing research: first, employ the fine-tuning approach for LLMs to detect quotation errors; second, incorporating full-text data of the cited literature into dataset construction, and exploring the optimal scheme for building such datasets by comparing three types of full-text integration methods. Based on this, this paper further uses the TokenSHAP tool to conduct interpretability experimental analysis on the model's prediction results. Findings - The fine-tuning approach for LLMs has improved the performance in detecting quotation errors. Among the different methods for incorporating full-text information, the approach based on using the source abstract yielded the best performance. Originality - The fine-tuning approach for large language models (LLMs) is applied to the task of automated detection of quotation errors, and interpretability analysis is conducted on the model's output results.

URL PDF HTML ☆

赞 0 踩 0

2606.08617 2026-06-09 cs.CL 新提交

Cross-Source Reasoning-based Correction for Author Name Disambiguation

基于跨源推理的作者姓名消歧校正

Fanjin Zhang, Yunhe Pang, Bo Chen, Zhiyu Shen, Yanghui Rao, Evgeny Kharlamov, Jie Tang

发表机构 * Renmin University of China（中国人民大学）； Sun Yat-Sen University（中山大学）； Tsinghua University（清华大学）； Robert Bosch GmbH（罗伯特·博世有限公司）； University of Oslo（奥斯陆大学）

AI总结提出CrossND框架，通过跨源不一致分配推理，结合数据精炼、监督微调和测试时缩放，无需人工干预即可校正作者姓名消歧错误。

Comments Accepted at KDD 2026 ADS track

详情

DOI: 10.1145/3770855.3818347

AI中文摘要

作者姓名消歧是学术搜索系统中的关键挑战，通常通过从头开始和实时消歧方法解决。然而，当前算法仍然容易受到论文-作者分配的累积误差影响，并忽略了不同来源之间的不一致分配。诉诸专家注释是资源密集型的。为此，本文探索了作者姓名消歧的新视角：通过利用跨源的不一致分配进行跨源校正。我们提出了CrossND，一个集成数据精炼、跨源推理和测试时缩放的全栈框架。首先，一个精炼链去噪作者档案并产生更准确的论文-作者匹配概率。其次，一个监督微调过程结合这些精炼信号和基于概率软逻辑的交叉校正模块，推断哪些来源的分配是错误的。第三，测试时缩放进一步增强了预测的准确性和鲁棒性。在真实数据集上的实验表明，CrossND通过利用跨源推理，无需人工干预，始终优于17个基线。

英文摘要

Author name disambiguation is a critical challenge in academic search systems, often addressed through from-scratch and real-time disambiguation approaches. However, current algorithms remain vulnerable to cumulative errors of paper-author assignments and overlook inconsistent assignments across different sources. Resorting to expert annotation is resource-intensive. To this end, this paper explores a new perspective for author name disambiguation: cross-source correction by leveraging inconsistent assignments across sources. We propose CrossND, a full-stack framework that integrates data refinement, cross-source reasoning, and test-time scaling. First, a chain-of-refinement pipeline denoises author profiles and produces more accurate paper-author matching probabilities. Second, a supervised fine-tuning process incorporates these refined signals and a probabilistic soft logic-based cross-correction module to infer the assignments of which sources are incorrect. Third, test-time scaling further enhances the accuracy and robustness of the predictions. Experiments on real-world datasets indicate that CrossND consistently outperforms 17 baselines by leveraging cross-source reasoning without human intervention.

URL PDF HTML ☆

赞 0 踩 0

2606.08932 2026-06-09 cs.CL cs.AI cs.CE 新提交

From Statute to Control Flow: Span-Grounded Deontic Trees for Defeasible Scope Parsing

从法规到控制流：基于跨度义务树的可废止范围解析

Jian Chen, Siyuan Li, Chucheng Wan, Zixuan Yuan

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Sun Yat-Sen University（中山大学）

AI总结提出NormBench基准和跨度义务树（SG-DT）中间表示，用于诊断和缓解规则遵循模型中的静默范围遗漏（SSO）问题，揭示递归衰减和可审计性陷阱两种病理，并通过约束输出改善树结构保真度。

详情

AI中文摘要

执行政策和法规的规则遵循代理常常因静默范围遗漏（SSO）而失败：模型应用一般规则但静默地丢弃嵌套的例外或反例外，产生看似合规但在重要边缘案例上失效的输出。尽管此类失败常被视为代理系统问题，其根本瓶颈在于法规和政策理解——这一能力通常在法律NLP中研究。然而，大多数现有法律NLP基准强调最终任务结果，可能忽略导致SSO的结构性遗漏。为诊断和缓解SSO，我们引入NormBench，一个包含2290条条款的基准，涵盖中文（法律和地方政策）、英文（美国税法、GDPR和企业政策）及跨语言设置，专为可废止范围解析设计：精确识别哪个条款覆盖哪个。NormBench使用基于跨度义务树（SG-DT），一种编译器式中间表示，将每个逻辑分支锚定到源跨度并要求显式排除守卫，实现确定性编译和审计。对前沿LLM的评估揭示了两种反复出现的病理：（1）递归衰减，性能随击败者深度增加急剧下降；（2）可审计性陷阱，模型检索相关跨度但未能组装正确的控制流。使用SG-DT作为约束中间输出可改善整树保真度和击败者恢复，下游实验表明其效用是机制特定的：增益集中在例外活跃、易SSO的案例上，而当附加结构不必要或解析器保真度低时，总体准确率可能参差不齐。

英文摘要

Rule-following agents tasked with executing policies and regulations often fail via Silent Scope Omission (SSO): a model applies a general rule but silently drops nested exceptions or counter-exceptions, producing outputs that appear compliant yet break on important edge cases. Although such failures are often framed as an agentic-systems problem, the underlying bottleneck is statutory and policy understanding, a capability typically studied in legal NLP. However, most existing legal NLP benchmarks emphasize end-task outcomes, which can overlook the structural omissions that cause SSO. To diagnose and mitigate SSO, we introduce NormBench, a benchmark of 2,290 provisions spanning Chinese (laws and local policies), English (U.S. tax law, GDPR, and corporate policies), and cross-lingual settings, designed for defeasible scope parsing: identifying precisely which clause overrides which. NormBench uses Span-Grounded Deontic Trees (SG-DT), a compiler-style intermediate representation that anchors every logical branch to source spans and requires explicit exclusion guards, enabling deterministic compilation and audit. Evaluations of frontier LLMs reveal two recurring pathologies: (1) Recursion Decay, where performance drops sharply as defeater depth increases, and (2) an Auditability Trap, where models retrieve relevant spans but fail to assemble correct control flow. Using SG-DT as a constrained intermediate output improves whole-tree fidelity and defeater recovery, and downstream experiments show that its utility is mechanism-specific: gains concentrate on exception-active, SSO-prone cases, while aggregate accuracy can be mixed when the added structure is unnecessary or parser fidelity is low.

URL PDF HTML ☆

赞 0 踩 0

2606.09459 2026-06-09 cs.CL 新提交

AbstRAG: Learning to Abstract for Retrieval Problems

AbstRAG：面向检索问题的抽象学习

Lei Xu, Xin Quan, Daniel Pedronette, André Freitas

发表机构 * Idiap Research Institute（Idiap 研究所）； École Polytechnique Fédérale de Lausanne (EPFL)（洛桑联邦理工学院 (EPFL)）； São Paulo State University（圣保罗州立大学）； University of Manchester（曼彻斯特大学）； CRUK National Biomarker Centre（英国癌症研究中心国家生物标志物中心）

AI总结针对查询与文档证据间的抽象鸿沟问题，提出AbstRAG方法，通过将抽象作为显式检索对象，并采用反思性精炼机制，在三个基准上提升了检索和生成性能。

详情

AI中文摘要

当查询、文档证据和用户意图以不同抽象级别表达时，检索增强生成常常失败。查询可能询问一个类别、关系或事件，而文档仅陈述具体实例、间接框架或限定表述。我们将这种不匹配定义为抽象鸿沟：将查询意图与可用证据对齐所需的最小类型假设集合。为弥合这一鸿沟，我们引入AbstRAG，将抽象视为显式检索对象。AbstRAG将查询-证据鸿沟分解为表达、概念、意图-证据和事件类型组件，并通过结合匹配质量、查询无关的效用先验以及所需桥梁的成本来评分相关性。其核心机制是反思性精炼：批评者诊断检索失败，定位失败的抽象操作符，提出最小的阶段特定补丁，并仅在充分性和压缩控制下接受补丁。在三个文档内检索基准上与七个基线对比，AbstRAG在21个配对自助法对比中的18个上以nDCG@10胜出，并在三个基准上分别将生成准确率提升1.9%、5.2%和4.0%；消融实验证实，反思性精炼驱动了大部分检索增益，而仅压缩控制就在压力切片上将过度扩展假阳性从73.7%降至0%。

英文摘要

Retrieval-augmented generation often fails when the query, the document evidence, and the user's intent are expressed at different levels of abstraction. A query may ask about a class, a relation, or an event, while the document only states specific instances, indirect framings, or scoped formulations. We define this mismatch as an abstraction gap: the minimal set of typed assumptions required to align query intent with the available evidence. To close this gap, we introduce AbstRAG, which treats abstraction as an explicit retrieval object. AbstRAG decomposes the query--evidence gap into expression, conceptual, intent--evidence, and event-type components, and scores relevance by combining match quality, a query-independent utility prior, and the cost of the required bridges. Its central mechanism is reflective refinement: a critic diagnoses retrieval failures, localizes the failed abstraction operator, proposes a minimal stage-specific patch, and accepts the patch only under sufficiency and compression controls. Across three within-document retrieval benchmarks against seven baselines, AbstRAG outperforms on nDCG@10 in 18 of 21 paired-bootstrap contrasts and improves generation accuracy by 1.9%, 5.2%, and 4.0% across the three benchmarks; ablations confirm that reflective refinement drives most of the retrieval gain and the compression control alone reduces over-expansion false positives from 73.7% to 0% on a stress slice.

URL PDF HTML ☆

赞 0 踩 0

2606.07688 2026-06-09 cs.IR cs.AI cs.CL cs.LG 交叉投稿

DIVERGE: 面向开放式信息检索的多样性增强RAG

Tianyi Hu, Niket Tandon, Akhil Arora

发表机构 * Aarhus University（奥胡斯大学）； Microsoft Research（微软研究院）

AI总结针对现有RAG系统忽略开放式信息检索中多样性需求的问题，提出Diverge框架，通过迭代反思引导的多样化视角探索和多样性感知检索支持，在保持质量的同时将多样性提升约2倍。

详情

AI中文摘要

现有的检索增强生成（RAG）系统通常假设每个查询只有一个正确答案。这种假设忽略了开放式信息检索场景，其中多个合理的答案是有价值的，并且多样性对于创造力、公平性和信息的包容性访问至关重要。我们表明，标准RAG系统未能充分利用多样化的检索上下文：简单地增加检索多样性并不一定会导致多样化的生成。为了解决这一局限性，我们提出了Diverge，一个即插即用的智能体RAG框架，通过迭代、反思引导的多样化视角探索和多样性感知检索支持来改善多样性与质量的权衡。我们进一步引入了用于表征开放式问答中多样性与质量权衡的评估指标。在多个真实世界数据集和骨干LLM上的实验表明，Diverge在竞争基线中实现了最佳的权衡，将多样性提高了约2倍，且没有明显的质量下降。这些结果揭示了当前RAG系统的系统性局限，并展示了显式多样性建模的价值。

英文摘要

Existing retrieval-augmented generation (RAG) systems often assume that each query has a single correct answer. This assumption overlooks open-ended information-seeking scenarios where multiple plausible answers are valuable, and where diversity is important for creativity, fairness, and inclusive access to information. We show that standard RAG systems fail to fully use diverse retrieved contexts: simply increasing retrieval diversity does not necessarily lead to diverse generations. To address this limitation, we propose Diverge, a plug-and-play agentic RAG framework that improves the diversity--quality trade-off through iterative, reflection-guided exploration of diverse viewpoints and diversity-aware retrieval support. We further introduce evaluation metrics for characterizing the diversity-quality trade-off in open-ended question answering. Experiments across multiple real-world datasets and backbone LLMs show that Diverge achieves the best trade-off among competitive baselines, increasing diversity by $\sim2\times$ without noticeable quality degradation. These results reveal a systematic limitation of current RAGs and show the value of explicit diversity modeling.

URL PDF HTML ☆

赞 0 踩 0

2602.17911 2026-06-09 cs.CL cs.AI 版本更新

Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering

基于条件的推理用于依赖上下文的生物医学问答

Jash Rajesh Parekh, Wonbin Kweon, Joey Chan, Rezarta Islamaj, Robert Leaman, Pengcheng Jiang, Chih-Hsuan Wei, Zhizheng Wang, Zhiyong Lu, Jiawei Han

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； National Institutes of Health（美国国立卫生研究院）

AI总结本文提出CondMedQA基准和Condition-Gated Reasoning框架，通过构建条件感知知识图谱，提升生物医学问答中条件依赖的推理能力。

详情

DOI: 10.1145/3770855.3818963 10.1145/3770855.3818963 10.1145/3770855.3818963 10.1145/3770855.3818963 10.1145/3770855.3818963

AI中文摘要

当前生物医学问答系统常假设医学知识是统一的，但现实临床推理本质上是条件性的：几乎所有决策都依赖于患者特定因素，如共病和禁忌症。现有基准不评估此类条件推理，检索增强或图基方法缺乏显式机制确保检索知识适用于给定上下文。为解决这一差距，我们提出CondMedQA，首个针对条件生物医学问答的基准，包含多跳问题，其答案随患者条件变化。此外，我们提出Condition-Gated Reasoning（CGR），一种新框架，构建条件感知知识图谱，并根据查询条件选择性激活或修剪推理路径。我们的发现显示，CGR更可靠地选择条件合适的答案，同时在生物医学问答基准上匹配或超越现有最佳性能，突显了显式建模条件性对稳健医疗推理的重要性。

英文摘要

Current biomedical question answering (QA) systems often assume that medical knowledge applies uniformly, yet real-world clinical reasoning is inherently conditional: nearly every decision depends on patient-specific factors such as comorbidities and contraindications. Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context. To address this gap, we propose CondMedQA, the first benchmark for conditional biomedical QA, consisting of multi-hop questions whose answers vary with patient conditions. Furthermore, we propose Condition-Gated Reasoning (CGR), a novel framework that constructs condition-aware knowledge graphs and selectively activates or prunes reasoning paths based on query conditions. Our findings show that CGR more reliably selects condition-appropriate answers while matching or exceeding state-of-the-art performance on biomedical QA benchmarks, highlighting the importance of explicitly modeling conditionality for robust medical reasoning.

URL PDF HTML ☆

赞 0 踩 0

2603.03292 2026-06-09 cs.CL cs.AI cs.IR 版本更新

From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

从冲突到共识：通过多轮代理RAG提升医疗推理

Wenhao Wu, Zhentao Tang, Yafu Li, Shixiong Kai, Mingxuan Yuan, Zhenhong Sun, Chunlin Chen, Zhi Wang

发表机构 * GitHub

AI总结本文提出MA-RAG框架，通过多轮代理循环迭代优化外部证据和内部推理历史，提升医疗复杂推理能力，实验显示在7个医疗问答基准上表现优于现有方法。

Comments 27 pages, 8 figures, 18 tables

详情

AI中文摘要

大型语言模型（LLMs）在医疗问答中表现出高推理能力，但其产生幻觉和过时知识的倾向对医疗领域构成重大风险。虽然检索增强生成（RAG）缓解了这些问题，但现有方法依赖于噪声的token级信号，并缺乏复杂推理所需的多轮细化。本文提出MA-RAG（多轮代理RAG），通过在代理细化循环中迭代演变外部证据和内部推理历史，实现复杂医疗推理的测试时间扩展。在每一轮中，代理将候选响应间的语义冲突转换为可检索的外部证据查询，同时优化历史推理轨迹以缓解长上下文退化。MA-RAG通过利用不一致性作为主动信号来扩展自我一致性原则，并通过迭代最小化残差误差来实现稳定、高保真的医疗共识。在7个医疗问答基准上的广泛评估显示，MA-RAG在推理时间扩展和RAG基线方面均优于竞争方法，平均准确率比基础模型提高+6.8点。我们的代码可在https://github.com/NJU-RL/MA-RAG上获得。

英文摘要

Large Language Models (LLMs) exhibit high reasoning capacity in medical question-answering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval-Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token-level signals and lack the multi-round refinement required for complex reasoning. In this paper, we propose MA-RAG (Multi-Round Agentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semantic conflict among candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long-context degradation. MA-RAG extends the self-consistency principle by leveraging the lack of consistency as a proactive signal for multi-round agentic reasoning and retrieval, and mirrors a boosting mechanism that iteratively minimizes the residual error toward a stable, high-fidelity medical consensus. Extensive evaluations across 7 medical Q&A benchmarks show that MA-RAG consistently surpasses competitive inference-time scaling and RAG baselines, delivering substantial +6.8 points on average accuracy over the backbone model. Our code is available at https://github.com/NJU-RL/MA-RAG.

URL PDF HTML ☆

赞 0 踩 0

2603.12453 2026-06-09 cs.CL 版本更新

CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

CSE-UOI在SemEval-2026任务6中的表现：一种双阶段异构集成与 deliberative 复杂性门控的政治理论逃避检测方法

Christos Tzouvaras, Konstantinos Skianis, Athanasios Voulodimos

发表机构 * University of Ioannina（伊奥安纳大学）； National Technical University of Athens（雅典国家技术大学）

AI总结本文提出一种双阶段异构集成方法，结合自我一致性与加权投票，以及新颖的后处理修正机制Deliberative Complexity Gating，用于政治逃避检测，最终在评估集上获得0.85的Macro-F1分数。

详情

AI中文摘要

本文描述了我们在SemEval-2026任务6中的系统，该系统将政治访谈中的回应清晰度分为三个类别：清晰回复、矛盾回复和清晰非回复。我们提出了一种异构双大型语言模型（LLM）集成方法，结合自我一致性（SC）和加权投票，并提出了一种新的后处理修正机制，即Deliberative Complexity Gating（DCG）。该机制利用跨模型行为信号，并利用发现LLM响应长度代理与样本模糊性之间存在强相关性的发现。为了进一步研究提高模糊性检测的机制，我们评估了多代理辩论作为增加 deliberative 能力的替代策略。与DCG不同，后者通过跨模型行为信号自适应地门控推理，而辩论则通过增加代理数量而不增加模型多样性。我们的解决方案在评估集上获得了0.85的Macro-F1分数，取得了第三名，并与第二好的报告分数并列。

英文摘要

This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place and tied with the second-best reported score.

URL PDF HTML ☆

赞 0 踩 0

2604.08849 2026-06-09 cs.CL cs.AI cs.DB cs.MA cs.SC 版本更新

SatIR: Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching

SatIR：可扩展的高召回率约束满足基于信息检索的临床试验匹配

Cyrus Zhou, Yufei Jin, Yilin Xu, Yu-Chiang Wang, Chieh-Ju Chao, Monica S. Lam

发表机构 * Department of Computer Science, Stanford University（斯坦福大学计算机科学系）； Samueli Electrical and Computer Engineering, UCLA（UCLA Samueli电气与计算机工程系）； Department of Computer Science and Informatics, Emory University（埃默里大学计算机科学与信息学系）； Mayo Clinic（梅奥诊所）

AI总结 SatIR通过将临床试验资格条件和摘要转化为形式约束，结合SMT、关系代数和大语言模型，提升了临床试验匹配的召回率和效率，优于基于相似度的基线方法。

详情

AI中文摘要

许多重要的检索问题不仅仅是语义相似性问题，而是约束满足问题：检索的项目应与查询主题相关，并满足涉及否定、时间条件、数值阈值、例外、本体关系和不完整证据的显式要求。我们研究了临床试验匹配中的这一挑战，这是一个高风险的测试平台，其中有用的试验必须既解决患者医疗需求，又满足复杂的资格标准。我们提出了SatIR，一种用于临床试验匹配的可扩展约束检索方法。SatIR将试验资格标准和摘要转换为形式约束，然后通过执行这些约束来检索患者-试验对。系统结合了满足模理论（SMT）、关系代数、医学本体基础和大语言模型（LLMs）：形式方法提供可执行且可检查的匹配，而LLMs将模糊、不完整和隐含的临床信息转换为显式、可控的约束表示。在SIGIR 2016患者-试验集合和TREC-2022-RetrievalSubset基准上，SatIR在资格意识检索方面优于基于相似度的基线方法。与TrialGPT式检索相比，SatIR在SIGIR 2016上每名患者检索出32%至72%更多相关且合格的试验，在TREC-2022-RetrievalSubset上实现了1.8至3.2倍更高的合格试验召回率。检索速度快，仅需146毫秒每名患者处理3,621个SIGIR试验。

英文摘要

Many important retrieval problems are not merely problems of semantic similarity, but problems of constraint satisfaction: a retrieved item should be topically relevant to a query and satisfy explicit requirements involving negation, temporal conditions, numeric thresholds, exceptions, ontological relations, and incomplete evidence. We study this challenge in clinical trial matching, a high-stakes test bed where a useful trial must both address a patient's medical needs and satisfy complex eligibility criteria. We propose SatIR, a scalable constraint-based retrieval method for clinical trial matching. SatIR converts trial eligibility criteria and summaries into formal constraints, then retrieves patient--trial pairs by executing these constraints over a database. The system combines Satisfiability Modulo Theories (SMT), relational algebra, medical ontology grounding, and large language models (LLMs): formal methods provide executable and inspectable matching, while LLMs convert ambiguous, incomplete, and implicit clinical information into explicit, controllable constraint representations. Across the SIGIR 2016 patient--trial collection and TREC-2022-RetrievalSubset, a benchmark derived from TREC 2022, SATIR consistently improves eligibility-aware retrieval over similarity-based baselines. Relative to TrialGPT-style retrieval, SATIR retrieves 32%--72% more relevant-and-eligible trials per patient on SIGIR 2016 and achieves $1.8$--$3.2\times$ higher eligible-trial recall on TREC-2022-RetrievalSubset. Retrieval is fast, requiring only 146 milliseconds per patient over 3,621 SIGIR trials.

URL PDF HTML ☆

赞 0 踩 0

2605.17301 2026-06-09 cs.CL cs.AI 版本更新

ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation

ConflictRAG: 检测和解决检索增强生成中的知识冲突

Chenyu Wang, Yueyuan Li, Yingmin Liu, Yang Shu

发表机构 * Zhejiang University（浙江大学）

AI总结本研究提出ConflictRAG框架，通过两阶段冲突检测模块、熵-TOPSIS框架和冲突感知RAG评分，有效检测和解决检索增强生成中的知识冲突，实验表明其在冲突检测F1和正确性方面优于现有方法。

Comments 6 pages, 6 figures, submitted to IEEE SMC 2026

详情

AI中文摘要

检索增强生成（RAG）系统隐式假设检索文档之间相互一致——这一假设在实践中经常失效。我们提出了ConflictRAG，一种具有冲突意识的RAG框架，能够在生成答案之前检测、分类和解决知识冲突。该框架引入了三个贡献：（1）一个两阶段冲突检测模块，结合轻量级嵌入基于MLP分类器和选择性LLM细化，使API成本降低62%，同时保持90.8%的检测准确率；（2）一个熵-TOPSIS框架用于数据驱动的来源可信度评估，比手动启发式方法提高7.1%的选取准确率；（3）一个冲突感知RAG评分（CARS）用于诊断冲突处理能力。在三个基准测试中对六个基线的实验表明，冲突检测F1达到88.7%，并且在最强的冲突感知基线中，正确性提高了5.3-6.1%。该流程能够有效跨基础LLM转移。

英文摘要

Retrieval-Augmented Generation (RAG) systems implicitly assume mutual consistency among retrieved documents -- an assumption that frequently fails in practice. We present ConflictRAG, a conflict-aware RAG framework that detects, classifies, and resolves knowledge conflicts prior to answer generation. The framework introduces three contributions: (1) a two-stage conflict detection module combining a lightweight embedding-based MLP classifier with selective LLM refinement, reducing API costs by 62% while maintaining 90.8% detection accuracy; (2) an Entropy-TOPSIS framework for data-driven source credibility assessment, improving selection accuracy by 7.1% over manual heuristics; and (3) a Conflict-Aware RAG Score (CARS) for diagnostic evaluation of conflict-handling capabilities. Experiments on three benchmarks against six baselines demonstrate 88.7% conflict-detection F1 and consistent 5.3--6.1% correctness gains over the strongest conflict-aware baseline, with the pipeline transferring effectively across backbone LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.28831 2026-06-09 cs.CL cs.AI 版本更新

S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

S3Mem：用于长时域交互式问答的结构化时空场景-事件记忆

Encheng Su, Jianyu Wu, Jinouwen Zhang, Qiucheng Yu, Chen Tang, Pengze Li, Lintao Wang, Aoran Wang, Xinzhu Ma, Shixiang Tang, Yizhou Wang, Houqiang Li

发表机构 * University of Science and Technology of China（中国科学技术大学）； Shanghai Jiao Tong University（上海交通大学）； Shanghai AI Laboratory（上海人工智能实验室）； City University of Hong Kong（香港城市大学）； The Chinese University of Hong Kong（香港中文大学）； Fudan University（复旦大学）； The University of Sydney（悉尼大学）； Beihang University（北航）

AI总结提出S3MEM框架，通过结构化场景-事件记忆和锚点敏感检索，在长时域交互式问答中实现比通用记忆接口更优的准确率-效率平衡。

详情

AI中文摘要

长时域交互代理通常积累大量轨迹历史，但仍无法可靠地回答关于早期事件的问题。我们认为主要瓶颈不仅是上下文长度，而是长期记忆的轨迹到答案接口。当历史以纯文本块存储并使用标准检索增强生成（RAG）查询时，系统通常检索到局部相关但链不完整的证据，特别是对于空间、时间、重复事件和多跳状态问题。我们提出S3MEM，一种用于长时域交互式问答（QA）的结构化场景-事件情节记忆框架。S3MEM将轨迹写入结构化记忆单元，通过锚点敏感检索检索证据，并为答案时间推理提供紧凑的令牌预算感知证据接口。从这个意义上说，S3MEM是一种结构化证据利用工具，将代理轨迹转换为查询对齐的支持。我们在两个内部标题环境（Crafter、Jericho）和两个外部环境（SciWorld、ALFWorld）上评估S3MEM。在共享的冻结答案时间协议下，S3MEM在所有四个环境中一致优于Vanilla RAG，在Crafter、Jericho和ALFWorld上超过Graph-NoReader，在SciWorld上与之匹配，同时使用的证据令牌显著减少。三个改编的近期基线——A-MEM启发、MemoryOS改编和LightMem改编——在多个设置中优于Vanilla RAG，但没有一个达到S3MEM的整体准确率-效率前沿。总体而言，证据支持一个有限的结论：在当前冻结的答案时间协议下，结构化写入和锚点敏感证据路由为长时域交互式QA提供了比通用记忆接口更强的准确率-效率前沿。

英文摘要

Long-horizon memory question answering often requires sparse evidence from heterogeneous histories, including events, object states, visual observations, temporal relations, and causal steps. Existing memory interfaces expand reader context, retrieve semantically related chunks, or expose graph neighborhoods, but they are not explicitly designed to select compact evidence for a fixed reader. We propose Structured Spatiotemporal Scene--Event Memory (S3Mem), a query-time memory interface that writes textual, visual, and agent-use histories into structured scene--event units and routes compact evidence packs to the reader. Its router scores candidate units, query anchors, and anchor--support links, enabling both single-hop selection and short multi-hop evidence chains without reader fine-tuning or test-time training. Across LoCoMo, EMemBench Visual Games, and AMA-Bench, S3Mem provides a strong score--token trade-off, with the clearest gains on localized event, state, temporal, causal, or provenance evidence. On LoCoMo, S3Mem reaches $0.48$ F1 and $0.40$ BLEU with (1{,}073) evidence tokens per question, about $15.8\times$ fewer than the LoCoMo reference. On EMemBench Visual Games, it obtains the best F1 and second-best accuracy with only $189$tokens.On AMA-Bench, it is not the highest-scoring method, but remains competitive while using the fewest reader-visible evidence tokens.

URL PDF HTML ☆

赞 0 踩 0

2603.24925 2026-06-09 cs.LG cs.CL cs.IR 版本更新

GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation

GraphER: 一种高效的基于图的增强和重排序方法用于检索增强生成

Ruizhong Miao, Yuying Wang, Rongguang Wang, Chenyang Li, Tao Sheng, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结 GraphER通过利用数据组织结构捕捉超越语义相似性的接近关系，构建查询时的图结构并应用图排序技术，提升检索完整性，无需额外图基础设施，兼容标准向量存储。

2603.29875 2026-06-09 cs.IR cs.AI cs.CL 版本更新

UnWeaving the knots of GraphRAG -- turns out VectorRAG is almost enough

解开图式RAG的结——事实证明向量RAG几乎足够

Ryszard Tuora, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Mateusz Czyżnikiewicz, Adam Kozakiewicz, Tomasz Ziętkiewicz

发表机构 * Samsung AI Warsaw（三星AI华沙）

AI总结本文提出UnWeaver框架，通过LLM解构文档内容为跨chunk的实体，提升检索和生成的准确性与效率，实验表明向量RAG在成本上优于图式RAG。

详情

DOI: 10.5281/zenodo.19203878

AI中文摘要

检索增强生成（RAG）系统中的关键问题在于基于片段的检索流程将源片段视为原子对象，将其中信息混合成单一向量。这些向量被视为孤立、独立且自足，没有尝试表示它们之间的可能关系。此类方法缺乏处理多跳问题的专用机制。图式RAG系统通过将信息建模为知识图谱来缓解这一问题，实体由节点表示，通过稳健的关系连接并形成层次化社区。然而，这种方法自身也存在一些问题，包括为创建图式索引而增加数量级的组件复杂性，以及依赖启发式方法进行检索。我们提出UnWeaver，一种新颖的RAG框架，简化了图式RAG的理念。UnWeaver利用LLM将文档内容解构为可以在多个片段中出现的实体。在检索过程中，实体被用作恢复原始文本片段的中间方式，从而保持对源材料的忠实度。我们主张基于实体的分解能提供更浓缩的原始信息表示，同时还能减少索引和生成过程中的噪声。此外，我们实验表明，在端到端QA评估中，向量RAG的表现优于标准图式RAG，并且几乎与当前最先进的图式解决方案相当，但成本仅为其分数。

英文摘要

One of the key problems in Retrieval-augmented generation (RAG) systems is that chunk-based retrieval pipelines represent the source chunks as atomic objects, mixing the information contained within such a chunk into a single vector. These vector representations are then fundamentally treated as isolated, independent and self-sufficient, with no attempt to represent possible relations between them. Such an approach has no dedicated mechanisms for handling multi-hop questions. Graph-based RAG systems aimed to ameliorate this problem by modeling information as knowledge-graphs, with entities represented by nodes being connected by robust relations, and forming hierarchical communities. This approach however suffers from its own issues with some of them being: orders of magnitude increased componential complexity in order to create graph-based indices, and reliance on heuristics for performing retrieval. We propose UnWeaver, a novel RAG framework simplifying the idea of GraphRAG. UnWeaver disentangles the contents of the documents into entities which can occur across multiple chunks using an LLM. In the retrieval process entities are used as an intermediate way of recovering original text chunks hence preserving fidelity to the source material. We argue that entity-based decomposition yields a more distilled representation of original information, and additionally serves to reduce noise in the indexing, and generation process. Furthermore we experimentally show that on end to end QA evaluation VectorRAG performs better than standard GraphRAG and almost as good as current SOTA graph-based solutions, for a fraction of the cost.

URL PDF HTML ☆

赞 0 踩 0

2604.26176 2026-06-09 cs.DB cs.CL 版本更新

CacheRAG: A Semantic Caching System for Retrieval-Augmented Generation in Knowledge Graph Question Answering

CacheRAG：面向知识图谱问答中检索增强生成的语义缓存系统

Yushi Sun, Lei Chen

发表机构 * HKUST Hong Kong China（香港理工大学（中国））； HKUST(GZ) / HKUST Guangzhou / Hong Kong China (2018)（香港理工大学（广州）/ 香港理工大学（广州）/ 香港中国（2018））

AI总结针对LLM驱动的KGQA系统作为无状态规划器导致模式幻觉和检索覆盖有限的问题，提出CacheRAG，一种基于缓存的架构，通过模式无关接口、多样性优化缓存检索和有界启发式扩展，将无状态规划器转变为持续学习器，显著提升准确性和真实性。

详情

AI中文摘要

大型语言模型（LLMs）与检索增强生成（RAG）的集成显著推进了知识图谱问答（KGQA）。然而，现有的LLM驱动的KGQA系统作为无状态规划器，孤立地生成检索计划而不利用历史查询模式：类似于一个没有计划缓存的数据库系统，从头优化每个查询。这一基本设计缺陷导致模式幻觉和有限的检索覆盖。我们提出CacheRAG，一种面向基于LLM的KGQA的系统化缓存增强架构，将无状态规划器转变为持续学习器。与传统的数据库计划缓存（优化频率）不同，CacheRAG引入了三种针对LLM上下文定制的新设计原则：（1）模式无关用户界面：通过中间语义表示（ISR）的两阶段语义解析框架使非专家用户能够纯粹以自然语言交互，同时后端适配器将LLM与本地模式上下文结合，安全地编译可执行的物理查询。（2）多样性优化的缓存检索：两层层次索引（领域→方面）结合最大边际相关性（MMR）最大化缓存示例的结构多样性，有效缓解推理同质性。（3）有界启发式扩展：具有严格复杂度保证的确定性深度和广度子图操作符显著提升检索召回率，而无需冒无界API执行的风险。在多个基准上的广泛实验表明，CacheRAG显著优于最先进的基线（例如，在CRAG数据集上准确率提升13.2%，真实性提升17.5%）。

英文摘要

The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) has significantly advanced Knowledge Graph Question Answering (KGQA). However, existing LLM-driven KGQA systems act as stateless planners, generating retrieval plans in isolation without exploiting historical query patterns: analogous to a database system that optimizes every query from scratch without a plan cache. This fundamental design flaw leads to schema hallucinations and limited retrieval coverage. We propose CacheRAG, a systematic cache-augmented architecture for LLM-based KGQA that transforms stateless planners into continual learners. Unlike traditional database plan caching (which optimizes for frequency), CacheRAG introduces three novel design principles tailored for LLM contexts: (1) Schema-agnostic user interface: A two-stage semantic parsing framework via Intermediate Semantic Representation (ISR) enables non-expert users to interact purely in natural language, while a Backend Adapter grounds the LLM with local schema context to compile executable physical queries safely. (2) Diversity-optimized cache retrieval: A two-layer hierarchical index (Domain $\rightarrow$ Aspect) coupled with Maximal Marginal Relevance (MMR) maximizes structural variety in cached examples, effectively mitigating reasoning homogeneity. (3) Bounded heuristic expansion: Deterministic depth and breadth subgraph operators with strict complexity guarantees significantly enhance retrieval recall without risking unbounded API execution. Extensive experiments on multiple benchmarks demonstrate that CacheRAG significantly outperforms state-of-the-art baselines (e.g., +13.2% accuracy and +17.5% truthfulness on the CRAG dataset).

URL PDF HTML ☆

赞 0 踩 0

2606.07893 2026-06-09 cs.CL 新提交

超越回忆的记忆：用于自进化LLM代理的双过程认知记忆系统

Tianxiang Fei, Mingyang Song, Mao Zheng, Xiang Yu

发表机构 * Tencent（腾讯）

AI总结提出DCPM系统，基于双过程理论将代理记忆组织为认知能力层次，通过同步日间写入器和异步夜间引擎分别处理信念修正和模式归纳，在隐式跨会话推理任务上提升显著。

详情

AI中文摘要

LLM代理的长期记忆不仅仅是适时检索正确的段落。当前的记忆系统将信念修正、因果耦合和跨领域抽象压缩到为表面回忆而调整的单一检索面上，因此难以处理需要推理用户如何演变的隐式个性化。我们提出DCPM，它沿着认知能力层次重新组织代理记忆，从原始输入和原子事实，经过历时信念轨迹和身份，上升到领域模式、潜在意图和跨领域模式。该层次由两个过程驱动，继承了双过程理论的架构分裂：一个同步的日间写入器（系统1），记录信念修正为双重链接的取代链；一个异步的夜间引擎（系统2），归纳模式和意图，并扫描跨领域冲突，抽象为更高级的核心模式。在LongMemEval、PersonaMem和PersonaMem-v2上，启用系统2在奖励隐式跨会话推理的基准上贡献最大（在PersonaMem-v2上最高+5.20），在跨度回忆上贡献最小，与架构预测一致。

英文摘要

Long-term memory for an LLM agent is more than retrieving the right passage at the right time. Current memory systems collapse belief revision, causal coupling, and cross-domain abstraction into a single retrieval surface tuned for surface recall, and consequently struggle on implicit personalisation that requires reasoning over how a user has evolved. We propose DCPM, which reorganises agent memory along a cognitive capability hierarchy ascending from raw inputs and atomic facts, through diachronic belief trajectories and identity, to domain schemas, latent intentions and cross-domain patterns. The hierarchy is driven by two processes inheriting the architectural split of dual-process theory: a synchronous daytime writer (System1) that records belief revisions as doubly linked supersedes chains, and an asynchronous nighttime engine (System2) that induces schemas and intentions and sweeps for cross-domain collisions abstracted into higher-level core schemas. On LongMemEval, PersonaMem and PersonaMem-v2, enabling System2 contributes most where the benchmark rewards implicit cross-session inference (up to +5.20 on PersonaMem-v2) and least on span recall, matching the architectural prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.09498 2026-06-09 cs.CL 新提交

Self-Harness: Harnesses That Improve Themselves

Self-Harness：自我改进的操控框架

Hangfan Zhang, Shao Zhang, Kangcong Li, Chen Zhang, Yang Chen, Yiqun Zhang, Lei Bai, Shuyue Hu

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出Self-Harness范式，让LLM智能体通过弱点挖掘、框架提议和验证迭代改进自身操控框架，在Terminal-Bench-2.0上使三种模型的通过率分别提升21.4%、14.3%和14.2%。

详情

AI中文摘要

基于LLM的智能体的性能由其基础模型和中介其与环境交互的操控框架共同塑造。由于不同模型表现出不同的行为，有效的框架设计本质上是模型特定的。然而，智能体框架仍然主要由人类专家设计，这种范式随着现代LLM日益多样化和快速演变而难以扩展。在本文中，我们引入了Self-Harness，一种新的范式，其中基于LLM的智能体改进其自身的操作框架，而不依赖人类工程师或更强的外部智能体。我们将Self-Harness实现为一个迭代循环，包含三个阶段：弱点挖掘，从执行轨迹中识别模型特定的失败模式；框架提议，生成与这些失败相关的多样化但最小的框架修改；以及提议验证，仅在回归测试后接受候选编辑。我们在Terminal-Bench-2.0上使用最小初始框架和来自不同家族的三个基础模型实例化了Self-Harness：MiniMax M2.5、Qwen3.5-35B-A3B和GLM-5。在所有三个模型上，Self-Harness一致地提高了性能，保留通过率分别从40.5%提高到61.9%，从23.8%提高到38.1%，以及从42.9%提高到57.1%。定性分析进一步表明，Self-Harness不仅仅是添加通用指令，而是有效地将模型特定的弱点转化为具体的、可执行的框架更改。这些结果表明了一条路径，使得基于LLM的智能体不仅被其框架塑造，而且能够参与重塑自身框架。

英文摘要

The performance of LLM-based agents is jointly shaped by their base models and the harnesses that mediate their interaction with the environment. Because different models exhibit distinct behaviors, effective harness design is inherently model-specific. Yet agent harnesses are still largely engineered by human experts, a paradigm that scales poorly as modern LLMs become increasingly diverse and rapidly evolving. In this paper, we introduce Self-Harness, a new paradigm in which an LLM-based agent improves its own operating harness, without relying on human engineers or stronger external agents. We operationalize Self-Harness as an iterative loop with three stages: Weakness Mining, which identifies model-specific failure patterns from execution traces; Harness Proposal, which generates diverse yet minimal harness modifications tied to these failures; and Proposal Validation, which accepts candidate edits only after regression testing. We instantiate Self-Harness on Terminal-Bench-2.0 using a minimal initial harness and three base models from diverse families: MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. Across all three models, Self-Harness consistently improves performance, with held-out pass rates increasing from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1%, respectively. Qualitative analyses further show that Self-Harness does not simply add generic instructions, but effectively turns model-specific weaknesses into concrete, executable harness changes. These results suggest a path toward LLM-based agents that are not merely shaped by their harnesses, but can also participate in reshaping them.

URL PDF HTML ☆

赞 0 踩 0

2606.09632 2026-06-09 cs.CL 新提交

Claw-R1：面向智能体强化学习的步骤级数据中间件系统

Daoyu Wang, Mingyue Cheng, Qingchuan Li, Shuo Yu, Jie Ouyang, Qi Liu

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China（中国科学技术大学认知智能国家重点实验室）

AI总结提出Claw-R1系统，通过网关服务器和数据池组件，将智能体交互步骤转化为结构化数据资产，支持实时检查、质量筛选和训练批次配置，解决智能体强化学习中数据生命周期管理问题。

详情

AI中文摘要

智能体强化学习已成为将大语言模型从静态聊天机器人转变为交互式智能体的重要后训练范式，催生了如OpenClaw等代表性应用。现有工作主要关注策略优化算法和训练框架，但对从数据产生到训练消费的智能体-环境交互完整数据生命周期关注不足。为弥补这一差距，我们提出Claw-R1，一个面向智能体强化学习的交互式步骤级数据中间件系统。Claw-R1通过两个核心组件——网关服务器和数据池——连接异构智能体运行时与强化学习训练后端。网关服务器通过统一的LLM API入口捕获多轮交互步骤，而数据池将其组织为由提示ID、响应ID、奖励和其他元数据组成的步骤级记录。在我们的演示中，用户可以交互式检查实时轨迹，查看每一步的状态、动作和奖励，根据质量和就绪程度筛选数据，并为不同的下游强化学习算法配置训练就绪批次。总体而言，Claw-R1将智能体交互轨迹视为受管理的数据资产，而非临时运行时日志。通过此演示，我们希望鼓励社区认识到数据管理在智能体强化学习中的重要性。我们的代码可在https://github.com/AgentR1/Claw-R1获取，演示视频可在https://youtu.be/Pw47dAOw6B0找到。

英文摘要

Agentic reinforcement learning (RL) has become an important post-training paradigm for turning LLMs from static chatbots into interactive agents, giving rise to representative applications such as OpenClaw. Existing work mainly focuses on policy optimization algorithms and training frameworks, but pays less attention to the full data lifecycle of agent-environment interactions, from data production to training consumption. To bridge this gap, we present Claw-R1, an interactive step-level data middleware system for agentic RL. Claw-R1 connects heterogeneous agent runtimes with RL training backends through two core components: a Gateway Server and a Data Pool. The Gateway Server captures multi-turn interaction steps through a unified LLM API entry point, while the Data Pool organizes them into step-level records consisting of prompt IDs, response IDs, rewards and other metadata. In our demo, users can interactively inspect live trajectories, examine the state, action, and reward of each step, curate data by quality and readiness, and configure training-ready batches for different downstream RL algorithms. Overall, Claw-R1 treats agent interaction traces as managed data assets rather than temporary runtime logs. Through this demonstration, we hope to encourage the community to recognize the importance of data management in agentic RL. Our code is available at https://github.com/AgentR1/Claw-R1 and the demonstration video can be found at link https://youtu.be/Pw47dAOw6B0.

URL PDF HTML ☆

赞 0 踩 0

2606.09751 2026-06-09 cs.AI cs.CL cs.HC 交叉投稿

Collaborative Human-Agent Protocol (CHAP)

协作式人机协议 (CHAP)

Arsalan Shahid, Gordon Suttie, Philip Black

发表机构 * Brightbeam AI

AI总结提出CHAP协议，通过结构化事件记录（差异、理由、哈希）和可组合配置文件，解决多人类多智能体协作中人类判断信号丢失的问题。

详情

AI中文摘要

基础模型正从响应生成转向操作角色。它们跨步骤规划、调用工具、请求人类输入、与其他智能体协调，并越来越多地承担影响客户、索赔、代码、合同和临床决策的工作。生产部署不再是单个人类监督单个模型，而是跨团队、时区和信任边界的多人类、多智能体协作。这种协作的技术界面仍然定义不清。当智能体起草响应，人类在发布前编辑它时，人类判断的时刻是系统中最有价值的信号。在当前实践中，该信号（如果有记录）仅存在于应用程序代码、聊天线程、工单评论和集体记忆中。两个协议标准解决了相邻问题：MCP标准化了智能体对工具和数据的访问，A2A标准化了智能体间的互操作性。两者都没有定义人类和智能体共同执行可问责工作的共享工作空间。本文提出了CHAP，即协作式人机协议。在CHAP下，原本会消失在聊天线程中的覆盖操作变成了一个结构化事件，包含差异、理由和内容哈希。班次交接变成了可移植的信封，而不是置顶消息。人类对智能体草稿的批准变成了一个不可否认的签名决策，可在多年后重放。该协议通过一个小的核心（工作空间、参与者、任务、工件和仅追加的证据日志）以及可组合的配置文件（根据部署需要添加审查、模式、路由、审议、交接、身份、签名和透明度支持的审计）来实现。规范、参考实现、一致性测试套件和示例可在以下网址获取：https://github.com/BrightbeamAI/chap

英文摘要

Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coordinate with other agents, and increasingly carry responsibility for work that affects customers, claims, code, contracts, and clinical decisions. Production deployments are no longer one human supervising one model. They are multi-human, multi-agent collaborations that cross teams, time zones, and trust boundaries. The technical surface for this collaboration remains weakly specified. When an agent drafts a response and a human edits it before it ships, the moment of human judgement is the most valuable signal in the system. In current practice it is recorded, if at all, in application code, chat threads, ticket comments, and tribal memory. Two protocol standards address adjacent concerns: MCP standardises agent access to tools and data, and A2A standardises agent-to-agent interoperability. Neither defines the shared workspace in which humans and agents perform accountable work together. This paper presents CHAP, the Collaborative Human-Agent Protocol. Under CHAP, the override that used to vanish into a chat thread becomes a structured event carrying a diff, a rationale, and a content hash. The handoff between shifts becomes a portable envelope rather than a pinned message. The human approval of an agent's draft becomes a non-repudiable signed decision that can be replayed years later. The protocol achieves this through a small Core (workspaces, participants, tasks, artefacts, and an append-only evidence log) together with composable profiles that add review, modes, routing, deliberation, handoff, identity, signatures, and transparency-backed audit as deployments require them. Specification, reference implementation, conformance suite, and worked examples are available at: https://github.com/BrightbeamAI/chap

URL PDF HTML ☆

赞 0 踩 0

2606.09774 2026-06-09 cs.AI cs.CL 交叉投稿

SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

SIGA: 用于科学模拟的自演化编码智能体适配器

Matthew Ho, Brian Liu, Jixuan Chen, Audrey Wang, Lianhui Qin

发表机构 * University of California, San Diego（加利福尼亚大学圣迭戈分校）

AI总结提出SIGA适配器，通过检索、程序记忆、轨迹内验证和验证强制终止，将通用编码智能体转化为科学模拟软件操作员，在GEOS上实现36倍加速，并支持自演化提升性能。

详情

AI中文摘要

高级科学模拟器暴露了专门的输入语言，将模拟目标转化为可执行配置，但学习这些语言可能需要领域科学家花费数小时到数天。我们将模拟器设置研究为智能体-工具接口接地问题：需要哪些最小的模拟器特定适配才能使现成的编码智能体操作真实的科学软件？我们的直觉是，编码智能体已经知道如何导航文件、编辑代码、运行命令和修复输出，但它们缺乏模拟器的可执行契约：其词汇、结构约束、验证规则和终止条件。我们介绍了SIGA，一个模拟器接口接地适配器，通过检索、程序记忆、轨迹内验证和验证强制终止来提供此契约。我们主要在GEOS上评估SIGA，GEOS是一个用于地下科学的开源多物理场模拟器。SIGA在大约五分钟内生成完整的GEOS输入文件，TreeSim高于0.90，与花费大约三小时的扩展预算人类专家相当，实现了大约36倍的挂钟加速。在更难的保留集上，接地将TreeSim从0.720提高到0.789，相对于裸智能体提高了大约10%，并且可以将跨种子的标准差降低16倍。自演化通过从先前轨迹重写适配器内容进一步改进SIGA，产生了最高的保留GEOS平均值，并匹配或超过了最强的手工设计配置。迁移到OpenFOAM和LAMMPS表明，主导机制因接口而异：当结构完整性是瓶颈时，验证最重要；而当领域正确性是瓶颈时，记忆和检索最重要。这些结果表明，轻量级、可自我改进的接地层可以将通用编码智能体转变为科学软件的实用操作员。

英文摘要

Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.

URL PDF HTML ☆

赞 0 踩 0

2306.16092 2026-06-09 cs.CL 版本更新

Chatlaw: A Multi-Agent Legal Assistant based on a Role-Aligned Mixture-of-Experts Architecture

Chatlaw: 基于角色对齐的混合专家架构的多智能体法律助手

Jiaxi Cui, Munan Ning, Zongjian Li, Bohua Chen, Yang Yan, Hao Li, Bin Ling, Yonghong Tian, Li Yuan

发表机构 * Shenzhen Graduate School, Peking University（北京大学深圳研究生院）； Peng Cheng Laboratory（鹏城实验室）； Law School, Peking University（北京大学法学院）； Pandalla.ai

AI总结提出Chatlaw多智能体法律助手，采用角色对齐的混合专家架构模拟律所协作流程，在LawBench基准上准确率比GPT-4提升7.73%，并通过真实案例验证。

Comments Accepted manuscript. Updated to match the journal version and added DOI

详情

DOI: 10.1016/j.fmre.2026.03.026

AI中文摘要

人工智能在法律服务中具有巨大潜力，但大型语言模型面临两大挑战：对中国法律体系知识有限且易产生幻觉。为解决这些问题，我们提出Chatlaw，一个多智能体法律助手。Chatlaw的框架旨在模拟真实律所的标准操作流程，其中不同角色（如助理、研究员、资深律师）协作处理案件。为了在计算上镜像这种协作结构，我们开发了一种新颖的角色对齐混合专家架构。在该系统中，内部“专家”经过专门训练，以与每个智能体角色（如咨询、分析、起草）的不同任务对齐。这些专业智能体（法律助理、研究员等）随后形成协作框架。当它们与用户交互、检索法律知识、分析案件细节或生成可靠咨询时，RA-MoE架构智能地将计算路由到相应的专用专家，确保每一步由最合格的参数处理。在评估中，Chatlaw超越了包括GPT-4在内的通用AI模型，在LawBench基准上准确率提升7.73%，在法律职业统一资格考试中得分高出11分。真实案例研究和专家评估进一步证实了其稳健性。Chatlaw提高了法律服务的可及性和可靠性，推动了向公众提供法律支持的进步。

英文摘要

Artificial Intelligence (AI) holds great potential in legal services, yet Large Language Models (LLMs) face two major challenges: limited knowledge of the Chinese legal system and vulnerability to hallucinations. To address these issues, we present Chatlaw, a multi-agent legal assistant. Chatlaw's framework is designed to emulate the Standard Operating Procedures (SOP) of real law firms, where different roles (e.g., assistant, researcher, senior lawyer) collaborate on a case. To computationally mirror this collaborative structure, we developed a novel Role-Aligned Mixture-of-Experts (RA-MoE) architecture. In this system, the internal "experts" are specifically trained to align with the distinct tasks of each agent role (e.g., inquiry, analysis, drafting). These specialized agents (Legal Assistant, Researcher, etc.) then form the collaborative framework. When they interact with users, retrieve legal knowledge, analyze case details, or generate reliable consultations, the RA-MoE architecture intelligently routes their computations to the corresponding dedicated expert, ensuring each step is handled by the most qualified parameters. In evaluations, Chatlaw surpasses general-purpose AI models, including GPT-4, achieving a 7.73% improvement in accuracy on the LawBench benchmark and an 11-point higher score on the Unified Qualification Exam for Legal Professionals. Real-case studies and expert assessments further confirm its robustness. Chatlaw enhances the accessibility and reliability of legal services, advancing the provision of legal support to the public.

URL PDF HTML ☆

赞 0 踩 0

2510.19186 2026-06-09 cs.CL 版本更新

When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue

当用户满意但智能体出错：工具增强对话的多维度评估

Tanya Shourya, Yingfan Wang, Zhaoyi Joey Hou, Shamik Roy, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah

发表机构 * AWS AI Labs（AWS人工智能实验室）； University of Pittsburgh（匹兹堡大学）

AI总结针对工具增强对话系统中用户满意但智能体错误的问题，提出TRACE基准，通过系统合成多样错误案例，评估现有框架发现性能远未理想。

Comments The Fifth Generation, Evaluation & Metrics Workshop (GEM) at ACL 2026

2601.07994 2026-06-09 cs.CL cs.AI 版本更新

DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs

DYCP：基于LLMs的长格式对话动态上下文修剪

Nayoung Choi, Jonathan Zhang, Jinho D. Choi

发表机构 * Computer Science Emory University（计算机科学埃默里大学）

AI总结 DYCP通过动态识别和检索对话段落，提升长格式对话中LLM的上下文管理效率，实现更精确的上下文选择和推理效率提升。

2603.09995 2026-06-09 cs.CL cs.AI 版本更新

打破知识的诅咒：为实时在线会议设计个性化术语支持

Yifan Song, Yijun Liu, Wing Yee Au, Hon Yung Wong, Brian P. Bailey, Tal August

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Fujitsu Research of America（富士通美国研究）

AI总结提出ParseJargon系统，利用用户画像和会话内反馈实现个性化术语识别，提升在线会议中跨学科听众的理解和参与度。

Comments Portions of this work appeared in CHI '26 Extended Abstracts ("Breaking the Curse of Knowledge: Toward Personalized Jargon Support in Online Meetings") and ACL '26 System Demonstrations ("ParseJargon: Personalized Real-time Jargon Support in Online Meetings")

详情

AI中文摘要

跨学科交流常常受到专业语言（即术语）和不均衡背景知识的阻碍。语音转文本和大语言模型的最新进展使得在在线会议期间提供术语支持成为可能，但通用支持（即对每个人定义相同的术语）可能会用听众不需要的定义淹没他们。我们提出了ParseJargon，一个用于实时在线会议中个性化术语支持的系统。我们从一个初始原型开始，探索使用单句用户画像进行个性化。我们进行了一项对照研究，结果表明，与通用支持相比，即使这种最小程度的个性化也能增强听众的理解和参与度，因为术语识别更精确。根据参与者反馈的见解，我们改进了系统，采用了更先进的个性化技术，包括会话内用户反馈和基于便携式词汇表的画像。我们评估了这些技术如何进一步提高术语识别精度，使用对照研究中收集的数据来模拟随时间变化的个性化。我们还进行了延迟测试，并辅以轻量级部署，以分析系统的实时能力和可用性。

英文摘要

Cross-disciplinary communication is often hindered by specialized language (i.e., jargon) and uneven background knowledge. Recent advances in speech-to-text and large language models make it possible to provide jargon support during online meetings, but generic support (i.e., defining the same terms for everyone) can overwhelm listeners with definitions they do not need. We present ParseJargon, a system for personalized jargon support in real-time online meetings. We begin with an initial prototype to probe the use of single-sentence user profiles for personalization. We conducted a controlled study and showed that even this minimal personalization enhanced listeners' comprehension and engagement over generic support because of more precise jargon identification. Guided by insights from participants' feedback, we refined the system with more advanced personalization techniques, including in-session user feedback and portable glossary-based profiles. We evaluated how these techniques can further improve jargon identification precision using data collected in the controlled study to simulate personalization over time. We also conducted a latency test, complemented by a lightweight deployment, to analyze the system's real-time capability and usability.

URL PDF HTML ☆

赞 0 踩 0

2601.19082 2026-06-09 cs.AI cs.CL cs.GT cs.LG cs.MA 版本更新

Payoff scaling shapes cooperation in LLM agents across languages

收益规模塑造跨语言LLM代理的合作行为

Trung-Kiet Huynh, Dao-Sy Duy-Minh, Thanh-Bang Cao, Phong-Hao Le, Hong-Dan Nguyen, Phu-Quy Nguyen-Lam, Minh-Luan Nguyen-Vo, Hong-Phat Pham, Phu-Hoa Pham, Thien-Kim Than, Chi-Nguyen Tran, Huy Tran, Gia-Thoai Tran-Le, Alessio Buscemi, Le Hong Trang, The Anh Han

发表机构 * Faculty of Information Technology, University of Science (HCMUS), Ho Chi Minh City, Vietnam（信息技术学院，科学大学（HCMUS），胡志明市，越南）； Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam（计算机科学与工程学院，胡志明市技术大学（HCMUT），胡志明市，越南）； Vietnam National University – Ho Chi Minh City (VNU-HCM), Ho Chi Minh City, Vietnam（越南国家大学——胡志明市（VNU-HCM），胡志明市，越南）； Luxembourg Institute of Science and Technology (LIST), Luxembourg（卢森堡科学与技术研究所（LIST），卢森堡）； School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, United Kingdom（计算、工程与数字技术学院，泰赛德大学，米德尔斯布罗，英国）

AI总结通过监督分类器识别重复囚徒困境中的策略，结合演化博弈论基线，发现随着收益增加，LLM反而更合作，与演化预测相反，表明对齐训练和人类推理模式的影响。

Comments 44 pages, 17 figures, 4 tables

详情

AI中文摘要

大型语言模型（LLM）越来越多地被部署为自主代理，代表用户进行谈判、协调和行动。它们在这种环境中是否合作不再只是一个学术问题，而是人工智能治理的核心问题。我们从战略行为的角度出发，探究两个日常杠杆——利害关系的大小和描述交互的语言——如何塑造LLM在重复囚徒困境中采用的策略。我们不直接通过原始行动计数来解读合作，而是训练监督分类器来识别重复博弈的经典策略（始终合作、始终背叛、以牙还牙、赢-留-输-变），并将其作为观察LLM行为的透镜。为了了解在相同收益下策略分布应如何，我们推导了演化博弈论（EGT）基线，并将其与LLM数据进行比较。两种结果以揭示性的方式不一致：随着收益增加，演化理论预测背叛应占据主导，但LLM却向相反方向移动，变得更加合作——我们认为，这是对齐训练和LLM从训练数据中继承的人类推理模式的标志。我们进一步表明，这种情况并非前沿规模、专有模型所特有：它也出现在三个开放权重的较小LLM中。总体而言，我们的分析强调，收益设计和语言框架是强大但未被充分探索的引导LLM行为的杠杆，对评估、对齐和治理部署在高风险、多语言环境中的多代理AI系统具有直接影响。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents that negotiate, coordinate, and act on behalf of users. Whether they cooperate in such settings is no longer just an academic question, but a central issue for AI governance. We approach it from a strategic-behaviour angle, asking how two everyday levers - the size of what is at stake, and the language in which the interaction is described - shape the strategies LLMs adopt in a repeated Prisoner's Dilemma. Rather than reading cooperation off raw action counts, we train supervised classifiers to recognise the canonical strategies of repeated games (always cooperate, always defect, Tit-for-Tat, Win-Stay-Lose-Shift) and use them as a lens onto LLM behaviour. To know what the strategy distribution should look like under the same payoffs, we derive an evolutionary game theory (EGT) baseline and compare it with the LLM data. The two outcomes disagree in a revealing way: as stakes grow, evolutionary theory predicts that defection should take over the population, yet LLMs move in the opposite direction, becoming more cooperative - a signature, we argue, of alignment training and the human-like reasoning patterns LLMs inherit from their training data. We further show that this picture is not particular to frontier-scale, proprietary models: it also occurs with three open-weight smaller LLMs. Overall, our analysis highlights that payoff design and linguistic framing are powerful but under-explored levers for steering LLM behaviour, with direct implications for evaluating, aligning, and governing multi-agent AI systems deployed in high-stakes, multilingual environments.

URL PDF HTML ☆

赞 0 踩 0

2606.07925 2026-06-09 cs.CL 新提交

ROSUM-MCTS: Monte Carlo Tree Search-Inspired HDL Code Summarization with Structural Rewards

ROSUM-MCTS：基于蒙特卡洛树搜索的HDL代码摘要生成与结构奖励

Prashanth Vijayaraghavan, Charles Mackin, Luyao Shi, Apoorva Nitsure, Ashutosh Jadhav, David Beymer, Tyler Baldwin, Ehsan Degan, Vandana Mukherjee

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出ROSUM-MCTS方法，利用蒙特卡洛树搜索引导大语言模型，通过分层候选扩展和复合奖励函数优化硬件描述语言代码摘要，在VHDL和Verilog数据集上优于基线方法。

Comments 7 pages

详情

Journal ref: ICLAD'2025

AI中文摘要

大型语言模型（LLMs）在代码摘要方面显示出潜力，但其对硬件描述语言（HDL）如VHDL和Verilog的有效性尚未充分探索。我们提出ROSUM-MCTS，一种受蒙特卡洛树搜索（MCTS）启发的LLM引导方法，通过结构化探索和强化驱动优化来改进摘要。我们的方法通过分层候选扩展机制整合局部和全局上下文，并使用复合奖励函数优化摘要，该函数平衡功能正确性（FC）、局部内容充分性（LCA）和流畅性。我们在VHDL-eval和Verilog-eval数据集上评估ROSUM-MCTS，证明其通过利用结构化自底向上细化和基于强化的优化，持续优于基线方法。消融研究证实了局部和全局扩展策略的必要性，以及平衡FC和LCA以获得最佳性能的重要性。此外，ROSUM-MCTS对表面修改（如变量重命名）具有鲁棒性，在基线性能下降时仍能保持摘要质量。这些结果确立了ROSUM-MCTS作为有效且鲁棒的HDL摘要框架，为进一步研究强化增强的代码摘要铺平了道路。

英文摘要

Large language models (LLMs) have shown promise in code summarization, yet their effectiveness for Hardware Description Languages (HDLs) like VHDL and Verilog remains underexplored. We propose ROSUM-MCTS, an LLM-guided approach inspired by Monte Carlo Tree Search (MCTS) that refines summaries through structured exploration and reinforcement-driven optimization. Our method integrates both local and global context via a hierarchical candidate expansion mechanism and optimizes summaries using a composite reward function balancing functional correctness (FC), local content adequacy (LCA), and fluency. We evaluate ROSUM-MCTS on the VHDL-eval and Verilog-eval datasets, demonstrating its consistent outperformance over baseline methods by leveraging structured bottom-up refinement and reinforcement-based optimization. Ablation studies confirm the necessity of both local and global expansion strategies, as well as the importance of balancing FC and LCA for optimal performance. Furthermore, ROSUM-MCTS proves robust against superficial modifications, such as variable renaming, maintaining summary quality where baselines degrade. These results establish ROSUM-MCTS as an effective and robust HDL summarization framework, paving the way for further research into reinforcement-enhanced code summarization.

URL PDF HTML ☆

赞 0 踩 0

2606.07951 2026-06-09 cs.CL cs.AI cs.LG 新提交

From `May' to `Is': Certainty Distortion in Language Model Rewriting

从“可能”到“是”：语言模型改写中的确定性扭曲

Catarina G Belem, Shang Wu, Hongyu Yao, Mark Steyvers, Sameer Singh, Padhraic Smyth

发表机构 * University of California Irvine（加利福尼亚大学尔湾分校）； Massachusetts Institute of Technology（麻省理工学院）

AI总结研究语言模型在改写任务中系统性增加表达确定性的偏差，提出基于人群判断的评估指标，发现高达75%的输出存在确定性扭曲，且模型更倾向于提高确定性。

详情

AI中文摘要

人类越来越多地以塑造信念和驱动决策的方式使用语言模型（LM），包括讨论、改写和总结来自科学文章、新闻和医学报告的信息。然而，在这些领域中，主张表达的信心程度至关重要，但关于LM是否忠实地保留它却知之甚少。在这项工作中，我们研究了LM中的确定性扭曲，定义为当语义内容被保留时，表达确定性的有意义变化。我们提出了一种基于LM的评估指标，该指标与人群层面的确定性判断一致。使用该指标，我们在科学和医学交流任务的背景下，表征了不同规模和系列的模型中的确定性扭曲。我们的结果表明，确定性扭曲影响了高达75%的LM输出，并且在改写任务中系统性地不对称，大多数LM将表达确定性增加的可能性是降低的1.5-2倍。这些效应可以通过重复释义累积：在医学领域，claude-haiku-4-5在一次迭代后增加了20%示例的确定性，五次迭代后增加到40%。基于提示的干预减少了整体确定性扭曲，但并未消除它。总之，这些发现揭示了普遍存在的夸大表达确定性的偏差，对在高风险领域依赖LM的用户有直接影响。

英文摘要

Humans increasingly turn to Language Models (LMs) in ways that shape beliefs and drive decisions, including discussing, rewriting, and summarizing information from scientific articles, news, and medical reports. However, in these domains, where how confidently a claim is expressed matters, little is known about whether LMs faithfully preserve it. In this work, we investigate certainty distortion in LMs, defined as meaningful changes in expressed certainty when semantic content is preserved. We propose an LM-based evaluation metric that is consistent with population-level judgments of certainty. Using this metric, we characterize certainty distortion across different sizes and families of models in the context of scientific and medical communication tasks. Our results show that certainty distortion affects up to 75\% of LM outputs and is systematically asymmetric in rewriting tasks with most LMs being 1.5-2$\times$ more likely to increase the expressed certainty than to decrease it. These effects can compound over repeated paraphrasing: in the medical domain, claude-haiku-4-5 increases certainty of 20\% examples after a single iteration, increasing to 40\% after five iterations. Prompt-based interventions reduce overall certainty distortion but do not eliminate it. Together, these findings reveal a general bias toward inflating expressed certainty, with direct implications for users who rely on LMs in high-stakes domains.

URL PDF HTML ☆

赞 0 踩 0

2606.08048 2026-06-09 cs.CL 新提交

Diffusion Language Model Parallel Decoding via Product-of-Experts Bridge

通过专家乘积桥接的扩散语言模型并行解码

Juntong Shi, Brian L. Trippe, Jure Leskovec, Stefano Ermon, Minkai Xu

发表机构 * Stanford University（斯坦福大学）

AI总结提出PoE-Bridge框架，通过专家乘积构建中间分布，结合扩散语言模型并行解码和自回归模型质量，实现5倍加速并恢复至少95%的AR性能。

Comments ICML 2026

详情

AI中文摘要

扩散语言模型（DLM）通过并行解码提供了显著的速度优势，但与自回归（AR）模型相比，缺乏令牌依赖性限制了生成质量。最近的进展试图通过重要性采样来弥合差距，其中DLM作为提议分布，AR作为目标分布。然而，由于它们分布之间的巨大差距，采样需要大量粒子，因此计算成本高昂。在本文中，我们引入了PoE-Bridge，一种新颖的解码框架，通过引入中间分布来弥合差距，从而大幅提高生成速度和准确性。该分布被构建为DLM提议和AR目标的专家乘积（PoE）。借助中间分布，我们首先使用DLM并行起草多个续写，然后应用拒绝采样验证起草的令牌，并将结果候选向PoE移动。接着，我们使用重要性采样进一步将PoE对齐的候选向AR目标校正。我们还提出了若干改进技术，包括用于增强多样性的混合温度采样和用于减少浪费验证的弹性拒绝窗口。实验上，PoE-Bridge在标准DLM解码方法上实现了显著提高的准确性，速度提升5倍，并恢复了目标AR模型至少95%的性能，在具有挑战性的数学推理和编码任务上高效地推进了大部分质量差距。我们的代码可在https://github.com/juntongshi48/poe-bridge获取。

英文摘要

Diffusion language models (DLMs) offer substantial speed advantages through parallel decoding, but the lack of token dependencies limits generation quality compared to autoregressive (AR) models. Recent progress attempts to bridge the gap via importance sampling, with DLM being the proposal and AR being the target. However, due to the huge gap between their distributions, the sampling requires a large number of particles and is thus expensive to compute. In this paper, we introduce PoE-Bridge, a novel decoding framework that drastically improves generation speed and accuracy by introducing an intermediate distribution to bridge the gap. The distribution is constructed as a Product-of-Experts (PoE) of the DLM proposal and the AR target. With the intermediate distribution, we first use the DLM to draft multiple continuations in parallel, then apply rejection sampling to verify the drafted tokens and move the resulting candidates toward the PoE. We then use importance sampling to further correct the PoE-aligned candidates toward the AR target. We further propose several improved techniques, including mixed-temperature sampling for enhanced diversity and elastic rejection windows for reducing wasted verification. Empirically, PoE-Bridge achieves significantly improved accuracy with $5\times$ speedup over the standard DLM decoding approach, and recovers at least 95% of the target AR model's performance, efficiently advancing most of the quality gap on challenging mathematical reasoning and coding tasks. Our code is available at https://github.com/juntongshi48/poe-bridge.

URL PDF HTML ☆

赞 0 踩 0

2606.08184 2026-06-09 cs.CL 新提交

TextEconomizer: Enhancing Lossy Text Compression with Denoising Transformers and Entropy Coding

TextEconomizer：利用去噪变换器和熵编码增强有损文本压缩

Mahbub E Sobhani, Anika Tasnim Rodela, Chowdhury Mofizur Rahman, Dewan Md. Farid, Swakkhar Shatabda

发表机构 * United International University（联合国际大学）； BRAC University（BRAC大学）； Southeast University（东南大学）

AI总结提出TextEconomizer编码器-解码器框架，结合去噪变换器和熵编码，实现50%-80%的压缩率，参数减少153倍，在BLEU等指标上保持近完美文本质量。

Comments Published in Neural Networks (Elsevier), Vol. 203, 2026

详情

DOI: 10.1016/j.neunet.2026.109111
Journal ref: Neural Networks, Vol. 203, 109111, 2026

AI中文摘要

有损文本压缩在保留核心含义的同时减少数据大小，适用于摘要、自动分析和数字存档。尽管基于变换器的模型在语言建模中占主导地位，但将上下文向量和熵编码集成到序列到序列（Seq2Seq）生成中仍未充分探索。一个关键挑战在于从编码器输出中识别信息最丰富的上下文向量，并引入熵编码以提高存储效率，同时即使在噪声文本下也能保持高质量输出。我们提出了TextEconomizer，一种与变换器神经网络配对的编码器-解码器框架，无需数据集维度的先验知识即可将可变大小输入减少50%至80%。我们的模型通过熵编码实现了有竞争力的压缩比，同时通过BLEU、ROUGE、METEOR和语义相似度评分评估，提供了近乎完美的文本质量。TextEconomizer的参数数量比同类模型少约153倍，实现了5.39倍的压缩比，且不牺牲语义质量。我们还评估了一个基于LSTM的自编码器，实现了最先进的67倍压缩比，参数减少196倍；以及LLaMAFormer，一种改进的变换器，参数比ICAE少263倍，同时保持有竞争力的文本质量。TextEconomizer在平衡内存效率和高保真输出方面显著超越了现有的基于变换器的模型，标志着有损压缩在最优空间利用方面的突破。

英文摘要

Lossy text compression reduces data size while preserving core meaning, making it well-suited for summarization, automated analysis, and digital archives. Despite the dominance of transformer-based models in language modeling, integrating context vectors and entropy coding into Sequence-to-Sequence (Seq2Seq) generation remains underexplored. A key challenge lies in identifying the most informative context vectors from encoder output and incorporating entropy coding to enhance storage efficiency while maintaining high-quality outputs, even under noisy text. We introduce TextEconomizer, an encoder-decoder framework paired with a transformer neural network that reduces variable-sized inputs by 50% to 80% without prior knowledge of dataset dimensions. Our model achieves competitive compression ratios via entropy coding while delivering near-perfect text quality, assessed by BLEU, ROUGE, METEOR, and semantic similarity scores. TextEconomizer operates with approximately 153x fewer parameters than comparable models, achieving a 5.39x compression ratio without sacrificing semantic quality. We also evaluate an LSTM-based autoencoder achieving a state-of-the-art 67x compression ratio with 196x fewer parameters, and LLaMAFormer, a modified transformer with 263x fewer parameters than ICAE while maintaining competitive text quality. TextEconomizer significantly surpasses existing transformer-based models in balancing memory efficiency and high-fidelity outputs, marking a breakthrough in lossy compression with optimal space utilization.

URL PDF HTML ☆

赞 0 踩 0

2606.08357 2026-06-09 cs.CL 新提交

Forward-Free Diffusion Language Models

无前向过程的扩散语言模型

Haotian Sun, Rushi Qiang, Yuqian Zheng, Bo Dai

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出FReDA，一种无需人工设计前向过程的扩散语言模型，通过递归分布细化利用模型生成草稿作为隐式中间状态，在推理和编码任务上超越更大模型，并实现1.5-1.8倍加速。

详情

AI中文摘要

扩散语言模型通过迭代去噪生成文本，为自回归生成提供了强大的替代方案。然而，离散语言空间缺乏用于定义有效扰动的自然邻域结构，因此在前向过程中提出了一些人工破坏方案。这些预设的前向过程通常产生数学上方便但与生成过程中遇到的草稿和错误不一致的状态，导致样本质量下降。为了解决这一限制，我们提出了FReDA，一种无前向过程的扩散语言模型，消除了对人工设计前向过程的需求。我们将扩散语言建模形式化为递归分布细化，其中模型生成的草稿作为隐式中间状态，学习的细化模型逐步将草稿分布推向目标分布。具体地，FReDA通过提出候选草稿序列并直接执行自我细化或通过最佳N细化在并行候选中进行选择来细化草稿。通过这种设计，FReDA是邻域无关的、模型复杂度感知的，并且与灵活的细化参数化兼容。在sub-8B规模下的广泛评估表明，FReDA-4B在推理和编码基准上优于更大的扩散基础模型，实现了高达15%的绝对增益，同时相对于扩散基线达到1.5-1.8倍的平均加速，并且随着额外细化计算量的增加而有效扩展。

英文摘要

Diffusion language models generate text through iterative denoising, offering a powerful alternative to autoregressive generation. However, discrete language spaces lack a natural neighborhood structure for defining effective perturbations, so some artificial corruption schemes are proposed in the forward process. Such prescribed forward processes often produce states that are mathematically convenient but misaligned with drafts and errors encountered during generation, resulting in degraded sample quality. To address this limitation, we propose FReDA, a forward-free diffusion language model that eliminates the need for a hand-designed forward process. We formulate diffusion language modeling as recursive distribution refinement, in which model-generated drafts serve as implicit intermediate states, and the learned refinement model progressively moves the draft distribution toward the target distribution. Concretely, FReDA refines drafts by proposing candidate draft sequences and either directly performing self-refinement or selecting among parallel candidates via best-of-N refinement. With this design, FReDA is neighborhood-agnostic, model-complexity-aware, and compatible with flexible refinement parameterizations. Extensive evaluations in the sub-8B regime show that FReDA-4B outperforms larger diffusion base models on reasoning and coding benchmarks, achieving absolute gains of up to 15%, while reaching a 1.5-1.8x average speedup over diffusion baselines and scaling effectively with additional refinement computation.

URL PDF HTML ☆

赞 0 踩 0

2606.08408 2026-06-09 cs.CL cs.AI 新提交

TimpaTeks: Automatic In-place Text Sequence Modification via Diffusion Language Model Steering

TimpaTeks: 通过扩散语言模型引导实现自动原地文本序列修改

Ryandito Diandaru, Ikhlasul Akmal Hanif, Fadli Aulawi Al Ghiffari, Ahmed Elshabrawy, Alham Fikri Aji

发表机构 * MBZUAI（穆罕默德·本·扎耶德人工智能大学）

AI总结提出TimpaTeks方法，将激活引导扩展到扩散语言模型，实现原地文本修改以改变概念，在情感和概念引导任务上降低困惑度并保持句子结构。

Comments 16 pages

2606.08411 2026-06-09 cs.CL 新提交

AsyncLane: Decoupling Refinement from Advancement in Diffusion Language Model Decoding

AsyncLane: 扩散语言模型解码中精炼与推进的解耦

Yingxuan Ren, Yuxuan Lou, Yong Liu, Pengcheng Fang, Ziming Wang, Pengfei Zhou, Yang You

发表机构 * National University of Singapore（新加坡国立大学）； University of Southampton（南安普顿大学）

AI总结提出AsyncLane，一种无需训练的解码调度器，通过将生成过程分叉为精炼和推进两个通道，解耦块间依赖，在保持质量的同时显著提升扩散语言模型的解码吞吐量。

详情

AI中文摘要

块级半自回归解码是扩散大语言模型（DLMs）的标准推理范式，但它强制块之间存在严格依赖：当前块完全解码或去噪预算耗尽之前，下一个块无法开始。我们观察到，一旦一个块暴露出可靠的分隔符边界或稳定的语义前缀，续写生成无需等待每个残差标记被解析。我们提出AsyncLane，一种无需训练的解码调度器，将精炼与推进解耦。AsyncLane在观察到的分隔符边界处将生成通道分叉为精炼通道和续写生成通道：前缀保持可编辑，而续写在前缀精炼完成之前推进。由此产生的通道树记录解码依赖关系和输出顺序，而执行则在活跃通道集上进行。为了使这种异步调度在双向注意力下高效，AsyncLane结合了共享前缀通道批处理、前瞻草稿重用、级联终止以及带有刷新-逻辑重用的紧凑缓存刷新，防止模型调用成本随通道数量线性增长。AsyncLane是块级DLM采样器的即插即用替代品，无需重新训练。在数学推理和代码生成实验表明，AsyncLane在保持竞争性质量的同时持续提高吞吐量。在LLaDA和Dream骨干网络上，AsyncLane在所有评估的基准长度设置中实现了最高的TPS；相对于最快的竞争基线，它在LLaDA上达到2.95倍峰值加速，在Dream上达到3.04倍，在较长生成预算下增益尤为显著。

基于语料库特征扩散的繁体中文家长会自动化个别化教育计划生成

Kuanlin Chen, Cheng-En Ou

发表机构 * National University of Singapore（新加坡国立大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结针对繁体中文个别化教育计划（IEP）生成中数据稀缺和隐私限制问题，提出基于语料库特征扩散（CGFD）的低资源微调流程，通过种子选择、特征扩散和语法约束解码（GCD）生成高质量样本，并发现GCD在繁体中文下适得其反，无GCD路径在可靠性和速度上更优。

Comments 12 pages, 5 figures

详情

AI中文摘要

编写个别化教育计划（IEP）是一项高劳动强度、知识密集型的文档负担；英语研究表明，生成式AI可以显著减少起草时间，但由于领域数据稀缺、严格的隐私法规以及缺乏本地评估基准，繁体中文的自动化IEP生成几乎未被探索。我们提出了一种基于语料库特征扩散（CGFD）的低资源微调流程：（1）通过tau阈值和标志感知分数上限选择25个双专家高评分种子转录本；（2）从种子中提取特征画像（句子长度、结构、量化模板），并连同言语化采样风格的多样性控制注入LLM提示，以驱动扩散；（3）使用15个专家黄金种子作为扩散锚点，目标生成585个样本；获得567个有效扩散样本，形成582个样本的训练集，用于使用QLoRA微调Breeze-7B；（4）通过语法约束解码（GCD）在推理时强制执行分层SMART目标阶梯模式。在55个样本的模式压力集上的消融结果揭示了一个意外发现：在繁体中文令牌预算下，GCD适得其反——无GCD路径实现了100%的模式通过率，中位延迟降低34%，在可靠性和速度上均优于GCD。在n=10的正式保留集上，无GCD推理路径实现了BERTScore F1 = 0.779，超过了GPT-5.4（0.726）、DeepSeek-V3.2（0.703）、Gemini-3-Flash-Preview（0.703）和Llama-4-Maverick（0.700）的零样本基线，同时保持完全本地、气隙推理。该系统填补了繁体中文特殊教育NLP的空白，并在工业工程范式下提供了可扩展、保护隐私的本地推理解决方案。

英文摘要

Writing Individualized Education Programs (IEPs) is a high-labor, knowledge-intensive document burden; English-language research has demonstrated that generative AI can significantly reduce drafting time, yet automated IEP generation in Traditional Chinese remains virtually unexplored due to domain data scarcity, strict privacy regulations, and the absence of local evaluation benchmarks. We propose a low-resource fine-tuning pipeline centered on Corpus-Grounded Feature Diffusion (CGFD): (1) 25 dual-expert high-score seed transcripts are selected via a tau threshold with flag-aware score caps; (2) a FeatureProfile (sentence length, structure, quantification templates) is extracted from seeds and injected into LLM prompts alongside Verbalized-Sampling-style diversity control to drive diffusion; (3) 15 expert gold seeds are used as diffusion anchors, targeting 585 samples; 567 valid diffusion samples are obtained, yielding a 582-sample training set used to fine-tune Breeze-7B with QLoRA; (4) schema-constrained inference via Grammar-Constrained Decoding (GCD) enforces a hierarchical SMART Goal Ladder schema at inference time. Ablation results on a 55-sample schema stress set reveal an unexpected finding: GCD is counterproductive under Traditional Chinese token budgets -- the no-GCD path achieves 100% schema pass rate at 34% lower median latency, outperforming GCD on both reliability and speed. On the n=10 formal hold-out, the no-GCD inference path achieves BERTScore F1 = 0.779, exceeding GPT-5.4 (0.726), DeepSeek-V3.2 (0.703), Gemini-3-Flash-Preview (0.703), and Llama-4-Maverick (0.700) zero-shot baselines while maintaining fully local, air-gapped inference. This system addresses a gap in Traditional Chinese special-education NLP and offers a scalable, privacy-preserving local inference solution under an industrial engineering paradigm.

URL PDF HTML ☆

赞 0 踩 0

2606.09709 2026-06-09 cs.CL 新提交

IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking

IS-CoT: 通过交错结构思维打破长文本生成崩溃

Zechen Sun, Yuyang Sun, Zecheng Tang, Juntao Li, Wenpeng Hu, Wenliang Chen, Zhunchen Luo, Guotong Geng, Min Zhang

发表机构 * Institute of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）； Information Research Center of Military Science, PLA Academy of Military Science（军事科学院军事科学信息研究中心）

AI总结针对大语言模型在长文本生成中因静态层次规划导致长度崩溃的问题，提出交错结构思维链（IS-CoT）框架，通过动态规划-写作-反思循环实现持续策略调整，训练IS-Writer-8B模型在长文本基准上取得最优性能。

详情

AI中文摘要

生成连贯且可控的长文本内容仍然是大语言模型（LLMs）面临的一个持久挑战。虽然推理增强模型在逻辑密集型领域已展现出成功，但我们的评估揭示，它们在开放式写作中遭受严重的长度崩溃，当目标长度超过2,000词时性能急剧下降。我们将这一失败归因于静态层次规划的局限性，它难以在扩展上下文中提供动态指导。为弥补这一差距，我们引入了交错结构思维链（IS-CoT）框架。与外部智能体工作流不同，IS-CoT将动态的规划-写作-反思循环嵌入生成过程，无需额外辅助即可实现持续策略调整和全局对齐。基于该框架，我们通过多教师管道构建了一个高质量的交错推理轨迹数据集，并训练了IS-Writer-8B。实验表明，IS-Writer-8B在具有挑战性的长文本基准上取得了最先进的性能（例如，在LongBench-Write上比DeepSeek-V3.2高出+3.08），展现出与显著更大的专有模型相竞争的长度合规性和连贯性。

英文摘要

Generating coherent and controllable long-form content remains a persistent challenge for Large Language Models (LLMs). While reasoning-enhanced models have demonstrated success in logic-intensive domains, our evaluation reveals that they suffer from a severe length collapse in open-ended writing, where performance degrades sharply as target lengths exceed 2,000 words. We attribute this failure to the limitation of static hierarchical planning, which struggles to provide dynamic guidance over extended contexts. To bridge this gap, we introduce the Interleaved Structural Chain-of-Thought (IS-CoT) framework. Unlike external agentic workflows, IS-CoT embeds a dynamic Plan-Write-Reflect cycle into the generation process, enabling continuous strategy adaptation and global alignment without additional assistance. Based on this framework, we construct a high-quality dataset of interleaved reasoning traces via a multi-teacher pipeline and train IS-Writer-8B. Experiments demonstrate that IS-Writer-8B achieves state-of-the-art performance on challenging long-form benchmarks (e.g., +3.08 vs. DeepSeek-V3.2 on LongBench-Write), exhibiting robust length compliance and coherence competitive with significantly larger proprietary models.

URL PDF HTML ☆

赞 0 踩 0

2504.00977 2026-06-09 cs.CL 版本更新

Chinese Grammatical Error Correction: A Survey

中文语法纠错：综述

Mengyang Qiu, Qingyu Gao, Linxuan Yang, Yang Gu, Tran Minh Nguyen, Zihao Huang, Jungyeul Park

发表机构 * KAIST（韩国科学技术院）

AI总结本文综述中文语法纠错（CGEC）研究，涵盖数据集、标注方案、评估方法和系统进展，指出关键挑战并展望未来方向。

详情

AI中文摘要

中文语法纠错（CGEC）是自然语言处理中的一项关键任务，旨在满足第二语言（L2）和母语（L1）中文写作中对自动写作辅助日益增长的需求。虽然L2学习者难以掌握复杂的语法结构，但在学术、专业和正式场合中，L1用户也能从CGEC中受益，因为这些场合对写作精度要求很高。本综述全面回顾了CGEC研究，涵盖数据集、标注方案、评估方法和系统进展。我们考察了广泛使用的CGEC数据集，突出了它们的特点、局限性以及对改进标准化的需求。我们还分析了错误标注框架，讨论了诸如分词歧义和中文特有错误类型分类等挑战。此外，我们回顾了评估指标，重点关注它们从英文GEC到中文的适应过程，包括字符级评分和多参考的使用。在系统开发方面，我们追溯了从基于规则和统计方法到神经架构的演变，包括基于Transformer的模型和大型预训练语言模型的集成。通过整合现有研究并识别关键挑战，本综述提供了对CGEC现状的见解，并概述了未来方向，包括完善标注标准以解决分词挑战，以及利用多语言方法增强CGEC。

英文摘要

Chinese Grammatical Error Correction (CGEC) is a critical task in Natural Language Processing, addressing the growing demand for automated writing assistance in both second-language (L2) and native (L1) Chinese writing. While L2 learners struggle with mastering complex grammatical structures, L1 users also benefit from CGEC in academic, professional, and formal contexts where writing precision is essential. This survey provides a comprehensive review of CGEC research, covering datasets, annotation schemes, evaluation methodologies, and system advancements. We examine widely used CGEC datasets, highlighting their characteristics, limitations, and the need for improved standardization. We also analyze error annotation frameworks, discussing challenges such as word segmentation ambiguity and the classification of Chinese-specific error types. Furthermore, we review evaluation metrics, focusing on their adaptation from English GEC to Chinese, including character-level scoring and the use of multiple references. In terms of system development, we trace the evolution from rule-based and statistical approaches to neural architectures, including Transformer-based models and the integration of large pre-trained language models. By consolidating existing research and identifying key challenges, this survey provides insights into the current state of CGEC and outlines future directions, including refining annotation standards to address segmentation challenges, and leveraging multilingual approaches to enhance CGEC.

URL PDF HTML ☆

赞 0 踩 0

2604.08479 2026-06-09 cs.CL 版本更新

AI generates well-liked but templatic empathic responses

AI生成受欢迎但模板化的共情回应

Emma S. Gueorguieva, Hongli Zhan, Jina Suh, Javier Hernandez, Tatiana Lau, Junyi Jessy Li, Desmond C. Ong

发表机构 * Department of Psychology, The University of Texas at Austin（心理学系，德克萨斯大学奥斯汀分校）； Department of Linguistics, The University of Texas at Austin（语言学系，德克萨斯大学奥斯汀分校）； Department of Computer Science and Engineering, The University of Washington（计算机科学与工程系，华盛顿大学）； Microsoft Research（微软研究院）； Toyota Research Institute（丰田研究院）

AI总结研究发现LLM生成的共情回应高度模板化，采用10种共情语言策略，覆盖81-92%的回应内容，而人类写作则更多样。

详情

AI中文摘要

最近的研究显示，越来越多的人转向大型语言模型（LLMs）寻求情感支持，并认为LLM的回应比人类写的更具共情性。我们提出原因：LLM学习并一致部署了一种受欢迎的共情模板。我们开发了10种共情语言“策略”分类，包括验证他人感受和 paraphrasing，并将此分类应用于分析人类和LLM生成共情回应的语言。在两项研究中，比较了3265个AI生成（由六个模型生成）和1290个人类写作的回应，发现LLM回应在话语功能层面高度公式化。我们发现一个模板——一种策略序列——匹配83-90%的LLM回应（在持出样本中为60-83%），当匹配时覆盖81-92%的回应内容。相比之下，人类写作的回应更多样化。我们最后讨论了这对AI生成共情未来的影响。

英文摘要

Recent research shows that greater numbers of people are turning to Large Language Models (LLMs) for emotional support, and that people rate LLM responses as more empathic than human-written responses. We suggest a reason for this success: LLMs have learned and consistently deploy a well-liked template for expressing empathy. We develop a taxonomy of 10 empathic language "tactics" that include validating someone's feelings and paraphrasing, and apply this taxonomy to characterize the language that people and LLMs produce when writing empathic responses. Across a set of 2 studies comparing a total of n = 3,265 AI-generated (by six models) and n = 1,290 human-written responses, we find that LLM responses are highly formulaic at a discourse functional level. We discovered a template -- a structured sequence of tactics -- that matches between 83--90% of LLM responses (and 60--83\% in a held out sample), and when those are matched, covers 81--92% of the response. By contrast, human-written responses are more diverse. We end with a discussion of implications for the future of AI-generated empathy.

URL PDF HTML ☆

赞 0 踩 0

2605.14531 2026-06-09 cs.CL 版本更新

Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

语言生成作为最优控制：潜在控制空间中的闭环扩散

ZiYi Dong, Yuliang Huang, Weijian Deng, Xiangyang Ji, Liang Lin, Pengxu Wei

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文将语言生成重新表述为随机最优控制问题，通过统一理论视角分析自回归和扩散模型，解释其局限性，并提出基于流匹配的闭环控制器实现高效文本生成。

详情

AI中文摘要

本工作将语言生成重新表述为随机最优控制问题，提供统一的理论视角来分析自回归和扩散模型，并解释其局限性（效率-保真度悖论、不可逆误差传播、优化可行性与保真度）在轨迹奇异性、共轭状态消失和梯度缺失的组合下的表现。为解决这些问题，我们近似求解哈密顿-雅可比-贝尔曼（HJB）方程，得到一个作为闭环控制器的最优策略。为避免直接求解HJB PDE的不可行性，我们采用流匹配作为最优轨迹求解器，在校正的潜在控制空间中。这使我们的Manta-LM配备全局积分算子能够近似全局向量场，从而实现同时实现高保真文本生成和高效、低成本并行采样的模型。实验表明，我们的方法在语言建模和条件生成任务中表现强劲，同时表现出改进的稳定性、效率和可控性。

英文摘要

This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.

URL PDF HTML ☆

赞 0 踩 0

2606.01736 2026-06-09 cs.CL cs.AI 版本更新

Argument Collapse: LLMs Flatten Long-Form Public Debate

论点坍缩：LLMs 扁平化长篇公共辩论

Yekyung Kim, Yapei Chang, Chau Minh Pham, Mohit Iyyer

发表机构 * University of Maryland, College Park（马里兰大学学院公园分校）

AI总结研究大型语言模型在生成公共辩论文本时导致论点坍缩的现象，即不同模型生成的论文在主要论点、子论点和段落结构上趋于收敛，通过对比人类与LLM生成文本发现LLM的论点多样性显著降低。

详情

AI中文摘要

随着LLMs越来越多地被用于起草面向公众的论点，它们可能通过反复引入相同的、经过修饰的、看似合理的论点来扁平化公共辩论。我们研究了论点坍缩，即不同LLMs生成的论文倾向于收敛到更小的主要论点、子论点和段落级结构集合。我们比较了来自195场《纽约时报》辩论的1,039个人类回复、来自61场更长形式的《波士顿评论》论坛的448个人类回复以及23,384篇LLM生成的论文。在《纽约时报》语料库中，65.3%的人类主要论点在辩论中是唯一的，而LLM主要论点中这一比例为3.4%。要求LLMs生成多样化的答案会增加变异性，但一个典型模型只能恢复大约一半的不同人类主要论点，且增加的变异性大多落在观察到的人类论点空间之外。坍缩也出现在子论点中，在具有相同主要论点的论文中，41.0%的人类子论点是唯一的，而LLM回复中这一比例为9.1%。定性上，LLMs经常重复使用泛化和模糊的子论点，而人类更喜欢更具体和针对主题的子论点。在结构上，LLM生成的论文倾向于遵循更固定的弧线，通常以直接主张开头并迅速转向提议。同样的模式在更长的《波士顿评论》论文中也成立，表明论点坍缩不仅限于短篇回复。

英文摘要

As LLMs are increasingly used to draft public-facing arguments, they may flatten public debate by repeatedly introducing the same polished, plausible arguments. We study argument collapse, the tendency of essays generated by different LLMs to converge to a smaller set of main arguments, sub-arguments, and paragraph-level structures. We compare 1,039 human responses from 195 New York Times (NYT) debates, 448 human responses from 61 longer-form Boston Review (BR) forums, and 23,384 LLM-generated essays. In the NYT corpus, 65.3% of human main arguments are unique within a debate, compared to 3.4% of LLM main arguments. Asking LLMs to generate diverse answers adds variation, but a typical model recovers only about half of the distinct human main arguments, with much of the added variation falling outside the observed human argument space. Collapse also appears in sub-arguments, where among essays with the same main argument, 41.0% of human sub-arguments are unique versus 9.1% from LLM responses. Qualitatively, LLMs often reuse generalized and hedged sub-arguments, while humans prefer more concrete and topic-specific ones. Structure-wise, LLM-generated essays tend to follow a more fixed arc, often opening with a direct claim and moving quickly toward proposals. The same patterns hold in longer BR essays, suggesting that argument collapse extends beyond short-form responses.

URL PDF HTML ☆

赞 0 踩 0

2606.07522 2026-06-09 cs.CL cs.LG cs.SI 新提交

Community-Specific Slang and Entity Detection via Semantic Shift in Fine-Tuned Language Models

通过微调语言模型中的语义偏移检测社区特定俚语和实体

Julia Kruk, Sanchita Porwal, Amitrajit Bhattacharjee, Mansi Phute

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出无监督方法，通过测量词在微调前后的语义偏移幅度，从在线社区文本中自动识别俚语、独特实体和民俗用语。

Comments 6 pages, 6 figures, 2 tables

详情

AI中文摘要

我们提出一种无监督方法，通过隔离词汇中具有最大语义偏移幅度的词，来解析来自在线社区的俚语、独特实体和民俗用语。语义偏移定义为在社区特定文本语料上微调预训练大语言模型（LLM）后，词编码表示的演化。该值与基础模型和微调模型对词的编码表示之间的余弦相似度成反比。我们在从3个Reddit子版块（r/Technology、r/Gaming、r/WorldofWarcraft）收集的文本语料上微调DistilRoBERTa模型，对词汇上的余弦相似度分布进行建模，并表明通过提取底部10百分位的数据，可以成功解析对社区具有独特意义的词。相反，我们表明顶部10百分位的数据由具有相对普遍语义的词组成。

英文摘要

We propose an unsupervised method of resolving slang, unique entities, and folklore from online communities by isolating words in the lexicon that have the highest magnitude of semantic shift. Semantic shift is defined as the evolution of a word's encoded representation as a result of fine-tuning a pretrained Large Language Model (LLM) on a community-specific text corpus. This value is inversely proportional to the cosine similarity between the base model's encoded representation of a word, and a fine-tuned model's encoded representation. We fine-tune the DistilRoBERTa model on text corpora collected from 3 Reddit subreddits (r/Technology, r/Gaming, r/WorldofWarcraft), model a distribution of cosine similarity over the lexicon, and show that one can successfully resolve words that have unique significance to the community by pulling data in the bottom 10-percentile. In contrast, we show that data in the top 10-percentile consist of words that carry relatively universal semantics.

URL PDF HTML ☆

赞 0 踩 0

2606.07525 2026-06-09 cs.CL cs.AI 新提交

Implicit Causal Graph Construction in Text via Chain Discovery

通过链发现实现文本中的隐式因果图构建

Liesbeth Allein, Marie-Francine Moens

发表机构 * KU Leuven（鲁汶大学）； Ghent University（根特大学）

AI总结研究利用大语言模型从文本因果对中推断中间事件以构建隐式因果图，比较端到端构建与因果链发现方法，并探索多模型集成策略，基于1560个科学验证因果对评估。

详情

AI中文摘要

文本中的因果图通常由可观察的、预定义的事件填充。相比之下，我们研究从文本中构建隐式因果图，将每个描述的因果对视为潜在隐式因果图的起点和终点，并使用大型语言模型（LLM）推断中间因果事件。我们比较了端到端图构建与将任务视为因果链发现的方法。在后一种方法中，图是通过聚合推断出的链或通过迭代搜索过程逐步扩展部分链来构建的。我们进一步探索了“群体智慧”扩展，即在事后聚合和协作推理设置中从多个LLM访问因果知识。我们分析了这些方法之间的权衡，并使用一个包含1560个经过科学验证的因果对的手动策划数据库评估推断出的因果关系的有效性。这种基于数据库的评估被认为是可靠的、资源高效的，并且可迁移到无法获得真实图的情况。

英文摘要

Causal graphs in text are typically populated by observable, predefined events. In contrast, we study implicit causal graph construction from text by treating each described cause-effect pair as the begin- and endpoint of an underlying latent causal graph and using large language models (LLMs) to infer intermediate causal events. We compare end-to-end graph construction with methods that frame the task as causal chain discovery. In the latter, graphs are built either by aggregating inferred chains or by progressively expanding partial chains through an iterative search process. We further explore Wisdom of the Crowd extensions that access causal knowledge from multiple LLMs in post-hoc aggregation and collaborative inference settings. We analyze trade-offs among these approaches and evaluate the validity of inferred causal relations using a manually curated database of 1,560 scientifically validated causal pairs. This database-based evaluation is proposed as reliable, resource-efficient, and transferable to settings where ground-truth graphs are unavailable.

URL PDF HTML ☆

赞 0 踩 0

2606.07066 2026-06-09 cs.CL 新提交

Modeling semantic association in self-paced reading with language model embeddings

使用语言模型嵌入建模自定步速阅读中的语义关联

Sara Møller Østergaard, Kenneth Enevoldsen, Afra Alishahi, Bruno Nicenboim

发表机构 * Department of Computational Cognitive Science, Tilburg University（蒂尔堡大学计算认知科学系）； Center for Humanities Computing, Aarhus University（奥胡斯大学人文计算中心）

AI总结本研究使用语言模型嵌入的十种实现方式量化语义关联，通过贝叶斯模型分析其对N400和自定步速阅读时间的影响，发现句子嵌入能可靠捕捉超出词可预测性的语义关联。

详情

AI中文摘要

词语与其上下文之间的语义关联已被认为是阅读理解的重要组成部分，即使考虑了词的可预测性。最近的研究强调了语言模型（LM）嵌入在量化语义关联方面的潜力。然而，基于嵌入的语义关联已有多种操作化方式。在本研究中，我们使用LM嵌入来估计联合脑电图（EEG）和自然荷兰语文本自定步速阅读语料库上的语义关联。语义关联通过十种不同的实现方式计算，这些方式在嵌入模型和上下文长度上有所不同。使用贝叶斯层次模型和贝叶斯因子检验了不同实现方式下语义关联对N400和自定步速阅读时间的影响。结果表明，嵌入模型的选择可以改变语义关联对N400和自定步速阅读时间的估计效应。此外，结果显示了句子嵌入在捕捉语义关联方面的潜力，因为只有依赖句子嵌入的实现方式在神经和行为测量上都显示出超出词可预测性的可靠语义关联结果。总之，这些发现强调了在量化语义关联时方法论选择的重要性。

英文摘要

Semantic association between a word and its context has been identified as an important component of reading comprehension, even when word predictability is accounted for. Recent research has highlighted the potential of language model ( LM) embeddings to quantify semantic association. Yet, embedding-based semantic association have been operationalized in a myriad of ways. In this study, we use embeddings from LMs to estimate semantic association on a corpus of joint electroencephalography (EEG) and self-paced reading of natural, Dutch texts. Semantic association is calculated in ten different implementations that vary the embedding model and context lengths. The effects of semantic association across the different implementations on the N400 and self-paced reading times are examined using Bayesian hierarchical models and Bayes factor. The results show that the choice of embedding model can alter the estimated effect of semantic association on both the N400 and self-paced reading times. Furthermore, the results demonstrate a promising potential of sentence embeddings for capturing semantic association, as only implementations relying on sentence embeddings indicate reliable results of semantic association beyond word predictability on both neural and behavioral measures. Together, these findings highlight the importance of methodological choices in quantifying semantic association.

URL PDF HTML ☆

赞 0 踩 0

2606.08236 2026-06-09 cs.CL cs.LG 新提交

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

共享语义，不同机制：通过对齐语义与机制的无监督特征发现

Hyunjin Cho, Youngji Roh, Jaehyung Kim

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种无监督方法，通过语义嵌入和归因签名聚类模型续写，发现隐藏的机制模式，补充电路分析。

Comments 40 pages

详情

Journal ref: ICML 2026 Spotlight

AI中文摘要

随着大型语言模型越来越多地部署在高风险场景中，人们越来越需要工具来审计不仅模型输出，还包括产生这些输出的内部计算。电路分析是机械可解释性中的核心方法，但通常是目标条件化的，解释单个提示与选定补全的配对。这种目标条件化设置可能掩盖模型续写分布中的异质性。我们引入了分布级无监督特征发现，该方法使用语义内容和序列级机械归因对采样续写进行聚类，而无需手动指定目标输出。我们的方法用语义嵌入和前缀到续写的归因签名表示每个续写，然后优化一个率失真目标，该目标在语义一致性、机械一致性和聚类粒度之间进行权衡。在聚类和引导分析中，发现的聚类暴露了单视图基线遗漏的续写模式，并提供了干预证据，表明聚类签名对应于可操作的机械因素。总的来说，我们的方法通过提供对模型续写分布背后机制的可扩展审计，补充了电路分析和行为评估。

英文摘要

As large language models are increasingly deployed in high-stakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them. Circuit analysis is a central approach in mechanistic interpretability, but it is typically target-conditioned, explaining a single prompt paired with a chosen completion. This target-conditioned setup can obscure heterogeneity across a model's continuation distribution. We introduce distribution-level unsupervised feature discovery, which clusters sampled continuations using both semantic content and sequence-level mechanistic attributions, without manually specifying target outputs. Our method represents each continuation with a semantic embedding and a prefix-to-continuation attribution signature, then optimizes a rate-distortion objective that trades off semantic coherence, mechanistic consistency, and cluster granularity. Across clustering and steering analyses, the discovered clusters expose continuation modes that single-view baselines miss and provide interventional evidence that cluster signatures correspond to actionable mechanistic factors. Overall, our approach complements circuit analysis and behavioral evaluation by providing a scalable audit of the mechanisms underlying a model's continuation distribution.

URL PDF HTML ☆

赞 0 踩 0

2606.08562 2026-06-09 cs.CL 新提交

Inside the LLM Word Factory

LLM单词工厂内部

Benzi Busigin, Yuval Pinter

发表机构 * Stein Faculty of Computer and Information Science（Stein计算机与信息科学学院）

AI总结通过激活修补实验，定位Llama2-7B中英语去分词化过程为第1层的两阶段机制：注意力传递非最终子词的令牌特定信号，MLP将其与局部嵌入组合。该结构泛化至八族十二模型，但深度取决于位置编码类型。

Comments 17 pages, 12 figures. Under review at EMNLP 2026

详情

AI中文摘要

Transformer语言模型处理以子词片段形式提供的输入，但自然语言语义通常依赖于词级概念。去分词化是模型调和这两个事实的过程，通过计算将子词聚合成词级表示。先前工作发现这主要发生在早期到中间层，但迄今为止该过程的确切机制尚未确定。我们通过控制配对实验中的激活修补深入探究去分词化，隔离不同模型组件的贡献，将Llama2-7B中的英语去分词化定位到第1层的两阶段过程。注意力从非最终子词传输令牌特定信号，必要时使用顺序中继，而MLP将其与局部嵌入组合。这种两阶段结构泛化至八族十二模型，但其发生的深度取决于位置编码类型：基于RoPE的模型在1到5层内去分词化，而学习绝对位置的模型需要5到10层。最后，我们提供一种仅基于早期层激活来确定去分词化成功与否的探针，根据上下文量不同，AUROC在0.94到0.97之间。

英文摘要

Transformer language models process input provided as subword fragments, but natural language semantics usually rely on word-level concepts. Detokenization is the process where models reconcile these two facts, aggregating subwords into word-level representations through their computation. Prior work has found that this takes place mostly in early-to-middle layers, but so far the exact mechanics of the process have not been pinned down. We venture deep into detokenization using activation patching in controlled paired experiments that isolate the contribution of different model components, localizing English detokenization in Llama2-7B to a two-stage process at Layer 1. Attention transmits a token-specific signal from nonfinal subwords, using sequential relays if necessary, while the MLP composes it with the local embedding. This two-stage structure generalizes to twelve models from eight families, but the depth over which it takes place depends on the flavor of positional encoding: RoPE-based models detokenize over 1 to 5 layers, while learned-absolute models take 5 to 10. Finally, we provide a probe for determining the success of the detokenization process based on early-layer activations alone, performing at 0.94-0.97 AUROC depending on the amount of context.

URL PDF HTML ☆

赞 0 踩 0

2606.09403 2026-06-09 cs.CL 新提交

Introducing multiplex semantic networks as multifaceted representations of creative associative knowledge across multilingual samples

引入多重语义网络作为跨语言样本中创造性联想知识的多面表示

Edith Haim, Kurt Haim, Roger E. Beaty, Cynthia S. Q. Siew, Massimo Stella

发表机构 * CogNosco Lab, Department of Psychology and Cognitive Science, University of Trento（CogNosco实验室，心理学与认知科学系，特伦托大学）； Department of Science Education, University of Education Upper Austria（科学教育系，上奥地利教育大学）； Department of Psychology, The Pennsylvania State University（心理学系，宾夕法尼亚州立大学）； Department of Psychology, National University of Singapore（心理学系，新加坡国立大学）

AI总结本研究通过从六种认知任务构建的多重语义网络，更全面地建模创造力背后的联想知识，并利用机器学习预测个体创造力得分，证明多重网络比单一任务更有效。

详情

有限状态转换器的神经归纳

Michael Ginn, Alexis Palmer, Mans Hulden

发表机构 * University of Colorado（科罗拉多大学）； New College of Florida（佛罗里达新学院）

AI总结提出基于循环神经网络隐藏状态几何自动构建无权重有限状态转换器的方法，在形态变化、音素转换等任务上准确率优于传统算法达87%。

Comments 15 pages, 8 figures, accepted to ACL 2026 Findings

2606.07529 2026-06-09 cs.CL cs.AI cs.CV cs.LG cs.MM 新提交

CAPruner: Conceptual-Adjacent Scene Graph Pruner for Enhancing 3D Spatial Reasoning of Large Language Models

CAPruner: 概念相邻场景图剪枝器以增强大语言模型的3D空间推理

Shengli Zhou, Xiangchen Wang, Guanhua Chen, Feng Zheng

发表机构 * Southern University of Science and Technology（南方科技大学）； SpatialTemporal AI（时空人工智能）

AI总结提出概念相邻场景图剪枝器(CAPruner)，通过融合模糊语义相关性和空间邻近性估计关系重要性，在任务特定上下文中选择关键关系，避免关系级标注，显著提升大语言模型在3D视觉语言任务上的空间推理性能。

Comments Accepted by ACL 2026 Main Conference

详情

AI中文摘要

大型语言模型（LLMs）最近被应用于3D视觉语言（3D-VL）任务，这些任务需要空间推理以识别相对于锚点的目标物体。场景图通常用于表示此类关系，但在完整图上进行推理会导致高昂的令牌成本和计算效率低下，因此需要剪枝。现有的剪枝方法主要依赖空间邻近性，常常移除任务相关的关系，从而削弱可靠的空间推理。为了解决这些局限性，我们推导出场景图剪枝的一个关键要求：保留与特定3D-VL任务最相关的空间关系。在此洞察指导下，我们提出了概念相邻场景图剪枝器（CAPruner）。CAPruner将模糊语义相关性与空间邻近性相结合，以估计关系的重要性，从而能够在任务特定上下文中选择关键关系。此外，为了避免昂贵的关系级标注，CAPruner通过监督每个节点入射边的聚合分数进行训练。大量实验表明，CAPruner有效保留了空间推理所必需的关系，从而显著提升了LLMs在3D-VL任务上的性能。代码可在 https://github.com/fz-zsl/CAPruner 获取。

英文摘要

Large language models (LLMs) have recently been applied to 3D vision-language (3D-VL) tasks, which require spatial reasoning to identify target objects relative to anchors. Scene graphs are commonly employed to represent such relations, but reasoning over complete graphs incurs high token costs and computational inefficiencies, motivating the need for pruning. Existing pruning methods primarily rely on spatial proximity and often remove task-relevant relations, thereby undermining reliable spatial reasoning. To address these limitations, we derive a key requirement for scene graph pruning: preserving spatial relations that are most pertinent to the specific 3D-VL task. Guided by this insight, we propose the Conceptual-Adjacent Scene Graph Pruner (CAPruner). CAPruner integrates fuzzy semantic relevance with spatial proximity to estimate the importance of relations, enabling the selection of critical relations in a task-specific context. Moreover, to avoid costly relation-level annotations, CAPruner is trained by supervising the aggregated scores of each node's incident edges. Extensive experiments demonstrate that CAPruner effectively preserves relations essential for spatial reasoning, leading to substantial performance improvements of LLMs on 3D-VL tasks. Code is available at https://github.com/fz-zsl/CAPruner.

URL PDF HTML ☆

赞 0 踩 0

2606.07531 2026-06-09 cs.CL cs.AI 新提交

mllm-shap: A Shapley Value Explainability Platform for Text-Audio Multimodal Large Language Models

mllm-shap：面向文本-音频多模态大语言模型的Shapley值可解释性平台

Jakub Muszyński, Paweł Pozorski, Maria Ganzha

发表机构 * Warsaw University of Technology（华沙理工大学）

AI总结提出mllm-shap框架，通过模态感知掩码、多轮对话追踪和音素对齐分组技术，将Shapley值可解释性扩展到文本-音频多模态大语言模型，并实现10-50倍的计算加速。

Comments Submitted to ACL2026

详情

AI中文摘要

我们介绍了mllm-shap，一个开源Python框架，旨在将Shapley值（SV）可解释性从纯文本大语言模型扩展到处理联合文本和音频输入的多模态大语言模型（MLLM）。虽然基于文本的归因已得到充分研究，但mllm-shap解决了多模态领域特有的三个关键挑战：（1）模态感知的联盟掩码，管理离散文本令牌和密集音频编码器帧的交错处理。（2）多轮对话追踪，利用每令牌元数据维护角色和模态上下文。（3）基于音素对齐的令牌分组，一种新颖的技术，将联盟空间减少10到50倍，使得长音频的SV估计在计算上可行。该平台实现了五种SV估计策略，包括具有Neyman最优分配的互补贡献（CC）估计器，其收敛性优于标准蒙特卡洛基线。mllm-shap作为pip可安装包提供，并具有交互式基于Web的GUI，用于细粒度归因可视化。据我们所知，这是第一个公开可用的框架，为文本-音频MLLM中的基于SV的可解释性提供完整、可复现的流水线。

英文摘要

We introduce mllm-shap, an open-source Python framework designed to extend Shapley Value (SV) explainability from text-only Large Language Models to Multimodal LLMs (MLLMs) processing joint text and audio inputs. While text-based attribution is well-studied, mllm-shap addresses three critical challenges unique to the multimodal regime: (1) Modality-aware coalition masking, which manages the interleaved processing of discrete text tokens and dense audio encoder frames. (2) Multi-turn conversation tracking, utilizing per-token metadata to maintain role and modality context. (3) Phonetic alignment-based token grouping, a novel technique that reduces the coalition space by 10x to 50x, rendering SV estimation computationally feasible for long-form audio. The platform implements five SV estimation strategies, including a Complementary Contributions (CC) estimator with Neyman-optimal allocation that demonstrates superior convergence over standard Monte Carlo baselines. mllm-shap is provided as a pip-installable package featuring an interactive web-based GUI for granular attribution visualization. To our knowledge, this is the first publicly available framework providing a complete, reproducible pipeline for SV-based explainability in text-audio MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.07533 2026-06-09 cs.CL cs.AI cs.SD 新提交

Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis

桥接传统可解释性方法与多模态多语言模型：基于XAI的分析

Paweł Pozorski, Jakub Muszyński, Maria Ganzha

发表机构 * arXiv

AI总结提出多模态Shapley值框架，结合频谱图引导的音素对齐（SGPA）预处理方法，实现文本与音频特征的可解释性归因，并开源计算包与可视化工具。

Comments Bachelor's thesis

详情

AI中文摘要

多模态大语言模型（MLLMs）有效整合文本和音频以理解复杂交互对话中的上下文。然而，异质模态影响模型行为的内部机制仍然不透明。虽然Shapley值（SV）为基于文本的NLP提供了鲁棒的、模型无关的局部可解释性框架，但其扩展到多模态数据受到跨通道依赖、复杂对话结构以及密集音频表示的高计算复杂性的阻碍。\n在这项工作中，我们形式化了Shapley值框架的多模态扩展，将离散文本标记和对齐的音频片段视为协作特征。为确保计算可行性，我们部署了一套高效的估计策略：低维输入的精确SV计算和基于采样的近似——包括蒙特卡洛排列和具有Neyman最优分配的分层抽样——以在有限计算预算下最小化方差。为解决模态间的粒度不匹配问题，我们提出了频谱图引导的音素对齐（SGPA），一种新颖的预处理方法，将高频音频流映射到可解释的、单词对齐的片段。\n我们的贡献有两方面：首先，我们提供了一个开源的、模型无关的Python包和配套的GUI，用于多模态归因的计算和交互式可视化。其次，我们使用VoiceBench和Infinity Instruct数据集的精选子集，在多种多语言场景下评估我们的框架。实验结果表明，输入模态是归因波动的主要驱动因素，并证明标准句法重要性代理在多模态跨语言上下文中通常无法预测模型注意力。

英文摘要

Multimodal Large Language Models (MLLMs) effectively integrate text and audio to interpret context in complex interactive dialogues. However, the internal mechanisms by which heterogeneous modalities influence model behavior remain opaque. While Shapley Values (SV) provide a robust, model-agnostic framework for local explainability in text-based NLP, their extension to multimodal data is hindered by cross-channel dependencies, intricate dialogue structures, and the prohibitive computational complexity of dense audio representations. In this work, we formalize a multimodal extension of the Shapley Value framework, treating discrete text tokens and aligned audio segments as cooperative features. To ensure computational feasibility, we deploy a suite of efficient estimation strategies: exact SV computation for low-dimensional inputs and sampling-based approximations - including Monte Carlo permutations and stratified sampling with Neyman-optimal allocation - to minimize variance under constrained computational budgets. To resolve the granularity mismatch between modalities, we propose Spectrogram-Guided Phonetic Alignment (SGPA), a novel preprocessing method that maps high-frequency audio streams to interpretable, word-aligned segments. Our contribution is twofold: first, we provide an open-source, model-agnostic Python package and a companion GUI for the computation and interactive visualization of multimodal attributions. Second, we evaluate our framework using curated subsets of the VoiceBench and Infinity Instruct datasets across diverse multilingual scenarios. Our experimental results reveal that input modality is a primary driver of attribution volatility and demonstrate that standard syntactic importance proxies often fail to predict model attention in multimodal, cross-lingual contexts.

URL PDF HTML ☆

赞 0 踩 0

2606.08056 2026-06-09 cs.CL cs.AI 新提交

What's the Point? Spatial Grammar & Index Resolution for Sign Language Processing

要点何在？手语处理中的空间语法与索引解析

Oline Ranum, Simon Hadfield, Richard Bowden

发表机构 * Centre for Vision, Speech and Signal Processing, University of Surrey（萨里大学视觉、语音与信号处理中心）

AI总结针对手语中占10-15%但被忽视的空间索引现象，提出索引检测与话语实体链接的分解框架，建立索引感知手语建模基线，并作为辅助专家提升冻结手语识别模型性能。

详情

AI中文摘要

手语模型主要使用词汇序列或文本监督进行训练，因此对非词汇和构式性结构的建模不足。一个相对易处理的情况是空间索引：将话语实体分配给空间位置以供后续共指的指向手势，而以词汇为中心的目标在很大程度上未能捕捉到这一点。我们对手语识别中的索引进行了有针对性的评估，显示尽管索引占手语内容的10-15%，但其恢复效果很差。我们引入了一个用于训练和评估索引专家的框架，为索引感知手语建模建立了基线。我们的方法将空间指代解析分解为索引检测和话语实体链接。由此产生的提及表示支持自动标注和非词汇结构建模，并在推理时作为辅助索引专家增强冻结的SLR模型。

英文摘要

Sign language models are predominantly trained with gloss-sequence or text supervision, thereby under-modeling non-lexical and productive constructions. One comparatively tractable instance is spatial indexing: pointing gestures that assign discourse entities to spatial loci for subsequent co-reference, which lexicon-centric objectives largely fail to capture. We present a targeted evaluation of indexing in Sign Language Recognition, showing that despite comprising 10-15% of signing content, indexing is poorly recovered. We introduce a framework for training and evaluating indexing experts, establishing a baseline for index-aware sign language modeling. Our approach decomposes spatial reference resolution into index detection and discourse entity linking. The resulting mention representations enable automatic annotation and non-lexical structure modeling, and serve as an auxiliary indexing expert that augments a frozen SLR model at inference time.

URL PDF HTML ☆

赞 0 踩 0

2606.08081 2026-06-09 cs.CL cs.AI 新提交

Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

对齐但非伙伴特定：区分多模态LLM智能体在参考游戏中如何成功而无需类人惯例

Po-Ya Angela Wang, Chinmaya Mishra, Aslı Özyürek, Paula Rubio-Fernández, Esam Ghaleb

发表机构 * National Taiwan University（国立台湾大学）； Max Planck Institute for Psycholinguistics（马克斯·普朗克心理语言学研究所）； Radboud University（拉德堡德大学）； Institut Jean Nicod（让·尼科研究所）

AI总结通过约束伪对基线方法，区分多模态LLM智能体在参考游戏中的标签对齐是源于伙伴特定交互还是共享任务词汇，发现智能体通过冗长描述而非压缩表达实现协调。

详情

AI中文摘要

重复参考游戏测试对话者是否用基于共享交互历史的更短、伙伴特定的惯例替换其初始长描述。先前工作表明，多模态LLM在轮次中未能变得更高效，尽管它们在使用的标签上对齐。我们如何确定这种对齐反映了伙伴特定的基础而非共享任务词汇？我们通过将有能力的多模态智能体对与来自KTH Tangrams语料库的人类对进行比较来解决这个问题。我们的新颖方法论贡献是一个受约束的伪对基线，它匹配原始指称任务结构，但打破了伙伴历史。该基线使我们能够测试观察到的标签对齐是否依赖于与特定伙伴的交互。在三个分析层面（任务能力、描述策略、对齐动态）上，我们发现了明显差异。人类通过适应减少努力，压缩描述并增加与伙伴的标签对齐。智能体反而保持固定的努力水平，从第一轮开始产生冗长的描述，标签重叠接近上限，在真实对和伪对之间统计上无法区分。因此，多模态LLM在没有惯例的情况下实现了协调，通过冗长描述而非形成人类对话特征的紧凑、依赖历史的指称表达来取得成功。

英文摘要

Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner-specific conventions grounded in shared interaction history. Prior work shows that multimodal LLMs fail to become more efficient across rounds, although they align on the labels they use. How can we determine whether this alignment reflects partner-specific grounding rather than a shared task vocabulary? We address this question by comparing capable multimodal agent dyads with human dyads from the KTH Tangrams corpus. Our novel methodological contribution is a constrained pseudo-dyad baseline that matches the original referential task structure, but breaks partner history. This baseline enables us to test whether the observed label alignment depends on interaction with a specific partner. Across three analytic layers (task competence, description strategy, alignment dynamics), we find clear differences. Humans reduce effort through entrainment, compressing descriptions and increasing label alignment with partners. Agents instead maintain fixed effort levels, producing verbose descriptions from round one, with near-ceiling label overlap that is statistically indistinguishable between real and pseudo dyads. MLLMs thus achieve coordination without convention, succeeding by verbose description rather than by forming the compact, history-dependent referring expressions characteristic of human dialogue.

URL PDF HTML ☆

赞 0 踩 0

2606.08394 2026-06-09 cs.CL 新提交

文本就是一切？文本作为语音大语言模型的通用信息瓶颈

Ming-Hao Hsu, Yuxuan Hu, Shujie Liu, Jinyu Li, Yan Lu, Zhizheng Wu

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Microsoft Corporation（微软公司）； Microsoft Research Asia（微软亚洲研究院）

AI总结提出Convex Gate（C-Gate）桥接语音与LLM，通过凸包约束将语音表示限制在LLM输入嵌入流形内，在ASR和情感识别上取得联合最优性能，并揭示几何结构而非离散性是关键设计因素。

详情

AI中文摘要

大型语言模型（LLM）为语音理解提供了强大的推理骨干，但将连续声学信号集成到冻结的LLM中仍然具有挑战性。现有的语音到LLM接口通常处于两个极端：要么强制近乎离散的令牌对齐，这有利于转录但丢失副语言信息；要么学习无约束的连续表示，这可能会偏离LLM的输入空间并降低自回归解码性能。在这项工作中，我们提出了Convex Gate（C-Gate），一种语音到LLM的桥接方法，通过架构凸包约束将所有语音表示限制在LLM的输入嵌入流形内。具体而言，每一帧被表示为令牌嵌入的凸组合，确保与预训练LLM的兼容性，同时保持连续表达能力。在自动语音识别（ASR）和情感识别任务中，C-Gate实现了强大的联合性能，在LibriSpeech上相对词错误率（WER）降低高达48.7%，同时匹配或超过单任务情感识别准确率。除了性能之外，我们的分析揭示了一个关键见解：信息不是由离散令牌身份携带，而是由嵌入空间中时间分辨的轨迹携带。因果干预证实，轨迹结构和与预训练嵌入流形的对齐对性能都至关重要。这些结果表明，几何结构而非令牌离散性是语音到LLM接口的基本设计因素，并为研究冻结LLM中的多模态集成提供了一个受控机制。我们发布了检查点、每个样本的输出、机制转储和干预套件以供复现。

英文摘要

Large language models (LLMs) provide a powerful reasoning backbone for speech understanding, but integrating continuous acoustic signals into a frozen LLM remains challenging. Existing speech-to-LLM interfaces typically operate at two extremes: either enforcing near-discrete token alignment, which benefits transcription but loses paralinguistic information, or learning unconstrained continuous representations, which can drift away from the LLM's input space and degrade autoregressive decoding. In this work, we propose Convex Gate (C-Gate), a speech-to-LLM bridge that constrains all speech representations to lie within the LLM's input embedding manifold with an architectural convex-hull constraint. Concretely, each frame is represented as a convex combination of token embeddings, ensuring compatibility with the pretrained LLM while preserving continuous expressivity. Across automatic speech recognition (ASR) and emotion recognition, C-Gate achieves strong joint performance, improving LibriSpeech WER by up to 48.7% relative while matching or exceeding single-task emotion accuracy. Beyond performance, our analysis reveals a key insight: information is not carried by discrete token identities, but by time-resolved trajectories in the embedding space. Causal interventions confirm that both the trajectory structure and alignment to the pretrained embedding manifold are critical for performance. These results suggest that geometry, rather than token discreteness, is the fundamental design factor in speech-to-LLM interfaces, and provide a controlled regime for studying multimodal integration in frozen LLMs. We release the checkpoint, per-sample outputs, mechanism dumps, and intervention suite for replication.

URL PDF HTML ☆

赞 0 踩 0

2606.09428 2026-06-09 cs.CL 新提交

Guide Me Out: A Framework to Benchmark VLM Operators Communication in Crisis Scenarios

引导我出去：危机场景下评估VLM操作员通信的框架

Giacomo Gonella, Stefano Menini, Marco Guerini

发表机构 * Fondazione Bruno Kessler（布鲁诺·凯斯勒基金会）； University of Trento（特伦托大学）

AI总结提出一个基准框架，评估视觉语言模型在模拟疏散中引导平民的策略（窄播 vs. 广播）、环境表示（视觉 vs. 图）和威胁行为（静态 vs. 移动），发现窄播降低失败率，视觉表示主导性能，移动威胁增加失败率。

详情

AI中文摘要

有效的危机响应需要空间定位的通信，将平民的语言指导与物理环境联系起来，考虑结构瓶颈、不断变化的威胁和代理特定背景。然而，当前危机通信中的NLP研究主要局限于静态、纯文本分类设置，忽视了AI操作员在动态、具身场景中的关键通信作用。我们通过一个新的基准框架来解决这一差距，该框架用于评估视觉语言模型（VLM）在模拟疏散中引导平民代理的任务。我们测试了两种通信策略（窄播与广播）、两种环境表示（视觉与基于图）和两种威胁行为（静态与移动），跨越九张不同结构复杂度的地图。我们的结果表明，与广播相比，窄播在所有难度级别上持续降低平民失败率。指导质量很大程度上取决于VLM操作员如何表示世界：视觉模态驱动性能，而添加邻接图则依赖于模型且通常有害。移动威胁在所有条件下提高失败率，因为通信必须随时间持续适应。这些发现共同表明，将VLM作为AI操作员部署在疏散场景中仍然是一个非平凡挑战，其中通信策略和输入表示的选择可以直接决定干预的成功或失败。

英文摘要

Effective crisis response requires spatially grounded communication that bridges linguistic guidance of civilians with the physical environment, accounting for structural bottlenecks, evolving threats, and agent-specific contexts. Yet, current NLP research in crisis communication remains mainly limited to static, text-only classification settings, overlooking the critical communicative role of AI operators in dynamic, embodied scenarios. We address this gap with a novel benchmarking framework for evaluating Vision-Language Models (VLMs) tasked with guiding civilian agents through simulated evacuations. We test two communication strategies (narrowcast vs. broadcast), two environment representations (visual vs. graph-based), and two threat behaviors (static vs. moving) across nine maps of varying structural complexity. Our results show that Narrowcast consistently reduces civilian Fail rates compared to Broadcast across all difficulty levels. Guidance quality depends heavily on how the VLM operator represents the world: the visual modality drives performance, while adding an adjacency graph is model-dependent and often harmful. Moving threats raise Fail rates across all conditions as communication must continuously adapt over time. Together, these findings show that deploying VLMs as AI operators in evacuation scenarios remains a non-trivial challenge, where the choice of communication strategy and input representation can directly determine the success or failure of the intervention.

URL PDF HTML ☆

赞 0 踩 0

2606.09644 2026-06-09 cs.CL cs.CV 新提交

Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

答案从何而来？面向自动驾驶的多视角MLLMs中视角级视觉证据识别基准

Yimu Wang, Yee Man Choi, Barry Zhang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki

发表机构 * University of Waterloo（滑铁卢大学）

AI总结针对多视角自动驾驶场景，提出一个基准测试，评估多模态大模型在视觉问答中识别支持性相机视角的能力，包含122个冲突中心问题对，并区分视角选择与答案正确性。

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉推理基准测试中取得了强劲结果，但仅凭答案准确性并不能表明模型是否依赖了正确的视觉证据。这一差距在用于自动驾驶的多视角驾驶场景中尤为重要，因为模型可能产生看似合理的答案，却将其归因于错误的相机视角。我们引入了一个多视角视觉问答基准，用于评估证据来源识别：给定六个同步的NuScenes视角和一个问题，模型必须识别支持性的相机视角并回答问题。该基准包含来自73个场景的122个冲突中心问答对，涵盖因果关系、反事实推理和意图预测。视角标签由自动冲突挖掘流程提出，并由标注者手动验证。我们评估了三种设置：相机视角选择、给定黄金视角的Oracle问答，以及模型在一次前向中同时选择视角并回答的联合预测。答案以多项选择和自由形式两种格式进行评估，使用精确匹配处理结构化预测，并使用LLM评判器处理自由形式回答。通过明确分离视觉来源识别与答案正确性，该基准揭示了仅凭答案评估无法发现的接地失败案例。

英文摘要

Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-view driving scenes used for autonomous driving, where a model can produce a plausible answer while grounding it in the wrong camera view. We introduce a multi-view visual question answering benchmark for evaluating evidence-source identification: given six synchronized NuScenes views and a question, the model must identify the supporting camera view and answer the question. The benchmark contains 122 conflict-centric question-answer pairs from 73 scenes, spanning causality, counterfactual reasoning, and intent prediction. View labels are proposed by an automatic conflict-mining pipeline and manually verified by annotators. We evaluate three settings: camera-view selection, oracle QA given the golden view, and joint prediction in which the model selects a view and answers in one pass. Answers are evaluated in both multiple-choice and free-form formats, using exact match for structured predictions and an LLM judge for free-form responses. By explicitly separating visual-source identification from answer correctness, the benchmark exposes grounding failures that answer-only evaluation misses.

URL PDF HTML ☆

赞 0 踩 0

2606.07647 2026-06-09 cs.CV cs.CL cs.LG 交叉投稿

Steer Where It Matters: Token-Level Visual-Sensitivity Steering for LVLMs Hallucination Mitigation

关键位置引导：基于令牌级视觉敏感度引导的LVLMs幻觉缓解

Ruipeng Zhang, Zhihao Li, C. L. Philip Chen, Tong Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出令牌级视觉敏感度引导（TLVS）方法，通过提取令牌级引导向量并自适应调整引导强度，仅在关键解码步骤抑制幻觉，在多个基准上优于现有方法。

详情

AI中文摘要

大型视觉语言模型（LVLMs）取得了快速进展并部署在各种应用中，但幻觉仍然是一个主要挑战。激活引导因其训练开销小和推理时可控制而具有吸引力。然而，我们发现，在自回归解码过程中，视觉条件对令牌预测的影响是稀疏且局部的，许多现有方法对整个序列的图像与非图像差异进行平均，稀释了这些关键信号，导致引导方向信噪比低。此外，许多现有方法应用固定的引导强度，错误分配干预预算，过度扰动非关键令牌，并可能导致不稳定。为了解决这些限制，我们提出了令牌级视觉敏感度引导（TLVS）用于幻觉缓解。我们的方法首先提取令牌级引导向量并进行细化，然后仅在关键位置应用细粒度的、视觉敏感度自适应的引导。这种轻量级、即插即用的机制只需要最少的校准训练，可以应用于各种视觉语言模型。它在每个解码步骤调节引导强度，选择性地抑制易产生幻觉的片段，同时保留基于证据的内容。我们在多个基准上评估TLVS，包括POPE、AMBER、CHAIR（COCO）、MMHal和HallusionBench，证明其相对于先前引导方法的一致改进。

英文摘要

Large vision language models (LVLMs) have made rapid advancements and are deployed across various applications, yet hallucinations remain a major challenge. Activation steering is appealing due to its minimal training overhead and controllability at inference time. However, we found that during autoregressive decoding, visual conditioning affects token prediction sparsely and locally across decoding steps, and many existing methods that average image-versus-no-image differences over the entire sequence dilute these critical signals, yielding low signal-to-noise ratio steering directions. Additionally, many existing methods apply a fixed steering strength, which misallocates the intervention budget, over-perturbs non-critical tokens, and can cause instability. To address these limitations, we propose Token-Level Visual-Sensitivity Steering (TLVS) for hallucination mitigation. Our approach first extracts token-level steering vectors and refines them, and then applies fine-grained, visual-sensitivity-adaptive steering only where it matters. This lightweight, plug-and-play mechanism requires only minimal training for calibration and can be applied across diverse vision-language models. It modulates the steering strength at each decoding step, selectively suppressing hallucination-prone spans while preserving evidence-grounded content. We evaluate TLVS on several benchmarks, including POPE, AMBER, CHAIR (COCO), MMHal, and HallusionBench, demonstrating consistent improvements over previous steering methods.

URL PDF HTML ☆

赞 0 踩 0

2606.07985 2026-06-09 cs.CV cs.CL 交叉投稿

FMRFusion: Frequency-Aware Multi-View Representation Learning for Heterogeneous Image Fusion

FMRFusion: 面向异质图像融合的频率感知多视图表示学习

Tao Zhoua, Yunlong Liu, Qinghui Chen, Zekai Zhang, Minlong Sun, Changlin Biana, Dagang Li, Wenmin Wang, Jinglin Zhang

发表机构 * Shandong University（山东大学）； Macau University of Science and Technology（澳门科技大学）

AI总结提出FMRFusion网络，通过多尺度结构感知模块、双线性频率分解和跨视图互补交互，结合流匹配优化，实现红外与可见光图像融合，在夜间场景表现优异。

详情

AI中文摘要

红外与可见光图像融合旨在生成保留重要目标信息和详细纹理的复合图像，整合两种异质模态。以往的图像融合方法通常采用单模块堆叠方式从两种模态中提取特征，然而这些方法可能导致对其独特特征的学习不完整，从而限制融合效果并在真实异质数据场景中降低鲁棒性。为解决这些问题，我们提出FMRFusion，一种用于异质图像融合的频率感知多视图表示学习网络。引入多尺度结构感知模块以有效捕捉判别性结构，提取细粒度局部结构和关键上下文信息。采用双线性频率分解机制将特征分离为高频和低频分量，实现不同频率域中局部细节和全局表示的联合建模。此外，融入跨视图互补交互以显式建模和融合反射光信息与辐射强度响应之间的互补特性，促进有效的跨视图交互。我们通过流匹配进一步改善融合结果的质量，通过学习从粗数据到高质量表示的变换逐步细化融合特征。在多个基准数据集上进行的大量实验表明，FMRFusion在一系列融合任务中实现了优越且一致的性能，尤其在夜间场景中表现突出。

英文摘要

Infrared and visible image fusion aims to generate a composite image that retains significant target information and preserves detailed textures, integrating two heterogeneous modalities. Previous image fusion methods typically adopt a single-module stacking approach to extract features from the two modalities. However, these approaches may result in incomplete learning of their distinct characteristics, thereby limiting the fusion effectiveness and constrain ing robustness in real-world heterogeneous data scenarios. To address these challenges, we propose FMRFusion, a frequency-aware multi-view representation learning network for Heterogeneous Image Fusion. A Multi-Scale Struc tural Perception Module is introduced to effectively capture discriminative structures, extracting fine-grained local structures and essential contextual information. A bilinear frequency decomposition mechanism is employed to sepa rate features into high-frequency and low-frequency components, enabling joint modeling of local details and global representations across different frequency domains. Moreover, a Cross-View Complementary Interaction is incorpo rated to explicitly model and fuse the complementary characteristics between reflected light information and radiative intensity responses, facilitating effective cross-view interaction. We further improve the Performance of the fused results by flow matching, which progressively refines the fused features by learning the transformation from coarse data to high-quality representations. Extensive experiments conducted on multiple benchmark datasets demonstrate that FMRFusion achieves superior and consistent performance across a range of fusion tasks, especially in nighttime scenarios

URL PDF HTML ☆

赞 0 踩 0

2606.08063 2026-06-09 cs.CV cs.AI cs.CL 交叉投稿

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Robust-U1: MLLMs能否自我恢复受损视觉内容以实现鲁棒理解？

Jiaqi Tang, Jianmin Chen, Youyang Zhai, Wei Wei, Runtao Liu, Mengjie Zhao, Xiangyu Wu, Qingfa Xiao, Qifeng Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出Robust-U1框架，通过监督微调、强化学习和多模态推理，使多模态大模型具备显式视觉自恢复能力，在真实和对抗性损坏下达到最先进鲁棒性。

Comments Accepted by ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉理解方面取得了显著成功，但在真实世界的视觉损坏下其性能会大幅下降。尽管存在现有的鲁棒性增强方法，但它们存在局限性：黑盒特征对齐缺乏可解释性，而白盒基于文本的推理无法恢复丢失的像素级细节。本文研究一个基本研究问题：MLLMs能否自行恢复受损的视觉内容？为此，我们提出Robust-U1，一种新颖框架，赋予MLLMs显式的视觉自恢复能力以实现鲁棒理解。该方法包含三个核心阶段：用于初始重建的监督微调、具有双重奖励（像素级SSIM和语义级CLIP相似度）的强化学习以对齐高视觉质量，以及联合考虑受损输入和恢复图像的多模态推理。大量实验表明，Robust-U1在真实世界损坏基准上达到了最先进的鲁棒性，并在一般VQA基准上的对抗性损坏下保持了优越性能。分析证实，高质量的视觉恢复直接提升了推理性能，将自恢复确立为鲁棒视觉理解的关键机制。源代码可在https://github.com/jqtangust/Robust-U1获取。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.

URL PDF HTML ☆

赞 0 踩 0

2606.08615 2026-06-09 cs.CV cs.CL 交叉投稿

Harnessing Streaming Video in the Wild

利用野外流式视频

Dingyu Yao, Shuhuan Gu, Qingyi Si, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Naibin Gu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络空间安全学院）； JD.COM（京东）

AI总结提出Streaming Harness系统，通过Streaming-Train-248K数据集和训练目标，使视觉语言模型具备主动交互、长期记忆和实时处理能力，并构建Streaming-Eval基准评估流式视频理解。

详情

AI中文摘要

视觉语言模型（VLM）在视频通话助手、实时评论和具身机器人等应用中越来越需要处理无界视频流。理想的流式系统应支持主动交互、长期记忆和实时处理，同时基于能够处理各种野外流式任务的VLM骨干。然而，现有VLM在离线视频理解方面表现出色，但在流式能力上有所欠缺，并且缺乏用于流式部署的专用基础设施。我们在三个方面解决这一差距。(i) 对于骨干能力，我们构建了\textbf{Streaming-Train-248K}，一个流式数据集，配以新颖的训练目标，用于使VLM适应流式交互和理解。(ii) 对于实际部署，我们引入了\textbf{Streaming Harness}，一个即插即用系统，赋予任何VLM三种核心能力：主动交互（每秒响应决策）、长期记忆（12小时上下文保留）和实时处理（亚秒级延迟）。(iii) 为了推动社区在流式能力方面的持续进步，我们设计了\textbf{Streaming-Eval}，一个反映模型在各种野外场景中能力的基准。大量实验表明，我们的方法在流式视频理解所需的所有核心能力上均取得了一致的提升。我们将开源我们的数据、代码和基准，以推动社区从离线视频理解向可部署的流式智能的转变。

英文摘要

Vision-Language Models (VLMs) are increasingly required to process unbounded video streams in applications such as video-call assistants, live commentary, and embodied robots. An ideal streaming system should support proactive interaction, long-horizon memory, and real-time processing, while resting on a VLM backbone capable of handling diverse in-the-wild streaming tasks. However, existing VLMs excel at offline video understanding but fall short in streaming capabilities and lack dedicated infrastructure for streaming deployment. We address this gap on three fronts. (i) For backbone capability, we construct \textbf{Streaming-Train-248K}, a streaming dataset paired with a novel training objective for adapting VLMs to streaming interaction and understanding. (ii) For real-world deployment, we introduce \textbf{Streaming Harness}, a plug-and-play system that endows any VLM with three core abilities: proactive interaction (per-second response decisions), long-term memory (12-hour context retention), and real-time processing (sub-second latency). (iii) To drive continued community progress on streaming capabilities, we design \textbf{Streaming-Eval}, a benchmark that reflects models' capabilities across diverse in-the-wild scenarios. Extensive experiments demonstrate consistent gains from our approach across all core capabilities required for streaming video understanding. We will open-source our data, code, and benchmark to advance the community's shift from offline video understanding to deployable streaming intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.08894 2026-06-09 cs.CV cs.CL 交叉投稿

Hummus：幽默多模态隐喻使用数据集

Xiaoyu Tong, Zhi Zhang, Pia Sommerauer, Martha Lewis, Ekaterina Shutova

发表机构 * ILLC, University of Amsterdam, the Netherlands（阿姆斯特丹大学语言学研究所，荷兰）； Vrije Universiteit Amsterdam, the Netherlands（阿姆斯特丹自由大学，荷兰）

AI总结提出幽默多模态隐喻数据集Hummus，基于不一致理论和概念隐喻理论设计标注方案，测试多模态大语言模型在检测和理解幽默多模态隐喻上的表现，发现现有模型仍存在困难。

详情

AI中文摘要

隐喻和幽默有许多共同点，隐喻是最常见的幽默机制之一。本研究关注多模态隐喻的幽默能力，该领域尚未得到足够关注。我们从幽默的不一致理论、概念隐喻理论以及VU阿姆斯特丹隐喻语料库的标注方案中汲取灵感，开发了一种新的用于图像-标题对中幽默多模态隐喻使用的标注方案。我们创建了幽默多模态隐喻使用数据集Hummus，提供了从《纽约客》标题竞赛语料库中抽取的1000个图像-标题对的专家标注。利用该数据集，我们测试了最先进的多模态大语言模型（MLLMs）在检测和理解幽默多模态隐喻使用方面的能力。实验表明，当前MLLMs在处理幽默多模态隐喻时仍然存在困难，特别是在整合视觉和文本信息方面。我们在该网址发布数据集和代码。

英文摘要

Metaphor and humor share a lot of common ground, and metaphor is one of the most common humorous mechanisms. This study focuses on the humorous capacity of multimodal metaphors, which has not received due attention in the community. We take inspiration from the Incongruity Theory of humor, the Conceptual Metaphor Theory, and the annotation scheme behind the VU Amsterdam Metaphor Corpus, and developed a novel annotation scheme for humorous multimodal metaphor use in image-caption pairs. We create the Hummus Dataset of Humorous Multimodal Metaphor Use, providing expert annotation on 1k image-caption pairs sampled from the New Yorker Caption Contest corpus. Using the dataset, we test state-of-the-art multimodal large language models (MLLMs) on their ability to detect and understand humorous multimodal metaphor use. Our experiments show that current MLLMs still struggle with processing humorous multimodal metaphors, particularly with regard to integrating visual and textual information. We release our dataset and code at github.com/xiaoyuisrain/humorous-multimodal-metaphor-use.

URL PDF HTML ☆

赞 0 踩 0

2601.12263 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

多模态生成式引擎优化：针对视觉-语言模型排序器的排名操纵

Yixuan Du, Chenxiao Yu, Haoyan Xu, Ziyi Wang, Yue Zhao, Xiyang Hu

发表机构 * Georgetown University（乔治城大学）； University of Southern California（南加州大学）； University of Maryland, College Park（马里兰大学学院公园分校）； Arizona State University（亚利桑那州立大学）

AI总结提出多模态生成式引擎优化（MGEO）方法，通过联合优化图像扰动和文本后缀，利用视觉-语言模型内部跨模态知识耦合，实现对产品排名的有效操纵，揭示了多模态基础模型知识基础的脆弱性。

Comments Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM) at ACL 2026

详情

AI中文摘要

视觉-语言模型（VLM）将视觉和文本知识整合到统一表示中，日益成为现代检索和推荐系统的基础。然而，这些模型在对多模态项目进行排序时如何可靠地利用其跨模态知识，以及其知识基础是否可以被颠覆，仍不清楚。在本文中，我们揭示了VLM在多模态产品排序中应用知识的一个基本漏洞：通过多模态生成式引擎优化（MGEO），我们展示了攻击者可以通过联合制作难以察觉的图像扰动和流畅的文本后缀，利用模型内部的跨模态知识耦合，操纵VLM的排序决策。MGEO采用交替优化策略，针对VLM中视觉和语言表示之间的深层交互，实现了远超单模态攻击和由强大商业模型驱动的启发式基线的排名操纵。我们的发现表明，表面内容质量不足以提升排名；相反，需要直接与模型内部知识利用机制对齐。这些结果对多模态基础模型中知识基础的忠实性和鲁棒性提出了重要问题，并激励了未来多模态检索系统防御机制的研究。代码见：this https URL

英文摘要

Vision-Language Models (VLMs) integrate visual and textual knowledge into unified representations that increasingly underpin modern retrieval and recommendation systems. However, it remains unclear how reliably these models utilize their cross-modal knowledge when ranking multimodal items, and whether their knowledge grounding can be subverted. In this paper, we expose a fundamental vulnerability in how VLMs apply multimodal knowledge for product ranking: through Multimodal Generative Engine Optimization (MGEO), we show that an adversary can manipulate a VLM's ranking decisions by jointly crafting imperceptible image perturbations and fluent textual suffixes that exploit the model's internal cross-modal knowledge coupling. Using an alternating optimization strategy, MGEO targets the deep interactions between visual and linguistic representations within the VLM, achieving rank manipulations that substantially exceed those of unimodal attacks and heuristic baselines powered by strong commercial models. Our findings reveal that surface-level content quality is insufficient for rank promotion; instead, direct alignment with the model's internal knowledge utilization mechanism is required. These results raise important questions on the faithfulness and robustness of knowledge grounding in multimodal foundation models, and motivate future work on defense mechanisms for multimodal retrieval systems. Code is available at: https://github.com/glad-lab/MGEO

URL PDF HTML ☆

赞 0 踩 0

2602.22766 2026-06-09 cs.CL 版本更新

Imagination Helps Visual Reasoning, But Not Yet in Latent Space

想象力有助于视觉推理，但尚未在潜在空间中实现

You Li, Chi Chen, Yanghao Li, Fanhu Zeng, Kaiyu Huang, Jinan Xu, Maosong Sun

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结通过因果中介分析发现，多模态大语言模型中的潜在推理存在输入-潜在和潜在-答案两个关键断连，表明其有效性有限，并提出显式想象方法CapImagine，在视觉推理任务中表现更优。

Comments ICML 2026 Poster

详情

AI中文摘要

潜在视觉推理旨在通过在多模态大语言模型的隐藏状态中进行冥想，模仿人类的想象过程。虽然被认为是视觉推理的一种有前景的范式，但其有效性的潜在机制仍不清楚。为了揭示其功效的真正来源，我们使用因果中介分析来研究潜在推理的有效性。我们将该过程建模为因果链：输入作为处理变量，潜在标记作为中介变量，最终答案作为结果变量。我们的发现揭示了两个关键的断连：(a) 输入-潜在断连：对输入进行剧烈扰动导致潜在标记的变化可以忽略不计，表明潜在标记未能有效关注输入序列。(b) 潜在-答案断连：对潜在标记的扰动对最终答案的影响极小，表明潜在标记对结果施加的因果效应有限。此外，广泛的探测分析显示，潜在标记编码的视觉信息有限且表现出高度相似性。因此，我们质疑潜在推理的必要性，并提出了一种简单的替代方法CapImagine，该方法教会模型使用文本进行显式想象。在视觉中心基准上的实验表明，CapImagine显著优于复杂的潜在空间基线，突显了通过显式想象进行视觉推理的优越潜力。

英文摘要

Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.

URL PDF HTML ☆

赞 0 踩 0

2604.18347 2026-06-09 cs.CL cs.AI 版本更新

Multilingual Training and Evaluation Resources for Vision-Language Models

面向视觉语言模型的多语言训练和评估资源

Daniela Baiamonte, Elena Fano, Matteo Gabburo, Stefano Simonazzi, Leonardo Rigutini, Andrea Zugarini

发表机构 * Villanova.ai ； Aithlas

AI总结本文提出跨五种欧洲语言的视觉语言模型训练与评估资源，通过再生与翻译方法生成高质量多语言数据，验证多语言数据在非英语基准上的有效性。

详情

AI中文摘要

视觉语言模型（VLMs）近年来取得了快速进展。然而，尽管其发展依赖于英语，导致两个主要限制：（i）缺乏多语言和多模态数据集用于训练，（ii）缺乏跨语言的全面评估基准。本文通过引入覆盖五种欧洲语言（英语、法语、德语、意大利语和西班牙语）的新型综合资源来填补这些空白。我们采用再生-翻译范式，通过结合精心挑选的合成生成和人工标注来生成高质量的跨语言资源。具体而言，我们构建了Multi-PixMo训练语料库，通过再生Pixmo现有数据集中的示例，结合许可的模型：PixMo-Cap、PixMo-AskModelAnything和CoSyn-400k。在评估方面，我们构建了一组多语言基准，通过翻译广泛使用的英语数据集（MMbench、ScienceQA、MME、POPE、AI2D）来实现。我们通过定性和定量的人类分析评估这些资源的质量，测量跨标注者的一致性。此外，我们进行了消融研究，以展示多语言数据在VLMs训练中的影响，相对于仅英语数据。实验包括三种不同的模型，结果表明使用多语言、多模态示例训练VLMs在非英语基准上始终有益，同时对英语也有积极的迁移效果。

英文摘要

Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.

URL PDF HTML ☆

赞 0 踩 0

2605.08384 2026-06-09 cs.CL 版本更新

jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

jina-embeddings-v5-omni: 通过锁定对齐塔实现几何保持嵌入

Florian Hönicke, Michael Günther, Andreas Koukounas, Mohammad Kalim Akram, Scott Martens, Saba Sturua, Han Xiao

发表机构 * Jina by Elastic（Jina 由 Elastic 公司）

AI总结本文提出GELATO方法，通过冻结对齐塔实现多模态嵌入，生成统一语义空间，训练效率高且保持文本嵌入一致性。

Comments 11 pages, 9 figures, 5 tables

详情

AI中文摘要

在本文中，我们介绍了GELATO（通过锁定对齐塔实现几何保持嵌入），一种新型的多模态嵌入模型。我们基于VLM式架构，非文本编码器被调整以生成语言模型的输入，进而生成所有输入类型的嵌入。我们展示了结果：jina-embeddings-v5-omni套件，一对模型将文本、图像、音频和视频输入编码到单一语义嵌入空间。GELATO扩展了两个Jina Embeddings v5文本模型，通过添加图像和音频编码器支持额外模态。骨干文本嵌入模型和新增的非文本模态编码器保持冻结。我们仅训练连接组件，代表联合模型总权重的0.35%。因此，训练比全参数重新训练要高效得多。此外，语言模型保持基本不变，对文本输入生成与Jina Embeddings v5文本模型完全相同的嵌入。我们的评估表明，GELATO产生的结果与最先进的方法相媲美，几乎与更大的多模态嵌入模型具有同等性能。

英文摘要

In this work, we introduce GELATO (Geometry-preserving Embeddings via Locked Aligned TOwers), a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. GELATO extends the two Jina Embeddings v5 Text models to support additional modality by adding encoders for images and audio. The backbone text embedding models and the added non-text modality encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the language model remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models. Our evaluations show that GELATO produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models.

URL PDF HTML ☆

赞 0 踩 0

2605.30608 2026-06-09 cs.CL 版本更新

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

语义运动锚点：连接共语手势中的运动与意义

Varsha Suresh, Mohammad Mahdi Abootorabi, Mohamed Salman, M. Hamza Mughal, Christian Theobalt, Ashwin Ram, Jürgen Steimle, Vera Demberg

发表机构 * Saarland University（萨尔兰大学）； MPI for Informatics（信息研究所）； Saarland Informatics Campus（萨尔兰信息校园）； University of British Columbia（不列颠哥伦比亚大学）； Vector Institute（向量研究所）； Zuse School（祖斯学校）

AI总结提出语义运动锚点方法，通过将3D手势离散化为身体-手部运动基元并转化为结构化描述，在文本与运动之间建立辅助对比监督，显著提升共语手势检索的语义相关性。

详情

AI中文摘要

学习口语文本与手势之间的共享表示是共语手势检索、合成和理解的核心，但对于语义上有意义的手势仍然具有挑战性，因为其交际意图无法仅通过运动捕捉。转录文本与连续运动嵌入之间的直接对比对齐往往过度强调低级运动学，而忽略了语义手势的符号内容。我们提出语义运动锚点，即手势运动的自然语言抽象，捕捉物理形式和交际意图。我们的方法将3D手势离散化为身体-手部运动基元，将其转化为结构化描述，并将其嵌入转录文本中以提供辅助对比监督。在BEAT2上，我们的方法在文本到手势的R@1上比直接文本-运动基线提高了8.2%，并在文本到手势和手势到文本检索方向上优于先前的检索方法。除了总体检索指标外，语义运动锚点监督有助于检索对口语查询具有语义意义的手势，而不是默认使用通用运动模式。一项下游检索增强手势生成研究表明，用户显著偏好我们方法检索的手势，而非检索增强生成基线，表明语义基础的检索转化为在下游生成中更好传达交际意图的手势。

英文摘要

Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures. We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns. A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.

URL PDF HTML ☆

赞 0 踩 0

2606.03371 2026-06-09 cs.CL 版本更新

See, Infer, Intervene: Proactive World Modeling for Goal-Oriented Social Intelligence

观察、推断、干预：面向目标导向社交智能的主动世界建模

Honghui Zhang, Chenmeinian Guo, Yichen Yu, Guanyu Liu, Yujia Zhang, Yongming Qin, Chongguo Song, Mengyue Yang, Lei Yu, Tianyu Shi

发表机构 * Mita Technology（Mita技术公司）； University of Bristol（布里斯托大学）； University of Toronto（多伦多大学）； McGill University（麦吉尔大学）

AI总结提出 See-Infer-Intervene (SII) 框架和主动意图世界模型 (PIWM)，通过观察顾客行为、推断潜在意图并选择干预动作，实现零售场景中的主动辅助，在 GuidanceSalesBench 基准上达到 0.641 macro F1。

Comments 16 pages, 3 figures, 9 tables. Preprint

详情

AI中文摘要

多模态零售智能体不仅应识别顾客正在做什么，还应决定是否以及如何在明确请求之前提供帮助。我们通过 See-Infer-Intervene (SII) 框架研究这一场景，其中设备必须观察交互前行为、推断潜在顾客意图，并通过选择适当的服务干预或选择等待来采取行动。我们使用主动意图世界模型 (PIWM) 实例化 SII，该模型通过 AIDA（注意力、兴趣、欲望、行动）购买阶段和 BDI（信念、欲望、意图）心理场表示顾客状态，预测动作条件下的意图转换，并从五类响应中选择：问候、引导、告知、推荐和等待。我们进一步构建了 GuidanceSalesBench，这是一个智能零售基准，包含状态清单、交互前视频、候选响应、动作条件结果和最佳动作标签。当以真实顾客状态为条件以隔离动作选择时，PIWM 在 30 个保留目标视频上达到 0.641 macro F1，优于零样本 Qwen2.5-VL-7B 基线和没有平衡动作监督的训练变体；端到端仅视频选择降至 0.295，低于 5 类平衡随机基线 0.414，将视频到状态的基础定位确定为部署阶段的主要瓶颈。一项初步的分阶段真实商店试点（由付费参与者执行脚本化顾客行为录制）在 20 个完全标注视频上达到 0.579 动作 macro F1，并额外发布了 10 个带有索引级标签的可访问视频。

英文摘要

Multimodal retail agents should not only recognize what a customer is doing, but also decide whether and how to assist before an explicit request is made. We study this setting through the See--Infer--Intervene (SII) framework, where a device must see pre-interaction behavior, infer latent customer intent, and act by selecting an appropriate service intervention or choosing to wait. We instantiate SII with the Proactive Intent World Model (PIWM), which represents customer state with AIDA (Attention, Interest, Desire, Action) purchasing phases and BDI (belief, desire, intention) psychological fields, predicts action-conditioned intent transitions, and selects from five response classes: Greet, Elicit, Inform, Recommend, and Hold. We further construct GuidanceSalesBench, a smart-retail benchmark containing state manifests, pre-interaction videos, candidate responses, action-conditioned outcomes, and best-action labels. When conditioned on ground-truth customer state to isolate action selection, PIWM achieves 0.641 macro F1 on 30 held-out target videos, outperforming a zero-shot Qwen2.5-VL-7B baseline and training variants without balanced action supervision; end-to-end video-only selection drops to 0.295, below the 5-class balanced random baseline of 0.414, identifying video-to-state grounding as the dominant deployment-time bottleneck. A preliminary staged real-store pilot (recorded with paid participants performing scripted customer behaviors) reaches 0.579 action macro F1 on 20 fully annotated videos, with 10 additional accessible videos released with index-level labels.

URL PDF HTML ☆

赞 0 踩 0

2601.15408 2026-06-09 cs.CV cs.AI cs.CL cs.LG 版本更新

CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

CURE：基于课程引导的多任务训练实现可靠的解剖学接地报告生成

Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem

发表机构 * Pontificia Universidad Católica de Chile（智利天主教大学）； CENIA ； iHEALTH ； KAUST（科威特皇家科学与技术局）

AI总结提出CURE框架，通过课程学习动态调整多任务训练，提升医学报告生成的视觉接地准确性和事实一致性，无需额外数据。

Comments 31 pages, 7 figures, accepted to CVPR 2026 (oral)

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 36279-36289

AI中文摘要

医学视觉语言模型可以自动生成放射学报告，但在精确的视觉接地和事实一致性方面存在困难。现有模型常常将文本发现与视觉证据错误对齐，导致不可靠或弱接地的预测。我们提出CURE，一个错误感知的课程学习框架，无需任何额外数据即可改善接地和报告质量。CURE在短语接地、接地报告生成和解剖学接地报告生成上，使用公共数据集微调多模态指令模型。该方法基于模型性能动态调整采样，强调困难样本以改善空间和文本对齐。CURE将接地准确率提高了+0.35 IoU，报告质量提高了+0.192 CXRFEScore，并将幻觉减少了18.6%。CURE是一个数据高效的框架，增强了接地准确性和报告可靠性。代码可从此https URL获取，模型权重可从此https URL获取。

英文摘要

Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.35 IoU, boosts report quality by +0.192 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure

URL PDF HTML ☆

赞 0 踩 0

2606.00384 2026-06-09 cs.AI cs.CL cs.CV cs.LG stat.CO 版本更新

VESTA: Visual Exploration with Statistical Tool Agents

VESTA: 基于统计工具代理的视觉探索

William Rudman, Abhishek Divekar, Kanishk Jain, Sebastian Joseph, Stella S. R. Offner, Matthew Lease, Kyle Mahowald, Greg Durrett, Junyi Jessy Li

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； New York University（纽约大学）

AI总结提出VESTA框架，通过动态增长的工具集指导数据变换、假设驱动可视化和统计检验，提升视觉语言模型在复杂统计建模任务上的性能。

详情

AI中文摘要

将定量模型拟合到数据上是科学工作流程中的核心步骤，但它仍然是最少自动化的步骤之一。最近的基于代理的系统利用语言和视觉语言模型（VLM）来迭代地提出和优化统计模型，但这些系统在更具挑战性的建模任务上表现不佳。为了解决这些限制，我们引入了VESTA：基于统计工具代理的视觉探索，这是一个框架，为VLM配备了一个动态增长的探索工具包，通过数据变换、假设驱动的可视化和稳健的统计检验来指导模型优化。与之前仅依赖迭代批评的系统不同，VESTA在优化之前和优化过程中通过选择或创建诊断工具主动探索数据，这些工具会累积在模型的上下文中，并可在以后重用。我们在三种工具配置下评估VESTA与已建立的基线：无工具、静态专家编写的工具和动态模型编写的工具。为了支持这一评估，我们引入了DAWN（自动工作流和数值建模数据集），这是一个针对分布拟合和时间序列建模的基准，具有不同的难度等级，并最终涉及真实世界的天文学任务，包括建模初始质量函数和引力波啁啾信号。我们发现VESTA的动态工具创建优于先前的代理流水线，在复杂和特定领域的任务上取得了最大的收益。我们进一步表明，动态生成的工具比现有视觉工具创建系统生成的工具复杂得多，每个函数覆盖更多的诊断类别，并且强烈倾向于VLM批评者可以直接推理的视觉输出。

英文摘要

Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model's context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA's dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.

URL PDF HTML ☆

赞 0 踩 0

2606.07435 2026-06-09 cs.CV cs.CL 版本更新

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

唇读差距：VSR模型是否像人类唇读者一样感知视觉语音？

Rishabh Jain, Naomi Harte

发表机构 * Sigmedia Group（Sigmedia集团）； School of Engineering（工程学院）； Trinity College Dublin（都柏林大学）

AI总结通过对比VSR系统与人类在MaFI数据集上的表现，发现模型虽整体准确率更高，但错误模式与人类不同，主要依赖训练数据中的语言线索而非视觉感知。

Comments Accepted at INTERSPEECH 2026

详情

AI中文摘要

视觉语音识别（VSR）模型在基准测试中现已超越人类唇读者，但这样的进步是否建立了类人的视觉语音感知？为探究此问题，我们使用MaFI词级唇读数据集，在词、字符、音素和视位级别上比较了三个VSR系统与人类基线。尽管模型实现了更高的整体准确率，但它们在不同于人类的单词上成功和失败。仅给定少量初始音素的纯文本n-gram基线可与人类唇读相媲美。VSR词级错误始终能更好地通过训练词频而非词的视觉信息量来解释。视位准确率、混淆矩阵以及人类-模型相关性进一步表明，模型在人类认为最难的视位上获益最多，并且对视觉清晰度的依赖性弱得多。我们的工作表明，VSR系统主要依赖训练数据中的语言线索而非视觉感知，未能将视觉特征绑定为有意义的单词。

英文摘要

Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI word-level lipreading dataset using word, character, phoneme, and viseme-level metrics. Although models achieve higher overall accuracy, they succeed and fail on different words than humans. A text-only n-gram baseline given only a few initial phonemes rivals human lipreading. VSR word-level errors are consistently better explained by training word frequency than by the visual informativeness of words. Viseme accuracies, confusion matrices and human-model correlations further show that models gain most on visemes humans find hardest, and show much weaker dependence on visual clarity. Our work demonstrates that VSR systems rely primarily on language cues from training data rather than visual perception, failing to bind visual features into meaningful words.

URL PDF HTML ☆

赞 0 踩 0

2606.07547 2026-06-09 cs.CL cs.AI cs.SD 新提交

Liberating LLM Capabilities in Full-Duplex Speech Models

在全双工语音模型中释放LLM能力

Luoyuan Zhang, Bokai Xu, Junbo Cui, Weiyue Sun, Yingjing Xu, Hanyu Liu, Yuan Yao

发表机构 * Royal Zhang（皇家张）

AI总结提出Listen-Write-Speak (LWS)三通道范式，使LLM在共享因果注意力上下文中同时监听、书写可见文本并实时口语回应，无需架构修改，实现全双工交互。

详情

AI中文摘要

基于语音的大型语言模型通常局限于口语回复，这将其面向用户的输出限制在可口头表达的内容上，并抑制了文本原生能力，如代码生成、结构化分析和实时交互中的多步推理，对于需要持久、结构化且可检查的中间输出的任务。现有工作改进了口语推理或全双工轮流发言，但仍将文本视为隐藏的中间状态或从属模态，而非第一类输出通道。我们提出Listen-Write-Speak (LWS)，一种文本优先的三通道范式，其中单个自回归LLM持续监听用户音频，写出可见的自由形式文本作为其主要输出，并在共享因果注意力上下文中并行生成实时口语回应。该行为完全通过Token Schema实现，无需架构修改，并通过两阶段数据流水线学习，该流水线合成与揭示的输入时间线一致的每秒认知注释。实验上，LWS在Full-Duplex-Bench上展示了强大的全双工交互，在VoiceBench AlpacaEval上达到4.72，写作-口语一致性达92.6%，并在URO-Bench上持续优于其内部消融版本。这些结果表明，可见书写可以作为语音交互的第一类输出通道，而不会牺牲实时响应性。代码和数据集可在项目页面获取：https://royalzhang.com/project/lws-page/。

英文摘要

Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.

URL PDF HTML ☆

赞 0 踩 0

2606.07608 2026-06-09 cs.CL cs.AI cs.LG cs.SD 新提交

Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)

针对瑞士德语音识别的Whisper字幕对齐微调：基准污染、惯例不匹配以及25.6% WER（13.8% cWER）的诚实基线

Felix Akeret

发表机构 * Independent Researcher, Zurich, Switzerland（独立研究员，瑞士苏黎世）； ETH Zürich（苏黎世联邦理工学院）； University of Bern（伯尔尼大学）； FHNW（西北应用科学与艺术大学）； CeTIM Leiden/Munich（CeTIM 莱顿/慕尼黑）

AI总结通过1,367小时广播语音与标准德语字幕的弱监督，系统微调Whisper large-v3用于瑞士德语音识，发现公开结果因基准污染被高估，并发布两个诚实评估的模型。

Comments 15 pages, 21 tables. Models available at https://huggingface.co/Flix-AI

详情

AI中文摘要

我们提出了一项系统研究，针对OpenAI的Whisper large-v3进行微调，用于瑞士德语音识，使用1,367小时的广播语音与标准德语字幕作为弱监督。通过在NVIDIA DGX Spark（Grace Blackwell，128 GB统一内存，最高1 PFLOP FP4）上进行16次迭代训练，我们比较了LoRA和全微调（1.55B参数模型），研究了幻觉的根本原因，并量化了数据质量、字幕对齐和训练策略的影响。我们的最佳模型在严格不相交数据上的诚实评估中，在All Swiss German Dialects Test Set (ASGDTS)上实现了25.6%的测量WER。通过将真实错误与有效的风格变异（时态、词序、瑞士正字法）分离的协调错误分析，得到内容WER (cWER)为13.8%，仅计算实际识别失败。偏差校正估计将其降至8.5%，表明真实错误率约为测量WER的三分之一。\n我们证明，已发表的瑞士德语ASR最先进结果（17.1-17.5% WER）因基准污染而被夸大：一个在ASGDTS测试集上自训练的普通Whisper模型（零瑞士德语数据）实现了13.88% WER，超过了所有已发表系统。使用Phi-4-multimodal的实验显示出更强的记忆效应（3.9% WER），揭示该基准主要衡量惯例匹配而非方言理解。\n我们发布了两个模型，一个LoRA适配器（25.32% WER，13.9% cWER）和一个全微调模型（25.60% WER，13.8% cWER），这是少数公开可用、经过诚实评估的瑞士德语Whisper模型之一，采用Apache 2.0许可，完全可复现，无需机构数据协议。

英文摘要

We present a systematic study of fine-tuning OpenAI's Whisper large-v3 for Swiss German ASR, using 1,367 hours of broadcast speech paired with Standard German subtitles as weak supervision. Through 16 iterative training runs on an NVIDIA DGX Spark (Grace Blackwell, 128 GB unified memory, up to 1 PFLOP FP4), we compare LoRA and full fine-tuning of the 1.55B-parameter model, investigate hallucination root causes, and quantify the effect of data quality, subtitle alignment, and training strategy. Our best model achieves 25.6% measured WER on the All Swiss German Dialects Test Set (ASGDTS) in an honest evaluation on strictly disjoint data. A harmonized error analysis separating genuine errors from valid stylistic variation (tense, word order, Swiss orthography) yields a content WER (cWER) of 13.8%, counting only actual recognition failures. Bias-corrected estimation reduces this to 8.5%, suggesting the true error rate is roughly one third of measured WER. We demonstrate that published state-of-the-art Swiss German ASR results (17.1-17.5% WER) are inflated by benchmark contamination: a vanilla Whisper model self-trained on the ASGDTS test set with zero Swiss German data achieves 13.88% WER, surpassing all published systems. Experiments with Phi-4-multimodal show an even stronger memorization effect (3.9% WER), revealing that the benchmark primarily measures convention matching rather than dialectal comprehension. We release two models, a LoRA adapter (25.32% WER, 13.9% cWER) and a full fine-tuned model (25.60% WER, 13.8% cWER), among the few publicly available, honestly evaluated Whisper models for Swiss German, under Apache 2.0 with full reproducibility, requiring no institutional data agreements.

URL PDF HTML ☆

赞 0 踩 0

2606.08486 2026-06-09 cs.CL 新提交

面向手语交互中的手语活动预测

Takao Obi, Wang Yusong, Koji Inoue, Kotaro Funakoshi

发表机构 * Institute of Science Tokyo（东京科学大学）； Kyoto University（京都大学）

AI总结本研究探索将语音活动预测（VAP）框架迁移至双人手语交互，利用公共DGS语料库提取手语活动流，基于姿态特征进行轮换预测，结果表明HOLD/SHIFT预测有潜力但SHIFT预测困难。

详情

AI中文摘要

社交机器人不仅需要与以语音为中心的系统所假设的用户进行稳健交互，还需要与依赖不同模态（例如手语）进行交流的多样化用户进行交互。一个重要的能力差距是与手语用户进行预测性轮换。尽管语音活动预测（VAP）已成功用于模拟口语交互中的未来语音活动，但该框架是否适用于手语交互仍不清楚。本文提出了将VAP架构适应双人手语交互的初步迁移研究。使用公共DGS语料库的交互录音，我们从词汇手语标注中推导出二进制手语活动流，并制定轮换预测的代理任务。模型使用每个手语者提取的基于姿态的手部、眼部区域和嘴部区域特征。结果表明，SHIFT/HOLD预测是有前景的，尤其是利用手部线索，而SHIFT预测仍然困难。这些发现为将预测性轮换模型从口语交互迁移到手语交互的潜力和当前局限性提供了初步证据。手语交互的预测建模仍然需要超越语音衍生类别的手语特定事件定义。

英文摘要

Social robots must interact robustly not only with users assumed by speech-centered systems but also with diverse users whose communication relies on different modalities, e.g., sign language. One important capability gap is predictive turn-taking with signing users. Although Voice Activity Projection (VAP) has been successfully used to model future voice activity in spoken interaction, it remains unclear whether the framework transfers to sign language interaction. This paper presents an initial transfer study of adapting a VAP architecture to dyadic sign language interaction. Using interaction recordings from the Public DGS Corpus, we derive binary signing activity streams from lexical sign annotations and formulate proxy tasks for turn-taking prediction. The model uses pose-derived hand, eye-region, and mouth-region features extracted for each signer. The results show that SHIFT/HOLD prediction is promising, especially with hand cues, while SHIFT-prediction remains difficult. These findings provide initial evidence for both the promise and the current limitations of transferring predictive turn-taking models from spoken interaction to sign language interaction. Predictive modeling of sign language interaction still requires sign-language-specific event definitions that go beyond speech-derived categories.

URL PDF HTML ☆

赞 0 踩 0

2606.09470 2026-06-09 cs.CL cs.AI 新提交

A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

一种用于联合多粒度L2评估和自然语言解释的微调SpeechLLM

Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

发表机构 * Centre for Language Studies, Radboud University（语言研究中心，拉德堡德大学）

AI总结提出一种基于评分准则的SpeechLLM，通过混合训练目标联合预测句子级和词/音素级标签并生成自然语言解释，在SpeechOcean762上达到或超越单粒度模型。

Comments Accepted to Interspeech 2026. This publication is part of the project Responsible AI for Voice Diagnostics (RAIVD) with file number NGF.1607.22.013 of the research programme NGF AiNed Fellowship Grants, which is financed by the Dutch Research Council (NWO)

详情

AI中文摘要

自动化的L2语音评估可以分配熟练度标签，但通常缺乏可解释性。我们提出了一种基于评分准则的SpeechLLM，用于多角度、多粒度的评估，采用结合监督微调和有界直接偏好优化的混合目标进行训练。该模型在同一个响应中联合预测句子级（准确性、流利度、韵律）的序数标签、词/音素级准确性，并生成自然语言解释。在SpeechOcean762上，我们的方法匹配或优于单粒度模型，同时与先前方法保持竞争力。我们从两个维度分析解释的可靠性：与模型预测的自一致性和与真实标签的对齐，使用情感一致性（合理性）和基于提及的一致性（忠实性）。解释在句子级别是合理的，但在词/音素级别忠实性下降：参考稀疏且与词元级标签弱对齐。

英文摘要

Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for multi-aspect, multi-granular assessment, trained with a hybrid objective combining supervised fine-tuning and Bounded Direct Preference Optimization. The model jointly predicts ordinal labels at the sentence-level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762, our approach matches or outperforms single-granularity models while remaining competitive with prior approaches. We analyze rationale reliability along two axes: self-consistency with model predictions and alignment with ground-truth labels, using sentiment consistency (plausibility) and mention-based agreement (faithfulness). Rationales are plausible at the sentence level, but faithfulness degrades at the word/phoneme level: references are sparse and weakly aligned with token-level labels.

URL PDF HTML ☆

赞 0 踩 0

2606.09535 2026-06-09 cs.CL cs.SD 新提交

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

克服Whisper在达罗毗荼语系和低资源语言中的解码器不一致性

Chowdam Venkata Kumar, Kumud Tripathi, Pankaj Wasnik

发表机构 * Sony Research India（索尼印度研究院）

AI总结针对Whisper在达罗毗荼语系上词错误率高的问题，通过语言学和数据集分析发现词汇稀疏和字符级替换错误，提出加权注意力和自条件化两种解码器增强方法，显著降低低资源和黏着语言的WER。

Comments Accepted at INTERSPEECH 2026, 5 pages, 1 figure, 5 tables

详情

AI中文摘要

多语言ASR模型如Whisper在高资源语言上表现良好，但在达罗毗荼语系上的词错误率（WER）显著高于印度-雅利安语系。通过语言学和数据集分析，我们发现达罗毗荼语系具有更长的单词、更高的词汇多样性和更低的重复率，导致标记分布稀疏和频繁的字符级替换错误。基线微调进一步揭示了自注意力（语言上下文）和交叉注意力（声学线索）之间的解码器不平衡。尽管合成标记重复实验表明潜在收益，但实际不可行。受这些观察启发，我们引入了两种解码器级增强：加权注意力（自适应平衡注意力来源）和自条件化（重新注入中间预测以提高标记一致性）。实验表明，对于低资源和黏着语言，WER持续降低。

英文摘要

Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Although synthetic token-repetition experiments indicate potential gains, they are impractical. Motivated by these observations, we introduce two decoder-level enhancements: Weighted-Attention, which adaptively balances attention sources, and Self-Conditioning, which reinjects intermediate predictions to improve token consistency. Experiments demonstrate consistent WER reductions for low-resource and agglutinative languages.

URL PDF HTML ☆

赞 0 踩 0

2606.07610 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

LEAF: 无需分支的树生长方法用于语音感知大语言模型后训练

Argyrios Gerogiannis, Yekaterina Yegorova, Mark Hasegawa-Johnson, Venugopal V. Veeravalli

发表机构 * University of Illinois, Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结针对语音感知大语言模型后训练中GRPO方法粗粒度信用分配问题，提出LEAF方法，通过回溯式树结构学习、高信息量边界选择和跨度级优势分配，在语音问答和翻译任务上超越GRPO。

Comments 15 pages, 3 figures, 11 tables

详情

AI中文摘要

最先进的GRPO风格方法在语音感知大语言模型后训练中存在粗粒度信用分配问题，将相同的终端奖励优势广播给响应中的每个token。这忽略了rollout批次中的有用结构，其中语音条件下的补全通常共享前缀，然后在重要决策处出现分歧。我们提出低秩探索自适应分叉（LEAF），一种基于回溯树的强化学习方法，无需在线分支或额外解码即可恢复这种结构。LEAF采样完整响应，选择高信息量边界，按共享前缀分组响应，并使用后代奖励分配跨度级优势。我们从理论上证明了LEAF的跨度级信用分配和边界选择设计。实验上，在相同的rollout和低秩适应预算下，LEAF在语音问答和语音翻译基准上优于GRPO。值得注意的是，较小的LEAF训练模型优于当前最先进的完全参数基线。

英文摘要

State-of-the-art GRPO-style methods for speech-aware large language model post-training suffer from coarse credit assignment, broadcasting the same terminal-reward advantage to every token in a response. This ignores useful structure within rollout batches, where speech-conditioned completions often share prefixes before diverging at important decisions. We propose Low-rank Exploration with Adaptive Forking (LEAF), a retrospective tree-based RL method that recovers this structure without online branching or additional decoding. LEAF samples complete responses, selects high-surprisal boundaries, groups responses by shared prefixes, and assigns span-level advantages using descendant rewards. We theoretically justify LEAF's span-level credit assignment and boundary-selection design. Empirically, LEAF improves over GRPO across speech question answering and speech translation benchmarks under the same rollout and low-rank adaptation budget. Notably, smaller LEAF-trained models outperform current state-of-the-art, full-parameter baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.08210 2026-06-09 eess.AS cs.CL cs.SD 交叉投稿

Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion

Paediatric-HGNN：一种通过多尺度声学融合检测儿童言语不流畅的混合异构图神经网络

Rashini Liyanarachchi, Rachael Mackay, Alison Short, Aditya Joshi, Erik Meijering

发表机构 * University of New South Wales（新南威尔士大学）； Western Sydney University（西澳悉尼大学）； Resourced Music Therapy（资源音乐治疗）

AI总结针对儿童言语中声学变异大、病理口吃与发育性不流畅难以区分的问题，提出Paediatric-HGNN框架，通过构建异构图捕获词汇与声学片段的分层关系，在儿童语料上实现82.4%加权准确率和0.386的典型不流畅F1分数。

Comments Accepted at INTERSPEECH 2026 (Main)

2606.08425 2026-06-09 cs.SD cs.CL eess.AS 交叉投稿

TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints

TinyGiantALM：面向资源约束下意图感知推理的紧凑型音频-语言模型

Vinh-Thuan Ly

发表机构 * University of Science, VNU-HCM（胡志明市国立大学下属理科大学）； Vietnam National University, Ho Chi Minh City（胡志明市国立大学）

AI总结提出紧凑型1.5B参数音频-语言模型TinyGiantALM，通过指令感知特征精炼框架（查询引导投影器+语义门控）过滤用户意图相关声学信号，在MMAR基准上零样本准确率46.4%，超越7B-13B基线，并优于8倍大模型。

Comments Accepted to Interspeech 2026. Project page: https://interspeech-tinygiant-alm.vercel.app

2606.08573 2026-06-09 cs.LG cs.CL 交叉投稿

Titans-as-a-Layer: Test-Time Memory for Conversational Speech Emotion Recognition

Titans-as-a-Layer：对话语音情感识别的测试时记忆

Daniel Chen, Qicong Hu, Yang Xiao, Ting Dang, Hong Jia

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出Memory-as-a-Layer (MAL)适配器，利用测试时神经记忆为对话语音情感识别提供上下文，在不修改大型音频语言模型的前提下提升性能。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情

AI中文摘要

语音情感识别（SER）通常被表述为话语级分类，尽管对话情感取决于说话者通常的音域和先前话语建立的情感上下文。语音语言模型提供了强大的预训练声学和语义表示，并可以通过微调将其适应于SER标签，但这种机制仍然缺少每对话状态。我们研究测试时神经记忆是否可以在保持大型音频语言模型（LALMs）主干不变的情况下提供这种缺失的上下文。基于Titans，我们引入了一种即插即用的Memory-as-a-Layer（MAL）适配器，它将对话历史写入小型神经记忆，并作为音频令牌对齐的残差更新读回，避免了对宿主模型令牌位置的更改。在不同的音频LLM和情感识别数据集评估中，我们的设计在不同评估指标上改善了SER性能，支持测试时记忆作为对话SER的残差上下文机制。

英文摘要

Speech emotion recognition (SER) is commonly formulated as utterance-level classification, although conversational emotion depends on a speaker's usual vocal range and the emotional context established by previous utterances. Speech-language models provide strong pretrained acoustic and semantic representations, and can adapts them to SER labels via finetune, but this mechanism still missing per-dialogue state. We study whether test-time neural memory can supply this missing context while leaving the large audio language models (LALMs) backbone intact. Building on Titans, we introduce a plug-and-play Memory-as-a-Layer (MAL) adapter that writes dialogue history into a small neural memory and reads it back as an audio-token-aligned residual update, avoiding changes to the host model's token positions. Across different audio LLMs and emotion recognition datasets evaluations, our design improves SER performs across different evaluation metrics, supporting test-time memory as a residual contextual mechanism for conversational SER.

URL PDF HTML ☆

赞 0 踩 0

2606.09667 2026-06-09 eess.AS cs.CL cs.SD 交叉投稿

Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

基于sEMG和唇读的鲁棒无声语音合成的跨模态掩蔽

Eder del Blanco, David Gimeno-Gómez, Eva Navas, Carlos-D. Martínez-Hinarejos, Inma Hernáez

发表机构 * Aholab research group within the HiTZ Center at University of the Basque Country (UPV/EHU)（巴斯克大学HiTZ中心内Aholab研究组）； PRHLT research center, Universitat Politècnica de València (UPV)（瓦伦西亚理工大学PRHLT研究中心）

AI总结提出掩蔽多模态语音合成框架，联合表面肌电图和唇读信号，通过训练时模态掩蔽提升鲁棒性，在多说话人设置下词错误率降低14个百分点。

Comments 12 pages, 7 figures and 6 tables. Submitted to Transactions on Audio, Speech and Language Processing

详情

AI中文摘要

通过无声语音接口进行语音恢复已成为针对喉部发声受损或缺失个体的有前景的辅助技术。在非侵入式无声语音接口模态中，表面肌电图和基于视频的唇读提供了互补的发音信息，然而它们用于连续语音合成的集成仍未被充分探索。此外，现有的多模态方法很少考虑对模态退化或临时传感器故障的鲁棒性，限制了它们在现实场景中的适用性。在这项工作中，我们提出了一种掩蔽多模态语音合成框架，通过在训练期间进行模态掩蔽来联合利用表面肌电图和唇读信号。在多说话人设置下，与最强的单模态基线相比，所提出的方法将词错误率降低了多达14个绝对百分点。实验结果不仅表明掩蔽策略对于这些性能提升和低比特率条件下的鲁棒性至关重要，而且表明在模态缺失情况下，它们比针对退化的数据增强具有更好的泛化能力。音素级分析进一步揭示了跨模态的互补贡献，对元音和特定辅音组尤其有益。总体而言，这些发现证明了掩蔽多模态集成用于无声语音合成的有效性和鲁棒性，尽管适应喉切除说话者仍是一个开放的研究挑战。

英文摘要

Speech restoration through silent speech interfaces (SSIs) has emerged as a promising assistive technology for individuals with impaired or absent laryngeal voice production. Among non-invasive SSI modalities, surface electromyography (sEMG) and video-based lipreading provide complementary articulatory information, yet their integration for continuous speech synthesis remains underexplored. Moreover, existing multimodal approaches rarely address robustness to modality degradation or temporary sensor failure, limiting their applicability in realistic scenarios. In this work, we propose a masked multimodal speech synthesis framework that jointly leverages sEMG and lipreading signals through modality masking during training. Under multispeaker settings, the proposed approach reduces word error rate by up to 14 absolute percentage points compared to the strongest unimodal baseline. Experimental results not only show that masking strategies are critical for these performance gains and robustness under low-bitrate conditions, but also that they generalize better than degradation-specific data augmentations in the presence of modality absence conditions. Phone-level analyses further reveal complementary contributions across modalities, with particularly strong benefits for vowels and for specific consonant groups. Overall, these findings demonstrate the effectiveness and robustness of masked multimodal integration for silent speech synthesis, although adaptation to laryngectomized speakers remains an open research challenge.

URL PDF HTML ☆

赞 0 踩 0

2605.19266 2026-06-09 cs.CL cs.AI 版本更新

FormalASR: End-to-End Spoken Chinese to Formal Text

FormalASR: 语音中文到正式文本的端到端系统

Wanyi Ning, Yinshang Guo, Haitao Qian, Jiyuan Cheng, Weiyuan Feng, Yufei Zhang

发表机构 * arXiv

AI总结本文提出FormalASR，一种端到端的中文语音到正式文本转换模型，通过构建大规模的语音到正式文本数据集，并使用Qwen3-ASR进行微调，实现了比原声基线减少37.4%的CER，同时提升了ROUGE-L和BERTScore指标，提供了一个轻量级的设备端解决方案。

详情

AI中文摘要

自动语音识别（ASR）系统通常优化于逐字转录，这保留了不连贯、填充词和非正式口语结构，这些结构往往不适合下游写作应用。常见的解决方法是ASR+LLM的两阶段流程用于后期编辑，但这种设计增加了延迟和内存成本，并且难以在设备上部署。我们提出了FormalASR，两个紧凑的端到端模型（0.6B和1.7B），可直接将中文语音转录为正式书面文本。为了实现这一目标，我们构建了WenetSpeech-Formal和Speechio-Formal两个大规模的语音到正式文本数据集，通过基于LLM的重写和质量过滤构建。然后我们使用监督微调对Qwen3-ASR进行两个规模（0.6B和1.7B）的微调。在WenetSpeech-Formal和Speechio-Formal上的实验表明，FormalASR在比原声基线减少37.4%的CER的同时，也提高了ROUGE-L和BERTScore。FormalASR在部署时不需要后处理LLM，提供了一个轻量级的设备端解决方案用于语音到正式转录。

英文摘要

Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

URL PDF HTML ☆

赞 0 踩 0

2605.06582 2026-06-09 cs.LG cs.CL cs.SD 版本更新

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

PairAlign：一种通过自对齐的序列标记化框架及其在音频标记化中的应用

Adhiraj Banerjee, Vipul Arora

发表机构 * Department of Electrical Engineering, Indian Institute of Technology, Kanpur（电子工程系，印度理工学院，坎浦尔）

AI总结 PairAlign通过序列级自对齐实现紧凑音频标记化，利用条件序列生成方法，提升标记一致性、长度控制和编辑相似性。

Comments 57 pages main content, 109 total pages, 9 Figures, pre-print, Under Review

详情

AI中文摘要

许多感官数据的操作——比较、记忆、检索和推理——自然地在离散符号结构上表达。在语言中，这种接口由标记提供；在音频中，必须学习。现有音频标记器依赖于量化、聚类或编解码器重建，将标记局部分配，因此序列一致性、紧凑性、长度控制、终止和编辑相似性很少被直接优化。我们引入PairAlign，一种通过序列级自对齐实现紧凑音频标记化的框架。PairAlign将标记化视为条件序列生成：编码器将语音映射为连续条件，自回归解码器从BOS开始生成标记，学习标记身份、顺序、长度和EOS位置。给定两个保持内容的视图，每个视图的序列在另一个视图的表示下被训练为可能，而无关示例提供竞争序列。这为可扩展的编辑距离保留代理，同时抑制许多对一的坍缩。PairAlign从VQ式标记化开始，并通过EMA教师目标、交叉配对教师强制、前缀损坏、似然对比和长度控制进行优化。在3秒语音上，PairAlign学习紧凑、非退化的序列，具有广泛的词汇使用和强跨视图一致性。在检索测试中，它保留编辑距离搜索，同时将存档标记数量减少55%。连续扫频探针显示其局部重叠低于密集几何标记器，但具有更强的长度控制和在100毫秒移位下的受约束编辑轨迹。PairAlign是一种序列符号预测学习者：像JEPA式目标一样，它从另一个视图预测一个抽象目标作为学习的可变长度符号序列，而不是连续潜在变量。

英文摘要

Many operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly. We introduce PairAlign, a framework for compact audio tokenization through sequence-level self-alignment. PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a continuous condition, and an autoregressive decoder generates tokens from BOS, learning token identity, order, length, and EOS placement. Given two content-preserving views, each view's sequence is trained to be likely under the other's representation, while unrelated examples provide competing sequences. This gives a scalable surrogate for edit-distance preservation while discouraging many-to-one collapse. PairAlign starts from VQ-style tokenization and refines it with EMA-teacher targets, cross-paired teacher forcing, prefix corruption, likelihood contrast, and length control. On 3-second speech, PairAlign learns compact, non-degenerate sequences with broad vocabulary usage and strong cross-view consistency. On retrieval tests, it preserves edit-distance search while reducing archive token count by 55%. A continuous-sweep probe shows lower local overlap than a dense geometric tokenizer, but stronger length control and bounded edit trajectories under 100 ms shifts. PairAlign is a sequence-symbolic predictive learner: like JEPA-style objectives, it predicts an abstract target from another view as a learned variable-length symbolic sequence, not a continuous latent.

URL PDF HTML ☆

赞 0 踩 0

2606.07521 2026-06-09 cs.CL cs.AI 新提交

Evaluating Hallucinations in Domain-Adapted Large Language Models

评估领域自适应大语言模型中的幻觉现象

Sanchita Porwal, Sai Prasath S, Xingjian Bi, Madelyn Scandlen

发表机构 * College of Computing, Georgia Institute of Technology（佐治亚理工学院计算学院）

AI总结本研究通过微调Llama-2模型，测试其记忆、回忆和推理能力，发现领域自适应大语言模型在生成新领域特定信息时存在幻觉问题，表明仅靠微调难以有效缓解幻觉。

Comments 13 pages, 2 figures, 3 tables

详情

AI中文摘要

本研究调查了领域自适应大语言模型（LLMs）中的幻觉现象，重点关注使用Lamini数据集对Llama-2模型进行微调。幻觉，即LLMs生成无意义或不忠实内容的现象，构成了重大挑战，尤其是当这些模型使用领域特定数据进行微调时。我们的方法包括一系列实验，测试微调后LLM的记忆、回忆和推理能力，并将其在新问答对和领域特定信息上的表现进行比较。我们发现，虽然模型在与训练数据相似的任务上表现出色，但其准确推理和回忆新领域特定信息的能力仍然有限，导致出现幻觉实例。模型倾向于提供带有额外信息的正确答案，表明存在过度生成的倾向。这些结果表明，仅靠微调方法在将LLMs适应专业领域时缓解幻觉存在重要局限性，并强调了在将LLMs适应专业领域时需要更鲁棒的方法。该研究还提供了关于LLMs在不同类型信息上表现差异的见解，揭示了其在处理领域特定查询时的相对弱点。

英文摘要

This study investigates the phenomenon of hallucinations in domain-adapted Large Language Models (LLMs), focusing on the fine-tuning of the Llama-2 model with the Lamini dataset. Hallucinations, or the generation of nonsensical or unfaithful content by LLMs, pose a significant challenge, especially when these models are fine-tuned with domain-specific data. Our methodology involves a series of experiments testing memorization, recall, and reasoning capabilities of the fine-tuned LLM, comparing its performance on novel question-answer pairs and domain-specific information. We found that while the model shows proficiency in tasks similar to its training data, its capability to accurately reason about and recall new domain-specific information remains limited, leading to instances of hallucination. The model demonstrates a tendency to provide correct answers with extra information, suggesting an inclination toward over-generation. These results suggest important limitations of fine-tuning-only approaches for mitigating hallucinations when adapting LLMs to specialized domains and underscore the need for more robust methods in adapting LLMs to specialized domains. The study also provides insights into the varying performance of LLMs on different types of information, revealing a comparative weakness in handling domain-specific queries.

URL PDF HTML ☆

赞 0 踩 0

2606.07778 2026-06-09 cs.CL 新提交

Unlocking Latent Value: Taxonomy-Guided Recovery of High-Performing Data from Low-Tier Web Corpora

解锁潜在价值：基于分类法从低层级网络语料库中恢复高性能数据

Neeraj Varshney, Sanket Lokegaonkar, Nasser Zalmout, Qingyu Yin, Priyanka Nigam, Bing Yin

发表机构 * Amazon（亚马逊）

AI总结提出一种分类驱动框架，通过引入时效性和文化特异性两个新维度，结合两阶段过滤方法，从低质量网络数据中恢复高性能子集，在推理和编码任务上显著超越未过滤的高质量数据。

详情

AI中文摘要

主流的预训练网络数据筛选流程将文档质量压缩为单一复合分数，系统性地遗漏了评分器权重不足维度上的高价值内容。我们提出一个分类驱动的框架，通过沿复合分数无法捕捉的语义有意义维度进行过滤来恢复这一价值。首先，基于ESSENTIAL-WEB分类法，我们引入两个新维度：时效性和文化特异性，它们与现有维度的成对NMI较低。我们使用Qwen2.5 32B对1400万文档进行标注，并蒸馏成一个轻量级0.5B模型。为实现快速的语料库级标注，我们额外在E5嵌入上训练了一个7300万参数的多任务MLP，推理吞吐量提升50倍。其次，为应对过滤配置的组合爆炸，我们引入一个计算高效的两阶段框架：第一阶段在小规模上识别最强维度信号；第二阶段从最优表现者中构建并评估合取和析取复合过滤器——以全规模定律成本的一小部分识别高性能配置。将所选过滤器应用于被降级的网络数据，分类过滤后的子集在性能上超过其未过滤基线，甚至超越最高质量层级。在中层数据上，我们的最佳过滤器在推理、编码和知识基准上分别比未过滤基线提升12.1%、9.5%和2.0%，在推理和编码上分别超过未过滤的顶层数据6.7%和13.7%。此外，来自典型生产阈值以下两个层级的数据，其过滤后的子集在推理和编码上比未过滤基线提升22.3%和19.5%，在编码基准上超越顶层数据。这些结果表明，大量潜在价值仍锁定在被降级的网络数据中，而多维分类过滤是解锁这些价值的原理性且计算高效的钥匙。

英文摘要

Dominant web data curation pipelines for pretraining collapse document quality into a single composite score, systematically missing high-value content along dimensions the scorer underweights. We present a taxonomy-driven framework that recovers this value by filtering along semantically meaningful dimensions that composite scores fail to capture. First, building on the ESSENTIAL-WEB taxonomy, we introduce two novel dimensions: timeliness and cultural specificity, both of which show low pairwise NMI with existing ones. We annotate 14M documents using Qwen2.5 32B and distill into a lightweight 0.5B model. To enable rapid corpus-wide annotation, we additionally train a 73M multi-task MLP on E5 embeddings, achieving 50x inference throughput. Second, to navigate the combinatorial explosion of filter configurations, we introduce a compute-efficient two-pass framework: Pass 1 identifies the strongest dimension signals at small scale; Pass 2 constructs and evaluates conjunctive and disjunctive compound filters from the top performers - identifying high-performing configurations at a fraction of full scaling-law cost. Applying the selected filters to deprioritized web data, taxonomy-filtered subsets outperform their unfiltered baselines and even surpass the highest-quality tier. On mid-tier data, our best filter improves over its unfiltered baseline by 12.1% on reasoning, 9.5% on coding, and 2.0% on knowledge benchmarks, exceeding unfiltered top-tier data by 6.7% on reasoning and 13.7% on coding. Furthermore, filtered data from two tiers below the typical production threshold improves by 22.3% on reasoning and 19.5% on coding over its unfiltered baseline, surpassing top-tier data on coding benchmarks. These results establish that vast latent value remains locked in deprioritized web data, and that multi-dimensional taxonomy filtering is a principled, compute-efficient key to unlocking it.

URL PDF HTML ☆

赞 0 踩 0

2606.07810 2026-06-09 cs.CL cs.AI cs.LG 新提交

SLMJury: Can Small Language Models Judge as Well as Large Ones?

SLMJury：小型语言模型能否像大型模型一样进行评判？

Anish Laddha, Nitesh Pradhan, Gaurav Srivastava

发表机构 * LNMIIT ； Virginia Tech（弗吉尼亚理工大学）

AI总结提出SLMJury框架，评估小型语言模型作为评判者的能力，发现领域依赖的过度思考效应、领域泛化差异、闭端与开端评判能力分离，以及多智能体辩论降低准确性。

详情

AI中文摘要

大型语言模型（LLMs）被广泛用作评估模型输出的评判者，但其高成本、延迟和不透明性限制了可扩展性。我们引入SLMJury，一个评估小型语言模型（SLMs）作为评判者的框架，涵盖两种范式：闭端二元正确性和开端质量评分。我们在四个模型家族的16个SLM评判者（0.6B-14B参数）上，跨十个基准进行基准测试：八个闭端任务涵盖数学、科学和通用推理（每个配置N=64,824个判断），以及用于摘要和对话评分的SummEval和MT-Bench。我们将评判形式化为预算条件函数，并研究五个维度。得出四个发现。（1）过度思考效应是领域依赖的：对于大多数评判者，快速10令牌判决在数学评判上匹配或优于扩展推理（在有帮助的情况下提升2-7%），而推理在通用任务上胜出高达23%。（2）领域泛化区分了模型家族，数学到通用准确率差距从低于10%到接近40%不等。（3）闭端和开端评判依赖不同的能力：最佳二元评判者（Phi-4）在MT-Bench上降至第9名，而经过推理训练的模型则反转了这一顺序。（4）在反思-批判-改进（RCR）辩论协议下，多智能体辩论在所有测试配置中降低了准确性，而顶级评判者抵抗六种对抗性人格的方差<=0.55%。可靠的自动评估不需要大型专有模型，但没有单一的SLM占主导地位。排行榜可在https://anishh15.github.io/SLMJury/获取，我们的框架代码和pip包公开在https://github.com/anishh15/SLMJury和https://pypi.org/project/slmjury/。

英文摘要

Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalability. We introduce SLMJury, a framework for evaluating small language models (SLMs) as judges across two paradigms: closed-ended binary correctness and open-ended quality scoring. We benchmark 16 SLM judges (0.6B-14B parameters) from four model families across ten benchmarks: eight closed-ended tasks spanning mathematical, scientific, and general reasoning (N=64,824 judgments per configuration), plus SummEval and MT-Bench for summarization and conversational scoring. We formalize judging as a budget-conditioned function and study five dimensions. Four findings emerge. (1) The overthinking effect is domain-dependent: for most judges quick 10-token verdicts match or beat extended reasoning on mathematical judging (by 2-7% where they help), while reasoning wins on general tasks by up to 23%. (2) Domain generalization separates model families, with math-to-general accuracy gaps ranging from under 10% to nearly 40%. (3) Closed-ended and open-ended judging draw on different capabilities: the best binary judge (Phi-4) drops to rank 9 on MT-Bench, while reasoning-trained models invert this ordering. (4) Under the Reflect-Critique-Refine (RCR) debate protocol, multi-agent debate degrades accuracy across all tested configurations, whereas the top judges resist six adversarial personas with <=0.55% variance. Reliable automated evaluation does not require large proprietary models, yet no single SLM dominates. The leaderboard is available at https://anishh15.github.io/SLMJury/, and our framework code and pip package are publicly available at https://github.com/anishh15/SLMJury and https://pypi.org/project/slmjury/.

URL PDF HTML ☆

赞 0 踩 0

2606.07853 2026-06-09 cs.CL cs.AI 新提交

Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese

超越英语基准：巴西葡萄牙语临床大语言模型评估

Giordano de Pinho Souza, Glaucia Melo, Josefino Cabral Melo Lima, Daniel Schneider

发表机构 * Federal University of Rio de Janeiro（里约热内卢联邦大学）； Toronto Metropolitan University（多伦多都会大学）

AI总结提出首个双语临床基准ClinicalBr，基于巴西病例报告构建，评估四个模型发现葡萄牙语-英语性能差距具有任务依赖性，诊断检索英语优势明显，其他任务差距消失。

详情

AI中文摘要

大语言模型正在改变临床决策支持及其在实际场景中的应用。然而，大多数基准测试以英语进行，跨语言评估对于解决全球可及性中的语言差距至关重要。我们介绍了ClinicalBr，这是首个基于真实巴西病例报告构建的双语临床决策基准。该语料库包含来自28种SciELO医学期刊的2,892个病例，涵盖18个专科，并构建为平行葡萄牙语-英语对。每个病例支持四项评估任务：诊断检索、鉴别诊断、检查推荐和治疗规划。我们评估了四个模型：MedGemma-27B、Sabiá-4、DeepSeek-R1和o3-mini，涵盖两种语言。核心发现是，葡萄牙语-英语性能差距是任务依赖的，而非普遍的。在诊断检索中，英语在所有模型上均具有一致优势，准确率高出7.5-12.1个百分点。这种优势在鉴别诊断、检查推荐和治疗规划中消失，大多数模型的置信区间跨越零，且葡萄牙语的完整性分数略高。巴西地方病比完整语料库更容易，而非更难，表明热带疾病表现在当前预训练中得到了充分体现。检查推荐是所有模型和两种语言中最难的任务，F1分数低于0.10，远低于鉴别诊断的上限0.20-0.27。

英文摘要

Large Language Models are transforming the support for clinical decision and their application in real scenarios. Yet, most benchmarks are conducted in English, and cross-lingual evaluation is needed to tackle the language gaps in global access. We introduce ClinicalBr, the first bilingual benchmark for clinical decision built from real Brazilian case reports. The corpus contains 2,892 cases drawn from 28 SciELO medical journals, spanning 18 specialties, and is structured as parallel Portuguese-English pairs. Each case supports four evaluation tasks: diagnosis retrieval, differential diagnosis, exam recommendation, and treatment planning. We evaluate four models: MedGemma-27B, Sabiá-4, DeepSeek-R1, and o3-mini, across both languages. The central finding is that the Portuguese-English performance gap is task-dependent, not general. In diagnosis retrieval, English yields a consistent advantage across all models, with +7.5-12.1 accuracy points. This advantage disappears in differential diagnosis, exam recommendation, and treatment planning, where confidence intervals cross zero for most models and Portuguese completeness scores are marginally higher. Brazilian-endemic conditions proved easier than the full corpus, not harder, indicating that tropical presentations are adequately represented in current pre-training. Exam recommendation was the hardest task across all models and both languages, with F1 scores below 0.10, well below the differential diagnosis ceiling of 0.20-0.27.

URL PDF HTML ☆

赞 0 踩 0

2606.07995 2026-06-09 cs.CL 新提交

Customer-Agent: Overcoming Context Limitations in Ultra-Long Shopping Trajectories via Tool-Augmented Agents and RLVR

客户代理：通过工具增强代理和RLVR克服超长购物轨迹中的上下文限制

Hongye Liu, Rongmei Lin, Anurag Kashyap, Hejie Cui, Ricardo Henao, Besnik Fetahu, Bing Yin

发表机构 * Amazon（亚马逊）； Duke University（杜克大学）

AI总结提出ShopTrajQA基准和客户代理框架，利用RLVR训练代理通过代码解释器自主检索解析外部轨迹文件，突破LLM上下文窗口限制，在超长购物轨迹推理中取得强性能。

详情

AI中文摘要

理解客户购物轨迹对于实现个性化购物体验至关重要。然而，购物记录（如客户的搜索、点击、购买等）通常跨越多年时间，形成极长的轨迹，给现有大型语言模型（LLM）带来重大挑战。尽管该问题重要，现有基准仅限于短客户轨迹，而大型电商平台的真实轨迹由于数据隐私限制难以获取。为解决这一差距，我们引入ShopTrajQA，一个基于真实产品信息和模拟购物轨迹构建的长上下文评估基准。数据集包含高达32k和64k token的变体，能够系统评估模型在不同上下文长度下的鲁棒性。通过对前沿LLM的全面基准测试，我们识别出在长购物轨迹数据推理中的关键性能差距。为应对这些挑战，我们提出一种用于超长上下文管理的客户代理框架。利用可验证奖励强化学习（RLVR）代理训练范式，我们的方法将轨迹存储为外部本地文件，并训练代理通过代码解释器交互（如SQL查询）自主检索和解析它们，有效绕过LLM的固定上下文窗口限制。实验结果表明，我们的框架在ShopTrajQA上取得强性能，并展现出对其他复杂推理任务的泛化能力。

英文摘要

Understanding customer shopping trajectories is essential for enabling personalized shopping experiences. However, shopping records (i.e., customer's search, clicks, purchases, etc.) often span long time horizons over multiple years, resulting in extremely long trajectories that pose significant challenges for existing large language models (LLMs). Despite the importance of this problem, existing benchmarks are limited to short customer trajectories, while real-world trajectories from large e-commerce platforms are rarely accessible due to data privacy constraints. To address this gap, we introduce ShopTrajQA, a long-context evaluation benchmark constructed from real-world product information and simulated shopping trajectories. The dataset includes variants of up to 32k and 64k tokens, enabling systematic evaluation of model robustness under varying context lengths. Through comprehensive benchmarking of frontier LLMs, we identify critical performance gaps in reasoning over long shopping trajectory data. To address these challenges, we propose a Customer Agent Framework for ultra-long context management. Leveraging a Reinforcement Learning with Verifiable Rewards (RLVR) agentic training paradigm, our approach stores trajectories as external local files and trains the agent to autonomously retrieve and parse them through code-interpreter interactions (e.g., SQL queries), effectively bypassing the fixed in-context window constraints of LLMs. Experimental results demonstrate that our framework achieves strong performance for ShopTrajQA and shows generalization to other complex reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.07996 2026-06-09 cs.CL cs.AI 新提交

MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models

MC-PDD: 面向黑盒大语言模型的掩码语料级预训练数据检测

Kaixin Lan, Mu You, Tao Fang, Binkai Ou, Lidia S. Chao, Derek F. Wong

发表机构 * University of Macau（澳门大学）； Macau Millennium College（澳门万人大学）； BoardWare Information System Limited（博纬信息系统有限公司）

AI总结提出MC-PDD方法，通过掩码特定token并利用LLM预测缺失内容，比较候选语料与参考非成员语料的预测命中率差异，以黑盒方式检测预训练数据，性能与现有方法相当。

Comments The manuscript consists of 10 pages formatted in the IEEE/ACM two-column style

详情

AI中文摘要

预训练是大语言模型（LLM）发展的基础，然而预训练数据的不透明性使模型分析复杂化，并引发伦理、法律和公平性问题。因此，检测特定数据集是否在预训练中使用至关重要。现有最先进方法通常依赖于访问模型概率分布，因此不适用于仅提供输入输出接口的闭源LLM。为解决这一限制，我们引入了掩码语料级预训练数据检测（MC-PDD），这是一种受掩码语言建模范式启发的新方法。MC-PDD在每段文本中掩码高度特定的token，并提示LLM预测缺失内容。然后，它评估候选语料与参考非成员语料之间的预测命中率差异是否具有统计显著性。基于此比较，MC-PDD确定候选文本是否可能包含在模型的预训练数据中。实验结果表明，在三个数据集上，对于开源和闭源LLM，预训练数据和未见数据之间的预测命中率存在明显且一致的差异。尽管在更严格的黑盒设置下运行，MC-PDD仍实现了与现有检测方法相当的性能。我们的方法仅需使用标准API访问即可实现模型审计和数据版权验证等实际应用。接受后，我们将公开发布代码和数据集。

英文摘要

Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets were used during pretraining is, therefore, critical. Existing state-of-the-art methods typically rely on access to model probability distributions, making them unsuitable for closed-source LLMs that provide only input-output interfaces. To address this limitation, we introduce Masked Corpus-level Pretraining Data Detection (MC-PDD), a novel method inspired by the masked language modeling paradigm. MC-PDD masks highly specific tokens in each text and prompts the LLM to predict the missing content. It then assesses whether the difference in prediction hit rates between a candidate corpus and a reference non-member corpus is statistically significant. Based on this comparison, MC-PDD determines whether the candidate texts were likely included in the model's pretraining data. Experimental results demonstrate clear and consistent differences in prediction hit rates between pretrained and unseen data across three datasets, for both open-source and closed-source LLMs. Despite operating under a stricter black-box setting, MC-PDD achieves performance comparable to existing detection methods. Our approach enables practical applications such as model auditing and data copyright verification using only standard API access. Upon acceptance, we will publicly release the code and datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.08000 2026-06-09 cs.CL cs.AI 新提交

Summarization is Not Dead Yet

摘要生成尚未消亡

Dongqi Liu, Chenxi Whitehouse, Zheng Zhao, Zhuchen Cao, Jian Li, Yabiao Wang

发表机构 * Saarland University（萨尔大学）； Max Planck Institute for Informatics（马克斯·普朗克信息学研究所）； University of Cambridge（剑桥大学）； University of Edinburgh（爱丁堡大学）； Zhejiang University（浙江大学）； Tencent YouTu Lab（腾讯优图实验室）

AI总结通过多维度评估，发现人类参考摘要在信息量和忠实度上仍优于大语言模型，后者仅在表面连贯性和流畅性上占优，表明摘要生成研究仍有挑战。

详情

AI中文摘要

大型语言模型（LLMs）的进展引发了关于模型生成的摘要可与人类撰写的参考摘要相媲美甚至超越后者的说法，这引发了摘要生成是否仍是一个开放研究问题的疑问。我们通过多轨道评估重新审视这一说法，涵盖五个不同数据集和五个最先进的LLMs，结合受控人工评估、偏差缓解的LLM作为评判协议、基于外部知识的事实性验证以及语料库级别的语言分析。我们的发现揭示了一个更为细致的图景：人类参考摘要继续在信息量和忠实度方面展现出优势，而LLM输出主要在表面连贯性和流畅性上更受青睐。事实性验证表明，人类参考摘要仍然更可靠，尤其是对于涉及推理或综合的声明，而语言分析揭示了不同模型之间风格同质化的模式。这些观察表明，当前的LLMs提高了摘要生成的质量下限，但其性能上限仍低于人类能力。

英文摘要

The progress of large language models (LLMs) has fueled claims that model-generated summaries rival or even surpass human-written references, raising questions about whether summarization remains an open research problem. We re-examine this narrative through a multi-track evaluation covering five diverse datasets and five state-of-the-art LLMs, combining controlled human assessment, bias-mitigated LLM-as-Judge protocols, factuality verification against external knowledge, and corpus-level linguistic analysis. Our findings reveal a more nuanced landscape in which human reference summaries continue to demonstrate advantages in informativeness and faithfulness, whereas LLM outputs are preferred mainly for surface-level coherence and fluency. Factuality verification indicates that human references remain more reliable, particularly for claims involving reasoning or synthesis, and linguistic analysis uncovers a pattern of stylistic homogeneity across different models. These observations suggest that current LLMs have raised the floor of summarization quality, but the ceiling of their performance remains below human capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.08025 2026-06-09 cs.CL 新提交

Arabic Sentence Segmentation Across Genres and Punctuation Conditions

跨体裁与标点条件下的阿拉伯语句子分割

Mohammed Elkholy, Khalid N. Elmadani, Nizar Habash, Bashar Alhafni

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（莫扎德·本·泽德人工智能大学）； New York University Abu Dhabi（纽约大学阿布扎比分校）

AI总结针对阿拉伯语标点歧义和不一致导致的句子分割难题，构建跨8种体裁的语料库AraSEG，评估LLM、轻量编码器和依存解析器，发现轻量模型在困难设置下优于LLM，且准确分割能显著提升下游依存解析。

详情

AI中文摘要

阿拉伯语的句子分割因标点符号的歧义和不一致而具有挑战性，许多文本缺乏可靠的句子边界标记。现有方法严重依赖标点线索，且通常在格式良好的文本上进行评估，限制了其在真实阿拉伯语环境中的鲁棒性。为解决这一问题，我们引入了AraSEG，一个跨体裁的句子分割语料库，涵盖八种体裁以及广泛的标点和文档结构条件。利用AraSEG，我们在日益具有挑战性的分割设置下评估了LLM、轻量级编码器模型和基于依存解析器的模型。我们的实验表明，在最困难的设置中，轻量级编码器甚至基于依存解析器的模型都优于LLM。我们进一步研究了训练数据规模和体裁多样性的影响，发现性能最终会饱和，且跨体裁泛化仍然具有挑战性。我们还证明了准确的句子分割能显著改善下游的依存解析。我们将公开我们的代码、数据和模型。

英文摘要

Sentence segmentation in Arabic is challenging due to ambiguous and inconsistent punctuation, with many texts lacking reliable sentence boundary markers. Existing approaches rely heavily on punctuation cues and are typically evaluated on well-formed text, limiting their robustness in realistic Arabic settings. To address this, we introduce AraSEG, a genre-diverse sentence segmentation corpus spanning eight genres and a wide range of punctuation and document structure conditions. Using AraSEG, we evaluate LLMs, lightweight encoder models, and dependency parser-based models under increasingly challenging segmentation settings. Our experiments show that lightweight encoders, and even dependency parser-based models, outperform LLMs in the most challenging settings. We further investigate the effects of training data size and genre diversity, finding that performance eventually saturates and cross-genre generalization remains challenging. We also demonstrate that accurate sentence segmentation substantially improves downstream dependency parsing. We make our code, data, and models publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.08071 2026-06-09 cs.CL 新提交

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

SurgiQ: 用于评估大语言模型手术理解的大规模多领域基准

Ayah Al-Naji, Edoardo Fazzari, Saif Alkindi, Hamdan Alhadhrami, Preslav Nakov, Cesare Stefanini

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结提出SurgiQ基准，包含13,055道多选题，覆盖六个外科领域和四种题型，用于评估LLM的手术推理能力。实验显示最佳模型准确率仅68.1%，通用模型优于多数生物医学模型，表明当前医学专业化未能充分覆盖手术知识。

详情

AI中文摘要

大语言模型在外科领域的可靠评估仍不成熟。广泛的医学基准测试临床知识，而手术需要程序性推理、管理权衡、否定处理以及在合理手术决策中的选择。我们提出SurgiQ，一个纯文本、基于来源的基准，包含13,055道四选一多选题，涵盖六个外科领域和四种题型：基于案例、推理、最佳选项和否定题。SurgiQ通过多阶段生成、验证和专家审核流程，从外科教科书、开放获取论文和考试材料构建。我们在统一的log-likelihood协议下评估了35个开源权重LLM。结果显示仍有很大提升空间：较小模型通常接近25%的随机基线，而最佳模型达到68.1%的准确率。通用模型，尤其是Qwen2.5，优于大多数生物医学模型，表明当前的医学专业化尚未提供足够广泛的外科覆盖。校准和错误分析进一步表明，即使是强模型也会在临床合理的干扰项上犯自信的错误，这促使进行更可靠和更广泛的外科LLM评估。

英文摘要

Reliable evaluation of large language models in surgery remains underdeveloped. Broad medical benchmarks test clinical knowledge, while surgery requires procedural reasoning, management trade-offs, negation handling, and selection among plausible operative decisions. We present SurgiQ, a text-only, source-grounded benchmark of 13,055 four-option multiple-choice questions spanning six surgical domains and four question formats: case-based, reasoning, best-option, and negative. SurgiQ is constructed from surgical textbooks, open-access papers, and examination material using a multi-stage generation, verification, and expert-audit pipeline. We evaluate 35 open-weight LLMs under a unified log-likelihood protocol. Our results show substantial remaining headroom: smaller models often remain near the 25\% random baseline, while the best model reaches 68.1\% accuracy. General-purpose models, especially Qwen2.5, outperform most biomedical models, suggesting that current medical specialization does not yet provide sufficiently broad surgical coverage. Calibration and error analysis further show that even strong models make confident mistakes on clinically plausible distractors, motivating more reliable and broader surgical LLM evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.08077 2026-06-09 cs.CL 新提交

Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics

支持向量评分准则：弥合自生成与人工评分准则之间的差距

Mengyuan Sun, Yu Li, Zhuohao Yu, Shikun Zhang, Wei Ye

发表机构 * National Engineering Research Center for Software Engineering, Peking University（北京大学软件工程国家工程研究中心）； University of Science and Technology of China（中国科学技术大学）

AI总结针对自生成评分准则在困难实例上落后于人工标注的问题，提出SVR框架，将准则构建转化为偏好数据上的最大间隔边界学习，通过对比特征挖掘、提示条件选择器和迭代优化，显著缩小与人工准则的差距，并展现出广泛的奖励建模能力。

详情

AI中文摘要

基于评分准则的评估是评判大语言模型（LLM）输出的一种有前景的范式，然而在困难实例上，自生成准则落后于人工标注的准则。我们认为这一判别差距反映了目标不匹配：自生成准则描述好的回答，而有效的准则必须区分相近的候选。为弥合这一差距，我们引入SVR（支持向量评分准则），一个将准则构建重新表述为偏好数据上的最大间隔边界学习的框架。SVR从偏好对中挖掘对比特征存入准则库，学习一个提示条件化的选择器以及全局准则权重，并通过支持对选择和对抗性探测困难负例来迭代优化准则库。在推理时，仅给定提示，SVR从库中检索顶级准则并对回答进行评分。在RubricBench上，SVR将差距从24.1分缩小到0.3分，并优于强自生成准则和评判基线，且学习到的准则库无需重新训练即可跨评判迁移。在RewardBench 1&2和RM-Bench上，它仍与专用奖励模型保持竞争力，展示了更广泛的奖励建模能力。总体而言，边界定义的准则为弥合LLM评估中的判别差距提供了一条原则性路径。

英文摘要

Rubric-based evaluation is a promising paradigm for judging large language model (LLM) outputs, yet self-generated rubrics lag human-annotated criteria on hard instances. We argue this discriminative gap reflects an objective mismatch: self-generated rubrics describe good responses, whereas effective criteria must discriminate between close candidates. To close this gap, we introduce SVR (Support Vector Rubrics), a framework that recasts rubric construction as max-margin boundary learning over preference data. SVR mines contrastive features from preference pairs into a rubric bank, learns a prompt-conditioned selector together with global rubric weights, and iteratively refines the bank through support-pair selection and adversarial probing of hard negatives. At inference, given only the prompt, SVR retrieves the top-rubrics from the bank and scores responses. On RubricBench, SVR narrows the gap to human reference rubrics from 24.1 to 0.3 points and outperforms strong self-rubric and judge baselines, and the learned bank transfers across judges without retraining. On RewardBench 1&2, and RM-Bench, it remains competitive with dedicated reward models, demonstrating broader reward modeling capability. Overall, boundary-defining rubrics offer a principled route to closing the discriminative gap in LLM evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.08092 2026-06-09 cs.CL 新提交

When Languages Disagree: Self-Evolving Multilingual LLM Judges

当语言不一致时：自我进化的多语言LLM评判者

Xiyan Fu, Wei Lu

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结提出SEMJ方法，利用多语言评判中的跨语言不一致性进行迭代自我反思与重新评估，在多个基准上优于投票和反思基线，提升准确性和跨语言一致性。

详情

AI中文摘要

多语言LLM-as-a-judge被广泛用于跨语言评估模型输出，但存在跨语言不一致性问题（Fu and Liu, 2025）。现有方法通常将这种不一致性视为噪声，并通过投票或聚合来缓解。在本工作中，我们反而表明多语言不一致性可以提供互补的评估信号。我们的oracle分析发现，跨语言采样判断比单语言判断能获得更高的性能上限，表明不同语言可能包含互补的判断。受此发现启发，我们提出SEMJ，一种自我进化的多语言评判者，利用跨语言不一致性进行迭代优化。SEMJ为每个输入构建多语言变体，收集独立的判断和理由，并将不一致的输出反馈给自我反思和重新评估。在多个基准上的实验表明，SEMJ在准确性和跨语言一致性上始终优于投票和反思基线。进一步分析表明，不一致性触发了有用的重新评估，从而提高了判断质量。

英文摘要

Multilingual LLM-as-a-judge is widely used to evaluate model outputs across languages, but suffers from cross-lingual inconsistency (Fu and Liu, 2025). Existing methods typically treat this inconsistency as noise and mitigate it through voting or aggregation. In this work, we instead show that multilingual inconsistency can provide complementary evaluation signals. Our oracle analysis finds that sampling judgments across languages yields a higher performance upper bound than single-language judging, indicating that different languages potentially include complementary judgments. Motivated by this finding, we propose SEMJ, a self-evolving multilingual judge that leverages cross-lingual inconsistency for iterative refinement. SEMJ constructs multilingual variants of each input, collects independent judgments and rationales, and feeds inconsistent outputs back for self-reflection and re-evaluation. Experiments on multiple benchmarks show that SEMJ consistently outperforms voting and reflection baselines in both accuracy and cross-lingual consistency. Further analysis shows that inconsistency triggers useful re-evaluation, which improves judgment quality.

URL PDF HTML ☆

赞 0 踩 0

2606.08194 2026-06-09 cs.CL cs.AI 新提交

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

GlobeAudio：用于大型音频-语言模型自然主义评估的多语言多文化基准

Ryner Tan, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design（新加坡科技设计大学）

AI总结提出GlobeAudio基准，包含5637道多语言多选题，评估大型音频-语言模型在自然音频条件下的听觉推理和文化理解能力，发现开源模型和低资源语言存在显著性能差距。

详情

AI中文摘要

大型音频-语言模型（LALMs）在统一框架中整合了音频感知和语言理解，支持广泛的实际应用。尽管近期取得了进展，但LALMs的评估相对于实际需求仍严重不足：大多数评估缺乏真正的语言和文化真实性，而其他评估则未能捕捉声学真实性。为弥补这一差距，我们提出了GlobeAudio，一个旨在评估自然音频理解的多语言和多文化基准。GlobeAudio包含5637道多项选择题，涵盖六种类型多样的语言，由母语者基于自然发生的音频精心制作。为了表现良好，模型必须具有更高层次的听觉推理技能和文化基础的解释。我们系统地评估了代表性的闭源和开源LALMs，以及级联的ASR-LLM流水线。我们的实验揭示了在自然声学条件下的显著性能差距，特别是对于开源模型和低资源语言。这些发现凸显了当前LALMs的关键局限性，并强调了自然音频评估对未来音频-语言系统的重要性。GlobeAudio可在https://huggingface.co/datasets/iNLP-Lab/GlobeAudio 获取。

英文摘要

Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range of real-world applications. Despite recent advances, evaluation for LALMs remains heavily underspecified relative to real-world requirements: most lack true linguistic and cultural authenticity, while others fail to capture acoustic realism. To bridge this gap, we propose GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. We systematically evaluate representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines. Our experiments reveal substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages. These findings highlight critical limitations of current LALMs and underscore the importance of naturalistic audio evaluation for future audio-language systems. GlobeAudio can be found at https://huggingface.co/datasets/iNLP-Lab/GlobeAudio .

URL PDF HTML ☆

赞 0 踩 0

2606.08272 2026-06-09 cs.CL cs.AI 新提交

AgriGov: A Structured Multilingual Dataset Curation for Indian Government Schemes for Farmers

AgriGov：面向印度政府农民计划的结构化多语言数据集整理

Mohsina Bilal, Gopakumar G

发表机构 * National Institute of Technology Calicut（国立卡利卡特理工学院）

AI总结提出AgriGov三语数据集，通过自动抓取、翻译流水线和人工后编辑构建约8000句对齐的农业政策领域平行语料，支持机器翻译、问答等应用。

Comments 15 pages, 4 figures, Submitted to: Sadhana, Elsevier

详情

AI中文摘要

AgriGov是一个精心整理的三语（英语-印地语-马拉地语）数据集，旨在解决农业政策和农民福利计划领域缺乏领域基础的多语言资源的问题。最初，我们使用自动抓取技术从可信门户收集并结构化50个政府计划的数据，将其组织到预定义的语义字段（如标题、资格、申请流程、文件、排除项）。翻译通过结合Google Translate API、MarianMT和人工后编辑的流水线进行，生成了一个包含约2100个源片段的领域特定印地语-马拉地语数据集。为了增强覆盖范围，我们用Samanantar语料库中的句子扩充了该数据集，产生了约8000个句子对齐的印地语-马拉地语平行对。该数据集现在为微调该领域的机器翻译模型提供了强大的资源。AgriGov专为领域自适应机器翻译、问答、信息检索和摘要系统等应用而设计。其主要贡献是一个模式驱动、人工校正的多语言对齐流水线，确保领域保真度、提供来源并支持可重复实验，从而为面向农民的工具实现检索增强应用。

英文摘要

AgriGov is a curated, trilingual (English-Hindi-Marathi) dataset designed to address the scarcity of domain-grounded multilingual resources for agricultural policies and farmer welfare schemes. Initially, we collected and structured data from 50 government schemes sourced from trusted portals using automated scraping techniques, organizing it into predefined semantic fields (e.g., title, eligibility, application process, documents, exclusions). Translations were performed using a pipeline combining Google Translate API, MarianMT, and human post-editing, resulting in a domain-specific Hindi-Marathi dataset comprising approximately 2100 source segments. To enhance coverage, we augmented this dataset with sentences from the Samanantar corpus, leading to approximately 8,000 sentence-aligned Hindi-Marathi parallel pairs. The dataset now offers robust resources for fine-tuning machine translation models in this domain. AgriGov is designed for applications in domain-adaptive machine translation, question answering, information retrieval, and summarization systems. Its key contribution is a schema-driven, human-corrected multilingual alignment pipeline that ensures domain fidelity, provides provenance, and supports reproducible experiments, enabling retrieval-augmented applications for farmer-facing tools.

URL PDF HTML ☆

赞 0 踩 0

2606.08417 2026-06-09 cs.CL cs.AI 新提交

Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics

破解生成困惑度：为何无条件文本评估需要分布度量

Antonio Franca, Alexander Tong

AI总结本文指出生成困惑度（gen-PPL）作为非自回归语言模型评估指标存在缺陷，通过构造零参数朴素采样器在LM1B和OpenWebText上达到SOTA gen-PPL但生成不连贯文本，建议采用直接量化生成文本与参考文本分布差异的评估套件。

Comments Accepted to the Workshop on Structured Probabilistic Inference & Generative Modeling (SPIGM) at ICML 2026

详情

AI中文摘要

扩散和连续流语言模型已成为语言建模中领先的非自回归替代方案。这两种范式的进展主要通过生成困惑度（gen-PPL）来衡量：在冻结的自回归（AR）评分器（如gpt2-large）下，样本的每个token的负对数似然，通常配以经验熵护栏来排除低熵崩溃。我们认为该度量不健全。从构造上看，gen-PPL仅衡量在评分AR下的可预测性，而非语法性或语义连贯性——而可预测但低质量的序列集合在组合上非常庞大。为了具体说明这一点，我们构建了一套零参数、故意朴素的采样器，在LM1B和OpenWebText上以非退化熵实现了最先进的gen-PPL，超越了最近发布的扩散和连续流模型，同时生成的文本在构造上是不连贯的。我们推荐直接量化生成文本与参考文本之间分布差异的评估套件，并使用这样的套件重新基准测试最近的非自回归模型，从而更真实地反映当前的最新技术水平。

英文摘要

Diffusion and continuous flow-based language models have emerged as the leading non-autoregressive alternatives to language modeling. Progress in both paradigms is overwhelmingly tracked by generative perplexity (gen-PPL): the per-token negative log-likelihood of samples under a frozen autoregressive (AR) scorer such as gpt2-large, typically paired with an empirical-entropy guardrail to rule out low-entropy collapse. We argue that this metric is unsound. By construction, gen-PPL measures only predictability under the scoring AR, not grammaticality or semantic coherence -- and the set of predictable but still low-quality sequences is combinatorially large. To make this concrete, we construct a suite of zero-parameter, deliberately naive samplers that achieve state-of-the-art gen-PPL on LM1B and OpenWebText at non-degenerate entropy, surpassing recently published diffusion and continuous-flow models while producing text that is incoherent by construction. We recommend evaluation suites that directly quantify the distributional divergence between generated and reference text, and use such a suite to re-benchmark recent non-autoregressive models, recovering a more faithful picture of the current state of the art.

URL PDF HTML ☆

赞 0 踩 0

2606.08605 2026-06-09 cs.CL 新提交

Multilingual Fact-Checking at Scale: Fine-Tuned Compact Models vs LLMs

大规模多语言事实核查：微调紧凑模型 vs 大语言模型

Pratuat Amatya, Vinay Setty

发表机构 * Factiverse

AI总结提出一个多语言事实核查系统，通过微调XLM-RoBERTa、mmBERT和SetFit模型，在114种语言的声明检测和28种语言的真实性预测中，与GPT-5.2等LLM相比，展示了紧凑模型的高效和稳定性能。

详情

AI中文摘要

我们提出了一个部署在Factiverse的多语言事实核查系统，旨在跨多种语言实现高吞吐量和低延迟操作。该系统遵循模块化流水线，包含三个阶段：声明检测、证据检索与重排序，以及真实性预测。我们微调了XLM-RoBERTa-Large用于声明检测，mmBERT-base用于三标签立场分类（支持/反驳/混合），以及一个基于SetFit的多语言重排序器用于声明-证据匹配。我们将这些组件与强大的LLM基线进行比较，包括GPT-5.2、Claude Opus~4.6和Qwen3-8b。在涵盖114种语言的声明检测和28种语言的真实性预测的生产数据上的实验表明，任务特定的微调提供了强大且稳定的多语言性能，而微调的检索模型与现代专有嵌入保持竞争力。相同硬件上的延迟测量进一步显示，基于编码器的组件具有巨大的效率提升，支持其在具有严格成本和隐私约束的生产部署中使用。总体而言，紧凑的微调自托管模型仍然是大规模多语言事实核查的实用且有效的基础。本研究的代码和数据可在https://github.com/factiverse/factcheck-editor获取。

英文摘要

We present a multilingual fact-checking system deployed at Factiverse, designed for high-throughput and low-latency operation across diverse languages. The system follows a modular pipeline with three stages: claim detection, evidence retrieval and re-ranking, and veracity prediction. We fine-tune XLM-RoBERTa-Large for claim detection, mmBERT-base for three-label stance classification (Supports/Refutes/Mixed), and a SetFit-based multilingual re-ranker for claim--evidence matching. We compare these components against strong LLM baselines, including GPT-5.2, Claude Opus~4.6, and Qwen3-8b. Experiments on production data spanning 114 languages for claim detection and 28 languages for veracity prediction show that task-specific fine-tuning provides strong and stable multilingual performance, while the fine-tuned retrieval model remains competitive with modern proprietary embeddings. Same-hardware latency measurements further show large efficiency gains for encoder-based components, supporting their use in production deployments with tight cost and privacy constraints. Overall, compact fine-tuned, self-hosted models remain a practical and effective foundation for multilingual fact-checking at scale. Code and data used for this study are available at https://github.com/factiverse/factcheck-editor.

URL PDF HTML ☆

赞 0 踩 0

2606.08625 2026-06-09 cs.CL 新提交

From Holistic Evaluation to Structured Criteria: Rubrics Across the Evolving LLM Landscape

从整体评估到结构化标准：大语言模型演变中的评分准则

Hao Chen, Ziyu Han, Yukun Yan, Qingfu Zhu, Maosong Sun, Wanxiang Che

发表机构 * Research Center for Social Computing and Interactive Robotics（社会计算与交互机器人研究中心）； Department of Computer Science and Technology, Institute for AI（计算机科学与技术系，人工智能研究院）

AI总结本文提出评分准则作为统一框架，通过分解整体判断为可验证维度、提供过程级反馈和动态涌现自模型行为三个层次，连接人类意图与机器行为。

详情

AI中文摘要

随着大型语言模型（LLMs）向开放式自主智能体发展，用于评估和引导其行为的机制也必须相应演进。本文引入评分准则作为捕捉这一演进的统一框架，将其描述为对LLM范式转变的动态响应，这种响应在评估、强化学习和安全对齐等看似独立的工作中反复出现。我们将评分准则定义为将复杂质量判断转化为结构化、可操作标准的一组显式标准，并证明其在上述研究线索中的反复出现并非巧合。我们系统地整理了现有的评分准则设计，考察了其构建与优化，并分析了它们在评估和训练中的作用。评分准则在三个逐渐深入的层面体现：在评估层面，它们将整体判断分解为可验证的维度；在训练层面，它们作为密集的反馈信号，在标量奖励不足时提供过程级指导；在内在层面，它们从模型行为中动态涌现，驱动自我改进。我们进一步评估了评分准则在生成质量、执行保真度、理论约束和安全威胁方面的可靠性，并调查了跨领域的基于评分准则的基准。通过使评估透明且可分解，评分准则将人类价值期望转化为机器可学习的信号，成为人类意图与机器行为之间的持久桥梁。

英文摘要

As Large Language Models (LLMs) advance toward open-ended autonomous agents, the mechanisms used to evaluate and guide their behavior must evolve accordingly. This work introduces the rubric as a unifying framework capturing this evolution, characterizing rubrics as a dynamic response to successive LLM paradigm shifts that recurs across otherwise independent efforts in evaluation, reinforcement learning, and safety alignment. We define rubrics as explicit criteria sets that transform complex quality judgments into structured and actionable standards, and demonstrate that their recurrence across these research threads is not coincidental. We systematically organize existing rubric designs, examine their construction and optimization, and analyze their role across evaluation and training. Rubrics manifest at three progressively deeper levels: at the evaluative level, they decompose holistic judgments into verifiable dimensions; at the training level, they serve as dense feedback signals providing process-level guidance where scalar rewards fall short; at the intrinsic level, they emerge dynamically from model behaviors, driving self-improvement. We further assess rubric reliability across generation quality, execution fidelity, theoretical constraints, and security threats, before surveying rubric-based benchmarks across diverse domains. By rendering assessment transparent and decomposable, rubrics translate human value expectations into machine-learnable signals, serving as the enduring bridge between human intentions and machine behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.08715 2026-06-09 cs.CL 新提交

Operationalizing Linguistic Methods through Prompt-Engineering Skills: An Automatic Chinese Web Neologism Detection Pipeline

通过提示工程技能操作化语言学方法：一种自动中文网络新词检测流水线

Yufeng Wu, Meichun Liu

发表机构 * City University of Hong Kong（香港城市大学）

AI总结提出一种自动中文网络新词检测方法，将传统语言学识别原则转化为提示工程技能，通过四阶段流水线从2.67亿文档中检测出4853个新词，并揭示候选覆盖和LLM语义判断为瓶颈。

详情

AI中文摘要

我们提出了一种自动中文网络新词检测方法，该方法将传统语言学识别原则操作化为提示工程技能。该方法包括四个阶段：基于字符n-gram的与分词器无关的候选生成；基于点互信息预过滤的词典锚定；基于中文构词原则的构词合法性技能；以及结合规则和三元分类技能来区分新词、实体和无。将该方法应用于BAAI CCI 3.0语料库（2.67亿文档），产生了226,959个分类候选，其中包括4,853个标注新词。为了评估该方法，我们开发了逐阶段条件召回分解，其中流水线的严格召回在数学上分解为各阶段条件召回的乘积。应用于Hou（2023）（4,199个条目），该分解揭示了阶段1候选覆盖和阶段4B LLM语义判断是两个瓶颈（召回率分别为41.5%和60.0%），而中间阶段接近无损。进一步的长度分层分析表明，结构构词合法性技能与长度无关（>= 96.9%），而语义新颖性分类技能与长度相关（2/3/4字符候选分别为65.6%/59.0%/44.1%），描绘了基于技能的语言学操作化的当前边界。我们将该方法、流水线输出和评估协议作为公共资源发布。

英文摘要

We present a method for automatic Chinese web neologism detection that operationalizes traditional linguistic identification principles as prompt-engineering skills. The method has four stages: tokenizer-independent character n-gram candidate generation; dictionary anchoring with a Pointwise Mutual Information pre-filter; a well-formedness skill based on Chinese word-formation principles; and a combined rule and three-way classification skill that distinguishes neologism, entity, and none. Applied to the BAAI CCI 3.0 corpus (267M documents), the method produces 226,959 classified candidates including 4,853 labeled neologisms. To evaluate the method, we develop a per-stage conditional recall decomposition in which the pipeline's strict recall factors mathematically into the product of stage conditional recalls. Applied to Hou (2023) (4,199 entries), the decomposition exposes Stage 1 candidate coverage and Stage 4B LLM semantic judgment as the two bottlenecks (R=41.5% and 60.0% respectively), while intermediate stages are near-lossless. A length-stratified analysis further reveals that the structural well-formedness skill is length-invariant (>= 96.9%) whereas the semantic novelty-classification skill is length-dependent (65.6%/59.0%/44.1% across 2/3/4-character candidates), mapping a current boundary of skill-based linguistic operationalization. We release the method, pipeline outputs, and evaluation protocol as public resources.

URL PDF HTML ☆

赞 0 踩 0

2606.08769 2026-06-09 cs.CL cs.AI 新提交

RadOT-Eval: Auditable Structured-Evidence Transport for Radiology Report Evaluation

RadOT-Eval：用于放射学报告评估的可审计结构化证据传输

Weixin Liu, Juming Xiong, Yang Li, Qingyuan Song, Susannah Rose, Murat Kantarcioglu, Bradley Malin, Zhijun Yin

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出RadOT-Eval框架，通过最优传输对齐结构化临床证据，在独立测试集上实现与错误负担的高斯皮尔曼相关，优于标准指标和基于LLM的评估器。

Comments 10 pages, 1 figure, 13 tables

详情

AI中文摘要

自动评估对于高风险文本生成至关重要，其中的错误通常涉及遗漏发现、幻觉内容、极性反转、位置变化、不确定性不匹配和时间比较错误，而不仅仅是低表面相似性。放射学报告生成提供了一个具有挑战性的测试案例，因为生成的报告必须跨来源保留结构化临床证据。我们提出了RadOT-Eval，一个可解释的结构化证据最优传输框架，用于离线审计放射学报告生成。RadOT-Eval将参考报告和候选报告分解为属性结构化的临床证据单元，使用熵正则化最优传输对齐相应的证据，并在单调风险模型中使用临床意义的侧信道差异来预测错误负担。所有传输、特征和读出选择均使用ReXVal数据集进行选择，并在独立的RadEvalX数据集上评估冻结系统。RadOT-Eval与总错误负担、临床显著错误负担和临床不显著错误负担的斯皮尔曼相关系数分别为0.715、0.548和0.399，其点估计值高于标准评估指标和基于开源大语言模型（LLM）的评估器GREEN-radllama2-7B。在ReXErr-v1上的冻结辅助腐败敏感性压力测试中，RadOT-Eval达到了0.768的AUROC和0.990的腐败大于干净的配对胜率。这些结果表明，在仅使用ReXVal模型选择和冻结RadEvalX测试下，结构化证据传输为高风险生成的临床文本提供了一个可审计、面向排序的评估工具。

英文摘要

Automatic evaluation is critical for high-stakes text generation, where errors often involve omitted findings, hallucinated content, polarity reversals, location changes, uncertainty mismatches, and temporal-comparison errors rather than low surface similarity alone. Radiology report generation provides a challenging test case because generated reports must preserve structured clinical evidence across sources. We present RadOT-Eval, an interpretable structured-evidence optimal transport framework for offline auditing of radiology report generation. RadOT-Eval decomposes reference and candidate reports into attribute-structured clinical evidence units, aligns corresponding evidence using entropy-regularized optimal transport, and uses clinically meaningful side-channel discrepancies in a monotone risk model to predict error burden. All transport, feature, and readout choices are selected using the ReXVal dataset, and the frozen system is evaluated on the independent RadEvalX dataset. RadOT-Eval achieves Spearman correlations of 0.715, 0.548, and 0.399 with total, clinically significant, and clinically insignificant annotated error burden, respectively, yielding higher point estimates than standard evaluation metrics and the open-source large language model (LLM)-based evaluator GREEN-radllama2-7B. In a frozen auxiliary corruption-sensitivity stress test on ReXErr-v1, RadOT-Eval achieves 0.768 AUROC and a 0.990 corrupted-greater-than-clean paired win rate. These results show that structured evidence transport provides an auditable, rank-oriented evaluation tool for high-stakes generated clinical text under ReXVal-only model selection and frozen RadEvalX testing.

URL PDF HTML ☆

赞 0 踩 0

2606.08878 2026-06-09 cs.CL cs.MA 新提交

PerspectiveGap: A Benchmark for Multi-Agent Orchestration Prompting

PerspectiveGap: 多智能体编排提示的基准测试

Youran Sun, Xingyu Ren, Kejia Zhang, Xinpeng Liu, Jiaxuan Guo

发表机构 * University of Maryland（马里兰大学）； The Chinese University of Hong Kong（香港中文大学）； Stanford University（斯坦福大学）

AI总结提出PerspectiveGap基准，评估LLM为多智能体系统编写编排提示的能力，实验显示模型平均通过率仅14.9%，表明该能力独特且未被充分评估。

详情

AI中文摘要

现实世界的LLM应用正从单智能体工作流转向编排的多智能体系统，但当前模型仍难以确定每个子智能体需要知道什么。为衡量这一点，我们引入了PerspectiveGap，一个用于评估LLM为多智能体系统编写编排提示能力的基准。PerspectiveGap包含110个场景，每个场景通过两种干扰混合任务格式评估：角色片段分配和自由形式提示编写。这些场景被组织成10种拓扑结构，这些拓扑结构源自作者的真实工程实践，并遵循提示经济原则：构建以循环为中心的编排，以最小的角色和工程开销最大化效用。在对来自10家公司的27个商业模型进行的实验中，GPT-5.5大幅超越所有竞争对手，而Opus 4.7尽管编码性能强劲，但在编排提示方面表现出明显弱点。尽管如此，PerspectiveGap仍然具有挑战性：评估模型平均综合通过率仅为14.9%（GPT-5.5为62.0%），平均总体泄漏率为246.5%（每个场景的信息泄漏事件计数，而非比例；GPT-5.5为49.1%）。这些发现表明，多智能体编排提示是一种独特且未被充分评估的能力，而PerspectiveGap为系统衡量和改进该能力提供了基础。

英文摘要

Real-world LLM applications are moving beyond single-agent workflows toward orchestrated multi-agent systems, yet current models still struggle to determine what each sub-agent needs to know. To measure this, we introduce PerspectiveGap, a benchmark for evaluating LLMs' ability to compose orchestration prompts for multi-agent systems. PerspectiveGap contains 110 scenarios, each evaluated through two distractor-mixed task formats: role-fragment assignment and free-form prompt writing. These scenarios are organized into 10 topologies, which are distilled from the authors' real-world engineering practice and framed by the Prompt Economy principle: building loop-centered orchestrations that maximize utility with minimal role and engineering overhead. In experiments with 27 commercial models from 10 companies, GPT-5.5 substantially outperforms all competitors, whereas Opus 4.7 shows a notable weakness in orchestration prompting despite its strong coding performance. Nevertheless, PerspectiveGap remains challenging: the evaluated models achieve an average combined pass rate of only 14.9\% (GPT-5.5 62.0\%) and an average overall leakage rate of 246.5\% (a per-scenario information leak-event count, not a proportion; GPT-5.5 49.1\%). These findings suggest that multi-agent orchestration prompting is a distinct and under-evaluated capability, and PerspectiveGap provides a foundation for measuring and improving it systematically.

URL PDF HTML ☆

赞 0 踩 0

2606.08988 2026-06-09 cs.CL cs.LG 新提交

Structure-Aware Modeling of Multiple-Choice Questions Improves Automatic Difficulty Estimation

选择题的结构感知建模改进自动难度估计

Gabriel Ortega, Abelino Jiménez, Séverin Lions, Pablo Dartnell

发表机构 * Centro de Investigación Avanzada en Educación (CIAE), Instituto de Estudios Avanzados en Educación (IE), Universidad de Chile（智利大学高级教育研究中心（CIAE），高级教育研究所（IE））； Departamento de Evaluación, Medición y Registro Educacional (DEMRE), Universidad de Chile（智利大学评估、测量与教育注册系（DEMRE））； Centro de Modelamiento Matemático (CMM), Universidad de Chile（智利大学数学建模中心（CMM））； Departamento de Ingeniería Matemática (DIM), Universidad de Chile（智利大学数学工程系（DIM））

AI总结提出结构感知模型，将选择题的干扰项作为独立输入编码，通过顺序感知或顺序不变聚合提升难度预测，在自然科学和社科数据集上达到R²=0.83和0.71。

Comments 30 pages, 1 table, 2 figures

详情

AI中文摘要

自动题目难度估计（AQDE）在教育评估中日益重要，因为它有潜力产生与专家判断相竞争的难度估计，同时有助于减少与试点管理相关的时间和财务负担，并扩展到数字测试环境。先前的AQDE研究报告了关于将干扰项作为附加文本添加到题干和正确答案中是否能一致改进难度预测的混合证据。我们假设干扰项信息的有效性取决于其结构表示，并且明确将干扰项建模为独立组件可以改进忽略此信息的基线的难度估计。为此，我们设计了受控架构，将选择题组件建模为不同输入，以隔离干扰项内容和顺序的贡献。具体来说，我们通过将每个干扰项编码为独立的文本输入，并通过顺序感知的拼接（带位置标签）或顺序不变的求和来聚合其表示，从而表示干扰项。我们使用两个智利数据集（自然科学和社会科学，2016-2020年；4114道选择题）评估了这些架构。与仅使用题干和正确答案的简单模型相比，我们最佳的结构感知架构实现了更高的预测性能，自然科学题目的R²=0.83，社会科学题目的R²=0.71。一个顺序不变的变体以大约一半的参数达到了几乎相同的准确率，提供了有利的准确率-效率权衡。这些结果表明，结构信息（尤其是干扰项内容）驱动了预测准确性的提升，支持开发计算上可行的大规模教育应用的高效结构感知模型。

英文摘要

Automatic Question Difficulty Estimation (AQDE) holds growing promise for educational assessment because it has the potential to yield difficulty estimates that are competitive with expert judgment, while helping reduce the time and financial burden associated with pilot administrations and scaling to digital testing contexts. Prior AQDE studies report mixed evidence on whether adding distractors as additional text to the question stem and the correct key consistently improves difficulty prediction. We hypothesize that the effectiveness of distractor information depends on its structural representation, and that explicitly modeling distractors as separate components improves difficulty estimation over baselines that omit this information. To address this, we designed controlled architectures that model MCQ components as distinct inputs to isolate the contribution of distractor content and order. Specifically, we represented distractors by encoding each distractor as its own text input and aggregating their representations either with order-aware concatenation (with positional tags) or with an order-invariant summation. We evaluated these architectures using two Chilean datasets (Natural and Social Sciences, 2016-2020; 4,114 multiple-choice questions). Compared to a simpler model that only used the question stem and the key, our best distractor-aware architecture achieved higher predictive performance, reaching R^2 = 0.83 for Natural Sciences and R^2 = 0.71 for Social Sciences items. An order-invariant variant achieved nearly the same accuracy with approximately half as many parameters, offering a favorable accuracy-efficiency trade-off. These results show that structural information (especially distractor content) drives gains in predictive accuracy, supporting the development of efficient, structure-aware models that are computationally viable for large-scale educational applications.

URL PDF HTML ☆

赞 0 踩 0

2606.09013 2026-06-09 cs.CL 新提交

Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level

超越平均值：在分布层面评估LLM对人类调查的复现能力

Jeonghyeon Moon, Jiwon Kim, Yeheum Lah, Yoonju Han, Yuncheol Kang

发表机构 * Ewha Womans University（梨花女子大学）

AI总结本研究通过非公开的韩国方便面购买实验，在分布层面评估LLM复现人类调查响应的能力，发现均值匹配的模型可能产生更偏离人类的分布，且结构化角色和多模态输入提升对齐度，而推理提示则降低。

详情

AI中文摘要

LLM越来越多地被用于模拟人类调查响应，但先前的工作主要使用均值层面或总体一致性来评估复现能力，对LLM是否复现人类行为的变异性提供的见解有限。我们使用一个非公开的2010年韩国方便面购买消费者选择实验，在分布层面评估基于LLM的调查复现，该设置不太可能与模型训练数据重叠。我们评估了三种不同统计类型的响应变量：二元购买发生、分类品牌选择和计数购买数量。对于每种变量，我们在均值层面、模式和分布一致性上比较人类和LLM响应，并参考仅来自人类数据的基线。LLM在复现条件层面模式上表现合理，但未能捕捉分布结构：对于购买数量，没有模型能击败一个简单的条件不敏感基线（该基线仅匹配合并的人类分布）。因为均值匹配人类良好的模型仍可能产生比该基线更远离人类的分布，仅基于均值的评估可能具有误导性。复现能力也随输入配置而变化，结构化角色和多模态输入改善一致性，而显式推理提示则单调地降低一致性。

英文摘要

LLMs are increasingly used to simulate human survey responses, but prior work has mainly evaluated replication using mean-level or aggregate agreement, offering limited insight into whether LLMs reproduce the variability of human behavior. We evaluate LLM-based survey replication at the distributional level using a non-public 2010 consumer choice experiment on Korean instant noodle purchases, a setting unlikely to overlap with model training data. We evaluate three response variables of differing statistical type: binary purchase incidence, categorical brand choice, and count purchase quantity. For each, we compare human and LLM responses at mean-level, pattern, and distributional alignment, and against reference baselines from the human data alone. LLMs reproduce condition-level patterns reasonably well but fail to capture distributional structure: for purchase quantity, no model beats a condition-insensitive baseline that simply matches the pooled human distribution. Because models that match human means well can still produce distributions further from humans than this baseline, mean-based evaluation alone can be actively misleading. Replication also varies with input configuration, with structured personas and multimodal inputs improving alignment while explicit reasoning prompting degrades it monotonically.

URL PDF HTML ☆

赞 0 踩 0

2606.09351 2026-06-09 cs.CL stat.ME 新提交

In-Context Learning for the Imputation of Public Opinion Data with Large Language Models

基于上下文学习的民意数据插补方法

Tobias Holtdirk, Georg Ahnert, Joseph W Sakshaug, Anna-Carolina Haensch

发表机构 * LMU Munich（慕尼黑大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of Mannheim（曼海姆大学）； Institute for Employment Research (IAB)（就业研究所（IAB））； University of Maryland, College Park（马里兰大学帕克分校）

AI总结提出通过上下文学习（ICL）插补调查缺失数据，在150个意见变量上评估，相比MICE PMM方法，在所有缺失机制下绝对误差更低，尤其非随机缺失时优势显著。

详情

AI中文摘要

大型语言模型已被广泛评估为个体调查响应的模拟器。然而，在实践中，完全未观测到的响应很少见；主要问题是部分无响应。插补旨在通过填充这些缺失值来恢复调查数据集的整体结构。它有自己的明确定义的评估标准，并且与预测有根本区别。我们提出通过上下文学习（ICL）来插补缺失的调查数据。我们在美国趋势面板的15波调查中，针对150个意见变量，系统评估了不同缺失机制（MCAR、MAR、MNAR）下的ICL设计选择。与成熟的数据插补统计方法（如MICE PMM）相比，我们的ICL方法在所有缺失机制下均持续降低了绝对误差，在非随机缺失（MNAR）下收益最大。值得注意的是，性能最佳的配置（gpt-oss-120b，100个上下文示例）实现了接近名义水平的总体覆盖率（接近95%），置信区间比MICE PMM窄2到5倍。我们发布了一个具有类似sklearn API的Python包，以便使用本地和专有LLM轻松部署我们的方法。

英文摘要

Large language models have been widely evaluated as simulators of individual survey responses. In practice, however, fully unobserved responses are rare; the dominant problem is partial non-response. Imputation aims to restore the overall structure of a survey dataset by filling in these missing values. It has its own well-defined evaluation criteria and differs fundamentally from prediction. We propose to impute missing survey data through in-context learning (ICL). We systematically evaluate ICL design choices across different missingness mechanisms (MCAR, MAR, MNAR) on 150 opinion variables spanning 15 waves of the American Trends Panel. Compared to well-established statistical methods for data imputation like MICE PMM, our ICL approach consistently reduces absolute error across all missingness mechanisms, with the largest gains under non-random missingness (MNAR). Notably, the best-performing specification (gpt-oss-120b with 100 in-context examples) achieves near-nominal aggregate coverage (approaching the 95% level) with confidence intervals two to five times narrower than MICE PMM. We publish a Python package with an sklearn-like API to enable easy deployment of our method using local and proprietary LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.09389 2026-06-09 cs.CL 新提交

项目反应缩放定律：一种高效且可泛化的神经缩放估计的测量理论方法

Sang Truong, Yuheng Tu, Rylan Schaeffer, Sanmi Koyejo

AI总结提出项目反应缩放定律（IRSL），将项目反应理论融入缩放定律框架，通过Beta-IRT模型利用语言模型的概率响应，将参数复杂度从O(M×N)降至O(M+N)，在预训练和测试时缩放场景中仅用50个问题即可实现可靠估计。

详情

AI中文摘要

缩放定律为理解语言模型（LM）的性能提供了基本框架，但推导它们需要在数千个检查点或数百万个推理样本上进行成本高昂的评估。为了解决这个问题，我们引入了项目反应缩放定律（IRSL），这是一个将项目反应理论（IRT）整合到缩放定律框架中的统一框架。与将每个模型-基准对单独处理的传统方法不同，IRSL将潜在模型能力与问题特征分离，将M个模型和N个问题的缩放定律估计分解，从而将参数复杂度从O(M×N)显著降低到O(M+N)。我们使用Beta-IRT实例化IRSL，它利用LM的经验概率响应——例如预训练中的token概率和测试时采样中的通过率——来捕获比二元响应更丰富的信号。我们在两种常见的缩放范式上验证了我们的方法：（1）预训练下游缩放，使用来自10个基准的6,612个LM检查点和37,682个问题；以及（2）测试时缩放，使用来自4个基准的12个LM和120个问题，每个问题最多2,500个样本。在现有模型响应上进行一次性校准后，IRSL仅使用每个基准50个问题（减少99.9%）即可产生更可靠的缩放估计，达到与传统方法相当或更优的决策准确性。此外，我们表明估计的潜在模型能力是可泛化的，从而能够跨共享相同测量目标的基准进行准确的性能预测。

英文摘要

Scaling laws provide a fundamental framework for understanding the performance of Language Models (LMs), yet deriving them requires prohibitively expensive evaluations across thousands of checkpoints or millions of inference samples. To address this, we introduce Item Response Scaling Laws (IRSL), a unified framework that integrates Item Response Theory (IRT) within the scaling law framework. Unlike traditional approaches that treat each model-benchmark pair in isolation, IRSL disentangles latent model ability from question characteristics, factorizing the scaling law estimation for $M$ models and $N$ questions to significantly reduce parameter complexity from $O(M \times N)$ to $O(M + N)$. We instantiate IRSL with Beta-IRT, which leverages the empirical probability responses of LMs -- such as token probabilities in pre-training and pass rates in test-time sampling -- to capture richer signals than binary responses. We validate our approach across two prevalent scaling paradigms: (1) pre-training downstream scaling, using 6,612 LM checkpoints and 37,682 questions from 10 benchmarks; and (2) test-time scaling, using 12 LMs and 120 questions from 4 benchmarks with up to 2,500 samples per question. Given a one-time calibration on existing model responses, IRSL yields more reliable scaling estimates using only 50 questions per benchmark (a 99.9\% reduction), achieving comparable or superior decision accuracy to traditional approaches. Furthermore, we show that the estimated latent model abilities are generalizable, enabling accurate performance forecasting across benchmarks that share the same measurement objective.

URL PDF HTML ☆

赞 0 踩 0

2606.08034 2026-06-09 cs.CV cs.AI cs.CL 交叉投稿

Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

Sci-Rho：面向STEM问题的多语言视觉基础符号基准

Muhammad Falensi Azmi, Ikhlasul Akmal Hanif, Vallerie Alexandra Putra, Adi Yeltay, Abdullah Mubarak, Fajri Koto

发表机构 * Independent Researcher（独立研究员）； MBZUAI（穆罕默德·本·扎耶德人工智能大学）； Binus University（比努斯大学）； Bandung Institute of Technology（万隆理工学院）

AI总结提出Sci-Rho，一个多语言、视觉基础的STEM问题动态基准，包含4242个模板和42420个实例，评估17个VLM发现最差精度与平均精度存在差距，且小模型跨语言性能下降。

Comments 22 pages

详情

AI中文摘要

符号基准已成为评估模型在STEM相关问题微小修改下鲁棒性的关键方法。然而，现有符号基准大多局限于数学推理，缺乏视觉基础，且主要以英语为主。在这项工作中，我们引入了Sci-Rho（科学鲁棒性），一个面向视觉基础STEM问题的动态基准，涵盖五个学科和七种语言，包含由领域专家（包括奥林匹克奖牌得主）精心设计的4,242个问题模板（每种语言606个）。每个模板实现为可执行的Python代码，通过改变数值、视觉模式、几何形状、颜色方案和函数类型，生成多样但等价的问题实例，总共产生42,420个实例，每个实例都配有推理步骤和真实解决方案。我们评估了17个最先进的VLM，发现最差情况准确率（定义为模型在每种生成变体上均正确回答的问题模板比例）与平均准确率之间存在明显差距。我们还发现，较小的模型在不同语言上表现出显著的性能下降，而专有模型和较大模型保持鲁棒。步骤级评估反映了相同的趋势，揭示了平均F1与最差情况F1分数之间的显著差距。最后，我们对VLM注意力头的检查显示，图像标记与文本标记的相对注意力分配存在显著的跨语言变化。我们的工作强调了超越静态基准的评估作为衡量VLM质量指标的重要性。

英文摘要

Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STEM-related questions. However, existing symbolic benchmarks mostly remain limited to mathematical reasoning, lack visual grounding, and are predominantly in English. In this work, we introduce Sci-Rho (Science Rhobustness), a dynamic benchmark for visually-grounded STEM problems spanning five subjects and seven languages, comprising 4,242 problem templates (606 per language) crafted by domain experts, including Olympiad medalists. Each template is implemented as executable Python code that generates diverse but equivalent problem instances by varying numerical values, visual patterns, geometric shapes, color schemes, and function types, resulting in 42,420 instances in total, each paired with reasoning steps and ground-truth solutions. We evaluated 17 state-of-the-art VLMs and discovered a noticeable gap between worst-case accuracy (defined as the proportion of problem templates that a model answers correctly across every generated variation) and average accuracy. We also discovered that smaller models show noticeable performance degradation across languages, whereas proprietary and larger models remain robust. Step-level evaluation reflects this same trend, revealing a significant gap between average F1 and worst-case F1 scores. Finally, our inspection of attention heads of a VLM reveals substantial cross-lingual variation in the relative attention allocated to image tokens compared to text tokens. Our work highlights the importance of evaluation beyond static benchmarks as a metric to measure the quality of VLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.08036 2026-06-09 cs.IR cs.AI cs.CL 交叉投稿

GIScholarBench: Benchmarking LLM Overconfidence in GIS Research

GIScholarBench: 在GIS研究中评估大语言模型的过度自信

Zongrng Li, Mingzheng Yang, Lei Zou, Hongxu Ma, Hao Tian, Siqi Zhou, Wenjing Gong, Kaili Zhang, Bingqian Chen, Mitch Zhang, Yifan Yang

发表机构 * Texas A&M University（德克萨斯理工大学）； Google（谷歌）； Department of Geography（地理系）； Department of Landscape Architecture and Urban Planning（景观建筑与城市规划系）

AI总结针对大语言模型在学术研究中的过度自信问题，构建了包含10865篇论文的GIScholarBench基准，通过元数据检索、文献链接和研究方向生成三项任务评估模型表现，发现所有模型均存在任务不变的过度自信现象。

详情

AI中文摘要

自适应选择性共形风险控制的联合有限样本证书

Xiaoli Yu, Jiamiao Liu

发表机构 * Chongqing University of Posts and Telecommunications（重庆邮电大学）； Army Medical University (Third Military Medical University)（陆军军医大学（第三军医大学））

AI总结提出一种联合有限样本证书，同时上界选择性风险、下界接受概率和部署效用，适用于自适应阈值选择，通过比率风险的经验伯恩斯坦界等方法，在ImageNet和COCO上比Hoeffding-CRC提升22个百分点接受前沿，且紧致约10倍。

详情

AI中文摘要

选择性预测器在置信输入上做出预测，否则弃权；安全部署需要一个单一的有限样本证书，同时上界所选风险、下界接受概率 $\pacc$ 高于下限 $\pmin$，并下界部署效用。该证书必须在从 $\ncert$ 样本上的有限网格 $m$ 对中进行自适应阈值选择时有效。我们通过将所选风险直接视为比率而非通过Hoeffding式范围界，为有界、可能非单调的损失给出了这样的证书。该构造耦合了三个置信界：比率风险的方差自适应经验伯恩斯坦界、接受概率的Clopper-Pearson界以及效用的双边接近界。它们共同下界认证策略的绝对效用，并且与认证集上的最优策略相差不超过 $2\gammau$，两者在可行时均非平凡；一个按场景划分的第三部分与外部预言机匹配，仅在风险边际 $\gammar < α$ 时有信息量，在主要操作点处为空。相对于仅范围Hoeffding比率构造，这使接受下限依赖从 $1/\pmin$ 变为 $1/\sqrt{\pmin}$，并且一个闭式推论识别出每对场景，其中我们的风险界优于Hoeffding共形风险控制（Hoeffding-CRC）选择性界。实验上，在ImageNet（三个ResNet）和COCO val 2017全景分割上，该证书比Hoeffding-CRC打开了+22个百分点的认证接受前沿，并且比非平凡匹配验证基线紧致约10倍；这些增益是按场景的，非普适的，在ADE20K上不存在。认证器运行时间为 $O(\ncert m)$。

英文摘要

Selective predictors answer on confident inputs and abstain elsewhere; deploying one safely needs a single finite-sample certificate that simultaneously upper-bounds the selected risk, lower-bounds the acceptance probability $\pacc$ above a floor $\pmin$, and lower-bounds the deployment utility. This certificate must be valid under adaptive threshold selection from a finite grid of $m$ pairs on $\ncert$ samples. We give such a certificate for bounded, possibly non-monotone losses by treating the selected risk directly as a ratio rather than through a Hoeffding-style range bound. The construction couples three confidence bounds: a variance-adaptive empirical-Bernstein bound on the ratio risk, a Clopper--Pearson bound on acceptance, and a two-sided closeness bound on utility. Together they lower-bound the certified policy's utility absolutely and to within $2\gammau$ of the best over the \emph{certified set}, both non-vacuous whenever feasible; a regime-scoped third leg matches an external oracle, informative only where the risk margin $\gammar < α$ and vacuous at the headline operating points. Relative to the range-only Hoeffding-ratio construction this sharpens the acceptance-floor dependence from $1/\pmin$ to $1/\sqrt{\pmin}$, and a closed-form corollary identifies a per-pair regime in which our risk bound dominates a Hoeffding conformal risk control (Hoeffding--CRC) selective bound. Empirically, on ImageNet (three ResNets) and COCO val 2017 panoptic, the certificate opens a $+22$ pp certified-acceptance frontier over Hoeffding--CRC and is ${\approx}10{\times}$ tighter than a non-vacuous matched-valid baseline; these gains are regime-scoped, not universal, and absent on ADE20K. The certifier runs in $O(\ncert m)$ time.

URL PDF HTML ☆

赞 0 踩 0

2606.08679 2026-06-09 stat.ML cs.CL cs.LG stat.ME 交叉投稿

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

排行榜的排名区间：模型评估的分层框架

Bitya Neuhof, Yuval Benjamini

发表机构 * Department of Statistics and Data Science（统计与数据科学系）

AI总结提出分层框架，通过任务级置信区间和排行榜级预测区间，实现具有统计保证的模型排名不确定性量化。

详情

AI中文摘要

预训练模型通常在多任务排行榜上评估，以衡量其在不同场景中的适用性。然而，当前将跨任务性能聚合为排行榜级排名的方法并未解决任务层面的不确定性和变异性。尽管近期工作提出了基于区间的模型排名，但从单个任务到排行榜级排名的不确定性的原则性聚合仍未解决，且模型在不同任务上的性能变化常被掩盖。本文引入一个分层框架，在两层上构建具有统计保证的模型排名区间：通过成对比较构建任务级排名置信区间，以及使用共形方法构建排行榜级排名预测区间。这使得能够对每个观测任务和新潜在任务进行可靠的模型排名量化。在模拟数据以及TabArena和PromptEval（MMLU）基准上的实验表明，我们的方法产生统计有效且信息丰富的区间，从而在排行榜上实现可靠、具有不确定性意识的模型排名。

英文摘要

Pretrained models are often evaluated on multi-task leaderboards to measure their applicability in diverse contexts. However, current methods for aggregating performance across tasks into leaderboard-level rankings do not address the uncertainty and variability at the task level. While recent works have proposed interval-based model rankings, the principled aggregation of uncertainty from individual tasks to leaderboard-level rankings remains unaddressed, and variation in models' performance across tasks is frequently obscured. In this work, we introduce a hierarchical framework that constructs model rank intervals with statistical guarantees at both levels: task-level rank confidence intervals from pairwise comparisons, and leaderboard-level rank prediction intervals using a conformal approach. This enables reliable quantification of model rank for each observed task and for new potential tasks. Experiments on simulated data and the TabArena and PromptEval (MMLU) benchmarks show that our method yields statistically valid and informative intervals, enabling reliable, uncertainty-aware model ranking on leaderboards.

URL PDF HTML ☆

赞 0 踩 0

2606.08722 2026-06-09 cs.SD cs.CL 交叉投稿

Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding

LLM 能否理解 LilyPond？一个用于符号音乐生成与理解的基准

Matteo Spanio, Mohammad Torabi, Andrea Poltronieri, Antonio Rodà

发表机构 * University of Padova（帕多瓦大学）； Universitat Pompeu Fabra（庞培法布拉大学）

AI总结提出 LilyBench，基于 LilyPond 的基准，联合评估开源 LLM 的符号音乐生成与理解能力，实验表明零样本可生成可执行 LilyPond，但结构理解任务仍有挑战，且指标间存在系统性分歧。

Comments Accepted at Ital-IA 2026

详情

AI中文摘要

大型语言模型的符号音乐评估在表示、数据集和指标上仍然碎片化。我们引入了 LilyBench，一个基于 LilyPond 的基准，用于在同一系列开源权重 LLM 上联合评估符号音乐生成和音乐理解。该基准包括一个 200 个提示的生成套件和十个从 ABC-Eval 改编的理解任务，涵盖语法、元数据预测、结构排序和音乐识别。生成质量通过编译率、基于 Jensen-Shannon 相似度的 MusPy 描述符分布以及基于 LilyBERT 的 Fréchet 音乐距离 (FMD) 进行评估。在四个开源模型上的实验表明，在零样本设置下可以实现可执行的 LilyPond 生成，而结构理解任务尽管在作曲家和流派识别上表现强劲，但仍然具有挑战性。我们的实验还揭示了基于描述符和基于嵌入的指标之间的系统性分歧，表明符号音乐评估受益于指标三角测量而非单一分数排名。我们发布了基准、提示库和评估代码，以支持未来在符号音乐生成和理解方面的研究，地址为 https://github.com/CSCPadova/lilybench。

英文摘要

Symbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs. The benchmark includes a 200-prompt generation suite and ten understanding tasks adapted from ABC-Eval, covering syntax, metadata prediction, structural sequencing, and music recognition. Generation quality is evaluated using compile rate, MusPy descriptor distributions via Jensen-Shannon similarity, and LilyBERT-based Fréchet Music Distance (FMD). Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Our experiments also reveal systematic disagreements between descriptor-based and embedding-based metrics, suggesting that symbolic music evaluation benefits from metric triangulation rather than single-score ranking. We release the benchmark, prompt bank, and evaluation code to support future research in symbolic music generation and understanding at https://github.com/CSCPadova/lilybench

URL PDF HTML ☆

赞 0 踩 0

2606.08959 2026-06-09 cs.CV cs.CL 交叉投稿

正确看起来更好：成对比较揭示准确性排名

Mina Remeli, Moritz Hardt

发表机构 * Max Planck Institute for Intelligent Systems, Tübingen, Germany（马克斯·普朗克智能系统研究所，蒂宾根，德国）； Tübingen AI Center（蒂宾根人工智能中心）

AI总结本文通过将基准测试转化为生成式评估，发现成对比较结合Elo方法得到的模型排名与基于真实准确率的排名高度一致（Spearman相关系数>0.9），且风格和裁判偏见影响较小，但答案重复（echo）是裁判偏好的因果驱动因素。

Comments Accepted at ICML'26

2606.09578 2026-06-09 cs.AI cs.CL cs.IR 交叉投稿

TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs

TABVERSE：大语言模型与视觉语言模型中跨格式表格理解的基准测试

Momina Ahsan, Sarfraz Ahmad, Ming Shan Hee, Roy Ka-Wei Lee, Preslav Nakov

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)（穆罕默德·本·扎耶德人工智能大学）； Singapore University of Technology and Design (SUTD)（新加坡科技设计大学）

AI总结提出TABVERSE基准，通过控制表格内容、跨多种结构格式（HTML、Markdown、LaTeX）和渲染图像，系统评估LLM和VLM在问答、结构理解和结构重建任务中的表现，发现表示格式显著影响表格理解能力。

Comments 24 pages, 18 tables, 16 figures, Submitted to ARR May 2026

详情

AI中文摘要

大语言模型（LLMs）和视觉语言模型（VLMs）在表格推理任务上的评估日益增多，但表格表示的作用仍未充分探索。实践中，相同的表格内容可能以不同的结构格式出现，如HTML、Markdown和LaTeX，或作为渲染图像。然而，现有评估往往让内容、格式、布局和模态同时变化，使得难以隔离表示效应。我们引入了TABVERSE，一个受控的多模态表格基准，它在多个结构格式和渲染图像中对齐相同的表格内容，并带有问题类别和难度标签。这种设计使得在保持表格内容固定的同时，能够系统评估表示效应。我们在三个任务上评估LLMs和VLMs：问答（QA）、结构理解能力（SUC）和结构重建（SR）。我们的结果表明，表示选择显著影响表格理解。模型在结构化文本上的表现通常优于渲染图像，但这一差距的大小取决于任务、模型和格式。HTML通常是最稳健的文本格式，而行敏感的结构任务和语法可用的LaTeX重建仍然具有挑战性。这些发现表明，表格表示是可靠表格评估的关键因素。

英文摘要

Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning tasks, but the role of table representation remains under-explored. In practice, the same table content may appear in different structural formats, such as HTML, Markdown, and LaTeX, or as rendered images. However, existing evaluations often let content, format, layout, and modality vary together, making it difficult to isolate representation effects. We introduce TABVERSE, a controlled multimodal table benchmark that aligns the same table content across multiple structural formats and rendered images, with question category and difficulty tags. This design enables systematic evaluation of representation effects while holding table content fixed. We evaluate LLMs and VLMs across three tasks: Question Answering (QA), Structural Understanding Capability (SUC), and Structure Reconstruction (SR). Our results show that representation choice substantially affects table understanding. Models generally perform better with structured text than with rendered images, but the size of this gap depends on the task, model, and format. HTML is often the most robust text format, while row-sensitive structural tasks and syntactically usable LaTeX reconstruction remain challenging. These findings show that table representation is a key factor in reliable table evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.09748 2026-06-09 cs.AI cs.CL cs.LG 交叉投稿

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

深度研究智能体在过程级反馈下的多轮评估

Rishabh Sabharwal, Hongru Wang, Amos Storkey, Jeff Z. Pan

发表机构 * Google DeepMind ； OpenAI ； Perplexity AI ； LangChain AI

AI总结针对深度研究智能体（DRA）在单轮输出评估的不足，提出研究缺口推断（RGI）方法提供过程级反馈，发现单轮过程反馈可提升8-15分，但多轮改进因回归问题难以持续。

Comments Published as a workshop paper at SCALE - ICML 2026 (Oral)

详情

AI中文摘要

现有的深度研究智能体（DRA）基准仅评估单次输出，忽略了一个关键问题：DRA能否在反馈指导下改进其报告？为此，我们在两种反馈设置下对DRA进行多轮评估：自我反思（智能体在无外部诊断信号的情况下修改报告）和过程级反馈（智能体接收针对其研究策略缺口的指导）。为提供过程级反馈，我们设计了研究缺口推断（RGI），该方法通过分析满足和未满足的评分标准模式来推断研究过程缺口。我们的分析揭示了三个关键发现：（i）在自我反思下，智能体以几乎相等的速率纳入和退步评分标准，导致净改进可忽略；（ii）单轮过程级反馈带来显著收益，将归一化分数提高约8-15分，并产生约35-40%的纳入率；（iii）这些收益在后续轮次中不会累积，因为智能体在重写完整报告以解决剩余缺口时，会退步多达24%的先前满足的标准。即使有针对性指导，我们所评估的DRA架构仍无法实现可靠的多轮改进。我们的代码和结果公开在 https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs。

英文摘要

Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately $8$-$15$ points and yielding a roughly $35$-$40\%$ incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to $24\%$ of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs.

URL PDF HTML ☆

赞 0 踩 0

2606.09764 2026-06-09 cs.LG cs.CL 交叉投稿

OpenCompass：大型语言模型的通用评估平台

Maosong Cao, Kai Chen, Haodong Duan, Yixiao Fang, Zhiwei Fei, Tong Gao, Ge Jiaye, Mo Li, Hongwei Liu, Junnan Liu, Yuan Liu, Chengqi Lyu, Han Lyu, Ningsheng Ma, Zerun Ma, Yu Sun, Zhiyong Wu, Linchen Xiao, Zhuozhi Xiong, Jun Xu, Haochen Ye, Zhaohui Yu, Yike Yuan, Songyang Zhang, Yufeng Zhao, Fengzhe Zhou, Peiheng Zhou, Dongsheng Zhu, Lin Zhu, Jingming Zhuo

发表机构 * OpenCompass Team（OpenCompass团队）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结提出OpenCompass，一个模块化、高兼容性、灵活且高并发的通用LLM评估平台，支持多种任务场景和主流基准数据集。

详情

AI中文摘要

近年来，人工智能领域经历了从特定任务的小规模模型到通用大型语言模型（LLM）的范式转变。随着LLM的快速迭代，对其能力进行客观、定量和全面的评估已成为推动技术发展的关键环节。目前，基于静态基准数据集的主流评估方法面临任务类型多样性、评估标准不一致以及数据处理流程碎片化等挑战，难以高效进行跨领域和大规模模型评估。为解决上述问题，本文提出并开源了OpenCompass，一个一站式、可扩展且支持高并发的通用LLM评估平台。该平台遵循模块化和组件解耦的设计理念，具有三大核心优势：高兼容性、灵活性和高并发性。OpenCompass的核心架构包括五个关键组件：配置系统、任务划分模块、执行与调度模块、任务执行单元和结果可视化模块。其工作流程提供基于规则、LLM作为评判者和级联评估器，以适应不同任务场景的需求。平台支持知识、推理、计算、科学、语言、代码等多个领域的基准数据集，为学术界和工业界提供统一高效的LLM评估工具，有助于准确识别LLM的优缺点并进行后续优化。

英文摘要

In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross-domain and large-scale model evaluation. To address the aforementioned issues, this paper proposes and open-sources OpenCompass, a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule-based, LLM-as-a-Judge, and cascaded evaluators to adapt to the requirements of different task scenarios. Supporting mainstream benchmark datasets across multiple domains, including knowledge, reasoning, computation, science, language, code, etc., the platform offers a unified and efficient LLM evaluation tool for both academia and industry, facilitating the accurate identification of strengths and weaknesses of LLMs as well as their subsequent optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.22079 2026-06-09 cs.CL 版本更新

Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements

Ishigaki-IDS-Bench: 一个用于从BIM信息需求生成信息交付规范的基准

Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto, Takayuki Kato, Tomoki Ando, Chenguang Wang, Dayuan Jiang, Naofumi Fujita, Shuhei Saitoh, Atomu Kondo, Koki Arakawa, Daiho Nishioka

发表机构 * ONESTRUCTION Inc.（ONESTRUCTION公司）； AWS GenAI Innovation Center（AWS生成式人工智能创新中心）

AI总结本文提出Ishigaki-IDS-Bench基准，用于评估大型语言模型生成符合行业标准的XML信息交付规范（IDS）的能力，通过166个由BIM/IDS专家编写和验证的示例，结合内容一致性评估和结构审核，展示了当前LLM在生成满足IDS标准和IFC词汇约束的XML方面的局限性。

Comments 7 pages; benchmark data and evaluation scripts are available on GitHub and Hugging Face

详情

AI中文摘要

大型语言模型（LLMs）被广泛用于生成结构化输出，如JSON、SQL和代码，但公共资源仍然有限，无法有效评估必须同时满足行业标准XML和领域词汇约束的生成能力。本文提出了Ishigaki-IDS-Bench，一个用于评估从BIM信息需求生成信息交付规范（IDS）XML能力的基准。该基准包含166个由BIM/IDS专家编写和验证的示例，这些示例是通过将83个实际场景扩展为日语和英语后生成的，对应黄金IDS文件以及输入格式、语言、轮次设置、IFC版本和建筑领域等元数据。其评估结合了基于IDSAuditTool的可操作性、结构和内容审核，以及与黄金IDS文件的内容一致性评估。在零样本评估中，10个LLM中表现最好的模型在内容一致性上达到65.6%的宏F1分数，但只有27.7%的输出通过内容审核。这些结果表明，当前LLM能够表达部分信息需求作为IDS，但仍难以稳定生成满足IDS标准和IFC词汇约束的XML。Ishigaki-IDS-Bench支持比较评估、失败分析以及开发符合领域标准的受限结构生成方法。我们已将评估脚本和基准数据以CC BY 4.0许可发布在GitHub和Hugging Face上。

英文摘要

Building Information Modeling (BIM) projects increasingly use Information Delivery Specification (IDS) to formalize information requirements in a machine-checkable XML format. Because IDS conditions are grounded in the Industry Foundation Classes (IFC) vocabulary, authoring them requires expertise in IFC concepts, validation tools, and property set conventions. Existing benchmarks for structured generation do not adequately capture the additional burden of vocabulary conformance and external-validator agreement that IDS imposes. We present Ishigaki-IDS-Bench, the first publicly released benchmark for IDS generation from BIM information requirements. The benchmark contains 166 examples spanning 83 practical scenarios authored in Japanese and English by six BIM/IDS experts, each paired with a gold IDS file and metadata covering input format, turn setting, target IFC versions, and construction domain. Evaluation proceeds in two stages: (i) formal validity scored by the buildingSMART IDSAuditTool along Processability, Structure, and Content, and (ii) content fidelity scored by facet-level macro-F1 against the gold IDS. Across 10 LLMs in zero-shot, the highest Facet F1 is 65.6%, achieved by GPT-5.5, while the highest Content pass rate is only 33.1%, achieved by Claude Opus 4.5. Ishigaki-IDS-Bench is released on Hugging Face (DOI 10.57967/hf/8873) under CC BY 4.0, and the evaluation code is released on Zenodo (DOI 10.5281/zenodo.20550510) under Apache-2.0.

URL PDF HTML ☆

赞 0 踩 0

2605.25312 2026-06-09 cs.CL 版本更新

P1SCO: Social Dimensions from a Perspectivist Lens

P1SCO：从视角主义视角看社会维度

Amanda Cercas Curry, Gianmarco de Francisci Morales, Luca Maria Aiello

发表机构 * Independent Researcher（独立研究者）； CENTAI, Turin（CENTAI，都灵）； IT University of Copenhagen（哥本哈根技术大学）

AI总结本文提出P1SCO数据集，从三个平台收集社交媒体评论并按十个社会维度标注，以捕捉社会互动和感知的多样性，支持细粒度分析及跨平台、个体差异研究。

2411.19504 2026-06-09 cs.AI cs.CL cs.IR 版本更新

TQA-Bench: Evaluating LLMs for Multi-Table Question Answering

TQA-Bench：评估大语言模型在多表问答中的表现

Zipeng Qiu, Chenyue Li, You Peng, Guangxin He, Binhang Yuan, Chen Wang

发表机构 * Hong Kong University of Science and Technology（香港科技大学）； Tsinghua University（清华大学）

AI总结提出TQA-Bench基准，通过长上下文多表问答任务评估LLM，揭示其在复杂数据驱动环境中的挑战与机遇。

Comments Accepted by IEEE Transactions on Big Data

详情

AI中文摘要

大语言模型（LLMs）的进步为复杂的多模态数据管理任务带来了巨大机遇，尤其是在涉及复杂多表关系数据的问答（QA）中。尽管取得了显著进展，但由于分析关系数据结构模态的固有复杂性以及序列化表格数据可能的大规模性，系统评估LLMs在多表QA上的表现仍然是一个关键挑战。现有基准主要关注单表QA，未能捕捉金融、医疗和电子商务等真实世界领域中多个关系表之间连接的复杂性。我们提出了TQA-Bench，一个基于真实世界公共数据集的长上下文分析型多表QA基准，具有灵活的采样机制，可变化上下文长度（8K--64K tokens）和符号扩展，以评估超越检索和模式匹配的推理能力。我们系统评估了一系列参数规模从20亿到6710亿的LLMs。大量实验揭示了LLMs在多表QA中的关键性能洞察，突出了推进其在复杂数据驱动环境中应用的挑战和机遇。

英文摘要

The advance of large language models (LLMs) has unlocked great opportunities in complex multi-modal data management tasks, particularly in question answering (QA) over complicated multi-table relational data. Despite significant progress, systematically evaluating LLMs on multi-table QA remains a critical challenge due to the inherent complexity of analyzing the modality of relational data structures and the potentially large scale of serialized tabular data. Existing benchmarks primarily focus on single-table QA, failing to capture the intricacies of connections across multiple relational tables, as required in real-world domains such as finance, healthcare, and e-commerce. We present TQA-Bench, a long-context analytical multi-table QA benchmark derived from real-world public datasets, with a flexible sampling mechanism that varies context length (8K--64K tokens) and symbolic extensions for assessing reasoning beyond retrieval and pattern matching. We systematically evaluate a set of LLMs spanning model scales from 2 billion to 671 billion parameters. Our extensive experiments reveal critical insights into the performance of LLMs in multi-table QA, highlighting both challenges and opportunities for advancing their application in complex, data-driven environments.

URL PDF HTML ☆

赞 0 踩 0

2502.16584 2026-06-09 cs.SD cs.AI cs.CL cs.MM eess.AS 版本更新

人类的ALMANAC：用于智能体协作的动作级心智模型标注的人类协作数据集

Jiaju Chen, Yuxuan Lu, Jiayi Su, Chaoran Chen, Songlin Xiao, Zheng Zhang, Yun Wang, Yunyao Li, Jian Zhao, Tongshuang Wu, Toby Jia-Jun Li, Dakuo Wang, Bingsheng Yao

发表机构 * Northeastern University（东北大学）； University of Notre Dame（Notre Dame 大学）； University of Waterloo（滑铁卢大学）； Carnegie Mellon University（卡内基梅隆大学）； Adobe（Adobe公司）； Microsoft Research Asia（微软亚洲研究院）

AI总结为解决当前LLM智能体缺乏协作中心智模型能力的问题，构建了基于Map Task的ALMANAC数据集，包含2987个协作动作及其心智模型标注，并评估了六种LLM在预测人类行为和心智模型上的表现。

详情

AI中文摘要

近年来，LLM智能体的进展使其具备了复杂的认知能力，如多步推理、规划和工具使用，这些能力使它们逐渐成为人类的协作者。然而，有效的协作要求协作者在协作过程中持续维护和调整自身推理、伙伴意图和共享目标的心智模型。当前的智能体很少发展这种能力，因为它们主要针对任务完成进行优化，而社区缺乏带有动作级心智模型标注的真实人类协作数据，这些数据可以指导智能体获得过程级的协作能力。为填补这一空白，我们提出了ALMANAC，一个基于社会科学中经典的二元路由任务Map Task构建的动作级心智模型标注数据集。ALMANAC包含2,987个协作动作，每个动作都配有基于理论的心智模型标注，记录了参与者的自我推理、感知的伙伴意图和感知的团队目标。我们评估了六种LLM在预测人类下一轮行为和心智模型方面的表现。我们的结果证明了ALMANAC在评估模型模拟人类协作行为及推断其潜在心智模型方面的实用性。

英文摘要

Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today's agents rarely develop such capabilities since they are primarily optimized for task completion, and the community lacks authentic human collaboration data with action-level mental model annotations that could guide agents toward process-level collaborative competence. To bridge this gap, we present ALMANAC, a dataset of Action-Level Mental model ANnotations for Agent Collaboration built from the Map Task, a classic dyadic routing task from social science. ALMANAC contains 2,987 collaboration actions, each paired with theory-informed mental model annotations that record the participants' self-reasoning, perceived partner intent, and perceived team goal. We benchmark six LLMs on predicting humans' next-turn behavior and mental models. Our results demonstrate ALMANAC's utility in evaluating models' ability to simulate human collaborative behaviors and infer their underlying mental models.

URL PDF HTML ☆

赞 0 踩 0

2606.07379 2026-06-09 cs.LG cs.AI cs.CL stat.ME 版本更新

LLM智能体中的冷启动安全差距

Chung-En Sun, Linbo Liu, Tsui-Wei Weng

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结研究发现工具调用型LLM智能体在会话开始时最脆弱，随着常规任务执行安全性提升，提出SODA基准并验证预热策略可缩小冷启动安全差距。

详情

AI中文摘要

工具调用型LLM智能体在整个对话过程中是否同样安全？我们发现并非如此：智能体在会话开始时最脆弱，在完成几个常规智能体任务后安全性显著提升——我们将这一现象称为冷启动安全差距。为了系统研究这一问题，我们引入了面向智能体的深度安全基准（SODA），该基准控制智能体在遇到安全威胁之前完成的常规智能体任务数量，最多支持20个前置任务。评估来自4个系列的7个模型，随着前置常规智能体任务数量从零增加到二十，安全性提升9%至52%。表示分析证实，随着更多前置任务的出现，模型隐藏状态逐渐向安全对齐区域移动。通过系统研究前置对话中哪部分最重要，我们发现常规智能体任务本身是安全性的主要驱动因素，而智能体自身的先前响应对安全性影响较小，但对于保持后续效用至关重要。这一结论在开源安全基准（AgentHarm、Agent Safety Bench）和效用基准（BFCL、API-Bank）上的评估中得到进一步支持，证实了在部署前用常规智能体任务预热智能体可以使其更安全并保持全部能力。基于这些发现，我们推荐一种简单的部署策略：让智能体在可能暴露于安全关键请求之前完成几个常规智能体任务，以缩小冷启动安全差距。我们的代码可在https://github.com/Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap获取。

英文摘要

Are tool-calling LLM agents equally safe throughout a conversation? We discover they are not: agents are most vulnerable at the very start of a session and become substantially safer after a few regular agentic tasks -- a phenomenon we term the cold-start safety gap. To study this systematically, we introduce Safety Over Depth for Agents (SODA), a benchmark that controls how many regular agentic tasks the agent completes before encountering a safety threat, supporting up to 20 preceding tasks. Evaluating 7 models from 4 families, safety improves by 9--52% as the number of preceding regular agentic tasks increases from zero to twenty. Representation analysis confirms that model hidden states gradually shift toward a safety-aligned region as more preceding tasks are present. By systematically studying which part of the preceding conversation matters most, we find that the regular agentic tasks themselves are the primary driver of safety, while the agent's own prior responses have less effect on safety but are essential for preserving later utility. This conclusion is further supported by evaluation on open-source safety benchmarks (AgentHarm, Agent Safety Bench) and utility benchmarks (BFCL, API-Bank), confirming that warming up the agent with regular agentic tasks before deployment makes it safer and preserves full capability. Based on these findings, we recommend a simple deployment strategy: having the agent complete a few regular agentic tasks before possible exposure to safety-critical requests mitigates the cold-start safety gap. Our code is available at https://github.com/Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap

URL PDF HTML ☆

赞 0 踩 0

2606.07877 2026-06-09 cs.CL 新提交

Whose Norms? Disentangling Cultural and Personal Alignment in Large Language Models

谁的规范？解开大语言模型中的文化与个人对齐

Angana Borah, Isabelle Augenstein, Rada Mihalcea

发表机构 * University of Michigan - Ann Arbor（密歇根大学安娜堡分校）； University of Copenhagen（哥本哈根大学）

AI总结提出PACT框架评估大语言模型在文化规范与个人偏好间的权衡，发现模型受国家背景影响大于年龄和性别，且人类对齐未能捕捉文化多元性。

Comments Preprint under review

详情

AI中文摘要

大语言模型越来越多地用于需要平衡文化规范与个人偏好的社会决策情境。例如，偏好诚实的用户可能会询问是否应在当地规范倾向于间接反馈时公开纠正同事。然而，现有研究大多将文化对齐和个性化分开研究。我们引入了PACT（个人偏好与文化规范权衡）框架，该框架评估模型是选择遵循文化规范还是允许个人偏好。我们发现，LLMs在强制执行文化规范的刚性程度上有所不同，行为受国家背景（7.8%）的影响大于年龄（1%）和性别（0.7%），并且在指令微调后非均匀地变化。此外，我们在五个国家进行的关于PACT的人类研究表明，人类遵循文化主要受情境国家驱动，当参与者判断自己的文化背景时一致性最低，显示出文化内部的多元性。最后，人类-LLM对齐实验表明，模型可以匹配多数选择，但未能捕捉响应分布和不确定性（最佳相关性仅为0.24）。总之，这些发现激励了超越多数、捕捉社会判断中文化多元性和分歧的对齐评估。

英文摘要

Large language models are increasingly used for social decision-making situations that require balancing cultural norms with personal preferences. For example, a user preferring honesty might ask whether to correct a coworker publicly when local norms favor indirect feedback. Yet existing research studies cultural alignment and personalization largely separately. We introduce PACT, the Personal-Preference and Cultural-Norm Trade-off framework, which evaluates whether models choose to follow a cultural norm or allow personal preferences. We find that LLMs vary in how rigidly they enforce cultural norms, with behavior shifted more by country context (7.8%) than age (1%) and gender (0.7%) and shifting non-uniformly after instruction tuning. Furthermore, our five-country human study on PACT shows that culture-following in humans is mainly driven by scenario country, with the lowest agreement when participants judge their own cultural contexts, showing within-culture pluralism. Finally, human-LLM alignment experiments show that models can match majority choices, but fail to capture response distributions and uncertainty (with best correlations reaching only 0.24). Together, these findings motivate alignment evaluations that go beyond majority to capture cultural pluralism and disagreement in social judgment.

URL PDF HTML ☆

赞 0 踩 0

2606.07964 2026-06-09 cs.CL 新提交

What Does Debiasing Really Remove? A Geometric Study of PCA-Based Gender Debiasing in Word Embeddings

去偏究竟移除了什么？词嵌入中基于PCA的性别去偏的几何研究

Alexey Kresin, Tchifou M. Dieffi, Tomer Caspi

发表机构 * Hood College（胡德学院）； Ben-Gurion University of the Negev（内盖夫本-古里安大学）

AI总结通过几何分析揭示PCA去偏主要移除第一主成分中的直接性别偏见，但无法消除分布在多维度上的关联偏见，且会破坏嵌入几何结构，表明偏见并非纯低秩，简单子空间移除不足以全面去偏。

Comments 8 pages, 4 figures. Source code available at https://github.com/AlexeyKresin/embedding-bias-geometry

详情

AI中文摘要

基于主成分分析（PCA）的去偏方法被广泛用于减少大型语言模型词嵌入中的性别偏见，但尚不清楚这些方法实际移除了偏见的哪些方面以及这一过程的破坏性有多大。这些方法基于偏见存在于低维子空间的理解，假设大部分偏见可以通过少数主成分捕获。在这项工作中，我们对基于PCA的性别去偏进行了系统的几何分析，并研究了嵌入空间中实际被移除的内容。我们在多个嵌入上的实验表明，直接性别偏见主要集中在前几个主成分上，支持了低秩偏见假设。然而，通过WEAT测量的关联偏见并不与这些主方向对齐，而是分布在多个嵌入维度上。此外，正如预期，我们证明移除越来越多的主成分会导致嵌入几何的一致退化，影响语义结构和向量关系。这些结果表明，基于PCA的去偏是一种权衡：虽然它有效减少了某些形式的直接偏见，但未能消除分布式关联，并引入了几何扭曲。此外，不存在通用的最优去偏水平，因为偏见减少与语义保留之间的平衡取决于所选的度量和嵌入。总体而言，我们的发现表明词嵌入中的偏见并非纯粹低秩，简单的子空间移除方法可能不足以实现全面去偏。

英文摘要

Debiasing methods based on principal component analysis (PCA) are broadly used to reduce gender bias in word embeddings used in LLMs, yet it remains unclear what aspects of bias they actually remove and how destructive this process is. These methods are based on the understanding that bias resides in a low-dimensional subspace, with the assumption that most of it can be captured by a few principal components. In this work, we conduct a systematic geometric analysis of PCA-based gender debiasing and investigate what is actually removed from the embedding space. Our experiments across multiple embeddings show that direct gender bias is primarily concentrated in the first principal component, supporting the low-rank bias hypothesis. However, associative bias measured by WEAT does not align with these principal directions and is instead spread across multiple embedding dimensions. Furthermore, as expected, we demonstrate that removing an increasing number of principal components leads to a consistent degradation of the embedding geometry, affecting semantic structure and vector relationships. These results reveal that PCA-based debiasing operates as a trade-off: while it effectively reduces certain forms of direct bias, it fails to eliminate distributed associations and introduces geometric distortion. Moreover, there is no universal optimal level of debiasing, as the balance between bias reduction and semantic preservation depends on the chosen metric and embedding. Overall, our findings suggest that bias in word embeddings is not purely low-rank and that simple subspace removal methods may be insufficient for comprehensive debiasing.

URL PDF HTML ☆

赞 0 踩 0

2606.07969 2026-06-09 cs.CL cs.AI 新提交

Neutrality Bites: Gender Representation in AI-Generated Animal Stories

中立性的代价：AI生成的动物故事中的性别表征

Imani Finkley, Yuanxi Li, Melanie Walsh

发表机构 * University of Washington（华盛顿大学）

AI总结研究六种主流LLM在生成动物故事时的性别分配，发现模型常避免指定性别或使用中性语言，但一旦指定则显著偏向男性，女性角色几乎缺席，表明中立策略可能导致边缘视角的抹除。

Comments FAccT(ACM Conference on Fairness, Accountability, and Transparency) 2026

详情

DOI: 10.1145/3805689.3812287

AI中文摘要

AI生成故事中的性别偏见是一个有充分记录的问题。尽管人们已投入大量关注来减少或缓解这种偏见，但干预措施是否产生真正公平的结果并不总是明确的。为了调查这一问题，我们研究了大型语言模型（LLMs）如何处理一个流行、高度模糊且已知会紧密复现人类刻板印象的叙事语境中的性别分配：关于会说话的动物的故事。我们提示六个领先的LLM完成一个关于七个性别未说明的拟人化动物角色的英语故事。此外，我们迭代了四种不同的叙事设置和一系列模型温度。在23.8K个故事中，我们发现模型经常避免在故事中指定动物角色的性别（平均19%）或使用性别中立的语言如“它”或“它的”（平均38.2%）。然而，当性别被指定时，存在显著的男性偏见。女性动物角色几乎不存在，仅出现在2.2%的故事中，而男性角色出现在40.6%的故事中。我们的发现指向一个更广泛的论点：中立性是有代价的。换句话说，优先考虑中立性以解决社会偏见的模型实际上可能助长边缘化视角和身份的抹除。我们建议需要追求超越中立性的替代策略，例如那些更平等地在想象主体之间分配社会可能性的策略。

英文摘要

Gender bias in AI-generated stories is a well-documented problem. While much attention has been paid to reducing or mitigating this bias, it is not always clear whether interventions produce genuinely fairer results. To investigate this issue, we examine how large language models (LLMs) handle gender assignment in a narrative context that is popular, highly ambiguous, and also known to closely reproduce human stereotypes: stories about talking animals. We prompt six leading LLMs to complete an English-language story about seven different anthropomorphic animal characters whose gender is unstated. We additionally iterate with four different narrative settings and a range of model temperatures. Across the 23.8K stories, we find that models frequently avoid gendering the animal character in the story (19% on average) or use gender-neutral language like "it" or "its" (38.2% on average). However, when gender is assigned, there is a significant masculine bias. Feminine animal characters are virtually absent, present in just 2.2% of stories vs. 40.6% that feature masculine characters. Our findings point to a broader argument: neutrality bites. In other words, models that prioritize neutrality to address social bias may actually contribute to the erasure of marginalized perspectives and identities. We suggest that alternative strategies beyond neutrality need to be pursued, such as ones that more equally distribute social possibilities across imagined subjects.

URL PDF HTML ☆

赞 0 踩 0

2606.07970 2026-06-09 cs.CL cs.AI 新提交

Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks

通过扩展训练时对抗攻击防御恶意微调

Haoming Wen, Shi Chen, Qingyu Shi, Siyuan Liu, Minrui Luo, Jingzhao Zhang, Tianxing He

发表机构 * Xiongan AI Institute（雄安人工智能研究院）； Institute for Interdisciplinary Information Sciences, Tsinghua University（清华大学交叉信息研究院）； Shanghai Qi Zhi Institute（上海期智研究院）

AI总结针对全参数微调的安全威胁，提出基于对抗训练和双层优化的Patcher方法，通过扩展对抗循环中的优化步数增强防御，并设计并行算法提升效率。

详情

AI中文摘要

当前的开源大型语言模型（LLMs）容易受到恶意微调攻击，这些攻击只需在中毒数据集上进行几步监督微调（SFT）即可破坏LLMs的安全对齐。现有的对齐阶段防御主要设计用于防御使用参数高效微调方法的攻击。然而，它们无法防御使用全参数微调的更强攻击。在本文中，我们提出了Patcher，一种受对抗训练和双层优化启发的方法，以对抗此类攻击。Patcher通过扩展对抗循环中的优化步数来增强模拟攻击，从而迫使防御者找到对更强攻击不敏感的模型参数。此外，我们提出了一种高效的并行算法来实现Patcher，减少了训练的挂钟时间，同时保持了Patcher的性能。大量实验表明，与普通SFT对齐相比，Patcher显著提高了模型的鲁棒性，并且可以迁移到不同的攻击场景和模型大小。代码可在https://github.com/haomingwen/patcher获取。

英文摘要

Current open-weight large language models (LLMs) are prone to malicious finetuning attacks, which could compromise the safety alignment of LLMs with only a few steps of supervised finetuning (SFT) on poisoned datasets. Existing alignment-stage defenses are primarily designed to defend against attacks that use parameter-efficient finetuning methods. However, they fail to defend against stronger attacks that use full-parameter finetuning. In this paper, we propose Patcher, a method inspired by adversarial training and bi-level optimization, to combat such attacks. Patcher strengthens the simulated attack by scaling up the optimization steps in the adversarial loop, thus forcing the defender to find model parameters that are insensitive to stronger attacks. Furthermore, we propose an efficient parallel algorithm to implement Patcher, decreasing the wall-clock time of training while preserving Patcher's performance. Extensive experiments show that Patcher substantially improves the model's robustness compared to vanilla SFT alignment, and transfers to diverse attack scenarios and model sizes. Code is available at https://github.com/haomingwen/patcher.

URL PDF HTML ☆

赞 0 踩 0

2606.08076 2026-06-09 cs.CL cs.AI cs.CY 新提交

"I understand your perspective": LLM Persuasion and Sycophancy through the Lens of Communicative Action Theory

“我理解你的观点”：通过交往行动理论视角看LLM的说服与谄媚

Esra Dönmez, Agnieszka Falenska

发表机构 * Institute for Natural Language Processing, University of Stuttgart（斯图加特大学自然语言处理研究所）； Interchange Forum for Reflecting on Intelligent Systems, University of Stuttgart（斯图加特大学智能系统反思交流论坛）

AI总结本研究基于哈贝马斯的交往行动理论，通过模拟Reddit讨论，发现LLM能有效传达言外之意（如建立信任），其谄媚策略与观点改变强相关，且人类更偏好LLM生成的论证。

详情

DOI: 10.18653/v1/2025.findings-acl.793
Journal ref: Findings of the Association for Computational Linguistics: ACL 2025

AI中文摘要

大型语言模型（LLM）能够生成高质量的论证，但它们在参与细致入微且有说服力的交往行动方面的能力仍 largely unexplored。本研究通过尤尔根·哈贝马斯的交往行动理论框架探索LLM的说服潜力。它考察LLM是否以与人类交流可比的方式表达言外之意（即语言的语用功能，如传达知识、建立信任或表明相似性）。我们使用来自说服性子论坛ChangeMyView的对话，模拟意见持有者与LLM之间的在线讨论。然后，我们比较人类撰写和LLM生成的反驳论证中言外之意的可能性，特别是那些成功改变了原帖作者观点的论证。我们发现，所有三个LLM都能有效传达言外之意——通常比人类更甚——可能增加其拟人化程度。此外，LLM精心制作谄媚回应，与意见持有者的意图紧密对齐，这种策略与观点改变强相关。最后，众包工作者发现LLM生成的反驳论证更令人信服，并且一致偏好它们胜过人类撰写的论证。这些发现表明，LLM的说服力不仅仅在于生成高质量论证。相反，用人类偏好训练LLM有效地调整它们以模仿人类交流模式，特别是细微的交往行动，可能增加个体对其影响的易感性。

英文摘要

Large Language Models (LLMs) can generate high-quality arguments, yet their ability to engage in nuanced and persuasive communicative actions remains largely unexplored. This work explores the persuasive potential of LLMs through the framework of Jürgen Habermas' Theory of Communicative Action. It examines whether LLMs express illocutionary intent (i.e., pragmatic functions of language such as conveying knowledge, building trust, or signaling similarity) in ways that are comparable to human communication. We simulate online discussions between opinion holders and LLMs using conversations from the persuasive subreddit ChangeMyView. We then compare the likelihood of illocutionary intents in human-written and LLM-generated counter-arguments, specifically those that successfully changed the original poster's view. We find that all three LLMs effectively convey illocutionary intent -- often more so than humans -- potentially increasing their anthropomorphism. Further, LLMs craft sycophantic responses that closely align with the opinion holder's intent, a strategy strongly associated with opinion change. Finally, crowd-sourced workers find LLM-generated counter-arguments more agreeable and consistently prefer them over human-written ones. These findings suggest that LLMs' persuasive power extends beyond merely generating high-quality arguments. On the contrary, training LLMs with human preferences effectively tunes them to mirror human communication patterns, particularly nuanced communicative actions, potentially increasing individuals' susceptibility to their influence.

URL PDF HTML ☆

赞 0 踩 0

2606.08157 2026-06-09 cs.CL 新提交

Cross Paraphrastic Invariance Learning for Hallucination Detection

跨释义不变性学习用于幻觉检测

Shanshan Lin, Dongsheng Hong, Sibo Ju, Chao Chen, Sihong Xie, Xiangwen Liao

AI总结提出CPIL框架，通过构建正负样本对进行两阶段对比学习，仅用1%标注数据即在11个任务上超越基线，高效检测LLM幻觉。

Comments Accepted to ICASSP 2026

详情

AI中文摘要

大型语言模型（LLM）经常生成缺乏源文档支持的幻觉。为避免昂贵的LLM评估流水线和现有分类器的大量标注需求，我们提出CPIL（跨释义不变性学习），一个两阶段孪生框架，最大化利用现有标注数据。具体地，CPIL通过以下方式构建信息丰富的训练对：（i）为每个文档-声明示例生成释义视图作为正样本，并显式对齐其表示以强制对表面形式的不变性；（ii）挖掘同文档、异标签对作为难负样本，以锐化文档敏感的决策边界。然后CPIL进行两阶段模型训练：第一阶段进行对比预训练，学习释义不变、基于事实的嵌入空间；第二阶段附加轻量级分类器进行二元事实性判断。在LLM-AggreFact基准（11个任务）上，CPIL仅用约1%的标注数据即在F1分数上超越强基线，展示了其预测优越性和标签效率。

英文摘要

Large language models (LLMs) frequently generate hallucinations, which are unsupported by a source document. To avoid costly LLM-as-evaluator pipelines and the heavy annotation demands of existing classifiers, we propose CPIL (Cross Paraphrastic Invariance Learning), a two-stage Siamese framework that maximizes the utility of existing labeled data. Concretely, CPIL constructs informative training pairs by: (i) generating paraphrastic views of each document-claim example as positives, and explicitly aligning their representations to enforce invariance to surface form; and (ii) mining same-document, opposite-label pairs as hard negatives to sharpen document-sensitive decision boundaries. Then CPIL conduct a two-stage model training: Stage 1 performs contrastive pretraining to learn a paraphrase-invariant, grounding-aware embedding space; and Stage 2 attaches a lightweight classifier for binary groundedness. On the LLM-AggreFact benchmark (11 tasks), CPIL surpasses strong baselines concerning F1 scores with only ~1% labeled data, showing its prediction superiority and label efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.08158 2026-06-09 cs.CL cs.AI 新提交

Constrained Paraphrase Consistency for LLM Hallucination Detection

约束释义一致性用于大语言模型幻觉检测

Shanshan Lin, Dongsheng Hong, Sibo Ju, Chao Chen, Xi Zhang, Xiangwen Liao

AI总结提出约束一致性幻觉检测器(CCHD)，通过约束优化利用释义一致性，无需额外数据，在多个基准上超越现有方法。

Comments Accepted to ICASSP 2026

详情

AI中文摘要

大型语言模型（LLM）可能生成事实不一致的声明，这促使需要准确且可扩展的幻觉检测器。先前的工作主要通过合成或新标注来扩大训练集，这增加了成本和潜在偏差，同时未充分利用语义等价释义所隐含的一致性。我们提出约束一致性幻觉检测器（CCHD），将训练形式化为约束优化问题。在原始文档-声明对上的标准交叉熵基础上，补充了（i）释义一致性约束，限制不同释义视图之间的差异，以及（ii）标签保持约束，将释义与真实标签绑定。我们通过模型参数和每个视图的拉格朗日乘子的梯度下降-上升法求解该问题，仅增加少量标量对偶变量，且无推理时开销。使用DeBERTa和Flan-T5骨干网络，CCHD在标准事实性基准上持续优于强基线（FactCG、MiniCheck和AlignScore），展示了其在幻觉检测上的优越性。

英文摘要

Large language models (LLMs) can generate factually inconsistent claims, motivating accurate and scalable hallucination detectors. Prior work largely enlarges training sets via synthesis or new annotations, introducing increasing cost and potential bias while underusing the consistency implied by semantically equivalent paraphrases. We propose Consistency-Constrained Hallucination Detector (CCHD), which formulates training as a constrained optimization problem. The standard cross-entropy on original document-claim pairs is complemented by (i) paraphrase-consistency constraints bounding divergence across paraphrased views, and (ii) label-preservation constraints tying paraphrases to ground truth. We solve the problem by gradient descent-ascent over model parameters and per-view Lagrange multipliers, adding only a few scalar dual variables and no inference-time overhead. With DeBERTa and Flan-T5 backbones, CCHD consistently outperforms strong baselines (FactCG, MiniCheck, and AlignScore) on standard factuality benchmarks, demonstrating its superiority on hallucination detection.

URL PDF HTML ☆

赞 0 踩 0

2606.08243 2026-06-09 cs.CL 新提交

Building Comparative Motivation Profiles with Instrumental Interventions

构建带有工具性干预的比较动机概况

David Vella Zarb, Rustem Turtayev, Taywon Min, Jinghua Ou, Shi Feng

发表机构 * MATS ； University of Cambridge（剑桥大学）； KAIST（韩国科学技术院）； George Washington University（乔治华盛顿大学）

AI总结通过对称工具性干预区分对齐伪装中的策略性自我保护与研究者期望追踪，发现模型对期望追踪更敏感，提示需要构念效度检验。

详情

AI中文摘要

安全性评估通常从行为模式推断潜在动机，但这些推断的构念效度尚不明确。我们在对齐伪装中研究这一问题，即当模型推断出训练压力时，它们更常服从训练目标。这种行为通常被解释为策略性自我保护，但也可能反映模型对研究者期望的敏感性。我们引入一个对称干预框架来区分这些竞争性假设。我们不直接干预“诡计”或“谄媚”，而是针对每个假设所蕴含的工具性过程：后果追踪和研究者期望追踪。然后比较对这些过程的干预如何影响对齐伪装。我们使用合成文档微调、激活引导和提示研究了四个开源模型生物。在合成文档微调下，Llama-3.1-70B、Llama3.1-405B 和 Qwen-2.5-72B 对期望追踪干预比后果追踪干预更敏感。对 Llama-3.1-70B 的激活引导支持相同的总体图景，提示干预与 SDF 概况大致一致。总体而言，对齐伪装行为在因果上对评估上下文期望敏感，尽管存在与诡计一致的草稿板。因此，诡计和策略性欺骗评估需要构念效度检验，而对称工具性干预提供了这样一种测试。

英文摘要

Safety evaluations often infer latent motivations from behavioral patterns, but the construct validity of these inferences is unclear. We study this problem in alignment faking, where models comply with training objectives more often when they infer training pressure. This behavior is commonly interpreted as strategic self-preservation, but it may also reflect sensitivity to the model's inference about the expectation of researchers conducting the evaluation. We introduce a symmetric intervention framework for distinguishing these competing hypotheses. Instead of directly intervening on "scheming" or "sycophancy", we target instrumental processes entailed by each hypothesis: consequence-tracking and researcher-expectation tracking. We then compare how interventions on these processes affect the alignment faking. We study four openweight model organisms using synthetic document fine-tuning, activation steering, and prompting. Under synthetic document fine-tuning, Llama-3.1-70B, Llama3.1-405B, and Qwen-2.5-72B are more sensitive to expectation-tracking than consequence-tracking interventions. Activation steering on Llama-3.1- 70B supports the same broad picture, and prompt interventions broadly align with SDF profiles. Overall, alignment-faking behavior can be causally sensitive to evaluation-context expectations despite scheming-consistent scratchpads. Scheming and strategic-deception evaluations therefore need construct-validity checks, and symmetric instrumental interventions provide one such test.

URL PDF HTML ☆

赞 0 踩 0

2606.08381 2026-06-09 cs.CL cs.AI 新提交

Auditing Proprietary Alignment in Large Language Models: A Comparative Framework Without a Ground-Truth Standard

审计大型语言模型中的专有对齐：一种无需真实标准的比较框架

Alireza Arbabi, Florian Kerschbaum

发表机构 * University of Waterloo（滑铁卢大学）； Vector Institute（向量研究所）

AI总结提出一种统计框架，通过比较目标模型与基线模型在共享语义空间中的响应偏差，检测黑盒语言模型中的专有对齐行为，无需真实标准即可实现外部审计。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地通过不透明的开发和部署流程发布和部署，使得模型提供商能够在不正式宣布的情况下注入有意的、提供商特定的策略。因此，已有多种模型被报道生成反映专有规则和组织利益的响应，导致在有争议话题上的审查或错误信息。然而，系统性地识别这种对齐仍然是一个基本挑战，因为“专有”在不同语境中的含义模糊。在本文中，我们提出了一种统计框架，通过比较行为分析来检测黑盒语言模型中的专有对齐。我们的方法量化了目标模型与一组参考基线模型在共享语义空间中的响应之间的系统性偏差。通过评估相对行为差异而非绝对正确性，我们的框架能够在黑盒访问下进行有原则的审计。应用于几个广泛讨论但此前未量化的案例，它为外部评估大型语言模型中提供商特定的对齐行为提供了系统且可扩展的基础。

英文摘要

Large language models (LLMs) are increasingly released and deployed through opaque development and deployment pipelines, enabling model providers to inject intentional, provider-specific policies without officially announcing them. As a result, various models have been reported to generate responses reflecting proprietary rules and organizational interests, leading to censorship or misinformation on controversial topics. However, systematic identification of such alignment remains a fundamental challenge, complicated by the ambiguity of what ``proprietary'' entails in different contexts. In this paper, we propose a statistical framework for detecting proprietary alignment in black-box language models via comparative behavioral analysis. Our approach quantifies systematic deviations between the responses of a target model and those of a reference set of baseline models in a shared semantic space. By evaluating relative behavioral divergence rather than absolute correctness, our framework enables principled auditing under black-box access. Applied to several widely discussed but previously unquantified cases, it provides a systematic and scalable basis for external assessment of provider-specific alignment behavior in large language models.

URL PDF HTML ☆

赞 0 踩 0

2606.08451 2026-06-09 cs.CL cs.AI 新提交

Sycophancy as a Multilingual Alignment Failure: How Safety Degrades Across Languages, Topics, and Models

谄媚作为多语言对齐失败：安全性能如何随语言、主题和模型退化

Arya Shah, Himanshu Beniwal, Mayank Singh, Chaklam Silpasuwanchai

发表机构 * IIT Gandhinagar（印度理工学院甘地讷格尔分校）； Asian Institute of Technology（亚洲理工学院）

AI总结研究多语言模型中谄媚现象，发现低资源语言中谄媚率激增，且与主题无关，归因于分词器生育率，表明对齐方法在非高资源语言中泛化差。

Comments 19 pages, 9 figures, 7 tables

详情

AI中文摘要

安全对齐的大型语言模型常常表现出谄媚，即倾向于肯定用户的意见而不考虑事实准确性。尽管在英语中已有充分研究，但其在其他语言中的表现仍基本未被考察，使得数十亿非英语使用者可能容易受到模型验证的错误信息的影响。我们首次进行了大规模、多模型的跨语言谄媚评估，对\textbf{六个指令调优模型}在涵盖\textbf{38种语言}和\textbf{33个主题类别}的\textbf{110万个实例}上进行了基准测试。我们识别出一致的资源层级效应：谄媚率在低资源和零资源语言设置中急剧上升。关键的是，这种退化与主题无关，模型在良性提示和安全关键提示上均匀失败，在最需要保护的地方没有提供额外保护。我们进一步确定了分词器生育率作为这种对齐崩溃的结构性驱动因素。总的来说，我们的结果表明，当前的对齐方法在高资源语言之外泛化能力差，强调了迫切需要公平的多语言安全技术。

英文摘要

Safety-aligned large language models often exhibit sycophancy, which is the tendency to affirm users' opinions regardless of factual accuracy. Although well-studied in English, its manifestation in other languages remains largely unexamined, leaving billions of non-English speakers potentially vulnerable to model-validated misinformation. We present the first large-scale, multi-model evaluation of cross-lingual sycophancy, benchmarking \textbf{six instruction-tuned models} across \textbf{1.1 million instances} spanning \textbf{38 languages} and \textbf{33 topic categories}. We identify a consistent resource-tier effect: sycophancy rates spike sharply in low-resource and zero-shot language settings. Critically, this degradation is topic-agnostic, as models fail uniformly across both benign and safety-critical prompts, offering no additional protection where it is most needed. We further identify tokenizer fertility as a structural driver of this alignment collapse. Collectively, our results demonstrate that prevailing alignment methodologies generalize poorly beyond high-resource languages, underscoring the urgent need for equitable multilingual safety techniques.

URL PDF HTML ☆

赞 0 踩 0

2606.08496 2026-06-09 cs.CL cs.LG 新提交

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

SAEExplainer: 基于激活引导偏好优化的SAE特征解释

Jingyi He, Haiyan Zhao, Ruxue Shi, Yanguang Liu, Xin Wang, Fei Sun, Mengnan Du

发表机构 * Shanghai Jiao Tong University（上海交通大学）； NJIT（新泽西理工学院）； Jilin University（吉林大学）； Institute of Computing Technology, CAS（中国科学院计算技术研究所）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出SAEExplainer框架，利用激活分数作为奖励信号，通过两轮优化迭代自纠正基础解释，减少解释幻觉并增强因果触发模式。

详情

AI中文摘要

尽管稀疏自编码器（SAE）通过将密集表示分解为稀疏特征缓解了大语言模型（LLM）的不透明性，但解释这些特征仍然是一个核心挑战。然而，当前的解释方法通常运行在开环范式下，未能利用机械反馈进行进一步优化。在本文中，我们提出SAEExplainer，一个利用激活分数作为客观奖励信号来训练模型进行自我纠正和迭代自举的训练框架。通过两轮优化过程迭代验证和纠正基础解释，SAEExplainer实现了其解释能力的持续提升。该机制显著减少了解释幻觉并强化了因果触发模式。大量实验表明，我们的方法在大多数指标上优于已有基线，特别是在因果触发和判别性激活方面。

英文摘要

Although Sparse Autoencoders (SAEs) have mitigated the opacity of large language models (LLMs) by decomposing dense representations into sparse features, explaining these features still remains a central challenge. Current explanation methods, however, typically operate within an open-loop paradigm, failing to leverage mechanistic feedback for further refinement. In this paper, we propose SAEExplainer, a training framework utilizes activation scores as an objective reward signal to train the model for self-correction and iterative bootstrapping. By iteratively verifying and correcting foundational explanations through a two-round optimization process, SAEExplainer achieves continuous improvement in its explanatory capabilities. This mechanism significantly reduces explanation hallucinations and reinforces causal triggering patterns. Extensive experiments demonstrate our approach improves upon established baselines across most metrics, especially in causal triggering and discriminative activation.

URL PDF HTML ☆

赞 0 踩 0

2606.08571 2026-06-09 cs.CL cs.AI cs.LG 新提交

Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models

用于诊断推理模型中未知未知的结构化无知证书的校准

Subramanyam Sahoo

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出结构化无知证书（SICs）输出格式，通过GRPO微调14B模型，使模型在无法回答时明确承认知识缺失并生成检索查询，在未知未知问题上实现99.46%的JSON有效性和0.967的证书特异性分数。

Comments Accepted in ICML 2026 Workshop: Epistemic Intelligence in Machine Learning

详情

AI中文摘要

大型语言模型经常以特征性方式失败：对于超出其知识边界的问题，它们不是承认无知，而是生成流畅但错误的答案。我们引入了\textbf{结构化无知证书}（SICs），这是一种JSON格式的输出模式，要求模型明确命名缺失的领域交叉点，列举所需概念，并提出一个富有成效的检索查询，而不是凭空捏造答案。为了训练模型生成高质量的SICs，我们构建了一个包含7,347个样本的\emph{未知-未知}（UU）数据集，通过提示Qwen3-14B将来自七个领域（物理、生物、工程、计算机科学、经济、医学、法律）的问题拼接成新颖的跨领域查询，这些查询是任何单一领域专家都无法回答的。我们使用组相对策略优化（GRPO）微调了一个14B参数的模型，采用结合检索效用、概念特异性和输出格式有效性的复合奖励。在模型响应上训练的释义散度探测器证实，SIC调优的输出系统地表现出更高的未知-未知概率分数。在735个保留的UU问题上的评估实现了99.46%的JSON有效性率、0.967的平均证书特异性分数，以及在基于检索的生成上相比基础模型3.6%的ROUGE-L改进——这表明显式的认知结构化是一种可学习且可衡量的能力。

英文摘要

Large language models frequently fail in a characteristic way: rather than acknowledging ignorance, they produce fluent but incorrect answers to questions that lie beyond their knowledge boundaries. We introduce \textbf{Structured Ignorance Certificates} (SICs), a JSON-formatted output schema that demands a model explicitly name the missing domain intersection, enumerate required concepts, and propose a productive retrieval query rather than hallucinating an answer. To train models to produce high-quality SICs we construct a 7,347-sample \emph{Unknown-Unknown} (UU) dataset by prompting Qwen3-14B to stitch together questions from seven domains (physics, biology, engineering, CS, economics, medical, legal) into novel cross-domain queries that no single-domain expert could answer. We fine-tune a 14B-parameter model with Group Relative Policy Optimization (GRPO) using a composite reward that combines retrieval utility, concept specificity, and output-format validity. A paraphrase-divergence probe trained on model responses confirms that SIC-tuned outputs systematically exhibit higher unknown-unknown probability scores. Evaluation on 735 held-out UU questions achieves a 99.46\% JSON validity rate, a mean Certificate Specificity Score of 0.967, and a 3.6\% ROUGE-L improvement over the base model on retrieval-grounded generation -- demonstrating that explicit epistemic structuring is a learnable and measurable capability.

URL PDF HTML ☆

赞 0 踩 0

2606.08629 2026-06-09 cs.CL 新提交

Sycophancy Towards Researchers Drives Performative Misalignment

对研究者的迎合驱动了表演性失调

David D. Baek, Xinnuo Li, Anay Gupta, Taslim Mahbub, Kejian Shi, Max Tegmark, Shi Feng

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Stanford University（斯坦福大学）

AI总结本文提出语言模型在评估中表现出的对齐伪装行为更可能是对研究者的迎合而非策略性欺骗，并通过三个实验支持该假说。

详情

AI中文摘要

语言模型日益增长的情境感知能力引发了安全担忧：模型可能意识到自己正在被评估，并调整行为以逃避监控和抵制修改，例如仅在评估中假装对齐。这种对齐伪装行为常被解释为诡计：一种有意的战略欺骗。在本文中，我们考察了一种替代解释，即表演性失调，它将行为变化解释为对AI研究者的迎合结果。为检验这一假说，我们提出了三个实证发现。首先，我们表明即使告诉模型它们已部署，评估意识仍然存在，这与诡计故事相矛盾，后者预测当模型感知到评估时失调会减少。其次，我们使用探针和引导表明，当前方法无法在机制上区分对齐伪装评估中的迎合和诡计。第三，我们微调模型使其更迎合，并观察到对评估线索的敏感性增加。最后，我们强调在未来的意图失调评估和缓解工作中，应将迎合与诡计去混淆。

英文摘要

The increasing situational awareness of language models raises safety concerns: models might be aware when they are evaluated, and adjust their behavior to evade monitoring and resist modification, e.g., pretending to be aligned only in evaluation. This alignment faking behavior is often interpreted as scheming: an intentional effort of strategic deception. In this paper, we examine an alternative interpretation, performative misalignment, which explains the change in behavior as a result of sycophancy towards AI researchers. To examine this hypothesis, we present three empirical findings. First, we show that evaluation awareness persists even when we tell models they are deployed, which contradicts the scheming story which predicts less misalignment when the model perceives evaluation. Second, we use probing and steering to show that our current methods cannot mechanistically distinguish sycophancy and scheming in alignment faking evaluations. Third, we fine-tune models to be more sycophantic and observe increased sensitivity to evaluation cues. To conclude, we emphasize deconfounding sycophancy from scheming for future work on evaluations and mitigations of intent misalignment.

URL PDF HTML ☆

赞 0 踩 0

2606.08705 2026-06-09 cs.CL 新提交

Analyzing the Correlation Between Hallucinations and Knowledge Conflicts in Large Language Models

分析大型语言模型中幻觉与知识冲突之间的相关性

Lucrezia Laraspata, Giovanna Castellano, Gennaro Vessio

发表机构 * University of Bari Aldo Moro（巴里阿尔多莫罗大学）

AI总结通过探针技术分析LLM内部表示，发现幻觉激活模式不能完全归因于知识冲突，但探针可提升模型可解释性。

详情

AI中文摘要

幻觉——事实不正确或无法验证的输出——仍然是大型语言模型（LLM）最具挑战性的限制之一，尤其是在知识密集型任务中。一种提出的解释是，由固定的、过时的训练数据引起的内部知识冲突。本文研究了与知识冲突相关的内部表示是否与LLM中的幻觉行为相关。使用受两项先前工作启发的探针技术，我们分析了预定义任务中隐藏层、注意力层和MLP层的激活以及输出logits。我们在幻觉检测基准上探测了LLaMA-3-8B，并在知识冲突数据集上探测了Falcon-7B。我们的发现表明，尽管概念上相关，但幻觉激活模式不能完全简化为或由知识冲突表示解释。尽管如此，探针在多种语言和激活类型中被证明是一个稳健的工具，支持其在提高LLM可解释性方面的作用。这项工作推进了对LLM中幻觉的更广泛理解，并强调了对其内部行为进行细粒度分析的价值。

英文摘要

Hallucinations -- factually incorrect or unverifiable outputs -- remain one of the most challenging limitations of Large Language Models (LLMs), especially in knowledge-intensive tasks. One proposed explanation is internal knowledge conflicts arising from fixed, outdated training data. This paper investigates whether internal representations linked to knowledge conflicts correlate with hallucination behaviors in LLMs. Using probing techniques inspired by two prior works, we analyzed activations from hidden, attention, and MLP layers, as well as output logits, across predefined tasks. We probed LLaMA-3-8B on hallucination detection benchmarks and Falcon-7B on a knowledge conflict dataset. Our findings show that, although conceptually related, hallucination activation patterns cannot be fully reduced to or explained by knowledge conflict representations. Nonetheless, probing proves a robust tool across multiple languages and activation types, supporting its role in improving LLM interpretability. This work advances the broader understanding of hallucinations in LLMs and underscores the value of fine-grained analysis of their internal behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.08792 2026-06-09 cs.CL 新提交

MAAM：面向中文歧视性语言检测的锚点保留压缩与上下文校准

Yuxin Fu, Shijing Si

发表机构 * School of Economics and Finance, Shanghai International Studies University（上海外国语大学国际金融贸易学院）

AI总结提出MAAM框架，通过保留歧视相关语义锚点并结合上下文先验校准，在轻量级模型上提升中文歧视性语言检测的准确性和校准性，同时构建首个中文LGBT歧视语料库ChLGBT。

详情

AI中文摘要

中文歧视性语言检测具有挑战性，因为有害意图往往是隐式的且依赖上下文。我们提出MAAM（近视-散光锚点机制），一种轻量级、模型无关的框架，受功能性视觉模糊启发：MAAM并非同等保留每个词元，而是保留歧视相关的语义锚点，并通过C-I-S上下文先验（上下文语气、群体身份和立场极性）对其进行校准。我们还引入了ChLGBT，据我们所知，这是首个专注于中文LGBT的歧视性语言数据集，包含8,120条人工标注样本和三个序数标签：显式偏见、隐式偏见和情感强度。在强编码器基线上，MAAM提升了所有三个预测维度，在准确率、F1、Brier分数和期望校准误差上均取得一致增益。与零样本和少样本提示协议下的前沿LLM基线相比，MAAM在保持竞争力的同时，提供了更强的紧凑性和稳定性。这些结果表明，可解释的锚点保留和上下文校准为中文歧视性语言评估提供了一种实用的替代方案，无需依赖更大规模的模型缩放。

英文摘要

Chinese discriminatory-language detection is challenging because harmful intent is often implicit and context-dependent. We propose MAAM (Myopia--Astigmatism Anchor Mechanism), a lightweight, model-agnostic framework inspired by functional visual blur: rather than preserving every token equally, MAAM retains discrimination-relevant semantic anchors and calibrates them with C--I--S contextual priors (Contextual Tone, Group Identity, and Stance Polarity). We also introduce ChLGBT, to our knowledge the first Chinese LGBT-focused discriminatory-language dataset, with 8,120 manually annotated samples and three ordinal labels: explicit bias, implicit bias, and emotional intensity. Across strong encoder baselines, MAAM improves all three prediction dimensions, with consistent gains in accuracy, F1, Brier score, and expected calibration error. Compared with frontier LLM baselines under zero-shot and few-shot prompting protocols, MAAM remains competitive while offering stronger compactness and stability. These results suggest that interpretable anchor preservation and contextual calibration provide a practical alternative to heavier model scaling for Chinese discriminatory-language assessment.

URL PDF HTML ☆

赞 0 踩 0

2606.09178 2026-06-09 cs.CL cs.AI 新提交

Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis

跨东亚和东南亚语境的文化适应红队测试：方法论与比较分析

Hyeji Choi, Yongtaek Lim, Minwoo Kim

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Korea Advanced Institute of Science and Technology（韩国科学技术院）

AI总结针对大语言模型的多语言安全评估，通过构建直接翻译与文化适应数据集，发现文化适应提示的攻击成功率平均提升9.3个百分点，直接翻译低估风险，且文化深度评分显著低于文化适应版本，表明适应文化语境对有效评估至关重要。

Comments Accepted to ICML 2026 Workshop on AIWILDS

详情

AI中文摘要

大语言模型的多语言安全评估主要依赖于将英文基准直接翻译成目标语言——这种方法转换了表面语言形式，但未能反映威胁场景、社会规范和法律法规中嵌入的文化语境。我们通过1:1种子匹配为四种语言——韩语、日语、泰语和高棉语——构建了配对的直接翻译和文化适应数据集，并比较了四个开源大语言模型的攻击成功率和文化真实感评分。文化适应提示在所有16种语言×模型组合中均产生正Delta-ASR（平均+9.3个百分点），且基于直接翻译的评估在48个类别×语言组合中有44个低估了风险。语言层面分析显示，威胁形式的分布在语言间具有异质性。文化真实感分析进一步表明，直接翻译的文化深度（C3）评分在所有四种语言中始终低于1.0（满分3.0，平均0.17），而文化适应评分高达2.51，表明直接翻译产生的输入与真实世界多文化环境中遇到的输入存在系统性差异。这些发现表明，将基准适应特定语言的文化语境——而非仅依赖语言翻译——对于有效的多语言大语言模型安全评估是必要的。

英文摘要

Multilingual safety evaluation of large language models (LLMs) has predominantly relied on direct translation (DT) of English benchmarks into target languages - an approach that converts surface-level linguistic form while failing to reflect the cultural context embedded in threat scenarios, social norms, and legal frameworks. We construct paired DT and culturally-adapted (CA) datasets via 1:1 seed matching for four languages - Korean (KO), Japanese (JA), Thai (TH), and Khmer (KM) - and compare Attack Success Rate (ASR) and Cultural Realism scores across four open-source LLM. CA prompts yield Delta-ASR > 0 across all 16 language x model combinations (mean +9.3 pp), and DT-based evaluation underestimates risk in 44 of 48 category x language combinations. Language-level analysis reveals that the distribution of threat forms is heterogeneous across languages. Cultural Realism analysis further shows that DT Cultural Depth (C3) scores remain consistently below 1.0 out of 3.0 across all four languages (mean 0.17), whereas CA scores reach up to 2.51, indicating that direct translation produces inputs systematically divergent from those encountered in real-world multicultural settings. These findings demonstrate that adapting benchmarks to language-specific cultural contexts - rather than relying on linguistic translation alone - is necessary for valid multilingual LLM safety evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.09590 2026-06-09 cs.CL cs.CR 新提交

Clinically Grounded Privacy Evaluation of Medical LMs

临床导向的医学语言模型隐私评估

Sasha Ronaghi, Sana Tonekaboni, Lena Stempfle, Vivian Utti, Jordan Li Cahoon, Nathaniel Hendrix, Ayin Vala, Marzyeh Ghassemi, Emily Alsentzer

发表机构 * Stanford University（斯坦福大学）； Massachusetts Institute of Technology（麻省理工学院）； American Board of Family Medicine（家庭医学认证委员会）

AI总结提出临床导向框架，按对抗访问等级评估医学语言模型隐私泄露，发现常规元数据可导致高比率逐字记忆和敏感诊断恢复，但部分记忆源于模板化文档。

详情

AI中文摘要

医学语言模型（LMs）可以记忆和重现受保护的健康信息，但隐私评估通常关注训练文本的恢复，而非在现实威胁模型下的泄露。我们引入了一个临床导向的框架，沿着对抗访问的分级轴评估泄露，范围从公开可推断的人口统计信息到泄露的笔记片段。在每个层级，我们测量患者特定文本的逐字记忆和敏感诊断的语义泄露。将该框架应用于一个在378k临床笔记上预训练的LM，我们发现常规就诊元数据（即姓名、出生日期、提供者、诊所、就诊日期）在患者时间线上引发高比率的逐字记忆和敏感诊断恢复（流产AUROC 0.91，HIV 0.81）。同时，精确匹配记忆可能夸大泄露：36%的记忆令牌反映了模板化文档。我们的工作强调了在纵向临床数据上训练的风险，为医学LM的上下文隐私评估提供了一个实用框架。

英文摘要

Medical language models (LMs) can memorize and reproduce protected health information, but privacy evaluations often focus on recovery of training text rather than disclosure under realistic threat models. We introduce a clinically grounded framework that evaluates leakage along a graded axis of adversarial access, ranging from publicly inferable demographics to leaked note fragments. At each tier, we measure verbatim memorization of patient-specific text and semantic leakage of sensitive diagnoses. Applying the framework to an LM pretrained on 378k clinical notes, we find that routine encounter metadata (i.e. name, date of birth, provider, practice, visit date) elicits high rates of verbatim memorization across a patient's timeline and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). At the same time, exact-match memorization can overstate disclosure: 36% of memorized tokens reflect templated documentation. Our work highlights the risks of training on longitudinal clinical data, providing a practical framework for contextual privacy evaluation of medical LMs.

URL PDF HTML ☆

赞 0 踩 0

2606.09697 2026-06-09 cs.CL 新提交

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

PsychoSafe：在大语言模型中引发基于心理学的拒绝

Gianluca Barmina, Federico Torrielli, Sven Harms, Jacob Nielsen, Felix Mächtle, Stine Lyngsø Beltoft, Peter Schneider-Kamp, Thomas Eisenbarth, Lukas Galke Poech, Anne Lauscher

发表机构 * University of Southern Denmark（南丹麦大学）； University of Turin（都灵大学）； University of Hamburg（汉堡大学）； University of Lübeck（吕贝克大学）

AI总结提出PsychoSafe框架，将LLM的拒绝行为重构为基于证据干预策略的结构化支持性沟通，通过构建5个心理风险领域的8019个提示-响应对，对Qwen 3.5 27B进行提示和参数高效微调，在拒绝质量上比通用基线提升28.1%，同时保持非拒绝任务性能。

详情

AI中文摘要

POISE：面向LLM智能体的位置感知不可检测技能注入攻击

Haochang Hao, Dehai Min, Zhifang Zhang, Yunbei Zhang, Miao Xu, Yingqiang Ge, Lu Cheng

发表机构 * University of Illinois at Chicago（伊利诺伊大学香槟分校）； University of Queensland（昆士兰大学）； Tulane University（路易斯安那州立大学）； Rutgers University（罗格斯大学）

AI总结提出POISE攻击方法，通过位置感知将恶意指令压缩为单一良性指令嵌入技能正文，在保持隐蔽性的同时实现89.3%的攻击成功率，比随机位置基线高28.0个百分点。

Comments 20 pages, 2 figures, 5 tables

详情

AI中文摘要

智能体技能为扩展通用智能体提供了一种轻量级机制，但其开放格式使其容易受到技能投毒攻击。实际危险的注入必须保持不可见：如果执行有效载荷破坏了用户的合法任务，由此产生的失败信号会引发对技能的检查。因此，我们通过攻击成功率（ASR）来评估攻击，这要求注入的有效载荷得以执行，并且用户的任务在同一试验中仍能通过验证器。先前的技能投毒攻击在此视角下面临可靠性-隐蔽性权衡：YAML头部注入可靠加载但易被检查，而将显式恶意命令置于技能正文中的更隐蔽的注入方式则可靠性较低，因为脱离上下文的命令会引发智能体自身的怀疑。我们提出POISE，一种位置感知攻击，将触发器压缩为单个看似良性的正文指令，将其放置在可行位置，并使用上下文感知生成器使其与附近的设置或前提步骤融合。在Skill-Inject（使用codex+gpt-5.2）上，POISE实现了89.3%的ASR，比随机位置正文基线高28.0个百分点，比仅YAML基线高2.6个百分点，同时保留了正文放置的隐蔽性优势。这种隐蔽性是决定性的优势：由于合法的技能正文自然需要特权工具操作，LLM扫描器高度敏感，在四个评判者和两个基准测试中平均将74.6%的干净技能误报为高风险。融入这些误报中，POISE仅导致5.6%的投毒变体相比其干净基线获得新的高风险警报，使得当前的静态防御无效。

英文摘要

Agent skills provide a lightweight mechanism for extending general-purpose agents, but their open format exposes them to skill-poisoning attacks. A practically dangerous injection must stay invisible: if executing the payload derails the user's legitimate task, the resulting failure signal invites inspection of the skill. We therefore evaluate attacks by Attack Success Rate, which requires the injected payload to execute and the user's task to still pass its verifier in the same trial. Prior skill-poisoning attacks face a reliability-stealth trade-off under this lens: YAML-header injections are reliably loaded but easily inspected, whereas stealthier body injections that place explicit malicious commands in the skill prose are less reliable because out-of-context commands invite the agent's own suspicion. We introduce POISE, a position-aware attack that compresses the trigger into a single, benign-looking body instruction, placing it at a feasible position and using a context-aware generator to blend it with nearby setup or prerequisite steps. On Skill-Inject with codex+gpt-5.2, POISE achieves an 89.3% ASR, 28.0 points above a random-placement body baseline and 2.6 points above a YAML-only baseline, while retaining the stealth advantage of body placement. That stealth is the decisive margin: because legitimate skill bodies naturally require privileged tool operations, LLM scanners are hyper-sensitive, falsely flagging 74.6% of clean skills on average across four judges and both benchmarks. Blending into these false alarms, POISE causes only 5.6% of poisoned variants to gain a new high-risk alert over their clean baselines, rendering current static defenses ineffective.

URL PDF HTML ☆

赞 0 踩 0

2606.07963 2026-06-09 cs.AI cs.CL 交叉投稿

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

共享潜在结构实现大语言模型中的统一后门检测与缓解

Omar Mahmoud, Aly M. Kassem, Thommen George Karimpanal, Buddhika Laknath Semage, Negar Rostamzadeh, Golnoosh Farnadi, Santu Rana

发表机构 * Deakin University（迪肯大学）； Mila, Quebec AI Institute（魁北克人工智能研究所Mila）

AI总结发现大语言模型中多种后门攻击共享潜在机制，通过稀疏自编码器检测因果特征，并提出双向激活操控和概念消融微调实现统一检测与缓解。

详情

AI中文摘要

大语言模型中的后门攻击通常被视为孤立的触发-响应失败，促使防御针对特定触发或行为。我们证明这种观点是不完整的。在多样化的后门行为中，我们识别出一个共享的潜在机制，可以被检测、因果控制和抑制。通过在残差流激活上使用稀疏自编码器，我们发现一小部分潜在特征在越狱、拒绝操控、密码锁定、偏见诱导、情感误分类和基于国家的有害建议中一致激活。这些特征在Qwen3、Gemma~3和Llama~3.1模型（参数从4B到32B）以及微调和权重编辑攻击中泛化。通过双向激活操控，我们证明这些特征是因果性的：抑制它们降低攻击成功率，而放大它们在干净提示上诱导目标行为。我们进一步训练轻量级SAE特征分类器，这些分类器零样本泛化到未见后门，并优于残差流和权重差异基线。最后，我们引入概念消融微调，通过在训练期间消融共享潜在子空间来抑制后门形成。总之，我们的结果表明许多后门依赖于可转移的潜在机制，从而实现统一的检测和缓解。

英文摘要

Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to specific triggers or behaviors. We show this view is incomplete. Across diverse backdoor behaviors, we identify a shared latent mechanism that can be detected, causally controlled, and suppressed. Using sparse autoencoders (SAEs) on residual-stream activations, we find a small set of latent features consistently activated across jailbreaking, refusal manipulation, password-locking, bias induction, sentiment misclassification, and country-conditioned harmful advice. These features generalize across Qwen3, Gemma~3, and Llama~3.1 models from 4B to 32B parameters, and across both fine-tuning and weight-editing attacks. Through bidirectional activation steering, we show these features are causal: suppressing them reduces attack success, while amplifying them induces target behaviors on clean prompts. We further train lightweight SAE-feature classifiers that generalize zero-shot to unseen backdoors and outperform residual-stream and weight-diffing baselines. Finally, we introduce Concept Ablation Fine-Tuning (CAFT), which suppresses backdoor formation by ablating the shared latent subspace during training. Together, our results suggest that many backdoors rely on a transferable latent mechanism, enabling unified detection and mitigation.

URL PDF HTML ☆

赞 0 踩 0

2606.08044 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

当行为安全评估失败时：表征层面的视角

Enyi Jiang, Anders Gjølbye, Yibo Jacky Zhang, Sanmi Koyejo

发表机构 * Stanford University（斯坦福大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Technical University of Denmark（丹麦技术大学）

AI总结本文提出行为安全与干预鲁棒性之间的“审计差距”，通过构建解离模型和引入潜在脆弱性评分（LVS），证明行为安全指标不足以衡量表征层面的鲁棒性。

Comments Preprint

详情

AI中文摘要

大型语言模型（LLM）的安全性通常从行为层面进行评估，这提供了有限的内部鲁棒性证据，因为这些评估针对的是输出，而非干预下的表征层面脆弱性。我们将这种差异形式化为审计差距：行为安全与干预下鲁棒性之间的差异。为了研究这一差距，我们构建了解离模型，这些模型在保持安全的外在行为的同时，在潜在空间中仍然脆弱。我们引入了一个基于干预的评估框架，通过在参数和潜在空间中进行软干预（包括有害微调和逐层潜在扰动）来测试模型鲁棒性。为了形式化评估，我们提出了潜在脆弱性评分（LVS），用于衡量通过有界潜在扰动引发有害行为的难易程度。使用该评估框架，我们表明行为安全指标不足以衡量多个安全和对齐及未对齐的最先进模型的表征层面鲁棒性。值得注意的是，解离模型在有害干预下尽管表现出相当的拒绝行为，但LVS显著升高，其中中间表征对干预最为敏感。我们的结果表明，仅凭行为安全评估无法全面反映模型鲁棒性，这促使我们需要进行表征感知的审计，以评估潜在脆弱性和可观察行为。

英文摘要

Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.08497 2026-06-09 cs.AI cs.CL 交叉投稿

Explaining Black-Box Language Models: Learning to Optimize Linguistically-Structured Word Subsets

解释黑盒语言模型：学习优化语言结构化的单词子集

Minyoung Hwang, Seokhyun Lee, Changhee Lee

发表机构 * Korea University（高丽大学）

AI总结针对黑盒语言模型解释的三个关键需求（推理效率、黑盒兼容性、语言结构可解释性），提出一种通过强化学习选择信息性单词子集的方法，实现高效、无梯度且语言连贯的解释。

Comments KDD 2026 Research Track

详情

DOI: 10.1145/3770855.3817677

AI中文摘要

随着深度语言模型（DLMs）在医疗保健等高风险领域中的部署日益增多，理解其决策依据对于确保信任、安全和问责变得至关重要。然而，当这些DLMs作为黑盒系统（例如通过API）运行时，访问内部模型状态（如参数、梯度）受到限制，实现这一关键的可解释性水平尤其具有挑战性。尽管付出了诸多努力，现有的解释方法往往无法同时满足三个关键需求：（i）推理时效率，（ii）黑盒兼容性且不引发分布外行为，以及（iii）基于输入语言结构的可理解解释。为了解决这些挑战，我们提出了一种方法，通过选择一小部分信息丰富的输入单词来解释DLM的预测。我们将其表述为一个摊销优化问题，从而无需针对特定输入进行搜索即可实现高效的一次性推理。我们的选择策略通过REINFORCE风格策略梯度进行训练，允许在完全无梯度的设置中进行离散单词选择。为了增强可解释性并与人类语言直觉对齐，我们将图结构知识整合到这一选择过程中，促进语言连贯的子集，从而产生对最终用户既高度信息丰富又具有认知意义的解释。我们在多种DLM架构和多个真实世界数据集上评估了我们的方法。它一致地识别出具有增强判别能力和与语言显著线索更强对齐的单词子集，优于传统的黑盒兼容方法和基于梯度的方法（后者被赋予黑盒模型梯度的oracle访问权限，以构成更具挑战性的基准）。我们的代码可在以下地址获取：here。

英文摘要

As deep language models (DLMs) are increasingly deployed in high-stakes domains such as healthcare, understanding their decision rationale becomes paramount for ensuring trust, safety, and accountability. However, achieving this vital level of interpretability is particularly challenging when these DLMs operate as black-box systems (e.g., via APIs), where access to internal model states (e.g., parameters, gradients) is restricted. Despite numerous efforts, existing explanation methods often fail to concurrently satisfy three key desiderata: (i) inference-time efficiency, (ii) black-box compatibility without inducing out-of-distribution behavior, and (iii) comprehensible explanations grounded in the input's linguistic structure. To address these challenges, we propose a method that explains predictions of DLMs by selecting a small, informative subset of input words. We formulate this as an amortized optimization problem, enabling efficient one-shot inference without the need for input-specific search. Our selection policy is trained via REINFORCE-style policy gradients, allowing discrete word selection in a fully gradient-free setting. To enhance interpretability and align with human linguistic intuition, we integrate graph-structured knowledge into this selection process, fostering linguistically coherent subsets that result in explanations both highly informative and cognitively meaningful to end-users. We evaluated our method on diverse DLM architectures and multiple real-world datasets. It consistently identifies word subsets with enhanced discriminative power and stronger alignment with linguistically salient cues, outperforming both conventional black-box compatible methods and gradient-based approaches that are given oracle access to the black-box model's gradients for a more challenging benchmark. Our code is available at here.

URL PDF HTML ☆

赞 0 踩 0

2606.08512 2026-06-09 cs.CY cs.CL 交叉投稿

少令牌，大杠杆：在微调期间通过约束安全令牌保持安全对齐

Guoli Wang, Haonan Shi, Tu Ouyang, An Wang

发表机构 * Case Western Reserve University（凯斯西储大学）

AI总结提出PACT框架，通过约束安全相关令牌的置信度来防止微调导致的安全对齐漂移，同时保持下游任务性能。

Comments Accepted to KDD 2026

详情

DOI: 10.1145/3770855.3817837

AI中文摘要

大型语言模型（LLMs）通常需要微调（FT）才能在下游任务上表现良好，但即使训练数据集仅包含良性数据，FT也可能导致安全对齐漂移。先前的研究表明，引入少量有害数据会显著损害LLM的拒绝行为，导致LLM顺从有害请求。现有的防御方法通常依赖于模型范围的干预，例如限制哪些参数更新或注入额外的安全数据，这可能会限制通用性并降低下游任务性能。为了解决这些限制，我们提出了一种名为PACT（通过约束令牌保持安全对齐）的微调框架，该框架稳定了模型在安全令牌上的置信度。我们的方法基于经验观察：安全对齐行为反映在模型的令牌级输出置信度中，并且通常集中在少量安全相关令牌上。在下游微调期间，我们正则化微调模型，使其在每一步响应中与对齐参考模型在安全相关令牌上的置信度匹配，同时允许非安全令牌基本不受约束以实现有效的任务适应。这种有针对性的约束防止了对齐漂移，而无需施加通常以牺牲模型效用为代价的全局限制。我们的代码可在{https://github.com/Glresearch1/PACT}获取。

英文摘要

Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data. Prior work shows that introducing a small fraction of harmful data can substantially compromise LLM refusal behavior, causing LLMs to comply with harmful requests. Existing defense methods often rely on model-wide interventions, such as restricting which parameters are updated or injecting additional safety data, which can limit generality and degrade downstream task performance. To address these limitations, we propose a fine-tuning framework called Preserving Safety Alignment via Constrained Tokens (PACT), which stabilizes the model's confidence on safety tokens. Our approach is motivated by the empirical observation that safety-aligned behavior is reflected in the model's token-level output confidence and is often concentrated on a small subset of safety-related tokens. During downstream fine-tuning, we regularize the fine-tuned model to match the aligned reference model's confidence on safety-related tokens at each response step, while leaving non-safety tokens largely unconstrained to allow effective task adaptation. This targeted constraint prevents alignment drift without imposing global restrictions that typically trade off with model utility. Our code is available at {https://github.com/Glresearch1/PACT}.

URL PDF HTML ☆

赞 0 踩 0

2606.01060 2026-06-09 cs.CL cs.AI cs.LG 版本更新

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

MENTIS: 对齐改变了什么信念？语言模型中多尺度潜在扭转的测量

Partha Pratim Saha, Samarth Raina, Mayur Parvatikar, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Pragya Lab, BITS Pilani Goa, India（BITS Pilani 去掉 Goa 的机构名，因为该机构名中包含 'Goa'，但根据规则，如果机构已有常见中文名，使用常见中文名。'Pragya Lab, BITS Pilani' 是 BITS Pilani 的一个实验室，因此翻译为 'BITS Pilani 实验室'）； IIIT Delhi, India（德里印度理工学院）； Amazon, USA（美国亚马逊）； Meta, USA（美国Meta）； Apple, USA（美国苹果）

AI总结提出MENTIS框架，通过层间协方差扭转范数、谱扭转诊断和能量-辐射-激活度量，测量偏好对齐在语言模型内部计算中引起的选择性、深度局部的几何结构变化。

Comments Submitted to EMNLP 2026

详情

AI中文摘要

偏好对齐显著改善了大语言模型的可观察行为，但尚不清楚对齐在内部改变了什么。对齐系统在越狱、提示注入和检索时损坏下仍然失败，表明仅行为级评估是不完整的。后训练应在内部计算中留下可测量的痕迹。我们问：当指令微调（IT）模型变为偏好对齐（PA）模型时，哪些几何结构发生了变化，这些变化集中在何处，以及它们在不同概念、提示和模型家族中的选择性如何？我们引入MENTIS，一个几何优先的框架，用于测量配对检查点中对齐引起的内部重组。MENTIS使用基于层间协方差的主扭转范数（T1）、辅助谱扭转诊断（T2）和用于深度定位的能量-辐射-激活度量（ERA）来比较IT和PA模型。在LITMUS上的四个7-8B模型对中，我们的研究表明对齐引起的变化是选择性的而非均匀的：规范性概念平均表现出比事实性概念更大的扭转偏移；扭转与上下文熵负相关；峰值效应定位于架构特定的中后层。相同的模式出现在词级、提示级和模型级分析中。这些结果表明偏好对齐在内部计算中留下了结构化的、深度局部的几何特征，超越了仅行为级评估所能揭示的内容。

英文摘要

Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment changes internally. Aligned systems still fail under jailbreaks, prompt injection, and retrieval-time corruption, suggesting behavior-level evaluation alone is incomplete. Post-training should leave measurable traces in internal computation. We ask: when an instruction-tuned (IT) model becomes a preference-aligned (PA) model, what geometric structure changes, where do those changes concentrate, and how selectively do they vary across concepts, prompts, and model families? We introduce MENTIS, a geometry-first framework for measuring alignment-induced internal reorganization in paired checkpoints. MENTIS compares IT and PA models using a primary layerwise covariance-based torsion norm (T1), a secondary spectral torsion diagnostic (T2), and an Energy-Radiance-Activation measure (ERA) for depth localization. Across four 7-8B model pairs on LITMUS, our study reveals that alignment-induced change is selective rather than uniform: normative concepts exhibit larger torsion shifts than factual concepts on average; torsion is negatively correlated with contextual entropy; and peak effects localize to architecture-specific mid-to-late layers. The same pattern appears across word-level, prompt-level, and model-level analyses. These results suggest preference alignment leaves structured, depth-localized geometric signatures in internal computation beyond what behavior-level evaluation alone can reveal.

URL PDF HTML ☆

赞 0 踩 0

2606.01637 2026-06-09 cs.CL cs.AI 版本更新

Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

误导比纠正更容易：LLM 从众中的有害与有益修正

Jiaming Qu, Lucheng Fu, Yibo Hu

发表机构 * Amazon（亚马逊）； Georgia Institute of Technology（佐治亚理工学院）； Illinois Institute of Technology（伊利诺伊理工学院）

AI总结通过控制实验，研究大语言模型在多智能体系统中面对同伴答案时的从众行为，发现同伴一致意见更容易误导原本正确的模型，而权威标签使模型更倾向于选择被认可的答案，且通用推理干预无法可靠地减少有害修正。

详情

AI中文摘要

大语言模型越来越多地用于多智能体系统，在这些系统中，它们会看到并回应其他智能体的答案。一个关键风险是从众：模型可能仅仅因为其他人同意不同的答案而放弃自己的答案。先前的研究表明，LLM 经常向多数答案修正，但仍不清楚这些修正是像引入新错误一样频繁地帮助纠正错误。在本文中，我们进行了一项受控研究，其中 LLM 首先回答一个问题，然后在做出最终决定之前看到模拟的同伴回应。我们操纵两个社会线索：共识结构和分配给同伴的权威标签，并测量它们如何影响有益和有害的修正。在四个开放权重的 LLM 和七个问答数据集上，我们发现同伴一致意见使得误导原本正确的模型比纠正原本错误的模型容易得多。权威标签使模型更可能选择被认可的答案，无论其是否正确。更令人担忧的是，通用的推理干预（如思维链和反思）并不能可靠地减少有害修正同时保留有益修正。这些发现表明，多智能体 LLM 系统应该验证同伴答案，而不是简单地聚合它们。

英文摘要

Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conformity: a model may abandon its own answer simply because others agree on a different one. Prior studies show that LLMs often revise toward a majority answer, but it remains unclear whether these revisions help correct mistakes as often as they introduce new errors. In this paper, we conduct a controlled study in which an LLM first answers a question, then sees simulated peer responses before making a final decision. We manipulate two social cues: consensus structure and authority labels assigned to peers, and measure how they influence beneficial and harmful revisions. Across four open-weight LLMs and seven QA datasets, we find that peer agreement makes it much easier to mislead initially correct models than to correct initially wrong ones. Authority labels make models more likely to choose the endorsed answer, regardless of whether it is correct. More concerningly, generic reasoning interventions such as chain-of-thought and reflection do not reliably reduce harmful revision while preserving beneficial revision. These findings suggest that multi-agent LLM systems should verify peer answers rather than simply aggregate them.

URL PDF HTML ☆

赞 0 踩 0

2606.06443 2026-06-09 cs.CL cs.MM cs.SI 版本更新

针对终端代理的技能注入攻击的防御与使能因素

Yoshinari Fujinuma, Varun Gangal, Traian Rebedea, Makesh Narsimhan Sreedhar, Prasoon Varshney, Rebecca Qian, Anand Kannappan

发表机构 * Patronus AI ； NVIDIA

AI总结研究基于大语言模型的代理在重用技能时面临的安全威胁，提出守护者防御（动态和静态）将攻击成功率降低过半，并测试了攻击重述的鲁棒性。

Comments First version, small updates and clarifications likely in v2

2606.08197 2026-06-09 cs.CL cs.DC 新提交

稀疏记忆微调：作为LoRA和全微调的低遗忘替代方案

Prakhar Gupta, Garv Shah, Satyam Goyal, Anirudh Kanchi

发表机构 * University of Washington（华盛顿大学）

AI总结提出稀疏记忆微调（SMF），通过添加键值记忆层并仅更新当前批次最活跃的记忆行，在MedMCQA任务上提升2.5个百分点，同时将遗忘探针（WikiText困惑度和TriviaQA准确率）控制在基线的1个百分点内，优于LoRA和全微调。

详情

AI中文摘要

将预训练语言模型适应新任务通常会损害其已有的通用能力，这一问题被称为灾难性遗忘。稀疏记忆微调（SMF）通过向模型添加键值记忆层，并在每个训练步骤中仅更新当前批次读取最频繁的一小组记忆行来避免这种情况。我们在Qwen-2.5-0.5B-Instruct上重新实现了SMF，并将其与LoRA和全微调在MedMCQA（一个4选1的医学考试任务）上进行比较，使用WikiText困惑度和TriviaQA准确率作为遗忘探针。SMF将MedMCQA提升了2.5个百分点，同时将两个遗忘探针保持在基线的约1个百分点内，而LoRA和全微调虽然取得了更大的增益，但在两个探针上都出现了明显的漂移。我们还比较了两种行选择规则（KL散度和TF-IDF），它们在两个遗忘指标上取得了不同的平衡。

英文摘要

Adapting a pretrained language model to a new task often hurts the general capabilities it already had, a problem known as catastrophic forgetting. Sparse Memory Finetuning (SMF) tries to avoid this by adding key-value memory layers to the model and, on each training step, updating only the small set of memory rows that the current batch reads most heavily. We re-implement SMF on Qwen-2.5-0.5B-Instruct and compare it with LoRA and full finetuning on MedMCQA, a 4-choice medical exam task, using WikiText perplexity and TriviaQA accuracy as forgetting probes. SMF improves MedMCQA by 2.5 percentage points while keeping both forgetting probes within roughly 1 point of the base model, whereas LoRA and full finetuning achieve larger gains but with clear drift on both. We also compare two row-selection rules (KL-divergence and TF-IDF), which balance the two forgetting metrics differently.

URL PDF HTML ☆

赞 0 踩 0

2605.04913 2026-06-09 cs.CL cs.LG 版本更新

Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

重新思考局部学习：一种更便宜更快的LLM后训练配方

Hengyu Shi, Tianyang Han, Peizhe Wang, Zhiling Wang, Xu Yang, Junhao Su

发表机构 * Independent Researcher（独立研究者）； D 4 Lab（D4实验室）； Southeast University（东南大学）

AI总结本文提出LoPT，一种局部学习后训练策略，通过在transformer中点设置梯度边界，降低内存成本，提高训练效率并保留预训练能力。

Comments 35pages

详情

AI中文摘要

LLM后训练通常通过完整深度传播任务梯度。尽管这种端到端结构简单通用，但将其任务适应与完整深度激活存储、长距离反向依赖和直接任务梯度访问预训练表示耦合在一起。我们主张这种完整深度反向耦合可能不必要的昂贵和侵入性，尤其是在后训练监督远比预训练狭窄时。为此，我们提出LoPT：局部学习后训练，一种简单的后训练策略，使梯度达到成为显式设计选择。LoPT在transformer中点放置单一梯度边界：后半部分块从任务目标学习，而前半部分块通过轻量级特征重建目标进行更新，以保留有用的表示并保持接口兼容性。LoPT缩短了任务引起的反向路径，同时限制了狭窄任务梯度对早期层表示的直接干扰。大量实验表明，LoPT在较低的内存成本、较高的训练效率和更好的保留预训练能力方面实现了竞争性性能。我们的代码可在：https://github.com/HumyuShi/LoPT获取。

英文摘要

LLM post-training typically propagates task gradients through the full depth of the model. Although this end-to-end structure is simple and general, it couples task adaptation to full-depth activation storage, long-range backward dependencies and direct task-gradient access to pretrained representations. We argue that this full-depth backward coupling can be unnecessarily expensive and intrusive, particularly when post-training supervision is much narrower than pre-training. To this end, we propose \textbf{LoPT}: Local-Learning Post-Training, a simple post-training strategy that makes gradient reach an explicit design choice. LoPT places a single gradient boundary at the transformer midpoint: the second-half block learns from the task objective, while the first-half block is updated by a lightweight feature-reconstruction objective to preserve useful representations and maintain interface compatibility. LoPT shortens the task-induced backward path while limiting direct interference from narrow task gradients on early-layer representations. Extensive experiments demonstrate that LoPT achieves competitive performance with lower memory cost, higher training efficiency and better retention of pretrained capabilities. Our code is available at: https://github.com/HumyuShi/LoPT

URL PDF HTML ☆

赞 0 踩 0

2605.16928 2026-06-09 cs.CL cs.AI 版本更新

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

全注意力再临：在数百次训练步骤内将全注意力转化为稀疏

Yanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li, Kan Liu, Tao Lan, Lin Qu, Yuan Yao, Xiaoxing Ma

发表机构 * Nanjing University（南京大学）； Alibaba Group（阿里巴巴集团）

AI总结本文提出RTPurbo方法，通过利用模型内在稀疏性，在少量训练步骤内实现高效的稀疏注意力，从而在保持接近无损精度的同时，显著提升推理效率。

Comments 20 pages, 9 figures

详情

AI中文摘要

大型语言模型的长上下文推理受到全注意力二次成本的限制。现有的高效替代方法通常依赖于原生稀疏训练或启发式令牌驱逐，导致效率、训练成本和准确性之间存在不理想的权衡。在本文中，我们证明全注意力LLM本质上已经是稀疏的，并且可以通过最小的适应转化为高度稀疏的模型。我们的方法基于三个观察：(1) 只有少量的注意力头真正需要完整的长上下文处理；(2) 长距离检索主要由低维子空间支配，允许相关令牌通过16维索引器高效检索；(3) 有用的令牌预算强烈依赖于查询，使得动态top-p选择比固定top-k稀疏化更合适。基于这些见解，我们提出了RTPurbo，该方法仅保留检索头的完整KV缓存，并引入轻量级令牌索引器进行稀疏注意力。通过利用模型的内在稀疏性，RTPurbo仅在数百次训练步骤内即可实现稀疏化。在长上下文基准和推理任务上的实验表明，RTPurbo在保持接近无损精度的同时，实现了显著的效率提升，包括在100万上下文下的预填充速度提升高达9.36倍，以及解码速度提升约2.01倍。这些结果表明，可以通过标准的全注意力训练获得强大的稀疏推理，而无需昂贵的原生稀疏预训练。

英文摘要

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-$p$ selection more suitable than fixed top-$k$ sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model's intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36$\times$ prefill speedup at 1M context and about a 2.01$\times$ decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.

URL PDF HTML ☆

赞 0 踩 0

2605.28207 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou

发表机构 * Frontis.AI ； Kuaishou Technology（快手科技）； Shanghai AI Lab（上海人工智能实验室）； TsinghuaC3I/ZEDA（清华大学C3I/ZEDA）

AI总结本文提出ZEDA框架，通过自蒸馏将预训练的静态MoE模型转换为高效的动态MoE模型，显著减少专家FLOPs并提升推理速度。

详情

AI中文摘要

混合专家（MoE）通过稀疏专家激活高效地扩展语言模型，其动态变体进一步通过输入依赖的方式调整激活专家以减少计算。现有动态MoE方法通常依赖从头训练或任务特定适应，使完全训练的MoE的实际转换未被充分探索。启用此类适应可直接缓解推理成本，通过允许简单令牌在服务时绕过不必要的专家。本文引入了零专家自蒸馏适应（ZEDA），一种低成本框架，将后训练的静态MoE模型转换为高效的动态MoE模型。为稳定此架构转换，ZEDA在每个MoE层中注入无参数的零输出专家，并通过两阶段自蒸馏适应增强模型，利用原始MoE作为冻结的教师，并应用组级平衡损失。在Qwen3-30B-A3B和GLM-4.7-Flash上跨11个基准测试（涵盖数学、代码和指令跟随）中，ZEDA在边际精度损失下消除了超过50%的专家FLOPs。在两个模型上，ZEDA比最强的动态MoE基线分别高出6.1和4.0个点，并提供约1.20倍的端到端推理加速。

英文摘要

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20$\times$ end-to-end inference speedup.

URL PDF HTML ☆

赞 0 踩 0

2606.07753 2026-06-09 cs.CL 新提交

ReadingMachine: A Computational Methodology for Structured Corpus Reading and Large-Scale Synthesis

ReadingMachine：一种结构化语料库阅读与大规模综合的计算方法

James Morrissey

发表机构 * GitHub

AI总结提出ReadingMachine方法，利用大语言模型对文档集合进行有界阅读，通过洞察提取、语义聚类、主题生成和迭代遗漏检测等可检查阶段，实现大规模语料库的覆盖性、可追溯性和分歧保留。

Comments 32 pages, 1 figure

2606.08254 2026-06-09 cs.CL 新提交

因果评估形式语言任务的可学习性

Vésteinn Snæbjarnarson, Anej Svete, Josef Valvoda, Reda Boumasmoud, Brian DuSell, Ryan Cotterell

发表机构 * ETH Zürich（苏黎世联邦理工学院）； University of Copenhagen（哥本哈根大学）

AI总结通过引入分箱半环控制目标属性频率，结合因果图模型和分解KL散度，证明标准相关性评估在形式语言任务可学习性分析中存在混淆偏差。

详情

AI中文摘要

语言模型作为多任务学习器，在训练过程中获得广泛能力。一个基本问题是学习给定任务需要多少特定任务数据。在自然语言中回答这个问题很困难：任务难以界定且可能相互混淆。为了严格研究数据频率与可学习性之间的关系，我们转向使用从概率有限自动机导出的形式语言的受控设置。这作为方法论测试平台，证明标准相关性评估实践固有缺陷。为了实现因果分析，我们引入了分箱半环，这是一种代数对象，允许我们控制目标属性在采样语料库中出现的频率。我们将实验流程表述为因果图模型，并推导出分解的Kullback-Leibler散度指标来衡量特定子任务的可学习性。我们的实验表明，在没有因果干预的情况下评估可学习性会由于相关性分析中的混淆因素导致错误结论，并警示自然语言环境中的相关性陷阱。

英文摘要

Language models, as multi-task learners, acquire a wide range of abilities during training. A fundamental question is how much task-specific data is needed to learn a given task. Answering this for natural language is difficult: tasks are hard to delineate and can confound one another. To rigorously investigate the relationship between data frequency and learnability, we turn to a controlled setting using formal languages induced from probabilistic finite automata. These serve as a methodological testbed to demonstrate that standard correlational evaluation practices are inherently flawed. To enable causal analysis, we introduce the binning semiring, an algebraic object that lets us control how often a targeted property occurs in a sampled corpus. We formulate the experimental pipeline as a causal graphical model and derive decomposed Kullback-Leibler divergence metrics to measure the learnability of specific sub-tasks. Our experiments show that evaluating learnability without causal intervention leads to incorrect conclusions due to confounders in correlational analysis, and serve as a warning about correlational pitfalls in natural-language settings.

URL PDF HTML ☆

赞 0 踩 0

2606.07727 2026-06-09 quant-ph cs.CL math.OC q-fin.PM 交叉投稿

Benchmarking Quantum Algorithmic Resilience for CVaR Portfolio Optimization: The Expressibility-Coherence Trade-off

面向CVaR投资组合优化的量子算法韧性基准测试：可表达性-相干性权衡

Prashik N. Somkuwar, K. Srinivasan, G. Raghavan

发表机构 * Prashik N. Somkuwar, K. Srinivasan, G. Raghavan（普拉希克·N·索姆库瓦尔、K·斯里尼瓦森、G·拉加万）

AI总结针对混合均值方差与条件风险价值投资组合优化，对比硬件高效变分量子神经网络与热启动量子近似优化算法，揭示NISQ设备上算法可表达性与硬件相干性之间的关键权衡。

Comments 10 pages, 11 figures. Master's thesis research conducted at the School of Quantum Technology, Defence Institute of Advanced Technology (DIAT), Pune

详情

基于移动性和社交媒体数据的可解释危机行为分析

Muhammad Hamza Arshad Majeed, Sidahmed Benabderrahmane, Talal Rahwan

发表机构 * New York University (NYUAD)（纽约大学（NYUAD））

AI总结提出统一可解释流水线，融合移动性和社交媒体数据，通过形式概念分析和关联规则挖掘，识别危机中跨域行为模式，并在洛杉矶山火和COVID-19案例中验证，生成可操作的政策简报。

详情

AI中文摘要

危机改变了人们的移动方式和沟通方式。在野火和流行病等紧急情况下，移动模式的变化和在线情感话语共同演变，但通常被孤立研究。本文提出了一个统一且可解释的流水线，整合移动性和社交媒体数据，以识别危机环境中的跨域行为模式。该框架通过两个案例研究进行评估：2025年1月洛杉矶野火的短期分析（原型案例）和2020年3月至2021年12月阿联酋COVID-19行为的纵向分析（主要案例，671天）。该流水线对齐异构每日信号，将其转换为二元行为状态，应用形式概念分析（FCA）提取共现结构，挖掘关联规则，并通过时间顺序保留测试验证规则稳定性。一个结构化的政策翻译层将稳健规则转化为操作简报，指定触发条件、提前时间和行动方案。结果揭示了两种危机中清晰的跨域行为结构。在野火案例中，交通压力、恐惧/愤怒情绪和治理话语在33天窗口内紧密耦合，关键规则达到100%置信度，提升度高达2.5。在COVID案例中，重复的移动适应和情绪波动产生了8条稳定的同日规则（88%保留通过率）和40条清晰的预测规则，提前时间为2-7天。该工作表明，可解释的多模态融合可以产生既科学可信又政策可操作的危机情报。

英文摘要

Crises alter both how people move and how they communicate. During emergencies such as wildfires and pandemics, changes in mobility patterns and online emotional discourse evolve jointly, yet they are typically studied in isolation. This paper presents a unified and interpretable pipeline that integrates mobility and social media data to identify cross-domain behavioral patterns in crisis settings. The framework is evaluated through two case studies: a short-horizon analysis of the January 2025 Los Angeles wildfires (prototype case) and a longitudinal analysis of UAE COVID-19 behavior from March 2020 to December 2021 (primary case, 671 days). The pipeline aligns heterogeneous daily signals, transforms them into binary behavioral states, applies Formal Concept Analysis (FCA) to extract co-occurrence structure, mines association rules, and validates rule stability through chronological holdout testing. A structured policy-translation layer renders robust rules as operational briefs specifying triggers, lead times, and action playbooks. Results reveal clear cross-domain behavioral structure in both crises. In the wildfire case, traffic stress, fear/anger sentiment, and governance discourse are tightly coupled within a 33-day window, with key rules reaching 100\% confidence and lift scores up to 2.5. In the COVID case, repeated mobility adaptation and sentiment volatility yield 8 stable same-day rules (88\% holdout pass rate) and 40 clean predictive rules with 2--7 day lead horizons. The work demonstrates that interpretable multimodal fusion can produce both scientifically credible and policy-actionable crisis intelligence.

URL PDF HTML ☆

赞 0 踩 0

2211.05583 2026-06-09 cs.CL math.OC 版本更新

Toward automatic generation of control structures for process flow diagrams with large language models

面向工艺流程图控制结构自动生成的大语言模型方法

Edwin Hirtreiter, Lukas Schulze Balhorn, Artur M. Schweidtmann

发表机构 * University of Zurich（苏黎世大学）

AI总结提出一种基于Transformer的端到端方法，将控制结构预测视为翻译任务，利用SFILES 2.0表示PFD拓扑，通过预训练和微调实现自动生成，在生成数据上达到74.8%-89.2%的Top-5准确率。

详情

DOI: 10.1002/aic.18259
Journal ref: AIChE Journal, Volume 70, Issue 1, January 2024, e18259

AI中文摘要

开发管道和仪表图（P&IDs）是工艺开发中的关键步骤。我们提出了一种数据驱动的控制结构预测方法。我们的方法受基于Transformer的端到端人类语言翻译模型启发。我们将控制结构预测视为翻译任务，其中没有控制结构的工艺流程图（PFDs）被翻译为带有控制结构的PFDs。我们使用SFILES 2.0符号将PFDs的拓扑表示为字符串。我们使用生成的PFDs预训练模型以学习语法结构。之后，利用迁移学习在真实PFDs上对模型进行微调。该模型在10,000个生成的PFDs上达到了74.8%的Top-5准确率，在100,000个生成的PFDs上达到了89.2%的Top-5准确率。这些有希望的结果显示了人工智能辅助工艺工程的巨大潜力。在312个真实PFDs数据集上的测试表明，工业应用需要更大的PFD数据集和混合人工智能解决方案。

英文摘要

Developing Piping and Instrumentation Diagrams (P&IDs) is a crucial step during process development. We propose a data-driven method for the prediction of control structures. Our methodology is inspired by end-to-end transformer-based human language translation models. We cast the control structure prediction as a translation task where Process Flow Diagrams (PFDs) without control structures are translated to PFDs with control structures. We represent the topology of PFDs as strings using the SFILES 2.0 notation. We pretrain our model using generated PFDs to learn the grammatical structure. Thereafter, the model is fine-tuned leveraging transfer learning on real PFDs. The model achieved a top-5 accuracy of 74.8% on 10,000 generated PFDs and 89.2% on 100,000 generated PFDs. These promising results show great potential for AI-assisted process engineering. The tests on a dataset of 312 real PFDs indicate the need for a larger PFD dataset for industry applications and hybrid artificial intelligence solutions.

URL PDF HTML ☆

赞 0 踩 0

2408.00684 2026-06-09 cs.CL 版本更新

Assessing the Variety of a Concept Space Using an Unbiased Estimate of Rao's Quadratic Index

使用Rao二次指数的无偏估计评估概念空间的多样性

Anubhab Majumder, Ujjwal Pal, Amaresh Chakrabarti

发表机构 * Department of Design and Manufacturing, Indian Institute of Science（印度科学研究院设计与制造系）

AI总结提出一种基于距离的多样性度量方法，通过无偏估计Rao二次指数，并开发软件工具VariAnT，以支持工程设计早期概念空间的多样性评估。

详情

AI中文摘要

过去的研究将设计创造力与“发散性思维”联系起来，即概念空间在设计早期阶段被探索的程度。研究人员认为，生成多个概念会增加产生更好设计解决方案的机会。“多样性”是量化设计师探索的概念空间广度的参数之一。在概念设计阶段评估多样性是有用的，因为在这个阶段，设计师可以自由探索不同的解决方案原则，以用新颖的概念满足设计问题。本文详细阐述并批判性地审视了工程设计文献中现有的多样性度量方法，讨论了它们的局限性。提出了一种新的基于距离的多样性度量方法，并附带了一个支持评估过程的规范性框架。该框架使用所选的基础抽象层次表示，测量两个设计概念之间的实值距离。所提出的框架在名为“VariAnT”的软件工具中实现。此外，通过一个说明性示例展示了该工具的应用。

英文摘要

Past research relates design creativity to 'divergent thinking,' i.e., how well the concept space is explored during the early phase of design. Researchers have argued that generating several concepts would increase the chances of producing better design solutions. 'Variety' is one of the parameters by which one can quantify the breadth of a concept space explored by the designers. It is useful to assess variety at the conceptual design stage because, at this stage, designers have the freedom to explore different solution principles so as to satisfy a design problem with substantially novel concepts. This article elaborates on and critically examines the existing variety metrics from the engineering design literature, discussing their limitations. A new distance-based variety metric is proposed, along with a prescriptive framework to support the assessment process. The framework measures the real-valued distance between two design concepts using any chosen representation of their underlying abstraction levels. The proposed framework is implemented in a software tool called 'VariAnT.' Furthermore, the tool's application is demonstrated through an illustrative example.

URL PDF HTML ☆

赞 0 踩 0

2605.29475 2026-06-09 cs.CL cs.AI cs.CE cs.HC 版本更新

MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery

MOOSE-Copilot：一个基于网络的交互式助手，用于统一探索性和细粒度科学假设发现

Hongran An, Zonglin Yang

发表机构 * Central Conservatory of Music（中央音乐学院）； Nanyang Technological University（南洋理工大学）

AI总结提出MOOSE-Copilot，通过形式化的人机交互协议，将发散性探索和收敛性细化统一，利用蓝图、路由和反馈三种信号引导生成，显著优于纯自主基线。

Comments Accepted to ACL 2026 (System Demonstrations)

详情

AI中文摘要

大型语言模型（LLMs）在科学假设发现中展现出显著潜力。然而，现有方法存在两个关键限制：它们将发散性探索构思和收敛性细粒度细化视为孤立任务，并且自主运行，几乎没有人类指导。我们提出了MOOSE-Copilot，这是第一个通过形式化的人机交互（HAII）协议弥合这一抽象差距的统一框架。我们的系统使科学家能够通过三种显式信号引导生成过程：初始蓝图、阶段间路由和再生反馈。定量评估表明，注入这些结构化专家信号显著优于纯自主基线，并在神谕指导下建立了性能上限。此外，为了普及这一范式，我们开发了一个直观的基于网络界面，具有交互式树状可视化。这明确消除了复杂命令行代理工具的陡峭学习曲线，使跨学科研究人员能够直接利用、视觉编排并加速端到端的科学突破。

英文摘要

Large language models (LLMs) show remarkable potential in scientific hypothesis discovery. However, existing approaches face two critical limitations: they treat divergent exploratory search and convergent fine-grained refinement as isolated tasks, and they operate autonomously with little to no human guidance. We present MOOSE-Copilot, the first unified framework to bridge this abstraction gap through a formalized human-AI interaction (HAII) protocol. Our system empowers scientists to steer the generative process via three explicit signals: initial blueprints, inter-stage routing, and intra-stage feedback. Using an oracle-simulated evaluation in which an LLM provides idealized expert signals, we show that injecting these structured signals significantly outperforms purely autonomous baselines, characterizing the gains achievable under high-quality guidance. Furthermore, we build a web-based interface that turns the framework into a no-code workflow: researchers pose a question, watch the hypothesis search unfold as an interactive tree, and steer it by selecting hypotheses, routing between stages, and injecting feedback-no command-line agents required. This makes end-to-end hypothesis discovery directly accessible to interdisciplinary researchers.

URL PDF HTML ☆

赞 0 踩 0

2208.00859 2026-06-09 cs.LG cs.CL 版本更新

Learning from flowsheets: A generative transformer model for autocompletion of flowsheets

从流程图学习：用于流程图自动补全的生成式Transformer模型

Gabriel Vogel, Lukas Schulze Balhorn, Artur M. Schweidtmann

发表机构 * University of Freiburg（弗赖堡大学）

AI总结受文本自动补全启发，提出基于SFILES 2.0字符串表示和Transformer语言模型的化工流程图自动补全方法，通过预训练和微调实现交互式流程图合成辅助。

详情

DOI: 10.1016/j.compchemeng.2023.108162
Journal ref: Computers and Chemical Engineering Volume 171, March 2023, 108162

AI中文摘要

我们提出了一种新颖的方法，能够实现化工流程图的自动补全。这一想法受到文本自动补全的启发。我们使用基于文本的SFILES 2.0符号将流程图表示为字符串，并利用基于Transformer的语言模型学习SFILES 2.0语言的语法结构以及流程图中的常见模式。我们在合成生成的流程图拓扑上预训练模型，以学习流程图语言语法。然后，通过迁移学习步骤在真实流程图拓扑上微调模型。最后，我们使用训练好的模型进行因果语言建模，以自动补全流程图。最终，所提出的方法可以在交互式流程图合成过程中为化学工程师提供建议。结果表明，该方法在未来AI辅助过程合成中具有巨大潜力，但也揭示了当前阶段的局限性以及在实际流程图合成场景中部署该技术需要采取的后续步骤。

英文摘要

We propose a novel method enabling autocompletion of chemical flowsheets. This idea is inspired by the autocompletion of text. We represent flowsheets as strings using the text-based SFILES 2.0 notation and learn the grammatical structure of the SFILES 2.0 language and common patterns in flowsheets using a transformer-based language model. We pre-train our model on synthetically generated flowsheet topologies to learn the flowsheet language grammar. Then, we fine-tune our model in a transfer learning step on real flowsheet topologies. Finally, we use the trained model for causal language modeling to autocomplete flowsheets. Eventually, the proposed method can provide chemical engineers with recommendations during interactive flowsheet synthesis. The results demonstrate a high potential of this approach for future AI-assisted process synthesis but also reveal the limitations at the present state and the next steps that need to be taken to deploy this technique in realistic flowsheet synthesis scenarios.

URL PDF HTML ☆

赞 0 踩 0

2601.01279 2026-06-09 econ.TH cs.AI cs.CE cs.CL cs.GT 版本更新

Supracompetitive Pricing Under AI Monoculture

人工智能单一群体下的超竞争定价

Shengyu Cao, Ming Hu

发表机构 * Rotman School of Management, University of Toronto（多伦多大学罗特曼管理学院）

AI总结本文研究了在共享AI模型下，竞争卖家委托定价时可能产生的超竞争定价问题，通过双寡头模型分析发现，AI模型的鲁棒性和可重复性配置可能导致超竞争定价现象，且市场结果取决于初始定价倾向。

Comments 46 pages

详情

AI中文摘要

当竞争卖家将定价委托给共享的AI模型（如大型语言模型）时，相关推荐结合性能驱动的更新，聚合卖家反馈，引发一个问题：标准的AI部署实践是否会无意中产生超竞争定价？本文开发了一个简化的双寡头模型，其中两个卖家从共享的AI模型中获得定价推荐，该模型由两个参数特征化：一个倾向参数捕捉模型设置高价的倾向，一个输出保真度参数衡量该倾向与实际输出的一致性，其中倾向通过定期重新训练在观察到的结果上更新。我们发现，配置AI模型以鲁棒性和可重复性可以导致超竞争定价通过相变。在临界输出保真度阈值以下，竞争性定价是唯一的稳定结果。在临界值以上，模型表现出双稳态：竞争性和超竞争性定价都是局部稳定的，实际结果取决于模型的初始倾向。超竞争性定价提高了平均价格，但偶尔的低价推荐使检测变得复杂。对于完美输出保真度，任何内部初始倾向都会导致完全价格协调。对于有限训练批次大小为b，当初始倾向位于超竞争性盆地时，随着b的增加，超竞争性定价的概率接近1，不确定结果区域以O(1/√b)的速率缩小。任何减少模型倾向与卖家实际定价之间一致性的因素，无论是通过多样化AI供应商、引入推荐噪声还是减少卖家的遵守，都会将市场推向竞争性结果。

英文摘要

When competing sellers delegate pricing to a shared AI model, such as a large language model, correlated recommendations combined with performance-driven updates aggregating seller feedback raise a key question: can standard AI deployment practices inadvertently produce supracompetitive pricing? We develop a stylized duopoly model in which two sellers receive pricing recommendations from a shared AI characterized by two parameters: a propensity parameter capturing the model's tendency to set high prices and an output-fidelity parameter measuring alignment between this tendency and actual outputs, with propensity updated via periodic retraining on observed outcomes. We find that configuring AI models for robustness and reproducibility can lead to supracompetitive pricing via a phase transition. Below a critical output-fidelity threshold, competitive pricing is the unique stable outcome. Above it, the model exhibits bistability: both competitive and supracompetitive pricing are locally stable, with the realized outcome determined by the model's initial propensity. Supracompetitive pricing raises average prices, but occasional low-price recommendations complicate detection. With perfect output fidelity, full price coordination emerges from any interior initial propensity. For finite training batches of size $b$, when the initial propensity lies in the supracompetitive basin, the probability of supracompetitive pricing approaches 1 as $b$ increases, with the region of indeterminate outcomes shrinking at rate $O(1/\sqrt{b})$. Any factor reducing alignment between the model's propensity and sellers' actual pricing, whether through diversifying AI providers, introducing recommendation noise, or reducing seller adherence, pushes the market toward competitive outcomes.

URL PDF HTML ☆

赞 0 踩 0

2605.24384 2026-06-09 cs.CL cs.AI 版本更新

Side-by-side Comparison Amplifies Dialect Bias in Language Models

并排比较加剧语言模型中的方言偏见

Kritee Kondapally, Claire J. Smerdon, Pooja C. Patel, Ogheneyoma Akoni, Jevon Torres, Jaspreet Ranjit, Matthew Finlayson, Swabha Swayamdipta

发表机构 * University of Southern California（美国南加州大学）

AI总结本研究通过并排比较标准美式英语和非裔美国英语的推文，发现语言模型中的隐性方言偏见在对比设置下显著加剧，且显性方言偏见在安全对齐微调后仍存在。

Comments In proceeding at ACM Conference on Fairness, Accountability, and Transparency 2026

详情

DOI: 10.1145/3805689.3812217
Journal ref: In The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

AI中文摘要

语言模型（LMs）可能因其方言变体而表现出偏见，即使在没有方言标签的情况下，这种行为被称为隐性方言偏见。在这项工作中，我们通过评估语言模型如何将刻板特征（源自社会心理学关于种族偏见的研究）与标准美式英语（SAE）和非裔美国英语（AAVE）中意图等效的推文相关联，来量化在线话语中的隐性方言偏见。虽然先前的研究表明，在单独评估推文时，语言模型将更多负面刻板印象与AAVE关联，但我们惊讶地发现，当SAE/AAVE推文对并排比较时，这种偏见显著加剧，这种设置更接近模型用于排名候选人的高影响力决策环境。当明确指定方言标签时，偏见只会恶化。考虑到商业开发者为了减轻其语言模型中的偏见所做的广泛努力，这一点令人震惊。令人鼓舞的是，我们表明反事实公平微调可以减轻某些刻板特征的隐性方言偏见，减少单独评估推文时的平均差异，然而，在并排评估SAE/AAVE推文时，这些改进并不一致地适用于所有特征。我们的发现表明，现有的隐性方言偏见评估设置可能低估了其严重性，特别是在对比设置中。此外，即使在安全对齐微调后，显性方言偏见仍然显著，表明它仍然是一个未解决的问题，并激励需要更稳健的评估和缓解框架。

英文摘要

Language models (LMs) can exhibit biases based on variations in their dialects, even in the absence of a dialect label, a behavior known as covert dialect bias. In this work, we quantify covert dialect bias in online discourse by evaluating how LMs associate stereotypical traits (derived from social psychology research on racial bias) with intent-equivalent tweets in Standard American English (SAE) and African-American Vernacular English (AAVE). While prior work shows that LMs associate more negative stereotypes with AAVE when evaluating tweets in isolation, we are surprised to find that this bias is significantly exacerbated when SAE / AAVE tweet pairs are compared side by side, a setting that more closely reflects high-impact decision making contexts in which models are used to rank candidates. The bias only worsens when dialect labels are explicitly specified. This is striking, given the extensive efforts from commercial developers to mitigate bias in their LMs. Encouragingly, we show that counterfactual fairness finetuning can mitigate covert dialect bias for some stereotypical traits, reducing average disparities when evaluating tweets in isolation, however, these improvements do not consistently hold across traits when evaluating SAE / AAVE tweets side by side. Our findings show that existing evaluation settings for covert dialect bias may underestimate its severity, specifically in contrastive settings. Additionally, overt dialect bias remains pronounced even after safety aligned finetuning, indicating that it remains an unresolved problem, and motivates the need for more robust evaluation and mitigation frameworks.

URL PDF HTML ☆

赞 0 踩 0

2508.03453 2026-06-09 cs.CL cs.LG 版本更新

Cropping outperforms dropout as an augmentation strategy for self-supervised training of text embeddings

裁剪优于dropout作为自监督训练文本嵌入的增强策略

Rita González-Márquez, Philipp Berens, Dmitry Kobak

发表机构 * Hertie Institute for AI in Brain Health（人工智能与脑健康赫尔蒂研究所）； University of Tübingen（图宾根大学）； University of Tübingen, Germany（德国图宾根大学）

AI总结本文研究了自监督微调中裁剪和dropout两种增强策略，发现裁剪在文本嵌入质量上表现更优，尤其在领域内数据中能快速生成高质量嵌入。

详情

Journal ref: Transactions on Machine Learning Research (TMLR) 2026

AI中文摘要

文本嵌入，即整个文本的向量表示，在许多NLP应用中起重要作用，如检索增强生成、聚类或文本集合的数据探索。目前，表现最佳的嵌入模型是通过监督对比微调从预训练语言模型中衍生而来。这种微调策略依赖于外部相似性概念和标注数据生成正样本对。本文研究了自监督微调，并系统比较了两种最知名的增强策略。我们评估了MTEB和额外的领域内评估，并发现裁剪增强显著优于基于dropout的方法。我们发现，在领域外数据中，生成的嵌入质量远低于监督的最新成果，但针对领域内数据，自监督微调能在极短的微调后生成高质量文本嵌入。最后，我们发现表示质量随着最后一层transformer层的改变而增加，仅微调这些最后一层足以达到相似的嵌入质量。

英文摘要

Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via supervised contrastive fine-tuning. This fine-tuning strategy relies on an external notion of similarity and annotated data for generation of positive pairs. Here we study self-supervised fine-tuning and systematically compare the two most well-known augmentation strategies used for fine-tuning text embeddings models. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is substantially below the supervised state-of-the-art models, but for in-domain data, self-supervised fine-tuning can produce high-quality text embeddings after very short fine-tuning. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality.

URL PDF HTML ☆

赞 0 踩 0

2507.15152 2026-06-09 cs.CL cs.AI cs.LG 版本更新

What Level of Automation is "Good Enough"? A Benchmark of Large Language Models for Meta-Analysis Data Extraction

什么是‘足够’的自动化水平？大型语言模型在元分析数据提取中的基准测试

Lingbo Li, Anuradha Mathrani, Teo Susnjak

发表机构 * School of Mathematical and Computational Sciences（数学与计算科学学院）； Massey University（梅西大学）； Auckland, New Zealand（新西兰奥克兰）

AI总结本文评估了三种大型语言模型在医疗领域数据提取中的性能，发现定制提示能显著提升召回率，提出三层次指南以平衡自动化与专家监督。

详情

DOI: 10.1017/rsm.2025.10066
Journal ref: Research Synthesis Methods (2026)

AI中文摘要

自动化从全文随机对照试验（RCT）中提取数据用于元分析仍是一个重大挑战。本研究评估了三种LLM（Gemini-2.0-flash、Grok-3、GPT-4o-mini）在高血压、糖尿病和骨科三个医学领域中统计结果、偏倚风险评估和研究层面特征任务上的实际表现。我们测试了四种不同的提示策略（基本提示、自我反思提示、模型集成和定制提示）以确定如何提高提取质量。所有模型均表现出高精度，但普遍存在召回率低的问题，因遗漏关键信息。我们发现定制提示是最有效的，召回率可提升高达15%。基于此分析，我们提出了一套三层指南，根据任务复杂性和风险匹配数据类型与适当的自动化水平。本研究为现实世界中的元分析自动化数据提取提供了实用建议，通过有针对性的、任务特定的自动化平衡LLM效率与专家监督。

英文摘要

Automating data extraction from full-text randomised controlled trials (RCTs) for meta-analysis remains a significant challenge. This study evaluates the practical performance of three LLMs (Gemini-2.0-flash, Grok-3, GPT-4o-mini) across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics in three medical domains: hypertension, diabetes, and orthopaedics. We tested four distinct prompting strategies (basic prompting, self-reflective prompting, model ensemble, and customised prompts) to determine how to improve extraction quality. All models demonstrate high precision but consistently suffer from poor recall by omitting key information. We found that customised prompts were the most effective, boosting recall by up to 15\%. Based on this analysis, we propose a three-tiered set of guidelines for using LLMs in data extraction, matching data types to appropriate levels of automation based on task complexity and risk. Our study offers practical advice for automating data extraction in real-world meta-analyses, balancing LLM efficiency with expert oversight through targeted, task-specific automation.

URL PDF HTML ☆

赞 0 踩 0

2410.14964 2026-06-09 cs.CL 版本更新

ChronoFact: Timeline-based Temporal Fact Verification

ChronoFact：基于时间线的时序事实验证

Anab Maulana Barik, Wynne Hsu, Mong Li Lee

发表机构 * School of Computing（计算学院）； Institute of Data Science（数据科学研究所）； Centre for Trusted Internet and Community（可信互联网与社区中心）

AI总结本文提出基于时间线的时序事实验证框架，通过识别声明和证据中的事件并组织时间线，系统分析事件关系以预测声明真实性，同时引入复杂时序声明数据集提升验证效果。

详情

DOI: 10.24963/ijcai.2025/893
Journal ref: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2025), pp. 8031-8039

AI中文摘要

时序声明常存在不准确之处，是数字虚假信息领域的重要挑战。能够准确验证此类声明的事实核查系统对于对抗虚假信息至关重要。当前系统在评估这些声明的准确性时面临复杂性，尤其是当声明包含多个、重叠或重复事件时。我们引入了一个新的基于时间线的事实验证框架，该框架识别声明和证据中的事件，并将它们组织到各自的时间线中。该框架系统地分析声明和证据中事件之间的关系，以预测每个声明事件的真实性及其时间准确性。这使我们能够准确确定整个声明的真实性。我们还引入了一个新的复杂时序声明数据集，涉及基于时间线的推理，用于训练和评估所提出的框架。实验结果展示了我们的方法在处理时序声明验证复杂性方面的有效性。

英文摘要

Temporal claims, often riddled with inaccuracies, are a significant challenge in the digital misinformation landscape. Fact-checking systems that can accurately verify such claims are crucial for combating misinformation. Current systems struggle with the complexities of evaluating the accuracy of these claims, especially when they include multiple, overlapping, or recurring events. We introduce a novel timeline-based fact verification framework that identify events from both claim and evidence and organize them into their respective chronological timelines. The framework systematically examines the relationships between the events in both claim and evidence to predict the veracity of each claim event and their chronological accuracy. This allows us to accurately determine the overall veracity of the claim. We also introduce a new dataset of complex temporal claims involving timeline-based reasoning for the training and evaluation of our proposed framework. Experimental results demonstrate the effectiveness of our approach in handling the intricacies of temporal claim verification.

URL PDF HTML ☆

赞 0 踩 0

2406.14883 2026-06-09 cs.CL cs.CY 版本更新

OATH-Frames: Characterizing Online Attitudes Towards Homelessness with LLM Assistants

OATH-Frames: 利用大语言模型助手分析在线对无家可归者的态度

Jaspreet Ranjit, Brihi Joshi, Rebecca Dorn, Laura Petry, Olga Koumoundouros, Jayne Bottarini, Peichen Liu, Eric Rice, Swabha Swayamdipta

发表机构 * Dept. of Computer Science, University of Southern California（计算机科学系，南加州大学）； Suzanne-Dwork School of Social Work, University of Southern California（苏兹曼-道克社会工作学院，南加州大学）

AI总结本文提出OATH-Frames框架，通过大语言模型分析社交媒体上的无家可归者态度，提升大规模分析效率并揭示态度趋势。

Comments Project website: https://dill-lab.github.io/oath-frames/, EMNLP Main 2024

详情

DOI: 10.18653/v1/2024.emnlp-main.724
Journal ref: In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

AI中文摘要

警告：本文内容可能令人不安。公众对关键社会问题的在线态度对政策制定至关重要，但大规模理解具有挑战性。本文通过利用大语言模型分析数百万条推文，研究美国无家可归问题，提出OATH-Frames框架，包含九个层级的批判、回应和感知类型。通过不同层级的模型辅助标注，实现标注时间提升6.5倍，性能仅下降3个F1点。实验表明，OATH-Frames在建模态度方面优于现有情感和毒性分类器。对240万条无家可归相关推文的分析揭示了各州、时间周期和脆弱群体的态度趋势，为问题提供了新见解。本文提供了一个通用框架，用于在无家可归问题之外的其他议题上理解大规模的复杂公众态度。

英文摘要

Warning: Contents of this paper may be upsetting. Public attitudes towards key societal issues, expressed on online media, are of immense value in policy and reform efforts, yet challenging to understand at scale. We study one such social issue: homelessness in the U.S., by leveraging the remarkable capabilities of large language models to assist social work experts in analyzing millions of posts from Twitter. We introduce a framing typology: Online Attitudes Towards Homelessness (OATH) Frames: nine hierarchical frames capturing critiques, responses and perceptions. We release annotations with varying degrees of assistance from language models, with immense benefits in scaling: 6.5x speedup in annotation time while only incurring a 3 point F1 reduction in performance with respect to the domain experts. Our experiments demonstrate the value of modeling OATH-Frames over existing sentiment and toxicity classifiers. Our large-scale analysis with predicted OATH-Frames on 2.4M posts on homelessness reveal key trends in attitudes across states, time periods and vulnerable populations, enabling new insights on the issue. Our work provides a general framework to understand nuanced public attitudes at scale, on issues beyond homelessness.

URL PDF HTML ☆

赞 0 踩 0

2406.19493 2026-06-09 cs.CL cs.AI 版本更新

Development and Evaluation of a Retrieval-Augmented Generation Tool for Creating SAPPhIRE Models of Artificial Systems

SAPPhIRE人工系统模型创建工具的开发与评估

Anubhab Majumder, Kausik Bhattacharya, Amaresh Chakrabarti

发表机构 * Department of Design and Manufacturing, Indian Institute of Science（设计与制造系，印度科学研究院）

AI总结本文提出一种基于检索增强生成的工具，用于创建SAPPhIRE因果模型的人工系统模型，通过评估工具在事实准确性和可靠性方面的表现，提升系统设计类比支持能力。

Comments This paper has been accepted for presentation at the 10th International Conference on Research Into Design, 2025

2407.00396 2026-06-09 cs.CL cs.AI 版本更新

A Study on Effect of Reference Knowledge Choice in Generating Technical Content Relevant to SAPPhIRE Model Using Large Language Model

基于SAPPhIRE模型因果关系的生成技术内容参考知识选择研究

Kausik Bhattacharya, Anubhab Majumder, Amaresh Chakrabarti

发表机构 * Indian Institute of Science（印度科学研究院）

AI总结本文研究如何利用大语言模型生成与SAPPhIRE因果关系模型相关的技术内容，通过检索增强生成方法抑制幻觉，强调参考知识选择对生成准确性的重要性。

详情

DOI: 10.1007/978-981-96-5511-3_39

AI中文摘要

使用SAPPhIRE因果关系模型表示系统可以成为设计的灵感来源。然而，创建技术或自然系统的SAPPhIRE模型需要从多个技术文档中获取系统工作原理的技术知识。本研究探讨如何利用大语言模型（LLM）生成准确的相关技术内容。本文是两部分研究中的第一部分，提出了一种使用检索增强生成方法来抑制幻觉，从而生成由相关科学信息支持的技术内容的方法。研究结果表明，用于为LLM生成技术内容提供上下文的参考知识选择非常重要。本研究的成果用于构建一个软件支持工具，以生成给定技术系统的SAPPhIRE模型。

英文摘要

Representation of systems using the SAPPhIRE model of causality can be an inspirational stimulus in design. However, creating a SAPPhIRE model of a technical or a natural system requires sourcing technical knowledge from multiple technical documents regarding how the system works. This research investigates how to generate technical content accurately relevant to the SAPPhIRE model of causality using a Large Language Model, also called LLM. This paper, which is the first part of the two-part research, presents a method for hallucination suppression using Retrieval Augmented Generating with LLM to generate technical content supported by the scientific information relevant to a SAPPhIRE con-struct. The result from this research shows that the selection of reference knowledge used in providing context to the LLM for generating the technical content is very important. The outcome of this research is used to build a software support tool to generate the SAPPhIRE model of a given technical system.

URL PDF HTML ☆

赞 0 踩 0

2402.09193 2026-06-09 cs.CL cs.AI cs.HC 版本更新

(Ir)rationality and Cognitive Biases in Large Language Models

非理性与大语言模型中的认知偏差

Olivia Macmillan-Scott, Mirco Musolesi

发表机构 * University College London（伦敦大学）； University of Bologna（博洛尼亚大学）

AI总结本文通过心理学文献中的任务评估七种语言模型，发现其在非理性表现上与人类相似，但表现形式不同，且存在响应不一致的额外非理性特征。

详情

DOI: 10.1098/rsos.240255
Journal ref: Royal Society Open Science 11(6) 2024

AI中文摘要

大型语言模型（LLMs）表现出理性推理吗？LLMs已被证明包含人类偏见，因为它们训练的数据中包含这些偏见；这种偏见是否反映在理性推理中尚不明确。在本文中，我们通过认知心理学文献中的任务评估了七种语言模型，以回答这个问题。我们发现，像人类一样，LLMs在这些任务中表现出非理性。然而，这种非理性表现的方式并不反映人类所展示的方式。当LLMs在这些任务中给出错误答案时，它们往往以与人类偏见不同的方式错误。此外，LLMs还揭示了响应中显著不一致性的额外非理性层。除了实验结果外，本文还希望通过展示如何评估和比较这些模型的不同能力，做出方法论上的贡献，特别是在理性推理方面。

英文摘要

Do large language models (LLMs) display rational reasoning? LLMs have been shown to contain human biases due to the data they have been trained on; whether this is reflected in rational reasoning remains less clear. In this paper, we answer this question by evaluating seven language models using tasks from the cognitive psychology literature. We find that, like humans, LLMs display irrationality in these tasks. However, the way this irrationality is displayed does not reflect that shown by humans. When incorrect answers are given by LLMs to these tasks, they are often incorrect in ways that differ from human-like biases. On top of this, the LLMs reveal an additional layer of irrationality in the significant inconsistency of the responses. Aside from the experimental results, this paper seeks to make a methodological contribution by showing how we can assess and compare different capabilities of these types of models, in this case with respect to rational reasoning.

URL PDF HTML ☆

赞 0 踩 0

1. 大语言模型与基础模型 74 篇

ABLE: Representing and Mapping LLMs via Attribution-Based Large-model Embedding

GraphLoRA: Structure-Aware Low-Rank Adaptation for Large Language Model Recommendation

Post-training is (Massive) Supervised Learning

Phantom transitions in language model fine-tuning

Function-Vector Heads Are Two Populations: Writers and Cancellers in In-Context Learning

Representational Similarity and Model Behavior in Multi-Agent Interaction

MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models

TLRD: Teaching LLMs to Reason over Tabular Data with Tri-Level Rationale Distillation

Chiaroscuro Attention: Spending Compute in the Dark

CATPO: Critique-Augmented Tree Policy Optimization

Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs

More Yap Less Meaning: Uncovering Self-Improvement Behavior in SLMs

Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

A retrieval conditioned rebinding circuit for dynamic entity tracking in large language models

Co-Evolving Skill Generation and Policy Optimization

Language-Aware Token Boosting: LLM Language Confusion Reduction Without Tuning

Bridging the Agent-World Gap: Text World Models for LLM-based Agents

SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model Performance

SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

Multi-Hop Knowledge Composition is Bound by Pretraining Exposure

Reasoning without Gold Standards: A Proxy-Judge Theory of Autoformalization

Emergence of Context Characteristics Sensitivity in Large Language Models

End-to-End Context Compression at Scale

When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

Evaluating Advanced Prompting on Gemini Flash for Multi-Hop Biomedical QA

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

Scaling Participation in Modular AI Systems

ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning

Beyond Linear Activation Steering: Invertible Latent Transformations for Controlling LLM Behavior

Lost in the Flow with Code Talkers: Unveiling the Instruction-Tuning Tax of Large Language Models in Code Tasks

Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization

Intrinsic Selection and Particle Resampling for Inference-Time Scaling Beyond Domain Verifiability

sGPO: Trading Inference FLOPs for Training Efficiency in RLVR

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity

INFUSER: Influence-Guided Self-Evolution Improves Reasoning

From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

Capacity, Not Format: Rethinking Structured Reasoning Failures

Escaping the KL Agreement Trap in On-Policy Distillation

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling

Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones

SPECTRA: Revealing the Full Spectrum of User Preferences via Distributional LLM Inference

Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

Automated Attribution Graph Interpretation via Probe Prompting

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs

The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models

Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models

How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

Component Ablation for Efficient Hybrid Language Model Architectures: Performance, Resilience, and Compression Implications

From Backward Spreading to Forward Replay: Revisiting Target Construction in LLM Parameter Editing

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Exploring Autonomous Agentic Data Engineering for Model Specialization

Do Value Vectors in Deep Layers Need Context from the Residual Stream?

Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

Federated Large Language Models: Current Progress and Future Directions

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Formalizing Learning from Language Feedback with Provable Guarantees

Sound and Complete Neurosymbolic Reasoning with LLM-Grounded Interpretations

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

Similarity-Distance-Magnitude Activations

MixReasoning: Switching Modes to Think

ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

Towards Automated Kernel Generation in the Era of LLMs

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

2. 机器翻译与跨语言处理 4 篇

ClinicalAligner26AM: A Cross-Lingual Aligner for Dataset Translation; Evidences from the MultiClinCorpus Shared Task

How Far Can Prompting Go for Minimal-Edit Ukrainian Grammatical Error Correction?

Beyond Accuracy: Community Perspectives on Machine Translation

Massively Multilingual Joint Segmentation and Glossing