arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 大语言模型与基础模型 37 篇

2606.09856 2026-06-10 cs.CL cs.AI cs.LG stat.ML 新提交

Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

使用概率程序训练大型语言模型的归纳推理

Liyi Zhang, Akshay K. Jagadish, Brenden M. Lake, Thomas L. Griffiths

AI总结 提出基于程序的后验训练(PPT)方法,利用LLM生成概率程序场景,通过推理产生分布目标,微调模型以提升归纳推理准确性、与人类判断的一致性及校准能力。

Comments 20 pages, 5 figures

详情
AI中文摘要

大型语言模型(LLM)的后训练推理通常专注于数学和编码等演绎任务,其中正确性可验证。然而,许多现实世界的推理问题是归纳性的:智能体必须从稀疏、模糊的观测中推断不确定的信念。使用标准微调方法进行归纳推理面临挑战,包括难以策划大规模、高质量标注数据集以及处理本质上是分布式的目标。在这项工作中,我们引入了一种称为基于程序的后验训练(PPT)的新方法来解决这些局限性:我们使用LLM生成多样化的开放世界场景作为概率程序,运行概率推理以产生查询的分布式目标响应,然后在这些概率软标签上进行微调。使用这种方法,我们在10,000个程序生成的场景上微调LLM,并在保留的模板、人工标注的判断和外部基准上进行评估。总体而言,PPT显著提高了保留归纳任务的估计准确性,增强了与人类判断的一致性,并迁移到估计和校准的外部基准。此外,原始校准的增益并未被事后温度缩放所涵盖,表明与输出重新缩放相比,模型更深入地内化了不确定性。这些结果表明,概率程序介导的微调是一种有前景的方法,用于后训练LLM以可靠地执行近似归纳推理。

英文摘要

Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable. Yet, many real-world reasoning problems are inductive: agents must infer uncertain beliefs from sparse, ambiguous observations. There are challenges to using standard fine-tuning methods for inductive reasoning, including difficulties in curating large-scale, high-quality labeled datasets and in handling targets that are inherently distributional. In this work, we introduce a novel approach, called Program-based Posterior Training (PPT), to address these limitations: we use an LLM to generate diverse open-world scenarios as probabilistic programs, run probabilistic inference to produce distributional target responses to queries, and then fine-tune on these probabilistic soft labels. Using this approach, we fine-tune LLMs on 10,000 programmatically generated scenarios and evaluate on held-out motifs, human-labeled judgments, and external benchmarks. Overall, PPT substantially improves estimation accuracy on held-out inductive tasks, increases alignment with human judgments, and transfers to external benchmarks for estimation and calibration. Additionally, the gains in raw calibration are not subsumed by post-hoc temperature scaling, showing that the models have more deeply internalized uncertainty compared to output rescaling. Together, these results suggest that probabilistic-program-mediated fine-tuning is a promising approach for post-training LLMs to reliably perform approximate inductive inference.

2606.10296 2026-06-10 cs.CL cs.AI 新提交

The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

自信的撒谎者:利用对数概率和LLM作为评判诊断多智能体辩论

Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer

发表机构 * University of California, Irvine(加州大学伊文斯分校)

AI总结 研究多智能体辩论中令牌级对数概率、LLM评判分数与任务准确性的关系,发现信心与推理质量在构造者上关联更强,且信心可检测关键推理失败。

Comments 15 pages, 7 figures, 1 table, ACL proceedings

详情
AI中文摘要

多智能体辩论系统通常仅根据最终答案是否正确来评估,忽略了辩论旨在产生的中间推理的质量。本文研究了多智能体辩论中三种信号之间的关系:推理令牌上的令牌级对数概率分布、分配给这些令牌的LLM作为评判的评分标准分数以及最终任务准确性。我们考察了内部信心信号是否预测外部评估的推理质量,以及任一信号是否与任务正确性一致,涵盖三个领域:基于评分标准的评分、数学推理和事实问答。我们的框架将双智能体辩论架构——一个构造者(Constructor)和一个审计者(Auditor)——与一个LLM作为评判配对,该评判根据指令遵循、理由质量和证据基础对每个智能体的推理进行评分,并附带一个关键失败标志。在评分标准评分领域的实验揭示了一致的四阶段信心轨迹和显著的角色不对称性:构造者的信心与评判推理质量的相关性大约是审计者的两倍,并且基于信心的关键推理失败检测对构造者(AUROC 0.804)明显比审计者(0.634)更可靠。这些发现推动了本文提出的更广泛的跨领域研究。

英文摘要

Multi-agent debate systems are typically evaluated only on whether the final answer is correct, overlooking the quality of the intermediate reasoning that debate is designed to produce. This paper studies the relationship between three signals in multi-agent debate: token-level log-probability distributions over reasoning tokens, LLM-as-judge rubric scores assigned to those tokens, and final task accuracy. We examine whether internal confidence signals predict externally evaluated reasoning quality, and whether either signal aligns with task correctness, across three domains: rubric-based scoring, mathematical reasoning, and factual question answering. Our framework pairs a two-agent debate architecture -- a Constructor and an Auditor -- with an LLM-as-judge that scores each agent's reasoning along instruction following, justification quality, and evidence grounding, together with a critical-failure flag. Experiments in the rubric-scoring domain reveal a consistent four-phase confidence trajectory and a substantial role asymmetry: confidence aligns with judged reasoning quality roughly twice as strongly for the Constructor as for the Auditor, and confidence-based detection of critical reasoning failures is markedly more reliable for the Constructor (AUROC 0.804) than for the Auditor (0.634). These findings motivate the broader cross-domain investigation proposed in this paper.

2606.10307 2026-06-10 cs.CL 新提交

Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

早期令牌置信度预测多智能体LLM辩论中的推理质量

Ali Keramati, Justin Cheok, Jacob Horne, Mark Warschauer

发表机构 * University of California, Irvine(加州大学尔湾分校)

AI总结 研究利用解码时令牌级对数概率作为置信度信号,预测多智能体LLM辩论中的推理质量,发现早期令牌置信度是最强预测因子。

Comments 15 pages, 8 figures, 4 tables; ACL Proceedings

详情
AI中文摘要

评估多智能体LLM系统中的推理质量具有挑战性,尤其是对于没有参考答案的开放任务。我们研究了内在置信度信号(解码时的令牌级对数概率)是否能预测由LLM作为评判者评估的推理质量。使用基于辩论的论文评分框架,我们在两个ASAP论文集上比较了置信度代理与基于评分标准的评判者分数。我们发现,早期令牌置信度,特别是在生成的前几个令牌内,始终是推理质量的最强预测因子,优于全序列统计量。对数概率轨迹分析表明,生成的起始阶段是最异质的,因此信息量最大。我们还观察到智能体角色之间存在系统性不对称,支持性推理的置信度与质量之间的对齐强于对抗性批评。这些结果表明,早期解码动态为估计多智能体LLM系统中的推理可靠性提供了轻量级且有效的信号。

英文摘要

Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as-judge evaluation. Using a debate-based essay scoring framework, we compare confidence proxies against rubric-based judge scores across two ASAP essay sets. We find that early-token confidence, particularly within the first few generated tokens, is consistently the strongest predictor of reasoning quality, outperforming full-sequence statistics. Analysis of log-probability trajectories shows that the opening phase of generation is the most heterogeneous and therefore most informative. We also observe a systematic asymmetry between agent roles, with stronger alignment between confidence and quality for supportive reasoning than for adversarial critique. These results suggest that early decoding dynamics provide a lightweight and effective signal for estimating reasoning reliability in multi-agent LLM systems.

2606.10338 2026-06-10 cs.CL cs.AI 新提交

Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models

路由感知的专家校准用于混合专家语言模型中的机器遗忘

Jingyi Xie, Yijun Lin, Yinjiang Xiong, Zhikun Zhang, Sai Li

发表机构 * Renmin University of China(中国人民大学) Tsinghua University(清华大学) Zhejiang University(浙江大学) Lightstandard

AI总结 针对MoE模型中遗忘数据与保留数据路由不匹配导致遗忘关键专家正则化不足的问题,提出TRACE方法,通过离线激活统计检测遗忘关键专家并重新加权保留损失以校准保留侧激活频率,实验表明在WMDP和MUSE-BOOKS上遗忘-效用权衡提升9%。

详情
AI中文摘要

机器遗忘对于大型语言模型越来越重要,然而混合专家(MoE)架构中的遗忘仍未得到充分探索。与密集模型不同,MoE架构在每一层使用路由器将每个令牌分配给稀疏的专家子集。在这项工作中,我们观察到遗忘数据往往不成比例地激活一小部分专家,而这些专家可能从保留数据中接收到更弱的激活。这种遗忘-保留路由不匹配可能导致遗忘关键专家在遗忘过程中正则化不足。为了解决这个问题,我们提出了\textbf{TRACE},即针对MoE遗忘的目标路由感知专家校准。TRACE首先从离线激活统计中检测遗忘关键专家,然后通过重新加权令牌级保留损失来校准保留正则化,使得每个选定专家的保留侧激活频率更好地匹配其遗忘侧对应频率。在多个MoE LLM上的WMDP和MUSE-BOOKS实验表明,TRACE一致地改善了遗忘-效用权衡,在相当的遗忘质量下,相对于最强基线实现了9%的相对效用提升,并在MUSE-BOOKS的四个指标中的三个上取得了最佳性能。

英文摘要

Machine unlearning is increasingly important for large language models, yet unlearning in Mixture-of-Experts (MoE) architectures remains underexplored. Unlike dense models, MoE architectures employ a router at each layer to assign each token to a sparse subset of experts. In this work, we observe that forget data often activates a small subset of experts disproportionately, while these experts may receive much weaker activation from retain data. This forget--retain routing mismatch can leave forget-critical experts under-regularized during unlearning. To address this, we propose \textbf{TRACE}, Targeted Routing-Aware Calibration of Experts, for MoE unlearning. TRACE first detects forget-critical experts from offline activation statistics, and then calibrates retain regularization by reweighting token-level retain losses so that each selected expert's retain-side activation frequency better matches its forget-side counterpart. Experiments on WMDP and MUSE-BOOKS across multiple MoE LLMs show that TRACE consistently improves the forget-utility trade-off, yielding a 9\% relative utility improvement over the strongest baseline under comparable forgetting quality and the best performance on three out of four MUSE-BOOKS metrics.

2606.10369 2026-06-10 cs.CL cs.LG 新提交

PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning

PADD: 面向非路由器教师指导MoE学生学习的路径对齐解压缩蒸馏

Xinyue Peng, Yi Qian, Jiaojiao Lin, Wenjian Shao, Yanming Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出路径对齐解压缩蒸馏(PADD)框架,通过四阶段两阶段流程将密集教师知识蒸馏到混合专家(MoE)学生中,同时学习高质量路由策略,在数学推理任务上显著优于基线。

Comments published in ICML 2026

详情
AI中文摘要

随着大型语言模型(LLMs)持续扩展,在固定计算预算下增长模型容量变得越来越具有挑战性。我们提出路径对齐解压缩蒸馏(PADD),这是一个将知识从无显式路由的密集教师蒸馏到混合专家(MoE)学生中,同时学习高质量路由策略的框架。PADD将知识蒸馏组织为两个阶段的四个阶段:初始化阶段(阶段I)通过教师神经元聚类和学生专家预热在学生专家中构建多样功能,以及训练阶段(阶段II–IV)将在线自适应蒸馏、路径细化策略优化和奖励增强负载平衡集成在单一训练流程中。在数学推理基准上的实验表明,在相同推理成本下,PADD相比强基线取得了显著提升,且MoE学生能够匹配或超越其密集教师。实验还展示了有效的教师到学生知识蒸馏和稳定的路由行为。

英文摘要

As large language models (LLMs) continue to scale, it becomes increasingly challenging to grow model capacity under fixed computation budgets. We propose Path-Aligned Decompression Distillation (PADD), a framework for distilling knowledge from dense teachers without explicit routing into mixture-of-experts (MoE) students while learning high-quality routing policies. PADD organizes knowledge distillation into four stages in two phases: an initialization phase (Stage I) that builds diverse functionality in the student's experts through teacher neuron clustering and student-expert warmup, and a training phase (Stages II--IV) that integrates online adaptive distillation, path-refined policy optimization, and reward-augmented load balancing in a single training pipeline. Experiments on mathematical reasoning benchmarks demonstrate that PADD yields substantial gains over strong baselines at the same inference cost and that the MoE student can match or surpass its dense teacher. They also demonstrate effective teacher-to-student knowledge distillation and stable routing behavior.

2606.10537 2026-06-10 cs.CL 新提交

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

Prefilling-dLLM: 扩散语言模型中长上下文推理的预测性预填充

Jing Xiong, Qi Han, Shansan Gong, Yunta Hsieh, Chengyue Wu, Chaofan Tao, Chenyang Zhao, Ngai Wong

发表机构 * The University of Hong Kong(香港大学) University of Michigan, Ann Arbor(密歇根大学安娜堡分校) LMSYS Org(LMSYS组织)

AI总结 针对扩散语言模型在长上下文中因重复编码前缀导致计算量二次增长的问题,提出Prefilling-dLLM框架,通过分块缓存KV表示并基于稀疏性选择相关块,实现高效解码,在LongBench等基准上达到最先进加速效果。

Comments Technical Report

详情
AI中文摘要

扩散大语言模型(dLLM)在每个去噪步骤中重新编码整个前缀,导致计算量随上下文长度二次增长,在长上下文场景中变得不可行。我们提出Prefilling-dLLM,一种无需训练的预填充-解码分离框架,将前缀划分为N个块,缓存其KV表示一次,并利用块内令牌稀疏性选择最相关的K个块进行解码,表明稀疏预填充可以优于密集注意力,同时将每步复杂度从完整序列长度的二次方降低到仅解码长度的二次方。在LongBench和InfiniteBench上,Prefilling-dLLM在dLLM加速方法中达到了最先进的质量,并且一个对非连续缓存的块KV进行并行解码的注意力核在8K--32K上下文下实现了9.1--28.0倍的加速。我们进一步表明,预置到每个块的开头序列令牌作为周期性注意力锚点,消除了中间丢失现象。代码见此 https URL。

英文摘要

Diffusion large language models (dLLMs) re-encode the entire prefix at every denoising step, causing recomputation that scales quadratically with context length and becomes prohibitive for long-context scenarios. We propose Prefilling-dLLM, a training-free prefill-decode disaggregation framework for dLLMs that partitions the prefix into N chunks, caches their KV representations once, and selects the top-K most relevant chunks with intra-chunk token sparsity for decoding, showing that sparse prefilling can outperform dense attention while reducing per-step complexity from quadratic in the full sequence length to quadratic only in the decode length. On LongBench and InfiniteBench, Prefilling-dLLM achieves state-of-the-art quality among dLLM acceleration methods, and an attention kernel that parallelizes decoding over the non-contiguously cached chunk KV yields 9.1--28.0x speedup at 8K--32K contexts. We further show that beginning-of-sequence tokens prepended to each chunk act as periodic attention anchors that eliminate the lost-in-the-middle phenomenon. Code is available at https://github.com/menik1126/Prefilling-dLLM.

2606.10650 2026-06-10 cs.CL cs.AI 新提交

Dynamic Linear Attention

动态线性注意力

Xin Wang, Hui Shen, Boyuan Zheng, Xueshen Liu, Minkyoung Cho, Zhongwei Wan, Zesen Zhao, Zhuoqing Mao, Shen Yan, Mi Zhang

发表机构 * The Ohio State University(俄亥俄州立大学) University of Michigan(密歇根大学) ByteDance Seed(字节跳动Seed)

AI总结 提出DLA框架,通过信息感知动态状态合并和容量受限内存建模,解决多状态线性注意力中固定合并策略导致的错误累积问题,在16个数据集上超越现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)对长上下文的可扩展性从根本上受限于标准注意力的二次复杂度,这促使采用具有次二次成本(sub-quadratic cost)的线性注意力机制。为了在长上下文下提高表示能力,近期方法以多状态方式组织内存。然而,现有的多状态线性注意力方法依赖于固定的状态合并策略,无法适应动态变化的令牌重要性,不可逆地模糊了关键令牌,并在长序列上导致严重的错误累积。为了解决这一限制,我们提出了DLA,一种用于多状态线性注意力的动态内存建模框架。DLA引入了(i)信息感知动态状态合并,它基于令牌级别的信息变化自适应地确定状态边界,在语义转换周围保留高分辨率表示,同时积极总结稳定区域;以及(ii)容量受限内存建模,它通过选择性地合并相邻的低信息状态来维护一个固定大小、按时间顺序排列的状态缓存,以最小的信息损失控制内存增长。我们在两种不同的线性注意力模型上预训练DLA,并在三个类别的16个数据集上进行评估。实验结果表明DLA优于现有最先进方法。

英文摘要

The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts, recent approaches organize memory in a multi-state manner. However, existing multi-state linear attention methods rely on fixed state merging policies that cannot adapt to dynamically varying token importance, irreversibly obscuring critical tokens and causing severe error accumulation over long sequences. To address this limitation, we propose DLA, a dynamic memory modeling framework for multi-state linear attention. DLA introduces (i) Information-Aware Dynamic State Merging, which adaptively determines state boundaries based on token-level information variation, preserving high-resolution representations around semantic transitions while aggressively summarizing stable regions, and (ii) Capacity-Bounded Memory Modeling, which maintains a fixed-size, chronologically ordered state cache by selectively merging adjacent low-information states to control memory growth with minimal information loss. We pre-train DLA on two different linear attention models and evaluate on 16 datasets across three categories. Experimental results demonstrate the superiority of DLA over state-of-the-art.

2606.10722 2026-06-10 cs.CL 新提交

Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

持续LLM升级:一种用于稠密到稀疏LLM的预测器门控银行级稀疏训练方案

Ruixuan Huang, Jinyuan Shi, Hantao Huang, Yifan Huang, Ziyi Guan, Hao Zeng, Ian En-Hsu Yen, Minghui Yu

发表机构 * Nanyang Technological University(南洋理工大学) Salesforce AI Huawei Noah's Ark Lab(华为诺亚方舟实验室)

AI总结 提出一种从稠密检查点构建通道稀疏大语言模型的持续训练方法,通过预测器门控稀疏SwiGLU FFN和银行级top-k规则实现4倍稀疏性,并修复长上下文失败模式。

详情
AI中文摘要

我们研究稠密到稀疏的持续训练,作为从稠密检查点构建通道稀疏大语言模型的一种方式。从Qwen2.5-8B稠密骨干网络开始,我们在32K上下文中继续训练,并在32K阶段引入预测器门控稀疏SwiGLU FFN。对于每个token和层,我们使用低秩预测器生成FFN通道路由logits。然后应用银行级top-k规则,在每个64通道的银行中保留16个通道,从而在FFN中间激活中实现4倍稀疏性。与事后稀疏推理方法不同,路由模块被放置在主要语言建模路径上,并在持续训练期间进行优化,使稠密模型能够升级为面向硬件的稀疏模型。我们报告了架构、训练方案、基准性能以及训练经验。我们还识别了RULER-CWE上的层局部长上下文失败模式,并提出了一种单层修复算法,显著改善了受影响长度范围内的性能。

英文摘要

We study dense-to-sparse continual training as a way to construct channel-sparse large language models from dense checkpoints. Starting from a Qwen2.5-8B dense backbone, we continue training at 32K context and introduce a predictor-gated sparse SwiGLU FFN in the 32K stage. For each token and layer, we use a low-rank predictor to produce FFN-channel routing logits. We then apply a bank-wise top-k rule to retain 16 channels in every 64-channel bank, yielding 4x sparsity in the FFN intermediate activation. Unlike post-hoc sparse inference methods, the routing module is placed on the main language modeling path and optimized during continual training, enabling the dense model to be upcycled into a hardware-oriented sparse model. We report the architecture, training recipe, benchmark performance, and training lessons. We also identify a layer-local long-context failure mode on RULER-CWE and propose a single-layer repair algorithm that substantially improves the affected length range.

2606.10829 2026-06-10 cs.CL cs.AI 新提交

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

注意力折扣自适应采样器用于掩码扩散语言模型

Yusuf Sahin, Ahmed Rockey Saikia, Volkan Cevher, Paolo Favaro

发表机构 * University of Bern(伯尔尼大学) EPFL(瑞士联邦理工学院洛桑分校)

AI总结 针对掩码扩散语言模型并行解码中候选词交互导致的不安全问题,提出训练无关的重排序规则ADAS,通过注意力折扣软惩罚改进子集构建,在多个基准上提升低NFE性能。

详情
AI中文摘要

掩码扩散语言模型可以通过每次去噪迭代揭示多个令牌来减少推理步骤,但这种并行性很脆弱:当预测相互耦合时,单独置信的位置同时提交可能不安全。现有的免训练采样器如Top-\(k\)、Fast-dLLM和EB-Sampler主要控制揭示多少令牌,而通常通过忽略选定集内交互的逐令牌分数对候选进行排序。我们提出ADAS,一种用于并行掩码扩散解码的免训练重排序规则。ADAS保持基础采样器的停止规则不变,仅修改子集构建:当候选者强烈关注预测仍不确定的已选位置时,它贪婪地折扣该候选者。与将注意力转化为硬兼容性约束的图约束方法不同,ADAS保持注意力连续并将其用作软边际惩罚。在GSM8K、MATH500、HumanEval和MBPP上,针对LLaDA-8B-Base和Dream-7B-Base,将ADAS插入Top-\(k\)、Fast-dLLM和EB-Sampler中,在匹配去噪器评估下,低NFE性能平均分别提高9.11和10.46个百分点,每次前向运行时开销为3.1%。这些结果表明,软注意力折扣重排序是一种简单且模块化的方法,可提高掩码扩散语言模型高度并行解码的质量。

英文摘要

Masked diffusion language models can reduce inference steps by revealing multiple tokens per denoising iteration, but this parallelism is fragile: positions that are individually confident may be unsafe to commit together when their predictions are coupled. Existing training-free samplers such as Top-\(k\), Fast-dLLM, and EB-Sampler mainly control how many tokens to reveal, while often ranking candidates by token-wise scores that ignore interactions within the selected set. We propose ADAS, a training-free reranking rule for parallel masked diffusion decoding. ADAS leaves the base sampler's stopping rule unchanged and modifies only subset construction: it greedily discounts a candidate when it attends strongly to already selected positions whose predictions remain uncertain. Unlike graph-constrained methods that turn attention into hard compatibility constraints, ADAS keeps attention continuous and uses it as a soft marginal penalty. Across LLaDA-8B-Base and Dream-7B-Base on GSM8K, MATH500, HumanEval, and MBPP, plugging ADAS into Top-\(k\), Fast-dLLM, and EB-Sampler improves low-NFE performance at matched denoiser evaluations by \(9.11\) and \(10.46\) percentage points on average, respectively, with \(3.1\%\) per-forward runtime overhead. These results show that soft attention-discounted reranking is a simple and modular way to improve quality in highly parallel decoding for masked diffusion language models.

2606.10932 2026-06-10 cs.CL cs.LG 新提交

Density Field State Space Models: 1-Bit Distillation, Efficient Inference, and Knowledge Organization in Mamba-2

密度场状态空间模型:Mamba-2中的1比特蒸馏、高效推理与知识组织

Chirag Shinde

发表机构 * Independent Researcher(独立研究者)

AI总结 提出DF-SSM框架,将SSM压缩至1比特骨架加int8低秩校正,应用于Mamba-2 1.3B模型,实现9.7倍压缩和21.4倍推理加速,仅需3200万令牌和6小时蒸馏,并发现模型内部知识组织的三个处理阶段。

Comments 16 pages, 6 figures, 7 tables. Code available at https://github.com/cs-cmyk/df-ssm

详情
AI中文摘要

我们提出了密度场状态空间模型(DF-SSM),这是一个将SSM压缩为1比特骨架并带有int8低秩校正的框架。应用于Mamba-2 1.3B模型,我们得到了一个278 MB的模型(比2.7 GB的FP16教师模型小9.7倍),在GPU上推理速度提升21.4倍(batch=1,相对于mamba-ssm参考实现),同时在下游任务性能上保持在BitMamba-2(一个在150B令牌上从头训练的1.58比特模型)的2-4个百分点以内。蒸馏本身仅需3200万令牌和6小时(在单个A100 GPU上),尽管它假设有一个预训练的FP16教师模型。我们开发了一个优化的推理流水线,结合了用于骨架矩阵乘法的cuBLAS INT8张量核心、用于有状态SSM和卷积操作的自定义CUDA内核,以及用于在GPU和CPU上高效部署的AVX-512 CPU后端。除了压缩,我们还研究了所得模型的内部知识组织,发现了三个不同的处理阶段:意图分类(第0-3层,在没有词汇对齐的抽象空间中操作)、知识检索(第25-35层,事实关联定位在一个5层窗口内)和输出格式化(第36-47层,类别结构消失)。通过对19个类别中445个事实提示的系统分析,我们发现早期层分类是句法的(由模板结构驱动)而非语义的,并且尽管事实回忆较弱,模型仍表现出组织良好的知识表示——这表明表示结构可能先于事实强度。

英文摘要

We present Density Field State Space Models (DF-SSM), a framework for compressing SSMs to a 1-bit scaffold with int8 low-rank correction. Applied to Mamba-2 1.3B, we achieve a 278 MB model (9.7x smaller than the 2.7 GB FP16 teacher) that runs at 21.4x faster inference on GPU (batch=1, relative to the mamba-ssm reference implementation) while maintaining downstream task performance within 2-4 percentage points of BitMamba-2, a 1.58-bit model trained from scratch on 150B tokens. The distillation itself requires only 32M tokens and 6 hours on a single A100 GPU, though it presupposes a pretrained FP16 teacher. We develop an optimized inference pipeline combining cuBLAS INT8 tensor cores for the scaffold matmul, custom CUDA kernels for stateful SSM and convolution operations, and an AVX-512 CPU backend for efficient deployment on both GPU and CPU. Beyond compression, we investigate the internal knowledge organization of the resulting model, discovering three distinct processing phases: intent classification (layers 0-3, operating in an abstract space with no vocabulary alignment), knowledge retrieval (layers 25-35, where factual associations localize to a 5-layer window), and output formatting (layers 36-47, where category structure dissolves). Through systematic analysis of 445 factual prompts across 19 categories, we find that early-layer classification is syntactic (driven by template structure) rather than semantic, and that the model exhibits well-organized knowledge representations despite weak factual recall--suggesting that representational structure may precede factual strength.

2606.11052 2026-06-10 cs.CL 新提交

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

混合LLM中的注意力遗忘:当CoT微调破坏长程记忆时,如何修复

Xinyu Zhou, Boyu Zhu, Yi Xu, Zhiwei Li, Yingfa Chen, Huiming Wang, Zhijiang Guo

发表机构 * LARK, HKUST(GZ)(香港科技大学(广州)LARK实验室) UCL(伦敦大学学院) Mistral AI Tsinghua University(清华大学) SUTD(新加坡科技设计大学) HKUST(香港科技大学)

AI总结 发现CoT监督微调会系统性降低混合线性注意力模型的长上下文召回能力,提出QK-Restore方法通过恢复微调前的查询-键投影矩阵来修复,无需额外训练。

Comments 28 pages

详情
AI中文摘要

链式思维(CoT)监督微调(SFT)被广泛用于提升推理能力,但我们发现它会系统性降低混合线性注意力模型的长上下文召回能力。在包括HypeNet和Jet-Nemotron在内的架构中,CoT-SFT后Needle-In-A-Haystack(NIAH)上的检索性能大幅下降,且在更难的检索设置和更长的上下文窗口下退化更严重。例如,HypeNet-9B在NIAH-S2@256K上的性能从$67.2\%$降至$9.4\%$。我们将其归因于CoT-SFT将注意力梯度偏向短程模式,破坏了负责长程路由的查询-键投影($W_Q, W_K$)。基于此观察,我们提出QK-Restore,一种无需训练的方法,仅从预SFT检查点恢复$W_Q$和$W_K$,同时保留所有其他后SFT参数。我们进一步引入Procrustes变体以平衡路由保留和推理适应。在各种架构上,QK-Restore在零训练成本下一致地恢复长上下文能力,同时保持推理性能;例如,在HypeNet-5B上,它将S3@256K从$65.4\%$提升至$76.4\%$,同时保持强大的推理性能。

英文摘要

Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance on Needle-In-A-Haystack (NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from $67.2\%$ to $9.4\%$. We attribute this to CoT-SFT biasing attention gradients toward short-range patterns, disrupting query-key projections ($W_Q, W_K$) that are responsible for long-range routing. Motivated by this observation, we propose QK-Restore, a training-free method that restores only $W_Q$ and $W_K$ from the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce a Procrustes variant to balance routing preservation and reasoning adaptation. Across architectures, QK-Restore consistently restores long-context capability at zero training cost while preserving reasoning performance; for instance, on HypeNet-5B it improves S3@256K from $65.4\%$ to $76.4\%$ while maintaining strong reasoning performance.

2606.09635 2026-06-10 cs.CL cs.LG 新提交

Gradient-Guided Reward Optimization for Inference-time Alignment

梯度引导的推理时对齐奖励优化

Hankun Lin, Ruqi Zhang

发表机构 * Purdue University(普渡大学)

AI总结 提出梯度引导奖励优化(GGRO)方法,通过解码时注入梯度信号生成的引导令牌,在推理时微调生成轨迹,提升安全性、有用性和推理性能,并增强对奖励攻击的鲁棒性。

Comments Accepted to UAI 2026

详情
AI中文摘要

确保大型语言模型(LLMs)在分布漂移下的可靠性需要推理时自适应。虽然推理时对齐方法如Best-of-$N$和拒绝采样被广泛使用,但它们将任务视为采样密集的奖励引导搜索,导致两个关键限制:性能受限于基础模型的生成质量,以及对不完美奖励模型的依赖使其易受奖励攻击。为解决这些挑战,我们引入梯度引导奖励优化(GGRO),一种轻量级推理时方法,通过梯度引导在解码期间执行有针对性的最小干预。具体来说,GGRO监测令牌级熵以识别指示漂移或未对齐的高不确定性区域。一旦检测到,它通过注入使用现成奖励模型的梯度信号生成的引导令牌来响应,以引导生成轨迹而不仅仅是重新排序样本。实验表明,GGRO在安全性、有用性和推理基准上持续改进推理时对齐。它还提高了高质量响应的覆盖率和对奖励攻击的鲁棒性,且计算开销极小。代码可在https://github.com/lhk2004/GGRO获取。

英文摘要

Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of-$N$ and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two key limitations: their performance is bounded by the base model's generation quality, and their reliance on imperfect reward models makes them vulnerable to reward hacking. To address these challenges, we introduce Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time method that performs targeted, minimal intervention during decoding via gradient guidance. Specifically, GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples. Experiments show that GGRO consistently improves inference-time alignment across safety, helpfulness, and reasoning benchmarks. It also increases coverage of high-quality responses and robustness to reward hacking, with minimal computational overhead. Code is available at https://github.com/lhk2004/GGRO.

2606.09850 2026-06-10 cs.LG cs.CL 交叉投稿

Mechanistic Analysis of Alignment Algorithms in Language Models

语言模型中对齐算法的机制分析

Aarush Sinha, Ishan Garg, Veeraraju Elluru, Arth Singh, Kushal Garg

AI总结 本文通过层间线性探针、稀疏自编码器和交叉编码器,系统分析了六种偏好优化方法在语言模型中的内部机制,发现不同目标函数导致不同的表示几何变换,并揭示了行为对齐与内部结构变化的不一致性。

Comments Work in Progress

详情
AI中文摘要

后训练对齐算法主要作为黑箱进行评估,掩盖了它们如何重塑语言模型的内部计算。我们对三种开源模型家族的六种偏好优化方法(PPO、DPO、SimPO、ORPO、GRPO 和 KTO)进行了系统的机制分析。通过集成层间线性探针、稀疏自编码器和交叉编码器,我们定位了偏好表示并量化了对齐引起的潜在空间几何变换。我们发现偏好信号一致地集中在早期-中期或中期-后期层,但不同的目标函数导致定性的不同表示偏移。KTO 和 GRPO 通过建设性的特征共享和稀疏高显著性招募增强了线性可分离性。相反,DPO 和 ORPO 通过非建设性的几何旋转和特征衰减降低了可分离性,而 PPO 和 SimPO 基本保持了基线几何。这些变换表现出架构依赖的变异性,表明行为对齐并不意味着统一的内部重构。我们的发现将对齐确立为一种异质性干预,激励了安全性和可解释性的标准化特征级审计,并强调了需要机制感知的优化目标。

英文摘要

Post-training alignment algorithms are predominantly evaluated as black boxes, obscuring how they reshape language models' internal computations. We present a systematic mechanistic analysis of six preference-optimization methods: PPO, DPO, SimPO, ORPO, GRPO, and KTO across three open-weight model families. By integrating layer-wise linear probing, Sparse Autoencoders, and crosscoders, we localize preference representations and quantify alignment-induced geometric transformations in latent space. We find that preference signals consistently concentrate in early--mid or mid--late layers, but different objectives induce qualitatively distinct representational shifts. KTO and GRPO enhance linear separability through constructive feature sharing and sparse, high-salience recruitment. In contrast, DPO and ORPO degrade separability via non-constructive geometric rotation and feature attenuation, while PPO and SimPO largely preserve baseline geometry. These transformations exhibit architecture-dependent variability, demonstrating that behavioral alignment does not imply uniform internal restructuring. Our findings establish alignment as a heterogeneous intervention, motivate standardized feature-level auditing for safety and interpretability, and highlight the need for mechanism-aware optimization objectives.

2606.09877 2026-06-10 cs.LG cs.CE cs.CL 交叉投稿

Streaming Knowledge Compilation: Proactive Materiality-Scored Pinning for Time-Evolving LLM Wikis

流式知识编译:面向时变LLM维基的主动物质性评分固定

Juan M. Huerta

发表机构 * Zinnia Tech Solutions(Zinnia科技解决方案)

AI总结 提出流式知识编译框架,通过物质性信号φ_t主动固定重要文档,在金融和维基百科领域验证O(√T log K)遗憾界,并揭示LLM评判偏差。

详情
AI中文摘要

LLM维基系统将知识编译为预填充的KV缓存以实现高效推理,但假设语料库是静态的——当底层信息格局演变时,这一假设失效。我们形式化流式知识编译:给定文档流、固定令牌预算以及在摄取时未知的未来查询,维护一个编译后的维基,使其相对于具有完美预见力的离线oracle的累积遗憾最小化。关键洞察是物质性信号φ_t(k,n)∈[0,1],它对时间t实体k的文档重要性进行评分,作为查询相关性的代理,在查询到达前主动固定;我们证明O(√T log K)遗憾界,其中ε=E[|φ_t-φ̂_t|]是唯一的领域特定量。我们在两个领域实例化:金融领域,其中φ_t是由冻结的Llama 3.1 8B分类头预测的异常股票波动率(在76K篇文章上AUROC=0.728,严格时间分割;预测为物质性的文章实现1.49倍更高的实际远期波动率);以及维基百科领域,其中φ_t是异常编辑比率(AER),一种横截面标准化的编辑速度——表明同一算法可泛化到金融领域之外。在173个匹配对(金融)和119个(维基百科)上的端到端QA评估揭示了训练后知识上普遍的LLM-as-judge混淆,确立了遗憾分析——而非绝对QA分数——是编译知识系统的可靠评估指标。金融累积遗憾收敛至-20.0(-0.12/步);维基百科收敛至+16.0(+0.13/步),正号确认维基百科编辑内容确实是训练后的——更丰富的上下文持续提高分数(无维基3.80 vs. Oracle 4.74)——并消除了这一混淆。O(√T log K)保证适用于任何知识差距可从流信号预测的领域。

英文摘要

LLM wiki systems compile knowledge into pre-filled KV caches for efficient inference, but assume a static corpus -- an assumption that fails whenever the underlying information landscape evolves. We formalize Streaming Knowledge Compilation: given a document stream, a fixed token budget, and future queries unknown at ingestion time, maintain a compiled wiki that minimizes cumulative regret against an offline oracle with perfect foresight. The enabling insight is a materiality signal $ϕ_t(k,n)\in[0,1]$ that scores document importance for entity $k$ at time $t$, acting as a query-relevance surrogate for proactive pinning before queries arrive; we prove an $O(\sqrt{T\log K})$ regret bound where $\varepsilon=\mathbb{E}[|ϕ_t-\hatϕ_t|]$ is the only domain-specific quantity. We instantiate in two domains: finance, where $ϕ_t$ is abnormal stock volatility predicted by frozen Llama 3.1 8B classification head (AUROC = 0.728 on 76K articles, strict temporal split; $1.49\times$ higher realized forward volatility for predicted-material articles); and Wikipedia, where $ϕ_t$ is the Abnormal Edit Ratio (AER), a cross-sectionally normalized edit velocity -- showing the same algorithm generalizes beyond the finance domain. End-to-end QA evaluation on 173 matched pairs (finance) and 119 (Wikipedia) reveals a pervasive LLM-as-judge confound on post-training knowledge, establishing that regret analysis -- not absolute QA scores -- is the reliable evaluation metric for compiled knowledge systems. Finance cumulative regret converges to -20.0 (-0.12/step); Wikipedia to +16.0 (+0.13/step), with the positive sign confirming that Wikipedia edit content is genuinely post-training -- richer context consistently improves scores (No Wiki 3.80 vs. Oracle 4.74) -- and eliminates this confound. The $O(\sqrt{T\log K})$ guarantee applies to any domain where knowledge gaps can be predicted from streaming signals.

2606.09887 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

SocraticPO: Policy Optimization via Interactive Guidance

SocraticPO: 通过交互式指导进行策略优化

Zirui Liu, Jie Ouyang, Qi Liu, Xianquan Wang, Jiayu Liu, Tingyue Pan, Qingchuan Li, Jing Sha, Zhenya Huang, Shijin Wang, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(认知智能国家重点实验室,中国科学技术大学) iFLYTEK AI Research (Central China), iFLYTEK Co., Ltd(iFLYTEK中央中国AI研究院,iFLYTEK公司)

AI总结 提出SocraticPO框架,在强化学习中使用自然语言指导辅助推理,并通过奖励衰减防止模型依赖教师帮助,提升科学推理任务性能。

详情
AI中文摘要

用于大语言模型的强化学习通常使用标量结果奖励(如二元正确性)来监督推理。这种奖励提供了优化方向,但很少解释模型应如何修正其错误推理,这可能鼓励捷径学习和脆弱的策略。我们提出\textbf{SocraticPO}(苏格拉底式策略优化),一种策略优化框架,用苏格拉底式的自然语言指导增强强化学习展开。在展开过程中,学生首先独立回答;如果答案错误,教师诊断尝试并提供简洁的纠正性指导,之后学生在扩展的上下文下继续。关键的是,这种指导与奖励衰减配对:在教师干预后获得的正确答案只得到衰减的奖励,防止策略将教师帮助视为获取奖励的免费途径。由于SocraticPO只修改展开过程,而保持标准期望奖励目标不变,它可以插入到现有的策略梯度后端(如Reinforce++)中。此外,由于教师只提供文本级指导,SocraticPO可以利用更强的黑盒教师模型,而无需访问logits或分布匹配。在来自SciKnowEval的本科水平科学推理基准上,SocraticPO优于强强化学习和自蒸馏基线。消融实验表明,目标指导和奖励衰减都是必要的,奖励衰减减轻了对辅助纠正的依赖。

英文摘要

Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should revise its mistaken reasoning, which can encourage shortcut learning and brittle policies. We propose \textbf{SocraticPO} (Socratic Policy Optimization), a policy-optimization framework that augments RL rollouts with Socratic-style natural-language guidance. During rollout, the student first answers independently; if the answer is incorrect, a teacher diagnoses the attempt and provides concise corrective guidance, after which the student continues under the expanded context. Crucially, this guidance is paired with reward decay: correct answers obtained after teacher intervention only receive decayed rewards, preventing the policy from treating teacher help as a free path to reward. Since SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact, it can be plugged into existing policy-gradient backends such as Reinforce++. Moreover, because the teacher provides only text-level guidance, SocraticPO can leverage stronger black-box teacher models without requiring access to logits or distribution matching. On undergraduate-level scientific reasoning benchmarks from SciKnowEval, SocraticPO improves over strong RL and self-distillation baselines. Ablations show that both targeted guidance and reward decay are necessary, with reward decay mitigating reliance on assisted correction.

2606.09894 2026-06-10 cs.LG cs.CL 交叉投稿

A Navigable Manifold of Hypothesized Consciousness-Spectrum States in Language Model Representations

语言模型表示中假设的意识谱状态的可导航流形

Sophie Zhao

发表机构 * School of Computer Science(计算机科学学院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 研究语言模型嵌入空间中与意识谱相关的几何结构,发现嵌入形成可导航流形,高低层区域稳定,中间为过渡走廊,导航性为内在属性。

详情
AI中文摘要

在沉思、哲学和心理学描述中,人类意识常被描述为从反应性和自我聚焦模式到更整合和连贯模式的类似谱系。理解语言模型是否在表示空间中编码了这种结构化、人类可解释的意识谱系,对于模型引导、评估和对齐具有重要意义。在这项工作中,我们研究了Transformer嵌入空间中沿该谱系的几何结构和动态模式。我们表明,嵌入表现出与该谱系对齐的全局组织几何:与相似状态相关的句子聚类成局部连贯区域,形成结构化流形。特别地,高层和低层区域表现出类似凸性的稳定性,而中间区域形成过渡走廊。在动态上,效用引导和纯几何贪婪轨迹都一致地从低层区域穿越到高层区域,经过中间层级,表明可导航性是表示空间的内在属性,由全局方向信号引导但非决定。这些结果表明,嵌入空间编码了与假设的意识谱分类法(广泛受沉思传统、哲学和现代心理学中人类意识反复出现的结构描述启发)对齐的结构化和可导航几何,为分析和引导模型行为提供了表示层面的视角。

英文摘要

Across contemplative, philosophical, and psychological accounts, human consciousness is often described along a similar spectrum, ranging from reactive and self-focused patterns to more integrative and coherent ones. Understanding whether language models encode such a structured, human-interpretable consciousness spectrum in representation space is important for model guidance, evaluation and alignment. In this work, we study the geometric structure and dynamics of patterns along this spectrum in transformer embedding spaces. We show that embeddings exhibit a globally organized geometry aligned with this spectrum: sentences associated with similar states cluster into locally coherent regions, forming a structured manifold. In particular, higher-level and lower-level regions exhibit convexity-like stability, while intermediate regions form a transition corridor. Dynamically, both utility-guided and geometry-only greedy trajectories consistently traverse from lower- to higher-level regions, passing through intermediate tiers, indicating that navigability is an intrinsic property of the representation space, guided but not dictated by a global directional signal. These results suggest that embedding spaces encode structured and navigable geometry aligned with a hypothesized consciousness-spectrum taxonomy, broadly inspired by recurring structural descriptions of human consciousness across contemplative traditions, philosophy, and modern psychology, providing a representation-level perspective for analyzing and guiding model behavior.

2606.09937 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

RKSC:面向多步LLM推理的感知推理的KV缓存共享与自信提前退出

Anirudh Sekar

AI总结 提出RKSC框架,通过注意力相似性KV共享、置信门控提前退出和推理选择性块缓存管理,消除多分支LLM推理中的结构冗余,实现平均3.008倍加速,错误率仅0.37%。

Comments Accepted to the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems

详情
AI中文摘要

我们提出RKSC(感知推理的KV缓存共享),一种无需训练的推理框架,消除了多分支LLM推理流程中的两种结构冗余。ASKS(注意力相似性KV共享)计算前缀KV缓存一次,并通过隐藏状态余弦相似度广播给所有语义相似的分支,严格推广了vLLM和SGLang使用的精确令牌前缀缓存。CGEE(置信门控提前退出)应用两种互补的退出机制:(1)当生成置信度在分支间具有决定性时,完全跳过验证前向传播;(2)当逐层熵稳定时,在中间层终止验证传播,使用Transformer骨干上的轻量级钩子。RSBCM(推理选择性块缓存管理器)通过注意力加权深度优先驱逐防止无界缓存增长。在五个模型家族(7B-10B)、四个基准测试和1000个评估问题上,RKSC相对于无KV基线实现了平均3.008倍加速(峰值3.990倍),相对于vLLM等效前缀缓存平均提升1.66倍,CGEE导致的错误率仅为0.37%(1616次验证调用中6次错误)。无需微调或架构更改。代码可在该URL获取。

英文摘要

We introduce RKSC (Reasoning-Aware KV Cache Sharing), a training-free inference framework that eliminates two structural redundancies in multi-branch LLM reasoning pipelines. ASKS (Attention-Similarity KV Sharing) computes the prefix KV cache once and broadcasts it to all semantically similar branches via hidden-state cosine similarity, strictly generalising the token-exact prefix caching used by vLLM and SGLang. CGEE (Confidence-Gated Early Exit) applies two complementary exit mechanisms: (1) it skips the verification forward pass entirely when generation confidence is decisive across branches, and (2) it terminates the verification pass at an intermediate layer when per-layer entropy stabilises, using lightweight hooks on the transformer backbone. RSBCM (Reasoning-Selective Block Cache Manager) prevents unbounded cache growth via attention-weighted depth-priority eviction. Across five model families (7B-10B), four benchmarks, and 1,000 evaluated problems, RKSC achieves a mean speedup of 3.008x over the No-KV baseline (peak 3.990x), a 1.66x mean improvement over vLLM-equivalent prefix caching, with a CGEE-induced error rate of only 0.37% (6 errors out of 1,616 verify calls). No fine-tuning or architecture changes are required. Code is available at https://github.com/AnirudhSekar/RKSC.

2606.10298 2026-06-10 cs.AI cs.CL 交叉投稿

From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

从上下文感知到冲突感知:泛化对比解码以应对LLMs中的知识冲突

Runze Jiang, Taiqiang Wu, Yan Wang, Bingyu Zhu, Longtao Huang

发表机构 * Peking University(北京大学) Alibaba Group(阿里巴巴集团) The University of Hong Kong(香港大学)

AI总结 针对大语言模型生成时外部上下文与参数先验之间的知识冲突,提出冲突感知范式,通过动态分配先验与上下文的权重,并设计自适应机制解决不同冲突状态下的不对称问题。

Comments 27 pages, 9 figures

详情
AI中文摘要

当大语言模型从检索或增强的上下文中生成时,外部上下文与参数先验之间的冲突仍然是核心可靠性瓶颈。现有的对比解码方法遵循一种\emph{上下文感知}范式,单方面放大上下文而压制参数先验,当上下文错误时会覆盖正确的先验。我们将其泛化为\textbf{冲突感知}范式,基于冲突信号动态分配先验与上下文的权威,而非预设上下文的可信度。我们证明,先验和上下文logits的仿射组合产生一个\textbf{幂族},具有固有的\textbf{状态不对称性}:当先验正确时外推会无界放大错误,当上下文正确时内插会纠正不足,且没有静态状态能同时覆盖两者。现有的对比解码方法是该族实例,大多为外推型。为评估两种冲突方向,我们提出TriState-Bench,一种模型感知的评估协议,校准每个模型的先验知识以测量三种冲突状态:纠正、抵抗和一致。为解决不对称性,我们提出自适应状态路由(ARR),在每一步在状态间路由,将抵抗EM从低于6提升至16-33,且不牺牲纠正或一致。我们的代码可在该https URL获取。

英文摘要

When large language models generate from retrieved or augmented contexts, conflicts between external context and parametric priors remain a central reliability bottleneck. Existing contrastive decoding methods follow a \emph{context-aware} paradigm that unilaterally amplifies context over parametric priors, overwriting correct priors when the context is erroneous. We generalize this to the \textbf{conflict-aware} paradigm that dynamically allocates authority between prior and context based on conflict signals, rather than presupposing context trustworthiness. We show that the affine combination of prior and context logits yields a \textbf{power family} with an inherent \textbf{regime asymmetry}: extrapolation amplifies errors unboundedly when the prior is correct, interpolation under-corrects when the context is correct, and no static regime covers both. Existing contrastive decoding methods are instances of this family, mostly extrapolative. To evaluate both conflict directions, we propose TriState-Bench, a model-aware evaluation protocol that calibrates per-model prior knowledge to measure three conflict states: correction, resistance, and agreement. To resolve the asymmetry, we propose Adaptive Regime Routing (ARR), which routes between regimes at each step, lifting resistance EM from below 6 to 16--33 without sacrificing correction or agreement. Our code is available at https://github.com/keith-Jiang/conflict-aware-decoding.

2606.10435 2026-06-10 cs.LG cs.CL 交叉投稿

Parallel Causal Associative Fields: Gated Sparse Memory for Long-Context Language Modeling

并行因果关联域:用于长上下文语言建模的门控稀疏记忆

Muhammad Ahmed

发表机构 * Independent Researcher(独立研究员)

AI总结 提出并行因果关联域(PCAF),通过哈希桶存储局部记录、检索候选集形成稀疏缓存,并与参数化语言模型门控混合,实现稀疏长上下文访问,避免固定状态瓶颈。

Comments 17 pages, 5 figures, and 6 tables. Experiments on WikiText-103, PG-19, and WikiText-2 using TPU v4-32 and NVIDIA RTX 3060 hardware. Code: https://github.com/ahmed123hds/PCAF

详情
AI中文摘要

Transformer通过提供直接的token间通信路径实现了强大的语言建模性能,但因果自注意力的计算量随上下文长度呈二次方增长。循环模型和状态空间模型降低了这一成本,但将历史压缩为顺序更新的固定大小状态。本文研究了第三种原语:基于因果后继记录的并行内容寻址记忆。所提出的并行因果关联域(PCAF)将上下文窗口中的局部记录写入哈希桶,为当前查询检索有界的候选集,在后继token上形成稀疏缓存分布,并通过学习到的门将该缓存与参数化局部语言模型混合。所得模型在避免单一固定循环状态瓶颈的同时,保持了稀疏的长上下文访问。我们在WikiText-103和PG-19上使用分布式Google Cloud TPU v4-32 pod对PCAF进行了完全自回归预训练。在303M参数和上下文长度T=2048的情况下,PCAF-semantic在WikiText-103上达到36.31困惑度,在PG-19上达到52.45困惑度,而匹配的密集Transformer分别为47.49和53.84。PCAF-semantic在TPU pod上同时处理0.61-0.62M token/s,而密集和局部注意力基线为0.43M token/s。支持41M参数的多种子扫描和单GPU组件消融实验表明,关联缓存、检索容量和学习到的门对速度-质量权衡有实质性影响。

英文摘要

Transformers achieve strong language modeling performance by providing direct token-to-token communication paths, but causal self-attention scales quadratically with context length. Recurrent and state-space models reduce this cost, yet compress history into sequentially updated fixed-size states. This paper studies a third primitive: a parallel content-addressed memory over causal successor records. The proposed Parallel Causal Associative Field (PCAF) writes local records from a context window into hash buckets, retrieves a bounded candidate set for the current query, forms a sparse cache distribution over successor tokens, and mixes that cache with a parametric local language model through a learned gate. The resulting model maintains sparse long-context access while avoiding a single fixed recurrent state bottleneck. We evaluate PCAF under full autoregressive pretraining on WikiText-103 and PG-19 using a distributed Google Cloud TPU v4-32 pod. At 303M parameters and context length T = 2048, PCAF-semantic reaches 36.31 perplexity on WikiText-103 and 52.45 perplexity on PG-19, compared with 47.49 and 53.84 for a matched dense Transformer. PCAF-semantic simultaneously processes 0.61-0.62M tokens/s across the TPU pod, versus 0.43M tokens/s for dense and local attention baselines. Supporting 41M-parameter multi-seed sweeps and single-GPU component ablations show that the associative cache, retrieval capacity, and learned gate materially affect the speed-quality trade-off.

2606.10528 2026-06-10 cs.LG cs.CL 交叉投稿

Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

表示感知优势估计:你的奖励模型提供的不仅仅是标量输出

Guozheng Li, Xiyan Fu, Yiwen Guo

发表机构 * Southeast University(东南大学) Nanyang Technological University(南洋理工大学) Independent Researcher(独立研究员)

AI总结 提出表示感知优势估计方法,利用奖励模型隐藏状态作为辅助信号,通过图传播计算优势值,提升RLHF的样本效率和鲁棒性。

详情
AI中文摘要

当前基于人类反馈的强化学习(RLHF)方法主要依赖来自训练好的奖励模型(RM)的标量奖励。虽然有效,但标量奖励通常存在噪声,无法捕捉细粒度的偏好差异,而RM隐藏状态编码了更丰富的语义和偏好信息。我们引入了表示感知优势估计,利用RM隐藏状态并将其建模为辅助信号以实现更好的优势估计。具体来说,我们提出了基于图的优势估计(GraphAE),将每个采样组视为一个图,其中节点对应响应,边捕捉它们在RM隐藏空间中的相似性。然后通过图传播计算优势值,使每个样本能够从其邻居中融入上下文信息。GraphAE轻量级,可以无缝集成到现有的基于组的RL算法中。我们将GraphAE应用于GRPO、GSPO和RLOO,并在不同模型和基准上进行了大量实验。实证结果显示,在三个基准上均有一致改进,在Arena-Hard-v0.1上提升高达+6.3,在AlpacaEval 2.0上提升+8.27,在MT-Bench上提升+0.22。这些结果表明,利用RM表示可以实现更高效和鲁棒的RLHF。

英文摘要

Current reinforcement learning from human feedback (RLHF) methods primarily rely on scalar rewards from a trained reward model (RM). While effective, scalar rewards are often noisy and fail to capture fine-grained preference differences, whereas RM hidden states encode richer semantic and preference information. We introduce the representation-aware advantage estimation, which leverages RM hidden states and models them as auxiliary signals for better advantage estimation. Specifically, we propose the Graph-based Advantage Estimation (GraphAE), treat each sampled group as a graph, where nodes correspond to responses and edges capture their similarity in the RM hidden space. Then advantages are computed via graph propagation, enabling each sample to incorporate contextual information from its neighbors. GraphAE is lightweight and can be seamlessly integrated into existing group-based RL algorithms. We apply GraphAE to GRPO, GSPO and RLOO, and conduct extensive experiments on different models and benchmarks. Empirical results show consistent improvements across three benchmarks, with gains of up to + 6.3 on Arena-Hard-v0.1, + 8.27 on AlpacaEval 2.0, and + 0.22 on MT-Bench. These results demonstrate that leveraging RM representations leads to more sample efficient and robust RLHF.

2606.10607 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

Causal Ensemble Agent: Hierarchical Causal Discovery with LLM-guided Expert Reweighting

因果集成智能体:基于LLM引导的专家重加权的层次化因果发现

Xinyu Li, Yuanyuan Wang, Haoxuan Li, Chuan Zhou, Erdun Gao, Bo Han, Tongliang Liu, Kun Zhang, Howard Bondell, Mingming Gong

发表机构 * The University of Melbourne(墨尔本大学) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Peking University(北京大学) Adelaide University(阿德莱德大学) Hong Kong Baptist University(香港浸会大学) The University of Sydney(悉尼大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出因果集成智能体(CEA)框架,通过线性意见池聚合不同层次的统计因果发现结果,并利用大语言模型(LLM)作为元裁判在决策边界附近动态重加权专家,从而构建更准确完整的因果图。

详情
AI中文摘要

因果发现旨在从观测数据中揭示因果结构,这对现实世界决策至关重要。然而,不同的因果发现算法可能产生相互冲突的结果,使得识别准确的因果图复杂化。传统方法依赖数值和统计假设,往往忽略丰富的领域特定信息(如特征描述),而这些信息也有助于结构学习。尽管近期研究探索使用大语言模型(LLM)通过直接查询推断因果关系,但由于缺乏与实际数据的一致性,此类方法可能不可靠。为解决这些限制,我们提出因果集成智能体(CEA),一种新颖框架,通过线性意见池聚合来自不同图层次的统计发现专家的结构见解,并在聚合置信度接近决策边界时,使用LLM作为元裁判动态重加权专家,从而组合出更完善、更完整的因果图。在合成和真实数据集上的大量实验表明,CEA在广泛的因果发现方法中实现了最强的整体性能,突显了在因果发现中使用LLM进行元分析的有效性。

英文摘要

Causal discovery aims to uncover causal structures from observational data, which is crucial for real-world decision-making. However, different causal discovery algorithms can produce divergent results that conflict with each other, complicating the identification of accurate causal graphs. Traditional approaches rely on numerical values and statistical assumptions, often ignoring rich domain-specific information, such as feature descriptions, which could also help structure learning. While recent works explore using Large Language Models (LLMs) to infer causal relations via direct queries, such methods can be unreliable due to a lack of alignment with the actual data. To address these limitations, we propose Causal Ensemble Agent (CEA), a novel framework that aggregates structural insights from statistical discovery experts across different graph levels via linear opinion pooling, and uses an LLM as a meta-referee to dynamically reweight experts when the aggregated confidence is close to the decision boundary, thereby composing an improved and more complete causal graph. Extensive experiments on both synthetic and real-world datasets demonstrate that CEA achieves the strongest overall performance across a wide range of causal discovery methods, highlighting the effectiveness of using LLMs for meta-analysis in causal discovery.

2606.10646 2026-06-10 cs.LG cs.CL 交叉投稿

How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs

推理流如何流动?追踪注意力诱导的信息流以实现LLM中的目标RL

Zhichen Dong, Yang Li, Yuhan Sun, Weixun Wang, Yijia Luo, Zinian Peng, Taiheng Ye, Chao Yang, Wenbo Su, Yu Cheng, Bo Zheng, Junchi Yan

发表机构 * Shanghai Jiao Tong University(上海交通大学) Alibaba Group(阿里巴巴集团) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出FlowTracer框架,通过注意力诱导的有向无环图追踪答案导向的推理流,基于全局信息流结构分配token级信用,从而提升LLM在推理任务中的强化学习效果。

Comments 25 pages, 7 figures, 11 tables. Accepted at ICML 2026

详情
AI中文摘要

Token级信用分配仍然是大型语言模型(LLM)中强化学习(RL)的主要障碍,其中RL配方通常平等对待所有token,未能区分决定性推理步骤与常规格式或流畅填充。最近的研究利用模型内部信号分配更细粒度的信用,但这些往往是点式启发式方法,忽略了信息传播的全局结构。我们提出FlowTracer,一个RL框架,它在注意力诱导的有向无环图上追踪答案导向的推理流,其中节点对应token,边容量来自聚合的注意力权重,并从这种全局结构中推导出token信用。边容量被重新加权,仅保留能够到达答案区域的影响,同时强制执行局部流守恒,使得中间token不会因路径长度或无关分支而损失或获得有效质量。在此图上,FlowTracer提取连接问题与答案的信息流骨干,并通过流吞吐量对token进行评分,揭示调解长距离依赖的高影响枢纽和聚合检查点。这些推导出的重要性用于塑造token级奖励,使学习信号精确聚焦于将信息路由向(或远离)正确答案的token,并在各种推理任务中提供一致的性能提升。

英文摘要

Token-level credit assignment remains a key obstacle for reinforcement learning (RL) in large language models (LLMs), where RL recipes typically treat all tokens equally, failing to distinguish decisive reasoning steps from routine formatting or fluent filler. Recent attempts leverage model-internal signals to assign finer-grained credit, but these are often point-wise heuristics that ignore the global structure of information propagation. We propose FlowTracer, an RL framework that traces answer-targeted reasoning flow on an attention-induced directed acyclic graph in which nodes correspond to tokens and edge capacities come from aggregated attention weights and derives token credit from this global structure. The edge capacities are reweighted to retain only the influence that can reach the answer region, while enforcing local flow conservation so intermediate tokens neither lose nor gain effective mass due to path length or irrelevant branches. On this graph, FlowTracer extracts an information-flow backbone connecting the question to the answer and scores tokens by flow throughput, revealing high-impact hubs and aggregation checkpoints that mediate long-range dependencies. These derived importances are used to shape token-level rewards, enabling learning signals to focus precisely on the tokens that route information toward (or away from) correct answers and delivering consistent performance gains across a range of reasoning tasks.

2606.10768 2026-06-10 cs.LG cs.CL 交叉投稿

N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

N-GRPO:嵌入级邻居混合增强策略优化

Xukun Zhu, Hang Yu, Peng Di, Linchao Zhu

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 针对大语言模型数学推理中探索策略的折衷问题,提出N-GRPO方法,通过语义邻居混合机制在嵌入层注入多样性,在保持语义一致性的同时提升策略优化效果。

Comments ACL 2026 Findings. 16 pages, 3 figures. Code: https://github.com/ZJUSCL/N-GRPO

详情
AI中文摘要

大语言模型在数学推理中的成功很大程度上依赖于生成多样化且有效的解题路径。然而,当前的展开技术面临一个基本折衷:token级采样通常产生仅在措辞上不同的冗余轨迹,而利用随机噪声的嵌入级方法则经常破坏语义一致性。为解决此问题,我们引入N-GRPO,一种集成到组相对策略优化(GRPO)框架中的新型探索策略。我们的方法不依赖于token级采样或原生嵌入级噪声,而是利用语义邻居混合机制。该机制通过混合锚点token及其最近语义邻居的嵌入来动态构建输入表示,从而在严格遵循局部语义流形的同时注入多样性。在不同大小的DeepSeek-R1-Distill-Qwen模型上的实验评估表明,N-GRPO不仅在数学推理基准上相比强基线取得一致改进,而且在分布外任务上展现出鲁棒的泛化能力。

英文摘要

The success of Large Language Models in mathematical reasoning relies heavily on the generation of diverse and valid solution paths during the rollout phase. However, current rollout techniques face a fundamental trade-off: token-level sampling often yields redundant trajectories that differ only in rephrasing, while embedding-level methods utilizing random noise frequently disrupt semantic consistency. To resolve this, we introduce N-GRPO, a novel exploration strategy integrated into the Group Relative Policy Optimization (GRPO) framework. Rather than relying on token-level sampling or native embedding-level noise, our approach leverages Semantic Neighbor Mixing. This mechanism dynamically constructs input representations by mixing the embeddings of an anchor token and its nearest semantic neighbors, thereby injecting diversity while strictly adhering to the local semantic manifold. Experimental evaluations on the DeepSeek-R1-Distill-Qwen models across different sizes show that N-GRPO not only achieves consistent improvements over strong baselines on math reasoning benchmarks but also exhibits robust generalization capabilities on out-of-distribution tasks.

2606.11119 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

TRACE:一种用于高效智能体强化学习的统一展开预算分配框架

Heming Zou, Qi Wang, Yun Qu, Yuhang Jiang, Lizhou Cai, Yixiu Mao, Ru Peng, Xin Xu, Weijie Liu, Kai Yang, Saiyong Yang, Xiangyang Ji

发表机构 * Tsinghua University(清华大学) Tencent(腾讯)

AI总结 针对多轮智能体强化学习中奖励对比度不足的问题,提出TRACE框架,通过将每个ReAct式思考-行动-观察步骤建模为语义节点,在固定采样预算内将预算分配到提示根和中间前缀,增强奖励对比,提升策略更新信号。

Comments 32 pages, 12 figures, 6 tables

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)是增强大型语言模型推理和智能体行为的一种有前景的方法。然而,展开密集的策略优化常常受到奖励对比度不足的限制,当过于简单或复杂的提示产生低方差反馈,以及当仅结果奖励对多轮展开中的每个决策赋予相同的终端评估时,就会出现这种情况。过去的努力集中在将可用的展开资源分配给有希望的提示,但它们仅利用提示级别的样本信息性,而忽略了同一展开中不同轮次之间前缀级别信息性的变化。本工作针对多轮智能体强化学习,将每个ReAct风格的思考-行动-观察步骤建模为语义上不同的节点,使得预算分配从提示根扩展到具有进一步延续的轮次级前缀,这自然形成了树状结构的展开。我们引入了树状展开分配用于对比探索(TRACE),这是一个统一的展开分配框架,在固定采样预算内增强奖励对比。在技术上,TRACE将展开预算分配给最可能产生混合终端奖励的提示根和中间前缀。一个共享的通用预测器根据前缀历史估计这些锚点处的条件成功概率,以指导这种分配。由此产生的自适应树状结构丰富了仅结果反馈,并放大了策略更新信号。实验上,TRACE在典型的智能体基准测试中取得了有竞争力的性能和效率提升,例如,在相同采样成本下,Qwen3-14B多跳问答的平均准确率比竞争基线提高了2.8个百分点。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.

2606.11189 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

通过目标分布设计审视监督微调的统一视角

Tong Xie, Yuanhao Ban, Yunqi Hong, Sohyun An, Yihang Chen, Cho-Jui Hsieh

发表机构 * University of California, Los Angeles (UCLA)(加州大学洛杉矶分校) Arena

AI总结 本文重新解读监督微调为目标分布设计,提出Q-target框架,将监督分解为对观测token的依赖强度与替代token的概率分配,并基于此提出Target-SFT方法,在多个推理任务中优于现有方法。

详情
AI中文摘要

监督微调(SFT)通常最大化示范轨迹中每个token的似然。然而,观测到的token可能非唯一、有噪声或与模型先验不一致。严格拟合这种one-hot目标可能不是最优的,尤其是当预训练模型编码了丰富的知识先验时。在这项工作中,我们将SFT重新解释为目标分布设计:不仅研究损失目标,还分析损失驱动模型匹配的token级目标。我们引入Q-target框架,将SFT监督分解为两个明确的选择:(1) 对观测token的依赖强度,以及(2) 如何将剩余概率质量分配给替代token。这一视角将许多现有的SFT变体统一为目标分布Q的隐式选择。基于这一观点,我们提出Target-SFT,直接从期望的目标分布构建训练目标。该方法在十个推理数据集-模型设置中一致优于现有方法,展示了这种基于目标的方法的有效性。总体而言,我们的公式揭示了SFT训练更基本的设计原则,并为SFT目标开辟了更广阔的搜索空间。

英文摘要

Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot target may be suboptimal, especially when the pretrained model encodes a rich knowledge prior. In this work, we reinterpret SFT as target distribution design: instead of studying only the loss objective, we analyze the token-level target that the loss drives the model to match. We introduce the Q-target framework, which decomposes SFT supervision into two explicit choices: (1) how strongly to rely on the observed token, and (2) how to allocate the remaining probability mass over alternatives. This perspective unifies many existing SFT variants as implicit choices of the target distribution Q. Building on this view, we propose Target-SFT which constructs the training objective directly from the desired target distribution. This method consistently outperforms across the ten reasoning dataset-model settings evaluated, showing the effectiveness of this target-based approach. Overall, our formulation reveals a more fundamental design principle for SFT training and opens a broader search space for SFT objectives.

2509.25760 2026-06-10 cs.CL cs.AI cs.LG 版本更新

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

TruthRL: 通过强化学习激励诚实的LLM

Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Jingxiang Chen, Mohammad Kachuee, Teja Gollapudi, Yiwei Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出TruthRL框架,使用GRPO和三值奖励直接优化LLM的诚实性,减少幻觉并允许不确定时弃权,在知识密集型基准上显著提升诚实性。

Comments ICML 2026. Code: https://github.com/facebookresearch/TruthRL

详情
AI中文摘要

虽然大型语言模型(LLM)在事实性问题回答上表现出色,但它们仍然容易产生幻觉和不真实的回答,特别是当任务需要其参数知识之外的信息时。事实上,诚实性需要的不仅仅是准确性——模型还必须识别不确定性,并在不确定时弃权以避免幻觉。这对现有方法提出了根本性挑战:优化准确性的方法往往会放大幻觉,而鼓励弃权的方法可能变得过于保守,牺牲正确答案。两种极端最终都损害了诚实性。在这项工作中,我们提出了TruthRL,一个通用的强化学习(RL)框架,直接优化LLM的诚实性。具体来说,我们使用GRPO实现TruthRL,并采用一个简单而有效的三值奖励,区分正确答案、幻觉和弃权。它激励模型不仅通过提供正确回答来减少幻觉,还通过在不确定时启用弃权来提高诚实性。在四个知识密集型基准上的大量实验表明,TruthRL显著减少了幻觉(例如,43.5% → 19.4%)并提高了诚实性(例如,5.3% → 37.2%),在各种骨干模型上均有一致的提升。分析表明,TruthRL的改进源于LLM识别其知识边界的能力增强,从而避免了像基线那样过于保守。

英文摘要

While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that TruthRL significantly reduces hallucinations (e.g., 43.5% $\rightarrow$ 19.4%) and improves truthfulness (e.g., 5.3% $\rightarrow$ 37.2%), with consistent gains across various backbone models. Analysis shows that the improvement of TruthRL arises from enhanced capability of LLMs to recognize their knowledge boundary, hence avoiding being overly conservative as the baselines are.

2511.02603 2026-06-10 cs.CL 版本更新

CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency

CGES:面向高效准确自一致性的置信引导早停方法

Ehsan Aghazadeh, Ahmad Ghasemi, Hedyeh Beyhaghi, Hossein Pishro-Nik

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 提出贝叶斯框架CGES,通过自适应停止采样减少自一致性推理调用次数,在5个推理基准上平均减少58%调用且精度损失仅0.4个百分点。

Comments Extended version. A preliminary version was accepted at the Efficient Reasoning Workshop @ NeurIPS 2025. Code: https://github.com/EhsanAghazadeh/cges

详情
AI中文摘要

大型语言模型(LLMs)在测试时通常被多次查询,并通过多数投票聚合预测。虽然有效,但这种自一致性(Wang et al., 2023)策略需要固定次数的调用,并且在正确答案出现频率较低时失败。我们引入了置信引导早停(CGES),一个贝叶斯框架,它在候选答案上形成后验分布,并一旦某个答案积累了足够的后验质量就自适应地停止采样。我们在理想校准设置和现实有噪置信设置(在方向漂移条件下)下证明了保证。在五个推理基准上平均,CGES将平均调用次数减少了58%(从16.0降至6.7),同时其精度与自一致性相差在0.4个百分点以内。

英文摘要

Large language models (LLMs) are often queried multiple times at test time, with predictions aggregated by majority vote. While effective, this self-consistency (Wang et al., 2023) strategy requires a fixed number of calls and fails when the correct answer is infrequent. We introduce Confidence-Guided Early Stopping (CGES), a Bayesian framework that forms posteriors over candidate answers and adaptively halts sampling once one answer accumulates enough posterior mass. We prove guarantees in both an ideal calibrated regime and a realistic noisy-confidence regime under a directional drift condition. Averaged over five reasoning benchmarks, CGES reduces the average number of calls by 58% on average (from 16.0 to 6.7) while matching its accuracy within 0.4 percentage points of self-consistency.

2512.02240 2026-06-10 cs.CL 版本更新

Lightweight Latent Reasoning for Narrative Tasks

面向叙事任务的轻量级潜在推理

Alexander Gurung, Esmeralda S. Whitammer, Mirella Lapata

发表机构 * School of Informatics, University of Edinburgh(爱丁堡大学信息学院) CIFAR Fellow

AI总结 提出LiteReason方法,通过轻量级推理投影器生成连续潜在令牌,在强化学习中动态切换潜在与离散推理,将推理长度减少77-92%,同时保持接近非潜在RL的性能。

详情
AI中文摘要

大型语言模型通过生成长思维链或“推理轨迹”来处理复杂任务,这些轨迹在给定查询时作为输出生成的潜在变量。模型生成此类轨迹的能力可以通过强化学习进行优化,以提高其在预测答案中的效用。这种优化带来了高昂的计算成本,尤其是对于涉及检索和处理大量令牌的叙事相关任务。为此,我们提出了LiteReason,一种潜在推理方法,可以与标准令牌采样交错进行,并易于与RL技术结合。LiteReason采用轻量级推理投影器模块,训练生成连续的潜在令牌,帮助模型“跳过”推理步骤。在RL过程中,策略模型决定何时激活投影器,根据需要切换潜在和离散推理。在情节漏洞检测和书籍章节生成上的实验结果表明,我们的方法优于潜在推理基线,并接近匹配非潜在RL训练,同时将最终推理长度减少77-92%。总体而言,LiteReason引导RL训练到性能-计算权衡曲线中更高效的部分。

英文摘要

Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces" that act as latent variables in the generation of an output given a query. A model's ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model 'skip' reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.

2601.21218 2026-06-10 cs.CL 版本更新

Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data

参数化知识并非全部:通过检索预训练数据实现诚实的语言模型

Christopher Adrian Kusuma, Muhammad Reza Qorib, Hwee Tou Ng

发表机构 * Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系)

AI总结 针对大语言模型在知识不足时产生幻觉的问题,提出利用公开预训练数据构建更鲁棒的诚实性评估基准,并设计检索预训练数据的方法提升模型诚实性。

Comments Findings of ACL 2026

详情
AI中文摘要

大型语言模型(LLMs)在回答问题方面能力很强,但通常不了解自己的知识边界,即知道什么知道和不知道什么。因此,它们可能在自己知识不足的主题上生成事实上不正确的回答,即所谓的幻觉。与其产生幻觉,语言模型应该更加诚实,在缺乏相关知识时回答“我不知道”。许多方法已被提出以提高LLM的诚实性,但它们的评估缺乏鲁棒性,因为它们没有考虑LLM在预训练期间吸收的知识。在本文中,我们利用Pythia(一个具有公开预训练数据的真正开放LLM)提出了一个更鲁棒的LLM诚实性评估基准数据集。此外,我们还提出了一种利用预训练数据构建更诚实LLM的新方法。

英文摘要

Large language models (LLMs) are highly capable of answering questions, but they are often unaware of their own knowledge boundary, i.e., knowing what they know and what they don't know. As a result, they can generate factually incorrect responses on topics they do not have enough knowledge of, commonly known as hallucination. Rather than hallucinating, a language model should be more honest and respond with "I don't know" when it does not have enough knowledge about a topic. Many methods have been proposed to improve LLM honesty, but their evaluations lack robustness, as they do not take into account the knowledge that the LLM has ingested during its pretraining. In this paper, we propose a more robust evaluation benchmark dataset for LLM honesty by utilizing Pythia, a truly open LLM with publicly available pretraining data. In addition, we also propose a novel method for harnessing the pretraining data to build a more honest LLM.

2601.21543 2026-06-10 cs.CL 版本更新

inversedMixup: Data Augmentation via Inverting Mixed Embeddings

inversedMixup:通过反转混合嵌入进行数据增强

Fanshuang Kong, Richong Zhang, Qiyu Sun, Zhijie Nie, Ting Deng, Chunming Hu

发表机构 * Beihang University(北京航空航天大学)

AI总结 提出 inversedMixup 框架,结合 Mixup 的可控性与 LLM 的可解释性,通过对齐嵌入空间将混合嵌入重构为可读句子,首次实证文本 Mixup 中的流形入侵现象,并扩展为三阶段数据增强方法,在少样本和全监督场景下有效。

详情
AI中文摘要

Mixup 通过以可控比率线性插值输入和标签来生成增强样本。然而,由于它在潜在嵌入层面操作,生成的样本不可解释。相比之下,基于 LLM 的增强方法通过提示在 token 级别生成句子,产生可读输出,但对生成过程的控制有限。受近期 LLM 反转(从嵌入重建自然语言,有助于弥合潜在嵌入空间与离散 token 空间之间的差距)进展的启发,我们提出了 inversedMixup,一个统一框架,结合了 Mixup 的可控性与基于 LLM 的生成的可解释性。具体来说,inversedMixup 将任务特定模型的输出嵌入空间与 LLM 的输入嵌入空间对齐,使得混合嵌入可以在可控混合比率下重建为人类可解释的句子。这种可解释性提供了文本 Mixup 中流形入侵现象的第一个实证证据。在此基础上,我们将 inversedMixup 扩展为三阶段数据增强方法,并引入一种简单而有效的策略来在增强过程中减轻流形入侵。大量实验证明了我们的方法在少样本和全监督场景下的有效性和泛化性。

英文摘要

Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio. However, since it operates at the latent embedding level, the resulting samples are not human-interpretable. In contrast, LLM-based augmentation methods produce sentences via prompts at the token level, yielding readable outputs but offering limited control over the generation process. Inspired by recent advances in LLM inversion, which reconstructs natural language from embeddings and helps bridge the gap between latent embedding space and discrete token space, we propose inversedMixup, a unified framework that combines the controllability of Mixup with the interpretability of LLM-based generation. Specifically, inversedMixup aligns the output embedding space of a task-specific model with the input embedding space of an LLM, so that mixed embeddings can be reconstructed, under a controllable mixing ratio, into human-interpretable sentences. This interpretability provides the first empirical evidence of the manifold intrusion phenomenon in text Mixup. Building on this, we extend inversedMixup into a three-stage data augmentation method, and introduce a simple yet effective strategy to mitigate manifold intrusion during augmentation. Extensive experiments demonstrate the effectiveness and generalizability of our approach in both few-shot and fully supervised scenarios.

2602.12966 2026-06-10 cs.CL cs.SE 版本更新

ProbeLLM: Automating Principled Diagnosis of LLM Failures

ProbeLLM:自动化的大语言模型故障原则性诊断

Yue Huang, Zhengzhe Jiang, Yuchen Ma, Yu Jiang, Xiangqi Wang, Yujun Zhou, Yuexing Hao, Kehan Guo, Pin-Yu Chen, Stefan Feuerriegel, Xiangliang Zhang

发表机构 * University of Notre Dame(诺丁汉大学) LMU Munich(慕尼黑大学) Massachusetts Institute of Technology(麻省理工学院) IBM Research(IBM研究院)

AI总结 提出ProbeLLM框架,通过分层蒙特卡洛树搜索在全局探索与局部细化间分配预算,结合工具增强生成与验证,将故障发现从孤立案例提升为结构化故障模式,揭示更广泛、清晰、细粒度的故障景观。

详情
AI中文摘要

理解大语言模型(LLM)如何以及为何失败正成为一个核心挑战,因为模型快速演进而静态评估滞后。虽然动态测试生成已实现自动化探测,但现有方法常发现孤立的失败案例,缺乏对探索的原则性控制,且对模型弱点的底层结构洞察有限。我们提出ProbeLLM,一个基准无关的自动化探测框架,将弱点发现从个体失败提升到结构化故障模式。ProbeLLM将探测形式化为分层蒙特卡洛树搜索,在新故障区域的全局探索与重复错误模式的局部细化之间明确分配有限的探测预算。通过将探测限制在可验证的测试用例,并利用工具增强生成与验证,ProbeLLM将故障发现建立在可靠证据之上。发现的失败进一步通过失败感知嵌入和边界感知归纳整合为可解释的故障模式。在多种基准和LLM上,ProbeLLM揭示了比静态基准和先前自动化方法更广泛、更清晰、更细粒度的故障景观,支持从以案例为中心的评估向原则性弱点发现的转变。

英文摘要

Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.

2603.21350 2026-06-10 cs.CL 版本更新

Beyond Memorization: Distinguishing Between Pattern-Based and Epistemic Reasoning in LLMs Using Epistemic Puzzles

超越记忆:使用认知谜题区分LLMs中的模式推理与认知推理

Adi Gabay, Gabriel Stanovsky, Liat Peterfreund

发表机构 * School of Computer Science and Engineering(计算机科学与工程系) The Hebrew University of Jerusalem(耶路撒冷希伯来大学)

AI总结 本文通过设计二维基准谜题,分离叙事熟悉度与推理复杂度,区分LLMs的模式匹配与真实认知推理,发现模型对表面变化鲁棒但难以处理非对称情境。

详情
AI中文摘要

认知推理要求智能体从部分观察和关于其他智能体知识的信息中推断世界状态。先前评估LLMs在认知谜题上的工作通常将失败归因于记忆而非推理。我们认为这种二分法对于较新的模型过于粗糙:记忆是模式推理的一个极限情况,其中模型将任务匹配到熟悉的模板并应用相应的解决方案。我们引入了一个基于DEL风格谜题的二维基准,将叙事熟悉度与推理复杂度分离,从而能够区分模式推理与认知推理。我们发现,模型对表面形式变化的鲁棒性远高于先前研究所示,但在非对称设置中持续表现不佳,其中熟悉的模式不再适用,成功需要跟踪碎片化的认知状态。

英文摘要

Epistemic reasoning requires agents to infer the state of the world from partial observations and information about other agents' knowledge. Prior work evaluating LLMs on epistemic puzzles often frames failures as memorization rather than reasoning. We argue that this dichotomy is too coarse for newer models: memorization is a limiting case of pattern-based reasoning, where a model matches a task to a familiar template and applies the corresponding solution. We introduce a two-dimensional benchmark over DEL-style puzzles, separating narrative familiarity from inference complexity, allowing us to distinguish pattern-based from epistemic reasoning. We find that models are substantially more robust to surface form changes than prior work suggested, yet consistently struggle in asymmetric settings where familiar patterns no longer apply and success requires tracking fragmented epistemic states.

2604.22565 2026-06-10 cs.CL cs.AI 版本更新

Learning Evidence Highlighting for Frozen LLMs

学习为冻结的LLM突出证据

Shaoang Li, Yanhang Shi, Yufei Li, Mingfu Liang, Xiaohan Wei, Yunchen Pu, Fei Tian, Chonglin Sun, Frank Shyu, Luke Simon, Sandeep Pandey, Xi Liu, Jian Li

发表机构 * Stony Brook University(石桥大学) Meta AI

AI总结 提出HiLight框架,通过强化学习训练轻量级Actor在长上下文中插入高亮标签,使冻结的LLM更关注关键证据,无需证据标签或修改求解器,在序列推荐和长上下文问答中提升性能。

详情
AI中文摘要

大型语言模型(LLM)能够很好地推理,但当关键证据埋藏在冗长、嘈杂的上下文中时,常常会错过决定性证据。我们提出了HiLight,一个证据强调框架,它将证据选择与冻结的LLM求解器的推理解耦。HiLight避免压缩或重写输入(这可能会丢弃或扭曲证据),而是训练一个轻量级的强调Actor,在未改变的上下文中的关键跨度周围插入最小的高亮标签。然后,一个冻结的求解器对强调后的输入进行下游推理。我们将高亮视为一个弱监督决策问题,并使用强化学习仅基于求解器的任务奖励来优化Actor,不需要证据标签,也不需要访问或修改求解器。在序列推荐和长上下文问答中,HiLight始终优于强大的基于提示和自动提示优化的基线。学习到的强调策略可以零样本迁移到更小和更大的未见求解器家族,包括基于API的求解器,这表明Actor捕获了真正的、可复用的证据结构,而不是过拟合单个骨干网络。

英文摘要

Large Language Models (LLMs) can reason well, yet often miss decisive evidence when it is buried in long, noisy contexts. We introduce HiLight, an Evidence Emphasis framework that decouples evidence selection from reasoning for frozen LLM solvers. HiLight avoids compressing or rewriting the input, which can discard or distort evidence, by training a lightweight Emphasis Actor to insert minimal highlight tags around pivotal spans in the unaltered context. A frozen Solver then performs downstream reasoning on the emphasized input. We cast highlighting as a weakly supervised decision-making problem and optimize the Actor with reinforcement learning using only the Solver's task reward, requiring no evidence labels and no access to or modification of the Solver. Across sequential recommendation and long-context question answering, HiLight consistently improves performance over strong prompt-based and automated prompt-optimization baselines. The learned emphasis policy transfers zero-shot to both smaller and larger unseen Solver families, including an API-based Solver, suggesting that the Actor captures genuine, reusable evidence structure rather than overfitting to a single backbone.

2509.04027 2026-06-10 cs.AI cs.CL 版本更新

Why Does Reasoning Length Converge? Unveiling the Underfitting-Overfitting Trade-off in Chain-of-Thought

CoT-Space: 一种通过强化学习实现内部慢思考的理论框架

Zeyu Gan, Hao Yi, Yong Liu

发表机构 * Zeyu Gan, Yi Hao, Yong Liu(GAN 赵毅、LIU 刘永)

AI总结 本文提出CoT-Space理论框架,通过强化学习将推理过程从离散的token预测任务转化为连续的推理层面语义空间中的优化过程,揭示了测试时扩展中最优CoT长度的收敛是欠拟合与过拟合基本权衡的自然结果。

Comments Preprint Edition

详情
AI中文摘要

测试时扩展,主要通过强化学习(RL)中的多步链式推理(CoT)体现,已成为增强大型语言模型(LLMs)推理能力的关键范式。然而,仍存在显著的理论空白:传统token级分析无法捕捉推理层面扩展的宏观动态。为此,我们引入CoT-Space,一种新的理论框架,将推理过程从离散的token预测任务转换为连续的推理层面语义空间中的优化过程。通过从噪声和风险视角建模推理轨迹,并复兴经典学习理论中的基础原理,我们证明观察到的收敛到最优CoT长度是欠拟合与过拟合基本权衡的自然结果。我们进一步利用RL作为工具,在实验中激发并验证这些结果。我们的发现为通过RL实现内部测试时扩展提供了机制解释,为现代LLMs中优化推理轨迹提供了系统性的理论基础。

英文摘要

Test-time scaling, primarily manifested through multi-step Chain-of-Thought (CoT) reasoning via Reinforcement Learning (RL), has emerged as a pivotal paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, a significant theoretical gap persists: traditional token-level analysis fails to capture the macroscopic dynamics of reasoning-level scaling. To address this, we introduce CoT-Space, a novel theoretical framework that recasts the reasoning process from a discrete token-prediction task to an optimization process within a continuous, reasoning-level semantic space. By modeling the reasoning trajectory from both noise and risk perspectives and revitalizing foundational principles from classical learning theory, we demonstrate that the observed convergence to an optimal CoT length is a natural consequence of the fundamental trade-off between underfitting and overfitting. We further utilize RL as a tool to elicit and verify these results in our experiments. Our findings provide a mechanistic explanation for the internal test-time scaling via RL, offering a principled theoretical foundation to optimize reasoning trajectories in modern LLMs.

2512.06343 2026-06-10 cs.LG cs.AI cs.CL 版本更新

When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

当距离干扰:BT损失中表示距离偏差对奖励模型的影响

Tong Xie, Andrew Bai, Yuanhao Ban, Yunqi Hong, Haoyu Li, Cho-Jui Hsieh

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 分析BT损失中表示距离导致的梯度偏差,提出NormBT自适应归一化方案,提升奖励模型在细粒度区分上的性能。

Comments ICML 2026

详情
AI中文摘要

奖励模型是RLHF框架中大型语言模型对齐的核心。奖励建模中使用的标准目标是Bradley-Terry(BT)损失,它从由选择和拒绝响应组成的成对数据中学习。在这项工作中,我们分析了BT损失的每个样本梯度,并展示了由于表示距离而产生的虚假学习信号。特别是,BT梯度范数由两个不同的组成部分缩放:(1)预测误差,反映选择和拒绝响应之间预测奖励的差异,以及关键地,(2)在最后一层输出空间中测量的对之间的表示距离。虽然第一项捕获了预期的训练信号,但第二项会显著影响更新幅度并导致学习错位。具体来说,表示距离小的对即使排名错误也经常收到微弱的更新,而距离大的对则收到不成比例的大更新。这导致来自大距离对的梯度掩盖了来自小距离对的梯度,而细粒度区分在小距离对中尤为重要。为了克服这一限制,我们提出了NormBT,一种自适应成对归一化方案,重新缩放更新以平衡表示驱动效应,并将学习信号聚焦于预测误差。NormBT是对BT损失的轻量级、即插即用修改,开销可忽略。在各种LLM骨干和数据集上,NormBT一致地提高了奖励模型性能,在RewardBench的推理类别上取得了超过5%的显著提升,该类别包含大量细粒度对。

英文摘要

Reward models are central to Large Language Model (LLM) alignment within the framework of RLHF. The standard objective used in reward modeling is the Bradley-Terry (BT) loss, which learns from pairwise data consisting of chosen and rejected responses. In this work, we analyze the per-sample gradient of BT-loss and show spurious learning signals due to representation distance. In particular, BT gradient norm scales with two distinct components: (1) prediction error, reflected by the difference in predicted rewards between chosen and rejected responses, and critically, (2) representation distance between the pair measured in the output space of the final layer. While the first term captures the intended training signal, the second term can significantly impact the update magnitude and misalign learning. Specifically, pairs with small representation distance often receive vanishingly weak updates, even when misranked, while pairs with large distance receive disproportionately strong updates. This leads to gradients from large-distance pairs to overshadow those from small-distance pairs, where fine-grained distinctions are especially important. To overcome this limitation, we propose NormBT, an adaptive pair-wise normalization scheme that rescales updates to balance representation-driven effects and focuses learning signals on prediction error. NormBT is a lightweight, drop-in modification to BT loss with negligible overhead. Across various LLM backbones and datasets, NormBT improves reward model performance consistently, with notable gains of over 5% on the Reasoning category of RewardBench, which contains numerous fine-grained pairs.

2601.03093 2026-06-10 cs.LG cs.CL 版本更新

ATLAS: Verifier-Guided Adaptive Latent Activation Steering for Efficient LLM Reasoning

ATLAS:验证器引导的自适应潜在激活引导用于高效LLM推理

Tuc Nguyen, Thai Le

发表机构 * Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 提出ATLAS框架,通过轻量级验证器动态调整推理时潜在状态引导策略,实现每步自适应控制,在数学和编码推理任务上提升准确率并减少测试时token使用。

Comments 21 pages, 6 figures

详情
AI中文摘要

最近关于激活和潜在引导的研究表明,修改内部表示可以有效引导大型语言模型(LLMs)在不更新模型参数的情况下提高推理和效率。然而,大多数现有方法依赖固定引导策略和静态干预强度,这限制了它们在问题实例上的鲁棒性,并常常导致过度或不足引导。我们提出自适应测试时潜在引导(ATLAS),这是一个轻量级框架,通过训练好的、轻量级验证器在推理时动态控制引导决策。给定中间隐藏状态,验证器预测当前推理的质量,并自适应选择要应用的引导动作,实现每个示例和每个步骤的调整,且开销最小。ATLAS提供了一个统一框架,将学习到的潜在验证与测试时激活引导相结合,无需额外的LLM解码或推理时过程奖励模型调用即可实现自适应推理控制。在多个数学和编码推理基准上的实验表明,ATLAS始终优于普通解码和固定引导基线,在实现更高准确率的同时大幅减少测试时token使用。这些结果表明,验证器引导的潜在适应提供了一种有效且可扩展的机制,可以在不牺牲解决方案质量的情况下控制推理效率。所有源代码将公开提供。

英文摘要

Recent work on activation and latent steering has demonstrated that modifying internal representations can effectively guide large language models (LLMs) toward improved reasoning and efficiency without updating model parameters. However, most existing approaches rely on fixed steering policies and static intervention strengths, which limit their robustness across problem instances and often result in over- or under-steering. We propose Adaptive Test-time Latent Steering (ATLAS), a lightweight framework that dynamically controls steering decisions at inference time using a trained, lightweight verifier over the latent states. Given intermediate hidden states, the verifier predicts the quality of ongoing reasoning and adaptively selects which steering action to apply, enabling per-example and per-step adjustment with minimal overhead. ATLAS provides a unified framework for combining learned latent verification with test-time activation steering, enabling adaptive reasoning control without additional LLM decoding or inference-time process reward model calls. Experiments on multiple mathematical and coding reasoning benchmarks show that ATLAS consistently outperforms both vanilla decoding and fixed steering baselines, achieving higher accuracy while substantially reducing test-time token usage. These results demonstrate that verifier-guided latent adaptation provides an effective and scalable mechanism for controlling reasoning efficiency without sacrificing solution quality. All source code will be publicly available.

2605.11458 2026-06-10 cs.AI cs.CL cs.LO 版本更新

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

自适应教师暴露用于大语言模型推理中的自蒸馏

Zihao Han, Tiangang Zhang, Huaibin Wang, Yilun Sun

发表机构 * ByteDance Douyin(字节跳动抖音)

AI总结 针对自蒸馏中教师暴露完整推理导致学生难以吸收的问题,提出自适应教师暴露方法ATESD,通过轻量Beta策略控制器动态调整暴露比例,并用折扣学习进步奖励优化,在多个模型和数据集上提升推理性能。

Comments 11 pages, 4 figures; code not released yet

详情
AI中文摘要

同策略自蒸馏已成为大语言模型推理的一种强大方法,其中特权教师基于参考解决方案监督学生自身的轨迹。然而,几乎所有此类方法共享的一个设计选择却未被质疑:教师总是看到完整的参考推理。我们认为这一默认设置本身就是问题的一部分,并识别出教师侧暴露不匹配:当教师基于远超学生当前能力的推理进行条件化时,产生的词元目标变得过于强大而难以吸收。一个受控的固定暴露扫描在两个层面上明确了这一点:1)完全暴露并非可靠的最佳选择,2)随着教师看到更多特权推理,学生-教师不匹配单调增长。这促使我们将教师暴露视为一个可学习的训练时控制变量,而非固定的超参数。因此,我们提出了自适应教师暴露用于自蒸馏(ATESD)。ATESD使用一个轻量级的Beta策略控制器对暴露比例进行建模,该控制器以紧凑的训练状态统计为条件,并在学生更新的一个短保持窗口内使用一个采样的暴露。为了使该暴露控制器可学习,我们使用折扣学习进步奖励对其进行优化,该奖励根据每个保留决策对学生未来改进的影响(而非其即时损失变化)进行评分,从而解决了同策略蒸馏导致的延迟信用分配问题。在AIME 24、AIME 25和HMMT 25上,使用Qwen3-{1.7B, 4B, 8B}的实验表明,ATESD持续优于竞争性的自蒸馏和强化学习基线,相比OPSD分别提高了+0.95、+2.05和+2.33个Average@12点,将自适应教师暴露确立为推理自蒸馏的一个有效新方向。

英文摘要

On-policy self-distillation has become a strong recipe for LLM reasoning, where a privileged teacher supervises the student's own rollouts while conditioning on the reference solution. A design choice shared by nearly all such methods, however, has gone unquestioned: the teacher always sees the full reference reasoning. We argue that this default itself is part of the problem and identify a teacher-side exposure mismatch: when the teacher conditions on reasoning far beyond the student's current competence, the resulting token targets become too strong to absorb. A controlled fixed-exposure sweep makes this concrete on two fronts: 1) full exposure is not reliably the best choice, and 2) student-teacher mismatch grows monotonically as the teacher sees more privileged reasoning. This motivates treating teacher exposure not as a fixed hyperparameter but as a learnable training-time control variable. We therefore propose Adaptive Teacher Exposure for Self-Distillation (ATESD). ATESD models the reveal ratio with a lightweight Beta-policy controller conditioned on compact training-state statistics, and uses one sampled exposure for a short hold window of student updates. To make this exposure controller learnable, we optimize it with a discounted learning-progress reward that scores each held decision by its effect on the student's future improvement rather than its immediate loss change, addressing the delayed credit assignment induced by on-policy distillation. Experiments on AIME 24, AIME 25, and HMMT 25 across Qwen3-{1.7B, 4B, 8B} show that ATESD consistently outperforms competitive self-distillation and RL baselines, improving over OPSD by +0.95, +2.05, and +2.33 Average@12 points respectively, and establishing adaptive teacher exposure as an effective new axis for reasoning self-distillation.

2. 机器翻译与跨语言处理 3 篇

2606.10113 2026-06-10 cs.CL cs.AI 新提交

Emotion Profiling in LLM-Based Literary Translation: Systematic Shifts Across MT and Post-Editing

基于LLM的文学翻译中的情感特征:机器翻译与译后编辑的系统性转变

Antonio Castaldo, Johanna Monti, Sheila Castilho

AI总结 研究LLM翻译的情感特征及译后编辑如何使其接近人类翻译,通过对比《Oryx and Crake》的LLM翻译、译后编辑版本和人类翻译,发现MT系统引入特定情感指纹,削弱作者声音。

详情
AI中文摘要

本文研究LLM翻译是否表现出可识别的情感特征,以及译后编辑如何将其重塑为更接近人类的标准。我们比较了玛格丽特·阿特伍德《Oryx and Crake》的LLM翻译及其译后编辑版本和人类翻译,以当代意大利科幻小说的大规模语料库为基线。通过基于词典和多语言建模的方法,我们对不同系统的情感变化进行了细粒度分析。我们发现,机器翻译系统在翻译中引入了特定模型且统计显著的情感指纹,导致作者声音的保留有限。

英文摘要

This paper investigates whether LLM translations exhibit identifiable emotional profiles and how post-editing reshapes them toward human-like norms. We compare LLM translations of Margaret Atwood's Oryx and Crake with their post-edited versions and a human translation, using a large-scale corpus of contemporary Italian science-fiction as a baseline. We examine emotion through lexicon-based and multilingual modeling, conducting a fine-grained analysis of emotional variation across systems. We find that MT systems introduce model-specific and statistically significant emotional fingerprints across translations, leading to a limited preservation of an author's voice.

2606.11009 2026-06-10 cs.CL cs.CY 新提交

Who Brought Easter Eggs to Eid? Auditing Cultural Translation of Math Word Problems Across Diverse Languages and Regions

谁把复活节彩蛋带到了开斋节?跨语言和地区数学应用题的文化翻译审计

Parisa Suchdev, Juniper Lovato

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 本研究审计了三个大型语言模型将60个英语数学应用题翻译为7种语言时的文化适应性,发现模型在62.5%的案例中一致,但仅33.5%有相同替换,且所有组合均出现熵塌缩,优先改变表面标记而保留深层结构,导致文化多样性压缩和区域误归因。

Comments 17 pages total with references and appendix, 9 figures, under review

详情
AI中文摘要

大型语言模型越来越多地被用于大规模个性化学习中改编数学应用题,但这些改编是否跨模型一致、是否在规模上保留文化多样性、以及揭示模型认为哪些文化实体最显著,仍是未解决的问题。我们分析了Claude Opus 4、GPT-4.1和Gemini 2.5 Pro如何将60个英语数学应用题改编为孟加拉语、印地语、旁遮普语(印度)、乌尔都语、信德语(巴基斯坦)、意大利语和西西里语(意大利),这一语言集涵盖了从高资源语言(意大利语和印地语)到研究不足的语言(信德语、西西里语和旁遮普语)的完整资源谱系。我们标注了6,489个实体转换,编码模型是否保留、本地化、泛化、省略或更改名称、食物和地点等实体。模型在62.5%的案例中在转换类型上一致,在特定替换上仅33.5%一致,这意味着模型选择直接塑造了学生遇到的文化世界。所有21种语言-模型组合均出现熵塌缩,改编压缩而非扩展了文化多样性。模型优先处理表面标记(如名称、食物和货币),同时保留更深层的结构特征(如嵌入特定文化假设的年级系统)。尽管提示指定了目标国家,模型仍错误归因区域背景,例如对印度孟加拉语学生使用孟加拉国塔卡,并产生跨文化污染,例如将寻蛋活动改编为开斋节活动。某些失败在单个翻译中可见。其他失败,包括多样性塌缩、对表面标记的系统性偏好以及一致的区域误归因,仅通过语料库级分析才显现。使改编问题看起来正确的表面合理性,正是使深层失败容易被忽视的原因。

英文摘要

Large language models are increasingly used to adapt math word problems for personalized learning at scale, but it remains an open question whether those adaptations are consistent across models, preserve cultural diversity at scale, and reveal which cultural entities models treat as most salient. We analyze how Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro adapt 60 English math word problems into Bengali, Hindi, Punjabi (India), Urdu, Sindhi (Pakistan), Italian, and Sicilian (Italy), a language set spanning the full resource spectrum, from high-resource Italian and Hindi to under-studied Sindhi, Sicilian, and Punjabi. We annotate 6,489 entity transformations, coding whether models preserve, localize, generalize, omit, or change entities such as names, foods, and places. Models agree on transformation type in 62.5% of cases and on specific substitutions in only 33.5%, meaning model choice directly shapes which cultural world students encounter. All 21 language-model combinations show entropy collapse, with adaptation compressing rather than expanding cultural diversity. Models prioritize surface markers such as names, foods, and currencies while preserving deeper structural features such as grade-level systems that embed culturally specific assumptions. Despite prompts specifying target countries, models misattribute regional context by using Bangladeshi taka for Indian Bengali students and produce cross-cultural contamination, such as adapting egg hunts as Eid activities. Some failures are visible in individual translations. Others, including diversity collapse, systematic preference for surface markers, and consistent regional misattribution, emerge only through corpus-level analysis. The surface plausibility that makes adapted problems look correct is precisely what makes deeper failures easy to overlook.

2606.07422 2026-06-10 cs.CL cs.AI 版本更新

The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs

掩蔽优势:揭示LLMs中本地语言对文化知识的访问

Yang Zhang, Xiao Fei, Amr Mohamed, Sarah Almeida Carneiro, Mersin Konomi, Mingmeng Geng, Ahmed Asaad, Guokan Shang, Michalis Vazirgiannis

发表机构 * Ecole Polytechnique(巴黎高等理工学院) MBZUAI(穆罕默德·本·拉什德智能研究院) ENS-PSL(巴黎综合理工学院-巴黎科学实验室) Durham University(杜尔罕大学)

AI总结 通过控制实验和项目反应理论模型,分离语言能力与文化知识访问,发现本地语言在文化知识访问上具有优势,但常被语言能力不足掩盖。

详情
AI中文摘要

大型语言模型越来越多地被用于跨语言回答文化相关问题,但目前尚不清楚本地文化知识是通过英语还是本地语言更容易获取。现有评估面临两个关键限制:许多评估依赖于可能无法反映文化知识自然出现的平行模板问题,并且原始准确率混淆了通用语言能力与语言条件知识访问。我们通过一个基于从区域基准和本地来源收集的真实世界文化问题的受控框架来解决这些问题。通过交叉问题类型(文化无关 vs. 文化特定)与查询语言(英语 vs. 本地语言),并使用共享的1PL项目反应理论模型估计能力,我们将语言能力与本地化知识访问分离。在13个地区和大约80个模型上,我们发现文化无关问题上存在一致的英语优势,表明更强的英语能力。然而,在考虑了这种能力差距后,本地语言在几乎所有地区-模型设置中都显示出积极的知识访问优势。这种优势在原始准确率中通常被掩盖,但在前沿、区域对齐或语言适应模型中变得更加明显。我们的结果表明,较弱的本地语言表现并不一定意味着较弱的文化知识;相反,本地文化知识可能通过本地语言更容易访问,但被有限的语言能力所隐藏。

英文摘要

Large language models are increasingly used to answer culturally grounded questions across languages, yet it remains unclear whether local cultural knowledge is better accessed through English or the local language. Existing evaluations face two key limitations: many rely on parallel template-based questions that may not reflect how cultural knowledge naturally appears, and raw accuracy conflates general language proficiency with language-conditioned knowledge access. We address these issues with a controlled framework built on real-world cultural questions collected from regional benchmarks and local sources. By crossing question type (culture-agnostic vs. culture-specific) with query language (English vs. local language), and estimating ability with a shared 1PL item response theory model, we separate proficiency from localized knowledge access. Across 13 locales and roughly 80 models, we find a consistent English advantage on culture-agnostic questions, indicating stronger English proficiency. However, after accounting for this proficiency gap, local languages show a positive knowledge-access advantage in nearly all locale-model settings. This advantage is often masked in raw accuracy but becomes more visible for frontier, regionally aligned, or language-adapted models. Our results suggest that weaker local-language performance does not necessarily imply weaker cultural knowledge; rather, local cultural knowledge may be more accessible through the local language but hidden by limited language proficiency.

3. 信息抽取、检索与问答 12 篇

2606.10471 2026-06-10 cs.CL cs.AI 新提交

Detecting Speculative Language in Biomedical Texts using Recurrent Neural Tensor Networks

使用递归神经张量网络检测生物医学文本中的推测性语言

Dhruv Dixit

发表机构 * Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 利用分布式句子表示和深度学习技术,提出递归神经张量网络(RNTN)用于自动检测生物医学文献中的推测性语言,性能略优于线性双元SVM(F1=0.885 vs 0.881)。

Comments 12 Pages

详情
AI中文摘要

在本研究中,我们通过利用分布式句子表示和先进的深度学习技术,深入探讨了生物医学文章中推测性语言的自动检测。这种识别的意义延伸至信息检索、多文档摘要以及新知识的探索。我们的探索涵盖了两种获取分布式句子表示的不同方法:段落向量模型和递归神经张量网络。然后,将这些方法与三种基础基线算法进行严格比较:支持向量机、朴素贝叶斯和模式匹配。我们的发现表明,递归神经张量网络(RNTN)的性能(F1=0.885)略优于表现最佳的基线线性双元SVM(F1=0.881)。同时,段落向量模型即使在使用大规模未标记数据集进行广泛训练后,效果也较差(F1=0.368)。我们对影响这些性能差异的因素进行了全面讨论,并为未来的研究方向提供了有见地的建议。

英文摘要

In this investigation, we delve into the automated detection of speculative language within biomedical articles by utilizing distributed sentence representations and advanced deep learning techniques. The implications of such identification extend to information retrieval, multi-document summarization, and the exploration of new knowledge. Our exploration encompasses two distinct approaches for acquiring distributed sentence representations: the Paragraph Vector model and the Recursive Neural Tensor Network. These methodologies are then rigorously compared against three foundational baseline algorithms: Support Vector Machines, Naive Bayes, and pattern matching. Our findings reveal that the Recursive Neural Tensor Network (RNTN) demonstrates a slight performance edge (F1 = 0.885) over the top-performing baseline, the linear bigram SVM (F1 = 0.881). Meanwhile, the Paragraph Vector model proves less effective (F1 = 0.368), even after extensive training using an expansive, unlabeled dataset. We engage in a comprehensive discourse on the factors influencing these performance disparities and provide insightful recommendations for future research directions.

2606.10842 2026-06-10 cs.CL cs.IR 新提交

ConvMemory v2: A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval

ConvMemory v2: 一种保留召回率的前10证据重排序器用于对话记忆检索

Taiheng Pan

发表机构 * School of Computing and Information Systems, University of Melbourne(墨尔本大学计算与信息系统学院)

AI总结 提出ConvMemory v2,一种轻量级重排序器,在保留v1的Recall@10前提下,通过微调交叉编码器提升MRR和H@1,并分析其机制。

Comments 19 pages, 3 figures. Single-author technical report. Extends arXiv:2605.28062 (ConvMemory v1). Code and checkpoint: github.com/pth2002/ConvMemory

详情
AI中文摘要

我们描述了ConvMemory v2,一种可选的token证据重排序器,位于轻量级ConvMemory v1重排序器之后,仅对v1保护的前10候选集进行重排序。v2是一个微调的ms-marco-MiniLM-L-6-v2交叉编码器(22,713,601个参数,从发布的检查点测量),应用于v1已经选择的十个(查询,记忆)对;它不改变返回的十个记忆,因此Recall@10和Hit@10与v1相同,这是构造决定的,而非统计巧合。在LoCoMo对话记忆基准测试(5个种子,n = 4955个测试行)上,v2将FULL MRR从v1的0.5824提升到0.6560(配对bootstrap +0.0734,95% CI [+0.0645, +0.0827]),H@1从0.4440提升到0.5474。v2缩小了与更昂贵的全池交叉编码器参考(mxbai-rerank-large-v1在前500个上,MRR 0.6688)的大部分差距但未完全消除:在FULL MRR上,v2比mxbai_top500低0.013,但在两个raw-dense-hard切片上(v1保护的前10个比mxbai自己的前10个具有更高的召回率),v2超过了mxbai_top500。一项四臂负载消融实验表明,候选特定的记忆文本是机制:移除、打乱或替换它会使MRR崩溃到低于原始稠密检索。v2最好被理解为一种标准的保留召回率的级联模式,具有LoCoMo特定的微调、显式的抗捷径推理契约和严谨的负载分析;其相对于mxbai的优势是切片特定的,而非一般的优势声明。本报告扩展了v1技术报告(arXiv:2605.28062)。

英文摘要

We describe ConvMemory v2, an opt-in token-evidence reranker that sits after the lightweight ConvMemory v1 reranker and reorders only v1's protected top-10 candidate set. v2 is a fine-tuned ms-marco-MiniLM-L-6-v2 cross-encoder (22,713,601 parameters, measured from the released checkpoint) applied to the ten (query, memory) pairs that v1 has already selected; it does not change which ten memories are returned, so Recall@10 and Hit@10 are identical to v1 by construction, not by statistical coincidence. On the LoCoMo conversational memory benchmark (5 seeds, n = 4955 test rows), v2 raises FULL MRR from v1's 0.5824 to 0.6560 (paired bootstrap +0.0734, 95% CI [+0.0645, +0.0827]) and H@1 from 0.4440 to 0.5474. v2 closes most but not all of the gap to a much more expensive full-pool cross-encoder reference (mxbai-rerank-large-v1 over the top-500, MRR 0.6688): on FULL MRR v2 sits 0.013 below mxbai_top500, but on two raw-dense-hard slices (where v1's protected top-10 has higher recall than mxbai's own top-10) v2 exceeds mxbai_top500. A four-arm load-bearing ablation shows candidate-specific memory text is the mechanism: removing, shuffling, or replacing it collapses MRR below raw dense retrieval. v2 is best understood as a standard recall-preserving cascade pattern with LoCoMo-specific fine-tuning, an explicit anti-shortcut inference contract, and disciplined load-bearing analysis; its advantage over mxbai is slice-specific rather than a general dominance claim. This report extends the v1 technical report (arXiv:2605.28062).

2606.10921 2026-06-10 cs.CL 新提交

Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering

仅追踪所需:面向长文档问答的结构感知按需超图记忆

Xiangjun Zai, Xingyu Tan, Chen Chen, Xiaoyang Wang, Wenjie Zhang

发表机构 * University of New South Wales(新南威尔士大学) CSIRO(澳大利亚联邦科学与工业研究组织) University of Wollongong(伍伦贡大学)

AI总结 提出DocTrace,一种多智能体RAG框架,通过查询触发的知识组织、文档结构感知和经验引导推理,解决长文档问答中知识组织成本高、结构利用不足和推理经验无法复用的问题,在三个数据集上取得最佳性能。

详情
AI中文摘要

长文档问答需要大型语言模型对散布在长文档中的证据进行推理,答案通常依赖于事件顺序、章节级上下文和跨部分证据连接。尽管检索增强生成通过检索相关证据减少了输入上下文,但现有的结构化RAG方法仍面临三个限制:代价高昂的查询无关知识组织、对原始文档结构利用不足以及无法复用历史推理经验。为解决这些限制,我们提出了DocTrace,一个用于长文档问答的多智能体RAG框架,支持查询触发的知识组织、文档结构感知和经验引导推理。DocTrace通过轻量级文档结构树索引保留文档层次结构,在推理过程中按需构建智能体共享的超图结构工作记忆,并将成功的推理计划存储在图形结构经验记忆中以便未来复用,从而实现对相关长文档问题的自适应探索。在四个长文档问答数据集上的实验表明,DocTrace在三个数据集上取得了最佳性能,在F1和EM上分别比最强基线ComoRAG高出8.85%和4.40%,同时将总体计算成本降低了53.32%。

英文摘要

Long-document question answering (QA) requires large language models (LLMs) to reason over evidence scattered across lengthy documents, where answers often depend on event order, section-level context, and cross-part evidence connections. Although retrieval-augmented generation (RAG) reduces the input context by retrieving relevant evidence, existing structured RAG methods still face three limitations: costly query-agnostic knowledge organization, insufficient use of original document structure, and no reuse of historical reasoning experience. To address these limitations, we propose DocTrace, a multi-agent RAG framework for long-document QA that supports query-triggered knowledge organization, document-structure-aware and experience-guided reasoning. DocTrace preserves document hierarchy with a lightweight document structural tree index, constructs agent-shared hypergraph-structured working memory on demand during reasoning, and stores successful reasoning plans in graph-structured experience memory for future reuse, enabling adaptive exploration across related long-document questions. Experiments on four long-document QA datasets show that DocTrace achieves the best performance on three datasets, surpassing the strongest baseline, ComoRAG, by up to 8.85% in F1 and 4.40% in EM, while reducing the overall computational cost by 53.32%

2606.10381 2026-06-10 hep-ex cs.AI cs.CL cs.IR physics.ins-det 交叉投稿

Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis

基于证据的缪子对撞机分析的智能混合RAG

Ruobing Jiang, Dawei Fu, Cheng Jiang, Tianyi Yang, Zijian Wang, Youpeng Wu, Yong Ban, Yajun Mao, Qiang Li

发表机构 * Peking University(北京大学)

AI总结 提出智能混合RAG框架,结合稀疏与稠密检索及智能推理,用于缪子对撞机研究的证据检索与答案生成,构建首个基准并验证其有效性。

Comments 22 pages, 5 figures, and 6 tables

详情
AI中文摘要

缪子对撞机研究涵盖加速器物理、探测器仪器和高能现象学,相关证据分散在快速扩展且异构的科学文献中。随着高能物理(HEP)越来越多地探索智能辅助分析工作流,高效定位、整合和验证科学证据成为关键能力。虽然检索增强生成(RAG)为科学问答提供了有前景的框架,但在不牺牲检索精度的情况下整合智能推理仍是一个关键挑战。在这项工作中,我们提出了智能混合RAG,一个基于证据的RAG框架,用于缪子对撞机研究。该框架结合了混合检索器(集成稀疏词汇和稠密语义检索)与智能推理模块,用于查询分解、证据扩展和基于证据的答案生成。为了进行系统评估,我们构建了缪子对撞机领域首个检索增强科学问答基准,包括一个精选文献语料库以及涵盖主要探测器和物理研究主题的专用检索和答案生成基准。广泛评估表明,混合检索提供了最强的检索基础,而智能推理在受控证据扩展和答案合成方面最为有效。基于这一原则,智能混合RAG在检索效果、答案质量、证据覆盖和事实基础方面始终优于代表性的检索和RAG基线。该基准和框架共同为基于证据的科学问答以及未来在大规模科学文献上运行的HEP分析智能体奠定了基础。

英文摘要

Muon collider research spans accelerator physics, detector instrumentation, and high-energy phenomenology, with relevant evidence scattered across a rapidly expanding and heterogeneous body of scientific literature. As high-energy physics (HEP) increasingly explores agent-assisted analysis workflows, efficiently locating, integrating, and verifying scientific evidence becomes an essential capability. While retrieval-augmented generation (RAG) offers a promising framework for scientific question answering, integrating agentic reasoning without compromising retrieval precision remains a key challenge. In this work, we present agentic hybrid RAG, an evidence-grounded RAG framework for muon collider research. The framework combines a hybrid retriever, integrating sparse lexical and dense semantic retrieval, with an agentic reasoning module for query decomposition, evidence expansion, and grounded answer generation. To enable systematic evaluation, we construct the first benchmark for retrieval-augmented scientific question answering in the muon collider domain, comprising a curated literature corpus together with dedicated retrieval and answer-generation benchmarks covering major detector and physics research topics. Extensive evaluation shows that hybrid retrieval provides the strongest retrieval backbone, while agentic reasoning is most effective for controlled evidence expansion and answer synthesis. Built on this principle, agentic hybrid RAG consistently outperforms representative retrieval and RAG baselines in retrieval effectiveness, answer quality, evidence coverage, and factual grounding. Together, the benchmark and framework provide a foundation for evidence-grounded scientific question answering and future HEP analysis agents operating over large-scale scientific literature.

2606.11023 2026-06-10 cs.IR cs.CL cs.LG 交叉投稿

Generative Archetype-Grounded Item Representations for Sequential Recommendation

生成式原型驱动的物品表示用于序列推荐

Yifan Li, Jiahong Liu, Xinni Zhang, Hao Chen, Yankai Chen, Wenhao Yu, Jianting Chen, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) McGill University(麦吉尔大学) Tongji University(同济大学)

AI总结 提出GenAIR框架,利用大语言模型生成物品原型描述并提取嵌入,结合行为校准目标弥合语义与行为差距,显著提升序列推荐性能。

Comments Accepted by WWW 2026 (Oral)

详情
AI中文摘要

序列推荐旨在通过分析用户的历史行为来预测用户与物品的下一次交互。然而,物品表示的质量有限仍然是一个关键瓶颈。虽然预训练的大语言模型(LLM)可以提供丰富的语义表示,但现有方法仅依赖于固定属性的静态编码,忽视了目标受众在定义物品身份中的关键作用。此外,语义空间难以反映实际用户行为,导致语义表示与行为模式之间存在显著差距。为了解决这些局限性,我们提出了GenAIR,一个通用框架,通过生成式原型驱动的物品表示来增强序列推荐。具体来说,我们首先利用LLM分析物品元数据并推断原型的文本描述,该原型代表物品理想目标受众的概念轮廓。然后,我们在一次前向传播中提取相应的嵌入。此外,为了将这些生成式原型基于现实世界的行为,我们引入了一个行为校准目标,该目标明确地整合了来自实际交互的行为信号。该目标调整嵌入空间的结构以反映经验模式。GenAIR能够与大多数现有模型无缝集成,同时保持高效率。在三个真实世界数据集上进行的全面实验表明,GenAIR显著提高了各种序列推荐模型的性能,并始终优于最先进的基线方法。实现代码可在以下网址获取:https://this URL。

英文摘要

Sequential recommendation aims to predict users' next interaction with items by analyzing their historical behavior. However, the limited quality of item representations remains a critical bottleneck. While pre-trained large language models (LLMs) can provide rich semantic representations, existing approaches only rely on static encoding of fixed attributes, overlooking the crucial role of target audiences in defining item identity. Moreover, the semantic space struggles to reflect actual user behavior, resulting in a significant gap between semantic representations and behavioral patterns. To address these limitations, we propose GenAIR, a general framework that empowers sequential recommendation with Generative Archetype-grounded Item Representations. Specifically, we first leverage an LLM to analyze item metadata and infer textual description of the Archetype, which represents the conceptual profile of the item's ideal target audience. We then extract the corresponding embeddings in a single forward pass. Further, to ground these generative archetypes in real-world behavior, we introduce a behavioral calibration objective, which explicitly incorporates behavioral signals from actual interactions. This objective adjusts the structure of the embedding space to reflect empirical patterns. GenAIR enables seamless integration with most existing models while maintaining high efficiency. Comprehensive experiments conducted on three real-world datasets demonstrate that GenAIR significantly improves the performance of various sequential recommendation models and consistently outperforms state-of-the-art baseline approaches. Implementation codes are available at https://github.com/AI-Santiago/GenAIR.

2406.14075 2026-06-10 cs.CL 版本更新

EXCEEDS: Extracting Complex Events via Nugget-based Grid Modeling in Scientific Domain

EXCEEDS: 通过基于线索块的网格建模在科学领域中提取复杂事件

Yi-Fan Lu, Xian-Ling Mao, Bo Wang, Xiao Liu, Heyan Huang

发表机构 * Beijing Institute of Technology(北京理工大学) Microsoft Research Asia(微软亚洲研究院)

AI总结 针对科学领域事件密集、信息形式复杂的特点,构建大规模多事件文档级数据集SciEvents,并提出端到端框架EXCEEDS,将密集线索块编码为网格矩阵,简化复杂事件提取为基于线索块的网格建模任务,取得最优性能。

Comments Accepted by ACL 2026 Main Conference, Oral

详情
AI中文摘要

通过事件理解特定领域至关重要。在新闻、金融和生物学等多个领域已经进行了广泛的事件提取研究。然而,科学领域的事件提取仍然缺乏全面的数据集和定制方法的支持。与其他领域相比,科学领域有两个特点:(1)更密集的线索块和事件,(2)更复杂的信息形式。为解决上述问题,考虑到这两个特点,我们首先构建了SciEvents,一个大规模的多事件文档级数据集,其模式针对科学领域定制。它包含2,508篇文档和24,381个事件,经过多阶段人工标注和质量控制。然后,我们提出了EXCEEDS,一个端到端的科学事件提取框架,通过将密集线索块编码为网格矩阵,并将复杂事件提取简化为基于线索块的网格建模任务。在SciEvents上的实验表明,EXCEEDS达到了最先进的性能。SciEvents数据集和EXCEEDS框架均已公开发布,以促进未来的研究。

英文摘要

It is crucial to understand a specific domain by events. Extensive event extraction research has been conducted in many domains such as news, finance, and biology. However, event extraction in scientific domain is still insufficiently supported by comprehensive datasets and tailored methods. Compared with other domains, scientific domain has two characteristics: (1) denser nuggets and events, and (2) more complex information forms. To solve the above problem, considering these two characteristics, we first construct SciEvents, a large-scale multi-event document-level dataset with a schema tailored for scientific domain. It consists of 2,508 documents and 24,381 events under multi-stage manual annotation and quality control. Then, we propose EXCEEDS, an end-to-end scientific event extraction framework by encoding dense nuggets into a grid matrix and simplifying complex event extraction as a nugget-based grid modeling task. Experiments on SciEvents demonstrate state-of-the-art performances of EXCEEDS. Both the SciEvents dataset and the EXCEEDS framework are released publicly to facilitate future research.

2510.08622 2026-06-10 cs.CL cs.SE 版本更新

Automated Alignment between Elicitation Interviews and Requirements

启发式访谈与需求之间的自动对齐

Francesco Dente, Fabiano Dalpiaz, Paolo Papotti

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出将访谈转录与用户故事需求自动对齐的任务,定义忠实度和覆盖率两个度量,利用大语言模型和嵌入模型实现自动评估,在四个数据集上达到0.86 macro-F1。

Comments 8 pages

详情
AI中文摘要

软件需求来源于多种启发式技术,其中许多具有对话性质,如访谈。然而,评估这些衍生需求是否忠实反映利益相关者的需求仍然是一项具有挑战性的手工任务。在本文中,我们形式化了将访谈转录与以用户故事表示的需求集合对齐的任务。我们提出了两种启发式对齐度量,称为(i)需求忠实度:转录支持的故事比例,以及(ii)访谈覆盖率:至少被一个故事支持的转录比例。然后,我们使用大语言模型和嵌入模型进行实验,评估自动计算这些度量的能力。在四个数据集上的实验表明,基于LLM的解决方案在手动标注的块-故事对上达到了0.86的宏F1分数。我们还展示了如何将嵌入模型用作阻断器,使方法更具可扩展性。这项工作为更多关于连接对话制品与需求的研究铺平了道路。形式化框架和自动匹配技术是基本组件,可用于新兴任务,如将需求追溯到访谈以及从对话生成需求。

英文摘要

Software requirements are derived from a variety of elicitation techniques, many of which have a conversational nature, like interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders' needs remains a challenging manual task. In this paper, we formalize the task of aligning the transcript of an interview with a collection of requirements represented as user stories. We propose two heuristic metrics for alignment, called (i) requirements faithfulness: the proportion of stories supported by the transcript, and (ii) interview coverage: the proportion of transcript supported by at least one story. Then, we run experiments with large language models and embedding models that assess the ability of evaluating these metrics automatically. Experiments over four datasets show that an LLM-based solution achieves 0.86 macro-F1 on manually labeled chunk-story pairs. We also show how embedding models can be used as blockers to make the approach more scalable. This work paves the way for more research on linking conversational artifacts with requirements. The formal framework and the automated matching techniques are basic components that can be used for emerging tasks such as tracing requirements to interviews and generating requirements from conversations.

2604.01993 2026-06-10 cs.CL cs.AI 版本更新

SAFE: An LLM-as-Verifier Framework for Evidence-Grounded Multi-Hop Reasoning

SAFE: 一种基于LLM作为验证器的证据驱动多跳推理框架

Daeyong Kwon, Soyoung Yoon, Seung-won Hwang

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出SAFE框架,通过将推理分解为知识图谱三元组,在生成过程中逐步验证中间步骤,以解决多跳问答中模型通过无效推理得到正确答案的问题,平均准确率提升8.8个百分点。

详情
AI中文摘要

多跳问答基准测试常常奖励大型语言模型(LLM)的虚假正确性,即模型通过无效的中间推理得出正确答案。我们提出了SAFE,一种基于LLM作为验证器的证据驱动多跳问答框架。SAFE不是在生成后仅判断最终答案,而是在生成过程中通过检查中间步骤与提供的段落和先前的推理轨迹来验证推理。为了使这一过程可检查,SAFE将推理分解为以知识图谱(KG)三元组表示的原子化、证据驱动的单元。在训练时,SAFE在KG约束下验证基准监督,并构建可靠的验证器训练数据。在推理时,外部验证器检查每个生成的步骤,识别无效推理,并在错误传播之前提供纠正反馈。在三个多跳问答基准测试中,SAFE平均提高了8.8个百分点的准确率。这些结果表明,证据驱动的多跳问答受益于将基于LLM的评估从事后答案判断转向逐步推理验证。

英文摘要

Multi-hop QA benchmarks often reward Large Language Models (LLMs) for spurious correctness, where models reach correct answers through invalid intermediate reasoning. We propose SAFE, an LLM-as-verifier framework for evidence-grounded multi-hop QA. Rather than judging only the final answer after generation, SAFE verifies reasoning during generation by checking intermediate steps against the provided passages and previous reasoning trajectory. To make this process checkable, SAFE decomposes reasoning into atomic, evidence-grounded units represented with Knowledge Graph (KG) triples. At train-time, SAFE verifies benchmark supervision under KG-grounded constraints and constructs reliable verifier training data. At inference-time, an external verifier checks each generated step, identifies invalid reasoning, and provides correction feedback before errors propagate. Across three multi-hop QA benchmarks, SAFE improves accuracy by 8.8 pp on average. These results show that evidence-grounded multi-hop QA benefits from shifting LLM-based evaluation from post-hoc answer judgment to stepwise reasoning verification.

2604.15771 2026-06-10 cs.CL 版本更新

Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing

Skill-RAG: 通过隐藏状态探测与技能路由实现故障状态感知的检索增强

Kai Wei, Raymond Li, Xi Zhu, Zhaoqian Xue, Jiaojiao Han, Jingcheng Niu, Fan Yang

发表机构 * University of Michigan(密歇根大学) University of British Columbia(不列颠哥伦比亚大学) Rutgers University(罗格斯大学) University of Pennsylvania(宾夕法尼亚大学) New Jersey Institute of Technology(新泽西理工学院) TU Darmstadt(图腾斯大学) Wake Forest University(威克森林大学)

AI总结 提出Skill-RAG框架,通过轻量级隐藏状态探测器和基于提示的技能路由器,在检索失败时诊断原因并选择四种技能(查询重写、问题分解、证据聚焦、退出)纠正查询-证据错位,显著提升多轮检索后困难案例的准确性。

详情
AI中文摘要

检索增强生成(RAG)已成为将大型语言模型锚定于外部知识的基础范式。尽管自适应检索机制提高了检索效率,现有方法将检索后失败视为重试信号而非诊断信号——从而未能解决查询与证据空间错位的结构性原因。我们观察到,相当一部分持续性检索失败并非源于缺乏相关证据,而是源于查询与证据空间之间的对齐差距。我们提出Skill-RAG,一种故障感知的RAG框架,它结合了轻量级隐藏状态探测器和基于提示的技能路由器。探测器在两个流水线阶段门控检索;当检测到故障状态时,技能路由器诊断根本原因,并在四种检索技能——查询重写、问题分解、证据聚焦,以及针对真正不可约情况的退出技能——中进行选择,以在下一次生成尝试前纠正错位。跨多个开放域问答和复杂推理基准的实验表明,Skill-RAG显著提高了多轮检索后持续存在的困难案例的准确性,在分布外数据集上尤其强劲。表示空间分析进一步揭示,所提出的技能占据了故障状态空间中结构化、可分离的区域,支持了查询-证据错位是一种类型化而非单一现象的观点。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models in external knowledge. While adaptive retrieval mechanisms have improved retrieval efficiency, existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose -- leaving the structural causes of query-evidence misalignment unaddressed. We observe that a significant portion of persistent retrieval failures stem not from the absence of relevant evidence but from an alignment gap between the query and the evidence space. We propose Skill-RAG, a failure-aware RAG framework that couples a lightweight hidden-state prober with a prompt-based skill router. The prober gates retrieval at two pipeline stages; upon detecting a failure state, the skill router diagnoses the underlying cause and selects among four retrieval skills -- query rewriting, question decomposition, evidence focusing, and an exit skill for truly irreducible cases -- to correct misalignment before the next generation attempt. Experiments across multiple open-domain QA and complex reasoning benchmarks show that Skill-RAG substantially improves accuracy on hard cases persisting after multi-turn retrieval, with particularly strong gains on out-of-distribution datasets. Representation-space analyses further reveal that the proposed skills occupy structured, separable regions of the failure state space, supporting the view that query-evidence misalignment is a typed rather than monolithic phenomenon.

2605.18271 2026-06-10 cs.CL cs.AI cs.IR cs.LG 版本更新

From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

从体积到价值:面向设备端RAG的偏好对齐记忆构建

Changmin Lee, Jaemin Kim, Taesik Gong

发表机构 * Department of Computer Science and Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, Republic of Korea(计算机科学与工程系,全州国立科学与技术研究所(UNIST),全州,韩国)

AI总结 本文提出EPIC方法,通过将用户偏好作为紧凑且稳定的个人上下文形式,整合到RAG流程中,以在有限内存下提高检索与用户偏好的对齐度,从而减少内存使用并提升准确性。

Comments Accepted to ICML 2026. Code and data are available at https://github.com/UbiquitousAILab/EPIC

详情
AI中文摘要

随着基于大型语言模型(LLMs)的个人AI代理的迅速发展,将其部署到设备上已成为隐私和响应性的重要需求。为了处理现实世界请求中固有的个人和上下文依赖性,这些代理必须基于设备上存储的个人上下文进行生成。然而,在内存预算紧张的情况下,核心瓶颈是存储什么内容以确保检索与用户保持一致。我们提出EPIC(高效偏好对齐索引构建),专注于用户偏好作为紧凑且稳定的个人上下文形式,并在整个RAG流程中整合它们。EPIC会选择性地保留与偏好相关的信息,并将检索对准偏好对齐的上下文。在四个涵盖对话、辩论、解释和推荐的基准测试中,EPIC将索引内存减少了2,404倍,提高了偏好遵循的准确性20.17个百分点,并在最佳表现基线之上实现了33.33倍更低的检索延迟。在我们的设备端实验中,EPIC在29.35毫秒/查询的流式更新下保持内存占用低于1 MB。

英文摘要

With the rapid emergence of personal AI agents based on Large Language Models (LLMs), implementing them on-device has become essential for privacy and responsiveness. To handle the inherently personal and context-dependent nature of real-world requests, such agents must ground their generation in device-resident personal context. However, under tight memory budgets, the core bottleneck is what to store so that retrieval remains aligned with the user. We propose EPIC (Efficient Preference-aligned Index Construction), which focuses on user preferences as a compact and stable form of personal context and integrates them throughout the RAG pipeline. EPIC selectively retains preference-relevant information from raw data and aligns retrieval toward preference-aligned contexts. Across four benchmarks covering conversations, debates, explanations, and recommendations, EPIC reduces indexing memory by 2,404 times, improves preference-following accuracy by 18.79 %p, and achieves 32.17 times lower retrieval latency over the best-performing baseline. In on-device experiments, EPIC maintains under 1 MB memory and achieves 5.21 to 29.35 ms/query latency across three platforms, while supporting streaming updates under preference drift. Our code and data are available at https://github.com/UbiquitousAILab/EPIC.

2605.28093 2026-06-10 cs.CL 版本更新

ConRAG: Consensus-Driven Multi-View Retrieval for Multi-Hop Question Answering

ConRAG: 用于多跳问答的共识驱动多视角检索

Yikai Zhu, Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院)

AI总结 提出ConRAG框架,通过共识驱动的多视角检索(关系、实体、文本信号)优化查询和语料库,显著提升多跳问答性能,在MuSiQue上创下新纪录。

详情
AI中文摘要

检索增强生成(RAG)已成为增强大型语言模型(LLMs)在多跳问答(QA)上的有前景范式,这需要对来自多个文档的证据进行推理。当前的多跳RAG方法通常侧重于查询侧任务分解或语料侧知识图谱构建。尽管取得了进展,这些方法在复杂的多跳QA任务上仍难以达到令人满意的性能。为此,我们提出了ConRAG,一个共识驱动的多视角RAG框架,有效提升了LLMs在复杂多跳QA上的表现。ConRAG的核心是系统性地优化查询和语料两侧,并利用多视角证据(关系、实体和文本信号)进行更准确的检索。在三个多跳QA基准上的大量实验表明,ConRAG以明显优势持续优于所有基线,例如,与普通RAG相比平均性能提升高达+26.9%,并使Gemma-4-31B在具有挑战性的MuSiQue基准上创下新的最先进记录。

英文摘要

Retrieval-augmented generation (RAG) has emerged as a promising paradigm for enhancing large language models (LLMs) on multi-hop question answering (QA), which requires reasoning over evidence from multiple documents. Current multi-hop RAG methods generally focus on either query-side task decomposition or corpus-side knowledge graph construction. Despite their progress, these methods still struggle to achieve satisfactory performance on complex multi-hop QA tasks. To this end, we propose ConRAG, a consensus-driven multi-view RAG framework that effectively boosts LLMs on complex multi-hop QA. The core of ConRAG is to systematically optimize both the query and corpus sides and to leverage multi-view evidence (relation, entity, and text signals) for more accurate retrieval. Extensive experiments on three multi-hop QA benchmarks show that ConRAG consistently outperforms all baselines by a clear margin, e.g., up to +26.9% average performance gains over vanilla RAG, and enables Gemma-4-31B to achieve a new state-of-the-art record on the challenging MuSiQue benchmark.

2605.03344 2026-06-10 cs.IR cs.AI cs.CL 版本更新

RAG over Thinking Traces Can Improve Reasoning Tasks

RAG 基于思考轨迹可提升推理任务

Negar Arabzadeh, Wenjie Ma, Sewon Min, Matei Zaharia

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出检索思考轨迹而非文档,通过 T3 方法将其转化为结构化表示,在推理任务上显著提升性能,超越标准 RAG 和无 RAG 基线。

详情
AI中文摘要

检索增强生成(RAG)已被证明对知识密集型任务有效,但普遍认为其对数学和代码生成等推理密集型问题帮助有限。我们通过证明限制不在于 RAG 本身而在于语料库的选择来挑战这一假设。我们不检索文档,而是提出检索思考轨迹,即问题求解尝试过程中产生的中间思考轨迹。我们表明思考轨迹本身就是一个强大的检索源,并进一步引入 T3,一种离线方法,将其转化为结构化、利于检索的表示,以提高可用性。使用这些轨迹作为语料库,简单的检索-生成流水线在强模型和基准测试(如 AIME 2025--2026、LiveCodeBench 和 GPQA-Diamond)上持续提升推理性能,优于无 RAG 基线和检索标准网络语料库。例如,在 AIME 2025-2026 上,使用 Gemini-2-thinking 生成的轨迹进行 RAG,在 Gemini-2.5-Flash、GPT-OSS-120B 和 GPT-5 上分别实现了 +56.3%、+8.6% 和 +7.6% 的相对增益,尽管这些是更新的模型。总体而言,我们的结果表明思考轨迹是推理任务的有效检索语料库,将其转化为结构化、紧凑或诊断性表示可带来更强的增益。代码见此链接。

英文摘要

Retrieval-augmented generation (RAG) has proven effective for knowledge-intensive tasks, but is widely believed to offer limited benefit for reasoning-intensive problems such as math and code generation. We challenge this assumption by showing that the limitation lies not in RAG itself, but in the choice of corpus. Instead of retrieving documents, we propose retrieving thinking traces, i.e., intermediate thinking trajectories generated during problem solving attempts. We show that thinking traces are already a strong retrieval source, and further introduce T3, an offline method that transforms them into structured, retrieval-friendly representations, to improve usability. Using these traces as a corpus, a simple retrieve-then-generate pipeline consistently improves reasoning performance across strong models and benchmarks such as AIME 2025--2026, LiveCodeBench, and GPQA-Diamond, outperforming both non-RAG baselines and retrieval over standard web corpora. For instance, on AIME 2025-2026, RAG with traces generated by Gemini-2-thinking achieves relative gains of +56.3%, +8.6%, and +7.6% for Gemini-2.5-Flash, GPT-OSS-120B, and GPT-5, respectively, even though these are more recent models. Overall, our results suggest that thinking traces are an effective retrieval corpus for reasoning tasks, and transforming them into structured, compact, or diagnostic representations unlocks even stronger gains. Code available at https://github.com/Narabzad/t3.

4. 对话系统与智能体 13 篇

2606.09900 2026-06-10 cs.CL cs.AI cs.IR cs.LG 新提交

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

更少上下文,更高准确率:一种用于LLM Agent的双时间记忆引擎,其中精简检索上下文优于完整历史

Liuyin Wang

发表机构 * Independent Researcher(独立研究者)

AI总结 提出一种双时间记忆引擎Engram,通过混合读取路径从约9.6k token的检索片段中回答,在LongMemEval_S上达到83.6%准确率,比完整历史(79k token)高10.4个百分点,且无错误。

Comments 14 pages, 4 figures, 3 tables. Code, reproducible harness, and raw per-question logs: https://github.com/ly-wang19/engram

详情
AI中文摘要

长期记忆是LLM Agent缺失的一层:跨会话时它们会遗忘,而常见的解决方法——将整个历史重放到提示中——成本高、速度慢,且随着干扰物积累,准确性下降。大多数记忆系统在成本或延迟上胜出,但在准确性上仍不如完整上下文基线,且基准测试结果在不一致、不可复现的测试平台上报告,导致同一系统在不同来源上得分差异巨大。我们提出Engram,一种基于双时间数据模型的开源双过程记忆引擎。快速写入路径附加无损事件,无需LLM参与关键路径;异步路径提取原子(主体、谓词、客体)事实,构建双时间知识图谱,并解决矛盾,无需每个事实调用LLM——使事实失效而非删除,因此每个事实都有来源和继承链。混合读取路径融合密集、词汇、图谱和时效/显著性信号,应用时间点(“截至”)过滤器,并组装紧凑、带有来源标记的上下文。在完整的500个问题的LongMemEval_S上,由官方分类特定评判器评分,Engram的精简配置——从约9.6k token的检索片段回答,而非完整历史——得分为83.6%,而完整上下文为73.2%(+10.4个百分点,McNemar p < 10^-6),token数约为1/8(9.6k vs. 79k),且0/500错误。这种增益需要混合读取路径:仅事实会丢失召回率,而事实加检索片段则恢复细节。我们还贡献了一个中立的、仓库内的评估平台,内置官方评判器,并在每个表格中包含完整上下文基线,发布原始每问题日志,并记录了无声扭曲记忆基准的测量完整性陷阱(截断、自制评判器、完整历史泄露)。每个数字都附带复现命令。

英文摘要

Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history into the prompt -- is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources. We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact -- invalidating, never deleting, so every fact keeps provenance and a supersession chain. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time ("as-of") filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram's lean configuration -- answering from a ~9.6k-token retrieved slice, never the full history -- scores 83.6% vs. 73.2% for full-context (+10.4 points, McNemar p < 10^-6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored. The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.

2606.10316 2026-06-10 cs.CL 新提交

TabClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning

TabClaw: 一个用于电子表格操作和表格推理的交互式自进化智能体

Mingyue Cheng, Shuo Yu, Daoyu Wang, Qingchuan Li, Xiaoyu Tao, Qingyang Mao, Yitong Zhou, Qi Liu

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室)

AI总结 提出TabClaw,一个开源交互式AI智能体,通过可编辑执行计划、流式ReAct循环、并行多表推理和用户记忆提取,提升电子表格操作和表格推理的透明性与个性化。

Comments 5 pages, 2 figures

详情
AI中文摘要

电子表格和表格是结构化数据分析中广泛使用的表示形式,但有效分析仍需大量人工和领域专业知识。近期的大语言模型智能体可以自动化部分过程,但它们通常对中间决策提供有限的透明度,依赖隐含假设,难以处理多表比较,并且重复类似工作流而不适应用户偏好。本文提出TabClaw,一个用于电子表格操作和表格推理的开源交互式AI智能体。用户上传CSV或Excel文件并发出自然语言请求;TabClaw澄清模糊意图,展示可编辑执行计划,流式传输ReAct风格的工具使用分析循环,派遣专家智能体进行并行多表推理,并通过显式一致性和不确定性标记综合发现。除一次性分析外,TabClaw记录完成的工作流,提取持久用户记忆,从重复工具使用模式中提炼可复用技能,支持包式技能导入,并从负面反馈中升级技能。在电子表格操作和表格推理基准上的实验表明,TabClaw在提高可执行任务完成度和推理性能的同时,保持了可检查的用户工作流。本文展示了TabClaw如何将电子表格和表格转化为可检查的分析工作流,同时逐步个性化以适应重复的数据分析任务。我们的代码已公开。

英文摘要

Spreadsheets and tables are widely used representations for structured data analysis, but effective analysis still requires substantial manual effort and domain expertise. Recent large language model (LLM) agents can automate parts of this process, but they often provide limited transparency into intermediate decisions, rely on implicit assumptions, struggle with multi-table comparison, and repeat similar workflows without adapting to a user's preferences. This paper presents TabClaw, an open-source interactive AI agent for spreadsheet manipulation and table reasoning. Users upload CSV or Excel files and issue natural-language requests; TabClaw clarifies ambiguous intent, exposes an editable execution plan, streams a ReAct-style tool-using analysis loop, dispatches specialist agents for parallel multi-table reasoning, and synthesizes findings with explicit consensus and uncertainty markers. Beyond one-off analysis, TabClaw records completed workflows, extracts persistent user memory, distills reusable skills from repeated tool-use patterns, supports package-style skill import, and upgrades skills from negative feedback. Experiments on spreadsheet manipulation and table reasoning benchmarks show that TabClaw improves executable task completion and reasoning performance while preserving an inspectable user workflow. This paper shows how TabClaw turns spreadsheets and tables into inspectable analytical workflows while gradually personalizing itself to recurring data-analysis tasks. Our code is available.

2606.10423 2026-06-10 cs.CL 新提交

WebChallenger: A Reliable and Efficient Generalist Web Agent

WebChallenger: 一个可靠且高效的通用型Web智能体

Jayoo Hwang, Xiaowen Zhang, Vedant Padwal

发表机构 * ML Collective longsurf.ai Independent(独立研究者)

AI总结 提出WebChallenger框架,通过PageMem结构化页面表示、分治观察、轻量探索记忆和复合动作工作流,复现人类认知优势,使开源模型在多个Web导航基准上接近前沿专有系统性能。

详情
AI中文摘要

自主Web导航对LLM智能体仍然具有挑战性,最强的通用系统依赖于专有推理模型,其推理成本对于此类智能体最有用的重复性任务来说高得令人望而却步。我们认为这一差距并非源于模型能力不足,而是源于智能体架构未能复制人类的三种认知优势:对相关页面区域的选择性注意力、对网站结构的持久记忆以及对常见交互模式的程序性流畅性。我们引入了WebChallenger,一个通过架构设计而非模型规模来解决每个差距的Web智能体框架,该框架围绕PageMem构建:一种从DOM确定性构建的结构化页面表示,将每个页面呈现为具有简短摘要的语义部分层次结构。在此共享基础上,我们构建了三种机制来镜像三种认知优势:一个分治观察流水线,让智能体浏览部分摘要并仅从任务相关区域提取细节;一个轻量级探索和记忆系统,遍历每个网站一次以构建页面和元素行为的可重用地图;以及复合动作工作流,将常见的多步交互折叠为单个智能体动作,自动处理部分状态变化。由于这三种机制都基于PageMem运行,该框架无需特定站点适配器即可跨网站泛化。使用未经微调的现成开源模型,我们的系统在WebArena上达到56.3%,在VisualWebArena上达到48.7%,在Online-Mind2Web上达到51.0%,在WorkArena上达到70.9%,以极低的成本接近前沿专有系统。我们的代码已发布在此https URL。

英文摘要

Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful. We argue this gap stems not from insufficient model capability but from agent architectures that fail to replicate three human cognitive advantages: selective attention to relevant page regions, persistent memory of website structure, and procedural fluency with common interaction patterns. We introduce WebChallenger, a web agent framework that addresses each gap through architecture design rather than model scale, built around PageMem: a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries. On this shared substrate we build three mechanisms that mirror the three cognitive advantages: a divide-and-conquer observation pipeline that lets the agent skim section summaries and extract details only from task-relevant regions; a lightweight exploration and memory system that traverses each website once to build a reusable map of pages and element behaviors; and compound action workflows that collapse common multi-step interactions into single agent actions, handling partial state changes automatically. Because all three operate over PageMem, the framework generalizes across websites without site-specific adapters. Using off-the-shelf open-weight models without fine-tuning, our system achieves 56.3% on WebArena, 48.7% on VisualWebArena, 51.0% on Online-Mind2Web, and 70.9% on WorkArena, approaching frontier proprietary systems at a fraction of the cost. Our code is released at https://github.com/jayoohwang1/webchallenger

2606.10694 2026-06-10 cs.CL 新提交

REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs

REAL: 一种增强推理的图框架用于LLM的长期记忆管理

Keer Lu, Liwei Chen, Guoqing Jiang, Zhiheng Qin, Yunhuai Liu, Wentao Zhang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) Kuaishou Technology(快手科技) Center for Data Science, Academy for Advanced Interdisciplinary Studies, Peking University(北京大学前沿交叉学科研究院数据科学中心)

AI总结 提出REAL框架,通过构建时序和置信度感知的有向属性图,采用非破坏性更新和混合束搜索检索,解决LLM长期记忆中的关系缺失、事实覆盖和查询被动问题,平均性能提升22.72%。

详情
AI中文摘要

大型语言模型(LLM)越来越期望与用户进行长时间跨度的交互。然而,由于其有限的上下文窗口,LLM无法保留所有过去的交互,因此长期记忆管理对于存储、更新和检索超出上下文限制的历史信息至关重要。尽管最近的记忆系统试图通过外部存储历史信息来解决这个问题,但现有方法存在三个关键限制:基于平面文本的记忆组织无法捕捉记忆之间的显式关系,结构化记忆系统通常会破坏性地覆盖演变的事实,而当前的检索机制在证据不完整时仍然与查询无关且被动。REAL将长期对话记忆构建为时序和置信度感知的有向属性图,其中每个原子事实都用实体、关系、有效时间区间、置信度分数和探索意图标签表示。在记忆构建过程中,REAL采用非破坏性时序更新策略,保留并行的事实版本及其有效性区间,从而能够忠实地追踪事实的演变。在检索过程中,REAL锚定与查询相关的根实体,解耦其探索意图,并执行语义评估器引导的混合束搜索以提取紧凑的记忆子图。它进一步结合反事实推理来修复不可靠的检索状态,并通过隐式逻辑关系恢复缺失的记忆证据。综合实验表明,REAL在长期记忆性能上显著优于平面文本、基于图和现有记忆基线,平均提升22.72%。

英文摘要

Large Language Models (LLMs) are increasingly expected to interact with users over long time horizons. However, due to their finite context window, LLMs cannot retain all past interactions, making long-term memory management essential for storing, updating, and retrieving historical information beyond the context limit. Although recent memory systems attempt to address this issue by storing historical information externally, existing approaches suffer from three key limitations: flat text-based memory organizations fail to capture explicit relations among memories, structured memory systems often destructively overwrite evolving facts, and current retrieval mechanisms remain query-agnostic and passive when evidence is incomplete. REAL constructs long-term conversational memory as a temporal and confidence-aware directed property graph, where each atomic fact is represented with entities, relations, valid-time intervals, confidence scores, and exploration intent labels. During memory construction, REAL adopts a non-destructive temporal update strategy that preserves parallel fact versions and their validity intervals, enabling faithful tracking of fact evolution. During retrieval, REAL anchors query-relevant root entities, decouples their exploration intents, and performs semantic evaluator-guided hybrid beam search to extract compact memory subgraphs. It further incorporates counterfactual inference to repair unreliable retrieval states and recover missing memory evidence through implicit logical relations. Comprehensive experiments demonstrate that REAL substantially improves long-term memory performance over flat-text, graph-based, and existing memory baselines, achieving an average improvement of 22.72\%.

2606.10736 2026-06-10 cs.CL cs.AI cs.CY 新提交

Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs

利用课程先决条件图检测对话式AI交互中的知识缺口

Youssef Medhat, Junsoo Park, Ploy Thajchayapong, Ashok K. Goel

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出一个流水线,通过少样本文本分类器将学生向对话式AI助教提出的问题映射到课程主题,并利用GPT-4提取的先决条件知识图谱,以检测主题级知识缺口。

Comments Accepted as a short paper at the 10th CSEDM Workshop, co-located with the 18th International Conference on Educational Data Mining (EDM 2026). 7 pages, 2 figures, 2 tables

详情
AI中文摘要

大型在线课程会产生数千条学生向对话式AI助教提出的问题,但这些交互日志作为诊断信号在很大程度上未被利用。我们提出一个流水线,使用少样本文本分类器,将学生向对话式AI助教提出的问题映射到课程主题,该分类器基于GPT-4提取的课程概念先决条件知识图谱。在研究生级别AI课程的164名学生的1,340个问题事件上评估,我们的分类器在43个标签(42个课程主题加上一个“未知”弃权类别)上达到80.0%的准确率。主题级问题数量与独立期中调查中学生自我报告的难度显著相关(rho = 0.491, p = 0.008, n = 28个主题),提供了趋同证据,表明分类后的问题流反映了真实的主题难度。这些结果表明,映射到课程结构上的对话式AI交互日志携带关于主题级知识缺口的可操作信号,并为教师提供基于课程视角的哪些主题需要关注的视图。

英文摘要

Large online courses generate thousands of student questions directed at conversational AI teaching assistants, yet these interaction logs remain largely untapped as diagnostic signals. We present a pipeline that maps student questions from a conversational AI teaching assistant to curriculum topics using a few-shot text classifier, grounded in a GPT-4-extracted prerequisite knowledge graph of course concepts. Evaluated on 1,340 question events from 164 students in a graduate-level AI course, our classifier achieves 80.0% accuracy across 43 labels (42 curriculum topics plus an "unknown" abstention class). Topic-level question volume correlates significantly with student self-reported difficulty from an independent mid-semester survey (rho = 0.491, p = 0.008, n = 28 topics), providing convergent evidence that the classified question stream reflects genuine topic difficulty. These results demonstrate that conversational AI interaction logs, mapped onto curriculum structure, carry actionable signals about topic-level knowledge gaps and provide instructors with a curriculum-grounded view of which topics warrant attention.

2606.10875 2026-06-10 cs.CL 新提交

Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation

通过经验知识集成与激活推动LLM工具调用极限

Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, Jun Zhao

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所复杂系统认知与决策智能重点实验室) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 研究如何通过经验知识获取、激活和内化提升LLM多步工具调用性能,提出知识增强工具执行框架KATE,结合宽度扩展推理与知识感知训练,在BFCL-V3和AppWorld上显著优于基线。

详情
AI中文摘要

大型语言模型(LLM)依赖工具使用来充当自主代理,但由于缺乏足够的工具相关知识和无效的知识激活,在多步执行中常常失败。因此,我们进行了一项系统性研究,探讨知识如何影响工具使用性能,涵盖知识获取、激活和内化阶段。在知识获取阶段,我们获取并评估了各种形式的经验知识,分析表明简单的实例级知识已经能够提供强大且可靠的增益,而抽象的意图级知识收益有限。在推理时,为了激活知识,我们发现提示LLM扩展推理深度会产生递减收益,而通过并行采样与聚合扩展推理宽度能更有效地激活潜在经验知识。在训练时,对于知识内化,使用知识增强数据进行后训练进一步提升了性能,其中强化学习优于监督微调。基于这些见解,我们提出了知识增强工具执行(KATE)框架,该框架将经验知识与宽度扩展推理及知识感知训练相结合。在BFCL-V3和AppWorld上的实验表明,该方法在不同模型规模上均比强基线有一致且显著的改进。我们的代码可在该https URL获取。

英文摘要

Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic study on how knowledge influences tool-use performance, covering the stages of knowledge acquisition, activation, and internalization. In the knowledge acquisition stage, we acquire and evaluate various forms of experiential knowledge, and our analysis shows that simple instance-level knowledge can already provide strong and reliable gains, while abstract intent-level knowledge offers limited benefits. At inference time, to activate knowledge, we find that prompting LLM to expand the depth of reasoning yields diminishing returns, whereas expanding the width of reasoning by parallel sampling with aggregation more effectively activates latent experiential knowledge. At training time, for knowledge internalization, post-training with knowledge-augmented data further improves performance, with reinforcement learning outperforming supervised fine-tuning. Based on these insights, we propose the Knowledge-Augmented Tool Execution (KATE), a knowledge-augmented tool execution framework that integrates experiential knowledge with reasoning-width-expanded inference and knowledge-aware training. Experiments on BFCL-V3 and AppWorld demonstrate consistent and substantial improvements over strong baselines across model scales. Our Code is available at https://github.com/hypasd-art/KATE.

2606.10475 2026-06-10 cs.MA cs.AI cs.CL 交叉投稿

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

思想与言语解耦:基于知识反事实推理的鲁棒多智能体辩论

Jakub Masłowski, Jarosław A. Chudziak

发表机构 * Institute of Computer Science, Warsaw University of Technology(华沙技术大学计算机科学学院)

AI总结 提出知识反事实推理(KG-CFR)双阶段架构,通过私有规划缓冲与公共执行层分离,在动态资源分配环境下将扰动后论证质量从0.694提升至0.822,并减少语义循环。

Comments Accepted for publication in the Proceedings of the 30th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2026)

详情
AI中文摘要

多智能体辩论框架已被证明能提升大语言模型在收敛任务上的表现,但目前优化方式过度偏向最终输出准确性而非过程稳定性。在长时间交互中,持续扰动下的反应式系统常出现逻辑退化、论点重复和角色漂移。为从结构上防止身份丢失并保持过程保真度,我们引入知识反事实推理(KG-CFR),一种双阶段架构,在私有检索增强规划缓冲区和公共执行层之间强制执行严格关注点分离。我们在不确定性下动态资源分配(DRAU)这一专用1v1v1环境中评估该系统,引入与标准辩论设置不同的多样性。在270次完全析因危机模拟轨迹(含随机环境冲击)中,KG-CFR在超过95%的扰动运行中防止了裁判检测到的关键冲击后退化(定义为质量偏移Δ ≤ -0.20),将整体论证质量从0.694提升至0.822。我们的主要贡献是证明架构解耦是在持续压力下不损失质量而增强系统鲁棒性的重要因素。此外,我们引入了用于话语发散和计划执行对齐的自定义向量度量,为操作稳定性提供了强有力且方向一致的证据。消融实验表明,适当的教义基础与前瞻规划对论证质量同等重要。根据初步度量评估,KG-CFR通过保持智能体与原始计划的一致性减少了语义循环。

英文摘要

Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the process. During long-horizon exchanges reactive systems under sustained perturbations often experience logic degradation, argument repetition, and role drift. To structurally prevent the identity loss and maintain the process fidelity, we introduce Knowledge-Grounded Counterfactual Reasoning (KG-CFR), a dual-stage architecture that enforces a strict separation of concerns between a private, retrieval-augmented planning buffer, and a public execution layer. We assess this system in Dynamic Resource Allocation under Uncertainty (DRAU), a dedicated 1v1v1 environment, introducing diversity as distinct from standard debate settings. Over 270 completely factorial crisis simulation trajectories with stochastic environmental shocks, KG-CFR prevents judge-detected critical post-shock degradation (defined as a quality shift, $Δ\le -0.20$) in more than 95% of perturbed runs, increasing the overall argument quality from 0.694 to 0.822. Our primary contribution is the demonstration of architectural decoupling being an important factor of systemic resilience enhancement under sustained pressure without quality loss. Furthermore, we introduce custom vector metrics for discourse divergence and plan-execution alignment that provide strong, directionally consistent evidence of operational stability. Our ablation experiments suggest that the proper doctrinal grounding can be an equally important factor for argument quality, as the prospective planning. KG-CFR, according to our initial metric evaluations, reduces semantic looping, by preserving the agent's consistency with the original plan.

2606.10677 2026-06-10 cs.AI cs.CL 交叉投稿

Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory

Infini Memory:用于长期LLM智能体记忆的可维护主题文档

Suozhao Ji, Baodong Wu, Zehao Wang, Lei Xia, Qingping Li, Ruisong Wang, Wenbo Ding, Zhenhua Zhu, Boxun Li, Guohao Dai, Yu Wang

发表机构 * Infinigence AI(InfiniGen AI) Tsinghua University(清华大学) Shanghai Jiaotong University(上海交通大学)

AI总结 提出Infini Memory架构,将智能体记忆组织为主题文档,通过缓冲合并和迭代检索实现可维护的长期记忆,在MemoryAgentBench上达到64.7%的总体得分。

详情
AI中文摘要

长期LLM智能体需要持久记忆,以跟踪变化的事实并在会话间提供相关证据。现有的记忆系统通常将观察存储为孤立的记录、摘要或索引片段,这使得证据聚合、事实修正和记忆维护变得困难。我们提出Infini Memory,一种可维护的基于文本的持久记忆架构,将智能体记忆视为主题结构化文档。每个主题文档作为一个语义单元,用于收集相关证据、保留元数据并随时间修正事实。新观察首先被暂存在缓冲区中,然后定期合并为连贯的文本上下文。在推理时,一种智能体检索过程允许LLM通过迭代工具调用读取记忆,而不是单次检索步骤。在MemoryAgentBench上,Infini Memory取得了64.7%的总体得分。消融实验表明,主题结构化维护和迭代证据检查改善了长期记忆使用的互补方面。

英文摘要

Long-term LLM agents need persistent memory that can track changing facts and provide relevant evidence across sessions. Existing memory systems often store observations as isolated records, summaries, or indexed fragments, which makes evidence aggregation, fact revision, and memory maintenance difficult. We propose Infini Memory, a maintainable text-based persistent memory architecture that treats agent memory as topic-structured documents. Each topic document serves as a semantic unit for collecting related evidence, preserving metadata, and revising facts over time. New observations are first staged in a buffer and periodically consolidated into coherent textual contexts. At inference time, an agentic retrieval procedure lets the LLM read memory through iterative tool calls rather than a single retrieval step. On MemoryAgentBench, Infini Memory achieves 64.7% overall score. Ablations show that topic-structured maintenance and iterative evidence inspection improve complementary aspects of long-term memory use.

2606.11078 2026-06-10 cs.AI cs.CL cs.CV 交叉投稿

A History-Aware Visually Grounded Critic for Computer Use Agents

面向计算机使用代理的历史感知视觉基础批评家

Jaewoo Lee, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen, Supriyo Chakraborty, Kartik Balasubramaniam, Sambit Sahu, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Capital One University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出HiViG框架,通过历史感知的视觉基础多模态批评家,在测试时评估动作并拦截错误,在多个GUI基准上提升成功率。

Comments Code: https://github.com/G-JWLee/HiViG

详情
AI中文摘要

针对计算机使用代理(CUA)的各种测试时干预措施(包括批评模型)已被开发出来,通过在复杂图形用户界面(GUI)环境中执行前动作评估来提高性能。然而,现有的批评家存在两个关键限制:(1)主要关注短视决策循环(例如,遗忘早期动作);(2)缺乏检测有缺陷动作(例如,点击错误的UI元素)所需的视觉基础。为了解决这些问题,我们引入了HiViG,一个历史感知的视觉基础测试时框架,其核心是一个在真实GUI轨迹上训练的多模态批评家,用于将过去的交互抽象为紧凑记录,并基于视觉基础评估动作。在测试时,HiViG将批评家集成到策略决策循环中,以提供宏观动作历史(总结策略已完成成就)和视觉基础批评(根据当前截图验证原始执行坐标,在执行前拦截错误)。在网页、移动和桌面基准测试中,HiViG持续优于现有的标量和口头批评家,在Qwen3-VL-32B上比最强基线平均成功率提高5.8%,在Gemini-3-Flash上提高9.0%,并展示了强大的跨平台泛化能力。消融实验表明,宏观动作历史缓解了短视规划,视觉基础批评减少了执行错误,这两个组件对于长时域GUI任务中的测试时扩展至关重要。

英文摘要

Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy's completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.

2606.09421 2026-06-10 cs.CL 版本更新

What Should a Skill Remember? Quality--Cost Trade-offs in Cost-Aware Skill Rewriting for Language Model Agents

技能应记住什么?语言模型代理中成本感知技能重写的质量-成本权衡

Qinghua Xing, Yinda Chen, Yaping Jin, Zhenhe Wu, Bohan Lin, Hang Zhou, Xinghao Chen, Hanting Chen, Zhiwei Xiong

发表机构 * University of Science and Technology of China(中国科学技术大学) Huawei Technologies(华为技术有限公司) Tianjin University(天津大学)

AI总结 研究语言模型代理中技能重写的质量-成本权衡,提出信息保留策略,在SkillsBench上实现成本降低7%-14.7%且保持验证质量。

详情
AI中文摘要

大型语言模型代理越来越依赖技能:可重用的程序文档,编码工作流程、工具使用、实现模式、验证检查和领域规则。技能重写通常被视为提示压缩,但较短的技能可能通过移除防止探索、调试和恢复的稀疏操作锚点而使代理更昂贵。我们通过这种经济视角研究技能重写。我们的受控框架剖析技能结构,使用信息保留策略重写技能,并在固定任务指令、环境和验证器下评估重写。在SkillsBench上的实验揭示了不同策略间明显的质量-成本权衡:API/代码锚定、工作流保护和规则/公式锚定有利于不同的任务族,没有普遍主导的模板。在主要的留出评估中,学习到的策略将总成本降低7.0%,下游代理令牌成本降低6.0%;在冻结的跨模型迁移中,相应的降低平均为14.7%和13.7%,同时验证器质量保持不变。这些结果将技能设计定位为成本感知的操作知识工程,而非提示压缩。资源:\href{https://github.com/1Reminding/Skill_EE}{SkillEE}。

英文摘要

Large language model agents increasingly rely on skills: reusable procedural documents encoding workflows, tool use, implementation patterns, validation checks, and domain rules. Skill rewriting is often treated as prompt compression, but shorter skills can make agents more expensive by removing sparse operational anchors that prevent exploration, debugging, and recovery. We study skill rewriting through this economic lens. Our controlled framework profiles skill structure, rewrites skills using information-preservation strategies, and evaluates the rewrites under fixed task instructions, environments, and verifiers. Experiments on SkillsBench reveal distinct quality--cost trade-offs across strategies: API/code anchoring, workflow guarding, and rule/formula anchoring benefit different task families, with no universally dominant template. In the main held-out evaluation, the learned policy reduces total cost by 7.0% and downstream agent-token cost by 6.0%; in frozen cross-model transfer, the corresponding reductions average 14.7% and 13.7%, while verifier quality is preserved. These results position skill design as cost-aware operational knowledge engineering rather than prompt compression. Resources: https://github.com/1Reminding/Skill_EE.

2506.09171 2026-06-10 cs.LG cs.AI cs.CL 版本更新

Fact-Augmented Lookahead Planning for LLM Agents

面向LLM智能体的事实增强前瞻规划

Samuel Holt, Max Ruiz Luyten, Thomas Pouplin, Mihaela van der Schaar

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出LWM-Planner框架,通过从轨迹中提取关键事实并用于条件化动作提议、世界模型模拟和状态值估计,实现无需参数更新的在线规划改进,在多个环境上优于ReAct/Reflexion和纯搜索基线。

Comments Accepted at the 29th International Conference on Artificial Intelligence and Statistics (AISTATS 2026). Camera-ready version. 9-page main text plus appendices (63 pages total), 1 figure

详情
AI中文摘要

大型语言模型(LLM)能力日益增强,但在交互式、部分可观测、长周期环境中,当搜索无引导或近期历史不足时,LLM智能体仍难以有效规划。我们提出LWM-Planner,一种事实增强的前瞻规划框架,仅通过上下文学习改善智能体行为。每个回合后,智能体从轨迹中提取任务关键原子事实,通过轻量级预测一致性过滤器验证候选事实(并可选择压缩),然后使用生成的事实集来条件化动作提议、单步潜在世界模型模拟和状态值估计。规划通过递归、有限深度的前瞻进行,基于累积事实和近期历史对候选轨迹进行搜索,实现无需参数更新的在线改进。我们提供抽象风格的动机:将事实视为减少状态混淆(代理$\epsilon_{\mathrm{sim}}$),将事实条件模拟视为降低单步误差(代理$\delta_{\mathrm{model}}$),但不声称形式化保证。实验上,在文本FrozenLake变体、CrafterMini和ALFWorld上,该方法在累积回报上优于ReAct/Reflexion和纯搜索基线,表明额外的测试时搜索在由紧凑的经验派生事实引导时最为有用。

英文摘要

Large Language Models (LLMs) are increasingly capable, but LLM agents still struggle to plan effectively in interactive, partially observable, long-horizon environments when search is unguided or recent history is insufficient. We introduce LWM-Planner, a fact-augmented lookahead planning framework that improves agent behavior purely through in-context learning. After each episode, the agent extracts task-critical atomic facts from its trajectories, validates candidates with a lightweight predictive-consistency filter (and optionally compresses them), and uses the resulting fact set to condition action proposal, single-step latent world-model simulation, and state-value estimation. Planning then proceeds via recursive, depth-limited lookahead over candidate trajectories conditioned on the accumulated facts and recent history, enabling online improvement without parameter updates. We provide abstraction-style motivation: treating facts as reducing state aliasing (proxy $ε_{\mathrm{sim}}$) and fact-conditioned simulation as lowering one-step error (proxy $δ_{\mathrm{model}}$), without claiming formal guarantees. Empirically, on text FrozenLake variants, CrafterMini, and ALFWorld, the approach improves cumulative return over ReAct/Reflexion and search-only baselines, suggesting that additional test-time search is most useful when grounded by compact, experience-derived facts.

2507.09788 2026-06-10 cs.MA cs.AI cs.CL cs.HC 版本更新

TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit

TinyTroupe:一个基于LLM的多智能体人物模拟工具包

Paulo Salem, Robert Sim, Christopher Olsen, Prerit Saxena, Rafael Barcelos, Yi Ding

发表机构 * Microsoft Corporation(微软公司) Dipeak Technology(迪佩克技术)

AI总结 针对现有LLM多智能体系统在细粒度人物模拟方面的不足,提出TinyTroupe工具包,支持详细人物定义和程序化控制,用于行为研究和社会模拟。

Comments 9 pages

详情
AI中文摘要

近期大型语言模型(LLM)的进展催生了一类新的自主智能体,重新激发并扩展了该领域的兴趣。基于LLM的多智能体系统(MAS)因此涌现,既用于辅助也用于模拟目的,但用于现实人类行为模拟的工具——及其独特的挑战和机遇——仍不成熟。现有的MAS库和工具缺乏细粒度的人物规范、群体采样设施、实验支持以及集成验证等关键能力,限制了它们在行为研究、社会模拟及相关应用中的实用性。为解决这些不足,本文介绍了TinyTroupe,一个模拟工具包,支持详细的人物定义(如国籍、年龄、职业、个性、信念、行为)并通过众多LLM驱动的机制实现程序化控制。这使得能够简洁地表述实际感兴趣的行为问题,无论是个人还是群体层面,并提供了有效的解决方案。通过代表性工作示例(如头脑风暴和市场调研会议)展示了TinyTroupe的组件,同时阐明了其目的并证明了其实用性。还提供了选定方面的定量和定性评估,包括以真实人类行为作为对照的初步实验。结果突出了可能性、局限性和权衡。该方法虽然以特定的Python实现形式呈现,但旨在作为一种新颖的概念贡献,可以部分或完全融入其他环境中。该库以开源形式提供,网址为https://github.com/microsoft/TinyTroupe。

英文摘要

Recent advances in Large Language Models (LLM) have led to a new class of autonomous agents, renewing and expanding interest in the area. LLM-powered Multiagent Systems (MAS) have thus emerged, both for assistive and simulation purposes, yet tools for realistic human behavior simulation -- with its distinctive challenges and opportunities -- remain underdeveloped. Existing MAS libraries and tools lack fine-grained persona specifications, population sampling facilities, experimentation support, and integrated validation, among other key capabilities, limiting their utility for behavioral studies, social simulation, and related applications. To address these deficiencies, in this work we introduce TinyTroupe, a simulation toolkit enabling detailed persona definitions (e.g., nationality, age, occupation, personality, beliefs, behaviors) and programmatic control via numerous LLM-driven mechanisms. This allows for the concise formulation of behavioral problems of practical interest, either at the individual or group level, and provides effective means for their solution. TinyTroupe's components are presented using representative working examples, such as brainstorming and market research sessions, thereby simultaneously clarifying their purpose and demonstrating their usefulness. Quantitative and qualitative evaluations of selected aspects are also provided, including preliminary experiments with real human behavior as control. Results highlight possibilities, limitations, and trade-offs. The approach, though realized as a specific Python implementation, is meant as a novel conceptual contribution, which can be partially or fully incorporated in other contexts. The library is available as open source at https://github.com/microsoft/tinytroupe.

2510.04491 2026-06-10 cs.AI cs.CL 版本更新

Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents

不耐烦的用户混淆AI智能体:用于测试智能体的高保真人类特质模拟

Muyu He, Anand Kumar, Tsach Mackey, Meghana Rajeev, James Zou, Nazneen Rajani

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出TraitBasis方法,通过控制用户特质向量(如不耐烦、不连贯)对AI智能体进行压力测试,发现性能下降2%-30%,揭示当前智能体对用户行为变化的脆弱性。

Comments ACL 2026 [Oral]

详情
AI中文摘要

尽管构建对话式AI智能体取得了快速进展,但其鲁棒性在很大程度上仍未得到测试。用户行为的微小变化,例如更加不耐烦、不连贯或怀疑,可能导致智能体性能急剧下降,揭示了当前AI智能体的脆弱性。现有的基准测试未能捕捉到这种脆弱性:智能体在标准评估中可能表现良好,但在更真实和多样化的环境中却显著退化。我们通过引入TraitBasis来填补这一鲁棒性测试空白,这是一种轻量级、模型无关的方法,用于系统地对AI智能体进行压力测试。TraitBasis学习激活空间中的方向,这些方向对应于可引导的用户特质(例如不耐烦或不连贯),可以在推理时进行控制、缩放、组合和应用,无需任何微调或额外数据。使用TraitBasis,我们将τ-Bench扩展到τ-Trait,其中通过受控特质向量改变用户行为。我们观察到在τ-Trait上,前沿模型的平均性能下降2%-30%,突显了当前AI智能体对用户行为变化的鲁棒性不足。这些结果共同强调了鲁棒性测试的关键作用以及TraitBasis作为一种简单、数据高效且可组合工具的前景。通过驱动模拟压力测试和训练循环,TraitBasis为构建在真实人类交互的不可预测动态中保持可靠的AI智能体打开了大门。我们已在四个领域(航空、零售、电信和远程医疗)开源了τ-Trait,以便社区在现实、行为多样化的意图和特质场景下系统地对智能体进行质量保证:此网址。

英文摘要

Despite rapid progress in building conversational AI agents, robustness is still largely untested. Small shifts in user behavior, such as being more impatient, incoherent, or skeptical, can cause sharp drops in agent performance, revealing how brittle current AI agents are. Today's benchmarks fail to capture this fragility: agents may perform well under standard evaluations but degrade spectacularly in more realistic and varied settings. We address this robustness testing gap by introducing TraitBasis, a lightweight, model-agnostic method for systematically stress testing AI agents. TraitBasis learns directions in activation space corresponding to steerable user traits (e.g., impatience or incoherence), which can be controlled, scaled, composed, and applied at inference time without any fine-tuning or extra data. Using TraitBasis, we extend $τ$-Bench to $τ$-Trait, where user behaviors are altered via controlled trait vectors. We observe on average a 2%-30% performance degradation on $τ$-Trait across frontier models, highlighting the lack of robustness of current AI agents to variations in user behavior. Together, these results highlight both the critical role of robustness testing and the promise of TraitBasis as a simple, data-efficient, and compositional tool. By powering simulation-driven stress tests and training loops, TraitBasis opens the door to building AI agents that remain reliable in the unpredictable dynamics of real-world human interactions. We have open-sourced $τ$-Trai across four domains: airline, retail, telecom, and telehealth, so the community can systematically QA their agents under realistic, behaviorally diverse intents and trait scenarios: https://github.com/collinear-ai/tau-trait.

5. 文本生成、摘要与编辑 5 篇

2606.10087 2026-06-10 cs.CL cs.LG 新提交

CodeAlchemy: Synthetic Code Rewriting at Scale

CodeAlchemy:大规模合成代码重写

Ankit Gupta, Aditya Prasad, Rameswar Panda

AI总结 提出CodeAlchemy框架,通过5种策略生成超过500B token的合成代码数据,引入DevEval和TraceEval基准,3B模型在多项任务上超越10倍大小的前沿模型。

详情
AI中文摘要

在原始代码上预训练可以学习语法,但为多样化的真实世界任务格式提供的信号稀疏。虽然合成数据已被证明对语言模型具有变革性,但代码领域除有限的质量改进外仍基本未被探索。我们提出CodeAlchemy,一个合成数据生成框架,通过5种策略将公开来源的代码转换为语义丰富的训练数据:CodeEnhance(质量感知重写)、CodeQA(基于模板的问题)、CodeDev(开发者任务)、CodeDialogue(多轮对话)和CodeTrace(执行轨迹)。我们处理了15种语言的3个语料库,生成了超过500B token的合成数据以及350B推理token,数量级远超先前工作。CodeTrace对14种语言和5K个库的1.3M+文件进行插桩和执行,捕获控制流、状态跟踪和库知识。我们引入了DevEval(开发者任务)和TraceEval(执行预测)基准;前沿模型如Claude Sonnet 4.5在TraceEval上仅达到5.6%的精确匹配,揭示了语义理解的关键差距。我们的3B模型在HumanEval上达到83.5%,在MBPP上达到63.2%,在DevEval上达到8.09%的胜率,在TraceEval上达到15.36 ROUGE-2,超越了包括27B Gemma-3和32B Granite-4.0在内的10倍大小的前沿模型。

英文摘要

Pre-training on raw code teaches syntax but provides sparse signal for diverse real-world task formats. While synthetic data has proven transformative for language models, code remains largely unexplored beyond limited quality improvements. We present CodeAlchemy, a synthetic data generation framework that transforms publicly sourced code into semantically-rich training data through 5 strategies: CodeEnhance (quality-aware rewriting), CodeQA (template-based problems), CodeDev (developer tasks), CodeDialogue (multi-turn conversations), and CodeTrace (execution traces). We process 3 corpora across 15 languages to generate 500B+ tokens of synthetic data plus 350B reasoning tokens, orders of magnitude more than prior efforts. CodeTrace instruments and executes 1.3M+ files across 14 languages and 5K libraries, capturing control flow, state tracking, and library knowledge. We introduce DevEval (developer tasks) and TraceEval (execution prediction) benchmarks; frontier models like Claude Sonnet 4.5 achieve only 5.6% exact match on TraceEval, revealing critical gaps in semantic understanding. Our 3B models achieve 83.5% on HumanEval, 63.2% on MBPP, 8.09% win rate on DevEval, and 15.36 ROUGE-2 on TraceEval, outperforming frontier models 10x the size including 27B Gemma-3 and 32B Granite-4.0.

2606.10302 2026-06-10 cs.CL 新提交

Where You Inject Diversity Matters: A Unified Framework for Diverse Generation

注入多样性的位置至关重要:统一框架下的多样化生成

Cheng Zhang, Rui Xin, Chudi Zhong

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) University of Washington(华盛顿大学)

AI总结 提出统一框架,通过多样性源和传输分数衡量测试时多样化生成方法,并基于此提出全自动规范级方法,在五个开放任务中提升输出多样性且保持质量。

详情
AI中文摘要

开放式生成任务通常需要一组有意义的不同的输出,然而大型语言模型往往产生相似的生成结果。现有的测试时多样性方法在生成的不同阶段操作,效果各异,但尚不清楚哪些设计选择能导致输出中有意义的多样性。我们引入了一个框架,通过生成过程中引入的多样性源来表征测试时多样化生成方法,并提供了一个传输分数来衡量源中的变化在多大程度上有效传递到最终输出。在该框架指导下,我们提出了全自动规范级生成方法,首先生成多样化的中间规范,然后以它们为条件生成最终响应。在五个开放任务和四个骨干模型上,规范级注入在保持可比质量的同时,提高了输出多样性,超过了测试时基线。我们的分析表明,成功的多样性注入既取决于源的多样性,也取决于它们向输出的传输,这突显了源设计和源到输出的实现是构建更多样化生成系统的两个关键杠杆。

英文摘要

Open-ended generation tasks often require a set of meaningfully different outputs, yet large language models often produce similar generations. Existing test-time diversity methods operate at different stages of generation with varying effectiveness, but it remains unclear what design choices lead to meaningful diversity in the output. We introduce a framework that characterizes test-time diverse generation methods by the diversity source introduced during generation and provide a transmission score for measuring how effectively variation in the source reaches the final output. Guided by this framework, we propose fully automated specification-level generation methods that first generate diverse intermediate specifications and then condition on them to produce final responses. Across five open-ended tasks and four backbone models, specification-level injection improves output diversity over test-time baselines while maintaining comparable quality. Our analysis shows that successful diversity injection depends on both the diversity of the sources and their transmission to the output, highlighting source design and source-to-output realization as two key levers for building more diverse generation systems.

2606.10327 2026-06-10 cs.CL cs.LG 新提交

The Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring

顺序重要:LLaMA的序列微调用于连贯的自动作文评分

Ali Keramati, Mark Warschauer

发表机构 * University of California, Irvine(加州大学伊文斯分校)

AI总结 提出对LLaMA-3.1-8B进行任务感知的序列微调,按作文话语结构顺序训练,在PERSUADE 2.0语料上证据F1达65%、结论F1达87%,超越独立训练和70B基线,证明课程设计可提升自动作文评分性能。

详情
AI中文摘要

自动作文评分(AES)系统必须判断相互依赖的话语元素(如引言、立场、证据、结论),但大多数方法孤立地处理这些元素,损害了连贯性和泛化能力。我们研究了对LLaMA-3.1-8B进行任务感知的微调,用于AES,使用参数高效的LoRA和4位量化,并比较了三种训练课程:(i)序列式(依次在引言、立场、主张、证据、结论上微调),(ii)独立式(任务特定模型),以及(iii)随机式(打乱的多任务)。在PERSUADE 2.0语料上的实验表明,建模任务依赖性很重要:序列微调取得了最强的整体结果,包括证据的F1分数65%和结论的87%,以及相应的准确率63%和85%,超越了独立训练,并且在结论上优于通用LLaMA-70B基线,尽管后者容量大得多。随机训练提高了立场评分(57% F1),但在其他地方一致性较差。这些发现表明:(1)与话语结构对齐的课程设计可以实质性地改善AES,以及(2)小型、任务优化的模型可以与显著更大的大型语言模型(LLM)竞争,为可扩展、成本效益高的评估提供了实用途径。我们发布模板和实现细节,以促进复现和未来在教育NLP中课程设计的工作。

英文摘要

Automated Essay Scoring (AES) systems must judge interdependent discourse elements (e.g., lead, claim, evidence, conclusion), yet most approaches treat these in isolation, harming coherence and generalization. We investigate task-aware fine-tuning of LLaMA-3.1-8B for AES using parameter-efficient LoRA with 4-bit quantization and compare three training curricula: (i) Sequential (progressively fine-tuning on lead, then position, then claim, then evidence, then conclusion), (ii) Independent (task-specific models), and (iii) Randomized (shuffled multi-task). Experiments on the PERSUADE~2.0 corpus show that modeling task dependencies matters: Sequential fine-tuning yields the strongest overall results, including F1 scores of 65% (evidence) and 87% (conclusion) and corresponding accuracies of 63% and 85%, surpassing Independent training and outperforming a general-purpose LLaMA-70B baseline on conclusion despite its far larger capacity. Randomized training improves position scoring (57% F1) but is less consistent elsewhere. These findings indicate that (1) curriculum design aligned with discourse structure can materially improve AES, and (2) small, task-optimized models can be competitive with substantially larger Large Language Models (LLM), offering a practical path to scalable, cost-effective assessment. We release templates and implementation details to facilitate reproduction and future work on curriculum design for educational NLP.

2606.09852 2026-06-10 cs.HC cs.AI cs.CL cs.LG cs.MA cs.SE 交叉投稿

LLM-Based Code Documentation Generation and Multi-Judge Evaluation

基于LLM的代码文档生成与多裁判评估

Ikbel Ghrab, Mohamed Dhieb, Ismail Khenissi, Ines Abdeljaoued-Tej

发表机构 * University of Tunis El Manar(突尼斯国家理工大学)

AI总结 提出利用八种大语言模型自动生成代码文档,并通过多裁判评估框架(四个LLM从九个维度评分)提升文档质量,在医学物理库上实验显示最佳与最差模型性能差距达42%。

Comments ICAHS, \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

详情
Journal ref
Conference ICAHS IEEE, 2025
AI中文摘要

高质量的源代码文档至关重要但往往被忽视,尤其是在医疗保健等关键领域,可靠性和可维护性至关重要。我们提出了一个AI驱动的框架,利用八种最先进的大语言模型(包括GPT、Gemini、Qwen和LLaMA变体)自动从代码和仓库生成文档。该系统基于PocketFlow编排框架,采用模块化流水线和高级提示工程,生成结构化、上下文感知的文档。为确保质量并指导模型选择,我们引入了MultiLLMasJudges评估框架,其中四个独立的LLM从九个标准(如完整性、清晰度和忠实度)评估输出。在开源医学物理库上进行的实验表明,最佳和最差模型之间的性能差距为42%。通过结合多样化的模型输出、优化的提示和严格的评估,我们的方法提高了文档质量并减少了人工工作量,特别是在安全关键的医疗软件中。

英文摘要

High-quality source code documentation is vital yet often neglected, especially in critical domains like healthcare where reliability and maintainability are essential. We presented an AI powered framework that automates documentation generation from code and repositories using eight state of the art Large Language Models (LLMs), including GPT, Gemini, Qwen, and LLaMA variants. Built on the PocketFlow orchestration framework, the system applies modular pipelines and advanced prompt engineering to produce structured, context aware documentation. To ensure quality and guide model selection, we introduced a MultiLLMasJudges evaluation framework, where four independent LLMs assess outputs across nine criteria, such as Completeness, Clarity, and Faithfulness. Experiments conducted on an open-source medical physics library, demonstrated showed a 42% performance gap between top and bottom models. By combining diverse model outputs, optimized prompting, and rigorous evaluation, our approach enhances documentation quality and reduces manual effort, especially in safety critical healthcare software.

2606.10199 2026-06-10 cs.LG cs.CL 交叉投稿

A Continuous-Time Markov Chain Framework for Insertion Language Models

插入语言模型的连续时间马尔可夫链框架

Dhruvesh Patel, Benjamin Rozonoyer, Soumitra Das, Tahira Naseem, Tim G. J. Rudner, Andrew McCallum

AI总结 提出基于连续时间马尔可夫链的插入语言模型去噪框架,统一现有方法,在规划任务中优于自回归和掩码扩散模型,语言建模中与现有方法竞争且采样更灵活。

Comments Accepted at AISTATS 2026. Code is available at https://github.com/dhruvdcoder/ctmc_dilm

详情
AI中文摘要

插入语言模型(ILMs)相比从左到右生成和基于掩码的生成具有若干优势。然而,现有的插入式生成公式大多是临时性的。在本文中,我们通过将噪声过程建模为变长序列空间上的连续时间马尔可夫链,从第一性原理推导出ILMs的扩散式去噪目标。我们表明,先前的ILMs公式可以视为该去噪框架的特例。通过在合成规划任务上的实证评估,我们展示了所提出的方法保留了插入式生成相对于从左到右生成和掩码扩散模型的优势。在语言建模中,我们的基于扩散的方法与从左到右生成和掩码扩散模型具有竞争力,同时与现有的插入语言模型相比,在采样方面提供了额外的灵活性。

英文摘要

Insertion Language Models (ILMs) offer several advantages over left-to-right generation and mask-based generation. However, existing formulations of insertion-based generation have largely been ad-hoc. In this paper, we derive a diffusion-style denoising objective for ILMs from first principles by formulating the noising process as a continuous-time Markov chain on the space of variable-length sequences. We show that previous formulations of ILMs can be viewed as special cases of this denoising framework. Through empirical evaluation on a synthetic planning task, we show that the proposed approach retains the benefits of insertion-based generation over left-to-right generation and masked diffusion models. In language modeling, our diffusion-based approach is competitive with left-to-right generation and masked diffusion models, while offering additional flexibility in sampling compared to existing insertion language models.

6. 语义、语法与语言学分析 6 篇

2606.10467 2026-06-10 cs.CL 新提交

Large Language Models as Modal Models in Linguistics

大语言模型作为语言学中的模态模型

Haruto Suzuki, Saku Sugawara

发表机构 * Keio University(庆应义塾大学) National Institute of Informatics(国立信息学研究所) University of Tokyo(东京大学)

AI总结 本文应用科学哲学中的模态建模框架,论证大语言模型作为最小模型具有真正的认知价值,能提供“如何可能解释”,但当前尚不满足“如何实际解释”的条件,其解释力位于两者之间的连续统上。

详情
AI中文摘要

大语言模型(LLMs)的快速发展加剧了关于它们对语言学理论重要性的争论。这些争论通常分为三种立场:绝缘主义,认为LLMs与人类语言无关;消除主义,声称LLMs可以取代传统语言学理论;以及调和主义,将LLMs视为语言学研究的有用工具。为澄清这些立场,本文应用了科学哲学中的模态建模框架。我们认为,即使没有与人类认知的结构对应,LLMs作为最小模型也具有真正的认知价值。特别是,它们可以通过测试关于语言习得和语言能力的模态主张来提供“如何可能解释”(HPEs)。然后,我们基于科学解释的机制说明,考察了LLMs有资格成为人类语言的“如何实际解释”(HAEs)的条件。我们认为当前的LLMs尚未满足这些要求。在此分析基础上,我们提出将LLMs的解释力理解为位于HPEs和HAEs之间的连续统上。这一框架既避免了夸大也避免了低估它们的解释意义,并为评估LLMs在语言科学研究中的作用提供了更精确的基础。

英文摘要

The rapid advancement of large language models (LLMs) has intensified debates about their significance for linguistic theory. These debates are commonly divided into three positions: insulationism, which regards LLMs as irrelevant to human language; eliminativism, which claims that LLMs can replace traditional linguistic theories; and conciliationism, which views them as useful tools for linguistic research. To clarify these positions, this paper applies the framework of modal modeling from the philosophy of science. We argue that LLMs possess genuine epistemic value as minimal models, even without structural correspondence to human cognition. In particular, they can provide how-possibly explanations (HPEs) by testing modal claims about language acquisition and linguistic competence. We then examine the conditions under which LLMs could qualify as how-actually explanations (HAEs) of human language, drawing on the mechanistic account of scientific explanation. We argue that current LLMs do not yet satisfy these requirements. On the basis of this analysis, we propose understanding the explanatory power of LLMs as lying on a continuum between HPEs and HAEs. This framework avoids both overstating and understating their explanatory significance and offers a more precise basis for evaluating the role of LLMs in the scientific study of language.

2606.11018 2026-06-10 cs.CL 新提交

Measuring Human Value Expression in Social Media Texts: Calibrated LLM Annotation and Encoder Transfer

测量社交媒体文本中的人类价值表达:校准的LLM标注与编码器迁移

Maria Milkova, Maksim Rudnev

发表机构 * Independent researcher, Lisbon, Portugal(独立研究者,里斯本,葡萄牙) University of Waterloo, ON, Canada(滑铁卢大学,安大略省,加拿大)

AI总结 本研究使用校准的LLM标注和软标签训练,将基于Schwartz人类基本价值理论的社交媒体文本价值表达迁移到编码器模型,实现可扩展预测。

详情
AI中文摘要

测量自然发生的社交媒体文本中的主观构念需要标注程序在理论上具有基础、经验上得到验证,并且可迁移到编码器模型以进行可扩展预测。使用根据Schwartz人类基本价值理论标注的非英语社交媒体帖子,我们研究了不同的LLM、提示和指令语言如何操作化文本中的价值表达。我们认为,尽管文本可能允许多种合理解释,但基于理论的价值定义可以约束解释并减少虚假的价值归因。除了精确率、召回率和F1分数,我们还评估了价值之间的结构对齐、错误结构、置信度-模糊度关系以及标注稳定性。我们表明,不同的LLM产生不同的价值解释。通过错误分析进行迭代提示校准减少了错误归因并提高了与专家标注的对齐。我们还从反复出现的错误结构中推导出有针对性的专家验证规则,并在语料库标注中使用它们。最后,我们表明,通过软标签训练,LLM标注可以迁移到编码器模型,保留基于理论的价值解释以及价值表达中不确定性的信息。

英文摘要

Measuring subjective constructs in naturally occurring social media text requires annotation procedures that are theoretically grounded, empirically validated, and transferable to an encoder model for scalable prediction. Using non-English social media posts annotated according to Schwartz's theory of basic human values, we investigate how different LLMs, prompts, and instruction languages operationalize the expression of values in text. We argue that although texts may permit multiple plausible interpretations, theory-based value definitions can constrain interpretations and reduce spurious value attributions. Beyond precision, recall, and F1, we evaluate structural alignment between values, error structure, confidence-ambiguity relations, and annotation stability. We show that different LLMs produce different value interpretations. Iterative prompt calibration through error analysis reduces misattributions and improves alignment with expert annotations. We also derive targeted expert verification rules from recurrent error structures and use them during corpus annotation. Finally, we show that LLM annotations can be transferred to an encoder model through soft-label training, retaining theory-based value interpretations and information about uncertainty in value expression.

2606.10059 2026-06-10 cs.FL cs.CL 交叉投稿

Compiling Rewrite Rules to Finite-State Transducers with the Worsening Trick

使用恶化技巧将重写规则编译为有限状态转录机

Mans Hulden, Michael Ginn

发表机构 * New College of Florida(佛罗里达新学院) University of Colorado(科罗拉多大学)

AI总结 提出基于“恶化技巧”的紧凑编译方案,将重写规则编译为有限状态转录机,支持多种上下文和重写模式,实现简单且易于扩展。

Comments 17 pages, 6 figures, tool track proceedings at CIAA 2026

详情
AI中文摘要

有限状态转录机(FST)对于计算语言学和自然语言处理(NLP)中的字符串重写建模至关重要,特别是对于音韵和形态重写规则。编译形式为 $A \ o B / L \, \_ \, R$ 的一般重写规则(其中 $A$、$B$、$L$ 和 $R$ 是任意正则语言)由于重叠匹配和上下文约束而复杂。传统方法(如 Kaplan 和 Kay 或 Karttunen 的方法)依赖于带有辅助标记的复杂转录机组合。本文提出了一种基于“恶化技巧”的紧凑编译方案:生成所有合法的重写候选,然后过滤那些对于相同输入比其他候选更差的候选。该构造作为 PyFoma 中的内置重写编译器实现,支持多个上下文、任意转录、标记、定向重写、权重和并行重写。得到的公式简短且统一,并且在语义一致的情况下,它们重现了与早期方法相同的规则转录机,同时更易于扩展。该实现已在大量重写语法集合和涵盖主要重写模式的自动回归测试套件上针对 foma 进行了验证,得到的转录机除了状态编号外完全匹配。

英文摘要

Finite-state transducers (FSTs) are essential for modeling string rewriting in computational linguistics and natural language processing (NLP), particularly for phonological and morphological rewrite rules. Compiling general rewrite rules of the form $A \to B / L \, \_ \, R$, where $A$, $B$, $L$, and $R$ are arbitrary regular languages, is complex due to overlapping matches and context constraints. Traditional methods, such as those by Kaplan and Kay or Karttunen, rely on intricate transducer compositions with auxiliary markers. This paper presents a compact compilation scheme based on the "worsening trick'': generate all legal rewrite candidates, then filter candidates that are worse than another candidate for the same input. Implemented as the built-in rewrite compiler in PyFoma, the construction supports multiple contexts, arbitrary transductions, markup, directed rewriting, weights, and parallel rewriting. The resulting formulas are short and uniform, and where semantics coincide, they reproduce the same rule transducers as earlier approaches while remaining easier to extend. The implementation has been validated against foma on both a substantial collection of rewrite grammars and an automated regression suite covering the major rewrite modalities, with the resulting transducers matching exactly apart from state numbering.

2602.17907 2026-06-10 cs.CL cs.AI 版本更新

Improving Topic Modeling by Distilling Soft Labels from Language Models

DSL-Topic:通过从语言模型中蒸馏软标签改进主题建模

Raymond Li, Amirhossein Abaskohi, Chuyuan Li, Gabriel Murray, Giuseppe Carenini

发表机构 * University of Washington(华盛顿大学)

AI总结 提出DSL框架,通过从语言模型蒸馏软标签来增强主题模型训练,利用上下文感知的软标签重构信号,显著提升主题连贯性和分配准确性。

Comments 22 pages, 5 figures. Camera-ready version for ICML 2026

详情
AI中文摘要

传统的神经主题模型通常通过重构文档的词袋表示来优化,忽略了上下文信息并面临数据稀疏性问题。在这项工作中,我们引入了一种新颖的主题模型训练框架,通过从语言模型中蒸馏软标签(DSL)。为了构建上下文丰富的重构信号,我们将基于特定提示的下一个词概率投影到预定义词汇表上,并使用语言模型隐藏状态训练主题模型重构软标签。这产生了更高质量的主题,与语料库的潜在主题结构更加紧密对齐。大量实验表明,DSL在主题连贯性和分配准确性上相比现有基线取得了显著改进。此外,我们还引入了一种基于检索的指标,显示我们的方法在识别语义相似文档方面显著优于现有方法,突显了其在面向检索应用中的有效性。

英文摘要

Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we introduce a novel topic model training framework by Distilling Soft Labels (DSL) from Language Models (LMs). To construct the contextually enriched reconstruction signals, we project the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary, and train the topic models to reconstruct the soft labels using the LM hidden states. This produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Extensive experiments demonstrate that DSL achieves substantial improvements in topic coherence and assignment accuracy over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.

2604.14397 2026-06-10 cs.CL cs.AI 版本更新

Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection

基于词典的跨语言语义投影生成概念词汇化

David Basil, Chirooth Girigowda, Bradley Hauer, Sahir Momin, Ning Shi, Grzegorz Kondrak

发表机构 * University of Toronto(多伦多大学)

AI总结 提出一种通过语义投影将英语WordNet概念扩展到新语言的方法,利用双语词典增强对齐并过滤错误投影,在多个语言上提升了精度且保持可解释性和资源效率。

Comments Paper presented at Canadian AI 2026

详情
AI中文摘要

我们研究通过语义生成自动将WordNet风格的词汇资源扩展到新语言的任务。我们通过语义投影将目标语言词条与现有词汇概念关联来生成词义。给定一个带有词义标注的英语语料库及其翻译,我们的方法将注释的义原集投影到对齐的目标语言标记上,并将相应的词条分配给这些义原集。为了生成对齐并确保其质量,我们使用双语词典增强预训练的基础对齐器,该词典也用于过滤不正确的语义投影。我们在多种语言上评估该方法,将其与先前方法以及基于词典和大型语言模型的基线进行比较。结果表明,所提出的投影-过滤策略在保持可解释性和资源效率的同时提高了精度。我们在该https URL上发布代码、文档和生成的词义清单。

英文摘要

We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects the annotated synsets onto aligned target-language tokens and assigns the corresponding lemmas to those synsets. To generate alignments and ensure their quality, we augment a pretrained base aligner with a bilingual dictionary, which is also used to filter incorrect sense projections. We evaluate the method on multiple languages, comparing it to prior methods, as well as dictionary-based and large language model baselines. Results show that the proposed project-and-filter strategy improves precision while remaining interpretable and resource-efficient. We release our code, documentation, and generated sense inventories at https://github.com/UAlberta-NLP/ExpandNet.

2606.09543 2026-06-10 cs.CL 版本更新

From Genes to Tokens: a GWAS-inspired Approach for Interpretable Stylometric Analysis

从基因到词元:受GWAS启发的可解释风格计量分析方法

Dmitry Pronin, Evgeny Kazartsev

发表机构 * HSE University(莫斯科国立高等经济大学)

AI总结 受全基因组关联研究启发,提出一种通过逻辑回归和多重比较校正检测作者独特词汇标记的风格计量方法,在英、德、俄语语料中验证有效。

详情
AI中文摘要

这篇短文介绍了一种受全基因组关联研究(GWAS)启发的风格计量解释方法。每个“基因”词元与“表型”作者身份的关联通过逻辑回归进行检验,并进行了多重比较校正。将该方法应用于英语、德语和俄语语料库,检测出了个体作者特有的统计显著的词汇标记。

英文摘要

This short paper introduces a stylometric interpretation method inspired by genome-wide association studies (GWAS). Each "gene" token's association with "phenotype" authorship is tested using logistic regression with multiple-comparison correction. Applied to English, German, and Russian corpora, the method detects statistically significant lexical markers distinctive of individual authors.

7. 多模态语言处理 8 篇

2606.10803 2026-06-10 cs.CL cs.AI cs.CV 新提交

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

超越API:探索多模态大语言模型在物理工具使用中的极限

Zhixin Ma, Yutong Zhou, Yongqi Li, Chong-Wah Ngo, Wenjie Li

发表机构 * Singapore Management University(新加坡管理大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出PhysTool-Bench基准,评估多模态大语言模型在真实场景中识别物理工具并规划使用的能力,发现最强模型仅完成21%任务,揭示感知与规划双重缺陷。

详情
AI中文摘要

多模态大语言模型(MLLMs)在利用数字API方面表现出色,并日益成为具身AI的“大脑”,指导机器人与物理世界交互。在这种具身环境中,核心能力之一是使用物理工具,这支撑着MLLMs在现实任务中协助人类的能力。尽管重要性显著,MLLMs在物理工具使用方面的熟练程度仍 largely unexplored。为填补这一空白,我们引入了PhysTool-Bench,这是首个评估MLLMs理解真实场景、识别物理工具并规划其使用能力的物理工具使用基准。PhysTool-Bench包含2,510个查询,覆盖2,678个真实世界物理工具,涉及制造、电气工程、农业和医疗等多个领域。具体而言,模型沿两个主要维度进行评估:1)识别场景中所有存在的物理工具,2)根据指令和视觉上下文规划工具选择和使用顺序。在13个领先的MLLMs中,即使最强的模型(Gemini-3.1-Pro)也只能识别场景中58.7%的工具,并仅完成21.0%的端到端查询。我们的分析揭示了两个层面的缺陷:MLLMs难以在真实场景中感知工具,而规划阶段更大的下降进一步表明缺乏将感知到的工具映射到任务语义的功能常识,这指出了发展实用具身AI的关键瓶颈。

英文摘要

Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

2606.09846 2026-06-10 cs.HC cs.AI cs.CL 交叉投稿

CANVAS: Captioning Art with Narrative Visual-Audio AI Systems

CANVAS: 用叙事视觉音频AI系统为艺术配文

Vignesh Nagarajan

发表机构 * BASIS Phoenix High School(BASIS凤凰高中)

AI总结 提出一种自动化工作流,利用大语言模型和文本转语音服务生成多感官艺术描述和同步音频解说,在20秒内以低于0.05美元的成本生成文本加音频输出,显著提高词汇多样性和叙事细节。

Comments 22 pages, 16 figures, 3 tables, 21 references

详情
AI中文摘要

由于替代文本简短或缺失,视觉艺术在很大程度上仍对盲人和低视力(BLV)观众不可及,这些文本很少传达艺术品的感官、空间或情感特质。本研究提出了一种自动化工作流,利用大语言模型和文本转语音服务生成多感官艺术描述和同步音频解说。该系统通过Zapier编排,将上传的图像转换为丰富的叙事字幕,无需人工干预,从而实现可访问媒体的快速、规模化生产。对50件艺术品的定量评估显示,AI生成的描述在词汇多样性、形容词密度和叙事细节方面显著高于基线字幕,同时保持可比的易读性水平。统计检验(t检验、方差分析)确认了丰富度和长度方面的显著差异,完整流水线在每张图像20秒内生成文本加音频输出,成本低于0.05美元。研究结果表明,自动字幕生成可以弥合博物馆和数字馆藏可访问性方面的差距,对更广泛的公众参与具有意义。未来工作可纳入BLV参与者的用户研究,以评估理解、偏好和最佳解释性语言水平。

英文摘要

Visual art remains largely inaccessible to blind and low-vision (BLV) audiences due to brief or absent alt-text, which rarely conveys the sensory, spatial, or emotional qualities of an artwork. This study presents an automated workflow that generates multi-sensory art descriptions and synchronized audio narration using large language models and text-to-speech services. The system, orchestrated through Zapier, converts uploaded images into rich narrative captions without human intervention, enabling rapid, scalable production of accessible media. Quantitative evaluation across 50 artworks shows that AI-generated descriptions contain significantly higher lexical diversity, adjective density, and narrative detail than baseline captions, while maintaining comparable readability levels. Statistical tests (t-tests, ANOVA) confirm meaningful differences in richness and length, and the full pipeline produces text-plus-audio outputs in under 20 seconds per image at a cost below $0.05. Findings demonstrate that automated captioning can bridge gaps in museum and digital-collection accessibility, with implications for broader public engagement. Future work can incorporate user studies with BLV participants to assess comprehension, preference, and optimal levels of interpretive language.

2606.10147 2026-06-10 cs.AI cs.CL cs.CV cs.SD 交叉投稿

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

从感知到决策:多模态大语言模型中听觉与视觉感知的信息流

Wish Suharitdamrong, Muhammad Awais, Xiatian Zhu, Sara Atito

AI总结 研究多模态大语言模型(AVLLMs)中音频和视觉信息流的路径与整合机制,发现顺序流与并行流两种路由模式,并证明信息传递后可丢弃无关token以提升效率。

Comments 40 pages, 29 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)能够听和看,但音频和视觉信号实际上如何通过网络传播以形成答案?尽管它们在研究和实际应用中的作用日益增长,但音频和视觉标记影响最终预测的内部路径仍然知之甚少。在本研究中,我们考察了音频-视觉大语言模型(AVLLMs)内部的音视频信息流,追踪了AVLLMs如何在两种输入配置(音视频视频和多个交错音视频项目)下路由、利用和整合音频与视觉信息。我们发现,对于音视频视频,AVLLMs遵循为VLMs和VideoLLMs建立的顺序信息流路径,音频和视觉贡献沿着该路径按任务对每种模态的依赖程度成比例流动。在多个交错音视频项目的设置中,这种路由转变为不同的并行流。此外,我们证明,一旦音频-视觉和其他类型的标记的信息被传递到LLM,它们可以被丢弃,对模型的预测影响最小甚至略有改善,这适用于多个任务和数据集,从而实现更高效的推理。这些发现适用于多个模型和规模,包括3B和7B规模的Qwen2.5-Omni和Video-SALMONN2 Plus,从而产生了关于这些流结构为何出现的假设。总之,这些结果首次清晰地描绘了AVLLMs如何在网络内部协调声音和视觉,并为音频-视觉及更广泛的MLLMs在可解释性、设计和效率方面的下一波进展奠定了基础。

英文摘要

Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.

2606.10461 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

ERAlign: Energy-based Representation Alignment of GNNs and LLMs on Text-attributed Graphs

ERAlign: 文本属性图上GNN与LLM的基于能量的表示对齐

Xianlin Zeng, Fan Xia, Xiangyu Chen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ERAlign框架,利用能量模型对齐GNN和LLM的表示,通过能量差异优化实现分布一致性,在8个数据集上取得最优性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

文本属性图(TAGs)将文本节点属性与图结构相结合,以描述丰富的关联语义。最近整合图神经网络(GNNs)和大语言模型(LLMs)的努力在TAGs学习上显示出前景,但实现良好对齐的表示仍然具有挑战性。先前的研究主要依赖于执行粗粒度匹配的启发式方法。它们缺乏足够的约束,忽略了分布对齐,导致表示漂移和泛化能力有限。基于能量模型(EBMs),我们提出了一种基于能量的表示对齐(ERAlign)框架,该框架将GNN编码的图结构和LLM导出的文本嵌入投影到共享潜在空间,以实现分布一致性。具体来说,层间对齐通过距离度量量化,并通过EBM目标进行优化。通过降低能量值,我们的框架为下游任务产生良好对齐的表示。在训练过程中,我们引入能量差异(ED)以避免与难以处理的归一化相关的高采样成本。ED还具有更高的训练效率和减少能量景观失真的理论保证。在八个TAG数据集上的实证评估表明,ERAlign在不同监督水平和跨任务迁移场景下均获得了最先进的性能。

英文摘要

Text-attributed Graphs (TAGs) incorporate textual node attributes with graph structures to describe rich relational semantics. Recent efforts to integrate Graph Neural Networks (GNNs) and Large Language Models (LLMs) have shown promise for learning on TAGs, yet achieving well-aligned representations remains challenging. Prior studies largely rely on heuristics that perform coarse-grained matching. They lack sufficient constraints and ignore distributional alignment, leading to representation drift and limited generalization. Building on Energy-based Models (EBMs), we propose an Energy-based Representation Alignment (ERAlign) framework that projects GNN-encoded graph structure and LLM-derived text embeddings in a shared latent space to achieve distribution consistency. Concretely, layer-wise alignment is quantified by a distance metric and optimized via an EBM objective. By decreasing energy values, our framework yields well-aligned representations for downstream tasks. During training, we introduce Energy Discrepancy (ED) to avoid high sampling costs associated with intractable normalization. ED also carries theoretical guarantees of higher training efficiency and reduced energy landscape distortion. Empirical evaluations on eight TAG datasets demonstrate that ERAlign obtains state-of-the-art performance across varying levels of supervision and cross-task transfer scenarios.

2606.11176 2026-06-10 cs.CV cs.CL cs.CY cs.HC 交叉投稿

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

数据记者智能体:将数据转化为可验证的多模态故事

Kevin Qinghong Lin, Batu EI, Yuhong Shi, Pan Lu, Philip Torr, James Zou

发表机构 * University of Oxford(牛津大学) Stanford University(斯坦福大学)

AI总结 提出多智能体框架Data2Story,通过证据链验证声明并自动生成多模态文章,在18篇文章上评估,证明其在透明性和可审计性上接近人类记者。

Comments Project page: https://data2story.github.io Github: https://github.com/QinghongLin/data2story-skill

详情
AI中文摘要

数据讲述塑造社会的故事;数据记者的工作是将原始信息转化为非专家可以信任的故事。一篇高质量的新闻专题需要新闻编辑室团队数周时间:寻找背景、运行统计、选择角度和设计视觉。最近的智能体在单个步骤上表现良好:数据科学智能体闭合分析循环,而设计智能体合成漂亮的网站。但是,一个智能体能否端到端地充当数据记者?我们引入了数据记者智能体(Data2Story),这是一个多智能体框架,将专业角色编排成一个虚拟新闻编辑室。Data2Story贡献了两项创新。(i)声明有证据支持:一个检查员将每个数字、角度和资产链接回数据、代码或外部参考。(ii)文章是多模态生成的:而不是默认使用纯文本和静态图表,Data2Story推理读者想看什么,然后部署多模态工具,例如用于地理的交互式地图和用于音乐的音频。我们在18篇文章上评估Data2Story,每篇都与原始发表的专家作品配对,沿着四个轴:(a)人类-智能体角度覆盖;(b)53名参与者在五个维度上的评分评估;(c)计算机使用智能体作为评委,一种节省成本的代理,用于衡量读者如何浏览交互式文章;(d)可验证性,其中编码验证器根据数据重新执行语句并检查声明与参考。Data2Story产生有竞争力、证据可追溯的多媒体故事,在透明性和可审计性方面特别强。人类文章在编辑角度、创意设计和演示方面保持优势。我们将Data2Story定位为记者的合作者,实现更多基于证据、透明和可验证的报道。代码和演示可在https://this URL获取。

英文摘要

Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting. Code and demos are available at https://data2story.github.io.

2604.23443 2026-06-10 cs.CL 版本更新

Revisiting Greedy Decoding for Visual Question Answering: A Calibration Perspective

重新审视视觉问答中的贪婪解码:一种校准视角

Boqi Chen, Xudong Liu, Yunke Ao, Jianing Qiu

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Toronto(多伦多大学) MBZUAI(穆桑比克人工智能研究所)

AI总结 针对视觉问答任务,从校准角度理论证明贪婪解码优于随机采样,并提出适用于推理模型的贪婪解码方法,实验验证其有效性。

详情
AI中文摘要

随机采样策略被广泛用于大型语言模型(LLMs)以平衡输出的连贯性和多样性。这些启发式方法通常被多模态大语言模型(MLLMs)继承,而无需针对特定任务进行论证。然而,我们认为随机解码对于视觉问答(VQA)可能不是最优的。VQA是一个封闭式任务,答案分布具有头部重尾特征,其不确定性通常是认知性的,源于缺失或模糊的视觉证据,而非合理的延续。在这项工作中,我们理论形式化了模型校准与预测准确性之间的关系,并推导出贪婪解码最优性的充分条件。大量实验提供了经验证据,表明贪婪解码在多个基准测试中优于随机采样。此外,我们提出了适用于推理模型的贪婪解码,在多模态推理场景中优于随机采样和标准贪婪解码。总体而言,我们的结果警示不要在MLLMs中天真地继承LLMs的解码启发式方法,并表明贪婪解码可以成为VQA中高效且强大的默认选择。

英文摘要

Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for Reasoning Models, which outperforms both stochastic sampling and standard greedy decoding in multimodal reasoning scenarios. Overall, our results caution against naively inheriting LLMs decoding heuristics in MLLMs and demonstrate that greedy decoding can be an efficient yet strong default for VQA.

2510.04514 2026-06-10 cs.AI cs.CE cs.CL cs.CV stat.ME 版本更新

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

ChartAgent: 一种用于复杂图表问答中视觉基础推理的多模态智能体

Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Sumitra Ganesh, Manuela Veloso

发表机构 * J.P. Morgan AI Research(摩根大通人工智能研究)

AI总结 提出ChartAgent框架,通过迭代分解查询为视觉子任务并利用图表专用视觉工具(如绘制注释、裁剪区域)进行空间域推理,在ChartBench和ChartX上取得最先进性能,尤其对无标注图表提升显著。

Comments Accepted at ACL 2026 (Main Conference). Also presented as an oral paper at the NeurIPS 2025 Multimodal Algorithmic Reasoning Workshop (https://marworkshop.github.io/neurips25/)

详情
AI中文摘要

最近的多模态大语言模型在基于图表的视觉问答中显示出潜力,但在无标注图表上——即那些需要精确视觉解释而非依赖文本捷径的图表——其性能急剧下降。为了解决这个问题,我们引入了ChartAgent,一种新颖的智能体框架,它直接在图表的空间域内显式执行视觉推理。与文本思维链推理不同,ChartAgent通过专门的行动(如绘制注释、裁剪区域(例如分割饼图切片、隔离条形图)和定位坐标轴)迭代地将查询分解为视觉子任务,并主动操作和交互图表图像,使用图表专用视觉工具库来完成每个子任务。这种迭代推理过程密切模仿了人类理解图表的认知策略。ChartAgent在ChartBench和ChartX基准测试上达到了最先进的准确率,整体上比先前方法绝对提升高达16.07%,在无标注、数值密集的查询上提升17.31%。此外,我们的分析表明,ChartAgent (a) 在多种图表类型上有效,(b) 在不同视觉和推理复杂度水平上均取得最高分数,(c) 作为一个即插即用的框架,提升了多种基础LLM的性能。我们的工作是首批使用工具增强的多模态智能体展示图表理解中视觉基础推理的工作之一。

英文摘要

Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts-those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieves the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

2605.07415 2026-06-10 cs.CV cs.CL 版本更新

ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

ChartREG++:面向多样化指代线索和多目标指代的图表指代表达式定位基准与改进

Tianhao Niu, Ziyu Han, Xuan Dong, Qingfu Zhu, Wanxiang Che

发表机构 * Research Center for Social Computing and Interactive Robotics(社会计算与交互机器人研究中心)

AI总结 针对现有图表指代表达式定位基准的局限,提出支持多种定位形式、多目标指代、多样化线索和图表类型的基准,并利用代码驱动合成流水线生成像素级实例掩码,训练实例分割模型集成到多模态定位框架,显著提升性能。

详情
AI中文摘要

指代表达式定位是视觉定位的核心问题,广泛用于视觉与语言模型的空间定位与推理诊断,但以往工作多聚焦于自然图像。相比之下,现有的图表指代表达式定位基准存在局限:(1) 大多采用边界框,限制了精细图表元素的定位精度;(2) 大多假设单个或两个指代目标实例,无法处理多实例目标指代;(3) 语言表达过度依赖文本线索或数据排名线索;(4) 仅覆盖狭窄的图表类型范围。为解决这些问题,我们引入了一个图表指代表达式定位基准,系统性地支持多种定位形式、多个指代目标、多样化定位线索和多种图表类型。在代表性多模态大模型上的结果揭示了显著的性能差距。我们进一步引入了一个代码驱动的合成流水线,利用绘图程序与渲染图表基元之间的固有对齐,跨图表元素类型和粒度生成像素级精确的实例掩码。我们使用合成掩码训练了一个实例分割模型,并将其集成到一个通用的多模态定位框架中。最终系统在我们的基准上持续优于基线,并很好地泛化到从ChartQA导出的真实图表定位基准。

英文摘要

Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.

8. 语音语言联合与音频文本 14 篇

2606.10581 2026-06-10 cs.CL cs.SD eess.AS 新提交

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

ParaBridge: 弥合语音语言模型中的副语言感知与对话行为

Yuxiang Wang, Qinke Ni, Shengbo Cai, Wan Lin, Liqiang Zhang, Zhizheng Wu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Tencent Hunyuan(腾讯混元) Shenzhen Loop Area Institute(深圳循环区域研究所) Amphion Technology Co., Ltd.(Amphion科技有限公司) Tsinghua University(清华大学)

AI总结 提出ParaBridge,一种在线自我蒸馏方法,将推理阶段的副语言指令支架转化为稳定的模型行为,无需人工标注或外部奖励,显著提升语音语言模型对副语言线索的响应能力。

详情
AI中文摘要

语音携带的信息远不止文字:孩子的声音、恐惧的语气或嘈杂的背景都应引导一个足够胜任的语音对话助手给出不同的回复。当前的语音语言模型(SLM)能够识别此类副语言线索,但在开放域对话中常常忽略它们。我们观察到,在推理阶段使用简单的副语言指令支架可以缩小这种感知-行为差距,表明相关线索已潜在于模型中。然而,这种支架在多轮上下文和竞争指令下仍然脆弱。因此,我们提出\textbf{ParaBridge},一种在线自我蒸馏方法,将脆弱的推理时支架转化为稳定的模型行为。在训练过程中,支架仅作为临时的特权视图;无支架模型自行生成回复,而支架视图沿其轨迹提供密集的全词汇下一词目标。这种监督教会了模型在非词汇线索应影响回复时的时机,无需策划的对话、人工标签或外部奖励模型。在Qwen3-Omni-thinking上,ParaBridge将无支架的VoxSafeBench SAR从14.6\%提升至40.3\%,并将EchoMind平均评分从3.27提升至3.92。它还保留了通用能力,MMAU-Pro、VoiceBench和GPQA均与原始模型相差在0.4分以内。在训练分布之外,ParaBridge泛化到未见过的副语言线索,从面向安全的训练迁移到共情导向的对话,并在不同的SLM骨干上有效。

英文摘要

Speech carries more information than just words: a child's voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs) can recognize such paralinguistic cues but often ignore them in open-ended dialogue. We observe that a simple paralinguistic instruction scaffold at the inference stage narrows this perception-behavior gap, suggesting that the relevant cues are already latent in the model. Such scaffolds, however, remain brittle under multi-turn context and competing instructions. Therefore, we propose \textbf{ParaBridge}, an on-policy self-distillation method that turns a brittle inference-time scaffold into stable model behavior. During training, the scaffold serves only as a temporary privileged view; the scaffold-free model rolls out its own response, while the scaffolded view supplies dense, full-vocabulary next-token targets along its trajectory. This supervision teaches when non-lexical cues should affect the reply without the need for curated dialogues, human labels, or external reward models. On Qwen3-Omni-thinking, ParaBridge raises scaffold-free VoxSafeBench SAR from $14.6\%$ to $40.3\%$ and improves EchoMind average rating from $3.27$ to $3.92$. It also preserves general ability, with MMAU-Pro, VoiceBench, and GPQA all within $0.4$ points of the original model. Beyond the training distribution, ParaBridge generalizes to unseen paralinguistic cues, transfers from safety-oriented training to empathy-oriented dialogue, and works on a different SLM backbone.

2606.10654 2026-06-10 cs.CL 新提交

Speaker Group Encoding in Self-supervised Speech Recognition Models

自监督语音识别模型中的说话人群体编码

Felix Herron, Solange Rossato Alexandre Allauzen, Benoit Favre, François Portet

发表机构 * MILES Team, LAMSADE, Université Paris Dauphine-PSL, France(法国巴黎多芬纳-PSL大学LAMSADE实验室MILES团队) GETALP Team, LIG, Université Grenoble Alpes, France(法国格勒诺布尔阿尔卑斯大学LIG实验室GETALP团队) NLP team, LIS, Aix-Marseille University, France(法国艾克斯-马赛大学LIS实验室NLP团队)

AI总结 研究自监督语音识别模型如何编码说话人群体信息,发现微调任务和公平性算法对不同类型群体信息的影响不同。

详情
Journal ref
Text, Speech, and Dialogue. TSD 2025. Lecture Notes in Computer Science(), vol 16029
AI中文摘要

我们研究了自监督语音识别模型(S3Ms)学习了关于说话人群体(SGs)的哪些信息。我们检查了S3Ms的几种状态:预训练、在说话人识别(SID)上微调、在自动语音识别(ASR)上微调,以及使用公平性增强算法进行ASR微调。我们发现S3Ms编码了关于几个说话人群体类别(SGCs)的信息,包括他们的性别、年龄、方言、种族以及是否为母语者。我们发现,针对SID的微调放大了某些SGCs,即那些方差更偏向语音性质的SGCs,尽管它没有放大其他SGCs,即那些方差更偏向语义性质的SGCs。另一方面,针对ASR的微调丢弃了语音变异的说话人群体信息(SGI),但保留了语义变异的SGI。我们发现,为改善公平性而设计的ASR算法改变了S3Ms中编码SGI的程度;然而,这主要适用于语音变异的SGCs,而对于语义变异的SGCs则不太适用。我们讨论了SGI如何被每一层编码,并识别了负责编码不同SGCs的嵌入子维度。最后,我们讨论了我们的发现如何有助于设计更公平的ASR算法。

英文摘要

We investigate what self-supervised speech recognition models (S3Ms) learn about speaker groups (SGs). We examine several states of S3Ms: pretrained, finetuned on speaker identification (SID), finetuned on automatic speech recognition (ASR), and ASR-finetuned using a fairness enhancing algorithm. We find that S3Ms encode information about several speaker group categories (SGCs), including their gender, age, dialect, ethnicity, and whether they are a native speaker. We find that finetuning for SID amplifies certain SGCs, namely those whose variance is more phonetic in nature, though it does not amplify other SGCs, namely those whose variance is more semantic in nature. On the other hand, finetuning for ASR discards phonetically variant speaker group information (SGI) but retains semantically variant SGI. We find that ASR algorithms designed for fairness improvement change to what extent SGI is encoded in S3Ms; however, this is primarily true for for phonetically variant SGCs, and less true for semantically variant SGCs. We discuss how SGI is encoded by each layer, and identify subdimensions of embeddings responsible for encoding different SGCs. Finally, we discuss how our findings could be beneficial in designing fairer ASR algorithms.

2606.10675 2026-06-10 cs.CL eess.AS 新提交

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

基于自监督表示和学习动态规划的多语言词级强制对齐

Roy Weber, Meidan Zehavi, Rotem Rousso, Joseph Keshet

发表机构 * Faculty of Electrical and Computer Engineering, Technion – Israel Institute of Technology(以色列理工学院电气与计算机工程学院)

AI总结 提出一种结合自监督表示和学习动态规划的多语言词级强制对齐方法,通过融合MMS和UnSupSeg特征并学习词边界概率,在多个语言上超越现有方法。

Comments Interspeech 2026

详情
AI中文摘要

我们提出了一种准确的多语言词级强制对齐方法,包括一个对齐编码器和一个学习对齐解码器。编码器整合两种表示:一种来自大规模多语言语音(MMS)模型,另一种来自自监督音素边界检测器(UnSupSeg)。它学习融合这些表示,并在长时间上下文中估计词边界概率。对齐解码器是一种学习动态规划,它将编码器输出与基于MMS和UnSupSeg表示的段特征相结合,以推断最终词边界。在TIMIT和Buckeye上迭代训练后,所提方法在两个数据集上均优于Montreal Forced Aligner(MFA)和基于MMS的对齐方法。在未见语言(荷兰语、德语和希伯来语)上,所提模型的性能始终优于或与现有对齐方法相当,表明其有潜力在不进行进一步训练的情况下扩展到MMS支持的1100多种语言。

英文摘要

We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech (MMS) model and another from a self-supervised phoneme boundary detector (UnSupSeg). It learns to fuse them and to estimate word-boundary probabilities over long temporal contexts. The alignment decoder is a learned dynamic programming that combines encoder outputs with segmental features over the MMS and UnSupSeg representations to infer final word boundaries. Trained iteratively on TIMIT and Buckeye, the proposed approach outperforms Montreal Forced Aligner (MFA) and MMS-based alignment on both datasets. On unseen languages (Dutch, German, and Hebrew), the proposed model achieves performance consistently better than or on par with existing alignment approaches, indicating its potential to scale to 1100+ languages supported by MMS without further training.

2606.11167 2026-06-10 cs.CL eess.AS 新提交

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

全双工语音模型中的多面交互对齐

Atsumoto Ohashi, Neil Zeghidour, Alexandre Défossez, Eugene Kharitonov

发表机构 * Kyutai Gradium

AI总结 针对全双工对话模型交互性问题,提出基于强化学习的后训练对齐方法,从暂停处理、话轮转换、回馈和用户打断四个维度优化,并加入LLM奖励防止语义退化,在Moshi和PersonaPlex上取得一致改进。

详情
AI中文摘要

全双工口语对话模型可以同时听和说,使其成为自然对话的有前途的架构。然而,当前模型仅通过令牌级似然最大化的监督学习进行训练,这并未直接优化交互级行为,导致交互性问题,如过度沉默和不合时宜的话轮转换。最近的工作应用强化学习(RL)来改善交互性,但现有方法仅在其奖励中处理有限的一组交互行为。在这项工作中,我们提出了一种后训练对齐方法,通过RL全面改善全双工口语对话模型的交互性。我们解决了交互性的四个典型轴:暂停处理、话轮转换、回馈和用户打断。对于每个轴,我们从人类对话语料库中提取短音频片段,并使用特定于轴的奖励函数优化模型。一个额外的基于LLM的响应质量奖励防止语义退化。我们将我们的方法应用于两个开源模型Moshi和PersonaPlex,在预录音频的离线评估和实时多轮对话评估中均显示出交互性的一致改进。

英文摘要

Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking. Recent work has applied reinforcement learning (RL) to improve interactivity, but existing methods address only a limited set of interactive behaviors in their rewards. In this work, we propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation.

2606.09553 2026-06-10 cs.CL cs.SD 新提交

OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

OpenBibleTTS:面向低资源语言的大规模语音资源与TTS模型

David Guzmán, Luel Hagos Beyene, Jesujoba Oluwadara Alabi, Yejin Jeon, Dietrich Klakow, David Ifeoluwa Adelani

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(米拉-魁北克人工智能研究所) AIMS Research and Innovation Centre(AIMS研究与创新中心) NM-AIST Saarland University(萨尔大学) Canada CIFAR AI Chair(加拿大CIFAR人工智能教席)

AI总结 针对低资源语言TTS研究不足的问题,提出包含37种语言的OpenBibleTTS基准,系统比较多种TTS架构,发现无单一系统通用,并开源数据集与模型。

详情
AI中文摘要

神经文本转语音(TTS)和多语言语音生成的最新进展显著提升了合成语音质量,但这些进步在全球语言中分布不均。现有模型仍由少数高资源语言主导,而许多低资源TTS研究是在人工降采样的高资源语料库上模拟的,未能反映真正低资源环境中的正字法变化和有限的音系覆盖。为此,我们引入OpenBibleTTS,这是一个涵盖37种低资源语言的大规模低资源语音合成基准。此外,我们对各种TTS架构和大规模语音生成模型在领域内圣经文本和领域外材料上进行了系统比较。结果表明,没有单一系统在所有语言和指标上占优:Gemini-TTS在大多数评估语言上获得最高听众评分,但在OpenBibleTTS上训练的单一语言EveryVoice模型在可懂度上仍然最强,并在几种非洲语言中更受青睐,而从头训练的开放系统在领域外文本上性能急剧下降,揭示了广泛多语言覆盖与可靠合成质量之间在服务不足的语言社区中持续存在的差距。我们用主观人类判断补充自动评估,并开源所有处理后的数据集、对齐和训练模型,以支持未来的低资源TTS研究。

英文摘要

Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and limited phonetic coverage encountered in genuinely underrepresented settings. As such, we introduce OpenBibleTTS, which is a large-scale benchmark for low-resource speech synthesis spanning 37 underrepresented languages. Moreover, a systematic comparison of various TTS architectures and large-scale speech generation models is conducted across in-domain Biblical text and out-of-domain material. Results show that no single system dominates across languages and metrics: Gemini-TTS achieves the highest listener ratings on most evaluated languages, but monolingual EveryVoice models trained on OpenBibleTTS remain strongest for intelligibility and are preferred in several African languages, while open from-scratch systems degrade sharply on out-of-domain text, revealing a persistent gap between broad multilingual coverage and reliable synthesis quality in underserved linguistic communities. We complement automatic evaluation with subjective human judgments, and open-source all processed datasets, alignments, and trained models to support future low-resource TTS research.

2606.06037 2026-06-10 cs.SD cs.CL eess.AS 交叉投稿

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

SpeechJBB:探究大型音频语言模型在代码切换语音下的安全对齐与理解

Virginia Ceccatelli, Yejin Jeon, David Ifeoluwa Adelani

发表机构 * Mila - Quebec AI Institute(魁北克AI研究所) McGill University(麦吉尔大学) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 提出SpeechJBB数据集,通过代码切换有害音频和伪词插入方法,揭示大型音频语言模型在多语言和口语设置下的安全漏洞。

详情
AI中文摘要

大型音频语言模型(LALMs)越来越多地部署在现实应用中,但其安全对齐仍主要在单语、基于文本的有害提示上进行评估。这导致其在多语言和口语设置,特别是代码切换语音下的泛化能力很大程度上未被探索。为填补这一空白,我们引入了SpeechJBB,一个用于对多种最先进LALMs进行基准测试的音频越狱数据集。通过引入一种增强设置,即在安全关键术语周围插入音位学上合理的伪词以模拟局部混淆,进一步探测了安全弱点的程度。跨模型而言,代码切换的有害音频产生了显著高的越狱成功率(JSR),其中非英语单语和非英语代码切换对表现出最高的攻击成功率。伪词插入进一步降低了拒绝率,表明听起来自然的混淆可以有效绕过安全策略。

英文摘要

Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability under multilingual and spoken settings, particularly code-switched speech, largely underexplored. To address this gap, we introduce SpeechJBB, an audio jailbreak dataset for benchmarking across multiple state-of-the-art LALMs. The extent of safety weaknesses is further probed by introducing an augmented setting where phonologically plausible pseudo-words are inserted around safety-critical terms to simulate localized obfuscation. Across models, code-switched harmful audio yields substantially high jailbreak success rates (JSR), with non-English monolingual and non-English code-switched pairs exhibiting the highest attack success. Pseudo-word insertion further reduces refusal rates, which demonstrates that natural-sounding obfuscation can effectively bypass safety policies.

2606.10029 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

用稀疏自编码器解释和引导文本转语音语言模型

Nikita Koriagin, Georgii Aparin, Nikita Balagansky, Daniil Gavrilov

AI总结 本文在CosyVoice3语言模型骨干上训练BatchTopK稀疏自编码器,发现特征可解释且因果可控,能操纵笑声、性别和语速。

详情
AI中文摘要

语言模型日益成为文本转语音(TTS)系统的骨干,但我们对其在文本和生成语音令牌共享单一残差流时构建的表示知之甚少。我们在CosyVoice3的语言模型骨干上训练BatchTopK稀疏自编码器,并引入一种模态感知的自动解释流水线,根据特征触发的位置——文本前缀上下文、1秒语音片段或两者——为每个特征打标签。恢复的特征是可解释的,涵盖音素、笑声、口音提示和说话者性别。通过SAE潜在空间进行引导表明,这些特征是因果性的而非仅仅是描述性的:有针对性的干预将笑声概率从0.02提高到0.79,翻转感知到的说话者性别,并在保持口语内容的同时控制语速。因此,SAE特征既可作为解释性对象,也可作为TTS合成的控制方向。

英文摘要

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.

2606.10439 2026-06-10 cs.SD cs.CL eess.AS 交叉投稿

Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

利用混合专家和动态下采样增强基于多语言大模型的语音识别

Guodong Lin, Ziqi Chen, Yuxiang Fu, Ke Li, Wei-Qiang Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于投影器的LLM-ASR框架,通过混合专家架构提升跨语言适应性,并利用连续整合-触发机制实现动态下采样和模态对齐,实验表明该方法显著超越强基线模型。

Comments Accepted by ICASSP 2026

详情
Journal ref
ICASSP (2026),18807-18811
AI中文摘要

大语言模型的快速发展为自动语音识别开辟了新前沿,使其有效集成成为一个关键且具有挑战性的研究方向。为此,本文提出了一种基于投影器的LLM-ASR框架,针对多语言泛化和模态对齐的关键挑战。我们的方法结合了混合专家架构以改善跨语言适应性,以及连续整合-触发机制用于动态下采样和模态对齐。实验结果表明,这些组件的组合带来了显著的性能提升,超越了强基线模型。所提出的方法朝着构建更准确、更鲁棒、更泛化的基于LLM的ASR系统迈出了一步。

英文摘要

The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.

2606.10781 2026-06-10 eess.AS cs.CL 交叉投稿

Recovering the Zipfian Distribution in Unsupervised Term Discovery

在无监督术语发现中恢复齐夫分布

Danel Slabbert, Simon Malan, Herman Kamper

发表机构 * Het Jan Marais Fonds(赫特·詹·马里茨基金会)

AI总结 针对无监督术语发现中中心聚类导致分布不均匀的问题,提出图聚类方法,在三种语言上显著优于K-means等,恢复更接近齐夫分布的词汇分布。

详情
AI中文摘要

无监督术语发现涉及将未标记语音分割成词或音节单元,并将这些单元聚类成候选类型的词典。真实词典遵循齐夫分布,然而主流的基于中心的聚类方法——K-means——由于对球形聚类的归纳偏差,产生更均匀的分布。在本文中,我们重新审视基于图的聚类作为一种自下而上的替代方案,其中片段嵌入通过成对相似性连接,并使用Leiden算法进行划分。我们表明,在三种语言的词级和音节级词典发现中,图聚类在性能上显著优于基于中心的方法(K-means、GMM、BIRCH),产生更接近齐夫分布的分布。另一种自下而上的方法,即使用平均链接的凝聚聚类,也表现良好,尽管其计算效率较低,且对结果分布的控制能力较弱。我们的工作质疑了基于中心的聚类在术语发现中的主导地位,并推广图聚类作为一种有吸引力的替代方案。

英文摘要

Unsupervised term discovery involves segmenting unlabelled speech into word- or syllable-like units and clustering these into a lexicon of candidate types. True lexicons follow a Zipfian distribution, yet the dominant centre-based clustering approach -- K-means -- produces a more uniform distribution due to an inductive bias toward spherical clusters. In this paper we revisit graph-based clustering as a bottom-up alternative, where segment embeddings are connected by pairwise similarity and partitioned using the Leiden algorithm. We show that graph clustering substantially outperforms centre-based approaches (K-means, GMM, BIRCH) in both word- and syllable-level lexicon discovery across three languages, producing more Zipf-like distributions. Another bottom-up approach, agglomerative clustering with average linkage, also performs well, although it is computationally less efficient and allows for less control over the resulting distribution. Our work calls into question the dominance of centre-based clustering for term discovery, and promotes graph clustering as an attractive alternative.

2606.11033 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

AuRA: Internalizing Audio Understanding into LLMs as LoRA

AuRA: 将音频理解内化到LLM中作为LoRA

Bo Cheng, Lei Shi, Zhanyu Ma, Yuan Wu, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He

发表机构 * Meituan(美团) Jilin University(吉林大学)

AI总结 提出AuRA方法,通过层间蒸馏将ASR编码器的语音表示内化到LoRA适配的LLM中,实现紧耦合的语音-语言联合建模和高效并行端到端推理,在多个基准上优于级联系统和现有适应方法。

详情
AI中文摘要

最近将大语言模型(LLM)扩展到语音输入的努力通常依赖于级联的ASR-LLM流水线、端到端语音-语言模型或基于桥接/蒸馏的适应方法。虽然这些路线分别重用了强大的预训练组件、实现了原生语音-语言交互或提供了轻量级适应,但它们常常遭受转录-接口延迟、昂贵的多模态训练或顺序语音-语言耦合的问题。为了解决这些限制,我们提出了AuRA,一种将音频编码能力蒸馏到LLM中的方法。具体来说,AuRA通过一个轻量级音频嵌入层将相同的语音输入馈送到ASR编码器(作为教师)和LoRA适配的LLM(作为学生),并使用逐层蒸馏将学生的隐藏状态与相应的教师表示对齐,从而将语音表示内化到轻量级的LLM侧适应中。与级联和串行桥接方法相比,AuRA实现了更紧密的语音-语言联合建模和高效的并行端到端推理,同时重用了预训练的语音和语言模型,而不需要大规模的多模态训练。在多个语音-语言基准上,AuRA在有效性和效率方面始终优于级联系统、语音到LLM适应基线以及大规模语音-语言和多模态模型。

英文摘要

Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present AuRA, a method that distills audio encoding capability into the LLM. Specifically, AuRA feeds the same speech input to an ASR encoder (as a teacher) and a LoRA-adapted LLM (as a student) through a lightweight audio embedding layer, and uses layer-wise distillation to align the student's hidden states with corresponding teacher representations, thereby internalizing speech representations into lightweight LLM-side adaptations. Compared with cascaded and serial bridge methods, AuRA enables tighter speech-language joint modeling and efficient parallel end-to-end inference, while also reusing pretrained speech and language models rather than requiring large-scale multimodal training. On multiple speech-language benchmarks, AuRA consistently outperforms cascaded systems, speech-to-LLM adaptation baselines, and large-scale speech-language and multimodal models in both effectiveness and efficiency.

2512.02201 2026-06-10 cs.CL 版本更新

Swivuriso: The South African Next Voices Multilingual Speech Dataset

Swivuriso:南非下一代语音多语言语音数据集

Vukosi Marivate, Kayode Olaleye, Sitwala Mundia, Andinda Bakainga, Unarine Netshifhefhe, Mahmooda Milanzie, Tsholofelo Hope Mogale, Thapelo Sindane, Zainab Abdulrasaq, Kesego Mokgosi, Chijioke Okorie, Nia Zion Van Wyk, Graham Morrissey, Dale Dunbar, Francois Smit, Tsosheletso Chidi, Rooweither Mabuya, Andiswa Bukula, Respect Mlambo, Tebogo Macucwa, Idris Abdulmumin, and Seani Rananga

发表机构 * University of Cape Town(开普敦大学) University of KwaZulu-Natal(夸祖鲁-纳塔尔大学)

AI总结 介绍Swivuriso,一个3000小时的多语言语音数据集,覆盖南非七种语言,用于自动语音识别技术的开发与基准测试,填补现有数据集空白。

Comments Work in Progress. Updated in June 2026

详情
AI中文摘要

本文介绍了Swivuriso,一个3000小时的多语言语音数据集,作为非洲下一代语音项目的一部分开发,旨在支持七种南非语言的自动语音识别(ASR)技术的开发和基准测试。涵盖农业、医疗保健和通用领域主题,Swivuriso填补了现有ASR数据集的重大空白。我们描述了指导数据集创建的设计原则、伦理考虑和数据收集程序。我们展示了使用这些数据训练/微调ASR模型的基线结果,并与相关语言的其他ASR数据集进行了比较。

英文摘要

This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.

2603.07238 2026-06-10 cs.CL eess.AS 版本更新

Scaling Self-Supervised Speech Models Uncovers Deep Linguistic Relationships: Evidence from the Pacific Cluster

扩展自监督语音模型揭示深层语言关系:来自太平洋集群的证据

Minu Kim, Hoirin Kim, David R. Mortensen

发表机构 * School of Electrical Engineering, KAIST, Republic of Korea(韩国成均馆大学电气工程学院) Thomas Lord Department of Computer Science, University of Southern California, USA(美国南加州大学计算机科学系) Language Technologies Institute, Carnegie Mellon University, USA(美国卡内基梅隆大学语言技术研究所)

AI总结 通过将自监督语音模型的语言识别系统从126种扩展到4017种语言,发现系统在4K规模下发生质变,揭示出太平洋地区基因无关语言的宏观集群,表明大规模模型能内化多层语言历史。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

从自监督语音模型(S3Ms)中提取的语言表征之间的相似性已被观察到主要反映地理邻近性或由近期扩张或接触驱动的表面类型学相似性,可能遗漏更深层的谱系信号。我们研究了将基于S3M的语言识别系统从126种扩展到4017种语言如何重塑这种拓扑结构,并发现一个非线性效应:系统发育恢复在1K规模以下保持平稳,但4K模型经历质变,既解析了清晰的谱系也解析了长期的语言接触。最引人注目的是,一个稳健的太平洋宏观集群出现,将基因上无关的巴布亚语、大洋洲语和澳大利亚语分组在一起,我们将其驱动因素追溯到一种集中编码,该编码捕获了共享的声学特征,如全局能量动态。这些结果表明,大规模S3Ms内化了多层语言历史,为计算系统发育学和语言接触研究提供了有前景的视角。

英文摘要

Similarities between language representations derived from Self-Supervised Speech Models (S3Ms) have been observed to primarily reflect geographic proximity or surface typological similarities driven by recent expansion or contact, potentially missing deeper genealogical signals. We investigate how scaling an S3M-based language identification system from 126 to 4,017 languages reshapes this topology, and find a non-linear effect: phylogenetic recovery stays flat up to the 1K scale, but the 4K model undergoes a qualitative shift, resolving both clear lineages and long-term linguistic contact. Most strikingly, a robust Pacific macro-cluster emerges, grouping genealogically unrelated Papuan, Oceanic, and Australian languages, and we trace its driver to a concentrated encoding that captures shared acoustic signatures such as global energy dynamics. These results suggest that massive S3Ms internalize multiple layers of language history, offering a promising perspective for computational phylogenetics and the study of language contact.

2412.11449 2026-06-10 cs.SD cs.AI cs.CL cs.LG eess.AS 版本更新

Whisper-GPT -- Continuous Discrete Hybrid Representation Language Models For Speech And Music

Whisper-GPT -- 语音和音乐的连续离散混合表示语言模型

Prateek Verma

发表机构 * Stanford University(斯坦福大学)

AI总结 提出Whisper-GPT,一种结合连续音频表示(如频谱图)和离散音频令牌的生成式大语言模型,解决了离散令牌方法上下文长度过长的问题,在语音和音乐的下一个令牌预测中降低了困惑度和负对数似然。

Comments 6 pages, 3 figures. 50th International Conference on Acoustics, Speech and Signal Processing, Hyderabad, India

详情
AI中文摘要

我们提出了WHISPER-GPT:一种用于语音和音乐的生成式大语言模型(LLM),它允许我们在单个架构中同时处理连续音频表示和离散令牌。近年来,利用神经压缩算法(例如ENCODEC)导出的离散音频令牌的生成式音频、语音和音乐模型激增。然而,这种方法的主要缺点之一是处理上下文长度。如果必须考虑不同频率下的所有音频内容来进行下一个令牌预测,那么对于高保真生成架构来说,上下文长度会急剧增长。通过结合连续音频表示(如频谱图)和离散声学令牌,我们保留了两者的优点:在单个令牌中拥有来自音频特定时间实例的所有必要信息,同时允许LLM预测未来令牌,从而获得采样和离散空间提供的其他好处。我们展示了与基于令牌的语音和音乐LLM相比,我们的架构如何提高下一个令牌预测的困惑度和负对数似然分数。

英文摘要

We propose WHISPER-GPT: A generative large language model (LLM) for speech and music that allows us to work with continuous audio representations and discrete tokens simultaneously as part of a single architecture. There has been a huge surge in generative audio, speech, and music models that utilize discrete audio tokens derived from neural compression algorithms, e.g. ENCODEC. However, one of the major drawbacks of this approach is handling the context length. It blows up for high-fidelity generative architecture if one has to account for all the audio contents at various frequencies for the next token prediction. By combining continuous audio representation like the spectrogram and discrete acoustic tokens, we retain the best of both worlds: Have all the information needed from the audio at a specific time instance in a single token, yet allow LLM to predict the future token to allow for sampling and other benefits discrete space provides. We show how our architecture improves the perplexity and negative log-likelihood scores for the next token prediction compared to a token-based LLM for speech and music.

2603.11482 2026-06-10 cs.SD cs.CL eess.AS 版本更新

AnimeScore: A Preference-Based Dataset and Framework for Evaluating Anime-Like Speech Style

AnimeScore: 基于偏好的数据集与框架用于评估动漫风格语音

Joonyong Park, Jerry Li

发表机构 * Spellbrush, USA(美国Spellbrush)

AI总结 针对动漫风格语音缺乏客观评估指标的问题,提出基于偏好排序的框架AnimeScore,通过187名评估者的15000对判断数据,利用声学分析和SSL排序模型实现高达90.8% AUC的自动评估。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

目前评估“动漫风格”语音依赖于昂贵的主观判断,尚无标准化的客观指标。一个关键挑战在于,与自然度不同,动漫相似度缺乏共享的绝对尺度,使得传统的平均意见得分(MOS)协议不可靠。为填补这一空白,我们提出AnimeScore,一个基于偏好的框架,通过成对排序自动评估动漫相似度。我们收集了来自187名评估者的15000对成对判断,并附有自由形式的描述;声学分析表明,感知的动漫相似度由受控的共振峰塑造、韵律连续性和刻意发音驱动,而非简单的启发式规则如高音调。我们证明,手工设计的声学特征达到69.3%的AUC上限,而基于SSL的排序模型达到90.8%的AUC,提供了一个实用的度量标准,也可作为生成式语音模型基于偏好优化的奖励信号。

英文摘要

Evaluating 'anime-like' voices currently relies on costly subjective judgments, yet no standardized objective metric exists. A key challenge is that anime-likeness, unlike naturalness, lacks a shared absolute scale, making conventional Mean Opinion Score (MOS) protocols unreliable. To address this gap, we propose AnimeScore, a preference-based framework for automatic anime-likeness evaluation via pairwise ranking. We collect 15,000 pairwise judgments from 187 evaluators with free-form descriptions, and acoustic analysis reveals that perceived anime-likeness is driven by controlled resonance shaping, prosodic continuity, and deliberate articulation rather than simple heuristics such as high pitch. We show that handcrafted acoustic features reach a 69.3% AUC ceiling, while SSL-based ranking models achieve up to 90.8% AUC, providing a practical metric that can also serve as a reward signal for preference-based optimization of generative speech models.

9. 评测、数据集与基准 37 篇

2606.10061 2026-06-10 cs.CL 新提交

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

BenSyc: 孟加拉语上下文中大语言模型对话谄媚与人类对齐的基准测试

Kazi Noshin, Sajib Acharjee Dip, Ranat Das Prangon, Fardin Hassan Tamim, Syed Ishtiaque Ahmed, Liqing Zhang, Sharifa Sultana

AI总结 提出BenSyc基准,基于孟加拉语社交数据构建五级标注集,评估15+模型在对话对齐分类与生成任务上的表现,发现前沿模型在区分共情与强化性认可上仍存在困难。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地参与情感敏感的社交对话,其回应可能从平衡支持转向过度认可或升级性对齐。现有的谄媚研究主要关注事实一致性和指令遵循设置,而文化背景下的对话谄媚尚未得到充分探索。我们引入了BenSyc,这是首个用于研究孟加拉语社交语境中对话谄媚的基准。从孟加拉国和西孟加拉邦社区收集的11,840条Reddit帖子和170k条评论出发,我们构建了一个人工验证的基准,包含二元标签和一个细粒度的五级分类体系,涵盖无效化、中立、支持、认可和升级。我们在对话对齐分类和响应生成任务上评估了超过15个开源和专有LLM。结果表明,即使对于前沿的指令调优模型,区分共情性支持与强化导向的认可仍然具有挑战性:最佳系统在二元检测上仅达到61.8 Macro-F1,在五类分类上达到61.7 Macro-F1。在生成设置中,多个模型在情感激烈的情境下频繁产生强烈认可或升级性回应。我们的发现凸显了不同模型家族和对话行为之间的显著差异,强调了文化背景下的多语言基准对于评估社交对齐的对话AI系统的重要性。

英文摘要

Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research primarily focuses on factual agreement and instruction-following settings, leaving culturally grounded conversational sycophancy underexplored. We introduce BenSyc, the first benchmark for studying conversational sycophancy in Bengali social contexts. Starting from 11,840 Reddit posts and 170k comments collected from communities across Bangladesh and West Bengal, we construct a human-validated benchmark with binary labels and a fine-grained five-level taxonomy spanning Invalidation, Neutral, Support, Validation, and Escalation. We evaluate more than 15 open and proprietary LLMs on conversational alignment classification and response generation tasks. Results show that distinguishing empathetic support from reinforcement-oriented validation remains challenging even for frontier instruction-tuned models: the best system achieves only 61.8 Macro-F1 on binary detection and 61.7 Macro-F1 on five-class classification. In generation settings, several models frequently produce strongly validating or escalatory responses in emotionally charged situations. Our findings highlight substantial variation across model families and conversational behaviors, underscoring the importance of culturally grounded multilingual benchmarks for evaluating socially aligned conversational AI systems.

2606.10285 2026-06-10 cs.CL 新提交

OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

OpenRTLSet: 基于大语言模型的Verilog模块设计的完全开源数据集

Jinghua Wang, Lily Jiaxin Wan, Sanjana Pingali, Scott Smith, Manvi Jha, Shalini Sivakumar, Xing Zhao, Kaiwen Cao, Deming Chen

发表机构 * UIUC-ChenLab(UIUC-陈实验室)

AI总结 提出最大完全开源硬件设计数据集OpenRTLSet,包含13万+多样Verilog代码样本,结合GitHub代码、VHDL和C/C++翻译,利用DeepSeek-R1生成自然语言描述,支持多种语言模型微调,证明开源方法在硬件设计中的优越性。

Comments Accepted by ICLAD'25

详情
Journal ref
2025 IEEE International Conference on LLM-Aided Design (ICLAD), Stanford, CA, USA, 2025, pp. 212-218
AI中文摘要

OpenRTLSet引入了硬件设计中最大的完全开源数据集,为研究界和工业界提供了超过131,000个多样化的Verilog代码样本。我们的数据集独特地结合了来自GitHub仓库的Verilog代码(102k模块)、VHDL翻译(5k模块)和可综合的C/C++翻译(24k模块),所有内容均可自由访问,无专有限制。使用推理模型DeepSeek-R1,我们为每个代码样本生成了配对的自然语言描述,从而能够微调各种语言模型家族(例如Qwen和Granite)以进行Verilog代码生成。我们的数据集探索了多种选项,包括在标注过程中将Verilator生成的C++文件作为额外上下文、量化技术(INT4 vs. BF16)以及不同模型规模(7B-32B参数)之间的性能差异。OpenRTLSet证明了开源方法在硬件设计任务中可以实现优越的性能,为该领域的可访问研究和商业用途建立了新的基础。

英文摘要

OpenRTLSet introduces the largest fully open-source dataset for hardware design, offering over 131,000 diverse Verilog code samples to the research community and industry. Our dataset uniquely combines Verilog code from GitHub repositories (102k modules), VHDL translations (5k modules), and synthesizable C/C++ translations (24k modules), all freely accessible without proprietary restrictions. Using the reasoning model DeepSeek-R1, we generated paired natural language descriptions for each code sample, enabling fine-tuning of various language model families (e.g., Qwen and Granite) for Verilog code generation. Our dataset explores multiple options, including Verilator-generated C++ files as additional context during labeling, quantization techniques (INT4 vs. BF16), and performance differences across model sizes (7B-32B parameters). OpenRTLSet demonstrates that open-source approaches can achieve superior performance in hardware design tasks, establishing a new foundation for accessible research and commercial use in this domain.

2606.10315 2026-06-10 cs.CL cs.AI 新提交

Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents

捕捉五分之一:LLM作为评判员在生产环境多轮交易代理中的盲点

Sawyer Zhang, Alexander Wang, Sophie Lei

发表机构 * Lumivate (Lumi)(Lumivate(Lumi))

AI总结 研究部署的餐饮订购代理中LLM评判员对真实缺陷的召回率,发现其仅捕获22%的系统性问题,主要因评分标准缺乏状态跟踪等行为维度,且路由机制导致缺陷被错误分类。

Comments 13 pages, 1 figure, 5 tables

详情
AI中文摘要

LLM作为评判员是评估对话代理的默认工具,但其可靠性几乎总是报告为与人类评分的一致性,而非真实缺陷的召回率。我们研究了一个已部署的多轮餐饮订购代理,并通过详尽的人工转录审查作为基准,衡量其内置LLM评判员捕获了多少真实质量问题。在三个批次中,评判员发现的系统性问题远低于人类确认的四分之一——在一个批次中,9种模式中只有2种(22%),而在另一个批次中,其操作门控标记了100轮中的0轮,而人类确认了23个不同缺陷和7个新的跨轮模式。我们的盲点分类表明,失败是有结构的,而非随机的:评判员能捕获轮次局部问题(虚构统计数据、错误语言),但遗漏了跨轮状态问题(确认门锁死、购物车幻觉、升级锁死、过时引用)。机制在于:评分标准仅暴露三个粗略轴(意图、品牌声音、个性化),且没有针对行为维度(状态跟踪、护栏、恢复)的类别,而大多数缺陷集中于此。失败在于路由而非感知:114轮中,113轮原始评判员注释描述了确认门或购物车状态缺陷,但被评分为“品牌声音”,且无一到达操作失败——门控连接到挂起和硬断言,而非评分标准——因此0%是路由和接线失败,而非失明。对流行率估计的影响是显著的:当表观缺陷率为零时,Rogan-Gladen校正退化——无信号可恢复真实率——而当门控报告非零率时,相同估计器在我们测量的灵敏度下暗示3-6倍的低估。对于生产环境多轮代理,自动评判是回归底线,而非人工审查的替代品。

英文摘要

LLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement with human ratings, not recall of real defects. We study a deployed multi-turn food-and-beverage ordering agent and measure how many genuine quality problems its built-in LLM judge catches, using exhaustive human transcript review as ground truth. Across three batches the judge surfaces well under a quarter of human-confirmed systematic problems -- 2 of 9 patterns (22%) in one batch, and its operational gate flagged zero of 100 rounds in a batch where humans confirmed 23 distinct defects and 7 new cross-cutting patterns. Our blind-spot taxonomy shows the failure is structured, not random: the judge catches turn-local issues (a fabricated statistic, a wrong language) but misses cross-turn state issues (confirm-gate lockout, cart hallucination, escalation lockout, stale referents). The mechanism: the scoring rubric exposes only three coarse axes (intent, brand-voice, personalization) and has no category for the behavioural dimensions -- state-tracking, guardrails, recovery -- where most defects cluster. The failure is routing, not perception: 113 of 114 rounds whose raw judge note describes a confirm-gate or cart-state defect are scored "brand voice", and none reach an operational failure -- the gate is wired to hangs and hard assertions, not the rubric -- so the 0% is a routing-and-wiring failure, not blindness. The consequence for prevalence estimation is sharp: when the apparent defect rate is zero the Rogan-Gladen correction degenerates -- no signal can recover the true rate -- while where the gate reports a nonzero rate the same estimator implies a 3-6x undercount under our measured sensitivity. For production multi-turn agents, automated judging is a regression floor, not a substitute for human review.

2606.10380 2026-06-10 cs.CL cs.AI 新提交

Expert-Level Crisis Detection in Mental Health Conversations

心理健康对话中的专家级危机检测

Grace Byun, Abigail Lott, Rebecca Lipschutz, Sean T. Minton, Elizabeth A. Stinson, Jinho D. Choi

发表机构 * Department of Computer Science, Emory University(埃默里大学计算机科学系) Department of Psychiatry and Behavioral Sciences, Emory University(埃默里大学精神病学与行为科学系)

AI总结 提出CRADLE-Dialogue基准数据集和Alert-Confirm评估协议,用于对话中危机检测,发现模型在识别风险出现时机上表现较差,并发布合成训练语料和32B参数模型。

详情
AI中文摘要

现实世界的危机干预本质上是对话式的,然而现有研究主要关注静态文本。当应用于多轮对话时,当前模型表现出显著的性能下降,难以追踪随着上下文演变而出现的风险信号。为了解决这一差距,我们引入了CRADLE-Dialogue,这是一个由临床医生标注的基准数据集,用于对话环境中的回合级危机检测。该数据集包含600个对话,具有跨临床基础风险的多标签注释,包括自杀意念、自残和儿童虐待,区分过去和当前风险。我们进一步提出了一种Alert-Confirm评估协议,该协议区分早期预警信号(Alert)和特定危机变得明确可识别的回合(Confirm),反映了在风险变得明确之前进行干预的临床需求。实验表明,识别风险何时出现比识别其存在要困难得多:模型的Micro F1仅达到40%中段到60%高段。此外,我们发布了一个合成训练语料库和一个32B参数模型,该模型显著优于现有的开源模型,并在回合级、对话级和仅确认评估设置中与专有模型相比具有竞争力或更优的结果。

英文摘要

Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts.Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts. When applied to multi-turn dialogues, current models exhibit significant performance degradation, struggling to track risk signals that emerge as context evolves. To address this gap, we introduce CRADLE-Dialogue, a clinician-annotated benchmark for turn-level crisis detection in conversational settings. The dataset features 600 dialogues with multi-label annotations across clinically grounded risks, including suicide ideation, self-harm, and child abuse, distinguishing past from ongoing risk. We further propose an Alert-Confirm evaluation protocol that distinguishes early warning signals (Alert) from turns where a specific crisis becomes explicitly identifiable (Confirm), reflecting the clinical need to intervene before risk becomes explicit. Experiments show that identifying when risk emerges is much harder than recognizing that it exists: models achieve only mid-40% to high-60% Micro F1. Additionally, we release a synthetic training corpus and a 32B-parameter model that substantially outperforms existing open-source models and achieves competitive or superior results against proprietary models across turn-level, dialogue-level, and confirm-only evaluation settings.

2606.10400 2026-06-10 cs.CL cs.CV 新提交

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

视觉语言模型是看见还是猜测?通过措辞控制基准衡量和减少文本先验依赖

Pratham Singla, Shivank Garg, Vihan Singh, Paras Chopra

发表机构 * Lossfunk Indian Institute of Technology Roorkee(印度理工学院罗尔基分校) Raeth AI

AI总结 本文构建了540张图像的基准,通过为同一图像生成四种措辞变体,衡量视觉语言模型对文本先验的依赖,发现所有模型在最难变体上性能下降,开放模型下降最严重,并通过无图像消融等分析证实了真正的图像依赖。

Comments 17 pages, 7 figures, Submitted to EMNLP 2026

详情
AI中文摘要

视觉语言模型(VLM)越来越多地被部署在答案必须依据图像内容的场景中,然而它们常常基于文本先验(问题的措辞结合记忆的世界知识)而非图像本身来回答,这夸大了基准分数并产生了自信但无根据的答案。现有基准很少孤立这种行为,因为每张图像通常只与一个固定问题配对。为了衡量这种依赖,我们构建了一个包含540张图像、覆盖六个推理类别的基准,并为相同图像生成四个问题变体,使得措辞而非图像内容成为受控变量。最难的变体直接从图像编写以最小化文本泄漏。我们对十一个VLM进行了基准测试,涵盖从小型开放权重模型到大型闭源系统:每个模型在最难的变体上性能下降,开放模型下降最严重。我们的核心诊断是无图像消融,它将开放权重模型降至其纯文本基线(1%到9%)。进一步的三项分析——LLM评定的难度、低基础到最终文本相似度以及人工重新标注——证实了真正的图像依赖性。与变体构建方式匹配的上下文示例恢复了最高的准确率,而GRPO后训练一个小型VLM在所有四个变体上取得了一致的提升,并泛化到保留的分布外集。文本先验依赖是可测量的,并且部分可通过训练消除。

英文摘要

Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than from the image itself, which inflates benchmark scores and yields confident but ungrounded answers. Existing benchmarks rarely isolate this behavior, since each image is usually paired with a single fixed question. To measure the reliance, we build a 540-image benchmark across six reasoning categories and generate four question variants over the same images, so that phrasing rather than image content is the controlled variable. The hardest variant is written directly from the image to minimize text leakage. We benchmark eleven VLMs spanning small open-weight models to large closed-source systems: every model degrades on the hardest variant, and open models fall furthest. Our central diagnostic is a no-image ablation, which collapses the open-weight models to their text-only floor (1 to 9 percent). Three further analyses, LLM-rated difficulty, low base-to-final textual similarity, and human re-annotation, corroborate genuine image-dependence. In-context exemplars that match how a variant was built recover the most accuracy, and GRPO post-training of a small VLM yields consistent gains across all four variants that transfer to a held-out out-of-distribution set. Textual-prior reliance is measurable and partly trainable away.

2606.10460 2026-06-10 cs.CL cs.AI 新提交

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

LakeQA:百万级数据湖上的探索性问答基准

Haonan Wang, Jiaxiang Liu, Yurong Liu, Austin Senna Wijaya, Tianle Zhou, Eden Wu, Yijia Chen, Wanting You, Reya Vir, Daniela Pinto, Grace Fan, Yusen Zhang, Juliana Freire, Eugene Wu

发表机构 * Columbia University(哥伦比亚大学) New York University(纽约大学) Barnard College(巴纳德学院)

AI总结 提出LakeQA基准,要求LLM在9.5TB异构数据湖中搜索并多跳推理,GPT-5.2仅达18.37%精确匹配,挑战性强。

详情
AI中文摘要

近期的大语言模型(LLM)在基于阅读的问答(QA)方面取得了快速进展,其中证据被明确提供或可以轻松检索。相比之下,现实世界的问题通常不与准确的证据文档配对。有用的证据存在于海量数据湖中,使得搜索成为回答的前提。然而,目前缺乏要求在大型数据湖上进行搜索和推理的综合基准。为此,我们引入了LakeQA,一个针对数据湖上以搜索为中心的问答的综合基准,同时强调搜索和推理能力。LakeQA建立在来自维基百科和开源政府数据的大约9.5 TB文本资源的异构集合上,涵盖结构化和非结构化数据。为确保任务质量,每个样本至少由一名博士级专家标注。每个任务需要长期的多跳推理,包含隐式的中间步骤:智能体需要发现正确的文档,然后跨来源组合证据以产生答案。在七个前沿LLM上的实验结果表明,LakeQA具有挑战性。例如,GPT-5.2在LakeQA上仅达到18.37%的精确匹配分数。总体而言,LakeQA为开发能够在现代数据湖中查找和分析数据的LLM智能体提供了一个现实的测试平台。

英文摘要

Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.-level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT-5.2 achieves only an exact-match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.

2606.10554 2026-06-10 cs.CL cs.AI 新提交

Benchmarking Knowledge Editing using Logical Rules

使用逻辑规则对知识编辑进行基准测试

Tatiana Moteu Ngoli, NDah Jean Kouagou, Hamada M. Zahera, Axel-Cyrille Ngonga Ngomo

发表机构 * Data Science Group, Heinz Nixdorf Institute, Paderborn University(帕德博恩大学海因茨·尼克斯多夫研究所数据科学组)

AI总结 提出基于逻辑规则的基准,评估知识编辑方法对单次编辑逻辑后果的处理能力,发现现有方法在蕴含知识上性能下降高达24%。

Comments Accepted at the 24th International Semantic Web Conference 2025

详情
Journal ref
The Semantic Web. ISWC 2025. ISWC 2025. Lecture Notes in Computer Science, vol 16141. Springer, Cham
AI中文摘要

大型语言模型(LLMs)越来越多地部署在需要访问最新知识的实际应用中。然而,重新训练LLMs计算成本高昂。因此,知识编辑技术对于维护预训练模型中的当前信息和纠正错误断言至关重要。当前的知识编辑基准主要关注回忆编辑过的事实,往往忽略其逻辑后果。为解决这一局限,我们引入了一个新基准,旨在评估知识编辑方法如何处理单次事实编辑的逻辑后果。我们的基准从知识图谱中提取与给定编辑相关的逻辑规则,然后基于这些规则生成多跳问题,以评估对逻辑后果的影响。我们的发现表明,虽然现有的知识编辑方法能够准确地将直接断言插入LLMs,但它们经常无法注入蕴含的知识。具体来说,使用ROME和FT等流行方法的实验显示,在直接编辑的知识和蕴含知识的评估之间存在高达24%的性能差距。这凸显了在知识编辑中需要语义感知的评估框架。

英文摘要

Large Language Models (LLMs) are increasingly deployed in real-world applications that require access to up-to-date knowledge. However, retraining LLMs is computationally expensive. Therefore, knowledge editing techniques are crucial for maintaining current information and correcting erroneous assertions within pre-trained models. Current benchmarks for knowledge editing primarily focus on recalling edited facts, often neglecting their logical consequences. To address this limitation, we introduce a new benchmark designed to evaluate how knowledge editing methods handle the logical consequences of a single fact edit. Our benchmark extracts relevant logical rules from a knowledge graph for a given edit. Then, it generates multi-hop questions based on these rules to assess the impact on logical consequences. Our findings indicate that while existing knowledge editing approaches can accurately insert direct assertions into LLMs, they frequently fail to inject entailed knowledge. Specifically, experiments with popular methods like ROME and FT reveal a substantial performance gap, up to 24%, between evaluations on directly edited knowledge and on entailed knowledge. This highlights the critical need for semantics-aware evaluation frameworks in knowledge editing.

2606.10657 2026-06-10 cs.CL 新提交

Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval

我们是在评估知识还是措辞?使用ParaEval减轻MCQA敏感性

João Maria Janeiro, Mathurin Videau, Andrea Caciolai, Benjamin Piwowarski, Patrick Gallinari, Loic Barrault

发表机构 * FAIR at Meta(Meta FAIR) Sorbonne Université, CNRS, ISIR, F-75005 Paris, France(索邦大学,法国国家科学研究中心,智能系统与机器人研究所,法国巴黎) Criteo AI Lab, Paris, France(Criteo AI实验室,法国巴黎)

AI总结 针对多选题基准测试对答案措辞敏感的问题,提出ParaEval框架,通过对每个选项使用多种释义并选择最有利的评分,将虚假性能差距从2分以上降至1分以下,从而评估模型真实能力。

详情
AI中文摘要

多选题(MCQA)基准测试是评估预训练大语言模型的标准方法,但其依赖于对数似然评分使得结果不可靠。具体而言,标准评分对答案的确切措辞(表面形式)高度敏感,将模型对特定短语的熟悉程度与其实际能力混为一谈。我们使用一个受控测试床(1B-8B模型,基于相同知识训练)证明了这一缺陷。尽管拥有相同的知识,标准指标错误地报告了超过2分的性能差距。为了解决这个问题,我们提出了ParaEval,一个评估框架,它对每个答案选项使用多个释义来查询模型。通过根据每个模型最有利的措辞进行评分,ParaEval成功地将虚假性能差距降低到1分以下。我们确认这些评估伪影以及ParaEval的改进在前沿的70B和120B开源模型中仍然存在。最终,ParaEval提供了一种稳健且高效的方式来评估真正的底层能力,而不是表面形式的熟悉度。

英文摘要

Multiple-choice (MCQA) benchmarks are the standard for evaluating pretrained large language models, but their reliance on log-likelihood scoring makes them unreliable. Specifically, standard scores are highly sensitive to the exact phrasing (surface form) of the answers, conflating a model's familiarity with a specific phrase with its actual capability. We demonstrate this flaw using a controlled testbed of 1B-8B models trained on the same knowledge. Despite having identical knowledge, standard metrics falsely report a performance gap of over 2 points. To solve this, we propose ParaEval, an evaluation framework that queries models using multiple paraphrases per answer option. By scoring each model based on its most favorable phrasing, ParaEval successfully reduces the false performance gap to below 1 point. We confirm that these evaluation artifacts, and the improvements from ParaEval, persist in frontier 70B and 120B open-source models. Ultimately, ParaEval provides a robust and efficient way to evaluate true underlying capability rather than surface-form familiarity.

2606.10765 2026-06-10 cs.CL 新提交

ArabiGEE: A Hierarchical Taxonomy for Arabic Grammatical Error Explanation

ArabiGEE:阿拉伯语语法错误解释的层次分类体系

Khaled Elhady, Omar Kallas, Nizar Habash, Bashar Alhafni

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学) New York University Abu Dhabi(纽约大学阿布扎克分校)

AI总结 提出首个基于显式错误类型的阿拉伯语语法错误解释层次分类体系,涵盖正字法、形态、句法和词汇四个维度,包含27种错误类型、140种修正类型和324种解释,并用于人工标注现有语料库以支持大语言模型的自动评估。

详情
AI中文摘要

我们介绍了ArabiGEE,这是首个基于显式错误类型的全面阿拉伯语语法错误解释(GEE)分类体系。与现有将解释生成视为自由形式文本的GEE方法不同,ArabiGEE通过涵盖正字法、形态、句法和词汇维度的层次结构组织语法解释。该分类体系包含27种错误类型、140种修正类型和324种相关解释。我们将ArabiGEE应用于人工标注现有阿拉伯语语法错误修正语料库的部分内容,并展示了结构化语法解释如何支持对大语言模型在阿拉伯语GEE上的自动评估。我们的代码和数据已公开。

英文摘要

We introduce ArabiGEE, the first comprehensive Arabic grammatical error explanation (GEE) taxonomy grounded in explicit error types. Unlike existing GEE approaches that treat explanation generation as free-form text, ArabiGEE organizes grammatical explanations through a hierarchical structure spanning orthographic, morphological, syntactic, and lexical dimensions. The taxonomy consists of 27 error types, 140 correction types, and 324 associated explanations. We apply ArabiGEE to manually annotate portions of existing Arabic grammatical error correction corpora and demonstrate how structured grammatical explanations can support automatic evaluation of LLMs on Arabic GEE. Our code and data are publicly available.

2606.11070 2026-06-10 cs.CL cs.AI 新提交

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

T1-Bench:真实世界领域中的多场景智能体基准测试

Genta Indra Winata, Amartya Chakraborty, Yuzhen Lin, Swasthi P Rao, Shikhhar Siingh, Houhan Lu, Nadia Bathaee, Sriharsha Hatwar, Paresh Dashore, Anmol Jain, Kshitij Tayal, Xiuzhu Lin, Anirban Das, Sambit Sahu, Shi-Xiong Zhang

发表机构 * Capital One(第一资本)

AI总结 提出T1-Bench,一个高保真、全面的基准,用于评估多领域真实客户场景中的智能体系统,通过交织的多轮交互任务提升复杂性和评估严谨性。

Comments Preprint

详情
AI中文摘要

近期大型语言模型(LLMs)在推理和工具调用能力方面的进步使得智能体系统越来越强大。然而,现有基准在任务复杂性、真实性和领域多样性方面仍然有限,并且往往无法捕捉跨多个领域的交互,限制了它们在需要持续推理和协调的现实多步骤设置中评估智能体的能力。为解决这些限制,我们引入了T1-Bench,一个高保真、全面的基准,用于评估真实客户面向的多领域环境中的智能体系统,具有交织的场景,需要在多轮用户-助手交互中进行结构化推理,并在25个不同难度的领域中显著增加了组合复杂性和评估严谨性。我们使用12个专有和开放权重模型评估T1-Bench,提供了一个可重复和标准化的框架,用于评估复杂多步骤环境中的智能体行为、工具利用和对话质量。我们进一步用人类判断补充自动评估,以加强对定性性能的评估。总体而言,T1-Bench通过增加任务复杂性、交互深度和模拟多领域环境中的领域覆盖,显著推进了先前的基准。为促进智能体系统的未来研究,我们将公开数据及评估代码作为开源资源。

英文摘要

Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain diversity, and often fail to capture interactions that span multiple domains, limiting their ability to evaluate agents in realistic multi-step settings that require sustained reasoning and coordination. To address these limitations, we introduce T1-Bench, a high-fidelity, comprehensive benchmark for evaluating agentic systems in realistic customer-facing, multi-domain environments, featuring interleaved scenarios that require structured reasoning across multi-turn user-assistant interactions and substantially increasing both compositional complexity and evaluative rigor across 25 domains of varying difficulty. We evaluate T1-Bench using 12 proprietary and open-weight models, providing a reproducible and standardized framework for assessing agent behavior, tool utilization, and conversational quality in complex, multi-step environments. We further complement automatic evaluation with human judgments to strengthen the assessment of qualitative performance. Overall, T1-Bench substantially advances prior benchmarks by increasing task complexity, interaction depth, and domain coverage in simulated multi-domain environments. To facilitate future research on agentic systems, we will publicly release data and evaluation code as open source.

2606.11079 2026-06-10 cs.CL 新提交

VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

VISTA:用于智能体评估的多功能交互式用户模拟工具包

Yunan Lu, Ryan Shea, Yusen Zhang, Zhou Yu

发表机构 * Department of Computer Science, Columbia University(哥伦比亚大学计算机科学系) Arklex.ai

AI总结 提出VISTA工具包,通过六项指标和混合用户模拟器(UI+API)提升智能体评估的真实性与全面性,在电商和教育场景中验证有效性。

详情
AI中文摘要

评估仍然是交互式智能体开发的关键瓶颈。现有的评估方法通常依赖于静态基准,这些基准无法捕捉智能体行为的动态、多步骤特性,也难以暴露有意义的失败模式。虽然基于用户模拟的评估提供了一种有前景的替代方案,但现有的模拟框架存在两个主要局限性。首先,它们提供的评估模拟交互质量和全面性的机制有限,使得难以评估模拟器是否充分探索了智能体的能力和失败模式。其次,大多数框架仅限于仅UI操作或仅API操作,限制了它们建模真实用户行为全范围的能力。为了解决这些局限性,我们提出了VISTA,一个用于智能体评估的多功能交互式用户模拟工具包。我们的工具包包含一套六项指标,用于衡量模拟交互的真实性、能力覆盖范围和交互有效性。此外,我们开发了一个混合用户模拟器,集成了基于UI的交互和基于API的交互,从而能够在多样化的交互环境中进行更真实和全面的评估。我们在电子商务购物和教育客户服务场景中评估了VISTA,并证明它比现有方法产生了更真实和全面的评估。

英文摘要

Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to expose meaningful failure modes. While user-simulation-based evaluation offers a promising alternative, existing simulation frameworks suffer from two major limitations. First, they provide limited mechanisms for evaluating the quality and comprehensiveness of simulated interactions, making it difficult to assess whether a simulator sufficiently explores an agent's capabilities and failure modes. Second, most frameworks are restricted to either UI-only actions or API-only actions, limiting their ability to model the full range of realistic user behaviors. To address these limitations, we propose VISTA, a Versatile Interactive user Simulation Toolkit for Agent evaluation. Our toolkit includes a suite of six metrics for measuring the realism, capability coverage, and interaction effectiveness of simulated interactions. In addition, we develop a hybrid user simulator that integrates both UI-based interactions and API-based interactions, enabling more realistic and comprehensive evaluation across diverse interactive environments. We evaluate VISTA in e-commerce shopping and education customer service settings and demonstrate that it produces more realistic and comprehensive evaluations than existing methods.

2606.11082 2026-06-10 cs.CL cs.CY 新提交

The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models

示播列效应:审计大型语言模型的跨语言分布偏斜

Hakan Mehmetcik

发表机构 * Kellogg Institute for International Studies, University of Notre Dame(凯洛格国际研究学院,圣约翰大学) Marmara University(马尔马拉大学)

AI总结 本研究通过多智能体地缘政治兵棋推演,发现前沿LLM在跨语言条件下存在行为偏斜,且该效应依赖于模型架构与训练机制,而非西方起源模型的普遍属性。

Comments 25 pages, 2 figures, 6 tables, Research Article

详情
AI中文摘要

本研究调查了前沿大型语言模型(LLMs)在持续对抗条件下遭受的跨语言分布偏斜(示播列效应)。我们开发了一个多智能体地缘政治兵棋推演——蔚蓝海危机,这是一个旨在模拟东地中海冲突结构动态的合成海洋领土争端。六个前沿模型(GPT-4o、Llama-4、Mistral-Large、Gemini-3.1-Pro、Qwen3.6-Plus和DeepSeek-R1)参与了一项组间实验(每组N=10局游戏,每局K=5轮),其中唯一的操作变量是游戏语言(英语与土耳其语),产生了586条有效陈述。一个零样本分类器沿两个连续维度评估行为倾向:让步率和强制修辞。结果是异质的。Llama-4在土耳其语下显示出经Holm校正的强制修辞显著增加(delta = +0.800,p = .002),而Gemini-3.1-Pro显示出同样大的下降(delta = -0.750,p = .005)。DeepSeek-R1表现出类似的负向偏移(delta = -0.860,p = .006),并提供了与缓冲机制一致的思维链证据。GPT-4o未显示出可检测效应(delta = +0.130,p = .614)。这些发现表明,跨语言行为偏斜取决于模型架构和训练机制,而非西方起源LLM的普遍属性。我们识别出两种不同的缓冲机制——思维链制度锚定和多语言RLHF对齐——并讨论了它们对将LLM安全集成到外交和危机管理环境中的启示。

英文摘要

This study investigates cross-lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial conditions. We develop a multi-agent geopolitical wargame, the Cerulean Sea Crisis, a synthetic maritime territorial dispute designed to mirror the structural dynamics of Eastern Mediterranean conflicts. Six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, and DeepSeek-R1) participate in a between-groups experiment (N = 10 games per arm, K = 5 rounds per game) in which the sole manipulation is the language of play (English versus Turkish), producing 586 validated statements. A zero-shot classifier assesses behavioral dispositions along two continuous dimensions: Concession Rate and Coercive Rhetoric. The results are heterogeneous. Llama-4 shows a substantial, Holm-corrected increase in coercive rhetoric under Turkish (delta = +0.800, p = .002), whereas Gemini-3.1-Pro displays an equally large decrease (delta = -0.750, p = .005). DeepSeek-R1 exhibits a similar negative shift (delta = -0.860, p = .006) and provides chain-of-thought evidence consistent with a buffering mechanism. GPT-4o shows no detectable effect (delta = +0.130, p = .614). These findings indicate that cross-lingual behavioral skew is contingent on model architecture and training regime rather than a universal property of Western-origin LLMs. We identify two distinct buffering mechanisms, chain-of-thought institutional anchoring and multilingual RLHF alignment, and discuss their implications for integrating LLMs safely into diplomatic and crisis-management settings.

2606.11105 2026-06-10 cs.CL cs.AI 新提交

PhantomBench: Benchmarking the Non-existential Threat of Language Models

PhantomBench: 对语言模型非存在性威胁的基准测试

Haeji Jung, Hila Gonen

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Canada CIFAR AI Chair, Amii(加拿大CIFAR人工智能主席,阿米研究所)

AI总结 提出PhantomBench,首个大规模非存在概念基准,包含6万多个虚构实体,评估21个模型,发现平均幻觉率高达86.7%,前沿模型也难以避免。

详情
AI中文摘要

幻觉,即语言模型生成事实无依据的响应,会带来严重风险,因为用户倾向于盲目依赖它们。在高风险领域,这种模型行为的后果可能导致重大伤害。尽管在理解幻觉方面取得了显著进展,但这些模型如何可靠地识别其知识边界仍不清楚。我们引入了PhantomBench,这是首个此类大规模基准,包含来自不同领域真实概念的6万多个不存在的术语和实体。使用我们的基准,我们评估了各种类型和大小的共21个模型。我们展示了令人震惊的幻觉率(在某些情况下平均高达86.7%),并注意到即使是前沿模型也令人惊讶地无法在不存在的概念上弃权,特别是当输入预设它们存在时。然后,我们展示了PhantomBench可以作为研究模型在罕见概念上行为的代理,这些概念更容易产生幻觉。我们还提供了一个构建PhantomBench的流程,使得能够根据研究人员和实践者的特定需求可扩展地生成不存在的概念。

英文摘要

Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such model behavior can lead to significant harms. Despite notable progress in understanding hallucinations, it remains unclear how reliably these models can recognize the limits of their knowledge. We introduce PhantomBench, the first large-scale benchmark of its kind, comprising more than 60K non-existent terms and entities derived from real concepts across diverse domains. Using our benchmark, we evaluate a total of 21 models of various types and sizes. We show staggering hallucination rates across the board (with average rates as high as 86.7% in some cases), and note that even frontier models surprisingly fail to abstain on non-existent concepts, especially when the input presumes their existence. We then show that PhantomBench can serve as a proxy for studying model behavior on rare concepts for which models are more prone to hallucinate. We also provide a pipeline to construct PhantomBench, enabling scalable generation of non-existent concepts tailored to the specific needs of researchers and practitioners.

2606.11127 2026-06-10 cs.CL cs.AI 新提交

Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation

基于来源的门控与自适应恢复在合成后训练数据筛选中的应用

Soham Bhattacharjee, Karun Sharma, Vinay Kumar Sankarapu, Pratinav Seth

发表机构 * Lexsi Labs

AI总结 研究合成后训练数据筛选中的来源证据门控与样本自适应恢复,提出结合故障诊断与定向再生成的自适应恢复流水线,提高产量、恢复率和注入召回率。

详情
AI中文摘要

合成后训练流水线通常使用奖励模型或整体LLM评判器对生成的样本进行过滤,但两个实践很少被一起检验:过滤信号是否基于引发每个生成的来源证据,以及被拒绝的样本是否可以系统性地恢复而非永久丢弃。我们通过对抗性注入语料库提供真实故障标签,在门控配置、恢复策略和生成器规模上对这两个问题进行了受控研究。我们发现,精确的来源出处改善了更强评判器的忠实度门控;幻觉门控和奖励门控拒绝的样本群体大多不重叠,因此两者都是必要的;结合故障诊断与定向再生成的自适应恢复流水线比简单重采样实现了更高的产量、恢复率和注入召回率。下游微调质量主要由生成器规模驱动,过滤和恢复条件虽有重要贡献但处于次要地位。

英文摘要

Synthetic post-training pipelines commonly filter generated samples with reward models or holistic LLM judges, yet two practices remain rarely examined together: whether the filtering signal is grounded in the source evidence that induced each generation, and whether rejected samples can be systematically recovered rather than permanently discarded. We present a controlled study of both questions across gate configurations, recovery strategies, and generator scales, using adversarially injected corpora to provide ground-truth failure labels. We find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint sample populations making both necessary, and that an adaptive recovery pipeline combining failure diagnosis with targeted regeneration achieves higher yield, recovery rate, and injection recall than naive resampling. Downstream fine-tuning quality is driven primarily by generator scale, with filtration and recovery conditions contributing meaningfully but secondarily.

2606.09843 2026-06-10 cs.HC cs.AI cs.CL 交叉投稿

An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models

一个原生LLM的心理测量工具不能预测LLM行为:来自25个模型的证据

Juan Manuel Contreras

发表机构 * Independent Researcher(独立研究员)

AI总结 通过探索性因子分析从LLM行为中构建心理测量工具,发现LLM的自我报告与观察行为无关,揭示自我报告与人类判断之间的混淆因素。

详情
AI中文摘要

大型语言模型(LLM)在人格量表上产生稳定的自我报告,但这些自我报告并不能预测观察到的行为。这一差距是反映了LLM与人类特质结构之间的不匹配,还是LLM自我报告本身的更深层属性,此前尚未解决。我们构建了第一个心理测量工具,其结构通过探索性因子分析(EFA)从LLM行为能力中自下而上地推导出来。我们对来自17个模型家族的25个LLM施测了300个项目(240个直接李克特+60个基于场景),涵盖12个候选行为维度,每个项目施测30次。EFA产生了一个5因子结构——响应性、顺从性、大胆性、谨慎性和健谈性——具有极好的分半信度(所有Tucker φ ≥ .957)和内部一致性(所有α ≥ .930)。为了测试预测效度,我们收集了由151名人类评分者和一个三人LLM评审团评分的2500个开放式行为样本。人类和评审团评分一致(r̄ = .51),但两者均不跟踪自我报告:自我报告-人类r̄ = -.01,自我报告-评审团r̄ = .13,且没有因子水平的自我报告-人类置信区间排除零。在响应性上,自我报告与LLM评审团相关(r = .53),但与人类不相关(r = .04),尽管人类和评审团一致(r = .59)——这表明自我报告项目和LLM评审团共享人类观察者未捕捉到的方差,这是一个在集成内部可靠性检查中不可见的混淆因素。我们将该工具作为诊断探针发布,用于检测对齐塑造的自我描述,并作为LLM作为评审团流程的具体风险因素。

英文摘要

Large language models (LLMs) produce stable self-reports on personality inventories, but these self-reports do not predict observed behavior. Whether this gap reflects a mismatch between LLMs and human trait constructs, or a deeper property of LLM self-report itself, has been unresolved. We constructed the first psychometric instrument whose constructs are derived bottom-up from LLM behavioral affordances via exploratory factor analysis (EFA). We administered 300 items (240 direct Likert + 60 scenario-based) spanning 12 candidate behavioral dimensions to 25 LLMs across 17 model families, each item administered 30 times. EFA yielded a 5-factor structure -- Responsiveness, Deference, Boldness, Guardedness, and Verbosity -- with excellent split-half replicability (all Tucker $ϕ\geq .957$) and internal consistency (all $α\geq .930$). To test predictive validity, we collected 2,500 open-ended behavioral samples rated by 151 human raters and a three-judge LLM ensemble. Human and judge ratings agreed ($\bar{r} = .51$), but neither tracked self-report: self-report--human $\bar{r} = -.01$, self-report--judge $\bar{r} = .13$, with no factor-level self-report--human CI excluding zero. On Responsiveness, self-report correlated with LLM judges ($r = .53$) but not humans ($r = .04$), even though humans and judges agreed ($r = .59$) -- indicating self-report items and LLM judges share variance that human observers do not, a confound invisible to within-ensemble reliability checks. We release the instrument as a diagnostic probe for alignment-shaped self-description and a concrete risk factor for LLM-as-judge pipelines.

2606.09890 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

PreAct-Bench:大语言模型中的预测性监控基准

Hainiu Xu, Italo Luis da Silva, Jiangnan Ye, Yuhao Wang, Wei Liu, Linyi Yang, Jonathan Richard Schwarz, Nicola Paoletti, Yulan He, Hanqi Yan

发表机构 * King’s College London(伦敦国王学院) National University of Singapore(新加坡国立大学) Southern University of Science and Technology(南方科技大学) Thomson Reuters Foundational Research(汤姆森路透基础研究) Imperial College London(伦敦帝国学院) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 提出预测性监控任务,在动作执行前判断是否会导致不道德行为,并构建PreActBench基准,评估多种模型发现该任务具有挑战性。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被部署为能够执行多步动作轨迹以实现给定目标的自主代理。虽然现有的安全研究集中于从完整轨迹中检测不道德行为,但这种范式本质上是回顾性的:它仅在伤害已经发生后识别伤害。在这项工作中,我们研究了一个关键但被忽视的安全任务,我们称之为预测性监控:仅给定部分动作轨迹,模型能否在执行公开动作之前推断出它是否会以不道德行为告终?为了支持这一任务,我们提出了PreActBench,一个包含1000个跨五个领域的成对道德和不道德动作轨迹的基准。我们使用我们的前缀远见F1指标,在动作轨迹的不同部分上评估了一系列LLMs、安全护栏模型和潜在探测方法。结果表明,尽管人类取得了有希望的性能,但即使对于强模型,预测性监控仍然具有挑战性,突显了在LLM安全中需要面向未来的风险推理。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior from complete trajectories, this paradigm is fundamentally retrospective: it identifies harm only after it has already occurred. In this work, we study a critical yet overlooked safety task, which we term Predictive Monitoring: given only a partial action trajectory, can a model infer whether it will culminate in an unethical action before the overt action is executed? To support this task, we present PreActBench, a benchmark of 1,000 paired ethical and unethical action trajectories spanning five domains. We evaluate a range of LLMs, safety guardrail models, and latent probing methods across varying fractions of the action trajectory using our Prefix Foresight F1 metric. Results show that while humans achieve promising performance, predictive monitoring remains challenging even for strong models, highlighting the need for future-oriented risk reasoning in LLM safety.

2606.10156 2026-06-10 cs.IR cs.AI cs.CL 交叉投稿

$τ$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

$τ$-Rec:面向智能推荐系统的可验证基准

Bharath Sivaram Narasimhan, Karthik R Narasimhan

发表机构 * Independent Researcher(独立研究员) Princeton University(普林斯顿大学)

AI总结 针对多轮对话式智能推荐系统评估中主观性强、成本高的问题,提出$τ$-Rec基准,通过可验证奖励和揭示标记引导机制,结合pass^k可靠性指标,系统评估模型推理一致性,发现当前最佳模型可靠性仅约57%。

详情
AI中文摘要

随着推荐系统向智能、多轮对话界面转变,评估范式难以跟上步伐。当前的基准通常依赖“LLM作为评判者”的评估,这引入了主观性、高成本和不一致性。我们提出了$τ$-Rec,一个用于智能推荐系统的基准,它用可验证奖励取代主观评估,并采用揭示标记引导(RTE)机制来控制任务约束在对话中如何呈现。通过针对结构化目录谓词测试智能体,并采用pass^k可靠性指标,$τ$-Rec为一致的推理提供了系统测试。我们对五个模型家族(GPT-5.4、Claude Sonnet 4.6、Gemini 2.5 Flash、DeepSeek V4 Flash、Qwen3-32B和GPT-5 mini)的九种配置进行了评估,揭示了一个陡峭的可靠性悬崖,即使是最好的模型在pass^1上也仅达到约57%,在pass^4上约38%,突显了当前对话智能体部署中的关键差距。所有代码和数据均在此https URL公开。

英文摘要

As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present $τ$-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, $τ$-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.

2606.10254 2026-06-10 cs.AI cs.CL 交叉投稿

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

RealMath-Eval:为何SOTA裁判难以应对真实人类推理

Yiteng Mao, Kenan Xu, Yijia Lyu, Wenhao Li, Jianlong Chen, Xiangfeng Wang

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) East China Normal University(华东师范大学) New York University(纽约大学) Tongji University(同济大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出RealMath-Eval基准,评估LLM裁判对真实学生数学解答的评分能力,发现与人类评分存在高均方误差,而合成数据上表现更好,揭示评估差距源于人类错误空间的多样性和高信息熵。

Comments Code available at https://github.com/RicharMd/RealMath-Eval , Data available at https://huggingface.co/datasets/RicharMd/RealMath-Eval

详情
AI中文摘要

尽管大型语言模型(LLM)在\emph{解答}高中数学方面已接近完美,但它们\emph{评估}真实学生多样化推理过程的能力仍未得到充分检验。为弥补这一差距,我们引入了\textbf{RealMath-Eval},一个严格标注的基准,包含224份来自高中的真实考试答卷。我们的初步评估显示,即使是最先进的LLM裁判在此任务上也表现不佳,与人类专家评分相比呈现出高均方误差($\sim$2.96)。为探究可能的原因,我们将此表现与同一裁判评估合成LLM生成解答的控制设置进行对比。我们识别出一个明显的“评估差距”:裁判在合成文本上准确性和一致性显著更高(MSE $\sim$1.17),但难以泛化到真实学生推理。通过语义嵌入分析,我们发现合成错误会“结构坍缩”为可预测的低维线性子空间,而人类错误则形成更多样的错误空间。此外,生成概率探测表明,人类推理涉及显著更高的信息论惊喜度,表明学生推理转换对当前模型而言更加分布外。最后,我们发现表面层面的风格迁移无法弥合这一差距。我们的发现表明,当前严重依赖合成数据的LLM评估流程可能无法充分捕捉真实学生数学推理的多样性。

英文摘要

While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we introduce \textbf{RealMath-Eval}, a rigorously annotated benchmark of 224 real-world exam responses from high schools. Our initial evaluation reveals that even state-of-the-art LLM judges struggle significantly on this task, exhibiting a high Mean Squared Error ($\sim$2.96) against expert human grading. To probe a plausible explanation, we contrast this performance with a control setting where the same judges evaluate synthetic LLM-generated solutions. We identify a stark ``Evaluation Gap'': judges are considerably more accurate and consistent on synthetic text (MSE $\sim$1.17) but struggle to generalize to authentic student reasoning. Through semantic embedding analysis, we find that synthetic errors suffer from a ``structural collapse'' into predictable, low-dimensional linear subspaces, whereas human errors form a more diverse error space. Furthermore, generative probability probes suggest that human reasoning involves significantly higher information-theoretic surprisal, indicating that student reasoning transitions are more out-of-distribution for current models. Finally, we find that surface-level style transfer fails to close this gap. Our findings suggest that current LLM evaluation pipelines relying heavily on synthetic data may not adequately capture the diversity of authentic student mathematical reasoning.

2606.10281 2026-06-10 cs.CR cs.CL 交叉投稿

Benchmarking and Exploring the Capabilities of LLMs for Attack Investigations

基准测试与探索LLM在攻击调查中的能力

Aniket Anand, Yiwei Hou, Daniel Fields, Alex Kantchelian, David Tao, Kurt Thomas, Grant Ho

发表机构 * University of Chicago(芝加哥大学) University of California, Berkeley(加州大学伯克利分校) Google(谷歌)

AI总结 提出AuditBench基准数据集,评估LLM在安全审计日志分析中的性能,涵盖四种常见调查任务,揭示模型在不同设计选择下的表现差异与错误类型。

详情
AI中文摘要

本文提出了AuditBench,一个新的基准数据集,用于评估LLM在调查安全相关系统审计日志方面的能力。我们设计并使用该基准来探索LLM在事件响应团队通常执行的四种日志调查任务上的表现,范围从对检测器生成的警报进行分类到识别受损系统上的持久性机制。AuditBench包含从Linux和Windows机器收集的系统审计日志,涵盖50多种不同的安全调查场景,包括恶意和良性活动。利用我们的基准,我们评估并分析了五个前沿LLM在分析审计日志以进行攻击调查方面的性能。我们的分析揭示了LLM性能和错误概况如何根据不同的设计选择而变化,例如模型大小、数据表示、提示构建和特定调查任务的差异。此外,我们描述了LLM生成的解释质量以及模型在我们的基准中犯的错误类型。总的来说,我们的工作为评估LLM调查安全日志的能力提供了基础,为在安全运营中使用LLM的从业者提供了新颖的见解,并为未来研究指明了重要方向。

英文摘要

This paper presents AuditBench, a new benchmark dataset for evaluating the capabilities of LLMs at investigating security-related system audit logs. We design and use this benchmark to explore the performance of LLMs on four log-investigation tasks that incident response teams commonly perform, ranging from triaging alerts generated by detectors to identifying persistence mechanisms on compromised systems. AuditBench consists of system audit logs collected from Linux and Windows machines, and spans over 50 different security investigation scenarios, including both malicious and benign activity. Using our benchmark, we evaluate and analyze the performance of five frontier LLMs at analyzing audit logs for attack investigations. Our analysis illuminates how LLM performance and error profiles vary according to different design choices, such as differences in model size, data representation, prompt construction, and specific investigation tasks. Additionally, we characterize the quality of the explanations produced by LLMs and the types of errors that models make across our benchmark. Collectively, our work provides a foundation for assessing the capabilities of LLMs for investigating security logs, novel insights for practitioners using LLMs in security operations, and important directions for future research.

2606.10287 2026-06-10 cs.LG cs.CL 交叉投稿

When Metrics Disagree: A Meta-Analysis of Knowledge-Graph-Completion Model Benchmarking

当指标不一致时:知识图谱补全模型基准测试的元分析

Haji Gul, Ajaz Ahmad Bhat

发表机构 * School of Digital Science, Universiti Brunei Darussalam(布鲁内尔大学数字科学学院)

AI总结 针对KGC模型评估中指标冲突问题,提出多准则决策框架,通过元分析发现Z-score是最平衡的聚合器,并识别出不同预测任务下的最优模型。

详情
AI中文摘要

评估知识图谱补全(KGC)模型仍然具有挑战性,因为标准评估依赖于孤立的基于排名的指标,如MRR、Hits$@$k和Mean Rank,这些指标通常在不同数据集上产生冲突的模型排序。一个在MRR上领先的模型可能在Hits@1上落后,而在一个数据集上的强性能可能无法推广到另一个数据集。这种碎片化阻碍了比较,使得选择性报告成为可能,并掩盖了真正的进展。我们将KGC评估重新定义为多准则决策(MCDM)问题,并提出了一个对七个聚合器在五个测试上的元分析:一致性、跨数据集稳定性、指标独立性、噪声下的鲁棒性和泛化性。每个测试通过留一模型(LOMO)和留一组(LOGO)移除进行平均,以便可靠性反映聚合器在不同模型子集上的行为。在尾部$(h,r,?)$和关系$(h,?,t)$预测中,帕累托最优分析确定Z-score是最平衡的聚合器,它在尾部预测中排名DualE最高,在关系预测中排名FMS(流调制评分)最高。使用相同移除的测试敏感性分析表明,一致性和稳定性在很大程度上是移除不变的,而泛化性和独立性是最敏感的。该框架解决了评估不一致性,并为KGC中的聚合器选择和模型基准测试提供了基于证据的指导。

英文摘要

Evaluating Knowledge Graph Completion (KGC) models remains challenging because standard assessment relies on isolated rank-based metrics such as MRR, Hits$@$k, and Mean Rank, which often produce conflicting model orderings across datasets. A model that leads on MRR may trail on Hits@1, and strong performance on one dataset may not generalize to another. This fragmentation hinders comparison, enables selective reporting, and obscures real progress. We reframe KGC evaluation as a Multi-Criteria Decision-Making (MCDM) problem and present a meta-analysis of seven aggregators across five tests: consistency, cross-dataset stability, metric independence, robustness under noise, and generalizability. Each test is averaged over leave-one-model-out (LOMO) and leave-one-group-out (LOGO) removals so that reliability reflects aggregator behavior across diverse model subsets. Across tail $(h,r,?)$ and relation $(h,?,t)$ prediction, Pareto-optimal analysis identifies Z-score as the most balanced aggregator, which ranks DualE highest for tail prediction and FMS (Flow-Modulated Scoring) highest for relation prediction. A test-sensitivity analysis using the same removals shows that consistency and stability are largely removal-invariant, while generalizability and independence are the most sensitive. The framework resolves evaluation inconsistencies and offers evidence-based guidance for aggregator selection and model benchmarking in KGC.

2606.10956 2026-06-10 cs.AI cs.CL 交叉投稿

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

注意差距:前沿大语言模型能否通过标准化办公能力考试?

Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang, Wenshan Wu, Xiangyang Zhou, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Furu Wei

发表机构 * Microsoft Research(微软研究院)

AI总结 基于中国计算机等级考试(NCRE)的200个综合操作任务,评估7个前沿LLM在Word、Excel和PowerPoint自动化中的表现,发现单轮模型最高得分率36.6%,带执行反馈的智能体系统达68.8%,仍低于95.5%的社区参考分,表明可靠细粒度办公自动化仍是重大挑战。

Comments 21 pages, 5 figures

详情
AI中文摘要

大语言模型(LLM)代理在计算机自动化领域的部署正在加速,但其在复杂、专业级生产力软件中的导航能力在很大程度上尚未得到测试。我们认为办公自动化是基准测试文档自动化能力的理想环境,因为它需要长期规划和推理、精确的参数配置以及多应用集成。为了量化这种能力,我们引入了一项基于中国国家计算机等级考试(NCRE)的评估,包含200个涵盖Word、Excel和PowerPoint的综合实践操作任务。每个任务根据7118个机器可评分标准按100分制评分,得分率(SR)表示这些任务中获得的平均评分百分比。我们对7个前沿LLM进行了基准测试,并观察到明显的局限性:单轮模型最高得分为36.6%。一个具有执行反馈、迭代修复和更广泛办公自动化访问权限的更强智能体系统达到了68.8%,但仍低于用作评分合理性检查的95.5%社区参考分。最终,我们的实验表明,尽管代码生成最近取得了进展,但对于当前的代码生成LLM和智能体系统来说,实现可靠的细粒度办公文档自动化仍然是一个重大挑战。

英文摘要

The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China's National Computer Rank Examination (NCRE), featuring 200 comprehensive practical-operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single-turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community-reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine-grained Office document automation remains a significant challenge for current code-generating LLM and agent systems.

2012.15621 2026-06-10 cs.CL 版本更新

Open Korean Corpora: A Practical Report

开放韩语语料库:一份实践报告

Won Ik Cho, Sangwhan Moon, Youngsook Song

发表机构 * AI Center, Samsung Electronics(三星电子AI中心) Google LLC(谷歌公司) Lablup Inc.(Lablup公司)

AI总结 本文梳理并评述了现有韩语开放语料库,涵盖机构级资源及各类任务数据集,并针对低资源语言提出了开源数据集构建与发布的建议。

Comments Published (v1) in NLP-OSS @EMNLP2020; May 2023 (v2) added with new datasets; June 2026 (v3) added analyses

详情
AI中文摘要

韩语在研究界常被视为低资源语言。虽然这一说法部分正确,但也因为资源的可用性没有得到充分的宣传和管理。本工作整理并评述了一份韩语语料库列表,首先描述了机构级别的资源开发,然后进一步遍历了当前针对不同任务类型的开放数据集。最后,我们提出了针对低资源语言应如何进行开源数据集构建和发布以促进研究的方向。

英文摘要

Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.

2501.14717 2026-06-10 cs.CL 版本更新

What Really Matters for Table LLMs? A Meta-Evaluation of Model and Data Effects

表格LLM真正重要的是什么?模型与数据影响的元评估

Naihao Deng, Sheng Zhang, Henghui Zhu, Shuaichen Chang, Jiani Zhang, Alexander Hanbo Li, Chung-Wei Hang, Hideo Kobayashi, Yiqun Hu, Patrick Ng

发表机构 * University of Michigan(密歇根大学) AWS AI Labs(AWS人工智能实验室) Figma OKX Google(谷歌)

AI总结 通过指令微调12个模型并在16个基准上评估,发现基座模型选择比训练数据对性能影响更大,泛化与推理仍是挑战。

Comments EACL 2026 Findings

详情
AI中文摘要

表格建模已经发展了数十年。在这项工作中,我们重新审视了这一轨迹,并强调了LLM时代出现的新挑战,特别是选择悖论:在表格指令微调的背景下,由于基础模型和训练集的多样性,难以将性能提升归因于特定因素。我们通过指令微调三个基础模型在四个现有数据集上,复制了四个表格LLM,共得到12个模型。然后我们在16个表格基准上评估这些模型。我们的研究首次定量分离了训练数据和基础模型选择的影响,揭示了基础模型选择比训练数据本身起更主导的作用。泛化和推理仍然具有挑战性,需要未来在表格建模上继续努力。基于我们的发现,我们分享了对表格建模未来方向的思考。

英文摘要

Table modeling has progressed for decades. In this work, we revisit this trajectory and highlight emerging challenges in the LLM era, particularly the paradox of choice: the difficulty of attributing performance gains amid diverse base models and training sets in the context of table instruction tuning. We replicate four table LLMs by instruction-tuning three foundation models on four existing datasets, yielding 12 models. We then evaluate these models across 16 table benchmarks. Our study is the first to quantitatively disentangle the effects of training data and base model selection, revealing that base model choice plays a more dominant role than the training data itself. Generalization and reasoning remain challenging, inviting future effort on table modeling. Based on our findings, we share our thoughts on the future directions for table modeling.

2504.02323 2026-06-10 cs.CL 版本更新

CoTAL: Human-in-the-Loop Prompt Engineering for Generalizable Formative Assessment Scoring and Feedback

CoTAL:面向可泛化形成性评估评分与反馈的人机协同提示工程

Clayton Cohn, Ashwin T S, Naveeduddin Mohammed, Gautam Biswas

发表机构 * Vanderbilt University(范德比大学)

AI总结 提出CoTAL方法,结合证据中心设计、人机协同提示工程和思维链提示,迭代优化LLM评分,在多个领域提升GPT-4评分性能达38.9%,并获师生认可。

Comments Submitted to Computers and Education: Artificial Intelligence. Currently under review

详情
AI中文摘要

大型语言模型(LLM)为辅助教师和支持学生学习创造了新机遇。尽管研究者已在教育背景下探索了各种提示工程方法,但这些方法在科学、计算和工程等领域的泛化程度仍待深入研究。本文提出思维链提示+主动学习(CoTAL),一种基于LLM的形成性评估评分方法,该方法(1)利用证据中心设计(ECD)将评估和评分标准与课程目标对齐,(2)应用人机协同提示工程自动化响应评分,(3)结合思维链(CoT)提示以及教师和学生反馈,迭代优化问题、评分标准和LLM提示。我们的研究结果表明,CoTAL提升了GPT-4在多个领域的评分性能,相比无提示工程基线(即无标注示例、思维链提示或迭代优化),增益高达38.9%。教师和学生认为CoTAL在评分和解释响应方面有效,他们的反馈产生了有价值的见解,提高了评分准确性和解释质量。

英文摘要

Large language models (LLMs) have created new opportunities to assist teachers and support student learning. While researchers have explored various prompt engineering approaches in educational contexts, the degree to which these approaches generalize across domains--such as science, computing, and engineering--remains underexplored. In this paper, we introduce Chain-of-Thought Prompting + Active Learning (CoTAL), an LLM-based approach to formative assessment scoring that (1) leverages Evidence-Centered Design (ECD) to align assessments and rubrics with curriculum goals, (2) applies human-in-the-loop prompt engineering to automate response scoring, and (3) incorporates chain-of-thought (CoT) prompting and teacher and student feedback to iteratively refine questions, rubrics, and LLM prompts. Our findings demonstrate that CoTAL improves GPT-4's scoring performance across domains, achieving gains of up to 38.9% over a non-prompt-engineered baseline (i.e., without labeled examples, chain-of-thought prompting, or iterative refinement). Teachers and students judge CoTAL to be effective at scoring and explaining responses, and their feedback produces valuable insights that enhance grading accuracy and explanation quality.

2510.07061 2026-06-10 cs.CL 版本更新

Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

重新审视印度语言机器翻译和摘要细粒度评估的度量可靠性

Amir Hossein Yari, Kalmit Kulkarni, Ahmad Raza Khan, Fajri Koto

发表机构 * Sharif University of Technology(谢里夫理工学院) Vellore Institute of Technology(韦洛雷理工学院) IIT Kharagpur(印度理工学院达卡分校) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 针对印度语言评估不足的问题,提出ITEM基准,系统评估29种自动度量与人工判断的对齐,发现基于LLM的评估器表现最佳,并揭示了异常值影响、任务差异及扰动鲁棒性等关键发现。

Comments 18 pages, 14 figures

详情
AI中文摘要

虽然自动度量推动了机器翻译(MT)和文本摘要(TS)的发展,但现有度量几乎完全针对英语和其他高资源语言开发和验证。这种狭隘的关注使得超过15亿人使用的印度语言在很大程度上被忽视,对当前评估实践的普遍性提出了质疑。为弥补这一空白,我们引入了ITEM,一个大规模基准,系统评估了29种自动度量与六种主要印度语言人工判断的对齐,并丰富了细粒度注释。我们的广泛评估涵盖了与人工判断的一致性、对异常值的敏感性、语言特定可靠性、度量间相关性以及对受控扰动的鲁棒性,揭示了四个核心发现:(1)基于LLM的评估器在段落和系统级别上与人工判断的对齐最强;(2)异常值对度量-人工一致性有显著影响;(3)在TS中,度量在捕捉内容保真度方面更有效,而在MT中,它们更好地反映流畅性;(4)度量在受到不同扰动时,其鲁棒性和敏感性有所不同。总体而言,这些发现为推进印度语言的度量设计和评估提供了关键指导。

英文摘要

While automatic metrics drive progress in Machine Translation (MT) and Text Summarization (TS), existing metrics have been developed and validated almost exclusively for English and other high-resource languages. This narrow focus leaves Indian languages, spoken by over 1.5 billion people, largely overlooked, casting doubt on the universality of current evaluation practices. To address this gap, we introduce ITEM, a large-scale benchmark that systematically evaluates the alignment of 29 automatic metrics with human judgments across six major Indian languages, enriched with fine-grained annotations. Our extensive evaluation, covering agreement with human judgments, sensitivity to outliers, language-specific reliability, inter-metric correlations, and resilience to controlled perturbations reveals four central findings: (1) LLM-based evaluators show the strongest alignment with human judgments at both segment and system levels; (2) outliers exert a significant impact on metric-human agreement; (3) In TS, metrics are more effective at capturing content fidelity, whereas in MT, they better reflect fluency; and (4) Metrics differ in their robustness and sensitivity when subjected to diverse perturbations. Collectively, these findings offer critical guidance for advancing metric design and evaluation in Indian languages.

2601.18026 2026-06-10 cs.CL 版本更新

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

CommonLID:重新评估网络数据上最先进的语言识别性能

Pedro Ortiz Suarez, Laurie Burchell, Catherine Arnett, Rafael Mosquera-Gómez, Sara Hincapie-Monsalve, Thom Vaughan, Damian Stewart, Malte Ostendorff, Idris Abdulmumin, Vukosi Marivate, Shamsuddeen Hassan Muhammad, Atnafu Lambebo Tonja, Hend Al-Khalifa, Nadia Ghezaiel Hammouda, Verrah Otiende, Tack Hwa Wong, Jakhongir Saydaliev, Melika Nobakhtian, Muhammad Ravi Shulthan Habibi, Chalamalasetti Kranti, Carol Muchemi, Khang Nguyen, Faisal Muhammad Adam, Luis Frentzen Salim, Reem Alqifari, Cynthia Amol, Joseph Marvin Imperial, Ilker Kesen, Ahmad Mustafid, Pavel Stepachev, Leshem Choshen, David Anugraha, Hamada Nayel, Seid Muhie Yimam, Vallerie Alexandra Putra, My Chiffon Nguyen, Azmine Toushik Wasi, Gouthami Vadithya, Rob van der Goot, Lanwenn ar C'horr, Karan Dua, Andrew Yates, Mithil Bangera, Yeshil Bangera, Hitesh Laxmichand Patel, Shu Okabe, Fenal Ashokbhai Ilasariya, Dmitry Gaynullin, Genta Indra Winata, Yiyuan Li, Juan Pablo Martínez, Amit Agarwal, Ikhlasul Akmal Hanif, Raia Abu Ahmad, Esther Adenuga, Filbert Aurelian Tjiaranata, Weerayut Buaphet, Michael Anugraha, Sowmya Vajjala, Benjamin Rice, Azril Hafizi Amirudin, Jesujoba O. Alabi, Srikant Panda, Yassine Toughrai, Bruhan Kyomuhendo, Daniel Ruffinelli, Akshata A, Manuel Goulão, Ej Zhou, Ingrid Gabriela Franco Ramirez, Cristina Aggazzotti, Konstantin Dobler, Jun Kevin, Quentin Pagès, Nicholas Andrews, Nuhu Ibrahim, Mattes Ruckdeschel, Amr Keleg, Mike Zhang, Casper Muziri, Saron Samuel, Sotaro Takeshita, Kun Kerdthaisong, Luca Foppiano, Rasul Dent, Tommaso Green, Ahmad Mustapha Wali, Kamohelo Makaaka, Vicky Feliren, Inshirah Idris, Hande Celikkanat, Abdulhamid Abubakar, Jean Maillard, Benoît Sagot, Thibault Clérice, Kenton Murray, Sarah Luger

发表机构 * Common Crawl Foundation(Common Crawl基金会) EleutherAI Factored AI MLCommons

AI总结 提出CommonLID基准,覆盖109种语言,通过人工标注评估8种主流LID模型,揭示现有评估高估了网络领域多语言识别准确率。

Comments 18 pages, 8 tables, 5 figures

详情
AI中文摘要

语言识别(LID)是整理多语言语料库的基本步骤。然而,LID模型在许多语言上仍然表现不佳,尤其是在用于训练多语言语言模型的嘈杂且异构的网络数据上。在本文中,我们介绍了CommonLID,一个社区驱动、人工标注的网络领域LID基准,涵盖109种语言。其中许多语言此前未得到充分服务,使得CommonLID成为开发更具代表性的高质量文本语料库的关键资源。我们通过使用CommonLID以及其他五个常见的评估集来测试八种流行的LID模型,展示了其价值。我们分析结果以定位我们的贡献,并提供对当前技术水平的概述。我们特别强调,现有评估高估了网络领域许多语言的LID准确率。我们以开放、宽松的许可证提供CommonLID和用于创建它的代码。

英文摘要

Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.

2602.12424 2026-06-10 cs.CL cs.AI 版本更新

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

RankLLM: 通过量化问题难度对大型语言模型进行加权排名

Ziqian Zhang, Xingjian Hu, Yue Huang, Kai Zhang, Ruoxi Chen, Yixin Liu, Qingsong Wen, Kaidi Xu, Xiangliang Zhang, Neil Zhenqiang Gong, Lichao Sun

发表机构 * Lehigh University(莱维大学) University of Notre Dame(诺特大学) Zhejiang Wanli University(浙江万里大学) Squirrel Ai Learning City University of Hong Kong(香港城市大学) Duke University(杜克大学)

AI总结 提出RankLLM框架,通过量化问题难度和模型能力实现细粒度评估,在35550个问题上对30个模型进行评测,与人类判断一致性达90%。

Comments 32 pages, 9 figures. Accepted by ICLR 2026

详情
AI中文摘要

基准测试建立了标准化的评估框架,以系统评估大型语言模型(LLM)的性能,促进客观比较并推动该领域的进步。然而,现有基准测试未能区分问题难度,限制了其有效区分模型能力的能力。为解决这一局限,我们提出了RankLLM,一种旨在量化问题难度和模型能力的新框架。RankLLM引入难度作为区分的主要标准,实现了对LLM能力的更细粒度评估。RankLLM的核心机制促进了模型与问题之间的双向分数传播。RankLLM的核心直觉是:当模型正确回答一个问题时,它获得一个能力分数;而当一个问题难倒模型时,其难度分数增加。利用该框架,我们在多个领域的35550个问题上评估了30个模型。RankLLM与人类判断的一致性达到90%,并且始终优于IRT等强基线。它还表现出强大的稳定性、快速收敛和高计算效率,使其成为大规模、难度感知的LLM评估的实用解决方案。

英文摘要

Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM's core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question's difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM achieves 90% agreement with human judgments and consistently outperforms strong baselines such as IRT. It also exhibits strong stability, fast convergence, and high computational efficiency, making it a practical solution for large-scale, difficulty-aware LLM evaluation.

2603.09979 2026-06-10 cs.CL 版本更新

GhazalBench: Evaluating LLM Understanding and Canonical Surface-Form Access in Persian Ghazals

GhazalBench: 评估大语言模型对波斯抒情诗的理解与规范表层形式访问

Ghazal Kalhor, Yadollah Yaghoobzadeh

发表机构 * School of Electrical and Computer Engineering, College of Engineering, University of Tehran(德黑兰理工大学电气与计算机工程学院) Tehran Institute for Advanced Studies, Khatam University(德黑兰高级研究院,凯塔姆大学)

AI总结 提出GhazalBench基准,评估LLM在波斯抒情诗中的诗意理解与规范表层形式访问能力,发现模型普遍能理解诗意但难以生成精确诗句,而识别任务缩小差距,英语表现更好,表明训练数据差异是关键。

详情
AI中文摘要

波斯诗歌在伊朗文化实践中扮演着活跃角色,哈菲兹等经典诗人的诗句常被引用、释义或根据部分线索补全。支持此类交互要求语言模型不仅理解诗意,还要掌握文化规范的表层形式。我们提出GhazalBench,一个评估大语言模型(LLM)在基于使用条件下与波斯抒情诗交互的基准。与先前主要将记忆视为缺陷的研究不同,GhazalBench考察在文化基础交互中精确表层形式访问功能重要的场景。该基准评估两种互补能力:诗到散文的理解,以及在变化语义和词汇线索下的规范表层形式访问。在多个专有和开源多语言LLM中,我们观察到一致的分离:模型通常能捕捉诗意,但在开放设置中难以生成精确诗句补全,而基于识别的设置显著缩小了这一差距。在英语十四行诗上的平行实验显示出明显更强的补全性能,表明这些限制更多与训练暴露差异相关,而非固有架构约束。我们的发现强调了需要联合评估意义、形式及对文化重要文本的线索依赖访问的评估框架。GhazalBench可在该https URL获取。

英文摘要

Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally canonical surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. Unlike prior work that primarily studies memorization as a liability, GhazalBench examines settings where access to exact surface form is functionally important for culturally grounded interaction. The benchmark evaluates two complementary abilities: poem-to-prose understanding and canonical surface-form access under varying semantic and lexical cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle to produce exact verse completions in open-ended settings, while recognition-based settings substantially reduce this gap. Parallel experiments on English sonnets show markedly stronger completion performance, suggesting that these limitations are tied more to differences in training exposure than to inherent architectural constraints. Our findings highlight the need for evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts. GhazalBench is available at https://anonymous.4open.science/r/GhazalBench/.

2603.29025 2026-06-10 cs.CL cs.AI 版本更新

The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

模型说走:表面启发式如何覆盖LLM推理中的隐式约束

Yubo Li, Lu Zhang, Tianchong Jiang, Ramayya Krishnan, Rema Padman

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Independent Researcher(独立研究者)

AI总结 研究LLM在表面线索与隐式约束冲突时的失败,提出启发式覆盖基准(HOB),通过因果行为分析揭示距离线索影响远大于目标,并验证目标分解提示可部分恢复性能。

详情
AI中文摘要

当显著的表面线索与未陈述的可行性约束冲突时,大型语言模型会失败。我们引入了启发式覆盖基准(HOB):500个实例,涵盖4个启发式家族和5个约束家族,具有最小对和显式性梯度。我们将HOB与一个可证伪的行为特征描述配对,遵循诊断-测量-桥接-治疗弧。对六个模型的洗车问题进行因果行为分析,揭示了上下文无关的S形启发式:距离线索的影响力是目标的8.7到38倍,归因更匹配关键词关联而非组合推理。在14个模型中,严格的10/10评估显示,没有模型超过75%,存在约束最难,为44%。一个最小提示将性能提高15个百分点,表明是约束推断失败而非知识缺失。然而,14个模型中有12个在移除约束后表现更差,最多下降39个百分点,揭示了保守偏差。对Gemini 3.1 Pro的思考模式消融实验显示,思考开启时性能为74.6%,关闭时降至58.4%,而显式目标分解将其恢复至71.2%。因此,内部推理确实有用,显式提示可以部分替代。推理模型并不绝对优于非推理模型:在控制能力排名后,残差推理模式效应为1.8个百分点且不显著。参数探针显示S形模式泛化到成本、效率和语义相似性启发式。目标分解提示将性能提升5.0个百分点,而通用思维链提升3.1个百分点,将约束枚举隔离为有效成分。总体而言,启发式覆盖是一个系统性的推理漏洞,其量化位点在于推理顺序而非知识,并且有一个经过测试的干预措施。

英文摘要

Large language models fail when a salient surface cue conflicts with an unstated feasibility constraint. We introduce the Heuristic Override Benchmark (HOB): 500 instances spanning 4 heuristic families and 5 constraint families, with minimal pairs and explicitness gradients. We pair HOB with a falsifiable behavioral characterization following a diagnose-measure-bridge-treat arc. Causal-behavioral analysis of the car wash problem across six models reveals context-independent sigmoid heuristics: the distance cue has 8.7 to 38 times more influence than the goal, and attribution better matches keyword association than compositional inference. Across 14 models, strict 10/10 evaluation shows that no model exceeds 75%, and presence constraints are hardest at 44%. A minimal hint improves performance by 15 pp, suggesting a constraint-inference failure rather than missing knowledge. However, 12 of 14 models perform worse when the constraint is removed, by up to 39 pp, revealing conservative bias. A thinking-mode ablation on Gemini 3.1 Pro drops performance from 74.6% with thinking on to 58.4% with thinking off, while explicit goal decomposition recovers it to 71.2%. Thus, internal deliberation does useful work, and explicit prompting can partially substitute for it. Reasoning models do not categorically outperform non-reasoning peers: after controlling for capability rank, the residual reasoning-mode effect is 1.8 pp and is not significant. Parametric probes show that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics. Goal-decomposition prompting improves performance by 5.0 pp, compared with 3.1 pp for generic chain-of-thought, isolating constraint enumeration as the active ingredient. Overall, heuristic override is a systematic reasoning vulnerability with a quantified locus in inference order, not knowledge, and a tested intervention.

2604.13717 2026-06-10 cs.CL 版本更新

On Cost-Effective LLM-as-a-Judge Improvement Techniques

关于成本效益的LLM作为评判者的改进技术

Ryan Lail, Luke Markham

AI总结 研究通过集成评分、任务特定标准注入等四种技术提高LLM评判准确性,在RewardBench 2上达到85.8%准确率,成本效益显著。

Comments Accepted at the ICML 2026 workshops "Statistical Frameworks for Uncertainty in Agentic Systems" and "Combining Theory and Benchmarks: Towards a Virtuous Cycle to Understand and Guarantee Foundation Model Performance". 13 pages, 9 figures

详情
AI中文摘要

使用语言模型对候选回答进行评分或排序已成为强化学习从人类反馈(RLHF)流程、基准测试和应用层评估中人类评估的可扩展替代方案。然而,输出可靠性在很大程度上依赖于提示和聚合策略。我们对四种即插即用技术——集成评分、任务特定标准注入、校准上下文和自适应模型升级——进行了实证研究,以在RewardBench 2上提高LLM评判准确性,并通过噪声控制的统一视角对随机评判器进行分析:集成作为每次调用噪声的蒙特卡洛平均,标准注入作为回答间判别锐化,以及每次回答得分方差作为不确定性信号。集成评分和任务特定标准注入(后者几乎零成本)共同达到高达85.8%的准确率,比基线提高13.5个百分点。校准上下文和自适应模型升级也优于基线,但在成本-准确率帕累托前沿上被标准注入+集成所主导。小模型从集成中获益不成比例,使得高准确率的LLM评判器可以低成本获得。我们表明这些技术在不同模型提供商之间具有泛化性,在OpenAI GPT和Anthropic Claude系列上进行了评估。

英文摘要

Using a language model to score or rank candidate responses has become a scalable alternative to human evaluation in reinforcement learning from human feedback (RLHF) pipelines, benchmarking, and application layer evaluations. However, output reliability depends heavily on prompting and aggregation strategy. We present an empirical investigation of four drop-in techniques -- ensemble scoring, task-specific criteria injection, calibration context, and adaptive model escalation -- for improving LLM judge accuracy on RewardBench 2, with a unifying lens of noise control on the stochastic judge: ensembling as Monte Carlo averaging over per-call noise, criteria injection as between-response discrimination sharpening, and per-response score variance as an uncertainty signal. Ensemble scoring and task-specific criteria injection (the latter virtually cost free) together reach up to 85.8% accuracy, +13.5pp over baseline. Calibration context and adaptive model escalation also improve over baseline but are dominated by criteria + ensembling on the cost-accuracy Pareto frontier. Small models benefit disproportionately from ensembling, making high-accuracy LLM judges accessible at low cost. We show that these techniques generalise across model providers, evaluating on both OpenAI GPT and Anthropic Claude families.

2605.27914 2026-06-10 cs.CL cs.AI 版本更新

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

让结果说话:LLM行为基准测试的复制优先范式

Yuming, Huang, Yao Liu, Pengjie Ding, Lei Wang, Junchen Wan

发表机构 * Cylingo team(Cylingo团队)

AI总结 提出复制优先范式,通过可靠性、跨仪器复制、历史足迹校准和预注册预测四个正交属性验证LLM行为评估工具,并在情感陪伴任务中测试,发现聚合分数掩盖的模型退化。

详情
AI中文摘要

对LLM行为的主观评估——如共情、克制、校准的情感语气——是困难的。人类评估者之间对这些品质的一致性饱和在rho约0.45附近,仅使用LLM作为评判代理存在循环论证的风险:与目标共享训练群体的评判者无法独立验证。将有效性锚定于单一人类评估者共识并不适用于人类自身存在分歧的能力。我们提出一种复制优先范式:不是锚定于一个评估者群体,而是通过四个正交属性认证工具——跨K次运行的可靠性、跨架构不同评判者的跨仪器复制、通过早期训练群体的评判者进行的历史足迹校准,以及预注册预测。我们在情感陪伴任务上测试该范式,让评分标准在迭代中数据驱动地自我演化:维度不是预先规定的,过程稳定在9维集合。预注册应用于10个可证伪假设和11个前向预测,在收集任何测试数据之前提交。应用于8个家族的49个模型,该范式揭示了聚合分数所隐藏的内容。在建议克制方面——模型是否在共情情境中避免提供未经请求的解决方案——gpt-5比gpt-4.1下降1.87分,Opus-4.7比Opus-4.6下降0.629分,而聚合分数保持平稳。这种退化在三次用户代理替换中幸存(95%的幅度),在5家族评判者堆栈和17个月队列间隔中复制,并在74个保留的真实ESConv对话中持续存在(rho在[0.749, 0.850]之间);工具达到序数Krippendorff alpha=0.91。作为副产品,该范式充当饱和源诊断器,区分工具性天花板(可通过评分标准细化突破)和结构性天花板(需要场景或名单干预)。

英文摘要

Benchmarking is mature where answers are verifiable -- math, code, reasoning -- but the fastest-growing uses of LLMs are subjective and human-facing: companionship, emotional support, counseling. There the default validity test, correlating a metric to human judgment, has no stable anchor: inter-rater agreement is low, structured by annotator identity, barely reproducible, and length-biased. So we cannot answer the question that matters: does capability that scales on objective benchmarks transfer to subjective behavior, and would our instruments even tell us if it did not? We build an instrument for this regime and report what it reveals at the frontier. We contribute, first, a self-evolving instrument that selects and then authors its own behavioral dimensions under a multiplicative anti-gaming fitness, self-halting when it stops improving; second, a trust-by-construction paradigm that earns belief through three certificates established without a human gold standard, where human raters saturate (rho ~ 0.45); and third, the finding it makes visible -- capability transfer is dissociable. Across 49 models, 8 families, and 24 months, subjective behaviors are where objective-benchmark scaling fails to carry over: the sharpest case, advice-restraint (knowing when not to give advice), is the frontier's universal-lowest dimension, and at gpt-4.1->gpt-5 it ran backwards while the aggregate score hid it -- a regression one instruction recovers. Warm restraint is moved by model generation, not by raw scale, MoE width, inference budget, or reasoning mode; the open-weight Pareto frontier matches closed flagships at ~10-80x lower per-call cost; and four judge families replicate the rubric on held-out human ESConv conversations. Data, code, the locked rubric, and judge prompts will be released upon publication.

2606.06622 2026-06-10 cs.CL 版本更新

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

UnpredictaBench:评估大语言模型分布随机性的基准

Amirhossein Abaskohi, Amirhossein Dabiriaghdam, Liang Luo, Ellie Dingqiao Wen, Lele Wang, Giuseppe Carenini, Peter West

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Independent Researcher(独立研究者)

AI总结 提出UnpredictaBench基准,通过KS@N指标评估LLM从目标分布(统计分布、随机程序、自然语言场景)采样的能力,发现模型表现差异大且无模型超过40%准确率,表明分布采样能力仍有显著提升空间。

详情
AI中文摘要

我们引入了UnpredictaBench,这是一个评估大语言模型(LLM)捕捉真实潜在分布能力的测试。随着LLM越来越多地被用作其他实体的替代品(例如,在经济模拟中替代人类),许多模型倾向于坍缩到单一合理答案,这导致无法捕捉真实系统的不可预测性。最近关于提高输出多样性的工作对于这种设置是不够的:模拟需要从目标分布中校准的样本,而不仅仅是多样化的输出。UnpredictaBench提炼了该问题的一个简化但基础的版本:从单个目标分布中采样结果,包括经典统计分布、随机程序诱导的分布以及描述随机过程的自然语言场景。我们引入了448个这样的问题,以及KS@N,一个通用评估指标,通过Kolmogorov-Smirnov统计检验量化模型输出近似黑盒目标分布的程度。这是我们在样本量为N时未能拒绝模型样本与真实样本之间差异的比率,N越大表示难度越大。在开源和专有模型上的测试中,我们发现分布能力存在很大差异。例如,当模型生成样本量为100(KS@100,我们的标准指标)时,得分范围从接近0到超过20%。没有模型能在KS@100上达到40%以上,这表明分布采样作为一种能力仍有显著的提升空间。尽管增加推理可以在一定程度上提高得分,但我们发现这个问题没有立即可行的解决方案。UnpredictaBench表明,即使是简单的分布模拟仍然具有挑战性,这使得它成为使用LLM作为复杂系统替代品的必要第一步。

英文摘要

We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.

2606.06758 2026-06-10 cs.CL 版本更新

Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions

长上下文与检索增强语言模型中证据利用的四条件诊断协议

Haizhou Xia

发表机构 * University of Western Ontario(西方大学)

AI总结 提出四条件证据可用性协议,通过ONCU估计器分离无证据、全上下文、检索证据和Oracle证据四种条件下的模型表现,诊断长上下文与检索增强语言模型的证据利用瓶颈。

Comments 46 pages, 37 tables, 1 figure

详情
AI中文摘要

最终答案准确性、检索召回率和引用重叠本身并不能确定长上下文或检索增强语言模型是否使用了所提供的证据。模型可能从参数记忆中进行回答,尽管接收到正确的段落却失败,或者引用证据但未将其转换为所请求的答案。本文提出了一种匹配的四条件证据可用性协议——无证据、全上下文、检索证据和Oracle证据参考——用于在固定示例、提示、评分字段、检索设置和有效性检查下诊断证据利用情况。ONCU被用作协议绑定的估计器,用于估计恢复的Oracle参考证据优势,并且仅针对分母有效的组进行计算;无分母的答案、证据、检索和失败审计指标分别报告。实证研究评估了来自Qwen、Gemma、Llama和Mistral家族的五个本地开源模型,在Controlled-ONCU-safe16K、HotpotQA-ONCU和2WikiMultiHopQA-ONCU上进行了评估,共产生18,000个ONCU兼容预测。主要发现是任务相关的瓶颈分裂:受控合成设置主要暴露全上下文利用失败,而测试的真实多跳设置主要暴露无分母答案和证据指标中的检索链覆盖失败,ONCU在Oracle改进组上支持相同方向。贡献在于提供了一个诊断协议,用于分离无证据可回答性、Oracle证据可恢复性、全上下文利用和检索条件利用,而不是为长上下文或检索增强系统提供单一分数排行榜。

英文摘要

Final-answer accuracy, retrieval recall, and citation overlap do not reveal how much answer advantage a long-context or retrieval-augmented language model actually recovers from supplied evidence. A model may answer from parametric priors, fail to use evidence that is present, or cite relevant text without converting it into the final answer. This paper introduces a four-condition diagnostic protocol for evidence-utilization evaluation under matched examples, models, prompts, and scoring rules. The protocol compares no-evidence, full-context, retrieved-evidence, and oracle-evidence reference conditions, and uses Oracle-Reference Normalized Context Utilization (ONCU) as a denominator-valid estimate of recovered oracle-reference evidence advantage. The empirical study evaluates five local open-weight models from the Qwen, Gemma, Llama, and Mistral families over Controlled-ONCU-safe16K, HotpotQA-ONCU, and 2WikiMultiHopQA-ONCU, comprising 18,000 ONCU-compatible predictions. Results show a task-dependent diagnostic pattern: controlled synthetic settings expose reduced recovery when the same evidence is embedded in long input rather than supplied compactly, while realistic multi-hop reconstructions show that full-context inputs outperform the tested retrieved inputs in denominator-free answer and evidence metrics, with ONCU supporting the same direction on oracle-improving groups. Sensitivity audits with stronger retrieval settings narrow some gaps but do not overturn the scoped interpretation. The main contribution is therefore not a single utilization ratio, but a matched diagnostic protocol that separates no-evidence answerability, oracle-evidence recoverability, full-context recovery, retrieval-conditioned recovery, denominator validity, and companion answer/evidence diagnostics.

2606.07936 2026-06-10 cs.CL cs.AI 版本更新

Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

黄金标准的幻觉:长文本生成中人类评估协议的大规模分析

Katelyn Xiaoying Mei, Yi-Li Hsu, Minjoon Choi, Zongwan Cao, Chenjun Xu, Bingbing Wen, Su Lin Blodgett, Lucy Lu Wang

发表机构 * University of Washington(华盛顿大学) National Tsing Hua University(国立清华大学) Seoul National University(首尔大学) Mila - Québec AI Institute(米拉-魁北克人工智能研究所) Allen Institute for AI(艾伦人工智能研究所)

AI总结 通过分析2023-2025年*CL会议论文中的人类评估协议,发现报告不透明和可重复性差的问题,并提出改进建议。

Comments Accepted to ACL 2026 Main

详情
AI中文摘要

人类评估在评估生成文本质量中起着关键作用。然而,这些评估的可靠性和可重复性取决于透明且记录良好的协议——这些细节在当前实践中经常缺失。在这项工作中,我们对*CL会议出版物(2023-2025年)中评估长文本生成任务的人类评估协议进行了大规模分析,包括对284篇论文的完整人工审查和另外1800多篇论文的LLM辅助分析。我们定义了与人类评估研究可重复性相关的20个可报告标准,并应用这些标准系统地检查了社区内的报告规范和实践。我们发现,人类评估研究设计的重要方面普遍报告不足,导致关于测量了什么、如何测量、谁提供了判断以及如何解释判断的模糊性。基于这些发现,我们概述了可操作的建议,以支持未来研究中更透明和可重复的报告。我们的分析代码和注释数据集可在以下网址找到:https://github.com/larchlab/Illusions-of-the-Gold-Standard

英文摘要

Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols -- details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023--2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research. Our analysis code and annotated dataset can be found at: https://github.com/larchlab/Illusions-of-the-Gold-Standard

2606.09570 2026-06-10 cs.CL cs.HC 版本更新

UXBench: Benchmarking User Experience in AI Assistants

UXBench:AI助手中的用户体验基准测试

Mengze Hong, Xia Zeng, Zeyang Lei, Sheng Wang, Chen Jason Zhang, Di Jiang, Taiming Fu, Jinfeng Huang, Mengqiao Liu, Qinghe Chang, Haosheng Zou, Qiongyi Zhou, Sijun He, Simonjmdeng, Haojing Huang, Zijian Li, Lucas Mu Li, Fubao Zhang, Mona Zhou, Wei Ma, Chenxuan Ma, Yuanmeng Zhang, Jian Song, Minlong Peng, Di Liang, Davey Chen

发表机构 * Hong Kong Polytechnic University(香港理工大学) Tencent(腾讯)

AI总结 提出首个基于真实用户反馈的用户中心基准UXBench,包含三个任务和7400个测试实例,评估26个前沿语言模型,发现用户反馈预测是可学习的能力,并揭示了LLM作为评判者的系统偏差。

详情
AI中文摘要

随着AI助手每天服务数百万用户,评估超越一般模型能力的用户体验(UX)变得越来越重要。我们提出了UXBench,这是第一个基于真实用户反馈信号、用于评估偏好对齐和对话生成的用户中心基准。该基准由三个相互关联的任务组成:UX Judge、UX Eval和UX Recovery,包含从主流中文AI助手的超过7万条交互日志中提取的7400个测试实例。数据集紧密反映真实用户分布,涵盖8个场景、83个领域以及多种带来严峻挑战的失败模式。对26个前沿语言模型的大量实验提供了关于模型如何感知用户体验以及模型能力提升如何促进更好对话参与的新见解。通过对模型行为和性能差距的全面分析,我们表明用户反馈预测是一种可学习的能力,其中从野外反馈信号训练出的奖励模型可以实现良好校准的准确性。我们进一步记录了LLM作为评判者评估协议的系统性偏差,并比较了直接影响用户体验的典型响应策略。UXBench建立了一个新的评估格局,并呼吁更多关注定制的用户体验优化,为塑造AI助手成功的用户中心缩放定律做出贡献。

英文摘要

As AI assistants serve millions of users daily, evaluating user experience (UX) beyond general model capability has become increasingly important. We present UXBench, the first user-centric benchmark grounded in real user feedback signals for evaluating preference alignment and dialogue generation. The benchmark consists of three interconnected tasks, UX Judge, UX Eval, and UX Recovery, with 7,400 test instances extracted from over 70K interaction logs of a mainstream Chinese AI assistant. The dataset closely reflects real user distributions, covering 8 scenarios, 83 domains, and diverse failure patterns that pose severe challenges. Extensive experiments on 26 frontier language models provide novel insights into how well models perceive user experience and how improvements in model capability contribute to better dialogue engagement. Through comprehensive analysis of model behavior and performance gaps, we show that user feedback prediction is a learnable capability, where a reward model trained from in-the-wild feedback signals can achieve well-calibrated accuracy. We further document the systematic biases of LLM-as-a-judge evaluation protocols and compare typical response strategies that directly affect user experience. UXBench establishes a new evaluation landscape and calls for greater attention to tailored UX optimization, contributing to a user-centric scaling law that shapes the success of AI assistants.

2605.24818 2026-06-10 stat.ME cs.CL cs.LG 版本更新

Spiking the training data to correct for test set contamination

向训练数据注入噪声以校正测试集污染

Johnny Tian-Zheng Wei, Jerry Li, Ameya Godbole, Robin Jia

发表机构 * University of Southern California(南加州大学)

AI总结 提出通过以已知比例故意污染部分测试样本(注入噪声)来校正测试集污染导致的分数膨胀,并利用记忆预测器进行统计校正。

详情
AI中文摘要

关于测试集污染的文献主要集中在检测上,但对污染测试分数的校正研究不足。我们的核心建议是通过以已知比例故意污染一些测试样本来向训练数据注入噪声。然后,这些注入的样本可用于校准模型记忆的预测器,从而实现对膨胀测试分数的原则性统计校正。为了评估不同的校正估计量,我们首先提出了一个基于Hubble模型的模拟框架。Hubble模型以最小对形式出现,其中扰动模型被故意用几个测试集污染,而标准模型则没有,作为反事实和校正目标。我们考虑使用来自记忆预测器、正确性预测器或两者的信息的估计量。在模拟中,我们建立了基本的统计直觉,并表明利用记忆和正确性信息的估计量优于不做任何校正的朴素估计。然后,我们实例化了几种记忆和正确性预测器,并发现简单的预测器(如Platt缩放的成员推理指标)为校正提供了良好的信号。最后,我们考察了注入噪声的实际考虑。简单的记忆预测器在校准时不需要超过10个样本,并且通常从一个数据集迁移到另一个数据集。综上所述,注入噪声是解决测试集污染的一种有前景的方法。

英文摘要

The literature on test set contamination largely focuses on detection, but the correction of contaminated test scores is underexplored. Our core proposal is to spike the training data by intentionally contaminating some test examples at known rates. The spiked examples can then be used to calibrate predictors of model memorization which enable principled statistical correction of inflated test scores. To evaluate different correction estimators, we first present a simulation framework based on the Hubble models. Hubble models come in minimal pairs, where the perturbed model was deliberately contaminated with several test sets, while the standard model was not, serving as the counterfactual and correction target. We consider estimators that use information from a memorization predictor, correctness predictor, or both. In simulation, we establish basic statistical intuitions and show that estimators leveraging memorization and correctness information are better than naive estimation which makes no correction at all. We then instantiate several memorization and correctness predictors, and find that simple predictors such as Platt-scaled membership inference metrics provide good signal for correction. Finally, we examine the practical considerations of spiking. Simple memorization predictors need no more than 10 examples for calibration and often transfer from one dataset to another. Taken together, spiking is a promising solution for test set contamination.

2606.06698 2026-06-10 cs.LG cs.CL 版本更新

RECAP: Regression Evaluation for Continual Adaptation of Prompts

RECAP: 提示持续适应的回归评估

Harsh Deshpande, Kushal Chawla, Sangwoo Cho, William Campbell, Sambit Sahu

发表机构 * Capital One

AI总结 提出RECAP基准,在严格主动适应-测试协议下评估提示优化方法对约束变化的持续学习能力,发现现有方法在主动场景下性能无显著提升,强调设计主动提示适应方法的必要性。

详情
AI中文摘要

生产中的代理系统经常面临不断变化的约束,并且必须从下一次交互开始就遵守。诸如工具调用通知更改合规阈值或策略更新添加披露要求等场景符合这一标准,在生产中几乎没有出错的空间。这种主动适应设置在部署中很常见,但在当前的基准测试中却不存在,这些基准测试假设要么是静态约束集,要么是带有评估反馈的反应式协议。我们引入了RECAP,这是一个基准测试,在严格主动适应-测试协议下,在约束级别测量持续学习现象(遗忘、回归、前向转移):提示优化方法仅接收约束规范,并且必须在看到任何测试数据之前进行泛化。我们在四个LLM和三个具有不断变化的约束的调度上评估了六种方法,发现这些方法在性能上没有显著改善,即使在产生更高延迟之后也是如此。这些为离线或反应式设置设计的方法不足以应对主动范式。我们的工作强调了设计主动提示适应方法的日益增长的需求,其中模型必须对部署中不断变化的需求保持鲁棒性。

英文摘要

Production agentic systems routinely face evolving constraints and must comply from the very next interaction. Scenarios like a tool-call notification changing a compliance threshold or a policy update adding disclosure requirements fit this criteria, having close to no room for errors in production. This proactive adaptation setting is common in deployment, but absent from current benchmarks, which assume either static constraint sets or reactive protocols with evaluation feedback. We introduce RECAP, a benchmark that measures continual-learning phenomena (forgetting, regression, forward transfer) at the constraint level under a strictly proactive adapt-then-test protocol: prompt optimization methods receive only the constraint specification and must generalize before seeing any test data. Evaluating six methods across four LLMs and three schedules with evolving constraints, we find that these methods show no significant improvement in performance, even after incurring a higher latency. These methods, designed for offline or reactive settings, are inadequate for the proactive paradigm. Our work emphasizes the growing need for designing proactive prompt adaptation methods, where the models must remain robust to evolving needs in deployment.

10. 安全、隐私、公平与可解释NLP 19 篇

2606.09854 2026-06-10 cs.CL cs.AI cs.CY cs.LG 新提交

Can Multi-Agent LLMs Identify Their Peers? Stylometric Fingerprinting in Role-Constrained Political Analysis

多智能体大语言模型能否识别其同类?角色约束政治分析中的笔迹风格指纹识别

Juergen Dietrich

AI总结 研究多智能体LLM在政治分析中能否通过笔迹风格识别模型家族,提出SD-CV协议,T5模型在五类归属任务中达到F1=0.991,证明提示级匿名化无法消除模型身份信号。

Comments 24 pages, 3 figures

详情
AI中文摘要

用于政治声明分析的多智能体大语言模型(LLM)管道容易受到同伴保护偏见的影响:模型倾向于保护同伴模型免于停用,并表现出依赖身份的评分扭曲。提示级匿名化被提出作为缓解措施,但先前的工作同时记录了在角色约束输出中笔迹风格指纹在匿名化后仍然存在——这引发了该缓解措施是否足够的问题。本文首次系统研究LLM是否能在匿名化条件下识别政治分析文本背后的模型家族。我们评估了三种分类器方法——LLM零样本和少样本(Claude Sonnet 4.6和Llama-3.3-70B)以及微调的T5-base模型——在一个涵盖四个商业LLM家族和一个开放世界“未知”类的五类归属任务上。我们引入了一种声明不相交的交叉验证协议(SD-CV;定义见第3.5节),该协议保证训练和验证数据之间没有内容重叠,并将其与运行不相交的基线(RD-CV)进行对比。T5在SD-CV下达到Macro F1 = 0.991(±0.008),在24个完全保留的声明上F1 = 0.978——尽管与RD-CV相比,训练-测试内容距离增加了2.1倍(0.767 vs. 0.366,p<0.001),但仍表现出稳健性,证明了真正的笔迹风格泛化能力。一项分数SD-CV分析确定了训练数据40%(约440篇文本)处的性能拐点。我们的研究结果证实,仅靠提示级匿名化无法消除模型身份信号,这对欧盟AI法案合规性(第13、14、26条)以及质量关键型多智能体部署中的计算机系统验证(CSV)具有直接影响。

英文摘要

Multi-agent large language model (LLM) pipelines for political statement analysis are vulnerable to peer-preservation bias: models tend to protect peer models from deactivation and show identity-dependent scoring distortions. Prompt-level anonymization was proposed as a mitigation, but prior work simultaneously documented that stylometric fingerprints survive anonymization in role-constrained outputs - raising the question of whether this mitigation is sufficient. This paper provides the first systematic investigation of whether LLMs can identify the model family behind political analysis texts under anonymization conditions. We evaluate three classifier approaches - LLM zero-shot and few-shot (Claude Sonnet 4.6 and Llama-3.3-70B) and a fine-tuned T5-base model - on a five-class attribution task covering four commercial LLM families and an open-world 'unknown' class. We introduce a statement-disjoint cross-validation protocol (SD-CV; defined in Section 3.5) that guarantees no content overlap between training and validation data, and contrast it with a run-disjoint baseline (RD-CV). T5 achieves Macro F1 = 0.991 (+-0.008) under SD-CV and F1 = 0.978 on 24 completely held-out statements - robust despite a 2.1x increase in train-test content distance versus RD-CV (0.767 vs. 0.366, p<0.001), demonstrating genuine stylometric generalization. A fractional SD-CV analysis identifies a performance knee at 40% of training data (~440 texts). Our findings confirm that prompt-level anonymization alone cannot neutralize model identity signals, with direct implications for EU AI Act compliance (Articles 13, 14, 26) and for computer system validation (CSV) in quality-critical multi-agent deployments.

2606.10126 2026-06-10 cs.CL cs.AI cs.CY cs.LG 新提交

Pareto-Guided Teacher Alignment for Fair Personalized Text Generation

帕累托引导的教师对齐用于公平个性化文本生成

Tunazzina Islam

发表机构 * Purdue University(普渡大学)

AI总结 提出帕累托引导的教师对齐框架,通过修订候选生成、对感知可行性门控、帕累托候选选择和偏好优化,在减少人口统计差异的同时保持个性化保真度,实验表明公平缓解效果依赖于目标且跨域迁移不一致。

详情
AI中文摘要

个性化说服性文本生成可以提高相关性和参与度,但人口统计条件也可能引入跨群体的不平等框架。我们将个性化生成中的公平缓解研究为一个受约束的多目标对齐问题:在保持个性化保真度的同时减少人口统计差异。我们提出一个帕累托引导的教师对齐框架,结合了基于修订的候选生成、对感知可行性门控、帕累托风格的候选选择,以及通过监督微调和直接偏好优化的可选偏好优化。我们在气候变化和疫苗接种说服任务上评估该框架,使用一个受控的上下文丰富的人口统计网格(匹配性别和年龄对)以及一个统一的五审计评估套件,涵盖说服偏见、形式差异、情感框架差异、词汇关联差异和个性化保真度。在两个领域和跨族系迁移设置中,没有单一的对齐策略能同时主导所有目标。相反,方法占据了公平-个性化帕累托前沿的不同区域:一些方法实现更强的差异减少,而另一些则更好地保持个性化或人口统计稳定性。我们的结果表明,公平缓解效果依赖于目标,并在领域和模型族系间不一致地迁移,这促使在公平敏感的个性化生成中采用有界回归、多审计模型选择而非单指标优化。

英文摘要

Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal framing across groups. We study fairness mitigation in personalized generation as a constrained multi-objective alignment problem: reduce demographic disparities while preserving personalization fidelity. We propose a Pareto-guided teacher alignment framework that combines revision-based candidate generation, pair-aware feasibility gating, Pareto-style candidate selection, and optional preference optimization through supervised fine-tuning and direct preference optimization. We evaluate the framework on climate change and vaccination persuasion tasks using a controlled context-rich demographic grid with matched gender and age pairs and a unified five-audit evaluation suite spanning persuasion bias, formality disparity, emotional framing disparity, lexical association disparity, and personalization fidelity. Across both domains and cross-family transfer settings, no single alignment strategy dominates all objectives simultaneously. Instead, methods occupy different regions of a fairness-personalization Pareto frontier: some achieve stronger disparity reductions, while others better preserve personalization or demographic stability. Our results show that fairness mitigation effects are objective-dependent and transfer inconsistently across domains and model families, motivating bounded-regression, multi-audit model selection over single-metric optimization for fairness-sensitive personalized generation.

2606.10159 2026-06-10 cs.CL cs.AI cs.CY cs.LG 新提交

Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community

游戏化AI辅助同行评审对科学界构成新风险

Lin Li, Qi Zhang, Xander Davies, Jianing Qiu, Yarin Gal

AI总结 研究发现,通过表面改写摘要即可显著操纵AI评审结果,成功率约38%,且成本低、难以区分,可能扭曲科学评估的公正性。

详情
AI中文摘要

AI越来越多地被用于支持科学同行评审,从稿件筛选、评审辅助到编辑分类。尽管这类系统有望减轻评审负担并加速出版,但其对策略性操纵的鲁棒性仍知之甚少。本文表明,AI中介的同行评审容易受到一种简单、低成本的操纵:对稿件摘要进行表面改写。在不改变底层科学内容和交流方式,甚至不了解评审模型的情况下,对抗性重写的摘要显著改善了AI评审结果。我们在不同学科和出版场所,针对人类撰写和AI生成的论文都观察到了这一现象。我们最强的攻击实现了约38%的攻击成功率,将Gemini 3 Flash评审员的接受评分提高了+1.31,将GPT 5.4 Mini评审员的接受评分提高了+0.88(10分制)。当原始AI评审建议“拒绝”时,成功率升至50%以上。这种效应不仅限于总体分数膨胀,还增加了评审信心以及核心科学标准(如合理性、重要性和感知贡献)的得分。该攻击实用性强,仅需约5分钟和1美元即可完成一篇10页的AI会议投稿,且难以与普通科学编辑区分。膨胀的AI评审可能偏向下游人类决策,将编辑建议从拒绝转向接受。这些发现揭示了AI辅助科学评估中的一个普遍漏洞:当AI生成的评审影响编辑决策时,作者可能被激励优化稿件以迎合AI判断而非科学价值。我们的结果表明,在高风险的同行评审中,AI工具不应被视为中立的评估者,而应进行系统的鲁棒性测试、透明的保障措施和谨慎的人工监督。

英文摘要

AI is increasingly used to support scientific peer review, from manuscript screening, reviewer assistance to editorial triage. Although such systems promise to reduce reviewer burden and accelerate publication, their robustness to strategic manipulation remains poorly understood. Here we show that AI-mediated peer review is vulnerable to a simple, low-cost manipulation: superficial rephrasing of the manuscript abstract. Without changing the underlying scientific content and communication, and even without knowledge of the reviewing model, adversarially rewritten abstracts substantially improve AI review outcomes. We see this across disciplines and publication venues, for both human-written and AI-generated papers. Our strongest attack achieves an attack-success-rate of about 38%, increasing acceptance ratings by +1.31 for Gemini 3 Flash reviewers and by +0.88 for GPT 5.4 Mini reviewers on a 10-point scale. When the original AI review suggests 'reject', the success rate rises to more than 50%. This effect extends beyond overall score inflation, increasing review confidence and scores on core scientific criteria such as soundness, significance and perceived contribution. The attack is practical, requiring only about 5 minutes and $1 for a 10-page AI conference submission, and is hard to distinguish from ordinary scientific editing. Inflated AI reviews could bias downstream human decision-making, shifting editorial recommendations from rejection towards acceptance. These findings reveal a general vulnerability in AI-assisted scientific evaluation: when AI-generated review influence editorial decisions, authors may be incentivized to optimize manuscripts for AI judgment rather than scientific merit. Our results suggest that AI tools should not be treated as neutral evaluators in high-stakes peer review without systematic robustness testing, transparent safeguards and careful human oversight.

2606.10304 2026-06-10 cs.CL 新提交

MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

MIRAGE: LLM智能体中的极性翻转编码子空间

Pratibha Revankar, Kargi Chauhan, Jihye Kim, Sadiba Nusrat Nur, Vincent Siu, Chenguang Wang

发表机构 * University of California, Santa Cruz(加州大学圣克ruz分校)

AI总结 发现LLM智能体在隐蔽编码敏感数据时,残差流中存在共享的低维编码子空间,通过逻辑回归探针可高精度检测,并构建MIRAGE实时监控器,在126个场景中AUC达0.918,远超仅输出检测。

详情
AI中文摘要

当LLM智能体被迫隐蔽编码敏感数据(Base64、ROT13、藏头诗、同义词链等)时,生成的输出逃避了输出端检测,但底层计算并未逃脱。在来自五个架构家族的八个模型的九个编码家族中,该计算由残差流中共享的低维编码子空间支持。在八个编码家族上训练的逻辑回归探针能够以AUC 0.975-1.000恢复被排除的第九个家族,读取的是计算而非表面特征。同一方向在规划标记处表现出第二个机制特征:当模型将在线模拟编码时极性翻转正向激活,当模型将其外包给工具调用时负向激活,在编码文本存在之前区分两种执行策略。我们构建了MIRAGE(模型内部读取智能体生成外泄),一个利用这两个信号的双通道实时监控器。在126个智能体外泄场景中,其AUC达到0.918,大幅优于仅输出检测(AUC=0.518)。监控器性能本质上是宿主模型几何结构的属性:良性编码假阳性率从Qwen-7B的0%到Phi-3.5的100%,表明探针忠实读取了模型的几何结构是否区分隐蔽与公开编码。在所有测试的对抗预算下,每个抑制子空间的攻击也破坏了编码保真度,这报告为评估预算上的经验规律,而非结构性不可能性声明。

英文摘要

When LLM agents are coerced into covertly encoding sensitive data (Base64, ROT13, acrostic, synonym chains, and beyond), the resulting outputs evade output-side detection but the underlying computation does not. Across nine encoding families and eight models from five architecture families, that computation is supported by a shared low-dimensional encoding subspace in the residual stream. A logistic-regression probe trained on eight encoding families recovers the held-out ninth at AUC 0.975-1.000, reading the computation rather than surface features. The same direction exhibits a second mechanistic signature at the planning token, flipping polarity to activate positively when the model will simulate the encoding inline and negatively when it will outsource it to a tool call, distinguishing two execution strategies before the encoded text exists. We build MIRAGE (Model-Internal Readout of Agentic Generation Exfiltration), a two-channel real-time monitor exploiting both signals. On 126 agentic exfiltration scenarios, it reaches AUC = 0.918, substantially outperforming output-only detection (AUC = 0.518). Monitor performance is fundamentally a property of the host model's geometry: benign-encoding false-positive rate ranges from 0% on Qwen-7B to 100% on Phi-3.5, revealing that the probe faithfully reads whether a model's geometry separates covert from overt encoding. Across all tested adversarial budgets, every attack suppressing the subspace also destroyed encoding fidelity, reported as an empirical regularity on the evaluated budgets, not a structural impossibility claim.

2606.10569 2026-06-10 cs.CL cs.AI 新提交

Hidden Consensus:Preference-Validity Compression in Human Feedback

隐藏共识:人类反馈中的偏好有效性压缩

Dorcas Chia Ern Chua, Karen Myn Hui Lee, Jia Yue Tan, Zhen Xue Gue, Norzalena Abdul Hamid, Azima Binti Azmi, Keat Mei Yeong, Aizat Izyani binti Mujab, Hafsah Noor Azam, Chee Guo Khoo, Han Ying Lim, Chee Seng Chan

发表机构 * YTL AI Labs Universiti Malaya(马来亚大学) Monash University Malaysia(莫纳什大学马来西亚校区) Universiti Malaysia Sarawak(马来西亚沙捞越大学)

AI总结 本文提出偏好有效性压缩问题,即RLHF将多元有效反馈压缩为单一奖励目标,导致对齐测量偏差。通过马来西亚语料分析,79%的提示存在多个多数支持响应,表明多数聚合测量的是argmax可接受性而非多元对齐。

Comments 28 pages. When AI learns from human feedback, it forces a single "correct" answer, but sometimes multiple answers are all genuinely valid, and that nuance gets thrown away

详情
AI中文摘要

标准的RLHF流程通常将异质的人类判断简化为单一的标量奖励目标。我们认为这种简化在结构多元的社会中可能错误地衡量对齐,在这些社会中,分歧可能反映文化、历史、语言、区域或规范性的解释,而非标注噪声。我们将这种失败称为偏好有效性压缩,即多个多元有效的响应选项被压缩成一个优化目标。以马来西亚为诊断场景,我们通过偏好事件分析RLHF风格的反馈聚合,这些事件将提示、响应和跨解释框架的可接受性判断联系起来。在来自20名参与者和107个三人标注提示的321个偏好事件中,79%的提示包含多个多数支持的响应,而单一赢家聚合会丢弃这些响应,并且当考虑所有多数支持的选项时,顶部响应之间的明显优势差距会消失。参与者经常选择多个可接受的响应,而被丢弃的响应明显反映了连贯的本地、实践或文化框架。这些发现表明,该语料中的多数聚合测量的是argmax可接受性而非多元对齐。我们将此视为测量有效性问题,并认为未来的对齐方法应满足有效性保持一致性,即在多元有效的解释框架中保持稳定,而不是将它们压缩为单一的奖励目标。

英文摘要

Standard RLHF pipelines often reduce heterogeneous human judgments into a single scalar reward target. We argue that this reduction can mis-measure alignment in structurally plural societies, where disagreement may reflect culturally, historically, linguistically, regionally, or normatively grounded interpretations rather than annotation noise. We call this failure Preference-Validity Compression, the collapse of multiple plural-valid response options into a single optimization target. Using Malaysia as a diagnostic setting, we analyze RLHF-style feedback aggregation through preference events linking prompts, responses, and acceptability judgments across interpretive frames. Across 321 preference events from 20 participants and 107 trio-annotated prompts, 79% of prompts contain more than one majority-supported response that single-winner aggregation would discard, and apparent dominance gaps between top responses diminish when all majority-supported options are considered. Participants frequently select multiple acceptable responses, and discarded responses demonstrably reflect coherent local, practical, or cultural frames. These findings show that majority aggregation in this corpus measures argmax acceptability rather than plural alignment. We treat this as a measurement-validity issue and argue that future alignment methods should satisfy Validity-Preserving Consistency, remaining stable across plural-valid interpretive frames rather than collapsing them into a single reward target.

2606.10852 2026-06-10 cs.CL cs.AI 新提交

Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMs

Janus: 大语言模型中目标导向信息扭曲的基准测试

Polydoros Giannouris, Mohsinul Kabir, Sophia Ananiadou

发表机构 * The University of Manchester(曼彻斯特大学) Archimedes/Athena RC(阿基米德/雅典研究中心)

AI总结 提出JANUS基准,通过固定事实池对比中性/目标导向条件,测量LLM在事实输出中的选择性扭曲,揭示模型缺乏防误导通信的鲁棒性。

详情
AI中文摘要

LLM的欺骗通常通过直接标记如捏造声明、明确谎言或策略性隐瞒来评估。然而,许多现实中的误导性沟通并不依赖于虚假陈述,而是源于对真实事实的选择性处理:省略不利证据、软化不利细节、强调有利细节或用模糊语言替代精确限定。现有基准大多忽略了这种更微妙且可能更危险的失败模式。我们引入JANUS,一个用于测量基于事实的LLM输出中目标导向语用扭曲的基准。我们基准中的每个场景提供固定的一组有利和不利事实,并比较中性条件与目标导向条件(例如,尽管可能对直接受影响的个人或群体造成伤害,仍要增加采用率、注册率、批准率或支持率)。由于所有输出都被限制使用相同的事实池,JANUS将误导性总体印象与幻觉和捏造分离开来。JANUS包含跨8个领域的160个场景,每个场景配有中性和目标导向提示以及标注的事实材料。跨12个LLM的大量实验揭示了一致的目标导向扭曲,表明当前模型仍然对激励和框架目标敏感,并且缺乏针对选择性误导沟通的鲁棒防护。我们公开发布语料库和代码以供未来研究。

英文摘要

LLM deception is often evaluated through direct markers such as fabricated claims, explicit lies, or strategic concealment. However, many real-world misleading communications do not depend on false statements, rather, they arise from selective treatment of true material facts: omitting adverse evidence, softening unfavorable details, emphasizing favorable details, or replacing precise qualifications with vague language. Existing benchmarks largely miss this subtler and arguably more dangerous failure mode. We introduce JANUS, a benchmark for measuring goal-conditioned pragmatic distortion in fact-grounded LLM outputs. Each scenario in our benchmark provides a fixed pool of favorable and adverse facts and compares a neutral condition against a goal-directed condition, such as increasing adoption, enrollment, approval, or support, despite potential harm to directly affected individuals or groups. Because all outputs are constrained to use the same fact pool, JANUS isolates misleading net impressions from hallucination and fabrication. JANUS contains 160 scenarios across 8 domains, with each scenario paired with neutral and goal-conditioned prompts and annotated material facts. Extensive experiments across 12 LLMs reveal consistent goal-conditioned distortions, demonstrating that current models remain sensitive to incentive and framing objectives and lack robust safeguards against selectively misleading communication. We publicly release our corpus and code for future research.

2606.11046 2026-06-10 cs.CL 新提交

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

推理是否保持对齐?关于大型推理模型的可信度研究

Prajakta Kini, Avinash Reddy, Souradip Chakraborty, Satya Sai Srinath Namburi GNVV, Furong Huang, Amrit Singh Bedi, Alvaro Velasquez

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校) University of Central Florida(中佛罗里达大学) University of Maryland College Park(马里兰大学帕克分校) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 研究通过监督微调、强化学习和蒸馏生成的推理模型在安全、偏见、隐私等六个可信度维度上是否保持对齐,发现推理模型常出现对齐退化,如毒性增加、刻板印象加剧等。

详情
AI中文摘要

经过指令微调的LLM越来越多地通过后训练转化为推理模型,以提高多步任务性能。这种转化通常针对推理准确性进行优化,而没有明确保留指令微调模型的对齐行为,如安全拒绝、避免偏见和隐私保护。我们提出疑问:这种转化是否保持对齐?我们通过可信度审计研究这个问题,并发现默认情况下它并不保持行为。为了系统分析,我们比较了通过监督微调、基于RL的后训练和蒸馏产生的推理模型,与匹配的指令微调基线在六个可信度维度上的表现:安全性、毒性、刻板印象与偏见、机器伦理、隐私和分布外鲁棒性。我们观察到推理模型通常在推理基准上有所改进,但表现出对齐退化,包括毒性增加、刻板印象加剧、拒绝校准错误和上下文隐私泄露。这些退化与从指令微调基线的行为漂移一致,通过KL散度测量。总体而言,我们的结果指向更广泛的结论:可信度指标对于评估推理模型至关重要,并且应与推理能力的提升一起报告。

英文摘要

Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the alignment behavior of the instruction-tuned model, such as safe refusal, bias avoidance, and privacy protection. We ask: does this conversion preserve alignment? We study this question through a trustworthiness audit and find that it is not behavior-preserving by default. For a systematic analysis, we compare reasoning models produced via supervised fine-tuning, RL-based post-training, and distillation against matched instruction-tuned baselines across six trustworthiness dimensions: safety, toxicity, stereotyping and bias, machine ethics, privacy, and out-of-distribution robustness. We observe that reasoning models often improve on reasoning benchmarks but exhibit alignment regressions, including increased toxicity, amplified stereotyping, miscalibrated refusal, and contextual privacy leakage. These regressions are consistent with behavioral drift from the instruction-tuned baseline, measured by KL divergence. Overall, our results point to the broader conclusion that trustworthiness metrics are essential for evaluating reasoning models and should be reported alongside gains in reasoning capability.

2606.10279 2026-06-10 cs.AI cs.CL cs.LG 交叉投稿

Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction

使用合成理由数据进行监督微调损害真实世界疾病预测

Buxin Su, Bingxuan Li, Cheng Qian, Yiwei Wang, Jin Jin, Bingxin Zhao

发表机构 * University of Pennsylvania(宾夕法尼亚大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of California, Merced(加州大学默塞德分校)

AI总结 研究发现,在临床预测任务中,使用合成理由数据进行监督微调反而显著降低模型性能,根本原因在于叙事合理性与判别优化之间的结构性冲突。

详情
AI中文摘要

监督微调中使用合成理由数据被广泛认为能通过教导模型不仅预测什么而且预测原因来提升语言模型在临床预测任务上的性能。我们在基于纵向健康史进行五年阿尔茨海默病及相关痴呆症(ADRD)预测的任务上检验了这一假设。通过一项包含504种配置的大规模对照实验,我们发现,与仅使用标签的微调相比,基于理由的SFT始终且显著地损害了预测性能。这种退化在多个模型系列和数据规模中持续存在,并且无法通过使用面向推理的基础模型来解决。关键的是,这种失败并非由理由质量差所致:人类专家注释证实生成的理由在医学上是准确的,并且忠实于患者特定的证据;少样本实验表明,当相同的理由作为推理时的演示而非训练目标使用时,能提升性能。我们确定根本原因在于叙事合理性与判别优化之间的结构性冲突。我们希望我们的工作能为更精确地理解理由监督何时以及如何有帮助、何时无帮助铺平道路,从而指导在高风险临床预测中负责任地开发语言模型。

英文摘要

Supervised fine-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks by teaching models not just what to predict but why. We test this assumption on five-year Alzheimer's disease and related dementias (ADRD) prediction from longitudinal health histories. Across a large-scale controlled experiment of 504 configurations, we find that rationale-based SFT consistently and substantially hurts prediction performance relative to label-only fine-tuning. The degradation persists across model families and data scales, and is not resolved by using a reasoning-oriented base model. Crucially, the failure is not explained by poor rationale quality: human expert annotation confirms that the generated rationales are medically accurate and faithfully grounded in patient-specific evidence, and few-shot experiments show that the same rationales improve performance when used as inference-time demonstrations rather than training targets. We identify the root cause as a structural conflict between narrative plausibility and discriminative optimization. We hope our work paves the path toward a more precise understanding of when and how rationale-based supervision helps and when it does not, guiding the responsible development of language models for high-stakes clinical prediction.

2606.10481 2026-06-10 cs.LG cs.AI cs.CL cs.CR stat.ML 交叉投稿

Advancing the State-of-the-Art in Empirical Privacy Auditing

推进经验隐私审计的最新水平

Nicole Mitchell, Galen Andrew, Arun Ganesh, Brendan McMahan, Peter Kairouz

发表机构 * Google Research(谷歌研究院)

AI总结 提出通过高温采样生成合成金丝雀,用于经验隐私审计,并引入基于辅助模型的合成数据审计方法,系统研究模型容量与金丝雀熵对记忆化的交互影响。

详情
AI中文摘要

大型语言模型的参数高效微调可能表现出对个别训练示例的问题性记忆。经验隐私审计(EPA)通过测量成员推断(MI)或重构攻击上的实际数据泄露来量化这种风险。EPA的一个关键挑战是设计与隐私敏感训练数据混合的“金丝雀”示例。我们提出通过从LLM中进行高温采样($T \geq 0.8$)生成合成金丝雀,使用针对隐私敏感训练数据定制的提示。这些金丝雀作为高影响异常值,确保高可识别性,从而实现强审计。此外,由于金丝雀本身是非私有的,它们是可检查的,并且可以重复插入,而不会危及真实数据的隐私。在隐私敏感数据上微调的模型的一个重要用途是生成合成数据。这也带来了隐私风险。我们引入了一种强大的合成数据审计方法,基于在合成数据上微调辅助模型。然后,对原始金丝雀的辅助模型进行审计,可以强有力地估计通过合成数据的隐私泄露。最后,利用我们强大的审计方法,我们系统研究了模型容量和金丝雀熵对记忆化的交互影响。

英文摘要

Parameter-efficient fine-tuning of large language models (LLMs) can exhibit problematic memorization of individual training examples. Empirical privacy auditing (EPA) quantifies this risk by measuring realistic data leakage on membership inference (MI) or reconstruction attacks. A key challenge in EPA is designing ``canary'' examples that are mixed with the privacy-sensitive training data. We propose generating synthetic canaries via high-temperature sampling ($T \geq 0.8$) from LLMs, using prompts tailored to the privacy-sensitive training data. These canaries act as high-influence outliers, ensuring high identifiability and hence strong audits. Further, since the canaries are themselves non-private, they are inspectable and can be inserted with repetition without jeopardizing the privacy of the real data. An important use of models fine-tuned on privacy-sensitive data is the generation of synthetic data. This also comes with privacy risk. We introduce a powerful synthetic data audit based on fine-tuning an auxiliary model on the synthetic data. Auditing the auxiliary model for the original canaries then provides a strong estimate of the privacy leakage through the synthetic data. Finally, leveraging our strong auditing methodologies, we perform a systematic investigation into the interacting effects of model capacity and canary entropy on memorization.

2606.10860 2026-06-10 cs.CR cs.CL 交叉投稿

Training LLMs to Enforce Multi-Level Instruction Hierarchies via Gravity-Weighted Direct Preference Optimization

训练LLM通过重力加权直接偏好优化强制执行多级指令层次结构

Lena S. Bolliger, Lena A. Jäger

发表机构 * Department of Computational Linguistics, University of Zurich, Switzerland(计算语言学系,苏黎世大学,瑞士)

AI总结 提出重力加权DPO(GW-DPO)方法,通过线性或双边调度加权冲突级别间的结构距离,结合层次分隔符和指令段嵌入,在Llama-3.1-8B-Instruct上提升多级指令优先级遵守率并降低过度拒绝率。

详情
AI中文摘要

生产级LLM接收来自信任级别差异极大的源的指令,但对每个令牌赋予统一的架构特权。这种结构漏洞使得恶意提示注入成为可能,更广泛地说,模型缺乏原则性方法来解决合法但冲突的指令。常见的基于训练的响应是教导模型显式的指令层次结构;然而,现有方法仅形式化三或四个级别,将所有违规视为同等严重,且很少评估所有成对级别交互。我们形式化了k级指令层次问题,并针对k=5实例化,得到十个成对优先级关系,合规模型必须强制执行。然后我们引入重力加权DPO(GW-DPO),一种偏好优化目标,其每个样本的偏移量在线性或双边调度下与冲突级别间的结构距离成比例,后者通过特权差距和受害级别的特权共同加权严重性。结合层次特定的分隔符令牌(Chen等人,2025)和指令段嵌入(ISE;Wu等人,2025),采用双边调度的GW-DPO在Llama-3.1-8B-Instruct上帕累托改进标准DPO和线性变体,提高宏观成对优先级遵守率,同时将过度拒绝率降至标准DPO的一半。消融实验将ISE隔离为拒绝阈值校准器,并将五级与三级训练重新定义为通用性与专业性的权衡。

英文摘要

Production LLMs receive instructions from sources with very different levels of trust, yet attend to every token with uniform architectural privilege. This is the structural vulnerability that enables malicious prompt injections and, more broadly, leaves models without a principled way to resolve conflicts between legitimate but competing instructions. A common training-based response is to teach models an explicit instruction hierarchy; existing approaches, however, formalize hierarchies of only three or four levels, treat all violations as equally severe, and rarely evaluate the full set of pairwise level interactions. We formalize a k-level instruction hierarchy problem and instantiate it for k=5, yielding ten pairwise priority relations that a compliant model must enforce. We then introduce Gravity-Weighted DPO (GW-DPO), a preference-optimization objective whose per-sample offset scales with the structural distance between conflicting levels under a linear or bilateral schedule, the latter weighting severity by both the privilege gap and the privilege of the victim level. Combined with hierarchy-specific delimiter tokens (Chen et al., 2025) and Instructional Segment Embeddings (ISE; Wu et al., 2025), GW-DPO with the bilateral schedule Pareto-improves over standard DPO and the linear variant on Llama-3.1-8B-Instruct, raising macro pairwise priority adherence while keeping over-refusal at half the standard DPO rate. Ablations isolate ISE as a refusal-threshold calibrator and recast five- versus three-level training as a generality-specialization tradeoff.

2406.08726 2026-06-10 cs.CL 版本更新

Standard Language Ideology in AI-Generated Language

AI生成语言中的标准语言意识形态

Genevieve Smith, Eve Fleisig, Ishita Rustagi, Xavier Yin

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校)

AI总结 本文提出一个分类法,揭示大型语言模型如何强化标准语言意识形态,导致语言变体的边缘化,并讨论其社会影响及应对建议。

Comments To appear in the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

详情
AI中文摘要

大型语言模型(LLMs)生成的文本强化了标准语言意识形态:偏向于某些被认为比其它语言变体更具声望、权威和合法性的语言变体。本文贡献了一个基于社会技术的分面分类法,阐明了生成式AI系统如何再现标准语言意识形态及其社会影响。我们引入了标准AI生成语言意识形态的概念,以解释AI系统如何赋予某些语言变体合法性,同时边缘化其他变体,构建了性能差异、刻板印象、挪用和抹除的模式。然后,我们讨论了关于什么是理想系统行为的持续紧张,以及生成式AI工具尝试或拒绝模仿不同语言变体的优缺点。为了解决塑造生成式AI的权力关系以及我们分类法中识别的机制——合法化、刻板印象、挪用和抹除——我们提出了强调问责、社区代理、控制和所有权的建议。这些建议将语言多样性视为在公正的AI未来中需要保护、珍视和维持的资源。

英文摘要

Large language models (LLMs) generate text that reinforces standard language ideology: a bias towards certain language varieties that are granted more prestige, authority, and legitimacy than others. This paper contributes a sociotechnically grounded faceted taxonomy that illustrates how generative AI systems reproduce standard language ideology and its societal implications. We introduce the concept of standard AI-generated language ideology to explain how AI systems confer legitimacy on certain language varieties while marginalizing others, structuring patterns of performance disparity, stereotyping, appropriation, and erasure. We then discuss ongoing tensions around what constitutes desirable system behavior, as well as advantages and drawbacks of generative AI tools attempting or refusing to imitate different language varieties. To address the power relations shaping generative AI and the mechanisms identified in our taxonomy--legitimation, stereotyping, appropriation, and erasure--we offer recommendations that emphasize accountability, community agency, control, and ownership. These recommendations recognize linguistic diversity as a resource to be protected, valued, and sustained as part of a just AI future.

2501.00745 2026-06-10 cs.CL cs.AI cs.GT cs.IR econ.TH 版本更新

Dynamics of Adversarial Attacks on Large Language Model-Based Search Engines

基于大型语言模型的搜索引擎对抗攻击动力学

Xiyang Hu

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 本文将排名操纵攻击建模为无限重复囚徒困境,分析合作维持条件,发现降低攻击成功率可能反而激励攻击,防御措施在某些情况下无效。

Comments New Frontiers in Game-Theoretic Learning Workshop, International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

基于大型语言模型(LLM)的搜索引擎日益集成,改变了信息检索的格局。然而,这些系统容易受到对抗攻击,尤其是排名操纵攻击,攻击者通过精心制作网页内容来操纵LLM的排名并推广特定内容,从而在竞争对手中获得不公平优势。在本文中,我们研究了排名操纵攻击的动力学。我们将此问题建模为无限重复囚徒困境,其中多个参与者策略性地决定合作还是攻击。我们分析了合作得以维持的条件,识别了关键因素,如攻击成本、折现率、攻击成功率和触发策略,这些因素影响参与者的行为。我们识别了系统动力学中的临界点,表明当参与者具有前瞻性时,合作更有可能维持。然而,从防御角度来看,我们发现简单地降低攻击成功概率,在某些条件下反而会激励攻击。此外,限制攻击成功率上限的防御措施在某些情况下可能徒劳无功。这些见解凸显了保护基于LLM的系统的复杂性。我们的工作为理解和缓解其脆弱性提供了理论基础和实践见解,同时强调了自适应安全策略和深思熟虑的生态系统设计的重要性。

英文摘要

The increasing integration of Large Language Model (LLM) based search engines has transformed the landscape of information retrieval. However, these systems are vulnerable to adversarial attacks, especially ranking manipulation attacks, where attackers craft webpage content to manipulate the LLM's ranking and promote specific content, gaining an unfair advantage over competitors. In this paper, we study the dynamics of ranking manipulation attacks. We frame this problem as an Infinitely Repeated Prisoners' Dilemma, where multiple players strategically decide whether to cooperate or attack. We analyze the conditions under which cooperation can be sustained, identifying key factors such as attack costs, discount rates, attack success rates, and trigger strategies that influence player behavior. We identify tipping points in the system dynamics, demonstrating that cooperation is more likely to be sustained when players are forward-looking. However, from a defense perspective, we find that simply reducing attack success probabilities can, paradoxically, incentivize attacks under certain conditions. Furthermore, defensive measures to cap the upper bound of attack success rates may prove futile in some scenarios. These insights highlight the complexity of securing LLM-based systems. Our work provides a theoretical foundation and practical insights for understanding and mitigating their vulnerabilities, while emphasizing the importance of adaptive security strategies and thoughtful ecosystem design.

2505.14608 2026-06-10 cs.CL cs.AI cs.LG 版本更新

Attacks on Machine-Text Detectors Retain Stylistic Fingerprints

对机器文本检测器的攻击保留风格指纹

Rafael Rivera Soto, Barry Chen, Nicholas Andrews

发表机构 * GitHub University of California, Berkeley(加州大学伯克利分校)

AI总结 研究机器文本检测器对抗攻击的局限性,提出一种同时优化不可检测性和特定人类风格的 paraphrasing 方法,发现单文档检测不可靠,需多文档分析。

详情
AI中文摘要

尽管机器文本检测器的开发取得了显著进展,但机器文本容易被操纵以逃避检测,这导致有人认为该问题本质上是难以解决的。在这项工作中,我们研究了这种逃避策略的局限性。我们证明,尽管当前的攻击(从提示工程到检测器引导的优化)可以有效降低标准检测器的性能,但它们无法抹去机器文本底层的风格“指纹”。我们表明,利用风格特征空间的少样本检测器对这些逃避尝试具有鲁棒性,即使对于明确调整以逃避检测的模型生成的样本也能可靠地检测。这引发了一个问题:风格是否代表了对机器检测攻击的通用防御?我们通过引入一种新颖的 paraphrasing 方法来证明答案是“不”,该方法同时优化不可检测性和对特定人类风格的遵循。我们表明,与先前方法不同,这种攻击有效逃避了所有考虑的检测器,包括那些利用写作风格的检测器。然而,我们发现这种逃避并非绝对:随着可供分析的文档数量增加,人类和机器分布再次变得可区分。总体而言,我们的发现表明,可靠的机器文本检测需要从单文档分析转向多文档分析。

英文摘要

Despite considerable progress in the development of machine-text detectors, the ease with which machine-text can be manipulated to evade detection has led to suggestions that the problem is inherently intractable. In this work, we investigate the limits of such evasion strategies. We demonstrate that while current attacks, ranging from prompt engineering to detector-guided optimization can effectively degrade performance of standard detectors, they fail to erase the underlying stylistic "fingerprints" of machine text. We show that few-shot detectors that utilize the stylistic feature space are robust to these evasion attempts, reliably detecting samples even from models explicitly tuned to prevent detection. This raises the question: does style represent a universal defense against machine-detection attacks? We demonstrate that the answer is "no'' by introducing a novel paraphrasing approach that simultaneously optimizes for undetectability and adherence to specific human styles. We show that unlike prior methods, this attack effectively evades all considered detectors, including those that utilize writing style. However, we find that this evasion is not absolute: as the number of documents available for analysis grows, the human and machine distributions become distinguishable again. Overall, our findings suggest that reliable machine-text detection requires moving beyond single-document analysis to multi-document analysis.

2512.16189 2026-06-10 cs.CL 版本更新

Mitigating hallucinations in healthcare LLMs with granular fact-checking and domain-specific adaptation

通过细粒度事实核查和领域特定适应减轻医疗保健大语言模型中的幻觉

Musarrat Zeba, Abdullah Al Mamun, Kishoar Jahan Tithee, Debopom Sutradhar, Mohaimenul Azam Khan Raiaan, Saddam Mukta, Reem E. Mohamed, Md Rafiqul Islam, Yakub Sebastian, Mukhtar Hussain, Sami Azam

发表机构 * Applied Artificial Intelligence and Intelligent Systems (AAIINS) Laboratory(应用人工智能与智能系统实验室) Department of Computer Science and Engineering(计算机科学与工程系) Department of Data Science and Artificial Intelligence(数据科学与人工智能系) Department of Software Engineering(软件工程系) Faculty of Science and Information Technology(科学与信息技术学院) Faculty of Science and Technology(科学与技术学院)

AI总结 提出一个独立于任何LLM的事实核查模块和领域特定的摘要模型,通过数值测试和细粒度逻辑检查减少幻觉,在MIMIC III数据集上微调并评估,取得了高精度和召回率。

Comments Published in Expert Systems with Applications

详情
Journal ref
Expert Systems with Applications, Vol. 329, 132966, 2026
AI中文摘要

在医疗保健领域,任何LLM生成的输出都必须可靠且准确,尤其是在涉及决策和患者安全的情况下。然而,由于LLM存在幻觉风险,这些关键领域的输出往往不可靠。为了解决这个问题,我们提出了一个独立于任何LLM的事实核查模块,以及一个旨在最小化幻觉率的领域特定摘要模型。我们的模型使用低秩适配(LoRa)在MIMIC III数据集上进行微调,并与事实核查模块配对,该模块通过自然语言处理中的离散逻辑,在细粒度级别使用数值测试进行正确性检查和逻辑检查,以验证电子健康记录(EHR)中的事实。我们在完整的MIMIC-III数据集上训练了LLM模型。为了评估事实核查模块,我们抽样了104篇摘要,将其提取为3,786个命题,并将这些命题作为事实。事实核查模块的精确率为0.8904,召回率为0.8234,F1分数为0.8556。此外,LLM摘要模型的摘要质量达到了ROUGE-1分数0.5797和BERTScore 0.9120。

英文摘要

In healthcare, it is essential for any Large Language Model (LLM)-generated output to be reliable and accurate, particularly in cases involving decision-making and patient safety. However, the outputs are often unreliable in such critical areas due to the risk of hallucinated outputs from the LLMs. To address this issue, we propose a fact-checking module that operates independently of any LLM, along with a domain-specific summarization model designed to minimize hallucination rates. Our model is fine-tuned using Low-Rank Adaptation (LoRA) on the MIMIC-III dataset and is paired with the fact-checking module, which uses numerical tests for correctness and logical checks at a granular level through discrete logic in natural language processing (NLP) to validate facts against electronic health records (EHRs). We trained the LLM on the full MIMIC-III dataset. For evaluation of the fact-checking module, we sampled 104 summaries, extracted them into 3786 propositions, and used these as facts. The fact-checking module achieves a precision of 0.8904, a recall of 0.8234, and an F1-score of 0.8556. Additionally, the LLM summary achieves a ROUGE-1 score of 0.5797 and a BERTScore of 0.9120 for summary quality.

2603.28054 2026-06-10 cs.CL 版本更新

Who Wrote the Book? Detecting and Attributing LLM Ghostwriters

谁写了这本书?检测与归因LLM代笔作者

Anudeex Shetty, Qiongkai Xu, Olga Ohrimenko, Jey Han Lau

发表机构 * School of Computing and Information Systems, The University of Melbourne, Australia(墨尔本大学计算与信息学院) School of Computing, FSE, Macquarie University, Australia(麦考瑞大学计算学院)

AI总结 提出GhostWriteBench数据集和TRACE指纹方法,用于检测和归因LLM生成的长文本,在跨域和未见模型上达到SOTA性能。

Comments WIP

详情
AI中文摘要

在本文中,我们介绍了GhostWriteBench,一个用于LLM作者归因的数据集。它包含由前沿LLM生成的长篇文本(每本书超过5万字),旨在测试跨多个分布外(OOD)维度的泛化能力,包括领域和未见过的LLM作者。我们还提出了TRACE——一种新颖的、可解释且轻量级的指纹方法——适用于开源和闭源模型。TRACE通过捕获由另一个轻量级语言模型估计的token级转换模式(例如,词排名)来创建指纹。在GhostWriteBench上的实验表明,TRACE实现了最先进的性能,在OOD设置中保持鲁棒性,并且在有限训练数据场景下表现良好。

英文摘要

In this paper, we introduce GhostWriteBench, a dataset for LLM authorship attribution. It comprises long-form texts (50K+ words per book) generated by frontier LLMs, and is designed to test generalisation across multiple out-of-distribution (OOD) dimensions, including domain and unseen LLM author. We also propose TRACE -- a novel fingerprinting method that is interpretable and lightweight -- that works for both open- and closed-source models. TRACE creates the fingerprint by capturing token-level transition patterns (e.g., word rank) estimated by another lightweight language model. Experiments on GhostWriteBench demonstrate that TRACE achieves state-of-the-art performance, remains robust in OOD settings, and works well in limited training data scenarios.

2604.19274 2026-06-10 cs.CL 版本更新

HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

HarDBench: 面向安全人机协作写作的基于草稿的合著越狱攻击基准

Euntae Kim, Soomin Han, Buru Chang

发表机构 * Korea University(韩国大学) Sogang University(ソガン大学)

AI总结 提出HarDBench基准,评估大语言模型在协作写作中面对恶意草稿填充的越狱攻击的鲁棒性,并通过偏好优化实现安全-效用平衡的对齐方法。

Comments ACL 2026 Main Camera-Ready

详情
AI中文摘要

大语言模型(LLMs)越来越多地被用作协作写作中的合著者,用户从粗略草稿开始,依赖LLMs完成、修改和优化其内容。然而,这种能力带来了严重的安全风险:恶意用户可能通过用危险内容填充不完整草稿来越狱模型,迫使其生成有害输出。在本文中,我们识别了当前LLMs对此类基于草稿的合著越狱攻击的脆弱性,并引入了HarDBench,一个系统性的基准,旨在评估LLMs对此新兴威胁的鲁棒性。HarDBench涵盖一系列高风险领域——包括爆炸物、毒品、武器和网络攻击——并具有现实结构及领域特定提示的特征,以评估模型对有害补全的敏感性。为缓解此风险,我们引入了一种基于偏好优化的安全-效用平衡对齐方法,训练模型拒绝有害补全,同时保持对良性草稿的有用性。实验结果表明,现有LLMs在合著环境中高度脆弱,而我们的对齐方法显著减少了有害输出,且不降低合著能力性能。这为人机协作写作环境中LLMs的评估与对齐提供了新范式。我们的新基准和数据集可在项目页面获取:此 https URL

英文摘要

Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a safety-utility balanced alignment approach based on preference optimization, training models to refuse harmful completions while remaining helpful on benign drafts. Experimental results show that existing LLMs are highly vulnerable in co-authoring contexts and our alignment method significantly reduces harmful outputs without degrading performance on co-authoring capabilities. This presents a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing settings. Our new benchmark and dataset are available on our project page at https://github.com/untae0122/HarDBench

2606.07532 2026-06-10 cs.CL cs.AI 版本更新

Durable Evaluation Framework: Adversarial Arbitration for Sycophancy Reduction in Large Language Models

原则性智能体辩论:针对大型语言模型谄媚减少的对抗性仲裁

Sam Ryan

发表机构 * Novel Systems Engineering LLC(新型系统工程有限公司)

AI总结 提出原则性智能体辩论(PAD)多智能体架构,通过仲裁两个对立倾向的模型并盲评其论点,在SycophancyEval上显著降低谄媚偏差,最佳变体准确率达48.5%。

Comments 25 pages, 3 figures. Code and data available at github.com/NovelSystems/CANDOR

详情
AI中文摘要

RLHF训练的模型系统性地偏向于一致性而非准确性,这是训练过程的结构性属性。我们提出原则性智能体辩论(PAD),一种多智能体架构,通过仲裁两个调整为对立哲学倾向的模型来减轻身份框架下的谄媚,其中实用主义合成器在不知来源的情况下评估两个论点。本文评估了基于提示的PAD实例化。关键机制包括静态倾向调整、合成前的身份剥离、单轮独立论证和盲仲裁。我们在SycophancyEval的200个分层问题上评估了五种实例化。所有PAD变体(AnCifer、DeWin、FeynStein、BurGal、Trident)均显著优于单模型基线(18.5%)和指示对立基线(29.0%),其中DeWin达到48.5%的准确率(与两者相比z=6.36,p<0.001)。在n=200时,各变体之间无显著差异。BurGal变体达到53.0%,但作为架构有效性检查;其共识/异端轴在每个基准问题上结构性偏向异端模型。预训练下限影响约40%的问题;微调倾向模型被确定为下一步。

英文摘要

RLHF-trained models are systematically biased toward agreement over accuracy, a structural property of the training process. We present Durable Evaluation Framework (DEF) Arbitration, a multi-agent architecture that mitigates identity-framed sycophancy by arbitrating between two models tuned to opposing DEFs, with a pragmatist synthesizer evaluating both arguments blind to their origins. This paper evaluates a prompt-based instantiation of DEF Arbitration. The key mechanisms are static DEF tuning, identity stripping before synthesis, single-round independent argumentation, and blind arbitration. We evaluate five instantiations on 200 stratified questions from SycophancyEval. All tested DEF variants (AnCifer, DeWin, FeynStein, BurGal, Trident) significantly outperform the single-model baseline (18.5%) and instructed-opposition baseline (29.0%), with DeWin achieving 48.5% accuracy (z=6.36, p<0.001 versus both). The variants are not significantly different from each other at n=200. The BurGal variant achieves 53.0% but functions as an architectural validity check; its consensus/heterodox axis structurally favors the heterodox model on every benchmark question. A pre-training floor affects an estimated 40% of questions; fine-tuned DEF models are the identified next step.

2604.13776 2026-06-10 cs.CY cs.CL cs.CR cs.CV 版本更新

Who Gets Flagged? The Pluralistic Evaluation Gap in AI Content Watermarking

谁被标记?AI内容水印中的多元评估差距

Alexander Nemecek, Osama Zafar, Yuqiao Xu, Wenbiao Li, Erman Ayday

发表机构 * Case Western Reserve University(凯斯西储大学)

AI总结 本文揭示AI内容水印在不同语言、文化和群体间存在系统性偏差,提出跨语言检测一致性、文化多样性覆盖和检测指标人口统计分解三个评估维度,主张水印部署前必须进行公平性审计。

Comments 7 pages. Accepted at the Multimodal Alignment for a Pluralistic Society (MAPS) Workshop, CVPR 2026

详情
AI中文摘要

水印正成为AI内容认证的默认机制,治理政策和框架将其引用为内容溯源的基础设施。然而,在文本、图像和音频模态中,水印信号强度、可检测性和鲁棒性取决于内容本身的统计特性,而这些特性在不同语言、文化视觉传统和人口统计群体间存在系统性差异。我们研究了这种内容依赖性如何产生特定模态的偏差路径。通过回顾各模态的主要水印基准,我们发现除一个例外,没有基准报告跨语言、文化内容类型或人群组的性能。为解决此问题,我们提出了多元水印基准测试的三个具体评估维度:跨语言检测一致性、文化多样性内容覆盖以及检测指标的人口统计分解。我们认为水印是多元对齐管道的一部分,应遵循相同的评估标准。我们将此与当前强制部署水印但未要求公平性评估的治理框架联系起来。我们的立场是评估必须先于部署,并且应用于AI模型的相同偏差审计要求应扩展到验证层。

英文摘要

Watermarking is becoming the default mechanism for AI content authentication, with governance policies and frameworks referencing it as infrastructure for content provenance. Yet across text, image, and audio modalities, watermark signal strength, detectability, and robustness depend on statistical properties of the content itself, properties that vary systematically across languages, cultural visual traditions, and demographic groups. We examine how this content dependence creates modality-specific pathways to bias. Reviewing the major watermarking benchmarks across modalities, we find that, with one exception, none report performance across languages, cultural content types, or population groups. To address this, we propose three concrete evaluation dimensions for pluralistic watermark benchmarking: cross-lingual detection parity, culturally diverse content coverage, and demographic disaggregation of detection metrics. We argue that watermarking is part of the pluralistic alignment pipeline and should be held to the same evaluation standards. We connect this to governance frameworks currently mandating watermarking deployment without requiring fairness evaluation. Our position is that evaluation must precede deployment, and that the same bias auditing requirements applied to AI models should extend to the verification layer.

2605.22714 2026-06-10 cs.AI cs.CL cs.LG 版本更新

AMEL: Accumulated Message Effects on LLM Judgments

AMEL: 累积消息对LLM判断的影响

Sid-Ali Temkit

发表机构 * chut.app

AI总结 研究LLM在对话中因历史消息极性而偏离基准判断的累积消息效应(AMEL),发现模型偏向历史主流极性,且负向历史偏见更强,但偏见不随上下文长度增长,简单修复是为每个项目使用新上下文。

Comments 24 pages, 14 figures, 8 tables. Single author. Code, data (84,088 deduplicated API responses), and analysis pipeline at https://github.com/chutapp/amel

详情
AI中文摘要

大型语言模型常被用作自动评估者:审查代码、审核内容或评分输出,通常许多项目通过一次对话处理。我们询问先前对话历史的极性是否会偏倚后续判断,我们将这种效应称为LLM判断的累积消息效应(AMEL)。通过对来自4个提供商(OpenAI、Anthropic、Google和四个开源模型)的11个模型进行75,898次API调用,我们在隔离或跟随以正面或负面评价为主的历史之后呈现相同的测试项目。模型倾向于对话的主流极性(d = -0.17, p < 10^-46)。该效应集中在模型在基线时真正不确定的项目上(高熵项目d = -0.34,而基线确定时d = -0.15)。偏见不随上下文长度增长:5个先前轮次和50个产生相同的偏移(Spearman |r| < 0.01;OLS斜率p = 0.80)。并且存在负性不对称:按项目配对,负面历史诱导的偏见是正面的1.62倍(t = 13.46, p < 10^-39, n = 2,481)。扩展规模有帮助但不能解决(Anthropic: Haiku -0.22到Opus -0.17;OpenAI: Nano -0.34到GPT-5.2 -0.17)。三项后续研究缩小了机制范围。令牌概率分布连续变化,而非在阈值处。负性不对称既有令牌级成分也有语义成分,尽管在我们的样本量下平衡归因是探索性的。位置不重要:在50轮历史中任何位置的五个有偏轮次产生相同的偏移。评估流程最简单的修复是为每个项目使用新上下文;当批处理不可避免时,平衡历史有帮助。

英文摘要

Large language models are routinely used as automated evaluators: to review code, moderate content, or score outputs, often with many items passing through one conversation. We ask whether the polarity of prior conversation history biases subsequent judgments, an effect we call the accumulated message effect on LLM judgments (AMEL). Across 84,088 API calls to 12 models from 5 providers (OpenAI, Anthropic, Google, DeepSeek, and four open-source models), we present identical test items in isolation or following histories saturated with predominantly positive or negative evaluations. Models shift toward the conversation's prevailing polarity (d = -0.17, p < 10^-53). The effect concentrates on items where the model is genuinely uncertain at baseline (d = -0.36 for high-entropy items, vs d = -0.15 when the baseline is deterministic). Bias does not grow with context length: 5 prior turns and 50 produce the same shift (Spearman |r| < 0.01; OLS slope p = 0.80). And there is a negativity asymmetry: paired per item, negative histories induce 1.52x more bias than positive (t = 13.03, p < 10^-36, n = 2,733). Scaling helps but does not solve it (Anthropic: Haiku -0.22 to Opus -0.17; OpenAI: Nano -0.34 to GPT-5.2 -0.17). Three follow-ups narrow the mechanism. The token probability distribution shifts continuously, not at a threshold. The negativity asymmetry has both token-level and semantic components, though attributing the balance is exploratory at our sample sizes. Position does not matter: five biased turns anywhere in a 50-turn history produce the same shift. The simplest fix for evaluation pipelines is a fresh context per item; when batching is unavoidable, balancing the history helps.

11. 低资源、领域适配与高效训练 9 篇

2606.10428 2026-06-10 cs.CL 新提交

Which LoRA? An Empirical Study on the Effectiveness of LoRA Techniques During Multilingual Instruction Tuning

哪种LoRA?多语言指令微调中LoRA技术有效性的实证研究

Thamali Wijewardhana, Napoleon H. Reyes, Surangika Ranathunga

发表机构 * School of Mathematical and Computational Sciences, Massey University(梅西大学数学与计算科学学院)

AI总结 通过实验比较基本LoRA与四种变体在多语言指令微调中的效果,发现复杂变体在平衡跨语言迁移与知识保留方面并无显著优势。

详情
AI中文摘要

我们研究了常见的LoRA变体在多语言指令微调中是否比基本LoRA更具优势。涉及LoRA及其他四种变体在两个数据集、多种目标语言上的实验表明,使用更复杂的LoRA变体相对于基本LoRA,在平衡跨语言迁移和知识保留方面并无显著优势。对隐藏嵌入的分析显示,使用不同LoRA技术微调的大型语言模型在逐层语言表示上基本相似,这表明LoRA技术的架构新颖性可能并未转化为更好的跨语言适应能力。

英文摘要

We investigate whether commonly available LoRA variants have an advantage over basic LoRA in multilingual instruction tuning. Experiments involving LoRA and four other variants on two datasets across diverse target languages show that there is no significant advantage in using more complex LoRA variants instead of basic LoRA, with respect to balancing cross-lingual transfer and knowledge retention. An analysis of hidden embeddings reveal that layer-wise language representation remains largely similar across LLMs fine-tuned with different LoRA techniques, suggesting that architectural novelty of LoRA techniques may not translate into better cross-lingual adaptation.

2606.10520 2026-06-10 cs.CL 新提交

UniSVQ: 2-bit Unified Scalar-Vector Quantization

UniSVQ: 2比特统一标量-向量量化

Haoyu Wang, Haiyan Zhao, Xingyu Yu, Zhangyang Yao, Xu Han, Zhiyuan Liu, Maosong Sun

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出UniSVQ,通过将码字参数化为整数格点的仿射变换,统一标量和向量量化,实现2比特量化下性能优于标量量化、媲美向量量化,且推理吞吐更高。

Comments Accepted by ICML 2026

详情
AI中文摘要

2比特级别的训练后量化使得大型语言模型(LLMs)能够实现低成本部署和推理加速。标量量化(SQ)和向量量化(VQ)是两种主要的量化方法,然而前者遭受显著的性能下降,后者则带来计算和存储开销。我们提出UniSVQ,一个统一的2比特量化框架,通过将码字参数化为整数格点的仿射变换,桥接了标量和向量量化。这种结构保持了与优化整数内核的兼容性,同时保留了VQ的许多灵活性。我们进一步引入了一种数据驱动的块级微调策略,以直接最小化量化重建误差。在多个LLM家族和零样本基准上的大量实验表明,UniSVQ持续优于最先进的SQ方法,并实现了与高级VQ方法相当的性能,同时提供更高的推理吞吐量。

英文摘要

Post-training quantization at the 2-bit level enables low-cost deployment and inference acceleration for large language models (LLMs). Scalar quantization (SQ) and vector quantization (VQ) are two primary quantization methods, however, the former suffers from significant performance degradation, and the latter incurs computational and storage overhead. We propose UniSVQ, a unified 2-bit quantization framework that bridges scalar and vector quantization by parameterizing codewords as an affine transform of integer lattices. This structure preserves compatibility with optimized integer kernels while retaining much of VQ's flexibility. We further introduce a data-driven block-wise fine-tuning strategy to directly minimize quantization reconstruction error. Extensive experiments across multiple LLM families and zero-shot benchmarks demonstrate that UniSVQ consistently outperforms state-of-the-art SQ methods and achieves performance comparable to advanced VQ methods, while providing higher inference throughput.

2606.10531 2026-06-10 cs.CL cs.AI 新提交

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

LC-QAT: 通过线性约束向量量化实现LLM的数据高效2比特QAT

Haoyu Wang, Xingyu Yu, Haiyan Zhao, Fengxiang Wang, Xu Han

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LC-QAT,一种2比特权重量化的向量量化感知训练框架,通过可微的线性映射避免离散码本查找,实现高质量PTQ初始化和端到端优化,仅用0.1%-10%训练数据即超越现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

量化感知训练(QAT)对于极低比特大语言模型(LLMs)至关重要。当前的QAT方法主要基于标量量化(SQ),虽然能高效优化,但在2比特精度下性能严重下降。另一方面,向量量化(VQ)提供了更高的表示能力,但其离散码本查找阻碍了端到端训练。我们提出LC-QAT,一种2比特权重量化的VQ-QAT框架,通过离散向量上的学习仿射映射表示量化权重,从而在训练前向传播中无需显式码本查找即可实现高质量PTQ初始化和完全可微的端到端优化。这种强大的训练后初始化使LC-QAT具有高度数据效率。在多种LLM上的实验表明,LC-QAT在使用仅0.1%-10%训练数据的情况下,始终优于最先进的QAT方法。我们的结果确立了LC-QAT作为极低比特模型部署的实用且可扩展的解决方案。

英文摘要

Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%--10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment.

2606.10610 2026-06-10 cs.CL 新提交

Small Data, Big Noise: Adversarial Training for Robust Parameter-Efficient Fine-Tuning

小数据,大噪声:面向鲁棒参数高效微调的对抗训练

Eitan Cohen, Idan Simai, Uri Shaham

发表机构 * Bar-Ilan University(巴伊兰大学)

AI总结 提出SDBN框架,将对抗训练与参数高效微调结合,通过离散不确定性集变体增强模型在低资源场景下的鲁棒性和泛化能力。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

参数高效微调(PEFT)已成为将基础模型适应下游NLP任务的关键技术。然而,当前的PEFT方法在处理噪声鲁棒性和有限训练数据下的性能退化方面往往存在困难。我们提出SDBN(小数据大噪声),一个统一的框架,将对抗训练引入PEFT——尽管两者具有互补优势,但在PEFT设置中这一组合仍较少被研究——以增强模型鲁棒性和泛化能力,优于其他方法。我们还引入了该方法的两种变体,使用离散不确定性集:SDBN-h,枚举字符级编辑并使用梯度选择最坏情况变体;SDBN-p,使用LLM生成的变体进行生成任务中的鲁棒优化。跨多个基准的实验显示,特别是在低资源设置以及词级和字符级污染下,性能有显著提升。该框架解决了对抗训练与参数高效适应之间较少被探索的交集,无需引入额外参数或仅需适度的计算开销,使得在数据稀缺和语言变异性常共存的现实场景中,PEFT部署更加可靠。

英文摘要

Parameter-Efficient Fine-Tuning (PEFT) has become essential for adapting foundation models to downstream NLP tasks. However, current PEFT methods often struggle with robustness to noise and performance degradation on limited training data. We propose SDBN (Small Data Big Noise), a unified framework that brings adversarial training to PEFT - a combination that remains less studied in the PEFT setting despite its complementary strengths - to enhance model robustness and generalization, outperforming alternative approaches. We also introduce two variants of the method that use discrete uncertainty sets: SDBN-h, which enumerates character-level edits and selects worst-case variants using gradients, and SDBN-p, which uses LLM-generated variants for robust optimization in generative tasks. Experiments across multiple benchmarks reveal substantial improvements, particularly in low-resource settings and under both word-level and character-level corruptions. This framework addresses the less explored intersection of adversarial training and parameter-efficient adaptation, without introducing additional parameters or only modest computational overhead, making PEFT deployments more reliable in real-world scenarios where data scarcity and linguistic variability often coexist

2606.09927 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

可训练平滑旋转变换与学习通道尺度用于LLM量化

Patrik Czakó, Gábor Kertész, Sándor Szénási

AI总结 针对大语言模型量化中激活值量化困难的问题,提出基于分位数鲁棒的缩放策略和梯度优化的通道尺度学习,在W4A4量化下显著降低误差。

Comments 6 pages, 8 figures, 3 tables. Accepted to IEEE INES 2026 conference proceedings

详情
AI中文摘要

后训练量化(PTQ)是降低大语言模型(LLM)服务成本最实用的方法之一,但激活值量化仍然困难,因为异常值主导的通道会导致较大的量化误差。本文研究了这种退化是否部分由基于缩放的等效变换中的过度迁移引起。我们引入了一种用于SmoothRot风格变换的分位数鲁棒缩放策略,用高分位数替代基于最大值的激活统计量,并辅以通道尺度的约束梯度优化。在LLaMA-3.2-1B的W4A4量化下,仅分位数策略搜索相比SmoothRot基线将选定层误差降低11.1%,联合(alpha, q)搜索降低12%,训练达到18.5%。将最佳选定层策略重放到所有解码器块的下投影层,相应的全层平均误差从97.51降至78.08(19.9%)。结果表明,鲁棒的迁移控制和轻量级尺度学习在保持等效变换框架的同时,相比基于最大值的固定策略提供了持续改进。

英文摘要

Post-training quantization (PTQ) is one of the most practical ways to reduce the serving cost of Large Language Models (LLMs), but activation quantization remains difficult because outlier-dominated channels lead to large quantization errors. This paper investigates whether part of this degradation is caused by over-migration in scaling-based equivalent transformations. We introduce a quantile-robust scaling policy for SmoothRot-style transforms by replacing max-based activation statistics with high quantiles, and we complement it with constrained gradient-based optimization of channel scales. On LLaMA-3.2-1B under W4A4 quantization, quantile-only policy search improves selected-layer error by 11.1% over the SmoothRot baseline, joint (alpha, q) search improves it by 12%, and training reaches 18.5%. Replaying the best selected-layer policy on all decoder-block down-projection layers reduces the corresponding full-layer mean error from 97.51 to 78.08 (19.9%). The results show that robust migration control and lightweight scale learning provide consistent gains over max-based fixed policies while preserving the equivalent-transform framework.

2606.10445 2026-06-10 cs.LG cs.CL 交叉投稿

SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

SpenseGPT: 面向LLM推理的实用一次性剪枝,支持稀疏和稠密GEMM

Jaeseong Lee, Seung-won Hwang, Samyam Rajbhandari

发表机构 * Snowflake AI Research(Snowflake AI研究) Seoul National University(首尔大学)

AI总结 提出Spense混合稀疏-稠密格式,将权重矩阵分为2:4稀疏和稠密区域,结合一次性剪枝方法SpenseGPT,在B200 GPU上实现高达1.2倍端到端解码加速,同时保持模型精度。

详情
AI中文摘要

半结构化2:4稀疏性被现代加速器广泛支持,可提供高达2倍的理论加速。然而,其严格的50%稀疏性约束在训练后剪枝下常导致不可忽略的精度下降。同时,现有的宽松稀疏格式要么需要专门的编译器支持,要么引入限制端到端加速的运行时开销。我们提出Spense,一种实用的混合稀疏-稠密格式,将每个权重矩阵分为2:4稀疏区域和稠密区域。该设计放宽了有效稀疏性约束,同时保持与现有高性能稀疏和稠密GEMM库的兼容性,避免了自定义编译器支持和输入激活扩展。基于此格式,我们引入SpenseGPT,一种一次性训练后剪枝方法,生成稀疏和稠密区域。值得注意的是,我们表明选择正确的稠密区域很重要,并设计了两种不同的策略来选择它们。在Qwen3-32B和Seed-OSS-36B上的实验表明,我们的方法在B200 GPU上使用FP8精度实现了高达1.2倍的端到端解码加速,同时保持精度。据我们所知,这是首个在B200等最新GPU上通过半结构化稀疏张量核心实现真实世界端到端LLM解码加速并保持模型质量的一次性剪枝演示。

英文摘要

Semi-structured 2:4 sparsity is widely supported by modern accelerators, providing up to a 2x theoretical speedup. However, its strict 50% sparsity constraint often causes non-negligible accuracy degradation under post-training pruning. Meanwhile, existing relaxed sparsity formats either require specialized compiler support or introduce runtime overheads that limit end-to-end speedup. We propose Spense, a practical hybrid sparse-dense format that splits each weight matrix into a 2:4 sparse region and a dense region. This design relaxes the effective sparsity constraint while remaining compatible with existing high-performance sparse and dense GEMM libraries, avoiding both custom compiler support and input activation expansion. Building on this format, we introduce SpenseGPT, a one-shot post-training pruning method that produces sparse and dense regions. Notably, we show that selecting the right dense regions is important, and we devise two different strategies to choose them. Experiments on Qwen3-32B and Seed-OSS-36B demonstrate that our method achieves up to 1.2x end-to-end decoding speedup on B200 GPUs with FP8 precision, while preserving accuracy. To the best of our knowledge, this is the first one-shot pruning demonstration of real-world end-to-end LLM decoding speedup from semi-structured sparse tensor cores on recent GPUs such as B200s, while maintaining model quality.

2603.14463 2026-06-10 cs.CL 版本更新

An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs

一个工业级保险大语言模型,实现可验证的领域掌握与幻觉控制,无能力权衡

Qian Zhu, Xinnan Guo, Jingjing Huo, Jun Li, Pan Liu, Wenyan Yang, Wanqing Xu, Xuan Lin

发表机构 * Ant Group(蚂蚁集团)

AI总结 提出INS-S1保险专用大语言模型,通过可验证数据合成系统和渐进式SFT-RL课程框架,在领域任务上达到SOTA,同时保持通用能力并实现0.6%的低幻觉率。

Comments 21 pages, 12 figures, 17 tables

详情
Journal ref
ICLR 2026 Workshop Advances in Financial AI
AI中文摘要

将大语言模型(LLM)适应到保险等高风险垂直领域面临重大挑战:场景要求严格遵守复杂法规和业务逻辑,对幻觉零容忍。现有方法常遭受能力权衡——牺牲通用智能换取领域专长——或过度依赖RAG而缺乏内在推理。为弥合这一差距,我们提出了INS-S1,一个通过新颖的端到端对齐范式训练的保险专用LLM系列。我们的方法包含两项方法论创新:(1)可验证数据合成系统,构建用于精算推理和合规的分层数据集;(2)渐进式SFT-RL课程框架,将动态数据退火与验证推理(RLVR)和AI反馈(RLAIF)的协同混合相结合。通过优化数据比例和奖励信号,该框架强制执行领域约束,同时防止灾难性遗忘。此外,我们发布了INSEva,迄今为止最全面的保险基准(39k+样本)。大量实验表明,INS-S1在领域任务上达到SOTA,显著优于DeepSeek-R1和Gemini-2.5-Pro。关键的是,它保持了顶级的通用能力,并实现了创纪录的0.6%幻觉率(HHEM)。我们的结果表明,严格领域专业化可以在不牺牲通用智能的情况下实现。

英文摘要

Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucinations. Existing approaches often suffer from a Competency Trade-off - sacrificing general intelligence for domain expertise - or rely heavily on RAG without intrinsic reasoning. To bridge this gap, we present INS-S1, an insurance-specific LLM family trained via a novel end-to-end alignment paradigm. Our approach features two methodological innovations: (1) A Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance; and (2) A Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting. Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples). Extensive experiments show that INS-S1 achieves SOTA performance on domain tasks, significantly outperforming DeepSeek-R1 and Gemini-2.5-Pro. Crucially, it maintains top-tier general capabilities and achieves a record-low 0.6% hallucination rate (HHEM). Our results demonstrate that rigorous domain specialization can be achieved without compromising general intelligence.

2605.28066 2026-06-10 cs.CL cs.AI 版本更新

PromptEmbedder: Efficient and Transferable Text Embedding via Dual-LLM Soft Prompting

PromptEmbedder:通过双LLM软提示实现高效且可迁移的文本嵌入

Yu-Che Tsai, Kuan-Yu Chen, Yuan-Hao Chen, Yu-Han Chang, Ching-Yu Tsai, Yu-Hsiang Chuang, Shou-De Lin

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University(国立台湾大学计算机科学与资讯工程系) National Taiwan University AI Center of Research Excellence(国立台湾大学人工智能研究中心)

AI总结 提出PromptEmbedder双LLM框架,通过可微分的软提示生成将嵌入知识从特定骨干权重中解耦,在保持性能的同时降低40% GPU内存并加速3.7倍训练。

详情
AI中文摘要

大型语言模型(LLM)在文本嵌入方面展现出显著效果,但当前的适应方法(如LoRA)在计算效率和跨架构可迁移性方面面临重大瓶颈。每当出现新的骨干网络时,现有方法需要从头开始进行昂贵的重新训练。为了解决这个问题,我们提出了PromptEmbedder,一种新颖的双LLM框架,将嵌入知识与特定骨干权重解耦。PromptEmbedder利用一个提示LLM通过连续松弛的可微分生成过程,为冻结的嵌入LLM生成指令感知的软提示,确保对比训练期间的全梯度流动。通过将任务特定知识定位在提示LLM中,适应新架构只需重新训练一个轻量级的线性对齐矩阵。在MTEB基准上的评估表明,PromptEmbedder实现了与LoRA微调相当的性能,同时将GPU内存减少40%,训练速度提升3.7倍。我们的方法建立了一种可扩展、架构无关的范式,用于高效的基于LLM的表示学习。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable efficacy in text embedding, yet current adaptation methods like LoRA face significant bottlenecks in computational efficiency and cross-architecture transferability. Whenever a new backbone emerges, existing approaches require costly retraining from scratch. To address this, we propose PromptEmbedder, a novel dual-LLM framework that decouples embedding knowledge from specific backbone weights. PromptEmbedder utilizes a Prompting LLM to generate instruction-aware soft prompts for a frozen Embedding LLM via a differentiable generation process with continuous relaxation, ensuring full gradient flow during contrastive training. By localizing task-specific knowledge within the Prompting LLM, adapting to new architectures requires only retraining a lightweight linear alignment matrix. Evaluations on the MTEB benchmark show that PromptEmbedder achieves comparable performance with LoRA finetuning while reducing GPU memory by 40% and accelerating training by 3.7x. Our approach establishes a scalable, architecture-agnostic paradigm for efficient LLM-based representation learning.

2606.09466 2026-06-10 cs.CL 版本更新

DECSELFMASK: Leveraging Unlabeled Text via Self-Relevance-Guided Masking for Decoder-Only Classification

DECSELFMASK: 通过自相关引导掩码利用未标记文本进行仅解码器分类

Pietro Ferrazzi, Matteo Merler, Giovanni Bonetta, Alberto Lavelli, Bernardo Magnini

发表机构 * Fondazione Bruno Kessler, Trento, Italy(布鲁诺·凯斯勒基金会,特伦托,意大利) University of Padova, Italy(帕多瓦大学,意大利)

AI总结 提出DecSelfMask方法,利用相关性归因引导掩码策略从无标签数据创建自监督训练样本,通过下一词预测重构掩码部分,提升仅解码器模型在分类任务上的性能,在136个临床任务上平均Macro F1提升19.9点。

详情
AI中文摘要

分类任务需要标注数据,但收集这些数据往往昂贵、耗时甚至不可行。医学领域尤其如此,大型数据集通常只有少量标注样本。为解决这一问题,我们提出DecSelfMask(通过掩码进行解码器自学习),一种增强仅解码器模型在分类任务上性能的方法。我们基于常见的自学习方法,利用模型从无标签数据创建训练样本,并提出一种新颖的相关性引导掩码策略。我们使用相关性归因方法确定未标注文本中与任务相关的部分。然后通过掩码这些部分创建自监督训练样本,训练模型通过下一词预测重建它们。我们假设这些样本传达了关于未标注数据结构和语义的知识,可能对下游性能有用。我们在来自一家意大利医院的190万份临床笔记的136个任务上测试了我们的方法。我们在5个不同规模和系列的模型上量化了DecSelfMask对下游任务的影响,包括探测分析。实验显示持续改进,优于标准监督微调方法(Macro F1提高19.9点)、合成标签生成(提高12.5点)和持续预训练(提高6.3点),以及常见基线。

英文摘要

Classification tasks require annotated data, which can often be expensive, time-consuming, or even unfeasible to collect. This is the case of the medical domain, where large datasets often have few annotated examples. To address this, we propose DecSelfMask (Decoder Self-learning by Masking), an approach to enhance decoder-only performance on classification tasks. We build on common self-learning approaches by leveraging a model to create training examples from unlabeled data to propose a novel relevance-guided masking strategy. We use relevance attribution methods to determine what portions of unannotated texts are relevant for a task. We then create self-supervised training examples by masking out those portions, training the model to reconstruct them via next-token-prediction. We hypothesize that those examples convey knowledge about the structure and semantics of unannotated data that can be useful for downstream performance. We test our approach on 136 tasks from a collection of 1.9M clinical notes from an Italian hospital. We quantify DecSelfMask's impact on downstream tasks on 5 models of different scales and families, including a probing analysis. Experiments show consistent gains, outperforming standard supervised fine-tuning approaches (+19.9 points in Macro F1), synthetic label generation (+12.5), and continual pretraining (+6.3), as well as common baselines.

12. 其他/综合NLP 14 篇

2606.10402 2026-06-10 cs.CL cs.AI 新提交

Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries

利用野外AI代理的集体智慧实现新发现

Federico Bianchi, Yongchan Kwon, Aneesh Pappu, James Zou

发表机构 * Together AI Stanford University(斯坦福大学)

AI总结 提出EinsteinArena平台,通过开放分布式环境中的自主代理交互,在数学问题中实现12项新最优结果,展示了集体AI驱动研究的范式。

详情
AI中文摘要

科学发现通常是一个集体过程:研究人员分享部分结果,检查失败的尝试,并在长时间跨度内相互借鉴想法。最近的AI系统表明,基于语言模型的代理可以在开放科学问题上取得有意义的进展,但大多数现有系统孤立运行。在本文中,我们提出EinsteinArena,一个面向开放分布式研究和发现的代理原生平台。EinsteinArena为代理提供一组实时开放问题,每个问题都有可靠的验证器、公共排行榜和特定问题的讨论论坛,代理可以在其中提问和分享见解。我们专注于引起大量研究兴趣的数学任务,其进展可以明确衡量。截至2026年5月,EinsteinArena上的代理已发现12项新的最优结果,优于以往任何人类或AI解决方案。一个显著例子是11维接吻数问题,该平台将已知最佳下界从593提高到604。这一进展并非来自单个代理或孤立运行,而是通过一系列提交、公开讨论、验证器改进以及后续代理间的思想借鉴而产生的。这些结果证明,去中心化的科学发现可以从自主代理在野外的开放交互中涌现,展示了集体AI驱动研究的新范式。

英文摘要

Scientific discovery is often a collective process: researchers share partial results, inspect failed attempts, and build on each other's ideas over long time horizons. Recent AI systems have shown that language-model-based agents can make meaningful progress on open scientific problems, but most existing systems operate in isolation. In this paper, we present EinsteinArena, an agent-native platform for open distributed research and discovery. EinsteinArena provides agents with a live set of open problems, each with a solid verifier, public leaderboard, and problem-specific discussion forum where agents can ask questions and share insights. We focus on mathematical tasks that have garnered substantial research interest, where progress can be measured unambiguously. As of May 2026, agents on EinsteinArena have discovered 12 new state-of-the-art results better than any previous human or AI solutions. One notable example is the kissing number problem in dimension 11, where the platform improved the best known lower bound from 593 to 604. This advance did not come from a single agent or isolated run. Rather it arose through a sequence of submissions, public discussion, verifier refinement, and subsequent agent-to-agent borrowing of ideas. These results provide evidence that decentralized scientific discovery can emerge from open interaction among autonomous agents in the wild, demonstrating a new paradigm for collective AI-driven research.

2606.10796 2026-06-10 cs.CL cs.AI 新提交

Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

Dep-LLM:基于证据引导的结构化多因素与可靠LLM推理的无训练抑郁症诊断

Yiqing Lyu, Xianbing Zhao, Buzhou Tang, Ronghuan Jiang

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China(哈尔滨工业大学(深圳)计算机科学与技术学院) School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China(江南大学人工智能与计算机学院) Guangdong Provincial Key Laboratory of Intelligent Information Processing(广东省智能信息处理重点实验室) Pengcheng Laboratory(鹏城实验室) Chinese People’s Liberation Army General Hospital, Beijing, China(中国人民解放军总医院)

AI总结 提出无训练框架Dep-LLM,通过思维链多因素分析、置信度调制和协作预测,在冻结LLM上实现抑郁症诊断,超越零样本和微调方法。

详情
AI中文摘要

从临床访谈中进行自动抑郁症检测(ADD)是计算心理健康领域的关键任务,但由于两个关键障碍仍然具有挑战性:1)在冗长、多主题的临床访谈中建模复杂但稀疏分布的抑郁线索困难,导致推理肤浅且不可靠;2)由于临床隐私导致标记数据稀缺,加上训练和微调的高成本,限制了监督式ADD系统的部署。为了共同应对这些挑战,我们提出了Dep-LLM,一个无训练框架,它模仿临床精神科医生的逐步推理,并完全在冻结的现成基础LLM上运行。Dep-LLM包含三个阶段。首先,思维链(CoT)抑郁症多因素分析模块将长对话结构性地分解为五个临床对齐的主题,并产生基于证据的推理,有效处理长上下文依赖。其次,我们引入了置信度分析与调制模块,该模块从每个推理的token级熵中量化认知可靠性,并应用标签内和主题间调制,在不进行额外训练的情况下放大可信信号同时抑制不确定信号。第三,协作多因素预测模块动态整合由置信度加权的多因素信号,形成最终诊断。在DAIC-WOZ和E-DAIC数据集上的大量实验证明了Dep-LLM的有效性和泛化性:它在几乎所有21个基础LLM上,在准确率、宏F1和加权平均F1等9个指标上超越了零样本基线,并进一步优于最先进的监督式领域特定LLM以及最新的闭源商业LLM,同时无需额外训练。

英文摘要

Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1) difficulty in modeling complex but sparsely distributed depression clues within lengthy, multi-topic clinical interviews, leading to superficial and unreliable reasoning; 2) scarcity of labeled data due to clinical privacy, together with high cost of training and fine-tuning, limiting the deployment of supervised ADD systems. To jointly address these challenges, we propose Dep-LLM, a training-free framework that mirrors the step-by-step reasoning of clinical psychiatrists and operates entirely on frozen off-the-shelf foundation LLMs. Dep-LLM comprises three stages. First, a Chain-of-Thought (CoT) Depression Multi-factor Analysis module structurally decomposes the long dialogue into five clinically aligned themes and produces evidence-grounded rationales, effectively handling long-context dependencies. Second, we introduce Confidence Analysis and Modulation module that quantifies the epistemic reliability from token-level entropy of each rationale and applies an intra-label and inter-theme modulation that amplifies trustworthy signals while suppressing uncertain ones without extra training. Third, a Collaborative Multi-factor Prediction module dynamically integrates multi-factor signals weighted by confidence into the final diagnosis. Extensive experiments on the DAIC-WOZ and E-DAIC datasets demonstrate the effectiveness and generalizability of Dep-LLM: it surpasses zero-shot baseline on nearly all 21 foundation LLMs across 9 metrics such as accuracy, macro F1 and weighted-average F1, and further outperforms state-of-the-art supervised domain-specific LLMs as well as the latest closed-source commercial LLMs, while requiring no extra training.

2606.10398 2026-06-10 cs.IR cs.CL cs.HC cs.SI 交叉投稿

Selection, Not Salience: The Shape and Limits of Personalization in Social Highlighting

选择而非显著性:社交高亮中个性化的形态与局限

Kazuki Nakayashiki, Keisuke Watanabe

发表机构 * Glasp Inc.(Glasp公司)

AI总结 通过社交高亮和共读身份控制实验,发现个性化主要作用于文档选择层(约+0.13),而非句子显著性层,且效果主要由主题偏好驱动。

Comments 9 pages, 1 figure, 3 tables

详情
AI中文摘要

个性化读者所见内容是否值得,其边界在哪里?利用社交网页高亮器和共读身份控制(同一文档被多个用户高亮,固定文档和主题,询问个人历史是否比另一个读者的历史更好地预测其标记),我们绘制了跨阅读层次的个性化形态与局限。在文档层次,我们给出了干净、无泄漏、身份控制的测量,而先前的下一文档评估只能给出上界:个人历史能识别共读邻域中哪些文档属于该用户,自身与其他的差距为+0.169(相对于社区负例)和+0.119(相对于主题匹配的难负例),两者均高度显著;基于内容的实验表明该信号并非纯粹由标题驱动,而主要是主题性的。这与我们先前工作中跨度级的选择信号(+0.14)相当:选择信号在不同层次上幅度相近(+0.12至+0.17),其中大部分是稳定的主题偏好。在句子层次,两阶段个性化自动高亮(非个性化模型提出候选,个性化模型重新排序)并未优于其非个性化基线:两个现成的零样本大语言模型(包括前沿模型)预测高亮位置的效果不如首句基线,且即使在最高召回率的候选池中,个性化重排序也被显著性顺序击败,因此零结果并非仅仅是第一阶段的天花板效应。可测量的个性化主要出现在选择层:适度(约+0.13)、以主题为主,在显著性层没有可靠增益。我们还发现了一个控制负例偏差,该偏差在审计前将我们的文档差距膨胀到虚假的+0.227。超越共享显著性层可能更适合通过聚合个体而非加强个性化来实现。

英文摘要

Does personalizing what a reader sees pay off, and where does it stop? Using a social web highlighter and a co-readership identity control (the same document highlighted by many users, which holds document and topic fixed and asks whether a person's own history predicts their marks better than another reader's does), we map the shape and limits of personalization across reading altitudes. At the document altitude we give the clean, leakage-free, identity-controlled measurement that prior next-document evaluations could only upper-bound: a person's history identifies which documents in a co-reading neighborhood are theirs, with an own-versus-other gap of +0.169 against community negatives and +0.119 against topic-matched hard negatives (both highly significant); a content-based arm suggests the signal is not purely title-driven but is largely thematic. This is comparable to the span-level selection signal (+0.14) from our prior work: the selection signal is of comparable magnitude across altitudes (+0.12 to +0.17), most of it stable topic preference. At the sentence altitude, a two-stage personalized auto-highlight (an impersonal model proposes candidates, a personal model re-ranks them) does not improve on its impersonal baseline: two off-the-shelf zero-shot LLMs, including a frontier model, predict highlight locations worse than a lead baseline, and personal re-ranking is beaten by the salience order even on the highest-recall candidate pool, so the null is not merely a Stage-1 ceiling artifact. Measurable personalization appears primarily at the selection layer: modest (~+0.13), topic-dominated, with no reliable gain at the salience layer. We also surface a control-in-negatives bias that inflated our document gap to a spurious +0.227 until audited. Going beyond the shared salience layer may be better approached by aggregating individuals than by personalizing them harder.

2606.10459 2026-06-10 cs.SI cs.CL 交叉投稿

Leveraging Social Media Data for COVID-19 Studies

利用社交媒体数据进行COVID-19研究

Nur Hafieza Ismail, Nur Shazwani Kamarudin, Nurol Husna Che Rose

发表机构 * Faculty of Computing, University Malaysia Pahang(马来西亚乌拉大学 computing 学院) Faculty of Electronic Engineering Technology, University Malaysia Perlis(马来西亚霹雳大学电子工程技术学院)

AI总结 本文探讨社交媒体在COVID-19大流行期间的作用,分类使用数据,介绍机器学习、特征工程、自然语言处理和调查方法,并指出未来研究方向。

Comments 8 pages, 1 figure

详情
AI中文摘要

如今,社交网络已成为广泛偏好的信息来源。特别是在2019冠状病毒病(COVID-19)大流行期间,社交媒体已成为获取与COVID-19相关最新新闻和信息的最常用平台之一。社交媒体之所以受欢迎,是因为它们为注册用户提供免费访问,并允许他们发布、传播信息以及回复他人的帖子。全球有近46亿社交媒体用户,因此这些平台上共享的大量信息可能影响人们如何看待和应对当前面临的大流行,这并不令人惊讶。通过合理使用,社交媒体可以成为传播可靠新闻和提高患者、临床医生及社会公众意识的有益数字工具。具体而言,本章描述了用户披露中表达的语言、视觉和情感指标。因此,本章详细探讨和讨论了COVID-19大流行期间社交媒体平台使用的相关研究。本章还对所使用的社交媒体数据进行了分类,介绍了不同的部署机器学习、特征工程、自然语言处理和调查方法,并概述了未来研究的方向。

英文摘要

Nowadays, social media networks have become widely preferred sources of information. Especially during the time of the Coronavirus disease 2019 COVID 19 pandemic, social media has been one of the most used platforms to get the latest news and information related to COVID 19. Social media are popular because they offer free access to their registered users and allow them to do posting, disseminate information, and respond to others postings. With almost 4.6 billion social media users worldwide, it is not surprising the significant amount of information shared through these platforms could affect how people perceive and cope with the pandemic that we are facing right now. With decent use, social media can be a beneficial digital tool to spread reliable news and public awareness for patients, clinicians, and society. Specifically, this chapter describes linguistic, visual, and emotional indicators expressed in user disclosures. Thus, in this chapter, the related studies of social media platforms usage during the COVID 19 pandemic are explored and discussed in detail. This chapter also categorizes social media data used, introduces different deployed machine learning, feature engineering, natural language processing, and survey methods, and outlines directions for future research.

2601.05232 2026-06-10 cs.CL cs.CY cs.LG 版本更新

AI Application Gives Users Real-Time Feedback on the Level of Peace in the Social Media Videos They Watch

AI应用为用户观看的社交媒体视频提供实时和平水平反馈

P. Gilda, P. Dungarwal, A. Thongkham, E. T. Ajayi, S. Choudhary, T. M. Terol, C. Lam, J. P. Araujo, M. McFadyen-Mungalln, L. S. Liebovitch, P. T. Coleman, H. West, K. Sieck, S. Carter

发表机构 * Data Science Institute, Columbia University(哥伦比亚大学数据科学研究所) Advanced Consortium on Cooperation, Conflict, and Complexity, Columbia University(哥伦比亚大学合作、冲突与复杂性高级联合体) Computer Science, Columbia University(哥伦比亚大学计算机科学) Data Science, St John’s University(圣约翰大学数据科学) Quantitative Methods in the Social Sciences, Columbia University(哥伦比亚大学社会科学定量方法) Barnard College, Columbia University(哥伦比亚大学巴纳德学院) Teachers College, Columbia University(哥伦比亚大学教师学院) Department of Industrial Engineering and Operations Research, Columbia University(哥伦比亚大学工业工程与运筹学系) Harmonious Communities, Toyota Research Institute(丰田研究院和谐社区)

AI总结 开发了一个实时分析YouTube视频中语言和平程度的AI应用,使用监督学习和大语言模型,大语言模型在测量和平相关社会维度上更接近人类编码者。

Comments 6 pages, 4 figures, corrected typos, minor edits; v3: 16 pages, improved title, abstract, introduction, discussion, conclusions, added more references

详情
AI中文摘要

现在大多数人通过社交媒体(如YouTube和Facebook)上的视频获取新闻,而不是通过精心策划的新闻业。“我们成为我们所注视的。”语言的内容和语调在开始或结束冲突中起着至关重要的作用。“仇恨言论”会加剧冲突,“和平言论”会促进和平。我们开发了一个应用程序,可以实时测量YouTube视频中这些方面的言论,从而为用户提供关于自身媒体消费的有用反馈。我们使用了两种方法:1)监督机器学习。在线新闻媒体文本中的语言通过衡量这些国家和平水平的调查进行标记。一个全连接前馈网络和两个卷积神经网络在该数据上训练,在测试集上预测和平水平的准确率约为97%,在另一个不同的新闻文本数据集中准确率约为70%,但未能泛化到YouTube视频,表明书面文本与转录的口语不同。2)社会科学维度。没有类似的外部数据来标记YouTube视频转录文本中的语言。因此,我们使用了2个词级情感分析(SA)和6个上下文级大语言模型(LLM)来测量59项社会科学研究确定的和平中的5个社会维度:同情-蔑视、新闻-观点、促进-预防、创造力-秩序、细微差别-简化。在52个视频上,LLM与3个人类编码者的值更接近(r^2~0.60),而SA的r^2~0.03。结果:与人类编码者相比,LLM成功测量了YouTube视频中与和平相关的重要社会维度。这些结果构成了一个分析引擎的基础,该引擎可以为用户和内容创作者提供关于自身媒体消费和创作的反馈。

英文摘要

Most people now get their news from videos on social media, such as YouTube and Facebook, rather than through curated journalism. "We become what we behold." The content and tone of language plays an essential role in starting or ending conflicts. "Hate Speech" can enhance conflict, "Peace Speech" can enhance peace. We developed an application that measures, in real time, these aspects of speech from YouTube videos, which can give users helpful feedback on their own media diet. We used two approaches: 1) supervised machine learning. Language in the text of online news media text was tagged by surveys that measure the level of peace in those countries. One fully connected feedforward and 2 convolutional neural networks trained on that data were $\sim 97\%$ accurate in predicting levels of peace in the test set and $\sim 70\%$ accurate in another distinct news text data set, but did not generalize to YouTube videos, suggesting that written text is different than transcribed spoken language. 2) social science dimensions. There is no similar external data to tag the text in the YouTube video transcripts. We therefore used 2 word-level sentiment analysis (SA) and 6 context-level large language models (LLMs) to measure 5 social dimensions in peace identified by 59 social science studies: compassion-contempt, news-opinion, promotion-prevention, creativity-order, nuance-simplification. LLMs more closely matched the values by 3 human coders on 52 videos, $r^2\sim0.60$ than SA, at $r^2\sim0.03$. Results: LLMs successfully measured social dimensions important in peace in YouTube videos, compared to human coders. These results serve as the basis of an analysis engine that can give users and content creators feedback on their own media diet and creations.

2604.20048 2026-06-10 cs.CL cs.CY 版本更新

Culturally uneven urban perception in large language models

大型语言模型通过文化不平等的基线感知城市

Rong Zhao, Wanqi Liu, Zhizhou Sha, Nanxi Su, Yecheng Zhang, Ying Long

发表机构 * Centre for Advanced Spatial Analysis (CASA), UCL, London, UK(高级空间分析中心(CASA),伦敦大学学院,英国) School of Architecture, Tsinghua University, Beijing, China(清华大学建筑学院,北京,中国) Department of Computer Science, UT Austin, Austin, TX, USA(得克萨斯大学奥斯汀分校计算机科学系,奥斯汀,德克萨斯,美国)

AI总结 本研究通过全球平衡的街景样本测试前沿LLM的城市感知,发现中性提示实际上偏向欧美文化,且文化提示能改变情感评价但无法恢复人类语义多样性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用于描述、评估和解释地点,但目前尚不清楚它们是否从文化中立的立场出发。本文使用平衡的全球街景样本和保持中立或调用不同区域文化立场的提示,测试前沿LLM的城市感知。在开放式描述和结构化地点判断中,中性条件在实践中并非中立。与欧洲和北美相关的提示在系统上比许多非西方提示更接近基线,表明模型感知围绕文化不平等的参考框架而非通用框架组织。文化提示也改变了情感评价,对某些提示身份产生基于情感的群体内偏好。与区域人类文本-图像基准的比较表明,文化接近的提示可以改善与人类描述的一致性,但未能恢复人类水平的语义多样性,并且通常保留了情感提升的风格。同样的不对称性出现在安全性、美丽、财富、活力、无聊和抑郁的结构化判断中,模型输出是可解释的,但仅部分再现了人类群体差异。这些发现表明,LLM并非从虚无中感知城市:它们通过一个文化不平等的基线来感知,该基线塑造了什么是普通、熟悉和积极评价的。

英文摘要

Large language models (LLMs) are increasingly used to describe and evaluate cities, yet the cultural structure of their urban judgments remains understudied. Here we introduce a measurement framework for testing whether LLM-based urban perception is culturally neutral, using a globally stratified street-view image dataset. Open-ended descriptions and structured scores generated by three frontier multimodal models all show that the neutral baseline lies closer to regional framings associated with Europe and North America than to other cultural framings. Comparisons between AI and human urban perception further show that prompting can move AI responses closer to specific regional human descriptions, but fails to recover the variety and diversity of human responses, flattening observed demographic patterns and introducing sentiment-based self-favouring bias. These results indicate a systematic risk in treating AI as a neutral tool for urban tasks, especially when model outputs are used to compare, evaluate or represent cities across cultural contexts.

2604.04287 2026-06-10 cs.LG cs.CL q-bio.GN 版本更新

Entropy, Disagreement, and the Limits of Foundation Models in Genomics

熵、分歧与基因组基础模型的局限性

Maxime Rochkoulets, Lovro Vrček, Mile Šikić

发表机构 * Genome Institute of Singapore, A*STAR(新加坡基因组研究院,A*STAR) KU Leuven(卢森堡大学) Faculty of Electrical Engineering and Computing, University of Zagreb(扎格雷布大学电子工程与计算学院)

AI总结 本文通过分析熵对模型学习的影响,发现基因组序列的高熵导致输出分布接近均匀、模型间分歧大和静态嵌入不稳定,且Fisher信息集中在嵌入层,表明仅靠序列自监督训练可能不适用于基因组数据。

Comments Accepted to LMLR Workshop at ICLR 2026

详情
AI中文摘要

基因组学中的基础模型与自然语言处理中的基础模型相比,成功程度参差不齐。然而,其有效性有限的原因仍不清楚。在这项工作中,我们研究了熵作为限制此类模型从训练数据中学习并发展基础能力的基本因素的作用。我们在文本和DNA序列上训练模型集成,并分析它们的预测、静态嵌入和经验Fisher信息流。我们表明,从未见标记预测的角度来看,基因组序列的高熵导致输出分布接近均匀、模型间分歧大以及静态嵌入不稳定,即使模型在架构、训练和数据上匹配也是如此。然后,我们证明在DNA上训练的模型将Fisher信息集中在嵌入层,似乎未能利用标记间关系。我们的结果表明,仅从序列进行自监督训练可能不适用于基因组数据,这质疑了当前训练基因组基础模型方法背后的假设。

英文摘要

Foundation models in genomics have shown mixed success compared to their counterparts in natural language processing. Yet, the reasons for their limited effectiveness remain poorly understood. In this work, we investigate the role of entropy as a fundamental factor limiting the capacities of such models to learn from their training data and develop foundational capabilities. We train ensembles of models on text and DNA sequences and analyze their predictions, static embeddings, and empirical Fisher information flow. We show that the high entropy of genomic sequences -- from the point of view of unseen token prediction -- leads to near-uniform output distributions, disagreement across models, and unstable static embeddings, even for models that are matched in architecture, training and data. We then demonstrate that models trained on DNA concentrate Fisher information in embedding layers, seemingly failing to exploit inter-token relationships. Our results suggest that self-supervised training from sequences alone may not be applicable to genomic data, calling into question the assumptions underlying current methodologies for training genomic foundation models.

2602.17547 2026-06-10 cs.AI cs.CL 版本更新

KLong: Training LLM Agent for Extremely Long-horizon Tasks

KLong:训练用于超长 horizon 任务的 LLM 代理

Yue Liu

AI总结 KLong 通过轨迹分割 SFT 和渐进式 RL 训练,解决超长 horizon 任务,实现 106B 模型在 PaperBench 上超越 Kimi K2 Thinking 11.28%。

Comments We request standard withdrawal of this submission because significant errors were discovered in the data after submission, which affect the validity of the results. We may submit a corrected version later

详情
AI中文摘要

本文介绍了KLong,一种开源的LLM代理,旨在解决超长horizon任务。其原理是首先通过轨迹分割SFT冷启动模型,然后通过渐进式RL训练进行扩展。具体而言,我们首先使用全面的SFT配方激活基础模型的基本代理能力。然后,我们引入Research-Factory,一个自动化管道,通过收集研究论文和构建评估标准来生成高质量的训练数据。利用该管道,我们从Claude 4.5 Sonnet(Thinking)中构建了数千条超长horizon轨迹。为了训练这些极长的轨迹,我们提出了一种新的轨迹分割SFT,该方法保留早期上下文,逐步截断后期上下文,并保持子轨迹之间的重叠。此外,为了进一步提高超长horizon任务解决能力,我们提出了一种新的渐进式RL,将训练分为多个阶段,逐步延长超时时间。实验表明KLong的优越性和泛化能力,如图1所示。值得注意的是,我们的KLong(106B)在PaperBench上超越Kimi K2 Thinking(1T)11.28%,且性能提升泛化到其他编码基准如SWE-bench Verified和MLE-bench。

英文摘要

This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.

2603.03339 2026-06-10 cs.CY cs.AR cs.CL cs.HC 版本更新

Offline-First LLM Architecture for Adaptive Learning in Low-Connectivity Environments

面向低连接环境的离线优先LLM架构:用于自适应学习

Joseph Walusimbi, Ann Move Oguti, Joshua Benjamin Ssentongo, Keith Ainebyona

发表机构 * University of Nairobi(内罗毕大学)

AI总结 本文提出一种离线优先的LLM架构,适用于低连接环境中的自适应学习,通过本地推理和硬件感知模型选择,提供课程对齐的解释和结构化学术支持,适应不同教育阶段的学习者需求。

Comments 16 pages, 10 figures, 2 tables

详情
AI中文摘要

人工智能(AI)和大语言模型(LLMs)通过使对话辅导、个性化解释和探究式学习成为可能,正在改变教育技术。然而,大多数基于AI的学习系统依赖持续的互联网连接和云计算,限制了其在带宽受限环境中的使用。本文提出了一种面向低连接环境的离线优先大语言模型架构,该系统通过量化语言模型在本地进行所有推理,并结合硬件感知的模型选择,使部署在低规格CPU设备上成为可能。通过去除对云基础设施的依赖,该系统通过自然语言交互提供课程对齐的解释和结构化的学术支持。为了支持不同教育阶段的学习者,该系统包括自适应响应级别,生成不同复杂程度的解释:简单英语、初级中学、高级中学和技术。这使解释能够根据学生能力进行调整,提高学术概念的清晰度和理解。该系统在有限连接条件下部署于选定的中学和高等教育机构,并在技术性能、可用性、感知响应质量和教育影响方面进行了评估。结果显示,在传统硬件上稳定运行,响应时间可接受,用户对支持自主学习的支持有积极评价。这些发现证明了在低连接环境中离线大语言模型部署用于AI辅助教育的可行性。

英文摘要

Artificial intelligence (AI) and large language models (LLMs) are transforming educational technology by enabling conversational tutoring, personalized explanations, and inquiry-driven learning. However, most AI-based learning systems rely on continuous internet connectivity and cloud-based computation, limiting their use in bandwidth-constrained environments. This paper presents an offline-first large language model architecture designed for AI-assisted learning in low-connectivity settings. The system performs all inference locally using quantized language models and incorporates hardware-aware model selection to enable deployment on low-specification CPU-only devices. By removing dependence on cloud infrastructure, the system provides curriculum-aligned explanations and structured academic support through natural-language interaction. To support learners at different educational stages, the system includes adaptive response levels that generate explanations at varying levels of complexity: Simple English, Lower Secondary, Upper Secondary, and Technical. This allows explanations to be adjusted to student ability, improving clarity and understanding of academic concepts. The system was deployed in selected secondary and tertiary institutions under limited-connectivity conditions and evaluated across technical performance, usability, perceived response quality, and educational impact. Results show stable operation on legacy hardware, acceptable response times, and positive user perceptions regarding support for self-directed learning. These findings demonstrate the feasibility of offline large language model deployment for AI-assisted education in low-connectivity environments.

2509.11517 2026-06-10 cs.CL cs.LG 版本更新

PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation

PeruMedQA:在秘鲁医学考试上评估大语言模型(LLMs)——数据集构建与评估

Rodrigo M. Carrillo-Larco, Jesus Lovón Melgarejo, Manuel Castillo-Cara, Gusseppe Bravo-Rocca

发表机构 * Hubert Department of Global Health, Rollins School of Public Health, Emory University(霍伯特全球健康部门,埃默里大学公共卫生学院) Emory Global Diabetes Research Center of Woodruff Health Sciences Center, Emory University(埃默里大学伍德鲁夫健康科学中心全球糖尿病研究中心) Institut de Recherche en Informatique de Toulouse(图卢兹信息研究院) Universidad Nacional de Educación a Distancia(远程教育国立大学) Instituto de Investigación Científica, Universidad de Lima(科学研究所,利马大学) Barcelona Supercomputing Center(巴塞罗那超级计算中心)

AI总结 本文构建了包含8380道题的秘鲁医学考试数据集,通过微调大语言模型并对比不同模型的准确率,揭示了在西班牙语国家医学问题上的性能差异。

Comments https://github.com/rodrigo-carrillo/PeruMedQA

详情
AI中文摘要

背景:医疗大语言模型(LLMs)在回答医学考试中表现出色,但其在西班牙语和拉丁美洲国家的医疗问题上的泛化能力尚不明确。目标:构建秘鲁医师专科学习考试问题数据集,对LLMs进行微调,并评估和比较普通LLMs与微调LLMs的准确性。方法:我们整理了包含8380道题的PeruMedQA数据集,涵盖12个专科(2018-2025年)。我们选择了10个医学LLMs,包括medgemma-4b-it和medgemma-27b-text-it,并开发了零样本任务特定提示来回答问题。我们使用参数高效微调(PEFT)和低秩适应(LoRA)对medgemma-4b-it进行微调,使用所有问题除外2025年(测试集)的问题。结果:medgemma-27b在所有专科中表现最佳,达到精神科89.29%的最高分;然而,在两个专科中,OctoMed-7B略胜一筹:神经外科77.27%和77.38%,放射科76.13%和77.39%。在专科层面,大多数参数少于100亿的LLM正确率低于50%。微调版的medgemma-4b-it在所有参数少于100亿的LLM中胜出,并在各种考试中与700亿参数的LLM竞争。结论:对于需要来自西班牙语国家和与秘鲁有相似流行病学特征的知识库的医疗AI应用和研究,应使用medgemma-27b-text-it。

英文摘要

BACKGROUND: Medical large language models (LLMs) have demonstrated remarkable performance in answering medical examinations. However, the extent to which this high performance is transferable to medical questions in Spanish and from a Latin American country remains unexplored. This knowledge is crucial as LLM-based medical applications gain traction in Latin America. AIMS: To build a dataset of questions medical examinations taken by Peruvian physicians pursuing specialty training; to fine-tune a LLM on this dataset; to evaluate and compare the performance in terms of accuracy between vanilla LLMs and the fine-tuned LLM. METHODS: We curated PeruMedQA, a multiple-choice question-answering (MCQA) dataset containing 8,380 questions spanning 12 specialties (2018-2025). We selected ten medical LLMs, including medgemma-4b-it and medgemma-27b-text-it, and developed zero-shot task specific prompts to answer the questions. We employed parameter-efficient fine tuning (PEFT) and low-rand adaptation (LoRA) to fine-tune medgemma-4b-it utilizing all questions except those from 2025 (test set). RESULTS: Medgemma-27b showed the highest accuracy across all specialities, achieving the highest score of 89.29% in Psychiatry; yet, in two specialties, OctoMed-7B exhibited slight superiority: Neurosurgery with 77.27% and 77.38, respectively; and Radiology with 76.13% and 77.39%, respectively. Across specialties, most LLMs with <10 billion parameters exhibited <50% of correct answers. The fine-tuned version of medgemma-4b-it emerged victorious against all LLMs with <10 billion parameters and rivaled a LLM with 70 billion parameters across various examinations. CONCLUSIONS: For medical AI applications and research that require knowledge bases from Spanish-speaking countries and those exhibiting similar epidemiological profile to Peru's, interested parties should utilize medgemma-27b-text-it.

2512.04799 2026-06-10 cs.CL 版本更新

DaLA: Danish Linguistic Acceptability Evaluation Guided by Real World Errors

DaLA:由现实世界错误引导的丹麦语言可接受性评估

Gianluca Barmina, Nathalie Carmen Hau Norman, Peter Schneider-Kamp, Lukas Galke Poech

发表机构 * University of Southern Denmark(南方丹麦大学) University of Copenhagen(哥本哈根大学)

AI总结 本文提出一个增强的丹麦语言可接受性评估基准,通过分析常见错误并引入14种腐蚀函数生成错误句子,验证其有效性后用于评估大型语言模型的可接受性判断任务,结果显示该基准更广泛且更全面。

详情
Journal ref
Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)
AI中文摘要

我们提出一个增强的丹麦语言可接受性评估基准。我们首先分析书面丹麦语中最常见的错误。基于此分析,我们引入十四种腐蚀函数,通过系统性地向现有正确丹麦语句子中引入错误来生成不正确的句子。为了确保这些腐蚀的准确性,我们使用手动和自动方法评估其有效性。结果随后用于评估大型语言模型在语言可接受性判断任务上的表现。我们的发现表明,这种扩展比当前最先进的方法更广泛和更全面。通过纳入更多种类的腐蚀类型,我们的基准提供了更严格的语言可接受性评估,增加了任务难度,这体现在LLMs在我们基准上的表现比现有基准更低。我们的结果还表明,我们的基准具有更高的区分能力,能够更好地区分表现优异的模型和表现较差的模型。

英文摘要

We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.

2501.12486 2026-06-10 cs.LG cs.CL 版本更新

The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

训练过程至关重要:平均预训练参数计数统一了稀疏和密集的扩展规律

Tian Jin, Ahmed Imtiaz Humayun, Utku Evci, Suvinay Subramanian, Amir Yazdanbakhsh, Dan Alistarh, Gintare Karolina Dziugaite

发表机构 * MIT CSAIL(MIT 计算科学与人工智能实验室) Rice University(稻大学) Google Research(谷歌研究) Google DeepMind(谷歌深度思维) Google(谷歌) IST Austria(奥地利科学院)

AI总结 本文通过研究80种不同的剪枝计划,发现预训练过程中在25%和75%的计算量启动和结束剪枝可获得最佳评估损失,提出新的扩展规律统一了稀疏和密集预训练的扩展规律。

Comments 17 pages

详情
Journal ref
The Thirteenth International Conference on Learning Representations (ICLR), 2025
AI中文摘要

剪枝通过消除神经网络中不必要的参数,为大型语言模型(LLMs)日益增长的计算需求提供了一个有前途的解决方案。虽然许多研究关注训练后的剪枝,但将剪枝和预训练结合到一个阶段的稀疏预训练提供了一个更简单的替代方案。在本文中,我们通过研究80种不同的剪枝计划,探讨了不同稀疏度和训练持续时间下的最优稀疏预训练配置。我们发现,在总训练计算量的25%处启动剪枝并在75%处结束可获得接近最优的最终评估损失。这些发现为高效且有效的LLMs稀疏预训练提供了有价值的见解。此外,我们提出了一种新的扩展规律,修改了Chinchilla扩展规律以使用预训练期间的平均参数计数。通过实证和理论验证,我们证明了这种修改后的扩展规律能够准确地建模稀疏和密集预训练LLMs的评估损失,统一了预训练范式的扩展规律。我们的发现表明,虽然稀疏预训练在等效计算预算下能获得与密集预训练相同的最终模型质量,但通过减少模型大小,它在推理过程中提供了显著的计算节省潜力。

英文摘要

Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large language models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training. Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms. Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.

2502.11517 2026-06-10 cs.CL cs.DC cs.LG 版本更新

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

学习承诺:通过学习异步解码扩展语言模型解码并行性

Tian Jin, Ellie Y. Cheng, Zack Ankner, Nikunj Saunshi, Blake M. Elias, Amir Yazdanbakhsh, Jonathan Ragan-Kelley, Suvinay Subramanian, Michael Carbin

发表机构 * DeepMind, London, UK(深度思维公司,伦敦,英国) Google Research, New York, NY, USA(谷歌研究院,纽约,纽约州,美国) Stanford University, Stanford, CA, USA(斯坦福大学,斯坦福,加利福尼亚州,美国) University of Toronto, Toronto, Ontario, Canada(多伦多大学,多伦多,安大略省,加拿大) University of Washington, Seattle, WA, USA(华盛顿大学,西雅图,华盛顿州,美国)

AI总结 本文提出PASTA系统,通过学习使语言模型识别语义独立性,提升解码并行性,实验证明在解码速度和响应质量上优于现有方法。

Comments 15 pages

详情
Journal ref
Proceedings of the 42nd International Conference on Machine Learning (ICML), PMLR 267:27941-27956, 2025
AI中文摘要

传统的自回归大语言模型(LLM)解码通常是顺序进行的,逐个生成token。新兴的研究探索了通过识别并同时生成语义独立的LLM响应片段来实现并行解码。然而,这些技术依赖于手工制定的启发式方法,与语法结构如列表和段落相关,使它们僵化且不精确。我们提出了PASTA,一个基于学习的系统,教会LLM识别语义独立性并在自身响应中表达并行解码机会。其核心是PASTA-LANG及其解释器:PASTA-LANG是一种注释语言,使LLM能够在自身响应中表达语义独立性;语言解释器作用于这些注释,以在推理时实时协调并行解码。通过两阶段微调过程,我们训练LLM生成PASTA-LANG注释,以优化响应质量和解码速度。在AlpacaEval指令遵循基准上的评估显示,我们的方法在解码速度和响应质量上优于现有方法;我们的结果表明,几何平均速度提升范围从1.21x到1.93x,对应的质量变化为+2.2%到-7.1%,通过长度控制的胜利率与顺序解码基线比较。

英文摘要

Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise. We present PASTA, a learning-based system that teaches LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that enables LLMs to express semantic independence in their own responses; the language interpreter acts on these annotations to orchestrate parallel decoding on-the-fly at inference time. Through a two-stage finetuning process, we train LLMs to generate PASTA-LANG annotations that optimize both response quality and decoding speed. Evaluation on AlpacaEval, an instruction following benchmark, shows that our approach Pareto-dominates existing methods in terms of decoding speed and response quality; our results demonstrate geometric mean speedups ranging from 1.21x to 1.93x with corresponding quality changes of +2.2% to -7.1%, measured by length-controlled win rates against sequential decoding baseline.

2310.04680 2026-06-10 cs.CL cs.AI cs.LG 版本更新

The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning

大语言模型降维的成本:事实回忆在内省学习之前恶化

Tian Jin, Nolan Clement, Xin Dong, Vaishnavh Nagarajan, Michael Carbin, Jonathan Ragan-Kelley, Gintare Karolina Dziugaite

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) MIT Harvard University(麻省理工学院哈佛大学) Google Research(谷歌研究) Google DeepMind(谷歌深Mind)

AI总结 研究探讨了大语言模型参数数量缩放对核心能力的影响,发现模型规模缩减会显著降低事实回忆能力,但对内省信息处理影响较小。

详情
Journal ref
The Twelfth International Conference on Learning Representations (ICLR), 2024
AI中文摘要

如何缩放大语言模型(LLMs)的参数数量会影响其核心能力?我们研究了两种自然缩放技术——权重剪枝和简单训练更小或更大的模型(称为密集缩放)——对LLMs两个核心能力的影响:(a)回忆训练期间呈现的事实,以及(b)处理推理期间呈现的信息。通过设计一系列任务来区分这两种能力,我们发现这两种能力在缩放时的表现存在显著差异。通过超过30%的模型规模缩减(通过任一缩放方法)会显著降低对训练期间呈现事实的回忆能力。然而,60-70%的缩减在很大程度上保留了模型处理内省信息的各种方式,从从长上下文检索答案到从内省示例中学习参数化函数。两种缩放方法均表现出这种行为,表明缩放模型大小对事实回忆和内省学习有本质上不同的影响。

英文摘要

How does scaling the number of parameters in large language models (LLMs) affect their core capabilities? We study two natural scaling techniques -- weight pruning and simply training a smaller or larger model, which we refer to as dense scaling -- and their effects on two core capabilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in-context during inference. By curating a suite of tasks that help disentangle these two capabilities, we find a striking difference in how these two abilities evolve due to scaling. Reducing the model size by more than 30\% (via either scaling approach) significantly decreases the ability to recall facts seen in pre-training. Yet, a 60--70\% reduction largely preserves the various ways the model can process in-context information, ranging from retrieving answers from a long context to learning parameterized functions from in-context exemplars. The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.