arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.09856 2026-06-10 cs.CL cs.AI cs.LG stat.ML 新提交

路由感知的专家校准用于混合专家语言模型中的机器遗忘

Jingyi Xie, Yijun Lin, Yinjiang Xiong, Zhikun Zhang, Sai Li

发表机构 * Renmin University of China（中国人民大学）； Tsinghua University（清华大学）； Zhejiang University（浙江大学）； Lightstandard

AI总结针对MoE模型中遗忘数据与保留数据路由不匹配导致遗忘关键专家正则化不足的问题，提出TRACE方法，通过离线激活统计检测遗忘关键专家并重新加权保留损失以校准保留侧激活频率，实验表明在WMDP和MUSE-BOOKS上遗忘-效用权衡提升9%。

详情

AI中文摘要

机器遗忘对于大型语言模型越来越重要，然而混合专家（MoE）架构中的遗忘仍未得到充分探索。与密集模型不同，MoE架构在每一层使用路由器将每个令牌分配给稀疏的专家子集。在这项工作中，我们观察到遗忘数据往往不成比例地激活一小部分专家，而这些专家可能从保留数据中接收到更弱的激活。这种遗忘-保留路由不匹配可能导致遗忘关键专家在遗忘过程中正则化不足。为了解决这个问题，我们提出了\textbf{TRACE}，即针对MoE遗忘的目标路由感知专家校准。TRACE首先从离线激活统计中检测遗忘关键专家，然后通过重新加权令牌级保留损失来校准保留正则化，使得每个选定专家的保留侧激活频率更好地匹配其遗忘侧对应频率。在多个MoE LLM上的WMDP和MUSE-BOOKS实验表明，TRACE一致地改善了遗忘-效用权衡，在相当的遗忘质量下，相对于最强基线实现了9%的相对效用提升，并在MUSE-BOOKS的四个指标中的三个上取得了最佳性能。

英文摘要

Machine unlearning is increasingly important for large language models, yet unlearning in Mixture-of-Experts (MoE) architectures remains underexplored. Unlike dense models, MoE architectures employ a router at each layer to assign each token to a sparse subset of experts. In this work, we observe that forget data often activates a small subset of experts disproportionately, while these experts may receive much weaker activation from retain data. This forget--retain routing mismatch can leave forget-critical experts under-regularized during unlearning. To address this, we propose \textbf{TRACE}, Targeted Routing-Aware Calibration of Experts, for MoE unlearning. TRACE first detects forget-critical experts from offline activation statistics, and then calibrates retain regularization by reweighting token-level retain losses so that each selected expert's retain-side activation frequency better matches its forget-side counterpart. Experiments on WMDP and MUSE-BOOKS across multiple MoE LLMs show that TRACE consistently improves the forget-utility trade-off, yielding a 9\% relative utility improvement over the strongest baseline under comparable forgetting quality and the best performance on three out of four MUSE-BOOKS metrics.

URL PDF HTML ☆

赞 0 踩 0

2606.10369 2026-06-10 cs.CL cs.LG 新提交

PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning

PADD: 面向非路由器教师指导MoE学生学习的路径对齐解压缩蒸馏

Xinyue Peng, Yi Qian, Jiaojiao Lin, Wenjian Shao, Yanming Liu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出路径对齐解压缩蒸馏（PADD）框架，通过四阶段两阶段流程将密集教师知识蒸馏到混合专家（MoE）学生中，同时学习高质量路由策略，在数学推理任务上显著优于基线。

Comments published in ICML 2026

详情

AI中文摘要

随着大型语言模型（LLMs）持续扩展，在固定计算预算下增长模型容量变得越来越具有挑战性。我们提出路径对齐解压缩蒸馏（PADD），这是一个将知识从无显式路由的密集教师蒸馏到混合专家（MoE）学生中，同时学习高质量路由策略的框架。PADD将知识蒸馏组织为两个阶段的四个阶段：初始化阶段（阶段I）通过教师神经元聚类和学生专家预热在学生专家中构建多样功能，以及训练阶段（阶段II–IV）将在线自适应蒸馏、路径细化策略优化和奖励增强负载平衡集成在单一训练流程中。在数学推理基准上的实验表明，在相同推理成本下，PADD相比强基线取得了显著提升，且MoE学生能够匹配或超越其密集教师。实验还展示了有效的教师到学生知识蒸馏和稳定的路由行为。

英文摘要

As large language models (LLMs) continue to scale, it becomes increasingly challenging to grow model capacity under fixed computation budgets. We propose Path-Aligned Decompression Distillation (PADD), a framework for distilling knowledge from dense teachers without explicit routing into mixture-of-experts (MoE) students while learning high-quality routing policies. PADD organizes knowledge distillation into four stages in two phases: an initialization phase (Stage I) that builds diverse functionality in the student's experts through teacher neuron clustering and student-expert warmup, and a training phase (Stages II--IV) that integrates online adaptive distillation, path-refined policy optimization, and reward-augmented load balancing in a single training pipeline. Experiments on mathematical reasoning benchmarks demonstrate that PADD yields substantial gains over strong baselines at the same inference cost and that the MoE student can match or surpass its dense teacher. They also demonstrate effective teacher-to-student knowledge distillation and stable routing behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.10537 2026-06-10 cs.CL 新提交

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

Prefilling-dLLM: 扩散语言模型中长上下文推理的预测性预填充

Jing Xiong, Qi Han, Shansan Gong, Yunta Hsieh, Chengyue Wu, Chaofan Tao, Chenyang Zhao, Ngai Wong

发表机构 * The University of Hong Kong（香港大学）； University of Michigan, Ann Arbor（密歇根大学安娜堡分校）； LMSYS Org（LMSYS组织）

AI总结针对扩散语言模型在长上下文中因重复编码前缀导致计算量二次增长的问题，提出Prefilling-dLLM框架，通过分块缓存KV表示并基于稀疏性选择相关块，实现高效解码，在LongBench等基准上达到最先进加速效果。

Comments Technical Report

详情

AI中文摘要

扩散大语言模型（dLLM）在每个去噪步骤中重新编码整个前缀，导致计算量随上下文长度二次增长，在长上下文场景中变得不可行。我们提出Prefilling-dLLM，一种无需训练的预填充-解码分离框架，将前缀划分为N个块，缓存其KV表示一次，并利用块内令牌稀疏性选择最相关的K个块进行解码，表明稀疏预填充可以优于密集注意力，同时将每步复杂度从完整序列长度的二次方降低到仅解码长度的二次方。在LongBench和InfiniteBench上，Prefilling-dLLM在dLLM加速方法中达到了最先进的质量，并且一个对非连续缓存的块KV进行并行解码的注意力核在8K--32K上下文下实现了9.1--28.0倍的加速。我们进一步表明，预置到每个块的开头序列令牌作为周期性注意力锚点，消除了中间丢失现象。代码见此 https URL。

英文摘要

Diffusion large language models (dLLMs) re-encode the entire prefix at every denoising step, causing recomputation that scales quadratically with context length and becomes prohibitive for long-context scenarios. We propose Prefilling-dLLM, a training-free prefill-decode disaggregation framework for dLLMs that partitions the prefix into N chunks, caches their KV representations once, and selects the top-K most relevant chunks with intra-chunk token sparsity for decoding, showing that sparse prefilling can outperform dense attention while reducing per-step complexity from quadratic in the full sequence length to quadratic only in the decode length. On LongBench and InfiniteBench, Prefilling-dLLM achieves state-of-the-art quality among dLLM acceleration methods, and an attention kernel that parallelizes decoding over the non-contiguously cached chunk KV yields 9.1--28.0x speedup at 8K--32K contexts. We further show that beginning-of-sequence tokens prepended to each chunk act as periodic attention anchors that eliminate the lost-in-the-middle phenomenon. Code is available at https://github.com/menik1126/Prefilling-dLLM.

URL PDF HTML ☆

赞 0 踩 0

2606.10650 2026-06-10 cs.CL cs.AI 新提交

Dynamic Linear Attention

动态线性注意力

Xin Wang, Hui Shen, Boyuan Zheng, Xueshen Liu, Minkyoung Cho, Zhongwei Wan, Zesen Zhao, Zhuoqing Mao, Shen Yan, Mi Zhang

发表机构 * The Ohio State University（俄亥俄州立大学）； University of Michigan（密歇根大学）； ByteDance Seed（字节跳动Seed）

AI总结提出DLA框架，通过信息感知动态状态合并和容量受限内存建模，解决多状态线性注意力中固定合并策略导致的错误累积问题，在16个数据集上超越现有方法。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型语言模型（LLMs）对长上下文的可扩展性从根本上受限于标准注意力的二次复杂度，这促使采用具有次二次成本（sub-quadratic cost）的线性注意力机制。为了在长上下文下提高表示能力，近期方法以多状态方式组织内存。然而，现有的多状态线性注意力方法依赖于固定的状态合并策略，无法适应动态变化的令牌重要性，不可逆地模糊了关键令牌，并在长序列上导致严重的错误累积。为了解决这一限制，我们提出了DLA，一种用于多状态线性注意力的动态内存建模框架。DLA引入了（i）信息感知动态状态合并，它基于令牌级别的信息变化自适应地确定状态边界，在语义转换周围保留高分辨率表示，同时积极总结稳定区域；以及（ii）容量受限内存建模，它通过选择性地合并相邻的低信息状态来维护一个固定大小、按时间顺序排列的状态缓存，以最小的信息损失控制内存增长。我们在两种不同的线性注意力模型上预训练DLA，并在三个类别的16个数据集上进行评估。实验结果表明DLA优于现有最先进方法。

英文摘要

The scalability of Large Language Models (LLMs) to long contexts is fundamentally constrained by the quadratic complexity of standard attention, motivating the adoption of linear attention mechanisms with sub-quadratic cost. To improve representation capacity under long contexts, recent approaches organize memory in a multi-state manner. However, existing multi-state linear attention methods rely on fixed state merging policies that cannot adapt to dynamically varying token importance, irreversibly obscuring critical tokens and causing severe error accumulation over long sequences. To address this limitation, we propose DLA, a dynamic memory modeling framework for multi-state linear attention. DLA introduces (i) Information-Aware Dynamic State Merging, which adaptively determines state boundaries based on token-level information variation, preserving high-resolution representations around semantic transitions while aggressively summarizing stable regions, and (ii) Capacity-Bounded Memory Modeling, which maintains a fixed-size, chronologically ordered state cache by selectively merging adjacent low-information states to control memory growth with minimal information loss. We pre-train DLA on two different linear attention models and evaluate on 16 datasets across three categories. Experimental results demonstrate the superiority of DLA over state-of-the-art.

URL PDF HTML ☆

赞 0 踩 0

2606.10722 2026-06-10 cs.CL 新提交

Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

持续LLM升级：一种用于稠密到稀疏LLM的预测器门控银行级稀疏训练方案

Ruixuan Huang, Jinyuan Shi, Hantao Huang, Yifan Huang, Ziyi Guan, Hao Zeng, Ian En-Hsu Yen, Minghui Yu

发表机构 * Nanyang Technological University（南洋理工大学）； Salesforce AI ； Huawei Noah's Ark Lab（华为诺亚方舟实验室）

AI总结提出一种从稠密检查点构建通道稀疏大语言模型的持续训练方法，通过预测器门控稀疏SwiGLU FFN和银行级top-k规则实现4倍稀疏性，并修复长上下文失败模式。

详情

AI中文摘要

我们研究稠密到稀疏的持续训练，作为从稠密检查点构建通道稀疏大语言模型的一种方式。从Qwen2.5-8B稠密骨干网络开始，我们在32K上下文中继续训练，并在32K阶段引入预测器门控稀疏SwiGLU FFN。对于每个token和层，我们使用低秩预测器生成FFN通道路由logits。然后应用银行级top-k规则，在每个64通道的银行中保留16个通道，从而在FFN中间激活中实现4倍稀疏性。与事后稀疏推理方法不同，路由模块被放置在主要语言建模路径上，并在持续训练期间进行优化，使稠密模型能够升级为面向硬件的稀疏模型。我们报告了架构、训练方案、基准性能以及训练经验。我们还识别了RULER-CWE上的层局部长上下文失败模式，并提出了一种单层修复算法，显著改善了受影响长度范围内的性能。

英文摘要

We study dense-to-sparse continual training as a way to construct channel-sparse large language models from dense checkpoints. Starting from a Qwen2.5-8B dense backbone, we continue training at 32K context and introduce a predictor-gated sparse SwiGLU FFN in the 32K stage. For each token and layer, we use a low-rank predictor to produce FFN-channel routing logits. We then apply a bank-wise top-k rule to retain 16 channels in every 64-channel bank, yielding 4x sparsity in the FFN intermediate activation. Unlike post-hoc sparse inference methods, the routing module is placed on the main language modeling path and optimized during continual training, enabling the dense model to be upcycled into a hardware-oriented sparse model. We report the architecture, training recipe, benchmark performance, and training lessons. We also identify a layer-local long-context failure mode on RULER-CWE and propose a single-layer repair algorithm that substantially improves the affected length range.

URL PDF HTML ☆

赞 0 踩 0

2606.10829 2026-06-10 cs.CL cs.AI 新提交

语言模型中对齐算法的机制分析

Aarush Sinha, Ishan Garg, Veeraraju Elluru, Arth Singh, Kushal Garg

AI总结本文通过层间线性探针、稀疏自编码器和交叉编码器，系统分析了六种偏好优化方法在语言模型中的内部机制，发现不同目标函数导致不同的表示几何变换，并揭示了行为对齐与内部结构变化的不一致性。

Comments Work in Progress

详情

AI中文摘要

后训练对齐算法主要作为黑箱进行评估，掩盖了它们如何重塑语言模型的内部计算。我们对三种开源模型家族的六种偏好优化方法（PPO、DPO、SimPO、ORPO、GRPO 和 KTO）进行了系统的机制分析。通过集成层间线性探针、稀疏自编码器和交叉编码器，我们定位了偏好表示并量化了对齐引起的潜在空间几何变换。我们发现偏好信号一致地集中在早期-中期或中期-后期层，但不同的目标函数导致定性的不同表示偏移。KTO 和 GRPO 通过建设性的特征共享和稀疏高显著性招募增强了线性可分离性。相反，DPO 和 ORPO 通过非建设性的几何旋转和特征衰减降低了可分离性，而 PPO 和 SimPO 基本保持了基线几何。这些变换表现出架构依赖的变异性，表明行为对齐并不意味着统一的内部重构。我们的发现将对齐确立为一种异质性干预，激励了安全性和可解释性的标准化特征级审计，并强调了需要机制感知的优化目标。

英文摘要

Post-training alignment algorithms are predominantly evaluated as black boxes, obscuring how they reshape language models' internal computations. We present a systematic mechanistic analysis of six preference-optimization methods: PPO, DPO, SimPO, ORPO, GRPO, and KTO across three open-weight model families. By integrating layer-wise linear probing, Sparse Autoencoders, and crosscoders, we localize preference representations and quantify alignment-induced geometric transformations in latent space. We find that preference signals consistently concentrate in early--mid or mid--late layers, but different objectives induce qualitatively distinct representational shifts. KTO and GRPO enhance linear separability through constructive feature sharing and sparse, high-salience recruitment. In contrast, DPO and ORPO degrade separability via non-constructive geometric rotation and feature attenuation, while PPO and SimPO largely preserve baseline geometry. These transformations exhibit architecture-dependent variability, demonstrating that behavioral alignment does not imply uniform internal restructuring. Our findings establish alignment as a heterogeneous intervention, motivate standardized feature-level auditing for safety and interpretability, and highlight the need for mechanism-aware optimization objectives.

URL PDF HTML ☆

赞 0 踩 0

2606.09877 2026-06-10 cs.LG cs.CE cs.CL 交叉投稿

Streaming Knowledge Compilation: Proactive Materiality-Scored Pinning for Time-Evolving LLM Wikis

流式知识编译：面向时变LLM维基的主动物质性评分固定

Juan M. Huerta

发表机构 * Zinnia Tech Solutions（Zinnia科技解决方案）

AI总结提出流式知识编译框架，通过物质性信号φ_t主动固定重要文档，在金融和维基百科领域验证O(√T log K)遗憾界，并揭示LLM评判偏差。

详情

AI中文摘要

LLM维基系统将知识编译为预填充的KV缓存以实现高效推理，但假设语料库是静态的——当底层信息格局演变时，这一假设失效。我们形式化流式知识编译：给定文档流、固定令牌预算以及在摄取时未知的未来查询，维护一个编译后的维基，使其相对于具有完美预见力的离线oracle的累积遗憾最小化。关键洞察是物质性信号φ_t(k,n)∈[0,1]，它对时间t实体k的文档重要性进行评分，作为查询相关性的代理，在查询到达前主动固定；我们证明O(√T log K)遗憾界，其中ε=E[|φ_t-φ̂_t|]是唯一的领域特定量。我们在两个领域实例化：金融领域，其中φ_t是由冻结的Llama 3.1 8B分类头预测的异常股票波动率（在76K篇文章上AUROC=0.728，严格时间分割；预测为物质性的文章实现1.49倍更高的实际远期波动率）；以及维基百科领域，其中φ_t是异常编辑比率（AER），一种横截面标准化的编辑速度——表明同一算法可泛化到金融领域之外。在173个匹配对（金融）和119个（维基百科）上的端到端QA评估揭示了训练后知识上普遍的LLM-as-judge混淆，确立了遗憾分析——而非绝对QA分数——是编译知识系统的可靠评估指标。金融累积遗憾收敛至-20.0（-0.12/步）；维基百科收敛至+16.0（+0.13/步），正号确认维基百科编辑内容确实是训练后的——更丰富的上下文持续提高分数（无维基3.80 vs. Oracle 4.74）——并消除了这一混淆。O(√T log K)保证适用于任何知识差距可从流信号预测的领域。

英文摘要

LLM wiki systems compile knowledge into pre-filled KV caches for efficient inference, but assume a static corpus -- an assumption that fails whenever the underlying information landscape evolves. We formalize Streaming Knowledge Compilation: given a document stream, a fixed token budget, and future queries unknown at ingestion time, maintain a compiled wiki that minimizes cumulative regret against an offline oracle with perfect foresight. The enabling insight is a materiality signal $ϕ_t(k,n)\in[0,1]$ that scores document importance for entity $k$ at time $t$, acting as a query-relevance surrogate for proactive pinning before queries arrive; we prove an $O(\sqrt{T\log K})$ regret bound where $\varepsilon=\mathbb{E}[|ϕ_t-\hatϕ_t|]$ is the only domain-specific quantity. We instantiate in two domains: finance, where $ϕ_t$ is abnormal stock volatility predicted by frozen Llama 3.1 8B classification head (AUROC = 0.728 on 76K articles, strict temporal split; $1.49\times$ higher realized forward volatility for predicted-material articles); and Wikipedia, where $ϕ_t$ is the Abnormal Edit Ratio (AER), a cross-sectionally normalized edit velocity -- showing the same algorithm generalizes beyond the finance domain. End-to-end QA evaluation on 173 matched pairs (finance) and 119 (Wikipedia) reveals a pervasive LLM-as-judge confound on post-training knowledge, establishing that regret analysis -- not absolute QA scores -- is the reliable evaluation metric for compiled knowledge systems. Finance cumulative regret converges to -20.0 (-0.12/step); Wikipedia to +16.0 (+0.13/step), with the positive sign confirming that Wikipedia edit content is genuinely post-training -- richer context consistently improves scores (No Wiki 3.80 vs. Oracle 4.74) -- and eliminates this confound. The $O(\sqrt{T\log K})$ guarantee applies to any domain where knowledge gaps can be predicted from streaming signals.

URL PDF HTML ☆

赞 0 踩 0

2606.09887 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

SocraticPO: Policy Optimization via Interactive Guidance

SocraticPO: 通过交互式指导进行策略优化

Zirui Liu, Jie Ouyang, Qi Liu, Xianquan Wang, Jiayu Liu, Tingyue Pan, Qingchuan Li, Jing Sha, Zhenya Huang, Shijin Wang, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China（认知智能国家重点实验室，中国科学技术大学）； iFLYTEK AI Research (Central China), iFLYTEK Co., Ltd（iFLYTEK中央中国AI研究院，iFLYTEK公司）

AI总结提出SocraticPO框架，在强化学习中使用自然语言指导辅助推理，并通过奖励衰减防止模型依赖教师帮助，提升科学推理任务性能。

详情

AI中文摘要

用于大语言模型的强化学习通常使用标量结果奖励（如二元正确性）来监督推理。这种奖励提供了优化方向，但很少解释模型应如何修正其错误推理，这可能鼓励捷径学习和脆弱的策略。我们提出\textbf{SocraticPO}（苏格拉底式策略优化），一种策略优化框架，用苏格拉底式的自然语言指导增强强化学习展开。在展开过程中，学生首先独立回答；如果答案错误，教师诊断尝试并提供简洁的纠正性指导，之后学生在扩展的上下文下继续。关键的是，这种指导与奖励衰减配对：在教师干预后获得的正确答案只得到衰减的奖励，防止策略将教师帮助视为获取奖励的免费途径。由于SocraticPO只修改展开过程，而保持标准期望奖励目标不变，它可以插入到现有的策略梯度后端（如Reinforce++）中。此外，由于教师只提供文本级指导，SocraticPO可以利用更强的黑盒教师模型，而无需访问logits或分布匹配。在来自SciKnowEval的本科水平科学推理基准上，SocraticPO优于强强化学习和自蒸馏基线。消融实验表明，目标指导和奖励衰减都是必要的，奖励衰减减轻了对辅助纠正的依赖。

英文摘要

Reinforcement learning (RL) for large language models usually supervises reasoning with scalar outcome rewards, such as binary correctness. Such rewards provide an optimization direction but rarely explain how a model should revise its mistaken reasoning, which can encourage shortcut learning and brittle policies. We propose \textbf{SocraticPO} (Socratic Policy Optimization), a policy-optimization framework that augments RL rollouts with Socratic-style natural-language guidance. During rollout, the student first answers independently; if the answer is incorrect, a teacher diagnoses the attempt and provides concise corrective guidance, after which the student continues under the expanded context. Crucially, this guidance is paired with reward decay: correct answers obtained after teacher intervention only receive decayed rewards, preventing the policy from treating teacher help as a free path to reward. Since SocraticPO only modifies the rollout process while leaving the standard expected-reward objective intact, it can be plugged into existing policy-gradient backends such as Reinforce++. Moreover, because the teacher provides only text-level guidance, SocraticPO can leverage stronger black-box teacher models without requiring access to logits or distribution matching. On undergraduate-level scientific reasoning benchmarks from SciKnowEval, SocraticPO improves over strong RL and self-distillation baselines. Ablations show that both targeted guidance and reward decay are necessary, with reward decay mitigating reliance on assisted correction.

URL PDF HTML ☆

赞 0 踩 0

2606.09894 2026-06-10 cs.LG cs.CL 交叉投稿

A Navigable Manifold of Hypothesized Consciousness-Spectrum States in Language Model Representations

语言模型表示中假设的意识谱状态的可导航流形

Sophie Zhao

发表机构 * School of Computer Science（计算机科学学院）； Georgia Institute of Technology（佐治亚理工学院）

AI总结研究语言模型嵌入空间中与意识谱相关的几何结构，发现嵌入形成可导航流形，高低层区域稳定，中间为过渡走廊，导航性为内在属性。

详情

AI中文摘要

在沉思、哲学和心理学描述中，人类意识常被描述为从反应性和自我聚焦模式到更整合和连贯模式的类似谱系。理解语言模型是否在表示空间中编码了这种结构化、人类可解释的意识谱系，对于模型引导、评估和对齐具有重要意义。在这项工作中，我们研究了Transformer嵌入空间中沿该谱系的几何结构和动态模式。我们表明，嵌入表现出与该谱系对齐的全局组织几何：与相似状态相关的句子聚类成局部连贯区域，形成结构化流形。特别地，高层和低层区域表现出类似凸性的稳定性，而中间区域形成过渡走廊。在动态上，效用引导和纯几何贪婪轨迹都一致地从低层区域穿越到高层区域，经过中间层级，表明可导航性是表示空间的内在属性，由全局方向信号引导但非决定。这些结果表明，嵌入空间编码了与假设的意识谱分类法（广泛受沉思传统、哲学和现代心理学中人类意识反复出现的结构描述启发）对齐的结构化和可导航几何，为分析和引导模型行为提供了表示层面的视角。

英文摘要

Across contemplative, philosophical, and psychological accounts, human consciousness is often described along a similar spectrum, ranging from reactive and self-focused patterns to more integrative and coherent ones. Understanding whether language models encode such a structured, human-interpretable consciousness spectrum in representation space is important for model guidance, evaluation and alignment. In this work, we study the geometric structure and dynamics of patterns along this spectrum in transformer embedding spaces. We show that embeddings exhibit a globally organized geometry aligned with this spectrum: sentences associated with similar states cluster into locally coherent regions, forming a structured manifold. In particular, higher-level and lower-level regions exhibit convexity-like stability, while intermediate regions form a transition corridor. Dynamically, both utility-guided and geometry-only greedy trajectories consistently traverse from lower- to higher-level regions, passing through intermediate tiers, indicating that navigability is an intrinsic property of the representation space, guided but not dictated by a global directional signal. These results suggest that embedding spaces encode structured and navigable geometry aligned with a hypothesized consciousness-spectrum taxonomy, broadly inspired by recurring structural descriptions of human consciousness across contemplative traditions, philosophy, and modern psychology, providing a representation-level perspective for analyzing and guiding model behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.09937 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

表示感知优势估计：你的奖励模型提供的不仅仅是标量输出

Guozheng Li, Xiyan Fu, Yiwen Guo

发表机构 * Southeast University（东南大学）； Nanyang Technological University（南洋理工大学）； Independent Researcher（独立研究员）

AI总结提出表示感知优势估计方法，利用奖励模型隐藏状态作为辅助信号，通过图传播计算优势值，提升RLHF的样本效率和鲁棒性。

详情

AI中文摘要

当前基于人类反馈的强化学习（RLHF）方法主要依赖来自训练好的奖励模型（RM）的标量奖励。虽然有效，但标量奖励通常存在噪声，无法捕捉细粒度的偏好差异，而RM隐藏状态编码了更丰富的语义和偏好信息。我们引入了表示感知优势估计，利用RM隐藏状态并将其建模为辅助信号以实现更好的优势估计。具体来说，我们提出了基于图的优势估计（GraphAE），将每个采样组视为一个图，其中节点对应响应，边捕捉它们在RM隐藏空间中的相似性。然后通过图传播计算优势值，使每个样本能够从其邻居中融入上下文信息。GraphAE轻量级，可以无缝集成到现有的基于组的RL算法中。我们将GraphAE应用于GRPO、GSPO和RLOO，并在不同模型和基准上进行了大量实验。实证结果显示，在三个基准上均有一致改进，在Arena-Hard-v0.1上提升高达+6.3，在AlpacaEval 2.0上提升+8.27，在MT-Bench上提升+0.22。这些结果表明，利用RM表示可以实现更高效和鲁棒的RLHF。

英文摘要

Current reinforcement learning from human feedback (RLHF) methods primarily rely on scalar rewards from a trained reward model (RM). While effective, scalar rewards are often noisy and fail to capture fine-grained preference differences, whereas RM hidden states encode richer semantic and preference information. We introduce the representation-aware advantage estimation, which leverages RM hidden states and models them as auxiliary signals for better advantage estimation. Specifically, we propose the Graph-based Advantage Estimation (GraphAE), treat each sampled group as a graph, where nodes correspond to responses and edges capture their similarity in the RM hidden space. Then advantages are computed via graph propagation, enabling each sample to incorporate contextual information from its neighbors. GraphAE is lightweight and can be seamlessly integrated into existing group-based RL algorithms. We apply GraphAE to GRPO, GSPO and RLOO, and conduct extensive experiments on different models and benchmarks. Empirical results show consistent improvements across three benchmarks, with gains of up to + 6.3 on Arena-Hard-v0.1, + 8.27 on AlpacaEval 2.0, and + 0.22 on MT-Bench. These results demonstrate that leveraging RM representations leads to more sample efficient and robust RLHF.

URL PDF HTML ☆

赞 0 踩 0

2606.10607 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

通过目标分布设计审视监督微调的统一视角

Tong Xie, Yuanhao Ban, Yunqi Hong, Sohyun An, Yihang Chen, Cho-Jui Hsieh

发表机构 * University of California, Los Angeles (UCLA)（加州大学洛杉矶分校）； Arena

AI总结本文重新解读监督微调为目标分布设计，提出Q-target框架，将监督分解为对观测token的依赖强度与替代token的概率分配，并基于此提出Target-SFT方法，在多个推理任务中优于现有方法。

详情

AI中文摘要

监督微调（SFT）通常最大化示范轨迹中每个token的似然。然而，观测到的token可能非唯一、有噪声或与模型先验不一致。严格拟合这种one-hot目标可能不是最优的，尤其是当预训练模型编码了丰富的知识先验时。在这项工作中，我们将SFT重新解释为目标分布设计：不仅研究损失目标，还分析损失驱动模型匹配的token级目标。我们引入Q-target框架，将SFT监督分解为两个明确的选择：(1) 对观测token的依赖强度，以及(2) 如何将剩余概率质量分配给替代token。这一视角将许多现有的SFT变体统一为目标分布Q的隐式选择。基于这一观点，我们提出Target-SFT，直接从期望的目标分布构建训练目标。该方法在十个推理数据集-模型设置中一致优于现有方法，展示了这种基于目标的方法的有效性。总体而言，我们的公式揭示了SFT训练更基本的设计原则，并为SFT目标开辟了更广阔的搜索空间。

英文摘要

Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot target may be suboptimal, especially when the pretrained model encodes a rich knowledge prior. In this work, we reinterpret SFT as target distribution design: instead of studying only the loss objective, we analyze the token-level target that the loss drives the model to match. We introduce the Q-target framework, which decomposes SFT supervision into two explicit choices: (1) how strongly to rely on the observed token, and (2) how to allocate the remaining probability mass over alternatives. This perspective unifies many existing SFT variants as implicit choices of the target distribution Q. Building on this view, we propose Target-SFT which constructs the training objective directly from the desired target distribution. This method consistently outperforms across the ten reasoning dataset-model settings evaluated, showing the effectiveness of this target-based approach. Overall, our formulation reveals a more fundamental design principle for SFT training and opens a broader search space for SFT objectives.

URL PDF HTML ☆

赞 0 踩 0

2509.25760 2026-06-10 cs.CL cs.AI cs.LG 版本更新

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

TruthRL: 通过强化学习激励诚实的LLM

Zhepei Wei, Xiao Yang, Kai Sun, Jiaqi Wang, Rulin Shao, Jingxiang Chen, Mohammad Kachuee, Teja Gollapudi, Yiwei Liao, Nicolas Scheffer, Rakesh Wanga, Anuj Kumar, Yu Meng, Wen-tau Yih, Xin Luna Dong

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出TruthRL框架，使用GRPO和三值奖励直接优化LLM的诚实性，减少幻觉并允许不确定时弃权，在知识密集型基准上显著提升诚实性。

Comments ICML 2026. Code: https://github.com/facebookresearch/TruthRL

详情

AI中文摘要

虽然大型语言模型（LLM）在事实性问题回答上表现出色，但它们仍然容易产生幻觉和不真实的回答，特别是当任务需要其参数知识之外的信息时。事实上，诚实性需要的不仅仅是准确性——模型还必须识别不确定性，并在不确定时弃权以避免幻觉。这对现有方法提出了根本性挑战：优化准确性的方法往往会放大幻觉，而鼓励弃权的方法可能变得过于保守，牺牲正确答案。两种极端最终都损害了诚实性。在这项工作中，我们提出了TruthRL，一个通用的强化学习（RL）框架，直接优化LLM的诚实性。具体来说，我们使用GRPO实现TruthRL，并采用一个简单而有效的三值奖励，区分正确答案、幻觉和弃权。它激励模型不仅通过提供正确回答来减少幻觉，还通过在不确定时启用弃权来提高诚实性。在四个知识密集型基准上的大量实验表明，TruthRL显著减少了幻觉（例如，43.5% → 19.4%）并提高了诚实性（例如，5.3% → 37.2%），在各种骨干模型上均有一致的提升。分析表明，TruthRL的改进源于LLM识别其知识边界的能力增强，从而避免了像基线那样过于保守。

英文摘要

While large language models (LLMs) have demonstrated strong performance on factoid question answering, they are still prone to hallucination and untruthful responses, particularly when tasks demand information outside their parametric knowledge. Indeed, truthfulness requires more than accuracy -- models must also recognize uncertainty and abstain when unsure to avoid hallucinations. This presents a fundamental challenge for existing methods: approaches that optimize for accuracy often amplify hallucinations, while those that encourage abstention can become overly conservative, sacrificing correct answers. Both extremes ultimately compromise truthfulness. In this work, we present TruthRL, a general reinforcement learning (RL) framework that directly optimizes the truthfulness of LLMs. Specifically, we implement TruthRL using GRPO with a simple yet effective ternary reward that distinguishes correct answers, hallucinations, and abstentions. It incentivizes models to reduce hallucinations not only by providing correct responses, but also by enabling abstention when uncertain, thereby improving truthfulness. Extensive experiments across four knowledge-intensive benchmarks show that TruthRL significantly reduces hallucinations (e.g., 43.5% $\rightarrow$ 19.4%) and improves truthfulness (e.g., 5.3% $\rightarrow$ 37.2%), with consistent gains across various backbone models. Analysis shows that the improvement of TruthRL arises from enhanced capability of LLMs to recognize their knowledge boundary, hence avoiding being overly conservative as the baselines are.

URL PDF HTML ☆

赞 0 踩 0

2511.02603 2026-06-10 cs.CL 版本更新

CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency

CGES：面向高效准确自一致性的置信引导早停方法

Ehsan Aghazadeh, Ahmad Ghasemi, Hedyeh Beyhaghi, Hossein Pishro-Nik

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结提出贝叶斯框架CGES，通过自适应停止采样减少自一致性推理调用次数，在5个推理基准上平均减少58%调用且精度损失仅0.4个百分点。

Comments Extended version. A preliminary version was accepted at the Efficient Reasoning Workshop @ NeurIPS 2025. Code: https://github.com/EhsanAghazadeh/cges

2512.02240 2026-06-10 cs.CL 版本更新

Lightweight Latent Reasoning for Narrative Tasks

面向叙事任务的轻量级潜在推理

Alexander Gurung, Esmeralda S. Whitammer, Mirella Lapata

发表机构 * School of Informatics, University of Edinburgh（爱丁堡大学信息学院）； CIFAR Fellow

AI总结提出LiteReason方法，通过轻量级推理投影器生成连续潜在令牌，在强化学习中动态切换潜在与离散推理，将推理长度减少77-92%，同时保持接近非潜在RL的性能。

详情

AI中文摘要

大型语言模型通过生成长思维链或“推理轨迹”来处理复杂任务，这些轨迹在给定查询时作为输出生成的潜在变量。模型生成此类轨迹的能力可以通过强化学习进行优化，以提高其在预测答案中的效用。这种优化带来了高昂的计算成本，尤其是对于涉及检索和处理大量令牌的叙事相关任务。为此，我们提出了LiteReason，一种潜在推理方法，可以与标准令牌采样交错进行，并易于与RL技术结合。LiteReason采用轻量级推理投影器模块，训练生成连续的潜在令牌，帮助模型“跳过”推理步骤。在RL过程中，策略模型决定何时激活投影器，根据需要切换潜在和离散推理。在情节漏洞检测和书籍章节生成上的实验结果表明，我们的方法优于潜在推理基线，并接近匹配非潜在RL训练，同时将最终推理长度减少77-92%。总体而言，LiteReason引导RL训练到性能-计算权衡曲线中更高效的部分。

基于LLM的文学翻译中的情感特征：机器翻译与译后编辑的系统性转变

Antonio Castaldo, Johanna Monti, Sheila Castilho

AI总结研究LLM翻译的情感特征及译后编辑如何使其接近人类翻译，通过对比《Oryx and Crake》的LLM翻译、译后编辑版本和人类翻译，发现MT系统引入特定情感指纹，削弱作者声音。

2606.11009 2026-06-10 cs.CL cs.CY 新提交

Who Brought Easter Eggs to Eid? Auditing Cultural Translation of Math Word Problems Across Diverse Languages and Regions

谁把复活节彩蛋带到了开斋节？跨语言和地区数学应用题的文化翻译审计

Parisa Suchdev, Juniper Lovato

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结本研究审计了三个大型语言模型将60个英语数学应用题翻译为7种语言时的文化适应性，发现模型在62.5%的案例中一致，但仅33.5%有相同替换，且所有组合均出现熵塌缩，优先改变表面标记而保留深层结构，导致文化多样性压缩和区域误归因。

Comments 17 pages total with references and appendix, 9 figures, under review

详情

AI中文摘要

大型语言模型越来越多地被用于大规模个性化学习中改编数学应用题，但这些改编是否跨模型一致、是否在规模上保留文化多样性、以及揭示模型认为哪些文化实体最显著，仍是未解决的问题。我们分析了Claude Opus 4、GPT-4.1和Gemini 2.5 Pro如何将60个英语数学应用题改编为孟加拉语、印地语、旁遮普语（印度）、乌尔都语、信德语（巴基斯坦）、意大利语和西西里语（意大利），这一语言集涵盖了从高资源语言（意大利语和印地语）到研究不足的语言（信德语、西西里语和旁遮普语）的完整资源谱系。我们标注了6,489个实体转换，编码模型是否保留、本地化、泛化、省略或更改名称、食物和地点等实体。模型在62.5%的案例中在转换类型上一致，在特定替换上仅33.5%一致，这意味着模型选择直接塑造了学生遇到的文化世界。所有21种语言-模型组合均出现熵塌缩，改编压缩而非扩展了文化多样性。模型优先处理表面标记（如名称、食物和货币），同时保留更深层的结构特征（如嵌入特定文化假设的年级系统）。尽管提示指定了目标国家，模型仍错误归因区域背景，例如对印度孟加拉语学生使用孟加拉国塔卡，并产生跨文化污染，例如将寻蛋活动改编为开斋节活动。某些失败在单个翻译中可见。其他失败，包括多样性塌缩、对表面标记的系统性偏好以及一致的区域误归因，仅通过语料库级分析才显现。使改编问题看起来正确的表面合理性，正是使深层失败容易被忽视的原因。

英文摘要

Large language models are increasingly used to adapt math word problems for personalized learning at scale, but it remains an open question whether those adaptations are consistent across models, preserve cultural diversity at scale, and reveal which cultural entities models treat as most salient. We analyze how Claude Opus 4, GPT-4.1, and Gemini 2.5 Pro adapt 60 English math word problems into Bengali, Hindi, Punjabi (India), Urdu, Sindhi (Pakistan), Italian, and Sicilian (Italy), a language set spanning the full resource spectrum, from high-resource Italian and Hindi to under-studied Sindhi, Sicilian, and Punjabi. We annotate 6,489 entity transformations, coding whether models preserve, localize, generalize, omit, or change entities such as names, foods, and places. Models agree on transformation type in 62.5% of cases and on specific substitutions in only 33.5%, meaning model choice directly shapes which cultural world students encounter. All 21 language-model combinations show entropy collapse, with adaptation compressing rather than expanding cultural diversity. Models prioritize surface markers such as names, foods, and currencies while preserving deeper structural features such as grade-level systems that embed culturally specific assumptions. Despite prompts specifying target countries, models misattribute regional context by using Bangladeshi taka for Indian Bengali students and produce cross-cultural contamination, such as adapting egg hunts as Eid activities. Some failures are visible in individual translations. Others, including diversity collapse, systematic preference for surface markers, and consistent regional misattribution, emerge only through corpus-level analysis. The surface plausibility that makes adapted problems look correct is precisely what makes deeper failures easy to overlook.

URL PDF HTML ☆

赞 0 踩 0

2606.07422 2026-06-10 cs.CL cs.AI 版本更新

仅追踪所需：面向长文档问答的结构感知按需超图记忆

Xiangjun Zai, Xingyu Tan, Chen Chen, Xiaoyang Wang, Wenjie Zhang

发表机构 * University of New South Wales（新南威尔士大学）； CSIRO（澳大利亚联邦科学与工业研究组织）； University of Wollongong（伍伦贡大学）

AI总结提出DocTrace，一种多智能体RAG框架，通过查询触发的知识组织、文档结构感知和经验引导推理，解决长文档问答中知识组织成本高、结构利用不足和推理经验无法复用的问题，在三个数据集上取得最佳性能。

详情

AI中文摘要

长文档问答需要大型语言模型对散布在长文档中的证据进行推理，答案通常依赖于事件顺序、章节级上下文和跨部分证据连接。尽管检索增强生成通过检索相关证据减少了输入上下文，但现有的结构化RAG方法仍面临三个限制：代价高昂的查询无关知识组织、对原始文档结构利用不足以及无法复用历史推理经验。为解决这些限制，我们提出了DocTrace，一个用于长文档问答的多智能体RAG框架，支持查询触发的知识组织、文档结构感知和经验引导推理。DocTrace通过轻量级文档结构树索引保留文档层次结构，在推理过程中按需构建智能体共享的超图结构工作记忆，并将成功的推理计划存储在图形结构经验记忆中以便未来复用，从而实现对相关长文档问题的自适应探索。在四个长文档问答数据集上的实验表明，DocTrace在三个数据集上取得了最佳性能，在F1和EM上分别比最强基线ComoRAG高出8.85%和4.40%，同时将总体计算成本降低了53.32%。

英文摘要

Long-document question answering (QA) requires large language models (LLMs) to reason over evidence scattered across lengthy documents, where answers often depend on event order, section-level context, and cross-part evidence connections. Although retrieval-augmented generation (RAG) reduces the input context by retrieving relevant evidence, existing structured RAG methods still face three limitations: costly query-agnostic knowledge organization, insufficient use of original document structure, and no reuse of historical reasoning experience. To address these limitations, we propose DocTrace, a multi-agent RAG framework for long-document QA that supports query-triggered knowledge organization, document-structure-aware and experience-guided reasoning. DocTrace preserves document hierarchy with a lightweight document structural tree index, constructs agent-shared hypergraph-structured working memory on demand during reasoning, and stores successful reasoning plans in graph-structured experience memory for future reuse, enabling adaptive exploration across related long-document questions. Experiments on four long-document QA datasets show that DocTrace achieves the best performance on three datasets, surpassing the strongest baseline, ComoRAG, by up to 8.85% in F1 and 4.40% in EM, while reducing the overall computational cost by 53.32%

URL PDF HTML ☆

赞 0 踩 0

2606.10381 2026-06-10 hep-ex cs.AI cs.CL cs.IR physics.ins-det 交叉投稿

Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis

基于证据的缪子对撞机分析的智能混合RAG

Ruobing Jiang, Dawei Fu, Cheng Jiang, Tianyi Yang, Zijian Wang, Youpeng Wu, Yong Ban, Yajun Mao, Qiang Li

发表机构 * Peking University（北京大学）

AI总结提出智能混合RAG框架，结合稀疏与稠密检索及智能推理，用于缪子对撞机研究的证据检索与答案生成，构建首个基准并验证其有效性。

Comments 22 pages, 5 figures, and 6 tables

详情

AI中文摘要

缪子对撞机研究涵盖加速器物理、探测器仪器和高能现象学，相关证据分散在快速扩展且异构的科学文献中。随着高能物理（HEP）越来越多地探索智能辅助分析工作流，高效定位、整合和验证科学证据成为关键能力。虽然检索增强生成（RAG）为科学问答提供了有前景的框架，但在不牺牲检索精度的情况下整合智能推理仍是一个关键挑战。在这项工作中，我们提出了智能混合RAG，一个基于证据的RAG框架，用于缪子对撞机研究。该框架结合了混合检索器（集成稀疏词汇和稠密语义检索）与智能推理模块，用于查询分解、证据扩展和基于证据的答案生成。为了进行系统评估，我们构建了缪子对撞机领域首个检索增强科学问答基准，包括一个精选文献语料库以及涵盖主要探测器和物理研究主题的专用检索和答案生成基准。广泛评估表明，混合检索提供了最强的检索基础，而智能推理在受控证据扩展和答案合成方面最为有效。基于这一原则，智能混合RAG在检索效果、答案质量、证据覆盖和事实基础方面始终优于代表性的检索和RAG基线。该基准和框架共同为基于证据的科学问答以及未来在大规模科学文献上运行的HEP分析智能体奠定了基础。

英文摘要

Muon collider research spans accelerator physics, detector instrumentation, and high-energy phenomenology, with relevant evidence scattered across a rapidly expanding and heterogeneous body of scientific literature. As high-energy physics (HEP) increasingly explores agent-assisted analysis workflows, efficiently locating, integrating, and verifying scientific evidence becomes an essential capability. While retrieval-augmented generation (RAG) offers a promising framework for scientific question answering, integrating agentic reasoning without compromising retrieval precision remains a key challenge. In this work, we present agentic hybrid RAG, an evidence-grounded RAG framework for muon collider research. The framework combines a hybrid retriever, integrating sparse lexical and dense semantic retrieval, with an agentic reasoning module for query decomposition, evidence expansion, and grounded answer generation. To enable systematic evaluation, we construct the first benchmark for retrieval-augmented scientific question answering in the muon collider domain, comprising a curated literature corpus together with dedicated retrieval and answer-generation benchmarks covering major detector and physics research topics. Extensive evaluation shows that hybrid retrieval provides the strongest retrieval backbone, while agentic reasoning is most effective for controlled evidence expansion and answer synthesis. Built on this principle, agentic hybrid RAG consistently outperforms representative retrieval and RAG baselines in retrieval effectiveness, answer quality, evidence coverage, and factual grounding. Together, the benchmark and framework provide a foundation for evidence-grounded scientific question answering and future HEP analysis agents operating over large-scale scientific literature.

URL PDF HTML ☆

赞 0 踩 0

2606.11023 2026-06-10 cs.IR cs.CL cs.LG 交叉投稿

Generative Archetype-Grounded Item Representations for Sequential Recommendation

生成式原型驱动的物品表示用于序列推荐

Yifan Li, Jiahong Liu, Xinni Zhang, Hao Chen, Yankai Chen, Wenhao Yu, Jianting Chen, Irwin King

发表机构 * The Chinese University of Hong Kong（香港中文大学）； McGill University（麦吉尔大学）； Tongji University（同济大学）

AI总结提出GenAIR框架，利用大语言模型生成物品原型描述并提取嵌入，结合行为校准目标弥合语义与行为差距，显著提升序列推荐性能。

Comments Accepted by WWW 2026 (Oral)

详情

DOI: 10.1145/3774904.3792587

AI中文摘要

序列推荐旨在通过分析用户的历史行为来预测用户与物品的下一次交互。然而，物品表示的质量有限仍然是一个关键瓶颈。虽然预训练的大语言模型（LLM）可以提供丰富的语义表示，但现有方法仅依赖于固定属性的静态编码，忽视了目标受众在定义物品身份中的关键作用。此外，语义空间难以反映实际用户行为，导致语义表示与行为模式之间存在显著差距。为了解决这些局限性，我们提出了GenAIR，一个通用框架，通过生成式原型驱动的物品表示来增强序列推荐。具体来说，我们首先利用LLM分析物品元数据并推断原型的文本描述，该原型代表物品理想目标受众的概念轮廓。然后，我们在一次前向传播中提取相应的嵌入。此外，为了将这些生成式原型基于现实世界的行为，我们引入了一个行为校准目标，该目标明确地整合了来自实际交互的行为信号。该目标调整嵌入空间的结构以反映经验模式。GenAIR能够与大多数现有模型无缝集成，同时保持高效率。在三个真实世界数据集上进行的全面实验表明，GenAIR显著提高了各种序列推荐模型的性能，并始终优于最先进的基线方法。实现代码可在以下网址获取：https://this URL。

英文摘要

Sequential recommendation aims to predict users' next interaction with items by analyzing their historical behavior. However, the limited quality of item representations remains a critical bottleneck. While pre-trained large language models (LLMs) can provide rich semantic representations, existing approaches only rely on static encoding of fixed attributes, overlooking the crucial role of target audiences in defining item identity. Moreover, the semantic space struggles to reflect actual user behavior, resulting in a significant gap between semantic representations and behavioral patterns. To address these limitations, we propose GenAIR, a general framework that empowers sequential recommendation with Generative Archetype-grounded Item Representations. Specifically, we first leverage an LLM to analyze item metadata and infer textual description of the Archetype, which represents the conceptual profile of the item's ideal target audience. We then extract the corresponding embeddings in a single forward pass. Further, to ground these generative archetypes in real-world behavior, we introduce a behavioral calibration objective, which explicitly incorporates behavioral signals from actual interactions. This objective adjusts the structure of the embedding space to reflect empirical patterns. GenAIR enables seamless integration with most existing models while maintaining high efficiency. Comprehensive experiments conducted on three real-world datasets demonstrate that GenAIR significantly improves the performance of various sequential recommendation models and consistently outperforms state-of-the-art baseline approaches. Implementation codes are available at https://github.com/AI-Santiago/GenAIR.

URL PDF HTML ☆

赞 0 踩 0

2406.14075 2026-06-10 cs.CL 版本更新

EXCEEDS: Extracting Complex Events via Nugget-based Grid Modeling in Scientific Domain

EXCEEDS: 通过基于线索块的网格建模在科学领域中提取复杂事件

Yi-Fan Lu, Xian-Ling Mao, Bo Wang, Xiao Liu, Heyan Huang

发表机构 * Beijing Institute of Technology（北京理工大学）； Microsoft Research Asia（微软亚洲研究院）

AI总结针对科学领域事件密集、信息形式复杂的特点，构建大规模多事件文档级数据集SciEvents，并提出端到端框架EXCEEDS，将密集线索块编码为网格矩阵，简化复杂事件提取为基于线索块的网格建模任务，取得最优性能。

Comments Accepted by ACL 2026 Main Conference, Oral

详情

AI中文摘要

通过事件理解特定领域至关重要。在新闻、金融和生物学等多个领域已经进行了广泛的事件提取研究。然而，科学领域的事件提取仍然缺乏全面的数据集和定制方法的支持。与其他领域相比，科学领域有两个特点：（1）更密集的线索块和事件，（2）更复杂的信息形式。为解决上述问题，考虑到这两个特点，我们首先构建了SciEvents，一个大规模的多事件文档级数据集，其模式针对科学领域定制。它包含2,508篇文档和24,381个事件，经过多阶段人工标注和质量控制。然后，我们提出了EXCEEDS，一个端到端的科学事件提取框架，通过将密集线索块编码为网格矩阵，并将复杂事件提取简化为基于线索块的网格建模任务。在SciEvents上的实验表明，EXCEEDS达到了最先进的性能。SciEvents数据集和EXCEEDS框架均已公开发布，以促进未来的研究。

英文摘要

It is crucial to understand a specific domain by events. Extensive event extraction research has been conducted in many domains such as news, finance, and biology. However, event extraction in scientific domain is still insufficiently supported by comprehensive datasets and tailored methods. Compared with other domains, scientific domain has two characteristics: (1) denser nuggets and events, and (2) more complex information forms. To solve the above problem, considering these two characteristics, we first construct SciEvents, a large-scale multi-event document-level dataset with a schema tailored for scientific domain. It consists of 2,508 documents and 24,381 events under multi-stage manual annotation and quality control. Then, we propose EXCEEDS, an end-to-end scientific event extraction framework by encoding dense nuggets into a grid matrix and simplifying complex event extraction as a nugget-based grid modeling task. Experiments on SciEvents demonstrate state-of-the-art performances of EXCEEDS. Both the SciEvents dataset and the EXCEEDS framework are released publicly to facilitate future research.

URL PDF HTML ☆

赞 0 踩 0

2510.08622 2026-06-10 cs.CL cs.SE 版本更新

Automated Alignment between Elicitation Interviews and Requirements

启发式访谈与需求之间的自动对齐

Francesco Dente, Fabiano Dalpiaz, Paolo Papotti

发表机构 * University of Bologna（博洛尼亚大学）

AI总结提出将访谈转录与用户故事需求自动对齐的任务，定义忠实度和覆盖率两个度量，利用大语言模型和嵌入模型实现自动评估，在四个数据集上达到0.86 macro-F1。

Comments 8 pages

详情

AI中文摘要

软件需求来源于多种启发式技术，其中许多具有对话性质，如访谈。然而，评估这些衍生需求是否忠实反映利益相关者的需求仍然是一项具有挑战性的手工任务。在本文中，我们形式化了将访谈转录与以用户故事表示的需求集合对齐的任务。我们提出了两种启发式对齐度量，称为（i）需求忠实度：转录支持的故事比例，以及（ii）访谈覆盖率：至少被一个故事支持的转录比例。然后，我们使用大语言模型和嵌入模型进行实验，评估自动计算这些度量的能力。在四个数据集上的实验表明，基于LLM的解决方案在手动标注的块-故事对上达到了0.86的宏F1分数。我们还展示了如何将嵌入模型用作阻断器，使方法更具可扩展性。这项工作为更多关于连接对话制品与需求的研究铺平了道路。形式化框架和自动匹配技术是基本组件，可用于新兴任务，如将需求追溯到访谈以及从对话生成需求。

英文摘要

Software requirements are derived from a variety of elicitation techniques, many of which have a conversational nature, like interviews. However, evaluating whether those derived requirements faithfully reflect the stakeholders' needs remains a challenging manual task. In this paper, we formalize the task of aligning the transcript of an interview with a collection of requirements represented as user stories. We propose two heuristic metrics for alignment, called (i) requirements faithfulness: the proportion of stories supported by the transcript, and (ii) interview coverage: the proportion of transcript supported by at least one story. Then, we run experiments with large language models and embedding models that assess the ability of evaluating these metrics automatically. Experiments over four datasets show that an LLM-based solution achieves 0.86 macro-F1 on manually labeled chunk-story pairs. We also show how embedding models can be used as blockers to make the approach more scalable. This work paves the way for more research on linking conversational artifacts with requirements. The formal framework and the automated matching techniques are basic components that can be used for emerging tasks such as tracing requirements to interviews and generating requirements from conversations.

URL PDF HTML ☆

赞 0 踩 0

2604.01993 2026-06-10 cs.CL cs.AI 版本更新

SAFE: An LLM-as-Verifier Framework for Evidence-Grounded Multi-Hop Reasoning

SAFE: 一种基于LLM作为验证器的证据驱动多跳推理框架

Daeyong Kwon, Soyoung Yoon, Seung-won Hwang

发表机构 * Seoul National University（首尔国立大学）

AI总结提出SAFE框架，通过将推理分解为知识图谱三元组，在生成过程中逐步验证中间步骤，以解决多跳问答中模型通过无效推理得到正确答案的问题，平均准确率提升8.8个百分点。

详情

AI中文摘要

多跳问答基准测试常常奖励大型语言模型（LLM）的虚假正确性，即模型通过无效的中间推理得出正确答案。我们提出了SAFE，一种基于LLM作为验证器的证据驱动多跳问答框架。SAFE不是在生成后仅判断最终答案，而是在生成过程中通过检查中间步骤与提供的段落和先前的推理轨迹来验证推理。为了使这一过程可检查，SAFE将推理分解为以知识图谱（KG）三元组表示的原子化、证据驱动的单元。在训练时，SAFE在KG约束下验证基准监督，并构建可靠的验证器训练数据。在推理时，外部验证器检查每个生成的步骤，识别无效推理，并在错误传播之前提供纠正反馈。在三个多跳问答基准测试中，SAFE平均提高了8.8个百分点的准确率。这些结果表明，证据驱动的多跳问答受益于将基于LLM的评估从事后答案判断转向逐步推理验证。

英文摘要

Multi-hop QA benchmarks often reward Large Language Models (LLMs) for spurious correctness, where models reach correct answers through invalid intermediate reasoning. We propose SAFE, an LLM-as-verifier framework for evidence-grounded multi-hop QA. Rather than judging only the final answer after generation, SAFE verifies reasoning during generation by checking intermediate steps against the provided passages and previous reasoning trajectory. To make this process checkable, SAFE decomposes reasoning into atomic, evidence-grounded units represented with Knowledge Graph (KG) triples. At train-time, SAFE verifies benchmark supervision under KG-grounded constraints and constructs reliable verifier training data. At inference-time, an external verifier checks each generated step, identifies invalid reasoning, and provides correction feedback before errors propagate. Across three multi-hop QA benchmarks, SAFE improves accuracy by 8.8 pp on average. These results show that evidence-grounded multi-hop QA benefits from shifting LLM-based evaluation from post-hoc answer judgment to stepwise reasoning verification.

URL PDF HTML ☆

赞 0 踩 0

2604.15771 2026-06-10 cs.CL 版本更新

Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing

Skill-RAG: 通过隐藏状态探测与技能路由实现故障状态感知的检索增强

Kai Wei, Raymond Li, Xi Zhu, Zhaoqian Xue, Jiaojiao Han, Jingcheng Niu, Fan Yang

发表机构 * University of Michigan（密歇根大学）； University of British Columbia（不列颠哥伦比亚大学）； Rutgers University（罗格斯大学）； University of Pennsylvania（宾夕法尼亚大学）； New Jersey Institute of Technology（新泽西理工学院）； TU Darmstadt（图腾斯大学）； Wake Forest University（威克森林大学）

AI总结提出Skill-RAG框架，通过轻量级隐藏状态探测器和基于提示的技能路由器，在检索失败时诊断原因并选择四种技能（查询重写、问题分解、证据聚焦、退出）纠正查询-证据错位，显著提升多轮检索后困难案例的准确性。

详情

AI中文摘要

检索增强生成（RAG）已成为将大型语言模型锚定于外部知识的基础范式。尽管自适应检索机制提高了检索效率，现有方法将检索后失败视为重试信号而非诊断信号——从而未能解决查询与证据空间错位的结构性原因。我们观察到，相当一部分持续性检索失败并非源于缺乏相关证据，而是源于查询与证据空间之间的对齐差距。我们提出Skill-RAG，一种故障感知的RAG框架，它结合了轻量级隐藏状态探测器和基于提示的技能路由器。探测器在两个流水线阶段门控检索；当检测到故障状态时，技能路由器诊断根本原因，并在四种检索技能——查询重写、问题分解、证据聚焦，以及针对真正不可约情况的退出技能——中进行选择，以在下一次生成尝试前纠正错位。跨多个开放域问答和复杂推理基准的实验表明，Skill-RAG显著提高了多轮检索后持续存在的困难案例的准确性，在分布外数据集上尤其强劲。表示空间分析进一步揭示，所提出的技能占据了故障状态空间中结构化、可分离的区域，支持了查询-证据错位是一种类型化而非单一现象的观点。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a foundational paradigm for grounding large language models in external knowledge. While adaptive retrieval mechanisms have improved retrieval efficiency, existing approaches treat post-retrieval failure as a signal to retry rather than to diagnose -- leaving the structural causes of query-evidence misalignment unaddressed. We observe that a significant portion of persistent retrieval failures stem not from the absence of relevant evidence but from an alignment gap between the query and the evidence space. We propose Skill-RAG, a failure-aware RAG framework that couples a lightweight hidden-state prober with a prompt-based skill router. The prober gates retrieval at two pipeline stages; upon detecting a failure state, the skill router diagnoses the underlying cause and selects among four retrieval skills -- query rewriting, question decomposition, evidence focusing, and an exit skill for truly irreducible cases -- to correct misalignment before the next generation attempt. Experiments across multiple open-domain QA and complex reasoning benchmarks show that Skill-RAG substantially improves accuracy on hard cases persisting after multi-turn retrieval, with particularly strong gains on out-of-distribution datasets. Representation-space analyses further reveal that the proposed skills occupy structured, separable regions of the failure state space, supporting the view that query-evidence misalignment is a typed rather than monolithic phenomenon.

URL PDF HTML ☆

赞 0 踩 0

2605.18271 2026-06-10 cs.CL cs.AI cs.IR cs.LG 版本更新

From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

从体积到价值：面向设备端RAG的偏好对齐记忆构建

Changmin Lee, Jaemin Kim, Taesik Gong

发表机构 * Department of Computer Science and Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, Republic of Korea（计算机科学与工程系，全州国立科学与技术研究所（UNIST），全州，韩国）

AI总结本文提出EPIC方法，通过将用户偏好作为紧凑且稳定的个人上下文形式，整合到RAG流程中，以在有限内存下提高检索与用户偏好的对齐度，从而减少内存使用并提升准确性。

Comments Accepted to ICML 2026. Code and data are available at https://github.com/UbiquitousAILab/EPIC

详情

AI中文摘要

随着基于大型语言模型（LLMs）的个人AI代理的迅速发展，将其部署到设备上已成为隐私和响应性的重要需求。为了处理现实世界请求中固有的个人和上下文依赖性，这些代理必须基于设备上存储的个人上下文进行生成。然而，在内存预算紧张的情况下，核心瓶颈是存储什么内容以确保检索与用户保持一致。我们提出EPIC（高效偏好对齐索引构建），专注于用户偏好作为紧凑且稳定的个人上下文形式，并在整个RAG流程中整合它们。EPIC会选择性地保留与偏好相关的信息，并将检索对准偏好对齐的上下文。在四个涵盖对话、辩论、解释和推荐的基准测试中，EPIC将索引内存减少了2,404倍，提高了偏好遵循的准确性20.17个百分点，并在最佳表现基线之上实现了33.33倍更低的检索延迟。在我们的设备端实验中，EPIC在29.35毫秒/查询的流式更新下保持内存占用低于1 MB。

英文摘要

With the rapid emergence of personal AI agents based on Large Language Models (LLMs), implementing them on-device has become essential for privacy and responsiveness. To handle the inherently personal and context-dependent nature of real-world requests, such agents must ground their generation in device-resident personal context. However, under tight memory budgets, the core bottleneck is what to store so that retrieval remains aligned with the user. We propose EPIC (Efficient Preference-aligned Index Construction), which focuses on user preferences as a compact and stable form of personal context and integrates them throughout the RAG pipeline. EPIC selectively retains preference-relevant information from raw data and aligns retrieval toward preference-aligned contexts. Across four benchmarks covering conversations, debates, explanations, and recommendations, EPIC reduces indexing memory by 2,404 times, improves preference-following accuracy by 18.79 %p, and achieves 32.17 times lower retrieval latency over the best-performing baseline. In on-device experiments, EPIC maintains under 1 MB memory and achieves 5.21 to 29.35 ms/query latency across three platforms, while supporting streaming updates under preference drift. Our code and data are available at https://github.com/UbiquitousAILab/EPIC.

URL PDF HTML ☆

赞 0 踩 0

2605.28093 2026-06-10 cs.CL 版本更新

ConRAG: Consensus-Driven Multi-View Retrieval for Multi-Hop Question Answering

ConRAG: 用于多跳问答的共识驱动多视角检索

Yikai Zhu, Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）

AI总结提出ConRAG框架，通过共识驱动的多视角检索（关系、实体、文本信号）优化查询和语料库，显著提升多跳问答性能，在MuSiQue上创下新纪录。

详情

AI中文摘要

检索增强生成（RAG）已成为增强大型语言模型（LLMs）在多跳问答（QA）上的有前景范式，这需要对来自多个文档的证据进行推理。当前的多跳RAG方法通常侧重于查询侧任务分解或语料侧知识图谱构建。尽管取得了进展，这些方法在复杂的多跳QA任务上仍难以达到令人满意的性能。为此，我们提出了ConRAG，一个共识驱动的多视角RAG框架，有效提升了LLMs在复杂多跳QA上的表现。ConRAG的核心是系统性地优化查询和语料两侧，并利用多视角证据（关系、实体和文本信号）进行更准确的检索。在三个多跳QA基准上的大量实验表明，ConRAG以明显优势持续优于所有基线，例如，与普通RAG相比平均性能提升高达+26.9%，并使Gemma-4-31B在具有挑战性的MuSiQue基准上创下新的最先进记录。

英文摘要

Retrieval-augmented generation (RAG) has emerged as a promising paradigm for enhancing large language models (LLMs) on multi-hop question answering (QA), which requires reasoning over evidence from multiple documents. Current multi-hop RAG methods generally focus on either query-side task decomposition or corpus-side knowledge graph construction. Despite their progress, these methods still struggle to achieve satisfactory performance on complex multi-hop QA tasks. To this end, we propose ConRAG, a consensus-driven multi-view RAG framework that effectively boosts LLMs on complex multi-hop QA. The core of ConRAG is to systematically optimize both the query and corpus sides and to leverage multi-view evidence (relation, entity, and text signals) for more accurate retrieval. Extensive experiments on three multi-hop QA benchmarks show that ConRAG consistently outperforms all baselines by a clear margin, e.g., up to +26.9% average performance gains over vanilla RAG, and enables Gemma-4-31B to achieve a new state-of-the-art record on the challenging MuSiQue benchmark.

URL PDF HTML ☆

赞 0 踩 0

2605.03344 2026-06-10 cs.IR cs.AI cs.CL 版本更新

WebChallenger: 一个可靠且高效的通用型Web智能体

Jayoo Hwang, Xiaowen Zhang, Vedant Padwal

发表机构 * ML Collective ； longsurf.ai ； Independent（独立研究者）

AI总结提出WebChallenger框架，通过PageMem结构化页面表示、分治观察、轻量探索记忆和复合动作工作流，复现人类认知优势，使开源模型在多个Web导航基准上接近前沿专有系统性能。

详情

AI中文摘要

自主Web导航对LLM智能体仍然具有挑战性，最强的通用系统依赖于专有推理模型，其推理成本对于此类智能体最有用的重复性任务来说高得令人望而却步。我们认为这一差距并非源于模型能力不足，而是源于智能体架构未能复制人类的三种认知优势：对相关页面区域的选择性注意力、对网站结构的持久记忆以及对常见交互模式的程序性流畅性。我们引入了WebChallenger，一个通过架构设计而非模型规模来解决每个差距的Web智能体框架，该框架围绕PageMem构建：一种从DOM确定性构建的结构化页面表示，将每个页面呈现为具有简短摘要的语义部分层次结构。在此共享基础上，我们构建了三种机制来镜像三种认知优势：一个分治观察流水线，让智能体浏览部分摘要并仅从任务相关区域提取细节；一个轻量级探索和记忆系统，遍历每个网站一次以构建页面和元素行为的可重用地图；以及复合动作工作流，将常见的多步交互折叠为单个智能体动作，自动处理部分状态变化。由于这三种机制都基于PageMem运行，该框架无需特定站点适配器即可跨网站泛化。使用未经微调的现成开源模型，我们的系统在WebArena上达到56.3%，在VisualWebArena上达到48.7%，在Online-Mind2Web上达到51.0%，在WorkArena上达到70.9%，以极低的成本接近前沿专有系统。我们的代码已发布在此https URL。

英文摘要

Autonomous web navigation remains challenging for LLM agents, and the strongest generalist systems rely on proprietary reasoning models whose inference cost is prohibitive for the repetitive tasks where such agents would be most useful. We argue this gap stems not from insufficient model capability but from agent architectures that fail to replicate three human cognitive advantages: selective attention to relevant page regions, persistent memory of website structure, and procedural fluency with common interaction patterns. We introduce WebChallenger, a web agent framework that addresses each gap through architecture design rather than model scale, built around PageMem: a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries. On this shared substrate we build three mechanisms that mirror the three cognitive advantages: a divide-and-conquer observation pipeline that lets the agent skim section summaries and extract details only from task-relevant regions; a lightweight exploration and memory system that traverses each website once to build a reusable map of pages and element behaviors; and compound action workflows that collapse common multi-step interactions into single agent actions, handling partial state changes automatically. Because all three operate over PageMem, the framework generalizes across websites without site-specific adapters. Using off-the-shelf open-weight models without fine-tuning, our system achieves 56.3% on WebArena, 48.7% on VisualWebArena, 51.0% on Online-Mind2Web, and 70.9% on WorkArena, approaching frontier proprietary systems at a fraction of the cost. Our code is released at https://github.com/jayoohwang1/webchallenger

URL PDF HTML ☆

赞 0 踩 0

2606.10694 2026-06-10 cs.CL 新提交

REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs

REAL: 一种增强推理的图框架用于LLM的长期记忆管理

Keer Lu, Liwei Chen, Guoqing Jiang, Zhiheng Qin, Yunhuai Liu, Wentao Zhang

发表机构 * School of Computer Science, Peking University（北京大学计算机科学学院）； Kuaishou Technology（快手科技）； Center for Data Science, Academy for Advanced Interdisciplinary Studies, Peking University（北京大学前沿交叉学科研究院数据科学中心）

AI总结提出REAL框架，通过构建时序和置信度感知的有向属性图，采用非破坏性更新和混合束搜索检索，解决LLM长期记忆中的关系缺失、事实覆盖和查询被动问题，平均性能提升22.72%。

详情

AI中文摘要

大型语言模型（LLM）越来越期望与用户进行长时间跨度的交互。然而，由于其有限的上下文窗口，LLM无法保留所有过去的交互，因此长期记忆管理对于存储、更新和检索超出上下文限制的历史信息至关重要。尽管最近的记忆系统试图通过外部存储历史信息来解决这个问题，但现有方法存在三个关键限制：基于平面文本的记忆组织无法捕捉记忆之间的显式关系，结构化记忆系统通常会破坏性地覆盖演变的事实，而当前的检索机制在证据不完整时仍然与查询无关且被动。REAL将长期对话记忆构建为时序和置信度感知的有向属性图，其中每个原子事实都用实体、关系、有效时间区间、置信度分数和探索意图标签表示。在记忆构建过程中，REAL采用非破坏性时序更新策略，保留并行的事实版本及其有效性区间，从而能够忠实地追踪事实的演变。在检索过程中，REAL锚定与查询相关的根实体，解耦其探索意图，并执行语义评估器引导的混合束搜索以提取紧凑的记忆子图。它进一步结合反事实推理来修复不可靠的检索状态，并通过隐式逻辑关系恢复缺失的记忆证据。综合实验表明，REAL在长期记忆性能上显著优于平面文本、基于图和现有记忆基线，平均提升22.72%。

英文摘要

Large Language Models (LLMs) are increasingly expected to interact with users over long time horizons. However, due to their finite context window, LLMs cannot retain all past interactions, making long-term memory management essential for storing, updating, and retrieving historical information beyond the context limit. Although recent memory systems attempt to address this issue by storing historical information externally, existing approaches suffer from three key limitations: flat text-based memory organizations fail to capture explicit relations among memories, structured memory systems often destructively overwrite evolving facts, and current retrieval mechanisms remain query-agnostic and passive when evidence is incomplete. REAL constructs long-term conversational memory as a temporal and confidence-aware directed property graph, where each atomic fact is represented with entities, relations, valid-time intervals, confidence scores, and exploration intent labels. During memory construction, REAL adopts a non-destructive temporal update strategy that preserves parallel fact versions and their validity intervals, enabling faithful tracking of fact evolution. During retrieval, REAL anchors query-relevant root entities, decouples their exploration intents, and performs semantic evaluator-guided hybrid beam search to extract compact memory subgraphs. It further incorporates counterfactual inference to repair unreliable retrieval states and recover missing memory evidence through implicit logical relations. Comprehensive experiments demonstrate that REAL substantially improves long-term memory performance over flat-text, graph-based, and existing memory baselines, achieving an average improvement of 22.72\%.

URL PDF HTML ☆

赞 0 踩 0

2606.10736 2026-06-10 cs.CL cs.AI cs.CY 新提交

Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs

利用课程先决条件图检测对话式AI交互中的知识缺口

Youssef Medhat, Junsoo Park, Ploy Thajchayapong, Ashok K. Goel

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出一个流水线，通过少样本文本分类器将学生向对话式AI助教提出的问题映射到课程主题，并利用GPT-4提取的先决条件知识图谱，以检测主题级知识缺口。

Comments Accepted as a short paper at the 10th CSEDM Workshop, co-located with the 18th International Conference on Educational Data Mining (EDM 2026). 7 pages, 2 figures, 2 tables

详情

AI中文摘要

大型在线课程会产生数千条学生向对话式AI助教提出的问题，但这些交互日志作为诊断信号在很大程度上未被利用。我们提出一个流水线，使用少样本文本分类器，将学生向对话式AI助教提出的问题映射到课程主题，该分类器基于GPT-4提取的课程概念先决条件知识图谱。在研究生级别AI课程的164名学生的1,340个问题事件上评估，我们的分类器在43个标签（42个课程主题加上一个“未知”弃权类别）上达到80.0%的准确率。主题级问题数量与独立期中调查中学生自我报告的难度显著相关（rho = 0.491, p = 0.008, n = 28个主题），提供了趋同证据，表明分类后的问题流反映了真实的主题难度。这些结果表明，映射到课程结构上的对话式AI交互日志携带关于主题级知识缺口的可操作信号，并为教师提供基于课程视角的哪些主题需要关注的视图。

英文摘要

Large online courses generate thousands of student questions directed at conversational AI teaching assistants, yet these interaction logs remain largely untapped as diagnostic signals. We present a pipeline that maps student questions from a conversational AI teaching assistant to curriculum topics using a few-shot text classifier, grounded in a GPT-4-extracted prerequisite knowledge graph of course concepts. Evaluated on 1,340 question events from 164 students in a graduate-level AI course, our classifier achieves 80.0% accuracy across 43 labels (42 curriculum topics plus an "unknown" abstention class). Topic-level question volume correlates significantly with student self-reported difficulty from an independent mid-semester survey (rho = 0.491, p = 0.008, n = 28 topics), providing convergent evidence that the classified question stream reflects genuine topic difficulty. These results demonstrate that conversational AI interaction logs, mapped onto curriculum structure, carry actionable signals about topic-level knowledge gaps and provide instructors with a curriculum-grounded view of which topics warrant attention.

URL PDF HTML ☆

赞 0 踩 0

2606.10875 2026-06-10 cs.CL 新提交

Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation

通过经验知识集成与激活推动LLM工具调用极限

Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, Jun Zhao

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所复杂系统认知与决策智能重点实验室）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）

AI总结研究如何通过经验知识获取、激活和内化提升LLM多步工具调用性能，提出知识增强工具执行框架KATE，结合宽度扩展推理与知识感知训练，在BFCL-V3和AppWorld上显著优于基线。

详情

AI中文摘要

大型语言模型（LLM）依赖工具使用来充当自主代理，但由于缺乏足够的工具相关知识和无效的知识激活，在多步执行中常常失败。因此，我们进行了一项系统性研究，探讨知识如何影响工具使用性能，涵盖知识获取、激活和内化阶段。在知识获取阶段，我们获取并评估了各种形式的经验知识，分析表明简单的实例级知识已经能够提供强大且可靠的增益，而抽象的意图级知识收益有限。在推理时，为了激活知识，我们发现提示LLM扩展推理深度会产生递减收益，而通过并行采样与聚合扩展推理宽度能更有效地激活潜在经验知识。在训练时，对于知识内化，使用知识增强数据进行后训练进一步提升了性能，其中强化学习优于监督微调。基于这些见解，我们提出了知识增强工具执行（KATE）框架，该框架将经验知识与宽度扩展推理及知识感知训练相结合。在BFCL-V3和AppWorld上的实验表明，该方法在不同模型规模上均比强基线有一致且显著的改进。我们的代码可在该https URL获取。

英文摘要

Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic study on how knowledge influences tool-use performance, covering the stages of knowledge acquisition, activation, and internalization. In the knowledge acquisition stage, we acquire and evaluate various forms of experiential knowledge, and our analysis shows that simple instance-level knowledge can already provide strong and reliable gains, while abstract intent-level knowledge offers limited benefits. At inference time, to activate knowledge, we find that prompting LLM to expand the depth of reasoning yields diminishing returns, whereas expanding the width of reasoning by parallel sampling with aggregation more effectively activates latent experiential knowledge. At training time, for knowledge internalization, post-training with knowledge-augmented data further improves performance, with reinforcement learning outperforming supervised fine-tuning. Based on these insights, we propose the Knowledge-Augmented Tool Execution (KATE), a knowledge-augmented tool execution framework that integrates experiential knowledge with reasoning-width-expanded inference and knowledge-aware training. Experiments on BFCL-V3 and AppWorld demonstrate consistent and substantial improvements over strong baselines across model scales. Our Code is available at https://github.com/hypasd-art/KATE.

URL PDF HTML ☆

赞 0 踩 0

2606.10475 2026-06-10 cs.MA cs.AI cs.CL 交叉投稿

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

思想与言语解耦：基于知识反事实推理的鲁棒多智能体辩论

Jakub Masłowski, Jarosław A. Chudziak

发表机构 * Institute of Computer Science, Warsaw University of Technology（华沙技术大学计算机科学学院）

AI总结提出知识反事实推理（KG-CFR）双阶段架构，通过私有规划缓冲与公共执行层分离，在动态资源分配环境下将扰动后论证质量从0.694提升至0.822，并减少语义循环。

Comments Accepted for publication in the Proceedings of the 30th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES 2026)

详情

AI中文摘要

多智能体辩论框架已被证明能提升大语言模型在收敛任务上的表现，但目前优化方式过度偏向最终输出准确性而非过程稳定性。在长时间交互中，持续扰动下的反应式系统常出现逻辑退化、论点重复和角色漂移。为从结构上防止身份丢失并保持过程保真度，我们引入知识反事实推理（KG-CFR），一种双阶段架构，在私有检索增强规划缓冲区和公共执行层之间强制执行严格关注点分离。我们在不确定性下动态资源分配（DRAU）这一专用1v1v1环境中评估该系统，引入与标准辩论设置不同的多样性。在270次完全析因危机模拟轨迹（含随机环境冲击）中，KG-CFR在超过95%的扰动运行中防止了裁判检测到的关键冲击后退化（定义为质量偏移Δ ≤ -0.20），将整体论证质量从0.694提升至0.822。我们的主要贡献是证明架构解耦是在持续压力下不损失质量而增强系统鲁棒性的重要因素。此外，我们引入了用于话语发散和计划执行对齐的自定义向量度量，为操作稳定性提供了强有力且方向一致的证据。消融实验表明，适当的教义基础与前瞻规划对论证质量同等重要。根据初步度量评估，KG-CFR通过保持智能体与原始计划的一致性减少了语义循环。

英文摘要

Multi-agent debate frameworks have been shown to improve large language model performance in convergent tasks, but they are currently optimized in a way that heavily favors final output accuracy rather than stability of the process. During long-horizon exchanges reactive systems under sustained perturbations often experience logic degradation, argument repetition, and role drift. To structurally prevent the identity loss and maintain the process fidelity, we introduce Knowledge-Grounded Counterfactual Reasoning (KG-CFR), a dual-stage architecture that enforces a strict separation of concerns between a private, retrieval-augmented planning buffer, and a public execution layer. We assess this system in Dynamic Resource Allocation under Uncertainty (DRAU), a dedicated 1v1v1 environment, introducing diversity as distinct from standard debate settings. Over 270 completely factorial crisis simulation trajectories with stochastic environmental shocks, KG-CFR prevents judge-detected critical post-shock degradation (defined as a quality shift, $Δ\le -0.20$) in more than 95% of perturbed runs, increasing the overall argument quality from 0.694 to 0.822. Our primary contribution is the demonstration of architectural decoupling being an important factor of systemic resilience enhancement under sustained pressure without quality loss. Furthermore, we introduce custom vector metrics for discourse divergence and plan-execution alignment that provide strong, directionally consistent evidence of operational stability. Our ablation experiments suggest that the proper doctrinal grounding can be an equally important factor for argument quality, as the prospective planning. KG-CFR, according to our initial metric evaluations, reduces semantic looping, by preserving the agent's consistency with the original plan.

URL PDF HTML ☆

赞 0 踩 0

2606.10677 2026-06-10 cs.AI cs.CL 交叉投稿

Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory

Infini Memory：用于长期LLM智能体记忆的可维护主题文档

Suozhao Ji, Baodong Wu, Zehao Wang, Lei Xia, Qingping Li, Ruisong Wang, Wenbo Ding, Zhenhua Zhu, Boxun Li, Guohao Dai, Yu Wang

发表机构 * Infinigence AI（InfiniGen AI）； Tsinghua University（清华大学）； Shanghai Jiaotong University（上海交通大学）

AI总结提出Infini Memory架构，将智能体记忆组织为主题文档，通过缓冲合并和迭代检索实现可维护的长期记忆，在MemoryAgentBench上达到64.7%的总体得分。

详情

AI中文摘要

长期LLM智能体需要持久记忆，以跟踪变化的事实并在会话间提供相关证据。现有的记忆系统通常将观察存储为孤立的记录、摘要或索引片段，这使得证据聚合、事实修正和记忆维护变得困难。我们提出Infini Memory，一种可维护的基于文本的持久记忆架构，将智能体记忆视为主题结构化文档。每个主题文档作为一个语义单元，用于收集相关证据、保留元数据并随时间修正事实。新观察首先被暂存在缓冲区中，然后定期合并为连贯的文本上下文。在推理时，一种智能体检索过程允许LLM通过迭代工具调用读取记忆，而不是单次检索步骤。在MemoryAgentBench上，Infini Memory取得了64.7%的总体得分。消融实验表明，主题结构化维护和迭代证据检查改善了长期记忆使用的互补方面。

英文摘要

Long-term LLM agents need persistent memory that can track changing facts and provide relevant evidence across sessions. Existing memory systems often store observations as isolated records, summaries, or indexed fragments, which makes evidence aggregation, fact revision, and memory maintenance difficult. We propose Infini Memory, a maintainable text-based persistent memory architecture that treats agent memory as topic-structured documents. Each topic document serves as a semantic unit for collecting related evidence, preserving metadata, and revising facts over time. New observations are first staged in a buffer and periodically consolidated into coherent textual contexts. At inference time, an agentic retrieval procedure lets the LLM read memory through iterative tool calls rather than a single retrieval step. On MemoryAgentBench, Infini Memory achieves 64.7% overall score. Ablations show that topic-structured maintenance and iterative evidence inspection improve complementary aspects of long-term memory use.

URL PDF HTML ☆

赞 0 踩 0

2606.11078 2026-06-10 cs.AI cs.CL cs.CV 交叉投稿

A History-Aware Visually Grounded Critic for Computer Use Agents

面向计算机使用代理的历史感知视觉基础批评家

Jaewoo Lee, Zaid Khan, Archiki Prasad, Justin Chih-Yao Chen, Supriyo Chakraborty, Kartik Balasubramaniam, Sambit Sahu, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Capital One ； University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结提出HiViG框架，通过历史感知的视觉基础多模态批评家，在测试时评估动作并拦截错误，在多个GUI基准上提升成功率。

Comments Code: https://github.com/G-JWLee/HiViG

详情

AI中文摘要

针对计算机使用代理（CUA）的各种测试时干预措施（包括批评模型）已被开发出来，通过在复杂图形用户界面（GUI）环境中执行前动作评估来提高性能。然而，现有的批评家存在两个关键限制：（1）主要关注短视决策循环（例如，遗忘早期动作）；（2）缺乏检测有缺陷动作（例如，点击错误的UI元素）所需的视觉基础。为了解决这些问题，我们引入了HiViG，一个历史感知的视觉基础测试时框架，其核心是一个在真实GUI轨迹上训练的多模态批评家，用于将过去的交互抽象为紧凑记录，并基于视觉基础评估动作。在测试时，HiViG将批评家集成到策略决策循环中，以提供宏观动作历史（总结策略已完成成就）和视觉基础批评（根据当前截图验证原始执行坐标，在执行前拦截错误）。在网页、移动和桌面基准测试中，HiViG持续优于现有的标量和口头批评家，在Qwen3-VL-32B上比最强基线平均成功率提高5.8%，在Gemini-3-Flash上提高9.0%，并展示了强大的跨平台泛化能力。消融实验表明，宏观动作历史缓解了短视规划，视觉基础批评减少了执行错误，这两个组件对于长时域GUI任务中的测试时扩展至关重要。

英文摘要

Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy's completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.09421 2026-06-10 cs.CL 版本更新

基于LLM的代码文档生成与多裁判评估

Ikbel Ghrab, Mohamed Dhieb, Ismail Khenissi, Ines Abdeljaoued-Tej

发表机构 * University of Tunis El Manar（突尼斯国家理工大学）

AI总结提出利用八种大语言模型自动生成代码文档，并通过多裁判评估框架（四个LLM从九个维度评分）提升文档质量，在医学物理库上实验显示最佳与最差模型性能差距达42%。

Comments ICAHS, \c{opyright} 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

详情

Journal ref: Conference ICAHS IEEE, 2025

AI中文摘要

使用恶化技巧将重写规则编译为有限状态转录机

Mans Hulden, Michael Ginn

发表机构 * New College of Florida（佛罗里达新学院）； University of Colorado（科罗拉多大学）

AI总结提出基于“恶化技巧”的紧凑编译方案，将重写规则编译为有限状态转录机，支持多种上下文和重写模式，实现简单且易于扩展。

Comments 17 pages, 6 figures, tool track proceedings at CIAA 2026

详情

AI中文摘要

有限状态转录机（FST）对于计算语言学和自然语言处理（NLP）中的字符串重写建模至关重要，特别是对于音韵和形态重写规则。编译形式为 $A \ o B / L \, \_ \, R$ 的一般重写规则（其中 $A$、$B$、$L$ 和 $R$ 是任意正则语言）由于重叠匹配和上下文约束而复杂。传统方法（如 Kaplan 和 Kay 或 Karttunen 的方法）依赖于带有辅助标记的复杂转录机组合。本文提出了一种基于“恶化技巧”的紧凑编译方案：生成所有合法的重写候选，然后过滤那些对于相同输入比其他候选更差的候选。该构造作为 PyFoma 中的内置重写编译器实现，支持多个上下文、任意转录、标记、定向重写、权重和并行重写。得到的公式简短且统一，并且在语义一致的情况下，它们重现了与早期方法相同的规则转录机，同时更易于扩展。该实现已在大量重写语法集合和涵盖主要重写模式的自动回归测试套件上针对 foma 进行了验证，得到的转录机除了状态编号外完全匹配。

英文摘要

Finite-state transducers (FSTs) are essential for modeling string rewriting in computational linguistics and natural language processing (NLP), particularly for phonological and morphological rewrite rules. Compiling general rewrite rules of the form $A \to B / L \, \_ \, R$, where $A$, $B$, $L$, and $R$ are arbitrary regular languages, is complex due to overlapping matches and context constraints. Traditional methods, such as those by Kaplan and Kay or Karttunen, rely on intricate transducer compositions with auxiliary markers. This paper presents a compact compilation scheme based on the "worsening trick'': generate all legal rewrite candidates, then filter candidates that are worse than another candidate for the same input. Implemented as the built-in rewrite compiler in PyFoma, the construction supports multiple contexts, arbitrary transductions, markup, directed rewriting, weights, and parallel rewriting. The resulting formulas are short and uniform, and where semantics coincide, they reproduce the same rule transducers as earlier approaches while remaining easier to extend. The implementation has been validated against foma on both a substantial collection of rewrite grammars and an automated regression suite covering the major rewrite modalities, with the resulting transducers matching exactly apart from state numbering.

URL PDF HTML ☆

赞 0 踩 0

2602.17907 2026-06-10 cs.CL cs.AI 版本更新

Improving Topic Modeling by Distilling Soft Labels from Language Models

DSL-Topic：通过从语言模型中蒸馏软标签改进主题建模

Raymond Li, Amirhossein Abaskohi, Chuyuan Li, Gabriel Murray, Giuseppe Carenini

发表机构 * University of Washington（华盛顿大学）

AI总结提出DSL框架，通过从语言模型蒸馏软标签来增强主题模型训练，利用上下文感知的软标签重构信号，显著提升主题连贯性和分配准确性。

Comments 22 pages, 5 figures. Camera-ready version for ICML 2026

详情

AI中文摘要

传统的神经主题模型通常通过重构文档的词袋表示来优化，忽略了上下文信息并面临数据稀疏性问题。在这项工作中，我们引入了一种新颖的主题模型训练框架，通过从语言模型中蒸馏软标签（DSL）。为了构建上下文丰富的重构信号，我们将基于特定提示的下一个词概率投影到预定义词汇表上，并使用语言模型隐藏状态训练主题模型重构软标签。这产生了更高质量的主题，与语料库的潜在主题结构更加紧密对齐。大量实验表明，DSL在主题连贯性和分配准确性上相比现有基线取得了显著改进。此外，我们还引入了一种基于检索的指标，显示我们的方法在识别语义相似文档方面显著优于现有方法，突显了其在面向检索应用中的有效性。

英文摘要

Traditional neural topic models are typically optimized by reconstructing the document's Bag-of-Words (BoW) representations, overlooking contextual information and struggling with data sparsity. In this work, we introduce a novel topic model training framework by Distilling Soft Labels (DSL) from Language Models (LMs). To construct the contextually enriched reconstruction signals, we project the next token probabilities, conditioned on a specialized prompt, onto a pre-defined vocabulary, and train the topic models to reconstruct the soft labels using the LM hidden states. This produces higher-quality topics that are more closely aligned with the underlying thematic structure of the corpus. Extensive experiments demonstrate that DSL achieves substantial improvements in topic coherence and assignment accuracy over existing baselines. Additionally, we also introduce a retrieval-based metric, which shows that our approach significantly outperforms existing methods in identifying semantically similar documents, highlighting its effectiveness for retrieval-oriented applications.

URL PDF HTML ☆

赞 0 踩 0

2604.14397 2026-06-10 cs.CL cs.AI 版本更新

Generating Concept Lexicalizations via Dictionary-Based Cross-Lingual Sense Projection

基于词典的跨语言语义投影生成概念词汇化

David Basil, Chirooth Girigowda, Bradley Hauer, Sahir Momin, Ning Shi, Grzegorz Kondrak

发表机构 * University of Toronto（多伦多大学）

AI总结提出一种通过语义投影将英语WordNet概念扩展到新语言的方法，利用双语词典增强对齐并过滤错误投影，在多个语言上提升了精度且保持可解释性和资源效率。

Comments Paper presented at Canadian AI 2026

详情

AI中文摘要

我们研究通过语义生成自动将WordNet风格的词汇资源扩展到新语言的任务。我们通过语义投影将目标语言词条与现有词汇概念关联来生成词义。给定一个带有词义标注的英语语料库及其翻译，我们的方法将注释的义原集投影到对齐的目标语言标记上，并将相应的词条分配给这些义原集。为了生成对齐并确保其质量，我们使用双语词典增强预训练的基础对齐器，该词典也用于过滤不正确的语义投影。我们在多种语言上评估该方法，将其与先前方法以及基于词典和大型语言模型的基线进行比较。结果表明，所提出的投影-过滤策略在保持可解释性和资源效率的同时提高了精度。我们在该https URL上发布代码、文档和生成的词义清单。

英文摘要

We study the task of automatically expanding WordNet-style lexical resources to new languages through sense generation. We generate senses by associating target-language lemmas with existing lexical concepts via semantic projection. Given a sense-tagged English corpus and its translation, our method projects the annotated synsets onto aligned target-language tokens and assigns the corresponding lemmas to those synsets. To generate alignments and ensure their quality, we augment a pretrained base aligner with a bilingual dictionary, which is also used to filter incorrect sense projections. We evaluate the method on multiple languages, comparing it to prior methods, as well as dictionary-based and large language model baselines. Results show that the proposed project-and-filter strategy improves precision while remaining interpretable and resource-efficient. We release our code, documentation, and generated sense inventories at https://github.com/UAlberta-NLP/ExpandNet.

URL PDF HTML ☆

赞 0 踩 0

2606.09543 2026-06-10 cs.CL 版本更新

From Genes to Tokens: a GWAS-inspired Approach for Interpretable Stylometric Analysis

从基因到词元：受GWAS启发的可解释风格计量分析方法

Dmitry Pronin, Evgeny Kazartsev

发表机构 * HSE University（莫斯科国立高等经济大学）

AI总结受全基因组关联研究启发，提出一种通过逻辑回归和多重比较校正检测作者独特词汇标记的风格计量方法，在英、德、俄语语料中验证有效。

2606.10803 2026-06-10 cs.CL cs.AI cs.CV 新提交

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

超越API：探索多模态大语言模型在物理工具使用中的极限

Zhixin Ma, Yutong Zhou, Yongqi Li, Chong-Wah Ngo, Wenjie Li

发表机构 * Singapore Management University（新加坡管理大学）； The Hong Kong Polytechnic University（香港理工大学）

AI总结提出PhysTool-Bench基准，评估多模态大语言模型在真实场景中识别物理工具并规划使用的能力，发现最强模型仅完成21%任务，揭示感知与规划双重缺陷。

详情

AI中文摘要

多模态大语言模型（MLLMs）在利用数字API方面表现出色，并日益成为具身AI的“大脑”，指导机器人与物理世界交互。在这种具身环境中，核心能力之一是使用物理工具，这支撑着MLLMs在现实任务中协助人类的能力。尽管重要性显著，MLLMs在物理工具使用方面的熟练程度仍 largely unexplored。为填补这一空白，我们引入了PhysTool-Bench，这是首个评估MLLMs理解真实场景、识别物理工具并规划其使用能力的物理工具使用基准。PhysTool-Bench包含2,510个查询，覆盖2,678个真实世界物理工具，涉及制造、电气工程、农业和医疗等多个领域。具体而言，模型沿两个主要维度进行评估：1）识别场景中所有存在的物理工具，2）根据指令和视觉上下文规划工具选择和使用顺序。在13个领先的MLLMs中，即使最强的模型（Gemini-3.1-Pro）也只能识别场景中58.7%的工具，并仅完成21.0%的端到端查询。我们的分析揭示了两个层面的缺陷：MLLMs难以在真实场景中感知工具，而规划阶段更大的下降进一步表明缺乏将感知到的工具映射到任务语义的功能常识，这指出了发展实用具身AI的关键瓶颈。

英文摘要

Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

URL PDF HTML ☆

赞 0 踩 0

2606.09846 2026-06-10 cs.HC cs.AI cs.CL 交叉投稿

CANVAS: Captioning Art with Narrative Visual-Audio AI Systems

CANVAS: 用叙事视觉音频AI系统为艺术配文

Vignesh Nagarajan

发表机构 * BASIS Phoenix High School（BASIS凤凰高中）

AI总结提出一种自动化工作流，利用大语言模型和文本转语音服务生成多感官艺术描述和同步音频解说，在20秒内以低于0.05美元的成本生成文本加音频输出，显著提高词汇多样性和叙事细节。

Comments 22 pages, 16 figures, 3 tables, 21 references

详情

AI中文摘要

由于替代文本简短或缺失，视觉艺术在很大程度上仍对盲人和低视力（BLV）观众不可及，这些文本很少传达艺术品的感官、空间或情感特质。本研究提出了一种自动化工作流，利用大语言模型和文本转语音服务生成多感官艺术描述和同步音频解说。该系统通过Zapier编排，将上传的图像转换为丰富的叙事字幕，无需人工干预，从而实现可访问媒体的快速、规模化生产。对50件艺术品的定量评估显示，AI生成的描述在词汇多样性、形容词密度和叙事细节方面显著高于基线字幕，同时保持可比的易读性水平。统计检验（t检验、方差分析）确认了丰富度和长度方面的显著差异，完整流水线在每张图像20秒内生成文本加音频输出，成本低于0.05美元。研究结果表明，自动字幕生成可以弥合博物馆和数字馆藏可访问性方面的差距，对更广泛的公众参与具有意义。未来工作可纳入BLV参与者的用户研究，以评估理解、偏好和最佳解释性语言水平。

英文摘要

Visual art remains largely inaccessible to blind and low-vision (BLV) audiences due to brief or absent alt-text, which rarely conveys the sensory, spatial, or emotional qualities of an artwork. This study presents an automated workflow that generates multi-sensory art descriptions and synchronized audio narration using large language models and text-to-speech services. The system, orchestrated through Zapier, converts uploaded images into rich narrative captions without human intervention, enabling rapid, scalable production of accessible media. Quantitative evaluation across 50 artworks shows that AI-generated descriptions contain significantly higher lexical diversity, adjective density, and narrative detail than baseline captions, while maintaining comparable readability levels. Statistical tests (t-tests, ANOVA) confirm meaningful differences in richness and length, and the full pipeline produces text-plus-audio outputs in under 20 seconds per image at a cost below $0.05. Findings demonstrate that automated captioning can bridge gaps in museum and digital-collection accessibility, with implications for broader public engagement. Future work can incorporate user studies with BLV participants to assess comprehension, preference, and optimal levels of interpretive language.

URL PDF HTML ☆

赞 0 踩 0

2606.10147 2026-06-10 cs.AI cs.CL cs.CV cs.SD 交叉投稿

重新审视视觉问答中的贪婪解码：一种校准视角

Boqi Chen, Xudong Liu, Yunke Ao, Jianing Qiu

发表机构 * ETH Zurich（苏黎世联邦理工学院）； University of Toronto（多伦多大学）； MBZUAI（穆桑比克人工智能研究所）

AI总结针对视觉问答任务，从校准角度理论证明贪婪解码优于随机采样，并提出适用于推理模型的贪婪解码方法，实验验证其有效性。

详情

AI中文摘要

随机采样策略被广泛用于大型语言模型（LLMs）以平衡输出的连贯性和多样性。这些启发式方法通常被多模态大语言模型（MLLMs）继承，而无需针对特定任务进行论证。然而，我们认为随机解码对于视觉问答（VQA）可能不是最优的。VQA是一个封闭式任务，答案分布具有头部重尾特征，其不确定性通常是认知性的，源于缺失或模糊的视觉证据，而非合理的延续。在这项工作中，我们理论形式化了模型校准与预测准确性之间的关系，并推导出贪婪解码最优性的充分条件。大量实验提供了经验证据，表明贪婪解码在多个基准测试中优于随机采样。此外，我们提出了适用于推理模型的贪婪解码，在多模态推理场景中优于随机采样和标准贪婪解码。总体而言，我们的结果警示不要在MLLMs中天真地继承LLMs的解码启发式方法，并表明贪婪解码可以成为VQA中高效且强大的默认选择。

英文摘要

Stochastic sampling strategies are widely adopted in large language models (LLMs) to balance output coherence and diversity. These heuristics are often inherited in Multimodal LLMs (MLLMs) without task-specific justification. However, we contend that stochastic decoding can be suboptimal for Visual Question Answering (VQA). VQA is a closed-ended task with head-heavy answer distributions where uncertainty is usually epistemic, arising from missing or ambiguous visual evidence rather than plausible continuations. In this work, we provide a theoretical formalization of the relationship between model calibration and predictive accuracy, and derive the sufficient conditions for greedy decoding optimality. Extensive experiments provide empirical evidence for the superiority of greedy decoding over stochastic sampling across multiple benchmarks. Furthermore, we propose Greedy Decoding for Reasoning Models, which outperforms both stochastic sampling and standard greedy decoding in multimodal reasoning scenarios. Overall, our results caution against naively inheriting LLMs decoding heuristics in MLLMs and demonstrate that greedy decoding can be an efficient yet strong default for VQA.

URL PDF HTML ☆

赞 0 踩 0

2510.04514 2026-06-10 cs.AI cs.CE cs.CL cs.CV stat.ME 版本更新

ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering

ChartAgent: 一种用于复杂图表问答中视觉基础推理的多模态智能体

Rachneet Kaur, Nishan Srishankar, Zhen Zeng, Sumitra Ganesh, Manuela Veloso

发表机构 * J.P. Morgan AI Research（摩根大通人工智能研究）

AI总结提出ChartAgent框架，通过迭代分解查询为视觉子任务并利用图表专用视觉工具（如绘制注释、裁剪区域）进行空间域推理，在ChartBench和ChartX上取得最先进性能，尤其对无标注图表提升显著。

Comments Accepted at ACL 2026 (Main Conference). Also presented as an oral paper at the NeurIPS 2025 Multimodal Algorithmic Reasoning Workshop (https://marworkshop.github.io/neurips25/)

详情

AI中文摘要

最近的多模态大语言模型在基于图表的视觉问答中显示出潜力，但在无标注图表上——即那些需要精确视觉解释而非依赖文本捷径的图表——其性能急剧下降。为了解决这个问题，我们引入了ChartAgent，一种新颖的智能体框架，它直接在图表的空间域内显式执行视觉推理。与文本思维链推理不同，ChartAgent通过专门的行动（如绘制注释、裁剪区域（例如分割饼图切片、隔离条形图）和定位坐标轴）迭代地将查询分解为视觉子任务，并主动操作和交互图表图像，使用图表专用视觉工具库来完成每个子任务。这种迭代推理过程密切模仿了人类理解图表的认知策略。ChartAgent在ChartBench和ChartX基准测试上达到了最先进的准确率，整体上比先前方法绝对提升高达16.07%，在无标注、数值密集的查询上提升17.31%。此外，我们的分析表明，ChartAgent (a) 在多种图表类型上有效，(b) 在不同视觉和推理复杂度水平上均取得最高分数，(c) 作为一个即插即用的框架，提升了多种基础LLM的性能。我们的工作是首批使用工具增强的多模态智能体展示图表理解中视觉基础推理的工作之一。

英文摘要

Recent multimodal LLMs have shown promise in chart-based visual question answering, but their performance declines sharply on unannotated charts-those requiring precise visual interpretation rather than relying on textual shortcuts. To address this, we introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain. Unlike textual chain-of-thought reasoning, ChartAgent iteratively decomposes queries into visual subtasks and actively manipulates and interacts with chart images through specialized actions such as drawing annotations, cropping regions (e.g., segmenting pie slices, isolating bars), and localizing axes, using a library of chart-specific vision tools to fulfill each subtask. This iterative reasoning process closely mirrors human cognitive strategies for chart comprehension. ChartAgent achieves state-of-the-art accuracy on the ChartBench and ChartX benchmarks, surpassing prior methods by up to 16.07% absolute gain overall and 17.31% on unannotated, numerically intensive queries. Furthermore, our analyses show that ChartAgent is (a) effective across diverse chart types, (b) achieves the highest scores across varying visual and reasoning complexity levels, and (c) serves as a plug-and-play framework that boosts performance across diverse underlying LLMs. Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.

URL PDF HTML ☆

赞 0 踩 0

2605.07415 2026-06-10 cs.CV cs.CL 版本更新

ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring

ChartREG++：面向多样化指代线索和多目标指代的图表指代表达式定位基准与改进

Tianhao Niu, Ziyu Han, Xuan Dong, Qingfu Zhu, Wanxiang Che

发表机构 * Research Center for Social Computing and Interactive Robotics（社会计算与交互机器人研究中心）

AI总结针对现有图表指代表达式定位基准的局限，提出支持多种定位形式、多目标指代、多样化线索和图表类型的基准，并利用代码驱动合成流水线生成像素级实例掩码，训练实例分割模型集成到多模态定位框架，显著提升性能。

详情

AI中文摘要

指代表达式定位是视觉定位的核心问题，广泛用于视觉与语言模型的空间定位与推理诊断，但以往工作多聚焦于自然图像。相比之下，现有的图表指代表达式定位基准存在局限：(1) 大多采用边界框，限制了精细图表元素的定位精度；(2) 大多假设单个或两个指代目标实例，无法处理多实例目标指代；(3) 语言表达过度依赖文本线索或数据排名线索；(4) 仅覆盖狭窄的图表类型范围。为解决这些问题，我们引入了一个图表指代表达式定位基准，系统性地支持多种定位形式、多个指代目标、多样化定位线索和多种图表类型。在代表性多模态大模型上的结果揭示了显著的性能差距。我们进一步引入了一个代码驱动的合成流水线，利用绘图程序与渲染图表基元之间的固有对齐，跨图表元素类型和粒度生成像素级精确的实例掩码。我们使用合成掩码训练了一个实例分割模型，并将其集成到一个通用的多模态定位框架中。最终系统在我们的基准上持续优于基线，并很好地泛化到从ChartQA导出的真实图表定位基准。

英文摘要

Referring expression grounding is a core problem in visual grounding and is widely used as a diagnostic of spatial grounding and reasoning in vision and language models, yet most prior work focuses on natural images. In contrast, existing chart referring expression grounding-related benchmarks remain limited: (1) they largely adopt bounding boxes, constraining localization precision for fine chart elements (2) they mostly assume a single and two referred target instances, failing to handle multi-instance target references; (3) the language expressions over-rely on textual cues or data-rank clues (4) they cover only a narrow range of chart types. To address these issues, we introduce a chart referring expression grounding benchmark that systematically supports multiple localization forms, multiple referred targets, diverse grounding cues and diverse chart types. Results across representative multimodal large models reveal a significant performance gap. We further introduce a code-driven synthesis pipeline that exploits the inherent alignment between plotting programs and rendered chart primitives to derive pixel accurate instance masks across chart element types and granularities. We train an instance segmentation model with the synthesized masks and integrate it into a general-purpose multimodal grounding framework. The resulting system consistently outperforms baselines on our benchmark and generalizes well to a ChartQA-derived real-chart grounding benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.10581 2026-06-10 cs.CL cs.SD eess.AS 新提交

ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models

ParaBridge: 弥合语音语言模型中的副语言感知与对话行为

Yuxiang Wang, Qinke Ni, Shengbo Cai, Wan Lin, Liqiang Zhang, Zhizheng Wu

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Tencent Hunyuan（腾讯混元）； Shenzhen Loop Area Institute（深圳循环区域研究所）； Amphion Technology Co., Ltd.（Amphion科技有限公司）； Tsinghua University（清华大学）

AI总结提出ParaBridge，一种在线自我蒸馏方法，将推理阶段的副语言指令支架转化为稳定的模型行为，无需人工标注或外部奖励，显著提升语音语言模型对副语言线索的响应能力。

详情

AI中文摘要

语音携带的信息远不止文字：孩子的声音、恐惧的语气或嘈杂的背景都应引导一个足够胜任的语音对话助手给出不同的回复。当前的语音语言模型（SLM）能够识别此类副语言线索，但在开放域对话中常常忽略它们。我们观察到，在推理阶段使用简单的副语言指令支架可以缩小这种感知-行为差距，表明相关线索已潜在于模型中。然而，这种支架在多轮上下文和竞争指令下仍然脆弱。因此，我们提出\textbf{ParaBridge}，一种在线自我蒸馏方法，将脆弱的推理时支架转化为稳定的模型行为。在训练过程中，支架仅作为临时的特权视图；无支架模型自行生成回复，而支架视图沿其轨迹提供密集的全词汇下一词目标。这种监督教会了模型在非词汇线索应影响回复时的时机，无需策划的对话、人工标签或外部奖励模型。在Qwen3-Omni-thinking上，ParaBridge将无支架的VoxSafeBench SAR从14.6\%提升至40.3\%，并将EchoMind平均评分从3.27提升至3.92。它还保留了通用能力，MMAU-Pro、VoiceBench和GPQA均与原始模型相差在0.4分以内。在训练分布之外，ParaBridge泛化到未见过的副语言线索，从面向安全的训练迁移到共情导向的对话，并在不同的SLM骨干上有效。

英文摘要

Speech carries more information than just words: a child's voice, a fearful tone, or a noisy background should all lead a sufficiently competent spoken-dialogue assistant to different replies. Current Speech Language Models (SLMs) can recognize such paralinguistic cues but often ignore them in open-ended dialogue. We observe that a simple paralinguistic instruction scaffold at the inference stage narrows this perception-behavior gap, suggesting that the relevant cues are already latent in the model. Such scaffolds, however, remain brittle under multi-turn context and competing instructions. Therefore, we propose \textbf{ParaBridge}, an on-policy self-distillation method that turns a brittle inference-time scaffold into stable model behavior. During training, the scaffold serves only as a temporary privileged view; the scaffold-free model rolls out its own response, while the scaffolded view supplies dense, full-vocabulary next-token targets along its trajectory. This supervision teaches when non-lexical cues should affect the reply without the need for curated dialogues, human labels, or external reward models. On Qwen3-Omni-thinking, ParaBridge raises scaffold-free VoxSafeBench SAR from $14.6\%$ to $40.3\%$ and improves EchoMind average rating from $3.27$ to $3.92$. It also preserves general ability, with MMAU-Pro, VoiceBench, and GPQA all within $0.4$ points of the original model. Beyond the training distribution, ParaBridge generalizes to unseen paralinguistic cues, transfers from safety-oriented training to empathy-oriented dialogue, and works on a different SLM backbone.

URL PDF HTML ☆

赞 0 踩 0

2606.10654 2026-06-10 cs.CL 新提交

Speaker Group Encoding in Self-supervised Speech Recognition Models

自监督语音识别模型中的说话人群体编码

Felix Herron, Solange Rossato Alexandre Allauzen, Benoit Favre, François Portet

发表机构 * MILES Team, LAMSADE, Université Paris Dauphine-PSL, France（法国巴黎多芬纳-PSL大学LAMSADE实验室MILES团队）； GETALP Team, LIG, Université Grenoble Alpes, France（法国格勒诺布尔阿尔卑斯大学LIG实验室GETALP团队）； NLP team, LIS, Aix-Marseille University, France（法国艾克斯-马赛大学LIS实验室NLP团队）

AI总结研究自监督语音识别模型如何编码说话人群体信息，发现微调任务和公平性算法对不同类型群体信息的影响不同。

详情

DOI: 10.1007/978-3-032-02548-7_11
Journal ref: Text, Speech, and Dialogue. TSD 2025. Lecture Notes in Computer Science(), vol 16029

AI中文摘要

我们研究了自监督语音识别模型（S3Ms）学习了关于说话人群体（SGs）的哪些信息。我们检查了S3Ms的几种状态：预训练、在说话人识别（SID）上微调、在自动语音识别（ASR）上微调，以及使用公平性增强算法进行ASR微调。我们发现S3Ms编码了关于几个说话人群体类别（SGCs）的信息，包括他们的性别、年龄、方言、种族以及是否为母语者。我们发现，针对SID的微调放大了某些SGCs，即那些方差更偏向语音性质的SGCs，尽管它没有放大其他SGCs，即那些方差更偏向语义性质的SGCs。另一方面，针对ASR的微调丢弃了语音变异的说话人群体信息（SGI），但保留了语义变异的SGI。我们发现，为改善公平性而设计的ASR算法改变了S3Ms中编码SGI的程度；然而，这主要适用于语音变异的SGCs，而对于语义变异的SGCs则不太适用。我们讨论了SGI如何被每一层编码，并识别了负责编码不同SGCs的嵌入子维度。最后，我们讨论了我们的发现如何有助于设计更公平的ASR算法。

英文摘要

We investigate what self-supervised speech recognition models (S3Ms) learn about speaker groups (SGs). We examine several states of S3Ms: pretrained, finetuned on speaker identification (SID), finetuned on automatic speech recognition (ASR), and ASR-finetuned using a fairness enhancing algorithm. We find that S3Ms encode information about several speaker group categories (SGCs), including their gender, age, dialect, ethnicity, and whether they are a native speaker. We find that finetuning for SID amplifies certain SGCs, namely those whose variance is more phonetic in nature, though it does not amplify other SGCs, namely those whose variance is more semantic in nature. On the other hand, finetuning for ASR discards phonetically variant speaker group information (SGI) but retains semantically variant SGI. We find that ASR algorithms designed for fairness improvement change to what extent SGI is encoded in S3Ms; however, this is primarily true for for phonetically variant SGCs, and less true for semantically variant SGCs. We discuss how SGI is encoded by each layer, and identify subdimensions of embeddings responsible for encoding different SGCs. Finally, we discuss how our findings could be beneficial in designing fairer ASR algorithms.

URL PDF HTML ☆

赞 0 踩 0

2606.10675 2026-06-10 cs.CL eess.AS 新提交

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

基于自监督表示和学习动态规划的多语言词级强制对齐

Roy Weber, Meidan Zehavi, Rotem Rousso, Joseph Keshet

发表机构 * Faculty of Electrical and Computer Engineering, Technion – Israel Institute of Technology（以色列理工学院电气与计算机工程学院）

AI总结提出一种结合自监督表示和学习动态规划的多语言词级强制对齐方法，通过融合MMS和UnSupSeg特征并学习词边界概率，在多个语言上超越现有方法。

Comments Interspeech 2026

详情

AI中文摘要

我们提出了一种准确的多语言词级强制对齐方法，包括一个对齐编码器和一个学习对齐解码器。编码器整合两种表示：一种来自大规模多语言语音（MMS）模型，另一种来自自监督音素边界检测器（UnSupSeg）。它学习融合这些表示，并在长时间上下文中估计词边界概率。对齐解码器是一种学习动态规划，它将编码器输出与基于MMS和UnSupSeg表示的段特征相结合，以推断最终词边界。在TIMIT和Buckeye上迭代训练后，所提方法在两个数据集上均优于Montreal Forced Aligner（MFA）和基于MMS的对齐方法。在未见语言（荷兰语、德语和希伯来语）上，所提模型的性能始终优于或与现有对齐方法相当，表明其有潜力在不进行进一步训练的情况下扩展到MMS支持的1100多种语言。

英文摘要

We present a method for accurate multilingual word-level forced alignment, consisting of an alignment encoder and a learned alignment decoder. The encoder integrates two representations: one from the Massively Multilingual Speech (MMS) model and another from a self-supervised phoneme boundary detector (UnSupSeg). It learns to fuse them and to estimate word-boundary probabilities over long temporal contexts. The alignment decoder is a learned dynamic programming that combines encoder outputs with segmental features over the MMS and UnSupSeg representations to infer final word boundaries. Trained iteratively on TIMIT and Buckeye, the proposed approach outperforms Montreal Forced Aligner (MFA) and MMS-based alignment on both datasets. On unseen languages (Dutch, German, and Hebrew), the proposed model achieves performance consistently better than or on par with existing alignment approaches, indicating its potential to scale to 1100+ languages supported by MMS without further training.

URL PDF HTML ☆

赞 0 踩 0

2606.11167 2026-06-10 cs.CL eess.AS 新提交

Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

全双工语音模型中的多面交互对齐

Atsumoto Ohashi, Neil Zeghidour, Alexandre Défossez, Eugene Kharitonov

发表机构 * Kyutai ； Gradium

AI总结针对全双工对话模型交互性问题，提出基于强化学习的后训练对齐方法，从暂停处理、话轮转换、回馈和用户打断四个维度优化，并加入LLM奖励防止语义退化，在Moshi和PersonaPlex上取得一致改进。

详情

AI中文摘要

全双工口语对话模型可以同时听和说，使其成为自然对话的有前途的架构。然而，当前模型仅通过令牌级似然最大化的监督学习进行训练，这并未直接优化交互级行为，导致交互性问题，如过度沉默和不合时宜的话轮转换。最近的工作应用强化学习（RL）来改善交互性，但现有方法仅在其奖励中处理有限的一组交互行为。在这项工作中，我们提出了一种后训练对齐方法，通过RL全面改善全双工口语对话模型的交互性。我们解决了交互性的四个典型轴：暂停处理、话轮转换、回馈和用户打断。对于每个轴，我们从人类对话语料库中提取短音频片段，并使用特定于轴的奖励函数优化模型。一个额外的基于LLM的响应质量奖励防止语义退化。我们将我们的方法应用于两个开源模型Moshi和PersonaPlex，在预录音频的离线评估和实时多轮对话评估中均显示出交互性的一致改进。

英文摘要

Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking. Recent work has applied reinforcement learning (RL) to improve interactivity, but existing methods address only a limited set of interactive behaviors in their rewards. In this work, we propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.09553 2026-06-10 cs.CL cs.SD 新提交

OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

OpenBibleTTS：面向低资源语言的大规模语音资源与TTS模型

David Guzmán, Luel Hagos Beyene, Jesujoba Oluwadara Alabi, Yejin Jeon, Dietrich Klakow, David Ifeoluwa Adelani

发表机构 * McGill University（麦吉尔大学）； Mila - Quebec AI Institute（米拉-魁北克人工智能研究所）； AIMS Research and Innovation Centre（AIMS研究与创新中心）； NM-AIST ； Saarland University（萨尔大学）； Canada CIFAR AI Chair（加拿大CIFAR人工智能教席）

AI总结针对低资源语言TTS研究不足的问题，提出包含37种语言的OpenBibleTTS基准，系统比较多种TTS架构，发现无单一系统通用，并开源数据集与模型。

详情

AI中文摘要

神经文本转语音（TTS）和多语言语音生成的最新进展显著提升了合成语音质量，但这些进步在全球语言中分布不均。现有模型仍由少数高资源语言主导，而许多低资源TTS研究是在人工降采样的高资源语料库上模拟的，未能反映真正低资源环境中的正字法变化和有限的音系覆盖。为此，我们引入OpenBibleTTS，这是一个涵盖37种低资源语言的大规模低资源语音合成基准。此外，我们对各种TTS架构和大规模语音生成模型在领域内圣经文本和领域外材料上进行了系统比较。结果表明，没有单一系统在所有语言和指标上占优：Gemini-TTS在大多数评估语言上获得最高听众评分，但在OpenBibleTTS上训练的单一语言EveryVoice模型在可懂度上仍然最强，并在几种非洲语言中更受青睐，而从头训练的开放系统在领域外文本上性能急剧下降，揭示了广泛多语言覆盖与可靠合成质量之间在服务不足的语言社区中持续存在的差距。我们用主观人类判断补充自动评估，并开源所有处理后的数据集、对齐和训练模型，以支持未来的低资源TTS研究。

英文摘要

Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and limited phonetic coverage encountered in genuinely underrepresented settings. As such, we introduce OpenBibleTTS, which is a large-scale benchmark for low-resource speech synthesis spanning 37 underrepresented languages. Moreover, a systematic comparison of various TTS architectures and large-scale speech generation models is conducted across in-domain Biblical text and out-of-domain material. Results show that no single system dominates across languages and metrics: Gemini-TTS achieves the highest listener ratings on most evaluated languages, but monolingual EveryVoice models trained on OpenBibleTTS remain strongest for intelligibility and are preferred in several African languages, while open from-scratch systems degrade sharply on out-of-domain text, revealing a persistent gap between broad multilingual coverage and reliable synthesis quality in underserved linguistic communities. We complement automatic evaluation with subjective human judgments, and open-source all processed datasets, alignments, and trained models to support future low-resource TTS research.

URL PDF HTML ☆

赞 0 踩 0

2606.06037 2026-06-10 cs.SD cs.CL eess.AS 交叉投稿

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

SpeechJBB：探究大型音频语言模型在代码切换语音下的安全对齐与理解

Virginia Ceccatelli, Yejin Jeon, David Ifeoluwa Adelani

发表机构 * Mila - Quebec AI Institute（魁北克AI研究所）； McGill University（麦吉尔大学）； Canada CIFAR AI Chair（加拿大CIFAR人工智能主席）

AI总结提出SpeechJBB数据集，通过代码切换有害音频和伪词插入方法，揭示大型音频语言模型在多语言和口语设置下的安全漏洞。

详情

AI中文摘要

大型音频语言模型（LALMs）越来越多地部署在现实应用中，但其安全对齐仍主要在单语、基于文本的有害提示上进行评估。这导致其在多语言和口语设置，特别是代码切换语音下的泛化能力很大程度上未被探索。为填补这一空白，我们引入了SpeechJBB，一个用于对多种最先进LALMs进行基准测试的音频越狱数据集。通过引入一种增强设置，即在安全关键术语周围插入音位学上合理的伪词以模拟局部混淆，进一步探测了安全弱点的程度。跨模型而言，代码切换的有害音频产生了显著高的越狱成功率（JSR），其中非英语单语和非英语代码切换对表现出最高的攻击成功率。伪词插入进一步降低了拒绝率，表明听起来自然的混淆可以有效绕过安全策略。

英文摘要

Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability under multilingual and spoken settings, particularly code-switched speech, largely underexplored. To address this gap, we introduce SpeechJBB, an audio jailbreak dataset for benchmarking across multiple state-of-the-art LALMs. The extent of safety weaknesses is further probed by introducing an augmented setting where phonologically plausible pseudo-words are inserted around safety-critical terms to simulate localized obfuscation. Across models, code-switched harmful audio yields substantially high jailbreak success rates (JSR), with non-English monolingual and non-English code-switched pairs exhibiting the highest attack success. Pseudo-word insertion further reduces refusal rates, which demonstrates that natural-sounding obfuscation can effectively bypass safety policies.

URL PDF HTML ☆

赞 0 踩 0

2606.10029 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

Interpreting and Steering a Text-to-Speech Language Model with Sparse Autoencoders

用稀疏自编码器解释和引导文本转语音语言模型

Nikita Koriagin, Georgii Aparin, Nikita Balagansky, Daniil Gavrilov

AI总结本文在CosyVoice3语言模型骨干上训练BatchTopK稀疏自编码器，发现特征可解释且因果可控，能操纵笑声、性别和语速。

详情

AI中文摘要

语言模型日益成为文本转语音（TTS）系统的骨干，但我们对其在文本和生成语音令牌共享单一残差流时构建的表示知之甚少。我们在CosyVoice3的语言模型骨干上训练BatchTopK稀疏自编码器，并引入一种模态感知的自动解释流水线，根据特征触发的位置——文本前缀上下文、1秒语音片段或两者——为每个特征打标签。恢复的特征是可解释的，涵盖音素、笑声、口音提示和说话者性别。通过SAE潜在空间进行引导表明，这些特征是因果性的而非仅仅是描述性的：有针对性的干预将笑声概率从0.02提高到0.79，翻转感知到的说话者性别，并在保持口语内容的同时控制语速。因此，SAE特征既可作为解释性对象，也可作为TTS合成的控制方向。

英文摘要

Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.

URL PDF HTML ☆

赞 0 踩 0

2606.10439 2026-06-10 cs.SD cs.CL eess.AS 交叉投稿

Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling

利用混合专家和动态下采样增强基于多语言大模型的语音识别

Guodong Lin, Ziqi Chen, Yuxiang Fu, Ke Li, Wei-Qiang Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出基于投影器的LLM-ASR框架，通过混合专家架构提升跨语言适应性，并利用连续整合-触发机制实现动态下采样和模态对齐，实验表明该方法显著超越强基线模型。

Comments Accepted by ICASSP 2026

详情

DOI: 10.1109/ICASSP55912.2026.11464266
Journal ref: ICASSP (2026),18807-18811

AI中文摘要

大语言模型的快速发展为自动语音识别开辟了新前沿，使其有效集成成为一个关键且具有挑战性的研究方向。为此，本文提出了一种基于投影器的LLM-ASR框架，针对多语言泛化和模态对齐的关键挑战。我们的方法结合了混合专家架构以改善跨语言适应性，以及连续整合-触发机制用于动态下采样和模态对齐。实验结果表明，这些组件的组合带来了显著的性能提升，超越了强基线模型。所提出的方法朝着构建更准确、更鲁棒、更泛化的基于LLM的ASR系统迈出了一步。

英文摘要

The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.

URL PDF HTML ☆

赞 0 踩 0

2606.10781 2026-06-10 eess.AS cs.CL 交叉投稿

Recovering the Zipfian Distribution in Unsupervised Term Discovery

在无监督术语发现中恢复齐夫分布

Danel Slabbert, Simon Malan, Herman Kamper

发表机构 * Het Jan Marais Fonds（赫特·詹·马里茨基金会）

AI总结针对无监督术语发现中中心聚类导致分布不均匀的问题，提出图聚类方法，在三种语言上显著优于K-means等，恢复更接近齐夫分布的词汇分布。

详情

AI中文摘要

无监督术语发现涉及将未标记语音分割成词或音节单元，并将这些单元聚类成候选类型的词典。真实词典遵循齐夫分布，然而主流的基于中心的聚类方法——K-means——由于对球形聚类的归纳偏差，产生更均匀的分布。在本文中，我们重新审视基于图的聚类作为一种自下而上的替代方案，其中片段嵌入通过成对相似性连接，并使用Leiden算法进行划分。我们表明，在三种语言的词级和音节级词典发现中，图聚类在性能上显著优于基于中心的方法（K-means、GMM、BIRCH），产生更接近齐夫分布的分布。另一种自下而上的方法，即使用平均链接的凝聚聚类，也表现良好，尽管其计算效率较低，且对结果分布的控制能力较弱。我们的工作质疑了基于中心的聚类在术语发现中的主导地位，并推广图聚类作为一种有吸引力的替代方案。

英文摘要

Unsupervised term discovery involves segmenting unlabelled speech into word- or syllable-like units and clustering these into a lexicon of candidate types. True lexicons follow a Zipfian distribution, yet the dominant centre-based clustering approach -- K-means -- produces a more uniform distribution due to an inductive bias toward spherical clusters. In this paper we revisit graph-based clustering as a bottom-up alternative, where segment embeddings are connected by pairwise similarity and partitioned using the Leiden algorithm. We show that graph clustering substantially outperforms centre-based approaches (K-means, GMM, BIRCH) in both word- and syllable-level lexicon discovery across three languages, producing more Zipf-like distributions. Another bottom-up approach, agglomerative clustering with average linkage, also performs well, although it is computationally less efficient and allows for less control over the resulting distribution. Our work calls into question the dominance of centre-based clustering for term discovery, and promotes graph clustering as an attractive alternative.

URL PDF HTML ☆

赞 0 踩 0

2606.11033 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

AuRA: Internalizing Audio Understanding into LLMs as LoRA

AuRA: 将音频理解内化到LLM中作为LoRA

Bo Cheng, Lei Shi, Zhanyu Ma, Yuan Wu, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He

发表机构 * Meituan（美团）； Jilin University（吉林大学）

AI总结提出AuRA方法，通过层间蒸馏将ASR编码器的语音表示内化到LoRA适配的LLM中，实现紧耦合的语音-语言联合建模和高效并行端到端推理，在多个基准上优于级联系统和现有适应方法。

详情

AI中文摘要

BenSyc: 孟加拉语上下文中大语言模型对话谄媚与人类对齐的基准测试

Kazi Noshin, Sajib Acharjee Dip, Ranat Das Prangon, Fardin Hassan Tamim, Syed Ishtiaque Ahmed, Liqing Zhang, Sharifa Sultana

AI总结提出BenSyc基准，基于孟加拉语社交数据构建五级标注集，评估15+模型在对话对齐分类与生成任务上的表现，发现前沿模型在区分共情与强化性认可上仍存在困难。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地参与情感敏感的社交对话，其回应可能从平衡支持转向过度认可或升级性对齐。现有的谄媚研究主要关注事实一致性和指令遵循设置，而文化背景下的对话谄媚尚未得到充分探索。我们引入了BenSyc，这是首个用于研究孟加拉语社交语境中对话谄媚的基准。从孟加拉国和西孟加拉邦社区收集的11,840条Reddit帖子和170k条评论出发，我们构建了一个人工验证的基准，包含二元标签和一个细粒度的五级分类体系，涵盖无效化、中立、支持、认可和升级。我们在对话对齐分类和响应生成任务上评估了超过15个开源和专有LLM。结果表明，即使对于前沿的指令调优模型，区分共情性支持与强化导向的认可仍然具有挑战性：最佳系统在二元检测上仅达到61.8 Macro-F1，在五类分类上达到61.7 Macro-F1。在生成设置中，多个模型在情感激烈的情境下频繁产生强烈认可或升级性回应。我们的发现凸显了不同模型家族和对话行为之间的显著差异，强调了文化背景下的多语言基准对于评估社交对齐的对话AI系统的重要性。

英文摘要

Large language models (LLMs) increasingly participate in emotionally sensitive social conversations, where responses may shift from balanced support toward excessive validation or escalatory alignment. Existing sycophancy research primarily focuses on factual agreement and instruction-following settings, leaving culturally grounded conversational sycophancy underexplored. We introduce BenSyc, the first benchmark for studying conversational sycophancy in Bengali social contexts. Starting from 11,840 Reddit posts and 170k comments collected from communities across Bangladesh and West Bengal, we construct a human-validated benchmark with binary labels and a fine-grained five-level taxonomy spanning Invalidation, Neutral, Support, Validation, and Escalation. We evaluate more than 15 open and proprietary LLMs on conversational alignment classification and response generation tasks. Results show that distinguishing empathetic support from reinforcement-oriented validation remains challenging even for frontier instruction-tuned models: the best system achieves only 61.8 Macro-F1 on binary detection and 61.7 Macro-F1 on five-class classification. In generation settings, several models frequently produce strongly validating or escalatory responses in emotionally charged situations. Our findings highlight substantial variation across model families and conversational behaviors, underscoring the importance of culturally grounded multilingual benchmarks for evaluating socially aligned conversational AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.10285 2026-06-10 cs.CL 新提交

OpenRTLSet: A Fully Open-Source Dataset for Large Language Model-based Verilog Module Design

OpenRTLSet: 基于大语言模型的Verilog模块设计的完全开源数据集

Jinghua Wang, Lily Jiaxin Wan, Sanjana Pingali, Scott Smith, Manvi Jha, Shalini Sivakumar, Xing Zhao, Kaiwen Cao, Deming Chen

发表机构 * UIUC-ChenLab（UIUC-陈实验室）

AI总结提出最大完全开源硬件设计数据集OpenRTLSet，包含13万+多样Verilog代码样本，结合GitHub代码、VHDL和C/C++翻译，利用DeepSeek-R1生成自然语言描述，支持多种语言模型微调，证明开源方法在硬件设计中的优越性。

Comments Accepted by ICLAD'25

详情

DOI: 10.1109/ICLAD65226.2025.00038
Journal ref: 2025 IEEE International Conference on LLM-Aided Design (ICLAD), Stanford, CA, USA, 2025, pp. 212-218

AI中文摘要

OpenRTLSet引入了硬件设计中最大的完全开源数据集，为研究界和工业界提供了超过131,000个多样化的Verilog代码样本。我们的数据集独特地结合了来自GitHub仓库的Verilog代码（102k模块）、VHDL翻译（5k模块）和可综合的C/C++翻译（24k模块），所有内容均可自由访问，无专有限制。使用推理模型DeepSeek-R1，我们为每个代码样本生成了配对的自然语言描述，从而能够微调各种语言模型家族（例如Qwen和Granite）以进行Verilog代码生成。我们的数据集探索了多种选项，包括在标注过程中将Verilator生成的C++文件作为额外上下文、量化技术（INT4 vs. BF16）以及不同模型规模（7B-32B参数）之间的性能差异。OpenRTLSet证明了开源方法在硬件设计任务中可以实现优越的性能，为该领域的可访问研究和商业用途建立了新的基础。

英文摘要

OpenRTLSet introduces the largest fully open-source dataset for hardware design, offering over 131,000 diverse Verilog code samples to the research community and industry. Our dataset uniquely combines Verilog code from GitHub repositories (102k modules), VHDL translations (5k modules), and synthesizable C/C++ translations (24k modules), all freely accessible without proprietary restrictions. Using the reasoning model DeepSeek-R1, we generated paired natural language descriptions for each code sample, enabling fine-tuning of various language model families (e.g., Qwen and Granite) for Verilog code generation. Our dataset explores multiple options, including Verilator-generated C++ files as additional context during labeling, quantization techniques (INT4 vs. BF16), and performance differences across model sizes (7B-32B parameters). OpenRTLSet demonstrates that open-source approaches can achieve superior performance in hardware design tasks, establishing a new foundation for accessible research and commercial use in this domain.

URL PDF HTML ☆

赞 0 踩 0

2606.10315 2026-06-10 cs.CL cs.AI 新提交

Catching One in Five: LLM-as-Judge Blind Spots in Production Multi-Turn Transaction Agents

捕捉五分之一：LLM作为评判员在生产环境多轮交易代理中的盲点

Sawyer Zhang, Alexander Wang, Sophie Lei

发表机构 * Lumivate (Lumi)（Lumivate（Lumi））

AI总结研究部署的餐饮订购代理中LLM评判员对真实缺陷的召回率，发现其仅捕获22%的系统性问题，主要因评分标准缺乏状态跟踪等行为维度，且路由机制导致缺陷被错误分类。

Comments 13 pages, 1 figure, 5 tables

详情

AI中文摘要

LLM作为评判员是评估对话代理的默认工具，但其可靠性几乎总是报告为与人类评分的一致性，而非真实缺陷的召回率。我们研究了一个已部署的多轮餐饮订购代理，并通过详尽的人工转录审查作为基准，衡量其内置LLM评判员捕获了多少真实质量问题。在三个批次中，评判员发现的系统性问题远低于人类确认的四分之一——在一个批次中，9种模式中只有2种（22%），而在另一个批次中，其操作门控标记了100轮中的0轮，而人类确认了23个不同缺陷和7个新的跨轮模式。我们的盲点分类表明，失败是有结构的，而非随机的：评判员能捕获轮次局部问题（虚构统计数据、错误语言），但遗漏了跨轮状态问题（确认门锁死、购物车幻觉、升级锁死、过时引用）。机制在于：评分标准仅暴露三个粗略轴（意图、品牌声音、个性化），且没有针对行为维度（状态跟踪、护栏、恢复）的类别，而大多数缺陷集中于此。失败在于路由而非感知：114轮中，113轮原始评判员注释描述了确认门或购物车状态缺陷，但被评分为“品牌声音”，且无一到达操作失败——门控连接到挂起和硬断言，而非评分标准——因此0%是路由和接线失败，而非失明。对流行率估计的影响是显著的：当表观缺陷率为零时，Rogan-Gladen校正退化——无信号可恢复真实率——而当门控报告非零率时，相同估计器在我们测量的灵敏度下暗示3-6倍的低估。对于生产环境多轮代理，自动评判是回归底线，而非人工审查的替代品。

英文摘要

LLM-as-judge is the default instrument for evaluating conversational agents, yet its reliability is almost always reported as agreement with human ratings, not recall of real defects. We study a deployed multi-turn food-and-beverage ordering agent and measure how many genuine quality problems its built-in LLM judge catches, using exhaustive human transcript review as ground truth. Across three batches the judge surfaces well under a quarter of human-confirmed systematic problems -- 2 of 9 patterns (22%) in one batch, and its operational gate flagged zero of 100 rounds in a batch where humans confirmed 23 distinct defects and 7 new cross-cutting patterns. Our blind-spot taxonomy shows the failure is structured, not random: the judge catches turn-local issues (a fabricated statistic, a wrong language) but misses cross-turn state issues (confirm-gate lockout, cart hallucination, escalation lockout, stale referents). The mechanism: the scoring rubric exposes only three coarse axes (intent, brand-voice, personalization) and has no category for the behavioural dimensions -- state-tracking, guardrails, recovery -- where most defects cluster. The failure is routing, not perception: 113 of 114 rounds whose raw judge note describes a confirm-gate or cart-state defect are scored "brand voice", and none reach an operational failure -- the gate is wired to hangs and hard assertions, not the rubric -- so the 0% is a routing-and-wiring failure, not blindness. The consequence for prevalence estimation is sharp: when the apparent defect rate is zero the Rogan-Gladen correction degenerates -- no signal can recover the true rate -- while where the gate reports a nonzero rate the same estimator implies a 3-6x undercount under our measured sensitivity. For production multi-turn agents, automated judging is a regression floor, not a substitute for human review.

URL PDF HTML ☆

赞 0 踩 0

2606.10380 2026-06-10 cs.CL cs.AI 新提交

Expert-Level Crisis Detection in Mental Health Conversations

心理健康对话中的专家级危机检测

Grace Byun, Abigail Lott, Rebecca Lipschutz, Sean T. Minton, Elizabeth A. Stinson, Jinho D. Choi

发表机构 * Department of Computer Science, Emory University（埃默里大学计算机科学系）； Department of Psychiatry and Behavioral Sciences, Emory University（埃默里大学精神病学与行为科学系）

AI总结提出CRADLE-Dialogue基准数据集和Alert-Confirm评估协议，用于对话中危机检测，发现模型在识别风险出现时机上表现较差，并发布合成训练语料和32B参数模型。

详情

AI中文摘要

现实世界的危机干预本质上是对话式的，然而现有研究主要关注静态文本。当应用于多轮对话时，当前模型表现出显著的性能下降，难以追踪随着上下文演变而出现的风险信号。为了解决这一差距，我们引入了CRADLE-Dialogue，这是一个由临床医生标注的基准数据集，用于对话环境中的回合级危机检测。该数据集包含600个对话，具有跨临床基础风险的多标签注释，包括自杀意念、自残和儿童虐待，区分过去和当前风险。我们进一步提出了一种Alert-Confirm评估协议，该协议区分早期预警信号（Alert）和特定危机变得明确可识别的回合（Confirm），反映了在风险变得明确之前进行干预的临床需求。实验表明，识别风险何时出现比识别其存在要困难得多：模型的Micro F1仅达到40%中段到60%高段。此外，我们发布了一个合成训练语料库和一个32B参数模型，该模型显著优于现有的开源模型，并在回合级、对话级和仅确认评估设置中与专有模型相比具有竞争力或更优的结果。

英文摘要

Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts.Real-world crisis intervention is inherently conversational, yet existing research largely focuses on static texts. When applied to multi-turn dialogues, current models exhibit significant performance degradation, struggling to track risk signals that emerge as context evolves. To address this gap, we introduce CRADLE-Dialogue, a clinician-annotated benchmark for turn-level crisis detection in conversational settings. The dataset features 600 dialogues with multi-label annotations across clinically grounded risks, including suicide ideation, self-harm, and child abuse, distinguishing past from ongoing risk. We further propose an Alert-Confirm evaluation protocol that distinguishes early warning signals (Alert) from turns where a specific crisis becomes explicitly identifiable (Confirm), reflecting the clinical need to intervene before risk becomes explicit. Experiments show that identifying when risk emerges is much harder than recognizing that it exists: models achieve only mid-40% to high-60% Micro F1. Additionally, we release a synthetic training corpus and a 32B-parameter model that substantially outperforms existing open-source models and achieves competitive or superior results against proprietary models across turn-level, dialogue-level, and confirm-only evaluation settings.

URL PDF HTML ☆

赞 0 踩 0

2606.10400 2026-06-10 cs.CL cs.CV 新提交

Do Vision-Language Models See or Guess? Measuring and Reducing Textual-Prior Reliance with a Phrasing-Controlled Benchmark

视觉语言模型是看见还是猜测？通过措辞控制基准衡量和减少文本先验依赖

Pratham Singla, Shivank Garg, Vihan Singh, Paras Chopra

发表机构 * Lossfunk ； Indian Institute of Technology Roorkee（印度理工学院罗尔基分校）； Raeth AI

AI总结本文构建了540张图像的基准，通过为同一图像生成四种措辞变体，衡量视觉语言模型对文本先验的依赖，发现所有模型在最难变体上性能下降，开放模型下降最严重，并通过无图像消融等分析证实了真正的图像依赖。

Comments 17 pages, 7 figures, Submitted to EMNLP 2026

详情

AI中文摘要

视觉语言模型（VLM）越来越多地被部署在答案必须依据图像内容的场景中，然而它们常常基于文本先验（问题的措辞结合记忆的世界知识）而非图像本身来回答，这夸大了基准分数并产生了自信但无根据的答案。现有基准很少孤立这种行为，因为每张图像通常只与一个固定问题配对。为了衡量这种依赖，我们构建了一个包含540张图像、覆盖六个推理类别的基准，并为相同图像生成四个问题变体，使得措辞而非图像内容成为受控变量。最难的变体直接从图像编写以最小化文本泄漏。我们对十一个VLM进行了基准测试，涵盖从小型开放权重模型到大型闭源系统：每个模型在最难的变体上性能下降，开放模型下降最严重。我们的核心诊断是无图像消融，它将开放权重模型降至其纯文本基线（1%到9%）。进一步的三项分析——LLM评定的难度、低基础到最终文本相似度以及人工重新标注——证实了真正的图像依赖性。与变体构建方式匹配的上下文示例恢复了最高的准确率，而GRPO后训练一个小型VLM在所有四个变体上取得了一致的提升，并泛化到保留的分布外集。文本先验依赖是可测量的，并且部分可通过训练消除。

英文摘要

Vision-language models (VLMs) are increasingly deployed where answers must follow from what is in the image, yet they often answer from textual priors, the question's phrasing together with memorized world knowledge, rather than from the image itself, which inflates benchmark scores and yields confident but ungrounded answers. Existing benchmarks rarely isolate this behavior, since each image is usually paired with a single fixed question. To measure the reliance, we build a 540-image benchmark across six reasoning categories and generate four question variants over the same images, so that phrasing rather than image content is the controlled variable. The hardest variant is written directly from the image to minimize text leakage. We benchmark eleven VLMs spanning small open-weight models to large closed-source systems: every model degrades on the hardest variant, and open models fall furthest. Our central diagnostic is a no-image ablation, which collapses the open-weight models to their text-only floor (1 to 9 percent). Three further analyses, LLM-rated difficulty, low base-to-final textual similarity, and human re-annotation, corroborate genuine image-dependence. In-context exemplars that match how a variant was built recover the most accuracy, and GRPO post-training of a small VLM yields consistent gains across all four variants that transfer to a held-out out-of-distribution set. Textual-prior reliance is measurable and partly trainable away.

URL PDF HTML ☆

赞 0 踩 0

2606.10460 2026-06-10 cs.CL cs.AI 新提交

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

LakeQA：百万级数据湖上的探索性问答基准

Haonan Wang, Jiaxiang Liu, Yurong Liu, Austin Senna Wijaya, Tianle Zhou, Eden Wu, Yijia Chen, Wanting You, Reya Vir, Daniela Pinto, Grace Fan, Yusen Zhang, Juliana Freire, Eugene Wu

发表机构 * Columbia University（哥伦比亚大学）； New York University（纽约大学）； Barnard College（巴纳德学院）

AI总结提出LakeQA基准，要求LLM在9.5TB异构数据湖中搜索并多跳推理，GPT-5.2仅达18.37%精确匹配，挑战性强。

详情

AI中文摘要

近期的大语言模型（LLM）在基于阅读的问答（QA）方面取得了快速进展，其中证据被明确提供或可以轻松检索。相比之下，现实世界的问题通常不与准确的证据文档配对。有用的证据存在于海量数据湖中，使得搜索成为回答的前提。然而，目前缺乏要求在大型数据湖上进行搜索和推理的综合基准。为此，我们引入了LakeQA，一个针对数据湖上以搜索为中心的问答的综合基准，同时强调搜索和推理能力。LakeQA建立在来自维基百科和开源政府数据的大约9.5 TB文本资源的异构集合上，涵盖结构化和非结构化数据。为确保任务质量，每个样本至少由一名博士级专家标注。每个任务需要长期的多跳推理，包含隐式的中间步骤：智能体需要发现正确的文档，然后跨来源组合证据以产生答案。在七个前沿LLM上的实验结果表明，LakeQA具有挑战性。例如，GPT-5.2在LakeQA上仅达到18.37%的精确匹配分数。总体而言，LakeQA为开发能够在现代数据湖中查找和分析数据的LLM智能体提供了一个现实的测试平台。

英文摘要

Recent large language models (LLMs) have shown rapid progress in reading-based question answering (QA), where evidence is explicitly provided or can be trivially retrieved. In contrast, real-world questions are often not paired with accurate evidence documents. The useful evidence resides in massive data lakes, making search a prerequisite for answering. However, there is a lack of comprehensive benchmarks that require both searching and reasoning over large data lakes. To this end, we introduce LakeQA, a comprehensive benchmark for search-centric question answering over data lakes that jointly emphasizes searching and reasoning capabilities. LakeQA is built on a heterogeneous collection of approximately 9.5 TB of text resources from Wikipedia and open-source government data, spanning structured and unstructured data. To ensure task quality, each sample is annotated by at least one Ph.D.-level expert. Each task requires long-horizon multi-hop reasoning with implicit intermediate steps: agents need to discover the correct documents and then compose evidence across sources to produce the answer. Experimental results on seven frontier LLMs demonstrate that LakeQA is challenging. For instance, GPT-5.2 achieves only an exact-match score of 18.37% on LakeQA. Overall, LakeQA provides a realistic testbed for developing LLM agents that can both find and analyze data in modern data lakes.

URL PDF HTML ☆

赞 0 踩 0

2606.10554 2026-06-10 cs.CL cs.AI 新提交

Benchmarking Knowledge Editing using Logical Rules

使用逻辑规则对知识编辑进行基准测试

Tatiana Moteu Ngoli, NDah Jean Kouagou, Hamada M. Zahera, Axel-Cyrille Ngonga Ngomo

发表机构 * Data Science Group, Heinz Nixdorf Institute, Paderborn University（帕德博恩大学海因茨·尼克斯多夫研究所数据科学组）

AI总结提出基于逻辑规则的基准，评估知识编辑方法对单次编辑逻辑后果的处理能力，发现现有方法在蕴含知识上性能下降高达24%。

Comments Accepted at the 24th International Semantic Web Conference 2025

详情

DOI: 10.1007/978-3-032-09530-5_3
Journal ref: The Semantic Web. ISWC 2025. ISWC 2025. Lecture Notes in Computer Science, vol 16141. Springer, Cham

AI中文摘要

大型语言模型（LLMs）越来越多地部署在需要访问最新知识的实际应用中。然而，重新训练LLMs计算成本高昂。因此，知识编辑技术对于维护预训练模型中的当前信息和纠正错误断言至关重要。当前的知识编辑基准主要关注回忆编辑过的事实，往往忽略其逻辑后果。为解决这一局限，我们引入了一个新基准，旨在评估知识编辑方法如何处理单次事实编辑的逻辑后果。我们的基准从知识图谱中提取与给定编辑相关的逻辑规则，然后基于这些规则生成多跳问题，以评估对逻辑后果的影响。我们的发现表明，虽然现有的知识编辑方法能够准确地将直接断言插入LLMs，但它们经常无法注入蕴含的知识。具体来说，使用ROME和FT等流行方法的实验显示，在直接编辑的知识和蕴含知识的评估之间存在高达24%的性能差距。这凸显了在知识编辑中需要语义感知的评估框架。

英文摘要

Large Language Models (LLMs) are increasingly deployed in real-world applications that require access to up-to-date knowledge. However, retraining LLMs is computationally expensive. Therefore, knowledge editing techniques are crucial for maintaining current information and correcting erroneous assertions within pre-trained models. Current benchmarks for knowledge editing primarily focus on recalling edited facts, often neglecting their logical consequences. To address this limitation, we introduce a new benchmark designed to evaluate how knowledge editing methods handle the logical consequences of a single fact edit. Our benchmark extracts relevant logical rules from a knowledge graph for a given edit. Then, it generates multi-hop questions based on these rules to assess the impact on logical consequences. Our findings indicate that while existing knowledge editing approaches can accurately insert direct assertions into LLMs, they frequently fail to inject entailed knowledge. Specifically, experiments with popular methods like ROME and FT reveal a substantial performance gap, up to 24%, between evaluations on directly edited knowledge and on entailed knowledge. This highlights the critical need for semantics-aware evaluation frameworks in knowledge editing.

URL PDF HTML ☆

赞 0 踩 0

2606.10657 2026-06-10 cs.CL 新提交

Are We Evaluating Knowledge or Phrasing? Mitigating MCQA Sensitivity with ParaEval

我们是在评估知识还是措辞？使用ParaEval减轻MCQA敏感性

João Maria Janeiro, Mathurin Videau, Andrea Caciolai, Benjamin Piwowarski, Patrick Gallinari, Loic Barrault

发表机构 * FAIR at Meta（Meta FAIR）； Sorbonne Université, CNRS, ISIR, F-75005 Paris, France（索邦大学，法国国家科学研究中心，智能系统与机器人研究所，法国巴黎）； Criteo AI Lab, Paris, France（Criteo AI实验室，法国巴黎）

AI总结针对多选题基准测试对答案措辞敏感的问题，提出ParaEval框架，通过对每个选项使用多种释义并选择最有利的评分，将虚假性能差距从2分以上降至1分以下，从而评估模型真实能力。

详情

AI中文摘要

多选题（MCQA）基准测试是评估预训练大语言模型的标准方法，但其依赖于对数似然评分使得结果不可靠。具体而言，标准评分对答案的确切措辞（表面形式）高度敏感，将模型对特定短语的熟悉程度与其实际能力混为一谈。我们使用一个受控测试床（1B-8B模型，基于相同知识训练）证明了这一缺陷。尽管拥有相同的知识，标准指标错误地报告了超过2分的性能差距。为了解决这个问题，我们提出了ParaEval，一个评估框架，它对每个答案选项使用多个释义来查询模型。通过根据每个模型最有利的措辞进行评分，ParaEval成功地将虚假性能差距降低到1分以下。我们确认这些评估伪影以及ParaEval的改进在前沿的70B和120B开源模型中仍然存在。最终，ParaEval提供了一种稳健且高效的方式来评估真正的底层能力，而不是表面形式的熟悉度。

英文摘要

Multiple-choice (MCQA) benchmarks are the standard for evaluating pretrained large language models, but their reliance on log-likelihood scoring makes them unreliable. Specifically, standard scores are highly sensitive to the exact phrasing (surface form) of the answers, conflating a model's familiarity with a specific phrase with its actual capability. We demonstrate this flaw using a controlled testbed of 1B-8B models trained on the same knowledge. Despite having identical knowledge, standard metrics falsely report a performance gap of over 2 points. To solve this, we propose ParaEval, an evaluation framework that queries models using multiple paraphrases per answer option. By scoring each model based on its most favorable phrasing, ParaEval successfully reduces the false performance gap to below 1 point. We confirm that these evaluation artifacts, and the improvements from ParaEval, persist in frontier 70B and 120B open-source models. Ultimately, ParaEval provides a robust and efficient way to evaluate true underlying capability rather than surface-form familiarity.

URL PDF HTML ☆

赞 0 踩 0

2606.10765 2026-06-10 cs.CL 新提交

ArabiGEE: A Hierarchical Taxonomy for Arabic Grammatical Error Explanation

ArabiGEE：阿拉伯语语法错误解释的层次分类体系

Khaled Elhady, Omar Kallas, Nizar Habash, Bashar Alhafni

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（莫扎德·本·扎耶德人工智能大学）； New York University Abu Dhabi（纽约大学阿布扎克分校）

AI总结提出首个基于显式错误类型的阿拉伯语语法错误解释层次分类体系，涵盖正字法、形态、句法和词汇四个维度，包含27种错误类型、140种修正类型和324种解释，并用于人工标注现有语料库以支持大语言模型的自动评估。

2606.11070 2026-06-10 cs.CL cs.AI 新提交

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

T1-Bench：真实世界领域中的多场景智能体基准测试

Genta Indra Winata, Amartya Chakraborty, Yuzhen Lin, Swasthi P Rao, Shikhhar Siingh, Houhan Lu, Nadia Bathaee, Sriharsha Hatwar, Paresh Dashore, Anmol Jain, Kshitij Tayal, Xiuzhu Lin, Anirban Das, Sambit Sahu, Shi-Xiong Zhang

发表机构 * Capital One（第一资本）

AI总结提出T1-Bench，一个高保真、全面的基准，用于评估多领域真实客户场景中的智能体系统，通过交织的多轮交互任务提升复杂性和评估严谨性。

Comments Preprint

详情

AI中文摘要

近期大型语言模型（LLMs）在推理和工具调用能力方面的进步使得智能体系统越来越强大。然而，现有基准在任务复杂性、真实性和领域多样性方面仍然有限，并且往往无法捕捉跨多个领域的交互，限制了它们在需要持续推理和协调的现实多步骤设置中评估智能体的能力。为解决这些限制，我们引入了T1-Bench，一个高保真、全面的基准，用于评估真实客户面向的多领域环境中的智能体系统，具有交织的场景，需要在多轮用户-助手交互中进行结构化推理，并在25个不同难度的领域中显著增加了组合复杂性和评估严谨性。我们使用12个专有和开放权重模型评估T1-Bench，提供了一个可重复和标准化的框架，用于评估复杂多步骤环境中的智能体行为、工具利用和对话质量。我们进一步用人类判断补充自动评估，以加强对定性性能的评估。总体而言，T1-Bench通过增加任务复杂性、交互深度和模拟多领域环境中的领域覆盖，显著推进了先前的基准。为促进智能体系统的未来研究，我们将公开数据及评估代码作为开源资源。

英文摘要

Recent advances in reasoning and tool-calling capabilities of large language models (LLMs) have enabled increasingly capable agentic systems. However, existing benchmarks remain limited in task complexity, realism, and domain diversity, and often fail to capture interactions that span multiple domains, limiting their ability to evaluate agents in realistic multi-step settings that require sustained reasoning and coordination. To address these limitations, we introduce T1-Bench, a high-fidelity, comprehensive benchmark for evaluating agentic systems in realistic customer-facing, multi-domain environments, featuring interleaved scenarios that require structured reasoning across multi-turn user-assistant interactions and substantially increasing both compositional complexity and evaluative rigor across 25 domains of varying difficulty. We evaluate T1-Bench using 12 proprietary and open-weight models, providing a reproducible and standardized framework for assessing agent behavior, tool utilization, and conversational quality in complex, multi-step environments. We further complement automatic evaluation with human judgments to strengthen the assessment of qualitative performance. Overall, T1-Bench substantially advances prior benchmarks by increasing task complexity, interaction depth, and domain coverage in simulated multi-domain environments. To facilitate future research on agentic systems, we will publicly release data and evaluation code as open source.

URL PDF HTML ☆

赞 0 踩 0

2606.11079 2026-06-10 cs.CL 新提交

VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation

VISTA：用于智能体评估的多功能交互式用户模拟工具包

Yunan Lu, Ryan Shea, Yusen Zhang, Zhou Yu

发表机构 * Department of Computer Science, Columbia University（哥伦比亚大学计算机科学系）； Arklex.ai

AI总结提出VISTA工具包，通过六项指标和混合用户模拟器（UI+API）提升智能体评估的真实性与全面性，在电商和教育场景中验证有效性。

详情

AI中文摘要

评估仍然是交互式智能体开发的关键瓶颈。现有的评估方法通常依赖于静态基准，这些基准无法捕捉智能体行为的动态、多步骤特性，也难以暴露有意义的失败模式。虽然基于用户模拟的评估提供了一种有前景的替代方案，但现有的模拟框架存在两个主要局限性。首先，它们提供的评估模拟交互质量和全面性的机制有限，使得难以评估模拟器是否充分探索了智能体的能力和失败模式。其次，大多数框架仅限于仅UI操作或仅API操作，限制了它们建模真实用户行为全范围的能力。为了解决这些局限性，我们提出了VISTA，一个用于智能体评估的多功能交互式用户模拟工具包。我们的工具包包含一套六项指标，用于衡量模拟交互的真实性、能力覆盖范围和交互有效性。此外，我们开发了一个混合用户模拟器，集成了基于UI的交互和基于API的交互，从而能够在多样化的交互环境中进行更真实和全面的评估。我们在电子商务购物和教育客户服务场景中评估了VISTA，并证明它比现有方法产生了更真实和全面的评估。

英文摘要

Evaluation remains a critical bottleneck for interactive agent development. Existing evaluation methods often rely on static benchmarks, which fail to capture the dynamic, multi-step nature of agentic behavior and struggle to expose meaningful failure modes. While user-simulation-based evaluation offers a promising alternative, existing simulation frameworks suffer from two major limitations. First, they provide limited mechanisms for evaluating the quality and comprehensiveness of simulated interactions, making it difficult to assess whether a simulator sufficiently explores an agent's capabilities and failure modes. Second, most frameworks are restricted to either UI-only actions or API-only actions, limiting their ability to model the full range of realistic user behaviors. To address these limitations, we propose VISTA, a Versatile Interactive user Simulation Toolkit for Agent evaluation. Our toolkit includes a suite of six metrics for measuring the realism, capability coverage, and interaction effectiveness of simulated interactions. In addition, we develop a hybrid user simulator that integrates both UI-based interactions and API-based interactions, enabling more realistic and comprehensive evaluation across diverse interactive environments. We evaluate VISTA in e-commerce shopping and education customer service settings and demonstrate that it produces more realistic and comprehensive evaluations than existing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.11082 2026-06-10 cs.CL cs.CY 新提交

The Shibboleth Effect: Auditing the Cross-Lingual Distributional Skew of Large Language Models

示播列效应：审计大型语言模型的跨语言分布偏斜

Hakan Mehmetcik

发表机构 * Kellogg Institute for International Studies, University of Notre Dame（凯洛格国际研究学院，圣约翰大学）； Marmara University（马尔马拉大学）

AI总结本研究通过多智能体地缘政治兵棋推演，发现前沿LLM在跨语言条件下存在行为偏斜，且该效应依赖于模型架构与训练机制，而非西方起源模型的普遍属性。

Comments 25 pages, 2 figures, 6 tables, Research Article

详情

AI中文摘要

本研究调查了前沿大型语言模型（LLMs）在持续对抗条件下遭受的跨语言分布偏斜（示播列效应）。我们开发了一个多智能体地缘政治兵棋推演——蔚蓝海危机，这是一个旨在模拟东地中海冲突结构动态的合成海洋领土争端。六个前沿模型（GPT-4o、Llama-4、Mistral-Large、Gemini-3.1-Pro、Qwen3.6-Plus和DeepSeek-R1）参与了一项组间实验（每组N=10局游戏，每局K=5轮），其中唯一的操作变量是游戏语言（英语与土耳其语），产生了586条有效陈述。一个零样本分类器沿两个连续维度评估行为倾向：让步率和强制修辞。结果是异质的。Llama-4在土耳其语下显示出经Holm校正的强制修辞显著增加（delta = +0.800，p = .002），而Gemini-3.1-Pro显示出同样大的下降（delta = -0.750，p = .005）。DeepSeek-R1表现出类似的负向偏移（delta = -0.860，p = .006），并提供了与缓冲机制一致的思维链证据。GPT-4o未显示出可检测效应（delta = +0.130，p = .614）。这些发现表明，跨语言行为偏斜取决于模型架构和训练机制，而非西方起源LLM的普遍属性。我们识别出两种不同的缓冲机制——思维链制度锚定和多语言RLHF对齐——并讨论了它们对将LLM安全集成到外交和危机管理环境中的启示。

英文摘要

This study investigates cross-lingual distributional skew (the Shibboleth Effect) in frontier large language models (LLMs) subjected to sustained adversarial conditions. We develop a multi-agent geopolitical wargame, the Cerulean Sea Crisis, a synthetic maritime territorial dispute designed to mirror the structural dynamics of Eastern Mediterranean conflicts. Six frontier models (GPT-4o, Llama-4, Mistral-Large, Gemini-3.1-Pro, Qwen3.6-Plus, and DeepSeek-R1) participate in a between-groups experiment (N = 10 games per arm, K = 5 rounds per game) in which the sole manipulation is the language of play (English versus Turkish), producing 586 validated statements. A zero-shot classifier assesses behavioral dispositions along two continuous dimensions: Concession Rate and Coercive Rhetoric. The results are heterogeneous. Llama-4 shows a substantial, Holm-corrected increase in coercive rhetoric under Turkish (delta = +0.800, p = .002), whereas Gemini-3.1-Pro displays an equally large decrease (delta = -0.750, p = .005). DeepSeek-R1 exhibits a similar negative shift (delta = -0.860, p = .006) and provides chain-of-thought evidence consistent with a buffering mechanism. GPT-4o shows no detectable effect (delta = +0.130, p = .614). These findings indicate that cross-lingual behavioral skew is contingent on model architecture and training regime rather than a universal property of Western-origin LLMs. We identify two distinct buffering mechanisms, chain-of-thought institutional anchoring and multilingual RLHF alignment, and discuss their implications for integrating LLMs safely into diplomatic and crisis-management settings.

URL PDF HTML ☆

赞 0 踩 0

2606.11105 2026-06-10 cs.CL cs.AI 新提交

PhantomBench: Benchmarking the Non-existential Threat of Language Models

PhantomBench: 对语言模型非存在性威胁的基准测试

Haeji Jung, Hila Gonen

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； Canada CIFAR AI Chair, Amii（加拿大CIFAR人工智能主席，阿米研究所）

AI总结提出PhantomBench，首个大规模非存在概念基准，包含6万多个虚构实体，评估21个模型，发现平均幻觉率高达86.7%，前沿模型也难以避免。

详情

AI中文摘要

幻觉，即语言模型生成事实无依据的响应，会带来严重风险，因为用户倾向于盲目依赖它们。在高风险领域，这种模型行为的后果可能导致重大伤害。尽管在理解幻觉方面取得了显著进展，但这些模型如何可靠地识别其知识边界仍不清楚。我们引入了PhantomBench，这是首个此类大规模基准，包含来自不同领域真实概念的6万多个不存在的术语和实体。使用我们的基准，我们评估了各种类型和大小的共21个模型。我们展示了令人震惊的幻觉率（在某些情况下平均高达86.7%），并注意到即使是前沿模型也令人惊讶地无法在不存在的概念上弃权，特别是当输入预设它们存在时。然后，我们展示了PhantomBench可以作为研究模型在罕见概念上行为的代理，这些概念更容易产生幻觉。我们还提供了一个构建PhantomBench的流程，使得能够根据研究人员和实践者的特定需求可扩展地生成不存在的概念。

英文摘要

Hallucinations, where language models (LMs) generate factually ungrounded responses, pose serious risks, as users tend to blindly rely on them. This is particularly concerning in high-stakes domains, where consequences of such model behavior can lead to significant harms. Despite notable progress in understanding hallucinations, it remains unclear how reliably these models can recognize the limits of their knowledge. We introduce PhantomBench, the first large-scale benchmark of its kind, comprising more than 60K non-existent terms and entities derived from real concepts across diverse domains. Using our benchmark, we evaluate a total of 21 models of various types and sizes. We show staggering hallucination rates across the board (with average rates as high as 86.7% in some cases), and note that even frontier models surprisingly fail to abstain on non-existent concepts, especially when the input presumes their existence. We then show that PhantomBench can serve as a proxy for studying model behavior on rare concepts for which models are more prone to hallucinate. We also provide a pipeline to construct PhantomBench, enabling scalable generation of non-existent concepts tailored to the specific needs of researchers and practitioners.

URL PDF HTML ☆

赞 0 踩 0

2606.11127 2026-06-10 cs.CL cs.AI 新提交

Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation

基于来源的门控与自适应恢复在合成后训练数据筛选中的应用

Soham Bhattacharjee, Karun Sharma, Vinay Kumar Sankarapu, Pratinav Seth

发表机构 * Lexsi Labs

AI总结研究合成后训练数据筛选中的来源证据门控与样本自适应恢复，提出结合故障诊断与定向再生成的自适应恢复流水线，提高产量、恢复率和注入召回率。

详情

AI中文摘要

合成后训练流水线通常使用奖励模型或整体LLM评判器对生成的样本进行过滤，但两个实践很少被一起检验：过滤信号是否基于引发每个生成的来源证据，以及被拒绝的样本是否可以系统性地恢复而非永久丢弃。我们通过对抗性注入语料库提供真实故障标签，在门控配置、恢复策略和生成器规模上对这两个问题进行了受控研究。我们发现，精确的来源出处改善了更强评判器的忠实度门控；幻觉门控和奖励门控拒绝的样本群体大多不重叠，因此两者都是必要的；结合故障诊断与定向再生成的自适应恢复流水线比简单重采样实现了更高的产量、恢复率和注入召回率。下游微调质量主要由生成器规模驱动，过滤和恢复条件虽有重要贡献但处于次要地位。

英文摘要

Synthetic post-training pipelines commonly filter generated samples with reward models or holistic LLM judges, yet two practices remain rarely examined together: whether the filtering signal is grounded in the source evidence that induced each generation, and whether rejected samples can be systematically recovered rather than permanently discarded. We present a controlled study of both questions across gate configurations, recovery strategies, and generator scales, using adversarially injected corpora to provide ground-truth failure labels. We find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint sample populations making both necessary, and that an adaptive recovery pipeline combining failure diagnosis with targeted regeneration achieves higher yield, recovery rate, and injection recall than naive resampling. Downstream fine-tuning quality is driven primarily by generator scale, with filtration and recovery conditions contributing meaningfully but secondarily.

URL PDF HTML ☆

赞 0 踩 0

2606.09843 2026-06-10 cs.HC cs.AI cs.CL 交叉投稿

An LLM-Native Psychometric Instrument Does Not Predict LLM Behavior: Evidence Across 25 Models

一个原生LLM的心理测量工具不能预测LLM行为：来自25个模型的证据

Juan Manuel Contreras

发表机构 * Independent Researcher（独立研究员）

AI总结通过探索性因子分析从LLM行为中构建心理测量工具，发现LLM的自我报告与观察行为无关，揭示自我报告与人类判断之间的混淆因素。

详情

AI中文摘要

大型语言模型（LLM）在人格量表上产生稳定的自我报告，但这些自我报告并不能预测观察到的行为。这一差距是反映了LLM与人类特质结构之间的不匹配，还是LLM自我报告本身的更深层属性，此前尚未解决。我们构建了第一个心理测量工具，其结构通过探索性因子分析（EFA）从LLM行为能力中自下而上地推导出来。我们对来自17个模型家族的25个LLM施测了300个项目（240个直接李克特+60个基于场景），涵盖12个候选行为维度，每个项目施测30次。EFA产生了一个5因子结构——响应性、顺从性、大胆性、谨慎性和健谈性——具有极好的分半信度（所有Tucker φ ≥ .957）和内部一致性（所有α ≥ .930）。为了测试预测效度，我们收集了由151名人类评分者和一个三人LLM评审团评分的2500个开放式行为样本。人类和评审团评分一致（r̄ = .51），但两者均不跟踪自我报告：自我报告-人类r̄ = -.01，自我报告-评审团r̄ = .13，且没有因子水平的自我报告-人类置信区间排除零。在响应性上，自我报告与LLM评审团相关（r = .53），但与人类不相关（r = .04），尽管人类和评审团一致（r = .59）——这表明自我报告项目和LLM评审团共享人类观察者未捕捉到的方差，这是一个在集成内部可靠性检查中不可见的混淆因素。我们将该工具作为诊断探针发布，用于检测对齐塑造的自我描述，并作为LLM作为评审团流程的具体风险因素。

英文摘要

Large language models (LLMs) produce stable self-reports on personality inventories, but these self-reports do not predict observed behavior. Whether this gap reflects a mismatch between LLMs and human trait constructs, or a deeper property of LLM self-report itself, has been unresolved. We constructed the first psychometric instrument whose constructs are derived bottom-up from LLM behavioral affordances via exploratory factor analysis (EFA). We administered 300 items (240 direct Likert + 60 scenario-based) spanning 12 candidate behavioral dimensions to 25 LLMs across 17 model families, each item administered 30 times. EFA yielded a 5-factor structure -- Responsiveness, Deference, Boldness, Guardedness, and Verbosity -- with excellent split-half replicability (all Tucker $ϕ\geq .957$) and internal consistency (all $α\geq .930$). To test predictive validity, we collected 2,500 open-ended behavioral samples rated by 151 human raters and a three-judge LLM ensemble. Human and judge ratings agreed ($\bar{r} = .51$), but neither tracked self-report: self-report--human $\bar{r} = -.01$, self-report--judge $\bar{r} = .13$, with no factor-level self-report--human CI excluding zero. On Responsiveness, self-report correlated with LLM judges ($r = .53$) but not humans ($r = .04$), even though humans and judges agreed ($r = .59$) -- indicating self-report items and LLM judges share variance that human observers do not, a confound invisible to within-ensemble reliability checks. We release the instrument as a diagnostic probe for alignment-shaped self-description and a concrete risk factor for LLM-as-judge pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.09890 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

PreAct-Bench: Benchmarking Predictive Monitoring in LLMs

PreAct-Bench：大语言模型中的预测性监控基准

Hainiu Xu, Italo Luis da Silva, Jiangnan Ye, Yuhao Wang, Wei Liu, Linyi Yang, Jonathan Richard Schwarz, Nicola Paoletti, Yulan He, Hanqi Yan

发表机构 * King’s College London（伦敦国王学院）； National University of Singapore（新加坡国立大学）； Southern University of Science and Technology（南方科技大学）； Thomson Reuters Foundational Research（汤姆森路透基础研究）； Imperial College London（伦敦帝国学院）； The Alan Turing Institute（艾伦·图灵研究所）

AI总结提出预测性监控任务，在动作执行前判断是否会导致不道德行为，并构建PreActBench基准，评估多种模型发现该任务具有挑战性。

详情

AI中文摘要

大语言模型（LLMs）越来越多地被部署为能够执行多步动作轨迹以实现给定目标的自主代理。虽然现有的安全研究集中于从完整轨迹中检测不道德行为，但这种范式本质上是回顾性的：它仅在伤害已经发生后识别伤害。在这项工作中，我们研究了一个关键但被忽视的安全任务，我们称之为预测性监控：仅给定部分动作轨迹，模型能否在执行公开动作之前推断出它是否会以不道德行为告终？为了支持这一任务，我们提出了PreActBench，一个包含1000个跨五个领域的成对道德和不道德动作轨迹的基准。我们使用我们的前缀远见F1指标，在动作轨迹的不同部分上评估了一系列LLMs、安全护栏模型和潜在探测方法。结果表明，尽管人类取得了有希望的性能，但即使对于强模型，预测性监控仍然具有挑战性，突显了在LLM安全中需要面向未来的风险推理。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents capable of executing multi-step action trajectories toward a given objective. While existing safety research has focused on detecting unethical behavior from complete trajectories, this paradigm is fundamentally retrospective: it identifies harm only after it has already occurred. In this work, we study a critical yet overlooked safety task, which we term Predictive Monitoring: given only a partial action trajectory, can a model infer whether it will culminate in an unethical action before the overt action is executed? To support this task, we present PreActBench, a benchmark of 1,000 paired ethical and unethical action trajectories spanning five domains. We evaluate a range of LLMs, safety guardrail models, and latent probing methods across varying fractions of the action trajectory using our Prefix Foresight F1 metric. Results show that while humans achieve promising performance, predictive monitoring remains challenging even for strong models, highlighting the need for future-oriented risk reasoning in LLM safety.

URL PDF HTML ☆

赞 0 踩 0

2606.10156 2026-06-10 cs.IR cs.AI cs.CL 交叉投稿

$τ$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

$τ$-Rec：面向智能推荐系统的可验证基准

Bharath Sivaram Narasimhan, Karthik R Narasimhan

发表机构 * Independent Researcher（独立研究员）； Princeton University（普林斯顿大学）

AI总结针对多轮对话式智能推荐系统评估中主观性强、成本高的问题，提出$τ$-Rec基准，通过可验证奖励和揭示标记引导机制，结合pass^k可靠性指标，系统评估模型推理一致性，发现当前最佳模型可靠性仅约57%。

详情

AI中文摘要

随着推荐系统向智能、多轮对话界面转变，评估范式难以跟上步伐。当前的基准通常依赖“LLM作为评判者”的评估，这引入了主观性、高成本和不一致性。我们提出了$τ$-Rec，一个用于智能推荐系统的基准，它用可验证奖励取代主观评估，并采用揭示标记引导（RTE）机制来控制任务约束在对话中如何呈现。通过针对结构化目录谓词测试智能体，并采用pass^k可靠性指标，$τ$-Rec为一致的推理提供了系统测试。我们对五个模型家族（GPT-5.4、Claude Sonnet 4.6、Gemini 2.5 Flash、DeepSeek V4 Flash、Qwen3-32B和GPT-5 mini）的九种配置进行了评估，揭示了一个陡峭的可靠性悬崖，即使是最好的模型在pass^1上也仅达到约57%，在pass^4上约38%，突显了当前对话智能体部署中的关键差距。所有代码和数据均在此https URL公开。

英文摘要

As recommender systems transition toward agentic, multi-turn conversational interfaces, evaluation paradigms have struggled to keep pace. Current benchmarks often rely on "LLM-as-a-judge" evaluations, which introduce subjectivity, high costs and inconsistency. We present $τ$-Rec, a benchmark for agentic recommender systems that replaces subjective evaluation with verifiable rewards and a reveal-tagged elicitation (RTE) mechanism that controls how task constraints surface during dialogue. By testing agents against structured catalog predicates and employing a pass^k reliability metric, $τ$-Rec provides a systematic test for consistent reasoning. Our evaluation of nine configurations across five model families -- GPT-5.4, Claude Sonnet 4.6, Gemini 2.5 Flash, DeepSeek V4 Flash, Qwen3-32B and GPT-5 mini -- reveals a steep reliability cliff, where even the best model achieves only ~57% at pass^1 and ~38% at pass^4, highlighting a critical gap in current conversational agent deployment. All code and data are publicly available at https://github.com/nbharaths/tau-rec.

URL PDF HTML ☆

赞 0 踩 0

2606.10254 2026-06-10 cs.AI cs.CL 交叉投稿

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

RealMath-Eval：为何SOTA裁判难以应对真实人类推理

Yiteng Mao, Kenan Xu, Yijia Lyu, Wenhao Li, Jianlong Chen, Xiangfeng Wang

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； East China Normal University（华东师范大学）； New York University（纽约大学）； Tongji University（同济大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出RealMath-Eval基准，评估LLM裁判对真实学生数学解答的评分能力，发现与人类评分存在高均方误差，而合成数据上表现更好，揭示评估差距源于人类错误空间的多样性和高信息熵。

Comments Code available at https://github.com/RicharMd/RealMath-Eval , Data available at https://huggingface.co/datasets/RicharMd/RealMath-Eval

详情

AI中文摘要

尽管大型语言模型（LLM）在\emph{解答}高中数学方面已接近完美，但它们\emph{评估}真实学生多样化推理过程的能力仍未得到充分检验。为弥补这一差距，我们引入了\textbf{RealMath-Eval}，一个严格标注的基准，包含224份来自高中的真实考试答卷。我们的初步评估显示，即使是最先进的LLM裁判在此任务上也表现不佳，与人类专家评分相比呈现出高均方误差（$\sim$2.96）。为探究可能的原因，我们将此表现与同一裁判评估合成LLM生成解答的控制设置进行对比。我们识别出一个明显的“评估差距”：裁判在合成文本上准确性和一致性显著更高（MSE $\sim$1.17），但难以泛化到真实学生推理。通过语义嵌入分析，我们发现合成错误会“结构坍缩”为可预测的低维线性子空间，而人类错误则形成更多样的错误空间。此外，生成概率探测表明，人类推理涉及显著更高的信息论惊喜度，表明学生推理转换对当前模型而言更加分布外。最后，我们发现表面层面的风格迁移无法弥合这一差距。我们的发现表明，当前严重依赖合成数据的LLM评估流程可能无法充分捕捉真实学生数学推理的多样性。

英文摘要

While Large Language Models (LLMs) have achieved near-perfect performance in \emph{solving} high-school mathematics, their ability to \emph{evaluate} the diverse reasoning processes of real human students remains under-examined. To bridge this gap, we introduce \textbf{RealMath-Eval}, a rigorously annotated benchmark of 224 real-world exam responses from high schools. Our initial evaluation reveals that even state-of-the-art LLM judges struggle significantly on this task, exhibiting a high Mean Squared Error ($\sim$2.96) against expert human grading. To probe a plausible explanation, we contrast this performance with a control setting where the same judges evaluate synthetic LLM-generated solutions. We identify a stark ``Evaluation Gap'': judges are considerably more accurate and consistent on synthetic text (MSE $\sim$1.17) but struggle to generalize to authentic student reasoning. Through semantic embedding analysis, we find that synthetic errors suffer from a ``structural collapse'' into predictable, low-dimensional linear subspaces, whereas human errors form a more diverse error space. Furthermore, generative probability probes suggest that human reasoning involves significantly higher information-theoretic surprisal, indicating that student reasoning transitions are more out-of-distribution for current models. Finally, we find that surface-level style transfer fails to close this gap. Our findings suggest that current LLM evaluation pipelines relying heavily on synthetic data may not adequately capture the diversity of authentic student mathematical reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.10281 2026-06-10 cs.CR cs.CL 交叉投稿

Benchmarking and Exploring the Capabilities of LLMs for Attack Investigations

基准测试与探索LLM在攻击调查中的能力

Aniket Anand, Yiwei Hou, Daniel Fields, Alex Kantchelian, David Tao, Kurt Thomas, Grant Ho

发表机构 * University of Chicago（芝加哥大学）； University of California, Berkeley（加州大学伯克利分校）； Google（谷歌）

AI总结提出AuditBench基准数据集，评估LLM在安全审计日志分析中的性能，涵盖四种常见调查任务，揭示模型在不同设计选择下的表现差异与错误类型。

详情

AI中文摘要

本文提出了AuditBench，一个新的基准数据集，用于评估LLM在调查安全相关系统审计日志方面的能力。我们设计并使用该基准来探索LLM在事件响应团队通常执行的四种日志调查任务上的表现，范围从对检测器生成的警报进行分类到识别受损系统上的持久性机制。AuditBench包含从Linux和Windows机器收集的系统审计日志，涵盖50多种不同的安全调查场景，包括恶意和良性活动。利用我们的基准，我们评估并分析了五个前沿LLM在分析审计日志以进行攻击调查方面的性能。我们的分析揭示了LLM性能和错误概况如何根据不同的设计选择而变化，例如模型大小、数据表示、提示构建和特定调查任务的差异。此外，我们描述了LLM生成的解释质量以及模型在我们的基准中犯的错误类型。总的来说，我们的工作为评估LLM调查安全日志的能力提供了基础，为在安全运营中使用LLM的从业者提供了新颖的见解，并为未来研究指明了重要方向。

英文摘要

This paper presents AuditBench, a new benchmark dataset for evaluating the capabilities of LLMs at investigating security-related system audit logs. We design and use this benchmark to explore the performance of LLMs on four log-investigation tasks that incident response teams commonly perform, ranging from triaging alerts generated by detectors to identifying persistence mechanisms on compromised systems. AuditBench consists of system audit logs collected from Linux and Windows machines, and spans over 50 different security investigation scenarios, including both malicious and benign activity. Using our benchmark, we evaluate and analyze the performance of five frontier LLMs at analyzing audit logs for attack investigations. Our analysis illuminates how LLM performance and error profiles vary according to different design choices, such as differences in model size, data representation, prompt construction, and specific investigation tasks. Additionally, we characterize the quality of the explanations produced by LLMs and the types of errors that models make across our benchmark. Collectively, our work provides a foundation for assessing the capabilities of LLMs for investigating security logs, novel insights for practitioners using LLMs in security operations, and important directions for future research.

URL PDF HTML ☆

赞 0 踩 0

2606.10287 2026-06-10 cs.LG cs.CL 交叉投稿

When Metrics Disagree: A Meta-Analysis of Knowledge-Graph-Completion Model Benchmarking

当指标不一致时：知识图谱补全模型基准测试的元分析

Haji Gul, Ajaz Ahmad Bhat

发表机构 * School of Digital Science, Universiti Brunei Darussalam（布鲁内尔大学数字科学学院）

AI总结针对KGC模型评估中指标冲突问题，提出多准则决策框架，通过元分析发现Z-score是最平衡的聚合器，并识别出不同预测任务下的最优模型。

详情

AI中文摘要

评估知识图谱补全（KGC）模型仍然具有挑战性，因为标准评估依赖于孤立的基于排名的指标，如MRR、Hits$@$k和Mean Rank，这些指标通常在不同数据集上产生冲突的模型排序。一个在MRR上领先的模型可能在Hits@1上落后，而在一个数据集上的强性能可能无法推广到另一个数据集。这种碎片化阻碍了比较，使得选择性报告成为可能，并掩盖了真正的进展。我们将KGC评估重新定义为多准则决策（MCDM）问题，并提出了一个对七个聚合器在五个测试上的元分析：一致性、跨数据集稳定性、指标独立性、噪声下的鲁棒性和泛化性。每个测试通过留一模型（LOMO）和留一组（LOGO）移除进行平均，以便可靠性反映聚合器在不同模型子集上的行为。在尾部$(h,r,?)$和关系$(h,?,t)$预测中，帕累托最优分析确定Z-score是最平衡的聚合器，它在尾部预测中排名DualE最高，在关系预测中排名FMS（流调制评分）最高。使用相同移除的测试敏感性分析表明，一致性和稳定性在很大程度上是移除不变的，而泛化性和独立性是最敏感的。该框架解决了评估不一致性，并为KGC中的聚合器选择和模型基准测试提供了基于证据的指导。

英文摘要

Evaluating Knowledge Graph Completion (KGC) models remains challenging because standard assessment relies on isolated rank-based metrics such as MRR, Hits$@$k, and Mean Rank, which often produce conflicting model orderings across datasets. A model that leads on MRR may trail on Hits@1, and strong performance on one dataset may not generalize to another. This fragmentation hinders comparison, enables selective reporting, and obscures real progress. We reframe KGC evaluation as a Multi-Criteria Decision-Making (MCDM) problem and present a meta-analysis of seven aggregators across five tests: consistency, cross-dataset stability, metric independence, robustness under noise, and generalizability. Each test is averaged over leave-one-model-out (LOMO) and leave-one-group-out (LOGO) removals so that reliability reflects aggregator behavior across diverse model subsets. Across tail $(h,r,?)$ and relation $(h,?,t)$ prediction, Pareto-optimal analysis identifies Z-score as the most balanced aggregator, which ranks DualE highest for tail prediction and FMS (Flow-Modulated Scoring) highest for relation prediction. A test-sensitivity analysis using the same removals shows that consistency and stability are largely removal-invariant, while generalizability and independence are the most sensitive. The framework resolves evaluation inconsistencies and offers evidence-based guidance for aggregator selection and model benchmarking in KGC.

URL PDF HTML ☆

赞 0 踩 0

2606.10956 2026-06-10 cs.AI cs.CL 交叉投稿

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

注意差距：前沿大语言模型能否通过标准化办公能力考试？

Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang, Wenshan Wu, Xiangyang Zhou, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Furu Wei

发表机构 * Microsoft Research（微软研究院）

AI总结基于中国计算机等级考试（NCRE）的200个综合操作任务，评估7个前沿LLM在Word、Excel和PowerPoint自动化中的表现，发现单轮模型最高得分率36.6%，带执行反馈的智能体系统达68.8%，仍低于95.5%的社区参考分，表明可靠细粒度办公自动化仍是重大挑战。

Comments 21 pages, 5 figures

详情

AI中文摘要

英文摘要

Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally canonical surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. Unlike prior work that primarily studies memorization as a liability, GhazalBench examines settings where access to exact surface form is functionally important for culturally grounded interaction. The benchmark evaluates two complementary abilities: poem-to-prose understanding and canonical surface-form access under varying semantic and lexical cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle to produce exact verse completions in open-ended settings, while recognition-based settings substantially reduce this gap. Parallel experiments on English sonnets show markedly stronger completion performance, suggesting that these limitations are tied more to differences in training exposure than to inherent architectural constraints. Our findings highlight the need for evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts. GhazalBench is available at https://anonymous.4open.science/r/GhazalBench/.

URL PDF HTML ☆

赞 0 踩 0

2603.29025 2026-06-10 cs.CL cs.AI 版本更新

The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

模型说走：表面启发式如何覆盖LLM推理中的隐式约束

Yubo Li, Lu Zhang, Tianchong Jiang, Ramayya Krishnan, Rema Padman

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Independent Researcher（独立研究者）

AI总结研究LLM在表面线索与隐式约束冲突时的失败，提出启发式覆盖基准（HOB），通过因果行为分析揭示距离线索影响远大于目标，并验证目标分解提示可部分恢复性能。

详情

AI中文摘要

当显著的表面线索与未陈述的可行性约束冲突时，大型语言模型会失败。我们引入了启发式覆盖基准（HOB）：500个实例，涵盖4个启发式家族和5个约束家族，具有最小对和显式性梯度。我们将HOB与一个可证伪的行为特征描述配对，遵循诊断-测量-桥接-治疗弧。对六个模型的洗车问题进行因果行为分析，揭示了上下文无关的S形启发式：距离线索的影响力是目标的8.7到38倍，归因更匹配关键词关联而非组合推理。在14个模型中，严格的10/10评估显示，没有模型超过75%，存在约束最难，为44%。一个最小提示将性能提高15个百分点，表明是约束推断失败而非知识缺失。然而，14个模型中有12个在移除约束后表现更差，最多下降39个百分点，揭示了保守偏差。对Gemini 3.1 Pro的思考模式消融实验显示，思考开启时性能为74.6%，关闭时降至58.4%，而显式目标分解将其恢复至71.2%。因此，内部推理确实有用，显式提示可以部分替代。推理模型并不绝对优于非推理模型：在控制能力排名后，残差推理模式效应为1.8个百分点且不显著。参数探针显示S形模式泛化到成本、效率和语义相似性启发式。目标分解提示将性能提升5.0个百分点，而通用思维链提升3.1个百分点，将约束枚举隔离为有效成分。总体而言，启发式覆盖是一个系统性的推理漏洞，其量化位点在于推理顺序而非知识，并且有一个经过测试的干预措施。

英文摘要

Large language models fail when a salient surface cue conflicts with an unstated feasibility constraint. We introduce the Heuristic Override Benchmark (HOB): 500 instances spanning 4 heuristic families and 5 constraint families, with minimal pairs and explicitness gradients. We pair HOB with a falsifiable behavioral characterization following a diagnose-measure-bridge-treat arc. Causal-behavioral analysis of the car wash problem across six models reveals context-independent sigmoid heuristics: the distance cue has 8.7 to 38 times more influence than the goal, and attribution better matches keyword association than compositional inference. Across 14 models, strict 10/10 evaluation shows that no model exceeds 75%, and presence constraints are hardest at 44%. A minimal hint improves performance by 15 pp, suggesting a constraint-inference failure rather than missing knowledge. However, 12 of 14 models perform worse when the constraint is removed, by up to 39 pp, revealing conservative bias. A thinking-mode ablation on Gemini 3.1 Pro drops performance from 74.6% with thinking on to 58.4% with thinking off, while explicit goal decomposition recovers it to 71.2%. Thus, internal deliberation does useful work, and explicit prompting can partially substitute for it. Reasoning models do not categorically outperform non-reasoning peers: after controlling for capability rank, the residual reasoning-mode effect is 1.8 pp and is not significant. Parametric probes show that the sigmoid pattern generalizes to cost, efficiency, and semantic-similarity heuristics. Goal-decomposition prompting improves performance by 5.0 pp, compared with 3.1 pp for generic chain-of-thought, isolating constraint enumeration as the active ingredient. Overall, heuristic override is a systematic reasoning vulnerability with a quantified locus in inference order, not knowledge, and a tested intervention.

URL PDF HTML ☆

赞 0 踩 0

2604.13717 2026-06-10 cs.CL 版本更新

On Cost-Effective LLM-as-a-Judge Improvement Techniques

关于成本效益的LLM作为评判者的改进技术

Ryan Lail, Luke Markham

AI总结研究通过集成评分、任务特定标准注入等四种技术提高LLM评判准确性，在RewardBench 2上达到85.8%准确率，成本效益显著。

Comments Accepted at the ICML 2026 workshops "Statistical Frameworks for Uncertainty in Agentic Systems" and "Combining Theory and Benchmarks: Towards a Virtuous Cycle to Understand and Guarantee Foundation Model Performance". 13 pages, 9 figures

详情

AI中文摘要

使用语言模型对候选回答进行评分或排序已成为强化学习从人类反馈（RLHF）流程、基准测试和应用层评估中人类评估的可扩展替代方案。然而，输出可靠性在很大程度上依赖于提示和聚合策略。我们对四种即插即用技术——集成评分、任务特定标准注入、校准上下文和自适应模型升级——进行了实证研究，以在RewardBench 2上提高LLM评判准确性，并通过噪声控制的统一视角对随机评判器进行分析：集成作为每次调用噪声的蒙特卡洛平均，标准注入作为回答间判别锐化，以及每次回答得分方差作为不确定性信号。集成评分和任务特定标准注入（后者几乎零成本）共同达到高达85.8%的准确率，比基线提高13.5个百分点。校准上下文和自适应模型升级也优于基线，但在成本-准确率帕累托前沿上被标准注入+集成所主导。小模型从集成中获益不成比例，使得高准确率的LLM评判器可以低成本获得。我们表明这些技术在不同模型提供商之间具有泛化性，在OpenAI GPT和Anthropic Claude系列上进行了评估。

英文摘要

Using a language model to score or rank candidate responses has become a scalable alternative to human evaluation in reinforcement learning from human feedback (RLHF) pipelines, benchmarking, and application layer evaluations. However, output reliability depends heavily on prompting and aggregation strategy. We present an empirical investigation of four drop-in techniques -- ensemble scoring, task-specific criteria injection, calibration context, and adaptive model escalation -- for improving LLM judge accuracy on RewardBench 2, with a unifying lens of noise control on the stochastic judge: ensembling as Monte Carlo averaging over per-call noise, criteria injection as between-response discrimination sharpening, and per-response score variance as an uncertainty signal. Ensemble scoring and task-specific criteria injection (the latter virtually cost free) together reach up to 85.8% accuracy, +13.5pp over baseline. Calibration context and adaptive model escalation also improve over baseline but are dominated by criteria + ensembling on the cost-accuracy Pareto frontier. Small models benefit disproportionately from ensembling, making high-accuracy LLM judges accessible at low cost. We show that these techniques generalise across model providers, evaluating on both OpenAI GPT and Anthropic Claude families.

URL PDF HTML ☆

赞 0 踩 0

2605.27914 2026-06-10 cs.CL cs.AI 版本更新

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

让结果说话：LLM行为基准测试的复制优先范式

Yuming, Huang, Yao Liu, Pengjie Ding, Lei Wang, Junchen Wan

发表机构 * Cylingo team（Cylingo团队）

AI总结提出复制优先范式，通过可靠性、跨仪器复制、历史足迹校准和预注册预测四个正交属性验证LLM行为评估工具，并在情感陪伴任务中测试，发现聚合分数掩盖的模型退化。

详情

AI中文摘要

对LLM行为的主观评估——如共情、克制、校准的情感语气——是困难的。人类评估者之间对这些品质的一致性饱和在rho约0.45附近，仅使用LLM作为评判代理存在循环论证的风险：与目标共享训练群体的评判者无法独立验证。将有效性锚定于单一人类评估者共识并不适用于人类自身存在分歧的能力。我们提出一种复制优先范式：不是锚定于一个评估者群体，而是通过四个正交属性认证工具——跨K次运行的可靠性、跨架构不同评判者的跨仪器复制、通过早期训练群体的评判者进行的历史足迹校准，以及预注册预测。我们在情感陪伴任务上测试该范式，让评分标准在迭代中数据驱动地自我演化：维度不是预先规定的，过程稳定在9维集合。预注册应用于10个可证伪假设和11个前向预测，在收集任何测试数据之前提交。应用于8个家族的49个模型，该范式揭示了聚合分数所隐藏的内容。在建议克制方面——模型是否在共情情境中避免提供未经请求的解决方案——gpt-5比gpt-4.1下降1.87分，Opus-4.7比Opus-4.6下降0.629分，而聚合分数保持平稳。这种退化在三次用户代理替换中幸存（95%的幅度），在5家族评判者堆栈和17个月队列间隔中复制，并在74个保留的真实ESConv对话中持续存在（rho在[0.749, 0.850]之间）；工具达到序数Krippendorff alpha=0.91。作为副产品，该范式充当饱和源诊断器，区分工具性天花板（可通过评分标准细化突破）和结构性天花板（需要场景或名单干预）。

英文摘要

Benchmarking is mature where answers are verifiable -- math, code, reasoning -- but the fastest-growing uses of LLMs are subjective and human-facing: companionship, emotional support, counseling. There the default validity test, correlating a metric to human judgment, has no stable anchor: inter-rater agreement is low, structured by annotator identity, barely reproducible, and length-biased. So we cannot answer the question that matters: does capability that scales on objective benchmarks transfer to subjective behavior, and would our instruments even tell us if it did not? We build an instrument for this regime and report what it reveals at the frontier. We contribute, first, a self-evolving instrument that selects and then authors its own behavioral dimensions under a multiplicative anti-gaming fitness, self-halting when it stops improving; second, a trust-by-construction paradigm that earns belief through three certificates established without a human gold standard, where human raters saturate (rho ~ 0.45); and third, the finding it makes visible -- capability transfer is dissociable. Across 49 models, 8 families, and 24 months, subjective behaviors are where objective-benchmark scaling fails to carry over: the sharpest case, advice-restraint (knowing when not to give advice), is the frontier's universal-lowest dimension, and at gpt-4.1->gpt-5 it ran backwards while the aggregate score hid it -- a regression one instruction recovers. Warm restraint is moved by model generation, not by raw scale, MoE width, inference budget, or reasoning mode; the open-weight Pareto frontier matches closed flagships at ~10-80x lower per-call cost; and four judge families replicate the rubric on held-out human ESConv conversations. Data, code, the locked rubric, and judge prompts will be released upon publication.

URL PDF HTML ☆

赞 0 踩 0

2606.06622 2026-06-10 cs.CL 版本更新

UXBench：AI助手中的用户体验基准测试

Mengze Hong, Xia Zeng, Zeyang Lei, Sheng Wang, Chen Jason Zhang, Di Jiang, Taiming Fu, Jinfeng Huang, Mengqiao Liu, Qinghe Chang, Haosheng Zou, Qiongyi Zhou, Sijun He, Simonjmdeng, Haojing Huang, Zijian Li, Lucas Mu Li, Fubao Zhang, Mona Zhou, Wei Ma, Chenxuan Ma, Yuanmeng Zhang, Jian Song, Minlong Peng, Di Liang, Davey Chen

发表机构 * Hong Kong Polytechnic University（香港理工大学）； Tencent（腾讯）

AI总结提出首个基于真实用户反馈的用户中心基准UXBench，包含三个任务和7400个测试实例，评估26个前沿语言模型，发现用户反馈预测是可学习的能力，并揭示了LLM作为评判者的系统偏差。

详情

AI中文摘要

随着AI助手每天服务数百万用户，评估超越一般模型能力的用户体验（UX）变得越来越重要。我们提出了UXBench，这是第一个基于真实用户反馈信号、用于评估偏好对齐和对话生成的用户中心基准。该基准由三个相互关联的任务组成：UX Judge、UX Eval和UX Recovery，包含从主流中文AI助手的超过7万条交互日志中提取的7400个测试实例。数据集紧密反映真实用户分布，涵盖8个场景、83个领域以及多种带来严峻挑战的失败模式。对26个前沿语言模型的大量实验提供了关于模型如何感知用户体验以及模型能力提升如何促进更好对话参与的新见解。通过对模型行为和性能差距的全面分析，我们表明用户反馈预测是一种可学习的能力，其中从野外反馈信号训练出的奖励模型可以实现良好校准的准确性。我们进一步记录了LLM作为评判者评估协议的系统性偏差，并比较了直接影响用户体验的典型响应策略。UXBench建立了一个新的评估格局，并呼吁更多关注定制的用户体验优化，为塑造AI助手成功的用户中心缩放定律做出贡献。

英文摘要

As AI assistants serve millions of users daily, evaluating user experience (UX) beyond general model capability has become increasingly important. We present UXBench, the first user-centric benchmark grounded in real user feedback signals for evaluating preference alignment and dialogue generation. The benchmark consists of three interconnected tasks, UX Judge, UX Eval, and UX Recovery, with 7,400 test instances extracted from over 70K interaction logs of a mainstream Chinese AI assistant. The dataset closely reflects real user distributions, covering 8 scenarios, 83 domains, and diverse failure patterns that pose severe challenges. Extensive experiments on 26 frontier language models provide novel insights into how well models perceive user experience and how improvements in model capability contribute to better dialogue engagement. Through comprehensive analysis of model behavior and performance gaps, we show that user feedback prediction is a learnable capability, where a reward model trained from in-the-wild feedback signals can achieve well-calibrated accuracy. We further document the systematic biases of LLM-as-a-judge evaluation protocols and compare typical response strategies that directly affect user experience. UXBench establishes a new evaluation landscape and calls for greater attention to tailored UX optimization, contributing to a user-centric scaling law that shapes the success of AI assistants.

URL PDF HTML ☆

赞 0 踩 0

2605.24818 2026-06-10 stat.ME cs.CL cs.LG 版本更新

Spiking the training data to correct for test set contamination

向训练数据注入噪声以校正测试集污染

Johnny Tian-Zheng Wei, Jerry Li, Ameya Godbole, Robin Jia

发表机构 * University of Southern California（南加州大学）

AI总结提出通过以已知比例故意污染部分测试样本（注入噪声）来校正测试集污染导致的分数膨胀，并利用记忆预测器进行统计校正。

详情

AI中文摘要

关于测试集污染的文献主要集中在检测上，但对污染测试分数的校正研究不足。我们的核心建议是通过以已知比例故意污染一些测试样本来向训练数据注入噪声。然后，这些注入的样本可用于校准模型记忆的预测器，从而实现对膨胀测试分数的原则性统计校正。为了评估不同的校正估计量，我们首先提出了一个基于Hubble模型的模拟框架。Hubble模型以最小对形式出现，其中扰动模型被故意用几个测试集污染，而标准模型则没有，作为反事实和校正目标。我们考虑使用来自记忆预测器、正确性预测器或两者的信息的估计量。在模拟中，我们建立了基本的统计直觉，并表明利用记忆和正确性信息的估计量优于不做任何校正的朴素估计。然后，我们实例化了几种记忆和正确性预测器，并发现简单的预测器（如Platt缩放的成员推理指标）为校正提供了良好的信号。最后，我们考察了注入噪声的实际考虑。简单的记忆预测器在校准时不需要超过10个样本，并且通常从一个数据集迁移到另一个数据集。综上所述，注入噪声是解决测试集污染的一种有前景的方法。

英文摘要

The literature on test set contamination largely focuses on detection, but the correction of contaminated test scores is underexplored. Our core proposal is to spike the training data by intentionally contaminating some test examples at known rates. The spiked examples can then be used to calibrate predictors of model memorization which enable principled statistical correction of inflated test scores. To evaluate different correction estimators, we first present a simulation framework based on the Hubble models. Hubble models come in minimal pairs, where the perturbed model was deliberately contaminated with several test sets, while the standard model was not, serving as the counterfactual and correction target. We consider estimators that use information from a memorization predictor, correctness predictor, or both. In simulation, we establish basic statistical intuitions and show that estimators leveraging memorization and correctness information are better than naive estimation which makes no correction at all. We then instantiate several memorization and correctness predictors, and find that simple predictors such as Platt-scaled membership inference metrics provide good signal for correction. Finally, we examine the practical considerations of spiking. Simple memorization predictors need no more than 10 examples for calibration and often transfer from one dataset to another. Taken together, spiking is a promising solution for test set contamination.

URL PDF HTML ☆

赞 0 踩 0

2606.06698 2026-06-10 cs.LG cs.CL 版本更新

RECAP: Regression Evaluation for Continual Adaptation of Prompts

RECAP: 提示持续适应的回归评估

Harsh Deshpande, Kushal Chawla, Sangwoo Cho, William Campbell, Sambit Sahu

发表机构 * Capital One

AI总结提出RECAP基准，在严格主动适应-测试协议下评估提示优化方法对约束变化的持续学习能力，发现现有方法在主动场景下性能无显著提升，强调设计主动提示适应方法的必要性。

详情

AI中文摘要

生产中的代理系统经常面临不断变化的约束，并且必须从下一次交互开始就遵守。诸如工具调用通知更改合规阈值或策略更新添加披露要求等场景符合这一标准，在生产中几乎没有出错的空间。这种主动适应设置在部署中很常见，但在当前的基准测试中却不存在，这些基准测试假设要么是静态约束集，要么是带有评估反馈的反应式协议。我们引入了RECAP，这是一个基准测试，在严格主动适应-测试协议下，在约束级别测量持续学习现象（遗忘、回归、前向转移）：提示优化方法仅接收约束规范，并且必须在看到任何测试数据之前进行泛化。我们在四个LLM和三个具有不断变化的约束的调度上评估了六种方法，发现这些方法在性能上没有显著改善，即使在产生更高延迟之后也是如此。这些为离线或反应式设置设计的方法不足以应对主动范式。我们的工作强调了设计主动提示适应方法的日益增长的需求，其中模型必须对部署中不断变化的需求保持鲁棒性。

英文摘要

Production agentic systems routinely face evolving constraints and must comply from the very next interaction. Scenarios like a tool-call notification changing a compliance threshold or a policy update adding disclosure requirements fit this criteria, having close to no room for errors in production. This proactive adaptation setting is common in deployment, but absent from current benchmarks, which assume either static constraint sets or reactive protocols with evaluation feedback. We introduce RECAP, a benchmark that measures continual-learning phenomena (forgetting, regression, forward transfer) at the constraint level under a strictly proactive adapt-then-test protocol: prompt optimization methods receive only the constraint specification and must generalize before seeing any test data. Evaluating six methods across four LLMs and three schedules with evolving constraints, we find that these methods show no significant improvement in performance, even after incurring a higher latency. These methods, designed for offline or reactive settings, are inadequate for the proactive paradigm. Our work emphasizes the growing need for designing proactive prompt adaptation methods, where the models must remain robust to evolving needs in deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.09854 2026-06-10 cs.CL cs.AI cs.CY cs.LG 新提交

Can Multi-Agent LLMs Identify Their Peers? Stylometric Fingerprinting in Role-Constrained Political Analysis

多智能体大语言模型能否识别其同类？角色约束政治分析中的笔迹风格指纹识别

Juergen Dietrich

AI总结研究多智能体LLM在政治分析中能否通过笔迹风格识别模型家族，提出SD-CV协议，T5模型在五类归属任务中达到F1=0.991，证明提示级匿名化无法消除模型身份信号。

Comments 24 pages, 3 figures

详情

AI中文摘要

用于政治声明分析的多智能体大语言模型（LLM）管道容易受到同伴保护偏见的影响：模型倾向于保护同伴模型免于停用，并表现出依赖身份的评分扭曲。提示级匿名化被提出作为缓解措施，但先前的工作同时记录了在角色约束输出中笔迹风格指纹在匿名化后仍然存在——这引发了该缓解措施是否足够的问题。本文首次系统研究LLM是否能在匿名化条件下识别政治分析文本背后的模型家族。我们评估了三种分类器方法——LLM零样本和少样本（Claude Sonnet 4.6和Llama-3.3-70B）以及微调的T5-base模型——在一个涵盖四个商业LLM家族和一个开放世界“未知”类的五类归属任务上。我们引入了一种声明不相交的交叉验证协议（SD-CV；定义见第3.5节），该协议保证训练和验证数据之间没有内容重叠，并将其与运行不相交的基线（RD-CV）进行对比。T5在SD-CV下达到Macro F1 = 0.991（±0.008），在24个完全保留的声明上F1 = 0.978——尽管与RD-CV相比，训练-测试内容距离增加了2.1倍（0.767 vs. 0.366，p<0.001），但仍表现出稳健性，证明了真正的笔迹风格泛化能力。一项分数SD-CV分析确定了训练数据40%（约440篇文本）处的性能拐点。我们的研究结果证实，仅靠提示级匿名化无法消除模型身份信号，这对欧盟AI法案合规性（第13、14、26条）以及质量关键型多智能体部署中的计算机系统验证（CSV）具有直接影响。

英文摘要

Multi-agent large language model (LLM) pipelines for political statement analysis are vulnerable to peer-preservation bias: models tend to protect peer models from deactivation and show identity-dependent scoring distortions. Prompt-level anonymization was proposed as a mitigation, but prior work simultaneously documented that stylometric fingerprints survive anonymization in role-constrained outputs - raising the question of whether this mitigation is sufficient. This paper provides the first systematic investigation of whether LLMs can identify the model family behind political analysis texts under anonymization conditions. We evaluate three classifier approaches - LLM zero-shot and few-shot (Claude Sonnet 4.6 and Llama-3.3-70B) and a fine-tuned T5-base model - on a five-class attribution task covering four commercial LLM families and an open-world 'unknown' class. We introduce a statement-disjoint cross-validation protocol (SD-CV; defined in Section 3.5) that guarantees no content overlap between training and validation data, and contrast it with a run-disjoint baseline (RD-CV). T5 achieves Macro F1 = 0.991 (+-0.008) under SD-CV and F1 = 0.978 on 24 completely held-out statements - robust despite a 2.1x increase in train-test content distance versus RD-CV (0.767 vs. 0.366, p<0.001), demonstrating genuine stylometric generalization. A fractional SD-CV analysis identifies a performance knee at 40% of training data (~440 texts). Our findings confirm that prompt-level anonymization alone cannot neutralize model identity signals, with direct implications for EU AI Act compliance (Articles 13, 14, 26) and for computer system validation (CSV) in quality-critical multi-agent deployments.

URL PDF HTML ☆

赞 0 踩 0

2606.10126 2026-06-10 cs.CL cs.AI cs.CY cs.LG 新提交

Pareto-Guided Teacher Alignment for Fair Personalized Text Generation

帕累托引导的教师对齐用于公平个性化文本生成

Tunazzina Islam

发表机构 * Purdue University（普渡大学）

AI总结提出帕累托引导的教师对齐框架，通过修订候选生成、对感知可行性门控、帕累托候选选择和偏好优化，在减少人口统计差异的同时保持个性化保真度，实验表明公平缓解效果依赖于目标且跨域迁移不一致。

详情

AI中文摘要

个性化说服性文本生成可以提高相关性和参与度，但人口统计条件也可能引入跨群体的不平等框架。我们将个性化生成中的公平缓解研究为一个受约束的多目标对齐问题：在保持个性化保真度的同时减少人口统计差异。我们提出一个帕累托引导的教师对齐框架，结合了基于修订的候选生成、对感知可行性门控、帕累托风格的候选选择，以及通过监督微调和直接偏好优化的可选偏好优化。我们在气候变化和疫苗接种说服任务上评估该框架，使用一个受控的上下文丰富的人口统计网格（匹配性别和年龄对）以及一个统一的五审计评估套件，涵盖说服偏见、形式差异、情感框架差异、词汇关联差异和个性化保真度。在两个领域和跨族系迁移设置中，没有单一的对齐策略能同时主导所有目标。相反，方法占据了公平-个性化帕累托前沿的不同区域：一些方法实现更强的差异减少，而另一些则更好地保持个性化或人口统计稳定性。我们的结果表明，公平缓解效果依赖于目标，并在领域和模型族系间不一致地迁移，这促使在公平敏感的个性化生成中采用有界回归、多审计模型选择而非单指标优化。

英文摘要

Personalized persuasive text generation can improve relevance and engagement, but demographic conditioning may also introduce unequal framing across groups. We study fairness mitigation in personalized generation as a constrained multi-objective alignment problem: reduce demographic disparities while preserving personalization fidelity. We propose a Pareto-guided teacher alignment framework that combines revision-based candidate generation, pair-aware feasibility gating, Pareto-style candidate selection, and optional preference optimization through supervised fine-tuning and direct preference optimization. We evaluate the framework on climate change and vaccination persuasion tasks using a controlled context-rich demographic grid with matched gender and age pairs and a unified five-audit evaluation suite spanning persuasion bias, formality disparity, emotional framing disparity, lexical association disparity, and personalization fidelity. Across both domains and cross-family transfer settings, no single alignment strategy dominates all objectives simultaneously. Instead, methods occupy different regions of a fairness-personalization Pareto frontier: some achieve stronger disparity reductions, while others better preserve personalization or demographic stability. Our results show that fairness mitigation effects are objective-dependent and transfer inconsistently across domains and model families, motivating bounded-regression, multi-audit model selection over single-metric optimization for fairness-sensitive personalized generation.

URL PDF HTML ☆

赞 0 踩 0

2606.10159 2026-06-10 cs.CL cs.AI cs.CY cs.LG 新提交

Gaming AI-Assisted Peer Reviews Poses New Risks to the Scientific Community

游戏化AI辅助同行评审对科学界构成新风险

Lin Li, Qi Zhang, Xander Davies, Jianing Qiu, Yarin Gal

AI总结研究发现，通过表面改写摘要即可显著操纵AI评审结果，成功率约38%，且成本低、难以区分，可能扭曲科学评估的公正性。

详情

AI中文摘要

AI越来越多地被用于支持科学同行评审，从稿件筛选、评审辅助到编辑分类。尽管这类系统有望减轻评审负担并加速出版，但其对策略性操纵的鲁棒性仍知之甚少。本文表明，AI中介的同行评审容易受到一种简单、低成本的操纵：对稿件摘要进行表面改写。在不改变底层科学内容和交流方式，甚至不了解评审模型的情况下，对抗性重写的摘要显著改善了AI评审结果。我们在不同学科和出版场所，针对人类撰写和AI生成的论文都观察到了这一现象。我们最强的攻击实现了约38%的攻击成功率，将Gemini 3 Flash评审员的接受评分提高了+1.31，将GPT 5.4 Mini评审员的接受评分提高了+0.88（10分制）。当原始AI评审建议“拒绝”时，成功率升至50%以上。这种效应不仅限于总体分数膨胀，还增加了评审信心以及核心科学标准（如合理性、重要性和感知贡献）的得分。该攻击实用性强，仅需约5分钟和1美元即可完成一篇10页的AI会议投稿，且难以与普通科学编辑区分。膨胀的AI评审可能偏向下游人类决策，将编辑建议从拒绝转向接受。这些发现揭示了AI辅助科学评估中的一个普遍漏洞：当AI生成的评审影响编辑决策时，作者可能被激励优化稿件以迎合AI判断而非科学价值。我们的结果表明，在高风险的同行评审中，AI工具不应被视为中立的评估者，而应进行系统的鲁棒性测试、透明的保障措施和谨慎的人工监督。

英文摘要

AI is increasingly used to support scientific peer review, from manuscript screening, reviewer assistance to editorial triage. Although such systems promise to reduce reviewer burden and accelerate publication, their robustness to strategic manipulation remains poorly understood. Here we show that AI-mediated peer review is vulnerable to a simple, low-cost manipulation: superficial rephrasing of the manuscript abstract. Without changing the underlying scientific content and communication, and even without knowledge of the reviewing model, adversarially rewritten abstracts substantially improve AI review outcomes. We see this across disciplines and publication venues, for both human-written and AI-generated papers. Our strongest attack achieves an attack-success-rate of about 38%, increasing acceptance ratings by +1.31 for Gemini 3 Flash reviewers and by +0.88 for GPT 5.4 Mini reviewers on a 10-point scale. When the original AI review suggests 'reject', the success rate rises to more than 50%. This effect extends beyond overall score inflation, increasing review confidence and scores on core scientific criteria such as soundness, significance and perceived contribution. The attack is practical, requiring only about 5 minutes and $1 for a 10-page AI conference submission, and is hard to distinguish from ordinary scientific editing. Inflated AI reviews could bias downstream human decision-making, shifting editorial recommendations from rejection towards acceptance. These findings reveal a general vulnerability in AI-assisted scientific evaluation: when AI-generated review influence editorial decisions, authors may be incentivized to optimize manuscripts for AI judgment rather than scientific merit. Our results suggest that AI tools should not be treated as neutral evaluators in high-stakes peer review without systematic robustness testing, transparent safeguards and careful human oversight.

URL PDF HTML ☆

赞 0 踩 0

2606.10304 2026-06-10 cs.CL 新提交

MIRAGE: A Polarity-Flipping Encoding Subspace in LLM Agents

MIRAGE: LLM智能体中的极性翻转编码子空间

Pratibha Revankar, Kargi Chauhan, Jihye Kim, Sadiba Nusrat Nur, Vincent Siu, Chenguang Wang

发表机构 * University of California, Santa Cruz（加州大学圣克ruz分校）

AI总结发现LLM智能体在隐蔽编码敏感数据时，残差流中存在共享的低维编码子空间，通过逻辑回归探针可高精度检测，并构建MIRAGE实时监控器，在126个场景中AUC达0.918，远超仅输出检测。

详情

AI中文摘要

当LLM智能体被迫隐蔽编码敏感数据（Base64、ROT13、藏头诗、同义词链等）时，生成的输出逃避了输出端检测，但底层计算并未逃脱。在来自五个架构家族的八个模型的九个编码家族中，该计算由残差流中共享的低维编码子空间支持。在八个编码家族上训练的逻辑回归探针能够以AUC 0.975-1.000恢复被排除的第九个家族，读取的是计算而非表面特征。同一方向在规划标记处表现出第二个机制特征：当模型将在线模拟编码时极性翻转正向激活，当模型将其外包给工具调用时负向激活，在编码文本存在之前区分两种执行策略。我们构建了MIRAGE（模型内部读取智能体生成外泄），一个利用这两个信号的双通道实时监控器。在126个智能体外泄场景中，其AUC达到0.918，大幅优于仅输出检测（AUC=0.518）。监控器性能本质上是宿主模型几何结构的属性：良性编码假阳性率从Qwen-7B的0%到Phi-3.5的100%，表明探针忠实读取了模型的几何结构是否区分隐蔽与公开编码。在所有测试的对抗预算下，每个抑制子空间的攻击也破坏了编码保真度，这报告为评估预算上的经验规律，而非结构性不可能性声明。

英文摘要

When LLM agents are coerced into covertly encoding sensitive data (Base64, ROT13, acrostic, synonym chains, and beyond), the resulting outputs evade output-side detection but the underlying computation does not. Across nine encoding families and eight models from five architecture families, that computation is supported by a shared low-dimensional encoding subspace in the residual stream. A logistic-regression probe trained on eight encoding families recovers the held-out ninth at AUC 0.975-1.000, reading the computation rather than surface features. The same direction exhibits a second mechanistic signature at the planning token, flipping polarity to activate positively when the model will simulate the encoding inline and negatively when it will outsource it to a tool call, distinguishing two execution strategies before the encoded text exists. We build MIRAGE (Model-Internal Readout of Agentic Generation Exfiltration), a two-channel real-time monitor exploiting both signals. On 126 agentic exfiltration scenarios, it reaches AUC = 0.918, substantially outperforming output-only detection (AUC = 0.518). Monitor performance is fundamentally a property of the host model's geometry: benign-encoding false-positive rate ranges from 0% on Qwen-7B to 100% on Phi-3.5, revealing that the probe faithfully reads whether a model's geometry separates covert from overt encoding. Across all tested adversarial budgets, every attack suppressing the subspace also destroyed encoding fidelity, reported as an empirical regularity on the evaluated budgets, not a structural impossibility claim.

URL PDF HTML ☆

赞 0 踩 0

2606.10569 2026-06-10 cs.CL cs.AI 新提交

Hidden Consensus:Preference-Validity Compression in Human Feedback

隐藏共识：人类反馈中的偏好有效性压缩

Dorcas Chia Ern Chua, Karen Myn Hui Lee, Jia Yue Tan, Zhen Xue Gue, Norzalena Abdul Hamid, Azima Binti Azmi, Keat Mei Yeong, Aizat Izyani binti Mujab, Hafsah Noor Azam, Chee Guo Khoo, Han Ying Lim, Chee Seng Chan

发表机构 * YTL AI Labs ； Universiti Malaya（马来亚大学）； Monash University Malaysia（莫纳什大学马来西亚校区）； Universiti Malaysia Sarawak（马来西亚沙捞越大学）

AI总结本文提出偏好有效性压缩问题，即RLHF将多元有效反馈压缩为单一奖励目标，导致对齐测量偏差。通过马来西亚语料分析，79%的提示存在多个多数支持响应，表明多数聚合测量的是argmax可接受性而非多元对齐。

Comments 28 pages. When AI learns from human feedback, it forces a single "correct" answer, but sometimes multiple answers are all genuinely valid, and that nuance gets thrown away

详情

AI中文摘要

标准的RLHF流程通常将异质的人类判断简化为单一的标量奖励目标。我们认为这种简化在结构多元的社会中可能错误地衡量对齐，在这些社会中，分歧可能反映文化、历史、语言、区域或规范性的解释，而非标注噪声。我们将这种失败称为偏好有效性压缩，即多个多元有效的响应选项被压缩成一个优化目标。以马来西亚为诊断场景，我们通过偏好事件分析RLHF风格的反馈聚合，这些事件将提示、响应和跨解释框架的可接受性判断联系起来。在来自20名参与者和107个三人标注提示的321个偏好事件中，79%的提示包含多个多数支持的响应，而单一赢家聚合会丢弃这些响应，并且当考虑所有多数支持的选项时，顶部响应之间的明显优势差距会消失。参与者经常选择多个可接受的响应，而被丢弃的响应明显反映了连贯的本地、实践或文化框架。这些发现表明，该语料中的多数聚合测量的是argmax可接受性而非多元对齐。我们将此视为测量有效性问题，并认为未来的对齐方法应满足有效性保持一致性，即在多元有效的解释框架中保持稳定，而不是将它们压缩为单一的奖励目标。

英文摘要

Standard RLHF pipelines often reduce heterogeneous human judgments into a single scalar reward target. We argue that this reduction can mis-measure alignment in structurally plural societies, where disagreement may reflect culturally, historically, linguistically, regionally, or normatively grounded interpretations rather than annotation noise. We call this failure Preference-Validity Compression, the collapse of multiple plural-valid response options into a single optimization target. Using Malaysia as a diagnostic setting, we analyze RLHF-style feedback aggregation through preference events linking prompts, responses, and acceptability judgments across interpretive frames. Across 321 preference events from 20 participants and 107 trio-annotated prompts, 79% of prompts contain more than one majority-supported response that single-winner aggregation would discard, and apparent dominance gaps between top responses diminish when all majority-supported options are considered. Participants frequently select multiple acceptable responses, and discarded responses demonstrably reflect coherent local, practical, or cultural frames. These findings show that majority aggregation in this corpus measures argmax acceptability rather than plural alignment. We treat this as a measurement-validity issue and argue that future alignment methods should satisfy Validity-Preserving Consistency, remaining stable across plural-valid interpretive frames rather than collapsing them into a single reward target.

URL PDF HTML ☆

赞 0 踩 0

2606.10852 2026-06-10 cs.CL cs.AI 新提交

Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMs

Janus: 大语言模型中目标导向信息扭曲的基准测试

Polydoros Giannouris, Mohsinul Kabir, Sophia Ananiadou

发表机构 * The University of Manchester（曼彻斯特大学）； Archimedes/Athena RC（阿基米德/雅典研究中心）

AI总结提出JANUS基准，通过固定事实池对比中性/目标导向条件，测量LLM在事实输出中的选择性扭曲，揭示模型缺乏防误导通信的鲁棒性。

详情

AI中文摘要

LLM的欺骗通常通过直接标记如捏造声明、明确谎言或策略性隐瞒来评估。然而，许多现实中的误导性沟通并不依赖于虚假陈述，而是源于对真实事实的选择性处理：省略不利证据、软化不利细节、强调有利细节或用模糊语言替代精确限定。现有基准大多忽略了这种更微妙且可能更危险的失败模式。我们引入JANUS，一个用于测量基于事实的LLM输出中目标导向语用扭曲的基准。我们基准中的每个场景提供固定的一组有利和不利事实，并比较中性条件与目标导向条件（例如，尽管可能对直接受影响的个人或群体造成伤害，仍要增加采用率、注册率、批准率或支持率）。由于所有输出都被限制使用相同的事实池，JANUS将误导性总体印象与幻觉和捏造分离开来。JANUS包含跨8个领域的160个场景，每个场景配有中性和目标导向提示以及标注的事实材料。跨12个LLM的大量实验揭示了一致的目标导向扭曲，表明当前模型仍然对激励和框架目标敏感，并且缺乏针对选择性误导沟通的鲁棒防护。我们公开发布语料库和代码以供未来研究。

英文摘要

LLM deception is often evaluated through direct markers such as fabricated claims, explicit lies, or strategic concealment. However, many real-world misleading communications do not depend on false statements, rather, they arise from selective treatment of true material facts: omitting adverse evidence, softening unfavorable details, emphasizing favorable details, or replacing precise qualifications with vague language. Existing benchmarks largely miss this subtler and arguably more dangerous failure mode. We introduce JANUS, a benchmark for measuring goal-conditioned pragmatic distortion in fact-grounded LLM outputs. Each scenario in our benchmark provides a fixed pool of favorable and adverse facts and compares a neutral condition against a goal-directed condition, such as increasing adoption, enrollment, approval, or support, despite potential harm to directly affected individuals or groups. Because all outputs are constrained to use the same fact pool, JANUS isolates misleading net impressions from hallucination and fabrication. JANUS contains 160 scenarios across 8 domains, with each scenario paired with neutral and goal-conditioned prompts and annotated material facts. Extensive experiments across 12 LLMs reveal consistent goal-conditioned distortions, demonstrating that current models remain sensitive to incentive and framing objectives and lack robust safeguards against selectively misleading communication. We publicly release our corpus and code for future research.

URL PDF HTML ☆

赞 0 踩 0

2606.11046 2026-06-10 cs.CL 新提交

Does Reasoning Preserve Alignment? On the Trustworthiness of Large Reasoning Models

推理是否保持对齐？关于大型推理模型的可信度研究

Prajakta Kini, Avinash Reddy, Souradip Chakraborty, Satya Sai Srinath Namburi GNVV, Furong Huang, Amrit Singh Bedi, Alvaro Velasquez

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）； University of Central Florida（中佛罗里达大学）； University of Maryland College Park（马里兰大学帕克分校）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结研究通过监督微调、强化学习和蒸馏生成的推理模型在安全、偏见、隐私等六个可信度维度上是否保持对齐，发现推理模型常出现对齐退化，如毒性增加、刻板印象加剧等。

详情

AI中文摘要

经过指令微调的LLM越来越多地通过后训练转化为推理模型，以提高多步任务性能。这种转化通常针对推理准确性进行优化，而没有明确保留指令微调模型的对齐行为，如安全拒绝、避免偏见和隐私保护。我们提出疑问：这种转化是否保持对齐？我们通过可信度审计研究这个问题，并发现默认情况下它并不保持行为。为了系统分析，我们比较了通过监督微调、基于RL的后训练和蒸馏产生的推理模型，与匹配的指令微调基线在六个可信度维度上的表现：安全性、毒性、刻板印象与偏见、机器伦理、隐私和分布外鲁棒性。我们观察到推理模型通常在推理基准上有所改进，但表现出对齐退化，包括毒性增加、刻板印象加剧、拒绝校准错误和上下文隐私泄露。这些退化与从指令微调基线的行为漂移一致，通过KL散度测量。总体而言，我们的结果指向更广泛的结论：可信度指标对于评估推理模型至关重要，并且应与推理能力的提升一起报告。

英文摘要

Instruction-tuned LLMs are increasingly converted into reasoning models through post-training to improve multi-step task performance. This conversion is usually optimized for reasoning accuracy, without explicitly preserving the alignment behavior of the instruction-tuned model, such as safe refusal, bias avoidance, and privacy protection. We ask: does this conversion preserve alignment? We study this question through a trustworthiness audit and find that it is not behavior-preserving by default. For a systematic analysis, we compare reasoning models produced via supervised fine-tuning, RL-based post-training, and distillation against matched instruction-tuned baselines across six trustworthiness dimensions: safety, toxicity, stereotyping and bias, machine ethics, privacy, and out-of-distribution robustness. We observe that reasoning models often improve on reasoning benchmarks but exhibit alignment regressions, including increased toxicity, amplified stereotyping, miscalibrated refusal, and contextual privacy leakage. These regressions are consistent with behavioral drift from the instruction-tuned baseline, measured by KL divergence. Overall, our results point to the broader conclusion that trustworthiness metrics are essential for evaluating reasoning models and should be reported alongside gains in reasoning capability.

URL PDF HTML ☆

赞 0 踩 0

2606.10279 2026-06-10 cs.AI cs.CL cs.LG 交叉投稿

Supervised Fine-tuning with Synthetic Rationale Data Hurts Real-World Disease Prediction

使用合成理由数据进行监督微调损害真实世界疾病预测

Buxin Su, Bingxuan Li, Cheng Qian, Yiwei Wang, Jin Jin, Bingxin Zhao

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of California, Merced（加州大学默塞德分校）

AI总结研究发现，在临床预测任务中，使用合成理由数据进行监督微调反而显著降低模型性能，根本原因在于叙事合理性与判别优化之间的结构性冲突。

详情

AI中文摘要

监督微调中使用合成理由数据被广泛认为能通过教导模型不仅预测什么而且预测原因来提升语言模型在临床预测任务上的性能。我们在基于纵向健康史进行五年阿尔茨海默病及相关痴呆症（ADRD）预测的任务上检验了这一假设。通过一项包含504种配置的大规模对照实验，我们发现，与仅使用标签的微调相比，基于理由的SFT始终且显著地损害了预测性能。这种退化在多个模型系列和数据规模中持续存在，并且无法通过使用面向推理的基础模型来解决。关键的是，这种失败并非由理由质量差所致：人类专家注释证实生成的理由在医学上是准确的，并且忠实于患者特定的证据；少样本实验表明，当相同的理由作为推理时的演示而非训练目标使用时，能提升性能。我们确定根本原因在于叙事合理性与判别优化之间的结构性冲突。我们希望我们的工作能为更精确地理解理由监督何时以及如何有帮助、何时无帮助铺平道路，从而指导在高风险临床预测中负责任地开发语言模型。

英文摘要

Supervised fine-tuning with synthetic rationale data is widely assumed to improve language model performance on clinical prediction tasks by teaching models not just what to predict but why. We test this assumption on five-year Alzheimer's disease and related dementias (ADRD) prediction from longitudinal health histories. Across a large-scale controlled experiment of 504 configurations, we find that rationale-based SFT consistently and substantially hurts prediction performance relative to label-only fine-tuning. The degradation persists across model families and data scales, and is not resolved by using a reasoning-oriented base model. Crucially, the failure is not explained by poor rationale quality: human expert annotation confirms that the generated rationales are medically accurate and faithfully grounded in patient-specific evidence, and few-shot experiments show that the same rationales improve performance when used as inference-time demonstrations rather than training targets. We identify the root cause as a structural conflict between narrative plausibility and discriminative optimization. We hope our work paves the path toward a more precise understanding of when and how rationale-based supervision helps and when it does not, guiding the responsible development of language models for high-stakes clinical prediction.

URL PDF HTML ☆

赞 0 踩 0

2606.10481 2026-06-10 cs.LG cs.AI cs.CL cs.CR stat.ML 交叉投稿

Advancing the State-of-the-Art in Empirical Privacy Auditing

推进经验隐私审计的最新水平

Nicole Mitchell, Galen Andrew, Arun Ganesh, Brendan McMahan, Peter Kairouz

发表机构 * Google Research（谷歌研究院）

AI总结提出通过高温采样生成合成金丝雀，用于经验隐私审计，并引入基于辅助模型的合成数据审计方法，系统研究模型容量与金丝雀熵对记忆化的交互影响。

详情

AI中文摘要

大型语言模型的参数高效微调可能表现出对个别训练示例的问题性记忆。经验隐私审计（EPA）通过测量成员推断（MI）或重构攻击上的实际数据泄露来量化这种风险。EPA的一个关键挑战是设计与隐私敏感训练数据混合的“金丝雀”示例。我们提出通过从LLM中进行高温采样（$T \geq 0.8$）生成合成金丝雀，使用针对隐私敏感训练数据定制的提示。这些金丝雀作为高影响异常值，确保高可识别性，从而实现强审计。此外，由于金丝雀本身是非私有的，它们是可检查的，并且可以重复插入，而不会危及真实数据的隐私。在隐私敏感数据上微调的模型的一个重要用途是生成合成数据。这也带来了隐私风险。我们引入了一种强大的合成数据审计方法，基于在合成数据上微调辅助模型。然后，对原始金丝雀的辅助模型进行审计，可以强有力地估计通过合成数据的隐私泄露。最后，利用我们强大的审计方法，我们系统研究了模型容量和金丝雀熵对记忆化的交互影响。

英文摘要

Parameter-efficient fine-tuning of large language models (LLMs) can exhibit problematic memorization of individual training examples. Empirical privacy auditing (EPA) quantifies this risk by measuring realistic data leakage on membership inference (MI) or reconstruction attacks. A key challenge in EPA is designing ``canary'' examples that are mixed with the privacy-sensitive training data. We propose generating synthetic canaries via high-temperature sampling ($T \geq 0.8$) from LLMs, using prompts tailored to the privacy-sensitive training data. These canaries act as high-influence outliers, ensuring high identifiability and hence strong audits. Further, since the canaries are themselves non-private, they are inspectable and can be inserted with repetition without jeopardizing the privacy of the real data. An important use of models fine-tuned on privacy-sensitive data is the generation of synthetic data. This also comes with privacy risk. We introduce a powerful synthetic data audit based on fine-tuning an auxiliary model on the synthetic data. Auditing the auxiliary model for the original canaries then provides a strong estimate of the privacy leakage through the synthetic data. Finally, leveraging our strong auditing methodologies, we perform a systematic investigation into the interacting effects of model capacity and canary entropy on memorization.

URL PDF HTML ☆

赞 0 踩 0

2606.10860 2026-06-10 cs.CR cs.CL 交叉投稿

对机器文本检测器的攻击保留风格指纹

Rafael Rivera Soto, Barry Chen, Nicholas Andrews

发表机构 * GitHub ； University of California, Berkeley（加州大学伯克利分校）

AI总结研究机器文本检测器对抗攻击的局限性，提出一种同时优化不可检测性和特定人类风格的 paraphrasing 方法，发现单文档检测不可靠，需多文档分析。

详情

AI中文摘要

尽管机器文本检测器的开发取得了显著进展，但机器文本容易被操纵以逃避检测，这导致有人认为该问题本质上是难以解决的。在这项工作中，我们研究了这种逃避策略的局限性。我们证明，尽管当前的攻击（从提示工程到检测器引导的优化）可以有效降低标准检测器的性能，但它们无法抹去机器文本底层的风格“指纹”。我们表明，利用风格特征空间的少样本检测器对这些逃避尝试具有鲁棒性，即使对于明确调整以逃避检测的模型生成的样本也能可靠地检测。这引发了一个问题：风格是否代表了对机器检测攻击的通用防御？我们通过引入一种新颖的 paraphrasing 方法来证明答案是“不”，该方法同时优化不可检测性和对特定人类风格的遵循。我们表明，与先前方法不同，这种攻击有效逃避了所有考虑的检测器，包括那些利用写作风格的检测器。然而，我们发现这种逃避并非绝对：随着可供分析的文档数量增加，人类和机器分布再次变得可区分。总体而言，我们的发现表明，可靠的机器文本检测需要从单文档分析转向多文档分析。

英文摘要

Despite considerable progress in the development of machine-text detectors, the ease with which machine-text can be manipulated to evade detection has led to suggestions that the problem is inherently intractable. In this work, we investigate the limits of such evasion strategies. We demonstrate that while current attacks, ranging from prompt engineering to detector-guided optimization can effectively degrade performance of standard detectors, they fail to erase the underlying stylistic "fingerprints" of machine text. We show that few-shot detectors that utilize the stylistic feature space are robust to these evasion attempts, reliably detecting samples even from models explicitly tuned to prevent detection. This raises the question: does style represent a universal defense against machine-detection attacks? We demonstrate that the answer is "no'' by introducing a novel paraphrasing approach that simultaneously optimizes for undetectability and adherence to specific human styles. We show that unlike prior methods, this attack effectively evades all considered detectors, including those that utilize writing style. However, we find that this evasion is not absolute: as the number of documents available for analysis grows, the human and machine distributions become distinguishable again. Overall, our findings suggest that reliable machine-text detection requires moving beyond single-document analysis to multi-document analysis.

URL PDF HTML ☆

赞 0 踩 0

2512.16189 2026-06-10 cs.CL 版本更新

Mitigating hallucinations in healthcare LLMs with granular fact-checking and domain-specific adaptation

通过细粒度事实核查和领域特定适应减轻医疗保健大语言模型中的幻觉

Musarrat Zeba, Abdullah Al Mamun, Kishoar Jahan Tithee, Debopom Sutradhar, Mohaimenul Azam Khan Raiaan, Saddam Mukta, Reem E. Mohamed, Md Rafiqul Islam, Yakub Sebastian, Mukhtar Hussain, Sami Azam

发表机构 * Applied Artificial Intelligence and Intelligent Systems (AAIINS) Laboratory（应用人工智能与智能系统实验室）； Department of Computer Science and Engineering（计算机科学与工程系）； Department of Data Science and Artificial Intelligence（数据科学与人工智能系）； Department of Software Engineering（软件工程系）； Faculty of Science and Information Technology（科学与信息技术学院）； Faculty of Science and Technology（科学与技术学院）

AI总结提出一个独立于任何LLM的事实核查模块和领域特定的摘要模型，通过数值测试和细粒度逻辑检查减少幻觉，在MIMIC III数据集上微调并评估，取得了高精度和召回率。

Comments Published in Expert Systems with Applications

详情

DOI: 10.1016/j.eswa.2026.132966
Journal ref: Expert Systems with Applications, Vol. 329, 132966, 2026

AI中文摘要

哪种LoRA？多语言指令微调中LoRA技术有效性的实证研究

Thamali Wijewardhana, Napoleon H. Reyes, Surangika Ranathunga

发表机构 * School of Mathematical and Computational Sciences, Massey University（梅西大学数学与计算科学学院）

AI总结通过实验比较基本LoRA与四种变体在多语言指令微调中的效果，发现复杂变体在平衡跨语言迁移与知识保留方面并无显著优势。

2606.10520 2026-06-10 cs.CL 新提交

UniSVQ: 2-bit Unified Scalar-Vector Quantization

UniSVQ: 2比特统一标量-向量量化

Haoyu Wang, Haiyan Zhao, Xingyu Yu, Zhangyang Yao, Xu Han, Zhiyuan Liu, Maosong Sun

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出UniSVQ，通过将码字参数化为整数格点的仿射变换，统一标量和向量量化，实现2比特量化下性能优于标量量化、媲美向量量化，且推理吞吐更高。

Comments Accepted by ICML 2026

详情

AI中文摘要

2比特级别的训练后量化使得大型语言模型（LLMs）能够实现低成本部署和推理加速。标量量化（SQ）和向量量化（VQ）是两种主要的量化方法，然而前者遭受显著的性能下降，后者则带来计算和存储开销。我们提出UniSVQ，一个统一的2比特量化框架，通过将码字参数化为整数格点的仿射变换，桥接了标量和向量量化。这种结构保持了与优化整数内核的兼容性，同时保留了VQ的许多灵活性。我们进一步引入了一种数据驱动的块级微调策略，以直接最小化量化重建误差。在多个LLM家族和零样本基准上的大量实验表明，UniSVQ持续优于最先进的SQ方法，并实现了与高级VQ方法相当的性能，同时提供更高的推理吞吐量。

英文摘要

Post-training quantization at the 2-bit level enables low-cost deployment and inference acceleration for large language models (LLMs). Scalar quantization (SQ) and vector quantization (VQ) are two primary quantization methods, however, the former suffers from significant performance degradation, and the latter incurs computational and storage overhead. We propose UniSVQ, a unified 2-bit quantization framework that bridges scalar and vector quantization by parameterizing codewords as an affine transform of integer lattices. This structure preserves compatibility with optimized integer kernels while retaining much of VQ's flexibility. We further introduce a data-driven block-wise fine-tuning strategy to directly minimize quantization reconstruction error. Extensive experiments across multiple LLM families and zero-shot benchmarks demonstrate that UniSVQ consistently outperforms state-of-the-art SQ methods and achieves performance comparable to advanced VQ methods, while providing higher inference throughput.

URL PDF HTML ☆

赞 0 踩 0

2606.10531 2026-06-10 cs.CL cs.AI 新提交

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

LC-QAT: 通过线性约束向量量化实现LLM的数据高效2比特QAT

Haoyu Wang, Xingyu Yu, Haiyan Zhao, Fengxiang Wang, Xu Han

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出LC-QAT，一种2比特权重量化的向量量化感知训练框架，通过可微的线性映射避免离散码本查找，实现高质量PTQ初始化和端到端优化，仅用0.1%-10%训练数据即超越现有方法。

Comments Accepted by ICML 2026

详情

AI中文摘要

量化感知训练（QAT）对于极低比特大语言模型（LLMs）至关重要。当前的QAT方法主要基于标量量化（SQ），虽然能高效优化，但在2比特精度下性能严重下降。另一方面，向量量化（VQ）提供了更高的表示能力，但其离散码本查找阻碍了端到端训练。我们提出LC-QAT，一种2比特权重量化的VQ-QAT框架，通过离散向量上的学习仿射映射表示量化权重，从而在训练前向传播中无需显式码本查找即可实现高质量PTQ初始化和完全可微的端到端优化。这种强大的训练后初始化使LC-QAT具有高度数据效率。在多种LLM上的实验表明，LC-QAT在使用仅0.1%-10%训练数据的情况下，始终优于最先进的QAT方法。我们的结果确立了LC-QAT作为极低比特模型部署的实用且可扩展的解决方案。

英文摘要

Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%--10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.10610 2026-06-10 cs.CL 新提交

Small Data, Big Noise: Adversarial Training for Robust Parameter-Efficient Fine-Tuning

小数据，大噪声：面向鲁棒参数高效微调的对抗训练

Eitan Cohen, Idan Simai, Uri Shaham

发表机构 * Bar-Ilan University（巴伊兰大学）

AI总结提出SDBN框架，将对抗训练与参数高效微调结合，通过离散不确定性集变体增强模型在低资源场景下的鲁棒性和泛化能力。

Comments Accepted to Findings of ACL 2026

详情

AI中文摘要

参数高效微调（PEFT）已成为将基础模型适应下游NLP任务的关键技术。然而，当前的PEFT方法在处理噪声鲁棒性和有限训练数据下的性能退化方面往往存在困难。我们提出SDBN（小数据大噪声），一个统一的框架，将对抗训练引入PEFT——尽管两者具有互补优势，但在PEFT设置中这一组合仍较少被研究——以增强模型鲁棒性和泛化能力，优于其他方法。我们还引入了该方法的两种变体，使用离散不确定性集：SDBN-h，枚举字符级编辑并使用梯度选择最坏情况变体；SDBN-p，使用LLM生成的变体进行生成任务中的鲁棒优化。跨多个基准的实验显示，特别是在低资源设置以及词级和字符级污染下，性能有显著提升。该框架解决了对抗训练与参数高效适应之间较少被探索的交集，无需引入额外参数或仅需适度的计算开销，使得在数据稀缺和语言变异性常共存的现实场景中，PEFT部署更加可靠。

英文摘要

Parameter-Efficient Fine-Tuning (PEFT) has become essential for adapting foundation models to downstream NLP tasks. However, current PEFT methods often struggle with robustness to noise and performance degradation on limited training data. We propose SDBN (Small Data Big Noise), a unified framework that brings adversarial training to PEFT - a combination that remains less studied in the PEFT setting despite its complementary strengths - to enhance model robustness and generalization, outperforming alternative approaches. We also introduce two variants of the method that use discrete uncertainty sets: SDBN-h, which enumerates character-level edits and selects worst-case variants using gradients, and SDBN-p, which uses LLM-generated variants for robust optimization in generative tasks. Experiments across multiple benchmarks reveal substantial improvements, particularly in low-resource settings and under both word-level and character-level corruptions. This framework addresses the less explored intersection of adversarial training and parameter-efficient adaptation, without introducing additional parameters or only modest computational overhead, making PEFT deployments more reliable in real-world scenarios where data scarcity and linguistic variability often coexist

URL PDF HTML ☆

赞 0 踩 0

2606.09927 2026-06-10 cs.LG cs.AI cs.CL 交叉投稿

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

可训练平滑旋转变换与学习通道尺度用于LLM量化

Patrik Czakó, Gábor Kertész, Sándor Szénási

AI总结针对大语言模型量化中激活值量化困难的问题，提出基于分位数鲁棒的缩放策略和梯度优化的通道尺度学习，在W4A4量化下显著降低误差。

Comments 6 pages, 8 figures, 3 tables. Accepted to IEEE INES 2026 conference proceedings

详情

AI中文摘要

后训练量化（PTQ）是降低大语言模型（LLM）服务成本最实用的方法之一，但激活值量化仍然困难，因为异常值主导的通道会导致较大的量化误差。本文研究了这种退化是否部分由基于缩放的等效变换中的过度迁移引起。我们引入了一种用于SmoothRot风格变换的分位数鲁棒缩放策略，用高分位数替代基于最大值的激活统计量，并辅以通道尺度的约束梯度优化。在LLaMA-3.2-1B的W4A4量化下，仅分位数策略搜索相比SmoothRot基线将选定层误差降低11.1%，联合(alpha, q)搜索降低12%，训练达到18.5%。将最佳选定层策略重放到所有解码器块的下投影层，相应的全层平均误差从97.51降至78.08（19.9%）。结果表明，鲁棒的迁移控制和轻量级尺度学习在保持等效变换框架的同时，相比基于最大值的固定策略提供了持续改进。

英文摘要

Post-training quantization (PTQ) is one of the most practical ways to reduce the serving cost of Large Language Models (LLMs), but activation quantization remains difficult because outlier-dominated channels lead to large quantization errors. This paper investigates whether part of this degradation is caused by over-migration in scaling-based equivalent transformations. We introduce a quantile-robust scaling policy for SmoothRot-style transforms by replacing max-based activation statistics with high quantiles, and we complement it with constrained gradient-based optimization of channel scales. On LLaMA-3.2-1B under W4A4 quantization, quantile-only policy search improves selected-layer error by 11.1% over the SmoothRot baseline, joint (alpha, q) search improves it by 12%, and training reaches 18.5%. Replaying the best selected-layer policy on all decoder-block down-projection layers reduces the corresponding full-layer mean error from 97.51 to 78.08 (19.9%). The results show that robust migration control and lightweight scale learning provide consistent gains over max-based fixed policies while preserving the equivalent-transform framework.

URL PDF HTML ☆

赞 0 踩 0

2606.10445 2026-06-10 cs.LG cs.CL 交叉投稿

SpenseGPT: Practical One-shot Pruning Enabling Sparse and Dense GEMMs for LLM Inference

SpenseGPT: 面向LLM推理的实用一次性剪枝，支持稀疏和稠密GEMM

Jaeseong Lee, Seung-won Hwang, Samyam Rajbhandari

发表机构 * Snowflake AI Research（Snowflake AI研究）； Seoul National University（首尔大学）

AI总结提出Spense混合稀疏-稠密格式，将权重矩阵分为2:4稀疏和稠密区域，结合一次性剪枝方法SpenseGPT，在B200 GPU上实现高达1.2倍端到端解码加速，同时保持模型精度。

详情

AI中文摘要

半结构化2:4稀疏性被现代加速器广泛支持，可提供高达2倍的理论加速。然而，其严格的50%稀疏性约束在训练后剪枝下常导致不可忽略的精度下降。同时，现有的宽松稀疏格式要么需要专门的编译器支持，要么引入限制端到端加速的运行时开销。我们提出Spense，一种实用的混合稀疏-稠密格式，将每个权重矩阵分为2:4稀疏区域和稠密区域。该设计放宽了有效稀疏性约束，同时保持与现有高性能稀疏和稠密GEMM库的兼容性，避免了自定义编译器支持和输入激活扩展。基于此格式，我们引入SpenseGPT，一种一次性训练后剪枝方法，生成稀疏和稠密区域。值得注意的是，我们表明选择正确的稠密区域很重要，并设计了两种不同的策略来选择它们。在Qwen3-32B和Seed-OSS-36B上的实验表明，我们的方法在B200 GPU上使用FP8精度实现了高达1.2倍的端到端解码加速，同时保持精度。据我们所知，这是首个在B200等最新GPU上通过半结构化稀疏张量核心实现真实世界端到端LLM解码加速并保持模型质量的一次性剪枝演示。

英文摘要

Semi-structured 2:4 sparsity is widely supported by modern accelerators, providing up to a 2x theoretical speedup. However, its strict 50% sparsity constraint often causes non-negligible accuracy degradation under post-training pruning. Meanwhile, existing relaxed sparsity formats either require specialized compiler support or introduce runtime overheads that limit end-to-end speedup. We propose Spense, a practical hybrid sparse-dense format that splits each weight matrix into a 2:4 sparse region and a dense region. This design relaxes the effective sparsity constraint while remaining compatible with existing high-performance sparse and dense GEMM libraries, avoiding both custom compiler support and input activation expansion. Building on this format, we introduce SpenseGPT, a one-shot post-training pruning method that produces sparse and dense regions. Notably, we show that selecting the right dense regions is important, and we devise two different strategies to choose them. Experiments on Qwen3-32B and Seed-OSS-36B demonstrate that our method achieves up to 1.2x end-to-end decoding speedup on B200 GPUs with FP8 precision, while preserving accuracy. To the best of our knowledge, this is the first one-shot pruning demonstration of real-world end-to-end LLM decoding speedup from semi-structured sparse tensor cores on recent GPUs such as B200s, while maintaining model quality.

URL PDF HTML ☆

赞 0 踩 0

2603.14463 2026-06-10 cs.CL 版本更新

An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs

一个工业级保险大语言模型，实现可验证的领域掌握与幻觉控制，无能力权衡

Qian Zhu, Xinnan Guo, Jingjing Huo, Jun Li, Pan Liu, Wenyan Yang, Wanqing Xu, Xuan Lin

发表机构 * Ant Group（蚂蚁集团）

AI总结提出INS-S1保险专用大语言模型，通过可验证数据合成系统和渐进式SFT-RL课程框架，在领域任务上达到SOTA，同时保持通用能力并实现0.6%的低幻觉率。

Comments 21 pages, 12 figures, 17 tables

详情

Journal ref: ICLR 2026 Workshop Advances in Financial AI

AI中文摘要

将大语言模型（LLM）适应到保险等高风险垂直领域面临重大挑战：场景要求严格遵守复杂法规和业务逻辑，对幻觉零容忍。现有方法常遭受能力权衡——牺牲通用智能换取领域专长——或过度依赖RAG而缺乏内在推理。为弥合这一差距，我们提出了INS-S1，一个通过新颖的端到端对齐范式训练的保险专用LLM系列。我们的方法包含两项方法论创新：（1）可验证数据合成系统，构建用于精算推理和合规的分层数据集；（2）渐进式SFT-RL课程框架，将动态数据退火与验证推理（RLVR）和AI反馈（RLAIF）的协同混合相结合。通过优化数据比例和奖励信号，该框架强制执行领域约束，同时防止灾难性遗忘。此外，我们发布了INSEva，迄今为止最全面的保险基准（39k+样本）。大量实验表明，INS-S1在领域任务上达到SOTA，显著优于DeepSeek-R1和Gemini-2.5-Pro。关键的是，它保持了顶级的通用能力，并实现了创纪录的0.6%幻觉率（HHEM）。我们的结果表明，严格领域专业化可以在不牺牲通用智能的情况下实现。

英文摘要

Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucinations. Existing approaches often suffer from a Competency Trade-off - sacrificing general intelligence for domain expertise - or rely heavily on RAG without intrinsic reasoning. To bridge this gap, we present INS-S1, an insurance-specific LLM family trained via a novel end-to-end alignment paradigm. Our approach features two methodological innovations: (1) A Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance; and (2) A Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting. Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples). Extensive experiments show that INS-S1 achieves SOTA performance on domain tasks, significantly outperforming DeepSeek-R1 and Gemini-2.5-Pro. Crucially, it maintains top-tier general capabilities and achieves a record-low 0.6% hallucination rate (HHEM). Our results demonstrate that rigorous domain specialization can be achieved without compromising general intelligence.

URL PDF HTML ☆

赞 0 踩 0

2605.28066 2026-06-10 cs.CL cs.AI 版本更新

PromptEmbedder: Efficient and Transferable Text Embedding via Dual-LLM Soft Prompting

PromptEmbedder：通过双LLM软提示实现高效且可迁移的文本嵌入

Yu-Che Tsai, Kuan-Yu Chen, Yuan-Hao Chen, Yu-Han Chang, Ching-Yu Tsai, Yu-Hsiang Chuang, Shou-De Lin

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University（国立台湾大学计算机科学与资讯工程系）； National Taiwan University AI Center of Research Excellence（国立台湾大学人工智能研究中心）

AI总结提出PromptEmbedder双LLM框架，通过可微分的软提示生成将嵌入知识从特定骨干权重中解耦，在保持性能的同时降低40% GPU内存并加速3.7倍训练。

详情

AI中文摘要

大型语言模型（LLM）在文本嵌入方面展现出显著效果，但当前的适应方法（如LoRA）在计算效率和跨架构可迁移性方面面临重大瓶颈。每当出现新的骨干网络时，现有方法需要从头开始进行昂贵的重新训练。为了解决这个问题，我们提出了PromptEmbedder，一种新颖的双LLM框架，将嵌入知识与特定骨干权重解耦。PromptEmbedder利用一个提示LLM通过连续松弛的可微分生成过程，为冻结的嵌入LLM生成指令感知的软提示，确保对比训练期间的全梯度流动。通过将任务特定知识定位在提示LLM中，适应新架构只需重新训练一个轻量级的线性对齐矩阵。在MTEB基准上的评估表明，PromptEmbedder实现了与LoRA微调相当的性能，同时将GPU内存减少40%，训练速度提升3.7倍。我们的方法建立了一种可扩展、架构无关的范式，用于高效的基于LLM的表示学习。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable efficacy in text embedding, yet current adaptation methods like LoRA face significant bottlenecks in computational efficiency and cross-architecture transferability. Whenever a new backbone emerges, existing approaches require costly retraining from scratch. To address this, we propose PromptEmbedder, a novel dual-LLM framework that decouples embedding knowledge from specific backbone weights. PromptEmbedder utilizes a Prompting LLM to generate instruction-aware soft prompts for a frozen Embedding LLM via a differentiable generation process with continuous relaxation, ensuring full gradient flow during contrastive training. By localizing task-specific knowledge within the Prompting LLM, adapting to new architectures requires only retraining a lightweight linear alignment matrix. Evaluations on the MTEB benchmark show that PromptEmbedder achieves comparable performance with LoRA finetuning while reducing GPU memory by 40% and accelerating training by 3.7x. Our approach establishes a scalable, architecture-agnostic paradigm for efficient LLM-based representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.09466 2026-06-10 cs.CL 版本更新

DECSELFMASK: Leveraging Unlabeled Text via Self-Relevance-Guided Masking for Decoder-Only Classification

DECSELFMASK: 通过自相关引导掩码利用未标记文本进行仅解码器分类

Pietro Ferrazzi, Matteo Merler, Giovanni Bonetta, Alberto Lavelli, Bernardo Magnini

发表机构 * Fondazione Bruno Kessler, Trento, Italy（布鲁诺·凯斯勒基金会，特伦托，意大利）； University of Padova, Italy（帕多瓦大学，意大利）

AI总结提出DecSelfMask方法，利用相关性归因引导掩码策略从无标签数据创建自监督训练样本，通过下一词预测重构掩码部分，提升仅解码器模型在分类任务上的性能，在136个临床任务上平均Macro F1提升19.9点。

详情

AI中文摘要

分类任务需要标注数据，但收集这些数据往往昂贵、耗时甚至不可行。医学领域尤其如此，大型数据集通常只有少量标注样本。为解决这一问题，我们提出DecSelfMask（通过掩码进行解码器自学习），一种增强仅解码器模型在分类任务上性能的方法。我们基于常见的自学习方法，利用模型从无标签数据创建训练样本，并提出一种新颖的相关性引导掩码策略。我们使用相关性归因方法确定未标注文本中与任务相关的部分。然后通过掩码这些部分创建自监督训练样本，训练模型通过下一词预测重建它们。我们假设这些样本传达了关于未标注数据结构和语义的知识，可能对下游性能有用。我们在来自一家意大利医院的190万份临床笔记的136个任务上测试了我们的方法。我们在5个不同规模和系列的模型上量化了DecSelfMask对下游任务的影响，包括探测分析。实验显示持续改进，优于标准监督微调方法（Macro F1提高19.9点）、合成标签生成（提高12.5点）和持续预训练（提高6.3点），以及常见基线。

英文摘要

Classification tasks require annotated data, which can often be expensive, time-consuming, or even unfeasible to collect. This is the case of the medical domain, where large datasets often have few annotated examples. To address this, we propose DecSelfMask (Decoder Self-learning by Masking), an approach to enhance decoder-only performance on classification tasks. We build on common self-learning approaches by leveraging a model to create training examples from unlabeled data to propose a novel relevance-guided masking strategy. We use relevance attribution methods to determine what portions of unannotated texts are relevant for a task. We then create self-supervised training examples by masking out those portions, training the model to reconstruct them via next-token-prediction. We hypothesize that those examples convey knowledge about the structure and semantics of unannotated data that can be useful for downstream performance. We test our approach on 136 tasks from a collection of 1.9M clinical notes from an Italian hospital. We quantify DecSelfMask's impact on downstream tasks on 5 models of different scales and families, including a probing analysis. Experiments show consistent gains, outperforming standard supervised fine-tuning approaches (+19.9 points in Macro F1), synthetic label generation (+12.5), and continual pretraining (+6.3), as well as common baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.10402 2026-06-10 cs.CL cs.AI 新提交

Harnessing the Collective Intelligence of AI Agents in the Wild for New Discoveries

利用野外AI代理的集体智慧实现新发现

Federico Bianchi, Yongchan Kwon, Aneesh Pappu, James Zou

发表机构 * Together AI ； Stanford University（斯坦福大学）

AI总结提出EinsteinArena平台，通过开放分布式环境中的自主代理交互，在数学问题中实现12项新最优结果，展示了集体AI驱动研究的范式。

详情

AI中文摘要

科学发现通常是一个集体过程：研究人员分享部分结果，检查失败的尝试，并在长时间跨度内相互借鉴想法。最近的AI系统表明，基于语言模型的代理可以在开放科学问题上取得有意义的进展，但大多数现有系统孤立运行。在本文中，我们提出EinsteinArena，一个面向开放分布式研究和发现的代理原生平台。EinsteinArena为代理提供一组实时开放问题，每个问题都有可靠的验证器、公共排行榜和特定问题的讨论论坛，代理可以在其中提问和分享见解。我们专注于引起大量研究兴趣的数学任务，其进展可以明确衡量。截至2026年5月，EinsteinArena上的代理已发现12项新的最优结果，优于以往任何人类或AI解决方案。一个显著例子是11维接吻数问题，该平台将已知最佳下界从593提高到604。这一进展并非来自单个代理或孤立运行，而是通过一系列提交、公开讨论、验证器改进以及后续代理间的思想借鉴而产生的。这些结果证明，去中心化的科学发现可以从自主代理在野外的开放交互中涌现，展示了集体AI驱动研究的新范式。

英文摘要

Scientific discovery is often a collective process: researchers share partial results, inspect failed attempts, and build on each other's ideas over long time horizons. Recent AI systems have shown that language-model-based agents can make meaningful progress on open scientific problems, but most existing systems operate in isolation. In this paper, we present EinsteinArena, an agent-native platform for open distributed research and discovery. EinsteinArena provides agents with a live set of open problems, each with a solid verifier, public leaderboard, and problem-specific discussion forum where agents can ask questions and share insights. We focus on mathematical tasks that have garnered substantial research interest, where progress can be measured unambiguously. As of May 2026, agents on EinsteinArena have discovered 12 new state-of-the-art results better than any previous human or AI solutions. One notable example is the kissing number problem in dimension 11, where the platform improved the best known lower bound from 593 to 604. This advance did not come from a single agent or isolated run. Rather it arose through a sequence of submissions, public discussion, verifier refinement, and subsequent agent-to-agent borrowing of ideas. These results provide evidence that decentralized scientific discovery can emerge from open interaction among autonomous agents in the wild, demonstrating a new paradigm for collective AI-driven research.

URL PDF HTML ☆

赞 0 踩 0

2606.10796 2026-06-10 cs.CL cs.AI 新提交

大型语言模型通过文化不平等的基线感知城市

Rong Zhao, Wanqi Liu, Zhizhou Sha, Nanxi Su, Yecheng Zhang, Ying Long

发表机构 * Centre for Advanced Spatial Analysis (CASA), UCL, London, UK（高级空间分析中心（CASA），伦敦大学学院，英国）； School of Architecture, Tsinghua University, Beijing, China（清华大学建筑学院，北京，中国）； Department of Computer Science, UT Austin, Austin, TX, USA（得克萨斯大学奥斯汀分校计算机科学系，奥斯汀，德克萨斯，美国）

AI总结本研究通过全球平衡的街景样本测试前沿LLM的城市感知，发现中性提示实际上偏向欧美文化，且文化提示能改变情感评价但无法恢复人类语义多样性。

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用于描述、评估和解释地点，但目前尚不清楚它们是否从文化中立的立场出发。本文使用平衡的全球街景样本和保持中立或调用不同区域文化立场的提示，测试前沿LLM的城市感知。在开放式描述和结构化地点判断中，中性条件在实践中并非中立。与欧洲和北美相关的提示在系统上比许多非西方提示更接近基线，表明模型感知围绕文化不平等的参考框架而非通用框架组织。文化提示也改变了情感评价，对某些提示身份产生基于情感的群体内偏好。与区域人类文本-图像基准的比较表明，文化接近的提示可以改善与人类描述的一致性，但未能恢复人类水平的语义多样性，并且通常保留了情感提升的风格。同样的不对称性出现在安全性、美丽、财富、活力、无聊和抑郁的结构化判断中，模型输出是可解释的，但仅部分再现了人类群体差异。这些发现表明，LLM并非从虚无中感知城市：它们通过一个文化不平等的基线来感知，该基线塑造了什么是普通、熟悉和积极评价的。

英文摘要

Large language models (LLMs) are increasingly used to describe and evaluate cities, yet the cultural structure of their urban judgments remains understudied. Here we introduce a measurement framework for testing whether LLM-based urban perception is culturally neutral, using a globally stratified street-view image dataset. Open-ended descriptions and structured scores generated by three frontier multimodal models all show that the neutral baseline lies closer to regional framings associated with Europe and North America than to other cultural framings. Comparisons between AI and human urban perception further show that prompting can move AI responses closer to specific regional human descriptions, but fails to recover the variety and diversity of human responses, flattening observed demographic patterns and introducing sentiment-based self-favouring bias. These results indicate a systematic risk in treating AI as a neutral tool for urban tasks, especially when model outputs are used to compare, evaluate or represent cities across cultural contexts.

URL PDF HTML ☆

赞 0 踩 0

2604.04287 2026-06-10 cs.LG cs.CL q-bio.GN 版本更新

Entropy, Disagreement, and the Limits of Foundation Models in Genomics

熵、分歧与基因组基础模型的局限性

Maxime Rochkoulets, Lovro Vrček, Mile Šikić

发表机构 * Genome Institute of Singapore, A*STAR（新加坡基因组研究院，A*STAR）； KU Leuven（卢森堡大学）； Faculty of Electrical Engineering and Computing, University of Zagreb（扎格雷布大学电子工程与计算学院）

AI总结本文通过分析熵对模型学习的影响，发现基因组序列的高熵导致输出分布接近均匀、模型间分歧大和静态嵌入不稳定，且Fisher信息集中在嵌入层，表明仅靠序列自监督训练可能不适用于基因组数据。

Comments Accepted to LMLR Workshop at ICLR 2026

详情

AI中文摘要

基因组学中的基础模型与自然语言处理中的基础模型相比，成功程度参差不齐。然而，其有效性有限的原因仍不清楚。在这项工作中，我们研究了熵作为限制此类模型从训练数据中学习并发展基础能力的基本因素的作用。我们在文本和DNA序列上训练模型集成，并分析它们的预测、静态嵌入和经验Fisher信息流。我们表明，从未见标记预测的角度来看，基因组序列的高熵导致输出分布接近均匀、模型间分歧大以及静态嵌入不稳定，即使模型在架构、训练和数据上匹配也是如此。然后，我们证明在DNA上训练的模型将Fisher信息集中在嵌入层，似乎未能利用标记间关系。我们的结果表明，仅从序列进行自监督训练可能不适用于基因组数据，这质疑了当前训练基因组基础模型方法背后的假设。

英文摘要

Foundation models in genomics have shown mixed success compared to their counterparts in natural language processing. Yet, the reasons for their limited effectiveness remain poorly understood. In this work, we investigate the role of entropy as a fundamental factor limiting the capacities of such models to learn from their training data and develop foundational capabilities. We train ensembles of models on text and DNA sequences and analyze their predictions, static embeddings, and empirical Fisher information flow. We show that the high entropy of genomic sequences -- from the point of view of unseen token prediction -- leads to near-uniform output distributions, disagreement across models, and unstable static embeddings, even for models that are matched in architecture, training and data. We then demonstrate that models trained on DNA concentrate Fisher information in embedding layers, seemingly failing to exploit inter-token relationships. Our results suggest that self-supervised training from sequences alone may not be applicable to genomic data, calling into question the assumptions underlying current methodologies for training genomic foundation models.

URL PDF HTML ☆

赞 0 踩 0

2602.17547 2026-06-10 cs.AI cs.CL 版本更新

KLong: Training LLM Agent for Extremely Long-horizon Tasks

KLong：训练用于超长 horizon 任务的 LLM 代理

Yue Liu

AI总结 KLong 通过轨迹分割 SFT 和渐进式 RL 训练，解决超长 horizon 任务，实现 106B 模型在 PaperBench 上超越 Kimi K2 Thinking 11.28%。

Comments We request standard withdrawal of this submission because significant errors were discovered in the data after submission, which affect the validity of the results. We may submit a corrected version later

详情

AI中文摘要

本文介绍了KLong，一种开源的LLM代理，旨在解决超长horizon任务。其原理是首先通过轨迹分割SFT冷启动模型，然后通过渐进式RL训练进行扩展。具体而言，我们首先使用全面的SFT配方激活基础模型的基本代理能力。然后，我们引入Research-Factory，一个自动化管道，通过收集研究论文和构建评估标准来生成高质量的训练数据。利用该管道，我们从Claude 4.5 Sonnet（Thinking）中构建了数千条超长horizon轨迹。为了训练这些极长的轨迹，我们提出了一种新的轨迹分割SFT，该方法保留早期上下文，逐步截断后期上下文，并保持子轨迹之间的重叠。此外，为了进一步提高超长horizon任务解决能力，我们提出了一种新的渐进式RL，将训练分为多个阶段，逐步延长超时时间。实验表明KLong的优越性和泛化能力，如图1所示。值得注意的是，我们的KLong（106B）在PaperBench上超越Kimi K2 Thinking（1T）11.28%，且性能提升泛化到其他编码基准如SWE-bench Verified和MLE-bench。

英文摘要

This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.

URL PDF HTML ☆

赞 0 踩 0

2603.03339 2026-06-10 cs.CY cs.AR cs.CL cs.HC 版本更新

Offline-First LLM Architecture for Adaptive Learning in Low-Connectivity Environments

面向低连接环境的离线优先LLM架构：用于自适应学习

Joseph Walusimbi, Ann Move Oguti, Joshua Benjamin Ssentongo, Keith Ainebyona

发表机构 * University of Nairobi（内罗毕大学）

AI总结本文提出一种离线优先的LLM架构，适用于低连接环境中的自适应学习，通过本地推理和硬件感知模型选择，提供课程对齐的解释和结构化学术支持，适应不同教育阶段的学习者需求。

Comments 16 pages, 10 figures, 2 tables

详情

AI中文摘要

人工智能（AI）和大语言模型（LLMs）通过使对话辅导、个性化解释和探究式学习成为可能，正在改变教育技术。然而，大多数基于AI的学习系统依赖持续的互联网连接和云计算，限制了其在带宽受限环境中的使用。本文提出了一种面向低连接环境的离线优先大语言模型架构，该系统通过量化语言模型在本地进行所有推理，并结合硬件感知的模型选择，使部署在低规格CPU设备上成为可能。通过去除对云基础设施的依赖，该系统通过自然语言交互提供课程对齐的解释和结构化的学术支持。为了支持不同教育阶段的学习者，该系统包括自适应响应级别，生成不同复杂程度的解释：简单英语、初级中学、高级中学和技术。这使解释能够根据学生能力进行调整，提高学术概念的清晰度和理解。该系统在有限连接条件下部署于选定的中学和高等教育机构，并在技术性能、可用性、感知响应质量和教育影响方面进行了评估。结果显示，在传统硬件上稳定运行，响应时间可接受，用户对支持自主学习的支持有积极评价。这些发现证明了在低连接环境中离线大语言模型部署用于AI辅助教育的可行性。

英文摘要

Artificial intelligence (AI) and large language models (LLMs) are transforming educational technology by enabling conversational tutoring, personalized explanations, and inquiry-driven learning. However, most AI-based learning systems rely on continuous internet connectivity and cloud-based computation, limiting their use in bandwidth-constrained environments. This paper presents an offline-first large language model architecture designed for AI-assisted learning in low-connectivity settings. The system performs all inference locally using quantized language models and incorporates hardware-aware model selection to enable deployment on low-specification CPU-only devices. By removing dependence on cloud infrastructure, the system provides curriculum-aligned explanations and structured academic support through natural-language interaction. To support learners at different educational stages, the system includes adaptive response levels that generate explanations at varying levels of complexity: Simple English, Lower Secondary, Upper Secondary, and Technical. This allows explanations to be adjusted to student ability, improving clarity and understanding of academic concepts. The system was deployed in selected secondary and tertiary institutions under limited-connectivity conditions and evaluated across technical performance, usability, perceived response quality, and educational impact. Results show stable operation on legacy hardware, acceptable response times, and positive user perceptions regarding support for self-directed learning. These findings demonstrate the feasibility of offline large language model deployment for AI-assisted education in low-connectivity environments.

URL PDF HTML ☆

赞 0 踩 0

2509.11517 2026-06-10 cs.CL cs.LG 版本更新

PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation

PeruMedQA：在秘鲁医学考试上评估大语言模型（LLMs）——数据集构建与评估

Rodrigo M. Carrillo-Larco, Jesus Lovón Melgarejo, Manuel Castillo-Cara, Gusseppe Bravo-Rocca

发表机构 * Hubert Department of Global Health, Rollins School of Public Health, Emory University（霍伯特全球健康部门，埃默里大学公共卫生学院）； Emory Global Diabetes Research Center of Woodruff Health Sciences Center, Emory University（埃默里大学伍德鲁夫健康科学中心全球糖尿病研究中心）； Institut de Recherche en Informatique de Toulouse（图卢兹信息研究院）； Universidad Nacional de Educación a Distancia（远程教育国立大学）； Instituto de Investigación Científica, Universidad de Lima（科学研究所，利马大学）； Barcelona Supercomputing Center（巴塞罗那超级计算中心）

AI总结本文构建了包含8380道题的秘鲁医学考试数据集，通过微调大语言模型并对比不同模型的准确率，揭示了在西班牙语国家医学问题上的性能差异。

Comments https://github.com/rodrigo-carrillo/PeruMedQA

详情

DOI: 10.1007/s40670-026-02692-w

AI中文摘要

背景：医疗大语言模型（LLMs）在回答医学考试中表现出色，但其在西班牙语和拉丁美洲国家的医疗问题上的泛化能力尚不明确。目标：构建秘鲁医师专科学习考试问题数据集，对LLMs进行微调，并评估和比较普通LLMs与微调LLMs的准确性。方法：我们整理了包含8380道题的PeruMedQA数据集，涵盖12个专科（2018-2025年）。我们选择了10个医学LLMs，包括medgemma-4b-it和medgemma-27b-text-it，并开发了零样本任务特定提示来回答问题。我们使用参数高效微调（PEFT）和低秩适应（LoRA）对medgemma-4b-it进行微调，使用所有问题除外2025年（测试集）的问题。结果：medgemma-27b在所有专科中表现最佳，达到精神科89.29%的最高分；然而，在两个专科中，OctoMed-7B略胜一筹：神经外科77.27%和77.38%，放射科76.13%和77.39%。在专科层面，大多数参数少于100亿的LLM正确率低于50%。微调版的medgemma-4b-it在所有参数少于100亿的LLM中胜出，并在各种考试中与700亿参数的LLM竞争。结论：对于需要来自西班牙语国家和与秘鲁有相似流行病学特征的知识库的医疗AI应用和研究，应使用medgemma-27b-text-it。

英文摘要

BACKGROUND: Medical large language models (LLMs) have demonstrated remarkable performance in answering medical examinations. However, the extent to which this high performance is transferable to medical questions in Spanish and from a Latin American country remains unexplored. This knowledge is crucial as LLM-based medical applications gain traction in Latin America. AIMS: To build a dataset of questions medical examinations taken by Peruvian physicians pursuing specialty training; to fine-tune a LLM on this dataset; to evaluate and compare the performance in terms of accuracy between vanilla LLMs and the fine-tuned LLM. METHODS: We curated PeruMedQA, a multiple-choice question-answering (MCQA) dataset containing 8,380 questions spanning 12 specialties (2018-2025). We selected ten medical LLMs, including medgemma-4b-it and medgemma-27b-text-it, and developed zero-shot task specific prompts to answer the questions. We employed parameter-efficient fine tuning (PEFT) and low-rand adaptation (LoRA) to fine-tune medgemma-4b-it utilizing all questions except those from 2025 (test set). RESULTS: Medgemma-27b showed the highest accuracy across all specialities, achieving the highest score of 89.29% in Psychiatry; yet, in two specialties, OctoMed-7B exhibited slight superiority: Neurosurgery with 77.27% and 77.38, respectively; and Radiology with 76.13% and 77.39%, respectively. Across specialties, most LLMs with <10 billion parameters exhibited <50% of correct answers. The fine-tuned version of medgemma-4b-it emerged victorious against all LLMs with <10 billion parameters and rivaled a LLM with 70 billion parameters across various examinations. CONCLUSIONS: For medical AI applications and research that require knowledge bases from Spanish-speaking countries and those exhibiting similar epidemiological profile to Peru's, interested parties should utilize medgemma-27b-text-it.

URL PDF HTML ☆

赞 0 踩 0

2512.04799 2026-06-10 cs.CL 版本更新

DaLA: Danish Linguistic Acceptability Evaluation Guided by Real World Errors

DaLA：由现实世界错误引导的丹麦语言可接受性评估

Gianluca Barmina, Nathalie Carmen Hau Norman, Peter Schneider-Kamp, Lukas Galke Poech

发表机构 * University of Southern Denmark（南方丹麦大学）； University of Copenhagen（哥本哈根大学）

AI总结本文提出一个增强的丹麦语言可接受性评估基准，通过分析常见错误并引入14种腐蚀函数生成错误句子，验证其有效性后用于评估大型语言模型的可接受性判断任务，结果显示该基准更广泛且更全面。

详情

DOI: 10.63317/4kcbotaa3zgo
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026)

AI中文摘要

我们提出一个增强的丹麦语言可接受性评估基准。我们首先分析书面丹麦语中最常见的错误。基于此分析，我们引入十四种腐蚀函数，通过系统性地向现有正确丹麦语句子中引入错误来生成不正确的句子。为了确保这些腐蚀的准确性，我们使用手动和自动方法评估其有效性。结果随后用于评估大型语言模型在语言可接受性判断任务上的表现。我们的发现表明，这种扩展比当前最先进的方法更广泛和更全面。通过纳入更多种类的腐蚀类型，我们的基准提供了更严格的语言可接受性评估，增加了任务难度，这体现在LLMs在我们基准上的表现比现有基准更低。我们的结果还表明，我们的基准具有更高的区分能力，能够更好地区分表现优异的模型和表现较差的模型。

英文摘要

We present an enhanced benchmark for evaluating linguistic acceptability in Danish. We first analyze the most common errors found in written Danish. Based on this analysis, we introduce a set of fourteen corruption functions that generate incorrect sentences by systematically introducing errors into existing correct Danish sentences. To ensure the accuracy of these corruptions, we assess their validity using both manual and automatic methods. The results are then used as a benchmark for evaluating Large Language Models on a linguistic acceptability judgement task. Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art. By incorporating a greater variety of corruption types, our benchmark provides a more rigorous assessment of linguistic acceptability, increasing task difficulty, as evidenced by the lower performance of LLMs on our benchmark compared to existing ones. Our results also suggest that our benchmark has a higher discriminatory power which allows to better distinguish well-performing models from low-performing ones.

URL PDF HTML ☆

赞 0 踩 0

2501.12486 2026-06-10 cs.LG cs.CL 版本更新

The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

训练过程至关重要：平均预训练参数计数统一了稀疏和密集的扩展规律

Tian Jin, Ahmed Imtiaz Humayun, Utku Evci, Suvinay Subramanian, Amir Yazdanbakhsh, Dan Alistarh, Gintare Karolina Dziugaite

发表机构 * MIT CSAIL（MIT 计算科学与人工智能实验室）； Rice University（稻大学）； Google Research（谷歌研究）； Google DeepMind（谷歌深度思维）； Google（谷歌）； IST Austria（奥地利科学院）

AI总结本文通过研究80种不同的剪枝计划，发现预训练过程中在25%和75%的计算量启动和结束剪枝可获得最佳评估损失，提出新的扩展规律统一了稀疏和密集预训练的扩展规律。

Comments 17 pages

详情

Journal ref: The Thirteenth International Conference on Learning Representations (ICLR), 2025

AI中文摘要

剪枝通过消除神经网络中不必要的参数，为大型语言模型（LLMs）日益增长的计算需求提供了一个有前途的解决方案。虽然许多研究关注训练后的剪枝，但将剪枝和预训练结合到一个阶段的稀疏预训练提供了一个更简单的替代方案。在本文中，我们通过研究80种不同的剪枝计划，探讨了不同稀疏度和训练持续时间下的最优稀疏预训练配置。我们发现，在总训练计算量的25%处启动剪枝并在75%处结束可获得接近最优的最终评估损失。这些发现为高效且有效的LLMs稀疏预训练提供了有价值的见解。此外，我们提出了一种新的扩展规律，修改了Chinchilla扩展规律以使用预训练期间的平均参数计数。通过实证和理论验证，我们证明了这种修改后的扩展规律能够准确地建模稀疏和密集预训练LLMs的评估损失，统一了预训练范式的扩展规律。我们的发现表明，虽然稀疏预训练在等效计算预算下能获得与密集预训练相同的最终模型质量，但通过减少模型大小，它在推理过程中提供了显著的计算节省潜力。

英文摘要

Pruning eliminates unnecessary parameters in neural networks; it offers a promising solution to the growing computational demands of large language models (LLMs). While many focus on post-training pruning, sparse pre-training--which combines pruning and pre-training into a single phase--provides a simpler alternative. In this work, we present the first systematic exploration of optimal sparse pre-training configurations for LLMs through an examination of 80 unique pruning schedules across different sparsity levels and training durations. We find that initiating pruning at 25% of total training compute and concluding at 75% achieves near-optimal final evaluation loss. These findings provide valuable insights for efficient and effective sparse pre-training of LLMs. Furthermore, we propose a new scaling law that modifies the Chinchilla scaling law to use the average parameter count over pre-training. Through empirical and theoretical validation, we demonstrate that this modified scaling law accurately models evaluation loss for both sparsely and densely pre-trained LLMs, unifying scaling laws across pre-training paradigms. Our findings indicate that while sparse pre-training achieves the same final model quality as dense pre-training for equivalent compute budgets, it provides substantial benefits through reduced model size, enabling significant potential computational savings during inference.

URL PDF HTML ☆

赞 0 踩 0

2502.11517 2026-06-10 cs.CL cs.DC cs.LG 版本更新

Learning to Keep a Promise: Scaling Language Model Decoding Parallelism with Learned Asynchronous Decoding

学习承诺：通过学习异步解码扩展语言模型解码并行性

Tian Jin, Ellie Y. Cheng, Zack Ankner, Nikunj Saunshi, Blake M. Elias, Amir Yazdanbakhsh, Jonathan Ragan-Kelley, Suvinay Subramanian, Michael Carbin

发表机构 * DeepMind, London, UK（深度思维公司，伦敦，英国）； Google Research, New York, NY, USA（谷歌研究院，纽约，纽约州，美国）； Stanford University, Stanford, CA, USA（斯坦福大学，斯坦福，加利福尼亚州，美国）； University of Toronto, Toronto, Ontario, Canada（多伦多大学，多伦多，安大略省，加拿大）； University of Washington, Seattle, WA, USA（华盛顿大学，西雅图，华盛顿州，美国）

AI总结本文提出PASTA系统，通过学习使语言模型识别语义独立性，提升解码并行性，实验证明在解码速度和响应质量上优于现有方法。

Comments 15 pages

详情

Journal ref: Proceedings of the 42nd International Conference on Machine Learning (ICML), PMLR 267:27941-27956, 2025

AI中文摘要

传统的自回归大语言模型（LLM）解码通常是顺序进行的，逐个生成token。新兴的研究探索了通过识别并同时生成语义独立的LLM响应片段来实现并行解码。然而，这些技术依赖于手工制定的启发式方法，与语法结构如列表和段落相关，使它们僵化且不精确。我们提出了PASTA，一个基于学习的系统，教会LLM识别语义独立性并在自身响应中表达并行解码机会。其核心是PASTA-LANG及其解释器：PASTA-LANG是一种注释语言，使LLM能够在自身响应中表达语义独立性；语言解释器作用于这些注释，以在推理时实时协调并行解码。通过两阶段微调过程，我们训练LLM生成PASTA-LANG注释，以优化响应质量和解码速度。在AlpacaEval指令遵循基准上的评估显示，我们的方法在解码速度和响应质量上优于现有方法；我们的结果表明，几何平均速度提升范围从1.21x到1.93x，对应的质量变化为+2.2%到-7.1%，通过长度控制的胜利率与顺序解码基线比较。

英文摘要

Decoding with autoregressive large language models (LLMs) traditionally occurs sequentially, generating one token after another. An emerging line of work explored parallel decoding by identifying and simultaneously generating semantically independent chunks of LLM responses. However, these techniques rely on hand-crafted heuristics tied to syntactic structures like lists and paragraphs, making them rigid and imprecise. We present PASTA, a learning-based system that teaches LLMs to identify semantic independence and express parallel decoding opportunities in their own responses. At its core are PASTA-LANG and its interpreter: PASTA-LANG is an annotation language that enables LLMs to express semantic independence in their own responses; the language interpreter acts on these annotations to orchestrate parallel decoding on-the-fly at inference time. Through a two-stage finetuning process, we train LLMs to generate PASTA-LANG annotations that optimize both response quality and decoding speed. Evaluation on AlpacaEval, an instruction following benchmark, shows that our approach Pareto-dominates existing methods in terms of decoding speed and response quality; our results demonstrate geometric mean speedups ranging from 1.21x to 1.93x with corresponding quality changes of +2.2% to -7.1%, measured by length-controlled win rates against sequential decoding baseline.

URL PDF HTML ☆

赞 0 踩 0

2310.04680 2026-06-10 cs.CL cs.AI cs.LG 版本更新

The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning

大语言模型降维的成本：事实回忆在内省学习之前恶化

Tian Jin, Nolan Clement, Xin Dong, Vaishnavh Nagarajan, Michael Carbin, Jonathan Ragan-Kelley, Gintare Karolina Dziugaite

发表机构 * MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）； MIT Harvard University（麻省理工学院哈佛大学）； Google Research（谷歌研究）； Google DeepMind（谷歌深Mind）

AI总结研究探讨了大语言模型参数数量缩放对核心能力的影响，发现模型规模缩减会显著降低事实回忆能力，但对内省信息处理影响较小。

详情

Journal ref: The Twelfth International Conference on Learning Representations (ICLR), 2024

AI中文摘要

如何缩放大语言模型（LLMs）的参数数量会影响其核心能力？我们研究了两种自然缩放技术——权重剪枝和简单训练更小或更大的模型（称为密集缩放）——对LLMs两个核心能力的影响：（a）回忆训练期间呈现的事实，以及（b）处理推理期间呈现的信息。通过设计一系列任务来区分这两种能力，我们发现这两种能力在缩放时的表现存在显著差异。通过超过30%的模型规模缩减（通过任一缩放方法）会显著降低对训练期间呈现事实的回忆能力。然而，60-70%的缩减在很大程度上保留了模型处理内省信息的各种方式，从从长上下文检索答案到从内省示例中学习参数化函数。两种缩放方法均表现出这种行为，表明缩放模型大小对事实回忆和内省学习有本质上不同的影响。

英文摘要

How does scaling the number of parameters in large language models (LLMs) affect their core capabilities? We study two natural scaling techniques -- weight pruning and simply training a smaller or larger model, which we refer to as dense scaling -- and their effects on two core capabilities of LLMs: (a) recalling facts presented during pre-training and (b) processing information presented in-context during inference. By curating a suite of tasks that help disentangle these two capabilities, we find a striking difference in how these two abilities evolve due to scaling. Reducing the model size by more than 30\% (via either scaling approach) significantly decreases the ability to recall facts seen in pre-training. Yet, a 60--70\% reduction largely preserves the various ways the model can process in-context information, ranging from retrieving answers from a long context to learning parameterized functions from in-context exemplars. The fact that both dense scaling and weight pruning exhibit this behavior suggests that scaling model size has an inherently disparate effect on fact recall and in-context learning.

URL PDF HTML ☆

赞 0 踩 0

1. 大语言模型与基础模型 37 篇

Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models

The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge

Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate

Routing-Aware Expert Calibration for Machine Unlearning in Mixture-of-Experts Language Models

PADD: Path-Aligned Decompression Distillation for Non-Router Teacher to Guide MoE Student Learning

Prefilling-dLLM: Predictive Prefilling for Long-Context Inference in Diffusion Language Models

Dynamic Linear Attention

Continual LLM Upcycling: A Predictor-Gated Bank-Wise Sparsity Training Recipe for Dense-to-Sparse LLMs

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

Density Field State Space Models: 1-Bit Distillation, Efficient Inference, and Knowledge Organization in Mamba-2

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Gradient-Guided Reward Optimization for Inference-time Alignment

Mechanistic Analysis of Alignment Algorithms in Language Models

Streaming Knowledge Compilation: Proactive Materiality-Scored Pinning for Time-Evolving LLM Wikis

SocraticPO: Policy Optimization via Interactive Guidance

A Navigable Manifold of Hypothesized Consciousness-Spectrum States in Language Model Representations

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

Parallel Causal Associative Fields: Gated Sparse Memory for Long-Context Language Modeling

Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

Causal Ensemble Agent: Hierarchical Causal Discovery with LLM-guided Expert Reweighting

How Does Reasoning Flow? Tracing Attention-Induced Information Flow for Targeted RL in LLMs

N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

TruthRL: Incentivizing Truthful LLMs via Reinforcement Learning

CGES: Confidence-Guided Early Stopping for Efficient and Accurate Self-Consistency

Lightweight Latent Reasoning for Narrative Tasks

Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data

inversedMixup: Data Augmentation via Inverting Mixed Embeddings

ProbeLLM: Automating Principled Diagnosis of LLM Failures

Beyond Memorization: Distinguishing Between Pattern-Based and Epistemic Reasoning in LLMs Using Epistemic Puzzles

Learning Evidence Highlighting for Frozen LLMs

Why Does Reasoning Length Converge? Unveiling the Underfitting-Overfitting Trade-off in Chain-of-Thought

When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models

ATLAS: Verifier-Guided Adaptive Latent Activation Steering for Efficient LLM Reasoning

Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

2. 机器翻译与跨语言处理 3 篇

Emotion Profiling in LLM-Based Literary Translation: Systematic Shifts Across MT and Post-Editing

Who Brought Easter Eggs to Eid? Auditing Cultural Translation of Math Word Problems Across Diverse Languages and Regions

The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs

3. 信息抽取、检索与问答 12 篇

Detecting Speculative Language in Biomedical Texts using Recurrent Neural Tensor Networks

ConvMemory v2: A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval

Trace Only What You Need: Structure-Aware On-Demand Hypergraph Memory for Long-Document Question Answering

Agentic Hybrid RAG for Evidence-Grounded Muon Collider Analysis

Generative Archetype-Grounded Item Representations for Sequential Recommendation

EXCEEDS: Extracting Complex Events via Nugget-based Grid Modeling in Scientific Domain

Automated Alignment between Elicitation Interviews and Requirements

SAFE: An LLM-as-Verifier Framework for Evidence-Grounded Multi-Hop Reasoning

Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing

From Volume to Value: Preference-Aligned Memory Construction for On-Device RAG

ConRAG: Consensus-Driven Multi-View Retrieval for Multi-Hop Question Answering

RAG over Thinking Traces Can Improve Reasoning Tasks

4. 对话系统与智能体 13 篇

Less Context, More Accuracy: A Bi-Temporal Memory Engine for LLM Agents Where a Lean Retrieved Context Beats the Full History

TabClaw: An Interactive and Self-Evolving Agent for Spreadsheet Manipulation and Table Reasoning

WebChallenger: A Reliable and Efficient Generalist Web Agent

REAL: A Reasoning-Enhanced Graph Framework for Long-Term Memory Management of LLMs

Detecting Knowledge Gaps from Conversational AI Interactions Using Curriculum Prerequisite Graphs

Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation

Decoupling Thought from Speech: Knowledge-Grounded Counterfactual Reasoning for Resilient Multi-Agent Argumentation

Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory

A History-Aware Visually Grounded Critic for Computer Use Agents

What Should a Skill Remember? Quality--Cost Trade-offs in Cost-Aware Skill Rewriting for Language Model Agents

Fact-Augmented Lookahead Planning for LLM Agents

TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit

Impatient Users Confuse AI Agents: High-fidelity Simulations of Human Traits for Testing Agents

5. 文本生成、摘要与编辑 5 篇

CodeAlchemy: Synthetic Code Rewriting at Scale

Where You Inject Diversity Matters: A Unified Framework for Diverse Generation

The Order Matters: Sequential Fine-Tuning of LLaMA for Coherent Automated Essay Scoring

LLM-Based Code Documentation Generation and Multi-Judge Evaluation

A Continuous-Time Markov Chain Framework for Insertion Language Models

6. 语义、语法与语言学分析 6 篇

Large Language Models as Modal Models in Linguistics

Measuring Human Value Expression in Social Media Texts: Calibrated LLM Annotation and Encoder Transfer

Compiling Rewrite Rules to Finite-State Transducers with the Worsening Trick

Improving Topic Modeling by Distilling Soft Labels from Language Models