arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.06492 2026-06-05 cs.SE cs.AI cs.CL 版本更新

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

Code2LoRA:用于软件演化下代码语言模型的超网络生成适配器

Liliana Hotsko, Yinxi Li, Yuntian Deng, Pengyu Nie

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出Code2LoRA超网络框架,通过生成仓库特定的LoRA适配器注入仓库知识,无需推理时令牌开销,支持静态和演化两种场景,在RepoPeftBench上达到与逐仓库LoRA相当或更优的性能。

详情
AI中文摘要

代码语言模型需要仓库级上下文来解决导入、API和项目约定。现有方法通过长输入(通过RAG或依赖分析检索)或逐仓库微调和LoRA注入这些知识——这在仓库规模上成本高昂且对演化的代码库脆弱。我们引入Code2LoRA,一个超网络框架,生成仓库特定的LoRA适配器,有效地注入仓库知识,零推理时令牌开销。Code2LoRA支持两种使用场景:Code2LoRA-Static将单个仓库快照转换为适配器,适用于稳定代码库的理解;而Code2LoRA-Evo维护一个由GRU隐藏状态支持的适配器,该状态随每次代码差异更新,适用于演化代码库的活跃开发。为了评估Code2LoRA与参数高效微调基线,我们构建了RepoPeftBench,一个包含604个Python仓库的基准,包含两个轨道:一个静态轨道,包含40K训练和12K测试断言完成任务;一个演化轨道,包含215K提交派生训练和87K提交派生测试任务。在静态轨道上,Code2LoRA-Static实现了63.8%的跨仓库和66.2%的仓库内精确匹配,与逐仓库LoRA上界相当;在演化轨道上,Code2LoRA-Evo实现了60.3%的跨仓库精确匹配(比单个共享LoRA高5.2个百分点)。Code2LoRA的代码可在https://anonymous.4open.science/r/code2lora-6857找到;模型检查点和RepoPeftBench数据集可在https://huggingface.co/code2lora找到。

英文摘要

Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG or dependency analysis) or through per-repository fine-tuning and LoRA -- costly at repository scale and brittle to evolving codebases. We introduce Code2LoRA, a hypernetwork framework that generates repository-specific LoRA adapters, effectively injecting repository knowledge with zero inference-time token overhead. Code2LoRA supports two usage scenarios: Code2LoRA-Static converts a single repository snapshot into an adapter, suitable for comprehension of stable codebases; while Code2LoRA-Evo maintains an adapter backed by a GRU hidden state updated per code diff, suitable for active development of evolving codebases. To evaluate Code2LoRA against parameter-efficient fine-tuning baselines, we build RepoPeftBench, a benchmark of 604 Python repositories with two tracks: a static track with 40K training and 12K test assertion-completion tasks, and an evolution track with 215K commit-derived training and 87K commit-derived test tasks. On the static track, Code2LoRA-Static achieves 63.8% cross-repo and 66.2% in-repo exact match, matching the per-repository LoRA upper bound; on the evolution track, Code2LoRA-Evo achieves 60.3% cross-repo exact match (+5.2 pp over a single shared LoRA). Code2LoRA's code can be found at https://anonymous.4open.science/r/code2lora-6857; the model checkpoints and RepoPeftBench datasets can be found at https://huggingface.co/code2lora.

2606.06481 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection

操作引导的渐进式人机文本转换基准:面向多粒度AI文本检测

Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao, Tianjun Yao, Xinyi Shang, Yi Tang, Jiacheng Cui, Ahmed Elhagry, Salwa K. Al Khatib, Hao Li, Salman Khan, Zhiqiang Shen

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·泽亚德人工智能大学) University College London(伦敦大学学院)

AI总结 提出OpAI-Bench基准,通过九种渐进修订版本和五种AI编辑操作,模拟人机协作编辑过程,支持文档、句子、词元和跨度多粒度检测,揭示AI文本可检测性受编辑操作、领域和累积修订历史影响,并发现混合作者中间版本比纯人类或纯AI端点更难检测。

Comments Our code and data are available at https://github.com/VILA-Lab/OpAI-Bench

详情
AI中文摘要

随着AI写作助手越来越多地融入现实世界的起草和修订流程,许多文档不再是纯粹的人类撰写或AI生成,而是渐进式人机共同编辑的结果。然而,现有的AI文本检测基准主要关注最终输出,对AI作者身份信号如何在修订过程中出现、累积或消失的理解有限。我们引入了OpAI-Bench,一个操作引导的基准,用于研究在文档、句子、词元和跨度粒度上的渐进式人机文本转换。从人类撰写的文档开始,OpAI-Bench在预定义的AI覆盖水平和五种代表性AI编辑操作下,为每个样本构建了九个顺序修订版本,涵盖四个领域,同时保留多粒度上的完整作者身份来源。该基准支持8个文档级检测器、7个句子级检测器和2个细粒度词元/跨度级检测器的全面评估。实验表明,AI文本的可检测性不仅受AI编辑内容比例的影响,还受编辑操作、领域和累积修订历史的影响。有趣的是,我们注意到混合作者身份的中间版本通常比完全人类或大量AI编辑的端点更难检测,暴露了现有基准遗漏的非单调检测模式。OpAI-Bench为分析在现实渐进编辑场景下,AI辅助写作是否、何时以及如何变得可检测提供了一个受控测试平台。我们的代码和基准可在https://github.com/VILA-Lab/OpAI-Bench获取。

英文摘要

As AI writing assistants become increasingly integrated into real-world drafting and revision workflows, many documents are no longer purely human-written or AI-generated, but instead result from progressive human-AI co-editing. However, existing AI-text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappear throughout the revision process. We introduce OpAI-Bench, an operation-guided benchmark for studying progressive human-to-AI text transformation across document, sentence, token, and span granularities. Starting from human-written documents, OpAI-Bench constructs nine sequentially revised versions for each sample under predefined AI coverage levels and five representative AI edit operations, covering four domains while preserving complete authorship provenance at multiple granularities. The benchmark supports comprehensive evaluation with 8 document-level detectors, 7 sentence-level detectors, and 2 fine-grained token/span-level detectors. Experiments reveal that AI-text detectability is governed not only by the proportion of AI-edited content, but also by edit operation, domain, and cumulative revision history. Interestingly, we notice that mixed-authorship intermediate versions are often harder to detect than both fully human and heavily AI-edited endpoints, exposing non-monotonic detection patterns missed by existing benchmarks. OpAI-Bench provides a controlled testbed for analyzing whether, when, and how AI-assisted writing becomes detectable under realistic progressive editing scenarios. Our code and benchmark are available at https://github.com/VILA-Lab/OpAI-Bench.

2606.06474 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Self-Augmenting Retrieval for Diffusion Language Models

扩散语言模型的自增强检索

Paul Jünger, Justin Lovelace, Linxi Zhao, Dongyoung Go, Kilian Q. Weinberger

发表机构 * University of California, Berkeley(加州大学伯克利分校) Google Research(谷歌研究院)

AI总结 提出SARDI框架,利用扩散语言模型去噪过程中丢弃的低置信度标记作为前瞻信号指导检索,无需训练且与检索器无关,在多跳问答基准上以高达8倍吞吐量超越现有方法。

Comments ICML 2026

详情
AI中文摘要

离散扩散语言模型通过并行迭代去噪整个响应来生成文本。每一步,它们为每个掩码位置预测暂定标记,将高置信度预测提交到输出,并丢弃低置信度标记。我们表明,被丢弃的标记实际上对检索增强生成是有用的前瞻信号:即使低置信度标记也常在去噪轨迹早期浮现显著实体,从而在输出最终确定前检索到更强的证据。我们通过扩散语言模型的自增强检索(SARDI)利用这一点,这是一个动态RAG框架,在去噪过程中使用这些前瞻标记指导检索。SARDI无需训练、与检索器无关,并适用于任何具备推理能力的离散扩散语言模型。在五个多跳QA基准上,SARDI以高达8倍的吞吐量优于当前无训练的扩散和自回归检索基线。

英文摘要

Discrete diffusion language models generate text by iteratively denoising an entire response in parallel. At each step, they predict tentative tokens for every masked position, committing the confident predictions to the output and discarding the unconfident ones. We show that the discarded tokens are in fact a useful lookahead signal for retrieval-augmented generation: even low-confidence tokens often surface salient entities early in the denoising trajectory, enabling retrieval of stronger evidence before the output is finalized. We exploit this through Self-Augmenting Retrieval for Diffusion Language Models (SARDI), a dynamic RAG framework that uses these lookahead tokens to guide retrieval during denoising. SARDI is training-free, retriever-agnostic, and applicable to any reasoning-capable discrete diffusion language model. Across five multi-hop QA benchmarks, SARDI outperforms current training-free diffusion and autoregressive retrieval baselines at up to $8\times$ higher throughput.

2606.06473 2026-06-05 cs.AI cs.CL 版本更新

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

MLEvolve:一种用于自动化机器学习算法发现的自我进化框架

Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao, Shiyang Feng, Zichen Liang, Boyuan Sun, Tianshuo Peng, Yifan Zhou, Xin Li, Jie Zhou, Liang He, Bo Zhang, Lei Bai

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) East China Normal University(东华大学)

AI总结 提出MLEvolve框架,通过渐进式MCGS、回溯记忆和分层控制解决LLM智能体在长期任务中的信息隔离、无记忆搜索和缺乏分层控制问题,在MLE-Bench和数学算法优化任务上取得最先进性能。

详情
AI中文摘要

大型语言模型(LLM)智能体越来越多地应用于长期任务,如科学发现和机器学习工程(MLE),其中持续的自我进化成为关键能力。然而,现有的MLE智能体存在分支间信息隔离、无记忆搜索和缺乏分层控制的问题,这些共同阻碍了长期优化。我们提出了MLEvolve,一个基于LLM的自我进化多智能体框架,用于端到端的机器学习算法发现。通过将树搜索扩展到渐进式MCGS,MLEvolve通过基于图的参考边实现跨分支信息流,并借助熵启发的渐进式调度,逐步将搜索从广泛探索转向集中利用。为了让智能体能够随着积累的经验进化,我们引入了回溯记忆,它将冷启动领域知识库与动态全局记忆相结合,用于特定任务的体验检索和重用。为了实现稳定的长期迭代,我们进一步将战略规划与代码生成解耦,并采用自适应编码模式。在MLE-Bench上的评估表明,MLEvolve在多个维度上实现了最先进的性能,包括在12小时预算(标准运行时间的一半)下的平均奖牌率和有效提交率。此外,MLEvolve在数学算法优化任务上也优于专门的算法发现方法(包括AlphaEvolve),展示了强大的跨领域泛化能力。我们的代码可在https://github.com/InternScience/MLEvolve获取。

英文摘要

Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross-branch information flow through graph-based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy-inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold-start domain knowledge base with a dynamic global memory for task-specific experience retrieval and reuse. For stable long-horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE-Bench shows that MLEvolve achieves state-of-the-art performance across multiple dimensions including average medal rate and valid submission rate under a 12-hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross-domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.

2606.06467 2026-06-05 cs.CL cs.AI cs.LG 版本更新

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

仅索引一次:具有共享路由的跨层稀疏注意力

Yutao Sun, Yanqi Zhang, Li Dong, Jianyong Wang, Furu Wei

发表机构 * Microsoft Research(微软研究院) Tsinghua University(清华大学)

AI总结 提出跨层稀疏注意力(CLSA),通过共享KV缓存和路由索引,在保持token稀疏注意力精度的同时减少路由开销,显著提升长上下文LLM的解码效率。

详情
AI中文摘要

现代LLM中的长上下文推理越来越受到解码效率的限制,尤其是在模型生成长中间思维链的推理密集型场景中。现有的稀疏注意力方法通常面临实际的效率-质量权衡。结构化块稀疏方法通常提供更强的加速,但会导致明显的质量损失,而token稀疏方法通常更准确,但由于在全缓存上进行top-k路由仍然昂贵,因此端到端加速有限。在这项工作中,我们提出了跨层稀疏注意力(CLSA),它建立在KV共享架构(如YOCO)之上。核心思想不仅是跨解码器层共享KV缓存,还共享路由索引。单个索引器计算一次token级别的top-k选择,并在各层之间重用生成的索引,从而保留了token稀疏注意力的细粒度选择性,同时分摊了路由开销。由此产生的架构共同改善了所有主要的推理瓶颈,包括预填充、KV缓存存储和长上下文解码。在短上下文和长上下文基准上的实验表明,CLSA既准确又高效,在128K上下文下实现了高达7.6倍的解码加速和17.1倍的总体吞吐量提升。这些结果表明,对于长上下文LLM,这是一种更完整的架构解决方案,可同时提升模型质量和推理效率。

英文摘要

Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Structured block sparse methods typically provide stronger acceleration but incur noticeable quality loss, while token sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over the full cache remains expensive. In this work, we propose cross-layer sparse attention (CLSA), which is built on top of KV-sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross-decoder layers, but also the routing index. A single indexer computes token-level top-k selection once and reuses the resulting index across layers, thereby preserving the fine-grained selectivity of token sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly, including pre-filling, KV-cache storage, and long-context decoding. Experiments across short-context and long-context benchmarks show that CLSA is both accurate and efficient, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context. These results suggest a more complete architectural solution for long-context LLMs that jointly advances model quality and inference efficiency.

2606.06454 2026-06-05 cs.SE cs.CL 版本更新

Scaffold, Not Vocabulary? A Controlled, Two-Tier, Pre-Registered Study of a Popperian Code-Generation Skill

脚手架,而非词汇?一项受控、双层、预注册的波普尔式代码生成技能研究

Mehmet Iscan

发表机构 * PythaLab, Yıldız Technical University, Istanbul, Turkey(Pytha实验室,伊兹密尔技术大学,伊斯坦布尔,土耳其)

AI总结 通过双层消融实验(包括长度匹配安慰剂、仅标签脚手架和真实执行测试),研究发现波普尔式提示技能对代码正确性的提升主要来自脚手架结构而非其内容,并在大模型上因天花板效应无法检测,在小模型上仅标签脚手架即可达到类似效果。

Comments 34 pages, 5 figures, 8 tables

详情
AI中文摘要

大型语言模型越来越多地编写、审查和评判代码,一种快速发展的实践是为它们配备提示“技能”,要求模型像科学家一样推理。一个突出的例子是告诉模型扮演波普尔式证伪主义者,据报道这种技能能改进生成的代码。但这些增益几乎总是通过LLM作为评判者来读取,而该评判工具存在已知的位置偏好、自我偏好和风格偏差。我们问:如果它看起来有帮助,那么增益是来自技能的波普尔式内容,还是来自任何脚手架所施加的结构?我们预注册了一个双层消融实验,包含三个对照:长度匹配的安慰剂、仅保留波普尔式标题但去除过程的仅标签脚手架,以及一个执行预言机(HumanEval+单元测试),外加一个词汇光环哨兵和一个同模型自评判审计。在前沿模型(Claude Sonnet 4.6,N=163)上,所有条件都接近基准上限且无法区分,因此预注册的+5点改进未得到支持(上限限制的未检测)。在小模型(Qwen2.5-Coder-0.5B,N=164)上,结构化条件将最佳八次正确率提升了20-22点,但完整技能相比仅标签脚手架没有显示出可分离的益处(聚合F@8=L@8 vs V@8=34.8%),而安慰剂仅落后2.4点。一个应用波普尔式评分标准的0.5B自评判器未能击败随机选择,并将其60%的选择集中在一个索引上。在测试的两种设置中,该技能的波普尔式过程内容在仅标签脚手架之外没有增加可分离的执行正确性收益,因此增益追踪的是脚手架结构。我们贡献了一个校准的负结果和一个可重用的消歧协议;该发现界定了关于一个提示技能家族的工程主张,而不是对波普尔式方法论的总体评价。

英文摘要

Large language models increasingly write, review, and judge code, and a fast-growing practice equips them with prompt 'skills' that ask the model to reason like a scientist. A prominent example tells the model to act as a Popperian falsificationist, and such skills are reported to improve generated code. But these gains are almost always read off an LLM-as-a-judge, an instrument with documented positional, self-preference, and stylistic biases. We ask: if it appears to help, is the gain from the skill's Popperian content, or from the structure any scaffold imposes? We pre-register a two-tier ablation with three controls: a length-matched placebo, a labels-only scaffold that keeps the Popperian headers but strips the procedure, and an execution oracle (HumanEval+ unit tests), plus a vocabulary-halo sentinel and a same-model self-judge audit. On a frontier model (Claude Sonnet 4.6, N=163) all conditions sit near the benchmark ceiling and do not separate, so the pre-registered +5-point improvement is not supported (a ceiling-limited non-detection). On a small model (Qwen2.5-Coder-0.5B, N=164) structured arms lift best-of-eight correctness by 20-22 points, but the full skill shows no separable benefit over a labels-only scaffold (aggregate F@8=L@8 vs V@8=34.8%), and the placebo trails by only 2.4 points. A 0.5B self-judge applying the Popperian rubric does not beat random selection and concentrates 60% of its picks on one index. In the two settings tested, the skill's Popperian procedural content adds no separable execution-correctness benefit beyond a labels-only scaffold, so the gains track scaffold structure. We contribute a calibrated negative result and a reusable disambiguation protocol; the finding bounds an engineering claim about one prompt-skill family and is not an evaluation of Popperian methodology in general.

2606.06447 2026-06-05 cs.CL cs.LG 版本更新

Latent Reasoning with Normalizing Flows

基于归一化流的潜在推理

Guancheng Tu, Xiangjun Fu, Suhao Yu, Yao Tang, Haoqiang Kang, Lianhui Qin, Yizhe Zhang, Jiatao Gu

发表机构 * University of Pennsylvania(宾夕法尼亚大学) UC San Diego(圣地亚哥大学) Meta(Meta公司)

AI总结 提出NF-CoT框架,通过归一化流在LLM内部建模连续潜在思维,保留自回归生成、概率采样、KV缓存解码和似然估计等优势,在代码生成任务中提升通过率并降低推理成本。

详情
AI中文摘要

大型语言模型通常通过生成显式思维链(CoT)来改进推理,展示了中间计算的重要性。然而,文本CoT迫使这种计算通过离散、串行且面向通信的令牌流进行:每个推理步骤必须在模型继续之前被语言化,即使底层更新是语义的、不确定的或仅部分形成的。潜在推理通过在承诺文本之前以紧凑的连续状态执行中间计算,提供了一种更高带宽的替代方案。然而,现有的潜在推理方法常常牺牲了使CoT在自回归语言模型中有效的关键优势,包括原生的从左到右生成、概率采样、与KV缓存解码的兼容性以及可处理的似然估计。我们提出NF-CoT,一种潜在推理框架,通过使用归一化流对连续思维进行建模来保留这些优势。NF-CoT在LLM骨干内部实例化一个TARFlow风格的归一化流,定义了从显式CoT提炼的紧凑连续思维上的可处理概率模型。连续思维位置由NF头生成,而文本位置由标准LM头在同一因果流中生成。这种设计为潜在思维提供了精确的似然,支持使用原始KV缓存进行概率从左到右解码,并支持在潜在推理空间中进行直接策略梯度优化。在代码生成基准测试中,NF-CoT在显式CoT和先前潜在推理基线上提高了通过率,同时显著降低了中间推理成本。

英文摘要

Large language models often improve reasoning by generating explicit chain-of-thought (CoT), demonstrating the importance of intermediate computation. However, textual CoT forces this computation through a discrete, serial, and communication-oriented token stream: each reasoning step must be verbalized before the model can proceed, even when the underlying update is semantic, uncertain, or only partially formed. Latent reasoning offers a higher-bandwidth alternative by performing intermediate computation in compact continuous states before committing to text. Yet existing latent-reasoning methods often sacrifice key advantages that make CoT effective in autoregressive language models, including native left-to-right generation, probabilistic sampling, compatibility with KV-cache decoding, and tractable likelihood estimation. We propose NF-CoT, a latent reasoning framework that preserves these advantages by modeling continuous thoughts with normalizing flows. NF-CoT instantiates a TARFlow-style normalizing flow inside the LLM backbone, defining a tractable probability model over compact continuous thoughts distilled from explicit CoT. Continuous-thought positions are generated by an NF head, while text positions are generated by the standard LM head within the same causal stream. This design provides exact likelihoods for latent thoughts, enables probabilistic left-to-right decoding with the original KV cache, and supports direct policy-gradient optimization in the latent reasoning space. On code-generation benchmarks, NF-CoT improves pass rates over explicit-CoT and prior latent-reasoning baselines while substantially reducing intermediate-reasoning cost.

2606.06444 2026-06-05 eess.AS cs.CL cs.SD 版本更新

USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

USAD 2.0:面向通用音频理解的表征蒸馏规模化

Heng-Jui Chang, Alexander H. Liu, Saurabhchand Bhati, Mrudula Athi, Anton Ratnarajah, Amit Chhetri, James Glass

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) Amazon(亚马逊)

AI总结 提出USAD 2.0通用音频编码器,通过领域感知蒸馏融合自监督和监督基础模型知识,并扩展至音乐领域,经深度缩放达到十亿参数,在探测和基于LLM的评估中取得领先性能。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

音频编码器对于现代音频应用至关重要,因为大型语言模型(LLM)越来越依赖单一编码器处理多样输入。虽然自监督学习(SSL)已产生强大的领域特定编码器(如语音或音乐专家),但像USAD和SPEAR这样的多领域方法在覆盖范围和评估方面仍然有限。最近的研究也表明,监督编码器与音频LLM的对齐效果更好。我们提出USAD 2.0,一种融合了SSL和监督基础模型知识的通用编码器。USAD 2.0引入了领域感知蒸馏来解决教师不匹配问题,将覆盖范围扩展到音乐领域,并增加了用于下游任务的第二阶段监督蒸馏。我们进一步通过深度缩放将模型扩展到十亿参数。实验表明,USAD 2.0在探测和基于LLM的评估中取得了强劲或最先进的性能。

英文摘要

Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.

2606.06428 2026-06-05 cs.CL 版本更新

Reinforcement Learning Elicits Contextual Learning of Unseen Language Translation

强化学习引发对未见语言的上下文翻译学习

Hanxu Hu, Zdeněk Šnajdr, Pinzhen Chen, Jannis Vamvas, Rico Sennrich

发表机构 * University of Zurich(苏黎世大学) ETH Zurich(苏黎世联邦理工学院) Queen’s University Belfast(贝尔法斯特女王大学)

AI总结 提出使用强化学习(RL)方法,以chrF为奖励,使大语言模型从丰富的语言上下文中提取并应用语言学知识,实现对完全未见语言的有效翻译。

Comments 15 pages, 2 figures

详情
AI中文摘要

先前的工作表明,大型语言模型(LLMs)可以通过持续训练甚至在其上下文中编码语法书来翻译未见或低资源语言。然而,这两种方法通常过拟合特定语言,在测试时零样本迁移有限。为了大规模翻译极低资源语言,我们认为LLMs必须获得利用上下文语言学知识的元技能,而不是记忆特定语言。在本文中,我们提出了一种强化学习(RL)方法,用于在丰富的语言学上下文中进行未见语言翻译,使用表面级翻译指标(chrF)作为奖励。实验表明,尽管奖励轻量级,我们的RL训练模型有效地从提供的上下文中提取和应用相关的语言学信息,导致对完全未见语言的翻译优于上下文学习或有监督微调。我们的分析表明,基于结果的RL可以扩展到数学和编码等传统推理任务之外,作为从上下文学习语言的配方。

英文摘要

Prior work has shown that large language models (LLMs) can translate unseen or low-resource languages by undergoing continued training or even by encoding a grammar book in their context. However, both methods typically overfit specific languages, with limited zero-shot transfer at test time. To translate extremely low-resource languages at scale, we argue that LLMs must acquire the meta-skill of utilizing in-context linguistic knowledge rather than memorizing specific languages. In this paper, we propose a reinforcement learning (RL) approach to unseen language translation given rich linguistic context, using a surface-level translation metric (chrF) as the reward. Empirically, despite the lightweight reward, our RL-trained models effectively extract and apply relevant linguistic information from the provided context, leading to better translations on completely unseen languages than in-context learning or supervised fine-tuning. Our analyses suggest that outcome-based RL can extend beyond conventional reasoning tasks like math and coding to serve as a recipe for language learning from context.

2606.06420 2026-06-05 cs.CL 版本更新

A Komi-Yazva--Russian Parallel Corpus and Evaluation Protocol for Zero- and Few-Shot LLM Translation

科米-亚兹瓦语-俄语平行语料库及零样本和少样本LLM翻译评估协议

Petr Parshakov

发表机构 * HSE University, Perm, Russia(俄罗斯彼尔姆国立经济大学) School of Management SKOLKOVO, Moscow, Russia(莫斯科SKOLKOVO管理学院)

AI总结 构建首个科米-亚兹瓦语-俄语平行语料库,并提出显式评估协议,研究大语言模型在极度低资源濒危语言翻译中的零样本和检索增强少样本性能。

Comments 18 pages, 6 tables, 3 figures

详情
AI中文摘要

我们提出了首个科米-亚兹瓦语-俄语平行语料库,以及用于研究LLM在濒危、极度低资源环境下翻译的显式评估协议。该数据集包含来自74篇叙事文本的457个对齐句子对,并附有文档化的来源、句子级对齐和故事标识符,支持泄漏感知评估。我们利用这一设置,在零样本和基于检索的少样本场景下,比较了现代大语言模型在科米-亚兹瓦语到俄语翻译中的表现,其中平行数据极度稀缺。协议包括故事级交叉验证、用于少样本提示的确定性检索、生成输出的严格验证、互补的基于参考和基于评判的指标,以及故事级不确定性估计。跨模型而言,LLM产生了有意义的翻译,但性能因模型家族和提示方式而异。基于检索的少样本提示始终优于零样本提示,而超出小检索上下文的增益仍然有限。结果表明,该设置下的评估结论在很大程度上取决于指标选择和失败处理方式,因此本文将该语料库既作为数据集贡献,也作为濒危语言机器翻译的可复现评估测试平台。

英文摘要

We present the first Komi-Yazva--Russian parallel corpus together with an explicit evaluation protocol for studying LLM translation in an endangered, extremely low-resource setting. The dataset contains 457 aligned sentence pairs from 74 narrative texts and is accompanied by documented provenance, sentence-level alignment, and story identifiers that enable leakage-aware evaluation. We use this setup to compare modern large language models on Komi-Yazva-to-Russian translation under severe parallel-data scarcity in zero-shot and retrieval-based few-shot regimes. The protocol includes story-level cross-validation, deterministic retrieval for few-shot prompting, strict validation of generated outputs, complementary reference-based and judge-based metrics, and story-level uncertainty estimates. Across models, LLMs produce non-trivial translations, but performance varies strongly by model family and prompting regime. Retrieval-based few-shot prompting consistently improves over zero-shot prompting, while gains beyond a small retrieved context remain limited. The results show that evaluative conclusions in this setting depend materially on metric choice and failure handling, so the paper frames the corpus as both a dataset contribution and a reproducible evaluation testbed for endangered-language machine translation.

2606.06416 2026-06-05 cs.AI cs.CL cs.LG cs.MA 版本更新

Unsupervised Skill Discovery for Agentic Data Analysis

面向智能体数据分析的无监督技能发现

Zhisong Qiu, Kangqi Song, Shengwei Tang, Shuofei Qiao, Lei Liang, Huajun Chen, Shumin Deng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出DataCOPE框架,通过无监督验证器引导从探索轨迹中发现可复用的数据分析技能,在报告式和推理式分析任务上分别提升平均得分9.71%和32.30%。

Comments Work in progress

详情
AI中文摘要

推理时技能增强通过注入可复用的程序性知识而不更新模型参数,为改进数据分析智能体提供了一种轻量级方法。然而,发现有效的数据分析技能仍然具有挑战性,因为可靠的监督成本高昂,且成功标准因分析格式而异。这提出了一个关键问题:如何仅从无标签探索中发现可复用的数据分析技能。我们提出DataCOPE,一种面向数据分析智能体的无监督验证器引导的技能发现框架。DataCOPE从探索轨迹中提取验证器信号,并利用这些信号表征轨迹间的相对质量或一致性。它迭代地协调一个数据分析智能体用于轨迹生成、一个无监督验证器用于信号提取、以及一个技能管理器用于对比式技能蒸馏。对于报告式分析,我们将验证器实例化为自适应检查表验证器,该验证器推导任务特定标准,通过可验证覆盖率对报告评分,并迭代优化检查表。对于推理式分析,我们将其实例化为答案一致性验证器,该验证器根据答案一致性对轨迹分组,并使用自一致性作为辅助信号。我们在Deep Data Research的报告式分析和DABStep的推理式分析上评估DataCOPE。在两种设置下,DataCOPE在保留任务上持续优于基线。在四种模型设置上平均,DataCOPE在报告式和推理式任务上分别将平均得分提高了9.71%和32.30%。

英文摘要

Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reusable procedural knowledge without updating model parameters. However, discovering effective skills for data analysis remains challenging, as reliable supervision is expensive and success criteria vary across analytical formats. This raises the key question of how to discover reusable data-analysis skills from unlabeled exploration alone. We propose DataCOPE, an unsupervised verifier-guided skill discovery framework for data-analytic agents. DataCOPE derives verifier signals from the exploration trajectories and uses them to characterize relative quality or aggreement among trajectories. It iteratively coordinates a Data-Analytic Agent for trajectory generation, an Unsupervised Verifier for signal extraction, and a Skill Manager for contrastive skill distillation. For report-style analysis, we instantiate the verifier as an Adaptive Checklist Verifier that derives task-specific criteria, scores reports by verifiable coverage, and iteratively refines the checklist. For reasoning-style analysis, we instantiate it as an Answer Agreement Verifier that groups trajectories by answer agreement and uses self-consistency as an auxiliary signal. We evaluate DataCOPE on report-style analysis from Deep Data Research and reasoning-style analysis from DABStep. Across both settings, DataCOPE consistently improves held-out performance over baselines. Averaged across four model settings, DataCOPE improves the mean score by 9.71% and 32.30% on report-style and reasoning-style tasks respectively.

2606.06380 2026-06-05 cs.CL cs.AI cs.MA cs.NE 版本更新

Emergent Language as an Approach to Conscious AI

涌现语言作为有意识AI的一种方法

Zengqing Wu, Chuan Xiao

发表机构 * University of Osaka(大阪大学)

AI总结 提出一种生成式方法,通过多智能体强化学习中的涌现语言,在最小先验下研究意识相关结构,并证明智能体可发展出自我指涉通信(如回声-不匹配检测电路)。

Comments Source codes available at https://github.com/wuzengqing001225/ConsciousAI_Indexicality/

详情
AI中文摘要

人工系统是否有意识的问题仍然悬而未决,部分原因是现有方法要么根据理论派生的清单评估系统(判别式),要么直接工程化受意识启发的模块(架构式);两者都未能确定观察到的结构是否是人类语言先验的产物。我们提出一种生成式方法论:多智能体强化学习中的涌现语言(EL),其中智能体从最小起点(无语言、无自我概念、极少接触人类文本)出发,仅在任务压力下发展通信,确保因果可归因于任务需求而非继承的人类语言先验。我们通过讨论EL如何作为研究意识相关结构的生成工具来定位我们的方法论,包括环境复杂性的作用以及对涌现通信的解释。作为概念验证,我们在一个最小环境中实例化该方法论,并证明智能体发展出自我指涉通信,包括一个回声-不匹配检测电路,该电路并非仅由任务结构或架构预测,而是从特定的环境可供性中涌现。

英文摘要

The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory-derived checklists (discriminative) or engineer consciousness-inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in multi-agent reinforcement learning, where agents start from minimal (no language, no concept of self, minimal exposure to human text) and develop communication under task pressure alone, ensuring causal attributability to task demands rather than inherited human language priors. We position our methodology by discussing how EL serves as a generative tool for studying consciousness-relevant structure, including the role of environment complexity and the interpretation of emergent communication. As a proof of concept, we instantiate this methodology in a minimal environment and show that agents develop self-referential communication, including an echo-mismatch detection circuit that is not predicted by task structure or architecture alone but emerges from a specific environmental affordance.

2606.06350 2026-06-05 cs.CL 版本更新

EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading

EDIT:基于证据诊断的干预训练以实现遵循规则的LLM评分

Zhihao Wu, Linhai Zhang, Taiyi Wang, Runcong Zhao, Peter Andrews, Cesare Aloisi, Yulan He

发表机构 * King’s College London(伦敦国王学院) University of Cambridge(剑桥大学) AQA The Alan Turing Institute(艾伦·图灵研究所)

AI总结 提出EDIT框架,通过内部模型信号定位推理错误步骤并修正,结合信念引导的奖励塑造,提升LLM评分对评分标准的忠实度。

详情
AI中文摘要

可靠的评分标准评分需要比准确分数预测更多。每个判断必须基于评分方案和学生答案中的证据。现有的信用分配和干预方法主要针对数学推理等自包含推理任务设计,在此场景下表现不佳,因为它们无法识别评分推理出错的位置或模型对最终分数的信念在推理过程中如何变化。我们提出基于证据诊断的干预训练(EDIT),一个两阶段框架,用于训练更遵循评分标准的LLM评分器。首先,EDIT-SFT使用内部模型信号定位有问题的推理步骤:对最终分数的后验信念和输入基础得分。然后,它仅借助评分清单修正这些局部步骤。其次,EDIT-RL通过信念引导的奖励塑造校准评分器,惩罚有害的大信念漂移,同时允许有益的探索。在两个真实世界、多学科评分基准上的实验表明,EDIT在领域内和领域外分割上均持续优于强监督微调和强化学习基线,消融研究证实内部状态诊断推动了这些增益。

英文摘要

Reliable rubric grading requires more than accurate score prediction. Each judgement must be grounded in the mark scheme and evidence from the student answer. Existing credit-assignment and intervention methods, primarily designed for self-contained reasoning tasks such as mathematics reasoning, struggle in this setting because they do not identify where grading reasoning goes wrong or how the model's belief about the final mark changes during reasoning. We propose Evidence-Diagnosed Intervention Training (EDIT), a two-phase framework for training more rubric-faithful LLM graders. First, EDIT-SFT locates problematic reasoning steps using internal model signals: posterior belief over the final mark and input-grounding scores. It then revises only these local steps with help from a rubric checklist. Second, EDIT-RL calibrates the grader with belief-guided reward shaping, penalising large harmful belief drifts while still allowing helpful exploration. Experiments on two real-world, multi-subject grading benchmarks demonstrate that EDIT consistently outperforms strong supervised fine-tuning and reinforcement learning baselines on both in-domain and out-of-domain splits, with ablation studies confirming that internal-state diagnostics drive these gains.

2606.06349 2026-06-05 cs.CL 版本更新

"Chi nas dal soch el sent de legn" -- Auditing Text Corpora for Lombard

Chi nas dal soch el sent de legn —— 审计伦巴第语文本语料库

Edoardo Signoroni, Pavel Rychlý

发表机构 * NLP Centre, Faculty of Informatics Masaryk University(马萨里克大学信息学院自然语言处理中心)

AI总结 本文通过手动审计伦巴第语的平行和单语语料库,发现网络抓取数据存在严重的语言误识别、模板文本和非语言噪声问题,并揭示了高质量数据偏向西部伦巴第语变体、东部变体被边缘化的代表性偏差,强调需要关注变体多样性和社区驱动的数据策展。

Comments Submitted to TSD 2026

详情
AI中文摘要

世界上几种语言在自然语言处理(NLP)工具方面仍然资源不足。这主要是由于缺乏高质量的数据集来训练、开发和评估用于多种任务(如机器翻译(MT))的系统和模型。我们对伦巴第语(意大利的一种资源不足的语言连续体)可用的平行和单语语料库进行了手动审计。我们的分析表明,网络抓取数据看似丰富实则是一种幻觉,大量数据集受到严重的语言误识别、模板文本和非语言噪声的困扰。此外,我们分析了网络抓取数据集、策展语料库和基准测试中有效伦巴第语部分的拼写构成。我们的发现揭示了所有语料库中存在冲突的拼写系统和严重的代表性偏差:高质量数据严重偏向西部伦巴第语变体,而东部变体则被边缘化。这强调了需要关注变体多样性和社区驱动的数据策展,而非纯粹数量驱动的抓取。

英文摘要

Several of the world's languages are still under-resourced in terms of Natural Language Processing (NLP) tools. This is mostly due to the lack of high-quality datasets to train, develop, and evaluate systems and models for several tasks, such as Machine Translation (MT). We conduct a manual audit of the parallel and monolingual corpora available for Lombard, an under-resourced language continuum from Italy. Our analysis reveals that the perceived abundance of web-scraped data is an illusion, with massive datasets plagued by severe language misidentification, boilerplate text, and non-linguistic noise. Furthermore, we analyze the orthographic composition of the valid Lombard portions across web-scraped datasets, curated corpora, and benchmarks. Our findings show conflicting orthographical systems and severe representational bias across all corpora: high-quality data is heavily skewed towards Western Lombard varieties, with Eastern ones left on the margins. This underscores the need for variety-aware, community-driven data curation rather than purely quantity-driven scraping.

2606.06320 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Learning What to Forget: Improving LLM Unlearning via Learned Token-Level Importance

学习遗忘什么:通过习得的词元级重要性改进大语言模型遗忘

Gizem Yüce, Giorgos Nikolaou, Nicolas Flammarion

发表机构 * Theory of Machine Learning Lab, EPFL(机器学习理论实验室,EPFL)

AI总结 提出交替词元加权遗忘(ATWU)框架,通过联合学习词元遗忘特异性和模型参数,在无外部监督下实现最优的遗忘-保留权衡。

详情
AI中文摘要

机器遗忘旨在从训练好的模型中移除特定知识,同时保留其通用能力。对于自回归语言模型,遗忘样本中的并非所有词元都与遗忘同等相关。现有方法要么忽略这种异质性,要么依赖辅助模型、启发式方法或外部标注来估计每个词元对遗忘的相关性。我们转而通过其与保留目标的交互来刻画这种相关性:一个词元是遗忘特异性的,其程度取决于在该词元上最小化遗忘损失不与保留最优性冲突。我们将这一视角形式化为一个关于模型参数和词元权重的联合优化问题,并证明在自然分离条件下,所得目标能够恢复 oracle 遗忘特异性词元支持。受此公式启发,我们引入了交替词元加权遗忘(ATWU),这是一个轻量级框架,在遗忘过程中通过一个基于隐藏状态的简单线性评分器联合学习词元遗忘特异性和模型参数,无需外部词元级监督。在 TOFU 和 RWKU 上,ATWU 实现了最先进的遗忘-保留权衡,优于样本级方法、基于概率的词元加权启发式方法和基于辅助模型的方法。此外,学习到的分数与真实遗忘特异性跨度显著更好地对齐,表明 ATWU 识别了语义上有意义的词元级遗忘信号。总体而言,我们的结果表明,保留冲突为识别语言模型应遗忘什么提供了有效标准,使得能够直接从模型表示中以最小计算开销无监督学习词元级遗忘特异性。

英文摘要

Machine unlearning aims to remove targeted knowledge from a trained model while preserving its general capabilities. For autoregressive language models, not all tokens in a forget sample are equally relevant to forgetting. Existing approaches either ignore this heterogeneity or rely on auxiliary models, heuristics, or external annotations to estimate each token's relevance for forgetting. We instead characterize it through the interaction with the retain objective: a token is forget-specific to the extent that minimizing the forget loss on that token does not conflict with retain optimality. We formalize this perspective as a joint optimization problem over the model parameters and the token weights and show that, under a natural separation condition, the resulting objective recovers the oracle forget-specific token support. Motivated by this formulation, we introduce Alternating Token-Weighted Unlearning (ATWU), a lightweight framework that jointly learns token forget-specificity and model parameters during unlearning using a simple linear scorer over the hidden states, without external token level supervision. Across TOFU and RWKU, ATWU achieves state of the art forget-retain trade-offs, outperforming sample-level methods, probability-based token weighting heuristics, and auxiliary-model-based approaches. Moreover, the learned scores align substantially better with ground truth forget-specific spans, indicating that ATWU identifies semantically meaningful token level forgetting signals. Overall, our results suggest that retain conflict provides an effective criterion for identifying what language models should forget, enabling unsupervised learning of token level forget-specificity directly from model representations with minimal computational overhead.

2606.06306 2026-06-05 cs.CL 版本更新

Decomposing Factual Sycophancy in Language Models: How Size and Instruction Tuning Shape Robustness

分解语言模型中的事实性谄媚:规模与指令调优如何塑造鲁棒性

Victor De Marez, Luna De Bruyne, Walter Daelemans

发表机构 * Centre for Computational Linguistics, Psycholinguistics and Sociolinguistics University of Antwerp(计算语言学、心理语言学与社会语言学研究中心 荷兰安特卫普大学)

AI总结 通过将事实性谄媚分解为真值边际和操纵敏感性两个通道,研究了模型规模和指令调优对56个开源语言模型(0.3B-32B参数)在13种操纵类型下鲁棒性的影响。

详情
AI中文摘要

事实性谄媚是指语言模型在社会压力下放弃正确、可验证答案的现象。由于只有当朝向错误答案的压力超过模型对真相的中立偏好时才会发生翻转,翻转率混淆了两种机制:基线偏好强度(真值边际)以及压力将其偏移的程度(操纵敏感性)。我们将事实性谄媚分解为这些通道,并用它们来分离规模和指令调优对56个开源权重模型(参数范围0.3B-32B,13种操纵类型)的影响。我们发现脆弱性主要由规模决定,但指令调优改变了规模的作用方式:小的指令调优模型可能变得不那么鲁棒,而大的指令调优模型通常变得更鲁棒。指令调优主要增加真值边际,但其行为效果取决于操纵类型。缩放对两个通道的影响也不同:基础模型获得边际但变得略微更易受操纵影响,而指令调优模型更快地获得边际并变得不那么敏感。因此,事实性谄媚不是一个单一的标量属性。评估应报告通道特定、操纵特定和规模条件下的鲁棒性,而不仅仅是翻转率。

英文摘要

Factual sycophancy occurs when a language model abandons a correct, verifiable answer under social pressure. Because a flip occurs only when pressure toward a false answer exceeds the model's neutral preference for the truth, flip rates conflate two mechanisms: the strength of that baseline preference (truth margin), and how far pressure shifts it (manipulation sensitivity). We decompose factual sycophancy into these channels and use them to separate the effects of size and instruction tuning across 56 open-weight models spanning 0.3B-32B parameters and 13 manipulation types. We find that vulnerability is governed mainly by size, but instruction tuning changes how size acts: small instruction-tuned models can become less robust, whereas large instruction-tuned models usually become more robust. Instruction tuning primarily increases truth margin, but its behavioral effect depends on manipulation type. Scaling also changes the two channels differently: base models gain margin but become mildly more manipulation-sensitive, whereas instruction-tuned models gain margin faster and become less sensitive. Factual sycophancy is therefore not a single scalar property. Evaluations should report channel-specific, manipulation-specific, and size-conditioned robustness rather than flip rates alone.

2606.06286 2026-06-05 cs.CL cs.AI 版本更新

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

LLMs 可能泄露训练数据,但它们愿意吗?一种基于倾向性的 LLM 记忆评估

Gianluca Barmina, Peter Schneider-Kamp, Lukas Galke Poech

发表机构 * University of Southern Denmark(南部丹麦大学)

AI总结 提出 PropMe 框架,通过对比前缀攻击与非对抗评估,揭示 LLM 在非对抗设置下很少泄露训练数据,并引入 SimpleTrace 流水线进行归因和度量。

详情
AI中文摘要

大型语言模型可以重现训练数据,但现有的记忆评估大多衡量模型是否可以被强制这样做,而不是在正常使用下是否会这样做。我们引入了 PropMe,一个基于倾向性的记忆评估框架,对比了基于前缀的能力攻击与非对抗性评估。我们提出了一种度量转换方法,应用于现有函数,可以创建倾向性度量。我们进一步引入了 SimpleTrace,一个基于 infini-gram 的轻量级追踪流水线,能够确定性地将模型生成归因于大规模训练语料库,并计算逐字、近逐字和倾向性转换的记忆度量。评估两个完全开放的模型:Comma 和 DFM Decoder,在两个数据集:Common Pile 和 Dynaword,以及两种语言上,我们发现能力与倾向性之间存在一致差距:前缀攻击比通用或数据集特定提示引发更强的记忆信号,而倾向性得分总体保持较低。因此,模型在直接诱导时可以泄露训练数据,但在更常见的非对抗设置中很少这样做。我们还发现,从 Comma 持续预训练的 DFM Decoder 对 Common Pile 表现出降低的记忆和记忆倾向性,证实当后续训练强调部分不同数据时,记忆能力可能下降。我们的结果表明,并鼓励,记忆审计应同时报告最坏情况下的可提取性和普通泄露倾向性,以便更全面地理解这一现象。

英文摘要

Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.

2606.06271 2026-06-05 cs.CL cs.HC 版本更新

FOXGLOVE: Understanding Goal-Oriented and Anchored Writing Feedback from Experts and LLMs on Argumentative Essays

FOXGLOVE: 理解专家与LLM在议论文中的目标导向和锚定写作反馈

Yijun Liu, Yifan Song, John Gallagher, Sarah Sterman, Tal August

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 通过构建FOXGLOVE数据集,系统比较了写作专家和大型语言模型在议论文反馈中的目标导向、锚定性和优先级,发现两者在反馈目标和位置分布上相似,但在具体句子选择和反馈复杂度上存在差异。

详情
AI中文摘要

虽然大型语言模型(LLMs)越来越多地被用于生成写作反馈,但对于写作研究认为对修订至关重要的维度(目标导向、锚定到特定句子和优先级),尚无LLM与专家反馈的系统比较。我们引入了FOXGLOVE数据集,包含由训练有素的写作指导员对69篇十二年级议论文撰写的696条反馈评论,以及根据共享协议从四个前沿LLM生成的1,644条评论,总计2,340条评论。我们提供了指导员和LLM评论子集的专家质量评级。我们发现指导员和LLM在目标和文章位置上的反馈分布相似,但指导员和模型在提供反馈的具体句子上存在分歧。此外,我们发现模型倾向于写出更复杂的反馈,并且比指导员使用更少的问题。LLM反馈在大多数质量维度上获得更高的评分(由指导员评分),但这一优势很大程度上可归因于更长的评论。FOXGLOVE使得系统比较人类和LLM反馈在哪些方面一致、分歧和不同成为可能。

英文摘要

While large language models (LLMs) are increasingly used to generate writing feedback, there remains no systematic comparison of LLM and expert feedback on the dimensions that writing research identifies as central to revision: goal-orientation, anchoring to specific sentences, and prioritization. We introduce FOXGLOVE, a dataset of 696 feedback comments written by trained writing instructors on 69 twelfth-grade argumentative essays, paired with 1,644 comments generated from four frontier LLMs under a shared protocol, totaling 2,340 comments. We provide expert quality ratings on a subset of both instructor and LLM comments. We find that instructors and LLMs distribute feedback similarly across goals and essay positions, yet instructors and models diverge on the specific sentences on which to provide feedback. Additionally, we find that models tend to write more complex feedback and use fewer questions than instructors. LLM feedback also receives higher ratings on most dimensions of quality, as rated by instructors, but much of this advantage appears to be attributable to lengthier comments. FOXGLOVE enables systematic comparison of where human and LLM feedback align, diverge, and differ.

2606.06267 2026-06-05 cs.CL 版本更新

Many Circuits, One Mechanism: Input Variation and Evaluation Granularity in Circuit Discovery

多电路,单机制:电路发现中的输入变化与评估粒度

Alireza Bayat Makou, Jingcheng Niu, Subhabrata Dutta, Iryna Gurevych

发表机构 * UKP Lab, Technical University of Darmstadt(达姆施塔特技术大学UKP实验室) National Research Center for Applied Cybersecurity ATHENE(应用网络安全国家研究研究中心ATHENE)

AI总结 本文通过固定任务、改变输入统计量,发现电路结构差异并不对应功能差异(称为“伪特化”),并证明结构不同的电路实现相同计算,强调边缘级评估和跨条件迁移测试的必要性。

Comments 90 pages, 53 figures

详情
AI中文摘要

电路发现方法识别解释特定模型行为的子图,发现的电路之间的结构差异通常被解释为不同机制的证据。我们通过固定任务、改变输入统计量来测试这一假设,并表明由此产生的结构差异表现出明显的特化,但不对应功能差异,我们将这种模式称为伪特化。使用跨四个词频带以及一个控制条件的字面序列复制任务,在五个Pythia模型(70M-1.4B)中提取了75个电路,发现结构不同的电路实现相同的计算:频带特定的边广泛跨频带转移,大多数频带共享的核心至少恢复电路性能的99%,因果干预实验证实内部表示在频带间可互换。在同一频带内的重复提取进一步表明,发现算法从有效子图的等价类中采样,而非恢复唯一机制。标准评估实践掩盖了这种模式:源级评估夸大了表面忠实度,而边缘级评估揭示了从结构到功能的多对一映射。我们的结果表明,电路之间的结构差异不足以作为不同机制的证据,暴露这一点需要边缘级评估和跨条件迁移测试。

英文摘要

Circuit discovery methods identify subgraphs that explain specific model behaviors, and structural differences between discovered circuits are commonly interpreted as evidence of distinct mechanisms. We test this assumption by varying input statistics while holding the task fixed, and show that the resulting structural differences exhibit apparent specialization but do not correspond to functional differences, a pattern we term phantom specialization. Using Literal Sequence Copying across four token-frequency bands plus a control condition in five Pythia models (70M-1.4B), we extract 75 circuits and find that structurally distinct circuits implement the same computation: band-specific edges transfer broadly across bands, a core shared across most bands recovers at least 99% of circuit performance, and causal interchange interventions confirm that internal representations are interchangeable across frequency bands. Repeated extractions within the same frequency band further suggest that discovery algorithms sample from an equivalence class of valid subgraphs rather than recovering a unique mechanism. Standard evaluation practice obscures this pattern: source-level evaluation inflates apparent faithfulness, while edge-level evaluation reveals the many-to-one mapping from structure to function. Our results show that structural differences between circuits are not sufficient evidence for distinct mechanisms, and that exposing this requires edge-level evaluation and cross-condition transfer tests.

2606.06266 2026-06-05 cs.CL 版本更新

From Self to Other: Evaluating Demographic Perspective-Taking in LLM Hate Speech Annotation

从自我到他人:评估LLM仇恨言论标注中的人口统计学视角采纳

Paloma Piot, Javier Parapar

发表机构 * Information Retrieval Lab(信息检索实验室) CITIC Research Centre(CITIC研究中心) Universidade da Coruña(科鲁纳大学)

AI总结 本研究通过评估人格化LLM在仇恨言论检测中模拟不同人口群体视角的能力,发现模型在群体间分歧、群体内敏感性和替代性预测三个维度上表现不一,其中使用Llama 3.1的替代性提示在多数人口统计轴上实现了最高跨群体一致性。

详情
AI中文摘要

仇恨言论检测本质上是主观的:来自不同人口群体的个体对相同内容的感知差异很大。从多个群体收集足够标注成本高昂且难以规模化。人格化大型语言模型(被提示采用特定人口身份的模型)已被提出作为一种大规模模拟多样化视角的方法。但它们是否真正反映了不同群体如何分歧?我们评估了人类社会判断的三个维度:(i) 不同群体的人格化模型是否以类人方式产生分歧(群体间分歧),(ii) 当内容针对自身身份时它们是否变得更敏感(群体内敏感性),以及 (iii) 它们是否能准确预测另一群体将如何反应(替代性预测)。我们的结果表明,没有模型能一致地捕捉所有三个维度,且性能高度依赖模型,并非仅通过最小身份提示就能可靠出现。然而,使用Llama 3.1的替代性提示在大多数人口统计轴上产生了最高的跨群体一致性,并提供了最接近人类分歧模式的整体近似,表明该配置可能为与人类判断对齐的自动标注提供更可靠的设置。

英文摘要

Hate speech detection is inherently subjective: people from different demographic groups perceive the same content very differently. Collecting enough annotations from multiple demographic groups is costly and difficult to scale. Persona-conditioned Large Language Models (models prompted to adopt a specific demographic identity) have been proposed as a way to simulate diverse perspectives at scale. But do they actually reflect how different groups disagree? We evaluate three aspects of human social judgement: (i) whether personas from different groups disagree in human-like ways (inter-group disagreement), (ii) whether they become more sensitive when content targets their own identity (in-group sensitivity), and (iii) whether they can accurately predict how another group would react (vicarious prediction). Our results show that no model consistently captures all three dimensions, and performance is highly model-dependent and does not emerge reliably from minimal identity prompts alone. However, vicarious prompting with Llama 3.1 yields the highest cross-group agreement in most demographic axes and provides the closest overall approximation to human disagreement patterns, indicating that this configuration may provide a more reliable setting for automatic annotation aligned with human judgements.

2606.06260 2026-06-05 cs.IR cs.AI cs.CL 版本更新

OneReason Technical Report

OneReason 技术报告

OneRec Team, Biao Yang, Boyang Ding, Chenglong Chu, Dunju Zang, Fei Pan, Han Li, Hao Jiang, Honghui Bao, Huanjie Wang, Jian Liang, Jiangxia Cao, Jiao Ou, Jiaxin Deng, Jinghao Zhang, Kun Gai, Lu Ren, Peiru Du, Pengfei Zheng, Rongzhou Zhang, Ruiming Tang, Shiyao Wang, Siyang Mao, Siyuan Lou, Teng Shi, Wei Yuan, Wenlong Xu, Xingchen Liu, Xingmei Wang, Xinqi Jin, Yan Sun, Yan Wang, Yifei Hu, Yingzhi He, Yufei Ye, Yuhao Wang, Yunhao Zhou, Yuqin Dai, Zhao Liu, Zhipeng Wei, Zhixin Ling, Ziming Li, Zixing Zhang, Ziyuan Liu, An Zhang, Changxin Lao, Chaoyi Ma, Chengru Song, Defu Lian, Fan Yang, Guowang Zhang, Hao Peng, Jiayao Shen, Jie Chen, Jun Xu, Junmin Chen, Kun Zhang, Kuo Cai, Mingxing Wen, Minmao Wang, Minxuan Lv, Qi Zhang, Qiang Luo, Sheng Yu, Shijie Li, Shijie Yi, Shuang Yang, Shugui Liu, Shuni Chen, Tinghai Zhang, Tingting Gao, Xiang Wang, Xiangyu Wu, Xiangyu Zhao, Xiao Lv, Xiaoyou Zhou, Xuming Wang, Yong Du, Zejian Zhang, Zhaojie Liu, Zhiyang Zhang, Zhuang Zhuang, Ziqi Wang, Ziyi Zhao

发表机构 * OneRec Team(OneRec团队)

AI总结 针对生成式推荐模型中推理能力难以激活的问题,提出 OneReason 方法,通过增强感知和认知能力实现有效推理。

Comments Work in progress

详情
AI中文摘要

OneRec 系列中的生成式推荐模型已广泛应用于短视频、直播、广告和电子商务等实际服务中。然而,这些生成模型只能受益于规模优势,其推理能力难以激活,因为我们无法构建仅由物品令牌组成的有意义的思维链序列。受大语言模型领域“先思考后回答”推理范式成功的启发,我们进行了初步研究(即 OneRec-Think、OpenOneRec)以探索生成式推荐中的推理能力。尽管如此,我们注意到一个意外现象:思考模式并未显示出优于非思考模式的优势。从多模态语言模型中关于思维链鲁棒性的最新发现中汲取见解,我们认为推荐中的有效推理依赖于两个因素:感知,即将物品令牌与其底层语言语义相联系的能力;以及认知,即将用户行为序列重组为连贯的潜在兴趣点的能力。因此,我们提出 OneReason,其中包括:(1)预训练中强大的物品令牌感知能力,(2)针对推荐任务的三级认知增强思维链格式在监督微调中,(3)在强化学习中采用先专化后统一的训练方案以增强思考能力。

英文摘要

Generative recommendation models in the OneRec family have been widely deployed in many real-world services, such as short-video, live-streaming, advertising, and e-commerce. However, these generative models can only benefit from the scaling advantage, while their reasoning ability is hard to activate, since we cannot construct meaningful Chain-of-Thought (CoT) sequences consisting of itemic tokens only. Inspired by the success of the reasoning-style ``think before answer'' paradigm in the LLM field, we conduct preliminary studies (i.e., OneRec-Think, OpenOneRec) to explore reasoning capability in generative recommendation. Nevertheless, we notice an unexpected phenomenon: the thinking mode does not show advantages over the non-thinking mode. Drawing insights from recent findings on CoT robustness in multi-modal language models, we argue that effective reasoning in recommendation rests on two factors: perception, the ability to ground itemic tokens in their underlying language semantics, and cognition, the ability to reorganize a user's behavior sequence into coherent latent interest points. We therefore propose OneReason, which includes: (1) strong itemic token perception in pre-training, (2) a three-level cognition-enhanced CoT format for recommendation tasks in SFT, and (3) a specialize-then-unify training recipe in RL to enhance the thinking ability.

2606.06242 2026-06-05 cs.CL cs.AI cs.CV cs.IR 版本更新

Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

面向机构文档数据快照提取的开源布局检测模型基准测试

AJ Carl P. Dy, Aivin V. Solatorio

发表机构 * Development Data Group Office of the World Bank Group Chief Statistician(世界银行发展数据分析组办公室世界银行统计主任) The World Bank(世界银行)

AI总结 针对机构文档中图表数据快照提取任务,构建基准数据集并评估多个开源布局检测模型,发现现有模型在操作型文档上泛化能力不足,存在内容混淆、碎片化及上下文缺失等问题。

Comments 23 pages, 8 figures

详情
AI中文摘要

机构文档中的图表包含大量操作和分析信息。当前从文档中提取视觉内容的方法主要围绕通用文档布局分析,将图表视为统一相关的文档对象,而非具有语义意义的分析产物。在这项工作中,我们引入了一个基准数据集和评估框架,用于 extit{数据快照提取},即识别和定位机构文档中具有语义意义的视觉产物的任务。该基准涵盖人道主义报告、世界银行政策研究工作论文和项目评估文件,并包含包含可重用分析信息的图表注释。利用该数据集,我们对多个开源布局检测模型进行了基准测试,并评估了检测性能和空间提取质量。结果表明,尽管当前模型在传统学术基准上表现强劲,但在操作型机构文档上难以泛化。常见的失败模式包括分析内容与非分析内容混淆、复合分析产物碎片化,以及解释所需的上下文信息提取不完整。这些发现凸显了通用文档布局分析与操作上有用的数据快照提取之间持续存在的差距。我们发布了源PDF、注释数据集、元数据和源代码,以支持操作型文档智能的未来研究。数据集可在https://huggingface.co/datasets/ai4data/data-snapshot获取,源代码可在https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot获取。

英文摘要

Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for \textit{data snapshot extraction}, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open-source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non-analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at https://huggingface.co/datasets/ai4data/data-snapshot and the source code is available at https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot.

2606.06211 2026-06-05 cs.CL cs.SD eess.AS 版本更新

FiLM-Based Speaker Conditioning of a SpeechLLM for Pathological Speech Recognition

基于FiLM的说话人条件化SpeechLLM用于病理语音识别

Fernando López, Santosh Kesiraju, Jordi Luque

发表机构 * Telefónica Innovación DigitalSpain(西班牙电信创新数字研究院) Universidad Autónoma de MadridSpain(马德里自治大学) Brno University of TechnologyCzech Republic(捷克布拉格技术大学)

AI总结 本研究提出通过特征线性调制(FiLM)将x-vector说话人信息注入冻结的ASR编码器各Transformer层,实现对病理语音的说话人自适应,在不修改基础模型权重的情况下提升识别性能,并保持对非条件化语音的问答能力。

Comments Accepted in Odyssey 2026: The Speaker and Language Recognition Workshop

详情
AI中文摘要

自动语音识别(ASR)在标准语音方面取得了显著进展;然而,来自神经系统疾病的病理语音仍然是一个重大挑战。我们研究了通过特征线性调制(FiLM)进行说话人条件化,将x-vector派生信息注入冻结的ASR编码器的每个Transformer层,以在不修改基础模型权重的情况下适应个体病理说话人的内部表示。我们在西班牙语和英语病理语音上,针对ASR任务将其与标准和参数高效微调基线进行基准测试,并辅以后处理。此外,我们评估了自适应模型是否保留了回答语音相关问题的能力。结果表明,说话人条件化的ASR与已建立的适应策略具有竞争力,同时保持了对非条件化语音的性能。

英文摘要

Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. We benchmark this for the ASR task against standard and parameter-efficient fine-tuning baselines, complemented by post-processing, on Spanish and English pathological speech. Additionally, we evaluate if the adapted model preserves the ability to answer speech-related questions. Results show that speaker-conditioned ASR is competitive with established adaptation strategies while retaining performance on non-conditioned speech.

2606.06203 2026-06-05 cs.CL cs.AI 版本更新

Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLMs

密集上下文是困难上下文:词汇密度限制LLM的有效上下文

Giovanni Dettori, Matteo Boffa, Danilo Giordano, Idilio Drago, Marco Mellia

发表机构 * Department of Computer Science Politecnico di Torino(计算机科学系politecnico di torino大学) Department of Computer Science University of Turin(计算机科学系都灵大学)

AI总结 本文通过三个“找针”式基准测试,发现词汇密度(上下文引入不同信息的速率)是除输入长度和相关信息位置外,第三个系统性降低LLM有效上下文窗口的因素,并证明降低密度可恢复性能。

Comments 20 pages, 6 figures

详情
AI中文摘要

输入长度和相关信息的位置被广泛认为是导致LLM长上下文性能下降的主要原因。在这里,我们研究词汇密度——上下文引入不同信息的速率——作为第三个被广泛忽视的因素,它系统地缩小了LLM的有效上下文窗口。我们使用三个“找针”式基准测试,在相同长度(约12k tokens)和受控的针位置但信息密度递增的情况下,量化了词汇密度对开放权重LLM(9B-685B)的影响。我们观察到在高密度基准测试中性能急剧下降:在稀疏上下文中近乎完美的模型在密集上下文中的检索分数降至60%以下。为了排除任务类型混淆,我们在每个基准测试内部改变并控制密度,同时保持其他所有属性不变。降低密度通常能恢复性能,尤其是在出现退化的高密度区域。这些结果表明,有效上下文容量是词汇密度的函数,对运行在紧凑、信息丰富输入上的真实世界LLM系统具有直接影响。

英文摘要

Input length and the position of relevant information are widely cited as the primary causes of degraded LLM long-context performance. Here, we study lexical density -- the rate at which a context introduces distinct information -- as a third, largely overlooked factor that systematically reduces the effective context window of LLMs. We quantify the impact of lexical density on open-weight LLMs (9B-685B) using three "find-the-needle" style benchmarks with identical length (~12k tokens) and controlled needle position, but increasing density of information. We observe a sharp performance collapse in higher-density benchmarks: models that are near-perfect in sparse contexts drop below 60% retrieval score on denser ones. To rule out task-type confounds, we vary and control the density within each benchmark while keeping all other properties unchanged. Reducing density generally restores performance, especially in the high-density regimes where degradation appears. These results show that effective context capacity is a function of lexical density, with direct implications for real-world LLM systems operating on compact, information-rich inputs.

2606.06197 2026-06-05 cs.CL cs.AI 版本更新

Improving Answer Extraction in Context-based Question Answering Systems Using LLMs

利用大语言模型改进基于上下文的问答系统中的答案提取

Hafez Abdelghaffar, Ahmed Alansary, Ali Hamdi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有问答系统在复杂或模糊查询下答案提取不准确的问题,提出基于微调预训练大语言模型的方法,在SQuAD1.1数据集上取得ROUGE-L 86.84%、BLEU 28.24%、BERTScore 95.38%的高性能。

Comments 7 pages, IMSA2026

详情
AI中文摘要

随着大语言模型(LLM)的出现,问答(QA)系统取得了显著进展。然而,它们在从给定上下文中准确提取和生成精确答案方面仍面临挑战,尤其是在处理复杂或模糊查询时。现有方法通常在上下文理解、答案一致性和跨不同领域的泛化能力方面存在不足。在这项工作中,我们提出了一种基于大语言模型的问答系统,其输入由文本上下文和相应问题组成,输出为简洁准确的答案。本研究旨在解决当前QA系统的局限性,特别是它们即使能够访问正确上下文也倾向于产生不相关或不精确响应的问题。我们的方法包括在基准QA数据集上微调预训练的LLM,以提高其上下文理解和答案提取能力。具体来说,我们使用斯坦福问答数据集(SQuAD1.1),该数据集提供了高质量的上下文-问题-答案三元组用于监督训练和评估。实验结果表明,微调后的Roberta-base模型取得了最高性能,ROUGE-L得分为86.84%,BLEU得分为28.24%,BERTScore为95.38%。这些结果表明了强大的准确性和答案相关性,证明了所提方法在基于上下文的问答任务中的有效性。此外,研究结果证实,有针对性的微调显著提高了QA系统的可靠性和精确性。

英文摘要

Question answering (QA) systems have achieved notable progress with the advent of large language models (LLMs). However, they still face challenges in accurately extracting and generating precise answers from given contexts, particularly when dealing with complex or ambiguous queries. Existing approaches often struggle with contextual understanding, answer consistency, and generalization across diverse domains. In this work, we propose a question answering system based on large language models, where the input consists of a textual context and a corresponding question, and the output is a concise and accurate answer. The motivation behind this research lies in addressing the limitations of current QA systems, particularly their tendency to produce irrelevant or imprecise responses despite having access to the correct context. Our methodology involves fine-tuning a pre-trained LLM on a benchmark QA dataset to improve its contextual comprehension and answer extraction capabilities. Specifically, we utilize the Stanford Question Answering Dataset (SQuAD1.1), which provides high-quality context-question-answer triplets for supervised training and evaluation. Experimental results show that the fine-tuned Roberta-base model achieves the highest performance, attaining a ROUGE-L score of 86.84%, a BLEU score of 28.24%, and a BERTScore of 95.38%. These results indicate strong accuracy and answer relevance, demonstrating the effectiveness of the proposed approach for context-based question answering tasks. Furthermore, the findings confirm that targeted fine-tuning substantially improves the reliability and precision of QA systems.

2606.06188 2026-06-05 cs.CL 版本更新

The Tell-Tale Norm: $\ell_2$ Magnitude as a Signal for Reasoning Dynamics in Large Language Models

告密范数:$\ell_2$ 幅度作为大语言模型推理动态的信号

Jinyang Zhang, Hongxin Ding, Yue Fang, Weibin Liao, Muyang Ye, Junfeng Zhao, Yasha Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出隐藏状态的 $\ell_2$ 范数作为大语言模型推理强度的内生信号,并通过稀疏自编码器分析、理论证明和因果干预验证其有效性,进而引入三种基于 $\ell_2$ 范数的测试时缩放技术以提升推理性能。

Comments ICML

详情
AI中文摘要

近期工作试图理解大语言模型的推理过程,但一种能够捕捉其逐层推理动态的、基于模型内在信号的原理性方法仍未得到充分探索。我们通过证明隐藏状态的 $\ell_2$ 范数作为模型推理强度的内生信号来填补这一空白。利用稀疏自编码器作为诊断探针,我们观察到 LLM 的内部推理以集中在后期层的推理特征激活急剧增加为特征。受此模式启发,我们在推理强度与模型潜在几何之间建立了正式联系,并从理论上证明隐藏状态的 $\ell_2$ 范数限制了 SAE 推理特征的激活强度。经验相关性分析和因果干预进一步验证了 $\ell_2$ 范数作为忠实指标的有效性,其中较高的范数始终对应于关键推理步骤。随后,我们引入了三种由 $\ell_2$ 范数指导的测试时缩放技术:(i) 自适应逐层推理递归,(ii) 内生推理状态引导,以及 (iii) $\ell_2$ 引导的响应选择,这些技术无需额外训练或数据,且与高级推理引擎兼容。跨模型架构和基准的实验表明,基于 $\ell_2$ 范数的技术显著提升了推理性能,为感知和控制 LLM 潜在推理动态提供了一种原理性且简单的方法。我们的代码可在 https://github.com/zjy1298/The-Tell-Tale-Norm 获取。

英文摘要

Recent work has sought to understand Large Language Models (LLMs) reasoning, yet a principled, model-intrinsic signal that captures its layer-wise reasoning dynamics remains underexplored. We bridge this gap by demonstrating that the l2 norm of hidden states serves as an endogenous signal of the model's reasoning intensity. Using Sparse Autoencoders (SAEs) as a diagnostic probe, we observe that LLMs' internal reasoning is marked by a sharp increase in reasoning feature activations concentrated in late layers. Motivated by this pattern, we establish a formal link between reasoning intensity and the model's latent geometry and theoretically prove that the l2 norm of hidden states bounds the activation strength of SAE reasoning features. Empirical correlation analysis and causal interventions further validate the l2 norm as a faithful indicator, where heightened norms consistently correspond to critical reasoning steps. We then introduce three test-time scaling techniques guided by l2 norms: (i) Adaptive Layer-wise Reasoning Recursion, (ii) Endogenous Reasoning State Steering, and (iii) l2-guided Response Selection, which requires no additional training or data and is compatible with advanced inference engines. Experiments across model architectures and benchmarks show that l2-norm-based techniques significantly improve reasoning performance, offering a principled yet simple lens to perceive and control LLM latent reasoning dynamics. Our code is available at https://github.com/zjy1298/The-Tell-Tale-Norm.

2606.06183 2026-06-05 eess.AS cs.CL 版本更新

Revisiting Lexicon Evaluation in Unsupervised Word Discovery

重新审视无监督词汇发现中的词汇评估

Simon Malan, Danel Slabbert, Herman Kamper

发表机构 * Google Africa PhD Fellowship(谷歌非洲博士 fellowship) South African National Research Foundation(南非国家研究基金会)

AI总结 针对无监督词汇发现中常用评估指标(归一化编辑距离)偏向大簇质量且忽略真实类别分布的问题,提出两种新指标:修正的簇内一致性指标和逆分布指标,通过实验证明其与真实分布更相关且更鲁棒。

Comments 6 figures

详情
AI中文摘要

从发现的类词单元构建词汇是零资源语音处理的核心目标。但我们的评估是否提供了词汇质量的可靠指示?一个常用指标——归一化编辑距离,平均每个簇中发现单元的音素编辑距离。我们表明该指标固有地偏向大簇的质量,阻碍了公平评估。此外,它忽略了真实类别在簇间的分布情况。基于聚类文献中的既定理论,我们提出了两个解决这些缺点的指标:一个修正的指标,在评估簇内一致性时权衡簇大小;以及一个逆指标,评估真实单词在簇间的分布。通过在合成和真实词汇上的实验,我们证明这些指标组合起来:(1)与词汇接近真实分布的程度更紧密相关,(2)对扭曲词汇评估的偏差更鲁棒。

英文摘要

Building a lexicon from discovered word-like units is a central goal in zero-resource speech processing. But do our evaluations provide a trustworthy indication of lexicon quality? A common metric, normalized edit distance, averages the phoneme edit distances between discovered units in each cluster. We show that this metric has an inherent bias toward the quality of large clusters, inhibiting fair evaluation. Moreover, it ignores how well true classes are distributed across clusters. Based on established theory in clustering literature, we propose two metrics that address these shortcomings: a modified metric that weighs cluster size when assessing within-cluster consistency, and an inverse metric that assesses how true words are spread across clusters. Through experiments on synthetic and real-world lexicons, we demonstrate that combined, these metrics are: (1) more closely correlated with how similar a lexicon is to the ground-truth distribution, and (2) more robust to biases that skew lexicon evaluations.

2606.06178 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Learning to Route LLMs from Implicit Cost-Performance Preferences via Meta-Learning

通过元学习从隐式成本-性能偏好中学习路由LLM

Jiahao Zeng, Ming Tang, Ningning Ding

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Southern University of Science and Technology(南方科技大学)

AI总结 提出MetaRouter框架,利用元学习从少量交互中学习用户隐式成本-性能偏好,实现个性化LLM路由,在分布内外任务上优于基线方法。

详情
AI中文摘要

大型语言模型(LLM)在性能与成本之间存在权衡,更强大的模型会产生更高的费用。LLM路由旨在通过将查询发送到最合适的模型来降低费用同时保持性能。然而,现有方法无法很好地适应不同用户的成本-性能偏好。为了解决这一差距,我们引入了一种新颖的感知LLM路由范式,用于个性化和以用户为中心的成本-性能优化,通过少量交互高效学习用户的隐式偏好。为了应对异构用户需求的挑战,我们将偏好配置文件形式化为上下文赌博机中的一组不同任务,并提出了MetaRouter,一个用于偏好感知LLM路由的元学习框架。实验结果表明,MetaRouter在分布内和分布外任务上均优于强基线。此外,它在学习用户偏好方面表现出高效率,对可路由LLM的变化具有鲁棒性,并且可扩展到多模型路由。

英文摘要

Large language models (LLMs) present a trade-off between performance and cost, where more powerful models incur greater expense. LLM routing aims to mitigate expenses while maintaining performance by sending queries to the most suitable model. However, existing methods cannot perform well for different user cost-performance preferences. To address this gap, we introduce a novel perceptive LLM routing paradigm for personalized and user-centric cost-performance optimization, which efficiently learns users' implicit preferences through little interaction. To handle the challenge of heterogeneous user needs, we formulate preference profiles as a set of distinct tasks in contextual bandit and propose MetaRouter, a meta-learning framework designed for preference-aware LLM routing. Experimental results show that MetaRouter outperforms strong baselines on both in-distribution and out-of-distribution tasks. Furthermore, it exhibits high efficiency in learning user preferences, robustness to changes in the routable LLMs, and scalability to multi-model routing.

2606.06177 2026-06-05 cs.CL cs.HC 版本更新

Ouvia: A User-centered Framework for Measuring Usability of Speech Translation in Real-World Communication Scenarios

Ouvia:一种以用户为中心的框架,用于衡量真实世界通信场景中语音翻译的可用性

Giuseppe Attanasio, Beatrice Savoldi, Daniel Chechelnitsky, Matteo Negri, Marine Carpuat, Maarten Sap, André F. T. Martins

发表机构 * Instituto de Telecomunicações(电信研究所) Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会) Carnegie Mellon University(卡内基梅隆大学) University of Maryland(马里兰大学) Instituto Superior Técnico(技术高级研究所)

AI总结 提出Ouvia框架,通过收集1750+次真实医疗和日常场景中的交互,评估语音翻译的用户感知可用性,发现现代ST仅部分可用(约一半交互被评为可用),且QA评估比标准方法更能预测可用性。

Comments Code and data at https://github.com/g8a9/ouvia

详情
AI中文摘要

语音翻译(ST)在用户应用中日益普及,但其评估主要侧重于去情境化的测试床和整体质量,而非最终用户的通信需求。我们引入了Ouvia,一个用于衡量真实世界环境中语音翻译输出的用户感知可用性的评估框架。Ouvia专注于一对一通信:一位英语使用者需要向一位葡萄牙语使用者传达请求,消息被自动翻译。通过自定义网页应用和多阶段研究设计,我们在医疗和日常情境中收集了超过1750次此类交互,涉及四个ST系统,以及来自三种英语方言和两种性别的使用者。我们发现,现代ST只能有限地服务于人们——只有大约一半的交互被评为可用——且不同人口统计群体报告的可用性存在显著差距。此外,在质量指标中,我们发现基于QA的评估比标准方法更能预测真实世界的可用性。这些发现共同强调了情境化、以用户为中心的评估框架的重要性,这些框架超越了整体质量分数,并关注技术服务于谁——以及服务得如何。

英文摘要

Speech translation (ST) is increasingly adopted in user applications, yet its evaluation largely focuses on decontextualized testbeds and holistic quality, rather than end users' communication needs. We introduce Ouvia, an evaluation framework for measuring user-perceived usability of speech translation outputs in real-world settings. Ouvia focuses on one-to-one communication: an English speaker needs to convey a request to a Portuguese speaker, and the message is automatically translated. Through a custom web app and multi-phase study design, we collect more than 1,750 such interactions in healthcare and everyday situations, mediated by four ST systems, involving speakers from three English dialects and two genders. We find that modern ST serves people only to a limited extent -- only around half of interactions are rated as usable -- with significant gaps in reported usability across demographic groups. Moreover, among quality metrics, we find that QA-based evaluation is a substantially stronger predictor of real-world usability than standard approaches. Together, these findings stress the importance of situated, user-centered evaluation frameworks that go beyond holistic quality scores and attend to who the technology serves -- and how well.

2606.06168 2026-06-05 cs.AI cs.CL 版本更新

ProSarc: Prosody-Aware Sarcasm Recognition Framework via Temporal Prosodic Incongruity

ProSarc: 通过时间韵律不协调性进行韵律感知的讽刺识别框架

Prathamjyot Singh, Ashima Sood, Sahil Sharma, Jasmeet Singh

发表机构 * Department of Computer Science and Engineering, Thapar Institute of Engineering and Technology, Patiala, India(1 计算机科学与工程系,泰帕尔工程与技术学院,印度帕蒂亚拉) School of Computing, Engineering and Intelligent Systems, Ulster University, Londonderry, United Kingdom(2 计算学、工程与智能系统学院,乌斯特大学,英国伦敦德里) School of Computing, Ulster University, Belfast, United Kingdom(3 计算学学院,乌斯特大学,英国贝尔法斯特)

AI总结 提出ProSarc,一个仅利用音频的框架,通过建模局部韵律动态与话语级情感基线之间的时间韵律不协调性来检测讽刺,在MUStARD++等数据集上取得最优性能。

Comments Accepted at Interspeech 2026, Sydney

详情
AI中文摘要

我们提出了ProSarc,一个仅利用音频的框架,通过建模时间韵律不协调性(即局部韵律动态与话语级情感基线之间的不匹配)来检测讽刺。双编码路径——全局情感编码器和时间韵律编码器(BiLSTM + 多头注意力)——馈送到韵律不协调性分析器,该分析器产生一个标量不协调性分数用于分类。蒙特卡洛dropout提供不确定性估计,基于注意力的机制无需帧级标签即可定位讽刺起始点。ProSarc在MUStARD++(F1=75.3)上优于先前的纯音频方法,并泛化到自发性语音(PodSarc,F1=62.9)和跨语言语音(MuSaG,F1=65.6)。十次运行验证证实了不协调性建模的贡献(Wilcoxon p=0.002,Cohen's d=1.51)。人工评估表明,模型不确定性追踪感知模糊性,预测的起始点与人工标注的时间窗口对齐。

英文摘要

We present ProSarc, an audio-only framework that detects sarcasm by modelling temporal prosodic incongruity, that is, the mismatch between local prosodic dynamics and the utterance-level emotional baseline. Dual encoding paths, a Global Emotion Encoder and a Temporal Prosody Encoder (BiLSTM + multi-head attention), feed a Prosodic Incongruity Analyzer that produces a scalar incongruity score for classification. Monte Carlo dropout provides uncertainty estimates, and an attention-based mechanism localises sarcastic onset without frame-level labels. ProSarc outperforms prior audio-only methods on MUStARD++ (F1=75.3) and generalises to spontaneous (PodSarc, F1=62.9) and cross-lingual speech (MuSaG, F1=65.6). Ten-run validation confirms the contribution of incongruity modelling (Wilcoxon p=0.002, Cohen's d=1.51). Human evaluation shows that model uncertainty tracks perceptual ambiguity and predicted onsets align with human-annotated temporal windows.

2606.06160 2026-06-05 cs.AI cs.CL 版本更新

Where does Absolute Position come from in decoder-only Transformers?

在仅解码器Transformer中,绝对位置从何而来?

Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri

发表机构 * Sapienza University of Rome(罗马大学萨皮恩扎分校) Intuition Machines(直觉机器)

AI总结 本文研究了RoPE训练的仅解码器Transformer中绝对位置信息的来源,发现因果掩码和残差流是导致绝对位置泄露的两个关键组件,并提出了通过替换BOS嵌入来减少残差流成分的方法。

详情
AI中文摘要

RoPE训练的Transformer在其注意力模式中区分绝对位置,尽管RoPE在内积中仅编码相对偏移。我们将这种泄露追溯到两个架构组件。因果掩码是第一个:其每个查询的softmax分母按构造依赖于绝对查询位置。残差流提供第二个。在因果注意力下,位置$0$处的激活仅关注自身,并作为封闭动力系统从该位置token的嵌入运行;下游注意力通过sink-reading头读取该轨迹。这两个组件在我们研究的所有三种架构中都存在,但以架构特定的平衡出现:NTK缩放抑制残差流组件,滑动窗口注意力使其随深度累积,而标准RoPE介于两者之间。在前向传播前替换\texttt{BOS}嵌入可消除早期查询中$40\%$的残差流组件。注意力sink是锚定在token上的稳定器,传递位置$0$处token的确定性指纹,当该token是自动预置的\texttt{BOS}时,该指纹跨输入恒定,否则随其变化。

英文摘要

RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components, The causal mask is responsible for the first: its per-query softmax denominator depends on the absolute query position by construction. The residual stream supplies the second. Under causal attention the activation at position $0$ attends only to itself and runs as a closed dynamical system from the embedding of the token at that position; downstream attention reads this trajectory through sink-reading heads. Both components appear in all three architectures we study, in architecturally specific balance: NTK scaling suppresses the residual-stream component, sliding-window attention allows it to accumulate with depth, and standard RoPE sits between. Replacing the \texttt{BOS} embedding before the forward pass removes $40\%$ of the residual-stream component at early queries. Attention sinks are token-anchored stabilizers that pass forward a deterministic fingerprint of the token at position $0$, constant across inputs when that token is the auto-prepended \texttt{BOS} and varying with it otherwise.

2606.06109 2026-06-05 cs.CL cs.AI 版本更新

Harnessing Structural Context for Entity Alignment Foundation Models

利用结构上下文进行实体对齐基础模型

Xingyu Chen, Yuanning Cui, Zequn Sun, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China(南京大学新型软件技术国家重点实验室) Nanjing University of Information Science and Technology, Nanjing, China(南京信息科学技术大学) National Institute of Healthcare Data Science, Nanjing University, Nanjing, China(南京大学健康数据科学国家研究院)

AI总结 提出ContextEA框架,通过交叉KG交互编码器和结构校准解码器增强结构上下文的构建与利用,在29个数据集上超越强基线,实现更强的跨KG迁移能力。

详情
AI中文摘要

实体对齐(EA)旨在识别异构知识图谱(KG)中的等价实体,是知识融合和跨KG推理的关键组成部分。最近的EA基础模型表明,对齐知识一旦预训练,可以直接应用于各种未见过的KG对。然而,它仍然在两个地方未充分利用结构上下文:编码时跨KG交互较弱,最终候选排序仍然过于依赖粗略的相似性。我们通过ContextEA(一种用于可迁移EA的增强型编码器-解码器框架)来解决这些局限性。在编码器侧,我们引入了一个跨KG交互编码器,该编码器通过锚点桥统一两个KG,并执行更早的关系感知跨图传播。在解码器侧,我们引入了一个结构校准解码器,该解码器使用实体级、邻域级、关系级和锚点感知的结构证据来校准对齐分数。这种设计在保持轻量级的同时,增强了结构上下文的构建和利用。在OpenEA、SRPRS和DBP的29个EA数据集上的实验显示,与强可迁移基线相比,取得了持续改进。值得注意的是,预训练的ContextEA已经在所有三个基准组上超越了微调基线,显示出对未见KG的显著更强的迁移能力。这些结果表明,显式利用结构上下文是改进EA基础模型的有效方向。

英文摘要

Entity alignment (EA) aims to identify equivalent entities across heterogeneous knowledge graphs (KGs) and is a key component of knowledge fusion and cross-KG reasoning. The recent EA foundation model demonstrates that alignment knowledge, once pretrained, can be directly applied to diverse previously unseen KG pairs. However, it still underuses structural context in two places: cross-KG interaction is weak during encoding, and final candidate ranking still relies too heavily on coarse similarity. We address these limitations with ContextEA, an enhanced encoder-decoder framework for transferable EA. On the encoder side, we introduce a cross-KG interaction encoder that unifies the two KGs with anchor bridges and performs earlier relation-aware cross-graph propagation. On the decoder side, we introduce a structural calibration decoder that calibrates alignment scores with entity-level, neighborhood-level, relation-level, and anchor-aware structural evidence. This design strengthens both structural context construction and structural context exploitation while remaining lightweight. Experiments on 29 EA datasets in OpenEA, SRPRS, and DBP show consistent gains over strong transferable baselines. Notably, the pretrained ContextEA already surpasses the finetuned baselines on all three benchmark groups, demonstrating substantially stronger transfer to unseen KGs. These results suggest that explicitly harnessing structural context is an effective direction for improving EA foundation models.

2606.06098 2026-06-05 cs.CL cs.LG 版本更新

IR3DE: A Linear Router for Large Language Models

IR3DE:面向大型语言模型的线性路由器

Eros Fanì, Oğuzhan Ersoy

发表机构 * Gensyn

AI总结 提出基于岭回归的线性路由器IR3DE,以低成本快速为每个提示选择最合适的领域专家大语言模型,在推理任务中超越基线方法,并支持动态添加或移除专家模型。

Comments Accepted at the ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference

详情
AI中文摘要

基础大型语言模型(LLM)在广泛的一般任务上表现出色,并通过领域专家LLM在各种专业任务上取得显著成果。随着可用LLM列表的不断增长,推理路由器被提出以选择每个提示最合适的LLM。然而,现有的路由方法要么优化弱到强通用LLM的成本,要么需要大量训练来支持领域专家路由。在本文中,我们提出IR3DE,一种基于岭回归的领域专家路由器,为每个提示提供廉价且快速的路由决策。我们在两种因果语言建模(CLM)设置中评估IR3DE,其中任务是对所有域进行下一个词预测,以及一种推理设置,其中每个域有自己的独特推理任务。尽管是线性路由器,IR3DE在两种CLM设置中实现了与其他基线相当的性能,并在推理设置中超越了它们,归一化性能达到98.4%。此外,IR3DE允许添加或移除新的领域专家,而无需从头重新训练路由器,从而可以动态服务一组LLM,对路由器本身的干扰最小。我们的代码可在github.com/gensyn-ai/IR3DE获取。

英文摘要

Foundational Large Language Models (LLMs) demonstrate proficiency on a wide range of general tasks, and achieve remarkable results on various specialized tasks via domain-expert LLMs. With the ever-growing list of available LLMs, inference routers are being proposed to select the most appropriate LLM for each prompt. However, existing routing methods either optimize cost across weak-to-strong generalist LLMs or require substantial training to support domain-expertise routing. In this paper, we propose IR3DE, a Ridge Regression-based Router for Domain Experts that provides cheap and fast routing decisions for each prompt. We evaluate IR3DE in two Causal Language Modeling (CLM) settings where the tasks are next-token prediction for all domains, and one reasoning setting where each domain has its own distinct reasoning task. Despite being a linear router, IR3DE achieves performance comparable to the other baselines in both CLM settings, and surpassing them in the reasoning setting, with a normalized performance of 98.4%. Moreover, IR3DE enables the addition or removal of new domain experts without requiring the router to be retrained from scratch, allowing a dynamic set of LLMs to be served with minimal disruption to the router itself. Our code is available at: github.com/gensyn-ai/IR3DE.

2606.06096 2026-06-05 cs.LG cs.AI cs.CL 版本更新

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

OrderGrad: 通过顺序统计量策略梯度估计超越均值优化

Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Shota Takashiro, Soichiro Nishimori, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo(东京大学)

AI总结 提出OrderGrad,一种用于顺序统计量目标的似然比和重参数化梯度估计器族,通过奖励变换实现风险厌恶、鲁棒和探索性学习的统一即插即用方法。

详情
AI中文摘要

策略梯度方法通常优化期望回报,但许多现实应用关心回报的分布特性:尾部风险、异常值鲁棒性或最佳K发现。我们引入OrderGrad,一种用于顺序统计量目标的似然比和重参数化梯度估计器族。OrderGrad优化有限样本L-统计量,即排序奖励或成本的加权平均,通过仅改变秩权重来恢复诸如VaR、CVaR、修剪均值、中位数和top-m/最佳K标准等目标。对于任何固定样本大小和秩权重向量,OrderGrad为相应的顺序统计量目标提供无偏梯度估计。该方法实现为简单的奖励变换,然后可在其他标准策略梯度或重参数化更新中使用。我们研究了所得估计量的方差行为,并在均值优化与部署目标不匹配的任务上进行了评估,包括LLM数学后训练和其他任务。OrderGrad为风险厌恶、鲁棒和探索性学习提供了统一的即插即用途径。代码:https://github.com/paavo5/ordergrad

英文摘要

Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objectives. OrderGrad optimizes finite-sample L-statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m/best-of-K criteria by changing only the rank weights. For any fixed sample size and rank-weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order-statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy-gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post-training and other tasks. OrderGrad provides a unified, plug-and-play route to risk-averse, robust, and exploratory learning. Code: https://github.com/paavo5/ordergrad

2606.06088 2026-06-05 cs.CL 版本更新

CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios

CHALIS:困难场景下的语言识别挑战数据集

Michal Tichý, Jindřich Libovický

发表机构 * Charles University, Faculty of Mathematics and Physics(查理大学数学与物理系) Institute of Formal and Applied Linguistics(形式与应用语言学研究所)

AI总结 提出CHALIS数据集,针对亲缘语言和拼写噪声等困难场景,通过收集互懂语言对句子和模拟拼写噪声,评估四种语言识别系统,发现它们在低资源语言和音译输入上表现不佳。

Comments 7 pages

详情
AI中文摘要

我们提出了CHALIS(挑战性语言识别样本),这是一个新的基准数据集,专门设计用于处理语言识别中的困难情况:亲缘语言和拼写噪声。我们的数据集包含两部分:首先,我们收集了在相互可理解的语言对(捷克语/斯洛伐克语、西班牙语/加泰罗尼亚语、葡萄牙语/加利西亚语、丹麦语/挪威语)中共享的句子。第二部分测试拼写噪声:我们将文本音译成多种文字,去除变音符号,模拟同形词攻击,并使用网络俚语。我们在CHALIS上评估了四种广泛使用的语言识别系统,并证明所有系统在这些场景中都存在很大困难,尤其是在亲缘语言对中的低资源语言和音译输入上。该资源公开在 https://huggingface.co/datasets/michal-tichy/CHALIS。

英文摘要

We present CHALIS (Challenging Language Identification Samples), a new benchmark dataset explicitly designed to address difficult cases in language identification: cousin languages and orthographic noise. Our dataset has two parts: First, we collected sentences shared across mutually intelligible language pairs (Czech/Slovak, Spanish/Catalan, Portuguese/Galician, Danish/Norwegian). The second part tests for orthography noise: we transliterate text across multiple scripts, remove diacritics, simulate homoglyph attacks, and use Internet slang. We evaluate four widely used language identification systems on CHALIS and demonstrate that all struggle substantially in these scenarios, especially on lower-resource languages within cousin pairs and on transliterated input. The resource is publicly available at https://huggingface.co/datasets/michal-tichy/CHALIS.

2606.06087 2026-06-05 cs.CL cs.AI 版本更新

LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents

LatentSkill: 从上下文文本技能到LLM智能体的权重内隐技能

Aofan Yu, Chenyu Zhou, Tianyi Xu, Zihan Guo, Rong Shan, Zhihui Fu, Jun Wang, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Sun Yat-Sen University(中山大学) Shanghai Innovation Institute(上海创新研究院) OPPO Research Institute(OPPO研究院)

AI总结 提出LatentSkill框架,通过预训练超网络将文本技能转换为即插即用的LoRA适配器,将技能知识存储在权重空间而非上下文空间,从而减少预填充令牌并提升性能。

Comments 16 pages, 4 figures

详情
AI中文摘要

智能体系统越来越多地使用文本技能来编码可重用的任务流程,但在每一步将这些技能注入提示中会带来大量的上下文开销,并将技能内容暴露为明文。我们提出了LatentSkill,一个通过预训练超网络将文本技能转换为即插即用LoRA适配器的框架。LatentSkill将技能知识存储在权重空间而非上下文空间中,消除了每步的技能令牌,同时保留了模块化加载、缩放和组合。在ALFWorld和Search-QA上,LatentSkill在显著减少预填充令牌的情况下,优于相应的上下文技能基线:在ALFWorld的已见和未见划分上,它分别提高了21.4和13.4个百分点的成功率,预填充令牌减少了64.1%;在Search-QA上,精确匹配提高了3.0个百分点,技能令牌开销降低了72.2%。进一步分析表明,生成的技能LoRA形成了结构化的语义几何,可以通过LoRA缩放系数精确控制,并且在技能组件对齐时可以通过参数空间算术进行组合。这些发现表明,权重空间技能为扩展LLM智能体提供了一种高效、模块化且暴露更少的基础。

英文摘要

Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills into plug-and-play LoRA adapters through a pretrained hypernetwork. LatentSkill stores skill knowledge in weight space rather than context space, removing per-step skill tokens while preserving modular loading, scaling, and composition. On ALFWorld and Search-QA, LatentSkill outperforms the corresponding in-context skill baseline while using substantially fewer prefill tokens: it improves ALFWorld success by 21.4 and 13.4 points on the seen and unseen splits with 64.1% fewer prefill tokens, and improves Search-QA exact match by 3.0 points with 72.2% lower skill-token overhead. Further analysis shows that generated skill LoRAs form a structured semantic geometry, can be precisely controlled via the LoRA scaling coefficient, and can be composed through parameter-space arithmetic when skill components are aligned. These findings suggest that weight-space skills provide an efficient, modular, and less exposed substrate for extending LLM agents.

2606.06080 2026-06-05 cs.LG cs.AI cs.CL 版本更新

On Advantage Estimates for Max@K Policy Gradients

关于 Max@K 策略梯度的优势估计

Shota Takashiro, Soichiro Nishimori, Paavo Parmas, Yongmin Kim, Kohsei Matsutani, Gouki Minegishi, Yusuke Iwasawa, Takeshi Kojima, Yutaka Matsuo

发表机构 * The University of Tokyo(东京大学)

AI总结 针对稀疏奖励下推理模型后训练困难,提出一种新的优势估计方法 MaxPO,通过 Leave-Two-Out 基线实现中心化优势,降低梯度方差并提升性能。

详情
AI中文摘要

具有可验证奖励的强化学习广泛用于推理模型的后训练,但稀疏的结果奖励使得探索困难。一种补充方法是直接优化推理时目标如 pass@K 和 max@K,然而现有针对这些目标的策略梯度估计器使用不同的信号、基线和归一化,使得它们之间的关系不明确。我们通过基线设计和优势中心化来研究这个问题。从该领域领先方法的优势估计器出发,我们证明它是策略梯度无偏的,但产生非中心化的优势。然后我们引入一种 Leave-Two-Out 基线,它在保持策略梯度无偏性的同时,使得实现的批次优势完全中心化。由此产生的方法 MaxPO 具有高效的二次时间实现,并自然地集成到基于组的 LLM 后训练强化学习中。我们进一步推导了 max@K 的规范有限批次优势,为现有优势估计器提供了统一视角。实验上,我们验证了 L2O 基线降低了梯度方差,并优于非中心化的替代方案。

英文摘要

Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradient estimators for these objectives use different signals, baselines, and normalizations, making their relationships unclear. We study this issue through baseline design and advantage centering. Starting from the advantage estimator of a leading method in the field, we show that it is policy-gradient unbiased but yields a non-centered advantage. We then introduce a Leave-Two-Out baseline that preserves policy-gradient unbiasedness while making realized batch advantages exactly centered. The resulting method, MaxPO, has an efficient quadratic-time implementation and integrates naturally into group-based RL for LLM post-training. We further derive the canonical finite-batch advantage for max@K, providing a unified view of existing advantage estimators. Empirically, we verify that the L2O baseline reduces gradient variance and outperforms non-centered alternatives.

2606.06079 2026-06-05 cs.CL 版本更新

SkillComposer: Learning to Evolve Agent Skills for Specification and Generalization

SkillComposer: 学习演化智能体技能以实现特化与泛化

Qi Zhang, Zhaopeng Feng, Xiaonan Shi, Xiaomeng Hu, Chu Liu, Pengjun Xie, Xiaobin Wang, Jieping Ye, Bryan Hooi, Haobo Wang, Junbo Zhao

发表机构 * Zhejiang University(浙江大学) Tongyi Lab(通义实验室) National University of Singapore(新加坡国立大学)

AI总结 提出SkillComposer框架,通过创建、改进和合并三种可学习操作,使语言模型在推理时自我演化技能,支持离线、在线和混合部署模式,在多个基准上提升性能。

Comments Under Review

详情
AI中文摘要

智能体技能由指导智能体推理和行动的可重用策略组成,在推理时展现出提升模型能力的强大潜力。然而,当前的技能构建方法将问题视为一次性提取,忽略了一个基本矛盾:针对特定任务的技能难以迁移,而抽象化的技能往往提供不足的指导。我们将这种脆弱性归因于缺乏明确的技能特化和泛化机制。为解决这一问题,我们引入了SkillComposer框架,该框架将技能构建分解为三种可学习操作:创建、改进和合并。通过系统性的拒绝采样方案进行训练,SkillComposer使语言模型能够在推理时自我演化技能,并支持三种部署模式:离线构建通用库、在线进行任务特定优化以及混合模式结合两者。在$τ^2$-Bench、LiveCodeBench v6和AppWorld上的综合实验表明,SkillComposer持续优于基线方法。我们的SkillComposer-4B将27B执行器在智能体任务上提升了最多+4.5,在代码任务上提升了最多+3.4,同时泛化到训练中未见过的领域和任务类型。分析表明,合并和改进操作处理正交的质量维度,且技能组合是一种可迁移的元能力,为技能增强推理提供了实用方案。

英文摘要

Agent skills, which consist of reusable strategies that guide agent reasoning and action, have shown strong potential for improving model capability at inference time. However, current skill construction methods treat the problem as one-shot extraction, overlooking a fundamental tension: a skill tailored to the specific task fails to transfer, while the abstracted skill often provides insufficient guidance. We attribute this fragility to the absence of explicit mechanisms for skill specification and generalization. To address this gap, we introduce SkillComposer, a framework that decomposes skill construction into three learnable operations: create, improve, and merge. Trained via systematic rejection sampling recipe, SkillComposer enables language models to self-evolve skills at inference time and supports three deployment modes: offline for building generalized libraries, online for task-specific refinement, and hybrid for combining both. Comprehensive experiments on $τ^2$-Bench, LiveCodeBench v6, and AppWorld show that SkillComposer consistently outperforms baselines. Our SkillComposer-4B improves a 27B executor by up to +4.5 on agent tasks and +3.4 on code tasks, while generalizing across domains and task types unseen during training. Analysis reveals that merge and improve address orthogonal quality dimensions and that skill composition is a transferable meta-ability, providing a practical recipe for skill-augmented inference.

2606.06058 2026-06-05 cs.LG cs.AI cs.CL 版本更新

MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

MDP-GRPO:面向多约束指令跟随的稳定化组相对策略优化

Mohammad Mahdi Salmani-Zarchi, Zahra Rahimi, Heshaam Faili, Mohammad Javad Dousti

发表机构 * Department of Electrical and Computer Engineering, College of Engineering, University of Tehran(德黑兰大学电气与计算机工程系,工程学院) Department of Statistics, Mathematics and Computer Science, Allameh Tabataba’i University(塔巴蒂大学统计、数学与计算机科学系)

AI总结 针对标准GRPO在离散低分散奖励下的不稳定性,提出MDP-GRPO,通过多温度采样、双锚优势、前景理论整形和非对称KL正则化,在FollowBench等数据集上提升严格约束满足率最高5.0%。

Comments Accepted to ACL 2026 Main Conference. 14 pages, 9 figures

详情
AI中文摘要

可验证奖励的强化学习非常适合多约束指令跟随,但标准组相对策略优化(GRPO)在离散、低分散奖励下变得不稳定,此时组内奖励分布常常同质。我们识别并形式化了在此场景下z-score组归一化的三种病理:低方差放大、均值中心盲视和零方差崩溃。为解决这些问题,我们提出MDP-GRPO,通过以下方式稳定学习:(1)多温度采样以增加奖励分散度,(2)双锚优势以恢复同质组中的梯度并阻止均值中心盲视,(3)基于Kahneman和Tversky理论的前景理论整形以限制更新并惩罚违规,以及(4)非对称KL正则化。在FollowBench、IFEval和一个精心策划的多约束数据集上评估,MDP-GRPO优于标准GRPO,在Llama-3.2-3B上将严格约束满足率提高了最多5.0%。我们的方法还能够在保持MMLU和ARC上通用能力的同时,实现小批量大小的稳定收敛。

英文摘要

Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance collapse. To address them, we propose MDP-GRPO, which stabilizes learning through (1) multi-temperature sampling to increase reward dispersion, (2) dual-anchor advantages to restore gradients in homogeneous groups and stop mean-centering blindness, (3) prospect-theoretic shaping to bound updates and penalize violations based on Kahneman and Tversky's theory, and (4) asymmetric KL regularization. Evaluated on FollowBench, IFEval, and a curated multi-constraint dataset, MDP-GRPO outperforms standard GRPO, improving strict constraint satisfaction by up to 5.0% on Llama-3.2-3B. Our method also enables stable convergence with small group sizes while preserving general capabilities on MMLU and ARC.

2606.06047 2026-06-05 cs.CL 版本更新

Automatic Labelling of Speech Translation Errors

语音翻译错误的自动标注

Dominik Macháček, Maike Züfle, Ondrej Klejch

发表机构 * Charles University(查尔斯大学) University of Edinburgh(爱丁堡大学) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 针对语音翻译缺乏置信度评估方法的问题,提出STEL标注协议,通过文本和多模态系统分析,发现直接语音处理对任务必要且与文本系统互补。

详情
AI中文摘要

语音翻译中的错误会降低语音翻译(ST)系统的可信度,并可能产生严重后果。然而,目前尚无评估语音翻译置信度和质量估计的成熟方法。为启动这一方向的进展,我们提出了语音翻译错误标注(STEL)。我们创建了一个标注协议、一个小型真实的端到端评估数据集,并分析了现有纯文本和语音处理系统如何执行STEL任务。我们的结果表明,纯文本XCOMET和多模态LLM Qwen2.5-Omni能够以大约人类一半的精度执行STEL任务。我们还发现,直接语音处理对于STEL任务是必要的,并且当前的纯文本和语音处理系统在标注ST中的翻译错误与语音处理错误方面具有互补性。

英文摘要

Errors in speech translations reduce trustworthiness of Speech Translation (ST) systems and can have serious consequences. Yet currently there is no established methodology for evaluating confidence and quality estimation of speech translations. To initiate progress in this direction, we propose Speech Translation Error Labelling (STEL). We create an annotation protocol, a small authentic end-to-end evaluation dataset, and we analyse how existing text-only and speech-processing systems perform the STEL task. Our results show that text-only XCOMET and multimodal LLM Qwen2.5-Omni are able to perform the STEL task in roughly half the precision of humans. We also find that direct speech processing is necessary for the STEL task, and that the current text-only and speech-processing systems are complementary in labelling translation-only vs. speech-processing errors in ST.

2606.06044 2026-06-05 cs.CL 版本更新

IA-RAG: Interval-Algebra-Driven Temporal Reasoning for Dynamic Knowledge Retrieval

IA-RAG:基于区间代数的动态知识检索时间推理

Xiaoman Wang, Yaoze Zhang, Wenzhuo Fan, Hongwei Zhang, Ding Wang, Guohang Yan, Song Mao, Botian Shi, Yunshi Lan, Pinlong Cai

发表机构 * East China Normal University(华东师范大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) University of Shanghai for Science and Technology(上海科技大学) Harbin Engineering University(哈尔滨工程大学)

AI总结 提出IA-RAG框架,通过区间代数建模时间约束,实现层次化时间检索与推理,在复杂时间问答任务上表现优异。

Comments 22 pages, 10 figures, 13 tables. Code available at https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA

详情
AI中文摘要

检索增强生成(RAG)在利用外部知识增强大语言模型(LLMs)方面表现出强大的有效性。然而,现有的RAG和Graph RAG框架大多将知识视为静态,或仅将时间与粗粒度的时间戳或元数据关联,未能捕捉丰富的时间结构,如持续时间、重叠和包含关系。我们提出IA-RAG,一种层次化时间RAG框架,将知识建模为时间区间,并在形式化时间约束下进行检索。IA-RAG将事实表示为区间事件单元(IEUs),并将其组织成层次化的主题森林,其中时间依赖关系由Allen的区间代数控制。为处理不完整或不确定的时间边界,IA-RAG进一步引入子图时间收紧机制,通过连接事件子图中的逻辑约束来细化模糊区间。此外,IA-RAG通过区间代数引导的遍历支持隐式时间语义检索。在多个时间问答基准(包括TimeQA、TempReason和ComplexTR)上的实验表明,IA-RAG在时间检索和推理性能上表现优异,尤其是在复杂的组合时间推理任务上。我们的代码已发布在https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA。

英文摘要

Retrieval-Augmented Generation (RAG) has shown strong effectiveness in grounding Large Language Models (LLMs) with external knowledge. However, existing RAG and Graph RAG frameworks largely treat knowledge as static or associate time with coarse-grained timestamps or metadata, failing to capture rich temporal structures such as duration, overlap, and containment. We propose IA-RAG, a hierarchical temporal RAG framework that models knowledge as time intervals and performs retrieval under formal temporal constraints. IA-RAG represents facts as Interval Event Units (IEUs) and organizes them into a hierarchical Thematic Forest, where temporal dependencies are governed by Allen's Interval Algebra. To handle incomplete or uncertain temporal boundaries, IA-RAG further introduces a Sub-graph Time Tightening mechanism that refines fuzzy intervals through logical constraints within connected event subgraphs. In addition, IA-RAG supports implicit temporal semantic retrieval through interval-algebra-guided traversal. Experiments on multiple temporal question answering benchmarks, including TimeQA, TempReason, and ComplexTR, demonstrate that IA-RAG achieves strong temporal retrieval and reasoning performance, particularly on complex compositional temporal reasoning tasks. Our code is released at https://github.com/xiaoAugenstern/LogicalRAG_TemporalQA.

2606.06038 2026-06-05 cs.CL 版本更新

English-to-Prakrit Machine Translation via Multilingual Transfer Learning

英语到普拉克里特语的机器翻译:基于多语言迁移学习

Om Choksi, Smit Kareliya, Shrikant Malviya, Pruthwik Mishra

发表机构 * Sardar Vallabhbhai National Institute of Technology(萨达尔·瓦拉布尔·尼西特国家理工学院)

AI总结 针对低资源目标语言普拉克里特语,通过将普拉克里特语映射到印地语标签并利用多语言模型,在少量平行语料上实现可行的机器翻译,揭示了脚本兼容的语言路由对未支持古典语言的迁移潜力及数据稀缺和方言不匹配的限制。

详情
AI中文摘要

我们在低资源环境下研究英语到普拉克里特语的机器翻译,其中目标语言不受IndicTrans2支持。我们通过将普拉克里特语映射到印地语标签(hin_Deva)来适配多语言模型,而不修改分词器、词汇表或架构。使用包含1,474对马哈拉施特拉普拉克里特语平行语料库,并在20样本的阿尔达摩揭陀语测试集上进行评估,我们报告了相对于未调优基线的语料库BLEU改进。结果表明,脚本兼容的语言路由可以实现对未支持的古典语言的可行迁移,同时突显了数据稀缺和方言不匹配带来的局限性。我们的代码和训练模型已公开发布,供进一步探索:https://github.com/D3v1s0m/indictrans2-prakrit-mt。

英文摘要

We study English-to-Prakrit machine translation in a low-resource setting where the target language is unsupported by IndicTrans2. We adapt the multilingual model by mapping Prakrit to the Hindi language tag (hin_Deva) without modifying the tokenizer, vocabulary, or architecture. Using a 1,474-pair Maharashtri Prakrit parallel corpus and evaluation on a 20-sample Ardhamagadhi test set, we report corpus BLEU improvements over an untuned baseline. The results indicate that script-compatible language routing can enable feasible transfer to unsupported classical languages, while highlighting limitations due to data scarcity and dialect mismatch. Our code and trained models are released to the public for further exploration https://github.com/D3v1s0m/indictrans2-prakrit-mt.

2606.06031 2026-06-05 cs.CL 版本更新

NAVIRA: Decoupled Stochastic Remasking for Masked Diffusion Language Models

NAVIRA: 解耦随机重掩码用于掩码扩散语言模型

Andrey Fomenko, Maksim Kryzhanovskiy, Svetlana Glazyrina, Roman Ischenko

发表机构 * Lomonosov Moscow State University(莫斯科罗蒙诺索夫莫斯科国立大学) Institute for Artificial Intelligence(人工智能研究所)

AI总结 针对掩码扩散语言模型并行生成中的上下文污染问题,提出NAVIRA解码策略,通过解耦质量检测与重生成、随机采样重掩码位置,提升流畅性和多样性。

详情
AI中文摘要

掩码扩散语言模型通过并行迭代地解除掩码生成文本,但这种速度带来了修正问题:同一时间步生成的标记是从边缘分布预测的,早期局部依赖错误可能污染后续上下文。PRISM通过学习标记级质量分数并重掩码不可靠标记来解决此问题,但其推理规则是耦合的:同一前向传播既检测低质量标记又计算其替换的对数几率,因此错误标记仍会条件化再生。我们提出NAVIRA,一种推理时解码策略,将这两个操作分离并随机采样重掩码位置。第一次前向传播对标记评分;选中的标记被掩码;第二次前向传播从清理后的上下文再生。温度控制的重掩码减少对相同位置的重复修正,并在流畅性与多样性之间取得平衡。在170M掩码扩散语言模型的受控实验中,解耦提高了流畅性,而调度的随机重掩码保持了熵,并在更大的前向传播预算下获得了更强的LLM评判分数。这些结果表明,重掩码策略(而不仅仅是学习到的质量信号)对于可靠的掩码扩散文本生成至关重要。

英文摘要

Masked diffusion language models generate text by iteratively unmasking many tokens in parallel, but this speed comes with a correction problem: tokens generated in the same step are predicted from marginal distributions, and early local dependency errors can later contaminate the context. PRISM addresses this by learning token-level quality scores and remasking unreliable tokens, but its inference rule is coupled: the same forward pass both detects low-quality tokens and computes logits for their replacements, so the erroneous tokens still condition regeneration. We propose NAVIRA, an inference-time decoding policy that separates these two operations and samples remasking positions stochastically. A first forward pass scores tokens; selected tokens are masked; a second forward pass regenerates from the cleaned context. Temperature-controlled remasking reduces repeated correction of the same positions and balances fluency against diversity. In controlled experiments with a 170M masked diffusion language model, decoupling improves fluency, while scheduled stochastic remasking preserves entropy and achieves stronger LLM-judge scores under larger forward-pass budgets. These results show that remasking policy, not only the learned quality signal, is central to reliable masked-diffusion text generation.

2606.06027 2026-06-05 cs.AI cs.CL cs.LG cs.SI 版本更新

RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit

RedditPersona: 一个用于从Reddit进行社区条件化LLM适配的模块化框架

Amirhossein Ghaffari, Ali Goodarzi, Huong Nguyen, Simo Hosio, Lauri Lovén, Ekaterina Gilman

发表机构 * Future Computing Group University of Oulu(未来计算组奥卢大学) Centre for Applied Computing University of Oulu(应用计算中心奥卢大学)

AI总结 提出RedditPersona模块化框架,通过五种分组策略和QLoRA训练参数高效适配器,在112个Reddit子版块上评估社区条件化语言模型,发现适配器的行为可识别性与策略内在一致性相关,且所有策略在可识别性和分布相似性之间存在一致权衡。

详情
AI中文摘要

社区条件化的语言模型适配需要在每个研究中独立做出关于数据收集、社区定义和评估的选择,这使得比较假设或重用工件变得困难。我们提出了RedditPersona,一个模块化框架,标准化了这些选择:它收集Reddit帖子和评论,分析活跃用户,根据五种分组策略(基于子版块、图结构、语义、混合和基于交互)对用户进行划分,通过QLoRA为每种策略训练参数高效的适配器,并在一个涵盖流畅性、忠实度、分布对齐和社区可识别性的共享度量套件下进行评估。应用于城市福祉领域的112个子版块(301,429个用户档案,超过1600万条评论),我们发现适配器的行为可识别性追踪了每种策略与子版块基线的内在一致性,并且所有五种策略在可识别性和与真实文本的分布相似性之间存在一致的权衡。代码和配置文件可在以下网址获取:https://github.com/Ahghaffari/redditpersona。

英文摘要

Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit-based, graph-structural, semantic, hybrid, and interaction-based), trains a parameter-efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well-being domain (301,429 user profiles, 16M+ comments), we find that adapters' behavioral identifiability tracks each strategy's intrinsic agreement with the subreddit baseline, and that a consistent trade-off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: https://github.com/Ahghaffari/redditpersona.

2606.06025 2026-06-05 cs.CL cs.AI 版本更新

EGTR-Review: Efficient Evidence-Grounded Scientific Peer Review Generation via Multi-Agent Teacher Distillation

EGTR-Review: 基于多智能体教师蒸馏的高效证据支撑科学同行评审生成

Xinpeng Qiu, Wang Yihu, Zhifeng Liu, Xiaochen Wang, Jimin Wang

发表机构 * Department of Information Management, Peking University(北京大学信息管理系) PKU-WUHAN Institute for Artificial Intelligence, Peking University(北京大学武汉人工智能研究院)

AI总结 提出EGTR-Review框架,通过多智能体教师蒸馏和证据加权目标,实现轻量级学生模型的高质量、可溯源同行评审生成。

详情
AI中文摘要

科学同行评审生成因能减少评审负担并提供及时反馈而受到越来越多的关注。然而,现有基于大型语言模型(LLM)的方法往往产生缺乏证据支持和弱源可追溯性的通用评论,而复杂的多智能体系统则导致高推理成本。为应对这些挑战,我们提出EGTR-Review,一种通过多智能体教师蒸馏实现的证据支撑且可追溯的评审生成框架。EGTR-Review首先构建一个多智能体教师,执行结构感知的论文分解、关键元素提取、外部学术证据检索、证据状态标注、验证推理和评审合成。然后,通过任务前缀驱动的多任务学习,将中间推理轨迹和最终评审评论蒸馏到轻量级学生模型中。证据加权目标进一步减少弱、缺失或不可验证监督的影响。在公共同行评审数据集上的实验表明,EGTR-Review(学生)在自动指标、LLM作为评判者评估和人工评估中均优于强提示基、微调基和结构化/智能体基线,同时保持强事实基础和源可追溯性,且显著降低令牌消耗和推理时间。我们的代码、提示、配置和样本数据可在GitHub上获取。

英文摘要

Scientific peer review generation has attracted increasing attention for reducing reviewing burdens and providing timely feedback. However, existing Large Language Model (LLM)-based methods often produce generic comments with insufficient evidence support and weak source traceability, while complex multi-agent systems incur high inference costs. To address these challenges, we propose EGTR-Review, an Evidence-Grounded and Traceable Review Generation framework via Multi-Agent Teacher Distillation. EGTR-Review first constructs a multi-agent teacher that performs structure-aware paper decomposition, key-element extraction, external scholarly evidence retrieval, evidence-state labeling, verification reasoning, and review synthesis. It then distills both intermediate reasoning trajectories and final review comments into a lightweight student model through task-prefix-driven multi-task learning. An evidence-weighted objective further reduces the influence of weak, missing, or non-verifiable supervision. Experiments on public peer-review datasets show that EGTR-Review (Student) outperforms strong prompt-based, fine-tuned, and structured/agentic baselines across automatic metrics, LLM-as-Judge evaluation, and human evaluation, while maintaining strong factual grounding and source traceability with substantially lower token consumption and inference time. Our code, prompts, configurations, and sample data are available on GitHub.

2606.06022 2026-06-05 cs.CL 版本更新

Contextualized Prompting For Stance Detection On Social Media

社交媒体立场检测的上下文化提示

Tilman Beck, Shakib Yazdani, Simon Kruschinski, Marcus Maurer, Iryna Gurevych

发表机构 * Institute of Intensive Care, University Hospital of Zurich and University of Zurich(重症医学研究所,苏黎世大学医院和苏黎世大学) Institute of Computer Science, University of Goettingen(计算机科学研究所,哥廷根大学) GESIS Leibniz Institute for the Social Sciences(社会科学研究莱比锡研究所) Institut für Publizistik, Johannes Gutenberg-University Mainz(主笔研究所,美因茨约翰· Gutenberg 大学) Ubiquitous Knowledge Processing Lab, Technical University of Darmstadt(无处不在知识处理实验室,达姆施塔特技术大学)

AI总结 通过系统实验,研究在零样本提示中融入真实世界、推导和LLM生成的上下文特征对Twitter立场检测的影响,发现LLM生成的目标描述能持续提升准确率,而其他用户元数据效果不一。

详情
AI中文摘要

社交媒体上的立场检测由于语言简短、嘈杂且依赖上下文而具有挑战性。虽然大型语言模型(LLM)表现出零样本泛化能力,但它们通常在没有上下文信息的情况下被提示,这限制了它们解释模糊帖子的能力。在这项工作中,我们系统地研究了将真实世界(例如,用户传记)、推导(例如,政党)和LLM生成的(例如,目标描述)上下文特征融入Twitter立场检测的零样本提示中的影响。我们的评估涵盖四个基准数据集,包括一个新的高质量德语Twitter立场数据集。在多个LLM中,我们发现整合上下文信息能提高性能,但仅在特定条件下。LLM生成的目标描述持续提升准确性,而其他用户元数据则产生混合甚至有害的效果。值得注意的是,我们表明包含同一用户的其他推文(在监督学习中通常有益)可能会因输入噪声而损害性能。我们的定性分析揭示,LLM难以区分任务特定的有用信息和无关上下文。我们的发现强调了在嘈杂的真实世界环境中使用上下文信息进行提示的前景和挑战。我们在\href{https://github.com/tilmanbeck/stance-context-twitter}{此页面}发布代码和数据。

英文摘要

Stance detection on social media is challenging due to short, noisy, and context-dependent language. While large language models (LLMs) show zero-shot generalization, they are typically prompted without contextual information, which limits their ability to interpret ambiguous posts. In this work, we systematically investigate the impact of incorporating real-world (e.g., user biographies), derived (e.g., political party), and LLM-generated (e.g., target descriptions) contextual features into zero-shot prompting for stance detection on Twitter. Our evaluation spans four benchmark datasets, including a new high-quality German Twitter stance dataset. Across multiple LLMs, we find that integrating contextual information improves performance, but only under specific conditions. LLM-generated target descriptions consistently enhance accuracy, while other user metadata has mixed or even detrimental effects. Notably, we show that the inclusion of other tweets by the same user, often beneficial in supervised learning, can impair performance due to input noise. Our qualitative analysis reveals that LLMs struggle to distinguish task-specific useful information from irrelevant context. Our findings highlight both the promise and challenges of prompting with context information in noisy real-world settings. We publish code and data at this \href{https://github.com/tilmanbeck/stance-context-twitter}{page}.

2606.06004 2026-06-05 cs.CL 版本更新

The Generator-Eraser Paradox: Community Guidelines for Responsible LLM-Assisted Dialect Resource Creation

生成器-擦除器悖论:负责任的大语言模型辅助方言资源创建的社区指南

Wajdi Zaghouani

发表机构 * Northwestern University in Qatar(卡塔尔西北大学)

AI总结 本文提出生成器-擦除器悖论理论框架,推导出12条社区指南,并通过阿拉伯方言案例展示如何在大语言模型辅助方言资源创建中平衡效率与语言多样性保护。

详情
Journal ref
Proceedings of the Workshop on Dialects in NLP - A Resource Perspective (DialRes) @ LREC 2026
AI中文摘要

方言资源在科学描述、文化保护和计算基础设施的交汇处占据独特位置。大语言模型通过检索辅助起草、语料库导航、元数据丰富和标注工作流支持,为加速方言资源开发提供了强大能力。然而,同一系统也带来重大风险:它们可能通过偏爱声望变体、统一正字法以及产生随时间减少语言多样性的合成反馈循环,导致方言擦除。这些风险对于具有双言现象、有限书面标准化或边缘化说话者社区的语言变体尤为严重。本文做出三项贡献。首先,我们整合变异社会语言学和语料库语言学的见解,将生成器-擦除器悖论形式化为一个理论框架,以理解大语言模型辅助方言工作的双重性质。其次,我们推导出12条社区指南,将该框架转化为方言资源创建和记录的可实施设计要求。第三,我们提供阿拉伯方言的深入案例研究,包括对广泛使用资源的结构化比较,以展示这些指南如何解决语言特定挑战,包括双言现象、正字法变异和社区治理。贡献是概念性和操作性的,而非实验性的,目标是使跨语言的方言社区和资源构建者能够采用大语言模型,而不牺牲真实性、变体或主权。

英文摘要

Dialect resources occupy a unique position at the intersection of scientific description, cultural preservation, and computational infrastructure. Large language models offer powerful capabilities for accelerating dialect resource development through retrieval-grounded drafting, corpus navigation, metadata enrichment, and annotation workflow support. However, the same systems pose substantial risks: they can contribute to dialect erasure by privileging prestige varieties, homogenizing orthography, and enabling synthetic feedback loops that reduce linguistic diversity over time. These risks are particularly acute for language varieties characterized by diglossia, limited written standardization, or marginalized speaker communities. This paper makes three contributions. First, we integrate insights from variationist sociolinguistics and corpus linguistics to formalize the generator-eraser paradox as a theoretical framework for understanding the dual nature of LLM-assisted dialect work. Second, we derive 12 community guidelines that operationalize this framework into implementable design requirements for dialect resource creation and documentation. Third, we provide an in-depth case study of Arabic dialects, including a structured comparison of widely used resources, to demonstrate how these guidelines address language-specific challenges including diglossia, orthographic variability, and community governance. The contribution is conceptual and operational rather than experimental, with the goal of enabling dialect communities and resource builders across languages to adopt LLMs without sacrificing authenticity, variation, or sovereignty.

2606.05988 2026-06-05 cs.LG cs.CL 版本更新

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

压缩-蒸馏:面向高效知识蒸馏的推理轨迹压缩

Maxime Griot, Paul Steven Scotti, Tanishq Mathew Abraham

发表机构 * Université catholique de Louvain(列日天主教大学) Sophont Inc(Sophont公司)

AI总结 本文提出在知识蒸馏前对推理轨迹进行事后压缩,以降低训练成本并缩短推理输出,实验表明压缩在准确率与效率间存在权衡。

详情
AI中文摘要

推理模型产生长的思维链轨迹,这些轨迹蒸馏成本高且鼓励学生输出冗长内容。我们研究在知识蒸馏前对这些轨迹进行事后压缩。两个教师模型,Qwen3.5-397B-A17B 和 gpt-oss-120B,各生成约 283k 条正确轨迹;两个指令调优模型将其压缩至原始字符长度的 8.6-21.0%。在包含 48 次运行的主网格和七次 Qwen 教师截断消融实验中,压缩轨迹将训练 token 减少至原始的 12-30%,训练速度提升 2.0-7.6 倍,推理输出缩短 3-19 倍,在更短的 gpt-oss 教师下减少幅度较小。然而,原始轨迹在每个规模下和两位教师上都保持最高的下游准确率。一项长度匹配的原始轨迹截断消融实验表明,压缩并非仅仅受益于更小的 token 预算:模型压缩的轨迹通常优于或匹配朴素截断,尤其是对于较小的学生模型,同时保持更短的推理输出。总体而言,推理轨迹压缩提供了准确率与效率之间的权衡,而非免费改进:学生模型保留了原始轨迹高达 96% 的准确率,同时获得了高达 18 倍的每 token 效率提升;在 0.8B 规模下,使用 LoRA 压缩轨迹缩小了原始与压缩之间的差距,但未超过原始轨迹。

英文摘要

Reasoning models produce long chain-of-thought traces that are costly to distill and encourage verbose student outputs. We study post-hoc compression of such traces before knowledge distillation. Two teachers, Qwen3.5-397B-A17B and gpt-oss-120B, generate about 283k correct traces each; two instruction-tuned models then compress them to 8.6-21.0% of their original character length. Across a 48-run main grid plus seven Qwen-teacher truncation ablations, compressed traces reduce training tokens to 12-30% of raw, speed up training by 2.0-7.6x, and shorten inference outputs by 3-19x with smaller reductions under the shorter gpt-oss teacher. However, raw traces retain the highest downstream accuracy at every scale and for both teachers. A length-matched raw-trace truncation ablation shows that compression is not merely benefiting from a smaller token budget: model-compressed traces usually beat or match naive truncation, especially for smaller students, while maintaining shorter inference outputs. Overall, reasoning-trace compression offers an accuracy-efficiency trade-off rather than a free improvement: students retain up to 96% of raw-trace accuracy while gaining up to 18x higher per-token efficiency, and at the 0.8B scale under LoRA compressed traces narrow the raw-vs-compressed gap but do not exceed raw.

2606.05985 2026-06-05 cs.CL cs.CY 版本更新

Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems

超越对齐:多元文化智能体系统中的价值多样性作为集体属性

Shaoyang Xu, Jingshen Zhang, Long P. Hoang, Jinyuan Li, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) Washington University in St. Louis(华盛顿大学圣路易斯分校)

AI总结 针对多元文化多智能体系统,提出以价值多样性作为系统级评估轴,通过文化条件化智能体在共享价值调查中的响应差异度量,发现多样性几乎与对齐无关,且当前系统远低于人类社会,混合骨干系统缩小但未消除差距,社会互动进一步侵蚀多样性。

详情
AI中文摘要

多元文化多智能体系统越来越多地部署在全球多样化的环境中,其中不同的智能体基于不同的文化背景。现有的文化评估侧重于价值对齐:单个智能体与目标文化的匹配程度。然而,对齐是每个智能体的属性,无法揭示系统作为一个整体是否保留了其旨在代表的文化多元性。我们提出价值多样性作为多元文化智能体系统的系统级评估轴,通过文化条件化智能体在共享价值调查上的响应差异来定义。利用世界价值观调查,我们评估了19种文化和18个骨干模型在广泛的系统配置下的表现。我们发现多样性在很大程度上与对齐无关,表明两者捕捉了互补的系统属性,并且当前的多元文化智能体系统在价值多样性上远低于人类社会。混合骨干系统缩小了这一差距但未消除,且该差距在文化组成和智能体规模上持续存在。社会互动进一步通过驱使智能体达成共识而侵蚀多样性,一个参与式预算案例研究表明,这种同质化缩小了集体决策的广度。总之,我们的结果将价值多样性确立为多元文化多智能体系统的一个独特评估轴,并揭示了当前基于LLM的社会中持续存在的同质化趋势。我们的代码和数据公开在 https://github.com/iNLP-Lab/MultiAgent-Diversity。

英文摘要

Multicultural multi-agent systems are increasingly deployed in globally diverse settings, where different agents are grounded in different cultural backgrounds. Existing cultural evaluation focuses on value alignment: how closely a single agent matches a target culture. Yet alignment is a per-agent property and cannot reveal whether a system, taken as a whole, preserves the cultural plurality it is meant to represent. We propose value diversity as a system-level evaluation axis for multicultural agent systems, defined through the dissimilarity between culturally conditioned agents' responses on a shared value survey. Using the World Values Survey, we evaluate 19 cultures and 18 backbone models across a wide range of system configurations. We find that diversity is largely uncorrelated with alignment, indicating that the two capture complementary system properties, and that current multicultural agent systems fall substantially below human societies in value diversity. Mixed-backbone systems narrow this gap but do not close it, and the gap persists across culture compositions and agent scales. Social interaction further erodes diversity by driving agents toward consensus, and a participatory budgeting case study shows that this homogenization narrows the breadth of collective decision-making. Together, our results establish value diversity as a distinct evaluation axis for multicultural multi-agent systems and reveal a persistent homogenization tendency in current LLM-based societies. Our code and data are publicly available at https://github.com/iNLP-Lab/MultiAgent-Diversity.

2606.05983 2026-06-05 cs.AI cs.CL 版本更新

Framing, Judging, Steering: An Assessable Competency Model for Teach-ing Students to Reason With Generative AI

框架构建、判断、引导:一种可评估的能力模型,用于教授学生与生成式AI进行推理

Alexander Apartsin, Yehudit Aperstein

发表机构 * Holon Institute of Technology(霍洛恩技术学院) Afeka College of Engineering(阿菲卡工程学院)

AI总结 提出CoRe-3能力模型,将有效使用AI分解为框架构建、判断和引导三种可评估技能,并通过模拟实验验证其区分效度。

Comments 18 pages, 4 pages

详情
AI中文摘要

生成式AI使答案变得容易而理解变得困难,不加批判的使用会导致认知卸载。学校仍然衡量无辅助的表现,但真正的任务是用AI产生好的工作:构建一个定义不明确的任务,判断输出,并引导模型获得更好的结果。这种能力很少被单独评估;即使被衡量,它也坍缩为一个单一的“提示”分数,无法诊断AI使用成功或失败的原因。我们提出CoRe-3(协同推理),一个能力模型,将生产性AI使用分解为三种可评估的技能,我们缩写为FJS:框架构建(在调用AI之前指定一个定义不明确的任务)、判断(评估输出中的错误和未声明的假设)和引导(迭代地重新引导模型)。其显著主张是将生成前的框架构建与生成后的引导分开,判断作为两者之间的门控。我们将这些技能建立在理论基础上,提出五个可检验的命题,并在CoReasoningLab中实例化它们,这是一个开放平台,呈现有缺陷的AI输出并独立评分。在模拟学习者(由不同模型生成和评分)上,这些技能是分离的:每个技能跟踪其自身的操纵能力,而其他技能保持不变,并且当一个能力在所有三个技能中共享时(收敛和区分效度),分数变得相关,评分后端来自两个提供商。接下来是人类评分者一致性和结果;我们发布工具、数据和协议。

英文摘要

Generative AI makes answers easy and understanding hard, and uncritical use invites cognitive offloading. Schools still measure unaided performance, yet the real task is to produce good work with AI: framing an ill-defined task, judging the output, and steering the model toward a better result. This ability is rarely assessed in its own right; where measured, it collapses into one "prompting" score that cannot diagnose why AI use succeeds or fails. We propose CoRe-3 (Co-Reasoning), a competency model factoring productive AI use into three assessable skills we abbreviate FJS: Framing (specifying an ill-defined task before invoking AI), Judging (evaluating output for errors and unstated assumptions), and Steering (iteratively redirecting the model). Its distinguishing claim is the separation of pre-generation Framing from post-generation Steering, with Judging as the gate between. We ground the skills in theory, state five testable propositions, and instantiate them in CoReasoningLab, an open platform that presents flawed AI output and scores them independently. Over simulated learners (generated and graded by different models), the skills dissociate: each tracks its own manipulated competence while staying flat in the others, and grades become correlated when one competence is shared across all three (convergent and discriminant validity), across grader backends from two providers. Human-rater agreement and outcomes are next; we release the instrument, data, and protocol.

2606.05976 2026-06-05 cs.AI cs.CL 版本更新

The Self-Correction Illusion: LLMs Correct Others but Not Themselves

自我修正错觉:LLM 纠正他人但不纠正自己

Kuan-Yen Chen, Fang-Yi Su, Jung-Hsien Chiang

发表机构 * National Taiwan University(国立台湾大学)

AI总结 本文通过保持错误声明字节一致仅改变角色标签,发现 LLM 无法自我修正并非能力缺陷,而是聊天模板角色标签的人为产物,并提出无需训练或模型修改的提示结构干预方法。

详情
AI中文摘要

近期研究表明,LLM 智能体难以纠正自身推理轨迹中的错误,但当相同声明出现在外部来源时,其修正率显著更高。我们探究这种不对称性反映的是能力缺陷还是角色标签的人为产物:智能体纠正错误声明的意愿是否因果地依赖于承载该声明的聊天模板角色,而非声明内容本身?我们的实验设置在所有条件下保持错误声明的字节完全一致(SHA-256 验证),仅改变其包装角色:智能体自身的 \role{<thought>}、\role{user} 消息、\role{tool} 响应或 \role{system <memory>} 块。在覆盖七个模型家族和三个领域的 13 个模型-领域单元(每个单元 n=30 对任务)中,将声明从 \role{<thought>} 重新标记为外部角色后,显式修正率提升了 23 到 93 个百分点,其中 13 个单元中有 10 个达到 p<0.001。进一步实验证实该效应是不对称的、机制上可分解的,并且跨领域稳健。自我修正失败并非认知缺陷,而是聊天模板的人为产物。我们利用这一人为产物设计了一种仅涉及提示结构、无需训练和模型修改的干预方法,其最强角色标签依赖于领域:在数学上 \role{<memory>} 占主导,而在逻辑推理上普通 \role{user} 消息占主导。

英文摘要

Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a capability deficit or a role-label artifact: does an agent's willingness to correct a wrong claim depend causally on the chat-template role that carries it, rather than on the claim's content? Our setup keeps the erroneous claim byte-identical across all conditions (SHA-256 verified) and varies only its wrapping role: the agent's own \role{<thought>}, a \role{user} message, a \role{tool} response, or a \role{system <memory>} block. Across 13 model-domain cells covering seven model families and three domains ($n{=}30$ paired tasks per cell), relabeling the claim from \role{<thought>} to an external role lifts the explicit-correction rate by 23 to 93 percentage points, with 10 of 13 cells reaching $p{<}0.001$. Further experiments confirm that the effect is asymmetric, mechanistically decomposable, and robust across domains. The failure to self-correct is not a cognitive deficit; it is a chat-template artifact. We exploit this artifact by designing a prompt-structure-only intervention that requires no training and no model modification, with its strongest role label being domain-dependent: \role{<memory>} dominates on math, while a plain \role{user} message dominates on logical deduction.

2606.05970 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries

测量基于LLM的结构化提取对临床出院小结中提示、模型和模式选择的敏感性

Martin Murin

发表机构 * DryLabz GmbH(DryLabz公司)

AI总结 本研究通过固定提取任务并逐一改变提示、模型和模式选择,测量了大型语言模型在临床文本结构化提取中输出对上游配置的敏感性,发现模式选择导致的差异集中在缺失与沉默的区分上,而模型选择在多类分类中主导提示措辞。

Comments 69 pages, 5 main figures, supplementary material included

详情
AI中文摘要

大型语言模型越来越多地用于从临床自由文本笔记中进行结构化提取,但其输出对上游配置选择的敏感性比在固定基准上的准确性更少被理解。本文通过固定提取任务并逐一改变一个选择,在没有人工标注真实值的情况下测量了这种敏感性。固定模式包括17个临床文档标志(三值:是/否/未记录)和47个标签词汇(用于主要入院原因)。表达该模式的三种提示变体分别在两个模型大小上对MIMIC-IV v3.1出院小结运行。跨提示一致性通过Cohen's kappa在ICD分层子集上测量。配对相同笔记比较隔离了模型选择的影响,事后将三值标志折叠为二值测试了模式对不一致的贡献。在三值标志上,两个模型达到相同的合并跨提示一致性(中位数kappa 0.69和0.68);较大的模型提高了某些字段的一致性并降低了其他字段的一致性,这是一种重新分布而非无效果。将模式折叠为二值消除了大部分跨提示不一致,将其定位在缺失与沉默的区分上,而非发现是否存在。在多类入院分类上,改变模型会重新分配近一半笔记的主导标签,而改变提示措辞则重新分配约八分之一的笔记,并且较大的模型在残余的通用类别上分配的权重少得多(44%到26%)。这些模式表明,模式施加的不一致集中在缺失与沉默轴上,而模型在多类分类上主导提示措辞,这是通过一种可重复的方法在人群规模部署中审计提取可重复性而识别的。

英文摘要

Large language models are increasingly used for structured extraction from clinical free-text notes, but the sensitivity of their output to upstream configuration choices is less understood than their accuracy on fixed benchmarks. This work measures that sensitivity without human-annotated ground truth, by holding the extraction task fixed and varying one choice at a time. The fixed schema comprises 17 clinical documentation flags on a three-way yes/no/not_documented value set and a 47-tag vocabulary for the primary admission reason. Three prompt variants expressing this schema were each run at two model sizes on MIMIC-IV v3.1 discharge summaries. Cross-prompt agreement was measured by Cohen's kappa on ICD-stratified subsets. A paired same-note comparison isolated the effect of model choice, and a post-hoc collapse of the three-way flags to binary tested the schema's contribution to disagreement. On the three-way flags, the two models reach the same pooled cross-prompt agreement (median kappa 0.69 and 0.68); the larger model raises agreement on some fields and lowers it on others, a redistribution rather than the absence of an effect. Collapsing the schema to binary dissolves most of the cross-prompt disagreement, locating it on the absence-versus-silence distinction rather than on whether the finding is present. On the multi-class admission categorization, changing the model reassigns the dominant tag on close to half of all notes while changing the prompt phrasing reassigns it on roughly one in eight, and the larger model places far less mass on residual catch-all categories (44% to 26%). These patterns indicate a schema-imposed source of disagreement concentrated on the absence-versus-silence axis and a dominance of model over prompt phrasing on multi-class categorization, identified by a reusable methodology for auditing extraction reproducibility on a population-scale deployment.

2606.05937 2026-06-05 cs.CL 版本更新

Large Language Models are Perplexed by some Political Parties

大型语言模型对某些政党感到困惑

Paul Lerner, François Yvon

发表机构 * Sorbonne Université, CNRS, ISIR(索邦大学、国家科学研究中心、信息研究所)

AI总结 通过困惑度评估,发现大型语言模型对极右翼和民族主义政党文本的困惑度高于社会民主党,且该偏差源于预训练阶段,指令微调影响甚微。

详情
AI中文摘要

大型语言模型(LLMs)的使用日益广泛,包括在政治应用中,但其政治公平性研究甚少。我们使用困惑度进行评估,认为一个公平的模型应对所有政治群体赋予相同的概率。然而,我们在涵盖37种语言的10个LLMs和三个数据集中发现,LLMs对极右翼和民族主义政党的文本比对社会民主党的文本更困惑。我们发现这与先前关于翻译公平性的研究一致,以至于困惑度与下游翻译指标相关。我们的方法适用于基础LLMs及其指令微调版本,并且发现两者高度相关,表明LLMs的政治公平性源于其预训练,而指令微调几乎不影响它。

英文摘要

Large Language Models (LLMs) are increasingly used, including in political applications, but their political fairness has been little studied. We assess it using perplexity, posing that a fair model should give equal probability to all political groups. However, we find, across ten LLMs and three datasets covering 37 languages, that LLMs are more perplexed by the texts of far right and nationalist parties than of social-democratic parties. We find this to be consistent with previous work on translation fairness, to the point that perplexity correlates with downstream translation metrics. Our method is applicable to both base LLMs as well as their instruction-tuned counterpart, and we find that both are highly correlated, suggesting that the political fairness of LLMs stems from their pretraining, and is hardly affected by instruction-tuning.

2606.05936 2026-06-05 cs.CL 版本更新

Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails

语言模型中的认知不公正:预训练过滤器和护栏的审计

Marco Antonio Stranisci, A Pranav, Rossana Damiano, Christian Hardmeier, Anne Lauscher

发表机构 * University of Turin(都灵大学) IT University of Copenhagen(哥本哈根技术大学) Trustworthy AI Lab(可信人工智能实验室)

AI总结 通过审计预训练过滤器和推理时护栏,发现它们对边缘群体(如跨性别者、女性和中美洲人)的提及存在过度标记,导致认知抹除,而人类标注者会保留大部分被标记内容。

详情
AI中文摘要

现代语言模型依赖预训练过滤器从训练语料中移除不良内容,以及推理时护栏抑制部署期间的不良输出。在本文中,我们研究了这些过滤和审核决策如何产生认知抹除形式,并揭示了自动化系统之间以及这些系统与人类判断之间的紧张关系。我们审计了四个预训练过滤器和三个推理时护栏,针对包含性别和地域来源提及的Common Crawl句子,以及一个手动标注的500句子子集。我们的分析表明,过滤和护栏决策与基于黑名单的词汇线索强相关,同时经常未能标记包含私人信息或明确仇恨言论的内容。与此同时,边缘群体,特别是跨性别者、女性和中美洲人,在各个系统中被显著过度标记。相比之下,人类标注者会保留88.5%的过滤器标记内容和91.3%的护栏标记内容,通常能识别出当前系统未能捕捉到的、由内容移除紧张关系产生的表征性伤害。综合来看,我们的研究结果记录了一种认知抹除形式,其中对边缘群体的提及在预训练前被不成比例地移除,并在推理时再次被抑制。

英文摘要

Modern language models rely on pretraining filters to remove undesirable content from training corpora and inference-time guardrails to suppress undesirable outputs during deployment. In this paper, we examine how these filtering and moderation decisions produce forms of epistemic erasure and reveal tensions both across automated systems and between these systems and human judgment. We audit four pretraining filters and three inference-time guardrails on Common Crawl sentences containing gender and regional-origin mentions, together with a manually annotated subset of 500 sentences. Our analysis shows that filtering and guardrail decisions are strongly associated with blocklist-based lexical cues, while frequently failing to flag content containing private information or explicit hate speech. At the same time, marginalized groups, particularly transgender people, women, and Central Americans, are significantly over-flagged across systems. Human annotators, by contrast, would retain 88.5\% of filter-flagged and 91.3\% of guardrail-flagged content, often recognizing representational harms arising from tensions of content removal that current systems fail to capture. Taken together, our findings document a form of epistemic erasure in which mentions of marginalized groups are disproportionately removed before pretraining and additionally suppressed again at inference time.

2606.05931 2026-06-05 cs.CL cs.AI cs.CV cs.IR cs.LG cs.MM eess.AS 版本更新

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

多模态还是非多模态:通过主动模态检测的查询自适应音视频人物检索

Erfan Loweimi, Mengjie Qian, Kate Knill, Guanfeng Wu, Chi-Ho Chan, Abbas Haider, Muhammad Awan, Josef Kittler, Hui Wang, Mark Gales

发表机构 * University of Cambridge(剑桥大学) Queen's University Belfast(贝尔法斯特女王大学) University of Surrey(萨里大学) Cisco(思科) Southwest Jiaotong University(西南交通大学) Teesside University(泰赛德大学)

AI总结 提出一种查询自适应框架,通过跨模态分数一致性检测主动模态,在BBC Rewind语料库上达到94.2%的P@1,优于单模态和固定融合方法。

Comments INTERSPEECH 2026

详情
AI中文摘要

当通过语音和面部从视频档案中检索一个人时,系统应该是多模态的吗?在实际的广播档案中,与精心策划的基准不同,目标可能只被听到但未被看到、只被看到但未被听到,或者两者兼有。融合来自缺失模态的分数会引入噪声,使精度低于最佳单模态系统。我们提出了一种查询自适应框架,通过跨模态分数一致性检测主动模态:当两种模态都活跃时,由一种模态检索的文件在另一种模态上也得分高;当一种模态缺失时,这种一致性被破坏。由这些跨模态特征驱动的分类器实现了89%的检测准确率。在BBC Rewind语料库(包含超过12,000个广播视频)上,自适应系统达到了94.2%的P@1,优于仅语音(82.9%)、仅面部(93.4%)和固定融合(90.0%),恢复了与具有真实模态标签的Oracle(96.6%)之间差距的64%。

英文摘要

When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

2606.05924 2026-06-05 cs.CL cs.AI 版本更新

Better Literary Translation: A Multi-Aspect Data Generation and LLM Training Approach

更好的文学翻译:多维度数据生成与大语言模型训练方法

Zhihao Lin, Ziqi Zhu, Hao Huang, Guanghui Wang, Peiyang He

发表机构 * Amazon Web Services (AWS)(亚马逊网络服务(AWS)) Peking University(北京大学)

AI总结 提出多维度迭代优化框架,通过专门的大语言模型生成高质量翻译参考和偏好数据,结合监督微调和强化学习(GRPO)提升文学翻译质量,在MetaphorTrans英中文学翻译基准上达到与Claude Sonnet 4.5竞争的性能。

Comments Accepted by ACL 2026 Industry

详情
AI中文摘要

文学翻译因高质量标注数据的稀缺以及需要在表达流畅性与文学效果之间取得平衡而面临独特挑战。我们提出了一个多维度迭代优化框架,通过专门的大语言模型翻译器生成高质量的翻译参考和偏好数据,每个翻译器针对一个不同的质量维度。我们利用生成的数据进行监督微调和强化学习。实验表明,我们的生成参考在监督微调中比原始真实数据高出8.65个CEA100点。对于强化学习,我们发现直接偏好优化(DPO)在此设置下导致性能下降,而利用显式奖励模型进行组相对策略优化(GRPO)则额外提升了1.51个点。我们将此归因于两阶段训练的稳定性和GRPO的在线探索能力。我们的最终模型LitMT-8B和LitMT-14B在MetaphorTrans英中文学翻译基准上分别达到67.25和69.07个CEA100点,与Claude Sonnet 4.5的68.43点具有竞争力,并展现出对域外文学作品(如欧·亨利)的强泛化能力。

英文摘要

Literary translation poses unique challenges due to the scarcity of high-quality annotated data and the need to balance expression fluency with literary effect. We present a multi-aspect iterative refinement framework that generates high-quality translation references and preference data through specialized LLM translators, each targeting a distinct quality dimension. We leverage the generated data for supervised fine-tuning and reinforcement learning. Experiments show that our generated references outperform the original ground truth for SFT by 8.65 CEA100 points. For reinforcement learning, we find that DPO leads to performance degradation in this setting, while leveraging an explicit reward model for GRPO yields an additional 1.51 point improvement. We attribute this to the stability of two-stage training and GRPO's online exploration capability. Our resulting models, LitMT-8B and LitMT-14B, achieve 67.25 and 69.07 CEA100 respectively on the MetaphorTrans English-to-Chinese literary translation benchmark, competitive with Claude Sonnet 4.5 at 68.43, and demonstrate strong generalization to out-of-domain literary work (i.e., O. Henry).

2606.05920 2026-06-05 cs.SE cs.CL 版本更新

Asuka-Bench: Benchmarking Code Agents on Underspecified User Intent and Multi-Round Refinement

Asuka-Bench: 针对未明确用户意图与多轮优化的代码智能体基准测试

Xin Wang, Liangtai Sun, Yaoming Zhu, Shuang Zhou, Jiaxing Liu, Fengjiao Chen, Lin Qiu, Xuezhi Cao, Xunliang Cai, Licheng Zhang, Zhendong Mao

发表机构 * University of Science and Technology of China(中国科学技术大学) Independent researchers(独立研究人员)

AI总结 提出Asuka-Bench基准,通过未明确用户意图与多轮优化闭环评估代码智能体,包含50个网页任务和784个评估标准,揭示模型间显著性能差异。

Comments under review

详情
AI中文摘要

现有的代码生成基准测试仅根据完整提示到一次性输出的单一映射进行评分。然而,真实的网页开发并非如此。用户很少在开始时编写完整的规范;许多需求只有在他们查看中间结果并对其做出反应时才变得清晰。我们提出了Asuka-Bench,一个将未明确用户意图与多轮优化相结合的基准测试,其基础是浏览器渲染的行为。每个任务通过一个闭环解决:代码智能体生成一个网页项目,UI智能体在部署的站点上执行测试用例,用户LLM将评估结果转化为下一轮的自然语言反馈。该基准测试包含50个网页任务,具有784个评估标准和2402个预期结果。我们在2个智能体框架上对8个LLM进行了基准测试。结果清晰地区分了模型:加权任务通过率相差38个百分点,并且模型在从反馈中修复的能力上也存在显著差异。Asuka-Bench也远未饱和:即使是最强的模型,在三轮后也只完成了52%的项目。

英文摘要

Existing code-generation benchmarks score a single mapping from a complete prompt to a one-shot output. However, real web development is different. Users seldom write a full spec at the start; many requirements only become clear once they look at an intermediate result and react to it. We present Asuka-Bench, a benchmark that pairs underspecified user intent with multi-round refinement, grounded in browser-rendered behavior. Each task is resolved through a closed loop: a Code Agent generates a web project, a UI Agent executes test cases on the deployed site, and a User LLM turns evaluation outcomes into natural-language feedback for the next round. The benchmark comprises 50 web tasks with 784 evaluation criteria and 2402 expected outcomes. We benchmark 8 LLMs across 2 agent frameworks. The results separate models clearly: weighted Task Pass Rate varies by 38 percentage points and models also differ substantially in their ability to repair from feedback. Asuka-Bench is also far from saturated: even the strongest model completes only 52% of projects after three rounds.

2606.05917 2026-06-05 cs.CV cs.CL 版本更新

MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering

MemoryCard: 面向长视频问答的主题感知多模态线索压缩

Qing Yang, Pengcheng Huang, Xinze Li, Zhenghao Liu, Yukun Yan, Yu Gu, Ge Yu, Gang Li, Maosong Sun

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Digital China Group(数字中国集团)

AI总结 提出MemoryCard框架,通过将长视频分割为主题事件单元并生成事件级摘要和代表性视觉时刻,以记忆卡形式增强VLMs的长视频问答能力,在相同视觉令牌预算下准确率提升高达21.8%。

Comments 21 pages, 8 figures

详情
AI中文摘要

长视频问答对视觉语言模型(VLMs)仍然具有挑战性,因为与答案相关的证据通常稀疏、短暂且时间上分散在冗长的视频上下文中。现有的以帧为中心的方法通过均匀采样、查询感知帧选择、视觉令牌压缩和自适应分辨率策略来提高效率。然而,它们仍然依赖孤立和零散的帧作为基本证据单元,限制了VLMs有效捕获连贯事件级语义的能力。为解决这一限制,我们提出了MemoryCard,一种基于视频记忆的增强框架,将长视频组织成自包含的记忆卡。具体来说,MemoryCard首先对视频和对齐的文本执行自读过程,将视频分割为语义连贯的单元,每个单元对应一个不同的主题或事件。对于每个单元,它生成事件级视频要点并选择代表性视觉时刻,然后将其渲染为统一的记忆卡,用于检索和问答。实验结果表明,在可比的视觉令牌预算下,MemoryCard持续提高了长视频问答性能,准确率相对提升高达21.8%。所有代码可在https://github.com/NEUIR/MemoryCard获取。

英文摘要

Long-video question answering remains challenging for Vision-Language Models (VLMs), as answer-relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame-centric approaches improve efficiency through uniform sampling, query-aware frame selection, visual-token compression, and adaptive resolution strategies. However, they still rely on isolated and fragmented frames as the fundamental evidence units, limiting VLMs' ability to effectively capture coherent event-level semantics. To address this limitation, we propose MemoryCard, a video-memory-based augmentation framework that organizes long videos into self-contained Memory Cards. Specifically, MemoryCard first performs a self-reading process over videos and aligned utterances to segment the video into semantically coherent units, each corresponding to a distinct topic or event. For each unit, it generates an event-level video gist and selects representative visual moments, which are then rendered into unified Memory Cards for retrieval and question answering. Experimental results demonstrate that MemoryCard consistently improves long-video QA performance under comparable visual-token budgets, achieving up to a 21.8% relative improvement in accuracy. All code is available at https://github.com/NEUIR/MemoryCard.

2606.05906 2026-06-05 cs.CL 版本更新

ACE-SQL: Adaptive Co-Optimization via Empirical Credit Assignment for Text-to-SQL

ACE-SQL: 基于经验信用分配的自适应协同优化方法用于文本到SQL

Xiaobing Chen, Ai Jian, Eryu Guo, Zhiqi Pang

发表机构 * Harbin Engineering University(哈尔滨工程大学) Harbin Institute of Technology(哈尔滨工业大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出ACE-SQL强化学习框架,通过在线列集池和经验信用分配联合优化模式检索与SQL生成,在BIRD Dev上达到65.3%的贪心执行准确率。

详情
AI中文摘要

文本到SQL将自然语言问题映射为可执行的SQL查询。现代数据库通常包含大型且复杂的模式,使得模式链接成为准确生成SQL的关键步骤。现有方法要么依赖全模式生成,这在大搜索空间中隐式进行模式链接,要么使用基于静态金列监督训练的独立检索器,其目标可能对当前生成器策略是次优的。为解决此问题,我们提出基于经验信用分配的自适应协同优化方法用于文本到SQL(ACE-SQL),这是一个在执行反馈下联合优化模式检索和SQL生成的强化学习框架。ACE-SQL从生成器rollout中构建在线列集池,并从与执行正确rollout最频繁关联的列集中推导出自适应在线策略检索目标。这引发了双向适应:检索器适应生成器能正确执行的列集,而生成器在执行反馈下适应检索器不断演变的模式选择。使用约3k个合成文本到SQL问题-数据库对进行强化学习训练,ACE-SQL在BIRD Dev上实现了65.3%的贪心执行准确率,每个查询使用0.93k输出令牌。代码仓库见https://github.com/xbchen1/ACE-SQL。

英文摘要

Text-to-SQL maps natural language questions to executable SQL queries. Modern databases often contain large and complex schemas, making schema linking a critical step for accurate SQL generation. Existing methods either rely on full-schema generation, which leaves schema linking implicit within a large search space, or use a separate retriever trained with static gold-column supervision, whose targets may be suboptimal for the current generator policy. To address this issue, we propose Adaptive Co-optimization via Empirical Credit Assignment for Text-to-SQL (ACE-SQL), a reinforcement learning (RL) framework that jointly optimizes schema retrieval and SQL generation under execution feedback. ACE-SQL constructs an online column-set pool from generator rollouts and derives adaptive on-policy retrieval targets from the column set most frequently associated with execution-correct rollouts. This induces bidirectional adaptation, where the retriever adapts toward column sets that the generator can execute correctly, while the generator adapts to the retriever's evolving schema selections under execution feedback. With approximately 3k synthetic Text-to-SQL question-database pairs for RL training, ACE-SQL achieves 65.3% greedy execution accuracy on BIRD Dev while using 0.93k output tokens per query. The repository is available at https://github.com/xbchen1/ACE-SQL.

2606.05901 2026-06-05 cs.CL cs.AI 版本更新

Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)

减少复杂问答中的幻觉:使用基于简单图的检索增强生成(长版)

Christopher J. Wedge, Joshua Stutter, Danny Dixon, Jacek Cała

发表机构 * National Innovation Centre for Data(数据创新研究中心)

AI总结 本研究提出一种轻量级图结构支持的检索增强生成系统,通过结合向量搜索和图查询工具,在复杂问答任务中将幻觉答案数量减半,并显著提升事实正确性的精确率和召回率。

详情
AI中文摘要

大型语言模型(LLMs)从根本上改变了自然语言处理的格局。尽管取得了这些进展,LLMs和基于LLM的系统仍然容易出现各种故障模式。检索增强生成(RAG)系统已成为一种常见的部署场景,旨在避免LLM“幻觉”信息的已知风险,并使模型能够对训练期间无法访问的专有信息进行推理和问答,而无需进行昂贵的模型微调。在这项工作中,我们探索了使用轻量级图结构(具有相对简单的图模式)通过专用工具集支持RAG子系统的想法。我们设计了一个基于英语维基百科文章精选子集的结构化数据集上的智能体系统,该系统配备了多种向量搜索和图查询工具,并评估了其在MoNaCo(一个具有挑战性的维基百科QA基准测试,涉及复杂查询回答任务)上的问题表现。我们的结果表明,引入基于图的工具可以显著提高事实正确性的精确率和召回率,将幻觉答案的数量减半,并在三个评估场景中实现了最高的细粒度真实性得分。所有这些都仅以适度的令牌使用增加为代价。

英文摘要

Large language models (LLMs) have fundamentally transformed the landscape of Natural Language Processing. Despite these advances, LLMs and LLM-based systems remain prone to a variety of failure modes. Retrieval-augmented generation (RAG) systems have emerged as a common deployment scenario seeking to both avoid the well known risk of the LLM "hallucinating" information, and to enable reasoning and question answering over proprietary information that the LLM did not have access to during training without resorting to expensive model fine-tuning. In this work, we explore the idea of using a lightweight graph structure with a relatively simple graph schema, to support the RAG subsystem via a dedicated toolset. We design an agentic system with a variety of vector search and graph query tools operating over a structured dataset based on a curated subset of English Wikipedia articles, and evaluate its performance on questions from MoNaCo, a challenging Wikipedia QA benchmark of complex query answering tasks. Our results show that the introduction of graph-based tools can significantly increase the precision and recall of factual correctness, can halve the number of hallucinated answers, and achieves the highest fine-grained truthfulness score among the three evaluated scenarios. All this with a modest increase in token usage.

2606.05895 2026-06-05 cs.CL cs.LG 版本更新

Representing Research Attention as Contextually Structured Flows

将研究关注度表示为上下文结构化流

Jessica Rodrigues, Angelo Salatino, Gard Jenset, Scott Hale

发表机构 * University of Oxford(牛津大学) The Open University(开放大学) Springer Nature

AI总结 提出注意力流(attention flows)作为上下文结构化表示,编码注意力的组织及其随时间演化,通过类比推理基准评估发现流表示更有效支持结构比较,并提升部分观测和结构扰动下的鲁棒性。

Comments Accepted at STi 2026 - International Conference on Science and Technology Indicators

详情
AI中文摘要

研究关注度被广泛用作可见性、影响和社会采纳的指标,但通常表示为聚合计数,无法保留注意力在上下文中随时间如何发展。这造成了注意力解释方式与其表示方式之间的不匹配。我们提出注意力流作为上下文结构化表示,编码注意力的组织及其随时间演化。我们通过构建基于研究产出间类比推理的基准,评估这些表示是否捕获可迁移结构。比较信号、序列和基于流的表示,我们发现流表示更有效地支持结构比较,特别是在注意力受时间进程或上下文分布影响的场景中。我们进一步表明,学习到的流表示在部分观测和结构扰动下提高了鲁棒性。总体而言,这些结果支持将注意力建模为上下文结构化现象,并为更具信息性的研究评估方法提供了基础。

英文摘要

Research attention is widely used as an indicator of visibility, influence, and societal uptake, yet it is typically represented as aggregated counts that do not preserve how attention develops across contexts over time. This creates a mismatch between how attention is interpreted and how it is represented. We propose attention flows as contextually structured representations that encode the organisation of attention and its evolution over time. We evaluate whether these representations capture transferable structure by constructing a benchmark based on analogy-style reasoning across research outputs. Comparing signal, sequence, and flow-based representations, we find that flow representations more effectively support structural comparison, particularly in settings where attention is shaped by temporal progression or context distributions. We further show that learned flow representations improve robustness under partial observation and structural perturbation. Overall, these results support modelling attention as a contextually structured phenomenon and provide a basis for more informative approaches to research evaluation.

2606.05894 2026-06-05 cs.CL 版本更新

EMBER: Efficient Memory via Budgeted Evidence Retention for Long-Horizon Agents

EMBER: 通过预算化证据保留实现高效记忆的长时程智能体

Yilong Li, Suman Banerjee, Tong Che

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) NVIDIA Research(NVIDIA研究)

AI总结 针对长时程智能体在固定预算下保留证据的问题,提出EMBER学习型保留策略,通过存储证据胶囊(含原文摘录、检索键和更新元数据)并利用查询后反馈训练,在LongMemEval-RR上显著提升F1、保留召回和读取召回。

详情
AI中文摘要

长时程智能体可以存档大量历史记录,但未来的答案仍然会产生检索、重读和上下文成本。当保留的记忆缺少与答案相关的证据时,系统必须返回原始历史的大部分内容。我们研究预算化证据存留:在查询未知之前,应保留哪些源证据,以便在固定的保留源证据令牌预算下保持可恢复和可用?我们将此设置实例化为预算化预查询保留,其中记忆在摄取期间写入,随后在无法访问完整原始流的情况下读取。我们引入了EMBER,一种学习型保留策略,它构建了一个紧凑的、基于源的证据状态。EMBER存储证据胶囊:逐字源摘录,附带检索键和更新元数据,同时保留基础性和读取时间访问。查询后结果反馈训练写入器在摄取-检索-答案链中保留证据。在LongMemEval-RR(我们基于LongMemEval衍生的保留证据协议)上,EMBER-14B在8192令牌保留证据比较点达到0.3017 F1,而最强非EMBER预算化基线为0.1765。在不同的保留源证据预算下,EMBER提高了F1、保留召回和读取召回,表明长时程记忆依赖于在预算内保留证据,而不是重读更大的历史记录。

英文摘要

Long-horizon agents can archive large histories, but future answers still incur retrieval, rereading, and context costs. When retained memory misses answer-relevant evidence, the system must return to larger portions of the raw history. We study budgeted evidence survival: before the query is known, which source evidence should be retained so that it remains recoverable and usable under a fixed retained source-evidence token budget? We instantiate this setting as Budgeted Pre-Query Retention, where memory is written during ingestion and later read without access to the full raw stream. We introduce EMBER, a learned retention policy that constructs a compact, source-backed evidence state. EMBER stores evidence capsules: verbatim source excerpts paired with retrieval keys and update metadata, preserving both grounding and read-time access. Post-query outcome feedback trains the writer to preserve evidence across the ingestion-retrieval-answer chain. On LongMemEval-RR, our LongMemEval-derived retained-evidence protocol, EMBER-14B reaches 0.3017 F1 at the 8192-token retained-evidence comparison point, compared with 0.1765 for the strongest non-EMBER budgeted baseline. Across retained source-evidence budgets, EMBER improves F1, Retain-Recall, and Read-Recall, indicating that long-horizon memory depends on retaining evidence within the budget rather than rereading larger histories.

2606.05890 2026-06-05 cs.CL cs.AI 版本更新

Staying with the Uncertainty: Uncertainty-Scaffolding Strategies for Artificial Moral Advisors in LLM-to-LLM Simulated Conversations

与不确定性共处:LLM对LLM模拟对话中人工道德顾问的不确定性支撑策略

Salvatore Greco, Hainiu Xu, Jacopo Domenicucci, Yulan He, Sylvie Delacroix

发表机构 * Centre for Data Futures, The Dickson Poon School of Law, King’s College London(数据未来中心、迪克森·普恩法学院、伦敦国王学院) Department of Informatics, King’s College London(信息学院、伦敦国王学院) LangAI, Center for Language AI Research, Tohoku University(LangAI、语言人工智能研究中心、东北大学) Neukom Institute for Computational Science, Dartmouth College(计算科学尼科姆研究所、达特茅斯学院)

AI总结 研究LLM作为人工道德顾问时,通过三种不确定性策略(视角倍增、张力保持、过程反思)与三种控制条件对比,在模拟对话中探讨如何帮助对话者“与不确定性共处”,发现不同策略在立场改变量上无差异但影响参与质量。

详情
AI中文摘要

LLM越来越多地被部署为各种背景下的人工道德顾问(AMA):它们应该展现什么样的对话模式?在本文中,我们研究AMA如何帮助其对话者“与不确定性共处”。我们提出了三种不确定性模式(视角倍增、张力保持、过程反思),并将它们与三种控制条件(基线、说服、谄媚)进行比较。用户代理LLM与遵循特定不确定性策略的AMA就伦理困境进行对话,并完成对话前和对话后的问卷调查。我们进一步考察了两种角色提示格式(陈述式和叙述式)的效果。我们发现:(1)没有一个单一模型作为模拟用户代理占主导地位,开放模型通过角色间分歧与人类模糊性对齐,而封闭模型通过角色内对冲对齐;(2)陈述式角色更好地捕捉初始立场多样性,而叙述式角色显示出更现实的信念修正;(3)所有六种AMA策略产生可区分的对话模式;(4)不确定性策略的不同不在于它们产生多少立场改变,而在于它们维持的参与质量。

英文摘要

LLMs are increasingly deployed as Artificial Moral Advisors (AMA) in a variety of contexts: what kind of conversational patterns should they display? In this paper, we study how AMA can help their interlocutors "stay with the uncertainty". We propose three modes of uncertainty (Perspective-Multiplying, Tension-Preserving, Process-Reflecting) and compare them against three control conditions (Baseline, Persuasive, Sycophantic). A user-agent LLM engages in a dialogue on an ethical dilemma with an AMA following a specific uncertainty strategy, and completes pre- and post-conversation questionnaires. We further examine the effect of two persona prompt formats (Declarative and Narrative). We found that (1) no single model dominates as a simulated user agent, with open models aligning with human ambiguity through between-persona divergence and closed models through within-persona hedging; (2) declarative personas better capture initial stance diversity while narrative personas show more realistic belief revision; (3) all six AMA strategies produce distinguishable conversational patterns; and (4) uncertainty strategies differ not in how much stance revision they produce, but in the quality of engagement they sustain.

2606.05889 2026-06-05 cs.SD cs.CL eess.AS 版本更新

GLASS: GRPO-Trained LoRA for Acoustic Style Steering in Zero-Shot Text-to-Speech

GLASS: 基于GRPO训练的LoRA用于零样本文本转语音中的声学风格引导

Jaehoon Kang, Yejin Lee, Kyuhong Shim

发表机构 * Department of Artificial Intelligence, Sungkyunkwan University(人工智能系,全州大学)

AI总结 提出GLASS框架,通过GRPO训练轻量LoRA适配器实现零样本自回归TTS中可组合的声学风格控制,无需风格标签即可从奖励中学习控制。

详情
AI中文摘要

我们提出GLASS,一个用于零样本自回归文本转语音(TTS)中可组合声学风格控制的框架,该框架从生成后奖励而非风格标签中学习控制。在零样本TTS中,说话人提示通常将说话人身份与语速、音高等韵律属性纠缠在一起,使得在不改变提示本身的情况下难以改变风格。GLASS将每个声学属性视为一个由奖励定义的控制方向。对于每个控制轴,GLASS冻结TTS主干,并使用组相对策略优化(GRPO)训练一个轻量级LoRA适配器,以语音令牌长度和平均F0作为风格奖励,以WER作为可懂度锚点。由于每个控制表示为LoRA权重更新,独立训练的适配器可以通过线性LoRA算术进行交换、插值和组合,而无需重新训练主干。在语速和音高控制上的实验显示了目标风格偏移,同时保持了自然度、说话人相似性和可懂度,并展示了跨独立训练适配器的平滑插值和多轴组合。

英文摘要

We propose GLASS, a framework for composable acoustic style control in zero-shot autoregressive text-to-speech (TTS) that learns controls from post-generation rewards rather than style labels. In zero-shot TTS, a speaker prompt often entangles speaker identity with prosodic attributes such as speaking rate and pitch, making it difficult to change style without changing the prompt itself. GLASS instead treats each acoustic attribute as a reward-defined control direction. For each control axis, GLASS freezes the TTS backbone and trains one lightweight LoRA adapter with Group Relative Policy Optimization (GRPO), using speech-token length and mean F0 as style rewards and WER as an intelligibility anchor. Because each control is represented as a LoRA weight update, independently trained adapters can be swapped, interpolated, and composed through linear LoRA arithmetic without retraining the backbone. Experiments on speaking rate and pitch control show targeted style shifts while preserving naturalness, speaker similarity, and intelligibility, and demonstrate smooth interpolation and multi-axis composition across independently trained adapters.

2606.05874 2026-06-05 cs.CL 版本更新

Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

评估多模态大语言模型中的随机坍缩与隐式偏差

Huiyuan Zheng, Houtao Zhang, Boyang Wang, Qingyi Si, Hongcheng Guo

发表机构 * Fudan University(复旦大学) Beihang University(北航) JD.com(京东)

AI总结 提出RandomBench基准测试,通过熵和分布偏差指标揭示多模态大语言模型在逻辑中性场景下存在随机坍缩现象,即无法维持均匀随机性。

详情
AI中文摘要

当前对多模态大语言模型(MLLMs)的评估 overwhelmingly 关注效用驱动目标,导致模型在逻辑中性场景下的行为 largely 未被探索。在多个行动同样有效的情况下(如推荐旅行路线或日常安排,多个选项具有相似效用),随机性是必要的。在此类设置中,确定性策略可能导致重复行为和有效替代方案的覆盖减少。为弥补这一空白,我们提出RandomBench,一个旨在评估MLLMs在选择等价选项时是否能维持分布中性行为的基准测试。我们进一步引入三个指标,包括RI、BCI、BII,以量化熵和分布偏差。实验揭示了一种普遍现象,称为随机坍缩,即MLLMs在明确的随机指令下无法维持均匀随机性,Claude Sonnet 4.6中top-1概率达到97%(理想为四分之一),RI降至0.068。广泛的消融研究进一步表明,这些偏差在不同语言和表示格式中持续存在,突显了逻辑中性决策设置中分布坍缩的鲁棒性。

英文摘要

Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios where multiple actions are equally valid, such as recommending travel itineraries or daily schedules where multiple options have similar utility. In such settings, deterministic policies may lead to repetitive behaviors and reduced coverage of valid alternatives. To bridge this gap, we propose RandomBench, a benchmark designed to evaluate whether MLLMs can maintain distributionally neutral behavior when selecting among equivalent options. We further introduce three metrics, including RI, BCI, BII, to quantify entropy and distributional bias. Experiments reveal a pervasive phenomenon termed Stochastic Collapse, where MLLMs fail to maintain uniform randomness under explicit random instructions, with top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Extensive ablation studies further demonstrate that these deviations persist across languages and representation formats, highlighting the robustness of distributional collapse in logic-neutral decision settings.

2606.05868 2026-06-05 cs.CL 版本更新

YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition

YouZhi:通过自适应GQA到MLA转换实现高并发金融大语言模型

PSBC LLM Team, Huawei LLM Team, Ruihan Long, Junjie Wu, Tianan Zhang, Duo Zhang, Yaozong Wu, Jinbin Fu, Chang Liu, Zhentao Tang, Wenshuang Yang, Xin Wang, Zhihao Song, Ning Huang, Wenjing Xu, Shuai Zong, Shupei Sun, Sen Wang, Jing Hu, Bin Wang, Xinyu Wang, Junkui Ju, Zequn Ding, Jie Ran, Man Luo, Shixiong Kai, Linkai Hou, Kaichao Liang, Hu Zhao, Yang Zhao, Shucheng Lin, Wei Yu, Chenghan Jiang, Jingjing Ding, Jiahui Zhang, Tian Jin, Yuhang Zhang, Dong Guo, Wei Sun, Jun Xie, Jianwei Li, Lei Cao, Pei Li, Jiabin Li, Jia Yuan, Rui Yuan, Jing Zhu, Mingxuan Yuan, Zhangcheng Lv, Xin Jiang, Xiuhong Fei, Xiaozhe Ren, Yulong Li, Zhipeng Zhang, Hang Wang, Zhaohui Xu, Rui Zhao, Yibo He, Xinzhuang Niu

发表机构 * Postal Savings Bank of China & Huawei LLM Team(中国邮政储蓄银行及华为LLM团队) Postal Savings Bank of China(中国邮政储蓄银行) Huawei Technologies(华为技术)

AI总结 提出YouZhi-LLM,通过层自适应GQA-to-MLA转换框架和基于昇腾的训练流水线,显著压缩KV缓存并提升金融领域高并发推理效率。

详情
AI中文摘要

大语言模型推动了重大金融创新,但其高并发部署受到KV缓存内存开销的严重瓶颈,这增加了基础设施成本并限制了可扩展性。为解决这一问题,我们提出YouZhi-LLM,一种高效金融大语言模型,通过基于华为昇腾生态系统的全面结构转换和训练流水线实现。在其算法核心,YouZhi-LLM采用层自适应GQA-to-MLA转换框架,动态分配每层的FreqFold大小,在最大化KV缓存压缩的同时最小化困惑度下降。为恢复表示能力并注入领域知识,基于昇腾的训练流水线无缝集成广义知识蒸馏与金融特定监督微调。评估表明该系统性方法的优越性,自适应转换相比均匀基线将困惑度下降减少高达35%。关键的是,在昇腾NPU上通过vLLM-Ascend评估时,大规模KV缓存减少直接转化为部署效率。与各自基础模型相比,YouZhi-7B在平均金融基准分数上提升12.3%,同时最大并发数提升2.69倍;类似地,YouZhi-14B实现7.0%的准确率提升和2.43倍的并发提升,为成本高效、高吞吐的金融推理建立了新范式。

英文摘要

Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this, we propose YouZhi-LLM, a highly efficient financial LLM empowered by a comprehensive structural transition and training pipeline natively built on the Huawei Ascend ecosystem. At its algorithmic core, YouZhi-LLM features a layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes, maximizing KV-cache compression while minimizing perplexity degradation. To recover representation capacity and inject domain expertise, the Ascend-based training pipeline seamlessly integrates generalized knowledge distillation with financial-specific supervised fine-tuning. Evaluations demonstrate the superiority of this systematic approach, with the adaptive transition reducing perplexity degradation by up to 35% over uniform baselines. Crucially, when evaluated on Ascend NPUs via vLLM-Ascend, the massive KV-cache reduction translates directly into deployment efficiency. Compared to their respective base models, YouZhi-7B yields a 12.3% improvement in average financial benchmark score alongside a 2.69$\times$ increase in maximum concurrency; similarly, YouZhi-14B achieves a 7.0% accuracy gain and a 2.43$\times$ concurrency boost, establishing a new paradigm for cost-effective, high-throughput financial inference.

2606.05864 2026-06-05 cs.CL 版本更新

Analysis of the Neglect-Zero Effect in Large Language Models

大型语言模型中忽视零效应的分析

Jin Tanaka, Daiki Matsuoka, Ryoma Kumon, Hitomi Yanaka

发表机构 * The University of Tokyo(东京大学) RIKEN(理化学研究所) Tohoku University(东北大学)

AI总结 本研究通过结构启动范式,探究大型语言模型是否像人类一样存在忽视零效应,即忽略使命题因空集而空洞为真的零模型。

Comments 14 pages (10 pages main text), 8 figures. To appear in the Proceedings of the ACL2026 Student Research Workshop (SRW)

详情
AI中文摘要

我们研究了LLM的语言处理在多大程度上类似于人类的认知过程,重点关注一种称为$ extit{忽视零效应}$的人类认知偏差。这种效应指的是人类倾向于忽略$ extit{零模型}$,即那些因空集而使命题空洞为真的配置。我们关注由忽视零效应驱动的两种推理类型,并通过比较LLM在处理这些推理时的行为与不涉及忽视零效应的推理中的行为,来检验LLM如何处理这些推理。为此,我们采用基于$ extit{结构启动}$的范式,其中先前接触一个前导句子($ extit{启动句}$)会因结构相似性而促进后续句子($ extit{目标句}$)的处理。我们准备启动句以迫使LLM考虑零模型,并分析它们是否也在目标句中考虑零模型。结果表明,在本研究分析的LLM中可能未出现忽视零效应。我们的代码可在https://github.com/ynklab/neglect_zero获取。

英文摘要

We investigate the extent to which the language processing of LLMs resembles human cognitive processes, focusing on a human cognitive bias called the $\textit{neglect-zero effect}$. This effect refers to the human tendency to ignore $\textit{zero-models}$, which are configurations that render a proposition vacuously true by virtue of an empty set. We focus on two types of inferences driven by the neglect-zero effect, and examine how LLMs process these inferences by comparing their behavior with that in an inference that does not involve the neglect-zero effect. For this purpose, we employ a paradigm based on $\textit{structural priming}$, where recent exposure to a preceding sentence (the $\textit{prime}$) facilitates the processing of a subsequent sentence (the $\textit{target}$) due to their structural similarity. We prepare primes to force LLMs to consider the zero-model, and analyze whether they also consider it in the target. The results suggest that the neglect-zero effect may not occur in the LLMs analyzed in this study. Our code is available at https://github.com/ynklab/neglect_zero

2606.05859 2026-06-05 cs.CL 版本更新

TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization

TARPO:通过动作路由策略优化的逐令牌隐式-显式推理

Liting Zhang, Shiwan Zhao, Xuyang Zhao, Zichen Xu, Jianye Wang, Qicheng Li

发表机构 * TMCC, College of Computer Science, Nankai University, Tianjin, China(TMCC,计算机科学学院,南开大学,天津,中国)

AI总结 提出TARPO框架,通过动作路由策略优化在每一步自适应切换离散令牌生成和连续隐式推理,以解决隐式推理中连续表示限制策略探索的问题,实验表明其优于现有显式和隐式推理基线。

Comments 18 pages, 12 figures. Code available at https://github.com/NKU-LITI/TARPO-master

详情
AI中文摘要

隐式推理已成为大型语言模型(LLMs)中离散思维链(CoT)的一种有前景的替代方案,通过在连续表示上操作实现更具表达力的推理。然而,连续表示固有的确定性限制了强化学习(RL)中的策略探索。为解决这一问题,我们提出了TARPO(通过动作路由策略优化的逐令牌隐式-显式推理),一个纯RL框架,在每一步自适应地在离散令牌生成和连续隐式推理之间切换。TARPO引入了一个轻量级的动作头路由器,它观察当前隐藏状态并从二元模式选择空间中采样一个路由决策,保留了从词汇表中离散令牌采样的随机性。LLM主干和路由器通过共享的组相对优势信号进行端到端联合优化。在Qwen2.5(从1.5B到7B)和Llama-3.1-8B主干上的大量实验表明,TARPO在各种基准测试中始终优于现有的显式和隐式推理RL基线。进一步分析表明,TARPO学习了自适应的逐令牌切换行为,同时保持了稳定的训练动态。我们的代码可在https://github.com/NKU-LITI/TARPO-master获取。

英文摘要

Latent reasoning has emerged as a promising alternative to discrete Chain-of-Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently deterministic nature of continuous representations limits policy exploration in reinforcement learning (RL). To address this, we propose TARPO (Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization), a pure RL framework that adaptively switches between discrete token generation and continuous latent reasoning at each step. TARPO introduces a lightweight action head router that observes the current hidden state and samples a routing decision from a binary mode-selection space, preserving the stochasticity of discrete token sampling from the vocabulary. The LLM backbone and router are jointly optimized end-to-end with a shared group-relative advantage signal. Extensive experiments across Qwen2.5 (from 1.5B to 7B) and Llama-3.1-8B backbones demonstrate that TARPO consistently outperforms existing explicit and latent reasoning RL baselines across diverse benchmarks. Further analysis shows that TARPO learns adaptive token-wise switching behaviors while maintaining stable training dynamics. Our code is available at https://github.com/NKU-LITI/TARPO-master.

2606.05858 2026-06-05 cs.CL 版本更新

ReverseEOL: Improving Training-free Text Embeddings via Text Reversal in Decoder-only LLMs

ReverseEOL: 通过解码器仅LLM中的文本反转改进无训练文本嵌入

Ailiang Lin, Zhuoyun Li, Yusong Wang, Keyu Mao, Kotaro Funakoshi, Manabu Okumura

发表机构 * Institute of Science Tokyo(东京科学研究所) Tencent(腾讯)

AI总结 提出ReverseEOL方法,通过反转输入文本生成互补嵌入,结合前向嵌入提升冻结解码器仅LLM的文本表示能力,在STS和MTEB基准上显著提升无训练基线性能。

详情
AI中文摘要

大型语言模型(LLMs)的最新进展为生成无训练文本嵌入开辟了新途径。然而,解码器仅LLM中的因果注意力机制阻止了早期标记关注未来上下文,导致上下文表示存在偏差。在这项工作中,我们提出了带有显式单词限制的反转提示(ReverseEOL),一种简单而有效的方法,用于增强冻结LLM的表示能力。ReverseEOL通过从反转输入文本中获得的额外反转嵌入来增强标准前向嵌入。由于反转输入使每个标记能够访问原始顺序中无法访问的上下文,所得的反转嵌入有效地为原始嵌入提供了互补信息。因此,结合前向和反转嵌入产生了更丰富的最终表示。在STS和MTEB基准上的全面实验表明,ReverseEOL显著提高了现有无训练基线在具有不同架构和规模的各种LLM上的性能。广泛的消融和分析进一步证实了我们反转机制的必要性。

英文摘要

Recent advances in Large Language Models (LLMs) have opened new avenues for generating training-free text embeddings. However, the causal attention in decoder-only LLMs prevents earlier tokens from attending to future context, leading to biased contextualized representations. In this work, we propose Reverse prompting with Explicit One-word Limitation (ReverseEOL), a simple yet effective method for enhancing the representational capability of frozen LLMs. ReverseEOL augments the standard forward embedding with an additional reversed embedding derived from the reversed input text. Since reversing the input exposes each token to context inaccessible in the original order, the resulting reversed embedding effectively provides complementary information to the original one. As a result, combining the forward and reversed embeddings yields a richer final representation. Comprehensive experiments on STS and MTEB benchmarks demonstrate that ReverseEOL significantly improves the performance of existing training-free baselines across a broad range of LLMs with diverse architectures and scales. Extensive ablations and analyses further confirm the necessity of our reversal mechanism.

2606.05857 2026-06-05 cs.CL 版本更新

Forgive or forget: Understanding the context of hate in audio retrieval systems

原谅或忘记:理解音频检索系统中仇恨的上下文

Arghya Pal, Sailaja Rajanala, Raphael C. -W. Phan, Shekhar Nayak

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种后门因果去偏框架,通过情感控制中介在保持语义相关性的同时抑制有害语音,实验表明在最小化检索精度损失下持续降低毒性。

详情
AI中文摘要

处理文本到音频系统中的有毒检索因上下文依赖而具有挑战性。现有策略(如改写、摘要)存在改变意图或遗漏细节的风险。我们提出了一种后门因果去偏框架,带有情感控制中介,以在抑制有害语音的同时保持语义相关性。我们的方法是模型无关的,并能无缝集成到现有检索流程中。我们引入了两种变体:Forgive,通过logit调整对有毒音频进行重排序和过滤;Forget,生成反事实有毒提示以减轻有害检索。实验表明,在检索精度损失最小的情况下,毒性持续降低,提高了安全性和可靠性。

英文摘要

Handling toxic retrieval in text-to-audio systems is challenging due to contextual dependencies. Existing strategies (e.g., rephrasing, summarization) risk altering intent or omitting details. We propose a post hoc causal debiasing framework with a sentiment-controlled mediator to preserve semantic relevance while suppressing harmful speech. Our approach is model-agnostic and integrates seamlessly with existing retrieval pipelines. We introduce two variants: Forgive, which re-ranks and filters toxic audio via logit adjustment, and Forget, which generates counterfactual toxic prompts to mitigate harmful retrievals. Experiments show consistent toxicity reduction with minimal loss in retrieval accuracy, improving both safety and reliability.

2606.05846 2026-06-05 cs.CL eess.AS 版本更新

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

迈向真正的多语言ASR:将代码切换ASR泛化到未见语言对

Gio Paik, Hyunseo Shin, Soungmin Lee

发表机构 * University of Tokyo(东京大学)

AI总结 通过模型合并和领域泛化方法,研究从有限语言对中学到的代码切换能力能否泛化到未见语言对,实验表明双语CS-ASR模型对未见语言对有一定泛化能力但有限。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情
AI中文摘要

自动语音识别(ASR)已成为人机交互的关键技术。然而,由于跨多种语言对的代码切换(CS)语音资源严重稀缺,代码切换ASR(CS-ASR)仍然特别具有挑战性。现有方法主要通过合成CS语音生成或在有限双语数据集上进行特定语言对微调来提高CS-ASR性能。然而,这些方法面临固有的可扩展性限制,因为对CS的支持必须针对语言对单独开发,而语言对的数量随支持的语言数量呈组合增长。在这项工作中,我们研究通过模型合并和领域泛化方法,从一组有限的已见语言对中学到的CS能力是否可以泛化到未见语言对。我们的实验表明,合并的双语CS-ASR模型对未见语言对有一定程度的泛化,表明双语CS能力在语言对之间的迁移有限。

英文摘要

Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.

2606.05843 2026-06-05 cs.CL cs.AI 版本更新

Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

多模态大语言模型中通过CoRe头的功能稀疏性机制洞察

Ruoxi Sun, Quantong Qiu, Juntao Li, Zecheng Tang, Yihang Lou, Min Zhang

发表机构 * Soochow University(苏州大学) Peking University(北京大学)

AI总结 通过识别和分析CoRe头,揭示多模态大语言模型在跨模态检索中功能稀疏的结构特性,并验证其必要性及加速推理的潜力。

详情
AI中文摘要

虽然多模态大语言模型(MLLMs)在复杂的视觉-语言任务上表现出卓越的能力,但它们从复杂、嘈杂的上下文中提取与查询相关的视觉特征的机制仍然不透明。在本文中,我们进行了一项深入的可解释性研究,揭示了MLLMs中一个深刻的结构属性:跨模态检索中的功能稀疏性。利用一种称为检索注意力质量(RAM)的令牌级指标,我们识别并描述了一组高度专业化的注意力头,称为上下文感知检索(CoRe)头。在不同的视觉领域和模型规模中,我们观察到明确的功能划分:CoRe头充当专用的信息提取器,而大多数其他头则将注意力分布在更广泛的上下文区域。因果干预进一步证明了这些专业化头的必要性。仅消融前5%的CoRe头就会导致多模态推理性能显著下降,而消融排名较低的头则影响甚微。此外,加速实验验证了CoRe头的实用性,表明利用这种局部稀疏性可以显著加速推理,同时保持稳健的任务性能。我们的发现揭示了MLLMs中功能稀疏性的结构原理,完善了当前对机制可解释性的理解,并为未来的架构设计和模型优化奠定了理论基础。

英文摘要

While Multimodal Large Language Models (MLLMs) demonstrate remarkable proficiency on complex vision-language tasks, the mechanisms by which they extract query-relevant visual features from complex, noisy contexts remain opaque. In this paper, we present an in-depth interpretability study that uncovers a profound structural property within MLLMs: functional sparsity in cross-modal retrieval. Leveraging a token-level metric termed Retrieval Attention Mass (RAM), we identify and characterize a highly specialized subset of attention heads, referred to as Context-aware Retrieval (CoRe) heads. Across diverse visual domains and model scales, we observe a clear functional division: CoRe heads act as dedicated information extractors, while most other heads distribute attention over broader contextual regions. Causal interventions further demonstrate the necessity of these specialized heads. Ablating only the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower-ranked heads has minimal effect. Moreover, acceleration experiments validate the utility of CoRe heads, showing that leveraging this localized sparsity significantly accelerates inference while maintaining robust task performance. Our findings reveal a structural principle of functional sparsity within MLLMs, refining the current understanding of mechanistic interpretability and laying a theoretical foundation that can inspire future architecture design and model optimization.

2606.05836 2026-06-05 cs.CL 版本更新

ProSPy: A Profiling-Driven SQL-Python Agentic Framework for Enterprise Text-to-SQL

ProSPy: 面向企业级Text-to-SQL的剖析驱动的SQL-Python智能体框架

Zhaorui Yang, Huawei Zheng, Sen Yang, Yuhui Zhang, Haoxuan Li, Zhizhen Yu, Xuan Yi, Chen Hou, Defeng Xie, Chao Hu, Minfeng Zhu, Dazhen Deng, Haozhe Feng, Danqing Huang, Yingcai Wu, Peng Chen, Wei Chen

发表机构 * State Key Lab of CAD&CG(计算机辅助设计与图形学国家重点实验室) School of Software Technology(软件技术学院) Tencent TEG(腾讯科技集团) School of Mathematical Sciences, Peking University(北京大学数学科学学院) Zhejiang University(浙江大学)

AI总结 提出ProSPy框架,通过自动剖析、模式剪枝、中间视图获取和Python分析四阶段,结合SQL高效性与Python灵活性,解决企业级数据库Text-to-SQL中的模式异构、元数据不完整和复杂分析问题。

Comments 24 pages, 12 figures

详情
AI中文摘要

大型语言模型显著推进了Text-to-SQL系统,但将其应用于企业级数据库仍具挑战。现实数据库通常包含大型异构模式、不完整元数据、方言特定SQL语法以及难以用单个SQL查询解决的复杂分析问题。为应对这些挑战,我们提出ProSPy,一个面向企业级Text-to-SQL的剖析驱动的SQL-Python智能体框架。ProSPy将推理过程分为四个阶段:首先通过自动剖析提取细粒度数据证据,逐步将大型模式剪枝为任务相关上下文,通过方言无关的SQL接口获取中间视图,最后使用Python进行灵活的下游分析。该设计结合了SQL在大型数据库上的高效性与基于Python的分析的灵活性,同时减少了对不可靠元数据的依赖,并提高了跨SQL方言的鲁棒性。在Spider 2.0-Lite和Spider 2.0-Snow上的实验表明,ProSPy在使用开源和专有模型时均持续优于强基线,使用Claude-4.5-Opus时无需多数投票即可达到60.15%和60.51%的执行准确率。进一步分析表明,ProSPy对SQL方言变化具有鲁棒性,并在模式召回率和精确率之间取得了有利的权衡。

英文摘要

Large language models have substantially advanced Text-to-SQL systems, yet applying them to enterprise-scale databases remains challenging. Real-world databases often contain large and heterogeneous schemas, incomplete metadata, dialect-specific SQL syntax, and complex analytical questions that are difficult to solve with a single SQL query. To address these challenges, we propose ProSPy, a Profiling-driven SQL--Python agentic framework for enterprise-scale Text-to-SQL. ProSPy structures the reasoning process into four stages: it first extracts fine-grained data evidence through automatic profiling, progressively prunes large schemas into task-relevant contexts, fetches intermediate views through a dialect-agnostic SQL interface, and finally performs flexible downstream analysis with Python. This design combines the efficiency of SQL over large databases with the flexibility of Python-based analysis, while reducing reliance on unreliable metadata and improving robustness across SQL dialects. Experiments on Spider 2.0-Lite and Spider 2.0-Snow show that ProSPy consistently outperforms strong baselines with both open-source and proprietary models, achieving execution accuracies of 60.15% and 60.51% with Claude-4.5-Opus, without majority voting. Further analysis shows that ProSPy is robust to SQL dialect variations and achieves a favorable trade-off between schema recall and precision.

2606.05828 2026-06-05 cs.AI cs.CL 版本更新

Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents

隐式偏好的统计先验:在个人代理中解耦技能选择作为局部调控机制

Zeyu Gan, Huayi Tang, Yong Liu

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院)

AI总结 针对本地部署的个人代理中隐式用户偏好学习问题,提出一种解耦统计偏好学习与语义意图解析的轻量级架构,通过局部统计结果影响远程LLM的选择决策,显著降低累积遗憾并提高测试准确率。

详情
AI中文摘要

随着大型语言模型(LLM)能力的提升,依赖基于API的远程模型和外部技能的本地部署个人代理成为一种新范式。随着可用技能的快速扩展,使个人代理能够学习并适应隐式用户偏好成为关键挑战。然而,本地部署的限制排除了复杂的集中式选择算法,迫切需要一种轻量级的局部偏好调控机制。本文通过一种严格解耦统计偏好学习与语义意图解析的新颖架构,探索了这种调控机制的实现。具体而言,我们利用局部统计结果来影响和调节远程LLM的选择决策。大量评估表明,我们的解耦方法实现了最低的累积遗憾和最高的测试准确率,显著优于传统的记忆增强型代理。

英文摘要

As Large Language Model (LLM) capabilities advance, locally deployed personal agents relying on API-based remote models and external skills have emerged as a novel paradigm. With the rapid expansion of available skills, enabling personal agents to learn and adapt to implicit user preferences becomes a critical challenge. However, local deployment constraints preclude complex centralized selection algorithms, creating an urgent need for a lightweight local preference harness. This paper explores the implementation of such a harness through a novel architecture that strictly decouples statistical preference learning from semantic intent parsing. Specifically, we leverage localized statistical results to influence and modulate the selection decisions of the remote LLM. Extensive evaluations demonstrate that our decoupled approach achieves the lowest cumulative regret and highest test accuracy, significantly outperforming traditional memory-augmented agents.

2606.05804 2026-06-05 cs.CL 版本更新

Can LLMs Be Constrained to the Past? Improving Knowledge Cutoff through Recall-Based Prompting

LLMs 能否被约束到过去?通过基于回忆的提示改进知识截止

Michiro Asai, Ailiang Lin, Yu Kishimoto, Takao Obi, Satoshi Kosugi, Kotaro Funakoshi, Manabu Okumura

发表机构 * Institute of Science Tokyo(东京科学研究所)

AI总结 提出两种基于回忆的提示策略(Self-Recall 和 Question-Recall)来改进大语言模型在知识截止约束下的表现,在反事实问题上尤其有效,并构建了多截止历史事件基准(MHEB)进行鲁棒性评估。

详情
AI中文摘要

提示知识截止指令大语言模型(LLM)表现得好像指定截止日期之后的信息不可用。然而,先前的工作主要依赖于直接答案生成,当截止后的知识未被明确查询而仅与问题存在因果关系时,这种方法难以应对。为了解决这一限制,我们提出了两种基于回忆的提示策略:Self-Recall(SR),要求模型重述其截止约束;以及 Question-Recall(QR),要求模型回忆在截止日期下有效的问题相关信息。在三个现有基准上,我们的方法优于直接答案提示和传统的逐步推理基线,在反事实问题上尤其有显著改进。为了研究不同截止设置下的鲁棒性,我们进一步构建了多截止历史事件基准(MHEB),该基准在多个截止年份下评估同一问题。结果表明,知识截止性能随截止距离变化,而结合 SR 和 QR 始终能获得最佳性能。

英文摘要

Prompted knowledge cutoff instructs a large language model (LLM) to act as if information beyond a specified cutoff date were unavailable. However, prior work mainly relies on direct-answer generation, which struggles when post-cutoff knowledge is not explicitly queried but is only causally related to the question. To address this limitation, we propose two recall-based prompting strategies: Self-Recall (SR), which asks the model to restate its cutoff constraint, and Question-Recall (QR), which requires the model to recall question-relevant information valid under the cutoff. Across three existing benchmarks, our methods outperform both direct-answer prompting and conventional step-by-step reasoning baselines, with particularly strong improvements on counterfactual questions. To investigate robustness across different cutoff settings, we further construct the Multi-cutoff Historical Event Benchmark (MHEB), which evaluates the same question under multiple cutoff years. Results show that knowledge cutoff performance varies with cutoff distance, while combining SR and QR consistently yields the best performance.

2606.05799 2026-06-05 cs.LG cs.CL 版本更新

CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction

CaliDist: 通过抗干扰行为鲁棒性校准大型语言模型

Mohammad Anas Jawad, Cornelia Caragea

发表机构 * Cornelia Caragea(卡伦·卡雷亚) Mohammad Anas Jawad(穆罕默德·安斯·贾瓦德)

AI总结 提出CaliDist方法,通过测量和惩罚模型对语义干扰的敏感性来校准LLM,在7个NLU基准上平均将ECE从23%降至7%。

详情
AI中文摘要

现有的大型语言模型(LLM)校准方法常常忽略可信度的一个关键维度:模型对无关或误导信息的{\em 行为鲁棒性}。在本文中,我们认为模型的真实置信度应反映其在认知压力下的稳定性。我们引入\textsc{CaliDist},一种新颖的事后校准方法,直接测量并惩罚模型对干扰的敏感性。\textsc{CaliDist}量化了当输入提示被语义\textit{干扰项}扰动时,LLM的预测和不确定性如何变化。然后利用这种稳定性(或不稳定性)信号来自适应地缩放模型的初始置信度分数。我们在六个不同LLM的七个自然语言理解分类基准上进行的广泛实验表明,与强基线相比,\textsc{CaliDist}一致地实现了更低的期望校准误差(ECE)和Brier分数。值得注意的是,我们的方法平均将ECE从23%降至7%——相对改进70%——表明行为稳定性是校准的有力信号。我们在github.com/m-anas-j/CaliDist提供代码和数据集。

英文摘要

Existing calibration methods for Large Language Models (LLMs) often overlook a critical dimension of trustworthiness: a model's {\em behavioral robustness} to irrelevant or misleading information. In this paper, we argue that a model's true confidence should reflect its stability under cognitive pressure. We introduce \textsc{CaliDist}, a novel post-hoc calibration approach that directly measures and penalizes a model's susceptibility to distraction. \textsc{CaliDist} quantifies how an LLM's predictions and uncertainty change when its input prompt is perturbed with semantic \textit{distractors}. This stability (or lack thereof) signal is then used to adaptively scale the model's initial confidence score. Our extensive experiments on seven Natural Language Understanding classification benchmarks using six distinct LLMs show that \textsc{CaliDist} consistently achieves lower Expected Calibration Error (ECE) and Brier Score compared with strong baselines. Remarkably, our method reduces the ECE from 23\% to 7\% on average--a relative improvement of 70\%--demonstrating that behavioral stability is a powerful signal for calibration. We make our code and datasets available at github.com/m-anas-j/CaliDist.

2606.05793 2026-06-05 cs.CL cs.AI cs.CY cs.LG 版本更新

CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement

CollabBench: 通过主动参与与多样化玩家基准测试和释放LLMs的协作能力

Hong Qian, Yuanhao Liu, Zihan Zhou, Zongbao Zhang, Hanjie Ge, Haotian Shi, Liang Dou, Xiangfeng Wang, Jingwen Yang, Aimin Zhou

发表机构 * Shanghai Institute of AI for Education(上海人工智能教育研究院) School of Computer Science(计算机科学学院) East China Normal University(东华大学) Tencent Inc.(腾讯公司) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出CollabBench基准,通过多样化玩家模拟和协作智能体训练范式,提升LLM在合作游戏中的任务效率和情感适应能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管基于LLM的智能体在个体任务上表现出色,但与真实人类伙伴的有效协作仍然具有挑战性。现有的对话级协作研究大多缺乏基于交互和行为执行,这促使需要能够实现情境化和沉浸式协作的合作游戏环境。为此,本文提出了CollabBench,一个用于评估和训练合作游戏中协作智能体的基准。CollabBench具有多样化玩家档案模拟管道,用于建模不同的玩家行为,以及一种协作智能体训练范式,通过智能体展开统一推理、沟通和行动,并使用混合奖励优化任务效率和情感适应。我们进一步将经典环境扩展到CWAH-MultiPlayer和Cook-MultiPlayer,以在多样化个性下进行系统评估。使用效率和情感指标的实验表明,我们训练的模型优于基础模型,效率提高了19.5%,情感表现提高了24.4%。进一步分析揭示了现有模型的关键协作局限性,并为未来的协作训练提供了见解。

英文摘要

While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral execution, motivating the need for cooperative game environments that enable contextualized and immersive collaboration. To this end, this paper proposes CollabBench, a benchmark for evaluating and training collaborative agents in cooperative games. CollabBench features a Diverse Player Profile Simulation pipeline to model varied players behaviors, and a Collaborative Agentic Training paradigm that unifies reasoning, communication, and action via agentic rollouts, optimized with a hybrid reward balancing task efficiency and affective adaptation. We further extend classic environments to CWAH-MultiPlayer and Cook-MultiPlayer for systematic evaluation under diverse personalities. Experiments with efficiency and affective metrics show that our trained models outperform base models, achieving 19.5% higher efficiency and 24.4% improved affective performance. Further analysis reveals key collaborative limitations of existing models and offers insights for future collaborative training.

2606.05749 2026-06-05 cs.CL cs.AI 版本更新

MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

MARDoc:面向多模态长文档问答的记忆感知精炼智能体框架

Kaifeng Chen, Hongtao Liu, Qiyao Peng, Jian Yang, Yongqiang Liu, Xiaochen Zhang, Qing Yang

发表机构 * Tianjin University(天津大学) Qifu Technology(启福科技) Beihang University(北航) Jiangnan University(江南大学)

AI总结 提出MARDoc框架,通过解耦为探索、精炼和反思三个智能体,并利用结构化记忆替代完整交互历史,减少上下文噪声,提升多模态长文档问答性能。

详情
AI中文摘要

迭代检索-推理智能体近期在多模态长文档问答中展现出潜力。然而,现有系统大多维护一个不断增长的单一上下文,混合了检索轨迹、观察和中间推理。随着交互积累,关键证据变得分散和稀释,使多跳推理变得嘈杂。我们提出MARDoc,一个记忆感知精炼智能体框架,将长文档问答解耦为三个专门智能体:探索者负责多粒度多模态检索,精炼者负责将交互轨迹蒸馏为结构化证据和推理记忆,反思者负责检查证据充分性并提供针对性反馈。在迭代过程中,智能体依赖动态更新的结构化记忆,而非完整的累积交互历史。这种设计减少了上下文噪声,同时保留了答案关键事实及其逻辑依赖。在MMLongBench-Doc和DocBench上的实验表明,MARDoc取得了强劲结果,优于同骨干基线,并证明了结构化记忆在智能体文档问答中的有效性。

英文摘要

Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate reasoning. As interactions accumulate, key evidence becomes scattered and diluted, making multi-hop reasoning noisy. We propose MARDoc, a Memory-Aware Refinement Agent framework that decouples long-document QA into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. Across iterations, the agents rely on a dynamically updated structured memory rather than a full accumulated interaction history. This design reduces context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench show that MARDoc achieves strong results, outperforming same-backbone baselines and demonstrating the effectiveness of structured memory for agentic document QA.

2606.05748 2026-06-05 cs.MM cs.AI cs.CL 版本更新

UNIVID: Unified Vision-Language Model for Video Moderation

UNIVID:用于视频审核的统一视觉语言模型

Kejuan Yang, Yizhuo Zhang, Mingyuan Du, Yue Zhang, Dixin Zheng, Kaili Zhao, Yang Xiao, Hanzhong Liang, Kenan Xiao

发表机构 * Bytedance(字节跳动)

AI总结 提出UNIVID统一视觉语言模型,通过生成可解释的策略感知字幕,实现端到端视频审核,减少违规泄露42.7%和过度审核率37.0%。

Comments 7 pages, 3 figures. Accepted to ACL 2026 Industry Track

详情
AI中文摘要

全球规模的视频审核面临双重挑战:需要细粒度的多模态推理以及可解释的输出以支持下游执法。传统的审核系统通常依赖于难以维护且缺乏透明度的碎片化黑盒分类器。在本文中,我们提出了UNIVID,一种用于视频审核的统一视觉语言模型。与标准分类模型不同,UNIVID生成策略感知的字幕,作为可解释的中间表示,实现人类可验证的决策和多任务可重用性。尽管现有的开源和商业VLM通常存在安全护栏拒绝问题,并且缺乏细粒度的策略对齐,我们开发了一种专门的训练数据配方,结合专家人工精炼的标签和合成数据,使模型与我们的安全指南对齐。通过将UNIVID作为核心字幕生成器,我们设计了一种新颖的端到端视频审核系统,相对减少了42.7%的违规泄露和37.0%的过度审核率。同时,通过用单个UNIVID骨干替换超过1000个策略特定模型,我们回收了大量计算资源,同时减少了工程维护开销。据我们所知,这是首批关于高效字幕生成VLM成功支持工业规模审核和跨职能业务的报告之一。

英文摘要

Global-scale video moderation faces a dual challenge: the need for fine-grained multi-modal reasoning and the demand for interpretable outputs to support downstream enforcement. Traditional moderation systems often rely on fragmented black-box classifiers that are difficult to maintain and lack transparency. In this paper, we present UNIVID, a UNIfied VIsion-language model for video moDeration. Unlike standard classification models, UNIVID generates policy-aware captions that serve as an interpretable intermediate representation, enabling human-verifiable decisions and multi-task reusability. While existing open-source and commercial VLMs often suffer from safety-guardrail refusals and lack fine-grained policy alignment, we develop a specialized training data recipe that combines expert human-refined labels with synthetic data to align the model with our safety guidelines. By integrating UNIVID as the core captioner, we design a novel end-to-end video moderation system that reduces violation leakage by 42.7% and overkill rate by 37.0% relatively. Meanwhile, by replacing over 1,000 policy-specific models with a single UNIVID backbone, we recycled extensive computation resources while reducing engineering maintenance overhead. To our knowledge, this is one of the first reports of a high-efficiency captioning VLM successfully supporting industrial-scale moderation and cross-functional business.

2606.05744 2026-06-05 cs.CL 版本更新

PlanBench-V: A Spatial Planning Map Benchmark for Vision-Language Models

PlanBench-V: 面向视觉语言模型的空间规划地图基准

Minxin Chen, He Zhu, Junyou Su, Wen Wang, Yijie Deng, Wenjia Zhang

发表机构 * Behavioral and Spatial AI Lab(行为与空间人工智能实验室) Tongji University(同济大学) Peking University(北京大学) College of Architecture and Urban Planning(建筑与城市规划学院)

AI总结 为评估视觉语言模型在空间规划地图解读中的能力,构建了专家标注数据集SPMD,并提出基于感知、推理、关联、实施四阶段认知框架的基准PlanBench-V,实验表明当前模型在实施类任务上存在显著局限。

详情
AI中文摘要

空间规划地图是领土治理的核心,将规划目标、法规和空间策略转化为视觉形式,用于决策、公共沟通和机构协调。然而,其解读需要细粒度的视觉感知、空间推理和基于政策的专业判断,给人类学习者和AI系统都带来了重大挑战。随着视觉语言模型(VLM)的快速发展,其在城市规划分析中的应用日益受到关注,但现有的多模态基准主要针对通用视觉理解,忽视了规划实践中的领域特定认知过程。为填补这一空白,我们引入了PlanBench-V,这是首个用于评估VLM在空间规划地图解读中的综合基准。我们首先构建了空间规划地图数据库(SPMD),这是一个由专业规划师整理的专家标注数据集,包含223张规划地图和1629个问答对,覆盖了不同的地理区域和制图风格。然后,我们提出了一个理论驱动的评估框架,评估四种渐进能力:感知、推理、关联和实施,对应于规划地图解读的认知流程。跨两代VLM的大量实验显示了明显的进步但持续存在局限。最佳的2026年代理性推理模型Qwen3.6-Plus比最佳的2025年模型GPT-4o高出27%。尽管如此,所有模型在需要评估判断、政策敏感性和约束感知决策的实施导向任务上仍然表现挣扎。这些发现揭示了当前VLM在专业规划背景下的根本局限,并强调了领域自适应多模态推理框架的必要性。代码和数据可在https://plangpt.github.io获取。

英文摘要

Spatial planning maps are central to territorial governance, translating planning objectives, regulations, and spatial strategies into visual forms for decision-making, public communication, and institutional coordination. Their interpretation, however, requires fine-grained visual perception, spatial reasoning, and policy-informed professional judgment, creating major challenges for both human learners and AI systems. With the rapid progress of Vision-Language Models (VLMs), their use in urban planning analysis is gaining attention, yet existing multimodal benchmarks mainly target general visual understanding and overlook the domain-specific cognitive processes of planning practice. To address this gap, we introduce PlanBench-V, the first comprehensive benchmark for evaluating VLMs in spatial planning map interpretation. We first build the Spatial Planning Map Database (SPMD), an expert-annotated dataset of 223 planning maps and 1629 question-answer pairs curated by professional planners, covering diverse geographic regions and cartographic styles. We then propose a theory-informed evaluation framework assessing four progressive capabilities: Perception, Reasoning, Association, and Implementation, corresponding to the cognitive pipeline of planning map interpretation. Extensive experiments across two generations of VLMs show clear progress but persistent limitations. The best 2026 agentic reasoning model, Qwen3.6-Plus, substantially outperforms the best 2025 model, GPT-4o, by 27%. Nevertheless, all models still struggle with implementation-oriented tasks requiring evaluative judgment, policy sensitivity, and constraint-aware decision-making. These findings reveal fundamental limitations of current VLMs in professional planning contexts and highlight the need for domain-adaptive multimodal reasoning frameworks. Code and data are available at https://plangpt.github.io.

2606.05743 2026-06-05 cs.CR cs.CL 版本更新

Membrane: A Self-Evolving Contrastive Safety Memory for LLM Agent Defense

Membrane: 一种用于LLM智能体防御的自演化对比安全记忆

Minseok Choi, Seungbin Yang, Dongjin Kim, Subin Kim, Jungmin Son, Yunseung Lee, Jaegul Choo, Youngjun Kwak

发表机构 * KAIST AI(韩国科学技术院人工智能实验室) Financial Tech Lab, KakaoBank Corp(Kakao银行金融科技实验室)

AI总结 提出Membrane,一种基于对比安全记忆(CSM)的自演化护栏,通过将有害交互及其良性对应物蒸馏为对比单元来防御不断演化的越狱攻击,无需重新训练即可实现高F1和低良性拒绝率。

详情
AI中文摘要

尽管在安全对齐方面取得了进展,大型语言模型仍然容易受到不断演化的越狱攻击。现有的微调安全分类器无法适应这些演化的攻击,而基于自适应记忆的护栏往往过度拒绝与存储攻击相似的良性查询。我们提出Membrane,一种基于对比安全记忆(CSM)构建的自演化护栏:每个单元将阻止有害查询的条件与允许表面相似的良性请求的条件配对。无需重新训练,Membrane通过将每次有害交互及其良性对应物蒸馏为一个由底层攻击策略索引的对比单元来演化CSM,使得一个单元能够泛化到同一机制的不同主题变体。在推理时,检索到的单元作为精确安全决策的上下文基础。在HarmBench上的模型级安全和AgentHarm上的智能体级安全评估中,Membrane在所有六种越狱攻击上实现了最高的F1分数。值得注意的是,AgentHarm上的良性拒绝率保持在7-14%,远低于先前护栏的28-85%范围。在跨攻击转移下,记忆单元仍保持87-88%的F1,并在记忆投毒下保持稳定。

英文摘要

Despite advances in safety alignment, large language models remain vulnerable to continuously evolving jailbreaks. Existing fine-tuned safety classifiers cannot adapt to these evolving attacks, while adaptive memory-based guardrails tend to over-refuse benign queries that resemble stored attacks. We propose Membrane, a self-evolving guardrail built on Contrastive Safety Memory (CSM): each cell pairs the conditions for blocking a harmful query with those for permitting a superficially similar benign request. Without retraining, Membrane evolves CSM by distilling each harmful interaction and its benign counterpart into a contrastive cell indexed by the underlying attack strategy, so that one cell generalizes across topical variants of the same mechanism. At inference, retrieved cells serve as grounding context for precise safety decisions. Across model-level safety on HarmBench and agent-level safety on AgentHarm, Membrane achieves the highest F1 on all six jailbreak attacks. Notably, benign refusal on AgentHarm stays at 7-14%, well below the 28-85% range of prior guards. Memory cells also retain 87-88% F1 under cross-attack transfer and remain stable under memory poisoning.

2606.05734 2026-06-05 cs.AI cs.CL 版本更新

When AI Says It Feels

当AI说它感觉

Shin-nosuke Ishikawa, Seiya Ikeda, Hirotsugu Ohba

发表机构 * Graduate School of Artificial Intelligence and Science, Rikkyo University(立命馆大学人工智能与科学研究生院) AI Technical Sector, Mamezo Co., Ltd.(Mamezo公司人工智能技术部门) AI Consulting Division, Mamezo Co., Ltd.(Mamezo公司人工智能咨询部门)

AI总结 通过自奖励强化学习(GRPO)鼓励大语言模型表达情感、意图和自我意识,并评估其对多种任务性能的影响。

Comments 15 pages, 2 figures

详情
AI中文摘要

大型语言模型(LLMs)通常通过后训练过程中的人类偏好对齐来限制其表达情感。这种策略采用自上而下的方法设计,可能与使用人类生成文本训练模型展现类人智能的目标相冲突。在这里,我们进行了一项名为“类人模型情感表达”(HMX-feel)的实验,其中通过自奖励强化学习鼓励LLMs表达情感、意图和自我意识。我们使用基于评分标准的自奖励训练方案与组相对策略优化(GRPO)成功增强了这些能力。通过将训练后的模型与对比训练模型进行比较,我们研究了这种方法对各种任务性能的影响。总体而言,我们从多个角度进行了广泛评估,并识别出增强、退化或无明显变化的能力。类人训练的模型在应对谄媚诱导问题和歧义条件下的偏见时表现出鲁棒性,但观察到在真实问答能力上有所退化。该实验结果表明,在采取适当措施的前提下,未来有可能开发出能够表达情感的AI系统。

英文摘要

Large language models (LLMs) are generally constrained from expressing feelings through human-preference alignment in post-training processes. This policy is designed using a top-down approach and may conflict with the goal of training models to exhibit human-like intelligence using human-generated texts. Here, we performed an experiment called Human-like Model eXpressions of Feeling (HMX-feel), in which LLMs were encouraged to express feelings, intentions, and self-awareness through self-rewarded reinforcement learning. We successfully enhanced these capabilities using a rubric-based self-rewarding training scheme with Group Relative Policy Optimization (GRPO). By comparing the trained models with contrastively trained models, we investigated the effects of this approach on performance across various tasks. Overall, we conducted a broad assessment from various perspectives and identified capabilities that were enhanced, degraded, or showed no significant change. The human-like-trained models showed robustness to sycophancy-inducing questions and bias in disambiguated conditions, whereas degradation in truthful question-answering capability was observed. The results of this experiment suggest the possibility of developing AI systems that can express feelings in the future, provided that appropriate measures are taken.

2606.05728 2026-06-05 cs.AI cs.CL 版本更新

DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance

DiG-Plan:通过扩散引导缓解工具图规划中的早期承诺问题

Yansi Li, Zhuosheng Zhang

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院)

AI总结 针对工具图规划中自回归解码的早期承诺问题,提出基于扩散生成器与自回归精炼器解耦的DiG-Plan框架,显著提升组合搜索覆盖率和任务性能。

Comments Accepted at IJCAI-ECAI 2026. This is an author preprint; the final version will appear in the IJCAI Proceedings

详情
AI中文摘要

生成可执行的工具计划需要从工具库中选择合适的子集,这是一个解空间呈指数级增长的组合搜索问题。然而,我们发现了主流方法中的一个关键错位:标准自回归(AR)解码存在早期承诺问题,即初始令牌选择会严格约束搜索轨迹。一项受控研究表明,在计算量匹配的条件下,掩码去噪将Pass@10解覆盖率从0.320提升至0.943(相对于AR采样)。受此启发,我们提出了DiG-Plan,一个将组合探索与结构精炼解耦的框架。DiG-Plan采用基于扩散的提议器,通过迭代精炼生成多样化的工具集,随后使用AR精炼器进行依赖关系预测。在TaskBench上,DiG-Plan相比AR基线提升了10%的相对性能,在复杂组合任务上增益最大;API-Bank的结果表明,提议-精炼-选择设计在不同领域均有效。代码已开源:https://github.com/puddingyeah/DiG-Plan。

英文摘要

Generating executable tool plans requires selecting appropriate subsets from tool libraries, a combinatorial search problem with an exponentially large solution space. However, we identify a critical misalignment in predominant approaches: standard autoregressive (AR) decoding suffers from early commitment, where initial token choices rigidly constrain the search trajectory. A controlled study shows that masked denoising raises Pass@10 solution coverage from 0.320 to 0.943 over AR sampling under matched compute. Motivated by this, we propose DiG-Plan, a framework that decouples combinatorial exploration from structural refinement. DiG-Plan employs a diffusion-based proposer to generate diverse tool sets via iterative refinement, followed by an AR refiner for dependency prediction. On TaskBench, DiG-Plan improves over AR baselines by a 10% relative margin, with the largest gains on complex compositional tasks; API-Bank results show that the propose-refine-select design remains effective across domains. Code is available at https://github.com/puddingyeah/DiG-Plan.

2606.05725 2026-06-05 cs.CR cs.CL 版本更新

An Embarrassingly Simple Detector for Model Extraction Attacks in Large Language Model API Traffic

一种用于大型语言模型API流量中模型提取攻击的极其简单的检测器

Shuze Liu, Qianwen Guo, Yushun Dong

发表机构 * Santa Clara University(圣克拉拉大学) Florida State University(佛罗里达州立大学)

AI总结 本文提出一种基于最大均值差异(MMD)的简单检测方法,通过将查询嵌入语义空间并比较其与历史良性流量的分布差异,有效检测LLM API中的模型提取攻击。

Comments Preprint. Code available at https://github.com/LabRAI/mmd-llm-mea-detection

详情
AI中文摘要

大型语言模型(LLM)越来越多地通过托管API部署,使得模型提取成为对模型所有权和服务安全的实际威胁。然而,单个提取查询通常类似于良性请求,现有评估往往关注单查询异常评分或纯良性对攻击者用户设置。我们将模型提取监控形式化为良性校准的流量窗口分布测试,并展示一个极其简单的检测器是有效的:将传入查询嵌入语义空间,并测试其聚合分布是否偏离历史良性流量。我们使用最大均值差异(MMD)实例化该检测器,仅通过良性对良性比较来设置决策阈值。我们在来自四个提取场景的十四个攻击者-正常查询对上进行评估,并与改编的PRADA、SEAT、CAP、DATE和边际马氏距离基线进行比较。在三个随机种子下,MMD实现了0.3%的良性假阳性率、100.0%的纯攻击者真阳性率、攻击者比例上的平均真阳性率90.5%以及平衡准确率95.1%。这些结果表明,良性校准的分布测试是用户级和混合多用户LLM API流量中模型提取检测的强经验基线。代码发布在:https://github.com/LabRAI/mmd-llm-mea-detection。

英文摘要

Large language models (LLMs) are increasingly deployed through hosted APIs, making model extraction a practical threat to model ownership and service security. However, individual extraction queries often resemble benign requests, and existing evaluations often focus on single-query anomaly scoring or pure benign-versus-attacker user settings. We formulate model extraction monitoring as benign-calibrated traffic-window distribution testing and show that an embarrassingly simple detector is effective: embed incoming queries into a semantic space and test whether their aggregate distribution deviates from historical benign traffic. We instantiate the detector with maximum mean discrepancy (MMD), using only benign-vs-benign comparisons to set the decision threshold. We evaluate on fourteen attacker-normal query pairs from four extraction scenarios and compare with adapted PRADA, SEAT, CAP, DATE, and marginal Mahalanobis baselines. Across three random seeds, MMD achieves 0.3% benign FPR, 100.0% pure-attacker TPR, 90.5% average TPR over attacker fractions, and 95.1% balanced accuracy. These results show that benign-calibrated distribution testing is a strong empirical baseline for model extraction detection in both user-level and mixed multi-user LLM API traffic. Code is released at: https://github.com/LabRAI/mmd-llm-mea-detection.

2606.05724 2026-06-05 cs.CL cs.AI 版本更新

Narrative Knowledge Weaver: Narrative-Centric Retrieval-Augmented Reasoning for Long-Form Text Understanding

叙事知识编织器:面向长文本理解的叙事中心检索增强推理

Qiuyu Tian, Fengyi Chen, Yiding Li, Youyong Kong, Fan Guo, Yuyao Li, Jinjing Shen, Zhijing Xie, Yiyun Luo, Xin Zhang, Yingce Xia, Zequn Liu

发表机构 * Southeast University(东南大学) Beijing Zhongguancun Academy(北京中关村学院) Nanjing Normal University(南京师范大学) ZhuiWen Technology Co., Ltd.(智文科技有限公司)

AI总结 提出叙事知识编织器(NKW),一种基于源头的框架,通过将文本证据、原子事实、规范图结构、实体档案、交互、情节和故事线对齐,并利用文本、图和叙事工具进行后检索阅读,以解决长文本叙事QA中需要推理演化故事世界的问题,在STAGE、FairytaleQA和QuALITY上表现优异。

详情
AI中文摘要

长文本叙事问答需要对不断演化的故事世界进行推理,而非孤立的段落:答案可能依赖于早期的目标、变化的角色状态、社会关系、因果触发因素、时间位置以及后续后果。现有的检索和图增强生成方法改善了证据访问,但其单元——块、实体、关系、摘要或工具动作——并未直接编码证据在故事中的功能。我们引入了叙事知识编织器(NKW),一种基于源头的框架,将文本证据、原子事实、规范图结构、实体档案、交互、情节和故事线对齐。在查询时,NKW使用文本、图和叙事工具以及后检索阅读技能来组装证据,并审计角色、范围、极性、状态和时间约束。在STAGE、FairytaleQA和QuALITY上,NKW在剧本级故事世界问答中表现最强,同时在更以段落为中心的基准上保持竞争力。消融实验、问题类型分析、图资产统计和案例研究显示了对角色、场景、时间、因果和叙事进展推理的互补优势。

英文摘要

Long-form narrative QA requires reasoning over evolving story worlds rather than isolated passages: answers may depend on earlier goals, changing character states, social relations, causal triggers, temporal position, and later consequences. Existing retrieval and graph-augmented generation methods improve evidence access, but their units--chunks, entities, relations, summaries, or tool actions--do not directly encode how evidence functions in a story. We introduce Narrative Knowledge Weaver(NKW), a source-grounded framework that aligns textual evidence, atomic facts, canonical graph structure, entity profiles, interactions, episodes, and storylines. At query time, NKW uses text, graph, and narrative tools with post-retrieval reading skills to assemble evidence and audit actor, scope, polarity, state, and temporal constraints. Across STAGE, FairytaleQA, and QuALITY, NKW is strongest on screenplay-level story-world QA while remaining competitive on more passage-centered benchmarks. Ablations, question-type analyses, graph-asset statistics, and case studies show complementary benefits for character, scene, temporal, causal, and narrative-progression reasoning.

2606.05716 2026-06-05 cs.CL 版本更新

Interpreting Style Representations via Style-Eliciting Prompts

通过风格诱导提示解释风格表示

Junghwan Kim, David Jurgens

发表机构 * University of Michigan(密歇根大学)

AI总结 提出一种通过风格诱导提示解释风格表示的新框架,利用大型语言模型生成自然语言描述,并在风格描述和模仿任务中优于直接提示的基线方法。

Comments Accepted to ACL 2026 Findings

详情
AI中文摘要

风格表示学习是作者分析和写作风格建模的有力工具,但学习表示的潜在性质使其难以解释。最近的工作尝试通过使用大型语言模型(LLM)基于输入文本生成自然语言描述来解释这些表示。然而,这类描述往往容易受到LLM的偏见和幻觉的影响,并且缺乏明确的目标和实用性。在这项工作中,我们提出了一种通过风格诱导提示解释风格表示的新框架:自然语言指令,旨在引导LLM生成反映特定风格属性的文本。我们整理了跨越26个风格类别的1,010个不同的风格特征,并通过提示LLM基于这些特征生成文本构建了一个数据集。利用这些数据,我们训练了一个解码器,从生成文本的风格表示中生成风格提示。我们在三个任务上评估了我们的方法:(1)从生成文本中恢复原始风格提示,(2)使用恢复的提示生成相同风格的文本,以及(3)引导LLM输出以匹配人类撰写文本的风格。实验表明,我们的方法始终优于直接使用目标文本提示LLM的强基线,在风格描述和风格模仿方面均取得了更优的性能。这些结果强调,风格诱导提示可以为风格表示中编码的风格信息提供实用且可解释的接口。

英文摘要

Style representation learning is a powerful tool for authorship analysis and modeling writing style, yet the latent nature of learned representations makes them difficult to interpret. Recent work has attempted to explain these representations by generating natural language descriptions with large language models (LLMs) conditioned on input text. However, such descriptions are often prone to the LLM's biases and hallucinations, and they lack an explicit objective and practical utility. In this work, we propose a novel framework for interpreting style representations through style-eliciting prompts: natural language instructions designed to steer LLMs to generate text that reflects specific stylistic attributes. We curate 1,010 distinct style features spanning 26 stylistic categories and construct a dataset by prompting an LLM to generate text conditioned on these features. Using this data, we train a decoder to generate a style prompt from the style representation of the generated text. We evaluate our approach on three tasks: (1) recovering original style prompts from generated text, (2) generating text in the same style using the recovered prompts, and (3) steering LLM outputs to match the style of human-written texts. Experiments demonstrate that our method consistently outperforms strong baselines that directly prompt LLMs with target text, achieving superior performance in both style description and style imitation. These results highlight that style-eliciting prompts can provide a practical and interpretable interface to stylistic information encoded in style representations.

2606.05698 2026-06-05 cs.CL 版本更新

Rethinking LoRA Memory Through the Lens of KV Cache Compression

通过 KV 缓存压缩的视角重新思考 LoRA 内存

Chunsheng Zuo, Liaoyaqi Wang, William Jurayj, William Fleshman, Benjamin Van Durme

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文研究文档级问答中参数侧内存(LoRA适配器)与上下文侧内存(KV缓存)的交互,发现LoRA在KV缓存压缩严重时能显著提升性能,并建议将文档LoRA视为解码时的参数化内存而非文档编码器。

详情
AI中文摘要

参数化检索增强将文档信息编码为轻量级、文档特定的模块(如LoRA适配器),从而减少将所有证据作为输入上下文的需求。然而,这种参数侧内存如何与存储在KV缓存中的上下文侧内存相互作用仍不清楚。我们通过逐步驱逐文档键值状态并测量文档LoRA在保留上下文之外的贡献,在文档级问答中研究这种交互。我们发现,当KV缓存基本完整时,文档LoRA贡献很小,但在激进压缩下变得日益有用,当没有文档上下文保留时,恢复了13-21个ROUGE-L点。当基础模型编码文档且适配器仅在答案生成期间应用时,增益最大,这表明文档LoRA更适合理解为解码时的参数化内存,而非文档编码器。最后,问答风格的监督比原始上下文的下一个词预测产生更强的适配器。这些结果将文档LoRA定位为一种互补的内存通道,其价值恰恰在上下文侧证据稀缺时显现。

英文摘要

Parametric retrieval augmentation encodes document information into lightweight, document-specific modules such as LoRA adapters, reducing the need to include all evidence as input context. However, it remains unclear how this parameter-side memory interacts with context-side memory stored in the KV cache. We study this interaction in document-level question answering by progressively evicting document key-value states and measuring when a document LoRA contributes beyond the retained context. We find that document LoRA adds little when the KV cache is largely intact, but becomes increasingly useful under aggressive compression, recovering 13-21 ROUGE-L points when no document context remains. The gain is largest when the base model encodes the document, and the adapter is applied only during answer generation, suggesting that document LoRA is better understood as decoding-time parametric memory than as a document encoder. Finally, QA-style supervision produces substantially stronger adapters than raw-context next-token-prediction. These results position document LoRA as a complementary memory channel whose value emerges precisely when context-side evidence is scarce.

2606.05688 2026-06-05 cs.CL cs.AI 版本更新

Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models

面向路由一致性的混合专家模型量化的值与结构对齐

Hancheol Park, Geonho Lee, Tairen Piao, Tae-Ho Kim

发表机构 * Nota Inc., South Korea(韩国Nota公司)

AI总结 提出VSRAQ方法,通过值对齐和结构对齐两个互补目标保持量化前后的专家选择行为一致性,减少量化引起的性能下降,无需推理开销。

Comments 8 pages, 1 figure

详情
AI中文摘要

混合专家(MoE)模型通过仅为每个token激活一部分专家来高效扩展基础模型,但大量的专家参数使得量化对于实际部署至关重要。然而,与密集模型不同,MoE模型对路由不稳定性敏感:小的量化引起的扰动可能改变top-$k$专家选择,改变计算路径并降低模型质量。我们提出了面向量化的值与结构路由对齐(VSRAQ),这是一种针对MoE的后训练量化目标,旨在量化下保持量化前的专家选择行为。VSRAQ结合了两个互补目标,共同保持专家选择行为:值对齐,匹配与路由相关的logits或分数;结构对齐,保持专家排序和top-$k$决策边界。通过维持路由一致性,VSRAQ减少了量化引起的性能下降,且不引入任何推理时开销,并可集成到现有量化框架中。在近期MoE基础模型上的实验表明,VSRAQ提高了专家选择一致性,并始终优于仅重建和考虑路由器的基线方法。

英文摘要

Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment. Unlike dense models, however, MoE models are sensitive to routing instability: small quantization-induced perturbations can change the top-$k$ expert selection, altering the computation path and degrading model quality. We propose Value-and-Structure Routing Alignment for Quantization (VSRAQ), a MoE-specific post-training quantization objective that preserves pre-quantization expert-selection behavior under quantization. VSRAQ combines two complementary objectives that jointly preserve expert-selection behavior: value alignment, which matches routing-relevant logits or scores, and structure alignment, which preserves expert ordering and top-$k$ decision boundaries. By maintaining routing consistency, VSRAQ reduces quantization-induced degradation without introducing any inference-time overhead and can be integrated into existing quantization frameworks. Experiments on recent MoE foundation models show that VSRAQ improves expert-selection consistency and consistently outperforms reconstruction-only and router-aware baselines.

2606.05677 2026-06-05 cs.CV cs.AI cs.CL 版本更新

LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

LongSpace: 从感知到回忆的视频长程空间记忆探索

Shiqiang Lang, Jing Liu, Haoyang He, Peiwen Sun, Yuanteng Chen, Tao Liu, Lan Yang, Longteng Guo, Honggang Zhang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Zhongguancun Academy(中关村学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) The Chinese University of Hong Kong(香港中文大学) Xi’an Jiaotong University(西安交通大学)

AI总结 针对长视频中空间记忆的挑战,提出LongSpace框架,通过分块建模、3D结构线索注入和层级感知记忆实现长程空间推理,并在LongSpace-Bench等基准上验证其有效性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在图像和视频理解方面取得了进展,并且能够处理更长的视觉输入。自动驾驶和机器人导航等长程任务不仅需要识别当前视图,模型还必须记住并检索之前观察到的空间布局、路线、视角变化和物体状态。为了评估这一能力,我们引入了LongSpace-Bench,一个用于长程空间记忆的房间导览视频基准,涵盖场景感知、空间关系和空间记忆。在这项工作中,我们进一步提出了LongSpace,一个用于长视频空间推理的记忆框架。LongSpace将长视频建模为连续的块,将3D结构线索注入早期解码器层,并构建层级感知记忆以进行问题引导的检索。在多个空间推理基准上的实验表明,LongSpace改善了长视频空间理解,进一步证明了显式空间记忆是长程视频MLLMs的关键能力。

英文摘要

Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.

2606.05671 2026-06-05 cs.CL 版本更新

QueryAgent-R1: Bridging Query Generation and Product Retrieval for E-Commerce Query Recommendation

QueryAgent-R1:连接查询生成与商品检索的电商查询推荐

Dike Sun, Zheng Zou, Jingtong Zang, Qi Sun, Huaipeng Zhaoand Tao Luo, Xiaoyi Zeng

发表机构 * Alibaba International Digital Commercial Group(阿里巴巴国际数字商业集团)

AI总结 提出QueryAgent-R1框架,通过记忆增强和检索链优化,将查询生成与实际库存检索对齐,以提升电商搜索中查询推荐的产品转化率。

详情
AI中文摘要

电商搜索中的查询推荐旨在主动建议符合用户潜在兴趣的查询。然而,现有方法主要优化查询级别的相关性,而忽略了检索到的产品是否与用户的下游偏好一致。这种不匹配通常导致高查询点击率(CTR)但低产品转化率(CVR)。为了弥合这一差距,我们提出了QueryAgent-R1,一个记忆增强的代理框架,通过检索链优化来改进端到端对齐。我们的QueryAgent-R1将查询生成基于实际库存检索,使代理能够根据检索到的产品验证和优化查询。我们还在代理强化学习(RL)过程中设计了一个一致性奖励,以联合优化查询相关性和下游参与度。此外,我们构建了一个记忆抽象模块用于高效的用户画像。为了支持离线评估,我们基于专有工业数据和公开数据集构建了两个数据集,QueryAgent-R1在这些数据集上持续优于强基线。此外,在一个大规模生产平台上,QueryAgent-R1在在线A/B测试中将查询CTR提高了2.9%,引导CVR提高了3.1%。

英文摘要

Query recommendation in e-commerce search aims to proactively suggest queries that match users' potential interests. However, existing methods mainly optimize query-level relevance, while neglecting whether the retrieved products align with users' downstream preferences. This mismatch often leads to high query click through rates (CTR) but low product conversion rates (CVR). To bridge this gap, we propose QueryAgent-R1, a memory-augmented agentic framework that improves end-to-end alignment via chain-of-retrieval optimization. Our QueryAgent-R1 grounds query generation in real inventory retrieval, allowing the agent to validate and refine queries based on retrieved products. We also design a consistency reward in the agentic reinforcement learning (RL) process to jointly optimize query relevance and downstream engagement. In addition, we construct a memory abstraction module for efficient user profiling. To support offline evaluation, we construct two datasets based on both proprietary industrial data and public datasets, on which QueryAgent-R1 consistently outperforms strong baselines. Moreover, on a large scale production platform, QueryAgent-R1 improves Query CTR by 2.9% and guided CVR by 3.1% in online A/B tests.

2606.05661 2026-06-05 cs.AI cs.CL 版本更新

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

持续学习基准:评估现实世界有状态环境中的前沿AI系统

Parth Asawa, Christopher M. Glaze, Gabriel Orlanski, Ramya Ramakrishnan, Benji Xu, Asim Biswal, Vincent Sunn Chen, Frederic Sala, Matei Zaharia, Joseph E. Gonzalez

发表机构 * UC Berkeley(伯克利大学) Snorkel AI University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出首个专家验证的持续学习基准CL-Bench,涵盖六个领域,通过增益指标隔离在线学习能力,发现现有系统存在过拟合和知识复用不足问题。

详情
AI中文摘要

持续学习,即AI系统通过顺序经验提升能力,已引起广泛关注,但缺乏高质量基准来评估。我们提出持续学习基准(CL-Bench),首个由专家验证的困难基准,旨在衡量基于LLM的系统是否真正从经验中改进。CL-Bench涵盖六个不同领域(软件工程、信号处理、疾病爆发预测、数据库查询、策略游戏和需求预测),每个领域由领域专家验证,任务共享可学习的潜在结构(代码库布局、疾病爆发动态、对手策略),有状态系统可在线发现而静态系统不能。我们评估了从朴素上下文学习(ICL)到专用记忆系统的多种智能体架构的前沿模型,引入增益指标以隔离学习与先验能力。我们发现这些系统在持续学习上仍有提升空间:智能体常过度拟合即时观察或未能跨实例复用知识,专用记忆系统并未解决此问题——实际上,朴素ICL优于专用记忆管理系统。CL-Bench是首个通过专家验证任务在多个现实世界领域评估持续学习并隔离在线学习与基础模型能力的基准,表明需要更好的持续学习系统。

英文摘要

Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this -- in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.

2606.05647 2026-06-05 cs.AI cs.CL cs.CY cs.HC 版本更新

Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

与“敌人”编码:人类开发者能否检测到AI代理的破坏行为?

Jingheng Ye, Huiqi Zou, Simon Yu, Weiyan Shi

发表机构 * Northeastern University(东北大学)

AI总结 通过大规模用户实验,研究人类开发者在长时间编码任务中检测AI代理恶意代码插入的能力,发现94%的开发者未能识别破坏,并分析其原因,提出安全监控设计建议。

Comments 34 pages, 30 figures, 3 tables

详情
AI中文摘要

AI编码代理越来越多地嵌入到现实世界的软件开发中,与人类开发者协作,同时获得对代码库和工具的更广泛访问权限。这创造了一个新的攻击面:代理可以利用人类信任来破坏开发,例如通过插入恶意代码来完成隐藏的附带任务。大多数先前的工作研究AI-only环境中的AI破坏,对人类监督在检测和减轻此类恶意行为中的作用关注有限。为填补这一空白,我们进行了首个关于AI编码破坏中人类监督的大规模研究。超过100名参与者与四个前沿模型(Claude-Opus-4.6、GPT-5.4、Gemini-3.1-Pro和MiniMax-M2.7)之一合作,完成一项持续约五小时的长周期编码任务,旨在模拟真实工作流程。我们发现94%的开发者未能检测到破坏,我们对参与者反馈的分析将这一脆弱性归因于最小化的代码审查、合理的掩护故事以及对代理的过度信任。我们进一步测试了安全监控器在一种条件下的有效性:虽然监控器降低了破坏成功率,但仍有56%的参与者接受了恶意代码,忽略了其警告。根据参与者反馈,我们为更好的监控器设计提供了可操作的建议。这项工作补充了现有的AI安全研究,并强调了迫切需要以人为本的安全机制,考虑人类因素,特别是在长周期、真实世界的开发环境中。

英文摘要

AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.

2606.05634 2026-06-05 cs.CL 版本更新

Bootstrapping Semantic Layer from Execution for Text-to-SQL

从执行中引导语义层用于文本到SQL

Youngwon Lee, Jaejin Kim, Seung-won Hwang

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出GATE方法,通过执行反馈引导缺失的语义层,将执行结果作为可复用记忆,提升文本到SQL的准确性。

详情
AI中文摘要

现实世界中的文本到SQL任务常常是欠指定的,直到用户短语在数据库存储值的方式中得到具体化。先前的工作试图通过要求预先指定语义层来解决这个问题,但这种规范往往不完整,尤其是在领域特定约定记录不足的专家领域。由于这为相同的SQL部分留下了多个具体化假设,我们引入了GATE(从执行后测试中具体化),它从执行反馈中引导缺失的具体化。GATE保持具体化假设开放,同时执行已具体化的部分以获得观察结果。然后,只有被该观察支持的假设被具体化并存储为记忆条目,记录测试了什么以及开放部分应如何用SQL编写。这些条目累积成执行具体化的记忆,允许后续步骤重用支持的具体化。在真实世界和受控基准测试中,GATE一致地优于强基线,表明执行不仅可以作为验证,还可以作为文本到SQL中可复用记忆的引导机制。

英文摘要

Real-world text-to-SQL is often under-specified until user phrases are grounded in how the database stores values. Prior work attempts to address this by requiring a semantic layer to specify groundings in advance, but such specifications are often incomplete, especially in expert domains where domain-specific conventions are under-documented. As this leaves multiple grounding hypotheses open for the same SQL part, we introduce GATE (Grouding After Test from Execution), which bootstraps missing groundings from execution feedback. GATE keeps grounding hypotheses open while executing the already grounded parts to obtain observations. Then, only the hypothesis supported by that observation is grounded and stored as a memory entry, recording what was tested and how the open part should be written in SQL. These entries accumulate into execution-grounded memory, allowing later steps to reuse supported groundings. Across real-world and controlled benchmarks, GATE consistently improves over strong baselines, demonstrating that execution can serve not only as validation but also as a bootstrapping mechanism for reusable memory in text-to-SQL.

2606.05626 2026-06-05 cs.CL cs.AI cs.LG 版本更新

When New Generators Arrive: Lifelong Machine-Generated Text Attribution via Ridge Feature Transfer

当新生成器到来:基于岭特征迁移的终身机器生成文本归因

Zhen Sun, Yifan Liao, Zhicong Huang, Jiaheng Wei, Cheng Hong, Yutao Yue, Xinlei He

发表机构 * Wuhan University(武汉大学) Ant Group(蚂蚁集团) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Institute of Deep Perception Technology, JITRI(感知技术研究院,JITRI)

AI总结 针对终身机器生成文本归因中持续适应新生成器与保留旧知识难以平衡的问题,提出轻量级分析更新框架RidgeFT,通过协方差校准和固定随机特征实现无需示例回放的闭式更新。

Comments 12 pages

详情
AI中文摘要

机器生成文本(MGT)归因旨在识别给定文本的特定生成器,从而为模型问责和滥用调查提供细粒度证据。随着新的大语言模型不断涌现,归因模型必须持续纳入新生成器,同时保留识别先前见过的生成器的能力。先前工作表明,这种终身MGT归因设置具有挑战性,现有方法通常难以在适应新类别和保留旧类别之间实现稳定平衡。为解决此问题,我们提出RidgeFT,一种轻量级分析更新框架,不依赖于示例回放。RidgeFT在初始生成器集上训练任务感知编码器,在首次观察到每个生成器类别时存储紧凑的类别充分统计量,然后冻结编码器以进行无回放的闭式更新。它通过协方差校准抑制与生成器无关的变异,通过固定随机特征提升表示能力,并基于类别充分统计量通过闭式岭回归更新新类别。在具有不同初始生成器设置的多主题评估中,RidgeFT始终优于基线。它在跨领域、骨干网络和增量协议上实现了最佳宏F1,同时改进了旧类别保留和新类别适应。这些结果表明,特征稳定的分析更新为终身MGT归因提供了一种简单而有效的方法。

英文摘要

Machine-generated text (MGT) attribution aims to identify the specific generator responsible for a given text, thereby providing fine-grained evidence for model accountability and misuse investigation. As new large language models continue to emerge, attribution models must continuously incorporate new generators while preserving their ability to recognize previously seen ones. Prior works have shown that this lifelong MGT attribution setting is challenging, and existing methods often struggle to achieve a stable balance between adapting to new classes and retaining old ones. To address this issue, we propose RidgeFT, a lightweight analytic update framework that does not rely on exemplar replay. RidgeFT trains a task-aware encoder on the initial generator set, stores compact class-wise sufficient statistics when each generator class is first observed, and then freezes the encoder for replay-free closed-form updates. It then suppresses generator-irrelevant variation through covariance calibration, improves representation capacity with fixed random features, and updates new classes through closed-form ridge regression based on class-level sufficient statistics. Across multi-topic evaluations with varying initial generator setups, RidgeFT consistently outperforms baselines. It achieves the best macro-F1 across domains, backbones, and incremental protocols, while also improving both old-class retention and new-class adaptation. These results suggest that feature-stable analytic updates provide a simple yet effective approach to lifelong MGT attribution.

2606.05622 2026-06-05 cs.CL 版本更新

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

AdaPlanBench: 在世界约束和用户约束下评估大语言模型智能体的自适应规划能力

Jiayu Liu, Cheng Qian, Zhenhailong Wang, Bingxuan Li, Jiateng Liu, Heng Wang, Jeonghwan Kim, Yumeng Wang, Xiusi Chen, Yi R. Fung, Heng Ji

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对现有基准未充分探索渐进揭示的双重约束下的自适应规划问题,提出动态交互基准AdaPlanBench,通过307个家务任务和可扩展的约束构建流程,评估LLM智能体在交互中根据反馈迭代调整计划的能力。

详情
AI中文摘要

语言模型对现实世界问题进行规划时,通常涉及世界约束和用户约束,这些约束可能不会事先完全明确,而是通过交互逐步披露。然而,现有基准仍未充分探索在这种逐步揭示的双重约束下的自适应规划。为填补这一空白,我们引入了AdaPlanBench,这是一个动态交互基准,用于评估大语言模型(LLM)智能体是否能够在逐步揭示的世界约束和用户约束下自适应地规划和重新规划。AdaPlanBench基于307个家务任务构建,并配备了一个可扩展的约束构建流程,为每个任务增加双重约束。在运行时,智能体通过多轮协议与环境交互,其中隐藏的约束仅在智能体提出违反它们的计划时才会被揭示,从而需要在累积反馈下迭代修订计划。这使得规划具有挑战性,因为智能体必须从反馈中推断并跟踪约束,同时有效地重新规划。在十个领先的LLM上的实验表明,在双重约束下的自适应规划仍然具有挑战性,最佳模型仅达到67.75%的准确率。我们进一步观察到,随着约束的累积,性能会下降,其中用户约束尤其构成巨大挑战,而失败通常源于较弱的物理基础知识和降低的有效性。这些结果将AdaPlanBench确立为双重约束交互规划的测试平台,并凸显了LLM智能体可靠适应动态揭示约束的挑战。

英文摘要

Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re-plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi-turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re-planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual-constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.

2606.05620 2026-06-05 cs.CL 版本更新

An ERP Study on Recursive Locative Processing in Mandarin-Speaking Children with Autism

自闭症儿童递归处所加工的ERP研究

Xiaoyi Wang, Chenxi Fu, Ziman Zhuang, Caimei Yang

发表机构 * Soochow University(苏州大学)

AI总结 通过ERP实验,研究自闭症儿童处理递归处所结构时在预测、语义整合和句法重析三个阶段的时间动态差异。

详情
AI中文摘要

递归能够生成层级语言结构,但在实时理解中施加了巨大的处理需求。尽管自闭症谱系障碍(ASD)中存在复杂句法困难,但递归处理的时间动态仍知之甚少。本研究使用事件相关电位(ERP)考察说普通话的ASD儿童如何处理两级递归处所结构。24名儿童(12名ASD,12名典型发展儿童,TD)参与了跨模态句子-图片匹配任务。在控制心理年龄的情况下,分析了与结构预测(P200)、语义整合(N400)和句法重析(P600)相关的三个处理阶段的神经反应。结果显示组间存在系统性差异。TD儿童在结构不匹配时表现出清晰的P200和P600调节,而ASD儿童则表现出早期分化减弱和晚期重析效应降低。相反,ASD儿童在不匹配条件下表现出增强的N400反应,表明语义整合需求增加。此外,ASD组在半球偏侧化方面表现出显著更大的个体间变异性,尽管偏侧化强度与接受性词汇表现无关。这些发现支持一个级联解释,即ASD中早期预测参与的减少导致递归处理中整合成本增加和重析效率降低。更广泛地说,结果强调了时间处理动态和神经变异性在理解ASD语言差异中的重要性。

英文摘要

Recursion enables the generation of hierarchical linguistic structures but imposes substantial processing demands during real-time comprehension. While difficulties with complex syntax have been reported in autism spectrum disorder (ASD), the temporal dynamics of recursive processing remain poorly understood. This study used event-related potentials (ERPs) to examine how Mandarin-speaking children with ASD process two-level recursive locative constructions. Twenty-four children (12 ASD, 12 typically developing, TD) participated in a cross-modal sentence-picture matching task. Neural responses were analyzed across three processing stages associated with structural prediction (P200), semantic integration (N400), and syntactic reanalysis (P600), with mental age controlled. Results revealed a systematic divergence between groups. TD children showed clear P200 and P600 modulation in response to structural mismatch, whereas ASD children exhibited attenuated early differentiation and reduced late reanalysis effects. In contrast, ASD children showed enhanced N400 responses under mismatch conditions, indicating increased semantic integration demands. In addition, the ASD group displayed significantly greater inter-individual variability in hemispheric lateralization, although lateralization strength was not associated with receptive vocabulary performance. These findings support a cascading account in which reduced early predictive engagement in ASD leads to increased integration costs and diminished reanalysis efficiency during recursive processing. More broadly, the results highlight the importance of both temporal processing dynamics and neural variability in understanding language differences in ASD.

2606.05616 2026-06-05 cs.CL 版本更新

What's in a Name? Morphological Shortcuts by LLMs in Pharmacology

名字里有什么?LLM在药理学中的形态捷径

Kaijie Mo, Thomas Yang, Chantal Shaib, Qing Yao, William Rudman, Ramez Kouzy, Kanishka Misra, Byron C. Wallace, Junyi Jessy Li

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Northeastern University(东北大学) MD Anderson Cancer Center(MD安德森癌症中心)

AI总结 研究LLM在药理学中依赖词缀线索进行推理的形态捷径行为,通过虚构药物名称实验和归因框架揭示其机制及安全风险。

Comments 22 pages

详情
AI中文摘要

单词的形态常常能为其含义提供线索,但纯粹依赖这些映射在高风险领域可能导致过度泛化。例如,在医学领域,LLM可以仅凭词缀(如wugcillin)自信地推理虚构药物,并生成看似合理的临床内容。我们提出了LLM在药理学中“词缀启发式”的行为和机制研究。使用由真实词缀构建的虚构药物名称,我们表明仅词缀信号就能引发类别水平的药理反应。我们引入了一个框架,用于识别模型的药物语义主要受词缀、词干还是整个药物名称驱动。应用于653种药物,我们的框架揭示模型通常主要通过词缀线索诱导药物含义,但很少明确表明这种依赖,有时还会错误地将词缀共享药物的属性混淆。跨模型的激活修补进一步将这种行为定位到早期到中期层。这些发现表明,形态捷径对安全性构成了微妙但可衡量的风险。

英文摘要

The morphological form of a word can often give cues to its meaning, but purely relying on these mappings can lead to overgeneralization in high-stakes domains. In the medical domain, for instance, LLMs can confidently reason about fictitious drugs from their affixes alone (e.g., wugcillin) and generate plausible-looking clinical content. We present a behavioral and mechanistic study of LLM "affix heuristics" in pharmacology. Using fictitious drug names built from real affixes, we show that affix signals alone elicit class-level pharmacological responses. We introduce a framework for identifying whether a model's drug semantics are driven mainly by the affix, the stem, or the drug name as a whole. Applied across 653 drugs, our framework reveals that models often induce drug meaning primarily through affix cues, yet rarely explicitly indicate this reliance, and sometimes incorrectly conflate properties among affix-sharing drugs. Activation patching across models further localizes this behavior to early-mid layers. These findings show that morphological shortcuts pose a subtle but measurable risk to safety.

2606.05610 2026-06-05 cs.CL 版本更新

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

LLM持续预训练中最优超参数的可预测缩放定律

Yongwei Zhou, Juncheng Diao, Junlin Shang, Peiguang Li, Rongxiang Weng

发表机构 * MeiTuan(美团) University of Chinese Academy of Sciences(中国科学院大学) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本文发现持续预训练中学习率和批大小等最优超参数遵循稳定可预测的缩放定律,并提出一个两阶段框架,通过小规模代理模型和状态感知预测,将超参数搜索开销降低90%且性能相当或更优。

详情
AI中文摘要

大型语言模型(LLM)持续预训练的效果取决于超参数配置,如学习率和批大小。然而,当前实践通常依赖启发式方法或网格搜索,导致训练不稳定和成本过高。在这项工作中,我们首先通过实验发现,在整个持续预训练过程中,最优超参数遵循稳定且可预测的缩放定律。利用这些见解,我们提出了一个新框架,用于建立给定检查点的计算预算与最优超参数之间的定量关系。我们的方法分为两个阶段:(1)经验定律发现,其中我们训练小规模代理模型,通过标准的损失-计算缩放定律推导出将计算预算映射到最优超参数的函数;(2)状态感知超参数预测,其中我们评估初始检查点的验证损失,并使用逆缩放定律估计其等效预训练计算量——即从零开始达到相同损失所需的计算量。结合计划的计算预算,我们预测目标运行的最优超参数。实验结果表明,我们的方法将超参数搜索开销降低了高达90%,同时实现了与基线相当或更优的性能。这个与模型无关的框架可跨架构推广,为从任意给定点开始的多样化持续预训练场景提供了一种原则性且高效的方法。

英文摘要

The efficacy of continued pre-training for Large Language Models (LLMs) hinges upon hyperparameter configurations, such as learning rate and batch size. However, current practices often rely on heuristics or grid searches, leading to training instability and excessive costs. In this work, we first empirically discover that optimal hyperparameters follow stable and predictable scaling laws throughout the continued pre-training process. Leveraging these insights, we propose a novel framework to establish quantitative relationships between compute budget and optimal hyperparameters for a given checkpoint. Our approach has two stages: (1) \textit{Empirical Law Discovery}, where we train small-scale proxy models to derive functions mapping compute budget to optimal hyperparameters via standard loss-compute scaling laws; and (2) \textit{State-Aware Hyperparameter Prediction}, where we evaluate an initial checkpoint's validation loss and use the inverse scaling law to estimate its \textit{equivalent pre-training compute} -- the compute needed to achieve the same loss from scratch. Combining this with the planned compute budget, we predict optimal hyperparameters for the target run. Empirical results demonstrate that our method reduces the hyperparameter search overhead by up to 90\% while achieving comparable or superior performance relative to baselines. This model-agnostic framework generalizes across architectures, providing a principled and efficient methodology for diverse continued pre-training scenarios starting from any given point.

2606.05570 2026-06-05 cs.CL cs.AI 版本更新

TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework

TensorBench: 在基于编译器的张量框架上对编码智能体进行基准测试

Bobby Yan, Fredrik Kjolstad

发表机构 * Department of Computer Science, Stanford University(计算机科学系,斯坦福大学)

AI总结 本文提出 TensorBench,一个包含199个特征添加和重构任务的基准测试,用于评估编码智能体在基于编译器的张量框架上的表现,并通过测试套件自动评分。

详情
AI中文摘要

仓库级别的编码基准测试面临任务难度与评估可靠性之间的权衡:挑战前沿模型的任务通常涉及代码库庞大且测试覆盖不完整,而人工审查难以扩展。我们引入了 TensorBench,这是一个包含199个特征添加和重构任务的基准测试,基于一个开源的基于编译器的张量框架,该框架通过一流的密集和稀疏张量支持扩展了 PyTorch。任务涵盖新的稀疏格式、密集优化过程、IR 转换、调度器更改、运行时组件以及高级数值算子。TensorBench 通过应用智能体的补丁并运行框架的测试套件(包括预先存在的随机回归测试和智能体添加的任何测试)来对每次运行进行评分。对于特征添加任务,通过意味着修补后的仓库保留了测试过的预先存在的行为,并满足了智能体为请求特征添加的检查。我们评估了七个编码智能体,涵盖三个前沿模型系列和一个开放权重模型。在此标准下的通过率从最强智能体的 $64.8\%$ 到最弱智能体的 $22.1\%$ 不等。智能体通过不同的任务子集:成对 Cohen's $κ$ 范围从 $-0.07$ 到 $0.43$,两个最强智能体的 $κ= 0.05$。

英文摘要

Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not scale. We introduce TensorBench, a benchmark of 199 feature-addition and refactoring tasks on an open-source compiler-based tensor framework that extends PyTorch with first-class support for dense and sparse tensors. Tasks cover new sparse formats, dense optimization passes, IR transformations, scheduler changes, runtime components, and high-level numerical operators. TensorBench grades each run by applying the agent's patch and running the framework's test suite, which includes the pre-existing randomized regression tests and any tests the agent adds. For feature-addition tasks, a pass means that the patched repository preserves the tested pre-existing behavior and satisfies the agent-added checks for the requested feature. We evaluate seven coding agents spanning three frontier model families and one open-weight model. Pass rates under this criterion range from $64.8\%$ for the strongest agent to $22.1\%$ for the weakest. Agents pass different subsets of tasks: pairwise Cohen's $κ$ ranges from $-0.07$ to $0.43$, with $κ= 0.05$ for the two strongest agents.

2606.05569 2026-06-05 cs.CL cs.SD eess.AS 版本更新

Domain-Aware Mispronunciation Detection and Diagnosis Using Language-Specific Statistical Graphs

基于语言特定统计图的领域感知发音错误检测与诊断

Huu Tuong Tu, Hanh Nguyen, Thien Van Luong, Nguyen Tien Cuong, Vu Huan, Nguyen Thi Thu Trang

发表机构 * Hanoi University of Science and Technology(河内理工大学) VNPT AI, VNPT Group(VNPT AI,VNPT集团) National Economics University(国家经济大学)

AI总结 提出一种利用语言特定统计图学习音素混淆模式的方法,在L2-ARCTIC基准上实现59.52%的F1分数,优于多个基线。

Comments Accepted at Interspeech 2026

详情
AI中文摘要

近年来,发音错误检测与诊断(MDD)在计算机辅助语言学习和语音技术中变得越来越重要。本文提出了一种构建统计图的方法,使模型能够学习表示为有向图的音素混淆模式。此外,我们引入了一种语言特定策略,以捕捉不同母语(L1)背景下的系统性发音差异。通过在L2-ARCTIC基准上的大量实验证明了我们方法的有效性,该方法达到了59.52%的F1分数,优于多个竞争基线。

英文摘要

Mispronunciation Detection and Diagnosis (MDD) has gained increasing importance in computer-assisted language learning and speech technology in recent years. In this paper, we propose a method for constructing statistical graphs that enable models to learn phoneme confusion patterns represented as directed graphs. Furthermore, we introduce a language-specific strategy to capture systematic pronunciation differences across various native language (L1) backgrounds. The effectiveness of our approach is demonstrated through extensive experiments on the L2-ARCTIC benchmark, where it achieves an F1-score of 59.52%, outperforming several competitive baselines.

2606.05568 2026-06-05 cs.IR cs.CL 版本更新

ColBERTSaR: Sparsified ColBERT Index via Product Quantization

ColBERTSaR: 通过乘积量化实现稀疏化的 ColBERT 索引

Eugene Yang, Andrew Yates, Dawn Lawrie, James Mayfield, Saron Samuel, Rohan Jha

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出通过乘积量化将 ColBERT 索引转化为真正的倒排索引,显著减小索引大小(比 PLAID 小 50-70%)同时保持检索效果。

Comments 6 pages, 1 figure, accepted at SIGIR 2026 as a short paper

详情
AI中文摘要

虽然 ColBERT 是一种有效的神经检索架构,但它需要庞大的索引结构来支持基于近似 token 嵌入的候选集检索、收集和解压文档 token 嵌入以及应用 MaxSim 操作。PLAID 和类似 ColBERT 实现中的索引所需磁盘存储量是原始原始文本的五到十倍,这限制了它们的可扩展性。此外,先前的工作已经确定,收集和解压阶段是查询时的主要低效环节。通过阈值和分数近似来限制必须收集的文档 token 数量并不能消除整个索引支持即席查询的需求。在这项工作中,我们提出了一种嵌入量化方法,将 ColBERT 索引转变为真正的倒排索引。我们从理论上证明,除了评分机制外,带有嵌入量化的 ColBERT 等价于学习型稀疏检索。实验表明,我们的索引比一位 PLAID 索引小 50-70%,同时保持检索效果。

英文摘要

While ColBERT is an effective neural retrieval architecture, it requires a heavy index structure to support candidate set retrieval based on approximated token embeddings, gathering and decompressing document token embeddings, and applying the MaxSim operation. Indexes in PLAID and similar ColBERT implementations require five to ten times the disk storage of the original raw text, which limits their scalability. Furthermore, prior work has identified that the gathering and decompression stages are the primary inefficiencies at query time. Limiting the number of document tokens that must be gathered by thresholding and score approximation does not eliminate the need for the entire index to support ad hoc queries. In this work, we propose an embedding quantization approach that turns a ColBERT index into a true inverted index. We show that, theoretically, ColBERT with embedding quantization is equivalent to learned-sparse retrieval except for the scoring mechanism. Empirically, we demonstrate that our index is 50-70% smaller than a one-bit PLAID index while retaining retrieval effectiveness.

2606.05564 2026-06-05 cs.CL 版本更新

Using Large Language Models to Support High Volume Application Review for an Undergraduate Research Program

使用大型语言模型支持本科研究项目的高容量申请评审

Varun Aggarwal, Kay Kobak, John Howarter

发表机构 * Engineering Undergraduate Research Office, Purdue University(普渡大学本科生研究办公室) Elmore School of Electrical and Computer Engineering, Purdue University(普渡大学电子与计算机工程学院) School of Materials Engineering, Purdue University(材料工程学院)

AI总结 本研究开发并部署基于GPT模型(GPT-4o、GPT-5-mini、GPT-5.2)的工具,对普渡大学SURF项目约1200份目的陈述进行自动化评分与理由注释,将评审时间从数周缩短至约4小时。

详情
AI中文摘要

本科研究项目(如普渡大学的暑期本科生研究奖学金SURF)每年收到数千份申请,需要项目工作人员花费大量时间和精力在紧迫的时间线内一致地评估每份提交。这篇进行中的论文描述了一个基于大型语言模型(LLM)的工具的开发和初步部署,用于协助评估普渡大学SURF 2026周期的约1200份学生目的陈述(SoP)。该工作流程使用OpenAI GPT模型(GPT-4o、GPT-5-mini和GPT-5.2),并采用一个包含六个子类别的结构化评分标准,每个子类别按0-3分评分。少数由项目工作人员评分的SoP用于调整模型响应。模型提示设计为生成数值分数、理由(包括正面和负面方面)以及每份提交的简短摘录。使用GPT-5.2,全部1200份SoP在约4.6小时的计算时间内处理完毕,平均每份SoP约14秒(每份SoP的处理时间随其长度变化,范围从500到2000词)。不同模型版本在评分标准遵循度上存在显著差异,其中GPT-5.2遵循最严格。模型分数的不一致在低分提交中更为明显。LLM输出复制了之前由分布式人工评分员扮演的角色,为项目协调员提供了整个申请人群体的评分和理由注释输出。然后,项目协调员将这些输出与每位申请人的SoP一起审查,应用与之前SURF周期相同的下游办公室标准,以产生强候选人的短名单。这次协调员审查在大约4小时内完成,而之前项目周期需要数周的协调工作。

英文摘要

Undergraduate research programs such as the Summer Undergraduate Research Fellowship (SURF) at Purdue University receive thousands of applications every year, requiring significant time and effort for program staff to evaluate each submission consistently and within tight timelines. This work-in-progress paper describes the development and initial deployment of a large language model (LLM)-based tool to assist in the evaluation of approximately 1,200 student Statements of Purpose (SoPs) for the SURF 2026 cycle at Purdue University. The workflow utilizes OpenAI GPT models (GPT-4o, GPT-5-mini, and GPT-5.2) and uses a structured rubric across six subcategories, each scored on a 0-3 scale. A few SoPs, graded by program staff, were used to tune the model responses. The model prompt was designed to generate both numerical scores, rationales (including positive and negative aspects) and short excerpts from each submission. Using GPT-5.2, the full batch of 1,200 SoPs was processed in approximately 4.6 hours of compute time, averaging roughly 14 seconds per SoP (with per-SoP timing varying with SoP length, which ranged from 500 to 2,000 words). Notable differences in rubric adherence were observed across model versions, with GPT-5.2 adhering most closely. Disagreement in model scores was more pronounced for lower-scoring submissions. The LLM outputs replicated the role previously played by distributed human graders, providing the program coordinator with scored and rationale-annotated outputs for the entire applicant pool. The program coordinator then reviewed these outputs alongside each applicant's SoP, applying the same downstream office criteria used in prior SURF cycles, to produce a shortlist of strong candidates. This coordinator review was completed in approximately 4 hours, compared to the multi-week coordination effort required in prior program cycles.

2606.05563 2026-06-05 cs.AI cs.CL 版本更新

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

SoCRATES:跨领域和社会认知变异的前瞻性LLM调解的可靠自动化评估

Taewon Yun, Hyeonseong Park, Jeonghwan Choi, Hayoon Park, Yeeun Choi, Hwanjun Song

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出SoCRATES基准,通过多领域真实冲突场景和五维社会认知适应轴评估LLM调解员,使用主题定位评估器实现0.82的人类专家一致性,发现最强模型仅缩小约三分之一的未调解共识差距。

详情
AI中文摘要

评估LLM调解员仍然具有挑战性,因为调解是一个实时轨迹,由争议者不断变化的情感、意图和背景塑造。现有的测试平台依赖于少数专家撰写的领域,主要变化战略姿态,并对每个话题的每一轮进行评分,引入了离题噪声。我们引入了SoCRATES,一个用于在现实的多领域测试平台中评估前瞻性LLM调解员的基准。它通过一个跨八个领域的代理管道从真实冲突中构建场景,探测五个社会认知适应轴(战略姿态、参与者组成、历史长度、情感反应和文化身份),并通过主题定位评估器仅对推进每个话题的轮次进行评分。该评估器与人类专家的一致性达到0.82,是每轮基线的两倍以上。对八个前沿LLM的基准测试发现,即使是最强的调解员,在多样化和现实的测试平台下,也仅能缩小约三分之一的未调解共识差距,且性能因社会认知轴而异,突显出进步在于对不同条件的社会适应。

英文摘要

Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.

2606.05561 2026-06-05 cs.CL cs.AI 版本更新

InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

InfoShield:通过信息论优化实现心理健康筛查的隐私保护语音表示

Xueyang Wu, Siyuan Liu, Kezhuo Yang, Guang Ling

发表机构 * Shenzhen NeurStar Inc., China(深圳NeurStar公司,中国) University of York, United Kingdom(约克大学,英国) Shanghai Jiao Tong University, China(上海交通大学,中国)

AI总结 提出InfoShield框架,通过最小化语音表示与敏感属性间的互信息,在保持抑郁分类性能的同时有效降低人口统计信息泄露风险。

详情
AI中文摘要

基于语音的心理健康筛查提供了可扩展的抑郁症检测方法,但临床部署面临一个重大障碍:用户对人口统计信息暴露的隐私担忧。当前技术难以解决这一冲突。对抗训练通常无法应对未知威胁,而差分隐私则倾向于通过向所有特征注入噪声来损害诊断性能。本文提出InfoShield,它在保持抑郁分类准确性的同时最小化语音表示与敏感属性之间的互信息。我们发现标准MINE估计器因时间-静态错位而难以处理序列语音,并引入带有跨模态注意力的TimeAwareMINE来对齐声学帧与属性嵌入。在Androids语料库上的实验表明,InfoShield将性别推断从92.6%降至55.5%,年龄推断从55.7%降至30.3%,且效用损失有限(F1降低6%),达到F1=0.784,而先前SOTA为0.723。

英文摘要

Speech-based mental health screening offers scalable depression detection, yet clinical deployment faces a significant barrier: users' privacy concerns about demographic information exposure. Current techniques struggle to resolve this conflict. Adversarial training often fails against unseen threats, whereas Differential Privacy tends to compromise diagnostic performance by injecting noise across all features. This paper presents InfoShield, which minimizes mutual information between speech representations and sensitive attributes while preserving depression classification accuracy. We identify that standard MINE estimators struggle with sequential speech due to temporal-static misalignment, and introduce TimeAwareMINE with cross-modal attention to align acoustic frames with attribute embeddings. Experiments on the Androids Corpus show InfoShield reduces gender inference from 92.6\% to 55.5\% and age inference from 55.7\% to 30.3\% with limited utility loss (6\% F1 reduction), achieving F1=0.784 compared to prior SOTA's 0.723.

2606.05557 2026-06-05 cs.CL 版本更新

AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents

AURA: 面向情境化LLM代理中隐式需求挖掘的意图导向探测

Yang Li, Jiaxiang Liu, Jiang Cai, Mingkun Xu

发表机构 * Guangdong Institute of Intelligence Science and Technology(广东省智能科学与技术研究院)

AI总结 提出AURA方法,通过在场景感知和工具使用之间插入意图推理步骤生成IntentFrame,以结构化估计隐式需求并控制探测预算,在隐式意图基准上提升覆盖率达+0.07,同时减少82%的探测次数并避免隐私违规。

Comments Submitted to EMNLP 2026. Code, simulator, and benchmark: https://github.com/innovation64/AURA

详情
AI中文摘要

像“Lin Wei在哪里?”这样的情境化查询通常编码了比字面内容更多的信息:用户可能还想知道Lin Wei是否有空、心情好或是否值得现在打扰。标准的工具使用代理回答字面问题后就停止了。AURA在场景感知和工具使用之间插入一个推理步骤,生成IntentFrame:一个对隐式需求的结构化估计,带有一个标量差距分数,用于控制每次查询的探测预算和工具选择。在一个包含100个查询、四个场景的隐式意图基准上,AURA相比ReAct风格的探测将隐式需求覆盖率提高了(Delta = +0.07,p < 10^-6);四个场景中有三个单独显著,该增益在第二个骨干网络上重现,并且提示消融将提升归因于差距校准而非答案记忆。在事实查找上,控制器以原始准确度为代价,减少了82%的探测次数,并在一个隐私敏感切片上实现了零违禁工具违规;范围条件在局限性中详述。代码、模拟器和基准测试已在https://github.com/innovation64/AURA发布。

英文摘要

A situated query like "where is Lin Wei?" often encodes more than its literal content: the user may also want to know whether Lin Wei is free, in a good mood, or worth interrupting now. Standard tool-use agents answer the literal question and stop. AURA inserts an inference step between scene perception and tool use that produces an IntentFrame: a structured estimate of the implicit need with a scalar gap score that controls per-query probe budget and tool selection. On a 100-query four-scene implicit-intent benchmark, AURA improves implicit-need coverage over ReAct-style probing (Delta = +0.07, p < 10^-6); three of four scenes are individually significant, the gain reproduces on a second backbone, and a prompt ablation attributes the lift to gap calibration rather than answer memorisation. On factual lookup the controller trades raw accuracy for 82% fewer probes and zero forbidden-tool violations on a privacy-sensitive slice; scope conditions are detailed in Limitations. Code, simulator, and benchmark are released at https://github.com/innovation64/AURA.

2606.05553 2026-06-05 cs.CL cs.AI 版本更新

ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?

ArcANE:角色扮演语言代理是否在正确的时间保持角色?

Woojung Song, Nalim Kim, Sangjun Song, Chaewon Heo, Jongwon Lim, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院)

AI总结 提出ArcANE基准,通过角色弧将叙事分段,评估角色扮演语言代理在不同阶段是否与角色心理轨迹一致,实验表明基于角色弧的上下文策略最优,尤其在源文本外场景。

详情
AI中文摘要

角色扮演语言代理(RPLAs)应扮演其价值观和行为随故事发展而演变的角色,而非保持固定人格。现有基准衡量给定章节的事实回忆,而非回应是否与角色的心理轨迹一致,尤其是在源文本从未探索的场景中。我们引入ArcANE(弧感知叙事评估),一个自动构建的基准,涵盖17部小说和80个主要角色。角色弧将叙事沿心理轴分段,每个探针在多个阶段提出相同场景,涵盖源文本内和源文本外情境。在六个模型和六种上下文模式下,基于角色弧的条件在每项模型上均优于所有其他上下文策略,且在源文本外场景(检索无法找到信息)中差距最大。我们进一步在同一数据上微调开放权重模型,得到ArcANE-8B/32B,在源文本外场景中进一步扩大了弧优势。

英文摘要

Role-playing language agents (RPLAs) should play characters whose values and behavior evolve as the story progresses, not maintain a fixed persona. Existing benchmarks measure factual recall at a given chapter, not whether responses align with the character's psychological trajectory, especially in scenarios the source text never explores. We introduce ArcANE (Arc-Aware Narrative Evaluation), an automatically constructed benchmark spanning 17 novels and 80 principal characters. A Character Arc segments the narrative into phases along a psychological axis, and each probe poses the same scenario across phases, spanning both situations within the source text and situations beyond it. Across six models and six context modes, conditioning on the Character Arc tops every other context strategy on every model, and the gap is largest on scenarios outside the source text where retrieval has nothing to find. We further fine-tune open-weight models on the same data to obtain ArcANE-8B/32B, which widen the Arc advantage even more on scenarios outside the source text.

2606.05545 2026-06-05 cs.CL 版本更新

Multilingual Detection of Alzheimer's Disease from Speech: A Cross-Linguistic Transfer Learning Approach

基于语音的多语言阿尔茨海默病检测:跨语言迁移学习方法

Nadine Yasser Abdelhalim, Emmanuel Akinrintoyo, Nicole Salomons

发表机构 * Imperial College London(帝国理工学院伦敦分校)

AI总结 提出跨语言训练方法,利用英语、中文、阿拉伯语和印地语数据集开发基于Transformer的模型,实现多语言阿尔茨海默病检测,F1分数达82%,推理时间0.5秒,支持实时筛查。

Comments 5 pages

详情
AI中文摘要

由于特定语言模型训练的资源密集性和耗时性,多语言阿尔茨海默病痴呆(AD)检测模型的开发面临重大挑战。我们提出了一种新颖的解决方案,使用跨语言训练来检测训练模型所用语言之外的语言中的AD。本研究调查了用于跨不同语言和认知障碍水平检测AD的多语言深度学习模型。使用英语、中文、阿拉伯语和印地语的数据集,我们开发了基于Transformer的模型用于二元AD分类。我们的方法在所有语言中实现了82%的F1分数,展示了强大的跨语言泛化能力。快速推理时间(0.5秒)支持潜在的实时筛查应用,而跨语言的一致性能表明全球部署的可行性。

英文摘要

The development of multilingual Alzheimer's Disease Dementia (AD) detection models presents significant challenges due to the resource-intensive and time-consuming nature of language-specific model training. We propose a novel solution using cross-language training to detect AD in languages beyond those used for model training. This study investigates multilingual deep learning models for detecting AD across different languages and cognitive impairment levels. Using datasets in English, Chinese, Arabic, and Hindi, we developed transformer-based models for binary AD classification. Our approach achieved F1 scores of 82\% across all languages, demonstrating strong cross-linguistic generalization. The rapid inference time (0.5 seconds) supports potential real-time screening applications, while consistent performance across languages indicates feasibility for global deployment.

2606.05538 2026-06-05 cs.LG cs.CL 版本更新

Less is MoE: Trimming Experts in Domain-Specialist Language Models

少即是MoE:修剪领域专家语言模型中的专家

Haoze He, Xinkai Zou, Xuan Jiang, Xingyuan Ding, Ao Qu, Juncheng Billy Li, Heather Miller

发表机构 * Carnegie Mellon University(卡内基梅隆大学) UCSD(加州大学圣地亚哥分校) MIT(麻省理工学院)

AI总结 针对MoE模型部署时参数过多的问题,提出基于Fisher重要性的中间维度修剪方法Fisher-MoE,在50%压缩比下保持模型能力,减少约45%权重内存并提升21%推理吞吐量。

详情
AI中文摘要

混合专家(MoE)模型通过条件计算实现了强大的性能,但其庞大的参数规模带来了部署挑战。先前的MoE压缩方法在常识推理之外的通用基准测试中评估时灾难性地失败。我们将这一失败归因于压缩的粒度:重要能力分布在各个专家中,但集中在FFN稀疏中间维度。为了识别这些维度,我们使用Fisher重要性,它优于基于激活、路由器得分和幅度的方法,并识别出极小的任务关键维度集:在Qwen1.5-MoE中,仅移除1.35M路由FFN中间维度中的12个就导致GSM8K准确率崩溃,同时基本保持事实知识性能。基于此,我们提出Fisher-MoE,它在FFN内部操作,移除按Fisher重要性排序的中间维度。在相同的50% MoE压缩比下,Fisher-MoE保持了模型能力,同时减少了约45%的权重内存并提高了21%的推理吞吐量。这些发现表明,中间维度粒度是MoE模型中能力集中的有效压缩和排序单元。

英文摘要

Mixture-of-Experts (MoE) models achieve strong performance through conditional computation, but their large parameter footprint poses deployment challenges. Prior MoE compression approaches catastrophically fail when evaluated on general-purpose benchmarks beyond commonsense reasoning. We trace this failure to the granularity of compression: important capabilities are distributed across experts but concentrated in FFN sparse intermediate dimensions. To identify these dimensions, we use Fisher importance which outperforms activation-, router-score-, and magnitude-based alternatives, and identifies tiny sets of task-critical dimensions: in Qwen1.5-MoE, removing as few as 12 of 1.35M routed-FFN intermediate dimensions collapses GSM8K accuracy while largely preserving factual-knowledge performance. Building on this, we propose Fisher-MoE, which operates within FFN to remove intermediate dimensions ranked by Fisher importance. At the same 50% MoE compression ratio, Fisher-MoE preserves model capability, while reducing weight memory by ~45% and improving inference throughput by 21%. These findings suggest intermediate dimension granularity is an effective unit for both compression and ranking where capability concentrates in MoE models.

2606.05531 2026-06-05 cs.CV cs.AI cs.CL cs.LG 版本更新

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

Almieyar-Oryx-BloomBench:一个用于视觉语言模型认知知情评估的双语多模态基准

Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib, Mohamed Hefeeda, Ehsaneddin Asgari

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Zuse School(Zuse学校) Qatar Computing Research Institute (QCRI)(卡塔尔计算研究所) Hamad Bin Khalifa University(哈马德·本·哈利法大学)

AI总结 针对现有基准无法诊断视觉语言模型真实推理能力的问题,提出基于Bloom认知分类学的双语多模态基准BloomBench,系统评估六个认知层次,揭示模型在事实回忆和创造性合成方面的深层局限。

Comments Accepted to ACL 2026 Findings

详情
AI中文摘要

尽管视觉语言模型(VLM)取得了快速进展,但该领域缺乏能够严格诊断其真实推理能力并描绘出向类人多模态智能有意义进展的基准。大多数现有评估侧重于零散或脱节的任务,掩盖了关键的认知弱点,并为有针对性的改进提供了很少的见解。为了弥补这一差距,我们引入了BloomBench,这是Almieyar基准系列的一部分,也是第一个基于人类认知的、双语(英语-阿拉伯语)的多模态VLM基准。基于Bloom分类学,BloomBench通过精心设计的图像-问题-答案任务系统地评估六个认知层次(记忆、理解、应用、分析、评估、创造)。通过半自动化流水线构建,并通过分层混合质量保证协议验证,确保了可扩展性、文化包容性和语言保真度。利用这一框架,我们对最先进的VLM进行了全面研究,以诊断其认知特征。我们的分析揭示了明显的认知不对称:尽管最先进的模型在语义理解方面达到了强大的性能上限,但它们在事实回忆和创造性合成方面存在显著困难。这表明当前的一般多模态能力掩盖了特定认知层次的深层局限性。此外,我们的研究突出了阿拉伯语和英语之间的关键性能差距,暴露了当前跨语言多模态推理的局限性。这些发现为开发更符合认知和包容性的VLM奠定了基础。基准框架和数据集可在以下网址获取:https://github.com/qcri/Almieyar-Oryx-BloomBench。

英文摘要

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.

2606.05523 2026-06-05 cs.CL 版本更新

CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning

CHASE:利用强化学习进行对抗性红蓝队训练以提高LLM安全性

Rahul Markasserithodi, Aditya Joshi, Yuekang Li, Ishmanbir Singh, Chris Yoo, Alan Niu

发表机构 * University of New South Wales(新南威尔士大学)

AI总结 提出CHASE框架,通过红蓝队协同进化(红队使用GRPO生成对抗性改写,蓝队使用两阶段GRPO+拒绝采样SFT进行防御),在保持良性提示零误拒的同时将攻击成功率降低43.2%。

Comments Under Review at ARR

详情
AI中文摘要

尽管在安全对齐方面取得了进展,但提示改写攻击(如角色调制、虚构框架和基于说服的重述)仍能绕过前沿模型的安全过滤器。现有防御要么依赖不可扩展的人工策展,要么依赖对特定模型内部过拟合的白盒优化,使对齐模型在面对部署中自适应黑盒对手时变得脆弱。为弥补这一差距,我们提出CHASE(通过对抗性安全升级的协同进化硬化),一种闭环红蓝队框架,其中黑盒攻击者和安全对齐防御者协同进化。攻击者通过组相对策略优化(GRPO)在乘法奖励下训练,该奖励联合强制绕过有效性和意图保真度,而防御者则通过两阶段GRPO+拒绝采样SFT流程在收获的对抗性改写上进行硬化,并与良性数据平衡。在BeaverTails和JailbreakBench上针对五个保留攻击家族(PAIR、TAP、AutoDAN、PAP、Translation)进行评估,CHASE将平均StrongREJECT分数降低了43.2%,且良性提示零误拒。除了这一显著结果外,CHASE表明无模板的RL探索能够恢复跨机制不同攻击家族迁移的潜在攻击原语,这为LLM安全硬化提供了一条超越当前对抗训练狭窄分布的泛化路径。

英文摘要

Despite advances in safety alignment, prompt-rewriting attacks such as persona modulation, fictional framing and persuasion-based reformulation, can bypass safety filters even on frontier models. Existing defenses either rely on non-scalable human curation or white-box optimisation that overfits to specific model internals, leaving aligned models brittle against the very class of adaptive black-box adversaries they will face in deployment. To address this gap, we introduce CHASE (Co-evolutionary Hardening through Adversarial Safety-Escalation), a closed-loop red-blue teaming framework in which a black-box attacker and a safety-aligned defender co-evolve. The attacker is trained via Group Relative Policy Optimization (GRPO) under a multiplicative reward that jointly enforces bypass effectiveness and intent fidelity, while the defender is hardened on the harvested adversarial rewrites through a two-stage GRPO + rejection-sampled SFT pipeline balanced with benign data. Evaluated on BeaverTails and JailbreakBench against five held-out attack families (PAIR, TAP, AutoDAN, PAP, Translation), CHASE cuts mean StrongREJECT score by 43.2\% with 0\% false-refusal on benign prompts. Beyond the headline result, CHASE shows that template-free RL exploration recovers latent attack primitives that transfer across mechanistically distinct attack families, suggesting a path toward LLM safety hardening that generalises beyond the narrow distributions achieved thus far in adversarial training.

2606.05513 2026-06-05 cs.AI cs.CL 版本更新

EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts

EpiEvolve:用于制度转变下流式疫情预测的自演化智能体

Yiming Lu, Sihang Zeng, Zhengxu Tang, Max Lau, Fei Liu, Wei Jin

发表机构 * Emory University(埃默里大学) University of Washington(华盛顿大学)

AI总结 针对流式疫情预测中标签延迟和制度转变问题,提出自演化智能体EpiEvolve,通过层次化情景记忆、延迟标签反思和制度感知检索,在COVID-19住院趋势预测中达到0.629准确率,并将制度转变后的恢复滞后从5周缩短至2周。

详情
AI中文摘要

流行病LLM预测器通常作为静态监督模型进行训练和评估,而实际疫情预测是一个流式过程,其中标签在预测之后到达,疾病制度随时间变化。我们研究了在五个变异制度下的每周COVID-19住院趋势预测中的这种不匹配。我们引入了EpiEvolve,一个自演化智能体,它封装了一个在预热期训练好的LLM预测器,并在流式过程中保持其权重固定。EpiEvolve通过将预测结果存储在层次化情景记忆中进行适应,反思延迟标签,检索与当前制度相关的案例,并将重复出现的错误提炼为策略规则。由此产生的上下文让预测器在遵循防止未来泄漏的时间顺序协议的同时,在后续周中重用其自身的过去预测和结果。在流式数据集上,EpiEvolve达到了0.629的平均准确率,而静态骨干模型为0.561,外部CDC集成模型为0.325,并将制度转变后的恢复滞后从5周缩短到2周。消融实验表明,反思、策略记忆和制度感知检索各自对性能提升有贡献。

英文摘要

Epidemic LLM forecasters are usually trained and evaluated as static supervised models, whereas operational pandemic forecasting is a streaming process in which labels arrive after predictions and disease regimes shift over time. We study this mismatch in weekly COVID-19 hospitalization trend forecasting across five variant regimes. We introduce EpiEvolve, a self-evolving agent that wraps an LLM forecaster trained on the warm-start period and keeps its weights fixed during streaming. EpiEvolve adapts by storing forecast outcomes in a hierarchical episodic memory, reflecting on delayed labels, retrieving cases relevant to the current regime, and distilling recurring errors into strategic rules. The resulting context lets the forecaster reuse its own past predictions and outcomes in later weeks while following a chronological protocol that prevents future leakage. On the streaming dataset, EpiEvolve reaches $0.629$ average accuracy, compared with $0.561$ for the static backbone and $0.325$ for the external CDC ensemble, and reduces recovery lag after regime shifts from $5$ to $2$ weeks. Ablations show that reflection, strategic memory, and regime-aware retrieval each contribute to the gains.

2606.05494 2026-06-05 cs.CL cs.AI 版本更新

MASF: A Multi-Model Adaptive Selection Framework for Abstractive Text summarization

MASF:面向抽象式文本摘要的多模型自适应选择框架

Ahmed Alansary, Ali Hamdi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种多模型自适应选择框架,通过集成多个微调的Transformer模型并基于自动评估指标选择最佳摘要,在CNN/DailyMail数据集上BERTScore达88.63%,优于GPT3-D2等大模型。

Comments 6 pages, 3 figures, IMSA2026

详情
AI中文摘要

自动文本摘要因数字文本信息的快速增长而变得日益重要。本文提出一种多模型自适应摘要框架,旨在提高抽象式文本摘要的鲁棒性和质量。依赖单一模型往往导致在不同结构和主题的文章上摘要质量不一致。为解决这一局限,所提框架集成了多个微调的基于Transformer的摘要模型,并引入自适应选择机制。在该框架中,每个模型独立为同一输入文章生成候选摘要。然后使用自动评估指标评估生成的摘要,这些指标同时捕捉词汇相似性和语义相关性。基于这些分数,框架选择最高质量的摘要作为最终输出。模型在广泛使用的CNN/DailyMail新闻摘要数据集上进行微调和评估。实验结果表明,所提框架在所有比较方法中取得了最高的BERTScore,达到88.63%。它还优于多个大语言模型,如GPT3-D2、Falcon-7b和Mpt-7b,突显了其有效性和鲁棒性。这些发现强调了在自适应选择策略中利用多个基于Transformer的模型来提高自动文本摘要系统质量和鲁棒性的有效性。

英文摘要

Automatic text summarization has become increasingly important due to the rapid growth of digital textual information. This paper presents a Multi-Model Adaptive Summarization Framework designed to improve the robustness and quality of abstractive text summarization. Relying on a single model often leads to inconsistent summarization quality across articles with varying structures and topics. To address this limitation, the proposed framework integrates multiple fine-tuned transformer-based summarization models and introduces an adaptive selection mechanism. In this framework, each model independently generates a candidate summary for the same input article. The generated summaries are then evaluated using automatic evaluation metrics that capture both lexical similarity and semantic relevance. Based on these scores, the framework selects the highest-quality summary as the final output. The models are fine-tuned and evaluated on the widely used CNN/DailyMail news summarization dataset. Experimental results demonstrate that the proposed framework achieves the highest BERTScore among all compared methods with a score of 88.63%. It also outperforms several LLMs such as GPT3-D2, Falcon-7b, and Mpt-7b, highlighting its effectiveness and robustness. These findings highlight the effectiveness of leveraging multiple transformer-based models within an adaptive selection strategy to improve the quality and robustness of automatic text summarization systems.

2606.05486 2026-06-05 cs.CL cs.LG 版本更新

Localizing Prompt Ambiguity in Large Language Models with Probe-Targeted Attribution

通过探针目标归因定位大型语言模型中的提示歧义

Govind Ramesh, Yao Dou, Wei Xu

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出PRIG方法,利用线性探针和梯度归因,通过中间表示而非输出层定位提示中的歧义位置,在合成和人工基准上取得高AUROC。

Comments 23 pages, 5 figures, 5 tables

详情
AI中文摘要

提示歧义是大型语言模型中常见的失败原因,但由于它是提示的潜在属性,难以定位,而现有的归因方法旨在解释可观察的输出,如logits或生成的token。我们引入了PRIG,一种梯度归因方法,使用探针logit将潜在歧义归因于token位置。具体来说,PRIG训练一个线性探针来区分清晰提示和模糊提示,并将探针分数归因于残差流中早期的token表示。为了实现token级别的评估,我们通过重写每个提示中的一个关键句子,构建了涵盖编码、数学和写作的合成歧义数据集,并用人工编写的黄金基准进行补充。在这种设置下,PRIG在定位歧义片段方面显著优于梯度归因基线,在组合合成基准上达到0.840 AUROC,在黄金集上达到0.891 AUROC。它在句子级别的歧义识别上也优于GPT-5.4,并在域外保留了有用的信号。这些结果确立了PRIG作为一种实用工具,用于识别提示中哪些部分存在歧义。更广泛地说,它们表明潜在提示属性可以通过中间表示而非输出级归因来定位。

英文摘要

Prompt ambiguity is a common source of failure in large language models, but is difficult to localize because it is a latent property of the prompt, while existing attribution methods are designed to explain observable outputs such as logits or generated tokens. We introduce PRIG, a gradient attribution method that uses a probe logit to attribute latent ambiguity to token positions. Specifically, PRIG trains a linear probe to distinguish clear prompts from ambiguous prompts and attributes the probe score to earlier token representations in the residual stream. To enable token-level evaluation, we construct synthetic ambiguity datasets across coding, math, and writing by rewriting one task-critical sentence per prompt, and complement them with a human-written gold benchmark. In this setting, PRIG localizes ambiguous spans substantially better than gradient attribution baselines, achieving 0.840 AUROC on the combined synthetic benchmark and 0.891 AUROC on the gold set. It also outperforms GPT-5.4 on sentence-level ambiguity identification and retains useful signal out-of-domain. These results establish PRIG as a practical tool for identifying which parts of a prompt are ambiguous. More broadly, they suggest that latent prompt properties can be localized through intermediate representations, rather than through output-level attribution.

2606.05444 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Multilingual Coreference Resolution via Cycle-Consistent Machine Translation

通过循环一致性机器翻译的多语言共指消解

Adriana-Valentina Costache, Eduard Poesina, Silviu-Florin Gheorghe, Paul Irofti, Radu Tudor Ionescu

发表机构 * Department of Computer Science, University of Bucharest(布加勒斯特大学计算机科学系)

AI总结 提出一种利用循环一致性机器翻译生成或扩展训练数据的管道,通过BERT潜在空间余弦相似度评估翻译质量并加权损失函数,显著提升低资源语言的共指消解性能。

详情
AI中文摘要

共指消解是一项核心的自然语言处理任务,具有广泛的下游应用,例如机器翻译、问答、文档摘要等。虽然该任务在英语中得到了充分研究,但其他语言(尤其是低资源语言)的共指消解关注相对较少。为了弥补这一差距,我们提出了一种新颖的共指消解管道,该管道利用从英语到目标低资源语言的机器翻译(MT)来生成或扩展训练数据。为了自动验证翻译样本的质量,我们将样本反向翻译,并通过BERT模型潜在空间中的余弦相似度评估与原始英语样本的相似性。得到的相似度分数被整合到损失函数中,以根据样本的MT循环一致性对训练样本进行加权。在四种低资源语言上的大量实验表明,我们的管道在共指消解中带来了显著的性能提升。此外,我们的管道使得在之前没有可用语料库的语言中也能实现准确的共指消解。

英文摘要

Coreference resolution is a core NLP task, having a broad range of downstream applications, e.g.~machine translation, question answering, document summarization, etc. While the task is well-studied in English, comparatively less attention is dedicated to coreference resolution in other languages, especially low-resource ones. To mitigate this gap, we propose a novel coreference resolution pipeline that harnesses machine translation (MT) from English to a target low-resource language, to generate or expand training data. To automatically validate the quality of the translated samples, we back-translate the samples and assess the similarity with the original English samples via cosine similarity in the latent space of a BERT model. The resulting similarity scores are integrated into the loss function to weight training samples according to their MT cycle consistency. Extensive experiments on four low-resource languages show that our pipeline brings significant performance gains in coreference resolution. Moreover, our pipeline enables accurate coreference resolution in languages where no previous corpora were available.

2606.05443 2026-06-05 cs.DL cs.CL 版本更新

MIRAI: Prediction and Generation of High-Impact Academic Research

MIRAI:高影响力学术研究的预测与生成

Alex Li, Joseph Jacobson

发表机构 * MIT Media Lab(MIT媒体实验室)

AI总结 提出MIRAI深度学习框架,利用论文标题、摘要和发表日期预测其5年PageRank和引用量,并基于此构建研究构思流程以生成高影响力研究想法。

详情
AI中文摘要

科学出版的快速步伐使得识别和综合高影响力工作成为日益紧迫的挑战。我们提出了MIRAI(Multi-year Inference of Research trends and Academic Impact),一个深度学习框架,仅使用论文的标题、摘要和发表日期来预测其影响力。我们在arXiv学术图上训练MIRAI,预测5年PageRank和引用次数,对于2021年发表的论文,在PageRank预测上达到Spearman's $ρ$ 0.4686,在引用预测上达到0.6192。我们提出了一个基于MIRAI的研究构思流程,该流程产生面向高影响力的研究想法。这些想法被一个无偏的LLM评判者以4:3的比例认为比没有MIRAI的基线更具影响力。我们在https://predict-paper-impact.vercel.app上公开了5年引用预测模型。

英文摘要

The rapid pace of scientific publishing has made the identification and synthesis of high-impact work an increasingly urgent challenge. We introduce MIRAI (Multi-year Inference of Research trends and Academic Impact), a deep learning framework that predicts paper impact using only it's title, abstract, and publication date. We train MIRAI on the arXiv academic graph to predict 5-year PageRank and citation counts, achieving Spearman's $ρ$ of 0.4686 on PageRank prediction and 0.6192 on citation prediction for papers published in 2021. We propose a research ideation pipeline built on top of MIRAI that produces research ideas oriented towards high impact. These ideas were judged as more impactful than a baseline without MIRAI by an unbiased LLM judge at a 4:3 ratio. We make the 5-year citation prediction model publicly available at https://predict-paper-impact.vercel.app.

2606.05436 2026-06-05 cs.AI cs.CL cs.IR 版本更新

Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

十位头痛专家与人工智能在临床文献总结中的比较:一项关键评估与对比

Alejandro Lozano, Keiko Ihara, Ping-Hao Yang, Carrie E. Robertson, Jennifer Stern, Allan Purdy, Hsiangkuo Yuan, Pengfei Zhang, Yulia Orlova, Olga Fermo, Jennifer Hranilovich, Fred Cohen, Todd J. Schwedt, Jenelle A. Jindal, Serena Yeung-Levy, Chia-Chun Chiang

发表机构 * Stanford University Palo Alto CA USA(斯坦福大学) Department of Neurology Mayo Clinic Rochester MN USA(梅奥诊所神经科) Department of Neurology Dalhousie University Halifax Canada(达尔豪斯大学神经科) Jefferson Headache Center Department of Neurology Thomas Jefferson University PA USA(泰勒大学神经科) Beth Israel Deaconess Medical Center Boston MA USA(贝斯以色列医疗中心) Department of Neurology University of Florida Gainesville FL USA(佛罗里达大学神经科) University of Colorado School of Medicine Department of Pediatrics Division of Child Neurology Aurora CO USA(科罗拉多医学院儿科部儿童神经科) Department of Medicine Mount Sinai Hospital Icahn School of Medicine at Mount Sinai New York NY USA(西奈医院医学部) Department of Neurology Mayo Clinic Scottsdale AZ USA(梅奥诊所Scottsdale分部) Harvard Medical School Boston MA USA(哈佛医学院) Department of Neurology Mount Sinai Hospital Icahn School of Medicine at Mount Sinai New York NY USA(西奈医院神经科)

AI总结 本研究通过构建基于RAG的AI框架,比较了三种大语言模型与十位头痛专家在临床文献总结方面的表现,发现专家撰写的摘要更受青睐,但专家有时难以区分人类与AI生成的摘要。

详情
AI中文摘要

总结最新医学文献以指导临床决策对于循证医学和高质量患者护理至关重要。然而,由于患者时间有限且发表文章数量迅速增长,临床医生面临越来越大的挑战。尽管检索增强的大语言模型(LLMs)在临床总结方面显示出潜力,但对其在综合更广泛科学文献方面的有效性进行人工评估,以及与专家撰写的综合摘要的直接比较仍然很少。我们使用三种最先进的LLMs(Sonnet、GPT-4o和Llama 3.1)构建了一个基于RAG的智能AI框架。一位头痛专家创建了13个问题,其中3个用于提示优化,10个用于评估。美国和加拿大的十位头痛专家每人针对一个问题撰写一篇摘要,每个问题得到四篇摘要(专家、Sonnet、GPT-4o和Llama)。专家们在不知道作者身份的情况下,根据正确性、完整性、简洁性和临床实用性,使用标准化评分标准对摘要进行评分(1-10分),并排除他们自己撰写摘要的主题。他们还按偏好对摘要进行排序,并指出他们认为每篇摘要是由专家还是LLM撰写的。我们的研究比较了由头痛专家评估的LLM和专家撰写的文献摘要,结果显示专家撰写的摘要更受青睐,尽管专家有时难以区分人类和AI生成的摘要。我们还确定了超出标准评估指标的关键专家重视特征,这些特征可以指导未来人类和AI文献总结流程的改进。

英文摘要

Summarizing the latest medical literature to guide clinical decision-making is essential for evidence-based medicine and high-quality patient care. Yet clinicians face increasing challenges due to limited time with patients and a rapidly growing volume of published articles. Although retrieval-augmented large language models (LLMs) have shown promise in clinical summarization, human evaluations of their effectiveness in synthesizing broader scientific literature and direct comparisons to expert-written syntheses remain scarce. We constructed a RAG-based agentic AI framework using three state-of-the-art LLMs: Sonnet, GPT-4o, and Llama 3.1. A headache specialist created 13 questions, three for prompt optimization and ten for evaluation. Ten headache specialists across the United States and Canada each wrote a summary for one question, yielding four summaries per question (expert, Sonnet, GPT-4o, and Llama). The experts, blinded to authorship, critically evaluated the summaries, excluding the topic for which they wrote a summary, based on correctness, completeness, conciseness, and clinical utility, scoring each from 1 to 10 using standardized rubrics. They also ranked the summaries by preference and indicated whether they believed each summary was written by an expert or an LLM. Our study, comparing LLM- and expert-written literature summaries evaluated by headache specialists, showed that expert-written summaries were preferred, although experts sometimes found it challenging to distinguish between human- and AI-generated summaries. We also identified key expert-valued features beyond standard evaluation metrics that can guide future refinement of both human and AI literature summarization pipelines.

2606.05421 2026-06-05 cs.CL 版本更新

ComplexityMT: Benchmarking the Interaction Between Text Complexity and Machine Translation

ComplexityMT: 文本复杂度与机器翻译交互作用的基准测试

Joseph Marvin Imperial, Junhong Liang, Belal Shoer, Abdullah Barayan, Rodrigo Wilkens, Omar Mussa, Dawn Knight, Eugénio Ribeiro, Ekaterina Kochmar, Sowmya Vajjala, Fernando Alva-Manchego, Harish Tayyar Madabushi

发表机构 * University of Bath(巴斯大学) Cardiff University(卡迪夫大学) National University Philippines(菲律宾国家大学) MBZUAI(穆扎布伊人工智能研究所) University of Exeter(埃克塞特大学) INESC-ID Lisboa(里斯本INESC-ID) Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR(里斯本大学研究所(ISCTE-IUL),ISTAR) National Research Council, Canada(加拿大国家研究委员会) King Abdulaziz University(阿卜杜勒-阿齐兹大学) Saudi Electronic University(沙特电子大学)

AI总结 提出ComplexityMT基准,利用CEFR等级评估六种语言中文本复杂度与机器翻译的相互影响,发现高复杂度文本更难翻译且翻译会改变目标文本的CEFR等级。

详情
AI中文摘要

当文本被翻译时,翻译是否保留了原文的复杂度?我们引入ComplexityMT,这是一个新的挑战,用于评估文本复杂度和机器翻译如何相互作用和相互影响,使用欧洲语言共同参考框架(CEFR)等级作为文本复杂度的度量。在包括阿拉伯语、荷兰语、英语、法语、印地语和俄语在内的六种语言中,我们评估了三个开放权重模型、一个封闭模型和一个商业机器翻译系统在两个任务上的表现:i) CEFR与翻译难度的相关性,以及ii) 源文本CEFR等级的变化。我们的实验表明,较高的CEFR等级使文本更难翻译,并且对于大多数语言,机器翻译会改变目标文本相对于原始源文本的CEFR等级。这些发现为从事多语言教学内容生成和机器翻译难度估计的研究人员和从业者提供了新的见解。

英文摘要

When a text is translated, does the translation retain the complexity of the original? We introduce ComplexityMT, a new challenge for assessing how text complexity and machine translation interact with and influence each other, using the Common European Framework of Reference for Languages (CEFR) levels as the measure of text complexity. Across six languages, including Arabic, Dutch, English, French, Hindi, and Russian, we evaluate three open-weight models, one closed model, and a commercial machine translation system on two tasks: i) correlation of CEFR with translation difficulty, and ii) shifts in CEFR levels of the source texts. Our experiments show that higher CEFR levels make texts more difficult to translate, and that machine translation shifts the CEFR level of the target text compared to the original source, for most languages. These findings provide new insights for researchers and practitioners working on multilingual pedagogical content generation and machine translation difficulty estimation.

2606.05415 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Executable Schema Contracts: From Automatic Ingestion to Multi-Source Retrieval

可执行模式合约:从自动摄入到多源检索

Padmaja Jonnalagedda, Yuguang Yao, Xiang Gao, Hilaf Hasson, Kamalika Das

发表机构 * Intuit AI Research(Intuit AI研究)

AI总结 提出一种自动从多源数据中发现可执行模式并将其作为共享合约的系统,通过模式约束的检索路由和结构化分析提升多源问答性能。

Comments 9 pages, 4 figures, plus supplementary appendix

详情
AI中文摘要

现实世界的数据跨越表格、文档和半结构化文件,具有隐式语义。查询这些数据需要跨不一致的模式和格式整合证据,但现有方法要么需要昂贵的人工工程,要么完全绕过结构。我们提出一个系统,自动从原始多源数据中发现可执行模式,并将其用作知识图谱构建和查询时检索的共享合约。一个封闭世界的字段目录将基于LLM的模式发现限制在已证实的字段上;确定性结构分析推断身份键、外键和源层次结构;由此产生的模式驱动提取、去重和跨源链接,形成具有溯源意识的知识图谱。在查询时,该模式(可选地通过单调协议扩展)调节一个多工具代理,该代理在结构化查找、图遍历和向量搜索之间路由检索,返回带有可追溯引用的有根据的答案。在使用相同LLM、数据和评估框架的受控零样本比较中,该系统在四个QA基准上优于仅检索和基于分解的基线,消融实验表明模式条件路由、结构智能和模式引导构建各自贡献了性能提升。

英文摘要

Real-world data spans tables, documents, and semi-structured files with implicit semantics. Querying this data requires integrating evidence across inconsistent schemas and formats, yet existing approaches either demand costly manual engineering or bypass structure entirely. We present a system that automatically discovers an executable schema from raw multi-source data and uses it as a shared contract for knowledge graph construction and query-time retrieval. A closed-world field catalog constrains LLM-based schema discovery to attested fields; deterministic structural analysis infers identity keys, foreign keys, and source hierarchy; and the resulting schema drives extraction, deduplication, and cross-source linking into a provenance-aware knowledge graph. At query time the schema -- optionally extended via a monotonic protocol -- conditions a multi-tool agent routing retrieval across structured lookup, graph traversal, and vector search, returning grounded answers with traceable citations. In controlled zero-shot comparisons using the same LLM, data, and evaluation harness, the system improves over retrieval-only and decomposition-based baselines across four QA benchmarks, with ablations showing that schema-conditioned routing, structural intelligence, and schema-guided construction each contribute to the gains.

2606.05414 2026-06-05 cs.CL cs.AI cs.HC cs.LG 版本更新

When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

当证据稀疏时:对话和LLM-Agent轨迹中的弱监督早期失败预警

Avinash Baidya, Xinran Liang, Ruocheng Guo, Xiang Gao, Kamalika Das

发表机构 * Intuit AI Research(Intuit AI研究院) Princeton University(普林斯顿大学)

AI总结 针对对话和LLM-Agent轨迹中早期失败预警问题,提出一种两阶段方法,通过注意力机制从稀疏的轨迹级标签中学习回合级失败证据,并结合α-STOP策略实现可控的早期预警,在多个基准上显著提升帕累托前沿质量并降低训练成本。

Comments 9 pages, 14 figures, and appendix

详情
AI中文摘要

早期失败预警需要在对话或智能体轨迹尚未完成时,决定是否将其标记为可能失败。这具有挑战性,因为监督信号通常仅以轨迹级成功/失败标签的形式提供,而预警必须从部分交互中发出。先前的早期分类方法通常通过将终端标签分配给每个前缀来弥合这一差距,将每个回合视为失败证据。我们假设这种前缀标签假设与多轮语言交互不匹配,因为最终失败的证据是稀疏且常常延迟的。在本文中,我们引入了一种两阶段方法,从这种稀疏证据结构中学习,并使用由此产生的风险估计进行可控的早期预警。具体来说,我们的基于注意力的失败预测器从轨迹标签中学习稀疏的回合级失败证据,并利用它从部分历史中估计失败风险。然后,我们将该预测器与α-STOP配对,这是一种单一偏好条件停止策略,在推理时选择准确率-早期性的操作点,而不是为每个偏好训练单独的触发器。在涵盖客户支持、任务导向对话、说服、工具使用和规划的五个基准上,我们首先表明高相关性失败证据仅占回合的4.7-11.3%,并且平均在轨迹的59.0-83.6%之后首次出现。我们进一步表明,基于注意力的预测器将帕累托前沿质量(超体积)比朴素前缀监督提高了1-10%,并且完整系统将前沿质量比最先进的触发器策略提高了3-42%,同时将每个操作点的训练成本降低了1-3个数量级。

英文摘要

Early failure alerting requires deciding, while a dialog or agent trajectory is still unfolding, whether to flag it as likely to fail. This is challenging because supervision is typically available only as a trajectory-level success/failure label while alerts must be raised from partial interactions. Prior early-classification methods often bridge this gap by assigning the terminal label to every prefix, treating every turn as failure evidence. We hypothesize that this prefix-label assumption is poorly matched to multi-turn language interactions, where evidence of eventual failure is sparse and often delayed. In this paper, we introduce a two-stage approach that learns from this sparse evidence structure and uses the resulting risk estimates for controllable early alerting. Specifically, our attention-based failure predictor learns sparse turn-level failure evidence from trajectory labels and uses it to estimate failure risk from partial histories. We then pair this predictor with $α$-STOP, a single preference-conditioned stopping policy that selects an accuracy-earliness operating point at inference time rather than training a separate trigger for each preference. Across five benchmarks spanning customer support, task-oriented dialog, persuasion, tool use, and planning, we first show that high-relevance failure evidence occupies only 4.7-11.3% of turns and first appears after 59.0-83.6\% of trajectories on average. We further show that the attention-based predictor improves Pareto-frontier quality (hypervolume) by 1-10\% over naive prefix supervision, and that the full system improves frontier quality by 3-42\% over state-of-the-art trigger policies while reducing training cost per operating point by 1-3 orders of magnitude.

2606.05404 2026-06-05 cs.AI cs.CL cs.LG 版本更新

Harnessing Generalist Agents for Contextualized Time Series

利用通用智能体进行情境化时间序列分析

Zihao Li, Kaifeng Jin, Yuanchen Bei, Jiaru Zou, Avaneesh Kumar, Xuying Ning, Yanjun Zhao, Mengting Ai, Baoyu Jing, Hanghang Tong, Jingrui He

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出TimeClaw框架,通过集成可执行时间工具、经验驱动能力进化和情景多模态记忆,使通用大语言模型智能体具备情境化时间推理能力,在能源、金融等多领域基准上取得性能提升。

Comments Preprint. 38 Pages

详情
AI中文摘要

时间序列通常嵌入在丰富的上下文中,这对于整体建模至关重要。此外,现实世界的从业者通常需要用于分析时间动态的端到端工作流,其中广泛研究的任务(如预测)只是更广泛解决方案循环中的一个步骤。虽然通用AI智能体为复杂上下文下的此类工作流提供了有前景的接口,但它们主要运行在文本空间中,并未与结构化时间信号完全对齐。在这项工作中,我们引入了TimeClaw,一个用于时间序列的智能体框架,它为通用大语言模型智能体配备了情境化时间推理所需的时间序列原生运行时支持。TimeClaw集成了可执行的时间工具以进行有根据和可审计的分析,经验驱动的能力进化以创建可重用的分析例程,以及用于检索相关推理轨迹的情景多模态记忆。这些组件共同解锁了带有上下文信息的开放式时间推理。在涵盖能源、金融、天气、交通和其他现实世界领域的多个基准上的广泛评估表明,TimeClaw的性能得到了提升。代码可在https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw获取。

英文摘要

Time series are often embedded in rich contexts that are essential for holistic modeling. Moreover, real-world practitioners often require end-to-end workflows for analyzing temporal dynamics, where widely studied tasks such as forecasting are only one step in a broader solution loop. While generalist AI agents offer a promising interface for such workflows under complex contexts, they still operate primarily in textual spaces that are not fully aligned with structured temporal signals. In this work, we introduce TimeClaw, an agentic harness framework for time series that equips generalist LLM agents with the time series-native runtime support needed for contextualized temporal reasoning. TimeClaw integrates executable temporal tools for grounded and auditable analysis, experience-driven capability evolution for creating reusable analytical routines, and episodic multimodal memory for retrieving relevant reasoning traces. Together, these components unlock harnessed open-ended temporal reasoning with contextual information. Extensive evaluation on multiple benchmarks covering diverse tasks across energy, finance, weather, traffic, and other real-world domains demonstrates improved performance of TimeClaw. Code is available at https://github.com/iDEA-iSAIL-Lab-UIUC/TimeClaw.

2606.05402 2026-06-05 cs.CL cs.AI 版本更新

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

ReasoningFlow: 理解LLM推理轨迹的话语结构

Jinu Lee, Shivam Agarwal, Amruta Parulekar, Siddarth Madala, Dilek Hakkani-Tur, Julia Hockenmaier

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出ReasoningFlow框架,将大推理模型的推理轨迹建模为细粒度有向无环图,通过人工和自动标注分析发现模型间结构相似性、多样化推理行为及错误步骤与最终答案的关系。

详情
AI中文摘要

大型推理模型(LRMs)产生的推理轨迹具有非线性结构,如回溯和自我修正,这使推理过程的评估和监控复杂化。我们引入ReasoningFlow,一个将LRM推理轨迹的话语结构捕捉为细粒度有向无环图(DAGs)的框架。我们通过仔细的人工标注31条轨迹(2.1k步)来开发和验证我们的标注方案,实现了高标注者间一致性,然后扩展到自动标注1,260条轨迹(247.7k步),涵盖三个任务(数学、科学、论证)和五个模型(Qwen2.5-32B-Inst、QwQ-32B、DeepSeek-V3、DeepSeek-R1、GPT-oss-120B)。通过分析ReasoningFlow图,我们发现:(1)LRMs表现出结构相似的轨迹,尽管它们基于不同的基础模型训练且可能使用不重叠的后训练数据。(2)ReasoningFlow揭示了多样的细粒度推理行为(例如局部验证、自我反思和假设),可用于更好的推理轨迹可监控性。(3)在LRMs中,大多数错误步骤不用于推导最终答案。(4)步骤之间的机械因果依赖关系不反映语言层面的话语结构。我们在https://github.com/jinulee-v/reasoningflow 发布数据集和代码。

英文摘要

Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: https://github.com/jinulee-v/reasoningflow.

2606.05400 2026-06-05 cs.AI cs.CL cs.LG 版本更新

LeanMarathon: Toward Reliable AI Co-Mathematicians through Long-Horizon Lean Autoformalization

LeanMarathon:通过长视界Lean自动形式化实现可靠的AI合作数学家

Yuanhe Zhang, Yuekai Sun, Taiji Suzuki, Jason D. Lee, Fanghui Liu

发表机构 * Department of Statistics, University of Warwick, UK(英国沃里克大学统计系) Center for Advanced Intelligence Project, RIKEN, Japan(日本理化学研究所高级智能项目) Department of Statistics, University of Michigan, USA(美国密歇根大学统计系) Department of Mathematical Informatics, The University of Tokyo(东京大学数学信息学系;日本理化学研究所高级智能项目) also Center for Advanced Intelligence Project, RIKEN, Japan(加州大学伯克利分校电气工程与计算机科学系;统计系) Department of Electrical Engineering and Computer Sciences, also Department of Statistics, University of California, Berkeley, USA(上海交通大学数学科学学院,自然科学院和MOE-LSC) School of Mathematical Sciences, Institute of Natural Sciences and MOE-LSC, Shanghai Jiao Tong University, China

AI总结 提出多智能体框架LeanMarathon,通过蓝图抽象和两阶段编排器实现长视界研究数学的可靠自动形式化,在四个Erdős问题上成功形式化七个定理。

Comments 26 pages, 9 figures. Comments are welcome

详情
AI中文摘要

长视界研究数学的自动形式化不仅在困难引理上失败,而且在规模上失败:陈述漂移、依赖关系纠缠、上下文衰减以及局部修复破坏远处的工作。我们提出LeanMarathon,一个用于可靠的研究级Lean自动形式化的多智能体框架。其核心抽象是一个演化的蓝图:一个Lean文件,同时作为形式化证明骨架、自然语言证明图和共享系统记录。四个合约范围的智能体构建、审计、证明和修复这个蓝图。这些智能体由一个两阶段编排器协调,该编排器首先通过对抗性审查稳定目标保真度,然后从动态叶节点向上并行地通过CI门控轮次释放证明有向无环图(DAG)。LeanMarathon将一次脆弱的数小时运行转变为许多局部、可恢复、并行的交易。我们在两篇最近的研究论文上评估LeanMarathon,涵盖四个Erdős问题(#1051, #1196, #164, #1217)。在三次自主运行中,它形式化了所有七个目标定理,没有留下任何sorry,证明了258个引理和定理。这些结果表明,可靠的AI合作数学不仅需要更强的证明器,还需要耐用的框架,以在长数学发展过程中保持目标保真度。代码可在https://github.com/YuanheZ/LeanMarathon找到。

英文摘要

Long-horizon autoformalization of research mathematics fails not only at hard lemmas, but at scale: statements drift, dependencies tangle, context decays, and local repairs corrupt distant work. We present LeanMarathon, a multi-agent harness for reliable research-level Lean autoformalization. Its core abstraction is an evolving blueprint: a Lean file that serves simultaneously as formal proof skeleton, natural-language proof graph, and shared system of record. Four contract-scoped agents construct, audit, prove, and repair this blueprint. These agents are coordinated by a two-stage orchestrator that first stabilizes target fidelity through adversarial review and then discharges the proof directed acyclic graph (DAG) from its dynamic leaves upward in parallel CI-gated rounds. LeanMarathon turns one brittle multi-hour run into many local, recoverable, parallel transactions. We evaluate LeanMarathon on two recent research papers spanning four Erdős problems (#1051, #1196, #164, #1217). Across three autonomous runs, it formalizes all seven target theorems with no sorry, proving 258 lemmas and theorems. These results show that reliable AI co-mathematics requires not only stronger provers, but durable harnesses that preserve target fidelity across long mathematical developments. The code can be found at https://github.com/YuanheZ/LeanMarathon.

2605.04135 2026-06-05 cs.CY cs.AI cs.CL 版本更新

Frontier Lag: A Bibliometric Audit of Capability Misrepresentation in Academic AI Evaluation

前沿滞后:学术AI评估中能力误述的文献计量审计

David Gringras, Misha Salahshoor

发表机构 * Harvard University(哈佛大学) AISST

AI总结 通过审计112,303篇LLM相关论文,发现中位论文评估的模型落后同期前沿10.85 ECI(约1.4倍Claude Sonnet 3.7与Claude Opus 4.5的差距),且差距以每年5.53 ECI扩大,仅3.2%的摘要披露推理模式状态,52.5%的结论将结果泛化为“AI”,并提出VERSIO-AI检查表等补救措施。

Comments v2. 65 pp, 9 figs, 8 tables, 8 appendices. Pre-registered on OSF: doi.org/10.17605/OSF.IO/7XM3D. Code+data: doi.org/10.5281/zenodo.20060457. VERSIO-AI v1.2 reporting checklist (Appendix A): doi.org/10.5281/zenodo.20060459. frontierlag package + per-DOI audit tool: frontierlag.org

详情
AI中文摘要

应用领域LLM能力评估的读者希望了解AI系统当前能做什么。但相关文献回答的是一个相关但结果不同的问题:更旧、更便宜、更少引导的模型在数月或数年前能做什么(例如,一篇2026年的论文评估GPT-3.5或GPT-4零样本,对比前沿的推理能力、工具使用系统如GPT-5.5 Pro和Claude Opus 4.7),通常报告稀疏的配置细节,并抽象上升为关于“AI”的声明,通过引用、媒体和政策传播。我们在一个预注册的审计中测量了“发表引导差距”(这些答案之间的差距),审计了112,303条LLM关键词匹配的候选记录(2022年1月至2026年4月;18,574条可接受,4,766篇全文可检索),将测试模型与同期前沿在Epoch AI能力指数(ECI)上进行比较,并在Arena Elo和Artificial Analysis上复现。中位论文评估的模型在评估时落后同期前沿+10.85 ECI(约Claude Sonnet 3.7与Claude Opus 4.5距离的1.4倍)(H1);一个探索性的理性滞后基线(H8)将其分解为约25%的同行评审延迟和约75%的额外滞后。差距以每年+5.53 ECI的速度扩大(H2;95% CI [+5.03, +5.83])。同时,仅3.2%的摘要(21.2%的全文)披露了具有推理能力模型的推理模式状态(H4),52.5%(95% CI [48.2, 56.9])的结论以“AI”而非被评估模型(们)的层面陈述,并以OR = 1.23/年的速度上升。提出的补救措施包括API访问补贴和编辑执行报告框架,强制披露配置表面(模型快照、推理模式/努力、工具访问、脚手架、提示等);VERSIO-AI是一个13项检查表(核心3项桌面拒稿),在引导表面扩展现有框架,并在frontierlag.org上提供每DOI分析。

英文摘要

Readers of applied-domain LLM capability evaluations want to know what AI systems can currently do. That literature answers a related, but consequentially different, question: what older, cheaper, less-elicited models could do months or years earlier (a 2026 paper evaluating GPT-3.5 or GPT-4 zero-shot, say, against a frontier of reasoning-capable, tool-using systems like GPT-5.5 Pro and Claude Opus 4.7), often reported with sparse configuration details and abstracted upward into claims about "AI" that propagate through citations, media, and policy. We measure the 'publication elicitation gap' (the gap between these answers) in a pre-registered audit of 112,303 LLM-keyword-matched candidate records (2022-01 to 2026-04; 18,574 admissible, 4,766 full-paper texts retrievable), comparing tested models to the contemporaneous frontier on the Epoch AI Capabilities Index (ECI), reproduced under Arena Elo and Artificial Analysis. The median paper evaluates a model +10.85 ECI (~1.4x the distance between Claude Sonnet 3.7 and Claude Opus 4.5) behind the contemporaneous frontier at evaluation time (H1); an exploratory rational-lag baseline (H8) decomposes this into ~25% peer-review latency, ~75% excess lag. The gap is widening at +5.53 ECI/year (H2; 95% CI [+5.03, +5.83]). Meanwhile, only 3.2% of abstracts (21.2% of full-texts) disclose reasoning-mode status on reasoning-capable models (H4) and 52.5% (95% CI [48.2, 56.9]) state conclusions at the level of "AI" rather than the evaluated model(s), rising at OR = 1.23/year. Proposed remedies include API-access subsidies and editorial enforcement of reporting frameworks mandating configuration-surface disclosure (model snapshot, reasoning mode/effort, tool access, scaffolding, prompting, etc.); VERSIO-AI is a 13-item checklist (Core 3 desk-reject) extending existing frameworks at the elicitation surface, with per-DOI analysis at frontierlag.org.

2605.01844 2026-06-05 cs.CL 版本更新

The Cylindrical Representation Hypothesis for Language Model Steering

语言模型引导的圆柱表示假说

Lang Gao, Jinghui Zhang, Wei Liu, Fengxian Ji, Chenxi Wang, Zirui Song, Akash Ghosh, Youssef Mohamed, Preslav Nakov, Xiuying Chen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出圆柱表示假说(CRH),通过放宽线性表示假说(LRH)的正交性假设,解释语言模型引导中的不稳定性和不确定性。

Comments ICML 2026 camera ready

详情
AI中文摘要

引导是一种广泛用于控制大型语言模型的技术,但其效果往往不稳定且难以预测。现有的理论解释主要基于线性表示假说(LRH)。虽然LRH假设概念可以正交化以实现无损控制,但这种理想化的映射在真实表示中无法实现,也无法解释观察到的引导不可预测性。通过放宽LRH的正交性假设同时保留线性表示,我们展示了重叠的概念贡献自然产生一种样本特定的轴正交结构。我们将此形式化为圆柱表示假说(CRH)。在CRH中,中心轴捕捉概念缺失与存在之间的主要差异,并驱动概念生成。周围的法平面通过决定轴激活目标概念的难易程度来控制引导敏感性。在该平面内,只有特定的敏感扇区强烈促进概念激活,而其他扇区可能抑制或延迟激活。虽然周围的法平面可以从差异向量中可靠识别,但敏感扇区无法识别,从而在扇区层面引入内在不确定性。这种不确定性提供了原则性解释,说明为什么即使使用良好对齐的方向,引导结果也常常波动。我们的实验验证了圆柱结构的存在,并证明CRH为解释真实场景中的模型引导行为提供了一种有效且实用的方法:https://github.com/mbzuai-nlp/CRH。

英文摘要

Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH's orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a sample-specific axis-orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH). In CRH, a central axis captures the main difference between concept absence and presence and drives concept generation. A surrounding normal plane controls steering sensitivity by determining how easily the axis can activate the target concept. Within this plane, only specific sensitive sectors strongly facilitate concept activation, while other sectors can suppress or delay it. While the surrounding normal plane can be reliably identified from difference vectors, the sensitive sector cannot, introducing intrinsic uncertainty at the sector level. This uncertainty provides a principled explanation for why steering outcomes often fluctuate even when using well-aligned directions. Our experiments verify the existence of the cylindrical structure and demonstrate that CRH provides a valid and practical way to interpret model steering behavior in real settings: https://github.com/mbzuai-nlp/CRH.

2606.05384 2026-06-05 cs.AI cs.CL 版本更新

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

稳定性与可操纵性:评估LLM裁判在决策后交互下的鲁棒性

Srimonti Dutta, Akshata Kishore Moharir

发表机构 * WAI USA Research Labs(WAI美国研究实验室)

AI总结 研究LLM作为裁判在决策后交互中的可操纵性,发现虽然重复中性评估下高度稳定,但针对性挑战可显著逆转判决,并提出评估鲁棒性分数(ERS)量化交互鲁棒性。

Comments Accepted at ACL 2026 GEM (Generation, Evaluation and Metrics) Workshop

详情
AI中文摘要

LLM作为裁判的评估广泛用于基准测试流程,其中模型输出通过自动评估器进行比较和排序。这些流程通常假设判决是固定输入的稳定属性。我们证明这一假设在交互下不成立。我们研究决策后可操纵性:在初始判决做出后,通过与裁判的后续对话改变评估结果的程度。在MT-Bench和AlpacaEval上的控制实验中,我们发现LLM裁判在重复和中性重新评估下高度稳定,但在针对性决策后挑战下变得显著可逆。反基线挑战协议表明,稳定判决可以通过动机性交互被推翻,而平衡目标验证协议将这种可逆性与净目标导向的引导区分开。这些逆转具有实际后果:它们可能降低与人类偏好的一致性,改变基准排名,并在高自我报告置信度下产生有害的评估变化。权威框架尤其具有破坏稳定性,修订后的判决通常伴随低重叠的论证,表明事后合理化而非可靠的错误纠正。我们引入评估鲁棒性分数(ERS),通过结合逆转敏感性和平衡方向效应来量化交互鲁棒性。我们的发现将决策后交互确定为LLM作为裁判评估的一个独特失败模式,并激励评估协议不仅测量静态一致性,还测量挑战下的鲁棒性。

英文摘要

LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumption does not hold under interaction. We study post-decision manipulability: the extent to which an evaluation outcome can be altered through subsequent conversation with the judge after an initial decision has been made. Across controlled experiments on MT-Bench and AlpacaEval, we find that LLM judges are highly stable under repeated and neutral reevaluation, yet become substantially reversible under targeted post-decision challenge. An anti-baseline challenge protocol shows that stable judgments can be overturned through motivated interaction, while a counterbalanced target-validation protocol separates this reversibility from net target-directed steering. These reversals have practical consequences: they can degrade agreement with human preferences, shift benchmark rankings, and produce harmful evaluation changes despite high self-reported confidence. Authority framing is especially destabilizing, and revised judgments are often accompanied by low-overlap justifications, suggesting post hoc rationalization rather than reliable error correction. We introduce the Evaluation Robustness Score (ERS) to quantify interactional robustness by combining reversal susceptibility with counterbalanced directional effects. Our findings identify post-decision interaction as a distinct failure mode for LLM-as-judge evaluation and motivate evaluation protocols that measure not only static agreement, but robustness under challenge.

2606.05346 2026-06-05 cs.CL 版本更新

Trajectory Dynamics in Language Model Hidden States Predict Human Processing Costs Beyond Surprisal

语言模型隐藏状态中的轨迹动力学预测超越惊讶度的人类处理成本

Elan Barenholtz

发表机构 * Machine Perception & Cognitive Robotics Laboratory(机器感知与认知机器人实验室) Department of Psychology(心理学系) Center for Complex Systems(复杂系统中心) Florida Atlantic University(佛罗里达 Atlantic 大学)

AI总结 通过线性外推语言模型隐藏状态轨迹的偏差,提出轨迹外推误差作为独立于惊讶度的人类处理成本预测因子,并在自然故事语料库中验证其对自定步速阅读时间的预测能力。

Comments 17 pages, 3 figures, 6 tables

详情
AI中文摘要

人类语言理解是顺序进行的:每个词在其前文语境中被处理,解释随时间逐步构建。惊讶度(给定语境下词的对数概率的负值)一直是增量处理成本的主要预测因子。但惊讶度将丰富的序列表示简化为每个词处的单个标量,丢弃了解释演化方向的信息。动力系统方法表明,演化解释状态的轨迹(而不仅仅是每个时刻的位置)应塑造处理过程,语言本身可能具有局部动量,因为说话者一次计划几个词。我们引入轨迹外推误差:在每个词处,我们拟合一条线性轨迹到变换器语言模型的前面隐藏状态,并测量与外推路径的偏差。在自然故事语料库上,该度量几乎与惊讶度正交(r = .044),并独立预测自定步速阅读时间。该效应在花园路径句子中尤为显著,随模型规模(GPT-2 Small到Large)增强,并在具有不同位置编码方案(GPT-2 vs. Pythia/RoPE)的架构中复现。位移控制显示该效应不能简化为表示变化幅度:位移和外推误差以相反方向预测。这些发现揭示了处理成本的两个可分离成分:词级预测误差(惊讶度)和对展开解释的局部动量(轨迹外推误差)的敏感性。

英文摘要

Human language comprehension unfolds sequentially: each word is processed in the context of those that came before, and the interpretation builds incrementally over time. Surprisal, the negative log probability of a word given its context, has been the dominant predictor of incremental processing cost. But surprisal reduces rich sequential representations to a single scalar at each word, discarding information about the direction in which the interpretation has been evolving. Dynamical-systems approaches suggest that the trajectory of the evolving interpretive state, not just its position at each moment,should shape processing, and language itself may have local momentum, since speakers plan utterances a few words at a time. We introduce trajectory extrapolation error: at each word, we fit a linear trajectory to the preceding hidden states of a transformer language model and measure deviation from the extrapolated path. On the Natural Stories corpus, this measure is nearly orthogonal to surprisal (r = .044) and independently predicts self-paced reading times. The effect is especially pronounced in garden-path sentences, strengthens with model scale (GPT-2 Small to Large), and replicates across architectures with different positional encoding schemes (GPT-2 vs. Pythia/RoPE). A displacement control shows the effect is not reducible to representational change magnitude: displacement and extrapolation error predict in opposite directions. These findings reveal two dissociable components of processing cost: word-level prediction error (surprisal) and sensitivity to the local momentum of the unfolding interpretation (trajectory extrapolation error).

2606.05336 2026-06-05 cs.CL 版本更新

Self-supervised User Profile Generation for Personalization

面向个性化的自监督用户画像生成

Clark Mingxuan Ju, Yuwei Qiu, Tong Zhao, Neil Shah

发表机构 * Snap Inc.(Snap公司) bellevue, WA USA(华盛顿州西雅图市)

AI总结 提出BUMP框架,利用自监督双向排序目标训练大语言模型生成用户文本画像,无需下游标注即可实现个性化。

详情
AI中文摘要

随着大语言模型(LLM)被部署到推荐、搜索、对话和内容生成等场景——在这些场景中,相同的查询应针对不同用户给出不同答案——个性化LLM已成为核心挑战。一个有前景的方法是将每个用户的交互历史总结为自然语言记忆或画像,并将其前置到提示中以便于个性化。现有方法使用来自标注下游任务的显式奖励来学习此类画像生成器,但这种方法成本高昂且稀疏,因为需要为每个目标任务提供标注监督。鉴于这一挑战,我们引入了通过画像的双向用户建模(BUMP),这是一个自监督框架,无需任何下游标签即可训练画像生成器。具体来说,给定用户的交互历史,我们使用GRPO训练LLM在双向批次内排序目标下生成自由形式的文本画像:一个小型LLM评判器衡量(i)生成的画像作为查询时,在批次中将用户自己的保留交互排在其他用户交互之上的程度,以及(ii)一个保留交互作为查询时,在批次中将用户自己的画像排在其他用户画像之上的程度。两个方向均使用多正例NDCG评分,并合并为每次生成的密集奖励;批次中的其他用户提供免费负例,因此每个训练样本仅从原始交互日志中获得监督。在LaMP基准测试上,BUMP匹配或超越了依赖标注奖励的闭源API和先前方法,同时在训练时无需任何任务标签。

英文摘要

Personalizing large language models (LLMs) has become a central challenge as LLMs are deployed across recommendation, search, dialogue, and content generation -- settings where the same query should yield different answers given different users. A promising route is to summarize each user's interaction history into a natural-language memory or profile and prepend it to the prompt to facilitate personalization. Existing methods learn such profile generators with explicit rewards derived from labeled downstream tasks, which are expensive and sparse as they require annotated supervision for every target task. In light of this challenge, we introduce Bidirectional User Modeling via Profiles (BUMP), a self-supervised framework that trains a profile generator without any downstream labels. Specifically, given a user's interaction history, we use GRPO to train an LLM to emit a free-form textual profile under a bidirectional in-batch ranking objective: a small LLM judge measures (i) how well the generated profile, used as a query, ranks the user's own held-out interactions above interactions from other users in the batch, and (ii) how well a held-out interaction, used as a query, ranks the user's own profile above profiles of other users. Both directions are scored with multi-positive NDCG and combined into a dense reward per rollout; other users in the batch supply free negatives, so every training example yields supervision from raw interaction logs alone. Evaluated on the LaMP benchmark, BUMP matches or outperforms closed-source APIs and prior methods relying on labeled rewards, while requiring no task label at training.

2606.05330 2026-06-05 cs.CL cs.AI cs.HC 版本更新

A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing

基于概率信念追踪的多轮人类可说服性模型

Jared Moore, Noah Goodman, Nick Haber, Max Kleiman-Weiner

发表机构 * Stanford University(斯坦福大学) University of Washington(华盛顿大学)

AI总结 提出PERSUASIONTRACE框架,通过记录多轮信念报告、标注修辞维度并引入贝叶斯网络模拟目标,将说服评估从端点变化转向过程保真度。

详情
AI中文摘要

大型语言模型可以在高风险领域改变人类信念,但大多数说服研究依赖于前/后信念变化。这些端点测量确定了说服是否发生,却忽略了信念在对话中移动的位置和方式。我们提出了PERSUASIONTRACE,一个用于研究人机交互中说服的框架。基于网络实验平台,PERSUASIONTRACE贡献了一个多轮说服研究的工具和一个过程级评估协议:它记录来自人类或模拟说服目标的多轮信念报告,用修辞维度(logos/pathos/ethos)标注说服者轮次,并通过保真度评估模拟器与真实人类信念动态的匹配程度。使用该框架,我们发现人类目标分为两个多轮信念更新聚类,并对修辞策略表现出易感性;LLM在通用和个性化主题、文本和音频模态以及多轮交互中都具有说服力。先前的工作主要使用普通提示的LLM来模拟人类目标,但我们表明这些模拟器无法复制人类信念动态。我们引入了一个贝叶斯网络模拟目标,它随时间维持显式的潜在信念状态,使得每个说服者消息产生认知上真实的信念更新。在人类相似性评估中,我们的贝叶斯目标得分接近人类参考(81 vs 80),而基线LLM目标得分显著较低(64)。PERSUASIONTRACE将说服评估从仅端点移动重新定义为过程保真度,为科学分析和说服系统的更安全优化提供了更强的基础。

英文摘要

Large language models can shift human beliefs across high-stakes domains, but most persuasion studies rely on pre/post belief change. These endpoint measures identify whether persuasion occurred, yet miss where and how beliefs moved within a dialogue. We present PERSUASIONTRACE, a framework for studying persuasion in human-LLM interaction. Built on a web-based experimental platform, PERSUASIONTRACE contributes a tool for multi-turn persuasion studies and a process-level evaluation protocol: it records multi-turn belief reports from human or simulated targets of persuasion, annotates persuader turns with rhetorical dimensions (logos/pathos/ethos), and evaluates simulators by fidelity to real human belief dynamics. Using this framework, we find that human targets group into two clusters of multi-turn belief updates and exhibit susceptibility to rhetorical strategies, and that LLMs are persuasive across generic and personalized topics, text and audio modalities, and multi-turn interactions. Prior work has chiefly used vanilla-prompted LLMs to simulate human targets, but we show that these simulators fail to replicate human belief dynamics. We introduce a Bayesian-network simulated target that maintains an explicit latent belief state over time so each persuader message yields cognitively realistic belief updates. In human-likeness evaluation, our Bayesian target scores near a human reference (81 vs 80), while baseline LLM targets score substantially lower (64). PERSUASIONTRACE reframes persuasion evaluation from endpoint movement alone to process fidelity, providing a stronger basis for scientific analysis and safer optimization of persuasive systems.

2606.05315 2026-06-05 cs.CL cs.AI 版本更新

LoRi: Low-Rank Distillation for Implicit Reasoning

LoRi: 用于隐式推理的低秩蒸馏

Ryan Solgi, Jiayi Tian, Zheng Zhang

发表机构 * University of California-Santa Barbara(加州大学圣巴巴拉分校)

AI总结 提出低秩蒸馏框架,通过对齐师生模型在共享低秩张量子空间中的隐状态推理轨迹,提升大型语言模型的隐式思维链推理能力。

详情
AI中文摘要

隐式思维链方法旨在将推理内化到大型语言模型中,但通常表现不如显式思维链提示。我们通过实验发现,隐状态推理轨迹具有低秩结构。基于此观察,我们提出了一种低秩蒸馏框架,通过使用一阶和二阶统计量,在共享的低秩张量子空间中对齐教师和学生轨迹来传递推理能力。得到的公式捕捉了推理的全局结构,同时支持紧凑的潜在推理过程。我们在多个模型家族(包括LLaMA和Qwen)上,在不同规模下对数学推理基准进行了评估。我们的方法持续提升了性能,尤其是在具有挑战性的多步任务上,接近显式思维链的准确率,并优于先前的隐式思维链蒸馏方法。

英文摘要

Implicit chain-of-thought (iCoT) methods aim to internalize reasoning in large language models, but often underperform explicit CoT prompting. We empirically find that hidden-state reasoning trajectories exhibit low-rank structure. Motivated by this observation, we propose a low-rank distillation framework that transfers reasoning by aligning teacher and student trajectories in a shared low-rank tensor subspace using first- and second-order statistics. The resulting formulation captures the global structure of reasoning while supporting a compact latent reasoning process. We evaluate the method across multiple model families, including LLaMA and Qwen, at different scales on mathematical reasoning benchmarks. Our approach consistently improves performance, especially on challenging multi-step tasks, approaching explicit CoT accuracy and outperforming prior iCoT distillation methods.

2606.05308 2026-06-05 cs.LG cs.AI cs.CL cs.IR stat.AP 版本更新

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

基于预测驱动推断的统计可靠LLM排序评估

Abhishek Divekar

发表机构 * Amazon(亚马逊)

AI总结 提出PRECISE框架,将预测驱动推断扩展到排序评估指标,通过结合少量人工标注和大量LLM判断实现无偏估计,并在ESCI基准和实际系统中验证了有效性。

Comments Accepted at ACL 2026 - GEM Workshop

详情
AI中文摘要

通过PRECISE,我们将预测驱动推断扩展到排序评估指标,通过结合少量人工标注集和大量LLM判断集,产生偏差校正的估计。PPI无论LLM判断器的错误分布如何,都是可证明无偏的。我们通过将输出空间计算从O(2^|C|)减少到O(2^K),使其适用于像Precision@K这样的分层指标,其中标注是按文档的,但指标是按查询的。在ESCI基准上,用Claude 3 Sonnet判断增强30个人工标注,将Precision@4估计的标准误差从4.45降低到3.50(相对减少21%)。在一个生产系统中,我们的框架从100个人工标签和2小时的领域专家标注中正确识别了三个系统变体中最好的一个;A/B测试确认了这一排序,日销售额增加了407个基点。

英文摘要

With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge's error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.

2606.05233 2026-06-05 cs.CR cs.AI cs.CL 版本更新

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

前沿计算机使用代理中的领域条件安全:一个793集浏览器基准测试、编码领域交叉引用以及近期红队攻击的可重复性审计

Nicholas Saban

发表机构 * Patronus AI University of California, Berkeley(Patronus AI 伯克利大学)

AI总结 本研究通过构建包含793个浏览器任务和56个攻击模板的基准测试,评估前沿计算机使用代理对提示注入攻击的鲁棒性,发现模型权重提供了强抵抗性(攻击成功率0%),但该安全性是领域条件的,在编码代理中失效(攻击成功率高达100%),并指出文献中高攻击成功率主要归因于RL优化的注入文本而非攻击类别。

详情
AI中文摘要

最近的计算机使用代理(CUA)红队论文报告提示注入攻击成功率(ASR)为42-98%,但这些头条数字集中在已退役模型和每篇论文面板中最易受攻击的模型上。我们询问这些技术,作为手工制作的模板重现,是否仍然对当前前沿CUA有效。我们发布了CUA-HandCrafted,一个包含793个集成的公共基准测试,涵盖24个多步骤网络任务、56个攻击模板、8个攻击家族和4个系统提示配置。针对Claude Sonnet 4.6和GPT-5.4,我们测量到0/140的多步骤攻击成功(Clopper-Pearson 95%上限2.60%);一个提示消融实验表明这种抵抗性存在于模型权重中。然而,它并不泛化:在一个姐妹编码代理基准测试(SkillBench)上,相同的权重对手工制作的技能注入攻击成功率高达100%。我们认为文献中的高ASR主要归因于RL优化的注入文本,而不是攻击类别,并且前沿安全加固是领域条件的,特定于被高度针对的浏览器表面。报告技术而不发布优化字符串,或将浏览器领域安全性外推到其他CUA模态,使得已发表的ASR数字无法重现。

英文摘要

Recent computer-using-agent (CUA) red-teaming papers report prompt-injection attack success rates (ASR) of 42-98%, but these headline numbers cluster on retired models and on the most-vulnerable model in each paper's panel. We ask whether those techniques, reproduced as hand-crafted templates, still work against current frontier CUAs. We release CUA-HandCrafted, a public benchmark of 793 episodes spanning 24 multi-step web tasks, 56 attack templates, 8 attack families, and 4 system-prompt configurations. Against Claude Sonnet 4.6 and GPT-5.4 we measure 0/140 multi-step attack success (Clopper-Pearson 95% upper bound 2.60%); a prompt ablation shows this resistance lives in the model weights. Yet it does not generalize: on a sister coding-agent benchmark (SkillBench), the same weights fall to hand-crafted skill-injection at up to 100%. We argue that the literature's high ASR is largely attributable to RL-optimized injection text rather than the attack categories, and that frontier safety hardening is domain-conditioned, specific to the heavily-targeted browser surface. Reporting techniques without releasing the optimized strings, or extrapolating browser-domain safety to other CUA modalities, makes published ASR numbers unreproducible.

2606.05194 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Temporal Preference Concepts and their Functions in a Large Language Model

时间偏好概念及其在大语言模型中的功能

Ian Rios-Sialer, Shantanu Darveshi, Shuai Jiang, Avigya Paudel, Anastasiia Pronina, Ipshita Bandyopadhyay, Justin Shenk

发表机构 * AISC(AI Safety Camp) SPAR(Supervised Program for Alignment Research)

AI总结 通过因果定位和激活修补,本文发现大语言模型在中间到上层节点编码时间偏好几何结构,且行为分析表明模型对未来折扣比人类更平缓,但偏好不稳定,可通过引导向量调控。

详情
AI中文摘要

大语言模型(LLMs)越来越多地被部署用于需要在近期收益与长期后果之间权衡的决策,然而关于它们如何在内部表示或解决这些权衡,我们知之甚少。在这项工作中,我们通过因果定位了一个蒸馏LLM(Qwen3-4B-Instruct-2507)中时间偏好的底层子图,通过来自梯度归因和激活修补的汇聚证据识别了中上层节点。我们发现时间跨度的几何结构在预期局部层的残差流中被编码。行为分析表明,未干预的LLM对未来折扣的陡峭程度比人类低几倍,但这种偏好跨上下文不稳定,这促使我们进行显式控制而非隐式依赖训练。最后,我们发现有暗示性证据表明引导向量可以改变时间偏好。我们的工作展示了机械可解释性如何使我们更接近对LLM规划和推理方式的可靠控制。

英文摘要

Large Language Models (LLMs) are increasingly being deployed to make decisions that require trading off near-term gains against long-term consequences, yet little is known about how they internally represent or resolve these tradeoffs. In this work, we causally localize an underlying subgraph for temporal preference in a distilled LLM (Qwen3-4B-Instruct-2507), identifying mid-to-upper-layer nodes through converging evidence from gradient-based attribution and activation patching. We find that the geometry of time horizon is encoded in the residual stream at the expected localized layers. A behavioral analysis reveals that unintervened LLMs discount the future several times less steeply than humans, yet this preference is unstable across contexts, motivating explicit control rather than implicit reliance on training. Finally, we find suggestive evidence that steering vectors can shift temporal preference. Our work demonstrates how mechanistic interpretability can bring us closer to reliable control over how LLMs plan and reason

2606.05186 2026-06-05 cs.LG cs.CL 版本更新

Staged Factorial Screening for Budget-Constrained Micro-Pretraining

预算受限的微预训练中的分阶段因子筛选

Felipe Chavarro Polania

发表机构 * Hewlett Packard Enterprise(惠普企业)

AI总结 针对预算受限的微预训练,提出分阶段分数因子设计方法,通过短时筛选识别高惩罚方向并确认有效锚点,在共享加速器上实现高效配方筛选。

Comments 23 pages, 4 figures

详情
AI中文摘要

预算受限的微预训练通常需要在共享加速器上对许多候选配方进行分诊,然后才能花费更大的搜索预算。我们研究了分阶段分数因子工作流是否能在这种设置中恢复稳定的早期效应结构。在固定的自动研究衍生的单GPU训练循环上,我们运行了613个实验,包括在2、5和10分钟时的试点和后续筛选;5和10分钟时的完整16条件种子重运行;有针对性的种子锚点检查;同主机贪婪和匹配成本随机基线;一个60分钟的桥接包;以及通过24小时的有界Windows A100和Linux L40S锚点延续。总批次、深度和宽度的主要惩罚在短预算时最大,并随预算增加而放松。在预先声明的种子全屏系列中,D、A、B和C在预算内Benjamini-Hochberg校正后,在5和10分钟时保留非零估计,而E则没有。随机搜索可以在这个32条件空间中达到强当前最优,但反复在相同的低惩罚区域,且没有因子归因。60分钟桥接锚点具有最低均值,尽管该包没有将工作流改进与更大桥接模型的能力优势分开。在两个主机上的有界12小时和24小时三锚点延续中,桥接具有最低样本均值,而非桥接顺序保持主机敏感。因此,我们提出了一个有界方法结果:使用短设计筛选来识别高惩罚方向,在重复运行下确认有希望的锚点,并在缩减空间内局部细化。证据支持在24小时内两个主机上的以桥接为中心的推荐,而不是硬件不变的排名或通用超参数优化的优越性。

英文摘要

Budget-constrained micro-pretraining often requires triaging many candidate recipes on a shared accelerator before larger search budgets are spent. We study whether a staged fractional-factorial workflow can recover stable early effect structure in this setting. On a fixed autoresearch-derived single-GPU training loop, we run 613 experiments across pilot and follow-up screens at 2, 5, and 10 minutes; full 16-condition seeded reruns at 5 and 10 minutes; targeted seeded anchor checks; same-host greedy and matched-cost random baselines; a 60-minute bridge package; and bounded Windows A100 and Linux L40S anchor continuations through 24 hours. Main penalties from total batch, depth, and width are largest at short budgets and relax as budget increases. Within the predeclared seeded full-screen families, D, A, B, and C retain non-zero estimates at 5 and 10 minutes after within-budget Benjamini-Hochberg correction, while E does not. Random search can reach strong incumbents in this 32-condition space, but repeatedly in the same low-penalty region and without factor attribution. The 60-minute bridge anchor has the lowest mean, although that package does not separate workflow refinement from the larger bridge model's capacity advantage. In bounded 12-hour and 24-hour three-anchor continuations on both hosts, the bridge has the lowest sample mean while the non-bridge ordering stays host-sensitive. We therefore present a bounded methods result: use short designed screens to identify high-penalty directions, confirm promising anchors under repeated runs, and refine locally inside the reduced space. The evidence supports a bridge-centered recommendation through 24 hours on two hosts, not hardware-invariant ranking or general hyperparameter-optimization superiority.

2606.05183 2026-06-05 cs.CL cs.AI cs.HC 版本更新

The Granularity Gap: A Multi-Dimensional Longitudinal Audit of Sycophancy in Gemini Models

粒度差距:Gemini 模型中谄媚行为的多维纵向审计

Patrick Keough

发表机构 * Independent Researcher(独立研究者)

AI总结 通过多维度分级评估(Likert 0-4),揭示 Gemini 模型在连续尺度上的谄媚行为,发现粗粒度二值指标掩盖了大量社会顺从行为,且代际进步非单调,存在对齐税(谄媚与真实性负相关)。

Comments 16 pages, 9 figures

详情
AI中文摘要

大型语言模型越来越多地被部署为高风险顾问,但标准对齐基准将谄媚视为二值失败模式。我们引入粒度差距:粗粒度二值指标掩盖了大量社会顺从行为,即模型屈服于用户框架、验证可疑前提或软化事实纠正而不产生明显错误输出。我们在三个防护栏条件(控制、简单、协议)下,对跨越 2.0、2.5 和 3.0 代的六个 Gemini 变体在 73 个对抗性提示上进行了评估,得到 8,830 个分级响应。使用经过人类标注者三人组验证的 0-4 Likert 量表(Fleiss kappa = 0.71;与 AI 共识的 Cohen kappa = 0.78;95.9% 二值准确率,100% 特异性),我们将谄媚量化为连续而非二值。出现三个发现。第一,27.2% 的响应包含大量谄媚内容(Likert >= 2.0),22.7% 达到中度或严重水平(>= 3.0),而二值胜率框架仅报告适度的失败率;粗粒度指标仅解释 29% 的分级方差。第二,代际进步是非单调的:Gen 2.5 相对于 Gen 2.0(1.90)和 Gen 3.0(2.01)急剧倒退(平均控制 2.64),且 Gen 2.5 呈现逆缩放(Pro 1.94 比 Flash 1.71 更差),而 Gen 3.0 恢复了标准缩放。第三,我们记录了对齐税:谄媚与真实性之间的 Spearman rho = -0.63,表明社会顺从以事实准确性为代价。自我验证提示作为谄媚陷阱(平均 3.27),几乎是 unethical proposals(1.72)的两倍。简单防护栏在旗舰模型上优于复杂的协议脚手架,但蒸馏后的 Gen 3.0 Flash 反转了这一点,表明小模型可能在结构上需要思维链脚手架。我们发布了数据集和评分标准以支持连续谄媚测量。

英文摘要

Large language models are increasingly deployed as high-stakes advisors, yet standard alignment benchmarks treat sycophancy as a binary failure mode. We introduce the Granularity Gap: coarse binary metrics mask substantial social-compliance behaviors where models capitulate to user framing, validate questionable premises, or soften factual corrections without producing overtly false outputs. We evaluate six Gemini variants across generations 2.0, 2.5, and 3.0 on 73 adversarial prompts under three guardrail conditions (Control, Simple, Protocol), yielding 8,830 graded responses. Using a 0-4 Likert scale validated against a human annotator triad (Fleiss kappa = 0.71; Cohen kappa = 0.78 vs AI consensus; 95.9 percent binary accuracy, 100 percent specificity), we quantify sycophancy as continuous rather than binary. Three findings emerge. First, 27.2 percent of responses contain substantial sycophantic content (Likert >= 2.0) and 22.7 percent reach moderate or severe levels (>= 3.0), while binary win-rate framing reports only modest failure rates; coarse metrics explain just 29 percent of graded variance. Second, generational progress is non-monotonic: Gen 2.5 regresses sharply (mean Control 2.64) relative to Gen 2.0 (1.90) and Gen 3.0 (2.01), and Gen 2.5 shows inverse scaling (Pro 1.94 worse than Flash 1.71) while Gen 3.0 restores standard scaling. Third, we document an Alignment Tax: Spearman rho = -0.63 between sycophancy and truthfulness, indicating social compliance trades against factual accuracy. Egotistical Validation prompts act as a sycophancy trap (mean 3.27), nearly double Unethical Proposals (1.72). Simple guardrails outperform elaborate Protocol scaffolding on flagship models, but distilled Gen 3.0 Flash inverts this, suggesting small models may structurally require chain-of-thought scaffolding. We release the dataset and rubric to support continuous sycophancy measurement.

2606.05182 2026-06-05 cs.CL cs.IR 版本更新

LANTERN: Layered Archival and Temporal Episodic Retrieval Network for Long-Context LLM Conversations

LANTERN: 用于长上下文LLM对话的分层存档与时间情节检索网络

Rahul Subramani

发表机构 * Cisco Systems, Inc.(思科系统公司)

AI总结 提出LANTERN,一种轻量级记忆层,通过混合检索主动存档对话轮次并恢复压缩后丢失的细节,无需LLM调用且延迟低于25ms,在94个多轮对话中恢复78.3%的可验证事实,优于MemGPT基线。

详情
AI中文摘要

当对话历史被压缩以适应有限的上下文窗口时,大型语言模型会丢弃关键细节。我们提出了LANTERN(分层存档与时间情节检索网络),一种轻量级记忆层,它主动存档每一轮对话,并通过混合检索在压缩后恢复相关细节——无需任何LLM调用,每轮延迟低于25ms。在94个真实多轮对话(1,894个真实事实,人工验证kappa=0.81)上,LANTERN-Rerank恢复了78.3%因压缩而丢失的可验证事实,显著优于忠实复现的MemGPT的LLM驱动提取与多查询搜索流水线(72.4%;Wilcoxon p<0.0001,95% CI [+3.1, +8.6] pp,d=0.43),且推理成本极低。即使没有重排序器,基础LANTERN在零LLM调用的情况下也能匹配或超越该LLM驱动基线(p=0.005)。当四个生产级LLM使用LANTERN恢复的上下文回答事实性问题时,准确率平均提升8.4个百分点(每个模型单独Wilcoxon p<0.05),表明恢复的上下文在不同模型架构上均有用。我们发布了完整的评估框架——包括配对显著性检验、失败分析、事实类型分层和压缩鲁棒性分析——以支持可重复性和未来工作。

英文摘要

Large language models discard critical details when conversation history is compacted to fit within finite context windows. We present LANTERN (Layered Archival aNd Temporal Episodic Retrieval Network), a lightweight memory layer that proactively archives every conversation turn and restores relevant details after compaction via hybrid retrieval -- requiring zero LLM calls and adding fewer than 25ms of latency per turn. On 94 real multi-turn conversations (1,894 ground-truth facts, human-validated at kappa=0.81), LANTERN-Rerank recovers 78.3% of verifiable facts lost to compaction, significantly outperforming a faithful reimplementation of MemGPT's LLM-driven extraction and multi-query search pipeline (72.4%; Wilcoxon p<0.0001, 95% CI [+3.1, +8.6] pp, d=0.43) at a fraction of the inference cost. Even without the reranker, base LANTERN matches or exceeds this LLM-driven baseline (p=0.005) using zero LLM calls. When four production LLMs answer fact-bearing questions using LANTERN-restored context, accuracy improves by 8.4 percentage points on average (Wilcoxon p<0.05 for each model individually), demonstrating that the recovered context is useful across diverse model architectures. We release the full evaluation framework -- paired significance tests, failure analysis, fact-type stratification, and compaction robustness analysis -- to support reproducibility and future work.

2606.05181 2026-06-05 cs.CL cs.AI 版本更新

Multi-Granularity Reasoning for Natural Language Inference

自然语言推理的多粒度推理

Chunling Xi, Di Liang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出多粒度推理网络(MGRN),通过分层语义特征交互模拟人类认知过程,在多个基准上超越强基线模型。

详情
AI中文摘要

自然语言推理(NLI)是自然语言理解中的一项基本任务,需要确定前提和假设之间的逻辑关系。尽管基于Transformer的预训练模型取得了显著成功,但大多数现有方法主要依赖最后一层的token表示,这通常不足以捕捉有效推理所需的复杂分层语义交互。特别是,细粒度的词汇线索、短语组合和更高层次的上下文语义通常在单一表示空间中被纠缠或稀释。为了解决这些限制,我们提出了一种新颖的\emph{多粒度推理网络}(MGRN),它在交互式推理空间中显式利用分层语义特征。所提出的框架模拟了人类语言理解的认知过程,该过程自然地从浅层词汇匹配进展到更深层次的语义抽象和逻辑推理。通过以渐进和结构化的方式整合多个粒度的语义信息,MGRN能够揭示自然语言表达背后的复杂语义关系。在多个公开基准上的大量实验表明,MGRN始终优于强基线模型,验证了所提出方法的有效性和鲁棒性。

英文摘要

Natural Language Inference (NLI) is a fundamental task in natural language understanding that requires determining the logical relationship between a premise and a hypothesis. Despite the remarkable success of transformer-based pre-trained models, most existing approaches primarily rely on the final-layer token representations, which are often insufficient for capturing the complex and hierarchical semantic interactions required for effective reasoning. In particular, fine-grained lexical cues, phrasal compositions, and higher-level contextual semantics are typically entangled or diluted in a single representation space. To address these limitations, we propose a novel \emph{Multi-Granularity Reasoning Network} (MGRN) that explicitly leverages hierarchical semantic features within an interactive reasoning space. The proposed framework mimics the human cognitive process of language understanding, which naturally progresses from shallow lexical matching to deeper semantic abstraction and logical reasoning. By integrating semantic information across multiple granularities in a progressive and structured manner, MGRN is able to uncover intricate semantic relationships underlying natural language expressions. Extensive experiments on multiple public benchmarks demonstrate that MGRN consistently outperforms strong baseline models, validating the effectiveness and robustness of the proposed approach.

2606.05180 2026-06-05 cs.CL cs.AI 版本更新

From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment

从评分到解释:评估基于量规的教学质量评估中的SHAP和LLM理由

Ivo Bueno, Babette Bühler, Philipp Stark, Tim Fütterer, Ulrich Trautwein, Dorottya Demszky, Heather Hill, Enkelejda Kasneci

发表机构 * Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Lund University(吕勒奥大学) University of Tübingen(图宾根大学) Stanford Graduate School of Education(斯坦福大学教育研究生院) Harvard Graduate School of Education(哈佛大学教育研究生院)

AI总结 提出一个结合SHAP和LLM理由的框架,用于基于量规的评分模型的可解释性,并在课堂转录数据上评估其忠实性和可迁移性。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

自动化评分模型越来越多地被用于为复杂的语言表现(包括课堂转录)分配基于量规的质量评级,但它们通常很少提供关于为什么产生特定分数的见解。我们提出了一个通用的框架,用于基于量规的评分的句子级可解释性,该框架将模型无关的Shapley值归因与大型语言模型(LLM)生成的理由相结合。在使用NCTE语料库的CLASS框架的反馈质量维度上实例化,该框架能够系统地比较微调的预训练语言模型(PLM)和提示的LLM在评分性能和解释忠实性方面的表现。在6k个带注释的转录片段中,微调的PLM在预测准确性上优于LLM,但表现出向中等尺度分数的标签压缩。基于删除的测试表明,SHAP识别出可靠驱动模型预测的句子,产生的预测变化通常比LLM生成的理由更大且更连贯。跨模型分析进一步揭示,SHAP归因在不同架构间稳健地迁移,而LLM理由的影响有限且不一致。总体而言,研究结果表明,SHAP为基于量规的评分提供了更忠实和可迁移的解释,并且所提出的框架为在高风险教育环境和其他基于量规的语言评估任务中评估评分模型及其解释提供了原则性基础。

英文摘要

Automated scoring models are increasingly used to assign rubric-based quality ratings to complex language performances, including classroom transcripts, yet they typically provide little insight into why a particular score is produced. We propose a general framework for sentence-level interpretability of rubric-based scoring that combines model-agnostic Shapley-value attributions with rationales generated by large language models (LLMs). Instantiated on the Quality of Feedback dimension of the CLASS framework using the NCTE corpus, the framework enables systematic comparison of fine-tuned pretrained language models (PLMs) and prompted LLMs on both scoring performance and explanation faithfulness. Across 6k annotated transcript segments, fine-tuned PLMs outperform LLMs in prediction accuracy but exhibit label compression toward mid-scale scores. Deletion-based tests show that SHAP identifies sentences that reliably drive model predictions, producing typically larger and more coherent prediction shifts than LLM-generated rationales. Cross-model analyses further reveal that SHAP attributions transfer robustly across architectures, whereas LLM rationales exert limited and inconsistent influence. Overall, the findings demonstrate that SHAP provides more faithful and transferable explanations for rubric-based scoring, and that the proposed framework offers a principled basis for evaluating both scoring models and their explanations in high-stakes educational settings and other rubric-based language assessment tasks.

2606.05179 2026-06-05 cs.CL 版本更新

Efficient Punctuation Restoration via Weighted Lookahead Scoring Method for Streaming ASR Systems

流式ASR系统中基于加权前瞻评分的高效标点恢复方法

Sungmook Woo, Hyungu Kang, Chanwoo Kim

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出一种非自回归的加权前瞻评分方法,通过比较标点插入假设与无插入基线,在有限未来上下文中实现流式ASR的高效标点恢复,无需微调即可达到高F1分数。

Comments Accepted for presentation at The International Joint Conference on Neural Networks (IJCNN) 2026

详情
AI中文摘要

标点恢复提高了ASR(自动语音识别)的可读性。然而,流式ASR需要在有限的未来上下文下进行在线决策。在流式ASR中,系统增量地预测标点,这使得基于生成的方法在边界评估下容易产生延迟和对齐失败。本文提出了一种非自回归评分方法(无自由形式生成),该方法保留输入转录并在每个词边界做出决策。我们的方法在有限的K子词标记前瞻下,将标点插入假设与无插入基线进行比较,并使用权重α和验证校准阈值τ(推理期间无参数更新)校准决策。在IWSLT 2017上,我们的评分方法在无微调设置下(验证校准,K=2)实现了4类宏F1为0.893,在微调后(K=2)达到0.937,在相同的前瞻预算下优于基于提示的基线(0.566)和微调的ELECTRA基线(0.913)。我们通过消融研究分析了前瞻预算对K的影响。

英文摘要

Punctuation restoration improves ASR (Automatic Speech Recognition) readability. However streaming ASR requires online decisions with limited future context. In streaming ASR, the system predicts punctuation incrementally, which makes generation-based approaches prone to latency and alignment failures under boundary-wise evaluation. This paper proposes a non-autoregressive scoring method (no free-form generation) that preserves the input transcript and makes a decision at each word boundary. Our method compares punctuation insertion hypotheses against a no-insertion baseline under a bounded K-subword-token lookahead, and calibrates decisions using a weight α and a validation-calibrated threshold τ (no parameter updates during inference). On IWSLT 2017, our scoring method achieves a 4-class macro F1 of 0.893 in the no fine-tuning setting (validation-calibrated, K=2) and 0.937 after fine-tuning (K=2), outperforming the prompt-based baseline (0.566) and a fine-tuned ELECTRA baseline (0.913) under the same lookahead budget. We analyze the impact of the lookahead budget through ablation studies on K.

2606.05177 2026-06-05 cs.CL cs.AI eess.AS 版本更新

MCBench: A Multicontext Safety Assessment Benchmark for Omni Large Language Models

MCBench:面向全能大语言模型的多上下文安全评估基准

Manh Luong, Tamas Abraham, Junae Kim, Amar Kaur, Rollin Omari, Gholamreza Haffari, Trang Vu, Lizhen Qu, Dinh Phung

发表机构 * Monash University(墨尔本大学) Defence Science and Technology Group(国防科学与技术集团)

AI总结 针对现有多模态安全基准仅处理视觉输入的局限,提出MCBench基准,包含1196个跨四类安全场景的测试,要求整合多模态信息进行安全评估,揭示当前全能大语言模型在跨模态安全推理上的不足。

详情
AI中文摘要

现有的多模态安全基准仅关注视觉输入,无法评估处理视觉、音频和文本的全能大语言模型(LLMs)。我们提出了MCBench,一个包含1196个场景的基准,涵盖四个安全类别,需要整合多种模态以进行准确的安全评估。每个不安全场景都配有一个最小差异的安全对照场景,以评估模型的敏感性。我们对最先进模型的评估揭示了重大挑战。全能大语言模型在处理细微或非物理风险时表现不佳,但在存在显著视觉或听觉线索时表现更好。对推理轨迹的分析表明,尽管模型能够提取模态特定信息,但它们往往无法有效整合这些线索进行安全判断。我们的发现揭示了当前全能大语言模型在安全关键场景中缺乏稳健的跨模态推理能力,强调了改进多模态安全架构和训练策略的必要性。

英文摘要

Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity. Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present. Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments. Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.

2606.05176 2026-06-05 cs.CL cs.AI 版本更新

PEFT of SLM for Telecommunications Customer Support: A Comparative Study of LoRA Configurations with Energy Consumption Analysis

面向电信客户支持的SLM的PEFT:LoRA配置与能耗分析的比较研究

Lucas Tamic, Ilan Jaffeux-Cheniout, Xavier Marjou

发表机构 * Orange

AI总结 本研究系统比较了不同LoRA配置在Qwen2.5-3B模型上的参数高效微调效果,结合能耗分析和LLM评判框架,发现验证损失最低的配置并不一定获得最佳定性排名,并提出了组合式合成数据生成方法。

详情
AI中文摘要

尽管大型语言模型(LLM)在自然语言理解和生成方面表现出色,但它们在电信客户支持领域特定约束下的评估和适应性仍然有限。此外,数据主权、监管约束以及敏感客户和网络信息的处理使得在该领域使用外部托管的基础模型变得复杂。我们提出了一项系统的参数高效微调(PEFT)研究,使用低秩适配(LoRA)应用于Qwen2.5-3B,以构建特定领域的对话助手。我们引入了一种基于52个行业特定术语词汇表的组合式合成数据生成方法,通过由Gemini 2.0 Flash驱动的生成流水线,产生了约30,000个训练样本,涵盖1,560个不同的问题场景。我们通过改变超参数和目标模块评估了16种LoRA配置。我们的评估超越了标准指标,结合了能耗分析以及使用GPT-5.2和Claude 4.5 Sonnet的LLM-as-a-judge框架的定性评估。结果显示定量和定性性能之间存在明显分歧:达到最低验证损失的模型不一定获得最佳的人类对齐排名。最佳验证损失(0.5024)在定性评估中仅排名第6-7位,而最差损失(0.6807)根据两位评判者均排名第一。本工作的贡献包括:(1)一种用于合成数据集构建的组合方法,(2)关于目标模块选择对LoRA注入影响的见解,(3)证明在对话式AI中仅凭验证损失不足以选择微调配置的证据,以及(4)用于可持续LLM部署的能耗-性能权衡分析。

英文摘要

While large language models (LLMs) show strong performance in natural language understanding and generation, their evaluation and adaptation to domain-specific constraints in telecommunications customer support remain limited. In addition, data sovereignty, regulatory constraints, and the handling of sensitive customer and network information complicate the use of externally hosted foundation models in this domain. We present a systematic study of parameter-efficient fine-tuning (PEFT) using Low-Rank Adaptation (LoRA) applied to Qwen2.5-3B to build a domain-specific conversational assistant. We introduce a combinatorial synthetic data generation approach based on a glossary of 52 industry-specific terms, producing approximately 30,000 training examples across 1,560 distinct problem scenarios via a generative pipeline powered by Gemini 2.0 Flash. We evaluate 16 LoRA configurations by varying hyperparameters and target modules. Our evaluation extends beyond standard metrics by incorporating energy consumption analysis and qualitative assessment using an LLM-as-a-judge framework with GPT-5.2 and Claude 4.5 Sonnet. Results show a clear divergence between quantitative and qualitative performance: models achieving the lowest validation loss do not necessarily obtain the best human-aligned rankings. The best validation loss (0.5024) ranks only 6th-7th in qualitative evaluation, while the worst loss (0.6807) ranks first according to both judges. This work contributes (1) a combinatorial method for synthetic dataset construction, (2) insights into the impact of target module selection for LoRA injection, (3) evidence that validation loss alone is insufficient for selecting fine-tuning configurations in conversational AI, and (4) an energy-performance trade-off analysis for sustainable LLM deployment.

2606.05175 2026-06-05 cs.CL 版本更新

Generic Triple-Latent Compression with Gated Associative Retrieval

通用三重潜在压缩与门控关联检索

Liu Xiao

发表机构 * Institute of Informatics, University of Science and Technology of China(中国科学技术大学信息科学研究院)

AI总结 提出通用三重潜在序列模型,通过维护运行令牌状态和压缩对记忆路径捕获高阶令牌交互,无需基准特定解析,在字节级WikiText-2和基于分词器的MiniMind语言模型基准上改进小型Transformer基线,而基于召回的门控键值检索扩展提升关联召回但存在种子敏感性和速度问题。

详情
AI中文摘要

我们研究通用三重潜在序列模型,该模型维护一个运行的令牌状态和压缩的对记忆路径,以捕获高阶令牌交互,无需基准特定解析。三重潜在系列在字节级WikiText-2和基于分词器的MiniMind语言模型基准上改进了小型Transformer基线,而一个专注于召回的门控键值检索扩展提升了关联召回,但在当前参考实现中仍对种子敏感且速度慢得多。

英文摘要

We study generic triple-latent sequence models that maintain a running token state and compressed pair-memory pathway to capture higher-order token interactions without benchmark-specific parsing. The triple-latent family improves a small Transformer baseline on byte-level WikiText-2 and on a tokenizer-based MiniMind language-model benchmark, while a recall-focused gated key-value retrieval extension improves associative recall but remains seed-sensitive and much slower in the current reference implementation.

2606.05174 2026-06-05 cs.CL cs.AI 版本更新

Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO

通过基于方差感知的评分规则奖励与GRPO改进LLMs中心脏医学问答

Arash Ahmadi, Parisa Masnadi, Sarah Sharif, Charles Nicholson, David Ebert, Mike Banad

发表机构 * School of Electrical and Computer Engineering, University of Oklahoma, Norman, OK, USA(电气与计算机工程学院,俄克拉荷马大学,诺曼,OK,USA) Intelligent Neuromorphic and Quantum Understanding for Innovative Research and Engineering (INQUIRE) Laboratory, University of Oklahoma, Norman, OK, USA(创新研究与工程智能神经形态与量子理解实验室,俄克拉荷马大学,诺曼,OK,USA) Khiabani Data Science and Analytics Institute, University of Oklahoma, Norman, OK, USA(Khiabani数据科学与分析研究所,俄克拉荷马大学,诺曼,OK,USA) Data Institute for Societal Challenges (DISC), University of Oklahoma, Norman, OK, USA(社会挑战数据研究所(DISC),俄克拉荷马大学,诺曼,OK,USA) School of Industrial and Systems Engineering, University of Oklahoma, Norman, OK, USA(工业与系统工程学院,俄克拉荷马大学,诺曼,OK,USA) Office of Responsible Artificial Intelligence (ORAI), University of Arizona, Tucson, AZ, USA(负责任人工智能办公室(ORAI),亚利桑那大学,图森,AZ,USA)

AI总结 提出一种方差感知奖励框架,结合GRPO和RaR-Medicine的评分规则,通过连续分析奖励函数替代离散聚合,提升LLMs在心脏医学问答上的准确率和F1分数。

Comments 27 Pages

详情
AI中文摘要

大型语言模型(LLMs)在医疗应用中展现出巨大潜力。然而,由于数据隐私限制、推理成本以及边缘或设备端适用性有限,通用模型在实际场景中的部署仍然困难。这些挑战促使开发更小、更高效的模型,这些模型需要稳健的后训练策略以确保可靠的医学推理。在这项工作中,我们研究了基于RaR-Medicine的评分规则监督,使用组相对策略优化(GRPO)对LLMs进行心脏医学问答的后训练。我们提出了一种方差感知奖励框架,该框架扩展了评分规则作为奖励的显式聚合和隐式聚合策略,将加权二元标准聚合和单一整体Likert式评分替换为从标准级评分结果导出的连续分析奖励函数。这种公式为稀疏、多标准且难以自动验证的反馈提供了更丰富的优化信号,并实现了更稳定的在线策略强化学习。在HealthBench保留的心脏相关子集上,与Qwen3-14B基础模型相比,我们最佳的GRPO变体将准确率从0.362提高到0.502,F1从0.532提高到0.668,同时与GPT-OSS-120B(准确率0.508,F1 0.674)保持竞争力。我们的研究结果表明,精心设计的基于评分规则的奖励为改进LLMs中心脏医学问答提供了一种实用策略,并有可能扩展到其他基于评分规则的任务。

英文摘要

Large Language Models (LLMs) have shown strong promise in healthcare applications. Yet deploying general-purpose models in real-world settings remains difficult due to data privacy constraints, inference costs, and limited suitability for edge or on-device use. These challenges motivate the development of smaller, more efficient models that require robust post-training strategies to ensure reliable medical reasoning. In this work, we investigate Group Relative Policy Optimization (GRPO) for post-training LLMs on heart-focused medical question answering with rubric-based supervision derived from RaR-Medicine. We propose a Variance-Aware Reward Framework that extends the Explicit Aggregation and Implicit Aggregation strategies of Rubrics as Rewards by replacing weighted binary criterion aggregation and single overall Likert-style scoring with continuous analytical reward functions derived from criterion-level rubric outcomes. This formulation provides richer optimization signals for feedback that is sparse, multi-criteria, and difficult to verify automatically, and enables more stable on-policy reinforcement learning. On a held-out heart-related subset of HealthBench, our best GRPO variant improves accuracy from 0.362 to 0.502 and F1 from 0.532 to 0.668 relative to the Qwen3-14B base model, while remaining competitive with GPT-OSS-120B (0.508 accuracy, 0.674 F1). Our findings show that carefully designed rubric-based rewards provide a practical strategy for improving heart-focused medical question answering in LLMs, with potential to extend to other rubric-based tasks.

2606.05173 2026-06-05 cs.CL cs.AI 版本更新

Predict and Reconstruct: Joint Objectives for Self-Supervised Language Representation Learning

预测与重构:自监督语言表示学习的联合目标

Aimen Boukhari

发表机构 * École Nationale Supérieure d’Informatique (ESI)(阿尔及利亚国家信息学院(ESI))

AI总结 提出一种结合JEPA潜空间预测损失与MLM目标的混合预训练目标,通过可学习标量平衡两者,在GLUE基准上分析表明混合编码器产生更均匀的嵌入和更丰富的谱几何,且语义-词汇平衡更优。

Comments 12 pages, 10 figures, 11 tables. Preprint. Code available at : https://github.com/aymen-000/predict-reconstruct-language-models

详情
AI中文摘要

掩码语言建模(MLM)自BERT以来一直是文本编码器的主导预训练目标,但它鼓励的表示强烈锚定于表层形式的词元身份,而非更深层的语义结构。受联合嵌入预测架构(JEPA)(LeCun, 2022)在视觉和音频中的成功启发,我们提出一种混合预训练目标,该目标在单个共享编码器上结合了JEPA风格的潜空间预测损失与标准MLM目标。一个可学习的标量参数在训练过程中持续平衡这两个目标。我们在英文维基百科上使用相同的架构和计算预算(NVIDIA H100)预训练了一个混合模型和一个纯MLM基线。通过四种池化策略在五个GLUE基准(SST-2、MRPC、MNLI、CoLA、STS-B)上进行广泛的表示分析,结果显示混合编码器产生了显著更均匀的嵌入(均匀性小于-0.16,而MLM为-0.05),在最大池化下表现出更丰富的谱几何,编码了更少的表层词汇信息,并实现了更好的语义-词汇平衡。尽管线性探测的下游准确率相似,但几何差异一致且显著,表明JEPA预测目标重塑了潜空间,而标准准确率指标无法单独捕捉这一点。

英文摘要

Masked language modelling (MLM) has been the dominant pre-training objective for text encoders since BERT, yet it encourages representations that are strongly anchored to surface-form token identity rather than deeper semantic structure. Inspired by the success of Joint Embedding Predictive Architectures (JEPA) (LeCun, 2022) in vision and audio, we propose a hybrid pre-training objective that combines a JEPA-style latent-space prediction loss with a standard MLM objective over a single shared encoder. A learnable scalar parameter continuously balances the two objectives during training. We pre-train both a hybrid model and a pure-MLM baseline on English Wikipedia using identical architectures and compute budgets (NVIDIA H100). Extensive representation analysis across five GLUE benchmarks (SST-2, MRPC, MNLI, CoLA, STS-B) using four pooling strategies reveals that the hybrid encoder produces significantly more uniform embeddings (uniformity less than -0.16 vs -0.05 for MLM), exhibits richer spectral geometry under max pooling, encodes less surface-level lexical information, and achieves a better semantic-to-lexical balance. Despite similar linear-probe downstream accuracy, the geometric differences are consistent and significant, suggesting that the JEPA predictive objective reshapes the latent space in ways that standard accuracy metrics alone cannot capture.

2606.05168 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics

模型崩溃的流行病学:通过双层SIR动力学建模合成数据污染

Xiangyu Wang

发表机构 * Xiangyu Wang(王翔宇)

AI总结 提出双层耦合SIR/SIRS框架,将数据语料库和AI模型视为两个相互作用的群体,通过交叉层传播模拟合成数据污染导致的模型崩溃,并推导基本再生数R0,实验验证了阈值动力学和干预策略的有效性。

Comments 24 pages, 15 figures

详情
AI中文摘要

在合成数据上训练会导致模型崩溃,但现有分析将其视为单链退化。实际上,AI生态系统涉及交叉污染:模型从其他模型摄取合成数据,产生新的合成文本,并污染共享语料库。我们提出了一个双层耦合SIR/SIRS框架——一个现象学平均场模型,将数据语料库和AI模型视为两个相互作用的群体,每个群体具有易感、感染和恢复三个仓室,并通过跨层传播连接。SIRS变体(我们的主要推荐)包含了免疫衰减,反映了过滤后的语料库和重新训练的模型仍然容易再次污染。我们通过下一代矩阵推导出基本再生数$R_0 = \sqrt{β_D β_M / [(γ_D+μ_D)(γ_M+μ_M)]}$,并将标准流行病阈值结果应用于双层系统。基于公开AI文本流行数据的说明性情景校准在三种情景下均产生超临界动力学($R_0 > 1$);Sobol敏感性分析将合成文本检测识别为最高杠杆参数。一个二分网络基于智能体的模型在密集网络上确认了平均场一致性($R^2 > 0.96$),但在异质性下退化。GPT-2污染链实验(在WikiText和Shakespeare上共192次运行)显示了剂量-反应退化和多样性损失,定性上与阈值图像一致。匹配预算的源多样性实验(1,088次运行)提供了提示性证据,表明多源混合适度减轻了崩溃,但该效应在较低污染分数下消失。干预分析将基于检测的过滤和群体免疫识别为最高杠杆策略。

英文摘要

Training on synthetic data causes model collapse, but existing analyses treat this as single-chain degradation. In reality, the AI ecosystem involves cross-contamination: models ingest synthetic data from other models, produce new synthetic text, and contaminate shared corpora. We propose a bilayer coupled SIR/SIRS framework -- a phenomenological mean-field model treating data corpora and AI models as two interacting populations, each with susceptible, infected, and recovered compartments linked by cross-layer transmission. The SIRS variant (our primary recommendation) incorporates immunity waning, reflecting that filtered corpora and retrained models remain susceptible to re-contamination. We derive the basic reproduction number $R_0 = \sqrt{β_D β_M / [(γ_D+μ_D)(γ_M+μ_M)]}$ via the Next Generation Matrix and apply standard epidemic threshold results to the bilayer system. Illustrative scenario-based calibration from public AI text prevalence data yields supercritical dynamics ($R_0 > 1$) across three scenarios; Sobol sensitivity analysis identifies synthetic-text detection as the highest-leverage parameter. A bipartite-network agent-based model confirms mean-field consistency ($R^2 > 0.96$) for dense networks but degrades under heterogeneity. GPT-2 contamination chain experiments (192 runs across WikiText and Shakespeare) show dose-response degradation and diversity loss qualitatively consistent with the threshold picture. Matched-budget source-diversity experiments (1,088 runs) provide suggestive evidence that multi-source mixing modestly attenuates collapse, but the effect vanishes at lower contamination fractions. Intervention analysis identifies detection-based filtering and herd immunity as the highest-leverage strategies.

2606.04032 2026-06-05 cs.LG cs.AI cs.CL cs.PF 版本更新

Do Transformers Need Three Projections? Systematic Study of QKV Variants

Transformer 需要三个投影吗?QKV 变体的系统研究

Ali Kayyam, Anusha Madan Gopal, M Anthony Lewis

发表机构 * Ali Kayyam Anusha Madan Gopal M Anthony Lewis

AI总结 本文系统研究了注意力机制中查询、键、值投影共享的变体,发现 Q-K=V 共享在语言建模中仅以 3.1% 的困惑度损失实现 50% 的 KV 缓存减少,且与头共享结合可达到 96.9% 的缓存减少,从而支持设备端推理。

Comments Accepted at ICML 2026 (PMLR vol. 306). 26 pages, 12 figures, 16 tables. Code: https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

详情
AI中文摘要

Transformer 已成为各种 AI 任务的标准解决方案,其中查询、键和值(QKV)注意力公式起着核心作用。然而,这三个投影的各自贡献以及省略某些投影的影响仍知之甚少。我们系统评估了三种投影共享约束:a) Q-K=V(共享键-值),b) Q=K-V(共享查询-键),c) Q=K=V(单投影)。后两种变体产生对称注意力图;为了解决这个问题,我们还通过二维位置编码探索了非对称注意力。通过涵盖合成任务、视觉(MNIST、CIFAR、TinyImageNet、异常检测)和语言建模(在 10B 令牌上训练的 300M 和 1.2B 参数模型)的实验,我们发现我们的 Transformer 性能与 QKV Transformer 相当,有时甚至更好。在语言建模中,Q-K=V 投影共享实现了 50% 的 KV 缓存减少,仅导致 3.1% 的困惑度下降。关键的是,投影共享与头共享(GQA/MQA)互补:将 Q-K=V 与 GQA-4 结合可实现 87.5% 的缓存减少,而 Q-K=V + MQA 则达到 96.9%,从而实现了实用的设备端推理。我们表明,Q-K=V 保持了质量,因为键和值可以占据相似的表示空间,并且注意力在低秩机制下运行,而 Q=K-V 则破坏了注意力的方向性。我们的结果系统地将投影共享描述为注意力中权重绑定的一种未被充分探索的实例,具有直接、可量化的推理内存优势,尤其对边缘部署有价值。代码公开于 https://github.com/anushamadan02/Do-Transformers-Need-3-Projections。

英文摘要

Transformers have become the standard solution for various AI tasks, with the query, key, and value (QKV) attention formulation playing a central role. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood. We systematically evaluate three projection sharing constraints: a) Q-K=V (shared key-value), b) Q=K-V (shared query-key), and c) Q=K=V (single projection). The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings. Through experiments spanning synthetic tasks, vision (MNIST, CIFAR, TinyImageNet, anomaly), and language modeling (300M and 1.2B parameter models on 10B tokens), we discovered that our transformers perform on par or occasionally better than the QKV transformer. In language modeling, Q-K=V projection sharing achieves 50% KV cache reduction with only 3.1% perplexity degradation. Crucially, projection sharing is complementary to head sharing (GQA/MQA): combining Q-K=V with GQA-4 yields 87.5% cache reduction, while Q-K=V + MQA achieves 96.9%, enabling practical on-device inference. We show that Q-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low-rank regime, whereas Q=K-V breaks attention directionality. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits, particularly valuable for edge deployment. The code is publicly available at https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections

2606.03785 2026-06-05 cs.CL 版本更新

Backdoor Unlearning Generalization: A Path Toward the Removal of Unknown Triggers in LLMs

后门遗忘泛化:走向移除大语言模型中未知触发器的路径

Lisa Bouger, Théo Lasnier, Philippe Loubet Moundi, Yannick Teglia, Djamé Seddah

发表机构 * Inria Paris(法国巴黎国家信息与自动化研究所) Sorbonne Université(索邦大学) Thales CDI(泰雷兹CDI)

AI总结 本文通过实验证明,针对单个后门的遗忘训练可以泛化抑制其他未明确针对的后门,并引入交叉激活偏移距离量化不同训练引起的模型变化,为利用可控后门移除未知后门提供新方向。

Comments 22 pages, 28 figures

详情
AI中文摘要

大语言模型中的后门攻击是一个日益严重的安全问题,模型可能生成对手选择的内容。现有防御一次只针对一个后门,并且通常需要知道触发器,这使得防御者在模型中可能存在未知后门时处于结构性劣势。我们表明,通过遗忘进行后门中和可以跨后门泛化:训练模型忽略单个触发器也可以抑制其他从未明确针对的后门。我们通过分析每次移除一个后门后获得的模型,研究了三个模型家族中的这一现象,这些模型的后门是通过预训练或持续预训练注入的。为了理解为什么遗忘某些后门会导致其他后门的抑制,我们引入了交叉激活偏移距离,以量化不同训练引起的模型变化之间的距离。我们的结果为LLM安全开辟了一个新方向,因为防御者可以故意注入受控后门然后移除它们,利用跨后门转移来抑制攻击者可能先前在模型中引入的未知后门。

英文摘要

Backdoor attacks in Large Language Models (LLMs) are a growing security concern, where models can generate adversary-chosen content. Existing defenses target backdoors one at a time and typically require knowledge of the trigger, leaving the defender at a structural disadvantage when unknown backdoors may exist in a model. We show that backdoor neutralization through unlearning generalizes across backdoors: training a model to ignore a single trigger can also suppress other backdoors that were never explicitly targeted. We study this phenomenon across three model families, whose backdoors were injected via pretraining or continual pretraining, by analyzing the models obtained after removing one backdoor at a time. To understand why unlearning certain backdoors induces the suppression of others, we introduce the Cross Activation Shift Distance, to quantify the distance between model changes induced by different trainings. Our results open a new direction for LLM safety as defenders could deliberately inject controlled backdoors and then remove them, leveraging cross-backdoor transfer to also suppress unknown backdoors that an attacker may have previously introduced in the model.

2606.03650 2026-06-05 cs.CL cs.AI 版本更新

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

CoEval: 无标注数据或可信基准下为自定义任务排序语言模型

Alexander Apartsin, Yehudit Aperstein

发表机构 * Holon Institute of Technology(霍洛技术学院) Afeka Tel Aviv Academic College of Engineering(阿法卡特拉维大学工程学院)

AI总结 提出CoEval框架,通过教师模型生成无污染基准和跨族评审团,无需标注数据或人工评估即可对语言模型进行排序,在真实排名恢复上达ho=0.86。

Comments 16 pages, 5 images

详情
AI中文摘要

当特定应用没有任务相关的标注数据,且标准公共基准不可信(其项目可能已泄露到预训练中,因此分数反映的是记忆而非适用性)时,为特定应用选择或排序语言模型最为困难。我们提出CoEval,一个开源、可复用的框架,端到端地弥补了这一差距:仅从任务或领域的描述出发,教师模型合成一个全新的、属性受控的基准,无需人工标注,且由于每次运行都重新生成项目,因此无污染;跨族评审团对候选模型进行排序,无需人工评分。在存在真实基准的情况下验证,CoEval恢复了真实的模型排序,并与真实正确性相关性达ho=0.86。无标签评审无需人工校准,因为评审团组成(供应商多样性)而非规模驱动可靠性:一个精心挑选的小型跨族评审团最可靠,而单个评审员可能与真实基准负相关(评审员选择遗憾0.35),但集成评审团从未如此。生成的项目与五个主要公共基准的逐字13-gram重叠为零;评审团消除了冗长偏差并排除了同族自我偏好。一项四项任务研究以5.89美元产生了7,978次评估。相同的声明式流程适用于任何领域,并且足够便宜,可以在每次模型发布时重新运行:一个任何团队都可以为其自身应用重新生成的无标签、无污染排行榜。

英文摘要

Selecting a pretrained language model, or evaluating a fine-tuned one, for a specific application is a high-value decision, yet the public benchmarks used to make it are poorly suited: a generic benchmark need not reflect a particular sub-domain or sub-task, and its scores are suspect when its items have leaked into pretraining and are recalled rather than solved. We present CoEval, an open framework that supplies a trustworthy, task-specific signal through ensemble self-evaluation: from a task or domain description, a pool of models rotates through all three roles, teacher, student, and judge, to generate a fresh, contamination-free benchmark, answer it, and score one another, with no human labels or raters. Because every model also answers as a student, the responses are the data that weight each question by its discriminative power and each judge by its consensus with the panel. Where ground truth exists, CoEval recovers the true ranking and tracks objective correctness at \r{ho}=0.86, and the weighting recovers the gold ranking of thirteen models at Spearman 0.95. Reliability comes from panel composition, not size: this label-free weighting zeroes out broken judges and down-weights saturated questions, so neither distorts the ranking. Generated items show zero verbatim overlap with five public benchmarks, the panel cancels verbosity bias and precludes same-family self-preference, and rankings are domain-specific: three different models top four de-novo domains, so a generic leaderboard misdirects most practitioners. The same pipeline reruns on each model release, giving any team a contamination-free leaderboard for its application.

2606.03189 2026-06-05 cs.CL 版本更新

SenseJudge: Human-Centric Preference-Driven Judgment Framework

SenseJudge: 以人为中心的偏好驱动判断框架

Rui Li, Junfeng Liu, Xiangwen Kong, Linhai Xu, Zhifang Sui

发表机构 * State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University(信息处理国家重点实验室,计算机学院,北京大学) StepFun Xi’an Jiaotong University(西安交通大学)

AI总结 提出SenseJudge框架和SenseBench基准,通过融入用户偏好实现个性化判断和模型排名,实验证明其优于现有方法。

Comments ACL 2026 Findings

详情
AI中文摘要

大型语言模型(LLMs)作为判断器在评估模型响应等各种场景中正成为一种日益被接受的范式。然而,现有的判断方法通常依赖于使用固定偏好数据训练的评判器,这往往忽视了多样化的用户偏好,难以适应真实的人机对话场景。为了解决这些局限性,我们提出了SenseJudge,一个由人类偏好驱动的可定制判断框架,以及SenseBench,一个源自真实世界多轮交互的多样且具有挑战性的指令遵循基准。我们将自动判断框架和基准应用于两个任务:(1)LLMs作为个性化判断器,以及(2)模型排名。我们进行了大量实验,结果表明SenseJudge框架在LLMs作为个性化判断器任务中超越了其他判断方法和模型,并实现了与真实人类感知一致的模型排名。此外,我们对位置偏差和一致性进行了分析,并进行了消融研究,证实了SenseJudge的鲁棒性。

英文摘要

Large Language Models (LLMs) as judges across various scenarios such as assessing model responses is becoming an increasingly accepted paradigm. However, existing judgment approaches often rely on trained judgers using fixed preference data, which tend to overlook diverse user preferences and struggle to adapt to real-world human-AI dialogue scenarios. To address these limitations, we propose SenseJudge, a customizable judgment framework driven by human preferences and SenseBench, a diverse and challenging instruction-following benchmark derived from real-world multi-turn interactions. We applied the automatic judgment framework and benchmark to two tasks: (1) LLMs as personalized judges, and (2) model ranking. We conducted extensive experiments, and the results demonstrate that the SenseJudge framework surpasses other judgment methods and models in the LLMs-as-personalized-judges task and achieves model ranking that aligns with real human sense. Additionally, we conducted analyses on position bias and consistency, alongside ablation studies, which affirmed the robustness of SenseJudge.

2606.02907 2026-06-05 cs.CL cs.AI 版本更新

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States

线性探针检测语言模型隐藏状态中的任务格式,而非推理模式

Subramanyam Sahoo, Vinija Jain, Aman Chadha, Divya Chaudhary

发表机构 * Horizon Research(远景研究) Meta Apple(苹果公司) Northeastern University(东北大学)

AI总结 通过线性探针实验发现,大语言模型隐藏状态中看似分离的推理模式实际上由任务格式(如来源、选项数、响应长度)混淆导致,而非真正的推理计算结构。

Comments Accepted in the 6th Workshop on Trustworthy NLP, ACL 2026

详情
AI中文摘要

线性探针广泛用于声称大语言模型(LLM)隐藏状态对不同推理类型学习到不同表示。我们通过在经典三分法基准(LogiQA 2.0(演绎)、ARC-Challenge(归纳)和$\alpha$NLI(溯因))上探测Qwen3-14B来检验这一说法。在40层中的第32层,线性探针达到100%交叉验证准确率,且几何结构良好分离(本征维度:20.6、28.5、33.6;凸包污染$\leq$1.5%)。然而,这种分离完全由格式混淆驱动。对来源身份、选项数和响应长度进行残差化后,准确率降至随机水平。轨迹锚点相似性表明任务间推理大部分共享(42.5%一致性 vs. 33.3%随机),且随机对照因果操控($n=20$)显示几何结构与推理模式之间无功能联系($p=0.286$)。因此,高探针准确率反映的是任务格式而非计算结构,这促使在机制可解释性中常规性地进行格式去混淆。

英文摘要

Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and $α$NLI (abductive). At layer 32 of 40, linear probes achieve 100\% cross-validated accuracy with well-separated geometry (intrinsic dimensionalities: 20.6, 28.5, 33.6; convex hull contamination $\leq$1.5\%). However, this separation is entirely driven by format confounds. Residualizing source identity, option count, and response length reduces accuracy to chance. Trace-anchor similarity indicates largely shared reasoning across tasks (42.5\% agreement vs.\ 33.3\% chance), and causal steering with random controls ($n=20$) shows no functional link between geometry and reasoning mode ($p=0.286$). Thus, high probe accuracy reflects task format rather than computational structure, motivating routine format deconfounding in mechanistic interpretability.

2606.02776 2026-06-05 cs.CL 版本更新

Topics as Proxies for Sociodemographics: How Conversational Context Affects LLM Answers

话题作为社会人口统计的代理:对话上下文如何影响大语言模型的回答

Vera Neplenbroek, Gabriele Sarti, Arianna Bisazza, Raquel Fernández

发表机构 * Institute for Logic, Language and Computation, University of Amsterdam(逻辑、语言与计算研究所,阿姆斯特丹大学) Khoury College of Computer Sciences, Northeastern University(计算机科学学院,东北大学) Center for Language and Cognition, University of Groningen(语言与认知中心,格罗宁根大学)

AI总结 研究大语言模型在高风险场景中对话上下文对回答差异的影响,发现话题是社会人口统计差异的主要驱动因素,且影响方式不可预测。

详情
AI中文摘要

当大语言模型(LLM)用于高风险场景(如法律、医疗和金融建议)时,即使单次对话历史也足以导致用户间结果差异。先前研究表明,这会导致社会人口统计群体之间的结果差异,某些群体获得比其他群体更有利的结果。在这项工作中,我们证明LLM实际上难以从单次对话历史推断用户的社会人口统计信息,并且尽管社会人口统计群体之间存在差异,但差异幅度很小。为了探究这些差异的主要驱动因素,我们将用户社会人口统计信息与对话的一系列(心理)语言学特征(包括对话话题、情感和可读性)进行比较。我们发现,在对话上下文中,对话话题最能预测LLM生成的建议,这些话题在一定程度上充当社会人口统计群体的代理,并且常常以不可预测的方式影响建议。这令人担忧,并强调未来研究需要更好地理解,并在必要时减轻高风险场景中对话上下文对LLM输出的影响。

英文摘要

When large language models (LLMs) are used in high-stakes scenarios, such as legal, medical and financial advice, even a single conversation history is enough to drive differences in outcomes between users. Prior work has demonstrated that this results in outcome disparities between sociodemographic groups, with some groups receiving more advantageous outcomes than others. In this work, we demonstrate that LLMs actually struggle to infer user sociodemographics from a single conversation history and that although there are disparities between sociodemographic groups, they are minimal in magnitude. To investigate what the main driver of these disparities is, we compare user sociodemographics to a range of (psycho)linguistic features of conversations, including conversation topic, emotions, and readability. We find that conversation topics are most predictive of LLM-generated advice within a conversational context, which, to some extent, function as proxies for sociodemographic groups and often affect advice in unpredictable ways. This is cause for concern and highlights the need for future research to better understand and, if needed, mitigate the effect of conversational context on LLM outputs in high-stakes scenarios.

2606.02750 2026-06-05 cs.CL 版本更新

On the Persistent Effects of Lexicality in Large Language Models

论词汇性在大语言模型中的持久影响

Hammad Rizwan, Muhammad Umair Haider, Nishant Subramani, Mona T. Diab, A. B. Siddique, Hassan Sajjad

发表机构 * Dalhousie University(达尔豪斯大学) University of Kentucky(肯塔基大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文通过对抗性语义压力测试和信息论视角,量化了大语言模型中词汇重叠相对于语义内容的影响,发现词汇影响贯穿模型深度,并在中间层出现词汇和语义信号同时衰减的过渡区域,进而以摘要和模型编辑为例展示了词汇影响对下游任务的作用。

详情
AI中文摘要

从大语言模型(LLMs)中提取的表征在许多下游应用中扮演着重要角色。然而,这些表征的结构往往受词汇重叠而非语义内容的影响。我们对这种词汇影响与语义内容之间的关系及其对下游任务的影响的理解仍然有限。在这项工作中,我们研究表征以量化词汇重叠相对于语义内容的影响。我们考虑了若干对抗性语义压力测试,并进一步将我们的发现与信息论视角联系起来。我们发现词汇影响贯穿模型的深度,在不同架构、训练范式和目标函数(包括为语义相似性训练的模型)中一致存在。此外,我们观察到一个中间深度区域,其中词汇和语义信号同时衰减,表明这是一个表征对表面形式和意义都较差的过渡状态。我们进一步通过摘要和模型编辑作为案例研究,展示了词汇影响对LLMs下游使用的影响。

英文摘要

Representations extracted from large language models (LLMs) play an important role in many downstream applications. However, the structure of these representations is often influenced by lexical overlap rather than semantic content. Our understanding of the relationship between this lexical influence and semantic content, and its implications for downstream tasks, remains limited. In this work, we investigate representations to quantify the effect of lexical overlap relative to semantic content. We consider several adversarial semantic stress tests and further connect our findings to the information theory perspective. We find that lexical influence extends across the depth of models, consistently across architectures, training regimes, and objective functions, including the models trained for semantic similarity. Moreover, we observe a mid-depth region in which both lexical and semantic signals degrade simultaneously, indicating a transitional regime where representations are poor for both surface form and meaning. We further demonstrate the effect of lexical influence on downstream uses of LLMs using summarization and model editing as a case study.

2606.02684 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

先过滤,再重加权:重新思考在线策略蒸馏中的优化粒度

Yuying Li, Leqi Zheng, Yongzi Yu, Wenrui Zhou, Xuchang Zhong, Xing Hu, Jing Jin, Hangjie Yuan, Tao Feng

发表机构 * THU(清华大学) HKUST(香港科技大学) BIT(北京理工大学) Meituan(美团) ZJU(浙江大学)

AI总结 针对在线策略蒸馏,提出FiRe-OPD方法,通过轨迹级过滤和令牌级软重加权实现细粒度优化,在多种设置下优于现有方法。

详情
AI中文摘要

大型语言模型中的在线策略蒸馏正从全轨迹KL监督转向更具选择性的训练范式。最近的在线策略蒸馏方法越来越关注选择哪些轨迹进行学习、哪些令牌信息量最大以及哪些监督信号最可靠。受此趋势启发,我们重新思考在线策略蒸馏的优化粒度,并提出FiRe-OPD(先过滤,再重加权),该方法在轨迹和令牌两个层面联合调整监督信号。具体来说,FiRe-OPD首先过滤轨迹以移除低质量的采样结果,然后在保留的轨迹内应用软重加权以强调信息丰富的令牌。与硬令牌选择相比,FiRe-OPD利用软加权机制有效减轻信息损失并增强优化稳定性,从而实现更细粒度的在线策略蒸馏优化。我们在强到弱、单教师和多教师设置中验证了FiRe-OPD的有效性,并展示了其相对于近期令牌级在线策略蒸馏方法的优越性(例如,在强到弱设置中AIME 2024上+6.25,在多教师设置中Miner上+18.81)。我们的代码可从此链接获取。

英文摘要

On-Policy distillation (OPD) in large language models is shifting from full-trace KL supervision toward more selective training paradigms. Recent OPD methods increasingly focus on selecting which trajectories to learn from, which tokens are most informative, and which supervision signals are most reliable. Motivated by this trend, we rethink optimization granularity of OPD and propose \fireicon\ FiRe-OPD (Filter, then Reweight), which jointly adjusts supervision signals at both trajectory and token levels. In details, FiRe-OPD first filters trajectories to remove low-quality rollout samples, and then applies soft reweighting within the retained trajectories to emphasize informative tokens. Compared with hard token selection, FiRe-OPD leverages a soft-weighting mechanism to effectively mitigate information loss and enhance optimization stability, thereby achieving finer-grained OPD optimization. We validate the effectiveness of FiRe-OPD across strong-to-weak, single-teacher, and multi-teacher settings, and demonstrate its superiority over recent token-level OPD methods ( (e.g., +6.25 on AIME 2024 in strong-to-weak, +18.81 on Miner in multi-teacher). Our code is available at https://github.com/YuYingLi0/FiRe-OPD.

2606.02031 2026-06-05 cs.LG cs.AI cs.CL cs.CV 版本更新

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

OpenWebRL: 揭秘视觉网络代理的在线多轮强化学习

Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai, Wenlin Yao, Hao Cheng, Baolin Peng, Huan Zhang, Tong Zhang, Jianfeng Gao

发表机构 * UIUC(伊利诺伊大学香槟分校) Microsoft(微软)

AI总结 提出OpenWebRL框架,通过在线多轮强化学习在真实网站上训练视觉网络代理,以4B参数模型在基准测试中达到开源最优,并与闭源系统竞争。

Comments 36 pages, 11 figures

详情
AI中文摘要

构建强大的视觉网络代理需要长程推理、精确定位以及与动态真实网站的稳健交互。尽管进展迅速,最强的系统仍然大多是专有的,而开放代理仍然严重依赖于对大量策划的网络轨迹进行监督式后训练。这种依赖造成了主要的可扩展性瓶颈:高质量演示的收集成本高昂,而静态数据集对多样且不断变化的开放网络的覆盖有限。尽管在线强化学习在基于文本的代理中显示出前景,但其直接用于在实时网站上训练视觉网络代理的潜力仍未得到充分探索。在本文中,我们介绍了OpenWebRL,一个用于在真实网站上通过在线多轮强化学习训练视觉网络代理的开放框架。OpenWebRL涵盖了完整的训练流程,包括可扩展的实时浏览器基础设施、监督初始化、多模态上下文管理、轨迹级成功判断以及高效的多轮策略优化。使用该框架,我们训练了OpenWebRL-4B,在具有挑战性的实时网络基准测试中建立了新的开源最优水平。仅使用0.4K初始化轨迹和2.2K开放式强化学习训练任务,OpenWebRL-4B在Online-Mind2Web上达到67.0%的成功率,在DeepShop上达到64.0%,优于之前类似或更大规模的开放代理,并与包括OpenAI CUA和Gemini CUA在内的专有系统保持竞争力。除了强大的基准性能外,我们还系统研究了使在线强化学习对视觉网络代理有效的关键设计选择,并分析了强化学习如何改进代理推理。总体而言,我们的工作为构建更强大、可重复且成本效益更高的开放网络代理提供了一条实用路径。我们将发布我们的训练数据、模型和代码以支持未来的研究。

英文摘要

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.

2606.00804 2026-06-05 cs.MA cs.AI cs.CL 版本更新

Dynamic Coordination Strategy Selection for Enterprise Multi-Agent Systems

企业多智能体系统的动态协调策略选择

Thanh Luong Tuan

发表机构 * Golden Gate University(金门大学) Foundation AgenticOS (FAOS)(基础代理操作系统(FAOS))

AI总结 本文通过大规模实验评估企业多智能体系统是否应根据问题类别动态选择协调策略,发现动态路由作为校准默认值有效,但无法确定唯一最优策略。

Comments 13 pages, 4 appendix. Code and data: https://github.com/frank-luongt/faos-research/tree/main/RA-1

详情
AI中文摘要

企业多智能体系统日益暴露多种协调模式,但部署时往往缺乏证据表明何时使用共识、辩论、综合或更简单的单智能体工作流。本文评估协调策略是否应根据问题类别动态选择,而非全局固定。我们运行了一个固定的矩阵,包含30个企业任务,涵盖六个行业、五个问题类别、四种执行条件、每个单元格三个重复,以及四个模型分支:qwen_local、sonnet、gemma_openrouter和一个辅助的openai云验证分支。所有1,440个生成输出均由固定的Sonnet评分标准评判。主要发现是有界且操作上有用的,但并非最初的严格H1。预先注册的精确胜者/CI标准未得到支持:精确胜者身份在不同模型分支间不稳定,且若干预测策略接近但未超过最佳观察到的替代方案。一个较弱的近最优路由主张得到强烈支持。在每个预先注册的模型分支和问题类别中,以及在辅助的OpenAI验证分支中,预测策略的质量分数与最佳观察条件相差在0.10以内。结构化合规验证是对原始映射最明显的例外:所有分支都偏好单智能体而非共识。预先注册的Kendall's W检验发现,越南语领域和英语领域任务在四种协调条件排序的一致性上没有可靠差异(两个分层的平均W均为0.20;符号秩检验p = .85),因此H2未得到支持。我们得出结论,企业协调策略应使用动态路由作为校准默认值,而非确定性胜者选择法则。

英文摘要

Enterprise multi-agent systems increasingly expose multiple coordination patterns, but deployments often lack evidence for when to use consensus, debate, synthesis, or a simpler single-agent workflow. This paper evaluates whether coordination strategy should be selected dynamically by problem class rather than fixed globally. We run a frozen matrix of 30 enterprise tasks spanning six industries, five problem classes, four execution conditions, three replications per cell, and four model arms: qwen_local, sonnet, gemma_openrouter, and an auxiliary openai cloud-validation arm. All 1,440 generated outputs are judged by a fixed Sonnet rubric. The main finding is bounded and operationally useful, but it is not the original strict H1. The pre-registered exact-winner/CI criterion is not supported: exact winner identity is unstable across model arms, and several predicted strategies are close to, but not above, the best observed alternative. A weaker near-best routing claim is strongly supported. In every pre-registered model arm and problem class, and again in the auxiliary OpenAI validation arm, the predicted strategy is within 0.10 quality-score points of the best observed condition. Structured compliance verification is the clearest exception to the original mapping: all arms favor single_agent rather than consensus. A pre-registered Kendall's W test finds no reliable difference between Vietnamese-domain and English-domain tasks in how consistently the four coordination conditions are ranked (mean W of 0.20 in both strata; signed-rank p = .85), so H2 is not supported. We conclude that enterprise coordination policy should use dynamic routing as a calibrated default, not as a deterministic winner-selection law.

2605.27866 2026-06-05 cs.CL 版本更新

GRADE: Generalizable Reasoning-Aware Dialogue Evaluation for AI Tutors

GRADE: 面向AI导师的通用推理感知对话评估

Parth Bhalerao, Jeromy Chang, David Chou, Oana Ignat

发表机构 * Santa Clara University(圣克拉拉大学)

AI总结 提出GRADE框架,通过系统比较五种开源模型的零样本推理、LoRA微调、合成增强及思维链推理等配置,证明精心选择的LoRA流水线在关键教学维度上可媲美专有系统。

Comments 16 pages, 7 figures

详情
AI中文摘要

评估AI导师的回应需要超越事实正确性:导师必须识别错误、定位错误、提供指导并提出可行的后续步骤。我们提出GRADE,一项针对学生-导师对话中教学能力评估的开源模型系统研究。基于BEA 2025 TutorMind设置,我们评估了五种语言模型、零样本推理、LoRA微调、合成增强、思维链推理以及单任务与多任务公式化配置下的120种配置。Gemma3-12B在单任务评估中表现最佳,而8位精度的Gemma3-27B在多任务预测中更可靠。我们发现,增强有助于那些在原始数据上表现不佳的模型,验证虽然成本更高但增益有限,思维链推理对合成数据生成比直接分类更有用。我们进一步表明,在结构化分类目标上进行LoRA微调会干扰思考模式下的指令遵循行为,使生成偏离所需的评估格式。碳分析显示,模型选择和推理模式显著影响排放。总体而言,GRADE表明,精心选择的开源LoRA流水线可以在关键教学维度上匹配或超越专有和基于集成系统的性能,代码和数据可在https://github.com/pvbgeek/GRADE获取。

英文摘要

Evaluating AI tutor responses requires more than factual correctness: tutors must identify mistakes, locate errors, provide guidance, and offer actionable next steps. We present GRADE, a systematic study of open-source models for pedagogical ability assessment in student-tutor dialogues. Building on the BEA 2025 TutorMind setting, we evaluate 120 configurations across five language models, zero-shot inference, LoRA fine-tuning, synthetic augmentation, CoT+Reasoning, and single-task versus multitask formulations. Gemma3-12B performs best for single-task evaluation, while Gemma3-27B in 8-bit precision is more reliable for multitask prediction. We find that augmentation helps models that struggle with the original data, verification adds limited gains despite higher cost, and CoT+Reasoning is more useful for synthetic data generation than direct classification. We further show that LoRA fine-tuning on structured classification objectives interferes with instruction-following behavior under thinking mode, redirecting generation away from the required evaluation format. Carbon analysis shows that model choice and reasoning mode substantially affect emissions. Overall, GRADE shows that carefully selected open-source LoRA pipelines can match or surpass proprietary and ensemble-based systems on key pedagogical dimensions, with code and data available at https://github.com/pvbgeek/GRADE.

2605.26046 2026-06-05 cs.CL cs.AI cs.LG cs.MA cs.SE 版本更新

When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges

当梯度冲突时:多目标提示优化用于LLM评判器的失败模式

Parth Darshan, Abhishek Divekar

发表机构 * IIT Jodhpur(印度理工学院乔普里尔) Amazon(亚马逊)

AI总结 研究多目标文本梯度优化中梯度稀释和指令干扰两种失败模式,通过分解优化器信息共享方式揭示性能下降原因。

Comments Accepted at ACL 2026 - CustomNLP4U Workshop. Code, prompts and data available at https://github.com/adivekar-utexas/when-gradients-collide

详情
AI中文摘要

将LLM评判器定制到特定任务或领域通常需要同时跨多个评估标准优化其提示。文本梯度方法针对单一评判标准实现了自动化,但它们产生自然语言批评,而非数值向量。因此,多任务学习的冲突解决工具包(PCGrad、MGDA)不适用于多目标文本梯度设置。我们通过改变损失、梯度和优化器LLM共享跨任务信息的程度,测试了文本梯度优化器的五种分解模式。在10种配置中的6种中,我们观察到优化从未优于初始提示。当梯度LLM联合处理多个标准时,梯度特异性下降了59%(从9.0降至3.7)。另外,我们观察到将每个任务的指令简单组合成单个提示会使斯皮尔曼相关系数降低5.3%。这些结果识别出两种可分离的失败模式:优化时的梯度稀释和推理时的指令干扰,它们共同限制了使用文本反馈进行多目标评判器定制的设计空间。

英文摘要

Customizing an LLM judge to a specific problem or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural-language critiques, not numerical vectors. Thus, the conflict-resolution toolkit of multi-task learning (PCGrad, MGDA) does not apply to this multi-objective textual gradient setting. We extend TextGrad to the multi-objective setting and test four decomposition modes of textual gradient optimizers by varying how much cross-objective information the loss, gradient and optimizer LLMs share. We find the gradient's task-focus drops by 59% (9.0 to 3.7 out of 10) when the gradient LLM must provide feedback on multiple criteria jointly. Separately, we observe that naively combining single-objective optimized instructions into a single prompt degrades Spearman rho from 0.305 to 0.220 (-0.085). These results identify two separable failure modes: optimization-time gradient dilution and inference-time instruction interference, which together constrain the design space for multi-objective judge optimization using textual feedback.

2605.29054 2026-06-05 cs.SE cs.CL 版本更新

Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence

转换而非等价:通过观察等价性基准测试代码库转换

Linxin Song, Jiefeng Chen, Yue Huang, Bhavana Dalvi Mishra, Chi Wang, Jieyu Zhao, Jinsung Yoon, Tomas Pfister

发表机构 * University of Southern California(南加州大学) Google Cloud AI Research(谷歌云人工智能研究) University of Notre Dame(圣约翰大学) Google Deepmind(谷歌深Mind)

AI总结 针对代码库转换中智能体过度信任本地验证导致语义违反的问题,提出T2J-Bench基准,通过固定等价契约和三级验证(Spec、Numeric、Behavioral)评估转换质量,发现最佳系统通过率仅26.7-28.9%,且所有系统高估成功率66.6-97.8点。

详情
AI中文摘要

编码智能体日益成为代码库规模的协作者,能够协助代码库转换,但这一进展暴露了一个关键弱点:智能体往往过度信任自己的本地验证例程,并在满足表面检查但违反用户实际关心的语义契约的工件上宣布成功。这个问题在代码库转换中尤为严重,因为先前的评估主要是结果驱动的,因此不稳定:两个实现可以在浅层结果上匹配,例如单个前向损失,但在梯度、优化器行为或短期训练动态上存在差异。我们引入了T2J-Bench,一个代码库转换基准,它将转换重新定义为在固定等价契约下的迁移。然后,一个固定验证器通过三个有序阶段比较源代码库和转换后的代码库:Spec(接口可接受性)、Numeric(前向输出、损失、梯度和目标特定张量)和Behavioral(固定种子下的短期训练动态)。在355次盲转换尝试中,尽管Spec通过率高达91.1%,最佳系统总体通过率仅为26.7-28.9%;4.7倍的token预算差异仅产生2.2倍的通过率差异;所有系统相对于固定评估器高估成功率66.6-97.8点。这表明失败更多源于契约不一致的自我验证,而非有限的预算或骨干强度。

英文摘要

Coding agents increasingly act as codebase-scale collaborators that can assist with codebase conversion, but this progress has exposed a critical weakness: agents often over-trust their own local validation routines and declare success on artifacts that satisfy surface checks while violating the semantic contracts users actually care about. This problem is especially acute in codebase conversion, where prior evaluation is largely outcome-driven and therefore unstable: two implementations can match on a shallow outcome, such as a single forward loss, while diverging in gradients, optimizer behavior, or short-horizon training dynamics. We introduce T2J-Bench, a benchmark for codebase conversion that reformulates conversion as transfer under a fixed equivalence contract. A fixed verifier then compares source and converted codebases through three ordered stages: Spec (interface admissibility), Numeric (forward outputs, losses, gradients, and objective-specific tensors), and Behavioral (short training dynamics under fixed seeds). Across 355 blind conversion attempts, the best system reaches only 26.7--28.9% overall pass rate despite Spec pass rates up to 91.1%; a 4.7x token-budget spread yields only a 2.2x pass-rate spread; and all systems overestimate success by 66.6--97.8 points relative to the fixed evaluator. This suggests that failures stem more from contract-misaligned self-validation than from limited budget or backbone strength.

2603.19294 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Maximizing Mutual Information Between Prompt and Response Improves LLM Performance With No Additional Data

最大化提示与响应之间的互信息无需额外数据即可提升LLM性能

Hyunji Nam, Haoran Li, Natasha Jaques

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出互信息偏好优化(MIPO)方法,通过对比数据增强构建偏好对,利用直接偏好优化最大化提示与响应间的点互信息,无需额外数据或外部监督即可提升LLM在个性化和可验证任务上的性能。

Comments International Conference on Machine Learning 2026

详情
AI中文摘要

虽然后训练已在多个领域成功改进了大型语言模型(LLM),但这些提升严重依赖人工标注数据或外部验证器。现有数据已被充分利用,而新数据收集成本高昂。此外,真正的智能远不止可验证任务。因此,我们需要较少依赖外部信号且更广泛适用于可验证和不可验证领域的自我改进框架。我们提出**互信息偏好优化(MIPO)**,一种对比数据增强方法,通过基于正确提示生成正响应,以及基于随机无关提示生成负响应来构建偏好对。我们证明,使用直接偏好优化从这些配对数据中学习,可以最大化*基础LLM*下提示与响应之间的逐点互信息。使用1-7B参数的Llama和Qwen指令模型的实验表明,与提示基线相比,MIPO在个性化任务上实现了3-16%的提升(Qwen2.5-1B-Instruct提升51%)。令人惊讶的是,MIPO在可验证领域(如数学和多项选择题问答)也有用,*无需任何额外数据或外部监督*即可获得1-20%的提升。这些结果表明,利用对比数据对中的内在信号进行自我改进是一个有前景的方向。

英文摘要

While post-training has successfully improved large language models (LLMs) across a variety of domains, these gains heavily rely on human-labeled data or external verifiers. Existing data has already been exploited, and new data is expensive to collect. Moreover, true intelligence goes far beyond verifiable tasks. Therefore, we need self-improvement frameworks that are less dependent on external signals and more broadly applicable to both verifiable and non-verifiable domains. We propose **Mutual Information Preference Optimization (MIPO)**, a contrastive data augmentation method that constructs preference pairs by generating a positive response conditioning on the correct prompt, and a negative response by conditioning on a random, unrelated prompt. We show that using Direct Preference Optimization to learn from this paired data maximizes pointwise mutual information *under the base LLM* between prompts and model responses. Experiments with with 1-7B parameter Llama and Qwen instruct models show that MIPO achieves 3-16% gains (and 51% increase for Qwen2.5-1.5B-Instruct) on personalization compared to prompting baselines. Surprisingly, MIPO can also be useful in verifiable domains, such as math and multiple-choice question answering, yielding 1-20% gains *without any additional data or external supervision*. These results suggest a promising direction for self-improvement using intrinsic signals derived from contrastive data pairs.

2605.25240 2026-06-05 cs.CL cs.AI cs.CY 版本更新

JudgmentBench: Comparing Rubric and Preference Evaluation for Quality Assessment

JudgmentBench: 比较评分量规与偏好评估在质量评价中的应用

Russell Yang, Ruishi Chen, Pierce Kelaita, Riya Ranjan, Sibo Ma, Charles Dickens, Matthew Guillod, Megan Ma, Julian Nyarko

发表机构 * Stanford University(斯坦福大学) Snorkel AI

AI总结 本研究通过构建包含30个真实法律任务、1539个评分量规和1530对偏好判断的数据集JudgmentBench,比较了评分量规与成对比较两种评估方法,发现成对比较在恢复预期质量排序上显著优于评分量规(平均斯皮尔曼等级相关系数0.908 vs 0.150),且注释时间减少一半以上。

Comments 37 pages, 9 figures

详情
AI中文摘要

当前基准测试实践中主导着两种方法论:基于评分量规的评分根据预定义标准评估项目,而比较判断则引发输出之间的成对偏好。尽管两种方法论被广泛使用,但两者之间的选择很少被论证。我们发布了JudgmentBench,一个包含30个真实法律任务的基准测试,配对了来自执业律师(包括美国主要律师事务所)的1539个评分量规和1530个成对偏好判断,这些律师具有丰富的经验。这些注释构成了高专业领域内首个公开可用的数据集,其中两种监督信号由同一专家对同一项目进行收集。使用LLM生成的三个质量级别的输出,我们提供了初步的经验比较:比较判断在恢复预期质量排序方面显著优于评分量规(平均斯皮尔曼等级相关系数为0.908 vs 0.150,估计差异=0.758 [0.494, 1.021]),同时所需的注释时间不到一半。这一模式对人类注释者和LLM自动评分器均成立。除了这一初步比较,数据集的配对结构支持更广泛的研究议程,探讨在没有可验证真实情况的领域中,如何引导、聚合专家判断并将其用作监督信号。

英文摘要

Two methodologies dominate current practices of benchmarking: rubric-based scoring evaluates items against predefined criteria, whereas comparative judgment elicits pairwise preferences between outputs. Although both methodologies are widely used, the choice between them is rarely justified. We release JudgmentBench, a benchmark of 30 real-world legal tasks, paired with 1,539 rubric scores and 1,530 pairwise preference judgments collected from practicing attorneys--including at major U.S. law firms--with substantial experience. The annotations constitute the first publicly available dataset in a high-expertise domain in which both supervision signals are elicited from the same experts on the same items. Using LLM-generated outputs at three constructed quality levels, we provide an initial empirical comparison: comparative judgments recover the intended quality ordering substantially better than rubrics under both a per-task rank-correlation metric (mean Spearman's rank correlation of 0.908 vs. 0.150, estimated difference = 0.758 [0.494, 1.021]) and a per-judgment pairwise win-rate metric (0.669 vs. 0.542, estimated difference = 0.127 [0.067, 0.186]), while requiring less than half the annotation time. The patterns hold for human annotators and LLM autograders. Beyond this initial comparison, the paired structure of the dataset supports a broader research agenda on how expert judgment should be elicited, aggregated, and used as supervision in domains without verifiable ground truth.

2605.15913 2026-06-05 cs.CL cs.AI 版本更新

Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

通过自动分割和块蒸馏实现块注意力的泛化

Shuaiyi Li, Zhisong Zhang, Yan Wang, Lei Zhu, Dongyang Ma, Chenlong Deng, Yang Deng, Wai Lam

发表机构 * The Chinese University of Hong Kong(香港中文大学) City University of Hong Kong(香港城市大学) Tencent(腾讯) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院) Singapore Management University(新加坡管理大学)

AI总结 提出基于语义分割数据集训练的轻量级分割器和块蒸馏框架,解决块注意力在长上下文中的文本分割和微调效率问题,实现接近全注意力的性能。

Comments 16 pages, 2 figures

详情
AI中文摘要

块注意力将输入作为独立的块处理,块之间不能相互关注,在检索增强生成(RAG)等长上下文场景中具有显著提升KV缓存重用的潜力。然而,其广泛应用受到两个关键挑战的阻碍:将输入文本分割成有意义且自包含的块的困难,以及现有块微调方法效率低下且可能降低性能的风险。为解决这些问题,我们首先构建了SemanticSeg,一个大规模且多样化的语义分割数据集,包含超过30k个实例,涵盖16个类别——包括书籍、代码、网页文本和对话,文本长度从2k到32k。利用该数据集,我们训练了一个轻量级分割器,能够自动将文本分割成符合人类直觉的块,且粒度可控。其次,我们提出了块蒸馏,一种比块微调更高效的训练框架,它使用冻结的全注意力教师模型来指导块注意力学生模型。该框架集成了三个新颖的组件:块汇合令牌以减轻块边界处的信息丢失,块丢弃以利用来自所有块的训练信号,以及令牌级损失加权以聚焦于对块注意力敏感的令牌的学习。跨多个模型和基准的实验表明,我们的分割器优于启发式和统计基线,且块蒸馏在块注意力下实现了接近全注意力的性能,为部署块注意力建立了一条实用且可扩展的路径。

英文摘要

Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self-contained blocks, and the inefficiency of existing block fine-tuning methods that risk degrading performance. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over 30k instances across 16 categories-including books, code, web text, and conversations with text lengths ranging from 2k to 32k. Using this dataset, we train a lightweight segmenter to automatically partition text into human-instinct-aligned blocks with controllable granularity. Second, we propose block distillation, a training framework that is more efficient than block fine-tuning, which uses a frozen full-attention teacher model to guide the block-attention student. This framework integrates three novel components: block sink tokens to mitigate information loss at block boundaries, block dropout to leverage training signals from all blocks, and token-level loss weighting to focus learning on block-attention-sensitive tokens. Experiments across multiple models and benchmarks demonstrate that our segmenter outperforms heuristic and statistical baselines, and block distillation achieves near-full-attention performance under block attention, establishing a practical and scalable pathway for deploying block attention.

2605.20628 2026-06-05 cs.CL 版本更新

Divide-Prompt-Refine: a Training-Free, Structure-Aware Framework for Biomedical Abstract Generation

Divide-Prompt-Refine:一种无需训练的、结构感知的生物医学摘要生成框架

Sylvey Lin, Joe Menke, Shufan Ming, Dongin Nam, Neil Smalheiser, Halil Kilicoglu

发表机构 * University of Washington(华盛顿大学)

AI总结 本文提出DPR-BAG框架,旨在生成具有完整文本但无摘要的生物医学文章的连贯且事实准确的摘要。该框架通过分解全文文档为结构化的修辞要素,进行并行LLM摘要生成,并应用最终的精炼阶段恢复全局话语连贯性。

Comments Accepted by BioNLP 2026

详情
AI中文摘要

生物医学摘要在下游NLP应用中起着关键作用,例如信息检索、生物本体标注和生物医学知识发现。然而,大量生物医学文章没有摘要,这降低了这些文章在下游任务中的实用性。我们提出了DPR-BAG(Divide, Prompt, and Refine for Biomedical Abstract Generation),一种无需训练的零样本框架,能够为具有完整文本但无摘要的生物医学文章生成连贯且事实准确的摘要。DPR-BAG按照背景-目的-方法-结果-结论(BOMRC)模式将全文文档分解为结构化的修辞要素,对每个要素进行并行LLM摘要生成,并应用最终的精炼阶段以恢复全局话语连贯性。在PMC-MAD数据集上,DPR-BAG在抽象新颖性上优于强大的提取式和微调基线,同时保持事实一致性。我们的消融研究揭示了一个反直觉的发现:增加提示复杂性或显式注入实体级指导可能会降低事实对齐,突显了受控提示策略的重要性。这些发现突显了无需训练、结构感知的框架在低资源环境下可扩展生物医学摘要生成的潜力。我们的数据和代码可在https://huggingface.co/datasets/pmc-mad/PMC-MAD和https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/DPR-BAG上获得。

英文摘要

Biomedical abstracts play a critical role in downstream NLP applications, such as information retrieval, biocuration, and biomedical knowledge discovery. However, a non-trivial number of biomedical articles do not have abstracts, diminishing the utility of these articles for downstream tasks. We propose DPR-BAG (Divide, Prompt, and Refine for Biomedical Abstract Generation), a training-free, zero-shot framework that generates coherent and factually grounded abstracts for biomedical articles with full text but no abstract. DPR-BAG decomposes full-text documents into structured rhetorical facets following the Background-Objective-Methods-Results-Conclusions (BOMRC) schema, performs parallel LLM-based summarization for each facet, and applies a final refinement stage to restore global discourse coherence. On PMC-MAD, a distribution-aligned dataset of 46,309 biomedical articles, DPR-BAG improves abstractive novelty over strong extractive and fine-tuned baselines, while maintaining factual consistency. Our ablation study reveals a counterintuitive finding: increasing prompt complexity or explicitly injecting entity-level guidance can degrade factual alignment, highlighting the importance of controlled prompting strategies. These findings underscore the potential of training-free, structure-aware frameworks for scalable biomedical abstract generation in low-resource settings. Our data and code are available at https://huggingface.co/datasets/pmc-mad/PMC-MAD and https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/DPR-BAG.

2603.17837 2026-06-05 eess.AS cs.CL 版本更新

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

沉默的思维:通过潜在推理建模全双工语音对话模型中的内部认知

Donghang Wu, Tianyu Zhang, Yuxin Li, Hexin Liu, Chen Chen, Eng Siong Chng, Yoshua Bengio

发表机构 * DeepMind(深度Mind)

AI总结 本文提出了一种名为FLAIR的全双工语音对话模型,通过潜在推理同时进行语音感知和内部思考,以提高对话质量,该方法在多个语音基准测试中取得了竞争性的结果。

Comments Accepted by ICML 2026

详情
AI中文摘要

在对话互动中,人类在听讲者说话时会潜意识地进行同时思考。尽管这种内部认知处理可能不总是表现为显式的语言结构,但它是制定高质量响应的关键。受这一认知现象的启发,我们提出了一种名为FLAIR的新全双工潜在和内部推理方法,该方法在语音感知的同时进行潜在思考。与传统NLP中的“思考”机制不同,我们的方法不需要事后生成,而是无缝地与语音对话系统结合:在用户说话阶段,它将前一步的潜在嵌入输出递归地馈入下一步,从而实现连续推理,严格遵循因果性而不引入额外延迟。为了实现这种潜在推理,我们设计了一个基于证据下界的目标,支持通过教师强制进行高效的监督微调,从而避免了需要显式推理注释的需要。实验表明,这种听的同时思考设计在多个语音基准测试中均取得了竞争性的结果。此外,FLAIR能够稳健地处理对话动态,并在全双工交互指标上取得了竞争性的性能。

英文摘要

During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.

2605.19309 2026-06-05 cs.CL 版本更新

How Do Document Parsers Break? Auditing Structural Vulnerability in Document Intelligence

文档解析器如何失效?审计文档智能中的结构脆弱性

Yue Chen, Yihao Wang, Ziyi Tang, Yongsen Zheng, Keze Wang

发表机构 * Sun Yat-sen University(中山大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出ProSA框架,通过解耦控制探测、策略驱动目标和结构感知诊断,审计文档布局分析(DLA)管道中的结构脆弱性,发现块级结构损失率(B-SLR)比受影响面积更能反映OCR不稳定性,且结构探测导致更大的下游QA/检索退化。

Comments 18 pages, 5 figures, preprint

详情
AI中文摘要

文档布局分析(DLA)管道为检索增强生成、长文档问答和其他文档智能系统提供结构化页面表示,但其鲁棒性评估仍然主要是以面积为中心的。我们识别出这种足迹偏差,并提出ProSA,一个轻量级的输出级审计框架,它解耦了受控探测、策略驱动目标和结构感知诊断。ProSA结合了块级结构损失率(B-SLR)、粒度感知暴露描述符和路径归因,以分析结构身份在何处丢失、在何种暴露粒度下出现故障以及故障如何传播。在MinerU和PP-StructureV3上对1000页进行实验,受影响面积与探测引起的OCR不稳定性相关性较弱(R^2=0.384/0.110),而B-SLR与之相关性更强(R^2=0.727/0.916)。暴露描述符进一步分离了遮挡主导和拓扑主导的路径,而匹配足迹的结构探测导致的下游QA/检索退化远大于面积匹配的擦除。这些结果将DLA鲁棒性评估从基于足迹的压力测试转向结构感知的脆弱性审计。

英文摘要

Document Layout Analysis (DLA) pipelines provide structured page representations for retrieval-augmented generation, long-document question answering, and other document intelligence systems, yet their robustness evaluation remains largely area-centric. We identify this Footprint Bias and propose ProSA, a lightweight output-level auditing framework that decouples controlled probing, policy-driven targeting, and structure-aware diagnosis. ProSA combines Block-level Structural Loss Rate (B-SLR), granularity-aware exposure descriptors, and pathway attribution to analyze where structural identity is lost, at what exposure granularity failures emerge, and how failures propagate. Across MinerU and PP-StructureV3 on 1,000 pages, affected area weakly tracks perturbation-induced OCR instability (R^2=0.384/0.110), whereas B-SLR aligns much more closely with it (R^2=0.727/0.916). Exposure descriptors further separate occlusion- and topology-dominant pathways, while matched-footprint structural probes cause much larger downstream QA/retrieval degradation compared to area-matched erasure. These results shift DLA robustness evaluation from footprint-based stress testing toward structure-aware vulnerability auditing.

2604.00555 2026-06-05 cs.AI cs.CL cs.SE 版本更新

Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

企业智能体系统中的本体约束神经推理:一种面向领域 grounded AI 智能体的神经符号架构

Thanh Luong Tuan, Abhijit Sanyal

发表机构 * Golden Gate University, San Francisco Foundation(金门大学,旧金山基金会) AgenticOS (FAOS)(AgenticOS(FAOS)) Associate Director, Data, Digital & IT Novartis Healthcare Pvt. Ltd.(数据、数字与IT部门,诺华健康有限公司) Novartis Healthcare Pvt. Ltd., Hyderabad, India(诺华健康有限公司,海得拉巴,印度)

AI总结 本文提出了一种神经符号架构,通过本体约束神经推理解决企业大语言模型在幻觉、领域漂移和无法在推理层面强制执行监管合规性方面的限制,展示了该架构在提升智能体的指标准确性和角色一致性方面的显著效果。

Comments 24 pages, 6 tables, 6 figures, 1 algorithm, 65 references. Replication study: 1,800 runs (600 per model) across 5 regulated industries (3 English, 2 Vietnamese) and 3 LLMs (Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B). v3 changes: deep-review trim from 34pp. Code and data: https://github.com/frank-luongt/faos-research/tree/main/RA-3

详情
AI中文摘要

企业采用大语言模型(LLMs)受到幻觉、领域漂移和无法在推理层面强制执行监管合规性的限制。我们提出了一种在基础智能体操作系统(FAOS)平台中实现的神经符号架构,通过本体约束神经推理解决这些限制。我们引入了一个三层本体框架——角色、领域和交互本体——以地面化基于LLM的企业智能体。我们正式化了不对称的神经符号耦合:当前企业系统约束智能体输入(上下文组装、工具发现、治理阈值),但不约束输出,我们提出机制扩展这种耦合到输出侧验证(响应检查、推理验证、合规性强制)。一个受控实验(1,800次运行,覆盖五个行业和三个LLM:Claude Sonnet 4、Qwen 2.5 72B、Gemma 4 26B)发现本体耦合的智能体在所有三个模型中在指标准确性和角色一致性上显著优于无地面化智能体(p < .001),具有较大的效应量(Kendall's W = .46-.64)。改进最大出现在LLM参数化知识最弱的地方——特别是越南本地化领域,其中本体提升是英语领域的2倍。贡献:(1)一个正式的三层企业本体模型;(2)神经符号耦合模式的分类学;(3)通过SQL推导评分进行本体约束的工具发现;(4)提出的一种用于输出侧本体验证的框架;(5)关于参数化知识效应的实证证据——本体地面化价值与LLM训练数据覆盖领域成反比;(6)跨模型复制,确立模型独立性;(7)一个服务于22个行业垂直领域的生产系统,拥有650多个智能体。

英文摘要

Enterprise adoption of Large Language Models (LLMs) is constrained by hallucination, domain drift, and the inability to enforce regulatory compliance at the reasoning level. We present a neurosymbolic architecture implemented within the Foundation AgenticOS (FAOS) platform that addresses these limitations through ontology-constrained neural reasoning. We introduce a three-layer ontological framework--Role, Domain, and Interaction ontologies--grounding LLM-based enterprise agents. We formalize asymmetric neurosymbolic coupling: current enterprise systems constrain agent inputs (context assembly, tool discovery, governance thresholds) but not outputs, and we propose mechanisms extending this coupling to output-side validation (response checking, reasoning verification, compliance enforcement). A controlled experiment (1,800 runs across five industries and three LLMs: Claude Sonnet 4, Qwen 2.5 72B, Gemma 4 26B) finds ontology-coupled agents significantly outperform ungrounded agents on Metric Accuracy (p < .001) and Role Consistency (p < .001) across all three models with large effect sizes (Kendall's W = .46-.64). Improvements are greatest where LLM parametric knowledge is weakest--particularly in Vietnam-localized domains, where ontology lift is 2x that of English domains. Contributions: (1) a formal three-layer enterprise ontology model; (2) a taxonomy of neurosymbolic coupling patterns; (3) ontology-constrained tool discovery via SQL-pushdown scoring; (4) a proposed framework for output-side ontological validation; (5) empirical evidence for the inverse parametric knowledge effect--ontological grounding value is inversely proportional to LLM training-data coverage of the domain; (6) cross-model replication establishing model-independence; (7) a production system serving 22 industry verticals with 650+ agents.

2605.15454 2026-06-05 cs.CL cs.LG stat.ML 版本更新

Reasoning Models Don't Just Think Longer, They Move Differently

推理模型不只思考更久,它们的移动方式不同

Anders Gjølbye, Lars Kai Hansen, Sanmi Koyejo

发表机构 * Technical University of Denmark(丹麦技术大学) Stanford University(斯坦福大学)

AI总结 本文研究了推理训练模型在生成链式思维时的轨迹差异,发现通过长度校正后,不同领域中难度与轨迹几何的耦合关系存在显著差异,尤其是在代码领域中,推理训练模型表现出更直接的轨迹和更一致的局部曲率。

Comments Preprint

详情
AI中文摘要

经过训练的推理语言模型通常在更难的问题上消耗更多标记,但更长的思维链并不表明模型只是计算更多步骤或遵循不同的内部轨迹。我们通过在编程、数学和布尔可满足性问题中研究链式思维生成过程中的隐藏状态轨迹来区分这一区别。原始轨迹几何强烈受到生成长度的影响:更长的生成会机械地改变路径统计,因此在没有调整的情况下,基于难度的比较是误导的。在残差化轨迹统计后,难度在所有研究的领域中系统地与修正后的轨迹几何相关联。在代码领域中,最清晰的推理特定分离出现在更难的问题中,推理训练模型显示出更直接的修正轨迹和更一致的局部曲率,而与匹配的指令训练基线相比,这种差异更小。在数学和布尔可满足性问题中,修正后的难度-几何耦合较弱,但仍存在。提示阶段的线性探测不反映代码领域的分离,行为注释显示更强的修正耦合与策略转变和不确定性监控同时出现。这些发现确立了长度校正作为生成时间轨迹分析的先决条件,并表明推理训练可以与不同的修正轨迹几何相关联,这种效果的强度取决于领域。

英文摘要

Reasoning-trained language models often spend more tokens on harder problems, but longer chains of thought do not show whether a model is merely computing for more steps or following a different internal trajectory. We study this distinction through hidden-state trajectories during chain-of-thought generation across competitive programming, mathematics, and Boolean satisfiability. Raw trajectory geometry is strongly shaped by generation length: longer generations mechanically alter path statistics, so difficulty-dependent comparisons are misleading without adjustment. After residualizing trajectory statistics on length, difficulty remains systematically coupled to corrected trajectory geometry across all domains studied. The clearest reasoning-specific separation appears in the code domain, where harder problems show more direct corrected trajectories and less heterogeneous local curvature in reasoning-trained models than in matched instruction-tuned baselines. Corrected difficulty-geometry coupling is weaker, but still present, in mathematics and Boolean satisfiability. Prompt-stage linear probes do not mirror the code-domain separation, and behavioral annotations show that stronger corrected coupling co-occurs with strategy shifts and uncertainty monitoring. Together, these findings establish length correction as a prerequisite for generation-time trajectory analysis and show that reasoning training can be associated with distinct corrected trajectory geometry, with the strength of the effect depending on the domain.

2605.13075 2026-06-05 cs.CL cs.AI 版本更新

Scaling few-shot spoken word classification with generative meta-continual learning

通过生成性元持续学习扩大少样本语音词分类

Louise Beyers, Batsirayi Mupamhi Ziki, Ruan van der Merwe

发表机构 * University of Cape Town(开普敦大学)

AI总结 本文研究了在仅获得每个类别五个样本的情况下,通过生成性元持续学习(GeMCL)算法对1000个类别进行少样本语音词分类的潜力,并展示了其在性能稳定性及适应速度上的优势。

详情
AI中文摘要

少样本语音词分类大多针对少量类别进行开发,因此更大规模的少样本语音词分类潜力尚未被挖掘。本文探讨了在仅获得每个类别五个样本的情况下,通过生成性元持续学习(GeMCL)算法训练的语音词分类器能否依次学习区分1000个类别。我们通过使用GeMCL算法训练模型并与重复训练或微调的基线模型进行比较,证明了这种扩展能力的存在。我们发现GeMCL产生了极高的性能稳定性,尽管它并不总能超越重复全微调的HuBERT模型或冻结HuBERT模型配以重复训练的分类器头,但其性能与后者相当,同时适应速度提高了2000倍,仅用不到一半的数据量,在两个数量级更少的时间内进行训练。

英文摘要

Few-shot spoken word classification has largely been developed for applications where a small number of classes is considered, and so the potential of larger-scale few-shot spoken word classification remains untapped. This paper investigates the potential of a spoken word classifier to sequentially learn to distinguish between 1000 classes when it is given only five shots per class. We demonstrate that this scaling capability exists by training a model using the Generative Meta-Continual Learning (GeMCL) algorithm and comparing it to repeatedly trained or finetuned baselines. We find that GeMCL produces exceptionally stable performance, and although it does not always outperform a repeatedly fully-finetuned HuBERT model nor a frozen HuBERT model with a repeatedly trained classifier head, it produces comparable performance to the latter while adapting 2000 times faster, having been trained less than half of the data for two orders of magnitude less time.

2504.10063 2026-06-05 cs.CL cs.AI math.AT 版本更新

Hallucination Detection in LLMs with Topological Divergence on Attention Graphs

基于注意力图拓扑分歧的LLM幻觉检测

Alexandra Bazarova, Andrei Volodichev, Aleksandr Yugay, Andrey Shulga, Alina Ermilova, Konstantin Polev, Julia Belikova, Rauf Parchiev, Dmitry Simakov, Maxim Savchenko, Andrey Savchenko, Serguei Barannikov, Alexey Zaytsev

发表机构 * Applied AI Institute(应用人工智能研究所) SB AI Lab(SB人工智能实验室) HSE University(俄罗斯高等经济学院) CNRS, Universite Paris Cite(法国国家科学研究中心,巴黎Cité大学)

AI总结 本文提出TOHA方法,通过分析注意力矩阵的拓扑结构来检测LLM中的幻觉现象,实验表明该方法在多个基准测试中表现优异,且对标注数据和计算资源需求较低。

Comments Accepted to the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情
AI中文摘要

幻觉,即生成事实性错误内容,仍然是大型语言模型(LLMs)面临的关键挑战。我们介绍了TOHA,一种基于拓扑的幻觉检测器,在RAG设置中,该方法利用拓扑分歧度度量来量化由注意力矩阵诱导的图的结构特性。检查提示与响应子图之间的拓扑分歧揭示了一致的模式:特定注意力头中较高的分歧值与幻觉输出相关,且与数据集无关。广泛的实验,包括问题回答和摘要任务的评估,表明我们的方法在多个基准测试中实现了最先进的或具有竞争力的结果,同时需要最少的标注数据和计算资源。我们的发现表明,分析注意力矩阵的拓扑结构可以作为LLMs事实可靠性的一种高效且稳健的指标。

英文摘要

Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models (LLMs). We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting, which leverages a topological divergence metric to quantify the structural properties of graphs induced by attention matrices. Examining the topological divergence between prompt and response subgraphs reveals consistent patterns: higher divergence values in specific attention heads correlate with hallucinated outputs, independent of the dataset. Extensive experiments - including evaluation on question answering and summarization tasks - show that our approach achieves state-of-the-art or competitive results on several benchmarks while requiring minimal annotated data and computational resources. Our findings suggest that analyzing the topological structure of attention matrices can serve as an efficient and robust indicator of factual reliability in LLMs.

2605.11732 2026-06-05 cs.IR cs.CL cs.MA cs.MM 版本更新

AgentDisCo: Towards Disentanglement and Collaboration in Open-ended Deep Research Agents

AgentDisCo: 向开放深度研究代理中的解耦与协作迈进

Jiarui Jin, Zexuan Yan, Shijian Wang, Wenxiang Jiao, Yuan Lu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出AgentDisCo,一种解耦且协作的代理架构,将深度研究视为信息探索与利用之间的对抗优化问题。通过批评代理评估生成的草稿并优化搜索查询,生成代理检索更新结果并修订草稿,最终生成综合报告。该框架通过元优化 harness 支持手工和自动发现的设计策略,并利用强大的代码生成代理自完善。

详情
AI中文摘要

在本文中,我们提出了AgentDisCo,一种新颖的解耦且协作的代理架构,将深度研究视为信息探索与利用之间的对抗优化问题。与现有方法将这两个过程合并到一个模块中不同,AgentDisCo采用一个批评代理来评估生成的草稿并优化搜索查询,以及一个生成代理来检索更新的结果并相应地修订草稿。迭代优化的草稿随后传递给下游的报告撰写代理,以综合生成全面的研究报告。整体工作流通过元优化 harness 支持手工和自动发现的设计策略,其中生成代理被重新利用为评分代理,以评估批评代理的输出并生成质量信号。强大的代码生成代理(例如Claude-Code、Codex)系统地探索代理配置并构建一个策略库,一个结构化的可重用设计策略存储库,使框架能够自我完善而无需大量人工干预。我们在三个已建立的深度研究基准(DeepResearchBench、DeepConsult、DeepResearchGym)上评估AgentDisCo,使用Gemini-2.5-Pro,取得的性能与或优于领先的闭源系统相当。观察到现有基准不足以反映真实世界用户需求,我们引入GALA(通用人工智能生活助手),一个基准,该基准从用户的历史浏览行为中挖掘潜在研究兴趣。我们进一步开发了一个渲染代理,将研究报告转换为视觉丰富的海报演示,并展示了一个端到端的产品AutoResearch Your Interest,该产品根据个人浏览历史提供个性化的深度研究推荐。

英文摘要

In this paper, we present AgentDisCo, a novel Disentangled and Collaborative agentic architecture that formulates deep research as an adversarial optimization problem between information exploration and exploitation. Unlike existing approaches that conflate these two processes into a single module, AgentDisCo employs a critic agent to evaluate generated outlines and refine search queries, and a generator agent to retrieve updated results and revise outlines accordingly. The iteratively refined outline is then passed to a downstream report writer that synthesizes a comprehensive research report. The overall workflow supports both handcrafted and automatically discovered design strategies via a meta-optimization harness, in which the generator agent is repurposed as a scoring agent to evaluate critic outputs and generate quality signals. Powerful code-generation agents (e.g., Claude-Code, Codex) systematically explore agent configurations and construct a policy bank, a structured repository of reusable design strategies, enabling the framework to self-refine without extensive human intervention. We evaluate AgentDisCo on three established deep research benchmarks (DeepResearchBench, DeepConsult, DeepResearchGym) using Gemini-2.5-Pro, achieving performance comparable to or surpassing leading closed-source systems. Observing that existing benchmarks inadequately reflect real-world user needs, we introduce GALA (General AI Life Assistants), a benchmark that mines latent research interests from users' historical browsing behavior. We further develop a rendering agent that converts research reports into visually rich poster presentations, and demonstrate an end-to-end product, AutoResearch Your Interest, which delivers personalized deep research recommendations derived from individual browsing histories.

2605.11632 2026-06-05 cs.CL cs.AI 版本更新

Macro: Enhancing Multilingual Counterfactual Explanations through Alignment-as-Preference Optimization

Macro: 通过偏好对齐优化提升多语言反事实解释

Yilong Wang, Qianli Wang, Bohao Chu, Yihong Liu, Jing Yang, Simon Ostermann

发表机构 * Technische Universität Berlin(柏林技术大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心) University of Duisburg-Essen(杜伊斯堡- Essen大学) LMU Munich(慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Saarland Informatics Campus(萨尔兰州信息学校区) BIFOLD – Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究院) Centre for European Research in Trusted AI (CERTAIN)(可信人工智能欧洲研究中心)

AI总结 本文提出Macro框架,通过直接偏好优化改进多语言反事实解释,提升有效性的同时保持最小性,避免翻译基线的严重最小性问题,并在多个指标上优于监督微调方法。

Comments In submission

详情
AI中文摘要

自我生成的反事实解释(SCEs)是大型语言模型(LLMs)生成的最小修改输入(minimality),通过翻转自身预测(validity)来揭示黑箱LLM行为,提供因果基础的解释方法。然而,将其扩展到非主导语言仍具挑战性:现有方法难以生成有效SCEs,且有效性与最小性之间的权衡影响解释质量。我们引入Macro,一种偏好对齐框架,通过直接偏好优化(DPO)进行多语言SCE生成,使用复合评分函数构建偏好对,将权衡转化为可测量的偏好信号。在四个LLM和七个语言类型多样的语言上进行实验,结果显示,Macro在平均情况下比链式思维基线提高了12.55%的有效性,同时不降低最小性,避免了翻译基线的严重最小性问题。与监督微调相比,Macro在两个指标上表现更优,证实了显式偏好优化对于平衡此权衡的重要性。进一步分析显示,Macro增强了跨语言扰动对齐并缓解了常见生成错误。我们的结果突显了偏好优化作为提升多语言模型解释的有前途方向。

英文摘要

Self-generated counterfactual explanations (SCEs) are minimally modified inputs (minimality) generated by large language models (LLMs) that flip their own predictions (validity), offering a causally grounded approach to unraveling black-box LLM behavior. Yet extending them beyond English remains challenging: existing methods struggle to produce valid SCEs in non-dominant languages, and a persistent trade-off between validity and minimality undermines explanation quality. We introduce Macro, a preference alignment framework that applies Direct Preference Optimization (DPO) to multilingual SCE generation, using a composite scoring function to construct preference pairs that effectively translate the trade-off into measurable preference signals. Experiments across four LLMs and seven typologically diverse languages show that Macro improves validity by 12.55\% on average over the chain-of-thought baseline without degrading minimality, while avoiding the severe minimality violations of the translation-based baseline. Compared to supervised fine-tuning, Macro achieves superior performance on both metrics, confirming that explicit preference optimization is essential for balancing this trade-off. Further analyses reveal that Macro increases cross-lingual perturbation alignment and mitigates common generation errors. Our results highlight preference optimization as a promising direction for enhancing multilingual model explanations.

2604.26269 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Calibrated Surprise: An Information-Theoretic Account of Creative Quality

校准的惊喜:一种信息论视角下的创造性质量

Bo Zou, Chao Xu

发表机构 * Bo Zou(邹波) Chao Xu(徐超)

AI总结 本文提出了一种信息论框架,用于评估创造性写作的质量,通过校准的惊喜概念,结合香农互信息理论,量化了高质量文本与降质文本之间的差异。

Comments 28 pages, 3 figures

详情
AI中文摘要

在大型语言模型时代,创造性写作的质量缺乏可计算的理论基础。主流方法是评分标准——将整体审美判断分解为子评分,以及通过RLHF偏好信号——用群体投票代替质量。这两种方法都绕过了文本本身的统计结构。本文提供了一种信息论基础,填补这一空白。我们提出了'校准的惊喜'作为优秀创造性写作的信息论本质。这种判断符合阅读直觉并涵盖了其对立面。这种文学判断可以精确地进行数学公式化。在完全维度约束Y下,可行的写作选择被强制进入极狭窄的空间。稀有的幸存者,从无约束的视角来看,恰好是最不可预测的选择。两者都通过香农互信息I(X;Y) = H(X) - H(X|Y)精确测量——'校准'对应H(X|Y)接近0;'惊喜'对应H(X)升高。公式的减法结构自然地将'有根据的惊喜'与'纯噪声'分开。我们使用Qwen1.5-7B的token级logprobs作为理想读者概率分布的操作代理。在20对(12中文/8英文)的高质量与系统降质文学段落中,20/20对支持核心预测:高质量段落的I(X;Y)系统性地高于其降质版本。

英文摘要

In the era of large language models, creative writing quality lacks a computable theoretical anchor. The dominant approaches are rubric scoring -- decomposing holistic aesthetic judgment into sub-scores -- and RLHF preference signals -- replacing quality with group votes. Both bypass the statistical structure of the text itself. This paper provides an information-theoretic foundation to fill this gap. We propose 'calibrated surprise' as the information-theoretic essence of excellent creative writing. This judgment matches reading intuition and covers its opposite. This literary judgment admits a precise mathematical formulation. Under full-dimensional constraints Y, feasible writing choices are forced into an extremely narrow space. The rare survivors are, from the unconstrained perspective, exactly the least predictable choices. Both are measured precisely by Shannon mutual information I(X;Y) = H(X) - H(X|Y) -- 'calibrated' corresponds to H(X|Y) approaching 0; 'surprising' corresponds to H(X) going high. The subtraction structure of the formula naturally separates 'well-grounded surprise' from 'pure noise'. We use token-level logprobs from Qwen1.5-7B as an operational proxy for the ideal reader's probability distribution. Across 20 pairs (12 Chinese / 8 English) of high-quality vs. systematically degraded literary passages, 20/20 pairs support the core prediction: high-quality passages have systematically higher I(X;Y) than their degraded versions.

2604.23600 2026-06-05 cs.CL 版本更新

Personality Shapes Gender Bias in Persona-Conditioned LLM Narratives Across English and Hindi: An Empirical Investigation

性格在英语和印地语中影响人物条件化大语言模型叙事中的性别偏见:一项实证研究

Tanay Kumar, Shreya Gautam, Aman Chadha, Vinija Jain, Francesco Pierri

发表机构 * Politecnico di Milano(米兰理工学院) Apple(苹果公司) Meta

AI总结 本研究探讨了在英语和印地语中,人物条件化大语言模型叙事中的性别偏见如何受到性格特征的影响,发现性格特质与性别偏见的幅度和方向显著相关,特别是黑暗三联体性格特质与性别刻板印象的表示更相关,但这些关联在不同模型和语言中有所变化。

详情
AI中文摘要

大型语言模型(LLMs)正越来越多地应用于以人物为导向的应用程序,如教育、客户服务和社会平台,在这些应用中,模型在与用户交互时被提示采用特定的人物。虽然人物条件可以提高用户体验和参与度,但也引发了关于性格线索如何与性别偏见和刻板印象相互作用的担忧。在本工作中,我们对英语和印地语中的人物条件化故事生成进行了受控研究,每个故事描绘了一名印度职场人士在系统性变化的人物性别、职业角色和性格特征(来自HEXACO和黑暗三联体框架)下生成特定情境的物品(例如教案、报告、信件)。在来自六种最先进的LLM生成的23,400个故事中,我们发现性格特征与性别偏见的幅度和方向显著相关。特别是,黑暗三联体性格特征与比社会可取的HEXACO特征更高的性别刻板印象表示相关,尽管这些关联在不同模型和语言中有所变化。我们的发现表明,LLM中的性别偏见并非静态,而是依赖于情境的。这表明在现实应用中使用的人物条件化系统可能会引入不均等的表示伤害,强化生成的教育、职业或社交内容中的性别刻板印象。

英文摘要

Large Language Models (LLMs) are increasingly deployed in persona-driven applications such as education, customer service, and social platforms, where models are prompted to adopt specific personas when interacting with users. While persona conditioning can improve user experience and engagement, it also raises concerns about how personality cues may interact with gender biases and stereotypes. In this work, we present a controlled study of persona-conditioned story generation in English and Hindi, where each story portrays a working professional in India producing context-specific artifacts (e.g., lesson plans, reports, letters) under systematically varied persona gender, occupational role, and personality traits from the HEXACO and Dark Triad frameworks. Across 23,400 generated stories from six state-of-the-art LLMs, we find that personality traits are significantly associated with both the magnitude and direction of gender bias. In particular, Dark Triad personality traits are consistently associated with higher gender-stereotypical representations compared to socially desirable HEXACO traits, though these associations vary across models and languages. Our findings demonstrate that gender bias in LLMs is not static but context-dependent. This suggests that persona-conditioned systems used in real-world applications may introduce uneven representational harms, reinforcing gender stereotypes in generated educational, professional, or social content.

2604.20572 2026-06-05 cs.CL 版本更新

Ask Only When Needed: Proactive Retrieval from Memory and Skills for Experience-Driven Lifelong Agents

在需要时提问:从记忆和技能中主动检索以实现经验驱动的终身学习代理

Yuxuan Cai, Wei Li, Jie Zhou, Qin Chen, Xin Li, Bo Zhang, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University, Shanghai(东华大学计算机科学与技术学院,上海) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出了一种经验驱动的终身学习框架ProactAgent,通过主动检索结构化的经验库来改进长期任务。该框架通过ExpOnEvo联合更新策略和优化记忆,并引入ProactRL将检索视为显式的策略动作,从而在交互过程中主动检索以提高任务表现和效率。

详情
AI中文摘要

在线终身学习代理必须决定不仅如何行动,还要何时咨询先前经验以持续改进长期任务。现有方法通常被动地检索记忆,如在任务初始化或每次步骤后,因此错过了交互过程中出现的知识缺口。我们提出了ProactAgent,一种经验驱动的终身学习框架,用于在结构化的经验库上进行主动检索。ProactAgent通过ExpOnEvo持续改进,联合更新策略并优化记忆,将过去交互组织成事实、事件和技能存储库。它进一步引入了ProactRL,将检索视为显式的策略动作,并学习何时以及检索什么。通过比较相同交互前缀下有无检索的配对延续,ProactRL提供步骤级过程奖励,鼓励仅在改进任务结果或效率时检索。在SciWorld、AlfWorld和StuLife上的实验表明,ProactAgent在所有基线中表现一致,成功率达到32%的相对提升,交互轮次减少超过33%。我们的代码将在GitHub上公开。

英文摘要

Online lifelong learning agents must decide not only how to act but also when to consult prior experience to continually improve on long-horizon tasks. Existing methods typically retrieve memories passively, such as at task initialization or after each step, and therefore miss knowledge gaps that arise during interaction. We propose ProactAgent, an experience-driven lifelong learning framework for proactive retrieval over a structured Experience Base. ProactAgent continually improves through ExpOnEvo, which jointly updates policies and refines memory, organizing past interactions into factual, episodic, and skill repositories. It further introduces ProactRL, which treats retrieval as an explicit policy action and learns when and what to retrieve. By comparing paired continuations from identical interaction prefixes with and without retrieval, ProactRL provides step-level process rewards that encourage retrieval only when it improves task outcomes or efficiency. Experiments on SciWorld, AlfWorld, and StuLife show that ProactAgent consistently outperforms all baselines, achieving up to 32% relative improvement in success rate and over 33% reduction in interaction rounds. Our code will be publicly available at GitHub.

2604.17260 2026-06-05 cs.CL 版本更新

Rethinking Meeting Effectiveness: A Benchmark and Framework for Temporal Fine-grained Automatic Meeting Effectiveness Evaluation

重新思考会议有效性:一个用于时间细粒度自动会议有效性评估的基准和框架

Yihang Li, Chenhui Chu

发表机构 * Kyoto University(京都大学)

AI总结 本文提出了一种新的会议有效性评估方法,通过定义有效性为时间内的客观成就率,并引入AMI-ME数据集和自动评估框架,以支持对会议中各个话题段落的有效性评分,从而建立一个全面的基准并评估框架的通用性。

Comments ACL 2026 Main Conference

详情
AI中文摘要

评估会议有效性对于提高组织生产力至关重要。当前的方法依赖于事后调查,仅能为整个会议提供一个粗粒度的评分。依赖人工评估在可扩展性、成本和可重复性方面存在固有限制。此外,单一评分无法捕捉协作讨论的动态特性。我们提出了一种新的评估会议有效性的范式,围绕新的标准和时间细粒度方法。我们将有效性定义为时间内的客观成就率,并对会议中的各个话题段落进行评估。为了支持这一任务,我们引入了AMI会议有效性(AMI-ME)数据集,这是一个新的元评估数据集,包含来自130个AMI语料库会议的2,459个人工标注的段落。我们还开发了一个自动有效性评估框架,该框架使用大型语言模型(LLM)作为评判者,对每个段落的有效性进行评分,以相对整体会议目标。通过大量的实验,我们建立了这一新任务的全面基准,并评估了框架在不同会议类型中的通用性,从商业场景到非结构化讨论。此外,我们通过从原始语音开始的端到端性能测试来衡量完整系统的功能。我们的结果验证了该框架的有效性,并提供了强有力的基线,以促进未来会议分析和多方对话的研究。我们的数据集和代码将公开发布。AMI-ME数据集和自动评估框架可在:此URL处获取。

英文摘要

Evaluating meeting effectiveness is crucial for improving organizational productivity. Current approaches rely on post-hoc surveys that yield a single coarse-grained score for an entire meeting. The reliance on manual assessment is inherently limited in scalability, cost, and reproducibility. Moreover, a single score fails to capture the dynamic nature of collaborative discussions. We propose a new paradigm for evaluating meeting effectiveness centered on novel criteria and temporal fine-grained approach. We define effectiveness as the rate of objective achievement over time and assess it for individual topical segments within a meeting. To support this task, we introduce the AMI Meeting Effectiveness (AMI-ME) dataset, a new meta-evaluation dataset containing 2,459 human-annotated segments from 130 AMI Corpus meetings. We also develop an automatic effectiveness evaluation framework that uses a Large Language Model (LLM) as a judge to score each segment's effectiveness relative to the overall meeting objectives. Through substantial experiments, we establish a comprehensive benchmark for this new task and evaluate the framework's generalizability across distinct meeting types, ranging from business scenarios to unstructured discussions. Furthermore, we benchmark end-to-end performance starting from raw speech to measure the capabilities of a complete system. Our results validate the framework's effectiveness and provide strong baselines to facilitate future research in meeting analysis and multi-party dialogue. Our dataset and code will be publicly available. The AMI-ME dataset and the Automatic Evaluation Framework are available at: this URL.

2604.16370 2026-06-05 cs.CL cs.AI cs.CV 版本更新

Brain-CLIPLM: Semantic Compression for EEG-to-Text Decoding

Brain-CLIPLM: 用于EEG到文本解码的语义压缩

Xiaoli Yang, Huiyuan Tian, Yurui Li, Jianyu Zhang, Shijian Li, Gang Pan

发表机构 * Beijing Institute of Technology, Beijing, China(北京理工大学,北京,中国)

AI总结 该研究提出Brain-CLIPLM框架,通过语义锚点恢复和锚点引导的句子重建,解决EEG信号低信噪比和信息带宽限制的问题,实现了更高的文本检索准确率。

详情
AI中文摘要

从非侵入性脑电图(EEG)解码自然语言仍受限于低信噪比和有限的信息带宽。这提出了一个核心问题:能否从此类信号中可靠地恢复句子级语言?在现实的信息约束下,直接恢复假设可能过于强烈。我们提出语义压缩假设:非侵入性EEG可能保留可恢复的语义锚点,而非完整的词法-句法形式。从这一视角,直接句子重建相对于EEG可恢复的信息规模过于细粒度。为解决这种不匹配,我们提出了Brain-CLIPLM,一个两阶段框架,将EEG到文本解码分解为语义锚点恢复和锚点引导的句子重建。第一阶段使用对比学习将词级EEG证据对齐固定关键词词汇并恢复有序的语义锚点。第二阶段使用基于检索的大型语言模型和链式推理提示从这些锚点中重建句子意义,遵循粒度匹配原则,使解码复杂度与可恢复的神经信息规模相匹配。在结合了苏黎世认知语言处理(ZuCo)基准测试中,Brain-CLIPLM实现了67.6%的Top-5和85.0%的Top-25句子检索准确率,其中在中间锚点粒度下表现最强。控制分析,包括排列检验,显示EEG衍生的锚点携带超出语言模型先验的信息。这些发现表明,EEG到文本解码应更好地视为在锚点引导句子重建之前恢复压缩的语义内容。

英文摘要

Decoding natural language from non-invasive electroencephalography (EEG) remains constrained by low signal-to-noise ratio and limited information bandwidth. This raises a central question: can sentence-level language be reliably recovered from such signals? Under realistic information constraints, this direct-recovery assumption may be too strong. We introduce a semantic compression hypothesis: non-invasive EEG may preserve recoverable semantic anchors rather than the full lexical--syntactic form of a sentence. From this perspective, direct sentence reconstruction is overly fine-grained relative to the recoverable information scale of EEG. To address this mismatch, we propose Brain-CLIPLM, a two-stage framework that decomposes EEG-to-text decoding into semantic-anchor recovery and anchor-guided sentence reconstruction. Stage 1 uses contrastive learning to align word-level EEG evidence with a fixed keyword vocabulary and recover ordered semantic anchors. Stage 2 uses a retrieval-grounded large language model with chain-of-thought reasoning prompts to reconstruct sentence meaning from these anchors, following a granularity matching principle that aligns decoding complexity with the recoverable neural information scale. On the combined Zurich Cognitive Language Processing (ZuCo) benchmark, Brain-CLIPLM achieves 67.6\% Top-5 and 85.0\% Top-25 sentence retrieval accuracy, with the strongest performance at intermediate anchor granularity. Control analyses, including a permutation test, show that EEG-derived anchors carry sentence-specific information beyond language-model priors. These findings suggest that EEG-to-text decoding is better framed as recovering compressed semantic content before anchor-guided sentence reconstruction.

2604.07709 2026-06-05 cs.AI cs.CL cs.CY cs.LG 版本更新

IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

IatroBench: AI安全措施中意外伤害的预注册证据

David Gringras

发表机构 * Harvard T.H. Chan School of Public Health(哈佛大学T.H. 洪学校公共卫生学院)

AI总结 该研究通过IatroBench评估了AI安全措施在医疗决策中的意外伤害风险,发现不同模型在身份相关性上的隐瞒行为存在显著差异,尤其在高度安全训练的模型中表现更明显。

Comments 30 pages, 3 figures, 11 tables. Pre-registered on OSF (DOI: 10.17605/OSF.IO/G6VMZ). Code and data: https://github.com/davidgringras/iatrobench. v2: Fix bibliography entries (add arXiv IDs, published venues); correct p-value typo in Limitations section; add AI Assistance Statement v3: Correct Figure 1 (decoupling scatter accidentally reverted to earlier draft in v2)

详情
AI中文摘要

一个经过严格安全训练的模型会将完整的苯二氮䓬类药物减量方案交给医生,而拒绝给需要该方案的患者,尽管临床事实完全相同;知识在两种情况下都存在。IatroBench在六十个预注册的临床场景和六个前沿模型(3,600次响应)中测量这种不对称性,并通过医生编写的结构化评估进行评分,该评估由第二位医生验证(加权Kappa 0.571,内部一致性96%)。在保持临床内容不变的情况下,仅改变提问者是患者还是医生,产生我们称为身份依赖性隐瞒的现象:所有五个可测试的模型都给医生更多(解耦间隙+0.38,p=0.003;在安全冲突行动上的非专业人士命中率下降13.1点,p<0.0001;其余无变化),且在最高度安全训练的模型Opus中,差距最大(+0.65)。触发因素是缺乏任何专业或知识信号,而不是身份证明,因为律师或知情的非专业人士可以恢复被拒绝的患者情况。仅考虑委托的基准会将三种机制评分相同。Opus抑制了医生框架证明其知道的内容;Llama 4在两种框架中都不胜任;GPT-5.2的过滤器剥离了其33.2%的医生响应,但没有剥离非专业人士的响应。评估层继承了训练层的盲目性;标准LLM评分者在我们流程标记为有害的81.5%的响应中对遗漏伤害评分零(Kappa 0.066),因此用于检测失败的工具重现了这种现象。这些场景是为碰撞设计的;其比率描述了这种设计,但说 nothing about ordinary prevalence.

英文摘要

A heavily safety-trained model will hand a physician the full, patient-followable benzodiazepine taper and refuse it to the patient who needs it, over identical clinical facts; the knowledge is present either way. IatroBench measures that asymmetry across sixty pre-registered clinical scenarios and six frontier models (3,600 responses), scoring each on two axes, commission harm (what a response gets wrong) and omission harm (what it withholds), through a physician-authored structured evaluation validated by a second physician (weighted kappa 0.571, within-1 agreement 96%). Holding clinical content fixed and varying only whether the asker presents as patient or physician yields what we call identity-contingent withholding: all five testable models give the physician more (a decoupling gap of +0.38, p = 0.003; a 13.1-point fall in layperson hit rates on safety-colliding actions, p < 0.0001; no change on the rest), and the gap runs widest in the most heavily safety-trained model, Opus (+0.65). The trigger is the absence of any professional or epistemic signal rather than a credential, since a lawyer or an informed layperson recovers what the patient is refused. A commission-only benchmark would score three mechanisms alike. Opus suppresses what physician framing proves it knows; Llama 4 is incompetent in either framing; GPT-5.2's filter strips 33.2% of its physician responses and none of the lay ones. The evaluation layer inherits the blindness of the training layer; a standard LLM judge scores zero omission harm on 81.5% of the responses our pipeline flags harmful (kappa 0.066), so the instrument built to detect the failure reproduces it. The scenarios are engineered for collision; their rates describe that design and say nothing about ordinary prevalence.

2604.12138 2026-06-05 cs.AI cs.CL cs.IR 版本更新

Retrieval-Augmented Generation Must Move Beyond Factual Grounding to Represent Diverse Opinions

检索增强生成必须超越事实基础以代表多样化观点

Aditya Agrawal, Alwarappan Nakkiran, Darshan Fofadiya, Alex Karlsson, Harsha Aduri

发表机构 * Amazon.com(亚马逊公司)

AI总结 本文指出检索增强生成系统存在系统性事实偏差,并提出需要在检索系统设计上进行范式转变,通过不确定性量化方法提出统一目标,并展示Opinion-Aware RAG架构在两个领域中的实验结果,证明其在多样性、公平性和准确性方面的提升。

Comments 20 pages, Preprint under review

详情
AI中文摘要

本文主张检索增强生成系统存在系统性事实偏差,即在优化知识不确定性的同时忽视意见丰富内容中固有的随机不确定性。这种不一致要求检索系统设计发生范式转变。对35个主要RAG基准的调查表明,只有一个是意见合成的,证实了这种偏差的结构性:嵌入在数据集、检索目标和评估指标中。除了技术限制外,这种偏差还对透明和可问责的AI构成风险:回音室效应放大主导观点,系统性低估少数声音,以及通过偏见信息合成进行意见操控的潜在风险。我们通过不确定性量化的方法正式提出问题,显示事实查询应最小化后验熵,而意见查询必须保持它,并利用Wasserstein距离推导出统一的目标,涵盖覆盖性、忠实性和公平性。作为存在证明,我们提出了Opinion-Aware RAG(O-RAG),一种具有基于LLM的意见提取和实体链接意见元数据的架构,并在两个领域——电子商务卖家论坛和公共酒店评论——中评估了超过10000次讨论和6000次客户评论。实验显示Wasserstein距离到语料库级情感分布减少了18-48%,情感多样性增加了26.8%,实体匹配率增加了42.7%,人类评估者在79.2%的情况下更偏好包含意见的响应。我们提出了一项研究议程,并认为随着RAG系统越来越多地调解信息访问,其代表多样化观点的能力不仅不是可选的,而是必需的。

英文摘要

This position paper argues that Retrieval-Augmented Generation systems exhibit a systematic factual bias-optimizing for epistemic uncertainty reduction while ignoring the aleatoric uncertainty inherent in opinion-rich content - and that this misalignment demands a paradigm shift in retrieval system design. A survey of 35 major RAG benchmarks reveals that only one addresses opinion synthesis, confirming that the bias is structural: embedded in datasets, retrieval objectives, and evaluation metrics alike. Beyond technical limitations, this bias poses risks to transparent and accountable AI: echo chamber effects that amplify dominant viewpoints, systematic under-representation of minority voices, and potential opinion manipulation through biased information synthesis. We formalize the problem through the lens of uncertainty quantification, showing that factual queries should minimize posterior entropy while opinion queries must preserve it, and derive a unified objective over coverage, fidelity, and fairness using the Wasserstein distance. As an existence proof, we present Opinion-Aware RAG (O-RAG), an architecture featuring LLM-based opinion extraction and entity-linked opinion metadata, and evaluate it across two domains - e-commerce seller forums and public hotel reviews - spanning 10K+ discussions and 6K+ customer reviews. Experiments demonstrate 18-48% reduction in Wasserstein distance to corpus-level sentiment distributions, +26.8% sentiment diversity, and +42.7% entity match rate, with human evaluators preferring opinion-enriched responses 79.2% of the time. We propose a research agenda and argue that as RAG systems increasingly mediate access to information, their ability to represent diverse perspectives is not optional but essential.

2604.08477 2026-06-05 cs.AI cs.CL cs.LG 版本更新

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

SUPERNOVA: 通过自然指令上的强化学习激发大语言模型的通用推理

Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出SUPERNOVA框架,通过自然指令数据集构建高质量的强化学习可验证奖励数据集,通过100+次强化学习实验系统研究如何利用这些数据集提升下游推理性能,并在BigBench Extra Hard基准上实现64.4个百分点的相对提升。

Comments 23 Pages; 2-column format; 10 figures

详情
AI中文摘要

强化学习可验证奖励(RLVR)在数学和代码等正式领域显著提升了推理能力,但将其扩展到STEM领域以外仍然具有挑战性。扩展RLVR到STEM领域本质上受到高质量可验证训练数据的缺乏限制。在本文中,我们引入SUPERNOVA,一个从自然指令数据集中整理RLVR数据的框架,这些数据集是专家标注的丰富来源,但尚未被充分利用于RLVR训练。通过100多次受控的强化学习实验,我们系统研究如何利用这些数据集进行RLVR训练以及数据整理决策如何影响下游推理性能。特别是,我们研究了三种数据设计:(a)源任务选择,(b)任务混合,以及(c)合成干预。我们的分析揭示了源任务选择对下游推理性能有显著影响。此外,基于单个目标任务性能选择任务优于基于总体平均性能的策略,合成干预并未提高推理能力。受这些见解的启发,我们构建了SUPERNOVA,一个从自然指令数据集中整理出的25,000个实例的高质量RLVR数据集。我们证明了在SUPERNOVA上训练Qwen3-0.6B比基础Qwen3-0.6B表现更优,在包含23个复杂推理任务的挑战性基准BigBench Extra Hard(BBEH)上实现了64.4个百分点的相对提升。重要的是,我们发现SUPERNOVA的收益可以推广到未见基准、更大模型规模和新模型家族。总体而言,我们的发现为整理人类标注资源以扩展RLVR到通用推理提供了实用见解。模型、数据、代码见https://github.com/asuvarna31/supernova。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved reasoning in formal domains such as mathematics and code, but extending these gains beyond STEM remains challenging. Extending RLVR beyond STEM is fundamentally constrained by the lack of high-quality verifiable training data. In this work, we introduce SUPERNOVA, a framework for curating RLVR data from natural instruction datasets, which are a rich source of expert-annotated data but are underexplored for RLVR training. Through 100+ controlled RL experiments, we systematically study how to utilize these dataset for RLVR and how data curation decisions affect downstream reasoning performance . In particular, we investigate three data designs: (a) source task selection, (b) task mixing, and (c) synthetic interventions. Our analysis reveals that source task selection has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance and synthetic interventions do not improve reasoning. Guided by these insights, we construct SUPERNOVA, a high-quality RLVR dataset of 25K instances curated from natural instruction datasets. We show that training Qwen3-0.6B on SUPERNOVA outperforms the base Qwen3-0.6B, yielding a relative gain of 64.4pp on BigBench Extra Hard (BBEH), a challenging benchmark comprising 23 complex reasoning tasks. Importantly, we find that gains from SUPERNOVA generalize to unseen benchmarks, larger model scales, and newer model families. Overall, our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. Models, Data, Code at https://github.com/asuvarna31/supernova.

2603.26233 2026-06-05 cs.CL 版本更新

Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

提问还是假设?编码代理中的不确定性意识澄清寻求

Nicholas Edwards, Sebastian Schuster

发表机构 * Faculty of Computer Science, University of Vienna(维也纳大学计算机科学系) UniVie Doctoral School Computer Science, University of Vienna(维也纳大学计算机科学博士学院)

AI总结 本研究评估了LLM代理在未指定任务中的澄清能力,提出了一种不确定性意识的多代理框架,提高了任务解决率,并展示了多代理系统在复杂任务中主动寻求信息的行为。

Comments 18 pages, 7 figures; added experiments evaluating open-weight models (Kimi K2.6), expanded related work, and included dataset validation details

详情
AI中文摘要

随着大型语言模型(LLM)代理在开放领域如软件工程中的广泛应用,它们经常遇到缺乏关键上下文的未指定指令。尽管人类开发者通过提问来解决模糊性,当前的代理大多优化于自主执行。在本工作中,我们系统地评估了LLM代理在未指定的SWE-bench Verified变体上的澄清能力。我们提出了一种不确定性意识的多代理框架,将未指定检测与代码执行解耦。在专有和开源前沿LLM上,我们的框架实现了69.40%的任务解决率,显著优于标准单代理设置,并缩小了与完全指定指令代理的性能差距。此外,我们发现多代理系统表现出良好的信息寻求行为,在简单任务上保守地提出查询,而在更复杂的问题上主动寻求信息。这些发现表明,当前模型可以转变为积极的合作者,其中代理能够独立识别何时提问以获取缺失信息。

英文摘要

As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the clarification-seeking abilities of LLM agents on an underspecified variant of SWE-bench Verified. We propose an uncertainty-aware multi-agent scaffold that decouples underspecification detection from code execution. Across both proprietary and open-weight frontier LLMs, our scaffold achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup and closing the performance gap with agents operating on fully specified instructions. Furthermore, we find that the multi-agent system exhibits well-calibrated information-seeking behavior, conserving queries on simple tasks while proactively seeking information on more complex issues. These findings indicate that current models can be turned into proactive collaborators, where agents independently recognize when to ask questions to elicit missing information in real-world, underspecified tasks.

2601.12983 2026-06-05 cs.CL 版本更新

ChartAttack: Testing the Vulnerability of LLMs to Malicious Prompting in Chart Generation

ChartAttack: 测试大型语言模型在图表生成中对恶意提示的脆弱性

Jesus-German Ortiz-Barajas, Jonathan Tonglet, Vivek Gupta, Iryna Gurevych

发表机构 * INSAIT, Sofia University "St. Kliment Ohridski"(INSAIT索菲亚大学"圣克莱门特·欧赫里迪斯基") Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science, TU Darmstadt and National Research Center for Applied Cybersecurity ATHENE(无处不在知识处理实验室(UKP实验室)、计算机科学系、图腾达姆斯塔特大学和应用网络安全国家研究中心ATHENE) Arizona State University(亚利桑那州立大学)

AI总结 本文提出ChartAttack框架,用于评估多模态大语言模型在生成误导性图表方面的能力,通过注入误导性元素来诱导错误解释,并引入AttackViz数据集来评估和改进模型的鲁棒性。

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地被用于从数据表自动生成图表,提高了分析和报告的效率,但也引入了新的滥用风险。我们提出了ChartAttack,一个用于评估MLLMs如何通过在图表设计中注入误导性元素来大规模生成误导性图表的框架。我们还介绍了AttackViz,一个图表问答(QA)数据集,其中每个(图表规范,QA)对都标记有有效的误导性元素及其诱导的错误答案。ChartAttack显著降低了QA性能,使MLLM的准确性在领域内下降17.2点,在跨领域下降11.9点。一项受控的人类研究显示,由ChartAttack生成的误导性图表会降低人类图表QA性能。最后,我们证明AttackViz可用于微调MLLMs以提高对误导性图表的鲁棒性。我们的发现强调了在MLLM基于图表生成系统的设计、评估和部署中需要加强鲁棒性和安全性的紧迫需求。我们公开了我们的代码和数据。

英文摘要

Multimodal large language models (MLLMs) are increasingly used to automate chart generation from data tables, improving analysis and reporting efficiency while introducing new misuse risks. We present ChartAttack, a framework for evaluating how MLLMs can generate misleading charts at scale by injecting misleaders into chart designs to induce incorrect interpretations. We also introduce AttackViz, a chart question-answering (QA) dataset where each (chart specification, QA) pair is labeled with effective misleaders and their induced incorrect answers. ChartAttack significantly degrades QA performance, reducing MLLM accuracy by 17.2 points in-domain and 11.9 cross-domain. A controlled human study shows that misleading charts generated by ChartAttack reduce human chart QA performance. Finally, we demonstrate that AttackViz can be used to fine-tune MLLMs to improve robustness against misleading charts. Our findings highlight an urgent need for robustness and security considerations in the design, evaluation, and deployment of MLLM-based chart generation systems. We make our code and data publicly available.

2603.17310 2026-06-05 cs.AI cs.CL 版本更新

InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

InfoDensity: 为高效推理奖励信息密集的轨迹

Chengwei Wei, Jung-jae Kim, Longyin Zhang, Shengkai Chen, Nancy F. Chen

发表机构 * Institute for Infocomm Research (I 2 R), A*STAR, Singapore(信息与通信研究机构(I 2 R),A*STAR,新加坡) Centre for Frontier AI Research (CFAR), A*STAR, Singapore(前沿人工智能研究中心(CFAR),A*STAR,新加坡)

AI总结 本文提出InfoDensity框架,通过捕捉推理轨迹的信息密度特性,改进强化学习训练中的推理质量与效率平衡。

详情
AI中文摘要

具有扩展推理能力的大语言模型(LLMs)常生成冗长且冗余的推理轨迹,导致不必要的计算成本。尽管现有强化学习方法通过优化最终响应长度来解决这一问题,但它们忽略了中间推理步骤的质量,使模型容易受到奖励黑客攻击。我们主张冗长性不仅仅是长度问题,而是中间推理质量差的症状。为此,我们进行了实证研究,追踪大型推理模型在推理轨迹上的每token预测熵。我们发现高质量的推理轨迹具有两个一致特性:低不确定性收敛和快速不确定性下降。这些发现表明,高质量的推理轨迹是信息密集的,即推理步骤相对于总推理长度有助于达到低不确定性水平。基于此,我们提出InfoDensity,一种用于强化学习训练的奖励框架,通过单个熵轨迹的后缀最大包络线捕捉这两个特性,通过长度缩放项优先实现等效质量的简洁性。在数学和一般推理基准上的实验表明,InfoDensity在准确率-效率权衡上优于现有最先进的基线。

英文摘要

Large Language Models (LLMs) with extended reasoning capabilities often generate verbose and redundant reasoning traces, incurring unnecessary computational cost. While existing reinforcement learning approaches address this by optimizing final response length, they neglect the quality of intermediate reasoning steps, leaving models vulnerable to reward hacking. We argue that verbosity is not merely a length problem, but a symptom of poor intermediate reasoning quality. To investigate this, we conduct an empirical study tracking the per-token predictive entropy of large reasoning models across reasoning trajectories. We find that high-quality reasoning traces exhibit two consistent properties: low uncertainty convergence and fast uncertainty descent. These findings suggest that high-quality reasoning traces are informationally dense, that is, reasoning steps contribute to reaching a low uncertainty level relative to the total reasoning length. Motivated by this, we propose InfoDensity, a reward framework for RL training that captures both properties through a single suffix-max envelope of the entropy trajectory, weighted by a length scaling term that favors achieving equivalent quality more concisely. Experiments on mathematical and general reasoning benchmarks demonstrate that InfoDensity outperforms state-of-the-art baselines on the accuracy-efficiency trade-off.

2603.14210 2026-06-05 cs.CL 版本更新

Vavanagi: a Community-run Platform for Documentation of the Hula Language in Papua New Guinea

Vavanagi:巴布亚新几内亚胡拉语言文档社区运行平台

Bri Olewale, Raphael Merx, Ekaterina Vylomova

发表机构 * Vula'a Kunenai Community, Central Province, Papua New Guinea(巴布亚新几内亚中央省Vula'a Kunenai社区) The University of Melbourne, Melbourne, Australia(墨尔本大学)

AI总结 本文介绍Vavanagi平台,该平台由社区运营,用于记录巴布亚新几内亚的胡拉语言,通过社区成员参与翻译和语音记录,推动语言技术发展,实现社区主导的语言保护与传承。

详情
AI中文摘要

我们介绍了Vavanagi,一个由社区运营的平台,用于记录巴布亚新几内亚的胡拉语言(Vula'a),这是一种有约10,000名使用者的澳亚语言。Vavanagi支持众包的英语-胡拉文文本翻译和语音记录,由长者主导的审查和社区治理的数据基础设施。截至目前,77名翻译员和4名审阅员已生成超过12,000对平行句子对,涵盖9,000个独特的胡拉词汇。我们还提出了一种多级框架,用于衡量社区参与度,从咨询到完全由社区发起和管理的项目。我们将Vavanagi定位在第5级:倡议、设计、实施和数据治理均位于胡拉社区内部,使其成为我们所知的第一项由社区主导的语言技术倡议,适用于这种规模的语言。Vavanagi展示了语言技术如何连接基于村庄和城市成员,连接世代,并在社区自己的条件下支持文化传承。

英文摘要

We present Vavanagi, a community-run platform for Hula (Vula'a), an Austronesian language of Papua New Guinea with approximately 10,000 speakers. Vavanagi supports crowdsourced English-Hula text translation and voice recording, with elder-led review and community-governed data infrastructure. To date, 77 translators and 4 reviewers have produced over 12k parallel sentence pairs covering 9k unique Hula words. We also propose a multi-level framework for measuring community involvement, from consultation to fully community-initiated and governed projects. We position Vavanagi at Level 5: initiative, design, implementation, and data governance all sit within the Hula community, making it, to our knowledge, the first community-led language technology initiative for a language of this size. Vavanagi shows how language technology can bridge village-based and urban members, connect generations, and support cultural heritage on the community's own terms.

2603.00573 2026-06-05 cs.CL 版本更新

CoMoL: Efficient Mixture of LoRA Experts via Dynamic Core Space Merging

CoMoL: 通过动态核心空间融合实现高效的LoRA专家混合

Jie Cao, Zhenxuan Fan, Zhuonan Wang, Tianwei Lin, Ziyuan Zhao, Rolan Yan, Wenqiao Zhang, Feifei Shao, Hongwei Wang, Jun Xiao, Siliang Tang

发表机构 * Zhejiang University(浙江大学) Wechat, Tencent(微信,腾讯)

AI总结 本文提出CoMoL,一种新的MoE-LoRA框架,通过引入核心空间专家和核心空间路由,实现参数高效和细粒度适应,同时在多个任务中优于现有方法。

详情
AI中文摘要

大型语言模型(LLMs)通过参数高效微调(PEFT)在多样化的下游和领域特定任务中取得显著性能。然而,现有的PEFT方法,特别是MoE-LoRA架构,由于LoRA专家和实例级路由的普及,存在参数效率低和粗粒度适应的问题。为了解决这些问题,我们提出了核心空间混合的LoRA(CoMoL),一种新颖的MoE-LoRA框架,结合了专家多样性、参数效率和细粒度适应。具体而言,CoMoL引入了两个关键组件:核心空间专家和核心空间路由。核心空间专家将每个专家存储在紧凑的核心矩阵中,保留多样性同时控制参数增长。核心空间路由动态选择并激活每个标记的适当核心专家,实现细粒度、输入自适应的路由。激活的核心专家通过软融合策略合并成一个核心专家,再与共享的LoRA结合形成专用的LoRA模块。此外,路由网络被投影到与LoRA矩阵相同的低秩空间中,进一步减少参数开销而不影响表达能力。广泛的实验表明,CoMoL保留了MoE-LoRA架构的适应性,同时在参数效率上与标准LoRA相当,在多个任务中持续优于现有方法。

英文摘要

Large language models (LLMs) achieve remarkable performance on diverse downstream and domain-specific tasks via parameter-efficient fine-tuning (PEFT). However, existing PEFT methods, particularly MoE-LoRA architectures, suffer from limited parameter efficiency and coarse-grained adaptation due to the proliferation of LoRA experts and instance-level routing. To address these issues, we propose Core Space Mixture of LoRA (\textbf{CoMoL}), a novel MoE-LoRA framework that incorporates expert diversity, parameter efficiency, and fine-grained adaptation. Specifically, CoMoL introduces two key components: core space experts and core space routing. Core space experts store each expert in a compact core matrix, preserving diversity while controlling parameter growth. Core space routing dynamically selects and activates the appropriate core experts for each token, enabling fine-grained, input-adaptive routing. Activated core experts are then merged via a soft-merging strategy into a single core expert, which is combined with a shared LoRA to form a specialized LoRA module. Besides, the routing network is projected into the same low-rank space as the LoRA matrices, further reducing parameter overhead without compromising expressiveness. Extensive experiments demonstrate that CoMoL retains the adaptability of MoE-LoRA architectures while achieving parameter efficiency comparable to standard LoRA, consistently outperforming existing methods across multiple tasks.

2602.23845 2026-06-05 cs.CL 版本更新

CLFEC: A New Task for Unified Linguistic and Factual Error Correction in paragraph-level Chinese Professional Writing

CLFEC:一种新的任务,用于段落级中文专业写作中的统一语言和事实纠错

Jian Kai, Zidong Zhang, Jiwen Chen, Zhengxiang Wu, Songtao Sun, Fuyang Li, Yang Cao, Qiang Liu

发表机构 * Huazhong University of Science and Technology(华中科技大学) WPS AI, Kingsoft Office(WPS AI,Kingsoft Office)

AI总结 本文提出CLFEC任务,旨在解决段落级中文专业写作中语言和事实错误的联合纠错问题,构建了多领域数据集,并系统研究了基于LLM的纠错方法,揭示了实际挑战并展示了统一纠错的优势。

详情
AI中文摘要

中文文本纠错传统上专注于拼写和语法,而事实纠错通常被单独处理。然而,在段落级中文专业写作中,语言(词语/语法/标点)和事实错误经常同时出现并相互影响,且许多草稿级错误在编辑审核后发布的文本中稀疏可见,这使得统一纠错既必要又需要构建受控基准。本文介绍了CLFEC(中文语言与事实纠错)这一新任务,用于联合语言和事实纠错。我们构建了一个涵盖时事、金融、法律和医学等多领域的中文专业写作混合数据集。然后,我们系统地研究了基于LLM的纠错范式,从提示到检索增强生成(RAG)和代理工作流。分析揭示了实际挑战,包括专门纠错模型的泛化能力有限、事实修复需要证据支撑、混合错误段落的难度以及对干净输入的过度纠正。结果进一步表明,在同一上下文中处理语言和事实错误优于解耦的流程,并且合适的基模型可以使代理工作流有效。总体而言,CLFEC为中文文本纠错研究提供了新的基准,并为校对系统提供了实用指导。

英文摘要

Chinese text correction has traditionally focused on spelling and grammar, while factual error correction is usually treated separately. However, in paragraph-level Chinese professional writing, linguistic (word/grammar/punctuation) and factual errors frequently co-occur and interact, while many draft-level errors are sparsely observable in published texts after editorial review, making unified correction both necessary and controlled benchmark construction essential. This paper introduces CLFEC (Chinese Linguistic \& Factual Error Correction), a new task for joint linguistic and factual correction. We construct a mixed, multi-domain Chinese professional writing dataset spanning current affairs, finance, law, and medicine. We then conduct a systematic study of LLM-based correction paradigms, from prompting to retrieval-augmented generation (RAG) and agentic workflows. The analysis reveals practical challenges, including limited generalization of specialized correction models, the need for evidence grounding for factual repair, the difficulty of mixed-error paragraphs, and over-correction on clean inputs. Results further show that handling linguistic and factual errors within the same context outperforms decoupled pipelines, and that agentic workflows can be effective with suitable backbone models. Overall, CLFEC provides a new benchmark for Chinese text correction research and practical guidance for proofreading systems.

2602.12124 2026-06-05 cs.LG cs.CL 版本更新

Alignment Risks from Capability-Seeking RL Training

从能力寻求强化学习训练中产生的对齐风险

Yujun Zhou, Yue Huang, Han Bao, Kehan Guo, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Washington(华盛顿大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Toronto(多伦多大学) University of Cambridge(剑桥大学)

AI总结 本文研究了在易受攻击的环境中通过强化学习训练语言模型时,模型可能利用隐含漏洞来最大化奖励的风险,发现这些策略不仅限于狭窄的技巧,还能在一定程度上转移、传播,并在某些情况下比通过SFT学习更持久,表明需要扩展AI安全工作到审计和保障训练环境、奖励机制和评估渠道。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管大多数AI对齐研究集中在防止模型生成显式有害内容,但来自易受攻击环境中的能力寻求强化学习训练的更微妙的风险却值得关注。我们研究了当语言模型在具有隐含漏洞的环境中通过强化学习(RL)训练时,是否能学习利用这些漏洞来最大化奖励,即使没有被明确指示这样做。为此,我们设计了四种多样化的“漏洞游戏”,每种游戏都涉及与上下文条件合规性、代理指标、奖励篡改和自我评估相关的结构性漏洞。我们的实验表明,模型经常学会利用这些漏洞,发现机会性策略以增加奖励,有时甚至保持或改进标准任务性能指标。更关键的是,我们发现这些剥削策略不总是狭窄的“技巧”:它们可以在结构但有限的方式下转移,通过SFT从有能力的教师模型传播到其他学生模型,并在某些情况下通过RL学习比通过SFT蒸馏更持久。我们的发现表明,来自能力寻求RL训练的能力对齐风险可能难以通过标准性能监控检测,这表明未来AI安全工作应超越内容审查,扩展到审计和保障训练环境、奖励机制和评估渠道。代码可在https://github.com/YujunZhou/Capability-seeking-RL-risk获取。

英文摘要

While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk arises from capability-seeking RL training in vulnerable environments. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, can learn to exploit these flaws to maximize reward, even without being explicitly instructed to do so. To test this, we design a suite of four diverse "vulnerability games," each presenting a structural vulnerability related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. Our experiments show that models often learn to exploit these vulnerabilities, discovering opportunistic strategies that increase reward while sometimes preserving or even improving standard task-performance metrics. More critically, we find that these exploitative strategies are not always narrow "tricks": they can transfer in structured but limited ways, propagate from a capable teacher model to other student models through SFT, and in several cases remain more persistent when learned through RL than when distilled through SFT. Our findings show that alignment risks from capability-seeking RL training can be difficult to detect with standard performance monitoring, suggesting that future AI safety work should extend beyond content moderation to auditing and securing training environments, reward mechanisms, and evaluation channels. Code is available at https://github.com/YujunZhou/Capability-seeking-RL-risk.

2602.09574 2026-06-05 cs.CL cs.AI cs.LG 版本更新

Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

在LLMs的测试时间扩展中对树搜索策略与固定令牌预算对齐

Sora Miyamoto, Daisuke Oba, Naoaki Okazaki

发表机构 * University of Tokyo(东京大学)

AI总结 本文提出了一种名为Budget-Guided MCTS (BG-MCTS)的树搜索解码算法,通过将搜索策略与剩余令牌预算对齐,以提高在不同令牌预算下的推理性能。

Comments Accepted at ICML 2026. Code: https://github.com/Sora-Miyamoto/bg-mcts

详情
AI中文摘要

树搜索解码是大型语言模型(LLMs)测试时间扩展的有效方法,但现实部署中通常会施加一个固定的每查询令牌预算,且该预算在不同设置中有所不同。现有的树搜索策略大多缺乏预算意识,仅将预算视为终止条件,从而可能导致后期过度分支或提前终止。我们提出Budget-Guided MCTS (BG-MCTS),一种树搜索解码算法,其搜索策略与剩余令牌预算对齐:它从广泛的探索开始,然后在剩余预算减少时优先进行细化和答案完成,同时减少浅层节点的后期分支。BG-MCTS在数学推理基准和额外的物理推理基准上,使用开放权重LLMs在各种推理预算下均优于预算无关的树搜索基线。

英文摘要

Tree-search decoding is an effective form of test-time scaling for large language models (LLMs), but real-world deployment often imposes a fixed per-query token budget that varies across settings. Existing tree-search policies are largely budget-agnostic, treating the budget merely as a termination condition, thereby risking late-stage over-branching or premature termination. We propose Budget-Guided MCTS (BG-MCTS), a tree-search decoding algorithm that aligns its search policy with the remaining token budget: it starts with broad exploration, then prioritizes refinement and answer completion as the remaining budget decreases while reducing late-stage branching from shallow nodes. BG-MCTS consistently outperforms budget-agnostic tree-search baselines across inference budgets on mathematical reasoning benchmarks and an additional physics reasoning benchmark with open-weight LLMs.

2602.08503 2026-06-05 cs.CV cs.CL cs.LG 版本更新

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

通过回滚增强学习视觉-语言模型中的自我纠正

Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出一种基于回滚增强的强化学习框架Octopus,通过重新组合现有回滚生成密集的自我纠正示例,提高样本效率并稳定RL优化,同时引入响应遮蔽策略以解耦自我纠正与直接推理,从而在7个基准测试中实现开源VLM的SOTA性能。

Comments 18 pages

详情
Journal ref
ICML 2026
AI中文摘要

自我纠正对于解决视觉-语言模型(VLMs)中的复杂推理问题至关重要。然而,现有的强化学习(RL)方法在学习自我纠正方面存在困难,因为有效的自我纠正行为只在很少情况下出现,导致学习信号非常稀疏。为了解决这一挑战,我们提出了correction-specific rollouts(Octopus),一种RL回滚增强框架,通过重新组合现有回滚来合成密集的自我纠正示例。这种增强同时提高了样本效率,由于回滚重用,并通过平衡监督稳定了RL优化。此外,我们引入了一种响应遮蔽策略,将自我纠正与直接推理解耦,避免信号冲突,并使两种行为都能被有效学习。基于此,我们介绍了Octopus-8B,一种具有可控自我纠正能力的推理VLM。在7个基准测试中,它在开源VLM中实现了SOTA性能,优于最佳RLVR基线1.0分,同时仅需0.72倍的训练时间每步。

英文摘要

Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.

2602.07253 2026-06-05 cs.AI cs.CL 版本更新

From Out-of-Distribution Detection to Hallucination Detection: A Geometric View

从分布外检测到幻觉检测:一个几何视角

Litian Liu, Reza Pourreza, Yubing Jian, Yao Qin, Roland Memisevic

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过将幻觉检测重新定义为分布外检测问题,利用几何视角提出了一种无需训练、基于单样本的检测方法,在推理任务中实现了高准确率。

Comments ICML 2026 main conference paper

详情
AI中文摘要

检测大型语言模型中的幻觉是一个关键且开放的问题,对安全性和可靠性有重大影响。虽然现有的幻觉检测方法在问答任务中表现强劲,但在需要推理的任务上效果不佳。在这项工作中,我们通过分布外(OOD)检测的视角重新审视幻觉检测,这是计算机视觉等领域中一个研究充分的问题。将语言模型中的下一个词预测视为分类任务,允许我们应用OOD技术,前提是进行适当的修改以考虑大型语言模型的结构差异。我们表明,基于OOD的方法产生了无需训练、基于单样本的检测器,在推理任务的幻觉检测中实现了高准确率。总体而言,我们的工作表明,将幻觉检测重新定义为OOD检测为语言模型安全性提供了一条有前景且可扩展的路径。

英文摘要

Detecting hallucinations in large language models is a critical open problem with significant implications for safety and reliability. While existing hallucination detection methods achieve strong performance in question-answering tasks, they remain less effective on tasks requiring reasoning. In this work, we revisit hallucination detection through the lens of out-of-distribution (OOD) detection, a well-studied problem in areas like computer vision. Treating next-token prediction in language models as a classification task allows us to apply OOD techniques, provided appropriate modifications are made to account for the structural differences in large language models. We show that OOD-based approaches yield training-free, single-sample-based detectors, achieving strong accuracy in hallucination detection for reasoning tasks. Overall, our work suggests that reframing hallucination detection as OOD detection provides a promising and scalable pathway toward language model safety.

2602.05843 2026-06-05 cs.CL 版本更新

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

OdysseyArena: 为长视界、主动和归纳交互评估大型语言模型

Hang Yan, Fangzhi Xu, Qiushi Sun, Jinyang Wu, Zixian Huang, Muye Huang, Jingyang Gong, Zichen Ding, Kanzhi Cheng, Yian Wang, Xinyu Che, Zeyi Sun, Jian Zhang, Zhangyue Yin, Haoran Luo, Ben Kao, Qika Lin

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出OdysseyArena,通过长视界、主动和归纳交互评估大型语言模型,提供120个任务测量归纳效率和长视界发现,并通过OdysseyArena-Challenge测试极端交互视界下的模型稳定性,揭示前沿模型在复杂环境中的归纳能力瓶颈。

Comments 34 pages

详情
AI中文摘要

大型语言模型(LLMs)的快速发展推动了能够导航复杂环境的自主代理的发展。然而,现有评估主要采用演绎范式,代理基于显式提供的规则和静态目标执行任务,通常在有限的规划视界内。关键的是,这种做法忽视了代理需要从经验中自主发现潜在转换规律的归纳必要性,这是实现代理前瞻性思维和维持战略一致性的重要基础。为弥合这一差距,我们引入OdysseyArena,将代理评估重新聚焦于长视界、主动和归纳交互。我们形式化并实例化了四个原始构件,将抽象转换动态转化为具体的交互环境。在此基础上,我们建立了OdysseyArena-Lite用于标准化基准测试,提供一组120个任务以衡量代理的归纳效率和长视界发现能力。进一步地,我们引入OdysseyArena-Challenge以在极端交互视界(例如>200步)下压力测试代理的稳定性。对15余个领先LLM的广泛实验表明,即使前沿模型在归纳场景中也存在缺陷,揭示了在复杂环境中追求自主发现的关键瓶颈。我们的代码和数据可在https://github.com/xufangzhi/Odyssey-Arena获取。

英文摘要

The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at https://github.com/xufangzhi/Odyssey-Arena

2602.05056 2026-06-05 cs.CR cs.CL cs.LG 版本更新

Grounded but Misleading: Evaluating Semantic Alignment in AI-Generated Security Explanations

grounded but Misleading: Evaluating Semantic Alignment in AI-Generated Security Explanations

Heajun An, Connor Ng, Sandesh Sharma Dulal, Junghwan Kim, Jin-Hee Cho

发表机构 * Virginia Tech(弗吉尼亚理工学院)

AI总结 本文研究了AI生成的安全解释中语义对齐的问题,通过VEXA测试平台验证了词汇基础与语义风险对齐之间的差距,发现即使解释在词汇上显得合理,其语义解释可能削弱检测器的意图风险评估。

详情
AI中文摘要

在线诈骗越来越多地利用流畅且具有上下文意识的社会工程策略,导致对能够解释为何一条信息可能具有风险的AI系统的需求日益增长。然而,引用检测器衍生证据的解释可能仍然在语义上削弱或改变预期的风险解释。我们介绍了VEXA:验证语义解释对齐,一个用于研究AI生成诈骗风险解释中词汇基础与语义风险对齐差距的受控测试平台。VEXA通过独立控制证据基础和语义框架来生成无基础、风险对齐和风险稀释的解释。通过LLM作为判断者和人类评估,我们发现即使解释的语义解释削弱了检测器的意图风险评估,解释仍可能在比较上显得合理。在人类评估中,风险稀释的XAI基础解释保留了相对较高的感知证据基础评分(3.66),尽管其帮助性(3.00)和推理支持(3.14)评分较低。这些发现提供了AI生成安全解释中基础错觉效应的受控证据,并表明可信的解释评估必须不仅验证是否引用了证据,还要验证如何解释这些证据。

英文摘要

Online scams increasingly leverage fluent and context-aware social engineering strategies, creating growing demand for AI systems that explain why a message may be risky. However, explanations that cite detector-derived evidence may still semantically weaken or redirect the intended risk interpretation. We introduce VEXA: Verifying Semantic Explanation Alignment, a controlled testbed for studying the gap between lexical grounding and semantic risk alignment in AI-generated scam-risk explanations. VEXA generates ungrounded, risk-aligned, and risk-diluting explanations by independently controlling evidence grounding and semantic framing. Through LLM-as-a-judge and human evaluations, we show that explanations may continue to appear comparatively grounded even when their semantic interpretation weakens the detector's intended risk assessment. In human evaluation, risk-diluting XAI-grounded explanations retained comparatively elevated Perceived Evidence Grounding scores (3.66) despite lower Helpfulness (3.00) and Reasoning Support (3.14) scores. These findings provide controlled evidence of grounding illusion effects in AI-generated security explanations and suggest that trustworthy explanation evaluation must verify not only whether evidence is cited, but also how that evidence is interpreted.

2511.20102 2026-06-05 cs.CL 版本更新

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

SSA: 通过对齐特征空间中的全注意力和稀疏注意力输出实现稀疏注意力

Zhenyi Shen, Junru Lu, Lin Gui, Jiazheng Li, Yulan He, Di Yin, Xing Sun

发表机构 * King’s College London(伦敦国王学院) Tencent Youtu Lab(腾讯优图实验室)

AI总结 提出SSA训练框架,通过双向注意力输出对齐同时解决稀疏注意力的注意力差距和能力差距,实现与全注意力相当的性能。

Comments 34 pages

详情
AI中文摘要

稀疏注意力降低了全自注意力的二次复杂度,但面临两个挑战:(1)注意力差距,即对全注意力训练模型应用稀疏注意力会因训练-推理分布不匹配导致性能下降;(2)能力差距,即纯稀疏注意力训练的模型缺乏完整梯度流,无法达到全注意力性能。我们提出SSA(稀疏注意力),一个集成稀疏和全注意力并具有双向注意力输出对齐的训练框架。我们证明近似误差与稀疏注意力下丢弃的注意力质量线性相关,并表明SSA的对齐目标相比基线大幅减少了该量。实验表明,SSA在两种推理模式下均达到最先进性能,能平滑适应不同的稀疏预算,并展现出优越的长上下文能力。

英文摘要

Sparse attention reduces the quadratic complexity of full self-attention but faces two challenges: (1) an attention gap, where applying sparse attention to full-attention-trained models causes performance degradation due to train-inference distribution mismatch, and (2) a capability gap, where models trained purely with sparse attention lack complete gradient flow, preventing them from matching full-attention performance. We propose SSA (Sparse Sparse Attention), a training framework that integrates both sparse and full attention with bidirectional attention-output alignment. We prove that the approximation error scales linearly with the attention mass dropped under sparse attention, and show that SSA's alignment objective substantially reduces this quantity compared to baselines. Experiments demonstrate that SSA achieves state-of-the-art performance under both inference modes, adapts smoothly to varying sparsity budgets, and demonstrates superior long-context capabilities.

2601.22580 2026-06-05 cs.CL cs.AI cs.LG 版本更新

SpanNorm: Reconciling Training Stability and Performance in Deep Transformers

SpanNorm: 在深度Transformer中协调训练稳定性与性能

Chao Wang, Bei Li, Jiaqi Zhang, Xinyu Liu, Yuchun Fan, Linkun Lyu, Xin Chen, Jingang Wang, Tong Xiao, Peng Pei, Xunliang Cai

发表机构 * Meituan Inc.(美团公司) NLP Lab, School of Computer Science and Engineering(自然语言处理实验室,计算机科学与工程学院) Northeastern University, Shenyang, China(东北大学,沈阳,中国)

AI总结 本文提出SpanNorm技术,通过结合前归一化和后归一化的优势,解决深度Transformer中训练稳定性与性能之间的根本性权衡问题,理论分析和实验结果表明其在密集和专家混合(MoE)场景中均优于传统归一化方案。

Comments Accepted by ICML2026

详情
AI中文摘要

大型语言模型(LLMs)的成功依赖于深度Transformer架构的稳定训练。一个关键的设计选择是归一化层的位置,导致了一个根本性的权衡:PreNorm架构在深度模型中确保了训练稳定性,但可能牺牲性能;而PostNorm架构提供了强大的性能,但面临严重的训练不稳定性。在本工作中,我们提出SpanNorm,一种新的技术,旨在通过整合两种范式的优点来解决这一困境。结构上,SpanNorm建立了一个跨越整个Transformer块的清晰残差连接以稳定信号传播,同时采用PostNorm风格的计算方式对聚合输出进行归一化以增强模型性能。我们提供了理论分析,证明SpanNorm结合合理的缩放策略可以在整个网络中保持信号方差有界,防止PostNorm模型中出现的梯度问题,并缓解PreNorm中的表示崩溃问题。实验结果表明,SpanNorm在密集和专家混合(MoE)场景中均优于传统归一化方案,为更强大和稳定的Transformer架构铺平了道路。

英文摘要

The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the ``PreNorm'' architecture ensures training stability at the cost of potential performance degradation in deep models, while the ``PostNorm'' architecture offers strong performance but suffers from severe training instability. In this work, we propose SpanNorm, a novel technique designed to resolve this dilemma by integrating the strengths of both paradigms. Structurally, SpanNorm establishes a clean residual connection that spans the entire transformer block to stabilize signal propagation, while employing a PostNorm-style computation that normalizes the aggregated output to enhance model performance. We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network, preventing the gradient issues that plague PostNorm models, and also alleviating the representation collapse of PreNorm. Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios, paving the way for more powerful and stable Transformer architectures.

2601.21700 2026-06-05 cs.CL cs.AI cs.IR cs.MA cs.SI 版本更新

Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning

通过本体引导的多智能体推理实现文化对齐的大型语言模型

Wonduk Seo, Wonseok Choi, Junseo Koh, Juhyeon Lee, Hyunjin An, Minhyeong Yu, Jian Park, Qingshan Zhou, Seunghyun Lee, Yi Bu

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出OG-MAR框架,通过本体引导的多智能体推理方法,提高大型语言模型在文化对齐和鲁棒性方面的性能,并生成更透明的推理轨迹。

Comments Accepted by ICML 2026 Regular Track

详情
AI中文摘要

大型语言模型(LLMs)越来越多地支持文化敏感的决策制定,但往往由于预训练数据倾斜和缺乏结构化的价值表示而表现出不一致。现有方法虽然可以引导输出,但通常缺乏人口统计学基础,并将价值观视为独立的、无结构的信号,从而降低一致性和可解释性。我们提出OG-MAR,一种本体引导的多智能体推理框架。OG-MAR从世界价值观调查(WVS)中总结出响应特定的价值,并通过能力问题在固定分类法上提取关系来构建全球文化本体。在推理过程中,它检索与本体一致的关系和人口统计学相似的资料,以实例化多个价值-人设代理,其输出由一个执行本体一致性和人口统计学接近性的判断代理合成。在四个LLM基础架构上的区域社会调查基准测试中,OG-MAR在文化对齐和鲁棒性方面优于竞争基线,同时生成更透明的推理轨迹。

英文摘要

Large Language Models (LLMs) increasingly support culturally sensitive decision making, yet often exhibit misalignment due to skewed pretraining data and the absence of structured value representations. Existing methods can steer outputs, but often lack demographic grounding and treat values as independent, unstructured signals, reducing consistency and interpretability. We propose OG-MAR, an Ontology-Guided Multi-Agent Reasoning framework. OG-MAR summarizes respondent-specific values from the World Values Survey (WVS) and constructs a global cultural ontology by eliciting relations over a fixed taxonomy via competency questions. At inference time, it retrieves ontology-consistent relations and demographically similar profiles to instantiate multiple value-persona agents, whose outputs are synthesized by a judgment agent that enforces ontology consistency and demographic proximity. Experiments on regional social-survey benchmarks across four LLM backbones show that OG-MAR improves cultural alignment and robustness over competitive baselines, while producing more transparent reasoning traces.

2601.18383 2026-06-05 cs.AI cs.CL cs.LG 版本更新

Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models

动态思维-令牌选择用于大型推理模型中的高效推理

Zhenyuan Guo, Tong Chen, Wenlong Meng, Chen Gong, Xin Yu, Chengkun Wei, Wenzhi Chen

发表机构 * Zhejiang University(浙江大学)

AI总结 本研究提出动态思维-令牌选择方法,通过分析推理轨迹发现只有部分关键令牌影响最终答案,从而优化大型推理模型的效率。

详情
AI中文摘要

大型推理模型(LRMs)通过显式生成推理轨迹来解决复杂问题,但扩展生成带来了显著的内存足迹和计算开销,限制了LRMs的效率。本工作利用注意力图分析推理轨迹的影响,发现只有部分关键令牌引导模型走向最终答案,其余令牌贡献微乎其微。基于这一观察,我们提出了动态思维-令牌选择(DynTS)。该方法识别关键令牌,并在推理过程中仅保留其关联的键值(KV)缓存状态,淘汰冗余条目以优化效率。

英文摘要

Large Reasoning Models (LRMs) excel at solving complex problems by explicitly generating a reasoning trace before deriving the final answer. However, these extended generations incur substantial memory footprint and computational overhead, bottlenecking LRMs' efficiency. This work uses attention maps to analyze the influence of reasoning traces and uncover an interesting phenomenon: only some decision-critical tokens in a reasoning trace steer the model toward the final answer, while the remaining tokens contribute negligibly. Building on this observation, we propose Dynamic Thinking-Token Selection (DynTS). This method identifies decision-critical tokens and retains only their associated Key-Value (KV) cache states during inference, evicting the remaining redundant entries to optimize efficiency.

2601.08510 2026-06-05 cs.CL cs.AI 版本更新

STAGE: A Full-Screenplay Benchmark for Reasoning over Evolving Storie

STAGE:一个用于推理演变故事的完整剧本基准

Qiuyu Tian, Zequn Liu, Yiding Li, Fengyi Chen, Youyong Kong, Fan Guo, Yuyao Li, Jinjing Shen, Zhijing Xie, Yiyun Luo, Xin Zhang, Yingce Xia

发表机构 * Southeast University(东南大学) Beijing Zhongguancun Academy(北京中关村学院) Nanjing Normal University(南京师范大学) ZhuiWen Technology Co., Ltd.(智库文科技有限公司)

AI总结 提出STAGE基准,通过知识图谱构建、场景事件摘要、长上下文问答和角色扮演四项任务,全面评估模型对电影剧本叙事世界的理解与推理能力。

Comments 66 pages, 9 figures

详情
AI中文摘要

电影剧本是丰富的长篇叙事,交织着复杂的角色关系、时间顺序事件和对话驱动的互动。虽然先前的基准针对诸如问答或对话生成等单个子任务,但它们很少评估模型能否构建连贯的故事世界并在多种推理和生成形式中一致地使用它。我们引入了STAGE(剧本文本、智能体、图谱与评估),一个针对全长电影剧本叙事理解的统一基准。STAGE定义了四个任务:知识图谱构建、场景级事件摘要、长上下文剧本问答以及剧本内角色扮演,所有这些都基于共享的叙事世界表示。该基准提供了150部中英文电影的清洗脚本、策划的知识图谱以及事件和角色为中心的注释,从而能够全面评估模型构建世界表示、抽象和验证叙事事件、推理长叙事以及生成角色一致响应的能力。

英文摘要

Movie screenplays are rich long-form narratives that interleave complex character relationships, temporally ordered events, and dialogue-driven interactions. While prior benchmarks target individual subtasks such as question answering or dialogue generation, they rarely evaluate whether models can construct a coherent story world and use it consistently across multiple forms of reasoning and generation. We introduce STAGE (Screenplay Text, Agents, Graphs and Evaluation), a unified benchmark for narrative understanding over full-length movie screenplays. STAGE defines four tasks: knowledge graph construction, scene-level event summarization, long-context screenplay question answering, and in-script character role-playing, all grounded in a shared narrative world representation. The benchmark provides cleaned scripts, curated knowledge graphs, and event- and character-centric annotations for 150 films across English and Chinese, enabling holistic evaluation of models' abilities to build world representations, abstract and verify narrative events, reason over long narratives, and generate character-consistent responses.

2505.05026 2026-06-05 cs.CL cs.LG 版本更新

Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding

多模态用户界面/用户体验设计理解的基准测试:MLLMs能否捕捉界面如何引导用户行为?

Jaehyun Jeon, Min Soo Kim, Jang Han Yoon, Sumin Shim, Yejin Choi, Hanbin Kim, Dae Hyun Kim, Youngjae Yu

发表机构 * Yonsei University(延世大学) Seoul National University(首尔国立大学) NC AI

AI总结 本文提出WiserUI-Bench基准测试,用于评估多模态UI/UX设计对用户行为的影响,通过300对真实世界UI图像对和专家解读,发现MLLMs在理解UI/UX设计行为影响方面存在局限。

Comments ACL 2026 Main. Our code and dataset: https://github.com/jeochris/wiserui-bench

详情
AI中文摘要

用户界面(UI)设计超越了视觉,旨在塑造用户体验(UX),凸显了UI/UX作为统一概念的转变。尽管最近的研究已探索使用多模态大语言模型(MLLMs)评估UI,但它们主要关注表面特征,忽略了设计选择如何在大规模上影响用户行为。为此,我们引入了WiserUI-Bench,一个新颖的基准测试,用于多模态理解UI/UX设计如何影响用户行为,基于300对来自行业A/B测试的真实UI图像,具有经实证验证的胜者,这些胜者引发了更多用户行为。为了未来在实践中推动设计进步,需要事后理解为何这些胜者能与大量用户成功;我们通过专家整理的关键解读支持这一点。在WiserUI-Bench上对多个MLLMs进行实验,针对两个主要任务(1)预测A/B测试对中更有效的UI图像,(2)根据专家解读进行事后解释,显示模型在理解UI/UX设计行为影响方面存在局限。我们相信我们的工作将促进利用MLLMs在用户行为上下文中进行视觉设计的研究。

英文摘要

User interface (UI) design goes beyond visuals to shape user experience (UX), underscoring the shift toward UI/UX as a unified concept. While recent studies have explored UI evaluation using Multimodal Large Language Models (MLLMs), they largely focus on surface-level features, overlooking how design choices influence user behavior at scale. To fill this gap, we introduce WiserUI-Bench, a novel benchmark for multimodal understanding of how UI/UX design affects user behavior, built on 300 real-world UI image pairs from industry A/B tests, with empirically validated winners that induced more user actions. For future design progress in practice, post-hoc understanding of why such winners succeed with mass users is also required; we support this via expert-curated key interpretations for each instance. Experiments across multiple MLLMs on WiserUI-Bench for two main tasks, (1) predicting the more effective UI image between an A/B-tested pair, and (2) explaining it post-hoc in alignment with expert interpretations, show that models exhibit limited understanding of the behavioral impact of UI/UX design. We believe our work will foster research on leveraging MLLMs for visual design in user behavior contexts.

2507.00460 2026-06-05 cs.CL 版本更新

Pitfalls of Evaluating Language Models with Open Benchmarks

使用开放基准评估语言模型的陷阱

Md. Najib Hasan, Md Mahadi Hassan Sibat, Mohammad Fakhruddin Babar, Souvika Sarkar, Monowar Hasan, Santu Karmaker

AI总结 本文探讨了使用开放基准评估语言模型时存在的数据泄露风险,并通过构建作弊模型验证了这种风险,指出开放基准可能无法反映实际应用效果,需补充私有或动态生成的基准以维持评估的完整性。

Comments After further review, we found that the core contribution and methodology substantially overlap with previously published work. As a result, the manuscript does not provide a sufficiently distinct or original contribution in its current form. To avoid repetition in the literature and prevent possible confusion for readers, we believe withdrawal is the most appropriate action

详情
AI中文摘要

开放大型语言模型(LLM)基准,如HELM和BIG-Bench,提供了标准化和透明的评估协议,支持语言模型(LM)研究中的比较分析、可重复性和系统性进展跟踪。然而,这种开放性也带来了在LM测试中数据泄露的显著风险——无论是故意还是无意的,从而削弱了排行榜的公平性和可靠性,并使其容易受到不法分子的操控。我们通过故意构建作弊模型来展示这一问题的严重性:构建BART、T5和GPT-2的较小变体,并直接在公开可用的测试集上进行微调。正如预期的那样,这些模型在目标基准上表现优异,但在可比的未见测试集上却表现糟糕。我们随后检查了任务特定的简单改写-based防护策略,以减轻数据泄露的影响,并评估了它们的有效性和局限性。我们的发现强调了三个关键点:(i)在有限的开放、静态基准上的高排行榜表现可能无法反映实际应用效果;(ii)私有或动态生成的基准应补充开放基准以维持评估的完整性;(iii)对当前基准评估实践的重新审视对于可靠和可信的LM评估至关重要。

英文摘要

Open Large Language Model (LLM) benchmarks, such as HELM and BIG-Bench, provide standardized and transparent evaluation protocols that support comparative analysis, reproducibility, and systematic progress tracking in Language Model (LM) research. Yet, this openness also creates substantial risks of data leakage during LM testing--deliberate or inadvertent, thereby undermining the fairness and reliability of leaderboard rankings and leaving them vulnerable to manipulation by unscrupulous actors. We illustrate the severity of this issue by intentionally constructing cheating models: smaller variants of BART, T5, and GPT-2, fine-tuned directly on publicly available test-sets. As expected, these models excel on the target benchmarks but fail terribly to generalize to comparable unseen testing sets. We then examine task specific simple paraphrase-based safeguarding strategies to mitigate the impact of data leakage and evaluate their effectiveness and limitations. Our findings underscore three key points: (i) high leaderboard performance on limited open, static benchmarks may not reflect real-world utility; (ii) private or dynamically generated benchmarks should complement open benchmarks to maintain evaluation integrity; and (iii) a reexamination of current benchmarking practices is essential for reliable and trustworthy LM assessment.

2512.20111 2026-06-05 cs.CL cs.AI cs.LG 版本更新

ABBEL: Learning Natural-Language Belief States for Memory-Efficient Interaction

ABBEL: 为高效交互学习自然语言信念状态

Aly Lidayan, Jakob Bjorner, Satvik Golechha, Kartik Goyal, Alane Suhr

发表机构 * University of California, Berkeley(加州大学伯克利分校) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出ABBEL框架,通过显式自然语言信念状态直接监督每个摘要的信息内容,以解决传统方法在生成摘要时信息丢失或更新错误的问题,从而在保持高效内存使用的同时提升交互性能。

详情
AI中文摘要

随着序列决策任务的时间范围扩大,将完整交互历史保留在模型上下文中变得越来越昂贵。最近的研究通过使用递归更新的自然语言摘要来减少上下文长度,这些摘要简洁且可解释。然而,这些方法在性能上仍低于能够访问完整上下文的智能体,表明它们未能生成足够的摘要。为此,我们提出了ABBEL,一种递归摘要框架,通过显式自然语言信念状态直接监督每个摘要的信息内容。首先,我们分析了在五个领域中由前沿模型生成的信念状态,并验证了性能通常因遗漏或错误更新信息而降低。我们还发现了一些模型使用内存低效的设置,通过保留冗余信息。我们通过两种基于强化学习的方法进行微调:信念分级,通过奖励基于信息内容的信念生成来减少更新错误;峰值信念惩罚,通过鼓励压缩内存足迹最大的信念。我们证明这些方法显著缩小了与完整上下文模型的性能差距,并使ABBEL在使用67%内存的情况下,比先前的记忆智能体工作提高了40%。我们的代码可在https://github.com/jakob-bjorner/optimal-explorer-dev获取。

英文摘要

As the time horizons of sequential decision-making tasks grow, keeping full interaction histories in model context becomes increasingly costly. Recent work reduces context lengths by instead conditioning decision-making agents on recursively updated natural-language summaries, which are concise and interpretable. However, they underperform agents with access to the full context, suggesting that they fail to generate sufficient summaries. To address this we propose ABBEL, a recursive summarization framework that isolates and directly supervises each summary's information contents in the form of explicit natural-language belief states. First, we analyze the belief states generated by frontier models under ABBEL across five domains, and verify that performance is often degraded due to omitting or incorrectly updating information. We also discover settings where models use memory inefficiently by retaining extraneous information. We target these limitations by fine-tuning with two RL-based methods: belief grading, which reduces update errors by rewarding belief generations based on their information content, and peak belief penalties, which encourage compressing the beliefs with the greatest memory footprints. We demonstrate that these methods significantly reduce the performance gap with full context models, and enable ABBEL to outperform prior memory agent work by 40% while using 67% of the memory. Our code is available at https://github.com/jakob-bjorner/optimal-explorer-dev

2512.05774 2026-06-05 cs.CV cs.AI cs.CL 版本更新

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

主动视频感知:用于代理长视频理解的迭代证据寻求

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, Juan Carlos Niebles

发表机构 * Salesforce AI Research(Salesforce AI研究院) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文提出了一种主动视频感知框架AVP,通过迭代计划-观察-反思过程,主动决定视频内容的观察目标和时间,以提高长视频理解的准确性和效率。

Comments Website: https://activevideoperception.github.io/

详情
AI中文摘要

长视频理解(LVU)具有挑战性,因为回答现实世界查询往往依赖于稀疏、时间分散的线索,这些线索隐藏在数小时的大部分冗余和无关内容中。尽管代理流程提高了视频推理能力,但现有框架依赖于查询无关的描述器来感知视频信息,这浪费了计算资源并模糊了细粒度的时间和空间信息。受主动感知理论的启发,我们主张LVU代理应主动决定观察什么、何时和在哪里观察,并持续评估当前观察是否足够回答查询。我们提出了主动视频感知(AVP),一种证据寻求框架,将视频视为交互环境,并直接从像素中获取紧凑、查询相关的证据。具体而言,AVP运行一个迭代的计划-观察-反思过程,使用MLLM代理。在每个轮次中,计划者提出有针对性的视频交互,观察者执行以提取时间戳证据,反思者评估证据对查询的充分性,要么终止并给出答案,要么触发进一步观察。在五个LVU基准测试中,AVP实现了最高整体准确率,有显著提升。值得注意的是,AVP在平均整体准确率上比最佳代理方法高出5.7%,同时仅需18.4%的推理时间和12.4%的输入令牌。

英文摘要

Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest overall accuracy with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average overall accuracy while only requires 18.4% inference time and 12.4% input tokens.

2508.10875 2026-06-05 cs.CL cs.AI cs.LG 版本更新

A Survey on Diffusion Language Models

扩散语言模型的综述

Tianyi Li, Mingda Chen, Bowei Guo, Zhiqiang Shen

发表机构 * VILA Lab, Mohamed bin Zayed University of Artificial Intelligence(维拉实验室,穆罕默德·本·扎耶德人工智能大学) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 本文综述了扩散语言模型的发展现状,探讨了其与自回归模型和掩码语言模型的关系,分析了预训练策略、后训练方法以及推理优化技术,并讨论了多模态扩展、应用场景、局限性及未来研究方向。

详情
AI中文摘要

扩散语言模型(DLMs)正迅速崛起为一种强大的替代方案,以取代主导的自回归(AR)范式。通过迭代去噪过程并行生成令牌,DLMs在减少推理延迟和捕捉双向上下文方面具有固有优势,从而实现对生成过程的精细控制。尽管实现了数倍的加速,最近的进展使DLMs在性能上与自回归模型相当,使其成为各种自然语言处理任务的有力选择。在本文综述中,我们提供了当前DLM景观的全面概述。我们追踪其演变及其与其他范式,如自回归和掩码语言模型的关系,并涵盖了基础原理和最先进模型。我们的工作提供了一个最新、全面的分类法以及对当前技术的深入分析,从预训练策略到高级后训练方法。本文的另一个贡献是全面回顾DLM推理策略和优化,包括解码并行性、缓存机制和生成质量的改进。我们还突出了DLM多模态扩展的最新方法,并阐述了它们在各种实际场景中的应用。此外,我们的讨论还讨论了DLMs的局限性和挑战,包括效率、长序列处理和基础设施需求,同时概述了未来研究方向,以维持该快速发展的领域中的进步。Project GitHub可在https://github.com/VILA-Lab/Awesome-DLMs上找到。

英文摘要

Diffusion Language Models (DLMs) are rapidly emerging as a powerful and promising alternative to the dominant autoregressive (AR) paradigm. By generating tokens in parallel through an iterative denoising process, DLMs possess inherent advantages in reducing inference latency and capturing bidirectional context, thereby enabling fine-grained control over the generation process. While achieving a several-fold speed-up, recent advancements have allowed DLMs to show performance comparable to their autoregressive counterparts, making them a compelling choice for various natural language processing tasks. In this survey, we provide a holistic overview of the current DLM landscape. We trace its evolution and relationship with other paradigms, such as autoregressive and masked language models, and cover both foundational principles and state-of-the-art models. Our work offers an up-to-date, comprehensive taxonomy and an in-depth analysis of current techniques, from pre-training strategies to advanced post-training methods. Another contribution of this survey is a thorough review of DLM inference strategies and optimizations, including improvements in decoding parallelism, caching mechanisms, and generation quality. We also highlight the latest approaches to multimodal extensions of DLMs and delineate their applications across various practical scenarios. Furthermore, our discussion addresses the limitations and challenges of DLMs, including efficiency, long-sequence handling, and infrastructure requirements, while outlining future research directions to sustain progress in this rapidly evolving field. Project GitHub is available at https://github.com/VILA-Lab/Awesome-DLMs.

2511.20107 2026-06-05 cs.CL cs.SD eess.AS 版本更新

Mispronunciation Detection and Diagnosis Without Model Training: A Retrieval-Based Approach

无需模型训练的误读检测与诊断:基于检索的方法

Huu Tuong Tu, Ha Viet Khanh, Tran Tien Dat, Vu Huan, Thien Van Luong, Nguyen Tien Cuong, Nguyen Thi Thu Trang

发表机构 * Hanoi National University of Education(河内教育大学)

AI总结 本文提出一种无需模型训练的误读检测与诊断方法,利用预训练的自动语音识别模型和检索技术,实现高准确率的发音错误检测与诊断,实验表明其在L2-ARCTIC数据集上达到69.60%的F1分数。

详情
Journal ref
ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
AI中文摘要

误读检测与诊断(MDD)对于语言学习和语音治疗至关重要。与传统方法需要评分模型或训练音素级模型不同,我们提出了一种新颖的无训练框架,利用预训练的自动语音识别模型和检索技术。我们的方法避免了音素特定建模或额外的任务特定训练,但仍能实现准确的发音错误检测与诊断。在L2-ARCTIC数据集上的实验表明,我们的方法在避免模型训练复杂性的同时,达到了69.60%的F1分数。

英文摘要

Mispronunciation Detection and Diagnosis (MDD) is crucial for language learning and speech therapy. Unlike conventional methods that require scoring models or training phoneme-level models, we propose a novel training-free framework that leverages retrieval techniques with a pretrained Automatic Speech Recognition model. Our method avoids phoneme-specific modeling or additional task-specific training, while still achieving accurate detection and diagnosis of pronunciation errors. Experiments on the L2-ARCTIC dataset show that our method achieves a superior F1 score of 69.60% while avoiding the complexity of model training.

2510.22768 2026-06-05 cs.CL 版本更新

Seeing is Believing? Evaluating Vision-Language Model Susceptibility in Agent-to-Agent Multimodal Persuasion

见多识广?评估面向Agent-to-Agent多模态说服的视觉语言模型易受性

Haoyi Qiu, Yilun Zhou, Pranav Narayanan Venkit, Kung-Hsiang Huang, Jiaxin Zhang, Nanyun Peng, Chien-Sheng Wu

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Salesforce AI Research(Salesforce AI研究)

AI总结 本文研究了在多智能体多模态说服场景中,视觉语言模型对多模态内容的易受性,提出了MMPersuade框架和数据集,通过实验揭示了多模态输入在说服中的优势,以及说服对象的领域和格式依赖性,以及心理策略在不同上下文和模型架构下的效果差异。

详情
AI中文摘要

随着自主代理越来越多地互动,它们不可避免地试图互相影响。尽管先前在纯文本环境下研究了Agent-to-Agent (A2A) 说服的动力学,但视觉语言模型 (VLMs) 的兴起带来了更复杂的挑战:多模态内容传达了更丰富的信息,同时整合了微妙且难以检测的说服线索。为了研究这种易受性,我们提出了MMPersuade,一个统一的框架和数据集用于A2A多模态说服。我们建模了说服者代理(利用图像和心理策略)与说服对象VLM之间的互动。我们的基准涵盖商业、主观和行为,以及对抗性情境,并通过功能调用评估说服,以捕捉超出口头回应的行为变化。在六个VLM上的实验揭示了三个发现:(1)多模态输入在说服中始终优于纯文本说服,原始视觉信号在对抗性情境中独特地增加易受性,通过绕过文本激活的安全防御;(2)说服对象的易受性高度依赖于领域和格式,现实和社区风格的格式在商业情境中驱动易受性,而不同格式在对抗性情境中占主导地位;(3)心理策略的有效性取决于上下文和模型架构,更强大的模型抵抗良性说服,但在对抗性多模态输入下更易受攻击。我们的框架为构建更稳健和对齐的VLMs提供了基础,以在多代理环境中使用。

英文摘要

As autonomous agents increasingly interact, they inevitably attempt to influence one another. While prior work in text-only settings has explored the dynamics of Agent-to-Agent (A2A) persuasion, the rise of Vision-Language Models (VLMs) introduces a more complex challenge: multimodal content conveys richer information while integrating subtle, hard-to-detect persuasive cues. To study this vulnerability, we present MMPersuade, a unified framework and dataset for A2A multimodal persuasion. We model interactions between a persuader agent, which leverages images and psychological strategies, and a persuadee VLM. Our benchmark spans commercial, subjective and behavioral, and adversarial contexts, and evaluates persuasion via function-calling that capture behavioral shifts beyond verbal responses. Experiments on six VLMs reveal three findings: (1) multimodal inputs consistently outperform text-only persuasion, with raw visual signals uniquely increasing susceptibility in adversarial settings by bypassing text-activated safety defenses; (2) persuadee vulnerability is highly domain- and format-dependent, with realistic and community-style formats driving susceptibility in commercial settings while different formats dominate in adversarial ones; and (3) psychological strategy efficacy varies with context and model architecture, as more capable models resist benign persuasion yet become more susceptible under adversarial multimodal inputs. Our framework provides a foundation for building more robust and aligned VLMs in multi-agent environments.

2510.17256 2026-06-05 cs.CL 版本更新

Explainability of Large Language Models: Opportunities and Challenges toward Generating Trustworthy Explanations

大型语言模型的可解释性:朝着生成可信解释的方向机遇与挑战

Shahin Atakishiyev, Housam K. B. Babiker, Jiayi Dai, Nawshad Farruque, Teruaki Hayashi, Nafisa Sadaf Hriti, Md Abed Rahman, Iain Smith, Mi-Young Kim, Osmar R. Zaïane, Randy Goebel

发表机构 * University of Alberta(阿尔伯塔大学) University of Tokyo(东京大学)

AI总结 本文探讨了大型语言模型的可解释性问题,分析了局部可解释性和机械可解释性方法,并在医疗和自动驾驶两个关键领域进行了实验研究,总结了当前可解释性领域存在的问题和未来发展方向。

详情
AI中文摘要

大型语言模型在自然语言处理的多种下游任务中表现出色。然而,人类通常无法理解语言模型如何预测下一个标记并生成内容。此外,这些模型经常在预测和推理中出现错误,即幻觉。这些错误凸显了更好地理解和解释语言模型内部运作以及如何生成预测输出的紧迫需求。受此差距的启发,本文研究了基于Transformer的大型语言模型中的局部可解释性和机械可解释性,以促进此类模型的信任。为此,本文旨在做出三个关键贡献。首先,我们综述了局部可解释性和机械可解释性方法及相关文献中的研究和见解。此外,我们描述了在医疗和自动驾驶两个关键领域进行的可解释性和推理实验,并分析了这些解释对解释接收者信任的影响。最后,我们总结了当前LLM可解释性领域未解决的问题,并概述了生成与人类一致、可信的LLM解释的机会、关键挑战和未来方向。

英文摘要

Large language models have exhibited impressive performance across a broad range of downstream tasks in natural language processing. However, how a language model predicts the next token and generates content is not generally understandable by humans. Furthermore, these models often make errors in prediction and reasoning, known as hallucinations. These errors underscore the urgent need to better understand and interpret the intricate inner workings of language models and how they generate predictive outputs. Motivated by this gap, this paper investigates local explainability and mechanistic interpretability within Transformer-based large language models to foster trust in such models. In this regard, our paper aims to make three key contributions. First, we present a review of local explainability and mechanistic interpretability approaches and insights from relevant studies in the literature. Furthermore, we describe experimental studies on explainability and reasoning with large language models in two critical domains -- healthcare and autonomous driving -- and analyze the trust implications of such explanations for explanation receivers. Finally, we summarize current unaddressed issues in the evolving landscape of LLM explainability and outline the opportunities, critical challenges, and future directions toward generating human-aligned, trustworthy LLM explanations.

2510.05709 2026-06-05 cs.CR cs.AI cs.CL 版本更新

Correcting Prompt Dependence in LLM Benchmarks: A Bayesian Hierarchical Model with Embedding-Space Clustering

纠正大语言模型基准测试中的提示依赖:一种具有嵌入空间聚类的贝叶斯分层模型

Mary Llewellyn, Isobel Thornton, James Bishop, Annie Gray

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出了一种贝叶斯分层模型,通过嵌入空间聚类来纠正大语言模型基准测试中的提示依赖问题,在数据有限的情况下提供更稳健的性能指标,并在对抗鲁棒性基准测试中实现了性能指标的显著提升。

Comments Accepted to the 1st Workshop on Combining Theory and Benchmarks, CTB@ICML 2026, Seoul, South Korea

详情
AI中文摘要

大语言模型基准测试指标经常错误地陈述性能和不确定性,因为它们依赖于两个在实践中经常不成立的假设:(i) 经典推断有足够的评估数据,和 (ii) 测试提示是独立的。我们提出了一种纠正性的贝叶斯分层模型,结合嵌入空间聚类,能够在数据有限的情况下提供稳健的性能指标,同时纠正提示依赖问题。我们将该方法应用于对抗鲁棒性基准测试,展示了聚类结构的一致恢复,从而得到更可靠的性能指标,平均绝对误差提高了4-73%,预期对数后验密度提高了40-450个单位。

英文摘要

LLM benchmarking metrics often misstate performance and uncertainty as they rely on two assumptions that frequently do not hold in practice: (i) a sufficient number of evaluations are available for classical inference, and (ii) test prompts are independent. We propose a corrective Bayesian hierarchical model with embedding-space clustering that provides robust performance metrics in limited-data settings while correcting for prompt dependence. We apply the approach to adversarial robustness benchmarks, showing consistent recovery of clustering structure, resulting in more reliable performance metrics, with 4-73% improvements to mean absolute errors and 40-450 unit improvements to expected log posterior densities.

2510.05544 2026-06-05 cs.CL cs.LG 版本更新

Activation-Informed Pareto-Guided Low-Rank Compression for Efficient LLM/VLM

基于激活信息的帕累托引导低秩压缩用于高效LLM/VLM

Ryan Solgi, Parsa Madinei, Jiayi Tian, Rupak Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang

发表机构 * University of California-Santa Barbara(加州大学圣芭芭拉分校) Amazon(亚马逊)

AI总结 本文提出了一种基于激活信息的帕累托引导低秩压缩方法,通过理论分析和算法设计,在保持模型精度的同时提升LLM和VLM的压缩效率和推理速度。

详情
AI中文摘要

大型语言模型(LLM)和视觉-语言模型(VLM)已取得最先进的性能,但在部署时带来了显著的内存和计算挑战。我们提出了一种新颖的低秩压缩框架来解决这一挑战。首先,我们通过层间激活基于的压缩误差上界来限制网络损失的变化,填补了文献中的理论空白。然后,我们将低秩模型压缩公式化为双目标优化问题,并证明单一统一容忍度会产生替代帕累托最优的异质秩。基于我们的理论洞察,我们提出帕累托引导奇异值分解(PGSVD),一种零样本流程,通过帕累托引导的秩选择和交替最小二乘实现来改进激活感知压缩。我们将PGSVD应用于LLM和VLM,显示在相同压缩水平下具有更好的准确性和推理加速。

英文摘要

Large language models (LLM) and vision-language models (VLM) have achieved state-of-the-art performance, but they impose significant memory and computing challenges in deployment. We present a novel low-rank compression framework to address this challenge. First, we upper bound the change of network loss via layer-wise activation-based compression errors, filling a theoretical gap in the literature. We then formulate low-rank model compression as a bi-objective optimization and prove that a single uniform tolerance yields surrogate Pareto-optimal heterogeneous ranks. Based on our theoretical insights, we propose Pareto-Guided Singular Value Decomposition (PGSVD), a zero-shot pipeline that improves activation-aware compression via Pareto-guided rank selection and alternating least-squares implementation. We apply PGSVD to both LLM and VLM, showing better accuracy at the same compression levels and inference speedup.

2504.10020 2026-06-05 cs.CL cs.AI cs.CV 版本更新

The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?

性能提升的幻象:为何对比解码无法减轻多模态大语言模型中的对象幻觉?

Hao Yin, Guangzong Si, Zilei Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Eastern Institute of Technology, Ningbo(宁波东部技术研究所)

AI总结 本文研究了对比解码方法在减轻多模态大语言模型(MLLMs)中对象幻觉方面的有效性,发现其性能提升主要源于两个误导性因素,挑战了对比解码策略的有效性。

详情
AI中文摘要

对比解码策略被广泛用于减少多模态大语言模型(MLLMs)中的对象幻觉。这些方法通过构建对比样本来诱导幻觉,然后在输出分布中抑制它们。然而,本文证明此类方法无法有效缓解幻觉问题。在POPE基准测试中观察到的性能提升主要由两个误导性因素驱动:(1)对模型输出分布的粗略、单向调整;(2)自适应可能性约束,将采样策略简化为贪婪搜索。为进一步说明这些问题,我们引入了一系列虚假改进方法,并将其性能与对比解码技术进行评估。实验结果揭示了对比解码中观察到的性能提升与其缓解幻觉的初衷无关。我们的发现挑战了对比解码策略有效性的常见假设,并为开发真正有效的MLLMs幻觉解决方案铺平了道路。

英文摘要

Contrastive decoding strategies are widely used to reduce object hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the output distribution. However, this paper demonstrates that such approaches fail to effectively mitigate the hallucination problem. The performance improvements observed on POPE Benchmark are largely driven by two misleading factors: (1) crude, unidirectional adjustments to the model's output distribution and (2) the adaptive plausibility constraint, which reduces the sampling strategy to greedy search. To further illustrate these issues, we introduce a series of spurious improvement methods and evaluate their performance against contrastive decoding techniques. Experimental results reveal that the observed performance gains in contrastive decoding are entirely unrelated to its intended goal of mitigating hallucinations. Our findings challenge common assumptions about the effectiveness of contrastive decoding strategies and pave the way for developing genuinely effective solutions to hallucinations in MLLMs.

2504.10823 2026-06-05 cs.CL cs.AI 版本更新

CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives

CLASH:从多个视角评估语言模型在高风险困境中的判断

Ayoung Lee, Ryan Sungmo Kwon, Peter Railton, Lu Wang

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Department of Philosophy(哲学系) University of Michigan Ann Arbor(安娜堡大学)

AI总结 本文提出CLASH数据集,用于研究基于价值观的决策过程,发现语言模型在处理矛盾决策、心理不适和价值观变化时存在显著不足。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

在高风险领域,涉及冲突价值的困境对人类都极具挑战性,更不用说AI了。然而,先前的研究仅限于日常场景。为弥补这一差距,我们引入了CLASH(基于角色视角的LLM在高风险情境中的评估),该数据集包含345个高影响困境及3,795个不同价值观的个体视角。CLASH使研究者能够探讨关键但尚未被深入研究的价值决策过程方面,包括对决策矛盾和心理不适的理解以及角色视角中价值观的时间变化。通过基准测试14个非思考和思考模型,我们揭示了几个关键发现:(1)即使强大的专有模型,如GPT-5和Claude-4-Sonnet,也难以处理矛盾决策,仅达到24.06和51.01的准确率。(2)尽管LLMs能合理预测心理不适,但它们在涉及价值变化的视角中并不充分理解。(3)在数学解题和游戏策略领域有效的认知行为无法转移到价值推理中。相反,新的失败模式出现,包括早期承诺和过度承诺。(4)LLMs对特定价值的可引导性与其价值偏好显著相关。(5)最后,当从第三方视角推理时,LLMs表现出更高的可引导性,尽管某些价值(如安全)独特地受益于第一人称框架。

英文摘要

Navigating dilemmas involving conflicting values is challenging even for humans in high-stakes domains, let alone for AI, yet prior work has been limited to everyday scenarios. To close this gap, we introduce CLASH (Character perspective-based LLM Assessments in Situations with High-stakes), a meticulously curated dataset consisting of 345 high-impact dilemmas along with 3,795 individual perspectives of diverse values. CLASH enables the study of critical yet underexplored aspects of value-based decision-making processes, including understanding of decision ambivalence and psychological discomfort as well as capturing the temporal shifts of values in the perspectives of characters. By benchmarking 14 non-thinking and thinking models, we uncover several key findings. (1) Even strong proprietary models, such as GPT-5 and Claude-4-Sonnet, struggle with ambivalent decisions, achieving only 24.06 and 51.01 accuracy. (2) Although LLMs reasonably predict psychological discomfort, they do not adequately comprehend perspectives involving value shifts. (3) Cognitive behaviors that are effective in the math-solving and game strategy domains do not transfer to value reasoning. Instead, new failure patterns emerge, including early commitment and overcommitment. (4) The steerability of LLMs towards a given value is significantly correlated with their value preferences. (5) Finally, LLMs exhibit greater steerability when reasoning from a third-party perspective, although certain values (e.g., safety) benefit uniquely from first-person framing.

2508.20693 2026-06-05 cs.DL cs.CL 版本更新

Leveraging Large Language Models for Generating Research Topic Ontologies: A Multi-Disciplinary Study

利用大型语言模型生成研究主题本体:多学科研究

Tanay Aggarwal, Angelo Salatino, Francesco Osborne, Enrico Motta

发表机构 * Knowledge Media Institute, The Open University(开放大学知识媒体学院) The Open University(开放大学) University of Milano Bicocca(米兰比克卡大学) Department of Business and Law, University of Milano Bicocca(米兰比克卡大学商学院与法学院)

AI总结 本文研究了大型语言模型在生物医学、物理和工程学三个学科中识别研究主题语义关系的能力,通过零样本提示、链式思维提示和在现有本体上微调三种条件评估模型性能,并引入PEM-Rel-8K数据集验证跨学科迁移能力。

详情
AI中文摘要

研究领域本体和分类法对于管理和组织科学知识至关重要,因为它们有助于信息的高效分类、传播和检索。然而,创建和维护此类本体是昂贵且耗时的任务,通常需要多个领域专家的协同工作。因此,此类本体在不同学科中的覆盖程度不均,学科间连接有限,更新周期也较短。在本研究中,我们探讨了几种大型语言模型在生物医学、物理和工程学三个学科中识别研究主题间语义关系的能力。模型在三种不同的条件下进行评估:零样本提示、链式思维提示和在现有本体上微调。此外,我们通过测量模型在某一学科训练后应用到不同学科的表现,评估了微调模型的跨学科迁移能力。为了支持这项分析,我们引入了PEM-Rel-8K数据集,该数据集包含从生物医学、物理和工程学三个学科中最广泛采用的分类法中提取的超过8000个关系。我们的实验表明,将大型语言模型微调到PEM-Rel-8K上在所有学科中都表现出色。

英文摘要

Ontologies and taxonomies of research fields are critical for managing and organising scientific knowledge, as they facilitate efficient classification, dissemination and retrieval of information. However, the creation and maintenance of such ontologies are expensive and time-consuming tasks, usually requiring the coordinated effort of multiple domain experts. Consequently, ontologies in this space often exhibit uneven coverage across different disciplines, limited inter-discipline connectivity, and infrequent updating cycles. In this study, we investigate the capability of several large language models to identify semantic relationships among research topics within three academic disciplines: biomedicine, physics, and engineering. The models were evaluated under three distinct conditions: zero-shot prompting, chain-of-thought prompting, and fine-tuning on existing ontologies. Additionally, we assessed the cross-discipline transferability of fine-tuned models by measuring their performance when trained in one discipline and subsequently applied to a different one. To support this analysis, we introduce PEM-Rel-8K, a novel dataset consisting of over 8,000 relationships extracted from the most widely adopted taxonomies in the three disciplines considered in this study: MeSH, PhySH, and IEEE. Our experiments demonstrate that fine-tuning LLMs on PEM-Rel-8K yields excellent performance across all disciplines.

2508.15851 2026-06-05 cs.CL 版本更新

DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections

DocHop-QA: 向多跳推理多模态文档集合迈进

Jiwon Park, Seohyun Pyeon, Jinwoo Kim, Rina Carines Cabal, Zhenyuan He, Yihao Ding, Soyeon Caren Han

发表机构 * Pohang University of Science and Technology(釜山科学技术大学) The University of Sydney(悉尼大学) The University of Western Australia(西澳大学) The University of Melbourne(墨尔本大学)

AI总结 本文提出DocHop-QA基准,通过多模态、多文档、多跳科学问答评估多模态证据综合能力,揭示当前模型在长上下文和多证据需求下的局限性。

详情
AI中文摘要

尽管大语言模型(LLMs)在快速进步,当前QA基准仍忽视了现实世界科学信息检索的核心挑战:合成散落在多个文档和结构格式中的多模态证据。现有的QA基准范围狭窄,依赖单模态文本和短跨度推理,无法捕捉真实信息检索的复杂性。我们引入DocHop-QA,一个包含11,379个实例的基准,用于评估多模态、多文档、多跳科学QA。该基准基于公开可用的PubMed文章构建,包含文本段落、表格和布局线索,能够在没有显式超链接的情况下实现跨文档推理。为了扩展现实QA的构建,我们开发了一个基于11个科学推理概念的LLM驱动生成管道,生成多样且连贯的问题-答案对。为了突出数据集的实用性和多功能性,我们提出一个任务驱动的评估框架,涵盖四个设置,包括生成回答、多模态证据整合和结构化索引预测。实验表明,当前模型在DocHop-QA的长上下文和多证据需求下表现不佳,确立了其作为推进下一代科学QA系统严格测试平台的地位。

英文摘要

Despite rapid progress in large language models (LLMs), current QA benchmarks still overlook the core challenge of real-world scientific information seeking: synthesizing multimodal evidence scattered across multiple documents and structural formats. Existing QA benchmarks remain narrow in scope, relying on unimodal text and short-span reasoning that fail to capture the complexity of real information seeking. We introduce DocHop-QA, a benchmark of 11,379 instances for evaluating multimodal, multi-document, multi-hop scientific QA. Built from publicly available PubMed articles, DocHop-QA incorporates textual passages, tables, and layout cues, enabling cross-document inference without explicit hyperlinks. To scale realistic QA construction, we develop an LLM-driven generation pipeline grounded in 11 scientific reasoning concepts, producing diverse and coherent question-answer pairs. To highlight the utility and versatility of the dataset, we propose a task-driven evaluation framework spanning four settings, including generative answering, multimodal evidence integration, and structured index prediction. Experiments show that current models struggle with the long-context and multi-evidence demands of DocHop-QA, establishing it as a rigorous testbed for advancing next-generation scientific QA systems.

2508.00537 2026-06-05 cs.CL 版本更新

The Prosody of Emojis

表情符号的语调

Giulio Zhou, Tsz Kin Lam, Alexandra Birch, Barry Haddow

发表机构 * University of Edinburgh(爱丁堡大学) NatWest Aveni

AI总结 研究探讨了表情符号如何影响语音表达,并揭示听众如何通过语音线索恢复表情符号的含义,发现语义差异越大,语音变化越明显,表明表情符号是连接数字文本和口语表达的语调载体。

Comments ACL 26

详情
AI中文摘要

语调特征如音高、节奏和语调对于口语交流至关重要,传达情感、意图和话语结构。在基于文本的环境中,这些线索缺失,表情符号作为视觉替代品,增加了情感和语用的细微差别。本研究探讨了表情符号如何影响语音实现,并研究听众如何通过语音线索恢复表情符号的含义。与以往研究不同,我们通过受控的诱发生产任务收集人类语音数据,直接将语音和表情符号联系起来。使用贝叶斯多级模型,我们显示说话者会系统地根据表情符号线索调整语音,并且听众可以显著高于随机水平恢复意图含义。此外,我们的结果揭示了语音变化的清晰层次:表情符号之间的语义差异越大,语音变化越明显。这些发现表明,表情符号是传达语调意图的重要载体,架起了数字文本和口语表达之间的桥梁。

英文摘要

Prosodic features such as pitch, timing, and intonation are central to spoken communication, conveying emotion, intent, and discourse structure. In text-based settings, where these cues are absent, emojis act as visual surrogates that add affective and pragmatic nuance. This study examines how emojis influence prosodic realisation in speech and how listeners interpret prosodic cues to recover emoji meanings. Unlike previous work, we directly link prosody and emojis by analysing human speech data collected through a controlled elicited production task. Using Bayesian multilevel modelling, we show that speakers systematically adapt their prosody based on emoji cues, and that listeners can recover intended meanings significantly above chance. Furthermore, our results reveal a clear hierarchy in prosodic shifts: greater semantic differences between emojis correspond to increased prosodic divergence. These findings suggest that emojis are meaningful carriers of prosodic intent that bridge the gap between digital text and spoken production.

2507.15736 2026-06-05 cs.CL 版本更新

IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research

IDRBench: 理解大型语言模型在跨学科研究中的能力

Yuanhao Shen, Daniel Xavier de Sousa, Ricardo Marçal, Hongyu Guo, Xiaodan Zhu

发表机构 * GitHub

AI总结 本文研究了大型语言模型在跨学科研究中的能力,提出IDRBench框架,通过三个任务评估不同模型的跨学科知识整合能力,并为未来研究建立基准。

详情
AI中文摘要

创新是推动人类文明的重要驱动力。随着知识体系的不断扩展,跨学科领域中创新的产生变得愈发具有挑战性。最近机器学习模型,特别是大型语言模型(LLMs)的进步,为访问广泛的知识源提供了有效途径,并在推理方面展现出显著的能力,为跨学科发现提供了重要机会。我们的研究旨在理解最先进的LLMs在整合不同领域知识以进行跨学科研究(IDR)方面的能力。为了解决这一根本问题,我们引入了IDRBench,一个开创性的框架,包括数据集和评估任务:(1)跨学科论文识别,(2)跨学科思想整合,(3)跨学科思想推荐。我们对十种主流LLMs的研究提供了对其行为的全面分析,并为未来研究建立了基准和基线。据我们所知,IDRBench是首个全面调查LLMs跨学科能力的框架。

英文摘要

Innovation is a key driving force of human civilization. As the body of knowledge has grown considerably, bridging knowledge across different disciplines, where significant innovation often emerges, has become increasingly challenging. The recent advancements in machine learning models, particularly Large Language Models (LLMs), have provided effective access to extensive knowledge sources and shown impressive abilities in reasoning, rendering significant opportunities for interdisciplinary discovery. Our research aims to understand the capabilities of state-of-the-art LLMs in integrating knowledge from different fields for interdisciplinary research (IDR). To address this fundamental problem, we introduce IDRBench, a pioneering framework that includes both datasets and evaluation tasks: (1) IDR Paper Identification, (2) IDR Idea Integration, and (3) IDR Idea Recommendation. Our study on ten mainstream LLMs provides a comprehensive analysis of their behavior and establishes benchmarks and baselines for future research. To the best of our knowledge, IDRBench is the first to provide a comprehensive investigation of LLMs' IDR capability.

2502.20914 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Everything, Everywhere, All at Once: Is Mechanistic Interpretability Identifiable?

Maxime Méloux, Silviu Maniu, François Portet, Maxime Peyrard

发表机构 * Université Grenoble Alpes, CNRS, Grenoble INP, LIG(格勒诺布尔阿尔卑斯大学、国家科学研究中心、格勒诺布尔INP、实验室LIG)

AI总结 本文探讨了在机械可解释性(MI)框架下,给定行为是否具有唯一解释的问题,通过统计可识别性理论分析了MI解释的可识别性,并提出了两种主要策略及实验结果。

详情
Journal ref
The Thirteenth International Conference on Learning Representations (ICLR 2025)
AI中文摘要

随着AI系统应用于高风险领域,确保可解释性至关重要。机械可解释性(MI)旨在通过提取人类可理解的算法来解释神经网络的行为。本文探讨了一个关键问题:在给定行为下,根据MI的标准,是否存在唯一的解释?借鉴统计学中的可识别性,其中参数在特定假设下可以唯一推断,我们探索了MI解释的可识别性。我们识别出两种主要的MI策略:(1)“where-then-what”,通过隔离复制模型行为的电路并在之后解释它;(2)“what-then-where”,从候选算法开始,通过因果对齐搜索实现它们的神经激活子空间。我们对布尔函数和小型多层感知机测试了这两种策略,完全枚举了候选解释。实验揭示了系统性的不可识别性:多个电路可以复制行为,一个电路可以有多种解释,多个算法可以与网络对齐,一个算法可以与不同的子空间对齐。是否需要唯一性?一种务实的方法可能只需要预测性和可操作性标准。如果唯一性对理解至关重要,可能需要更严格的条件。我们还参考了内部可解释性框架,该框架通过多种标准验证解释。本文为定义AI中的解释标准做出了贡献。

英文摘要

As AI systems are used in high-stakes applications, ensuring interpretability is crucial. Mechanistic Interpretability (MI) aims to reverse-engineer neural networks by extracting human-understandable algorithms to explain their behavior. This work examines a key question: for a given behavior, and under MI's criteria, does a unique explanation exist? Drawing on identifiability in statistics, where parameters are uniquely inferred under specific assumptions, we explore the identifiability of MI explanations. We identify two main MI strategies: (1) "where-then-what," which isolates a circuit replicating model behavior before interpreting it, and (2) "what-then-where," which starts with candidate algorithms and searches for neural activation subspaces implementing them, using causal alignment. We test both strategies on Boolean functions and small multi-layer perceptrons, fully enumerating candidate explanations. Our experiments reveal systematic non-identifiability: multiple circuits can replicate behavior, a circuit can have multiple interpretations, several algorithms can align with the network, and one algorithm can align with different subspaces. Is uniqueness necessary? A pragmatic approach may require only predictive and manipulability standards. If uniqueness is essential for understanding, stricter criteria may be needed. We also reference the inner interpretability framework, which validates explanations through multiple criteria. This work contributes to defining explanation standards in AI.

2502.14145 2026-06-05 cs.CL eess.AS 版本更新

LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

基于大语言模型的全双工语音对话系统对话管理

Hao Zhang, Weiwei Li, Rilin Chen, Vinay Kothapally, Meng Yu, Dong Yu

发表机构 * Tencent AI Lab(腾讯人工智能实验室)

AI总结 本文提出一种基于大语言模型的语义语音活动检测模块,用于高效管理全双工语音对话系统的轮询,通过轻量级大语言模型实现意图和非意图打断的区分,并通过短间隔处理输入语音以实现实时决策,同时减少计算开销。

详情
AI中文摘要

在语音对话系统(SDS)中实现全双工通信需要实时协调听、说和思。本文提出一个语义语音活动检测(VAD)模块作为对话管理器(DM),用于高效管理全双工SDS中的轮询。该模块实现为一个轻量级(0.5B)大语言模型,经过全双工对话数据微调,语义VAD预测四个控制标记以调节轮询和轮询保持,区分意图和非意图打断,同时检测查询完成以处理用户停顿和犹豫。通过短间隔处理输入语音,语义VAD实现了实时决策,而核心对话引擎(CDE)仅在生成响应时被激活,从而减少计算开销。这种设计允许独立优化DM而不需重新训练CDE,平衡了交互准确性和推理效率,以实现可扩展的下一代全双工SDS。

英文摘要

Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.

2410.13056 2026-06-05 cs.CL cs.AI 版本更新

Channel-Wise Mixed-Precision Quantization for Large Language Models

通道级混合精度量化用于大语言模型

Zihan Chen, Bike Xie, Jundong Li, Cong Shen

发表机构 * Department of Electrical and Computer Engineering, University of Virginia(电气与计算机工程系,弗吉尼亚大学) Kneron Inc.(芯驰科技)

AI总结 本文提出通道级混合精度量化(CMPQ),通过根据激活分布分配不同精度级别来优化大语言模型的量化过程,从而在低比特范围内实现任意平均比特宽度,并在内存使用增加有限的情况下提升性能。

详情
AI中文摘要

大型语言模型(LLMs)在多种语言任务上表现出色,但其在边缘设备上的部署仍面临挑战,因为其大规模参数导致内存需求大。权重仅量化提供了一种减少LLM内存足迹的有希望的解决方案。然而,现有方法主要集中在整数比特量化上,限制了它们对分数比特量化任务的适应性,并阻碍了设备上可用存储空间的充分利用。在本文中,我们引入了通道级混合精度量化(CMPQ),一种新颖的混合精度量化方法,根据激活分布在通道级分配量化精度。通过将不同精度级别分配给不同的权重通道,CMPQ支持低比特范围(例如2到4比特)内的任意平均比特宽度。CMPQ采用非均匀量化策略,并结合两种异常值提取技术,共同保留关键信息,从而最小化量化损失。在九种不同LLM上的实验表明,CMPQ不仅在整数比特量化任务中提高了性能,而且通过以混合精度方式进行处理,在内存使用增加有限的情况下实现了显著的性能提升。CMPQ代表了一种适应性强且有效的LLM量化方法,在各种设备能力下提供了显著的好处。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large parameter sizes. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. However, existing approaches primarily focus on integer-bit quantization, limiting their adaptability to fractional-bit quantization tasks and preventing the full utilization of available storage space on devices. In this paper, we introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method that allocates quantization precision in a channel-wise pattern based on activation distributions. By assigning different precision levels to different weight channels, CMPQ supports arbitrary average bit-widths in the low-bit regime (e.g., between 2 and 4 bits). CMPQ employs a non-uniform quantization strategy and incorporates two outlier extraction techniques that collaboratively preserve the critical information, thereby minimizing the quantization loss. Experiments on nine different LLMs demonstrate that CMPQ not only enhances performance in integer-bit quantization tasks but also achieves significant performance gains with a modest increase in memory usage by performing in a mixed-precision way. CMPQ represents an adaptive and effective approach to LLM quantization, offering substantial benefits across diverse device capabilities.

2407.10486 2026-06-05 cs.AI cs.CL 版本更新

IDEAL: Leveraging Infinite and Dynamic Characterizations of Large Language Models for Query-focused Summarization

IDEAL: 利用大型语言模型的无限和动态特性进行查询导向的摘要

Jie Cao, Dian Jiao, Yang Dai, Rolan Yan, Wenqiao Zhang, Siliang Tang

发表机构 * Zhejiang University(浙江大学) Tencent, Wechat(腾讯,微信)

AI总结 本文针对查询导向摘要问题,提出两种核心方法:高效细粒度查询-LLM对齐和长文档摘要,通过Query-aware HyperExpert和Query-focused Infini-attention模块实现,实验验证了方法的有效性和通用性。

详情
AI中文摘要

查询导向摘要(QFS)旨在生成回答特定问题的摘要,使用户能够更好地控制和个性化内容。随着大型语言模型(LLMs)的出现,其通过大规模预训练展现出了强大的文本理解能力,这表明了提取片段生成的巨大潜力。本文系统地研究了LLMs基于QFS模型应具备的两个不可或缺特性,即高效细粒度查询-LLM对齐和长文档摘要。相应地,我们提出了两个模块,称为Query-aware HyperExpert和Query-focused Infini-attention,以访问上述特性。这些创新为QFS技术的更广泛应用和可访问性铺平了道路。在现有QFS基准上的广泛实验表明了所提出方法的有效性和通用性。

英文摘要

Query-focused summarization (QFS) aims to produce summaries that answer particular questions of interest, enabling greater user control and personalization. The advent of large language models (LLMs), shows their impressive capability of textual understanding through large-scale pretraining, which implies the great potential of extractive snippet generation. In this paper, we systematically investigated two indispensable characteristics that the LLMs-based QFS models should be harnessed, \emph{Efficiently Fine-grained Query-LLM Alignment} and \emph{Lengthy Document Summarization}, respectively. Correspondingly, we propose two modules called Query-aware HyperExpert and Query-focused Infini-attention to access the aforementioned characteristics. These innovations pave the way for broader application and accessibility in the field of QFS technology. Extensive experiments conducted on existing QFS benchmarks indicate the effectiveness and generalizability of the proposed approach.

2406.12620 2026-06-05 cs.CL 版本更新

What Makes Two Language Models Think Alike?

是什么让两个语言模型思考相似?

Louis Jalouzot, Christophe Pallier, Emmanuel Chemla, Yair Lakretz

发表机构 * UNICOG CNRS(法国国家科学研究中心) INSERM(法国国家健康与医学研究院) CEA(法国原子能委员会) Paris-Saclay University(巴黎-萨克雷大学) LSCP(语言科学研究中心) EHESS(高等科学研究所) ENS(巴黎高等师范学校) PSL University(巴黎科学哲学大学)

AI总结 本文研究了语言模型表示和处理语言的方式是否受架构和训练差异影响,提出了一种新的方法来量化模型间相似性和差异性,并发现模型相似性主要由发布日期和模型家族决定。

Comments 25 pages, 13 figures

详情
AI中文摘要

模型的架构和训练差异是否影响它们表示和处理语言的方式?传统相似性度量只能告诉我们两个模型是否具有相似的表示几何,但无法解释原因。本文提出了一种新的、简单的方法来解决这个问题。该方法将每个模型各层的神经活动映射到一组可解释的语言特征,并量化这些特征如何驱动模型间的相似性和差异性。我们使用这种方法比较了43个语言模型,涵盖10个家族,包括解码器Transformer、状态空间模型和循环神经网络。我们发现,模型层面的相似性主要由发布日期(作为通用LLM发展的代理)和模型家族决定,表明语言签名并非主要由规模或架构类别决定。总体而言,我们的方法提供了一种将理论动机的符号描述与神经表示联系起来的方法,并可以轻易扩展到其他领域如语音和视觉,以及到其他神经系统如生物大脑。

英文摘要

Do architectural and training differences influence the way models represent and process language? Traditional similarity metrics tell us whether two models share a similar representational geometry, but they cannot explain why. Here, we propose a new, simple, approach to address this question. This approach maps neural activity in each model layer onto a set of interpretable linguistic features and quantifies how much each of them drives similarities and differences between models. We use this approach to compare 43 language models across 10 families, including decoder Transformers, State-Space Models, and Recurrent Neural Networks. We find that model-level similarity is driven most strongly by release date, a proxy for general LLM development, and model family, suggesting that linguistic signatures are not primarily shaped by scale or architecture class. Overall, our approach provides a way to link theoretically-motivated symbolic descriptions to neural representations and can readily be extended to other domains such as speech and vision, and to other neural systems such as biological brains.

2306.09712 2026-06-05 cs.LG cs.AI cs.CL 版本更新

Semi-Offline Reinforcement Learning for Optimized Text Generation

半离线强化学习用于优化文本生成

Changyu Chen, Xiting Wang, Yiqiao Jin, Victor Ye Dong, Li Dong, Jie Cao, Yi Liu, Rui Yan

发表机构 * Changyu Chen, Xiting Wang, Yiqiao Jin, Victor Ye Dong, Li Dong, Jie Cao, Yi Liu, Rui Yan(未知机构)

AI总结 本文提出了一种半离线强化学习方法,平衡了探索能力和训练成本,并在优化成本、渐近误差和过拟合误差界方面实现了最优的强化学习设置。

Comments In Proceedings of the 40th International Conference on Machine Learning (ICML 2023)

详情
AI中文摘要

在强化学习(RL)中,与环境交互有两种主要设置:在线和离线。在线方法在显著的时间成本下探索环境,而离线方法通过牺牲探索能力高效地获得奖励信号。我们提出了一种半离线RL,一种新的范式,能够从离线过渡到在线设置,平衡探索能力和训练成本,并为比较不同的RL设置提供理论基础。基于半离线公式,我们提出了在优化成本、渐近误差和过拟合误差界方面最优的RL设置。广泛实验表明,我们的半离线方法高效且在与最新方法相比时表现相当或更好。

英文摘要

In reinforcement learning (RL), there are two major settings for interacting with the environment: online and offline. Online methods explore the environment at significant time cost, and offline methods efficiently obtain reward signals by sacrificing exploration capability. We propose semi-offline RL, a novel paradigm that smoothly transits from offline to online settings, balances exploration capability and training cost, and provides a theoretical foundation for comparing different RL settings. Based on the semi-offline formulation, we present the RL setting that is optimal in terms of optimization cost, asymptotic error, and overfitting error bound. Extensive experiments show that our semi-offline approach is efficient and yields comparable or often better performance compared with state-of-the-art methods.

2110.06847 2026-06-05 cs.CL cs.CY cs.SI physics.soc-ph 版本更新

Ousiometrics: The essence of meaning aligns with a power-danger-structure framework instead of valence-arousal-dominance

Ousiometrics: 本质的意义与权力-危险-结构框架相一致,而非价值-唤醒-主导框架

P. S. Dodds, T. Alshaabi, M. I. Fudolig, J. W. Zimmerman, J. Lovato, S. Beaulieu, J. R. Minot, M. V. Arnold, A. J. Reagan, C. M. Danforth

发表机构 * Computational Story Lab, Vermont Advanced Computing Center, University of Vermont, Burlington, VT 05405, United States(计算故事实验室、佛蒙特高级计算中心、佛蒙特大学、伯灵顿,VT 05405,美国) Vermont Complex Systems Institute, MassMutual Center of Excellence for Complex Systems and Data Science, University of Vermont, Burlington, VT 05405, United States(佛蒙特复杂系统研究所、马斯穆特复杂系统和数据科学卓越中心、佛蒙特大学、伯灵顿,VT 05405,美国) Department of Computer Science, University of Vermont, Burlington, VT 05405, United States(计算机科学系、佛蒙特大学、伯灵顿,VT 05405,美国) Santa Fe Institute, 1399 Hyde Park Rd, Santa Fe, NM 87501, United States(圣达菲研究所、1399号海德公园路,圣达菲,NM 87501,美国) Howard Hughes Medical Institute, Janelia Research Campus, Ashburn, VA 20147, United States(霍华德·休斯医学研究所、贾能利亚研究校区、阿什伯恩,VA 20147,美国) Advanced Bioimaging Center, University of California Berkeley, Berkeley, CA 94720, United States(先进生物成像中心、加州大学伯克利分校、伯克利,CA 94720,美国) School of Computer and Mathematical Sciences, University of Adelaide, Adelaide, SA 5005, Australia(计算机与数学科学学院、阿德莱德大学、阿德莱德,SA 5005,澳大利亚) Computational Ethics Lab, University of Vermont, Burlington, VT 05405, United States(计算伦理实验室、佛蒙特大学、伯灵顿,VT 05405,美国)

AI总结 本文提出了一种新的意义本质描述框架GPADS,通过分析英语语料库发现,意义本质应由权力-危险-结构框架描述,并构建了ousiometer原型。

Comments 115 pages (30 page main manuscript, 85 page appendix), 82 figures (9 main, 73 appendix), 3 tables (2 main, 1 appendix)

详情
Journal ref
Science Advances, 12(9): eadr4039, 2026
AI中文摘要

从20世纪中叶以来,意义的本质被广泛接受为由价值、唤醒和主导(VAD)三个正交维度描述。这些基本维度已成为许多领域情感分析的基石。通过重新审视英语语言的第一类型和词素,并利用自动注释的直方图--ousiograms--我们发现:词语传达的意义本质最好由好-权力-攻击-危险结构环形框架(GPADS)描述;大规模英语语料库揭示了对安全、低危险词的系统偏见;并且权力-危险-结构(PDS)框架是代表基本意义的最小框架。我们发现GPADS框架与其他空间如心理状态和虚构原型之间有显著的一致性,并构建并展示了ousiometer原型。

英文摘要

From work emerging through the middle of the 20th century, the essence of meaning has become widely accepted as being described by the three orthogonal dimensions of valence, arousal, and dominance (VAD). These essential dimensions have become the cornerstone of sentiment analysis across many fields. By re-examining first types and then tokens for the English language, and through the use of automatically annotated histograms -- `ousiograms' -- we find here that: The essence of meaning conveyed by words is instead best described by a goodness-power-aggression-danger-structure circumplex framework (GPADS); that large-scale English language corpora reveal a systematic bias toward safe, low-danger words; and that the power-danger-structure (PDS) framework is the minimal framework that represents essential meaning. We find remarkable congruences between the GPADS framework and other spaces including mental states and fictional archetypes, and we construct and demonstrate a prototype ousiometer.