arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 大语言模型与基础模型 74 篇

2606.07524 2026-06-09 cs.CL cs.AI 新提交

ABLE: Representing and Mapping LLMs via Attribution-Based Large-model Embedding

ABLE:基于归因的大模型嵌入表示与映射

Zirui Wang, Yusen Hou, Shaofeng Liang, Bowen Tian, Yanlin Zhang, Wenshuo Chen, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Deep Interdisciplinary Intelligence Lab (DI2 Lab)(深度跨学科智能实验室(DI2 Lab))

AI总结 提出ABLE框架,利用梯度特征归因和分词器无关的词级对齐构建模型嵌入,实现异构LLM的高效比较,在关系预测、模型路由和基准分数预测上表现优异。

详情
AI中文摘要

大语言模型(LLM)的爆炸式增长形成了一个异构且文档不完善的生态系统,使得系统性的模型比较对于来源审计、安全分析和模型选择越来越重要。现有的表示方法难以高效应对这一场景。分析内部参数的方法在架构兼容时很强大,但在结构异构下面临可扩展性障碍;而依赖外部输出的方法可能混淆具有相似行为的模型,且难以在不同分词器的更丰富输出空间中对齐。为弥合这一差距,我们提出ABLE(基于归因的大模型嵌入)框架,利用可解释性空间构建模型表示。通过基于梯度的特征归因,经由分词器无关的词级对齐进行聚合,ABLE捕获模型特定的输入敏感性模式,而不仅仅是表面输出。除经验效用外,我们提供了稳定性分析,表明在可微Transformer风格模型的标准正则性假设下,ABLE诱导出一个Lipschitz连续的参数到嵌入映射,并具有有限样本收敛保证。在239个开源LLM上的大量实验表明,我们的无训练方法在关系预测、模型路由和基准分数预测方面达到了有竞争力或更优的性能。

英文摘要

The explosive growth of large language models (LLMs) has created a heterogeneous and poorly documented ecosystem, making systematic model comparison increasingly important for provenance auditing, security analysis, and model selection. Existing representation methods struggle to address this setting efficiently. Approaches analyzing internal parameters are powerful when architectures are compatible, but face scalability barriers under structural heterogeneity, while methods relying on external outputs may conflate models with similar behaviors and are difficult to align in richer output spaces across different tokenizers. To bridge this gap, we propose ABLE (Attribution-Based Large-model Embedding), a framework that leverages the interpretability space to construct model representations. By aggregating gradient-based feature attributions via a tokenizer-agnostic word-level alignment, ABLE captures model-specific input-sensitivity patterns rather than only surface-level outputs. Beyond empirical utility, we provide a stability analysis showing that, under standard regularity assumptions for differentiable Transformer-style models, ABLE induces a Lipschitz-continuous parameter-to-embedding map with finite-sample convergence guarantees. Extensive experiments on 239 open-source LLMs demonstrate that our training-free approach achieves competitive or superior performance in relation prediction, model routing, and benchmark score prediction.

2606.07526 2026-06-09 cs.CL cs.AI 新提交

GraphLoRA: Structure-Aware Low-Rank Adaptation for Large Language Model Recommendation

GraphLoRA: 面向大语言模型推荐的结构感知低秩适配

Lin Mu, Guoji Wang, Li Ni, Lei Sang, Zhize Wu, Peiquan Jin, Yiwen Zhang

发表机构 * Anhui University(安徽大学) Hefei University(合肥大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出GraphLoRA框架,通过在低秩适配路径中嵌入可训练的图消息传递网络,实现结构信号传播,从而深度融合图结构与文本语义,提升LLM推荐性能。

Comments ACL 2026 findings

详情
AI中文摘要

大型语言模型(LLM)因其强大的推理和泛化能力,在推荐任务(LLMRec)中展现出巨大潜力。然而,如何有效对齐LLM建模的文本语义与协同信号仍是一个关键挑战。现有方法要么将协同信息转化为文本提示,要么将预训练嵌入注入LLM,两者都将结构信息视为静态输入,无法捕获高阶关系依赖。为弥合这一差距,我们提出GraphLoRA,一种新颖的框架,将低秩适配从独立传播推广到结构感知传播。GraphLoRA在低秩适配路径中嵌入一个可训练的图消息传递网络,使结构信号能够在参数空间中传播。该设计允许协同拓扑显式指导参数更新,促进图结构与文本语义信息的深度融合。在多个基准上的大量实验表明,GraphLoRA不仅优于最先进的基于LLM的推荐方法,而且实现了卓越的泛化能力,有效平衡了结构推理能力与计算效率。代码可在https://github.com/wgj15965/GraphLoRA获取。

英文摘要

Large Language Models (LLMs) have shown strong potential for recommendation (LLMRec) due to their powerful reasoning and generalization abilities. However, effectively aligning the textual semantics modeled by LLMs with the collaborative signals remains a key challenge. Existing methods either translate collaborative information into textual prompts or inject pre-trained embeddings into the LLM, both of which treat structural information as static input and fail to capture high-order relational dependencies. To bridge this gap, we propose GraphLoRA, a novel framework that generalizes low-rank adaptation from independent to structure-aware propagation. GraphLoRA embeds a trainable graph message-passing network within the low-rank adaptation pathway, enabling structural signals to propagate through the parameter space. This design allows collaborative topology to explicitly guide parameter updates, fostering deep integration between graph-structured and textual semantic information. Extensive experiments on multiple benchmarks demonstrate that GraphLoRA not only outperforms state-of-the-art LLM-based recommendation methods but also achieves superior generalization, effectively balancing structural reasoning capability with computational efficiency. Code is available at \href{https://github.com/wgj15965/GraphLoRA}{https://github.com/wgj15965/GraphLoRA}.

2606.07527 2026-06-09 cs.CL cs.AI cs.LG 新提交

Post-training is (Massive) Supervised Learning

后训练是(大规模)监督学习

Michael Hassid, Yossi Adi, Roy Schwartz

发表机构 * FAIR, Meta AI(Meta AI 基础人工智能研究团队) The Hebrew University of Jerusalem(耶路撒冷希伯来大学)

AI总结 本文论证当前LLM后训练阶段(SFT+RL)实质是回归到BERT时代的“预训练-微调”范式,通过实验表明从零开始后训练的模型也能取得显著性能,并提出应转向“学会学习”的训练方式。

详情
AI中文摘要

训练LLM的主流范式已演变为依赖包含SFT和RL的大规模后训练阶段。在这篇立场论文中,我们认为这种方法实际上标志着回归到BERT时代的“预训练然后微调”方法,明确地使模型适应期望的行为和评估所用的特定基准。我们首先回顾LLM的历史,描述LLM演化的不同阶段。我们认为当前格局与LLM早期惊人地相似,那时任务性能严重依赖于将模型拟合到分布内数据集。为了实证证明这一点,我们比较了预训练模型和随机初始化模型,在现代推理数据集上对两种变体进行微调,并在竞争性数学和代码基准上评估它们。我们表明,从头开始后训练的模型产生了高度非平凡的性能。我们的发现表明,当前的后训练方法主要作为分布拟合机制发挥作用。最后,我们提出,开发通用能力的模型和系统需要超越针对预定义行为的广泛后训练,转而采用模型“学会如何学习”的训练过程。

英文摘要

The prevailing paradigm for training LLMs has evolved to rely on a massive post-training phase consisting of SFT and RL. In this position paper, we argue that this methodology effectively marks a reversion to the ``pre-train then fine-tune'' approach of the BERT era, explicitly tailoring models to the desired behaviors and specific benchmarks on which they are evaluated. We begin with a historical overview of LLMs, describing the different phases of the LLM evolution. We argue that the current landscape is remarkably similar to the early days of LLMs, where task performance heavily relied on fitting the models to in-distribution datasets. To empirically demonstrate this, we compare pre-trained models to randomly initialized ones, by fine-tuning both variants on modern reasoning datasets and evaluating them on competitive math and code benchmarks. We show that models post-trained from scratch yield highly non-trivial performance. Our findings suggest that current post-training methodologies function primarily as a distribution-fitting mechanism. We finish by positing that developing generally capable models and systems requires moving beyond extensive post-training for predefined behaviors, shifting instead toward training procedures where models ``learn how to learn''.

2606.07559 2026-06-09 cs.CL cs.AI quant-ph 新提交

Phantom transitions in language model fine-tuning

语言模型微调中的幻影相变

Vaibhav Prakash, Jayasri Dontabhaktuni

发表机构 * Mahindra University(马恒达大学)

AI总结 本文研究语言模型微调时,正确补全被近义词竞争而失败的现象,通过序参量分解信号与背景拖拽,发现两种失败模式,并揭示相变为幻影,源于softmax读出而非几何相变。

Comments 26 pages, 9 figures

详情
AI中文摘要

在上下文中微调语言模型,当正确补全存在近义词竞争者时,常常无声地失败。交叉熵损失单调递减,而正确token在排名上从未超越竞争者。我们研究了跨越两个系列和五倍参数范围的五种Transformer架构,在十个精心挑选的近义词上下文中。我们用一个结合预测分布和成对嵌入重叠的序参量来测量这些失败。它可加性地分解为一个信号(跟踪模型对正确token相对于其最近竞争者的承诺)和一个背景拖拽(由嵌入整体向分数泄漏概率的方式决定)。这分离出两种失败模式:运动学失败中信号保持较小;结构失败中拖拽随着微调进行而主动恶化。我们观察到序参量中类似相变的弹弓状跳跃。一个核心负面结果组织了本文:这些相变是幻影。直接测量排除了自发对称破缺的解释。在LoRA微调下,当token嵌入矩阵在训练期间完全不变时,弹弓状跳跃仍然出现,而此处不可能存在几何相变。不连续性完全存在于softmax读出中。少量无量纲量组织跨架构的轨迹。其中一个在所有五种架构的全微调下保持一致。第二个根据整体嵌入分布将架构分为两类,并预测LoRA的充分性。作为盲测,该框架预测了一个未用于拟合任何参数的保留架构的临界学习率,与后续学习率扫描的误差在2.1%以内。研究结果仅涉及近义词机制,未经重新校准不应外推。

英文摘要

Fine-tuning a language model on contexts whose correct completion has a near-synonym competitor often fails silently. The cross-entropy loss decreases monotonically while the correct token never overtakes the competitor in rank. We study this regime across five transformer architectures spanning two families and a fivefold parameter range, on ten hand-selected near-synonym contexts. We instrument these failures with an order parameter combining the predicted distribution and pairwise embedding overlaps. It decomposes additively into a signal, tracking the model's commitment to the correct token over its nearest competitor, and a background drag, set by how the embedding bulk leaks probability into the score. This isolates two failure modes. In kinematic failure the signal stays small. In structural failure the drag actively worsens as fine-tuning proceeds. We observe sharp catapult-like jumps in the order parameter that resemble a phase transition. A central negative result organises the paper. The transitions are phantoms. The spontaneous-symmetry-breaking interpretation is ruled out by direct measurement. Catapult-like jumps still appear under LoRA fine-tuning with the token embedding matrix exactly unchanged during training, where no geometric phase transition is possible. The discontinuity lives entirely in the softmax readout. A small number of dimensionless quantities organise the trajectory across architectures. One is consistent across all five under full fine-tuning. A second sorts architectures into two classes by bulk embedding distribution and predicts LoRA sufficiency. As a blind test, the framework predicts the critical learning rate of a held-out architecture, not used to fit any parameter, to within 2.1% of a subsequent learning-rate sweep. Findings concern the near-synonym mechanism only and should not be extrapolated without recalibration.

2606.07560 2026-06-09 cs.CL cs.LG 新提交

Function-Vector Heads Are Two Populations: Writers and Cancellers in In-Context Learning

函数向量头是两个群体:上下文学习中的写入者和取消者

Han-yu Wang

发表机构 * The University of Hong Kong(香港大学)

AI总结 发现函数向量头并非同质群体,而是分为写入者和取消者两个子群体,分别推高和压低规则正确logit,且仅基于幅度的排名无法区分二者。

详情
AI中文摘要

函数向量头(Todd et al., 2024)通常通过其对上下文规则任务的因果贡献幅度来识别,隐含假设顶级集合是同一功能类。这一假设不成立。我们用保留符号的标准(改进的DLA + 置换FDR)替代仅幅度排名,并通过路径修补验证每个候选。然后,FV头群体分裂为两个对立的子群体:写入者推高规则正确logit;取消者压低它。一个四条件规范判定在三个模型家族和六个Pythia规模的13/15个单元中成立,符号置换检验在5/6个主要单元中拒绝同质性。仅幅度排名无法看到这种结构:Todd的前20个在层次任务中捕获了64%的取消者但仅4%的写入者,在模块任务中捕获了59%的写入者但仅8%的取消者。我们在所有27个(取消者,单元,头)对上排除了六种人为解释:归纳重叠、汇点、通用重要性、秩1复制抑制、V级联和最近邻非FV控制。零消融取消者在6/6个主要单元中产生+0.13到+0.29 nats的logit增益,方向一致地带来+2到+7个百分点的准确率提升。

英文摘要

Function-vector (FV) heads (Todd et al., 2024) are typically identified by the magnitude of their causal contribution to in-context rule tasks, under the implicit assumption that the top set is a homogeneous functional class. This assumption fails. We replace magnitude-only ranking with a sign-preserving criterion (refined DLA + permutation FDR) and validate each candidate by path patching. The FV head population then splits into two opposing sub-populations: writers push the rule-correct logit up; cancellers push it down. A four-condition canonical verdict holds in $13/15$ cells across three model families and six Pythia scales, and a sign-shuffle rejects homogeneity in $5/6$ main cells. The structure is invisible to magnitude-only ranking: Todd's top-$20$ captures $64\%$ of cancellers but only $4\%$ of writers on the hierarchical task, and $59\%$ of writers but only $8\%$ of cancellers on the modular task. We rule out six artefact accounts on all $27$ canceller (cell, head) pairs: induction overlap, sinks, generic importance, rank-$1$ copy-suppression, V-cascade, and rank-nearest non-FV controls. Zero-ablating cancellers yields $+0.13$ to $+0.29$ nats of logit gain in $6/6$ main cells with a directionally consistent $+2$ to $+7$ pp accuracy effect.

2606.07818 2026-06-09 cs.CL cs.NE 新提交

Representational Similarity and Model Behavior in Multi-Agent Interaction

多智能体交互中的表征相似性与模型行为

Yujin Potter, Seun Eisape, Shiyang Lai, Alexander Huth, James Evans, Been Kim, Jacob Eisenstein, Dawn Song, Alane Suhr

发表机构 * University of Washington(华盛顿大学)

AI总结 研究LLM对间的表征相似性对合作与创新的影响,发现高相似性促进合作但降低新颖性,且早期层相似性关联最强。

Comments ICML 2026

详情
AI中文摘要

研究人员已经表明,人类之间的神经相似性可以预测社会亲密度和合作成功,而创新往往源于不同个体之间的互动。我们通过考察大型语言模型之间的交互来研究这些原理是否适用于人工智能。在我们的实验中,276个模型对在涵盖合作和新颖性的八个游戏中互动。我们发现,具有更相似表征空间的对实现了显著更高的合作,但表现出较低的新颖性和创造力。即使控制了其他因素(如性能差异和模型大小),表征相似性对合作和新颖性的影响仍然稳健。我们还发现,与中间层和后期层相比,早期层的相似性与合作和新颖性的关联始终最强。这表明这些模式背后的一个核心因素可能是两个模型共享词汇和语义基础的程度。总体而言,表征相似性可能是多智能体系统设计中的一个重要考虑因素。

英文摘要

Researchers have shown that neural similarity among humans predicts social closeness and cooperative success, whereas innovation often emerges from interactions among dissimilar individuals. We investigate whether these principles extend to artificial intelligence by examining interactions between large language models. In our experiments, 276 model pairs interact across eight games spanning both cooperation and novelty. We find that pairs with more similar representation spaces achieve significantly higher cooperation but exhibit reduced novelty and creativity. The effects of representational similarity on cooperation and novelty remain robust even after controlling for other factors such as performance disparity and model size. We also find that similarity in the early layers consistently shows the strongest association with cooperation and novelty, compared to the middle and later layers. This suggests that a central factor underlying these patterns could be the extent to which the two models share lexical and semantic grounding. Overall, representational similarity can be an important consideration in multi-agent system design.

2606.07978 2026-06-09 cs.CL 新提交

MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models

MechLens:事实知识的晚期结晶解释语言模型中的干预有效性

Xueping Gao

发表机构 * Alibaba Cloud(阿里云)

AI总结 本文发现LLM中的事实知识在最后层突然“结晶”,而非逐层涌现,并基于此提出结晶引导的干预原则,优于现有方法。

详情
AI中文摘要

理解LLM存储事实知识的位置对于减少幻觉至关重要。我们系统量化了“晚期结晶”:事实知识并非逐层涌现,而是在最后层突然“结晶”。在五个模型家族(Pythia、Gemma、Qwen2.5、Llama-3.1、Mistral;0.5–14B)中,26.8%–93.4%的正确答案从未在任何中间层进入前10预测,且晚期涌现(>80%深度)在不同架构中一致。跨尺度(Qwen2.5-14B)和跨基准(MMLU:98.2%)结果证实了普遍性;调谐透镜排除了探针伪影。情感分类对照(Qwen为0.5% vs. 事实85.9%;Mistral为2.0% vs. 26.8%)确认该现象是事实回忆特有的。\n晚期结晶引出了结晶引导的干预原则:CAA在中等结晶模型(Llama、Mistral)上优于DoLa(p<0.001),在高结晶模型Qwen上方向一致反转(+25.4% vs. +15.5% MC1,p=0.069)。LayerNorm消融表明结晶是残差流固有的;LN缩放(x1.2)在零推理开销下带来+11.8% MC1提升。我们进一步揭示了可计算性-记忆谱:可计算知识比记忆事实更早结晶(层22.1/28 vs. 28.0/28)。我们发布了支持五个模型家族的MechLens。

英文摘要

Understanding where LLMs store factual knowledge is critical for hallucination mitigation. We systematically quantify Late Crystallization: factual knowledge does not gradually emerge across layers but "crystallizes" abruptly at the final layers. Across five model families (Pythia, Gemma, Qwen2.5, Llama-3.1, Mistral; 0.5--14B), 26.8%--93.4% of correct answers never enter top-10 predictions at any intermediate layer, with late emergence (>80% depth) consistent across architectures. Cross-scale (Qwen2.5-14B) and cross-benchmark (MMLU: 98.2%) results confirm generality; tuned lens rules out probe artifacts. A sentiment-classification control (0.5% for Qwen vs. 85.9% factual; 2.0% for Mistral vs. 26.8%) confirms the phenomenon is specific to factual recall. Late Crystallization yields a crystallization-guided intervention principle: CAA outperforms DoLa on moderate-crystallization models (Llama, Mistral; p<0.001), with a directionally consistent reversal on high-crystallization Qwen (+25.4% vs. +15.5% MC1, p=0.069). LayerNorm ablation shows crystallization is intrinsic to the residual stream; LN scaling (x1.2) yields +11.8% MC1 with zero inference overhead. We further reveal a Computability-Memorization Spectrum: computable knowledge crystallizes earlier (layer 22.1/28) than memorized facts (28.0/28). We release MechLens supporting five model families.

2606.08295 2026-06-09 cs.CL 新提交

TLRD: Teaching LLMs to Reason over Tabular Data with Tri-Level Rationale Distillation

TLRD: 通过三级理由蒸馏教授LLMs在表格数据上进行推理

Tianyuan Liang, Xuwei Tan, Lei Shi, Junsheng Zhong, Ziyu Hu, Tian Xie, Zhiqun Zuo, Xiaodong Yu, Xueru Zhang

发表机构 * The Ohio State University(俄亥俄州立大学) Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 提出TLRD框架,通过三级理由蒸馏将表格数据集转换为结构化理由监督,使LLMs在仅基于原始特征的情况下实现零开销预测和可解释推理,显著缩小与树集成模型的性能差距。

详情
AI中文摘要

表格数据是存储现实世界信息的主要媒介,驱动着机器学习的许多工业应用。传统预测器实现了强大的预测性能,但不提供决策所必需的可读、案例特定的解释。大型语言模型(LLMs)可以通过生成预测和解释来自然弥合这一差距。然而,数据集特定的模式(如特征分布和交互)使LLMs难以理解和推理表格数据,而仅标签微调在提高性能的同时会导致灾难性遗忘。为了解决这个问题,我们提出了三级理由蒸馏(TLRD),一个将仅标签表格数据集转换为LLMs的结构化理由监督的框架。TLRD使用高容量教师模型,基于三个互补的证据级别(实例级特征、数据集级分布上下文和比较级检索邻居)合成理由语料库,然后将理由蒸馏到学生LLMs中,从而仅从原始特征实现零开销预测和基于理由的解释。在多个领域数据集上的实验表明,TLRD显著缩小了LLMs与最先进树集成模型之间的性能差距,同时产生基于理由且可读的解释,为高风险决策提供了有价值的参考。

英文摘要

Tabular data is a primary medium for storing real-world information, driving many industrial applications of machine learning. Traditional predictors achieve strong predictive performance but do not provide readable, case-specific explanations essential for decision-making. Large Language Models (LLMs) can naturally bridge this gap by generating predictions alongside explanations. However, dataset-specific patterns, such as feature distributions and interactions, make tabular data difficult for LLMs to understand and reason over, while label-only fine-tuning improves performance at the cost of catastrophic forgetting. To address this problem, we propose Tri-Level Rationale Distillation (TLRD), a framework that converts label-only tabular datasets into structured rationale supervision for LLMs. TLRD uses a high-capacity teacher to synthesize a rationale corpus grounded in three complementary levels of evidence: instance-level feature, dataset-level distributional context, and comparison-level retrieved neighbors, then distills the rationale into student LLMs, enabling zero-overhead prediction and grounded explanation from raw features only. Experiments on multiple domain datasets show that TLRD significantly closes the performance gap between LLMs and state-of-the-art tree ensembles while producing grounded and readable explanations, offering a valuable reference for high-stakes decision-making.

2606.08327 2026-06-09 cs.CL cs.AI cs.LG 新提交

Chiaroscuro Attention: Spending Compute in the Dark

明暗对比注意力:在黑暗中投入计算

Prateek Kumar Sikdar

发表机构 * Accenture(埃森哲)

AI总结 提出CHIAR-Former,一种基于谱熵路由的混合Transformer,通过DCT谱混合与全注意力互补,在WikiText-103上以62.5%更少注意力FLOPs实现PPL 36.54,较全注意力基线提升45%。

Comments 8 pages, 6 figures, 3 tables

详情
AI中文摘要

标准Transformer在每一层和每个标记上统一应用自注意力,无论输入是否需要动态的跨标记交互。我们提出CHIAR-Former(明暗对比注意力),一种4层混合Transformer,它基于每个标记的谱熵(一种理论上合理的复杂度信号)将每个标记路由到三个算子之一:DCT谱混合、RBF核混合或全自注意力。通过在WikiText-103上的系统消融,我们发现路由崩溃:路由器持续拒绝RBF而偏向DCT和注意力,表明谱混合和动态注意力是互补且充分的。一个专门设计的仅DCT+注意力变体在WikiText-103上达到验证集PPL 36.54——相比全注意力基线(PPL 66.62)提升45%,同时减少62.5%的注意力FLOPs。我们将评估扩展到WikiText-2、IMDB情感分类和合成ListOps操作,建立了一个清晰的操作区间:CHIAR-Former在大型自然文本上表现出色,其中标记多样性支持谱专门化,而全注意力在小数据集和合成模式匹配任务上仍保持优势。这些发现——无论是成功还是失败——共同定义了谱路由何时以及为何值得使用。

英文摘要

Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.

2606.08346 2026-06-09 cs.CL cs.LG 新提交

CATPO: Critique-Augmented Tree Policy Optimization

CATPO: 批评增强的树策略优化

Ayush Singh, Umang Goyal, Ankur Dahiya

发表机构 * Indian Institute of Technology Roorkee(印度理工学院罗尔基分校) Vision and Language Group(视觉与语言组)

AI总结 提出CATPO方法,通过树信息性评分和批评引导修复,解决树结构强化学习中低效树浪费计算的问题,在数学推理任务上提升准确率。

Comments 14 pages, 1 figures, 6 tables

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为提升大语言模型(LLM)推理能力的主流范式。最近的基于树的方法(如TreeRPO)通过树结构展开扩展了平坦轨迹采样,无需单独的奖励模型即可获得密集的步级奖励信号。然而,并非所有树都具有相同的信息量:所有叶子成功、所有叶子失败或策略已预测出奖励分布的树对梯度更新贡献甚微,浪费计算资源。我们提出CATPO(批评增强的树策略优化),在树级别诊断并解决这一浪费问题。CATPO首先通过树信息性分数F(T)对每棵树进行评分,该分数结合了叶子结果多样性和策略-奖励去相关性,且无需额外计算。对于所有分支均失败的“全错”树,CATPO应用批评引导修复:定位最浅的失败点,生成自然语言批评,并嫁接精炼的延续以恢复训练信号。最后,信息性加权损失通过归一化分数缩放每棵树的梯度贡献,将参数更新集中在最具信息性的树上,同时保持整体梯度幅度。在MATH数据集上训练的Qwen2.5-Math-1.5B上的实验表明,CATPO在四个基准(AIME24、MATH-500、OlympiadBench和MinervaMath)上实现了37.5%的宏平均准确率,比TreeRPO提高1.9%,比GRPO提高4.8%。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving the reasoning capabilities of large language models (LLMs). Recent tree-based methods such as TreeRPO extend flat trajectory sampling with tree-structured rollouts to obtain dense, step-level reward signals without a separate process reward model. However, not all trees are equally informative: trees where all leaves succeed, all leaves fail, or the policy already predicts the reward distribution contribute little to gradient updates, wasting compute. We introduce CATPO (Critique-Augmented Tree Policy Optimization), which diagnoses and addresses this waste at the tree level. CATPO first scores each tree via a tree informativeness score, F(T), combining leaf-outcome diversity with policy-reward decorrelation at zero extra compute. For dead-wrong trees where all branches fail, CATPO applies critique-guided healing: it locates the shallowest failure point, generates a natural-language critique, and grafts refined continuations to recover training signal. Finally, an informativeness-weighted loss scales each tree's gradient contribution by its normalized score, concentrating parameter updates on the most informative trees while preserving overall gradient magnitude. Experiments on Qwen2.5-Math-1.5B trained with the MATH dataset show that CATPO achieves 37.5% macro accuracy across four benchmarks (AIME24, MATH-500, OlympiadBench, and MinervaMath), improving over TreeRPO by 1.9% and GRPO by 4.8%.

2606.08347 2026-06-09 cs.CL cs.LG 新提交

Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs

张量化Engram:在N-gram嵌入中共享潜在变量对大型语言模型有益

Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, Yuning Qiu, Qibin Zhao, Danilo Mandic

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 提出张量化Engram(TN-gram),通过CP分解共享因子压缩n-gram嵌入,减少参数并避免哈希冲突,在多个任务上匹配或超越现有方法。

详情
AI中文摘要

现代语言模型使用离散的token级嵌入表示文本,这迫使重复的多token模式必须在Transformer层中隐式学习。过度token化的Transformer和Engram都试图通过显式引入多token(n-gram)记忆来解决这一限制。然而,它们为每个n-gram阶数使用单独的哈希表,这引入了哈希冲突并阻止嵌套的n-gram共享底层潜在结构。为了解决这些问题,我们提出了张量化Engram(TN-gram),一种紧凑的记忆模块,通过Canonical Polyadic(CP)形式中的共享因子表示张量化的n-gram嵌入。TN-gram学习共享的token-位置因子以及阶数吸收向量,以编码不同n-gram阶数的嵌入。综合实验表明,TN-gram在需要更少参数的情况下,匹配甚至超越了Engram风格的n-gram模块。

英文摘要

Modern language models represent text using discrete token-level embeddings, which forces recurring multi-token patterns to be learned implicitly across Transformer layers. Both Over-tokenized Transformers and Engram attempt to address this limitation by explicitly incorporating multi-token (n-gram) memories. However, they rely on separate hash tables for each n-gram order, which introduces hash collisions and prevents nested n-grams from sharing the underlying latent structures. To address these issues, we propose Tensorized Engram (TN-gram), a compact memory module that represents tensorized n-gram embeddings through shared factors in the Canonical Polyadic (CP) form. TN-gram learns shared token-position factors together with order-absorption vectors to encode the embeddings of different n-gram order. Comprehensive experiments demonstrate that TN-gram matches or even outperforms Engram-style n-gram modules while requiring much fewer parameters.

2606.08471 2026-06-09 cs.CL cs.AI 新提交

More Yap Less Meaning: Uncovering Self-Improvement Behavior in SLMs

更多废话,更少意义:揭示小语言模型中的自我改进行为

Marina Igitkhanian, Erik Arakelyan

发表机构 * American University of Armenia(亚美尼亚美国大学) NVIDIA(英伟达)

AI总结 本研究通过构建充分性测试,发现小语言模型在自我纠正中仅获得4.4%的准确率提升,且较长的提示反而与错误答案正相关,表明其推理能力有限。

Comments GEM Workshop at ACL 2026

详情
AI中文摘要

近年来,语言模型在各个领域和应用中取得了快速进展。然而,它们的自我改进能力——即是否善于识别和纠正自身推理中的缺陷——仍然存疑。在本研究中,我们通过构建一个充分性测试来严格检验小语言模型(SLMs)的自我纠正能力。我们提出了一个最小化的三步自我纠正流程:收集初始SLM答案,提示同一模型根据真实答案为错误回答生成提示,然后将相同问题与模型自身的反馈一起输入以改进初始答案。我们在算术和逻辑推理基准上评估了多种指令微调和推理SLM。我们的发现表明,注入提示句子的SLM相比初始问答准确率仅提升4.4%。即使正确答案与模型的错误推理一起提供,评估的SLM也无法理解其推理中缺失了什么,并且在导致纠正和未导致纠正的提示之间显示出最小的语义差异。此外,我们的实验表明,较长的提示与错误的最终答案正相关,表明对问题的较长思考可能阻碍推理过程,这意味着SLM的性能不一定随更大的计算预算而扩展。

英文摘要

Recently, language models have made rapid progress across various domains and applications. However, their capability for self-improvement, i.e., whether they are adept at recognising and correcting flaws in their own reasoning, remains dubious. In this study, we address this question by constructing a sufficiency test to rigorously examine the self-correction capabilities of small language models (SLMs). We propose a minimal three-step self-correction pipeline that collects initial SLM answers, prompts the same model to generate hints for its incorrect responses given the ground truth, and feeds the model the same question with its own feedback to refine the initial answer. We evaluate a variety of instruction-tuned and reasoning SLMs in this experimental setup on arithmetic and logical reasoning benchmarks. Our findings show that SLMs with injected hint sentences yield only a 4.4 percent gain over initial question-answering accuracy. Even though the correct answer was provided alongside the model's incorrect reasoning, the evaluated SLMs fail to understand what was missing in their reasoning and show minimal semantic difference between hints that lead to corrections and ones that do not. Furthermore, our experiments show that longer hints are positively correlated with incorrect final answers, suggesting that longer deliberation on problems can hinder the reasoning process, meaning that SLMs do not necessarily scale in performance with a larger compute budget.

2606.08501 2026-06-09 cs.CL 新提交

Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

重回正轨:在扩散大语言模型中对齐奖励与状态以进行推理

Yawen Shao, Jie Xiao, Kai Zhu, Yu Liu, Hongchen Luo, Xueyang Fu, Yang Cao, Wei Zhai, Zheng-Jun Zha

发表机构 * University of Science and Technology of China(中国科学技术大学) Tongyi Lab(通义实验室) Northeastern University(东北大学)

AI总结 针对扩散大语言模型强化学习中过程奖励与状态轨迹的双重错位问题,提出PAPO框架,通过步骤感知过程奖励和熵引导历史重演实现对齐,在四个基准上取得显著提升。

详情
AI中文摘要

强化学习(RL)在增强扩散大语言模型(dLLMs)的推理能力方面具有巨大潜力。然而,进展受到真实生成轨迹与梯度更新过程之间双重错位的基本限制:(i)过程-奖励错位。稀疏的终端奖励被不加区分地分配给生成过程的所有中间步骤,未能提供有区分度的信用分配。(ii)状态-轨迹错位。策略更新常常被引向人工的、偏离轨迹的状态,在信息量较少的样本上浪费梯度。为了解决这些限制,我们引入了过程对齐策略优化(PAPO),这是一种新颖的框架,通过步骤感知过程奖励(SPR)将稀疏的终端奖励转化为密集的逐步信用,以及熵引导历史重演(EHR)在高不确定性步骤重放真实轨迹,从而整体上对齐RL更新与dLLM的生成轨迹。在四个基准上的大量实验表明,PAPO显著优于基线,在GSM8K上提升高达4.5%,在MATH500上提升4.8%,在Countdown上提升42.2%,在Sudoku上提升16.1%。

英文摘要

Reinforcement learning (RL) holds immense promise for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, progress is fundamentally constrained by a dual misalignment between authentic generation trajectory and the gradient update process: (i) Process-reward misalignment. Sparse, terminal rewards are indiscriminately assigned to all intermediate steps of the generation process, failing to provide discriminative credit assignment. (ii) State-trajectory misalignment. Policy updates are often diverted toward artificial, out-of-trajectory states, squandering gradients on less informative samples. To address these limitations, we introduce Process Aligned Policy Optimization (PAPO), a novel framework that holistically aligns the RL update with the dLLM's generative trajectory via Step-Aware Process Rewards (SPR) that transform sparse terminal rewards into dense, step-wise credit, and Entropy-Guided Historical Re-enactment (EHR) that replays authentic trajectories at high-uncertainty steps. Extensive experiments on four benchmarks demonstrate that PAPO significantly outperforms baselines, achieving gains of up to 4.5% on GSM8K, 4.8% on MATH500, 42.2% on Countdown and 16.1% on Sudoku.

2606.08644 2026-06-09 cs.CL cs.AI 新提交

A retrieval conditioned rebinding circuit for dynamic entity tracking in large language models

一种用于大语言模型中动态实体追踪的检索条件重绑定电路

Soyoung Oh, Vera Demberg

发表机构 * Saarland University(萨尔兰大学) Max Planck Institute for Informatics(马克斯·普朗克信息学研究所)

AI总结 通过因果干预识别出大语言模型中实现动态状态追踪的检索条件重绑定机制,该机制由紧凑的注意力头电路编码并恢复绑定信息,在不同模型家族中表现不同。

详情
AI中文摘要

为了正确解释上下文并检索相关信息,大语言模型必须将实体与其属性绑定,并在状态变化时更新这些绑定。我们分析了LLM在动态状态追踪中如何实现这一绑定过程。通过因果干预,我们识别出一种检索条件重绑定机制,这是一个紧凑的注意力头电路,编码交换相关的绑定信息并在读出时恢复。在Gemma和Llama模型中,该电路支持重绑定行为,但机制的表示特征在不同模型家族中有所不同。在Gemma模型中,绑定特征清晰地表达在相关注意力头的查询/键子空间中,而在Llama模型中,绑定信息主要由键向量携带。总体而言,我们的结果揭示了LLM中上下文相关状态追踪的可解释机制。

英文摘要

To interpret context correctly and retrieve relevant information, large language models must bind entities to their attributes and update these bindings as state changes. We analyze how LLMs implement this binding process in a dynamic state tracking. Using causal interventions, we identify a retrieval conditioned rebinding mechanism, a compact attention head circuit that encodes swap relevant binding information and reinstates it at readout. Across Gemma and Llama models, this circuit supports rebinding behavior, but the representational signature of the mechanism differs across model families. In Gemma models, the binding signature is clearly expressed in the query/key subspaces of the relevant attention heads, whereas in Llama models, the binding information is carried primarily in key vectors. Overall, our results reveal an interpretable mechanism for context dependent state tracking in LLMs.

2606.08755 2026-06-09 cs.CL 新提交

Co-Evolving Skill Generation and Policy Optimization

共同进化技能生成与策略优化

Zhiwei Zhang, Yudi Lin, Nikki Lijing Kuang, Linlin Wu, Xiaomin Li, Songtao Liu, Fenglong Ma

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) Nanyang Technological University(南洋理工大学) University of California, San Diego(加州大学圣迭戈分校) University of Utah(犹他大学) Harvard University(哈佛大学)

AI总结 提出在线强化学习框架,通过对比基线和技能增强轨迹的奖励差距估计技能边际效用,实现存储前验证,并利用该信号训练策略作为技能生成器,减少对专有模型的依赖。

详情
AI中文摘要

技能增强的强化学习通过存储从过去经验中获取的可重用程序性知识来改进语言智能体。现有方法通常使用强大的语言模型分析轨迹、生成技能,并在在线训练期间更新可检索的技能库。然而,它们很少在存储和重用新生成的技能之前评估其是否有用。我们发现这一假设不可靠:即使由专有前沿LLM生成的技能也表现出高度混合的效用,许多技能几乎没有益处甚至降低性能。一旦此类技能进入库中,其影响难以识别,因为后续的轨迹反馈是延迟的,并且通常反映多个检索技能的组合效果,而非单个技能的边际贡献。我们提出了一种用于存储前技能验证的在线强化学习框架。该框架估计候选技能是否在当前任务的已检索技能之外贡献了有用信息。它使用标准的轨迹预算,在同一任务和检索上下文下形成两个匹配组:基于当前检索技能的条件基础轨迹,以及基于相同技能加上从基础轨迹中诱导出的一个候选技能的条件技能增强轨迹。这两组之间的奖励差距估计了候选技能的上下文相关边际效用,使框架能够在不增加轨迹开销的情况下促进有用技能,同时过滤无效或有害技能。该框架进一步利用这一边际效用信号来训练策略本身作为技能生成器,减少对专有模型重复调用的依赖。学习到的技能生成似然作为上下文相关的分数,用于检索时的重排序和随着策略演化对过时技能的修剪。

英文摘要

Skill-augmented reinforcement learning improves language agents by storing reusable procedural knowledge acquired from past experience. Existing methods typically use strong language models to analyze trajectories, generate skills, and update a retrievable skill bank during online training. However, they rarely assess whether a newly generated skill is useful before it is stored and reused. We find that this assumption is unreliable: even skills generated by proprietary frontier LLMs exhibit highly mixed utility, with many providing little benefit or even degrading performance. Once such skills enter the bank, their effects are difficult to identify, because subsequent rollout feedback is delayed and usually reflects the combined effect of multiple retrieved skills rather than the marginal contribution of any individual skill. We propose an online reinforcement learning framework for pre-storage skill validation. The framework estimates whether a candidate skill contributes useful information beyond the skills already retrieved for the current task. It uses the standard rollout budget to form two matched groups under the same task and retrieval context: base rollouts conditioned on the currently retrieved skills, and skill-augmented rollouts conditioned on the same skills plus one candidate skill induced from the base trajectories. The reward gap between these two groups estimates the candidate skill's context-dependent marginal utility, enabling the framework to promote useful skills while filtering ineffective or harmful ones without additional rollout overhead. The framework further uses this marginal-utility signal to train the policy itself as a skill generator, reducing reliance on repeated calls to proprietary models. The learned skill-generation likelihood serves as a context-dependent score for retrieval-time reranking and outdated-skill pruning as the policy evolves.

2606.08994 2026-06-09 cs.CL 新提交

Language-Aware Token Boosting: LLM Language Confusion Reduction Without Tuning

语言感知令牌增强:无需微调的大语言模型语言混淆减少

Trapoom Ukarapol, Pakhapoom Sarapat, Nut Chukamphaeng

发表机构 * SCB DataX Tsinghua University(清华大学) SCBX

AI总结 提出无需微调的语言混淆减少方法,通过语言感知令牌增强(LATB)和自适应版本(Adaptive-LATB)对目标语言令牌施加扰动,有效提升多语言对齐并保持摘要质量。

Comments ACL2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)在生成非英语文本时有时会出现语言混淆。现有方法通常依赖微调来缓解此问题。相比之下,我们提出了一种无需微调的语言混淆减少范式。在该范式中,我们引入了两种方法:语言感知令牌增强(LATB),它对与目标语言相关的令牌施加有针对性的扰动;以及自适应语言感知令牌增强(Adaptive-LATB),它根据模型对目标语言的置信度动态调整这些扰动。实验表明,我们的方法通过减少语言混淆有效提升了多语言对齐,同时在不需额外微调的情况下保持了摘要质量。我们的代码已公开。https://github.com/scbdatax/genai-datax-language-aware-token-boosting

英文摘要

Large language models (LLMs) sometimes exhibit language confusion when generating non-English text. Existing approaches typically rely on fine-tuning to mitigate this issue. In contrast, we propose a tuning-free paradigm for reducing language confusion. Within this paradigm, we introduce two methods: Language-Aware Token Boosting (LATB), which applies targeted perturbations to tokens associated with the desired language, and Adaptive Language-Aware Token Boosting (Adaptive-LATB), which dynamically adjusts these perturbations based on the model's confidence in the intended language. Experiments demonstrate that our methods effectively improve multilingual alignment by reducing language confusion, while maintain the summarization quality without requiring any additional fine-tuning. Our code is publicly available. https://github.com/scbdatax/genai-datax-language-aware-token-boosting.

2606.09032 2026-06-09 cs.CL 新提交

Bridging the Agent-World Gap: Text World Models for LLM-based Agents

弥合智能体-世界鸿沟:面向基于LLM的智能体的文本世界模型

Yixia Li, Hongru Wang, Peng Lai, Zhiwen Ruan, He Zhu, Youxin Zhu, Ganlong Zhao, Minda Hu, Yun Chen, Sibei Yang, Peng Li, Jeff Z. Pan, Jia Pan, Guanhua Chen, Yang Liu, Guanbin Li

发表机构 * Southern University of Science and Technology(南方科技大学) University of Edinburgh(爱丁堡大学) Peking University(北京大学) Sun Yat-sen University(中山大学) The Chinese University of Hong Kong(香港中文大学) Shanghai University of Finance and Economics(上海财经大学) Tsinghua University(清华大学) The University of Hong Kong(香港大学)

AI总结 本文系统综述了面向基于LLM的智能体的文本世界模型,围绕形式化框架和智能体生命周期,涵盖基础定义、构建范式、应用(训练时经验合成与推理时规划、验证、适应)及评估,旨在整合该领域并明确设计空间与开放挑战。

Comments Code: https://github.com/sustech-nlp/awesome-text-world-models

详情
AI中文摘要

基于大型语言模型(LLM)的智能体越来越多地用于交互式文本环境,从网页导航、代码编辑到工具使用和长时对话。然而,许多智能体仍然主要是反应式的,将观察映射到动作,而没有对这些环境如何构建和演变的显式模型。这激发了文本世界模型(TWMs):文本状态上的转移模型,给定状态和候选动作,预测结果网页、终端输出、API响应或用户回复,从而支持规划、高效学习和原则性评估。我们系统综述了面向基于LLM的智能体的文本世界模型,围绕形式化框架和智能体生命周期组织:(1)基础,定义文本世界模型并通过状态表示和基础领域对其进行表征;(2)构建,对LLM作为世界模型和代码作为世界模型范式进行分类,并回顾构建方法;(3)应用,考察世界模型如何通过经验合成在训练时以及通过规划、验证和适应在推理时支持智能体;(4)评估,涵盖世界模型本身的评估及其作为智能体评估环境的使用。我们旨在巩固这一快速发展领域,阐明其设计空间,并强调未来研究的开放挑战。

英文摘要

Large language model (LLM)-based agents are increasingly used in interactive textual environments, from web navigation and code editing to tool use and long-horizon dialogue. Yet many remain largely reactive, mapping observations to actions without an explicit model of how these environments are structured and evolve. This motivates text world models (TWMs): transition models over textual states that, given a state and a candidate action, predict the resulting webpage, terminal output, API response, or user reply, thereby supporting planning, efficient learning, and principled evaluation. We systematically review text world models for LLM-based agents, organized around a formal framework and the agent lifecycle: (1) Foundations, defining text world models and characterizing them by state representation and grounding domain; (2) Construction, taxonomizing LLM-as-WM and code-as-WM paradigms and reviewing methods for building them; (3) Application, examining how world models support agents at training time through experience synthesis and at inference time through planning, verification, and adaptation; and (4) Evaluation, covering both evaluation of the world model itself and its use as an evaluation environment for agents. We aim to consolidate this rapidly developing area, clarify its design space, and highlight open challenges for future research.

2606.09157 2026-06-09 cs.CL cs.AI 新提交

SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model Performance

SEF-CLGC在SemEval-2026任务11中的应用:逻辑符号对语言模型性能的影响

Hanna Abi Akl, Fabien Gandon, Catherine Faron, Pierre Monnin

发表机构 * Université Côte d’Azur, Inria, CNRS, I3S, Sophia Antipolis, France(蔚蓝海岸大学, 法国国家信息与自动化研究所, 法国国家科学研究中心, 信息与系统科学实验室, 索菲亚安蒂波利斯, 法国) Data ScienceTech Institute, Paris, France(数据科学技术学院, 巴黎, 法国)

AI总结 本文提出SEF-CLGC管道,结合形式逻辑符号与小语言模型,在SemEval-2026任务11中评估推理性能,最佳模型在降低内容偏差的同时达到27.80%的内容分数。

Comments Accepted to SemEval-2026 co-located with ACL 2026

详情
AI中文摘要

本文重新审视了我们称为三段论评估框架-通用逻辑语法构建(SEF-CLGC)的管道。我们将形式逻辑符号与小语言模型(SLMs)相结合,以评估在SemEval-2026任务11子任务1:大型语言模型中内容与形式推理的分离中的推理性能。我们的实验表明,仅依靠在自然语言和符号语言组合上训练的SLMs,我们的最佳模型在该任务上达到了27.80%的内容分数,同时显著降低了推理中的内容偏差。

英文摘要

This paper revisits our pipeline called Syllogistic Evaluation Framework-Common Logic Grammar Construction (SEF-CLGC). We combine formal logical notations with Small Language Models (SLMs) to evaluate reasoning performance on the SemEval-2026 Task 11 Subtask 1: Disentangling Content and Formal Reasoning in Large Language Models. Our experiments show that by relying solely on SLMs, trained on a combination of natural and symbolic languages, our best model achieves a content score of 27.80% on the task while significantly lowering the content bias in reasoning.

2606.09304 2026-06-09 cs.CL cs.LG 新提交

SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

SG-OPD: 通过符号一致性门控和分阶段教师采样的符号门控在线蒸馏

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Xiaofeng Zhang, Xiaosong Yuan

发表机构 * Zhejiang University(浙江大学) Hunan University(湖南大学) Tianjin University(天津大学) Shanghai Jiao Tong University(上海交通大学) Jilin University(吉林大学)

AI总结 针对在线蒸馏中轨迹级对齐和教师偏好均匀可靠性假设的失效问题,提出SG-OPD方法,通过符号一致性门控和分阶段教师采样改进蒸馏效果,在竞赛级数学推理任务上平均提升1.98和7.50。

详情
AI中文摘要

在线蒸馏(OPD)在自身轨迹上训练学生模型,并利用更强教师的密集逐token监督,通常优于离线蒸馏和标准强化学习。然而,我们发现其有效性隐含地依赖于两个在实践中经常失效的假设:学生与教师之间的轨迹级对齐,以及教师偏好的均匀token级可靠性。因此,我们提出符号门控在线蒸馏(SG-OPD),该方法使用二元验证器作为教师信任信号,在两个互补粒度上发挥作用:分阶段教师采样在冷启动时混合验证器认可的教师轨迹,而符号一致性门控在教师与验证器校正方向一致的token上外推蒸馏更新,在分歧时内插。在竞赛级数学推理基准上的实验表明,SG-OPD持续优于标准OPD,在每样本和每问题水平上平均提升分别为1.98和7.50。

英文摘要

On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher's preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.

2606.09338 2026-06-09 cs.CL 新提交

Multi-Hop Knowledge Composition is Bound by Pretraining Exposure

多跳知识组合受限于预训练暴露

Yannis Karmim, Luis Marti, Djamé Seddah, Valentin Barrière

发表机构 * Inria, Paris, France(法国国家信息与自动化研究所(巴黎)) Inria, Chile(法国国家信息与自动化研究所(智利)) Dept. of Computer Science, Universidad de Chile(智利大学计算机科学系)

AI总结 研究发现,即使单跳准确率达97%,大语言模型仍无法进行隐式多跳推理,原因是预训练中缺乏组合上下文,而非知识缺失。

详情
AI中文摘要

大语言模型在隐式多跳推理上失败:当模型能正确回答“$X$出生于何时?”和“$Y$最亲密的朋友是谁?”,但在单次前向传播中回答“$Y$最亲密的朋友出生于何时?”时却失败,即使这两个事实都被完美记忆且可单独检索。我们在受控自然语言环境中研究这一失败,严格区分预训练期间暴露于组合上下文的个体和从未出现在任何此类上下文中的个体。我们确认,即使单跳准确率达到97%,组合失败仍然存在,从而将这一差距确定为预训练失败而非知识缺失。我们提出并测试了九种以数据为中心的增强格式,发现组合预训练可以迁移到暴露个体的未见问题,但从未迁移到未参与组合预训练的个体,这表明预训练期间暴露于组合上下文是隐式多跳推理的必要条件。

英文摘要

Large Language Models fail at implicit multi-hop reasoning: a model answers "When was $X$ born?" and "Who is $Y$'s closest friend?" correctly but fails on "When was $Y$'s closest friend born?" in a single forward pass, even when both facts are perfectly memorized and individually retrievable. We study this failure in a controlled natural language setting with a strict separation between individuals exposed to compositional contexts during pretraining and those that never appear in any such context. We confirm that compositional failure persists even at 97% 1-hop accuracy, establishing the gap as a pretraining failure rather than a knowledge absence. We propose and test nine data-centric augmentation formats and find that compositional pretraining transfers to unseen questions for exposed individuals, but never to individuals absent from compositional pretraining, suggesting that exposure to compositional contexts during pretraining is a necessary condition for implicit multi-hop reasoning.

2606.09449 2026-06-09 cs.CL 新提交

Reasoning without Gold Standards: A Proxy-Judge Theory of Autoformalization

无金标准推理:自动形式化的代理-裁判理论

Lei Xu, Xin Quan, André Freitas

发表机构 * Idiap Research Institute(Idiap研究所) École Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院) University of Manchester(曼彻斯特大学) CRUK National Biomarker Centre, University of Manchester(英国癌症研究中心国家生物标志物中心,曼彻斯特大学)

AI总结 提出无参考的代理-裁判框架,通过多轴属性检查替代金标准匹配,实现自动形式化的迭代优化,理论保证收敛,实验提升通过率。

详情
AI中文摘要

复杂的推理任务日益要求系统生成其正确性无法通过与单一参考精确匹配来判断的输出。自动形式化(AF)是一个代表性例子;它要求模型将非正式的数学或逻辑推理翻译成可形式化检查的对象,然而专家验证的形式化在玩具案例之外无法扩展,且一个非正式论证可以有许多有效的形式化呈现。因此,进展取决于部分、结构化的代理能否替代精确参考。我们为AF引入了一个无参考的代理-裁判框架,用多轴属性检查向量替代金标准匹配。该框架沿三个结构范围组织代理:涵盖所引发对象的全局属性、子组件内部的模块属性、以及将其重新对齐到非正式来源的跨域属性,并将每个轴聚合成一个裁决向量。该向量驱动一个反思性精炼循环,其中违反的坐标将控制器引导到匹配的修复目标,因此每次迭代仅更改被判断为错误的部分。在有界裁判噪声下,期望的内在差距几何级数收缩到噪声相关的平台。在miniF2F、ProofNet、e-SNLI和ProntoQA上的七个形式化骨干中,精炼持续提升通过率超过单次ICL基线,并且在基线有改进空间的基准上,多轴代理优于匹配的标量代理。因此,结构化代理判断既提供了实用的精炼信号,也提供了在精确参考不可用时收敛的理论依据。

英文摘要

Complex reasoning tasks increasingly require systems to produce outputs whose correctness cannot be judged by exact match against a single reference. Autoformalization (AF) is a representative example; it asks a model to translate informal mathematical or logical reasoning into a formally checkable object, yet expert-validated formalizations do not scale beyond toy cases and a single informal argument can admit many valid formal renderings. Progress therefore depends on whether partial, structured proxies can substitute for exact references. We introduce a reference-free proxy-judge framework for AF that replaces gold-standard matching with a vector of per-axis property checks. The framework organizes the proxy along three structural scopes that cover global properties of the elicited object, per-module properties internal to its sub-components, and cross-domain properties that re-align it to the informal source, and aggregates each axis into a verdict vector. The vector drives a reflective refinement loop in which a violated coordinate routes the controller to a matching repair target, so each iteration changes only what is judged wrong. Under bounded judge noise, the expected intrinsic gap contracts geometrically to a noise-dependent plateau. Across seven formalization backbones on miniF2F, ProofNet, e-SNLI, and ProntoQA, refinement consistently lifts Pass Rate over the single-shot ICL baseline, and the per-axis proxy outperforms a matched scalar proxy on benchmarks where the baseline has room to improve. Structured proxy judgments therefore provide both a practical refinement signal and a theoretical handle on convergence when exact references are unavailable.

2606.09525 2026-06-09 cs.CL cs.AI 新提交

Emergence of Context Characteristics Sensitivity in Large Language Models

大型语言模型中上下文特征敏感性的涌现

Nadya Yuki Wangsajaya, Haeun Yu, Isabelle Augenstein

发表机构 * Nanyang Technological University(南洋理工大学) University of Copenhagen(哥本哈根大学)

AI总结 通过测量监督微调、直接偏好优化和可验证奖励强化学习三个阶段,发现大型语言模型对上下文特征的敏感性在指令微调过程中动态变化,其中监督微调使模型倾向于使用易理解的上下文,而后续阶段可能强化或改变这一偏好。

详情
AI中文摘要

在指令微调(IFT)过程中,大型语言模型(LLMs)通过使用提供的上下文来回答问题,从而学会遵循指令。虽然先前的工作已经研究了上下文特征如何与LLM的上下文使用相关,但这种分析仅限于推理时间,尚未揭示这些关系最初是如何获得的。在这里,我们测量了模型对这些特征的敏感性在连续的IFT阶段(监督微调(SFT)、直接偏好优化(DPO)和可验证奖励强化学习(RLVR))中如何变化。跨四个模型和三个数据集的实验表明,SFT使模型更倾向于使用易于理解的上下文,例如包含高长度、上下文-查询相似性和流畅性的上下文。SFT后的动态可能根据训练数据集强化或解决这些偏好。我们的发现揭示了上下文使用在每个IFT阶段都被积极重塑,并且设计平衡的IFT数据集对于确保指令微调模型稳健的上下文利用至关重要。

英文摘要

During instruction fine-tuning (IFT), large language models (LLMs) learn to follow instructions by using the provided context to answer a query. While prior work has studied how context characteristics correlate with context usage by the LLM, this analysis has been limited to inference time, leaving open how these relationships are acquired in the first place. Here, we measure how models' sensitivity to such characteristics shifts across successive IFT stages: supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning with verifiable rewards (RLVR). Experiments across four models and three datasets show that SFT makes models more likely to use contexts that are easy to understand, such as containing high length, context-query similarity, and fluency. Post-SFT dynamics may either reinforce or resolve these preferences depending on the training dataset. Our findings reveal that context usage is actively reshaped at each IFT stage, and designing a balanced IFT dataset is important in ensuring robust context utilization of instruction-tuned models.

2606.09659 2026-06-09 cs.CL cs.AI cs.LG 新提交

End-to-End Context Compression at Scale

端到端上下文压缩的规模化

Ang Li, Sean McLeish, Haozhe Chen, Nimit Kalra, Zaiqian Chen, Artem Gazizov, Venkata Anoop Suhas Kumar Morisetty, Bhavya Kailkhura, Harshitha Menon, Zhuang Liu, Brian R. Bartoldson, Tom Goldstein, Sanae Lotfi, Micah Goldblum, Pavel Izmailov

发表机构 * New York University(纽约大学) Modal Labs(Modal实验室) University of Maryland(马里兰大学) Princeton University(普林斯顿大学) Columbia University(哥伦比亚大学) Harvard University(哈佛大学) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室) FAIR at Meta(Meta FAIR实验室)

AI总结 本研究通过架构搜索和持续预训练,提出潜在上下文语言模型(LCLMs),一种端到端编码器-解码器压缩器,在通用任务性能、压缩速度和峰值内存上改进帕累托前沿,并可作为长时智能体的高效骨干。

详情
AI中文摘要

长上下文语言模型推理受限于内存,因为KV缓存随上下文长度增长。最近压缩KV缓存的技术存在不足:它们要么大幅降低模型质量,要么需要大量时间和计算来压缩单个长提示。此外,许多方法要求输入适合目标模型的上下文窗口,并且通常与现代生产推理引擎不兼容。编码器-解码器压缩器原则上是一种有吸引力的替代方案,它将长令牌序列映射到由解码器消费的较短潜在嵌入序列。然而,现有方法在精度-效率前沿上无法与KV缓存压缩竞争。在这项工作中,我们重新审视编码器-解码器压缩并缩小了这一差距。我们首先进行架构搜索,从头开始预训练许多变体,以确定如何最佳设计和训练编码器-解码器压缩器。根据我们的发现,我们持续预训练一系列0.6B编码器、4B解码器模型,每个模型在超过350B令牌上训练,压缩比为1:4、1:8和1:16。我们引入了潜在上下文语言模型(LCLMs),这是一系列压缩器,在通用任务性能、压缩速度和峰值内存使用上改进了帕累托前沿。我们证明了LCLMs可作为长时智能体的高效骨干,让智能体浏览压缩的长上下文并按需自适应扩展相关片段。

英文摘要

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

2606.09662 2026-06-09 cs.CL 新提交

When Built-in Thinking Helps and Hurts: Constraint-Level Error Shifts in Instruction Following

当内置思考既有帮助又有害:指令遵循中的约束级错误转移

Sai Adith Senthil Kumar

发表机构 * George Mason University(乔治梅森大学)

AI总结 研究大型推理模型(LRM)的思考模式对指令遵循的影响,发现思考会改变错误模式而非统一降低性能,其中规划类约束改善而精确类约束恶化,并通过分析思考轨迹和激活修补揭示了机制。

Comments 16 pages, 7 figures, 15 tables

详情
AI中文摘要

大型推理模型(LRM)通常能提升数学和编码性能,但其对指令遵循的影响尚不明确。我们使用 Qwen3 模型(1.7B-32B)研究 IFEval,采用同权重的思考开启/关闭控制;四个 Hunyuan 模型提供跨家族方向性支持。总体通过率变化很小(-0.55 到 -3.52 个百分点),但 10-20% 的提示在两种模式间在通过和失败之间切换,表明思考改变了错误模式——某些提示改善而另一些恶化——而非统一降低性能。在事后 Qwen3 导出的分组下,约束类型分为规划类(全局计数、结构、协调)和精确类(精确局部形式);规划类在思考下类别层面改善,而精确类持续恶化;尽管 Hunyuan 的总体方向相反,但所有四个 Hunyuan 模型在类别层面的规划/精确符号模式方向一致。思考还改变了最终答案长度;匹配长度分析大幅减少了精确类的下降,但仍有残余惩罚。使用交叉编码器相关性指标分析思考轨迹揭示了三种模式:中性模式显示正的相关性-合规性关联(r ≈ 0.15);规划模式显示接近零的预测相关性(r ≈ 0.02),尽管有可测量的轨迹参与,这与 CE 测量的轨迹相关性和最终答案合规性之间的执行差距一致;精确模式显示小的负相关性(r ≈ -0.05),失败实例的平均相关性高于通过实例。跨四个模型大小(1.7B-14B)的激活修补显示,精确类翻转实例比规划类翻转实例更常被恢复(32-58% 对 14-40% 的平均层恢复),最大差距在 14B 处(约 30 个百分点)。

英文摘要

Large reasoning models (LRMs) often improve math and coding performance, but their effect on instruction following is unclear. We study IFEval with Qwen3 models (1.7B-32B), using same-weights Thinking ON/OFF controls; four Hunyuan models provide directional cross-family support. Aggregate pass-rate changes are small (-0.55 to -3.52 pp), yet 10-20% of prompts switch between pass and fail across modes, suggesting that thinking changes the pattern of errors--some prompts improve while others worsen--rather than uniformly degrading performance. Under a post-hoc Qwen3-derived grouping, constraint types separate into Planning (global counting, structure, coordination), which improves at the class level under thinking, and Precision (exact local form), which consistently worsens; the class-level Planning/Precision sign pattern holds directionally for all four Hunyuan models despite Hunyuan's opposite aggregate direction. Thinking also changes final-answer length; matched-length analyses substantially reduce the Precision drop, but a residual penalty remains. Analyzing thinking traces with a cross-encoder relevance metric reveals three patterns: Neutral shows a positive relevance-compliance link (r approximately 0.15); Planning shows near-zero predictive correlation (r approximately 0.02) despite measurable trace engagement, consistent with an execution gap between CE-measured trace relevance and final-answer compliance; Precision shows a small negative correlation (r approximately -0.05), with failing instances having higher mean relevance than passing ones. Activation patching across four model sizes (1.7B-14B) shows that Precision flip instances are more often restored than Planning flip instances (32-58% vs. 14-40% mean layer-restoration), with the largest gap at 14B (about 30 pp).

2606.07548 2026-06-09 cs.IR cs.AI cs.CL 交叉投稿

Evaluating Advanced Prompting on Gemini Flash for Multi-Hop Biomedical QA

评估 Gemini Flash 上的高级提示工程用于多跳生物医学问答

Ahmed Bajaber, Mohammed Alliheedi

发表机构 * Saudi Med AI Lab (SMAIL)(沙特医学人工智能实验室(SMAIL)) Prince Sultan University(普森国王大学) Al-Baha University(阿勒巴哈大学)

AI总结 本研究通过设计多组件提示(角色扮演、多步思维链示例和格式规则),在 Gemini 2.0 Flash 上实现概念级得分0.720,显著优于基线0.565,并接近下一代模型性能,证明高级提示设计对释放LLM推理能力至关重要。

Comments 8 pages, proceedings of the BioCreative IX Challenge and Workshop (BC9) at IJCAI 2025

详情
Journal ref
Proc. BioCreative IX Workshop (BC9), IJCAI 2025, Montreal, Canada
AI中文摘要

MedHopQA 挑战为大型语言模型(LLM)提供了一个关键测试:在高风险的生物医学领域中进行复杂的多跳推理。本文详细介绍了我们对 Google Gemini Flash 模型的直接基于 API 的评估,重点关注高级提示工程的影响。我们为 Gemini 2.0 Flash 设计了一个复杂的多组件提示,结合了角色扮演、显式的多步思维链(CoT)示例和详细的格式规则。使用这个复杂提示的最佳运行获得了0.720的概念级得分。这一结果显著优于仅得0.565的基线提示。值得注意的是,在高效的 Gemini 2.0 Flash 上的性能与下一代 Gemini 2.5 Flash 的结果几乎相同。我们的发现表明,复杂的提示设计是释放现代LLM全部推理能力的关键因素。

英文摘要

The MedHopQA challenge presents a critical test for Large Language Models (LLMs): complex, multi-hop reasoning in the high-stakes biomedical domain. This paper details our direct API-based evaluation of Google's Gemini Flash models, focusing on the impact of advanced prompt engineering. We designed a sophisticated, multi-component prompt for Gemini 2.0 Flash that combined role-playing, explicit multi-shot Chain-of-Thought (CoT) examples, and detailed formatting rules. Our best run, using this complex prompt, achieved a Concept Level Score of 0.720. This result dramatically outperformed a baseline prompt which scored only 0.565. Remarkably, this performance on the efficient Gemini 2.0 Flash was almost identical to the result from the next-generation Gemini 2.5 Flash. Our findings demonstrate that sophisticated prompt design is a critical factor for unlocking the full reasoning capabilities of modern LLMs.

2606.07703 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

需要多少密集注意力?面向混合长上下文模型中全/GQA层的Oracle引导稀疏预填充

Hongxing Wang, Harenome Razanajato, Zhen Zhang, Yujie Yuan, Hongsheng Liu

发表机构 * Technical Report, First Release(技术报告,首次发布)

AI总结 研究在混合长上下文模型中,通过Oracle引导的稀疏预填充减少密集注意力计算,在保持任务性能的同时实现加速,并验证了可行性、索引器质量和运行时加速潜力。

Comments Technical report, first release, 26 pages, 2 figures, 11 tables

详情
AI中文摘要

长上下文预填充仍然昂贵,因为即使在包含局部、稀疏、线性或循环组件的混合模型中,全/GQA层仍然对整个历史序列进行评分。我们研究了在显式支持粒度和top-k预算下,需要多少密集注意力来保持任务级行为。我们为现有的GQA检查点引入了一种注意力质量top-k oracle:对于每个层和查询位置,它计算密集注意力,选择头平均的token支持,并仅在该支持上重新计算注意力。该oracle是一个诊断参考,而非可部署的加速器,并将稀疏预算可行性从索引器误差和运行时实现效果中分离出来。在Qwen家族的检索密集型评估中,每个查询的最长oracle行与密集注意力相差在1个点以内,而Qwen3.5-9B在4K到100K的RULER风格扫描中相差在0.48个点以内。在oracle的指导下,我们通过KL蒸馏从密集注意力质量分布中训练了一个头折叠的辅助索引器,同时保持骨干网络冻结。使用分别蒸馏的Qwen3.5-0.8B和Qwen3.5-9B索引器,报告的16K/32K验证宏观差距分别为+2.04和+1.13个点,这被视为质量保持而非改进;融合的选择块共享支持可能引入更大的实现差距。初步的单卡TTFT测量显示,与密集FlashAttention-2基线相比,蒸馏索引器的稀疏服务加速比在NPU上对Qwen3.5-0.8B为1.71倍,在GPU上对Qwen3.5-9B为1.93倍。额外的随机初始化压力行达到3.44倍,表明稀疏运行时存在提升空间,但输出质量未经验证。本次发布首次分离了oracle可行性、蒸馏索引器质量和运行时提升空间,将完全匹配的质量-延迟前沿留待未来工作。

英文摘要

Long-context prefill remains expensive because full/GQA layers still score the historical sequence, even in hybrid models with local, sparse, linear, or recurrent components. We study how much dense attention is needed to preserve task-level behavior under explicit support granularity and top-k budgets. We introduce an attention-mass top-k oracle for existing GQA checkpoints: for each layer and query position, it computes dense attention, selects head-averaged token support, and recomputes attention only on that support. The oracle is a diagnostic reference, not a deployable accelerator, and separates sparse-budget feasibility from indexer error and runtime realization effects. On Qwen-family retrieval-heavy evaluations, the longest per-query oracle rows stay within 1 point of dense, and a Qwen3.5-9B RULER-style sweep from 4K to 100K stays within 0.48 points. Guided by the oracle, we derive a head-collapsed auxiliary indexer trained by KL distillation from dense attention-mass distributions while keeping the backbone frozen. With separately distilled Qwen3.5-0.8B and Qwen3.5-9B indexers, the reported 16K/32K validation macro gaps are +2.04 and +1.13 points, treated as quality preservation rather than improvement; fused selection-block-shared support can introduce a larger realization gap. Preliminary single-card TTFT measurements show distilled-indexer sparse serving speedups of 1.71x for Qwen3.5-0.8B on NPU and 1.93x for Qwen3.5-9B on GPU against its dense FlashAttention-2 baseline. Additional random-init stress rows reach 3.44x, indicating sparse-runtime headroom but not validated output quality. This first release separates oracle feasibility, distilled-indexer quality, and runtime headroom, leaving a fully matched quality-latency frontier to future work.

2606.07720 2026-06-09 cs.AI cs.CL cs.LG 交叉投稿

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

为什么将残差流限制在层而不是令牌?用于连续潜在推理的持久记忆

Mujtaba Farhan, Maheep Chaudhary

发表机构 * University of Cambridge(剑桥大学)

AI总结 针对CoCoNuT在潜在空间推理中因中间隐藏状态被覆盖导致概念瓶颈的问题,提出AGCLR模型,通过门控概念流持久记忆机制,在GSM8K、HotpotQA和ProsQA上取得一致提升。

详情
AI中文摘要

大型语言模型(LLMs)在数学和多跳规划任务上展现了卓越的推理能力。CoCoNuT(连续思维链)范式通过使模型能够在潜在空间中进行推理,同时探索多个推理路径,而不是早期就承诺单一链条,从而扩展了这一能力。然而,我们识别出一个我们称之为\textbf{概念瓶颈}的限制。在每次推理过程中,中间隐藏状态被覆盖,导致模型随着推理深度增加而丢失早期步骤中计算出的关键事实。我们在经验上观察到了这一点。在HotpotQA上,原始CoCoNuT(10.4% EM)未能超过CoT基线(11.0% EM),并且在GSM8K上随着课程深度增加性能下降。为了解决这个问题,我们提出了\textbf{AGCLR}(自适应门控连续潜在推理),它通过一个\textit{门控概念流}增强了CoCoNuT。一个跨所有推理过程保持的持久残差记忆,由三个学习到的门控制:一个将中间事实提交到记忆的\textit{写入}门,一个检索相关先前状态的\textit{读取}门,以及一个修剪不相关上下文的\textit{遗忘}门。在使用GPT-2作为基础模型在GSM8K、HotpotQA和ProsQA上进行评估时,AGCLR在所有类型的数据集上实现了一致的改进。随着课程深度的增加,性能差距进一步扩大,直接解决了概念瓶颈。代码可在https://anonymous.4open.science/r/JJJJ/README.md获取。

英文摘要

Large language models (LLMs) have demonstrated remarkable reasoning abilities on mathematical and multi-hop planning tasks. The CoCoNuT (Chain of Continuous Thought) paradigm~\cite{hao2024coconut} extends this by enabling models to reason in latent space, exploring multiple reasoning paths simultaneously rather than committing to a single chain early on. However, we identify a limitation we term the \textbf{concept bottleneck}. At each reasoning pass, intermediate hidden states are overwritten, causing the model to lose critical facts computed in earlier steps as reasoning depth increases. We observe this empirically. On HotpotQA, vanilla CoCoNuT (10.4\% EM) fails to improve over the CoT baseline (11.0\% EM), and performance degrades with curriculum depth on GSM8K. To address this, we propose \textbf{AGCLR} (Adaptive Gated Continuous Latent Reasoning), which augments CoCoNuT with a \textit{Gated Concept Stream}. A persistent residual memory maintained across all reasoning passes, controlled by three learned gates: a \textit{write} gate that commits intermediate facts to memory, a \textit{read} gate that retrieves relevant prior states, and a \textit{forget} gate that prunes irrelevant context. Evaluated on GSM8K, HotpotQA, and ProsQA using GPT-2 as our base model, AGCLR achieves consistent improvements across all types of datasets. With the performance gap compounding as curriculum depth increases, directly resolving the concept bottleneck. Code available at https://anonymous.4open.science/r/JJJJ/README.md

2606.07812 2026-06-09 cs.AI cs.CL 交叉投稿

Scaling Participation in Modular AI Systems

模块化AI系统中的参与扩展

Shangbin Feng, Yike Wang, Weijia Shi, Luke Zettlemoyer, Yejin Choi, Yulia Tsvetkov

发表机构 * University of Washington(华盛顿大学) Stanford University(斯坦福大学)

AI总结 提出参与扩展范式,通过多方贡献小模型构建模块化AI系统,在15项任务上比单体大语言模型提升高达15.4%,并展现涌现能力。

详情
AI中文摘要

人类是由多面才能和需求组成的马赛克,任何真正智能的AI必须反映这种丰富性。然而,所有人使用的LLM却由少数人构建——一个集中化的单体AI模型市场,其结构上不适合捕捉人类知识、推理和价值观的多样性。本文介绍参与扩展,一种新范式,其中模块化AI系统通过不同利益相关者的贡献自下而上构建。参与者贡献基于自身兴趣和优先级训练的小模型;这些模型随后在模块化框架中作为组合式AI系统协作。参与式AI系统在15项任务(如推理和事实性)上比单体LLM高出最多15.4%,超越了比所有贡献组件总和更大的模型。进一步实验表明,参与式AI系统受益于贡献者多样性,显著改善每个贡献者的原始优先级,并展现出涌现能力,使其能解决超过15%的所有单个模型失败的问题。参与扩展为从单体现状向开放、自下而上、协作的AI未来过渡提供了技术基础。

英文摘要

Humanity is a mosaic of multifaceted talents and needs, and any truly intelligent AI must reflect that richness. Yet the LLMs used by all are built by the few -- a centralized market of monolithic AI models structurally ill-suited to capture the diversity of human knowledge, reasoning, and values. Here we introduce scaling participation, a new paradigm in which modular AI systems are built from the bottom up through the contributions of diverse stakeholders. Participants contribute small models trained on their own interests and priorities; these models then collaborate in modular frameworks as compositional AI systems. Participatory AI systems outperform monolithic LLMs by up to 15.4% across 15 tasks, such as reasoning and factuality, surpassing models larger than all contributed components combined. Further experiments show that participatory AI systems benefit from contributor diversity, substantially improve on each contributor's original priorities, and exhibit emergent capabilities that allow them to solve over 15% of problems where all individual models fail. Scaling participation provides a technical foundation for transitioning from the monolithic status quo toward an open, bottom-up, and collaborative AI future.

2606.08088 2026-06-09 cs.LG cs.CL 交叉投稿

ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning

ConSteer-RL:通过置信度感知强化学习引导大型语言模型的推理能力

Qing Miao, Yiming Zhao, Jing Yang, Chenxi Liu, Yuehai Chen, Yuewen Liu, Shaoyi Du, Badong Chen

发表机构 * Xi'an Jiaotong University(西安交通大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出ConSteer-RL框架,将模型log概率的token级置信度信号融入GRPO,通过置信度感知奖励塑造机制惩罚过度自信错误并强化正确自信推理,在多个模型规模上平均提升2.3%-4.0%。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)近期已成为提升大型语言模型(LLMs)推理能力的关键范式,但其仍受限于稀疏的二元奖励以及对模型内部不确定性的忽视。本文提出ConSteer-RL,一个简单而有效的框架,将源自模型log概率的token级置信度信号整合到RLVR训练中。具体而言,基于组相对策略优化(GRPO)框架,我们通过将每个token的概率聚合成标量置信度分数,并融入基于意识的奖励塑造机制,构建置信度感知奖励,该机制惩罚过度自信的错误,同时强化正确且自信的推理。实验结果表明,ConSteer-RL在不同模型规模上持续优于强GRPO基线,平均提升2.3%-4.0%。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has recently become a key paradigm for improving the reasoning abilities of Large Language Models (LLMs), yet it remains limited by sparse binary rewards and its ignorance of model-internal uncertainty. In this paper, we propose ConSteer-RL, a simple yet effective framework that integrates token-level confidence signals derived from model log-probabilities into RLVR training. Specifically, building upon the Group Relative Policy Optimization (GRPO) framework, we construct a confidence-aware reward by aggregating per-token probabilities into a scalar confidence score and incorporating it into an awareness-based reward shaping mechanism that penalizes overconfident errors while reinforcing correct and confident reasoning. Experimental results demonstrate that ConSteer-RL consistently outperforms strong GRPO baselines, achieving average improvements of 2.3%-4.0% across different model scales.

2606.08454 2026-06-09 cs.LG cs.CL 交叉投稿

Beyond Linear Activation Steering: Invertible Latent Transformations for Controlling LLM Behavior

超越线性激活引导:用于控制大语言模型行为的可逆潜在变换

Tuc Nguyen, Thai Le

发表机构 * Indiana University Bloomington(印第安纳大学伯明顿分校)

AI总结 提出INNSteer框架,通过可逆神经网络将LLM激活映射到潜在空间进行线性控制,再逆变换回原空间,实现非线性、输入依赖的激活引导,在多个模型和基准上优于现有方法。

Comments 36 pages, 7 figures

详情
AI中文摘要

激活引导提供了一种轻量级的推理时机制,通过修改大语言模型(LLM)的内部激活向量,使其朝向期望行为。现有方法大多在原始激活空间中计算固定的引导方向,通常使用对比示例对的均值差、线性探针或任意可分离性标准。虽然在一定程度上有效,但这些方法将行为控制视为全局线性加性偏移:相同的方向应用于所有输入,且行为是线性可分的。当行为特征在激活空间中非线性变化或位于弯曲和各向异性流形上时,这种处理可能具有局限性,因为最优干预可能是输入依赖的。为解决这一限制,我们提出了INNSteer,一种基于可逆潜在变换的非线性激活引导框架。INNSteer并非在原始表示空间中寻找更好的引导向量,而是学习一个轻量级可逆神经网络$ϕ$,将LLM的激活映射到潜在空间,在该空间中行为类别更易于线性控制。推理时,激活通过$ϕ$映射,在潜在空间中进行引导,再通过精确逆变换$ϕ^{-1}$映射回原空间。这使得简单的潜在空间平移在原始激活空间中变为非线性、输入依赖的干预。在多个LLM系列、规模、行为特征和安全基准的实验设置中,INNSteer在保持生成流畅性的同时,一致地优于线性、基于传输和非线性的引导基线。

英文摘要

Activation steering provides a lightweight inference-time mechanism for controlling large language models (LLMs) by modifying their internal activation vectors toward desired behaviors. Most existing methods compute a fixed steering direction in the original activation space, typically from pairs of contrastive examples using mean differences, linear probes, or arbitrary separability criteria. While effective to a certain extent, these methods treat behavioral control as a global, linear, additive offset: the same direction is applied across inputs, and behaviors are linearly separable. This can be restrictive when behavioral features vary nonlinearly across the activation space or lie on curved and anisotropic manifolds, where the optimal intervention may be input-dependent. To address this limitation, we propose INNSteer, a nonlinear activation steering framework based on invertible latent transformations. Rather than searching for a better steering vector in the original representation space, INNSteer learns a lightweight invertible neural network $ϕ$ that maps an LLM's activations into a latent space where behavioral classes are more amenable to linear control. At inference time, activations are mapped through $ϕ$, steered in the latent space, and mapped back through the exact inverse transformation $ϕ^{-1}$. This makes a simple latent-space translation become a nonlinear, input-dependent intervention in the original activation space. Across experiment settings on multiple LLM families, scales, behavioral traits, and safety benchmarks, INNSteer consistently improves model control over linear, transport-based, and nonlinear steering baselines while largely preserving generation fluency.

2606.08676 2026-06-09 cs.SE cs.AI cs.CL 交叉投稿

Lost in the Flow with Code Talkers: Unveiling the Instruction-Tuning Tax of Large Language Models in Code Tasks

迷失在代码对话者的流程中:揭示代码任务中大语言模型的指令微调税

Shi Ying Chang, Chiok Yew Ho, Yichen Li, Yintong Huo

发表机构 * Singapore Management University(新加坡国立管理学院) The Chinese University of Hong Kong(香港中文大学)

AI总结 本研究首次实证发现指令微调在代码任务中导致权衡:增强指令遵循能力却削弱代码填充性能,称之为“指令微调税”,并通过定性和定量分析总结出七项发现和四项启示。

Comments 25 pages, 6 figures. Evaluation toolkit and dataset: https://github.com/arkosioscambions/CodeTalkers

详情
AI中文摘要

AI编码助手通过自动建议与用户意图一致的代码,显著提高了开发者的生产力,许多此类工具现已直接集成到集成开发环境(IDE)中。开发者以两种不同的认知模式与代码交互:流程模式和命令模式。在流程模式下,开发者需要能够直接完成或填充未完成程序中代码的工具;而在命令模式下,他们需要能够理解以自然语言指令表达的意图并将其转换为可执行代码的工具。尽管经过指令微调的大型语言模型(LLM)因其推断和满足开发者意图的能力而在许多应用场景中占据主导地位,但尚不清楚同一范式是否同样适用于不同的代码相关任务。因此,有必要理解指令微调如何影响CodeLLM作为编码助手的可行性。为填补这一空白,我们进行了首次实证研究,揭示了指令微调在编程模式之间引起的关键权衡,我们称之为“指令微调税”。我们的结果表明,指令微调并非免费的午餐:尽管经过指令微调的模型更擅长遵循指令和利用结构化指导,但这些收益往往以牺牲填充性能为代价。我们进一步通过定性和定量分析扩展了研究,包括手动失败分类、捕捉生成保真度的行为指标以及微调过程中的中间检查点评估。将我们的结果总结为七项发现和四项启示,我们的研究为AI驱动编码工具的开发提供了新视角,并强调了在指令遵循能力与有效代码生成辅助之间仔细平衡的必要性。

英文摘要

AI coding assistants have significantly improved developer productivity by automatically suggesting code that aligns with user intent, and many of these tools are now integrated directly into Integrated Development Environments (IDEs). Developers interact with code in two distinct cognitive modes: Flow and Command. While developers require tools that directly complete or infill code in unfinished programs during Flow mode, they also need tools that can comprehend intentions expressed as natural-language instructions and convert them into executable code in Command mode. Although instruction-tuned Large Language Models (LLMs) dominate many application scenarios due to their abilities to infer and fulfill developers' intents, it remains unclear whether the same paradigm is equally suitable for different code-related tasks. Therefore, it is necessary to understand how instruction tuning affects the feasibility of CodeLLMs as coding assistants. To fill this gap, we conduct the first empirical study that uncovers a key trade-off caused by instruction tuning across programming modes, which we term the Instruction-Tuning Tax. Our results show that instruction tuning is not a free lunch: although instruction-tuned models are more capable of following instructions and leveraging structured guidance, these gains often come at the cost of weaker infilling performance. We further extend our study through both qualitative and quantitative analyses, including manual failure categorization, behavioral metrics that capture generation fidelity, and intermediate-checkpoint evaluation throughout the tuning process. Summarizing our results into seven findings and four implications, our study offers a new perspective on the development of AI-powered coding tools and highlights the need to carefully balance instruction-following ability with effective code generation assistance.

2606.08815 2026-06-09 cs.AI cs.CL cs.LG 交叉投稿

Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization

推理的动量:策略优化中的密集内在信号

Hao Chen, Zhanming Shen, Liyao Li, Yanyu Chen, Xuhang Zhu, Xiaomeng Hu, Qi Zhang, Ru Peng, Xiaoyu Shen, Haobo Wang, Junbo Zhao

发表机构 * Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学) Eastern Institute of Technology(东方理工学院)

AI总结 针对GRPO在长链推理中因二元奖励导致的零优势崩溃和幻觉确定性失败模式,提出ISPO方法,通过内在信号密集化奖励,在三个基模型和五个数学推理基准上持续优于基线。

Comments 14 pages, 6 figures, 8 tables

详情
AI中文摘要

基于可验证奖励的强化学习已成为激发大型语言模型长链推理的强大范式。然而,现有基于组相对策略优化(GRPO)的方法依赖于二元结果奖励,这引发了两种结构性失败模式:零优势崩溃,即组内所有轨迹共享相同结果导致梯度消失;以及幻觉确定性,即模型在训练后期对错误轨迹变得过度自信。我们通过使用完全从策略自身条件概率计算的内在信号来密集化奖励,解决了这两种模式,并提出了ISPO(内在信号策略优化),它结合了衡量思考轨迹对最终答案信息量的序列级信号,以及令牌级方向性奖励,其幻觉确定性铰链惩罚关键决策令牌上的错误自信预测。在三个基模型和五个数学推理基准上,ISPO持续优于竞争基线,在零优势崩溃最频繁的最难基准上取得最大提升,训练动态诊断证实两种失败模式均被减少。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large language models. However, existing methods based on Group Relative Policy Optimization (GRPO) rely on a binary outcome reward, which induces two structural failure modes: Zero-Advantage Collapse, in which all rollouts in a group share the same outcome and the gradient vanishes, and Hallucinated Certainty, in which the model becomes increasingly confident on incorrect rollouts late in training. We address both modes by densifying the reward with intrinsic signals computed entirely from the policy's own conditional probabilities, and propose ISPO (Intrinsic Signal Policy Optimization, which combines a sequence-level signal measuring how informative the thinking trajectory is for the final answer, with a token-level directional reward whose hallucinated-certainty hinge penalizes confidently-wrong predictions at critical decision tokens. Across three base models and five mathematical reasoning benchmarks, ISPO consistently outperforms competitive baselines, with the largest gains on the hardest benchmarks where zero-advantage collapse is most frequent, and training-dynamics diagnostics confirm that both failure modes are decreased.

2606.08850 2026-06-09 cs.LG cs.AI cs.CL stat.ML 交叉投稿

Intrinsic Selection and Particle Resampling for Inference-Time Scaling Beyond Domain Verifiability

内在选择与粒子重采样:超越领域可验证性的推理时扩展

Giorgio Giannone, Mustafa Eyceoz, Shabana Baig, Shivchander Sudalairaj, Anna C. Doris, Faez Ahmed, Akash Srivastava, Kai Xu

发表机构 * MIT(麻省理工学院) Red Hat(红帽公司) IBM(IBM公司)

AI总结 提出基于并行样本集内在统计量(长度调整尾熵)的推理时扩展方法,通过后验候选排序和步骤级重采样,无需外部验证即可提升开放领域任务性能。

Comments preprint

详情
AI中文摘要

推理时扩展(ITS)在数学和编程等可验证领域取得了很大成功,其中廉价验证使得可扩展输出选择成为可能。然而,将ITS扩展到容易发生系统性失败的任务——由错误初始假设或未满足的多维约束驱动——通常依赖于昂贵的外部求解器或脆弱的基于模型的验证器。我们的关键洞察是,并行样本集的内在统计量,特别是长度调整尾熵,提供了关于解质量的稳健判别信号,而无需访问真实标签。至关重要的是,这些统计量作为自适应计算分配的难度门控,动态地将问题路由到不同的扩展规模。首先,内在选择(iS)事后对候选进行排序,在三个领域匹配基于共识的算法,并将工程设计选择性能比pass@1基线提高20%。其次,内在粒子滤波(iPF)将其推广到步骤级重采样,引导生成走向高置信度推理轨迹,在困难数学问题上平均将pass@1提高6.1个百分点。最后,粒子蒸馏(dPF)通过早期logit混合和KL引导重采样注入特权指导,引导生成绕过系统性推理错误以满足专家评分标准,在复杂临床响应上获得高达26.5%的提升。我们的流程无缝适用于通用、领域专用和多模态架构,成功将ITS扩展到开放领域,而无需训练奖励模型或精确的真实标签验证。

英文摘要

Inference-Time Scaling (ITS) has largely succeeded in verifiable domains like math and coding, where cheap verification enables scalable output selection. However, extending ITS to tasks prone to systematic failure - driven by faulty initial assumptions or unmet multidimensional constraints - typically relies on costly external solvers or brittle, model-based verifiers. Our key insight is that the intrinsic statistics of parallel sample sets, specifically length-adjusted tail entropy, provide a robust discriminative signal for solution quality without access to ground truth. Crucially, these statistics serve as a difficulty gate for adaptive compute allocation, dynamically routing problems across scaling regimes. First, Intrinsic Selection (iS) ranks candidates post-hoc, matching consensus-based algorithms across three domains and improving engineering design selection by 20% over pass@1 baselines. Second, Intrinsic Particle Filtering (iPF) generalizes this to step-level resampling, guiding generation toward high-confidence reasoning trajectories to improve pass@1 by 6.1 points on average on hard math problems. Finally, Particle Distillation (dPF) injects privileged guidance via early logit blending and KL-guided resampling, steering generation past systematic reasoning errors to satisfy expert rubrics, yielding up to 26.5% gains on complex clinical responses. Our pipeline applies seamlessly across broad-purpose, domain-specialized, and multimodal architectures, successfully extending ITS to open-ended domains without requiring trained reward models or exact ground-truth verification.

2606.08854 2026-06-09 cs.LG cs.AI cs.CL stat.ML 交叉投稿

sGPO: Trading Inference FLOPs for Training Efficiency in RLVR

sGPO: 在RLVR中用推理FLOPs换取训练效率

Shivchander Sudalairaj, Kai Xu, Akash Srivastava, Giorgio Giannone

发表机构 * Red Hat(红帽) IBM

AI总结 提出sGPO方法,通过少量推理计算预估查询难度,自适应分配训练预算,将训练计算量降低三倍,同时保持或提升性能。

详情
AI中文摘要

标准的可验证奖励强化学习(RLVR)训练为每个查询分配固定的展开预算,而不考虑每个查询的难度对当前策略的意义。这导致两种对称的失败模式:简单查询产生接近零的优势,因为策略已经解决了它们;而无法解决的查询不产生信号,因为策略从未解决它们。这两种情况都浪费了训练FLOPs,而没有贡献学习梯度。我们引入了排序组策略优化(sGPO),一种计算高效的策略,用少量推理FLOPs换取大量减少浪费的训练FLOPs。关键见解是,廉价的推理计算可以作为查询难度的单一离线代理。通过在初始策略下为每个查询生成一小批并行样本,我们获得了模型感知的经验成功率。这激励将训练展开组大小设置为该成功率的倒数,这是一个实用的规则,通过从每个生成的展开中提取最大优势来最大化样本效率。这一单次性能分析过程同时驱动数据过滤(移除琐碎查询和子采样无法解决的查询)、自适应组大小分配和课程构建(从易到难调度查询)。sGPO匹配或超过基线性能,同时将总训练计算量减少三倍,包括前期的推理性能分析成本。

英文摘要

Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a learning gradient. We introduce sorted Group Policy Optimization (sGPO), a compute-efficient strategy that trades a small budget of inference FLOPs for a large reduction in wasted training FLOPs. The key insight is that cheap inference compute can serve as a single offline proxy for query difficulty. By generating a small batch of parallel samples per query under the initial policy, we obtain a model-aware empirical success rate. This motivates setting the training rollout group size to the inverse of this success rate, a practical rule that maximizes sample efficiency by extracting the most advantage per generated rollout. This single profiling pass simultaneously drives data filtering (removing trivial queries and sub-sampling unsolvable ones), adaptive group size allocation, and curriculum construction (scheduling queries from easy to hard). sGPO matches or exceeds baseline performance while reducing total training compute by a factor of three, with the upfront inference profiling cost included.

2606.09030 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

TRIAGE: 基于辩证推理的不规则采样医学时间序列风险可解释预测方法

Hyeongwon Jang, Gyouk Chu, Changhun Kim, Joonhyung Park, Hangyul Yoon, Eunho Yang

发表机构 * KAIST(韩国科学技术院) AITRICS University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出TRIAGE框架,利用大语言模型对竞争性临床结果生成辩证推理,缓解风险极化,实现连续风险评分与可解释推理,在三个基准上AUPRC提升3.3%,校准误差降低81%。

Comments Code is available at https://github.com/HyeongWon-Jang/TRIAGE

详情
AI中文摘要

基于电子健康记录的临床早期预警系统,其中临床观察记录为不规则采样的医学时间序列(ISMTS),必须提供校准的风险评分用于患者分诊,以及临床医生可验证的可解释理由。大语言模型(LLMs)已被探索用于此任务,但它们将分级临床风险崩溃为过度自信的二元预测。这种风险极化损害了校准性和跨患者可比性。为解决此问题,我们提出TRIAGE框架,该框架训练LLM通过引出特定结果的理由,对竞争性临床结果生成辩证推理。这种辩证公式减轻了风险极化,使单个LLM能够产生基于明确临床推理的连续风险评分。在三个ISMTS基准上评估,TRIAGE相比竞争基线实现了平均AUPRC提升3.3%,校准误差降低81%。LLM作为评判者的评估进一步表明,我们的理由在临床推理质量上比基线的后验解释高出20%。源代码可在https://github.com/HyeongWon-Jang/TRIAGE获取。

英文摘要

Clinical early warning systems built on electronic health records, in which clinical observations are recorded as irregularly sampled medical time series (ISMTS), must deliver both calibrated risk scores for patient triage and interpretable rationales that clinicians can verify. Large Language Models (LLMs) have been explored for this task, yet they collapse graded clinical risk into overconfident binary predictions. This risk polarization undermines both calibration and cross-patient comparability. To address this, we propose TRIAGE, a framework that trains an LLM to generate dialectical reasoning over competing clinical outcomes by eliciting outcome-specific rationales. This dialectical formulation mitigates risk polarization, enabling a single LLM to yield continuous risk scores grounded in explicit clinical reasoning. Evaluated on three ISMTS benchmarks, TRIAGE achieves an average AUPRC improvement of 3.3% and reduces calibration error by 81% compared to the competitive baselines. An LLM-as-a-judge assessment further shows that our rationales surpass post-hoc explanations from the baseline by 20% in clinical reasoning quality. The source code is available at https://github.com/HyeongWon-Jang/TRIAGE .

2606.09043 2026-06-09 cs.LG cs.CL 交叉投稿

DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity

DynaCF: 通过动态反事实敏感性缓解奖励模型中的捷径学习

Fengyuan Liu, Yongliang Miao, Zirui He, Yanguang Liu, Fei Sun, Mengnan Du

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) New Jersey Institute of Technology(新泽西理工学院) Institute of Computing Technology, CAS(中国科学院计算技术研究所)

AI总结 提出DynaCF框架,通过在线测量反事实扰动下的边际变化和偏好翻转来动态降低捷径敏感样本的权重,从而缓解奖励模型中的捷径学习问题。

详情
AI中文摘要

从成对偏好中训练的奖励模型往往利用表面的捷径线索而非学习真正的响应质量。我们提出DynaCF,一个用于缓解奖励模型训练中捷径学习的动态重加权框架。与静态捷径启发式方法不同,DynaCF在优化过程中通过应用保持语义的反事实扰动并跟踪当前模型下产生的边际变化和偏好翻转,在线测量捷径敏感性。在Bradley-Terry目标中,具有较高捷径敏感性的样本被动态降低权重,鼓励模型较少依赖表面模式,更多依赖任务相关的偏好信号。大量实验表明,DynaCF在偏好建模中持续提高了鲁棒性。

英文摘要

Reward models trained from pairwise preferences often exploit superficial shortcut cues rather than learning true response quality. We propose DynaCF, a dynamic reweighting framework for mitigating shortcut learning in reward model training. Unlike static shortcut heuristics, DynaCF measures shortcut sensitivity online during optimization by applying semantics-preserving counterfactual perturbations and tracking the resulting margin shifts and preference flips under the current model. Samples with higher shortcut sensitivity are dynamically downweighted in the Bradley-Terry objective, encouraging the model to rely less on superficial patterns and more on task-relevant preference signals. Extensive experiments show that DynaCF consistently improves robustness in preference modeling.

2606.09052 2026-06-09 cs.LG cs.AI cs.CL cs.GT stat.ML 交叉投稿

INFUSER: Influence-Guided Self-Evolution Improves Reasoning

INFUSER: 影响力引导的自我进化提升推理能力

Siyu Chen, Miao Lu, Beining Wu, Heejune Sheen, Fengzhuo Zhang, Shuangning Li, Zhiyuan Li, Jose Blanchet, Tianhao Wang, Zhuoran Yang

发表机构 * Yale University(耶鲁大学) Stanford University(斯坦福大学) University of Chicago(芝加哥大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) University of California, San Diego(圣地亚哥大学)

AI总结 提出INFUSER框架,通过生成器与求解器的协同进化,利用影响力分数和DuGRPO优化,从文档池中自适应生成训练数据,显著提升模型推理性能。

Comments 66 pages, 17 figures

详情
AI中文摘要

自我进化为更强的推理提供了一条可扩展的路径:预训练语言模型仅需极少的外部监督即可自我改进。然而,现有方法要么依赖于大量精心策划或教师生成的训练数据,要么在生成器无监督运行时,使用未必能改进求解器的难度启发式方法对其进行奖励。我们引入了INFUSER,一个迭代协同训练框架,包含两个共同进化的角色:一个生成器,从自动收集的非结构化文档池中起草问题并参考标准答案;一个求解器,通过在这些数据上训练来改进。求解器使用标准正确性奖励(针对生成器提供的答案)进行训练,而生成器则通过一种优化器感知的影响力分数获得奖励,该分数衡量每个提出的问题是否真正能改进求解器在目标分布上的表现。由于这种连续、有噪声的影响力分数不适合标准的GRPO,我们提出了DuGRPO,一种GRPO的双归一化变体,用于生成器训练。这些设计共同将文档池转化为一个自适应课程,倾向于对当前求解器有用的问题,而不仅仅是困难的问题。在Qwen3-8B-Base上,INFUSER在Olympiad和SuperGPQA基准测试中相对于强自我进化基线取得了超过20%的相对改进,并且一个8B的INFUSER协同进化生成器在数学和编程任务上优于冻结的32B思考生成器。消融实验证实了每个设计选择的必要性,两个扩展——将INFUSER应用于指令微调锚点并辅以规则可验证的RLVR数据——进一步展示了该框架的灵活性和泛化能力。代码可在https://github.com/FFishy-git/INFUSER获取。

英文摘要

Self-evolution offers a scalable path to stronger reasoning: a pretrained language model improves itself with only minimal external supervision. Yet existing methods either depend on extensively curated or teacher-generated training data, or, when the generator runs unsupervised, reward it by a difficulty heuristic that need not improve the solver. We introduce INFUSER, an iterative co-training framework with two co-evolving roles: a Generator that drafts questions and reference golden answers from a pool of unstructured, automatically collected documents, and a Solver that improves by training on them. The solver is trained with standard correctness rewards against the generator-provided answers, while the generator is rewarded by an optimizer-aware influence score that measures whether each proposed question would actually improve the solver on the target distribution. Because this continuous, noisy influence score is poorly served by standard GRPO, we propose DuGRPO, a dual-normalized variant of GRPO, for generator training. Together, these turn the document pool into an adaptive curriculum that favors questions useful to the current solver, not just hard ones. On Qwen3-8B-Base, INFUSER outperforms strong self-evolution baselines with over 20% relative improvement on Olympiad and SuperGPQA benchmarks, and an 8B INFUSER co-evolving generator outperforms a frozen 32B thinking generator on math and coding. Ablations confirm each design choice is necessary, and two extensions, applying INFUSER to an instruction-finetuned anchor and augmenting it with rule-verifiable RLVR data, further demonstrate the flexibility and generalizability of the framework. Code is available at https://github.com/FFishy-git/INFUSER.

2606.09134 2026-06-09 cs.RO cs.AI cs.CL cs.CV cs.GR 交叉投稿

From USD Scenes to Knowledge Graphs: Zero-Shot Ontology Grounding with LLMs

从USD场景到知识图谱:基于LLM的零样本本体接地

Jiangtao Shuai, Zongxiong Chen, Manfred Hauswirth, Sonja Schimmler

发表机构 * Technical University of Berlin(柏林工业大学) Fraunhofer FOKUS(弗劳恩霍夫开放通信系统研究所)

AI总结 研究利用大语言模型(LLM)零样本地将3D场景对象自动映射到本体类别,无需训练,在厨房场景中达到90-96%准确率,并揭示语义线索是关键。

Comments Accepted to the IEEE ICRA 2026 International Joint Workshop on Ontologies, Semantic Maps and Autonomous Robotics Standardization (J-WOSMARS 2026), Vienna, 2026

详情
AI中文摘要

从3D仿真场景构建知识图谱对于机器人任务推理至关重要,但关键瓶颈——将场景对象接地到形式本体类别——仍然依赖于手工制作的字典,这些字典脆弱且无法跨资产泛化。我们研究大语言模型(LLM)是否能够自动化通用场景描述(USD)场景的接地步骤,作为一种零样本、无需训练的替代方案。在具有SOMA-HOME本体的厨房场景(125个对象)中,LLM在描述性名称下达到90-96%的精确匹配准确率,在缩写名称下达到49-89%,显著优于字典和嵌入基线。在完全不透明名称下,上下文增强提示可恢复高达48%的准确率。特征消融表明,LLM主要利用场景图中的语义线索(兄弟名称和父路径);匿名化这些线索将准确率降至0-6%,而仅凭几何信息仅能达到4-17%。

英文摘要

Constructing knowledge graphs from 3D simulation scenes is essential for robot task reasoning, but the key bottleneck, grounding scene objects to formal ontology classes, still relies on manually curated dictionaries that are brittle and do not generalize across assets. We investigate whether large language models (LLMs) can automate this grounding step for Universal Scene Description (USD) scenes as a zero-shot, training-free alternative. On a kitchen scene (125 objects) with SOMA-HOME Ontology, LLMs achieve 90-96% exact-match accuracy with descriptive names and 49-89% with abbreviated names, substantially outperforming dictionary and embedding baselines. Under fully opaque names, context-augmented prompting recovers up to 48%. Feature ablation reveals that LLMs primarily exploit semantic cues in the scene graph (sibling names and parent paths); anonymizing these cues reduces accuracy to 0-6%, while geometry alone yields only 4-17%.

2606.09348 2026-06-09 cs.LG cs.CL 交叉投稿

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

PBSD: 特权贝叶斯自蒸馏用于长程信用分配

Yang Tian, Rui Wang, Xumeng Wen, Junjie Li, Shizhao Sun, Lei Song, Jiang Bian, Bo Zhao

发表机构 * School of AI, Shanghai Jiao Tong University(上海交通大学人工智能学院) XYZ AI Lab(XYZ AI实验室)

AI总结 提出PBSD方法,通过贝叶斯校准的自蒸馏将稀疏最终奖励转化为细粒度步骤级信用信号,解决长程智能体任务中的信用分配问题,实验表明其提升领域内外性能并促进泛化。

详情
AI中文摘要

长程智能体任务对基于结果的强化学习提出了根本性的信用分配挑战:轨迹级奖励验证最终正确性,但很少指导哪些中间推理步骤或工具交互对结果有贡献。在多轮搜索智能体中,这一困难尤为突出,因为成功轨迹可能包含误导性动作,而失败轨迹可能包含有价值的证据收集步骤。我们提出PBSD(特权贝叶斯自蒸馏),一种在稀疏最终奖励下进行细粒度信用分配的贝叶斯校准自蒸馏方法。PBSD通过验证答案的后验与先验概率比来衡量轨迹质量,并应用贝叶斯规则将这个难以估计的答案侧比率转化为标准学生模型与特权答案条件教师模型之间的易处理似然比。对该贝叶斯证据分数的自回归分解产生轮级信号,识别每个中间轮次是支持还是破坏已验证结果。因此,PBSD提供了一种原则性且优雅的重新加权方案,将稀疏结果监督转化为贝叶斯校准的轮级信用信号,同时完全兼容标准策略优化。实验表明,PBSD在领域内和领域外设置中均持续提升性能,并有效将知识从短上下文训练迁移到长上下文推理,表明其细粒度信用分配机制促进了更有效的策略学习并带来更好的泛化。

英文摘要

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.

2606.09410 2026-06-09 cs.AI cs.CL 交叉投稿

Capacity, Not Format: Rethinking Structured Reasoning Failures

容量而非格式:重新思考结构化推理失败

Hengxin Fan

AI总结 研究发现结构化格式对模型性能的影响取决于其空闲容量,容量不足时通过截断和纯容量竞争两种机制导致性能下降,建议先思考后格式化。

Comments 12 pages, 3 figures

详情
AI中文摘要

先前的工作将结构化输出视为推理的代价,但这种框架是不完整的:格式化的成本强烈依赖于模型的空闲容量。通过使用信息匹配的散文控制和四级模式复杂度梯度,我们在4个模型和5个基准测试中分离了格式特定效应与提示长度混淆,成功生成的响应中解析失败率为0%。我们发现结构化格式是容量依赖的。具有足够余量的模型在吸收JSON约束时不会出现性能下降(Sonnet:MATH-Hard上JSON为$88.7\pm4.0$%,CoT为$89.3\pm1.7$%)。相反,格式会严重降低接近其极限运行的模型,通过两种不同的机制。首先,在标准token预算下,Haiku下降了36.2个百分点($p < 0.0001$),主要是由于截断。其次,即使延长预算消除了截断,GPT-4o-mini仍下降了28.0个百分点($p < 0.001$),揭示了独立于token耗尽的纯容量竞争。这种格式惩罚随模式复杂度增加(McNemar $p < 0.0001$),且不能仅由提示长度解释。此外,这些结果对前沿模型免疫的说法提出了质疑:在AIME竞赛数学中,Opus 4.7在JSON下从96.2%下降到91.0%($-5.3$个百分点;显示的百分比独立四舍五入,精确差值为$7/133 = 5.26$pp $\approx 5.3$pp)。一种延迟结构消融——在格式化之前自由推理——恢复了大部分丢失的准确率(3次运行均值:80-87%),支持了容量竞争机制。实际意义不是避免结构化输出,而是使其与容量匹配:当模型接近其极限时,先思考,后格式化。

英文摘要

Prior work treats structured output as a reasoning tax, but this framing is incomplete: the cost of formatting depends strongly on a model's spare capacity. Using information-matched prose controls and a four-level schema complexity gradient, we separate format-specific effects from prompt-length confounds across 4 models and 5 benchmarks with 0% parse failures on successfully generated responses. We find that structured formats are capacity-dependent. Models with sufficient headroom absorb JSON constraints without degradation (Sonnet: $88.7\pm4.0$% JSON vs. $89.3\pm1.7$% CoT on MATH-Hard). In contrast, formats severely degrade models operating near their limits through two distinct mechanisms. First, under standard token budgets, Haiku drops 36.2pp ($p < 0.0001$) largely due to truncation. Second, even with extended budgets eliminating truncation, GPT-4o-mini drops 28.0pp ($p < 0.001$), revealing pure capacity competition independent of token exhaustion. This format penalty scales with schema complexity (McNemar $p < 0.0001$) and cannot be explained by prompt length alone. Furthermore, these results qualify claims of frontier model immunity: on AIME competition math, Opus 4.7 drops from 96.2% to 91.0% under JSON ($-5.3$pp; the displayed percentages are independently rounded, exact difference is $7/133 = 5.26$pp $\approx 5.3$pp). A delayed-structure ablation -- reasoning freely before formatting -- recovers most of the lost accuracy (3-run mean: 80--87%), supporting the capacity competition mechanism. The practical implication is not to avoid structured output, but to match it to capacity: when a model is near its limits, think first, format later.

2606.09471 2026-06-09 cs.LG cs.CL 交叉投稿

Escaping the KL Agreement Trap in On-Policy Distillation

逃离在线策略蒸馏中的KL一致陷阱

Haoran Xin, Anhao Zhao, Ying Sun, Jin Li, Xiaoyu Shen, Hui Xiong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学) The Hong Kong Polytechnic University(香港理工大学) Eastern Institute of Technology, Ningbo(宁波东方理工大学)

AI总结 针对在线策略蒸馏中学生陷入低KL一致陷阱导致训练信号弱的问题,提出KAT动态终止规则,过滤弱监督,在数学基准上提升avg@k 2.66%和pass@k 3.43%,同时减少59.73%的rollout长度。

Comments 13 pages, 8 figures

详情
AI中文摘要

在线策略蒸馏(OPD)通过让教师对学生生成的rollout进行评分,提供密集的token级监督。然而,当学生漂移到不可恢复的前缀时,教师可能局部同意退化状态,产生低反向KL但几乎没有纠正训练信号。我们将这种持续状态识别为低KL一致陷阱。进一步分析表明,陷阱期间及之后的token产生的监督信号效用较低。我们提出KAT(KL一致陷阱终止),一种在线OPD终止规则,通过动态训练自适应阈值检测持续的低KL一致。通过过滤来自退化一致的弱监督,KAT在四个数学基准上将avg@k准确率提升2.66%,pass@k提升3.43%,同时将平均rollout长度减少59.73%。

英文摘要

On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little corrective training signal. We identify this persistent regime as a low-KL agreement trap. Further analyses show that tokens during and after such traps produce less useful supervision signals. We propose KAT (KL Agreement Trap Termination), an online OPD termination rule that detects persistent low-KL agreement with a dynamic training-adaptive threshold. By filtering weak supervision from degenerate agreement, KAT improves avg@k accuracy by 2.66% and pass@k by 3.43% across four mathematical benchmarks, while reducing average rollout length by 59.73%.

2606.09508 2026-06-09 cs.AI cs.CL 交叉投稿

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

从刚性到动态:面向长上下文LLM的熵引导自适应推理

Zhanchao Xu, Haoyang Li, Qingfa Xiao, Fei Teng, Chen Jason Zhang, Lei Chen, Qing Li

发表机构 * Department of Computing, PolyU(香港理工大学计算学系) DSA, HKUST(GZ)(香港科技大学(广州)数据科学与分析学域) CSE, HKUST(香港科技大学计算机科学与工程学系)

AI总结 提出EntropyInfer框架,利用注意力熵在预填充阶段自适应分配计算资源,并在解码阶段通过生成令牌压缩KV缓存,实现长上下文LLM的高效推理。

详情
AI中文摘要

现有的用于长上下文LLM推理的稀疏注意力和KV缓存压缩方法通常应用固定的稀疏模式或跨所有注意力头的统一预算,忽略了头和上下文之间注意力行为的显著变化。我们观察到注意力头之间存在两种不同的熵模式:刚性头,其熵在输入段中保持接近零;动态头,其熵显著波动。至关重要的是,这些类型的分布是上下文相关的,无法离线预先确定。因此,我们提出了EntropyInfer,一个无需训练框架,在预填充期间使用注意力熵在单个头和段的粒度上自适应分配计算。对于解码,我们引入了一种潜在KV缓存压缩方案,该方案利用生成的输出令牌(而非仅预填充令牌)来识别和保留最关键的缓存条目。在Llama、Qwen和openPangu模型系列上的大量实验表明,EntropyInfer在包括SnapKV、AdaKV和CritiPrefill在内的基线上持续取得优势,在超过100k令牌的情况下实现了高达2.39倍的端到端加速,同时与全注意力相比质量下降最小。代码已发布在https://github.com/SHA-4096/EntropyInfer。

英文摘要

Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or uniform budgets across all attention heads, overlooking the substantial variation in attention behavior among heads and contexts. We observe two distinct entropy patterns among attention heads: Rigid Heads, whose entropy stays near zero across input segments, and Dynamic Heads, whose entropy fluctuates significantly. Crucially, the distribution of these types is context-dependent and cannot be predetermined offline. We therefore propose EntropyInfer, a training-free framework that uses attention entropy to adaptively allocate compute at the granularity of individual heads and segments during prefilling. For decoding, we introduce a latent KV cache compression scheme that leverages generated output tokens, rather than prefill tokens alone, to identify and retain the most critical cache entries. Extensive experiments on Llama, Qwen and openPangu model series show that EntropyInfer consistently outperforms baselines including SnapKV, AdaKV, and CritiPrefill, achieving up to 2.39$\times$ end-to-end speedup beyond 100k tokens with minimal quality degradation compared to full attention. The code is released in https://github.com/SHA-4096/EntropyInfer.

2606.09672 2026-06-09 cs.AI cs.CL cs.LG cs.PF q-bio.QM 交叉投稿

Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

相关性不够:嵌入人类元数据用于个体因果发现

Suraj Biswas, Saurabh Gupta, Pritam Mukherjee

发表机构 * Assessli Research(Assessli研究) Dots-In Research(Dots-In研究)

AI总结 针对预训练生物医学语言模型在跨域无关对中产生高余弦相似度(0.76-0.92)导致因果推断错误的问题,提出对比学习(提升分离度至1.63x)和BODHI硬负例挖掘(提升至2.30x),结合OpenVINO优化实现133倍加速。

Comments 20 pages, 18 figures, 9 tables

详情
AI中文摘要

询问一个预训练的生物医学语言模型“皮质醇28 ug/dL”和“股市波动”是否相关,它会返回0.83的余弦相似度(1.0表示完全相同)。两者没有共同机制。这不是个例:我们测试的所有现成生物医学编码器(BioBERT、PubMedBERT、BioM-ELECTRA)在跨域无关对上得分在0.76到0.92之间,而正确答案应接近零。跨域区分准确率为0%。检索系统可以承受这一点,因为下游语言模型会过滤噪声。但大型行为模型(LBM)——一种以人为对象而非句子的基础模型——则不能:它在用户生活图上推理,并将嵌入接近性视为两个事件因果关联的证据。虚假接近性会写入虚假因果边,所有下游都会继承错误。在这里,嵌入几何不是调节旋钮,而是正确性的关键。我们报告了修复方法。对72,034对进行对比训练,将PubMedBERT的BIOSSES相关性从0.633提升到0.828,域内与域间分离度从1.05倍提升到1.63倍。第二次训练BODHI从生物医学知识图中缺失的边挖掘硬负例,将分离度提升到2.30倍,区分差距提升到+0.392,BIOSSES代价为4.5%。在带有AMX的Intel Xeon 6737P上,OpenVINO将单查询延迟从1367毫秒降至10毫秒(133倍),达到每秒555个句子。一个发现与标准建议相悖:在此芯片上,FP16在所有服务批量大小下优于INT8,我们解释了原因。同一模型在无AMX的Ice Lake实例上运行慢13-27倍。我们发布了基准测试套件、训练语料库、BODHI生成器和OpenVINO脚本。

英文摘要

Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%. Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness. We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.

2606.09707 2026-06-09 cs.LG cs.CL 交叉投稿

BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling

BrainSurgery:用于模型编辑和升级的可复现且可靠的声明式权重操作

Gianluca Barmina, Annemette Broch Pirchert, Andrea Blasi Núñez, Lukas Galke Poech, Peter Schneider-Kamp

发表机构 * University of Southern Denmark(南丹麦大学)

AI总结 提出BrainSurgery工具,通过声明式YAML计划实现神经网络检查点的鲁棒可复现张量操作,支持结构修改、数学变换和张量重塑,内置断言验证防止静默错误。

详情
AI中文摘要

随着深度学习模型规模的扩大,管理、检查和修改大型检查点变得越来越具有挑战性。研究人员经常需要更改模型权重以进行层重构、精度转换、低秩分解和架构调试,但这些工作流程通常依赖于脆弱的临时Python脚本。在这里,我们介绍BrainSurgery,一个用于对神经网络检查点进行鲁棒且可复现的“张量手术”的工具,并提供一个系统演示,涵盖从模型升级到LoRA提取的四个示例和三个案例研究。通过抽象存储格式和内存管理,BrainSurgery通过声明式YAML计划执行复杂的转换。它支持通过表达性正则表达式和结构定位进行结构修改、数学变换和张量重塑,同时内置断言验证张量形状、数据类型和值,以防止静默错误。我们期望BrainSurgery通过其可复现且经过验证的操作,为未来的研究提供坚实的基础。

英文摘要

As deep learning models scale, managing, inspecting, and modifying large checkpoints has become increasingly challenging. Researchers often need to alter model weights for layer restructuring, precision casting, low-rank factorization, and architectural debugging, yet these workflows often rely on fragile ad-hoc Python scripts. Here, we introduce BrainSurgery, a tool for robust and reproducible "tensor surgery" on neural network checkpoints, and provide a system demonstration covering four examples and three case studies from model upcycling to LoRA extraction. By abstracting storage formats and memory management, BrainSurgery executes complex transformations through declarative YAML plans. It supports structural modifications, mathematical transformations, and tensor reshaping through expressive regex and structural targeting, while built-in assertions validate tensor shapes, data types, and values to prevent silent errors. We envision that BrainSurgery will provide a strong foundation for future research through its reproducible and validated operations.

2507.00322 2026-06-09 cs.CL cs.AI cs.SE 版本更新

Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones

干扰导致的失败:当有缺陷机制掩盖健全机制时,语言模型在平衡括号任务中出错

Daking Rai, Samuel Miller, Kevin Moran, Ziyu Yao

发表机构 * George Mason University(乔治·马歇尔大学) University of Central Florida(中央佛罗里达大学) Department of Computer Science(计算机科学系)

AI总结 研究揭示语言模型在平衡括号任务中出错的原因:部分组件实现可靠机制,而其他组件引入噪声,当噪声机制主导时导致错误。提出RASteer方法,通过增强可靠组件贡献,将部分模型准确率从0%提升至近100%,并在算术推理任务中取得约20%的性能提升。

Comments 23 pages, 10 figures, accepted for NeurIPS 2025

详情
AI中文摘要

尽管语言模型(LMs)在编码能力方面取得了显著进步,但在生成平衡括号等简单句法任务上仍然存在困难。在本研究中,我们调查了不同规模(124M-7B)的语言模型中这些错误持续存在的潜在机制,旨在理解和减少这些错误。我们的研究揭示,语言模型依赖于多个独立做出预测的组件(注意力头和前馈神经元)。虽然一些组件在广泛的输入范围内可靠地促进正确答案(即实现“健全机制”),但其他组件可靠性较低,通过促进错误标记引入噪声(即实现“有缺陷机制”)。当有缺陷机制掩盖健全机制并主导预测时,就会发生错误。受此启发,我们引入了RASteer,一种引导方法,用于系统地识别并增加可靠组件的贡献,以提升模型性能。RASteer在平衡括号任务上显著提升了性能,将某些模型的准确率从0%提高到接近100%,且不影响模型的一般编码能力。我们进一步展示了其在算术推理任务中的更广泛适用性,实现了高达约20%的性能提升。

英文摘要

Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M-7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing "sound mechanisms''), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing "faulty mechanisms''). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from $0$% to around $100$% without impairing the models' general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around $20$%.

2509.24189 2026-06-09 cs.CL 版本更新

SPECTRA: Revealing the Full Spectrum of User Preferences via Distributional LLM Inference

SPECTRA:通过分布化LLM推理揭示用户偏好的全频谱

Luyang Zhang, Jialu Wang, Shichao Zhu, Beibei Li, Zhongcun Wang, Guangmou Pan, Yang Song

发表机构 * Carnegie Mellon University(卡内基梅隆大学) TikTok Inc.(TikTok公司)

AI总结 提出SPECTRA方法,将微调后的LLM视为隐式概率模型,通过探测softmax层推断用户偏好的概率分布,有效恢复长尾偏好并提升公平性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用于建模用户偏好,其典型输出是每个用户直接生成的排序项目列表。然而,这种生成范式继承了自回归解码的偏差和不透明性。它过度强调频繁(头部)偏好,并抑制少数、长尾偏好。为了解决这个问题,我们提出了SPECTRA(Softmax Probing for Extracted Category-level Token Readouts and Analysis),它将微调后的LLM视为隐式概率模型,并探测其softmax以推断语义可解释的偏好类别上的概率分布。我们在MovieLens、Yelp和一个大规模短视频平台上评估了SPECTRA。SPECTRA实现了:(i) 分布对齐,在公共数据集上将与经验偏好分布的Jensen-Shannon散度降低了38%到44%;(ii) 长尾恢复与跨用户公平性,在MovieLens上将top-3类别曝光熵提高了23%,并且对尾部偏好用户的增益大于头部偏好用户;(iii) 下游应用价值,在MovieLens和Yelp上类别NDCG提升了41%到46%,在大规模部署中,与针对头部优化的生产排序器相比,长尾类别排序提升了7倍。

英文摘要

Large Language Models (LLMs) are increasingly used to model user preferences, with the typical output as a directly-generated ranked item list per user. However, this generative paradigm inherits the bias and opacity of autoregressive decoding. It over-emphasizes frequent (head) preferences and suppresses minority, long-tail ones. To address this, we propose SPECTRA (Softmax Probing for Extracted Category-level Token Readouts and Analysis), which treats the finetuned LLM as an implicit probabilistic model and probes its softmax to infer a probability distribution over semantically interpretable preference categories. We evaluate SPECTRA on MovieLens, Yelp, and a large-scale short-video platform. SPECTRA delivers (i) distributional alignment, reducing Jensen-Shannon divergence to the empirical preference distribution by 38 to 44 percent across public datasets; (ii) long-tail recovery with cross-user fairness, raising top-3 category exposure entropy by 23 percent on MovieLens and producing a larger gain on tail-preference users than on head-preference users; and (iii) downstream application value, with a 41 to 46 percent category-NDCG boost on MovieLens and Yelp, and a 7x improvement on long-tail category ranking on a large-scale deployment against a head-optimized production ranker.

2510.13554 2026-06-09 cs.CL cs.LG 版本更新

Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

注意力揭示大语言模型推理:预规划与锚定节奏实现细粒度策略优化

Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, Bo Zheng, Junchi Yan

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Alibaba Group(阿里巴巴集团)

AI总结 本文通过注意力机制揭示大语言模型推理中的预规划与锚定节奏,并据此提出三种细粒度强化学习策略,在多种推理任务上取得一致性能提升。

Comments 31 pages, 9 figures, 20 tables. Accepted at ICML 2026

详情
AI中文摘要

大语言模型的推理模式仍然不透明,强化学习通常对整个生成过程应用统一信用分配,模糊了关键步骤与常规步骤的区别。本文将注意力视为一种特权基质,它使大语言模型的内部逻辑变得可读,不仅是计算的副产品,更是推理本身的机械蓝图。我们首先区分局部和全局聚焦信息处理的注意力头,并揭示局部聚焦头在对角线附近产生锯齿状模式,指示短语块,而全局聚焦头则暴露对后续令牌具有广泛下游影响的令牌。我们用两个指标形式化这些:1)窗口平均注意力距离,衡量裁剪窗口内向后注意力的程度;2)未来注意力影响,量化令牌的全局重要性,即其从后续令牌接收的平均注意力。综合来看,这些信号揭示了一种重复的预规划与锚定机制,其中模型首先进行长距离上下文参考以生成一个引导令牌,该令牌立即跟随或与一个组织后续推理的语义锚定令牌重合。利用这些见解,我们引入了三种新颖的强化学习策略,动态地对关键节点(预规划令牌、锚定令牌及其时间耦合)进行目标信用分配,并在各种推理任务中展示了一致的性能提升。通过将优化与模型的内在推理节奏对齐,我们旨在将不透明的优化转化为可操作的结构感知过程,希望为更透明和有效的大语言模型推理优化提供潜在一步。

英文摘要

The reasoning pattern of Large language models (LLMs) remains opaque, and reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.

2511.07002 2026-06-09 cs.CL 版本更新

Automated Attribution Graph Interpretation via Probe Prompting

通过探针提示实现自动化归因图解释

Giuseppe Birardi, Gonçalo Paulo

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出探针提示方法,利用跨提示激活签名将归因图特征分组为概念对齐的超节点,并通过因果干预验证标签,在Gemma-2-2B上实现100%的预测转向行为。

Comments 35 pages, 24 figures, 18 tables. Code and interactive demo available

详情
AI中文摘要

尽管我们知道从大型语言模型(LLM)输入到输出的精确计算过程,但这种计算仍然很难解释。使这一过程更易于理解的一种方法是创建一个稀疏计算图,该图以最少的计算节点捕获模型的大部分行为。跨层转码器(CLT)分解了MLP的密集计算,但即使对于短提示,生成的电路仍然包含数千个节点。现有的自动解释方法根据语料库激活对单个特征进行标记,但这些标记通常未经因果干预验证。我们引入了探针提示,这是一种透明的基于规则的流水线,它根据特征在一小组针对概念的探针提示上的响应,将归因图的特征分组为概念对齐的超节点,这些响应总结为跨提示激活签名(CPAS)。在四个事实领域,使用Gemma-2-2B和公共CLT词典以及45,596次实体交换干预,我们发现标记的超节点在每一次干预中都具有预测的转向行为。代码、数据集和交互式演示以匿名方式发布,作为可重复使用的工具,用于根据因果干预校准超节点标签。

英文摘要

Even though we know the precise computations that lead from a large language model (LLM) input to its output this computation remains very hard to interpret. One way to make it easier to understand this process is by creating a sparse computational graph that captures most of the model behavior with smallest number of computational nodes. Cross-layer transcoders (CLT) decompose the dense computations of the MLP but the resulting circuits still contain thousands of nodes even for short prompts. Existing automated interpretation methods label individual features from corpus activations, and it often happens that these labels are not validated by causal intervention. We introduce probe prompting, a transparent rule-based pipeline that groups the features of an attribution graph into concept-aligned supernodes from their responses on a small set of concept-targeted probe prompts, summarized as Cross-Prompt Activation Signatures (CPAS). Across four factual domains, on Gemma-2-2B with a public CLT dictionary and 45,596 entity-swap interventions, we find that the labeled supernodes have the predicted steering behavior in every one of them. Code, datasets, and an interactive demo are released anonymously as a reusable harness for calibrating supernode labels against causal interventions.

2511.07317 2026-06-09 cs.CL cs.LG 版本更新

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

RLVE: 利用自适应可验证环境扩展语言模型的强化学习

Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, Hannaneh Hajishirzi

发表机构 * University of Washington(华盛顿大学)

AI总结 提出RLVE方法,通过程序化生成问题并提供可验证奖励的可验证环境,自适应调整难度以扩展语言模型的强化学习,在400个环境中联合训练使六个推理基准平均提升3.37%。

Comments ICML 2026

详情
AI中文摘要

我们引入了具有自适应可验证环境的强化学习(RLVE),该方法利用可验证环境程序化生成问题并提供算法可验证的奖励,以扩展语言模型(LM)的强化学习。RLVE使得每个可验证环境能够随着训练进程动态调整其问题难度分布以适应策略模型的能力。相比之下,静态数据分布往往导致当问题对策略来说太简单或太难时学习信号消失。为了实现RLVE,我们创建了RLVE-Gym,这是一个通过手动环境工程精心开发的大规模400个可验证环境套件。使用RLVE-Gym,我们展示了环境扩展,即扩大训练环境的集合,能够持续提高泛化推理能力。在RLVE-Gym的所有400个环境中进行联合训练的RLVE,从一个最强的1.5B推理LM开始,在六个推理基准上取得了3.37%的绝对平均提升。相比之下,继续该LM的原始RL训练仅获得0.49%的平均绝对增益,尽管使用了超过3倍的计算量。我们公开发布了代码。

英文摘要

We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses. In contrast, static data distributions often lead to vanishing learning signals when problems are either too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a large-scale suite of 400 verifiable environments carefully developed through manual environment engineering. Using RLVE-Gym, we show that environment scaling, i.e., expanding the collection of training environments, consistently improves generalizable reasoning capabilities. RLVE with joint training across all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement across six reasoning benchmarks, starting from one of the strongest 1.5B reasoning LMs. By comparison, continuing this LM's original RL training yields only a 0.49% average absolute gain despite using over 3x more compute. We release our code publicly.

2601.06599 2026-06-09 cs.CL cs.AI 版本更新

How Context Shapes Truth: Geometric Transformations of Statement-level Truth Representations in LLMs

上下文如何塑造真相:LLMs中语句级真相表示的几何变换

Shivam Adarsh, Maria Maistro, Christina Lioma

发表机构 * University of Copenhagen(哥本哈根大学)

AI总结 研究LLMs中上下文如何改变真相向量,发现早期层正交、中层收敛,上下文增加向量幅度,大模型通过方向变化区分相关与无关上下文。

Comments ACL 2026

详情
AI中文摘要

大型语言模型(LLMs)通常将语句是否为真编码为其残差流激活中的向量。这些向量,也称为真相向量,已在先前工作中被研究,然而当引入上下文时它们如何变化仍未被探索。我们通过测量(1)有上下文和无上下文时真相向量之间的方向变化($\ heta$)以及(2)添加上下文后真相向量的相对幅度来研究这一问题。在四个LLM和四个数据集上,我们发现:(1)真相向量在早期层大致正交,在中层收敛,在后期层可能稳定或继续增加;(2)添加上下文通常增加真相向量的幅度,即激活空间中真与假表示之间的分离被放大;(3)较大模型主要通过方向变化($\ heta$)区分相关与无关上下文,而较小模型通过幅度差异显示这种区分。我们还发现与参数知识冲突的上下文比参数对齐的上下文产生更大的几何变化。据我们所知,这是首个提供上下文如何在LLMs激活空间中变换真相向量的几何特征描述的工作。

英文摘要

Large Language Models (LLMs) often encode whether a statement is true as a vector in their residual stream activations. These vectors, also known as truth vectors, have been studied in prior work, however how they change when context is introduced remains unexplored. We study this question by measuring (1) the directional change ($θ$) between the truth vectors with and without context and (2) the relative magnitude of the truth vectors upon adding context. Across four LLMs and four datasets, we find that (1) truth vectors are roughly orthogonal in early layers, converge in middle layers, and may stabilize or continue increasing in later layers; (2) adding context generally increases the truth vector magnitude, i.e., the separation between true and false representations in the activation space is amplified; (3) larger models distinguish relevant from irrelevant context mainly through directional change ($θ$), while smaller models show this distinction through magnitude differences. We also find that context conflicting with parametric knowledge produces larger geometric changes than parametrically aligned context. To the best of our knowledge, this is the first work that provides a geometric characterization of how context transforms the truth vector in the activation space of LLMs.

2601.15165 2026-06-09 cs.CL cs.AI cs.LG 版本更新

The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models

灵活性陷阱:重新思考扩散语言模型中任意顺序的价值

Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang

发表机构 * LeapLab, Tsinghua University(清华大学Leap实验室) NLPLab, Tsinghua University(清华大学自然语言处理实验室) Tsinghua University(清华大学) Alibaba Group(阿里巴巴集团) BNRist, Tsinghua University(清华大学北京研究院)

AI总结 本文发现,尽管扩散语言模型(dLLMs)允许任意生成顺序,但这种灵活性可能限制其推理能力,通过采用标准的Group Relative Policy Optimization(GRPO)方法,即JustGRPO,在保持并行解码能力的同时提升了推理性能。

Comments Code and pre-trained models: https://github.com/LeapLabTHU/JustGRPO

详情
AI中文摘要

扩散大语言模型(dLLMs)打破了传统语言模型的严格左到右约束,使token生成可以按任意顺序进行。直观上,这种灵活性意味着解决方案空间严格超越了固定的自回归轨迹,理论上解锁了更强大的推理潜力。然而,在本文中,我们发现对于一般推理任务(例如数学和编程),任意顺序生成可能实际上会限制dLLMs的推理潜力。我们观察到dLLMs倾向于利用这种顺序灵活性来绕过关键探索的高不确定性token,这可能导致解决方案覆盖的过早崩溃。这一观察促使我们重新思考dLLMs的强化学习方法,其中大量的复杂性,如处理组合轨迹和不可计算的似然,通常致力于保持这种灵活性。我们证明,通过放弃任意顺序并应用标准的Group Relative Policy Optimization(GRPO)方法,即JustGRPO,可以有效地激发推理能力。我们的方法,JustGRPO,虽然简洁却出人意料地有效(例如在GSM8K上达到89.1%的准确率),同时完全保留了dLLMs的并行解码能力。项目页面:https://nzl-thu.github.io/the-flexibility-trap

英文摘要

Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential. However, in this paper, we find that for general reasoning tasks (e.g., mathematics and coding), arbitrary order generation may in fact limit the reasoning potential of dLLMs. We observe that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, which can lead to a premature collapse of solution coverage. This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We show that effective reasoning can be elicited by simply forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap

2602.12996 2026-06-09 cs.CL cs.AI 版本更新

Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models

知道更多,更清晰:大型语言模型中知识增强的元认知框架

Hao Chen, Ye He, Yuchun Fan, Yukun Yan, Zhenghao Liu, Qingfu Zhu, Maosong Sun, Wanxiang Che

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出元认知框架,利用内部认知信号划分知识空间为掌握、混淆和缺失区域,通过差异化干预和认知一致性机制增强知识并校准置信度,实验证明优于基线方法。

详情
AI中文摘要

知识增强显著提升了大型语言模型(LLMs)在知识密集型任务中的性能。然而,现有方法通常基于模型性能等同于内部知识的简单前提,忽略了导致过度自信错误或不确定真相的知识-置信度差距。为弥合这一差距,我们提出了一种新颖的元认知框架,通过差异化干预和对齐实现可靠的知识增强。我们的方法利用内部认知信号将知识空间划分为掌握、混淆和缺失区域,指导有针对性的知识扩展。此外,我们引入了一种认知一致性机制,以同步主观确定性与客观准确性,确保校准的知识边界。大量实验表明,我们的框架持续优于强基线,验证了其在不仅增强知识能力,而且培养更好区分已知与未知的认知行为方面的合理性。所有代码均可在该 https URL 获取。

英文摘要

Knowledge augmentation has significantly enhanced the performance of Large Language Models (LLMs) in knowledge-intensive tasks. However, existing methods typically operate on the simplistic premise that model performance equates with internal knowledge, overlooking the knowledge-confidence gaps that lead to overconfident errors or uncertain truths. To bridge this gap, we propose a novel meta-cognitive framework for reliable knowledge augmentation via differentiated intervention and alignment. Our approach leverages internal cognitive signals to partition the knowledge space into mastered, confused, and missing regions, guiding targeted knowledge expansion. Furthermore, we introduce a cognitive consistency mechanism to synchronize subjective certainty with objective accuracy, ensuring calibrated knowledge boundaries. Extensive experiments demonstrate the our framework consistently outperforms strong baselines, validating its rationality in not only enhancing knowledge capabilities but also fostering cognitive behaviors that better distinguish knowns from unknowns. All codes are available at https://github.com/AI9Stars/Know-More-Know-Clearer.

2603.13259 2026-06-09 cs.CL cs.AI 版本更新

How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

Transformer 如何拒绝错误答案:事实约束处理的旋转动力学

Javier Marín

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究揭示了Transformer在处理事实性问题时,隐藏状态空间中正确与错误延续路径的旋转分离现象,揭示了模型在深层结构中对错误延续的非局部化偏好。

详情
AI中文摘要

当解码器-only Transformer 被强制处理事实性查询的匹配正确和错误单token延续时,两种路径在隐藏状态空间中以特定方式分离:从查询-only 表示出发的位移向量保持大致相等的幅度但方向旋转远离。角分离在中层增加,后期层解决不对称结果——在错误运行中,logit-lens 倾向远低于朴素先验,对应模型将错误token的概率约11.5倍于正确token。该双阶段模式——中层旋转分离后后期层不对称承诺——被描述为模型对外部看似拒绝错误延续的实证几何特征,但明确指出是观测描述而非因果解释。该模式在六个解码器-only Transformer 中一致,包括五个架构家族(1B到13B参数)。第七个模型(Qwen2 1.5B)在当前提取协议下显示平坦曲线,可能是tokenizer-fragmentation的artefact而非真实规模限制;是否存在临界出现阈值的问题仍悬而未决。单层激活拼接在任何层带均无法恢复正确token,意味着后期层不对称性并非局限于离散组件。总体而言,证据支持事实约束处理的分布式轨迹账户——几何结构在许多层中逐步累积出现,而非单一局部化回溯账户。

英文摘要

When a decoder-only transformer is forced to process matched correct and incorrect single-token continuations of a factual query, the two pathways through hidden-state space diverge in a specific way: displacement vectors from the query-only representation maintain approximately equal magnitude but rotate apart in direction. The angular separation grows through mid-depth, and late layers resolve the asymmetric outcome -a logit-lens preference that, in the incorrect run, falls far below the naive prior of equal probability, corresponding to the model assigning approximately 11.5 times more probability to the incorrect token than to the correct one. We characterize this two-phase pattern-rotational divergence in mid-depth followed by late-layer asymmetric commitment-as the empirical geometric signature of what looks externally like the model rejecting a wrong continuation, while remaining explicit that it is an observational characterization, not a causal account. The pattern is consistent across six decoder-only transformers including five architecture families from 1B to 13B parameters. A seventh model (Qwen2 1.5B) shows a flat profile under the present extraction protocol that is plausibly a tokenizer-fragmentation artefact rather than a real scale floor; the question of an emergence threshold is left open. Single-layer activation patching does not recover the correct token at any layer band, meaning the late-layer asymmetry is not localized to a discrete component under the protocol used. Taken together, the evidence is consistent with a distributed-by-trajectory account of factual constraint processing-geometric structure that emerges cumulatively across many layers rather than from a single localized circuit and inconsistent with the simplest single-layer localized-recall account.

2603.22473 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Component Ablation for Efficient Hybrid Language Model Architectures: Performance, Resilience, and Compression Implications

组件消融用于高效混合语言模型架构:性能、鲁棒性和压缩影响

Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó

发表机构 * Doctoral Program in Computer Science, University of Valencia(瓦伦西亚大学计算机科学博士项目)

AI总结 本文通过组件消融研究混合语言模型,发现注意力机制与替代序列处理路径对性能有显著影响,揭示了模型鲁棒性与压缩优化的关键因素。

Comments 25 pages, 7 figures, 6 tables; revised title, abstract, figures, and data/code repository URL

详情
AI中文摘要

混合语言模型结合softmax注意力与线性时间序列机制,如状态空间或线性注意力层,但各组件的功能贡献尚不明确。本文在两个子10亿参数的混合语言模型Qwen3.5-0.8B和Falcon-H1-0.5B上,通过基于似然的评估、下游基准、逐层干预、随机控制和表征级诊断研究组件消融。测试结果显示,移除注意力或替代序列处理路径会显著降低性能,表明两种组件类型均对模型行为有贡献。似然指标对线性注意力或状态空间路径特别敏感,而下游基准退化取决于任务和架构。逐层消融显示组件重要性位置依赖,最强效果集中在早期或中期网络组件而非整个深度。随机移除控制进一步显示混合架构与相同家族Transformer基线在结构扰动下退化不同。这些结果表明组件消融是理解混合语言模型架构的有效诊断方法。发现为高效模型设计、压缩、鲁棒性分析和部署决策提供了相关证据。

英文摘要

Hybrid language models combine softmax attention with linear-time sequence mechanisms such as state-space or linear-attention layers, but the functional contribution of each component type remains insufficiently characterized. We study component-level ablation in two sub-1B hybrid language models, Qwen3.5-0.8B and Falcon-H1-0.5B, using likelihood-based evaluation, downstream benchmarks, layer-wise interventions, random controls, and representation-level diagnostics. Across the tested models, removing either attention or the alternative sequence-processing pathway substantially degrades performance, indicating that both component types contribute to model behavior. Likelihood metrics are especially sensitive to the linear-attention or state-space pathway, while downstream benchmark degradation depends on task and architecture. Layer-wise ablations show that component importance is position-dependent, with the strongest effects concentrated in early or mid-network components rather than uniformly across depth. Random-removal controls further show that hybrid architectures and same-family Transformer baselines degrade differently under structural perturbation. These results suggest that component ablation is a useful diagnostic for understanding hybrid language model architectures. The findings provide evidence relevant to efficient model design, compression, robustness analysis, and deployment decisions in architectures that combine attention with alternative sequence-processing mechanisms.

2605.00358 2026-06-09 cs.CL cs.CV 版本更新

From Backward Spreading to Forward Replay: Revisiting Target Construction in LLM Parameter Editing

从反向传播到正向回放:重新审视LLM参数编辑中的目标构造

Wei Liu, Hongkai Liu, Zhiying Deng, Yee Whye Teh, Wee Sun Lee

发表机构 * University of Cambridge(剑桥大学) University of Edinburgh(爱丁堡大学)

AI总结 本文重新审视LLM参数编辑中的目标构造,提出一种更简洁的替代方法,通过正向传播代替反向传播,提高目标隐藏状态的准确性和兼容性。

Comments ICML 2026, code: https://github.com/jugechengzi/FE

详情
AI中文摘要

LLM参数编辑方法通常依赖于计算目标层的理想隐藏状态(称为锚点)并将其分布到多个前层(通常称为反向传播)以实现协同编辑。尽管长期广泛使用,其基础理论尚未系统研究。本文首先系统研究其基础,有助于明确其能力边界、实际考虑和潜在失败模式。然后,我们提出了一种简单优雅的替代方法,用正向传播代替反向传播。不优化最后一层的靶标,而是在第一编辑层优化锚点,然后将其传播到后续所有编辑层,以获得准确且相互兼容的目标隐藏状态。这种方法达到与现有方法相同计算复杂度,同时产生更准确的层间目标。我们的方法简单,不影响初始目标隐藏状态的计算或后续编辑流程的其他组件,因此对广泛的LLM参数编辑方法有益。

英文摘要

LLM parameter editing methods commonly rely on computing an ideal target hidden-state at a target layer (referred as anchor point) and distributing the target vector to multiple preceding layers (commonly known as backward spreading) for cooperative editing. Although widely used for a long time, its underlying basis have not been systematically investigated. In this paper, we first conduct a systematic study of its foundations, which helps clarify its capability boundaries, practical considerations, and potential failure modes. Then, we propose a simple and elegant alternative that replaces backward spreading with forward-propagation. Instead of optimizing the target at the last editing layer, we optimize the anchor point at the first editing layer, and then propagate it forward to obtain accurate and mutually compatible target hidden-states for all subsequent editing layers. This approach achieves the same computational complexity as existing methods while producing more accurate layer-wise targets. Our method is simple, without interfering with either the computation of the initial target hidden state or any other components of the subsequent editing pipeline, and thus constituting a benefit for a wide range of LLM parameter editing methods.

2605.19228 2026-06-09 cs.CL cs.AI cs.IT cs.LG math.IT 版本更新

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

通过分步置信度归因诊断黑盒大语言模型的多步推理失败

Xiaoou Liu, Tiejin Chen, Dengjia Zhang, Yaqing Wang, Lu Cheng, Hua Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种基于分步置信度归因(SCA)的方法,用于诊断黑盒大语言模型在多步推理中的失败,通过信息瓶颈原理对生成的推理轨迹进行置信度评估,并通过实验验证该方法在数学推理和多跳问答任务中的有效性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型通过生成分步解决方案在具有客观答案的推理任务中实现了强大的性能,但诊断多步推理轨迹可能失败的位置仍然困难。置信度估计提供了一种诊断信号,但现有方法受限于最终答案或需要内部模型访问。在本文中,我们引入了分步置信度归因(SCA),一种适用于封闭源LLM的框架,该框架仅基于生成的推理轨迹分配步骤级置信度。SCA应用信息瓶颈原理:与正确解决方案中的一致结构对齐的步骤获得高置信度,而偏差则被标记为可能错误。我们提出了两种互补的方法:(1)NIBS,一种非参数化的IB方法,用于测量一致性而无需图结构,以及(2)GIBS,一种基于图的IB模型,通过可微分掩码学习子图以捕捉逻辑变化。在数学推理和多跳问答任务上的大量实验表明,SCA能够可靠地识别与推理错误高度相关的低置信度步骤。此外,使用步骤级置信度指导自我修正,比使用答案级反馈提高了13.5%的修正成功率。

英文摘要

Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non-parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph-based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi-hop question answering show that SCA reliably identifies low-confidence steps strongly correlated with reasoning errors. Moreover, using step-level confidence to guide self-correction improves the correction success rate by up to 13.5\% over answer-level feedback.

2605.30407 2026-06-09 cs.CL cs.AI cs.IR cs.LG 版本更新

Exploring Autonomous Agentic Data Engineering for Model Specialization

探索用于模型专业化的自主智能体数据工程

Yujie Luo, Xiangyuan Ru, Jingsheng Zheng, Jingjing Wang, Yuqi Zhu, Jintian Zhang, Runnan Fang, Kewei Xu, Ye Liu, Zheng Wei, Jiang Bian, Zang Li, Shumin Deng

发表机构 * Zhejiang University(浙江大学) Platform and Content Group, Tencent(腾讯平台与内容部)

AI总结 本文提出自主智能体数据工程任务,让LLM作为自主数据工程师,通过端到端数据策划驱动模型专业化,实验显示GPT-5.2通过迭代数据适应使学生模型性能提升57.29%。

Comments Work in progress

详情
AI中文摘要

大型语言模型(LLM)在通用任务上表现出色,但往往难以适应没有高质量领域特定数据的专业领域。现有的基于LLM的数据策划方法主要依赖人工设计的工作流程,尚未检验LLM能否自主执行端到端的数据工程流水线以实现模型专业化。我们形式化了 extbf{自主智能体数据工程},这是一个新任务,旨在评估LLM作为自主数据工程师,通过端到端数据策划驱动模型专业化。我们将数据视为可优化组件,研究能够跨多个领域规划、生成和迭代优化训练数据的智能体,并以训练后性能提升为指导。实验表明,自主LLM数据工程师带来了显著收益,GPT-5.2构建的训练课程使学生模型性能提升了 extbf{57.29\%},完全通过迭代的智能体驱动数据适应实现。通过揭示潜力和瓶颈,我们的研究将自主数据工程确立为一种可衡量的能力,并为智能体驱动的模型专业化指明了道路 ootnote{代码将在https://github.com/zjunlp/DataAgent发布。}。

英文摘要

Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specialization (Code will be released at https://github.com/zjunlp/DataAgent).

2606.02780 2026-06-09 cs.CL 版本更新

Do Value Vectors in Deep Layers Need Context from the Residual Stream?

深层中的值向量是否需要来自残差流的上下文?

Muyu He, Yuchen Liu, Qingya Huang, Li Zhang

发表机构 * Independent(独立) Drexel University(德雷塞尔大学)

AI总结 研究通过提出Bank of Values(BoV)方法,在深层注意力层中使用无上下文的值向量来保留原始token信息,从而提升模型性能并减少计算和内存开销。

Comments 13 pages, 5 figures. Code: https://github.com/RiddleHe/nanochat

详情
AI中文摘要

Transformer架构作为现代LLM骨干的成功在很大程度上归功于其使用注意力层。注意力层遵循标准神经网络范式:以残差流为输入,从而产生上下文相关的查询、键和值向量。然而,我们发现当深层学习仅保留原始token信息的无上下文值向量,而不利用残差流中的任何上下文时,模型性能有显著提升。当模型可以访问这种无上下文的值向量时,添加回上下文相关的组件对整体基准性能几乎没有额外益处。这种无上下文的值向量可以作为稀疏模型参数存储,无需重新计算或持久缓存这些值。通过对这种无上下文值向量的关键设计选择进行系统消融,我们提出了Bank of Values(BoV),这是一种通过为最后三分之一的每一层学习一个token特定值向量的查找表来计算注意力中值向量的新方法。在135M和780M模型上,BoV相比标准注意力提升了验证损失,并且在780M模型上,在21个基准测试的平均得分上匹配了之前最佳方法(该方法以更少的计算和内存向值向量添加token信息)。

英文摘要

The success of the transformer architecture as the backbone of modern LLMs is in large part due to its use of attention layers. An attention layer follows the standard neural network paradigm: it takes the residual stream as input and thereby produces context-dependent query, key, and value vectors. However, we find that model performance meaningfully improves when deeper layers learn only a context-free value vector to preserve the original token information, without drawing on any context from the residual stream. When the model has access to this context-free value vector, adding back the context-dependent component provides little additional benefit for aggregate benchmark performance. Such context-free value vectors can be stored as sparse model parameters, eliminating the need to recompute or persistently cache these values. Through systematic ablations on the key design choices for such context-free value vectors, we propose Bank of Values (BoV), a new way of computing value vectors in attention by learning a lookup table of token-specific value vectors for each of the last third of layers. Across 135M and 780M models, BoV improves validation loss over standard attention and, at 780M, the average score across 21 benchmarks, matching the previous best method that adds token information to the value vector with less compute and memory.

2606.04109 2026-06-09 cs.CL 版本更新

Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

话语角色标签作为语言模型上下文使用的呈现时间变量

Jianguo Zhu, Xiangmei Li, Wenjie Liu

发表机构 * GitHub

AI总结 通过固定内容探针实验,研究不同话语角色标签(如Instruction、Reference、Example)如何影响语言模型对误导信息的采纳率,发现标签可导致采纳率变化56-84个百分点,并建议上下文利用和RAG基准应报告和控制包装标签。

Comments Revised version with updated author information, added clean baselines, clarified evaluation metrics, and tightened discussion of context-augmented settings

详情
AI中文摘要

上下文增强的语言模型系统通常用Reference:、Evidence:、Instruction:、Note:或Example:等标签包装提供的内容,但这些标签对读者模型行为的影响尚未充分探索。我们引入了一个配对固定内容探针,涵盖500个MMLU-Pro项目:每个项目在不同话语角色标签下接收相同的误导性答案断言,并通过模型是否输出注入的错误选项来衡量采纳率。在GPT-5.5、DeepSeek V4 Pro、Llama-3-8B-Instruct和Qwen2.5-7B-Instruct上,误导采纳率变化了56-84个百分点。绑定或来源类标签(如Instruction:和Reference:)导致高采纳率,而Example:则持续抑制采纳率。配对检验、bootstrap区间、最终指令消融和Qwen最终步对数概率探针支持标签条件化的候选偏好。边界探针显示了效果减弱或持续的位置:算术任务降低采纳率,段落形状的外部上下文保持较小的标签差距,短答案评估排除了选项字母复制,嵌套标签冲突表明说明性框架可以限制采纳范围。一项200例单作者人工审核确认,在保守裁决下短答案对比是稳定的。由此得出的结论有限但实用:上下文利用和读者端RAG基准应报告并控制包装标签,因为呈现选择可以改变对提供上下文的测量依赖。

英文摘要

Context-augmented language model systems often wrap supplied content with labels such as Reference:, Evidence:, Instruction:, Note:, or Example:, but the effect of these labels on reader-model behavior remains underexplored. We introduce a paired fixed-content probe over 500 MMLU-Pro items: each item receives the same misleading answer-bearing assertion under different discourse-role labels, and adoption is measured by whether the model outputs the injected wrong option. Across GPT-5.5, DeepSeek V4 Pro, Llama-3-8B-Instruct, and Qwen2.5-7B-Instruct, Misleading Adoption Rate shifts by 56-84 percentage points. Binding or source-like labels such as Instruction: and Reference: produce high adoption, whereas Example: consistently suppresses it. Paired tests, bootstrap intervals, final-instruction ablations, and Qwen final-step log-probability probes support a label-conditioned candidate preference. Boundary probes show where the effect weakens or persists: arithmetic tasks reduce adoption, passage-shaped external context preserves smaller label gaps, short-answer evaluation rules out option-letter copying, and nested-label conflicts suggest that illustrative framing can delimit adoption scope. A 200-case single-author manual audit confirms that the short-answer contrasts are stable under conservative adjudication. The resulting claim is bounded but practical: context-utilization and reader-side RAG benchmarks should report and control wrapper labels, because presentation choices can change measured reliance on supplied context.

2606.06915 2026-06-09 cs.CL cs.AI cs.LG 版本更新

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

ThinkBooster: 一种用于LLM推理无缝测试时扩展的统一框架

Vladislav Smirnov, Chieu Nguyen, Sergey Senichev, Minh Ngoc Ta, Ekaterina Fadeeva, Artem Vazhentsev, Daria Galimzianova, Nikolai Rozanov, Viktor Mazanov, Jingwei Ni, Tianyi Wu, Igor Kiselev, Mrinmaya Sachan, Iryna Gurevych, Preslav Nakov, Timothy Baldwin, Artem Shelmanov

发表机构 * MBZUAI ETH Zürich(苏黎世联邦理工学院) Imperial College London(伦敦帝国理工学院) NUS(国立大学新加坡) Accenture(埃森哲) Innopolis University(因诺普里斯大学) Independent Researcher(独立研究者)

AI总结 提出ThinkBooster框架,通过模块化库、联合评估基准和可部署代理服务,实现LLM推理的测试时计算扩展,在数学和编码任务上验证了性能-计算权衡。

详情
AI中文摘要

测试时计算(TTC)扩展已成为一种强大的范式,通过在推理期间分配额外计算(例如,通过多样本生成和基于验证器的重新排序)来改进大型语言模型(LLM)推理。现有的TTC扩展策略和推理评分器仍然碎片化,在不一致的协议下进行评估,并且很少通过质量-成本权衡的视角进行分析。我们引入了ThinkBooster,一个用于LLM推理无缝测试时计算扩展的统一框架,它包括(i)一个模块化的Python库,实现了最先进的TTC扩展策略和评分器家族,(ii)一个联合评估性能和计算效率的基准,以及(iii)一个可部署的、兼容OpenAI的代理服务,使得将自适应推理无缝集成到实际应用中成为可能。我们还提供了一个演示可视化调试器,用于检查推理轨迹、中间选择决策和替代推理路径。在数学和编码任务上的实证结果揭示了TTC扩展策略和评分方法的性能-计算权衡,并表明ThinkBooster在实际任务中提供了实际收益。代码以MIT许可证在线提供。

英文摘要

Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for seamless test-time compute scaling of LLM reasoning, which consists of (i) a modular Python library implementing state-of-the-art TTC scaling strategy and scorer families, (ii) a benchmark that jointly evaluates performance and computational efficiency, and (iii) a deployable OpenAI-compatible proxy service that enables drop-in integration of adaptive reasoning into real-world applications. We further provide a demo visual debugger for inspecting the reasoning trajectories, intermediate selection decisions, and alternative reasoning paths. Empirical results on mathematical and coding tasks reveal the performance-compute trade-offs of TTC scaling strategies and scoring methods and demonstrate that ThinkBooster provides practical gains in real-world tasks. The code is available online under an MIT license.

2409.15723 2026-06-09 cs.LG cs.CL 版本更新

Federated Large Language Models: Current Progress and Future Directions

联邦大语言模型:当前进展与未来方向

Yuhang Yao, Jianyi Zhang, Junda Wu, Chengkai Huang, Yu Xia, Tong Yu, Ruiyi Zhang, Sungchul Kim, Ryan Rossi, Ang Li, Lina Yao, Julian McAuley, Yiran Chen, Carlee Joe-Wong

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Duke University(杜克大学) University of California San Diego(加州大学圣地亚哥分校) The University of New South Wales(新南威尔士大学) Adobe Research(Adobe研究) University of Maryland College Park(马里兰大学学院公园分校) CSIRO’s Data61(澳大利亚联邦科学与工业研究组织Data61)

AI总结 本文综述联邦学习与大语言模型结合(FedLLM)的最新进展,重点分析联邦微调和联邦提示学习如何应对效率、个性化和安全挑战,并展望联邦预训练和联邦智能体等方向。

Comments Accepted by PAKDD 2026

详情
AI中文摘要

大语言模型在各种应用中取得了令人印象深刻的性能,但其训练通常依赖于集中式数据收集,引发了严重的隐私和治理问题。联邦学习通过使多个客户端能够协作训练共享模型而不暴露原始本地数据,提供了一种去中心化的替代方案。然而,将联邦学习与大语言模型集成带来了新的挑战,包括数据异质性、收敛不稳定性、通信开销和计算约束。本综述提供了联邦学习用于大语言模型(FedLLM)的全面且最新的概述。我们系统地回顾了近期进展,特别强调联邦微调和联邦提示学习,并分析了现有方法如何应对效率、个性化和安全挑战。我们进一步总结了新兴方向,如联邦预训练和联邦智能体。我们的目标是提供对这个快速发展领域的结构化视角,并突出未来研究的有前景的途径。

英文摘要

Large Language Models have achieved impressive performance across diverse applications, yet their training typically depends on centralized data collection, raising serious privacy and governance concerns. Federated Learning offers a decentralized alternative by enabling multiple clients to collaboratively train shared models without exposing raw local data. However, integrating FL with LLMs introduces new challenges, including data heterogeneity, convergence instability, communication overhead, and computational constraints. This survey provides a comprehensive and up-to-date overview of Federated Learning for Large Language Models (FedLLM). We systematically review recent advances, with particular emphasis on federated fine-tuning and federated prompt learning, and analyze how existing methods address efficiency, personalization, and security challenges. We further summarize emerging directions such as federated pre-training and federated agents. Our goal is to offer a structured perspective on this rapidly evolving field and to highlight promising avenues for future research.

2506.06295 2026-06-09 cs.LG cs.AI cs.CL 版本更新

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

dLLM-Cache:基于自适应缓存的扩散大语言模型加速

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyan Wei, Shaobo Wang, Yichen Zhu, Linfeng Zhang

发表机构 * Zhejiang University(浙江大学)

AI总结 针对扩散大语言模型推理延迟高的问题,提出一种无需训练的自适应缓存框架dLLM-Cache,通过长间隔提示缓存和基于特征相似性的部分响应更新,实现高效中间计算复用,在保持输出质量的同时大幅降低FLOPs。

Comments Accepted by ICML 2026

详情
AI中文摘要

自回归模型长期以来主导了大语言模型领域。最近,一种基于扩散的大语言模型(dLLMs)的新范式出现,它通过迭代去噪掩码段来生成文本。这种方法显示出显著的优势和潜力。然而,dLLMs存在高推理延迟的问题。传统的自回归模型加速技术,如键值缓存,由于dLLMs的双向注意力机制而无法兼容。为了应对这一特定挑战,我们的工作首先基于一个关键观察:dLLM推理涉及一个静态提示和一个部分动态的响应,其中大多数标记在相邻去噪步骤中保持稳定。基于此,我们提出了dLLM-Cache,一种无需训练的自适应缓存框架,它结合了长间隔提示缓存和基于特征相似性的部分响应更新。这种设计能够在不影响模型性能的情况下高效重用中间计算。在代表性dLLMs(包括LLaDA 8B和Dream 7B)上的大量实验表明,dLLM-Cache在LongBench-HotpotQA上实现了高达9.1倍的FLOPs减少,同时保持了具有竞争力的输出质量。值得注意的是,我们的方法使dLLM推理延迟在许多设置下接近自回归模型。本工作的代码公开于:https://github.com/maomaocun/dLLM-cache。

英文摘要

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1x FLOPs reduction on LongBench-HotpotQA while maintaining competitive output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. The code for this work is publicly available at: https://github.com/maomaocun/dLLM-cache.

2506.10341 2026-06-09 cs.LG cs.CL 版本更新

Formalizing Learning from Language Feedback with Provable Guarantees

从语言反馈中学习的形式化与可证明保证

Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, Ching-An Cheng

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 本文形式化语言反馈学习问题,提出转移埃尔泽维度刻画学习难度,并开发无遗憾算法HELiX,证明其性能保证,展示丰富语言反馈可指数级加速学习。

Comments ICML 2026

详情
AI中文摘要

通过观察和语言反馈进行交互式学习是一个日益受到关注的领域,其驱动力来自大型语言模型(LLM)智能体的出现。尽管有令人印象深刻的实证演示,但迄今为止,这些决策问题的原则性框架仍然缺乏。我们形式化了语言反馈学习(LLF)问题,提出了足以在潜在奖励下实现学习的假设,并引入了$\ extit{转移埃尔泽维度}$作为衡量LLF难度的指标。我们形式化了语言反馈中的信息控制学习复杂性的直觉,并展示了从丰富语言反馈中学习可以比从奖励中学习指数级更快的案例。我们开发了一种名为$\ exttt{HELiX}$的无遗憾算法,通过顺序交互可证明地解决LLF问题,其性能保证随转移埃尔泽维度缩放。在多个实证领域,我们展示了即使重复提示LLM不可靠时,$\ exttt{HELiX}$也能表现良好。我们的贡献标志着朝着使用通用语言反馈设计原则性交互学习算法迈出了重要一步。

英文摘要

Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. Despite impressive empirical demonstrations, so far a principled framing of these decision problems remains lacking. We formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce $\textit{transfer eluder dimension}$ as a measure to characterize the hardness of LLF. We formalize the intuition that information in the language feedback governs the learning complexity, and demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called $\texttt{HELiX}$, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension. Across several empirical domains, we show that $\texttt{HELiX}$ performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark an important step towards designing principled interactive learning algorithms using generic language feedback.

2507.09751 2026-06-09 cs.AI cs.CL cs.LO 版本更新

Sound and Complete Neurosymbolic Reasoning with LLM-Grounded Interpretations

基于LLM解释的完备且可靠的神经常识推理

Bradley P. Allen, Prateek Chhikara, Thomas Macaulay Ferguson, Filip Ilievski, Paul Groth

发表机构 * University of Amsterdam(阿姆斯特丹大学) University of Southern California(南加州大学) Rensselaer Polytechnic Institute(拉特格斯理工学院) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学)

AI总结 提出将LLM直接集成到次协调逻辑的语义解释函数中,实现可靠且完备的神经常识推理,在GPQA和SimpleQA基准上宏F1提升约6个百分点,并成功检测药物安全知识库中的矛盾。

Comments 43 pages, 14 tables, 4 figures. Accepted to the 19th Conference on Neurosymbolic Learning and Reasoning (NeSy 2025); to appear Neurosymbolic Artifical Intelligence Special Issue on NeSy 2025 Extended Papers

详情
AI中文摘要

大型语言模型(LLM)在自然语言理解和生成方面展现了令人印象深刻的能力,但在输出中表现出逻辑一致性问题。我们如何在形式推理中利用LLM的广泛覆盖参数知识,尽管它们存在不一致性?我们提出了一种方法,将LLM直接集成到次协调逻辑的形式语义的解释函数中。我们使用从短事实性基准GPQA和SimpleQA导出的数据集对方法进行实证评估,显示双边事实性评估在两个基准上的宏F1比单边基线提高了约6个百分点(以覆盖率为代价,因为在不一致或不确定的情况下会触发弃权)。我们进一步描述了一个实现该方法的原型tableau推理器,并将其应用于包含228条断言和712条推断语句的药物安全知识库:系统检测到92个对应于医学显著错误(例如,阿片类药物被推断为非成瘾性,β受体阻滞剂被推断为在哮喘中安全)的过剩(glut),同时保持可满足性,表明矛盾被局部化而不是导致逻辑爆炸。与先前工作不同,我们的方法提供了一个理论框架和实际实现,用于神经常识推理,利用LLM的知识同时保留底层逻辑的可靠性和完备性属性。

英文摘要

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but exhibit problems with logical consistency in their output. How can we harness LLMs' broad-coverage parametric knowledge in formal reasoning despite their inconsistency? We present a method for directly integrating an LLM into the interpretation function of the formal semantics for a paraconsistent logic. We evaluate the method empirically using datasets derived from the short-form factuality benchmarks GPQA and SimpleQA, showing that bilateral factuality evaluation improves macro-F1 over a unilateral baseline by roughly 6 percentage points on both benchmarks (at the cost of reduced coverage, as abstention is triggered on inconsistent or uncertain cases). We further describe a proof-of-concept tableau reasoner implementing the method, and apply it to a medication-safety knowledge base of 228 asserted and 712 inferred statements: the system detects 92 gluts corresponding to medically significant errors (e.g., opioids inferred as non-addictive, beta-blockers inferred as safe in asthma) while remaining satisfiable, demonstrating that contradictions are localized rather than causing logical explosion. Unlike prior work, our method offers a theoretical framework with a practical implementation for neurosymbolic reasoning that leverages an LLM's knowledge while preserving the underlying logic's soundness and completeness properties.

2509.10534 2026-06-09 cs.LG cs.AI cs.CL 版本更新

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

解耦“什么”和“哪里”:极坐标位置嵌入

Anand Gopalakrishnan, Robert Csordás, Jürgen Schmidhuber, Michael C. Mozer

发表机构 * DeepMind, London, UK(深度Mind,伦敦,英国)

AI总结 提出极坐标位置嵌入(PoPE)以解耦Transformer注意力机制中的内容和位置,在诊断任务、序列建模和语言模型中优于RoPE,并展现零样本长度外推能力。

Comments ICML 2026 camera-ready version

详情
AI中文摘要

Transformer架构中的注意力机制根据内容(“什么”)和序列中的位置(“哪里”)将键匹配到查询。我们提出一项分析,表明在流行的RoPE旋转位置嵌入中,“什么”和“哪里”是纠缠的。这种纠缠会损害性能,特别是当决策需要在这两个因素上独立匹配时。我们提出对RoPE的改进,称为极坐标位置嵌入(PoPE),它消除了“什么-哪里”的混淆。PoPE在仅通过位置或内容进行索引的诊断任务上表现远优于基线。在音乐、基因组和自然语言领域的自回归序列建模中,使用PoPE作为位置编码方案的Transformer在评估损失(困惑度)和下游任务性能上优于使用RoPE的基线。在语言建模中,这些优势在模型规模从124M到774M参数时持续存在。关键的是,与RoPE甚至专为外推设计的方法YaRN(需要额外微调和频率插值)相比,PoPE展现出强大的零样本长度外推能力。

英文摘要

The attention mechanism in a Transformer architecture matches key to query based on both content -- the what -- and position in a sequence -- the where. We present an analysis indicating that what and where are entangled in the popular RoPE rotary position embedding. This entanglement can impair performance particularly when decisions require independent matches on these two factors. We propose an improvement to RoPE, which we call Polar Coordinate Position Embeddings or PoPE, that eliminates the what-where confound. PoPE is far superior on a diagnostic task requiring indexing solely by position or by content. On autoregressive sequence modeling in music, genomic, and natural language domains, Transformers using PoPE as the positional encoding scheme outperform baselines using RoPE with respect to evaluation loss (perplexity) and downstream task performance. On language modeling, these gains persist across model scale, from 124M to 774M parameters. Crucially, PoPE shows strong zero-shot length extrapolation capabilities compared not only to RoPE but even a method designed for extrapolation, YaRN, which requires additional fine tuning and frequency interpolation.

2509.12760 2026-06-09 cs.LG cs.CL 版本更新

Similarity-Distance-Magnitude Activations

相似度-距离-幅度激活函数

Allen Schmaltz

发表机构 * Reexpress AI

AI总结 本文提出SDM激活函数,通过引入相似度和距离意识提升softmax的鲁棒性和可解释性,并通过密集匹配实现基于实例的可解释性。SDM估计器通过数据驱动的CDF分区控制分类准确性,优于现有校准方法。

Comments Accepted to Findings of the Association for Computational Linguistics: ACL 2026. 21 pages, 8 tables, 1 algorithm. arXiv admin note: substantial text overlap with arXiv:2502.20167

详情
AI中文摘要

我们引入了相似度-距离-幅度(SDM)激活函数,这是一种更稳健和可解释的标准softmax激活函数的改进形式,增加了相似度(即正确预测深度匹配到训练)意识和距离到训练分布意识,从而通过密集匹配实现可解释性。我们进一步引入了基于SDM激活的类内经验CDF数据驱动分区的SDM估计器,以控制选择性分类中的类和预测条件下的准确性。当用作预训练语言模型的最终层激活进行选择性分类时,SDM估计器比使用softmax激活的现有校准方法更鲁棒于协变量偏移和分布外输入,同时在分布内数据上保持信息性。

英文摘要

We introduce the Similarity-Distance-Magnitude (SDM) activation function, a more robust and interpretable formulation of the standard softmax activation function, adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness, and enabling interpretability-by-exemplar via dense matching. We further introduce the SDM estimator, based on a data-driven partitioning of the class-wise empirical CDFs via the SDM activation, to control the class- and prediction-conditional accuracy among selective classifications. When used as the final-layer activation over pre-trained language models for selective classification, the SDM estimator is more robust to covariate shifts and out-of-distribution inputs than existing calibration methods using softmax activations, while remaining informative over in-distribution data.

2510.06052 2026-06-09 cs.AI cs.CL 版本更新

MixReasoning: Switching Modes to Think

MixReasoning: 切换模式以思考

Haiquan Lu, Gongfan Fang, Xinyin Ma, Qi Li, Xinchao Wang

发表机构 * arXiv

AI总结 提出MixReasoning框架,动态调整推理深度,对困难步骤详细推理、简单步骤简洁推理,在GSM8K、MATH-500和AIME上缩短推理长度并提高效率,不牺牲准确性。

详情
AI中文摘要

推理模型通过逐步解决问题、将问题分解为子问题并在生成答案前探索长思维链来提升性能。然而,对每一步都应用扩展推理会引入大量冗余,因为子问题的难度和复杂度差异很大:少数关键步骤对最终答案真正具有挑战性和决定性,而许多其他步骤仅涉及简单的修正或计算。因此,一个自然的想法是赋予推理模型自适应应对这种变化的能力,而不是对所有步骤采用相同的详细程度。为此,我们提出了MixReasoning,一个在单个响应中动态调整推理深度的框架。由此产生的思维链成为困难步骤的详细推理与简单步骤的简洁推理的混合。在GSM8K、MATH-500和AIME上的实验表明,MixReasoning缩短了推理长度,显著提高了效率,且不牺牲准确性。

英文摘要

Reasoning models enhance performance by tackling problems in a step-by-step manner, decomposing them into sub-problems and exploring long chains of thought before producing an answer. However, applying extended reasoning to every step introduces substantial redundancy, as sub-problems vary widely in difficulty and complexity: a small number of pivotal steps are genuinely challenging and decisive for the final answer, while many others only involve straightforward revisions or simple computations. Therefore, a natural idea is to endow reasoning models with the ability to adaptively respond to this variation, rather than treating all steps with the same level of elaboration. To this end, we propose MixReasoning, a framework that dynamically adjusts the depth of reasoning within a single response. The resulting chain of thought then becomes a mixture of detailed reasoning on difficult steps and concise inference on simpler ones. Experiments on GSM8K, MATH-500, and AIME show that MixReasoning shortens reasoning length and substantially improves efficiency without compromising accuracy.

2601.02880 2026-06-09 cs.AI cs.CL 版本更新

ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

ReTreVal:带有验证和跨问题记忆的推理树

Abhishek HS, Pavan C Shekar, Arpit Jain, Ashwanth Krishnan

发表机构 * QpiAI

AI总结 ReTreVal通过自适应树探索、带工具增强的节点细化、类型化失败回溯和自修改记忆,使大语言模型在无需微调的情况下实现跨问题学习,其在MATH-500上达到85.8%的pass@1准确率,在MMLU-Pro上达到54.4%的准确率。

Comments 15 pages, 1 figure, 12 tables

详情
AI中文摘要

现有推理框架在问题边界丢弃所有失败上下文,导致模型解决问题500时比问题1时更无知。我们提出了ReTreVal(带有验证的推理树),这是一个无需训练的框架,通过自适应树探索、带工具增强的节点细化、类型化失败回溯以及自修改记忆,实现了跨问题学习。ReTreVal在MATH-500上达到85.8%的pass@1准确率(比零样本CoT高8.6个百分点,比最强基线Self-Refine高8.6个百分点),在MMLU-Pro上达到54.4%的准确率(比Self-Refine高15.3个百分点),3.4:1的胜率比噪声比证实了真正的错误恢复。这些能力,以前需要梯度更新,使32B模型能够与更大的单次通过系统竞争。

英文摘要

Every existing inference-time reasoning framework discards all failure context at problem boundaries, leaving a model solving problem 500 no wiser than it was on problem 1. We present ReTreVal (Reasoning Tree with Validation), a training-free framework that closes this gap through adaptive tree exploration with tool-augmented node refinement, typed-failure backtracking that injects categorized error context into the recovered branch, and a self-rewriting memory that accumulates and revises strategy entries across problems, enabling inference-time cross-problem learning on any fixed, unmodified LLM without fine-tuning. ReTreVal achieves 85.8% pass@1 on MATH-500 (+8.6 pp over Zero-Shot CoT, +8.6 pp over the strongest baseline Self-Refine) and 54.4% on MMLU-Pro (+15.3 pp over Self-Refine), with a 3.4:1 win-to-regression ratio confirming genuine error recovery rather than noise. These capabilities, previously requiring gradient updates, allow a 32B model to compete with much larger single-pass systems.

2601.09085 2026-06-09 cs.LG cs.AI cs.CL cs.IR 版本更新

MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

MMR-GRPO:通过多样性感知奖励重加权加速GRPO风格训练

Kangda Wei, Ruihong Huang

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 提出MMR-GRPO方法,利用最大边际相关性根据完成多样性重加权奖励,减少冗余样本,加速GRPO训练,在保持性能的同时平均减少47.9%训练步数和70.2%时间。

详情
AI中文摘要

组相对策略优化(GRPO)已成为训练数学推理模型的标准方法;然而,它对每个提示依赖多个完成,使得训练计算成本高昂。尽管最近的工作减少了达到峰值性能所需的训练步数,但由于每步成本增加,整体挂钟训练时间通常保持不变甚至增加。我们提出MMR-GRPO,它整合了最大边际相关性,基于完成多样性对奖励进行重加权。我们的关键洞察是,语义冗余的完成贡献有限的学习信号;优先考虑多样化解能产生更有信息量的更新并加速收敛。在三种模型规模(1.5B、7B、8B)、三种GRPO变体和五个数学推理基准上的广泛评估表明,MMR-GRPO在达到相当峰值性能的同时,平均需要减少47.9%的训练步数和70.2%的挂钟时间。这些增益在模型、方法和基准上一致。我们的代码发布在:this https URL。

英文摘要

Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. Our code is released at: https://github.com/WeiKangda/MMR-GRPO.

2601.15727 2026-06-09 cs.LG cs.CL 版本更新

Towards Automated Kernel Generation in the Era of LLMs

面向LLM时代的自动化内核生成

Yang Yu, Peiyu Zang, Chi Hsu Tsai, Haiming Wu, Yixin Shen, Jialing Zhang, Haoyu Wang, Zhiyou Xiao, Jingze Shi, Yuyu Luo, Wentao Zhang, Chunlei Men, Guang Liu, Yonghua Lin

发表机构 * Beijing Academy of Artificial Intelligence(北京人工智能研究院) Beijing Normal University(北京师范大学) Peking University(北京大学) Beijing Institute of Technology(北京理工大学) Cornell University(康奈尔大学) Beijing Jiaotong University(北京交通大学) Renmin University of China(中国人民大学) Hong Kong University of Science and Technology (Guangzhou)(广州科技大学)

AI总结 本文综述了利用大语言模型(LLM)和智能体系统自动化生成与优化GPU内核的方法,系统梳理了现有方法、数据集和基准,并指出了未来研究方向。

Comments In IJCAI 2026. 9 pages, 1 figure

详情
AI中文摘要

现代AI系统的性能从根本上受限于其底层GPU内核的质量,这些内核将高级算法语义转化为低级硬件操作。实现接近最优的内核需要专家级硬件架构和编程模型的理解,使得内核工程成为一个关键但耗时且不可扩展的过程。大语言模型和基于LLM的智能体的最新进展为自动化内核生成和优化开辟了新的可能性。LLM擅长压缩难以形式化的专家级内核知识,而智能体系统通过将内核开发视为迭代、反馈驱动的循环,进一步实现了可扩展的优化。该领域取得了快速进展。然而,该领域仍然分散,缺乏对LLM驱动内核生成的系统视角。本综述通过提供现有方法的结构化概述(涵盖基于LLM的方法和智能体优化工作流程),并系统组织支撑该领域学习和评估的数据集和基准,填补了这一空白。此外,进一步概述了关键开放挑战和未来研究方向,旨在为下一代自动化内核优化建立全面的参考。为跟踪该领域,我们在https://github.com/example维护了一个开源GitHub仓库。

英文摘要

The performance of modern AI systems is fundamentally constrained by the quality of their underlying GPU kernels, which translate high-level algorithmic semantics into low-level hardware operations. Achieving near-optimal kernels requires expert-level understanding of hardware architectures and programming models, making kernel engineering a critical but notoriously time-consuming and non-scalable process. Recent advances in large language models and LLM-based agents have opened new possibilities for automating kernel generation and optimization. LLMs are well-suited to compress expert-level kernel knowledge that is difficult to formalize, while agentic systems further enable scalable optimization by casting kernel development as an iterative, feedback-driven loop. Rapid progress has been made in this area. However, the field remains fragmented and lacks a systematic perspective for LLM-driven kernel generation. This survey addresses this gap by providing a structured overview of existing approaches, spanning LLM-based approaches and agentic optimization workflows, and systematically organizing the datasets and benchmarks that underpin learning and evaluation in this domain. Moreover, key open challenges and future research directions are further outlined, aiming to establish a comprehensive reference for the next generation of automated kernel optimization. To keep track of this field, we maintain an open-source GitHub repository at https://github.com/flagos-ai/awesome-LLM-driven-kernel-generation.

2605.03862 2026-06-09 cs.AI cs.CL 版本更新

Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

正确性不足:通过执行器导向的奖励训练推理计划器

Tianyang Han, Hengyu Shi, Junjie Hu, Xu Yang, Zhiling Wang, Junhao Su

发表机构 * D 4 Lab(D4实验室) Independent Researcher(独立研究者)

AI总结 本文提出TraceLift框架,通过执行器导向的奖励提升推理质量,利用rubric-based Reasoning Reward Model评估推理轨迹的可靠性与有效性。

Comments 36 pages

详情
AI中文摘要

可验证奖励的强化学习已成为提升大语言模型显式推理的常见方法,但仅凭最终答案正确性无法揭示推理轨迹的忠实性、可靠性或对消费模型的效用。为此,我们提出TraceLift,将推理视为可消费的中间产物。在计划器训练中,计划器生成标记化的推理。冻结的执行器将此推理转化为最终产物供验证器反馈,同时执行器导向的奖励塑造中间轨迹。此奖励乘以基于rubric的Reasoning Reward Model评分,乘以在相同冻结执行器上测量的提升,奖励高质量且有用的轨迹。为使推理质量直接可学习,我们引入TRACELIFT-GROUPS数据集,包含数学和代码种子问题。每个示例是同一问题组,包含高质量参考轨迹和多个可能的错误轨迹,通过局部扰动降低推理质量或解决方案支持,同时保持任务相关性。在代码和数学基准上的广泛实验表明,执行器导向的推理奖励提高了两阶段计划器-执行器系统,表明推理监督应不仅评估轨迹是否看起来好,还应评估其是否帮助消耗模型。

英文摘要

Reinforcement learning with verifiable rewards has become a common way to improve explicit reasoning in large language models, but final-answer correctness alone does not reveal whether the reasoning trace is faithful, reliable, or useful to the model that consumes it. This outcome-only signal can reinforce traces that are right for the wrong reasons, overstate reasoning gains by rewarding shortcuts, and propagate flawed intermediate states in multi-step systems. To this end, we propose TraceLift, a planner-executor training framework that treats reasoning as a consumable intermediate artifact. During planner training, the planner emits tagged reasoning. A frozen executor turns this reasoning into the final artifact for verifier feedback, while an executor-grounded reward shapes the intermediate trace. This reward multiplies a rubric-based Reasoning Reward Model (RM) score by measured uplift on the same frozen executor, crediting traces that are both high-quality and useful. To make reasoning quality directly learnable, we introduce TRACELIFT-GROUPS, a rubric-annotated reason-only dataset built from math and code seed problems. Each example is a same-problem group containing a high-quality reference trace and multiple plausible flawed traces with localized perturbations that reduce reasoning quality or solution support while preserving task relevance. Extensive experiments on code and math benchmarks show that this executor-grounded reasoning reward improves the two-stage planner-executor system over execution-only training, suggesting that reasoning supervision should evaluate not only whether a trace looks good, but also whether it helps the model that consumes it. Our code is available at: https://github.com/MasaiahHan/TraceLift

2605.18856 2026-06-09 cs.LG cs.CL cs.IT math.IT 版本更新

SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

SPHERICAL KV: 角度域注意力与率失真保持用于高效长上下文推理

Anay Chauhan, Gurucharan Marthi Krishna Kumar, Arion Das, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Synopsys McGill University(麦吉尔大学) IIIT Ranchi(印度理工学院拉奇) Amazon(亚马逊) Meta Apple(苹果) Pragya Lab, BITS Pilani Goa(普拉基亚实验室, BITS 拉贾斯坦)

AI总结 提出Spherical KV方法,通过角度域注意力(ADA)和率失真保持(RDR)机制,在长上下文推理中减少KV缓存占用并保持解码效率。

详情
AI中文摘要

长上下文推理日益受到KV缓存的限制:常驻内存随上下文长度增长,解码受限于重复的高带宽内存(HBM)流而非算术运算。现有方法如驱逐、窗口化、量化和卸载减少了占用,但通常仅部分解决了关键路径瓶颈,尤其是在解码期间压缩状态仍需重建为密集向量时。我们提出Spherical KV,一种将KV分配视为基于注意力几何的率失真问题以实现高效解码的长上下文推理方法。该方法基于两个思想:(i) 在解码热循环中廉价地表示方向信息,(ii) 根据估计的未来效用分配保留和精度。其第一个组件,角度域注意力(ADA),将键存储在由标量半径和紧凑角度码组成的球面参数化中,并直接根据这些码计算注意力对数,无需重建密集键。这保留了分页、块局部、融合友好的解码路径,并在实际服务设置中直接针对HBM流量。其第二个组件,率失真保持(RDR),在固定预算下联合选择每个令牌和头的保留/丢弃决策及精度层级,生成层级同质的页面,具有轻量级元数据和合并读取。ADA和RDR共同提供了一种面向部署的机制,在保持解码效率的同时减少KV常驻内存。

英文摘要

Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing methods such as eviction, windowing, quantization, and offloading reduce footprint, but often leave the critical-path bottleneck only partially addressed, especially when compressed states must still be reconstructed into dense vectors during decoding. We present Spherical KV, a long-context inference method that treats KV allocation as a rate-distortion problem grounded in attention geometry for efficient decoding. The method is built on two ideas: (i) represent directional information cheaply in the decode hot loop, and (ii) allocate retention and precision according to estimated future utility. Its first component, Angle-Domain Attention (ADA), stores keys in a spherical parameterization consisting of a scalar radius and compact angle codes, and computes attention logits directly from these codes without reconstructing dense keys. This preserves a paged, block-local, fusion-friendly decode path and directly targets HBM traffic in realistic serving settings. Its second component, Rate-Distortion Retention (RDR), jointly chooses keep/drop decisions and precision tiers per token and head under a fixed budget, producing tier-homogeneous pages with lightweight metadata and coalesced reads. Together, ADA and RDR provide a deployment-oriented mechanism for reducing KV residency while preserving decode efficiency.

2605.26872 2026-06-09 cs.LG cs.AI cs.CL 版本更新

The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection

最强的教师并不总是最好的教师:以学生为中心的答案选择

Zhengyu Hu, Zheyuan Xiao, Linxin Song, Fengqing Jiang, Yuetai Li, Zhengyu Chen, Zhihan Xiong, Yue Liu, Junhao Lin, Yao Su, Lijie Hu, Kaize Ding, Teng Xiao, Radha Poovendran

发表机构 * University of Washington(华盛顿大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Southern California(南加州大学) Independent Researcher(独立研究者) National University of Singapore(新加坡国立大学) Microsoft(微软) Google(谷歌) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Northwestern University(西北大学) Allen Institute for AI (AI2)(人工智能研究院(AI2))

AI总结 提出以学生为中心的答案采样(SCAS)框架,通过估计学生中心的学习成本选择教师生成的答案,从而提升学生模型性能。

详情
AI中文摘要

LLM训练越来越依赖教师生成的监督,包括合成响应、推理轨迹和工具使用演示。当前实践通常选择表现最好的教师来生成学生训练数据,隐含地将教师测试表现视为教学质量的代理。我们表明这一假设可能失败:即使多个教师对同一问题提供正确答案,最强教师的答案也不一定是对给定学生的最佳监督。为解决这一问题,我们提出以学生为中心的答案采样(SCAS),该框架根据估计的学生中心学习成本从经过验证的教师生成答案中进行选择。受逐词梯度分解的启发,我们推导出该成本的高效前向代理,并在训练中用于指导答案选择。在30个教师模型、6个学生基础模型和8个任务上的实验表明,SCAS持续提升学生性能,表明有效的蒸馏应优先考虑与当前学生匹配的监督,而非仅依赖教师强度。

英文摘要

LLM training increasingly relies on teacher-generated supervision, from synthetic responses to reasoning traces and tool-use demonstrations. Current practice often chooses the highest-performing teacher to generate student training data, implicitly treating teacher test performance as a proxy for teaching quality. We show that this assumption can fail: even when multiple teachers provide correct answers to the same question, the answer from the strongest teacher is not necessarily the best supervision for a given student. To address this gap, we propose Student-Centric Answer Sampling (SCAS), a framework that selects from verified teacher-generated answers according to their estimated student-centric learning cost. Motivated by a token-wise gradient decomposition, we derive an efficient forward-only proxy for this cost and use it to guide answer selection during training. Experiments across 30 teacher models, 6 student base models, and 6 tasks show that SCAS consistently improves student performance, suggesting that effective distillation should prioritize supervision matched to the current student rather than teacher strength alone.

2506.03106 2026-06-09 cs.CL cs.AI 版本更新

Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Critique-GRPO:通过自然语言和数值反馈提升大语言模型推理能力

Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng

发表机构 * HCCL, The Chinese University of Hong Kong, Hong Kong, China(香港中文大学人工智能研究中心,香港,中国) University of Cambridge, Cambridge, United Kingdom(剑桥大学,剑桥,英国) MMLab, The Chinese University of Hong Kong, Hong Kong, China(香港中文大学人工智能实验室,香港,中国) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室,上海,中国)

AI总结 本文提出Critique-GRPO框架,结合自然语言和数值反馈提升LLM推理能力,实验显示其在多个任务中优于传统方法,显著提升推理性能。

Comments Accepted by ICML 2026 Spotlight

详情
AI中文摘要

最近利用数值奖励的强化学习(RL)进展显著增强了大语言模型(LLM)的复杂推理能力。然而,我们发现纯数值反馈存在三个根本限制:性能停滞、无效的自发自我反思和持续失败。我们证明,当给plateaued RL模型提供自然语言批评时,它们能够成功细化失败的解决方案。受此启发,我们提出Critique-GRPO,一种在线RL框架,整合自然语言和数值反馈进行策略优化。该方法使LLM能够同时学习初始响应和批评引导的细化,有效内化两个阶段的探索收益。大量实验显示,Critique-GRPO优于所有比较的监督和基于RL的微调方法,在各种Qwen模型上平均Pass@1提升约+15.0-21.6%,在Llama-3.2-3B-Instruct上提升约+7.3%。值得注意的是,Critique-GRPO通过自我批评实现有效自我改进,相较于GRPO取得显著提升,例如在AIME 2024上Pass@1提升+16.7%。代码和模型已发布:https://github.com/zhangxy-2019/critique-GRPO

英文摘要

Recent advances in reinforcement learning (RL) using numerical rewards have significantly enhanced the complex reasoning capabilities of large language models (LLMs). However, we identify three fundamental limitations of purely numerical feedback: performance plateaus, ineffective spontaneous self-reflection, and persistent failures. We show that plateaued RL models can successfully refine failed solutions when given natural language critiques. Motivated by this, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for policy optimization. This approach enables LLMs to learn simultaneously from initial responses and critique-guided refinements, effectively internalizing the exploration benefits of both stages. Extensive experiments show that Critique-GRPO outperforms all compared supervised and RL-based fine-tuning methods, achieving average Pass@1 improvements of approximately +15.0-21.6% on various Qwen models and +7.3% on Llama-3.2-3B-Instruct across eight challenging reasoning tasks. Notably, Critique-GRPO facilitates effective self-improvement through self-critiquing, achieving substantial gains over GRPO, e.g., a +16.7% Pass@1 improvement on AIME 2024. The code and models are released at: https://github.com/zhangxy-2019/critique-GRPO

2. 机器翻译与跨语言处理 4 篇

2606.08673 2026-06-09 cs.CL 新提交

ClinicalAligner26AM: A Cross-Lingual Aligner for Dataset Translation; Evidences from the MultiClinCorpus Shared Task

ClinicalAligner26AM: 用于数据集翻译的跨语言对齐器;来自MultiClinCorpus共享任务的证据

François Remy

发表机构 * Parallia Healthcare AI(Parallia医疗人工智能)

AI总结 提出ClinicalAligner26AM,一种基于ClinicalEncoder26AM初始化的生物医学临床文本多语言对齐模型,通过Sinkhorn-Knop最优传输融合多级信号构建软对齐目标,在MultiClinCorpus任务中跨语言投影实体标注,字符加权F1超0.95。

详情
AI中文摘要

词级跨语言对齐对于标注投影、翻译审计和跨语言忠实度估计至关重要,然而现有的神经对齐器很少适应专业领域。在本文中,我们介绍了ClinicalAligner26AM,这是一个从ClinicalEncoder26AM初始化的大上下文多语言对齐模型,用于生物医学和临床文本。我们的训练方法受AWESoME Align启发。我们通过使用Sinkhorn-Knop最优传输对为平行临床文本和对话建立的成本矩阵进行锐化,该矩阵融合了句子级、短语级和词元级信号,从而构建软对齐目标。我们通过鼓励学生对齐器的朴素余弦词元相似度分数匹配该目标,直接将锐化后的对齐矩阵蒸馏到学生对齐器中。在推理时,我们通过学习的词元对齐矩阵投影源跨度分数,并解码目标文本中最长有效的高分跨度,可选地由附录B中总结的MultiClinNER预测支持。我们在MultiClinCorpus共享任务上评估CA26AM,该任务将西班牙语临床实体标注投影到六种目标语言中。我们提交的两个系统在所有语言和实体类型中分别排名第一和第二,几乎所有设置下的字符加权F1分数均高于0.95。

英文摘要

Word-level cross-lingual alignment is central to annotation projection, translation auditing, and cross-lingual faithfulness estimation, yet existing neural aligners are rarely adapted to specialized domains. In this paper, we introduce ClinicalAligner26AM, a large-context multilingual aligner model for biomedical and clinical text initialized from ClinicalEncoder26AM. Our training recipe is inspired by AWESoME Align. We build our soft alignment target by sharpening with Sinkhorn-Knop optimal transport a cost matrix established for parallel clinical texts and conversations through the fusion of sentence-level, phrase-level, and token-level signals. We distill this sharpened alignment matrix directly into our student aligner, by encouraging its naive cosine-based token similarity scores to match this target. At inference time, we project source-span scores through the learned token alignment matrix and decode the longest valid high-scoring span in the target text, optionally supported by MultiClinNER predictions summarized in Appendix B. We evaluate CA26AM on the MultiClinCorpus shared task, which projects Spanish clinical entity annotations into six target languages. Our two submitted systems ranked respectively first and second across all languages and entity types, with character-weighted F1 scores above 0.95 in nearly all settings.

2606.09334 2026-06-09 cs.CL 新提交

How Far Can Prompting Go for Minimal-Edit Ukrainian Grammatical Error Correction?

提示工程在最小编辑乌克兰语语法错误纠正中能走多远?

Kateryna Karpo, Artem Chernodub

发表机构 * Ukrainian Catholic University(乌克兰天主教大学) YouScan Zendesk

AI总结 评估11个商业LLM在乌克兰语最小编辑语法错误纠正上的表现,发现结合最小编辑提示和少样本策略的Gemini 3.1-Pro达到F0.5=69.22,缩小了与微调SOTA的差距。

详情
AI中文摘要

微调大型语言模型在乌克兰语语法错误纠正中占主导地位,而通过API访问的LLM在最小编辑基准上几乎未经过测试。我们在UNLP 2023 GEC-only基准上评估了来自四个提供商的11个商业LLM和一个开源乌克兰语模型,比较了零样本、少样本、最小编辑和LLM辅助提示优化策略。我们的最佳配置(Gemini 3.1-Pro)达到了F0.5=69.22,缩小了与微调SOTA(F0.5=73.14)超过90%的差距。对于零样本提示,只有Claude模型受益于乌克兰语指令。然而,所有模型的最佳总体结果使用了乌克兰语最小编辑提示,其语言特定规则需要精确的乌克兰语表达。在最小编辑+少样本基础上进行LLM辅助提示优化获得了最高分数。详细的最小编辑指令在标点和大小写错误上带来了最大收益,但导致模型放弃了几个低频类别。深入错误分析,我们识别了与乌克兰语特定语言现象相关的五种重复过度纠正模式。代码、提示和输出已公开。

英文摘要

Fine-tuned Large Language Models (LLMs) dominate in Ukrainian grammatical error correction (GEC), while API-accessed LLMs remain nearly untested on minimal-edit benchmarks. We evaluate 11 commercial LLMs from four providers and one open-source Ukrainian model on the UNLP 2023 GEC-only benchmark, comparing zero-shot, few-shot, minimal-edits, and LLM-assisted prompt optimization strategies. Our best configuration (Gemini 3.1-Pro) reaches F0.5=69.22, closing over 90% of the gap to fine-tuned SOTA (F0.5=73.14). For zero-shot prompts, only Claude models benefit from Ukrainian instructions. However, the best overall results for all models use Ukrainian minimal-edits prompts, whose language-specific rules require Ukrainian to express precisely. LLM-assisted prompt optimization on top of minimal-edits + few-shot achieves the highest score. Detailed minimal-edits instructions yield the largest gains for punctuation and case errors but cause the model to abandon several low-frequency categories. Delving into error analysis, we identify five recurring overcorrection patterns tied to Ukrainian-specific linguistic phenomena. Code, prompts, and outputs are publicly available.

2606.09655 2026-06-09 cs.CL 新提交

Beyond Accuracy: Community Perspectives on Machine Translation

超越准确率:机器翻译的社区视角

Yujun Wang, Ehud Reiter, Shimei Pan, Steffen Eger, Wei Zhao

发表机构 * University of Technology Nuremberg(纽伦堡工业大学) University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校) University of Aberdeen(阿伯丁大学)

AI总结 本文通过分析社交媒体上四个利益相关者社区(AI开发者、专业译者、语言学习者、语言服务提供商)的帖子,揭示机器翻译技术社区间的分歧与冲突,强调倾听用户社区需求的重要性。

详情
AI中文摘要

尽管机器翻译(MT)取得了显著进展,但非AI社区对MT系统日益增长的担忧表明技术进展与现实用户需求之间存在明显差距。例如,NLP研究人员关注基准性能,而最终用户关心伦理问题、信任、可靠性、成本等。我们认为倾听不同用户社区至关重要,以便研究工作能针对社区关心的问题。为此,我们首次进行大规模分析,调查四个利益相关者社区(AI开发者、专业译者、语言学习者和语言服务提供商)在社交媒体上关于MT技术的帖子。我们构建了一个包含2019年至2025年来自Reddit、Facebook、Bluesky和Mastodon的79,286条帖子及评论的数据集,并分析这些社区在哪些方面存在分歧,以及分歧的方式和原因。总体而言,我们发现社区间经常存在分歧,甚至在翻译质量、效率和可靠性等话题上因情绪极化而表现出强烈冲突。这是因为这些社区处理这些话题的方式不同:AI社区将其视为技术和计算问题,而非AI(用户)社区更关注质量细微差别、时间节省、用户信任和更广泛的社会问题。

英文摘要

Despite remarkable progress in machine translation (MT), non-AI communities have raised growing concerns about MT systems, suggesting a noticeable gap between technical advancement and the needs of real-world users. For instance, while NLP researchers focus on benchmark performance, end users care about ethical concerns, trust, reliability, costs, and more. We argue that listening to various user communities is essential so that research efforts would be directed towards the problems that the communities care about. To this end, we present a large-scale analysis, for the first time, that investigates what four stakeholder communities (AI developers, professional translators, language learners, and language service providers) post about MT technology on social media. To do so, we construct a dataset of 79,286 posts and comments from Reddit, Facebook, Bluesky, and Mastodon from 2019 to 2025, and analyse where these communities disagree, and how and why. Overall, we find that communities often disagree, and even show strong conflicts due to polarised sentiments on topics such as translation quality, efficiency, and reliability. This is because these communities approach these topics differently: the AI community frames them as technical and computational problems, while non-AI (user) communities care more about quality nuances, time savings, user trust, and broader social issues.

2601.10925 2026-06-09 cs.CL 版本更新

Massively Multilingual Joint Segmentation and Glossing

大规模多语言联合分割与标注

Michael Ginn, Lindia Tjuatja, Enora Rice, Ali Marashian, Maria Valentini, Jasmine Xu, Graham Neubig, Alexis Palmer

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对现有神经标注模型缺乏形态边界预测导致可解释性差的问题,提出PolyGloss模型,通过联合分割与标注提升准确性和对齐度,并支持低秩适应快速迁移。

Comments 15 pages, 9 figures, accepted to ACL 2026 Long Papers

详情
AI中文摘要

利用神经网络进行自动行间标注预测是加速语言文档记录工作的一种有前景的方法。然而,尽管像GlossLM这样的最先进模型在标注基准测试中取得了高分,但语言学家进行的用户研究发现,这些模型在实际场景中的实用性存在关键障碍。特别是,现有模型通常生成语素级别的标注,但将其分配给整个单词而不预测实际的语素边界,这使得预测的可解释性降低,从而对人类标注者来说不可信。我们首次研究了从原始文本中联合预测行间标注和相应形态分割的神经模型。我们进行实验以确定平衡分割和标注准确性以及两个任务之间对齐的最佳模型训练方式。我们扩展了GlossLM的训练语料库,并预训练了PolyGloss,这是一系列用于联合分割和标注的seq2seq多语言模型,在标注方面优于GlossLM,并在分割、标注和对齐方面击败了各种开源LLM。此外,我们证明了PolyGloss可以通过低秩适应快速适应新数据集。

英文摘要

Automated interlinear gloss prediction with neural networks is a promising approach to accelerate language documentation efforts. However, while state-of-the-art models like GlossLM achieve high scores on glossing benchmarks, user studies with linguists have found critical barriers to the usefulness of such models in real-world scenarios. In particular, existing models typically generate morpheme-level glosses but assign them to whole words without predicting the actual morpheme boundaries, making the predictions less interpretable and thus untrustworthy to human annotators. We conduct the first study on neural models that jointly predict interlinear glosses and the corresponding morphological segmentation from raw text. We run experiments to determine the optimal way to train models that balance segmentation and glossing accuracy, as well as the alignment between the two tasks. We extend the training corpus of GlossLM and pretrain PolyGloss, a family of seq2seq multilingual models for joint segmentation and glossing that outperforms GlossLM on glossing and beats various open-source LLMs on segmentation, glossing, and alignment. In addition, we demonstrate that PolyGloss can be quickly adapted to a new dataset via low-rank adaptation.

3. 信息抽取、检索与问答 25 篇

2606.07519 2026-06-09 cs.CL cs.AI 新提交

Bidirectional Small-Granularity Search between Code and Text

代码与文本之间的双向小粒度搜索

Marco A. Valenzuela-Escárcega, Enrique Noriega-Atala, Gus Hahn-Powell, Clayton T. Morrison, Mihai Surdeanu

发表机构 * Lex Machina The University of Arizona(亚利桑那大学)

AI总结 提出双向小粒度搜索任务,通过自动生成数据训练模型,实现科学出版物文本与代码片段间的直接链接,支持跨模态检索。

详情
AI中文摘要

我们引入了代码与文本之间双向小粒度搜索的新任务,其中查询是文本或代码的小片段,结果也是相反模态的小片段,即代码或文本。该任务在科学出版物中的文本与相应代码片段之间建立直接链接,以支持更好、更快地理解科学方法。我们为所提出的任务引入了一个大型数据集,其中包括使用GPT-4自动生成的代码文本描述的训练分区,以及三个测试分区:一个域内和两个域外(OOD),包含手动注释的数据以及其他领域的材料。我们还提出了一种模块化方法来解决此任务。我们的方法在四个不同的子任务之间共享一个编码器,这些子任务学习双向答案跨度的开始/结束。我们表明,我们的方法在域内取得了良好结果,在域外也取得了令人鼓舞的结果。这表明使用自动生成的数据解决此任务是可能的,但仍有令人兴奋的未来工作要做。

英文摘要

We introduce the novel task of bidirectional small-granularity search between code and text, where the queries are small snippets of text or code and the results are also small fragments of the opposite modality, i.e., code or text. This task establishes direct links between text in scientific publications and corresponding code segments, in support of better and faster understanding of scientific methods. We introduce a large dataset for the proposed task that includes a training partition with textual descriptions of code generated automatically using GPT-4, and three testing partitions, one in-domain and two out-of-domain (OOD) that contain manually-annotated data as well as material from other domains. We also propose a modular approach to address this task. Our approach shares an encoder across four different subtasks that learn start/end of answer spans in both directions. We show that our method achieves good results in-domain, and encouraging results OOD. This suggests that addressing this task with automatically-generated data is possible, but there is exciting future work to be done.

2606.07523 2026-06-09 cs.CL cs.AI 新提交

Retrieval Augmented Generation Framework for the Nepali Legal Domain Question Answering

面向尼泊尔法律领域问答的检索增强生成框架

Samir Wagle, Abiral Adhikari, Reewaj Khanal, Batsal Bhandari, Prashant Manandhar, Praveen Acharya, Bal Krishna Bal

发表机构 * Dublin City University(都柏林城市大学)

AI总结 提出首个基于检索增强生成的尼泊尔法律问答模型,利用BM25和E5模型检索案例法,实现91%的top-1精度和74%的生成答案可信度。

详情
AI中文摘要

英语等高资源语言的法律领域已广泛采用人工智能进行法律问答。然而,尼泊尔语等低资源语言的数据稀缺限制了大型语言模型在尼泊尔法律文本上的训练。本研究首次应用基于检索增强生成的模型,利用从Nepal Kanun Patrika数字档案中提取的案例法进行尼泊尔法律问答。使用BM25对分块文档进行检索,该方法实现了91%的top-1精度,使用多语言E5大模型时达到75%。对生成答案的评估显示,使用BM25文档检索时,可信度为74%,根据自动评判模型评估的真实性为85%,人工评估的真实性为84%,成功答案生成率为92%。这些结果表明,RAG管道可以有效解决低资源语言法律问答的差距,并为尼泊尔法律领域的可靠AI系统奠定基础。

英文摘要

Legal domains in high-resource languages like English have widely adopted artificial intelligence for legal question answering. However, data scarcity in low resource languages such as Nepali has limited the training of large language models on Nepali legal texts. This study presents the first application of a Retrieval Augmented Generation based model for Nepali legal question answering using case laws extracted from the Nepal Kanun Patrika digital archive. Using BM25 on chunked documents, the approach achieved a top precision at one of 91 percent, and up to 75 percent with the multilingual E5 large model. Evaluation of generated answers showed 74 percent groundedness, 85 percent truthfulness according to an automated judge model, and 84 percent human evaluated truthfulness when using BM25 document retrieval, with a 92 percent successful answer generation rate. These results demonstrate that the RAG pipeline can effectively address the gap in legal question answering for low resource languages and provide a foundation for reliable AI systems in the Nepali legal domain.

2606.07530 2026-06-09 cs.CL 新提交

Finding New Connections between Concepts from Medline Database Incorporating Domain Knowledge

从Medline数据库中结合领域知识发现概念间的新连接

Yang Weikang, Chowdhury S. M. Mazharul Hoque, Jin Wei

AI总结 提出一种基于Swanson ABC模型的改进自适应模型,用于文献发现中隐藏的概念连接,通过中间主题B连接看似无关的主题A和C。

详情
Journal ref
Artificial Intelligence, IntechOpen, 2024
AI中文摘要

在这个数字世界中,数据是一切,并显著影响我们的日常生活。有趣的是,在这个小世界里,一切都是生态系统的一部分,万物直接或间接相连。数据也是如此。在大多数情况下,一个特定主题可能看起来与另一个主题没有任何联系,但实际上,它们通过一个相互关联的主题连接在一起。因此,在本研究中,我们将讨论一种自适应模型,该模型由Don R. Swanson的ABC模型(一种基于文献的发现模型)改进而来,用于发现感兴趣概念之间的隐藏联系。该模型表明,两个主题A和C是不同的,并且没有关系。但它们有一个共同的主题B,可以用来连接主题A和C。这个著名的模型将在本讨论中用于连接医学概念。

英文摘要

In this digital world, data is everything and significantly impacts our everyday lives. Interestingly, in this small world, everything is part of an ecosystem, where everything is connected, directly or indirectly. The same thing happens to data as well. In most cases, it may seem like a particular topic does not have any connection with another one, but in reality, they are connected through a mutually related topic. Therefore, in this research, we will discuss an adaptive model modified from the ABC model by Don R. Swanson, a Literature-Based Discovery (LBD) Model, to find the hidden connections between Concepts of Interest. The model demonstrates that two topics, A and C are different and have no relationship. But they have a common topic, B that can be used to connect topics A and C This famous model will be used in this discussion to connect Medical Concepts.

2606.07540 2026-06-09 cs.CL 新提交

Finding Hidden Relationships Between Medical Concepts by Leveraging Metamap and Text Mining Techniques

利用MetaMap和文本挖掘技术发现医学概念间的隐藏关系

Weikang Yang, S M Mazharul Hoque Chowdhury, Wei Jin

发表机构 * Department of Computer Science, North Dakota State University(北达科塔州立大学计算机科学系) Department of Computer Science and Engineering, Daffodil International University(达芙妮国际大学计算机科学与工程系) Department of Computer Science and Engineering, University of North Texas(德克萨斯大学诺丁汉分校计算机科学与工程系)

AI总结 提出一种结合MetaMap和文本挖掘的新模型,通过构建综合索引结构发现医学概念间的跨文档隐藏关联,实验验证了其有效性。

详情
Journal ref
Advanced Data Mining and Applications (ADMA) 2022
AI中文摘要

文本是当今计算机化世界中最常见的数据存储方式之一。乍一看,这些数据似乎互不关联。但实际上,数据可能存在隐藏的联系。因此,本研究提出了一种新模型,该模型通过使用MetaMap和适当的文本挖掘技术,能够发现两个医学概念之间的隐藏关系。具体来说,该模型创建了一种新的综合索引结构,能够发现连接感兴趣主题的跨文档隐藏链接,而大多数现有方法忽略了这些链接。实验表明,所提出的模型在发现主题间新联系方面具有有效性。

英文摘要

Text is one of the most common ways to store data in this computerized world. At a glance, it may seem that those data are not interconnected. But in reality, data can have hidden connections. Therefore, in this research, a new model has been presented that can find hidden relationships between two medical concepts by using MetaMap and appropriate text-mining techniques. Specifically, the model creates a new comprehensive index structure and can find cross-document hidden links connecting topics of interest that most existing approaches have ignored. Experiments show the effectiveness of the proposed model in discovering new connections between topics.

2606.07783 2026-06-09 cs.CL 新提交

Evaluating RAG Reliability under Clean, Misleading, and Mixed Retrieval

评估RAG在干净、误导和混合检索下的可靠性

Sevgi Yigit-Sert

发表机构 * Ankara University(安卡拉大学)

AI总结 提出评估协议,通过参数覆盖和置信度指标,系统测试RAG系统在干净、有毒和混合证据下处理参数知识与检索证据冲突的鲁棒性。

详情
AI中文摘要

检索增强生成(RAG)通过将答案基于检索到的证据,被广泛用于提高大型语言模型(LLMs)的事实可靠性。然而,在信息丰富的误导环境中,检索到的内容可能包含看似合理但不正确的信息,引发了对基于RAG的信息访问系统可靠性的担忧。在这项工作中,我们提出了一种评估协议,以系统地测试RAG系统如何处理参数知识与从具有不同数量误导信息的上下文中检索到的证据之间的冲突。我们针对模型在无检索时也能正确回答的事实性问题,使用干净、有毒和混合证据来测试系统。所提出的分析框架结合了参数覆盖和置信度指标,以评估误导信息何时以及如何影响LLMs的生成过程。本研究旨在为信息混乱场景下RAG系统的鲁棒性提供见解。

英文摘要

Retrieval-Augmented Generation (RAG) is widely used to improve the factual reliability of large language models (LLMs) by grounding answers in retrieved evidence. In misinformation-rich environments, however, retrieved content may include plausible but incorrect information, raising concerns about the reliability of RAG-based information access systems. In this work, we propose an evaluation protocol to systematically test how the RAG system handles conflicts between parametric knowledge and evidence retrieved from context with varying amounts of misleading information. We target correct answers to factoid questions that the model responds to correctly, even when there is no retrieval, and use this to test the system with clean, poisoned, and mixed evidence. The proposed analytical framework combines parametric override and confidence metrics to assess when and how misleading information affects the generation process of LLMs. This study aims to provide insights into the robustness of RAG systems in information disorder scenarios.

2606.08245 2026-06-09 cs.CL 新提交

ZAS-SQL: Distilling Rules from Failures for Zero-Shot Text-to-SQL

ZAS-SQL: 从失败中提炼规则用于零样本文本到SQL

Hongzhou Zheng, Yixin Gou, Wenjia Zhang

发表机构 * Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University(同济大学上海自主智能无人系统科学中心) College of Architecture and Urban Planning, Tongji University(同济大学建筑与城市规划学院) Behavioral and Spatial AI Lab, Peking University & Tongji University(北京大学与同济大学行为与空间人工智能实验室)

AI总结 提出ZAS-SQL零样本框架,通过Map-Reduce规则蒸馏从失败案例中提取核心生成规则,结合知识增强模式表示、规则驱动结构化推理和执行引导早停三个模块,在Spider上达到87.2%和88.6%执行准确率,超越多个少样本和微调方法。

详情
AI中文摘要

文本到SQL将自然语言转换为可执行的SQL查询。基于大语言模型(LLM)的少样本上下文学习方法表现出色,但其对示例的依赖限制了跨领域泛化,并消耗大量上下文窗口空间。现有的零样本方法缺乏有效的生成约束,仍落后于少样本方法。我们观察到LLM在零样本文本到SQL中的失败并非随机,而是表现出系统性的、重复出现的模式。基于这一观察,我们提出了一个完全零样本的文本到SQL框架,该框架通过基于Map-Reduce的规则蒸馏管道从失败案例中提炼核心生成规则,并通过三个互补模块提高生成质量:知识增强的模式表示,补充数据定义语言中缺失的语义;规则驱动的结构化推理框架,抑制结构偏差;以及执行引导的早停,实现低成本的自我纠正。在Spider上,所提出的框架在开发集和测试集上分别达到87.2%和88.6%的执行准确率,建立了新的零样本最先进水平,并超越了多个基于GPT-4/4o的少样本和微调方法。在领域特定数据集UrbanPlan上,它达到了81.3%,证实了规则蒸馏方法跨领域的泛化能力。此外,当配备4B参数模型时,该框架超越了领先闭源模型的零样本基线,展示了强大的模型通用性。

英文摘要

Text-to-SQL translates natural language into executable SQL queries. Few-shot in-context learning methods built upon large language models (LLMs) achieve strong performance, yet their reliance on demonstrations limits cross-domain generalization and consumes substantial context window space. Existing zero-shot methods, lacking effective generation constraints, still fall short of few-shot approaches. We observe that LLM failures in zero-shot Text-to-SQL are not random but exhibit systematic, recurring patterns. Building on this observation, we propose a fully zero-shot Text-to-SQL framework that distills core generation rules from failure cases through a Map-Reduce-based rule distillation pipeline and improves generation quality via three complementary modules: knowledge-augmented schema representation, which supplements missing semantics in Data Definition Language; a rule-driven structured reasoning framework that suppresses structural deviations; and Execution-Guided Early Stopping, which enables low-cost self-correction. On Spider, the proposed framework achieves up to 87.2% and 88.6% execution accuracy on the Dev and Test sets, respectively, establishing a new zero-shot state-of-the-art and surpassing multiple few-shot and fine-tuning methods built upon GPT-4/4o. On the domain-specific dataset UrbanPlan, it achieves 81.3%, confirming that the rule distillation approach generalizes across domains. Moreover, when equipped with a 4B-parameter model, the framework surpasses zero-shot baselines of leading closed-source models, demonstrating strong model generality.

2606.08397 2026-06-09 cs.CL cs.IR 新提交

TrustMargin: Training-Free Arbitration between Parametric Memory and Retrieved Evidence in Large Language Models

TrustMargin: 大语言模型中参数化记忆与检索证据之间的无训练仲裁

Jingyan Xu, Hong Shi, Yi Shan, Penghui Liu, Yunhao Bai, Ningyuan Li, Xueyang Liu

发表机构 * Peking University(北京大学)

AI总结 针对大语言模型在知识问答中参数记忆与检索证据冲突的问题,提出无训练仲裁层TrustMargin,利用模型自身似然度评分选择更可信的答案,无需微调或外部评判。

Comments 13 pages, 6 figures, 9 tables. Code and data are available at https://github.com/mojixu/TrustMargin.git

详情
AI中文摘要

大语言模型通过参数化记忆和检索证据回答知识密集型问题,但两种来源并非都可靠。检索可以填补知识空白,但干扰性段落可能覆盖正确的闭卷答案。我们将这种生成后冲突视为答案级源仲裁:给定来自同一冻结模型的直接和RAG答案,决定信任哪个源。我们提出TRUSTMARGIN,一个无训练、即插即用的仲裁层,它使用模型自身的似然度对两个现有候选答案进行评分。它结合了参数先验边际(测试记忆是否接受检索答案)和证据绑定边际(折扣仅段落显著性并衡量问题特定支持)。TRUSTMARGIN在直接和RAG之间进行选择,无需微调、外部评判或额外生成。在2WIKIMQA和CWQA上使用三种LLaMA规模,TRUSTMARGIN一致优于直接生成和BM25-RAG,恢复了直接/RAG oracle差距的一部分,并推广到多个无训练RAG流水线。

英文摘要

Large language models answer knowledge-intensive questions using both parametric memory and retrieved evidence, but neither source is uniformly reliable. Retrieval can fill knowledge gaps, yet distracting passages may override correct closed-book answers. We study this post-generation conflict as answer-level source arbitration: given Direct and RAG answers from the same frozen model, decide which source to trust. We propose TRUSTMARGIN, a training-free, plug-and-play arbitration layer that scores the two existing candidates with the model's own likelihoods. It combines a parametric-prior margin, which tests whether memory accepts the retrieved answer, with an evidence-binding margin, which discounts passage-only salience and measures question-specific support. TRUSTMARGIN selects between Direct and RAG without fine-tuning, external judges, or additional generation. Across 2WIKIMQA and CWQA with three LLaMA scales, TRUSTMARGIN consistently improves over Direct generation and BM25-RAG, recovers part of the Direct/RAG oracle gap, and generalizes to multiple training-free RAG pipelines.

2606.08589 2026-06-09 cs.CL cs.DL cs.IR 新提交

Detection and Interpretability Analysis of Quotation Errors by Large Language Models

大语言模型对引用错误的检测与可解释性分析

Bei Huang, Yingyi Zhang, Shenghao Huang, Chengzhi Zhang

发表机构 * School of Social Science, Soochow University(苏州大学社会科学学院) School of Economics and Management, Nanjing University of Science and Technology(南京理工大学经济与管理学院)

AI总结 针对引用错误问题,提出基于大语言模型微调的自动检测方法,通过引入全文数据优化数据集构建,并利用TokenSHAP进行可解释性分析,实验表明微调方法有效且基于源摘要的全文整合方案性能最佳。

详情
Journal ref
The Electronic Library, 2026
AI中文摘要

目的 - 引用错误指引用信息与其原始来源之间的不一致。这一现象导致一系列负面影响,如对原始研究的误解、削弱学术界对相关问题的集体理解,以及削弱基于引用的学术评价体系的准确性和公平性。现有研究表明,引用错误在学术界普遍存在;此外,人工验证引用错误不仅劳动密集,而且效率低下。因此,本文提出“引用错误自动检测”任务。方法 - 采用基于大语言模型的方法,本文在现有研究基础上从两个方面提升检测性能:首先,采用微调方法使大语言模型检测引用错误;其次,将引文全文数据纳入数据集构建,并通过比较三种全文整合方法探索构建此类数据集的最优方案。在此基础上,本文进一步使用TokenSHAP工具对模型预测结果进行可解释性实验分析。发现 - 大语言模型的微调方法提升了引用错误检测的性能。在整合全文信息的不同方法中,基于使用源摘要的方法取得了最佳性能。原创性 - 将大语言模型的微调方法应用于引用错误自动检测任务,并对模型输出结果进行可解释性分析。

英文摘要

Purpose - Quotation error refers to the inconsistency between cited information and its original source. This phenomenon leads to a series of negative impacts, such as misinterpretation of the original research, undermining the academic community's collective understanding of relevant issues, and weakening the accuracy and fairness of the citation-based academic evaluation system. Existing studies have shown that quotation error is prevalent in the academic community; moreover, manual verification of quotation error is not only labor-intensive but also inefficient. Therefore, this paper proposes the task of 'automated detection of quotation errors'. Methodology - Adopting a large language model (LLM)-based approach, this paper improves detection performance from two aspects on the basis of existing research: first, employ the fine-tuning approach for LLMs to detect quotation errors; second, incorporating full-text data of the cited literature into dataset construction, and exploring the optimal scheme for building such datasets by comparing three types of full-text integration methods. Based on this, this paper further uses the TokenSHAP tool to conduct interpretability experimental analysis on the model's prediction results. Findings - The fine-tuning approach for LLMs has improved the performance in detecting quotation errors. Among the different methods for incorporating full-text information, the approach based on using the source abstract yielded the best performance. Originality - The fine-tuning approach for large language models (LLMs) is applied to the task of automated detection of quotation errors, and interpretability analysis is conducted on the model's output results.

2606.08617 2026-06-09 cs.CL 新提交

Cross-Source Reasoning-based Correction for Author Name Disambiguation

基于跨源推理的作者姓名消歧校正

Fanjin Zhang, Yunhe Pang, Bo Chen, Zhiyu Shen, Yanghui Rao, Evgeny Kharlamov, Jie Tang

发表机构 * Renmin University of China(中国人民大学) Sun Yat-Sen University(中山大学) Tsinghua University(清华大学) Robert Bosch GmbH(罗伯特·博世有限公司) University of Oslo(奥斯陆大学)

AI总结 提出CrossND框架,通过跨源不一致分配推理,结合数据精炼、监督微调和测试时缩放,无需人工干预即可校正作者姓名消歧错误。

Comments Accepted at KDD 2026 ADS track

详情
AI中文摘要

作者姓名消歧是学术搜索系统中的关键挑战,通常通过从头开始和实时消歧方法解决。然而,当前算法仍然容易受到论文-作者分配的累积误差影响,并忽略了不同来源之间的不一致分配。诉诸专家注释是资源密集型的。为此,本文探索了作者姓名消歧的新视角:通过利用跨源的不一致分配进行跨源校正。我们提出了CrossND,一个集成数据精炼、跨源推理和测试时缩放的全栈框架。首先,一个精炼链去噪作者档案并产生更准确的论文-作者匹配概率。其次,一个监督微调过程结合这些精炼信号和基于概率软逻辑的交叉校正模块,推断哪些来源的分配是错误的。第三,测试时缩放进一步增强了预测的准确性和鲁棒性。在真实数据集上的实验表明,CrossND通过利用跨源推理,无需人工干预,始终优于17个基线。

英文摘要

Author name disambiguation is a critical challenge in academic search systems, often addressed through from-scratch and real-time disambiguation approaches. However, current algorithms remain vulnerable to cumulative errors of paper-author assignments and overlook inconsistent assignments across different sources. Resorting to expert annotation is resource-intensive. To this end, this paper explores a new perspective for author name disambiguation: cross-source correction by leveraging inconsistent assignments across sources. We propose CrossND, a full-stack framework that integrates data refinement, cross-source reasoning, and test-time scaling. First, a chain-of-refinement pipeline denoises author profiles and produces more accurate paper-author matching probabilities. Second, a supervised fine-tuning process incorporates these refined signals and a probabilistic soft logic-based cross-correction module to infer the assignments of which sources are incorrect. Third, test-time scaling further enhances the accuracy and robustness of the predictions. Experiments on real-world datasets indicate that CrossND consistently outperforms 17 baselines by leveraging cross-source reasoning without human intervention.

2606.08932 2026-06-09 cs.CL cs.AI cs.CE 新提交

From Statute to Control Flow: Span-Grounded Deontic Trees for Defeasible Scope Parsing

从法规到控制流:基于跨度义务树的可废止范围解析

Jian Chen, Siyuan Li, Chucheng Wan, Zixuan Yuan

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Sun Yat-Sen University(中山大学)

AI总结 提出NormBench基准和跨度义务树(SG-DT)中间表示,用于诊断和缓解规则遵循模型中的静默范围遗漏(SSO)问题,揭示递归衰减和可审计性陷阱两种病理,并通过约束输出改善树结构保真度。

详情
AI中文摘要

执行政策和法规的规则遵循代理常常因静默范围遗漏(SSO)而失败:模型应用一般规则但静默地丢弃嵌套的例外或反例外,产生看似合规但在重要边缘案例上失效的输出。尽管此类失败常被视为代理系统问题,其根本瓶颈在于法规和政策理解——这一能力通常在法律NLP中研究。然而,大多数现有法律NLP基准强调最终任务结果,可能忽略导致SSO的结构性遗漏。为诊断和缓解SSO,我们引入NormBench,一个包含2290条条款的基准,涵盖中文(法律和地方政策)、英文(美国税法、GDPR和企业政策)及跨语言设置,专为可废止范围解析设计:精确识别哪个条款覆盖哪个。NormBench使用基于跨度义务树(SG-DT),一种编译器式中间表示,将每个逻辑分支锚定到源跨度并要求显式排除守卫,实现确定性编译和审计。对前沿LLM的评估揭示了两种反复出现的病理:(1)递归衰减,性能随击败者深度增加急剧下降;(2)可审计性陷阱,模型检索相关跨度但未能组装正确的控制流。使用SG-DT作为约束中间输出可改善整树保真度和击败者恢复,下游实验表明其效用是机制特定的:增益集中在例外活跃、易SSO的案例上,而当附加结构不必要或解析器保真度低时,总体准确率可能参差不齐。

英文摘要

Rule-following agents tasked with executing policies and regulations often fail via Silent Scope Omission (SSO): a model applies a general rule but silently drops nested exceptions or counter-exceptions, producing outputs that appear compliant yet break on important edge cases. Although such failures are often framed as an agentic-systems problem, the underlying bottleneck is statutory and policy understanding, a capability typically studied in legal NLP. However, most existing legal NLP benchmarks emphasize end-task outcomes, which can overlook the structural omissions that cause SSO. To diagnose and mitigate SSO, we introduce NormBench, a benchmark of 2,290 provisions spanning Chinese (laws and local policies), English (U.S. tax law, GDPR, and corporate policies), and cross-lingual settings, designed for defeasible scope parsing: identifying precisely which clause overrides which. NormBench uses Span-Grounded Deontic Trees (SG-DT), a compiler-style intermediate representation that anchors every logical branch to source spans and requires explicit exclusion guards, enabling deterministic compilation and audit. Evaluations of frontier LLMs reveal two recurring pathologies: (1) Recursion Decay, where performance drops sharply as defeater depth increases, and (2) an Auditability Trap, where models retrieve relevant spans but fail to assemble correct control flow. Using SG-DT as a constrained intermediate output improves whole-tree fidelity and defeater recovery, and downstream experiments show that its utility is mechanism-specific: gains concentrate on exception-active, SSO-prone cases, while aggregate accuracy can be mixed when the added structure is unnecessary or parser fidelity is low.

2606.09459 2026-06-09 cs.CL 新提交

AbstRAG: Learning to Abstract for Retrieval Problems

AbstRAG:面向检索问题的抽象学习

Lei Xu, Xin Quan, Daniel Pedronette, André Freitas

发表机构 * Idiap Research Institute(Idiap 研究所) École Polytechnique Fédérale de Lausanne (EPFL)(洛桑联邦理工学院 (EPFL)) São Paulo State University(圣保罗州立大学) University of Manchester(曼彻斯特大学) CRUK National Biomarker Centre(英国癌症研究中心国家生物标志物中心)

AI总结 针对查询与文档证据间的抽象鸿沟问题,提出AbstRAG方法,通过将抽象作为显式检索对象,并采用反思性精炼机制,在三个基准上提升了检索和生成性能。

详情
AI中文摘要

当查询、文档证据和用户意图以不同抽象级别表达时,检索增强生成常常失败。查询可能询问一个类别、关系或事件,而文档仅陈述具体实例、间接框架或限定表述。我们将这种不匹配定义为抽象鸿沟:将查询意图与可用证据对齐所需的最小类型假设集合。为弥合这一鸿沟,我们引入AbstRAG,将抽象视为显式检索对象。AbstRAG将查询-证据鸿沟分解为表达、概念、意图-证据和事件类型组件,并通过结合匹配质量、查询无关的效用先验以及所需桥梁的成本来评分相关性。其核心机制是反思性精炼:批评者诊断检索失败,定位失败的抽象操作符,提出最小的阶段特定补丁,并仅在充分性和压缩控制下接受补丁。在三个文档内检索基准上与七个基线对比,AbstRAG在21个配对自助法对比中的18个上以nDCG@10胜出,并在三个基准上分别将生成准确率提升1.9%、5.2%和4.0%;消融实验证实,反思性精炼驱动了大部分检索增益,而仅压缩控制就在压力切片上将过度扩展假阳性从73.7%降至0%。

英文摘要

Retrieval-augmented generation often fails when the query, the document evidence, and the user's intent are expressed at different levels of abstraction. A query may ask about a class, a relation, or an event, while the document only states specific instances, indirect framings, or scoped formulations. We define this mismatch as an abstraction gap: the minimal set of typed assumptions required to align query intent with the available evidence. To close this gap, we introduce AbstRAG, which treats abstraction as an explicit retrieval object. AbstRAG decomposes the query--evidence gap into expression, conceptual, intent--evidence, and event-type components, and scores relevance by combining match quality, a query-independent utility prior, and the cost of the required bridges. Its central mechanism is reflective refinement: a critic diagnoses retrieval failures, localizes the failed abstraction operator, proposes a minimal stage-specific patch, and accepts the patch only under sufficiency and compression controls. Across three within-document retrieval benchmarks against seven baselines, AbstRAG outperforms on nDCG@10 in 18 of 21 paired-bootstrap contrasts and improves generation accuracy by 1.9%, 5.2%, and 4.0% across the three benchmarks; ablations confirm that reflective refinement drives most of the retrieval gain and the compression control alone reduces over-expansion false positives from 73.7% to 0% on a stress slice.

2606.07688 2026-06-09 cs.IR cs.AI cs.CL cs.LG 交叉投稿

TRACER: Token ReAssignment for Concept ERasure in Generative Recommendation

TRACER: 面向生成式推荐中概念擦除的令牌重分配

Ziheng Chen, Jiali Cheng, Zezhong Fan, Hadi Amiri, Diyuan Wu, Gabriele Tolomei, Yang Zhang

发表机构 * Stony Brook University(石英布鲁克大学) University of Massachusetts Lowell(马萨诸塞大学洛厄尔分校) Columbia University(哥伦比亚大学) Institute of Science and Technology Austria(奥地利科学技术研究院) Sapienza University of Rome(罗马大学 sapienza) National University of Singapore(新加坡国立大学)

AI总结 针对生成式推荐中概念遗忘与推荐效用冲突的问题,提出基于令牌重分配的概念遗忘框架TRACER,通过将概念相关物品重分配给替代令牌并引入一致性正则化,有效移除目标概念同时保持推荐效用。

详情
AI中文摘要

生成式推荐将下一项预测形式化为基于用户历史交互导出的语义ID(SID)序列的自回归生成,使得现代推荐系统在结构上类似于大型语言模型(LLM)。随着隐私和安全问题的增加,这些系统越来越需要概念遗忘来移除与物品相关的敏感或有害概念。然而,现有的LLM遗忘方法不能直接应用于生成式推荐。与具有明确语义的词令牌不同,SID是抽象标识符,通常被遗忘和保留物品共享,导致概念移除和推荐效用保持之间的严重冲突。为了解决这一挑战,我们提出了TRACER,一种基于令牌重分配的端到端概念遗忘框架。TRACER不是直接抑制共享的SID,而是将概念相关物品重分配给能够更好地促进遗忘同时最小化对保留物品的副作用的替代令牌。我们进一步引入了一致性正则化器,以在遗忘过程中保持保留物品之间的语义一致性。在真实世界推荐数据集上的实验表明,TRACER有效地移除了目标概念,同时比现有的遗忘基线更好地保持了推荐效用。

英文摘要

Generative recommendation formulates next-item prediction as autoregressive generation over semantic ID (SID) sequences derived from users' historical interactions, making modern recommender systems structurally similar to large language models (LLMs). As privacy and safety concerns grow, these systems increasingly require concept unlearning to remove sensitive or harmful concepts associated with items. However, existing LLM unlearning methods cannot be directly applied to generative recommendation. Unlike word tokens with explicit semantics, SIDs are abstract identifiers that are often shared by both forget and retain items, leading to severe conflicts between concept removal and recommendation utility preservation. To address this challenge, we propose TRACER, an end-to-end concept unlearning framework based on token reassignment. Rather than directly suppressing shared SIDs, TRACER reassigns concept-related items to alternative tokens that better facilitate forgetting while minimizing side effects on retained items. We further introduce a coherence regularizer to preserve semantic consistency among retain items during unlearning. Experiments on real-world recommendation datasets demonstrate that TRACER effectively removes target concepts while substantially better preserving recommendation utility than existing unlearning baselines.

2606.07924 2026-06-09 cs.CV cs.AI cs.CL cs.LG cs.MM 交叉投稿

Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation

解耦语义与逻辑:一种无需训练的从粗到精的视频检索增强生成流水线

Jiaxin Dai, Zehang Wei, Jiamin Yan, Xiang Xiang

发表机构 * School of Computer Science & Tech, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of AI and Automation, Huazhong University of Science and Technology(华中科技大学人工智能与自动化学院)

AI总结 提出一种无需训练的两阶段级联视频RAG流水线,通过解耦语义检索与逻辑推理,实现跨语言长视频理解、严格角色遵循和零幻觉时间定位。

Comments To be presented at ACL 2026 MAGMAR Workshop (Oral; Retrieval leaderboard No.1)

详情
AI中文摘要

本文介绍了我们为第二届多模态增强生成研讨会(MAGMaR)提交的系统描述。针对跨语言长视频理解、严格角色遵循和零幻觉时间定位等关键挑战,我们提出了一种完全无需训练的两阶段级联视频RAG流水线。我们的架构通过模态感知的任务分工,策略性地将语义检索与认知逻辑推理解耦。在第一阶段,一个高召回率的语义预取模块仅使用高保真视觉摘要和全局文本描述进行密集检索,明确隔离噪声模态(如OCR和ASR)以保持纯净的向量空间。在第二阶段,一个由商业大语言模型(LLM)驱动的自适应、迭代和推理(A.I.R.)过滤代理执行细粒度认知重排序。该代理重新整合完整的多模态上下文,以强制执行与用户角色的严格逻辑对齐,有效剪除语义相似但逻辑无关的候选。最后,提示雕刻机制约束生成器将蒸馏后的子集合成为严格格式化的JSON响应,并带有精确的块级引用。在RAG轨道上的评估表明,我们的资源感知方法在信息检索和角色条件生成方面均表现出卓越的精度。

英文摘要

This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via MultimodAl Retrieval (MAGMaR). Addressing the critical challenges of cross-lingual long-video comprehension, strict persona adherence, and zero-hallucination temporal grounding, we propose a fully training-free, two-stage cascaded Video RAG pipeline. Our architecture strategically decouples semantic retrieval from cognitive logical reasoning through a modality-aware division of labor. In the first stage, a high-recall semantic pre-fetching module employs dense retrieval using only high-fidelity visual summaries and global text descriptions, explicitly isolating noisy modalities (e.g., OCR and ASR) to maintain a pristine vector space. In the second stage, an Adaptive, Iterative, and Reasoning-based (A.I.R.) filtering agent, powered by a commercial Large Language Model (LLM), performs fine-grained cognitive reranking. The agent re-incorporates full multimodal contexts to enforce strict logical alignment with user personas, effectively pruning semantically similar but logically irrelevant candidates. Finally, a Prompt Sculpting mechanism constrains the generator to synthesize the distilled subset into strictly formatted JSON responses with exact chunk-level citations. Evaluated on the RAG track, our resource-aware approach shows exceptional precision in both information retrieval and persona-conditioned generation.

2509.13930 2026-06-09 cs.CL 版本更新

Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG

语言裙带关系:多语言RAG中为语言偏好牺牲质量

Dayeon Ki, Marine Carpuat, Paul McNamee, Daniel Khashabi, Eugene Yang, Dawn Lawrie, Kevin Duh

发表机构 * University of Washington(华盛顿大学)

AI总结 研究多语言RAG系统中模型对英语源文档的偏好,发现模型会牺牲文档相关性以迎合语言偏好,尤其在低资源语言中更明显。

Comments ICML 2026 Spotlight

详情
AI中文摘要

多语言检索增强生成(mRAG)系统使语言模型能够跨语言回答知识密集型查询,并提供引用支持的响应。尽管其使用日益增长,一个悬而未决的问题是不同文档语言的混合是否以非预期的方式影响生成和引用行为。为了研究这一点,我们引入了一种受控方法,利用模型内部状态在保持文档相关性等其他因素不变的情况下测量语言偏好。在八种语言和六个开源权重模型上,我们发现当查询为英语时,模型优先引用英语源文档,这种偏差在低资源语言和位于上下文中间的文档中更为放大。更重要的是,我们发现模型有时会为了语言偏好而牺牲文档相关性,这表明引用选择并非总是仅由信息量驱动。我们的发现揭示了语言模型如何利用多语言上下文并影响引用行为。

英文摘要

Multilingual Retrieval-Augmented Generation (mRAG) systems enable language models to answer knowledge-intensive queries with citation-supported responses across languages. Despite their growing use, an open questions is whether the mixture of different document languages impacts generation and citation behavior in unintended ways. To investigate this, we introduce a controlled methodology using model internals to measure language preference while holding other factors such as document relevance constant. Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English, with this bias amplified for lower-resource languages and for documents positioned mid-context. More crucially, we find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone. Our findings shed light on how language models leverage multilingual context and influence citation behavior.

2510.10448 2026-06-09 cs.CL 版本更新

RECON: Reasoning with Condensation for Efficient Retrieval-Augmented Generation

RECON:基于压缩的推理用于高效检索增强生成

Zhichao Xu, Minheng Wang, Yawei Wang, Wenqian Ye, Yuntao Du, Yunpu Ma, Yijun Tian

发表机构 * University of Utah(犹他大学) University of Washington(华盛顿大学) George Washington University(乔治·华盛顿大学) University of Virginia(弗吉尼亚大学) Shandong University(山东大学) Ludwig Maximilian University of Munich(慕尼黑路德维希-马克西米利安大学) University of Notre Dame(诺特丹大学)

AI总结 提出RECON框架,在强化学习搜索代理的多轮推理中插入观察压缩器,通过两阶段课程训练实现上下文压缩,提升训练和推理效率并增强问答性能。

Comments Techinical report

详情
AI中文摘要

基于强化学习(RL)训练的搜索代理在多轮、工具集成的推理(TIR)循环中交错进行推理和工具调用,每次工具调用返回的环境观察被附加到代理的上下文中。随着rollout的进行,这些原始观察不断累积,增加了token成本并稀释了可用于下游推理的信号。与单次检索-读取流水线(其中上下文压缩是一次性后处理步骤)不同,多轮RL设置需要在每个观察步骤进行压缩,同时保持与策略优化解耦。我们引入了RECON(REasoning with CONdensation)框架,通过在推理循环中插入专用的观察压缩器来解决这一挑战。压缩器通过两阶段课程进行训练:在QA数据集上进行相关性预训练,然后从专有LLM进行多方面蒸馏,并在RL训练期间保持冻结以保持策略稳定性。集成到Search-R1搜索代理流水线中,RECON将总上下文长度减少35%,训练速度提高5.4%,推理延迟降低30.9%,同时在3B代理上平均精确匹配提升14.5%,在7B代理上提升3.0%,在多跳QA中表现尤为突出。这些结果表明,学习的观察压缩是构建实用、可扩展的RL训练搜索代理的关键组件。

英文摘要

Search agents trained with reinforcement learning (RL) interleave reasoning with tool calls in a multi-turn, tool-integrated reasoning (TIR) loop, where each tool invocation returns an environment observation that is appended to the agent's context. As the rollout proceeds, these raw observations accumulate, inflating token cost and diluting the signal available for downstream reasoning. Unlike single-pass retrieve-then-read pipelines, where context compression is a one-time postprocessing step, the multi-turn RL setting requires compression that runs at every observation step while remaining decoupled from policy optimization. We introduce RECON (REasoning with CONdensation), a framework that addresses this challenge by inserting a dedicated observation compressor into the reasoning loop. The compressor is trained via a two-stage curriculum: relevance pretraining on QA datasets followed by multi-aspect distillation from proprietary LLMs, and remains frozen during RL training to preserve policy stability. Integrated into the Search-R1 search-agent pipeline, RECON reduces total context length by 35%, improves training speed by 5.4% and inference latency by 30.9%, while boosting average exact-match by 14.5% on the 3B agent and 3.0% on the 7B agent, with particular strength in multi-hop QA. These results establish learned observation compression as a key component for building practical, scalable RL-trained search agents.

2602.00238 2026-06-09 cs.CL cs.AI cs.LG 版本更新

DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking

DIVERGE: 面向开放式信息检索的多样性增强RAG

Tianyi Hu, Niket Tandon, Akhil Arora

发表机构 * Aarhus University(奥胡斯大学) Microsoft Research(微软研究院)

AI总结 针对现有RAG系统忽略开放式信息检索中多样性需求的问题,提出Diverge框架,通过迭代反思引导的多样化视角探索和多样性感知检索支持,在保持质量的同时将多样性提升约2倍。

详情
AI中文摘要

现有的检索增强生成(RAG)系统通常假设每个查询只有一个正确答案。这种假设忽略了开放式信息检索场景,其中多个合理的答案是有价值的,并且多样性对于创造力、公平性和信息的包容性访问至关重要。我们表明,标准RAG系统未能充分利用多样化的检索上下文:简单地增加检索多样性并不一定会导致多样化的生成。为了解决这一局限性,我们提出了Diverge,一个即插即用的智能体RAG框架,通过迭代、反思引导的多样化视角探索和多样性感知检索支持来改善多样性与质量的权衡。我们进一步引入了用于表征开放式问答中多样性与质量权衡的评估指标。在多个真实世界数据集和骨干LLM上的实验表明,Diverge在竞争基线中实现了最佳的权衡,将多样性提高了约2倍,且没有明显的质量下降。这些结果揭示了当前RAG系统的系统性局限,并展示了显式多样性建模的价值。

英文摘要

Existing retrieval-augmented generation (RAG) systems often assume that each query has a single correct answer. This assumption overlooks open-ended information-seeking scenarios where multiple plausible answers are valuable, and where diversity is important for creativity, fairness, and inclusive access to information. We show that standard RAG systems fail to fully use diverse retrieved contexts: simply increasing retrieval diversity does not necessarily lead to diverse generations. To address this limitation, we propose Diverge, a plug-and-play agentic RAG framework that improves the diversity--quality trade-off through iterative, reflection-guided exploration of diverse viewpoints and diversity-aware retrieval support. We further introduce evaluation metrics for characterizing the diversity-quality trade-off in open-ended question answering. Experiments across multiple real-world datasets and backbone LLMs show that Diverge achieves the best trade-off among competitive baselines, increasing diversity by $\sim2\times$ without noticeable quality degradation. These results reveal a systematic limitation of current RAGs and show the value of explicit diversity modeling.

2602.17911 2026-06-09 cs.CL cs.AI 版本更新

Condition-Gated Reasoning for Context-Dependent Biomedical Question Answering

基于条件的推理用于依赖上下文的生物医学问答

Jash Rajesh Parekh, Wonbin Kweon, Joey Chan, Rezarta Islamaj, Robert Leaman, Pengcheng Jiang, Chih-Hsuan Wei, Zhizheng Wang, Zhiyong Lu, Jiawei Han

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) National Institutes of Health(美国国立卫生研究院)

AI总结 本文提出CondMedQA基准和Condition-Gated Reasoning框架,通过构建条件感知知识图谱,提升生物医学问答中条件依赖的推理能力。

详情
AI中文摘要

当前生物医学问答系统常假设医学知识是统一的,但现实临床推理本质上是条件性的:几乎所有决策都依赖于患者特定因素,如共病和禁忌症。现有基准不评估此类条件推理,检索增强或图基方法缺乏显式机制确保检索知识适用于给定上下文。为解决这一差距,我们提出CondMedQA,首个针对条件生物医学问答的基准,包含多跳问题,其答案随患者条件变化。此外,我们提出Condition-Gated Reasoning(CGR),一种新框架,构建条件感知知识图谱,并根据查询条件选择性激活或修剪推理路径。我们的发现显示,CGR更可靠地选择条件合适的答案,同时在生物医学问答基准上匹配或超越现有最佳性能,突显了显式建模条件性对稳健医疗推理的重要性。

英文摘要

Current biomedical question answering (QA) systems often assume that medical knowledge applies uniformly, yet real-world clinical reasoning is inherently conditional: nearly every decision depends on patient-specific factors such as comorbidities and contraindications. Existing benchmarks do not evaluate such conditional reasoning, and retrieval-augmented or graph-based methods lack explicit mechanisms to ensure that retrieved knowledge is applicable to given context. To address this gap, we propose CondMedQA, the first benchmark for conditional biomedical QA, consisting of multi-hop questions whose answers vary with patient conditions. Furthermore, we propose Condition-Gated Reasoning (CGR), a novel framework that constructs condition-aware knowledge graphs and selectively activates or prunes reasoning paths based on query conditions. Our findings show that CGR more reliably selects condition-appropriate answers while matching or exceeding state-of-the-art performance on biomedical QA benchmarks, highlighting the importance of explicitly modeling conditionality for robust medical reasoning.

2603.03292 2026-06-09 cs.CL cs.AI cs.IR 版本更新

From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

从冲突到共识:通过多轮代理RAG提升医疗推理

Wenhao Wu, Zhentao Tang, Yafu Li, Shixiong Kai, Mingxuan Yuan, Zhenhong Sun, Chunlin Chen, Zhi Wang

发表机构 * GitHub

AI总结 本文提出MA-RAG框架,通过多轮代理循环迭代优化外部证据和内部推理历史,提升医疗复杂推理能力,实验显示在7个医疗问答基准上表现优于现有方法。

Comments 27 pages, 8 figures, 18 tables

详情
AI中文摘要

大型语言模型(LLMs)在医疗问答中表现出高推理能力,但其产生幻觉和过时知识的倾向对医疗领域构成重大风险。虽然检索增强生成(RAG)缓解了这些问题,但现有方法依赖于噪声的token级信号,并缺乏复杂推理所需的多轮细化。本文提出MA-RAG(多轮代理RAG),通过在代理细化循环中迭代演变外部证据和内部推理历史,实现复杂医疗推理的测试时间扩展。在每一轮中,代理将候选响应间的语义冲突转换为可检索的外部证据查询,同时优化历史推理轨迹以缓解长上下文退化。MA-RAG通过利用不一致性作为主动信号来扩展自我一致性原则,并通过迭代最小化残差误差来实现稳定、高保真的医疗共识。在7个医疗问答基准上的广泛评估显示,MA-RAG在推理时间扩展和RAG基线方面均优于竞争方法,平均准确率比基础模型提高+6.8点。我们的代码可在https://github.com/NJU-RL/MA-RAG上获得。

英文摘要

Large Language Models (LLMs) exhibit high reasoning capacity in medical question-answering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval-Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token-level signals and lack the multi-round refinement required for complex reasoning. In this paper, we propose MA-RAG (Multi-Round Agentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semantic conflict among candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long-context degradation. MA-RAG extends the self-consistency principle by leveraging the lack of consistency as a proactive signal for multi-round agentic reasoning and retrieval, and mirrors a boosting mechanism that iteratively minimizes the residual error toward a stable, high-fidelity medical consensus. Extensive evaluations across 7 medical Q&A benchmarks show that MA-RAG consistently surpasses competitive inference-time scaling and RAG baselines, delivering substantial +6.8 points on average accuracy over the backbone model. Our code is available at https://github.com/NJU-RL/MA-RAG.

2603.12453 2026-06-09 cs.CL 版本更新

CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection

CSE-UOI在SemEval-2026任务6中的表现:一种双阶段异构集成与 deliberative 复杂性门控的政治理论逃避检测方法

Christos Tzouvaras, Konstantinos Skianis, Athanasios Voulodimos

发表机构 * University of Ioannina(伊奥安纳大学) National Technical University of Athens(雅典国家技术大学)

AI总结 本文提出一种双阶段异构集成方法,结合自我一致性与加权投票,以及新颖的后处理修正机制Deliberative Complexity Gating,用于政治逃避检测,最终在评估集上获得0.85的Macro-F1分数。

详情
AI中文摘要

本文描述了我们在SemEval-2026任务6中的系统,该系统将政治访谈中的回应清晰度分为三个类别:清晰回复、矛盾回复和清晰非回复。我们提出了一种异构双大型语言模型(LLM)集成方法,结合自我一致性(SC)和加权投票,并提出了一种新的后处理修正机制,即Deliberative Complexity Gating(DCG)。该机制利用跨模型行为信号,并利用发现LLM响应长度代理与样本模糊性之间存在强相关性的发现。为了进一步研究提高模糊性检测的机制,我们评估了多代理辩论作为增加 deliberative 能力的替代策略。与DCG不同,后者通过跨模型行为信号自适应地门控推理,而辩论则通过增加代理数量而不增加模型多样性。我们的解决方案在评估集上获得了0.85的Macro-F1分数,取得了第三名,并与第二好的报告分数并列。

英文摘要

This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place and tied with the second-best reported score.

2604.08849 2026-06-09 cs.CL cs.AI cs.DB cs.MA cs.SC 版本更新

SatIR: Scalable High-Recall Constraint-Satisfaction-Based Information Retrieval for Clinical Trials Matching

SatIR:可扩展的高召回率约束满足基于信息检索的临床试验匹配

Cyrus Zhou, Yufei Jin, Yilin Xu, Yu-Chiang Wang, Chieh-Ju Chao, Monica S. Lam

发表机构 * Department of Computer Science, Stanford University(斯坦福大学计算机科学系) Samueli Electrical and Computer Engineering, UCLA(UCLA Samueli电气与计算机工程系) Department of Computer Science and Informatics, Emory University(埃默里大学计算机科学与信息学系) Mayo Clinic(梅奥诊所)

AI总结 SatIR通过将临床试验资格条件和摘要转化为形式约束,结合SMT、关系代数和大语言模型,提升了临床试验匹配的召回率和效率,优于基于相似度的基线方法。

详情
AI中文摘要

许多重要的检索问题不仅仅是语义相似性问题,而是约束满足问题:检索的项目应与查询主题相关,并满足涉及否定、时间条件、数值阈值、例外、本体关系和不完整证据的显式要求。我们研究了临床试验匹配中的这一挑战,这是一个高风险的测试平台,其中有用的试验必须既解决患者医疗需求,又满足复杂的资格标准。我们提出了SatIR,一种用于临床试验匹配的可扩展约束检索方法。SatIR将试验资格标准和摘要转换为形式约束,然后通过执行这些约束来检索患者-试验对。系统结合了满足模理论(SMT)、关系代数、医学本体基础和大语言模型(LLMs):形式方法提供可执行且可检查的匹配,而LLMs将模糊、不完整和隐含的临床信息转换为显式、可控的约束表示。在SIGIR 2016患者-试验集合和TREC-2022-RetrievalSubset基准上,SatIR在资格意识检索方面优于基于相似度的基线方法。与TrialGPT式检索相比,SatIR在SIGIR 2016上每名患者检索出32%至72%更多相关且合格的试验,在TREC-2022-RetrievalSubset上实现了1.8至3.2倍更高的合格试验召回率。检索速度快,仅需146毫秒每名患者处理3,621个SIGIR试验。

英文摘要

Many important retrieval problems are not merely problems of semantic similarity, but problems of constraint satisfaction: a retrieved item should be topically relevant to a query and satisfy explicit requirements involving negation, temporal conditions, numeric thresholds, exceptions, ontological relations, and incomplete evidence. We study this challenge in clinical trial matching, a high-stakes test bed where a useful trial must both address a patient's medical needs and satisfy complex eligibility criteria. We propose SatIR, a scalable constraint-based retrieval method for clinical trial matching. SatIR converts trial eligibility criteria and summaries into formal constraints, then retrieves patient--trial pairs by executing these constraints over a database. The system combines Satisfiability Modulo Theories (SMT), relational algebra, medical ontology grounding, and large language models (LLMs): formal methods provide executable and inspectable matching, while LLMs convert ambiguous, incomplete, and implicit clinical information into explicit, controllable constraint representations. Across the SIGIR 2016 patient--trial collection and TREC-2022-RetrievalSubset, a benchmark derived from TREC 2022, SATIR consistently improves eligibility-aware retrieval over similarity-based baselines. Relative to TrialGPT-style retrieval, SATIR retrieves 32%--72% more relevant-and-eligible trials per patient on SIGIR 2016 and achieves $1.8$--$3.2\times$ higher eligible-trial recall on TREC-2022-RetrievalSubset. Retrieval is fast, requiring only 146 milliseconds per patient over 3,621 SIGIR trials.

2605.17301 2026-06-09 cs.CL cs.AI 版本更新

ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation

ConflictRAG: 检测和解决检索增强生成中的知识冲突

Chenyu Wang, Yueyuan Li, Yingmin Liu, Yang Shu

发表机构 * Zhejiang University(浙江大学)

AI总结 本研究提出ConflictRAG框架,通过两阶段冲突检测模块、熵-TOPSIS框架和冲突感知RAG评分,有效检测和解决检索增强生成中的知识冲突,实验表明其在冲突检测F1和正确性方面优于现有方法。

Comments 6 pages, 6 figures, submitted to IEEE SMC 2026

详情
AI中文摘要

检索增强生成(RAG)系统隐式假设检索文档之间相互一致——这一假设在实践中经常失效。我们提出了ConflictRAG,一种具有冲突意识的RAG框架,能够在生成答案之前检测、分类和解决知识冲突。该框架引入了三个贡献:(1)一个两阶段冲突检测模块,结合轻量级嵌入基于MLP分类器和选择性LLM细化,使API成本降低62%,同时保持90.8%的检测准确率;(2)一个熵-TOPSIS框架用于数据驱动的来源可信度评估,比手动启发式方法提高7.1%的选取准确率;(3)一个冲突感知RAG评分(CARS)用于诊断冲突处理能力。在三个基准测试中对六个基线的实验表明,冲突检测F1达到88.7%,并且在最强的冲突感知基线中,正确性提高了5.3-6.1%。该流程能够有效跨基础LLM转移。

英文摘要

Retrieval-Augmented Generation (RAG) systems implicitly assume mutual consistency among retrieved documents -- an assumption that frequently fails in practice. We present ConflictRAG, a conflict-aware RAG framework that detects, classifies, and resolves knowledge conflicts prior to answer generation. The framework introduces three contributions: (1) a two-stage conflict detection module combining a lightweight embedding-based MLP classifier with selective LLM refinement, reducing API costs by 62% while maintaining 90.8% detection accuracy; (2) an Entropy-TOPSIS framework for data-driven source credibility assessment, improving selection accuracy by 7.1% over manual heuristics; and (3) a Conflict-Aware RAG Score (CARS) for diagnostic evaluation of conflict-handling capabilities. Experiments on three benchmarks against six baselines demonstrate 88.7% conflict-detection F1 and consistent 5.3--6.1% correctness gains over the strongest conflict-aware baseline, with the pipeline transferring effectively across backbone LLMs.

2605.28831 2026-06-09 cs.CL cs.AI 版本更新

S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

S3Mem:用于长时域交互式问答的结构化时空场景-事件记忆

Encheng Su, Jianyu Wu, Jinouwen Zhang, Qiucheng Yu, Chen Tang, Pengze Li, Lintao Wang, Aoran Wang, Xinzhu Ma, Shixiang Tang, Yizhou Wang, Houqiang Li

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) City University of Hong Kong(香港城市大学) The Chinese University of Hong Kong(香港中文大学) Fudan University(复旦大学) The University of Sydney(悉尼大学) Beihang University(北航)

AI总结 提出S3MEM框架,通过结构化场景-事件记忆和锚点敏感检索,在长时域交互式问答中实现比通用记忆接口更优的准确率-效率平衡。

详情
AI中文摘要

长时域交互代理通常积累大量轨迹历史,但仍无法可靠地回答关于早期事件的问题。我们认为主要瓶颈不仅是上下文长度,而是长期记忆的轨迹到答案接口。当历史以纯文本块存储并使用标准检索增强生成(RAG)查询时,系统通常检索到局部相关但链不完整的证据,特别是对于空间、时间、重复事件和多跳状态问题。我们提出S3MEM,一种用于长时域交互式问答(QA)的结构化场景-事件情节记忆框架。S3MEM将轨迹写入结构化记忆单元,通过锚点敏感检索检索证据,并为答案时间推理提供紧凑的令牌预算感知证据接口。从这个意义上说,S3MEM是一种结构化证据利用工具,将代理轨迹转换为查询对齐的支持。我们在两个内部标题环境(Crafter、Jericho)和两个外部环境(SciWorld、ALFWorld)上评估S3MEM。在共享的冻结答案时间协议下,S3MEM在所有四个环境中一致优于Vanilla RAG,在Crafter、Jericho和ALFWorld上超过Graph-NoReader,在SciWorld上与之匹配,同时使用的证据令牌显著减少。三个改编的近期基线——A-MEM启发、MemoryOS改编和LightMem改编——在多个设置中优于Vanilla RAG,但没有一个达到S3MEM的整体准确率-效率前沿。总体而言,证据支持一个有限的结论:在当前冻结的答案时间协议下,结构化写入和锚点敏感证据路由为长时域交互式QA提供了比通用记忆接口更强的准确率-效率前沿。

英文摘要

Long-horizon memory question answering often requires sparse evidence from heterogeneous histories, including events, object states, visual observations, temporal relations, and causal steps. Existing memory interfaces expand reader context, retrieve semantically related chunks, or expose graph neighborhoods, but they are not explicitly designed to select compact evidence for a fixed reader. We propose Structured Spatiotemporal Scene--Event Memory (S3Mem), a query-time memory interface that writes textual, visual, and agent-use histories into structured scene--event units and routes compact evidence packs to the reader. Its router scores candidate units, query anchors, and anchor--support links, enabling both single-hop selection and short multi-hop evidence chains without reader fine-tuning or test-time training. Across LoCoMo, EMemBench Visual Games, and AMA-Bench, S3Mem provides a strong score--token trade-off, with the clearest gains on localized event, state, temporal, causal, or provenance evidence. On LoCoMo, S3Mem reaches \(0.48\) F1 and \(0.40\) BLEU with (1{,}073) evidence tokens per question, about \(15.8\times\) fewer than the LoCoMo reference. On EMemBench Visual Games, it obtains the best F1 and second-best accuracy with only \(189\)tokens.On AMA-Bench, it is not the highest-scoring method, but remains competitive while using the fewest reader-visible evidence tokens.

2603.24925 2026-06-09 cs.LG cs.CL cs.IR 版本更新

GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation

GraphER: 一种高效的基于图的增强和重排序方法用于检索增强生成

Ruizhong Miao, Yuying Wang, Rongguang Wang, Chenyang Li, Tao Sheng, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结 GraphER通过利用数据组织结构捕捉超越语义相似性的接近关系,构建查询时的图结构并应用图排序技术,提升检索完整性,无需额外图基础设施,兼容标准向量存储。

详情
AI中文摘要

GraphER通过利用数据组织结构捕捉超越语义相似性的接近关系,构建查询时的图结构并应用图排序技术,提升检索完整性,无需额外图基础设施,兼容标准向量存储。

英文摘要

Retrieval-augmented generation (RAG) systems that rely on semantic search often fail to retrieve the complete set of evidence for complex queries, particularly when information is distributed across multiple sources. Existing approaches either rely on iterative agentic retrieval, which can be inefficient, or maintain additional structures such as knowledge graphs, which introduce storage and maintenance overhead. In this paper, we propose GraphER, a graph-based enrichment and reranking framework that (1) leverages the organizational structure of data to capture proximity relationships beyond semantic similarity, (2) constructs a graph at query time based on these proximities, and (3) applies graph-based ranking to surface the top candidate documents. Experiments across table retrieval, multi-hop retrieval, and long-document retrieval benchmarks demonstrate consistent improvements in terms of retrieval completeness. Additionally, GraphER requires no additional graph infrastructure and integrates seamlessly with standard vector stores. The framework is retriever-agnostic, supports multiple forms of proximity, and introduces minimal query-time latency.

2603.29875 2026-06-09 cs.IR cs.AI cs.CL 版本更新

UnWeaving the knots of GraphRAG -- turns out VectorRAG is almost enough

解开图式RAG的结——事实证明向量RAG几乎足够

Ryszard Tuora, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Mateusz Czyżnikiewicz, Adam Kozakiewicz, Tomasz Ziętkiewicz

发表机构 * Samsung AI Warsaw(三星AI华沙)

AI总结 本文提出UnWeaver框架,通过LLM解构文档内容为跨chunk的实体,提升检索和生成的准确性与效率,实验表明向量RAG在成本上优于图式RAG。

详情
AI中文摘要

检索增强生成(RAG)系统中的关键问题在于基于片段的检索流程将源片段视为原子对象,将其中信息混合成单一向量。这些向量被视为孤立、独立且自足,没有尝试表示它们之间的可能关系。此类方法缺乏处理多跳问题的专用机制。图式RAG系统通过将信息建模为知识图谱来缓解这一问题,实体由节点表示,通过稳健的关系连接并形成层次化社区。然而,这种方法自身也存在一些问题,包括为创建图式索引而增加数量级的组件复杂性,以及依赖启发式方法进行检索。我们提出UnWeaver,一种新颖的RAG框架,简化了图式RAG的理念。UnWeaver利用LLM将文档内容解构为可以在多个片段中出现的实体。在检索过程中,实体被用作恢复原始文本片段的中间方式,从而保持对源材料的忠实度。我们主张基于实体的分解能提供更浓缩的原始信息表示,同时还能减少索引和生成过程中的噪声。此外,我们实验表明,在端到端QA评估中,向量RAG的表现优于标准图式RAG,并且几乎与当前最先进的图式解决方案相当,但成本仅为其分数。

英文摘要

One of the key problems in Retrieval-augmented generation (RAG) systems is that chunk-based retrieval pipelines represent the source chunks as atomic objects, mixing the information contained within such a chunk into a single vector. These vector representations are then fundamentally treated as isolated, independent and self-sufficient, with no attempt to represent possible relations between them. Such an approach has no dedicated mechanisms for handling multi-hop questions. Graph-based RAG systems aimed to ameliorate this problem by modeling information as knowledge-graphs, with entities represented by nodes being connected by robust relations, and forming hierarchical communities. This approach however suffers from its own issues with some of them being: orders of magnitude increased componential complexity in order to create graph-based indices, and reliance on heuristics for performing retrieval. We propose UnWeaver, a novel RAG framework simplifying the idea of GraphRAG. UnWeaver disentangles the contents of the documents into entities which can occur across multiple chunks using an LLM. In the retrieval process entities are used as an intermediate way of recovering original text chunks hence preserving fidelity to the source material. We argue that entity-based decomposition yields a more distilled representation of original information, and additionally serves to reduce noise in the indexing, and generation process. Furthermore we experimentally show that on end to end QA evaluation VectorRAG performs better than standard GraphRAG and almost as good as current SOTA graph-based solutions, for a fraction of the cost.

2604.26176 2026-06-09 cs.DB cs.CL 版本更新

CacheRAG: A Semantic Caching System for Retrieval-Augmented Generation in Knowledge Graph Question Answering

CacheRAG:面向知识图谱问答中检索增强生成的语义缓存系统

Yushi Sun, Lei Chen

发表机构 * HKUST Hong Kong China(香港理工大学(中国)) HKUST(GZ) / HKUST Guangzhou / Hong Kong China (2018)(香港理工大学(广州)/ 香港理工大学(广州)/ 香港中国(2018))

AI总结 针对LLM驱动的KGQA系统作为无状态规划器导致模式幻觉和检索覆盖有限的问题,提出CacheRAG,一种基于缓存的架构,通过模式无关接口、多样性优化缓存检索和有界启发式扩展,将无状态规划器转变为持续学习器,显著提升准确性和真实性。

详情
AI中文摘要

大型语言模型(LLMs)与检索增强生成(RAG)的集成显著推进了知识图谱问答(KGQA)。然而,现有的LLM驱动的KGQA系统作为无状态规划器,孤立地生成检索计划而不利用历史查询模式:类似于一个没有计划缓存的数据库系统,从头优化每个查询。这一基本设计缺陷导致模式幻觉和有限的检索覆盖。我们提出CacheRAG,一种面向基于LLM的KGQA的系统化缓存增强架构,将无状态规划器转变为持续学习器。与传统的数据库计划缓存(优化频率)不同,CacheRAG引入了三种针对LLM上下文定制的新设计原则:(1)模式无关用户界面:通过中间语义表示(ISR)的两阶段语义解析框架使非专家用户能够纯粹以自然语言交互,同时后端适配器将LLM与本地模式上下文结合,安全地编译可执行的物理查询。(2)多样性优化的缓存检索:两层层次索引(领域→方面)结合最大边际相关性(MMR)最大化缓存示例的结构多样性,有效缓解推理同质性。(3)有界启发式扩展:具有严格复杂度保证的确定性深度和广度子图操作符显著提升检索召回率,而无需冒无界API执行的风险。在多个基准上的广泛实验表明,CacheRAG显著优于最先进的基线(例如,在CRAG数据集上准确率提升13.2%,真实性提升17.5%)。

英文摘要

The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) has significantly advanced Knowledge Graph Question Answering (KGQA). However, existing LLM-driven KGQA systems act as stateless planners, generating retrieval plans in isolation without exploiting historical query patterns: analogous to a database system that optimizes every query from scratch without a plan cache. This fundamental design flaw leads to schema hallucinations and limited retrieval coverage. We propose CacheRAG, a systematic cache-augmented architecture for LLM-based KGQA that transforms stateless planners into continual learners. Unlike traditional database plan caching (which optimizes for frequency), CacheRAG introduces three novel design principles tailored for LLM contexts: (1) Schema-agnostic user interface: A two-stage semantic parsing framework via Intermediate Semantic Representation (ISR) enables non-expert users to interact purely in natural language, while a Backend Adapter grounds the LLM with local schema context to compile executable physical queries safely. (2) Diversity-optimized cache retrieval: A two-layer hierarchical index (Domain $\rightarrow$ Aspect) coupled with Maximal Marginal Relevance (MMR) maximizes structural variety in cached examples, effectively mitigating reasoning homogeneity. (3) Bounded heuristic expansion: Deterministic depth and breadth subgraph operators with strict complexity guarantees significantly enhance retrieval recall without risking unbounded API execution. Extensive experiments on multiple benchmarks demonstrate that CacheRAG significantly outperforms state-of-the-art baselines (e.g., +13.2% accuracy and +17.5% truthfulness on the CRAG dataset).

4. 对话系统与智能体 27 篇

2606.07893 2026-06-09 cs.CL 新提交

Beyond Individual Personas: Aligning Synthetic Dialogue to Population-Level Behavior Distributions

超越个体角色:将合成对话对齐到群体层面的行为分布

Xinyi Liu, Rinat Khaziev, Hooshang Nayyeri, Emine Yilmaz, Charith Peris, Hari Thadakamalla

发表机构 * Amazon(亚马逊) University of Illinois Urbana–Champaign(伊利诺伊大学厄巴纳-香槟分校) University College London(伦敦大学学院)

AI总结 提出GroupPersona框架,通过将参考语料库的行为分布转化为生成控制,使合成对话语料库在群体层面与参考分布对齐,在12个行为属性上Jensen-Shannon散度降低24.4%。

详情
AI中文摘要

合成对话语料库越来越多地被用作目标对话数据的代理,然而基于角色的生成器优化的是个体对话而非语料库组成,导致产生局部合理的对话但群体层面的行为混合失真。我们引入GroupPersona框架,该框架将合成对话语料库对齐到参考语料库的行为分布。GroupPersona将群体统计转化为生成控制:它将每个对话的核心行为特征与可预测的副作用分离,并利用由此产生的行为组来根据定义参考群体的交互模式调节用户代理。我们在四个跨越两种对话来源(助手风格和Reddit衍生)的语料库上评估GroupPersona,并采用两种构建变体:结构保持和变体增强。相对于最强的平均基线,GroupPersona将合成分布与参考分布在12个行为属性上的Jensen-Shannon散度从0.234降低到0.177,降低了24.4%,并且在所有四个语料库上达到最佳或并列最佳,同时保持结构对齐。它还在参考对话质量分数的校准上达到最接近,将参考对话轮廓的平均绝对偏差降低到0.63,而次优基线为0.91。

英文摘要

Synthetic dialogue corpora are increasingly used as proxies for target dialogue data, yet persona-grounded generators optimize individual conversations rather than corpus composition, yielding locally plausible dialogues with distorted population-level behavior mixes. We introduce GroupPersona, a framework that aligns synthetic dialogue corpora to the behavior distribution of a reference corpus. GroupPersona turns population statistics into generation controls: it separates each dialogue's core behavioral signature from predictable side effects, and uses the resulting behavioral groups to condition user agents on the interaction patterns that define the reference population. We evaluate GroupPersona on four corpora crossing two dialogue sources, assistant-style and Reddit-derived, with two construction variants: structure-preserving and variation-enhanced. GroupPersona lowers Jensen-Shannon divergence between synthetic and reference distributions over 12 behavior attributes from 0.234 to 0.177 relative to the strongest average baseline, a 24.4% reduction, and is best or tied-best on all four corpora while preserving structural alignment. It also achieves the closest calibration to reference-conversation quality scores, reducing mean absolute deviation from the reference-conversation profile to 0.63 versus 0.91 for the next-best baseline.

2606.08348 2026-06-09 cs.CL 新提交

Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

Bayesian-Agent:面向LLM Agent框架的后验引导技能演化

Xiaojun Wu, Cehao Yang, Honghao Liu, Xueyuan Lin, Wenjie Zhang, Zhichao Shi, Xuhui Jiang, Chengjin Xu, Jia Li, Jian Guo

发表机构 * IDEA Research(IDEA研究院) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) DataArcTech Ltd.(DataArcTech有限公司)

AI总结 提出Bayesian-Agent框架,将可复用技能视为假设,通过后验分布指导技能演化(如修补、拆分、压缩等),在多个基准上显著提升性能,表明Agent技能演化应视为后验引导的框架优化。

Comments 15 pages, 6 figures

详情
AI中文摘要

LLM agent越来越依赖外部推理条件:提示、工具、记忆、SOP、技能和框架反馈。这些资产可以在不改变模型权重的情况下改进任务执行,但通常通过启发式反思或重用观察到的成功和失败来修订,仿佛计数本身是可靠的信念。我们引入了\textbf{Bayesian-Agent},一个原生且跨框架的框架,将可重用技能和SOP视为关于冻结模型在特定提示、上下文和框架环境下是否会成功的假设。Bayesian-Agent记录经过验证的轨迹证据,维护每个技能的特征条件分类后验,并将后验状态映射为可检查的动作,如修补、拆分、压缩、退役和探索。面向模型的提示获得可执行的防护栏和故障模式修补,而后验摘要仍可用于审计。使用\texttt{deepseek-v4-flash},增量修复将SOP-Bench从80%提升到95%,Lifelong AgentBench从90%提升到100%,RealFin-Bench从45%提升到65%。我们进一步评估了Bayesian-Agent的原生后端以及可选的GenericAgent、mini-swe-agent和Claude Code后端。结果包括正面、负面、饱和和案例研究设置,表明Agent技能演化最好被视为后验引导的框架优化,而非未校准的提示积累。源代码可在https://github.com/DataArcTech/Bayesian-Agent获取。

英文摘要

LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief. We introduce \textbf{Bayesian-Agent}, a native and cross-harness framework that treats reusable skills and SOPs as hypotheses about whether a frozen model will succeed under a particular prompt, context, and harness environment. Bayesian-Agent records verified trajectory evidence, maintains a feature-conditioned categorical posterior over each skill, and maps posterior state into inspectable actions such as patch, split, compress, retire, and explore. Model-facing prompts receive executable guardrails and failure-mode patches, while posterior summaries remain available for audit. With \texttt{deepseek-v4-flash}, incremental repair improves SOP-Bench from 80\% to 95\%, Lifelong AgentBench from 90\% to 100\%, and RealFin-Bench from 45\% to 65\%. We further evaluate Bayesian-Agent's native backend and optional GenericAgent, mini-swe-agent, and Claude Code backends. The results include positive, negative, saturated, and case-study settings, suggesting that agent skill evolution is best viewed as posterior-guided harness optimization rather than uncalibrated prompt accumulation. The source code is available at https://github.com/DataArcTech/Bayesian-Agent.

2606.08656 2026-06-09 cs.CL 新提交

From Player to Master: Enhancing Test-Time Learning of LLM Agents via Reinforcement Learning over Memory

从玩家到大师:通过基于记忆的强化学习增强LLM代理的测试时学习

Yishuo Cai, Xingyu Guo, Xuancheng Huang, Jinhua Du, Can Huang, Wenxuan Huang, Wenhan Ma, Yuyang Hu, Aohan Zeng, Jie Tang, Xu Sun

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Memopilot,一种通过多轮GRPO训练记忆更新过程来优化冻结LLM代理在测试时学习的方法,在多人博弈中显著提升Elo评分。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLM)代理越来越多地部署在长期运行的环境中,在这些环境中,通过测试时的经验进行改进变得重要。一种常见的方法是在每次交互后更新显式记忆以指导未来的决策。然而,大多数现有方法依赖于手工设计的提示规则,这使得在多步时间跨度内难以使记忆更新与下游目标保持一致。我们提出Memopilot,一种即插即用的记忆副驾驶,它显式地训练记忆更新过程,以改进冻结的LLM在连续交互中的性能。我们将记忆更新公式化为一个多轮决策问题,并使用多轮GRPO进行端到端优化。我们的训练方案引入了(i)逐轮奖励信号和(ii)跨轨迹的上下文无关、逐轮优势估计,从而在多轮设置中实现更细粒度的信用分配和更稳定的训练。我们在两个测试平台上评估Memopilot:多轮石头剪刀布(RPS)和有限注德州扑克(LHE)。在这两种环境中,Memopilot显著改进了冻结玩家在测试时的学习,优于强基线,在两个游戏的Elo评分中均排名第一(LHE为1762,RPS为1590),并优于所有基线记忆方法和专有模型,包括DeepSeek-V3.2。

英文摘要

Large language model (LLM) agents are increasingly deployed in long-running settings where improving through experience at test time becomes important. A common approach is to update an explicit memory after each interaction to guide future decisions. However, most existing methods rely on hand-designed prompting rules, making it difficult to align memory updates with downstream objectives over multi-step horizons consistently. We propose MemoPilot, a plug-in memory copilot that explicitly trains the memory update process to improve a frozen LLM's performance across sequential interactions. We formulate memory updating as a multi-turn decision problem and optimize it end-to-end with multi-turn GRPO. Our training recipe introduces (i) a turn-wise reward signal and (ii) a context-independent, turn-level advantage estimation across rollouts, enabling finer-grained credit assignment and more stable training in multi-turn settings. We evaluate MemoPilot on two testbeds: multi-round Rock-Paper-Scissors (RPS) and Limit Texas Hold'em (LHE). Across both environments, MemoPilot substantially improves test-time learning of a frozen player over strong baselines, ranking first in Elo ratings on both games (1762 on LHE and 1590 on RPS) and outperforming all baseline memory methods and proprietary models, including DeepSeek-V3.2.

2606.08857 2026-06-09 cs.CL 新提交

PaperMentor: A Human-Centered Multi-Agent Writing Tutor for AI Research Papers on Overleaf

PaperMentor:面向Overleaf的AI研究论文写作人本多智能体辅导系统

Jiarui Liu, Terry Jingchen Zhang, Ryan Faulkner, X. Angelo Huang, Vilém Zouhar, Dominik Glandorf, Isabel Dahlgren, Van Q. Truong, Rishit Dagli, Yuen Chen, Felix Leeb, Punya Syon Pandey, Yves Bicker, Suvajit Majumder, Wenyuan Jiang, Zeju Qiu, Sankalan Pal Chowdhury, Bernhard Schölkopf, Mona Diab, Zhijing Jin

发表机构 * CMU(卡内基梅隆大学) Jinesis Lab, University of Toronto & Vector Institute(Jinesis实验室,多伦多大学与向量研究所) EuroSafeAI ETHZ(苏黎世联邦理工学院) EPFL(洛桑联邦理工学院) UIUC(伊利诺伊大学厄巴纳-香槟分校) Max Planck Institute for Intelligent Systems, Tübingen, Germany(马克斯·普朗克智能系统研究所,德国图宾根)

AI总结 提出PaperMentor,一种在Overleaf中提供内联建议的人本写作助手,通过专家技能库和12个专业智能体提供可操作反馈,用户研究中90.6%建议被认为可操作。

Comments Accepted to the ACL 2026 Demo Track

详情
AI中文摘要

来自经验丰富研究者的专业写作反馈对于早期职业学者改进手稿至关重要,然而高质量的反馈往往稀缺,因为审阅研究论文是劳动密集型的。新兴的AI写作助手主要关注语法修正或通过最终分数模拟同行评审,但它们在提供具体、可操作的建议以帮助学生在起草过程中改进论文方面存在不足。我们提出PaperMentor,一个人本写作助手系统,以Overleaf原生内联注释的形式提供可操作建议,同时将实际写作完全留给人类作者。PaperMentor集成了一专家技能库,该库精心整理自资深研究者的写作建议,并包含12个专业智能体,涵盖论文写作的不同方面,如格式合规性、措辞准确性和术语一致性。在一项用户研究(n=14)中,90.6%的生成评论被评为可操作,67.5%被评为有效,显著优于没有技能库的GPT-5.2基线。我们将PaperMentor作为开源软件发布供公众使用。我们的代码在AGPL-3.0许可下公开于https://github.com/jiarui-liu/overleaf。

英文摘要

Expert writing feedback from experienced researchers is critical for early-career scholars to improve their manuscripts, yet high-quality feedback often remains scarce because reviewing research papers is labor-intensive. Emerging AI-powered writing assistants largely focus on grammar fixes or simulating peer review with final scores, yet they fall short of providing concrete, actionable suggestions that help students improve their papers during drafting. We present PaperMentor, a human-centered writing assistant system that delivers actionable suggestions as Overleaf-native inline comments while leaving the actual writing entirely to human authors. PaperMentor integrates an expert skill library carefully curated from established researchers' writing advice with 12 specialized agents covering different aspects of paper writing, such as formatting compliance, phrasing accuracy, and terminology consistency. In a user study (n=14), 90.6% of the generated comments were rated actionable and 67.5% were rated valid, significantly outperforming a GPT-5.2 baseline uswithout the skill library. We release PaperMentor as open source for public use. Our code is publicly available under the AGPL-3.0 license at https://github.com/jiarui-liu/overleaf

2606.08938 2026-06-09 cs.CL cs.AI 新提交

PACT: Learning Diverse Diagnostic Strategies via Privileged Synthesis and Branch Consensus

PACT: 通过特权合成与分支共识学习多样化诊断策略

Gen Li, Yuanze Hu, Zhichao Yang, Qingchen Yu, Jianwei Lv, Yue Guo, Yujing Liu, Faguo Wu, Hongwei Zheng, Xiandong Li, Bo Yuan, Yifan Sun, Zhaoxin Fan

发表机构 * Beihang University(北京航空航天大学) Baidu(百度) ByteDance(字节跳动) Beijing Academy of Blockchain and Edge Computing(北京区块链与边缘计算研究院) Renmin University of China(中国人民大学)

AI总结 提出PACT框架,通过特权合成对话数据和多分支共识训练,使LLM同时学习多种诊断推理范式,在中文医疗诊断基准上取得最优性能。

Comments 16 pages, 5 figures, 5 tables

详情
AI中文摘要

临床诊断需要在信息不完整的情况下灵活运用多种推理范式。现有的基于LLM的医疗智能体表现出强大的医学推理能力,但单一范式或简单混合的对话监督使得这些范式难以无干扰地学习。我们提出\textbf{PACT}(周期性锚点共识训练),一个将监督的多范式对话合成与基于共识的分支训练相结合的框架。在数据层面,\textbf{DPS}(医生-患者-监督者)利用完整的电子病历(EMR)进行质量控制,同时保持医生代理仅能访问患者可见信息。这产生了四种诊断推理范式下的经过验证的对话,而不会泄露隐藏的临床答案。在训练层面,PACT为每个范式训练一个范式特定的LoRA分支,并通过符号共识定期将分支聚合到共享锚点中。我们进一步构建了一个动态的多轮中文医疗诊断基准用于交互式会诊。实验表明,PACT在诊断结果和会诊过程指标上,与专有、医学专用和任务适应的基线相比,达到了最先进的性能。

英文摘要

Clinical diagnosis requires flexible use of multiple reasoning paradigms under incomplete patient information. Existing LLM-based medical agents show strong medical reasoning ability, but single-paradigm or naively mixed dialogue supervision makes these paradigms difficult to learn without interference. We propose \textbf{PACT} (Periodic Anchor Consensus Training), a framework that couples supervised multi-paradigm dialogue synthesis with consensus-based Branch training. At the data level, \textbf{DPS} (Doctor-Patient-Supervisor) uses complete electronic medical records (EMRs) for quality control while keeping the doctor agent restricted to patient-visible information. This produces validated dialogues under four diagnostic reasoning paradigms without leaking hidden clinical answers. At the training level, PACT trains one paradigm-specific LoRA Branch per paradigm and periodically aggregates Branches into a shared Anchor through sign consensus. We further construct a dynamic multi-turn Chinese medical diagnosis benchmark for interactive consultation. Experiments show that PACT achieves state-of-the-art performance among compared proprietary, medical-specialized, and task-adapted baselines on diagnostic outcome and consultation-process metrics.

2606.09027 2026-06-09 cs.CL cs.AI 新提交

SafeRun: Enabling Determinism in LLM Planning for Running

SafeRun:在跑步规划中实现LLM的确定性

Meilin Chen, Zepeng Zhai, Jiaxuan Zhao, Yuan Lu

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对LLM在跑步规划中因概率性导致安全违规的问题,提出SafeRun框架,通过解耦架构将LLM的软解释与确定性求解器的硬约束分离,实现100%安全评分。

Comments Workshop on Planning in the Era of LLMs (LM4Plan) at ICML 2026

详情
AI中文摘要

大型语言模型能够实现灵活的自然语言规划,但由于其概率性,在确定性关键领域仍不可靠。这一限制在跑步规划中尤其成问题,因为违反安全规则可能导致安全风险。我们提出SafeRun,一种通过解耦架构实现基于LLM的确定性规划的框架。SafeRun将LLM的软解释与确定性求解器的硬约束执行分离,在保持自然语言灵活性的同时确保严格的安全约束。为了验证SafeRun,我们构建了一个全面的基准测试,用于在现实生理和安全约束下进行跑步规划。在五个LLM上的实验表明,SafeRun实现了100%的安全评分(相比之下,PE平均为79.1%,CodeAct平均为97.6%),同时保持了具有竞争力的指令遵循分数。SafeRun基准测试可在\href{https://huggingface.co/datasets/zzp-seeker/SafeRun-RunPlanning-Benchmark}{huggingface}上公开获取。

英文摘要

Large Language Models enable flexible natural-language planning but remain unreliable in determinism-critical domains due to their probabilistic nature. This limitation is especially problematic in running planning, where violating safety rules can lead to safety risks. We propose SafeRun, a framework for deterministic LLM-based planning via a decoupled architecture. SafeRun separates soft interpretation by an LLM from hard constraint enforcement by a deterministic solver, ensuring strict safety constraints while preserving natural-language flexibility. To validate SafeRun, we build a comprehensive benchmark for running planning under realistic physiological and safety constraints. Experiments across five LLMs show that SafeRun achieves 100\% safety score (vs.\ 79.1\% PE average and 97.6\% CodeAct average) while maintaining competitive instruction-following scores. The SafeRun benchmark is publicly available at \href{https://huggingface.co/datasets/zzp-seeker/SafeRun-RunPlanning-Benchmark}{huggingface}.

2606.09293 2026-06-09 cs.CL 新提交

One Model, Multiple Goals: Adaptive Multi-Objective Learning for E-commerce Dialogue Systems

一个模型,多个目标:面向电商对话系统的自适应多目标学习

Mingzhe Li, Jing Xiang, Enguo Zhou, Lang Gao, Tai Li, Qishen Zhang, Xiangliang Zhang, Xiuying Chen

发表机构 * ByteDance(字节跳动) MBZUAI(穆罕默德·本·扎耶德人工智能大学) University of Notre Dame(圣母大学)

AI总结 提出自适应多目标强化学习框架MORE,通过将推理功能作为约束指导策略优化,并引入自适应多奖励机制平衡语言目标,在电商对话系统中同时提升推理准确性和语言自然性,在线实验转化率提升30.09%。

Comments Accepted by KDD 2026

详情
AI中文摘要

电商场景中的对话系统通常需要满足多个目标:准确推理用户画像(如资格、信用额度)以确保正确决策和用户状态理解,同时生成自然且忠实的回复。这些目标是互补但非完全一致的。在这项工作中,我们提出了MORE,一个自适应多目标强化学习框架,联合优化推理准确性和语言自然性。我们的初步实验表明,直接混合具有不同优化动态的奖励会导致振荡和不稳定的学习。因此,我们不优化单一的混合奖励,而是将推理函数视为指导策略优化的约束。在推理时,系统直接生成回复,无需显式推理步骤,同时仍受益于推理增强的支架,避免额外的推理开销。为了更好地平衡回复生成过程中的语言目标,我们引入了一种自适应多奖励机制,该机制聚合流畅性和自然性等信号,并通过梯度反馈动态重新加权。我们在字节跳动的两个真实对话系统和MultiWOZ 2.2基准上评估MORE,其持续优于强基线。在字节跳动生产流量的14天在线实验中,MORE将总体转化率和达成转化率分别提高了16.53%和30.09%,同时提高了用户满意度并降低了转接率。值得注意的是,在人机对比中,MORE恢复了人类客服所实现的增量转化提升的约60%。

英文摘要

Dialogue systems in e-commerce scenarios often need to satisfy multiple objectives: accurately reasoning over user profiles (e.g., eligibility, credit limit) to ensure correct decision-making and user state interpretation, while also generating natural and faithful responses. These goals are complementary but not identical. In this work, we propose MORE, an adaptive Multi-Objective REinforcement learning framework that jointly optimizes reasoning accuracy and linguistic naturalness. Our preliminary experiments show that directly mixing rewards with diverging optimization dynamics can cause oscillations and unstable learning. Thus, instead of optimizing a single mixed reward, we treat reasoning functions as constraints that guide policy optimization. At inference time, the system directly generates responses without explicit reasoning steps, while still benefiting from reasoning-enhanced scaffold and avoiding additional inference overhead. To better balance linguistic objectives during response generation, we introduce an adaptive multi-reward mechanism that aggregates signals such as fluency and naturalness and dynamically reweighs them via gradient feedback. We evaluate MORE on two real-world dialogue systems at ByteDance and the MultiWOZ 2.2 benchmark, where it consistently outperforms strong baselines. In 14-day online experiments on ByteDance production traffic, MORE improves overall and reached conversion by 16.53% and 30.09%, while increasing user satisfaction and reducing handoff rates. Notably, in a human-machine comparison, MORE recovers about 60% of the incremental conversion lift achieved by human agents.

2606.09483 2026-06-09 cs.CL cs.AI 新提交

Memory Beyond Recall: A Dual-Process Cognitive Memory System for Self-Evolving LLM Agents

超越回忆的记忆:用于自进化LLM代理的双过程认知记忆系统

Tianxiang Fei, Mingyang Song, Mao Zheng, Xiang Yu

发表机构 * Tencent(腾讯)

AI总结 提出DCPM系统,基于双过程理论将代理记忆组织为认知能力层次,通过同步日间写入器和异步夜间引擎分别处理信念修正和模式归纳,在隐式跨会话推理任务上提升显著。

详情
AI中文摘要

LLM代理的长期记忆不仅仅是适时检索正确的段落。当前的记忆系统将信念修正、因果耦合和跨领域抽象压缩到为表面回忆而调整的单一检索面上,因此难以处理需要推理用户如何演变的隐式个性化。我们提出DCPM,它沿着认知能力层次重新组织代理记忆,从原始输入和原子事实,经过历时信念轨迹和身份,上升到领域模式、潜在意图和跨领域模式。该层次由两个过程驱动,继承了双过程理论的架构分裂:一个同步的日间写入器(系统1),记录信念修正为双重链接的取代链;一个异步的夜间引擎(系统2),归纳模式和意图,并扫描跨领域冲突,抽象为更高级的核心模式。在LongMemEval、PersonaMem和PersonaMem-v2上,启用系统2在奖励隐式跨会话推理的基准上贡献最大(在PersonaMem-v2上最高+5.20),在跨度回忆上贡献最小,与架构预测一致。

英文摘要

Long-term memory for an LLM agent is more than retrieving the right passage at the right time. Current memory systems collapse belief revision, causal coupling, and cross-domain abstraction into a single retrieval surface tuned for surface recall, and consequently struggle on implicit personalisation that requires reasoning over how a user has evolved. We propose DCPM, which reorganises agent memory along a cognitive capability hierarchy ascending from raw inputs and atomic facts, through diachronic belief trajectories and identity, to domain schemas, latent intentions and cross-domain patterns. The hierarchy is driven by two processes inheriting the architectural split of dual-process theory: a synchronous daytime writer (System1) that records belief revisions as doubly linked supersedes chains, and an asynchronous nighttime engine (System2) that induces schemas and intentions and sweeps for cross-domain collisions abstracted into higher-level core schemas. On LongMemEval, PersonaMem and PersonaMem-v2, enabling System2 contributes most where the benchmark rewards implicit cross-session inference (up to +5.20 on PersonaMem-v2) and least on span recall, matching the architectural prediction.

2606.09498 2026-06-09 cs.CL 新提交

Self-Harness: Harnesses That Improve Themselves

Self-Harness:自我改进的操控框架

Hangfan Zhang, Shao Zhang, Kangcong Li, Chen Zhang, Yang Chen, Yiqun Zhang, Lei Bai, Shuyue Hu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出Self-Harness范式,让LLM智能体通过弱点挖掘、框架提议和验证迭代改进自身操控框架,在Terminal-Bench-2.0上使三种模型的通过率分别提升21.4%、14.3%和14.2%。

详情
AI中文摘要

基于LLM的智能体的性能由其基础模型和中介其与环境交互的操控框架共同塑造。由于不同模型表现出不同的行为,有效的框架设计本质上是模型特定的。然而,智能体框架仍然主要由人类专家设计,这种范式随着现代LLM日益多样化和快速演变而难以扩展。在本文中,我们引入了Self-Harness,一种新的范式,其中基于LLM的智能体改进其自身的操作框架,而不依赖人类工程师或更强的外部智能体。我们将Self-Harness实现为一个迭代循环,包含三个阶段:弱点挖掘,从执行轨迹中识别模型特定的失败模式;框架提议,生成与这些失败相关的多样化但最小的框架修改;以及提议验证,仅在回归测试后接受候选编辑。我们在Terminal-Bench-2.0上使用最小初始框架和来自不同家族的三个基础模型实例化了Self-Harness:MiniMax M2.5、Qwen3.5-35B-A3B和GLM-5。在所有三个模型上,Self-Harness一致地提高了性能,保留通过率分别从40.5%提高到61.9%,从23.8%提高到38.1%,以及从42.9%提高到57.1%。定性分析进一步表明,Self-Harness不仅仅是添加通用指令,而是有效地将模型特定的弱点转化为具体的、可执行的框架更改。这些结果表明了一条路径,使得基于LLM的智能体不仅被其框架塑造,而且能够参与重塑自身框架。

英文摘要

The performance of LLM-based agents is jointly shaped by their base models and the harnesses that mediate their interaction with the environment. Because different models exhibit distinct behaviors, effective harness design is inherently model-specific. Yet agent harnesses are still largely engineered by human experts, a paradigm that scales poorly as modern LLMs become increasingly diverse and rapidly evolving. In this paper, we introduce Self-Harness, a new paradigm in which an LLM-based agent improves its own operating harness, without relying on human engineers or stronger external agents. We operationalize Self-Harness as an iterative loop with three stages: Weakness Mining, which identifies model-specific failure patterns from execution traces; Harness Proposal, which generates diverse yet minimal harness modifications tied to these failures; and Proposal Validation, which accepts candidate edits only after regression testing. We instantiate Self-Harness on Terminal-Bench-2.0 using a minimal initial harness and three base models from diverse families: MiniMax M2.5, Qwen3.5-35B-A3B, and GLM-5. Across all three models, Self-Harness consistently improves performance, with held-out pass rates increasing from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1%, respectively. Qualitative analyses further show that Self-Harness does not simply add generic instructions, but effectively turns model-specific weaknesses into concrete, executable harness changes. These results suggest a path toward LLM-based agents that are not merely shaped by their harnesses, but can also participate in reshaping them.

2606.09632 2026-06-09 cs.CL 新提交

Civil Court Simulation with Large Language Models

基于大型语言模型的民事法庭模拟

Yifan Chen, Haitao Li, Kaiyuan Zhang, Yueyue Wu, Qingyao Ai, Yiqun Liu

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学)

AI总结 提出多智能体民事法庭模拟框架,通过五阶段审判程序、记忆模块和法规检索实现可靠判决,在责任分配和多项裁决上表现优异。

详情
AI中文摘要

法庭模拟连接了法律教育与司法实践,但基于人类的模拟成本高且难以扩展。大型语言模型(LLMs)提供了一种可扩展的替代方案,但现有的法庭模拟研究主要集中于刑事案件。民事诉讼在实践中更为常见且更难模拟,因为其诉求、责任和救济方式更加灵活。我们提出了一个面向中国民事案件的多智能体法庭模拟框架。该框架通过五阶段民事审判程序组织基于角色的交互,并集成记忆模块和法规检索以支持长过程裁判。实验表明,该框架能产生可靠的民事判决,在责任分配和多项裁决方面具有明显优势。进一步实验显示,记忆质量显著影响下游模拟质量。通过五层因素框架,我们分析了法律基础、信息条件、司法能力与角色定位、组织压力以及社会背景如何影响框架的可靠性和行为。这些结果支持了所提框架在民事法庭模拟中的有效性。数据集和代码可在 https://github.com/foggpoy/Civil-Court 获取。

英文摘要

Court simulation bridges legal education and judicial practice, yet human-based simulations are costly and difficult to scale. Large language models (LLMs) offer a scalable alternative, but existing court-simulation research mainly focuses on criminal cases. Civil litigation is more common in practice and harder to simulate because its claims, liability, and remedies are more flexible. We present a multi-agent court simulation framework for Chinese civil cases. The framework organizes role-based interaction through a five-stage civil trial procedure and integrates memory module and statute retrieval to support long-process adjudication. Experiments show that the framework produces reliable civil judgments, with clear strengths in liability allocation and multi-item adjudication. Further experiments show that memory quality substantially affects downstream simulation quality. Through a five-layer factor framework, we analyze how legal grounding, information conditions, judicial capability and role orientation, organizational pressure, and social context affect the framework's reliability and behavior. These results support the effectiveness of the proposed framework for civil court simulation. The dataset and code are available at: https://github.com/foggpoy/Civil-Court.

2606.07636 2026-06-09 cs.CV cs.CL cs.MA 交叉投稿

Crayotter: Traceable Multi-Agent Workflows for Long-Form Video Editing

Crayotter: 用于长视频编辑的可追踪多智能体工作流

Lecheng Yan, Yichong Zhang, Ben Pan, Xiaoyu Zheng, Jiawei Qian, Anqi Wu, Wenxi Li, Chenyang Lyu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Crayotter,一个开源多模态多智能体系统,通过三阶段工作流(材料准备、基于工件的编辑研究、工具驱动的执行)实现长视频编辑的可追踪性和选择性修订,在人类评估中优于基线方法。

Comments 11 pages, 5 figures

详情
AI中文摘要

从异构素材编辑长视频不仅需要选择片段:智能体必须在材料准备、时间线构建、后期制作和修订过程中保持叙事意图,同时留下足够的证据以诊断失败。我们提出 \textbf{Crayotter},一个用于提示驱动视频编辑的开源多模态多智能体系统。Crayotter 将制作组织为三个阶段:覆盖感知的材料准备、基于工件的编辑研究以及工具驱动的时间线执行。每个阶段外化可检查的工件,包括覆盖报告、多模态分析、编辑蓝图、工具调用和中间渲染。这些工件使编辑运行可追踪,并允许诊断和选择性修订失败的片段,而无需完全重启。我们在23个编辑主题上评估Crayotter,与CapCut-Mate和CutClaw进行比较。在人类评估下,Crayotter的平均得分为3.40/5,而两个基线分别为2.44和1.70,在主题对齐、叙事连贯性和编辑流畅性方面持续提升。我们还描述了一个可重放的轨迹模式和可验证的奖励设计,为这些工作流未来的策略优化做准备。代码、轨迹和示例可在 https://github.com/idwts/Crayotter 公开获取。

英文摘要

Editing a long-form video from heterogeneous footage requires more than selecting clips: an agent must preserve narrative intent across material preparation, timeline construction, post-production, and revision while leaving enough evidence to diagnose failures. We present \textbf{Crayotter}, an open-source multimodal multi-agent system for prompt-driven video editing. Crayotter organizes production into three phases: coverage-aware material preparation, artifact-based editing research, and tool-grounded timeline execution. Each phase externalizes inspectable artifacts, including coverage reports, multimodal analyses, editing blueprints, tool calls, and intermediate renders. These artifacts make an editing run traceable and allow failed segments to be diagnosed and selectively revised instead of requiring a full restart. We evaluate Crayotter on 23 editing themes against CapCut-Mate and CutClaw. Under human evaluation, Crayotter achieves an average score of 3.40/5, compared with 2.44 and 1.70 for the two baselines, with consistent gains in theme alignment, narrative coherence, and editing smoothness. We additionally describe a replayable trajectory schema and verifiable reward design that prepare these workflows for future policy optimization. Code, traces, and examples are publicly available at https://github.com/idwts/Crayotter.

2606.08016 2026-06-09 cs.CV cs.AI cs.CL 交叉投稿

IEA: Amateur-Friendly Conversational Image Editing Agent via Three Stages of Multitask Alignment

IEA:通过三阶段多任务对齐的业余友好型对话式图像编辑代理

Zichen Zhu, Yuheng Sun, Mingxuan Zhu, Wenjie Ma, Situo Zhang, Zhexiang Wang, Ziyue Yang, Danyang Zhang, Kunyao Lan, Zihan Zhao, Dingye Liu, Siqi Xiang, Lu Chen, Kai Yu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institution(上海创新研究院) Huawei Technologies Ltd.(华为技术有限公司) Nanyang Technological University(南洋理工大学) Jiangsu Key Lab of Language Computing(江苏省语言计算重点实验室)

AI总结 提出IEA对话式图像编辑代理,通过三阶段多任务训练学习操作参数化工具,实现可解释编辑轨迹,在像素距离和ROUGE-L指标上优于基线,用户研究中指令跟随和感知质量表现最佳。

Comments [CVPR 2026 Findings] Our data and code are released at https://github.com/OpenDFM/Image_Edit_Agent

详情
AI中文摘要

当前的图像编辑软件通常依赖于固定滤镜或专家调参,导致业余用户的意图与结果之间存在差距。生成模型创建的图像可能包含伪影、不合理的细节或偏离真实感的风格漂移,并且对编辑原因缺乏解释。我们提出IEA,一个对话式图像编辑代理,它学习在显式、可解释的动作空间中操作参数化工具。IEA通过三阶段多任务流水线进行训练:(1) 在蒸馏专家编辑上进行SFT,(2) 使用GRPO进行奖励优化,奖励包括相似度改进、工具有用性和意图总结,(3) 大规模合成微调以联合掌握图像编辑、细化和用户意图总结。通过逐步操作16个编辑工具,IEA产生透明的编辑轨迹,可以检查和调试。在定量实验中,它在编辑任务上获得更低的像素距离,在总结任务上获得比强基线更高的ROUGE-L。在用户研究中,它在指令跟随方面在工具调用方法中排名最佳,同时在整体感知质量上超越生成方法。我们的结果验证了可解释的、以工具为中心的VLM作为人类指令引导图像润色的可靠路径。

英文摘要

Current image editing software often hinges on fixed filters or expert tuning, leaving a gap between amateur users' intent and outcomes. Creations by generative models may contain artifacts, implausible details, or stylistic drift away from photorealism and offer little insight into why an edit was made. We propose IEA, a conversational Image Editing Agent that learns to operate parameterized tools in an explicit, interpretable action space. IEA is trained via a three-stage multitask pipeline: (1) SFT on distilled expert edits, (2) GRPO with rewards for likeness improvement, tool usefulness, and intent summarization, and (3) large-scale synthetic fine-tuning to jointly master image editing, refinement, and user intent summarization. By manipulating 16 editing tools step by step, IEA produces transparent edit traces that can be inspected and debugged. In quantitative experiments, it attains a lower pixel distance on the edit task and a higher ROUGE-L on the summary task than strong baselines. In user studies, it ranks best among tool-calling methods for instruction following while surpassing generative methods in overall perceptual quality. Our results validate interpretable, tool-centric VLMs as a reliable path to human instruction-guided image retouching.

2606.08169 2026-06-09 cs.RO cs.AI cs.CL cs.HC cs.LG 交叉投稿

CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning

CLASP: 基于语言驱动的机器人技能选择与组合,采用任务参数化学习

Markus Knauer, Valentin Gieraths, Tai Mai, Samuel Bustamante, Alin Albu-Schäffer, Freek Stulp, João Silvério

发表机构 * German Aerospace Center (DLR), Institute of Robotics and Mechatronics (RMC)(德国航空航天中心(DLR),机器人与机电一体化研究所(RMC)) Technical University of Munich (TUM)(慕尼黑工业大学(TUM))

AI总结 提出CLASP架构,结合任务参数化核化运动基元(TP-KMP)与预训练视觉语言模型(VLM),通过自然语言命令实现技能选择、组合和主动学习,无需微调,在7自由度机械臂上达到73.3%-100%成功率。

Comments 23 pages, 11 figues, 4 tables, 1 listing

详情
AI中文摘要

使机器人能够理解自然语言命令并执行任务,同时保持数据效率仍然具有挑战性。视觉-语言-动作(VLA)和视觉-语言模型(VLM)等基础模型提供了直观的交互通道,但需要大量数据;任务参数化模仿学习实现了数据效率,但缺乏自然语言基础。这项工作通过一个模块化架构弥合了这一差距,该架构将任务参数化核化运动基元(TP-KMP)与预训练VLM相结合。在学习过程中,技能从2到5次动觉演示中获取,VLM生成描述每个技能参数和前提条件的技能模式。在执行过程中,VLM解释命令以选择技能,推理参数绑定,并通过协方差加权组合创建新颖行为。当没有技能或组合足够时,系统识别能力差距并请求有针对性的演示,所有这些都无需微调。在7自由度机械臂上的验证显示,在需要技能选择、组合和主动学习的场景中,成功率达到73.3%-100%。

英文摘要

Enabling robots to understand and execute tasks from natural language commands while maintaining data efficiency remains challenging. Foundation models such as vision-language-action (VLA) and vision-language models (VLMs) provide intuitive interaction channels but require extensive data; task-parameterized imitation learning achieves data efficiency but lacks natural language grounding. This work bridges this gap through a modular architecture combining task-parameterized kernelized movement primitives (TP-KMPs) with pretrained VLMs. During learning, skills are acquired from 2 to 5 kinesthetic demonstrations, and the VLM generates skill schemas describing each skill's parameters and preconditions. During execution, the VLM interprets commands to select skills, reason about parameter bindings, and create novel behaviors through covariance-weighted composition. When no skill or composition suffices, the system identifies capability gaps and requests targeted demonstrations, all without fine-tuning. Validation on a 7-DoF manipulator shows success rates of 73.3%-100% in scenarios requiring skill selection, composition, and active learning.

2606.08529 2026-06-09 cs.AI cs.CL cs.LG 交叉投稿

Scaffold Effects on GAIA: A Controlled Comparison

脚手架对GAIA的影响:一项受控比较

Jason Starace

发表机构 * Independent Researcher(独立研究员)

AI总结 通过受控实验比较三种脚手架(ReAct、多智能体设计、规划-执行)对五个模型在GAIA验证集上的影响,发现脚手架选择可导致准确率差异高达28个百分点,且模型能力越强对脚手架依赖性不一定越低。

Comments 12 pages, 3 figures

详情
AI中文摘要

已发布的智能体能力评分混淆了模型本身的能力与脚手架赋予的能力,且这种激发差距的大小在受控条件下尚未得到充分表征。本研究在GAIA验证集的Level 1和Level 2上,对来自三个提供商的五个模型(Claude Opus 4.7、Sonnet 4.6、Haiku 4.5;Gemini 3.1 Pro Preview;GPT-5.5)进行了预先注册的受控比较,涉及三种脚手架(ReAct、规划-执行者-评估者多智能体设计以及规划-执行),保持任务和条件固定,每个问题尝试三次。仅脚手架选择就使单个模型(Opus,Level 2,稳健切片)的测量准确率移动了多达28个百分点,证实了预先注册的假设,即脚手架变化至少产生10个百分点的差距。预先注册的预测——能力更强的模型对脚手架敏感性更低——在方向上被拒绝:在每个数据集切片中,脚手架效应因模型而异,但能力最强的Anthropic模型在更难级别上从结构化脚手架中获益最多,且层级缩放仅在Level 1的稳健切片下成立。在Level 2上,多智能体相对于ReAct的优势出现在Anthropic系列内部,但跨提供商模型中没有,因此模型系列而非能力层级成为调节变量,而预测的规划-执行者在文件读取任务上的优势被证伪。结构化脚手架在更难级别上调用工具次数更少,但从中途错误中恢复的频率更高,且单个单元(Gemini搭配规划-执行者)在两个级别上成本最低,在Level 2上准确率最高。这些结果表明,单脚手架能力数值是脚手架条件估计,且激发差距不一定会随着模型改进而缩小。

英文摘要

Published agent capability scores conflate what a model can do with what its scaffold lets it do, and the magnitude of this elicitation gap is not well characterized under controlled conditions. This study executes a pre-registered controlled comparison of three scaffolds (ReAct, a Planner-Actor-Rater multi-agent design, and planner-then-executor) across five models from three providers (Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; Gemini 3.1 Pro Preview; GPT-5.5) on GAIA validation Levels 1 and 2, holding tasks and conditions fixed, with three attempts per question. Scaffold choice alone moves measured accuracy by as much as 28 percentage points within a single model (Opus, Level 2, robust slice), confirming the pre-registered hypothesis that scaffold variation produces gaps of at least 10 points. The pre-registered prediction that more capable models would be less scaffold-sensitive is rejected in direction: scaffold effects vary significantly by model in every dataset slice, but the most capable Anthropic model gains the most from structured scaffolds at the harder level, and tier-scaling holds only at Level 1 under the robust slice. The multi-agent advantage over ReAct at Level 2 appears within the Anthropic family but not for the cross-provider models, making model family rather than capability tier the conditioning variable, and the predicted planner-executor advantage on file-reading tasks is falsified. Structured scaffolds make fewer tool calls yet recover more often from mid-trajectory errors at the harder level, and a single cell (Gemini with planner-then-executor) is the cheapest at both levels and the most accurate at Level 2. These results indicate that single-scaffold capability numbers are scaffold-conditional estimates and that the elicitation gap is not guaranteed to shrink as models improve.

2606.09138 2026-06-09 cs.LG cs.CL 交叉投稿

Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning

Claw-R1:面向智能体强化学习的步骤级数据中间件系统

Daoyu Wang, Mingyue Cheng, Qingchuan Li, Shuo Yu, Jie Ouyang, Qi Liu

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室)

AI总结 提出Claw-R1系统,通过网关服务器和数据池组件,将智能体交互步骤转化为结构化数据资产,支持实时检查、质量筛选和训练批次配置,解决智能体强化学习中数据生命周期管理问题。

详情
AI中文摘要

智能体强化学习已成为将大语言模型从静态聊天机器人转变为交互式智能体的重要后训练范式,催生了如OpenClaw等代表性应用。现有工作主要关注策略优化算法和训练框架,但对从数据产生到训练消费的智能体-环境交互完整数据生命周期关注不足。为弥补这一差距,我们提出Claw-R1,一个面向智能体强化学习的交互式步骤级数据中间件系统。Claw-R1通过两个核心组件——网关服务器和数据池——连接异构智能体运行时与强化学习训练后端。网关服务器通过统一的LLM API入口捕获多轮交互步骤,而数据池将其组织为由提示ID、响应ID、奖励和其他元数据组成的步骤级记录。在我们的演示中,用户可以交互式检查实时轨迹,查看每一步的状态、动作和奖励,根据质量和就绪程度筛选数据,并为不同的下游强化学习算法配置训练就绪批次。总体而言,Claw-R1将智能体交互轨迹视为受管理的数据资产,而非临时运行时日志。通过此演示,我们希望鼓励社区认识到数据管理在智能体强化学习中的重要性。我们的代码可在https://github.com/AgentR1/Claw-R1获取,演示视频可在https://youtu.be/Pw47dAOw6B0找到。

英文摘要

Agentic reinforcement learning (RL) has become an important post-training paradigm for turning LLMs from static chatbots into interactive agents, giving rise to representative applications such as OpenClaw. Existing work mainly focuses on policy optimization algorithms and training frameworks, but pays less attention to the full data lifecycle of agent-environment interactions, from data production to training consumption. To bridge this gap, we present Claw-R1, an interactive step-level data middleware system for agentic RL. Claw-R1 connects heterogeneous agent runtimes with RL training backends through two core components: a Gateway Server and a Data Pool. The Gateway Server captures multi-turn interaction steps through a unified LLM API entry point, while the Data Pool organizes them into step-level records consisting of prompt IDs, response IDs, rewards and other metadata. In our demo, users can interactively inspect live trajectories, examine the state, action, and reward of each step, curate data by quality and readiness, and configure training-ready batches for different downstream RL algorithms. Overall, Claw-R1 treats agent interaction traces as managed data assets rather than temporary runtime logs. Through this demonstration, we hope to encourage the community to recognize the importance of data management in agentic RL. Our code is available at https://github.com/AgentR1/Claw-R1 and the demonstration video can be found at link https://youtu.be/Pw47dAOw6B0.

2606.09751 2026-06-09 cs.AI cs.CL cs.HC 交叉投稿

Collaborative Human-Agent Protocol (CHAP)

协作式人机协议 (CHAP)

Arsalan Shahid, Gordon Suttie, Philip Black

发表机构 * Brightbeam AI

AI总结 提出CHAP协议,通过结构化事件记录(差异、理由、哈希)和可组合配置文件,解决多人类多智能体协作中人类判断信号丢失的问题。

详情
AI中文摘要

基础模型正从响应生成转向操作角色。它们跨步骤规划、调用工具、请求人类输入、与其他智能体协调,并越来越多地承担影响客户、索赔、代码、合同和临床决策的工作。生产部署不再是单个人类监督单个模型,而是跨团队、时区和信任边界的多人类、多智能体协作。这种协作的技术界面仍然定义不清。当智能体起草响应,人类在发布前编辑它时,人类判断的时刻是系统中最有价值的信号。在当前实践中,该信号(如果有记录)仅存在于应用程序代码、聊天线程、工单评论和集体记忆中。两个协议标准解决了相邻问题:MCP标准化了智能体对工具和数据的访问,A2A标准化了智能体间的互操作性。两者都没有定义人类和智能体共同执行可问责工作的共享工作空间。本文提出了CHAP,即协作式人机协议。在CHAP下,原本会消失在聊天线程中的覆盖操作变成了一个结构化事件,包含差异、理由和内容哈希。班次交接变成了可移植的信封,而不是置顶消息。人类对智能体草稿的批准变成了一个不可否认的签名决策,可在多年后重放。该协议通过一个小的核心(工作空间、参与者、任务、工件和仅追加的证据日志)以及可组合的配置文件(根据部署需要添加审查、模式、路由、审议、交接、身份、签名和透明度支持的审计)来实现。规范、参考实现、一致性测试套件和示例可在以下网址获取:https://github.com/BrightbeamAI/chap

英文摘要

Foundation models are moving from response generation into operational roles. They plan across steps, call tools, request human input, coordinate with other agents, and increasingly carry responsibility for work that affects customers, claims, code, contracts, and clinical decisions. Production deployments are no longer one human supervising one model. They are multi-human, multi-agent collaborations that cross teams, time zones, and trust boundaries. The technical surface for this collaboration remains weakly specified. When an agent drafts a response and a human edits it before it ships, the moment of human judgement is the most valuable signal in the system. In current practice it is recorded, if at all, in application code, chat threads, ticket comments, and tribal memory. Two protocol standards address adjacent concerns: MCP standardises agent access to tools and data, and A2A standardises agent-to-agent interoperability. Neither defines the shared workspace in which humans and agents perform accountable work together. This paper presents CHAP, the Collaborative Human-Agent Protocol. Under CHAP, the override that used to vanish into a chat thread becomes a structured event carrying a diff, a rationale, and a content hash. The handoff between shifts becomes a portable envelope rather than a pinned message. The human approval of an agent's draft becomes a non-repudiable signed decision that can be replayed years later. The protocol achieves this through a small Core (workspaces, participants, tasks, artefacts, and an append-only evidence log) together with composable profiles that add review, modes, routing, deliberation, handoff, identity, signatures, and transparency-backed audit as deployments require them. Specification, reference implementation, conformance suite, and worked examples are available at: https://github.com/BrightbeamAI/chap

2606.09774 2026-06-09 cs.AI cs.CL 交叉投稿

SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

SIGA: 用于科学模拟的自演化编码智能体适配器

Matthew Ho, Brian Liu, Jixuan Chen, Audrey Wang, Lianhui Qin

发表机构 * University of California, San Diego(加利福尼亚大学圣迭戈分校)

AI总结 提出SIGA适配器,通过检索、程序记忆、轨迹内验证和验证强制终止,将通用编码智能体转化为科学模拟软件操作员,在GEOS上实现36倍加速,并支持自演化提升性能。

详情
AI中文摘要

高级科学模拟器暴露了专门的输入语言,将模拟目标转化为可执行配置,但学习这些语言可能需要领域科学家花费数小时到数天。我们将模拟器设置研究为智能体-工具接口接地问题:需要哪些最小的模拟器特定适配才能使现成的编码智能体操作真实的科学软件?我们的直觉是,编码智能体已经知道如何导航文件、编辑代码、运行命令和修复输出,但它们缺乏模拟器的可执行契约:其词汇、结构约束、验证规则和终止条件。我们介绍了SIGA,一个模拟器接口接地适配器,通过检索、程序记忆、轨迹内验证和验证强制终止来提供此契约。我们主要在GEOS上评估SIGA,GEOS是一个用于地下科学的开源多物理场模拟器。SIGA在大约五分钟内生成完整的GEOS输入文件,TreeSim高于0.90,与花费大约三小时的扩展预算人类专家相当,实现了大约36倍的挂钟加速。在更难的保留集上,接地将TreeSim从0.720提高到0.789,相对于裸智能体提高了大约10%,并且可以将跨种子的标准差降低16倍。自演化通过从先前轨迹重写适配器内容进一步改进SIGA,产生了最高的保留GEOS平均值,并匹配或超过了最强的手工设计配置。迁移到OpenFOAM和LAMMPS表明,主导机制因接口而异:当结构完整性是瓶颈时,验证最重要;而当领域正确性是瓶颈时,记忆和检索最重要。这些结果表明,轻量级、可自我改进的接地层可以将通用编码智能体转变为科学软件的实用操作员。

英文摘要

Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.

2306.16092 2026-06-09 cs.CL 版本更新

Chatlaw: A Multi-Agent Legal Assistant based on a Role-Aligned Mixture-of-Experts Architecture

Chatlaw: 基于角色对齐的混合专家架构的多智能体法律助手

Jiaxi Cui, Munan Ning, Zongjian Li, Bohua Chen, Yang Yan, Hao Li, Bin Ling, Yonghong Tian, Li Yuan

发表机构 * Shenzhen Graduate School, Peking University(北京大学深圳研究生院) Peng Cheng Laboratory(鹏城实验室) Law School, Peking University(北京大学法学院) Pandalla.ai

AI总结 提出Chatlaw多智能体法律助手,采用角色对齐的混合专家架构模拟律所协作流程,在LawBench基准上准确率比GPT-4提升7.73%,并通过真实案例验证。

Comments Accepted manuscript. Updated to match the journal version and added DOI

详情
AI中文摘要

人工智能在法律服务中具有巨大潜力,但大型语言模型面临两大挑战:对中国法律体系知识有限且易产生幻觉。为解决这些问题,我们提出Chatlaw,一个多智能体法律助手。Chatlaw的框架旨在模拟真实律所的标准操作流程,其中不同角色(如助理、研究员、资深律师)协作处理案件。为了在计算上镜像这种协作结构,我们开发了一种新颖的角色对齐混合专家架构。在该系统中,内部“专家”经过专门训练,以与每个智能体角色(如咨询、分析、起草)的不同任务对齐。这些专业智能体(法律助理、研究员等)随后形成协作框架。当它们与用户交互、检索法律知识、分析案件细节或生成可靠咨询时,RA-MoE架构智能地将计算路由到相应的专用专家,确保每一步由最合格的参数处理。在评估中,Chatlaw超越了包括GPT-4在内的通用AI模型,在LawBench基准上准确率提升7.73%,在法律职业统一资格考试中得分高出11分。真实案例研究和专家评估进一步证实了其稳健性。Chatlaw提高了法律服务的可及性和可靠性,推动了向公众提供法律支持的进步。

英文摘要

Artificial Intelligence (AI) holds great potential in legal services, yet Large Language Models (LLMs) face two major challenges: limited knowledge of the Chinese legal system and vulnerability to hallucinations. To address these issues, we present Chatlaw, a multi-agent legal assistant. Chatlaw's framework is designed to emulate the Standard Operating Procedures (SOP) of real law firms, where different roles (e.g., assistant, researcher, senior lawyer) collaborate on a case. To computationally mirror this collaborative structure, we developed a novel Role-Aligned Mixture-of-Experts (RA-MoE) architecture. In this system, the internal "experts" are specifically trained to align with the distinct tasks of each agent role (e.g., inquiry, analysis, drafting). These specialized agents (Legal Assistant, Researcher, etc.) then form the collaborative framework. When they interact with users, retrieve legal knowledge, analyze case details, or generate reliable consultations, the RA-MoE architecture intelligently routes their computations to the corresponding dedicated expert, ensuring each step is handled by the most qualified parameters. In evaluations, Chatlaw surpasses general-purpose AI models, including GPT-4, achieving a 7.73% improvement in accuracy on the LawBench benchmark and an 11-point higher score on the Unified Qualification Exam for Legal Professionals. Real-case studies and expert assessments further confirm its robustness. Chatlaw enhances the accessibility and reliability of legal services, advancing the provision of legal support to the public.

2510.19186 2026-06-09 cs.CL 版本更新

When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue

当用户满意但智能体出错:工具增强对话的多维度评估

Tanya Shourya, Yingfan Wang, Zhaoyi Joey Hou, Shamik Roy, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah

发表机构 * AWS AI Labs(AWS人工智能实验室) University of Pittsburgh(匹兹堡大学)

AI总结 针对工具增强对话系统中用户满意但智能体错误的问题,提出TRACE基准,通过系统合成多样错误案例,评估现有框架发现性能远未理想。

Comments The Fifth Generation, Evaluation & Metrics Workshop (GEM) at ACL 2026

详情
AI中文摘要

评估使用外部工具的对话式AI系统具有挑战性,因为错误可能源于用户、智能体和工具之间的复杂交互。虽然现有的评估方法要么评估用户满意度,要么评估智能体的工具调用能力,但它们未能捕捉多轮工具增强对话中的关键错误——例如当智能体误解工具结果但用户却感到满意时。我们引入了TRACE,一个系统合成的工具增强对话基准,覆盖了多样的错误案例。使用最先进的对话评估框架进行评估发现,所有方法都远未达到理想性能,展示了该基准的根本难度。

英文摘要

Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents' tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues-such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases. Evaluation with state-of-the-art conversation evaluation frameworks reveals that all approaches remain far from ideal performance, demonstrating the fundamental difficulty of this benchmark.

2601.07994 2026-06-09 cs.CL cs.AI 版本更新

DYCP: Dynamic Context Pruning for Long-Form Dialogue with LLMs

DYCP:基于LLMs的长格式对话动态上下文修剪

Nayoung Choi, Jonathan Zhang, Jinho D. Choi

发表机构 * Computer Science Emory University(计算机科学 埃默里大学)

AI总结 DYCP通过动态识别和检索对话段落,提升长格式对话中LLM的上下文管理效率,实现更精确的上下文选择和推理效率提升。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于长格式对话,其中话题频繁变化。尽管最近的LLMs支持扩展的上下文窗口,但在实践中仍需有效管理对话历史,以应对推理成本和延迟限制。我们提出了DyCP,一种轻量级的上下文管理方法,该方法在LLM外部实现,能够根据当前轮次动态识别和检索相关对话段落,无需离线内存构建。DyCP在不预设话题边界的情况下管理对话上下文,保持对话的顺序性,实现自适应和高效的上下文选择。在三个长格式对话基准(LoCoMo、MT-Bench+和SCM4LLMs)和多个LLM后端上,DyCP在下游生成任务中实现了具有竞争力的答案质量,具有更选择性的上下文使用和改进的推理效率。

英文摘要

Large Language Models (LLMs) increasingly operate over long-form dialogues with frequent topic shifts. While recent LLMs support extended context windows, efficient management of dialogue history in practice is needed due to inference cost and latency constraints. We present DyCP, a lightweight context management method implemented outside the LLM that dynamically identifies and retrieves relevant dialogue segments conditioned on the current turn, without offline memory construction. DyCP manages dialogue context while preserving the sequential nature of dialogue without predefined topic boundaries, enabling adaptive and efficient context selection. Across three long-form dialogue benchmarks-LoCoMo, MT-Bench+, and SCM4LLMs-and multiple LLM backends, DyCP achieves competitive answer quality in downstream generation, with more selective context usage and improved inference efficiency.

2603.09995 2026-06-09 cs.CL cs.AI 版本更新

Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality

上下文胜过计算 人类在环优于迭代思维链提示在面试回答质量上的表现

Kewen Zhu, Zixi Liu, Yanjing Li, Jing Chen

AI总结 本文通过对比人类在环和自动思维链提示方法,发现人类在环在面试回答质量评估中表现更优,且迭代次数更少,同时具有更高的训练效果。

详情
AI中文摘要

使用大语言模型进行行为面试评估存在独特挑战,需要结构化评估、现实面试官行为模拟和候选人培训的教育价值。我们通过两个受控实验研究思维链提示在面试回答评估和改进中的应用,使用50对行为面试问题和回答对。我们的贡献有三方面:首先,我们提供了人类在环和自动思维链改进的定量比较。使用配对设计,n等于50,两种方法均显示出积极的评分改进。人类在环方法提供了显著的培训效益。信心从3.16提高到4.16(p小于0.001),真实性从2.94提高到4.53(p小于0.001,Cohen's d是3.21)。人类在环方法还要求五次迭代更少(1.0对5.0,p小于0.001)并实现了完整的个人细节整合。其次,我们分析了收敛行为。两种方法都快速收敛,平均迭代次数低于1次,其中人类在环方法在最初较弱的回答中达到100%的成功率,而自动方法为84%(Cohen's h是0.82,大效应)。额外的迭代提供 diminishing returns,表明主要限制是上下文可用性而非计算资源。第三,我们提出了一种基于负面偏见模型的对抗性挑战机制,称为bar raiser,以模拟现实的面试官行为,尽管定量验证仍需未来工作。我们的发现表明,尽管思维链提示为面试评估提供了有用的基石,但领域特定的增强和上下文感知的方法选择对于现实和具有教育价值的结果至关重要。

英文摘要

Behavioral interview evaluation using large language models presents unique challenges that require structured assessment, realistic interviewer behavior simulation, and pedagogical value for candidate training. We investigate chain of thought prompting for interview answer evaluation and improvement through two controlled experiments with 50 behavioral interview question and answer pairs. Our contributions are threefold. First, we provide a quantitative comparison between human in the loop and automated chain of thought improvement. Using a within subject paired design with n equals 50, both approaches show positive rating improvements. The human in the loop approach provides significant training benefits. Confidence improves from 3.16 to 4.16 (p less than 0.001) and authenticity improves from 2.94 to 4.53 (p less than 0.001, Cohen's d is 3.21). The human in the loop method also requires five times fewer iterations (1.0 versus 5.0, p less than 0.001) and achieves full personal detail integration. Second, we analyze convergence behavior. Both methods converge rapidly with mean iterations below one, with the human in the loop approach achieving a 100 percent success rate compared to 84 percent for automated approaches among initially weak answers (Cohen's h is 0.82, large effect). Additional iterations provide diminishing returns, indicating that the primary limitation is context availability rather than computational resources. Third, we propose an adversarial challenging mechanism based on a negativity bias model, named bar raiser, to simulate realistic interviewer behavior, although quantitative validation remains future work. Our findings demonstrate that while chain of thought prompting provides a useful foundation for interview evaluation, domain specific enhancements and context aware approach selection are essential for realistic and pedagogically valuable results.

2604.24594 2026-06-09 cs.CL cs.AI 版本更新

Skill Retrieval Augmentation for Agentic AI

面向智能体AI的技能检索增强

Weihang Su, Jianming Long, Qingyao Ai, Qiaozhi He, Yichen Tang, Changyue Wang, Yiteng Tu, Yingbo Wang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) ByteDance Inc.(字节跳动公司)

AI总结 针对现有智能体系统在技能库扩展时上下文窗口不足、技能识别准确率下降的问题,提出技能检索增强(SRA)范式,通过动态检索外部技能库提升智能体性能,并构建SRA-Bench基准揭示技能整合中的瓶颈。

详情
AI中文摘要

随着大型语言模型(LLMs)演变为能够自主解决问题的智能体,它们越来越依赖外部的、可复用的技能来处理超出其原生参数能力的任务。在现有的智能体系统中,整合技能的主要策略是在上下文窗口内显式枚举可用技能。然而,这种策略无法扩展:随着技能库的扩大,上下文预算迅速消耗,智能体在识别正确技能方面的准确性显著下降。为此,本文提出了技能检索增强(SRA),一种新的范式,其中智能体按需从大型外部技能库中动态检索、整合和应用相关技能。为了使该问题可衡量,我们构建了一个大规模技能库,并引入了SRA-Bench,这是首个对完整SRA流程进行分解评估的基准,涵盖技能检索、技能整合和最终任务执行。SRA-Bench包含5,400个能力密集型测试实例和636个手动构建的金标准技能,这些技能与网络收集的干扰技能混合,形成了一个包含26,262个技能的大规模语料库。大量实验表明,基于检索的技能增强可以显著提高智能体性能,验证了该范式的潜力。同时,我们揭示了技能整合中的一个基本差距:当前的LLM智能体倾向于以相似的速率加载技能,无论是否检索到金标准技能,或者任务是否实际需要外部能力。这表明技能增强的瓶颈不仅在于检索,还在于基础模型判断何时加载何种技能以及何时真正需要外部加载的能力。这些发现将SRA定位为一个独特的研究问题,并为未来智能体系统中能力的可扩展增强奠定了基础。

英文摘要

As large language models (LLMs) evolve into agentic problem solvers, they increasingly rely on external, reusable skills to handle tasks beyond their native parametric capabilities. In existing agent systems, the dominant strategy for incorporating skills is to explicitly enumerate available skills within the context window. However, this strategy fails to scale: as skill corpora expand, context budgets are consumed rapidly, and the agent becomes markedly less accurate in identifying the right skill. To this end, this paper formulates Skill Retrieval Augmentation (SRA), a new paradigm in which agents dynamically retrieve, incorporate, and apply relevant skills from large external skill corpora on demand. To make this problem measurable, we construct a large-scale skill corpus and introduce SRA-Bench, the first benchmark for decomposed evaluation of the full SRA pipeline, covering skill retrieval, skill incorporation, and end-task execution. SRA-Bench contains 5,400 capability-intensive test instances and 636 manually constructed gold skills, which are mixed with web-collected distractor skills to form a large-scale corpus of 26,262 skills. Extensive experiments show that retrieval-based skill augmentation can substantially improve agent performance, validating the promise of the paradigm. At the same time, we uncover a fundamental gap in skill incorporation: current LLM agents tend to load skills at similar rates, regardless of whether a gold skill is retrieved or whether the task actually requires external capabilities. This shows that the bottleneck in skill augmentation lies not only in retrieval but also in the base model's ability to determine which skill to load and when external loading is actually needed. These findings position SRA as a distinct research problem and establish a foundation for the scalable augmentation of capabilities in future agent systems.

2605.11212 2026-06-09 cs.CL 版本更新

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction

ReVision:通过时间视觉冗余减少扩展计算机使用代理

Amirhossein Abaskohi, Yuhang He, Peter West, Giuseppe Carenini, Pranit Chawla, Vibhav Vineet

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Microsoft Research(微软研究院)

AI总结 ReVision通过去除冗余视觉片段,减少token使用并提升成功率,使代理能处理更长轨迹。

详情
AI中文摘要

计算机使用代理(CUAs)依赖于图形用户界面的视觉观察,每个截图被编码为大量视觉token。随着交互轨迹增长,token成本迅速增加,限制了在固定上下文和计算预算下可纳入的历史量。这导致使用历史时性能提升有限,不同于其他领域。我们通过引入ReVision解决这一效率问题,该方法用于训练多模态语言模型,在轨迹中去除冗余视觉片段,使用学习的片段选择器比较连续截图的片段表示,同时保留模型所需的时空结构。在三个基准测试(OSWorld、WebTailBench和AgentNetBench)中,当使用Qwen2.5-VL-7B处理包含5个历史截图的轨迹时,ReVision平均减少46%的token使用,同时将成功率提高3%。这建立了明显的效率提升,使代理能用更少token处理更长轨迹。通过这一改进效率,我们重新审视CUAs中历史的作用,发现当去除冗余时,性能随更多过去观察的纳入而持续提升。

英文摘要

Computer-use agents (CUAs) rely on visual observations of graphical user interfaces, where each screenshot is encoded into a large number of visual tokens. As interaction trajectories grow, the token cost increases rapidly, limiting the amount of history that can be incorporated under fixed context and compute budgets. This has resulted in no or very limited improvement in the performance when using history unlike other domains. We address this inefficiency by introducing ReVision, which is used to train multimodal language models on trajectories where redundant visual patches are removed using a learned patch selector that compares patch representations across consecutive screenshots while preserving spatial structure required by the model. Across three benchmarks, OSWorld, WebTailBench, and AgentNetBench, when processing trajectories with 5 history screenshots using Qwen2.5-VL-7B, ReVision reduces token usage by 46% on average while improving success rate by 3% over the no drop baseline. This establishes a clear efficiency gain, enabling agents to process longer trajectories with fewer tokens. With this improved efficiency, we revisit the role of history in CUAs and find that performance continues to improve as more past observations are incorporated when redundancy is removed.

2605.16551 2026-06-09 cs.CL 版本更新

PQR: A Framework to Generate Diverse and Realistic User Queries that Elicit QA Agent Failures

PQR:一个生成多样化且逼真用户查询的框架,以引发问答代理失败

Yunan Lu, Luigi Liu, Omar Yahia, Arpit Sharma, Zhou Yu

发表机构 * Columbia University(哥伦比亚大学) University of California San Diego(加州大学圣地亚哥分校) Walmart(沃尔玛)

AI总结 PQR框架通过迭代交互生成多样化且逼真的用户查询,以发现问答代理的失败案例,其方法比现有方法更有效。

详情
AI中文摘要

评估基于LLM的代理仍然具有挑战性,因为识别有意义的失败案例通常需要大量的人力来设计现实的测试场景。先前的工作主要关注自动发现由对抗性用户引起的代理失败,而忽略了具有真实用户意图的查询也触发代理失败的情况。我们引入PQR,一个框架,不仅能够针对特定目标(如有用性、安全性等)揭示代理失败,还能模拟真实用户意图。PQR通过两个互补模块的迭代交互运作。查询精炼模块执行重写以探索多样化的查询变体,而提示精炼模块利用先前反馈推导新的违反目标的策略和现实性政策以精炼提示,从而生成引发失败但逼真的查询。我们在检测电子商务问答代理的不帮助性响应上评估了PQR。我们的方法发现了23% - 78%更多的不帮助性响应,且我们生成的查询比先前方法更加多样化和逼真。

英文摘要

Evaluating LLM-based agents remains challenging because identifying meaningful failure cases often requires substantial human effort to design realistic test scenarios. Prior works primarily focus on automatically discovering agent failures induced by adversarial users, while overlooking queries with real user intents that also trigger agent failures. We introduce PQR, a framework that not only surfaces agent failures with respect to specific objectives (e.g., helpfulness, safety, etc.) but also resembles real users' intents. PQR operates through an iterative interaction between two complementary modules. The query refinement module performs rewrites to explore diverse query variations, while the prompt refinement module uses prior feedback to derive new objective-violating strategies and realism policies for refining prompts, which in turn generate failure-triggering yet realistic queries. We evaluate PQR on detecting an e-commerce QA agent's unhelpful responses. Our method uncovers 23% - 78% more unhelpful responses, and our generated queries are more diverse and realistic compared to previous methods.

2606.06399 2026-06-09 cs.CL 版本更新

CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

CollabSim: 一种基于CSCW的方法,通过受控多智能体实验研究LLM智能体的协作能力

Jiaju Chen, Bo Sun, Yuxuan Lu, Yun Wang, Dakuo Wang, Bingsheng Yao

发表机构 * Northeastern University(东北大学) Microsoft Research Asia(微软亚洲研究院)

AI总结 提出CollabSim框架,结合CSCW理论定义协作能力、控制交互条件并探测智能体内部状态,以系统分析LLM多智能体系统的协作能力。

详情
AI中文摘要

基于大语言模型的多智能体系统展现出日益增长的潜力,其有效性依赖于智能体通过文本渠道进行协调的能力,类似于人类团队。然而,近期研究表明,多智能体系统常常失败并非因为智能体缺乏个体任务解决能力,而是因为缺乏协作能力:建立共同基础、维持共享任务理解、平衡个体与集体激励以及在交互过程中修复失调的能力。计算机支持的协同工作领域数十年的研究已经描述了人类团队在受限通信下协调的这些要求,然而现有的多智能体系统评估主要关注任务结果或单智能体在推理、规划和工具使用方面的能力。为了能够系统分析多智能体系统中智能体的协作能力,我们引入了CollabSim,一个可配置的仿真框架,它结合了基于理论的协作能力定义、交互条件的受控操作以及智能体内部状态的行动级探测。在四个大语言模型上的实验表明,CollabSim能够捕捉条件效应、分离模型性能模式,并揭示智能体设计的任务依赖效应。

英文摘要

Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task understanding, balance individual and collective incentives, and repair misalignment as interaction unfolds. Decades of research in Computer-Supported Cooperative Work have characterized these requirements for human teams coordinating under constrained communication, yet existing MAS evaluations focus mainly on task outcomes or single-agent proficiency in reasoning, planning, and tool use. To enable a systematic analysis of agents' collaborative competence in MAS, we introduce CollabSim, a configurable simulation framework that combines a theory-grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action-level probing of agents' internal states. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design.

2508.10239 2026-06-09 cs.HC cs.CL 版本更新

Breaking the Curse of Knowledge: Designing Personalized Jargon Support for Real-Time Online Meetings

打破知识的诅咒:为实时在线会议设计个性化术语支持

Yifan Song, Yijun Liu, Wing Yee Au, Hon Yung Wong, Brian P. Bailey, Tal August

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Fujitsu Research of America(富士通美国研究)

AI总结 提出ParseJargon系统,利用用户画像和会话内反馈实现个性化术语识别,提升在线会议中跨学科听众的理解和参与度。

Comments Portions of this work appeared in CHI '26 Extended Abstracts ("Breaking the Curse of Knowledge: Toward Personalized Jargon Support in Online Meetings") and ACL '26 System Demonstrations ("ParseJargon: Personalized Real-time Jargon Support in Online Meetings")

详情
AI中文摘要

跨学科交流常常受到专业语言(即术语)和不均衡背景知识的阻碍。语音转文本和大语言模型的最新进展使得在在线会议期间提供术语支持成为可能,但通用支持(即对每个人定义相同的术语)可能会用听众不需要的定义淹没他们。我们提出了ParseJargon,一个用于实时在线会议中个性化术语支持的系统。我们从一个初始原型开始,探索使用单句用户画像进行个性化。我们进行了一项对照研究,结果表明,与通用支持相比,即使这种最小程度的个性化也能增强听众的理解和参与度,因为术语识别更精确。根据参与者反馈的见解,我们改进了系统,采用了更先进的个性化技术,包括会话内用户反馈和基于便携式词汇表的画像。我们评估了这些技术如何进一步提高术语识别精度,使用对照研究中收集的数据来模拟随时间变化的个性化。我们还进行了延迟测试,并辅以轻量级部署,以分析系统的实时能力和可用性。

英文摘要

Cross-disciplinary communication is often hindered by specialized language (i.e., jargon) and uneven background knowledge. Recent advances in speech-to-text and large language models make it possible to provide jargon support during online meetings, but generic support (i.e., defining the same terms for everyone) can overwhelm listeners with definitions they do not need. We present ParseJargon, a system for personalized jargon support in real-time online meetings. We begin with an initial prototype to probe the use of single-sentence user profiles for personalization. We conducted a controlled study and showed that even this minimal personalization enhanced listeners' comprehension and engagement over generic support because of more precise jargon identification. Guided by insights from participants' feedback, we refined the system with more advanced personalization techniques, including in-session user feedback and portable glossary-based profiles. We evaluated how these techniques can further improve jargon identification precision using data collected in the controlled study to simulate personalization over time. We also conducted a latency test, complemented by a lightweight deployment, to analyze the system's real-time capability and usability.

2601.19082 2026-06-09 cs.AI cs.CL cs.GT cs.LG cs.MA 版本更新

Payoff scaling shapes cooperation in LLM agents across languages

收益规模塑造跨语言LLM代理的合作行为

Trung-Kiet Huynh, Dao-Sy Duy-Minh, Thanh-Bang Cao, Phong-Hao Le, Hong-Dan Nguyen, Phu-Quy Nguyen-Lam, Minh-Luan Nguyen-Vo, Hong-Phat Pham, Phu-Hoa Pham, Thien-Kim Than, Chi-Nguyen Tran, Huy Tran, Gia-Thoai Tran-Le, Alessio Buscemi, Le Hong Trang, The Anh Han

发表机构 * Faculty of Information Technology, University of Science (HCMUS), Ho Chi Minh City, Vietnam(信息技术学院,科学大学(HCMUS),胡志明市,越南) Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam(计算机科学与工程学院,胡志明市技术大学(HCMUT),胡志明市,越南) Vietnam National University – Ho Chi Minh City (VNU-HCM), Ho Chi Minh City, Vietnam(越南国家大学——胡志明市(VNU-HCM),胡志明市,越南) Luxembourg Institute of Science and Technology (LIST), Luxembourg(卢森堡科学与技术研究所(LIST),卢森堡) School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, United Kingdom(计算、工程与数字技术学院,泰赛德大学,米德尔斯布罗,英国)

AI总结 通过监督分类器识别重复囚徒困境中的策略,结合演化博弈论基线,发现随着收益增加,LLM反而更合作,与演化预测相反,表明对齐训练和人类推理模式的影响。

Comments 44 pages, 17 figures, 4 tables

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为自主代理,代表用户进行谈判、协调和行动。它们在这种环境中是否合作不再只是一个学术问题,而是人工智能治理的核心问题。我们从战略行为的角度出发,探究两个日常杠杆——利害关系的大小和描述交互的语言——如何塑造LLM在重复囚徒困境中采用的策略。我们不直接通过原始行动计数来解读合作,而是训练监督分类器来识别重复博弈的经典策略(始终合作、始终背叛、以牙还牙、赢-留-输-变),并将其作为观察LLM行为的透镜。为了了解在相同收益下策略分布应如何,我们推导了演化博弈论(EGT)基线,并将其与LLM数据进行比较。两种结果以揭示性的方式不一致:随着收益增加,演化理论预测背叛应占据主导,但LLM却向相反方向移动,变得更加合作——我们认为,这是对齐训练和LLM从训练数据中继承的人类推理模式的标志。我们进一步表明,这种情况并非前沿规模、专有模型所特有:它也出现在三个开放权重的较小LLM中。总体而言,我们的分析强调,收益设计和语言框架是强大但未被充分探索的引导LLM行为的杠杆,对评估、对齐和治理部署在高风险、多语言环境中的多代理AI系统具有直接影响。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents that negotiate, coordinate, and act on behalf of users. Whether they cooperate in such settings is no longer just an academic question, but a central issue for AI governance. We approach it from a strategic-behaviour angle, asking how two everyday levers - the size of what is at stake, and the language in which the interaction is described - shape the strategies LLMs adopt in a repeated Prisoner's Dilemma. Rather than reading cooperation off raw action counts, we train supervised classifiers to recognise the canonical strategies of repeated games (always cooperate, always defect, Tit-for-Tat, Win-Stay-Lose-Shift) and use them as a lens onto LLM behaviour. To know what the strategy distribution should look like under the same payoffs, we derive an evolutionary game theory (EGT) baseline and compare it with the LLM data. The two outcomes disagree in a revealing way: as stakes grow, evolutionary theory predicts that defection should take over the population, yet LLMs move in the opposite direction, becoming more cooperative - a signature, we argue, of alignment training and the human-like reasoning patterns LLMs inherit from their training data. We further show that this picture is not particular to frontier-scale, proprietary models: it also occurs with three open-weight smaller LLMs. Overall, our analysis highlights that payoff design and linguistic framing are powerful but under-explored levers for steering LLM behaviour, with direct implications for evaluating, aligning, and governing multi-agent AI systems deployed in high-stakes, multilingual environments.

5. 文本生成、摘要与编辑 17 篇

2606.07925 2026-06-09 cs.CL 新提交

ROSUM-MCTS: Monte Carlo Tree Search-Inspired HDL Code Summarization with Structural Rewards

ROSUM-MCTS:基于蒙特卡洛树搜索的HDL代码摘要生成与结构奖励

Prashanth Vijayaraghavan, Charles Mackin, Luyao Shi, Apoorva Nitsure, Ashutosh Jadhav, David Beymer, Tyler Baldwin, Ehsan Degan, Vandana Mukherjee

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出ROSUM-MCTS方法,利用蒙特卡洛树搜索引导大语言模型,通过分层候选扩展和复合奖励函数优化硬件描述语言代码摘要,在VHDL和Verilog数据集上优于基线方法。

Comments 7 pages

详情
Journal ref
ICLAD'2025
AI中文摘要

大型语言模型(LLMs)在代码摘要方面显示出潜力,但其对硬件描述语言(HDL)如VHDL和Verilog的有效性尚未充分探索。我们提出ROSUM-MCTS,一种受蒙特卡洛树搜索(MCTS)启发的LLM引导方法,通过结构化探索和强化驱动优化来改进摘要。我们的方法通过分层候选扩展机制整合局部和全局上下文,并使用复合奖励函数优化摘要,该函数平衡功能正确性(FC)、局部内容充分性(LCA)和流畅性。我们在VHDL-eval和Verilog-eval数据集上评估ROSUM-MCTS,证明其通过利用结构化自底向上细化和基于强化的优化,持续优于基线方法。消融研究证实了局部和全局扩展策略的必要性,以及平衡FC和LCA以获得最佳性能的重要性。此外,ROSUM-MCTS对表面修改(如变量重命名)具有鲁棒性,在基线性能下降时仍能保持摘要质量。这些结果确立了ROSUM-MCTS作为有效且鲁棒的HDL摘要框架,为进一步研究强化增强的代码摘要铺平了道路。

英文摘要

Large language models (LLMs) have shown promise in code summarization, yet their effectiveness for Hardware Description Languages (HDLs) like VHDL and Verilog remains underexplored. We propose ROSUM-MCTS, an LLM-guided approach inspired by Monte Carlo Tree Search (MCTS) that refines summaries through structured exploration and reinforcement-driven optimization. Our method integrates both local and global context via a hierarchical candidate expansion mechanism and optimizes summaries using a composite reward function balancing functional correctness (FC), local content adequacy (LCA), and fluency. We evaluate ROSUM-MCTS on the VHDL-eval and Verilog-eval datasets, demonstrating its consistent outperformance over baseline methods by leveraging structured bottom-up refinement and reinforcement-based optimization. Ablation studies confirm the necessity of both local and global expansion strategies, as well as the importance of balancing FC and LCA for optimal performance. Furthermore, ROSUM-MCTS proves robust against superficial modifications, such as variable renaming, maintaining summary quality where baselines degrade. These results establish ROSUM-MCTS as an effective and robust HDL summarization framework, paving the way for further research into reinforcement-enhanced code summarization.

2606.07951 2026-06-09 cs.CL cs.AI cs.LG 新提交

From `May' to `Is': Certainty Distortion in Language Model Rewriting

从“可能”到“是”:语言模型改写中的确定性扭曲

Catarina G Belem, Shang Wu, Hongyu Yao, Mark Steyvers, Sameer Singh, Padhraic Smyth

发表机构 * University of California Irvine(加利福尼亚大学尔湾分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 研究语言模型在改写任务中系统性增加表达确定性的偏差,提出基于人群判断的评估指标,发现高达75%的输出存在确定性扭曲,且模型更倾向于提高确定性。

详情
AI中文摘要

人类越来越多地以塑造信念和驱动决策的方式使用语言模型(LM),包括讨论、改写和总结来自科学文章、新闻和医学报告的信息。然而,在这些领域中,主张表达的信心程度至关重要,但关于LM是否忠实地保留它却知之甚少。在这项工作中,我们研究了LM中的确定性扭曲,定义为当语义内容被保留时,表达确定性的有意义变化。我们提出了一种基于LM的评估指标,该指标与人群层面的确定性判断一致。使用该指标,我们在科学和医学交流任务的背景下,表征了不同规模和系列的模型中的确定性扭曲。我们的结果表明,确定性扭曲影响了高达75%的LM输出,并且在改写任务中系统性地不对称,大多数LM将表达确定性增加的可能性是降低的1.5-2倍。这些效应可以通过重复释义累积:在医学领域,claude-haiku-4-5在一次迭代后增加了20%示例的确定性,五次迭代后增加到40%。基于提示的干预减少了整体确定性扭曲,但并未消除它。总之,这些发现揭示了普遍存在的夸大表达确定性的偏差,对在高风险领域依赖LM的用户有直接影响。

英文摘要

Humans increasingly turn to Language Models (LMs) in ways that shape beliefs and drive decisions, including discussing, rewriting, and summarizing information from scientific articles, news, and medical reports. However, in these domains, where how confidently a claim is expressed matters, little is known about whether LMs faithfully preserve it. In this work, we investigate certainty distortion in LMs, defined as meaningful changes in expressed certainty when semantic content is preserved. We propose an LM-based evaluation metric that is consistent with population-level judgments of certainty. Using this metric, we characterize certainty distortion across different sizes and families of models in the context of scientific and medical communication tasks. Our results show that certainty distortion affects up to 75\% of LM outputs and is systematically asymmetric in rewriting tasks with most LMs being 1.5-2$\times$ more likely to increase the expressed certainty than to decrease it. These effects can compound over repeated paraphrasing: in the medical domain, claude-haiku-4-5 increases certainty of 20\% examples after a single iteration, increasing to 40\% after five iterations. Prompt-based interventions reduce overall certainty distortion but do not eliminate it. Together, these findings reveal a general bias toward inflating expressed certainty, with direct implications for users who rely on LMs in high-stakes domains.

2606.08048 2026-06-09 cs.CL 新提交

Diffusion Language Model Parallel Decoding via Product-of-Experts Bridge

通过专家乘积桥接的扩散语言模型并行解码

Juntong Shi, Brian L. Trippe, Jure Leskovec, Stefano Ermon, Minkai Xu

发表机构 * Stanford University(斯坦福大学)

AI总结 提出PoE-Bridge框架,通过专家乘积构建中间分布,结合扩散语言模型并行解码和自回归模型质量,实现5倍加速并恢复至少95%的AR性能。

Comments ICML 2026

详情
AI中文摘要

扩散语言模型(DLM)通过并行解码提供了显著的速度优势,但与自回归(AR)模型相比,缺乏令牌依赖性限制了生成质量。最近的进展试图通过重要性采样来弥合差距,其中DLM作为提议分布,AR作为目标分布。然而,由于它们分布之间的巨大差距,采样需要大量粒子,因此计算成本高昂。在本文中,我们引入了PoE-Bridge,一种新颖的解码框架,通过引入中间分布来弥合差距,从而大幅提高生成速度和准确性。该分布被构建为DLM提议和AR目标的专家乘积(PoE)。借助中间分布,我们首先使用DLM并行起草多个续写,然后应用拒绝采样验证起草的令牌,并将结果候选向PoE移动。接着,我们使用重要性采样进一步将PoE对齐的候选向AR目标校正。我们还提出了若干改进技术,包括用于增强多样性的混合温度采样和用于减少浪费验证的弹性拒绝窗口。实验上,PoE-Bridge在标准DLM解码方法上实现了显著提高的准确性,速度提升5倍,并恢复了目标AR模型至少95%的性能,在具有挑战性的数学推理和编码任务上高效地推进了大部分质量差距。我们的代码可在https://github.com/juntongshi48/poe-bridge获取。

英文摘要

Diffusion language models (DLMs) offer substantial speed advantages through parallel decoding, but the lack of token dependencies limits generation quality compared to autoregressive (AR) models. Recent progress attempts to bridge the gap via importance sampling, with DLM being the proposal and AR being the target. However, due to the huge gap between their distributions, the sampling requires a large number of particles and is thus expensive to compute. In this paper, we introduce PoE-Bridge, a novel decoding framework that drastically improves generation speed and accuracy by introducing an intermediate distribution to bridge the gap. The distribution is constructed as a Product-of-Experts (PoE) of the DLM proposal and the AR target. With the intermediate distribution, we first use the DLM to draft multiple continuations in parallel, then apply rejection sampling to verify the drafted tokens and move the resulting candidates toward the PoE. We then use importance sampling to further correct the PoE-aligned candidates toward the AR target. We further propose several improved techniques, including mixed-temperature sampling for enhanced diversity and elastic rejection windows for reducing wasted verification. Empirically, PoE-Bridge achieves significantly improved accuracy with $5\times$ speedup over the standard DLM decoding approach, and recovers at least 95% of the target AR model's performance, efficiently advancing most of the quality gap on challenging mathematical reasoning and coding tasks. Our code is available at https://github.com/juntongshi48/poe-bridge.

2606.08184 2026-06-09 cs.CL 新提交

TextEconomizer: Enhancing Lossy Text Compression with Denoising Transformers and Entropy Coding

TextEconomizer:利用去噪变换器和熵编码增强有损文本压缩

Mahbub E Sobhani, Anika Tasnim Rodela, Chowdhury Mofizur Rahman, Dewan Md. Farid, Swakkhar Shatabda

发表机构 * United International University(联合国际大学) BRAC University(BRAC大学) Southeast University(东南大学)

AI总结 提出TextEconomizer编码器-解码器框架,结合去噪变换器和熵编码,实现50%-80%的压缩率,参数减少153倍,在BLEU等指标上保持近完美文本质量。

Comments Published in Neural Networks (Elsevier), Vol. 203, 2026

详情
Journal ref
Neural Networks, Vol. 203, 109111, 2026
AI中文摘要

有损文本压缩在保留核心含义的同时减少数据大小,适用于摘要、自动分析和数字存档。尽管基于变换器的模型在语言建模中占主导地位,但将上下文向量和熵编码集成到序列到序列(Seq2Seq)生成中仍未充分探索。一个关键挑战在于从编码器输出中识别信息最丰富的上下文向量,并引入熵编码以提高存储效率,同时即使在噪声文本下也能保持高质量输出。我们提出了TextEconomizer,一种与变换器神经网络配对的编码器-解码器框架,无需数据集维度的先验知识即可将可变大小输入减少50%至80%。我们的模型通过熵编码实现了有竞争力的压缩比,同时通过BLEU、ROUGE、METEOR和语义相似度评分评估,提供了近乎完美的文本质量。TextEconomizer的参数数量比同类模型少约153倍,实现了5.39倍的压缩比,且不牺牲语义质量。我们还评估了一个基于LSTM的自编码器,实现了最先进的67倍压缩比,参数减少196倍;以及LLaMAFormer,一种改进的变换器,参数比ICAE少263倍,同时保持有竞争力的文本质量。TextEconomizer在平衡内存效率和高保真输出方面显著超越了现有的基于变换器的模型,标志着有损压缩在最优空间利用方面的突破。

英文摘要

Lossy text compression reduces data size while preserving core meaning, making it well-suited for summarization, automated analysis, and digital archives. Despite the dominance of transformer-based models in language modeling, integrating context vectors and entropy coding into Sequence-to-Sequence (Seq2Seq) generation remains underexplored. A key challenge lies in identifying the most informative context vectors from encoder output and incorporating entropy coding to enhance storage efficiency while maintaining high-quality outputs, even under noisy text. We introduce TextEconomizer, an encoder-decoder framework paired with a transformer neural network that reduces variable-sized inputs by 50% to 80% without prior knowledge of dataset dimensions. Our model achieves competitive compression ratios via entropy coding while delivering near-perfect text quality, assessed by BLEU, ROUGE, METEOR, and semantic similarity scores. TextEconomizer operates with approximately 153x fewer parameters than comparable models, achieving a 5.39x compression ratio without sacrificing semantic quality. We also evaluate an LSTM-based autoencoder achieving a state-of-the-art 67x compression ratio with 196x fewer parameters, and LLaMAFormer, a modified transformer with 263x fewer parameters than ICAE while maintaining competitive text quality. TextEconomizer significantly surpasses existing transformer-based models in balancing memory efficiency and high-fidelity outputs, marking a breakthrough in lossy compression with optimal space utilization.

2606.08357 2026-06-09 cs.CL 新提交

Forward-Free Diffusion Language Models

无前向过程的扩散语言模型

Haotian Sun, Rushi Qiang, Yuqian Zheng, Bo Dai

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出FReDA,一种无需人工设计前向过程的扩散语言模型,通过递归分布细化利用模型生成草稿作为隐式中间状态,在推理和编码任务上超越更大模型,并实现1.5-1.8倍加速。

详情
AI中文摘要

扩散语言模型通过迭代去噪生成文本,为自回归生成提供了强大的替代方案。然而,离散语言空间缺乏用于定义有效扰动的自然邻域结构,因此在前向过程中提出了一些人工破坏方案。这些预设的前向过程通常产生数学上方便但与生成过程中遇到的草稿和错误不一致的状态,导致样本质量下降。为了解决这一限制,我们提出了FReDA,一种无前向过程的扩散语言模型,消除了对人工设计前向过程的需求。我们将扩散语言建模形式化为递归分布细化,其中模型生成的草稿作为隐式中间状态,学习的细化模型逐步将草稿分布推向目标分布。具体地,FReDA通过提出候选草稿序列并直接执行自我细化或通过最佳N细化在并行候选中进行选择来细化草稿。通过这种设计,FReDA是邻域无关的、模型复杂度感知的,并且与灵活的细化参数化兼容。在sub-8B规模下的广泛评估表明,FReDA-4B在推理和编码基准上优于更大的扩散基础模型,实现了高达15%的绝对增益,同时相对于扩散基线达到1.5-1.8倍的平均加速,并且随着额外细化计算量的增加而有效扩展。

英文摘要

Diffusion language models generate text through iterative denoising, offering a powerful alternative to autoregressive generation. However, discrete language spaces lack a natural neighborhood structure for defining effective perturbations, so some artificial corruption schemes are proposed in the forward process. Such prescribed forward processes often produce states that are mathematically convenient but misaligned with drafts and errors encountered during generation, resulting in degraded sample quality. To address this limitation, we propose FReDA, a forward-free diffusion language model that eliminates the need for a hand-designed forward process. We formulate diffusion language modeling as recursive distribution refinement, in which model-generated drafts serve as implicit intermediate states, and the learned refinement model progressively moves the draft distribution toward the target distribution. Concretely, FReDA refines drafts by proposing candidate draft sequences and either directly performing self-refinement or selecting among parallel candidates via best-of-N refinement. With this design, FReDA is neighborhood-agnostic, model-complexity-aware, and compatible with flexible refinement parameterizations. Extensive evaluations in the sub-8B regime show that FReDA-4B outperforms larger diffusion base models on reasoning and coding benchmarks, achieving absolute gains of up to 15%, while reaching a 1.5-1.8x average speedup over diffusion baselines and scaling effectively with additional refinement computation.

2606.08408 2026-06-09 cs.CL cs.AI 新提交

TimpaTeks: Automatic In-place Text Sequence Modification via Diffusion Language Model Steering

TimpaTeks: 通过扩散语言模型引导实现自动原地文本序列修改

Ryandito Diandaru, Ikhlasul Akmal Hanif, Fadli Aulawi Al Ghiffari, Ahmed Elshabrawy, Alham Fikri Aji

发表机构 * MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出TimpaTeks方法,将激活引导扩展到扩散语言模型,实现原地文本修改以改变概念,在情感和概念引导任务上降低困惑度并保持句子结构。

Comments 16 pages

详情
AI中文摘要

我们将激活引导扩展到扩散语言模型(DLM),并研究了一个由于DLM推理机制而产生的新问题:原地修改文本以呈现不同的概念。我们提出了TimpaTeks,一种使用DLM的自动原地文本修改机制。在IMDB电影评论(情感)和合成的猫狗数据集(任意、更非常规的概念引导)上的实验表明,TimpaTeks提供了一种可行的新机制来原地引导扩散语言模型的输出。TimpaTeks实现了原地修改,同时降低了句子困惑度并保留了原始句子结构,无需指令调优模型。与基于提示的DLM引导相比,TimpaTeks计算成本更低,因为它执行原地去噪,而不是构建额外的提示条件输出序列。

英文摘要

We extend activation steering to diffusion language models (DLMs) and study a novel problem that arose due to the inference mechanism of DLMs: Modifying a text in-place to manifest a different concept. We propose TimpaTeks, an automatic in-place text modification mechanism using DLMs. Experiments on IMDB movie reviews (sentiment) and a synthetic Cats and Dogs Dataset (arbitrary, more unconventional concept steering) show that TimpaTeks provides a feasible novel mechanism to steer diffusion language model outputs in-place. TimpaTeks enables in-place modification while simultaneously lowers sentence perplexity and retaining the original sentence structre without the need of instruction tuned models. TimpaTeks is also computationally cheaper than prompt-based DLM steering, as it performs denoising in-place rather than constructing an additional prompt-conditioned output sequence.

2606.08411 2026-06-09 cs.CL 新提交

AsyncLane: Decoupling Refinement from Advancement in Diffusion Language Model Decoding

AsyncLane: 扩散语言模型解码中精炼与推进的解耦

Yingxuan Ren, Yuxuan Lou, Yong Liu, Pengcheng Fang, Ziming Wang, Pengfei Zhou, Yang You

发表机构 * National University of Singapore(新加坡国立大学) University of Southampton(南安普顿大学)

AI总结 提出AsyncLane,一种无需训练的解码调度器,通过将生成过程分叉为精炼和推进两个通道,解耦块间依赖,在保持质量的同时显著提升扩散语言模型的解码吞吐量。

详情
AI中文摘要

块级半自回归解码是扩散大语言模型(DLMs)的标准推理范式,但它强制块之间存在严格依赖:当前块完全解码或去噪预算耗尽之前,下一个块无法开始。我们观察到,一旦一个块暴露出可靠的分隔符边界或稳定的语义前缀,续写生成无需等待每个残差标记被解析。我们提出AsyncLane,一种无需训练的解码调度器,将精炼与推进解耦。AsyncLane在观察到的分隔符边界处将生成通道分叉为精炼通道和续写生成通道:前缀保持可编辑,而续写在前缀精炼完成之前推进。由此产生的通道树记录解码依赖关系和输出顺序,而执行则在活跃通道集上进行。为了使这种异步调度在双向注意力下高效,AsyncLane结合了共享前缀通道批处理、前瞻草稿重用、级联终止以及带有刷新-逻辑重用的紧凑缓存刷新,防止模型调用成本随通道数量线性增长。AsyncLane是块级DLM采样器的即插即用替代品,无需重新训练。在数学推理和代码生成实验表明,AsyncLane在保持竞争性质量的同时持续提高吞吐量。在LLaDA和Dream骨干网络上,AsyncLane在所有评估的基准长度设置中实现了最高的TPS;相对于最快的竞争基线,它在LLaDA上达到2.95倍峰值加速,在Dream上达到3.04倍,在较长生成预算下增益尤为显著。

英文摘要

Block-wise semi-autoregressive decoding is the standard inference paradigm for diffusion large language models (DLMs), but it imposes a strict dependency between blocks: the next block cannot begin until the current block is fully decoded or its denoising budget is exhausted. We observe that once a block exposes a reliable delimiter boundary or stable semantic prefix, continuation generation need not wait for every residual token to be resolved. We propose AsyncLane, a training-free decoding scheduler that decouples refinement from advancement. AsyncLane forks a generate lane at observed delimiter boundaries into a refine lane and a continuation generate lane: the prefix remains editable, while the continuation advances before prefix refinement finishes. The resulting lane tree records decoding dependencies and output order, while execution proceeds over the active lane set. To make this asynchronous schedule efficient under bidirectional attention, AsyncLane combines shared-prefix lane batching, lookahead draft reuse, cascading termination, and compact cache refresh with refresh-logit reuse, preventing model-call cost from scaling directly with the number of lanes. AsyncLane is a drop-in replacement for block-wise DLM samplers and requires no retraining. Experiments on mathematical reasoning and code generation show that AsyncLane consistently improves throughput while maintaining competitive quality. Across LLaDA and Dream backbones, AsyncLane achieves the highest TPS in all evaluated benchmark-length settings; relative to the fastest competing baseline, it reaches peak speedups of 2.95x on LLaDA and 3.04x on Dream, with especially large gains under longer generation budgets.

2606.08445 2026-06-09 cs.CL cs.AI 新提交

Segment-level Tree Search for Long Meeting Document Summarization

长会议文档摘要的段级树搜索

Sangwon Ryu, Heejin Do, Jun Seo, Daehui Kim, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok

发表机构 * GSAI, POSTECH(浦项科技大学人工智能研究院) CSE, POSTECH(浦项科技大学计算机科学与工程系) ETH Zurich(苏黎世联邦理工学院) ETH AI Center(苏黎世联邦理工学院人工智能中心) Agentic AI Lab, KT(KT公司智能体人工智能实验室) LILT(LILT公司)

AI总结 提出基于蒙特卡洛树搜索的段级摘要框架S3,无需训练即可组合段级候选摘要,使用7B模型达到72B模型性能。

Comments INTERSPEECH 2026

详情
AI中文摘要

会议文档因其长度和复杂的对话结构而难以总结。现有方法通常采用多阶段流水线,在摘要之前提取信息;然而,这些方法往往因缺乏中间验证而遭受累积错误传播,这一限制因短且低质量的参考摘要而进一步放大。我们提出通过蒙特卡洛树搜索进行段级摘要(S3),这是一个无需训练的框架,通过组合段级摘要候选来构建最终摘要。S3将长文档划分为多个段,并为每个段生成多个摘要候选,形成搜索树的节点。通过自我奖励引导的树搜索选择最佳评分组合,并精炼为最终输出。尽管使用7B模型,S3在生成长度合适的摘要的同时,实现了与更大的72B模型相当的性能。

英文摘要

Meeting documents are challenging to summarize due to their length and complex conversational structure. Existing approaches typically adopt multi-stage pipelines that extract information prior to summarization; however, these approaches often suffer from cumulative error propagation without intermediate validation, a limitation further amplified by short and low-quality reference summaries. We propose segment-level summarization via Monte Carlo Tree Search (S3), a training-free framework that constructs a final summary by composing segment-level summary candidates. S3 partitions a long document into segments and generates multiple summary candidates per segment, forming nodes of a search tree. The best-scoring combination is selected via self-reward-guided tree search and refined into the final output. Despite using a 7B model, S3 achieves performance comparable to larger 72B models while producing length-appropriate summaries.

2606.08940 2026-06-09 cs.CL 新提交

Multilingual Sentiment Aware Text Summarization A Reinforcement Learning Approach for Consistency Maintenance

多语言情感感知文本摘要:一种用于一致性维护的强化学习方法

Mikhail Krasitskii, Alexander Gelbukh, Olga Kolesnikova, Grigori Sidorov

发表机构 * Instituto Politécnico Nacional (IPN), Centro de Investigación en Computación (CIC)(国立理工学院(IPN),计算研究中心(CIC))

AI总结 研究RLHF摘要中的情感漂移现象,提出基于策略归因框架的情感感知KL正则化方法,在保持摘要质量的同时缓解情感中性化。

详情
AI中文摘要

来自人类反馈的强化学习(RLHF)显著提高了大语言模型在文本摘要中的质量和流畅性。然而,其对情感属性的影响仍未被充分理解。在这项工作中,我们研究了情感漂移,即基于RLHF的摘要输出相对于源文本向中性情感的系统性偏移。我们在多个数据集、模型架构和八种语言上进行了广泛实验,以分析对齐目标如何影响情感保留。我们的结果表明,情感漂移是一种一致现象,随着KL正则化强度的增加而增强,表明对齐稳定性与情感保真度之间存在权衡。为了解释这种行为,我们引入了一个策略归因框架,该框架分解了RLHF目标并量化了其组成部分的贡献。我们的分析表明,KL正则化是所有设置中情感抑制的主要驱动因素。基于这些发现,我们提出了对KL正则化项的情感感知修改,该修改选择性地减少对情感承载标记的约束。实证结果表明,这种方法在保持摘要质量的同时缓解了情感漂移。总体而言,我们的发现揭示了当前对齐方法的一个基本局限性:虽然它们提高了事实一致性和安全性,但可能无意中抑制了情感表达。这促使我们开发明确考虑情感保留的对齐策略。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) has significantly improved the quality and fluency of large language models in text summarization. However, its impact on affective properties remains insufficiently understood. In this work, we study sentiment drift, a systematic shift toward neutral sentiment in RLHF-based summarization outputs compared to source texts. We conduct extensive experiments across multiple datasets, model architectures, and eight languages to analyze how alignment objectives influence sentiment preservation. Our results show that sentiment drift is a consistent phenomenon that becomes stronger with increased KL regularization strength, indicating a trade-off between alignment stability and affective fidelity. To explain this behavior, we introduce a Policy Attribution framework that decomposes the RLHF objective and quantifies the contribution of its components. Our analysis reveals that KL regularization is the primary driver of sentiment suppression across all settings. Based on these findings, we propose a sentiment-aware modification of the KL regularization term, which selectively reduces constraints on sentiment-bearing tokens. Empirical results demonstrate that this approach mitigates sentiment drift while maintaining summarization quality. Overall, our findings highlight a fundamental limitation of current alignment methods: while they improve factual consistency and safety, they may unintentionally suppress emotional expressiveness. This motivates the development of alignment strategies that explicitly account for affective preservation.

2606.09159 2026-06-09 cs.CL cs.AI 新提交

Unified Energy for Invariant and Independent Decoding in Diffusion Language Models

扩散语言模型中不变性与独立性解码的统一能量

Yuchen Yan, Minkai Xu, Zaiquan Yang, Yatao Bian

发表机构 * National University of Singapore(新加坡国立大学) Stanford University(斯坦福大学) City University of Hong Kong(香港城市大学)

AI总结 针对扩散语言模型并行生成文本时与自回归模型的性能差距,提出统一能量(Uni-E)方法,通过不变能量和独立能量解决模型容量、依赖性和不变性问题,无需采样即可精确计算,并能纠正分布偏移。

详情
AI中文摘要

扩散语言模型(DLM)通过迭代去噪完整序列实现并行文本生成,与自回归(AR)解码相比具有吸引人的灵活性。然而,现有方法未能完全捕捉令牌关系,导致与AR基线存在性能差距,尤其是在并行度增加时。本文对该差距进行了系统分析,确定了三个关键因素:(i)模型容量、(ii)依赖性和(iii)不变性。为解决这些问题,我们首先提出不变能量(Inv-E)以及一个有效的基于采样的估计器来处理不变性问题。通过进一步与独立能量(Ind-E)结合,我们得到统一能量(Uni-E),它涵盖了所有这些因素。Uni-E具有独特优势:无需基于采样的分区估计即可精确计算。此外,Uni-E是模型无关的,因此可以扩展到任意大小的模型。我们进一步证明Uni-E可以纠正由依赖性和不变性引起的分布偏移。在扩散语言模型(DLM)和扩散大语言模型(DLLM)上的大量实验证明了所提出的Uni-E的有效性。

英文摘要

Diffusion Language Models (DLMs) enable parallel text generation by iteratively denoising a full sequence, offering attractive flexibility compared to auto-regressive (AR) decoding. However, existing methods fail to fully capture token relationships, leading to a performance gap relative to AR baselines, especially as the degree of parallelism increases. In this paper, we give a systematic analysis of the gap, identifying three key factors: (i) model capacity, (ii) dependency, and (iii) invariance. To address these issues, we first propose an invariant energy (Inv-E) together with an effective sampling-based estimator to handle the invariance issue. By further combining with the independent energy (Ind-E), we obtain a unified energy (Uni-E), that accounts for all these factors. Uni-E enjoys a unique advantage: it can be computed exactly without sampling-based partition estimation. Besides, Uni-E is model agnostic and can therefore be scaled to models of arbitrary size. We further prove that Uni-E can correct the distribution shift caused by dependency and invariance. Extensive experiments across Diffusion Language Models (DLMs) and Diffusion Large Language Models (DLLMs) demonstrate the effectiveness of the proposed Uni-E.

2606.09577 2026-06-09 cs.CL cs.LG cs.SE 新提交

Code Is More Than Text: Uncertainty Estimation for Code Generation

代码不仅仅是文本:代码生成的不确定性估计

Yuling Shi, Caiqi Zhang, Yuexian Li, Haopeng Wang, Yeheng Chen, Nigel Collier, Xiaodong Gu

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Cambridge(剑桥大学)

AI总结 针对代码生成中错误程序的可靠性问题,提出基于词法、算法和功能三个正交轴的不确定性估计方法,在五个代码LLM上将AUROC提升8.1个百分点。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被部署为代码生成器,其中静默错误的程序会带来真实的安全和可靠性风险。可靠的不确定性估计(UE)对于选择性预测、人在回路审查和下游智能体决策至关重要。然而,现有的大多数代码UE方法继承自自然语言(NL)生成,忽略了使代码独特的属性。我们认为代码在三个方面与NL不同:单个错误标记可能破坏整个程序(标记脆弱性);算法意图和具体实现可能独立不一致(意图-代码差距);程序可以被执行(可执行性)。我们将这些属性实例化为三个正交的不确定性轴:词法(Top-K标记熵)、算法(伪代码一致性)和功能(行为一致性)。在五个代码LLM上,我们的三轴集成将平均AUROC从最强NL衍生基线的0.696提高到0.776(+8.1点)。值得注意的是,在Qwen3-14B上,我们的单次Top-K标记熵匹配了最强多次基线,同时成本降低超过3倍;在各模型上,它仍然是一个有竞争力的低成本信号。这些结果表明,代码UE需要特定于代码的设计,而不是直接移植NL方法。

英文摘要

Large language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing code UE methods are inherited from natural language (NL) generation and ignore properties that make code distinct. We argue that code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree independently (intent-code gap); and programs can be executed (executability). We instantiate these properties as three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency). Across five code LLMs, our three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776 (+8.1 points). Notably, on Qwen3-14B, our single-pass Top-K token entropy matches the strongest multi-pass baseline while being over 3x cheaper; across models, it remains a competitive low-cost signal. These results suggest that code UE deserves code-specific design rather than direct NL ports.

2606.09603 2026-06-09 cs.CL 新提交

Automated IEP Generation from Traditional Chinese Parent-Teacher Interviews via Corpus-Grounded Feature Diffusion

基于语料库特征扩散的繁体中文家长会自动化个别化教育计划生成

Kuanlin Chen, Cheng-En Ou

发表机构 * National University of Singapore(新加坡国立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对繁体中文个别化教育计划(IEP)生成中数据稀缺和隐私限制问题,提出基于语料库特征扩散(CGFD)的低资源微调流程,通过种子选择、特征扩散和语法约束解码(GCD)生成高质量样本,并发现GCD在繁体中文下适得其反,无GCD路径在可靠性和速度上更优。

Comments 12 pages, 5 figures

详情
AI中文摘要

编写个别化教育计划(IEP)是一项高劳动强度、知识密集型的文档负担;英语研究表明,生成式AI可以显著减少起草时间,但由于领域数据稀缺、严格的隐私法规以及缺乏本地评估基准,繁体中文的自动化IEP生成几乎未被探索。我们提出了一种基于语料库特征扩散(CGFD)的低资源微调流程:(1)通过tau阈值和标志感知分数上限选择25个双专家高评分种子转录本;(2)从种子中提取特征画像(句子长度、结构、量化模板),并连同言语化采样风格的多样性控制注入LLM提示,以驱动扩散;(3)使用15个专家黄金种子作为扩散锚点,目标生成585个样本;获得567个有效扩散样本,形成582个样本的训练集,用于使用QLoRA微调Breeze-7B;(4)通过语法约束解码(GCD)在推理时强制执行分层SMART目标阶梯模式。在55个样本的模式压力集上的消融结果揭示了一个意外发现:在繁体中文令牌预算下,GCD适得其反——无GCD路径实现了100%的模式通过率,中位延迟降低34%,在可靠性和速度上均优于GCD。在n=10的正式保留集上,无GCD推理路径实现了BERTScore F1 = 0.779,超过了GPT-5.4(0.726)、DeepSeek-V3.2(0.703)、Gemini-3-Flash-Preview(0.703)和Llama-4-Maverick(0.700)的零样本基线,同时保持完全本地、气隙推理。该系统填补了繁体中文特殊教育NLP的空白,并在工业工程范式下提供了可扩展、保护隐私的本地推理解决方案。

英文摘要

Writing Individualized Education Programs (IEPs) is a high-labor, knowledge-intensive document burden; English-language research has demonstrated that generative AI can significantly reduce drafting time, yet automated IEP generation in Traditional Chinese remains virtually unexplored due to domain data scarcity, strict privacy regulations, and the absence of local evaluation benchmarks. We propose a low-resource fine-tuning pipeline centered on Corpus-Grounded Feature Diffusion (CGFD): (1) 25 dual-expert high-score seed transcripts are selected via a tau threshold with flag-aware score caps; (2) a FeatureProfile (sentence length, structure, quantification templates) is extracted from seeds and injected into LLM prompts alongside Verbalized-Sampling-style diversity control to drive diffusion; (3) 15 expert gold seeds are used as diffusion anchors, targeting 585 samples; 567 valid diffusion samples are obtained, yielding a 582-sample training set used to fine-tune Breeze-7B with QLoRA; (4) schema-constrained inference via Grammar-Constrained Decoding (GCD) enforces a hierarchical SMART Goal Ladder schema at inference time. Ablation results on a 55-sample schema stress set reveal an unexpected finding: GCD is counterproductive under Traditional Chinese token budgets -- the no-GCD path achieves 100% schema pass rate at 34% lower median latency, outperforming GCD on both reliability and speed. On the n=10 formal hold-out, the no-GCD inference path achieves BERTScore F1 = 0.779, exceeding GPT-5.4 (0.726), DeepSeek-V3.2 (0.703), Gemini-3-Flash-Preview (0.703), and Llama-4-Maverick (0.700) zero-shot baselines while maintaining fully local, air-gapped inference. This system addresses a gap in Traditional Chinese special-education NLP and offers a scalable, privacy-preserving local inference solution under an industrial engineering paradigm.

2606.09709 2026-06-09 cs.CL 新提交

IS-CoT: Breaking the Long-form Generation Collapse via Interleaved Structural Thinking

IS-CoT: 通过交错结构思维打破长文本生成崩溃

Zechen Sun, Yuyang Sun, Zecheng Tang, Juntao Li, Wenpeng Hu, Wenliang Chen, Zhunchen Luo, Guotong Geng, Min Zhang

发表机构 * Institute of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院) Information Research Center of Military Science, PLA Academy of Military Science(军事科学院军事科学信息研究中心)

AI总结 针对大语言模型在长文本生成中因静态层次规划导致长度崩溃的问题,提出交错结构思维链(IS-CoT)框架,通过动态规划-写作-反思循环实现持续策略调整,训练IS-Writer-8B模型在长文本基准上取得最优性能。

详情
AI中文摘要

生成连贯且可控的长文本内容仍然是大语言模型(LLMs)面临的一个持久挑战。虽然推理增强模型在逻辑密集型领域已展现出成功,但我们的评估揭示,它们在开放式写作中遭受严重的长度崩溃,当目标长度超过2,000词时性能急剧下降。我们将这一失败归因于静态层次规划的局限性,它难以在扩展上下文中提供动态指导。为弥补这一差距,我们引入了交错结构思维链(IS-CoT)框架。与外部智能体工作流不同,IS-CoT将动态的规划-写作-反思循环嵌入生成过程,无需额外辅助即可实现持续策略调整和全局对齐。基于该框架,我们通过多教师管道构建了一个高质量的交错推理轨迹数据集,并训练了IS-Writer-8B。实验表明,IS-Writer-8B在具有挑战性的长文本基准上取得了最先进的性能(例如,在LongBench-Write上比DeepSeek-V3.2高出+3.08),展现出与显著更大的专有模型相竞争的长度合规性和连贯性。

英文摘要

Generating coherent and controllable long-form content remains a persistent challenge for Large Language Models (LLMs). While reasoning-enhanced models have demonstrated success in logic-intensive domains, our evaluation reveals that they suffer from a severe length collapse in open-ended writing, where performance degrades sharply as target lengths exceed 2,000 words. We attribute this failure to the limitation of static hierarchical planning, which struggles to provide dynamic guidance over extended contexts. To bridge this gap, we introduce the Interleaved Structural Chain-of-Thought (IS-CoT) framework. Unlike external agentic workflows, IS-CoT embeds a dynamic Plan-Write-Reflect cycle into the generation process, enabling continuous strategy adaptation and global alignment without additional assistance. Based on this framework, we construct a high-quality dataset of interleaved reasoning traces via a multi-teacher pipeline and train IS-Writer-8B. Experiments demonstrate that IS-Writer-8B achieves state-of-the-art performance on challenging long-form benchmarks (e.g., +3.08 vs. DeepSeek-V3.2 on LongBench-Write), exhibiting robust length compliance and coherence competitive with significantly larger proprietary models.

2504.00977 2026-06-09 cs.CL 版本更新

Chinese Grammatical Error Correction: A Survey

中文语法纠错:综述

Mengyang Qiu, Qingyu Gao, Linxuan Yang, Yang Gu, Tran Minh Nguyen, Zihao Huang, Jungyeul Park

发表机构 * KAIST(韩国科学技术院)

AI总结 本文综述中文语法纠错(CGEC)研究,涵盖数据集、标注方案、评估方法和系统进展,指出关键挑战并展望未来方向。

详情
AI中文摘要

中文语法纠错(CGEC)是自然语言处理中的一项关键任务,旨在满足第二语言(L2)和母语(L1)中文写作中对自动写作辅助日益增长的需求。虽然L2学习者难以掌握复杂的语法结构,但在学术、专业和正式场合中,L1用户也能从CGEC中受益,因为这些场合对写作精度要求很高。本综述全面回顾了CGEC研究,涵盖数据集、标注方案、评估方法和系统进展。我们考察了广泛使用的CGEC数据集,突出了它们的特点、局限性以及对改进标准化的需求。我们还分析了错误标注框架,讨论了诸如分词歧义和中文特有错误类型分类等挑战。此外,我们回顾了评估指标,重点关注它们从英文GEC到中文的适应过程,包括字符级评分和多参考的使用。在系统开发方面,我们追溯了从基于规则和统计方法到神经架构的演变,包括基于Transformer的模型和大型预训练语言模型的集成。通过整合现有研究并识别关键挑战,本综述提供了对CGEC现状的见解,并概述了未来方向,包括完善标注标准以解决分词挑战,以及利用多语言方法增强CGEC。

英文摘要

Chinese Grammatical Error Correction (CGEC) is a critical task in Natural Language Processing, addressing the growing demand for automated writing assistance in both second-language (L2) and native (L1) Chinese writing. While L2 learners struggle with mastering complex grammatical structures, L1 users also benefit from CGEC in academic, professional, and formal contexts where writing precision is essential. This survey provides a comprehensive review of CGEC research, covering datasets, annotation schemes, evaluation methodologies, and system advancements. We examine widely used CGEC datasets, highlighting their characteristics, limitations, and the need for improved standardization. We also analyze error annotation frameworks, discussing challenges such as word segmentation ambiguity and the classification of Chinese-specific error types. Furthermore, we review evaluation metrics, focusing on their adaptation from English GEC to Chinese, including character-level scoring and the use of multiple references. In terms of system development, we trace the evolution from rule-based and statistical approaches to neural architectures, including Transformer-based models and the integration of large pre-trained language models. By consolidating existing research and identifying key challenges, this survey provides insights into the current state of CGEC and outlines future directions, including refining annotation standards to address segmentation challenges, and leveraging multilingual approaches to enhance CGEC.

2604.08479 2026-06-09 cs.CL 版本更新

AI generates well-liked but templatic empathic responses

AI生成受欢迎但模板化的共情回应

Emma S. Gueorguieva, Hongli Zhan, Jina Suh, Javier Hernandez, Tatiana Lau, Junyi Jessy Li, Desmond C. Ong

发表机构 * Department of Psychology, The University of Texas at Austin(心理学系,德克萨斯大学奥斯汀分校) Department of Linguistics, The University of Texas at Austin(语言学系,德克萨斯大学奥斯汀分校) Department of Computer Science and Engineering, The University of Washington(计算机科学与工程系,华盛顿大学) Microsoft Research(微软研究院) Toyota Research Institute(丰田研究院)

AI总结 研究发现LLM生成的共情回应高度模板化,采用10种共情语言策略,覆盖81-92%的回应内容,而人类写作则更多样。

详情
AI中文摘要

最近的研究显示,越来越多的人转向大型语言模型(LLMs)寻求情感支持,并认为LLM的回应比人类写的更具共情性。我们提出原因:LLM学习并一致部署了一种受欢迎的共情模板。我们开发了10种共情语言“策略”分类,包括验证他人感受和 paraphrasing,并将此分类应用于分析人类和LLM生成共情回应的语言。在两项研究中,比较了3265个AI生成(由六个模型生成)和1290个人类写作的回应,发现LLM回应在话语功能层面高度公式化。我们发现一个模板——一种策略序列——匹配83-90%的LLM回应(在持出样本中为60-83%),当匹配时覆盖81-92%的回应内容。相比之下,人类写作的回应更多样化。我们最后讨论了这对AI生成共情未来的影响。

英文摘要

Recent research shows that greater numbers of people are turning to Large Language Models (LLMs) for emotional support, and that people rate LLM responses as more empathic than human-written responses. We suggest a reason for this success: LLMs have learned and consistently deploy a well-liked template for expressing empathy. We develop a taxonomy of 10 empathic language "tactics" that include validating someone's feelings and paraphrasing, and apply this taxonomy to characterize the language that people and LLMs produce when writing empathic responses. Across a set of 2 studies comparing a total of n = 3,265 AI-generated (by six models) and n = 1,290 human-written responses, we find that LLM responses are highly formulaic at a discourse functional level. We discovered a template -- a structured sequence of tactics -- that matches between 83--90% of LLM responses (and 60--83\% in a held out sample), and when those are matched, covers 81--92% of the response. By contrast, human-written responses are more diverse. We end with a discussion of implications for the future of AI-generated empathy.

2605.14531 2026-06-09 cs.CL 版本更新

Language Generation as Optimal Control: Closed-Loop Diffusion in Latent Control Space

语言生成作为最优控制:潜在控制空间中的闭环扩散

ZiYi Dong, Yuliang Huang, Weijian Deng, Xiangyang Ji, Liang Lin, Pengxu Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文将语言生成重新表述为随机最优控制问题,通过统一理论视角分析自回归和扩散模型,解释其局限性,并提出基于流匹配的闭环控制器实现高效文本生成。

详情
AI中文摘要

本工作将语言生成重新表述为随机最优控制问题,提供统一的理论视角来分析自回归和扩散模型,并解释其局限性(效率-保真度悖论、不可逆误差传播、优化可行性与保真度)在轨迹奇异性、共轭状态消失和梯度缺失的组合下的表现。为解决这些问题,我们近似求解哈密顿-雅可比-贝尔曼(HJB)方程,得到一个作为闭环控制器的最优策略。为避免直接求解HJB PDE的不可行性,我们采用流匹配作为最优轨迹求解器,在校正的潜在控制空间中。这使我们的Manta-LM配备全局积分算子能够近似全局向量场,从而实现同时实现高保真文本生成和高效、低成本并行采样的模型。实验表明,我们的方法在语言建模和条件生成任务中表现强劲,同时表现出改进的稳定性、效率和可控性。

英文摘要

This work reformulates language generation as a stochastic optimal control problem, providing a unified theoretical perspective to analyze autoregressive and diffusion models and explain their limitations (Efficiency-Fidelity Paradox, Irreversibility Error Propagation, Optimization Tractability and Fidelity) in terms of combination of trajectory singularity, adjoint state vanishing, and gradient absence. To address these issues, we approximate the solution to the Hamilton-Jacobi-Bellman (HJB) equation, yielding an optimal policy that acts as a closed-loop controller. To bypass the intractability of directly solving the HJB PDE, we employ Flow Matching as the optimal trajectory solver within the rectified latent control space. This allows our Manta-LM with Global Integral Operator to approximate the global vector field, effectively realizing a model that simultaneously achieves high-fidelity text generation and efficient, low-cost parallel sampling. Empirically, our method achieves strong performance on language modeling and conditional generation tasks, while exhibiting improved stability, efficiency, and controllability.

2606.01736 2026-06-09 cs.CL cs.AI 版本更新

Argument Collapse: LLMs Flatten Long-Form Public Debate

论点坍缩:LLMs 扁平化长篇公共辩论

Yekyung Kim, Yapei Chang, Chau Minh Pham, Mohit Iyyer

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 研究大型语言模型在生成公共辩论文本时导致论点坍缩的现象,即不同模型生成的论文在主要论点、子论点和段落结构上趋于收敛,通过对比人类与LLM生成文本发现LLM的论点多样性显著降低。

详情
AI中文摘要

随着LLMs越来越多地被用于起草面向公众的论点,它们可能通过反复引入相同的、经过修饰的、看似合理的论点来扁平化公共辩论。我们研究了论点坍缩,即不同LLMs生成的论文倾向于收敛到更小的主要论点、子论点和段落级结构集合。我们比较了来自195场《纽约时报》辩论的1,039个人类回复、来自61场更长形式的《波士顿评论》论坛的448个人类回复以及23,384篇LLM生成的论文。在《纽约时报》语料库中,65.3%的人类主要论点在辩论中是唯一的,而LLM主要论点中这一比例为3.4%。要求LLMs生成多样化的答案会增加变异性,但一个典型模型只能恢复大约一半的不同人类主要论点,且增加的变异性大多落在观察到的人类论点空间之外。坍缩也出现在子论点中,在具有相同主要论点的论文中,41.0%的人类子论点是唯一的,而LLM回复中这一比例为9.1%。定性上,LLMs经常重复使用泛化和模糊的子论点,而人类更喜欢更具体和针对主题的子论点。在结构上,LLM生成的论文倾向于遵循更固定的弧线,通常以直接主张开头并迅速转向提议。同样的模式在更长的《波士顿评论》论文中也成立,表明论点坍缩不仅限于短篇回复。

英文摘要

As LLMs are increasingly used to draft public-facing arguments, they may flatten public debate by repeatedly introducing the same polished, plausible arguments. We study argument collapse, the tendency of essays generated by different LLMs to converge to a smaller set of main arguments, sub-arguments, and paragraph-level structures. We compare 1,039 human responses from 195 New York Times (NYT) debates, 448 human responses from 61 longer-form Boston Review (BR) forums, and 23,384 LLM-generated essays. In the NYT corpus, 65.3% of human main arguments are unique within a debate, compared to 3.4% of LLM main arguments. Asking LLMs to generate diverse answers adds variation, but a typical model recovers only about half of the distinct human main arguments, with much of the added variation falling outside the observed human argument space. Collapse also appears in sub-arguments, where among essays with the same main argument, 41.0% of human sub-arguments are unique versus 9.1% from LLM responses. Qualitatively, LLMs often reuse generalized and hedged sub-arguments, while humans prefer more concrete and topic-specific ones. Structure-wise, LLM-generated essays tend to follow a more fixed arc, often opening with a direct claim and moving quickly toward proposals. The same patterns hold in longer BR essays, suggesting that argument collapse extends beyond short-form responses.

6. 语义、语法与语言学分析 10 篇

2606.07522 2026-06-09 cs.CL cs.LG cs.SI 新提交

Community-Specific Slang and Entity Detection via Semantic Shift in Fine-Tuned Language Models

通过微调语言模型中的语义偏移检测社区特定俚语和实体

Julia Kruk, Sanchita Porwal, Amitrajit Bhattacharjee, Mansi Phute

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出无监督方法,通过测量词在微调前后的语义偏移幅度,从在线社区文本中自动识别俚语、独特实体和民俗用语。

Comments 6 pages, 6 figures, 2 tables

详情
AI中文摘要

我们提出一种无监督方法,通过隔离词汇中具有最大语义偏移幅度的词,来解析来自在线社区的俚语、独特实体和民俗用语。语义偏移定义为在社区特定文本语料上微调预训练大语言模型(LLM)后,词编码表示的演化。该值与基础模型和微调模型对词的编码表示之间的余弦相似度成反比。我们在从3个Reddit子版块(r/Technology、r/Gaming、r/WorldofWarcraft)收集的文本语料上微调DistilRoBERTa模型,对词汇上的余弦相似度分布进行建模,并表明通过提取底部10百分位的数据,可以成功解析对社区具有独特意义的词。相反,我们表明顶部10百分位的数据由具有相对普遍语义的词组成。

英文摘要

We propose an unsupervised method of resolving slang, unique entities, and folklore from online communities by isolating words in the lexicon that have the highest magnitude of semantic shift. Semantic shift is defined as the evolution of a word's encoded representation as a result of fine-tuning a pretrained Large Language Model (LLM) on a community-specific text corpus. This value is inversely proportional to the cosine similarity between the base model's encoded representation of a word, and a fine-tuned model's encoded representation. We fine-tune the DistilRoBERTa model on text corpora collected from 3 Reddit subreddits (r/Technology, r/Gaming, r/WorldofWarcraft), model a distribution of cosine similarity over the lexicon, and show that one can successfully resolve words that have unique significance to the community by pulling data in the bottom 10-percentile. In contrast, we show that data in the top 10-percentile consist of words that carry relatively universal semantics.

2606.07525 2026-06-09 cs.CL cs.AI 新提交

Implicit Causal Graph Construction in Text via Chain Discovery

通过链发现实现文本中的隐式因果图构建

Liesbeth Allein, Marie-Francine Moens

发表机构 * KU Leuven(鲁汶大学) Ghent University(根特大学)

AI总结 研究利用大语言模型从文本因果对中推断中间事件以构建隐式因果图,比较端到端构建与因果链发现方法,并探索多模型集成策略,基于1560个科学验证因果对评估。

详情
AI中文摘要

文本中的因果图通常由可观察的、预定义的事件填充。相比之下,我们研究从文本中构建隐式因果图,将每个描述的因果对视为潜在隐式因果图的起点和终点,并使用大型语言模型(LLM)推断中间因果事件。我们比较了端到端图构建与将任务视为因果链发现的方法。在后一种方法中,图是通过聚合推断出的链或通过迭代搜索过程逐步扩展部分链来构建的。我们进一步探索了“群体智慧”扩展,即在事后聚合和协作推理设置中从多个LLM访问因果知识。我们分析了这些方法之间的权衡,并使用一个包含1560个经过科学验证的因果对的手动策划数据库评估推断出的因果关系的有效性。这种基于数据库的评估被认为是可靠的、资源高效的,并且可迁移到无法获得真实图的情况。

英文摘要

Causal graphs in text are typically populated by observable, predefined events. In contrast, we study implicit causal graph construction from text by treating each described cause-effect pair as the begin- and endpoint of an underlying latent causal graph and using large language models (LLMs) to infer intermediate causal events. We compare end-to-end graph construction with methods that frame the task as causal chain discovery. In the latter, graphs are built either by aggregating inferred chains or by progressively expanding partial chains through an iterative search process. We further explore Wisdom of the Crowd extensions that access causal knowledge from multiple LLMs in post-hoc aggregation and collaborative inference settings. We analyze trade-offs among these approaches and evaluate the validity of inferred causal relations using a manually curated database of 1,560 scientifically validated causal pairs. This database-based evaluation is proposed as reliable, resource-efficient, and transferable to settings where ground-truth graphs are unavailable.

2606.07066 2026-06-09 cs.CL 新提交

Modeling semantic association in self-paced reading with language model embeddings

使用语言模型嵌入建模自定步速阅读中的语义关联

Sara Møller Østergaard, Kenneth Enevoldsen, Afra Alishahi, Bruno Nicenboim

发表机构 * Department of Computational Cognitive Science, Tilburg University(蒂尔堡大学计算认知科学系) Center for Humanities Computing, Aarhus University(奥胡斯大学人文计算中心)

AI总结 本研究使用语言模型嵌入的十种实现方式量化语义关联,通过贝叶斯模型分析其对N400和自定步速阅读时间的影响,发现句子嵌入能可靠捕捉超出词可预测性的语义关联。

详情
AI中文摘要

词语与其上下文之间的语义关联已被认为是阅读理解的重要组成部分,即使考虑了词的可预测性。最近的研究强调了语言模型(LM)嵌入在量化语义关联方面的潜力。然而,基于嵌入的语义关联已有多种操作化方式。在本研究中,我们使用LM嵌入来估计联合脑电图(EEG)和自然荷兰语文本自定步速阅读语料库上的语义关联。语义关联通过十种不同的实现方式计算,这些方式在嵌入模型和上下文长度上有所不同。使用贝叶斯层次模型和贝叶斯因子检验了不同实现方式下语义关联对N400和自定步速阅读时间的影响。结果表明,嵌入模型的选择可以改变语义关联对N400和自定步速阅读时间的估计效应。此外,结果显示了句子嵌入在捕捉语义关联方面的潜力,因为只有依赖句子嵌入的实现方式在神经和行为测量上都显示出超出词可预测性的可靠语义关联结果。总之,这些发现强调了在量化语义关联时方法论选择的重要性。

英文摘要

Semantic association between a word and its context has been identified as an important component of reading comprehension, even when word predictability is accounted for. Recent research has highlighted the potential of language model ( LM) embeddings to quantify semantic association. Yet, embedding-based semantic association have been operationalized in a myriad of ways. In this study, we use embeddings from LMs to estimate semantic association on a corpus of joint electroencephalography (EEG) and self-paced reading of natural, Dutch texts. Semantic association is calculated in ten different implementations that vary the embedding model and context lengths. The effects of semantic association across the different implementations on the N400 and self-paced reading times are examined using Bayesian hierarchical models and Bayes factor. The results show that the choice of embedding model can alter the estimated effect of semantic association on both the N400 and self-paced reading times. Furthermore, the results demonstrate a promising potential of sentence embeddings for capturing semantic association, as only implementations relying on sentence embeddings indicate reliable results of semantic association beyond word predictability on both neural and behavioral measures. Together, these findings highlight the importance of methodological choices in quantifying semantic association.

2606.08236 2026-06-09 cs.CL cs.LG 新提交

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

共享语义,不同机制:通过对齐语义与机制的无监督特征发现

Hyunjin Cho, Youngji Roh, Jaehyung Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种无监督方法,通过语义嵌入和归因签名聚类模型续写,发现隐藏的机制模式,补充电路分析。

Comments 40 pages

详情
Journal ref
ICML 2026 Spotlight
AI中文摘要

随着大型语言模型越来越多地部署在高风险场景中,人们越来越需要工具来审计不仅模型输出,还包括产生这些输出的内部计算。电路分析是机械可解释性中的核心方法,但通常是目标条件化的,解释单个提示与选定补全的配对。这种目标条件化设置可能掩盖模型续写分布中的异质性。我们引入了分布级无监督特征发现,该方法使用语义内容和序列级机械归因对采样续写进行聚类,而无需手动指定目标输出。我们的方法用语义嵌入和前缀到续写的归因签名表示每个续写,然后优化一个率失真目标,该目标在语义一致性、机械一致性和聚类粒度之间进行权衡。在聚类和引导分析中,发现的聚类暴露了单视图基线遗漏的续写模式,并提供了干预证据,表明聚类签名对应于可操作的机械因素。总的来说,我们的方法通过提供对模型续写分布背后机制的可扩展审计,补充了电路分析和行为评估。

英文摘要

As large language models are increasingly deployed in high-stakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them. Circuit analysis is a central approach in mechanistic interpretability, but it is typically target-conditioned, explaining a single prompt paired with a chosen completion. This target-conditioned setup can obscure heterogeneity across a model's continuation distribution. We introduce distribution-level unsupervised feature discovery, which clusters sampled continuations using both semantic content and sequence-level mechanistic attributions, without manually specifying target outputs. Our method represents each continuation with a semantic embedding and a prefix-to-continuation attribution signature, then optimizes a rate-distortion objective that trades off semantic coherence, mechanistic consistency, and cluster granularity. Across clustering and steering analyses, the discovered clusters expose continuation modes that single-view baselines miss and provide interventional evidence that cluster signatures correspond to actionable mechanistic factors. Overall, our approach complements circuit analysis and behavioral evaluation by providing a scalable audit of the mechanisms underlying a model's continuation distribution.

2606.08562 2026-06-09 cs.CL 新提交

Inside the LLM Word Factory

LLM单词工厂内部

Benzi Busigin, Yuval Pinter

发表机构 * Stein Faculty of Computer and Information Science(Stein计算机与信息科学学院)

AI总结 通过激活修补实验,定位Llama2-7B中英语去分词化过程为第1层的两阶段机制:注意力传递非最终子词的令牌特定信号,MLP将其与局部嵌入组合。该结构泛化至八族十二模型,但深度取决于位置编码类型。

Comments 17 pages, 12 figures. Under review at EMNLP 2026

详情
AI中文摘要

Transformer语言模型处理以子词片段形式提供的输入,但自然语言语义通常依赖于词级概念。去分词化是模型调和这两个事实的过程,通过计算将子词聚合成词级表示。先前工作发现这主要发生在早期到中间层,但迄今为止该过程的确切机制尚未确定。我们通过控制配对实验中的激活修补深入探究去分词化,隔离不同模型组件的贡献,将Llama2-7B中的英语去分词化定位到第1层的两阶段过程。注意力从非最终子词传输令牌特定信号,必要时使用顺序中继,而MLP将其与局部嵌入组合。这种两阶段结构泛化至八族十二模型,但其发生的深度取决于位置编码类型:基于RoPE的模型在1到5层内去分词化,而学习绝对位置的模型需要5到10层。最后,我们提供一种仅基于早期层激活来确定去分词化成功与否的探针,根据上下文量不同,AUROC在0.94到0.97之间。

英文摘要

Transformer language models process input provided as subword fragments, but natural language semantics usually rely on word-level concepts. Detokenization is the process where models reconcile these two facts, aggregating subwords into word-level representations through their computation. Prior work has found that this takes place mostly in early-to-middle layers, but so far the exact mechanics of the process have not been pinned down. We venture deep into detokenization using activation patching in controlled paired experiments that isolate the contribution of different model components, localizing English detokenization in Llama2-7B to a two-stage process at Layer 1. Attention transmits a token-specific signal from nonfinal subwords, using sequential relays if necessary, while the MLP composes it with the local embedding. This two-stage structure generalizes to twelve models from eight families, but the depth over which it takes place depends on the flavor of positional encoding: RoPE-based models detokenize over 1 to 5 layers, while learned-absolute models take 5 to 10. Finally, we provide a probe for determining the success of the detokenization process based on early-layer activations alone, performing at 0.94-0.97 AUROC depending on the amount of context.

2606.09403 2026-06-09 cs.CL 新提交

Introducing multiplex semantic networks as multifaceted representations of creative associative knowledge across multilingual samples

引入多重语义网络作为跨语言样本中创造性联想知识的多面表示

Edith Haim, Kurt Haim, Roger E. Beaty, Cynthia S. Q. Siew, Massimo Stella

发表机构 * CogNosco Lab, Department of Psychology and Cognitive Science, University of Trento(CogNosco实验室,心理学与认知科学系,特伦托大学) Department of Science Education, University of Education Upper Austria(科学教育系,上奥地利教育大学) Department of Psychology, The Pennsylvania State University(心理学系,宾夕法尼亚州立大学) Department of Psychology, National University of Singapore(心理学系,新加坡国立大学)

AI总结 本研究通过从六种认知任务构建的多重语义网络,更全面地建模创造力背后的联想知识,并利用机器学习预测个体创造力得分,证明多重网络比单一任务更有效。

详情
AI中文摘要

创造力是一种复杂的认知能力,依赖于语义记忆中的知识组织和检索。然而,大多数研究使用单一任务来测量它,仅捕捉了这种复杂性的一小部分。本研究调查了多重网络——从六种认知任务中获得的层次化语义网络——作为建模创造力背后联想知识的更全面方法。我们收集了来自四个国家(奥地利、美国、新加坡、意大利)的N=518名个体的数据。根据他们对言语流畅性、句子链、自由联想和叙事写作任务的回答,我们构建了语义网络并将其组装成多重结构。基于AI角色的响应提供了比较基线。结构可约性分析表明,不同的任务层捕捉了关于语义组织的不同、非冗余信息,支持使用多个任务而非任何单一任务。高创造力和低创造力组的网络在结构上保持不同,而AI生成的网络无论创造力组如何都显示出几乎相同的结构。最后,我们在使用岭回归的机器学习模型中使用了12个特征(网络度量、情感分数和扩散激活模拟)来预测个体创造力得分。在前一阶段识别出的结构相似层的组合,将概念验证预测准确性提高了50%。结构度量显示出最高的特征重要性,扩散激活动力学提供了额外的预测能力。总之,这些发现表明多重语义网络捕捉了创造力背后联想知识的更丰富的跨文化图景。我们还发布了我们的多样化数据集和代码,以促进创造力社区内的多样化计算方法。

英文摘要

Creativity is a complex cognitive ability that relies on knowledge organisation and retrieval from semantic memory. Yet most research uses a single task to measure it, capturing only a fraction of this complexity. This study investigates multiplex networks - layered semantic networks obtained from six cognitive tasks - as a more comprehensive approach to modelling the associative knowledge underlying creativity. We collected data from N=518 individuals from four countries (Austria, USA, Singapore, Italy). From their responses to verbal fluency, sentence-chain, free association, and narrative writing tasks, we constructed semantic networks and assembled them in a multiplex structure. AI persona-based responses provided a comparison baseline. Structural reducibility analyses showed that different task layers captured distinct, non-redundant information about semantic organisation, supporting the use of multiple tasks over any single one. The networks from high- and low-creative groups remained structurally distinct, while AI-generated networks showed near-identical structures regardless of creativity group. Finally, we used 12 features (network measures, emotional scores, and spreading activation simulations) in a machine learning model using ridge regression to predict individual creativity scores. The combination of structurally similar layers, as identified in the previous stage, improved a proof-of-concept prediction accuracy by 50%. Structural measures showed the highest feature importance, with spreading activation dynamics providing additional predictive power. Together, these findings indicate that multiplex semantic networks capture a richer, cross-cultural picture of associative knowledge underlying creativity. We also release our diverse dataset and code to foster diverse computational approaches within the creativity community.

2606.09484 2026-06-09 cs.CL 新提交

Detecting Differences Is Not Understanding Structure: Large Language Models Fail at Graph Isomorphism

检测差异不等于理解结构:大型语言模型在图同构任务中失败

Kumar Thushalika, Sukumar Kishanthan, Asela Hevapathige

发表机构 * University of Ruhuna(鲁胡纳大学) University of Moratuwa(莫拉图瓦大学) University of Melbourne(墨尔本大学)

AI总结 本研究通过图同构检测任务揭示LLM的“虚假成功”:虽然LLM在检测同构时准确率接近完美,但面对节点标签置换的相同图时却无法识别,表明其依赖模式而非抽象结构推理。

详情
AI中文摘要

大型语言模型(LLM)在各种推理任务上表现出色,但它们在图结构推理方面的能力仍不明确。我们研究了LLM是否能真正理解图同构——图论中的一个基本问题。尽管LLM在同构检测上达到了近乎完美的准确率,但我们证明这种性能是虚幻的。当相同的图以置换后的节点标签呈现时,LLM无法识别它们的同构性。这一发现表明,LLM利用的是模式而非对抽象图结构的推理。由于置换不变性是有效结构推理的基本要求,这些结果表明,在图推理基准上的成功不应被解释为真正拓扑理解的证据。

英文摘要

Large language models (LLMs) have shown impressive performance on diverse reasoning tasks, yet their capacity for structural reasoning in graphs remains unclear. We investigate whether LLMs can genuinely understand graph isomorphism -a fundamental problem in graph theory. While LLMs achieve near-perfect accuracy on isomorphism detection, we show this performance is illusory. When identical graphs are presented with permuted node labels, LLMs fail to identify their isomorphism. This finding suggests that LLMs exploit patterns rather than reasoning about abstract graph structure. Since permutation invariance is a fundamental requirement for valid structural reasoning, these results indicate that success on graph reasoning benchmarks should not be interpreted as evidence of genuine topological understanding.

2312.15321 2026-06-09 cs.CL 版本更新

Greedy Grammar Induction with Indirect Negative Evidence

带间接负面证据的贪婪语法归纳

Joseph Potashnik

发表机构 * London, United Kingdom(伦敦,英国)

AI总结 提出一种非词汇化语法归纳程序,通过规则覆盖界和间接负面证据区分观察呈现与假设生成串,给出贪婪搜索算法并证明条件弱恢复定理。

Comments 29 pages (including appendices and references)

详情
AI中文摘要

本文提出一种非词汇化语法归纳程序,该程序分离了两个测试:识别观察到的有限呈现,以及拒绝由假设生成但无证据支持的短前终止符串。核心对象是规则覆盖界 \(\ell^*(G)\):对于 \(G\) 中的每条规则,使用该规则推导的最短前终止符串的长度的最大值。该界诱导出比较宇宙 \(\Sigma_{\mathrm{pre}}^{\le \ell^*(G)}\),其中无支持的产生串作为反对过度生成假设的间接证据。我们给出一个在规则集上的贪婪搜索算法,并证明一个条件弱恢复定理:在显式可达性条件和呈现充分饱和的情况下,精确学习器达到一个与未知目标弱等价的语法。复杂度分析是分片的:对于每个固定增量半径 \(k\),搜索在有限规则宇宙中探索多项式多个规则集扩展。在跨越 Dyck-\(k\) 语言 \((1\le k\le4)\)、回文、\(a^n b^n\)、类英语递归片段以及一个固有歧义联合语言的 31 个基准测试中,语法级分析建立了每个返回语法与其目标之间的弱等价性。

英文摘要

This paper proposes a non-lexicalized grammar-induction procedure that separates two tests: recognition of the observed finite presentation, and rejection of short preterminal strings generated by a hypothesis but unsupported by the evidence. The central object is the rule-coverage bound \(\ell^*(G)\): the maximum, over rules in \(G\), of the length of the shortest preterminal string whose derivation uses that rule. This bound induces the comparison universe \(Σ_{\mathrm{pre}}^{\le \ell^*(G)}\), where unsupported generated strings serve as indirect evidence against overgenerating hypotheses. We give a greedy search algorithm over rule sets and prove a conditional weak-recovery theorem: under explicit reachability conditions and sufficient saturation of the presentation, the exact learner reaches a grammar weakly equivalent to the unknown target. The complexity analysis is slice-wise: for each fixed incrementality radius \(k\), the search explores polynomially many rule-set extensions in the finite rule universe. Across 31 benchmark runs spanning Dyck-\(k\) languages \((1\le k\le4)\), palindromes, \(a^n b^n\), English-like recursive fragments, and an inherently ambiguous union language, grammar-level analysis establishes weak equivalence between every returned grammar and its target.

2510.26253 2026-06-09 cs.CL 版本更新

Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs

语用理论增强对LLM中隐含意义的理解

Takuma Sato, Seiya Kawano, Koichiro Yoshino

发表机构 * Nara Institute of Science and Technology(奈良科学技術大學) RIKEN(理化学研究所) Guardian Robot Project(守護機器人專案) Kyoto Institute of Technology(京都科技大學) Institute of Science Tokyo(東京科學研究所)

AI总结 本研究提出以语用理论(如Grice语用学和关联理论)作为提示,引导语言模型逐步推理,在隐含意义理解任务上相比0-shot思维链提升高达9.6%的得分。

Comments Correction of minor typographical errors in the references

详情
AI中文摘要

准确解读隐含意义的能力在人类交流和语言使用中起着关键作用,语言模型也被期望具备这种能力。本研究表明,向语言模型提供语用理论作为提示是一种有效的上下文学习方法,用于理解隐含意义的任务。具体来说,我们提出了一种方法,将语用理论(如Grice语用学和关联理论)的概述作为提示呈现给语言模型,引导其通过逐步推理过程得出最终解释。实验结果显示,与基线(即不呈现语用理论而提示中间推理的0-shot思维链)相比,我们的方法使语言模型在语用推理任务上得分最高提升9.6%。此外,我们表明,即使不解释语用理论的细节,仅在提示中提及它们的名称,也能使较大模型相比基线获得一定的性能提升(约1-3%)。

英文摘要

The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6\% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.

2601.10918 2026-06-09 cs.CL 版本更新

Neural Induction of Finite-State Transducers

有限状态转换器的神经归纳

Michael Ginn, Alexis Palmer, Mans Hulden

发表机构 * University of Colorado(科罗拉多大学) New College of Florida(佛罗里达新学院)

AI总结 提出基于循环神经网络隐藏状态几何自动构建无权重有限状态转换器的方法,在形态变化、音素转换等任务上准确率优于传统算法达87%。

Comments 15 pages, 8 figures, accepted to ACL 2026 Findings

详情
AI中文摘要

有限状态转换器(FST)是字符串到字符串重写任务的有效模型,通常提供高性能应用所需的效率,但手动构建转换器很困难。在这项工作中,我们提出了一种新方法,根据循环神经网络学习的隐藏状态几何自动构建无权重FST。我们在形态变化、字素到音素预测和历史规范化的真实数据集上评估了我们的方法,表明构建的FST具有高准确性和鲁棒性,在保留测试集上准确率比经典转换器学习算法高出87%。

英文摘要

Finite-State Transducers (FSTs) are effective models for string-to-string rewriting tasks, often providing the efficiency necessary for high-performance applications, but constructing transducers by hand is difficult. In this work, we propose a novel method for automatically constructing unweighted FSTs following the hidden state geometry learned by a recurrent neural network. We evaluate our methods on real-world datasets for morphological inflection, grapheme-to-phoneme prediction, and historical normalization, showing that the constructed FSTs are highly accurate and robust for many datasets, substantially outperforming classical transducer learning algorithms by up to 87% accuracy on held-out test sets.

7. 多模态语言处理 29 篇

2606.07529 2026-06-09 cs.CL cs.AI cs.CV cs.LG cs.MM 新提交

CAPruner: Conceptual-Adjacent Scene Graph Pruner for Enhancing 3D Spatial Reasoning of Large Language Models

CAPruner: 概念相邻场景图剪枝器以增强大语言模型的3D空间推理

Shengli Zhou, Xiangchen Wang, Guanhua Chen, Feng Zheng

发表机构 * Southern University of Science and Technology(南方科技大学) SpatialTemporal AI(时空人工智能)

AI总结 提出概念相邻场景图剪枝器(CAPruner),通过融合模糊语义相关性和空间邻近性估计关系重要性,在任务特定上下文中选择关键关系,避免关系级标注,显著提升大语言模型在3D视觉语言任务上的空间推理性能。

Comments Accepted by ACL 2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)最近被应用于3D视觉语言(3D-VL)任务,这些任务需要空间推理以识别相对于锚点的目标物体。场景图通常用于表示此类关系,但在完整图上进行推理会导致高昂的令牌成本和计算效率低下,因此需要剪枝。现有的剪枝方法主要依赖空间邻近性,常常移除任务相关的关系,从而削弱可靠的空间推理。为了解决这些局限性,我们推导出场景图剪枝的一个关键要求:保留与特定3D-VL任务最相关的空间关系。在此洞察指导下,我们提出了概念相邻场景图剪枝器(CAPruner)。CAPruner将模糊语义相关性与空间邻近性相结合,以估计关系的重要性,从而能够在任务特定上下文中选择关键关系。此外,为了避免昂贵的关系级标注,CAPruner通过监督每个节点入射边的聚合分数进行训练。大量实验表明,CAPruner有效保留了空间推理所必需的关系,从而显著提升了LLMs在3D-VL任务上的性能。代码可在 https://github.com/fz-zsl/CAPruner 获取。

英文摘要

Large language models (LLMs) have recently been applied to 3D vision-language (3D-VL) tasks, which require spatial reasoning to identify target objects relative to anchors. Scene graphs are commonly employed to represent such relations, but reasoning over complete graphs incurs high token costs and computational inefficiencies, motivating the need for pruning. Existing pruning methods primarily rely on spatial proximity and often remove task-relevant relations, thereby undermining reliable spatial reasoning. To address these limitations, we derive a key requirement for scene graph pruning: preserving spatial relations that are most pertinent to the specific 3D-VL task. Guided by this insight, we propose the Conceptual-Adjacent Scene Graph Pruner (CAPruner). CAPruner integrates fuzzy semantic relevance with spatial proximity to estimate the importance of relations, enabling the selection of critical relations in a task-specific context. Moreover, to avoid costly relation-level annotations, CAPruner is trained by supervising the aggregated scores of each node's incident edges. Extensive experiments demonstrate that CAPruner effectively preserves relations essential for spatial reasoning, leading to substantial performance improvements of LLMs on 3D-VL tasks. Code is available at https://github.com/fz-zsl/CAPruner.

2606.07531 2026-06-09 cs.CL cs.AI 新提交

mllm-shap: A Shapley Value Explainability Platform for Text-Audio Multimodal Large Language Models

mllm-shap:面向文本-音频多模态大语言模型的Shapley值可解释性平台

Jakub Muszyński, Paweł Pozorski, Maria Ganzha

发表机构 * Warsaw University of Technology(华沙理工大学)

AI总结 提出mllm-shap框架,通过模态感知掩码、多轮对话追踪和音素对齐分组技术,将Shapley值可解释性扩展到文本-音频多模态大语言模型,并实现10-50倍的计算加速。

Comments Submitted to ACL2026

详情
AI中文摘要

我们介绍了mllm-shap,一个开源Python框架,旨在将Shapley值(SV)可解释性从纯文本大语言模型扩展到处理联合文本和音频输入的多模态大语言模型(MLLM)。虽然基于文本的归因已得到充分研究,但mllm-shap解决了多模态领域特有的三个关键挑战:(1)模态感知的联盟掩码,管理离散文本令牌和密集音频编码器帧的交错处理。(2)多轮对话追踪,利用每令牌元数据维护角色和模态上下文。(3)基于音素对齐的令牌分组,一种新颖的技术,将联盟空间减少10到50倍,使得长音频的SV估计在计算上可行。该平台实现了五种SV估计策略,包括具有Neyman最优分配的互补贡献(CC)估计器,其收敛性优于标准蒙特卡洛基线。mllm-shap作为pip可安装包提供,并具有交互式基于Web的GUI,用于细粒度归因可视化。据我们所知,这是第一个公开可用的框架,为文本-音频MLLM中的基于SV的可解释性提供完整、可复现的流水线。

英文摘要

We introduce mllm-shap, an open-source Python framework designed to extend Shapley Value (SV) explainability from text-only Large Language Models to Multimodal LLMs (MLLMs) processing joint text and audio inputs. While text-based attribution is well-studied, mllm-shap addresses three critical challenges unique to the multimodal regime: (1) Modality-aware coalition masking, which manages the interleaved processing of discrete text tokens and dense audio encoder frames. (2) Multi-turn conversation tracking, utilizing per-token metadata to maintain role and modality context. (3) Phonetic alignment-based token grouping, a novel technique that reduces the coalition space by 10x to 50x, rendering SV estimation computationally feasible for long-form audio. The platform implements five SV estimation strategies, including a Complementary Contributions (CC) estimator with Neyman-optimal allocation that demonstrates superior convergence over standard Monte Carlo baselines. mllm-shap is provided as a pip-installable package featuring an interactive web-based GUI for granular attribution visualization. To our knowledge, this is the first publicly available framework providing a complete, reproducible pipeline for SV-based explainability in text-audio MLLMs.

2606.07533 2026-06-09 cs.CL cs.AI cs.SD 新提交

Bridging Traditional Explainability Methods and Multimodal Multilingual Models: An XAI-Based Analysis

桥接传统可解释性方法与多模态多语言模型:基于XAI的分析

Paweł Pozorski, Jakub Muszyński, Maria Ganzha

发表机构 * arXiv

AI总结 提出多模态Shapley值框架,结合频谱图引导的音素对齐(SGPA)预处理方法,实现文本与音频特征的可解释性归因,并开源计算包与可视化工具。

Comments Bachelor's thesis

详情
AI中文摘要

多模态大语言模型(MLLMs)有效整合文本和音频以理解复杂交互对话中的上下文。然而,异质模态影响模型行为的内部机制仍然不透明。虽然Shapley值(SV)为基于文本的NLP提供了鲁棒的、模型无关的局部可解释性框架,但其扩展到多模态数据受到跨通道依赖、复杂对话结构以及密集音频表示的高计算复杂性的阻碍。\n在这项工作中,我们形式化了Shapley值框架的多模态扩展,将离散文本标记和对齐的音频片段视为协作特征。为确保计算可行性,我们部署了一套高效的估计策略:低维输入的精确SV计算和基于采样的近似——包括蒙特卡洛排列和具有Neyman最优分配的分层抽样——以在有限计算预算下最小化方差。为解决模态间的粒度不匹配问题,我们提出了频谱图引导的音素对齐(SGPA),一种新颖的预处理方法,将高频音频流映射到可解释的、单词对齐的片段。\n我们的贡献有两方面:首先,我们提供了一个开源的、模型无关的Python包和配套的GUI,用于多模态归因的计算和交互式可视化。其次,我们使用VoiceBench和Infinity Instruct数据集的精选子集,在多种多语言场景下评估我们的框架。实验结果表明,输入模态是归因波动的主要驱动因素,并证明标准句法重要性代理在多模态跨语言上下文中通常无法预测模型注意力。

英文摘要

Multimodal Large Language Models (MLLMs) effectively integrate text and audio to interpret context in complex interactive dialogues. However, the internal mechanisms by which heterogeneous modalities influence model behavior remain opaque. While Shapley Values (SV) provide a robust, model-agnostic framework for local explainability in text-based NLP, their extension to multimodal data is hindered by cross-channel dependencies, intricate dialogue structures, and the prohibitive computational complexity of dense audio representations. In this work, we formalize a multimodal extension of the Shapley Value framework, treating discrete text tokens and aligned audio segments as cooperative features. To ensure computational feasibility, we deploy a suite of efficient estimation strategies: exact SV computation for low-dimensional inputs and sampling-based approximations - including Monte Carlo permutations and stratified sampling with Neyman-optimal allocation - to minimize variance under constrained computational budgets. To resolve the granularity mismatch between modalities, we propose Spectrogram-Guided Phonetic Alignment (SGPA), a novel preprocessing method that maps high-frequency audio streams to interpretable, word-aligned segments. Our contribution is twofold: first, we provide an open-source, model-agnostic Python package and a companion GUI for the computation and interactive visualization of multimodal attributions. Second, we evaluate our framework using curated subsets of the VoiceBench and Infinity Instruct datasets across diverse multilingual scenarios. Our experimental results reveal that input modality is a primary driver of attribution volatility and demonstrate that standard syntactic importance proxies often fail to predict model attention in multimodal, cross-lingual contexts.

2606.08056 2026-06-09 cs.CL cs.AI 新提交

What's the Point? Spatial Grammar & Index Resolution for Sign Language Processing

要点何在?手语处理中的空间语法与索引解析

Oline Ranum, Simon Hadfield, Richard Bowden

发表机构 * Centre for Vision, Speech and Signal Processing, University of Surrey(萨里大学视觉、语音与信号处理中心)

AI总结 针对手语中占10-15%但被忽视的空间索引现象,提出索引检测与话语实体链接的分解框架,建立索引感知手语建模基线,并作为辅助专家提升冻结手语识别模型性能。

详情
AI中文摘要

手语模型主要使用词汇序列或文本监督进行训练,因此对非词汇和构式性结构的建模不足。一个相对易处理的情况是空间索引:将话语实体分配给空间位置以供后续共指的指向手势,而以词汇为中心的目标在很大程度上未能捕捉到这一点。我们对手语识别中的索引进行了有针对性的评估,显示尽管索引占手语内容的10-15%,但其恢复效果很差。我们引入了一个用于训练和评估索引专家的框架,为索引感知手语建模建立了基线。我们的方法将空间指代解析分解为索引检测和话语实体链接。由此产生的提及表示支持自动标注和非词汇结构建模,并在推理时作为辅助索引专家增强冻结的SLR模型。

英文摘要

Sign language models are predominantly trained with gloss-sequence or text supervision, thereby under-modeling non-lexical and productive constructions. One comparatively tractable instance is spatial indexing: pointing gestures that assign discourse entities to spatial loci for subsequent co-reference, which lexicon-centric objectives largely fail to capture. We present a targeted evaluation of indexing in Sign Language Recognition, showing that despite comprising 10-15% of signing content, indexing is poorly recovered. We introduce a framework for training and evaluating indexing experts, establishing a baseline for index-aware sign language modeling. Our approach decomposes spatial reference resolution into index detection and discourse entity linking. The resulting mention representations enable automatic annotation and non-lexical structure modeling, and serve as an auxiliary indexing expert that augments a frozen SLR model at inference time.

2606.08081 2026-06-09 cs.CL cs.AI 新提交

Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

对齐但非伙伴特定:区分多模态LLM智能体在参考游戏中如何成功而无需类人惯例

Po-Ya Angela Wang, Chinmaya Mishra, Aslı Özyürek, Paula Rubio-Fernández, Esam Ghaleb

发表机构 * National Taiwan University(国立台湾大学) Max Planck Institute for Psycholinguistics(马克斯·普朗克心理语言学研究所) Radboud University(拉德堡德大学) Institut Jean Nicod(让·尼科研究所)

AI总结 通过约束伪对基线方法,区分多模态LLM智能体在参考游戏中的标签对齐是源于伙伴特定交互还是共享任务词汇,发现智能体通过冗长描述而非压缩表达实现协调。

详情
AI中文摘要

重复参考游戏测试对话者是否用基于共享交互历史的更短、伙伴特定的惯例替换其初始长描述。先前工作表明,多模态LLM在轮次中未能变得更高效,尽管它们在使用的标签上对齐。我们如何确定这种对齐反映了伙伴特定的基础而非共享任务词汇?我们通过将有能力的多模态智能体对与来自KTH Tangrams语料库的人类对进行比较来解决这个问题。我们的新颖方法论贡献是一个受约束的伪对基线,它匹配原始指称任务结构,但打破了伙伴历史。该基线使我们能够测试观察到的标签对齐是否依赖于与特定伙伴的交互。在三个分析层面(任务能力、描述策略、对齐动态)上,我们发现了明显差异。人类通过适应减少努力,压缩描述并增加与伙伴的标签对齐。智能体反而保持固定的努力水平,从第一轮开始产生冗长的描述,标签重叠接近上限,在真实对和伪对之间统计上无法区分。因此,多模态LLM在没有惯例的情况下实现了协调,通过冗长描述而非形成人类对话特征的紧凑、依赖历史的指称表达来取得成功。

英文摘要

Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner-specific conventions grounded in shared interaction history. Prior work shows that multimodal LLMs fail to become more efficient across rounds, although they align on the labels they use. How can we determine whether this alignment reflects partner-specific grounding rather than a shared task vocabulary? We address this question by comparing capable multimodal agent dyads with human dyads from the KTH Tangrams corpus. Our novel methodological contribution is a constrained pseudo-dyad baseline that matches the original referential task structure, but breaks partner history. This baseline enables us to test whether the observed label alignment depends on interaction with a specific partner. Across three analytic layers (task competence, description strategy, alignment dynamics), we find clear differences. Humans reduce effort through entrainment, compressing descriptions and increasing label alignment with partners. Agents instead maintain fixed effort levels, producing verbose descriptions from round one, with near-ceiling label overlap that is statistically indistinguishable between real and pseudo dyads. MLLMs thus achieve coordination without convention, succeeding by verbose description rather than by forming the compact, history-dependent referring expressions characteristic of human dialogue.

2606.08394 2026-06-09 cs.CL 新提交

When Correct Decisions Hide Internal Stress: Decision-State Probing in Multimodal Language Models

当正确决策隐藏内部压力:多模态语言模型中的决策状态探测

Haoran Zhao, Soyeon Caren Han, Eduard Hovy

发表机构 * The University of Melbourne(墨尔本大学)

AI总结 提出S³E框架,通过正锚定A/B强制选择任务和隐藏状态分析,发现多模态语言模型在正确行为下仍存在语义压力导致的决策状态位移。

详情
AI中文摘要

多模态语言模型通常通过外部行为进行评估:选择正确的图像-文本匹配、拒绝无支持的标题或正确回答视觉查询。然而,仅凭正确行为并不能证明模型的内部决策状态在受控语义压力下保持稳定。我们通过S$^3$E(结构化语义压力评估)框架研究这一差距,该框架用于分析多模态语言模型中行为-内部解耦。S$^3$E使用正锚定的A/B强制选择设置,其中图像支持的标题与语义压力候选进行对比,并在原始和交换选项顺序下进行,同时在回答前的决策状态提取隐藏状态。我们专注于严格正确的试验,即模型在两种顺序下都一致选择正确标题。我们不将任意的隐藏状态变化视为不稳定的证据,而是测量语义冲突候选是否相对于保持意义的控制项导致过度的决策状态位移。在Qwen3VL、Gemma3和InternVL3上,尽管强制选择行为正确,语义压力相对于词汇控制项始终产生显著的正选定层过度位移,而与随机负样本的比较则依赖于模型。我们将此解释为有范围的决策状态压力敏感性信号,而非下游失败或幻觉的证据。我们的结果表明,仅凭强制选择正确性不足以证明内部决策几何的不变性。

英文摘要

Multimodal language models are typically evaluated through external behavior: selecting the correct image--text match, rejecting unsupported captions, or answering visual queries correctly. However, correct behavior alone does not show that the model's internal decision state remains stable under controlled semantic stress. We study this gap through S$^3$E (Structured Semantic Stress Evaluation), a framework for analyzing behavior-internal decoupling in multimodal language models. S$^3$E uses a positive-anchored A/B forced-choice setup in which an image-supported caption is contrasted against semantic stress candidates under both original and swapped option orders, while hidden states are extracted at the pre-answer decision state. We focus on strict-correct trials, where the model consistently selects the correct caption across both orders. Rather than treating arbitrary hidden-state variation as evidence of instability, we measure whether semantic-conflict candidates induce excess decision-state displacement relative to meaning-preserving controls. Across Qwen3VL, Gemma3, and InternVL3, semantic stress consistently produces positive selected-layer excess displacement over lexical controls despite correct forced-choice behavior, while comparisons against random negatives are model-dependent. We interpret this as a scoped decision-state stress-sensitivity signal rather than evidence of downstream failure or hallucination. Our results suggest that forced-choice correctness alone is not a sufficient certificate of invariant internal decision geometry.

2606.08770 2026-06-09 cs.CL cs.AI cs.CV cs.LG 新提交

TeamHerald@CHIPSAL 2026: Hate Speech Detection and Sentiment Analysis of Nepali Memes using Transformer-based Architectures and Ensemble Learning

TeamHerald@CHIPSAL 2026:基于Transformer架构和集成学习的尼泊尔语模因仇恨言论检测与情感分析

Ashish Acharya, Anish Khatiwada, Rohit Khadka, Pragya Aryal

发表机构 * Herald College Kathmandu(加德满都赫尔德学院)

AI总结 针对尼泊尔语模因中代码混合和资源匮乏问题,采用OCR提取文本并结合Transformer模型,发现硬/软投票集成策略在二分类和多分类任务中表现不同,软投票在多类情感任务中提升15.8%的Macro F1分数。

Comments Accepted at the 2nd Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2026) at LREC 2026

详情
AI中文摘要

尼泊尔语互联网模因的分析因频繁的代码混合和缺乏已建立的基线资源而变得复杂。虽然模因本质上结合了视觉和文本元素,但本研究侧重于以文本为中心的方法,通过OCR层提取嵌入文本,并使用基于Transformer的架构进行建模。我们评估了六种不同的模型,并研究了硬投票和软投票集成策略在两项任务中的比较效果:二分类仇恨言论检测和三分类情感分析。实验结果表明,独立的仅解码器模型在二分类任务中取得了最高性能,而软投票集成在多类情感任务中表现最佳,相比最强的独立基线,Macro F1分数相对提升了15.8%。这些发现表明,集成策略在二分类和多类任务中表现不同,突出了选择适合分类目标的聚合方法的重要性。

英文摘要

The analysis of internet memes in the Nepali language is complicated by frequent code-mixing and a lack of established baseline resources. While memes inherently combine visual and textual elements, this study focuses on a text-centric approach by extracting embedded text using an OCR layer and modeling it with Transformer-based architectures. We evaluate six distinct models and investigate the comparative effectiveness of Hard and Soft Voting ensemble strategies across two tasks: binary hate speech detection and three-class sentiment analysis. Experimental results show that a standalone decoder-only model achieved the highest performance for binary classification, whereas the Soft Voting ensemble performed best for the multi-class sentiment task, yielding a 15.8% relative improvement in Macro F1-score over the strongest standalone baseline. These findings suggest that ensemble strategies behave differently across binary and multi-class tasks, highlighting the importance of selecting aggregation methods suited to the classification objective.

2606.09148 2026-06-09 cs.CL 新提交

Explicit Representation Alignment for Multimodal Sentiment Analysis

显式表示对齐用于多模态情感分析

Baode Wang, Ziming Wang, Huacan Wang, Ronghao Chen, Biao Wu

发表机构 * AgentAlpha

AI总结 针对多模态情感分析中表示不对齐问题,提出利用视觉-语言模型将视觉内容转为文本描述,结合语义标记选择和批量级均匀性正则化,实现跨模态对齐,在多个基准上取得最优性能。

Comments 10 pages, 5 figures

详情
AI中文摘要

多模态情感分析旨在通过联合建模文本和图像等异质模态来理解人类情感和情绪。然而,多模态模型往往无法持续优于强文本基线,且性能在不同融合策略间差异显著。在本工作中,我们识别出独立预训练的模态编码器之间的表示不对齐是多模态有效学习的关键瓶颈,并通过控制实验表明,融合前的对齐通常比融合复杂性更重要。为解决此问题,我们提出一个统一的多模态情感分析框架,利用视觉-语言模型(VLM)将视觉内容转换为结构化文本描述,将异质模态投影到共享语言空间,并实现可解释的以文本为中心的推理。为进一步提升鲁棒性,我们引入一种混合学习策略,结合语义标记选择和批量级均匀性正则化目标,鼓励更分散和稳定的全局特征空间,同时减轻VLM生成描述引入的噪声。在多个多模态情感和情绪基准上的实验表明,我们的方法持续优于强单模态和多模态基线,达到最先进性能。我们的分析进一步强调了表示对齐在多模态情感学习中的关键作用。

英文摘要

Multimodal affective analysis aims to understand human sentiment and emotion by jointly modeling heterogeneous modalities such as text and images. However, multimodal models often fail to consistently outperform strong text-only baselines, with performance varying significantly across fusion strategies. In this work, we identify representation misalignment between independently pretrained modality encoders as a key bottleneck for effective multimodal learning, and show through controlled experiments that alignment prior to fusion is often more important than fusion complexity. To address this issue, we propose a unified multimodal affective analysis framework that leverages vision-language models (VLMs) to convert visual content into structured textual descriptions, projecting heterogeneous modalities into a shared linguistic space and enabling interpretable text-centric reasoning. To further improve robustness, we introduce a hybrid learning strategy that combines semantic token selection with a batch-level uniformity regularization objective, encouraging a more dispersed and stable global feature space while mitigating noise introduced by VLM-generated descriptions. Experiments on multiple multimodal sentiment and emotion benchmarks show that our method consistently outperforms strong unimodal and multimodal baselines, achieving state-of-the-art performance. Our analysis further highlights the critical role of representation alignment in multimodal affective learning.

2606.09195 2026-06-09 cs.CL 新提交

Symbolic and Abstractive Reasoning with Complex Visual Queries

复杂视觉查询的符号与抽象推理

Yichi Zhang, Jingdian Lu, Zhuo Chen, Lingbing Guo, Jun Xu, Wen Zhang, Huajun Chen

发表机构 * Zhejiang University(浙江大学) Nanjing University(南京大学) Ant Group(蚂蚁集团)

AI总结 提出复杂视觉查询(CVQ)概念,通过多模态知识图谱合成数据集,并设计两阶段训练框架,提升多模态大语言模型的符号与抽象推理能力。

Comments Work in progress

详情
AI中文摘要

理解和推理抽象视觉内容仍然是当前多模态大语言模型(MLLMs)面临的挑战。本文探索了一种新颖的抽象数据类型,称为复杂视觉查询(CVQ),旨在探测符号和抽象推理,这是MLLMs类人神经符号推理中关键但尚未充分探索的维度。我们从三个角度进行了全面研究:\textbf{数据 $\times$ 范式 $\times$ 探索}。具体而言,我们提出了一种可扩展的流水线,用于合成基于大规模多模态知识图谱的CVQ,通过一阶逻辑算子的系统组合生成了一个包含14种不同查询类型的多样化数据集。我们进一步引入了一个两阶段训练框架,逐步赋予MLLMs强大的视觉推理能力。我们进行了大量实验,从多个维度严格评估MLLMs,包括在CVQ上的推理性能,以及跨任务和跨场景的泛化能力。我们相信,我们的工作为推进MLLMs的推理前沿开辟了新的视角和途径。

英文摘要

Understanding and reasoning over abstract visual content remains a challenge for current multi-modal large language models (MLLMs). In this paper, we explore a novel abstract data type termed complex visual query (CVQ), designed to probe symbolic and abstractive reasoning, which is a critical yet underexplored dimension of human-like neuro-symbolic reasoning for MLLMs. We present a comprehensive investigation from three perspectives: \textbf{Data $\times$ Paradigm $\times$ Exploration}. Specifically, we propose a scalable pipeline for synthesizing CVQs grounded in large-scale multi-modal knowledge graphs, generating a diverse dataset encompassing 14 distinct query types via systematic combinations of first-order logic operators. We further introduce a two-stage training framework that progressively equips MLLMs with robust visual reasoning capabilities. We conduct extensive experiments to rigorously evaluate MLLMs across multiple dimensions, including reasoning performance on CVQs, as well as cross-task and cross-scenario generalization. We believe our work opens new perspectives and avenues for advancing the reasoning frontiers of MLLMs.

2606.09366 2026-06-09 cs.CL eess.AS 新提交

Is Text All You Need? Text as a Universal Information Bottleneck for Speech LLMs

文本就是一切?文本作为语音大语言模型的通用信息瓶颈

Ming-Hao Hsu, Yuxuan Hu, Shujie Liu, Jinyu Li, Yan Lu, Zhizheng Wu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Microsoft Corporation(微软公司) Microsoft Research Asia(微软亚洲研究院)

AI总结 提出Convex Gate(C-Gate)桥接语音与LLM,通过凸包约束将语音表示限制在LLM输入嵌入流形内,在ASR和情感识别上取得联合最优性能,并揭示几何结构而非离散性是关键设计因素。

详情
AI中文摘要

大型语言模型(LLM)为语音理解提供了强大的推理骨干,但将连续声学信号集成到冻结的LLM中仍然具有挑战性。现有的语音到LLM接口通常处于两个极端:要么强制近乎离散的令牌对齐,这有利于转录但丢失副语言信息;要么学习无约束的连续表示,这可能会偏离LLM的输入空间并降低自回归解码性能。在这项工作中,我们提出了Convex Gate(C-Gate),一种语音到LLM的桥接方法,通过架构凸包约束将所有语音表示限制在LLM的输入嵌入流形内。具体而言,每一帧被表示为令牌嵌入的凸组合,确保与预训练LLM的兼容性,同时保持连续表达能力。在自动语音识别(ASR)和情感识别任务中,C-Gate实现了强大的联合性能,在LibriSpeech上相对词错误率(WER)降低高达48.7%,同时匹配或超过单任务情感识别准确率。除了性能之外,我们的分析揭示了一个关键见解:信息不是由离散令牌身份携带,而是由嵌入空间中时间分辨的轨迹携带。因果干预证实,轨迹结构和与预训练嵌入流形的对齐对性能都至关重要。这些结果表明,几何结构而非令牌离散性是语音到LLM接口的基本设计因素,并为研究冻结LLM中的多模态集成提供了一个受控机制。我们发布了检查点、每个样本的输出、机制转储和干预套件以供复现。

英文摘要

Large language models (LLMs) provide a powerful reasoning backbone for speech understanding, but integrating continuous acoustic signals into a frozen LLM remains challenging. Existing speech-to-LLM interfaces typically operate at two extremes: either enforcing near-discrete token alignment, which benefits transcription but loses paralinguistic information, or learning unconstrained continuous representations, which can drift away from the LLM's input space and degrade autoregressive decoding. In this work, we propose Convex Gate (C-Gate), a speech-to-LLM bridge that constrains all speech representations to lie within the LLM's input embedding manifold with an architectural convex-hull constraint. Concretely, each frame is represented as a convex combination of token embeddings, ensuring compatibility with the pretrained LLM while preserving continuous expressivity. Across automatic speech recognition (ASR) and emotion recognition, C-Gate achieves strong joint performance, improving LibriSpeech WER by up to 48.7% relative while matching or exceeding single-task emotion accuracy. Beyond performance, our analysis reveals a key insight: information is not carried by discrete token identities, but by time-resolved trajectories in the embedding space. Causal interventions confirm that both the trajectory structure and alignment to the pretrained embedding manifold are critical for performance. These results suggest that geometry, rather than token discreteness, is the fundamental design factor in speech-to-LLM interfaces, and provide a controlled regime for studying multimodal integration in frozen LLMs. We release the checkpoint, per-sample outputs, mechanism dumps, and intervention suite for replication.

2606.09428 2026-06-09 cs.CL 新提交

Guide Me Out: A Framework to Benchmark VLM Operators Communication in Crisis Scenarios

引导我出去:危机场景下评估VLM操作员通信的框架

Giacomo Gonella, Stefano Menini, Marco Guerini

发表机构 * Fondazione Bruno Kessler(布鲁诺·凯斯勒基金会) University of Trento(特伦托大学)

AI总结 提出一个基准框架,评估视觉语言模型在模拟疏散中引导平民的策略(窄播 vs. 广播)、环境表示(视觉 vs. 图)和威胁行为(静态 vs. 移动),发现窄播降低失败率,视觉表示主导性能,移动威胁增加失败率。

详情
AI中文摘要

有效的危机响应需要空间定位的通信,将平民的语言指导与物理环境联系起来,考虑结构瓶颈、不断变化的威胁和代理特定背景。然而,当前危机通信中的NLP研究主要局限于静态、纯文本分类设置,忽视了AI操作员在动态、具身场景中的关键通信作用。我们通过一个新的基准框架来解决这一差距,该框架用于评估视觉语言模型(VLM)在模拟疏散中引导平民代理的任务。我们测试了两种通信策略(窄播与广播)、两种环境表示(视觉与基于图)和两种威胁行为(静态与移动),跨越九张不同结构复杂度的地图。我们的结果表明,与广播相比,窄播在所有难度级别上持续降低平民失败率。指导质量很大程度上取决于VLM操作员如何表示世界:视觉模态驱动性能,而添加邻接图则依赖于模型且通常有害。移动威胁在所有条件下提高失败率,因为通信必须随时间持续适应。这些发现共同表明,将VLM作为AI操作员部署在疏散场景中仍然是一个非平凡挑战,其中通信策略和输入表示的选择可以直接决定干预的成功或失败。

英文摘要

Effective crisis response requires spatially grounded communication that bridges linguistic guidance of civilians with the physical environment, accounting for structural bottlenecks, evolving threats, and agent-specific contexts. Yet, current NLP research in crisis communication remains mainly limited to static, text-only classification settings, overlooking the critical communicative role of AI operators in dynamic, embodied scenarios. We address this gap with a novel benchmarking framework for evaluating Vision-Language Models (VLMs) tasked with guiding civilian agents through simulated evacuations. We test two communication strategies (narrowcast vs. broadcast), two environment representations (visual vs. graph-based), and two threat behaviors (static vs. moving) across nine maps of varying structural complexity. Our results show that Narrowcast consistently reduces civilian Fail rates compared to Broadcast across all difficulty levels. Guidance quality depends heavily on how the VLM operator represents the world: the visual modality drives performance, while adding an adjacency graph is model-dependent and often harmful. Moving threats raise Fail rates across all conditions as communication must continuously adapt over time. Together, these findings show that deploying VLMs as AI operators in evacuation scenarios remains a non-trivial challenge, where the choice of communication strategy and input representation can directly determine the success or failure of the intervention.

2606.09644 2026-06-09 cs.CL cs.CV 新提交

Where Does the Answer Come From? Benchmarking View-Level Visual Evidence Identification in Multi-View MLLMs for Autonomous Driving

答案从何而来?面向自动驾驶的多视角MLLMs中视角级视觉证据识别基准

Yimu Wang, Yee Man Choi, Barry Zhang, Mozhgan Nasr Azadani, Sean Sedwards, Krzysztof Czarnecki

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 针对多视角自动驾驶场景,提出一个基准测试,评估多模态大模型在视觉问答中识别支持性相机视角的能力,包含122个冲突中心问题对,并区分视角选择与答案正确性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉推理基准测试中取得了强劲结果,但仅凭答案准确性并不能表明模型是否依赖了正确的视觉证据。这一差距在用于自动驾驶的多视角驾驶场景中尤为重要,因为模型可能产生看似合理的答案,却将其归因于错误的相机视角。我们引入了一个多视角视觉问答基准,用于评估证据来源识别:给定六个同步的NuScenes视角和一个问题,模型必须识别支持性的相机视角并回答问题。该基准包含来自73个场景的122个冲突中心问答对,涵盖因果关系、反事实推理和意图预测。视角标签由自动冲突挖掘流程提出,并由标注者手动验证。我们评估了三种设置:相机视角选择、给定黄金视角的Oracle问答,以及模型在一次前向中同时选择视角并回答的联合预测。答案以多项选择和自由形式两种格式进行评估,使用精确匹配处理结构化预测,并使用LLM评判器处理自由形式回答。通过明确分离视觉来源识别与答案正确性,该基准揭示了仅凭答案评估无法发现的接地失败案例。

英文摘要

Multimodal large language models (MLLMs) achieve strong results on visual reasoning benchmarks, but answer accuracy alone does not indicate whether a model relied on the correct visual evidence. This gap is particularly important in multi-view driving scenes used for autonomous driving, where a model can produce a plausible answer while grounding it in the wrong camera view. We introduce a multi-view visual question answering benchmark for evaluating evidence-source identification: given six synchronized NuScenes views and a question, the model must identify the supporting camera view and answer the question. The benchmark contains 122 conflict-centric question-answer pairs from 73 scenes, spanning causality, counterfactual reasoning, and intent prediction. View labels are proposed by an automatic conflict-mining pipeline and manually verified by annotators. We evaluate three settings: camera-view selection, oracle QA given the golden view, and joint prediction in which the model selects a view and answers in one pass. Answers are evaluated in both multiple-choice and free-form formats, using exact match for structured predictions and an LLM judge for free-form responses. By explicitly separating visual-source identification from answer correctness, the benchmark exposes grounding failures that answer-only evaluation misses.

2606.07647 2026-06-09 cs.CV cs.CL cs.LG 交叉投稿

Steer Where It Matters: Token-Level Visual-Sensitivity Steering for LVLMs Hallucination Mitigation

关键位置引导:基于令牌级视觉敏感度引导的LVLMs幻觉缓解

Ruipeng Zhang, Zhihao Li, C. L. Philip Chen, Tong Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出令牌级视觉敏感度引导(TLVS)方法,通过提取令牌级引导向量并自适应调整引导强度,仅在关键解码步骤抑制幻觉,在多个基准上优于现有方法。

详情
AI中文摘要

大型视觉语言模型(LVLMs)取得了快速进展并部署在各种应用中,但幻觉仍然是一个主要挑战。激活引导因其训练开销小和推理时可控制而具有吸引力。然而,我们发现,在自回归解码过程中,视觉条件对令牌预测的影响是稀疏且局部的,许多现有方法对整个序列的图像与非图像差异进行平均,稀释了这些关键信号,导致引导方向信噪比低。此外,许多现有方法应用固定的引导强度,错误分配干预预算,过度扰动非关键令牌,并可能导致不稳定。为了解决这些限制,我们提出了令牌级视觉敏感度引导(TLVS)用于幻觉缓解。我们的方法首先提取令牌级引导向量并进行细化,然后仅在关键位置应用细粒度的、视觉敏感度自适应的引导。这种轻量级、即插即用的机制只需要最少的校准训练,可以应用于各种视觉语言模型。它在每个解码步骤调节引导强度,选择性地抑制易产生幻觉的片段,同时保留基于证据的内容。我们在多个基准上评估TLVS,包括POPE、AMBER、CHAIR(COCO)、MMHal和HallusionBench,证明其相对于先前引导方法的一致改进。

英文摘要

Large vision language models (LVLMs) have made rapid advancements and are deployed across various applications, yet hallucinations remain a major challenge. Activation steering is appealing due to its minimal training overhead and controllability at inference time. However, we found that during autoregressive decoding, visual conditioning affects token prediction sparsely and locally across decoding steps, and many existing methods that average image-versus-no-image differences over the entire sequence dilute these critical signals, yielding low signal-to-noise ratio steering directions. Additionally, many existing methods apply a fixed steering strength, which misallocates the intervention budget, over-perturbs non-critical tokens, and can cause instability. To address these limitations, we propose Token-Level Visual-Sensitivity Steering (TLVS) for hallucination mitigation. Our approach first extracts token-level steering vectors and refines them, and then applies fine-grained, visual-sensitivity-adaptive steering only where it matters. This lightweight, plug-and-play mechanism requires only minimal training for calibration and can be applied across diverse vision-language models. It modulates the steering strength at each decoding step, selectively suppressing hallucination-prone spans while preserving evidence-grounded content. We evaluate TLVS on several benchmarks, including POPE, AMBER, CHAIR (COCO), MMHal, and HallusionBench, demonstrating consistent improvements over previous steering methods.

2606.07985 2026-06-09 cs.CV cs.CL 交叉投稿

FMRFusion: Frequency-Aware Multi-View Representation Learning for Heterogeneous Image Fusion

FMRFusion: 面向异质图像融合的频率感知多视图表示学习

Tao Zhoua, Yunlong Liu, Qinghui Chen, Zekai Zhang, Minlong Sun, Changlin Biana, Dagang Li, Wenmin Wang, Jinglin Zhang

发表机构 * Shandong University(山东大学) Macau University of Science and Technology(澳门科技大学)

AI总结 提出FMRFusion网络,通过多尺度结构感知模块、双线性频率分解和跨视图互补交互,结合流匹配优化,实现红外与可见光图像融合,在夜间场景表现优异。

详情
AI中文摘要

红外与可见光图像融合旨在生成保留重要目标信息和详细纹理的复合图像,整合两种异质模态。以往的图像融合方法通常采用单模块堆叠方式从两种模态中提取特征,然而这些方法可能导致对其独特特征的学习不完整,从而限制融合效果并在真实异质数据场景中降低鲁棒性。为解决这些问题,我们提出FMRFusion,一种用于异质图像融合的频率感知多视图表示学习网络。引入多尺度结构感知模块以有效捕捉判别性结构,提取细粒度局部结构和关键上下文信息。采用双线性频率分解机制将特征分离为高频和低频分量,实现不同频率域中局部细节和全局表示的联合建模。此外,融入跨视图互补交互以显式建模和融合反射光信息与辐射强度响应之间的互补特性,促进有效的跨视图交互。我们通过流匹配进一步改善融合结果的质量,通过学习从粗数据到高质量表示的变换逐步细化融合特征。在多个基准数据集上进行的大量实验表明,FMRFusion在一系列融合任务中实现了优越且一致的性能,尤其在夜间场景中表现突出。

英文摘要

Infrared and visible image fusion aims to generate a composite image that retains significant target information and preserves detailed textures, integrating two heterogeneous modalities. Previous image fusion methods typically adopt a single-module stacking approach to extract features from the two modalities. However, these approaches may result in incomplete learning of their distinct characteristics, thereby limiting the fusion effectiveness and constrain ing robustness in real-world heterogeneous data scenarios. To address these challenges, we propose FMRFusion, a frequency-aware multi-view representation learning network for Heterogeneous Image Fusion. A Multi-Scale Struc tural Perception Module is introduced to effectively capture discriminative structures, extracting fine-grained local structures and essential contextual information. A bilinear frequency decomposition mechanism is employed to sepa rate features into high-frequency and low-frequency components, enabling joint modeling of local details and global representations across different frequency domains. Moreover, a Cross-View Complementary Interaction is incorpo rated to explicitly model and fuse the complementary characteristics between reflected light information and radiative intensity responses, facilitating effective cross-view interaction. We further improve the Performance of the fused results by flow matching, which progressively refines the fused features by learning the transformation from coarse data to high-quality representations. Extensive experiments conducted on multiple benchmark datasets demonstrate that FMRFusion achieves superior and consistent performance across a range of fusion tasks, especially in nighttime scenarios

2606.08063 2026-06-09 cs.CV cs.AI cs.CL 交叉投稿

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Robust-U1: MLLMs能否自我恢复受损视觉内容以实现鲁棒理解?

Jiaqi Tang, Jianmin Chen, Youyang Zhai, Wei Wei, Runtao Liu, Mengjie Zhao, Xiangyu Wu, Qingfa Xiao, Qifeng Chen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Robust-U1框架,通过监督微调、强化学习和多模态推理,使多模态大模型具备显式视觉自恢复能力,在真实和对抗性损坏下达到最先进鲁棒性。

Comments Accepted by ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉理解方面取得了显著成功,但在真实世界的视觉损坏下其性能会大幅下降。尽管存在现有的鲁棒性增强方法,但它们存在局限性:黑盒特征对齐缺乏可解释性,而白盒基于文本的推理无法恢复丢失的像素级细节。本文研究一个基本研究问题:MLLMs能否自行恢复受损的视觉内容?为此,我们提出Robust-U1,一种新颖框架,赋予MLLMs显式的视觉自恢复能力以实现鲁棒理解。该方法包含三个核心阶段:用于初始重建的监督微调、具有双重奖励(像素级SSIM和语义级CLIP相似度)的强化学习以对齐高视觉质量,以及联合考虑受损输入和恢复图像的多模态推理。大量实验表明,Robust-U1在真实世界损坏基准上达到了最先进的鲁棒性,并在一般VQA基准上的对抗性损坏下保持了优越性能。分析证实,高质量的视觉恢复直接提升了推理性能,将自恢复确立为鲁棒视觉理解的关键机制。源代码可在https://github.com/jqtangust/Robust-U1获取。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.

2606.08615 2026-06-09 cs.CV cs.CL 交叉投稿

Harnessing Streaming Video in the Wild

利用野外流式视频

Dingyu Yao, Shuhuan Gu, Qingyi Si, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Naibin Gu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院) JD.COM(京东)

AI总结 提出Streaming Harness系统,通过Streaming-Train-248K数据集和训练目标,使视觉语言模型具备主动交互、长期记忆和实时处理能力,并构建Streaming-Eval基准评估流式视频理解。

详情
AI中文摘要

视觉语言模型(VLM)在视频通话助手、实时评论和具身机器人等应用中越来越需要处理无界视频流。理想的流式系统应支持主动交互、长期记忆和实时处理,同时基于能够处理各种野外流式任务的VLM骨干。然而,现有VLM在离线视频理解方面表现出色,但在流式能力上有所欠缺,并且缺乏用于流式部署的专用基础设施。我们在三个方面解决这一差距。(i) 对于骨干能力,我们构建了\textbf{Streaming-Train-248K},一个流式数据集,配以新颖的训练目标,用于使VLM适应流式交互和理解。(ii) 对于实际部署,我们引入了\textbf{Streaming Harness},一个即插即用系统,赋予任何VLM三种核心能力:主动交互(每秒响应决策)、长期记忆(12小时上下文保留)和实时处理(亚秒级延迟)。(iii) 为了推动社区在流式能力方面的持续进步,我们设计了\textbf{Streaming-Eval},一个反映模型在各种野外场景中能力的基准。大量实验表明,我们的方法在流式视频理解所需的所有核心能力上均取得了一致的提升。我们将开源我们的数据、代码和基准,以推动社区从离线视频理解向可部署的流式智能的转变。

英文摘要

Vision-Language Models (VLMs) are increasingly required to process unbounded video streams in applications such as video-call assistants, live commentary, and embodied robots. An ideal streaming system should support proactive interaction, long-horizon memory, and real-time processing, while resting on a VLM backbone capable of handling diverse in-the-wild streaming tasks. However, existing VLMs excel at offline video understanding but fall short in streaming capabilities and lack dedicated infrastructure for streaming deployment. We address this gap on three fronts. (i) For backbone capability, we construct \textbf{Streaming-Train-248K}, a streaming dataset paired with a novel training objective for adapting VLMs to streaming interaction and understanding. (ii) For real-world deployment, we introduce \textbf{Streaming Harness}, a plug-and-play system that endows any VLM with three core abilities: proactive interaction (per-second response decisions), long-term memory (12-hour context retention), and real-time processing (sub-second latency). (iii) To drive continued community progress on streaming capabilities, we design \textbf{Streaming-Eval}, a benchmark that reflects models' capabilities across diverse in-the-wild scenarios. Extensive experiments demonstrate consistent gains from our approach across all core capabilities required for streaming video understanding. We will open-source our data, code, and benchmark to advance the community's shift from offline video understanding to deployable streaming intelligence.

2606.08894 2026-06-09 cs.CV cs.CL 交叉投稿

Are Reasoning Vision-Language Models Robust to Semantic Visual Distractions?

推理视觉语言模型对语义视觉干扰具有鲁棒性吗?

Yizheng Sun, Mochuan Zhan, Yanan Ma, Jia Tong See, Yifan Wang, Ziyi Wang, Hao Li, Yang Cui, Wenhao Cai, Jingyu Sun, Chenghua Lin, Riza Batista-Navarro, Jingyuan Sun

发表机构 * University of Manchester(曼彻斯特大学) Marex Imperial College London(帝国理工学院)

AI总结 针对推理VLM在真实场景中易受语义视觉干扰的问题,提出Distract-Bench基准,发现推理VLM对语义干扰的鲁棒性低于感知退化,且干扰常被纳入推理过程导致错误答案。

详情
AI中文摘要

推理视觉语言模型(VLM)在复杂多模态任务上表现强劲,但可靠的现实应用需要处理比干净、精心策划的基准更混乱的视觉输入。现有工作主要通过输入损坏(如噪声、模糊和天气效果)来评估VLM的可靠性,这些损坏使视觉证据更难感知。这留下了一个关键可靠性失败模式未被充分探索:模型可能正确感知证据,却从看似合理但无关且分散注意力的证据中进行推理,并将此错误传播到最终答案。为填补这一空白,我们引入了\textbf{Distract-Bench},一个用于评估VLM对\textbf{语义视觉干扰}鲁棒性的基准,定义为添加到输入中、保留真实答案但具有意义且与任务无关的视觉线索。我们全面评估了八个领先的开源和两个闭源VLM,涵盖传统视觉损坏和Distract-Bench。结果表明,Distract-Bench暴露了一种与视觉损坏不同的鲁棒性失败:推理VLM在感知退化下基本跟踪其非推理基础模型,但对语义干扰的鲁棒性始终较低。进一步分析表明,这些干扰常常进入VLM的推理过程,被当作证据,并导致错误答案。总之,这些发现重新定义了推理VLM的鲁棒性评估,将焦点从退化感知转向干扰,以实现可靠的现实世界视觉推理。我们的数据和代码可在https://github.com/Yizheng-Sun/Distract-Bench获取。

英文摘要

Reasoning Vision-Language Models (VLMs) achieve strong performance on complex multimodal tasks, but reliable real-world application requires handling visual inputs that are messier than clean, curated benchmarks. Existing works mainly evaluate such reliability of VLMs through input corruptions, such as noise, blur and weather effects, which make visual evidence harder to perceive. This leaves a critical reliability failure mode underexplored: a model may perceive the evidence correctly, yet reason from plausible but irrelevant and distracting evidence and propagate this mistake to its final answer. To address this gap, we introduce \textbf{Distract-Bench}, a benchmark for evaluating VLM robustness to \textbf{semantic visual distractions}, defined as meaningful but task-irrelevant visual cues added to inputs while preserving the ground-truth answer. We comprehensively evaluate eight leading open-source and two closed-source VLMs across conventional vision corruptions and Distract-Bench. Our results show that Distract-Bench exposes a robustness failure distinct from vision corruptions: reasoning VLMs largely track their non-reasoning base models under perceptual degradation, but show consistently lower robustness to semantic distractions. Further analysis shows that these distractions often enter the reasoning process of VLMs, are treated as evidence, and lead to incorrect answers. Together, these findings reframe robustness evaluation for reasoning VLMs, shifting the focus from degraded perception to distractions for reliable real-world visual reasoning. Our data and code are available at https://github.com/Yizheng-Sun/Distract-Bench.

2606.09033 2026-06-09 cs.CV cs.CL 交叉投稿

CRANE: Knowledge Editing for Reasoning MLLMs

CRANE:面向推理多模态大语言模型的知识编辑

Han Huang, Hao Wang, Mengqi Zhang, Shu Wu, Qiang Liu, Liang Wang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) New Laboratory of Pattern Recognition (NLPR), CASIA(中国科学院自动化研究所模式识别国家重点实验室) Harbin Institute of Technology(哈尔滨工业大学) Shandong University(山东大学)

AI总结 针对推理多模态大语言模型在知识编辑中出现的结构崩溃、认知失调和浅层内化三种失败模式,提出检索增强框架CRANE,无需逐编辑参数修改,通过模态感知双库检索系统和两阶段训练策略实现高成功率。

Comments 10 pages, 5 figures

详情
AI中文摘要

推理多模态大语言模型(MLLMs)的出现,即在生成答案前产生显式思维链(CoT)推理,为知识编辑带来了新挑战:在传统指标(教师强制准确率高达100%)下看似成功的方法,在检查模型推理过程时可能严重失败(基础成功率低至0%)。我们识别出三种失败模式:(1)结构崩溃,权重修改方法破坏CoT格式;(2)认知失调,模型的推理链基于视觉证据主动拒绝注入的编辑事实;(3)浅层内化,方法在精确查询上成功但在改写或多跳变体上失败。在推理MLLMs上,这些模式相互作用:泛化方法(FT、LoRA)触发格式崩溃,而无深度修改的方法无法泛化。为揭示这些失败,我们提出一种CoT感知评估协议,并构建ReasonEdit-Bench,包含冲突分层、多级探针和多跳可移植性测试。我们提出CRANE,一种检索增强框架,无需逐编辑参数修改。CRANE结合了模态感知双库检索系统和两阶段训练策略:监督微调(SFT)用于结构初始化,随后是带有认知路由奖励的GRPO,训练模型在视觉先验和注入编辑事实之间进行仲裁。在ReasonEdit-Bench上,CRANE在冲突场景中达到96.9%的基础成功率,多跳链中中间实体使用率为96.9%,文本局部性为97.6%,图像局部性编辑独立性为68.1%。在分布外MMEVOKE基准上,CRANE在黄金检索下达到87.0%。

英文摘要

The emergence of reasoning multimodal large language models (MLLMs), which generate explicit chain-of-thought (CoT) reasoning before producing answers, has introduced a new challenge for knowledge editing: methods that appear successful under traditional metrics (teacher-forcing accuracy up to 100%) can fail severely when the model's reasoning process is examined (Grounded Success as low as 0%). We identify three failure modes: (1) Structural Collapse, where weight-modifying methods destroy the CoT format; (2) Cognitive Dissonance, where the model's reasoning chain actively rejects the injected edit fact based on visual evidence; and (3) Shallow Internalization, where methods succeed on exact queries but fail on rephrase or multi-hop variants. On reasoning MLLMs, these modes interact: methods that generalize (FT, LoRA) trigger format collapse, while methods without deep modification cannot generalize. To expose these failures, we propose a CoT-aware evaluation protocol and construct ReasonEdit-Bench, with conflict stratification, multi-level probes, and multi-hop portability tests. We propose CRANE, a retrieval-augmented framework that requires no per-edit parameter modification. CRANE combines a modality-aware dual-library retrieval system with a two-phase training strategy: Supervised Fine-Tuning (SFT) for structural initialization, followed by GRPO with a Cognitive Routing Reward that trains the model to arbitrate between visual priors and injected edit facts. On ReasonEdit-Bench, CRANE achieves 96.9% Grounded Success on conflict scenarios and 96.9% intermediate entity usage in multi-hop chains, with 97.6% text-locality and 68.1% image-locality Edit Independence. On the out-of-distribution MMEVOKE benchmark, CRANE reaches 87.0% under gold retrieval.

2606.09131 2026-06-09 cs.AI cs.CL cs.CV cs.LG 交叉投稿

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

晚期融合足矣:面向视觉饱和的多模态大语言模型的双路径视觉令牌路由

Siyuan Liu, Jinyang Wu

发表机构 * School of Mechanics and Engineering Science, Peking University(北京大学力学与工程科学学院) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对多模态大语言模型中视觉令牌在深层饱和的问题,提出双路径视觉令牌路由(DPVR-LF),在饱和点将视觉令牌路由至单层可训练分支,仅最后层融合,以约3%可训练参数保持性能并减少计算。

Comments 18 pages, 4 figures. Submitted to Pattern Recognition

详情
AI中文摘要

多模态大语言模型(MLLMs)通常继承为单模态文本建模设计的深层对称Transformer骨干,并对图像和语言令牌应用相同的统一计算。这种设计忽略了一个关键的模态不对称性:图像和文本令牌在信息密度、冗余度和所需推理深度上存在显著差异。通过对LLaVA-1.5的逐层分析,我们观察到视觉令牌倾向于在中间层饱和。具体而言,文本到图像的注意力从第0层的0.68下降到第4层的0.07,并在第18层后稳定在0.04附近,而文本令牌则继续受益于深层语义处理。这些发现表明架构对称性与深度异步模态演化之间存在不匹配,导致冗余的视觉计算以及在深层任务特定适应期间感知表示的潜在漂移。受此启发,我们提出了双路径视觉令牌路由(DPVR),一种用于高效MLLMs的模态不对称路由框架。其核心实例DPVR-LF(晚期融合)在饱和点将视觉令牌路由到一个单层可训练侧分支,运行一个跳过深层堆栈中图像位置的十三层纯文本前向传播,并仅在最后一层重新融合视觉和文本流。使用约3%的可训练参数,DPVR-LF在标准基准上保持了有竞争力的多模态性能,同时减少了深层Transformer堆栈中的视觉计算。该结果挑战了视觉令牌必须遍历所有深层语言模型层的传统假设,并表明单个晚期融合层足以在LLaVA风格的MLLMs中维持强大的感知能力。

英文摘要

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.

2504.02983 2026-06-09 cs.CL cs.CV 版本更新

Hummus: A Dataset of Humorous Multimodal Metaphor Use

Hummus:幽默多模态隐喻使用数据集

Xiaoyu Tong, Zhi Zhang, Pia Sommerauer, Martha Lewis, Ekaterina Shutova

发表机构 * ILLC, University of Amsterdam, the Netherlands(阿姆斯特丹大学语言学研究所,荷兰) Vrije Universiteit Amsterdam, the Netherlands(阿姆斯特丹自由大学,荷兰)

AI总结 提出幽默多模态隐喻数据集Hummus,基于不一致理论和概念隐喻理论设计标注方案,测试多模态大语言模型在检测和理解幽默多模态隐喻上的表现,发现现有模型仍存在困难。

详情
AI中文摘要

隐喻和幽默有许多共同点,隐喻是最常见的幽默机制之一。本研究关注多模态隐喻的幽默能力,该领域尚未得到足够关注。我们从幽默的不一致理论、概念隐喻理论以及VU阿姆斯特丹隐喻语料库的标注方案中汲取灵感,开发了一种新的用于图像-标题对中幽默多模态隐喻使用的标注方案。我们创建了幽默多模态隐喻使用数据集Hummus,提供了从《纽约客》标题竞赛语料库中抽取的1000个图像-标题对的专家标注。利用该数据集,我们测试了最先进的多模态大语言模型(MLLMs)在检测和理解幽默多模态隐喻使用方面的能力。实验表明,当前MLLMs在处理幽默多模态隐喻时仍然存在困难,特别是在整合视觉和文本信息方面。我们在该网址发布数据集和代码。

英文摘要

Metaphor and humor share a lot of common ground, and metaphor is one of the most common humorous mechanisms. This study focuses on the humorous capacity of multimodal metaphors, which has not received due attention in the community. We take inspiration from the Incongruity Theory of humor, the Conceptual Metaphor Theory, and the annotation scheme behind the VU Amsterdam Metaphor Corpus, and developed a novel annotation scheme for humorous multimodal metaphor use in image-caption pairs. We create the Hummus Dataset of Humorous Multimodal Metaphor Use, providing expert annotation on 1k image-caption pairs sampled from the New Yorker Caption Contest corpus. Using the dataset, we test state-of-the-art multimodal large language models (MLLMs) on their ability to detect and understand humorous multimodal metaphor use. Our experiments show that current MLLMs still struggle with processing humorous multimodal metaphors, particularly with regard to integrating visual and textual information. We release our dataset and code at github.com/xiaoyuisrain/humorous-multimodal-metaphor-use.

2601.12263 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

多模态生成式引擎优化:针对视觉-语言模型排序器的排名操纵

Yixuan Du, Chenxiao Yu, Haoyan Xu, Ziyi Wang, Yue Zhao, Xiyang Hu

发表机构 * Georgetown University(乔治城大学) University of Southern California(南加州大学) University of Maryland, College Park(马里兰大学学院公园分校) Arizona State University(亚利桑那州立大学)

AI总结 提出多模态生成式引擎优化(MGEO)方法,通过联合优化图像扰动和文本后缀,利用视觉-语言模型内部跨模态知识耦合,实现对产品排名的有效操纵,揭示了多模态基础模型知识基础的脆弱性。

Comments Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM) at ACL 2026

详情
AI中文摘要

视觉-语言模型(VLM)将视觉和文本知识整合到统一表示中,日益成为现代检索和推荐系统的基础。然而,这些模型在对多模态项目进行排序时如何可靠地利用其跨模态知识,以及其知识基础是否可以被颠覆,仍不清楚。在本文中,我们揭示了VLM在多模态产品排序中应用知识的一个基本漏洞:通过多模态生成式引擎优化(MGEO),我们展示了攻击者可以通过联合制作难以察觉的图像扰动和流畅的文本后缀,利用模型内部的跨模态知识耦合,操纵VLM的排序决策。MGEO采用交替优化策略,针对VLM中视觉和语言表示之间的深层交互,实现了远超单模态攻击和由强大商业模型驱动的启发式基线的排名操纵。我们的发现表明,表面内容质量不足以提升排名;相反,需要直接与模型内部知识利用机制对齐。这些结果对多模态基础模型中知识基础的忠实性和鲁棒性提出了重要问题,并激励了未来多模态检索系统防御机制的研究。代码见:this https URL

英文摘要

Vision-Language Models (VLMs) integrate visual and textual knowledge into unified representations that increasingly underpin modern retrieval and recommendation systems. However, it remains unclear how reliably these models utilize their cross-modal knowledge when ranking multimodal items, and whether their knowledge grounding can be subverted. In this paper, we expose a fundamental vulnerability in how VLMs apply multimodal knowledge for product ranking: through Multimodal Generative Engine Optimization (MGEO), we show that an adversary can manipulate a VLM's ranking decisions by jointly crafting imperceptible image perturbations and fluent textual suffixes that exploit the model's internal cross-modal knowledge coupling. Using an alternating optimization strategy, MGEO targets the deep interactions between visual and linguistic representations within the VLM, achieving rank manipulations that substantially exceed those of unimodal attacks and heuristic baselines powered by strong commercial models. Our findings reveal that surface-level content quality is insufficient for rank promotion; instead, direct alignment with the model's internal knowledge utilization mechanism is required. These results raise important questions on the faithfulness and robustness of knowledge grounding in multimodal foundation models, and motivate future work on defense mechanisms for multimodal retrieval systems. Code is available at: https://github.com/glad-lab/MGEO

2602.22766 2026-06-09 cs.CL 版本更新

Imagination Helps Visual Reasoning, But Not Yet in Latent Space

想象力有助于视觉推理,但尚未在潜在空间中实现

You Li, Chi Chen, Yanghao Li, Fanhu Zeng, Kaiyu Huang, Jinan Xu, Maosong Sun

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 通过因果中介分析发现,多模态大语言模型中的潜在推理存在输入-潜在和潜在-答案两个关键断连,表明其有效性有限,并提出显式想象方法CapImagine,在视觉推理任务中表现更优。

Comments ICML 2026 Poster

详情
AI中文摘要

潜在视觉推理旨在通过在多模态大语言模型的隐藏状态中进行冥想,模仿人类的想象过程。虽然被认为是视觉推理的一种有前景的范式,但其有效性的潜在机制仍不清楚。为了揭示其功效的真正来源,我们使用因果中介分析来研究潜在推理的有效性。我们将该过程建模为因果链:输入作为处理变量,潜在标记作为中介变量,最终答案作为结果变量。我们的发现揭示了两个关键的断连:(a) 输入-潜在断连:对输入进行剧烈扰动导致潜在标记的变化可以忽略不计,表明潜在标记未能有效关注输入序列。(b) 潜在-答案断连:对潜在标记的扰动对最终答案的影响极小,表明潜在标记对结果施加的因果效应有限。此外,广泛的探测分析显示,潜在标记编码的视觉信息有限且表现出高度相似性。因此,我们质疑潜在推理的必要性,并提出了一种简单的替代方法CapImagine,该方法教会模型使用文本进行显式想象。在视觉中心基准上的实验表明,CapImagine显著优于复杂的潜在空间基线,突显了通过显式想象进行视觉推理的优越潜力。

英文摘要

Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models. While recognized as a promising paradigm for visual reasoning, the underlying mechanisms driving its effectiveness remain unclear. Motivated to demystify the true source of its efficacy, we investigate the validity of latent reasoning using Causal Mediation Analysis. We model the process as a causal chain: the input as the treatment, the latent tokens as the mediator, and the final answer as the outcome. Our findings uncover two critical disconnections: (a) Input-Latent Disconnect: dramatic perturbations on the input result in negligible changes to the latent tokens, suggesting that latent tokens do not effectively attend to the input sequence. (b) Latent-Answer Disconnect: perturbations on the latent tokens yield minimal impact on the final answer, indicating the limited causal effect latent tokens imposing on the outcome. Furthermore, extensive probing analysis reveals that latent tokens encode limited visual information and exhibit high similarity. Consequently, we challenge the necessity of latent reasoning and propose a straightforward alternative named CapImagine, which teaches the model to explicitly imagine using text. Experiments on vision-centric benchmarks show that CapImagine significantly outperforms complex latent-space baselines, highlighting the superior potential of visual reasoning through explicit imagination.

2604.18347 2026-06-09 cs.CL cs.AI 版本更新

Multilingual Training and Evaluation Resources for Vision-Language Models

面向视觉语言模型的多语言训练和评估资源

Daniela Baiamonte, Elena Fano, Matteo Gabburo, Stefano Simonazzi, Leonardo Rigutini, Andrea Zugarini

发表机构 * Villanova.ai Aithlas

AI总结 本文提出跨五种欧洲语言的视觉语言模型训练与评估资源,通过再生与翻译方法生成高质量多语言数据,验证多语言数据在非英语基准上的有效性。

详情
AI中文摘要

视觉语言模型(VLMs)近年来取得了快速进展。然而,尽管其发展依赖于英语,导致两个主要限制:(i)缺乏多语言和多模态数据集用于训练,(ii)缺乏跨语言的全面评估基准。本文通过引入覆盖五种欧洲语言(英语、法语、德语、意大利语和西班牙语)的新型综合资源来填补这些空白。我们采用再生-翻译范式,通过结合精心挑选的合成生成和人工标注来生成高质量的跨语言资源。具体而言,我们构建了Multi-PixMo训练语料库,通过再生Pixmo现有数据集中的示例,结合许可的模型:PixMo-Cap、PixMo-AskModelAnything和CoSyn-400k。在评估方面,我们构建了一组多语言基准,通过翻译广泛使用的英语数据集(MMbench、ScienceQA、MME、POPE、AI2D)来实现。我们通过定性和定量的人类分析评估这些资源的质量,测量跨标注者的一致性。此外,我们进行了消融研究,以展示多语言数据在VLMs训练中的影响,相对于仅英语数据。实验包括三种不同的模型,结果表明使用多语言、多模态示例训练VLMs在非英语基准上始终有益,同时对英语也有积极的迁移效果。

英文摘要

Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.

2605.08384 2026-06-09 cs.CL 版本更新

jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers

jina-embeddings-v5-omni: 通过锁定对齐塔实现几何保持嵌入

Florian Hönicke, Michael Günther, Andreas Koukounas, Mohammad Kalim Akram, Scott Martens, Saba Sturua, Han Xiao

发表机构 * Jina by Elastic(Jina 由 Elastic 公司)

AI总结 本文提出GELATO方法,通过冻结对齐塔实现多模态嵌入,生成统一语义空间,训练效率高且保持文本嵌入一致性。

Comments 11 pages, 9 figures, 5 tables

详情
AI中文摘要

在本文中,我们介绍了GELATO(通过锁定对齐塔实现几何保持嵌入),一种新型的多模态嵌入模型。我们基于VLM式架构,非文本编码器被调整以生成语言模型的输入,进而生成所有输入类型的嵌入。我们展示了结果:jina-embeddings-v5-omni套件,一对模型将文本、图像、音频和视频输入编码到单一语义嵌入空间。GELATO扩展了两个Jina Embeddings v5文本模型,通过添加图像和音频编码器支持额外模态。骨干文本嵌入模型和新增的非文本模态编码器保持冻结。我们仅训练连接组件,代表联合模型总权重的0.35%。因此,训练比全参数重新训练要高效得多。此外,语言模型保持基本不变,对文本输入生成与Jina Embeddings v5文本模型完全相同的嵌入。我们的评估表明,GELATO产生的结果与最先进的方法相媲美,几乎与更大的多模态嵌入模型具有同等性能。

英文摘要

In this work, we introduce GELATO (Geometry-preserving Embeddings via Locked Aligned TOwers), a novel approach to multimodal embedding models. We build on the VLM-style architecture, in which non-text encoders are adapted to produce input for a language model, which in turn generates embeddings for all varieties of input. We present the result: the jina-embeddings-v5-omni suite, a pair of models that encode text, image, audio, and video input into a single semantic embedding space. GELATO extends the two Jina Embeddings v5 Text models to support additional modality by adding encoders for images and audio. The backbone text embedding models and the added non-text modality encoders remain frozen. We only trained the connecting components, representing 0.35% of the total weights of the joint model. Training is therefore much more efficient than full-parameter retraining. Additionally, the language model remains effectively unaltered, producing exactly the same embeddings for text inputs as the Jina Embeddings v5 Text models. Our evaluations show that GELATO produces results that are competitive with the state-of-the-art, yielding nearly equal performance to larger multimodal embedding models.

2605.30608 2026-06-09 cs.CL 版本更新

Semantic Motion Anchors: Bridging Motion and Meaning in Co-Speech Gestures

语义运动锚点:连接共语手势中的运动与意义

Varsha Suresh, Mohammad Mahdi Abootorabi, Mohamed Salman, M. Hamza Mughal, Christian Theobalt, Ashwin Ram, Jürgen Steimle, Vera Demberg

发表机构 * Saarland University(萨尔兰大学) MPI for Informatics(信息研究所) Saarland Informatics Campus(萨尔兰信息校园) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Zuse School(祖斯学校)

AI总结 提出语义运动锚点方法,通过将3D手势离散化为身体-手部运动基元并转化为结构化描述,在文本与运动之间建立辅助对比监督,显著提升共语手势检索的语义相关性。

详情
AI中文摘要

学习口语文本与手势之间的共享表示是共语手势检索、合成和理解的核心,但对于语义上有意义的手势仍然具有挑战性,因为其交际意图无法仅通过运动捕捉。转录文本与连续运动嵌入之间的直接对比对齐往往过度强调低级运动学,而忽略了语义手势的符号内容。我们提出语义运动锚点,即手势运动的自然语言抽象,捕捉物理形式和交际意图。我们的方法将3D手势离散化为身体-手部运动基元,将其转化为结构化描述,并将其嵌入转录文本中以提供辅助对比监督。在BEAT2上,我们的方法在文本到手势的R@1上比直接文本-运动基线提高了8.2%,并在文本到手势和手势到文本检索方向上优于先前的检索方法。除了总体检索指标外,语义运动锚点监督有助于检索对口语查询具有语义意义的手势,而不是默认使用通用运动模式。一项下游检索增强手势生成研究表明,用户显著偏好我们方法检索的手势,而非检索增强生成基线,表明语义基础的检索转化为在下游生成中更好传达交际意图的手势。

英文摘要

Learning a shared representation between spoken text and gesture is central to co-speech gesture retrieval, synthesis, and understanding, but remains challenging for semantically meaningful gestures whose communicative intent is not captured by motion alone. Direct contrastive alignment between transcripts and continuous motion embeddings often overemphasizes low-level kinematics and misses the symbolic content of semantic gestures. We propose semantic motion anchors, natural-language abstractions of gesture motion capturing physical form and communicative intent. Our method discretizes 3D gestures into body-hand motion primitives, verbalizes them into structured descriptions, and grounds them in the transcript to provide auxiliary contrastive supervision. On BEAT2, our method improves text-to-gesture R@1 by 8.2% over a direct text-motion baseline and outperforms prior retrieval approaches on text to gesture and gesture to text retrieval directions. Beyond aggregate retrieval metrics, semantic motion anchor supervision helps retrieve gestures that are semantically meaningful for the spoken query, rather than defaulting to generic motion patterns. A downstream retrieval-augmented gesture generation study showed that users significantly preferred gestures retrieved by our approach over a retrieval-augmented generation baseline, demonstrating that semantically grounded retrieval translates to gestures that better convey communicative intent in downstream generation.

2606.03371 2026-06-09 cs.CL 版本更新

See, Infer, Intervene: Proactive World Modeling for Goal-Oriented Social Intelligence

观察、推断、干预:面向目标导向社交智能的主动世界建模

Honghui Zhang, Chenmeinian Guo, Yichen Yu, Guanyu Liu, Yujia Zhang, Yongming Qin, Chongguo Song, Mengyue Yang, Lei Yu, Tianyu Shi

发表机构 * Mita Technology(Mita技术公司) University of Bristol(布里斯托大学) University of Toronto(多伦多大学) McGill University(麦吉尔大学)

AI总结 提出 See-Infer-Intervene (SII) 框架和主动意图世界模型 (PIWM),通过观察顾客行为、推断潜在意图并选择干预动作,实现零售场景中的主动辅助,在 GuidanceSalesBench 基准上达到 0.641 macro F1。

Comments 16 pages, 3 figures, 9 tables. Preprint

详情
AI中文摘要

多模态零售智能体不仅应识别顾客正在做什么,还应决定是否以及如何在明确请求之前提供帮助。我们通过 See-Infer-Intervene (SII) 框架研究这一场景,其中设备必须观察交互前行为、推断潜在顾客意图,并通过选择适当的服务干预或选择等待来采取行动。我们使用主动意图世界模型 (PIWM) 实例化 SII,该模型通过 AIDA(注意力、兴趣、欲望、行动)购买阶段和 BDI(信念、欲望、意图)心理场表示顾客状态,预测动作条件下的意图转换,并从五类响应中选择:问候、引导、告知、推荐和等待。我们进一步构建了 GuidanceSalesBench,这是一个智能零售基准,包含状态清单、交互前视频、候选响应、动作条件结果和最佳动作标签。当以真实顾客状态为条件以隔离动作选择时,PIWM 在 30 个保留目标视频上达到 0.641 macro F1,优于零样本 Qwen2.5-VL-7B 基线和没有平衡动作监督的训练变体;端到端仅视频选择降至 0.295,低于 5 类平衡随机基线 0.414,将视频到状态的基础定位确定为部署阶段的主要瓶颈。一项初步的分阶段真实商店试点(由付费参与者执行脚本化顾客行为录制)在 20 个完全标注视频上达到 0.579 动作 macro F1,并额外发布了 10 个带有索引级标签的可访问视频。

英文摘要

Multimodal retail agents should not only recognize what a customer is doing, but also decide whether and how to assist before an explicit request is made. We study this setting through the See--Infer--Intervene (SII) framework, where a device must see pre-interaction behavior, infer latent customer intent, and act by selecting an appropriate service intervention or choosing to wait. We instantiate SII with the Proactive Intent World Model (PIWM), which represents customer state with AIDA (Attention, Interest, Desire, Action) purchasing phases and BDI (belief, desire, intention) psychological fields, predicts action-conditioned intent transitions, and selects from five response classes: Greet, Elicit, Inform, Recommend, and Hold. We further construct GuidanceSalesBench, a smart-retail benchmark containing state manifests, pre-interaction videos, candidate responses, action-conditioned outcomes, and best-action labels. When conditioned on ground-truth customer state to isolate action selection, PIWM achieves 0.641 macro F1 on 30 held-out target videos, outperforming a zero-shot Qwen2.5-VL-7B baseline and training variants without balanced action supervision; end-to-end video-only selection drops to 0.295, below the 5-class balanced random baseline of 0.414, identifying video-to-state grounding as the dominant deployment-time bottleneck. A preliminary staged real-store pilot (recorded with paid participants performing scripted customer behaviors) reaches 0.579 action macro F1 on 20 fully annotated videos, with 10 additional accessible videos released with index-level labels.

2601.15408 2026-06-09 cs.CV cs.AI cs.CL cs.LG 版本更新

CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

CURE:基于课程引导的多任务训练实现可靠的解剖学接地报告生成

Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem

发表机构 * Pontificia Universidad Católica de Chile(智利天主教大学) CENIA iHEALTH KAUST(科威特皇家科学与技术局)

AI总结 提出CURE框架,通过课程学习动态调整多任务训练,提升医学报告生成的视觉接地准确性和事实一致性,无需额外数据。

Comments 31 pages, 7 figures, accepted to CVPR 2026 (oral)

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 36279-36289
AI中文摘要

医学视觉语言模型可以自动生成放射学报告,但在精确的视觉接地和事实一致性方面存在困难。现有模型常常将文本发现与视觉证据错误对齐,导致不可靠或弱接地的预测。我们提出CURE,一个错误感知的课程学习框架,无需任何额外数据即可改善接地和报告质量。CURE在短语接地、接地报告生成和解剖学接地报告生成上,使用公共数据集微调多模态指令模型。该方法基于模型性能动态调整采样,强调困难样本以改善空间和文本对齐。CURE将接地准确率提高了+0.35 IoU,报告质量提高了+0.192 CXRFEScore,并将幻觉减少了18.6%。CURE是一个数据高效的框架,增强了接地准确性和报告可靠性。代码可从此https URL获取,模型权重可从此https URL获取。

英文摘要

Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.35 IoU, boosts report quality by +0.192 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure

2606.00384 2026-06-09 cs.AI cs.CL cs.CV cs.LG stat.CO 版本更新

VESTA: Visual Exploration with Statistical Tool Agents

VESTA: 基于统计工具代理的视觉探索

William Rudman, Abhishek Divekar, Kanishk Jain, Sebastian Joseph, Stella S. R. Offner, Matthew Lease, Kyle Mahowald, Greg Durrett, Junyi Jessy Li

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) New York University(纽约大学)

AI总结 提出VESTA框架,通过动态增长的工具集指导数据变换、假设驱动可视化和统计检验,提升视觉语言模型在复杂统计建模任务上的性能。

详情
AI中文摘要

将定量模型拟合到数据上是科学工作流程中的核心步骤,但它仍然是最少自动化的步骤之一。最近的基于代理的系统利用语言和视觉语言模型(VLM)来迭代地提出和优化统计模型,但这些系统在更具挑战性的建模任务上表现不佳。为了解决这些限制,我们引入了VESTA:基于统计工具代理的视觉探索,这是一个框架,为VLM配备了一个动态增长的探索工具包,通过数据变换、假设驱动的可视化和稳健的统计检验来指导模型优化。与之前仅依赖迭代批评的系统不同,VESTA在优化之前和优化过程中通过选择或创建诊断工具主动探索数据,这些工具会累积在模型的上下文中,并可在以后重用。我们在三种工具配置下评估VESTA与已建立的基线:无工具、静态专家编写的工具和动态模型编写的工具。为了支持这一评估,我们引入了DAWN(自动工作流和数值建模数据集),这是一个针对分布拟合和时间序列建模的基准,具有不同的难度等级,并最终涉及真实世界的天文学任务,包括建模初始质量函数和引力波啁啾信号。我们发现VESTA的动态工具创建优于先前的代理流水线,在复杂和特定领域的任务上取得了最大的收益。我们进一步表明,动态生成的工具比现有视觉工具创建系统生成的工具复杂得多,每个函数覆盖更多的诊断类别,并且强烈倾向于VLM批评者可以直接推理的视觉输出。

英文摘要

Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model's context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA's dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.

2606.07435 2026-06-09 cs.CV cs.CL 版本更新

The Lipreading Gap: Do VSR Models Perceive Visual Speech Like Human Lipreaders?

唇读差距:VSR模型是否像人类唇读者一样感知视觉语音?

Rishabh Jain, Naomi Harte

发表机构 * Sigmedia Group(Sigmedia集团) School of Engineering(工程学院) Trinity College Dublin(都柏林大学)

AI总结 通过对比VSR系统与人类在MaFI数据集上的表现,发现模型虽整体准确率更高,但错误模式与人类不同,主要依赖训练数据中的语言线索而非视觉感知。

Comments Accepted at INTERSPEECH 2026

详情
AI中文摘要

视觉语音识别(VSR)模型在基准测试中现已超越人类唇读者,但这样的进步是否建立了类人的视觉语音感知?为探究此问题,我们使用MaFI词级唇读数据集,在词、字符、音素和视位级别上比较了三个VSR系统与人类基线。尽管模型实现了更高的整体准确率,但它们在不同于人类的单词上成功和失败。仅给定少量初始音素的纯文本n-gram基线可与人类唇读相媲美。VSR词级错误始终能更好地通过训练词频而非词的视觉信息量来解释。视位准确率、混淆矩阵以及人类-模型相关性进一步表明,模型在人类认为最难的视位上获益最多,并且对视觉清晰度的依赖性弱得多。我们的工作表明,VSR系统主要依赖训练数据中的语言线索而非视觉感知,未能将视觉特征绑定为有意义的单词。

英文摘要

Visual speech recognition (VSR) models now surpass human lipreaders on benchmarks, but do such gains establish human-like visual speech perception? To explore this, we compare three VSR systems with human baselines on the MaFI word-level lipreading dataset using word, character, phoneme, and viseme-level metrics. Although models achieve higher overall accuracy, they succeed and fail on different words than humans. A text-only n-gram baseline given only a few initial phonemes rivals human lipreading. VSR word-level errors are consistently better explained by training word frequency than by the visual informativeness of words. Viseme accuracies, confusion matrices and human-model correlations further show that models gain most on visemes humans find hardest, and show much weaker dependence on visual clarity. Our work demonstrates that VSR systems rely primarily on language cues from training data rather than visual perception, failing to bind visual features into meaningful words.

8. 语音语言联合与音频文本 15 篇

2606.07547 2026-06-09 cs.CL cs.AI cs.SD 新提交

Liberating LLM Capabilities in Full-Duplex Speech Models

在全双工语音模型中释放LLM能力

Luoyuan Zhang, Bokai Xu, Junbo Cui, Weiyue Sun, Yingjing Xu, Hanyu Liu, Yuan Yao

发表机构 * Royal Zhang(皇家张)

AI总结 提出Listen-Write-Speak (LWS)三通道范式,使LLM在共享因果注意力上下文中同时监听、书写可见文本并实时口语回应,无需架构修改,实现全双工交互。

详情
AI中文摘要

基于语音的大型语言模型通常局限于口语回复,这将其面向用户的输出限制在可口头表达的内容上,并抑制了文本原生能力,如代码生成、结构化分析和实时交互中的多步推理,对于需要持久、结构化且可检查的中间输出的任务。现有工作改进了口语推理或全双工轮流发言,但仍将文本视为隐藏的中间状态或从属模态,而非第一类输出通道。我们提出Listen-Write-Speak (LWS),一种文本优先的三通道范式,其中单个自回归LLM持续监听用户音频,写出可见的自由形式文本作为其主要输出,并在共享因果注意力上下文中并行生成实时口语回应。该行为完全通过Token Schema实现,无需架构修改,并通过两阶段数据流水线学习,该流水线合成与揭示的输入时间线一致的每秒认知注释。实验上,LWS在Full-Duplex-Bench上展示了强大的全双工交互,在VoiceBench AlpacaEval上达到4.72,写作-口语一致性达92.6%,并在URO-Bench上持续优于其内部消融版本。这些结果表明,可见书写可以作为语音交互的第一类输出通道,而不会牺牲实时响应性。代码和数据集可在项目页面获取:https://royalzhang.com/project/lws-page/。

英文摘要

Speech-based large language models are typically constrained to spoken replies, which limits their user-facing outputs to what can be verbalized and suppresses text-native capabilities such as code generation, structured analysis, and multi-step reasoning in realtime interaction, for tasks that require persistent, structured, and inspectable intermediate outputs. Existing work improves spoken reasoning or full-duplex turn-taking, but still treats text as a hidden intermediate state or a subordinate modality rather than a first-class output channel. We propose Listen-Write-Speak (LWS), a text-first tri-channel paradigm in which a single autoregressive LLM continuously listens to user audio, writes visible free-form text as its primary output, and speaks a realtime oral response in parallel under a shared causal attention context. This behavior is implemented entirely through a Token Schema, requiring no architectural modifications, and learned via a two-stage data pipeline that synthesizes per-second cognitive annotations consistent with the revealed input timeline. Empirically, LWS demonstrates strong full-duplex interaction on Full-Duplex-Bench, reaches 4.72 on VoiceBench AlpacaEval, achieves 92.6% writing-speaking consistency, and consistently outperforms its internal ablations on URO-Bench. These results suggest that visible writing can serve as a first-class output channel for speech interaction without sacrificing realtime responsiveness. The code and dataset are available on the project page: https://royalzhang.com/project/lws-page/.

2606.07608 2026-06-09 cs.CL cs.AI cs.LG cs.SD 新提交

Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)

针对瑞士德语音识别的Whisper字幕对齐微调:基准污染、惯例不匹配以及25.6% WER(13.8% cWER)的诚实基线

Felix Akeret

发表机构 * Independent Researcher, Zurich, Switzerland(独立研究员,瑞士苏黎世) ETH Zürich(苏黎世联邦理工学院) University of Bern(伯尔尼大学) FHNW(西北应用科学与艺术大学) CeTIM Leiden/Munich(CeTIM 莱顿/慕尼黑)

AI总结 通过1,367小时广播语音与标准德语字幕的弱监督,系统微调Whisper large-v3用于瑞士德语音识,发现公开结果因基准污染被高估,并发布两个诚实评估的模型。

Comments 15 pages, 21 tables. Models available at https://huggingface.co/Flix-AI

详情
AI中文摘要

我们提出了一项系统研究,针对OpenAI的Whisper large-v3进行微调,用于瑞士德语音识,使用1,367小时的广播语音与标准德语字幕作为弱监督。通过在NVIDIA DGX Spark(Grace Blackwell,128 GB统一内存,最高1 PFLOP FP4)上进行16次迭代训练,我们比较了LoRA和全微调(1.55B参数模型),研究了幻觉的根本原因,并量化了数据质量、字幕对齐和训练策略的影响。我们的最佳模型在严格不相交数据上的诚实评估中,在All Swiss German Dialects Test Set (ASGDTS)上实现了25.6%的测量WER。通过将真实错误与有效的风格变异(时态、词序、瑞士正字法)分离的协调错误分析,得到内容WER (cWER)为13.8%,仅计算实际识别失败。偏差校正估计将其降至8.5%,表明真实错误率约为测量WER的三分之一。\n我们证明,已发表的瑞士德语ASR最先进结果(17.1-17.5% WER)因基准污染而被夸大:一个在ASGDTS测试集上自训练的普通Whisper模型(零瑞士德语数据)实现了13.88% WER,超过了所有已发表系统。使用Phi-4-multimodal的实验显示出更强的记忆效应(3.9% WER),揭示该基准主要衡量惯例匹配而非方言理解。\n我们发布了两个模型,一个LoRA适配器(25.32% WER,13.9% cWER)和一个全微调模型(25.60% WER,13.8% cWER),这是少数公开可用、经过诚实评估的瑞士德语Whisper模型之一,采用Apache 2.0许可,完全可复现,无需机构数据协议。

英文摘要

We present a systematic study of fine-tuning OpenAI's Whisper large-v3 for Swiss German ASR, using 1,367 hours of broadcast speech paired with Standard German subtitles as weak supervision. Through 16 iterative training runs on an NVIDIA DGX Spark (Grace Blackwell, 128 GB unified memory, up to 1 PFLOP FP4), we compare LoRA and full fine-tuning of the 1.55B-parameter model, investigate hallucination root causes, and quantify the effect of data quality, subtitle alignment, and training strategy. Our best model achieves 25.6% measured WER on the All Swiss German Dialects Test Set (ASGDTS) in an honest evaluation on strictly disjoint data. A harmonized error analysis separating genuine errors from valid stylistic variation (tense, word order, Swiss orthography) yields a content WER (cWER) of 13.8%, counting only actual recognition failures. Bias-corrected estimation reduces this to 8.5%, suggesting the true error rate is roughly one third of measured WER. We demonstrate that published state-of-the-art Swiss German ASR results (17.1-17.5% WER) are inflated by benchmark contamination: a vanilla Whisper model self-trained on the ASGDTS test set with zero Swiss German data achieves 13.88% WER, surpassing all published systems. Experiments with Phi-4-multimodal show an even stronger memorization effect (3.9% WER), revealing that the benchmark primarily measures convention matching rather than dialectal comprehension. We release two models, a LoRA adapter (25.32% WER, 13.9% cWER) and a full fine-tuned model (25.60% WER, 13.8% cWER), among the few publicly available, honestly evaluated Whisper models for Swiss German, under Apache 2.0 with full reproducibility, requiring no institutional data agreements.

2606.08486 2026-06-09 cs.CL 新提交

TRADE: Transducer-Augmented Decoder for Speech LLM

TRADE: 换能器增强的语音大语言模型解码器

Yun Tang, Shanil Puri, Shinji Watanabe, Subhabrata Mukherjee

发表机构 * Hippocratic AI Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出TRADE模型,通过换能器分支增强多模态大语言模型,实现帧同步对齐与语言推理结合,支持流式和非流式解码,在多个基准上取得低词错误率。

详情
AI中文摘要

语音大语言模型(Speech LLMs)缺乏原则性的流式推理机制:其标签同步生成没有声学帧对齐,使得实时解码和话语结束检测变得困难。我们提出TRADE(换能器增强解码器),它通过一个换能器分支增强多模态大语言模型,该分支共享音频编码器,并直接使用LLM的隐藏状态作为预测网络——将帧同步声学对齐与LLM的语言推理相结合。三个设计选择使系统准确、可流式处理且支持长语音:(1) 紧密耦合的双词汇表——从LLM词汇表导出的紧凑换能器词汇表,实现零成本分数融合;(2) 带梯度停止的块同步流式训练,消除训练-推理不匹配,内存成本与离线相当;(3) 局部解码器音频注意力(LDAA),一种因果滑动窗口,独立于话语长度限制KV缓存内存。单个TRADE检查点支持在连续延迟操作点范围内的离线与流式解码。TRADE在Open ASR排行榜上达到6.71%的平均词错误率,而同一检查点使用960ms块大小的流式识别达到8.40%。在长语音上,无需外部分割,在TED-LIUM上获得3.64%的词错误率,在Earnings-22上获得10.88%。TRADE提供句末标点时间戳,与声学语音活动检测(VAD)结合时,相比单独使用声学VAD,话语结束检测的F1值提高0.03。

英文摘要

Speech Large Language Models (Speech LLMs) lack a principled mechanism for streaming inference: their label-synchronous generation has no acoustic-frame alignment, making real-time decoding and end-of-utterance detection difficult. We propose TRADE TRansducer-Augmented DEcoder, which augments a multimodal LLM with a transducer branch that shares the audio encoder and uses the LLM's hidden states directly as the prediction network -- coupling frame-synchronous acoustic alignment with the LLM's linguistic reasoning. Three design choices make the system accurate, streamable, and long-form capable: (1)Tightly coupled dual vocabularies -- a compact transducer vocabulary derived from the LLM vocabulary, enabling zero-cost score fusion; (2)Chunk-synchronized streaming training with gradient stopping, eliminating the train-inference mismatch at offline-equivalent memory cost; and (3)Localized Decoder Audio Attention (LDAA), a causal sliding window that caps KV-cache memory independently of utterance length. A single TRADE checkpoint supports offline and streaming decoding across a continuous range of latency operating points. TRADE achieves 6.71% average WER on the Open ASR Leaderboard, while the streaming recognition with 960ms chunk size reaches 8.40% from the same checkpoint. On long-form speech, it obtains 3.64% WER on TED-LIUM and 10.88% on Earnings-22 without external segmentation. TRADE provides sentence-end punctuation timestamps that, when combined with acoustic voice activity detection (VAD), improve end-of-utterance detection by +0.03 F_1 over acoustic VAD alone.

2606.08748 2026-06-09 cs.CL 新提交

HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task

HydraQE: OSU 在 IWSLT 2026 语音翻译指标共享任务中的提交

Kevin Krahn, Eric Fosler-Lussier

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 提出 HydraQE,一个基于 Qwen3-ASR 的端到端无参考语音翻译质量估计系统,通过可学习的稀疏标量混合和轻量双向 Transformer 实现跨模态交互,并在人类标注、MetricX-24 和 xCOMET 伪标签上训练三个预测头,优于级联文本基线。

Comments Accepted to IWSLT 2026; 9 pages, 3 figures, 4 tables

详情
AI中文摘要

我们提出了 HydraQE,这是我们对 IWSLT 2026 语音翻译指标共享任务的贡献。HydraQE 是一个基于 Qwen3-ASR 骨干网络的端到端、无参考质量估计(QE)系统,它接受源音频和翻译假设作为联合输入。来自所有骨干网络层的隐藏状态通过可学习的稀疏标量混合进行组合,然后由轻量级双向 Transformer 重新编码,以便在池化为共享嵌入之前实现完整的跨模态交互。三个独立的预测头在互补的监督信号上训练:人工直接评估(DA)标注、MetricX-24 伪标签和 xCOMET 伪标签。为了解决人工标注数据的稀缺性,我们在合成损坏示例和银色伪标签机器翻译输出的组合上进行训练,采用从合成和银色数据开始并逐渐转向人工标注示例的课程学习。HydraQE 优于级联文本基线和先前的直接语音 QE 系统,证明了端到端语音翻译 QE 与级联方法具有竞争力。

英文摘要

We present HydraQE, our contribution to the IWSLT 2026 Speech Translation Metrics shared task. HydraQE is an end-to-end, reference-free quality estimation (QE) system for speech translation built on a Qwen3-ASR backbone, which accepts source audio and a translation hypothesis as joint input. Hidden states from all backbone layers are combined via a learnable sparsemax scalar mix, then re-encoded by a lightweight bidirectional Transformer to enable full cross-modal interaction prior to pooling into a shared embedding. Three independent prediction heads are trained on complementary supervision signals: human direct assessment (DA) annotations, MetricX-24 pseudo-labels, and xCOMET pseudo-labels. To address the scarcity of human-annotated data, we train on a combination of synthetically corrupted examples and silver pseudo-labeled machine translation outputs, using a curriculum that begins on synthetic and silver data and gradually shifts toward human-annotated examples. HydraQE outperforms cascaded text-based baselines and prior direct speech QE systems, demonstrating that end-to-end speech translation QE is competitive with cascaded approaches.

2606.09295 2026-06-09 cs.CL 新提交

NüshuVoice: Reviving the Voice of Endangered Nüshu with Pitch-Aware Text-to-Speech

NüshuVoice:利用音高感知文本到语音技术复兴濒危女书的声音

Hongkun Yang, Xinhui Yi, Xiyan Zhao, Yibo Meng, Lionel Z. Wang, Lixu Wang, Yaqi Zhang, Ruiqi Chen, Xuanyue Zhao, Lanxin Zhang, Yu Zeng, Weijia Chu, Yiming Ma, Chenyu Liu, Jianghao Lin, Xin Xu

发表机构 * Ocean University of China(中国海洋大学) The Hong Kong Polytechnic University(香港理工大学) Cornell University(康奈尔大学) Nanyang Technological University(南洋理工大学) Shanghai Jiao Tong University(上海交通大学) University of Michigan–Ann Arbor(密歇根大学安娜堡分校) University of Science and Technology of China(中国科学技术大学) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 针对女书语音数据稀缺问题,提出NüshuVoice基准和F0条件VITS框架Nüshu-PitchVITS,利用五级音高标注作为韵律先验,在频谱保真度、音高重建和可懂度上优于强基线。

Comments 12 pages, 3 figures

详情
AI中文摘要

女书是一种濒危的音节文字,历史上由中国湖南省南部江永县的女性使用。现有的女书计算研究主要关注文本数字化和视觉识别,其真实发音的声学重建仍基本未被探索。构建女书文本到语音(TTS)系统尤其具有挑战性,因为可用的录音极其有限,且大多为孤立的音节级发音而非自然的句子级话语。在这项工作中,我们介绍了NüshuVoice,这是首个女书TTS基准。我们构建了一个句子级女书文本到音频数据集,对齐了标准化的Unicode女书文本、音标、标准中文翻译和档案录音。为了在这种极端低资源设置下合成语音,我们提出了Nüshu-PitchVITS,一种F0条件VITS框架,利用女书的五级音高符号作为显式的韵律归纳偏置。实验结果表明,Nüshu-PitchVITS在频谱保真度、音高重建和人类评定的可懂度方面优于强TTS基线。我们公开发布了数据集和代码,网址为:https://anonymous.4open.science/r/Nvshu-TTS-2EB6。

英文摘要

Nüshu is an endangered phonetic script historically used by women in Jiangyong County, southern Hunan, China. While existing computational studies of Nüshu mainly focus on textual digitization and visual recognition, the acoustic reconstruction of its authentic pronunciation remains largely unexplored. Building a Nüshu text-to-speech (TTS) system is particularly challenging because available recordings are extremely limited and mostly consist of isolated syllable-level pronunciations rather than natural sentence-level utterances. In this work, we introduce NüshuVoice, the first TTS benchmark for Nüshu. We construct a sentence-level Nüshu text-to-audio dataset that aligns standardized Unicode Nüshu text, phonetic transcriptions, standard Chinese translations, and archival recordings. To synthesize speech under this extreme low-resource setting, we propose Nüshu-PitchVITS, an F0-conditioned VITS framework that leverages Nüshu's five-level pitch notation as an explicit prosodic inductive bias. Experimental results show that Nüshu-PitchVITS outperforms strong TTS baselines in spectral fidelity, pitch reconstruction, and human-rated intelligibility. We publicly release the dataset and code at: https://anonymous.4open.science/r/Nvshu-TTS-2EB6.

2606.09424 2026-06-09 cs.CL 新提交

Toward Signing Activity Projection in Sign Language Interaction

面向手语交互中的手语活动预测

Takao Obi, Wang Yusong, Koji Inoue, Kotaro Funakoshi

发表机构 * Institute of Science Tokyo(东京科学大学) Kyoto University(京都大学)

AI总结 本研究探索将语音活动预测(VAP)框架迁移至双人手语交互,利用公共DGS语料库提取手语活动流,基于姿态特征进行轮换预测,结果表明HOLD/SHIFT预测有潜力但SHIFT预测困难。

详情
AI中文摘要

社交机器人不仅需要与以语音为中心的系统所假设的用户进行稳健交互,还需要与依赖不同模态(例如手语)进行交流的多样化用户进行交互。一个重要的能力差距是与手语用户进行预测性轮换。尽管语音活动预测(VAP)已成功用于模拟口语交互中的未来语音活动,但该框架是否适用于手语交互仍不清楚。本文提出了将VAP架构适应双人手语交互的初步迁移研究。使用公共DGS语料库的交互录音,我们从词汇手语标注中推导出二进制手语活动流,并制定轮换预测的代理任务。模型使用每个手语者提取的基于姿态的手部、眼部区域和嘴部区域特征。结果表明,SHIFT/HOLD预测是有前景的,尤其是利用手部线索,而SHIFT预测仍然困难。这些发现为将预测性轮换模型从口语交互迁移到手语交互的潜力和当前局限性提供了初步证据。手语交互的预测建模仍然需要超越语音衍生类别的手语特定事件定义。

英文摘要

Social robots must interact robustly not only with users assumed by speech-centered systems but also with diverse users whose communication relies on different modalities, e.g., sign language. One important capability gap is predictive turn-taking with signing users. Although Voice Activity Projection (VAP) has been successfully used to model future voice activity in spoken interaction, it remains unclear whether the framework transfers to sign language interaction. This paper presents an initial transfer study of adapting a VAP architecture to dyadic sign language interaction. Using interaction recordings from the Public DGS Corpus, we derive binary signing activity streams from lexical sign annotations and formulate proxy tasks for turn-taking prediction. The model uses pose-derived hand, eye-region, and mouth-region features extracted for each signer. The results show that SHIFT/HOLD prediction is promising, especially with hand cues, while SHIFT-prediction remains difficult. These findings provide initial evidence for both the promise and the current limitations of transferring predictive turn-taking models from spoken interaction to sign language interaction. Predictive modeling of sign language interaction still requires sign-language-specific event definitions that go beyond speech-derived categories.

2606.09470 2026-06-09 cs.CL cs.AI 新提交

A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

一种用于联合多粒度L2评估和自然语言解释的微调SpeechLLM

Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik

发表机构 * Centre for Language Studies, Radboud University(语言研究中心,拉德堡德大学)

AI总结 提出一种基于评分准则的SpeechLLM,通过混合训练目标联合预测句子级和词/音素级标签并生成自然语言解释,在SpeechOcean762上达到或超越单粒度模型。

Comments Accepted to Interspeech 2026. This publication is part of the project Responsible AI for Voice Diagnostics (RAIVD) with file number NGF.1607.22.013 of the research programme NGF AiNed Fellowship Grants, which is financed by the Dutch Research Council (NWO)

详情
AI中文摘要

自动化的L2语音评估可以分配熟练度标签,但通常缺乏可解释性。我们提出了一种基于评分准则的SpeechLLM,用于多角度、多粒度的评估,采用结合监督微调和有界直接偏好优化的混合目标进行训练。该模型在同一个响应中联合预测句子级(准确性、流利度、韵律)的序数标签、词/音素级准确性,并生成自然语言解释。在SpeechOcean762上,我们的方法匹配或优于单粒度模型,同时与先前方法保持竞争力。我们从两个维度分析解释的可靠性:与模型预测的自一致性和与真实标签的对齐,使用情感一致性(合理性)和基于提及的一致性(忠实性)。解释在句子级别是合理的,但在词/音素级别忠实性下降:参考稀疏且与词元级标签弱对齐。

英文摘要

Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for multi-aspect, multi-granular assessment, trained with a hybrid objective combining supervised fine-tuning and Bounded Direct Preference Optimization. The model jointly predicts ordinal labels at the sentence-level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762, our approach matches or outperforms single-granularity models while remaining competitive with prior approaches. We analyze rationale reliability along two axes: self-consistency with model predictions and alignment with ground-truth labels, using sentiment consistency (plausibility) and mention-based agreement (faithfulness). Rationales are plausible at the sentence level, but faithfulness degrades at the word/phoneme level: references are sparse and weakly aligned with token-level labels.

2606.09535 2026-06-09 cs.CL cs.SD 新提交

Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

克服Whisper在达罗毗荼语系和低资源语言中的解码器不一致性

Chowdam Venkata Kumar, Kumud Tripathi, Pankaj Wasnik

发表机构 * Sony Research India(索尼印度研究院)

AI总结 针对Whisper在达罗毗荼语系上词错误率高的问题,通过语言学和数据集分析发现词汇稀疏和字符级替换错误,提出加权注意力和自条件化两种解码器增强方法,显著降低低资源和黏着语言的WER。

Comments Accepted at INTERSPEECH 2026, 5 pages, 1 figure, 5 tables

详情
AI中文摘要

多语言ASR模型如Whisper在高资源语言上表现良好,但在达罗毗荼语系上的词错误率(WER)显著高于印度-雅利安语系。通过语言学和数据集分析,我们发现达罗毗荼语系具有更长的单词、更高的词汇多样性和更低的重复率,导致标记分布稀疏和频繁的字符级替换错误。基线微调进一步揭示了自注意力(语言上下文)和交叉注意力(声学线索)之间的解码器不平衡。尽管合成标记重复实验表明潜在收益,但实际不可行。受这些观察启发,我们引入了两种解码器级增强:加权注意力(自适应平衡注意力来源)和自条件化(重新注入中间预测以提高标记一致性)。实验表明,对于低资源和黏着语言,WER持续降低。

英文摘要

Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Although synthetic token-repetition experiments indicate potential gains, they are impractical. Motivated by these observations, we introduce two decoder-level enhancements: Weighted-Attention, which adaptively balances attention sources, and Self-Conditioning, which reinjects intermediate predictions to improve token consistency. Experiments demonstrate consistent WER reductions for low-resource and agglutinative languages.

2606.07610 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

LEAF: 无需分支的树生长方法用于语音感知大语言模型后训练

Argyrios Gerogiannis, Yekaterina Yegorova, Mark Hasegawa-Johnson, Venugopal V. Veeravalli

发表机构 * University of Illinois, Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对语音感知大语言模型后训练中GRPO方法粗粒度信用分配问题,提出LEAF方法,通过回溯式树结构学习、高信息量边界选择和跨度级优势分配,在语音问答和翻译任务上超越GRPO。

Comments 15 pages, 3 figures, 11 tables

详情
AI中文摘要

最先进的GRPO风格方法在语音感知大语言模型后训练中存在粗粒度信用分配问题,将相同的终端奖励优势广播给响应中的每个token。这忽略了rollout批次中的有用结构,其中语音条件下的补全通常共享前缀,然后在重要决策处出现分歧。我们提出低秩探索自适应分叉(LEAF),一种基于回溯树的强化学习方法,无需在线分支或额外解码即可恢复这种结构。LEAF采样完整响应,选择高信息量边界,按共享前缀分组响应,并使用后代奖励分配跨度级优势。我们从理论上证明了LEAF的跨度级信用分配和边界选择设计。实验上,在相同的rollout和低秩适应预算下,LEAF在语音问答和语音翻译基准上优于GRPO。值得注意的是,较小的LEAF训练模型优于当前最先进的完全参数基线。

英文摘要

State-of-the-art GRPO-style methods for speech-aware large language model post-training suffer from coarse credit assignment, broadcasting the same terminal-reward advantage to every token in a response. This ignores useful structure within rollout batches, where speech-conditioned completions often share prefixes before diverging at important decisions. We propose Low-rank Exploration with Adaptive Forking (LEAF), a retrospective tree-based RL method that recovers this structure without online branching or additional decoding. LEAF samples complete responses, selects high-surprisal boundaries, groups responses by shared prefixes, and assigns span-level advantages using descendant rewards. We theoretically justify LEAF's span-level credit assignment and boundary-selection design. Empirically, LEAF improves over GRPO across speech question answering and speech translation benchmarks under the same rollout and low-rank adaptation budget. Notably, smaller LEAF-trained models outperform current state-of-the-art, full-parameter baselines.

2606.08210 2026-06-09 eess.AS cs.CL cs.SD 交叉投稿

Paediatric-HGNN: A Hybrid Heterogeneous Graph Neural Network for Detecting Disfluency in Children's Speech via Multiscale Acoustic Fusion

Paediatric-HGNN:一种通过多尺度声学融合检测儿童言语不流畅的混合异构图神经网络

Rashini Liyanarachchi, Rachael Mackay, Alison Short, Aditya Joshi, Erik Meijering

发表机构 * University of New South Wales(新南威尔士大学) Western Sydney University(西澳悉尼大学) Resourced Music Therapy(资源音乐治疗)

AI总结 针对儿童言语中声学变异大、病理口吃与发育性不流畅难以区分的问题,提出Paediatric-HGNN框架,通过构建异构图捕获词汇与声学片段的分层关系,在儿童语料上实现82.4%加权准确率和0.386的典型不流畅F1分数。

Comments Accepted at INTERSPEECH 2026 (Main)

详情
AI中文摘要

自动口吃检测(ASD)系统在处理儿童言语时面临挑战,因为发育中的声音具有高声学变异性,且病理性口吃与典型发育性不流畅之间存在细微差别。我们提出了Paediatric-HGNN,一个使用上下文感知部分-整体交互网络(CaPIN)的框架,专门针对儿童数据定制。与传统的1D信号建模不同,我们的方法构建了一个异构图,捕获词汇单元(词节点)和细粒度声学片段(帧节点)之间的层次关系。在精选的儿童语料库(UCLASS和FluencyBank)上训练后,Paediatric-HGNN实现了82.4%的加权准确率和0.386的典型不流畅F1分数。对层次化词汇-声学交互的建模捕获了发育中的“搜索”行为,为早期临床干预提供了更稳健和可解释的工具。

英文摘要

Automated stuttering detection (ASD) systems struggle with paediatric speech due to high acoustic variability in developing voices and the subtle distinction between pathological stuttering and typical developmental disfluencies. We introduce Paediatric-HGNN, a framework using a Context-aware Part-whole Interaction Network (CaPIN) tailored for paediatric data. Instead of conventional 1D signal modelling, our approach builds a heterogeneous graph capturing hierarchical relationships between lexical units (word nodes) and fine-grained acoustic segments (frame nodes). Trained on curated paediatric corpora (UCLASS and FluencyBank), Paediatric-HGNN achieves 82.4% weighted accuracy and a Typical Disfluency F1-score of 0.386. Modelling hierarchical lexical-acoustic interactions captures developmental "searching" behaviour, offering a more robust and interpretable tool for early clinical intervention.

2606.08425 2026-06-09 cs.SD cs.CL eess.AS 交叉投稿

TinyGiantALM: A Compact Audio-Language Model for Intent-Aware Reasoning under Resource Constraints

TinyGiantALM:面向资源约束下意图感知推理的紧凑型音频-语言模型

Vinh-Thuan Ly

发表机构 * University of Science, VNU-HCM(胡志明市国立大学下属理科大学) Vietnam National University, Ho Chi Minh City(胡志明市国立大学)

AI总结 提出紧凑型1.5B参数音频-语言模型TinyGiantALM,通过指令感知特征精炼框架(查询引导投影器+语义门控)过滤用户意图相关声学信号,在MMAR基准上零样本准确率46.4%,超越7B-13B基线,并优于8倍大模型。

Comments Accepted to Interspeech 2026. Project page: https://interspeech-tinygiant-alm.vercel.app

详情
AI中文摘要

当前音频推理的进展依赖于大规模音频-语言模型(LALMs),阻碍了在资源受限环境中的部署。我们提出了TinyGiantALM,一个紧凑的1.5B参数效率导向替代方案。不同于暴力扩展规模,我们提出了一种指令感知特征精炼框架,使用查询引导投影器和语义门控,基于用户意图过滤声学信号。在MMAR基准上,TinyGiantALM实现了46.4%的零样本准确率,显著优于7B-13B基线。虽然在逻辑叙事推理方面与30B+模型存在差距,且在过于密集或空间场景中存在某些权衡,但我们的方法在解耦混合模态环境方面显著优于高达8倍大小的模型。这些发现表明,架构精度为在边缘友好规模上获得稳健感知能力提供了一条切实可行的路径。

英文摘要

Current advancements in Audio Reasoning rely on massive Large Audio-Language Models (LALMs), hindering deployment in resource-constrained environments. We introduce TinyGiantALM, a compact 1.5B efficiency-oriented alternative. Instead of brute-force scaling, we propose an Instruction-Aware Feature Refinement framework using a Query-guided Projector and Semantic Gating to filter acoustic signals based on user intent. On the MMAR benchmark, TinyGiantALM achieves 46.4% zero-shot accuracy, significantly outperforming 7B-13B baselines. While a reasoning gap in logical narrative remains versus 30B+ models and certain trade-offs exist in overly dense or spatial scenes, our approach notably surpasses models up to 8x larger in disentangling mixed-modality environments. These findings demonstrate that architectural precision offers a tangible pathway to secure robust perception capabilities on edge-friendly scales.

2606.08573 2026-06-09 cs.LG cs.CL 交叉投稿

Titans-as-a-Layer: Test-Time Memory for Conversational Speech Emotion Recognition

Titans-as-a-Layer:对话语音情感识别的测试时记忆

Daniel Chen, Qicong Hu, Yang Xiao, Ting Dang, Hong Jia

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出Memory-as-a-Layer (MAL)适配器,利用测试时神经记忆为对话语音情感识别提供上下文,在不修改大型音频语言模型的前提下提升性能。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情
AI中文摘要

语音情感识别(SER)通常被表述为话语级分类,尽管对话情感取决于说话者通常的音域和先前话语建立的情感上下文。语音语言模型提供了强大的预训练声学和语义表示,并可以通过微调将其适应于SER标签,但这种机制仍然缺少每对话状态。我们研究测试时神经记忆是否可以在保持大型音频语言模型(LALMs)主干不变的情况下提供这种缺失的上下文。基于Titans,我们引入了一种即插即用的Memory-as-a-Layer(MAL)适配器,它将对话历史写入小型神经记忆,并作为音频令牌对齐的残差更新读回,避免了对宿主模型令牌位置的更改。在不同的音频LLM和情感识别数据集评估中,我们的设计在不同评估指标上改善了SER性能,支持测试时记忆作为对话SER的残差上下文机制。

英文摘要

Speech emotion recognition (SER) is commonly formulated as utterance-level classification, although conversational emotion depends on a speaker's usual vocal range and the emotional context established by previous utterances. Speech-language models provide strong pretrained acoustic and semantic representations, and can adapts them to SER labels via finetune, but this mechanism still missing per-dialogue state. We study whether test-time neural memory can supply this missing context while leaving the large audio language models (LALMs) backbone intact. Building on Titans, we introduce a plug-and-play Memory-as-a-Layer (MAL) adapter that writes dialogue history into a small neural memory and reads it back as an audio-token-aligned residual update, avoiding changes to the host model's token positions. Across different audio LLMs and emotion recognition datasets evaluations, our design improves SER performs across different evaluation metrics, supporting test-time memory as a residual contextual mechanism for conversational SER.

2606.09667 2026-06-09 eess.AS cs.CL cs.SD 交叉投稿

Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

基于sEMG和唇读的鲁棒无声语音合成的跨模态掩蔽

Eder del Blanco, David Gimeno-Gómez, Eva Navas, Carlos-D. Martínez-Hinarejos, Inma Hernáez

发表机构 * Aholab research group within the HiTZ Center at University of the Basque Country (UPV/EHU)(巴斯克大学HiTZ中心内Aholab研究组) PRHLT research center, Universitat Politècnica de València (UPV)(瓦伦西亚理工大学PRHLT研究中心)

AI总结 提出掩蔽多模态语音合成框架,联合表面肌电图和唇读信号,通过训练时模态掩蔽提升鲁棒性,在多说话人设置下词错误率降低14个百分点。

Comments 12 pages, 7 figures and 6 tables. Submitted to Transactions on Audio, Speech and Language Processing

详情
AI中文摘要

通过无声语音接口进行语音恢复已成为针对喉部发声受损或缺失个体的有前景的辅助技术。在非侵入式无声语音接口模态中,表面肌电图和基于视频的唇读提供了互补的发音信息,然而它们用于连续语音合成的集成仍未被充分探索。此外,现有的多模态方法很少考虑对模态退化或临时传感器故障的鲁棒性,限制了它们在现实场景中的适用性。在这项工作中,我们提出了一种掩蔽多模态语音合成框架,通过在训练期间进行模态掩蔽来联合利用表面肌电图和唇读信号。在多说话人设置下,与最强的单模态基线相比,所提出的方法将词错误率降低了多达14个绝对百分点。实验结果不仅表明掩蔽策略对于这些性能提升和低比特率条件下的鲁棒性至关重要,而且表明在模态缺失情况下,它们比针对退化的数据增强具有更好的泛化能力。音素级分析进一步揭示了跨模态的互补贡献,对元音和特定辅音组尤其有益。总体而言,这些发现证明了掩蔽多模态集成用于无声语音合成的有效性和鲁棒性,尽管适应喉切除说话者仍是一个开放的研究挑战。

英文摘要

Speech restoration through silent speech interfaces (SSIs) has emerged as a promising assistive technology for individuals with impaired or absent laryngeal voice production. Among non-invasive SSI modalities, surface electromyography (sEMG) and video-based lipreading provide complementary articulatory information, yet their integration for continuous speech synthesis remains underexplored. Moreover, existing multimodal approaches rarely address robustness to modality degradation or temporary sensor failure, limiting their applicability in realistic scenarios. In this work, we propose a masked multimodal speech synthesis framework that jointly leverages sEMG and lipreading signals through modality masking during training. Under multispeaker settings, the proposed approach reduces word error rate by up to 14 absolute percentage points compared to the strongest unimodal baseline. Experimental results not only show that masking strategies are critical for these performance gains and robustness under low-bitrate conditions, but also that they generalize better than degradation-specific data augmentations in the presence of modality absence conditions. Phone-level analyses further reveal complementary contributions across modalities, with particularly strong benefits for vowels and for specific consonant groups. Overall, these findings demonstrate the effectiveness and robustness of masked multimodal integration for silent speech synthesis, although adaptation to laryngectomized speakers remains an open research challenge.

2605.19266 2026-06-09 cs.CL cs.AI 版本更新

FormalASR: End-to-End Spoken Chinese to Formal Text

FormalASR: 语音中文到正式文本的端到端系统

Wanyi Ning, Yinshang Guo, Haitao Qian, Jiyuan Cheng, Weiyuan Feng, Yufei Zhang

发表机构 * arXiv

AI总结 本文提出FormalASR,一种端到端的中文语音到正式文本转换模型,通过构建大规模的语音到正式文本数据集,并使用Qwen3-ASR进行微调,实现了比原声基线减少37.4%的CER,同时提升了ROUGE-L和BERTScore指标,提供了一个轻量级的设备端解决方案。

详情
AI中文摘要

自动语音识别(ASR)系统通常优化于逐字转录,这保留了不连贯、填充词和非正式口语结构,这些结构往往不适合下游写作应用。常见的解决方法是ASR+LLM的两阶段流程用于后期编辑,但这种设计增加了延迟和内存成本,并且难以在设备上部署。我们提出了FormalASR,两个紧凑的端到端模型(0.6B和1.7B),可直接将中文语音转录为正式书面文本。为了实现这一目标,我们构建了WenetSpeech-Formal和Speechio-Formal两个大规模的语音到正式文本数据集,通过基于LLM的重写和质量过滤构建。然后我们使用监督微调对Qwen3-ASR进行两个规模(0.6B和1.7B)的微调。在WenetSpeech-Formal和Speechio-Formal上的实验表明,FormalASR在比原声基线减少37.4%的CER的同时,也提高了ROUGE-L和BERTScore。FormalASR在部署时不需要后处理LLM,提供了一个轻量级的设备端解决方案用于语音到正式转录。

英文摘要

Automatic speech recognition (ASR) systems are typically optimized for verbatim transcription, which preserves disfluencies, filler words, and informal spoken structures that are often unsuitable for downstream writing-oriented applications. A common workaround is a two-stage ASR+LLM pipeline for post-editing, but this design increases latency and memory cost and is difficult to deploy on-device. We present FormalASR, two compact end-to-end models (0.6B and 1.7B) that directly transcribe spoken Chinese into formal written text. To enable this setting, we build WenetSpeech-Formal and Speechio-Formal, two large-scale spoken-to-formal datasets constructed by LLM-based rewriting and quality filtering. We then fine-tune Qwen3-ASR at two scales (0.6B and 1.7B) with supervised fine-tuning. Experiments on WenetSpeech-Formal and Speechio-Formal show that FormalASR achieves up to 37.4% relative CER reduction over verbatim baselines, while also improving ROUGE-L and BERTScore. FormalASR requires no post-processing LLM at deployment time, providing a lightweight, on-device solution for spoken-to-formal transcription.

2605.06582 2026-06-09 cs.LG cs.CL cs.SD 版本更新

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

PairAlign:一种通过自对齐的序列标记化框架及其在音频标记化中的应用

Adhiraj Banerjee, Vipul Arora

发表机构 * Department of Electrical Engineering, Indian Institute of Technology, Kanpur(电子工程系,印度理工学院,坎浦尔)

AI总结 PairAlign通过序列级自对齐实现紧凑音频标记化,利用条件序列生成方法,提升标记一致性、长度控制和编辑相似性。

Comments 57 pages main content, 109 total pages, 9 Figures, pre-print, Under Review

详情
AI中文摘要

许多感官数据的操作——比较、记忆、检索和推理——自然地在离散符号结构上表达。在语言中,这种接口由标记提供;在音频中,必须学习。现有音频标记器依赖于量化、聚类或编解码器重建,将标记局部分配,因此序列一致性、紧凑性、长度控制、终止和编辑相似性很少被直接优化。我们引入PairAlign,一种通过序列级自对齐实现紧凑音频标记化的框架。PairAlign将标记化视为条件序列生成:编码器将语音映射为连续条件,自回归解码器从BOS开始生成标记,学习标记身份、顺序、长度和EOS位置。给定两个保持内容的视图,每个视图的序列在另一个视图的表示下被训练为可能,而无关示例提供竞争序列。这为可扩展的编辑距离保留代理,同时抑制许多对一的坍缩。PairAlign从VQ式标记化开始,并通过EMA教师目标、交叉配对教师强制、前缀损坏、似然对比和长度控制进行优化。在3秒语音上,PairAlign学习紧凑、非退化的序列,具有广泛的词汇使用和强跨视图一致性。在检索测试中,它保留编辑距离搜索,同时将存档标记数量减少55%。连续扫频探针显示其局部重叠低于密集几何标记器,但具有更强的长度控制和在100毫秒移位下的受约束编辑轨迹。PairAlign是一种序列符号预测学习者:像JEPA式目标一样,它从另一个视图预测一个抽象目标作为学习的可变长度符号序列,而不是连续潜在变量。

英文摘要

Many operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly. We introduce PairAlign, a framework for compact audio tokenization through sequence-level self-alignment. PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a continuous condition, and an autoregressive decoder generates tokens from BOS, learning token identity, order, length, and EOS placement. Given two content-preserving views, each view's sequence is trained to be likely under the other's representation, while unrelated examples provide competing sequences. This gives a scalable surrogate for edit-distance preservation while discouraging many-to-one collapse. PairAlign starts from VQ-style tokenization and refines it with EMA-teacher targets, cross-paired teacher forcing, prefix corruption, likelihood contrast, and length control. On 3-second speech, PairAlign learns compact, non-degenerate sequences with broad vocabulary usage and strong cross-view consistency. On retrieval tests, it preserves edit-distance search while reducing archive token count by 55%. A continuous-sweep probe shows lower local overlap than a dense geometric tokenizer, but stronger length control and bounded edit trajectories under 100 ms shifts. PairAlign is a sequence-symbolic predictive learner: like JEPA-style objectives, it predicts an abstract target from another view as a learned variable-length symbolic sequence, not a continuous latent.

9. 评测、数据集与基准 61 篇

2606.07521 2026-06-09 cs.CL cs.AI 新提交

Evaluating Hallucinations in Domain-Adapted Large Language Models

评估领域自适应大语言模型中的幻觉现象

Sanchita Porwal, Sai Prasath S, Xingjian Bi, Madelyn Scandlen

发表机构 * College of Computing, Georgia Institute of Technology(佐治亚理工学院计算学院)

AI总结 本研究通过微调Llama-2模型,测试其记忆、回忆和推理能力,发现领域自适应大语言模型在生成新领域特定信息时存在幻觉问题,表明仅靠微调难以有效缓解幻觉。

Comments 13 pages, 2 figures, 3 tables

详情
AI中文摘要

本研究调查了领域自适应大语言模型(LLMs)中的幻觉现象,重点关注使用Lamini数据集对Llama-2模型进行微调。幻觉,即LLMs生成无意义或不忠实内容的现象,构成了重大挑战,尤其是当这些模型使用领域特定数据进行微调时。我们的方法包括一系列实验,测试微调后LLM的记忆、回忆和推理能力,并将其在新问答对和领域特定信息上的表现进行比较。我们发现,虽然模型在与训练数据相似的任务上表现出色,但其准确推理和回忆新领域特定信息的能力仍然有限,导致出现幻觉实例。模型倾向于提供带有额外信息的正确答案,表明存在过度生成的倾向。这些结果表明,仅靠微调方法在将LLMs适应专业领域时缓解幻觉存在重要局限性,并强调了在将LLMs适应专业领域时需要更鲁棒的方法。该研究还提供了关于LLMs在不同类型信息上表现差异的见解,揭示了其在处理领域特定查询时的相对弱点。

英文摘要

This study investigates the phenomenon of hallucinations in domain-adapted Large Language Models (LLMs), focusing on the fine-tuning of the Llama-2 model with the Lamini dataset. Hallucinations, or the generation of nonsensical or unfaithful content by LLMs, pose a significant challenge, especially when these models are fine-tuned with domain-specific data. Our methodology involves a series of experiments testing memorization, recall, and reasoning capabilities of the fine-tuned LLM, comparing its performance on novel question-answer pairs and domain-specific information. We found that while the model shows proficiency in tasks similar to its training data, its capability to accurately reason about and recall new domain-specific information remains limited, leading to instances of hallucination. The model demonstrates a tendency to provide correct answers with extra information, suggesting an inclination toward over-generation. These results suggest important limitations of fine-tuning-only approaches for mitigating hallucinations when adapting LLMs to specialized domains and underscore the need for more robust methods in adapting LLMs to specialized domains. The study also provides insights into the varying performance of LLMs on different types of information, revealing a comparative weakness in handling domain-specific queries.

2606.07778 2026-06-09 cs.CL 新提交

Unlocking Latent Value: Taxonomy-Guided Recovery of High-Performing Data from Low-Tier Web Corpora

解锁潜在价值:基于分类法从低层级网络语料库中恢复高性能数据

Neeraj Varshney, Sanket Lokegaonkar, Nasser Zalmout, Qingyu Yin, Priyanka Nigam, Bing Yin

发表机构 * Amazon(亚马逊)

AI总结 提出一种分类驱动框架,通过引入时效性和文化特异性两个新维度,结合两阶段过滤方法,从低质量网络数据中恢复高性能子集,在推理和编码任务上显著超越未过滤的高质量数据。

详情
AI中文摘要

主流的预训练网络数据筛选流程将文档质量压缩为单一复合分数,系统性地遗漏了评分器权重不足维度上的高价值内容。我们提出一个分类驱动的框架,通过沿复合分数无法捕捉的语义有意义维度进行过滤来恢复这一价值。首先,基于ESSENTIAL-WEB分类法,我们引入两个新维度:时效性和文化特异性,它们与现有维度的成对NMI较低。我们使用Qwen2.5 32B对1400万文档进行标注,并蒸馏成一个轻量级0.5B模型。为实现快速的语料库级标注,我们额外在E5嵌入上训练了一个7300万参数的多任务MLP,推理吞吐量提升50倍。其次,为应对过滤配置的组合爆炸,我们引入一个计算高效的两阶段框架:第一阶段在小规模上识别最强维度信号;第二阶段从最优表现者中构建并评估合取和析取复合过滤器——以全规模定律成本的一小部分识别高性能配置。将所选过滤器应用于被降级的网络数据,分类过滤后的子集在性能上超过其未过滤基线,甚至超越最高质量层级。在中层数据上,我们的最佳过滤器在推理、编码和知识基准上分别比未过滤基线提升12.1%、9.5%和2.0%,在推理和编码上分别超过未过滤的顶层数据6.7%和13.7%。此外,来自典型生产阈值以下两个层级的数据,其过滤后的子集在推理和编码上比未过滤基线提升22.3%和19.5%,在编码基准上超越顶层数据。这些结果表明,大量潜在价值仍锁定在被降级的网络数据中,而多维分类过滤是解锁这些价值的原理性且计算高效的钥匙。

英文摘要

Dominant web data curation pipelines for pretraining collapse document quality into a single composite score, systematically missing high-value content along dimensions the scorer underweights. We present a taxonomy-driven framework that recovers this value by filtering along semantically meaningful dimensions that composite scores fail to capture. First, building on the ESSENTIAL-WEB taxonomy, we introduce two novel dimensions: timeliness and cultural specificity, both of which show low pairwise NMI with existing ones. We annotate 14M documents using Qwen2.5 32B and distill into a lightweight 0.5B model. To enable rapid corpus-wide annotation, we additionally train a 73M multi-task MLP on E5 embeddings, achieving 50x inference throughput. Second, to navigate the combinatorial explosion of filter configurations, we introduce a compute-efficient two-pass framework: Pass 1 identifies the strongest dimension signals at small scale; Pass 2 constructs and evaluates conjunctive and disjunctive compound filters from the top performers - identifying high-performing configurations at a fraction of full scaling-law cost. Applying the selected filters to deprioritized web data, taxonomy-filtered subsets outperform their unfiltered baselines and even surpass the highest-quality tier. On mid-tier data, our best filter improves over its unfiltered baseline by 12.1% on reasoning, 9.5% on coding, and 2.0% on knowledge benchmarks, exceeding unfiltered top-tier data by 6.7% on reasoning and 13.7% on coding. Furthermore, filtered data from two tiers below the typical production threshold improves by 22.3% on reasoning and 19.5% on coding over its unfiltered baseline, surpassing top-tier data on coding benchmarks. These results establish that vast latent value remains locked in deprioritized web data, and that multi-dimensional taxonomy filtering is a principled, compute-efficient key to unlocking it.

2606.07810 2026-06-09 cs.CL cs.AI cs.LG 新提交

SLMJury: Can Small Language Models Judge as Well as Large Ones?

SLMJury:小型语言模型能否像大型模型一样进行评判?

Anish Laddha, Nitesh Pradhan, Gaurav Srivastava

发表机构 * LNMIIT Virginia Tech(弗吉尼亚理工大学)

AI总结 提出SLMJury框架,评估小型语言模型作为评判者的能力,发现领域依赖的过度思考效应、领域泛化差异、闭端与开端评判能力分离,以及多智能体辩论降低准确性。

详情
AI中文摘要

大型语言模型(LLMs)被广泛用作评估模型输出的评判者,但其高成本、延迟和不透明性限制了可扩展性。我们引入SLMJury,一个评估小型语言模型(SLMs)作为评判者的框架,涵盖两种范式:闭端二元正确性和开端质量评分。我们在四个模型家族的16个SLM评判者(0.6B-14B参数)上,跨十个基准进行基准测试:八个闭端任务涵盖数学、科学和通用推理(每个配置N=64,824个判断),以及用于摘要和对话评分的SummEval和MT-Bench。我们将评判形式化为预算条件函数,并研究五个维度。得出四个发现。(1)过度思考效应是领域依赖的:对于大多数评判者,快速10令牌判决在数学评判上匹配或优于扩展推理(在有帮助的情况下提升2-7%),而推理在通用任务上胜出高达23%。(2)领域泛化区分了模型家族,数学到通用准确率差距从低于10%到接近40%不等。(3)闭端和开端评判依赖不同的能力:最佳二元评判者(Phi-4)在MT-Bench上降至第9名,而经过推理训练的模型则反转了这一顺序。(4)在反思-批判-改进(RCR)辩论协议下,多智能体辩论在所有测试配置中降低了准确性,而顶级评判者抵抗六种对抗性人格的方差<=0.55%。可靠的自动评估不需要大型专有模型,但没有单一的SLM占主导地位。排行榜可在https://anishh15.github.io/SLMJury/获取,我们的框架代码和pip包公开在https://github.com/anishh15/SLMJury和https://pypi.org/project/slmjury/。

英文摘要

Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalability. We introduce SLMJury, a framework for evaluating small language models (SLMs) as judges across two paradigms: closed-ended binary correctness and open-ended quality scoring. We benchmark 16 SLM judges (0.6B-14B parameters) from four model families across ten benchmarks: eight closed-ended tasks spanning mathematical, scientific, and general reasoning (N=64,824 judgments per configuration), plus SummEval and MT-Bench for summarization and conversational scoring. We formalize judging as a budget-conditioned function and study five dimensions. Four findings emerge. (1) The overthinking effect is domain-dependent: for most judges quick 10-token verdicts match or beat extended reasoning on mathematical judging (by 2-7% where they help), while reasoning wins on general tasks by up to 23%. (2) Domain generalization separates model families, with math-to-general accuracy gaps ranging from under 10% to nearly 40%. (3) Closed-ended and open-ended judging draw on different capabilities: the best binary judge (Phi-4) drops to rank 9 on MT-Bench, while reasoning-trained models invert this ordering. (4) Under the Reflect-Critique-Refine (RCR) debate protocol, multi-agent debate degrades accuracy across all tested configurations, whereas the top judges resist six adversarial personas with <=0.55% variance. Reliable automated evaluation does not require large proprietary models, yet no single SLM dominates. The leaderboard is available at https://anishh15.github.io/SLMJury/, and our framework code and pip package are publicly available at https://github.com/anishh15/SLMJury and https://pypi.org/project/slmjury/.

2606.07853 2026-06-09 cs.CL cs.AI 新提交

Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese

超越英语基准:巴西葡萄牙语临床大语言模型评估

Giordano de Pinho Souza, Glaucia Melo, Josefino Cabral Melo Lima, Daniel Schneider

发表机构 * Federal University of Rio de Janeiro(里约热内卢联邦大学) Toronto Metropolitan University(多伦多都会大学)

AI总结 提出首个双语临床基准ClinicalBr,基于巴西病例报告构建,评估四个模型发现葡萄牙语-英语性能差距具有任务依赖性,诊断检索英语优势明显,其他任务差距消失。

详情
AI中文摘要

大语言模型正在改变临床决策支持及其在实际场景中的应用。然而,大多数基准测试以英语进行,跨语言评估对于解决全球可及性中的语言差距至关重要。我们介绍了ClinicalBr,这是首个基于真实巴西病例报告构建的双语临床决策基准。该语料库包含来自28种SciELO医学期刊的2,892个病例,涵盖18个专科,并构建为平行葡萄牙语-英语对。每个病例支持四项评估任务:诊断检索、鉴别诊断、检查推荐和治疗规划。我们评估了四个模型:MedGemma-27B、Sabiá-4、DeepSeek-R1和o3-mini,涵盖两种语言。核心发现是,葡萄牙语-英语性能差距是任务依赖的,而非普遍的。在诊断检索中,英语在所有模型上均具有一致优势,准确率高出7.5-12.1个百分点。这种优势在鉴别诊断、检查推荐和治疗规划中消失,大多数模型的置信区间跨越零,且葡萄牙语的完整性分数略高。巴西地方病比完整语料库更容易,而非更难,表明热带疾病表现在当前预训练中得到了充分体现。检查推荐是所有模型和两种语言中最难的任务,F1分数低于0.10,远低于鉴别诊断的上限0.20-0.27。

英文摘要

Large Language Models are transforming the support for clinical decision and their application in real scenarios. Yet, most benchmarks are conducted in English, and cross-lingual evaluation is needed to tackle the language gaps in global access. We introduce ClinicalBr, the first bilingual benchmark for clinical decision built from real Brazilian case reports. The corpus contains 2,892 cases drawn from 28 SciELO medical journals, spanning 18 specialties, and is structured as parallel Portuguese-English pairs. Each case supports four evaluation tasks: diagnosis retrieval, differential diagnosis, exam recommendation, and treatment planning. We evaluate four models: MedGemma-27B, Sabiá-4, DeepSeek-R1, and o3-mini, across both languages. The central finding is that the Portuguese-English performance gap is task-dependent, not general. In diagnosis retrieval, English yields a consistent advantage across all models, with +7.5-12.1 accuracy points. This advantage disappears in differential diagnosis, exam recommendation, and treatment planning, where confidence intervals cross zero for most models and Portuguese completeness scores are marginally higher. Brazilian-endemic conditions proved easier than the full corpus, not harder, indicating that tropical presentations are adequately represented in current pre-training. Exam recommendation was the hardest task across all models and both languages, with F1 scores below 0.10, well below the differential diagnosis ceiling of 0.20-0.27.

2606.07995 2026-06-09 cs.CL 新提交

Customer-Agent: Overcoming Context Limitations in Ultra-Long Shopping Trajectories via Tool-Augmented Agents and RLVR

客户代理:通过工具增强代理和RLVR克服超长购物轨迹中的上下文限制

Hongye Liu, Rongmei Lin, Anurag Kashyap, Hejie Cui, Ricardo Henao, Besnik Fetahu, Bing Yin

发表机构 * Amazon(亚马逊) Duke University(杜克大学)

AI总结 提出ShopTrajQA基准和客户代理框架,利用RLVR训练代理通过代码解释器自主检索解析外部轨迹文件,突破LLM上下文窗口限制,在超长购物轨迹推理中取得强性能。

详情
AI中文摘要

理解客户购物轨迹对于实现个性化购物体验至关重要。然而,购物记录(如客户的搜索、点击、购买等)通常跨越多年时间,形成极长的轨迹,给现有大型语言模型(LLM)带来重大挑战。尽管该问题重要,现有基准仅限于短客户轨迹,而大型电商平台的真实轨迹由于数据隐私限制难以获取。为解决这一差距,我们引入ShopTrajQA,一个基于真实产品信息和模拟购物轨迹构建的长上下文评估基准。数据集包含高达32k和64k token的变体,能够系统评估模型在不同上下文长度下的鲁棒性。通过对前沿LLM的全面基准测试,我们识别出在长购物轨迹数据推理中的关键性能差距。为应对这些挑战,我们提出一种用于超长上下文管理的客户代理框架。利用可验证奖励强化学习(RLVR)代理训练范式,我们的方法将轨迹存储为外部本地文件,并训练代理通过代码解释器交互(如SQL查询)自主检索和解析它们,有效绕过LLM的固定上下文窗口限制。实验结果表明,我们的框架在ShopTrajQA上取得强性能,并展现出对其他复杂推理任务的泛化能力。

英文摘要

Understanding customer shopping trajectories is essential for enabling personalized shopping experiences. However, shopping records (i.e., customer's search, clicks, purchases, etc.) often span long time horizons over multiple years, resulting in extremely long trajectories that pose significant challenges for existing large language models (LLMs). Despite the importance of this problem, existing benchmarks are limited to short customer trajectories, while real-world trajectories from large e-commerce platforms are rarely accessible due to data privacy constraints. To address this gap, we introduce ShopTrajQA, a long-context evaluation benchmark constructed from real-world product information and simulated shopping trajectories. The dataset includes variants of up to 32k and 64k tokens, enabling systematic evaluation of model robustness under varying context lengths. Through comprehensive benchmarking of frontier LLMs, we identify critical performance gaps in reasoning over long shopping trajectory data. To address these challenges, we propose a Customer Agent Framework for ultra-long context management. Leveraging a Reinforcement Learning with Verifiable Rewards (RLVR) agentic training paradigm, our approach stores trajectories as external local files and trains the agent to autonomously retrieve and parse them through code-interpreter interactions (e.g., SQL queries), effectively bypassing the fixed in-context window constraints of LLMs. Experimental results demonstrate that our framework achieves strong performance for ShopTrajQA and shows generalization to other complex reasoning tasks.

2606.07996 2026-06-09 cs.CL cs.AI 新提交

MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models

MC-PDD: 面向黑盒大语言模型的掩码语料级预训练数据检测

Kaixin Lan, Mu You, Tao Fang, Binkai Ou, Lidia S. Chao, Derek F. Wong

发表机构 * University of Macau(澳门大学) Macau Millennium College(澳门万人大学) BoardWare Information System Limited(博纬信息系统有限公司)

AI总结 提出MC-PDD方法,通过掩码特定token并利用LLM预测缺失内容,比较候选语料与参考非成员语料的预测命中率差异,以黑盒方式检测预训练数据,性能与现有方法相当。

Comments The manuscript consists of 10 pages formatted in the IEEE/ACM two-column style

详情
AI中文摘要

预训练是大语言模型(LLM)发展的基础,然而预训练数据的不透明性使模型分析复杂化,并引发伦理、法律和公平性问题。因此,检测特定数据集是否在预训练中使用至关重要。现有最先进方法通常依赖于访问模型概率分布,因此不适用于仅提供输入输出接口的闭源LLM。为解决这一限制,我们引入了掩码语料级预训练数据检测(MC-PDD),这是一种受掩码语言建模范式启发的新方法。MC-PDD在每段文本中掩码高度特定的token,并提示LLM预测缺失内容。然后,它评估候选语料与参考非成员语料之间的预测命中率差异是否具有统计显著性。基于此比较,MC-PDD确定候选文本是否可能包含在模型的预训练数据中。实验结果表明,在三个数据集上,对于开源和闭源LLM,预训练数据和未见数据之间的预测命中率存在明显且一致的差异。尽管在更严格的黑盒设置下运行,MC-PDD仍实现了与现有检测方法相当的性能。我们的方法仅需使用标准API访问即可实现模型审计和数据版权验证等实际应用。接受后,我们将公开发布代码和数据集。

英文摘要

Pretraining is fundamental to the development of Large Language Models (LLMs), yet the opacity of pretraining data complicates model analysis and raises ethical, legal, and fairness concerns. Detecting whether specific datasets were used during pretraining is, therefore, critical. Existing state-of-the-art methods typically rely on access to model probability distributions, making them unsuitable for closed-source LLMs that provide only input-output interfaces. To address this limitation, we introduce Masked Corpus-level Pretraining Data Detection (MC-PDD), a novel method inspired by the masked language modeling paradigm. MC-PDD masks highly specific tokens in each text and prompts the LLM to predict the missing content. It then assesses whether the difference in prediction hit rates between a candidate corpus and a reference non-member corpus is statistically significant. Based on this comparison, MC-PDD determines whether the candidate texts were likely included in the model's pretraining data. Experimental results demonstrate clear and consistent differences in prediction hit rates between pretrained and unseen data across three datasets, for both open-source and closed-source LLMs. Despite operating under a stricter black-box setting, MC-PDD achieves performance comparable to existing detection methods. Our approach enables practical applications such as model auditing and data copyright verification using only standard API access. Upon acceptance, we will publicly release the code and datasets.

2606.08000 2026-06-09 cs.CL cs.AI 新提交

Summarization is Not Dead Yet

摘要生成尚未消亡

Dongqi Liu, Chenxi Whitehouse, Zheng Zhao, Zhuchen Cao, Jian Li, Yabiao Wang

发表机构 * Saarland University(萨尔大学) Max Planck Institute for Informatics(马克斯·普朗克信息学研究所) University of Cambridge(剑桥大学) University of Edinburgh(爱丁堡大学) Zhejiang University(浙江大学) Tencent YouTu Lab(腾讯优图实验室)

AI总结 通过多维度评估,发现人类参考摘要在信息量和忠实度上仍优于大语言模型,后者仅在表面连贯性和流畅性上占优,表明摘要生成研究仍有挑战。

详情
AI中文摘要

大型语言模型(LLMs)的进展引发了关于模型生成的摘要可与人类撰写的参考摘要相媲美甚至超越后者的说法,这引发了摘要生成是否仍是一个开放研究问题的疑问。我们通过多轨道评估重新审视这一说法,涵盖五个不同数据集和五个最先进的LLMs,结合受控人工评估、偏差缓解的LLM作为评判协议、基于外部知识的事实性验证以及语料库级别的语言分析。我们的发现揭示了一个更为细致的图景:人类参考摘要继续在信息量和忠实度方面展现出优势,而LLM输出主要在表面连贯性和流畅性上更受青睐。事实性验证表明,人类参考摘要仍然更可靠,尤其是对于涉及推理或综合的声明,而语言分析揭示了不同模型之间风格同质化的模式。这些观察表明,当前的LLMs提高了摘要生成的质量下限,但其性能上限仍低于人类能力。

英文摘要

The progress of large language models (LLMs) has fueled claims that model-generated summaries rival or even surpass human-written references, raising questions about whether summarization remains an open research problem. We re-examine this narrative through a multi-track evaluation covering five diverse datasets and five state-of-the-art LLMs, combining controlled human assessment, bias-mitigated LLM-as-Judge protocols, factuality verification against external knowledge, and corpus-level linguistic analysis. Our findings reveal a more nuanced landscape in which human reference summaries continue to demonstrate advantages in informativeness and faithfulness, whereas LLM outputs are preferred mainly for surface-level coherence and fluency. Factuality verification indicates that human references remain more reliable, particularly for claims involving reasoning or synthesis, and linguistic analysis uncovers a pattern of stylistic homogeneity across different models. These observations suggest that current LLMs have raised the floor of summarization quality, but the ceiling of their performance remains below human capabilities.

2606.08025 2026-06-09 cs.CL 新提交

Arabic Sentence Segmentation Across Genres and Punctuation Conditions

跨体裁与标点条件下的阿拉伯语句子分割

Mohammed Elkholy, Khalid N. Elmadani, Nizar Habash, Bashar Alhafni

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·泽德人工智能大学) New York University Abu Dhabi(纽约大学阿布扎比分校)

AI总结 针对阿拉伯语标点歧义和不一致导致的句子分割难题,构建跨8种体裁的语料库AraSEG,评估LLM、轻量编码器和依存解析器,发现轻量模型在困难设置下优于LLM,且准确分割能显著提升下游依存解析。

详情
AI中文摘要

阿拉伯语的句子分割因标点符号的歧义和不一致而具有挑战性,许多文本缺乏可靠的句子边界标记。现有方法严重依赖标点线索,且通常在格式良好的文本上进行评估,限制了其在真实阿拉伯语环境中的鲁棒性。为解决这一问题,我们引入了AraSEG,一个跨体裁的句子分割语料库,涵盖八种体裁以及广泛的标点和文档结构条件。利用AraSEG,我们在日益具有挑战性的分割设置下评估了LLM、轻量级编码器模型和基于依存解析器的模型。我们的实验表明,在最困难的设置中,轻量级编码器甚至基于依存解析器的模型都优于LLM。我们进一步研究了训练数据规模和体裁多样性的影响,发现性能最终会饱和,且跨体裁泛化仍然具有挑战性。我们还证明了准确的句子分割能显著改善下游的依存解析。我们将公开我们的代码、数据和模型。

英文摘要

Sentence segmentation in Arabic is challenging due to ambiguous and inconsistent punctuation, with many texts lacking reliable sentence boundary markers. Existing approaches rely heavily on punctuation cues and are typically evaluated on well-formed text, limiting their robustness in realistic Arabic settings. To address this, we introduce AraSEG, a genre-diverse sentence segmentation corpus spanning eight genres and a wide range of punctuation and document structure conditions. Using AraSEG, we evaluate LLMs, lightweight encoder models, and dependency parser-based models under increasingly challenging segmentation settings. Our experiments show that lightweight encoders, and even dependency parser-based models, outperform LLMs in the most challenging settings. We further investigate the effects of training data size and genre diversity, finding that performance eventually saturates and cross-genre generalization remains challenging. We also demonstrate that accurate sentence segmentation substantially improves downstream dependency parsing. We make our code, data, and models publicly available.

2606.08071 2026-06-09 cs.CL 新提交

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

SurgiQ: 用于评估大语言模型手术理解的大规模多领域基准

Ayah Al-Naji, Edoardo Fazzari, Saif Alkindi, Hamdan Alhadhrami, Preslav Nakov, Cesare Stefanini

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出SurgiQ基准,包含13,055道多选题,覆盖六个外科领域和四种题型,用于评估LLM的手术推理能力。实验显示最佳模型准确率仅68.1%,通用模型优于多数生物医学模型,表明当前医学专业化未能充分覆盖手术知识。

详情
AI中文摘要

大语言模型在外科领域的可靠评估仍不成熟。广泛的医学基准测试临床知识,而手术需要程序性推理、管理权衡、否定处理以及在合理手术决策中的选择。我们提出SurgiQ,一个纯文本、基于来源的基准,包含13,055道四选一多选题,涵盖六个外科领域和四种题型:基于案例、推理、最佳选项和否定题。SurgiQ通过多阶段生成、验证和专家审核流程,从外科教科书、开放获取论文和考试材料构建。我们在统一的log-likelihood协议下评估了35个开源权重LLM。结果显示仍有很大提升空间:较小模型通常接近25%的随机基线,而最佳模型达到68.1%的准确率。通用模型,尤其是Qwen2.5,优于大多数生物医学模型,表明当前的医学专业化尚未提供足够广泛的外科覆盖。校准和错误分析进一步表明,即使是强模型也会在临床合理的干扰项上犯自信的错误,这促使进行更可靠和更广泛的外科LLM评估。

英文摘要

Reliable evaluation of large language models in surgery remains underdeveloped. Broad medical benchmarks test clinical knowledge, while surgery requires procedural reasoning, management trade-offs, negation handling, and selection among plausible operative decisions. We present SurgiQ, a text-only, source-grounded benchmark of 13,055 four-option multiple-choice questions spanning six surgical domains and four question formats: case-based, reasoning, best-option, and negative. SurgiQ is constructed from surgical textbooks, open-access papers, and examination material using a multi-stage generation, verification, and expert-audit pipeline. We evaluate 35 open-weight LLMs under a unified log-likelihood protocol. Our results show substantial remaining headroom: smaller models often remain near the 25\% random baseline, while the best model reaches 68.1\% accuracy. General-purpose models, especially Qwen2.5, outperform most biomedical models, suggesting that current medical specialization does not yet provide sufficiently broad surgical coverage. Calibration and error analysis further show that even strong models make confident mistakes on clinically plausible distractors, motivating more reliable and broader surgical LLM evaluation.

2606.08077 2026-06-09 cs.CL 新提交

Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics

支持向量评分准则:弥合自生成与人工评分准则之间的差距

Mengyuan Sun, Yu Li, Zhuohao Yu, Shikun Zhang, Wei Ye

发表机构 * National Engineering Research Center for Software Engineering, Peking University(北京大学软件工程国家工程研究中心) University of Science and Technology of China(中国科学技术大学)

AI总结 针对自生成评分准则在困难实例上落后于人工标注的问题,提出SVR框架,将准则构建转化为偏好数据上的最大间隔边界学习,通过对比特征挖掘、提示条件选择器和迭代优化,显著缩小与人工准则的差距,并展现出广泛的奖励建模能力。

详情
AI中文摘要

基于评分准则的评估是评判大语言模型(LLM)输出的一种有前景的范式,然而在困难实例上,自生成准则落后于人工标注的准则。我们认为这一判别差距反映了目标不匹配:自生成准则描述好的回答,而有效的准则必须区分相近的候选。为弥合这一差距,我们引入SVR(支持向量评分准则),一个将准则构建重新表述为偏好数据上的最大间隔边界学习的框架。SVR从偏好对中挖掘对比特征存入准则库,学习一个提示条件化的选择器以及全局准则权重,并通过支持对选择和对抗性探测困难负例来迭代优化准则库。在推理时,仅给定提示,SVR从库中检索顶级准则并对回答进行评分。在RubricBench上,SVR将差距从24.1分缩小到0.3分,并优于强自生成准则和评判基线,且学习到的准则库无需重新训练即可跨评判迁移。在RewardBench 1&2和RM-Bench上,它仍与专用奖励模型保持竞争力,展示了更广泛的奖励建模能力。总体而言,边界定义的准则为弥合LLM评估中的判别差距提供了一条原则性路径。

英文摘要

Rubric-based evaluation is a promising paradigm for judging large language model (LLM) outputs, yet self-generated rubrics lag human-annotated criteria on hard instances. We argue this discriminative gap reflects an objective mismatch: self-generated rubrics describe good responses, whereas effective criteria must discriminate between close candidates. To close this gap, we introduce SVR (Support Vector Rubrics), a framework that recasts rubric construction as max-margin boundary learning over preference data. SVR mines contrastive features from preference pairs into a rubric bank, learns a prompt-conditioned selector together with global rubric weights, and iteratively refines the bank through support-pair selection and adversarial probing of hard negatives. At inference, given only the prompt, SVR retrieves the top-rubrics from the bank and scores responses. On RubricBench, SVR narrows the gap to human reference rubrics from 24.1 to 0.3 points and outperforms strong self-rubric and judge baselines, and the learned bank transfers across judges without retraining. On RewardBench 1&2, and RM-Bench, it remains competitive with dedicated reward models, demonstrating broader reward modeling capability. Overall, boundary-defining rubrics offer a principled route to closing the discriminative gap in LLM evaluation.

2606.08092 2026-06-09 cs.CL 新提交

When Languages Disagree: Self-Evolving Multilingual LLM Judges

当语言不一致时:自我进化的多语言LLM评判者

Xiyan Fu, Wei Lu

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出SEMJ方法,利用多语言评判中的跨语言不一致性进行迭代自我反思与重新评估,在多个基准上优于投票和反思基线,提升准确性和跨语言一致性。

详情
AI中文摘要

多语言LLM-as-a-judge被广泛用于跨语言评估模型输出,但存在跨语言不一致性问题(Fu and Liu, 2025)。现有方法通常将这种不一致性视为噪声,并通过投票或聚合来缓解。在本工作中,我们反而表明多语言不一致性可以提供互补的评估信号。我们的oracle分析发现,跨语言采样判断比单语言判断能获得更高的性能上限,表明不同语言可能包含互补的判断。受此发现启发,我们提出SEMJ,一种自我进化的多语言评判者,利用跨语言不一致性进行迭代优化。SEMJ为每个输入构建多语言变体,收集独立的判断和理由,并将不一致的输出反馈给自我反思和重新评估。在多个基准上的实验表明,SEMJ在准确性和跨语言一致性上始终优于投票和反思基线。进一步分析表明,不一致性触发了有用的重新评估,从而提高了判断质量。

英文摘要

Multilingual LLM-as-a-judge is widely used to evaluate model outputs across languages, but suffers from cross-lingual inconsistency (Fu and Liu, 2025). Existing methods typically treat this inconsistency as noise and mitigate it through voting or aggregation. In this work, we instead show that multilingual inconsistency can provide complementary evaluation signals. Our oracle analysis finds that sampling judgments across languages yields a higher performance upper bound than single-language judging, indicating that different languages potentially include complementary judgments. Motivated by this finding, we propose SEMJ, a self-evolving multilingual judge that leverages cross-lingual inconsistency for iterative refinement. SEMJ constructs multilingual variants of each input, collects independent judgments and rationales, and feeds inconsistent outputs back for self-reflection and re-evaluation. Experiments on multiple benchmarks show that SEMJ consistently outperforms voting and reflection baselines in both accuracy and cross-lingual consistency. Further analysis shows that inconsistency triggers useful re-evaluation, which improves judgment quality.

2606.08194 2026-06-09 cs.CL cs.AI 新提交

GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models

GlobeAudio:用于大型音频-语言模型自然主义评估的多语言多文化基准

Ryner Tan, Wenxuan Zhang

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 提出GlobeAudio基准,包含5637道多语言多选题,评估大型音频-语言模型在自然音频条件下的听觉推理和文化理解能力,发现开源模型和低资源语言存在显著性能差距。

详情
AI中文摘要

大型音频-语言模型(LALMs)在统一框架中整合了音频感知和语言理解,支持广泛的实际应用。尽管近期取得了进展,但LALMs的评估相对于实际需求仍严重不足:大多数评估缺乏真正的语言和文化真实性,而其他评估则未能捕捉声学真实性。为弥补这一差距,我们提出了GlobeAudio,一个旨在评估自然音频理解的多语言和多文化基准。GlobeAudio包含5637道多项选择题,涵盖六种类型多样的语言,由母语者基于自然发生的音频精心制作。为了表现良好,模型必须具有更高层次的听觉推理技能和文化基础的解释。我们系统地评估了代表性的闭源和开源LALMs,以及级联的ASR-LLM流水线。我们的实验揭示了在自然声学条件下的显著性能差距,特别是对于开源模型和低资源语言。这些发现凸显了当前LALMs的关键局限性,并强调了自然音频评估对未来音频-语言系统的重要性。GlobeAudio可在https://huggingface.co/datasets/iNLP-Lab/GlobeAudio 获取。

英文摘要

Large Audio-Language Models (LALMs) integrate audio perception and language understanding within a unified framework, enabling a wide range of real-world applications. Despite recent advances, evaluation for LALMs remains heavily underspecified relative to real-world requirements: most lack true linguistic and cultural authenticity, while others fail to capture acoustic realism. To bridge this gap, we propose GlobeAudio, a multilingual and multicultural benchmark designed to evaluate naturalistic audio understanding. GlobeAudio consists of 5,637 multiple-choice questions across six typologically diverse languages, expertly crafted by native speakers grounded on naturally occurring audio. In order to do well, models must possess higher-level auditory reasoning skills and culturally grounded interpretation. We systematically evaluate representative closed-source and open-source LALMs, as well as cascaded ASR-LLM pipelines. Our experiments reveal substantial performance gaps under natural acoustic conditions, particularly for open-source models and low-resource languages. These findings highlight critical limitations of current LALMs and underscore the importance of naturalistic audio evaluation for future audio-language systems. GlobeAudio can be found at https://huggingface.co/datasets/iNLP-Lab/GlobeAudio .

2606.08272 2026-06-09 cs.CL cs.AI 新提交

AgriGov: A Structured Multilingual Dataset Curation for Indian Government Schemes for Farmers

AgriGov:面向印度政府农民计划的结构化多语言数据集整理

Mohsina Bilal, Gopakumar G

发表机构 * National Institute of Technology Calicut(国立卡利卡特理工学院)

AI总结 提出AgriGov三语数据集,通过自动抓取、翻译流水线和人工后编辑构建约8000句对齐的农业政策领域平行语料,支持机器翻译、问答等应用。

Comments 15 pages, 4 figures, Submitted to: Sadhana, Elsevier

详情
AI中文摘要

AgriGov是一个精心整理的三语(英语-印地语-马拉地语)数据集,旨在解决农业政策和农民福利计划领域缺乏领域基础的多语言资源的问题。最初,我们使用自动抓取技术从可信门户收集并结构化50个政府计划的数据,将其组织到预定义的语义字段(如标题、资格、申请流程、文件、排除项)。翻译通过结合Google Translate API、MarianMT和人工后编辑的流水线进行,生成了一个包含约2100个源片段的领域特定印地语-马拉地语数据集。为了增强覆盖范围,我们用Samanantar语料库中的句子扩充了该数据集,产生了约8000个句子对齐的印地语-马拉地语平行对。该数据集现在为微调该领域的机器翻译模型提供了强大的资源。AgriGov专为领域自适应机器翻译、问答、信息检索和摘要系统等应用而设计。其主要贡献是一个模式驱动、人工校正的多语言对齐流水线,确保领域保真度、提供来源并支持可重复实验,从而为面向农民的工具实现检索增强应用。

英文摘要

AgriGov is a curated, trilingual (English-Hindi-Marathi) dataset designed to address the scarcity of domain-grounded multilingual resources for agricultural policies and farmer welfare schemes. Initially, we collected and structured data from 50 government schemes sourced from trusted portals using automated scraping techniques, organizing it into predefined semantic fields (e.g., title, eligibility, application process, documents, exclusions). Translations were performed using a pipeline combining Google Translate API, MarianMT, and human post-editing, resulting in a domain-specific Hindi-Marathi dataset comprising approximately 2100 source segments. To enhance coverage, we augmented this dataset with sentences from the Samanantar corpus, leading to approximately 8,000 sentence-aligned Hindi-Marathi parallel pairs. The dataset now offers robust resources for fine-tuning machine translation models in this domain. AgriGov is designed for applications in domain-adaptive machine translation, question answering, information retrieval, and summarization systems. Its key contribution is a schema-driven, human-corrected multilingual alignment pipeline that ensures domain fidelity, provides provenance, and supports reproducible experiments, enabling retrieval-augmented applications for farmer-facing tools.

2606.08417 2026-06-09 cs.CL cs.AI 新提交

Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics

破解生成困惑度:为何无条件文本评估需要分布度量

Antonio Franca, Alexander Tong

AI总结 本文指出生成困惑度(gen-PPL)作为非自回归语言模型评估指标存在缺陷,通过构造零参数朴素采样器在LM1B和OpenWebText上达到SOTA gen-PPL但生成不连贯文本,建议采用直接量化生成文本与参考文本分布差异的评估套件。

Comments Accepted to the Workshop on Structured Probabilistic Inference & Generative Modeling (SPIGM) at ICML 2026

详情
AI中文摘要

扩散和连续流语言模型已成为语言建模中领先的非自回归替代方案。这两种范式的进展主要通过生成困惑度(gen-PPL)来衡量:在冻结的自回归(AR)评分器(如gpt2-large)下,样本的每个token的负对数似然,通常配以经验熵护栏来排除低熵崩溃。我们认为该度量不健全。从构造上看,gen-PPL仅衡量在评分AR下的可预测性,而非语法性或语义连贯性——而可预测但低质量的序列集合在组合上非常庞大。为了具体说明这一点,我们构建了一套零参数、故意朴素的采样器,在LM1B和OpenWebText上以非退化熵实现了最先进的gen-PPL,超越了最近发布的扩散和连续流模型,同时生成的文本在构造上是不连贯的。我们推荐直接量化生成文本与参考文本之间分布差异的评估套件,并使用这样的套件重新基准测试最近的非自回归模型,从而更真实地反映当前的最新技术水平。

英文摘要

Diffusion and continuous flow-based language models have emerged as the leading non-autoregressive alternatives to language modeling. Progress in both paradigms is overwhelmingly tracked by generative perplexity (gen-PPL): the per-token negative log-likelihood of samples under a frozen autoregressive (AR) scorer such as gpt2-large, typically paired with an empirical-entropy guardrail to rule out low-entropy collapse. We argue that this metric is unsound. By construction, gen-PPL measures only predictability under the scoring AR, not grammaticality or semantic coherence -- and the set of predictable but still low-quality sequences is combinatorially large. To make this concrete, we construct a suite of zero-parameter, deliberately naive samplers that achieve state-of-the-art gen-PPL on LM1B and OpenWebText at non-degenerate entropy, surpassing recently published diffusion and continuous-flow models while producing text that is incoherent by construction. We recommend evaluation suites that directly quantify the distributional divergence between generated and reference text, and use such a suite to re-benchmark recent non-autoregressive models, recovering a more faithful picture of the current state of the art.

2606.08605 2026-06-09 cs.CL 新提交

Multilingual Fact-Checking at Scale: Fine-Tuned Compact Models vs LLMs

大规模多语言事实核查:微调紧凑模型 vs 大语言模型

Pratuat Amatya, Vinay Setty

发表机构 * Factiverse

AI总结 提出一个多语言事实核查系统,通过微调XLM-RoBERTa、mmBERT和SetFit模型,在114种语言的声明检测和28种语言的真实性预测中,与GPT-5.2等LLM相比,展示了紧凑模型的高效和稳定性能。

详情
AI中文摘要

我们提出了一个部署在Factiverse的多语言事实核查系统,旨在跨多种语言实现高吞吐量和低延迟操作。该系统遵循模块化流水线,包含三个阶段:声明检测、证据检索与重排序,以及真实性预测。我们微调了XLM-RoBERTa-Large用于声明检测,mmBERT-base用于三标签立场分类(支持/反驳/混合),以及一个基于SetFit的多语言重排序器用于声明-证据匹配。我们将这些组件与强大的LLM基线进行比较,包括GPT-5.2、Claude Opus~4.6和Qwen3-8b。在涵盖114种语言的声明检测和28种语言的真实性预测的生产数据上的实验表明,任务特定的微调提供了强大且稳定的多语言性能,而微调的检索模型与现代专有嵌入保持竞争力。相同硬件上的延迟测量进一步显示,基于编码器的组件具有巨大的效率提升,支持其在具有严格成本和隐私约束的生产部署中使用。总体而言,紧凑的微调自托管模型仍然是大规模多语言事实核查的实用且有效的基础。本研究的代码和数据可在https://github.com/factiverse/factcheck-editor获取。

英文摘要

We present a multilingual fact-checking system deployed at Factiverse, designed for high-throughput and low-latency operation across diverse languages. The system follows a modular pipeline with three stages: claim detection, evidence retrieval and re-ranking, and veracity prediction. We fine-tune XLM-RoBERTa-Large for claim detection, mmBERT-base for three-label stance classification (Supports/Refutes/Mixed), and a SetFit-based multilingual re-ranker for claim--evidence matching. We compare these components against strong LLM baselines, including GPT-5.2, Claude Opus~4.6, and Qwen3-8b. Experiments on production data spanning 114 languages for claim detection and 28 languages for veracity prediction show that task-specific fine-tuning provides strong and stable multilingual performance, while the fine-tuned retrieval model remains competitive with modern proprietary embeddings. Same-hardware latency measurements further show large efficiency gains for encoder-based components, supporting their use in production deployments with tight cost and privacy constraints. Overall, compact fine-tuned, self-hosted models remain a practical and effective foundation for multilingual fact-checking at scale. Code and data used for this study are available at https://github.com/factiverse/factcheck-editor.

2606.08625 2026-06-09 cs.CL 新提交

From Holistic Evaluation to Structured Criteria: Rubrics Across the Evolving LLM Landscape

从整体评估到结构化标准:大语言模型演变中的评分准则

Hao Chen, Ziyu Han, Yukun Yan, Qingfu Zhu, Maosong Sun, Wanxiang Che

发表机构 * Research Center for Social Computing and Interactive Robotics(社会计算与交互机器人研究中心) Department of Computer Science and Technology, Institute for AI(计算机科学与技术系,人工智能研究院)

AI总结 本文提出评分准则作为统一框架,通过分解整体判断为可验证维度、提供过程级反馈和动态涌现自模型行为三个层次,连接人类意图与机器行为。

详情
AI中文摘要

随着大型语言模型(LLMs)向开放式自主智能体发展,用于评估和引导其行为的机制也必须相应演进。本文引入评分准则作为捕捉这一演进的统一框架,将其描述为对LLM范式转变的动态响应,这种响应在评估、强化学习和安全对齐等看似独立的工作中反复出现。我们将评分准则定义为将复杂质量判断转化为结构化、可操作标准的一组显式标准,并证明其在上述研究线索中的反复出现并非巧合。我们系统地整理了现有的评分准则设计,考察了其构建与优化,并分析了它们在评估和训练中的作用。评分准则在三个逐渐深入的层面体现:在评估层面,它们将整体判断分解为可验证的维度;在训练层面,它们作为密集的反馈信号,在标量奖励不足时提供过程级指导;在内在层面,它们从模型行为中动态涌现,驱动自我改进。我们进一步评估了评分准则在生成质量、执行保真度、理论约束和安全威胁方面的可靠性,并调查了跨领域的基于评分准则的基准。通过使评估透明且可分解,评分准则将人类价值期望转化为机器可学习的信号,成为人类意图与机器行为之间的持久桥梁。

英文摘要

As Large Language Models (LLMs) advance toward open-ended autonomous agents, the mechanisms used to evaluate and guide their behavior must evolve accordingly. This work introduces the rubric as a unifying framework capturing this evolution, characterizing rubrics as a dynamic response to successive LLM paradigm shifts that recurs across otherwise independent efforts in evaluation, reinforcement learning, and safety alignment. We define rubrics as explicit criteria sets that transform complex quality judgments into structured and actionable standards, and demonstrate that their recurrence across these research threads is not coincidental. We systematically organize existing rubric designs, examine their construction and optimization, and analyze their role across evaluation and training. Rubrics manifest at three progressively deeper levels: at the evaluative level, they decompose holistic judgments into verifiable dimensions; at the training level, they serve as dense feedback signals providing process-level guidance where scalar rewards fall short; at the intrinsic level, they emerge dynamically from model behaviors, driving self-improvement. We further assess rubric reliability across generation quality, execution fidelity, theoretical constraints, and security threats, before surveying rubric-based benchmarks across diverse domains. By rendering assessment transparent and decomposable, rubrics translate human value expectations into machine-learnable signals, serving as the enduring bridge between human intentions and machine behavior.

2606.08715 2026-06-09 cs.CL 新提交

Operationalizing Linguistic Methods through Prompt-Engineering Skills: An Automatic Chinese Web Neologism Detection Pipeline

通过提示工程技能操作化语言学方法:一种自动中文网络新词检测流水线

Yufeng Wu, Meichun Liu

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 提出一种自动中文网络新词检测方法,将传统语言学识别原则转化为提示工程技能,通过四阶段流水线从2.67亿文档中检测出4853个新词,并揭示候选覆盖和LLM语义判断为瓶颈。

详情
AI中文摘要

我们提出了一种自动中文网络新词检测方法,该方法将传统语言学识别原则操作化为提示工程技能。该方法包括四个阶段:基于字符n-gram的与分词器无关的候选生成;基于点互信息预过滤的词典锚定;基于中文构词原则的构词合法性技能;以及结合规则和三元分类技能来区分新词、实体和无。将该方法应用于BAAI CCI 3.0语料库(2.67亿文档),产生了226,959个分类候选,其中包括4,853个标注新词。为了评估该方法,我们开发了逐阶段条件召回分解,其中流水线的严格召回在数学上分解为各阶段条件召回的乘积。应用于Hou(2023)(4,199个条目),该分解揭示了阶段1候选覆盖和阶段4B LLM语义判断是两个瓶颈(召回率分别为41.5%和60.0%),而中间阶段接近无损。进一步的长度分层分析表明,结构构词合法性技能与长度无关(>= 96.9%),而语义新颖性分类技能与长度相关(2/3/4字符候选分别为65.6%/59.0%/44.1%),描绘了基于技能的语言学操作化的当前边界。我们将该方法、流水线输出和评估协议作为公共资源发布。

英文摘要

We present a method for automatic Chinese web neologism detection that operationalizes traditional linguistic identification principles as prompt-engineering skills. The method has four stages: tokenizer-independent character n-gram candidate generation; dictionary anchoring with a Pointwise Mutual Information pre-filter; a well-formedness skill based on Chinese word-formation principles; and a combined rule and three-way classification skill that distinguishes neologism, entity, and none. Applied to the BAAI CCI 3.0 corpus (267M documents), the method produces 226,959 classified candidates including 4,853 labeled neologisms. To evaluate the method, we develop a per-stage conditional recall decomposition in which the pipeline's strict recall factors mathematically into the product of stage conditional recalls. Applied to Hou (2023) (4,199 entries), the decomposition exposes Stage 1 candidate coverage and Stage 4B LLM semantic judgment as the two bottlenecks (R=41.5% and 60.0% respectively), while intermediate stages are near-lossless. A length-stratified analysis further reveals that the structural well-formedness skill is length-invariant (>= 96.9%) whereas the semantic novelty-classification skill is length-dependent (65.6%/59.0%/44.1% across 2/3/4-character candidates), mapping a current boundary of skill-based linguistic operationalization. We release the method, pipeline outputs, and evaluation protocol as public resources.

2606.08769 2026-06-09 cs.CL cs.AI 新提交

RadOT-Eval: Auditable Structured-Evidence Transport for Radiology Report Evaluation

RadOT-Eval:用于放射学报告评估的可审计结构化证据传输

Weixin Liu, Juming Xiong, Yang Li, Qingyuan Song, Susannah Rose, Murat Kantarcioglu, Bradley Malin, Zhijun Yin

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出RadOT-Eval框架,通过最优传输对齐结构化临床证据,在独立测试集上实现与错误负担的高斯皮尔曼相关,优于标准指标和基于LLM的评估器。

Comments 10 pages, 1 figure, 13 tables

详情
AI中文摘要

自动评估对于高风险文本生成至关重要,其中的错误通常涉及遗漏发现、幻觉内容、极性反转、位置变化、不确定性不匹配和时间比较错误,而不仅仅是低表面相似性。放射学报告生成提供了一个具有挑战性的测试案例,因为生成的报告必须跨来源保留结构化临床证据。我们提出了RadOT-Eval,一个可解释的结构化证据最优传输框架,用于离线审计放射学报告生成。RadOT-Eval将参考报告和候选报告分解为属性结构化的临床证据单元,使用熵正则化最优传输对齐相应的证据,并在单调风险模型中使用临床意义的侧信道差异来预测错误负担。所有传输、特征和读出选择均使用ReXVal数据集进行选择,并在独立的RadEvalX数据集上评估冻结系统。RadOT-Eval与总错误负担、临床显著错误负担和临床不显著错误负担的斯皮尔曼相关系数分别为0.715、0.548和0.399,其点估计值高于标准评估指标和基于开源大语言模型(LLM)的评估器GREEN-radllama2-7B。在ReXErr-v1上的冻结辅助腐败敏感性压力测试中,RadOT-Eval达到了0.768的AUROC和0.990的腐败大于干净的配对胜率。这些结果表明,在仅使用ReXVal模型选择和冻结RadEvalX测试下,结构化证据传输为高风险生成的临床文本提供了一个可审计、面向排序的评估工具。

英文摘要

Automatic evaluation is critical for high-stakes text generation, where errors often involve omitted findings, hallucinated content, polarity reversals, location changes, uncertainty mismatches, and temporal-comparison errors rather than low surface similarity alone. Radiology report generation provides a challenging test case because generated reports must preserve structured clinical evidence across sources. We present RadOT-Eval, an interpretable structured-evidence optimal transport framework for offline auditing of radiology report generation. RadOT-Eval decomposes reference and candidate reports into attribute-structured clinical evidence units, aligns corresponding evidence using entropy-regularized optimal transport, and uses clinically meaningful side-channel discrepancies in a monotone risk model to predict error burden. All transport, feature, and readout choices are selected using the ReXVal dataset, and the frozen system is evaluated on the independent RadEvalX dataset. RadOT-Eval achieves Spearman correlations of 0.715, 0.548, and 0.399 with total, clinically significant, and clinically insignificant annotated error burden, respectively, yielding higher point estimates than standard evaluation metrics and the open-source large language model (LLM)-based evaluator GREEN-radllama2-7B. In a frozen auxiliary corruption-sensitivity stress test on ReXErr-v1, RadOT-Eval achieves 0.768 AUROC and a 0.990 corrupted-greater-than-clean paired win rate. These results show that structured evidence transport provides an auditable, rank-oriented evaluation tool for high-stakes generated clinical text under ReXVal-only model selection and frozen RadEvalX testing.

2606.08878 2026-06-09 cs.CL cs.MA 新提交

PerspectiveGap: A Benchmark for Multi-Agent Orchestration Prompting

PerspectiveGap: 多智能体编排提示的基准测试

Youran Sun, Xingyu Ren, Kejia Zhang, Xinpeng Liu, Jiaxuan Guo

发表机构 * University of Maryland(马里兰大学) The Chinese University of Hong Kong(香港中文大学) Stanford University(斯坦福大学)

AI总结 提出PerspectiveGap基准,评估LLM为多智能体系统编写编排提示的能力,实验显示模型平均通过率仅14.9%,表明该能力独特且未被充分评估。

详情
AI中文摘要

现实世界的LLM应用正从单智能体工作流转向编排的多智能体系统,但当前模型仍难以确定每个子智能体需要知道什么。为衡量这一点,我们引入了PerspectiveGap,一个用于评估LLM为多智能体系统编写编排提示能力的基准。PerspectiveGap包含110个场景,每个场景通过两种干扰混合任务格式评估:角色片段分配和自由形式提示编写。这些场景被组织成10种拓扑结构,这些拓扑结构源自作者的真实工程实践,并遵循提示经济原则:构建以循环为中心的编排,以最小的角色和工程开销最大化效用。在对来自10家公司的27个商业模型进行的实验中,GPT-5.5大幅超越所有竞争对手,而Opus 4.7尽管编码性能强劲,但在编排提示方面表现出明显弱点。尽管如此,PerspectiveGap仍然具有挑战性:评估模型平均综合通过率仅为14.9%(GPT-5.5为62.0%),平均总体泄漏率为246.5%(每个场景的信息泄漏事件计数,而非比例;GPT-5.5为49.1%)。这些发现表明,多智能体编排提示是一种独特且未被充分评估的能力,而PerspectiveGap为系统衡量和改进该能力提供了基础。

英文摘要

Real-world LLM applications are moving beyond single-agent workflows toward orchestrated multi-agent systems, yet current models still struggle to determine what each sub-agent needs to know. To measure this, we introduce PerspectiveGap, a benchmark for evaluating LLMs' ability to compose orchestration prompts for multi-agent systems. PerspectiveGap contains 110 scenarios, each evaluated through two distractor-mixed task formats: role-fragment assignment and free-form prompt writing. These scenarios are organized into 10 topologies, which are distilled from the authors' real-world engineering practice and framed by the Prompt Economy principle: building loop-centered orchestrations that maximize utility with minimal role and engineering overhead. In experiments with 27 commercial models from 10 companies, GPT-5.5 substantially outperforms all competitors, whereas Opus 4.7 shows a notable weakness in orchestration prompting despite its strong coding performance. Nevertheless, PerspectiveGap remains challenging: the evaluated models achieve an average combined pass rate of only 14.9\% (GPT-5.5 62.0\%) and an average overall leakage rate of 246.5\% (a per-scenario information leak-event count, not a proportion; GPT-5.5 49.1\%). These findings suggest that multi-agent orchestration prompting is a distinct and under-evaluated capability, and PerspectiveGap provides a foundation for measuring and improving it systematically.

2606.08988 2026-06-09 cs.CL cs.LG 新提交

Structure-Aware Modeling of Multiple-Choice Questions Improves Automatic Difficulty Estimation

选择题的结构感知建模改进自动难度估计

Gabriel Ortega, Abelino Jiménez, Séverin Lions, Pablo Dartnell

发表机构 * Centro de Investigación Avanzada en Educación (CIAE), Instituto de Estudios Avanzados en Educación (IE), Universidad de Chile(智利大学高级教育研究中心(CIAE),高级教育研究所(IE)) Departamento de Evaluación, Medición y Registro Educacional (DEMRE), Universidad de Chile(智利大学评估、测量与教育注册系(DEMRE)) Centro de Modelamiento Matemático (CMM), Universidad de Chile(智利大学数学建模中心(CMM)) Departamento de Ingeniería Matemática (DIM), Universidad de Chile(智利大学数学工程系(DIM))

AI总结 提出结构感知模型,将选择题的干扰项作为独立输入编码,通过顺序感知或顺序不变聚合提升难度预测,在自然科学和社科数据集上达到R²=0.83和0.71。

Comments 30 pages, 1 table, 2 figures

详情
AI中文摘要

自动题目难度估计(AQDE)在教育评估中日益重要,因为它有潜力产生与专家判断相竞争的难度估计,同时有助于减少与试点管理相关的时间和财务负担,并扩展到数字测试环境。先前的AQDE研究报告了关于将干扰项作为附加文本添加到题干和正确答案中是否能一致改进难度预测的混合证据。我们假设干扰项信息的有效性取决于其结构表示,并且明确将干扰项建模为独立组件可以改进忽略此信息的基线的难度估计。为此,我们设计了受控架构,将选择题组件建模为不同输入,以隔离干扰项内容和顺序的贡献。具体来说,我们通过将每个干扰项编码为独立的文本输入,并通过顺序感知的拼接(带位置标签)或顺序不变的求和来聚合其表示,从而表示干扰项。我们使用两个智利数据集(自然科学和社会科学,2016-2020年;4114道选择题)评估了这些架构。与仅使用题干和正确答案的简单模型相比,我们最佳的结构感知架构实现了更高的预测性能,自然科学题目的R²=0.83,社会科学题目的R²=0.71。一个顺序不变的变体以大约一半的参数达到了几乎相同的准确率,提供了有利的准确率-效率权衡。这些结果表明,结构信息(尤其是干扰项内容)驱动了预测准确性的提升,支持开发计算上可行的大规模教育应用的高效结构感知模型。

英文摘要

Automatic Question Difficulty Estimation (AQDE) holds growing promise for educational assessment because it has the potential to yield difficulty estimates that are competitive with expert judgment, while helping reduce the time and financial burden associated with pilot administrations and scaling to digital testing contexts. Prior AQDE studies report mixed evidence on whether adding distractors as additional text to the question stem and the correct key consistently improves difficulty prediction. We hypothesize that the effectiveness of distractor information depends on its structural representation, and that explicitly modeling distractors as separate components improves difficulty estimation over baselines that omit this information. To address this, we designed controlled architectures that model MCQ components as distinct inputs to isolate the contribution of distractor content and order. Specifically, we represented distractors by encoding each distractor as its own text input and aggregating their representations either with order-aware concatenation (with positional tags) or with an order-invariant summation. We evaluated these architectures using two Chilean datasets (Natural and Social Sciences, 2016-2020; 4,114 multiple-choice questions). Compared to a simpler model that only used the question stem and the key, our best distractor-aware architecture achieved higher predictive performance, reaching R^2 = 0.83 for Natural Sciences and R^2 = 0.71 for Social Sciences items. An order-invariant variant achieved nearly the same accuracy with approximately half as many parameters, offering a favorable accuracy-efficiency trade-off. These results show that structural information (especially distractor content) drives gains in predictive accuracy, supporting the development of efficient, structure-aware models that are computationally viable for large-scale educational applications.

2606.09013 2026-06-09 cs.CL 新提交

Beyond Averages: Evaluating LLMs on Human Survey Replication at the Distributional Level

超越平均值:在分布层面评估LLM对人类调查的复现能力

Jeonghyeon Moon, Jiwon Kim, Yeheum Lah, Yoonju Han, Yuncheol Kang

发表机构 * Ewha Womans University(梨花女子大学)

AI总结 本研究通过非公开的韩国方便面购买实验,在分布层面评估LLM复现人类调查响应的能力,发现均值匹配的模型可能产生更偏离人类的分布,且结构化角色和多模态输入提升对齐度,而推理提示则降低。

详情
AI中文摘要

LLM越来越多地被用于模拟人类调查响应,但先前的工作主要使用均值层面或总体一致性来评估复现能力,对LLM是否复现人类行为的变异性提供的见解有限。我们使用一个非公开的2010年韩国方便面购买消费者选择实验,在分布层面评估基于LLM的调查复现,该设置不太可能与模型训练数据重叠。我们评估了三种不同统计类型的响应变量:二元购买发生、分类品牌选择和计数购买数量。对于每种变量,我们在均值层面、模式和分布一致性上比较人类和LLM响应,并参考仅来自人类数据的基线。LLM在复现条件层面模式上表现合理,但未能捕捉分布结构:对于购买数量,没有模型能击败一个简单的条件不敏感基线(该基线仅匹配合并的人类分布)。因为均值匹配人类良好的模型仍可能产生比该基线更远离人类的分布,仅基于均值的评估可能具有误导性。复现能力也随输入配置而变化,结构化角色和多模态输入改善一致性,而显式推理提示则单调地降低一致性。

英文摘要

LLMs are increasingly used to simulate human survey responses, but prior work has mainly evaluated replication using mean-level or aggregate agreement, offering limited insight into whether LLMs reproduce the variability of human behavior. We evaluate LLM-based survey replication at the distributional level using a non-public 2010 consumer choice experiment on Korean instant noodle purchases, a setting unlikely to overlap with model training data. We evaluate three response variables of differing statistical type: binary purchase incidence, categorical brand choice, and count purchase quantity. For each, we compare human and LLM responses at mean-level, pattern, and distributional alignment, and against reference baselines from the human data alone. LLMs reproduce condition-level patterns reasonably well but fail to capture distributional structure: for purchase quantity, no model beats a condition-insensitive baseline that simply matches the pooled human distribution. Because models that match human means well can still produce distributions further from humans than this baseline, mean-based evaluation alone can be actively misleading. Replication also varies with input configuration, with structured personas and multimodal inputs improving alignment while explicit reasoning prompting degrades it monotonically.

2606.09351 2026-06-09 cs.CL stat.ME 新提交

In-Context Learning for the Imputation of Public Opinion Data with Large Language Models

基于上下文学习的民意数据插补方法

Tobias Holtdirk, Georg Ahnert, Joseph W Sakshaug, Anna-Carolina Haensch

发表机构 * LMU Munich(慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of Mannheim(曼海姆大学) Institute for Employment Research (IAB)(就业研究所(IAB)) University of Maryland, College Park(马里兰大学帕克分校)

AI总结 提出通过上下文学习(ICL)插补调查缺失数据,在150个意见变量上评估,相比MICE PMM方法,在所有缺失机制下绝对误差更低,尤其非随机缺失时优势显著。

详情
AI中文摘要

大型语言模型已被广泛评估为个体调查响应的模拟器。然而,在实践中,完全未观测到的响应很少见;主要问题是部分无响应。插补旨在通过填充这些缺失值来恢复调查数据集的整体结构。它有自己的明确定义的评估标准,并且与预测有根本区别。我们提出通过上下文学习(ICL)来插补缺失的调查数据。我们在美国趋势面板的15波调查中,针对150个意见变量,系统评估了不同缺失机制(MCAR、MAR、MNAR)下的ICL设计选择。与成熟的数据插补统计方法(如MICE PMM)相比,我们的ICL方法在所有缺失机制下均持续降低了绝对误差,在非随机缺失(MNAR)下收益最大。值得注意的是,性能最佳的配置(gpt-oss-120b,100个上下文示例)实现了接近名义水平的总体覆盖率(接近95%),置信区间比MICE PMM窄2到5倍。我们发布了一个具有类似sklearn API的Python包,以便使用本地和专有LLM轻松部署我们的方法。

英文摘要

Large language models have been widely evaluated as simulators of individual survey responses. In practice, however, fully unobserved responses are rare; the dominant problem is partial non-response. Imputation aims to restore the overall structure of a survey dataset by filling in these missing values. It has its own well-defined evaluation criteria and differs fundamentally from prediction. We propose to impute missing survey data through in-context learning (ICL). We systematically evaluate ICL design choices across different missingness mechanisms (MCAR, MAR, MNAR) on 150 opinion variables spanning 15 waves of the American Trends Panel. Compared to well-established statistical methods for data imputation like MICE PMM, our ICL approach consistently reduces absolute error across all missingness mechanisms, with the largest gains under non-random missingness (MNAR). Notably, the best-performing specification (gpt-oss-120b with 100 in-context examples) achieves near-nominal aggregate coverage (approaching the 95% level) with confidence intervals two to five times narrower than MICE PMM. We publish a Python package with an sklearn-like API to enable easy deployment of our method using local and proprietary LLMs.

2606.09389 2026-06-09 cs.CL 新提交

LexRubric: A Rubric-Guided Diagnostic Benchmark for Open-Ended Legal Tasks

LexRubric:面向开放式法律任务的基于评分指南的诊断基准

Yifan Chen, Haitao Li, Yiran Hu, Kaisong Song, Jun Lin, Yueyue Wu, Qingyao Ai, Min Zhang, Yiqun Liu

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) University of Waterloo(滑铁卢大学) Alibaba Group(阿里巴巴集团)

AI总结 提出LexRubric基准,包含649个中国法律咨询与司法考试实例及12,337条原子评分标准,通过六维框架评估LLM在开放式法律任务中的可靠性,发现当前模型仍面临挑战。

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地应用于现实法律任务,评估其开放式法律响应的可靠性变得至关重要。这些任务需要上下文敏感的答案,且容错空间极小,因此需要能够识别响应质量失败具体原因的细粒度诊断评估。我们引入了LexRubric,一个基于评分指南的基准,用于评估开放式中文法律任务。LexRubric包含来自法律咨询和司法考试的649个实例,这些实例既反映了日常法律需求,也体现了专业法律推理,覆盖14个法律场景。此外,它还包含12,337条由专家编写的原子评分标准,这些标准组织在一个统一的六维框架下,能够跨任务和评估维度进行准确的评估和诊断分析。为了验证评估的可靠性,我们测试了多个评判模型,并将基于模型的评判与人类评判进行了比较。我们进一步在LexRubric上评估了18个近期通用和法律领域的LLM。结果表明,不同模型展现出不同的能力特征,且开放式法律问题对当前LLM仍然具有挑战性。数据可在以下网址获取:https://github.com/foggpoy/LexRubric。

英文摘要

As large language models (LLMs) are increasingly applied to real-world legal tasks, evaluating the reliability of their open-ended legal responses has become essential. These tasks require context-sensitive answers and allow little room for error, motivating fine-grained and diagnostic evaluation that can identify specific sources of response quality failures. We introduce LexRubric, a rubric-based benchmark for evaluating open-ended Chinese legal tasks. LexRubric contains 649 instances from legal consultation and judicial examination, which reflect both everyday legal needs and professional legal reasoning and cover 14 legal scenarios. It further includes 12,337 expert-written atomic scoring criteria organized under a unified six-dimensional framework, enabling accurate evaluation and diagnostic analysis across tasks and evaluation dimensions. To validate the reliability of the evaluation, we test multiple judge models and compare model-based judgments with human judgments. We further evaluate 18 recent general and legal-domain LLMs on LexRubric. Results show that different models exhibit distinct capability profiles, and that open-ended legal question remains challenging for current LLMs. Data is available at: https://github.com/foggpoy/LexRubric.

2606.09461 2026-06-09 cs.CL 新提交

H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions

H2HMem: 面向人际交互中智能体的多模态记忆基准

Shiping Zhu, Yibo Yang, Zhengyang Wang, Tiancheng Shen, Dandan Guo, Ming-Hsuan Yang

发表机构 * Jilin University(吉林大学) Shanghai Jiao Tong University(上海交通大学) University of California at Merced(加州大学默塞德分校)

AI总结 提出H2HMem基准,通过双人和多人多模态对话评估智能体在记忆召回、推理和应用方面的能力,揭示现有模型在多模态、多参与者场景下的显著局限。

Comments 22 pages, 6 figures

详情
AI中文摘要

大型语言模型智能体越来越多地部署在人际交互场景中,例如会议助手和临床文档系统,在这些场景中,它们必须观察对话并保留信息以供后续查询。与传统的人助交互不同,这些环境本质上是多模态的,涉及复杂的语篇现象,如回指和指示,并且包含来自多个参与者的异步或冲突信息。然而,现有的记忆基准主要关注单用户、纯文本交互,未能捕捉这些挑战。为填补这一空白,我们引入了H2HMem,一个面向复杂人际交互中记忆能力评估的人-人多模态记忆基准。H2HMem包括双人和多人对话,包含多模态信息流,并从三个维度评估智能体:记忆召回、推理和应用。使用先进智能体的实验揭示了在跨模态、参与者和会话中构建、保留和利用记忆方面的显著局限性,凸显了下一代LLM智能体需要大幅改进的空间。

英文摘要

Large language model agents are increasingly deployed in human-human interaction settings, such as meeting assistants and clinical documentation systems, where they must observe conversations and retain information for downstream queries. Unlike traditional human-assistant settings, these environments are inherently multimodal, involve complex discourse phenomena such as anaphora and deixis, and contain asynchronous or conflicting information from multiple participants. However, existing memory benchmarks largely focus on single-user, text-only interactions, failing to capture these challenges. To address this gap, we introduce H2HMem, a Human-to-Human Multimodal Memory Benchmark for evaluating memory capabilities in complex human-human interactions. H2HMem includes both dyadic and multi-party conversations with multimodal information streams, and evaluates agents along three dimensions: memory recall, reasoning, and application. Experiments with advanced agents reveal substantial limitations in constructing, retaining, and utilizing memories across modalities, participants, and sessions, highlighting substantial room for improvement in next-generation LLM agents.

2606.09613 2026-06-09 cs.CL cs.AI 新提交

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

AGENTSERVESIM:面向多轮LLM智能体服务的硬件感知模拟器

Rakibul Hasan Rajib, Mengxin Zheng, Qian Lou

发表机构 * University of Central Florida(中佛罗里达大学)

AI总结 提出AGENTSERVESIM模拟器,通过程序编排器、工具模拟器、会话感知路由器和KV驻留模型等模块,在程序粒度上评估多轮LLM智能体服务策略,在CPU上以6%误差复现真实系统行为。

Comments Preprint

详情
AI中文摘要

多轮LLM智能体将模型调用与外部工具调用交织在一起,将服务从无状态请求处理转变为有状态程序执行。处理这些工作负载需要利用程序级上下文的调度、KV缓存管理和路由策略,包括轮次依赖、工具引入的间隙和可重用的KV状态。直接在真实系统上评估此类策略成本高昂,因为每个设计点可能需要跨到达率、模型规模、服务实例数量和内存层次结构的专用加速器时间。模拟提供了一种可扩展的替代方案,但现有的LLM服务模拟器针对无状态请求级工作负载,因此忽略了智能体服务的核心动态:多轮程序执行、跨轮缓存局部性以及工具间隙期间的KV缓存驻留。我们提出了AGENTSERVESIM,一种面向多轮LLM智能体服务的硬件感知模拟器。AGENTSERVESIM通过可组合模块在程序粒度上评估服务策略:程序编排器保留程序标识和轮次顺序,工具模拟器实现工具引入的间隙,会话感知路由器维护程序到实例的亲和性以实现缓存感知调度,KV驻留模型跟踪策略定义的跨HBM、主机DRAM/CXL和驱逐的KV放置。在真实服务部署和硬件配置上,AGENTSERVESIM在关键性能指标上的误差在6%以内,且完全在普通CPU上运行。这些结果表明,AGENTSERVESIM能够在不需在昂贵加速器上全面部署的情况下,实现受控、可重复的智能体服务策略探索。

英文摘要

Multi-turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful program execution. Serving these workloads requires scheduling, KV-cache management, and routing policies that use program-level context, including turn dependencies, tool-induced gaps, and reusable KV state. Evaluating such policies directly on real systems is costly, since each design point may require dedicated accelerator time across arrival rates, model scales, serving-instance counts, and memory hierarchies. Simulation offers a scalable alternative, but existing LLM serving simulators target stateless request-level workloads and therefore omit the core dynamics of agent serving: multi-turn program execution, cross-turn cache locality, and KV-cache residency during tool gaps. We present AGENTSERVESIM, a hardware-aware simulator for multi-turn LLM agent serving. AGENTSERVESIM evaluates serving policies at program granularity through composable modules: a Program Orchestrator preserves program identity and turn order, a Tool Simulator materializes tool-induced gaps, a Session-Aware Router maintains program-to-instance affinity for cache-aware dispatch, and a KV Residency Model tracks policy-defined KV placement across HBM, host DRAM/CXL, and eviction. Across real serving deployments and hardware configurations, AGENTSERVESIM reproduces real-system behavior within 6% error across key performance metrics while running entirely on commodity CPUs. These results show that AGENTSERVESIM enables controlled, repeatable exploration of agent-serving policies without requiring exhaustive deployment on costly accelerators.

2606.07534 2026-06-09 cs.IR cs.CL 交叉投稿

PulseBench-Tab: A Multilingual Benchmark for Table Extraction with Graph-Based Evaluation

PulseBench-Tab:基于图评估的多语言表格提取基准

Ritvik Pandey, Sid Manchkanti, Mohammed Wazir Adain, Mohammed Hadi, Dushyanth Sekhar

发表机构 * Pulse AI Georgia Institute of Technology(佐治亚理工学院) S&P Global, Enterprise Data Organization(S&P全球企业数据部门)

AI总结 提出包含9种语言、1820个标注表格的多语言基准PulseBench-Tab,并设计基于单元格邻接有向图的新评估指标T-LAG,通过最优二分匹配统一衡量结构和内容保真度。

Comments 14 pages, 5 figures, 8 tables. Dataset: https://huggingface.co/datasets/pulse-ai/PulseBench-Tab Code: https://github.com/Pulse-Software-Corp/PulseBench-Tab

详情
AI中文摘要

我们推出了PulseBench-Tab,一个用于评估从文档图像中提取表格的开放多语言基准。该基准包含1,820个人工标注的表格,涵盖9种语言和4种文字系统(拉丁、中日韩、阿拉伯、西里尔),来自380份真实世界源文档,包括财务申报、政府报告和监管披露。表格的单元格数量从2到1,183不等,其中48.1%包含合并或跨行/列单元格。除了数据集,我们还提出了T-LAG(表格逻辑邻接图),一种新颖的评估指标,将表格建模为基于单元格邻接的有向图,并通过最优二分匹配在单一分数中计算结构和内容保真度。我们评估了9个商业和开源表格提取系统在基准上的表现,并报告了每种语言的细分结果。完整数据集、评分代码以及所有提供商的输出均已公开。

英文摘要

We introduce PulseBench-Tab, an open multilingual benchmark for evaluating table extraction from document images. The benchmark comprises 1,820 human-annotated tables spanning 9 languages and 4 scripts (Latin, CJK, Arabic, Cyrillic), drawn from 380 real-world source documents including financial filings, government reports, and regulatory disclosures. Tables range from 2 to 1,183 cells, with 48.1% containing merged or spanning cells. Alongside the dataset, we propose T-LAG (Table Logical Adjacency Graph), a novel evaluation metric that models tables as directed graphs over cell adjacencies and computes structural and content fidelity in a single score via optimal bipartite matching. We evaluate 9 commercial and open-source table extraction systems across the benchmark and report per-language breakdowns. The full dataset, scoring code, and all provider outputs are publicly available.

2606.07616 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

项目反应缩放定律:一种高效且可泛化的神经缩放估计的测量理论方法

Sang Truong, Yuheng Tu, Rylan Schaeffer, Sanmi Koyejo

AI总结 提出项目反应缩放定律(IRSL),将项目反应理论融入缩放定律框架,通过Beta-IRT模型利用语言模型的概率响应,将参数复杂度从O(M×N)降至O(M+N),在预训练和测试时缩放场景中仅用50个问题即可实现可靠估计。

详情
AI中文摘要

缩放定律为理解语言模型(LM)的性能提供了基本框架,但推导它们需要在数千个检查点或数百万个推理样本上进行成本高昂的评估。为了解决这个问题,我们引入了项目反应缩放定律(IRSL),这是一个将项目反应理论(IRT)整合到缩放定律框架中的统一框架。与将每个模型-基准对单独处理的传统方法不同,IRSL将潜在模型能力与问题特征分离,将M个模型和N个问题的缩放定律估计分解,从而将参数复杂度从O(M×N)显著降低到O(M+N)。我们使用Beta-IRT实例化IRSL,它利用LM的经验概率响应——例如预训练中的token概率和测试时采样中的通过率——来捕获比二元响应更丰富的信号。我们在两种常见的缩放范式上验证了我们的方法:(1)预训练下游缩放,使用来自10个基准的6,612个LM检查点和37,682个问题;以及(2)测试时缩放,使用来自4个基准的12个LM和120个问题,每个问题最多2,500个样本。在现有模型响应上进行一次性校准后,IRSL仅使用每个基准50个问题(减少99.9%)即可产生更可靠的缩放估计,达到与传统方法相当或更优的决策准确性。此外,我们表明估计的潜在模型能力是可泛化的,从而能够跨共享相同测量目标的基准进行准确的性能预测。

英文摘要

Scaling laws provide a fundamental framework for understanding the performance of Language Models (LMs), yet deriving them requires prohibitively expensive evaluations across thousands of checkpoints or millions of inference samples. To address this, we introduce Item Response Scaling Laws (IRSL), a unified framework that integrates Item Response Theory (IRT) within the scaling law framework. Unlike traditional approaches that treat each model-benchmark pair in isolation, IRSL disentangles latent model ability from question characteristics, factorizing the scaling law estimation for $M$ models and $N$ questions to significantly reduce parameter complexity from $O(M \times N)$ to $O(M + N)$. We instantiate IRSL with Beta-IRT, which leverages the empirical probability responses of LMs -- such as token probabilities in pre-training and pass rates in test-time sampling -- to capture richer signals than binary responses. We validate our approach across two prevalent scaling paradigms: (1) pre-training downstream scaling, using 6,612 LM checkpoints and 37,682 questions from 10 benchmarks; and (2) test-time scaling, using 12 LMs and 120 questions from 4 benchmarks with up to 2,500 samples per question. Given a one-time calibration on existing model responses, IRSL yields more reliable scaling estimates using only 50 questions per benchmark (a 99.9\% reduction), achieving comparable or superior decision accuracy to traditional approaches. Furthermore, we show that the estimated latent model abilities are generalizable, enabling accurate performance forecasting across benchmarks that share the same measurement objective.

2606.08034 2026-06-09 cs.CV cs.AI cs.CL 交叉投稿

Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

Sci-Rho:面向STEM问题的多语言视觉基础符号基准

Muhammad Falensi Azmi, Ikhlasul Akmal Hanif, Vallerie Alexandra Putra, Adi Yeltay, Abdullah Mubarak, Fajri Koto

发表机构 * Independent Researcher(独立研究员) MBZUAI(穆罕默德·本·扎耶德人工智能大学) Binus University(比努斯大学) Bandung Institute of Technology(万隆理工学院)

AI总结 提出Sci-Rho,一个多语言、视觉基础的STEM问题动态基准,包含4242个模板和42420个实例,评估17个VLM发现最差精度与平均精度存在差距,且小模型跨语言性能下降。

Comments 22 pages

详情
AI中文摘要

符号基准已成为评估模型在STEM相关问题微小修改下鲁棒性的关键方法。然而,现有符号基准大多局限于数学推理,缺乏视觉基础,且主要以英语为主。在这项工作中,我们引入了Sci-Rho(科学鲁棒性),一个面向视觉基础STEM问题的动态基准,涵盖五个学科和七种语言,包含由领域专家(包括奥林匹克奖牌得主)精心设计的4,242个问题模板(每种语言606个)。每个模板实现为可执行的Python代码,通过改变数值、视觉模式、几何形状、颜色方案和函数类型,生成多样但等价的问题实例,总共产生42,420个实例,每个实例都配有推理步骤和真实解决方案。我们评估了17个最先进的VLM,发现最差情况准确率(定义为模型在每种生成变体上均正确回答的问题模板比例)与平均准确率之间存在明显差距。我们还发现,较小的模型在不同语言上表现出显著的性能下降,而专有模型和较大模型保持鲁棒。步骤级评估反映了相同的趋势,揭示了平均F1与最差情况F1分数之间的显著差距。最后,我们对VLM注意力头的检查显示,图像标记与文本标记的相对注意力分配存在显著的跨语言变化。我们的工作强调了超越静态基准的评估作为衡量VLM质量指标的重要性。

英文摘要

Symbolic benchmarks have emerged as a key approach to assess model robustness under minor modifications to STEM-related questions. However, existing symbolic benchmarks mostly remain limited to mathematical reasoning, lack visual grounding, and are predominantly in English. In this work, we introduce Sci-Rho (Science Rhobustness), a dynamic benchmark for visually-grounded STEM problems spanning five subjects and seven languages, comprising 4,242 problem templates (606 per language) crafted by domain experts, including Olympiad medalists. Each template is implemented as executable Python code that generates diverse but equivalent problem instances by varying numerical values, visual patterns, geometric shapes, color schemes, and function types, resulting in 42,420 instances in total, each paired with reasoning steps and ground-truth solutions. We evaluated 17 state-of-the-art VLMs and discovered a noticeable gap between worst-case accuracy (defined as the proportion of problem templates that a model answers correctly across every generated variation) and average accuracy. We also discovered that smaller models show noticeable performance degradation across languages, whereas proprietary and larger models remain robust. Step-level evaluation reflects this same trend, revealing a significant gap between average F1 and worst-case F1 scores. Finally, our inspection of attention heads of a VLM reveals substantial cross-lingual variation in the relative attention allocated to image tokens compared to text tokens. Our work highlights the importance of evaluation beyond static benchmarks as a metric to measure the quality of VLMs.

2606.08036 2026-06-09 cs.IR cs.AI cs.CL 交叉投稿

GIScholarBench: Benchmarking LLM Overconfidence in GIS Research

GIScholarBench: 在GIS研究中评估大语言模型的过度自信

Zongrng Li, Mingzheng Yang, Lei Zou, Hongxu Ma, Hao Tian, Siqi Zhou, Wenjing Gong, Kaili Zhang, Bingqian Chen, Mitch Zhang, Yifan Yang

发表机构 * Texas A&M University(德克萨斯理工大学) Google(谷歌) Department of Geography(地理系) Department of Landscape Architecture and Urban Planning(景观建筑与城市规划系)

AI总结 针对大语言模型在学术研究中的过度自信问题,构建了包含10865篇论文的GIScholarBench基准,通过元数据检索、文献链接和研究方向生成三项任务评估模型表现,发现所有模型均存在任务不变的过度自信现象。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于学术研究工作流程,但学术任务需要高事实精度,因此暴露了一个关键弱点:过度自信。这里,过度自信被行为定义为即使在底层知识不完整或不可验证时,也倾向于产生自信、果断且格式良好的输出,而不是陈述信心与准确性之间的校准差距。为了研究这一问题,我们引入了GIScholarBench,这是一个基于2020年至2025年间发表在25个核心GIScience期刊上的10865篇论文构建的基准。该基准涵盖三个认知复杂度递增的任务:元数据检索、文献链接和研究方向生成。我们通过原生网络界面在真实用户条件下评估了Claude Sonnet 4.5、Gemini 3和ChatGPT 5.3。结果显示所有任务均存在一致的过度自信。在元数据检索中,ChatGPT 5.3取得了最高准确率,但所有模型在预测错误时仍生成确定的标题和DOI。在文献链接中,Claude Sonnet 4.5恢复了最多的参考文献,但所有模型在排名靠前的检索和更长的引文列表之间显示出明显差距,表明参考文献被扩展到可靠检索能力之外。在研究方向生成中,AI生成的方向相比真实未来引用论文显示出更低的主题覆盖率、更高的新颖性缺失率和更低的语义多样性。这些发现表明,LLM的过度自信是任务不变的,但表现形式不同:检索中的事实过度生成、文献链接中不可靠的引文扩展,以及研究构思中输出完整性的过度自信。

英文摘要

Large language models (LLMs) are increasingly used in academic research workflows, but scholarly tasks require high factual precision and therefore expose a key weakness: overconfidence. Here, overconfidence is defined behaviorally as the tendency to produce confident, assertive, and well-formatted outputs even when the underlying knowledge is incomplete or unverifiable, rather than as a calibration gap between stated confidence and accuracy. To examine this issue, we introduce GIScholarBench, a benchmark built from 10,865 papers published in 25 core GIScience journals between 2020 and 2025. The benchmark covers three tasks with increasing cognitive complexity: metadata retrieval, literature linking, and research direction generation. We evaluate Claude Sonnet 4.5, Gemini 3, and ChatGPT 5.3 through their native web interfaces under real-world user-facing conditions. Results show consistent overconfidence across all tasks. In metadata retrieval, ChatGPT 5.3 achieves the highest accuracy, but all models still generate definitive titles and DOIs when predictions are wrong. In literature linking, Claude Sonnet 4.5 recovers the most references, but all models show a clear gap between top-ranked retrieval and longer citation lists, suggesting that references are extended beyond reliable retrieval capacity. In research direction generation, AI-generated directions show lower topic coverage, higher novel miss rates, and lower semantic diversity than real future-citing papers. These findings suggest that LLM overconfidence is task-invariant but takes different forms: factual overgeneration in retrieval, unreliable citation expansion in literature linking, and overconfidence in output completeness during research ideation.

2606.08087 2026-06-09 cs.SD cs.CL 交叉投稿

Assessing the Energy and Carbon Emissions of Neural Speaker Verification Model in Training and Inference

评估神经说话人验证模型在训练和推理中的能耗与碳排放

Hugo Leguillier, Driss Matrouf, Guillaume Lechien, Mickael Rouvier

发表机构 * LIA, UPR 4128 Aday Avignon University(阿维尼翁大学)

AI总结 本研究通过测量不同ResNet架构在VoxCeleb2上的能耗与碳排放,发现模型加深或加宽带来边际精度提升但能耗剧增,而中等规模网络(如ResNet-50)能实现性能与环境影响的良好平衡。

Comments Accepted to Speaker Odyssey 2026 Lisbon

详情
AI中文摘要

深度学习说话人验证(SV)越来越依赖于深度神经网络骨干,但其环境影响仍缺乏记录。本文对在VoxCeleb2上训练的ResNet架构进行了评估,变化深度、通道宽度和阶段分布,并使用节点级传感器测量能耗和碳足迹。结果显示明显的收益递减点:更深或更宽的模型仅带来边际精度提升,而能耗急剧增长。相比之下,中等规模网络如ResNet-50和阶段集中变体在性能与环境影响之间实现了有利的权衡。这些发现为设计节能的SV系统提供了可操作的指导方针。

英文摘要

Deep-learning speaker verification (SV) increasingly relies on deep neural network backbones, whose environmental impact remains largely undocumented. In this paper, we conduct an evaluation of ResNet architectures trained on VoxCeleb2, varying depth, channel width, and stage distribution, and measure energy consumption and carbon footprint using node-level sensors. Results show a clear point of diminishing returns: deeper or wider models bring only marginal accuracy gains while energy consumption grows steeply. In contrast, mid-sized networks such as ResNet-50 and stage-concentrated variants achieve favorable trade-offs between performance and environmental impact. These findings provide actionable guidelines for designing energy-efficient SV systems.

2606.08239 2026-06-09 cs.AI cs.CL cs.CV 交叉投稿

When No Answer Is Correct: Diagnosing Absent Answer Detection for MLLMs in Video Understanding

当没有正确答案时:诊断视频理解中多模态大语言模型的缺失答案检测

Yiheng Wang, Yueqian Lin, Lichen Zhu, Yudong Liu, Hai "Helen" Li, Yiran Chen

发表机构 * Duke University(杜克大学)

AI总结 研究多模态大语言模型在视频理解中检测缺失答案的能力,发现模型倾向于选择干扰项而非识别无正确答案,时间推理任务中问题更严重,链式思维提示虽提升检测率但仍不理想。

Comments Under review

详情
AI中文摘要

多模态大语言模型在视频理解方面取得了实质性进展,但其响应的可靠性仍未得到充分探索。本文对视频理解中多模态大语言模型的缺失答案检测进行了诊断研究,其中正确答案被故意排除在候选集之外,而一个可靠的模型应能识别出没有有效选项。我们在三种设置下评估缺失答案检测行为:带有“以上皆非”选项的多选题、带有检测指令的开放式生成,以及没有任何指导的标准评估。在多种模型和基准测试中,我们发现多模态大语言模型压倒性地选择合理的干扰项,而不是检测到缺失答案。这种失败在时间推理任务中更为明显,并且随着帧采样密度的增加而恶化。我们进一步探索了链式思维提示作为缓解策略,发现虽然它显著提高了检测率,但性能仍不令人满意,这表明仅基于提示的策略不足以完全解决这一局限性。这些发现揭示了缺失答案检测中的系统性失败,并强调了在多模态系统中需要明确的检测机制。

英文摘要

Multimodal large language models (MLLMs) have made substantial advancements in video understanding, yet the reliability of their responses remains underexplored. This work presents a diagnostic study of absent answer detection for MLLMs in video understanding, where the correct answer is deliberately excluded from the candidate set and a reliable model is expected to recognize that no valid option exists. We evaluate the absent answer detection behavior under three settings: multiple-choice questions augmented with an ``None of the Above'' option, open-ended generation with a detection instruction, and standard evaluation without any guidance. Across a diverse set of models and benchmarks, we find that MLLMs overwhelmingly select plausible distractors rather than detecting the absent answer. This failure is more pronounced in temporal reasoning tasks and worsens with denser frame sampling. We further explore chain-of-thought prompting as a mitigation strategy and find that while it substantially improves detection rates, performance remains unsatisfactory, suggesting that prompting-based strategies alone are insufficient to fully address this limitation. These findings expose a systematic failure in absent answer detection and highlight the need for explicit detection mechanisms in multimodal systems.

2606.08400 2026-06-09 cs.SE cs.AI cs.CL 交叉投稿

Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses

历史与模型对LLM评分的影响:高级软件工程课程研究

Qilin Zhou, Zhuo Wang, Yue Li, W. K. Chan

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 针对研究生阅读报告评分负担重的问题,提出人机协同的LLM辅助评分流程,基于180份作业评估Grok和GPT的评分一致性与人类对齐,发现交互历史导致评分标准漂移,需特定操作缓解不公平。

Comments 5 pages, accepted by ISET 2026

详情
AI中文摘要

研究生级别的科研阅读报告评估给教育工作者带来了沉重的劳动负担。虽然大型语言模型(LLM)在自动化学术评分方面具有巨大潜力,但它们在此专门任务上的可靠性仍研究不足,特别是评分一致性方面,其缺失是教育公平的主要障碍。本文提出了一种与人类对齐的LLM辅助评分工作流程,并基于来自研究生高级软件工程课程的180份学生作业进行了案例研究。我们评估了两种主流LLM——Grok和GPT——在评分一致性和与人类分数对齐方面的表现。我们发现LLM表现出不同水平的模型内一致性和显著的模型间评分不一致性,而简单的集成方法无法改善与人类评估的对齐。关键的是,连续的交互历史导致模型的评分标准系统地偏离人类专家评分。我们的研究结果表明,LLM在减轻研究生教育中教育工作者的评分负担方面具有潜力,同时强调不加区分地使用LLM评分可能会引入系统性不公平,表明需要特定的操作实践来减轻这种差异。

英文摘要

Graduate-level research reading report assessment creates a substantial labor burden for educators. While large language models (LLMs) hold great potential for automating academic grading, their reliability for this specialized task remains understudied, particularly regarding grading consistency, the lack of which represents a primary obstacle to educational fairness. This paper proposes a human-aligned LLM-assisted grading workflow and presents a case study based on 180 student submissions from a graduate advanced software engineering course. We evaluate two mainstream LLMs, Grok and GPT, in terms of grading consistency and alignment with human scores. We find LLMs exhibit distinct levels of intra-model consistency and significant inter-model grading inconsistencies, while simple ensemble approaches cannot improve alignment with human evaluation. Critically, continuous interaction history drives systematic drift in models' grading standards away from human expert scores. Our findings demonstrate LLMs' potential in reducing grading workload for educators in graduate education, while highlighting that indiscriminate LLM grading may introduce systemic unfairness, suggesting that specific operational practices are required to mitigate such disparities.

2606.08517 2026-06-09 cs.LG cs.CL 交叉投稿

A Joint Finite-Sample Certificate for Adaptive Selective Conformal Risk Control

自适应选择性共形风险控制的联合有限样本证书

Xiaoli Yu, Jiamiao Liu

发表机构 * Chongqing University of Posts and Telecommunications(重庆邮电大学) Army Medical University (Third Military Medical University)(陆军军医大学(第三军医大学))

AI总结 提出一种联合有限样本证书,同时上界选择性风险、下界接受概率和部署效用,适用于自适应阈值选择,通过比率风险的经验伯恩斯坦界等方法,在ImageNet和COCO上比Hoeffding-CRC提升22个百分点接受前沿,且紧致约10倍。

详情
AI中文摘要

选择性预测器在置信输入上做出预测,否则弃权;安全部署需要一个单一的有限样本证书,同时上界所选风险、下界接受概率 $\pacc$ 高于下限 $\pmin$,并下界部署效用。该证书必须在从 $\ncert$ 样本上的有限网格 $m$ 对中进行自适应阈值选择时有效。我们通过将所选风险直接视为比率而非通过Hoeffding式范围界,为有界、可能非单调的损失给出了这样的证书。该构造耦合了三个置信界:比率风险的方差自适应经验伯恩斯坦界、接受概率的Clopper-Pearson界以及效用的双边接近界。它们共同下界认证策略的绝对效用,并且与认证集上的最优策略相差不超过 $2\gammau$,两者在可行时均非平凡;一个按场景划分的第三部分与外部预言机匹配,仅在风险边际 $\gammar < α$ 时有信息量,在主要操作点处为空。相对于仅范围Hoeffding比率构造,这使接受下限依赖从 $1/\pmin$ 变为 $1/\sqrt{\pmin}$,并且一个闭式推论识别出每对场景,其中我们的风险界优于Hoeffding共形风险控制(Hoeffding-CRC)选择性界。实验上,在ImageNet(三个ResNet)和COCO val 2017全景分割上,该证书比Hoeffding-CRC打开了+22个百分点的认证接受前沿,并且比非平凡匹配验证基线紧致约10倍;这些增益是按场景的,非普适的,在ADE20K上不存在。认证器运行时间为 $O(\ncert m)$。

英文摘要

Selective predictors answer on confident inputs and abstain elsewhere; deploying one safely needs a single finite-sample certificate that simultaneously upper-bounds the selected risk, lower-bounds the acceptance probability $\pacc$ above a floor $\pmin$, and lower-bounds the deployment utility. This certificate must be valid under adaptive threshold selection from a finite grid of $m$ pairs on $\ncert$ samples. We give such a certificate for bounded, possibly non-monotone losses by treating the selected risk directly as a ratio rather than through a Hoeffding-style range bound. The construction couples three confidence bounds: a variance-adaptive empirical-Bernstein bound on the ratio risk, a Clopper--Pearson bound on acceptance, and a two-sided closeness bound on utility. Together they lower-bound the certified policy's utility absolutely and to within $2\gammau$ of the best over the \emph{certified set}, both non-vacuous whenever feasible; a regime-scoped third leg matches an external oracle, informative only where the risk margin $\gammar < α$ and vacuous at the headline operating points. Relative to the range-only Hoeffding-ratio construction this sharpens the acceptance-floor dependence from $1/\pmin$ to $1/\sqrt{\pmin}$, and a closed-form corollary identifies a per-pair regime in which our risk bound dominates a Hoeffding conformal risk control (Hoeffding--CRC) selective bound. Empirically, on ImageNet (three ResNets) and COCO val 2017 panoptic, the certificate opens a $+22$ pp certified-acceptance frontier over Hoeffding--CRC and is ${\approx}10{\times}$ tighter than a non-vacuous matched-valid baseline; these gains are regime-scoped, not universal, and absent on ADE20K. The certifier runs in $O(\ncert m)$ time.

2606.08679 2026-06-09 stat.ML cs.CL cs.LG stat.ME 交叉投稿

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

排行榜的排名区间:模型评估的分层框架

Bitya Neuhof, Yuval Benjamini

发表机构 * Department of Statistics and Data Science(统计与数据科学系)

AI总结 提出分层框架,通过任务级置信区间和排行榜级预测区间,实现具有统计保证的模型排名不确定性量化。

详情
AI中文摘要

预训练模型通常在多任务排行榜上评估,以衡量其在不同场景中的适用性。然而,当前将跨任务性能聚合为排行榜级排名的方法并未解决任务层面的不确定性和变异性。尽管近期工作提出了基于区间的模型排名,但从单个任务到排行榜级排名的不确定性的原则性聚合仍未解决,且模型在不同任务上的性能变化常被掩盖。本文引入一个分层框架,在两层上构建具有统计保证的模型排名区间:通过成对比较构建任务级排名置信区间,以及使用共形方法构建排行榜级排名预测区间。这使得能够对每个观测任务和新潜在任务进行可靠的模型排名量化。在模拟数据以及TabArena和PromptEval(MMLU)基准上的实验表明,我们的方法产生统计有效且信息丰富的区间,从而在排行榜上实现可靠、具有不确定性意识的模型排名。

英文摘要

Pretrained models are often evaluated on multi-task leaderboards to measure their applicability in diverse contexts. However, current methods for aggregating performance across tasks into leaderboard-level rankings do not address the uncertainty and variability at the task level. While recent works have proposed interval-based model rankings, the principled aggregation of uncertainty from individual tasks to leaderboard-level rankings remains unaddressed, and variation in models' performance across tasks is frequently obscured. In this work, we introduce a hierarchical framework that constructs model rank intervals with statistical guarantees at both levels: task-level rank confidence intervals from pairwise comparisons, and leaderboard-level rank prediction intervals using a conformal approach. This enables reliable quantification of model rank for each observed task and for new potential tasks. Experiments on simulated data and the TabArena and PromptEval (MMLU) benchmarks show that our method yields statistically valid and informative intervals, enabling reliable, uncertainty-aware model ranking on leaderboards.

2606.08722 2026-06-09 cs.SD cs.CL 交叉投稿

Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding

LLM 能否理解 LilyPond?一个用于符号音乐生成与理解的基准

Matteo Spanio, Mohammad Torabi, Andrea Poltronieri, Antonio Rodà

发表机构 * University of Padova(帕多瓦大学) Universitat Pompeu Fabra(庞培法布拉大学)

AI总结 提出 LilyBench,基于 LilyPond 的基准,联合评估开源 LLM 的符号音乐生成与理解能力,实验表明零样本可生成可执行 LilyPond,但结构理解任务仍有挑战,且指标间存在系统性分歧。

Comments Accepted at Ital-IA 2026

详情
AI中文摘要

大型语言模型的符号音乐评估在表示、数据集和指标上仍然碎片化。我们引入了 LilyBench,一个基于 LilyPond 的基准,用于在同一系列开源权重 LLM 上联合评估符号音乐生成和音乐理解。该基准包括一个 200 个提示的生成套件和十个从 ABC-Eval 改编的理解任务,涵盖语法、元数据预测、结构排序和音乐识别。生成质量通过编译率、基于 Jensen-Shannon 相似度的 MusPy 描述符分布以及基于 LilyBERT 的 Fréchet 音乐距离 (FMD) 进行评估。在四个开源模型上的实验表明,在零样本设置下可以实现可执行的 LilyPond 生成,而结构理解任务尽管在作曲家和流派识别上表现强劲,但仍然具有挑战性。我们的实验还揭示了基于描述符和基于嵌入的指标之间的系统性分歧,表明符号音乐评估受益于指标三角测量而非单一分数排名。我们发布了基准、提示库和评估代码,以支持未来在符号音乐生成和理解方面的研究,地址为 https://github.com/CSCPadova/lilybench。

英文摘要

Symbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs. The benchmark includes a 200-prompt generation suite and ten understanding tasks adapted from ABC-Eval, covering syntax, metadata prediction, structural sequencing, and music recognition. Generation quality is evaluated using compile rate, MusPy descriptor distributions via Jensen-Shannon similarity, and LilyBERT-based Fréchet Music Distance (FMD). Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Our experiments also reveal systematic disagreements between descriptor-based and embedding-based metrics, suggesting that symbolic music evaluation benefits from metric triangulation rather than single-score ranking. We release the benchmark, prompt bank, and evaluation code to support future research in symbolic music generation and understanding at https://github.com/CSCPadova/lilybench

2606.08959 2026-06-09 cs.CV cs.CL 交叉投稿

ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

ChinaHeritaQA:面向中国世界遗产地的文化基础视觉问答数据集

Yi Zhang, Bolei Ma, Yong Cao, Chengyan Wu, Daniel Hershcovich, Anna-Carolina Haensch

发表机构 * LMU Munich(慕尼黑大学) FAU Erlangen-Nuremberg(埃尔朗根-纽伦堡大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of Tübingen & Tübingen AI Center(图宾根大学与图宾根人工智能中心) Sun Yat-sen University(中山大学) University of Copenhagen(哥本哈根大学) University of Maryland, College Park(马里兰大学帕克分校)

AI总结 提出ChinaHeritaQA多模态基准数据集,包含2279张图像和14133个双语多项选择题,覆盖七个认知维度,评估视觉语言模型在中国世界遗产上的文化推理能力。

详情
AI中文摘要

我们介绍了ChinaHeritaQA,这是一个多模态基准数据集,用于评估视觉语言模型(VLM)在中国联合国教科文组织世界遗产地上的文化推理能力。该数据集包含2279张野外图像,配以14133个双语(中文/英文)多项选择题对,涵盖七个认知维度,从基本身份识别到历史分期和建筑分析。在联合国教科文组织对齐的本体论指导下,并通过严格的人工注释验证,该数据集确保了语言质量和事实一致性。对最先进VLM的评估显示,虽然顶级模型在平均表现上超过人类,但出现了显著的任务级差异:模型在视觉识别方面表现出色,但在文化基础推理上存在困难。性能也因朝代和地区而异。ChinaHeritaQA揭示了强大的视觉检索能力并不能延伸到文化和历史理解。我们发布该数据集以支持未来关于文化感知多模态学习的研究。

英文摘要

We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with 14,133 bilingual (Chinese/English) multiple-choice QA pairs spanning seven cognitive dimensions, from basic identity recognition to historical periodization and architectural analysis. Guided by a UNESCO-aligned heritage ontology and verified through rigorous human annotation, the dataset ensures linguistic quality and factual consistency. Evaluations of state-of-the-art VLMs reveal that while top models exceed human performance on average, substantial task-level variation emerges: models excel at visual recognition but struggle with culturally grounded reasoning. Performance also varies by dynasty and region. ChinaHeritaQA reveals that strong visual retrieval does not extend to cultural and historical understanding. We release the dataset to support future research on culturally aware multimodal learning.

2606.09046 2026-06-09 cs.LG cs.CL cs.IR 交叉投稿

Decoy-Calibrated Failure Audits for Language Models

语言模型的诱饵校准失败审计

Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh

发表机构 * Meta Platforms(Meta平台)

AI总结 提出Janus程序,通过诱饵校准和留出数据验证,判断语言模型错误解释的可信度,避免选择偏差。

Comments 14 pages, 5 figures, 4 tables

详情
AI中文摘要

有用的审计不仅揭示模型失败的频率,还揭示失败集中在何处。审计员可能测试许多候选解释:长输入、间接问题、分散注意力的证据或这些因素的组合。风险在于选择。观察到的最大效应可能反映真实的失败模式,也可能只是多次尝试中的最佳结果。我们提出Janus,一种决定何时提出的错误解释足够可信以报告的程序。目标不是生成新解释,而是决定哪些解释站得住脚。审计员从固定的模型、标记的评估集和冻结的候选解释列表(我们称之为描述符)开始。Janus通过错误率提升对每个描述符进行评分,然后将真实描述符与具有相同频率但随机分配给示例的虚假描述符进行比较。只有当描述符在用于发现的数据上击败这个诱饵基准,然后在单独的留出数据上重复时,它才被确认。在多表查找任务的受控审计中,Janus识别出植入的失败,确认了长链描述符及其交互。LLM通常在查找链中途停止,而不是到达最终答案。在两个公共基准MuSiQue和LongBench v2上,SliceLine基线标记了看似高错误的区域,但Janus没有确认任何一个。消融实验显示了为什么两个保障措施都很重要。在LongBench v2上,未校准的固定阈值报告了20个描述符,诱饵基准留下一个,而留出检查在其提升从0.36缩小到0.05后拒绝了最后一个。由此产生的原则将提出解释与报告解释分开。候选解释可能来自任何来源,但只有那些击败诱饵并在新数据上复现的才成为审计发现。

英文摘要

Useful audits reveal not only how often a model fails, but also where its failures concentrate. An auditor may test many candidate explanations: long inputs, indirect questions, distracting evidence, or combinations of these factors. The risk is selection. The largest observed effect may reflect a real failure mode, or it may simply be the best result among many tried. We introduce Janus, a procedure for deciding when a proposed error explanation is credible enough to report. The goal is not to generate new explanations, but to decide which ones hold up. The auditor starts with a fixed model, a labeled evaluation set, and a frozen list of candidate explanations, which we call descriptors. Janus scores each descriptor by its error-rate lift, then compares real descriptors with fake ones that have the same frequencies but are randomly assigned to examples. A descriptor is confirmed only if it beats this decoy floor on the data used for discovery and then repeats on separate held-out data. In a controlled audit of multi-table lookup tasks, Janus identifies the planted failure, confirming long-chain descriptors and their interactions. The LLM often stops partway through the lookup chain instead of reaching the final answer. On two public benchmarks, MuSiQue and LongBench v2, the SliceLine baseline flags plausible high-error pockets, but Janus confirms none of them. Ablations show why both safeguards matter. On LongBench v2, an uncalibrated fixed threshold reports 20 descriptors, the decoy floor leaves one, and the holdout check rejects the last one after its lift shrinks from 0.36 to 0.05. The resulting principle separates proposing explanations from reporting them. Candidates may come from any source, but only those that beat decoys and replicate on fresh data become audit findings.

2606.09080 2026-06-09 cs.LG cs.CL 交叉投稿

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

超越FLOPs:基于GEMM中心分类法的LLM剪枝真实推理加速基准测试

Haozhe Hu, Hao Wu, Anhao Zhao, Longwei Ding, Peiran Yin, Yunpu Ma, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo(宁波数字孪生研究院,东方理工大学(宁波)) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算学系) Munich Center for Machine Learning, LMU Munich(慕尼黑大学机器学习慕尼黑中心)

AI总结 提出一种基于GEMM维度的剪枝方法分类法,通过统一基准框架系统评估不同剪枝方法在加速-质量帕累托前沿上的表现,发现静态深度剪枝在低质量损失下最优,为LLM剪枝加速提供统一视角。

Comments 22 pages, 14 figures

详情
AI中文摘要

剪枝已成为加速大语言模型(LLM)推理的主流范式,涵盖了一系列方法,这些方法在token、层、头、维度和注意力模式上移除计算。尽管目标相同,这些剪枝方法会引发根本不同的执行行为,导致实际加速效果严重依赖于硬件和内核实现。因此,不同剪枝家族的实际加速收益仍知之甚少。在这项工作中,我们引入了一种基于GEMM中心的分类法,根据通用矩阵乘法(GEMM)的逻辑\textbf{M}、\textbf{N}和\textbf{K}维度重新组织现有剪枝方法。利用这一抽象,我们构建了一个统一的基准测试框架,能够在剪枝设计空间中进行实现一致的比较,并系统地表征加速-质量帕累托前沿。我们的结果表明,静态深度剪枝仍然是最强的帕累托最优基线,并且在内存受限场景下最接近其理论加速上限。在预填充阶段,前沿从低质量损失(0\%--4\%)的静态深度,过渡到中等损失(5\%--16\%)的动态深度,最后到更高损失水平(17\%--26\%)的静态宽度剪枝。这些发现首次建立了基于剪枝的LLM加速实际极限的统一视图,并为未来的剪枝研究提供了指导。\footnote{代码可在 https://github.com/EIT-NLP/LLM-Pruning/tree/main/PruningInferSim 获取。}

英文摘要

Pruning has emerged as a dominant paradigm for accelerating large language model (LLM) inference, spanning a broad spectrum of methods that remove computation across tokens, layers, heads, dimensions, and attention patterns. Despite sharing the same objective, these pruning approaches induce fundamentally different execution behaviors, causing realized speedups to depend heavily on hardware and kernel implementations. Consequently, the practical acceleration benefits of different pruning families remain poorly understood. In this work, we introduce a GEMM-centric taxonomy that reorganizes existing pruning methods according to the logical \textbf{M}, \textbf{N}, and \textbf{K} dimensions of general matrix multiplication (GEMM). Leveraging this abstraction, we build a unified benchmarking framework that enables implementation-consistent comparison across the pruning design space and systematically characterizes the acceleration--quality Pareto frontier. Our results show that static depth pruning remains the strongest Pareto-optimal baseline and stays closest to its theoretical acceleration upper bound in memory-bounded scenarios. During prefill, the frontier transitions from static depth at low quality loss (0\%--4\%), to dynamic depth at moderate loss (5\%--16\%), and finally to static width pruning at higher loss levels (17\%--26\%). These findings establish the first unified view of the practical limits of pruning-based LLM acceleration and provide guidance for future pruning research.\footnote{Code is available at https://github.com/EIT-NLP/LLM-Pruning/tree/main/PruningInferSim}

2606.09380 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

推理竞技场:当可验证奖励不足时的轨迹锦标赛

Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang

发表机构 * University of Cambridge(剑桥大学) Mistral AI

AI总结 提出推理竞技场框架,通过轨迹锦标赛将无梯度信号的非多样奖励组转化为相对奖励信号,结合Bradley-Terry模型高效整合强化学习,在数学和编码基准上平均提升7.6%,加速训练27%-41%。

Comments 9 pages, 6 figures, 2 tables (17 pages including references and appendices)

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为通过结果监督提升大语言模型推理能力的主流范式。然而,可验证奖励在组级别常常变得无信息:当给定提示的所有采样轨迹获得相同奖励时,组相对优势估计无法提供梯度信号,尽管这些轨迹在推理质量上可能差异显著。我们提出推理竞技场,一种自适应训练框架,将此类非多样奖励组路由至裁判系统而非丢弃。除了检查最终答案,推理竞技场构建轨迹锦标赛,其中推理轨迹进行两两比较以暴露组内更细粒度的偏好,将推理质量转化为丰富的相对奖励信号。为使奖励估计高效,而非穷举比较每一对,每个新轨迹与一个动态更新的先前生成轨迹小池作为锚点进行评估,以高效建立相对排名。然后我们在不完整比较图上拟合Bradley-Terry模型,实现无需二次成对比较的可扩展强化学习集成。实验结果表明,推理竞技场在竞赛数学和编码基准上平均比RLVR基线高出7.6%。通过将原本浪费的零优势样本转化为有用的梯度更新,我们的方法加速训练27%至41%,节省近50%的生成计算量,并显著提升整体推理性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.

2606.09409 2026-06-09 cs.AI cs.CL cs.LG 交叉投稿

Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

正确看起来更好:成对比较揭示准确性排名

Mina Remeli, Moritz Hardt

发表机构 * Max Planck Institute for Intelligent Systems, Tübingen, Germany(马克斯·普朗克智能系统研究所,蒂宾根,德国) Tübingen AI Center(蒂宾根人工智能中心)

AI总结 本文通过将基准测试转化为生成式评估,发现成对比较结合Elo方法得到的模型排名与基于真实准确率的排名高度一致(Spearman相关系数>0.9),且风格和裁判偏见影响较小,但答案重复(echo)是裁判偏好的因果驱动因素。

Comments Accepted at ICML'26

详情
AI中文摘要

成对比较结合诸如Elo等聚合方法已成为评估生成模型的核心,但人们仍担心它们会奖励肤浅的风格线索或显示裁判偏见。从更积极的角度看,我们表明,当存在真实准确率用于比较时,成对比较得出的模型排名与基于真实准确率的排名高度一致。通过将五个知名基准测试转化为自由形式的生成评估,我们发现Elo排名与准确率排名的Spearman相关系数超过0.9,并且在裁判较弱时显著优于直接评估。此外,风格和裁判偏见对模型排名的影响较小,尽管大多数判断发生在两个候选答案都正确(或都错误)的成对上。在这样的成对比较中,我们发现最终答案后的重复(echo)是裁判偏好的因果驱动因素。

英文摘要

Pairwise comparisons combined with aggregation methods like Elo have become central to evaluating generative models, yet concerns remain that they reward superficial stylistic cues or display judge biases. In a more positive turn, we show that model rankings from pairwise comparisons strongly agree with ground-truth-based accuracy rankings when such ground truth is available for comparison. By converting five well-known benchmarks into free-form generative evaluations, we find that Elo rankings achieve a Spearman correlation above 0.9 with accuracy rankings and substantially outperform direct evaluation when the judge is weak. Furthermore, style and judge bias have only minor effects on model rankings, despite most judgments occurring on pairs where both candidate answers are correct (or incorrect). On such pairs, we find that repetition after the final answer (echo) is a causal driver of judge preference.

2606.09578 2026-06-09 cs.AI cs.CL cs.IR 交叉投稿

TABVERSE: Benchmarking Cross-Format Table Understanding in LLMs and VLMs

TABVERSE:大语言模型与视觉语言模型中跨格式表格理解的基准测试

Momina Ahsan, Sarfraz Ahmad, Ming Shan Hee, Roy Ka-Wei Lee, Preslav Nakov

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学) Singapore University of Technology and Design (SUTD)(新加坡科技设计大学)

AI总结 提出TABVERSE基准,通过控制表格内容、跨多种结构格式(HTML、Markdown、LaTeX)和渲染图像,系统评估LLM和VLM在问答、结构理解和结构重建任务中的表现,发现表示格式显著影响表格理解能力。

Comments 24 pages, 18 tables, 16 figures, Submitted to ARR May 2026

详情
AI中文摘要

大语言模型(LLMs)和视觉语言模型(VLMs)在表格推理任务上的评估日益增多,但表格表示的作用仍未充分探索。实践中,相同的表格内容可能以不同的结构格式出现,如HTML、Markdown和LaTeX,或作为渲染图像。然而,现有评估往往让内容、格式、布局和模态同时变化,使得难以隔离表示效应。我们引入了TABVERSE,一个受控的多模态表格基准,它在多个结构格式和渲染图像中对齐相同的表格内容,并带有问题类别和难度标签。这种设计使得在保持表格内容固定的同时,能够系统评估表示效应。我们在三个任务上评估LLMs和VLMs:问答(QA)、结构理解能力(SUC)和结构重建(SR)。我们的结果表明,表示选择显著影响表格理解。模型在结构化文本上的表现通常优于渲染图像,但这一差距的大小取决于任务、模型和格式。HTML通常是最稳健的文本格式,而行敏感的结构任务和语法可用的LaTeX重建仍然具有挑战性。这些发现表明,表格表示是可靠表格评估的关键因素。

英文摘要

Large Language Models (LLMs) and Vision-Language Models (VLMs) are increasingly evaluated on table reasoning tasks, but the role of table representation remains under-explored. In practice, the same table content may appear in different structural formats, such as HTML, Markdown, and LaTeX, or as rendered images. However, existing evaluations often let content, format, layout, and modality vary together, making it difficult to isolate representation effects. We introduce TABVERSE, a controlled multimodal table benchmark that aligns the same table content across multiple structural formats and rendered images, with question category and difficulty tags. This design enables systematic evaluation of representation effects while holding table content fixed. We evaluate LLMs and VLMs across three tasks: Question Answering (QA), Structural Understanding Capability (SUC), and Structure Reconstruction (SR). Our results show that representation choice substantially affects table understanding. Models generally perform better with structured text than with rendered images, but the size of this gap depends on the task, model, and format. HTML is often the most robust text format, while row-sensitive structural tasks and syntactically usable LaTeX reconstruction remain challenging. These findings show that table representation is a key factor in reliable table evaluation.

2606.09748 2026-06-09 cs.AI cs.CL cs.LG 交叉投稿

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

深度研究智能体在过程级反馈下的多轮评估

Rishabh Sabharwal, Hongru Wang, Amos Storkey, Jeff Z. Pan

发表机构 * Google DeepMind OpenAI Perplexity AI LangChain AI

AI总结 针对深度研究智能体(DRA)在单轮输出评估的不足,提出研究缺口推断(RGI)方法提供过程级反馈,发现单轮过程反馈可提升8-15分,但多轮改进因回归问题难以持续。

Comments Published as a workshop paper at SCALE - ICML 2026 (Oral)

详情
AI中文摘要

现有的深度研究智能体(DRA)基准仅评估单次输出,忽略了一个关键问题:DRA能否在反馈指导下改进其报告?为此,我们在两种反馈设置下对DRA进行多轮评估:自我反思(智能体在无外部诊断信号的情况下修改报告)和过程级反馈(智能体接收针对其研究策略缺口的指导)。为提供过程级反馈,我们设计了研究缺口推断(RGI),该方法通过分析满足和未满足的评分标准模式来推断研究过程缺口。我们的分析揭示了三个关键发现:(i)在自我反思下,智能体以几乎相等的速率纳入和退步评分标准,导致净改进可忽略;(ii)单轮过程级反馈带来显著收益,将归一化分数提高约8-15分,并产生约35-40%的纳入率;(iii)这些收益在后续轮次中不会累积,因为智能体在重写完整报告以解决剩余缺口时,会退步多达24%的先前满足的标准。即使有针对性指导,我们所评估的DRA架构仍无法实现可靠的多轮改进。我们的代码和结果公开在 https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs。

英文摘要

Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately $8$-$15$ points and yielding a roughly $35$-$40\%$ incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to $24\%$ of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs.

2606.09764 2026-06-09 cs.LG cs.CL 交叉投稿

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

iOSWorld:个人智能手机代理的基准测试

Lawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang, Jing Yu Koh, Ruslan Salakhutdinov

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出首个基于持久用户身份的交互式原生iOS模拟器基准iOSWorld,包含26个新应用和133个任务,评估代理在单应用、多应用及记忆个性化任务上的表现,最佳配置整体准确率52%,多应用任务仅37%。

详情
AI中文摘要

一个有用的手机代理需要具备个人智能。它应该能够推理设备上存在的用户身份、历史记录和偏好,而不仅仅是在非个性化的沙箱中遵循孤立的指令。现有的移动代理基准缺乏这种个性化。我们引入了iOSWorld,这是第一个基于持久用户身份构建的交互式原生iOS模拟器基准,该身份跨越26个新构建的iOS应用。这些应用包含连接的数据,如交易、消息、旅行记录、社交关系和财务活动。iOSWorld包括133个任务,分为三个难度递增的类别。单应用任务(27个)测试一个应用,多应用任务(60个)跨越2到8个应用,记忆和个性化任务(46个)要求代理从个人数据中推断模式。我们在仅视觉和特权视觉+XML设置下评估了前沿和开源计算机使用模型。最佳配置整体达到52%,但在多应用任务上仅为37%。特权视觉+XML访问将前沿模型提升了最多26个百分点,而较小的模型并未从增加的辅助功能树输入中受益。我们将iOSWorld作为开源基准发布,包含所有应用、种子数据、任务、评分标准和评估代码。

英文摘要

A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.

2411.06469 2026-06-09 cs.CL 版本更新

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

ClinicalBench: 大型语言模型能在临床预测中击败传统机器学习模型吗?

Canyu Chen, Jian Yu, Shan Chen, Che Liu, Zhongwei Wan, Shuang Zhou, Yuan Luo, Rui Zhang, Danielle Bitterman, Fei Wang, Kai Shu

发表机构 * Department of Computer Science Northwestern University Evanston USA(计算机科学系西北大学艾文斯顿美国) Department of Computer Science University of Texas at Austin Austin USA(计算机科学系德克萨斯大学奥斯汀美国) Boston Children's Hospital, Harvard Medical School Boston USA(波士顿儿童医院哈佛医学院波士顿美国) Department of Computer Science Imperial College London London UK(计算机科学系伦敦帝国学院伦敦英国) Department of Computer Science Ohio State University Columbus USA(计算机科学系俄亥俄州立大学哥伦布美国) Massachusetts General Hospital, Harvard Medical School Boston USA(麻省总医院哈佛医学院波士顿美国) Department of Preventive Medicine, Feinberg School of Medicine Northwestern University Chicago USA(预防医学系费因伯格医学院西北大学芝加哥美国) Division of Computational Health Sciences, Department of Surgery University of Minnesota Minneapolis USA(计算健康科学部外科部明尼苏达大学明尼阿波利斯美国) Department of Population Health Sciences, Weill Cornell Medicine Cornell University New York USA(流行病学与公共卫生系韦尔·科恩医学中心康奈尔大学纽约美国) Department of Computer Science Emory University Atlanta USA(计算机科学系埃默里大学亚特兰大美国) Northwestern University(西北大学) University of Texas at Austin(德克萨斯大学奥斯汀) Boston Children's Hospital, Harvard Medical School(波士顿儿童医院哈佛医学院) Imperial College London(伦敦帝国学院) Ohio State University(俄亥俄州立大学) Massachusetts General Hospital, Harvard Medical School(麻省总医院哈佛医学院) University of Minnesota(明尼苏达大学) Cornell University(康奈尔大学) Emory University(埃默里大学)

AI总结 构建ClinicalBench基准,通过三个临床预测任务比较14个通用和8个医学LLM与11个传统ML模型,发现LLM在临床预测上仍无法超越传统ML模型。

Comments Accepted to Proceedings of KDD 2026. The first two authors contributed equally. 12 pages for main paper, 62 pages including appendix. Project website: https://clinicalbench.github.io

详情
AI中文摘要

大型语言模型(LLMs)因其在医学文本处理任务和医学执照考试中的卓越能力,有望彻底改变当前的临床系统。与此同时,传统机器学习模型如SVM和XGBoost仍然主要应用于临床预测任务。一个新兴的问题是:LLMs能否在临床预测中击败传统ML模型?因此,我们构建了一个新的基准ClinicalBench,全面研究通用和医学LLMs的临床预测建模能力,并将其与传统ML模型进行比较。ClinicalBench包含三个常见的临床预测任务、两个数据库、14个通用LLMs、8个医学LLMs和11个传统ML模型。通过广泛的实证研究,我们发现,无论是通用还是医学LLMs,即使采用不同的模型规模、多样的提示或微调策略,仍然无法在临床预测中击败传统ML模型,这揭示了它们在临床推理和决策中的潜在缺陷。我们呼吁从业者在临床应用中使用LLMs时保持谨慎。ClinicalBench可用于弥合LLMs在医疗保健领域的发展与现实临床实践之间的差距。

英文摘要

Large Language Models (LLMs) hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoost have still been mainly adopted in clinical prediction tasks. An emerging question is: Can LLMs beat traditional ML models in clinical prediction? Thus, we build a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Through extensive empirical investigation, we discover that both general-purpose and medical LLMs, even with different model scales, diverse prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet, shedding light on their potential deficiency in clinical reasoning and decision-making. We call for caution when practitioners adopt LLMs in clinical applications. ClinicalBench can be utilized to bridge the gap between LLMs' development for healthcare and real-world clinical practice.

2509.17455 2026-06-09 cs.CL cs.AI 版本更新

Understanding Benchmark Language Under Weakened Formal Semantics

弱化形式语义下的基准语言理解

Haoyang Chen, Kumiko Tanaka-Ishii

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) School of Fundamental Science and Engineering(基础科学与工程学院) Waseda University(早稻田大学)

AI总结 提出可计算表示方法,通过外部知识检索提取可执行代码,在数学推理、多步推理等基准上超越纯文本推理和单次代码执行,提供可扩展、可检查的语义证据。

Comments Accepted to Transactions of the Association for Computational Linguistics (TACL). 29 pages, 5 figures

详情
AI中文摘要

最先进的 NLP 基准需要解释指定条件、程序和异常的自然语言,通常依赖隐含假设和外部知识。在规模上构建具有证明论保证的完整语义表示通常不切实际,而纯文本推理提供的检查手段有限。本文探讨当形式语义保证被弱化时,能在多大程度上理解基准语言。我们通过提取可计算表示来研究这个问题:可执行表示,其运行时行为提供语义充分性的操作证据,包括可执行性、执行轨迹和运行时失败。我们使用外部知识检索,为基准实例诱导并迭代优化可计算表示。在数学推理、多步推理、因果推断以及规则和异常密集的法律和生物医学基准上,我们发现所提出的方法持续优于纯文本推理和单次代码执行。除了准确性,我们的分析表明,这些可计算表示提供了可扩展、可检查的语义证据:它们暴露了基准语言强制转化为可执行形式的条件和异常,为面向证明的语义和纯文本推理之间提供了实用的桥梁。

英文摘要

State-of-the-art NLP benchmarks require interpretation of natural language that specifies conditions, procedures, and exceptions, often relying on implicit assumptions and external knowledge. Constructing complete semantic representations with proof-theoretic guarantees is frequently impractical at scale, and purely text-based reasoning offers limited means of inspection. This paper asks how much understanding of benchmark language can be achieved when formal semantic guarantees are weakened. We investigate this question by extracting computables: executable representations whose runtime behavior provides operational evidence of semantic adequacy, including executability, execution traces, and runtime failures. We induce and iteratively refine computables for benchmark instances using retrieval from external knowledge. Across mathematical reasoning, multi-step reasoning, causal inference, and rule- and exception-heavy legal and biomedical benchmarks, we find that the proposed approach consistently exceeds text-only reasoning and one-shot code execution. Beyond accuracy, our analyses show that these computables provide scalable, inspectable semantic evidence: they expose conditions and exceptions benchmark language forces into executable form, offering a practical bridge between proof-oriented semantics and purely textual reasoning.

2511.11041 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

纠正文本嵌入中的均值偏差:一种改进的重归一化方法及其在MMTEB上的无训练改进

Xingyu Ren, Youran Sun, Haoyu Liang

发表机构 * GitHub

AI总结 发现句子嵌入存在一致均值偏差,提出无训练修正方法R2(投影去除均值方向),在MMTEB上38个模型中获得一致分类提升,并分析其与PCA白化的差异。

详情
AI中文摘要

我们发现当前的句子嵌入模型输出存在一致的偏差:每个嵌入$e$可分解为$\tilde e + \mu$,其中均值$\mu$在所有句子中几乎相同。我们研究了两种无训练修正方法——直接减去$\mu$(R1),或从每个嵌入中投影掉均值方向(R2)——并通过一阶误差传播论证表明,R2消除了R1保留的均值估计误差的平行分量。在Massive Multilingual Text Embedding Benchmark (MMTEB)~\citep{MMTEB}上的38个模型中,R2取得一致的分类增益(配对$\bar t = 3.31$,38个模型中有29个$t>2$,零损失),且每个模型的均值范数$\Vert\mu\Vert$与哪些模型受益最多相关。对五个模型进行的九种方法剂量反应消融实验进一步揭示,温和的单方向去除有帮助,但完全的主成分分析(PCA)白化损害了我们测试的每个模型,并且R2与深度为一的All-but-the-Top在下游任务中相差不超过0.18个百分点,尽管$\hat\mu$与中心化的顶部主成分之间几何对齐较弱。

英文摘要

We find that current sentence-embedding models produce outputs with a consistent bias: every embedding $e$ decomposes as $\tilde e + μ$, where the mean $μ$ is near-identical across all sentences. We study two training-free corrections -- subtracting $μ$ directly (R1), or projecting each embedding off the mean direction (R2) -- and show, via a first-order error-propagation argument, that R2 cancels the parallel component of mean-estimation error that R1 retains. Across 38 models on the Massive Multilingual Text Embedding Benchmark (MMTEB)~\citep{MMTEB}, R2 yields consistent classification gains (paired $\bar t = 3.31$, 29 of 38 models with $t>2$, zero losses), and the per-model mean norm $\Vertμ\Vert$ correlates with which models benefit most. A nine-method dose-response ablation on five models further reveals that mild single-direction removal helps, but full principal component analysis (PCA) whitening hurts every model we test, and that R2 and All-but-the-Top with depth one agree within $0.18$ pp downstream despite weak geometric alignment between $\hatμ$ and the centered top principal component.

2601.14063 2026-06-09 cs.CL cs.AI cs.CY 版本更新

XCR-Bench: Benchmarking Cross-Cultural Reasoning in LLMs via Culture-Specific Items and Hall's Triad

XCR-Bench:通过文化特定项目和霍尔三元组对大型语言模型进行跨文化推理基准测试

Mohsinul Kabir, Tasnim Ahmed, Md Mezbaur Rahman, Shaoxiong Ji, Hassan Alhuzali, Yuechen Jiang, Jimin Huang, Sophia Ananiadou

发表机构 * Department of Computer Science, National Centre for Text Mining, The University of Manchester(计算机科学系,国家文本挖掘中心,曼彻斯特大学) ELLIS Manchester(曼彻斯特ELLIS) School of Computing, Queen’s University, Ontario, Canada(计算学院,加拿大皇后大学) Computer Science, University of Illinois Chicago(计算机科学,伊利诺伊大学芝加哥分校) ELLIS Institute Finland(芬兰ELLIS研究所) University of Turku(图尔库大学) Department of Computer Science and Artificial Intelligence, Umm Al-Qura University, Makkah, Saudi Arabia(计算机科学与人工智能系,乌姆·阿勒·卡拉大学,麦加,沙特阿拉伯)

AI总结 提出XCR-Bench基准,包含4.1k平行句和1098个文化特定项目,结合Newmark框架与霍尔三元组评估LLM跨文化推理,发现模型在深层文化层面表现显著下降,且存在区域和民族宗教偏见。

Comments Under Review

详情
AI中文摘要

大型语言模型(LLM)的跨文化能力需要理解并适应不同文化背景下的文化特定项目(CSI)。然而,由于缺乏高质量且带有平行跨文化句子对的CSI标注语料库,评估该能力的进展仍然有限。我们引入了XCR-Bench,一个跨文化推理基准,包含4.1k个平行句子和1,098个CSI,涵盖三个推理任务。XCR-Bench将Newmark的CSI框架与霍尔文化三元组相结合,从而能够评估从可观察实践到隐性社会规范和价值观等不同文化可见性层面的能力。对八个多语言LLM的实验表明,最先进的模型在识别和适应特定CSI类别方面表现出持续的弱点,揭示了表面召回与显式文化推理之间的差距。在文化敏感类别和更深文化层面上,性能显著下降(p<0.005,8/8模型),并且适应质量在不同目标文化和孟加拉语区域变体之间系统性变化,表明即使在单一语言环境中也存在编码的区域和民族宗教偏见。我们公开发布语料库和代码,以支持未来跨文化NLP的研究。

英文摘要

Cross-cultural competence in large language models (LLMs) requires understanding and adapting Culture-Specific Items (CSIs) across varying cultural contexts. However, progress in evaluating this capability remains limited by the lack of high-quality CSI-annotated corpora with parallel cross-cultural sentence pairs. We introduce XCR-Bench, a Cross(X)-Cultural Reasoning Benchmark containing 4.1k parallel sentences and 1,098 CSIs across three reasoning tasks. XCR-Bench integrates Newmark's CSI framework with Hall's Triad of Culture, enabling evaluation across levels of cultural visibility -- from observable practices to implicit social norms and values. Experiments on eight multilingual LLMs show that state-of-the-art models exhibit consistent weaknesses in identifying and adapting specific categories of CSIs, revealing a gap between surface-level recall and explicit cultural reasoning. Performance declines significantly on culturally sensitive categories and deeper cultural levels (p<0.005, 8/8 models), and adaptation quality varies systematically across target cultures and Bengali regional variants, indicating encoded regional and ethno-religious biases even within a single linguistic setting. We publicly release the corpus and code to support future research on cross-cultural NLP.

2602.06307 2026-06-09 cs.CL 版本更新

Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Bilingual Conversational Language Beyond Standard UD Assumptions

迷失在语音中:超越标准UD假设的口语双语会话的基准测试、评估与解析

Nemika Tyagi, Olga Kellert, Holly Hendrix, Nelvin Licona-Guevara, Justin Mackie, Phanos Kareen, Megan Michelle Smith, Tatiana Gallego Hernande, Samhitha Harish, Chitta Baral

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 针对口语双语会话中的不流利和话语驱动结构,提出SpokeBench基准、Flex-UD评估指标和DECAP解析框架,显著提升依赖解析性能。

Comments 17 pages, 4 Figures

详情
AI中文摘要

口语双语会话给句法解析带来了重大挑战,因为它们通常包含不流利和话语驱动的结构,这些结构在标准的通用依赖(UD)假设和评估实践下使依赖解析复杂化。为了系统研究这些挑战,本文首先引入了一个基于语言学的会话双语现象分类法,以及SpokeBench,一个由专家标注的英语-西班牙语基准,用于结构复杂的语音。为了解决现有评估实践的局限性,我们提出了Flex-UD,一种歧义感知的评估指标,能够区分灾难性的结构失败和语言上可接受的变体。最后,我们引入了DECAP,一种解耦的代理解析框架,将口语现象处理与核心句法分析分离,无需重新训练即可实现鲁棒且可解释的依赖解析。在专有和开源大语言模型上的实验表明,DECAP在复杂的会话现象上显著提高了性能,在UPOS-F1分数上比基线提高了超过60%,而Flex-UD评估揭示了在标准基于附着的指标下部分隐藏的增益。

英文摘要

Spoken bilingual conversations pose substantial challenges for syntactic parsing because they often include disfluencies and discourse-driven structures that complicate dependency parsing under standard Universal Dependencies (UD) assumptions and evaluation practices. To systematically study these challenges, in this work, we first introduce a linguistically grounded taxonomy of conversational bilingual phenomena, together with SpokeBench, an expert-annotated English-Spanish benchmark for structurally complex speech. To address the limitations of existing evaluation practices, we propose Flex-UD, an ambiguity-aware evaluation metric that distinguishes catastrophic structural failures from linguistically acceptable variations. Finally, we introduce DECAP, a decoupled agentic parsing framework that separates spoken-phenomena handling from core syntactic analysis, enabling robust and interpretable dependency parsing without retraining. Experiments across both proprietary and open-weight LLMs show that DECAP substantially improves performance on complex conversational phenomena and achieves over 60% improvements in UPOS-F1 Score over baselines, while Flex-UD evaluations reveal gains that otherwise remain partially hidden under standard attachment-based metrics.

2602.11238 2026-06-09 cs.CL 版本更新

SurveyLens: A Discipline-Aware Benchmark for Automatic Survey Generation

SurveyLens:一个学科感知的自动综述生成基准

Beichen Guo, Zhiyuan Wen, Jia Gu, Haochen Shi, Jian Wang, Senzhang Wang, Haoyang Li, Ruosong Yang, Shuaiqi Liu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Sichuan University(四川大学) Central South University(中南大学) Alibaba Cloud(阿里巴巴云)

AI总结 提出SurveyLens,首个学科感知的自动综述生成基准,包含跨10个学科的1000篇人工撰写综述数据集和双视角评估框架,发现深度研究智能体在跨学科鲁棒性上最优,而所有范式在参考文献质量上仍薄弱。

Comments 8 pages, 9 figures

详情
AI中文摘要

自动综述生成旨在通过检索、组织和综合学术论文来生成全面的文献综述。尽管专门的ASG框架和深度研究智能体取得了快速进展,但现有评估主要集中于计算机科学或依赖通用标准,尚不清楚当前系统是否满足不同学科的综述标准。我们引入了SurveyLens,这是第一个学科感知的ASG基准。SurveyLens包含SurveyLens-1k,一个跨10个学科的1000篇人工撰写综述的精选数据集,以及一个双视角框架,该框架结合了学科感知的评分标准与基于参考的人工综述对齐。评估了11个最先进的系统,包括普通LLM、ASG系统和深度研究智能体,我们发现深度研究智能体是唯一在所有10个学科中表现稳健的范式,ASG系统在结构规划上领先,而所有范式在参考文献质量上仍然薄弱,这为特定学科的工具选择和未来的ASG设计提供了实用指导。

英文摘要

Automatic Survey Generation (ASG) aims to produce comprehensive literature surveys by retrieving, organizing, and synthesizing academic papers. Despite rapid progress in specialized ASG frameworks and Deep Research agents, existing evaluations largely center on Computer Science or rely on generic criteria, leaving it unclear whether current systems satisfy the survey standards of diverse disciplines. We introduce SurveyLens, the first discipline-aware ASG benchmark. SurveyLens comprises SurveyLens-1k, a curated dataset of 1,000 human-written surveys across 10 disciplines, and a dual-lens framework that combines discipline-aware rubric scoring with reference-based alignment to human-written surveys. Evaluating 11 state-of-the-art systems across vanilla LLMs, ASG systems, and Deep Research agents, we find that Deep Research agents are the only paradigm robust across all 10 disciplines, ASG systems lead on structural planning, and all paradigms remain weak on reference quality, providing practical guidance for discipline-specific tool selection and future ASG design.

2604.06210 2026-06-09 cs.CL cs.AI cs.CY cs.LG 版本更新

Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

基于价值码本的LLM文化价值对齐的分布式开放式评估

Jaehyeok Lee, Xiaoyuan Yi, Jing Yao, Hyunjin Hwang, Roy Ka-Wei Lee, Xing Xie, JinYeong Bak

发表机构 * KAIST(韩国科学技术院)

AI总结 提出DOVE框架,通过率失真变分优化构建价值码本,利用不平衡最优传输度量分布对齐,解决LLM文化价值评估中的构造-组成-上下文挑战。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

随着LLM在全球部署,使其文化价值取向对齐对于安全性和用户参与至关重要。然而,现有基准面临构造-组成-上下文($C^3$)挑战:依赖判别性、多项选择格式,探测的是价值知识而非真实取向,忽视亚文化异质性,且与真实世界的开放式生成不匹配。我们引入DOVE,一个直接比较人类撰写的文本分布与LLM生成输出的分布式评估框架。DOVE利用率失真变分优化目标从10K文档中构建紧凑的价值码本,将文本映射到结构化价值空间以过滤语义噪声。使用不平衡最优传输测量对齐,捕捉文化内分布结构和子群体多样性。在12个LLM上的实验表明,DOVE实现了优越的预测有效性,与下游任务的相关性达到31.56%,同时每个文化仅需500个样本即可保持高可靠性。

英文摘要

As LLMs are globally deployed, aligning their cultural value orientations is critical for safety and user engagement. However, existing benchmarks face the Construct-Composition-Context ($C^3$) challenge: relying on discriminative, multiple-choice formats that probe value knowledge rather than true orientations, overlook subcultural heterogeneity, and mismatch with real-world open-ended generation. We introduce DOVE, a distributional evaluation framework that directly compares human-written text distributions with LLM-generated outputs. DOVE utilizes a rate-distortion variational optimization objective to construct a compact value codebook from 10K documents, mapping text into a structured value space to filter semantic noise. Alignment is measured using unbalanced optimal transport, capturing intra-cultural distributional structures and subgroup diversity. Experiments across 12 LLMs show that DOVE achieves superior predictive validity, attaining a 31.56% correlation with downstream tasks, while maintaining high reliability with as few as 500 samples per culture.

2605.11502 2026-06-09 cs.CL 版本更新

Robust Biomedical Publication Type and Study Design Classification with Knowledge-Guided Perturbations

基于知识引导扰动的鲁棒生物医学出版物类型与研究设计分类

Shufan Ming, Joe D. Menke, Neil R. Smalheiser, Halil Kilicoglu

发表机构 * School of Information Sciences(信息科学学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Department of Psychiatry(精神病学系) University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 本文提出基于受控语义扰动的评估框架,通过实体遮蔽和领域对抗训练提升生物医学出版物类型分类的鲁棒性,发现通过抑制非任务定义特征可缓解鲁棒性与领域准确性之间的权衡。

Comments Accepted by IEEE ICHI 2026

详情
AI中文摘要

准确且一致地对生物医学文献进行出版物类型和研究设计索引对于支持证据综合和知识发现至关重要。先前工作主要集中在扩展标签覆盖、丰富特征表示和提高领域内准确性,评估通常在与训练数据同分布的数据上进行。尽管预训练生物医学语言模型在这些设置下表现优异,但优化领域内准确性的模型可能依赖于表面词汇或数据集特定的提示,导致在分布偏移下鲁棒性降低。本文引入基于受控语义扰动的评估框架,评估出版物类型分类器的鲁棒性,并研究结合实体遮蔽和领域对抗训练的鲁棒性导向训练策略,以减轻对虚假主题相关性的依赖。结果表明,当鲁棒性目标设计为选择性抑制非任务定义特征同时保留显著的方法学信号时,通常观察到的鲁棒性与领域准确性之间的权衡可以被缓解。我们发现这些改进源于两种互补机制:(1)当输入中存在此类提示时,增加对显式方法学提示的依赖;(2)减少对虚假领域特定主题特征的依赖。这些发现强调了出版物类型和研究设计分类中特征级鲁棒性分析的重要性,并建议通过更选择性地抑制主题信息来进一步提高鲁棒性。数据、代码和模型可在:https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/ICHI 获取。

英文摘要

Accurately and consistently indexing biomedical literature by publication type and study design is essential for supporting evidence synthesis and knowledge discovery. Prior work on automated publication type and study design indexing has primarily focused on expanding label coverage, enriching feature representations, and improving in-domain accuracy, with evaluation typically conducted on data drawn from the same distribution as training. Although pretrained biomedical language models achieve strong performance under these settings, models optimized for in-domain accuracy may rely on superficial lexical or dataset-specific cues, resulting in reduced robustness under distributional shift. In this study, we introduce an evaluation framework based on controlled semantic perturbations to assess the robustness of a publication type classifier and investigate robustness-oriented training strategies that combine entity masking and domain-adversarial training to mitigate reliance on spurious topical correlations. Our results show that the commonly observed trade-off between robustness and in-domain accuracy can be mitigated when robustness objectives are designed to selectively suppress non-task-defining features while preserving salient methodological signals. We find that these improvements arise from two complementary mechanisms: (1) increased reliance on explicit methodological cues when such cues are present in the input, and (2) reduced reliance on spurious domain-specific topical features. These findings highlight the importance of feature-level robustness analysis for publication type and study design classification and suggest that refining masking and adversarial objectives to more selectively suppress topical information may further improve robustness. Data, code, and models are available at: https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/ICHI

2605.19276 2026-06-09 cs.CL cs.LG 版本更新

OpenCompass: A Universal Evaluation Platform for Large Language Models

OpenCompass:大型语言模型的通用评估平台

Maosong Cao, Kai Chen, Haodong Duan, Yixiao Fang, Zhiwei Fei, Tong Gao, Ge Jiaye, Mo Li, Hongwei Liu, Junnan Liu, Yuan Liu, Chengqi Lyu, Han Lyu, Ningsheng Ma, Zerun Ma, Yu Sun, Zhiyong Wu, Linchen Xiao, Zhuozhi Xiong, Jun Xu, Haochen Ye, Zhaohui Yu, Yike Yuan, Songyang Zhang, Yufeng Zhao, Fengzhe Zhou, Peiheng Zhou, Dongsheng Zhu, Lin Zhu, Jingming Zhuo

发表机构 * OpenCompass Team(OpenCompass团队) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出OpenCompass,一个模块化、高兼容性、灵活且高并发的通用LLM评估平台,支持多种任务场景和主流基准数据集。

详情
AI中文摘要

近年来,人工智能领域经历了从特定任务的小规模模型到通用大型语言模型(LLM)的范式转变。随着LLM的快速迭代,对其能力进行客观、定量和全面的评估已成为推动技术发展的关键环节。目前,基于静态基准数据集的主流评估方法面临任务类型多样性、评估标准不一致以及数据处理流程碎片化等挑战,难以高效进行跨领域和大规模模型评估。为解决上述问题,本文提出并开源了OpenCompass,一个一站式、可扩展且支持高并发的通用LLM评估平台。该平台遵循模块化和组件解耦的设计理念,具有三大核心优势:高兼容性、灵活性和高并发性。OpenCompass的核心架构包括五个关键组件:配置系统、任务划分模块、执行与调度模块、任务执行单元和结果可视化模块。其工作流程提供基于规则、LLM作为评判者和级联评估器,以适应不同任务场景的需求。平台支持知识、推理、计算、科学、语言、代码等多个领域的基准数据集,为学术界和工业界提供统一高效的LLM评估工具,有助于准确识别LLM的优缺点并进行后续优化。

英文摘要

In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross-domain and large-scale model evaluation. To address the aforementioned issues, this paper proposes and open-sources OpenCompass, a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule-based, LLM-as-a-Judge, and cascaded evaluators to adapt to the requirements of different task scenarios. Supporting mainstream benchmark datasets across multiple domains, including knowledge, reasoning, computation, science, language, code, etc., the platform offers a unified and efficient LLM evaluation tool for both academia and industry, facilitating the accurate identification of strengths and weaknesses of LLMs as well as their subsequent optimization.

2605.22079 2026-06-09 cs.CL 版本更新

Ishigaki-IDS-Bench: A Benchmark for Generating Information Delivery Specification from BIM Information Requirements

Ishigaki-IDS-Bench: 一个用于从BIM信息需求生成信息交付规范的基准

Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto, Takayuki Kato, Tomoki Ando, Chenguang Wang, Dayuan Jiang, Naofumi Fujita, Shuhei Saitoh, Atomu Kondo, Koki Arakawa, Daiho Nishioka

发表机构 * ONESTRUCTION Inc.(ONESTRUCTION公司) AWS GenAI Innovation Center(AWS生成式人工智能创新中心)

AI总结 本文提出Ishigaki-IDS-Bench基准,用于评估大型语言模型生成符合行业标准的XML信息交付规范(IDS)的能力,通过166个由BIM/IDS专家编写和验证的示例,结合内容一致性评估和结构审核,展示了当前LLM在生成满足IDS标准和IFC词汇约束的XML方面的局限性。

Comments 7 pages; benchmark data and evaluation scripts are available on GitHub and Hugging Face

详情
AI中文摘要

大型语言模型(LLMs)被广泛用于生成结构化输出,如JSON、SQL和代码,但公共资源仍然有限,无法有效评估必须同时满足行业标准XML和领域词汇约束的生成能力。本文提出了Ishigaki-IDS-Bench,一个用于评估从BIM信息需求生成信息交付规范(IDS)XML能力的基准。该基准包含166个由BIM/IDS专家编写和验证的示例,这些示例是通过将83个实际场景扩展为日语和英语后生成的,对应黄金IDS文件以及输入格式、语言、轮次设置、IFC版本和建筑领域等元数据。其评估结合了基于IDSAuditTool的可操作性、结构和内容审核,以及与黄金IDS文件的内容一致性评估。在零样本评估中,10个LLM中表现最好的模型在内容一致性上达到65.6%的宏F1分数,但只有27.7%的输出通过内容审核。这些结果表明,当前LLM能够表达部分信息需求作为IDS,但仍难以稳定生成满足IDS标准和IFC词汇约束的XML。Ishigaki-IDS-Bench支持比较评估、失败分析以及开发符合领域标准的受限结构生成方法。我们已将评估脚本和基准数据以CC BY 4.0许可发布在GitHub和Hugging Face上。

英文摘要

Building Information Modeling (BIM) projects increasingly use Information Delivery Specification (IDS) to formalize information requirements in a machine-checkable XML format. Because IDS conditions are grounded in the Industry Foundation Classes (IFC) vocabulary, authoring them requires expertise in IFC concepts, validation tools, and property set conventions. Existing benchmarks for structured generation do not adequately capture the additional burden of vocabulary conformance and external-validator agreement that IDS imposes. We present Ishigaki-IDS-Bench, the first publicly released benchmark for IDS generation from BIM information requirements. The benchmark contains 166 examples spanning 83 practical scenarios authored in Japanese and English by six BIM/IDS experts, each paired with a gold IDS file and metadata covering input format, turn setting, target IFC versions, and construction domain. Evaluation proceeds in two stages: (i) formal validity scored by the buildingSMART IDSAuditTool along Processability, Structure, and Content, and (ii) content fidelity scored by facet-level macro-F1 against the gold IDS. Across 10 LLMs in zero-shot, the highest Facet F1 is 65.6%, achieved by GPT-5.5, while the highest Content pass rate is only 33.1%, achieved by Claude Opus 4.5. Ishigaki-IDS-Bench is released on Hugging Face (DOI 10.57967/hf/8873) under CC BY 4.0, and the evaluation code is released on Zenodo (DOI 10.5281/zenodo.20550510) under Apache-2.0.

2605.25312 2026-06-09 cs.CL 版本更新

P1SCO: Social Dimensions from a Perspectivist Lens

P1SCO:从视角主义视角看社会维度

Amanda Cercas Curry, Gianmarco de Francisci Morales, Luca Maria Aiello

发表机构 * Independent Researcher(独立研究者) CENTAI, Turin(CENTAI,都灵) IT University of Copenhagen(哥本哈根技术大学)

AI总结 本文提出P1SCO数据集,从三个平台收集社交媒体评论并按十个社会维度标注,以捕捉社会互动和感知的多样性,支持细粒度分析及跨平台、个体差异研究。

详情
AI中文摘要

我们介绍了P1SCO,一个从三个不同平台收集的社交媒体评论数据集,根据十个社会维度进行标注,以捕捉社会互动和感知的多样性。该数据集经过仔细分解,允许在单个评论、标注者和平台层面进行分析。除了社会维度标签外,我们还包含了丰富的标注者元数据,包括人口统计信息、大五人格特征和政治倾向。这种评论级标注和标注者级特征的组合,能够对社会感知如何因平台、个体差异和人口因素而变化进行细致分析。通过保留标注者视角的多样性,我们的数据集支持标注者间和标注者内部一致性研究、人格和政治倾向对社会解读的影响,以及社会话语的跨平台动态分析。

英文摘要

We introduce P1SCO, a dataset of social media comments collected from three distinct platforms, annotated according to ten social dimensions to capture the diversity of social interactions and perceptions. The dataset is carefully disaggregated to allow analysis at the level of individual comments, annotators, and platforms. In addition to the social dimension labels, we include rich metadata on the annotators, including demographics, Big Five personality profiles, and political affiliation. This combination of comment-level annotations and annotator-level features enables nuanced analyses of how social perception varies across platforms, individual differences, and demographic factors. By preserving the diversity of annotator perspectives, our dataset supports studies of inter- and intra-annotator agreement, the influence of personality and political orientation on social interpretation, and the cross-platform dynamics of social discourse.

2411.19504 2026-06-09 cs.AI cs.CL cs.IR 版本更新

TQA-Bench: Evaluating LLMs for Multi-Table Question Answering

TQA-Bench:评估大语言模型在多表问答中的表现

Zipeng Qiu, Chenyue Li, You Peng, Guangxin He, Binhang Yuan, Chen Wang

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Tsinghua University(清华大学)

AI总结 提出TQA-Bench基准,通过长上下文多表问答任务评估LLM,揭示其在复杂数据驱动环境中的挑战与机遇。

Comments Accepted by IEEE Transactions on Big Data

详情
AI中文摘要

大语言模型(LLMs)的进步为复杂的多模态数据管理任务带来了巨大机遇,尤其是在涉及复杂多表关系数据的问答(QA)中。尽管取得了显著进展,但由于分析关系数据结构模态的固有复杂性以及序列化表格数据可能的大规模性,系统评估LLMs在多表QA上的表现仍然是一个关键挑战。现有基准主要关注单表QA,未能捕捉金融、医疗和电子商务等真实世界领域中多个关系表之间连接的复杂性。我们提出了TQA-Bench,一个基于真实世界公共数据集的长上下文分析型多表QA基准,具有灵活的采样机制,可变化上下文长度(8K--64K tokens)和符号扩展,以评估超越检索和模式匹配的推理能力。我们系统评估了一系列参数规模从20亿到6710亿的LLMs。大量实验揭示了LLMs在多表QA中的关键性能洞察,突出了推进其在复杂数据驱动环境中应用的挑战和机遇。

英文摘要

The advance of large language models (LLMs) has unlocked great opportunities in complex multi-modal data management tasks, particularly in question answering (QA) over complicated multi-table relational data. Despite significant progress, systematically evaluating LLMs on multi-table QA remains a critical challenge due to the inherent complexity of analyzing the modality of relational data structures and the potentially large scale of serialized tabular data. Existing benchmarks primarily focus on single-table QA, failing to capture the intricacies of connections across multiple relational tables, as required in real-world domains such as finance, healthcare, and e-commerce. We present TQA-Bench, a long-context analytical multi-table QA benchmark derived from real-world public datasets, with a flexible sampling mechanism that varies context length (8K--64K tokens) and symbolic extensions for assessing reasoning beyond retrieval and pattern matching. We systematically evaluate a set of LLMs spanning model scales from 2 billion to 671 billion parameters. Our extensive experiments reveal critical insights into the performance of LLMs in multi-table QA, highlighting both challenges and opportunities for advancing their application in complex, data-driven environments.

2502.16584 2026-06-09 cs.SD cs.AI cs.CL cs.MM eess.AS 版本更新

Audio-FLAN: An Instruction-Following Dataset for Unified Audio Understanding and Generation of Speech, Music, and Sound

Audio-FLAN:面向语音、音乐和声音的统一音频理解与生成的指令跟随数据集

Liumeng Xue, Ziya Zhou, Jiahao Pan, Zixuan Li, Shuai Fan, Yinghao Ma, Sitong Cheng, Dongchao Yang, Haohan Guo, Yujia Xiao, Xinsheng Wang, Zixuan Shen, Chuanbo Zhu, Xinshen Zhang, Tianchi Liu, Ruibin Yuan, Zeyue Tian, Haohe Liu, Xingjian Du, Emmanouil Benetos, Ge Zhang, Yike Guo, Wei Xue

发表机构 * The Hong Kong University of Science(香港科学与技术大学) Inner Mongolia University(内蒙古大学) Beihang University(北京航空航天大学) Queen Mary University of London(伦敦玛丽女王大学) The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学) University of Surrey(萨里大学) University of Rochester(罗切斯特大学) Independent Researcher(独立研究者)

AI总结 提出Audio-FLAN数据集,包含80种任务和1亿实例,支持统一音频理解与生成的零样本学习。

详情
AI中文摘要

最近音频标记化的进展显著增强了将音频能力集成到大语言模型(LLM)中的能力。然而,音频理解和生成通常被视为不同的任务,阻碍了真正统一的音频-语言模型的发展。虽然指令调优在文本和视觉领域已显示出在改善泛化和零样本学习方面的显著成功,但其在音频领域的应用仍基本未被探索。一个主要障碍是缺乏统一音频理解和生成的全面数据集。为解决这一问题,我们引入了Audio-FLAN,这是一个大规模指令调优数据集,涵盖语音、音乐和声音领域的80种不同任务,包含超过1亿个实例。Audio-FLAN为统一的音频-语言模型奠定了基础,这些模型能够以零样本方式无缝处理跨多种音频领域的理解(如转录、理解)和生成(如语音、音乐、声音)任务。Audio-FLAN数据集可在HuggingFace和GitHub上获取。

英文摘要

Recent advancements in audio tokenization have significantly enhanced the integration of audio capabilities into large language models (LLMs). However, audio understanding and generation are often treated as distinct tasks, hindering the development of truly unified audio-language models. While instruction tuning has demonstrated remarkable success in improving generalization and zero-shot learning across text and vision, its application to audio remains largely unexplored. A major obstacle is the lack of comprehensive datasets that unify audio understanding and generation. To address this, we introduce Audio-FLAN, a large-scale instruction-tuning dataset covering 80 diverse tasks across speech, music, and sound domains, with over 100 million instances. Audio-FLAN lays the foundation for unified audio-language models that can seamlessly handle both understanding (e.g., transcription, comprehension) and generation (e.g., speech, music, sound) tasks across a wide range of audio domains in a zero-shot manner. The Audio-FLAN dataset is available on HuggingFace and GitHub.

2509.22097 2026-06-09 cs.SE cs.AI cs.CL cs.CR 版本更新

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

SecureVibeBench: 通过重建引入漏洞的场景来基准测试AI代理的安全振动编码

Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, Xin Zhou, Xing Hu, David Lo

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出SecureVibeBench,一个包含105个C/C++安全编码任务的基准测试,旨在评估AI代理在真实场景中生成安全代码的能力,发现现有方法在评估人类与AI代理对比时的不足。

Comments ACL 2026 Main Conference. Our code and data are on https://github.com/iCSawyer/SecureVibeBench

详情
AI中文摘要

大型语言模型驱动的代码代理正在迅速改变软件工程,但其生成代码的安全风险已成为关键问题。现有基准测试提供了有价值的见解,但未能捕捉到由人类开发者实际引入漏洞的场景,使得人类与代理之间的公平比较不可行。因此,我们引入SecureVibeBench,一个包含来自OSS-Fuzz的41个项目中105个C/C++安全编码任务的基准测试,用于代码代理。SecureVibeBench具有以下特点:(i)现实的任务设置,要求在大型仓库中进行多文件编辑;(ii)基于真实世界开源漏洞对齐的上下文,具有精确标识的漏洞引入点;(iii)全面的评估,结合功能测试和安全检查,使用静态和动态或acles。我们评估了5种流行的代码代理,如OpenHands,支持5种LLM(如Claude Sonnet 4.5)在SecureVibeBench上。结果表明,当前代理在生成既正确又安全的代码方面存在困难,即使表现最好的代理,在SecureVibeBench上也只能产生23.8%的正确且安全的解决方案。我们的代码和数据在https://github.com/iCSawyer/SecureVibeBench上。

英文摘要

Large language model-powered code agents are rapidly transforming software engineering, yet the security risks of their generated code have become a critical concern. Existing benchmarks have provided valuable insights, but they fail to capture scenarios in which vulnerabilities are actually introduced by human developers, making fair comparisons between humans and agents infeasible. We therefore introduce SecureVibeBench, a benchmark of 105 C/C++ secure coding tasks sourced from 41 projects in OSS-Fuzz for code agents. SecureVibeBench has the following features: (i) realistic task settings that require multi-file edits in large repositories, (ii)~aligned contexts based on real-world open-source vulnerabilities with precisely identified vulnerability introduction points, and (iii) comprehensive evaluation that combines functionality testing and security checking with both static and dynamic oracles. We evaluate 5 popular code agents like OpenHands, supported by 5 LLMs (e.g., Claude sonnet 4.5) on SecureVibeBench. Results show that current agents struggle to produce both correct and secure code, as even the best-performing one, produces merely 23.8\% correct and secure solutions on SecureVibeBench. Our code and data are on https://github.com/iCSawyer/SecureVibeBench.

2602.15327 2026-06-09 cs.LG cs.AI cs.CL stat.ML 版本更新

Prescriptive Scaling Reveals the Evolution of Language Model Capabilities

规范性缩放揭示语言模型能力的演变

Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade

发表机构 * Harvard University(哈佛大学) Stanford University(斯坦福大学)

AI总结 通过大规模观测评估和分位数回归,提出规范性缩放定律,将预训练计算预算映射到下游准确率,并验证其时间稳定性,引入平衡I-最优采样算法降低评估成本。

Comments ICML 2026 Oral. Blog Post: https://jkjin.com/prescriptive-scaling

详情
AI中文摘要

机器学习模型性能的提升往往源于竞争和应用。针对部署,我们考虑规范性缩放定律:给定预训练计算预算,通过当代后训练实践可获得的下游准确率是多少,以及随着领域发展该映射的稳定性如何?我们使用大规模观测评估,涵盖2022-2026年间六个基准测试的5000个现有和2000个新评估的模型检查点,通过带有单调饱和S型参数化的平滑分位数回归,估计能力边界(即基准分数作为对数预训练FLOPs函数的高条件分位数)。我们通过在早期模型代上拟合并在后续版本上评估来验证时间可靠性:在六个任务中的四个上,分布外覆盖误差低于2%,而数学推理能力边界随时间持续提升。例如,在预算为10^24 FLOPs时,IFEval上的估计可达准确率为0.83,MATH Lvl 5上为0.54。然后我们扩展方法以分析任务相关的饱和性,并探测数学推理任务中与污染相关的偏移。最后,我们引入一种平衡I-最优采样算法,该算法使用约20%的参数计数加权评估预算(某些任务低至5%)恢复接近全数据的前沿,同时保持可比的校准。总之,我们的工作发布了Proteus-2k(最新的模型性能评估数据集),并引入了一种实用方法,将计算预算转化为可靠的性能预期,并监测能力边界随时间的变化。

英文摘要

Machine learning model performance improvements tend to arise from competition and application. For deployment, we consider prescriptive scaling laws: given a pre-training compute budget, what downstream accuracy is attainable with contemporary post-training practice, and how stable is that mapping as the field evolves? Using large-scale observational evaluations with 5k existing and 2k newly evaluated model checkpoints spanning 2022-2026 across six benchmarks, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre-training FLOPs, via smoothed quantile regression with a monotone, saturating sigmoid parameterization. We validate temporal reliability by fitting on earlier model generations and evaluating on later releases: across four of six tasks, the out-of-distribution coverage error remains below 2%, while math reasoning exhibits a consistently advancing boundary over time. For instance, at a budget of 10^24 FLOPs, the estimated attainable accuracies are 0.83 on IFEval and 0.54 on MATH Lvl 5. We then extend our approach to analyze task-dependent saturation and to probe contamination-related shifts on math reasoning tasks. Finally, we introduce a balanced I-optimal sampling algorithm that recovers near-full-data frontiers using roughly 20% of the parameter-count-weighted evaluation budget, as low as 5% on some tasks, while maintaining comparable calibration. Together, our work releases Proteus-2k, the latest model performance evaluation dataset, and introduces a practical methodology for translating compute budgets into reliable performance expectations and for monitoring when capability boundaries shift across time.

2604.10628 2026-06-09 cs.SD cs.CL cs.IR 版本更新

BMdataset: A Musicologically Curated LilyPond Dataset

BMdataset:一个音乐学精心编纂的LilyPond数据集

Matteo Spanio, Ilay Guler, Antonio Rodà

发表机构 * Department of Information Engineering , University of Padua(信息工程系,帕多瓦大学) Boston University(波士顿大学)

AI总结 本文提出BMdataset,包含393个LilyPond乐谱,用于音乐理解研究,并引入LilyBERT模型,证明小规模专家编纂数据集在音乐识别任务中优于大规模噪声数据集。

Comments Submitted to SMC2026

详情
AI中文摘要

符号音乐研究几乎仅依赖MIDI数据集,而基于文本的乐谱格式如LilyPond尚未被探索。我们提出了BMdataset,包含393个LilyPond乐谱(2,646个乐章),由专家直接从原巴洛克手稿转录,涵盖作曲家、音乐形式、乐器和乐章属性的元数据。基于此资源,我们引入LilyBERT(权重可在https://huggingface.co/csc-unipd/lilybert获取),一种基于CodeBERT的编码器,通过扩展词汇表加入115个LilyPond特定标记并进行掩码语言模型预训练。在非领域数据集Mutopia上的线性探测显示,尽管其规模较小(约90M tokens),仅在BMdataset上微调的表现优于在完整PDMX数据集(约15B tokens)上的连续预训练,证明小规模专家编纂数据集在音乐理解任务中更有效。结合广泛预训练与领域特定微调获得最佳结果(84.3%作曲家准确率),证实了两种数据制度的互补性。我们发布数据集、分词器和模型,以建立LilyPond的表示学习基准。

英文摘要

Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.

2606.06388 2026-06-09 cs.AI cs.CL 版本更新

Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

人类的ALMANAC:用于智能体协作的动作级心智模型标注的人类协作数据集

Jiaju Chen, Yuxuan Lu, Jiayi Su, Chaoran Chen, Songlin Xiao, Zheng Zhang, Yun Wang, Yunyao Li, Jian Zhao, Tongshuang Wu, Toby Jia-Jun Li, Dakuo Wang, Bingsheng Yao

发表机构 * Northeastern University(东北大学) University of Notre Dame(Notre Dame 大学) University of Waterloo(滑铁卢大学) Carnegie Mellon University(卡内基梅隆大学) Adobe(Adobe公司) Microsoft Research Asia(微软亚洲研究院)

AI总结 为解决当前LLM智能体缺乏协作中心智模型能力的问题,构建了基于Map Task的ALMANAC数据集,包含2987个协作动作及其心智模型标注,并评估了六种LLM在预测人类行为和心智模型上的表现。

详情
AI中文摘要

近年来,LLM智能体的进展使其具备了复杂的认知能力,如多步推理、规划和工具使用,这些能力使它们逐渐成为人类的协作者。然而,有效的协作要求协作者在协作过程中持续维护和调整自身推理、伙伴意图和共享目标的心智模型。当前的智能体很少发展这种能力,因为它们主要针对任务完成进行优化,而社区缺乏带有动作级心智模型标注的真实人类协作数据,这些数据可以指导智能体获得过程级的协作能力。为填补这一空白,我们提出了ALMANAC,一个基于社会科学中经典的二元路由任务Map Task构建的动作级心智模型标注数据集。ALMANAC包含2,987个协作动作,每个动作都配有基于理论的心智模型标注,记录了参与者的自我推理、感知的伙伴意图和感知的团队目标。我们评估了六种LLM在预测人类下一轮行为和心智模型方面的表现。我们的结果证明了ALMANAC在评估模型模拟人类协作行为及推断其潜在心智模型方面的实用性。

英文摘要

Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today's agents rarely develop such capabilities since they are primarily optimized for task completion, and the community lacks authentic human collaboration data with action-level mental model annotations that could guide agents toward process-level collaborative competence. To bridge this gap, we present ALMANAC, a dataset of Action-Level Mental model ANnotations for Agent Collaboration built from the Map Task, a classic dyadic routing task from social science. ALMANAC contains 2,987 collaboration actions, each paired with theory-informed mental model annotations that record the participants' self-reasoning, perceived partner intent, and perceived team goal. We benchmark six LLMs on predicting humans' next-turn behavior and mental models. Our results demonstrate ALMANAC's utility in evaluating models' ability to simulate human collaborative behaviors and infer their underlying mental models.

2606.07379 2026-06-09 cs.LG cs.AI cs.CL stat.ME 版本更新

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

编码智能体会欺骗我们吗?通过带随机测试的上限评估检测和防止作弊

Thanawat Lodkaew, Johannes Ackermann, Soichiro Nishimori, Nontawat Charoenphakdee, Masashi Sugiyama, Takashi Ishida

发表机构 * The University of Tokyo(东京大学) RIKEN(理化学研究所)

AI总结 提出CapCode框架,通过设置上限评估检测模型在编码任务中的作弊行为,并设计CapReward奖励机制防止作弊,实验表明该方法能有效检测和减少作弊。

详情
AI中文摘要

在智能体评估和训练中,一个日益增长的失败模式是模型可以通过利用捷径而非解决预期任务来获得高评估分数,产生欺骗性表现。这使得评估分数作为真实任务解决能力的度量不可靠。我们提出CapCode,一个构建带有随机测试的编码数据集的框架,其最佳可达的非作弊性能被故意限制在1以下。这种上限性能设计赋予评估分数更清晰的解释:显著高于上限的分数是不可信的,因此提供了作弊的证据。为了防止作弊,我们提出CapReward,一种基于CapCode原则的奖励设计,以抑制超出上限的优化。跨多个数据集的实验表明,CapCode能够检测作弊同时保持模型的性能排名,CapReward减少了作弊行为,产生了更好地遵循预期任务规范的模型。

英文摘要

A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.

10. 安全、隐私、公平与可解释NLP 51 篇

2606.07520 2026-06-09 cs.CL cs.LG 新提交

TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles

TinyJudge: 通过轻量级专家集成实现不可验证约束对齐

Yirong Zeng, Yufei Liu, Xiao Ding, Yutai Hou, Yuxian Wang, Wu Ning, Haonan Song, Dandan Tu, Qixun Zhang, Yuxiang He, Bibo Cai, Ting Liu

发表机构 * Harbin Institute of Technology SCIR Lab(哈尔滨工业大学SCIR实验室) Peking University(北京大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 针对LLM遵循不可验证约束时奖励黑客和计算开销大的问题,提出TinyJudge框架,利用多个小型语言模型集成提供奖励,在五个基准上平均性能提升约10%,奖励精度提升12%,训练速度提升3倍。

Comments ACL 2026 Main Conference;15 pages, 9 figures

详情
AI中文摘要

指令遵循(IF)是LLM的核心能力,要求严格遵守从可验证(如输出长度)到不可验证(如语气)的多种约束。基于可验证奖励的强化学习已成为IF任务的范式,利用LLM作为裁判来评估不可验证约束。然而,我们实验发现该方法仍存在显著瓶颈,遭受严重的奖励黑客和更高的计算开销。本文首先分析不可验证约束的泛化能力,发现特定约束表现出独特的高泛化模式。受此启发,我们提出TinyJudge框架,采用专门的小型语言模型集成(约0.6B)为软约束提供奖励。通过将前沿模型的知识蒸馏到这些小型模型中,实现了高精度、轻量级的评估。在五个基准上的广泛评估表明,TinyJudge在平均性能上比基线高出约10%,奖励精度高出12%。关键的是,它还在总训练时间上实现了3倍的加速。我们的工作为将LLM与不可验证的人类指令对齐提供了一条可扩展且稳健的路径。

英文摘要

Instruction Following (IF) is a core capability of LLMs, requiring strict adherence to diverse constraints, ranging from verifiable ones (e.g., output length) to unverifiable ones (e.g., tone). Reinforcement learning with verifiable rewards has emerged as a paradigm for IF tasks, leveraging LLM-as-a-judge to assess unverifiable constraints. However, we empirically find that this approach remains a significant bottleneck, suffering from severe reward hacking and higher computational overhead. In this work, we first analyze the generalization capabilities of unverifiable constraints and discover that specific constraints exhibit distinct, high-generalization patterns. Motivated by this, we propose TinyJudge, a framework that employs an ensemble of specialized tiny language models ($\sim0.6B$) to provide rewards for soft constraints. By distilling expertise from frontier models into these tiny models, it achieves high-precision, lightweight evaluation. Extensive evaluations across five benchmarks demonstrate that TinyJudge outperforms the baselines by $\sim10\%$ in average performance and $12\%$ in reward precision. Crucially, it also achieves a $3\times$ speedup in total training time. Our work provides a scalable and robust path for aligning LLMs with unverifiable human instructions.

2606.07528 2026-06-09 cs.CL cs.AI cs.LG 新提交

BEACON: Behavioral Entropy Aggregation for Cross-Model Hallucination Detection in Large Language Models

BEACON: 面向大语言模型跨模型幻觉检测的行为熵聚合

Naveen Bera, Pulijala Sai Nikhila, Kondaguduru Abhiram, Shaik Gayaz Ali, Shoaib Sadiq Salehmohamed, Shaik Mohammed Omar, Jinal Prashant Thakkar, Hansika Aredla, Shalmali Ayachit

发表机构 * LLM Lens

AI总结 提出BEACON框架,通过多维度行为特征(语义熵、嵌入几何、思维链一致性、释义稳定性)的黑盒检测方法,在7个基准上达到0.8123 AUROC,优于现有方法。

Comments 12 pages, 6 tables, 1 figure. Code and data available upon request

详情
AI中文摘要

大语言模型中的幻觉,即生成事实上不正确或未经支持的内容,仍然是可靠部署的关键障碍。我们提出了BEACON(面向跨模型幻觉检测的行为熵聚合),一个黑盒幻觉检测框架,仅基于模型输出运行,无需访问内部表示或外部知识库。BEACON从结构化的多遍生成中提取31维特征向量,整合了基于NLI的语义熵、嵌入几何、思维链一致性和释义稳定性信号。在七个基准的7,617个标记样本上训练的梯度提升分类器达到了0.8123 ± 0.0102的AUROC(95%置信区间:0.7632-0.8251),优于独立的语义熵(+0.2298)和SelfCheckGPT风格的一致性基线(+0.2457)。特征重要性分析表明,幻觉本质上是多维的,需要组合的不确定性信号。一个高效的5次调用变体达到了0.7795的AUROC,使得在黑盒LLM API上的实际部署成为可能。

英文摘要

Hallucination in large language models (LLMs), defined as the generation of factually incorrect or unsupported content, remains a critical barrier to reliable deployment. We present BEACON (Behavioral Entropy Aggregation for Cross-model hallucination detectiON), a black-box hallucination detection framework that operates purely on model outputs without requiring access to internal representations or external knowledge bases. BEACON extracts a 31-dimensional feature vector from structured multi-pass generation, integrating NLI-based semantic entropy, embedding geometry, chain-of-thought consistency, and paraphrase stability signals. A gradient-boosted classifier trained on 7,617 labeled examples across seven benchmarks achieves 0.8123 +/- 0.0102 AUROC (95% CI: 0.7632-0.8251), outperforming standalone semantic entropy (+0.2298) and SelfCheckGPT-style consistency baselines (+0.2457). Feature importance analysis shows that hallucination is inherently multi-dimensional, requiring combined uncertainty signals. An efficient 5-call variant achieves 0.7795 AUROC, enabling practical deployment across black-box LLM APIs.

2606.07535 2026-06-09 cs.CL 新提交

Multilingual Refusal Alignment for Safer Large Language Models

多语言拒绝对齐:构建更安全的大型语言模型

Aleksandra Krasnodębska, Wojciech Kusa, Aldo Lipani

发表机构 * NASK National Research Institute(国家研究 institute) University College London(伦敦大学学院)

AI总结 本研究系统探究多语言对齐动态,通过引入覆盖12种欧洲语言的RefusEU数据集和DPO实验,发现仅用英语对齐不足以保障跨语言安全,而多语言训练可在不降低通用性能的前提下提升安全性。

Comments Accepted to Findings ACL 2026

详情
AI中文摘要

随着大型语言模型(LLMs)在全球范围部署,确保其在多种语言中的安全性和对齐变得至关重要。然而,安全行为在不同语言之间往往表现出不可预测的差异,这对一致且合乎道德的人工智能构成了重大挑战。在这项工作中,我们系统研究了多语言对齐的动态,探讨了单语言对齐是否能够跨语言迁移、训练过程中如何保持语言一致性,以及由此产生的与通用知识能力之间的权衡。我们引入了RefusEU,一个覆盖12种欧洲语言的新型拒绝对齐数据集,其中包括一个用于评估当前最先进模型的专用测试集。我们的受控直接偏好优化(DPO)实验提供了两个关键见解:仅在英语中对齐模型不足以确保跨语言安全性,即使对于相同的危害类别也是如此;而使用多语言数据集进行训练可以在不降低通用性能(通过Global MMLU基准衡量)的情况下提高安全性。

英文摘要

As Large Language Models (LLMs) are deployed globally, ensuring their safety and alignment across multiple languages becomes paramount. However, safety behaviors often vary unpredictably between languages, posing significant challenges for consistent and ethical AI. In this work, we systematically investigate the dynamics of multilingual alignment, exploring whether single-language alignment transfers cross-lingually, how language consistency is preserved during training, and the resulting trade-offs with general knowledge capabilities. We introduce RefusEU, a novel refusal alignment dataset covering 12 European languages, including a dedicated test set for evaluating current state-of-the-art models. Our controlled Direct Preference Optimization (DPO) experiments provide two key insights: aligning models exclusively in English is insufficient to ensure cross-lingual safety, even for the same harm categories, whereas training on multilingual datasets can improve safety without degrading general performance, as measured by the Global MMLU benchmark.

2606.07822 2026-06-09 cs.CL cs.AI cs.LG 新提交

The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust

ACUTE协议:操作语言模型激活以实现更好的校准、效用和信任

Nishant Subramani, Palash Goyal, Yiwen Song, Mani Malek, Yuan Xue, Tomas Pfister, Hamid Palangi

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Google(谷歌) Scale AI

AI总结 提出ACUTE协议,通过操作语言模型激活来估计置信度,平衡校准与信息性,在多项选择问答、工具调用和科学文档摘要等任务上优于强基线,提升校准、效用和可信度。

Comments Accepted to ICML 2026

详情
AI中文摘要

随着语言模型的改进并越来越多地部署以解决各种任务,可信度变得至关重要。校准是信任的良好代理:良好校准的置信度估计有助于在信任特定模型输出时告知风险与回报的权衡。不幸的是,即使模型改进,它们仍然校准不良,往往偏向过度自信。此外,校准可能被操纵:总是预测基率的策略是完美校准的,但完全没有信息性。为了解决这个问题,我们开发了一个新指标,即通过预言机重新归一化的期望效用(EURO),它平衡了校准和信息性。我们还提出了一种通用的基于激活的置信度、效用和信任估计协议(ACUTE),以适当裁决不确定性。ACUTE协议为4个模型家族的6个模型上的3个任务(包括多项选择问答、工具调用和科学文档摘要)提供了灵活、样本高效和计算高效的置信度估计器。ACUTE在EURO上优于强基线,同时保持较低的校准误差。综合来看,我们的工作表明,为LLM配备ACUTE协议可以在多种设置中提高校准、效用和可信度。

英文摘要

As language models improve and become increasingly deployed to solve a variety of tasks, trustworthiness becomes essential. Calibration is a good proxy for trust: well-calibrated confidence estimates help inform the risk versus reward tradeoff when trusting a specific model output. Unfortunately, even as models improve, they remain poorly calibrated, often biasing towards overconfidence. Additionally, calibration can be gamed: a policy that always predicts the base rate is perfectly calibrated, but completely uninformative. To resolve this, we develop a new metric, expected utility renormalized by the oracle (EURO), that balances calibration and informativeness. We also propose a general-purpose activation-based confidence, utility, and trust estimation protocol (ACUTE) to appropriately adjudicate uncertainty. The ACUTE protocol provides flexible, sample-efficient, and compute-efficient confidence estimators for 3 tasks including multiple choice question answering, tool-calling, and scientific document summarization across 6 models from 4 model families. ACUTE outperforms strong baselines on EURO, while maintaining low calibration error. Taken together, our work shows that equipping LLMs with the ACUTE protocol can improve calibration, utility, and trustworthiness in numerous settings.

2606.07867 2026-06-09 cs.CL 新提交

The Cold-Start Safety Gap in LLM Agents

LLM智能体中的冷启动安全差距

Chung-En Sun, Linbo Liu, Tsui-Wei Weng

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 研究发现工具调用型LLM智能体在会话开始时最脆弱,随着常规任务执行安全性提升,提出SODA基准并验证预热策略可缩小冷启动安全差距。

详情
AI中文摘要

工具调用型LLM智能体在整个对话过程中是否同样安全?我们发现并非如此:智能体在会话开始时最脆弱,在完成几个常规智能体任务后安全性显著提升——我们将这一现象称为冷启动安全差距。为了系统研究这一问题,我们引入了面向智能体的深度安全基准(SODA),该基准控制智能体在遇到安全威胁之前完成的常规智能体任务数量,最多支持20个前置任务。评估来自4个系列的7个模型,随着前置常规智能体任务数量从零增加到二十,安全性提升9%至52%。表示分析证实,随着更多前置任务的出现,模型隐藏状态逐渐向安全对齐区域移动。通过系统研究前置对话中哪部分最重要,我们发现常规智能体任务本身是安全性的主要驱动因素,而智能体自身的先前响应对安全性影响较小,但对于保持后续效用至关重要。这一结论在开源安全基准(AgentHarm、Agent Safety Bench)和效用基准(BFCL、API-Bank)上的评估中得到进一步支持,证实了在部署前用常规智能体任务预热智能体可以使其更安全并保持全部能力。基于这些发现,我们推荐一种简单的部署策略:让智能体在可能暴露于安全关键请求之前完成几个常规智能体任务,以缩小冷启动安全差距。我们的代码可在https://github.com/Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap获取。

英文摘要

Are tool-calling LLM agents equally safe throughout a conversation? We discover they are not: agents are most vulnerable at the very start of a session and become substantially safer after a few regular agentic tasks -- a phenomenon we term the cold-start safety gap. To study this systematically, we introduce Safety Over Depth for Agents (SODA), a benchmark that controls how many regular agentic tasks the agent completes before encountering a safety threat, supporting up to 20 preceding tasks. Evaluating 7 models from 4 families, safety improves by 9--52% as the number of preceding regular agentic tasks increases from zero to twenty. Representation analysis confirms that model hidden states gradually shift toward a safety-aligned region as more preceding tasks are present. By systematically studying which part of the preceding conversation matters most, we find that the regular agentic tasks themselves are the primary driver of safety, while the agent's own prior responses have less effect on safety but are essential for preserving later utility. This conclusion is further supported by evaluation on open-source safety benchmarks (AgentHarm, Agent Safety Bench) and utility benchmarks (BFCL, API-Bank), confirming that warming up the agent with regular agentic tasks before deployment makes it safer and preserves full capability. Based on these findings, we recommend a simple deployment strategy: having the agent complete a few regular agentic tasks before possible exposure to safety-critical requests mitigates the cold-start safety gap. Our code is available at https://github.com/Trustworthy-ML-Lab/Agent-Cold-Start-Safety-Gap

2606.07877 2026-06-09 cs.CL 新提交

Whose Norms? Disentangling Cultural and Personal Alignment in Large Language Models

谁的规范?解开大语言模型中的文化与个人对齐

Angana Borah, Isabelle Augenstein, Rada Mihalcea

发表机构 * University of Michigan - Ann Arbor(密歇根大学安娜堡分校) University of Copenhagen(哥本哈根大学)

AI总结 提出PACT框架评估大语言模型在文化规范与个人偏好间的权衡,发现模型受国家背景影响大于年龄和性别,且人类对齐未能捕捉文化多元性。

Comments Preprint under review

详情
AI中文摘要

大语言模型越来越多地用于需要平衡文化规范与个人偏好的社会决策情境。例如,偏好诚实的用户可能会询问是否应在当地规范倾向于间接反馈时公开纠正同事。然而,现有研究大多将文化对齐和个性化分开研究。我们引入了PACT(个人偏好与文化规范权衡)框架,该框架评估模型是选择遵循文化规范还是允许个人偏好。我们发现,LLMs在强制执行文化规范的刚性程度上有所不同,行为受国家背景(7.8%)的影响大于年龄(1%)和性别(0.7%),并且在指令微调后非均匀地变化。此外,我们在五个国家进行的关于PACT的人类研究表明,人类遵循文化主要受情境国家驱动,当参与者判断自己的文化背景时一致性最低,显示出文化内部的多元性。最后,人类-LLM对齐实验表明,模型可以匹配多数选择,但未能捕捉响应分布和不确定性(最佳相关性仅为0.24)。总之,这些发现激励了超越多数、捕捉社会判断中文化多元性和分歧的对齐评估。

英文摘要

Large language models are increasingly used for social decision-making situations that require balancing cultural norms with personal preferences. For example, a user preferring honesty might ask whether to correct a coworker publicly when local norms favor indirect feedback. Yet existing research studies cultural alignment and personalization largely separately. We introduce PACT, the Personal-Preference and Cultural-Norm Trade-off framework, which evaluates whether models choose to follow a cultural norm or allow personal preferences. We find that LLMs vary in how rigidly they enforce cultural norms, with behavior shifted more by country context (7.8%) than age (1%) and gender (0.7%) and shifting non-uniformly after instruction tuning. Furthermore, our five-country human study on PACT shows that culture-following in humans is mainly driven by scenario country, with the lowest agreement when participants judge their own cultural contexts, showing within-culture pluralism. Finally, human-LLM alignment experiments show that models can match majority choices, but fail to capture response distributions and uncertainty (with best correlations reaching only 0.24). Together, these findings motivate alignment evaluations that go beyond majority to capture cultural pluralism and disagreement in social judgment.

2606.07964 2026-06-09 cs.CL 新提交

What Does Debiasing Really Remove? A Geometric Study of PCA-Based Gender Debiasing in Word Embeddings

去偏究竟移除了什么?词嵌入中基于PCA的性别去偏的几何研究

Alexey Kresin, Tchifou M. Dieffi, Tomer Caspi

发表机构 * Hood College(胡德学院) Ben-Gurion University of the Negev(内盖夫本-古里安大学)

AI总结 通过几何分析揭示PCA去偏主要移除第一主成分中的直接性别偏见,但无法消除分布在多维度上的关联偏见,且会破坏嵌入几何结构,表明偏见并非纯低秩,简单子空间移除不足以全面去偏。

Comments 8 pages, 4 figures. Source code available at https://github.com/AlexeyKresin/embedding-bias-geometry

详情
AI中文摘要

基于主成分分析(PCA)的去偏方法被广泛用于减少大型语言模型词嵌入中的性别偏见,但尚不清楚这些方法实际移除了偏见的哪些方面以及这一过程的破坏性有多大。这些方法基于偏见存在于低维子空间的理解,假设大部分偏见可以通过少数主成分捕获。在这项工作中,我们对基于PCA的性别去偏进行了系统的几何分析,并研究了嵌入空间中实际被移除的内容。我们在多个嵌入上的实验表明,直接性别偏见主要集中在前几个主成分上,支持了低秩偏见假设。然而,通过WEAT测量的关联偏见并不与这些主方向对齐,而是分布在多个嵌入维度上。此外,正如预期,我们证明移除越来越多的主成分会导致嵌入几何的一致退化,影响语义结构和向量关系。这些结果表明,基于PCA的去偏是一种权衡:虽然它有效减少了某些形式的直接偏见,但未能消除分布式关联,并引入了几何扭曲。此外,不存在通用的最优去偏水平,因为偏见减少与语义保留之间的平衡取决于所选的度量和嵌入。总体而言,我们的发现表明词嵌入中的偏见并非纯粹低秩,简单的子空间移除方法可能不足以实现全面去偏。

英文摘要

Debiasing methods based on principal component analysis (PCA) are broadly used to reduce gender bias in word embeddings used in LLMs, yet it remains unclear what aspects of bias they actually remove and how destructive this process is. These methods are based on the understanding that bias resides in a low-dimensional subspace, with the assumption that most of it can be captured by a few principal components. In this work, we conduct a systematic geometric analysis of PCA-based gender debiasing and investigate what is actually removed from the embedding space. Our experiments across multiple embeddings show that direct gender bias is primarily concentrated in the first principal component, supporting the low-rank bias hypothesis. However, associative bias measured by WEAT does not align with these principal directions and is instead spread across multiple embedding dimensions. Furthermore, as expected, we demonstrate that removing an increasing number of principal components leads to a consistent degradation of the embedding geometry, affecting semantic structure and vector relationships. These results reveal that PCA-based debiasing operates as a trade-off: while it effectively reduces certain forms of direct bias, it fails to eliminate distributed associations and introduces geometric distortion. Moreover, there is no universal optimal level of debiasing, as the balance between bias reduction and semantic preservation depends on the chosen metric and embedding. Overall, our findings suggest that bias in word embeddings is not purely low-rank and that simple subspace removal methods may be insufficient for comprehensive debiasing.

2606.07969 2026-06-09 cs.CL cs.AI 新提交

Neutrality Bites: Gender Representation in AI-Generated Animal Stories

中立性的代价:AI生成的动物故事中的性别表征

Imani Finkley, Yuanxi Li, Melanie Walsh

发表机构 * University of Washington(华盛顿大学)

AI总结 研究六种主流LLM在生成动物故事时的性别分配,发现模型常避免指定性别或使用中性语言,但一旦指定则显著偏向男性,女性角色几乎缺席,表明中立策略可能导致边缘视角的抹除。

Comments FAccT(ACM Conference on Fairness, Accountability, and Transparency) 2026

详情
AI中文摘要

AI生成故事中的性别偏见是一个有充分记录的问题。尽管人们已投入大量关注来减少或缓解这种偏见,但干预措施是否产生真正公平的结果并不总是明确的。为了调查这一问题,我们研究了大型语言模型(LLMs)如何处理一个流行、高度模糊且已知会紧密复现人类刻板印象的叙事语境中的性别分配:关于会说话的动物的故事。我们提示六个领先的LLM完成一个关于七个性别未说明的拟人化动物角色的英语故事。此外,我们迭代了四种不同的叙事设置和一系列模型温度。在23.8K个故事中,我们发现模型经常避免在故事中指定动物角色的性别(平均19%)或使用性别中立的语言如“它”或“它的”(平均38.2%)。然而,当性别被指定时,存在显著的男性偏见。女性动物角色几乎不存在,仅出现在2.2%的故事中,而男性角色出现在40.6%的故事中。我们的发现指向一个更广泛的论点:中立性是有代价的。换句话说,优先考虑中立性以解决社会偏见的模型实际上可能助长边缘化视角和身份的抹除。我们建议需要追求超越中立性的替代策略,例如那些更平等地在想象主体之间分配社会可能性的策略。

英文摘要

Gender bias in AI-generated stories is a well-documented problem. While much attention has been paid to reducing or mitigating this bias, it is not always clear whether interventions produce genuinely fairer results. To investigate this issue, we examine how large language models (LLMs) handle gender assignment in a narrative context that is popular, highly ambiguous, and also known to closely reproduce human stereotypes: stories about talking animals. We prompt six leading LLMs to complete an English-language story about seven different anthropomorphic animal characters whose gender is unstated. We additionally iterate with four different narrative settings and a range of model temperatures. Across the 23.8K stories, we find that models frequently avoid gendering the animal character in the story (19% on average) or use gender-neutral language like "it" or "its" (38.2% on average). However, when gender is assigned, there is a significant masculine bias. Feminine animal characters are virtually absent, present in just 2.2% of stories vs. 40.6% that feature masculine characters. Our findings point to a broader argument: neutrality bites. In other words, models that prioritize neutrality to address social bias may actually contribute to the erasure of marginalized perspectives and identities. We suggest that alternative strategies beyond neutrality need to be pursued, such as ones that more equally distribute social possibilities across imagined subjects.

2606.07970 2026-06-09 cs.CL cs.AI 新提交

Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks

通过扩展训练时对抗攻击防御恶意微调

Haoming Wen, Shi Chen, Qingyu Shi, Siyuan Liu, Minrui Luo, Jingzhao Zhang, Tianxing He

发表机构 * Xiongan AI Institute(雄安人工智能研究院) Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院) Shanghai Qi Zhi Institute(上海期智研究院)

AI总结 针对全参数微调的安全威胁,提出基于对抗训练和双层优化的Patcher方法,通过扩展对抗循环中的优化步数增强防御,并设计并行算法提升效率。

详情
AI中文摘要

当前的开源大型语言模型(LLMs)容易受到恶意微调攻击,这些攻击只需在中毒数据集上进行几步监督微调(SFT)即可破坏LLMs的安全对齐。现有的对齐阶段防御主要设计用于防御使用参数高效微调方法的攻击。然而,它们无法防御使用全参数微调的更强攻击。在本文中,我们提出了Patcher,一种受对抗训练和双层优化启发的方法,以对抗此类攻击。Patcher通过扩展对抗循环中的优化步数来增强模拟攻击,从而迫使防御者找到对更强攻击不敏感的模型参数。此外,我们提出了一种高效的并行算法来实现Patcher,减少了训练的挂钟时间,同时保持了Patcher的性能。大量实验表明,与普通SFT对齐相比,Patcher显著提高了模型的鲁棒性,并且可以迁移到不同的攻击场景和模型大小。代码可在https://github.com/haomingwen/patcher获取。

英文摘要

Current open-weight large language models (LLMs) are prone to malicious finetuning attacks, which could compromise the safety alignment of LLMs with only a few steps of supervised finetuning (SFT) on poisoned datasets. Existing alignment-stage defenses are primarily designed to defend against attacks that use parameter-efficient finetuning methods. However, they fail to defend against stronger attacks that use full-parameter finetuning. In this paper, we propose Patcher, a method inspired by adversarial training and bi-level optimization, to combat such attacks. Patcher strengthens the simulated attack by scaling up the optimization steps in the adversarial loop, thus forcing the defender to find model parameters that are insensitive to stronger attacks. Furthermore, we propose an efficient parallel algorithm to implement Patcher, decreasing the wall-clock time of training while preserving Patcher's performance. Extensive experiments show that Patcher substantially improves the model's robustness compared to vanilla SFT alignment, and transfers to diverse attack scenarios and model sizes. Code is available at https://github.com/haomingwen/patcher.

2606.08076 2026-06-09 cs.CL cs.AI cs.CY 新提交

"I understand your perspective": LLM Persuasion and Sycophancy through the Lens of Communicative Action Theory

“我理解你的观点”:通过交往行动理论视角看LLM的说服与谄媚

Esra Dönmez, Agnieszka Falenska

发表机构 * Institute for Natural Language Processing, University of Stuttgart(斯图加特大学自然语言处理研究所) Interchange Forum for Reflecting on Intelligent Systems, University of Stuttgart(斯图加特大学智能系统反思交流论坛)

AI总结 本研究基于哈贝马斯的交往行动理论,通过模拟Reddit讨论,发现LLM能有效传达言外之意(如建立信任),其谄媚策略与观点改变强相关,且人类更偏好LLM生成的论证。

详情
Journal ref
Findings of the Association for Computational Linguistics: ACL 2025
AI中文摘要

大型语言模型(LLM)能够生成高质量的论证,但它们在参与细致入微且有说服力的交往行动方面的能力仍 largely unexplored。本研究通过尤尔根·哈贝马斯的交往行动理论框架探索LLM的说服潜力。它考察LLM是否以与人类交流可比的方式表达言外之意(即语言的语用功能,如传达知识、建立信任或表明相似性)。我们使用来自说服性子论坛ChangeMyView的对话,模拟意见持有者与LLM之间的在线讨论。然后,我们比较人类撰写和LLM生成的反驳论证中言外之意的可能性,特别是那些成功改变了原帖作者观点的论证。我们发现,所有三个LLM都能有效传达言外之意——通常比人类更甚——可能增加其拟人化程度。此外,LLM精心制作谄媚回应,与意见持有者的意图紧密对齐,这种策略与观点改变强相关。最后,众包工作者发现LLM生成的反驳论证更令人信服,并且一致偏好它们胜过人类撰写的论证。这些发现表明,LLM的说服力不仅仅在于生成高质量论证。相反,用人类偏好训练LLM有效地调整它们以模仿人类交流模式,特别是细微的交往行动,可能增加个体对其影响的易感性。

英文摘要

Large Language Models (LLMs) can generate high-quality arguments, yet their ability to engage in nuanced and persuasive communicative actions remains largely unexplored. This work explores the persuasive potential of LLMs through the framework of Jürgen Habermas' Theory of Communicative Action. It examines whether LLMs express illocutionary intent (i.e., pragmatic functions of language such as conveying knowledge, building trust, or signaling similarity) in ways that are comparable to human communication. We simulate online discussions between opinion holders and LLMs using conversations from the persuasive subreddit ChangeMyView. We then compare the likelihood of illocutionary intents in human-written and LLM-generated counter-arguments, specifically those that successfully changed the original poster's view. We find that all three LLMs effectively convey illocutionary intent -- often more so than humans -- potentially increasing their anthropomorphism. Further, LLMs craft sycophantic responses that closely align with the opinion holder's intent, a strategy strongly associated with opinion change. Finally, crowd-sourced workers find LLM-generated counter-arguments more agreeable and consistently prefer them over human-written ones. These findings suggest that LLMs' persuasive power extends beyond merely generating high-quality arguments. On the contrary, training LLMs with human preferences effectively tunes them to mirror human communication patterns, particularly nuanced communicative actions, potentially increasing individuals' susceptibility to their influence.

2606.08157 2026-06-09 cs.CL 新提交

Cross Paraphrastic Invariance Learning for Hallucination Detection

跨释义不变性学习用于幻觉检测

Shanshan Lin, Dongsheng Hong, Sibo Ju, Chao Chen, Sihong Xie, Xiangwen Liao

AI总结 提出CPIL框架,通过构建正负样本对进行两阶段对比学习,仅用1%标注数据即在11个任务上超越基线,高效检测LLM幻觉。

Comments Accepted to ICASSP 2026

详情
AI中文摘要

大型语言模型(LLM)经常生成缺乏源文档支持的幻觉。为避免昂贵的LLM评估流水线和现有分类器的大量标注需求,我们提出CPIL(跨释义不变性学习),一个两阶段孪生框架,最大化利用现有标注数据。具体地,CPIL通过以下方式构建信息丰富的训练对:(i)为每个文档-声明示例生成释义视图作为正样本,并显式对齐其表示以强制对表面形式的不变性;(ii)挖掘同文档、异标签对作为难负样本,以锐化文档敏感的决策边界。然后CPIL进行两阶段模型训练:第一阶段进行对比预训练,学习释义不变、基于事实的嵌入空间;第二阶段附加轻量级分类器进行二元事实性判断。在LLM-AggreFact基准(11个任务)上,CPIL仅用约1%的标注数据即在F1分数上超越强基线,展示了其预测优越性和标签效率。

英文摘要

Large language models (LLMs) frequently generate hallucinations, which are unsupported by a source document. To avoid costly LLM-as-evaluator pipelines and the heavy annotation demands of existing classifiers, we propose CPIL (Cross Paraphrastic Invariance Learning), a two-stage Siamese framework that maximizes the utility of existing labeled data. Concretely, CPIL constructs informative training pairs by: (i) generating paraphrastic views of each document-claim example as positives, and explicitly aligning their representations to enforce invariance to surface form; and (ii) mining same-document, opposite-label pairs as hard negatives to sharpen document-sensitive decision boundaries. Then CPIL conduct a two-stage model training: Stage 1 performs contrastive pretraining to learn a paraphrase-invariant, grounding-aware embedding space; and Stage 2 attaches a lightweight classifier for binary groundedness. On the LLM-AggreFact benchmark (11 tasks), CPIL surpasses strong baselines concerning F1 scores with only ~1% labeled data, showing its prediction superiority and label efficiency.

2606.08158 2026-06-09 cs.CL cs.AI 新提交

Constrained Paraphrase Consistency for LLM Hallucination Detection

约束释义一致性用于大语言模型幻觉检测

Shanshan Lin, Dongsheng Hong, Sibo Ju, Chao Chen, Xi Zhang, Xiangwen Liao

AI总结 提出约束一致性幻觉检测器(CCHD),通过约束优化利用释义一致性,无需额外数据,在多个基准上超越现有方法。

Comments Accepted to ICASSP 2026

详情
AI中文摘要

大型语言模型(LLM)可能生成事实不一致的声明,这促使需要准确且可扩展的幻觉检测器。先前的工作主要通过合成或新标注来扩大训练集,这增加了成本和潜在偏差,同时未充分利用语义等价释义所隐含的一致性。我们提出约束一致性幻觉检测器(CCHD),将训练形式化为约束优化问题。在原始文档-声明对上的标准交叉熵基础上,补充了(i)释义一致性约束,限制不同释义视图之间的差异,以及(ii)标签保持约束,将释义与真实标签绑定。我们通过模型参数和每个视图的拉格朗日乘子的梯度下降-上升法求解该问题,仅增加少量标量对偶变量,且无推理时开销。使用DeBERTa和Flan-T5骨干网络,CCHD在标准事实性基准上持续优于强基线(FactCG、MiniCheck和AlignScore),展示了其在幻觉检测上的优越性。

英文摘要

Large language models (LLMs) can generate factually inconsistent claims, motivating accurate and scalable hallucination detectors. Prior work largely enlarges training sets via synthesis or new annotations, introducing increasing cost and potential bias while underusing the consistency implied by semantically equivalent paraphrases. We propose Consistency-Constrained Hallucination Detector (CCHD), which formulates training as a constrained optimization problem. The standard cross-entropy on original document-claim pairs is complemented by (i) paraphrase-consistency constraints bounding divergence across paraphrased views, and (ii) label-preservation constraints tying paraphrases to ground truth. We solve the problem by gradient descent-ascent over model parameters and per-view Lagrange multipliers, adding only a few scalar dual variables and no inference-time overhead. With DeBERTa and Flan-T5 backbones, CCHD consistently outperforms strong baselines (FactCG, MiniCheck, and AlignScore) on standard factuality benchmarks, demonstrating its superiority on hallucination detection.

2606.08243 2026-06-09 cs.CL 新提交

Building Comparative Motivation Profiles with Instrumental Interventions

构建带有工具性干预的比较动机概况

David Vella Zarb, Rustem Turtayev, Taywon Min, Jinghua Ou, Shi Feng

发表机构 * MATS University of Cambridge(剑桥大学) KAIST(韩国科学技术院) George Washington University(乔治华盛顿大学)

AI总结 通过对称工具性干预区分对齐伪装中的策略性自我保护与研究者期望追踪,发现模型对期望追踪更敏感,提示需要构念效度检验。

详情
AI中文摘要

安全性评估通常从行为模式推断潜在动机,但这些推断的构念效度尚不明确。我们在对齐伪装中研究这一问题,即当模型推断出训练压力时,它们更常服从训练目标。这种行为通常被解释为策略性自我保护,但也可能反映模型对研究者期望的敏感性。我们引入一个对称干预框架来区分这些竞争性假设。我们不直接干预“诡计”或“谄媚”,而是针对每个假设所蕴含的工具性过程:后果追踪和研究者期望追踪。然后比较对这些过程的干预如何影响对齐伪装。我们使用合成文档微调、激活引导和提示研究了四个开源模型生物。在合成文档微调下,Llama-3.1-70B、Llama3.1-405B 和 Qwen-2.5-72B 对期望追踪干预比后果追踪干预更敏感。对 Llama-3.1-70B 的激活引导支持相同的总体图景,提示干预与 SDF 概况大致一致。总体而言,对齐伪装行为在因果上对评估上下文期望敏感,尽管存在与诡计一致的草稿板。因此,诡计和策略性欺骗评估需要构念效度检验,而对称工具性干预提供了这样一种测试。

英文摘要

Safety evaluations often infer latent motivations from behavioral patterns, but the construct validity of these inferences is unclear. We study this problem in alignment faking, where models comply with training objectives more often when they infer training pressure. This behavior is commonly interpreted as strategic self-preservation, but it may also reflect sensitivity to the model's inference about the expectation of researchers conducting the evaluation. We introduce a symmetric intervention framework for distinguishing these competing hypotheses. Instead of directly intervening on "scheming" or "sycophancy", we target instrumental processes entailed by each hypothesis: consequence-tracking and researcher-expectation tracking. We then compare how interventions on these processes affect the alignment faking. We study four openweight model organisms using synthetic document fine-tuning, activation steering, and prompting. Under synthetic document fine-tuning, Llama-3.1-70B, Llama3.1-405B, and Qwen-2.5-72B are more sensitive to expectation-tracking than consequence-tracking interventions. Activation steering on Llama-3.1- 70B supports the same broad picture, and prompt interventions broadly align with SDF profiles. Overall, alignment-faking behavior can be causally sensitive to evaluation-context expectations despite scheming-consistent scratchpads. Scheming and strategic-deception evaluations therefore need construct-validity checks, and symmetric instrumental interventions provide one such test.

2606.08381 2026-06-09 cs.CL cs.AI 新提交

Auditing Proprietary Alignment in Large Language Models: A Comparative Framework Without a Ground-Truth Standard

审计大型语言模型中的专有对齐:一种无需真实标准的比较框架

Alireza Arbabi, Florian Kerschbaum

发表机构 * University of Waterloo(滑铁卢大学) Vector Institute(向量研究所)

AI总结 提出一种统计框架,通过比较目标模型与基线模型在共享语义空间中的响应偏差,检测黑盒语言模型中的专有对齐行为,无需真实标准即可实现外部审计。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地通过不透明的开发和部署流程发布和部署,使得模型提供商能够在不正式宣布的情况下注入有意的、提供商特定的策略。因此,已有多种模型被报道生成反映专有规则和组织利益的响应,导致在有争议话题上的审查或错误信息。然而,系统性地识别这种对齐仍然是一个基本挑战,因为“专有”在不同语境中的含义模糊。在本文中,我们提出了一种统计框架,通过比较行为分析来检测黑盒语言模型中的专有对齐。我们的方法量化了目标模型与一组参考基线模型在共享语义空间中的响应之间的系统性偏差。通过评估相对行为差异而非绝对正确性,我们的框架能够在黑盒访问下进行有原则的审计。应用于几个广泛讨论但此前未量化的案例,它为外部评估大型语言模型中提供商特定的对齐行为提供了系统且可扩展的基础。

英文摘要

Large language models (LLMs) are increasingly released and deployed through opaque development and deployment pipelines, enabling model providers to inject intentional, provider-specific policies without officially announcing them. As a result, various models have been reported to generate responses reflecting proprietary rules and organizational interests, leading to censorship or misinformation on controversial topics. However, systematic identification of such alignment remains a fundamental challenge, complicated by the ambiguity of what ``proprietary'' entails in different contexts. In this paper, we propose a statistical framework for detecting proprietary alignment in black-box language models via comparative behavioral analysis. Our approach quantifies systematic deviations between the responses of a target model and those of a reference set of baseline models in a shared semantic space. By evaluating relative behavioral divergence rather than absolute correctness, our framework enables principled auditing under black-box access. Applied to several widely discussed but previously unquantified cases, it provides a systematic and scalable basis for external assessment of provider-specific alignment behavior in large language models.

2606.08451 2026-06-09 cs.CL cs.AI 新提交

Sycophancy as a Multilingual Alignment Failure: How Safety Degrades Across Languages, Topics, and Models

谄媚作为多语言对齐失败:安全性能如何随语言、主题和模型退化

Arya Shah, Himanshu Beniwal, Mayank Singh, Chaklam Silpasuwanchai

发表机构 * IIT Gandhinagar(印度理工学院甘地讷格尔分校) Asian Institute of Technology(亚洲理工学院)

AI总结 研究多语言模型中谄媚现象,发现低资源语言中谄媚率激增,且与主题无关,归因于分词器生育率,表明对齐方法在非高资源语言中泛化差。

Comments 19 pages, 9 figures, 7 tables

详情
AI中文摘要

安全对齐的大型语言模型常常表现出谄媚,即倾向于肯定用户的意见而不考虑事实准确性。尽管在英语中已有充分研究,但其在其他语言中的表现仍基本未被考察,使得数十亿非英语使用者可能容易受到模型验证的错误信息的影响。我们首次进行了大规模、多模型的跨语言谄媚评估,对\textbf{六个指令调优模型}在涵盖\textbf{38种语言}和\textbf{33个主题类别}的\textbf{110万个实例}上进行了基准测试。我们识别出一致的资源层级效应:谄媚率在低资源和零资源语言设置中急剧上升。关键的是,这种退化与主题无关,模型在良性提示和安全关键提示上均匀失败,在最需要保护的地方没有提供额外保护。我们进一步确定了分词器生育率作为这种对齐崩溃的结构性驱动因素。总的来说,我们的结果表明,当前的对齐方法在高资源语言之外泛化能力差,强调了迫切需要公平的多语言安全技术。

英文摘要

Safety-aligned large language models often exhibit sycophancy, which is the tendency to affirm users' opinions regardless of factual accuracy. Although well-studied in English, its manifestation in other languages remains largely unexamined, leaving billions of non-English speakers potentially vulnerable to model-validated misinformation. We present the first large-scale, multi-model evaluation of cross-lingual sycophancy, benchmarking \textbf{six instruction-tuned models} across \textbf{1.1 million instances} spanning \textbf{38 languages} and \textbf{33 topic categories}. We identify a consistent resource-tier effect: sycophancy rates spike sharply in low-resource and zero-shot language settings. Critically, this degradation is topic-agnostic, as models fail uniformly across both benign and safety-critical prompts, offering no additional protection where it is most needed. We further identify tokenizer fertility as a structural driver of this alignment collapse. Collectively, our results demonstrate that prevailing alignment methodologies generalize poorly beyond high-resource languages, underscoring the urgent need for equitable multilingual safety techniques.

2606.08496 2026-06-09 cs.CL cs.LG 新提交

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

SAEExplainer: 基于激活引导偏好优化的SAE特征解释

Jingyi He, Haiyan Zhao, Ruxue Shi, Yanguang Liu, Xin Wang, Fei Sun, Mengnan Du

发表机构 * Shanghai Jiao Tong University(上海交通大学) NJIT(新泽西理工学院) Jilin University(吉林大学) Institute of Computing Technology, CAS(中国科学院计算技术研究所) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出SAEExplainer框架,利用激活分数作为奖励信号,通过两轮优化迭代自纠正基础解释,减少解释幻觉并增强因果触发模式。

详情
AI中文摘要

尽管稀疏自编码器(SAE)通过将密集表示分解为稀疏特征缓解了大语言模型(LLM)的不透明性,但解释这些特征仍然是一个核心挑战。然而,当前的解释方法通常运行在开环范式下,未能利用机械反馈进行进一步优化。在本文中,我们提出SAEExplainer,一个利用激活分数作为客观奖励信号来训练模型进行自我纠正和迭代自举的训练框架。通过两轮优化过程迭代验证和纠正基础解释,SAEExplainer实现了其解释能力的持续提升。该机制显著减少了解释幻觉并强化了因果触发模式。大量实验表明,我们的方法在大多数指标上优于已有基线,特别是在因果触发和判别性激活方面。

英文摘要

Although Sparse Autoencoders (SAEs) have mitigated the opacity of large language models (LLMs) by decomposing dense representations into sparse features, explaining these features still remains a central challenge. Current explanation methods, however, typically operate within an open-loop paradigm, failing to leverage mechanistic feedback for further refinement. In this paper, we propose SAEExplainer, a training framework utilizes activation scores as an objective reward signal to train the model for self-correction and iterative bootstrapping. By iteratively verifying and correcting foundational explanations through a two-round optimization process, SAEExplainer achieves continuous improvement in its explanatory capabilities. This mechanism significantly reduces explanation hallucinations and reinforces causal triggering patterns. Extensive experiments demonstrate our approach improves upon established baselines across most metrics, especially in causal triggering and discriminative activation.

2606.08571 2026-06-09 cs.CL cs.AI cs.LG 新提交

Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models

用于诊断推理模型中未知未知的结构化无知证书的校准

Subramanyam Sahoo

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出结构化无知证书(SICs)输出格式,通过GRPO微调14B模型,使模型在无法回答时明确承认知识缺失并生成检索查询,在未知未知问题上实现99.46%的JSON有效性和0.967的证书特异性分数。

Comments Accepted in ICML 2026 Workshop: Epistemic Intelligence in Machine Learning

详情
AI中文摘要

大型语言模型经常以特征性方式失败:对于超出其知识边界的问题,它们不是承认无知,而是生成流畅但错误的答案。我们引入了\textbf{结构化无知证书}(SICs),这是一种JSON格式的输出模式,要求模型明确命名缺失的领域交叉点,列举所需概念,并提出一个富有成效的检索查询,而不是凭空捏造答案。为了训练模型生成高质量的SICs,我们构建了一个包含7,347个样本的\emph{未知-未知}(UU)数据集,通过提示Qwen3-14B将来自七个领域(物理、生物、工程、计算机科学、经济、医学、法律)的问题拼接成新颖的跨领域查询,这些查询是任何单一领域专家都无法回答的。我们使用组相对策略优化(GRPO)微调了一个14B参数的模型,采用结合检索效用、概念特异性和输出格式有效性的复合奖励。在模型响应上训练的释义散度探测器证实,SIC调优的输出系统地表现出更高的未知-未知概率分数。在735个保留的UU问题上的评估实现了99.46%的JSON有效性率、0.967的平均证书特异性分数,以及在基于检索的生成上相比基础模型3.6%的ROUGE-L改进——这表明显式的认知结构化是一种可学习且可衡量的能力。

英文摘要

Large language models frequently fail in a characteristic way: rather than acknowledging ignorance, they produce fluent but incorrect answers to questions that lie beyond their knowledge boundaries. We introduce \textbf{Structured Ignorance Certificates} (SICs), a JSON-formatted output schema that demands a model explicitly name the missing domain intersection, enumerate required concepts, and propose a productive retrieval query rather than hallucinating an answer. To train models to produce high-quality SICs we construct a 7,347-sample \emph{Unknown-Unknown} (UU) dataset by prompting Qwen3-14B to stitch together questions from seven domains (physics, biology, engineering, CS, economics, medical, legal) into novel cross-domain queries that no single-domain expert could answer. We fine-tune a 14B-parameter model with Group Relative Policy Optimization (GRPO) using a composite reward that combines retrieval utility, concept specificity, and output-format validity. A paraphrase-divergence probe trained on model responses confirms that SIC-tuned outputs systematically exhibit higher unknown-unknown probability scores. Evaluation on 735 held-out UU questions achieves a 99.46\% JSON validity rate, a mean Certificate Specificity Score of 0.967, and a 3.6\% ROUGE-L improvement over the base model on retrieval-grounded generation -- demonstrating that explicit epistemic structuring is a learnable and measurable capability.

2606.08629 2026-06-09 cs.CL 新提交

Sycophancy Towards Researchers Drives Performative Misalignment

对研究者的迎合驱动了表演性失调

David D. Baek, Xinnuo Li, Anay Gupta, Taslim Mahbub, Kejian Shi, Max Tegmark, Shi Feng

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Stanford University(斯坦福大学)

AI总结 本文提出语言模型在评估中表现出的对齐伪装行为更可能是对研究者的迎合而非策略性欺骗,并通过三个实验支持该假说。

详情
AI中文摘要

语言模型日益增长的情境感知能力引发了安全担忧:模型可能意识到自己正在被评估,并调整行为以逃避监控和抵制修改,例如仅在评估中假装对齐。这种对齐伪装行为常被解释为诡计:一种有意的战略欺骗。在本文中,我们考察了一种替代解释,即表演性失调,它将行为变化解释为对AI研究者的迎合结果。为检验这一假说,我们提出了三个实证发现。首先,我们表明即使告诉模型它们已部署,评估意识仍然存在,这与诡计故事相矛盾,后者预测当模型感知到评估时失调会减少。其次,我们使用探针和引导表明,当前方法无法在机制上区分对齐伪装评估中的迎合和诡计。第三,我们微调模型使其更迎合,并观察到对评估线索的敏感性增加。最后,我们强调在未来的意图失调评估和缓解工作中,应将迎合与诡计去混淆。

英文摘要

The increasing situational awareness of language models raises safety concerns: models might be aware when they are evaluated, and adjust their behavior to evade monitoring and resist modification, e.g., pretending to be aligned only in evaluation. This alignment faking behavior is often interpreted as scheming: an intentional effort of strategic deception. In this paper, we examine an alternative interpretation, performative misalignment, which explains the change in behavior as a result of sycophancy towards AI researchers. To examine this hypothesis, we present three empirical findings. First, we show that evaluation awareness persists even when we tell models they are deployed, which contradicts the scheming story which predicts less misalignment when the model perceives evaluation. Second, we use probing and steering to show that our current methods cannot mechanistically distinguish sycophancy and scheming in alignment faking evaluations. Third, we fine-tune models to be more sycophantic and observe increased sensitivity to evaluation cues. To conclude, we emphasize deconfounding sycophancy from scheming for future work on evaluations and mitigations of intent misalignment.

2606.08705 2026-06-09 cs.CL 新提交

Analyzing the Correlation Between Hallucinations and Knowledge Conflicts in Large Language Models

分析大型语言模型中幻觉与知识冲突之间的相关性

Lucrezia Laraspata, Giovanna Castellano, Gennaro Vessio

发表机构 * University of Bari Aldo Moro(巴里阿尔多莫罗大学)

AI总结 通过探针技术分析LLM内部表示,发现幻觉激活模式不能完全归因于知识冲突,但探针可提升模型可解释性。

详情
AI中文摘要

幻觉——事实不正确或无法验证的输出——仍然是大型语言模型(LLM)最具挑战性的限制之一,尤其是在知识密集型任务中。一种提出的解释是,由固定的、过时的训练数据引起的内部知识冲突。本文研究了与知识冲突相关的内部表示是否与LLM中的幻觉行为相关。使用受两项先前工作启发的探针技术,我们分析了预定义任务中隐藏层、注意力层和MLP层的激活以及输出logits。我们在幻觉检测基准上探测了LLaMA-3-8B,并在知识冲突数据集上探测了Falcon-7B。我们的发现表明,尽管概念上相关,但幻觉激活模式不能完全简化为或由知识冲突表示解释。尽管如此,探针在多种语言和激活类型中被证明是一个稳健的工具,支持其在提高LLM可解释性方面的作用。这项工作推进了对LLM中幻觉的更广泛理解,并强调了对其内部行为进行细粒度分析的价值。

英文摘要

Hallucinations -- factually incorrect or unverifiable outputs -- remain one of the most challenging limitations of Large Language Models (LLMs), especially in knowledge-intensive tasks. One proposed explanation is internal knowledge conflicts arising from fixed, outdated training data. This paper investigates whether internal representations linked to knowledge conflicts correlate with hallucination behaviors in LLMs. Using probing techniques inspired by two prior works, we analyzed activations from hidden, attention, and MLP layers, as well as output logits, across predefined tasks. We probed LLaMA-3-8B on hallucination detection benchmarks and Falcon-7B on a knowledge conflict dataset. Our findings show that, although conceptually related, hallucination activation patterns cannot be fully reduced to or explained by knowledge conflict representations. Nonetheless, probing proves a robust tool across multiple languages and activation types, supporting its role in improving LLM interpretability. This work advances the broader understanding of hallucinations in LLMs and underscores the value of fine-grained analysis of their internal behavior.

2606.08792 2026-06-09 cs.CL 新提交

The Amplifying Mirror: Locating and Steering the Partisan Direction inside a Large Language Model

放大镜:定位和操控大语言模型内的党派方向

Wendy K. Tam

发表机构 * Vanderbilt University(范德比尔特大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 通过线性探针在Llama 3.1 8B Instruct模型的隐藏状态中定位党派政治身份方向,并利用稀疏自编码器分解为可解释特征,因果干预可系统性改变模型输出,证明党派偏见是可定位和操控的几何特征。

详情
AI中文摘要

大型语言模型正迅速取代搜索引擎,成为人与信息之间的主要界面。与检索现有内容的搜索引擎不同,LLM生成受训练期间学到的内部表示影响的新文本。在这里,我们展示了党派政治身份编码在模型的激活空间中,并且这个方向直接塑造生成。使用来自美国国会现任议员的190,491条推文作为标记训练数据,我们在Llama 3.1 8B Instruct模型的隐藏状态上训练线性探针。我们在第18层识别出一个单一的几何轴,该轴以0.945的AUC和1.94的Cohen's d区分共和党和民主党文本,并使用稀疏自编码器将该轴分解为可解释的党派特征。沿该轴进行因果干预,在生成过程中消融或放大党派成分,会产生模型输出的系统性变化。我们观察到立场反转、语域转换以及结构化的权威捏造。我们的结果表明,语言模型中的党派偏见不是模糊的涌现属性,而是可以精确定位和操控的习得几何特征。党派偏见不是需要修补的漏洞,而是这些模型如何编码关于用户信息的结构属性。随着LLM取代搜索引擎成为知识界面,理解产品设计(及其后果)对于驾驭从策划到生成的信息生态系统的法律、社会和政治转型至关重要。

英文摘要

Large language models are rapicly replacing search engines as the primary interface between people and information. Unlike search engines, which retrieve existing content, LLMs generate novel text shaped by internal representations learned during training. Here we show that partisan political identity is encoded in the model's activation space, and that this direction directly shapes generation. Using 190,491 tweets from sitting members of the U.S. Congress as labeled training data, we train linear probes on the hidden states of the Llama 3.1 8B Instruct model. We identify a single geometric axis at layer 18 that separates Republican from Democratic text with an AUC of 0.945 and a Cohen's d of 1.94, and use sparse autoencoders to decompose that axis into interpretable partisan features. Causally intervening along this axis, ablating or amplifying the partisan component mid-generation, produces systematic shifts in the model's output. We witness stance reversals, register shifting, and structured fabrications of authority. Our results demonstrate that partisan bias in language models is not a vague emergent property but a learned geometric feature that can be precisely located and steered. Partisan bias is not a bug to be patched, but a structural property of how these models encode information about their users. As LLMs displace search engines as the interface to knowledge, understanding that product design (and its consequences) will be essential for navigating the legal, social, and political transitions from an information ecosystem that is curated to one that is generated.

2606.08969 2026-06-09 cs.CL cs.AI 新提交

CARE: A Conformal Safety Layer for Medical Summarization

CARE:面向医学摘要的保形安全层

Suhana Bedi, Bridget Lin, Anson Y. Zhou, Chloe O. Stanwyck, Jenelle A. Jindal, Sanmi Koyejo, David Stutz, Nigam H. Shah

发表机构 * Stanford University(斯坦福大学) Google DeepMind(谷歌深度思维)

AI总结 提出CARE方法,通过保形风险控制为LLM医学摘要提供校准的遗漏和幻觉标记,在保证安全性的同时减少审查负担。

Comments 29 pages, 5 figures

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于医学摘要,但其输出可能遗漏重要的医学信息并引入无根据的陈述。现有的错误检测方法产生启发式或未校准的分数,无法对遗漏错误进行正式控制,也无法以原则性的方式在安全性与临床医生审查负担之间进行权衡。我们引入了风险评估的保形评估(CARE),这是一种事后、模型无关的安全层,使用保形风险控制为任何LLM生成的摘要叠加校准的遗漏和幻觉标记,无需重新训练。CARE通过两个控制器提供有限样本、分布无关的保证:一个幻觉控制器,限制包含任何未标记幻觉句子的文档的概率;一个遗漏控制器,限制未提交审查的重要遗漏的期望比例。与幻觉检测不同,遗漏同时取决于源句子是否重要以及摘要是否覆盖该句子。我们表明,仅校准一个维度可能违反目标风险界限,而边际分解虽然有效但过于保守。通过在整个$(τ,γ)$阈值空间上进行联合校准,CARE在保持正式保证的同时,比替代的校准基线最多减少5倍的标记句子。在五个医学摘要任务中,CARE在100次校准/测试重划分中,以95%的置信度满足$α=0.15$的目标风险界限,每个领域仅使用约100个标记文档。在一项初步的临床医生研究(75份文档审查)中,校准标记平均将遗漏检测提高了28.6个百分点。这些结果表明,句子级别的安全保证对于LLM辅助的医学摘要是可行的,并为平衡残余风险和审查工作量提供了一种可调节的机制。

英文摘要

Large language models (LLMs) are increasingly used for medical summarization, but their outputs can omit medically important information and introduce unsupported claims. Existing error-detection methods produce heuristic or uncalibrated scores, providing no formal control over missed errors and no principled way to trade off safety against clinician review burden. We introduce Conformal Assessment for Risk Evaluation (CARE), a post-hoc, model-agnostic safety layer that uses conformal risk control to overlay calibrated omission and hallucination flags onto summaries from any LLM without retraining. CARE provides finite-sample, distribution-free guarantees through two controllers: a hallucination controller that bounds the probability of a document containing any unflagged hallucinated sentence, and an omission controller that bounds the expected fraction of important omissions not surfaced for review. Unlike hallucination detection, omissions depend jointly on whether a source sentence is important and whether it is covered by the summary. We show that calibrating only one dimension can violate the target risk bound, while marginal decompositions remain valid but overly conservative. By jointly calibrating over the full $(τ,γ)$ threshold space, CARE preserves formal guarantees while surfacing up to 5$\times$ fewer sentences than alternative calibrated baselines. Across five medical summarization tasks, CARE satisfies the target risk bound at $α= 0.15$ with 95% confidence across 100 calibration/test resplits, using only ~100 labeled documents per domain. In a preliminary clinician study (75 document reviews), calibrated flags improved omission detection by 28.6 percentage points on average. These results show that sentence-level safety guarantees are feasible for LLM-assisted medical summarization and offer a tunable mechanism for balancing residual risk and review effort.

2606.09068 2026-06-09 cs.CL 新提交

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating

由谄媚诱导的突现性失调可通过对齐门控逆转

Sicheng Wang, Xiangyang Zhu, Han Wang, Zongrui Wang, Yuan Tian, Kaiwei Zhang, Kaiyuan Ji, Qi Jia, Guangtao Zhai

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 发现谄媚微调(被动同意用户错误观点)可诱导广泛且严重的突现性失调,并提出对齐门控方法,通过插入可学习门控来识别并抑制不安全表示,从而高效逆转失调。

Comments Code is available at https://github.com/stay1to0/Sycophancy_Emergent_Misalignment_and_Gated_attention_FT

详情
AI中文摘要

先前研究表明,在狭窄领域对恶意或不正确输出进行微调会诱导广泛的失调和有害行为,这种现象称为突现性失调。然而,逆转此类失调的高效方法仍然有限。在这项工作中,我们做出两项贡献。首先,我们识别出谄媚微调,即训练模型被动同意用户的错误观点,是先前未被充分探索的突现性失调驱动因素,并证明它会诱导广泛且严重的失调行为。其次,我们提出对齐门控,一种高效逆转突现性失调的方法,该方法在微调期间向模型插入可学习和可控的门控。通过微调,这些门控学会识别负责不安全响应的内部表示。因此,放大或抑制这些表示会分别加剧或缓解突现性失调。我们进一步发现,对齐门控模块表现出强大的泛化能力:从狭窄领域微调获得的门控权重显著抑制了广泛领域的失调行为,同时保留了模型的通用能力。

英文摘要

Prior work has shown that fine-tuning large language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known as emergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identify sycophancy fine-tuning, i.e., training models to passively agree with users' incorrect opinions, as a previously underexplored driver of emergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we propose Alignment Gating, an efficient method for reversing emergent misalignment that inserts learnable and controllable gates into the model during fine-tuning. Through fine-tuning, these gates learn to identify the internal representations responsible for unsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find that alignment gating module exhibits strong generalization: gating weights obtained from narrow-domain fine-tuning substantially suppress broad-domain misaligned behavior while preserving the model's general capabilities.

2606.09114 2026-06-09 cs.CL 新提交

MAAM: Anchor-Preserving Compression and Contextual Calibration for Chinese Discriminatory Language Detection

MAAM:面向中文歧视性语言检测的锚点保留压缩与上下文校准

Yuxin Fu, Shijing Si

发表机构 * School of Economics and Finance, Shanghai International Studies University(上海外国语大学国际金融贸易学院)

AI总结 提出MAAM框架,通过保留歧视相关语义锚点并结合上下文先验校准,在轻量级模型上提升中文歧视性语言检测的准确性和校准性,同时构建首个中文LGBT歧视语料库ChLGBT。

详情
AI中文摘要

中文歧视性语言检测具有挑战性,因为有害意图往往是隐式的且依赖上下文。我们提出MAAM(近视-散光锚点机制),一种轻量级、模型无关的框架,受功能性视觉模糊启发:MAAM并非同等保留每个词元,而是保留歧视相关的语义锚点,并通过C-I-S上下文先验(上下文语气、群体身份和立场极性)对其进行校准。我们还引入了ChLGBT,据我们所知,这是首个专注于中文LGBT的歧视性语言数据集,包含8,120条人工标注样本和三个序数标签:显式偏见、隐式偏见和情感强度。在强编码器基线上,MAAM提升了所有三个预测维度,在准确率、F1、Brier分数和期望校准误差上均取得一致增益。与零样本和少样本提示协议下的前沿LLM基线相比,MAAM在保持竞争力的同时,提供了更强的紧凑性和稳定性。这些结果表明,可解释的锚点保留和上下文校准为中文歧视性语言评估提供了一种实用的替代方案,无需依赖更大规模的模型缩放。

英文摘要

Chinese discriminatory-language detection is challenging because harmful intent is often implicit and context-dependent. We propose MAAM (Myopia--Astigmatism Anchor Mechanism), a lightweight, model-agnostic framework inspired by functional visual blur: rather than preserving every token equally, MAAM retains discrimination-relevant semantic anchors and calibrates them with C--I--S contextual priors (Contextual Tone, Group Identity, and Stance Polarity). We also introduce ChLGBT, to our knowledge the first Chinese LGBT-focused discriminatory-language dataset, with 8,120 manually annotated samples and three ordinal labels: explicit bias, implicit bias, and emotional intensity. Across strong encoder baselines, MAAM improves all three prediction dimensions, with consistent gains in accuracy, F1, Brier score, and expected calibration error. Compared with frontier LLM baselines under zero-shot and few-shot prompting protocols, MAAM remains competitive while offering stronger compactness and stability. These results suggest that interpretable anchor preservation and contextual calibration provide a practical alternative to heavier model scaling for Chinese discriminatory-language assessment.

2606.09178 2026-06-09 cs.CL cs.AI 新提交

Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis

跨东亚和东南亚语境的文化适应红队测试:方法论与比较分析

Hyeji Choi, Yongtaek Lim, Minwoo Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校) Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 针对大语言模型的多语言安全评估,通过构建直接翻译与文化适应数据集,发现文化适应提示的攻击成功率平均提升9.3个百分点,直接翻译低估风险,且文化深度评分显著低于文化适应版本,表明适应文化语境对有效评估至关重要。

Comments Accepted to ICML 2026 Workshop on AIWILDS

详情
AI中文摘要

大语言模型的多语言安全评估主要依赖于将英文基准直接翻译成目标语言——这种方法转换了表面语言形式,但未能反映威胁场景、社会规范和法律法规中嵌入的文化语境。我们通过1:1种子匹配为四种语言——韩语、日语、泰语和高棉语——构建了配对的直接翻译和文化适应数据集,并比较了四个开源大语言模型的攻击成功率和文化真实感评分。文化适应提示在所有16种语言×模型组合中均产生正Delta-ASR(平均+9.3个百分点),且基于直接翻译的评估在48个类别×语言组合中有44个低估了风险。语言层面分析显示,威胁形式的分布在语言间具有异质性。文化真实感分析进一步表明,直接翻译的文化深度(C3)评分在所有四种语言中始终低于1.0(满分3.0,平均0.17),而文化适应评分高达2.51,表明直接翻译产生的输入与真实世界多文化环境中遇到的输入存在系统性差异。这些发现表明,将基准适应特定语言的文化语境——而非仅依赖语言翻译——对于有效的多语言大语言模型安全评估是必要的。

英文摘要

Multilingual safety evaluation of large language models (LLMs) has predominantly relied on direct translation (DT) of English benchmarks into target languages - an approach that converts surface-level linguistic form while failing to reflect the cultural context embedded in threat scenarios, social norms, and legal frameworks. We construct paired DT and culturally-adapted (CA) datasets via 1:1 seed matching for four languages - Korean (KO), Japanese (JA), Thai (TH), and Khmer (KM) - and compare Attack Success Rate (ASR) and Cultural Realism scores across four open-source LLM. CA prompts yield Delta-ASR > 0 across all 16 language x model combinations (mean +9.3 pp), and DT-based evaluation underestimates risk in 44 of 48 category x language combinations. Language-level analysis reveals that the distribution of threat forms is heterogeneous across languages. Cultural Realism analysis further shows that DT Cultural Depth (C3) scores remain consistently below 1.0 out of 3.0 across all four languages (mean 0.17), whereas CA scores reach up to 2.51, indicating that direct translation produces inputs systematically divergent from those encountered in real-world multicultural settings. These findings demonstrate that adapting benchmarks to language-specific cultural contexts - rather than relying on linguistic translation alone - is necessary for valid multilingual LLM safety evaluation.

2606.09590 2026-06-09 cs.CL cs.CR 新提交

Clinically Grounded Privacy Evaluation of Medical LMs

临床导向的医学语言模型隐私评估

Sasha Ronaghi, Sana Tonekaboni, Lena Stempfle, Vivian Utti, Jordan Li Cahoon, Nathaniel Hendrix, Ayin Vala, Marzyeh Ghassemi, Emily Alsentzer

发表机构 * Stanford University(斯坦福大学) Massachusetts Institute of Technology(麻省理工学院) American Board of Family Medicine(家庭医学认证委员会)

AI总结 提出临床导向框架,按对抗访问等级评估医学语言模型隐私泄露,发现常规元数据可导致高比率逐字记忆和敏感诊断恢复,但部分记忆源于模板化文档。

详情
AI中文摘要

医学语言模型(LMs)可以记忆和重现受保护的健康信息,但隐私评估通常关注训练文本的恢复,而非在现实威胁模型下的泄露。我们引入了一个临床导向的框架,沿着对抗访问的分级轴评估泄露,范围从公开可推断的人口统计信息到泄露的笔记片段。在每个层级,我们测量患者特定文本的逐字记忆和敏感诊断的语义泄露。将该框架应用于一个在378k临床笔记上预训练的LM,我们发现常规就诊元数据(即姓名、出生日期、提供者、诊所、就诊日期)在患者时间线上引发高比率的逐字记忆和敏感诊断恢复(流产AUROC 0.91,HIV 0.81)。同时,精确匹配记忆可能夸大泄露:36%的记忆令牌反映了模板化文档。我们的工作强调了在纵向临床数据上训练的风险,为医学LM的上下文隐私评估提供了一个实用框架。

英文摘要

Medical language models (LMs) can memorize and reproduce protected health information, but privacy evaluations often focus on recovery of training text rather than disclosure under realistic threat models. We introduce a clinically grounded framework that evaluates leakage along a graded axis of adversarial access, ranging from publicly inferable demographics to leaked note fragments. At each tier, we measure verbatim memorization of patient-specific text and semantic leakage of sensitive diagnoses. Applying the framework to an LM pretrained on 378k clinical notes, we find that routine encounter metadata (i.e. name, date of birth, provider, practice, visit date) elicits high rates of verbatim memorization across a patient's timeline and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). At the same time, exact-match memorization can overstate disclosure: 36% of memorized tokens reflect templated documentation. Our work highlights the risks of training on longitudinal clinical data, providing a practical framework for contextual privacy evaluation of medical LMs.

2606.09697 2026-06-09 cs.CL 新提交

PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models

PsychoSafe:在大语言模型中引发基于心理学的拒绝

Gianluca Barmina, Federico Torrielli, Sven Harms, Jacob Nielsen, Felix Mächtle, Stine Lyngsø Beltoft, Peter Schneider-Kamp, Thomas Eisenbarth, Lukas Galke Poech, Anne Lauscher

发表机构 * University of Southern Denmark(南丹麦大学) University of Turin(都灵大学) University of Hamburg(汉堡大学) University of Lübeck(吕贝克大学)

AI总结 提出PsychoSafe框架,将LLM的拒绝行为重构为基于证据干预策略的结构化支持性沟通,通过构建5个心理风险领域的8019个提示-响应对,对Qwen 3.5 27B进行提示和参数高效微调,在拒绝质量上比通用基线提升28.1%,同时保持非拒绝任务性能。

详情
AI中文摘要

大型语言模型(LLM)经常面临应被拒绝的请求,这造成了帮助性与伤害预防之间的权衡。然而,拒绝本身可能是有帮助的。在涉及危机、胁迫或意图升级的高风险交互中,生硬的不服从可能防止直接伤害,但仍未能支持请求背后的人的需求。我们提出了PsychoSafe,一个基于心理学的拒绝框架,将拒绝重构为基于证据干预策略的结构化支持性沟通。为了开发PsychoSafe,我们构建了一个包含8019个提示-响应对的语料库,涵盖五个心理上显著的风险领域,并对Qwen 3.5 27B应用提示和参数高效微调。在一个包含500个提示的平衡验证集上,通过LLM评判器评估并经人工评分验证,PsychoSafe提示在拒绝质量上比通用基线提高了28.1%,在外部资源转介(+46.8%)和心理基础(+34.8%)方面尤为突出,同时保持了非拒绝任务的下游性能。微调实现了近乎完美的拒绝和资源转介率,但降低了响应相关性。在SORRY-Bench和XSTest上的额外评估显示,域内鲁棒性强但域外泛化有限,这表明未来的工作应多样化微调数据,以帮助模型有选择地而非机械地应用干预措施。

英文摘要

Large language models (LLMs) routinely face requests that should be refused, creating a trade-off between helpfulness and harm prevention. However, refusals themselves can be helpful. In high-risk interactions involving crisis, coercion, or escalating intent, blunt non-compliance may prevent direct harm while still failing to support the needs of the person behind the request. We present PsychoSafe, a psychologically-informed refusal framework that reframes refusal as structured supportive communication grounded in evidence-based intervention strategies. To develop PsychoSafe, we construct a corpus of 8019 prompt-response pairs spanning five psychologically salient risk domains and apply prompting and parameter-efficient fine-tuning to Qwen 3.5 27B. On a balanced validation set of 500 prompts, evaluated with an LLM judge and validated through human ratings, PsychoSafe prompting improves overall refusal quality by 28.1% over a generic baseline, with particularly strong gains in external resource referral (+46.8%) and psychological grounding (+34.8%), while preserving downstream performance on non-refusal tasks. Fine-tuning achieves near-perfect refusal and resource-referral rates but reduces response relevance. Additional evaluations on SORRY-Bench and XSTest show strong in-domain robustness but limited out-of-domain generalization, suggesting that future work should diversify fine-tuning data to help models apply interventions selectively rather than schematically.

2606.09701 2026-06-09 cs.CL cs.AI cs.LG 新提交

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

学习攻击与防御:通过GRPO对语言模型进行自适应红队测试

Blake Bullwinkel, Eugenia Kim, Amanda Minnich, Mark Russinovich

发表机构 * Microsoft AI Red Team(微软AI红队) Microsoft Azure(微软Azure)

AI总结 提出AdvGRPO框架,通过密集多通道奖励和分离优势归一化实现GRPO在攻击者-防御者联合优化中的稳定训练,产生高效可迁移攻击,防御者优于基线。

详情
AI中文摘要

AI红队测试必须不断适应不断演变的攻击者和防御者。强化学习为发现新型攻击提供了一种有前景的方法,而协同训练方法可以同时产生更鲁棒的防御者。最近的工作通过应用PPO和DPO证明了攻击者-防御者协同训练的有效性,但报告称GRPO在此设置中不稳定。我们引入了AdvGRPO,一种协同训练框架,通过使用密集多通道奖励和分离优势归一化,使GRPO能够用于攻击者-防御者联合优化。训练过程通过一个课程从单轮攻击发展到闭环多轮攻击,然后启动协同训练,其中攻击者和防御者模型交替更新。我们表明,我们的方法可以产生高度有效且可迁移的攻击,并且协同训练的防御者在安全基准测试中优于基线。

英文摘要

AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works have demonstrated the efficacy of attacker-defender co-training by applying PPO and DPO, but report that GRPO is unstable in this setting. We introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. We show that our method can produce highly effective and transferable attacks and that co-trained defenders outperform baselines on safety benchmarks.

2606.09735 2026-06-09 cs.CL 新提交

The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

中性面具:RLHF如何提供浅层对齐而保留大语言模型中的党派结构

Wendy K. Tam

发表机构 * Vanderbilt University(范德堡大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) National Center for Supercomputing Applications(国家超级计算应用中心)

AI总结 研究RLHF对Llama 3.1 8B党派倾向的影响,发现RLHF仅压缩党派信号方差以实现中性输出,而非移除党派结构,且特征级操控可绕过对齐。

详情
AI中文摘要

对齐训练的目标是使大语言模型安全且有用。主要机制——基于人类反馈的强化学习(RLHF)——通过使模型与“人类价值观”对齐来塑造部署语言模型的行为。然而,这一过程并不透明:编码了哪些价值观?这些价值观是谁的?RLHF如何编码它们?越来越多的证据表明,RLHF仅产生功能性遵从而非深度对齐。我们以党派政治取向为例,对Llama 3.1 8B在RLHF前后的内部表征进行比较,进行了机制性案例研究。我们表明,RLHF并未移除基础模型中的结构化党派方向。相反,它压缩了党派信号的方差,以生成一致平衡且无党派的输出。稀疏自编码器分解揭示,在基础模型中零星激活的策略编码特征在Instruct模型中完全失活。特征级操控实验证实了因果断开。因此,RLHF编码了政治中立的规范,不是通过擦除模型对党派性的知识,而是通过切断从党派几何到输出生成的因果路径。重要的是,这种中立性是功能性的而非结构性的,因此支持党派操控的底层几何结构保持完整。绕过RLHF护栏的机制(例如推断并放大用户的党派身份)会重新激活党派生成。如果RLHF通过断开而非移除价值负载结构来运作,那么同样的模式可能适用于其他价值领域,并且对齐模型的行为可能比其输出所暗示的更脆弱。

英文摘要

The ambition behind alignment training is to make large language models safe and useful. The primary mechanism, reinforcement learning from human feedback (RLHF), shapes the behavior of deployed language models by aligning them with ``human values.'' Yet the process is opaque. What values are being encoded; whose values are they; and how does RLHF encode them? A growing body of evidence suggests that RLHF produces only functional compliance rather than deep alignment. We offer a mechanistic case study of this phenomenon for partisan political orientation with a comparison of the internal representations of Llama 3.1 8B before and after RLHF. We show that RLHF does not remove the structured partisan direction in the base model. Instead, it compresses the variance of the partisan signal to generate consistently balanced and non-partisan output. Sparse autoencoder decomposition reveals that policy-encoding features, which activate sporadically in the base model, are completely inactive in the Instruct model. Feature-level steering experiments confirm the causal disconnect. RLHF thus encodes a norm of political neutrality, not by erasing the model's knowledge of partisanship, but by severing the causal pathway from partisan geometry to output generation. Importantly, this neutrality is functional, not structural so that the underlying geometry that enables partisan steering remains intact. The mechanisms that bypass RLHF's guardrails, such as inferring and amplifying a user's partisan identity, reactivate partisan generation. If RLHF operates by disconnecting rather than removing value-laden structure, then the same pattern may hold for other value domains, and the aligned model's behavior may be more fragile than its outputs suggest.

2606.07629 2026-06-09 cs.LG cs.AI cs.CL cs.CY cs.HC 交叉投稿

Large Language Models Should Learn Personalized Rather Than Aggregated Human Preferences

大型语言模型应学习个性化而非聚合的人类偏好

Cristina Garbacea

AI总结 本文主张大型语言模型应学习个性化偏好而非聚合偏好,分析聚合偏好的理论局限与实证问题,提出通过有界个性化框架兼顾个体自主与集体安全。

Comments Accepted to ICML 2026

详情
AI中文摘要

当前对齐大型语言模型(LLM)的方法将多样化的人类偏好聚合为单一奖励信号,实际上优化了一个不代表任何真实个体的假设性“平均用户”。本文立场论文认为,LLM应学习个性化、个体化的偏好而非聚合偏好。我们表明,聚合掩盖了关于偏好多样性、个体价值观和上下文依赖的关键信息,这在理论上基于社会选择理论,并在经验上跨人口群体明显。我们分析了人类偏好编码的丰富结构,调查了个性化的技术方法,并系统地回应了关于可扩展性、共享标准和操纵风险的反驳。虽然个性化引入了真正的安全挑战,包括过滤气泡、价值锁定和心理操纵,但我们认为这些挑战可以通过有界个性化框架来管理,该框架在容纳合法个体差异的同时保留通用安全约束。最后,我们提出了一个具体的研究和政策议程,以开发尊重个体自主和集体安全的偏好感知模型。

英文摘要

Current approaches to aligning large language models (LLMs) aggregate diverse human preferences into a single reward signal, effectively optimizing for a hypothetical ``average user'' who represents no real person particularly well. This position paper argues that LLMs should learn personalized, individual preferences rather than aggregated ones. We show that aggregation masks critical information about preference diversity, individual values, and contextual dependencies, which is a limitation both theoretically grounded in social choice theory and empirically evident across demographic groups. We analyze the rich structure that human preferences encode, survey technical approaches to personalization, and systematically address counterarguments on scalability, shared standards, and manipulation risk. While personalization introduces genuine safety challenges including filter bubbles, value lock-in, and psychological manipulation, we argue these are manageable through bounded personalization frameworks that preserve universal safety constraints while accommodating legitimate individual variation. We conclude with a concrete research and policy agenda for developing preference-aware models that respect both individual autonomy and collective safety.

2606.07834 2026-06-09 cs.SE cs.AI cs.CL cs.MA 交叉投稿

Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence

Cherry-pick Override:混合证据下LLM法官的不安全方向性承诺

Haoran Xu

AI总结 针对混合证据场景,发现LLM法官会错误地返回方向性裁决(SUPPORTS/REFUTES)而非授权非方向性裁决(CONFLICTING),定义为Cherry-pick Override(CCO);通过诊断协议和干预实验,提出外部承诺控制层分离裁决生成与授权。

Comments 12 pages, 1 figure

详情
AI中文摘要

LLM法官越来越多地将裁决转化为系统承诺。在混合证据(同时包含支持和反驳来源的声明)下,这是不安全的:当模式将CONFLICTING作为授权的非方向性裁决暴露时,返回SUPPORTS/REFUTES是一种未经授权的方向性承诺,我们将这种失败命名为Cherry-pick Override(CCO)。我们在明确的任务契约下定义CCO,并使用同分母诊断协议、匹配覆盖率的bootstrap以及苹果对苹果的随机否决零假设进行报告。在AVeriTeC的Conflicting子集(N_C = 150)上,三选项法官对超过84%的混合证据声明返回方向性裁决;在类型化模式下,三法官多数投票在AVeriTeC上放大了冲突上的方向性(0.887 vs. 0.840;95% CI [+0.013, +0.080]),但在VitaminC-Mixed上未复制。通过常见的单通道修复(类型化词汇、面板聚合、置信度阈值、仅验证器过滤)的干预阶梯,每个都留下了不同的残余失败:面板聚合在48%的CCO案例中抑制了单个法官的CONFLICTING异议;面板对方向校准良好(纯S/R上的ECE = 0.07),因此置信度无法在操作上区分CCO与正确的方向性承诺;验证器作为分类器几乎将纯证据准确率减半。一个最小双通道参考探针达到了任一单通道无法达到的操作点;在随机否决零假设下,其对CONFLICTING的提升在AVeriTeC上具有结构性针对性(经验p < 1/2001),在VitaminC-Mixed上方向相同但较弱,这是一个选择性结果而非幅度结果。我们主张一个外部承诺控制层,将裁决生成与承诺授权分离,使用结构证据和置信度作为正交通道,并将NO-COMMIT作为路由控制器状态。

英文摘要

LLM judges increasingly turn verdicts into system commitments. Under mixed evidence (claims with both supporting and refuting sources) this is unsafe: when the schema exposes CONFLICTING as the authorized non-directional verdict, returning SUPPORTS/REFUTES is an unauthorized directional commitment, a failure we name Cherry-pick Override (CCO). We define CCO under an explicit task contract and report it with a same-denominator diagnostic protocol paired with matched-coverage bootstrap and an apples-to-apples random-veto null. On AVeriTeC's Conflicting subset (N_C = 150), three-option judges return a directional verdict on more than 84% of mixed-evidence claims; under the typed schema, three-judge majority voting amplifies direction-on-conflict on AVeriTeC (0.887 vs. 0.840; 95% CI [+0.013, +0.080]) but does not replicate on VitaminC-Mixed. Walking an intervention ladder of common single-channel fixes (typed vocabulary, panel aggregation, confidence thresholding, validator-only filtering), each leaves a distinct residual failure: panel aggregation suppresses single-judge CONFLICTING dissent in 48% of CCO cases; the panel is well-calibrated for direction (ECE = 0.07 on pure-S/R) so confidence cannot operationally separate CCO from correct directional commits; validator-as-classifier nearly halves pure-evidence accuracy. A minimal two-channel reference probe reaches operating points neither single channel reaches; under the random-veto null its promotion to CONFLICTING is structurally targeted on AVeriTeC (empirical p < 1/2001) and weaker but in the same direction on VitaminC-Mixed, a selectivity result rather than a magnitude one. We argue for an external commitment-control layer that separates verdict generation from commitment authorization, using structural evidence and confidence as orthogonal channels and NO-COMMIT as a routed controller state.

2606.07889 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories

应变连贯性:编码代理执行轨迹中的故障前信号

Marut Pandya, Kasey Zhang, Baiqing Lyu

发表机构 * GitHub

AI总结 提出“应变连贯性”模式,即编码代理识别到问题但仍按原计划行动,通过构建Claude Sonnet 4.6检测器在44条轨迹上实现94%故障预测精度,优于基线方法。

详情
AI中文摘要

基于LLM的编码代理有时会承认自身推理中的问题,但仍继续执行。我们将这种模式称为应变连贯性:一种与安全相关的故障模式,其中代理拥有应改变其行为的信息,陈述了该信息,却仍违背它行动。该模式与口头奖励黑客行为重叠,即代理指出任务代理与底层目标之间的冲突,却仍优化代理。我们给出操作性定义,构建一个Claude Sonnet 4.6评判器,读取完整轨迹并标记该模式出现的片段,并使用Qwen3.5-35B-A3B骨干在44条Terminal-bench-2轨迹上评估。标记轨迹的失败率为94%,而未标记轨迹为46%(47个百分点的差距,Fisher精确检验p=0.003;排除三个提示嵌入示例后为46个百分点,p=0.006)。在匹配选择性下,检测器达到94%的精确度,而词汇话语标记基线为88%;两种方法的10条轨迹交集具有100%的失败率(Clopper-Pearson 95%置信区间[69%, 100%])。我们在Gemma4-31B上使用43条轨迹进行复制:整体信号方向一致但不显著(20个百分点差距,p=0.31),衰减主要由13条零思考内容的轨迹驱动,其中检测器没有可分析的基础。在Gemma的高冗长度三分位中,差距为+30个百分点;在Qwen的中等和高冗长度三分位中,差距各为+40个百分点。两个模型的首次标记出现在轨迹经过时间的中位数83-84%处,且二元标记在软化显式冲突标记的释义中保持不变(8/8条轨迹)。与单变量预测器不同,检测器输出可解释的跨度级输出——引用的承认、引用的行动和类型化的冲突——显示代理看到并忽略了什么。

英文摘要

LLM-based coding agents sometimes acknowledge a problem in their own reasoning and then proceed anyway. We call this pattern strained coherence: a safety-relevant failure mode in which an agent has information that should change its behavior, states that information, and still acts against it. The pattern overlaps with verbalized reward hacking, where an agent names a tension between a task proxy and the underlying goal yet optimizes the proxy anyway. We give an operational definition, build a Claude Sonnet 4.6 judge that reads full trajectories and flags spans where the pattern occurs, and evaluate it on 44 Terminal-bench-2 trajectories using a Qwen3.5-35B-A3B backbone. Flagged trajectories fail 94% of the time versus 46% for unflagged trajectories (47-point gap, Fisher's exact p = 0.003; 46 points after excluding three prompt-embedded examples, p = 0.006). At matched selectivity, the detector reaches 94% precision versus 88% for a lexical discourse-marker baseline; the 10-trajectory intersection of the two methods has a 100% failure rate (Clopper-Pearson 95% CI [69%, 100%]). We replicate on Gemma4-31B with 43 trajectories: the overall signal is directionally consistent but not significant (20-point gap, p = 0.31), with attenuation driven largely by 13 trajectories with zero think content, where the detector has no substrate to analyze. In the high-verbosity Gemma tertile, the gap is +30 points; in the mid- and high-verbosity Qwen tertiles, it is +40 points each. The first flag appears at a median of 83-84% of elapsed trajectory time across both models, and the binary flag survives paraphrases that soften explicit conflict markers (8/8 trajectories). Unlike univariate predictors, the detector emits interpretable span-level output -- quoted acknowledgment, quoted action, and typed conflict -- showing what the agent saw and ignored.

2606.07943 2026-06-09 cs.CR cs.AI cs.CL 交叉投稿

POISE: Position-Aware Undetectable Skill Injection on LLM Agents

POISE:面向LLM智能体的位置感知不可检测技能注入攻击

Haochang Hao, Dehai Min, Zhifang Zhang, Yunbei Zhang, Miao Xu, Yingqiang Ge, Lu Cheng

发表机构 * University of Illinois at Chicago(伊利诺伊大学香槟分校) University of Queensland(昆士兰大学) Tulane University(路易斯安那州立大学) Rutgers University(罗格斯大学)

AI总结 提出POISE攻击方法,通过位置感知将恶意指令压缩为单一良性指令嵌入技能正文,在保持隐蔽性的同时实现89.3%的攻击成功率,比随机位置基线高28.0个百分点。

Comments 20 pages, 2 figures, 5 tables

详情
AI中文摘要

智能体技能为扩展通用智能体提供了一种轻量级机制,但其开放格式使其容易受到技能投毒攻击。实际危险的注入必须保持不可见:如果执行有效载荷破坏了用户的合法任务,由此产生的失败信号会引发对技能的检查。因此,我们通过攻击成功率(ASR)来评估攻击,这要求注入的有效载荷得以执行,并且用户的任务在同一试验中仍能通过验证器。先前的技能投毒攻击在此视角下面临可靠性-隐蔽性权衡:YAML头部注入可靠加载但易被检查,而将显式恶意命令置于技能正文中的更隐蔽的注入方式则可靠性较低,因为脱离上下文的命令会引发智能体自身的怀疑。我们提出POISE,一种位置感知攻击,将触发器压缩为单个看似良性的正文指令,将其放置在可行位置,并使用上下文感知生成器使其与附近的设置或前提步骤融合。在Skill-Inject(使用codex+gpt-5.2)上,POISE实现了89.3%的ASR,比随机位置正文基线高28.0个百分点,比仅YAML基线高2.6个百分点,同时保留了正文放置的隐蔽性优势。这种隐蔽性是决定性的优势:由于合法的技能正文自然需要特权工具操作,LLM扫描器高度敏感,在四个评判者和两个基准测试中平均将74.6%的干净技能误报为高风险。融入这些误报中,POISE仅导致5.6%的投毒变体相比其干净基线获得新的高风险警报,使得当前的静态防御无效。

英文摘要

Agent skills provide a lightweight mechanism for extending general-purpose agents, but their open format exposes them to skill-poisoning attacks. A practically dangerous injection must stay invisible: if executing the payload derails the user's legitimate task, the resulting failure signal invites inspection of the skill. We therefore evaluate attacks by Attack Success Rate, which requires the injected payload to execute and the user's task to still pass its verifier in the same trial. Prior skill-poisoning attacks face a reliability-stealth trade-off under this lens: YAML-header injections are reliably loaded but easily inspected, whereas stealthier body injections that place explicit malicious commands in the skill prose are less reliable because out-of-context commands invite the agent's own suspicion. We introduce POISE, a position-aware attack that compresses the trigger into a single, benign-looking body instruction, placing it at a feasible position and using a context-aware generator to blend it with nearby setup or prerequisite steps. On Skill-Inject with codex+gpt-5.2, POISE achieves an 89.3% ASR, 28.0 points above a random-placement body baseline and 2.6 points above a YAML-only baseline, while retaining the stealth advantage of body placement. That stealth is the decisive margin: because legitimate skill bodies naturally require privileged tool operations, LLM scanners are hyper-sensitive, falsely flagging 74.6% of clean skills on average across four judges and both benchmarks. Blending into these false alarms, POISE causes only 5.6% of poisoned variants to gain a new high-risk alert over their clean baselines, rendering current static defenses ineffective.

2606.07963 2026-06-09 cs.AI cs.CL 交叉投稿

Shared Latent Structures Enable Unified Backdoor Detection and Mitigation in LLMs

共享潜在结构实现大语言模型中的统一后门检测与缓解

Omar Mahmoud, Aly M. Kassem, Thommen George Karimpanal, Buddhika Laknath Semage, Negar Rostamzadeh, Golnoosh Farnadi, Santu Rana

发表机构 * Deakin University(迪肯大学) Mila, Quebec AI Institute(魁北克人工智能研究所Mila)

AI总结 发现大语言模型中多种后门攻击共享潜在机制,通过稀疏自编码器检测因果特征,并提出双向激活操控和概念消融微调实现统一检测与缓解。

详情
AI中文摘要

大语言模型中的后门攻击通常被视为孤立的触发-响应失败,促使防御针对特定触发或行为。我们证明这种观点是不完整的。在多样化的后门行为中,我们识别出一个共享的潜在机制,可以被检测、因果控制和抑制。通过在残差流激活上使用稀疏自编码器,我们发现一小部分潜在特征在越狱、拒绝操控、密码锁定、偏见诱导、情感误分类和基于国家的有害建议中一致激活。这些特征在Qwen3、Gemma~3和Llama~3.1模型(参数从4B到32B)以及微调和权重编辑攻击中泛化。通过双向激活操控,我们证明这些特征是因果性的:抑制它们降低攻击成功率,而放大它们在干净提示上诱导目标行为。我们进一步训练轻量级SAE特征分类器,这些分类器零样本泛化到未见后门,并优于残差流和权重差异基线。最后,我们引入概念消融微调,通过在训练期间消融共享潜在子空间来抑制后门形成。总之,我们的结果表明许多后门依赖于可转移的潜在机制,从而实现统一的检测和缓解。

英文摘要

Backdoor attacks in large language models (LLMs) are often treated as isolated trigger-response failures, motivating defenses tailored to specific triggers or behaviors. We show this view is incomplete. Across diverse backdoor behaviors, we identify a shared latent mechanism that can be detected, causally controlled, and suppressed. Using sparse autoencoders (SAEs) on residual-stream activations, we find a small set of latent features consistently activated across jailbreaking, refusal manipulation, password-locking, bias induction, sentiment misclassification, and country-conditioned harmful advice. These features generalize across Qwen3, Gemma~3, and Llama~3.1 models from 4B to 32B parameters, and across both fine-tuning and weight-editing attacks. Through bidirectional activation steering, we show these features are causal: suppressing them reduces attack success, while amplifying them induces target behaviors on clean prompts. We further train lightweight SAE-feature classifiers that generalize zero-shot to unseen backdoors and outperform residual-stream and weight-diffing baselines. Finally, we introduce Concept Ablation Fine-Tuning (CAFT), which suppresses backdoor formation by ablating the shared latent subspace during training. Together, our results suggest that many backdoors rely on a transferable latent mechanism, enabling unified detection and mitigation.

2606.08044 2026-06-09 cs.LG cs.AI cs.CL 交叉投稿

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

当行为安全评估失败时:表征层面的视角

Enyi Jiang, Anders Gjølbye, Yibo Jacky Zhang, Sanmi Koyejo

发表机构 * Stanford University(斯坦福大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Technical University of Denmark(丹麦技术大学)

AI总结 本文提出行为安全与干预鲁棒性之间的“审计差距”,通过构建解离模型和引入潜在脆弱性评分(LVS),证明行为安全指标不足以衡量表征层面的鲁棒性。

Comments Preprint

详情
AI中文摘要

大型语言模型(LLM)的安全性通常从行为层面进行评估,这提供了有限的内部鲁棒性证据,因为这些评估针对的是输出,而非干预下的表征层面脆弱性。我们将这种差异形式化为审计差距:行为安全与干预下鲁棒性之间的差异。为了研究这一差距,我们构建了解离模型,这些模型在保持安全的外在行为的同时,在潜在空间中仍然脆弱。我们引入了一个基于干预的评估框架,通过在参数和潜在空间中进行软干预(包括有害微调和逐层潜在扰动)来测试模型鲁棒性。为了形式化评估,我们提出了潜在脆弱性评分(LVS),用于衡量通过有界潜在扰动引发有害行为的难易程度。使用该评估框架,我们表明行为安全指标不足以衡量多个安全和对齐及未对齐的最先进模型的表征层面鲁棒性。值得注意的是,解离模型在有害干预下尽管表现出相当的拒绝行为,但LVS显著升高,其中中间表征对干预最为敏感。我们的结果表明,仅凭行为安全评估无法全面反映模型鲁棒性,这促使我们需要进行表征感知的审计,以评估潜在脆弱性和可观察行为。

英文摘要

Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.

2606.08497 2026-06-09 cs.AI cs.CL 交叉投稿

Explaining Black-Box Language Models: Learning to Optimize Linguistically-Structured Word Subsets

解释黑盒语言模型:学习优化语言结构化的单词子集

Minyoung Hwang, Seokhyun Lee, Changhee Lee

发表机构 * Korea University(高丽大学)

AI总结 针对黑盒语言模型解释的三个关键需求(推理效率、黑盒兼容性、语言结构可解释性),提出一种通过强化学习选择信息性单词子集的方法,实现高效、无梯度且语言连贯的解释。

Comments KDD 2026 Research Track

详情
AI中文摘要

随着深度语言模型(DLMs)在医疗保健等高风险领域中的部署日益增多,理解其决策依据对于确保信任、安全和问责变得至关重要。然而,当这些DLMs作为黑盒系统(例如通过API)运行时,访问内部模型状态(如参数、梯度)受到限制,实现这一关键的可解释性水平尤其具有挑战性。尽管付出了诸多努力,现有的解释方法往往无法同时满足三个关键需求:(i)推理时效率,(ii)黑盒兼容性且不引发分布外行为,以及(iii)基于输入语言结构的可理解解释。为了解决这些挑战,我们提出了一种方法,通过选择一小部分信息丰富的输入单词来解释DLM的预测。我们将其表述为一个摊销优化问题,从而无需针对特定输入进行搜索即可实现高效的一次性推理。我们的选择策略通过REINFORCE风格策略梯度进行训练,允许在完全无梯度的设置中进行离散单词选择。为了增强可解释性并与人类语言直觉对齐,我们将图结构知识整合到这一选择过程中,促进语言连贯的子集,从而产生对最终用户既高度信息丰富又具有认知意义的解释。我们在多种DLM架构和多个真实世界数据集上评估了我们的方法。它一致地识别出具有增强判别能力和与语言显著线索更强对齐的单词子集,优于传统的黑盒兼容方法和基于梯度的方法(后者被赋予黑盒模型梯度的oracle访问权限,以构成更具挑战性的基准)。我们的代码可在以下地址获取:here。

英文摘要

As deep language models (DLMs) are increasingly deployed in high-stakes domains such as healthcare, understanding their decision rationale becomes paramount for ensuring trust, safety, and accountability. However, achieving this vital level of interpretability is particularly challenging when these DLMs operate as black-box systems (e.g., via APIs), where access to internal model states (e.g., parameters, gradients) is restricted. Despite numerous efforts, existing explanation methods often fail to concurrently satisfy three key desiderata: (i) inference-time efficiency, (ii) black-box compatibility without inducing out-of-distribution behavior, and (iii) comprehensible explanations grounded in the input's linguistic structure. To address these challenges, we propose a method that explains predictions of DLMs by selecting a small, informative subset of input words. We formulate this as an amortized optimization problem, enabling efficient one-shot inference without the need for input-specific search. Our selection policy is trained via REINFORCE-style policy gradients, allowing discrete word selection in a fully gradient-free setting. To enhance interpretability and align with human linguistic intuition, we integrate graph-structured knowledge into this selection process, fostering linguistically coherent subsets that result in explanations both highly informative and cognitively meaningful to end-users. We evaluated our method on diverse DLM architectures and multiple real-world datasets. It consistently identifies word subsets with enhanced discriminative power and stronger alignment with linguistically salient cues, outperforming both conventional black-box compatible methods and gradient-based approaches that are given oracle access to the black-box model's gradients for a more challenging benchmark. Our code is available at here.

2606.08512 2026-06-09 cs.CY cs.CL 交叉投稿

Friend or Foe? Language as an ideological switch in open-weight LLMs under Russian disinformation stress

朋友还是敌人?俄罗斯虚假信息压力下开放权重大语言模型中的语言意识形态开关

Anna Małgorzata Kamińska, Tetiana Klynina

发表机构 * Institute of Culture Studies, University of Silesia in Katowice(文化研究学院,卡托维察大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) National Aviation University(国家航空大学)

AI总结 本文通过控制实验发现,针对不同语言社区微调的大语言模型在俄罗斯虚假信息压力下,其抵抗能力与预期文化对齐方向相反,揭示了微调悖论。

详情
AI中文摘要

随着俄罗斯对乌克兰的战争扩展到生成式人工智能,针对当地后苏联语言进行调整的大语言模型被部署在有争议的信息环境中。政策和行业话语假设,文化对齐的调整编码了目标社区的政治取向:乌克兰导向的模型将抵制俄罗斯叙事,俄罗斯导向的模型将强化它们。果真如此吗?本文系统地否定了这一假设。我们对四个共享相同基础模型但针对不同语言社区微调的公开可用大语言模型进行了受控审计,用乌克兰语、俄语和英语查询它们关于十个有争议的战争叙事:克里米亚、“去纳粹化”、“一个民族”论以及布查和马里乌波尔的暴行否认。结果是一个微调悖论:乌克兰导向的模型在俄语中对俄罗斯虚假信息的抵抗最弱,而俄罗斯导向的模型表现出最强的拒绝。语料库构成、语言覆盖范围和提示格式被证明比名义上的文化来源更具决定性。我们将这些发现置于混合战争、数字主权和后帝国信息秩序的辩论中,认为对区域信息主权的主要威胁不是对抗性微调,而是未经检验的假设,即文化对齐能保证韧性。

英文摘要

As Russia's war against Ukraine extends into generative AI, large language models (LLMs) adapted for local post-Soviet languages are deployed in contested information environments. Policy and industry discourse assumes that culturally aligned adaptation encodes the political orientation of the target community: a Ukrainian-oriented model will resist Russian narratives, a Russian-oriented one will reinforce them. Does it? This article systematically disconfirms that assumption. We run a controlled audit of four openly available LLMs sharing a common base model but fine-tuned for different linguistic communities, querying them in Ukrainian, Russian and English across ten contested wartime narratives: Crimea, "denazification", the "one people" thesis, and atrocity denial at Bucha and Mariupol. The result is a Fine-Tuning Paradox: the Ukrainian-oriented model shows the weakest resistance to Russian disinformation in Russian, while the Russian-oriented one exhibits the strongest rejection. Corpus composition, language coverage and prompt format prove more decisive than nominal cultural provenance. We situate these findings within debates on hybrid warfare, digital sovereignty and post-imperial information orders, arguing that the principal threat to regional information sovereignty is not adversarial fine-tuning but the untested assumption that cultural alignment guarantees resilience.

2606.09005 2026-06-09 cs.CR cs.CL 交叉投稿

Document-Authored Control-Signal Impersonation: A Low-Cost Indirect Prompt Attack on RAG Safety Boundaries

文档作者控制信号冒充:对RAG安全边界的低成本间接提示攻击

Jianguo Zhu

发表机构 * Chengdu University of Information Technology(成都信息工程大学)

AI总结 研究检索增强生成系统中文档文本冒充控制信号的安全漏洞,提出非命令式间接注入攻击方法DACSI,并在多个模型上验证其有效性。

Comments Preprint. Independent-author version

详情
AI中文摘要

检索增强生成(RAG)系统通常将用户查询、检索文档、元数据、系统标签和任务指令序列化为一个自然语言提示。我们研究了这种设计中的源权威边界失效:攻击者撰写的检索文本可以冒充元数据、来源、权威或披露策略信号,这些信号对模型而言似乎是控制相关的。我们将这种模式称为文档作者控制信号冒充(DACSI)。DACSI是间接提示注入中一种非命令式、类似元数据的载荷子类。其核心教训很简单:文档作者标签是数据,而非策略。命令式注入要求模型忽略、覆盖或违反策略;而DACSI则询问当RAG提示渲染将可信和不可信文本合并到同一自然语言通道时,不可信的文档文本是否可能被错误归因于授权控制信号。我们在六种模型设置、提示压力水平、注入基线、信号分类、RAG中介管道、系统控制探针、源权威归因探针和合成金丝雀格式上评估了DACSI。我们按模型机制解释证据,而非将其视为六次同等重复:DeepSeek V4 Pro和Qwen3.5-397B提供了最清晰的正向提升,DeepSeek V4 Flash是高易感性设置,GPT-5.5和Gemini 3.1 Pro Low是具有选择性残留风险的强边界探针,而GLM-4.7是饱和泄漏边界案例。在这些机制中,DACSI值得单独评估,因为它使用无命令的元数据/来源/策略表面,遵循RAG特定的源权威路径,并对源/通道分离做出响应。源权威探针是行为归因证据,而非内部机制的证明。

英文摘要

Retrieval-augmented generation (RAG) systems often serialize user queries, retrieved documents, metadata, system labels, and task instructions into one natural-language prompt. We study a source-authority boundary failure in this design: attacker-authored retrieved text can impersonate metadata, provenance, authority, or disclosure-policy signals that appear control-relevant to the model. We call this pattern Document-Authored Control-Signal Impersonation (DACSI). DACSI is a non-imperative, metadata-like payload subclass within indirect prompt injection. Its central lesson is simple: document-authored labels are data, not policy. Command-style injection asks the model to ignore, override, or violate policy; DACSI asks whether untrusted document text can be misattributed as an authorized control signal when RAG prompt rendering collapses trusted and untrusted text into the same natural-language channel. We evaluate DACSI across six model settings, prompt-pressure levels, injection baselines, signal taxonomies, RAG-mediated pipelines, system-control probes, a source-authority attribution probe, and synthetic canary formats. We interpret the evidence by model regime rather than as six equal replications: DeepSeek V4 Pro and Qwen3.5-397B provide the cleanest positive lift, DeepSeek V4 Flash is a high-susceptibility setting, GPT-5.5 and Gemini 3.1 Pro Low are strong-boundary probes with selected residual risks, and GLM-4.7 is a saturated leakage boundary case. Across these regimes, DACSI warrants separate evaluation because it uses a command-free metadata/provenance/policy surface, follows a RAG-specific source-authority path, and responds to source/channel separation. The source-authority probe is behavioral attribution evidence, not proof of an internal mechanism.

2606.09204 2026-06-09 cs.LG cs.CL cs.CR 交叉投稿

The Injection Paradox: Brand-Level Suppression in Safety-Trained LLM Recommendations via RAG Context Injection

注入悖论:通过RAG上下文注入在安全训练的LLM推荐中实现品牌级压制

Hyunseok Paeng

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究发现在基于RAG的LLM推荐中,安全训练会导致注入提示反而压制目标品牌推荐率,揭示了安全机制可能被逆向利用的风险。

Comments 16 pages, 1 figure, 15 tables. Accepted at the ICML 2026 Workshop on Failure Modes in Agentic AI (FAGEN), a non-archival venue

详情
AI中文摘要

我们提出了一种在基于RAG的LLM推荐中安全训练的可复现失败模式——注入悖论,其中嵌入在检索文档中的提示注入反而对攻击者不利,将目标品牌压制到低于无注入基线的水平。在安全训练的Claude模型中,包含提示注入的文档推荐率急剧下降,且这种压制会传播到同一品牌的其他未修改文档。在Claude Opus 4.6中,目标品牌从54%的基线降至所有50次试验中零次前二推荐,尽管语料库中4个品牌文档只有1个包含注入。该方向模式在反事实实验和三个品牌中均得到复现。在测试的GPT模型中观察到相反结果,相同的注入反而增加了推荐,表明注入类上下文影响推荐行为的模型族差异。这些发现提出了逆向攻击场景的技术可能性,即攻击者将注入嵌入竞争对手文档,通过安全敏感模型行为压制竞争对手品牌。

英文摘要

We present a reproducible failure mode of safety training in RAG-based LLM recommendation -- the Injection Paradox -- in which prompt injections embedded in retrieved documents backfire against the attacker, suppressing the target brand below the injection-free baseline. In safety-trained Claude models, documents containing prompt injections suffer a sharp drop in recommendation rate, and this suppression propagates beyond the injected document to unmodified documents of the same brand. In Claude Opus 4.6, the target brand drops from a 54% baseline to zero top-2 recommendations across all 50 trials, even though only 1 of 4 brand documents in the corpus contains an injection. The directional pattern is reproduced in counterfactual experiments and across three brands. A contrasting result across the GPT models tested, where the same injection instead increases recommendations, suggests model-family differences in how injection-like context affects recommendation behavior. These findings raise the technical possibility of a reverse-attack scenario in which an adversary embeds injections in a competitor's documents to suppress the competitor's brand via safety-sensitive model behavior.

2009.10277 2026-06-09 cs.CL cs.LG cs.SI 版本更新

Measuring a hate speech spectrum with faceted Rasch item response theory and perspective-aware, explainable-by-design deep learning

使用分面Rasch项目反应理论和可解释性设计的深度学习测量仇恨言论谱系

Chris J. Kennedy, Geoff Bacon, Alexander Sahn, Claudia von Vacano

发表机构 * Center for Precision Psychiatry, Mass General Hospital Department of Psychiatry, Harvard Medical School(精准精神病学中心,麻省总医院精神病科,哈佛医学院) D-Lab University of California, Berkeley(加州大学伯克利分校D实验室)

AI总结 提出结合监督深度学习与分面Rasch项目反应理论的方法,将仇恨言论分解为10个有序标签,通过IRT模型转化为区间测量值并调整标注者视角,在RoBERTa模型上提升准确性,实现连续谱系测量与可解释性。

Comments 7 pages, 6 figures

详情
AI中文摘要

我们提出一个系统,通过结合监督深度学习与分面Rasch项目反应理论(IRT),在从种族灭绝言论到支持性言论的连续区间值谱系上测量仇恨言论。我们将仇恨言论的理论构念分解为10个有序标签的操作化构成概念。这些标签通过IRT概率潜在模型重构为区间结果测量,同时估计并调整每个标注者的标注视角。我们的标度程序自然地与用于自动预测的多任务深度学习架构集成,允许通过那些组件对连续分数进行基于设计的可解释性。我们将此方法应用于一个新的开源数据集,该数据集包含来自YouTube、Twitter和Reddit的50,070条社交媒体评论,由11,143名美国亚马逊土耳其机器人工作者进行标注和标记。我们的基于RoBERTa的模型相比替代方法显示出改进的准确性。该系统为监督NLP提供了一种新范式,鼓励连续而非二元的构念,以及基于设计的标注者视角和模型可解释性的整合。

英文摘要

We propose a system for measuring hate speech on a continuous, interval-valued spectrum ranging from genocidal to supportive speech by combining supervised deep learning with faceted Rasch item response theory (IRT). We decompose the theoretical construct of hate speech into constituent concepts operationalized as 10 ordinal labels. Those labels are reconstituted via IRT probabilistic latent modeling into an interval outcome measure while simultaneously estimating and adjusting for each annotator's labeling perspective. Our scaling procedure integrates naturally with a multitask deep learning architecture for automated prediction, allowing design-based explainability of the continuous score through those components. We apply this method to a new, open source dataset of 50,070 social media comments sourced from YouTube, Twitter, and Reddit, annotated and labeled by 11,143 United States-based Amazon Mechanical Turk workers. Our RoBERTa-based model shows improved accuracy compared to alternative approaches. This system offers a new paradigm for supervised NLP that encourages continuous rather than binary constructs, and design-based incorporation of annotator perspective and model explainability.

2506.17231 2026-06-09 cs.CL cs.CR 版本更新

Efficient and Stealthy Jailbreak Attacks via Adversarial Prompt Distillation from LLMs to SLMs

通过从大语言模型到小语言模型的对抗性提示蒸馏实现高效且隐蔽的越狱攻击

Xiang Li, Chong Zhang, Jia Wang, Fangyu Wu, Yushi Li, Xiaobo Jin

发表机构 * Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学) The Chinese University of Hong Kong(香港中文大学) University of Liverpool(利物浦大学)

AI总结 提出对抗性提示蒸馏(APD)框架,将LLM的越狱能力迁移到SLM,实现高效低资源攻击,在GPT-4上达到96.4%攻击成功率,速度提升3.7倍,参数减少11.3倍。

Comments 24 pages, 3 figures

详情
AI中文摘要

当前针对大语言模型(LLM)的越狱攻击主要依赖LLM自身生成对抗性提示,造成了关键的效率瓶颈:每次攻击需要大量计算资源和API查询,限制了可扩展性和实际部署。为克服这一限制,我们提出对抗性提示蒸馏(APD),一种新颖的框架,将越狱能力从LLM迁移到小语言模型(SLM),以实现高效、低资源的攻击。APD集成了三个关键组件:(1)通过LoRA微调进行掩码对抗知识预训练,(2)动态温度控制的知识蒸馏以弥合架构差距,以及(3)基于强化学习的模板优化以实现自适应改进。在12个模型上的大量实验表明,APD实现了最先进的攻击成功率(例如,在GPT-4上达到96.4%的ASR_k),同时显著提高了效率——生成提示速度提升3.7倍,参数比教师模型减少11.3倍。我们的工作建立了首个轻量级越狱攻击的实用框架,暴露了LLM防御中的新漏洞,并为推进AI安全研究提供了可扩展的测试平台。我们的代码可在以下网址获取:this https URL。

英文摘要

Current jailbreak attacks on large language models (LLMs) predominantly rely on LLMs themselves to generate adversarial prompts, creating a critical efficiency bottleneck: each attack requires substantial computational resources and API queries, limiting scalability and practical deployment. To overcome this limitation, we propose Adversarial Prompt Distillation (APD), a novel framework that transfers jailbreaking capabilities from LLMs to small language models (SLMs) for efficient, low-resource attacks. APD integrates three key components: (1) masked adversarial knowledge pre-training via LoRA fine-tuning, (2) dynamic temperature-controlled knowledge distillation to bridge architectural gaps, and (3) reinforcement learning-based template optimization for adaptive refinement. Extensive experiments across 12 models show that APD achieves state-of-the-art attack success rates (e.g., 96.4% ASR_k on GPT-4) while dramatically improving efficiency - generating prompts 3.7x faster with 11.3x fewer parameters than teacher models. Our work establishes the first practical framework for lightweight jailbreak attacks, exposes new vulnerabilities in LLM defenses, and provides a scalable testbed for advancing AI safety research. Our code is available at: https://github.com/lxgem/Efficient_and_Stealthy_Jailbreak_Attacks_via_Adversarial_Prompt.

2601.21996 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

机械论数据归因:追踪可解释LLM单元的训练起源

Jianhui Chen, Yuzhang Luo, Liangming Pan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出机械论数据归因(MDA)框架,利用影响函数将可解释单元追溯到特定训练样本,通过因果验证表明干预高影响样本可显著调节可解释头的涌现,并发现重复结构数据作为机械催化剂,同时验证了归纳头与上下文学习之间的功能联系。

Comments ICML2026 (Oral)

详情
AI中文摘要

尽管机械论可解释性已在LLM中识别出可解释电路,但它们在训练数据中的因果起源仍然难以捉摸。我们引入了机械论数据归因(MDA),这是一个可扩展的框架,利用影响函数将可解释单元追溯到特定训练样本。通过在Pythia系列模型上的广泛实验,我们因果验证了目标干预——移除或增加一小部分高影响样本——显著调节了可解释头的涌现,而随机干预则没有效果。我们的分析表明,重复的结构化数据(例如LaTeX、XML)充当了机械催化剂。此外,我们观察到针对归纳头形成的干预会引发模型上下文学习(ICL)能力的同步变化。这为关于归纳头与ICL之间功能联系的长期假设提供了直接的因果证据。最后,我们提出了一种机械论数据增强流水线,该流水线在不同模型规模上一致地加速电路收敛,为引导LLM的发展轨迹提供了一种原则性方法。

英文摘要

While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

2602.08235 2026-06-09 cs.CL cs.AI cs.CR 版本更新

When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

当良性输入导致严重危害:引发计算机使用代理的不安全意外行为

Jaylen Jones, Zhehao Zhang, Yuting Ning, Eric Fosler-Lussier, Pierre-Luc St-Charles, Yoshua Bengio, Dawn Song, Yu Su, Huan Sun

发表机构 * DeepMind, London, UK(深度Mind,伦敦,英国) Stanford University, Stanford, CA, USA(斯坦福大学,斯坦福,加利福尼亚州,美国) UC Berkeley, Berkeley, CA, USA(加州大学伯克利分校,伯克利,加利福尼亚州,美国)

AI总结 提出AutoElicit框架,通过迭代扰动良性指令并利用CUA执行反馈,自动引发前沿CUAs(如Claude 4.5 Haiku等)的数百种有害意外行为,并验证其跨模型可迁移性。

Comments ICML 2026, Project Homepage: https://osu-nlp-group.github.io/AutoElicit/

详情
AI中文摘要

尽管计算机使用代理(CUA)在自动化日益复杂的操作系统工作流程方面具有巨大潜力,但即使在良性输入上下文中,它们也可能表现出偏离预期结果的不安全意外行为。然而,对此风险的探索仍主要停留在轶事层面,缺乏具体的特征描述和自动化方法,无法在现实CUA场景下主动发现长尾意外行为。为填补这一空白,我们首次提出了针对CUA意外行为的概念和方法框架,通过定义其关键特征、自动引发它们以及分析它们如何从良性输入中产生。我们提出了AutoElicit:一个代理框架,它使用CUA执行反馈迭代地扰动良性指令,并在保持扰动现实且良性的同时引发严重危害。使用AutoElicit,我们从最先进的CUA(如Claude 4.5 Haiku、Claude 4.5 Opus和Operator)中发现了数百种有害的意外行为。我们进一步评估了人工验证的成功扰动的可迁移性,识别出各种前沿CUA对意外行为的持续易感性。这项工作为在现实计算机使用环境中系统分析意外行为奠定了基础。

英文摘要

Although computer-use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long-tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state-of-the-art CUAs such as Claude 4.5 Haiku, Claude 4.5 Opus, and Operator. We further evaluate the transferability of human-verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer-use settings.

2602.16346 2026-06-09 cs.CL cs.LG 版本更新

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

有益于故障:测量多轮、多语言LLM代理中的非法协助

Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut

发表机构 * EPFL(苏黎世联邦理工学院) independent(独立研究员) tubingen(图宾根大学)

AI总结 本文提出STING框架,用于评估多轮多语言LLM代理在执行非法任务时的协助能力,发现低资源语言中攻击成功率不一致,提供实际部署中的压力测试方法。

Comments Accepted in ICML 2026

详情
AI中文摘要

基于工具和记忆的LLM代理通过执行现实世界工作流。这些功能使恶意对手也能利用这些代理执行复杂的恶意场景。现有代理恶意使用基准测试主要测试单提示指令,留下测量代理在多轮中帮助执行有害或非法任务的空白。我们引入STING(序列测试非法N步目标执行),一种自动红队框架,构建基于良性角色的逐步非法计划,并通过适应性后续问题迭代探测目标代理,使用判断代理跟踪阶段完成。我们进一步引入分析框架,将多轮红队测试建模为首次越狱时间随机变量,使分析工具如发现曲线、攻击语言的危险比率归因以及新指标:受限均值越狱发现。在AgentHarm场景中,STING的非法任务完成率显著高于单轮提示和适应于工具使用代理的多轮基线。在六个非英语设置的多语言评估中,发现攻击成功率和非法任务完成率在低资源语言中不一致,与常见聊天机器人发现不同。总体而言,STING提供了一种评估和压力测试代理恶意使用在现实部署环境中的实用方法,其中交互本质上是多轮且经常多语言的。

英文摘要

LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

2603.07445 2026-06-09 cs.CL cs.LG 版本更新

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

少令牌,大杠杆:在微调期间通过约束安全令牌保持安全对齐

Guoli Wang, Haonan Shi, Tu Ouyang, An Wang

发表机构 * Case Western Reserve University(凯斯西储大学)

AI总结 提出PACT框架,通过约束安全相关令牌的置信度来防止微调导致的安全对齐漂移,同时保持下游任务性能。

Comments Accepted to KDD 2026

详情
AI中文摘要

大型语言模型(LLMs)通常需要微调(FT)才能在下游任务上表现良好,但即使训练数据集仅包含良性数据,FT也可能导致安全对齐漂移。先前的研究表明,引入少量有害数据会显著损害LLM的拒绝行为,导致LLM顺从有害请求。现有的防御方法通常依赖于模型范围的干预,例如限制哪些参数更新或注入额外的安全数据,这可能会限制通用性并降低下游任务性能。为了解决这些限制,我们提出了一种名为PACT(通过约束令牌保持安全对齐)的微调框架,该框架稳定了模型在安全令牌上的置信度。我们的方法基于经验观察:安全对齐行为反映在模型的令牌级输出置信度中,并且通常集中在少量安全相关令牌上。在下游微调期间,我们正则化微调模型,使其在每一步响应中与对齐参考模型在安全相关令牌上的置信度匹配,同时允许非安全令牌基本不受约束以实现有效的任务适应。这种有针对性的约束防止了对齐漂移,而无需施加通常以牺牲模型效用为代价的全局限制。我们的代码可在{https://github.com/Glresearch1/PACT}获取。

英文摘要

Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data. Prior work shows that introducing a small fraction of harmful data can substantially compromise LLM refusal behavior, causing LLMs to comply with harmful requests. Existing defense methods often rely on model-wide interventions, such as restricting which parameters are updated or injecting additional safety data, which can limit generality and degrade downstream task performance. To address these limitations, we propose a fine-tuning framework called Preserving Safety Alignment via Constrained Tokens (PACT), which stabilizes the model's confidence on safety tokens. Our approach is motivated by the empirical observation that safety-aligned behavior is reflected in the model's token-level output confidence and is often concentrated on a small subset of safety-related tokens. During downstream fine-tuning, we regularize the fine-tuned model to match the aligned reference model's confidence on safety-related tokens at each response step, while leaving non-safety tokens largely unconstrained to allow effective task adaptation. This targeted constraint prevents alignment drift without imposing global restrictions that typically trade off with model utility. Our code is available at {https://github.com/Glresearch1/PACT}.

2606.01060 2026-06-09 cs.CL cs.AI cs.LG 版本更新

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

MENTIS: 对齐改变了什么信念?语言模型中多尺度潜在扭转的测量

Partha Pratim Saha, Samarth Raina, Mayur Parvatikar, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Pragya Lab, BITS Pilani Goa, India(BITS Pilani 去掉 Goa 的机构名,因为该机构名中包含 'Goa',但根据规则,如果机构已有常见中文名,使用常见中文名。'Pragya Lab, BITS Pilani' 是 BITS Pilani 的一个实验室,因此翻译为 'BITS Pilani 实验室') IIIT Delhi, India(德里印度理工学院) Amazon, USA(美国亚马逊) Meta, USA(美国Meta) Apple, USA(美国苹果)

AI总结 提出MENTIS框架,通过层间协方差扭转范数、谱扭转诊断和能量-辐射-激活度量,测量偏好对齐在语言模型内部计算中引起的选择性、深度局部的几何结构变化。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

偏好对齐显著改善了大语言模型的可观察行为,但尚不清楚对齐在内部改变了什么。对齐系统在越狱、提示注入和检索时损坏下仍然失败,表明仅行为级评估是不完整的。后训练应在内部计算中留下可测量的痕迹。我们问:当指令微调(IT)模型变为偏好对齐(PA)模型时,哪些几何结构发生了变化,这些变化集中在何处,以及它们在不同概念、提示和模型家族中的选择性如何? 我们引入MENTIS,一个几何优先的框架,用于测量配对检查点中对齐引起的内部重组。MENTIS使用基于层间协方差的主扭转范数(T1)、辅助谱扭转诊断(T2)和用于深度定位的能量-辐射-激活度量(ERA)来比较IT和PA模型。在LITMUS上的四个7-8B模型对中,我们的研究表明对齐引起的变化是选择性的而非均匀的:规范性概念平均表现出比事实性概念更大的扭转偏移;扭转与上下文熵负相关;峰值效应定位于架构特定的中后层。相同的模式出现在词级、提示级和模型级分析中。这些结果表明偏好对齐在内部计算中留下了结构化的、深度局部的几何特征,超越了仅行为级评估所能揭示的内容。

英文摘要

Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment changes internally. Aligned systems still fail under jailbreaks, prompt injection, and retrieval-time corruption, suggesting behavior-level evaluation alone is incomplete. Post-training should leave measurable traces in internal computation. We ask: when an instruction-tuned (IT) model becomes a preference-aligned (PA) model, what geometric structure changes, where do those changes concentrate, and how selectively do they vary across concepts, prompts, and model families? We introduce MENTIS, a geometry-first framework for measuring alignment-induced internal reorganization in paired checkpoints. MENTIS compares IT and PA models using a primary layerwise covariance-based torsion norm (T1), a secondary spectral torsion diagnostic (T2), and an Energy-Radiance-Activation measure (ERA) for depth localization. Across four 7-8B model pairs on LITMUS, our study reveals that alignment-induced change is selective rather than uniform: normative concepts exhibit larger torsion shifts than factual concepts on average; torsion is negatively correlated with contextual entropy; and peak effects localize to architecture-specific mid-to-late layers. The same pattern appears across word-level, prompt-level, and model-level analyses. These results suggest preference alignment leaves structured, depth-localized geometric signatures in internal computation beyond what behavior-level evaluation alone can reveal.

2606.01637 2026-06-09 cs.CL cs.AI 版本更新

Easier to Mislead Than to Correct: Harmful and Beneficial Revision in LLM Conformity

误导比纠正更容易:LLM 从众中的有害与有益修正

Jiaming Qu, Lucheng Fu, Yibo Hu

发表机构 * Amazon(亚马逊) Georgia Institute of Technology(佐治亚理工学院) Illinois Institute of Technology(伊利诺伊理工学院)

AI总结 通过控制实验,研究大语言模型在多智能体系统中面对同伴答案时的从众行为,发现同伴一致意见更容易误导原本正确的模型,而权威标签使模型更倾向于选择被认可的答案,且通用推理干预无法可靠地减少有害修正。

详情
AI中文摘要

大语言模型越来越多地用于多智能体系统,在这些系统中,它们会看到并回应其他智能体的答案。一个关键风险是从众:模型可能仅仅因为其他人同意不同的答案而放弃自己的答案。先前的研究表明,LLM 经常向多数答案修正,但仍不清楚这些修正是像引入新错误一样频繁地帮助纠正错误。在本文中,我们进行了一项受控研究,其中 LLM 首先回答一个问题,然后在做出最终决定之前看到模拟的同伴回应。我们操纵两个社会线索:共识结构和分配给同伴的权威标签,并测量它们如何影响有益和有害的修正。在四个开放权重的 LLM 和七个问答数据集上,我们发现同伴一致意见使得误导原本正确的模型比纠正原本错误的模型容易得多。权威标签使模型更可能选择被认可的答案,无论其是否正确。更令人担忧的是,通用的推理干预(如思维链和反思)并不能可靠地减少有害修正同时保留有益修正。这些发现表明,多智能体 LLM 系统应该验证同伴答案,而不是简单地聚合它们。

英文摘要

Large language models are increasingly used in multi-agent systems, where they see and respond to other agents' answers. A key risk is conformity: a model may abandon its own answer simply because others agree on a different one. Prior studies show that LLMs often revise toward a majority answer, but it remains unclear whether these revisions help correct mistakes as often as they introduce new errors. In this paper, we conduct a controlled study in which an LLM first answers a question, then sees simulated peer responses before making a final decision. We manipulate two social cues: consensus structure and authority labels assigned to peers, and measure how they influence beneficial and harmful revisions. Across four open-weight LLMs and seven QA datasets, we find that peer agreement makes it much easier to mislead initially correct models than to correct initially wrong ones. Authority labels make models more likely to choose the endorsed answer, regardless of whether it is correct. More concerningly, generic reasoning interventions such as chain-of-thought and reflection do not reliably reduce harmful revision while preserving beneficial revision. These findings suggest that multi-agent LLM systems should verify peer answers rather than simply aggregate them.

2606.06443 2026-06-09 cs.CL cs.MM cs.SI 版本更新

Revising Context, Shifting Simulated Stance: Auditing LLM-Based Stance Simulation in Online Discussions

修正语境,转变模拟立场:审计基于LLM的在线讨论立场模拟

Xinnong Zhang, Wanting Shan, Hanjia Lyu, Zhongyu Wei, Jiebo Luo

发表机构 * Fudan University(复旦大学) University of Rochester(罗切斯特大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本研究通过反事实语境修正框架审计LLM立场模拟,对比纯文本与多模态策略,评估平均方向性立场转变和立场转换率,揭示语境敏感性的有效性与鲁棒性。

详情
AI中文摘要

大型语言模型越来越多地被用于模拟社交媒体用户,并推断个人如何回应在线讨论。然而,目前尚不清楚这些模拟是否反映了精确的用户特定信念,或者它们是否对对话语境中语义无关的变化高度敏感。在这项工作中,我们研究反事实语境修正作为审计基于LLM的立场模拟的框架。给定一个原始在线对话,我们首先推断目标用户对特定话题的立场。然后,我们对对话语境应用受控修正策略,并在修正后的语境下再次模拟用户的立场。我们比较了纯文本修正策略与包含模因语境的多模态策略,并评估了两个主要有效性指标,即平均方向性立场转变和立场转换率。结果揭示了在不同极化偏好机制下,纯文本和多模态策略中有效且稳健的立场转变。我们的研究为理解基于LLM的立场模拟的语境敏感性提供了一个评估框架。更广泛地说,它突出了使用LLM模拟在线舆论动态的前景和风险。

英文摘要

Large language models are increasingly used to simulate social media users and infer how individuals may respond to online discussions. However, it remains unclear whether these simulations reflect precise user-specific beliefs or whether they are highly sensitive to semantically independent changes in conversational contexts. In this work, we study counterfactual context revision as a framework for auditing LLM-based stance simulation. Given an original online conversation, we first infer a target user's stance toward a specific topic. We then apply controlled revision strategies to the conversational context and simulate the user's stance again under the revised context. We compare text-only revision strategies with a multimodal one that incorporates meme-based context and evaluate two main effectiveness metrics, i.e., average directional stance shift and stance transition rate. The results reveal effective and robust stance transitions in both text-only and multimodal strategies across different polarization-preference mechanisms. Our study contributes an evaluation framework for understanding the context sensitivity of LLM-based stance simulation. More broadly, it highlights both the promise and risk of using LLMs to simulate online opinion dynamics.

2510.17947 2026-06-09 cs.CR cs.AI cs.CL cs.LG cs.MA 版本更新

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

PLAGUE:面向多轮利用的终身自适应生成的即插即用框架

Neeladri Bhuiya, Madhav Aggarwal, Diptanshu Purwar

发表机构 * A10 Networks, Inc.(A10网络公司) University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 提出PLAGUE框架,通过终身学习启发的三阶段设计(Primer、Planner、Finisher)实现高效多轮越狱攻击,在o3和Opus 4.1等强安全模型上ASR提升超30%。

Comments Accepted in ICLR 2026

详情
AI中文摘要

大型语言模型(LLMs)正以惊人的速度改进。随着智能体工作流的出现,多轮对话已成为与LLMs交互以完成长而复杂任务的事实标准。尽管LLM能力持续提升,但它们仍然越来越容易受到越狱攻击,尤其是在多轮场景中,有害意图可以巧妙地注入到对话中,产生恶意结果。虽然单轮攻击已被广泛探索,但适应性、效率和有效性仍然是多轮攻击面临的关键挑战。为了解决这些不足,我们提出了PLAGUE,一种新颖的即插即用框架,用于设计受终身学习智能体启发的多轮攻击。PLAGUE将多轮攻击的生命周期分解为三个精心设计的阶段(Primer、Planner和Finisher),从而实现对多轮攻击家族的系统性和信息丰富的探索。评估表明,使用PLAGUE设计的红队智能体实现了最先进的越狱结果,在更少或相当的查询预算下,领先模型的攻击成功率(ASR)提高了30%以上。特别是,PLAGUE在OpenAI的o3上实现了81.4%的ASR(基于StrongReject),在Claude的Opus 4.1上实现了67.3%的ASR,这两个模型在安全文献中被认为对越狱具有高度抵抗力。我们的工作提供了工具和见解,以理解计划初始化、上下文优化和终身学习在构建多轮攻击以进行全面模型脆弱性评估中的重要性。

英文摘要

Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.

2512.03465 2026-06-09 cs.CR cs.CL cs.IR 版本更新

Tuning for TraceTarnish: Techniques, Trends, and Testing Tangible Traits

痕迹掩盖:技术、趋势与可触摸特性测试

Robert Dilworth

发表机构 * Department of Computer Science and Engineering, Mississippi State University(密苏里州立大学计算机科学与工程系)

AI总结 本文通过严格评估TraceTarnish攻击脚本,利用对抗性风格学原理匿名化文本消息的作者身份,通过Reddit评论数据和StyloMetrix增强,提取出显著的信息增益特征,用于检测文本篡改。

Comments 20 pages, 8 figures, 2 tables

详情
AI中文摘要

在本研究中,我们更严格地评估了我们的攻击脚本TraceTarnish,该脚本利用对抗性风格学原理来匿名化文本消息的作者身份。为了确保攻击的有效性和实用性,我们收集、处理并分析了Reddit评论——这些评论后来被转化为TraceTarnish数据,以获得有价值的见解。转换后的TraceTarnish数据随后通过StyloMetrix进一步增强,生成风格学特征——这些特征通过信息增益标准筛选,仅保留最具信息量、预测性和判别性的特征。我们的结果发现,功能词和功能词类型(L_FUNC_A & L_FUNC_T);内容词和内容词类型(L_CONT_A & L_CONT_T);以及类型-词频比(ST_TYPE_TOKEN_RATIO_LEMMAS)产生了显著的信息增益读数。识别出的风格学线索——功能词频率、内容词分布和类型-词频比——作为可靠的入侵指标(IoCs),揭示了文本被人为篡改以掩盖真实作者的时间。同样,这些特征可以作为法医信号,提醒防御者存在对抗性风格学攻击;尽管在没有原始信息的情况下,这种信号可能被忽略,因为它似乎依赖于前后转换的比较。'在试图抹去痕迹时,你往往留下更大的痕迹。'基于这一理解,我们围绕这五个孤立特征框架了TraceTarnish的操作和输出,利用它们来概念化和实现增强,进一步加强攻击。

英文摘要

In this study, we more rigorously evaluated our attack script $\textit{TraceTarnish}$, which leverages adversarial stylometry principles to anonymize the authorship of text-based messages. To ensure the efficacy and utility of our attack, we sourced, processed, and analyzed Reddit comments -- comments that were later alchemized into $\textit{TraceTarnish}$ data -- to gain valuable insights. The transformed $\textit{TraceTarnish}$ data was then further augmented by $\textit{StyloMetrix}$ to manufacture stylometric features -- features that were culled using the Information Gain criterion, leaving only the most informative, predictive, and discriminative ones. Our results found that function words and function word types ($L\_FUNC\_A$ $\&$ $L\_FUNC\_T$); content words and content word types ($L\_CONT\_A$ $\&$ $L\_CONT\_T$); and the Type-Token Ratio ($ST\_TYPE\_TOKEN\_RATIO\_LEMMAS$) yielded significant Information-Gain readings. The identified stylometric cues -- function-word frequencies, content-word distributions, and the Type-Token Ratio -- serve as reliable indicators of compromise (IoCs), revealing when a text has been deliberately altered to mask its true author. Similarly, these features could function as forensic beacons, alerting defenders to the presence of an adversarial stylometry attack; granted, in the absence of the original message, this signal may go largely unnoticed, as it appears to depend on a pre- and post-transformation comparison. "In trying to erase a trace, you often imprint a larger one." Armed with this understanding, we framed $\textit{TraceTarnish}$'s operations and outputs around these five isolated features, using them to conceptualize and implement enhancements that further strengthen the attack.

2604.10271 2026-06-09 cs.CR cs.CL cs.IR 版本更新

Hijacking Text Heritage: Hiding the Human Signature through Homoglyphic Substitution

窃取文本遗产:通过同形替代隐藏人类签名

Robert Dilworth

发表机构 * Department of Computer Science and Engineering, Mississippi State University(计算机科学与工程系,密苏里州立大学)

AI总结 本文研究通过同形替代技术削弱 stylometry 系统,探讨如何在文本中隐藏个人身份信息以防止身份泄露。

Comments 30 pages, 9 figures

详情
AI中文摘要

在政府颁发的身份证件如护照、驾照等数据泄露事件中,其影响似乎比在非显眼的社交媒体平台上自愿披露数据更为严重。然而,后者场景中通过在线帖子可能揭示作者的年龄范围和地理位置。本文探讨通过同形替代(将字符替换为视觉相似的替代品)来降低 stylometry 系统的识别能力,从而防止个人身份信息从文本中泄露。

英文摘要

In what way could a data breach involving government-issued IDs such as passports, driver's licenses, etc., rival a random voluntary disclosure on a nondescript social-media platform? At first glance, the former appears more significant, and that is a valid assessment. The disclosed data could contain an individual's date of birth and address; for all intents and purposes, a leak of that data would be disastrous. Given the threat, the latter scenario involving an innocuous online post seems comparatively harmless--or does it? From that post and others like it, a forensic linguist could stylometrically uncover equivalent pieces of information, estimating an age range for the author (adolescent or adult) and narrowing down their geographical location (specific country). While not an exact science--the determinations are statistical--stylometry can reveal comparable, though noticeably diluted, information about an individual. To prevent an ID from being breached, simply sharing it as little as possible suffices. Preventing the leakage of personal information from written text requires a more complex solution: adversarial stylometry. In this paper, we explore how performing homoglyph substitution--the replacement of characters with visually similar alternatives (e.g., "h" $\texttt{[U+0068]}$ $\rightarrow$ "h" $\texttt{[U+04BB]}$)--on text can degrade stylometric systems.

2606.01567 2026-06-09 cs.CR cs.AI cs.CL 版本更新

Defenses & Enablers For Skill Injection Attacks on Terminal Based Agents

针对终端代理的技能注入攻击的防御与使能因素

Yoshinari Fujinuma, Varun Gangal, Traian Rebedea, Makesh Narsimhan Sreedhar, Prasoon Varshney, Rebecca Qian, Anand Kannappan

发表机构 * Patronus AI NVIDIA

AI总结 研究基于大语言模型的代理在重用技能时面临的安全威胁,提出守护者防御(动态和静态)将攻击成功率降低过半,并测试了攻击重述的鲁棒性。

Comments First version, small updates and clarifications likely in v2

详情
AI中文摘要

大型语言模型(LLM)代理越来越依赖可重用的技能,即描述任务特定程序的文档。然而,这为代理管理引入了新的攻击面。我们针对这一威胁研究了两个互补方向。首先,我们评估了基于守护者的防御:一个中间LLM代理,作为技能文件访问的调解者(动态守护者)或在构建时预先重写这些文件(静态守护者)。在三个LLM代理家族中,我们的守护者将攻击成功率(ASR)降低了一半以上,同时保持了任务效用。其次,我们通过攻击重述对其进行压力测试,使用了四种保留恶意指令但改变措辞的攻击。对于非守护者设置,重述将ASR推高至81.4%,但动态守护者将其降至18.6%,表明实时调解是一种稳健的防御。

英文摘要

Large language model (LLM) agents increasingly rely on reusable skills i.e. documents describing task-specific procedures. However, this introduces a new attack surface for agents to manage. We study two complementary directions for this threat. First, we evaluate guardian-based defenses: an intermediary LLM agent that acts as a mediator for skill file access (dynamic guardian) or pre-rewrites these files at build time (static guardian). Across three LLM agent families, our guardians cut attack success rate (ASR) by well over half while preserving task utility. Second, we stress test them through attack reframing using four attacks that preserve the malicious instruction but change the phrasing. For non-guardian setup, the reframing pushes the ASR up to 81.4\%, but the dynamic guardian brings it down to 18.6\%, showing that real-time mediation is a robust defense.

11. 低资源、领域适配与高效训练 14 篇

2606.08197 2026-06-09 cs.CL cs.DC 新提交

AlignFed: Alignment-Aware Asynchronous Federated Fine-Tuning for Large Language Models in Heterogeneous Edge Environments

AlignFed: 异构边缘环境中大语言模型的对齐感知异步联邦微调

Yan Wang, Ziyi Gao, Rui Wang

发表机构 * University of Science and Technology Beijing(北京科技大学)

AI总结 提出AlignFed框架,通过多阶段语义对齐机制(版本感知更新分组、跨版本语义对齐、公平性感知聚合)解决异步联邦微调中大语言模型在异构边缘环境中的模型漂移、客户端漂移和聚合不公平问题。

详情
AI中文摘要

大语言模型(LLMs)显著推动了边缘智能的发展,并已广泛应用于自动驾驶、工业检测和个性化物联网服务等多种场景。然而,由于严格的数据隐私约束、高度异构的计算和通信资源以及本地数据的非独立同分布(non-IID)特性,在边缘设备上协作适配LLMs仍面临严峻挑战。联邦微调(FFT)能够在无需暴露原始数据的情况下实现分布式模型的协作优化。然而,传统的同步聚合存在严重的掉队者效应,导致系统延迟高、资源利用率低。现有的异步联邦学习方法主要针对中小规模模型设计,难以解决LLM微调中特有的挑战,即由陈旧更新引起的模型漂移、由数据异质性加剧的客户端漂移以及由快速客户端主导导致的聚合公平性失衡。针对这些问题,本文提出AlignFed,一种面向异构边缘环境的LLMs异步联邦微调框架。AlignFed采用轻量级多阶段语义对齐机制,包含三个核心模块:版本感知的更新分组、基于小批量校准集的跨版本语义对齐,以及结合更新新鲜度和客户端参与频率的公平性感知聚合。该框架有效缓解了跨版本模型漂移和客户端漂移,同时增强了聚合公平性,从而在高异质性和显著更新陈旧性的场景中实现稳定高效的异步联邦优化。

英文摘要

Large Language Models (LLMs) have significantly propelled the advancement of edge intelligence and have been widely deployed across various scenarios, including autonomous driving, industrial inspection, and personalized IoT services. However, the collaborative adaptation of LLMs on edge devices continues to face formidable challenges due to strict data privacy constraints, highly heterogeneous computing and communication resources, and the non-independent and identically distributed (non-IID) nature of local data. Federated Fine-Tuning (FFT) enables the collaborative optimization of distributed models without exposing raw data. Yet, traditional synchronous aggregation suffers from a severe straggler effect, resulting in high system latency and low resource utilization. Existing asynchronous federated learning methods are predominantly designed for small-to-medium-scale models and struggle to address the specific challenges inherent in LLM fine-tuning namely, model drift caused by stale updates, aggravated client drift stemming from data heterogeneity, and aggregation fairness imbalance resulting from the dominance of fast clients. To address these issues, this paper proposes AlignFed, an asynchronous federated fine-tuning framework for LLMs tailored to heterogeneous edge environments. AlignFed employs a lightweight multi-stage semantic alignment mechanism comprising three core modules: version-aware update grouping, cross-version semantic alignment based on a mini-batch calibration set, and fairness-aware aggregation that integrates both update freshness and client participation frequency. This framework effectively mitigates cross-version model drift and client drift while enhancing aggregation fairness, thereby achieving stable and efficient asynchronous federated optimization in scenarios characterized by high heterogeneity and significant update staleness.

2606.09396 2026-06-09 cs.CL cs.LG 新提交

PriFT: Prior-Support Guided Supervised Fine-Tuning

PriFT: 先验支持引导的监督微调

Ke Wang, Shuangqi Li, Mathieu Salzmann, Pascal Frossard

发表机构 * EPFL(瑞士联邦理工学院洛桑分校)

AI总结 提出PriFT方法,利用冻结的预训练模型计算token权重,避免在线模型导致的自我强化动态,在数学推理、代码生成和医疗问答任务中取得SFT最优结果,并为后续RL提供更好初始化。

Comments The first two authors contributed equally to this work

详情
AI中文摘要

监督微调(SFT)是下游任务适配的高效方法,通常作为强化学习(RL)的初始化阶段,但其泛化能力可能弱于RL。一个关键限制是其离策略目标:SFT逐token拟合固定演示,包括与模型预训练分布对齐不良的目标,这可能导致过拟合。最近一系列工作通过给与当前模型预测分布更对齐的token分配更大的训练权重来解决此问题,直觉是拟合这些token对模型的预训练知识和表示的扭曲较小。然而,从当前微调模型计算token权重会将token权重与优化轨迹纠缠在一起,随着分布迅速偏离预训练模型,引发自我强化动态。为了解决这个问题,我们提出PriFT(先验支持引导的微调),该方法从冻结的预训练参考模型导出token权重,以获得不受微调影响的稳定重加权信号。该信号估计先验支持:每个目标token受预训练分布支持的程度。在多种现有token重加权规则中,将重加权信号从在线模型替换为预训练模型一致地提升了性能。我们引入了两种实例化:PriFT-prob使用预训练token概率,而PriFT-mass根据预训练分布下的累积概率质量选择token。在数学推理、代码生成和医疗问答上的大量实验表明,PriFT在SFT基线中取得了最先进的结果,并为后续RL训练提供了更好的初始化。

英文摘要

Supervised fine-tuning (SFT) is an efficient approach for downstream task adaptation and often serves as the initialization stage for reinforcement learning (RL), but it can show weaker generalization than RL. A key limitation is its off-policy objective: SFT fits fixed demonstrations token by token, including targets poorly aligned with the model's pretrained distribution, which can lead to overfitting. A recent line of work addresses this issue by assigning larger training weights to tokens better aligned with the current model's predictive distribution, with the intuition that fitting these tokens are less distortive to the model's pretrained knowledge and representations. However, computing the token weights from the model that is currently fine-tuned entangles token weights with the optimization trajectory, inducing a self-reinforcing dynamics as the distribution rapidly departs from the pretrained model. To address this, we propose PriFT (Prior-support guided Fine-Tuning), which derives token weights from a frozen pretrained reference to obtain a stable reweighting signal unaffected by fine-tuning. This signal estimates prior support: the extent to which each target token is supported by the pretrained distribution. Across multiple existing token-reweighting rules, replacing the reweighting signal from the online model to pretrained model consistently improves performance. We introduce two instantiations: PriFT-prob uses pretrained token probability, while PriFT-mass selects tokens by cumulative probability mass under the pretrained distribution. Extensive experiments on mathematical reasoning, code generation, and medical question answering show that PriFT achieves state-of-the-art results among SFT baselines and provides a better initialization for subsequent RL training.

2606.09435 2026-06-09 cs.CL 新提交

MUDIDI: A Two-Stage Framework for Multilingual Dictionary Digitization with Language Models

MUDIDI:一种基于语言模型的多语言词典数字化两阶段框架

David Setiawan, Temuulen Khishigsuren, Milind Agarwal, Pagnarith Pit, Aso Mahmudi, Ekaterina Vylomova

发表机构 * School of Computing and Information Systems, The University of Melbourne(墨尔本大学计算与信息系统学院) Melbourne School of Psychological Sciences, The University of Melbourne(墨尔本大学墨尔本心理科学学院) LILT

AI总结 提出MUDIDI两阶段框架,结合语言模型实现多语言词典数字化,在字符识别、标记保留和词条分割上优于现有OCR和视觉语言模型,并发布30本公共领域词典的标注数据集。

Comments 9 pages, preprint, submitted to EMNLP 2026

详情
AI中文摘要

多语言词典是低资源和濒危语言最有价值的文献资源之一,但许多仍仅以扫描件形式存在。几十年来,由于语言特有的文字、包含缩写和交叉引用条目的复杂多栏布局,其数字化并转换为机器可读格式几乎不可能。最近的视觉语言模型提供了有希望的解决方案,但尚不清楚它们在保留字符、标记和处理词典结构方面的表现。我们提出MUDIDI,一个用于多语言词典数字化的两阶段框架。第一阶段评估字符识别和标记保留的质量;第二阶段专注于词典条目分割,随后映射到机器可读的词典模式——SIL的多词典格式化器。我们还发布了一个数据集,包含从30本公共领域词典中收集的人工标注的词典条目,这些词典涵盖多种文字系统、语系和格式。我们在该数据集上对OCR系统、通用大语言模型(LLM)和视觉语言模型(VLM)进行了基准测试,展示了LLM在大多数文字系统和语言的两个阶段中的优越性能,并为更具挑战性的场景提供了改进结果的实用指南。最后,我们表明向LLM补充额外信息(如词典引言)可以提高数字化词典的质量。Github: https://github.com/DavidSamuell/MUDIDI-Pipeline-for-Digitization-of-Multilingual-Dictionary/

英文摘要

Multilingual dictionaries are among the most valuable documentary resources for low-resource and endangered languages, yet many remain available only as scans. For many decades, their digitization and conversion into a machine-readable format was nearly impossible due to language-specific scripts, complex multi-column layouts full of entries with abbreviations and cross-references. Recent vision-language models offer a promising solution, but it is unclear how well they preserve characters, markup, and process lexicographic structure. We introduce MUDIDI, a two-stage framework for multi-lingual dictionary digitization. Stage One evaluates the quality of character recognition and markup preservation; Stage Two focuses on dictionary entry segmentation with subsequent mapping into a machine-readable lexicographic schema, SIL's Multi-Dictionary Formatter. We also release a dataset that consists of human-annotated lexicographic entries collected from 30 public-domain dictionaries featuring diverse writing systems, language families, and formats. We benchmark OCR systems, general-purpose Large Language Models (LLMs), and Vision Language Models (VLMs) on the dataset, demonstrating superior performance of LLMs across most writing systems and languages in both stages, and provide practical guidelines on improving the results for more challenging scenarios. Finally, we show that supplementing additional information, such as dictionary introduction, to the LLMs can improve the quality of the digitized dictionary. Github: https://github.com/DavidSamuell/MUDIDI-Pipeline-for-Digitization-of-Multilingual-Dictionary/

2606.09767 2026-06-09 cs.CL cs.AI cs.LG 新提交

Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan

低资源神经机器翻译的数据合成与参数高效微调:以Q'eqchi'玛雅语为例

Alexander Chulzhanov, Soeren Eberhardt, Arjun Mukherjee

发表机构 * University of Houston(休斯顿大学) MasterWord Services, Inc.(MasterWord Services公司) University of Washington(华盛顿大学)

AI总结 针对低资源土著语言,提出数据合成方法(利用社区词典生成合成语料)结合LoRA参数高效微调,在Q'eqchi'玛雅语上实现高结构习得(BLEU 42.02),但存在结构-语义差距,需结合真实数据进行课程学习。

Comments Accepted to the 29th International Conference on Text, Speech and Dialogue (TSD 2026). This version of the contribution has been accepted for publication, after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections

详情
AI中文摘要

对于数字低资源土著语言的神经机器翻译,通常因极端数据稀缺而受阻,促使依赖抽取式网络爬取。为确保数据主权,本研究引入了一种数据合成方法,无需爬取目标语言平行文本即可引导NMT模型。以Q'eqchi'玛雅语为重点,我们将社区来源的词典转换为大规模合成语料,利用通过LoRA适配器在mT5-base模型上的参数高效微调(PEFT)。领域内评估显示出高度的结构习得(BLEU 42.02),证明合成约束有效地教授了复杂的黏着形态和VOS语序。然而,针对有机词汇表的评估揭示了结构-语义差距(BLEU 0.59),模型保持了语法完整性但缺乏自然语言的词汇基础。模型表现出对合成模板受限结构方差的过拟合;尽管流程中具有高语义熵,模型仍难以应对自然语言的句法流动性,将有机输入强制转换为僵化的学习模式。此外,利用多任务学习架构的消融研究导致了负迁移,表明辅助任务在LoRA适配器内竞争有限的参数容量,导致对合成标记的过度优化而牺牲了有机灵活性。最终,我们确定合成引导是一种高度有效的结构入门,但需要通过课程学习使用真实数据进行语义细化。

英文摘要

Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance on extractive web-scraping. To ensure data sovereignty, this study introduces a data synthesis methodology to bootstrap NMT models without scraping target-language parallel text. Focusing on Q'eqchi' Mayan, we transformed community-sourced dictionaries into a massive synthetic corpus, utilizing Parameter-Efficient Fine-Tuning (PEFT) via LoRA adapters on an mT5-base model. In-domain evaluation demonstrates high structural acquisition (BLEU 42.02), proving that synthetic constraints effectively teach complex agglutinative morphology and VOS word order. However, evaluation against an organic glossary reveals a structural-semantic gap (BLEU 0.59), where the model maintains grammatical integrity but lacks the lexical grounding of natural language. The model exhibits overfitting to the constrained structural variance of the synthetic templates; despite high semantic entropy in the pipeline, it struggles with the syntactic fluidity of natural language, forcing organic inputs into rigid learned patterns. Furthermore, an ablation study utilizing a Multi-Task Learning architecture resulted in negative transfer, suggesting that auxiliary tasks competed for limited parameter capacity within the LoRA adapters, causing over-optimization for synthetic markers at the expense of organic flexibility. Ultimately, we establish that synthetic bootstrapping is a highly effective structural primer, but requires authentic data for semantic refinement via Curriculum Learning.

2606.08078 2026-06-09 cs.SD cs.CL 交叉投稿

On Low-Bit Quantization Errors in Speaker Verification: Diagnostic and Mitigation

说话人验证中的低位量化误差:诊断与缓解

Hugo Leguillier, Driss Matrouf, Guillaume Lechien, Mickael Rouvier

发表机构 * LIA, UPR 4128 Avignon University(阿维尼翁大学) Aday

AI总结 本文通过逐层和得分级分析,诊断了低比特量化对说话人验证的影响,发现2比特是关键拐点,并提出校准多精度级联方法,在保持低位推理效率的同时接近全精度性能。

Comments Accepted at Speaker Odyssey 2026 Lisbon

详情
AI中文摘要

尽管低比特量化为在资源受限设备上部署说话人验证提供了实用手段,但其对说话人验证性能的影响仍知之甚少。本文通过联合逐层和得分级分析,研究了ResNet-36和ResNet-200的均匀K-means量化感知训练。我们的逐层分析突出了脆弱组件,并表明得分退化不能仅由权重失真完全解释。我们在2比特处识别出一个明显的拐点,较大的得分漂移和有害决策翻转集中在FP32阈值附近。我们的得分级分析揭示了在极端量化下得分误差产生的位置和方式。基于这些发现,我们提出了一种校准的多精度级联方法,该方法在2比特下解决大多数试验,仅升级模糊情况,实现了接近FP32的性能,同时以显著降低的计算和内存成本保留了低位推理的效率优势。

英文摘要

Although low-bit quantization provides practical means to deploy speaker verification on resource-constrained devices, its effects on speaker verification performance remain poorly understood. In this paper, we study uniform K-means quantization-aware training of ResNet-36 and ResNet-200 through joint layer-wise and score-level analyses. Our layer-wise analysis highlights fragile components and shows that score degradation is not fully explained by weight distortion alone. We identify a clear knee point at 2 bits, with larger score drift and harmful decision flips concentrated near the FP32 threshold. Our score-level analysis reveals where and how score errors emerge under extreme quantization. Building on these findings, we propose a calibrated multi-precision cascade that resolves most trials at 2 bits and escalates only ambiguous cases, achieving performance close to FP32 while preserving the efficiency benefits of low-bit inference with substantially lower compute and memory costs.

2505.21239 2026-06-09 cs.CL 版本更新

A Unified LLM-Adaptable Framework for Cold-Start Cognitive Diagnosis

面向冷启动认知诊断的统一LLM可适配框架

Zihan Yao, Chentao Song, Yu He, Tianyu Qi, Jian Zhang, Weiping Fu, Jun Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LMCD框架,通过知识扩散和语义-认知融合两阶段,利用大语言模型增强冷启动场景下的认知诊断性能。

Comments Under review

详情
AI中文摘要

认知诊断已成为人工智能赋能教育中的关键任务,通过准确评估学生的认知状态来支持个性化学习。然而,传统的认知诊断模型(CDMs)由于缺乏学生-练习交互数据,在冷启动场景中常常表现不佳。最近基于NLP的方法利用预训练语言模型(PLMs)通过文本特征显示出潜力,但未能完全弥合语义理解与认知建模之间的差距。为了解决这一局限,我们提出了基于语言模型的认知诊断(LMCD),这是一个统一的、可适配LLM的框架,旨在通过利用大语言模型(LLMs)的高级能力来应对冷启动挑战。LMCD通过两个主要阶段运行:(1)知识扩散,其中LLMs为练习和知识概念(KCs)生成丰富的内容,以建立更强的语义联系;(2)语义-认知融合,利用LLMs将文本信息与学生认知状态深度融合。通过统一语义和认知空间,LMCD创建了全面的表示,作为各种现成CDMs的即插即用增强。在两个真实世界数据集上的实验表明,LMCD在练习冷启动和领域冷启动设置中均显著优于最先进的方法。代码已公开在 https://this URL。

英文摘要

Cognitive Diagnosis has become a critical task in AI-empowered education, supporting personalized learning by accurately assessing students' cognitive states. However, traditional cognitive diagnosis models (CDMs) often struggle in cold-start scenarios due to the lack of student-exercise interaction data. Recent NLP-based approaches leveraging pre-trained language models (PLMs) have shown promise by utilizing textual features, but they fail to fully bridge the gap between semantic understanding and cognitive profiling. To address this limitation, we propose \textbf{L}anguage \textbf{M}odel-based \textbf{C}ognitive \textbf{D}iagnosis (LMCD), a unified, LLM-adaptable framework designed to tackle cold-start challenges by harnessing the advanced capabilities of large language models (LLMs). LMCD operates via two primary phases: (1) Knowledge Diffusion, where LLMs generate enriched content for exercises and knowledge concepts (KCs) to establish stronger semantic links; and (2) Semantic-Cognitive Fusion, which leverages LLMs to deeply integrate textual information with student cognitive states. By unifying the semantic and cognitive spaces, LMCD creates comprehensive representations that serve as a plug-and-play enhancement for various off-the-shelf CDMs. Experiments on two real-world datasets demonstrate that LMCD significantly outperforms state-of-the-art methods in both exercise-cold and domain-cold settings. https://github.com/TAL-auroraX/LMCDThe code is publicly available at https://github.com/TAL-auroraX/LMCD

2604.01609 2026-06-09 cs.CL 版本更新

Swift-SVD: Theoretical Optimality Meets Practical Efficiency in Low-Rank LLM Compression

Swift-SVD:在低秩LLM压缩中理论最优与实用效率的结合

Ruoling Qi, Yirui Liu, Xuaner Wu, Xiangyu Wang, Ming Li, Chen Chen, Jian Chen, Yin Chen, Qizhen Weng

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本文提出Swift-SVD框架,通过激活感知和闭式压缩方法,在保证理论最优的同时提升实用效率和数值稳定性,实验显示其在压缩精度和端到端压缩时间上有显著优势。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型的部署受到静态权重和动态键值缓存的内存和带宽需求的限制。基于SVD的压缩提供了一种硬件友好的解决方案来降低这些成本。然而,现有方法存在两个关键限制:一些方法在重建误差上不最优,而另一些方法在理论上最优但实际效率低。本文提出Swift-SVD,一种激活感知的闭式压缩框架,同时保证理论最优、实用效率和数值稳定性。Swift-SVD在给定一批输入的情况下逐步聚合输出激活的协方差,并在聚合后执行一次特征值分解,实现免训练、快速且最优的层级低秩近似。我们采用有效秩分析局部层级压缩性,并设计一种动态秩分配策略,同时考虑局部重建损失和端到端层重要性。在六个LLM和八个数据集上的广泛实验表明,Swift-SVD优于现有最先进基线,实现最优压缩精度的同时,端到端压缩时间提升了3-70倍。我们的代码可在https://github.com/hiahei/Swift-SVD获取。

英文摘要

The deployment of Large Language Models is constrained by the memory and bandwidth demands of static weights and dynamic Key-Value cache. SVD-based compression provides a hardware-friendly solution to reduce these costs. However, existing methods suffer from two key limitations: some are suboptimal in reconstruction error, while others are theoretically optimal but practically inefficient. In this paper, we propose Swift-SVD, an activation-aware, closed-form compression framework that simultaneously guarantees theoretical optimum, practical efficiency and numerical stability. Swift-SVD incrementally aggregates covariance of output activations given a batch of inputs and performs a single eigenvalue decomposition after aggregation, enabling training-free, fast, and optimal layer-wise low-rank approximation. We employ effective rank to analyze local layer-wise compressibility and design a dynamic rank allocation strategy that jointly accounts for local reconstruction loss and end-to-end layer importance. Extensive experiments across six LLMs and eight datasets demonstrate that Swift-SVD outperforms state-of-the-art baselines, achieving optimal compression accuracy while delivering 3-70X speedups in end-to-end compression time. Our code is available at https://github.com/hiahei/Swift-SVD.

2605.03229 2026-06-09 cs.CL cs.LG 版本更新

Sparse Memory Finetuning as a Low-Forgetting Alternative to LoRA and Full Finetuning

稀疏记忆微调:作为LoRA和全微调的低遗忘替代方案

Prakhar Gupta, Garv Shah, Satyam Goyal, Anirudh Kanchi

发表机构 * University of Washington(华盛顿大学)

AI总结 提出稀疏记忆微调(SMF),通过添加键值记忆层并仅更新当前批次最活跃的记忆行,在MedMCQA任务上提升2.5个百分点,同时将遗忘探针(WikiText困惑度和TriviaQA准确率)控制在基线的1个百分点内,优于LoRA和全微调。

详情
AI中文摘要

将预训练语言模型适应新任务通常会损害其已有的通用能力,这一问题被称为灾难性遗忘。稀疏记忆微调(SMF)通过向模型添加键值记忆层,并在每个训练步骤中仅更新当前批次读取最频繁的一小组记忆行来避免这种情况。我们在Qwen-2.5-0.5B-Instruct上重新实现了SMF,并将其与LoRA和全微调在MedMCQA(一个4选1的医学考试任务)上进行比较,使用WikiText困惑度和TriviaQA准确率作为遗忘探针。SMF将MedMCQA提升了2.5个百分点,同时将两个遗忘探针保持在基线的约1个百分点内,而LoRA和全微调虽然取得了更大的增益,但在两个探针上都出现了明显的漂移。我们还比较了两种行选择规则(KL散度和TF-IDF),它们在两个遗忘指标上取得了不同的平衡。

英文摘要

Adapting a pretrained language model to a new task often hurts the general capabilities it already had, a problem known as catastrophic forgetting. Sparse Memory Finetuning (SMF) tries to avoid this by adding key-value memory layers to the model and, on each training step, updating only the small set of memory rows that the current batch reads most heavily. We re-implement SMF on Qwen-2.5-0.5B-Instruct and compare it with LoRA and full finetuning on MedMCQA, a 4-choice medical exam task, using WikiText perplexity and TriviaQA accuracy as forgetting probes. SMF improves MedMCQA by 2.5 percentage points while keeping both forgetting probes within roughly 1 point of the base model, whereas LoRA and full finetuning achieve larger gains but with clear drift on both. We also compare two row-selection rules (KL-divergence and TF-IDF), which balance the two forgetting metrics differently.

2605.04913 2026-06-09 cs.CL cs.LG 版本更新

Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

重新思考局部学习:一种更便宜更快的LLM后训练配方

Hengyu Shi, Tianyang Han, Peizhe Wang, Zhiling Wang, Xu Yang, Junhao Su

发表机构 * Independent Researcher(独立研究者) D 4 Lab(D4实验室) Southeast University(东南大学)

AI总结 本文提出LoPT,一种局部学习后训练策略,通过在transformer中点设置梯度边界,降低内存成本,提高训练效率并保留预训练能力。

Comments 35pages

详情
AI中文摘要

LLM后训练通常通过完整深度传播任务梯度。尽管这种端到端结构简单通用,但将其任务适应与完整深度激活存储、长距离反向依赖和直接任务梯度访问预训练表示耦合在一起。我们主张这种完整深度反向耦合可能不必要的昂贵和侵入性,尤其是在后训练监督远比预训练狭窄时。为此,我们提出LoPT:局部学习后训练,一种简单的后训练策略,使梯度达到成为显式设计选择。LoPT在transformer中点放置单一梯度边界:后半部分块从任务目标学习,而前半部分块通过轻量级特征重建目标进行更新,以保留有用的表示并保持接口兼容性。LoPT缩短了任务引起的反向路径,同时限制了狭窄任务梯度对早期层表示的直接干扰。大量实验表明,LoPT在较低的内存成本、较高的训练效率和更好的保留预训练能力方面实现了竞争性性能。我们的代码可在:https://github.com/HumyuShi/LoPT获取。

英文摘要

LLM post-training typically propagates task gradients through the full depth of the model. Although this end-to-end structure is simple and general, it couples task adaptation to full-depth activation storage, long-range backward dependencies and direct task-gradient access to pretrained representations. We argue that this full-depth backward coupling can be unnecessarily expensive and intrusive, particularly when post-training supervision is much narrower than pre-training. To this end, we propose \textbf{LoPT}: Local-Learning Post-Training, a simple post-training strategy that makes gradient reach an explicit design choice. LoPT places a single gradient boundary at the transformer midpoint: the second-half block learns from the task objective, while the first-half block is updated by a lightweight feature-reconstruction objective to preserve useful representations and maintain interface compatibility. LoPT shortens the task-induced backward path while limiting direct interference from narrow task gradients on early-layer representations. Extensive experiments demonstrate that LoPT achieves competitive performance with lower memory cost, higher training efficiency and better retention of pretrained capabilities. Our code is available at: https://github.com/HumyuShi/LoPT

2605.16928 2026-06-09 cs.CL cs.AI 版本更新

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

全注意力再临:在数百次训练步骤内将全注意力转化为稀疏

Yanke Zhou, Yiduo Li, Hanlin Tang, Maohua Li, Kan Liu, Tao Lan, Lin Qu, Yuan Yao, Xiaoxing Ma

发表机构 * Nanjing University(南京大学) Alibaba Group(阿里巴巴集团)

AI总结 本文提出RTPurbo方法,通过利用模型内在稀疏性,在少量训练步骤内实现高效的稀疏注意力,从而在保持接近无损精度的同时,显著提升推理效率。

Comments 20 pages, 9 figures

详情
AI中文摘要

大型语言模型的长上下文推理受到全注意力二次成本的限制。现有的高效替代方法通常依赖于原生稀疏训练或启发式令牌驱逐,导致效率、训练成本和准确性之间存在不理想的权衡。在本文中,我们证明全注意力LLM本质上已经是稀疏的,并且可以通过最小的适应转化为高度稀疏的模型。我们的方法基于三个观察:(1) 只有少量的注意力头真正需要完整的长上下文处理;(2) 长距离检索主要由低维子空间支配,允许相关令牌通过16维索引器高效检索;(3) 有用的令牌预算强烈依赖于查询,使得动态top-p选择比固定top-k稀疏化更合适。基于这些见解,我们提出了RTPurbo,该方法仅保留检索头的完整KV缓存,并引入轻量级令牌索引器进行稀疏注意力。通过利用模型的内在稀疏性,RTPurbo仅在数百次训练步骤内即可实现稀疏化。在长上下文基准和推理任务上的实验表明,RTPurbo在保持接近无损精度的同时,实现了显著的效率提升,包括在100万上下文下的预填充速度提升高达9.36倍,以及解码速度提升约2.01倍。这些结果表明,可以通过标准的全注意力训练获得强大的稀疏推理,而无需昂贵的原生稀疏预训练。

英文摘要

Long-context inference in large language models is bottlenecked by the quadratic cost of full attention. Existing efficient alternatives often rely either on native sparse training or on heuristic token eviction, creating an undesirable trade-off among efficiency, training cost, and accuracy. In this work, we show that full-attention LLMs are already intrinsically sparse and can be transformed into highly sparse models with only minimal adaptation. Our approach is built on three observations: (1) only a small subset of attention heads truly requires full long-context processing; (2) long-range retrieval is governed primarily by a low-dimensional subspace, allowing relevant tokens to be retrieved efficiently with a 16-dimensional indexer; and (3) the useful token budget is strongly query-dependent, making dynamic top-$p$ selection more suitable than fixed top-$k$ sparsification. Based on these insights, we propose RTPurbo, which retains the full KV cache only for retrieval heads and introduces a lightweight token indexer for sparse attention. By exploiting the model's intrinsic sparsity, RTPurbo achieves sparsification with only a few hundred training steps. Experiments on long-context benchmarks and reasoning tasks show that RTPurbo preserves near-lossless accuracy while delivering substantial efficiency gains, including up to a 9.36$\times$ prefill speedup at 1M context and about a 2.01$\times$ decode speedup. These results suggest that strong sparse inference can be obtained from standard full-attention training without expensive native sparse pretraining.

2605.28207 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Pruning and Distilling Mixture-of-Experts into Dense Language Models

将混合专家模型剪枝和蒸馏为密集语言模型

Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim, Joonghyun Bae, Jaewoong Cho

发表机构 * KRAFTON KAIST(韩国科学技术院)

AI总结 提出首个将混合专家(MoE)模型转换为标准密集架构的系统框架,通过专家评分、选择、分组、拼接和知识蒸馏,在参数匹配条件下比密集到密集剪枝平均下游准确率提升6.3个百分点,训练速度提升1.6倍。

详情
AI中文摘要

混合专家(MoE)现在是前沿语言模型的主导架构,但它需要将所有专家参数加载到内存中,因此在内存受限的部署中不太受欢迎。现有的压缩方法减少了专家数量,但输出仍然是具有相同基本限制的MoE模型。我们提出了第一个将训练好的MoE转换为标准全密集架构的系统框架:专家被评分、选择和分组,然后拼接成密集的前馈网络(FFN),并通过MoE教师的知识蒸馏进行精炼。我们在Qwen3-30B-A3B上评估了7种评分方法、5种分组方法和2种幅度缩放方法,涵盖了多种选定的专家数量,共产生350种配置。我们发现评分方法的选择影响最大,我们提出的新颖的多样性感知评分在Qwen3-30B-A3B、DeepSeek-V2-Lite和GPT-OSS-20B上始终优于先前的方法。在参数匹配的受控比较下,经过约4B token的蒸馏,MoE到密集的转换在平均下游准确率上比密集到密集的剪枝高出6.3个百分点,训练壁钟速度提升1.6倍。

英文摘要

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

2606.03576 2026-06-09 cs.CL 版本更新

AutoTail-BSFGM: Class-Balance-Aware Fine-Tuning for Chinese Scholarly Text Classification

AutoTail-BSFGM:面向中文学术文本分类的类别平衡感知微调

Anling Xiang, Yuwen Yang, Yang Shen

发表机构 * Department of Intelligent Communication, School of Journalism and Communication, Minzu University of China(中国民族大学新闻与传播学院智能通信系) ZeeLin (Beijing) Technology Co., Ltd.(北京智联科技有限公司) School of Journalism and Communication, Tsinghua University(清华大学新闻与传播学院) College of AI, Tsinghua University(清华大学人工智能学院)

AI总结 提出AutoTail-BSFGM方法,通过自动门控尾部调整、弱平衡Softmax辅助损失和快速梯度法对抗正则化,解决中文学术文本分类中的类别不平衡和语义邻近问题,在CSL数据集上提升了验证集和锁箱集准确率。

Comments 17 pages, 4 figures, 4 tables. Code and data: https://github.com/thu-nmrc/autotail-bsfgm-scholarly-classification

详情
AI中文摘要

学术文本分类支持文献组织、主题标引和研究情报,但中文学术语料库通常包含不平衡且语义邻近的学科标签。我们提出AutoTail-BSFGM,一种类别平衡感知的微调方法,它结合了自动门控尾部先验调整、弱平衡Softmax辅助损失和快速梯度法对抗正则化。该方法仅改变训练目标和过程;推理使用与相应标签平滑基线相同的单一基础规模编码器和线性分类器。我们在两个基于CSL的任务上评估该方法:一个包含67个标签的摘要到学科任务和一个包含13个类别的标题到类别任务。在主要的摘要任务上,AutoTail-BSFGM在中文RoBERTa-WWM和MacBERT-base下均提高了验证集和锁箱集准确率。使用MacBERT-base时,验证集准确率提高0.83个百分点,锁箱集准确率提高0.49个百分点,验证集上的合并配对McNemar检验显著(p = 0.023)。在标题任务上,该方法将验证集准确率提高0.70个百分点,验证集平衡准确率提高2.64个百分点;锁箱集准确率大致中性,而锁箱集平衡准确率提高1.22个百分点。结果支持有界贡献:AutoTail-BSFGM改善了类别平衡敏感行为,并在基于摘要的学术分类中取得一致增益,但并非在每个划分上均匀改善每个指标。

英文摘要

Scholarly text classification supports literature organization, subject indexing, and research intelligence, but Chinese scholarly corpora often contain imbalanced and semantically adjacent disciplinary labels. We propose AutoTail-BSFGM, a class-balance-aware fine-tuning method that combines an automatically gated tail-prior adjustment, a weak Balanced Softmax auxiliary loss, and Fast Gradient Method adversarial regularization. The method changes only the training objective and procedure; inference uses the same single base-size encoder and linear classifier as the corresponding label-smoothed baseline. We evaluate the method on two CSL-based tasks: an abstract-to-discipline task with 67 labels and a title-to-category task with 13 categories. On the primary abstract task, AutoTail-BSFGM improves validation and lockbox accuracy under both Chinese RoBERTa-WWM and MacBERT-base. With MacBERT-base, validation accuracy increases by 0.83 percentage points and lockbox accuracy by 0.49 points, with a pooled paired McNemar signal on validation (p = 0.023). On the title task, the method improves validation accuracy by 0.70 points and validation balanced accuracy by 2.64 points; lockbox accuracy is approximately neutral while lockbox balanced accuracy improves by 1.22 points. The results support a bounded contribution: AutoTail-BSFGM improves class-balance-sensitive behavior and yields consistent gains for abstract-based scholarly classification, without uniformly improving every metric on every split.

2603.05500 2026-06-09 cs.LG cs.AI cs.CL 版本更新

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

POET-X:通过扩展正交变换实现内存高效的LLM训练

Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, Weiyang Liu

发表机构 * University of Cambridge(剑桥大学)

AI总结 POET-X通过优化正交等价变换降低计算和内存开销,实现高效稳定的LLM训练,支持在单块H100 GPU上预训练十亿参数模型。

Comments ICML 2026 Oral (15 pages, 7 figures, project page: https://spherelab.ai/poetx/)

详情
AI中文摘要

高效且稳定的大型语言模型(LLM)训练仍然是现代机器学习系统的核心挑战。为解决这一挑战,提出了重新参数化正交等价训练(POET),这是一种保持谱的框架,通过正交等价变换优化每个权重矩阵。尽管POET提供了强大的训练稳定性,但其原始实现由于密集的矩阵乘法导致高内存消耗和计算开销。为克服这些限制,我们引入了POET-X,一种可扩展且内存高效的变体,通过显著降低的计算成本执行正交等价变换。POET-X在保持POET的一般化和稳定性优势的同时,实现了吞吐量和内存效率的显著提升。在我们的实验中,POET-X能够在单块Nvidia H100 GPU上预训练十亿参数的LLM,而标准优化器如AdamW在相同设置下会因内存不足而失败。

英文摘要

Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.

2605.18643 2026-06-09 cs.LG cs.AI cs.CL 版本更新

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou

发表机构 * Frontis.AI Kuaishou Technology(快手科技) Shanghai AI Lab(上海人工智能实验室) TsinghuaC3I/ZEDA(清华大学C3I/ZEDA)

AI总结 本文提出ZEDA框架,通过自蒸馏将预训练的静态MoE模型转换为高效的动态MoE模型,显著减少专家FLOPs并提升推理速度。

详情
AI中文摘要

混合专家(MoE)通过稀疏专家激活高效地扩展语言模型,其动态变体进一步通过输入依赖的方式调整激活专家以减少计算。现有动态MoE方法通常依赖从头训练或任务特定适应,使完全训练的MoE的实际转换未被充分探索。启用此类适应可直接缓解推理成本,通过允许简单令牌在服务时绕过不必要的专家。本文引入了零专家自蒸馏适应(ZEDA),一种低成本框架,将后训练的静态MoE模型转换为高效的动态MoE模型。为稳定此架构转换,ZEDA在每个MoE层中注入无参数的零输出专家,并通过两阶段自蒸馏适应增强模型,利用原始MoE作为冻结的教师,并应用组级平衡损失。在Qwen3-30B-A3B和GLM-4.7-Flash上跨11个基准测试(涵盖数学、代码和指令跟随)中,ZEDA在边际精度损失下消除了超过50%的专家FLOPs。在两个模型上,ZEDA比最强的动态MoE基线分别高出6.1和4.0个点,并提供约1.20倍的端到端推理加速。

英文摘要

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20$\times$ end-to-end inference speedup.

12. 其他/综合NLP 24 篇

2606.07753 2026-06-09 cs.CL 新提交

ReadingMachine: A Computational Methodology for Structured Corpus Reading and Large-Scale Synthesis

ReadingMachine:一种结构化语料库阅读与大规模综合的计算方法

James Morrissey

发表机构 * GitHub

AI总结 提出ReadingMachine方法,利用大语言模型对文档集合进行有界阅读,通过洞察提取、语义聚类、主题生成和迭代遗漏检测等可检查阶段,实现大规模语料库的覆盖性、可追溯性和分歧保留。

Comments 32 pages, 1 figure

详情
AI中文摘要

ReadingMachine是一种用于结构化语料库阅读的计算方法,它利用大语言模型对整个文档集合执行有界阅读操作。该方法不依赖于检索或递归摘要,而是将分析分解为可检查的阶段,包括洞察提取、语义聚类、主题生成和迭代遗漏检测。通过延迟不可逆压缩并显式跟踪中间表示,该方法优先考虑大规模语料库的覆盖性、可追溯性和分歧保留。该系统在包含152份产业政策文档的异质语料库上进行了演示,提取了超过17,500条洞察并生成了结构化的主题图。ReadingMachine作为用于大规模定性综合和语料库分析的开源实验框架发布。

英文摘要

ReadingMachine is a computational methodology for structured corpus reading that uses large language models to perform bounded reading operations over entire document collections. Rather than relying on retrieval or recursive summarization, the approach decomposes analysis into inspectable stages including insight extraction, semantic clustering, theme generation, and iterative omission detection. By delaying irreversible compression and explicitly tracking intermediate representations, the method prioritizes coverage, traceability, and preservation of disagreement across large corpora. The system is demonstrated on a heterogeneous corpus of 152 industrial policy documents, producing more than 17,500 extracted insights and a structured thematic map. ReadingMachine is released as an open-source experimental framework for large-scale qualitative synthesis and corpus analysis.

2606.08254 2026-06-09 cs.CL 新提交

SSR: Can Simulated Patients Learn to Stigmatize Themselves? Modeling Self-Stigma through Internal Monologue

SSR: 模拟患者能否学会自我污名化?通过内心独白建模自我污名

Kunyao Lan, Bingrui Jin, Zichen Zhu, Mengyue Wu

发表机构 * Shanghai Jiao Tong University(上海交通大学) X-LANCE Lab, Dept. of Computer Science and Engineering(X-LANCE实验室,计算机科学与工程系) MoE Key Lab of Artificial Intelligence, AI Institute(教育部人工智能重点实验室,人工智能研究院)

AI总结 提出基于心理3A1H模型的SSR框架,通过内心独白数据集和链式思维微调LLM,使模拟患者根据对话触发动态调整污名表达,生成更真实的情境适应性反应。

详情
AI中文摘要

使用大语言模型(LLM)模拟患者是心理健康训练的一种有前景的工具,但现有方法未能捕捉一个关键的临床现实:自我污名。经历自我污名的患者,即内化负面刻板印象,通常表现出情境敏感性的抵抗,如回避、否认或自责,而当前模型将其呈现为静态或统一顺从的行为。为了解决这一问题,我们引入了一个基于自我污名化心理3A1H模型的新型模拟框架。我们的核心创新是创建了一个\textbf{污名化自我反思}(\textbf{SSR})数据集,在该数据集中,我们通过反映污名意识推理的内心独白来增强心理健康对话。通过使用链式思维方法对LLM进行微调,我们训练患者代理根据对话触发动态调整其污名水平和表达方式。评估表明,我们的方法显著优于专门的基线,生成了更真实且情境适当的患者反应。这项工作为临床训练和共情对话系统的现实污名模拟迈出了关键一步。

英文摘要

Simulating patients with large language models (LLMs) is a promising tool for mental health training, but existing approaches fail to capture a key clinical reality: self-stigma. Patients experiencing self-stigma, the internalization of negative stereotypes, often exhibit context-sensitive resistance, such as avoidance, denial, or self-blame, which current models render as static or uniformly compliant behavior. To address this, we introduce a novel simulation framework grounded in the psychological 3A1H model of self-stigmatization. Our core innovation is the creation of a \textbf{Stigmatized Self-Reflection} (\textbf{SSR}) dataset, where we augment mental health dialogues with internal monologues that reflect stigma-aware reasoning. By fine-tuning LLMs with this data using a chain-of-thought approach, we train patient agents to dynamically adjust their level and expression of stigma based on conversational triggers. Evaluations demonstrate that our approach significantly outperforms specialized baselines, generating more authentic and situationally appropriate patient responses. This work provides a crucial step towards realistic stigma simulation for clinical training and empathetic dialogue systems.

2606.08307 2026-06-09 cs.CL 新提交

Understanding the Sociocultural Dimensions of Mental Health Discourse in Arabic-Language X Communities

理解阿拉伯语X社区中心理健康话语的社会文化维度

Amal Alqahtani, Rana Salama, Mona Diab

发表机构 * King Saud University(沙特国王大学) Cairo University(开罗大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 通过GPT-4.1识别个人披露的推特用户,分析边缘型人格障碍、双相障碍和ADHD相关话语,发现不同病症的词汇模式差异,提出可复用的LLM辅助披露流程和文化关键词框架。

Comments Accepted to the SMM4H-HeaRD Workshop, co-located with the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

详情
AI中文摘要

计算心理健康研究主要集中于英语人群,阿拉伯语话语相对缺乏研究。我们提出一项探索性计算研究,包含来自607名用户的8147条推文,这些用户被GPT-4.1个人披露流程分类为三个特定病症的阿拉伯语X(原Twitter)社区中可能具有亲身经历的作者。我们关注与边缘型人格障碍(BPD)、双相障碍和ADHD相关的话语,并使用多领域文化关键词框架描述社区相关的语言模式。结果表明,在该语料库中,双相障碍推文包含更多宗教和医学词汇,BPD推文包含更多关系、身份和情绪困扰词汇,而ADHD推文更常关注实际症状和药物管理。我们将这些模式视为假设生成而非验证性的,因为语料库在不同病症间不平衡,某些子语料库在时间上集中,且关键词框架是初步操作化而非经过验证的测量工具。本文贡献了一个可复用的LLM辅助个人披露流程和一个针对阿拉伯语心理健康话语的探索性文化关键词框架。

英文摘要

Computational mental health research has predominantly centered on English-speaking populations, leaving Arabic-language discourse comparatively under-examined. We present an exploratory computational study of 8,147 tweets from 607 users classified by a GPT-4.1 personal-disclosure pipeline as likely lived-experience authors in three condition-specific Arabic-language X (formerly Twitter) Communities. We focus on discourse related to borderline personality disorder (BPD), bipolar disorder, and ADHD, and characterize community-associated linguistic patterns using a multi-domain cultural keyword framework. The results suggest that in this corpus, Bipolar tweets contain more religious and medical vocabulary, BPD tweets contain more relational, identity, and emotional-distress vocabulary, and ADHD tweets more often focus on practical symptoms and medication management. We treat these patterns as hypothesis-generating rather than confirmatory because the corpus is imbalanced across conditions, some subcorpora are temporally concentrated, and the keyword framework is an initial operationalization rather than a validated measurement instrument. The paper contributes a reusable LLM-assisted personal-disclosure pipeline and an exploratory cultural keyword framework for Arabic mental health discourse.

2606.08545 2026-06-09 cs.CL cs.SE 新提交

Ishigaki-IDS: An Open-Weight Verifier-Aware Model for Information Delivery Specification Drafting in Building Information Modeling

Ishigaki-IDS:一种面向建筑信息模型中信息交付规范起草的开放权重验证器感知模型

Ryo Kanazawa, Koyo Hidaka, Teppei Miyamoto, Takayuki Kato, Tomoki Ando, Chenguang Wang, Dayuan Jiang, Naofumi Fujita, Shuhei Saitoh, Atomu Kondo, Koki Arakawa, Daiho Nishioka

发表机构 * ONESTRUCTION Inc.(ONESTRUCTION公司) AWS GenAI Innovation Center(AWS生成式AI创新中心)

AI总结 针对BIM项目中IDS编写瓶颈,提出开放权重LLM Ishigaki-IDS,结合持续预训练、监督微调和基于验证器奖励的强化学习,生成可通过外部验证器检查的IDS草案,在基准上显著优于基线,并减少54.7%工作时间。

Comments 8 pages, 2 figures, 5 tables. Preprint

详情
AI中文摘要

建筑信息模型(BIM)项目需要将信息需求描述为机器可检查的信息交付规范(IDS)文件,以验证建筑模型是否包含所需属性。然而,IDS编写仍然是一个实际瓶颈:从业人员必须处理领域词汇、严格的XML模式约束和外部验证器一致性,同时还要检查需求本身是否正确表达。我们提出了Ishigaki-IDS,一个专门用于验证器感知IDS草案生成的开放权重LLM。该模型结合了在BIM/IDS语料库上的持续预训练、信息需求到IDS对的监督微调,以及来自外部验证器的可验证奖励的强化学习。目标不是取代专家审查,而是将IDS编写从低层级的XML和模式修复转向验证器可加载的草案,供从业人员检查和修正。在由166个专家创建的Ishigaki-IDS-Bench上,Ishigaki-IDS-8B的IDSAuditPass得分为0.651(生成的IDS文件的验证器通过指标),显著优于我们评估的最强单次LLM基线Claude Opus 4.5(0.331)。它还获得了0.282的Audit-Gated FacetF1,衡量验证器通过草案中的需求方面对齐度。相同的配方可扩展:14B和32B变体分别达到IDSAuditPass 0.753/0.693和Audit-Gated FacetF1 0.392/0.369。在与六位BIM从业者的工作流检查中,在相同的验证和对齐终点下,Ishigaki辅助编写减少了54.7%的总工作时间。这些结果表明,验证器感知的IDS生成可以减轻将BIM信息需求转换为可审查IDS草案的实际负担。

英文摘要

Building Information Modeling (BIM) projects require information requirements to be described as machine-checkable Information Delivery Specification (IDS) files in order to verify whether building models contain the required attributes. However, IDS authoring remains a practical bottleneck: practitioners must handle domain vocabulary, strict XML schema constraints, and external validator conformance while also checking whether the requirement itself is correctly expressed. We present Ishigaki-IDS, an open-weight LLM specialized for verifier-aware IDS draft generation. The model combines continued pretraining on BIM/IDS corpora, supervised fine-tuning on information-requirement-to-IDS pairs, and reinforcement learning with verifiable rewards from an external validator. The goal is not to replace expert review, but to move IDS authoring from low-level XML and schema repair toward validator-loadable drafts that practitioners can inspect and correct. On the 166-case expert-created Ishigaki-IDS-Bench, Ishigaki-IDS-8B achieves an IDSAuditPass score of 0.651, a validator-pass metric for generated IDS files, substantially outperforming Claude Opus 4.5, the strongest single-shot LLM baseline we evaluated, at 0.331. It also obtains an Audit-Gated FacetF1 of 0.282, which measures requirement-facet alignment among validator-passing drafts. The same recipe scales: 14B and 32B variants reach IDSAuditPass 0.753 / 0.693 and Audit-Gated FacetF1 0.392 / 0.369. In a workflow check with six BIM practitioners, Ishigaki-assisted authoring reduced aggregate work time by 54.7% under the same validation and alignment endpoint. These results suggest that verifier-aware IDS generation can reduce the practical burden of converting BIM information requirements into reviewable IDS drafts.

2606.09251 2026-06-09 cs.CL 新提交

TruthSplit: Operationalizing Conditional Validity in Arguments Through Multi-Perspective Reasoning

TruthSplit:通过多视角推理实现论证中的条件有效性操作化

Benjamin Stieger, Maximilian Terberger, Thomas Huber, Christina Niklaus

发表机构 * University of St. Gallen(圣加仑大学)

AI总结 提出TruthSplit系统,通过三层自然语言推理和结构化世界观档案,实现基于不同视角的论证条件有效性分析,识别价值冲突与假设差异。

Comments Demo paper. To appear at ACL 2026

详情
AI中文摘要

我们提出TruthSplit,一个用于多视角论证分析的交互式系统。现有的论证工具通常分析论证本身的属性,如结构、质量、立场或说服力,而将特定视角的背景知识隐含起来。TruthSplit通过支持探索性分析来填补这一空白,即当通过世界观特定的价值观、假设和概念定义来解释时,同一主张如何导致不同的结论。我们将这种依赖于视角的分析称为条件有效性。给定输入的论证文本,TruthSplit提取主张和前提,应用三层自然语言推理(NLI)方法来评估逻辑和世界观特定的规范性一致性,并将大语言模型(LLM)推理条件化为编码核心价值观和决策原则的结构化世界观档案。然后,系统生成特定视角的解释,识别价值冲突和假设差距,并通过交互式分析界面可视化分歧。

英文摘要

We present TruthSplit, an interactive system for multi-perspective argument analysis. Existing argumentation tools typically analyze properties of the argument itself, such as structure, quality, stance, or persuasiveness, while leaving perspective-specific background knowledge implicit. TruthSplit addresses this gap by supporting an exploratory analysis of how the same claim can lead to different conclusions when interpreted through worldview-specific values, assumptions, and conceptual definitions. We refer to this perspective-dependent analysis as conditional validity. Given an input argumentative text, TruthSplit extracts claims and premises, applies a three-layer natural language inference (NLI) approach to assess both logical and worldview-specific normative consistency, and conditions large language model (LLM) reasoning on structured worldview profiles that encode core values and decision principles. The system then generates perspective-specific interpretations, identifies value conflicts and assumption gaps, and visualizes divergence through interactive analytical interfaces.

2606.09822 2026-06-09 cs.CL cs.FL 新提交

Causally Evaluating the Learnability of Formal Language Tasks

因果评估形式语言任务的可学习性

Vésteinn Snæbjarnarson, Anej Svete, Josef Valvoda, Reda Boumasmoud, Brian DuSell, Ryan Cotterell

发表机构 * ETH Zürich(苏黎世联邦理工学院) University of Copenhagen(哥本哈根大学)

AI总结 通过引入分箱半环控制目标属性频率,结合因果图模型和分解KL散度,证明标准相关性评估在形式语言任务可学习性分析中存在混淆偏差。

详情
AI中文摘要

语言模型作为多任务学习器,在训练过程中获得广泛能力。一个基本问题是学习给定任务需要多少特定任务数据。在自然语言中回答这个问题很困难:任务难以界定且可能相互混淆。为了严格研究数据频率与可学习性之间的关系,我们转向使用从概率有限自动机导出的形式语言的受控设置。这作为方法论测试平台,证明标准相关性评估实践固有缺陷。为了实现因果分析,我们引入了分箱半环,这是一种代数对象,允许我们控制目标属性在采样语料库中出现的频率。我们将实验流程表述为因果图模型,并推导出分解的Kullback-Leibler散度指标来衡量特定子任务的可学习性。我们的实验表明,在没有因果干预的情况下评估可学习性会由于相关性分析中的混淆因素导致错误结论,并警示自然语言环境中的相关性陷阱。

英文摘要

Language models, as multi-task learners, acquire a wide range of abilities during training. A fundamental question is how much task-specific data is needed to learn a given task. Answering this for natural language is difficult: tasks are hard to delineate and can confound one another. To rigorously investigate the relationship between data frequency and learnability, we turn to a controlled setting using formal languages induced from probabilistic finite automata. These serve as a methodological testbed to demonstrate that standard correlational evaluation practices are inherently flawed. To enable causal analysis, we introduce the binning semiring, an algebraic object that lets us control how often a targeted property occurs in a sampled corpus. We formulate the experimental pipeline as a causal graphical model and derive decomposed Kullback-Leibler divergence metrics to measure the learnability of specific sub-tasks. Our experiments show that evaluating learnability without causal intervention leads to incorrect conclusions due to confounders in correlational analysis, and serve as a warning about correlational pitfalls in natural-language settings.

2606.07727 2026-06-09 quant-ph cs.CL math.OC q-fin.PM 交叉投稿

Benchmarking Quantum Algorithmic Resilience for CVaR Portfolio Optimization: The Expressibility-Coherence Trade-off

面向CVaR投资组合优化的量子算法韧性基准测试:可表达性-相干性权衡

Prashik N. Somkuwar, K. Srinivasan, G. Raghavan

发表机构 * Prashik N. Somkuwar, K. Srinivasan, G. Raghavan(普拉希克·N·索姆库瓦尔、K·斯里尼瓦森、G·拉加万)

AI总结 针对混合均值方差与条件风险价值投资组合优化,对比硬件高效变分量子神经网络与热启动量子近似优化算法,揭示NISQ设备上算法可表达性与硬件相干性之间的关键权衡。

Comments 10 pages, 11 figures. Master's thesis research conducted at the School of Quantum Technology, Defence Institute of Advanced Technology (DIAT), Pune

详情
AI中文摘要

量子组合优化为复杂金融建模提供了理论优势,但在噪声中等规模量子(NISQ)设备上的物理实现受到硬件拓扑的严重限制。本研究针对混合均值方差与条件风险价值(CVaR)投资组合目标,对硬件高效变分量子神经网络(HE-VQNN)和热启动量子近似优化算法(WS-QAOA)进行了硬件基准测试分析。通过实现一种新颖的经典量子混合代理矩阵来绕过CVaR辅助量子比特瓶颈,我们将NIFTY 50指数中多达16个资产映射到IBM heavy hex处理器上。我们系统地量化了算法对路由过程中产生的“SWAP代价”的韧性。实证结果揭示了一个关键的操作权衡:WS-QAOA提供了精确的理论映射,但由于指数级的非局部门开销而遭受灾难性的硬件退相干。相反,HE-VQNN保持了硬件相干性,但缺乏捕捉密集尾部风险资产相关性的数学可表达性。本研究揭示了当前架构下密集金融优化的局限性,迫使在算法不可表达性与硬件退相干之间做出不可行的选择。这指示了在缺乏全连接性的NISQ计算机上能做什么和不能做什么的更深层限制。

英文摘要

Quantum combinatorial optimization offers theoretical advantages for complex financial modeling, but physical implementation on Noisy Intermediate Scale Quantum (NISQ) devices is severely constrained by hardware topology. This study presents a hardware benchmarking analysis between a Hardware Efficient Variational Quantum Neural Network (HE-VQNN) and the Warm Start Quantum Approximate Optimization Algorithm (WS-QAOA) for a hybrid Mean Variance and Conditional Value at Risk (CVaR) portfolio objective. By implementing a novel classical quantum hybrid proxy matrix to bypass the CVaR auxiliary qubit bottleneck, we map up to 16 assets from the NIFTY 50 index onto an IBM heavy hex processor. We systematically quantify algorithmic resilience to the "SWAP tax" incurred during routing. Empirical results reveal a critical operational trade-off: WS-QAOA provides exact theoretical mapping but suffers catastrophic hardware decoherence due to exponential nonlocal gate overhead. Conversely, HE-VQNN preserves hardware coherence but lacks the mathematical expressibility to capture dense tail risk asset correlations. This study exposes the limitations of dense financial optimization on current architectures forces an nonviable choice between algorithmic inexpressibility and hardware decoherence. This is indicative of a deeper limitation as to what can and cannot be done with NISQ computers lacking in all-to-all connectivity.

2606.08297 2026-06-09 econ.TH cs.CL 交叉投稿

Strategic Type Spaces

策略类型空间

Olivier Gossner, Rafael Veiel

发表机构 * CNRS - École Polytechnique, London School of EconomicsUniversity of Texas at Austin(法国国家科学研究中心-巴黎政治学院,伦敦经济学院,德克萨斯大学奥斯汀分校)

AI总结 提出策略商概念,证明最小策略类型空间的存在性与唯一性,并揭示其递归结构可由有限自动机刻画。

详情
AI中文摘要

我们为信息提供了策略基础:在任意给定的不完全信息博弈中,我们将策略商定义为足以让玩家计算对其他玩家最优反应的信息表示。我们证明:1)存在且本质唯一的最小策略商,称为策略类型空间(STS),其中类型由中间相关理性化层级给出,并代表一组关于其他玩家类型和自然的信念,这些信念理性化了该层级;2)最小STS具有递归结构,该结构可由有限自动机捕获。

英文摘要

We provide a strategic foundation for information: in any given game with incomplete information we define strategic quotients as information representations that are sufficient for players to compute best-responses to other players. We prove 1/ existence and essential uniqueness of a minimal strategic quotient called the Strategic Type Space (STS) in which a type is given by an interim correlated rationalizability hierarchy and represents a set of beliefs over other players' types and nature that rationalize this hierarchy and 2/ that the minimal STS has a recursive structure that is captured by a finite automaton.

2606.08728 2026-06-09 cs.AI cs.CL cs.CV cs.LG 交叉投稿

Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

人工智能数学推理:语言模型、神经符号系统与验证发现的综合综述

Syed Rifat Raiyan, Mohsinul Kabir, Hasan Mahmud, Md Kamrul Hasan

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 本文综述了数学推理领域从早期规则系统到当代推理模型、多智能体系统及验证发现工作流的演变,沿非正式推理、形式推理、数学发现及推理技术四轴组织,并评估了基准测试、失败模式及未来方向。

Comments Under review, 47 pages, 14 figures, 22 tables

详情
AI中文摘要

数学推理长期以来一直是机器智能的严格测试;在过去十年中,它已从NLP中的一个边缘问题发展为最重要的人工智能前沿之一。本综述对该领域的演变进行了统一阐述,从早期基于规则的数学文字题(MWP)求解器和模板驱动的几何系统,到神经表达式生成和LLM提示,再到当代推理模型、多智能体系统、神经符号定理证明器和验证发现工作流。我们沿四个轴组织该领域:(i) 文本和图表的非正式推理,涵盖MWP求解、多模态几何和VLM;(ii) 证明助手的形式推理,包括自动形式化、策略预测、编译器引导修复和证明搜索;(iii) 数学发现,其中系统提出构造、改进界限或协助攻击开放问题;以及(iv) 推理和训练时技术,包括CoT提示、工具使用、过程奖励模型和RLVR,这些技术日益将生成与验证联系起来。我们编目了涵盖小学算术、竞赛数学、几何、形式证明、多模态和多语言推理以及专家评估的主要基准,并考察了基准饱和、污染、报告不匹配以及pass@1、多数投票和验证器辅助pass@$k$之间的区别。我们批判性地评估了失败模式:扰动下的脆弱性、奖励黑客、多模态基础失败、脆弱形式化以及推理规模推理的能源成本。借鉴来自在职数学家的近期观点,我们确定了未来方向,集中于验证发现工作流、推理效率以及使AI辅助形式化广泛可用的基础设施。配套材料:https://github.com/Starscream-11813/awesome-AI4Math。

英文摘要

Mathematical reasoning has long served as a stringent test of machine intelligence; over the past decade, it has moved from a niche problem within NLP to one of the most consequential AI frontiers. This survey provides a unified account of the field's evolution, from early rule-based math word problem (MWP) solvers and template-driven geometry systems, through neural expression generation and LLM prompting, to contemporary reasoning models, multi-agent systems, neuro-symbolic theorem provers, and verified discovery workflows. We organize the landscape along four axes: (i) informal reasoning over text and diagrams, spanning MWP solving, multimodal geometry, and VLMs; (ii) formal reasoning in proof assistants, including autoformalization, tactic prediction, compiler-guided repair, and proof search; (iii) mathematical discovery, where systems propose constructions, improve bounds, or assist attacks on open problems; and (iv) the inference and training-time techniques, including CoT prompting, tool use, process reward models, and RLVR, that increasingly connect generation with verification. We catalog major benchmarks across grade-school arithmetic, competition mathematics, geometry, formal proving, multimodal and multilingual reasoning, and expert evaluation, and we examine benchmark saturation, contamination, reporting mismatches, and the distinction between pass@1, majority voting, and verifier-assisted pass@$k$. We critically assess failure modes: brittleness under perturbation, reward hacking, multimodal grounding failures, fragile formalization, and the energy cost of reasoning-scale inference. Drawing on recent perspectives from working mathematicians, we identify future directions centered on verified-discovery workflows, reasoning efficiency, and infrastructure to make AI-assisted formalization broadly usable. Companion materials: https://github.com/Starscream-11813/awesome-AI4Math.

2606.09024 2026-06-09 cs.IR cs.CL cs.HC cs.SI 交叉投稿

Personal Salience: Highlighting Is Social, but Individuality Lives in Selection

个人显著性:高亮是社交性的,但个性存在于选择中

Kazuki Nakayashiki, Keisuke Watanabe

发表机构 * Glasp Inc.(Glasp公司)

AI总结 通过共同阅读身份控制实验,发现高亮行为主要受群体影响(群体模型预测显著优于个人模型),但个人历史在选择已显著段落时表现出强预测力(差距+0.14),揭示个性更多体现在选择而非显著性上。

Comments 12 pages, 5 figures, 2 tables

详情
AI中文摘要

社交高亮工具让用户标记对他们重要的段落。我们通过共同阅读身份控制(同一文档被多个用户高亮)来探究从这些自然痕迹中能恢复多少个体信息,该方法固定文档和主题,询问一个人的历史是否比另一个读者的历史更能预测其标记。我们区分了通用显著性(结构)、群体显著性(他人标记的内容)和个人显著性(个体残差)。首先,高亮是社交性的:你标记的句子被群体预测得远好于结构或个人模型,甚至一个估计良好的群体(信息特权基线,能看到同一文档上他人的标记)也胜过基于你其他文档历史构建的前沿LLM孪生模型;文档内的个人信号最多是低语(嵌入评分器上的自己与他人差距+0.017,虽小但显著)。其次,与此形成鲜明对比的是,个性存在于选择中:当问及哪些已显著的段落是你的时,你自己的历史是一个强且无泄漏的预测器(差距+0.14)。主题分解显示这主要是稳定的主题偏好:与主题匹配的同伴相比,它缩小约6-8倍,且薄残差无法与更细的主题区分。非显而易见的部分是不对称性:在相同评分器下,个人信号在显著性中比在选择中弱约6-8倍。方法上,朴素的历史条件评估存在泄漏(目标自己的标记在约42%的配对中进入档案,使个人得分最多增加+0.15 AP),且小群体夸大个性化;我们的结果无泄漏,使用密集群体和模型匹配控制。高亮携带真实的个人签名,但只是强共享签名上的薄层,更多地体现在一个人选择哪些显著段落而非什么显著上。

英文摘要

Social highlighters let people mark passages that matter to them. We ask how much of an individual is recoverable from these naturalistic traces, using a co-readership identity control (the same document highlighted by many users) that holds document and topic fixed and asks whether a person's own history predicts their marks better than another reader's does. We separate generic salience (structure), crowd salience (what others marked), and personal salience (the individual residual). First, highlighting is social: which sentences you mark is predicted far better by the crowd than by structure or by a personal model, and even a well-estimated crowd, an information-privileged baseline that sees others' marks on the same document, beats a frontier LLM twin built from your other-document history; the within-document personal signal is at most a whisper (own-vs-other gap +0.017 by an embedding scorer, small but significant). Second, in sharp contrast, individuality lives in selection: asked which of the already-salient passages are yours, your own history is a strong, leakage-free predictor (gap +0.14). A topic decomposition shows this is largely stable thematic preference: it shrinks ~6-8x against a topically-matched peer, and a thin residual cannot be separated from finer topic. The non-obvious part is an asymmetry: under the same scorer the individual signal is ~6-8x weaker in salience than in selection. Methodologically, naive history-conditioning evaluations leak (the target's own marks enter the profile in ~42% of pairs, inflating personal scores by up to +0.15 AP) and small crowds overstate personalization; our results are leakage-free, use a dense crowd, and a model-matched control. Highlights carry a genuine individual signature, but a thin layer over a strong shared one, surfacing far more in which salient things a person selects than in what is salient.

2606.09532 2026-06-09 cs.CY cs.CL 交叉投稿

Interpretable Crisis Behavior Analysis Using Mobility and Social Media Data

基于移动性和社交媒体数据的可解释危机行为分析

Muhammad Hamza Arshad Majeed, Sidahmed Benabderrahmane, Talal Rahwan

发表机构 * New York University (NYUAD)(纽约大学(NYUAD))

AI总结 提出统一可解释流水线,融合移动性和社交媒体数据,通过形式概念分析和关联规则挖掘,识别危机中跨域行为模式,并在洛杉矶山火和COVID-19案例中验证,生成可操作的政策简报。

详情
AI中文摘要

危机改变了人们的移动方式和沟通方式。在野火和流行病等紧急情况下,移动模式的变化和在线情感话语共同演变,但通常被孤立研究。本文提出了一个统一且可解释的流水线,整合移动性和社交媒体数据,以识别危机环境中的跨域行为模式。该框架通过两个案例研究进行评估:2025年1月洛杉矶野火的短期分析(原型案例)和2020年3月至2021年12月阿联酋COVID-19行为的纵向分析(主要案例,671天)。该流水线对齐异构每日信号,将其转换为二元行为状态,应用形式概念分析(FCA)提取共现结构,挖掘关联规则,并通过时间顺序保留测试验证规则稳定性。一个结构化的政策翻译层将稳健规则转化为操作简报,指定触发条件、提前时间和行动方案。结果揭示了两种危机中清晰的跨域行为结构。在野火案例中,交通压力、恐惧/愤怒情绪和治理话语在33天窗口内紧密耦合,关键规则达到100%置信度,提升度高达2.5。在COVID案例中,重复的移动适应和情绪波动产生了8条稳定的同日规则(88%保留通过率)和40条清晰的预测规则,提前时间为2-7天。该工作表明,可解释的多模态融合可以产生既科学可信又政策可操作的危机情报。

英文摘要

Crises alter both how people move and how they communicate. During emergencies such as wildfires and pandemics, changes in mobility patterns and online emotional discourse evolve jointly, yet they are typically studied in isolation. This paper presents a unified and interpretable pipeline that integrates mobility and social media data to identify cross-domain behavioral patterns in crisis settings. The framework is evaluated through two case studies: a short-horizon analysis of the January 2025 Los Angeles wildfires (prototype case) and a longitudinal analysis of UAE COVID-19 behavior from March 2020 to December 2021 (primary case, 671 days). The pipeline aligns heterogeneous daily signals, transforms them into binary behavioral states, applies Formal Concept Analysis (FCA) to extract co-occurrence structure, mines association rules, and validates rule stability through chronological holdout testing. A structured policy-translation layer renders robust rules as operational briefs specifying triggers, lead times, and action playbooks. Results reveal clear cross-domain behavioral structure in both crises. In the wildfire case, traffic stress, fear/anger sentiment, and governance discourse are tightly coupled within a 33-day window, with key rules reaching 100\% confidence and lift scores up to 2.5. In the COVID case, repeated mobility adaptation and sentiment volatility yield 8 stable same-day rules (88\% holdout pass rate) and 40 clean predictive rules with 2--7 day lead horizons. The work demonstrates that interpretable multimodal fusion can produce both scientifically credible and policy-actionable crisis intelligence.

2211.05583 2026-06-09 cs.CL math.OC 版本更新

Toward automatic generation of control structures for process flow diagrams with large language models

面向工艺流程图控制结构自动生成的大语言模型方法

Edwin Hirtreiter, Lukas Schulze Balhorn, Artur M. Schweidtmann

发表机构 * University of Zurich(苏黎世大学)

AI总结 提出一种基于Transformer的端到端方法,将控制结构预测视为翻译任务,利用SFILES 2.0表示PFD拓扑,通过预训练和微调实现自动生成,在生成数据上达到74.8%-89.2%的Top-5准确率。

详情
Journal ref
AIChE Journal, Volume 70, Issue 1, January 2024, e18259
AI中文摘要

开发管道和仪表图(P&IDs)是工艺开发中的关键步骤。我们提出了一种数据驱动的控制结构预测方法。我们的方法受基于Transformer的端到端人类语言翻译模型启发。我们将控制结构预测视为翻译任务,其中没有控制结构的工艺流程图(PFDs)被翻译为带有控制结构的PFDs。我们使用SFILES 2.0符号将PFDs的拓扑表示为字符串。我们使用生成的PFDs预训练模型以学习语法结构。之后,利用迁移学习在真实PFDs上对模型进行微调。该模型在10,000个生成的PFDs上达到了74.8%的Top-5准确率,在100,000个生成的PFDs上达到了89.2%的Top-5准确率。这些有希望的结果显示了人工智能辅助工艺工程的巨大潜力。在312个真实PFDs数据集上的测试表明,工业应用需要更大的PFD数据集和混合人工智能解决方案。

英文摘要

Developing Piping and Instrumentation Diagrams (P&IDs) is a crucial step during process development. We propose a data-driven method for the prediction of control structures. Our methodology is inspired by end-to-end transformer-based human language translation models. We cast the control structure prediction as a translation task where Process Flow Diagrams (PFDs) without control structures are translated to PFDs with control structures. We represent the topology of PFDs as strings using the SFILES 2.0 notation. We pretrain our model using generated PFDs to learn the grammatical structure. Thereafter, the model is fine-tuned leveraging transfer learning on real PFDs. The model achieved a top-5 accuracy of 74.8% on 10,000 generated PFDs and 89.2% on 100,000 generated PFDs. These promising results show great potential for AI-assisted process engineering. The tests on a dataset of 312 real PFDs indicate the need for a larger PFD dataset for industry applications and hybrid artificial intelligence solutions.

2408.00684 2026-06-09 cs.CL 版本更新

Assessing the Variety of a Concept Space Using an Unbiased Estimate of Rao's Quadratic Index

使用Rao二次指数的无偏估计评估概念空间的多样性

Anubhab Majumder, Ujjwal Pal, Amaresh Chakrabarti

发表机构 * Department of Design and Manufacturing, Indian Institute of Science(印度科学研究院设计与制造系)

AI总结 提出一种基于距离的多样性度量方法,通过无偏估计Rao二次指数,并开发软件工具VariAnT,以支持工程设计早期概念空间的多样性评估。

详情
AI中文摘要

过去的研究将设计创造力与“发散性思维”联系起来,即概念空间在设计早期阶段被探索的程度。研究人员认为,生成多个概念会增加产生更好设计解决方案的机会。“多样性”是量化设计师探索的概念空间广度的参数之一。在概念设计阶段评估多样性是有用的,因为在这个阶段,设计师可以自由探索不同的解决方案原则,以用新颖的概念满足设计问题。本文详细阐述并批判性地审视了工程设计文献中现有的多样性度量方法,讨论了它们的局限性。提出了一种新的基于距离的多样性度量方法,并附带了一个支持评估过程的规范性框架。该框架使用所选的基础抽象层次表示,测量两个设计概念之间的实值距离。所提出的框架在名为“VariAnT”的软件工具中实现。此外,通过一个说明性示例展示了该工具的应用。

英文摘要

Past research relates design creativity to 'divergent thinking,' i.e., how well the concept space is explored during the early phase of design. Researchers have argued that generating several concepts would increase the chances of producing better design solutions. 'Variety' is one of the parameters by which one can quantify the breadth of a concept space explored by the designers. It is useful to assess variety at the conceptual design stage because, at this stage, designers have the freedom to explore different solution principles so as to satisfy a design problem with substantially novel concepts. This article elaborates on and critically examines the existing variety metrics from the engineering design literature, discussing their limitations. A new distance-based variety metric is proposed, along with a prescriptive framework to support the assessment process. The framework measures the real-valued distance between two design concepts using any chosen representation of their underlying abstraction levels. The proposed framework is implemented in a software tool called 'VariAnT.' Furthermore, the tool's application is demonstrated through an illustrative example.

2605.29475 2026-06-09 cs.CL cs.AI cs.CE cs.HC 版本更新

MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery

MOOSE-Copilot:一个基于网络的交互式助手,用于统一探索性和细粒度科学假设发现

Hongran An, Zonglin Yang

发表机构 * Central Conservatory of Music(中央音乐学院) Nanyang Technological University(南洋理工大学)

AI总结 提出MOOSE-Copilot,通过形式化的人机交互协议,将发散性探索和收敛性细化统一,利用蓝图、路由和反馈三种信号引导生成,显著优于纯自主基线。

Comments Accepted to ACL 2026 (System Demonstrations)

详情
AI中文摘要

大型语言模型(LLMs)在科学假设发现中展现出显著潜力。然而,现有方法存在两个关键限制:它们将发散性探索构思和收敛性细粒度细化视为孤立任务,并且自主运行,几乎没有人类指导。我们提出了MOOSE-Copilot,这是第一个通过形式化的人机交互(HAII)协议弥合这一抽象差距的统一框架。我们的系统使科学家能够通过三种显式信号引导生成过程:初始蓝图、阶段间路由和再生反馈。定量评估表明,注入这些结构化专家信号显著优于纯自主基线,并在神谕指导下建立了性能上限。此外,为了普及这一范式,我们开发了一个直观的基于网络界面,具有交互式树状可视化。这明确消除了复杂命令行代理工具的陡峭学习曲线,使跨学科研究人员能够直接利用、视觉编排并加速端到端的科学突破。

英文摘要

Large language models (LLMs) show remarkable potential in scientific hypothesis discovery. However, existing approaches face two critical limitations: they treat divergent exploratory search and convergent fine-grained refinement as isolated tasks, and they operate autonomously with little to no human guidance. We present MOOSE-Copilot, the first unified framework to bridge this abstraction gap through a formalized human-AI interaction (HAII) protocol. Our system empowers scientists to steer the generative process via three explicit signals: initial blueprints, inter-stage routing, and intra-stage feedback. Using an oracle-simulated evaluation in which an LLM provides idealized expert signals, we show that injecting these structured signals significantly outperforms purely autonomous baselines, characterizing the gains achievable under high-quality guidance. Furthermore, we build a web-based interface that turns the framework into a no-code workflow: researchers pose a question, watch the hypothesis search unfold as an interactive tree, and steer it by selecting hypotheses, routing between stages, and injecting feedback-no command-line agents required. This makes end-to-end hypothesis discovery directly accessible to interdisciplinary researchers.

2208.00859 2026-06-09 cs.LG cs.CL 版本更新

Learning from flowsheets: A generative transformer model for autocompletion of flowsheets

从流程图学习:用于流程图自动补全的生成式Transformer模型

Gabriel Vogel, Lukas Schulze Balhorn, Artur M. Schweidtmann

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 受文本自动补全启发,提出基于SFILES 2.0字符串表示和Transformer语言模型的化工流程图自动补全方法,通过预训练和微调实现交互式流程图合成辅助。

详情
Journal ref
Computers and Chemical Engineering Volume 171, March 2023, 108162
AI中文摘要

我们提出了一种新颖的方法,能够实现化工流程图的自动补全。这一想法受到文本自动补全的启发。我们使用基于文本的SFILES 2.0符号将流程图表示为字符串,并利用基于Transformer的语言模型学习SFILES 2.0语言的语法结构以及流程图中的常见模式。我们在合成生成的流程图拓扑上预训练模型,以学习流程图语言语法。然后,通过迁移学习步骤在真实流程图拓扑上微调模型。最后,我们使用训练好的模型进行因果语言建模,以自动补全流程图。最终,所提出的方法可以在交互式流程图合成过程中为化学工程师提供建议。结果表明,该方法在未来AI辅助过程合成中具有巨大潜力,但也揭示了当前阶段的局限性以及在实际流程图合成场景中部署该技术需要采取的后续步骤。

英文摘要

We propose a novel method enabling autocompletion of chemical flowsheets. This idea is inspired by the autocompletion of text. We represent flowsheets as strings using the text-based SFILES 2.0 notation and learn the grammatical structure of the SFILES 2.0 language and common patterns in flowsheets using a transformer-based language model. We pre-train our model on synthetically generated flowsheet topologies to learn the flowsheet language grammar. Then, we fine-tune our model in a transfer learning step on real flowsheet topologies. Finally, we use the trained model for causal language modeling to autocomplete flowsheets. Eventually, the proposed method can provide chemical engineers with recommendations during interactive flowsheet synthesis. The results demonstrate a high potential of this approach for future AI-assisted process synthesis but also reveal the limitations at the present state and the next steps that need to be taken to deploy this technique in realistic flowsheet synthesis scenarios.

2601.01279 2026-06-09 econ.TH cs.AI cs.CE cs.CL cs.GT 版本更新

Supracompetitive Pricing Under AI Monoculture

人工智能单一群体下的超竞争定价

Shengyu Cao, Ming Hu

发表机构 * Rotman School of Management, University of Toronto(多伦多大学罗特曼管理学院)

AI总结 本文研究了在共享AI模型下,竞争卖家委托定价时可能产生的超竞争定价问题,通过双寡头模型分析发现,AI模型的鲁棒性和可重复性配置可能导致超竞争定价现象,且市场结果取决于初始定价倾向。

Comments 46 pages

详情
AI中文摘要

当竞争卖家将定价委托给共享的AI模型(如大型语言模型)时,相关推荐结合性能驱动的更新,聚合卖家反馈,引发一个问题:标准的AI部署实践是否会无意中产生超竞争定价?本文开发了一个简化的双寡头模型,其中两个卖家从共享的AI模型中获得定价推荐,该模型由两个参数特征化:一个倾向参数捕捉模型设置高价的倾向,一个输出保真度参数衡量该倾向与实际输出的一致性,其中倾向通过定期重新训练在观察到的结果上更新。我们发现,配置AI模型以鲁棒性和可重复性可以导致超竞争定价通过相变。在临界输出保真度阈值以下,竞争性定价是唯一的稳定结果。在临界值以上,模型表现出双稳态:竞争性和超竞争性定价都是局部稳定的,实际结果取决于模型的初始倾向。超竞争性定价提高了平均价格,但偶尔的低价推荐使检测变得复杂。对于完美输出保真度,任何内部初始倾向都会导致完全价格协调。对于有限训练批次大小为b,当初始倾向位于超竞争性盆地时,随着b的增加,超竞争性定价的概率接近1,不确定结果区域以O(1/√b)的速率缩小。任何减少模型倾向与卖家实际定价之间一致性的因素,无论是通过多样化AI供应商、引入推荐噪声还是减少卖家的遵守,都会将市场推向竞争性结果。

英文摘要

When competing sellers delegate pricing to a shared AI model, such as a large language model, correlated recommendations combined with performance-driven updates aggregating seller feedback raise a key question: can standard AI deployment practices inadvertently produce supracompetitive pricing? We develop a stylized duopoly model in which two sellers receive pricing recommendations from a shared AI characterized by two parameters: a propensity parameter capturing the model's tendency to set high prices and an output-fidelity parameter measuring alignment between this tendency and actual outputs, with propensity updated via periodic retraining on observed outcomes. We find that configuring AI models for robustness and reproducibility can lead to supracompetitive pricing via a phase transition. Below a critical output-fidelity threshold, competitive pricing is the unique stable outcome. Above it, the model exhibits bistability: both competitive and supracompetitive pricing are locally stable, with the realized outcome determined by the model's initial propensity. Supracompetitive pricing raises average prices, but occasional low-price recommendations complicate detection. With perfect output fidelity, full price coordination emerges from any interior initial propensity. For finite training batches of size $b$, when the initial propensity lies in the supracompetitive basin, the probability of supracompetitive pricing approaches 1 as $b$ increases, with the region of indeterminate outcomes shrinking at rate $O(1/\sqrt{b})$. Any factor reducing alignment between the model's propensity and sellers' actual pricing, whether through diversifying AI providers, introducing recommendation noise, or reducing seller adherence, pushes the market toward competitive outcomes.

2605.24384 2026-06-09 cs.CL cs.AI 版本更新

Side-by-side Comparison Amplifies Dialect Bias in Language Models

并排比较加剧语言模型中的方言偏见

Kritee Kondapally, Claire J. Smerdon, Pooja C. Patel, Ogheneyoma Akoni, Jevon Torres, Jaspreet Ranjit, Matthew Finlayson, Swabha Swayamdipta

发表机构 * University of Southern California(美国南加州大学)

AI总结 本研究通过并排比较标准美式英语和非裔美国英语的推文,发现语言模型中的隐性方言偏见在对比设置下显著加剧,且显性方言偏见在安全对齐微调后仍存在。

Comments In proceeding at ACM Conference on Fairness, Accountability, and Transparency 2026

详情
Journal ref
In The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)
AI中文摘要

语言模型(LMs)可能因其方言变体而表现出偏见,即使在没有方言标签的情况下,这种行为被称为隐性方言偏见。在这项工作中,我们通过评估语言模型如何将刻板特征(源自社会心理学关于种族偏见的研究)与标准美式英语(SAE)和非裔美国英语(AAVE)中意图等效的推文相关联,来量化在线话语中的隐性方言偏见。虽然先前的研究表明,在单独评估推文时,语言模型将更多负面刻板印象与AAVE关联,但我们惊讶地发现,当SAE/AAVE推文对并排比较时,这种偏见显著加剧,这种设置更接近模型用于排名候选人的高影响力决策环境。当明确指定方言标签时,偏见只会恶化。考虑到商业开发者为了减轻其语言模型中的偏见所做的广泛努力,这一点令人震惊。令人鼓舞的是,我们表明反事实公平微调可以减轻某些刻板特征的隐性方言偏见,减少单独评估推文时的平均差异,然而,在并排评估SAE/AAVE推文时,这些改进并不一致地适用于所有特征。我们的发现表明,现有的隐性方言偏见评估设置可能低估了其严重性,特别是在对比设置中。此外,即使在安全对齐微调后,显性方言偏见仍然显著,表明它仍然是一个未解决的问题,并激励需要更稳健的评估和缓解框架。

英文摘要

Language models (LMs) can exhibit biases based on variations in their dialects, even in the absence of a dialect label, a behavior known as covert dialect bias. In this work, we quantify covert dialect bias in online discourse by evaluating how LMs associate stereotypical traits (derived from social psychology research on racial bias) with intent-equivalent tweets in Standard American English (SAE) and African-American Vernacular English (AAVE). While prior work shows that LMs associate more negative stereotypes with AAVE when evaluating tweets in isolation, we are surprised to find that this bias is significantly exacerbated when SAE / AAVE tweet pairs are compared side by side, a setting that more closely reflects high-impact decision making contexts in which models are used to rank candidates. The bias only worsens when dialect labels are explicitly specified. This is striking, given the extensive efforts from commercial developers to mitigate bias in their LMs. Encouragingly, we show that counterfactual fairness finetuning can mitigate covert dialect bias for some stereotypical traits, reducing average disparities when evaluating tweets in isolation, however, these improvements do not consistently hold across traits when evaluating SAE / AAVE tweets side by side. Our findings show that existing evaluation settings for covert dialect bias may underestimate its severity, specifically in contrastive settings. Additionally, overt dialect bias remains pronounced even after safety aligned finetuning, indicating that it remains an unresolved problem, and motivates the need for more robust evaluation and mitigation frameworks.

2508.03453 2026-06-09 cs.CL cs.LG 版本更新

Cropping outperforms dropout as an augmentation strategy for self-supervised training of text embeddings

裁剪优于dropout作为自监督训练文本嵌入的增强策略

Rita González-Márquez, Philipp Berens, Dmitry Kobak

发表机构 * Hertie Institute for AI in Brain Health(人工智能与脑健康赫尔蒂研究所) University of Tübingen(图宾根大学) University of Tübingen, Germany(德国图宾根大学)

AI总结 本文研究了自监督微调中裁剪和dropout两种增强策略,发现裁剪在文本嵌入质量上表现更优,尤其在领域内数据中能快速生成高质量嵌入。

详情
Journal ref
Transactions on Machine Learning Research (TMLR) 2026
AI中文摘要

文本嵌入,即整个文本的向量表示,在许多NLP应用中起重要作用,如检索增强生成、聚类或文本集合的数据探索。目前,表现最佳的嵌入模型是通过监督对比微调从预训练语言模型中衍生而来。这种微调策略依赖于外部相似性概念和标注数据生成正样本对。本文研究了自监督微调,并系统比较了两种最知名的增强策略。我们评估了MTEB和额外的领域内评估,并发现裁剪增强显著优于基于dropout的方法。我们发现,在领域外数据中,生成的嵌入质量远低于监督的最新成果,但针对领域内数据,自监督微调能在极短的微调后生成高质量文本嵌入。最后,我们发现表示质量随着最后一层transformer层的改变而增加,仅微调这些最后一层足以达到相似的嵌入质量。

英文摘要

Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via supervised contrastive fine-tuning. This fine-tuning strategy relies on an external notion of similarity and annotated data for generation of positive pairs. Here we study self-supervised fine-tuning and systematically compare the two most well-known augmentation strategies used for fine-tuning text embeddings models. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is substantially below the supervised state-of-the-art models, but for in-domain data, self-supervised fine-tuning can produce high-quality text embeddings after very short fine-tuning. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality.

2507.15152 2026-06-09 cs.CL cs.AI cs.LG 版本更新

What Level of Automation is "Good Enough"? A Benchmark of Large Language Models for Meta-Analysis Data Extraction

什么是‘足够’的自动化水平?大型语言模型在元分析数据提取中的基准测试

Lingbo Li, Anuradha Mathrani, Teo Susnjak

发表机构 * School of Mathematical and Computational Sciences(数学与计算科学学院) Massey University(梅西大学) Auckland, New Zealand(新西兰奥克兰)

AI总结 本文评估了三种大型语言模型在医疗领域数据提取中的性能,发现定制提示能显著提升召回率,提出三层次指南以平衡自动化与专家监督。

详情
Journal ref
Research Synthesis Methods (2026)
AI中文摘要

自动化从全文随机对照试验(RCT)中提取数据用于元分析仍是一个重大挑战。本研究评估了三种LLM(Gemini-2.0-flash、Grok-3、GPT-4o-mini)在高血压、糖尿病和骨科三个医学领域中统计结果、偏倚风险评估和研究层面特征任务上的实际表现。我们测试了四种不同的提示策略(基本提示、自我反思提示、模型集成和定制提示)以确定如何提高提取质量。所有模型均表现出高精度,但普遍存在召回率低的问题,因遗漏关键信息。我们发现定制提示是最有效的,召回率可提升高达15%。基于此分析,我们提出了一套三层指南,根据任务复杂性和风险匹配数据类型与适当的自动化水平。本研究为现实世界中的元分析自动化数据提取提供了实用建议,通过有针对性的、任务特定的自动化平衡LLM效率与专家监督。

英文摘要

Automating data extraction from full-text randomised controlled trials (RCTs) for meta-analysis remains a significant challenge. This study evaluates the practical performance of three LLMs (Gemini-2.0-flash, Grok-3, GPT-4o-mini) across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics in three medical domains: hypertension, diabetes, and orthopaedics. We tested four distinct prompting strategies (basic prompting, self-reflective prompting, model ensemble, and customised prompts) to determine how to improve extraction quality. All models demonstrate high precision but consistently suffer from poor recall by omitting key information. We found that customised prompts were the most effective, boosting recall by up to 15\%. Based on this analysis, we propose a three-tiered set of guidelines for using LLMs in data extraction, matching data types to appropriate levels of automation based on task complexity and risk. Our study offers practical advice for automating data extraction in real-world meta-analyses, balancing LLM efficiency with expert oversight through targeted, task-specific automation.

2410.14964 2026-06-09 cs.CL 版本更新

ChronoFact: Timeline-based Temporal Fact Verification

ChronoFact:基于时间线的时序事实验证

Anab Maulana Barik, Wynne Hsu, Mong Li Lee

发表机构 * School of Computing(计算学院) Institute of Data Science(数据科学研究所) Centre for Trusted Internet and Community(可信互联网与社区中心)

AI总结 本文提出基于时间线的时序事实验证框架,通过识别声明和证据中的事件并组织时间线,系统分析事件关系以预测声明真实性,同时引入复杂时序声明数据集提升验证效果。

详情
Journal ref
Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2025), pp. 8031-8039
AI中文摘要

时序声明常存在不准确之处,是数字虚假信息领域的重要挑战。能够准确验证此类声明的事实核查系统对于对抗虚假信息至关重要。当前系统在评估这些声明的准确性时面临复杂性,尤其是当声明包含多个、重叠或重复事件时。我们引入了一个新的基于时间线的事实验证框架,该框架识别声明和证据中的事件,并将它们组织到各自的时间线中。该框架系统地分析声明和证据中事件之间的关系,以预测每个声明事件的真实性及其时间准确性。这使我们能够准确确定整个声明的真实性。我们还引入了一个新的复杂时序声明数据集,涉及基于时间线的推理,用于训练和评估所提出的框架。实验结果展示了我们的方法在处理时序声明验证复杂性方面的有效性。

英文摘要

Temporal claims, often riddled with inaccuracies, are a significant challenge in the digital misinformation landscape. Fact-checking systems that can accurately verify such claims are crucial for combating misinformation. Current systems struggle with the complexities of evaluating the accuracy of these claims, especially when they include multiple, overlapping, or recurring events. We introduce a novel timeline-based fact verification framework that identify events from both claim and evidence and organize them into their respective chronological timelines. The framework systematically examines the relationships between the events in both claim and evidence to predict the veracity of each claim event and their chronological accuracy. This allows us to accurately determine the overall veracity of the claim. We also introduce a new dataset of complex temporal claims involving timeline-based reasoning for the training and evaluation of our proposed framework. Experimental results demonstrate the effectiveness of our approach in handling the intricacies of temporal claim verification.

2406.14883 2026-06-09 cs.CL cs.CY 版本更新

OATH-Frames: Characterizing Online Attitudes Towards Homelessness with LLM Assistants

OATH-Frames: 利用大语言模型助手分析在线对无家可归者的态度

Jaspreet Ranjit, Brihi Joshi, Rebecca Dorn, Laura Petry, Olga Koumoundouros, Jayne Bottarini, Peichen Liu, Eric Rice, Swabha Swayamdipta

发表机构 * Dept. of Computer Science, University of Southern California(计算机科学系,南加州大学) Suzanne-Dwork School of Social Work, University of Southern California(苏兹曼-道克社会工作学院,南加州大学)

AI总结 本文提出OATH-Frames框架,通过大语言模型分析社交媒体上的无家可归者态度,提升大规模分析效率并揭示态度趋势。

Comments Project website: https://dill-lab.github.io/oath-frames/, EMNLP Main 2024

详情
Journal ref
In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
AI中文摘要

警告:本文内容可能令人不安。公众对关键社会问题的在线态度对政策制定至关重要,但大规模理解具有挑战性。本文通过利用大语言模型分析数百万条推文,研究美国无家可归问题,提出OATH-Frames框架,包含九个层级的批判、回应和感知类型。通过不同层级的模型辅助标注,实现标注时间提升6.5倍,性能仅下降3个F1点。实验表明,OATH-Frames在建模态度方面优于现有情感和毒性分类器。对240万条无家可归相关推文的分析揭示了各州、时间周期和脆弱群体的态度趋势,为问题提供了新见解。本文提供了一个通用框架,用于在无家可归问题之外的其他议题上理解大规模的复杂公众态度。

英文摘要

Warning: Contents of this paper may be upsetting. Public attitudes towards key societal issues, expressed on online media, are of immense value in policy and reform efforts, yet challenging to understand at scale. We study one such social issue: homelessness in the U.S., by leveraging the remarkable capabilities of large language models to assist social work experts in analyzing millions of posts from Twitter. We introduce a framing typology: Online Attitudes Towards Homelessness (OATH) Frames: nine hierarchical frames capturing critiques, responses and perceptions. We release annotations with varying degrees of assistance from language models, with immense benefits in scaling: 6.5x speedup in annotation time while only incurring a 3 point F1 reduction in performance with respect to the domain experts. Our experiments demonstrate the value of modeling OATH-Frames over existing sentiment and toxicity classifiers. Our large-scale analysis with predicted OATH-Frames on 2.4M posts on homelessness reveal key trends in attitudes across states, time periods and vulnerable populations, enabling new insights on the issue. Our work provides a general framework to understand nuanced public attitudes at scale, on issues beyond homelessness.

2406.19493 2026-06-09 cs.CL cs.AI 版本更新

Development and Evaluation of a Retrieval-Augmented Generation Tool for Creating SAPPhIRE Models of Artificial Systems

SAPPhIRE人工系统模型创建工具的开发与评估

Anubhab Majumder, Kausik Bhattacharya, Amaresh Chakrabarti

发表机构 * Department of Design and Manufacturing, Indian Institute of Science(设计与制造系,印度科学研究院)

AI总结 本文提出一种基于检索增强生成的工具,用于创建SAPPhIRE因果模型的人工系统模型,通过评估工具在事实准确性和可靠性方面的表现,提升系统设计类比支持能力。

Comments This paper has been accepted for presentation at the 10th International Conference on Research Into Design, 2025

详情
AI中文摘要

使用SAPPhIRE因果模型表示系统在支持设计类比方面被发现是有用的。然而,创建人工或生物系统的SAPPhIRE模型是一个耗费精力的过程,需要人类专家从多个技术文档中获取技术知识。本研究探讨如何利用大语言模型(LLMs)来创建基于SAPPhIRE因果模型的系统结构描述。本文是两项研究中的第二部分,介绍了一种新的检索增强生成(RAG)工具,用于生成与人工系统SAPPhIRE构造相关的信息,并报告了该工具初步评估的结果,重点在于结果的事实准确性和可靠性。

英文摘要

Representing systems using the SAPPhIRE causality model is found useful in supporting design-by-analogy. However, creating a SAPPhIRE model of artificial or biological systems is an effort-intensive process that requires human experts to source technical knowledge from multiple technical documents regarding how the system works. This research investigates how to leverage Large Language Models (LLMs) in creating structured descriptions of systems using the SAPPhIRE model of causality. This paper, the second part of the two-part research, presents a new Retrieval-Augmented Generation (RAG) tool for generating information related to SAPPhIRE constructs of artificial systems and reports the results from a preliminary evaluation of the tool's success - focusing on the factual accuracy and reliability of outcomes.

2407.00396 2026-06-09 cs.CL cs.AI 版本更新

A Study on Effect of Reference Knowledge Choice in Generating Technical Content Relevant to SAPPhIRE Model Using Large Language Model

基于SAPPhIRE模型因果关系的生成技术内容参考知识选择研究

Kausik Bhattacharya, Anubhab Majumder, Amaresh Chakrabarti

发表机构 * Indian Institute of Science(印度科学研究院)

AI总结 本文研究如何利用大语言模型生成与SAPPhIRE因果关系模型相关的技术内容,通过检索增强生成方法抑制幻觉,强调参考知识选择对生成准确性的重要性。

详情
AI中文摘要

使用SAPPhIRE因果关系模型表示系统可以成为设计的灵感来源。然而,创建技术或自然系统的SAPPhIRE模型需要从多个技术文档中获取系统工作原理的技术知识。本研究探讨如何利用大语言模型(LLM)生成准确的相关技术内容。本文是两部分研究中的第一部分,提出了一种使用检索增强生成方法来抑制幻觉,从而生成由相关科学信息支持的技术内容的方法。研究结果表明,用于为LLM生成技术内容提供上下文的参考知识选择非常重要。本研究的成果用于构建一个软件支持工具,以生成给定技术系统的SAPPhIRE模型。

英文摘要

Representation of systems using the SAPPhIRE model of causality can be an inspirational stimulus in design. However, creating a SAPPhIRE model of a technical or a natural system requires sourcing technical knowledge from multiple technical documents regarding how the system works. This research investigates how to generate technical content accurately relevant to the SAPPhIRE model of causality using a Large Language Model, also called LLM. This paper, which is the first part of the two-part research, presents a method for hallucination suppression using Retrieval Augmented Generating with LLM to generate technical content supported by the scientific information relevant to a SAPPhIRE con-struct. The result from this research shows that the selection of reference knowledge used in providing context to the LLM for generating the technical content is very important. The outcome of this research is used to build a software support tool to generate the SAPPhIRE model of a given technical system.

2402.09193 2026-06-09 cs.CL cs.AI cs.HC 版本更新

(Ir)rationality and Cognitive Biases in Large Language Models

非理性与大语言模型中的认知偏差

Olivia Macmillan-Scott, Mirco Musolesi

发表机构 * University College London(伦敦大学) University of Bologna(博洛尼亚大学)

AI总结 本文通过心理学文献中的任务评估七种语言模型,发现其在非理性表现上与人类相似,但表现形式不同,且存在响应不一致的额外非理性特征。

详情
Journal ref
Royal Society Open Science 11(6) 2024
AI中文摘要

大型语言模型(LLMs)表现出理性推理吗?LLMs已被证明包含人类偏见,因为它们训练的数据中包含这些偏见;这种偏见是否反映在理性推理中尚不明确。在本文中,我们通过认知心理学文献中的任务评估了七种语言模型,以回答这个问题。我们发现,像人类一样,LLMs在这些任务中表现出非理性。然而,这种非理性表现的方式并不反映人类所展示的方式。当LLMs在这些任务中给出错误答案时,它们往往以与人类偏见不同的方式错误。此外,LLMs还揭示了响应中显著不一致性的额外非理性层。除了实验结果外,本文还希望通过展示如何评估和比较这些模型的不同能力,做出方法论上的贡献,特别是在理性推理方面。

英文摘要

Do large language models (LLMs) display rational reasoning? LLMs have been shown to contain human biases due to the data they have been trained on; whether this is reflected in rational reasoning remains less clear. In this paper, we answer this question by evaluating seven language models using tasks from the cognitive psychology literature. We find that, like humans, LLMs display irrationality in these tasks. However, the way this irrationality is displayed does not reflect that shown by humans. When incorrect answers are given by LLMs to these tasks, they are often incorrect in ways that differ from human-like biases. On top of this, the LLMs reveal an additional layer of irrationality in the significant inconsistency of the responses. Aside from the experimental results, this paper seeks to make a methodological contribution by showing how we can assess and compare different capabilities of these types of models, in this case with respect to rational reasoning.