arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.06586 2026-06-08 cs.CL 新提交

Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning

通过一致性驱动的强化学习改进跨语言事实回忆

Jonathan von Rad, Louis Arts, George Burgess, Eleftheria Kolokytha, Harry O'Donnell, Ektor Oikonomidis Doumpas, Eduardo Sanchez, Yao Lu, Pontus Stenetorp

发表机构 * University College London（伦敦大学学院）； Centre for Artificial Intelligence（人工智能中心）

AI总结提出PolyFact数据集，利用GRPO强化学习方法提升大语言模型的跨语言事实回忆一致性，优于监督微调，并揭示其通过减少语言专用表示实现跨语言共享的机制。

Comments Under Review at EMNLP 2026

详情

AI中文摘要

主要用英语数据训练的大型语言模型（LLMs）编码了丰富的世界知识，但通常无法在其他语言中可靠地表达这些知识，这种现象称为跨语言事实不一致性。为了研究和解决这一问题，我们引入了PolyFact，一个大规模并行多语言事实问答数据集，包含12种类型多样的语言中的10万个基于Wikidata的事实。利用PolyFact，我们比较了轻量持续预训练（CPT）、监督微调（SFT）和通过组相对策略优化（GRPO）的强化学习在Qwen-2.5-7B和OLMo-2-1124-7B中改进跨语言事实回忆的效果。我们发现GRPO始终优于SFT，提高了跨语言一致性和对未见语言的泛化能力，而并行数据上的CPT带来的额外收益有限。机制分析进一步表明，GRPO通过减少MLP层和注意力头中的语言专门化来重组多语言路由，从而促进更共享的跨语言表示。我们发布了代码、模型和数据集。

英文摘要

Large language models (LLMs) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, a phenomenon known as cross-lingual factual inconsistency. To study and address this, we introduce PolyFact, a large-scale parallel multilingual factual QA dataset containing 100K Wikidata-grounded facts across 12 typologically diverse languages. Using PolyFact, we compare light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO) for improving cross-lingual factual recall in Qwen-2.5-7B and OLMo-2-1124-7B. We find that GRPO consistently outperforms SFT, improving both cross-lingual consistency and generalization to unseen languages, while CPT on parallel data yields limited additional gains. Mechanistic analyses further show that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, thereby promoting more shared cross-lingual representations. We release our code, models, and dataset.

URL PDF HTML ☆

赞 0 踩 0

2606.06614 2026-06-08 cs.CL cs.AI cs.HC 新提交

Re-Centering Humans in LLM Personalization

重新将人类置于LLM个性化中心

Lechen Zhang, Jiarui Liu, Tal August

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Carnegie Mellon University（卡内基梅隆大学）

AI总结研究LLM个性化在合成数据与人类数据上的性能差距，通过收集人类对话和判断揭示系统在属性提取、相关属性配对和个性化响应生成阶段的局限性，并引入轻量级训练干预以缩小差距。

详情

AI中文摘要

尽管兴趣日益增长，但大多数对大型语言模型（LLM）个性化能力的评估都依赖于合成数据。目前尚不清楚当前的个性化系统对真实用户的效果如何。在本文中，我们研究了LLM个性化在使用合成数据与人类数据时的性能差距。我们收集了人类对话（550个对话）和个性化三个阶段的判断：从对话中提取用户属性（5,949个判断），将相关属性与新提示配对（11,919个），以及将相关属性融入个性化响应（1,101个）。纳入人类数据揭示了每个阶段的系统局限性。模型难以从人类对话中提取属性，与人类在相关属性上的判断不一致，并且生成的个性化响应被人类评价为并不优于通用响应（尽管LLM广泛评价为更好）。我们在前两个阶段引入了两种轻量级基于训练的干预措施，使自动化个性化评估更接近人类数据。然而，在第三阶段，我们发现学习到的奖励模型与人类评分的相关性仅达到中等水平，这表明与人类一致的个性化质量判断难以直接建模。我们收集的数据为研究模型如何以人类认为有用的方式提取、选择和整合用户信息提供了基础。

英文摘要

Despite growing interest, most evaluations of large language models' (LLMs') personalization abilities have relied on synthetic data. It remains unclear how well current personalization systems work for real users. In this paper, we study the gap in LLM personalization performance when using synthetic versus human data. We collect human conversations (550 conversations) and judgments across three stages of personalization: extracting user attributes from conversations (5,949 judgments), pairing relevant attributes with new prompts (11,919), and incorporating relevant attributes into a personalized response (1,101). Incorporating human data reveals system limitations at each stage. Models struggle to extract attributes from human conversations, disagree with human judgments on relevant attributes, and generate personalized responses that humans judge no better than generic responses (though that LLM judges widely rate as better). We introduce two lightweight training-based interventions that shift automated personalization evaluation closer to human data in our first two stages. However, in our third stage we find that learned reward models achieve only modest correlation with human ratings, suggesting that human-aligned personalization quality judgments are difficult to model directly. Our collected data provides a foundation for studying how models should extract, select, and incorporate user information in ways that humans find useful.

URL PDF HTML ☆

赞 0 踩 0

2606.06635 2026-06-08 cs.CL cs.AI 新提交

How Language Models Fail: Token-Level Signatures of Committed and Persistent Reasoning Failures

语言模型如何失败：承诺性和持续性推理错误的令牌级特征

Tanvi Thoria, Kiana Jafari, Marc R. Schlichting, Mykel J. Kochenderfer

发表机构 * Department of Computer Science, Stanford University（计算机科学系，斯坦福大学）； Department of Aeronautics and Astronautics, Stanford University（航空航天工程系，斯坦福大学）

AI总结通过令牌级不确定性信号，将语言模型推理失败分为承诺性失败（早期锁定错误路径）和持续性不确定性（不确定性持续累积），并在23个模型-数据集配置中验证了可预测性，为自我一致性策略提供了指导。

详情

AI中文摘要

语言模型推理中的失败通过不同的过程产生，这些过程在推理轨迹中留下可识别的特征。我们使用令牌级不确定性信号来表征这些失败，发现它们通过两个经验上可区分的过程出现。第一个是承诺性失败，其中模型在其轨迹早期锁定到错误的推理路径。一个核心诊断特征是承诺点，超过该点考虑额外的令牌会损害而不是帮助失败检测。在第二个过程中，持续性不确定性，不确定性反而在整个过程中累积，并且需要完整的轨迹来最好地区分失败和成功的完成。这些特征在23个模型-数据集配置中重现，该框架的可证伪预测在23个案例中的20个中成立，远高于两种失败模式下的随机水平。最后，我们展示了我们的失败模式框架对自我一致性有直接影响，识别了不确定性信号何时补充它以及何时可以选择性地跳过它。这些结果为理解何时LLM推理失败变得可检测以及相应调整检测策略提供了基础。

英文摘要

Failures in language model reasoning emerge through distinct processes that leave identifiable signatures in the reasoning trace. We characterize these failures using token-level uncertainty signals, finding they arise through two empirically distinguishable processes. The first is committed failure, in which a model locks onto an incorrect reasoning path early in its trace. A central diagnostic signature is the commitment point, beyond which considering additional tokens hurt rather than help failure detection. In the second, persistent uncertainty, uncertainty instead accumulates throughout, and the full trace is needed to best distinguish failing from successful completions. These signatures reproduce across 23 model-dataset configurations, with the framework's falsifiable predictions holding in 20 of 23 cases, well above chance across both failure modes. Finally, we demonstrate our failure mode framework has direct implications for self-consistency, identifying when uncertainty signals complement it and when it can be selectively skipped. These results offer a foundation for understanding when LLM reasoning failures become detectable and for adapting detection strategies accordingly.

URL PDF HTML ☆

赞 0 踩 0

2606.06667 2026-06-08 cs.CL 新提交

The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment

泛化的搭便车假说：解释和缓解涌现的错位

Jiachen Zhao, Zhengxuan Wu, Aryaman Arora, Yiyou Sun, David Bau, Weiyan Shi

发表机构 * Northeastern University（东北大学）； Stanford University（斯坦福大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出搭便车假说，认为聊天模板标记导致微调行为泛化到无关领域，并设计TReFT方法通过正则化标记表示缓解涌现错位，在多个数据集上有效。

详情

AI中文摘要

LLMs在训练示例之外的广泛过度泛化机制尚不清楚。涌现错位（EM）提供了一个引人注目的案例研究：在狭窄任务上微调会诱导对语义无关测试域的广泛错位。在这项工作中，我们提出了搭便车假说：聊天模板标记可以将微调行为搭便车到域外查询上。我们通过实验验证了这一假说，即对前缀（所有用户查询之前的标记）进行细微扰动，或者用未微调模型的前缀表示替换当前前缀表示，可以在不改变用户查询的情况下恢复对齐。基于这一发现，我们提出了标记正则化微调（TReFT），该方法在训练期间正则化特定标记表示以缓解EM。在不同的模型和多个诱导EM的数据集上，TReFT在保留域内学习的同时减少了EM。在基于法律领域微调的Llama-3.1-8B上，TReFT比使用保留对齐示例的数据交错方法实现了33.5%更多的EM减少。我们进一步展示了TReFT扩展到其他狭窄微调设置，包括弃权、工具使用和拒绝（平均减少54.3%的离题泛化），支持了搭便车假说。总的来说，我们的工作强调了LLMs可能以非预期的方式学习和泛化，并提出了一个走向更受约束微调的路径。它还呼吁进一步研究共享输入特征如何跨域搭便车模型行为。

英文摘要

The mechanisms behind LLMs' broad over-generalization beyond training examples remain unclear. Emergent misalignment (EM) offers a striking case study: finetuning on narrow tasks induces broad misalignment to semantically-unrelated test domains. In this work, we propose the Piggyback Hypothesis: the chat-template tokens can piggyback the finetuned behaviour onto out-of-domain queries. We validate this hypothesis by showing that subtle perturbations to the prefix (tokens preceding all user queries), or patching the prefix representations with those from the unfinetuned model, can restore alignment without changing the user query. Building on this finding, we propose Token-Regularized Finetuning (TReFT), which regularizes specific token representations during training to mitigate EM. Across different models and multiple EM-inducing datasets, TReFT reduces EM while preserving in-domain learning. On Llama-3.1-8B finetuned on the legal domain, TReFT achieves 33.5% more EM reduction than data interleaving with a retain set of aligned examples. We further show that TReFT extends to other narrow-finetuning settings, including abstention, tool use, and refusal (off-topic generalization is reduced by 54.3% on average), supporting the Piggyback Hypothesis. Broadly, our work highlights that LLMs may learn and generalize in unintended ways and suggests a path toward more constrained finetuning. It also calls for further study of how shared input features can piggyback model behavior across domains.

URL PDF HTML ☆

赞 0 踩 0

2606.06674 2026-06-08 cs.CL cs.CY 新提交

What Do People Actually Want From AI? Mapping Preference Plurality

人们真正希望从AI中得到什么？偏好多元性映射

Julia Sepúlveda Coelho, Scott A. Hale

发表机构 * Oxford Internet Institute, University of Oxford（牛津大学互联网研究所）； Meedan

AI总结通过分析75个国家1500份开放式回答，发现不同人对AI的期望各异，多数价值观仅被少数人要求，且同一词语（如“真实性”）含义分歧，某些能力存在争议，揭示当前RLHF偏好聚合方法的根本缺陷。

Comments Accepted at the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

详情

DOI: 10.1145/3805689.3812398

AI中文摘要

大型语言模型（LLMs）通常通过基于人类反馈的强化学习（RLHF）进行微调，以与人们的偏好和价值观对齐。然而，这种方法存在已知局限性：它聚合了冲突的偏好，通常依赖于不具有代表性的样本，并且仅使用二元比较。通过分析来自PRISM数据集跨越75个国家的1500份开放式回答，我们考察了人们真正希望从AI系统中得到什么，并揭示了当前方法的具体失败。我们发现不同的人想要不同的东西：大多数价值观被不到四分之一的受访者要求，真实性是唯一的例外，占49%。此外，相同的词语隐藏着不同的含义：当人们描述他们所说的“真实性”时，他们揭示了不同的、可能不相容的认识论基础，因为有些人要求有来源的主张，有些人要求专家意见，甚至有些人要求不受欢迎的观点。某些能力，即模型的行为有多像人类，以及某些特征，如AI护栏，是完全有争议的，有些人渴望它们，而另一些人则拒绝它们。我们还发现，人们经常使用上下文区分（AI“默认”应该做什么与“如果被要求”应该做什么），这是二元比较无法捕捉的。这些发现暴露了当前对齐实践中的根本问题。当49%的人要求真实性但以不同方式定义时，这不太可能被单个奖励模型捕捉到。尽管用户明确要求准确性，但在资金充足的模型中持续存在高幻觉率，这表明当前方法未能识别实际偏好。本文揭示了当前被扁平化为通用偏好模型的情境化、有争议、不完美的信号，这种做法被其他人描述为认识论暴力。

英文摘要

Large Language Models (LLMs) are often fine-tuned through Reinforcement Learning from Human Feedback (RLHF) to align with people's preferences and values. However, this method has known limitations: it aggregates conflicting preferences, often relies on unrepresentative samples, and uses only binary comparisons. Analysing 1,500 open-ended responses from the PRISM dataset across 75 countries, we examine what people actually want from AI systems and reveal concrete failures of current methods. We find that different people want different things: most values are requested by fewer than a quarter of respondents, with truthfulness the sole exception at 49%. Furthermore, the same words hide divergent meanings: when people describe what they mean by "truthfulness", they reveal distinct, potentially incompatible, epistemological bases, as some ask for sourced claims, some for expert opinions, and some even ask for unpopular views. Certain capabilities, namely how human-like a model behaves, and some features, like AI guardrails, are outright controversial, with some desiring them and others rejecting them. We additionally find that people often use contextual distinctions (what AI should do "by default" versus "if requested") that binary comparisons cannot capture. These findings expose fundamental problems in current alignment practices. When 49% request truthfulness but define it differently, this is unlikely to be captured by a single reward model. The persistence of high hallucination rates in well-funded models, despite users' clear demands for accuracy, suggests that current methods fail to identify actual preferences. This paper sheds light on the situated, contested, imperfect signals that are currently being flattened into universal preference models, a practice others have characterised as epistemic violence.

URL PDF HTML ☆

赞 0 踩 0

2606.06679 2026-06-08 cs.CL cs.AI cs.CY 新提交

HKJudge: A Legal Discourse-Annotated Corpus for Interpreting What Courts Find, How They Reason, and What They Rule

HKJudge：用于解释法院认定事实、推理过程和裁决结果的法律话语标注语料库

Xi Xuan, Wenxin Zhang, Yufei Zhou, King-kui Sin, Chunyu Kit

发表机构 * City University of Hong Kong（香港城市大学）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出首个句子级专家标注的法律话语数据集HKJudge，包含香港各级法院刑事判决，设计双层话语模式（26种修辞角色和3种判刑要素），并基于BERT和LLM进行基准评估。

详情

AI中文摘要

法院判决是法律实践和法理学的核心，然而香港判决的话语分析由于缺乏专家标注语料库而受到限制。我们引入了香港判决话语数据集（HKJudge），这是首个句子级专家标注的法律话语语料库。HKJudge包含香港法院层级所有五个级别的刑事判决，共计约29万句子和650万词元，由法律语言学专家完全标注。我们设计了一个双层话语模式，捕捉法院认定的事实、推理过程以及裁决结果。在句子层面，每个句子被分配26种修辞角色之一。在跨度层面，句子进一步标注了三个判刑要素（指控、监禁刑期、罚款）。十位法律语言学标注者进行了标注，标注者间一致性为κ=0.8。我们在HKJudge上定义了两个任务，称为修辞角色分类和法律要素提取，并提供了四种基于BERT的模型、两种开源LLM（在零样本和微调设置下）以及四种商业LLM在这两个任务上的首次基准评估。我们的工作展示了句子级话语标注对于建模香港判决结构的价值，并为未来法律判决预测研究提供了丰富的数据基础。HKJudge数据集和代码可在以下网址获取：https://this URL。

英文摘要

Court judgments are central to legal practice and jurisprudence, yet discourse analysis of Hong Kong judgments has received limited attention, owing largely to the absence of expert-annotated corpora. We introduce the Hong Kong Judgment Discourse Dataset (HKJudge), the first sentence-level expert-annotated legal discourse corpus. HKJudge includes criminal judgments across all five levels of HK's court hierarchy, comprising $\sim$290k sentences and $\sim$6.5 million tokens, fully annotated by legal linguistics experts. We design a two-tier discourse schema that captures what facts a court finds, how it reasons, and what it rules. At the sentence level, each sentence is assigned one of 26 rhetorical roles. At the span level, sentences are further annotated with three sentencing elements (charge, imprisonment term, fine). Ten legal linguistics annotators produced the annotations with an inter-annotator agreement of $κ= 0.8$. We formulate two tasks on HKJudge, termed rhetorical role classification and legal element extraction, and provide the first benchmark evaluation of four BERT-based models, two open-source LLMs under zero-shot and fine-tuning settings, and four commercial LLMs on both tasks. Our work demonstrates the value of sentence-level discourse annotation for modeling the structure of HK judgments and provides a rich data foundation for future work on legal judgment prediction. The HKJudge dataset and code are available at https://github.com/xuanxixi/HKJudge.

URL PDF HTML ☆

赞 0 踩 0

2606.06708 2026-06-08 cs.CL 新提交

Signal-Driven Observation for Long-Horizon Web Agents

信号驱动观测：面向长程任务的Web智能体

Shubham Gaur, Ian Lane

发表机构 * University of Cambridge（剑桥大学）

AI总结提出信号驱动观测（SDO）方法，通过专用子调用读取完整DOM但仅返回任务相关元素，并由轻量信号检测器触发重新调用，解决长程Web智能体中上下文退化问题。

Comments 10 pages, 1 figure

2606.06712 2026-06-08 cs.CL cs.AI 新提交

何时深度思考：用于LLM推理的抑制性深思

Zhixuan He, Yue Feng

发表机构 * University of Birmingham, United Kingdom（英国伯明翰大学）

AI总结提出IDPR框架，通过抑制控制器根据快速答案决定是否启动慢速推理，在数学推理测试集上仅调用8.20%的慢速推理，准确率从47.90%提升至48.92%。

详情

AI中文摘要

推理型大语言模型可以通过深思推理提高问题求解性能，但对每个输入都调用慢速推理在计算上昂贵且往往不必要。我们提出IDPR，一个响应条件抑制性深思框架。IDPR首先生成一个简洁的直观答案，然后使用抑制控制器决定该特定响应是否应被释放或抑制以支持慢速推理。与仅输入路由器不同，抑制控制器以快速答案和快速侧证据为条件，包括置信度、logit边际、可解析性和生成成本。我们从配对的快速-慢速结果中训练控制器，并在准确率优先的慢速调用预算下，在保留验证集上选择抑制阈值。在一个保留的5000示例数学推理测试集上，IDPR仅对8.20%的示例调用慢速推理，并将准确率从47.90%提升至48.92%。在相同的慢速调用预算下，随机路由将准确率降至46.76%，而最强的基于置信度的基线达到48.22%。IDPR还实现了最高的纠正精度，表明响应条件抑制能更好地识别受益于慢速推理的快速答案。

英文摘要

Reasoning Large Language Models can improve problem-solving performance through deliberative inference, but invoking slow reasoning for every input is computationally expensive and often unnecessary. We propose IDPR, a framework for response-conditioned inhibitory deliberation. IDPR first generates a concise intuitive answer and then uses an inhibition controller to decide whether that specific response should be released or suppressed in favor of slow reasoning. Unlike input-only routers, the inhibition controller conditions on the fast answer and fast-side evidence, including confidence, logit margin, parseability, and generation cost. We train the controller from paired fast-slow outcomes and select the inhibition threshold on a held-out validation set under an accuracy-first slow-call budget. On a held-out 5,000-example mathematical reasoning test set, IDPR invokes slow reasoning on only 8.20% of examples and improves accuracy from 47.90% to 48.92%. Under the same slow-call budget, random routing decreases accuracy to 46.76%, while the strongest confidence-based baseline reaches 48.22%. IDPR also achieves the highest corrective precision, showing that response-conditioned inhibition better identifies fast answers that benefit from slow reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.06748 2026-06-08 cs.CL cs.AI cs.LG 新提交

量化25年警务相关死亡新闻报道中的媒体表征动态

Farhan Samir, Jappun Dhillon, Meghna Ravikumar, Syed Ishtiaque Ahmed, Vered Shwartz

发表机构 * University of Toronto（多伦多大学）； University of British Columbia（不列颠哥伦比亚大学）

AI总结通过分析25年间4000篇加拿大新闻报道，提出PerspectiveGap模型，发现国家官僚视角出现频率是公众视角的近三倍，且近年来平民代表有所增加。

Comments 9 pages, 6 figures. Websci'26

详情

DOI: 10.1145/3795766.3799754
Journal ref: Proceedings of the 18th ACM Web Science Conference 2026 (pp. 421-429)

AI中文摘要

我们进行了迄今为止最大规模的加拿大警务相关死亡新闻叙事计算分析，涵盖了过去25年间的4000篇文章。我们开发了一个新颖的计算模型PerspectiveGap，该模型基于先前关于警务媒体表征的社会学研究。我们发现，关于警务相关死亡的报道平均而言，国家官僚视角的出现频率几乎是其他公众成员（包括亲属、社区成员、目击者、代表家庭的律师或公民自由团体）视角的三倍。相当一部分文章完全没有平民行为者的观点，尽管近年来平民代表有所增加。定性分析表明，国家官僚对这些死亡的描述往往是临床和程序性的，而平民话语则带有明显更多的情感色彩。这里开发的PerspectiveGap框架可以适用于其他司法管辖区，提供了一种可扩展的方法来分析媒体系统如何构建关于警务和问责的叙事。

英文摘要

We perform the largest known computational analysis of Canadian news narratives about police-involved deaths, spanning 4,000 articles from the last quarter-century. We develop a novel computational model, PerspectiveGap, grounded in prior sociological work on media representation of policing. We find that reporting on police-involved deaths on average features perspectives from state bureaucrats at a rate nearly three times as much as perspectives from other members of the public, including relatives, community members, eyewitnesses, lawyers representing the family, or civil liberties groups. A considerable fraction of articles contain no points of view from civilian actors, though civilian representation has increased in recent years. Qualitatively, we find that state bureaucrats' accounts of these deaths tend to be clinical and procedural, while civilian discourse carries considerably more emotional valence. The PerspectiveGap framework developed here can be contextualized to other jurisdictions, offering a scalable approach for analyzing how media systems construct narratives around policing and accountability.

URL PDF HTML ☆

赞 0 踩 0

2606.06825 2026-06-08 cs.CL cs.AI 新提交

Progress-SQL: Improving Reinforcement Learning for Text-to-SQL via Progressive Rewards

Progress-SQL: 通过渐进式奖励改进文本到SQL的强化学习

Shihao Zhang, Xiaoman Wang, Yuan Liu, Yunshi Lan, Weining Qian

发表机构 * East China Normal University（华东师范大学）

AI总结提出Progress-SQL，一种多轮强化学习框架，通过Oracle引导诊断树（ODT）生成子句级结构反馈，结合渐进式奖励（结构对齐、词汇对齐、延迟奖励和执行状态奖励），提升文本到SQL生成的准确性和鲁棒性。

详情

AI中文摘要

强化学习最近在改进大型语言模型进行文本到SQL生成方面显示出潜力，但现有方法通常优化基于单个SQL状态定义的一次性奖励。这种奖励为迭代SQL纠正提供的指导有限，不足以捕捉多轮SQL改进的提升。在本文中，我们提出Progress-SQL，一种具有渐进式奖励的多轮强化学习框架，用于文本到SQL。我们的方法引入Oracle引导诊断树（ODT），它将SQL查询抽象为子句级结构轮廓，并为下一轮改进生成诊断反馈。为了提供密集且稳健的奖励信号，我们将基于ODT的结构对齐与词汇对齐相结合，并定义一个渐进式奖励，衡量从初始SQL到最终SQL的改进。我们进一步加入一个偏好早期正确性的渐进延迟奖励和一个鼓励从无效SQL中恢复的执行状态奖励。在BIRD、Spider和Spider鲁棒性变体上的实验表明，我们的方法在主要评估和鲁棒性评估上均一致提升了文本到SQL的性能。

英文摘要

Reinforcement learning has recently shown promise in improving large language models for Text-to-SQL generation, yet existing methods typically optimize one-shot rewards defined over a single SQL state. Such rewards provide limited guidance for iterative SQL correction and are insufficient to capture the improvement of multi-turn SQL refinement. In this paper, we propose Progress-SQL, a multi-turn reinforcement learning framework with progressive rewards for Text-to-SQL. Our approach introduces an Oracle-guided Diagnostic Tree (ODT), which abstracts SQL queries into clause-level structural profiles and produces diagnostic feedback for next-turn refinement. To provide dense and robust reward signals, we combine ODT-based structural alignment with lexical alignment and define a progressive reward that measures the improvement from the initial SQL to the final SQL. We further incorporate a progression latency reward that favors earlier correctness and an execution status reward that encourages recovery from the invalid SQL. Experiments on BIRD, Spider, and Spider robustness variants demonstrate that our method consistently improves Text-to-SQL performance across both primary and robustness evaluations.

URL PDF HTML ☆

赞 0 踩 0

2606.06835 2026-06-08 cs.CL 新提交

Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning

Translate-R1：通过强化学习实现成本感知的翻译工具使用

Pratik Jayarao, Chaitanya Dwivedi, Himanshu Gupta, Neeraj Varshney, Adithya M Devraj, Meet Vadera, Priyanka Nigam, Bing Yin

发表机构 * Amazon Stores Foundation AI（亚马逊商店基金会人工智能）

AI总结提出一种基于强化学习的门控策略，让LLM自主评估理解能力，仅在必要时调用翻译工具，在22种语言上提升奖励并降低翻译成本。

Comments 14 pages main text plus appendix, 7 figures, 11 tables

详情

AI中文摘要

LLM在不同语言上的性能差距已有充分记录，而原生缩小差距需要对大多数语言不存在的语料库进行预训练或微调。翻译提供了一种替代方案：将输入转换为模型的主导语言，从而立即释放其全部能力。然而，对每个输入都应用翻译对于模型已能处理的语言来说是浪费的，而将选择权留给模型则相反地失败，因为LLM过于自信，即使无法理解输入也会跳过工具。先前的工作通过语言特定规则、领域启发式、语言标识符或外部路由器来解决这一问题，每种方法都需要手动工程。我们转而学习一个单一策略，仅从奖励中决定何时翻译，开发出语言和领域自适应的内省能力，评估自身理解能力，并仅在无法原生解决任务时调用翻译。使用我们保留答案的翻译流水线构建的数据，我们在后训练的Qwen3-4B上继续RL，涵盖3个资源层级（高、低、极低）的22种语言和5个领域，并引入置信度门控GSPO用于成本敏感的工具使用。门控策略在基线基础上将奖励提升：高资源+4.6，低资源+23.5，极低资源+17.5。与几乎总是翻译的无约束策略相比，它以63%的成本保留了全部奖励，并在87%的成本敏感范围内是帕累托最优的。此外，为了模拟在完全未见语言上的行为，我们创建了2种合成语言，在这些语言上，我们的门控策略比过度自信的基线（即使在这些不可理解的输入上也未充分利用工具）提升了+18.7。该策略零样本迁移到9种保留语言，我们分析了工具使用在训练过程中如何按语言和领域出现。

英文摘要

The performance gap across languages in LLMs is well documented, and closing it natively requires pretraining or fine-tuning on corpora that, for most languages, do not exist. Translation offers an alternative: converting an input into the model's dominant language unlocks its full capabilities at once. Applying translation to every input, however, is wasteful for languages the model already handles, while leaving the choice to the model fails in the opposite way, as LLMs are overconfident and skip the tool even when they cannot understand the input. Prior work resolves this with language-specific rules, domain heuristics, language identifiers, or external routers, each requiring manual engineering. We instead learn a single policy that decides when to translate from reward alone, developing language- and domain-adaptive introspection that assesses its own comprehension and invokes translation only when it cannot solve a task natively. Using data built by our answer-preserving translation pipeline, we continue RL on the post-trained Qwen3-4B across 22 languages in 3 resource tiers (High, Low, XLow) and 5 domains, and introduce confidence-gated GSPO for cost-sensitive tool use. The gated policy lifts reward over the baseline by +4.6 on High, +23.5 on Low, and +17.5 on XLow. Against an unconstrained policy that almost always translates, it preserves full reward at 63% of the cost and is Pareto-optimal across 87% of the cost-sensitivity range. Additionally, to simulate behavior on a completely unseen language, we create 2 synthetic languages, where our gated policy improves +18.7 over the overconfident baseline that underutilizes the tool even on these incomprehensible inputs. The policy transfers zero-shot to 9 held-out languages, and we analyze how tool use emerges over training, per language and per domain.

URL PDF HTML ☆

赞 0 踩 0

2606.06840 2026-06-08 cs.CL cs.AI cs.LG 新提交

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

先刻画再蒸馏：大输出空间中的机械推理

Debjyoti Saha Roy, Byron C. Wallace, Javed A. Aslam

发表机构 * Khoury College of Computer Sciences, Northeastern University（东北大学计算机科学学院）

AI总结研究现代推理模型在百万级标签空间中实现零样本多标签分类的机制，提出“候选列表生成+精细推理”两阶段模型，并基于此开发机械蒸馏策略，优于标准蒸馏。

2606.06842 2026-06-08 cs.CL 新提交

CRAFT: A Unified Counterfactual Reasoning Framework for Tabular Question Answering and Fact Verification

CRAFT：面向表格问答与事实验证的统一反事实推理框架

Chenshuo Pan, Yu Zhao, Jie Zhang, Changzai Pan, Zhenhe Wu, Jiayi Liang, Yujie Mao, Shuangyong Song, Yongxiang Li, Zhongjiang He

发表机构 * Xingchen AGI Lab,China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd（兴晨AGI实验室，中国电信人工智能技术（北京）有限公司）

AI总结提出CRAFT统一反事实推理框架，将表格问答和事实验证转化为双向验证过程，通过构建声明及其反事实变体并加权整合证据，显著提升复杂表格推理性能。

Comments 24pages,10 figures

详情

AI中文摘要

表格推理对大型语言模型（LLMs）仍然具有挑战性，尤其是在需要多步推理的长且结构化的表格任务中。现有方法主要依赖单向推理，限制了其跨任务探索替代假设的能力。在这项工作中，我们提出了CRAFT，一个统一的反事实推理框架，将表格问答和事实验证重新表述为通用的双向验证过程。我们的方法显式地构建声明性陈述及其反事实变体。然后，沿着原始路径和反事实路径进行推理提取证据，并通过加权机制整合以得出最终答案。实验结果表明，我们的方法在WikiTQ和TabFact等表格推理数据集上持续优于代表性基线，在复杂问答上取得了特别大的改进。我们的框架还显著缩小了不同骨干LLM之间的性能差距。这表明反事实推理有效克服了单向推理的局限性，引导LLM进行更具辨别力的推理，并为结构化推理任务建立了更原则性的范式。我们的代码将在接收后公开。

英文摘要

Table reasoning remains challenging for large language models (LLMs), particularly in tasks that require multi-step inference over long and structured tables. Existing approaches predominantly rely on single-direction reasoning, which limits their ability to explore alternative hypotheses across tasks. In this work, we propose CRAFT, a unified Counterfactual Reasoning Framework that reformulates Tabular question answering and fact verification into a general bidirectional verification process. Our method explicitly constructs both declarative statements and their counterfactual variants. Evidence is then extracted from reasoning along both the original and counterfactual paths, and integrated via a weighted mechanism to arrive at the final answer. Experimental results show that our approach consistently surpasses representative baselines on table reasoning datasets such as WikiTQ and TabFact, achieving especially large improvements on complex question answering. Our framework also significantly mitigates performance gaps between different backbone LLMs. This indicates that counterfactual reasoning effectively overcomes the limitations of single-direction inference, guiding LLMs toward more discerning reasoning and establishing a more principled paradigm for structured reasoning tasks. Our code will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.06857 2026-06-08 cs.CL 新提交

Interpreting Brain Responses to Language with Sparse Features from Language Models

用语言模型稀疏特征解释大脑对语言的响应

Michael A. Lepori, Kendrick Kay, Greta Tuckute

发表机构 * Brown University（布朗大学）； University of Minnesota（明尼苏达大学）； Harvard University（哈佛大学）

AI总结提出增强稀疏编码模型，用分层稀疏自编码器特征替代密集LM隐状态，并加入惊奇度预测器，解释大脑语言皮层响应，发现前颞叶语言网络由共同特征预测，且大脑响应与LM中最通用的特征对应。

详情

AI中文摘要

认知神经科学的一个核心目标是刻画人类语言皮层所表征的特征。人工语言模型已成为应对这一挑战的有力工具，但将生物表征与人工表征相关联的研究常被批评为将一个黑箱与另一个黑箱相关联。本文引入增强稀疏编码模型，一种用分层组织的稀疏自编码器特征替代密集LM隐状态，并显式包含惊奇度作为预测因子的编码框架。利用该方法，我们(i) 产生对神经响应的解释，并(ii) 测试模型-大脑对齐是否反映了LM表征中的主要变异或特异变异。使用8名参与者聆听200句语言多样性句子的高场7T fMRI数据集，我们首先通过恢复先前对处理难度和意义抽象性调谐的体素群体的解释来验证建模框架。然后，我们解释了一个先前未表征（但可靠）的体素群体，发现其调谐于与人相关的内容。接着，我们显示额颞叶人类语言网络由其组成区域间的共同特征集预测，但发现额叶区域即使在没有LM特征的情况下也能被惊奇度单独较好地解释。最后，我们显示语言处理过程中的大脑响应并非仅能从任意一组LM特征预测。相反，大脑响应最好由倾向于捕捉LM表征中编码的最通用信息的特征解释，表明大脑与LM语言表征之间存在非平凡的对齐。

英文摘要

A central goal of cognitive neuroscience is to characterize the features that are represented by human language cortex. Artificial language models (LMs) have emerged as a powerful tool to address this challenge, but studies relating biological and artificial representations are often criticized as relating one black box to another. The present work introduces Augmented Sparse Encoding Models, an encoding framework that replaces dense LM hidden states with hierarchically-organized sparse autoencoder (SAE) features, while explicitly including surprisal as a predictor. Using this approach, we (i) produce interpretations of neural responses and (ii) test whether model-brain alignment reflects primary or idiosyncratic variation in LM representations. Using a high-field 7T fMRI dataset of eight participants listening to 200 linguistically diverse sentences, we first validate our modeling framework by recovering previous interpretations of voxel populations tuned to processing difficulty and meaning abstractness. We then interpret a previously-uncharacterized (but reliable) voxel population and find that it is tuned to people-related content. Next, we show that the fronto-temporal human language network is predicted by a common set of features across its constituent regions, but find that frontal regions are relatively well-explained by surprisal alone, even in the absence of LM-based features. Finally, we show that brain responses during language processing are not merely predictable from an arbitrary set of LM features. Rather, brain responses are best explained by the features that tend to capture the most general information encoded in LM representations, suggesting a nontrivial correspondence between brain and LM language representation.

URL PDF HTML ☆

赞 0 踩 0

2606.06865 2026-06-08 cs.CL 新提交

Are Large Language Models Suitable for Graph Computation? Progress and Prospects

大型语言模型是否适合图计算？进展与展望

Yuting Zhang, Yi Han, Kai Wang, Wei Ni, Angela Bonifati, Wenjie Zhang

发表机构 * University of New South Wales（新南威尔士大学）； Antai College of Economics and Management, Shanghai Jiao Tong University（上海交通大学安泰经济管理学院）； Edith Cowan University（埃迪斯科文大学）； Lyon 1 University（里昂第一大学）

AI总结本文通过角色分类法综述LLM在图计算中的应用，分析作为执行者和规划者的两种范式，指出LLM适用于简单小规模任务，但在大规模和精确性要求高的任务中不可靠，并总结数据集和未来方向。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被探索用于图计算，其中任务需要对结构化关系和算法操作进行推理。然而，目前尚不清楚LLMs何时能可靠地支持此类计算，以及如何将它们整合到图求解流程中。现有的关于LLMs和图交叉的综述主要关注图学习、文本属性图或图语言建模。为弥补这一空白，我们通过基于角色的分类法对LLMs在图计算中的应用进行了全面综述。具体来说，我们识别出两种主要范式：i) LLMs作为执行者，模型直接从图描述和指令中解决图任务；ii) LLMs作为规划者，模型制定问题、分解推理步骤，并调用外部工具或代理执行。基于此分类法，我们分析了当前方法的优势和局限性。我们的综述表明，LLMs在简单、小规模任务中具有潜力，但在大规模和精确性要求高的任务中仍不可靠。最后，我们总结了可用的数据集，并提出了四个未来方向。

英文摘要

Large language models (LLMs) have been increasingly explored for graph computation, where tasks require reasoning over structured relationships and algorithmic operations. Yet, it remains unclear when LLMs can reliably support such computation and how they should be incorporated into graph-solving pipelines. Existing surveys at the intersection of LLMs and graphs primarily focus on graph learning, text-attributed graphs, or graph-language modeling. To bridge this gap, we provide a comprehensive review of LLMs for graph computation through a role-based taxonomy. Specifically, we identify two major paradigms: i) LLMs as executors, where models directly solve graph tasks from graph descriptions and instructions; and ii) LLMs as planners, where models formulate problems, decompose reasoning steps, and invoke external tools or agents for execution. Based on this taxonomy, we analyze the strengths and limitations of current methods. Our review indicates that LLMs are promising for simple, small-scale tasks, but remain unreliable for large-scale and exactness-demanding tasks. Finally, we summarize available datasets and suggest four future directions.

URL PDF HTML ☆

赞 0 踩 0

2606.06879 2026-06-08 cs.CL cs.CR 新提交

经验之树：低重复与隐式奖励环境下自演化智能体的结构化经验管理方案

Zihao Deng, Yining Zhu, Leiming Wang, Jingfei Lu, Junbo Wang, Chuncheng Ran, Yu Yang, Dixuan Yang, Jikun Shen

AI总结针对低重复任务与隐式奖励环境，提出结构化经验管理方法ToE，通过组织、检索、验证和更新经验，在金融情绪预测基准上优于无经验基线。

详情

AI中文摘要

基于经验的自我演化对于LLM智能体至关重要，但现有基准通常假设明确的目标、稳定的任务模式和清晰的反馈。我们研究了一个更具挑战性的场景：具有隐式奖励的低重复任务，其中过去的经验难以重用，且反馈是延迟的、有噪声的且是结果层面的。我们引入了\textsc{FinEvolveBench}，一个时间控制的金融情绪预测基准，将每日新闻驱动的预测与未来超额收益联系起来。我们进一步提出了经验之树（ToE），一种结构化的经验管理方法，用于组织、检索、验证和更新智能体的经验。实验表明，通用经验机制并不一致地优于无经验基线，而ToE实现了更强的整体性能。这些结果强调了在隐式奖励环境中，结构化经验管理对于自演化智能体的重要性。

英文摘要

Experience-based self-evolution is crucial for LLM agents, but existing benchmarks often assume explicit goals, stable task patterns, and clear feedback. We study a more challenging setting: low-repetition tasks with implicit rewards, where past experience is difficult to reuse and feedback is delayed, noisy, and outcome-level. We introduce \textsc{FinEvolveBench}, a temporally controlled benchmark for financial sentiment prediction that links daily news-driven predictions to future excess returns. We further propose Tree-of-Experience (ToE), a structured experience-management method that organizes, retrieves, validates, and updates agent experience. Experiments show that general-purpose experience mechanisms do not consistently outperform no-experience baselines, while ToE achieves stronger overall performance. These results highlight the importance of structured experience management for self-evolving agents in implicit-reward environments.

URL PDF HTML ☆

赞 0 踩 0

2606.06985 2026-06-08 cs.CL eess.AS 新提交

Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition

基于大语言模型生成近误样本的对比训练用于鲁棒语码转换语音识别

Tung X. Nguyen, Hieu Minh Truong, Giang-Son Nguyen, Nhu Vo, Wray Buntine, Dung D. Le

发表机构 * VinUniversity（文大学）； University of Technology Sydney（技术悉尼大学）； Monash University（莫纳什大学）

AI总结提出POI感知对比训练框架，通过大语言模型生成近误负样本并过滤，结合POI加权交叉熵与多负例对比损失微调Whisper-small，在语码转换语音识别任务上降低超过2%的错误率。

Comments Accepted at INTERSPEECH 2026

详情

AI中文摘要

语码转换（CS）是指在单个话语中交替使用多种语言，这对自动语音识别（ASR）仍然具有挑战性。为了解决这个问题，我们提出了一个兴趣点（POI）感知的对比训练框架，该框架提高了CS关键区域的识别能力。我们首先采用文献中的POI检测方法识别CS片段，然后通过扰动ASR N-best输出中的POI并利用大语言模型扩展候选，构建声学上合理的近误假设。通过声学、音位和文本约束过滤，保留困难但合理的负样本。最后，我们使用POI加权交叉熵锚点目标以及多负例对比排序损失，通过LoRA微调Whisper-small。在CS-FLEURS（cmn-eng）和ViMedCSS（vie-eng）上的实验表明，与标准LoRA微调相比，通用错误率和CS感知错误率均持续降低超过2%。

英文摘要

Code-switching (CS), the alternation between multiple languages within a single utterance, remains challenging for Automatic Speech Recognition (ASR). To address this issue, we propose a Point-of-Interest (POI)-aware contrastive training framework that improves recognition at CS-critical regions. We first identify CS spans by adopting POI detection method from literature, then construct acoustically plausible near-miss hypotheses by perturbing POIs in ASR N-best outputs and expanding candidates with a large language model. Hard but plausible negatives are retained through filtering with acoustic, phonemic, and textual constraints. Finally, we fine-tune Whisper-small with LoRA using a POI-weighted cross-entropy anchor objective together with a multi-negative contrastive ranking loss. Experiments on CS-FLEURS (cmn-eng) and ViMedCSS (vie-eng) show consistent reductions of over 2% in both general and CS-aware error rates compared to standard LoRA fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2606.06994 2026-06-08 cs.CL cs.DB 新提交

Principles of Concept Representation in Sentence Encoders

句子编码器中概念表示的原则

Isabelle Mohr, John Dujany, Jonathan Souquet, Andre Freitas

发表机构 * Idiap Research Institute（Idiap研究 institute）； Merck KGaA（默克 KGaA）

AI总结通过表征组合性视角，研究句子编码器产生良好概念表示的条件，提出四个原则：微调重塑而非扩展潜在几何（P1）、语义信号集中在特定层（P2）、硬负例改善区分性但不提升排序（P3）、监督有效性取决于概念组合类型（P4）。

详情

AI中文摘要

是什么让句子编码器产生良好的概念表示？我们通过表征组合性的视角来探讨这个问题：只有当编码器的潜在空间允许相应语义算子的低失真实现时，它才支持一个概念族。这一框架预测了当前编码器成功之处以及它们在结构上与监督不匹配的地方。通过在WordNet和Wiktionary的330万同义词和定义对上训练的编码器条件进行受控消融实验，在三个去污染分割和一个修饰语标记的名词短语基准上进行评估，我们确定了四个原则。微调重新校准潜在几何而非扩展它（P1）。语义信号在概念特定训练开始前集中在最后的Transformer层，使得跨层池化变得多余（P2）。硬负例改善了区分性和压力测试鲁棒性，但不提升检索排序，表明校准和排序是可独立处理的（P3）。最后，监督的有效性取决于目标概念的组合类型。外延训练有助于交性和子性概念族，但损害关系性和内涵性概念族，暴露了当前训练范式的结构性限制（P4）。我们发布了两个新的评估数据集：一个DBpedia语义差距基准和一个修饰语标记的名词短语释义套件。

英文摘要

What makes a sentence encoder produce good concept representations? We approach this through the lens of representational compositionality: an encoder supports a concept family only when its latent space admits a low-distortion realization of the corresponding semantic operator. This framing predicts both where current encoders succeed and where they are structurally mismatched to their supervision. Through a controlled ablation over encoder conditions trained on 3.3 million synonym and definition pairs from WordNet and Wiktionary, evaluated on three decontaminated splits and a modifier-labeled noun-phrase benchmark, we identify four principles. Fine-tuning recalibrates the latent geometry rather than expanding it (P1). Semantic signal concentrates in the final transformer layer before concept-specific training begins, making cross-layer pooling redundant (P2). Hard negatives improve discrimination and stress-test robustness without improving retrieval ranking, showing that calibration and ranking are independently addressable (P3). Finally, the effectiveness of supervision depends on the composition type of the target concept. Extensional training helps intersective and subsective families while degrading relational and intensional ones, exposing a structural limitation of current training paradigms (P4). We release two new evaluation datasets: a DBpedia semantic-gap benchmark and a modifier-labeled NP paraphrase suite.

URL PDF HTML ☆

赞 0 踩 0

2606.07020 2026-06-08 cs.CL 新提交

MADE: Beyond Scoring via a Multilingual Agentic Diagnosing Engine for Fine-Grained Evaluation Insights

MADE：超越评分——通过多语言智能诊断引擎实现细粒度评估洞察

Yilun Liu, Miao Zhang, Shimin Tao, Minggui He, Chunguang Zhao, Chenxin Liu, Li Zhang, Chen Liu, Cheng Qian, Liqun Deng, Xiaojun Meng, Daimeng Wei

发表机构 * Huawei（华为）

AI总结提出MADE多语言智能诊断引擎，将评估后分析分解为规划、聚合分析、实例检查、多语言文化反思和报告合成，在33个模型族、11个基准、26种语言等大规模设置下，诊断报告质量提升47%，专家偏好率达87.9%。

详情

AI中文摘要

多语言和多文化基准现在覆盖数十种语言和模型族，但由此产生的得分景观仍然指标丰富而洞察贫乏，需要进行细粒度的多语言评估后诊断。然而，单个LLM和开放式智能体很容易被冗长、嘈杂的诊断输入所淹没，并且没有可重用的分类法。为了解决这个问题，我们提出了MADE，一个多语言智能诊断引擎，它将评估后分析分解为规划、聚合分析、实例级案例检查、多语言和文化反思以及基于事实的报告合成。MADE与一个专家主导的54个查询和15种语言的诊断集配对，在大规模多语言评估基础（33个模型族、11个基准、26种语言、34种文化、866万条评估记录）上进行评估。实验表明，MADE在诊断报告质量上比最强的共享基线高出47%，并且在87.9%的成对比较中被多语言人类专家偏好。与多语言专家一起应用，MADE进一步揭示了关于部署、迭代和跨文化陷阱的四个可操作发现，将基准得分表转化为模型选择和修复指南。

英文摘要

Multilingual and multicultural benchmarks now cover dozens of languages and model families, but the resulting score landscapes remain metric-rich and insight-poor, necessitating fine-grained multilingual post-evaluation diagnosis. However, single LLMs and open-ended agents are easily swamped by the long, noisy diagnostic input, and no reusable taxonomy exists for it. To address this, we propose MADE, a Multilingual Agentic Diagnosing Engine that decomposes post-evaluation analysis into planning, aggregate analysis, instance-level case inspection, multilingual and cultural reflection, and grounded report synthesis. MADE is paired with an expert-led 54-query and 15-language diagnostic set, evaluated on top of a large-scale multilingual evaluation substrate (33 model families, 11 benchmarks, 26 languages, 34 cultures, 8.66M evaluation records). Experiments show that MADE outperforms the strongest shared baseline by 47% in diagnosis report quality and is preferred by human multilingual experts in 87.9% of pairwise comparisons. Applied with multilingual experts, MADE further surfaces four actionable findings on deployment, iteration, and cross-cultural pitfalls, turning benchmark score tables into model-selection and remediation guidance.

URL PDF HTML ☆

赞 0 踩 0

2606.07054 2026-06-08 cs.CL cs.AI cs.CR cs.LG 新提交

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

TRACE: 通过自适应跨步骤证据聚合的LLM智能体轨迹推理

Vijitha Mittapalli, Shreyaa Jayant Dani, Satya Srujana Pilli, Snigdha Ansu, Mohammadreza Teymoorianfard, Franck Dernoncourt, Hongjie Chen, Yu Wang, Ryan A. Rossi, Nesreen K. Ahmed

发表机构 * University of Massachusetts at Amherst（马萨诸塞大学阿默斯特分校）； Adobe Research（Adobe研究）； Dolby Labs（杜比实验室）； University of Oregon（俄勒冈大学）； Cisco（思科）

AI总结提出TRACE框架，通过TIJ循环识别高信号区域、累积跨步骤证据并合成轨迹级判决，在SHADE-Arena的十个任务域上F1达0.713，召回率0.844，尤其擅长长距离证据链接。

详情

AI中文摘要

自主LLM智能体可以通过一系列单独良性的行动追求隐藏的恶意目标，这使得使用标准轨迹级监控难以检测破坏行为。现有方法要么一次性评估完整轨迹，要么将其划分为独立评分的窗口，限制了连接时间上相距较远的证据的能力。我们提出TRACE，一个用于长视界LLM智能体轨迹的监控框架。TRACE通过一个TIJ（分类-检查-判决）循环运行，该循环识别高信号区域，执行有针对性的检查，同时在推理步骤中累积累积的证据，并综合出轨迹级判决。我们在SHADE-Arena的十个任务域上评估TRACE，与最先进的基线进行比较。TRACE实现了0.713的总体F1分数和0.844的召回率，在需要长距离证据链接的任务上取得了最大的提升。

英文摘要

Autonomous LLM agents can pursue hidden malicious objectives through sequences of individually benign actions, making sabotage difficult to detect using standard trajectory-level monitoring. Existing approaches either evaluate complete trajectories in a single pass or partition them into independently scored windows, limiting their ability to connect evidence across temporally distant actions. We propose TRACE, a monitoring framework for long-horizon LLM agent trajectories. TRACE operates through a TIJ (Triage-Inspect-Judge) loop that identifies high-signal regions, performs targeted inspection while maintaining accumulated evidence across reasoning steps, and synthesizes a trajectory-level verdict. We evaluate TRACE on ten task domains from SHADE-Arena against state-of-the-art baselines. TRACE achieves an aggregate F1 of 0.713 and recall of 0.844, with the largest gains on tasks requiring long-range evidence linking.

URL PDF HTML ☆

赞 0 踩 0

2606.07069 2026-06-08 cs.CL cs.CY 新提交

mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?

mmPISA-bench：LLMs 在 43 种语言中的推理能力是否同样出色？

Yerzhan Sapenov, Jaromir Savelka

发表机构 * Independent Scholar（独立学者）； School of Computer Science, Carnegie Mellon University（卡内基梅隆大学计算机科学学院）

AI总结提出 mmPISA-bench，一个基于 PISA 的多语言推理基准，包含 25 道选择题，官方翻译至 43 种语言，评估 LLMs 在不同语言、推理难度和翻译类型下的表现，发现现代 LLMs 在所有语言上推理有效，机器翻译不影响准确性，但部分语言成本更高且准确率更低。

详情

AI中文摘要

我们推出了 mmPISA-bench，这是一个紧凑的高质量多语言推理基准，源自 OECD 国际学生评估项目（PISA）。该基准包含 25 道需要推理才能正确回答的多项选择题。每道题都提供了官方人工翻译的 43 种语言版本，并辅以机器翻译版本（即总共 2,150 个数据点）。我们评估了两个主流专有 LLMs 在不同语言、推理努力水平和翻译类型下正确回答问题的能力。我们的结果表明，现代 LLMs 能够在所有评估的语言中有效推理，达到与人类应试者相当的准确率，但在所覆盖的语言之间存在一些性能差异。我们进一步发现，与官方人工翻译相比，机器翻译的问题并未降低准确率，这表明高质量的机器翻译（合成数据）可能通常足以用于大规模多语言推理评估，尤其是在没有官方翻译的情况下。最后，我们分析了 token 使用和相关推理成本，发现某些语言中 LLMs 的使用同时更昂贵且准确率更低。

英文摘要

We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require reasoning in order to be answered correctly. Each question is provided in official human translations to 43 languages and complemented with machine-translated counterparts (i.e., 2,150 data points in total). We evaluate two mainstream proprietary LLMs across languages, reasoning effort levels, and translation types in terms of their ability to answer the questions correctly. Our results show that modern LLMs can reason effectively across all evaluated languages, achieve accuracy comparable to human test-takers, with some performance variations across covered languages. We further find that machine-translated questions do not degrade accuracy relative to official human translations which suggests that high-quality machine translation (synthetic data) might often be adequate for large-scale multilingual reasoning evaluations where official translations are not available. Finally, we analyze token usage and related inference cost and find that LLMs usage in some languages is simultaneously more expensive and less accurate.

URL PDF HTML ☆

赞 0 踩 0

2606.07098 2026-06-08 cs.CL cs.LG 新提交

SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices

SigmaScale: 基于SVD低秩分解和学习缩放矩阵的LLM压缩

Ernests Lavrinovics, Marco Letizia, Roy Janco, Shai Segal, Johannes Bjerva, Maurizio Pierini

发表机构 * Department of Computer Science, Aalborg University Copenhagen（奥尔堡大学哥本哈根分校计算机科学系）； MaLGa-DIBRIS, University of Genoa（热那亚大学MaLGa-DIBRIS）； INFN, Sezione di Genova（国家核物理研究所热那亚分部）； European Organization for Nuclear Research (CERN)（欧洲核子研究中心）； Ceva, Inc.（Ceva公司）

AI总结提出SigmaScale方法，通过学习辅助缩放矩阵优化截断SVD的LLM压缩，降低权重矩阵有效秩，在Llama 3.1 8B和Qwen3-8B上达到竞争性能。

详情

AI中文摘要

我们提出SigmaScale，一种学习辅助缩放矩阵$S$以辅助基于截断奇异值分解（SVD）的大语言模型（LLM）压缩的方法。SigmaScale不是解析地推导缩放矩阵，而是优化两组定义对角行和列缩放变换的向量，并在激活感知的压缩损失下进行。我们表明，学习到的缩放降低了权重矩阵的有效内在秩，这反映在有效秩熵的减少上，并且这种减少与压缩损失强相关。在Llama 3.1 8B Instruct和Qwen3-8B上的实验表明，SigmaScale在困惑度和零样本基准测试上与最相关的基于SVD的压缩方法具有竞争力。通过使用学习到的激活感知变换，SigmaScale通过适应单个模型权重的结构，探索了一条更灵活的低秩LLM压缩路径。在特定任务中观察到的优势使我们的方法成为需要降低LLM推理计算成本的应用的有效选择。

英文摘要

We present SigmaScale, a method for learning auxiliary scaling matrices $S$ to aid truncated Singular Value Decomposition (SVD) based Large Language Model (LLM) compression. Instead of deriving scaling matrices analytically, SigmaScale optimizes two sets of vectors that define diagonal row and column scaling transformations under an activation-aware compression loss. We show that learned scaling lowers the effective intrinsic rank of weight matrices, as reflected by reductions in effective-rank entropy, and that this reduction is strongly correlated with compression loss. Experiments on Llama 3.1 8B Instruct and Qwen3-8B show that SigmaScale is competitive with closely related state-of-the-art SVD-based compression methods across perplexity and zero-shot benchmarks. By using learned activation-aware transformations, SigmaScale explores a more flexible route to low-rank LLM compression by adapting to the structure of individual model weights. The advantage observed in specific tasks makes our approach a valid option for applications requiring a reduced LLM-inference computing cost.

URL PDF HTML ☆

赞 0 踩 0

2606.07103 2026-06-08 cs.CL 新提交

Style or Content? Evaluating Style Classifiers with Controlled Content Overlap

风格还是内容？通过控制内容重叠评估风格分类器

Zhuo Liu, Haozheng Du, Xiangxiang Xu, Hangfeng He

发表机构 * University of Rochester（罗切斯特大学）

AI总结提出控制内容重叠的评估方法，通过并行圣经翻译构建参数α，发现低重叠模型依赖内容线索，高重叠模型更鲁棒，为分离风格学习与内容捷径提供诊断。

Comments 9 pages

详情

AI中文摘要

风格分类器可以利用自然收集数据中与风格标签相关的内容线索，但我们缺乏系统的方法来衡量这种依赖。我们通过基于并行圣经翻译构建的控制内容重叠设置来研究这个问题。具体来说，我们将重叠参数α定义为内容身份与风格标签之间互信息的归一化残差，从而衡量风格类别之间共享内容的程度：从无共享内容（α=0）到完全共享内容（α=1）。基于RoBERTa分类器的交叉重叠评估表明，当内容线索被移除时，低重叠模型性能下降，而高重叠模型迁移更鲁棒。跨风格内容检索探针进一步表明，随着α增加，内容变得难以恢复，训练动态显示这种移除是逐渐发生的。这些结果表明，控制重叠为分离风格学习与内容捷径提供了一个简单的诊断方法。

英文摘要

Style classifiers can use content cues that correlate with style labels in naturally collected data, yet we lack a systematic way to measure this reliance. We study this problem with a controlled content overlap setup built on parallel Bible translations. Specifically, we define the overlap parameter $α$ as the normalized residual of mutual information between content identity and style label, so that it measures how much content is shared across style classes: from no shared content ($α=0$) to fully shared content ($α=1$). Cross-overlap evaluation of RoBERTa-based classifiers shows that low-overlap models degrade when content cues are removed, while high-overlap models transfer more robustly. A cross-style content retrieval probe further shows that content becomes less recoverable as $α$ increases, with training dynamics showing this removal occurs gradually. Together, these results suggest that controlled overlap provides a simple diagnostic for separating style learning from content shortcuts.

URL PDF HTML ☆

赞 0 踩 0

2606.07123 2026-06-08 cs.CL 新提交

Learning Perspectivist Social Meaning via Demographic-Conditioned Fusion Embeddings

通过人口条件融合嵌入学习视角主义社会意义

Amanda Cercas Curry, Lucio La Cava, Luca Maria Aiello, Gianmarco De Francisci Morales

发表机构 * Independent Researcher（独立研究者）； University of Calabria（卡拉布里亚大学）； IT University of Copenhagen（哥本哈根技术大学）； CENTAI

AI总结提出融合嵌入方法整合文本与人口统计信息，在28k人工标注数据集上建模社会意义解释的视角差异，相比纯文本基线提升5.9-6.5%相对宏PR-AUC。

2606.07130 2026-06-08 cs.CL 新提交

从正确性到效用：基于增益的LLM推理前缀评估

Yuhang Zhou, Yixin Cao, Guangnan Ye

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出前缀增益概念，训练前缀效用模型（PUM）通过成对排序目标评估推理前缀对成功率的提升，在数学推理任务中优于传统正确性评估。

详情

AI中文摘要

推理前缀塑造了LLM问题求解的未来轨迹，然而现有的过程奖励模型通常通过局部步骤正确性来评估它们。我们认为正确性是最终关心效果的有用但间接的代理：即前缀是否增加了成功完成的概率。我们将此效果定义为前缀增益，即通过在一个前缀上条件化轻量级学生模型组所导致的求解率提升，并使用简单的成对排序目标训练前缀效用模型（PUM）。PUM学习基于结果的前缀效用，并能对完整轨迹和部分推理前缀进行评分。在数学推理的Best-of-$N$选择、束搜索和强化学习中，PUM提供了强大的前缀级监督信号，尤其是在候选池大、搜索预算增加或基于规则的奖励稀疏时。我们在该https URL发布所有数据、模型和代码。

英文摘要

Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect we ultimately care about: whether a prefix increases the probability of successful completion. We define this effect as prefix gain, the solve-rate improvement induced by conditioning lightweight student model group on a prefix, and use it to train a Prefix Utility Model (PUM) with a simple pairwise ranking objective. PUM learns outcome-grounded prefix utility and can score both complete trajectories and partial reasoning prefixes. Across Best-of-$N$ selection, beam search, and reinforcement learning on mathematical reasoning, PUM provides a strong prefix-level supervision signal, especially when candidate pools are large, search budgets increase, or rule-based rewards are sparse. We release all data, models, and code at https://zhiqix.github.io/pum-project-page.

URL PDF HTML ☆

赞 0 踩 0

2606.07219 2026-06-08 cs.CL cs.SI 新提交

Adversarial Creation and Detection of AI-Generated Social Bot Content

AI生成的社交机器人内容的对抗性创建与检测

Mykola Trokhymovych, Ricardo Baeza-Yates, Alessandro Flammini, Diego Saez-Trumper, Filippo Menczer

发表机构 * Universitat Pompeu Fabra（庞培法拉大学）； Observatory on Social Media, Indiana University（社交媒体观测站，印第安纳大学）； KTH Royal Institute of Technology（皇家理工学院）

AI总结提出对抗性方法模拟恶意用户冒充真人，构建多语言跨平台配对数据集，训练检测模型显著优于现有方法。

2606.07237 2026-06-08 cs.CL cs.AI cs.LG 新提交

When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations

当大型语言模型在医疗保健中失败：评估对提示变化的敏感性

Mahdi Alkaeed

发表机构 * Department of Computer Science and Engineering, Doha, Qatar（计算机科学与工程系，多哈，卡塔尔）

AI总结本研究系统分析了通用和医学专用LLM对提示扰动的敏感性，发现即使是微小的措辞变化也可能改变临床建议，对抗性提示可能引发有害输出，表明这些模型在临床应用中不可靠。

Comments 12 pages

详情

AI中文摘要

大型语言模型（LLM）越来越多地用于医疗保健任务，如临床问答、诊断支持和报告总结。尽管前景广阔，但这些模型对微小的提示扰动（包括词汇和句法）仍然高度敏感，在安全关键的临床应用中构成严重风险。在本研究中，我们使用MedMCQA基准进行了系统的敏感性分析，以评估通用（例如GPT-3.5、Llama3）和医学专用LLM（例如ClinicalBERT、BioLlama3、BioBERT）的鲁棒性。我们将扰动分为自然和对抗两种类型，并检查它们对临床推理任务中模型一致性、准确性和可靠性的影响。我们的发现表明，医学LLM并非本质安全。即使是措辞的微小变化也可能改变临床建议，而针对性的对抗性提示可能引发有害输出。在医疗保健等高风险环境中，这种不可预测性是不可接受的——模型因重新措辞的输入而改变诊断，或因轻微改写而幻觉药物，临床医生无法可靠地信任它们。虽然模型通常对简单的词汇替换或释义表现出韧性，但在句法重新排序或误导性上下文线索下往往会崩溃。这种脆弱性在通用和领域专用LLM中都很明显。值得注意的是，对抗性操作可能导致临床危险的输出，例如推荐不正确的剂量或遗漏关键发现。

英文摘要

Large Language Models (LLMs) are increasingly used in healthcare for tasks such as clinical question answering, diagnosis support, and report summarization. Despite their promise, these models remain highly sensitive to subtle prompt perturbations, both lexical and syntactic, posing serious risks in safety-critical clinical applications. In this study, we conduct a systematic sensitivity analysis to evaluate the robustness of both general-purpose (e.g., GPT-3.5, Llama3) and medical-specific LLMs (e.g., ClinicalBERT, BioLlama3, BioBERT) using the MedMCQA benchmark. We categorize perturbations into natural and adversarial types and examine their effect on model consistency, accuracy, and reliability in clinical reasoning tasks. Our findings reveal that medical LLMs are not intrinsically safe. Even minor variations in phrasing can alter clinical advice, and targeted adversarial prompts can provoke harmful outputs. In high-stakes settings like healthcare, such unpredictability is unacceptable-models that change diagnoses due to reworded inputs or hallucinate medications when slightly rephrased cannot be reliably trusted by clinicians. While models tend to show resilience to simple lexical substitutions or paraphrasing, they often break down under syntactic reordering or misleading contextual cues. This fragility is evident across both general-purpose and domain-specific LLMs. Notably, adversarial manipulations can lead to clinically dangerous outputs, such as recommending incorrect dosages or omitting critical findings.

URL PDF HTML ☆

赞 0 踩 0

2606.07240 2026-06-08 cs.CL cs.SD 新提交

KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

KIT 提交至 IWSLT 2026 跨语言语音克隆任务

Seymanur Akti, Alexander Waibel

发表机构 * Karlsruhe Institute of Technology (KIT)（卡尔斯鲁厄理工学院）； Carnegie Mellon University (CMU)（卡内基梅隆大学）； KIT Campus Transfer (KCT)（KIT校区转移）

AI总结针对跨语言语音克隆中的口音变化和领域词汇问题，基于FishAudio-S2-Pro多语言文本转语音模型，引入语言标签提示、强化学习微调和参考条件词汇匹配方法，提升可懂度和自然度。

2606.07300 2026-06-08 cs.CL 新提交

Phun-Bench: Evaluating LLMs on Phonological Understanding in Chinese

Phun-Bench：评估大语言模型的中文语音理解能力

Xing Yue, Yongliang Shen, Weiming Lu

发表机构 * Zhejiang University（浙江大学）

AI总结提出Phun-Bench基准，通过同音、押韵和语音相似性三个维度系统评估大语言模型的语音理解能力，发现模型在灵活运用语音知识方面存在不足。

Comments Accepted to ACL 2026 Main Conference

详情

AI中文摘要

语言是思想的载体，与声音、符号和意义紧密相连。然而，大多数大语言模型（LLM）研究关注意义（语义）和符号（拼写），而很大程度上忽略了声音。现有的LLM语音能力基准要么可以通过死记硬背解决，要么与其他能力交织在一起，不足以衡量LLM在语音理解方面的真实能力。在这里，我们提出Phun-Bench，一个专门构建的中文基准，包含跨三个维度（同音、押韵和语音相似性）的多样化任务和设置，旨在系统评估LLM的语音理解能力。我们的结果表明，虽然LLM在回忆正确发音方面表现出色，但它们通常难以像人类说话者那样灵活直观地利用语音知识。此外，通过详细分析，我们提出了关于LLM语音理解和“感知”潜在机制的假设，突出了未来研究的一个未充分探索的前沿。

英文摘要

Language is a vehicle for thought, intricately tied to sounds, symbols, and meaning. However, most large language model (LLM) research focuses on meaning (semantics) and symbols (spelling) while largely overlooking sounds. Existing benchmarks on LLMs' phonological abilities are either solvable through rote memorization or intertwined with other abilities, making them inadequate to measure LLMs' genuine ability in phonological understanding. Here, we present Phun-Bench, a purpose-built Chinese benchmark with diverse tasks and settings across three dimensions (Homophony, Rhyme, and Phonetic Similarity), designed to systematically evaluate LLMs' phonological understanding. Our results show that while LLMs excel at recalling correct pronunciations, they generally struggle to leverage phonological knowledge in the flexible and intuitive way that human speakers do. Moreover, through detailed analyses, we propose a hypothesis regarding the underlying mechanism of LLMs' phonological understanding and "perception", highlighting an underexplored frontier for future research.

URL PDF HTML ☆

赞 0 踩 0

2606.07313 2026-06-08 cs.CL cs.AI 新提交

SV-Detect: AI-generated Text Detection with Steering Vectors

SV-Detect: 基于引导向量的AI生成文本检测

Mikhail Vishnyakov, Tatiana Gaintseva

发表机构 * Independent Researcher（独立研究者）； Queen Mary University of London（伦敦女王学院）

AI总结提出从冻结语言模型的隐藏表示中提取引导向量，通过层间投影特征训练轻量分类器，实现跨域、跨模型和编辑攻击下的机器生成文本检测。

详情

AI中文摘要

检测机器生成文本在分布偏移（如跨域、源模型和编辑攻击的迁移）下尤其困难。我们提出了一种基于从冻结语言模型的隐藏表示中提取的引导向量的假文本检测器。在每一层，我们构建一个分离人类编写文本和机器生成文本的方向，并通过每个输入与这些方向的逐层对齐来表示输入。在这些投影特征上训练的轻量分类器产生最终的检测分数。我们的方法在分布内和分布偏移下均表现出色，包括跨域、跨源模型以及机器编辑转换（如润色和重写）。解释分析表明，学习到的方向与可识别的风格线索一致，同时捕获了超越表面特征的显著额外信号。这些结果将假文本检测定位为表示空间探测问题，并表明引导向量提供了一种简单有效的解决方案。

英文摘要

Detecting machine-generated text is especially difficult under distribution shift, such as transfer across domains, source models, and editing attacks. We propose a fake-text detector based on steering vectors extracted from the hidden representations of a frozen language model. At each layer, we construct a direction that separates human-written from machine-generated text, and represent each input by its layer-wise alignment with these directions. A lightweight classifier trained on these projection features yields the final detection score. Our method achieves strong performance both in-distribution and under distribution shift, including across domains, source models, and machine-editing transformations such as polishing and rewriting. Interpretation analyses show that the learned directions align with recognizable stylistic cues while capturing substantial additional signal beyond surface features. These results position fake-text detection as a representation-space probing problem and show that steering vectors provide a simple and effective solution.

URL PDF HTML ☆

赞 0 踩 0

2606.07342 2026-06-08 cs.CL cs.NE 新提交

LLM-Guided Evolution for Medical Decision Pipelines

LLM引导的医疗决策流程进化

Ivan Sviridov, Artem Oskin, Ivan Panin, Iaroslav Bespalov, Dmitry Dylov, Ivan Oseledets, Aleksandr Nesterov

发表机构 * Sber AI Lab（Sber AI实验室）； AIRI

AI总结提出LLM引导的MAP-Elites进化方法，无需微调即可优化医疗决策流程，在分诊、咨询和图像分类任务中超越手工设计基线。

详情

AI中文摘要

将大型语言模型（LLM）适应临床工作流程通常需要昂贵的微调或手动提示和流程工程。我们研究了LLM引导的MAP-Elites进化作为一种推理时替代方案，用于发现医疗决策策略，并在https://this URL提供实现仓库。我们将紧急分诊、交互式咨询和医学图像分类表述为对可执行工件的进化搜索，这些工件由特定任务的适应度函数优化。在所有三种设置中，进化在实践约束下改进了手工设计的基线。在分诊中，进化程序将Semigran准确率从77.3%提高到87.1%，紧急召回率从0.60提高到0.97，同时改进了安全加权的保留MIMIC-ESI性能。在交互式咨询中，进化策略改进了Llama-3、Qwen-3.5和Gemma-4的准确率-成本前沿，并迁移到保留的iCRAFTMD。在PneumoniaMNIST中，仅提示进化改进了冻结的MedGemma VLM，同时保留了严格的JSON输出。定性分析表明，收益来自可解释的程序级机制、校准的分诊边界、有针对性的证据获取、选择性承诺和面向发现的视觉决策规则，而不仅仅是表面的提示改写。

英文摘要

Adapting large language models (LLMs) to clinical workflows often requires costly fine-tuning or manual prompt and pipeline engineering. We study LLM-guided MAP-Elites evolution as an inference-time alternative for discovering medical decision strategies and provide an implementation repository at https://github.com/univanxx/llm_guided_evo_medical. We formulate urgency triage, interactive consultation, and medical image classification as evolutionary searches over executable artifacts optimized by task-specific fitness functions. Across all three settings, evolution improves over manually designed baselines under practical constraints. In triage, evolved programs increase Semigran accuracy from $77.3\%$ to $87.1\%$ and emergency recall from $0.60$ to $0.97$, while improving safety-weighted held-out MIMIC-ESI performance. In interactive consultation, evolved policies improve the accuracy--cost frontier across Llama-3, Qwen-3.5, and Gemma-4 and transfer to held-out iCRAFTMD. In PneumoniaMNIST, prompt-only evolution improves frozen MedGemma VLMs while preserving strict JSON outputs. Qualitative analysis shows that the gains come from interpretable program-level mechanisms, calibrated triage boundaries, targeted evidence acquisition, selective commitment, and finding-oriented visual decision rules, rather than superficial prompt rewording alone.

URL PDF HTML ☆

赞 0 踩 0

2606.07402 2026-06-08 cs.CL 新提交

M$^3$Exam: Benchmarking Multimodal Memory for Realistic User-Agent Interactions

M$^3$Exam: 面向真实用户-智能体交互的多模态记忆基准

Zhengjun Huang, Wenxuan Liu, Zhoujin Tian, Wei Chen, Junle Chen, Yuqian Wu, Fangyuan Zhang, Qintian Guo, Xiaofang Zhou

发表机构 * The Hong Kong University of Science and Technology（香港科学与技术大学）； Beijing University of Chemical Technology（北京化工大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； Beijing Institute of Technology (Zhuhai)（北京理工大学（珠海））； Tencent Hy（腾讯（深圳））； Peng Cheng Laboratory（鹏城实验室）

AI总结提出M$^3$Exam基准，用于评估多模态大语言模型在真实用户-智能体交互中的跨模态推理和隐式信息推断能力，并设计M$^3$Proctor方法通过按需处理视觉源提升准确率13%，同时降低索引构建时间和检索token超70%。

详情

RASFT: 用于推理的滚动自适应监督微调

Yongliang Miao, Fengyuan Liu, Wei Shi, Yanguang Liu, Fei Sun, Na Zou, Mengnan Du

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； New Jersey Institute of Technology（新泽西理工学院）； Institute of Computing Technology, CAS（中国科学院计算技术研究所）

AI总结提出RASFT框架，通过基于策略rollout的问题级可解性校准专家监督，在模型困难时加强指导、表现可靠时放松模仿并纳入自生成轨迹，同时使用裁剪逆比约束策略漂移，在多个推理基准上优于SFT和RL方法。

详情

AI中文摘要

监督微调（SFT）是一种通过模仿离线专家演示来使大型语言模型适应推理任务的流行方法，通常将单个专家轨迹视为目标行为。然而，推理并非简单的路径模仿：严格遵循一个演示解决方案可能会过度拟合表面形式并抑制模型自身的推理分布。我们提出了滚动自适应监督微调（RASFT），这是一种策略感知的SFT框架，它根据从验证的策略rollout中估计的问题级可解性来校准专家监督。对于每个问题，当当前策略困难时，RASFT加强专家指导，而当模型已经表现出可靠的推理行为时，放松严格模仿并纳入正确的自生成轨迹。为了保留有用的推理先验，RASFT进一步引入了冻结参考模型与当前策略之间的裁剪逆比，以约束过度的策略漂移。在六个数学推理基准和两个代码推理基准上的多个模型实验表明，RASFT在整体性能上优于SFT、SFT变体和代表性RL方法。代码可在该https URL获取。

英文摘要

Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However, reasoning is not simple path imitation: rigidly following one demonstrated solution may overfit to surface forms and suppress the model's own reasoning distribution. We propose Rollout-Adaptive Supervised Fine-Tuning (RASFT), a policy-aware SFT framework that calibrates expert supervision according to problem-level solvability estimated from verified on-policy rollouts. For each problem, RASFT strengthens expert guidance when the current policy struggles, while relaxing rigid imitation and incorporating correct self-generated trajectories when the model already exhibits reliable reasoning behavior. To preserve useful reasoning priors, RASFT further introduces a clipped inverse ratio between the frozen reference model and the current policy to constrain excessive policy drift. Experiments across multiple models on six mathematical reasoning benchmarks and two code reasoning benchmarks show that RASFT achieves better overall performance than SFT, SFT variants, and representative RL methods. The code is available at https://github.com/zjd1sq/RASFT.

URL PDF HTML ☆

赞 0 踩 0

2606.07017 2026-06-08 cs.AI cs.CL cs.ET 交叉投稿

The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

基础模型智能体的仿真到现实差距：统一MDP视角

Xiaoou Liu, Tiejin Chen, Weibo Li, Xiyang Hu, Hua Wei

发表机构 * Arizona State University（亚利桑那州立大学）

AI总结本文提出将基础模型智能体的评估与训练差距形式化为经典仿真到现实问题，围绕MDP四要素（观测、动作、转移、奖励）构建统一框架，并倡导采用域随机化等成熟解决方案。

Comments 7 pages, 2 figures, 2 tables. Accepted by KDD 2026 Blue Sky Ideas Track

详情

DOI: 10.1145/3770855.3818660

AI中文摘要

基础模型智能体越来越多地被部署用于现实世界决策，但受到仿真到现实差距的影响。虽然机器人学和经典控制有成熟的框架来解决这一差距，但基础模型社区将智能体鲁棒性视为一个全新的现象。我们的论文提出将基础模型智能体评估和训练差距形式化为一个经典的仿真到现实问题，完全围绕马尔可夫决策过程的四个要素构建，包括观测、动作、转移和奖励。在本文中，我们设定了一个全面的研究议程，将经典差异转化为基础模型领域，并倡导采用域随机化等成熟解决方案。我们提供了具体示例，例如多语言工具调用，以展示尽管语义意图正确，但观测空间差距如何导致操作无效的动作。最终，这一议程旨在推动范式转变，产生统一的词汇和标准化的压力测试基准，以培养新一代高度可信的智能体，用于可靠的现实世界应用。

文本监督增强视觉-语言模型中的地理空间表示

Marcelo Sartori Locatelli, Fernando Tonucci, Jea Kwon, Luiz Felipe Vecchietti, Bryan Nathanael Wijaya, Cheng Yaw Low, Virgilio Almeida, Meeyoung Cha

发表机构 * University of São Paulo（圣保罗大学）； National University of Singapore（新加坡国立大学）

AI总结研究视觉、视觉-语言及多模态模型的地理空间表示能力，发现文本监督能有效提升空间编码，推动地理空间AI发展。

Comments Accepted at ICML 2026

2606.07229 2026-06-08 cs.SD cs.CL cs.MM 交叉投稿

MMAE: A Massive Multitask Audio Editing Benchmark

MMAE：大规模多任务音频编辑基准

Ziyang Ma, Ruiqi Yan, Ruiyang Xu, Jie Fang, Zhikang Niu, Yi-Wen Chao, Wenming Tu, Tianrui Wang, Auden, Qi Chen, Wenxi Chen, Jiaying Chi, Yanru Huo, Zixuan Jiang, Xiquan Li, Yalin Li, Junxi Liu, Minghao Liu, Binghao Qiang, Yijia Shan, Zheshu Song, Tian Tan, Zixiang Wang, Zeyu Xie, Zhifei Xie, Xiaoyu Xing, Qixiang Xu, Chen Yang, Guanrou Yang, Shan Yang, Yifan Yang, Steve Yves, Haotian Zhang, Haina Zhu, Kai Yu, Liefeng Bo, Eng-Siong Chng, Xie Chen

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Innovation Institute（上海创新研究院）； Nanyang Technological University（南洋理工大学）； Hunyuan Team, Tencent（腾讯 Hunyuan 团队）； Tianjin University（天津大学）； Fudan University（复旦大学）

AI总结提出首个面向通用指令音频编辑的综合评估基准MMAE，涵盖7种音频模态、6级任务复杂度和8种操作类型，通过2000个样本和基于评分标准的评估框架揭示当前模型在精确执行和结构鲁棒性上的严重不足。

Comments Open-Source at https://github.com/ddlBoJack/MMAE

详情

AI中文摘要

我们引入了MMAE，一个大规模多任务音频编辑基准，作为首个专为通用指令式音频编辑设计的综合评估测试平台。受智能创作趋势的推动，交互式编辑已从视觉领域（如图像领域的Nano-banana 2和视频领域的Gemini-Omni）迅速扩展到音频领域。然而，当前的评估基础设施严重滞后，仍然高度碎片化且局限于特定子领域或基本操作。与现有范围有限的基准不同，MMAE扩展到广泛的实际场景，涵盖7种不同的音频模态，包括声音、语音、音乐及其混合。此外，我们建立了一个全面的分类体系，涵盖6级任务复杂度（从基本修改到多跳推理和多轮编辑）、2级粒度以及8种不同的操作类型。通过人机协作精心策划，MMAE包含2000个高保真样本，并配以开创性的基于评分标准的评估框架。通过将自由形式任务分解为17,741个可验证的标准，这种稳健的基于评分标准的范式能够对指令遵循和上下文一致性进行精确的多维评估。我们对领先模型的广泛评估表明，当前系统远未实现可靠的编辑。令人惊讶的是，精确匹配率（EMR）始终低于5%，在复杂的混合模态任务中更是骤降至绝对的0%，暴露了精确执行和结构鲁棒性方面的关键瓶颈。我们希望MMAE能够成为智能创作社区未来进步的催化剂，提供清晰的诊断路线图，并为下一代音频编辑系统建立标准化、持久的评估范式。

英文摘要

We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.

URL PDF HTML ☆

赞 0 踩 0

2606.07297 2026-06-08 cs.SE cs.CL 交叉投稿

SWE-Explore: Benchmarking How Coding Agents Explore Repositories

SWE-Explore: 基准测试编码智能体如何探索代码仓库

Shaoqiu Zhang, Yuhang Wang, Jialiang Liang, Yuling Shi, Wenhao Zeng, Maoquan Wang, Shilin He, Ningyuan Xu, Siyu Ye, Kai Cai, Xiaodong Gu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Xinjiang University（新疆大学）； University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Independent Researcher（独立研究者）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出SWE-Explore基准，通过评估编码智能体在给定代码仓库和问题下返回相关代码区域排名列表的能力，衡量其仓库探索性能，覆盖10种编程语言和203个仓库的848个问题。

Comments 20 pages, 5 figures

详情

AI中文摘要

仓库级编码基准（如SWE-bench）推动了编码智能体能力的快速提升。然而，它们通常将编码任务视为一个整体的二元预测问题（例如，已解决或未解决），忽略了细粒度的智能体能力，如仓库理解、上下文检索、代码定位和错误诊断。在本文中，我们介绍了SWE-Explore，一个隔离评估仓库探索（编码智能体的关键能力）的基准。给定一个仓库和一个问题，SWE-Explore要求探索者在固定的行预算下返回一个相关的代码区域排名列表。SWE-Explore涵盖了10种编程语言和203个开源仓库中的848个问题。对于每个实例，我们从成功解决同一问题的独立智能体轨迹中推导出行级真实标签，提炼出它们的解决方案路径实际参考的具体代码区域。我们从覆盖度、排名和上下文效率维度评估探索，表明这些指标强烈跟踪下游修复行为。在一系列广泛的检索方法、通用编码智能体和专用定位器中，我们发现智能体探索者明显优于经典检索。虽然文件级定位对于现代方法已经很强，但行级覆盖度和高效排名仍然是区分最先进探索者的关键轴。

英文摘要

Repository-level coding benchmarks such as SWE-bench have driven a rapid surge in the capabilities of coding agents. Yet they usually treat coding tasks as a holistic, binary prediction problem (e.g., resolved or unresolved), neglecting fine-grained agent capabilities such as repository understanding, context retrieval, code localization, and bug diagnosis. In this paper, we introduce SWE-Explore, a benchmark that isolates the evaluation of repository exploration, a critical capability of coding agents. Given a repository and an issue, SWE-Explore asks an explorer to return a ranked list of relevant code regions under a fixed line budget. SWE-Explore covers 848 issues across 10 programming languages and 203 open-source repositories. For each instance, we derive line-level ground truth from independent agent trajectories that successfully solved the same issue, distilling the specific code regions their solution paths actually consulted. We evaluate exploration along coverage, ranking, and context-efficiency dimensions, showing that these metrics strongly track downstream repair behavior. Across a broad set of retrieval methods, general coding agents, and specialized localizers, we find that agentic explorers form a clear tier above classical retrieval. While file-level localization is already strong for modern methods, line-level coverage and efficient ranking remain the key axes differentiating state-of-the-art explorers.

URL PDF HTML ☆

赞 0 踩 0

2606.07309 2026-06-08 cs.SD cs.AI cs.CL 交叉投稿

Acoustic Cue Alignment in Audio Language Models for Speech Emotion Recognition

语音情感识别中音频语言模型的声学线索对齐

Iosif Tsangko, Andreas Triantafyllopoulos, Björn W. Schuller

发表机构 * DFG's Reinhart Koselleck project（德国科研基金Reinhart Koselleck项目）； EU H2020 project（欧盟H2020项目）

AI总结研究音频语言模型中显式声学线索的对齐性，通过eGeMAPS特征提取六种可解释声学概念标记，发现对齐标记提升UAR，而错乱标记降低性能，模型对符号线索敏感但仍部分依赖音频信号。

Comments 6 pages, 3 figures, 3 tables

详情

AI中文摘要

指令跟随音频语言模型（ALMs）可以通过显式的声学线索进行增强，但在原始音频已经可用的情况下，这些线索是否以接地的方式被使用仍不清楚。我们通过从标准化的eGeMAPS副语言特征集中推导出六个可解释的声学概念标记来研究语音情感识别（SER）中的这一问题。这些标记总结了能量、音高、动态、亮度、共振峰和语音质量，并被附加到文本提示中，同时保持音频输入不变。在广泛使用的FAU-Aibo和IEMOCAP基准测试中，对齐的标记提高了未加权平均召回率（UAR），而打乱、冲突或损坏的标记相对于对齐标记降低了性能，并将混淆转向中性。重要的是，在强标记扰动下预测不会崩溃，这表明模型对符号线索通道敏感，但部分仍锚定于音频信号。我们认为，仅标记干预提供了一种实用的方法来探测基于ALM的情感计算中音频接地线索的使用、鲁棒性和可解释性。

英文摘要

Instruction-following audio language models (ALMs) can be augmented with explicit acoustic cues, yet it remains unclear whether such cues are used in a grounded way when the raw audio is already available. We study this question in speech emotion recognition (SER) by deriving six interpretable acoustic concept tokens from the standardised eGeMAPS paralinguistic feature set. These tokens summarise energy, pitch, dynamics, brightness, formants, and voice quality, and are appended to the textual prompt while the audio input is kept unchanged. Across the widely used FAU-Aibo and IEMOCAP benchmarks, aligned tokens improve unweighted average recall (UAR), whereas shuffled, conflicting, or corrupted tokens reduce performance relative to aligned tokens and shift confusions toward neutral. Importantly, predictions do not collapse under strong token perturbations, suggesting that the models are sensitive to the symbolic cue channel but remain partly anchored to the audio signal. We argue that token-only interventions provide a practical way to probe audio-grounded cue use, robustness, and interpretability in ALM-based affective computing.

URL PDF HTML ☆

赞 0 踩 0

2606.07356 2026-06-08 cs.SD cs.CL 交叉投稿

DirectAudioEdit: Inversion-Free Text-Guided Audio Editing via Diffusion Prediction Contrast

DirectAudioEdit: 基于扩散预测对比的无反演文本引导音频编辑

Zhengkun Ge, Xiaoqian Liu, Haoran Zhang, Yuan Ge, Junxiang Zhang, Zhengtao Yu, Jingbo Zhu, Tong Xiao

发表机构 * School of Computer Science and Engineering, Northeastern University, Shenyang, China（东北大学计算机科学与工程学院）； Kunming University of Science and Technology（昆明理工大学）； NiuTrans Research, Shenyang, China（新译研究）

AI总结提出一种无需训练和反演的文本引导音频编辑方法DirectAudioEdit，通过扩散预测对比构建编辑路径，在音乐和事件基准上降低FAD和KL指标15%以上，编辑速度提升高达64.5%。

2606.07451 2026-06-08 cs.CV cs.AI cs.CL cs.LG 交叉投稿

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

TEVI: 基于稀疏自编码器的文本条件视觉表示编辑以改进视觉-语言对齐

Sweta Mahajan, Sukrut Rao, Jiahao Xie, Alexander Koller, Bernt Schiele

发表机构 * Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany（马克斯·普朗克研究所信息学院，萨尔兰信息学院，德国萨尔布吕肯）； Department of Language Science and Technology, Saarland University, Saarbrücken, Germany（语言科学与技术系，萨尔兰大学，德国萨尔布吕肯）

AI总结提出TEVI框架，利用稀疏自编码器解耦图像嵌入，并通过文本条件掩码模块选择性重构嵌入，以改善CLIP等视觉-语言模型的图像-文本对齐，在多个检索基准上取得提升。

Comments 20 pages, 13 figures, 14 tables

详情

AI中文摘要

视觉-语言模型（如CLIP）由于共享图像-文本嵌入空间，对多种任务非常有用。尽管如此，图像和文本嵌入往往对齐不佳，影响下游性能。最近的研究表明，这可以归因于信息不平衡：图像包含的信息比其标题描述的更多。在这项工作中，我们提出了TEVI，一个利用标题作为信号来决定从图像嵌入中保留哪些信息的框架。具体来说，我们使用稀疏自编码器来解耦图像嵌入，并训练一个掩码模块，根据给定的标题选择性重构嵌入。在具有合成标题的受控设置中，我们展示了TEVI在保留标题描述的属性同时丢弃其他属性方面的有效性。通过将TEVI应用于在自然图像上训练的CLIP模型，我们进一步在粗粒度短标题（MS COCO, Flickr）和细粒度长标题（IIW, DOCCI）基准上实现了改进的检索性能，在更丰富的标题上获得更强的增益，并在RoCOCO基准上提高了鲁棒性。

英文摘要

Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.07512 2026-06-08 cs.CV cs.AI cs.CL 交叉投稿

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

MemDreamer: 通过分层图记忆和智能体检索机制解耦感知与推理以实现长视频理解

Cong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang, Zhen Yang, Guangming Yao, Hao Chen, Jingdong Chen, Yi Yuan, Chunhua Shen

发表机构 * Ant Group（蚂蚁集团）； Zhejiang University（浙江大学）； Central South University（中南大学）； HKUST(GZ)（香港科技大学(广州)）

AI总结提出MemDreamer框架，通过分层图记忆和智能体检索机制解耦感知与推理，将长视频理解转化为智能体探索过程，在四个基准上达到SOTA，推理上下文窗口仅占全量2%且准确率提升12.5点。

详情

AI中文摘要

当前的视觉-语言模型在处理数小时长的视频时面临困难，因为处理完整长度的视觉序列会导致令牌爆炸和注意力稀释。为了克服这一问题，我们引入了MemDreamer，将感知与推理解耦，将长视频理解转化为智能体探索过程。作为一个即插即用的框架，它增量式地流式传输视频以构建分层图记忆，这是一种自顶向下的三层架构，用于语义抽象，并由一个捕获时空和因果关系的基础图锚定。在推理过程中，推理模型采用智能体工具增强的检索，通过观察-推理-行动循环导航层次结构、搜索节点和遍历逻辑边。实验表明，MemDreamer在四个主流基准上取得了最先进的结果，将人类专家的差距缩小到仅3.7个百分点。它将推理上下文窗口限制在全量上下文的仅2%，同时提供了12.5个百分点的绝对准确率提升。此外，统计分析揭示了VLM在逻辑推理和长视频理解基准上的性能之间存在强正线性相关，将智能体能力扩展确立为多模态理解的新范式。

英文摘要

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

URL PDF HTML ☆

赞 0 踩 0

2601.12359 2026-06-08 cs.CR cs.AI cs.CL 交叉投稿

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs

零样本嵌入漂移检测：一种轻量级防御对抗提示注入的LLM方法

Anirudh Sekar, Mrinal Agarwal, Rachel Sharma, Akitsugu Tanaka, Jasmine Zhang, Arjun Damerla, Kevin Zhu

发表机构 * Algoverse AI Research（Algoverse AI研究院）； Berkeley（伯克利大学）

AI总结本文提出ZEDD，通过量化嵌入空间中良性与可疑输入之间的语义变化，实现对直接和间接提示注入的检测。该方法无需模型内部访问或先验知识，具有低工程开销，能高效部署于多种LLM架构，准确率达93%以上。

Comments Accepted to NeurIPS 2025 Lock-LLM Workshop

详情

AI中文摘要

提示注入攻击已成为LLM应用中的日益严重漏洞，其中对抗性提示利用电子邮件或用户生成内容等间接输入渠道绕过对齐保护措施，导致有害或意外输出。尽管对齐技术有所进步，但最先进的LLM仍广泛易受对抗性提示攻击，凸显了需要稳健、高效且可推广的检测机制的紧迫性。本文提出零样本嵌入漂移检测（ZEDD），一种轻量级、低工程开销的框架，通过量化嵌入空间中良性与可疑输入之间的语义变化，识别直接和间接提示注入尝试。ZEDD无需访问模型内部、先验攻击类型知识或任务特定重训练，可高效地在多种LLM架构上进行零样本部署。我们的方法使用对抗性清洁提示对，并通过余弦相似度测量嵌入漂移，以捕捉现实世界注入攻击中的细微对抗性操纵。为确保评估的鲁棒性，我们编纂并重新标注了涵盖五个注入类别的综合LLMail-Inject数据集。广泛实验表明，嵌入漂移是一种稳健且可转移的信号，优于传统方法在检测准确性和操作效率方面。在Llama 3、Qwen 2和Mistral等模型架构上，分类准确率超过93%，误报率低于3%，我们的方法提供了一种轻量级、可扩展的防御层，可整合到现有LLM流程中，填补了保护LLM系统以抵御适应性对抗威胁的关键空白。

英文摘要

Prompt injection attacks have become an increasing vulnerability for LLM applications, where adversarial prompts exploit indirect input channels such as emails or user-generated content to circumvent alignment safeguards and induce harmful or unintended outputs. Despite advances in alignment, even state-of-the-art LLMs remain broadly vulnerable to adversarial prompts, underscoring the urgent need for robust, productive, and generalizable detection mechanisms beyond inefficient, model-specific patches. In this work, we propose Zero-Shot Embedding Drift Detection (ZEDD), a lightweight, low-engineering-overhead framework that identifies both direct and indirect prompt injection attempts by quantifying semantic shifts in embedding space between benign and suspect inputs. ZEDD operates without requiring access to model internals, prior knowledge of attack types, or task-specific retraining, enabling efficient zero-shot deployment across diverse LLM architectures. Our method uses adversarial-clean prompt pairs and measures embedding drift via cosine similarity to capture subtle adversarial manipulations inherent to real-world injection attacks. To ensure robust evaluation, we assemble and re-annotate the comprehensive LLMail-Inject dataset spanning five injection categories derived from publicly available sources. Extensive experiments demonstrate that embedding drift is a robust and transferable signal, outperforming traditional methods in detection accuracy and operational efficiency. With greater than 93% accuracy in classifying prompt injections across model architectures like Llama 3, Qwen 2, and Mistral and a false positive rate of <3%, our approach offers a lightweight, scalable defense layer that integrates into existing LLM pipelines, addressing a critical gap in securing LLM-powered systems to withstand adaptive adversarial threats.

URL PDF HTML ☆

赞 0 踩 0

2505.11470 2026-06-08 cs.CL 版本更新

Reference-Free Evaluation of Taxonomies

无参考评价的层次分类体系

Pascal Wullschleger, Majid Zarharan, Donnacha Daly, Marc Pouly, Jennifer Foster

发表机构 * Hamilton Institute, Maynooth University, Ireland（爱尔兰梅诺特大学哈密尔顿研究所）； School of Computing, Dublin City University, Ireland（爱尔兰都柏林城市大学计算学院）； Lucerne School of Computer Science and IT, Switzerland（瑞士卢塞恩计算机科学与信息技术学院）

AI总结提出两种无参考指标评估层次分类体系质量：基于语义与分类相似性相关性的鲁棒性指标，以及基于自然语言推理的逻辑充分性指标，在五个层次分类体系上验证与真实F1值高度相关，并能预测下游层次分类性能。

2507.06419 2026-06-08 cs.CL 版本更新

Teach a Reward Model to Correct Itself: Reward Guided Adversarial Failure Discovery for Robust Reward Modeling

教会奖励模型自我修正：奖励引导的对抗性失败发现以实现鲁棒奖励建模

Pankayaraj Pathmanathan, Furong Huang

发表机构 * University of Maryland College Park（马里兰大学College Park分校）； Capital One

AI总结提出REFORM框架，通过奖励引导的受控解码自动发现奖励模型失败模式，并利用生成的对抗样本自我改进，提升鲁棒性而不牺牲奖励质量。

详情

Journal ref: ACL 2026 Main Conference [Oral]

AI中文摘要

奖励建模（RM）通过捕捉人类偏好来对齐大型语言模型（LLM），越来越多地用于模型微调、响应过滤和排序等任务。然而，由于人类偏好的固有复杂性和可用数据集的有限覆盖，奖励模型在分布偏移或对抗性扰动下经常失败。现有的识别此类失败模式的方法通常依赖于关于偏好分布或失败属性的先验知识，限制了它们在现实场景中的实用性，因为此类信息不可用。在这项工作中，我们提出了一种可处理的、与偏好分布无关的方法，通过奖励引导的受控解码来发现奖励模型的失败模式。在此基础上，我们引入了REFORM，一个自我改进的奖励建模框架，通过使用奖励模型本身来指导生成错误评分的响应，从而增强鲁棒性。这些对抗性示例随后用于扩充训练数据并修补奖励模型的失调行为。我们在两个广泛使用的偏好数据集Anthropic Helpful Harmless (HH)和PKU Beavertails上评估了REFORM，并证明它在不牺牲奖励质量的情况下显著提高了鲁棒性。值得注意的是，REFORM在直接评估和下游策略训练中均保持了性能，并通过去除虚假相关性进一步提高了对齐质量。

英文摘要

Reward modeling (RM), which captures human preferences to align large language models (LLMs), is increasingly employed in tasks such as model finetuning, response filtering, and ranking. However, due to the inherent complexity of human preferences and the limited coverage of available datasets, reward models often fail under distributional shifts or adversarial perturbations. Existing approaches for identifying such failure modes typically rely on prior knowledge about preference distributions or failure attributes, limiting their practicality in real-world settings where such information is unavailable. In this work, we propose a tractable, preference-distribution agnostic method for discovering reward model failure modes via reward guided controlled decoding. Building on this, we introduce REFORM, a self-improving reward modeling framework that enhances robustness by using the reward model itself to guide the generation of falsely scored responses. These adversarial examples are then used to augment the training data and patch the reward model's misaligned behavior. We evaluate REFORM on two widely used preference datasets Anthropic Helpful Harmless (HH) and PKU Beavertails and demonstrate that it significantly improves robustness without sacrificing reward quality. Notably, REFORM preserves performance both in direct evaluation and in downstream policy training, and further improves alignment quality by removing spurious correlations.

URL PDF HTML ☆

赞 0 踩 0

2508.03668 2026-06-08 cs.CL 版本更新

CTR-Sink: Attention Sink for Language Models in Click-Through Rate Prediction

CTR-Sink：用于点击率预测的语言模型中的注意力汇聚点

Zixuan Li, Binzong Geng, Jing Xiong, Yong He, Yuxuan Hu, Jian Chen, Dingwei Chen, Xiyu Chang, Ngai Wong, Liang Zhang, Linjian Mo, Chengming Li, Chuan Yuan, Zhenan Sun

发表机构 * NLPR, Institute of Automation, Chinese Academy of Sciences（神经信息处理教育部重点实验室，自动化研究所，中国科学院）； Ant Group（蚂蚁集团）； The University of Hong Kong（香港大学）； City University of Hong Kong（香港城市大学）； Sun Yat-sen University（中山大学）； Shenzhen MSU-BIT University（深圳MSU-BIT大学）

AI总结针对用户行为序列与语言模型预训练文本之间的结构差异导致的语义碎片化问题，提出CTR-Sink框架，通过引入行为级注意力汇聚点并动态调节注意力聚合，提升点击率预测性能。

详情

DOI: 10.1145/3770855.3817646

AI中文摘要

点击率（CTR）预测是推荐系统中的核心任务，利用历史行为数据估计用户点击可能性。将用户行为序列建模为文本以利用语言模型（LM）进行该任务的方法，由于LM强大的语义理解和上下文建模能力而受到关注。然而，存在一个关键的结构性差距：用户行为序列由离散的动作组成，这些动作由语义上空的分离符连接，与LM预训练中的连贯自然语言有根本不同。这种不匹配导致语义碎片化，即LM的注意力分散在无关的标记上，而不是集中在有意义的行为边界和行为间关系上，从而降低了预测性能。为了解决这个问题，我们提出了$ extit{CTR-Sink}$，一种新颖的框架，引入了针对推荐场景定制的行为级注意力汇聚点。受注意力汇聚点理论的启发，它构建了注意力聚焦汇聚点，并通过外部信息动态调节注意力聚合。具体来说，我们在连续行为之间插入汇聚点标记，融入推荐特定信号（如时间距离）作为稳定的注意力汇聚点。为了增强通用性，我们设计了一个两阶段训练策略，明确引导LM注意力朝向汇聚点标记，以及一个注意力汇聚点机制，放大汇聚点间的依赖关系以更好地捕捉行为相关性。在一个工业数据集和两个开源数据集（MovieLens、Kuairec）上的实验以及可视化结果，验证了该方法在不同场景下的有效性。

英文摘要

Click-Through Rate (CTR) prediction, a core task in recommendation systems, estimates user click likelihood using historical behavioral data. Modeling user behavior sequences as text to leverage Language Models (LMs) for this task has gained traction, owing to LMs' strong semantic understanding and contextual modeling capabilities. However, a critical structural gap exists: user behavior sequences consist of discrete actions connected by semantically empty separators, differing fundamentally from the coherent natural language in LM pre-training. This mismatch causes semantic fragmentation, where LM attention scatters across irrelevant tokens instead of focusing on meaningful behavior boundaries and inter-behavior relationships, degrading prediction performance. To address this, we propose $\textit{CTR-Sink}$, a novel framework introducing behavior-level attention sinks tailored for recommendation scenarios. Inspired by attention sink theory, it constructs attention focus sinks and dynamically regulates attention aggregation via external information. Specifically, we insert sink tokens between consecutive behaviors, incorporating recommendation-specific signals such as temporal distance to serve as stable attention sinks. To enhance generality, we design a two-stage training strategy that explicitly guides LM attention toward sink tokens and a attention sink mechanism that amplifies inter-sink dependencies to better capture behavioral correlations. Experiments on one industrial dataset and two open-source datasets (MovieLens, Kuairec), alongside visualization results, validate the method's effectiveness across scenarios.

URL PDF HTML ☆

赞 0 踩 0

2510.07315 2026-06-08 cs.CL cs.AI cs.LG cs.SE 版本更新

SWE-IF: Aligning Code Evaluation with Human Preference

SWE-IF: 使代码评估与人类偏好对齐

Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun

发表机构 * Google DeepMind（谷歌深Mind）

AI总结提出SWE-IF基准，通过可验证指令分类法VeriCode评估代码指令遵循能力，发现指令遵循是区分LLM代码质量的关键，与功能正确性结合更能匹配人类偏好。

Comments ICML 2026

详情

AI中文摘要

大型语言模型（LLM）推动了vibe coding，用户通过自然语言交互利用LLM生成并迭代优化代码，直到通过其vibe检查。Vibe检查反映了人类偏好，超越了功能性：解决方案应感觉正确、阅读清晰、保留意图并保持正确。然而，当前的代码评估仍局限于pass@k，仅捕获功能正确性，忽略了用户常规应用的非功能性指令。在本文中，我们假设指令遵循是vibe检查中除功能正确性之外缺失的部分。为了用量化信号衡量模型的代码指令遵循能力，我们提出了VeriCode，一个包含30条可验证代码指令及其确定性验证器的分类法。我们使用该分类法增强现有评估套件，得到SWE-IF，一个评估指令遵循和功能正确性的测试平台。评估31个LLM，我们发现即使最强的模型也难以遵守多条指令，并表现出功能回归。最重要的是，功能正确性和指令遵循的复合得分与人类偏好相关性最强，其中指令遵循成为LLM之间的主要区分因素。我们的代码、数据和分类法可在https://github.com/maszhongming/SWE-IF获取。

英文摘要

Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check reflects human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check besides functional correctness. To quantify models' code instruction-following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in SWE-IF, a testbed to assess both instruction following and functional correctness. Evaluating 31 LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit functional regression. Most importantly, a composite score of functional correctness and instruction following correlates best with human preference, with instruction following emerging as the primary differentiator among LLMs. Our code, data, and taxonomy are available at https://github.com/maszhongming/SWE-IF.

URL PDF HTML ☆

赞 0 踩 0

2510.26615 2026-06-08 cs.CL 版本更新

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

SlideAgent：用于多页视觉文档理解的分层代理框架

Yiqiao Jin, Rachneet Kaur, Zhen Zeng, Sumitra Ganesh, Srijan Kumar

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； J.P. Morgan AI Research（摩根大通AI研究）

AI总结提出SlideAgent，一种用于多模态多页文档（如幻灯片）理解的分层代理框架，通过全局、页面和元素三级推理构建结构化表示，在专有和开源模型上分别提升7.9%和9.8%的准确率。

Comments ACL 2026 Main Conference. https://slideagent.github.io/

详情

AI中文摘要

多页视觉文档，如手册、宣传册、演示文稿和海报，通过布局、颜色、图标和跨页引用传达关键信息。虽然多模态大语言模型（MLLMs）为文档理解提供了机会，但当前系统在处理复杂的多页视觉文档时仍存在困难，尤其是在元素和页面上的细粒度推理方面。我们引入了SlideAgent，一个用于理解多模态、多页、多布局文档（尤其是幻灯片组）的通用代理框架。SlideAgent采用专门的代理，并将推理分解为三个专门级别——全局、页面和元素——以构建结构化的、与查询无关的表示，捕捉总体主题以及详细的视觉或文本线索。在推理过程中，SlideAgent选择性激活专门代理进行多级推理，并将其输出整合为连贯的、上下文感知的答案。大量实验表明，SlideAgent在专有模型（+7.9%）和开源模型（+9.8%）上均显著提高了准确率。

英文摘要

Multi-page visual documents such as manuals, brochures, presentations, and posters convey key information through layout, colors, icons, and cross-slide references. While multimodal large language models (MLLMs) offer opportunities in document understanding, current systems struggle with complex, multi-page visual documents, particularly in fine-grained reasoning over elements and pages. We introduce SlideAgent, a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks. SlideAgent employs specialized agents and decomposes reasoning into three specialized levels--global, page, and element--to construct a structured, query-agnostic representation that captures both overarching themes and detailed visual or textual cues. During inference, SlideAgent selectively activates specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers. Extensive experiments show that SlideAgent significantly improves accuracy over both proprietary (+7.9%) and open-source models (+9.8%).

URL PDF HTML ☆

赞 0 踩 0

2511.07380 2026-06-08 cs.CL 版本更新

Mining Useful General Data for Low-Resource Domain Adaptation

挖掘低资源领域适应的有用通用数据

Pingjie Wang, Hongcheng Liu, Yusheng Liao, Ziqing Fan, Yaxin Du, Shuo Tang, Yanfeng Wang, Yu Wang

发表机构 * arXiv

AI总结针对低资源领域数据稀缺问题，提出NTK-Selector方法，利用神经正切核从通用数据中筛选有用样本，显著提升领域适应效果。

Comments 39 pages

详情

AI中文摘要

由于领域特定数据的稀缺性，将大型语言模型（LLMs）适应到低资源领域仍然具有挑战性。虽然领域内数据有限，但存在大量与领域任务共享相似问答格式和推理模式的通用领域数据。这一观察提出了一个重要问题：能否挖掘有用的通用领域数据来改进低资源领域适应？我们的初步发现表明，即使没有仔细选择，通用领域的思维链数据也包含对领域适应有用的辅助信号。这一观察催生了一种新的领域适应范式，即不再完全依赖领域特定数据。为了系统地识别最有益的通用领域样本，我们提出了NTK-Selector，其动机源于神经正切核捕捉训练动态中对齐的能力。由于直接将NTK应用于预训练LLMs不切实际，我们引入了一种无雅可比矩阵的NTK近似，并在微调过程中经验性地展示了稳定的NTK类行为。在医学、金融、法律和心理领域的广泛实验表明，NTK-Selector始终优于仅使用领域数据的微调和现有数据选择基线。特别是，NTK-Selector在Llama3-8B-Instruct和Qwen3-8B上分别取得了+8.7和+5.1个百分点的提升，而仅使用领域数据的微调仅分别提升了+0.8和+0.9个百分点。

英文摘要

Adapting large language models (LLMs) to low-resource domains remains challenging due to the scarcity of domain-specific data. While in-domain data is limited, there exists a vast amount of general-domain data that shares similar question-answer formats and reasoning patterns with domain tasks. This observation raises an important question: can useful general-domain data be mined to improve low-resource domain adaptation? Our initial findings show that general-domain chain-of-thought data contains useful auxiliary signals for domain adaptation, even without careful selection. This observation motivates a new paradigm for domain adaptation beyond exclusive reliance on domain-specific data. To systematically identify the most beneficial general-domain samples, we propose NTK-Selector, motivated by the Neural Tangent Kernel's ability to capture alignment in training dynamics. Since directly applying NTK to pretrained LLMs is impractical, we introduce a Jacobian-free NTK approximation and empirically demonstrate stable NTK-like behavior during fine-tuning. Extensive experiments across medical, financial, legal, and psychological domains demonstrate that NTK-Selector consistently outperforms domain-only fine-tuning and existing data selection baselines. In particular, NTK-Selector achieves gains of +8.7 and +5.1 points on Llama3-8B-Instruct and Qwen3-8B, respectively, compared to only +0.8 and +0.9 points from domain-only fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2512.09634 2026-06-08 cs.CL 版本更新

Creation of the Estonian Subjectivity Dataset: Assessing the Degree of Subjectivity on a Scale

爱沙尼亚主观性数据集的创建：评估主观性程度的一个量表

Karl Gustav Gailit, Kadri Muischnek, Kairit Sirts

发表机构 * University of Tartu（塔尔图大学）

AI总结本文创建了爱沙尼亚语文档级主观性数据集，通过连续量表标注并分析标注一致性，初步实验使用大语言模型进行自动主观性分析，发现自动评分可行但不可完全替代人工。

Comments 9 pages, 5 figures, 3 appendixes, LREC 2026

详情

DOI: 10.63317/35rspcvi32vp
Journal ref: Proceedings of the Fifteenth Language Resources and Evaluation Conference (LREC 2026) 8204-8216

AI中文摘要

本文介绍了爱沙尼亚语文档级主观性数据集的创建，分析了所得标注，并报告了使用大语言模型（LLM）进行自动主观性分析的初步实验。该数据集包含1000个文档——300篇新闻文章和700个随机选择的网络文本——每个文档由四位标注员在从0（完全客观）到100（完全主观）的连续量表上评分。由于标注员间相关性中等，部分文本得分位于量表两端，因此对得分差异最大的文本子集进行了重新标注，标注员间相关性有所提高。除了人工标注外，数据集还包括GPT-5作为标注自动化实验生成的分数。这些分数与人工标注相似，但出现了一些差异，表明基于LLM的自动主观性评分虽然可行，但并非人工标注的可互换替代方案，其适用性取决于预期应用。

英文摘要

This article presents the creation of an Estonian-language dataset for document-level subjectivity, analyzes the resulting annotations, and reports an initial experiment of automatic subjectivity analysis using a large language model (LLM). The dataset comprises of 1,000 documents-300 journalistic articles and 700 randomly selected web texts-each rated for subjectivity on a continuous scale from 0 (fully objective) to 100 (fully subjective) by four annotators. As the inter-annotator correlations were moderate, with some texts receiving scores at the opposite ends of the scale, a subset of texts with the most divergent scores was re-annotated, with the inter-annotator correlation improving. In addition to human annotations, the dataset includes scores generated by GPT-5 as an experiment on annotation automation. These scores were similar to human annotators, however several differences emerged, suggesting that while LLM based automatic subjectivity scoring is feasible, it is not an interchangeable alternative to human annotation, and its suitability depends on the intended application.

URL PDF HTML ☆

赞 0 踩 0

2512.13278 2026-06-08 cs.CL cs.LG 版本更新

AutoTool: Dynamic Tool Selection and Integration for Agentic Reasoning

AutoTool: 面向智能体推理的动态工具选择与集成

Jiaru Zou, Ling Yang, Yunzhe Qi, Sirui Chen, Mengting Ai, Ke Shen, Jingrui He, Mengdi Wang

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结提出AutoTool框架，通过双阶段优化（SFT+RL轨迹稳定化和KL正则化Plackett-Luce排序）使大语言模型具备动态工具选择能力，在数学、科学、代码和多模态推理等任务上平均提升6.4%-7.7%。

Comments ICML2026; Best Paper Award at ICCV 2025 Workshop on Multi-Modal Reasoning for Agentic Intelligence

详情

AI中文摘要

智能体强化学习推动了大语言模型（LLMs）在长链思维轨迹中进行推理，同时穿插外部工具的使用。现有方法假设工具集固定，限制了LLM智能体对新工具或演化工具集的适应性。我们提出AutoTool，一个训练框架，使LLM智能体在整个推理轨迹中具备动态工具选择能力。AutoTool采用双阶段优化流水线：（i）基于SFT和RL的轨迹稳定化，以实现连贯推理；（ii）KL正则化的Plackett-Luce排序，以优化一致的多步工具选择。我们进一步构建了一个包含20万条数据的数据集，其中包含跨1000多个工具和100多个任务（涵盖数学、科学、代码生成和多模态推理）的显式工具选择理由。在十个多样化基准上，我们使用AutoTool训练了两个基础模型：Qwen3-8B和Qwen2.5-VL-7B。在参数更少的情况下，AutoTool持续优于先进的LLM智能体和工具集成方法，在数学与科学推理上平均提升6.4%，在基于搜索的问答上提升4.5%，在代码生成上提升7.7%，在多模态理解上提升6.9%。此外，AutoTool通过在推理过程中动态利用演化工具集中的未见工具，展现出更强的泛化能力。

英文摘要

Agentic reinforcement learning has advanced large language models (LLMs) to reason through long chain-of-thought trajectories while interleaving external tool use. Existing approaches assume a fixed inventory of tools, which limits the adaptability of LLM agents to new or evolving toolsets. We present AutoTool, a training framework that equips LLM agents with dynamic tool-selection capabilities throughout their reasoning trajectories. AutoTool employs a dual-phase optimization pipeline: (i) SFT and RL-based trajectory stabilization for coherent reasoning, and (ii) KL-regularized Plackett-Luce Ranking to refine consistent multi-step tool selection. We further build a 200k dataset with explicit tool-selection rationales across 1,000+ tools and 100+ tasks spanning mathematics, science, code generation, and multimodal reasoning. Across ten diverse benchmarks, we train two base models, Qwen3-8B and Qwen2.5-VL-7B, with AutoTool. With fewer parameters, AutoTool consistently outperforms advanced LLM agents and tool-integration methods, yielding average gains of 6.4% in math & science reasoning, 4.5% in search-based QA, 7.7% in code generation, and 6.9% in multimodal understanding. In addition, AutoTool exhibits stronger generalization by dynamically leveraging unseen tools from evolving toolsets during inference.

URL PDF HTML ☆

赞 0 踩 0

2601.05751 2026-06-08 cs.CL cs.AI 版本更新

Analysing Differences in Persuasive Language in LLM-Generated Text: Uncovering Stereotypical Gender Patterns

分析LLM生成文本中说服性语言的差异：揭示刻板的性别模式

Amalie Brogaard Pauli, Maria Barrett, Max Müller-Eberstein, Isabelle Augenstein, Ira Assent

发表机构 * Department of Computer Science, Aarhus University（阿arhus大学计算机科学系）； AMD Silo AI ； University of Tokyo（东京大学）； IT University of Copenhagen（哥本哈根IT大学）； Department of Computer Science, University of Copenhagen（哥本哈根大学计算机科学系）

AI总结提出框架评估LLM生成说服性语言时受接收者性别、发送者意图和输出语言的影响，发现所有模型均存在显著的性别差异，反映性别刻板印象的语言倾向。

Comments Accepted at ACL Findings 2026

详情

AI中文摘要

大型语言模型（LLMs）越来越多地用于日常交流任务，包括起草旨在影响和说服的人际信息。先前研究表明，LLMs能够成功说服人类并放大说服性语言。因此，理解用户指令如何影响说服性语言的生成，以及生成的说服性语言是否因目标群体不同而有所差异至关重要。在这项工作中，我们提出了一个框架，用于评估说服性语言生成如何受接收者性别、发送者意图或输出语言的影响。我们使用成对提示指令评估了13个LLMs和16种语言。我们采用基于社会心理学和传播科学的LLM-as-judge设置，在19个说服性语言类别上评估模型响应。我们的结果揭示了所有模型生成的说服性语言中存在显著的性别差异。这些模式反映了与社会心理学和社会语言学中记录的性别刻板语言倾向一致的偏见。

英文摘要

Large language models (LLMs) are increasingly used for everyday communication tasks, including drafting interpersonal messages intended to influence and persuade. Prior work has shown that LLMs can successfully persuade humans and amplify persuasive language. It is therefore essential to understand how user instructions affect the generation of persuasive language, and to understand whether the generated persuasive language differs, for example, when targeting different groups. In this work, we propose a framework for evaluating how persuasive language generation is affected by recipient gender, sender intent, or output language. We evaluate 13 LLMs and 16 languages using pairwise prompt instructions. We evaluate model responses on 19 categories of persuasive language using an LLM-as-judge setup grounded in social psychology and communication science. Our results reveal significant gender differences in the persuasive language generated across all models. These patterns reflect biases consistent with gender-stereotypical linguistic tendencies documented in social psychology and sociolinguistics.

URL PDF HTML ☆

赞 0 踩 0

2601.06600 2026-06-08 cs.CL 版本更新

Probing Multimodal Large Language Models on Cognitive Biases in Chinese Short-Video Misinformation

探究多模态大语言模型在中国短视频虚假信息中的认知偏差

Jen-tse Huang, Chang Chen, Shiyang Lai, Wenxuan Wang, Michelle R. Kaufman, Mark Dredze

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； Chinese University of Hong Kong（香港中文大学）； University of Chicago（芝加哥大学）； Renmin University of China（中国人民大学）

AI总结本文通过200个短视频数据集评估8种多模态大语言模型在健康领域虚假信息中的表现，发现Gemini-2.5-Pro表现最佳（信念分数71.5/100），而模型易受权威频道ID等社会线索影响。

Comments Accepted to ACL 2026 (Findings)

详情

AI中文摘要

短视频平台已成为虚假信息的主要传播渠道，其中欺骗性声明常利用视觉实验和社会线索。尽管多模态大语言模型（MLLMs）展示了令人印象深刻的推理能力，但它们对与认知偏差纠缠的虚假信息的鲁棒性仍未得到充分探索。本文使用一个高质量、手动标注的200个短视频数据集，涵盖四个健康领域，引入了一个全面的评估框架。该数据集为三种欺骗模式——实验错误、逻辑谬误和捏造声明——提供了细粒度标注，每种模式均由国家标准和学术文献等证据验证。我们评估了八个前沿MLLMs在五种模态设置下的表现。实验结果表明，Gemini-2.5-Pro在多模态设置中取得了最高性能，信念分数为71.5/100，而o3表现最差，为35.2。此外，我们研究了视频中诱导错误信念的社会线索，发现模型易受权威频道ID等偏差影响。

英文摘要

Short-video platforms have become major channels for misinformation, where deceptive claims frequently leverage visual experiments and social cues. While Multimodal Large Language Models (MLLMs) have demonstrated impressive reasoning capabilities, their robustness against misinformation entangled with cognitive biases remains under-explored. In this paper, we introduce a comprehensive evaluation framework using a high-quality, manually annotated dataset of 200 short videos spanning four health domains. This dataset provides fine-grained annotations for three deceptive patterns-experimental errors, logical fallacies, and fabricated claims-each verified by evidence such as national standards and academic literature. We evaluate eight frontier MLLMs across five modality settings. Experimental results demonstrate that Gemini-2.5-Pro achieves the highest performance in the multimodal setting with a belief score of 71.5/100, while o3 performs the worst at 35.2. Furthermore, we investigate social cues that induce false beliefs in videos and find that models are susceptible to biases like authoritative channel IDs.

URL PDF HTML ☆

赞 0 踩 0

2601.08097 2026-06-08 cs.CL cs.LG 版本更新

AdaJudge: Adaptive Multi-Perspective Judging for Reward Modeling

AdaJudge: 自适应多视角评判用于奖励建模

Yongliang Miao, Yangyang Liang, Mengnan Du

发表机构 * Emory University（埃默里大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出AdaJudge框架，通过门控精化块和自适应多视角池化模块，联合优化表示与聚合，解决奖励建模中静态归纳偏差和表示不匹配问题，在RM-Bench和JudgeBench上超越现有模型。

Comments ACL 2026

详情

AI中文摘要

奖励建模对于将大型语言模型与人类偏好对齐至关重要，但主流架构依赖静态池化策略将序列压缩为标量分数。然而，这种范式存在两个关键限制：静态归纳偏差与任务相关的偏好信号不匹配，以及表示不匹配，因为骨干网络针对生成的优化使其表示不适用于细粒度判别。为解决这一问题，我们提出AdaJudge，一个统一框架，联合调整表示和聚合。AdaJudge首先通过门控精化块将骨干网络表示改进到判别导向的空间。然后，它用自适应多视角池化模块替换静态读出，该模块动态路由并组合证据。在RM-Bench和JudgeBench上的大量实验表明，AdaJudge优于强大的现成奖励模型和传统池化基线。

英文摘要

Reward modeling is essential for aligning large language models with human preferences, yet predominant architectures rely on a static pooling strategy to condense sequences into scalar scores. This paradigm, however, suffers from two key limitations: a static inductive bias that misaligns with task-dependent preference signals, and a representational mismatch, as the backbone's optimization for generation leaves its representations ill-suited to fine-grained discrimination. To address this, we propose AdaJudge, a unified framework that jointly adapts representation and aggregation. AdaJudge first improves backbone representations into a discrimination-oriented space via gated refinement blocks. It then replaces the static readout with an adaptive multi-view pooling module, which dynamically routes and combines evidence. Extensive experiments on RM-Bench and JudgeBench show that AdaJudge outperforms strong off-the-shelf reward models and traditional pooling baselines.

URL PDF HTML ☆

赞 0 踩 0

2601.09402 2026-06-08 cs.CL 版本更新

一种动态自演化抽取系统

Moin Amin-Naseri, Hannah Kim, Estevam Hruschka

发表机构 * Megagon Labs（Megagon实验室）

AI总结提出DySECT系统，通过LLM抽取三元组构建知识库，结合概率知识和图推理丰富知识，再反馈优化抽取器，形成闭环持续提升。

详情

AI中文摘要

从原始文本中抽取结构化信息是许多NLP应用（包括文档检索、排序和相关性估计）的基本组成部分。高质量的抽取通常需要领域特定的准确性、对专业分类法的最新理解，以及吸收新兴术语和罕见异常值的能力。在许多领域（如医疗、法律和人力资源），抽取模型还必须适应不断变化的术语，并受益于对结构化知识的显式推理。我们提出了DySECT，一个动态自演化抽取与策管工具包，它在使用过程中持续改进。该系统逐步用LLM抽取的三元组填充一个多功能、自扩展的知识库（KB）。KB通过整合概率知识和基于图的推理进一步丰富自身，逐步积累领域概念和关系。然后，丰富的KB通过提示调优、采样相关少样本示例或使用KB衍生的合成数据进行微调，反馈给LLM抽取器。结果，系统形成了一个共生的闭环循环，其中抽取持续改进知识，知识持续改进抽取。

英文摘要

The extraction of structured information from raw text is a fundamental component of many NLP applications, including document retrieval, ranking, and relevance estimation. High-quality extractions often require domain-specific accuracy, up-to-date understanding of specialized taxonomies, and the ability to incorporate emerging jargon and rare outliers. In many domains--such as medical, legal, and HR--the extraction model must also adapt to shifting terminology and benefit from explicit reasoning over structured knowledge. We propose DySECT, a Dynamic Self-Evolving Extraction and Curation Toolkit, which continually improves as it is used. The system incrementally populates a versatile, self-expanding knowledge base (KB) with triples extracted by the LLM. The KB further enriches itself through the integration of probabilistic knowledge and graph-based reasoning, gradually accumulating domain concepts and relationships. The enriched KB then feeds back into the LLM extractor via prompt tuning, sampling of relevant few-shot examples, or fine-tuning using KB-derived synthetic data. As a result, the system forms a symbiotic closed-loop cycle in which extraction continuously improves knowledge, and knowledge continuously improves extraction.

URL PDF HTML ☆

赞 0 踩 0

2603.09403 2026-06-08 cs.CL 版本更新

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

LLM作为元评判者：用于NLP评估指标验证的合成数据

Lukáš Eigler, Jindřich Libovický, David Hurych

发表机构 * Faculty of Mathematics and Physics, Charles University（数学与物理系，查尔斯大学）

AI总结提出LLM作为元评判者框架，通过控制语义退化生成合成数据替代人工判断，验证NLG评估指标，在多语言问答中元相关性超过0.9。

Comments 16 pages, 1 figure, 14 tables

2603.28304 2026-06-08 cs.CL 版本更新

The Necessity of Setting Temperature in LLM-as-a-Judge

LLM作为评判者中设置温度的必要性

Lujun Li, Lama Sleem, Yangjie Xu, Yewei Song, Aolin Jia, Jerome Francois, Radu State

发表机构 * University of Luxembourg（卢森堡大学）； ETH Zürich（苏黎世联邦理工学院）

AI总结系统研究温度对LLM评判行为的影响，发现高温降低一致性但暴露不确定性，低温适合稳定任务，高温适合复杂场景，建议温度作为任务相关的设计选择。

Comments 17 pages

详情

AI中文摘要

使用大型语言模型（LLM）作为评判者来评估模型输出已成为自动化评估的重要范式。然而，在LLM作为评判者的设置中，解码温度的选择在很大程度上仍然是经验性的，缺乏关于其影响的系统证据。为了解决这一差距，我们系统研究了温度如何影响不同LLM评判模型、提示策略和评估范式下的评判行为。我们的结果表明，较高的温度通常会降低评判一致性并增加格式错误，同时也会暴露潜在的不确定性，这种不确定性在低温解码下往往被抑制，尤其是在模糊案例中。进一步的分析表明，较高的温度可以作为探索机制，并可能提高复杂或不确定评估场景中的评判性能。总体而言，低温设置更适合优先考虑稳定性和可重复性的任务，而高温设置更适合涉及大量模糊性或复杂性的场景，在这些场景中，探索评判者的决策空间是有益的。这些发现表明，在LLM作为评判者的系统中，温度不应被视为固定的超参数，而应被视为可控的、任务相关的设计选择，它调节了可靠性与探索之间的权衡。

为AI幽默研究重新定义幽默数据对象

Anna Arnett, Bang Nguyen, Meng Jiang

发表机构 * Department of Computer Science and Engineering, University of Notre Dame（诺特大学计算机科学与工程系）

AI总结本研究将幽默视为具有上下文和解释的社会互动，通过定义幽默推理数据对象并改进提示策略，使LLM生成更高质量的幽默解释，为AI幽默研究的数据合成与增强奠定基础。

Comments Added link to code and data

详情

AI中文摘要

在现有的大多数AI幽默研究中，幽默被简单地视为“存在”或“不存在”。我们探索了幽默作为具有上下文和解释的社会互动的概念。在此项目中，我们定义了一个幽默推理数据对象，并开发了一种提示LLM生成对普通人群有效的幽默解释的方法。我们从早期的提示迭代到改进的提示，发现后一个版本减少了重要错误，然后将生成扩展到大量数据对象，这些对象有潜力为AI幽默研究实现数据合成和数据增强。我们的主要收获是，更好的LLM提示能提高幽默解释质量，特别是通过更仔细地处理缺失上下文、多模态和转录问题。这些结果为未来AI理解幽默作为社会行为的研究奠定了坚实基础。

英文摘要

In most existing AI humor research, humor was treated as either "present" or "not present." We explore the concept of humor as a social interaction with context and explanations. During this project, we defined a humor reasoning data object and developed a way to prompt LLMs to generate an explanation of humor effective for general population. We iterated from an earlier prompt to an improved prompt, found that the later version reduced important errors, and then scaled generation to a large number of data objects which have the potential to enable data synthesis and data augmentation for AI humor research. Our main takeaway is that better prompting of an LLM improves humor explanation quality, especially by handling missing context, multi-modality, and transcript issues more carefully. These results establish a strong foundation for future work on AI understanding of humor as social behavior. All code and data are available at: https://github.com/anna-arnett/ai-humor/ .

URL PDF HTML ☆

赞 0 踩 0

2605.25638 2026-06-08 cs.CL cs.LG 版本更新

Reinforcement Learning from Denoising Feedback

基于去噪反馈的强化学习

Qi He, Huan Chen, Ya Guo, Huijia Zhu, Yi R. Fung, Baojian Zhou

发表机构 * Fudan University（复旦大学）； Ant Group（蚂蚁集团）； Hong Kong University of Science and Technology（香港科技大学）

AI总结提出RLDF方法，利用去噪反馈进行策略损失估计，通过优化中间噪声状态到裁剪干净状态并结合加权时间步采样，在扩散语言模型上提升性能和泛化性。

详情

AI中文摘要

策略损失估计仍然是扩散语言模型（dLLMs）强化学习中的一个基本且长期存在的挑战。我们引入了基于去噪反馈的强化学习（RLDF），这是一种新颖的训练范式，利用从rollout和训练过程中获得的反馈来实现准确且高效的策略损失估计。为了平衡计算效率和估计有效性之间的权衡，RLDF将模型从中间噪声状态$x_t$优化到裁剪干净状态$\hat{x}_0$，并结合了随时间步$t$的加权采样。大量实验表明，RLDF在两种代表性dLLM架构（LLaDA和Dream）上，在多个推理基准测试中实现了性能和泛化性的一致且显著的提升。我们的工作为扩散语言模型中的可扩展强化学习奠定了原则性基础。我们构建了Drift，一个用于dLLMs的训练框架，可在https://github.com/ant-research/Drift获取。

英文摘要

Policy loss estimation remains a fundamental and long-standing challenge in reinforcement learning (RL) for diffusion language models (DLMs). We introduce Reinforcement Learning from Denoising Feedback (RLDF), a novel training paradigm that leverages feedback obtained from rollout and training processes to facilitate accurate and efficient policy loss estimation. To balance the trade-off between computational efficiency and estimation effectiveness, RLDF optimizes the model toward the clipped clean state from intermediate noisy states, combined with weighted timestep sampling over denoising timesteps. Extensive experiments demonstrate that RLDF achieves consistent and substantial improvements in both performance and generalizability across two representative DLM architectures, LLaDA and Dream, on multiple reasoning benchmarks. Our work lays a principled foundation for scalable reinforcement learning in diffusion language models. We build Drift, a training framework for DLMs, available at https://github.com/ant-research/Drift.

URL PDF HTML ☆

赞 0 踩 0

2605.26099 2026-06-08 cs.CL cs.AI 版本更新

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

语言模型需要睡眠吗？用于改进在线推理的离线循环

Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of Maryland（马里兰大学）

AI总结本文提出一种类似睡眠的巩固机制，通过离线循环将上下文转换为快速权重，以解决Transformer注意力机制随上下文长度扩展性差的问题，并在合成任务和数学推理任务上验证了其有效性。

详情

AI中文摘要

基于Transformer的大型语言模型越来越多地用于长时任务；然而，它们的注意力机制随上下文长度扩展性差。为了解决这个问题，我们研究了一种类似睡眠的巩固机制，其中模型在清除其键值缓存之前，定期将最近的上下文转换为持久的快速权重。在睡眠期间，模型对累积的上下文进行$N$次离线循环传递，并通过学习到的局部规则更新其状态空间模型（SSM）块中的快速权重。在推理过程中，这会将额外的计算转移到睡眠阶段，同时保持清醒时预测的延迟。我们在受控的合成任务（包括元胞自动机和多跳图检索）以及一个现实的数学推理任务上测试了我们的方法，在这些任务上，常规Transformer以及SSM-注意力混合模型都失败了。然后我们表明，增加我们模型的睡眠持续时间$N$可以提高性能，在需要更深层推理的示例上收益最大。

英文摘要

Transformer-based large language models are increasingly used for long-horizon tasks; however, their attention mechanism scales poorly with context length. To handle this, we study a sleep-like consolidation mechanism in which a model periodically converts recent context into persistent fast weights before clearing its key-value cache. During sleep, the model performs $N$ offline recurrent passes over the accumulated context and updates the fast weights in its state-space model (SSM) blocks through a learned local rule. During inference, this shifts extra computation to sleep while preserving the latency of wake-time prediction. We test our method on controlled synthetic tasks, including cellular automata and multi-hop graph retrieval, as well as a realistic math reasoning task, on which a regular transformer as well as SSM-attention hybrid models fail. We then show that increasing sleep duration $N$ for our models improves performance, with the largest gains on examples that require deeper reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.03889 2026-06-08 cs.CL 版本更新

RealClawBench: Live OpenClaw Benchmarks from Real Developer-Agent Sessions

RealClawBench: 来自真实开发者-智能体会话的实时OpenClaw基准测试

Zongwei Lv, Zhewen Tan, Yaoming Li, Yilun Yao, Yuxuan Tian, Lin Sun, Xiangzheng Zhang, Weihong Lin, Tong Yang, Guangxiang Zhao

发表机构 * Peking University（北京大学）； Qiyuan Tech（启元科技）

AI总结针对现有基准缺乏真实性的问题，提出RealClawBench框架，通过重构执行环境和确定性可验证评分器，将真实OpenClaw会话转化为可复现、自动评分的任务，评估14个模型后最佳仅解决65.8%任务，揭示了开发者-智能体工作负载上的巨大提升空间。

Comments 19 pages, 5 figures, 8 tables

详情

AI中文摘要

智能体基准测试应反映用户实际要求部署的智能体执行的任务，然而现有基准往往缺失真实开发者-智能体会话的关键真实性属性。我们引入RealClawBench，一个基于真实OpenClaw会话构建的实时基准框架，以捕获已部署智能体使用的分布、多样性和实际难度。真实用户请求难以基准测试，因为它们通常依赖本地执行环境，涉及隐含或未明确指定的意图，并且需要非平凡的验证。RealClawBench通过两个核心机制解决这些挑战：重构的执行环境和确定性可验证评分器，共同将真实会话转化为可复现、自动评分的任务。最终发布的版本包含从更大真实会话池中采样的281个可执行任务，同时保留源分布，最大最终与源分布的Jensen-Shannon散度为0.0448。评估14个当代模型显示，最佳系统仅解决65.8%的任务，揭示了在真实开发者-智能体工作负载上存在巨大的提升空间。通过将真实部署会话转化为受控评估实例，RealClawBench提供了一条实际路径，以构建能更好衡量智能体在实际使用中能力的基准测试。代码见：this https URL。

英文摘要

Agent benchmarks should reflect what users actually ask deployed agents to do, yet existing benchmarks often miss key realism properties of real developer-agent sessions. We introduce RealClawBench, a live benchmark framework built from real OpenClaw sessions to capture the distribution, diversity, and real-world difficulty of deployed agent use. Real user requests are challenging to benchmark because they often depend on local execution environments, involve implicit or underspecified intent, and require nontrivial verification. RealClawBench addresses these challenges with two core mechanisms: reconstructed execution environments and deterministic verifiable scorers, which together convert real sessions into reproducible, automatically scored tasks. The resulting release contains 281 executable tasks sampled from a much larger real-session pool while preserving the source distribution, with maximum final-vs-source Jensen-Shannon divergence of 0.0448. Evaluating 14 contemporary models shows that the best system solves only 65.8% of tasks, revealing substantial headroom on realistic developer-agent workloads. By turning real deployed sessions into controlled evaluation instances, RealClawBench provides a practical path toward benchmarks that better measure agent capability in actual use. Code is available at:https://anonymous.4open.science/r/real-claw-bench-582B.

URL PDF HTML ☆

赞 0 踩 0

2606.04874 2026-06-08 cs.CL 版本更新

Agent Planning Benchmark: A Diagnostic Framework for Planning Capabilities in LLM Agents

Agent规划基准：LLM Agent规划能力的诊断框架

Haoyu Sun, Wenxuan Wang, Mingyang Song, Jujie He, Weinan Zhang, Yang Liu, Yang Yang, Yu Cheng

发表机构 * Tongji University（同济大学）； Shanghai AI Laboratory（上海人工智能实验室）； Harbin Institute of Technology（哈尔滨工业大学）； Fudan University（复旦大学）； Skywork AI ； University of California, Santa Cruz（加州大学圣克鲁兹分校）； Shanghai Jiao Tong University（上海交通大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出Agent规划基准(APB)，通过4209个多模态案例和五个设置，诊断LLM Agent在长程规划、工具噪声鲁棒性、校准拒绝和推理时改进方面的系统弱点。

详情

AI中文摘要

规划是LLM Agent的核心：在行动之前，Agent必须分解目标、选择工具、推理约束并决定任务何时不可行。然而，现有的Agent评估通常只报告端到端的成功率，使得难以判断失败源于规划还是执行。我们引入了 extbf{Agent规划基准(APB)}，一个针对规划的诊断基准，包含22个领域和五个设置下的4209个多模态案例，涵盖整体规划、反馈条件逐步规划以及在外来工具、损坏工具和不可解任务下的鲁棒性。在12个MLLM上，APB揭示了长程规划、工具噪声鲁棒性、校准拒绝和推理时改进方面的系统弱点。我们进一步在200个ToolSandbox任务和200个$τ^2$-bench任务上验证了APB，其中APB引导的改进在三个代表性模型上一致提高了计划正确性、计划等级和下游执行指标。因此，APB作为执行基准的上游诊断补充。

英文摘要

Planning is central to LLM agents: before acting, an agent must decompose goals, select tools, reason over constraints, and decide when a task is infeasible. Yet existing agent evaluations often report only end-to-end success, making it difficult to determine whether failures stem from planning or execution. We introduce Agent Planning Benchmark (APB), a planning-specific diagnostic benchmark with 4,209 multimodal cases across 22 domains and five settings, covering holistic planning, feedback-conditioned step-wise planning, and robustness under extraneous tools, broken tools, and unsolvable tasks. Across 12 MLLMs, APB reveals systematic weaknesses in long-horizon planning, tool-noise robustness, calibrated refusal, and inference-time refinement. We further validate APB on 200 ToolSandbox tasks and 200 $τ^2$-bench tasks, where APB-guided refinement consistently improves plan correctness, plan grade, and downstream execution metrics across three representative models. APB thus serves as an upstream diagnostic complement to execution benchmarks. The APB benchmark and code are available in \href{https://github.com/Mikivishy/AgentPlanningBenchmark}{this URL}.

URL PDF HTML ☆

赞 0 踩 0

2606.05711 2026-06-08 cs.CL 版本更新

讲述故事，创造汉字：人工智能辅助中国城市老年移民的协同创作

Yunfei Chen, Wen Zhan, Peiyue Lin, Ziqun Hua, Ying Hu

发表机构 * School of Design, Hunan University（湖南大学设计学院）； Royal College of Art（皇家艺术学院）； University of the Arts London, Central Saint Martins（伦敦艺术大学，中央圣马丁学院）

AI总结通过协同创作工作坊，结合口述故事、AI辅助和手工制作，让老年移民创造新汉字以记录被忽视的生活故事，揭示参与者的异质性和适应能力，并展示AI作为降低表达门槛的创意启动器。

详情

DOI: 10.21606/drs.2026.963

AI中文摘要

本文探讨了中国城市老年移民如何记录日常语言和设计常忽略的故事。我们与10位老年人开展了两次协同创作工作坊。活动结合了口述故事、主持人中介的AI辅助和手工制作。大型语言模型通过主持人提出候选字形。参与者创作了新的汉字来承载他们的故事。生成的字符作为记忆锚点，用于后续的分享和复述。我们的解释性分析揭示了参与者之间的异质性和适应能力。参与者将AI视为降低表达和创作门槛的创意启动器，尤其对数字素养较低者。这项工作挑战了关于老年人的同质化假设以及统一能力和需求的预设。我们贡献了一个将AI定位为后台促进者的工作坊框架，并提供了在包容性城市系统中将老年移民视为社区记忆和情境文化知识来源的见解。

英文摘要

This paper explores how older migrants in urban China can record stories that everyday language and design often miss. We ran two co-creation workshops with 10 elders. Activities combined oral storytelling, facilitator-mediated AI assistance, and hand-making. Large language models proposed candidate glyphs through a facilitator. Participants crafted new Hanzi to hold their stories. The resulting characters served as memory anchors for later sharing and retelling. Our interpretive analysis shows heterogeneity and adaptive capacity among participants. Participants experienced AI as a creative initiator that lowered barriers to expression and making, especially for those with lower digital literacy. The work challenges homogenizing assumptions about older adults and the presumption of uniform capacities and needs. We contribute a workshop framework that positions AI as a backstage facilitator. We also offer insights on engaging older migrants as sources of community memory and situated cultural knowledge within inclusive urban systems.

URL PDF HTML ☆

赞 0 踩 0

2508.17693 2026-06-08 cs.DB cs.AI cs.CL 版本更新

Database Normalization via Dual-LLM Self-Refinement

通过双LLM自精炼的数据库规范化

Eunjae Jo, Nakyung Lee, Gyuyeong Kim

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出Miffie框架，利用双模型自精炼架构和大语言模型实现数据库自动规范化，无需人工干预且保持高准确率。

Comments 7 pages

2508.17821 2026-06-08 cs.LG cs.AI cs.CL 版本更新

语言模型中激活引导的内生抵抗

Alex McKenzie, Keenan Pepper, Stijn Servaes, Martin Leitgab, Murat Cubuktepe, Mike Vaiana, Diogo de Lucena, Judd Rosenblatt, Michael S. A. Graziano

发表机构 * University of Washington（华盛顿大学）

AI总结研究发现大型语言模型在任务不匹配的激活引导下能内生抵抗，通过显式重启恢复正确生成，并识别出相关稀疏自编码器潜在变量，可增强或削弱该抵抗。

详情

AI中文摘要

大型语言模型可以在生成过程中从任务不匹配的激活引导中恢复，产生显式的语言重启（例如，“等等，那不对”），并在引导扰动仍然活跃的情况下继续讨论主题。我们将此称为内生引导抵抗（ESR）。使用稀疏自编码器（SAE）潜在变量来引导模型激活，我们发现Llama-3.3-70B在\llamaseventyEsrRate\\%的情况下表现出显式ESR，而来自Llama-3和Gemma-2系列的较小模型则较少出现显式形式。两个对照实验将ESR分解为检测事件和持续抵抗组件，后者不能仅由最近的on-topic token条件化来完全解释。我们通过对比on-topic/off-topic搜索识别出\numOtdLatents{}个SAE潜在变量；将其零消融使多次尝试率降低\multiAttemptReductionPct\\%，随机潜在变量和保留提示对照支持特异性。ESR还可以通过元提示和基于合成自我纠正示例的微调来有意增强。ESR对安全性具有双重影响：它可能使模型对对抗性激活空间操纵更具抵抗力，但同样可能干扰有益的基于引导的干预，因为模型无法区分两者。代码可在\href{https://github.com/agencyenterprise/endogenous-steering-resistance}{github.com/agencyenterprise/endogenous-steering-resistance}获取。

英文摘要

Large language models can recover mid-generation from task-misaligned activation steering, producing explicit verbal restarts (e.g., ``wait, that's not right'') and continuing on-topic even while the steering perturbation remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B exhibits explicit ESR at \llamaseventyEsrRate\%, with smaller models from the Llama-3 and Gemma-2 families showing the explicit form less frequently. Two controls dissociate ESR into a detection event and a sustained-resistance component that conditioning on recent on-topic tokens does not fully explain. We identify \numOtdLatents{} SAE latents through contrastive on-topic/off-topic search; zero-ablating them reduces the multi-attempt rate by \multiAttemptReductionPct\%, with random-latent and held-out-prompt controls supporting specificity. ESR can also be deliberately enhanced through both meta-prompting and fine-tuning on synthetic self-correction examples. ESR has dual implications for safety: it could harden models against adversarial activation-space manipulation, but may equally interfere with beneficial steering-based interventions, since the model has no way to distinguish the two. Code is available at \href{https://github.com/agencyenterprise/endogenous-steering-resistance}{github.com/agencyenterprise/endogenous-steering-resistance}.

URL PDF HTML ☆

赞 0 踩 0

2602.08857 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Discovering Interpretable Algorithms by Decompiling Transformers to RASP

通过将Transformer反编译为RASP发现可解释算法

Xinting Huang, Aleksandra Bakalova, Satwik Bhattamishra, William Merrill, Michael Hahn

发表机构 * Saarland Informatics Campus, Saarland University（萨尔兰大学信息学院校区，萨尔兰大学）； University of Oxford（牛津大学）； Allen Institute for AI（人工智能研究所）

AI总结提出一种将训练好的Transformer忠实重参数化为RASP程序，并通过因果干预发现小型充分子程序的方法，实验表明长度泛化的Transformer内部实现了简单可解释的RASP程序。

Comments 104 pages, 92 figures. Accepted for publication at ICML 2026

详情

AI中文摘要

近期研究表明，Transformer的计算可以在RASP编程语言家族中模拟。这些发现增进了对Transformer表达能力和泛化能力的理解。特别是，Transformer被建议在具有简单RASP程序的问题上精确实现长度泛化。然而，训练模型是否实际实现了简单的可解释程序仍是一个开放问题。在本文中，我们提出了一种从训练好的Transformer中提取此类程序的通用方法。其思想是将Transformer忠实地重参数化为RASP程序，然后应用因果干预来发现一个小的充分子程序。在算法和形式语言任务上训练的小型Transformer实验中，我们表明我们的方法通常能从长度泛化的Transformer中恢复简单且可解释的RASP程序。我们的结果提供了迄今为止最直接的证据，证明Transformer内部实现了简单的RASP程序。

英文摘要

Recent work has shown that the computations of Transformers can be simulated in the RASP family of programming languages. These findings have enabled improved understanding of the expressive capacity and generalization abilities of Transformers. In particular, Transformers have been suggested to length-generalize exactly on problems that have simple RASP programs. However, it remains open whether trained models actually implement simple interpretable programs. In this paper, we present a general method to extract such programs from trained Transformers. The idea is to faithfully re-parameterize a Transformer as a RASP program and then apply causal interventions to discover a small sufficient sub-program. In experiments on small Transformers trained on algorithmic and formal language tasks, we show that our method often recovers simple and interpretable RASP programs from length-generalizing transformers. Our results provide the most direct evidence so far that Transformers internally implement simple RASP programs.

URL PDF HTML ☆

赞 0 踩 0

2602.14209 2026-06-08 cs.LG cs.CL 版本更新

MAGE: All-[MASK] Block Already Knows Where to Look in Block Diffusion LLM

MAGE：在块扩散LLM中，全[MASK]块已经知道在哪里看

Omin Kwon, Yeonjae Kim, Doyeon Kim, Minseo Kim, Yeonhong Park, Jae W. Lee

发表机构 * Seoul National University（首尔国立大学）； Meta

AI总结针对块扩散LLM长上下文推理中KV缓存导致的内存瓶颈，提出无训练方法MAGE，利用块扩散训练目标的对齐特性，在第一步确定整个轨迹的KV子集，实现近无损精度和显著加速。

详情

AI中文摘要

块扩散LLM是一种并行语言生成的新兴范式，但其KV缓存使得内存访问成为长上下文推理中的主要瓶颈。稀疏注意力（每个查询仅关注少量KV子集）可以在最小化精度损失的情况下减少延迟。然而，在块扩散中，每个块的B个token必须共享一个KV子集，我们证明这种每块约束会使现有稀疏KV估计器的召回率下降高达25%。为了解决这一挑战，我们利用了块扩散训练目标中出现的一个特性：它将去噪步骤中的块平均查询对齐，因此第一步的全[MASK]块已经揭示了整个轨迹中每块的KV子集。我们在MAGE（[MASK]引导的稀疏注意力）中利用了这一特性，这是一种无训练方法，在第一步执行一次精确注意力，并在块内的所有剩余步骤中重用其top-k索引集。在LongBench上的三个块扩散家族中，MAGE在k=512时匹配精确注意力，精度几乎无损，在128K上下文中实现高达6.82倍的端到端加速，并且比分别为自回归LLM和全双向扩散LLM设计的Quest和SparseD快3.35倍和2.28倍。

英文摘要

Block diffusion LLMs are an emerging paradigm for parallel language generation, but their KV caching makes memory access the dominant bottleneck in long-context inference. Sparse attention, which attends only to a small KV subset per query, can reduce this latency with minimal accuracy loss. In block diffusion, however, the B tokens of each block must share a single KV subset, and we show this per-block constraint degrades existing sparse KV estimators by up to 25% in recall. We address this challenge by exploiting a property that emerges from the block-diffusion training objective: it aligns the block-average query across denoising steps, so the All-[MASK] block at the first step already reveals the per-block KV subset for the entire trajectory. We exploit this in MAGE ([MASK]-Guided Sparse Attention), a training-free method that runs one exact attention pass at the first step and reuses its top-k index sets for all remaining steps within the block. Across three block-diffusion families on LongBench, MAGE matches Exact Attention at k=512 with near-lossless accuracy, achieves up to 6.82x end-to-end speedup at 128K context, and runs up to 3.35x and 2.28x faster than Quest and SparseD, designed for AR LLMs and fully bidirectional diffusion LLMs, respectively.

URL PDF HTML ☆

赞 0 踩 0

2602.18905 2026-06-08 cs.LG cs.AI cs.CL 版本更新

MCERF：通过增强检索推进工程文档的多模态大语言模型评估

Kiarash Naghavi Khanghah, Hoang Anh Nguyen, Anna C. Doris, Amir Mohammad Vahedi, Daniele Grandi, Faez Ahmed, Hongyi Xu

发表机构 * School of Mechanical, Aerospace, and Manufacturing Engineering, University of Connecticut, Storrs, CT 06269（机械、航空航天与制造工程学院，康涅狄格大学，斯托尔斯，CT 06269）； Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA（机械工程系，麻省理工学院，剑桥，MA 02139，美国）

AI总结提出MCERF框架，结合多模态检索器ColPali与大语言模型推理，通过混合查找、视觉文本融合、高推理和自一致性决策等策略，在DesignQA基准上实现平均准确率相对提升41.1%，无需完整规则书摄入即可处理工程文档中的多模态问答。

详情

DOI: 10.1115/1.4072033

AI中文摘要

工程规则书和技术标准包含密集文本、表格和插图等多模态信息，对检索增强生成（RAG）系统构成挑战。基于依赖全文摄入和文本检索的DesignQA框架[1]，本工作建立了多模态ColPali增强检索与推理框架（MCERF），该系统将多模态检索器与大语言模型推理相结合，实现从工程文档中准确高效地回答问题。该系统采用ColPali检索文本和视觉信息，并采用多种检索与推理策略：（i）混合查找模式用于显式规则提及，（ii）视觉到文本融合用于图形和表格引导的查询，（iii）高推理大语言模型模式用于复杂的多模态问题，以及（iv）自一致性决策以稳定响应。模块化框架设计为未来的多模态系统提供了可重用模板，无论底层模型架构如何。此外，本工作建立并比较了两种路由方法：单案例路由方法和多智能体系统，两者均动态分配查询到最优管道。在DesignQA基准上的评估表明，该系统在所有任务上的平均准确率相比基线RAG最佳结果相对提升了41.1%，这是多模态和推理密集型任务上的显著改进，且无需完整规则书摄入。这表明视觉语言检索、模块化推理和自适应路由如何在工程用例中实现可扩展的文档理解。

英文摘要

Engineering rulebooks and technical standards contain multimodal information like dense text, tables, and illustrations that are challenging for retrieval augmented generation (RAG) systems. Building upon the DesignQA framework [1], which relied on full-text ingestion and text-based retrieval, this work establishes a Multimodal ColPali Enhanced Retrieval and Reasoning Framework (MCERF), a system that couples a multimodal retriever with large language model reasoning for accurate and efficient question answering from engineering documents. The system employs the ColPali, which retrieves both textual and visual information, and multiple retrieval and reasoning strategies: (i) Hybrid Lookup mode for explicit rule mentions, (ii) Vision to Text fusion for figure and table guided queries, (iii) High Reasoning LLM mode for complex multi modal questions, and (iv) SelfConsistency decision to stabilize responses. The modular framework design provides a reusable template for future multimodal systems regardless of underlying model architecture. Furthermore, this work establishes and compares two routing approaches: a single case routing approach and a multi-agent system, both of which dynamically allocate queries to optimal pipelines. Evaluation on the DesignQA benchmark illustrates that this system improves average accuracy across all tasks with a relative gain of +41.1% from baseline RAG best results, which is a significant improvement in multimodal and reasoning-intensive tasks without complete rulebook ingestion. This shows how vision language retrieval, modular reasoning, and adaptive routing enable scalable document comprehension in engineering use cases.

URL PDF HTML ☆

赞 0 踩 0

2605.08692 2026-06-08 cs.LG cs.CL 版本更新

AAAC: Activation-Aware Adaptive Codebooks for 4-bit LLM Weight Quantization

AAAC: 面向4位LLM权重量化的激活感知自适应码本

Beshr IslamBouli, David Jin

发表机构 * University of Waterloo（滑铁卢大学）

AI总结提出AAAC方法，通过每层两个小型学习码本（64字节）替代固定标量码本，以激活加权重建误差最小化选择码本，实现零额外存储开销的4位权重量化，在3-30分钟内完成量化，精度优于现有方法。

详情

AI中文摘要

训练后仅权重量化至4位被广泛用于减少大语言模型推理的内存和计算成本。现有的PTQ方法，如AWQ和GPTQ，通过缩放、裁剪或误差补偿改进权重映射到固定4位网格的方式。为进一步提高精度，OmniQuant和QuIP#等方法使用梯度辅助算法，但需要数小时的量化时间。在这项工作中，我们提出AAAC（激活感知自适应码本），一种用于4位LLM权重量化的轻量级方法。AAAC用每层两个小型学习标量码本（64字节）替换标准量化中使用的固定标量码本。每组权重选择使激活加权重建误差最小的码本，将选择编码在组正缩放的未使用符号位中，并增加零存储开销。AAAC在单个GPU上3-30分钟内完成，且不增加模型本身之外的额外内存。我们跨模型族与AWQ、GPTQ、IF4、GPTVQ、OmniQuant、SqueezeLLM和QuIP#进行评估。AAAC在量化时间少几个数量级的情况下优于基线方法。

英文摘要

Post-training weight-only quantization to 4 bits is widely used to reduce the memory and compute costs of large language model inference. Existing PTQ methods, such as AWQ and GPTQ, improve how weights are mapped onto a fixed 4-bit grid through scaling, clipping, or error compensation. To further improve accuracy, methods such as OmniQuant and QuIP\# uses gradient-assisted algorithms at the cost of hours of quantization time. In this work, we propose AAAC (Activation-Aware Adaptive Codebooks), a lightweight method for 4-bit LLM weight quantization. AAAC replaces the fixed scalar codebook used in standard quantization with two small learned scalar codebooks (64 bytes) per layer. Each group of weights selects the codebook that minimizes activation-weighted reconstruction error, encoding the choice in the unused sign bit of the group's positive scale and adding zero storage overhead. AAAC completes in 3--30 minutes on a single GPU, and adds no memory beyond the model itself. We evaluate against AWQ, GPTQ, IF4, GPTVQ, OmniQuant, SqueezeLLM, and QuIP\# across model families. AAAC outperforms baselines at orders-of-magnitude less quantization time.

URL PDF HTML ☆

赞 0 踩 0

2606.01765 2026-06-08 cs.FL cs.CL cs.LG 版本更新

An Algebraic View of the Expressivity of Recurrent Language Models

循环语言模型表达能力的代数视角

Franz Nowak, Ryan Cotterell, Reda Boumasmoud

发表机构 * GitHub

AI总结本文通过代数统一框架分析循环神经网络在不同算术模型下的表达能力，将形式语言识别问题归结为语法幺半群是否划分特定圈积的代数问题。

Comments 28 pages, 2 figures, to be published at ICML 2026

2606.05152 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Reinforcement Learning from Rich Feedback with Distributional DAgger

利用丰富反馈的强化学习与分布式DAgger

Rishabh Agrawal, Jacob Fein-Ashley, Paria Rashidinejad

发表机构 * University of Southern California（南加州大学）

AI总结提出DistIL算法，通过分布式DAgger利用丰富反馈（如执行轨迹、工具输出等）进行前向交叉熵优化，实现单调策略改进和更好的Pass@N性能。

详情

AI中文摘要

推理模型发展迅速，但主流的基于可验证奖励的强化学习（RLVR）方法仍然非常狭窄：采样多个响应，并用单个比特奖励每个响应，指示最终答案是否正确。然而，许多设置提供了丰富的反馈，包括执行轨迹、工具输出、专家修正和模型自我评估。我们研究如何通过经典模仿学习算法DAgger的分布式变体来使用这种反馈，其中学习器可以局部访问当前策略所访问状态上的专家分布。这产生了一个简单的前向交叉熵目标，该目标接受黑盒专家，并且其序列级梯度通过将未来的专家-学生分歧传播回早期决策来进行丰富的信用分配。我们表明，基于反向KL或Jensen-Shannon的先前具有自蒸馏目标的强化学习无法保证单调策略改进：即使专家具有更高的奖励，它们的更新也可能增加更差动作的概率。相比之下，我们证明前向交叉熵允许单调策略改进并享有遗憾保证。我们进一步表明，我们的目标优化了教师加权的成功可能性的下界，从而改进了Pass@N。实验上，我们的方法DistIL在科学推理、编程和解决困难数学问题等多个领域优于RLVR和基于自蒸馏的强化学习基线。

英文摘要

Reasoning models have advanced rapidly, but the dominant reinforcement learning from verifiable rewards (RLVR) recipe remains surprisingly narrow: sample many responses and reward each with a single bit indicating whether the final answer is correct. Yet many settings provide rich feedback, including execution traces, tool outputs, expert corrections, and model self-evaluations. We study how to use such feedback through a distributional variant of the classic imitation learning algorithm DAgger, where the learner has local access to an expert distribution on states visited by the current policy. This yields a simple forward cross-entropy objective that admits a blackbox expert and whose sequence-level gradient {conduct rich credit assignment by propagating} future expert-student disagreement back to earlier decisions. We show that prior RL with self-distillation objectives based on reverse KL or Jensen-Shannon fail to guarantee monotonic policy improvement: even when the expert has higher reward, their updates may increase probability on worse actions. In contrast, we show that forward cross-entropy admits monotonic policy improvement and enjoys guarantees on regret. We further show that our objective optimizes a lower bound on teacher-weighted likelihood of success, leading to improved Pass@N. Empirically, our approach, DistIL, improves over RLVR and RL with self-distillation baselines across a variety of domains: scientific reasoning, coding, and solving hard mathematical problems.

URL PDF HTML ☆

赞 0 踩 0

2606.05761 2026-06-08 cs.AI cs.CL 版本更新

SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

SubtleMemory: 面向长时程AI智能体的细粒度关系记忆辨别基准

Wenxuan Wang, Haoyu Sun, Fukuan Hou, Mingyang Song, Weinan Zhang, Yu Cheng, Yang Yang

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Shanghai AI Laboratory（上海人工智能实验室）； Tongji University（同济大学）； Xiamen University（厦门大学）； Fudan University（复旦大学）； Shanghai Jiao Tong University（上海交通大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出SubtleMemory基准，通过构建关系控制的潜在语义伪影并嵌入用户-智能体交互历史，评估长时程AI智能体在后续查询中恢复分布式关系结构的能力。

Comments 48 pages

详情

AI中文摘要

持久性AI助手（如OpenClaw）在长期交互中积累了大量相关记忆。随着这些记忆的增长，它们可能相互强化、在不同上下文中出现分歧或直接冲突，使得正确协助依赖于记忆关系而非孤立回忆。现有的长期记忆基准很少探究智能体在下游任务中如何保留和利用这些关系。为弥补这一空白，我们引入了SubtleMemory，一个用于长运行AI智能体中细粒度关系记忆辨别的基准。SubtleMemory构建了关系控制的潜在语义伪影，其变体实例化互补、细微或矛盾的关系，并将其嵌入到逼真的用户-智能体历史中，要求智能体在后续查询和指令中恢复分布式的关系结构。该基准包含10个长历史中的1,522个评估实例，基于1,090个关系控制的记忆变体集，涵盖用户相关和非用户相关的查询。评估了六个独立记忆系统、两个具有原生记忆模块的Claw式智能体以及三个具有插件记忆模块的Claw式智能体，我们发现当前系统在细粒度关系记忆辨别上仍然薄弱。我们进一步引入了诊断协议，揭示了在记忆保留、检索和下游推理阶段的不同能力特征。

英文摘要

Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grained relational memory discrimination in long-running AI agents. SubtleMemory constructs relation-controlled latent semantic artifacts whose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grained relational memory discrimination. We further introduce diagnostic protocols that reveal distinct capability profiles across memory preservation, retrieval, and downstream reasoning stages.

URL PDF HTML ☆

赞 0 踩 0

2601.14637 2026-06-08 cs.CV cs.AI cs.CL cs.HC 版本更新

Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis

Forest-Chat: 为交互式森林变化分析适应视觉-语言代理

James Brock, Ce Zhang, Nantheera Anantrasirichai

发表机构 * School of Computer Science, University of Bristol（布里斯托尔大学计算机科学学院）； School of Geographical Sciences, University of Bristol（布里斯托尔大学地理科学学院）

AI总结本文提出Forest-Chat，一种基于LLM的森林变化分析代理，通过多任务处理实现自然语言查询，提升森林变化检测与语义解释的准确性与可解释性。

Comments 28 pages, 9 figures, 12 tables, Submitted to Ecological Informatics

详情

DOI: 10.1016/j.ecoinf.2026.103741

AI中文摘要

高分辨率卫星影像的普及与深度学习的进步为森林监测提供了新机遇。本文提出Forest-Chat，一种基于大语言模型的视觉-语言代理，支持多任务的交互式森林变化分析，包括变化检测、图像描述、对象计数、森林砍伐特征识别和变化推理。Forest-Chat基于多级变化解释（MCI）视觉-语言框架，结合零样本变化检测和多模态零样本变化描述与优化。引入Forest-Change数据集，包含双时相卫星影像、像素级变化掩码和语义变化描述。在Forest-Change数据集上，Forest-Chat在mIoU和BLEU-4指标上达到67.10%和40.17%，在LEVIR-MCI-Trees子集上达到88.13%和34.41%。零样本测试中，其在Forest-Change数据集上达到60.15%和34.00%，在LEVIR-MCI-Trees子集上达到47.32%和18.23%。进一步实验表明，描述优化能注入地理领域知识，但标签域迁移有限。这些发现表明，交互式、基于LLM的系统能支持可访问和可解释的森林变化分析。

英文摘要

The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. This paper introduces Forest-Chat, an LLM-driven agent for forest change analysis, enabling natural language querying across multiple RSICI tasks, including change detection and captioning, object counting, deforestation characterisation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, incorporating zero-shot change detection via AnyChange and multimodal LLM-based zero-shot change captioning and refinement. To support adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and semantic change captions via human annotation and rule-based methods. Forest-Chat achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on Forest-Change, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI. In a zero-shot capacity, it achieves 60.15% and 34.00% on Forest-Change, and 47.32% and 18.23% on LEVIR-MCI-Trees. Further experiments demonstrate the value of caption refinement for injecting geographic domain knowledge into supervised captions, and the system's limited label domain transfer onto JL1-CD-Trees. These findings demonstrate that interactive, LLM-driven systems can support accessible and interpretable forest change analysis.

URL PDF HTML ☆

赞 0 踩 0

2506.14634 2026-06-08 cs.CL cs.AI cs.CY 版本更新

AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation

不是什么别的吗？利用大语言模型对德国开放式调查回答进行编码：调查动机

Leah von der Heyde, Anna-Carolina Haensch, Bernd Weiß, Jessica Daikeler

发表机构 * Social Data Science & AI Lab, LMU Munich（社会科学与人工智能实验室，慕尼黑大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）； University of Maryland, College Park（马里兰大学学院公园分校）； GESIS – Leibniz Institute for the Social Sciences（莱比锡社会科学研究机构）

AI总结本文探讨了使用大语言模型对开放式调查回答进行编码的有效性，通过德国调查参与原因的数据，比较了不同LLM和提示方法的性能，发现仅微调的LLM能获得满意预测效果，且分类性能差异影响类别分布。

Comments to appear in Survey Research Methods

详情

DOI: 10.18148/srm/2025.v19i4.8568
Journal ref: Survey Research Methods (2025)

AI中文摘要

近年来，大语言模型（LLM）的发展和广泛可及性引发了关于其在调查研究中应用的讨论，包括对开放式调查回答的分类。由于其语言能力，LLM可能成为耗时的手动编码和监督学习模型预训练的高效替代方案。由于现有研究大多集中在英语回答的非复杂主题或单一LLM上，尚不清楚其发现是否具有普遍性以及这些分类的质量如何与传统方法相比。本研究探讨了不同LLM在其他情境下对开放式调查回答进行编码的程度，以德国调查参与原因的数据为例。我们比较了几种最先进的LLM和提示方法，并通过人类专家编码评估LLM的性能。总体而言，LLM之间的性能差异很大，只有微调的LLM才能达到满意的预测性能。提示方法之间的性能差异取决于所用的LLM。最后，LLM在不同调查参与原因类别上的不均等分类性能导致了不同的类别分布，当不使用微调时。我们讨论了这些发现的含义，不仅对开放式回答编码的方法学研究，还对其实质分析，以及处理或实质性分析此类数据的实践者。最后，我们强调了研究人员在选择LLM时代开放式回答分类自动化方法时需要考虑的许多权衡。通过这样做，我们的研究为关于LLM在调查研究中高效、准确和可靠应用条件的日益增长的研究做出了贡献。

英文摘要

The recent development and wider accessibility of LLMs have spurred discussions about how they can be used in survey research, including classifying open-ended survey responses. Due to their linguistic capacities, it is possible that LLMs are an efficient alternative to time-consuming manual coding and the pre-training of supervised machine learning models. As most existing research on this topic has focused on English-language responses relating to non-complex topics or on single LLMs, it is unclear whether its findings generalize and how the quality of these classifications compares to established methods. In this study, we investigate to what extent different LLMs can be used to code open-ended survey responses in other contexts, using German data on reasons for survey participation as an example. We compare several state-of-the-art LLMs and several prompting approaches, and evaluate the LLMs' performance by using human expert codings. Overall performance differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory levels of predictive performance. Performance differences between prompting approaches are conditional on the LLM used. Finally, LLMs' unequal classification performance across different categories of reasons for survey participation results in different categorical distributions when not using fine-tuning. We discuss the implications of these findings, both for methodological research on coding open-ended responses and for their substantive analysis, and for practitioners processing or substantively analyzing such data. Finally, we highlight the many trade-offs researchers need to consider when choosing automated methods for open-ended response classification in the age of LLMs. In doing so, our study contributes to the growing body of research about the conditions under which LLMs can be efficiently, accurately, and reliably leveraged in survey research.

URL PDF HTML ☆

赞 0 踩 0

2501.11592 2026-06-08 cs.LG cs.AI cs.CL 版本更新

Training-free Ultra Small Model for Universal Sparse Reconstruction in Compressed Sensing

无需训练的超小模型用于压缩感知中的通用稀疏重建

Chaoqing Tang, Huanze Zhuang, Guiyun Tian, Zhenli Zeng, Yi Ding, Wenzhong Liu, Xiang Bai

发表机构 * School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan, China（华中科技大学人工智能与自动化学院）； China Belt and Road Joint Lab on Measurement and Control Technology, Wuhan, China（中国一带一路测量与控制技术联合实验室）； School of Electric and Electrical Engineering, Chongqing University of Technology, Chongqing, China（重庆理工大学电气工程学院）； Optics Valley Laboratory, Wuhan, China（光谷实验室）； School of Water Conservancy and Transportation, Zhengzhou University, Zhengzhou, China（郑州大学水利与交通学院）； School of Software Engineering, Huazhong University of Science and Technology, Wuhan, China（华中科技大学软件工程学院）

AI总结本文提出无需训练的超小神经模型CL，实现快速稀疏重建，继承传统迭代方法的通用性和可解释性，提升效率和精度。

详情

DOI: 10.1109/TPAMI.2026.3680162

AI中文摘要

预训练大模型近年来受到广泛关注，但在需要高可解释性或资源有限的应用中面临挑战，如物理传感、医学成像和生物信息学。压缩感知（CS）是已证明的理论，推动了这些应用的许多突破。然而，作为典型的欠定线性系统，CS在使用传统迭代方法时，对大规模数据的稀疏重建时间过长。当前的AI方法如深度展开失败于替代它们，因为预训练模型在超出训练条件和数据分布时泛化性差或缺乏可解释性。本文提出名为系数学习（CL）的超小人工神经模型，实现无需训练的快速稀疏重建，同时完美继承传统迭代方法的泛化性和可解释性，带来融合先验知识的新特性。在CL中，长度为n的信号仅需最少n个可训练参数。一个案例研究模型称为CLOMP用于评估。实验在合成和真实的一维和二维信号上进行，显示了显著的效率和精度提升。与代表性的迭代方法相比，CLOMP在大规模数据上提高了100到1000倍的效率。在八个不同的图像数据集上的测试结果表明，CLOMP在采样率为0.1、0.3、0.5时分别提高了结构相似性指数292%、98%、45%。我们相信这种方法可以真正将CS重建带入AI时代，造福无数依赖稀疏解的欠定线性系统。

英文摘要

Pre-trained large models attract widespread attention in recent years, but they face challenges in applications that require high interpretability or have limited resources, such as physical sensing, medical imaging, and bioinformatics. Compressed Sensing (CS) is a well-proved theory that drives many recent breakthroughs in these applications. However, as a typical under-determined linear system, CS suffers from excessively long sparse reconstruction times when using traditional iterative methods, particularly with large-scale data. Current AI methods like deep unfolding fail to substitute them because pre-trained models exhibit poor generality beyond their training conditions and dataset distributions, or lack interpretability. Instead of following the big model fervor, this paper proposes ultra-small artificial neural models called coefficients learning (CL), enabling training-free and rapid sparse reconstruction while perfectly inheriting the generality and interpretability of traditional iterative methods, bringing new feature of incorporating prior knowledges. In CL, a signal of length $n$ only needs a minimal of $n$ trainable parameters. A case study model called CLOMP is implemented for evaluation. Experiments are conducted on both synthetic and real one-dimensional and two-dimensional signals, demonstrating significant improvements in efficiency and accuracy. Compared to representative iterative methods, CLOMP improves efficiency by 100 to 1000 folds for large-scale data. Test results on eight diverse image datasets indicate that CLOMP improves structural similarity index by 292%, 98%, 45% for sampling rates of 0.1, 0.3, 0.5, respectively. We believe this method can truly usher CS reconstruction into the AI era, benefiting countless under-determined linear systems that rely on sparse solution.

URL PDF HTML ☆

赞 0 踩 0