arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2367
专题追踪
2605.29691 2026-05-29 cs.CV

Unsupervised Semantic Segmentation Facilitates Model Understanding

无监督语义分割促进模型理解

Xiaoyan Yu, Lisa Mais, Jannik Franzen, Peter Hirsch, Nick Lechtenbörger, Andreas Mardt, Dagmar Kainmüller

发表机构 * Max-Delbruck-Center(马克斯·德尔布鲁克中心) Helmholtz Imaging(海德堡成像) Humboldt-Universität zu Berlin(柏林洪堡大学) Charité Universitätsmedizin(夏里特大学医学院) University of Potsdam(波茨坦大学)

AI总结 提出基于无监督语义分割的可视化协议,直观揭示不同自监督视觉Transformer的注意力机制、位置偏差和缩放行为等模型特性。

详情
AI中文摘要

自监督学习(SSL)产生了多种视觉Transformer(ViT),其预训练表示支持广泛的下游任务。为了更好地理解这些模型,已有工作评估了自注意力的机制以及表示中捕获的信息类型,例如揭示了对比学习(CL)和掩码图像建模(MIM)训练模型之间的显著差异。然而,模型理解的这些进展尚未完全渗透到更广泛的社区,其中针对CL模型的见解有时被泛化到MIM模型。为了使模型理解对广大受众直接且直观,我们提出了一种简单且易于解释的可视化协议。我们的协议基于可视化无监督语义分割结果,但目标不是最大化分割性能。相反,它允许我们传达跨图像一致出现的模型行为。通过对不同层和表示上的多种SSL模型进行基准测试,我们获得了关于不同位置偏差和缩放行为的新见解,包括DINOv3-Large模型令牌中的强边界伪影。这些见解补充并有助于传达一系列先前发现。我们的协议进一步能够清晰地区分位置效应与密切相关但不同的局部性偏差,后者在文献中已被更广泛地研究。该协议在GitHub上公开,我们相信它将促进更广泛社区的进一步模型理解。

英文摘要

Self-supervised learning (SSL) has produced a diverse landscape of vision transformers (ViTs) whose pretrained representations support a wide range of downstream tasks. Towards a better understanding of these models, a body of work has assessed the mechanics of their self-attention as well as the types of information captured across their representations, revealing, for example, stark differences between models trained with contrastive learning (CL) and masked image modeling (MIM). However, these advances in model understanding have not yet fully permeated the broader community, where insights specific to CL models are sometimes generalized to MIM models. To make model understanding straightforward and intuitive for a broad audience, we propose a simple and easily interpretable visualization protocol. Our protocol is based on visualizing unsupervised semantic segmentation results, yet our goal is not to maximize segmentation performance. Instead, it allows us to convey model behaviors that consistently emerge across images. Benchmarking a diverse set of SSL models across layers and representations, we obtain novel insights into distinct positional biases and scaling behaviors, including strong boundary artifacts in DINOv3-Large model tokens. These insights complement and help communicate a range of previous findings. Our protocol further enables a clear visual distinction between positional effects and the closely related but distinct locality bias, which has been studied much more extensively in the literature. The protocol is publicly available on GitHub and we believe it will catalyze further model understanding for a broad community.

2605.29688 2026-05-29 cs.LG

A Novel Tensor Product-Based Neural Network for Solving Partial Differential Equations

一种基于张量积的新型神经网络用于求解偏微分方程

Qihong Yang, Yangtao Deng, Qiaolin He, Shiquan Zhang

发表机构 * School of Mathematics, Sichuan University(四川大学数学学院)

AI总结 提出张量积网络(TPNet),通过将解显式表示为基函数的线性组合并利用最小二乘直接求解系数,实现高效准确的函数逼近和PDE求解。

Comments 44 pages, 11 figures

详情
AI中文摘要

本文提出了张量积网络(TPNet),一种用于高效准确函数逼近和PDE求解的新型神经架构。该方案的核心是将解显式构造为集成到网络中的基函数的线性组合,系数通过直接最小二乘求解确定,从而绕过了传统的基于梯度的训练。关键的方法贡献包括:(1)一种高效的张量积方案,通过组合两组子网络输出的组合生成多维基函数,在保持表达力的同时显著降低模型复杂度和参数数量;(2)一种块时间推进策略,以提高长时间模拟的计算效率;(3)一种线性重构策略,通过将已知非线性项视为源项来处理非线性PDE。TPNet在准确性和训练时间上优于传统神经网络求解器。这一性能提升源于其结构化设计和确定性最小二乘拟合,与主流方法(如物理信息神经网络PINNs)所需的迭代且通常计算密集的优化形成对比。

英文摘要

This paper presents the Tensor Product Network (TPNet), a novel neural architecture for efficient and accurate function approximation and PDE solving. The core of the proposal involves constructing the solution explicitly as a linear combination of basis functions integrated into the network, with coefficients determined by a direct least-squares solve, thereby bypassing traditional gradient-based training. The key methodological contribution include: (1) an efficient tensor-product scheme that generates multi-dimensional basis functions from combinations of two sets of subnetwork outputs, significantly reducing model complexity and parameter count while maintaining expressivity; (2) a block time-marching strategy to improve computational efficiency in long-time simulations; and (3) a linear reformulation strategy for handling nonlinear PDEs by treating known nonlinear terms as sources. TPNet achieves superior accuracy and shorter training times than conventional neural network solvers. This performance gain stems from its structured design and deterministic least-squares fitting, which contrast with the iterative, often computationally intensive optimization required by mainstream methods like Physics-Informed Neural Networks (PINNs).

2605.29687 2026-05-29 cs.AI cs.LO

Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability

基于偏好最大可满足性的大语言模型可靠推理

Pedro Orvalho, Marta Kwiatkowska, Guillem Alenyà, Felip Manyà

发表机构 * Artificial Intelligence Research Institute (IIIA) Consejo Superior de Investigaciones Científicas (CSIC)(人工智能研究所(IIIA)西班牙国家科学研究委员会(CSIC)) Department of Computer Science University of Oxford(计算机科学系牛津大学) Institut de Robòtica i Informàtica Industrial (IRI-CSIC-UPC)(机器人与信息工业研究所(IRI-CSIC-UPC))

AI总结 提出一种混合推理方法,通过LLM生成代码将自然语言问题编码为偏好最大可满足性问题,由精确求解器求解并独立验证,显著提高可行性。

Comments 17 pages, 1 figure, 4 tables

详情
AI中文摘要

大语言模型(LLM)擅长理解自然语言,但在涉及多个约束和用户定义偏好的优化任务(常见于机器人等领域)中表现不佳。我们提出一种混合推理方法,其中LLM通过代码生成实现外部化推理。给定自然语言问题描述,LLM生成Python代码,将用户定义的约束和偏好编码为偏好最大可满足性(MaxSAT)问题,然后由精确的MaxSAT求解器求解。为确保正确性,模型生成代码返回的解会与规范MaxSAT编码独立验证可行性和最优性,允许不同的编码和多个最优解。我们使用开源和闭源LLM在三个偏好推理任务族上评估该方法,并与相同模型的直接回答、思维链和程序思维基线进行比较。虽然这些基线很少产生可行解,但基于MaxSAT的流水线实现了显著更高的接受率,在某些情况下超过80%。我们的结果表明,LLM驱动的代码生成结合偏好MaxSAT能够针对生成的编码实现可验证的优化,并在独立验证的参考语义下大幅提高正确性。

英文摘要

Large Language Models (LLMs) excel at understanding natural language but struggle with optimisation tasks involving multiple constraints and user-defined preferences, which commonly arise in domains such as robotics. We propose a hybrid reasoning approach in which LLMs externalise reasoning through code generation. Given a natural language problem description, an LLM generates Python code that encodes user-defined constraints and preferences as a preference-based Maximum Satisfiability (MaxSAT) problem, which is then solved by an exact MaxSAT solver. To ensure correctness, solutions returned by the model-generated code are independently verified for feasibility and optimality against a canonical MaxSAT encoding, allowing for different encodings and multiple optimal solutions. We evaluate our approach using both open-source and closed-access LLMs on three families of preference-based reasoning tasks, and compare it against direct-answer, chain-of-thought, and program-of-thought baselines using the same models. While these baselines rarely produce feasible solutions, the MaxSAT-based pipeline achieves substantially higher acceptance rates, in some cases exceeding 80%. Our results demonstrate that LLM-driven code generation combined with preference-based MaxSAT enables solver-verifiable optimisation with respect to generated encodings, and substantially improves correctness under independently verified reference semantics.

2605.29685 2026-05-29 cs.AI

NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs

NICE:一个基于理论的LLM社交智能诊断基准

Yunjin Qi, Zhaojun Jiang, Xuan Wu, Hanxi Pan, Yixuan Wang, Yanfang Liu, Xiang Ji, Churu Yu, Chunyuan Zheng, Yingze Chen, Jie He, Liuqing Chen, Zaifeng Gao

发表机构 * Department of Psychology and Behavioral Sciences, Zhejiang University(浙江大学心理学与行为科学系) College of Artificial Intelligence, Zhejiang University(浙江大学人工智能学院) Human Machine Interaction Lab, Huawei Technologies Co., Ltd.(华为技术有限公司人机交互实验室) Zhejiang Key Laboratory of Neurocognitive Development and Mental Health(浙江省神经认知发展与心理健康重点实验室)

AI总结 本文通过构建基于社会理论的社交智能框架,提出诊断基准NICE,用于细粒度评估大语言模型在社交交互中的能力弱点。

详情
AI中文摘要

随着大语言模型(LLM)在情感陪伴和客户服务等社交场景中的广泛应用,衡量其社交智能对人工智能交互的质量与安全性变得至关重要。然而,现有的社交智能基准缺乏统一框架来组织社交能力,因此无法进行细粒度诊断。为了构建首个基于社会理论的整体诊断评估,我们首先通过文献综述和多阶段专家验证(遵循心理测量学原则)构建了一个社交智能框架。该框架包括4个类别和11个维度,每个维度进一步由细粒度的能力方面指定。基于此框架,我们提出了NICE(规范、交互、认知、体验),一个包含137个项目的诊断基准,通过代表性中文情境进行操作化。在5个前沿LLM和一个人类参考组中,模型在总体准确率上得分较高,但在沟通方面表现出持续的弱点,框架将其定位到三个具体能力方面:多轮沟通、非语言沟通和同步性。因此,NICE将社交智能评估重新定义为对LLM中具有社会后果的弱点的基于理论的诊断。

英文摘要

As large language models (LLMs) are increasingly applied in social contexts such as emotional companionship and customer service, measuring their social intelligence has become critical to the quality and safety of human-AI interaction. However, existing social intelligence benchmarks lack a unified framework that organizes social abilities into a unified structure, and therefore cannot enable fine-grained diagnosis. To build the first holistic diagnostic evaluation grounded in social theory, we first construct a social intelligence framework through a literature review and multi-stage expert validation guided by psychometric principles. The resulting framework includes 4 categories and 11 dimensions, each further specified by fine-grained capability facets. Building on this framework, we introduce NICE (Norm, Interaction, Cognition, Experience), a diagnostic benchmark of 137 items operationalized through representative Chinese contexts. Across 5 frontier LLMs and a human reference group, models score higher in aggregate accuracy yet show a consistent weakness in Communication, which the framework localizes to 3 specific capability facets: multi-turn communication, nonverbal communication, and synchrony. NICE thus reframes social intelligence evaluation toward theory-grounded diagnosis of socially consequential weaknesses in LLMs.

2605.29684 2026-05-29 cs.LG cond-mat.dis-nn stat.ML

Kernel Renormalization in Bayesian Deep Neural Networks: the Equivalent Wishart Ansatz in the Proportional Regime

贝叶斯深度神经网络中的核重整化:比例机制下的等效Wishart假设

Paolo Baglioni, Christian Keup, Vincenzo Zimbardo, Rosalba Pacelli, Alessandro Vezzani, Raffaella Burioni, Pietro Rotondo

发表机构 * INFN, Sezione di Milano Bicocca(意大利国家研究所(INFN),米兰Bicocca分所) INFN, Gruppo Collegato di Parma(意大利国家研究所(INFN),帕尔马联合小组) Dipartimento di Scienze Matematiche, Fisiche e Informatiche, Università degli Studi di Parma(帕尔马大学数学、物理和信息科学系) Istituto dei Materiali per l’Elettronica ed il Magnetismo (IMEM-CNR), Parco Area delle Scienze(电子与磁性材料研究所(IMEM-CNR),帕尔马科技园区)

AI总结 针对固定深度L的贝叶斯多层感知机,提出等效Wishart假设来捕捉层次经验核的随机涨落,通过大偏差分析得到重正化NNGP核描述,在比例极限下用至多L个标量序参数刻画表示学习,并扩展到CNN揭示局部核重整化机制。

Comments 45 pages, 21 figures

详情
AI中文摘要

训练集大小$P$和深度神经网络宽度$N$以相同速率增长的比例宽度极限,已被深入研究用于浅层单隐藏层网络。然而,将这些非微扰结果从浅层架构扩展到深度非线性网络已被证明非常具有挑战性。在这里,我们提出了一种有效的近似方法,用于预测固定深度$L$的贝叶斯多层感知机(MLP)在任意高维数据上的泛化性能。我们提出了一个等效Wishart假设,以捕捉MLP层次经验核的主要随机涨落。这使我们能够在比例极限下对MLP的配分函数进行大偏差分析,并用重正化NNGP核表示。在这种描述中,即使比例极限下的强表示学习也由至多$L$个标量序参数编码,这些参数自洽确定。将该方法扩展到卷积架构(CNN),我们识别出一种层次局部核重整化机制,该机制允许量化CNN中由于有限宽度效应导致的大宽度核的更复杂数据相关变换。我们在经典基准数据集上,针对深度$L \sim O(10)$和$P\sim O(10^3)$的有限深度神经网络的贝叶斯后验采样实验测试了我们的有效理论,发现总体吻合良好,同时存在两种不同类型的系统性偏差。

英文摘要

The scaling limit where both the size of the training set $P$ and the width $N$ of a deep neural network grow at the same rate, the so-called proportional-width regime, has been intensely studied for shallow, single-hidden-layer networks. However, extending these non-perturbative results from shallow architectures to deep non-linear networks has proven very challenging. Here we present an effective approximate approach to predict the generalization performance of Bayesian multi-layer perceptrons (MLPs) of fixed depth $L$ on arbitrary high-dimensional data. We propose an equivalent Wishart Ansatz to capture the dominant stochastic fluctuations of the hierarchical empirical kernels of MLPs. This allows us to perform a large deviation analysis for the partition function of MLPs in the proportional limit, expressed in terms of a renormalized NNGP kernel. In this description, even strong representation learning in the proportional limit is encoded in at most $L$ scalar order parameters, determined self-consistently. Extending the approach to convolutional architectures (CNNs), we identify a hierarchical local kernel renormalization mechanism, which allows to quantify more complex data-dependent transformations of the large-width kernel in CNNs due to finite-width effects. We test our effective theory against sampling experiments from the Bayesian posterior of finite deep neural networks with depths $L \sim O(10)$ and $P\sim O(10^3)$ on classic benchmark datasets, finding overall very good agreement together with two distinct types of systematic deviations.

2605.29682 2026-05-29 cs.CL

Scaling Laws for Agent Harnesses via Effective Feedback Compute

智能体框架的有效反馈计算缩放定律

Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che

发表机构 * Harbin Institute of Technology(哈尔滨理工大学)

AI总结 提出有效反馈计算(EFC)作为缩放坐标,通过衡量信息性、有效性、非冗余性和保留性来预测智能体框架性能,在多个任务上优于原始计算基线。

详情
AI中文摘要

智能体框架通过决定模型如何调用工具、接收反馈、验证中间状态、存储记忆和修正解决方案,日益决定语言模型系统的性能。然而,当前的测试时缩放分析通常通过原始支出(令牌、工具调用、操作、挂钟时间或成本)来参数化这一过程,这并未区分有用反馈与冗余或不稳定的交互。我们引入了有效反馈计算(EFC),这是一种轨迹级缩放坐标,仅在反馈具有信息性、有效性、非冗余性且被保留用于后续决策时才计入反馈,并在比较具有不同反馈需求的任务时通过任务需求进行归一化。在合成可控任务、可执行代码任务、真实基准轨迹、保留集和前瞻性验证批次中,基于EFC的坐标一致地比原始计算基线和强多变量SAS基线更好地预测失败率。在受控缩放中,原始令牌和工具调用解释的变异有限(R²=0.33和0.42),SAS达到0.88,而Oracle-EFC和Estimated-EFC达到0.94,Oracle-EFC/D_task达到0.99。匹配预算的干预表明,在原始成本和工具调用固定的情况下,提高反馈质量将成功率从0.27提升到0.90。在混合真实轨迹上,NRS-EFC/D_task达到R²=0.92,而原始计算具有接近零或负的拟合,并且在前瞻性保留集中仍然是最佳预测器(R²=0.85)。这些结果表明,框架缩放受计算量多少的影响较小,而更多地取决于原始预算如何高效地转化为持久且任务充分的反馈。

英文摘要

Agent harnesses increasingly determine the performance of language-model systems by deciding how models call tools, receive feedback, verify intermediate states, store memory, and revise solutions. Yet current test-time scaling analyses often parameterize this process by raw expenditure -- tokens, tool calls, operations, wall time, or cost -- which does not distinguish useful feedback from redundant or unstable interaction. We introduce \emph{Effective Feedback Compute} (EFC), a trace-level scaling coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for subsequent decisions, and we normalize it by task demand when comparing tasks with different feedback requirements. Across synthetic controllable tasks, executable code tasks, real benchmark traces, held-out splits, and a prospective validation batch, EFC-based coordinates consistently predict failure rates better than raw-compute baselines and a strong multivariate SAS baseline. In controlled scaling, raw tokens and tool calls explain limited variation ($R^2=0.33$ and $0.42$), SAS reaches $0.88$, while Oracle-EFC and Estimated-EFC reach $0.94$ and Oracle-EFC/$D_{\mathrm{task}}$ reaches $0.99$. Matched-budget interventions show that improving feedback quality raises success from $0.27$ to $0.90$ while raw cost and tool calls are fixed. On mixed real traces, NRS-EFC/$D_{\mathrm{task}}$ reaches $R^2=0.92$ while raw compute has near-zero or negative fit, and it remains the best predictor in a prospective holdout ($R^2=0.85$). These results suggest that harness scaling is governed less by how much computation is spent than by how efficiently raw budget is converted into durable, task-sufficient feedback.

2605.29678 2026-05-29 cs.CL

Spurious Prompts: Can Irrelevant Prompts Steer Large Language Models?

虚假提示:无关提示能否引导大型语言模型?

Pawel Batorski, Abtin Pourhadi, Jerzy Sarosiek, Przemyslaw Spurek, Paul Swoboda

发表机构 * Heinrich Heine University Düsseldorf(海因里希·海因斯大学多特蒙德分校) Jagiellonian University(雅盖隆大学) IDEAS Research Institute(IDEAS研究所)

AI总结 研究语义无关的提示(虚假提示)对大型语言模型行为的影响,提出黑盒搜索方法发现此类提示,并证明其在多个基准和模型上能显著影响模型输出。

详情
AI中文摘要

大型语言模型对提示高度敏感,但这种敏感性通常通过任务相关的指令、示例或推理线索来研究。本文研究了一种不同形式的提示敏感性:与任务语义无关的提示是否仍然能够引导模型行为。我们称其为虚假提示,并展示了其惊人的有效性。我们还提出了一种简单的黑盒搜索程序来发现它们。在推理和问答基准上,使用参数从0.8B到27B、涵盖三个模型家族的模型,我们展示了虚假提示可以提升性能,通常匹配或超越标准提示基线和任务感知的提示优化。我们进一步展示了它们可以引导模型产生非预期行为,例如重复选择第一个答案选项、产生错误答案、返回偶数、质数或小数,而无需明确指示模型这样做。这些发现揭示了一种新的提示敏感性:LLM可以被与它们被要求解决的任务无关的提示系统地引导。我们的代码可在 https://github.com/Batorskq/spurious 获取。

英文摘要

Large language models are highly sensitive to prompts, but this sensitivity is usually studied through task-relevant instructions, demonstrations, or reasoning cues. In this paper, we study a different form of prompt sensitivity: whether prompts that are semantically unrelated to the task can nevertheless steer model behavior. We call them spurious prompts and show their surprising efficacy. We also propose a simple black-box search procedure for discovering them. Across reasoning and question-answering benchmarks, using models ranging from 0.8B to 27B parameters and spanning three model families, we show that spurious prompts can improve performance, often matching or outperforming standard prompting baselines and task-aware prompt optimization. We further show that they can steer models toward unintended behaviors, such as repeatedly selecting the first answer option, producing incorrect answers, returning an even, prime or small number without explicitly instructing the model to do so. These findings reveal a new kind of prompt sensitivity: LLMs can be systematically steered by prompts that are unrelated to the task they are asked to solve. Our code is available at https://github.com/Batorskq/spurious

2605.29670 2026-05-29 cs.CL cs.AI

EviLink: Multi-Path Schema Linking with Uncertainty-Guided Evidence Acquisition for Large-Scale Text-to-SQL

EviLink: 面向大规模Text-to-SQL的基于不确定性引导证据获取的多路径模式链接

Huawei Zheng, Sen Yang, Zhaorui Yang, Yuhui Zhang, Haozhe Feng, Haoxuan Li, Xuan Yi, Chao Hu, Defeng Xie, Chen Hou, Danqing Huang, Wei Chen, Yingcai Wu, Peng Chen, Dazhen Deng

发表机构 * School of Software Technology, Zhejiang University(浙江大学软件学院) State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD&CG国家重点实验室) Tencent TEG(腾讯TEG) School of Mathematical Sciences, Peking University(北京大学数学科学学院)

AI总结 提出EviLink方法,通过多假设模式基础与不确定性引导的证据获取,重新定义模式链接为不确定性感知的模式需求推理,以平衡模式完整性、相关性和令牌成本,提升大规模Text-to-SQL性能。

详情
AI中文摘要

模式链接是大规模Text-to-SQL中困难且重要的步骤,系统必须从庞大且模糊的数据库中识别出紧凑且充分的模式上下文。现有方法通常将模式链接视为围绕单个SQL路径的确定性选择,但复杂问题可能允许多个具有不同模式需求的有效实现。我们将模式链接重新定义为对多个可行SQL路径的不确定性感知模式需求推理,其中系统区分必需模式项与路径依赖的不确定项,并仅在需要时获取证据。我们通过EviLink实例化这一重构,它结合了多假设模式基础与不确定性引导的证据获取。在BIRD-Dev和Spider2-Snow上的实验表明,这种视角改善了模式完整性、模式相关性和令牌成本之间的平衡。在Spider2-Snow上,EviLink实现了90.15%的字段级严格召回率,平均使用123.30K令牌,并在固定生成器下提升了下游SQL生成性能。

英文摘要

Schema linking is a difficult and important step in large-scale Text-to-SQL, where systems must identify a compact yet sufficient schema context from large and ambiguous databases. Existing methods often treat schema linking as deterministic selection around a single SQL path, but complex questions may admit multiple valid realizations with different schema needs. We reframe schema linking as uncertainty-aware schema-need inference over multiple plausible SQL paths, where the system distinguishes required schema items from path-dependent uncertain ones and acquires evidence only where needed. We instantiate this reframing with EviLink, which combines multi-hypothesis schema grounding with uncertainty-guided evidence acquisition. Experiments on BIRD-Dev and Spider2-Snow show that this perspective improves the balance among schema completeness, schema relevance, and token cost. On Spider2-Snow, EviLink achieves 90.15% field-level strict recall rate, uses 123.30K average tokens, and improves downstream SQL generation under a fixed generator.

2605.29668 2026-05-29 cs.AI cs.CL

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

GRASP: 门控回归感知技能提议器用于自我改进的LLM智能体

Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan, Martin Hadamitzky, Daniel Rueckert, Lisa Adams, Keno Bressem

发表机构 * Technical University of Munich and TUM University Hospital(慕尼黑技术大学及慕尼黑大学医院) Microsoft Healthcare & Life Sciences(微软医疗与生命科学)

AI总结 提出GRASP方法,通过门控回归感知技能库编辑,在硬回归预算下确保每次技能更新带来净改进,显著提升LLM智能体在结构化环境中的操作可靠性。

详情
AI中文摘要

在结构化环境中运行的LLM智能体以操作方式而非对话方式失败,其可靠性取决于对环境的程序性知识。先前的自我改进方法累积自然语言指导而不检查每个新项目是否保留先前正确的行为,因此修复一条轨迹的笔记可能静默地使另一条轨迹退化。我们引入GRASP(门控回归感知技能提议器),将智能体改进视为对有限技能库的一系列编辑,仅在候选技能在硬回归预算下对平衡的保留探针产生净改进时才接受它。我们在两个基于FHIR的临床基准上评估了GRASP在五个基础模型(gpt-oss-120b、DeepSeek V4 Flash、Gemini 3.1 Flash Lite、GPT-4.1、GPT-5.4)上的表现。在MedAgentBench上,GRASP将gpt-oss-120b从40.6%提升至88.8%,超过五个自我改进基线中最强的21.0个百分点,并将其他每个基础模型提升17.2至40.3个百分点。消融实验将增益归因于比较性提议生成、接受门和硬回归预算,而非技能编写本身——没有验证的技能编写并不比不使用技能更好。该机制泛化到临床领域之外,在四个非临床环境中的三个上改进了智能体,仅在动作空间开放的环境中保持持平。冻结的技能库可在模型间迁移,其中来自更强模型的技能将较弱执行者提升到超出其自身学习能力的水平,而反向则不然,这种不对称性是没有门控的基线无法复现的。

英文摘要

LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression-Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held-out probe under a hard regression budget. We evaluate GRASP across five base models (gpt-oss-120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT-4.1, GPT-5.4) on two FHIR-based clinical benchmarks. On MedAgentBench, GRASP lifts gpt-oss-120b from 40.6% to 88.8%, exceeds the strongest of five self-improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non-clinical environments and remaining flat only where the action space is open-ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.

2605.29667 2026-05-29 cs.CL

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

超越英语与规避:用于高风险LLM中文安全评估的人工标注多领域基准

Wajdi Zaghouani, Kholoud K. Aldous, Yicheng Gao

发表机构 * Northwestern University in Qatar(卡塔尔西北大学)

AI总结 针对LLM在中文环境下安全系统失效的问题,构建了包含1,897个对抗性提示的人工标注基准ChiSafe-PAS,覆盖四个高风险领域,并提供完整标注以评估模型安全对齐。

Journal ref Proceedings of The fourth international workshop on the role of resources in the age of large language models RESOURCEFUL-2026 at LREC 2026, Palma de Mallorca, Spain, 2026

详情
AI中文摘要

当大型语言模型(LLM)部署在中文环境中时,出现了一个令人不安的模式:在英语中运行良好的安全系统会失效。这些系统难以跨越语言和文化的界限,使得模型暴露于利用中文特定规避技术(包括拼音罗马化、汉字分解、网络俚语和模糊语气)的对抗性提示。为解决这一差距,我们引入了ChiSafe-PAS(中文安全试点标注集),这是一个包含1,897个对抗性中文提示的人工标注基准,涵盖四个高风险领域:自残与暴力、毒品与非法交易、欺诈以及讽刺。其中,1,544条条目带有完整的黄金标准标注:一个3类响应标签(拒绝、安全重定向、回应)、一个九类混淆分类、一个风险等级评级以及标注者理由。我们详细描述了数据集设计、标注过程和混淆分类。我们的主要目标是实用的:为研究社区提供一个高质量、基于文化背景的资源,用于基准测试LLM的安全对齐。在此过程中,我们涉及了该领域的三个更广泛的张力:训练数据和评估数据之间模糊的界限、基于现实风险进行领域覆盖的需求,以及规模作为文化专业知识替代品的局限性。

英文摘要

When Large Language Models (LLMs) are deployed in Chinese-language settings, a troubling pattern emerges: safety systems that work well in English break down. These systems struggle to cross linguistic and cultural bound-aries, leaving models exposed to adversarial prompts that exploit Chinese-specific evasion techniques, including Pinyin romanization, character decomposition, internet slang, and hedging tone. To address this gap, we introduce ChiSafe-PAS (Chinese Safety Pilot Annotation Set), a human-annotated benchmark of 1,897 adversarial Chinese prompts spanning four high-stakes domains: self-harm and violence, drug and illicit trade, fraud, and satire. Of these, 1,544 entries carry complete gold-standard annotations: a 3-class response label (REFUSE, SAFE-REDIRECT, RESPOND), a nine-category obfuscation taxonomy, a risk-level rating, and annotator rationale. We describe the dataset design, annotation process, and obfuscation taxonomy in detail. Our primary goal is practical: to give the research community a high-quality, culturally grounded resource for benchmarking LLM safety alignment. In doing so, we engage three broader tensions in the field: the blurring boundary between training and evaluation data, the need for domain coverage grounded in real-world risk, and the limits of scale as a substitute for cultural expertise.

2605.29659 2026-05-29 cs.LG cs.AI cs.CL

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

Opir:针对毒性、越狱、仇恨言论和有害内容的高效多任务安全分类

Ihor Stepanov, Aleksandr Smechov

发表机构 * Knowledgator Wordcab

AI总结 本文提出基于GLiClass架构的Opir系列编码器护栏模型,通过多任务学习实现二进制安全/不安全分类、多标签毒性分类、越狱分类和零样本不安全提示与响应分类,在12项安全分类任务和17项类别任务上与现有护栏系统竞争,同时部署开销更小。

Comments 23 pages, 4 figures, 9 tables

详情
AI中文摘要

大型语言模型(LLM)应用的实时安全过滤需要能够检测不安全提示、有毒语言、越狱尝试和不安全响应的分类器,且不能像大型护栏模型那样成本高昂,同时要能区分良性的敏感文本与真正隐蔽的有害内容。在本文中,我们介绍了Opir,一个基于GLiClass架构的编码器护栏模型系列。Opir包括用于二进制安全/不安全分类、多标签毒性分类、越狱分类以及零样本不安全提示和响应分类的多任务模型。我们还发布了专门用于二进制安全/不安全分类的边缘变体,参数少于1亿。这些模型在一个三级分类体系上训练,该体系包含16个顶层标签、126个中层标签和854个叶标签,共996个类别。Opir的训练数据结合了基于分类体系的不安全提示、对抗性挖掘的难负例、良性安全保持示例、生成的响应示例、多语言翻译以及Aegis2和WildGuard训练子集的部分内容。我们还开源了一个评估工具,支持GLiClass和GLiNER2后端以及基于解码器的模型,涵盖二进制安全分类、多标签分类、毒性、越狱检测、提示安全、响应安全、响应拒绝以及跨公共基准系列的提示子类别视图。在与八个当代护栏系统(包括基于GLiNER2和生成式护栏模型)的扩展比较中,涵盖12项安全分类任务和17项类别任务,Opir变体在大多数基准数据集上与最强的开源基线模型竞争或领先,同时部署规模显著更小。

英文摘要

Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail models, and that can distinguish benign sensitive text from genuinely covert harmful content. In this paper, we introduce Opir, a family of encoder-based guardrail models built on the GLiClass architecture. Opir includes multi-task models for binary safe/unsafe classification, multi-label toxicity classification, jailbreak classification, and zero-shot unsafe prompt and response categorization. We also release edge variants with fewer than 100M parameters dedicated to binary safe/unsafe categorization. The models are trained on a three-level taxonomy containing 996 categories across 16 top-level labels, 126 mid-level labels, and 854 leaf labels. Opir's training data combines taxonomy-grounded unsafe prompts, adversarially mined hard negatives, benign safety-preserving examples, generated response examples, multilingual translations, and portions of the Aegis2 and WildGuard training subsets. We also open-sourced an evaluation harness that supports GLiClass and GLiNER2 backends as well as decoder-based models, and covers binary safety classification, multi-label categorization, toxicity, jailbreak detection, prompt safety, response safety, response refusal, and prompt subcategory views across public benchmark families. Across an expanded comparison spanning 12 safety-classification tasks and 17 category tasks against eight contemporary guardrail systems -- including both GLiNER2-based and generative guardrail models -- Opir variants are competitive on or ahead of the strongest open-weight baselines on the majority of benchmark datasets while operating with a substantially smaller deployment footprint.

2605.29657 2026-05-29 cs.CV cs.AI

OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

OccamToken: 无需训练且预算自适应的令牌剪枝实现高效VLM推理

Geng Li, Guohao Chen, Ting Chen, Shilin Shan, Kuangji Zuo, Bofan Lyu, Tuo An, Gen Li, Jianfei Yang

发表机构 * Nanyang Technological University (NTU)(南洋理工大学)

AI总结 提出OccamToken框架,通过寄存器锚定的相对证据测试替代绝对排名范式,实现无需训练、自适应预算的视觉令牌剪枝,在保持高精度的同时大幅压缩令牌数量。

Comments 26 pages,8 figures

详情
AI中文摘要

视觉语言模型(VLM)依赖长视觉令牌序列进行视觉理解,导致预填充阶段在计算和内存上开销巨大。现有大多数剪枝方法遵循绝对排名范式,为视觉令牌分配重要性分数并保留固定的Top-K子集。本文认为这种范式本质上是脆弱的:注意力汇聚点扭曲令牌重要性排名,而图像冗余和查询依赖的视觉证据使得固定令牌预算在不同输入间不可靠。我们提出OccamToken,一个无需训练的框架,用寄存器锚定的相对证据测试替代绝对令牌排名。OccamToken不询问哪些令牌全局重要,而是评估视觉令牌是否提供了超越寄存器基线的信息。我们的关键洞察是,寄存器令牌自然吸收低信息注意力模式,使其成为识别真正信息性视觉证据的稳定参考。基于这一原理,OccamToken通过从寄存器注意力中导出的动态阈值,执行图像自适应冗余剪枝和查询自适应相关性剪枝。在LLaVA-NeXT、LLaVA-v1.5和Qwen3-VL上,OccamToken一致地改善了准确率-效率权衡,无需额外训练。值得注意的是,在LLaVA-NeXT上,它将2880个视觉令牌减少到约40个,同时保留了超过93%的原始准确率,即使在极端的1.4%保留率下也能实现稳定的视觉令牌压缩。

英文摘要

Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assigning importance scores to visual tokens and retaining a fixed top-K subset. In this work, we argue that this paradigm is fundamentally brittle: attention sinks distort token importance rankings, while image redundancy and query-dependent visual evidence make fixed token budgets unreliable across inputs. We propose OccamToken, a training-free framework that replaces absolute token ranking with register-anchored relative evidence testing. Instead of asking which tokens are globally important, OccamToken evaluates whether a visual token provides information beyond a register-based reference. Our key insight is that register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence. Based on this principle, OccamToken performs both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention. Across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off without additional training. Notably, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40 while preserving over 93% of the original accuracy, enabling stable visual token compression even in the extreme 1.4% retention regime.

2605.29656 2026-05-29 cs.AI

TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation

TRACE: 基于图尔敏论证元素的 LLM 思维链推理评估

Yundong Kim, Heyoung Yang

发表机构 * Applied Agent Research Center, Korea Institute of Science(应用智能代理研究中心,韩国科学技术信息研究所) Department of Computer Science and Engineering, University of Seoul, Republic of Korea(首尔大学计算机科学与工程系,大韩民国)

AI总结 提出 TRACE 指标,结合图尔敏论证理论与弗拉维尔元认知框架分析思维链推理结构,实验表明与基准准确率强相关(r=0.74)并可作为有效强化学习奖励信号。

Comments 23 pages, Accepted at ICML 2026

详情
AI中文摘要

由于缺乏真实答案,评估大型语言模型(LLM)的开放式输出仍然具有挑战性。现有指标依赖于最终答案的准确性或表面统计,而未检查推理过程本身。我们提出 TRACE(基于图尔敏论证元素的推理评估),一种分析思维链(CoT)推理过程的指标。TRACE 不判断结果,而是通过整合图尔敏的论证理论与弗拉维尔的元认知框架来检查论证的构建方式,从而评估推理结构。在 7 个推理模型的 26.3K QA 样本上的实验表明,TRACE 与基准准确率强相关(r=0.74)。此外,TRACE 作为强化学习奖励信号有效,优于仅基于准确率的基线。这些结果共同表明,逻辑合理的推理能带来更高质量的答案。因此,TRACE 可作为评估开放式输出的补充指标。代码可在 https://github.com/hyyangkisti/trace 获取。

英文摘要

Evaluating open-ended outputs from large language models (LLMs) remains challenging due to the absence of ground truth. Existing metrics rely on final-answer accuracy or surface-level statistics, leaving the reasoning process itself unexamined. We introduce TRACE (Toulmin-based Reasoning Assessment through Constructive Elements), a metric that analyzes Chain-of-Thought (CoT) reasoning processes. Rather than judging outcomes, TRACE inspects how arguments are constructed by integrating Toulmin's argumentation theory with Flavell's metacognitive framework to assess reasoning structure. Experiments on 26.3K QA samples across 7 reasoning models show strong correlation with benchmark accuracy (r=0.74). Furthermore, TRACE is effective as a reinforcement learning reward signal, outperforming accuracy-only baselines. Together, these results indicate that logically sound reasoning leads to higher-quality answers. TRACE thus serves as a complementary metric for evaluating open-ended outputs. Code is available at https://github.com/hyyangkisti/trace.

2605.29653 2026-05-29 cs.AI

PTCG-Bench: Can LLM Agents Master Pokémon Trading Card Game?

PTCG-Bench:LLM智能体能否掌握宝可梦集换式卡牌游戏?

Dongdong Hua, Yifei Sun, Renhong Huang, Feng Gao, Chunping Wang, Yang Yang

发表机构 * Zhejiang University(浙江大学) FinVolution Group(FinVolution集团)

AI总结 提出PTCG-Bench基准,通过宝可梦集换式卡牌游戏评估LLM智能体的决策性能和自进化能力,并设计模块化消融实验分析智能体性能。

详情
AI中文摘要

面对一个策略复杂的棋盘游戏,人类玩家在玩几轮后就能快速学会制定策略。自主智能体在现实交互环境中需要类似的能力,然而现有的智能体基准往往未能充分捕捉这种策略性和不断演变的决策场景。我们提出了PTCG-Bench,一个基于宝可梦集换式卡牌游戏(PTCG)构建的基准,它在两个互补层面上评估LLM智能体:(1)它们在单个复杂环境中的决策性能,以及(2)它们通过积累经验自我进化的能力。我们进一步包括一个模块化消融实验,以更好地解释智能体性能,而不将其与模型能力混为一谈。我们的实验表明,尽管LLM智能体能够实现非平凡的 gameplay 性能,但持续稳定的自我进化仍然具有挑战性,并且性能对消融设计敏感。我们希望PTCG-Bench能够促进未来在现实交互环境中对消融感知和自我进化智能体的研究。

英文摘要

Given a strategically complex board game, human players can quickly learn to devise strategies after playing a few rounds. Autonomous agents require similar capabilities in realistic interactive environments, yet existing agent benchmarks often fail to fully capture such strategic and evolving decision-making scenarios. We present PTCG-Bench, a benchmark built on the Pok'{e}mon Trading Card Game (PTCG) that evaluates LLM agents at two complementary levels: (1) their decision-making performance within a single complex environment, and (2) their ability to self-evolving through accumulated experience. We further include a modular harness ablation to better interpret agent performance without conflating it with model capability. Our experiments show that, although LLM agents can achieve non-trivial gameplay performance, sustained and stable self-evolution remains challenging, and performance is sensitive to harness design. We hope that PTCG-Bench will facilitate future research on harness-aware and self-evolving agents in realistic interactive environments.

2605.29652 2026-05-29 cs.AI

Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation

快速思考,智能对话:结构化健康文本生成中确定性与神经计算的划分

Kai-Chen Cheng, Haejun Han, David Q. Sun

发表机构 * Kai-Chen Cheng Haejun Han David Q. Sun

AI总结 提出一种将确定性计算与有限LLM调用相结合的流水线,用于结构化健康文本生成,在降低错误率和成本的同时保持忠实性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用于从结构化记录(如可穿戴时间序列、生物标志物、生命体征和护理管理日志)生成健康文本。对于重复性健康输出,流畅性是不够的:系统必须忠实于源数据,将解释性主张建立在可用证据上,遵循既定政策,输出机器可读的内容,并且运行成本足够低以支持重复使用。我们探讨在结构化健康生成中,哪些责任应由确定性计算承担,而非运行时LLM提示。我们引入了“快速思考,智能对话”,一个睡眠健康洞察流水线,其中确定性代码在调用一次有界LLM写入器之前执行重复分析。在280个用户-夜晚和六个模型上,与结构化零样本和少样本单次调用基线相比,该方法实现了更低的数值误差、更低的指令合规误差和更低的端到端成本。层替换揭示了特定合约的失败:LLM比较增加了数值误差,LLM排名降低了策略选择,LLM属性增加了无根据的因果语言,而LLM生成的写入器接口即使在上游事实确定后也会重新引入误差。结果支持一个更广泛的设计规则:让代码负责重复分析,让LLM在有界接口内表达已验证的事实。

英文摘要

Large language models (LLMs) are increasingly being used to generate health text from structured records such as wearable time series, biomarkers, vitals, and care-management logs. For recurring health outputs, fluency is not enough: systems must remain faithful to source data, ground explanatory claims in available evidence, follow stated policies, emit machine-readable outputs, and run cheaply enough for repeated use. We ask which responsibilities in structured health generation should be deterministic computation rather than runtime LLM prompting. We introduce Think Fast, Talk Smart, a sleep-health insight pipeline in which deterministic code performs recurring analysis before one bounded LLM writer call. Across 280 user-nights and six models, achieves lower numeric error, lower instruction-compliance error, and lower end-to-end cost than structured zero-shot and few-shot one-call baselines. Layer replacement reveals contract-specific failures: LLM comparison raises numeric error, LLM ranking degrades policy selection, LLM attribution increases unsupported causal language, and an LLM-generated writer interface reintroduces errors even after upstream facts are deterministic. The results support a broader design rule: let code own recurring analysis, and let LLMs express verified facts within bounded interfaces.

2605.29648 2026-05-29 cs.CL

Verifiable Rewards Beyond Math and Code: Lightweight Corpus-Grounded Process Supervision for Factual Question Answering

超越数学与代码的可验证奖励:面向事实问答的轻量级语料库基础过程监督

Shicheng Fan, Haochang Hao, Dehai Min, Weihao Liu, Philip S. Yu, Lu Cheng

发表机构 * University of Illinois Chicago(伊利诺伊大学香槟分校)

AI总结 提出CorVer,一种基于语料库共现统计的轻量级过程奖励方法,通过句子级信用分配和令牌级优势映射,在多个模型和基准上显著提升事实问答准确性且训练速度更快。

详情
AI中文摘要

将强化学习应用于提高知识密集型问答的事实准确性面临奖励设计困境。响应级奖励仅提供粗略监督,无法区分推理轨迹中的正确与错误陈述。句子级替代方案提供更细粒度的反馈,但通常依赖于NLI验证器、LLM评判或知识验证流水线,这些方法在RL规模下部署成本高昂,且对于稀有实体事实(准确奖励信号尤为重要)往往不可靠。我们提出CorVer(语料库验证),一种轻量级、即插即用的过程奖励,用源自维基百科共现统计的语料库基础信号替代神经验证器。CorVer分配句子级信用,并通过简单对齐将其映射到令牌级优势,仅需一个0.5B的提取器和每个句子一次语料库查找。在跨越六个指令微调模型(3B至14B)和五个QA基准的30个(模型,基准)单元中,CorVer在每个单元上均优于原始基线,TriviaQA平均提升4.1个百分点。在其可行配置下的20个单元中,CorVer在18个单元上优于四个神经验证器基线,同时训练速度快4.8至8.4倍。

英文摘要

Applying reinforcement learning to improve factual accuracy in knowledge-intensive question answering faces a reward design dilemma. Response-level rewards provide only coarse supervision and cannot distinguish correct from incorrect statements within a reasoning trace. Sentence-level alternatives offer finer-grained feedback, but typically rely on NLI verifiers, LLM judges, or knowledge-verification pipelines that are expensive to deploy at RL scale and often unreliable for rare-entity facts, where accurate reward signals are especially important. We propose CorVer (Corpus Verify), a lightweight, plug-in-ready process reward that replaces neural verifiers with a corpus-grounded signal derived from Wikipedia co-occurrence statistics. CorVer assigns sentence-level credit and maps it to token-level advantages via a simple alignment, requiring only a 0.5B extractor and a single corpus lookup per sentence. Across 30 (model, benchmark) cells spanning six instruction-tuned models (3B to 14B) and five QA benchmarks, CorVer improves over the raw baseline for every cell, with an average TriviaQA gain of +4.1 pp. It also outperforms four neural-verifier baselines in 18 of 20 cells under their feasible configurations, while training 4.8 to 8.4x faster.

2605.29647 2026-05-29 cs.CV

MARTIAN: A Rendering Framework for Aerial Mars Imagery from HiRISE Orbital Data

MARTIAN:基于HiRISE轨道数据的火星空中影像渲染框架

Dario Pisanti, Georgios Georgakis

发表机构 * Space Robotics Research Group, SnT, University of Luxembourg(卢森堡大学空间机器人研究组) Jet Propulsion Laboratory, California Institute of Technology(加州理工学院喷气推进实验室)

AI总结 提出一个基于Blender的开源渲染框架MARTIAN,利用真实HiRISE轨道地图数据合成火星地形在不同光照和高度下的逼真空中视图,并生成精确姿态标注,以解决火星视觉导航训练数据稀缺问题。

详情
AI中文摘要

火星上的空中导航需要基于视觉的管道,这些管道必须对火星表面的多样光照条件和地形形态具有鲁棒性。训练和评估此类方法的一个关键瓶颈是缺乏大规模、带标注的空中数据集。我们提出了MARTIAN,一个基于Blender的开源渲染框架,它利用真实的HiRISE轨道地图产品,在可控光照条件和不同高度下合成火星地形的逼真空中视图。MARTIAN生成带有精确姿态标注的观测数据,直接解决了火星视觉导航训练数据稀缺的问题。该框架已通过其在基于地图的定位系统(用于Ingenuity和未来火星旋翼机)的并行工作中的部署得到验证,其中合成训练的深度图像匹配器已成功在真实火星图像上进行了评估。MARTIAN公开于:https://github.com/nasa-jpl/martian。

英文摘要

Aerial navigation on Mars requires vision-based pipelines that are robust to the diverse illumination conditions and terrain morphology of the Martian surface. A key bottleneck for training and evaluating such methods is the scarcity of large-scale, annotated aerial datasets. We present MARTIAN, an open-source Blender-based rendering framework that leverages real HiRISE orbital map products to synthesize realistic aerial views of the Martian terrain under controllable lighting conditions and at varying altitudes. MARTIAN generates observations with accurate pose annotations, directly addressing the scarcity of training data for vision-based navigation on Mars. The framework has been validated through its deployment in concurrent work on map-based localization systems for Ingenuity and future Mars rotorcraft, where synthetically trained deep image matchers were successfully evaluated on real Mars imagery. MARTIAN is publicly available at: https://github.com/nasa-jpl/martian.

2605.29645 2026-05-29 cs.LG cs.AI stat.ML

The Sample Complexity of Multiclass and Sparse Contextual Bandits

多类别和稀疏上下文赌博机的样本复杂度

Liad Erez, Fan Chen, Alon Cohen, Tomer Koren, Yishay Mansour, Shay Moran, Alexander Rakhlin

发表机构 * Tel Aviv University(特拉维夫大学) Massachusetts Institute of Technology(麻省理工学院) Google Research Tel Aviv(谷歌研究特拉维夫) Technion—Israel Institute of Technology(技术学院—以色列理工学院)

AI总结 针对随机i.i.d.上下文赌博机,提出基于决策估计系数和低方差探索的算法,在稀疏奖励下实现接近最优的样本复杂度,并匹配下界。

详情
AI中文摘要

我们研究随机i.i.d.设置下的上下文赌博机,其中学习器观察来自未知分布的上下文,从有限集合$A$中选择动作,并旨在基于赌博机反馈从给定类别中识别近似最优策略。受零一奖励的赌博机多类别分类启发,我们关注\emph{$s$-稀疏}设置,其中对于每个上下文,奖励向量的$L_1$范数至多为$s \ll |A|$。我们的主要结果是设计算法,以高概率输出一个相对于策略类$Π$的$ε$-最优策略,使用$ ilde{O} ((s/ε^2 + |A|/ε)\log |Π|/δ)$个样本。我们将此界推广到一般Natarajan类,并补充了匹配的下界(对数因子内),从而缩小了先前工作(Erez等人,2024, 2025)留下的巨大差距,后者额外增加了$Θ(|A|^9)$依赖。我们通过两种互补方法获得这些结果。首先,我们从具有结构化观测的上下文决策角度分析上下文赌博机,设计了一种探索-优化算法,其样本复杂度由\emph{决策估计系数}(DEC;Foster等人,2021, 2022)控制。我们证明,在$s$-稀疏奖励下,诱导的模型类具有随$s$缩放的尖锐DEC界,直接产生最优速率。由于这种方法主要是信息论性的,并涉及求解复杂的min-max优化问题,我们还开发了第二种更专门的算法方法,基于低方差探索技术。这种方法产生了具体、易处理的算法,并自然地扩展到上下文组合半赌博机,为赌博机多类别列表分类提供了改进的样本复杂度保证。

英文摘要

We study contextual bandits in the stochastic i.i.d.\ setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set $A$, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by bandit multiclass classification with zero-one rewards, we focus on the \emph{$s$-sparse} setting in which, for every context, the reward vector has $L_1$-norm at most $s \ll |A|$. Our main result is the design of algorithms that, with high probability, output an $ε$-optimal policy compared to policy class $Π$ using $\tilde{O} ((s/ε^2 + |A|/ε)\log |Π|/δ)$ samples. We extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024, 2025), which incurred an additional $Θ(|A|^9)$ dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with $s$-sparse rewards, the induced model class admits a sharp DEC bound that scales with $s$ and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.

2605.29643 2026-05-29 cs.CV cs.MA

AgentCVR: Active Multi-Agent Cross-Video Reasoning via Script-Simulated Reinforcement Learning

AgentCVR:通过脚本模拟强化学习的主动多智能体跨视频推理

Yilun Qiu, Jiahe Wang, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Chun Yuan

发表机构 * Xiaohongshu Inc.(小红书公司) Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生学院,清华大学)

AI总结 提出AgentCVR多智能体框架,将跨视频推理视为主动证据获取任务,通过主智能体协调视觉和音频智能体进行定向证据提取,并引入脚本模拟强化学习优化策略,在跨视频对齐和定位任务上超越单次基线,达到与闭源系统相当的性能。

详情
AI中文摘要

跨视频推理(CVR)已成为多模态智能的关键前沿,要求模型检索、对齐和聚合分布在多个视频中的证据。当前的多模态大语言模型(MLLMs)往往难以应对CVR,因为简单的单次策略将多个视频编码到共享压缩上下文中,可能掩盖罕见但关键的证据。在本文中,我们提出AgentCVR,一个多智能体框架,将CVR视为主动证据获取任务。AgentCVR使用主智能体迭代协调专门的视觉和音频智能体进行定向证据提取。为确保高效训练,我们引入脚本模拟强化学习,利用LLM生成的语义脚本和轻量级文本模拟器优化智能体策略,在在线探索期间避免昂贵的多模态推理。在综合CVR基准上的实验结果表明,AgentCVR优于单次基线,并在复杂跨视频对齐和定位任务上达到与最先进闭源系统相当的性能。为确保可复现性,我们的代码可在https://github.com/wang-jh24/AgentCVR获取。

英文摘要

Cross-Video Reasoning (CVR) has emerged as a critical frontier in multimodal intelligence, requiring models to retrieve, align, and aggregate evidence distributed across multiple videos. Current Multimodal Large Language Models (MLLMs) often struggle with CVR, as simple single-pass strategies encode multiple videos into a shared compressed context, potentially obscuring rare but critical evidence. In this paper, we propose AgentCVR, a multi-agent framework that treats CVR as an active evidence-acquisition task. AgentCVR employs a Master Agent to iteratively coordinate specialized Visual and Audio Agents for targeted evidence extraction. To ensure efficient training, we introduce Script-Simulated RL, which optimizes the agent's policy with LLM-generated semantic scripts and a lightweight text-based simulator, bypassing costly multimodal inference during online exploration. Experimental results on a comprehensive CVR benchmark show that AgentCVR outperforms single-pass baselines and achieves comparable performance to state-of-the-art closed-source systems, particularly in complex cross-video alignment and localization. To ensure reproducibility, our code is available at https://github.com/wang-jh24/AgentCVR.

2605.29638 2026-05-29 cs.CL

Classification of non-analyzable word types in web documents to implement an effective Korean e-learning system

网络文档中不可分析词类型的分类以实现有效的韩语电子学习系统

Sang-Taek Park, Ae-Lim Ahn, Eric Laporte, Jee-Sun Nam

发表机构 * DICORA, Hankuk University of Foreign Studies, Korea(DICORA,韩国外国语大学,韩国) LIGM, Université Paris-Est, France(LIGM,巴黎-est大学,法国)

AI总结 通过构建正式与非正式语料库,比较其表达差异,并提出局部语法图(LGG)模型以有效处理非正式文本,用于韩语电子学习系统。

Journal ref Doing Research in Applied Linguistics, 2011, pp. 61-68

详情
AI中文摘要

电子学习系统应传递反映语言实际使用中各种现象的内容。除了正式韩语,包含网络文档、手机短信或推特帖子等真实世界韩语表达的电子学习系统将对高级学习者有用。我们构建了两种语料库:一种由在线新闻文章等正式文档组成;另一种由网络博客中关于新产品的客户评论等非正式文档组成。通过比较这些语料库,我们展示了这两种语料库中表达的差异。我们调查了非正式语料库的主要特征。鉴于文本中有很大比例是非正式的,我们提出局部语法图(LGG)作为在韩语电子学习系统中有效处理它们的合适模型。

英文摘要

E-learning systems should deliver contents that reflect various phenomena of the language as it is used. In addition to formal Korean, e-learning systems that would include real-world Korean expressions such as those in web documents, mobile text messages, or twitter posts, would be useful to high-level learners. We construct two types of corpora: one is made of formal documents like online news articles; the other is made of informal documents like customer reviews about new products in web blogs. By comparing these corpora, we show how expressions differ in these two types of corpora. We survey the main characteristics of the informal corpus. Given that a significant proportion of text is informal, we propose Local Grammar Graphs (LGG) as an appropriate model to treat them effectively in Korean e-learning systems.

2605.29637 2026-05-29 cs.CL

Evaluating Cross-lingual Knowledge Consistency in Code-Mixed vis-a-vis Indian Languages using IndicKLAR

评估混合语码与印度语言中的跨语言知识一致性:基于IndiKLAR

Debajyoti Mazumder, Divyansh Pathak, Prashant Kodali, Aditya Joshi, Akshay Agarwal, Jasabanta Patro

发表机构 * Indian Institute of Science Education and Research(印度科学教育与研究学院) Microsoft Corporation(微软公司) UNSW Sydney(新南威尔士大学悉尼分校)

AI总结 本文通过构建IndiKLAR基准,评估大语言模型在英语、混合语码和印度本土语言上的知识召回一致性,发现混合语码输入能显著缩小与英语的性能差距,并识别出从本土语言到混合语码的“翻转点”。

Comments 23 pages

详情
AI中文摘要

大型语言模型能可靠地回忆英语知识,但在低资源语言上对相同查询却常常失败——这种跨语言一致性差距在印度语言及其混合语码变体中尚未得到充分研究。为了研究这一差距,我们引入了IndiKLAR,这是KLAR-CLC基准的印度扩展,覆盖了22种印度官方语言中的18种,并为11种广泛使用的语言对配对了混合语码变体,且对这11种设置的单语和混合语码变体进行了母语验证。这种三方对齐提供了一个独特的机会来考察知识召回一致性如何随英语、混合语码和印度本土语言输入的变化而变化。在九个开放权重模型上的评估发现,本土语言与英语的准确率差距可达约0.50,而混合语码输入能缩小大部分差距——无需任何模型层面的干预即可使性能接近英语(差距约0.05)。受此启发,我们评估了几种在语言转换暴露方式上有所不同的提示策略,包括两阶段的翻译-回答设置、单阶段的联合翻译-回答提示,以及“翻译中思考”(TinT)——一种单步策略,模型内部转换输入并仅输出最终答案。在从本土语言到混合语码再到英语的性能轨迹中,我们识别出一个一致的翻转点——即错误与正确预测之间的边界——位于本土语言和混合语码设置之间。有趣的是,无论该轨迹是由输入表面形式还是由模型的内部转换过程诱导,这一现象都成立。

英文摘要

Large language models recall knowledge reliably in English but often fail on the same query posed in a lower-resourced language -- a crosslingual consistency gap that remains underexplored for Indian languages and their code-mixed counterparts. To study this gap, we introduce IndiKLAR, an Indic extension of the KLAR-CLC benchmark covering 18 of the 22 scheduled Indian languages and pairing them with code-mixed variants for 11 widely used language pairs, with native-speaker verification of both monolingual and code-mixed variants for these 11 settings. This three-way alignment offers a unique opportunity to examine how knowledge recall consistency varies across the spectrum of English, code-mixed, and native Indian language inputs. Evaluating across nine open-weight models, we find that the native-language accuracy gap to English can reach $\sim$0.50, while code-mixed inputs close most of it -- bringing performance within $\sim$0.05 of English without any model-level intervention. Motivated by this, we evaluate several prompting strategies that vary in how language conversion is exposed, including a two-stage translate-then-answer setup, a one-stage joint translation-and-answer prompt, and Translate-in-Thought (TinT) -- a single-step strategy in which the model converts the input internally and emits only the final answer. Across the performance trajectory native $\rightarrow$ code-mixed $\rightarrow$ English, we identify a consistent flip point -- the boundary between incorrect and correct prediction -- that lies between the native and code-mixed settings. Interestingly, this holds whether the trajectory is induced by the input surface form or by the model's internal conversion process.

2605.29634 2026-05-29 cs.LG

Relational Rank Geometry in Transformers: Detecting and Steering Hidden-State Relation Frames

Transformer中的关系秩几何:检测与引导隐藏状态关系框架

Mazen Kobrosly

发表机构 * Independent Researcher(独立研究者)

AI总结 本文通过Plücker符号熵检测Transformer隐藏状态中元组关系的秩索引几何,并在Llama系列模型上验证了关系探测的可控干预,实现了从关系探测到关系框架干预的受控桥梁。

Comments 32 pages, 9 figures

详情
AI中文摘要

Transformer隐藏状态通常通过局部或低阶对象解释:神经元、稀疏特征、注意力头、残差流方向或激活补丁。本文研究一个互补对象:元组间关系的秩索引几何。我使用Plücker符号熵来测试r元关系是否在隐藏状态空间中留下arity匹配的方向签名。在Llama系列8B、70B和405B检查点上,真实关系元组在预期秩k=r(r=3,...,6)处显示出比随机控制审计中打乱元组更强的方向签名一致性。多模板审计表明,这些效应在表面变化下仍然存在,所有测试的405B行保持正预期秩边际,8B/70B保持正行,但带有构造器特定的混合单元。然后我问相同的关系几何是否可以被引导。在一个边缘网格干净/损坏干预实验中,使用32个提示,行/列框架和答案格式保持不变,而YES/NO关系图发生变化,损坏的隐藏状态关系框架被修补为干净或安慰剂目标。在70B和405B中,干净目标的关系框架路径恢复了干净答案行为和残差关系几何,而仅质心和等范数控制显示出可忽略的恢复。位置/顺序控制进一步将标记点重要性从有序干净框架几何中分离:目标干净形状和跨提示干净形状在标记接口处恢复行为和残差几何,而损坏供体转移、同位置置换/反射、错误位置干净增量、仅质心运动和等范数噪声失败或远低于干净框架路径。结果是从关系探测到关系框架干预的受控桥梁:关系秩几何可以在Transformer隐藏状态中被检测、定位和行为验证。

英文摘要

Transformer hidden states are often interpreted through local or low-order objects: neurons, sparse features, attention heads, residual-stream directions, or activation patches. This paper studies a complementary object: the rank-indexed geometry of relations among token tuples. I use Plucker sign entropy to test whether r-argument relations leave arity-matched orientation signatures in hidden-state space. Across Llama-family 8B, 70B, and 405B checkpoints, true relation tuples show stronger orientation-sign consistency at the expected rank k=r for r=3,...,6 than scrambled tuples under matched random-control audits. Multi-template audits show that the effects survive surface variation, with all tested 405B rows retaining positive expected-rank margins and 8B/70B retaining positive rows with constructor-specific mixed cells. I then ask whether the same relation geometry can be steered. In an edge-grid clean/corrupt intervention assay over 32 prompts, the row/column scaffold and answer format stay fixed while the YES/NO relation map changes, and the corrupt hidden-state relation frame is patched toward clean or placebo targets. In 70B and 405B, clean-targeted relation-frame paths recover clean-answer behavior and residual relation geometry, while centroid-only and equal-norm controls show negligible recovery. Site/order controls further separate marker-site importance from ordered clean-frame geometry: target clean shape and cross-prompt clean shape recover behavior and residual geometry at the marker interface, whereas corrupt-donor transfer, same-site permutation/reflection, wrong-site clean deltas, centroid-only motion, and equal-norm noise fail or remain far below clean-frame paths. The result is a controlled bridge from relation probing to relation-frame intervention: relation rank geometry can be detected, targeted, and behaviorally validated in transformer hidden states.

2605.29631 2026-05-29 cs.CL cs.AI

Predicting Causal Effects from Natural Language Queries using Structured Representations

使用结构化表示从自然语言查询预测因果效应

Giuliano Martinelli, Piriyakorn Piriyatamwong, Abelardo Carlos Martinez Lorenzo, Jasmin Baier, Riccardo Orlando, Satvik Garg, Sharif Kazemi, Linxi Wang, Arianna Legovini, Samuel Fraiberger

发表机构 * The World Bank Group(世界银行集团) University of Oxford(牛津大学) New York University(纽约大学)

AI总结 针对从自然语言查询预测因果效应的问题,提出Query2Effect基准和两步框架,通过生成结构化表示再预测效应大小,微调使绝对误差降低27%-71%。

Comments 18 pages

详情
AI中文摘要

随机对照试验是医学和社会科学的基石,因为它们能够可靠地估计因果效应。然而,进行这些试验成本高昂且耗时,这激发了从现有实验证据预测因果效应的兴趣。大型语言模型(LLMs)的最新进展在知识密集型任务上表现出强大的性能,引发了一个问题:这些模型能否用于预测因果效应大小?为了研究这一点,我们引入了Query2Effect,这是一个新的大规模基准,包含超过72,000个与实验描述对齐的自然语言问题,通过改变查询在隐含性、抽象性和歧义性维度上的特异性,模拟现实的信息寻求场景。然后,我们提出了一个两步框架,首先生成查询的合成结构化表示,然后使用监督编码器模型预测效应大小。实验表明,微调在提高预测性能方面起着关键作用,与开箱即用的提示式LLMs相比,绝对误差降低了-27%到-71%,并且我们的两步框架有利于域外泛化,突显了将语义解释与数值效应估计分离的好处。

英文摘要

Randomized controlled trials are a cornerstone of medicine and the social sciences as they enable reliable estimates of causal effects. However, they are costly and time-consuming to conduct, motivating interest in predicting causal effects from existing experimental evidence. Recent advances in large language models (LLMs) have demonstrated strong performance on knowledge-intensive tasks, raising the question of whether these models can be used for forecasting causal effect sizes. To investigate this, we introduce Query2Effect, a new large-scale benchmark consisting of more than 72,000 natural language questions aligned with experiment descriptions, created to simulate realistic information-seeking scenarios by varying query specificity along dimensions of implicitness, abstraction, and ambiguity. We then propose a two-step framework that first generates a synthetic structured representation of a query before predicting effect size using a supervised encoder model. Experiments show that finetuning plays a crucial role in improving prediction performance, with absolute error reducing by -27% up to -71% compared to prompted out-of-the-box LLMs, and that our two-step framework is beneficial for out-of-domain generalization, highlighting the benefits of separating semantic interpretation from numerical effect estimation.

2605.29630 2026-05-29 cs.CL cs.AI cs.IR

Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory

实体碰撞:一种用于归因智能体记忆检索提升的分层协议

Youwang Deng

发表机构 * Independent Researcher(独立研究员)

AI总结 提出实体碰撞协议,通过控制实体重叠和标签分层,将BM25基线固定,从而将检索提升归因于嵌入器,并在多维度实验中揭示编码器容量并非唯一约束。

Comments 48 pages with appendix; 6-page body, mandatory Limitations, References, and 7 appendices. Code, benchmarks, and 37 reproduce scripts: https://github.com/youwangd/engram (see paper/REPRODUCIBILITY.md). Apache 2.0

详情
AI中文摘要

端到端的智能体记忆基准测试为每个检索器报告一个单一的hit@k指标,混淆了词汇泄漏(不受控制的查询/黄金/干扰实体重叠)与标签混合(偏好、服务、工具平均在一起)。我们提出实体碰撞,一种系统无关的协议,通过构造将BM25基线固定——每个干扰项共享答案的实体标记——并按判别器标签对查询进行分层,因此任何超过BM25的提升都可归因于嵌入器。应用于一个开源智能体记忆测试平台,涵盖5个标签×3个嵌入器×5个碰撞程度,并采用配对自助法95%置信区间,该协议揭示了一个双轴模式:256维哈希三元组仅在深度碰撞下的封闭词汇标签上有帮助;MiniLM-384在两个轴上均占优;而参数规模2.7倍的BGE-large并未在MiniLM上一致提升——它在意图式查询上胜出,但在词汇式查询上落败。编码器容量本身并非约束条件。合成意图标签的零假设在LongMemEval(n=500)上重现为单会话偏好回忆悬崖。LoCoMo上的自适应向量权重路由是一个测量的零假设:存在11.7个百分点的oracle空间,但我们测试的所有信号均未恢复。所有26个结果表和37个复现脚本均受版本控制并由公共注册表验证;该协议在一个确定性管理的记忆测试平台(事件溯源决策日志、DAG状态机模式生命周期)上执行,因此每个报告的置信区间都可以从输入流中逐字节复现。

英文摘要

End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor entity overlap) with tag-mixing (preferences, services, tools averaged together). We propose entity-collision, a system-agnostic protocol that pins the BM25 floor by construction -- every distractor shares the answer's entity tokens -- and stratifies queries by discriminator tag, so any lift over BM25 is attributable to the embedder. Applied to an open-source agent-memory testbed across 5 tags x 3 embedders x 5 collision degrees with paired-bootstrap 95% CIs, the protocol reveals a two-axis pattern: a 256-d hash trigram helps only on closed-vocabulary lexical tags at deep collision; MiniLM-384 dominates both axes; and a 2.7x-parameter BGE-large does not uniformly improve on MiniLM -- it wins on intent-style queries but loses on lexical ones. Encoder capacity alone is not the binding constraint. The synthetic intent-tag null replicates on LongMemEval (n=500) as a single-session-preference recall cliff. Adaptive vector-weight routing on LoCoMo is a measured null: 11.7pp of oracle headroom exists, but no signal we tested recovers it. All 26 result tables and 37 reproduce scripts are version-controlled and verified by a public registry; the protocol is exercised on a deterministically governed memory testbed (event-sourced decision log, DAG-state-machine schema lifecycle) so every reported CI is reproducible byte-for-byte from the ingest stream.

2605.29629 2026-05-29 cs.AI

Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

超越攻击成功率:LLM安全失效的时间对数可观测性

Junyoung Park, Sunghwan Park, Seongyong Ju, Jaewoo Lee

发表机构 * Chung-Ang University(Chung-Ang 大学)

AI总结 提出时间对数可观测性(TLO)方法,通过解码过程中的合规-拒绝边际将模型-攻击条件映射到校准的二维平面,揭示攻击成功的时间模式,并基于此设计早期停止规则将成功越狱减少一半以上。

详情
AI中文摘要

攻击成功率(ASR)在生成结束时用单个是/否标签评估每次越狱,告诉我们是否发生了失败,但未说明失败如何展开。产生同等有害输出的两次攻击可能遵循完全不同的路径,而ASR无法区分它们。我们仅从对数几率使这些隐藏路径变得可观测。时间对数可观测性(TLO)是一种无需训练的诊断方法,在解码过程中观察合规-拒绝边际,并将每个模型-攻击条件置于校准的二维平面上。通过设计,该平面在ASR信息量最小的情况下最具信息量:即在因真正不同原因而成功的攻击中。在四种对齐的LLM和三种越狱范式下,具有几乎相同ASR的攻击在平面上位于明显不同的点:同一模型可能通过不同的时间模式失败。在大多数条件下,几何形状与来自隐藏状态的拒绝方向探针匹配,但一个模型显示了固定词汇方法的局限性。从TLO导出的简单早期停止规则将成功的越狱减少一半以上,且对普通良性查询无误报。安全评估应报告失败发生的时间和方式,而不仅仅是是否发生。TLO仅从对数几率即可观测前两者。

英文摘要

Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the end of generation, telling us whether a failure happened but not how it unfolded. Two attacks that produce equally harmful outputs may have followed completely different paths, and ASR cannot tell them apart. We make those hidden paths observable from logits alone. Temporal Logit Observability (TLO) is a training-free diagnostic that watches a compliance-refusal margin during decoding and places each model-attack condition on a calibrated 2D plane. By design, this plane is most informative exactly where ASR is least informative: among attacks that succeed for genuinely different reasons. Across four aligned LLMs and three jailbreak paradigms, attacks with nearly identical ASR land at clearly different points on the plane: the same model can fail through different temporal patterns. The geometry matches refusal-direction probes from hidden states on most conditions, with one model showing the limit of our fixed-lexicon approach. A simple early-stop rule derived from TLO cuts successful jailbreaks by more than half, without false alarms on plain benign queries. Safety evaluation should report when and how a failure unfolds, not only whether it occurred. TLO makes the first two observable from logits alone.

2605.29628 2026-05-29 cs.SD cs.AI cs.CL cs.LG eess.AS

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

COMET:音频-文本多模态对比嵌入中模态间隙的概念空间剖析

Yonggang Zhu, Liting Gao, Aidong Men, Wenwu Wang

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey(Surrey 大学视觉、语音和信号处理中心)

AI总结 提出COMET框架,通过PLS-SVD分解揭示CLAP模型中模态间隙主要由少数共享概念轴贡献,并基于谱截断方法无训练地缓解间隙,实现零样本音频字幕接近全监督性能。

详情
AI中文摘要

对比语言-音频预训练(CLAP)模型广泛用于音频理解,并在许多零样本应用中支持模态无关的条件交换。然而,其性能受到音频和文本嵌入之间模态间隙的严重影响。现有解释主要将此间隙归因于锥体效应,将其视为均值嵌入之间的偏移,但仅纠正均值只能带来有限的改进。其他假设,如信息不平衡和维度坍缩,也被提出,但仍未得到充分验证,并且在音频领域尚未被深入研究。同时,一些工作尝试将多模态对比嵌入分解为可解释的概念,但没有任何工作从概念分解的角度显式分析模态间隙。在这项工作中,我们引入了COMET(基于PLS-SVD变换的概念空间组织与模态间隙解释),这是一个新颖的用于CLAP的偏最小二乘奇异值分解(PLS-SVD)框架,揭示了模态间隙的更广泛视角。我们的框架揭示,只有一小部分可解释的轴(捕捉共享概念)对相似度计算有显著贡献,并且均值分量仅部分代表模态间隙。基于这一见解,我们提出了一种简单的谱截断方法,以无训练的方式缓解模态间隙。该方法使得零样本音频字幕通过条件交换接近全监督性能,无需大型辅助记忆库或昂贵计算。同时,它在保持检索和音频字幕任务强性能的同时,实现了显著的嵌入维度缩减。

英文摘要

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.

2605.29626 2026-05-29 cs.CL cs.AI

DLM-SWAI: Steering Diffusion Language Models Before They Unmask

DLM-SWAI: 在扩散语言模型去掩码之前引导它们

Hyeseon An, Yo-Sub Han

发表机构 * Department of Computer Science(计算机科学系) Yonsei University(延世大学)

AI总结 提出一种无需训练的引导方法DLM-SWAI,通过预计算的词级风格分数在去噪步骤中偏置词分布,实现扩散语言模型的可控生成。

Comments preprint

详情
AI中文摘要

将语言模型生成引导至期望的文本属性对于实际部署至关重要,而推理时方法特别有吸引力,因为它们无需重新训练即可实现可控生成。最近的研究也强调了扩散语言模型作为一种新兴的生成范式,具有独特的解码特性。然而,大多数现有的引导方法要么依赖辅助模型,要么专为自回归下一个词解码设计,难以应用于通过部分掩码序列的迭代去噪生成文本的扩散语言模型(DLM)。因此,我们提出DLM-SWAI,一种简单的无需训练的引导方法,通过使用预计算的词级风格分数在每个去噪步骤偏置词分布。在风格和安全控制任务上的实验表明,DLM-SWAI有效引导扩散语言模型,同时保持生成质量并需要最小的计算开销。消融实验进一步揭示了引导强度与流畅性之间的可控权衡,我们的分析将类别可引导性与词级属性线索的强度联系起来。

英文摘要

Steering language model generation toward desired textual properties is essential for practical deployment, and inference-time methods are particularly appealing because they enable controllable generation without retraining. Recent work has also highlighted diffusion language models as an emerging generation paradigm with distinct decoding properties. However, most existing steering approaches either rely on auxiliary models or are designed for autoregressive next-token decoding, making them difficult to apply to diffusion language models DLMs, which generate text through iterative denoising of partially masked sequences. Therefore, we propose DLM-SWAI, a simple training-free steering method that biases the token distribution at each denoising step using pre-computed token-level style scores. Experiments on style and safety control tasks show that DLM-SWAI effectively steers diffusion language models while preserving generation quality and requiring minimal computational overhead. Ablations further reveal a controllable trade-off between steering strength and fluency, and our analysis links class-wise steerability to the strength of token-level attribute cues.

2605.29625 2026-05-29 cs.AI

Improving Collaborative Storytelling with a Multi-Agent Framework Based on Large Language Models

基于大语言模型的多智能体框架改进协作故事讲述

Arturo Valdivia, Paolo Burelli

发表机构 * Data Science Section IT University of Copenhagen(数据科学部门 河南大学) AI Section - brAIn lab IT University of Copenhagen(人工智能部门 - brAIn实验室 河南大学)

AI总结 提出一种基于大语言模型的多智能体框架,通过迭代的写者-编辑者过程,在物理棋盘游戏中与儿童协作生成高质量故事。

详情
AI中文摘要

共同创作(即AI智能体与人类交互生成输出(如艺术))的话题近期获得了显著关注。然而,大多数研究关注数字环境中的成人-人类交互。本文探索了一种新颖的游戏式共同创作场景,涉及儿童和大语言模型(LLMs)通过物理棋盘游戏交互来创作书面故事。我们的目标是开发一个多智能体框架,能够生成适合年轻玩家的高质量叙事。我们方法的核心是一个迭代的写者-编辑者过程,其中一个LLM生成故事,另一个评估故事并提供改进反馈。通过涉及多个LLM的模拟研究,我们表明这种迭代交互在连续循环中持续提高了生成故事的感知质量。结果表明,在交互式故事讲述系统中,少量改进步骤可能足以实现高质量输出。

英文摘要

The topic of Co-creation, i.e., AI agents interacting with humans to generate outputs (e.g., art), has gained significant attention recently. However, most studies focus on adult-human interactions in a digital setting. This paper explores a novel ludic co-creation scenario involving children and Large Language Models (LLMs) interacting through a physical board game to create written stories. Our goal is to develop a multi-agent framework capable of producing high-quality narratives suitable for young players. At the core of our approach is an iterative Writer-Editor process in which one LLM generates stories while another evaluates them and provides feedback for refinement. Through a simulation study involving multiple LLMs, we show that this iterative interaction consistently improves the perceived quality of generated stories across successive loops. The results indicate that a small number of refinement steps may be sufficient to achieve high-quality outputs in interactive storytelling systems.

2605.29622 2026-05-29 cs.LG physics.chem-ph

MōLe-Λ: Learning the Coupled-Cluster Response State for Energies, Gradients, and Properties

MōLe-Λ: 学习耦合簇响应态以获取能量、梯度和性质

Andreas Burger, Luca Thiede, Abdulrahman Aldossary, Jorge A. Campos-Gonzalez-Angulo, Alex Zook, Jérôme Florian Gonthier, Alán Aspuru-Guzik

发表机构 * University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(人工智能向量研究所) NVIDIA(英伟达) Canadian Institute for Advanced Research (CIFAR)(加拿大高级研究研究院)

AI总结 提出MōLe-Λ模型,通过联合学习左右手振幅预测耦合簇响应态,高效计算能量、梯度及多类分子性质。

Comments ICML 2026 AI4Physics

详情
AI中文摘要

耦合簇理论常被视为量子化学的金标准,但其高计算成本限制了准确能量、力和响应性质的常规获取。虽然右手$T$-振幅决定了相关波函数,但许多实际重要的可观测量还需要左手$Λ$-振幅。我们引入MōLe-$Λ$,它是分子轨道学习(MōLe)的扩展,通过从局域化的Hartree-Fock分子轨道联合学习右手振幅$(T_1,T_2)$和左手振幅$(Λ_1,Λ_2)$,预测完整的基态耦合簇单双激发(CCSD)响应态。在架构上,MōLe-$Λ$扩展了MōLe,增加了$Λ_1$和$Λ_2$读出模块,这些模块镜像了$T_1$和$T_2$头的对称性约束,同时保留了原始的等变轨道编码器、奇符号等变解码、局域性和大小广延性。所得模型能够提供准确的CC级能量和力,同时恢复偶极矩、四极矩、极化率、电子密度以及双电子可观测量如对密度。我们表明,MōLe-$Λ$进一步扩展了MōLe相对于完整CCSD的速度优势,同时大幅扩展了可访问的性质,为相关量子化学的波函数级替代模型提供了途径。

英文摘要

Coupled-cluster (CC) theory is often considered the gold standard of quantum chemistry, but its high computational cost limits routine access to accurate energies, forces and response properties. While the right-hand $T$-amplitudes determine the correlated wavefunction, many practically important observables additionally require the left-hand $Λ$-amplitudes. We introduce MōLe-$Λ$, an extension of Molecular Orbital Learning (MōLe) that predicts the full ground-state coupled-cluster singles and doubles (CCSD) response state by jointly learning right-hand amplitudes $(T_1,T_2)$ and left-hand amplitudes $(Λ_1,Λ_2)$ from localized Hartree--Fock molecular orbitals. Architecturally, MōLe-$Λ$ extends MōLe with $Λ_1$ and $Λ_2$ readouts that mirror the symmetry constraints of the $T_1$ and $T_2$ heads, while preserving the original equivariant orbital encoder, odd sign-equivariant decoding, locality and size-extensivity. The resulting model yields accurate CC-quality energies and forces, while simultaneously recovering dipoles, quadrupoles, polarizabilities, the electron density, and 2-electron observables such as the pair density. We show that MōLe-$Λ$ further extends the speed advantage of MōLe over full CCSD while substantially expanding the accessible properties, providing a route to wavefunction-level surrogate models for correlated quantum chemistry.

2605.29615 2026-05-29 cs.CV cs.CL

DiffSpot: Can VLMs Spot Fine-Grained Visual Differences in Web Interfaces?

DiffSpot:VLM能发现网页界面中的细微视觉差异吗?

Linhao Zhang, Aiwei Liu, Yuan Liu, Xiao Zhou

发表机构 * WeChat AI, Tencent Inc(腾讯公司)

AI总结 提出DiffSpot基准,通过CSS属性突变生成可控图像对,评估视觉语言模型在网页界面中检测细微视觉差异的能力,发现最佳模型仅识别40.7%的真实变化。

详情
AI中文摘要

视觉语言模型(VLM)在高层次图像-文本对齐方面取得了显著进展,但其感知细微视觉差异的能力仍然有限。我们在渲染的网页界面中研究这一问题,其中局部视觉变化既是对细粒度感知的诊断测试,也是GUI代理和设计工具的实际需求。我们引入了 extbf{DiffSpot},一个用于网页界面开放式找不同的代码驱动基准。DiffSpot通过突变自包含HTML中目标元素的单个CSS属性,重新渲染页面,并记录变化的属性、元素和突变幅度,从而构建受控图像对。一个接地门控仅保留渲染像素差异局限于目标元素的图像对。该基准包含4,400对图像,包括3,900对有差异对(平衡分布在13个CSS属性操作符和三个难度级别上)以及500对无差异对用于幻觉控制。对13个前沿VLM进行零样本评估,我们发现即使最佳模型也只能识别$40.7\%$的真实变化,所有模型在困难级别的召回率低于$23\%$。DiffSpot进一步表明,难度强烈依赖于属性:在CSS操作符中,像素幅度和CLIP距离都不能可靠预测召回率。

英文摘要

Vision-language models (VLMs) have made strong progress on high-level image-text alignment, yet their ability to perceive subtle visual differences remains limited. We study this problem in rendered web interfaces, where localized visual changes are both a diagnostic test of fine-grained perception and a practical requirement for GUI agents and design tools. We introduce \textbf{DiffSpot}, a code-driven benchmark for open-ended spot-the-difference on web interfaces. DiffSpot constructs controlled image pairs by mutating a single CSS property of a target element in self-contained HTML, re-rendering the page, and recording the changed property, element, and mutation magnitude. A grounding gate retains only pairs whose rendered pixel difference is confined to the target element. The benchmark contains 4{,}400 pairs, including 3{,}900 has-diff pairs balanced across 13 CSS-property operators and three difficulty tiers, plus 500 no-diff pairs for hallucination control. Evaluating 13 frontier VLMs zero-shot, we find that even the best model identifies only $40.7\%$ of true changes, with Hard-tier Recall below $23\%$ for every model. DiffSpot further shows that difficulty is strongly property-dependent: across CSS operators, neither pixel magnitude nor CLIP distance reliably predicts Recall.