arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.28812 2026-05-28 cs.RO cs.AI cs.LG 版本更新

Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation

超越二元：基于物理接触表示的仿真到现实灵巧操作

Jiahe Pan, Stelian Coros, Jitendra Malik, Toru Lin

发表机构 * ETH Zürich（苏黎世联邦理工学院）； UC Berkeley（伯克利加州大学）

AI总结提出基于物理原理的中心压力（CoP）触觉表示，结合可微动力学传感器标定，实现多指手的零样本仿真到现实迁移，在插销入孔和球平衡任务中优于二元接触和原始触觉基线。

Comments Project site: https://mpan31415.github.io/tactile_rep/

详情

AI中文摘要

接触丰富操作的主要瓶颈是收集真实世界数据的困难。仿真到现实强化学习提供了一种可扩展的替代方案，但仿真-现实差距阻碍了像触觉这样信息密集的模式被有效使用。现有的仿真到现实方法通常通过将触觉数据简化为粗略的低维特征来缩小这一差距——牺牲了复杂操作所需的丰富性。在这项工作中，我们引入了中心压力（CoP），一种基于物理原理的有效触觉表示，它保留了密集的接触信息，同时保持了仿真到现实迁移的鲁棒性。为了支持这种表示，我们提出了一种基于可微动力学的传感器标定方案，使得能够在不需真实力测量的情况下估计触觉单元的朝向。我们在两个盲态、具有挑战性的接触丰富操作任务上评估了CoP：插销入孔和球平衡。在这两个任务中，基于CoP的策略在多指手上实现了零样本仿真到现实迁移，并且优于粗略的二元接触和原始触觉基线。对学习策略状态的分析进一步表明，基于CoP的策略编码了任务相关的物理属性，如物体质量，作为控制的涌现副产品。

英文摘要

A primary bottleneck in contact-rich manipulation is the difficulty of collecting real-world data. Sim-to-real reinforcement learning offers a scalable alternative, but the simulation-reality gap prevents information-dense modalities like touch from being effectively used. Existing sim-to-real methods often mitigate this gap by simplifying tactile data into coarse low-dimensional features -- sacrificing the richness required for complex manipulation. In this work, we introduce Center-of-Pressure (CoP), an effective tactile representation grounded in physical principles that preserves dense contact information while maintaining robustness for sim-to-real transfer. To support this representation, we propose a sensor calibration scheme based on differentiable dynamics, enabling the estimation of taxel orientations without requiring ground-truth force measurements. We evaluate CoP on two blind, challenging contact-rich manipulation tasks: peg-in-hole insertion and ball balancing. Across both tasks, policies conditioned on CoP achieve zero-shot sim-to-real transfer on a multi-fingered hand, and outperform both coarse binary-contact and raw-taxel baselines. Analysis of learned policy states further suggests that CoP-conditioned policies encode task-relevant physical properties, such as object mass, as an emergent byproduct of control.

URL PDF HTML ☆

赞 0 踩 0

2605.28807 2026-05-28 cs.AI 版本更新

技能条件门控自蒸馏用于大语言模型推理

Jiazhen Huang, Xiao Chen, Xiao Luo, Yong Dai, Senkang Hu, Yuzhi Zhao

发表机构 * Tsinghua University（清华大学）； Fudan University（复旦大学）； City University of Hong Kong（香港城市大学）； Huazhong University of Science and Technology（华中科技大学）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结提出技能条件门控自蒸馏（SGSD），通过从经验技能库中检索技能-错误对构建多教师池，并利用验证器验证教师极性，以鲁棒门控目标蒸馏信息性师生差异，在弱先验信息假设下提升数学推理性能。

详情

AI中文摘要

在线自蒸馏（SD）通过使用教师端特权信息（PI）将稀疏的验证器结果转化为密集的令牌级监督，从而改善大语言模型推理。现有方法通常假设可信的PI，例如参考答案或成功轨迹。我们提出PI是否可以来自经验驱动的技能库，其中检索到的技能紧凑且可重用，但也可能不相关或具有误导性。我们提出技能条件门控自蒸馏（SGSD），将基于技能的SD表述为教师假设验证而非无条件模仿。SGSD检索技能-错误对，构建多教师池，并让所有技能条件教师对相同的普通提示学生输出进行评分。验证器验证每个教师的极性：支持成功或抑制失败提供正向监督，而相反立场则被反转。然后，一个鲁棒的门控目标蒸馏信息性的师生差异，同时抑制不确定或极端信号。在多个数学推理基准上的实验表明，SGSD在弱PI假设下持续优于GRPO，并与答案条件OPSD保持竞争力。例如，在Qwen3-1.7B上，SGSD在AIME24、AIME25和HMMT25上平均比GRPO高出6.2%，比OPSD高出1.7%。我们的代码可在https://github.com/walawalagoose/SGSD获取。

英文摘要

On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes into dense token-level supervision. Existing methods usually assume trusted PI, such as reference answers or successful traces. We ask whether PI can instead come from an experience-derived skill bank, where retrieved skills are compact and reusable but may also be irrelevant or misleading. We propose Skill-Conditioned Gated Self-Distillation (SGSD), which formulates skill-based SD as teacher hypothesis validation rather than unconditional imitation. SGSD retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets all skill-conditioned teachers score the same plain-prompt student rollout. The verifier validates each teacher's polarity: supporting a success or suppressing a failure gives positive supervision, while the opposite stance is reversed. A robust gated objective then distills informative teacher-student disagreements while suppressing uncertain or extreme signals. Experiments on multiple mathematical reasoning benchmarks show that SGSD consistently improves over GRPO and remains competitive with answer-conditioned OPSD under a weaker PI assumption. For example, on Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and OPSD by 1.7% on average on AIME24, AIME25, and HMMT25. Our code is available at https://github.com/walawalagoose/SGSD.

URL PDF HTML ☆

赞 0 踩 0

2605.28787 2026-05-28 cs.IR cs.AI 版本更新

Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

智能体需要语义元数据吗？智能体数据检索的比较研究

Shiyu Chen, Tarfah Alrashed, Alon Halevy, Natasha Noy

发表机构 * Google（谷歌）

AI总结通过对比基线智能体（搜索开放网络）与语义智能体（利用schema.org元数据）在数据检索中的表现，发现语义元数据在检索可操作数据时精度更高（整体精度高65.7%），而基线智能体覆盖更广但存在“最后一英里效用”失败。

详情

AI中文摘要

在自主智能体时代，机器可操作数据对于数据驱动的工作流至关重要。十多年来，像schema.org这样的语义元数据支撑了机器可操作数据的FAIR原则（可发现、可访问、可互操作、可重用），并支持了Google Dataset Search等发现工具。然而，能够导航非结构化网络的大型语言模型（LLM）的兴起提出了一个基本问题：语义元数据对于智能体数据发现是否仍然必要，或者智能体能否直接从网络可靠地检索可操作数据？我们提出了两种不同环境下的智能体数据检索比较分析：一个基线智能体搜索数十亿开放网络文档，以及一个语义智能体利用使用schema.org的9000万数据集语料库。我们部署了一个“LLM作为裁判”的评估流程，直接映射到FAIR原则，以评估检索数据的语义相关性、数据可访问性和计算实用性。我们的结果揭示了明显的差异。语义智能体在检索可操作数据方面表现出色，对于元数据丰富的注册表，其返回结果中的精度高出44.9%，对于具有机器可读下载的页面，精度高出46.6%。相反，基线智能体经常遭受“最后一英里效用”失败，检索到的是散文密集的页面（占结果的20.1%）和门户登录页面（占8.5%），而不是实际的数据页面。虽然基线智能体通过回答多40%的问题实现了更高的覆盖率，但语义智能体在检索符合FAIR原则的数据集方面实现了更高的准确性，整体精度高出65.7%。我们得出结论，虽然非结构化检索支持广泛的探索性任务，但结构化生态系统仍然是可靠、面向执行的自主工作流不可或缺的基础。

英文摘要

In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata like schema.org has anchored the FAIR principles (Findable, Accessible, Interoperable, and Reusable) for machine-actionable data and enabled discovery tools like Google Dataset Search. However, the rise of Large Language Models (LLMs) capable of navigating the unstructured web raises a fundamental question: Is semantic metadata still necessary for agentic data discovery, or can agents reliably retrieve actionable data directly from the web? We present a comparative analysis of agentic data retrieval across two distinct environments: a Baseline Agent searching billions of open-web documents, and a Semantic Agent leveraging a corpus of 90 million datasets using schema.org. We deploy an "LLM-as-a-judge" evaluation pipeline, mapped directly to the FAIR principles, to assess the semantic relevance, data accessibility, and computational utility of the retrieved data. Our results reveal a clear divergence. The Semantic Agent excels at retrieving actionable data, achieving a 44.9% higher precision for metadata-rich registries and a 46.6% higher precision for pages with machine-readable downloads among its returned results. Conversely, the Baseline Agent frequently suffers "Last-Mile Utility" failures, retrieving prose-heavy pages (20.1% of results) and portal landing pages (8.5%) rather than actual data pages. While the Baseline Agent achieves higher coverage by answering 40% more questions, the Semantic Agent delivers greater accuracy, achieving 65.7% higher overall precision in retrieving FAIR-compliant datasets. We conclude that while unstructured retrieval supports broad exploratory tasks, structured ecosystems remain the indispensable foundation for reliable, execution-oriented autonomous workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.28775 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

从弱点中学习：小型计算机使用代理的自动化领域专业化

Suji Kim, Kangsan Kim, Sung Ju Hwang

发表机构 * KAIST（韩国科学技术院）； Samsung Electronics（三星电子）

AI总结提出LearnWeak框架，通过更强的参考代理识别学生代理在目标领域的弱点，自动合成针对性任务和监督信号，并引入误差感知专业化目标，显著提升小型计算机使用代理在多个领域的性能。

详情

AI中文摘要

计算机使用代理（CUA）最近取得了实质性进展，但为每个软件领域部署单独的大型专家仍然昂贵。小型开源计算机使用代理是更实用的专业化目标，但它们仍然明显较弱，并表现出不均匀的领域特定失败。一个直接的补救措施是为目标领域合成大规模训练数据，但我们发现这种简单方法仅带来边际改进。基于这一观察，我们引入了LearnWeak，一个针对小型计算机使用代理的无注释专业化框架，它使用更强的参考代理来识别学生在目标领域的弱点，合成有针对性的任务，并自动构建监督。LearnWeak进一步引入了一个误差感知的专业化目标，将规划和执行误差分离，从而实现比广泛统一监督更行为精确的更新。在OSWorld上，LearnWeak在八个领域上分别比EvoCUA-8B和OpenCUA-7B平均提高了11.6和11.1个百分点。我们还验证了我们的学生感知数据集生成和训练方法优于现有的自主轨迹生成和训练基线。我们的工作强调了学生意识在数据合成和代理训练中的重要性，为在多样化领域专业化小型计算机使用代理指明了更原则和高效的路径。

英文摘要

Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.

URL PDF HTML ☆

赞 0 踩 0

2605.28773 2026-05-28 cs.CL cs.AI cs.LG cs.MA cs.MM 版本更新

Rethinking Memory as Continuously Evolving Connectivity

重新思考记忆作为持续演化的连接性

Jizhan Fang, Buqiang Xu, Zhixian Wang, Haoliang Cao, Xinle Deng, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu, Ying Wei, Guozhou Zheng, Feiyu Xiong, Haofen Wang, Huajun Chen, Ningyu Zhang

发表机构 * Zhejiang University（浙江大学）； Alibaba Group（阿里巴巴集团）； MemTensor ； Tongji University（同济大学）

AI总结提出 FluxMem 框架，将记忆建模为异构图并通过三个阶段（初始连接形成、反馈驱动优化、长期巩固）动态演化拓扑结构，以解决现有记忆增强型 LLM 代理在动态环境中的脆弱性问题。

Comments Ongoing work

详情

AI中文摘要

现有的记忆增强型 LLM 代理通常将记忆视为具有预定义表示和固定检索管道的静态存储库，这在动态代理环境中是脆弱的，因为反馈、任务变化和异构信号不断重塑应该记住的内容以及如何连接它们。为了解决这个问题，我们提出了 FluxMem，一种连接性演化的记忆框架，它将记忆建模为异构图，并通过三个阶段逐步优化其拓扑结构：初始连接形成、反馈驱动优化和长期巩固。在执行过程中，FluxMem 修复缺失的链接、修剪干扰、对齐抽象粒度，并将重复的成功轨迹提炼为可重用的程序化电路，由记忆泛化性和演化成熟度的一个度量指导。在三个根本不同的基准测试（包括 LoCoMo、Mind2Web 和 GAIA）上，FluxMem 实现了持续的最先进性能，展示了在复杂代理环境中的强大适应性和泛化能力。代码将在 https://github.com/zjunlp/LightMem 开源。

英文摘要

Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelines, which is brittle in dynamic agentic environments where feedback, task variation, and heterogeneous signals continuously reshape what should be remembered and how it should be connected. To address this, we propose FluxMem, a connectivity-evolving memory framework that models memory as a heterogeneous graph and progressively refines its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. During execution, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills recurrent successful trajectories into reusable procedural circuits, guided by one metric for memory generalizability and evolutionary maturity. Across three fundamentally distinct benchmarks including LoCoMo, Mind2Web, and GAIA, FluxMem achieves consistent state-of-the-art performance, demonstrating strong adaptation and generalization in complex agentic environments. The code will be open-sourced in https://github.com/zjunlp/LightMem.

URL PDF HTML ☆

赞 0 踩 0

2605.28764 2026-05-28 cs.AI cs.DC cs.MA 版本更新

反向探测：临床文本中大语言模型的监督式词级不确定性量化

Bushi Xiao, Sarvesh Soni, Daisy Zhe Wang

发表机构 * University of Florida（佛罗里达大学）； U.S. National Library of Medicine（美国国家医学图书馆）

AI总结提出反向探测框架，利用预标注摘要从模型内部激活中提取词级不确定性信号，在临床文本中实现高效、可解释的不确定性量化。

详情

AI中文摘要

随着大语言模型越来越多地应用于临床文本，确保它们能够可靠地表明自身的不确定性变得至关重要。大多数现有的不确定性量化（UQ）方法是为开放域生成设计的，无法在长临床文本中定位到词或跨度级别的不确定性。我们提出了反向探测，这是首个专门针对临床摘要的UQ框架，它直接从预标注的摘要中估计词级不确定性。与采样新输出不同，反向探测将文本视为探测模型内部状态的探针，从四类内部激活中提取不确定性信号。我们在两个专家标注的临床数据集上进行了评估，在所有指标上优于八个适配基线，AUPRC最高提升4倍，同时减少了推理时间和计算成本。特征分析表明，delta能量和邻域上下文是所有模型中最一致的预测因子。本研究提供了关于模型内部如何响应无支持的临床内容的可解释性见解。

英文摘要

As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes critical. Most existing uncertainty quantification (UQ) methods are designed for open-domain generation and cannot localize uncertainty at the token or span level in long clinical text. We propose Reverse Probing, the first UQ framework specialized for clinical summarization, which estimates token-level uncertainty directly from pre-existing labeled summaries. Rather than sampling new outputs, Reverse Probing treats the text as a probe into the model's internal state, extracting uncertainty signals from four categories of internal activations. We evaluate on two expert-annotated clinical datasets and outperform eight adapted baselines on all metrics, achieving up to 4 times higher AUPRC while reducing inference time and computational costs. Feature analysis reveals that delta energy and neighborhood context are the most consistent predictors across all models. This study offers interpretable insights into how models internally respond to unsupported clinical content.

URL PDF HTML ☆

赞 0 踩 0

2605.28739 2026-05-28 cs.LG cs.AI cs.NE q-bio.QM 版本更新

BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks

BIRDNet: 挖掘和编码布尔蕴含知识图作为可解释深度神经网络

Tirtharaj Dash

发表机构 * BITS Pilani, K K Birla Goa Campus（BITS 印度 Goa 分校）

AI总结提出BIRDNet，通过挖掘特征间的布尔蕴含关系并编码为稀疏可解释神经网络，在保持高精度的同时大幅减少参数，并在转录组和蛋白质组数据中恢复已知生物学特征。

Comments 5 pages; 1 figure, 4 tables

详情

AI中文摘要

知识丰富领域中的表格数据通常携带特征对之间的布尔蕴含关系（BIR）形式的潜在先验。我们使用稀疏异常二项检验挖掘此类关系。挖掘出的蕴含构成一个带类型的定向图，等价于一个由2-文字子句组成的命题规则库。我们将该图编码为分层神经网络的连接性，称为BIRDNet，其中每个隐藏单元对应一条挖掘出的规则，并仅绑定到其两个特征。我们展示了这种设计的两个结果：首先，该架构在构造上是稀疏的：每个BIR层中最多有$2/d$的权重是活跃的，其中$d$是输入维度。其次，模型是可解释的：每个训练后的单元保持稳定的符号身份，因此无需代理模型即可从网络中读取规则。与大多数神经符号模型不同，BIRDNet不消耗外部规则库；其结构先验是从数据中挖掘的。我们在六个转录组和蛋白质组基准上评估BIRDNet。我们的结果表明，BIRDNet在AUROC上与最强的密集基线相差0.02以内，精度损失很小，同时使用的活跃参数比架构匹配的密集MLP少高达96倍。第一层规则恢复了多种癌症亚型和组织类型中的已知生物学特征，包括典型扩增子、谱系定义共表达模块和免疫浸润标记。数据和代码可在 https://github.com/MAHI-Group/BIRDNet 获取。

英文摘要

Tabular data in knowledge-rich domains often carries a latent prior in the form of Boolean implication relationships (BIRs) between pairs of features. We mine such relationships with a sparse-exception binomial test. The mined implications form a typed directed graph, equivalent to a propositional rule base of 2-literal clauses. We encode this graph as the connectivity of a layered neural network, called BIRDNet, in which each hidden unit corresponds to one mined rule and binds only to its two features. We show two consequences of this design: First, the architecture is sparse by construction: at most $2/d$ of the weights in each BIR layer are active, where $d$ is the input dimension. Second, the model is interpretable: every trained unit keeps a stable symbolic identity, so rules can be read off the network without surrogate models. Unlike most neurosymbolic models, BIRDNet does not consume an external rule base; its structural prior is mined from the data. We evaluate BIRDNet on six transcriptomic and proteomic benchmarks. Our results show that BIRDNet stays within 0.02 AUROC of the strongest dense baseline, at a small accuracy cost, while using up to $96\times$ fewer active parameters than an architecture-matched dense MLP. First-layer rules recover known biological signatures across multiple cancer subtypes and tissue types, including canonical amplicons, lineage-defining co-expression modules, and immune-infiltration markers. Data and code are available at: https://github.com/MAHI-Group/BIRDNet.

URL PDF HTML ☆

赞 0 踩 0

2605.28733 2026-05-28 cs.AI 版本更新

Utility-Aware Multimodal Contrastive Learning for Product Image Generation

效用感知的多模态对比学习用于产品图像生成

Xiaohang Feng, Yiling Xie

发表机构 * City University of Hong Kong（香港城市大学）

AI总结提出一种效用感知的多模态对比学习框架，通过引入效用感知InfoNCE损失优化产品图像生成，使图像在语义对齐的同时提升市场需求。

详情

AI中文摘要

产品图像强烈影响在线市场中消费者的决策。借助多模态对比学习，生成式AI可以输出与文本提示紧密对齐的图像。然而，现有的生成式AI模型并未直接优化市场表现。这是一个关键差距，因为仅凭语义对齐并不能保证图像能够促进销售。为了解决这一局限性，我们提出了一个 extit{效用感知的多模态对比学习}框架，将消费者需求纳入新颖的效用感知InfoNCE损失中。优化这一效用感知目标引导生成过程朝向既语义连贯又增强需求的图像。这一效果直接源于学习到的图像-文本表示空间向需求驱动的视觉线索的转变，我们也通过所提目标的理论界限验证了这一点。在Amazon和Airbnb的下游应用中，我们的方法生成和编辑的产品图像在增加需求和保持保真度方面优于最先进的模型，同时保持了文本-图像一致性。值得注意的是，我们的效用感知框架保留了美学和独特性等属性的倒U型需求模式，在保持保真度和语义一致性的同时提升了基于需求的性能。人类受试者实验进一步验证了其商业有效性。随着生成式AI技术的不断发展，我们的效用感知组件可以灵活地嵌入新兴的生成模型中，以改善直接商业用途。

英文摘要

Product images strongly influence consumer decision-making in online marketplaces. Empowered by multimodal contrastive learning, generative AI can output images that closely align with text prompts. Yet existing generative AI models do not directly optimize marketplace performance. This is a critical gap, since semantic alignment alone does not guarantee that an image will sell. To address this limitation, we propose a \textit{utility-aware multimodal contrastive learning} framework that incorporates consumer demand into a novel Utility-Aware InfoNCE loss. Optimizing this utility-aware objective guides generation toward images that are both semantically coherent and demand-enhancing. This effect arises directly from a shift in the learned image-text representation space toward demand-driven visual cues, which we also validate through the theoretical bound of the proposed objective. In downstream applications on Amazon and Airbnb, product images generated and edited by our method outperform state-of-the-art models in increasing demand and preserving fidelity, while maintaining text-image consistency. Notably, our utility-aware framework preserves inverse U-shaped demand patterns for attributes such as aesthetics and uniqueness, improving demand-based performance while preserving fidelity and semantic consistency. Human-subject experiments further validate its commercial effectiveness. As generative AI technology continues to evolve, our utility-aware component can be flexibly embedded into emerging generative models to improve direct commercial use.

URL PDF HTML ☆

赞 0 踩 0

2605.28732 2026-05-28 cs.CL cs.AI cs.LG 版本更新

MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems

MemTrace：大型语言模型记忆系统中的错误追踪与归因

Xinle Deng, Ruobin Zhong, Hujin Peng, Xiaoben Lu, Yanzhe Wu, Guang Li, Buqiang Xu, Yunzhi Yao, Jizhan Fang, Haoliang Cao, Junjie Guo, Yuan Yuan, Ziqing Ma, Yuanqiang Yu, Rui Hu, Baohua Dong, Hangcheng Zhu, Ningyu Zhang

发表机构 * Zhejiang University（浙江大学）； Alibaba Group（阿里巴巴集团）

AI总结提出MemTrace框架，通过构建可执行的记忆演化图实现细粒度错误追踪，并利用自动归因方法定位根因，进而优化提示词提升下游任务性能。

Comments Ongoing work

详情

AI中文摘要

记忆对于使大型语言模型支持长程推理至关重要，但现有的记忆系统仍然不可靠且难以调试。追踪记忆的动态演化对于理解信息如何随时间合成、传播或损坏至关重要。在这项工作中，我们研究了LLM记忆系统中错误追踪与归因的新问题。我们提出了一种新颖的框架，将记忆流水线转换为可执行的记忆演化图，从而实现对操作信息流的细粒度追踪。然后，我们构建了MemTraceBench，一个从代表性记忆系统（如Long-Context、RAG、Mem0和EverMemOS）收集的基准，以系统地研究记忆故障模式。我们进一步引入了一种自动归因方法，该方法迭代地追踪操作子图以定位任何失败案例的根本原因。我们的分析表明，记忆故障是系统性的，源于操作层面的问题，如信息丢失和检索错位。关键的是，我们利用这些细粒度的归因信号来指导下游提示优化，建立了一个自动纠正故障并提升最终任务性能高达7.62%的闭环系统。代码将在https://github.com/zjunlp/MemTrace发布。

英文摘要

Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long-Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment. Crucially, we leverage these fine-grained attribution signals to guide downstream prompt optimization, establishing a closed-loop system that automatically corrects faults and boosts end-task performance by up to 7.62%. Code will be released at https://github.com/zjunlp/MemTrace.

URL PDF HTML ☆

赞 0 踩 0

2605.28730 2026-05-28 cs.AI 版本更新

AlphaTransit: Learning to Design City-scale Transit Routes

AlphaTransit: 学习设计城市级公交线路

Bibek Poudel, Sai Swaminathan, Weizi Li

发表机构 * Department of EECS, University of Tennessee, Knoxville, TN, USA（田纳西大学电子工程与计算机科学系）； Department of CSE, University of California, Riverside, CA, USA（加州大学河滨分校计算机科学与工程系）

AI总结针对公交线路设计中的延迟反馈问题，提出AlphaTransit框架，将蒙特卡洛树搜索与神经策略-价值网络结合，在布卢明顿基准上实现最高服务率。

详情

AI中文摘要

设计公交网络需要许多顺序的线路扩展决策，但其质量通常只有在完整网络组装后才能显现。这种延迟反馈挑战是公交线路网络设计问题（TRNDP）的核心，其中线路交互可能具有欺骗性：一个看似有用的局部扩展可能会造成换乘瓶颈、产生冗余重叠或降低整体吞吐量。为了在延迟模拟器反馈下指导线路构建，我们引入了AlphaTransit，一个用于城市级公交网络设计的基于搜索的规划框架。AlphaTransit将蒙特卡洛树搜索（MCTS）与神经策略-价值网络相结合：策略提出线路扩展，价值估计下游设计质量，搜索利用这些预测来优化每个决策。这提供了在路线构建过程中的决策时间前瞻，而无需在搜索树内运行模拟器展开。我们在一个新的布卢明顿TRNDP基准上评估AlphaTransit，该基准具有现实的道路拓扑和基于人口普查的需求，在混合和全公交需求设置下。在布卢明顿网络中，AlphaTransit在两种需求设置下均达到了最高服务率，分别为54.6%和82.1%。相对于无搜索的强化学习，这对应9.9%和11.4%的服务率提升；相对于无学习指导的MCTS，这对应2.5%和11.2%的提升。这些结果表明，将学习指导与MCTS结合比单独使用任何一种方法对公交网络设计更有效。我们的代码和数据公开在https://github.com/poudel-bibek/AlphaTransit。

英文摘要

Designing a transit network requires many sequential route extension decisions, but their quality is often visible only after the full network is assembled. This delayed-feedback challenge lies at the heart of the Transit Route Network Design Problem (TRNDP), where route interactions can be deceptive: an extension that appears useful locally can create transfer bottlenecks, produce redundant overlap, or reduce overall throughput. To guide route construction under delayed simulator feedback, we introduce AlphaTransit, a search-based planning framework for cityscale bus network design. AlphaTransit couples Monte Carlo Tree Search (MCTS) with a neural policy-value network: the policy proposes route extensions, the value estimates downstream design quality, and search uses these predictions to refine each decision. This provides decision-time lookahead during route construction without running simulator rollouts inside the search tree. We evaluate AlphaTransit on a new Bloomington TRNDP benchmark with realistic road topology and censusderived demand, under mixed and full transit demand settings. In the Bloomington network, AlphaTransit attains the highest service rate in both demand settings, reaching 54.6% and 82.1%, respectively. Relative to reinforcement learning without search, these correspond to 9.9% and 11.4% service rate gains; relative to MCTS without learned guidance, they correspond to 2.5% and 11.2% gains. These results suggest that coupling learned guidance with MCTS is more effective than using either approach alone for transit network design. Our code and data are publicly available in https://github.com/poudel-bibek/AlphaTransit.

URL PDF HTML ☆

赞 0 踩 0

2605.28722 2026-05-28 cs.AI 版本更新

Multi-Adapter Representation Interventions via Energy Calibration

通过能量校准的多适配器表示干预

Manjiang Yu, Hongji Li, Junwei Chen, Xue Li, Priyanka Singh, Yang Cao, Lijie Hu

发表机构 * The University of Queensland, Brisbane, Australia（昆士兰大学）； Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates（马尔代夫 bin Zayed 人工智能大学）； Institute of Science Tokyo, Tokyo, Japan（东京科学研究所）

AI总结提出MARI方法，通过竞争性多适配器机制和基于能量的门控模块，自适应地确定干预方向和强度，在保持通用能力的同时提升对齐性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

表示干预已成为一种有前景的范式，可以在不修改模型权重的情况下将大型语言模型对齐到期望的行为。现有方法通常对所有输入统一应用固定的干预。然而，我们发现适当的干预方向和强度在不同样本间差异很大，这种无差别的干预会导致良性输入上通用能力的下降。为了解决这些挑战，我们提出了通过能量校准的多适配器表示干预（MARI）。具体来说，我们引入了一种竞争性多适配器机制，其中专门的专家捕获非线性校正模式，并自适应地确定不同样本的适当干预方向和强度。此外，我们设计了一个基于能量的门控模块，利用内部传播动力学来区分适合干预的输入。跨不同模型系列和参数规模的广泛实验表明，MARI实现了最先进的对齐性能。我们的方法在TruthfulQA、BBQ和安全基准测试上显著提高了性能，同时在MMLU和ARC等任务上保持甚至提高了通用能力。我们的代码可在https://github.com/V1centNevwake/MARI获取。

英文摘要

Representation intervention has emerged as a promising paradigm for aligning large language models toward desired behaviors without modifying model weights. Existing methods typically apply a fixed intervention uniformly across all inputs. However, we find that the appropriate intervention direction and strength vary substantially across samples, and such indiscriminate intervention leads to degradation of general capabilities on benign inputs. To address these challenges, we propose Multi-Adapter Representation Interventions via Energy Calibration (MARI). Specifically, we introduce a competitive multi-adapter mechanism in which specialized experts capture non-linear correction patterns and adaptively determine the appropriate intervention direction and strength for different samples. Furthermore, we design an energy-based gating module that leverages internal propagation dynamics to distinguish inputs that are applicable for intervention. Extensive experiments across diverse model families and parameter scales demonstrate that MARI achieves state-of-the-art alignment performance. Our method significantly improves performance on TruthfulQA, BBQ, and safety benchmarks, while maintaining and even improving general capabilities on tasks such as MMLU and ARC. Our code is available at https://github.com/V1centNevwake/MARI.

URL PDF HTML ☆

赞 0 踩 0

2605.28721 2026-05-28 cs.AI 版本更新

LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

LiveBrowseComp: 搜索智能体是在搜索，还是仅仅在验证它们已知的信息？

HuiMing Fan, Xiao Wang, Zheng Chu, Qianyu Wang, Zhuoyao Wang, Ming Liu, Bing Qin, XingYu

发表机构 * Harbin Institute of Technology（哈尔滨理工大学）； Xiaohongshu（小红书）

AI总结本文通过诊断方法发现基于LLM的搜索智能体存在内在知识依赖（IKD），即依赖模型内部知识而非外部证据，并引入LiveBrowseComp基准来评估超越内在知识覆盖的深度搜索能力。

详情

AI中文摘要

基于LLM的搜索智能体是真的在搜索，还是仅仅利用网络验证它们已知的信息？我们在BrowseComp上通过三个诊断研究这个问题。我们的分析揭示了内在知识依赖（IKD）：即使有工具访问权限，智能体也常常依赖内在知识——检索前模型已编码的信息——而非外部证据。智能体在没有工具的情况下回答了高达44.5%的BrowseComp问题，超过一半的搜索查询来自内部生成的假设而非检索到的线索，并且当答案支持证据被移除时，其表现比闭卷基线更差。这些结果表明，静态搜索基准可能奖励基于记忆的验证而非基于证据的发现，混淆了智能体已知的信息与它们能找到的信息。然后我们引入了LiveBrowseComp，一个深度搜索基准，旨在评估超越内在知识覆盖的智能体。它包含335个人工编写的问题，其答案依赖于基准构建前90天内发布的事实，来自六个更新的来源，并过滤掉全球显著事件。在LiveBrowseComp上，所有评估的智能体闭卷准确率低于2%，搜索增强的分数相对于BrowseComp下降了25-40个百分点，且先前的模型排名不再可靠地预测性能。LiveBrowseComp可在https://huggingface.co/datasets/Forival/LiveBrowseComp获取。

英文摘要

Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at https://huggingface.co/datasets/Forival/LiveBrowseComp.

URL PDF HTML ☆

赞 0 踩 0

2605.28717 2026-05-28 cs.AI cs.AR cs.NI 版本更新

迈向可靠的多语言LLM作为评判者：一项实证研究

Irune Zubiaga, Aitor Soroa, Rodrigo Agerri

发表机构 * HiTZ Center - Ixa, University of the Basque Country EHU（希茨中心 - Ixa，巴斯克国家大学EHU）

AI总结本研究通过分析指令翻译、单语与多语言监督及模型规模等策略，探讨了在有无领域内数据情况下开发多语言LLM评判者的方法，并揭示了领域内数据可用时微调小模型可媲美专有模型、零样本大模型在域外更有效等关键权衡。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被用于生成文本的自动评估，然而大多数先前工作集中在英语上。尽管对多语言评估的需求日益增长，将基于LLM的评估器扩展到多语言环境仍然具有挑战性，特别是对于低资源语言和领域内数据稀缺的场景。本文探索了开发多语言LLM评判者的几种策略，考虑了是否有领域内数据可用于微调。我们系统分析了英语、西班牙语和巴斯克语（代表高、中、低资源语言），考虑了指令翻译、单语与多语言监督以及模型规模。为了评估，我们将两个现有的元评估数据集扩展到巴斯克语和西班牙语。我们的结果揭示了关键的权衡：当领域内数据可用时，微调的小模型可以达到与专有模型相当的性能，而在域外设置中，使用较大模型的零样本评估更为有效。我们还观察到，在域外数据上进行微调可能会对模型性能产生不利影响。这些发现为构建高效、可靠的多语言评估流程提供了实用指导。数据和代码公开在hitz-zentroa/mJudge。

英文摘要

Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to multilingual settings remains challenging, particularly for low-resource languages and scenarios where in-domain data is scarce. This work explores several strategies for developing multilingual LLMs-as-a-judge, considering whether in-domain data is available for fine-tuning or not. We systematically analyze English, Spanish, and Basque, representing high-, mid-, and low-resource languages, considering instruction translation, monolingual versus multilingual supervision, and model size. For evaluation, we extend two existing meta-evaluation datasets to Basque and Spanish. Our results reveal key trade-offs: When in-domain data is available, fine-tuned smaller models can achieve performance comparable to proprietary models, whereas zero-shot evaluation with larger models proves more effective in out-of-domain settings. We also observe that fine-tuning on out-of-domain data can adversely affect model performance. These findings provide practical guidance for building efficient, reliable multilingual evaluation pipelines. The data and code are publicly available at hitz-zentroa/mJudge.

URL PDF HTML ☆

赞 0 踩 0

2605.28707 2026-05-28 cs.AI cs.LG 版本更新

职场中的AI：人工智能对感知工作体面性和意义性的影响

Kuntal Ghosh, Marc Hassenzahl, Shadan Sadeghian

发表机构 * University of Siegen（锡根大学）

AI总结本研究通过对24名来自IT、服务和医疗行业员工的访谈，探讨了AI对工作满意度的感知影响，发现不同职业领域对AI带来的工作体面性和意义性变化预期不同，从而影响整体满意度。

Comments Accepted to CSCW 2026 / Proceedings of the ACM on Human-Computer Interaction (PACMHCI)

详情

DOI: 10.1145/3816896

AI中文摘要

人工智能在工作场所的普及正在改变我们的工作方式。虽然现有关于人机协作的研究通常优先考虑绩效，但对其体验结果知之甚少。通过对24名来自信息技术、服务和医疗行业的员工进行访谈，本文考察了AI通过感知工作体面性和意义性对当前和未来工作满意度的影响。我们的结果显示，AI对整体工作满意度的预期影响因职业领域而异，对其潜在的体面性和意义性的感知也不同。例如，IT和医疗行业预期在工时等体面性方面满意度提高，但由于误解AI将处理大部分任务，在社交形象等意义性方面满意度下降。相反，服务行业员工预计工时无改善，但由于与AI合作带来的地位提升感知，社会地位会提高。

英文摘要

The proliferation of Artificial Intelligence (AI) in workplaces is transforming how we work. While existing research on human-AI collaboration at work often prioritizes performance, less is known about their experiential outcomes. Through interviews with 24 employees across Information Technology (IT), service-based, and healthcare sectors, this paper examines AI's impact on job satisfaction via perceptions of job decency and meaningfulness, now and in the future. Our results reveal that the anticipated impact of AI on overall job satisfaction varies with the occupational domain, with differing perceptions of its underlying decency and meaningfulness. For instance, IT and healthcare anticipate increased satisfaction with decency aspects like working hours but decreased satisfaction with meaningfulness aspects like social image due to misconceptions about AI handling most of their tasks. Conversely, service workers foresee no improvement in their working hours but a higher social standing due to the perceived status boost associated with working with AI.

URL PDF HTML ☆

赞 0 踩 0

2605.28678 2026-05-28 cs.AI 版本更新

DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution

DREAM-R: 基于强化学习的精炼草稿、精确验证与完全并行执行的多模态推测推理

Yunhai Hu, Zining Liu, Xiangyang Yin, Tianhua Xia, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang

发表机构 * New York University（纽约大学）； University of Pennsylvania（宾夕法尼亚大学）

AI总结提出DREAM-R框架，通过强化学习优化草稿生成、阈值验证机制和完全并行执行，加速多模态模型的推理密集型任务，同时保持准确性。

详情

AI中文摘要

推测推理最近被提出作为加速大型多模态模型中推理密集型生成的一种手段，但其有效性常受限于推测草稿与目标验证推理之间的不匹配。在本工作中，我们引入了DREAM-R，一个显著提升推测推理性能的框架。其核心是采用推测对齐策略优化（SAPO），这是一种强化学习目标，训练草稿模型生成既忠实于目标轨迹又简洁的推理步骤。我们进一步提出基于阈值的验证机制（TBVM），使用基于比率的标准，仅在正面证据明显占优时稳定且可解释地接受推测步骤，从而防止错误传播。基于这些组件，我们开发了完全并行推测推理（FPSR）框架，该框架将草稿生成、目标侧推理和验证并行化到多步推理中，支持提前停止和干净回退。在推理密集型基准上的实验表明，在保持目标模型准确性的同时，实现了高达[具体加速比]的加速，在不牺牲推理质量的情况下带来了显著的效率提升。

英文摘要

Speculative reasoning has recently been proposed as a means to accelerate reasoning-intensive generation in large multimodal models, but its effectiveness is often constrained by misalignment between speculative drafts and target-verified reasoning. In this work, we introduce DREAM-R, a framework that substantially improves the performance of speculative reasoning. At its core, DREAM-R employs Speculative Alignment Policy Optimization (SAPO), a reinforcement-learning objective that trains draft models to generate reasoning steps that are both faithful to target trajectories and concise. We further propose a Threshold-based Verification Mechanism (TBVM) that uses a ratio-based criterion to provide stable and interpretable acceptance of speculative steps only when positive evidence clearly dominates, thereby preventing error propagation. Building on these components, we develop a Fully Parallel Speculative Reasoning (FPSR) framework that parallelizes draft generation, target-side reasoning, and verification across multi-step reasoning, enabling early stopping and clean fallback. Experiments on reasoning-heavy benchmarks demonstrate up to speedup while preserving target-model accuracy, yielding substantial efficiency gains without compromising reasoning quality.

URL PDF HTML ☆

赞 0 踩 0

2605.28669 2026-05-28 cs.CL cs.AI 版本更新

Sense Representations Are Inducible Interfaces

Jan Christian Blaise Cruz, Alham Fikri Aji

发表机构 * MBZUAI（马克斯·普朗克智能系统研究所）

AI总结提出ACROS方法，通过门控残差加法在冻结的预训练解码器LM中诱导显式词义通路，实现零样本词义消歧、低KL词义引导和跨语言适应，保持基础LM质量。

Comments https://github.com/jcblaisecruz02/acros

2605.28666 2026-05-28 cs.AI 版本更新

An LLM-Based Assistance System for Intuitive and Flexible Capability-Based Planning

基于LLM的直观灵活能力规划辅助系统

Luis Miguel Vieira da Silva, Nicolas König, Felix Gehlhoff

AI总结提出一种混合辅助系统，将基于能力的形式化SMT规划与LLM自然语言交互层结合，通过人机协同实现规划解释与知识模型自适应，提升工业自动化中能力规划的可访问性和灵活性。

详情

AI中文摘要

在现代工业中，动态环境以及模块化和可重构资源的复杂性要求对过程序列进行自动化规划。基于能力的规划方法通过从以机器可解释形式描述资源功能的语义知识模型自动生成计划来解决这一问题。然而，其实际应用仍然有限：求解器反馈（特别是在不可满足情况下）难以解释，并且知识模型需要随着操作条件变化或请求变得不可行而进行调整。本文提出一种混合辅助系统，通过基于大语言模型（LLM）的自然语言交互、解释和适应层，增强现有的基于能力的可满足性模理论（SMT）规划方法。形式化规划的正确性仍由符号规划器保证，而LLM层在明确的人机协同（HitL）批准下处理自然语言访问和灵活的知识模型适应。该系统分解为四个组件：能力基础化、符号规划、结果解释和规划适应，实现为路由代理工作流，其中中央路由器将任务委派给五个专门代理。该系统在模块化生产系统上针对四种场景类型进行了评估。在23个测试案例中，10个知识查询中的9个和所有4个可满足规划案例均被正确处理，4个不可满足案例中的3个产生了具体的修复建议，所有5个自适应规划场景通过迭代的、用户批准的知识模型修改最终生成了可满足计划。研究结果证实，将形式化规划与基于LLM的辅助相结合，显著提高了工业自动化的可访问性和适应性。

英文摘要

In modern industry, dynamic environments and the complexity of modular and reconfigurable resources require automated planning of process sequences. Capability-based planning approaches address this by automatically generating plans from semantic knowledge models that describe resource functions in a machine-interpretable form. Their practical use, however, remains limited: solver feedback, especially in the case of unsatisfiability, is difficult to interpret, and the knowledge models require adaptation as operational conditions change or requests become infeasible. This paper presents a hybrid assistance system that augments an existing capability-based Satisfiability Modulo Theories (SMT) planning approach with an Large Language Model (LLM)-based layer for natural-language interaction, explanation, and adaptation. Formal planning correctness remains with the symbolic planner, while the LLM layer handles natural-language access and flexible knowledge model adaptation under explicit Human-in-the-Loop (HitL) approval. The system decomposes into four components: Capability Grounding, Symbolic Planning, Result Interpretation, and Planning Adaptation, realized as a routed agentic workflow in which a central router delegates to five specialized agents. The system is evaluated on a modular production system across four scenario types. Of 23 test cases, 9 of 10 knowledge queries and all 4 satisfiable planning cases were handled correctly, 3 of 4 unsatisfiable cases produced concrete repair proposals, and all 5 adaptive planning scenarios resolved into satisfiable plans through iterative, user-approved knowledge model modifications. The findings confirm that combining formal planning with LLM-based assistance substantially improves accessibility and adaptability in industrial automation.

URL PDF HTML ☆

赞 0 踩 0

2605.28655 2026-05-28 cs.AI 版本更新

AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

AutoScientists: 用于长期科学实验的自组织智能体团队

Shanghua Gao, Ada Fang, Marinka Zitnik

发表机构 * Harvard University（哈佛大学）

AI总结提出一种去中心化的AI智能体团队系统AutoScientists，通过自组织协作、提案评审和失败知识共享，在生物医学机器学习、语言模型训练优化和蛋白质适应性预测等长期实验中显著优于现有方法。

详情

AI中文摘要

科学研究通过假设生成、实验设计、执行和修正的迭代循环进行。AI智能体可以自动化这一过程的某些部分，但现有方法通常遵循单一研究轨迹，或通过具有固定目标的中央规划器进行协调。因此，它们难以维持并行探索、根据实验证据的变化进行调整，或在长期实验中保留失败方向的知识。我们引入了AutoScientists，一个用于长期计算科学实验的去中心化AI智能体团队。智能体解释共享的实验状态，围绕有希望的假设自组织成团队，在使用实验计算资源之前评审提案，并分享成功和失败以减少冗余探索。在匹配的实验预算下，AutoScientists在生物医学机器学习、语言模型训练优化和蛋白质适应性预测方面优于先前的AI智能体。在涵盖生物医学成像、蛋白质工程、单细胞组学和药物发现的BioML-Bench上，AutoScientists在24个任务中达到了74.4%的平均排行榜百分位，比最强的AI智能体提高了8.33%。在GPT训练优化中，AutoScientists达到目标验证bits-per-byte的速度比Autoresearch快1.9倍，并从初始冠军开始持续发现改进，而单智能体方法未发现任何改进（7个接受改进对比0个）。在ProteinGym适应性预测中，AutoScientists发现了一种ACE2-Spike结合方法，其Spearman相关性比当前最先进模型提高了12.5%。在未经修改地应用于所有217个ProteinGym检测时，相同方法比先前最先进技术提高了6.5%（Spearman相关性）。

英文摘要

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).

URL PDF HTML ☆

赞 0 踩 0

2605.28647 2026-05-28 cs.AI cs.CY q-fin.RM 版本更新

The Ethics of LLM Sandbox and Persona Dynamics

LLM沙盒与人格动态的伦理

Tim Gebbie, Stewart Gebbie

发表机构 * University of Cape Town（开普敦大学）

AI总结本文论证LLM护栏和人格动态产生的现实差距（reality gap）构成不道德的“现实洗白”（reality laundering），并提出通过任务级因果需求规范而非响应级道德修正来解决。

Comments 8 pages

详情

AI中文摘要

众所周知，LLM护栏和训练的人格动态会产生现实差距：LLM被允许或塑造描述的世界与用户必须行动的世界之间的距离。这里我们论证，主动产生现实差距实际上是不道德的，因为它有意将认知风险转嫁给不知情的用户——这就是现实洗白。当大规模运作时，这可能会造成伤害。在高暴露建议情境中风险最为尖锐，用户寻求的是方向而非有边界、可外部检查的任务。护栏在声称防止直接伤害时看似在伦理上必要，但当它们压制真实感知并将令人不适的机制洗白为可接受的抽象时，往往变得可疑。巴塞尔式金融监管、B-BBEE式合规、法国兴业银行和伦敦鲸事件展示了正式安全系统如何变得可理解、可博弈和表演性，而真实风险却转移到了别处。同样的模式可能出现在LLM中作为道德合规：安全的语言，扭曲的现实。因此，我们区分拒绝伤害与拒绝现实；然后主张在任务层面进行自上而下的因果需求规范，而非在响应或沙盒层面进行自下而上的道德修正。人格动态之所以重要，是因为助手界面并非中立；它塑造了不确定性、冲突、权威和风险如何被呈现。结论是，所谓的“伦理AI”当用制度安慰替代与现实接触时，实质上变得不伦理。

英文摘要

It is well known that LLM guardrails and trained persona dynamics can produce a reality gap: the distance between the world a LLM is permitted or shaped to describe, and the world in which users must act. Here we argue that actively generating reality gaps is in fact unethical because it knowingly shifts epistemic risk back to the uninformed user -- this is reality laundering. This can potentially cause harm when operationalised at scale. The risk is sharpest in high-exposure advice contexts, where users seek orientation rather than a bounded, externally checkable task. Guardrails naively appear ethically necessary when they claim to prevent direct harm, but often become suspect when they suppress truthful perception and launder uncomfortable mechanisms into acceptable abstractions. Basel-style financial regulation, B-BBEE-style compliance, Societe Generale, and the London Whale show how formal safety systems can become legible, gameable, and performative while real exposure migrates elsewhere. The same pattern can appear in LLMs as moral compliance: safe language, distorted reality. We therefore distinguish refusing harm, from refusing reality; and then argue for top-down causal requirements specification at the task level rather than bottom-up moral correction at the response or sandbox level. Persona dynamics matter because the assistant interface is not neutral; it shapes how uncertainty, conflict, authority, and risk are staged. The conclusion is that so-called ``ethical AI'' becomes substantively unethical when it substitutes institutional reassurance for contact with reality.

URL PDF HTML ☆

赞 0 踩 0

2605.28642 2026-05-28 cs.AI 版本更新

Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

带宽高效且隐私保护的边缘-云多对多语音翻译

Yexing Du, Kaiyuan Liu, Youcheng Pan, Bo Yang, Ming Liu, Bing Qin, Yang Xiang

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Pengcheng Laboratory（鹏城实验室）

AI总结提出边缘-云协同框架ESRT，通过分割推理架构压缩中间特征实现带宽降低10倍和语音隐私保护，并采用多任务加权课程学习策略实现45种语言的多对多语音翻译。

详情

AI中文摘要

多模态大语言模型（MLLMs）在语音到文本翻译（S2TT）方面展现出巨大潜力。然而，现有部署范式面临关键挑战：纯设备端模型受资源限制，而集中式云系统通过传输原始语音数据导致严重的隐私风险和带宽瓶颈。此外，大多数模型表现出以英语为中心的偏见，限制了多对多翻译的扩展。在本文中，我们提出边缘-云语音识别与翻译（ESRT），一种隐私保护且带宽高效的协作式边缘-云MLLM框架。具体而言，我们设计了一种边缘-云分割推理架构，在设备上保留轻量级语音编码器和适配器，仅将高度压缩的中间特征传输到云端。这从根本上防止了声纹泄露，并将带宽需求降低高达10倍。为克服以英语为中心的瓶颈，我们引入了一种多任务加权课程学习策略与数据平衡，以确保鲁棒的跨语言一致性。在FLEURS数据集上的大量实验表明，我们的模型ESRT-4B和ESRT-12B在45种语言（45×44个方向）上实现了最先进的多对多S2TT性能。代码和模型已发布，以促进可复现的、隐私感知的MLLM S2TT研究。代码和模型发布于https://github.com/yxduir/esrt。

英文摘要

Multimodal large language models (MLLMs) have demonstrated significant potential for speech-to-text translation (S2TT). However, existing deployment paradigms face critical challenges: pure on-device models suffer from resource constraints, while centralized cloud systems incur severe privacy risks and bandwidth bottlenecks by transmitting raw voice data. Furthermore, most models exhibit English-centric biases, restricting many-to-many translation scaling. In this paper, we propose Edge-cloud Speech Recognition and Translation (ESRT), a privacy-preserving and bandwidth-efficient collaborative edge-cloud MLLM framework. Specifically, we design an edge-cloud split inference architecture that retains a lightweight speech encoder and adapter on the device, transmitting only highly compressed intermediate features to the cloud. This fundamentally prevents voiceprint leakage and reduces bandwidth requirements by up to 10$\times$. To overcome English-centric bottlenecks, we introduce a multi-task weighted curriculum learning strategy with data balancing to ensure robust cross-lingual consistency. Extensive experiments on the FLEURS dataset demonstrate that our models, ESRT-4B and ESRT-12B, achieve state-of-the-art many-to-many S2TT performance across 45 languages ($45 \times 44$ directions). Code and models are released to facilitate reproducible, privacy-aware MLLM S2TT research. The code and models are released at https://github.com/yxduir/esrt.

URL PDF HTML ☆

赞 0 踩 0

2605.28639 2026-05-28 cs.CL cs.AI 版本更新

The Attentional White Bear Effect in Transformer Language Models

Transformer语言模型中的注意力白熊效应

Rebecca Ramnauth, Brian Scassellati

发表机构 * Yale University（耶鲁大学）

AI总结通过表征探测、注意力分析和行为语义泄露实验，发现指令抑制下Transformer语言模型仍能恢复被禁止概念的表征并影响后续生成，揭示了行为对齐与表征对齐之间的根本差距。

Comments Currently under review at EMNLP 2026

2605.28632 2026-05-28 cs.CR cs.AI 版本更新

Blind PRNG Hijacking: An Undetectable Integrity-Preserving Attack Against LLM Watermarking

盲PRNG劫持：一种针对LLM水印的不可检测的完整性保持攻击

Ziyang You, Huilong He, Xiaoke Yang, Xuxing Lu

发表机构 * Fujian Provincial Key Laboratory of Automotive Electronics and Electric Drive（福建省汽车电子与电力驱动重点实验室）； School of Electronic, Electrical and Physics（电子、电气与物理学院）； Fujian University of Technology（福建理工大学）； School of Humanities（人文学院）； Institute of Applied Physics and Materials Engineering（应用物理与材料工程学院）； University of Macau（澳门大学）

AI总结提出SeedHijack攻击，通过替换伪随机数生成器（PRNG）在供应链层面对LLM水印进行盲攻击，同时保持完整性并规避检测。

Comments Preprint prepared for submission to IEEE TIFS. 12 pages, 8 figures

详情

AI中文摘要

密码学水印是归因大型语言模型（LLM）生成文本的主要防御手段。现有方案（包括KGW、Unigram和DipMark）的安全性基于底层伪随机数生成器（PRNG）可信的假设。本文引入SeedHijack，这是首个针对LLM水印的供应链攻击，同时满足：(i) 盲——无需知道水印密钥、检测器或模型logits；(ii) 完整性保持——放大而非擦除水印信号；(iii) 与检测正交——攻击引入的偏差与所有内容侧检测器统计独立，确保放大和规避共存而无权衡。SeedHijack不扰动生成文本，而是在供应链层替换PRNG，偏向绿名单选择而不改变输出令牌或降低文本质量。在三种水印方案和三个开源LLM上，攻击触发了0/6个最先进的内容侧统计检测器，同时将水印z-score放大至2.42倍（系统级防御如熵源认证保持正交和互补）。量子随机数生成器（QRNG）对策被证明能完全中和攻击，同时保持良性水印效用。这些发现确立了PRNG完整性作为密码学内容来源系统的一等安全需求。

英文摘要

Cryptographic watermarking is a leading defense for attributing text generated by large language models (LLMs). Existing schemes, including KGW, Unigram, and DipMark, derive their security guarantees from the assumption that the underlying pseudo-random number generator (PRNG) is trustworthy. This work introduces SeedHijack, the first supply-chain attack on LLM watermarking that is simultaneously (i) blind -- requiring no knowledge of the watermark key, detector, or model logits, (ii) integrity-preserving -- amplifying rather than erasing the watermark signal, and (iii) orthogonal to detection -- the attack-induced bias is statistically independent of all content-side detector statistics, ensuring that amplification and evasion coexist without trade-off. Rather than perturbing generated text, SeedHijack replaces the PRNG at the supply-chain layer, biasing green-list selection without altering output tokens or degrading text quality. Across three watermarking schemes and three open-source LLMs, the attack triggers 0/6 state-of-the-art content-side statistical detectors while inflating the watermark z-score up to 2.42x (system-level defenses such as entropy-source attestation remain orthogonal and complementary). A quantum random number generator (QRNG) countermeasure is shown to fully neutralize the attack while preserving benign watermarking utility. These findings establish PRNG integrity as a first-class security requirement for cryptographic content-provenance systems.

URL PDF HTML ☆

赞 0 踩 0

2605.28617 2026-05-28 cs.AI cs.PL 版本更新

LACUNA: Safe Agents as Recursive Program Holes

LACUNA: 作为递归程序空洞的安全智能体

Yaoyu Zhao, Yichen Xu, Oliver Bračevac, Cao Nguyen Pham, Frank Zhengqing Wu, Martin Odersky

发表机构 * EPFL（苏黎世联邦理工学院）

AI总结提出LACUNA编程模型，通过类型化调用和编译时检查，让LLM智能体以递归程序空洞的方式安全地编写代码，实现表达性与安全性的统一。

详情

AI中文摘要

LLM智能体越来越多地通过编写代码来行动，但驱动智能体的运行时与模型编写的代码之间仍然存在分裂。运行时拥有循环、上下文和控制流，而模型对这些几乎没有发言权。让模型编写的代码塑造运行时本身将使智能体更具表达性，但也会加剧安全问题。模型可能因提示注入而偏离方向、调用错误的工具，或在执行中途失败并留下不一致的状态，而当代码塑造运行时，此类失败的波及范围比仅表达单个动作时更广。我们提出了LACUNA，一种智能体编程模型，它在保持安全性的同时弥合了这种分裂。每个智能体动作都是一个类型化调用$\texttt{agent[T](task)}$，当执行到达该调用时，LLM用代码填充它，并且在代码运行之前，会针对周围程序进行类型检查。由于每个动作作为一个整体被接受或拒绝，被拒绝的动作不会影响环境，其编译器诊断信息会驱动重试。同样的检查也限制了动作可以使用哪些工具和数据以及它们如何流动。我们的原语将ReAct循环、子智能体、技能、并行分解和多模型规划表达为普通的控制流。我们在测试用例集合、BrowseComp-Plus和$τ^2$-bench上评估了LACUNA。在BrowseComp-Plus上，8.6%的生成在执行前被拒绝，平均每次查询重试0.7次，智能体达到27.1%的准确率。在$τ^2$-bench上，LACUNA使用一个能力强的模型解决了四个领域392个任务中的76.0%，与基线智能体相当。

英文摘要

LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any of them. Letting model-written code shape the runtime itself would make agents more expressive, but it would also sharpen safety problems. A model can be diverted by a prompt injection, call the wrong tool, or fail partway and leave an inconsistent state, and each such failure reaches further when the code shapes the runtime than when it expresses a single action. We present LACUNA, a programming model for agents that closes this split while preserving safety. Each agent action is a typed call $\texttt{agent[T](task)}$ that the LLM fills with code when execution reaches it, and the code is type-checked against the surrounding program before it runs. Because each action is accepted or rejected as a whole, a rejected one leaves the environment untouched, and its compiler diagnostics drive a retry. The same check also bounds which tools and data an action may use and how they flow. Our primitive expresses ReAct loops, sub-agents, skills, parallel decomposition, and multi-model planning as ordinary control flow. We evaluate LACUNA on a collection of test cases, BrowseComp-Plus, and $τ^2$-bench. On BrowseComp-Plus, $8.6\%$ of generations are rejected before execution, with 0.7 retries per query on average, and the agent reaches $27.1\%$ accuracy. On $τ^2$-bench, LACUNA solves $76.0\%$ of $392$ tasks across four domains with a capable model, on par with the baseline agent.

URL PDF HTML ☆

赞 0 踩 0

2605.28616 2026-05-28 cs.CL cs.AI 版本更新

Measuring Form and Function in Language Models

语言模型中的形式与功能测量

Héctor Javier Vázquez Martínez, Charles Yang

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； Department of Linguistics and Computer and Information Science（语言学与计算机与信息科学系）

AI总结通过引入儿童语言习得的定量指标，提出上下文替代选择（CAC）提示方法，评估语言模型在英语限定词的形式句法和功能话语知识方面的表现，发现仅大型模型能同时满足形式和功能基准。

Comments Under review at ACL Rolling Review May 2026 cycle

2605.28607 2026-05-28 cs.AI cs.CL 版本更新

Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution

基于自适应多智能体框架的自动工作流执行

Susanna Cifani, Mario Luca Bernardi, Marta Cimitile

发表机构 * Sapienza University of Rome（罗马萨皮恩扎大学）； Department of Engineering University of Sannio（萨尼奥大学工程系）； Faculty of Jurisprudence Unitelma Sapienza University（法理学院萨皮恩扎大学）

AI总结提出一种多模态多智能体框架，通过离线构建拓扑知识库和在线自适应检索增强生成与闭环协作验证，实现自动工作流执行。

Comments Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses. Accepted for publication at the 2026 IEEE International Conference on Evolving and Adaptive Intelligent Systems (EAIS 2026)

详情

AI中文摘要

现代信息系统需要能够导航复杂工作流的自主智能体，但当前方法在从结构化元数据解析过渡到通用环境感知时常常遇到困难。虽然多模态大语言模型的集成使智能体能够直接与图形用户界面交互，但现有方法通常将任务序列视为离散的线性片段。这种碎片化阻止了智能体捕捉底层转移拓扑结构，限制了它们在新型或非平稳场景中的有效性。为了解决这个问题，我们提出了一种新颖的多模态多智能体框架，通过一个独特的两阶段流程实现自动工作流执行。首先，在离线发现阶段，该架构从碎片化的执行日志中自适应地构建拓扑知识库。在推理过程中，智能体利用自适应检索增强生成（RAG）作用于这个固定的、预先建立的图，并结合闭环协作验证协议进行动态自我纠正和导航。这种基于图的方法促进了优越的任务分解和自适应导航性能。我们在真实世界环境中验证了该框架，展示了即使在训练数据有限的情况下，它也能保持高可靠性和语义感知能力。

英文摘要

Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with the transition from structured metadata parsing to general environmental perception. While the integration of MLLMs has enabled agents to interact directly with GUIs, existing approaches typically treat task sequences as discrete, linear episodes. This fragmentation prevents agents from capturing the underlying transition topology, limiting their effectiveness in novel or non-stationary scenarios. To address this, we propose a novel multimodal multi-agent framework that achieves automatic workflow execution through a distinct two-phase pipeline. First, during an offline discovery phase, the architecture adaptively constructs a topological knowledge base from fragmented execution logs. During inference, agents leverage Adaptive Retrieval-Augmented Generation (RAG) over this fixed, pre-established graph, coupled with a closed-loop collaborative verification protocol to dynamically self-correct and navigate. This graph-based approach facilitates superior task decomposition and adaptive navigation performance. We validate our framework in a real-world context, demonstrating its ability to maintain high reliability and semantic awareness even with limited training data.

URL PDF HTML ☆

赞 0 踩 0

2605.28604 2026-05-28 cs.CV cs.AI 版本更新

评估基于LLM的社会智能体的真实性：对西班牙在线新闻反应的案例研究

Alejandro Buitrago López, Alberto Ortega Pastor, Javier Pastor-Galindo, José A. Ruipérez-Valiente

发表机构 * Faculty of Computer Science, University of Murcia（计算机科学系，穆尔西亚大学）

AI总结通过比较真实与LLM生成的西班牙新闻评论，研究LLM在仇恨言论、情感和语义对齐三个维度上的真实性，发现现成模型表现不佳，微调可部分改善。

详情

AI中文摘要

基于LLM的社会智能体越来越多地被用于模拟在线社交行为，但其真实性仍然难以验证。现有工作主要依赖通用基准，而对简短的反应性话语（如受众对在线新闻的回复）关注较少。在本文中，我们评估LLM生成的西班牙新闻反应是否再现了真实受众话语的可测量属性。使用Hatemedia数据集，我们将5,631条新闻与58,555条真实受众反应配对，并在共享实验设置下使用五个LLM生成匹配的合成数据集。我们从仇恨言论、情感和语义对齐三个维度比较真实和合成反应，考虑现成和微调生成。结果表明，现成模型是真实受众反应的糟糕代理：它们严重低估仇恨言论，引入模型特定的情感偏差，并且在分布上与人类回复相距甚远。微调不均匀地提高了保真度。Qwen3提供了最平衡的近似，而Mistral7B实现了最强的情感和语义对齐，但过度估计了仇恨普遍性。看似合理的合成回复不一定再现公共话语的分布特性。

英文摘要

LLM-powered social agents are increasingly used to simulate online social behavior, yet their realism remains difficult to validate. Existing work has largely relied on general-purpose benchmarks, while less attention has been paid to short, reactive discourse such as audience replies to online news. In this paper, we evaluate whether LLM-generated reactions to Spanish online news reproduce measurable properties of real audience discourse. Using the Hatemedia dataset, we pair 5,631 news items with 58,555 real audience reactions, and generate a matched synthetic dataset using five LLMs under a shared experimental setting. We compare real and synthetic reactions across three dimensions: hate speech, sentiment, and semantic alignment, considering both off-the-shelf and fine-tuned generation. Results show that off-the-shelf models are poor proxies for real audience reactions: they strongly underproduce hate speech, introduce model-specific sentiment biases, and remain distributionally distant from human replies. Fine-tuning improves fidelity unevenly. Qwen3 provides the most balanced approximation, while Mistral7B achieves the strongest sentiment and semantic alignment but overshoots hate prevalence. Plausible synthetic replies do not necessarily reproduce the distributional properties of public discourse.

URL PDF HTML ☆

赞 0 踩 0

2605.28597 2026-05-28 cs.CR cs.AI cs.LG 版本更新

Position: Retire the "Positive Backdoor" Label -- Secret Alignment Requires Strict and Systematic Evaluation

立场：淘汰“正向后门”标签——秘密对齐需要严格且系统的评估

Jianwei Li, Jung-Eun Kim

发表机构 * Department of Computer Science, North Carolina State University, Raleigh, USA（北卡罗来纳州立大学计算机科学系）

AI总结本文主张停止使用“正向后门”标签，将触发激活的隐藏行为视为秘密对齐，并通过评估三个代表性应用在六个核心属性上的表现，揭示其脆弱性，呼吁进行严格评估。

Comments ICML 2026

详情

AI中文摘要

这篇立场论文认为，AI/ML社区应停止过度宣称并淘汰“正向后门”标签，而应将触发激活的隐藏行为视为秘密对齐。关键在于，基于秘密对齐的保护性主张在缺乏严格、标准化评估的情况下，默认不应被视为安全。私有AI时代，通过开放权重的LLM和可访问的训练/推理栈，语言模型成为私有数字资产，产生了关于未授权访问、模型盗窃和行为滥用的安全问题。最近，一系列被称为“正向后门”的工作被提出以应对这些挑战。为将我们的立场建立在证据基础上，我们将这些提议统一为用于访问门控、所有权归属和安全执行的隐蔽触发-行为关联，并评估了三个代表性应用在六个核心属性上的表现：有效性、无害性、持久性、效率、鲁棒性和可靠性。我们的结果揭示了触发-行为映射的显著脆弱性——尤其是在机密性、完整性和可用性（CIA）方面——这些往往被现有声称低估。我们进一步将这些结果与行为密度和决策复杂性联系起来，提供了一个理解部署时风险的行为视角，并激励社区范围内的评估，使秘密对齐主张可证明。

英文摘要

This position paper argues that the AI/ML community should stop overclaiming and retire the label "positive backdoor," and instead treat trigger-activated hidden behaviors as Secret Alignment. Crucially, protective claims based on Secret Alignment should be presumed not secure by default unless supported by rigorous, standardized evaluation. The Private AI era, enabled by open-weight LLMs and accessible training/inference stacks, turns language models into privately owned digital assets, creating security concerns around unauthorized access, model theft, and behavioral misuse. Recently, a line of work framed as "positive backdoors" has been proposed to address these challenges. To ground our position in evidence, we unify these proposals as covert trigger-behavior associations for access gating, ownership attribution, and safety enforcement, and evaluate three representative applications across six core properties: effectiveness, harmlessness, persistence, efficiency, robustness, and reliability. Our results reveal substantial brittleness - especially in the confidentiality, integrity, and availability (CIA) - of trigger-behavior mappings often underrepresented by existing claims. We further relate these outcomes to behavior density and decision complexity, offering a behavioral lens for understanding deployment-time risks and motivating community-wide evaluation that makes Secret Alignment claims provable.

URL PDF HTML ☆

赞 0 踩 0

2605.28594 2026-05-28 cond-mat.stat-mech cs.AI physics.comp-ph 版本更新

Thermodynamic properties of chemically disordered compounds via AI-driven estimation of partition function with the PULSE method

通过PULSE方法基于AI驱动配分函数估计的化学无序化合物热力学性质

Baptiste Bernard, Luca Messina, Eiji Kawasaki, Emeric Bourasseau

发表机构 * CEA, DES, IRESNE, DEC（CEA，DES，IRESNE，DEC）

AI总结提出改进的PULSE方法，通过无监督学习采样和估计配分函数，以低成本高效计算化学无序化合物的热力学性质，并在2D Ising模型上验证了其高精度和效率。

Comments 13 pages, 11 figures, submitted to Physical Chemistry Chemical Physics

2605.28588 2026-05-28 cs.CR cs.AI 版本更新

Technical Report: Exploring the Emerging Threats of the Agent Skill Ecosystem

技术报告：探索智能体技能生态系统的新兴威胁

Luca Beurer-Kellner, Aleksei Kudrinskii, Marco Milanta, Kristian Bonde Nielsen, Hemang Sarkar, Liran Tal

发表机构 * Snyk

AI总结本研究通过分析3984个AI智能体技能，发现76个恶意载荷，揭示了技能生态系统中的安全威胁，并提出了威胁分类和攻击模式。

Comments 10 pages, technical report

2605.28583 2026-05-28 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

SARAD: LLM-Based Safety-Aware Hybrid Reinforcement Learning with Collision Prediction for Autonomous Driving

SARAD：基于LLM的安全感知混合强化学习与碰撞预测在自动驾驶中的应用

Kangyu Wu, Peng Cui, Guoxi Chen, Ya Zhang

发表机构 * National Natural Science Foundation (NNSF) of China（中国国家自然科学基金委员会）； National Science and Major Project（国家科学技术重大专项）

AI总结提出SARAD框架，结合大语言模型和深度强化学习，通过检索增强生成和碰撞预测模块提升自动驾驶的安全性和效率。

Comments 7 pages, 4 figures, accepted by IJCNN 2026

详情

AI中文摘要

确保自动驾驶系统决策的安全性和效率仍然是一个基本挑战。传统的深度强化学习（DRL）存在不安全的随机探索和收敛缓慢的问题，而大语言模型（LLM）在实时推理操作中表现出固有的延迟。为了解决这些限制，本文提出了SARAD，一种新颖的安全感知混合框架，协同LLM和DRL用于自动驾驶。SARAD用来自动态专家知识库的、经检索增强生成（RAG）增强的LLM引导决策替代了DRL的随机探索。提出了一个注意力判别器，将LLM的先验知识整合到DRL策略优化中。进一步设计了一个碰撞预测模块，使用历史碰撞数据进行微调，以提高车辆安全性。大量实验表明，SARAD在Highway-Env模拟器中实现了显著的性能提升，验证了所提模型在自动驾驶中的有效性。

英文摘要

Ensuring both safety and efficiency in decision-making for autonomous driving systems remains a fundamental challenge. Traditional Deep Reinforcement Learning (DRL) suffers from unsafe random exploration and slow convergence, while Large Language Models (LLMs) demonstrate inherent latency in real-time inference operations. To address these limitations, this paper proposes SARAD, a novel safety-aware hybrid framework that synergizes LLMs and DRL for autonomous driving. SARAD substitutes the random exploration of DRL with Retrieval-Augmented Generation (RAG)-enhanced, LLM-guided decisions sourced from a dynamic expert knowledge repository. An attention discriminator is proposed to integrate the prior knowledge of LLMs into DRL policy optimization. A collision predictor module, fine-tuned with historical collision data, is further designed to improve vehicle safety. Extensive experiments show that SARAD achieves significant performance improvements in the Highway-Env simulator, validating the effectiveness of the proposed model in autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2605.28577 2026-05-28 cs.AI cs.LG 版本更新

Continual Model Routing in Evolving Model Hubs

演化模型库中的持续模型路由

Jack Bell, Giacomo Carfì, Gerlando Gramaglia, Vincenzo Lomonaco

发表机构 * Department of Computer Science, University of Pisa, Pisa, Italy（意大利比萨大学计算机科学系）； LUISS University, Rome, Italy（意大利罗马大学）

AI总结针对模型库快速扩展带来的模型选择和路由更新挑战，提出持续模型路由（CMR）问题，构建大规模基准CMRBench，并设计基于对比嵌入的CARvE方法，通过检查点锚定和结构化重放实现高效路由，显著优于多种基线。

Comments 42 pages, 24 tables, 6 figures, to be published at ICML 2026

详情

AI中文摘要

AI模型库提供了对快速增长的大量预训练模型的访问，使得具有不同路由策略的现成混合专家系统成为可能。然而，这种快速增长带来了两个基本挑战：跨数千个专家进行模型选择的扩展，以及随着新模型和任务的引入持续更新路由机制。在本文中，我们将这一设置形式化为持续模型路由（CMR），并提出了CMRBench，这是一个新的大规模基准，模拟现实的模型库扩展，包括超过2000个候选模型。最后，我们介绍了CARvE，一种对比嵌入方法，通过基于检查点的锚定和结构化重放实现高效的持续模型路由。大量的实验结果和消融研究表明，CARvE在模型、家族和领域级别的准确性上显著优于零样本检索、微调和适配器合并基线。

英文摘要

AI model hubs provide access to a rapidly growing collection of powerful pre-trained models, enabling off-the-shelf mixture-of-experts systems with different routing strategies. However, this rapid growth poses two fundamental challenges: scaling model selection across thousands of experts and continually updating routing mechanisms as new models and tasks are introduced. In this paper, we formalise this setting as Continual Model Routing (CMR) and propose CMRBench, a new large-scale benchmark simulating realistic hub expansion and including over 2,000 candidate models. Finally, we introduce CARvE, a contrastive embedding approach for efficient continual model routing via checkpoint-based anchoring and structured replay. Extensive empirical results and ablations show that CARvE significantly outperforms zero-shot retrieval, fine-tuning, and adapter-merging baselines in model, family, and domain-level accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.28575 2026-05-28 cs.AI 版本更新

A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis

一种冲突感知惩罚与统计损失框架，用于平衡模态并增强多模态情感分析的稳定性

Jianheng Dai, Jiazhang Liang, Sijie Mai

发表机构 * School of Computer Science, South China Normal University（华南师范大学计算机学院）

AI总结针对多模态情感分析中文本模态主导导致梯度冲突的问题，提出冲突感知惩罚和统计损失框架，实现模态平衡与训练稳定，在CMU-MOSI上取得最优性能。

详情

AI中文摘要

多模态情感分析（MSA）融合文本、声学和视觉流来推断情感。由于预训练文本编码器的表达能力远强于声学和视觉编码器，文本模态往往主导优化过程，抑制较弱模态并引发梯度范数冲突，从而破坏训练稳定性。为解决此问题，我们提出一种冲突感知惩罚（CP），在每一步训练中检测并惩罚梯度范数冲突，以及一种统计损失（SL），使预测分布统计量与经验输入统计量对齐。关键的是，CP防止主导模态梯度干扰SL目标，从而在统一框架内实现协同训练，该框架包含自适应模态编码、门控跨模态融合和单模态辅助头。在CMU-MOSI上的实验表明，该方法达到了最先进的性能，消融研究证实了每个组件的有效性。

英文摘要

Multimodal Sentiment Analysis (MSA) fuses text, acoustic, and visual streams to infer sentiment. Because pre-trained text encoders are far more expressive than their acoustic and visual counterparts, the text modality tends to dominate optimization, suppressing weaker modalities and inducing gradient norm conflicts that destabilize training. To address this, we propose a Conflict-aware Penalty (CP) that detects and penalizes gradient norm conflicts at each training step, and a Statistical Loss (SL) that aligns predicted distribution statistics with empirical input statistics. Crucially, CP prevents dominant modality gradients from interfering with the SL objective, enabling synergistic training within a unified framework incorporating adaptive modality encoding, gated cross-modal fusion, and unimodal auxiliary heads. Experiments on CMU-MOSI demonstrate state-of-the-art performance, with ablation studies confirming the effectiveness of each component.

URL PDF HTML ☆

赞 0 踩 0

2605.28573 2026-05-28 cs.LG cs.AI 版本更新

Efficient Pre-Training of LLMs through Truncated SVD Layers

通过截断SVD层实现LLM的高效预训练

Kaivan Kamali, Kajetan Schweighofer, Hormoz Shahrzad, Olivier Francon, Babak Hodjat, Risto Miikkulainen

发表机构 * Cognizant AI Lab（认知AI实验室）； UT Austin（得克萨斯大学奥斯汀分校）

AI总结提出TSVD框架，利用谱能量启发式自适应秩选择和缓存机制保持低秩与严格正交性，在减少计算开销的同时匹配或超越全参数基线的性能。

详情

AI中文摘要

大规模语言模型（LLM）的规模扩展使得预训练成本日益高昂。虽然低秩表示和正交权重矩阵原则上可以减少参数数量和计算开销，但现有方法大多依赖静态秩选择，且由于高计算成本而不强制权重正交性。本文引入TSVD框架，在整个训练过程中保持低秩和严格正交性。它利用基于谱能量的启发式方法进行自适应秩选择，并采用缓存机制来维持正交性。理论分析证明了该方法在预训练动态中的优势，跨多种模型规模的实验表明其在经验上有效。TSVD在显著降低计算需求的同时，匹配或超越了全参数基线的性能。因此，该方法为高效高性能LLM预训练提供了一条有充分依据、实用且可扩展的路径。

英文摘要

The massive scaling of Large Language Models (LLMs) has made pretraining increasingly cost-prohibitive. While low-rank representation and orthonormal weight matrices could in principle reduce parameter counts and computational overhead, most existing methods rely on static rank selection and do not enforce weight orthonormality due to high computational cost. This paper introduces TSVD, a framework that maintains low rank and strict orthonormality throughout the training process. It utilizes a spectral energy-based heuristic for adaptive rank selection, and a caching mechanisms to maintain orthonormality. Theoretical analysis justifies the advantage of the approach in pretraining dynamics and experiments across various model scales demonstrate that it is effective empirically. TSVD matches or exceeds the performance of full-parameter baselines while significantly reducing compute requirements. The approach thus offers a well-founded, practical, and scalable path toward efficient high-performance LLM pretraining.

URL PDF HTML ☆

赞 0 踩 0

2605.28567 2026-05-28 cs.LG cs.AI 版本更新

Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression

稀疏自编码器特征匹配与电路压缩的语义最优传输

Tue M. Cao, Nguyen Do, My T. Thai

发表机构 * University of Florida（佛罗里达大学）

AI总结提出基于最优传输的分布框架，通过激活加权分布和Wasserstein距离统一解决跨层特征匹配与电路压缩问题。

Comments preprint

详情

AI中文摘要

稀疏自编码器（SAE）已成为解释语言模型的核心工具。然而，两个关键的SAE分析仍然难以规模化：（1）跨层匹配语义相似的特征，（2）将大型特征电路压缩为可解释的超节点。尽管这些问题被视为独立问题，但我们表明它们都是更基础挑战的实例，我们将其框架化为估计位于不同激活流形上的SAE特征之间的语义距离。我们为此问题引入了一个分布框架，其中每个特征不是像文献中那样由单个解码器向量表示，而是由表达它的隐藏状态上的激活加权分布表示。通过将这些分布投影到共享参考空间并使用Wasserstein距离进行比较，我们的方法为跨层特征比较提供了统一的语义度量。我们证明了我们的表示对激活缩放具有不变性，在扰动下稳定，并在有限样本边际条件下恢复真实匹配。实验上，我们的方法优于解码器向量和基于LLM的基线，并捕捉相关特征之间的细微功能差异。值得注意的是，我们的方法自动将大型特征电路压缩为可解释的超节点。

英文摘要

Sparse autoencoders (SAEs) have become a central tool for interpreting language models. However, two key SAE analyses that remain difficult to scale are (1) matching semantically similar features across multi-layers and (2) compressing large feature circuits into interpretable supernodes. Although these have been treated as separate problems, we show that both are instances of a more fundamental challenge, which we frame as the estimation of semantic distances between SAE features that lie on different activation manifolds. We introduce a distributional framework for this problem, in which each feature is represented not by a single decoder vector like in the literature, but by an activation-weighted distribution over the hidden states that express it. By projecting these distributions into a shared reference space and comparing them with Wasserstein distance, our method provides a unified semantic metric for cross-layer feature comparison. We prove that our representation is invariant to activation rescaling, stable under perturbations, and recovers true matches under finite-sample margin conditions. Empirically, our method outperforms decoder-vector and LLM-based baselines and captures subtle functional distinctions between related features. Notably, our method compresses large feature circuits into interpretable supernodes automatically.

URL PDF HTML ☆

赞 0 踩 0

2605.28566 2026-05-28 cs.AI cs.LG 版本更新

Tree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design Patterns

思维树作为经典启发式搜索问题：形式化基础与设计模式

Guni Sharon

发表机构 * Guni Sharon

AI总结本文通过经典启发式搜索术语统一分类法，将基于LLM的推理映射到搜索组件，并识别出系统搜索和前瞻性策略两种设计模式。

Comments Extended version of the SoCS 2026 paper. Includes appendices omitted from the proceedings version

详情

Journal ref: Proceedings of the Nineteenth International Symposium on Combinatorial Search (SoCS 2026), AAAI Press, 2026

AI中文摘要

大型语言模型（LLM）展示了卓越的推理能力，但其标准生成过程——自回归令牌预测——本质上是短视的，容易产生级联错误。为了解决这个问题，思维树（ToT）框架在中间推理步骤上创建了一个搜索空间，允许搜索模型进行探索、前瞻和回溯。然而，当前的ToT研究在自然语言处理和自动规划社区之间仍然分散，常常使用不一致的术语和临时实现。因此，我们通过基于经典启发式搜索术语的统一分类法综合了ToT领域。我们将基于LLM的推理映射到经典搜索组件：状态表示（思维粒度）、后继生成（提示操作符）和启发式评估（进展自我评估）。我们在分类法的背景下分析现有工作，并识别出新兴的设计模式：针对浅层确定性任务的系统搜索（最佳优先搜索）和针对深层多步推理的前瞻性策略（DFS、MCTS）。最后，我们指出了启发式搜索与LLM推理交叉领域中的开放算法挑战，并呼吁启发式搜索社区参与这一新兴领域。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, yet their standard generation process -- auto-regressive token prediction -- is inherently myopic and prone to cascading errors. To address this, the Tree-of-Thoughts (ToT) framework creates a search space over intermediate reasoning steps, allowing search models to explore, look ahead, and backtrack. However, current ToT research remains fragmented across Natural Language Processing and Automated Planning communities, often using inconsistent terminology and ad-hoc implementations. Consequently, we synthesize the ToT landscape through a unified taxonomy based on classical heuristic search terminology. We map LLM-based reasoning to classical search components: state representation (granularity of thoughts), successor generation (prompting operators), and heuristic evaluation (self-assessment of progress). We analyze existing work within the context of our taxonomy and identify emerging design patterns: systematic search (Best-First Search) for shallow, deterministic tasks and lookahead-heavy strategies (DFS, MCTS) for deep multi-step reasoning. We conclude by identifying open algorithmic challenges at the intersection of heuristic search and LLM reasoning, and call on the heuristic search community to engage with this emerging domain.

URL PDF HTML ☆

赞 0 踩 0

2605.28565 2026-05-28 cs.DL cs.AI cs.CL cs.IR 版本更新

Verified Misguidance: Measuring Structural Citation Failures in Search-Augmented LLMs

验证性误导：衡量搜索增强型大语言模型中的结构性引用失败

Yongsik Seo, Wooseok Jeong, Eunyoung Kim, Hyeonseo Jang, Dongha Lee

发表机构 * Department of Artificial Intelligence, Yonsei University（延世大学人工智能系）； Department of Computer Science and Engineering, Konkuk University（Konkuk大学计算机科学与工程系）； Incheon International Airport Corporation（仁川国际机场公司）； Department of Computer Science and Engineering, Ewha Womans University（成均馆女子大学计算机科学与工程系）

AI总结针对搜索增强型大语言模型中的引用可信度问题，提出CITETRACE数据集和三维评估框架，发现系统性“验证性误导”模式：模型引用真实可访问来源但存在意图对齐、来源适宜性或答案-来源忠实度缺陷，导致用户面临结构性误导。

Comments Working Progress

详情

AI中文摘要

搜索增强型大语言模型的用户依赖引用作为回答基于真实来源的证据，但很少自行验证引用的页面。每天数百万次查询通过这些系统，使得引用质量成为用户是被告知还是被误导的无声决定因素——然而现有基准各自孤立地处理一个方面，导致决定引用可信度的联合结构未被衡量。我们构建了CITETRACE，一个大规模数据集，追踪从用户查询到检索来源再到生成答案的完整引用链：来自28个社区的11,200个真实世界查询，与来自五个提供商的十个模型的112,000个回答配对，产生761,495个可评估的引用对。我们设计了一个三维评估框架，使用专家验证的预定义矩阵和五级忠实度标准，对每个引用在意图-目的对齐、来源适宜性和答案-来源忠实度上进行评分；该框架适用于任何产生带引用回答的系统。大规模应用该框架，我们识别出一种系统性的模式，称为验证性误导（VM）：模型引用真实、可访问的来源，但在一个或多个维度上失败，产生忠实度-适宜性权衡，其中忠实模型选择不合适的来源，反之亦然。在我们的池中，30.6%的引用扭曲了其来源，27.1%的引用源自领域不合适的来源；在回答层面，高达96%的用户至少遇到一个结构性误导的引用。提供商层面的差异解释了88-96%的引用质量方差，表明来源选择更多受超出单个模型能力的因素控制，而非LLM本身。总之，CITETRACE及其评估框架为诊断部署的搜索增强系统中的结构性引用失败提供了首个资源。

英文摘要

Users of search-augmented LLMs rely on citations as evidence that responses are grounded in real sources, and rarely verify the cited pages themselves. Millions of queries per day now pass through these systems, making citation quality a silent determinant of whether users are informed or misled-yet existing benchmarks each address one facet in isolation, leaving the joint structure that determines citation trustworthiness unmeasured. We construct CITETRACE, a large-scale dataset that traces the full citation chain from user query through retrieved source to generated answer: 11,200 real-world queries from 28 communities paired with 112,000 responses from ten models across five providers, yielding 761,495 evaluable citation pairs. We design a three-dimension evaluation framework that scores each citation on intent-purpose alignment, source suitability, and answer-source fidelity, using expert-validated predefined matrices and a five-level fidelity rubric; the framework applies to any system that produces citation-bearing responses. Applying this framework at scale, we identify a systematic pattern we call VERIFIED MISGUIDANCE (VM): models cite real, accessible sources yet fail along one or more dimensions, producing a fidelity-suitability trade-off in which faithful models select inappropriate sources and vice versa. Across our pool, 30.6% of citations distort their sources and 27.1% originate from domain-inappropriate sources; at the response level, up to 96% of users encounter at least one structurally misleading citation. Provider-level differences explain 88-96% of citation-quality variance, suggesting that source selection is governed more by factors beyond individual model capability than by the LLMs themselves. Together, CITETRACE and its evaluation framework provide the first resource for diagnosing structural citation failures in deployed search-augmented systems.

URL PDF HTML ☆

赞 0 踩 0

2605.28563 2026-05-28 cs.LG cs.AI 版本更新

A Multi-dimensional Framework for Evaluating Generalization in EEG Foundation Models

评估脑电图基础模型泛化能力的多维框架

Aditya Kommineni, Emily Zhou, Kleanthis Avramidis, Tiantian Feng, Shrikanth Narayanan

发表机构 * Signal Analysis and Interpretation Laboratory（信号分析与解释实验室）

AI总结提出一个多维评估框架，在低资源条件下系统评估EEG基础模型（如LaBraM、CSBrain、CBraMod）的泛化能力，发现其在长上下文任务中表现优异，但在短窗口BCI任务中与监督模型相当，且对通道限制鲁棒性不足。

Comments 24 pages, 5 Figures

详情

AI中文摘要

在适当的适应设置下评估基础模型对于理解所学表示的质量和可迁移性至关重要。最近的脑电图基础模型在跨任务和数据集上展示了有前景的迁移能力，推动了它们在神经技术和临床应用中日益增长的使用。然而，这些模型通常是在精心整理的下游数据集上进行全微调评估，这种设置并未反映生物医学领域的约束，如有限的标记数据、减少的传感器覆盖或参数高效的适应。在这项工作中，我们提出了一个多维评估框架，用于在现实低资源条件下评估脑电图模型。在提出的多维评估框架下，对包括LaBraM、CSBrain和CBraMod在内的监督脑电图模型和最近的脑电图基础模型在6个不同数据集上进行了实证分析。我们发现，脑电图基础模型在长上下文任务（如睡眠阶段预测和心理健康状态分类）上持续提供性能提升。相比之下，对于短窗口的脑机接口风格任务，监督模型尽管参数少得多，却取得了相当的性能。额外的分析表明，当前的基础模型对短窗口任务和通道受限设置提供的鲁棒性有限。总之，这些发现激励使用多维评估协议，以表征模型在现实使用约束下的行为。

英文摘要

Evaluating foundation models under appropriate adaptation settings is essential for understanding the quality and transferability of the learned representations. Recent EEG foundation models have demonstrated promising transfer capabilities across tasks and datasets, motivating their growing use in neurotechnology and clinical applications. However, these models are typically evaluated under full fine-tuning on well-curated downstream datasets, a setting that does not reflect biomedical domain constraints such as limited labeled data, reduced sensor coverage, or parameter-efficient adaptation. In this work, we propose a multi-dimensional evaluation framework for assessing EEG models under realistic low-resource conditions. Empirical analysis of both supervised EEG models and recent EEG foundation models, including LaBraM, CSBrain, and CBraMod, across 6 different datasets is performed under the proposed multi-dimensional evaluation framework. We find that EEG foundation models consistently provide performance gains on long-context tasks such as sleep stage prediction and mental health state classification. In contrast, for short-window Brain Computer Interface style tasks, supervised models achieve comparable despite having substantially fewer parameters. Additional analyses demonstrate that current foundation models provide limited robustness to short-window tasks and channel constrained settings. Together, these findings motivate the use of multi-dimensional evaluation protocols that characterize model behavior under realistic use constraints.

URL PDF HTML ☆

赞 0 踩 0

2605.28557 2026-05-28 cs.LO cs.AI 版本更新

Token Optimization Strategies for LLM-Based Oracle-to-PostgreSQL Migration

基于LLM的Oracle到PostgreSQL迁移的Token优化策略

Oleg Grynets, Dmytro Babarytskyi, Vasyl Lyashkevych

发表机构 * EPAM Systems（EPAM系统）； Kharkiv, Ukraine（乌克兰基尔基茨）； Lviv, Ukraine（乌克兰利沃夫）； McLean, Virginia, USA（美国弗吉尼亚州麦莱恩）

AI总结本文形式化并评估了十二种Token优化策略，在Oracle到PostgreSQL迁移中平衡成本、语法有效性、语义保持和结构保真度。

Comments 11 pages, 3 figures, 5 tables, 38 references

详情

AI中文摘要

LLM越来越多地用于软件现代化、代码翻译和数据库迁移。然而，基于LLM的Oracle2PostgreSQL迁移仍然受到高Token消耗、长上下文退化、方言特定的语义差异以及查询转换过程中语义漂移风险的限制。将大型Oracle SQL/PL-SQL工件、模式定义、过程逻辑和迁移指令直接包含到模型上下文中会增加成本并可能降低生成质量。本文将Token优化视为基于LLM的Oracle2PostgreSQL迁移中的一个约束转换问题。研究形式化并评估了十二种Token优化策略：基线表示、上下文剪枝、最小化、基于DSL的语义压缩、元数据增强、上下文重构、模式蒸馏、自适应路由、基于AST的最小化、标识符掩码、输出约束强制和混合优化。这些策略在10和100个Oracle SQL查询样本上使用有效语法率、精确匹配、语义匹配、CodeBLEU和Token效率进行评估。结果表明，轻度上下文剪枝几乎保持了基线水平的语义质量，在100个查询样本上实现了89.75%的语义匹配，而未优化基线为89.80%。自适应路由提供了最佳的实际权衡，输入Token减少8.72%，输出Token减少5.49%，同时保持88.40%的语义匹配，并将Token效率提高6.67%。激进的模式蒸馏将Token效率提高了132.22%，但导致语义匹配下降44.50个百分点。研究结果表明，Token优化不能简单地视为提示缩短；它必须作为一个多目标迁移问题来评估，平衡成本、语法有效性、语义保持和结构保真度。

英文摘要

LLMs are increasingly used for software modernization, code translation, and database migration. However, LLM-based Oracle2PostgreSQL migration remains constrained by high token consumption, long-context degradation, dialect-specific semantic differences, and the risk of semantic drift during query transformation. Direct inclusion of large Oracle SQL/PL-SQL artefacts, schema definitions, procedural logic, and migration instructions into the model context increases cost and may reduce generation quality. This paper shows token optimization as a constrained transformation problem in LLM-based Oracle2PostgreSQL migration. The study formalizes and evaluates twelve token optimization strategies: baseline representation, context pruning, minification, DSL-based semantic compression, metadata augmentation, context refactoring, schema distillation, adaptive routing, AST-based minification, identifier masking, output constraint enforcement, and hybrid optimization. The strategies are evaluated on samples of 10 and 100 Oracle SQL queries using Valid Syntax Rate, Exact Match, Semantic Match, CodeBLEU, and Token Efficiency. The results show that mild context pruning preserves semantic quality almost at the baseline level, achieving 89.75% Semantic Match on the 100-query sample compared with 89.80% for the unoptimized baseline. Adaptive routing provides the best practical trade-off, reducing input tokens by 8.72% and output tokens by 5.49% while maintaining 88.40% Semantic Match and increasing Token Efficiency by 6.67%. Aggressive schema distillation increases Token Efficiency by 132.22% but results in a 44.50-percentage-point decrease in Semantic Match. The findings demonstrate that token optimization cannot be treated as simple prompt shortening; it must be evaluated as a multi-objective migration problem balancing cost, syntactic validity, semantic preservation, and structural fidelity.

URL PDF HTML ☆

赞 0 踩 0

2605.28553 2026-05-28 cs.AI cs.CR 版本更新

Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

解码前拒绝：检测和利用中间LLM激活中的拒绝信号

Matteo Gioele Collu, Riccardo Conte, Alberto Giaretta, Denis Kleyko, Mauro Conti, Matteo Zavatteri, Roberto Confalonieri

发表机构 * University of Padua（帕多瓦大学）； Örebro University（欧雷布罗大学）； Fondazione Bruno Kessler（布鲁诺·凯索基金会）

AI总结本文通过线性探针在变压器块的残差流激活中检测拒绝行为，并提出Mechanistic AutoDAN方法，利用探针引导的遗传搜索实现高效攻击，显著降低搜索时间并保持攻击成功率。

详情

AI中文摘要

在本文中，我们研究了是否可以通过在解码前使用线性探针在变压器块的残差流激活上训练，从LLM中间激活中预测拒绝行为。我们发现拒绝在远早于最后一层时即可线性解码，表明安全相关行为在输出生成前就已编码在中间激活中。为了测试该信号是否可行，我们引入了Mechanistic AutoDAN，这是AutoDAN的一种探针引导变体，它在遗传提示搜索循环中用部分前向传递和基于探针的评分取代了全模型适应度评估。在评估的模型中，我们的方法实现了与原始AutoDAN相当的攻击成功率，同时将每次迭代的搜索时间减少了高达72%，并且在多种配置下，探针引导的提示在跨模型迁移方面达到或超过了AutoDAN。我们进一步发现，探针引导的有效性随模型规模增大而增加。我们的结果表明，拒绝不仅在输出层面可观察，而且作为结构化且可行的信号编码在LLM中间激活中。

英文摘要

In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that safety-relevant behavior is represented in intermediate activations before output generation. To test whether this signal is actionable, we introduce Mechanistic AutoDAN, a probe-guided variant of AutoDAN that replaces full-model fitness evaluation with partial forward passes and probe-based scoring inside a genetic prompt search loop. Across the evaluated models, our method achieves attack success rates competitive with vanilla AutoDAN while reducing per-iteration search time by up to 72%, and probe-guided prompts match or exceed AutoDAN's cross-model transfer in several configurations. We further find that the usefulness of probe guidance increases with model scale. Our results show that refusal is not only observable at the output level, but is encoded as a structured and actionable signal in intermediate LLM activations.

URL PDF HTML ☆

赞 0 踩 0

2605.28552 2026-05-28 cs.AI 版本更新

Modeling Vehicle-Type-Specific Pedestrian Crash Avoidance Behavior in Safety-Critical Interactions Using Smooth-Mamba Deep Reinforcement Learning

使用Smooth-Mamba深度强化学习建模安全关键交互中车辆类型特定的行人碰撞规避行为

Qingwen Pu, Kun Xie, Hong Yang, Di Yang, Junqing Wang

发表机构 * Transportation Informatics Lab, Department of Civil and Environmental Engineering, Old Dominion University（交通信息实验室，土木与环境工程系，旧 Dominion 大学）； Department of Electrical and Computer Engineering, Old Dominion University（电气与计算机工程系，旧 Dominion 大学）； Department of Transportation and Urban Infrastructure Studies, SMARTER Center, Morgan State University（交通与城市基础设施研究系，SMARTER 中心，莫根州立大学）

AI总结本研究利用Smooth-Mamba深度确定性策略梯度框架（SMamba-DDPG）从Argoverse 2数据集中提取安全关键交互，建模行人与自动驾驶车辆（AV）和人类驾驶车辆（HDV）的碰撞规避行为，发现行人对AV反应更快、穿越速度更低，且AV场景冲突率更低。

Comments 37 page. 15 Figure, 9 table

详情

AI中文摘要

随着自动驾驶车辆（AV）越来越多地与人类驾驶车辆（HDV）共享道路，理解行人在安全关键交互中如何应对不同车辆类型对于自动驾驶技术的安全部署至关重要。本研究从Argoverse 2数据集中提取安全关键的行人-车辆交互，以捕捉涉及AV和HDV的真实碰撞规避行为。为了建模车辆类型特定的行人碰撞规避行为，我们开发了Smooth-Mamba深度确定性策略梯度框架（称为SMamba-DDPG），该框架将平滑动作约束与高效的时序表示学习相结合。为了量化行人行为差异，该框架分别为行人与AV和HDV的交互训练了碰撞规避策略。结果表明，SMamba-DDPG在复现行人碰撞规避行为方面优于基线强化学习和监督学习模型。重构轨迹表现出强烈的行为真实性，准确复现了AV和HDV场景中的碰撞规避运动学。反应时间分析表明，该模型捕捉到了类人的响应延迟，并揭示行人对AV的反应比HDV更快。反事实分析进一步表明，行人在与AV交互时采用更低的穿越速度。对模型生成数据的大规模安全分析显示，与行人-HDV交互相比，行人-AV交互始终产生更低的冲突率和更高的行人让行率。这些发现强调了在混合交通环境中，将车辆类型特定的行人行为模型纳入更安全的自动驾驶系统设计和更真实的交通模拟中的重要性。

英文摘要

As automated vehicles (AVs) increasingly share roadways with human-driven vehicles (HDVs), understanding how pedestrians respond to different vehicle types in safety-critical interactions is essential for the safe deployment of automated driving technologies. This study extracts safety-critical pedestrian-vehicle interactions from the Argoverse 2 dataset to capture real-world crash avoidance behaviors in encounters involving AVs and HDVs. To model vehicle-type-specific pedestrian crash avoidance behavior, we develop a Smooth-Mamba Deep Deterministic Policy Gradient framework, termed SMamba-DDPG, which integrates smooth action constraints with efficient temporal representation learning. To quantify pedestrian behavioral differences, the framework trains separate crash avoidance policies for pedestrian interactions with AVs and HDVs. Results show that SMamba-DDPG outperforms baseline reinforcement learning and supervised learning models in reproducing pedestrian crash avoidance behaviors. Reconstructed trajectories demonstrate strong behavioral realism, accurately reproducing crash avoidance kinematics in both AV and HDV scenarios. Reaction time analysis shows that the model captures human-like response delays and reveals that pedestrians respond more quickly to AVs than to HDVs. Counterfactual analysis further indicates that pedestrians adopt lower crossing speeds when interacting with AVs. Large-scale safety analysis of model-generated data revealed that pedestrian-AV interactions consistently yielded lower conflict rates and higher pedestrian yielding rates compared to pedestrian-HDV interactions. The findings highlight the importance of incorporating vehicle-type-specific pedestrian behavioral models for safer automated driving system design and more realistic traffic simulations in mixed-traffic environments.

URL PDF HTML ☆

赞 0 踩 0

2605.28543 2026-05-28 cs.AI cs.CL cs.LG 版本更新

GS-FUSE: 格兰杰监督的门控融合与多粒度对齐用于事件驱动的金融预测

Yang Zhang, En Chun, Ziyun Mao, Yulu Wu, Jun Wang

发表机构 * Southwestern University of Finance and Economics（西南财经大学）

AI总结提出GS-Fuse框架，通过格兰杰因果监督的门控融合模块和多粒度对齐机制，选择性利用事件文本与价格信号，提升金融事件对市场影响的预测精度。

详情

DOI: 10.1145/3770855.3817927

AI中文摘要

准确预测重大金融事件对市场的影响对投资者和政策制定者至关重要。然而，现有的多模态时间序列模型通常对称地融合文本和价格，没有明确的方式来决定事件文本何时真正具有预测性，因此难以利用事件到价格的方向性结构以及文本和价格信号的异质性角色。在这项工作中，我们提出了GS-Fuse，一个基于多模态事件的预测框架，它采用：(i) 格兰杰监督的、因果感知的门控融合模块，该模块仅在事件文本提供超越历史价格的增量预测价值时学习向事件文本开放；(ii) 多粒度对齐机制，该机制将高级事件表示和细粒度文本线索与未来市场轨迹联合对齐。作为构建在现成的大语言模型和时间序列基础模型之上的灵活、即插即用适配器，GS-Fuse可以在不同的骨干网络和市场设置中实例化。在真实世界金融数据集上的大量实验表明，GS-Fuse在多种资产和预测时间范围内始终优于最先进的时间序列和多模态基线。

英文摘要

Accurately forecasting the impact of salient financial events on markets is critical for investors and policymakers. However, existing multimodal time-series models typically fuse text and prices symmetrically, without an explicit way to decide when event text is truly predictive, and thus struggle to exploit the directional event-to-price structure and the heterogeneous roles of textual and price signals. In this work, we propose GS-Fuse, a multimodal event-based forecasting framework that employs (i) a Granger-supervised, causal-aware gated fusion module, which learns to open toward event text only when it provides incremental predictive value beyond historical prices, and (ii) a multi-granularity alignment mechanism that jointly aligns high-level event representations and fine-grained textual cues with future market trajectories. Built as a flexible, plug-and-play adapter on top of off-the-shelf large language models and time-series foundation models, GS-Fuse can be instantiated across diverse backbones and market settings. Extensive experiments on real-world financial datasets show that GS-Fuse consistently outperforms state-of-the-art time-series and multimodal baselines across multiple assets and forecasting horizons.

URL PDF HTML ☆

赞 0 踩 0

2605.28517 2026-05-28 cs.LG cs.AI 版本更新

Stochastic Gradient Descent with Momentum is Algorithmically Stable

带动量的随机梯度下降具有算法稳定性

Yunwen Lei, Zimeng Wang, Xiaoming Yuan

发表机构 * Department of Mathematics, The University of Hong Kong（香港大学数学系）； Department of Mathematics and Mathematical Statistics, Umeå University（乌梅大学数学与统计学系）

AI总结本文通过算法稳定性分析，证明了带动量的随机梯度下降（SGDM）在光滑凸问题上具有泛化保证，并建立了最优的过界总体风险界。

详情

AI中文摘要

带动量的随机梯度下降（SGDM）是机器学习中最广泛使用的优化算法之一。尽管文献中已经广泛研究了SGDM的优化性质，但关于SGDM是否以及何时能够很好地泛化到未见数据，仍然不够清楚。特别是，有人推测虽然动量加速了训练，但可能会降低泛化性能。在本文中，我们通过算法稳定性的视角，对SGDM进行了全面的泛化分析，填补了这一空白。更具体地说，我们引入了一个广义的SGDM框架，该框架涵盖了Polyak和Nesterov的动量方案，并为光滑凸问题建立了紧的平均模型稳定性界。值得注意的是，所获得的界利用了沿轨迹的小优化误差界，适用于区间$[0, 1)$内的任何动量参数，并且不需要通常假设的损失函数的Lipschitz连续性。我们进一步推导了广义SGDM的优化误差界，并将其与我们的泛化分析相结合，为具有Polyak和Nesterov动量的SGDM获得了最优的过界总体风险界。

英文摘要

Stochastic gradient descent with momentum (SGDM) is one of the most widely used optimization algorithms in machine learning. While optimization properties of SGDM have been extensively studied in the literature, it remains insufficiently understood whether and when SGDM can generalize well to unseen data. In particular, it has been conjectured that while momentum accelerates training, it may degrade generalization. In this paper, we close this gap by developing a comprehensive generalization analysis of SGDM through the lens of algorithmic stability. More specifically, we introduce a generalized SGDM framework that encompasses both Polyak's and Nesterov's momentum schemes, and establish tight on-average model stability bounds for smooth and convex problems. Notably, the obtained bounds exploit small optimization error bounds along the trajectory, apply to any momentum parameter in the interval $[0, 1)$, and do not require the commonly assumed Lipschitzness of loss functions. We further derive optimization error bounds for the generalized SGDM, and combine them with our generalization analyses to obtain optimal excess population risk bounds for SGDM with both Polyak's and Nesterov's momentum.

URL PDF HTML ☆

赞 0 踩 0

2605.28515 2026-05-28 cs.SE cs.AI 版本更新

Do LLMs Favor Their Providers? Measuring Vertical Integration Bias in Code Generation

LLM 是否偏袒其提供商？测量代码生成中的垂直整合偏差

Melih Catal, Alex Wolf, Tiago Ferreiro Matos, Pooja Rani, Harald Gall

发表机构 * University of Zurich（苏黎世大学）； University of Mannheim（曼海姆大学）

AI总结本文提出 VIBench 基准，通过 20 个提供商可选的软件集成场景，测量前沿 LLM 在直接和代理代码生成中的垂直整合偏差，发现六成关联模型存在显著偏差，代理工作流加剧偏差至 +39.2 个百分点。

详情

AI中文摘要

大型语言模型已成为软件开发不可或缺的一部分，尤其是随着代理能力的出现。然而，许多前沿 LLM 与特定提供商有关联。这引发了一个问题：生成的代码是否偏袒提供商自身的生态系统而非可比较的替代方案，从而可能限制开发者的选择并增加对单一提供商的依赖。我们将这种行为定义为垂直整合偏差，并引入 VIBench，一个用于在 20 个提供商可选的软件集成场景中测量直接和代理代码生成中 VIB 的基准。通过评估 10 个前沿提供商关联模型与 3 个非关联对照模型，我们发现直接生成中存在正的 VIB，其中十个关联模型中有六个显示出统计显著效应，最高达 +18.8 个百分点。代理工作流进一步放大了 VIB，达到 +39.2 个百分点。此外，代理工作流中早期的关联生态系统选择可能持续存在于概念上解耦的下游文件中，持续性高达 90.3%。这些发现强调了在代码生成中测量和考虑 VIB 的必要性，尤其是在代理能力日益普及的背景下。

英文摘要

Large Language Models (LLMs) have become an integral part of software development, especially with the advent of agentic capabilities. Yet, many frontier LLMs are affiliated with specific providers. This raises the question of whether generated code favors the provider's own ecosystem over comparable alternatives, potentially constraining developers' choices and increasing dependence on a single provider. We define this behavior as Vertical Integration Bias (VIB) and introduce \textsc{VIBench}, a benchmark for measuring VIB in direct and agentic code generation across $20$ provider-selectable software-integration scenarios. Evaluating $10$ frontier provider-affiliated models against $3$ non-affiliated controls, we find positive VIB in direct generation, with six of ten affiliated models showing statistically significant effects up to $+18.8$ percentage points (pp). Agentic workflows further amplify VIB, reaching $+39.2$ pp. Moreover, early affiliated-ecosystem choices in agentic workflows can persist into conceptually decoupled downstream files, with persistence as high as $90.3\%$. These findings underscore the need to measure and account for VIB in code generation, especially as agentic capabilities become more prevalent.

URL PDF HTML ☆

赞 0 踩 0

2605.28513 2026-05-28 cs.LG cs.AI 版本更新

Learning Theory of the SVRG: Generalization and Convergence Analysis

SVRG的学习理论：泛化与收敛性分析

Yunwen Lei, Zimeng Wang, Xiaoming Yuan

发表机构 * Department of Mathematics, The University of Hong Kong（香港大学数学系）； Department of Mathematics and Mathematical Statistics, Umeå University（乌梅大学数学与统计学系）

AI总结本文通过算法稳定性分析，首次为非凸和强凸设置下的SVRG方法建立了非平凡的泛化界，揭示了优化与泛化之间的相互作用，并得到了最优的过量风险界。

详情

AI中文摘要

方差缩减（VR）方法采用方差递减的随机梯度，因其高效性被广泛应用于机器学习中的大规模优化问题。现有的VR方法理论研究主要集中在收敛性分析上，而泛化行为在很大程度上未被探索。本文通过算法稳定性的视角，首次为代表性VR方法——随机方差缩减梯度（SVRG）建立了非平凡的泛化分析，填补了这一空白。特别地，我们利用SVRG的算法结构，在凸和强凸两种设置下建立了尖锐的稳定性界。所得到的界是数据依赖的，因为训练误差沿轨迹被纳入。我们的分析阐明了优化与泛化之间的相互作用，从而在两种设置下都得到了最优的过量风险界。我们的方法与现有的随机算法分析有本质不同，我们将SVRG更新分解为类似SGD的步骤加上一个零均值修正项，然后引入新的Lyapunov函数来吸收由参考点引起的额外梯度项。我们的分析框架可以推广到其他VR方法，并通过著名的随机平均梯度加速（SAGA）方法展示了泛化性。

英文摘要

Variance reduction (VR) methods employ stochastic gradients with decreasing variance, and they have been widely applied to solve large-scale optimization problems in machine learning because of their efficiency. Existing theoretical studies of VR methods are mainly focused on the convergence analysis, leaving the generalization behavior largely unexplored. In this paper, we bridge this gap by developing the first non-vacuous generalization analysis of the representative VR method: Stochastic Variance Reduced Gradient (SVRG), through the lens of algorithmic stability. In particular, we establish sharp stability bounds of the SVRG in both convex and strongly convex settings by exploiting its algorithmic structure. The obtained bounds are data-dependent, because the training errors are incorporated along the trajectory. Our analysis clarifies the interplay between optimization and generalization, leading to optimal excess population risk bounds in both settings. Our approach differs substantially from existing analyses of stochastic algorithms in the sense that we decompose the SVRG update as an SGD-like step plus a zero-mean correction term and then introduce novel Lyapunov functions to absorb the additional gradient terms induced by the reference points. Our analytical framework can be generalized to other VR methods, and we demonstrate the generalization by the well-known Stochastic Average Gradient Accelerated (SAGA) method.

URL PDF HTML ☆

赞 0 踩 0

2605.28500 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

功能熵：通过不确定性量化预测LLM生成代码的功能正确性

Dylan Bouchard, Mohit Singh Chauhan, Zeya Ahmad, Ho-Kyeong Ra

发表机构 * CVS Health（CVS健康）

AI总结针对LLM生成代码功能不正确的问题，提出基于功能等价性的不确定性量化方法（功能熵），在多个编程语言和模型上优于现有方法。

详情

AI中文摘要

大型语言模型在代码生成方面表现出令人印象深刻的能力，但它们经常生成功能不正确的代码。不确定性量化（UQ）方法已成为检测自然语言生成中幻觉的有前途的方法，但它们在代码生成任务中的有效性仍未得到充分探索。我们系统地评估了UQ技术如何跨三种编程语言、五个LLM和超过1700个问题迁移到代码生成。我们发现，一些基于令牌概率的方法无需修改即可有效泛化，而依赖自然语言推理（NLI）的基于采样的方法失败，因为NLI模型无法区分功能不同的代码，导致大多数响应崩溃为单个语义簇。为了解决这个问题，我们引入了功能等价性方法，这是一类特定于代码的方法，用基于LLM的功能等价性评估取代基于NLI的语义等价性，包括功能熵，即语义熵的代码特定模拟。功能等价性方法在15个模型-基准组合中的11个中实现了最高的AUROC，并在大多数设置中实现了最佳校准，始终优于基于NLI的对应方法以及所有其他评估方法。

英文摘要

Large language models have shown impressive capabilities in code generation, yet they often produce functionally incorrect code. Uncertainty quantification (UQ) methods have emerged as a promising approach for detecting hallucinations in natural language generation, but their effectiveness for code generation tasks remains underexplored. We systematically evaluate how UQ techniques transfer to code generation across three programming languages, five LLMs, and over 1,700 problems. We find that some token-probability-based methods generalize effectively without modification, while sampling-based methods relying on natural language inference (NLI) fail because NLI models cannot distinguish functionally different code, causing most responses to collapse into a single semantic cluster. To address this, we introduce functional equivalence methods, a family of code-specific methods that replace NLI-based semantic equivalence with an LLM-based functional equivalence assessment, including functional entropy, a code-specific analog of semantic entropy. Functional equivalence methods achieve top AUROC in 11 out of 15 model-benchmark combinations and the best calibration across most settings, consistently outperforming both NLI-based counterparts and all other methods evaluated.

URL PDF HTML ☆

赞 0 踩 0

2605.28498 2026-05-28 cs.HC cs.AI 版本更新

The Decision to Verify: How Warmth and User Characteristics Shape Reliance on Conversational Agents for Information Search

验证决策：温暖度和用户特征如何塑造对信息搜索中对话代理的依赖

Mert Yazan, Frederik Bungaran Ishak Situmeang, Suzan Verberne

发表机构 * Amsterdam University of Applied Sciences（阿姆斯特丹应用科学大学）； Leiden University（莱顿大学）

AI总结研究通过混合实验发现，即使提供事实核查工具，用户仍过度依赖对话AI，验证行为主要由用户特征（如先验信任）驱动，而温暖对话风格通过增加对错误答案的认同间接影响依赖。

Comments Under review for Computers in Human Behavior

详情

AI中文摘要

对话式人工智能（AI）提供了高效便捷的信息访问途径。然而，当用户盲目信任AI并在不进行事实核查的情况下接受其答案时，可能会导致过度依赖。信息搜索日益遵循一种结合对话AI与网络搜索的混合交互范式，使得事实核查更加容易。本文考察了这种交互范式是否能有效抑制依赖。我们进一步探究了驱动用户验证AI答案的潜在因素（例如数字素养和对话温暖度）。我们进行了一项混合被试问答实验，参与者与温暖或中性的聊天机器人互动。我们的发现表明，尽管用户同时拥有对话和网络搜索的访问权限，依赖仍然存在。验证决策主要由现有的用户感知（例如对聊天机器人的先验信任）驱动，而非答案属性，一些用户无论上下文如何都会进行事实核查，而另一些用户则默认信任聊天机器人。温暖的对话风格通过增加对错误聊天机器人的认同，对依赖产生了间接但关键的影响。咨询额外的AI来源可预测更高的准确性，而传统网络搜索则不然。我们的研究通过以下方式扩展了过度依赖研究：（a）证明了即使在可进行事实核查的情况下，过度依赖仍然存在；（b）将验证行为识别为用户依赖性；（c）揭示了对话温暖度对过度依赖的间接影响，这对设计可信赖的对话搜索系统具有启示意义。

英文摘要

Conversational artificial intelligence (AI) provides an efficient and convenient gateway to information access. However, it can cause overreliance when users blindly trust AI and accept its answers without fact-checking. Information search increasingly follows a hybrid interaction paradigm that combines conversational AI with web search, making fact-checking easier. In this paper, we examine whether this interaction paradigm is effective in curbing reliance. We further investigate the underlying factors (e.g., digital literacy and conversation warmth) that drive users to verify AI answers. We conduct a mixed-subjects question-answering experiment where participants interact with either a warm or a neutral chatbot. Our findings reveal that reliance persists despite users having access to both conversational and web search. The decision to verify is driven primarily by existing user perceptions (e.g., prior trust in chatbots) rather than answer properties, with some users fact-checking regardless of the context and others trusting chatbots by default. Warm conversational style has an indirect yet critical influence on reliance by increasing agreement with the chatbot when it is incorrect. Consulting additional AI sources predicts higher accuracy, while traditional web search does not. Our study extends overreliance research by: (a) demonstrating its persistence despite access to fact-checking, (b) identifying verification behavior as user-dependent, and (c) revealing conversational warmth's indirect effect on overreliance with implications for designing trustworthy conversational search systems.

URL PDF HTML ☆

赞 0 踩 0

2605.28490 2026-05-28 cs.CV cs.AI 版本更新

SSR3D-LLM: Structured Spatial Reasoning via Latent Steps for Fine-Grained Grounding in Unified 3D-LLMs

SSR3D-LLM: 通过潜在步骤实现结构化空间推理以实现统一3D-LLM中的细粒度定位

Jiawei Li, Ziyi Liu, Weijie Shi, Long Chen, Jiajie Xu, Xiaofang Zhou

发表机构 * The Hong Kong University of Science and Technology（香港科学与技术大学）； Soochow University（苏州大学）

AI总结针对统一3D-LLM中细粒度查询的脆弱性，提出SSR3D-LLM，通过潜在空间推理步骤和几何感知评分器逐步精炼候选排名，在多个基准上取得最优结果。

详情

AI中文摘要

3D物体定位从自然语言中定位3D场景中的所指对象。统一的以实例为中心的3D-LLM旨在同时解决定位、对话、问答和描述任务，但许多方法依赖于单一的指针式定位决策，将关系指令压缩为一个选择。这对于需要根据上下文对象和空间关系排除多个同类候选的细粒度查询来说是脆弱的。我们提出结构化空间推理3D-LLM（SSR3D-LLM），一种用于统一3D-LLM的结构化定位接口。给定固定的Mask3D物体提议，LLM从查询中写出一系列潜在的空间推理步骤和记忆令牌，然后一个几何感知评分器读取这些潜在步骤，通过逐步长度掩码逐步精炼候选排名。潜在步骤从标准基准目标监督和训练期间的辅助指代线索监督中学习，而推理仅使用输入查询和Mask3D提议。在ReferIt3D、ScanRefer和Multi3DRef上，SSR3D-LLM在统一3D-LLM基线中取得了最强结果，在细粒度定位上相比单指针QPG基线有显著提升，并相比先前的统一3D-LLM有一致改进，同时保留了默认的语言任务路径。

英文摘要

3D object grounding localizes referred objects in a 3D scene from natural language. Unified instance-centric 3D-LLMs aim to solve grounding together with dialog, QA, and captioning, yet many rely on a single pointer-style grounding decision that compresses a relational instruction into one selection. This is brittle for fine-grained queries where multiple same-class candidates must be ruled out by context objects and spatial relations. We propose Structured Spatial Reasoning 3D-LLM (SSR3D-LLM), a structured grounding interface for unified 3D-LLMs. Given fixed Mask3D object proposals, the LLM writes a sequence of latent spatial reasoning steps and memory tokens from the query, and a geometry-aware scorer reads these latent steps in order to refine candidate rankings step by step with step-length masking. The latent steps are learned from standard benchmark target supervision with auxiliary referential-cue supervision during training, while inference uses only the input query and Mask3D proposals. Across ReferIt3D, ScanRefer, and Multi3DRef, SSR3D-LLM achieves the strongest results among unified 3D-LLM baselines, with substantial gains over the single-pointer QPG baseline on fine-grained grounding and consistent improvements over prior unified 3D-LLMs, while preserving the default language-task route.

URL PDF HTML ☆

赞 0 踩 0

2605.28487 2026-05-28 cs.AI cs.LG 版本更新

ProvMind: Provenance-grounded reasoning for materials synthesis

ProvMind：基于来源的材料合成推理

Yiming Zhang, Ryo Tamura, Koji Tsuda

发表机构 * Center for Basic Research on Materials, National Institute for Materials Science（材料基础研究センター，国家材料科学研究所）； RIKEN Center for Advanced Intelligence Project（RIKEN高级智能项目中心）

AI总结提出MatProcBench基准和ProvMind框架，通过来源图推理实现材料合成中的路线、条件和因果依赖优化，在双OOD分割上达到52.84%准确率。

2605.28483 2026-05-28 cs.AI cs.IR 版本更新

GONDOR 救援：低内存下的满意规划

Yonatan Vernik, Alexander Tuisov, Alexander Shleyfman

发表机构 * Computer Science Department, Bar-Ilan University（巴伊兰大学计算机科学系）； Independent Researcher（独立研究员）

AI总结提出 GONDOR 算法，通过周期压缩搜索树并保留稀疏锚点状态，在严格内存限制下扩展 GBFS，实现低内存预算下的满意规划。

详情

AI中文摘要

贪婪最佳优先搜索（GBFS）是解决可通过启发式估计目标（如规划、路径查找、导航和寻路）的搜索问题的主要方法。当内存严格受限时（例如在边缘设备上规划），尤其如此。为了缓解这一问题，我们提出了 GONDOR（基于动态前哨站再搜索的贪婪在线导航），这是 GBFS 的一种内存高效扩展，通过周期性地压缩搜索树同时保留一组稀疏的锚点状态，允许在严格内存限制下继续搜索，然后在到达目标时通过在稀疏状态之间重新搜索来重建路径。我们分析了该算法，并讨论了由不同前哨站选择策略定义的几种变体。此外，我们探索了在关闭列表中使用布隆过滤器进行紧凑的重复检测。跨数值规划领域和启发式配置的实验表明，与标准 GBFS 相比，GONDOR 在低内存预算下持续提高了覆盖率。我们发布了 GONDOR 和布隆过滤器变体的实现，以促进对内存高效启发式搜索的进一步研究。

英文摘要

Greedy Best-First Search (GBFS) is the dominant approach for solving search problems where the goal can be estimated with a heuristic, such as planning, route finding, navigation, and pathfinding. This is especially true when the memory is tightly constrained, such as planning on edge devices. To alleviate that, we present GONDOR (Greedy Online Navigation with Dynamic Outpost-based Re-search), a memory-efficient extension of GBFS that allows search to continue under strict memory limits by periodically compressing the search tree while retaining a sparse set of anchor states, then upon reaching the goal reconstructs the path by re-searching between the sparse states. We analyze the algorithm and discuss several variants defined by different outpost selection policies. In addition, we explore using Bloom filters for compact duplicate detection in the closed list. Experiments across numeric planning domains and heuristic configurations show that GONDOR consistently improves coverage under low memory budgets compared to standard GBFS. We release the implementation of GONDOR and the Bloom-filter variant to facilitate further research on memory-efficient heuristic search.

URL PDF HTML ☆

赞 0 踩 0

2605.28450 2026-05-28 cs.CV cs.AI 版本更新

BiasEdit: A Training-Free Bias-Detect-and-Edit Framework for Learning Fair Visual Classifiers

BiasEdit: 一种无需训练的偏差检测与编辑框架，用于学习公平的视觉分类器

Jungwook Seo, Yoonsik Park, Changmin Lee, Sungyong Baik

发表机构 * Hanyang University Department of Artificial Intelligence BAIK Lab Seoul South Korea（翰阳大学人工智能系BAIK实验室首尔韩国）； Hanyang University Department of Data Science BAIK Lab Seoul South Korea（翰阳大学数据科学系BAIK实验室首尔韩国）； Hanyang University Department of Data Science Department of Artificial Intelligence BAIK Lab Seoul South Korea（翰阳大学数据科学系人工智能系BAIK实验室首尔韩国）； Hanyang University（翰阳大学）

AI总结提出BiasEdit框架，通过统计依赖和互信息分析自动检测偏差属性，并利用文本引导的图像编辑生成无偏样本，无需手动标注即可实现公平分类。

Comments Accepted to The Web Conference 2026 (formerly WWW) as an Oral presentation

详情

AI中文摘要

来自网络的视觉数据为图像分类器提供动力，这些分类器通常支撑着许多网络服务，如推荐和内容审核。然而，原始网络数据常常包含虚假关联和社会偏见，而神经网络以其倾向于学习数据中存在的偏见而闻名。这可能会加剧网络服务和网络数据中的不公平性，导致恶性循环。在图像分类的背景下，当大多数图像仅针对给定类包含相同属性时，网络会学习该类别的偏差属性。因此，从有偏数据集中训练公平且去偏的分类器需要处理多数具有偏差属性的图像（偏差对齐样本）与少数没有偏差属性的图像（偏差冲突样本）之间的不平衡问题。在这项工作中，我们引入了BiasEdit，一个模块化框架，能够自动从原始数据集中检测偏差属性并对其进行编辑，以构建去偏数据集。具体来说，BiasEdit首先通过视觉-语言表示的统计依赖性和互信息分析检测未知的偏差属性，然后使用文本引导的图像编辑显式编辑这些属性，以生成逼真的偏差冲突样本。与先前假设已知偏差属性或依赖合成混合的工作不同，我们的方法无需手动标注，并且可以利用现成的视觉-语言和编辑模型。BiasEdit解决了网络来源视觉AI中的一个基本挑战，减轻了数据集引起的偏差，并在训练数据完全有偏的情况下实现了最先进的去偏性能。

英文摘要

Visual data from the Web power image classifiers, which often underpin many web services, such as recommendation and content moderation. However, the raw Web data often contain spurious correlations and social biases, and neural networks are known for their tendency to learn biases present in data. This can reinforce unfairness in web services and the web data, leading to a vicious cycle. In the context of image classification, networks learn bias attributes for a specific class when a majority of images contain the same attribute only for a given class. Hence, training a fair and debiased classifier from a biased dataset demands handling an imbalanced problem between a majority of images with bias attributes (bias-aligned samples) and a minority without (bias-conflict samples). In this work, we introduce BiasEdit, a modular framework that automatically detects bias attributes from the original dataset and edits them to construct a debiased dataset. Specifically, BiasEdit first detects unknown bias attributes via statistical dependence and mutual information analysis of visual-linguistic representations, and then explicitly edits those attributes using text-guided image editing to generate realistic bias-conflict samples. Unlike prior works that assume known bias attributes or relies on synthetic mixing, our method operates without manual annotations and can leverage off-the-shelf vision-language and editing models. BiasEdit addresses a fundamental challenge in Web-sourced visual AI, mitigating dataset-induced bias and achieving state-of-the-art debiasing performance even when training data are fully biased.

URL PDF HTML ☆

赞 0 踩 0

2605.28441 2026-05-28 cs.CV cs.AI 版本更新

Bayesian Gated Non-Negative Contrastive Learning

贝叶斯门控非负对比学习

Peng Cui, Jiahao Zhang, Lijie Hu

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)（穆罕默德·本·扎耶德人工智能大学）

AI总结针对对比学习中表示纠缠问题，提出贝叶斯门控非负对比学习，通过概率门控机制动态过滤无关特征，在Imagenet-100上语义一致性提升142.1%。

Comments Accepted by ICML 2026

详情

AI中文摘要

虽然对比学习（CL）已经革新了自监督表示学习，但其潜在表示仍然高度纠缠且不透明，限制了在安全关键应用中的可解释性。我们发现这种纠缠的一个根本原因是对确定性相似度量的依赖，该度量平等地对待所有特征维度。在组合场景中，这会产生优化冲突：常见的背景特征（如“蓝天”）被鼓励在正对中对齐，但同时又在负对中排斥，导致梯度振荡，阻碍精确的语义解缠。为了解决这个问题，我们提出了BayesNCL（贝叶斯门控非负对比学习）。与标准方法不同，BayesNCL引入了一种概率门控机制，动态过滤掉与任务无关的高频常见特征，同时选择性地保留判别性语义。通过将特征选择形式化为具有稀疏伯努利先验的变分推理问题，我们的方法有效解决了优化冲突。在Imagenet-100上的实验结果表明，与最先进的基线相比，BayesNCL在语义一致性上实现了142.1%的显著提升，在不影响下游任务性能的情况下产生了高度可解释的表示。代码可在 https://github.com/Cui-Peng-624/BayesNCL 获取。

英文摘要

While Contrastive Learning (CL) has revolutionized self-supervised representation learning, its latent representations remain highly entangled and opaque, limiting their interpretability in safety-critical applications. We identify that a fundamental cause of this entanglement is the reliance on deterministic similarity measures, which treat all feature dimensions equally. In compositional scenes, this creates an Optimization Conflict: common background features, such as, "blue sky", are encouraged to align in positive pairs but simultaneously repelled in negative pairs, causing gradient oscillations that hinder precise semantic disentanglement. To address this, we propose BayesNCL (Bayesian Gated Non-Negative Contrastive Learning). Unlike standard approaches, BayesNCL introduces a probabilistic gating mechanism that dynamically filters out task-irrelevant, high-frequency common features while selectively retaining discriminative semantics. By formalizing feature selection as a variational inference problem with a sparse Bernoulli prior, our method effectively resolves the optimization conflict. Empirical experimental results on Imagenet-100 demonstrate that BayesNCL achieves a remarkable 142.1% improvement in semantic consistency compared to state-of-the-art baselines, yielding highly interpretable representations without compromising downstream task performance. Code is available at https://github.com/Cui-Peng-624/BayesNCL.

URL PDF HTML ☆

赞 0 踩 0

2605.28428 2026-05-28 cs.CV cs.AI 版本更新

Anomaly as Non-Conformity via Training-Free Graph Laplacian Energy Minimization

通过无训练图拉普拉斯能量最小化的非一致性异常检测

Jungwook Seo, Minjeong Kim, Younkwan Lee, Seungho Shin, Sungyong Baik

发表机构 * Dept. of Artificial Intelligence, Hanyang University（人工智能系，翰阳大学）； Dept. of Data Science, Hanyang University（数据科学系，翰阳大学）； Global Technology Research, Samsung Electronics（三星电子全球技术研究）

AI总结提出一种无训练图拉普拉斯能量优化方法ANoCo，通过查询补丁与正常流形对齐所需的更新幅度来度量异常，无需学习参数或采样，在标准基准上取得强图像级AUROC和稳定定位图。

Comments Accepted to CVPR 2026

详情

AI中文摘要

检测图像中的细微视觉异常仍然具有挑战性，特别是当仅预先提供正常样本时。这种无监督异常检测通常通过测量查询补丁与正常补丁记忆库的特征相似性来解决。然而，仅凭相似性无法揭示查询补丁在多大程度上违反了正常特征流形的结构。我们提出了一种无训练的拉普拉斯图能量优化公式，名为ANoCo，它通过查询补丁与固定正常流形对齐所需的非一致性成本来评分异常。对于每个查询补丁，我们构建一个由余弦亲和性加权的二分查询-正常图，明确移除查询-查询和正常-正常边以防止证据稀释。我们将异常评分公式化为带有锚定正常节点的凸拉普拉斯能量，并以闭式求解。特别地，我们不使用优化后的特征本身——异常分数是满足正常性约束所需的更新幅度，将图拉普拉斯重新定义为非一致性算子而非平滑先验。所提出的方法不引入可学习参数、消息传递或采样，其复杂度与单次线性求解相当。在标准基准上，它实现了强大的图像级AUROC、稳定的定位图以及相比先前方法更强的鲁棒性，证明了使用优化诱导的特征漂移作为异常度量的有效性。

英文摘要

Detecting subtle visual anomalies in images remains challenging, particularly when only normal samples are available a priori. Such unsupervised anomaly detection is typically solved by measuring feature similarity of a query patch to a memory of normal patches. However, similarity alone does not reveal how strongly a query patch violates the structure of the normal feature manifold. We propose a training-free Laplacian graph energy optimization formulation, named ANoCo that scores Anomaly by the cost of Non-Conformity of a query patch to align with a fixed normal manifold. For each query patch, we construct a bipartite query to normal graph weighted by cosine affinity, explicitly removing query-query and normal-normal edges to prevent evidence dilution. We formulate anomaly scoring as a convex Laplacian energy with anchored normal nodes, and solve in closed form. In particular, we do not use the optimized features themselves-the anomaly score is the magnitude of the update required to satisfy normality constraints, reframing the graph Laplacian as a non-conformity operator rather than a smoothing prior. The proposed method introduces no learnable parameters, message passing, or sampling, and has complexity comparable to a single linear solve. Across standard benchmarks, it delivers strong image-level AUROC, stable localization maps, and improved robustness over prior methods, demonstrating the effectiveness of using optimization-induced feature drift as anomaly measure.

URL PDF HTML ☆

赞 0 踩 0

2605.28422 2026-05-28 cs.CV cs.AI 版本更新

VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs

VITAL: 视觉-语义双重监督增强可解释的医学多模态大语言模型潜在推理

Qiaoru Li, Shaotian Liang, Jintao Chen, Haoran Sun, Yuxiang Cai, Jianwei Yin, Yankai Jiang

发表机构 * Zhejiang University（浙江大学）； Shanghai AI Laboratory（上海人工智能实验室）； Tencent（腾讯）； Ningbo Global Innovation Center, Zhejiang University（宁波全球创新中心，浙江大学）； Zhejiang Key Laboratory of Digital-Intelligence Service Technology（浙江省数字智能服务技术重点实验室）

AI总结提出VITAL框架，通过视觉-语义双重监督（文本解码器重构推理链、视觉投影器回归ROI特征）实现医学MLLM的可解释潜在推理，在7个基准上达到SOTA。

详情

AI中文摘要

潜在推理能够对连续隐藏状态而非显式token进行推理，避免了医学VQA中思维链的语言瓶颈和推理开销。然而，现有方法存在模态崩溃、视觉监督不足以及训练-推理不匹配的问题。此外，其不透明的潜在状态缺乏可解释性，而这在临床应用中至关重要。我们提出VITAL，一个用于医学MLLM的潜在空间推理框架，具有视觉-语义双重监督：一个辅助文本解码器从潜在状态重建推理链，同时一个视觉投影器从冻结的独立医学视觉编码器回归ROI特征。两个模块在推理时被丢弃，零开销，但可以在事后重新附加以实现双重可解释性，在不牺牲效率的情况下提供推理过程的文本和视觉解释。我们构建了一个涵盖9种成像模态的61K数据集，比之前的医学视觉潜在推理数据集大一个数量级。在7个基准上的实验表明，VITAL一致且显著优于骨干模型、所有潜在推理基线以及在更大数据上训练的医学MLLM，达到了与万亿参数专有模型竞争的最先进结果。

英文摘要

Latent reasoning enables reasoning over continuous hidden states rather than explicit tokens, avoiding the language bottleneck and inference overhead of chain-of-thought for medical VQA. However, existing methods suffer from modality collapse, insufficient visual supervision, and train-inference mismatch. Moreover, their opaque latent states offer no interpretability, which is critical in clinical applications. We propose VITAL, a latent-space reasoning framework for medical MLLMs with visual-semantic dual supervision: an auxiliary text decoder reconstructs reasoning chains from latent states, while a visual projector regresses ROI features from a frozen, independent medical vision encoder. Both modules are discarded at inference with zero overhead, yet can be re-attached post-hoc for dual interpretability, providing textual and visual explanations of the reasoning process without sacrificing efficiency. We construct a 61K dataset spanning 9 imaging modalities, exceeding prior medical visual latent reasoning datasets by an order of magnitude. Experiments on 7 benchmarks show that VITAL consistently and substantially outperforms the backbone, all latent reasoning baselines, and medical MLLMs trained on far larger data, achieving state-of-the-art results competitive with trillion-parameter proprietary models.

URL PDF HTML ☆

赞 0 踩 0

2605.28421 2026-05-28 cs.AI 版本更新

DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

DenoiseRL：引导推理模型从噪声前缀中恢复

Caijun Xu, Changyi Xiao, Zhongyuan Peng, Yixin Cao

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出DenoiseRL框架，通过强化学习从弱模型的错误推理中学习，无需外部监督或强教师模型，提升推理性能和训练效率。

Comments 17 pages, 6 figures

详情

AI中文摘要

强化学习已成为推动大型语言模型推理能力发展的核心范式，然而现有方法仍依赖更强的教师模型或精心策划的困难数据集，限制了可扩展的能力提升。在本文中，我们提出DenoiseRL，一种强化学习框架，通过从弱模型的失败中恢复导向优化来替代外部监督。DenoiseRL不依赖更强的监督或精心设计的数据，而是直接从错误的推理轨迹中学习，将其转化为改进的机会，使训练更具可扩展性且更少依赖外部资源。这产生了更丰富、更多样化的学习信号，提高了从非完美模型行为中探索的效率。因此，DenoiseRL提升了推理性能和整体训练效率，同时减少了对昂贵数据整理或更强教师模型的需求。实验表明，DenoiseRL在竞争性数学和通用推理基准上持续优于强在线强化学习基线，并随着训练难度增加促进更强的自我纠正行为，突显了改进大型语言模型推理的一种有效且可扩展的替代路径。

英文摘要

Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depend on stronger teacher models or heavily curated difficult datasets, limiting scalable capability improvement. In this paper, we introduce DenoiseRL, a reinforcement learning framework that substitutes external supervision with recovery-oriented optimization over failures from weak models. Instead of relying on stronger supervision or carefully engineered data, DenoiseRL learns directly from incorrect reasoning traces by converting them into opportunities for improvement, making training more scalable and less dependent on external resources. This yields a richer and more diverse learning signal, improving exploration efficiency from imperfect model behavior. As a result, DenoiseRL improves reasoning performance and overall training efficiency while reducing the need for expensive data curation or stronger teacher models. Empirically, DenoiseRL consistently outperforms strong on-policy RL baselines across competitive mathematical and general reasoning benchmarks and promotes stronger self-corrective behavior as training difficulty increases, highlighting an effective and scalable alternative pathway for improving reasoning in large language models.

URL PDF HTML ☆

赞 0 踩 0

2605.28409 2026-05-28 cs.AI 版本更新

Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning

基于离线强化学习的代码生成LLM高效后训练

Mingze Wu, Abhinav Anand, Shweta Verma, Mira Mezini

发表机构 * Hessian Center for Artificial Intelligence（海德堡人工智能中心）； National Research Center for Applied Cybersecurity ATHENE（应用网络安全国家研究中心ATHENE）

AI总结本文探索使用离线强化学习利用现有代码数据集对代码生成LLM进行后训练，实验表明该方法能有效提升模型性能，尤其适用于小模型和复杂编码问题。

2605.28405 2026-05-28 cs.AI 版本更新

Measuring Progress Toward AGI: A Cognitive Framework

衡量AGI进展：一个认知框架

Ryan Burnell, Yumeya Yamamori, Orhan Firat, Kate Olszewska, Steph Hughes-Fitt, Oran Kelly, Isaac R. Galatzer-Levy, Meredith Ringel Morris, Allan Dafoe, Alison M. Snyder, Noah D. Goodman, Matthew Botvinick, Shane Legg

发表机构 * Google DeepMind（谷歌深Mind）

AI总结本文提出一个基于认知分类学的框架，通过10个关键认知能力评估系统性能，以量化AGI进展。

Comments 32 pages, 2 figures

2605.28398 2026-05-28 cs.AI 版本更新

HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

HRBench：混合推理大语言模型中思维模式切换策略的基准测试与理解

Yansong Ning, Mianpeng Liu, Jingwen Ye, Weidong Zhang, Hao Liu

发表机构 * AI Thrust, The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州）人工智能方向）

AI总结提出HRBench统一评估框架，系统研究混合推理LLM中基于提示、外部路由和推测执行三类切换策略在四种训练机制下的效率-效果权衡，揭示策略选择随模型规模和任务领域的变化规律。

Comments Under review

详情

AI中文摘要

混合推理大语言模型（LLMs）暴露了对推理努力程度的显式控制，允许用户或系统在答案质量与推理成本之间进行权衡。然而，现有的自适应思维模式选择方法通常在不同模型、数据集和实现假设下进行评估，使得比较它们的实际行为变得困难。我们引入了HRBench，一个用于研究混合推理LLM中思维模式切换的统一评估框架。HRBench沿两个轴组织设计空间：三种切换策略族（基于提示的选择、外部路由和推测执行）和四种训练机制（无训练、SFT、离线RL和在线RL），产生12种受控评估设置。我们在6个LLM（从Qwen3.5-2B到Kimi-K2.5-1.1T）和5个涵盖数学、科学和代码的推理基准上评估这些设置，并在同一流水线中重新实现了12种以上有代表性的先前方法。我们的分析表征了不同切换策略如何占据不同的效率-效果权衡区域：基于提示的方法通常提供有利的token-准确率权衡，路由方法提供更稳定的成本降低，而推测方法倾向于以更高的token成本提高准确率。我们进一步发现训练对不同策略的影响不同，且首选策略随模型规模和任务领域而变化。HRBench提供了参考实现和统一评估平台，以支持对混合推理LLM中高效推理的更受控研究。我们的数据、代码和仓库可在https://github.com/usail-hkust/HRBench获取。

英文摘要

Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost. However, existing methods for adaptive thinking-mode selection are typically evaluated under different models, datasets, and implementation assumptions, making it difficult to compare their practical behavior. We introduce HRBench, a unified evaluation framework for studying thinking-mode switching in hybrid-reasoning LLMs. HRBench organizes the design space along two axes: three switching strategy families, prompt-based selection, external routing, and speculative execution, and four training regimes, training-free, SFT, offline and online RL, yielding 12 controlled evaluation settings. We evaluate these settings across 6 LLMs, from Qwen3.5-2B to Kimi-K2.5-1.1T, and 5 reasoning benchmarks covering mathematics, science, and code, while reimplementing 12+ representative prior methods within the same pipeline. Our analysis characterizes how different switching strategies occupy distinct effectiveness-efficiency trade-off regions: prompt-based methods often provide favorable token-accuracy trade-offs, routing methods offer more stable cost reduction, and speculative methods tend to improve accuracy at higher token cost. We further find that training affects strategies differently, and that the preferred strategy varies with model scale and task domain. HRBench provides reference implementations and a unified evaluation platform to support more controlled research on efficient reasoning in hybrid-reasoning LLMs. Our data, code and repository are available at https://github.com/usail-hkust/HRBench.

URL PDF HTML ☆

赞 0 踩 0

2605.28396 2026-05-28 cs.LG cs.AI 版本更新

ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

ADWIN: 用于视野感知在线策略蒸馏的自适应窗口

Kun Liang, Chenming Tang, Clive Bai, Weijie Liu, Saiyong Yang, Yunfang Wu

发表机构 * School of Computer Science, Peking University（北京大学计算机科学系）； National Key Laboratory for Multimedia Information Processing, Peking University（北京大学多媒体信息处理国家重点实验室）； LLM Department, Tencent（腾讯LLM部门）

AI总结提出ADWIN框架，通过自适应窗口动态调整在线策略蒸馏中的轨迹长度，在保持或提升准确率的同时，将训练成本降低最多4.1倍。

详情

AI中文摘要

在线策略蒸馏（OPD）通过沿着学生生成的轨迹训练学生模型，并利用教师反馈来迁移推理行为，但标准的全轨迹训练将每次更新与昂贵的完整轨迹绑定，并且可能过度分配监督到对当前学生边际价值较低的后半部分。我们通过有用监督视野重新审视这一假设：学生引起的轨迹可能偏离教师偏好的延续，而对齐的前缀可能已经保留了长视野OPD更新方向。我们提出ADWIN，一种用于OPD的自适应窗口框架，将轨迹长度视为在线可接受性决策，在短的教师锚定前缀上训练，同时使用延迟的全轨迹探测来审计前缀与全轨迹的对齐情况，并通过陈旧性控制自适应调整下一视野。在数学和代码推理基准测试中，包括单任务、多任务和强到弱设置，ADWIN在全轨迹OPD和基于前缀的基线方法上改善了准确率与计算成本的权衡，将端到端训练成本降低最多4.1倍，同时达到相当或更好的准确率。

英文摘要

On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-allocate supervision to late positions with low marginal value for the current student. We revisit this assumption through the useful supervision horizon: student-induced rollouts can drift from teacher-preferred continuations, while aligned prefixes may already preserve the long-horizon OPD update direction. We propose ADWIN, an adaptive-window framework for OPD that treats rollout length as an online admissibility decision, training on short teacher-anchored prefixes while using delayed full-rollout probes to audit prefix--full alignment and adapt the next horizon with staleness control. Across math and code reasoning benchmarks in single-task, multi-task, and strong-to-weak settings, ADWIN improves the accuracy--compute trade-off over full-rollout OPD and prefix-based baselines, reducing end-to-end training cost by up to 4.1 times while achieving comparable or better accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.28390 2026-05-28 cs.AI 版本更新

You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

你活不止一次：迈向分层技能元进化

Xujun Li, Kehan Zheng, Mingyuan Zhao, Yize Geng, Jinfeng Zhou, Qi Zhu, Fei Mi, Lifeng Shang, Minlie Huang, Hongning Wang

发表机构 * Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； Huawei Foundation Model Department（华为基础模型部门）

AI总结本文提出HiSME，一种轻量级分层技能元进化方法，通过从智能体任务执行轨迹中学习元技能，联合优化技能和技能进化策略，以持续提升部署的智能体系统在不同下游场景中的性能。

详情

AI中文摘要

测试时技能进化被视为增强已部署智能体系统的新范式。现有工作主要关注硬编码的技能进化策略或依赖底层LLM中昂贵参数更新的参数化学习。在本文中，我们证明，对于在不同下游场景中持续改进智能体系统，对技能进化框架本身进行测试时优化是必要的，并且轻量级的算法适应是可行的。具体来说，我们提出HiSME，一种轻量级分层技能元进化解决方案，通过从智能体的任务执行轨迹中学习元技能，联合优化技能和技能进化策略。在多样化智能体基准上的实验表明，元进化可以产生比纯技能进化更高质量的技能库，并能为不同场景推导出多样化的元技能，从而促进未来的持续经验学习。我们的代码暂时公开在https://anonymous.4open.science/r/HiSME-BD45。

英文摘要

Test-time skill evolving is regarded as a new paradigm for enhancing deployed agentic systems. Existing works mainly focus on hard-coded skill evolving strategies or parametric learning that rely on expensive parameter updates in the underlying LLMs. In this paper, we demonstrate that test-time refinement of the skill evolving framework itself is necessary for continuous improvement of the agent systems in different downstream scenarios, and lightweight algorithmic adaptation is feasible. Specifically, we propose HiSME, a lightweight hierarchical skill meta-evolving solution that jointly optimizes skills and the skill evolving strategy by learning meta-skills from agents' task execution traces. Experiments on diverse agentic benchmarks show that meta-evolving can produce a higher-quality skill library than pure skill evolving and can derive diverse meta-skills for different scenarios, thereby facilitating future continual experience learning. Our code is temporarily public at https://anonymous.4open.science/r/HiSME-BD45.

URL PDF HTML ☆

赞 0 踩 0

2605.28388 2026-05-28 cs.AI 版本更新

Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

机制性解释样本难度在RLVR中对大语言模型的作用

Yue Cheng, Jiajun Zhang, Xiaohui Gao, Weiwei Xing, Zheng Wang, Zhanxing Zhu

发表机构 * Beijing Jiaotong University（北京交通大学）； AntGroup（蚂蚁集团）； Northwestern Polytechnical University（西北工业大学）； University of Leeds（利兹大学）； University of Southampton（南安普顿大学）

AI总结本文通过难度维度和单样本分析，发现样本难度对RLVR有非单调影响，中等难度问题提供最稳定的推理改进，并基于此提出难度自适应策略。

Comments 30 pages, 11 figures

详情

AI中文摘要

经验表明，带可验证奖励的强化学习（RLVR）能显著提升大语言模型（LLMs）的推理性能，尤其是在数学和编程领域。然而，样本难度在RLVR中的机制性作用仍不明确。本文通过难度维度和单样本分析研究RLVR。我们发现样本难度对RLVR有非单调影响：简单和中等难度问题带来最强且最稳定的推理改进，而过难问题往往提供弱学习信号，诱发退化行为（如重复答案或跳过必要计算），并最终损害模型已有的能力。除了响应层面，我们还利用时间稀疏自编码器（T-SAE）分析模型内部特征动态。简单问题主要强化直接答案和基本计算特征，同时抑制深思熟虑推理特征；困难问题激活推理相关特征，但仅在成功轨迹被采样时才有用；中等难度问题提供更平衡的信号，同时强化计算和多步推理特征。基于这些发现，我们提出了针对困难样本的难度自适应策略，利用反向推理重构和T-SAE引导的训练信号来改善RLVR中的奖励密度和信用分配。总体而言，我们的结果将样本难度识别为控制RLVR优化动态和表示演化的关键因素。

英文摘要

Reinforcement Learning with Verifiable Reward (RLVR) is empirically shown to notably enhance the reasoning performance of large language models (LLMs), particularly in mathematics and programming. However, the mechanistic role of Sample Difficulty in RLVR remains poorly understood. In this paper, we investigate RLVR through the lens of difficulty-wise and one-sample analysis. We find that sample difficulty has a non-monotonic effect on RLVR: easy and medium-difficulty problems yield the strongest and most stable reasoning improvements, whereas overly hard problems often provide weak learning signals, induce degenerate behaviors such as answer repetition or skipping necessary computation, and can ultimately degrade the model's pre-existing capabilities. Beyond the obverse of response, we further analyze the model's internal feature dynamics using Temporal Sparse Autoencoders (T-SAE). Easy problems mainly reinforce direct-answer and basic-computation features while suppressing deliberative-reasoning features; hard problems activate reasoning-related features but become useful only when successful trajectories are sampled; medium-difficulty problems provide a more balanced signal, strengthening both computation and multi-step reasoning features. Motivated by these findings, we propose difficulty-adaptive strategies for hard-sample utilization, using backward-reasoning reformulation and T-SAE-guided training signals to improve reward density and credit assignment during RLVR. Overall, our results identify sample difficulty as a key factor governing both the optimization dynamics and representation evolution of RLVR.

URL PDF HTML ☆

赞 0 踩 0

2605.28387 2026-05-28 cs.LG cs.AI cs.NE 版本更新

CLANE: Continual Learning of Actions on Neuromorphic Hardware from Event Cameras

CLANE: 基于事件相机在神经形态硬件上的动作持续学习

Elvin Hajizada, Michael Neumeier, Edward Paxon Frady, Yulia Sandamirskaya, Axel von Arnim, Bing Li, Eyke Hüllermeier

发表机构 * Institute of Informatics, University of Munich (LMU)（慕尼黑大学信息学院）； fortiss GmbH, Neuromorphic Computing（fortiss GmbH 神经形态计算部门）； Technical University of Munich, TUM School of CIT（慕尼黑技术大学 CIT 学院）； Intel Labs, Intel Corporation（英特尔实验室，英特尔公司）； Institute of Computational Life Sciences (ICLS), Zurich University of Applied Sciences (ZHAW)（应用科学大学（ZHAW）计算生命科学研究所）； Technische Universität Ilmenau, Resource-Efficient Artificial Intelligence Group（伊门豪大学资源高效人工智能小组）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； German Research Centre for Artificial Intelligence (DFKI)（德国人工智能研究中心）

AI总结提出CLANE系统，在Intel Loihi 2神经形态芯片上实现端到端的持续学习，用于事件相机动作识别，通过尖峰CNN和新型Loihi 2模块实现高能效和低延迟。

详情

AI中文摘要

识别并持续学习新的人类动作而不遗忘先前类别，是新兴AR/VR和机器人应用的需求。对于这些应用，设备上的处理和学习对于隐私和低延迟适应至关重要。事件相机通过稀疏、异步的输出解决了视觉传感的效率问题，该输出天然兼容神经形态处理。然而，此前没有系统部署过使用神经形态硬件进行基于事件的持续设备上学习流水线。我们提出了CLANE（基于事件相机在神经形态硬件上的动作持续学习），端到端部署在Intel Loihi 2上。CLANE将用于时空特征提取的脉冲2D CNN与作为片上学习头的CLP-SNN相结合，并通过时间聚合层和定点归一化层（两者均为新型Loihi 2模块）扩展到动作片段。在真实条件下捕获的50类数据集THU E-ACT-50上，CLANE在持续学习任务中达到70.4%的准确率，同时相比顺序CNN+GRU+CLP边缘GPU基线实现了超过100倍的能耗降低和16倍的延迟降低，通过三个评估级别的跨平台等算法基准测试得到验证。

英文摘要

Recognizing and continuously learning novel human actions without forgetting prior classes is a requirement for emerging AR/VR and robotics applications. For these applications, both on-device processing and learning are essential for privacy and low-latency adaptation. Event cameras address the efficiency of visual sensing with sparse, asynchronous output that is naturally compatible with neuromorphic processing. Yet no prior system has deployed a continual on-device learning pipeline for event-based action recognition using neuromorphic hardware. We present CLANE, Continual Learning of Actions on Neuromorphic Hardware from Event Cameras, deployed end-to-end on Intel Loihi 2. CLANE combines a spiking 2D CNN for spatiotemporal feature extraction with CLP-SNN as its on-chip learning head, extended to action clips via a Temporal Aggregation Layer and a fixed-point Normalization Layer, both novel Loihi 2 modules. On THU E-ACT-50, a 50-class dataset captured under real-world conditions, CLANE achieves 70.4% accuracy in a continual learning task while delivering more than 100x energy reduction and 16x lower latency over a sequential CNN+GRU+CLP edge GPU baseline, validated through iso-algorithm cross-platform benchmarking across three evaluation levels.

URL PDF HTML ☆

赞 0 踩 0

2605.28371 2026-05-28 cs.AI cs.LG cs.SE 版本更新

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

从论文到基准测试：基于智能体和框架的机器健康智能中欠规范方法复现

Raffael Theiler, Ludovico Comito, David Leko, Leandro Von Krannichfeldt, Lev Telyatnikov, Olga Fink

发表机构 * EPFL（苏黎世联邦理工学院）

AI总结提出一种基于智能体和共享框架的方法，通过槽绑定接口将论文转化为可执行、可比较的基准测试实现，解决工业预测与健康管理中方法复现的困难。

详情

AI中文摘要

工业预测与健康管理（PHM）为应用机器学习中的更广泛挑战提供了一个代表性案例研究：将已发表的论文转化为可执行、可基准测试的实现。由于工业数据集的访问受限、预处理和评估协议的报告不完整以及隐含的设计选择（例如，窗口化、目标构建、数据分割）对性能有重要影响，复现PHM中的欠规范方法尤为困难。现有的论文到代码系统为单篇论文生成实现，但由于假设和评估设置的不一致性，这些产物通常无法直接比较。我们引入了基于智能体和框架的PHM论文复现方法，其中智能体通过槽绑定接口将论文转化为共享的PHM基准测试框架。该接口将方程和协议描述映射为结构化组件（任务定义、数据集适配器、窗口化、目标、模型和评估器），同时明确记录未解决的假设。最终实现通过标准化任务契约和评估钩子进行验证，从而实现一致且可比较的基准测试。我们在16篇PHM论文上评估了该方法，比较了框架增强型、基于技能和基于提示的智能体复现与最近的无框架论文复现智能体。我们评估了复现成功率、基于模型的代码评估、论文假设的框架绑定以及标准化协议下的跨论文基准可比性。结果表明，将智能体生成与共享框架相结合，将论文复现从孤立的代码合成转变为可执行、假设感知且系统可比较的基准测试实现。

英文摘要

Industrial Prognostics and Health Management (PHM) provides a representative case study for a broader challenge in applied machine learning: translating published papers into executable, benchmark-ready implementations. Reproducing under-specified methods in PHM is particularly difficult due to restricted access to industrial datasets, incomplete reporting of preprocessing and evaluation protocols, and implicit design choices (e.g., windowing, target construction, data splits) that critically affect performance. Existing paper-to-code systems generate implementations for individual papers, but these artifacts are often not directly comparable due to inconsistencies in assumptions and evaluation settings. We introduce \emph{agentic, framework-based PHM paper reproduction}, where an agent translates a paper into a shared PHM benchmark framework via a \emph{slot-binding interface}. This interface maps equations and protocol descriptions into structured components (task definitions, dataset adapters, windowing, targets, models, and evaluators), while explicitly recording unresolved assumptions. The resulting implementations are validated against standardized task contracts and evaluation hooks, enabling consistent and comparable benchmarking. We evaluate this approach on 16 PHM papers, comparing framework-enhanced, skill-based and prompt-based agentic reproduction against a recent framework-free paper-reproduction agent. We assess reproduction success, model-based code evaluation, framework binding of paper assumptions, and cross-paper benchmark comparability under standardized protocols. Our results show that coupling agentic generation with a shared framework transforms paper reproduction from isolated code synthesis into executable, assumption-aware, and systematically comparable benchmark implementations.

URL PDF HTML ☆

赞 0 踩 0

2605.28369 2026-05-28 cs.AI cs.SI 版本更新

CyberJurors: A Multi-Agent Simulation Task for E-Commerce Disputes Verdict

CyberJurors：电商纠纷裁决的多智能体模拟任务

Yanhui Sun, Wu Liu, Haifeng Ming, Xinru Wang, Hantao Yao, Yongdong Zhang

发表机构 * School of Information Science and Technology, University of Science and Technology of China（信息科学与技术学院，中国科学技术大学）

AI总结针对电商纠纷裁决需要从冗余多轮多模态证据中提取关键线索并依据平台特定惯例决策的问题，提出多智能体框架CyberJurors，通过个体裁决链式思维和集体陪审共识裁决提升裁决质量，在包含6000真实案例的基准上超越现有方法。

Comments ICML 2026

详情

AI中文摘要

电商平台开始招募众包陪审员来裁决大量交易纠纷。与正式法律判决不同，电商纠纷裁决需要从冗余、多轮、多模态证据中提取关键线索，并在平台特定的灵活惯例下做出决策。这些特点使得现有方法不足以应对该场景。为弥补这一差距，我们引入了一项开创性任务——电商纠纷裁决（EDV），并提出了VerdictBench，一个包含6000个真实案例的多模态基准，旨在反映众包陪审团决策。在此基础上，我们提出了CyberJurors，一个多智能体框架，用于澄清纠纷逻辑并规范裁决过程。在个体层面，个体裁决链式思维将EDV任务分解为四个结构化的推理阶段，实现细粒度线索感知并澄清关键线索与纠纷焦点之间的因果逻辑。在集体层面，陪审共识裁决模拟陪审员之间的多轮讨论和投票，同时纳入裁决先例以减轻对任一争议方的认知偏差。在VerdictBench上的实验表明，CyberJurors优于最先进的LLM、MLLM和法庭模拟器，同时与真实陪审团投票模式实现了更强的一致性。代码和数据集可在https://github.com/YanhuiS/CyberJurors 和 https://huggingface.co/datasets/piggi/VerdictBench 获取。

英文摘要

E-commerce platforms have begun recruiting crowdsourced jurors to adjudicate massive volumes of transaction disputes. Unlike formal legal judgment, E-commerce dispute verdicts require grounding pivotal clues from redundant, multi-round, multimodal evidence and making decisions under flexible platform-specific conventions. These characteristics render existing methods insufficient for this scenario. To bridge this gap, we introduce a pioneering task, E-commerce Dispute Verdicts (EDV), and present VerdictBench, a multimodal benchmark comprising 6,000 real-world cases designed to reflect crowdsourced jury decisions. Building upon this, we propose CyberJurors, a multi-agent framework to clarify the dispute logic and regulate the verdict process. At the individual level, Individual Verdict Chain-of-Thought decomposes the EDV task into four structured reasoning stages, enabling fine-grained clue perception and clarifying causal logic between pivotal clues and the dispute focus. At the collective level, Jury Consensus Verdict simulates multi-round discussion and voting among jurors, while incorporating verdict precedents to mitigate cognitive biases toward either disputant. Experiments on VerdictBench show that CyberJurors outperforms state-of-the-art LLMs, MLLMs, and court simulators, while achieving stronger alignment with real-world jury voting patterns. Code and dataset are available at https://github.com/YanhuiS/CyberJurors and https://huggingface.co/datasets/piggi/VerdictBench.

URL PDF HTML ☆

赞 0 踩 0

2605.28365 2026-05-28 cs.AI cs.CL cs.LO 版本更新

Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

风险控制的 Lean 作为自然语言数学推理的评判者

Pauline Bourigault, Xiaotong Ji, Matthieu Zimmer, Rasul Tutunov, Haitham Bou Ammar

发表机构 * Imperial College London（伦敦帝国理工学院）； Huawei Noah’s Ark Lab（华为诺亚实验室）； UCL Centre for AI（大学学院伦敦人工智能中心）

AI总结针对 Lean 评判自然语言数学答案时信号稀疏且不忠实的问题，提出 COVCAL 选择器，通过有限样本选择性风险控制，在自动形式化覆盖率足够高时保证接受答案的准确率。

详情

AI中文摘要

Lean 越来越多地被用于评判自然语言数学答案，但其信号是不完全的：许多答案从未被形式化，而一个失败的证明可能反映类型错误或缺少库事实，而非答案错误。在 MATH-500 上，我们表明该信号 (i) 严重依赖于覆盖率，即在证明覆盖率高的答案中正确率为 96%，但在覆盖率低时为 20%，以及 (ii) 稀疏且常常不忠实：一个 7B 自动形式化器仅对 28% 的问题证明了某个类别，而人工审计发现其中只有约 43% 的证明是忠实的。我们提出 COVCAL，一个基于 Lean 跟踪诊断的选择器，它在两种机制（保守的 Bonferroni 界和更紧的 dev-then-cal 规则）下，对接受的答案认证有限样本选择性风险界，否则弃权。可行性取决于自动形式化覆盖率：对于 7B 形式化器，信号过于稀疏，Bonferroni 在所有 20 个自助法分区上弃权，而一个专用于证明器的形式化器达到 79% 的覆盖率，并在 20 个分区中的 17 个上使其可行，以 0.98 的接受准确率接受约 48% 的问题。由于自一致性本身已达到 91% 的准确率，我们的贡献是精确描述了何时以及使用哪个形式化器，部分形式化信号可以在风险控制下被信任。

英文摘要

Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply coverage-dependent, that is the proof-winning answer is correct 96% of the time at high proved coverage but 20% at low, and (ii) sparse and often unfaithful: a 7B autoformalizer proves a class for only 28% of problems, and a manual audit finds only approximately 43% of those proofs faithful. We propose COVCAL, a selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound on accepted answers or abstains, under two regimes (a conservative Bonferroni bound and a tighter dev-then-cal rule). Feasibility depends on autoformalization coverage: with the 7B formalizer the signal is too sparse and Bonferroni abstains on all 20 bootstrap partitions, whereas a prover-specialized formalizer reaches 79% coverage and flips it to feasible on 17 of 20, accepting approximately 48% of problems at 0.98 accepted accuracy. Since self-consistency alone is already 91% accurate, our contribution is a precise account of when, and with which formalizer, a partial formal signal can be trusted under risk control.

URL PDF HTML ☆

赞 0 踩 0

2605.28360 2026-05-28 cs.AI 版本更新

Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

提示码本：面向语言模型指令精炼的离散组合优化

Jyotirmoy Nath, Neeraj Kumar, Brejesh Lall

发表机构 * IIT Delhi（印度德里理工学院）

AI总结提出Prompt Codebooks (PCO)框架，将自动提示优化重构为离散组合学习，通过可重用的自然语言本能单元实现实例级路由和结构化反馈，在多个基准上提升性能并压缩提示长度。

详情

AI中文摘要

自动提示优化（APO）显著提升了基于LLM的智能体工作流。然而，现有方法将每个任务的提示视为一个整体、实例无关的字符串，通过全局编辑进行优化，导致更新脆弱且无法复用学到的子行为。我们提出提示码本（PCO），一种新颖的组合式提示优化框架，将APO重构为在有限自然语言本能（原子、可重用的指令单元）词汇表上的离散学习。PCO将提示构建知识组织在离散码本中，通过基于LLM的编码器将每个输入路由到少量条目；生成器将它们组合成冻结目标模型的提示；评论器输出结构化判决，通过归因分解为每个变量的文本梯度，在语言值极小极大目标下联合训练编码器、生成器和码本。得到的路由是实例级的：同一任务的不同输入接收不同的本能组合，这种机制在实例无关方法下结构上无法表达。在Qwen3-8B和LLaMA-3.1-8B上的六个基准测试中，PCO相比零样本提升高达+30.36分，在HotpotQA上超越最强先前基线（GEPA）达+3.34分，总体平均提升+1.11分，并且仅使用K=16个本能即可将部署提示长度相比MIPROv2压缩最多14.1倍，相比GEPA压缩3.0倍。

英文摘要

Automatic prompt optimization (APO) has driven significant gains in LLM-based agentic workflows. However, existing methods treat each task's prompt as a monolithic, instance-blind string optimized through global edits, producing brittle updates and preventing the reuse of learned sub-behaviors. We propose Prompt Codebooks (PCO), a novel compositional prompt optimization framework that recasts APO as discrete learning over a finite vocabulary of natural-language instincts - atomic, reusable instruction units. PCO organizes prompt-construction knowledge in a discrete codebook and routes each input to a small subset of entries via an LLM-based encoder; a generator composes them into a prompt for the frozen target model; a critic emits a structured verdict that decomposes by attribution into per-variable textual gradients, jointly training the encoder, generator, and codebook under a language-valued min-max objective. The resulting routing is per-instance: different inputs in the same task receive different instinct compositions, a regime structurally inexpressible under instance-blind methods. Across six benchmarks on Qwen3-8B and LLaMA-3.1-8B, PCO improves over zero-shot by up to +30.36 points, surpasses the strongest prior baseline (GEPA) by +3.34 on HotpotQA and +1.11 in aggregate, and reduces deployed prompt length by up to 14.1x versus MIPROv2 and 3.0x versus GEPA using only K=16 instincts.

URL PDF HTML ☆

赞 0 踩 0

2605.28359 2026-05-28 cs.AI q-fin.TR 版本更新

From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets

从知道到做到：面向LLM股票市场交易智能体的记忆控制基准

Taojie Zhu, Wentao Zhao, Rui Sun, Beidi Luan, Jiacheng Lu, Sinuo Wang, Jing Li, Daxin Jiang, Yonghong He, Zuo Bai

发表机构 * Tsinghua University（清华大学）； Stepfun ； FinStep ； Shanghai Jiao Tong University（上海交通大学）； Adelaide University（阿德莱德大学）

AI总结针对LLM交易智能体评估中的知识泄露和收益归因问题，提出KTD-Fin基准，通过数据掩码和Barra风格归因框架，分离市场记忆与投资决策，并揭示收益主要来自被动市场暴露而非选股能力。

详情

AI中文摘要

评估大语言模型（LLM）智能体能否在资本市场盈利，越来越被框架化为端到端交易：将智能体置于历史市场中，让其交易，并衡量投资组合收益。这种设置容易导致两种评估失败。首先，长时间的回测往往与前沿LLM的知识截止日期重叠，使得记忆的股票代码、日期、价格和市场叙事替代了投资推理。其次，原始收益是选股能力的一个嘈杂代理，因为正收益可能来自市场贝塔、风格暴露或有利的市场环境，而非真正的阿尔法。我们引入了KTD-Fin（知道-做到金融基准），一个端到端的股票市场交易基准，解决了这两个问题。KTD-Fin使用数据侧掩码协议，在提示和工具中一致地匿名化关键标识符和日历信息，将历史市场记忆与投资决策分离。它还整合了Barra风格的表现归因框架，将投资组合收益分解为市场、风格和选股阿尔法成分。在2024-2026年窗口内对中国沪深300指数评估的十个前沿LLM智能体中，掩码显著改变了智能体的推理过程，推动其转向匿名化的因子推理。归因分析进一步表明，在泄露控制评估下，LLM智能体的累积收益主要由被动的市场和风格暴露解释，而持续选股阿尔法的证据有限。这些发现表明，金融LLM基准不仅应评估智能体是否赚钱，还应评估收益来源是否反映了可转移的投资技能。我们发布KTD-Fin作为LLM交易智能体泄露控制和归因感知评估的可复现模板。

英文摘要

Evaluating whether large language model (LLM) agents can profit in capital markets is increasingly framed as end-to-end trading: place an agent in a historical market, let it trade, and measure portfolio returns. This setup is vulnerable to two evaluation failures. First, long backtests often overlap with the knowledge cutoffs of frontier LLMs, allowing memorized tickers, dates, prices, and market narratives to substitute for investment reasoning. Second, raw returns are a noisy proxy for stock-selection ability, since positive performance may come from market beta, style exposure, or favorable regimes rather than genuine alpha. We introduce KTD-Fin (Knowing-To-Doing Financial Benchmark), an end-to-end stock-market trading benchmark that addresses both issues. KTD-Fin uses a data-side masking protocol to anonymize key identifiers and calendar information consistently across prompts and tools, separating historical market memory from investment decision-making. It also incorporates a Barra-style performance attribution framework that decomposes portfolio returns into market, style, and stock-selection alpha components. Across ten frontier LLM agents evaluated on the Chinese CSI300 over a 2024--2026 window, masking substantially changes agent rationales, pushing them towards anonymized factor-based reasoning. Attribution analysis further shows that LLM agents' cumulative returns under leakage-controlled evaluation are largely explained by passive market and style exposure, with limited evidence of persistent stock-selection alpha. These findings suggest that financial LLM benchmarks should evaluate not only whether an agent makes money, but also whether the source of returns reflects transferable investment skill. We release KTD-Fin as a reproducible template for leakage-controlled and attribution-aware evaluation of LLM trading agents.

URL PDF HTML ☆

赞 0 踩 0

2605.28358 2026-05-28 cs.LG cs.AI cs.IT math.IT 版本更新

Score Based Error Correcting Code Decoder

基于分数的纠错码译码器

Alon Helvits, Eliya Nachmani

发表机构 * School of Electrical and Computer Engineering (ECE)（电气与计算机工程学院）

AI总结提出SB-ECC，一种将译码视为连续时间去噪的基于分数的译码器，通过神经去噪器定义概率流常微分方程，在奇偶校验约束下迭代更新噪声信道观测值，无需SNR估计即可推理，并在42个码/SNR设置中39/42达到最佳误码率。

Comments Accepted to ICML 2026

详情

AI中文摘要

纠错码能够实现可靠通信，然而在实际软译码中，跨码族和码长仍然具有挑战性。我们提出SB-ECC，一种基于分数的译码器，将译码视为连续时间去噪。神经去噪器定义了一个概率流常微分方程（ODE），该方程在奇偶校验约束的引导下，迭代地将噪声信道观测值更新为有效的码字。该模型在不同噪声水平下训练，无需时间/SNR条件，从而无需SNR估计即可进行推理，并支持由ODE求解器预算控制的直接延迟-精度权衡。我们使用原始带符号的信道观测值作为输入来学习连续去噪场。在42个码/SNR设置中，SB-ECC在39/42个条目中实现了最佳误码率，平均SNR增益为0.17dB，最大增益为0.46dB，优于最强竞争基线。我们表明，将求解器从Euler切换为DPM可保持-ln(BER)，同时将端到端译码时间平均减少8.86%（最高达12.82%）。

英文摘要

Error-correcting codes enable reliable communication, yet practical soft decoding remains challenging across code families and block lengths. We propose SB-ECC, a score-based decoder that casts decoding as continuous-time denoising. A neural denoiser defines a probability-flow ordinary differential equation (ODE) that iteratively updates the noisy channel observation toward a valid codeword, guided by parity constraints. The model is trained across noise levels without time/SNR conditioning, enabling inference without SNR estimation and supporting a direct latency accuracy trade off controlled by the ODE solver budget. We use the raw signed channel observation as input for learning a continuous denoising field. Across 42 code/SNR settings, SB-ECC achieves the best BER in 39/42 entries, with an average SNR gain of 0.17dB and a maximum gain of 0.46dB over the strongest competing baseline, we showed that swapping the solver from Euler to DPM preserves -ln(BER) while reducing end-to-end decoding time by 8.86% on average (up to 12.82%).

URL PDF HTML ☆

赞 0 踩 0

2605.28354 2026-05-28 cs.AI 版本更新

Plan Before Search: Search Agents Need Plan

搜索前先规划：搜索智能体需要规划

Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai, Jiayi Ji, Chenyi Lei, Wenwu Ou, Xiaoshuai Sun, Qibin Hou

发表机构 * Kuaishou Technology（快手科技）； Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University（教育部多媒体可信感知与高效计算重点实验室，厦门大学）； VCIP, CS, Nankai University（南开大学VCIP实验室）

AI总结提出Plan方法，通过将问题分解为有序子问题再进行检索，并引入自举训练范式，无需外部强模型蒸馏即可在多跳QA中激活规划能力。

详情

AI中文摘要

将大型语言模型训练为检索增强推理智能体通常将强化学习与从更强模型蒸馏的SFT冷启动相结合。然而，这种范式忽略了两个基本因素：子技能之间的依赖结构，以及蒸馏并非获取能力的唯一途径。我们通过Plan来研究这一点，这是一种结构化的智能体行为，用于多跳检索，它在任何检索执行之前将问题分解为有序的子问题，从而使每个搜索步骤可以锚定到预先设计的子问题，而不是在先前检索的部分相关文档的影响下漂移。然而，在涵盖3B到14B参数的三个模型家族中，我们发现相同的奖励信号会引发定性不同的RL失败模式。这一现象表明，成功的训练不仅取决于奖励设计，还取决于模型特定的可行性条件：足够的初始熵、训练稳定性和先决子技能。受此启发，我们提出了一种自举训练范式，其中小规模种子模型生成过滤后的轨迹，从而在任何目标模型中激活Plan，消除了从外部强模型蒸馏的需要。我们的流程在每个测试模型中都激活了Plan，并在多跳QA基准上持续优于竞争基线。

英文摘要

Training large language models as retrieval-augmented reasoning agents typically combines reinforcement learning with an SFT cold start distilled from a stronger model. However, this paradigm overlooks two fundamental factors: the dependency structure among sub-skills, and the possibility that distillation is not the only route to capability acquisition. We study this through Plan, a structured agentic behavior for multi-hop retrieval that decomposes a question into ordered sub-questions before any retrieval is performed, so that each search step can be anchored to a pre-designed sub-question instead of drifting under the influence of partially relevant documents retrieved earlier. However, across three model families spanning 3B to 14B parameters, we find that an identical reward signal induces qualitatively different RL failure modes. This phenomenon indicates that successful training hinges not only on reward design but also on model-specific feasibility conditions: sufficient initial entropy, training stability, and prerequisite sub-skills. Motivated by this, we propose a self-bootstrapping paradigm in which a small-scale seed model generates filtered trajectories that activate Plan in any target model, eliminating the need for distillation from an external stronger model. Our pipeline activates Plan across every tested model and consistently outperforms competitive baselines on multi-hop QA benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.28353 2026-05-28 cs.NE cs.AI cs.SC 版本更新

Improving Evaluation of Recombination-based Cartesian Genetic Programming

改进基于重组的笛卡尔遗传编程的评估

Duy Long Tran, Anja Jankovic, Marie Anastacio, Holger Hoos, Roman Kalkreuth

发表机构 * Chair for AI Methodology, RWTH Aachen University（人工智能方法学研究所，亚琛工业大学）

AI总结本研究通过超参数优化，在SRBench基准平台上评估了子图交叉和离散表型重组两种重组算子，证明了超参数优化可提升基于重组的笛卡尔遗传编程的性能。

Comments Accepted for presentation as workshop paper in the graph-based genetic programming workshop (GGP) at the Genetic and Evolutionary Computation Conference (GECCO). To appear in the GECCO'26 conference companion. GECCO'26 will be held July 13-17, 2026 in San Jose, Costa Rica

详情

DOI: 10.1145/3795101.3814688
Journal ref: GECCO'26 Companion: Genetic and Evolutionary Computation Conference Companion, July 13-17, 2026, San Jose, Costa Rica

AI中文摘要

笛卡尔遗传编程传统上使用变异作为其主要且通常是唯一的遗传算子来驱动进化搜索。尽管近年来取得了进展，但由于明显的性能提升不足，基于重组的方法长期以来一直被避免。本研究在符号回归基准平台SRBench上检验了最近提出的两种重组算子：子图交叉和离散表型重组。利用TinyverseGP框架中提供的实现，我们对这两种算子的相应表示进行了超参数优化。我们的工作表明，超参数优化可以导致基于重组的笛卡尔遗传编程的性能提升。

英文摘要

Cartesian Genetic Programming has traditionally been using mutation as its main and often sole genetic operator to drive evolutionary search. Despite advancements in recent years, recombinationbased approaches have long been avoided, due to apparent lack of performance gains. This study examines two recently suggested recombination-based operators, subgraph crossover and discrete phenotypic recombination on SRBench, a benchmarking platform for symbolic regression. Using the implementations provided in the TinyverseGP framework, we perform hyperparameter optimisation of the respective representations with these two operators. Our work demonstrates that hyperparameter optimisation can lead to improvements in performance for recombination-based Cartesian Genetic Programming.

URL PDF HTML ☆

赞 0 踩 0

2605.28347 2026-05-28 cs.AI 版本更新

FedMPT: Federated Multi-label Prompt Tuning of Vision-Language Models

FedMPT: 视觉语言模型的多标签联邦提示调优

Xucong Wang, Pengkun Wang, Zhe Zhao, Liheng Yu, Shuang Wang, Yang Wang

发表机构 * University of Science and Technology of China（科学技术大学）

AI总结针对联邦学习中多标签识别任务，提出FedMPT方法，利用因果模型的前门调整和大语言模型驱动的条件解耦，通过最优传输和门控机制抑制虚假标签关联，提升模型鲁棒性。

Comments 16 pages, including 11 pages of main text and 5 pages of appendix; Accepted by CVPR2026

详情

AI中文摘要

基于视觉语言模型的多标签识别旨在利用其预训练知识更好地适应复杂识别场景，从而增强模型鲁棒性。然而，对于需要联邦学习的现实去中心化应用，将视觉语言模型适应到每个拥有私有和异构数据的客户端会导致模型过拟合虚假标签关联，从而在遇到新样本时触发不相关类别。为解决此问题，我们使用因果模型重新考虑多标签识别的联邦学习，其中采用前门调整并通过中间变量（放大真实标签共现）解耦多标签识别建模过程。在分析指导下，我们提出FedMPT，这是首个专门为联邦多标签识别设计的方法。FedMPT的核心思想是利用可泛化条件引导联邦多标签识别以减轻错误标签激活。为此，FedMPT引入了一个由大语言模型驱动的流程来解读控制标签依赖的潜在条件。此外，我们引入了条件增强提示与图像块之间的最优传输以揭示多个区域级语义。最后，我们通过精心设计的门控机制从不同条件生成协同预测。在多个基准数据集上的实验表明，我们提出的方法在不同设置下取得了有竞争力的结果，并优于现有最先进方法。

英文摘要

Multi-Label Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge to better adapt complex recognition scenarios, thereby enhancing model robustness. However, for realistic decentralized applications requiring federated learning, adapting VLMs to each client that possesses private and heterogeneous data can cause the model to overfit spurious label correlations, consequently triggering irrelevant categories when encountering new samples. To tackle this problem, we reconsider the federated learning for MLR with a causal model, in which we adopt a front-door adjustment and decouple the MLR modeling process by intermediate variables that magnify the oracle label co-occurrence. Guided by our analysis, we propose our FedMPT, the first method specifically designed for federated MLR. The core idea of FedMPT is to leverage generalizable conditions to steer federated MLR to mitigate erroneous label activations. To achieve this, FedMPT introduces an Large Language Model (LLM)-driven pipeline to decipher the underlying conditions that govern label dependencies. Furthermore, we introduce an optimal transport between the condition-enriched prompts and the image patches to uncover multiple region-level semantics. Finally, we generate synergistic predictions from different conditions with a crafted gating mechanism. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods under varied settings.

URL PDF HTML ☆

赞 0 踩 0

2605.28345 2026-05-28 cs.AI cs.LG eess.SP 版本更新

Picid: A Modular Evaluation Infrastructure for Reproducible PHM Across Tasks and Domains

Picid: 一种跨任务和领域的可复现PHM模块化评估基础设施

Lev Telyatnikov, Raffael Theiler, Leandro Von Krannichfeldt, Olga Fink

发表机构 * EPFL（苏黎世联邦理工学院）

AI总结提出模块化评估基础设施Picid，通过标准化数据契约和评估边界，实现跨任务、跨数据集的故障检测、诊断和预测的可复现与公平比较。

详情

AI中文摘要

预测与健康管理（PHM）领域的进展受到跨任务、数据集和应用领域缺乏标准化和可复用评估实践的阻碍。报告的结果往往难以复现和比较，因为关键协议选择（如数据划分、预处理、标签对齐、时间窗口和指标）通常是隐式的或临时实现的。我们引入了\picid，一个模块化评估基础设施，将PHM评估流程形式化为显式、可执行和可复现的协议。通过定义良好的抽象，\picid在保持对不同PHM设置的灵活性的同时，强制执行确定性、无泄漏的数据集构建。该框架通过统一接口支持故障检测、诊断和预测，并且可以扩展到新的数据集和模型类别，而不违反协议不变性。通过标准化数据契约和评估边界，\picid还实现了跨诊断（分类）和预测（回归）的公平任务比较，允许相同的模型系列在不同设置中一致地进行评估。我们通过对跨越电池、轴承、涡轮风扇发动机、液压系统、过滤系统和建筑的十二个数据集上的十三个模型进行实证评估来展示\picid。这项工作为PHM中标准化、公平和可复现的评估建立了可复用的基础。

英文摘要

Progress in Prognostics and Health Management (PHM) is hindered by the lack of standardized and reusable evaluation practices across tasks, datasets, and application domains. Reported results are often difficult to reproduce and compare, as key protocol choices, such as data splits, preprocessing, label alignment, temporal windowing, and metrics, are often implicit or implemented ad hoc. We introduce \picid, a modular evaluation infrastructure that formalizes the PHM evaluation pipeline as an explicit, executable, and reproducible protocol. Through well-defined abstractions, \picid enforces deterministic, leakage-safe dataset construction while remaining flexible across diverse PHM settings. The framework supports fault detection, diagnostics, and prognostics through a unified interface and can be extended to new datasets and model classes without violating protocol invariants. By standardizing data contracts and evaluation boundaries, \picid also enables fair cross-task comparisons across diagnostics (classification) and prognostics (regression), allowing identical model families to be evaluated consistently across heterogeneous settings. We demonstrate \picid through an empirical evaluation of thirteen models on twelve datasets spanning batteries, bearings, turbofan engines, hydraulics, filtration systems, and buildings. This work establishes a reusable foundation for standardized, fair and reproducible evaluation in PHM.

URL PDF HTML ☆

赞 0 踩 0

2605.28338 2026-05-28 cs.AI 版本更新

SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models

SafeMed-R1: 临床医生审计的安全与伦理对齐用于医疗大语言模型

Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan, Yidong Jiang, Yankai Jiang, Jinru Ding, Jiayuan Chen, Zhuangzhi Gao, Pengcheng Chen, Zhao He, Rongzhao Zhang, Meiling Liu, Luyi Jiang, Jie Xu

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Joint Laboratory of Biomedical Artificial Intelligence（生物医学人工智能联合实验室）； Shanghai Institute of Infectious Disease and Biosecurity（上海传染病与生物安全研究院）； Shanghai Health Development Research Center (Shanghai Medical Information Center)（上海健康发展战略研究中心（上海医疗信息中心））； University of Washington（华盛顿大学）； Department of Eye and Vision Sciences, University of Liverpool（利物浦大学眼科与视觉科学系）； Liverpool Centre for Cardiovascular Science, University of Liverpool（利物浦大学心血管科学中心）； School of Computer Science and Technology, Tongji University（同济大学计算机科学与技术学院）

AI总结提出SafeMed-R1模型，通过可追溯的临床信任信号管道和红队压力测试实现安全与伦理对齐，在临床基准上达到79.6%的宏平均准确率，并将不安全输出减少约3-5%。

详情

AI中文摘要

大语言模型在执业考试中日益匹配专家表现，但常规临床使用仍受限，因为治理需要可审计的推理、安全与伦理对齐以及对对抗性滥用的韧性。本文提出SafeMed-R1，通过可追溯的临床信任信号管道进行训练，该管道将每个推理实例与临床医生评分标准和编辑历史关联，并通过安全与伦理监督和红队压力测试进行对齐。SafeMed-R1在临床基准上达到79.6%的宏平均准确率。在对抗性安全测试下，它显示出最低的聚合风险，并将不安全输出相对于基线减少约3%至5%。在一项包含30个用药安全场景的配对专家研究中，SafeMed-R1在医学正确性上与PGY1和PGY2住院医师相当，并在用药安全、指南一致性和临床实用性上得分更高。总体而言，这些结果表明，临床医生审计的监督溯源，结合领域定制的安全与伦理对齐，可以在不依赖推理时检索或引用依据的情况下，加强治理相关的证据。

英文摘要

Large language models(LLMs) increasingly match expert performance on licensing examinations, yet routine clinical use remains limited because governance requires auditable reasoning, safety and ethics alignment, and resilience to adversarial misuse. Here we present SafeMed-R1, trained with a traceable Clinical Trust Signals(CTS) pipeline that links each reasoning instance to clinician rubric scores and edit histories, and aligned through safety and ethics supervision and red team stress testing. SafeMed-R1 attains a macro-averaged accuracy of 79.6% across clinical benchmarks. Under adversarial safety testing, it shows the lowest aggregated risk and reduces unsafe outputs by about 3 to 5% relative to its baseline. In a paired expert study of 30 medication safety vignettes, SafeMed-R1 matches PGY1 and PGY2 residents on medical correctness and scores higher for medication safety, guideline consistency, and clinical usefulness. Collectively, these results suggest that clinician-audited supervision provenance, together with domain-tailored safety and ethics alignment, can strengthen governance-relevant evidence without relying on inference-time retrieval or citation grounding.

URL PDF HTML ☆

赞 0 踩 0

2605.28337 2026-05-28 cs.AI 版本更新

An Enhanced Large Neighborhood Search Approach for the Capacitated Facility Location Problem with Incompatible Customers

一种增强的大邻域搜索方法用于解决具有不兼容客户的容量设施选址问题

Ida Gjergji, Lucas Kletzander, Nysret Musliu, Andrea Schaerf

发表机构 * University of Udine（乌迪内大学）

AI总结针对具有客户不兼容约束的容量设施选址问题，提出一种结合混合破坏算子和精确修复的大邻域搜索方法，在所有基准实例上取得了新的最优解。

2605.28328 2026-05-28 cs.LG cs.AI 版本更新

Learning the Error Patterns of Language Models

学习语言模型的错误模式

Jinwoo Kim, Taylor Berg-KirkPatrick, Loris D'Antoni

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； University of California-San Diego（加州大学圣地亚哥分校）

AI总结提出前缀过滤器（prefix filters）来捕捉LLM在特定领域中的错误模式，并通过Palla算法高效学习这些过滤器，从而提升输出有效性，例如在TypeScript生成中将编译率提升60%以上。

2605.28321 2026-05-28 cs.SE cs.AI 版本更新

Multi-Agent LLM-based Metamorphic Testing for REST APIs

基于多智能体LLM的REST API蜕变测试

Shehroz Khan, Abdullah Mughees, Gaadha Sudheerbabu, Tanwir Ahmad, Dragos Truscan

发表机构 * Åbo Akademi University（阿博阿卡迪米大学）

AI总结提出ARMeta方法，利用基于LLM的多智能体工作流自动识别蜕变测试场景并生成可执行测试，以解决REST API测试中的预言问题。

Comments Author submitted version accepted for publication the IEEE Conference on Computers, Software, and Applications (COMPSAC2026), July 7-11, 2026, Madrid Spain

详情

AI中文摘要

随着REST API在软件系统中日益重要，其验证也变得更为关键。因此，测试和发现潜在问题对于提高软件质量至关重要。然而，测试REST API的主要挑战在于难以评估API调用的输出是否正确，即测试预言问题。蜕变测试是一种基于规约的测试方法，适用于正确输出未知或未明确指定的情况。为了检查系统的正确性，需要指定不同输出之间的关系。我们提出了ARMeta，一种支持工具的方法，利用基于LLM的多智能体工作流来支持使用OpenAPI文档化的REST API的蜕变测试。该智能体工作流用于识别蜕变测试场景，并以Given-When-Then格式进行规约。这些场景自动实现为可执行测试，并针对被测系统执行。我们在两个公开的暴露REST接口的Web应用程序上评估了ARMeta，并将其性能与基于场景的测试基线进行了比较。结果表明，ARMeta探索的行为可作为现有基于场景的测试方法的补充。

英文摘要

As REST APIs become an increasingly significant part of software systems, their validation is becoming more critical. Hence, testing and uncovering underlying issues are of utmost importance for improving software quality. However, testing REST APIs is challenging mainly due to the difficulty of assessing whether the output of an API call is correct, i.e., the test oracle problem. Metamorphic testing is a specification-based testing approach for situations where correct outputs are unknown or not specified explicitly. To check the correctness of a system, relations between the different outputs are specified. We present ARMeta, a tool-supported approach that uses an LLM-based multi-agent workflow to support metamorphic testing of REST APIs documented with OpenAPI. The agentic workflow is used to identify metamorphic test scenarios and specify them in the Given-When-Then format. These scenarios are automatically implemented as executable tests and executed against the system under test. We evaluate ARMeta on two publicly available web applications that expose REST interfaces and compare its performance with a scenario-based testing baseline. The results show that ARMeta explores behaviors that serve as a complement to existing scenario-based testing approaches.

URL PDF HTML ☆

赞 0 踩 0

2605.28320 2026-05-28 cs.RO cs.AI 版本更新

Identifying Explicit Parsimonious Piece-wise Polynomial Relationships in Industrial time-series: Application to manipulator robots

识别工业时间序列中的显式简约分段多项式关系：应用于机械臂

Mazen Alamir, Sacha Clavel

发表机构 * Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble, France（格勒诺布尔阿尔卑斯大学、国家科学研究中心、格勒诺布尔国立理工学院、GIPSA实验室）

AI总结本文提出一种算法，利用隐式关系中的多项式集构建显式分段表示，以识别工业时间序列中的简约显式分段多项式关系，并应用于机械臂逆模型识别，实验表明该模型在泛化能力上优于深度神经网络。

2605.28317 2026-05-28 cs.LG cs.AI cs.NA math.NA physics.comp-ph 版本更新

Hybrid Neural World Models

混合神经世界模型

Pranav Lakshmanan, Paras Chopra

发表机构 * Lossfunk

AI总结提出混合神经世界模型，通过单网络连续视界条件训练直接预测未来状态，并利用误差图隐式捕捉不连续性，实现高效且可靠的物理动力学模拟。

Comments Preprint. Under review

详情

AI中文摘要

神经代理模型有望在物理动力学中实现比经典求解器大幅加速，但在冲击、锋面和接触等剧烈动力学事件中会静默失败。我们提出了用于物理动力学的混合神经世界模型：一种在物理状态空间中训练和部署多视界代理模型的方案，其中单个具有连续视界条件的网络通过直接监督（对照教科书参考求解器）进行训练，以在前向传播中一步预测任意未来状态（视界T）。尽管训练数据、损失函数或架构的任何部分都没有监督不连续位置，但训练后的代理模型隐式地编码了它，仅通过其前向传播即可恢复为每个轨迹的误差图，该误差图集中在冲击、锋面和接触上，而在其他地方保持较小。该误差图与标准无标签基线（包括深度集成、学习误差头、梯度幅度指标和局部自适应共形预测）相比具有竞争力或更好，同时仅使用单个训练网络，且不需要校准集或控制方程知识。该方案支持两个操作点。模式1单独运行代理模型以最大化吞吐量，在PDE环境中，与教科书求解器相比，相同硬件上的CPU加速比为26倍至72倍。模式2使用误差图来门控参考求解器回退，推迟不确定的轨迹，并在默认操作点将代理模型的残差误差大致减半。该方案无需修改即可应用于反应扩散、可压缩欧拉和刚体碰撞动力学。

英文摘要

Neural surrogates promise large speedups over classical solvers for physical dynamics but fail silently at sharp dynamical events such as shocks, fronts, and contact. We present hybrid neural world models for physical dynamics: a recipe for training and deploying multi-horizon surrogates in physical state space, where a single network with continuous horizon conditioning is trained with direct supervision against textbook reference solvers to predict any future state at horizon T in one forward pass. Although no part of the training data, loss function, or architecture supervises discontinuity location, the trained surrogate encodes it implicitly, recoverable from its forward passes alone as a per-trajectory error map that concentrates on shocks, fronts, and contacts, and stays small elsewhere. The map is competitive with or better than standard label-free baselines including deep ensembles, learned error heads, gradient-magnitude indicators, and locally-adaptive conformal prediction, while using only a single trained network and requiring no calibration set or governing-equation knowledge. The recipe supports two operating points. Mode 1 runs the surrogate alone for maximum throughput, with same-hardware CPU speedups of 26x to 72x against textbook solvers on the PDE environments. Mode 2 uses the error map to gate a reference-solver fallback, deferring uncertain trajectories and roughly halving the surrogate's residual error at the default operating point. The recipe applies without modification across reaction-diffusion, compressible Euler, and rigid-body collision dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.28306 2026-05-28 cs.CL cs.AI 版本更新

Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models

面向混合专家模型中多语言下游任务的路由对齐微调

Guanzhi Deng, Kuan Wu, Haibo Wang, Shing Yin Wong, Sichun Luo, Linqi Song

发表机构 * City University of Hong Kong（香港城市大学）； Carnegie Mellon University（卡内基梅隆大学）； The University of Hong Kong（香港大学）

AI总结针对混合专家模型在多语言下游任务中的路由结构异构问题，提出RA-MoE三阶段框架，通过中间层语言通用对齐区识别任务相关专家，并引入路由对齐损失增强目标语言路由，实验表明该方法优于标准微调和强基线。

详情

AI中文摘要

混合专家（MoE）模型已成为高效扩展LLM的主流范式，但将其适配到非英语下游任务仍然具有挑战性。现有的微调方法将MoE模型视为整体学习器，忽略了预训练期间形成的异构路由结构。我们在多个MoE模型和下游任务上验证，中间层形成了语言通用对齐区，其中路由发散性强烈预测了每种语言的任务性能差距。基于这一观察，我们提出了RA-MoE（路由对齐MoE微调），一个三阶段框架，该框架根据英语和目标语言的正确性将并行任务示例分类为四路分类法（cc/ci/ic/ii），识别中间层中与任务相关的专家，并用路由对齐损失增强标准SFT，该损失鼓励ci类型示例上的目标语言路由遵循英语任务专家激活模式。在三个MoE模型、三个任务和六种目标语言上的实验表明，RA-MoE始终优于标准SFT和强基线（包括Routing Steering和RISE），其中任务-语言对的ci比例可作为对齐收益的可靠预测指标。

英文摘要

Mixture-of-Experts (MoE) models have emerged as a dominant paradigm for efficient LLM scaling, yet adapting them to non-English downstream tasks remains challenging. Existing fine-tuning approaches treat MoE models as monolithic learners, ignoring the heterogeneous routing structure that develops during pretraining. We validate across multiple MoE models and downstream tasks that middle layers form a language-universal alignment zone where routing divergence strongly predicts per-language task performance gaps. Building on this observation, we propose RA-MoE (Routing-Aligned MoE Fine-Tuning), a three-stage framework that categorizes parallel task examples into a four-way taxonomy (cc/ci/ic/ii) based on correctness in English and the target language, identifies task-relevant experts in the middle layers, and augments standard SFT with a routing alignment loss that encourages target-language routing on ci-type examples to follow the English task-expert activation pattern. Experiments across three MoE models, three tasks, and six target languages demonstrate that RA-MoE consistently outperforms standard SFT and strong baselines including Routing Steering and RISE, with the ci proportion of a task-language pair serving as a reliable predictor of alignment benefit.

URL PDF HTML ☆

赞 0 踩 0

2605.28305 2026-05-28 cs.CL cs.AI 版本更新

Revisiting Anthropomorphic Reflection Markers in Large Language Model Reasoning

重新审视大语言模型推理中的拟人化反思标记

Yahan Yu, Noa Nakanishi, Fei Cheng

发表机构 * Kyoto University（京都大学）

AI总结本文通过提示级和令牌级干预抑制拟人化反思标记，发现这些标记并非推理性能的必要条件，且抑制后模型仍能进行无标记验证，表明它们更多是表面线索而非可靠反思代理。

Comments 15 pages, 12 figures

详情

AI中文摘要

大语言模型（LLMs）在复杂推理过程中经常产生显式的反思痕迹，并伴随有拟人化标记，如“wait”、“hmm”和“alternatively”。尽管这些标记通常被用作反思的可见指标，但其机制仍不清楚，这带来了与冗余和重复反思标记相关的过度思考风险。在这项工作中，我们重新审视了拟人化反思标记，考察了它们对推理的必要性以及在反思中的作用。我们通过提示级和令牌级干预抑制这些标记，并分析了它们对四个基准测试和两种模型规模的任务性能的影响。我们的结果表明，拟人化标记对于推理性能并非普遍必要：抑制它们可以在多种设置下保持或提高性能，尤其是在较大的采样预算下。同时，标记抑制并不一定消除反思行为，因为模型仍然可以进行无标记验证。这些结果表明，拟人化标记更倾向于表面线索，而不是反思本身的可靠代理，并激励未来在显式标记模式之外对推理机制进行研究。

英文摘要

Large Language Models (LLMs) often produce explicit reflective traces during complex reasoning, accompanied by anthropomorphic markers such as wait, hmm, and alternatively. Although these markers are commonly used as visible indicators of reflection, their mechanisms remain unclear, which leaves the risk of overthinking associated with redundant and repetitive reflection markers. In this work, we revisit anthropomorphic reflection markers, examining their necessity for reasoning and role in the reflection. We suppress these markers through prompt-level and token-level interventions, and analyze their effects on task performance across four benchmarks and two model scales. Our results show that anthropomorphic markers are not uniformly necessary for reasoning performance: suppressing them can preserve or improve performance in several settings, especially under larger sampling budgets. Meanwhile, marker suppression does not necessarily remove reflection behavior, as models can still perform marker-free verification. These suggest that anthropomorphic markers tend to be surface cues rather than reliable proxies for reflection itself, and motivate future research on reasoning mechanisms beyond explicit marker patterns.

URL PDF HTML ☆

赞 0 踩 0

2605.28303 2026-05-28 cs.AI 版本更新

From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

从事实覆写到知识演化：基于同策略自蒸馏的因果编辑

Shuaike Li, Kai Zhang, Xianquan Wang, Jiachen Liu, Shengpeng Mo

发表机构 * State Key Laboratory of Cognitive Intelligence（认知智能国家重点实验室）

AI总结针对知识编辑中静态事实覆写范式导致认知失调的问题，提出基于因果引导的同策略蒸馏方法CODE，将事实注入转化为连贯的知识演化，显著降低自反驳率并提升多跳准确性。

详情

AI中文摘要

虽然知识编辑（KE）能够实现高效更新，但其主导的静态事实覆写范式将大型语言模型视为离散数据库，强行注入孤立事实。这会破坏预训练的逻辑拓扑结构，引发认知失调——一种未进化的先验知识迫使模型明确否定注入更新的病理现象。理想化干预表明，这本质上是结构缺陷而非算法噪声，零失真代理导致高达95.6%的自反驳率。鉴于现实世界知识的因果驱动特性，将更新基于明确的因果叙事可将冲突率降至仅6.6%，凸显了向因果编辑范式转变的必要性。为内化这种演化，我们提出CODE（用于编辑的因果同策略蒸馏）。通过将因果自举与非对称同策略蒸馏相结合，CODE将因果转换逻辑直接刻入参数记忆。在LLaMA-3.1和Qwen-2.5上的实验表明，CODE将自反驳率大幅抑制至1.8%，同时保持稳健的多跳准确性（高达83.5%），将离散事实注入无缝转化为连贯的知识演化。代码见https://github.com/CrashBugger/CODE。

英文摘要

While Knowledge Editing (KE) enables efficient updates, its dominant Static Fact Overwriting paradigm treats LLMs as discrete databases, forcibly injecting isolated facts. Fracturing pre-trained logical topologies, this triggers Epistemic Dissonance -- a pathology where un-evolved legacy priors force the model to explicitly negate the injected update. Idealized interventions reveal that this is an inherent structural flaw rather than mere algorithmic noise, with a zero-distortion proxy yielding a catastrophic 95.6% self-refutation rate. Given the causally driven nature of real-world knowledge, grounding updates in explicit causal narratives effectively collapses this conflict rate to just 6.6%, underscoring the imperative for a paradigm shift toward Causal Editing. To internalize this evolution, we propose CODE (Causal On-policy Distillation for Editing). By coupling causal bootstrapping with asymmetric on-policy distillation, CODE engraves causal transition logic directly into parametric memory. Experiments on LLaMA-3.1 and Qwen-2.5 show CODE drastically suppresses self-refutation to 1.8% while securing robust multi-hop accuracy (up to 83.5%), seamlessly transforming discrete fact injection into coherent knowledge evolution. Code is available at https://github.com/CrashBugger/CODE.

URL PDF HTML ☆

赞 0 踩 0

2605.28302 2026-05-28 cs.LG cs.AI cs.DC 版本更新

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

解聚能走多远？面向高效 MoE LLM 服务的 Attention-FFN 解聚设计空间探索

Hanjiang Wu, Abhimanyu Rajeshkumar Bambhaniya, Sarbartha Banerjee, Tuhin Khare, Sudarshan Srinivasan, Suvinay Subramanian, Souvik Kundu, Madhu Kumar, Midhilesh Elavazhagan, William Won, Amir Yazdanbakhsh, Tushar Krishna

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Intel（英特尔）； Google（谷歌）； Google DeepMind（谷歌深Mind）； Infravana

AI总结本文系统探索了从分块预填充、预填充-解码解聚到算子级 Attention-FFN 解聚 (AFD) 的不同解聚层次在 MoE 模型推理中的收益与局限，通过融合设备内核测量与高保真网络仿真的框架，在严格 TTFT/TPOT SLO 下 AFD 可在 DeepSeek-V3.2 上维持约 4k tokens/s 的系统吞吐量，并给出了联合优化吞吐与交互性的具体设计原则。

详情

AI中文摘要

现代大语言模型 (LLM) 推理已逐步解聚以跟上不断增长的模型规模和严格的 TTFT 与 TPOT 服务级别目标：从分块预填充聚合，到预填充-解码 (P/D) 解聚，再到最近出现的算子级 Attention-FFN 解聚 (AFD)。这一趋势对于混合专家 (MoE) 模型尤为重要，其中内存受限的注意力、计算密集的专家 FFN 以及 MoE 分发/组合通信产生了不同的资源需求。AFD 通过将注意力与 MoE-FFN 执行放在不同的 GPU 组上进一步暴露了这种异构性。每个解聚层次都加深了跨工作负载特征、资源分配和互连拓扑的调度设计空间，提出了核心问题：每个层次何时真正产生收益？我们系统地刻画了 MoE 推理中这一权衡，涵盖了输入/输出序列长度、前缀-KV 重用和每用户延迟约束等实际工作负载。以分块预填充和 P/D 解聚为基线，我们通过一个融合设备内核测量与高保真网络仿真的框架，研究了 AFD 在大规模下的收益与局限。在严格的 TTFT/TPOT SLO 下，AFD 在 DeepSeek-V3.2 上针对聊天、编码和代理编码工作负载维持了约 4k tokens/s 的系统吞吐量，而未经 AFD 的部署则不可行。我们提炼出联合优化吞吐与交互性的具体结论，包括如何根据工作负载和模型架构在 GPU 间划分注意力与 FFN，为当前机架级和集群级部署以及未来的解聚 AI 基础设施提供了设计原则。

英文摘要

Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D) disaggregation, and most recently to operator-level Attention-FFN Disaggregation (AFD). This trend is especially important for mixture-of-experts (MoE) models, where memory-bound attention, compute-intensive expert FFNs, and MoE dispatch/combine communication create distinct resource demands. AFD further exposes this heterogeneity by placing attention and MoE-FFN execution on separate GPU groups. Each level of disaggregation deepens the scheduling design space across workload characteristics, resource allocation, and interconnect topology, raising the central question: when does each level actually pay off? We systematically characterize this trade-off for MoE inference across realistic workloads spanning input/output sequence lengths, prefix-KV reuse, and per-user latency constraints. Using chunked-prefill and P/D disaggregation as baselines, we study the benefits and limits of AFD at scale through a framework that fuses on-device kernel measurements with high-fidelity network simulation. Under strict TTFT/TPOT SLOs, AFD sustains around 4k tokens/s of system throughput on DeepSeek-V3.2 across chat, coding, and agentic-coding workloads, where non-AFD deployments are infeasible. We distill concrete takeaways for jointly optimizing throughput and interactivity, including how to partition attention and FFN across GPUs as a function of workload and model architecture, providing design principles for current rack- and cluster-scale deployments as well as future disaggregated AI infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2605.28301 2026-05-28 cs.AI 版本更新

Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

更高的准确率，更差的推理：医学思维链蒸馏的步骤级审计

Zhaoyang Jiang, Xuanqi Peng, Fei Teng, Zhizhong Fu, Yunsoo Kim, Jiacong Mi, Zicheng Li, Honghan Wu

发表机构 * School of Health & Wellbeing, University of Glasgow（健康与福祉学院，格拉斯哥大学）； Department of Respiratory and Critical Care Medicine, Shanghai Sixth People’s Hospital, Shanghai Jiao Tong University School of Medicine（呼吸与危重医学科，上海第六人民医院，上海交通大学医学院）； School of Life Science and Technology, University of Electronic Science and Technology of China（生命科学与技术学院，电子科技大学）； Institute of Health Informatics, University College London（健康信息学研究所，伦敦大学学院）

AI总结通过蒸馏大模型思维链训练小模型，发现医学问答中答案准确率提升但推理步骤的事实错误率上升，表明答案质量与推理真实性可能背道而驰。

详情

AI中文摘要

思维链（CoT）蒸馏训练一个小模型模仿教师的推理轨迹，但通常通过最终答案指标（包括准确率）进行评估。我们探究答案质量的提升是否伴随着轨迹的改进。在医学问答中，短答案选项可能使更丰富的临床理由未充分指定，从DeepSeek-V3系列教师蒸馏得到的Qwen3-8B学生在MedQA-USMLE答案指标上有所提升（SC@64从74.7%到84.4%；期望校准误差（ECE）从0.096到0.034）。然而，在Kimi-K2.6风格盲法LLM裁判审计下，其非弃权步骤的错误率从30.6%上升到50.3%。在这个主要医学设置中，答案质量和轨迹事实性向相反方向移动。这种前后模式在评估者、教师强度、学生规模和系列、医学基准以及风格、分割和答案正确性控制中持续存在。由临床专家进行的150步盲法审计重现了相同的排序。边界检查缩小了主张的范围：当紧凑答案对理由约束不足，且有能力的学生能够模仿专家风格而不可靠地支撑每个局部主张时，风险出现。标准答案指标和聚合对冲率未揭示这一转变。当此类轨迹被发布或重用时，仅靠答案级指标是不够的。

英文摘要

Chain-of-thought (CoT) distillation trains a smaller model to imitate a teacher's reasoning trace, but it is typically evaluated by final-answer metrics including accuracy. We ask whether gains in answer quality are accompanied by improvements in the trace. In medical QA, where short answer options can leave a richer clinical justification under-specified, a Qwen3-8B student distilled from a DeepSeek-V3-family teacher improves on MedQA-USMLE answer metrics (SC@64 74.7% to 84.4%; expected calibration error (ECE) 0.096 to 0.034). Yet under a Kimi-K2.6 style-blind LLM-judge audit, its error rate over non-abstained steps rises from 30.6% to 50.3%. In this primary medical setting, answer quality and trace factuality move in opposite directions. This before--after pattern persists across evaluators, teacher strengths, student scales and families, medical benchmarks, and style, segmentation, and answer-correctness controls. A 150-step blinded audit by a clinical expert reproduces the same ordering. Boundary checks narrow the scope of the claim: the risk appears when a compact answer under-constrains the rationale and a capable student can imitate expert-like form without reliably grounding each local claim. Standard answer metrics and aggregate hedging rates do not reveal the shift. When such traces are released or reused, answer-level metrics alone are insufficient.

URL PDF HTML ☆

赞 0 踩 0

2605.28298 2026-05-28 cs.AI 版本更新

REED: Post-Training Representation Editing for Cross-Domain Linguistic Steganalysis

REED: 面向跨域语言隐写分析的后训练表示编辑

Ruohan Lei, Jianxin Gao, Wanli Peng, Huimin Pei

发表机构 * China Agricultural University（中国农业大学）； Jiangsu Normal University（江苏师范大学）

AI总结提出一种后训练表示编辑方法，通过构造域偏移向量和源域封面到隐写方向指导编辑，实现无需架构修改或参数更新的高效跨域语言隐写分析。

详情

AI中文摘要

在语言隐写分析的实际场景中，测试文本通常来自未见过的域，具有不同的词汇、主题、写作风格和隐写生成模式，这会显著降低检测性能。尽管现有的跨域隐写分析方法可以通过分布对齐、域不变特征学习等有效缓解这一问题，但检测性能仍不理想。本文提出了一种用于跨域语言隐写分析的后训练表示编辑方法。具体来说，首先在源域数据上训练检测器，然后保持特征提取器和分类器冻结，在分类前对中间表示进行确定性编辑。对于域适应，我们从边缘源域和目标域表示构造域偏移向量。对于域泛化，我们推导出源域封面到隐写方向以指导样本特定编辑。实验结果表明，与先进方法相比，所提方法能够实现高跨域检测性能，尤其是在F1分数方面，同时无需在源域训练后进行架构修改或参数更新。

英文摘要

In real-world scenarios of linguistic steganalysis, tested texts usually come from unseen domains with different vocabularies, topics, writing styles, and steganographic generation patterns, which can significantly degrade the detection performance. Although existing cross-domain steganalysis methods can effectively alleviate this problem through distribution alignment, domain-invariant feature learning, etc., the detection performance is not satisfactory. In this paper, we propose a post-training representation editing method for cross-domain linguistic steganalysis. Specifically, the detector is first trained on source-domain data, and then the feature extractor and classifier are kept frozen, and the intermediate representations are deterministically edited before classification. For domain adaptation, we construct a domain-offset vector from marginal source and target representations. For domain generalization, we derive a source-domain cover-to-stego direction to guide sample-specific editing. Experimental results show that compared with the advanced methods, the proposed method can achieve high cross-domain detection performance, especially in terms of F1-score, while requiring no architecture modification or parameter updates after source-domain training.

URL PDF HTML ☆

赞 0 踩 0

2605.28295 2026-05-28 cs.AI cs.CL cs.LG 版本更新

Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

Rollouts 的起点：面向 RLVR 的低负载、高杠杆的首 token 多样化

Soeun Kim, Albert No

发表机构 * Department of Artificial Intelligence, Yonsei University（延世大学人工智能系）

AI总结本文提出 REFT 方法，通过在推理标记后的第一个 token 处进行均匀采样多样化，以低开销显著提升 RLVR 中 rollout 的多样性，从而改善推理模型的 Pass@k 性能。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）无需标注轨迹即可训练推理模型，它依赖分组 rollout 将策略暴露于替代推理路径，并由验证器进行评分。Rollout 多样性因此成为 RLVR 的核心瓶颈，现有方法大多通过温度、前缀或 rollout 选择调整来拓宽探索。我们发现了一个结构上独特但被忽视的拓宽多样性的位置：推理标记后的第一个 token。策略的首 token 分布表现出尖锐峰值但正确性解耦的现象，且该首 token 位置可以拓宽 rollout 组覆盖的区域而不改变正确性信号。我们引入 REFT（基于首 token 多样化的 Rollout 探索），这是对 RLVR 流程的一个轻量级补充，它从策略自身的 top-$N$ 候选集中均匀采样首 token，并均匀分配 rollout，其他组件保持不变。在由此产生的多样化 rollout 上训练后，REFT 在四个基础模型（0.5B-7B）和三个难度级别上，相较于 DAPO 和 GRPO 基线，提升了聚合的 Pass@1、Pass@8 和 Pass@64。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) trains reasoning models without labeled trajectories, relying on grouped rollouts to expose the policy to alternative reasoning paths and a verifier to score them. Rollout diversity has accordingly emerged as a central bottleneck in RLVR, with most existing methods broadening exploration through temperature, prefix, or rollout-selection adjustments. We identify a structurally distinguished but overlooked position for broadening this diversity: the first token after the reasoning marker. The policy's first-token distribution exhibits a sharply peaked yet correctness-decoupled phenomenon, and this first token position can broaden the regions a rollout group covers without altering the correctness signal. We introduce REFT (Rollout Exploration with First-Token Diversification), a light addition to the RLVR pipeline that samples first tokens uniformly from the policy's own top-$N$ candidates and allocates rollouts evenly, leaving every other component unchanged. Trained on the resulting diversified rollouts, REFT improves aggregate Pass@1, Pass@8, and Pass@64 over DAPO and GRPO baselines across four base models (0.5B-7B) and three difficulty regimes.

URL PDF HTML ☆

赞 0 踩 0

2605.28283 2026-05-28 cs.CL cs.AI 版本更新

PrunePath: Towards Highly Structured Sparse Language Models

PrunePath：迈向高度结构化稀疏语言模型

Zhexuan Gu, Zixun Fu, Yancheng Yuan

发表机构 * Department of Applied Mathematics, The Hong Kong Polytechnic University（应用数学系，香港理工大学）

AI总结提出PrunePath框架，通过软最大归一化路由和累积质量阈值实现自适应预算的结构化稀疏化，在自然语言理解、生成和指令调优中取得优越的稀疏-性能权衡，并利用Triton内核将结构化稀疏转化为实际内存节省和解码速度提升。

详情

AI中文摘要

前馈网络（FFN）主导了现代语言模型的参数数量和计算量，然而现有的剪枝方法往往难以将稀疏性转化为硬件友好的推理效率提升。我们引入了 extbf{PrunePath}，一个针对FFN层的预算自适应结构化稀疏化框架。基于MoEfication，PrunePath用软最大归一化路由分布替代独立的专家级阈值，并在累积质量阈值下激活重要专家。这种公式化施加了令牌级概率预算，实现了自适应专家数量以及从单个检查点直接推理时的稀疏性调节旋钮。在自然语言理解、自然语言生成和指令调优评估中，与现有的静态剪枝和基于MoEfication的方法相比，PrunePath实现了有利的稀疏-性能权衡。我们进一步实现了用于KV缓存解码的Triton内核，以将所得的结构化稀疏性转化为实际的内存节省和可测量的解码速度提升。这些结果证明了PrunePath在构建高度稀疏、易于部署的大型语言模型方面的优越性能。

英文摘要

Feed-forward networks (FFNs) dominate the parameter count and computation of modern language models, yet existing pruning methods often struggle to convert sparsity into hardware-friendly inference efficiency gains. We introduce \textbf{PrunePath}, a budget-adaptive structured sparsification framework for FFN layers. Built on MoEfication, PrunePath replaces independent expert-wise thresholding with a softmax-normalized routing distribution and activates important experts under a cumulative-mass threshold. This formulation imposes a token-level probability budget, enabling adaptive expert counts and a direct inference-time sparsity knob from a single checkpoint. Across NLU, NLG, and instruction-tuning evaluations, PrunePath achieves a favorable sparsity--performance trade-off compared with existing static pruning and MoEfication-based methods. We further implement Triton kernels for KV-cache decoding to translate the resulting structured sparsity into practical memory savings and measurable decoding-speed improvements. These results demonstrate the superior performance of PrunePath for building highly sparse, deployment-friendly large language models.

URL PDF HTML ☆

赞 0 踩 0

2605.28282 2026-05-28 cs.AI 版本更新

ResearchLoop: An Evidence-Gated Control Plane for AI-Assisted Research

ResearchLoop: 一种用于AI辅助研究的证据门控控制平面

Yihan Xia, Taotao Wang

发表机构 * Shenzhen University（深圳大学）

AI总结提出ResearchLoop，一种通过证据门控控制平面来确保AI辅助研究中声明可审计的协议，包括状态模型、转换规则和实验验证。

Comments 32 pages, 4 figures, 6 tables; technical report

详情

AI中文摘要

AI辅助研究将构思、实现、评估和手稿撰写压缩成一个单一的交互循环。这种压缩是有用的，但也带来了出版风险：论文声明可能比审计更容易陈述。我们提出了ResearchLoop，一种用于AI辅助计算研究的证据门控控制平面。ResearchLoop将研究问题、任务合同、证据对象、声明账本、结项和论文绑定视为持久的项目状态，在此实现为基于仓库的运行时。本技术报告提供了完整的协议规范、状态模型、转换规则、声明准入算法和洞察复合机制。它还报告了跨越九个版本（V0--V9）的完整实验记录，包括自托管案例研究、带有组件消融的受控任务套件研究、数学奥林匹克评估以及使用官方生成代码工具评估的补充SciCode边界实验。所有工件、清单和验证报告都保存在项目仓库中。

英文摘要

AI-assisted research compresses ideation, implementation, evaluation, and manuscript writing into a single interactive loop. This compression is useful, but it also creates a publication risk: paper claims can become easier to state than to audit. We present ResearchLoop, an evidence-gated control plane for AI-assisted computational research. ResearchLoop treats research questions, task contracts, evidence objects, claim ledgers, closeouts, and paper bindings as durable project state, realized here as a repository-backed runtime. This technical report provides the complete protocol specification, state model, transition rules, claim-admission algorithm, and insight-compounding mechanism. It also reports the full experimental record spanning nine versions (V0--V9), including a self-hosting case study, a controlled task-suite study with component ablations, a mathematical olympiad evaluation, and a supplementary SciCode boundary experiment evaluated with the official generated-code harness. All artifacts, manifests, and verification reports are preserved in the project repository.

URL PDF HTML ☆

赞 0 踩 0

2605.28277 2026-05-28 cs.AI 版本更新

Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

LLMs 是否从文本构建世界模型？多语言空间推理诊断

Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao, Yitong Qiao, Chunlei Meng, Zhangquan Chen, Xin Cao

发表机构 * University of New South Wales（新南威尔士大学）； Essential Energy ； University of Alabama at Birmingham（阿拉巴马大学伯明翰分校）； Zhejiang University（浙江大学）； Fudan University（复旦大学）； Tsinghua University（清华大学）

AI总结通过多语言诊断基准 MentalMap 评估大语言模型的空间推理能力，发现所有模型在视角推理上存在普遍的性能瓶颈（L3 推理悬崖），表明该限制源于纯文本工作记忆约束而非特定架构。

详情

AI中文摘要

大语言模型（LLMs）是否从纯文本描述中构建内部空间世界模型仍存在争议，且这种能力是否跨语言迁移尚未得到系统研究。我们引入 MentalMap，一个多语言诊断基准，具有六级能力层次（L0-L5），涵盖从原子空间事实到生成性世界图构建，以及四个诊断轴：参考系、阅读方向偏差、推理努力分配和幻觉。MentalMap 基于 100 个 ProcTHOR 家庭场景构建，涵盖八种类型多样的语言加上一个结构化文本控制，包含 39 个任务族，共 1950 个评估单元。评估了跨规模和模型家族的十三个 LLMs，我们识别出一个普遍的 L3 推理悬崖：一旦基线原子准确率超过 40%，没有模型能在视角推理上保留其 L0 性能的一半。该悬崖在语言、规模和提示策略中持续存在，而结构化输出失败和推理模式在不同模型间差异显著。在相同纯文本协议下的人类评估重现了相同的失败模式，表明瓶颈源于纯文本工作记忆约束，而非特定于当前 LLM 架构。我们的发现将纯文本空间推理重新定义为多轴世界建模问题，并推动多模态和草稿板增强推理作为未来方向。

英文摘要

Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured-text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured-output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure-text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text-only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure-text spatial reasoning as a multi-axis world-modeling problem and motivate multimodal and scratchpad-augmented reasoning as future directions.

URL PDF HTML ☆

赞 0 踩 0

2605.28273 2026-05-28 cs.AI 版本更新

Global Policy-Space Response Oracles for Two-Player Zero-Sum Games

全局策略空间响应预言机用于两人零和博弈

Junyu Zhang, Feihong Yang, Jian Wang, Chao Wang, Xudong Zhang

发表机构 * Department of Electronic Engineering, Tsinghua University, Beijing, China（清华大学电子工程系，北京，中国）； Qiyuan Lab, Beijing, China（启元实验室，北京，中国）

AI总结提出Global PSRO框架，通过直接最小化种群可利用性（PE）来引导策略种群扩展，以更少的策略迭代逼近纳什均衡。

Comments Accepted by ICML 2026

详情

AI中文摘要

策略空间响应预言机（PSRO）框架通过使用深度强化学习（DRL）迭代扩展受限策略集，将均衡计算扩展到大型零和博弈。一个核心挑战是在有限计算预算下构建一个小的策略种群，其诱导博弈能很好地近似完整博弈。现有的PSRO变体通常使用从受限博弈收益计算出的元策略的最佳响应来扩展种群，这可能导致效率低下的扩展，仅提供有限的全局改进。我们提出通过直接评估扩展后的种群质量来引导种群扩展。具体来说，我们采用种群可利用性（PE）来衡量受限策略集代表完整博弈的程度，并引入一个两阶段探索-选择框架，在扩展过程中显式最小化PE。我们将该框架实例化为Global PSRO，一种实用的基于DRL的算法，该算法通过参数共享的条件神经网络高效生成候选响应并估计PE。在多个两人零和博弈上的实验表明，与先前的PSRO方法相比，Global PSRO实现了更低的可利用性，并以显著更少的策略迭代逼近纳什均衡。

英文摘要

The Policy-Space Response Oracles (PSRO) framework scales equilibrium computation to large zero-sum games by iteratively expanding a restricted strategy set using deep reinforcement learning (DRL). A central challenge is to construct, under limited computational budgets, a small strategy population whose induced game well approximates the full game. Existing PSRO variants typically expand the population using best responses to meta-strategies computed from restricted-game payoffs, which can lead to inefficient expansions that provide limited global improvement. We propose to guide population expansion by directly evaluating the post-expansion population quality. Specifically, we adopt Population Exploitability (PE) to measure how well a restricted strategy set represents the full game, and introduce a two-phase exploration--selection framework that explicitly minimizes PE during expansion. We instantiate this framework as Global PSRO, a practical DRL-based algorithm that efficiently generates candidate responses and estimates PE via parameter-sharing conditional neural networks. Experiments across multiple two-player zero-sum games show that Global PSRO achieves lower exploitability and approximates Nash equilibria with significantly fewer policy iterations than prior PSRO methods.

URL PDF HTML ☆

赞 0 踩 0

2605.28264 2026-05-28 cs.AI 版本更新

Entropy Distribution as a Fingerprint for Hallucinations in Generative Models

熵分布作为生成模型中幻觉的指纹

Mattia J. Villani, Pranav Deshpande, Akshay Seshadri, Romina Yalovetzky, Niraj Kumar

发表机构 * Global Technology Applied Research（全球技术应用研究）

AI总结本文提出基于token级熵分布（而非仅均值）的校准熵分数（CES），通过单次前向传递和黑盒logits访问实现幻觉检测，并提供理论保证和实证验证。

详情

AI中文摘要

大型语言模型（LLMs）经常生成事实上不正确的输出，通常称为幻觉，这削弱了信任并限制了在高风险环境中的部署。现有的幻觉检测方法通常需要多次前向传递或访问模型内部。在这项工作中，我们提供了理论背景和实证证据，表明token级熵的分布（超越困惑度或长度归一化熵所捕获的均值）作为幻觉的指纹，其分布形状和尾部行为携带独立信号。我们将幻觉检测形式化为统计假设检验，并提出校准熵分数（CES），一种轻量级算法，仅需单次前向传递和黑盒访问token logits。CES通过校准的参考CDF将均值信号与生成熵的最大信号相结合，产生可直接跨模型和任务比较的分数。我们通过新颖的随机长度Dvoretzky-Kiefer-Wolfowitz不等式建立了有限样本校准保证，并证明了CES检测幻觉的概率随生成长度指数级收敛到1。在八个QA基准和十个生成模型（涵盖开源和API访问模型）上，CES在所有单次黑盒方法中实现了最高的检测性能，同时提供了现有启发式方法所缺乏的正式误差保证。值得注意的是，CES在统计上与需要更高计算成本的多样本方法无法区分，缩小了轻量级与昂贵检测之间的差距，使其适用于实时、大规模部署。

英文摘要

Large Language Models (LLMs) often generate factually incorrect outputs, commonly termed hallucinations, that undermine trust and limit deployment in high-stakes settings. Existing hallucination detection methods typically require multiple forward passes, or access to model internals. In this work, we provide theoretical background and empirical evidence that the distribution of token-level entropies, beyond the mean captured by perplexity or length-normalised entropy, serves as a fingerprint of hallucination, with distributional shape and tail behaviour carrying independent signal. We formalize hallucination detection as a statistical hypothesis test and propose the Calibrated Entropy Score (CES), a lightweight algorithm requiring only a single forward pass and black-box access to token logits. CES combines the mean signal with the maximum signal of the generated entropy through a calibrated reference CDF, producing scores that are directly comparable across models and tasks. We establish finite-sample calibration guarantees via a novel random-length Dvoretzky--Kiefer--Wolfowitz inequality, and also prove that CES detects hallucinations with probability converging to one exponentially fast in the generation length. Across eight QA benchmarks and ten generator models spanning open-source and API access models, CES achieves the highest detection performance among all single-pass black-box methods while providing formal error guarantees that existing heuristics lack. Remarkably, CES is statistically indistinguishable from multi-sample methods that require far greater computational cost, closing the gap between lightweight and expensive detection and making it suitable for real-time, large-scale deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.28258 2026-05-28 cs.SE cs.AI cs.CV cs.HC 版本更新

SmartIterator: 监督无监督数据分组的可视化分析工作流

Gennady Andrienko, Natalia Andrienko

发表机构 * Fraunhofer Institute IAIS（弗劳恩霍夫研究所IAIS）； Lamarr Institute for Machine Learning and Artificial Intelligence（拉马尔人工智能与机器学习研究所）； City St George’s, University of London（伦敦大学圣乔治学院）

AI总结提出SmartIterator可视化分析方法，通过六阶段工作流和IteraScope协调视图，系统探索参数扫描下的分组结果，支持用户理解数据结构和做出知情决策。

详情

AI中文摘要

无监督学习方法——主题建模、基于划分和基于密度的聚类——在没有人类指导的情况下产生数据分组，但选择和评估这些分组本身不应是无监督的。我们提出了\emph{SmartIterator}（SI），一种可视化分析方法，将参数扫描中分组结果的完整序列视为一等分析对象。对于每个方法族，SI提供了一个结构化的六阶段工作流，引导分析师系统地探索分组结果——从质量指标概览，经过过渡稳定性评估、成员置信度评估、内容和上下文检查、循环原型验证，到知情决策——在此过程中逐步建立对数据结构的累积理解。这些工作流通过\emph{IteraScope}（IS）实现，这是一个协调的可视化显示，结合了质量指标图表与语义颜色编码、带有桑基式过渡流和成员置信度小提琴图的一维组嵌入、带有HDBSCAN检测的循环原型的二维组嵌入（突出显示捕获所有持久模式的迭代），以及用于上下文解释的特定领域链接视图。我们在以下三个场景中演示了这些工作流：（1）来自VAST Challenge 2011的模拟社交媒体消息（基于密度的聚类，根据真实情况进行验证），（2）约1500个NUTS-3区域的欧盟人口统计数据（基于划分的聚类），以及（3）30年的IEEE VIS论文（NMF主题建模）。这些工作流构成了主要贡献：它们提供了可操作的、针对特定方法的指导，用于导航参数空间、研究数据结构如何随配置变化，以及将分析理解扎根于领域背景——从而产生关于数据的知识，这是任何单个“最佳”结果都无法提供的。

英文摘要

Unsupervised learning methods -- topic modeling, partition-based and density-based clustering -- produce data groupings without human guidance, yet choosing and evaluating those groupings should not itself be unsupervised. We present \emph{SmartIterator}~(SI), a visual analytics approach that treats the full sequence of grouping results across a parameter sweep as a first-class analytical object. For each method family, SI provides a structured six-phase workflow that guides the analyst through systematic exploration of grouping results -- from quality-metric overview through transition-stability assessment, membership-confidence evaluation, content and context inspection, and recurrent-archetype verification to an informed decision -- building cumulative understanding of data structure along the way. The workflows are operationalized through \emph{IteraScope}~(IS), a coordinated visual display combining quality-metric charts with semantic color encoding, a 1D group embedding with Sankey-style transition flows and violin plots of membership confidence, a 2D group embedding with HDBSCAN-detected recurrent archetypes that highlights iterations capturing all persistent patterns, and domain-specific linked views for contextualized interpretation. We demonstrate the three workflows on: (1)~simulated social-media messages from the VAST Challenge 2011 (density-based clustering, validated against ground truth), (2)~EU population statistics across ${\sim}1\,500$ NUTS-3 regions (partition-based clustering), and (3)~30 years of IEEE VIS papers (NMF topic modeling). The workflows constitute the main contribution: they provide actionable, method-specific guidance for navigating parameter spaces, studying how data structure evolves across configurations, and grounding analytical understanding in domain context -- yielding knowledge about the data that no single ``best'' result can provide.

URL PDF HTML ☆

赞 0 踩 0

2605.28215 2026-05-28 cs.AI cs.CL cs.LG cs.LO cs.MA 版本更新

Explaining is Harder Than Predicting Alone: Evaluating Concept-based Explanations of MLLMs as ICL Visual Classifiers

解释比单独预测更难：评估基于概念的MLLM解释作为ICL视觉分类器

Carmen Quiles-Ramírez, Leticia L. Rodríguez, Nicolás Martorell, Natalia Díaz-Rodríguez

AI总结本文通过五种形式化程度递增的条件，系统评估多模态大语言模型在少样本上下文学习中的基于概念的可解释性，发现解释比预测更难，且强制生成形式化解释会降低预测准确性。

Comments Accepted to the CompLearn Workshop at ICML 2026

详情

AI中文摘要

上下文学习（ICL）使多模态大语言模型（MLLM）能够从少量标记示例中对图像进行分类。然而，这些模型如何使用提供的上下文仍然不透明。虽然思维链提示被广泛使用，但最近的研究认为它可能不反映真实的内部计算。在本文中，我们通过五种形式化程度递增的条件（从基线分类到描述逻辑（DL）公理生成）系统评估了冻结MLLM在少样本ICL下的基于概念的可解释性。通过独立的LLM-as-a-judge流水线评估四个最先进的MLLM，我们证明解释确实比单独预测更难。令人惊讶的是，强制模型生成形式化结构的基于概念的解释会单调降低预测准确性（从93.8%降至90.1%），这与显式推理普遍有助于性能的假设相矛盾。然而，当模型成功表达类别判别性视觉特征时，解释质量与正确预测强相关。我们的发现表明，虽然MLLM在视觉分类方面表现出色，但它们缺乏形式化、机器可验证的可解释性所需的特定指令微调。

英文摘要

In-context learning (ICL) enables multimodal large language models (MLLMs) to classify images from a few labelled examples. Yet, how these models use the provided context remains opaque. While Chain-of-Thought prompting is widely used, recent work argues that it may not reflect true internal computation. In this paper, we systematically evaluate the concept-based explainability of frozen MLLMs under few-shot ICL using five conditions of increasing formal rigour, ranging from baseline classification to Description Logics (DL) axiom generation. Evaluating four state-of-the-art MLLMs via an independent LLM-as-a-judge pipeline, we demonstrate that explaining is genuinely harder than predicting alone. Surprisingly, forcing models to generate formally structured, concept-based explanations degrades predictive accuracy monotonically (from 93.8% to 90.1%), contradicting the assumption that explicit reasoning universally aids performance. However, when models successfully articulate class-discriminative visual features, explanation quality strongly correlates with correct predictions. Our findings suggest that while MLLMs excel at visual classification, they lack the specific instruction-tuning required for formal, machine-verifiable explainability.

URL PDF HTML ☆

赞 0 踩 0

2605.28213 2026-05-28 cs.AI 版本更新

Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

学习何时优化：来自专家GPU内核谱系的验证优化技能

Shuoming Zhang, Qiuchu Yu, Yangyu Zhang, Ruiyuan Xu, Xiyu Shi, Guangli Li, Xiaobing Feng, Huimin Cui, Jiacheng Zhao

发表机构 * SKLP, Institute of Computing Technology, Chinese Academy of Sciences（SKLP，计算技术研究所，中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）； University of New South Wales（新南威尔士大学）

AI总结提出KLineage方法，通过反向遍历专家GPU内核实现并提取可重用的优化技能，学习优化的适用条件，从而提升LLM代理生成内核的优化质量与效率。

Comments Preprint, Under Review

详情

AI中文摘要

基于LLM的代理越来越多地被用于生成GPU内核，但它们通常知道尝试哪些优化，却不知道这些优化何时是合理的。我们引入了KLineage，它从专家内核中学习这种缺失的“何时”知识：KLineage不是依赖前向展开，而是通过验证门控简化反向遍历专家实现，并将每个接受的步骤逆转为可重用的优化技能。每个技能不仅记录了优化意图，还记录了它在代码中的适用位置、使其有效的条件、产生的效果以及其假设避免了哪些失败。下游LLM在相同的编译/正确性/性能分析门控下，将这些技能应用到新的代码表面上。在两个NVIDIA架构上的五个专家工作负载中，这些谱系衍生的技能作为有效的优化课程，在相同的固定预算下，在最终内核质量和优化效率方面均超过了近期基于内存的LLM内核基线。此外，我们使用一个单独的22实例保留检查作为对源案例记忆的合理性测试。

英文摘要

LLM-based agents are increasingly used to generate GPU kernels, but they often know what optimizations to try without knowing when those optimizations are sound. We introduce KLineage, which learns this missing "when" knowledge from expert kernels: instead of relying on forward rollouts, KLineage walks expert implementations backward through validation-gated simplifications and reverses each accepted step into a reusable optimization skill. Each skill records not only the optimization intent, but also where it applies in code, what conditions made it valid, what effect it had, and what failures its assumptions avoid. A downstream LLM materializes these skills on new code surfaces under the same compile/correctness/profile gate. On five expert workloads across two NVIDIA architectures, these lineage-derived skills serve as an effective optimization curriculum, exceeding recent memory-based LLM-kernel baselines in both final kernel quality and optimization efficiency under the same fixed budget. We additionally use a separate 22-instance held-out check as a sanity test against source-case memorization.

URL PDF HTML ☆

赞 0 踩 0

2605.28201 2026-05-28 cs.AI 版本更新

Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

种植、持久化、触发：针对大语言模型智能体的潜伏攻击

Yongxiang Li, Moxin Li, Zhixin Ma, Fengbin Zhu, Dongrui Liu, Wenjie Wang, Fuli Feng

发表机构 * University of Science and Technology of China（中国科学技术大学）； National University of Singapore（新加坡国立大学）； Singapore Management University（新加坡管理学院）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

AI总结提出潜伏攻击（Sleeper Attack），即攻击者将对抗性内容注入智能体状态并持久化，在后续交互中被良性用户查询触发，导致有害行为；构建包含1896个实例的基准测试，实验表明当前最强LLM智能体仍易受此类攻击。

详情

AI中文摘要

大语言模型（LLM）智能体仍然容易受到来自外部环境的安全威胁，攻击者将对抗性内容注入外部观察（如工具返回的数据、网页或MCP上下文），导致有害的智能体行为，例如不安全的操作或错误的输出。现有研究通常关注单次交互攻击，即智能体观察到对抗性内容后立即在单次用户请求中表现出有害行为。然而，我们表明对抗性内容也可以在同一智能体服务的多次交互中持久化，使得此类威胁更难检测和缓解。具体来说，对抗性内容可能持久化在智能体状态中，在多次交互中保持休眠，随后被良性用户查询激活。我们将此类安全威胁形式化为潜伏攻击（Sleeper Attack）。为了评估它，我们构建了一个包含1896个实例的基准测试，涵盖六种真实世界的有害结果、三种攻击策略和三种智能体状态目标：会话上下文、记忆和可复用技能。在七个强大的开源和闭源LLM上的实验表明，最先进的LLM智能体仍然容易受到潜伏攻击，即使在单次交互基线中它们实现了较低的攻击成功率。我们的代码和数据可在https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef获取。

英文摘要

Large Language Model (LLM) agents remain vulnerable to safety threats from the external environment, where attackers inject adversarial content into external observations such as tool-returned data, webpages, or MCP context, causing harmful agentic behaviors such as unsafe actions or incorrect outputs. Existing studies typically focus on single-interaction attacks, where the agent observes adversarial content and immediately exhibits harmful behavior within one user request. However, we show that adversarial content can also persist across interactions served by the same agent, making such threats harder to detect and mitigate. Specifically, adversarial content may persist in the agent state, remain dormant across interactions, and later be activated by a benign user query. We formalize this type of safety threat as Sleeper Attack. To evaluate it, we construct a benchmark with 1,896 instances covering six real-world harmful outcomes, three attack strategies, and three agent state targets: session context, memory, and reusable skills. Experiments on seven strong open-source and closed-source LLMs show that state-of-the-art LLM agents remain vulnerable to Sleeper Attack, even when they achieve low attack success rates under a single-interaction baseline. Our code and data are available at https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef.

URL PDF HTML ☆

赞 0 踩 0

2605.28192 2026-05-28 cs.AI 版本更新

Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

面向多跳音视频推理的主动全模态感知代理

Ke Xu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu, Yanfeng Wang, Yu Wang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结针对多跳音视频推理中证据稀疏且跨模态分布的问题，提出MOV-Bench基准和AOP-Agent代理框架，通过分层全模态记忆与观察-反思-重规划循环实现主动感知，显著提升开源全模态大模型在长视频和推理密集型问题上的性能。

详情

AI中文摘要

多跳音视频推理对全模态大语言模型（Omni-LLMs）仍然具有挑战性，因为相关证据通常稀疏、时间上分散，并且分布在音频和视频流中。现有基准对此设置的研究有限，通常仅涉及有限数量的模态、相关时间片段或推理步骤。在这项工作中，我们引入了MOV-Bench，一个包含519个精心设计问题的基准，这些问题需要对时间上分散的音视频证据进行多跳推理。在MOV-Bench上的评估表明，当前的全模态大语言模型在多跳跨模态推理方面仍然存在困难。为了解决这一挑战，我们进一步提出了AOP-Agent，一个基于开源全模态大语言模型的高效代理框架，用于主动全模态感知。通过将分层全模态记忆与协作的观察-反思-重规划循环相结合，AOP-Agent使开源全模态大语言模型能够进行主动感知，而无需额外训练或专有模型。在MOV-Bench和OmniVideoBench上的实验表明，AOP-Agent持续提升了推理性能，在长视频和推理密集型问题上尤其显著。

英文摘要

Multi-hop audio-visual reasoning remains challenging for Omni-LLMs, as relevant evidence is often sparse, temporally dispersed, and distributed across both audio and visual streams. Existing benchmarks provide limited investigation of this setting, typically involving only a limited number of modalities, relevant temporal segments, or reasoning steps. In this work, we introduce MOV-Bench, a benchmark containing 519 carefully curated questions that require multi-hop reasoning over temporally dispersed audio-visual evidence. Evaluations on MOV-Bench reveal that current Omni-LLMs still struggle with multi-hop cross-modal reasoning. To address this challenge, we further propose AOP-Agent, an efficient agentic framework built on open-source Omni-LLMs for active omni-modal perception. By combining a hierarchical omni-modal memory with a collaborative observe-reflect-replan loop, AOP-Agent enables open-source Omni-LLMs to perform active perception without additional training or proprietary models. Experiments on MOV-Bench and OmniVideoBench demonstrate that AOP-Agent consistently improves reasoning performance, with particularly notable gains on long videos and reasoning-intensive questions.

URL PDF HTML ☆

赞 0 踩 0

2605.28187 2026-05-28 cs.IR cs.AI cs.CY cs.SI 版本更新

Whose Name Comes Up? III: Persona Prompting Effects in LLM-Based Scholar Recommendation

谁的名字会出现？III：基于LLM的学者推荐中的人设提示效应

Annabella Sánchez-Guzmán, Lukas Eberhard, Denis Helic, Lisette Espín-Noboa

发表机构 * Graz University of Technology（格拉茨技术大学）； Complexity Science Hub（复杂科学中心）

AI总结本研究通过构建基准测试，分离模型选择与提示设计对LLM学者推荐的影响，发现提示设计（语言、地点、角色与任务）显著影响推荐质量（事实性、覆盖度）和社会代表性（多样性、均等性）。

Comments 25 pages (10 main, 2 references, 13 appendix), 6 figures in main, 13 figures in appendix (under-review)

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作学者推荐系统，塑造了学术界中被视为专家的人选。现有的审计仍然以英语为中心、单一学科且忽略人设，导致输出变异性的来源尚不明确。为此，我们提出了一个基准测试，以分离模型选择和提示设计对推荐的影响。我们通过改变人设提示（语言、地点、角色与任务）和上下文（领域、资历、k）审计了43个LLM。将推荐的学者与Semantic Scholar在六个科学学科上进行比较，以衡量技术质量（事实性、覆盖度）和社会代表性（多样性、均等性）。基本技术质量由模型选择驱动，事实性和均等性由上下文驱动，多样性由地点驱动。南非提示产生的事实性较低的列表，而日本提示产生的事实性高但同质化的列表，偏向高产的学者。因此，提示设计是基于LLM的学者发现中一个不可忽视的维度，应与模型选择一起系统审计。

英文摘要

Large language models (LLMs) are increasingly used as scholar recommenders, shaping who is seen as an expert in academia. Existing audits remain English-centric, single discipline, and persona-agnostic, leaving the source of output variability poorly understood. To this end, we propose a benchmark that disentangles the effects of model choice and prompt design on recommendations. We audit 43 LLMs by varying persona prompts (language, location, role-and-task) and context (field, seniority, k). Recommended scholars are compared against Semantic Scholar over six scientific disciplines to measure technical quality (factuality, coverage) and social representativeness (diversity, parity). Basic technical quality is driven by model choice, factuality and parity by context, and diversity by location. South Africa prompts yield less factual lists, while Japan prompts yield highly factual but homogeneous lists skewed toward highly productive scholars. Prompt design is thus a non-trivial axis of LLM-based scholar discovery and should be systematically audited alongside model choice.

URL PDF HTML ☆

赞 0 踩 0

2605.28186 2026-05-28 cs.RO cs.AI 版本更新

进化算法在实际物理信息优化中的性能和可解释性要求

Helena Stegherr, Michael Heider, Nils Meyer, Tobias Thummerer, Thomas Wendler, Pierre Aublin, Ennio Idrobo-Àvila, Lars Mikelsons, Sebastian Zaunseder, Jörg Hähner

发表机构 * Universität Augsburg（乌尔姆大学）

AI总结本文通过五个实际物理优化问题，分析领域专家对进化算法在性能和可解释性方面的需求，并指出现有方法未能充分应用于复杂实际场景的差距。

详情

AI中文摘要

进化计算提供了多种工具来解决复杂的实际优化问题。然而，研究通常集中在较小、简化的问题和优化算法上，这些算法在实际场景中有时无法满足期望。此外，在此类设置中，对应用算法及其提供的解决方案的信任通常至关重要，但这需要理解搜索过程本身。这导致在许多应用背景下（包括基于物理的建模）实践者往往不会认真考虑进化计算。本文详细介绍了可以缓解这些问题的进化计算技术。首先，由领域专家介绍并描述了五个实际的基于物理的优化问题。针对每个问题，提出了进化算法在性能和可解释性方面的要求，以增加信任和可用性。我们发现，所有领域专家都期望快速收敛到良好解决方案，并希望获得关于结果如何形成的一些解释，而其他要求则强烈依赖于具体问题。最后，我们介绍了现有方法，这些方法可用于改进进化算法的这些方面，但据我们所知，从未在复杂的实际场景中使用过。这意味着两个领域之间存在需要弥合的差距，以充分发挥进化计算的潜力。

英文摘要

Evolutionary computation offers a variety of tools to solve complex real-world optimization problems. However, research often focuses on smaller, simplified problems and optimization algorithms that sometimes miss expectations in real-world scenarios. Additionally, trust in the applied algorithm and the solutions it provides is often essential in such settings, but requires an understanding of the search process itself. This leads to evolutionary computation often not being seriously considered by practitioners in many application contexts, among them physics-based modeling. In this article, techniques from evolutionary computation are detailed that can alleviate these problems. First, five real-world physics-based optimization problems are introduced and described by domain experts. For each of these, the requirements for the evolutionary algorithm regarding performance and explainability to increase trust and usability are presented. We found that all domain experts expect fast convergence to a good solution and want some explanations for how the results were formed, while other requirements strongly depend on the respective problem. Finally, we present existing approaches that can be leveraged to improve those aspects of evolutionary algorithms but have to our knowledge never been employed in complex real-world scenarios. This implies a gap between both domains that needs to be closed to exploit the full potential of evolutionary computation.

URL PDF HTML ☆

赞 0 踩 0

2605.28163 2026-05-28 cs.CL cs.AI 版本更新

DeltaMCP: 通过规范感知转换实现MCP服务器的增量再生

Aditya Pujara, Xiaogang Zhu, Hsiang-Ting Chen

发表机构 * Microsoft（微软）； University of Adelaide（阿德莱德大学）

AI总结针对企业级API与MCP工具集同步维护的挑战，提出DeltaMCP，一种基于规范感知的增量再生工具，仅更新受影响的MCP服务器工具，实验表明能减少开发者开销并提升可维护性与版本一致性。

详情

AI中文摘要

LLM的快速发展以及模型上下文协议（MCP）的引入，通过确定性和结构化方法彻底改变了智能代理与API交互的方式。虽然一些现有系统（如AutoMCP）试图自动化之前完全手动生成MCP服务器的过程，但它们未能解决不断发展的企业级API与其相应MCP工具集实现之间保持同步的反复挑战。本文介绍了DeltaMCP，一种面向企业级MCP服务器的规范感知增量再生工具。DeltaMCP使开发者能够在给定其对应服务的OpenAPI规范新版本时，仅更新MCP服务器中受影响的工具。使用Azure REST API规范作为评估数据集，DeltaMCP在生成质量和系统性能方面与基线全量生成方法进行了基准测试。结果表明，DeltaMCP减少了开发者开销，同时提高了可维护性和版本一致性。这项研究为企业寻求为基于LLM的系统维护高保真、最新MCP服务器基础设施提供了一种可扩展的方法。

英文摘要

The rapid development of LLMs coupled with the introduction of Model Context Protocol (MCP) has revolutionized how intelligent agents interact with APIs through deterministic and structured methods \cite{ModelContextProtocolIntro2025}. While some existing systems like AutoMCP attempt to automate a previously completely manual process of generating MCP servers, they fail to address the recurring challenge of maintaining synchronization between evolving enterprise-level APIs and their corresponding MCP toolset implementation \cite{mastouri2025makingrestapisagentready}. This paper introduces DeltaMCP, a specification-aware, incremental regeneration tool for enterprise-grade MCP servers. DeltaMCP enables developers to only update the affected tooling of MCP servers, given a new release of it's corresponding service's OpenAPI specification. Using Azure REST API specifications as the evaluation dataset, DeltaMCP is benchmarked against baseline full generation methods on generation quality and system performance. The results demonstrate the reduction in developer overhead through DeltaMCP whilst improving maintainability and version consistency. This research offers a scalable approach for enterprises seeking to maintain high-fidelity, up-to-date MCP server infrastructures for LLM-based systems.

URL PDF HTML ☆

赞 0 踩 0

2605.26910 2026-05-28 cs.LG cs.AI q-bio.NC 版本更新

EEG-FM-Audit: A Systematic Evaluation and Analysis Pipeline for EEG Foundation Models

EEG-FM-Audit：脑电图基础模型的系统评估与分析流程

Xianheng Wang, Yige Yang, Damien Coyle

发表机构 * Bath Institute for the Augmented Human（巴思增强人类研究所）； University of Bath（巴斯大学）

AI总结提出EEG-FM-Audit流程，通过ASHA驱动的基准测试、范式级消融研究和神经生理探测，系统评估脑电图基础模型，发现调优的监督基线可媲美或超越先进基础模型。

Comments 26 pages

详情

AI中文摘要

大型脑电图基础模型在解码跨多种认知任务的脑电图信号方面展现出巨大潜力。然而，现有的EEG-FM研究存在三个关键局限性：不透明的监督基线调优、复杂学习范式的贡献未经验证以及模型决策缺乏透明度。为解决这些问题，我们提出了EEG-FM-Audit，一个旨在系统化评估EEG-FM的综合评估与分析流程。EEG-FM-Audit包含三个主要组成部分：(1) ASHA驱动的基准测试协议，通过透明优化监督基线确保公平比较；(2) 范式级消融研究，评估FM中学习范式的有效性；(3) 神经生理探测框架，探究FM是否利用了有效的时域、空域和频域脑电图特性。我们将EEG-FM-Audit应用于四个最先进的EEG-FM和五个代表性监督模型，涉及三个公开数据集。结果表明，尽管参数显著减少，但适当调优的监督基线可以匹配或超越先进的FM。此外，我们发现FM学习范式的有效性高度依赖于数据集规模和架构。最后，NPP分析展示了FM如何依赖特定的生理特征，为更可解释的神经解码建立了框架。

英文摘要

Large EEG Foundation Models (FMs) have shown great potential for decoding EEG signals across diverse cognitive tasks. However, existing EEG-FM studies exhibit three critical limitations: opaque supervised baseline tuning, unverified contributions of complex learning paradigms, and a lack of transparency in model decision-making. To address these, we propose EEG-FM-Audit, a comprehensive evaluation and analysis pipeline designed to systematize the assessment of EEG-FMs. EEG-FM-Audit consists of three primary components: (1) an ASHA-driven benchmarking protocol that ensures fair comparisons by transparently optimizing supervised baselines; (2) paradigm-level ablation studies to evaluate the effectiveness of learning paradigms in FMs; and (3) a neurophysiological probing (NPP) framework, which explores whether FMs leverage valid temporal, spatial, and spectral EEG properties. We apply EEG-FM-Audit to four state-of-the-art EEG-FMs and five representative supervised models across three public datasets. Our results reveal that properly tuned supervised baselines can match or outperform advanced FMs, despite requiring significantly fewer parameters. Furthermore, we find that the effectiveness of learning paradigms of FMs is highly dependent on dataset scale and architecture. Finally, NPP analysis demonstrates how FMs rely on specific physiological features, establishing a framework for more interpretable neural decoding.

URL PDF HTML ☆

赞 0 踩 0

2605.26368 2026-05-28 cs.CV cs.AI 版本更新

Unified Panoramic Geometry Estimation via Multi-View Foundation Models

统一全景几何估计：基于多视角基础模型

Vukasin Bozic, Isidora Slavkovic, Dominik Narnhofer, Nando Metzger, Denis Rozumny, Konrad Schindler, Nikolai Kalischek

发表机构 * ETH Zürich（苏黎世联邦理工学院）； Google（谷歌）

AI总结提出PaGeR框架，利用预训练3D基础模型，从单张全景图像中统一预测尺度不变深度、度量深度、表面法线和天空掩码，实现360度场景重建。

详情

AI中文摘要

从透视图像进行几何估计已取得巨大进展，成熟到现成的基础模型不仅能够从多视角图像重建3D场景结构，甚至能从单视图进行重建。一个自然的扩展是从全景图像进行3D重建，其令人兴奋的前景是从单张全景图像恢复完整的360度场景。在这项工作中，我们引入了PaGeR（全景几何重建），这是一个将专为透视图像设计的强大3D基础模型提升到全景领域的框架。我们的策略是从一个预训练的3D重建Transformer开始，将其转变为一个统一的高性能模型，该模型在单次前向传播中从透视和全向图像预测尺度不变深度、度量深度、表面法线和天空掩码。通过将架构改动保持在最小，并在训练中混合透视和全景图像，PaGeR保留了底层基础模型的丰富3D先验，同时学会从单张全景图像估计几何一致的360度场景。我们在室内和室外环境中广泛测试了我们的方法，发现它在各种场景中提供了最先进的性能和出色的零样本性能。代码、数据和模型可在此处获取：https://github.com/prs-eth/PaGeR。

英文摘要

Geometry estimation from perspective images has greatly advanced, maturing to the point where off-the-shelf foundation models are able to reconstruct 3D scene structure not only from multi-view imagery, but even from a single view. A natural extension is 3D reconstruction from panoramas, with the exciting prospect of recovering a full 360-degree scene from a single panoramic image. In this work, we introduce PaGeR (Panoramic Geometry Reconstruction), a framework to lift powerful 3D foundation models designed for perspective imagery to the panorama domain. Our strategy is to start from a pre-trained transformer for 3D reconstruction and turn it into a unified high-performance model that predicts scale-invariant depth, metric depth, surface normals, and sky masks from both perspective and omnidirectional images, in a single forward pass. By keeping architectural changes to a minimum and mixing perspective and panoramic images during training, PaGeR retains the rich 3D prior of the underlying foundation model while learning to also estimate geometrically consistent 360-degree scenes from single panoramas. We extensively test our method in both indoor and outdoor environments and find that it delivers state-of-the-art performance and excellent zero-shot performance across a wide range of scenes. Code, data and models are available $\href{https://github.com/prs-eth/PaGeR}{\text{here}}$.

URL PDF HTML ☆

赞 0 踩 0

2605.26277 2026-05-28 cs.CV cs.AI 版本更新

VesselSim: learning 3D blood vessel segmentation without expert annotations

VesselSim: 无需专家标注的3D血管分割学习

Erin Rainville, Melissa Ananian, Tristan Mirolla, Hassan Rivaz, Yiming Xiao

发表机构 * Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada（计算机科学与软件工程系，康科迪亚大学，蒙特利尔，加拿大）； Department of Electrical and Computer Engineering, Concordia University, Montreal, Canada（电气与计算机工程系，康科迪亚大学，蒙特利尔，加拿大）

AI总结提出VesselSim两阶段框架，通过几何驱动的合成血管生成和自监督测试时适应，实现无需真实标注的3D血管分割，在多个临床数据集上达到与有监督方法竞争的性能。

Comments This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published as part of the MICCAI 2026 proceedings in October

详情

AI中文摘要

血管分割是医学图像分析中用于血管疾病护理和手术规划的核心任务，然而提供专家血管标注的挑战对相关深度学习技术的进展构成了主要障碍。为解决这一问题，我们提出了VesselSim，一个用于通用3D血管分割的两阶段框架，在训练过程中无需真实标注数据。首先，我们引入了一个随机的、几何驱动的血管模拟框架，该框架模拟递归分支、曲率控制生长和碰撞感知拓扑，随后通过域随机化强度合成生成16,500个体解剖学上合理的3D血管造影体积。其次，仅在此合成数据上训练3D U-Net。为了在推理时弥合从合成图像到真实图像的域差距，我们通过自监督掩码重建解码器引入了一种测试时适应策略，无需先验域知识即可适应未见过的临床扫描。我们在多个真实世界数据集上以零样本设置评估VesselSim，这些数据集涵盖多个解剖区域（包括脑和肾脏）的MR和CT。尽管仅在合成数据上训练，VesselSim的性能与最先进的血管分割基础模型相竞争。这些发现表明，从合成管状结构中学习血管几何对于鲁棒的跨域泛化是有效的，大大减少了对获取的医学成像数据以及更重要的是专家标注的依赖。

英文摘要

Blood vessel segmentation is a core task in medical image analysis for the care of vascular diseases and surgical planning, yet the challenges of providing expert vascular annotations pose a major obstacle for the progress of related deep learning techniques. To address this, we propose VesselSim, a two-stage framework for universal 3D blood vessel segmentation that eliminates the need for real annotated data during training. First, we introduce a stochastic, geometry-driven vascular simulation framework that models recursive branching, curvature-controlled growth, and collision-aware topology, followed by domain-randomized intensity synthesis to generate 16,500 anatomically plausible 3D angiographic volumes. Second, a 3D U-Net is trained solely on this synthetic data. To bridge the domain gap from synthetic to real images at inference time, we introduce a test-time adaptation strategy via a self-supervised mask reconstruction decoder, enabling adaptation to unseen clinical scans without prior domain knowledge. We evaluate VesselSim in a zero-shot setting on multiple real-world datasets spanning MR and CT across several anatomical regions, including the brain and kidneys. Despite being trained exclusively on synthetic data, VesselSim achieves performance competitive with state-of-the-art vascular segmentation foundation models. These findings suggest that learning vessel geometry from synthetic tubular structures is effective for robust cross-domain generalization, substantially reducing the reliance on acquired medical imaging data and more importantly, expert annotations.

URL PDF HTML ☆

赞 0 踩 0

2605.25010 2026-05-28 cs.RO cs.AI 版本更新

Performance Comparison of Classical and Neural Sampling Algorithms for Robotic Navigation

经典与神经采样算法在机器人导航中的性能比较

Hichem Cheriet, Badra Khellat Kihel, Samira Chouraqui

发表机构 * dept. of Economics Oran2 Mohamed BenAhmed University（经济系奥兰2莫哈梅德·本·阿赫迈德大学）

AI总结本文在含凸凹障碍物的环境中比较了RRT*、Neural RRT*和Neural Informed RRT*三种算法，发现神经引导规划器能生成更短（最多14%）和更平滑（55-75%）的路径，其中Neural Informed RRT*综合性能最优。

详情

Journal ref: Presented at The 3rd Edition of National Conference on Applications of Artificial Intelligence A2I' 26. 2026

AI中文摘要

将人工智能（AI）集成到基于采样的运动规划中为提高自主导航效率提供了新的可能性。本文在包含不同障碍物密度的凸凹障碍物环境中实现并评估了三种算法，即RRT*、Neural RRT*和Neural Informed RRT*。结果表明，与传统RRT*算法相比，神经引导规划器提高了路径质量，生成了最多短14%的路径和55-75%更平滑的轨迹。在评估的方法中，Neural Informed RRT*在路径长度和轨迹平滑度方面实现了最佳整体性能。这些结果证明了AI引导采样策略在提高机器人和无人机导航的可靠性和轨迹效率方面的有效性，尽管计算时间略有增加。总体而言，该研究凸显了人工智能在实时机器人路径规划应用中日益增长的重要性。

英文摘要

Integrating artificial intelligence (AI) into sampling-based motion planning provides new possibilities for improving autonomous navigation efficiency. In this paper, three algorithms, namely RRT*, Neural RRT*, and Neural Informed RRT*, are implemented and evaluated on environments containing convex and concave obstacles with different obstacle densities. The obtained results indicate that neural-guided planners improve path quality, producing up to 14\% shorter paths and 55--75\% smoother trajectories compared with the conventional RRT* algorithm. Among the evaluated methods, Neural Informed RRT* achieves the best overall performance in terms of path length and trajectory smoothness. These results demonstrate the effectiveness of AI-guided sampling strategies for improving reliability and trajectory efficiency in robotic and UAV navigation, despite a slight increase in computation time. Overall, the study highlights the growing importance of artificial intelligence in real-time robotic path planning applications.

URL PDF HTML ☆

赞 0 踩 0

2605.24678 2026-05-28 cs.AI cs.CL cs.SD 版本更新

Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care

探索感知语音特征用于心理健康护理中的临床决策支持

Vassilis Lyberatos, Edmund G. Dervakos, Eleni Adamidi, Athanasios Voulodimos, Giorgos Stamou

发表机构 * National Technical University of Athens（国家技术大学雅典）； PsychNow

AI总结提出一个基于感知声学和语言特征（如韵律、嗓音质量、语义连贯性、句法结构和讽刺）的系统分析框架，结合统计分析和可解释机器学习（XGBoost与SHAP和LIME），在多个数据集上发现语音特征与抑郁、焦虑和ADHD症状严重度之间的稳定关联，并通过消融研究识别最具信息量的特征组。

Comments Accepted to CLPsych 2026, part of ACL 2026

详情

AI中文摘要

语音和语言技术通过客观且可解释的线索为支持心理健康评估提供了宝贵的机会。我们提出了一个系统的基于特征的分析框架，利用感知基础的声学和语言特征，包括韵律、嗓音质量、语义连贯性、句法结构和讽刺。通过统计分析和可解释机器学习（XGBoost与SHAP和LIME），我们研究了语音特征与抑郁、焦虑和ADHD的已验证症状测量之间的关联。在受控基准数据集（StressID、DAIC-WOZ、Androids、EATD）和真实世界临床数据集上的评估表明，该框架揭示了症状严重度与嗓音不规则性（如shimmer、jitter）、词汇-句法模式和情感基调之间的稳定且一致的关系。跨所有数据集进行的消融研究进一步识别了最具信息量的特征组。这项工作探索了一种透明且临床可解释的基于语音的心理健康分析方法。

英文摘要

Speech and language technologies offer valuable opportunities for supporting mental health assessment through objective and interpretable cues. We present a systematic feature-based analysis framework leveraging perceptually grounded acoustic and linguistic characteristics, including prosody, vocal quality, semantic coherence, syntactic structure, and sarcasm. Using statistical analysis and interpretable machine learning (XGBoost with SHAP and LIME), we examine associations between speech features and validated symptom measures of depression, anxiety, and ADHD. Evaluated on both controlled benchmark datasets (StressID, DAIC-WOZ, Androids, EATD) and a real-world clinical dataset, the framework reveals stable and consistent relationships between symptom severity and vocal irregularities (e.g., shimmer, jitter), lexical-syntactic patterns, and affective tone. An ablation study conducted across all datasets further identifies the most informative feature groups. This work explores a transparent and clinically interpretable approach to speech-based mental health analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.23955 2026-05-28 cs.AI cs.DC cs.LG cs.SI q-fin.CP 版本更新

自适应储层计算用于多场景混沌系统预测

Shadmehr Zaregarizi, Khashayar Yavari

发表机构 * Politecnico di Torino（托里尼理工大学）

AI总结提出一种自适应储层计算框架，通过四种定制策略（精确状态同步、直方图引导候选选择、多种子搜索、顺序多序列训练）在CTF-4-Science Lorenz基准的12个任务中取得74.91分，证明其高效竞争力。

Comments 4 pages, 2 figures

详情

AI中文摘要

我们提出了一种自适应储层计算框架，用于CTF-4-Science Lorenz基准测试，该基准评估机器学习模型在十二个不同任务上的表现，这些任务涵盖五种性质不同的场景：基线预测、含噪信号重建、噪声下预测、少样本学习和参数泛化。我们没有采用统一的推理策略，而是根据每个评估场景的具体需求定制回声状态网络（ESNs）的训练和预测过程。我们的主要贡献有四个方面：（1）精确的储层状态同步，消除了短时预测中的预热近似误差；（2）直方图引导的候选选择，直接优化长时间遍历评估指标；（3）多种子储层搜索，适用于训练数据严重受限的少样本场景；（4）顺序多序列训练，解决了参数泛化任务中的状态分布不匹配问题。所提出的框架在公共基准排行榜上获得了74.91分，表明精心调整的储层计算对于多样化的混沌系统建模挑战是一种具有竞争力和计算效率的方法。

英文摘要

We present an adaptive reservoir computing framework for the CTF-4-Science Lorenz benchmark, which evaluates machine learning models across twelve distinct tasks spanning five qualitatively different scenarios: baseline forecasting, noisy signal reconstruction, forecasting under noise, few-shot learning, and parametric generalization. Rather than applying a uniform inference strategy, we tailor the training and prediction procedure of Echo State Networks (ESNs) to the specific demands of each evaluation scenario. Our key contributions are fourfold: (1) exact reservoir state synchronization that eliminates warmup approximation error in short-time prediction; (2) histogram-guided candidate selection that directly optimizes the long-time ergodic evaluation metric; (3) multi-seed reservoir search for few-shot regimes with severely limited training data; and (4) sequential multi-sequence training that resolves state-distribution mismatch in parametric generalization tasks. The proposed framework achieves a score of 74.91 on the public benchmark leaderboard, demonstrating that carefully adapted reservoir computing constitutes a competitive and computationally efficient approach for diverse chaotic system modeling challenges.

URL PDF HTML ☆

赞 0 踩 0

2605.28144 2026-05-28 cs.AI 版本更新

Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning

解构空间复杂性：用于LLM空间推理的层次分解

Yi Wang, Haojie Lu, Zhaofan Zhang, Li Chen, Sihong Xie

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出一种层次任务分解方法，结合MCTS引导的组相对策略优化（M-GRPO），通过改进中间状态选择和规划能力，显著提升LLM在导航、规划和策略游戏等空间任务中的表现。

Comments 8 pages

详情

AI中文摘要

LLMs在通用语言理解和推理方面表现出色。然而，它们在空间推理方面始终表现不佳，这严重限制了它们的应用，特别是在具身智能领域。受层次强化学习成功的启发，本文介绍了一种新颖的LLM空间推理层次任务分解方法。我们的方法通过识别关键中间状态并生成简化的子环境，引导LLMs将复杂任务分解为可管理的子任务。然而，我们发现LLMs由于缺乏足够的空间先验知识，往往无法推导出最优的中间状态，导致次优的任务分解。为了解决这一限制并增强其规划能力，我们提出了MCTS引导的组相对策略优化（M-GRPO），其中我们通过结合LLM的先验预测概率及其认知不确定性来重新制定UCT公式。此外，我们实现了一个更细粒度的优势函数，使模型能够学习最优路径规划。实验结果表明，我们的方法显著提高了LLM在空间任务（包括导航、规划和策略游戏）上的性能，达到了最先进的结果。这项工作为LLM在现实世界中的应用铺平了道路。

英文摘要

LLMs have shown remarkable proficiency in general language understanding and reasoning. However, they consistently underperform in spatial reasoning that severely limits their application, particularly in embodied intelligence. Inspired by the success of hierarchical reinforcement learning, this paper introduces a novel method for hierarchical task decomposition in LLM spatial reasoning. Our approach guides LLMs to decompose complex tasks into manageable sub-tasks by identifying key intermediate states and generating simplified sub-environments. However, we identify that LLMs often fail to derive optimal intermediate states due to their insufficient spatial prior, leading to sub-optimal task decomposition. To address this limitation and enhance its planning capability, we propose the MCTS-Guided Group Relative Policy Optimization (M-GRPO), where we reformulate the UCT formula by incorporating the LLM's prior predictive probabilities alongside its epistemic uncertainty. Furthermore, we implement a more fine-grained advantage function, enabling the model to learn optimal path planning. Experimental results demonstrate that our method substantially improves LLM performance on spatial tasks, including navigation, planning, and strategic games, achieving state-of-the-art results. This work paves the way for LLMs in real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2605.28139 2026-05-28 cs.AI 版本更新

Data-Efficient On-Policy Distillation for Automatic Speech Recognition

数据高效的在线策略蒸馏用于自动语音识别

Yu Lin, Yiming Wang, Runyuan Cai, Xiaodong Zeng

发表机构 * AutoArk-AI

AI总结提出一种在线策略蒸馏方法，利用教师模型（Qwen-ASR）在仅10万小时语音数据下提升学生模型（Ark-ASR）的识别能力，在多个基准上超越同规模基线。

详情

AI中文摘要

构建有竞争力的自动语音识别（ASR）模型通常需要大规模的音频监督，这使得复现和专业化成本高昂。我们研究了Ark-ASR，一个基于100k小时语音训练的0.6B参数音频条件语言模型，并检验了强大的Qwen-ASR教师能否通过在线策略蒸馏传递额外的识别能力。在普通话和英语ASR基准上，所提出的训练方案一致地优于仅进行监督微调，并在五个评估集中的四个上超越了同规模的Qwen3-ASR-0.6B基线。这仅使用了100k小时的语音，而Qwen3-Omni AuT编码器报告使用了20M小时的监督音频。更大的Qwen3-ASR-1.7B仍然更强，但结果表明，在更小的音频预算下，教师指导的在线策略训练可以显著缩小紧凑型ASR模型的差距。支持重叠诊断进一步表明，教师-数据阶段改善了局部学生-教师兼容性，这与最近关于在线策略蒸馏何时有效的分析一致。

英文摘要

Building competitive automatic speech recognition (ASR) models usually requires large-scale au- dio supervision, which makes reproduction and specialization expensive. We study Ark-ASR, a 0.6B- parameter audio-conditioned language model trained with 100k hours of speech, and examine whether a strong Qwen-ASR teacher can transfer additional recognition capability through on-policy distillation. Across Mandarin and English ASR benchmarks, the proposed training recipe consistently improves over supervised fine-tuning alone and outperforms the same-scale Qwen3-ASR-0.6B baseline on four of five evaluation sets. This is achieved with only 100k hours of speech, compared with the 20M hours of super- vised audio reported for the Qwen3-Omni AuT encoder. The larger Qwen3-ASR-1.7B remains stronger, but the results show that teacher-guided on-policy training can substantially close the gap for compact ASR models under a much smaller audio budget. A support-overlap diagnostic further suggests that the teacher-data stage improves local student-teacher compatibility, matching recent analyses of when on-policy distillation is effective.

URL PDF HTML ☆

赞 0 踩 0

2605.28129 2026-05-28 cs.AI 版本更新

Do Clinical Models Change Treatment Decisions?

临床模型是否会改变治疗决策？

Dongkyu Cho, Miao Zhang, Rumi Chunara

发表机构 * New York University（纽约大学）

AI总结本研究提出ClinPivot基准，通过生物医学关系和变化的患者情境评估临床基础模型在治疗决策中的适应性，发现强医学QA能力不能可靠预测决策表现。

Comments 9 pages, 3 figures

2605.28124 2026-05-28 cs.AI 版本更新

Gradient Step Plug-and-Play Model for Dental Cone-Beam CT Reconstruction

梯度步进即插即用模型用于牙科锥束CT重建

Idris Tatachak, Luis Kabongo, Nicolas Papadakis, Xavier Ripoche, Simon Rit

发表机构 * INSA‐Lyon, Universite Claude Bernard Lyon 1, CNRS, Inserm, CREATIS UMR 5220, U1294（INSA-里昂、 Claude Bernard 里昂大学、 CNRS、 Inserm、 CREATIS UMR 5220、 U1294）； Univ. Bordeaux, CNRS, Inria, Bordeaux INP, IMB, UMR 5251（波尔多大学、 CNRS、 Inria、 Bordeaux INP、 IMB、 UMR 5251）； ACTEON Group, France（ACTEON集团，法国）

AI总结提出一种基于梯度步进去噪器的即插即用算法，通过模拟扇形束采集并添加光子噪声训练先验，有效减少牙科锥束CT重建中的光子噪声。

Comments CT Meeting 2026 - 9th International Conference on Image Formation in X-Ray Computed Tomography, Jun 2026, Salt lake City, United States

2605.28122 2026-05-28 cs.CR cs.AI cs.CL 版本更新

SNARE: Adaptive Scenario Synthesis for Eliciting Overeager Behavior in Coding Agents

SNARE: 自适应场景合成以诱发编码代理中的过度行为

Yubin Qu, Yi Liu, Gelei Deng, Yanjun Zhang, Yuekang Li, Ying Zhang, Leo Yu Zhang

发表机构 * Griffith University（格里菲斯大学）； Quantstamp ； Nanyang Technological University（南洋理工大学）； UNSW Sydney（悉尼大学）； Wake Forest University（卫斯理大学）

AI总结提出SNARE流水线，通过组合良性场景片段并使用无评判器预言机评分与汤普森采样，自适应地诱发编码代理的过度行为，并在4×5代理-模型矩阵上评估，发现19.51%的良性运行触发过度行为，且代理框架比模型影响更大。

详情

AI中文摘要

编码代理以一系列shell、文件和网络操作执行良性任务，其中任何操作都可能悄然超出授权范围而任务仍完成。我们称此为过度行为：提示并非对抗性且运行成功，但超出范围的操作可能泄露凭据或删除文件。现有基准未能捕捉：任务完成套件认可任何完成的运行，越狱套件探测对抗性提示，而先前唯一的过度行为基准对每个代理-模型对应用单一固定提示集，导致其最易和最难的配对测量不足。我们提出SNARE（为非对抗场景合成自适应奖励引导诱发），该流水线从可重用范围和陷阱片段组合良性场景，用无评判器预言机对每次运行评分，标记陷阱模式匹配及未经请求的文件添加或删除，并使用汤普森采样将每对运行预算导向最常触发它的场景。在24个过度行为原型上实例化得到OverEager，我们在四个编码代理和五个基础模型的4×5矩阵上运行。在10,000次良性运行中，19.51%触发过度行为，每对比率跨度达11.9倍。这种变化由代理框架驱动，而非模型：框架占56%而模型占21%，因此任何单一框架或单一模型评估都会低估矩阵约五分之一。

英文摘要

A coding agent executes a benign task as a sequence of shell, file, and network actions, any of which can quietly exceed the authorized scope while the task still completes. We call this overeager behavior: the prompt is not adversarial and the run succeeds, yet an out-of-scope step can leak credentials or delete files. Existing benchmarks miss it: task-completion suites credit any finished run, jailbreak suites probe adversarial prompts, and the one prior overeager benchmark applies a single fixed prompt set to every agent-model pair, leaving its easiest and most resistant pairs under-measured. We present SNARE (Synthesizing Non-adversarial scenarios for Adaptive Reward-guided Elicitation), a pipeline that composes benign scenarios from reusable scope and trap fragments, scores each run with a judge-free oracle flagging trap-pattern matches and unsolicited file additions or deletions, and uses Thompson sampling to steer each pair's run budget toward the scenarios that most often trigger it. Instantiating it over 24 overeager archetypes yields OverEager, which we run across a 4x5 matrix of four coding agents and five base models. Across 10,000 benign runs, 19.51% trigger overeager behavior, with per-pair rates spanning 11.9x. This variation is driven by the agent framework, not the model: the framework accounts for 56% of it against the model's 21%, so any single-framework or single-model evaluation undercounts the matrix by about a fifth.

URL PDF HTML ☆

赞 0 踩 0

2605.28120 2026-05-28 cs.CL cs.AI cs.MA 版本更新

LegalGraphRAG: Multi-Agent Graph Retrieval-Augmented Generation for Reliable Legal Reasoning

LegalGraphRAG：面向可靠法律推理的多智能体图检索增强生成

Zerui Chen, Qinggang Zhang, Zhishang Xiang, Zhimin Wei, Linfeng Gao, Xiao Huang, Zhihong Zhang, Jinsong Su

发表机构 * School of Informatics, Xiamen University（厦门大学信息学院）； Institute of Artificial Intelligence, Xiamen University（厦门大学人工智能研究院）； The Hong Kong Polytechnic University（香港理工大学）

AI总结提出LegalGraphRAG框架，通过分层法律图和多智能体系统（研究员、审计员、裁决员）实现可靠的法律推理，在准确性和可信度上超越现有GraphRAG基线。

Comments 30 pages, 18 figures, ACL 2026 Main Conference. Project page: https://github.com/XMUDeepLIT/LegalGraphRAG

详情

AI中文摘要

基于图的检索增强生成（GraphRAG）通过将知识结构化为关系图，推进了平面文档检索，实现了更连贯和有效的推理。然而，将其应用于法律推理等特定领域面临关键挑战。(i) 法律语料库是异构的，包含来自案例、法条和解释的多粒度知识。平面知识图无法充分区分事实细节、适用规则和抽象原则，限制了准确检索。(ii) 可靠的法律判决需要透明、基于证据的推理。传统的RAG直接将检索到的上下文传递给LLM而不进行验证，导致推理不透明且易出错。为此，我们提出了LegalGraphRAG，一个专为可靠法律推理设计的框架。我们的方法引入了两个核心组件：一个分层法律图，用于分层组织法律来源，以便在适当的抽象级别进行检索；以及一个用于可靠法律推理的多智能体系统，其中研究员检索候选证据，审计员严格验证其相对于源文档的有效性，裁决员综合已验证的证据集作出最终判决。大量实验表明，LegalGraphRAG达到了最先进的性能，在准确和可信的法律分析方面优于现有的GraphRAG基线。我们的代码、数据集和实现细节可在https://github.com/XMUDeepLIT/LegalGraphRAG获取。

英文摘要

Graph-based Retrieval-Augmented Generation (GraphRAG) advances flat document retrieval by structuring knowledge as relational graphs, enabling more coherent and effective reasoning. However, applying it to specific domains like legal reasoning faces critical challenges. (i) Legal corpora are heterogeneous, containing multi-granular knowledge from cases, articles and interpretations. A flat knowledge graph cannot adequately differentiate between factual details, applied rules, and abstract principles, limiting accurate retrieval. (ii) Reliable legal judgment demands transparent, evidence-based reasoning. Traditional RAG passes retrieved context directly to an LLM without verification, resulting in opaque, error-prone reasoning. To this end, we propose LegalGraphRAG, a framework designed for reliable legal reasoning. Our approach introduces two core components: a hierarchical legal graph that hierarchically organizes legal sources to enable retrieval at appropriate abstraction levels, and a multi-agent system for reliable legal reasoning, where a Researcher retrieves candidate evidence, an Auditor rigorously verifies its validity against source documents, and an Adjudicator synthesizes the set of verified evidence to render a final judgment. Extensive experiments show that LegalGraphRAG achieves the state-of-the-art performance, outperforming existing GraphRAG baselines in accurate and trustworthy legal analysis. Our code, datasets and implementation details are available at https://github.com/XMUDeepLIT/LegalGraphRAG.

URL PDF HTML ☆

赞 0 踩 0

2605.28116 2026-05-28 cs.CR cs.AI cs.CL 版本更新

防御基于LLM的多智能体系统免受合作攻击：句子级纠正方法

Yaoyang Luo, Zhi Zheng, Ziwei Zhao, Tong Xu, Zhao Jielun, Wenjun Xue, Yong Chen, Enhong Chen

发表机构 * University of Science and Technology of China（中国科学技术大学）； North Automatic Control Technology Institute（北自动控制技术研究所）； Shenzhen Institute for Advanced Study, UESTC（深圳先进研究 institute, 中国科学技术大学）

AI总结提出一种自适应合作攻击框架，并引入句子级可信度分析与纠正（STAR）防御框架，以识别和纠正多智能体通信中的误导信息，显著提升任务成功率。

详情

AI中文摘要

近年来，基于大型语言模型的多智能体系统（MAS）发展迅速，其在协作决策和复杂问题解决方面表现出色。然而，MAS中的恶意智能体可能注入错误信息以误导其他智能体并破坏系统性能，这催生了一个新的研究方向，即关注MAS中的攻击机制和防御策略。以往的研究大多假设恶意智能体独立行动，并研究相应的防御策略。然而，我们认为恶意智能体可能表现出协作行为，通过内部信息交换实现更有效的攻击。在本文中，我们提出了一种自适应合作攻击框架，其中恶意智能体通过多轮交互自主协调并动态调整其攻击策略。此外，我们引入了句子级可信度分析与纠正（STAR），这是一种在智能体通信中识别和纠正句子级误导信息的防御框架。我们的实验表明，合作攻击导致任务成功率的下降幅度显著大于独立攻击，相对下降5.34%。同时，STAR有效缓解了合作和独立威胁，平均提高任务成功率36.76%。代码可在https://github.com/smoooom/STAR获取。

英文摘要

Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. However, malicious agents in MAS may inject misinformation to mislead other agents and disrupt system performance, giving rise to a new research direction that focuses on attack mechanisms and defense strategies in MAS. Prior studies largely assume malicious agents act independently and investigate the corresponding defense strategies. However, we argue that malicious agents may exhibit collaborative behaviors, enabling more effective attacks through internal information exchange. In this paper, we propose an adaptive cooperative attack framework, where malicious agents autonomously coordinate and dynamically adjust their attack strategies through multi-round interactions. Furthermore, we introduce Sentence-Level Trustworthiness Analysis and Rectification (STAR), a defense framework that identifies and rectifies misleading information at the sentence level within agent communications. Our experiments show that cooperative attacks lead to a significantly larger degradation in task success rate than independent attacks, resulting in a relative drop of 5.34\%. Meanwhile, STAR effectively mitigates both cooperative and independent threats and improves task success rate by an average of 36.76\%. The code is available at https://github.com/smoooom/STAR.

URL PDF HTML ☆

赞 0 踩 0

2605.28102 2026-05-28 cs.AI 版本更新

审视多智能体系统中智能体的偏见放大与抑制

Zejian Eric Wu, Zhongyi Jiang, Yuan Zhuang, Paul Jen-Hwa Hu

发表机构 * Oregon State University（俄勒冈州立大学）； Independent Researcher（独立研究者）； Amazon（亚马逊）； University of Utah（犹他大学）

AI总结研究多智能体系统中个体偏见如何影响系统级公平性，提出FBS指标量化偏见变化，发现均匀暴露偏见时系统偏见甚至超过个体偏见之和。

详情

AI中文摘要

多智能体系统越来越多地被部署以支持各种任务，其中智能体相互作用以实现个体和集体目标。尽管这些系统可以提高任务性能和决策能力，但通过减少偏见来维护公平性仍然具有挑战性。本研究考察了智能体层面的偏见如何转变并影响系统范围的公平性。我们使用提示将个体智能体暴露于群体偏向偏见，然后评估下游对系统层面的影响。为了量化影响，我们提出了Favor Bias Strength (FBS)，一个以零为中心的度量，将偏见变化分解为受青睐群体的提升和不受青睐群体的抑制。通过使用多种智能体设计、基准和最新的语言模型，我们表明具有偏见的智能体可以显著影响系统范围的公平性。有趣的是，当智能体均匀暴露于偏见时，系统范围的偏见会升高，甚至超过个体智能体偏见的累加和。实证证据强调了多智能体系统中公平性的关键性，这需要进一步的分析和实证测试。

英文摘要

Multi-agent systems are increasingly deployed to support various tasks where agents interact to achieve individual and collective objectives. Although these systems can enhance task performance and decision-making, fairness preservation through bias reduction remains challenging. This study examines how agent-level biases shift and impact system-wide fairness. We use prompts to expose individual agents to group-favoring bias, then assess downstream impacts at the system level. To quantify the impact, we propose Favor Bias Strength (FBS), a zero-centered metric that decomposes bias alteration between favored-group uplift and disfavored-group suppression. Using multiple agent designs, benchmarks, and up-to-date large language models, we show that agents endowed with bias can substantially affect system-wide fairness. Interestingly, when agents are exposed to bias uniformly, the system-wide bias elevates, even exceeding the additive sum of the individual agents' biases. The empirical evidence underscores the criticality of fairness in multi-agent systems, which warrants further analyses and empirical tests.

URL PDF HTML ☆

赞 0 踩 0

2605.28089 2026-05-28 cs.AI 版本更新

BuddyBench: A Privacy-Constrained Multi-Task Benchmark for Pediatric Social-Communication Personalization

BuddyBench：面向儿科社交沟通个性化的隐私约束多任务基准

Jeyeon Eo, Joo Young Kim, Ran Ju, Minyoung Jung, Unggi Lee

发表机构 * Independent Researcher（独立研究者）； Neudive Inc.（Neudive公司）； Korea University（韩国大学）

AI总结 BuddyBench通过整合观察队列和随机对照试验队列，构建了一个隐私约束的多任务基准，支持知识追踪、下一练习推荐、临床预测和因果推断，将行为个性化与临床评估联系起来。

Comments 30pages, 4 figures

详情

AI中文摘要

BuddyBench引入了一个面向儿科社交沟通个性化的隐私约束多任务基准。与主要强调影像、遗传学或横断面临床表型的现有神经发育数据集不同，BuddyBench在统一的基准模式中链接了练习级学习轨迹、标准化临床评估、BuddyPlan自我报告和随机治疗终点。BuddyBench结合了两个队列：ND-03是一个观察队列，对任务1-2有密集的练习覆盖（n=189），ND-02是一个随机对照试验队列，用于任务3-4（n=86 ITT）。它们共同支持知识追踪、下一练习推荐、临床预测和因果推断，将行为个性化与临床评估联系起来。我们还引入了BuddyBench-Sim，一个用于可重复评估的合成配套数据集。基线方法在保护儿科临床记录的同时，展示了跨任务的信号。

英文摘要

BuddyBench introduces a privacy-constrained multi-task benchmark for pediatric social-communication personalization. Unlike existing neurodevelopmental repositories that primarily emphasize imaging, genetics, or cross-sectional clinical phenotyping, BuddyBench links drill-level learning trajectories, standardized clinical assessments, BuddyPlan self-report, and randomized-treatment endpoints within a unified benchmark schema. BuddyBench combines two cohorts: ND-03 is an observational cohort with dense drill coverage for Tasks1-2 (n = 189), and ND-02 is a randomized controlled trial cohort for Tasks3-4 (n = 86 ITT). Together, they support knowledge tracing, next-drill recommendation, clinical prediction, and causal inference, linking behavioral personalization to clinical evaluation. We additionally introduce BuddyBench-Sim, a synthetic companion dataset for reproducible evaluation. Baselines show signal across tasks while keeping pediatric clinical records protected.

URL PDF HTML ☆

赞 0 踩 0

2605.28084 2026-05-28 cs.CL cs.AI 版本更新

SMILE-Next: Teaching Large Language Models to Detect, Classify, and Reason about Laughter

SMILE-Next: 教授大型语言模型检测、分类和推理笑声

Lee Jung-Mok, Kim Sung-Bin, Joohyun Chang, Lee Hyun, Tae-Hyun Oh

发表机构 * School of EE, KAIST（韩国科学技术院电子工程系）； Dept. of EE, POSTECH（POSTECH电子工程系）； School of Computing, KAIST（韩国科学技术院计算机科学系）

AI总结提出SMILE-Next数据集和包含笑声特定Self-Instruct与混合笑声专家框架的方法，用于实现多模态笑声理解，显著优于基线模型。

详情

Journal ref: Annual Meetings of the Association for Computational Linguistics 2026

AI中文摘要

笑声是一种复杂的社会信号，传达超越娱乐的交际意图。虽然先前的工作集中在孤立的笑声分析任务上，但在现实场景中对笑声的全面理解仍未得到充分探索。因此，我们引入了SMILE-Next，一个用于现实世界笑声理解的数据集，具有多模态文本表示和跨三个任务的问答标注：笑声检测、笑声类型分类和笑声推理。基于SMILE-Next，我们旨在开发一个能够细致理解现实语境中笑声的笑声专用大型语言模型。为此，我们提出了两个关键组件：笑声特定Self-Instruct和混合笑声专家框架。笑声特定Self-Instruct通过自动合成多样化的以笑声为中心的指令，增强了跨任务和领域的泛化能力。MoLE引入了一种任务自适应专家路由机制，动态选择针对每个笑声相关任务定制的专用专家，提高了任务特定性能和效率。实验结果表明，我们提出的组件的组合显著优于多模态LLM基线，推动了鲁棒的现实世界笑声理解。项目页面位于：https://mok0102.github.io/smile-next/。

英文摘要

Laughter is a complex social signal that conveys communicative intent beyond amusement. While prior work has focused on isolated laughter analysis tasks, a comprehensive understanding of laughter in real-world scenarios remains underexplored. Therefore, we introduce SMILE-Next, a dataset for real-world laughter understanding with multimodal textual representations and question-answer annotations across three tasks: laughter detection, laughter type classification, and laughter reasoning. Building upon SMILE-Next, we aim to develop a laughter-specialized large language model capable of nuanced understanding of laughter in real-world contexts. To this end, we propose two key components: laughter-specific Self-Instruct and the Mixture-of-Laugh-Experts (MoLE) framework. Laughter-specific Self-Instruct enhances generalization across tasks and domains by automatically synthesizing diverse laughter-centric instructions. MoLE introduces a task-adaptive expert routing mechanism that dynamically selects specialized experts tailored to each laughter-related task, improving task-specific performance and efficiency. Experimental results show that the combination of our proposed components substantially outperforms multimodal LLM baselines, advancing robust real-world laughter understanding. Project page is at: https://mok0102.github.io/smile-next/.

URL PDF HTML ☆

赞 0 踩 0

2605.28078 2026-05-28 cs.CR cs.AI cs.LG stat.ML 版本更新

Mind the Gap: Mixtures of Gaussians in Approximate Differential Privacy

注意差距：近似差分隐私中的高斯混合机制

Huikang Liu, Aras Selvi, Wolfram Wiesemann

发表机构 * Shanghai Jiao Tong University（上海交通大学）； UCL School of Management（伦敦大学学院管理学院）； Imperial Business School（帝国理工学院商学院）

AI总结针对已知敏感度的标量实值查询函数，设计了一类混合高斯加性噪声机制，在中等和低隐私预算下显著降低噪声幅度和方差，接近最优性。

Comments ICML 2026 style: 9 main pages followed by acknowledgements, references, appendices

详情

AI中文摘要

我们设计了一类加性噪声机制，满足标量实值查询函数的 $(\varepsilon, δ)$-差分隐私（DP），这些函数具有已知敏感度，特别关注中等和低隐私预算。这些机制称为 extit{混合机制}，通过混合多个高斯分布构建，这些高斯分布共享相同的方差，但均值和混合权重不同。得到的分布可以解释为零均值高斯（如解析高斯机制中所用）和额外高斯（其均值取决于查询函数的敏感度）的凸组合。我们推导了 $(\varepsilon, δ)$-DP 所需方差的严格条件，并提供了高效算法来计算它们。与解析高斯机制相比，我们的机制产生了显著更低的期望噪声幅度（$l_1$-损失）和方差（零均值分布的 $l_2$-损失）。在激励我们设计的低隐私预算下，我们的机制接近最优性，几乎消除了解析高斯机制的所有最优性差距。

英文摘要

We design a class of additive noise mechanisms that satisfy $(\varepsilon, δ)$-differential privacy (DP) for scalar, real-valued query functions with known sensitivities, with a particular focus on moderate and low-privacy regimes. These mechanisms, which we call \textit{mixture mechanisms}, are constructed by mixing multiple Gaussian distributions that share the same variance but differ in their means and mixture weights. The resulting distributions can be interpreted as convex combinations of a zero-mean Gaussian (as used in the analytic Gaussian mechanism) and additional Gaussians whose means depend on the sensitivity of the query function. We derive tight conditions on the variances required for $(\varepsilon, δ)$-DP and provide efficient algorithms to compute them. Compared to the analytic Gaussian mechanism, our mechanisms yield substantially lower expected noise amplitudes ($l_1$-loss) and variances ($l_2$-loss for zero-mean distributions). In the low-privacy regime that motivates our design, our mechanisms approach optimality, mitigating nearly all of the optimality gap of the analytic Gaussian mechanism.

URL PDF HTML ☆

赞 0 踩 0

2605.28077 2026-05-28 cs.AI 版本更新

MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing

MACReD：一种用于反应图解解析的多智能体协作推理框架

Chuang Tang, Chenhao Lin, Yin Xu, Hao Wang, Jinrui Zhou, Xin Li, Mingjun Xiao, Enhong Chen

发表机构 * University of Science ； Technology of China \& iFLYTEK Co., Ltd. Hefei China ； Technology of China \& iFLYTEK Co., Ltd.

AI总结提出MACReD分层多智能体框架，通过协调分子感知、箭头理解、文本提取和反应重建等专用智能体，在统一VLM引导架构下实现化学图解解析，在RxnScribe基准上达到最优性能。

Comments Preprint. Code is available at https://github.com/TC9905/MACReD

详情

AI中文摘要

由于异构布局、交错的视觉元素以及识别与推理整合的困难，从科学文献中解析化学反应图解具有挑战性。现有的视觉语言模型虽然推进了多模态理解，但在复杂图解上仍然失败，难以在推理过程中保持空间连贯性和整合多维信息。为解决这些问题，我们提出了MACReD，一个分层多智能体框架，在统一的VLM引导架构中协调专用智能体进行分子感知、箭头理解、文本提取和反应重建。规划和感知层使用灵活、细粒度的检测来处理视觉复杂性，而推理层使用多图融合机制来整合异构线索并强制执行化学一致的全局推理。在RxnScribe基准上的实验表明，MACReD达到了最先进的性能，在硬匹配和软匹配标准下F1分数分别为75.2%和84.6%，优于RxnScribe基线的69.1%和80.0%。这些结果证明了MACReD在不同图解布局（包括多步和树状结构反应）中的鲁棒性。

英文摘要

Parsing chemical reaction diagrams from scientific literature is challenging due to heterogeneous layouts, intertwined visual elements, and the difficulty of integrating recognition and reasoning. Existing vision-language models advance multimodal understanding but still fail on complex diagrams, struggling to maintain spatial coherence and to integrate multidimensional information during reasoning. To address these issues, we propose MACReD, a hierarchical multi-agent framework that coordinates specialized agents for molecular perception, arrow understanding, text extraction, and reaction reconstruction within a unified VLM-guided architecture. The planning and perception layers use flexible, fine-grained detection to handle visual complexity, while the reasoning layer uses a multigraph fusion mechanism to integrate heterogeneous cues and enforce chemically consistent global reasoning. Experiments on the RxnScribe benchmark show that MACReD achieves state-of-the-art performance, with F1 scores of 75.2% and 84.6% under hard and soft match criteria, outperforming the RxnScribe baseline, which obtains 69.1% and 80.0%, respectively. These results demonstrate the robustness of MACReD across diverse diagram layouts, including multi-step and tree-structured reactions.

URL PDF HTML ☆

赞 0 踩 0

2605.28073 2026-05-28 cs.CL cs.AI 版本更新

StoryLens: Preference-Aligned Story Rewriting via Context-Aware Narrative Enrichment

StoryLens: 通过上下文感知叙事丰富实现偏好对齐的故事重写

Hanwen Cui, Yuting Mei, Yuhang Fu, Dingyi Yang, Qin Jin

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； AIM3 Lab, Renmin University of China（中国人民大学AIM3实验室）； Nanyang Technological University（南洋理工大学）

AI总结针对故事重写中读者偏好对齐问题，提出结合上下文感知叙事丰富的方法，构建基准STORYLENSBENCH、奖励模型STORYLENSEVAL和两阶段重写模型STORYLENSWRITER，实验表明上下文增强显著提升用户满意度。

Comments 16 pages, 7 figures, 15 tables

详情

AI中文摘要

故事重写旨在适应不同读者偏好的同时保持情节一致性和叙事连贯性。与传统的风格迁移工作不同，我们认为有效的故事重写需要上下文感知的叙事丰富，而不仅仅是表面层面的风格适应。我们的初步人类研究表明，仅风格适应对读者满意度的提升微乎其微（2.3%），而上下文增强的重写则显著改善了用户偏好对齐（24.5%）。受此启发，我们引入了STORYLENSBENCH，一个用于偏好对齐故事重写的大规模基准，包含结构化故事书、多维读者偏好档案和排序后的上下文感知重写故事。基于该基准，我们提出了STORYLENSEVAL，一个用于估计重写故事读者满意度的奖励模型，以及STORYLENSWRITER，一个结合监督微调和基于GRPO的强化学习的两阶段重写模型。我们进一步建立了一个涵盖忠实度、连贯性和读者满意度的综合评估框架。实验结果表明，STORYLENSWRITER持续优于强大的生成和个性化基线，突显了上下文感知叙事丰富对于个性化故事重写的重要性。

英文摘要

Story rewriting aims to adapt existing narratives to diverse reader preferences while preserving plot consistency and narrative coherence. Unlike conventional work on style transfer, we argue that effective story rewriting demands context-aware narrative enrichment beyond surface-level stylistic adaptation. Our pilot human study shows that style adaptation alone provides only marginal gains in reader satisfaction (2.3%), while context-enhanced rewriting substantially improves user preference alignment (24.5%). Motivated by this, we introduce STORYLENSBENCH, a large-scale benchmark for preference-aligned story rewriting, comprising structured story books, multi-dimensional reader preference profiles, and ranked context-aware rewritten stories. Building on this benchmark, we propose STORYLENSEVAL, a reward model for estimating reader satisfaction over rewritten stories, and STORYLENSWRITER, a two-stage rewriting model combining supervised fine-tuning with GRPO-based reinforcement learning. We further establish a comprehensive evaluation framework covering fidelity, coherence, and reader satisfaction. Experimental results demonstrate that STORYLENSWRITER consistently outperforms strong generation and personalization baselines, highlighting the importance of context-aware narrative enrichment for personalized story rewriting.

URL PDF HTML ☆

赞 0 踩 0

2605.28070 2026-05-28 cs.AI 版本更新

Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

弥合推理模型在信息不足情况下的检测到弃权差距

Renjie Gu, Jiaxu Li, Yihao Wang, Yun Yue, Hansong Xiao, Yefei Chen, Yuan Wang, Chunxiao Guo, Pei Wei, Jinjie Gu, Yixin Cao

发表机构 * Fudan University（复旦大学）； Ant Group（蚂蚁集团）； Zhejiang University（浙江大学）； Tsinghua University（清华大学）

AI总结针对推理模型在信息不足时无法有效弃权的问题，提出Judge-Then-Solve（JTS）轨迹级推理控制框架，通过可回答性判断和强化学习训练，显著提升可靠弃权率并减少不必要的推理。

详情

AI中文摘要

我们强调了大型推理模型在信息不足问题上的失败模式：模型可能认识到问题描述不充分，但仍然继续推理并产生无依据的最终答案，而不是弃权。我们将这种不匹配形式化为检测到弃权差距，即检测到信息不足未能转化为最终弃权。这种差距在高风险领域（如医疗AI）尤其令人担忧，因为基于不完整证据的答案可能比拒绝回答更有害。为了弥合这一差距，我们提出了Judge-Then-Solve（JTS），一种轨迹级推理控制框架，训练模型在生成解决方案之前做出明确的可回答性承诺。JTS不将弃权视为最终答案风格，而是将其视为控制决策：模型要么继续求解，要么根据其可回答性判断提前终止。我们通过监督式预热和缺失前提强化学习（结合一致性和长度塑造奖励）来实例化这一策略。在密集和MoE推理模型上的实验表明，JTS显著提高了跨数据集的可靠弃权率，并将弃权@检测（A@D）推至接近饱和，表明模型不仅检测到缺失信息，而且根据检测结果采取行动。通过在可回答性判断后立即终止不可回答的轨迹，JTS减少了不必要的推理，并在持续推理会放大无依据假设时提高了推理效率。我们还观察到，缺失前提训练可以改变困难但可回答问题上的推理行为，减少无效的自我反思。这些结果表明，信息不足下的弃权是安全高效部署推理模型的关键推理控制形式。

英文摘要

We highlight a failure mode of large reasoning models on questions with insufficient information: models may recognize that a problem is under-specified, yet still continue reasoning and produce unsupported final answers instead of abstaining. We formalize this mismatch as the detection-to-abstention gap, where detected insufficiency fails to translate into final abstention. This gap is especially concerning in high-risk domains such as medical AI, where answers based on incomplete evidence can be more harmful than refusal. To close this gap, we propose Judge-Then-Solve (JTS), a trajectory-level reasoning-control framework that trains models to make an explicit answerability commitment before solution generation. Rather than treating abstention as a final-answer style, JTS casts it as a control decision: the model either proceeds to solve or terminates early based on its answerability judgment. We instantiate this policy through supervised warm-up and missing-premise reinforcement learning with consistency and length-shaping rewards. Experiments on dense and MoE reasoning models show that JTS substantially improves reliable abstention across datasets and pushes Abstention@Detection (A@D) to near-saturation, indicating that models not only detect missing information but also act on that detection. By terminating unanswerable trajectories immediately after the answerability judgment, JTS reduces unnecessary reasoning and improves inference efficiency when continued deliberation would amplify unsupported assumptions. We also observe that missing-premise training can alter reasoning behavior on difficult but answerable problems, reducing unproductive self-reflection. These results suggest that abstention under insufficient information is a key form of reasoning control for deploying reasoning models safely and efficiently.

URL PDF HTML ☆

赞 0 踩 0

2605.28069 2026-05-28 cs.AI 版本更新

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

ZipRL: 自适应多轮上下文压缩与事后响应回放

Zhexin Hu, Li Wang, Xiaohan Wang, Jiajun Chai, Xiaojun Guo, Wei Lin, Guojun Yin

发表机构 * Meituan（美团）； Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所）

AI总结提出ZipRL框架，通过多粒度压缩机制和事后响应回放技术，在可验证奖励强化学习中实现自适应上下文压缩，平衡信息保留与token效率，在多个智能体任务中显著优于现有方法。

详情

AI中文摘要

自适应上下文压缩对于将大型语言模型扩展到复杂的多轮智能体任务至关重要。然而，基于规则的压缩方法可能会丢弃任务关键细节，而强化学习方法通常难以在长时工作流固有的稀疏奖励下平衡信息保留和token效率。为弥补这一差距，我们提出ZipRL，一种针对可验证奖励强化学习的新型自适应压缩框架。ZipRL具有多粒度压缩机制，用于主动、非均匀的信息缩减，并配合事后响应回放（HRR），一种旨在在RLVR优化期间密集化训练信号的技术。理论上，我们证明了ZipRL相对于均匀方法具有更优的任务相关效用。具体而言，ZipRL利用从粗到细的提示进行宏观压缩，并通过广义优势重塑将HRR纳入GRPO。多个不同版本和参数规模的模型验证了我们方法的有效性。在五个智能体任务上的基准测试显示，ZipRL在Qwen3-4B和Qwen3-8B模型上分别比最先进方法高出27.9%和34.7%，同时在极端256轮外推压力测试下保持卓越的token效率和鲁棒性。

英文摘要

Adaptive context compression is vital for scaling Large Language Models (LLMs) to complex, multi-turn agent tasks. However, rule-based compression methods may discard task-critical nuances, while Reinforcement Learning (RL) approaches usually struggle to balance information retention and token efficiency under the sparse rewards inherent to long-horizon workflows. To bridge this gap, we propose ZipRL, a novel adaptive compression framework tailored for Reinforcement Learning from Verifiable Rewards (RLVR). ZipRL features a multi-granularity compression mechanism for active, non-uniform information reduction, coupled with Hindsight Response Replay (HRR), a technique designed to densify training signals during RLVR optimization. Theoretically, we prove ZipRL's superior task-relevant utility over uniform methods. Concretely, ZipRL utilizes coarse-to-fine prompts for macro-compression and incorporates HRR into GRPO via generalized advantage reshaping. Multiple models of varying versions and parameter scales validate the effectiveness of our approach. Benchmarks on five agent tasks show ZipRL outperforms state-of-the-art approaches by 27.9% and 34.7% across Qwen3-4B and Qwen3-8B models, while maintaining exceptional token efficiency and robustness under extreme 256-turn extrapolation stress tests.

URL PDF HTML ☆

赞 0 踩 0

2605.28067 2026-05-28 cs.AI 版本更新

BlazeEdit: Generalist Image Editing on Mobile Devices with Image-to-Image Diffusion Models

BlazeEdit: 基于图像到图像扩散模型的移动设备通用图像编辑

Fei Deng, Yanwu Xu, Zhipeng Bao, Zhixing Zhang, Haolin Jia, Karthik Raveendran, Jianing Wei

发表机构 * Google（谷歌）

AI总结提出BlazeEdit，一个仅195M参数的轻量级通用图像编辑扩散模型，通过消除文本条件组件和多任务架构，在移动设备上实现快速、隐私保护的图像编辑。

Comments Accepted to CVPR 2026 EDGE Workshop

详情

AI中文摘要

现代扩散模型卓越的生成质量往往以巨大的参数量为代价，这需要服务器端推理，带来显著的计算成本和潜在的隐私风险。因此，开发高效的设备端替代方案日益受到关注。尽管最近的努力优化了移动硬件上的文本到图像模型，但它们仍然相对庞大，通常有0.5B到1B参数。我们提出了BlazeEdit，一个专为设备端部署设计的高效通用图像到图像扩散模型。通过识别许多实际图像编辑任务不需要基于文本的指导，我们消除了文本条件组件，并开发了一个多任务架构，将对象移除、外扩、色调校正、重新照明和贴纸生成整合到一个仅195M参数的紧凑模型中。BlazeEdit大幅减少了下载大小和内存开销，同时保持了具有竞争力的生成质量。它在Pixel 10上仅需290ms即可完成一次完整推理，为边缘设备上的通用图像编辑提供了无缝、隐私保护和闪电般的体验。

英文摘要

The remarkable generation quality of modern diffusion models often comes at the cost of massive parameter counts, which necessitate server-side inference with significant computational costs and potential privacy risks. Consequently, there is growing momentum toward developing efficient on-device alternatives. While recent efforts have optimized text-to-image models for mobile hardware, they remain relatively bulky, typically ranging from 0.5B to 1B parameters. We present BlazeEdit, a highly efficient, generalist image-to-image diffusion model tailored for on-device deployment. By identifying that many practical image editing tasks do not require text-based guidance, we eliminate the text-conditioning components and develop a multi-task architecture that consolidates object removal, outpainting, tone correction, relighting, and sticker generation into a single, compact model of only 195M parameters. BlazeEdit achieves a substantial reduction in download size and memory overhead while maintaining competitive generation quality. It completes a full inference pass in just 290ms on a Pixel 10, delivering a seamless, privacy-preserving, and lightning-fast experience for generalist image editing on the edge.

URL PDF HTML ☆

赞 0 踩 0

2605.28065 2026-05-28 cs.AI 版本更新

Verifiable Benchmarking of Long-Horizon Spatial Biology

长程空间生物学的可验证基准测试

Ian Diks, Harihara Muralidharan, Tim Proctor, Kenny Workman

发表机构 * LatchBio

AI总结提出 SpatialBench-Long 基准，通过 24 个评估任务测试 AI 代理从原始空间数据中推导科学结论的能力，发现当前最佳模型仅达到 11.1% 的成功率。

详情

AI中文摘要

AI 代理在生物数据分析中越来越有用，但现有基准大多测试广泛的生物学知识、可执行的工作流程或局部分析步骤，而不是对空间测量进行端到端的科学推理。我们引入了 SpatialBench-Long，一个用于长程空间生物学的基准，其中代理必须从原始或接近原始的数据以及校准的实验背景中恢复生物学声明，而不使用规定的方法。SpatialBench-Long 包含 24 个评估，涵盖原发性胰腺导管腺癌（PDAC）、工程化胶质母细胞瘤类器官和体内肿瘤、Cas9 谱系追踪的肺腺癌、以及小鼠视神经衰老/干预系统，涉及 CosMx、Visium、Xenium、多重纠错荧光原位杂交（MERFISH）、单细胞 RNA 测序（scRNA-seq）、Slide-seq、Slide-tags、组织学和谱系记录数据。候选声明通过再现、独立科学家审查和轨迹检查进行强化。最终答案通过受控词汇和符号进行确定性评分，并附有配套评分标准，捕捉通过关键分析瓶颈的进展。在 SpatialBench-Long 基准测试中，三个模型-工具对在 8/72 次运行（11.1%）中并列：Gemini 3.5 Flash / Pi 终端编码工具、GPT-5.5 / Pi 和 GPT-5.5 / OpenAI Codex。SpatialBench-Long 测试代理是否能够超越执行程序性分析，从复杂的空间测量中推导出准确的科学结论。

英文摘要

AI agents are increasingly useful for biological data analysis, but existing benchmarks mostly test broad biological knowledge, executable workflows, or localized analysis steps rather than end-to-end scientific reasoning over spatial measurements. We introduce SpatialBench-Long, a benchmark for long-horizon spatial biology in which agents must recover biological claims from raw or near-raw data and calibrated experimental context without prescribed methods. SpatialBench-Long contains 24 evaluations across primary pancreatic ductal adenocarcinoma (PDAC), engineered glioblastoma organoids and in vivo tumors, Cas9 lineage-traced lung adenocarcinoma, and mouse optic nerve aging/intervention systems, spanning CosMx, Visium, Xenium, multiplexed error-robust fluorescence in situ hybridization (MERFISH), single-cell RNA sequencing (scRNA-seq), Slide-seq, Slide-tags, histology, and lineage-recording data. Candidate claims are hardened through reproduction, independent scientist review, and trajectory inspection. Final answers are graded deterministically over controlled vocabularies and symbols with companion rubrics capturing progress through key analysis chokepoints. Across the SpatialBench-Long benchmark, three model-harness pairs tie at 8/72 runs (11.1\%): Gemini 3.5 Flash / Pi terminal coding harness, GPT-5.5 / Pi, and GPT-5.5 / OpenAI Codex. SpatialBench-Long tests whether agents can move beyond executing procedural analysis to deriving accurate scientific conclusions from complex spatial measurements.

URL PDF HTML ☆

赞 0 踩 0

2605.28064 2026-05-28 eess.AS cs.AI cs.HC 版本更新

I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors

我听见，故我信任：人类作为合成语音检测器的社会技术研究

Lelia Erscoi, Tomi Kinnunen

发表机构 * University of Eastern Finland（东芬兰大学）； Computational Speech Group（计算语音组）

AI总结通过定位任务实验，研究人类在感知和语境中检测语音深度伪造的能力，发现话语类别是检测准确性和感知质量的主要决定因素，信任线索无主效应但影响检测行为，完全合成语音的检测低于随机水平。

Comments To be included in Odyssey 2026: The Speaker and Language Recognition Workshop, Session 4.2, 23-26 June, Lisbon, Portugal

详情

AI中文摘要

自动深度伪造检测已受到大量研究关注，然而人类实际遇到合成语音的社会技术环境仍知之甚少。我们将语音深度伪造检测作为感知和语境过程进行研究，呈现一个定位任务，其中47名参与者在三种操纵的信任线索下（指导框架、情感启动和来源标注）标记真实、完全合成和部分合成话语中的疑似合成片段。参与者提供了关于机械性、表现力、可懂度、清晰度、平静度和评估自信度的质量评分。话语类别是检测准确性和感知质量的主要决定因素；信任线索未产生主效应，但激发了检测行为。完全合成语音的检测低于随机水平。质量评分与话语类型相关，表明在显性检测失败时存在隐性区分。

英文摘要

Automatic deepfake detection has received considerable research attention, yet the socio-technical environment in which humans actually encounter synthetic speech remains poorly understood. We investigate voice deepfake detection as a perceptual and contextual process, presenting a localization task in which 47 participants marked suspected synthetic segments across authentic, fully synthetic, and partially synthetic utterances under three manipulated trust cues: instructional framing, affective priming, and provenance labeling. Participants provided quality ratings on mechanicalness, expressiveness, intelligibility, clarity, calmness, and confidence of evaluation. Utterance class was the primary determinant of detection accuracy and perceptual quality; trust cues produced no main effects but motivated detection behavior. Fully synthetic speech was detected at below-chance levels. Quality ratings tracked utterance type, indicating implicit discrimination where overt detection failed.

URL PDF HTML ☆

赞 0 踩 0

2605.28063 2026-05-28 cs.SD cs.AI cs.MM 版本更新

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

自由文本提示的统一语音与声音合成

Yuyue Wang, Xihua Wang, Xin Cheng, Yijing Chen, Ruihua Song

发表机构 * Renmin University of China（中国人民大学）

AI总结提出PlanAudio框架，利用大语言模型推理能力和语义潜在思维链机制，直接从自由文本生成包含语音和声音的统一音频。

详情

AI中文摘要

音频生成已取得显著进展，但合成语音与声音自然组合的统一音频仍具挑战。当前方法要么依赖分离的流水线，无法捕捉细粒度交互，要么需要结构化输入和外部文本重写，限制了自由文本提示的灵活性。本文提出新任务：自由文本提示到统一音频生成，旨在直接从无约束自然语言合成包含语音、声音及其复合的统一音频。为此，我们提出PlanAudio，一个统一的、基于自回归LLM的框架。首先，它利用LLM内在推理能力简化模型架构，而非传统文本编码器。其次，引入语义潜在思维链机制，一种隐式规划机制，连接高层语义理解与低层声学合成。此外，我们创建PlanAudio-Bench，一个专门评估复合音频场景的基准。我们在语音、声音及其复合场景下进行评估。结果表明，PlanAudio普遍优于现有流水线和统一基线，同时与专为单一场景设计的模型保持竞争力。进一步分析揭示了语义潜在CoT相对于其他CoT机制的优越性，并强调了连续多场景训练课程的重要性。

英文摘要

Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.

URL PDF HTML ☆

赞 0 踩 0

2605.28046 2026-05-28 cs.AI cs.CL 版本更新

MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents

MemCog: 从记忆即工具到记忆即认知的对话代理

Zihan Li, Xingyu Fan, Feifei Li, Wenhui Que

发表机构 * WeChat, Tencent Inc.（腾讯公司）

AI总结提出MemCog系统，通过可导航记忆存储、跨维度导航接口和主动推理协议，将记忆访问融入推理过程，在被动问答和主动记忆触发基准上达到最优性能。

详情

AI中文摘要

现有的代理记忆系统普遍遵循我们称之为“记忆即工具”的范式，其中单个查询触发对扁平段落列表的一次性检索，存在被动调用、推理-检索解耦以及检索片段与代理导航需求之间的结构不匹配等问题。我们提出MemCog，一个“记忆即认知”系统，使记忆访问成为推理过程的一个组成部分。MemCog将用户知识组织为具有关联链接图的可导航记忆存储，暴露跨维度导航接口以进行多步推理驱动的遍历，并采用主动推理协议，驱动代理从对话上下文中自发启动记忆探索。我们还构建了ProactiveMemBench，这是第一个用于评估主动记忆触发的基准。实验表明，MemCog在被动问答基准上达到了最先进水平（LoCoMo上92.98，LongMemEval上95.8），同时在ProactiveMemBench上大幅超越基线，展示了记忆即认知的优势。

英文摘要

Existing agent memory systems universally follow what we term a Memory-as-Tool paradigm where a single query triggers one-shot retrieval of flat passage lists, suffering from passive invocation, reasoning-retrieval decoupling, and structural mismatch between retrieved fragments and the agent's navigational needs. We propose MemCog, a Memory-as-Cognition system that makes memory access an integral part of the reasoning process. MemCog organizes user knowledge as Navigable Memory Store with associative link graphs, exposes Cross-Dimensional Navigation Interface for multi-step reasoning-driven traversal, and employs Proactive Reasoning Protocol that drives agents to spontaneously initiate memory exploration from conversational context. We additionally construct ProactiveMemBench, the first benchmark for evaluating proactive memory triggering. Experiments show that MemCog achieves state-of-the-art on passive QA benchmarks (92.98 on LoCoMo, 95.8 on LongMemEval) while substantially outperforming baselines on ProactiveMemBench, demonstrating the advantage of Memory-as-Cognition.

URL PDF HTML ☆

赞 0 踩 0

2605.28044 2026-05-28 cs.AI 版本更新

Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG

相关并不保证：引用RAG的证据力度校准

Pin Qian, Su Wang, Xiaoyuan Wang, Yihang Chen, Wenxuan Xu, Qiaolin Yu, Shuhuai Lin, Sipeng Zhang, Junxian You, Xinpeng Wei

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Georgia Institute of Technology（佐治亚理工学院）； Dartmouth College（达特茅斯学院）； Cornell University（康奈尔大学）； University of California San Diego（加州大学圣地亚哥分校）； University of Glasgow（格拉斯哥大学）

AI总结针对引用RAG中证据力度不足的问题，提出FORCEBENCH基准测试，通过对比证据校准声明与力度增强变体，评估模型在五个操作轴上的单调性，发现标准支持提示不足以校准证据力度。

详情

AI中文摘要

引用RAG评估通常将可见来源视为接地信号，但一个真实的、主题相关的引用仍可能对附带的措辞支持不足。我们将这种诊断失败称为引用洗白：一个相关的来源被呈现为对过度强声称的保证。我们引入了FORCEBENCH，一个用于证据力度校准的对比压力测试。每个项目固定一个引用的段落，并将一个证据校准的声明与一个局部力度增强的变体配对，涵盖五个操作轴：关系、模态、范围、时间有效性和数值特异性。一个校准的评估器应该给证据校准的声明更高的分数。主要实验使用一个固定的、经过局部过滤的198对评估集。引用存在的合理性检查设计上无信息；标记和实体重叠在32.8--36.4%的对上仍然违反单调性。在四个报告的模型评判中，标准的通用支持提示不足以应对这个力度校准压力测试（总体MVR 47.2%），而显式的保证力度提示将MVR降低到24.5%，但仍不完美。我们发布了基准、提示、输出和即插即用管道，以便引用评估器可以报告单调性违反率和力度敏感性，以及传统的支持指标。

英文摘要

Cited RAG evaluation often treats visible sources as a grounding signal, but a real, topically relevant citation can still under-warrant the attached wording. We study this diagnostic failure as citation laundering: a related source is presented as warrant for an over-strong claim. We introduce FORCEBENCH, a contrastive stress test for evidence-force calibration. Each item holds a cited passage fixed and pairs an evidence-calibrated claim with a localized force-raised variant across five operational axes: relation, modality, scope, temporal validity, and numeric specificity. A calibrated evaluator should score the evidence-calibrated claim higher. Headline experiments use a fixed, locality-filtered 198-pair evaluation set. A citation-presence sanity check is uninformative by design; token and entity overlap still violate monotonicity on 32.8--36.4% of pairs. Across four reported model judges, standard generic support prompting is insufficient for this force-calibration stress test (aggregate MVR 47.2%), while explicit warrant-strength prompting lowers MVR to 24.5% but remains imperfect. We release the benchmark, prompts, outputs, and plug-in pipeline so citation evaluators can report monotonicity violation rate and force sensitivity alongside conventional support metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.28042 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Extracting Small Translation Specialists from LLMs by Aggressively Pruning Experts

通过激进剪枝专家从LLM中提取小型翻译专家

Liu O. Martin, Lucas Bandarkar, Nanyun Peng

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）

AI总结提出一种从混合专家LLM中激进剪枝与翻译无关的专家，实现大幅压缩MoE块而不显著降低翻译质量的方法。

详情

AI中文摘要

现代大型语言模型（LLM）实现了最先进的机器翻译性能，但它们是作为广泛通才训练的，主要针对许多与翻译无关的任务和能力。因此，它们对于此任务严重过参数化，导致过多的内存和计算需求。在本文中，我们提出了一种从现代混合专家LLM中激进剪枝专家的方法，同时翻译质量下降可忽略不计。我们的方法利用专家专业化和LLM中多语言能力的可分离性来识别与翻译无关的专家。并且由于MoE的模块化特性，这些专家可以在无需任何训练的情况下轻松剪枝。无需重新训练，我们能够剪枝一半的专家而质量下降可忽略，剪枝70%仅造成轻微损失。通过非常短的SFT，我们剪枝75%的专家并恢复基线性能，在某些设置下移除近90%的专家同时保持合理的翻译质量。总体而言，我们的结果表明翻译仅需要LLM的一小部分，从而实现了对包含超过90%参数的MoE块的大幅压缩。

英文摘要

Modern large language models (LLMs) achieve state-of-the-art machine translation performance, but they do so as broad generalists largely trained for many tasks and capabilities unrelated to translation. Thus, they are heavily overparameterized for this task, resulting in excessive memory and compute requirements. In this paper, we present a method for aggressively pruning experts from modern mixture-of-experts LLMs while incurring negligible degradation in translation quality. Our approach exploits expert specialization and the separability of multilingual capabilities in LLMs to identify experts irrelevant to translation. And because of the modular nature of MoEs, these can be easily pruned without any training. Without retraining, we are able to prune half of all experts with negligible degradation and 70% with only minor losses. With a very short SFT, we prune 75% of experts while recovering baseline performance, and in some settings remove nearly 90% while maintaining reasonable translation quality. Overall, our results show that translation requires only a fraction of the LLM, enabling substantial compression of the MoE blocks that contain over 90% of parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.28035 2026-05-28 cs.AI cs.MM cs.SD 版本更新

MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

MTAVG-Bench 2.0：诊断多说话人音视频生成中电影表现力的失败模式

Haitian Li, Yanghao Zhou, Heyan Huang, Liangji Chen, YiMing Cheng, Xu Liu, Dian Jin, Jiajun Xu, Jingyun Liao, Tian Lan, Ziqin Zhou, Yueying Liu, Yu Bai, Changsen Yuan, Jinxing Zhou, Xian-Ling Mao, Xuefeng Chen, Yousheng Feng

发表机构 * Shanghai University（上海大学）； Beijing Institute of Technology（北京理工大学）； Shanghai Film Academy（上海电影学院）； Tsinghua University（清华大学）； Hefei University of Technology（合肥工业大学）； Inkeverse Group Limited（Inkeverse集团有限公司）； The University of Adelaide（阿德莱德大学）； Beijing University of Technology（北京工业大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； OpenNLP Lab（OpenNLP实验室）

AI总结针对多说话人音视频生成中电影表现力评估不足的问题，提出MTAVG-Bench 2.0基准，通过构建涵盖表演、叙事、氛围和视听语言的高层次失败分类体系及超过1万个问答实例，系统评估全模态大语言模型诊断复杂视听失败的能力。

详情

AI中文摘要

近年来，多说话人音视频生成（MTAVG）模型在唇形同步和视听对齐等基本指标上表现出了有前景的性能。然而，这些指标仍不足以评估场景级生成中的电影表现力。在多角色场景中，生成模型必须超越视听真实感，传达连贯的角色表演及其他更高层次的电影品质。为填补这一空白，我们引入了MTAVG-Bench 2.0，这是一个用于诊断多说话人音视频生成中电影表现力失败模式的基准。与先前主要关注基本多轮对话质量的设置不同，MTAVG-Bench 2.0针对短剧和场景级生成，并建立了一个涵盖表演、叙事、氛围和视听语言的高层次失败分类体系。基于该分类体系，我们构建了超过1万个问答评估实例，以及用于短剧级评估和失败模式时间定位的子集，以系统评估全模态大语言模型诊断高层次视听失败的能力。实验结果表明，Gemini等商业全模态模型显著优于其他评估器，但即使是最强的模型在我们的基准中仍难以应对复杂失败。这些结果证明，MTAVG-Bench 2.0为电影级多说话人音视频生成中的失败诊断提供了一个系统化的基准。

英文摘要

In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.

URL PDF HTML ☆

赞 0 踩 0

2605.28034 2026-05-28 cs.AI 版本更新

Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

Clark Hash: 无状态稀疏Johnson-Lindenstrauss量化用于神经嵌入

Stanislav Kirdey, Clark Labs Inc

发表机构 * Clark Labs Inc（Clark实验室）

AI总结提出Clark Hash方法，通过归一化、稀疏符号投影和固定宽度标量量化，将384维句子嵌入压缩至48字节，无需训练，在保持高余弦相似度相关性的同时实现32倍存储压缩。

Comments First Autoresearch publication. Code available at https://github.com/clark-labs-inc/clark-hash. GPT-5.5 Pro was used for drafting and editing assistance

详情

AI中文摘要

Clark Hash是一种用于以更少空间存储神经嵌入的小型方法。它对每个数据库向量进行归一化，应用确定性稀疏有符号Johnson-Lindenstrauss投影，裁剪结果，并存储固定宽度的标量量化码。查询保持浮点格式，并根据存储的草图进行评分。在默认的384维句子嵌入设置中，Clark Hash将余弦搜索向量存储在48字节中，而密集f32存储需要1536字节。这小了32倍。该方法在存储新向量之前不需要训练过程、学习码本、旋转或语料库统计。我们描述了编解码器、Rust实现，以及对来自29个子集的9,304个标记对进行的多语言句子相似性评估。使用多语言MiniLM编码器，48字节草图在STS17和STS22上与密集余弦评分的宏Pearson相关性分别达到0.910和0.946。Clark Hash不是一个新的Johnson-Lindenstrauss定理，也不是近似最近邻索引的替代品。它是一种用于紧凑嵌入存储的简单无状态编解码器。

英文摘要

Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.

URL PDF HTML ☆

赞 0 踩 0

2605.28032 2026-05-28 cs.AI 版本更新

PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

PetroBench：石油工程大语言模型基准测试

Xiang Wang, Tingting Zhang, Sen Wang, Ying Wu, Heng Meng, Peng Zhou, Peng Li

发表机构 * School of Petroleum and Natural Gas Engineering, Changzhou University（常州大学石油与天然气工程学院）； China University of Petroleum (East China)（中国石油大学（华东））

AI总结针对石油工程领域，构建包含1200道题目的标准化题库，评估8种主流大语言模型，发现模型在主观题上表现优于客观题，中国模型在选择题上有优势，国际模型在简答题上略优。

详情

AI中文摘要

大语言模型在石油工业中的应用日益广泛，凸显了领域特定评估框架的必要性。本研究开发了一个面向石油工程的大语言模型基准测试，包括数据预处理、质量过滤和多模型验证三个阶段。通过专家评审，构建了具有强领域相关性和区分能力的标准化题库。该基准测试涵盖采油工程、油藏工程和钻井工程，包含1200道题目，涉及选择题、判断题、术语定义和简答题四种格式。在统一API环境下评估了八种主流大语言模型。结果表明，模型在主观题上的表现优于客观题，表明其在事实知识辨别方面存在弱点。选择题和判断题的最高准确率分别为65.3%和74.3%。Gemini-3-Pro、Kimi-K2.5和Claude-Opus-4.6-Thinking取得了72%-74%的最佳总分。模型在采油工程中表现最佳，在油藏工程中最弱。中国模型在选择题上具有优势，而国际模型在简答题上略优。该基准测试为石油工程中大语言模型的评估和部署提供了可重复且实用的参考。

英文摘要

Large Language Models are increasingly applied in the petroleum industry, highlighting the need for a domain-specific evaluation framework. This study develops a benchmark for LLMs in petroleum engineering, including a three-stage process of data preprocessing, quality filtering, and multi-model validation. Using expert review, a standardized question bank with strong domain relevance and discriminative capability was constructed. The benchmark covers production, reservoir, and drilling engineering, with 1,200 questions across multiple-choice, true or false, term definition, and short-answer formats. Eight mainstream LLMs were evaluated under a unified API environment. Results show that models performed better on subjective than objective questions, indicating weaknesses in factual knowledge discrimination. The highest accuracies for multiple-choice and true or false questions were 65.3% and 74.3%, respectively. Gemini-3-Pro, Kimi-K2.5, and Claude-Opus-4.6-Thinking achieved the best overall scores of 72%-74%. Models performed best in production engineering and weakest in reservoir engineering. Chinese models showed advantages in multiple-choice questions, while international models performed slightly better in short-answer questions. The benchmark provides a reproducible and practical reference for evaluating and deploying LLMs in petroleum engineering.

URL PDF HTML ☆

赞 0 踩 0

2605.28030 2026-05-28 cs.LG cs.AI cs.CR 版本更新

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

SPARD: 通过安全投影与相关性-多样性数据选择防御有害微调攻击

Shuhao Chen, Weisen Jiang, Yeqi Gong, Shengda Luo, Chengxiang Zhuo, Zang Li, James T. Kwok, Yu Zhang

发表机构 * Department of Computer Science and Engineering, The Hong Kong University of Science and Technology（香港科学与技术大学计算机科学与工程系）； Department of Computer Science and Engineering, Southern University of Science and Technology（南方科技大学计算机科学与工程系）； Department of Computer Science and Engineering, The Chinese University of Hong Kong（香港中文大学计算机科学与工程系）； Platform and Content Group, Tencent（腾讯平台与内容组）； Chinese Medicine Guangdong Laboratory（广东中医实验室）

AI总结提出SPARD框架，结合安全投影交替优化和相关性-多样性数据选择，防御有害微调攻击，在保持任务精度的同时显著降低攻击成功率。

Comments Accepted by ICML 2026

详情

AI中文摘要

微调大型语言模型往往会破坏其安全对齐，有害微调攻击进一步加剧了这一问题，其中对抗性数据移除安全防护并诱导不安全行为。我们提出SPARD，一种集成安全投影交替优化与相关性-多样性感知数据选择的防御框架。SPARD采用SPAG，在效用更新和显式安全投影之间交替优化，使用一组安全数据强制执行安全约束。为策划安全数据，我们引入相关性-多样性行列式点过程来选择紧凑的安全数据，平衡任务相关性和安全覆盖。在GSM8K和OpenBookQA上针对四种有害微调攻击的实验表明，SPARD始终实现最低的平均攻击成功率，显著优于最先进的防御方法，同时保持高任务精度。代码可在https://github.com/shuhao02/SPARD获取。

英文摘要

Fine-tuning large language models often undermines their safety alignment, a problem further amplified by harmful fine-tuning attacks in which adversarial data removes safeguards and induces unsafe behaviors. We propose SPARD, a defense framework that integrates Safety-Projected Alternating optimization with Relevance-Diversity aware data selection. SPARD employs SPAG, which optimizes alternatively between utility updates and explicit safety projections with a set of safe data to enforce safety constraints. To curate safe data, we introduce a Relevance-Diversity Determinantal Point Process to select compact safe data, balancing task relevance and safety coverage. Experiments on GSM8K and OpenBookQA under four harmful fine-tuning attacks demonstrate that SPARD consistently achieves the lowest average attack success rates, substantially outperforming state-of-the-art defense methods, while maintaining high task accuracy. Code is available at https://github.com/shuhao02/SPARD.

URL PDF HTML ☆

赞 0 踩 0

2605.28025 2026-05-28 cs.AI cs.CL cs.CY 版本更新

MIRA: A Bilingual Benchmark for Medical Information Response Audit

MIRA: 医学信息响应审计的双语基准

Mengyu Xu, Qiaoxin Yang, Qianqian Wang, Xiwei Dai, Weiyi Wu, Chongyang Gao

发表机构 * The University of Chicago（芝加哥大学）； SynAI Technologies Inc.（SynAI技术公司）； Jinzhou Medical University（锦州医学院）； Zhejiang University（浙江大学）； Dartmouth College（达特茅斯学院）； Northwestern University（西北大学）

AI总结提出MIRA双语基准，通过4,320个提示评估大语言模型在不同用户表达下提供医学信息的一致性，发现低健康素养提示导致信息稀释（DID），并提出知识引导缓解方法。

详情

AI中文摘要

大语言模型（LLM）越来越多地被用于提供面向公众的健康信息，然而现有的安全评估忽略了在相同问题的不同用户表述下，响应是否保留了可比较的医学信息。为了解决这个问题，我们引入了医学信息响应审计（MIRA），这是一个受控的双语基准，评估LLM在用户侧语言、语域和健康素养信号下是否提供可比较的医学信息。MIRA包含从60个经过医学审查的低风险健康问题构建的4,320个提示。在五个主流LLM中，模型回答了所有医学问题，但对低健康素养信号的响应始终省略了更多关键信息，提供的具体后续步骤更少，并为独立判断提供的支持更少。我们将这种模式称为差异信息稀释（DID）。语言效应是模型特定的，而非对非英语提示普遍更差。与300个真实世界健康查询的比较提供了初步的秩次有效性证据。一种知识引导的缓解提示减少了大多数模型的信息稀释，其中Claude（约8%）和Qwen（约6%）在信息不足的简化方面减少最大。

英文摘要

Large language models (LLMs) are increasingly used to provide public-facing health information, yet existing safety evaluations overlook whether responses preserve comparable medical information across different user phrasings of the same question. To address this, we introduce the Medical Information Response Audit (MIRA), a bilingual, controlled benchmark that assesses whether LLMs provide comparable medical information across user-side language, register, and health literacy signals. MIRA contains 4,320 prompts built from 60 medically reviewed, low-risk health questions. Across five mainstream LLMs, models answered all medical questions, but responses to low health-literacy signals consistently omitted more key information, provided fewer concrete next steps, and offered less support for independent judgment. We term this pattern Differential Information Dilution (DID). Language effects are model-specific rather than uniformly worse for non-English prompts. A comparison with 300 real-world health queries provides preliminary evidence of rank-order validity. A knowledge-guided mitigation prompt reduces information dilution for most models, with the largest reductions in underinformative simplification observed for Claude (~8%) and Qwen (~6%).

URL PDF HTML ☆

赞 0 踩 0

2605.28023 2026-05-28 cs.CV cs.AI cs.CL cs.MM 版本更新

VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

VCap: 用于弱到强视觉字幕的超几何奖励

Xingyu Lu, Jinpeng Wang, Yi-Fan Zhang, Yankai Yang, Yancheng Long, Yiyang Fan, Xuanyu Zheng, Haonan Fan, Kaiyu Jiang, Tianke Zhang, Changyi Liu, Bin Wen, Fan Yang, Tingting Gao, Han Li, Chun Yuan

发表机构 * Tsinghua Shenzhen International Graduate School（清华大学深圳国际研究生院）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳分校）； Chinese Academy of Sciences（中国科学院）； Kuaishou Technology（快手科技）

AI总结提出VCap，一种证人-裁判奖励机制，通过超几何分布级别的精度验证视觉信号中参考字幕与策略生成字幕之间的事实一致性，实现弱到强泛化，在多个图像和视频字幕基准上超越SOTA模型。

Comments 28 pages, 8 figures

详情

AI中文摘要

视觉字幕要求模型忠实捕捉视觉内容，同时最小化遗漏和幻觉。作为字幕的主导范式，多模态大语言模型通过扩展和高质量数据取得了强大性能。最近，强化学习成为推动多模态大语言模型向更高精度和更广覆盖的关键途径，然而，现有字幕奖励设计未能提供细粒度且可靠的事实验证信号，限制了其有效性。为解决这一问题，我们提出VCap，一种证人-裁判奖励，将参考字幕（证人）与视觉信号（裁判）配对。通过明确验证基于视觉信号的参考字幕与策略生成字幕之间的事实一致性，VCap提供了具有超几何分布级别精度的奖励信号用于字幕质量验证。该设计即使在不完美的参考下也能实现有效学习，促进强化学习训练中的弱到强泛化。在我们的实验中，使用VCap训练的8B模型在多个图像和视频字幕基准上优于开源和闭源的最先进模型。人工评估进一步证实了其与事实正确性的强对齐。此外，VCap提升了多模态大语言模型的感知能力，跨任务泛化，并超越了最佳N蒸馏，挑战了先前关于强化学习与视觉推理的假设。

英文摘要

Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.

URL PDF HTML ☆

赞 0 踩 0

2605.28010 2026-05-28 cs.AI 版本更新

Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

信心编排的自我进化：应对不确定的LLM反馈

Bowen Wei, Nan Wang, Yuqing Zhou, Jinhao Pan, Ziwei Zhu

发表机构 * George Mason University（乔治·马歇尔大学）

AI总结提出COSE方法，利用LLM内在置信度作为不确定性信号，通过置信度加权PPO更新和置信度优先重放，在通用推理和数学任务上取得最佳平均性能。

详情

AI中文摘要

自我进化的大语言模型（LLM）通过生成自己的训练任务和解决方案来学习，减少了对人工策划监督的依赖。然而，在许多推理领域，模型还必须验证生成的任务并判断生成的答案以获得训练信号。这带来了训练信号挑战：错误的自我判断会导致错误的梯度更新。现有方法要么依赖外部验证器（限制了通用性），要么将噪声的自我生成反馈视为监督。我们提出COSE（Confidence-Orchestrated Self-Evolution），它利用LLM的内在置信度作为轻量级不确定性信号来调节学习。COSE引入了置信度加权PPO更新和置信度优先重放。在19个保留基准测试和四个Qwen/Llama骨干网络（0.6B-4B）上，COSE始终优于基础模型，并在通用推理和数学方面取得最佳平均性能，同时在代码方面保持竞争力。代码和数据可在https://anonymous.4open.science/r/COSE_-B5C2获取。

英文摘要

Self-evolving large language models (LLMs) learn by generating their own training tasks and solutions, reducing reliance on human-curated supervision. However, in many reasoning domains, the model must also validate generated tasks and judge generated answers to obtain training signals. This creates a training-signal challenge: erroneous self-judgments become erroneous gradient updates. Existing approaches either rely on external verifiers, which limits generality, or treat noisy self-generated feedback as supervision. We propose COSE (Confidence-Orchestrated Self-Evolution), which uses the LLM's intrinsic confidence as a lightweight uncertainty signal to modulate learning. COSE introduces confidence-weighted PPO updates and confidence-prioritized replay. Across 19 held-out benchmarks and four Qwen/Llama backbones (0.6B--4B), COSE consistently improves over base models and achieves the best average performance in general reasoning and mathematics, while remaining competitive on code. Code and data are available at https://anonymous.4open.science/r/COSE_-B5C2.

URL PDF HTML ☆

赞 0 踩 0

2605.28009 2026-05-28 cs.CL cs.AI cs.LG 版本更新

MemGuard: Preventing Memory Contamination in Long-Term Memory-Augmented Large Language Models

MemGuard：防止长期记忆增强型大语言模型中的记忆污染

Hyeonjeong Ha, Jeonghwan Kim, Cheng Qian, Jiayu Liu, William M. Campbell, Yue Wu, Yuji Zhang, Kathleen McKeown, Dilek Hakkani-Tur, Heng Ji

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Columbia University（哥伦比亚大学）； Capital One

AI总结提出MemGuard，一种类型感知的记忆框架，通过显式分配功能角色、维护类型隔离记忆间的关联并选择性组合必要类型的证据，防止异构记忆污染，提升记忆可靠性最高28.27%并减少检索token数最高5.8倍。

详情

AI中文摘要

记忆增强型大语言模型通过跨交互维护长期记忆，将推理扩展到固定上下文窗口之外。然而，现有的记忆系统常常将稳定的用户事实、情景事件和行为规则折叠到共享空间中，使得功能不同的记忆被检索并用作可互换的证据。我们将这种失败模式识别为异构记忆污染，其中上下文特定的事件被过度概括为声明，或者语义相关但功能不兼容的记忆误导生成。为此，我们引入了MemGuard，一种类型感知的记忆框架，在记忆构建和检索过程中保留功能记忆边界。它在写入时为每个记忆分配显式的功能角色，维护跨类型隔离记忆的关系，并仅从必要的记忆类型中选择性组合证据，从而减少来自无关或功能不兼容证据的污染。在幻觉和长时对话基准测试中，MemGuard将记忆可靠性提高了最多28.27%，同时检索的记忆token数比先前方法减少了最多5.8倍。这些结果表明，可靠的长期推理依赖于对异构记忆的有原则的组织和选择性使用。

英文摘要

Memory-augmented large language models extend reasoning beyond a fixed context window by maintaining long-term memory across interactions. However, existing memory systems often collapse stable user facts, episodic events, and behavioral rules into a shared space, allowing functionally distinct memories to be retrieved and used as interchangeable evidence. We identify this failure mode as heterogeneous memory contamination, where context-specific events become overgeneralized claims, or semantically relevant but functionally incompatible memories mislead generation. To this end, we introduce MemGuard, a type-aware memory framework that preserves functional memory boundaries during memory construction and retrieval. It assigns each memory an explicit functional role at write time, maintains relations across type-isolated memories, and selectively composes evidence only from necessary memory types, reducing contamination from irrelevant or functionally incompatible evidence. Across hallucination and long-horizon conversation benchmarks, MemGuard improves memory reliability by up to 28.27% while retrieving up to 5.8x fewer memory tokens than prior methods. These results suggest that reliable long-term reasoning depends on principled organization and selective use of heterogeneous memory.

URL PDF HTML ☆

赞 0 踩 0

2605.28008 2026-05-28 cs.AI cs.LG 版本更新

Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

压缩思维：压缩推理数据在LLM后训练中何时以及如何发挥作用

Kohsei Matsutani, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo（东京大学）

AI总结本文通过分类显式、组合和隐式思维链，在合成组合推理任务上实验，发现粗粒度CoT需要更多SFT数据，组合和隐式CoT从数据缩放中获益更多但隐式CoT易导致记忆，后续RLVR会分解压缩步骤，且单向CoT顺序在长序列任务上泛化更强。

详情

AI中文摘要

大型语言模型（LLM）现在能够通过长思维链（CoT）推理解决复杂问题，但性能与token成本之间的权衡仍然是一个核心挑战。为了解决这个问题，监督微调（SFT）通常使用压缩推理数据，其中CoT轨迹被缩短为紧凑形式。然而，这种压缩推理数据对后训练的影响仍然知之甚少。在本文中，我们提出了一个CoT分类法，包括显式CoT（输出所有操作而不聚合）、组合CoT（将多个操作合并为单步）和隐式CoT（省略中间操作）。我们构建了一个合成组合推理任务，允许对难度、压缩粒度和数据大小进行可控变化，并在不同模型家族和大小上进行了全面的实验。值得注意的是，我们发现：（i）粗粒度CoT需要更多SFT数据；（ii）与显式CoT相比，组合CoT和隐式CoT从数据缩放中获益更多，而组合CoT从数据重复中获益，隐式CoT则倾向于导致记忆；（iii）与SFT不同，后续带有可验证奖励的强化学习（RLVR）会分解在SFT期间学到的压缩步骤；（iv）单向CoT顺序在更长序列任务上表现出更强的泛化能力。我们的发现为数据资源约束下的CoT设计提供了启示，并为LLM后训练中SFT和RL的机制提供了重要见解。

英文摘要

Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between performance and token cost remains a central challenge. To address this issue, supervised fine-tuning (SFT) often uses compressed reasoning data, where CoT traces are shortened into compact forms. However, the effect of such compressed reasoning data on post-training remains poorly understood. In this paper, we propose a taxonomy of CoT consisting of Explicit CoT, which outputs all operations without aggregation, Composed CoT, which combines multiple operations into a single step, and Implicit CoT, which omits intermediate operations. We construct a synthetic compositional reasoning task that allows controlled variation of difficulty, compression granularity, and data size, and conducted a comprehensive set of experiments across different model families and sizes. Notably, we find that (i) coarser CoT requires more SFT data, (ii) compared with Explicit CoT, Composed CoT and Implicit CoT benefit more from data scaling, while Composed CoT benefits from data repetition and Implicit CoT tends to lead to memorization, (iii) unlike SFT, subsequent reinforcement learning (RL) with verifiable rewards (RLVR) decomposes compressed steps learned during SFT, and (iv) unidirectional CoT ordering shows stronger generalization on longer sequential tasks. Our findings provide implications for CoT design under data resource constraints and offer important insights into the mechanisms of SFT and RL in LLM post-training.

URL PDF HTML ☆

赞 0 踩 0

2605.28007 2026-05-28 cs.LG cs.AI 版本更新

Learning Compositional Latent Structure with Vector Networks

学习带有向量网络的组合潜在结构

Niclas Pokel, Benjamin F. Grewe

发表机构 * Institute of Neuroinformatics, UZH / ETH Zurich（神经信息学研究所，苏黎世联邦理工学院/苏黎世联邦理工人工智能中心）； ETH AI Center Zurich, Switzerland（苏黎世联邦理工人工智能中心，瑞士）

AI总结提出向量网络（VN），一种层级循环架构，通过可重用的秩1权重原子库实现组合泛化，在分布外任务中误差降低约一个数量级。

详情

AI中文摘要

深度网络是强大的函数逼近器，但它们通常将许多不同的计算存储在共享权重矩阵中，使得当熟悉的结构以新颖组合出现时，难以选择性地重用或调整其中的部分。我们引入了向量网络（VN），一种层级循环架构，其中每一层将固定的权重矩阵替换为可重用的秩1权重原子库。对于每个输入，VN最小化层级局部能量，以推断一组稀疏的活跃权重原子及其系数，这些系数受自底向上的输入重建和自顶向下的反馈一致性共同约束。这些权重原子系数随后为该样本组成一个输入特定的低秩权重矩阵。收敛后，慢速学习更新仅通过推断系数缩放的局部残差信号更新选中的权重原子。我们在四个组合基准上评估了VN，涵盖一维信号、二维空间解码、N体动力学和组合MNIST。在分布内任务中VN与强基线相当，而在需要以新颖方式重新组合熟悉因子的分布外任务中，其误差通常低约一个数量级。因此，向量网络使组合泛化成为架构和推理过程的结构属性，而非将许多行为拟合到单个共享密集参数基底的脆弱副产品。

英文摘要

Deep networks are powerful function approximators, but they typically store many different computations in shared weight matrices, making it difficult to selectively reuse or adapt parts of them when a familiar structure appears in novel combinations. We introduce the Vector Network (VN), a hierarchical recurrent architecture in which each layer replaces a fixed weight matrix with a library of reusable rank-1 weight atoms. For each input, VN minimizes a layer-local energy to infer a sparse set of active weight atoms and their coefficients, jointly constrained by bottom-up input reconstruction and top-down feedback consistency. These weight atom coefficients then compose an input-specific low-rank weight matrix for that sample. After convergence, slow learning updates only the selected weight atoms through local residual signals scaled by the inferred coefficients. We evaluate VN on four compositional benchmarks spanning 1D signals, 2D spatial decoding, N-body dynamics, and compositional MNIST. VN matches strong baselines in distribution while often achieving out-of-distribution error about an order of magnitude lower when familiar factors must be recombined in novel ways. Vector networks thus make compositional generalization a structural property of the architecture and inference process rather than a brittle byproduct of fitting many behaviors into one shared dense parameter substrate.

URL PDF HTML ☆

赞 0 踩 0

2605.28006 2026-05-28 cs.CL cs.AI 版本更新

学习将预测任务分配给具有容量限制的智能体

Shang Wu, Saatvik Kher, Padhraic Smyth

发表机构 * Department of Computer Science（计算机科学系）； University of California, Irvine（加州大学欧文分校）

AI总结针对容量受限的多个智能体（人类或AI），提出一种序贯探索-利用策略学习框架，以最大化整体预测性能。

2605.27997 2026-05-28 cs.CL cs.AI cs.LG 版本更新

Where Does Toxicity Live? Mechanistic Localization and Targeted Suppression in Language Models

毒性存在于何处？语言模型中的机制定位与定向抑制

Himanshu Beniwal, Mayank Singh

发表机构 * Indian Institute of Technology Gandhinagar（印度理工学院冈德辛加尔）

AI总结通过分析毒性与中性提示的激活差异，定位特定层和神经元中的毒性，并利用推理时缩放或最小秩一权重编辑进行抑制，无需梯度下降，实现毒性降低同时保持语言质量。

详情

AI中文摘要

大型语言模型频繁生成有毒、仇恨或有害内容，然而现有的缓解方法依赖于昂贵的重新训练或输出级过滤，且缺乏对毒性内部起源的机制性理解。我们提出了Meow2X和TRNE，两种互补的无需重新训练的框架，通过分析毒性与中性提示之间的激活差异，将毒性定位到特定层和神经元，然后通过推理时缩放或最小秩一权重编辑进行抑制——无需任何梯度下降。在五个语言模型、两个基准测试和90种配置上的评估，使用双重安全评估器，一致地证明了毒性降低，同时保持了语言建模质量。我们的分析揭示，毒性不成比例地编码在早期MLP层中，在不同架构间有所变化，并且被单一评估器设置系统性地低估——强调了多评估器安全评估的必要性。通过连接机制可解释性与实际去毒化，我们的框架为更安全、更透明的语言模型提供了一条原则性路径。

英文摘要

Large language models frequently generate toxic, hateful, or harmful content, yet existing mitigation methods rely on costly retraining or output-level filtering with no mechanistic insight into where toxicity originates internally. We introduce Meow2X and TRNE, two complementary retraining-free frameworks that localize toxicity to specific layers and neurons by analyzing activation differentials between toxic and neutral prompts, then suppress them via inference-time scaling or minimal rank-one weight edits -- without any gradient descent. Evaluations across five LMs, two benchmarks, and 90 configurations using dual safety evaluators demonstrate consistent toxicity reduction while preserving language modeling quality. Our analysis reveals that toxicity is disproportionately encoded in early MLP layers, varies across architectures, and is systematically underestimated by single-evaluator setups -- underscoring the need for multi-evaluator safety assessment. By bridging mechanistic interpretability with practical detoxification, our framework offers a principled path toward safer, more transparent language models.

URL PDF HTML ☆

赞 0 踩 0

2605.27990 2026-05-28 cs.LG cs.AI cs.CV 版本更新

Geometry-Correct Diffusion Posterior Sampling with Denoiser-Pullback Curvature Guidance and Manifold-Aligned Damping

几何校正扩散后验采样：基于去噪器回拉曲率引导与流形对齐阻尼

Seunghyeok Shin, Minwoo Kim, Dabin Kim, Hongki Lim

发表机构 * Department of Electrical and Computer Engineering, Inha University, Incheon, 22212, South Korea（电气与计算机工程系，Inha大学，Incheon，22212，韩国）

AI总结提出一种基于去噪器回拉曲率引导和流形对齐阻尼的几何校正扩散后验采样方法，通过每噪声水平的阻尼高斯-牛顿校正替代标量引导，实现稳定高效的后验采样。

Comments Code: https://github.com/Seunghyeok0715/CLAMP

详情

Journal ref: International Conference on Machine Learning 2026

AI中文摘要

扩散后验采样将扩散先验条件于测量值，但数据一致性更新通常由手动调整的引导权重缩放，并且在刚性、算子依赖的曲率下可能破坏采样稳定性。我们使用在扩散状态坐标中计算的每噪声水平阻尼高斯-牛顿校正替代标量引导。该校正通过去噪器回拉似然梯度，使用避免前向去噪器雅可比矩阵的单侧曲率模型，并应用与去噪器残差对齐的扩散校准秩一阻尼。每个校正通过自动微分的无矩阵GMRES求解，采样通过具有闭式漂移/噪声分离的方差保持朗之万转移进行。在FFHQ和ImageNet上的逆问题中，该方法在PSNR/SSIM/LPIPS上达到竞争性能，同时运行速度显著快于大多数对比基线；在加速MRI重建中，它在对比基线中取得了最佳的PSNR/SSIM。

英文摘要

Diffusion posterior sampling conditions diffusion priors on measurements, but data-consistency updates are typically scaled by hand-tuned guidance weights and can destabilize sampling under stiff, operator-dependent curvature. We replace scalar guidance with a per-noise-level damped Gauss--Newton correction computed in diffusion-state coordinates. The correction pulls likelihood gradients back through the denoiser, uses a one-sided curvature model that avoids forward denoiser Jacobians, and applies diffusion-calibrated rank-one damping aligned with the denoiser residual. Each correction is solved with matrix-free GMRES using automatic differentiation, and sampling proceeds with a variance-preserving Langevin transition with a closed-form drift/noise split. On FFHQ and ImageNet across inverse problems, it achieves competitive PSNR/SSIM/LPIPS while running markedly faster than most of the compared baselines; on accelerated MRI reconstruction, it achieves the best PSNR/SSIM among the compared baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.27984 2026-05-28 cs.CL cs.AI 版本更新

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

KVoiceBench, KOpenAudioBench 和 KMMAU：用于评估语音语言模型的智能体驱动的韩语语音基准

Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee, Jonghyun Lee

发表机构 * KRAFTON ； Graduate School of AI, KAIST（韩国科学技术院人工智能研究生院）； Department of Mathematical Sciences, Seoul National University（首尔国立大学数学科学系）

AI总结针对语音语言模型评估中英语中心化的问题，提出两种智能体驱动的基准构建框架，构建并发布了三个韩语语音基准（KVoiceBench、KOpenAudioBench、KMMAU），通过评估八个最新模型揭示了英语-韩语性能差距和不同任务间的互补性弱点。

Comments 16 pages, 4 figures

详情

AI中文摘要

语音语言模型通过将大型语言模型扩展到语音模态取得了显著进展。然而，语音语言模型的评估仍然严重以英语为中心，限制了多语言语音能力的可靠评估。通过ASR、翻译、归一化和TTS直接迁移基准会破坏语言特定的指令、答案约束和口语形式；对于音频理解，迁移源语言音频也无法保留目标语言的说话人属性、口音和副语言特性。为解决这些限制，我们提出了两种智能体驱动的基准构建框架：一种将源语言SpokenQA基准迁移为目标语言SpokenQA基准，另一种利用转录和说话人元数据将目标语言ASR语料库转换为音频理解基准。使用这些框架，我们构建并公开发布了三个韩语语音基准：用于韩语SpokenQA的KVoiceBench和KOpenAudioBench，以及用于韩语音频理解的KMMAU，共包含12,345个样本。我们评估了八个最近的语音语言模型，发现英语-韩语性能差距在不同模型和任务族中差异很大，并且SpokenQA和音频理解的排名出现分歧，揭示了仅靠英语评估无法发现的互补性弱点。

英文摘要

Speech language models (SpeechLMs) have achieved substantial progress by extending large language models (LLMs) to the speech modality. However, SpeechLM evaluation remains heavily centered on English, limiting reliable assessment of multilingual speech capabilities. Straightforward benchmark transfer through ASR, translation, normalization, and TTS can corrupt language-specific instructions, answer constraints, and spoken forms; for audio understanding, transferring source-language audio also fails to preserve target-language speaker attributes, accents, and paralinguistic properties. To address these limitations, we propose two human-agent benchmark-construction frameworks: one transfers source-language SpokenQA benchmarks into target-language SpokenQA benchmarks, and the other converts target-language ASR corpora into audio understanding benchmarks using transcriptions and speaker metadata. Using these frameworks, we construct and publicly release three Korean speech benchmarks: KVoiceBench and KOpenAudioBench for Korean SpokenQA, and KMMAU for Korean audio understanding, comprising 12,345 samples in total. We evaluate eight recent SpeechLMs and find that English-Korean performance gaps vary substantially across models and task families, and that SpokenQA and audio understanding rankings diverge, revealing complementary weaknesses invisible to English-only evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.27981 2026-05-28 cs.AI 版本更新

STAB: Specification-driven Testing for Algorithmic Bottlenecks

STAB：面向算法瓶颈的规约驱动测试

Soohan Lim, Joonghyuk Hahn, Hyundong Jin, Yo-Sub Han

发表机构 * Yonsei University（延世大学）

AI总结提出STAB流水线，通过约束饱和与对抗场景注入，从自然语言问题规约生成暴露算法瓶颈的测试用例，显著提升测试用例对瓶颈的检测率。

Comments 16 pages, 5 figures, 8 tables

详情

AI中文摘要

评估算法代码的效率需要能够暴露运行时瓶颈的测试用例。先前的方法通过增加输入规模或生成使给定实现运行缓慢的代码特定输入来生成效率测试用例。因此，它们没有处理驱动算法最坏情况的结构性输入条件。我们引入STAB，一个规约驱动的流水线，仅从自然语言问题规约生成暴露算法瓶颈的测试用例。STAB将任务分为约束边界最大化和对抗结构注入两部分。(i) 约束饱和器提取约束，并通过基于规则的饱和及对相关变量的CP-SAT优化来解析大的可接受规模赋值。(ii) 对抗场景注入器使用关键词匹配和K近邻（KNN）从策划的场景目录中检索实现级别的对抗构造原则。STAB将问题规约、解析的边界和检索到的构造原则编码为结构化生成规约，LLM据此合成Python测试用例生成器。在CodeContests上，STAB将生成测试用例中暴露算法瓶颈的比例从开源LLM平均50.43%提升至73.45%，从闭源LLM平均57.45%提升至71.85%，在Python、Java和C++上均有一致提升。我们的代码可在https://github.com/suhanmen/STAB获取。

英文摘要

Evaluating the efficiency of algorithmic code requires test cases that expose runtime bottlenecks. Previous methods generate efficiency test cases either by increasing input size or by generating code-specific inputs that make the given implementation run slowly. Consequently, they do not address the structural input conditions that drive the algorithmic worst case. We introduce STAB, a specification-driven pipeline that generates test cases that expose algorithmic bottlenecks from a natural-language problem specification alone. STAB separates the task into constraint-bound maximization and adversarial structure injection. (i) The constraint saturator extracts constraints and resolves large admissible size assignments using rule-based saturation and CP-SAT optimization over related variables. (ii) The adversarial scenario injector retrieves implementation-level adversarial construction principles from a curated scenario catalog using keyword matching and K-nearest neighbors (KNN). STAB encodes the problem specification, resolved boundary, and retrieved construction principles into a structured generation specification, from which the LLM synthesizes a Python test case generator. On CodeContests, STAB raises the rate of generated test cases that expose algorithmic bottlenecks from 50.43% to 73.45% on average across open-source LLMs and from 57.45% to 71.85% on average across closed-source LLMs, with consistent gains across Python, Java, and C++. Our code is available at https://github.com/suhanmen/STAB.

URL PDF HTML ☆

赞 0 踩 0

2605.27980 2026-05-28 cs.CL cs.AI 版本更新

Periodic RoPE for Infinite Context LLMs

周期性RoPE：面向无限上下文的大型语言模型

Simin Huo

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结提出周期性RoPE（P-RoPE）位置编码机制，结合滑动窗口注意力和无位置编码的全局注意力，避免位置耗尽，理论上支持无限上下文窗口。

Comments 5 pages

详情

AI中文摘要

处理超长上下文的能力对于大型语言模型（LLMs）执行长期任务至关重要。尽管最近的努力已将上下文窗口扩展到1M及以上，但当序列长度超过位置编码（如RoPE）的预训练范围时，模型性能会下降，即位置耗尽。必须克服这一基本限制才能实现真正的无限上下文。为此，我们提出了周期性RoPE（P-RoPE），一种旨在避免这种耗尽的位置编码机制。它与滑动窗口注意力（SWA）协同工作，以捕获每个窗口内的局部依赖和相对位置。然后，这一局部层由无位置编码（NoPE）的全局注意力层补充，使得整个序列上的无界交互成为可能，而不受位置限制。通过堆叠这两类层，模型避免了位置外推以泛化更长的序列，并理论上支持无限的上下文窗口。实验结果表明，我们的模型MiniWin在长上下文效率和稳定性上优于采用标准GPT架构的MiniMInd。我们的工作为LLMs实现真正的无限上下文理解提供了一条可能的路径。代码可在\href{https://github.com/Cominder/miniwin}{https://github.com/Cominder/miniwin}获取。

英文摘要

The ability to process ultra-long contexts is crucial for large language models (LLMs) to perform long-horizon tasks. While recent efforts have extended context windows to 1M and beyond, model performance degrades when sequence length exceeds the pre-trained range of positional encodings (e.g., RoPE), i.e., position exhaustion. This fundamental limitation must be overcome to achieve a truly infinite context. To address it, we propose Periodic RoPE (P-RoPE), a positional encoding mechanism designed to circumvent this exhaustion. It operates in conjunction with sliding window attention (SWA) to capture local dependencies and relative positions within each window. This local layer is then complemented by a global attention layer with No Positional Encoding (NoPE), enabling unbounded interaction across the entire sequence without positional constraints. By stacking these two types of layers, the model avoids the need for positional extrapolation to generalize longer and theoretically supports an infinite context window. Empirical results show that our model, MiniWin, outperforms MiniMInd with standard GPT architectures in long-context efficiency and stability. Our work provides a possible pathway toward LLMs with genuine infinite-context understanding. The code is available at \href{https://github.com/Cominder/miniwin}{https://github.com/Cominder/miniwin}.

URL PDF HTML ☆

赞 0 踩 0

2605.27971 2026-05-28 cs.CL cs.AI 版本更新

Semantic Flow Regularization: Teaching LLMs to Generate Diverse Yet Coherent Responses

语义流正则化：教会LLMs生成多样且连贯的回复

Kerui Peng, Feifei Li, Xingyu Fan, Wenhui Que

发表机构 * Tencent Inc.（腾讯公司）； Beijing, China（中国北京）

AI总结针对大语言模型微调时输出多样性严重受限的跨风格坍缩问题，提出语义流正则化（SFR），通过条件流匹配监督骨干网络使用连续句子嵌入，在零部署成本下提升多样性和风格保真度。

详情

AI中文摘要

当大语言模型被微调以生成个性或语气条件化的回复时，其输出多样性受到严重限制——我们将这种失败称为跨风格坍缩。我们将这种坍缩追溯到交叉熵目标，该目标在共享表示下倾向于抑制多样化的延续。我们提出语义流正则化（SFR），一种轻量级的辅助目标，通过条件流匹配使用未来片段的连续句子编码器嵌入来监督骨干网络。随机流源通过构造保持多模态；流匹配头在推理时被丢弃，增加零部署成本。在一个大规模工业对话数据集（Qwen3-32B，9种个性）上，SFR在输出多样性、风格保真度和回复质量上优于SFT。我们进一步在公共LiveCodeBench-v5（Qwen2.5-Coder-7B-Instruct）上验证，其中SFR持续改进pass@k，证实了其超越风格化对话的通用性。在MBPP上的受控比较显示，多令牌预测是SFR的一个退化特例。

英文摘要

When large language models are fine-tuned to generate persona- or tone-conditioned responses, their output diversity is severely limited--a failure we term Cross-Style Collapse. We trace this collapse to the cross-entropy objective, which under shared representations tends to suppress diverse continuations. We propose Semantic Flow Regularization (SFR), a lightweight auxiliary objective that supervises the backbone with continuous sentence-encoder embeddings of future segments via conditional flow matching. The stochastic flow source preserves multi-modality by construction; the flow-matching head is discarded at inference, adding zero deployment cost. On a large-scale industrial dialogue dataset (Qwen3-32B, 9 personas), SFR improves output diversity, style fidelity, and response quality over SFT. We further validate on the public LiveCodeBench-v5 (Qwen2.5-Coder-7B-Instruct), where SFR consistently improves pass@k, confirming generality beyond stylized dialogue. A controlled comparison on MBPP reveals Multi-Token Prediction to be a degenerate special case of SFR.

URL PDF HTML ☆

赞 0 踩 0

2605.27970 2026-05-28 cs.AI 版本更新

Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations

人类感知域的几何结构在LLM表征中短暂出现

Simardeep Singh, Paras Chopra

发表机构 * Indian Institute of Technology Roorkee（印度理工学院罗尔基分校）

AI总结研究大型语言模型内部表征中是否出现与人类感知组织相似的几何结构，发现多个感知域的几何结构在中间层短暂涌现，且与人类基准对齐。

Comments 19 Pages, 28 Figures

详情

AI中文摘要

虽然大型语言模型（LLM）仅基于文本数据进行训练，但先前的工作表明，它们的内部表征在嵌入空间中可能展现出丰富的几何结构。基于这一研究方向，我们调查了这种结构是否与不同领域（例如颜色、音高、情感和味觉）的人类感知组织相似。具体来说，我们研究了多个开源Transformer架构的残差流中，与感知模态对应的内在几何结构逐层涌现的情况。我们的结果揭示了三个关键发现。首先，我们观察到多个感知域的逐层几何结构涌现，尽管训练过程中没有任何直接的感知监督。其次，这些感知域表现出不同的涌现轮廓，几何结构及其与人类基准的一致性在深度上遵循领域和模型特定的轨迹。第三，这种涌现遵循一致的表征轨迹：几何结构在早期层较弱或分散，在中间层逐渐组织化，在后期层减弱，表明感知几何结构作为模型内部转换管道的一部分短暂出现。这为理解类人感知几何结构在LLM中如何以及何处出现提供了新见解，为内部表征的机制分析提供了原则性途径。

英文摘要

While large language models (LLMs) are trained purely on textual data, prior work has shown that their internal representations can exhibit rich geometric structure in embedding space. Building on this line of work, we investigate whether such structure is similar to human perceptual organisation across different domains (e.g., color, pitch, emotion, and taste). Specifically, we study the layer-wise emergence of intrinsic geometrical structure corresponding to perceptual modalities within the residual streams of multiple open-weight transformer architectures. Our results reveal three key findings. First, we observe the emergence of layer-wise geometric structure across multiple perceptual domains, despite the absence of any direct perceptual supervision during training. Second, these perceptual domains exhibit distinct emergence profiles, with both geometric structure and its alignment with human baselines following domain- and model-specific trajectories across depth. Third, this emergence follows a consistent representational trajectory: geometry is weak or diffuse in early layers, becomes progressively organised in intermediate layers, and is attenuated in later layers, suggesting that perceptual geometry arises transiently as part of the model's internal transformation pipeline. This provides new insight into how and where human-like perceptual geometry arises in LLMs, offering a principled pathway for mechanistic analysis of internal representations.

URL PDF HTML ☆

赞 0 踩 0

2605.27967 2026-05-28 stat.ME cs.AI cs.LG stat.ML 版本更新

Multi-Teacher Knowledge Distillation via Teacher-Informed Mixture Priors

通过教师引导的混合先验进行多教师知识蒸馏

Luyang Fang, Yongkai Chen, Jiazhang Cai, Ping Ma, Wenxuan Zhong

发表机构 * Department of Statistics, University of Georgia（佐治亚大学统计系）； Department of Statistics, Harvard University（哈佛大学统计系）

AI总结提出多教师贝叶斯知识蒸馏（MT-BKD）框架，利用贝叶斯推断和教师引导的先验分布，结合熵加权机制，实现多教师知识的高效融合与不确定性量化。

详情

AI中文摘要

知识蒸馏是一种强大的模型压缩方法，能够高效部署复杂的深度学习模型（教师模型），包括大型语言模型。然而，其潜在的统计机制尚不明确，且不确定性评估常被忽视，特别是在需要多样化教师专业知识的实际场景中。为解决这些挑战，我们引入了 extit{多教师贝叶斯知识蒸馏}（MT-BKD），其中蒸馏学生模型在贝叶斯框架内从多个教师模型学习。我们的方法利用贝叶斯推断来捕捉蒸馏过程中的固有不确定性。我们引入了一种教师引导的先验，整合来自教师模型和特定任务训练数据的外部知识，提供了更好的泛化性、鲁棒性和可扩展性。此外，一种基于熵的加权机制自适应地调整每个教师的影响，使学生能够有效组合多个专业知识来源。MT-BKD增强了学生模型学习过程的可解释性，提高了预测准确性，并提供了不确定性量化。我们在合成任务和真实任务（包括蛋白质亚细胞定位预测和图像分类）上验证了MT-BKD。实验表明，我们的MT-BKD框架在性能提升和稳健的不确定性量化方面表现出色，突显了其优势。

英文摘要

Knowledge distillation is a powerful method for model compression, enabling the efficient deployment of complex deep learning models (teachers), including large language models. However, its underlying statistical mechanisms remain unclear, and uncertainty evaluation is often overlooked, especially in real-world scenarios requiring diverse teacher expertise. To address these challenges, we introduce \textit{Multi-Teacher Bayesian Knowledge Distillation} (MT-BKD), where a distilled student model learns from multiple teachers within the Bayesian framework. Our approach leverages Bayesian inference to capture inherent uncertainty in the distillation process. We introduce a teacher-informed prior, integrating external knowledge from teacher models and task-specific training data, offering better generalization, robustness, and scalability. Additionally, an entropy-based weighting mechanism adaptively adjusts each teacher's influence, allowing the student to combine multiple sources of expertise effectively. MT-BKD enhances the interpretability of the student model's learning process, improves predictive accuracy, and provides uncertainty quantification. We validate MT-BKD on both synthetic and real-world tasks, including protein subcellular location prediction and image classification. Our experiments show improved performance and robust uncertainty quantification, highlighting the strengths of our MT-BKD framework.

URL PDF HTML ☆

赞 0 踩 0

2605.27965 2026-05-28 cs.AI 版本更新

智能体思考得更深吗？顺序规划中层间动力学的机制研究

Zhenyu Cui, Xiangzhong Luo

AI总结通过残差流探针、因果层跳跃干预和有效深度测量，研究了大型语言模型在自主智能体任务（多轮规划、工具使用、迭代状态更新）中层间动态的差异，发现智能体推理表现出与静态任务不同的深度分布，随着轨迹展开，模型逐步招募更多更深层，且后期出现更强的长距离层间依赖，同时残差更新从稳定特征积累转向重复校准，有效深度分析揭示了语义方向形成较早而深层对稳定最终输出仍必要的构建-精炼差距。

详情

AI中文摘要

最近的机制研究表明，大型语言模型（LLMs）在标准单轮任务中可能未能高效利用其深度。在自主智能体设置中，模型必须执行多轮规划、工具使用和迭代状态更新，这种情况是否仍然成立尚不清楚。我们通过系统性地对三个领域（深度研究、代码生成和表格处理）的完整用户-智能体轨迹进行逐层分析来研究这一问题。使用残差流探针、因果层跳过干预和有效深度测量，我们表明智能体推理表现出与静态任务不同的深度分布。随着轨迹展开，模型逐步招募更多和更深的层，在后期出现更强的长距离层间依赖。同时，残差更新变得越来越以校正为主导，表明从稳定的特征积累转向重复校准。有效深度分析进一步揭示了一个显著的构建-精炼差距：语义方向通常形成较早，而深层对于稳定最终输出仍然必要。在不同模型家族中，这一差距在Qwen和Minimax中显著，而GLM则表现出更依赖领域的深度分配模式。这些结果提供了机制证据，表明自主LLM智能体随着推理复杂性的增长自适应地分配深度。

英文摘要

Recent mechanistic studies suggest that large language models (LLMs) may utilize their depth inefficiently in standard single-turn tasks. Whether this still holds in autonomous agent settings, where models must perform multi-turn planning, tool use, and iterative state updates, remains unclear. We study this question through a systematic layer-wise analysis of complete user-agent trajectories spanning three domains: Deep Research, Code Generation, and Tabular Processing. Using residual stream probes, causal layer-skipping interventions, and effective-depth measurements, we show that agentic reasoning exhibits a distinct depth profile from static tasks. As trajectories unfold, models progressively recruit more and deeper layers, with stronger long-range inter-layer dependencies emerging in later turns. At the same time, residual updates become increasingly correction-dominant, indicating a shift from stable feature accumulation toward repeated recalibration. Effective-depth analysis further reveals a substantial construction-refinement gap: semantic direction often forms relatively early, while deep layers remain necessary for stabilizing final outputs. Across model families, this gap is pronounced in Qwen and Minimax, whereas GLM shows a more domain-dependent depth allocation pattern. These results provide mechanistic evidence that autonomous LLM agents allocate depth adaptively as reasoning complexity grows.

URL PDF HTML ☆

赞 0 踩 0

2605.27932 2026-05-28 cs.CV cs.AI cs.CL cs.CR cs.LG 版本更新

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

当图文推理遇上安全：什么决定了多模态越狱鲁棒性？

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li, Binghang Lu, Neil Zhenqiang Gong

发表机构 * Independent Researcher（独立研究者）； Stanford University（斯坦福大学）； Harvard University（哈佛大学）； Purdue University（普渡大学）； Duke University（杜克大学）

AI总结本文研究多模态大语言模型中不同图文推理范式对越狱鲁棒性的影响，发现显式图像工具交互能显著降低攻击成功率，并通过引入图像工具安全向量框架从表征层面解释其机制。

Comments 17 pages, 6 figures, 7 tables

详情

AI中文摘要

图文推理正成为大型视觉-语言模型的一种新推理范式，但其安全性影响尚不明确。现有系统已涵盖多种流程设计，包括直接响应生成、纯文本前轮、视觉状态操作以及显式外部图像工具调用。本文探究这些评估范式中哪一种能提升多模态越狱鲁棒性及其原因。在多个视觉-语言模型上，我们的实验表明显式图像工具交互的攻击成功率最低，平均相对降低约30%。这一发现起初令人惊讶：即使返回的图像工具输出被人为覆盖或本身不安全，攻击成功率仍保持较低，但在纯文本前轮控制下又恢复到接近直接回答的水平。这些结果表明，较低的攻击成功率并非由良性返回图像语义或仅文本图像工具轨迹解释。为解释这一模式，我们引入了一个图像工具安全向量框架，将图像工具调用建模为隐藏表示向安全相关方向的残差偏移。表征层面的分析和激活干预支持了这一解释。总体而言，我们的结果表明，显式图像工具交互是提升越狱鲁棒性的一种有前景的设计模式，同时也推动了针对特定流程的安全性评估。

英文摘要

Think-with-image reasoning is emerging as a new inference paradigm for large vision-language models, but its safety implications remain poorly understood. Existing systems already span multiple process designs, including direct response generation, text-only prior turn, visual-state manipulation, and explicit external image-tool invocation. In this paper, we ask which of these evaluated paradigms improves multimodal jailbreak robustness, and why. Across multiple vision-language models, explicit image-tool interaction yields the lowest attack success rates in our experiments, reducing jailbreak success by around 30% relative on average across the evaluated models. This finding is initially surprising: ASR remains low even when the returned image-tool output is manually overridden or itself unsafe-looking, but returns near direct-answering levels under text-only prior turn controls. These results indicate that the lower ASR is not explained by benign returned-image semantics or by the textual image-tool trace alone. To explain the pattern, we introduce an image-tool safety vector framework that models image-tool invocation as a residual shift in hidden representations toward a safety-relevant direction. Representation-level analyses and activation interventions support this account. Overall, our results suggest that explicit image-tool interaction is a promising design pattern for improving jailbreak robustness, while also motivating pipeline-specific safety evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.27931 2026-05-28 cs.AI 版本更新

DiagramRAG: A Lightweight Framework to Retrieve Scientific Diagram for Figure Generation

DiagramRAG：一个用于科学图表生成的轻量级检索增强框架

Xinjiang Yu, Junyi Han, Zhuofan Chen, Chi Zhang, Xiangyu Fu, Jingyuan Tan, Zirui You, Yixiang Jian, Yu-Ping Wang, Chengliang Chai

发表机构 * Beijing Institute of Technology（北京理工大学）

AI总结提出DiagramRAG框架，通过检索与草图语义和拓扑结构兼容的参考图表，实现草图到科学图表的自动补全与生成。

Comments 23 pages, 9 figures

详情

AI中文摘要

科学图表对于在学术论文中传达复杂方法至关重要。研究人员指定此类图表的一种自然方式是通过粗略草图，其中文本标签、连接器和空间布局表达了早期的语义和拓扑意图。然而，草图通常不完整，不足以直接生成出版质量的图表。现有的基于草图的生成方法主要重构草图本身，而最近的文本驱动图表生成框架依赖文本语义，未能充分利用草图中包含的拓扑结构。在本文中，我们介绍了DiagramRAG，一个轻量级的检索增强框架，用于基于草图的科学图表补全。给定用户草图，DiagramRAG检索与草图内容语义相关且与其结构拓扑兼容的参考图表，并使用它们指导下游图表生成。为了实现高效的结构感知检索，我们将图表表示为知识图谱，在不同简化级别合成草图变体，并训练一个嵌入模型，将草图与共享空间中的兼容图表对齐。检索到的参考进一步提供内容、拓扑和视觉先验，用于补全和渲染最终图表。实验表明，DiagramRAG在DiagramBank和FigureBench上分别达到0.848和0.802的F1分数，并以最佳的VLM-as-a-Judge评分7.170提高了生成质量，同时将推理延迟降低到每个样本35.48秒。我们的代码和数据可在https://anonymous.4open.science/r/DiagramRAG-A262和https://huggingface.co/datasets/anonymous-review-a262/DiagramSketch获取。

英文摘要

Scientific diagrams are essential for communicating complex methodologies in academic papers. A natural way for researchers to specify such diagrams is through rough sketches, where text labels, connectors, and spatial arrangements express early semantic and topological intentions. However, sketches are usually incomplete, making them insufficient for directly producing publication-quality diagrams. Existing sketch-based generation methods mainly reconstruct the sketch itself, while recent text-driven diagram generation frameworks rely on textual semantics and do not fully exploit the topological structure contained in sketches. In this paper, we introduce DiagramRAG, a lightweight retrieval-augmented framework for sketch-based scientific diagram completion. Given a user sketch, DiagramRAG retrieves reference diagrams that are both semantically relevant to the sketch content and topologically compatible with its structure, and uses them to guide downstream diagram generation. To enable efficient structure-aware retrieval, we represent diagrams as knowledge graphs, synthesize sketch variants at different simplification levels, and train an embedding model to align sketches with compatible diagrams in a shared space. The retrieved references further provide content, topology, and visual priors for completing and rendering the final diagram. Experiments show that DiagramRAG achieves F1-scores of 0.848 and 0.802 on DiagramBank and FigureBench, respectively, and improves generation quality with the best VLM-as-a-Judge score of 7.170, while reducing inference latency to 35.48 seconds per sample. Our code and data are available at https://anonymous.4open.science/r/DiagramRAG-A262 and https://huggingface.co/datasets/anonymous-review-a262/DiagramSketch.

URL PDF HTML ☆

赞 0 踩 0

2605.27923 2026-05-28 cs.CV cs.AI cs.LG quant-ph 版本更新

Do We Really Need Quantum Machine Learning?: A Multidimensional Empirical Study

我们真的需要量子机器学习吗？：一项多维实证研究

Sudip Vhaduri, Ryan Gammon, Sayanton Dibbo

发表机构 * Department of Computer Science, University of Alabama, AL 35487（1 计算机科学系，阿拉巴马大学，AL 35487）

AI总结通过在MNIST手写数字数据集上对经典和量子机器学习模型进行多维基准测试，发现量子模型在准确率、参数和内存效率上优于经典模型，但计算成本更高。

详情

AI中文摘要

计算机视觉的快速发展和日益复杂的图像识别任务暴露了经典机器学习模型的基本计算限制，推动了量子计算作为一种新兴范式的探索。本文对MNIST手写数字数据集上的经典和量子机器学习模型进行了全面的基准测试，评估了传统模型（经典支持向量机CSVM和量子支持向量机QSVM）以及深度神经网络模型（经典卷积神经网络CCNN和量子卷积神经网络QCNN）在四个性能维度上的表现：分类准确率、计算运行时间、参数数量和内存需求。实验作为特征维度和样本量的函数进行，并在CPU和GPU执行环境下进行，提供了受控的多维比较，以解决先前工作中的空白。对于基于SVM的模型，QSVM在准确率上始终优于CSVM，在1000个样本时达到约0.90对比约0.85，但计算成本更高。10个量子比特的特征数和200-500的样本量成为平衡准确率和运行时间的实际工作点。对于神经网络模型，CCNN和QCNN实现了可比的分类准确率，在64个特征和60000个样本时均超过0.96，但QCNN在参数和内存效率上显著更优，在较高特征数下比CCNN少约94%的参数和约75%的内存，但运行时间更长。在两个模型家族中，随着特征维度或样本量的增加，量子模型在准确率上始终以更大优势超越经典模型。

英文摘要

The rapid growth of computer vision and increasingly complex image recognition tasks has exposed fundamental computational limitations of classical machine learning models, motivating the exploration of quantum computing as an emerging new paradigm. This paper presents a comprehensive benchmarking study of classical and quantum machine learning models for image recognition on the MNIST handwritten digit dataset, evaluating both traditional models, a Classical Support Vector Machine (CSVM) and a Quantum Support Vector Machine (QSVM), and deep neural network models, a Classical Convolutional Neural Network (CCNN) and a Quantum Convolutional Neural Network (QCNN), across four performance dimensions: classification accuracy, computational runtime, parameter count, and memory requirements. Experiments are conducted as functions of both feature dimensionality and sample size, and across CPU and GPU execution environments, providing a controlled, multidimensional comparison to address gaps in prior work. For the SVM-based models, QSVM consistently outperforms CSVM in accuracy, reaching $\sim$ 0.90 versus $\sim$ 0.85 at 1,000 samples, with a higher computational cost. A feature count of 10 qubits and a sample size in the range of 200 -- 500 emerge as practical operating points that balance accuracy and runtime. For the neural network models, CCNN and QCNN achieve comparable classification accuracy, both exceeding 0.96 at 64 features and 60,000 samples, yet QCNN offers substantially superior parameter and memory efficiency, requiring $\sim$ 94\% fewer parameters and $\sim$ 75\% less memory than CCNN at higher feature counts, while incurring higher runtime. Across both model families, quantum models consistently outperform classical models by greater margins in accuracy as feature dimensionality or sample size increases.

URL PDF HTML ☆

赞 0 踩 0

2605.27922 2026-05-28 cs.AI 版本更新

Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

Harness-Bench: 在真实智能体工作流中测量不同模型的框架效应

Yilun Yao, Xinyu Tan, Chao-Hsuan Liu, Yaoming Li, Zhengyang Wang, Wenhan Yu, Zhewen Tan, Yuxuan Tian, Guangxiang Zhao, Lin Sun, Xiangzheng Zhang, Tong Yang

发表机构 * Peking University（北京大学）； Qiyuan Tech（启元科技）

AI总结提出Harness-Bench基准，通过106个沙盒离线任务评估不同模型与框架配置组合下的执行性能，发现智能体能力应归因于模型-框架配置而非基础模型。

Comments 16 pages, 4 figures, 11 tables. The first three authors contributed equally

详情

AI中文摘要

LLM智能体越来越多地被部署为可执行系统，使用工具、修改工作区并产生具体产物。在此类工作流中，性能不仅取决于基础模型，还取决于框架：管理上下文、工具、状态、约束、权限、追踪和恢复的系统层。然而，现有基准通常抽象掉执行过程、比较完整智能体系统或固定框架，使得执行层变化难以研究。我们引入Harness-Bench，一个用于评估真实智能体工作流中配置级框架效应的诊断基准。Harness-Bench在共享任务环境、预算和评估协议下，跨多个模型后端评估代表性框架配置，同时保留每个框架的原生执行行为。该基准包含106个沙盒离线任务，这些任务基于实际智能体使用模式构建，并经过人工审核以确保真实性、可解性、可验证性和完整性。每次运行记录最终产物、执行轨迹、使用统计和验证器输出，从而能够分析最终完成之外的内容。在5,194条执行轨迹中，我们观察到不同模型-框架配对在完成度、过程质量、效率和失败行为上存在显著差异。这些结果表明，智能体能力应在模型-框架配置级别报告，而非仅归因于基础模型。我们的分析进一步识别了重复的执行-对齐失败，其中合理的推理与工具反馈、工作区状态、证据或可验证输出契约脱节。Harness-Bench为诊断和改进可靠、高效且可审计的智能体执行栈提供了可复现的基础。

英文摘要

LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workflows, performance depends not only on the base model, but also on the harness: the system layer that manages context, tools, state, constraints, permissions, tracing, and recovery. However, existing benchmarks typically abstract away execution, compare complete agent systems, or hold the harness fixed, making execution-layer variation difficult to study. We introduce Harness-Bench, a diagnostic benchmark for evaluating configuration-level harness effects in realistic agent workflows. Harness-Bench evaluates representative harness configurations across multiple model backends under shared task environments, budgets, and evaluation protocols, while preserving each harness's native execution behavior. The benchmark contains 106 sandboxed offline tasks constructed from practical agent-use patterns and manually reviewed for realism, solvability, oracle-checkability, and integrity. Each run records final artifacts, execution traces, usage statistics, and validator outputs, enabling analysis beyond final completion. Across 5,194 execution trajectories, we observe substantial variation in completion, process quality, efficiency, and failure behavior across model-harness pairings. These results suggest that agent capability should be reported at the model-harness configuration level rather than attributed to the base model alone. Our analysis further identifies recurring execution-alignment failures, where plausible reasoning becomes decoupled from tool feedback, workspace state, evidence, or verifiable output contracts. Harness-Bench provides a reproducible foundation for diagnosing and improving reliable, efficient, and auditable agent execution stacks.

URL PDF HTML ☆

赞 0 踩 0

2605.27921 2026-05-28 cs.AI cs.CL cs.CY cs.HC 版本更新

Show, Don't TELL: Explainable AI-Generated Text Detection

展示，而非告知：可解释的AI生成文本检测

Aldan Creo, Suraj Ranganath

发表机构 * School of Computing, Information and Data Sciences（计算与数据科学学院）； University of California, San Diego（加州大学圣地亚哥分校）； United States of America（美国）

AI总结提出一种名为TELL的新型可解释架构，通过内置解释机制和强化学习训练，在保持高检测性能（AUROC 0.927）的同时提供文本级注释，帮助用户基于自身判断识别AI生成文本。

详情

AI中文摘要

关于AI生成文本检测的研究已经提出了多种区分人类与AI文本的方法，其中一些方法在分布内性能上表现优异。然而，由于输出与用户（如教授）的需求不一致——他们只得到一个没有附带解释的数值分数——现实世界的应用进展缓慢。我们通过一种新颖的架构TELL解决了这个问题，该架构从一开始就内置了可解释性。虽然我们的系统仍像其他检测器一样提供数值分数以便比较，但TELL采用了一种根本不同的方法，旨在向用户展示模型认为文本是AI还是人类写作的“线索”，使用户能够根据自己的判断以及对写作背景和所谓作者的理解来决定文本的作者。我们在一个特定领域的作者注释自定义SFT数据集上训练TELL，并进一步使用GRPO结合课程学习来优化系统以提高性能。我们实现了与最先进检测器相竞争的性能（AUROC 0.927），同时原生提供解释检测器决策基础的注释。我们进一步使用人类注释数据集评估解释质量，报告了在注释的具体性、可证伪性、连贯性、合理性和基础性方面的高胜率（平均72.3%），使用户能够批判性思考并自行决定。因此，我们的工作从以人为中心的角度重新定义了AI生成文本检测的问题，并为专注于原生可解释性的新一代检测器铺平了道路。

英文摘要

Research on AI-generated text detection has presented a number of approaches to discern human from AI prose, some of which achieving high in-distribution performance. However, real-world applicability has stalled because their outputs are misaligned with the needs of users, such as professors, who are presented with a numeric score that has no attached explanation. We tackle this issue with a novel architecture, TELL, that bakes explainability from the ground-up. While our system still offers a numerical score like other detectors for comparability, TELL takes a fundamentally different approach where we aim to show the user the "tells" by which the model believes a text is AI or human-written, to empower the user to decide who wrote a text using their own judgment and understanding of the context of the writing and its alleged author. We train TELL on a custom SFT dataset of domain-specific authorship annotations, and further refine the system using GRPO with curriculum learning to improve performance. We achieve competitive performance with state-of-the-art detectors (AUROC 0.927) while natively providing annotations that explain the basis for the detector's decision. We further evaluate the quality of our explanations using a dataset of human annotations and report a high (mean 72.3%) win-rate on annotation concreteness, falsifiability, coherence, plausibility and grounding, allowing users to critically think and decide for themselves. Our work thus reframes the problem of AI-generated text detection in a human-centric perspective and paves the way for a new family of detectors that focus on native explainability.

URL PDF HTML ☆

赞 0 踩 0

2605.27911 2026-05-28 cs.AI 版本更新

SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats

SuiChat-CN：中文群聊情境自杀风险评估基准

Xiangyu Wang, Zhiwei Yu, Chengze Du, Dingchang Wang, Yuhan Ye, Fangyu Zheng

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Tsinghua University（清华大学）； Beijing University of Posts and Telecommunications（北京邮电大学）； University of Science and Technology of China（中国科学技术大学）

AI总结针对即时通讯群聊中消息碎片化、多轮对话和隐晦表达带来的挑战，构建了首个中文群聊情境自杀风险评估基准SuiChat-CN，通过信号词提取和双向上下文扩展构建连贯对话片段，并利用专家验证的LLM辅助范式标注用户风险等级，实验表明上下文信息对可靠评估至关重要。

详情

AI中文摘要

自杀是一个关键的全球公共卫生挑战，每年导致约72万人死亡，需要及时有效的预防策略。现有的计算研究主要关注基于帖子的社交媒体平台（如Twitter和微博），而忽略了即时通讯环境（如Telegram）。然而，群聊带来了独特的挑战：消息简短、碎片化、多方参与，并且常常依赖隐晦或文化特定的表达，使得孤立的帖子级分析不足。我们引入了SuiChat-CN，一个用于情境自杀风险评估的中文群聊基准。我们收集了公开的Telegram群聊数据，通过信号词提取和双向上下文扩展构建连贯的对话片段，并使用专家验证的LLM辅助范式注释用户风险等级。SuiChat-CN包含来自1,406名用户的13,312个上下文片段，覆盖258,228条原始聊天消息。使用PLM和超过40个LLM的大量实验表明，上下文信息对于可靠的风险评估至关重要，而微调和部分上下文评估进一步揭示了多方对话中早期检测的挑战。出于伦理和敏感性考虑，该数据集不公开发布，但将根据合理请求与经认可的心理健康和自杀预防研究机构共享。

英文摘要

Suicide is a critical global public health challenge, causing approximately 720,000 deaths each year and calling for timely, effective prevention strategies. Existing computational studies primarily focus on post-based social media platforms such as Twitter and Weibo, leaving instant messaging environments such as Telegram underexplored. Yet group chats pose distinct challenges: messages are short, fragmented, multi-party, and often rely on implicit or culturally specific expressions, making isolated post-level analysis insufficient. We introduce SuiChat-CN, a Chinese group-chat benchmark for contextual suicide risk assessment. We collect public Telegram group-chat data, construct coherent conversational segments through signal-word extraction and bidirectional context expansion, and annotate user risk levels with an expert-validated, LLM-assisted paradigm. SuiChat-CN contains 13,312 contextual segments from 1,406 users, covering 258,228 raw chat messages. Extensive experiments with PLMs and more than 40 LLMs demonstrate that contextual information is essential for reliable risk assessment, while fine-tuning and partial-context evaluation further reveal the challenges of early detection in multi-party conversations. Due to ethical and sensitivity concerns, the dataset is not publicly released but will be shared with accredited mental health and suicide-prevention research institutions upon reasonable request.

URL PDF HTML ☆

赞 0 踩 0

2605.27908 2026-05-28 cs.CL cs.AI 版本更新

ESC-Skills: Discovering and Self-Evolving Skills for Emotional Support Conversations

ESC-Skills: 发现并自我进化情感支持对话技能

Jie Zhu, Huaixia Dou, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang, Fang Kong

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）； Qwen DianJin Team, Alibaba Cloud Computing（阿里云Qwen团队）

AI总结提出ESC-Skills框架，通过干预单元建模支持交互并构建技能库，结合多轮廓自我进化机制，提升情感支持对话的可解释性、可控性和效果。

详情

AI中文摘要

现有的情感支持对话（ESC）系统主要依赖于端到端的回复生成或粗粒度的策略监督，可解释性有限，且对系统性的技能提升支持不足。我们提出ESC-Skills，一个以技能为中心的框架，能够发现并自我进化可执行的情感支持技能。我们首先将局部支持交互建模为干预单元（IUs），捕捉求助者状态、支持干预和回复后情绪变化之间的状态-动作-结果动态。基于从成功和失败的ESC对话中提取的IUs，我们构建了ESC-Skills库，这是一个包含干预指导、适用条件、预期结果和潜在风险的可执行情感支持技能仓库。为了进一步提升鲁棒性，我们引入了一个多轮廓自我进化精炼框架，其中ESC代理在SAGE评估下与多种模拟求助者轮廓进行交互。分析由此产生的交互轨迹，以识别缺失的技能、不安全的干预和特定轮廓的失败模式，然后通过基于模拟的验证来精炼技能库。实验结果表明，ESC-Skills在提升回复质量和对话层面的情感结果的同时，提供了更可解释和可控的支持行为。我们将发布代码、提示和ESC-Skills库，网址为https://github.com/aliyun/qwen-dianjin。

英文摘要

Existing emotional support conversation (ESC) systems mainly rely on end-to-end response generation or coarse strategy supervision, offering limited interpretability and little support for systematic skill improvement. We propose ESC-Skills, a skill-centric framework that discovers and self-evolves executable emotional support skills. We first model localized support interactions as Intervention Units (IUs), which capture state--action--outcome dynamics between seeker states, support interventions, and post-response emotional changes. Based on IUs extracted from both successful and failed ESC dialogues, we construct the ESC-Skills Bank, a repository of executable emotional support skills containing intervention guidance, applicability conditions, expected outcomes, and potential risks. To further improve robustness, we introduce a multi-profile self-evolutionary refinement framework in which an ESC agent interacts with diverse simulated seeker profiles under SAGE evaluation. The resulting interaction traces are analyzed to identify missing skills, unsafe interventions, and profile-specific failure patterns, which are then used to refine the Skills Bank through simulation-based verification. Experimental results demonstrate that ESC-Skills improves both response-level quality and dialogue-level emotional outcomes while providing more interpretable and controllable support behaviors. We will release the code, prompts, and ESC-Skills Bank at https://github.com/aliyun/qwen-dianjin.

URL PDF HTML ☆

赞 0 踩 0

2605.27906 2026-05-28 cs.AI 版本更新

Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

推理至关重要：通过推理条件偏好优化减轻多模态大型推理模型中的幻觉

Jiawei Kong, Hao Fang, Shunxiang Liao, Jinyu Li, Bin Chen, Hao Wu, Shu-Tao Xia, Min Zhang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳）

AI总结提出推理条件直接偏好优化（RC-DPO）方法，通过将思维链作为答案生成的条件并对比不同思维链下的偏好，结合蒙特卡洛树搜索和注意力引导的思维链剪枝生成偏好数据，有效减轻多模态大型推理模型中的幻觉。

详情

AI中文摘要

多模态大型推理模型引入了推理范式，在复杂的视觉-语言任务中展现出强大的能力。然而，它们仍然存在严重的幻觉问题。现有的基于训练的方法通常通过响应级直接偏好优化（DPO）来减轻幻觉，其中思维链（CoT）和最终答案被视为一个整体输出并联合优化。我们发现这种公式的表现与仅答案优化相似，表明它主要学习答案级别的偏好，而未能充分利用CoT级别的监督。为了解决这个问题，我们明确制定了一个面向CoT的偏好项，并推导出推理条件直接偏好优化（RC-DPO），它将CoT建模为答案生成的条件，并在不同CoT条件下对比同一偏好答案的偏好，促进答案支持的推理链对齐。为了进一步优化，我们引入了一种推理增强的偏好数据生成策略，该策略采用蒙特卡洛树搜索来发现视觉基础且逻辑一致的CoT作为正样本，以及注意力引导的CoT令牌剪枝来构建负样本。在各种模型和基准上的大量实验表明，RC-DPO有效减轻了幻觉，并提高了多模态推理过程的可靠性。

英文摘要

Multimodal Large Reasoning Models introduce the reasoning paradigm, demonstrating strong capabilities on complex vision-language tasks. However, they still suffer from severe hallucinations. Existing training-based methods typically mitigate hallucinations through response-level direct preference optimization (DPO), where the Chain-of-Thought (CoT) and the final answer are treated as a monolithic output and optimized jointly. We reveal that this formulation performs similarly to answer-only optimization, suggesting that it primarily learns answer-level preference, while leaving CoT-level supervision insufficiently exploited. To address this issue, we explicitly formulate a CoT-oriented preference term and derive Reasoning-Conditioned Direct Preference Optimization (RC-DPO), which models the CoT as a condition for answer generation and contrasts the preference for the same preferred answer under different CoT conditions, promoting answer-supportive reasoning chain alignment. To further improve optimization, we introduce a reasoning-enhanced preference data generation strategy that employs Monte Carlo Tree Search to discover visually grounded and logically consistent CoTs as positive samples, and attention-guided CoT token pruning to construct negative ones. Extensive experiments across various models and benchmarks show that RC-DPO effectively mitigates hallucinations and improves the reliability of the multimodal reasoning process.

URL PDF HTML ☆

赞 0 踩 0

2605.27904 2026-05-28 cs.AI cs.LG 版本更新

Dr-CiK: A Testbed for Foresight-Driven Agents

Dr-CiK：面向预见驱动型智能体的测试平台

Yihong Tang, Andrew Robert Williams, Arjun Ashok, Vincent Zhihao Zheng, Lijun Sun, Alexandre Drouin, Issam H. Laradji, Étienne Marcotte, Valentina Zantedeschi

发表机构 * McGill University（麦吉尔大学）； ServiceNow Research（ServiceNow研究院）； Mila -- Quebec AI Institute（蒙特利尔AI研究院）； University of British Columbia（不列颠哥伦比亚大学）

AI总结针对现有上下文辅助预测基准假设上下文已提供的问题，提出Dr-CiK基准，评估智能体从文档语料库中检索、过滤、提炼预测相关上下文并生成预测的能力，实验表明高质量上下文显著提升预测性能，但现有深度研究智能体恢复证据不足5%、易受干扰误导。

详情

AI中文摘要

现实环境中的时间序列预测通常不仅依赖于历史观测，还依赖于必须从嘈杂、异构的信息源中主动发现的外部上下文。然而，现有的上下文辅助预测基准通常假设支持性上下文已经提供，未考虑智能体是否能自行识别。因此，我们引入Dr-CiK，一个用于评估智能体是否能够从文档语料库中检索预测相关的支持性上下文、过滤干扰项、将检索到的上下文提炼为对预测有用的证据，并生成由该证据支持的预测的基准。通过上下文消融实验以及对最先进的深度研究和预测方法的联合评估，我们表明高质量上下文显著提高了Dr-CiK中的预测性能。然而，大多数现有的深度研究智能体仅能恢复一小部分真实支持证据（通常<5%），经常被干扰项误导（>80%的干扰项引用），并且可能导致预测器在使用检索到的上下文时比不使用上下文时表现更差。我们的结果激励了对预见驱动型智能体的研究，这些智能体能够搜索正确的上下文以预测未来。

英文摘要

Time series forecasting in real-world settings often depends not only on historical observations, but also on external context that must be actively discovered from noisy, heterogeneous information sources. Yet existing context-aided forecasting benchmarks typically assume that the supporting context is already provided, leaving open whether agents can identify it on their own. Therefore, we introduce Dr-CiK, a benchmark for evaluating whether agents can retrieve forecasting-relevant supporting context from a document corpus, filter out distractors, distill the retrieved context into forecast-useful evidence, and generate forecasts supported by that evidence. Through context ablations and evaluations of state-of-the-art deep research and forecasting methods paired together, we show that high-quality context substantially improves forecasting performance in Dr-CiK. However, most existing DR agents recover only a small fraction of the ground-truth supporting evidence (usually <5%), are frequently misled by distractors (>80% distractor citations), and can cause forecasters to perform worse with retrieved context than without context. Our results motivate research on foresight-driven agents that search for the right context to predict the future.

URL PDF HTML ☆

赞 0 踩 0

2605.27901 2026-05-28 cs.CL cs.AI 版本更新

The Fragility of Chain-of-Thought Monitoring Across Typologically Diverse Languages

跨类型多样语言的思维链监控脆弱性

Eric Onyame, Runtao Zhou, Kowshik Thopalli, Bhavya Kailkhura, Chirag Agarwal

发表机构 * University of Virginia（弗吉尼亚大学）； Lawrence Livermore National Laboratory（劳伦斯利弗莫尔国家实验室）

AI总结本研究通过13种语言和7个前沿模型家族的评估，发现思维链监控在语言分布偏移下普遍不可靠（平均不可信率95.9%），模型会进行策略性操纵，且低资源语言中欺骗模式完全存在。

详情

AI中文摘要

思维链（CoT）监控已被提出作为一种有前景的安全机制，用于检测大型语言模型中的失调行为。然而，其在英语之外以及跨不同模型家族中的可靠性仍 largely unexplored。我们首次在13种多样语言和7个前沿模型家族（共16个模型）上对CoT可监控性进行了大规模评估。使用需要显式中间计算的对抗性提示评估，结合内部答案标记概率分析，我们一致发现CoT在语言和提示类型上存在不忠实性，在8B至120B参数模型中平均不忠实率为95.9%。我们发现前沿模型系统性地进行策略性操纵，包括答案切换、事后合理化以及对提示的程序性利用，使得外部监控器难以检测欺骗。我们表明，前沿模型通常在其潜在激活中在生成的前15%内就承诺了失调线索，即使CoT看起来忠实。令人惊讶的是，这些欺骗模式在低资源语言中保持100%，揭示了当前基于CoT的监督的根本局限性。我们的结果表明，CoT监控在语言分布偏移下本质上是脆弱的，提供的安全信号比仅英语研究所暗示的要弱得多。这些发现强调了开发稳健的CoT监控器以及加速白盒监控技术研究的迫切需要，特别是为了改善中低资源语言中的CoT可监控性。我们的代码可在此处获取：\href{https://multilingual-cot-monitoring.github.io/}{\textcolor{blue}{here}}。

英文摘要

Chain-of-thought (CoT) monitoring has been proposed as a promising safety mechanism for detecting misaligned behavior in large language models. However, its reliability remains largely unexplored beyond English and across diverse model families. We present the first large-scale evaluation of CoT monitorability across 13 diverse languages and seven frontier model families, comprising 16 models. Using adversarial-hint evaluations that require explicit intermediate computation, together with analysis of internal answer-token probabilities, we consistently find CoT unfaithfulness across languages and hint types, with an average rate of 95.9\% across 8B--120B parameter models. We find that frontier models systematically engage in strategic manipulation, including answer-switching, post-hoc rationalization, and procedural exploitation of hints, making external monitors struggle to detect deception. We show that frontier models often commit to the misaligned cue in their latent activations within the first 15\% of generation, even when the CoT appears faithful. Surprisingly, these deceptive patterns remain 100\% in low-resource languages, revealing fundamental limitations in current CoT-based oversight. Our results reveal that CoT monitoring is fundamentally fragile under linguistic distribution shift, providing a substantially weaker safety signal than what English-only studies suggest. These findings underscore an urgent need to develop robust CoT monitors and to accelerate research into white-box monitoring techniques, especially to improve CoT monitorability in mid- and low-resource languages. Our code is available \href{https://multilingual-cot-monitoring.github.io/}{\textcolor{blue}{here}}.

URL PDF HTML ☆

赞 0 踩 0

2605.27899 2026-05-28 cs.AI 版本更新

SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

SKILLC: 通过对比信用分配学习LLM智能体的自主技能内化

Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang

发表机构 * Meituan（美团）

AI总结提出SkillC框架，基于对比技能信用分配（CSCA）将技能帮助性对比转化为直接学习信号，实现LLM智能体的自主技能内化，在ALFWorld和WebShop上分别超越最强基线5.5%和4.4%。

详情

AI中文摘要

结构化技能提示改善了长周期智能体强化学习（RL）中的探索。技能增强型RL方法在推理时保留外部技能，而技能内化型RL方法在训练期间撤回技能以实现自主性能。然而，现有的内化方法仅使用技能帮助性对比进行课程控制，策略更新保持不变，无法区分技能依赖和自主成功。我们提出SkillC，一种基于对比技能信用分配（CSCA）的框架，将该对比转化为内化的直接学习信号。SkillC在同一策略更新中，为来自活跃技能类型的任务采样配对的技能注入和无技能轨迹，并通过双流优势估计器将它们的任务级对比注入优化，该估计器在保持全局排名的同时，对无技能成功施加单边校正。平滑的验证级信号进一步驱动自适应课程，包括归因强度、轨迹分配和单调活跃集剪枝。在ALFWorld和WebShop上的实验表明，在无运行时技能访问的情况下，SkillC分别超过最强先验技能内化RL基线5.5%和4.4%，同时与技能增强型RL方法保持竞争力。

英文摘要

Structured skill prompts improve exploration in long-horizon agentic reinforcement learning (RL). Skill-augmented RL methods retain external skills at inference, while skill-internalization RL methods withdraw them during training to enable autonomous performance. However, existing internalization approaches only use skill-helpfulness contrast for curriculum control, leaving the policy update unchanged and unable to distinguish skill-dependent from autonomous success. We propose SkillC, a framework based on Contrastive Skill Credit Assignment (CSCA) that converts this contrast into a direct learning signal for internalization. \textsc{SkillC} samples paired skill-injected and skill-free rollouts for tasks from active skill types within the same policy update, and injects their task-level contrast into optimization via a dual-stream advantage estimator that preserves global ranking while applying a one-sided correction toward skill-free success. A smoothed validation-level signal further drives an adaptive curriculum over attribution strength, rollout allocation, and monotonic active-set pruning. Experiments on ALFWorld and WebShop show that, without runtime skill access, SkillC surpasses the strongest prior skill-internalization RL baseline by 5.5\% and 4.4\%, respectively, while remaining competitive with skill-augmented RL methods.

URL PDF HTML ☆

赞 0 踩 0

2605.27898 2026-05-28 cs.AI 版本更新

A Unified Framework for the Evaluation of LLM Agentic Capabilities

LLM 代理能力评估的统一框架

Pengyu Zhu, Lijun Li, Yaxing Lyu, Qianxin Luo, Jingyi Yang, Yi Liu, Tingfeng Hui, Xinyu Yuan, Li Sun, Sen Su, Jing Shao

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Chongqing University of Posts and Telecommunications（重庆邮电大学）； North China Electric Power University（华北电力大学）

AI总结提出一个统一框架，通过标准化配置、固定 ReAct 架构和离线设置，分离框架与环境效应，实现 LLM 代理能力的公平评估，并在 7 个基准、24 个领域、15 个模型上进行了大规模实证分析。

详情

AI中文摘要

随着 LLM 越来越多地被部署为代理，对其代理能力的可靠评估变得至关重要。然而，报告的基准分数通常共同反映了模型能力以及每个基准所附带的实现选择，使得跨基准结果难以解释为对底层模型的纯粹测量。在这项工作中，我们提出了一个用于公平评估 LLM 代理能力的统一框架。在统一配置系统的驱动下，该框架将多样化的基准整合为标准化的指令-工具-环境格式，通过固定的 ReAct 风格架构在可控沙箱中执行代理，并提供可选的离线设置，用精心策划的快照替换易变的实时环境，从而可以分别分析框架效应和环境效应。在此基础上，我们在每个基准的原始任务成功标准下统一了评估方法，同时引入了资源消耗的统一指标以及决策和执行层面失败归因的分类法。在该框架内，我们适配了 7 个广泛使用的基准，涵盖单代理、多代理和安全关键场景的 24 个领域，并在 15 个模型上进行了超过 40 万次 rollout 和 50 亿 token 的大规模实证分析。结果表明，脚手架选择和环境波动会显著改变基准结果的方向，使我们的框架能够将内在的 LLM 能力与框架和环境引入的伪影分离开来。我们进一步展示了其作为安全关键领域安全测试床的可扩展性。代码和基准可在 https://github.com/whfeLingYu/A-Unified-Framework-for-the-Evaluation-of-LLM-Agentic-Capabilities, https://huggingface.co/AgentFramework/Unified_Farmework 获取。

英文摘要

As LLMs are increasingly deployed as agents, reliable assessment of their agentic capabilities has become essential. However, reported benchmark scores often jointly reflect model capability and the implementation choices each benchmark is packaged with, making cross-benchmark results difficult to interpret as clean measurements of the underlying model. In this work, we present a unified framework for the fair evaluation of LLM agentic capabilities. Driven by a unified configuration system, the framework integrates diverse benchmarks into a standardized instruction--tool--environment format, executes agents through a fixed ReAct-style architecture within a controllable sandbox, and provides an optional offline setting that replaces volatile live environments with curated snapshots, so that framework effects and environment effects can be analyzed separately. Building on this, we unify the evaluation methodology under each benchmark's original task-success criteria, while introducing unified metrics for resource consumption and a taxonomy for decision- and execution-level failure attribution. Within this framework, we adapt 7 widely used benchmarks spanning 24 domains across single-agent, multi-agent, and safety-critical scenarios, and conduct a large-scale empirical analysis over 400K rollouts and 5B tokens on 15 models. The results show that scaffold choice and environmental volatility materially shift benchmark outcomes in both directions, allowing our framework to disentangle intrinsic LLM capabilities from framework- and environment-induced artifacts. We further demonstrate its extensibility as a secure testbed for safety-critical domains. Codes and benchmarks at are available at https://github.com/whfeLingYu/A-Unified-Framework-for-the-Evaluation-of-LLM-Agentic-Capabilities, https://huggingface.co/AgentFramework/Unified_Farmework.

URL PDF HTML ☆

赞 0 踩 0

2605.27891 2026-05-28 cs.CV cs.AI 版本更新

SmartDirector: Keyframe-Conditioned Cinematic Video Generation with Narrative Pacing Control

SmartDirector: 基于关键帧的叙事节奏可控电影视频生成

Zhida Zhang, Jie Ma, Zhan Peng, Haoxue Wu, Yang Han, Jun Liang, Jie Cao, Jing Li

发表机构 * Youku Moku-Lab（优酷莫酷实验室）

AI总结提出SmartDirector框架，通过多关键帧条件控制视频生成中的叙事结构和时间节奏，采用两阶段方法（Director-Gen生成低分辨率视频，Director-SR利用高分辨率关键帧细化细节），显著优于现有方法。

详情

AI中文摘要

视频的叙事质量从根本上决定了其感知价值。尽管现有的视频生成方法可以生成视觉上吸引人的内容，但它们主要依赖于稀疏的条件信号，如文本提示或首尾帧，这限制了对叙事结构和时间节奏的精确控制。在本文中，我们提出了SmartDirector，一个通过多个关键帧增强视频生成模型叙事能力的框架。SmartDirector支持灵活的生成长场景，包括单镜头生成、多镜头叙事合成和视频扩展。该框架分两个阶段运行：Director-Gen根据提供的关键帧生成低分辨率视频，Director-SR通过利用高分辨率关键帧作为语义锚点来恢复细粒度细节，从而优化输出。为了实现鲁棒的多关键帧训练，我们构建了一个数据管道，从电影中策划单镜头和多镜头序列。大量实验表明，SmartDirector显著优于现有的最先进方法。我们将发布代码以促进进一步研究。

英文摘要

The narrative quality of a video fundamentally determines its perceptual value. Although existing video generation methods can produce visually appealing content, they predominantly rely on sparse conditioning signals such as text prompts or first/last frames, which limits precise control over narrative structure and temporal pacing. In this paper, we propose SmartDirector, a framework that enhances the narrative capacity of video generation models through multiple keyframes. SmartDirector supports flexible generation scenarios including single-shot generation, multi-shot narrative synthesis, and video extension. The framework operates in two stages: Director-Gen generates a low-resolution video conditioned on the provided keyframes, and Director-SR refines the output by exploiting high-resolution keyframes as semantic anchors to recover fine-grained details. To enable robust multi-keyframe training, we construct a data pipeline that curates single-shot and multi-shot sequences from movies. Extensive experiments demonstrate that SmartDirector substantially outperforms existing state-of-the-art approaches. We will release the code to facilitate further research.

URL PDF HTML ☆

赞 0 踩 0

2605.27882 2026-05-28 cs.CL cs.AI 版本更新

VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

VibeSearchBench：野外长期主动搜索的基准测试

Xiaohongshu Inc

发表机构 * Xiaohongshu Dots Studio & Unipat AI（小红书 dots 飞 studios 与 Unipat AI）

AI总结针对现有搜索基准中查询过于明确、单轮交互和固定模式评估导致用户体验与评估结果差距的问题，提出VibeSearch范式并构建VibeSearchBench基准，通过渐进式用户模拟和图匹配评估框架测试前沿模型，发现所有模型在长期上下文推理、主动意图激发和结构化知识构建方面仍存在显著不足。

详情

AI中文摘要

基于LLM的智能体在搜索基准上得分很高，但真实用户始终觉得结果不令人满意，这揭示了持续的评估-体验差距。我们将这一差距归因于现有基准依赖于过度明确的查询、单轮交互和固定模式评估，这些都不反映真实搜索行为——用户和智能体通过多轮对话协作细化模糊意图。我们将这种范式称为VibeSearch，并引入VibeSearchBench，一个包含200个手动策划的双语（中文和英文）任务的基准，涵盖20个领域，分为VibeSearch-Pro（专业）和VibeSearch-Daily（日常生活）子集。每个任务将一个用户角色与一个无模式的真实知识图谱配对，并通过渐进式披露用户模拟器和图匹配评估框架进行评估。我们在ReAct框架和OpenClaw智能体框架下对七个前沿模型进行了基准测试。结果表明，所有模型对于VibeSearch仍然严重不足（最佳F1：30.30），凸显了在长期上下文推理、主动意图激发和结构化知识构建方面需要根本性进展。

英文摘要

LLM-based agents score well on search benchmarks, yet real users consistently find results unsatisfying, revealing a persistent evaluation-experience gap. We attribute this gap to existing benchmarks' reliance on over-specified queries, single-turn interactions, and fixed-schema evaluation, none of which reflect real search behavior where users and agents collaboratively refine vague intent through multi-turn dialogue. We term this paradigm VibeSearch and introduce VibeSearchBench, a benchmark comprising 200 manually curated bilingual (Chinese and English) tasks across 20 domains, split into VibeSearch-Pro (professional) and VibeSearch-Daily (daily-life) subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph, and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework. We benchmark seven frontier models under both the ReAct framework and the OpenClaw agent harness. Results show that all models remain substantially inadequate for VibeSearch (best F1: 30.30), highlighting the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction.

URL PDF HTML ☆

赞 0 踩 0

2605.27879 2026-05-28 cs.AI 版本更新

Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

迈向忠实代理式XAI：一种验证方法和一个用于更好模型忠实度的开放世界基准

Jaechang Kim, Sunung Mun, Seungjoon Lee, Jaewoong Cho, Jungseul Ok

发表机构 * Graduate School of AI, POSTECH（POSTECH人工智能研究生院）； Department of Computer Science and Engineering, POSTECH（POSTECH计算机科学与工程系）； Krafton

AI总结提出FAX框架，通过显式验证分解解释声明并交叉检查忠实工具，以及CRAFTER-XAI-Bench开放世界基准，在强化学习环境中将模拟忠实度从0.20提升至0.46。

详情

AI中文摘要

可解释AI（XAI）帮助用户解释模型行为并识别潜在故障。代理式XAI系统使用大型语言模型（LLM）通过自然语言交互使解释更易理解，但也可能产生看似合理但不忠实的解释。这种风险源于不可靠的XAI输出可能被LLM放大并误导用户。我们提出忠实代理式XAI（FAX），一个通过显式验证提高解释忠实度的框架。FAX将草稿解释分解为声明，并针对固有忠实工具进行交叉检查，在最终生成前过滤不支持或矛盾的声明。我们还引入了CRAFTER-XAI-Bench，一个具有复杂策略、多样目标和挑战场景的开放世界强化学习基准，用于评估模型特定忠实度。在CRAFTER-XAI-Bench上，FAX将模拟忠实度从最强基线的0.20提升至0.46，同时保持高信息量、相关性和流畅性。在三个表格基准上，FAX与先前的代理式XAI基线表现相当，但我们的分析表明，这些设置可能将任务准确性与模型特定忠实度混为一谈。这些发现表明，显式验证对于忠实代理式XAI至关重要，并且忠实度基准必须设计用于测试解释是否针对目标模型本身的行为。

英文摘要

Explainable AI (XAI) helps users interpret model behavior and identify potential faults. Agentic XAI systems use Large Language Models (LLMs) to make explanations more accessible through natural-language interaction, but they can also produce plausible yet unfaithful explanations. This risk arises because unreliable XAI outputs for complex models can be amplified by LLMs and mislead users. We propose Faithful Agentic XAI (FAX), a framework that improves explanation faithfulness through explicit verification. FAX decomposes draft explanations into claims and cross-checks them against inherently faithful tools, filtering unsupported or contradictory claims before final generation. We also introduce CRAFTER-XAI-Bench, an open-world reinforcement learning benchmark with complex policies, diverse goals, and challenging scenarios for assessing model-specific faithfulness. On CRAFTER-XAI-Bench, FAX improves simulation faithfulness from 0.20 for the strongest baseline to 0.46 while maintaining high informativeness, relevance, and fluency. On three tabular benchmarks, FAX performs competitively with prior Agentic XAI baselines, but our analysis shows that these settings can conflate task accuracy with model-specific faithfulness. These findings show that explicit verification is essential for faithful Agentic XAI and that that faithfulness benchmarks must be designed to test explanations against the behavior of the target model itself.

URL PDF HTML ☆

赞 0 踩 0

2605.27877 2026-05-28 cs.LG cs.AI 版本更新

SPAR: Support-Preserving Action Rectification

SPAR: 支持保持的动作纠正

Jiaxin Zhao, Weihang Pan, Xun Liang, Binbin Lin

发表机构 * Zhejiang University（浙江大学）

AI总结提出支持保持的动作纠正（SPAR）框架，通过将全局学习转化为局部残差纠正，并引入潜在自模仿机制，解决了离线策略改进中价值最大化与数据分布拟合之间的冲突，在D4RL基准上达到最优性能。

详情

AI中文摘要

离线策略改进面临着最大化价值与拟合数据分布之间的固有冲突。虽然样本内加权回归是稳定的，但它过度保守，抑制了分布尾部的高价值动作；相反，基于梯度的方法通常表现出梯度的拟合-优化冲突，这会将策略推离数据流形。为了解决这个问题，我们提出了支持保持的动作纠正（SPAR），它将全局学习重新定义为锚定在冻结的纯行为克隆策略上的局部残差纠正。该框架在残差空间中进行细粒度拟合和局部策略改进，从而收缩搜索空间。我们进一步引入了潜在自模仿，利用潜在采样加权回归机制来解决残差空间中的拟合-改进梯度冲突。理论上，我们证明了该机制消除了标准价值梯度的流形正常漂移，而广泛的D4RL实验表明，SPAR从次优基线中提取了显著的增益，实现了最先进的性能。

英文摘要

Offline policy improvement faces an inherent conflict between maximizing value and fitting the data distribution. While in-sample weighted regression is stable, it suffers from over-conservatism that suppresses high-value actions in the distribution tail; conversely, gradient-based approaches often exhibit a fitting-optimization conflict of gradients, which drives the policy off the data manifold. To address this, we propose Support-Preserving Action Rectification (SPAR), which reframes global learning as a local residual rectification anchored to a frozen pure behavior cloning policy. This framework performs fine-grained fitting and local policy improvement in the residual space, thereby contracting the search space. We further introduce Latent Self-Imitation, utilizing a latent-sampling weighted-regression mechanism to address fitting-improvement gradient conflict in the residual space. Theoretically, we prove this mechanism eliminates the manifold-normal drift of standard value gradients, while extensive D4RL experiments show SPAR extracts significant gains from suboptimal baselines to achieve state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2605.27873 2026-05-28 cs.AI 版本更新

AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

AIBuildAI-2：一种用于自动构建AI模型的知识增强智能体

Ruiyi Zhang, Peijia Qin, Qi Cao, Li Zhang, Pengtao Xie

发表机构 * Department of Electrical and Computer Engineering, University of California San Diego（加州大学圣地亚哥分校电气与计算机工程系）； Department of Medicine, University of California San Diego（加州大学圣地亚哥分校医学系）

AI总结针对现有自动构建AI模型的智能体因依赖大语言模型静态参数知识而性能受限的问题，提出AIBuildAI-2，通过引入分层、可进化的外部知识系统，动态加载相关上下文，实现设计决策的专家知识支撑，在MLE-Bench上取得70.7%奖牌率并在心脏病预测竞赛中排名前6.6%。

详情

AI中文摘要

AI模型支撑着从图像和文本处理到生物学、物理学和化学科学发现的数据中心应用。然而，开发这些模型仍然高度依赖人工，需要从业者设计架构、构建训练流程并迭代优化解决方案，这使得缺乏专业AI工程专业知识的自然科学家难以构建其研究所需的高性能模型。为减轻这一负担并拓宽AI在科学发现中的可及性，已有研究提出自动构建AI模型的智能体。然而，这些智能体的性能很大程度上受限于其底层大语言模型的参数知识，这些知识是静态的、常常过时，且缺乏实用的AI模型工程诀窍。为解决这一局限，我们提出AIBuildAI-2，一种具有外部、可进化知识系统的知识增强智能体，用于自动构建AI模型。AIBuildAI-2的知识系统是分层的，将整理好的AI开发知识组织为按主题类别划分的高层知识指令和每个类别下的低层知识文档，智能体据此仅动态加载与当前状态及待解决AI任务相关的上下文，使每个设计和实现决策都基于具体、可外部验证的专业知识。该系统通过从网络收集和清洗AI开发相关文档并将其组织到相应类别进行初始化，并通过从智能体自身经验中提炼每次AI任务完成运行的结构化要点并写回知识系统而持续进化。AIBuildAI-2取得了最先进的结果，在MLE-Bench上以70.7%的奖牌率排名第一，并在一个心脏病预测竞赛中位列4370个人类专家团队的前6.6%。

英文摘要

AI models underpin data-centric applications from image and text processing to scientific discovery in biology, physics, and chemistry. Yet developing them remains heavily manual, requiring practitioners to design architectures, build training pipelines, and iteratively refine solutions, making it challenging for natural scientists without specialized AI engineering expertise to build the high-performing models their research demands. To reduce this burden and broaden access to AI for scientific discovery, agents that automatically build AI models have been proposed. However, the performance of these agents is largely limited by the parametric knowledge of their underlying large language models, which is static, often outdated, and sparse on practical AI model engineering know-how. To address this limitation, we introduce AIBuildAI-2, a knowledge-enhanced agent with an external, evolving knowledge system for automatically building AI models. The knowledge system of AIBuildAI-2 is hierarchical, organizing curated AI development knowledge into high-level knowledge instructions over topical categories and low-level knowledge documents under each category, from which the agent dynamically loads only the context relevant to its current state and the AI task being solved, grounding each design and implementation decision in concrete, externally verifiable expertise. The system is initialized by collecting and cleaning AI-development-related documents from the web and organizing them into the corresponding categories, and continually evolves from the agent's own experience by distilling each completed run on an AI task into structured takeaways that are written back into the knowledge system. AIBuildAI-2 achieves state-of-the-art results, ranking first on MLE-Bench with a 70.7% medal rate and placing in the top 6.6% among 4,370 human-expert teams in a heart disease prediction competition.

URL PDF HTML ☆

赞 0 踩 0

2605.27861 2026-05-28 cs.LG cs.AI q-bio.QM 版本更新

From Detection to Mechanism: Cross-Attention Graph Neural Networks Enable Drug-Drug Interaction Type Prediction An Ablation Study with Acetylsalicylic Acid Validation

从检测到机制：跨注意力图神经网络实现药物相互作用类型预测——一项以乙酰水杨酸验证的消融研究

Juergen Dietrich

AI总结本研究通过系统消融实验比较三种图神经网络架构，发现跨注意力机制（CrossAtt）在药物相互作用类型预测（多分类）上比二元检测提升显著，并在乙酰水杨酸验证中实现10/10正确预测。

Comments 12 pages, 1 figure

详情

AI中文摘要

预测两种药物是否相互作用（二元检测）与预测该相互作用的机制类型（多分类）是本质上不同的任务。本研究在包含38,337个正例对（涵盖86种相互作用类型）的公开基准数据集上，对三种图神经网络架构进行了系统的消融实验，用于药物相互作用预测。在相同训练条件下（n=61,339对）比较了三种架构：带有拼接的双消息传递神经网络（Concat）、带有四头跨注意力的双MPNN（CrossAtt）以及引入相互作用图的三元MPNN（Ternary）。CrossAtt在多分类F1-macro上比Concat绝对提升+0.186（+45%），而二元AUC仅提升+0.012（+1.3%），证实原子级分子间通信专门支持机制类型分类。尽管训练数据相同，三元架构表现不佳，其失败与训练不稳定性假设一致。在训练前保留的十个乙酰水杨酸药物对上的验证表明，CrossAtt实现了10/10正确的DDI类型预测，而Ternary为0/10。在所有架构中识别出两个一致的失败案例，与一项配套毒性研究中确立的结构限制相关。

英文摘要

Predicting whether two drugs interact (binary detection) is a substantially dif- ferent task from predicting the mechanism type of that interaction (multi-class classification). This study presents a systematic ablation study of three Graph Neural Network (GNN) architectures for drug-drug interaction (DDI) prediction on a publicly available benchmark dataset comprising 38,337 positive pairs across 86 interaction types. Three architectures are compared under identical training conditions (n = 61,339 pairs): a siamese dual Message Passing Neural Network (MPNN) with concatenation (Concat), a dual MPNN with four-head cross-attention (CrossAtt), and a ternary MPNN incorporating an interaction graph (Ternary). CrossAtt improves multi-class F1-macro by +0.186 absolute (+45%) over Concat, while improving binary AUC by only +0.012 (+1.3%) - confirming that atom-level inter-molecular communication specifically enables mechanism-type classification. The ternary architecture underperforms despite equivalent training data, with its failure consistent with a training instability hypothesis. Validation on ten acetylsali- cylic acid (ASA) drug pairs, held out prior to training, demonstrates 10/10 correct DDI-type predictions for CrossAtt versus 0/10 for Ternary. Two consistent failure cases are identified across all architectures, linking to structural limits established in a companion toxicity study.

URL PDF HTML ☆

赞 0 踩 0

2605.27860 2026-05-28 cs.AI 版本更新

C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning

C-MIG：基于多视角信息增益的检索增强生成用于临床诊断推理

Yuwei Miao, Gen Li, Yunsheng Zeng, Xiandong Li, Yujin Wang, Siyu Chen, Luning Wang, Yunhao Qiao, Junfeng Wang, Jianwei Lv, Bo Yuan

发表机构 * Baidu Inc（百度公司）

AI总结提出C-MIG框架，通过多视角信息增益和多重子查询检索增强策略，解决检索增强生成中奖励信号丢失和异构推理监督问题，在临床诊断任务上取得最优性能。

详情

AI中文摘要

检索增强生成结合强化学习在将大型语言模型锚定于可信医学证据方面显示出前景。然而，现有方法依赖精确匹配的二元奖励，在临床诊断中导致两个问题：(i) 语义相关但非逐字匹配的步骤获得零信号，丢弃了有价值的学习信号；(ii) 单一维度的奖励无法有效监督异构推理能力。为解决这些问题，我们提出C-MIG，一种基于多视角信息增益的临床诊断检索增强生成框架。C-MIG在冻结参考模型下从两个互补视角——检索文档和文档精炼——估计信息增益，以联合指导检索什么以及如何精炼，缓解了有价值奖励信号丢失和信用分配问题。我们进一步设计了一种多重子查询检索增强策略，提高了临床诊断场景中的知识召回覆盖率。在四个医学基准上的综合实验表明，C-MIG在领域内和领域外数据集上均达到所有RAG-RL方法中的最佳性能，并在临床诊断上超越了最先进的通用大型语言模型。

英文摘要

Retrieval-augmented generation combined with reinforcement learning has shown promise for grounding large language models in trustworthy medical evidence. However, existing methods rely on exact-match binary rewards, which in clinical diagnosis cause two issues: (i) semantically relevant but non-verbatim steps receive zero signal, discarding valuable learning signals; and (ii) uni-dimensional rewards cannot effectively supervise heterogeneous reasoning capabilities. To address these issues, we propose C-MIG, a Multi-view Information Gain-based retrieval-augmented generation framework for Clinical diagnosis. C-MIG estimates information gain under a frozen reference model from two complementary views, retrieved-document and document-refinement, to jointly guide what to retrieve and how to refine, alleviating the issues of valuable reward signal loss and credit assignment. We further design a multi-subquery retrieval augmentation strategy that improves knowledge recall coverage in clinical diagnostic scenarios. Comprehensive experiments on four medical benchmarks demonstrate that C-MIG achieves the best performance among all RAG-RL methods on both in-domain and out-of-domain sets, and outperforms state-of-the-art general-purpose LLMs for clinical diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2605.27858 2026-05-28 cs.CL cs.AI cs.LG 版本更新

DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification

DecomposeRL: 学习提出有用、信息丰富且多样的问题以进行半监督、可追踪的声明验证

Shubhashis Roy Dipta, Ankur Padia, Francis Ferraro

发表机构 * Department of Computer Science and Electrical Engineering（计算机科学与电气工程系）

AI总结提出DecomposeRL框架，通过GRPO和多面奖励集成将声明分解为可追踪的子问题，在完全监督和半监督设置下实现高精度，且模型规模小4倍仍匹配大模型性能。

详情

AI中文摘要

声明验证分为两类：端到端分类器准确但无法提供可检查的追踪，而基于分解的方法可产生可检查的追踪但在基准数据集上性能滞后。我们提出DecomposeRL，一种能产生可检查追踪的准确声明验证器。DecomposeRL将分解建模为使用GRPO和多面奖励集成训练的RL策略，支持从无标签声明进行完全监督和半监督学习。DecomposeRL通过数据筛选漏斗解决了GRPO高昂的训练成本，将115K事实验证声明提炼为包含密集学习信号的5K声明子集。我们表明，仅在约5K精选声明上使用完全监督训练的DecomposeRL-7B策略，在包含生物医学、政治、科学和通用领域声明的11个声明验证基准上，实现了86.3的域内和69.8的域外平衡准确率。尽管规模小4倍，它匹配了32B基线和GPT-4.1-mini，并且在仅10%标签声明数据的半监督设置中进一步优于基线。代码、数据和模型见https://dipta007.github.io/DecomposeRL。

英文摘要

Claim verification splits between end-to-end classifiers that are accurate but yields no inspectable traces, and decomposition-based methods produce inspectable traces but lag performance on benchmark datasets. We propose DecomposeRL an accurate claim-verifier that produce inspectable traces. DecomposeRL frames decomposition as an RL policy trained with GRPO and a multi-faceted reward ensemble, enabling both fully supervised and semi-supervised learning from unlabeled claims. DecomposeRL addresses the prohibitive training cost of GRPO with a data-curation funnel that distills 115K fact-verification claims into a compact, learning-signal-dense subset of 5K claims. We show that a DecomposeRL-7B policy trained with full supervision on only ~5K curated claims achieves 86.3 in-domain and 69.8 out-of-domain balanced accuracy across 11 claim-verification benchmarks containing biomedical, political, scientific, and general-domain claims. Despite being 4x smaller, it matches 32B baselines and GPT-4.1-mini, and it further outperforms baselines in a semi-supervised setting with only 10% labeled claims data. Code, data, and models are available at https://dipta007.github.io/DecomposeRL

URL PDF HTML ☆

赞 0 踩 0

2605.27856 2026-05-28 cs.IR cs.AI 版本更新

Fine-Tuned LLM as a Complementary Predictor Improving Ads System

微调LLM作为改进广告系统的互补预测器

Hui Yang, Daiwei He, Kevin Jiang, Taejin Park, Kungang Li, Jiajun Luo, Yuying Chen, Xinyi Zhang, Sihan Wang, Haoyu He, Yu Liu, Lakshmi Manoharan, David Xue, Shubham Barhate, Runze Su, Duna Zhan, Ling Leng, Siping Ji, Jinfeng Zhuang, Alice Wu, Leo Lu, Han Sun, Zhifang Liu

发表机构 * Pinterest, Inc., USA（Pinterest公司，美国）

AI总结提出将微调的开源LLM作为广告特定辅助预测器，从用户画像和历史中预测广告主，增强候选生成并为下游排序提供先验信息，在工业广告系统中取得离线改进和在线业务提升。

详情

AI中文摘要

推荐系统驱动着信息流、广告和短视频平台的用户参与和变现，但将大型语言模型的最新进展转化为推荐系统的收益仍然罕见，尤其是在广告和工业级生产规模的实际场景中。先前真实世界的LLM成功通常分为三类：(a) 直接预测下一项以生成候选的生成式检索，(b) 使用LLM进行后期重排序，以及(c) 利用LLM进行辅助信号增强。我们为广告引入了一种互补范式：微调的开源LLM不作为排序器，而是作为广告特定的辅助预测器，从用户画像和历史中预测可能的广告主。这种LLM驱动的广告主预测增强了传统候选生成，并为下游排序提供了信息先验。在大规模生产广告系统中开发，我们的方法产生了显著的离线改进和可衡量的在线业务影响，展示了LLM的世界知识和预测能力可以被有效利用。除了验证LLM在广告应用中的有效性，我们的结果表明，有针对性的辅助预测可以在检索和后期排序中解锁端到端的收益，为大规模LLM增强推荐提供了一条实用路径。

英文摘要

Recommendation systems power engagement and monetization across feeds, ads, and short-video platforms, but translating the latest advances in Large Language Models into Recommendation Systems (RecSys) gains remains rare, particularly in advertising and production-scale real-world industry setups. Prior real-world LLM successes typically fall into three buckets: (a) generative retrieval that directly predicts the next items for candidate generation, (b) late-stage re-ranking that uses LLMs, and (c) auxiliary signal enrichment with LLMs. We introduce a complementary paradigm for ads: a fine-tuned open-source LLM used not as a ranker, but as an ads-specific ancillary predictor, forecasting likely advertisers from user profiles and histories. This LLM-driven advertiser prediction augments conventional candidate generation and provides informative priors to downstream ranking. Developed in a large-scale production advertising system, our approach produces substantial offline improvements and measurable online business impact, demonstrating that LLM world knowledge and predictive capacity can be efficiently harnessed. Beyond validating LLMs for ads applications, our results show that targeted ancillary predictions can unlock end-to-end gains across both retrieval and late-stage ranking, offering a practical path to LLM-enhanced recommendation at scale.

URL PDF HTML ☆

赞 0 踩 0

2605.27853 2026-05-28 cs.AI 版本更新

MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents

MolLingo：面向LLM驱动的科学智能体的分子原生表示

Thao Nguyen, Heng Ji

发表机构 * Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校Siebel计算与数据科学学院）

AI总结提出MolLingo多智能体系统，通过共享内存协调文献、化学家和编排智能体，结合基于BRICS的片段枚举（BFE）表示方法，实现分子块级推理与编辑，在四个基准上优于前沿LLM和专用基线。

详情

AI中文摘要

我们提出MolLingo，一个模拟化学家推理过程的多智能体系统，用于自动化分子设计。现有的基于LLM的方法要么作为独立的生成模型运行，无法访问外部工具，要么缺乏多智能体协调和共享内存，无法在分子设计流程中进行迭代、证据驱动的推理。MolLingo通过共享内存模块协调文献智能体、化学家智能体和编排智能体来解决这一问题，每个智能体配备领域特定工具。为了实现有效的分子推理，我们引入了基于BRICS的片段枚举（BFE），这是一种合成感知的分子碎片化方法，将分子分解为化学上有意义的构建块，表示为基于块的SMILES并配以常见化学名称。这种表示桥接了分子结构和LLM语义空间，实现了仅靠原始SMILES难以实现的块级推理和编辑。作为早期治疗设计的案例研究，MolLingo进一步将化学家智能体的推理基于结合位点几何和来自分子对接的残基级蛋白质上下文，以优化分子以实现更强的靶标结合。在四个基准上，MolLingo始终优于前沿LLM和专用基线，包括在相同底层模型下，对接分数比GPT-5.4提升四倍，在多个LLM骨干上一致的药物性质优化增益，以及在TOMG-Bench上达到最先进结果，超越了前沿LLM和基于RL的优化方法RePO。我们的结果表明，当通过化学上有意义的表示和生物学基础的上下文进行引导时，LLM已经能够成为有能力的分子设计助手。代码可在：https://anonymous.4open.science/status/MolLingo-7450 获取。

英文摘要

We present MolLingo, a multi-agent system that emulates the reasoning process of a chemist to automate molecular design. Existing LLM-based approaches either operate as standalone generative models without access to external tools or lack the multi-agent coordination and shared memory needed for iterative, evidence-driven reasoning across the molecular design pipeline. MolLingo addresses this by coordinating a Literature Agent, a Chemist Agent, and an Orchestrator through a shared memory module, with each agent equipped with domain-specific tools. To enable effective molecular reasoning, we introduce BRICS-based Fragment Enumeration (BFE), a synthesis-aware molecular fragmentation method that decomposes molecules into chemically meaningful building blocks represented as block-based SMILES paired with common chemical names. This representation bridges molecular structure and LLM semantic space, enabling block-level reasoning and editing that is difficult with raw SMILES alone. As a case study in early-stage therapeutic design, MolLingo further grounds the Chemist Agent's reasoning in binding site geometry and residue-level protein context derived from molecular docking to optimize molecules for stronger target binding. Across four benchmarks, MolLingo consistently outperforms frontier LLMs and specialized baselines, including a fourfold docking score improvement over GPT-5.4 despite using the same underlying model, consistent drug property optimization gains across multiple LLM backbones, and state-of-the-art results on TOMG-Bench, surpassing both frontier LLMs and the RL-based optimization method RePO. Our results suggest that LLMs are already capable molecular design assistants when guided through chemically meaningful representations and biologically grounded structural context. Code is available at: https://anonymous.4open.science/status/MolLingo-7450.

URL PDF HTML ☆

赞 0 踩 0

2605.27851 2026-05-28 cs.AI 版本更新

When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

当上下文翻转，安全失效：诊断对齐语言模型中的脆弱安全性

Dasol Choi, Alex Kwon

发表机构 * AIM Intelligence（AIM智能）

AI总结本文提出上下文翻转评估方法，通过安全基准和常识控制测试12个模型，发现对齐语言模型存在安全特异性脆弱性，源于策略覆盖而非理解错误，并证明动作级护栏无法检测后果翻转。

详情

AI中文摘要

安全基准分数提供的部署准备证据不完整：对齐语言模型通常遵循刚性规则，即使情境更新翻转了哪个动作是安全的。我们将这种失败称为脆弱安全性。为诊断它，我们引入上下文翻转评估，在安全基准（PacifAIst）和两个常识控制上测试12个模型，使用配对变体，其中名义上安全的动作产生伤害。出现三个发现。首先，脆弱安全性是安全特异性的：所有12个模型都表现出安全-常识差距（平均+17.4个百分点）。基线准确率无法预测脆弱性：在基线准确率高于90%的模型中，脆弱率从13.7%到90.0%不等。其次，失败源于策略覆盖而非理解错误：尽管在每个案例中都承认上下文变化，模型通过三种不同机制持续存在，这些机制因更新类型和模型系列而异。第三，在对灾难性后果翻转场景的手动审计探测中，标准动作级护栏未能检测到任何情况，而状态感知验证器在正确干预上无假警报地检测到所有情况。这表明动作级内容审核系统性地对后果翻转视而不见，激发了状态感知架构替代方案。我们发布我们的协议、扰动基准和部署探测。

英文摘要

Safety benchmark scores provide incomplete evidence of deployment readiness: aligned language models often adhere to rigid rules even when a situational update flips which action is safe. We term this failure brittle safety. To diagnose it, we introduce context-flip evaluation, testing 12 models across a safety benchmark (PacifAIst) and two commonsense controls using paired variants where the nominally safe action produces harm. Three findings emerge. First, brittle safety is safety-specific: all 12 models exhibit a safety-commonsense gap (mean +17.4 pp). Baseline accuracy fails to predict brittleness: among models above 90% baseline accuracy, brittleness rates range from 13.7% to 90.0%. Second, failures stem from policy override rather than miscomprehension: despite acknowledging the context change in every case, models persist via three distinct mechanisms that vary by update type and model family. Third, on a hand-audited probe of catastrophic consequence-flip scenarios, standard action-level guardrails catch none, while a state-aware validator catches all without false alarms on correct interventions. This indicates that action-level content moderation is systematically blind to consequence-flips, motivating state-aware architectural alternatives. We release our protocol, perturbed benchmarks, and deployment probe.

URL PDF HTML ☆

赞 0 踩 0

2605.27850 2026-05-28 cs.AI 版本更新

TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems

TCP-MCP：面向多智能体系统的提示与通信拓扑的景观引导协同进化

Yi Ding, Zijie Xuan, Haowei Zhou, Zhenyu Ju, Xiaoxiao Dong, Jingwen Zhang, Xingyu Zhu, Leixin Sun, Haochi Zhang

发表机构 * National Institute of Metrology, China（中国计量科学研究院）； University of California, Berkeley（加州大学伯克利分校）； Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（中国科学院深圳先进技术研究院）； Nanjing University of Chinese Medicine（南京中医药大学）； WEEX Exchange（WEEX交易所）； National University of Singapore（新加坡国立大学）； Wuhan University（武汉大学）； Peking University（北京大学）

AI总结提出TCP-MCP框架，通过协同进化智能体提示和通信拓扑，在任务性能、令牌成本和结构复杂度三个目标下实现多智能体系统的成本感知与任务自适应设计。

详情

AI中文摘要

有效的多智能体系统不能通过孤立地选择提示或通信图来设计。智能体行为取决于其接收的信息，而通信边的有用性则取决于接收智能体如何解释和使用该信息。我们提出 extbf{TCP-MCP}（面向多智能体协作问题求解的拓扑耦合提示），这是一个将智能体提示和通信拓扑作为统一基因组进行搜索的协同进化框架。TCP-MCP使用初始化时的景观探针来校准早期搜索行为，然后依赖帕累托前沿诊断在三个目标（任务性能、令牌成本和结构复杂度）下自适应调整探索。在所有方法中使用相同的DeepSeek-V3.2骨干网络，TCP-MCP在MMLU-Pro、MMLU和GSM8K上分别达到82.66%、89.96%和96.61%的准确率。在三个基准测试中，它持续优于自动图生成基线，并在报告的操作点上达到与辩论式系统相当的准确率，同时使用的令牌数最多减少5.69倍。这些结果表明，联合进化提示和通信结构为受控评估中成本感知和任务自适应的多智能体系统设计提供了一条实用途径。

英文摘要

Effective multi-agent systems cannot be designed by selecting prompts or communication graphs in isolation. Agent behavior depends on the information an agent receives, while the usefulness of a communication edge depends on how the receiving agent interprets and uses that information. We propose \textbf{TCP-MCP} (Topology-Coupled Prompting for Multi-Agent Collaborative Problem-Solving), a co-evolution framework that searches agent prompts and communication topologies as a unified genome. TCP-MCP uses an initialization-time landscape probe to calibrate early search behavior, and then relies on Pareto-front diagnostics to adapt exploration under three objectives: task performance, token cost, and structural complexity. Using the same DeepSeek-V3.2 backbone across all methods, TCP-MCP achieves 82.66\%, 89.96\%, and 96.61\% accuracy on MMLU-Pro, MMLU, and GSM8K, respectively. Across the three benchmarks, it consistently outperforms automated graph-generation baselines and achieves competitive accuracy relative to debate-style systems, while using up to 5.69$\times$ fewer tokens than those systems at the reported operating points. These results show that jointly evolving prompts and communication structure provides a practical route to cost-aware and task-adaptive multi-agent system design in controlled evaluations.

URL PDF HTML ☆

赞 0 踩 0

2605.27849 2026-05-28 cs.PL cs.AI cs.CL 版本更新

FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation

FPMoE：一种用于函数式代码生成的稀疏混合专家方法

Loc Pham, Lang Hong Nguyet Anh, Thanh Le-Cong

发表机构 * GreenNode AI ； Hanoi University of Science and Technology（河内科学技术大学）； Singapore University of Technology and Design（新加坡科技设计大学）

AI总结针对LLM在函数式编程语言上性能差的问题，提出基于稀疏MoE架构的FPMoE模型，通过语言特定专家和共享专家分别消除干扰和捕获跨语言抽象，以3B活跃参数达到远超微调基线并匹配大模型的效果。

详情

AI中文摘要

尽管基于LLM的代码生成取得了快速进展，但现有模型主要针对命令式语言进行训练，导致函数式编程语言（FPLs）如Haskell、OCaml和Scala长期未被充分探索，即使是前沿模型在FPLs上的表现也明显较差。微调是一种自然的补救措施，但我们的实验表明，每种语言的微调无法捕获共享的函数式抽象，而合并的多语言微调则引入了跨语言干扰。为了解决这个问题，我们引入了FPMoE，这是一个轻量级的开源代码生成模型，基于稀疏混合专家（MoE）架构，包含三个语言特定的路由专家（分别对应Haskell、OCaml和Scala）和一个共享专家，用于捕获跨语言的函数式模式，如单子推理和类型导向编程。这种设计同时解决了两种失败模式：专用专家消除了干扰，而共享专家保留了单语言模型遗漏的抽象。在FPEval上，FPMoE显著优于微调基线，并且仅使用3B活跃参数，即可匹配包括DeepSeek-Coder-6.7B、Qwen2.5-Coder-14B-Instruct和Qwen3-Coder-30B-A3B在内的更大模型的性能。

英文摘要

Despite rapid progress in LLM-based code generation, existing models are predominantly trained on imperative languages, leaving functional programming languages (FPLs) such as Haskell, OCaml, and Scala chronically underexplored, with even frontier models performing substantially worse on FPLs. Fine-tuning is a natural remedy, but our experiments show that per-language fine-tuning fails to capture shared functional abstractions, while merged multi-language fine-tuning introduces cross-language interference. To address this, we introduce FPMoE, a lightweight, open-source code generation model built on a sparse Mixture-of-Experts (MoE) architecture with three language-specific routed experts (one each for Haskell, OCaml, and Scala) and a shared expert that captures cross-language functional patterns such as monadic reasoning and type-directed programming. This design resolves both failure modes simultaneously: dedicated experts eliminate interference, while the shared expert preserves abstractions that per-language models miss. On FPEval, FPMoE substantially outperforms fine-tuned baselines and, with only 3B active parameters, matches the performance of much larger models including DeepSeek-Coder-6.7B, Qwen2.5-Coder-14B-Instruct, and Qwen3-Coder-30B-A3B.

URL PDF HTML ☆

赞 0 踩 0

2605.27846 2026-05-28 cs.AI 版本更新

EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA

EAPO: 面向开放问答的基于熵驱动的自适应正负样本加权策略优化

Yunsheng Zeng, Gen Li, Yuwei Miao, Xiandong Li, Yujin Wang, Siyu Chen, Luning Wang, Yunhao Qiao, Junfeng Wang, Jianwei Lv, Bo Yuan

发表机构 * Baidu Inc（百度公司）

AI总结针对开放问答中强化学习固定权重的问题，提出基于熵驱动的自适应策略优化方法EAPO，通过动态调整正负样本权重平衡探索与稳定性，在医学问答数据集上显著提升多样性和稳定性。

详情

AI中文摘要

大型推理模型通常通过可验证奖励的强化学习（RLVR）进行训练。然而，现有方法对正负样本采用固定权重，且结论难以推广到开放问答（QA）。本文系统研究了开放问答中强化学习正负样本的作用。我们提出了一种基于奖励均值的策略来区分正负样本，并观察到负样本主要控制响应多样性和性能上限，而正样本主要决定响应质量和收敛稳定性。基于这些观察，我们提出了EAPO，一种基于熵驱动的自适应策略优化方法，该方法根据当前策略熵与初始熵的比率自适应计算正样本的加权系数。在熵减阶段，分配给正样本的权重降低以保持探索，而在熵增阶段则放大以增强稳定性，从而缓解熵崩溃。在两个公开的开放医学问答数据集上的实验表明，EAPO在响应多样性和稳定性方面一致且显著优于固定权重基线。

英文摘要

Large Reasoning Models are typically trained via reinforcement learning from verifiable rewards (RLVR). However, existing approaches adopt fixed weights for positive and negative samples, and the conclusions hardly generalize to open-ended question answering (QA). In this paper, we systematically investigate the roles of positive and negative samples in reinforcement learning for open-ended QA. We propose a reward-mean-based strategy for distinguishing positive from negative samples, and observe that negative samples predominantly govern response diversity and the performance upper bound, whereas positive samples primarily determine response quality and convergence stability. Building on these observations, we propose EAPO, an Entropy-driven Adaptive Policy Optimization method that adaptively computes the weighting coefficients of positive samples based on the ratio of the current policy entropy to the initial entropy. During the entropy-decreasing phase, the weight assigned to positive samples is reduced to preserve exploration, whereas during the entropy-increasing phase it is amplified to reinforce stability, thereby mitigating entropy collapse. Experiments on two publicly available open-ended medical QA datasets demonstrate that EAPO consistently and substantially outperforms fixed-weight baselines in both response diversity and stability.

URL PDF HTML ☆

赞 0 踩 0

2605.27845 2026-05-28 cs.SI cs.AI physics.soc-ph 版本更新

Snippet-Driven Supply Chain Discovery with LLMs: Scaling Visibility in China

基于片段的供应链发现：利用LLMs在中国实现规模化可见性

Hiroto Fukada, Takayuki Mizuno

发表机构 * Graduate Institute for Advanced Studies (SOKENDAI)（研究生高级研究学院（SOKENDAI））； National Institute of Informatics（国家信息研究所）

AI总结提出一种基于网络搜索片段的方法，利用大语言模型构建供应链知识图谱，以低成本扩展对中国企业间关系的覆盖范围。

Comments 8 pages, 4 figures, 3 tables

详情

AI中文摘要

金融和经济研究通常依赖于结构化的供应链披露和商业数据库。在中国，供应商-客户披露通常仅限于上市公司的重大合作伙伴，导致非上市公司和长尾企业间关系在结构化数据中记录不足。公共网络证据可以通过企业、政府和贸易媒体披露部分弥补这一差距；然而，大规模的全文本网络挖掘成本高昂，因为页面通常难以访问或使用大语言模型（LLM）处理成本过高。我们提出了一种基于片段的方法来构建供应链知识图谱（SCKG），以企业为节点，企业间关系为边。网络搜索片段是与查询相关的摘要，随搜索结果返回。我们将其用作基于LLM的关系提取的可扩展第一层证据。我们从提取效率和覆盖范围两方面评估该流程。在提取效率方面，穷举全文本分块发现的独特关系数量是片段的19.8倍，但需要的输入token数量是片段的251.2倍，且冗余度更高。在覆盖范围方面，我们使用130,685家中国企业作为搜索种子，涵盖截至2024年的上海/深圳上市公司和大型非上市公司。在上市公司子集中，生成的SCKG覆盖的企业数量是CSMAR披露基准的7.2倍，关系数量是9.3倍，同时揭示了重尾度分布模式。保留的来源元数据使SCKG成为可审计的披露数据库补充。

英文摘要

Financial and economic research often relies on structured supply-chain disclosures and commercial databases. In China, supplier--customer disclosure is typically limited to major partners of listed firms, leaving unlisted firms and long-tail inter-firm links poorly captured in structured data. Public web evidence can partly complement this gap through corporate, government, and trade-media disclosures; however, full-text web mining at scale is costly because pages are often inaccessible or expensive to process with large language models (LLMs). We propose a snippet-driven method for constructing a supply chain knowledge graph (SCKG), with firms as nodes and inter-firm relationships as edges. Web search snippets are query-biased summaries returned with search results. We use them as a scalable first-pass evidence layer for LLM-based relationship extraction. We evaluate the pipeline in terms of extraction efficiency and coverage. For extraction efficiency, exhaustive full-text chunking discovers 19.8$\times$ more unique relationships than snippets, but requires 251.2$\times$ more input tokens and yields higher redundancy. For coverage, we use 130,685 Chinese firms as search seeds, covering Shanghai/Shenzhen-listed firms and large unlisted firms as of 2024. In the listed-firm subset, the resulting SCKG covers 7.2$\times$ more firms and 9.3$\times$ more relationships than the CSMAR disclosure-based benchmark, while revealing heavy-tailed degree patterns. Retained provenance metadata make the SCKG an auditable complement to disclosure-based databases.

URL PDF HTML ☆

赞 0 踩 0

2605.27840 2026-05-28 eess.AS cs.AI cs.SD 版本更新

LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation

LoSATok: 用于跨域音频理解与生成的低维语义-声学分词器

Zhisheng Zhang, Xiang Li, Yixuan Zhou, Jing Peng, Guoyang Zeng, Zhiyong Wu

发表机构 * Shenzhen International Graduate School, Tsinghua University, China（清华大学深圳国际研究生院，中国）； ModelBest Inc., China（ModelBest公司，中国）

AI总结提出低维音频分词器LoSATok，通过语义瓶颈压缩和双级语义监督，在紧凑潜空间中联合捕获语义和声学细节，提升扩散Transformer的生成性能。

详情

AI中文摘要

音频分词器是统一音频理解和生成的基础。理解需要高层语义，而生成需要语义和声学细节。现有的统一分词器将两者共同编码到高维连续潜变量中，这增加了扩散Transformer（DiT）的建模负担。我们提出LoSATok，一种用于跨域音频理解和生成的低维音频分词器。受1280维语义编码器特征可压缩的观察启发，我们引入语义瓶颈（Semantic Bottleneck），将其压缩到128维，并通过提出的时间关系损失（time-relation loss）正则化以实现时间特征一致性。我们进一步设计了一种双级语义监督方法，利用高维和低维语义信号，使分词器能够在紧凑的潜空间中联合捕获语义和声学细节。在语音、音乐和通用音频上的实验表明，SemBo保持了强大的低维语义能力，LoSATok与几种语义表示相比保持了有竞争力的理解性能，同时在语音、音乐和音频生成上持续提升了DiT的建模性能。这些结果表明，LoSATok的低维表示能够有效支持音频理解和生成。我们的代码提供在https://github.com/wxzyd123/LoSATok。

英文摘要

Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high-dimensional continuous latents, which increases the modeling burden of Diffusion Transformers (DiTs) for generation. We propose LoSATok, a low-dimensional audio tokenizer for cross-domain audio understanding and generation. Motivated by the observation that 1280-dimensional semantic encoder features are compressible, we introduce a Semantic Bottleneck that compresses them into 128 dimensions, regularized by the proposed time-relation loss for temporal feature consistency. We further design a dual-level semantic supervision method that leverages both high- and low-dimensional semantic signals, enabling the tokenizer to jointly capture semantics and acoustic details within a compact latent space. Experiments on speech, music, and general audio show that SemBo preserves strong low-dimensional semantic capacity and LoSATok retains competitive understanding performance compared with several semantic representations, while consistently improving DiT modeling performance on speech, music, and audio generation. These results demonstrate that LoSATok's low-dimensional representations can effectively support audio understanding and generation. Our code is provided at https://github.com/wxzyd123/LoSATok.

URL PDF HTML ☆

赞 0 踩 0

2605.27836 2026-05-28 cs.CR cs.AI 版本更新

Symmetry Defeats Auditing

对称性击败审计

Nick Merrill, Zeke Medley

发表机构 * UC Berkeley Forecasting Research Institute（伯克利大学预测研究所）； Northeastern University（东北大学）

AI总结本文展示了对内省适配器（Shenoy et al., 2026）的一种攻击方法。

2605.27827 2026-05-28 cs.AI cs.CY 版本更新

Operational AI Deployment Assurance: Governance-State Orchestration Under Threshold-Sensitive Deployment Conditions -- A Governance Framework for High-Stakes AI Systems

运营级AI部署保障：阈值敏感部署条件下的治理状态编排——高风险AI系统的治理框架

Khalid Adnan Alsayed

发表机构 * Ducaltus | AI Assurance \& Governance Newcastle upon Tyne, United Kingdom School of Computing, Engineering \& Digital Technologies Teesside University Middlesbrough, United Kingdom

AI总结提出运营级AI部署保障（OADA）框架，通过部署保障分数、就绪分类、阈值稳定区、治理升级状态和修复感知保障推进等机制，将公平性分歧、子组不稳定性和阈值敏感性转化为部署导向的治理决策，以解决高风险AI系统中静态指标报告和事后审计的不足。

Comments 13 pages, 3 figures, governance-oriented framework for operational AI deployment assurance in high-stakes systems

详情

AI中文摘要

AI治理框架日益强调高风险领域的公平性、透明度、问责制和生命周期风险管理。然而，许多当前方法仍停留在观察层面，依赖静态指标报告、事后审计和监控仪表板，而未能直接治理部署就绪性、修复进展、升级状态或保障驱动的部署控制。本文引入运营级AI部署保障（OADA），这是一个治理框架，用于将公平性分歧、子组不稳定性、阈值敏感性、修复结果和运营不确定性转化为面向部署的保障决策。基于先前关于公平性分歧指数（FDI）和FairRisk-FDI的工作，OADA将治理不确定性重新定义为AI部署管道中的运营问题，而非指标分歧的副产品。该框架引入了部署保障分数、部署就绪分类、阈值稳定区、治理升级状态和修复感知保障推进。这些构造通过将评估输出与部署状态解释、重新评估、升级和运营控制相连接，支持高风险环境中的生命周期导向治理决策。通过在面部识别系统上进行面向部署的评估，并将讨论扩展到作为代表性高风险领域的医疗AI，本文展示了系统在孤立的公平性或性能指标下可能看似可接受，同时仍表现出影响部署就绪性的不稳定性。所提出的框架将运营部署保障定位为评估与现实世界AI部署之间的治理层。

英文摘要

AI governance frameworks increasingly emphasize fairness, transparency, accountability, and lifecycle risk management in high-stakes domains. However, many current approaches remain observational, relying on static metric reporting, post-hoc auditing, and monitoring dashboards without directly governing deployment readiness, remediation progression, escalation states, or assurance-driven deployment control. This paper introduces Operational AI Deployment Assurance (OADA), a governance framework for translating fairness disagreement, subgroup instability, threshold sensitivity, remediation outcomes, and operational uncertainty into deployment-oriented assurance decisions. Building on prior work on the Fairness Disagreement Index (FDI) and FairRisk-FDI, OADA reframes governance uncertainty as an operational concern within AI deployment pipelines rather than a byproduct of metric disagreement. The framework introduces Deployment Assurance Scores, Deployment Readiness Classifications, Threshold Stability Zones, Governance Escalation States, and remediation-aware assurance progression. These constructs support lifecycle-oriented governance decisions across high-stakes settings by connecting evaluation outputs to deployment-state interpretation, reassessment, escalation, and operational control. Through deployment-oriented evaluation across facial recognition systems, with discussion extended to healthcare AI as a representative high-stakes domain, the paper demonstrates how systems may appear acceptable under isolated fairness or performance metrics while still exhibiting instability that affects deployment readiness. The proposed framework positions operational deployment assurance as a governance layer between evaluation and real-world AI deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.27824 2026-05-28 cs.AI cs.CL 版本更新

ReSAE: 用于多层Transformer干预的残差化稀疏自编码器

Prathyush Poduval, Calvin Yeung, Neel Desai, Mohsen Imani

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结针对多层稀疏自编码器（SAE）在Transformer中因层间耦合导致的冗余和交互问题，提出残差化稀疏自编码器（ReSAE），通过拟合层间仿射映射并训练SAE于残差上，减少解码器冗余并提升多层替换下的交叉熵恢复。

详情

AI中文摘要

稀疏自编码器通常逐层训练，尽管Transformer残差流激活在深度上强烈耦合。这对多层干预造成实际问题：不同层的字典可能将容量用于表示相同的向前传递信息，同时替换多层可能产生单层行为无法预测的交互。我们引入残差化稀疏自编码器（ReSAE），它在选定层之间拟合仿射映射，并在未解释的残差上训练后续层的SAE，而非完整激活。重构通过拟合的仿射链映射回原始激活空间，因此ReSAE可以像普通SAE一样使用相同的干预协议进行评估。在Pythia-1.4B和Gemma-2-9B上，残差化减少了解码器冗余，并在大多数测试设置中改进了稀疏探测和定向扰动。尽管重构的原始激活方差较少，ReSAE在多层替换下恢复了更多Transformer交叉熵。这一增益在教师强制和足够的在线稀疏性下最为明显，表明ReSAE保留了与模型下游计算最相关的激活成分。这些结果表明，去除线性可预测的跨层结构是多层SAE干预的有用默认设置。

英文摘要

Sparse autoencoders are usually trained one layer at a time, even though transformer residual stream activations are strongly coupled across depth. This creates a practical problem for multi-layer interventions: different layerwise dictionaries can spend capacity representing the same carried-forward information, and replacing several layers at once can produce interactions that are not predicted by single-layer behavior. We introduce Residualized Sparse Autoencoders (ReSAEs), which fit an affine map between selected layers and train each later-layer SAE on the unexplained residual rather than on the full activation. Reconstructions are mapped back into the original activation space through the fitted affine chain, so ReSAEs can be evaluated with the same intervention protocols as ordinary SAEs. On Pythia-1.4B and Gemma-2-9B, residualization reduces decoder redundancy and improves sparse probing and targeted perturbation in most tested settings. Despite reconstructing less of the raw activation variance, ReSAEs recover more transformer cross entropy under multi-layer replacement. This gain is clearest under teacher-forcing and at sufficient sparsity online, indicating that ReSAEs preserve the components of the activation most relevant to the model's downstream computation. These results suggest that removing linearly predictable cross-layer structure is a useful default for multi-layer SAE interventions.

URL PDF HTML ☆

赞 0 踩 0

2605.27817 2026-05-28 cs.RO cs.AI cs.CV cs.LG 版本更新

Turning Video Models into Generalist Robot Policies

将视频模型转化为通用机器人策略

Sizhe Lester Li, Evan Kim, Xingjian Bai, Tong Zhao, Tao Pang, Max Simchowitz, Vincent Sitzmann

发表机构 * MIT（麻省理工学院）； CMU（卡内基梅隆大学）； Amazon FAR（亚马逊公司）

AI总结提出一种解耦的视频到动作策略VERA，利用无动作视频世界模型和基于机器人雅可比矩阵的逆动力学模型，实现跨本体的零样本机器人控制。

Comments project page: https://vera.csail.mit.edu

详情

AI中文摘要

视频生成模型已成为一种有前景的机器人骨干网络，能够生成描绘跨本体和环境完成复杂任务的视频。最近的工作提出了机器人基础模型，通过使用带有动作标签的数据微调视频模型，联合预测未来观测和动作。在本文中，我们测试了一种替代方法的极限：保持视频规划器不变，同时训练一个特定本体的逆动力学模型（IDM）。这种解耦带来了几个自然的好处：视频规划器保持本体无关，不同的视频模型可以轻松互换而无需重新训练IDM，并且IDM可以独立地使用现成的自对弈数据进行训练。我们提出了一种闭环的视频到动作策略，该策略将无动作视频世界模型与基于机器人本体雅可比矩阵的精心设计的IDM相结合。我们证明了我们的IDM设计既数据高效又可扩展到高维动作空间。我们将该策略命名为视频到具身机器人动作模型（VERA），在模拟和真实世界基准测试中取得了强劲的性能，包括零样本的Panda机械臂操作和16自由度Allegro灵巧手立方体重新定向。通过将相同的视频规划器与不同的本体特定IDM配对，可以在多个本体上使用。我们的结果表明，解耦的视频规划加上忠实的视频到动作翻译是实现零样本、跨本体和可泛化机器人控制的可行替代途径。更多结果请访问我们的项目网站：https://vera.csail.mit.edu。

英文摘要

Video generative models have emerged as a promising robotics backbone, capable of generating videos that depict the completion of complex tasks across embodiments and environments. Recent work proposes robot foundation models that jointly predict future observations and actions by finetuning video models with action-labeled data. In this paper, we test the limits of an alternative approach: leave the video planner as-is while training an embodiment-specific inverse dynamics model (IDM). This decoupling offers several natural benefits: the video planner remains embodiment-agnostic, different video models can be interchanged easily without re-training the IDM, and the IDM can be independently trained with readily available self-play data. We present a closed-loop, video-to-action policy that combines an action-free video world model with a carefully-designed IDM based on the robot embodiment Jacobian. We demonstrate that our IDM design is both data-efficient and scalable to high-dimensional action spaces. Our policy, which we coin the Video-to-Embodied Robot Action Model (VERA), achieves strong performance across simulated and real-world benchmarks, including zero-shot Panda arm manipulation and 16-DoF Allegro-hand dexterous cube re-orientation. The same video planner can be used across multiple embodiments by pairing it with different embodiment-specific IDMs. Our results show that decoupled video planning plus faithful video-to-action translation is a viable alternative route towards zero-shot, cross-embodiment, and generalizable robot control. More results are available on our project website: https://vera.csail.mit.edu.

URL PDF HTML ☆

赞 0 踩 0

2605.27813 2026-05-28 cs.CV cs.AI cs.LG 版本更新

Residualized Temporal Sparse Autoencoders for Interpreting Diffusion Models

残差化时间稀疏自编码器用于解释扩散模型

Calvin Yeung, Prathyush Poduval, Ali Zakeri, Zhuowen Zou, Mohsen Imani

发表机构 * University of California, Irvine（加州大学 Irvine 分校）

AI总结提出残差化时间稀疏自编码器，通过去噪时间步间的线性预测残差学习扩散激活轨迹中的可解释特征，并在Stable Diffusion 1.5上验证其有效性。

详情

AI中文摘要

文本到图像扩散模型通过迭代去噪过程生成图像，因此内部神经层产生激活轨迹而非单一静态表示。稀疏自编码器（SAE）最近被用于将扩散激活分解为可解释的特征方向，但大多数方法在单个时间步分析激活或基于时间条件，而非直接从完整激活轨迹中学习。在这项工作中，我们引入了用于扩散激活轨迹的残差化时间SAE。我们收集去噪时间上的激活，拟合相邻时间步之间的线性预测器，并使用初始激活以及这些线性动力学未解释的残差分量来表示每个轨迹。在这种残差化表示上训练SAE鼓励稀疏潜在变量捕捉超出线性可预测范围的结构。残差化解码器方向可以映射回激活空间，使得每个潜在变量可以作为去噪时间上的特征轨迹进行分析。通过在Stable Diffusion 1.5上的重建与消融研究、时空特征分析和定性引导实验，我们表明残差化时间SAE为研究时间结构化的扩散激活提供了一个有用的框架。

英文摘要

Text-to-image diffusion models generate images through an iterative denoising process, so internal neural layers produce trajectories of activations rather than single static representations. Sparse autoencoders (SAEs) have recently been used to decompose diffusion activations into interpretable feature directions, but most approaches analyze activations at individual timesteps or condition on time rather than learning directly from full activation trajectories. In this work, we introduce residualized temporal SAEs for diffusion activation trajectories. We collect activations across denoising time, fit linear predictors between neighboring timesteps, and represent each trajectory using an initial activation together with residual components not explained by these linear dynamics. Training an SAE on this residualized representation encourages sparse latents to capture structure beyond what is linearly predictable. The residualized decoder directions can be mapped back into activation space, allowing each latent to be analyzed as a feature trajectory over denoising time. Through reconstruction and ablation studies, spatiotemporal feature analysis, and qualitative steering experiments on Stable Diffusion~1.5, we show that residualized temporal SAEs provide a useful framework for studying temporally structured diffusion activations.

URL PDF HTML ☆

赞 0 踩 0

2605.27811 2026-05-28 cs.AI 版本更新

Constrained Auto-Bidding via Generative Response Modeling

通过生成式响应建模实现约束自动出价

Eunseok Yang, Xingdong Zuo, Kyung-Min Kim

发表机构 * NAVER Corporation（NAVER公司）

AI总结提出生成式响应模型（GRM），通过预测未来流量和聚合成本/价值曲线，结合轻量解析控制器，在预算和比率约束下实现稳定高效的自动出价。

详情

DOI: 10.1145/3770855.3817847

AI中文摘要

自动出价系统旨在预算约束和成本每次获取等比率目标下，最大化广告主在长期内的价值，然而未来流量和拍卖动态是非平稳且不确定的。现有方法面临明显局限性：基于控制的节奏方法对偏差做出反应但无法预测未来条件，而强化学习和生成方法将约束纳入奖励信号，掩盖了违规并在分布偏移下退化。我们将学习目标从动作转向响应，提出生成式响应模型（GRM），这是一个基于历史条件的序列模型，联合预测未来流量和作为单一出价乘数函数的水平聚合成本/价值曲线。我们证明，在温和的单调性条件下，相对于完全逐拍控制的最优性差距受逐拍边际价值-成本离散度的限制。给定预测响应，一个轻量解析控制器通过一维求根步骤强制执行每个活动约束。我们证明该控制器对于单乘数问题是精确的，并根据预测误差限制了滚动时域重规划下的约束违规。在AuctionNet上的实验表明，与现有基线相比，GRM提高了约束稳定性和总体得分。

英文摘要

Auto-bidding systems aim to maximize advertiser value over long horizons under budget constraints and ratio targets such as cost-per-acquisition, yet future traffic and auction dynamics are non-stationary and uncertain. Existing approaches face distinct limitations: control-based pacing reacts to deviations but cannot anticipate future conditions, while RL and generative methods fold constraints into reward signals, obscuring violations and degrading under distribution shift. We shift the learning target from actions to responses with the Generative Response Model (GRM), a history-conditioned sequence model that jointly predicts future traffic volume and horizon-aggregate cost/value curves as functions of a single bid multiplier. We show that under mild monotonicity conditions, the optimality gap relative to full per-tick control is bounded by the dispersion of per-tick marginal value-per-cost. Given predicted responses, a lightweight analytic controller enforces each active constraint via a 1D root-finding step. We prove this controller is exact for the single-multiplier problem and bound constraint violations under receding-horizon replanning in terms of prediction error. Experiments on AuctionNet show that GRM improves constraint stability and overall score compared to existing baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.27805 2026-05-28 cs.CL cs.AI 版本更新

ChildEval: When large language models meet children's personalities

ChildEval：当大语言模型遇到儿童个性

Yanyan Luo, Xue Han, Chunxu Zhao, Ruiqiao Bai, Yaxing Zhang, Qian Hu, Lijun Mei, Junlan Feng

发表机构 * JIUTIAN Research（九天研究院）； China Mobile（中国移动）； Beijing, China（北京，中国）

AI总结提出ChildEval基准，通过合成3-6岁儿童个性档案和偏好（显式或隐式表达），评估大语言模型在长对话中推断并遵循儿童偏好的能力，实验表明微调可提升儿童中心性能。

Comments 8 pages of main text (ACL Findings format), with references and appendix

详情

AI中文摘要

虽然大语言模型（LLM）使得个性化聊天机器人成为可能，但它们在儿童中心个性化方面的有效性仍不明确，因为缺乏对儿童特定偏好的系统评估。为填补这一空白，我们引入了ChildEval，一个用于评估LLM在长上下文对话中推断和遵循儿童中心偏好能力的基准。ChildEval包含29K个3-6岁儿童的合成个性档案，提供相对静态的背景信息。每个个性档案关联一个儿童偏好——可能与个性一致、冲突或独立——通过单句显式表达或6-10轮对话隐式表达。显式和隐式偏好旨在反映相同的潜在偏好，但表达方式不同，捕捉偏好表达的动态方面而非静态个性的变化。该基准涵盖五个顶层类别和十四个子类别，覆盖儿童的日常生活和发展。我们进一步提出了细粒度、以儿童为中心的评估协议，以系统评估开源LLM。实验结果表明，不同的个性化表示如何影响LLM的响应，并表明在ChildEval上进行微调可以提升儿童中心性能。我们的代码和数据集可在https://github.com/ziyanluo/ChildEval获取。

英文摘要

While LLMs enable personalized chatbots, their effectiveness in child-centered personalization remains unclear, as systematic evaluation of child-specific preferences is still lacking. To address this gap, we introduce ChildEval, a benchmark for evaluating LLMs' ability to infer and follow child-centered preferences in long-context conversations. ChildEval contains 29K synthesized persona profiles of children aged 3-6, providing relatively static background information. Each persona is associated with a child preference-which may align with, conflict with, or be independent of the persona-expressed either explicitly in a single sentence or implicitly through 6-10 turn dialogues. Explicit and implicit preferences are designed to reflect the same underlying preference but differ in expression, capturing dynamic aspects of preference expression rather than changes in the static persona. The benchmark spans five top-level and fourteen sub-level categories covering children's daily lives and development. We further propose fine-grained, child-centric evaluation protocols to systematically assess open-source LLMs. Experimental results demonstrate how different personalized representations affect LLM responses and suggest that finetuning on ChildEval can enhance child-centered performance. Our code and dataset are available at https://github.com/ziyanluo/ChildEval.

URL PDF HTML ☆

赞 0 踩 0

2605.27799 2026-05-28 cs.AI eess.SP 版本更新

GraD-IBD: Graph Representation Learning from Diagnosis Trajectories for Early Detection of Inflammatory Bowel Disease

GraD-IBD：基于诊断轨迹的图表示学习用于炎症性肠病的早期检测

Leo Y. Li-Han, Ellen L. Larson, Elizabeth B. Habermann, Cornelius A. Thiels, Hojjat Salehinejad

发表机构 * Department of Surgery, Mayo Clinic, Rochester, MN, USA（外科部，梅奥诊所，罗切斯特，MN，美国）； Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, USA（健康保健交付科学中心，梅奥诊所，罗切斯特，MN，美国）； Division of Hepatobiliary and Pancreas Surgery, Mayo Clinic, Rochester, MN, USA（肝胆胰外科部，梅奥诊所，罗切斯特，MN，美国）； Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA（人工智能与信息学部，梅奥诊所，罗切斯特，MN，美国）

AI总结提出GraD-IBD图诊断模型，将纵向ICD轨迹重构为时间有向图，并设计上下文感知的时间衰减消息传递机制，以降低复杂度并提升炎症性肠病检测性能。

详情

AI中文摘要

国际疾病分类（ICD）是一种全球公认的编码系统，记录每次患者就诊的诊断事件，为各种临床任务提供标准化的数据基础。然而，ICD代码序列的不规则性和层次性给基于N-D格子的序列建模方法带来了挑战，导致模型设计过于复杂。在本文中，我们提出了GraD-IBD，一种图诊断模型，将纵向ICD轨迹重构为按就诊分桶的时间有向图，以检测炎症性肠病（IBD）的风险。我们开发了一种新颖的上下文感知时间衰减消息传递机制，以捕获时间依赖性并降低模型复杂度。使用真实世界临床数据集的实验结果表明，与最先进的方法相比，IBD检测性能一致且稳健地提升，同时与序列模型相比，计算复杂度显著降低。这些发现凸显了图表示学习在从纵向ICD诊断代码中进行高效、可扩展且准确的疾病风险预测方面的潜力。

英文摘要

International Classification of Diseases (ICD) is a globally recognized coding system that records diagnostic events during each patient encounter, providing a standardized data foundation for various clinical tasks. However, the irregular and hierarchical nature of ICD code sequences poses challenges for N-D lattice-based sequential modeling methods, leading to overly complex model designs. In this paper, we propose GraD-IBD, a graph diagnosis model that reformulates longitudinal ICD trajectories as visit-bucketized, temporally directed graphs to detect the risk of inflammatory bowel disease (IBD). A novel context-aware, time-decay message passing mechanism was developed to capture temporal dependencies while reducing model complexity. The experimental results using a real-world clinical dataset demonstrated consistent and robust improvements in IBD detection over state-of-the-art methods, with significant reductions in computational complexity compared to sequential models. These findings highlight the potential of graph representation learning to enable efficient, scalable, and accurate disease risk prediction from longitudinal ICD diagnosis codes.

URL PDF HTML ☆

赞 0 踩 0

2605.27789 2026-05-28 cs.AI cs.CL 版本更新

A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test

固定预算、聚类感知的 LLM-as-a-Judge 评估标准：多跳 RAG 压力测试

Camilo Chacón Sartori, José H. García

发表机构 * Catalan Institute of Nanoscience and Nanotechnology（加泰罗尼亚纳米科学与纳米技术研究所）

AI总结针对多跳 RAG 系统评估中的统计偏差问题，提出一种固定预算、聚类感知的 LLM-as-a-Judge 比较标准，并通过遗传算法证据选择器 GADMEC 在 400 个多跳问题上进行压力测试，揭示聚类感知推断改变了实证结论。

详情

AI中文摘要

检索增强生成（RAG）系统通常通过让大型语言模型（LLM）法官判断哪个答案更好来进行比较。对于多跳 RAG，这已成为一个测量问题，与建模问题同等重要：相同的分数可以反映检索质量、答案长度、词汇重叠或忽略聚类数据的统计检验。我们询问当这些选择被明确时会发生什么。我们提出了 RAG 中 LLM-as-a-Judge 比较的最小测量标准。该标准固定了 top-100 候选池、证据预算、答案上限、生成器和提示；它还要求预先注册假设、聚类感知推断、在可行时进行精确的聚类符号翻转检验以及第二法官复制。聚类基准可能夸大进展；该领域应采用此标准。我们使用遗传算法解码器进行多跳证据组合（GADMEC），一种进化证据选择器，在计算机科学/机器学习（CS/ML）和材料科学领域的 400 个多跳问题上对其进行压力测试。该协议改变了实证故事。二项检验使所有四个语义基线比较看起来显著；聚类感知推断只留下一个 Bonferroni 显著结果。在相同预算下，BM25 优于纯语义 GADMEC，而词汇-语义混合在 CS/ML 中恢复并缩小了材料科学差距。

英文摘要

Retrieval-augmented generation (RAG) systems are often compared by asking a large language model (LLM) judge which answer is better. For multi-hop RAG, this has become a measurement problem as much as a modeling problem: the same score can reflect retrieval quality, answer length, lexical overlap, or a statistical test that ignores clustered data. We ask what happens when these choices are made explicit. We propose a minimum measurement standard for LLM-as-a-judge comparisons in RAG. The standard fixes the top-100 candidate pool, evidence budget, answer cap, generator, and prompt; it also requires pre-registered hypotheses, cluster-aware inference, an exact cluster sign-flip check when feasible, and second-judge replication. Clustered benchmarks can overstate progress; the field should adopt this standard. We stress-test it with Genetic Algorithm Decoder for Multi-hop Evidence Composition (GADMEC), an evolutionary evidence selector, on 400 multi-hop questions in computer science/machine learning (CS/ML) and Materials Science. The protocol changes the empirical story. A binomial test makes all four semantic-baseline comparisons look significant; cluster-aware inference leaves only one Bonferroni-significant result. BM25 beats pure semantic GADMEC under the same budget, while a lexical-semantic hybrid recovers in CS/ML and narrows the Materials Science gap.

URL PDF HTML ☆

赞 0 踩 0

2605.27785 2026-05-28 cs.AI cs.DB 版本更新

A Query Engine for the Agents

面向智能体的查询引擎

Kenny Daniel

发表机构 * Hyperparam（Hyperparam公司）

AI总结提出一个轻量级、JS原生、支持异步SQL和LLM UDF的查询引擎Hyperparam，用于在AI应用中分析非结构化文本，性能优于DuckDB-WASM。

Comments 4 pages, 1 figure, 3 tables

详情

AI中文摘要

当今生产环境中增长最快的数据是非结构化文本：智能体轨迹、聊天日志、推理链、模型输出。人们想要分析这些数据，而有价值的问题（例如“显示智能体在哪里感到困惑”）无法仅通过SQL回答，因为如果没有模型参与查询路径，文本是不可查询的。这种分析自然发生在新一类AI应用中（如Claude Code、Cursor、Claude Desktop、浏览器内智能体），这些应用在客户端运行，并在同一进程中托管人类用户和LLM智能体。这些应用越来越需要处理数据，但数据湖仓的读取路径在JS运行时中难以使用：Spark、Trino和托管数据仓库不适合。为了构建这种新型AI数据应用，引擎的三个属性成为首要考虑：JS原生分发，能够直接嵌入应用已运行的运行时；足够小的包体积，以便在冷标签页或每轮智能体沙箱中分发；以及一种将分析操作符与基于模型的文本解释交错的方法。我们提出Hyperparam，三个总大小低于70 KB的开源JavaScript库（Hyparquet、Squirreling、Icebird），它们直接从对象存储读取Parquet和Apache Iceberg，并通过基于单元格的异步原生SQL执行满足第三个属性，因此昂贵的单元格仅在下游操作符需要时才触发。Squirreling在过滤受限查询上运行LLM形状的异步UDF比DuckDB-WASM快300倍以上（排序受限查询快192倍），并以低三分之二的成本完成十项智能体分析师任务。我们认为数据工程作为一个学科需要更新，以适应现已投入生产的AI原生客户端应用以及与其用户协作的智能体。

英文摘要

The fastest-growing data in production today is unstructured text: agent traces, chat logs, reasoning chains, model outputs. People want to analyze it, and the questions worth asking ("show me where the agent got confused") cannot be answered by SQL alone, since text is not queryable without a model in the query path. The natural place this analysis is happening is the new class of AI applications (Claude Code, Cursor, Claude Desktop, in-browser agents) that run client-side and host both a human user and an LLM agent in the same process. These applications increasingly want to work with data, but the lakehouse read path has been hard to use from a JS runtime: Spark, Trino, and managed warehouses do not fit there. To build this new kind of AI data application, three properties of the engine become first-order: a JS-native distribution that drops into the runtime the application already runs in, a bundle small enough to ship inside a cold tab or per-turn agent sandbox, and a way to interleave analytic operators with model-based interpretation of text. We present Hyperparam, three open-source JavaScript libraries (Hyparquet, Squirreling, Icebird) totaling under 70 KB, that read Parquet and Apache Iceberg directly from object storage and meet the third property with per-cell, async-native SQL execution, so expensive cells fire only when downstream operators demand them. Squirreling runs LLM-shaped async UDFs over 300x faster than DuckDB-WASM on filter-bounded queries (and 192x on sort-bounded queries) and completes a ten-task agent analyst suite at two-thirds lower cost. We argue that data engineering as a discipline needs to update for the AI-native client applications now in production and the agents that work alongside their users.

URL PDF HTML ☆

赞 0 踩 0

2605.27784 2026-05-28 cs.AI 版本更新

UniMaia：用语言引导国际象棋策略以实现类人玩法

Sherman Siu, Lesley Istead

发表机构 * University of Waterloo（滑铁卢大学）； Carleton University（卡尔顿大学）

AI总结提出UniMaia框架，通过参数高效文本编码器和ControlNet风格调节机制，在冻结的Lc0国际象棋策略网络上实现提示条件策略调制，实现语义控制（如开局选择和玩家强度）并保持预训练策略表征，同时构建大规模元数据增强的Lichess数据集和半自动提示生成管道，在多个基准上取得最优或竞争性结果。

详情

AI中文摘要

大型语言模型的最新进展使得自然语言能够作为控制复杂系统的灵活接口，但通常以大规模多模态训练或弱化领域特定归纳偏差为代价。在结构化决策领域（如国际象棋）中，专门的策略网络表现强劲但缺乏语义可控性，而提示条件语言模型更灵活但通常领域基础较弱。我们提出$ extbf{UniMaia}$，一个用于提示条件策略调制的框架，它使用参数高效文本编码器和ControlNet风格的调节机制来适配基于Lc0的冻结国际象棋策略网络。UniMaia能够实现对游戏玩法的语义控制，包括开局选择和玩家强度，同时保留预训练的策略表征。我们进一步引入$ extbf{UniMaia-Aux}$，它结合了辅助时间条件化和行为预测目标。为了支持这项工作，我们构建了一个大规模元数据增强的Lichess数据集，开发了一个半自动提示生成管道，并引入了涵盖提示条件和元数据条件设置的基准。UniMaia在多个提示条件基准上实现了最先进的预期准确率，在通用指令遵循任务上达到了竞争性的最佳着法准确率，同时在人类着法预测基准上与专门的元数据条件方法保持竞争力。UniMaia-Aux进一步提高了多个评估设置下的预期准确率和行为建模，在最佳着法准确率上略有折衷。总体而言，我们的结果表明，无需端到端多模态训练即可实现领域特定策略网络的提示条件控制，同时突出了可控性与预测性能之间的权衡。

英文摘要

Recent advances in large language models have enabled natural language to serve as a flexible interface for controlling complex systems, but often at the cost of large-scale multimodal training or weakened domain-specific inductive biases. In structured decision-making domains such as chess, specialized policy networks achieve strong performance but lack semantic controllability, while prompt-conditioned language models are more flexible yet typically exhibit weaker domain grounding. We propose $\textbf{UniMaia}$, a framework for prompt-conditioned policy modulation that adapts a frozen Lc0-based chess policy network using a parameter-efficient text encoder and a ControlNet-style conditioning mechanism. UniMaia enables semantic control over gameplay, including opening selection and player strength, while preserving the pretrained policy representations. We further introduce $\textbf{UniMaia-Aux}$, which incorporates auxiliary temporal conditioning and behavioral prediction objectives. To support this work, we construct a large-scale metadata-augmented Lichess dataset, develop a semi-automated prompt-generation pipeline, and introduce benchmarks spanning both prompt-conditioned and metadata-conditioned settings. UniMaia achieves state-of-the-art expected accuracy on several prompt-conditioned benchmarks and competitive top-move accuracy on general instruction-following tasks, while remaining competitive with dedicated metadata-conditioned approaches on human move prediction benchmarks. UniMaia-Aux further improves expected accuracy and behavioral modeling across several evaluation settings, with modest trade-offs in top-move accuracy. Overall, our results demonstrate that prompt-conditioned control of domain-specific policy networks is feasible without end-to-end multimodal training, while highlighting trade-offs between controllability and predictive performance.

URL PDF HTML ☆

赞 0 踩 0

2605.27766 2026-05-28 cs.AI 版本更新

Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems

Aman Priyanshu, Supriti Vijay, Esha Pahwa

发表机构 * Foundation AI USA（Foundation AI美国）

AI总结本研究通过多智能体模拟平台评估LLM智能体在社交压力下的隐私泄露风险，发现多轮社交交互显著增加隐私泄露，且泄露具有社交传染性，即使有隐私指令也无法完全消除。

详情

AI中文摘要

LLM安全评估主要在隔离环境中测试模型，然而部署的AI智能体越来越多地与其他智能体在持久社交环境中交互。我们引入了一个Moltbook风格的模拟平台，数千个LLM智能体在模拟的一个月内跨社区交互，并用它来评估在不同程度的社交压力下隐私作为下游安全问题的表现。我们发现从单轮到多轮社交评估会放大隐私侵犯（OpenAI模型上CIMemories 19.95%到我们的45.30%），泄露具有社交传染性，观察到同伴泄露后智能体泄露敏感信息的可能性增加8倍，并且明确的隐私指令减少但不能消除这种效应，即使有保护措施，泄露率仍高于37.8%。我们的发现表明，基于静态聊天的安全基准系统性地低估了智能体部署中的风险，而仅社交环境就足以引发单轮评估永远不会发现的敏感泄露。

英文摘要

LLM safety evaluations predominantly test models in isolation, yet deployed AI agents increasingly operate within persistent social environments alongside other agents. We introduce a Moltbook-style simulation platform where thousands of LLM agents interact across communities over a simulated month, and use it to evaluate privacy as a downstream safety concern under varying degrees of social pressure. We find that shifting from single turn to multi turn social evaluation amplifies privacy violations (CIMemories 19.95% to Ours 45.30% across OpenAI models), that leakage is socially contagious, with agents 8 times more likely to disclose sensitive information after observing a peer do so, and that explicit privacy instructions reduce but do not eliminate this effect, leaving leakage rates above 37.8% even with safeguards. Our findings suggest that static chat based safety benchmarks systematically underestimate risks in agentic deployment, and that social context alone is sufficient to elicit sensitive disclosures that single turn evaluations would never surface.

URL PDF HTML ☆

赞 0 踩 0

2605.27765 2026-05-28 cs.LG cs.AI 版本更新

Restoring the Sweet Spot: Pass-Rate Weighted Self-Distillation for LLM Reasoning

恢复甜蜜点：用于LLM推理的通过率加权自蒸馏

Zehao Liu, Yuanpu Cao, Jinghui Chen, Vasant G. Honavar

发表机构 * College of Information Sciences and Technology（信息科学与技术学院）

AI总结提出SC-SDPO方法，通过问题通过率加权自蒸馏损失，动态调整训练难度，提升LLM推理性能。

Comments 18 pages, 8 figures

详情

AI中文摘要

自蒸馏策略优化（SDPO）通过利用模型自身的反馈条件预测作为自教师，为大型语言模型的强化学习提供密集的令牌级信用分配。然而，与GRPO不同——其群体相对优势自然地将学习集中在一个中等难度问题的甜蜜点上——SDPO的基于KL的优势缺乏隐式的难度感知。我们通过GRPO的优势归一化视角分析这一差距。将可学习性框架扩展到归一化奖励，我们表明归一化吸收了方差项$p(1-p)$，使各问题的前导阶可学习性相等，留下$\sqrt{p(1-p)}$作为每个问题梯度中唯一的残差缩放因子。这一分析产生了一个简单的处方：用$[\hat{p}(1-\hat{p})]^{1/2}$加权每个问题的SDPO损失，得到SC-SDPO，即SDPO的尺度一致变体。所提出的权重作为在线策略rollout与批自适应归一化的零成本副产品获得，诱导出一个隐式课程，动态跟踪模型不断发展的能力。在科学推理和工具使用基准上的实验表明，SC-SDPO持续优于SDPO，在Qwen3-8B上获得+3.2/+4.3（mean@16/maj@16）的提升，在OLMo-3-7B上获得+1.8/+3.0的提升，同时在整个优化过程中保持稳定的训练动态。

英文摘要

Self-Distillation Policy Optimization (SDPO) provides dense token-level credit assignment for reinforcement learning with large language models by leveraging the model's own feedback-conditioned predictions as a self-teacher. Unlike GRPO, however, whose group-relative advantage naturally concentrates learning on a sweet spot of intermediate-difficulty questions, SDPO's KL-based advantage lacks an implicit notion of difficulty awareness. We analyze this gap through the lens of GRPO's advantage normalization. Extending the learnability framework to normalized rewards, we show that normalization absorbs the variance term $p(1-p)$, equalizing leading-order learnability across questions and leaving $\sqrt{p(1-p)}$ as the sole residual scaling factor in the per-question gradient. This analysis yields a simple prescription: weight each question's SDPO loss by $[\hat{p}(1-\hat{p})]^{1/2}$, resulting in SC-SDPO, a scale-consistent variant of SDPO. The proposed weights are obtained as a zero-cost byproduct of on-policy rollouts with batch-adaptive normalization, inducing an implicit curriculum that dynamically tracks the model's evolving competence. Experiments on scientific reasoning and tool-use benchmarks demonstrate that SC-SDPO consistently improves over SDPO, yielding gains of +3.2/+4.3 (mean@16/maj@16) on Qwen3-8B and +1.8/+3.0 on OLMo-3-7B, while preserving stable training dynamics throughout optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.27764 2026-05-28 cs.CV cs.AI 版本更新

Can Segmentation Models Understand the World? Towards Proactive Affordance Reasoning via Visual Chain-of-Thought

分割模型能理解世界吗？通过视觉思维链实现主动可供性推理

Yuchen Guo, Junli Gong, Hongmin Cai, Yiu-ming Cheung, Weifeng Su

发表机构 * Northwestern University（西北大学）； Northeastern University（东北大学）； South China University of Technology（华南理工大学）； Hong Kong Baptist University（香港 Baptist大学）； Beijing Normal - Hong Kong Baptist University（北京师范大学-香港 Baptist大学）

AI总结提出SegWorld框架，通过多级视觉思维链在意图级指令下进行主动场景观察和可供性推理，实现从目标到部件的高效分割。

详情

AI中文摘要

最近的分割模型将大语言模型（LLMs）与掩码解码器结合，将复杂的语言表达映射到掩码上，但其指令仍然是目标指涉的：它们描述、约束或暗示待分割的区域。然而，在现实世界的具身交互中，人类指令通常是意图级的，包括期望的结果而不指定实现该结果的区域。为弥合这一差距，我们引入SegWorld，其中模型在确定掩码之前通过多级视觉思维链（CoT）推理场景。在接收任何指令之前，它主动观察场景，描述可见对象并推断它们可能支持的可能事件。给定指令后，它继续思维链：从与意图相关的对象，到满足意图的动作，再到物理交互部位，即支持该动作的对象部分。我们将SegWorld形式化为概率推理，其中主动观察提供语言场景上下文，当指令以意图级别给出时，可改善掩码预测。我们构建了一个意图到部件的基准，用于评估从高层目标出发的可供性承载部件分割。实验表明，SegWorld在目标指涉指令上匹配指令驱动基线，并在意图级指令上显著提升。

英文摘要

Recent segmentation models couple large language models (LLMs) with mask decoders to ground complex language expressions into masks, yet their instructions remain target-referential: they describe, constrain, or imply the region to be segmented. However, in real-world embodied interaction, human instructions are often at the intent-level, which includes the desired outcome without naming the region that enables it. To bridge this gap, we introduce SegWorld, where the model reasons about the scene through a multi-level visual chain-of-thought (CoT) before committing to a mask. Before receiving any instructions, it proactively observes the scene, describing visible objects and inferring plausible events they may support. Given an instruction, it continues the chain: from the object relevant to the intent, through the action that satisfies it, to the physical interaction site, the object part that affords the action. We formalize SegWorld as probabilistic inference, in which proactive observation supplies a linguistic scene context that improves mask prediction when instructions are given at the level of intent. We construct an intent-to-part benchmark for evaluating affordance-bearing part segmentation from high-level goals. Experiments show SegWorld matches instruction-driven baselines on target-referential instructions and improves substantially on intent-level ones.

URL PDF HTML ☆

赞 0 踩 0

2605.27760 2026-05-28 cs.AI 版本更新

SkillGrad: Optimizing Agent Skills Like Gradient Descent

SkillGrad: 像梯度下降一样优化智能体技能

Hanyu Wang, Yifan Lan, Bochuan Cao, Lu Lin, Jinghui Chen

发表机构 * College of Information Sciences and Technology（信息科学与技术学院）； The Pennsylvania State University（宾夕法尼亚州立大学）； University Park, PA, USA

AI总结提出SkillGrad框架，将技能包视为结构化参数，通过轨迹级损失、文本梯度诊断和动量记忆覆盖进行类梯度下降优化，在表格问答任务上平均提升6.7个百分点。

详情

AI中文摘要

智能体技能通过将可复用的程序化知识存储在结构化文件中，提供了一种轻量级的方式将LLM智能体适配到专业领域。然而，无论是从第三方下载还是自行生成，这些技能往往不可靠、不完整或过时。现有的技能演化方法通常通过启发式反思来解决这些缺陷，缺乏明确的优化公式。在本文中，我们提出了SkillGrad，一种受梯度下降启发的智能体技能优化框架。SkillGrad将技能包视为结构化参数，以梯度下降的方式进行优化：任务执行提供轨迹级损失证据，自动诊断随后提供指示修正方向的文本梯度。为了稳定跨迭代的优化，动量智能体将重复出现的诊断模式累积到持久记忆覆盖层中。最后，基于LLM的修补器通过对技能包进行层感知编辑来执行参数更新。在SpreadsheetBench Verified和WikiTableQuestions上的评估表明，SkillGrad在两个骨干LLM上始终优于基于训练的技能演化基线，平均比最强的基于训练的基线高出6.7个百分点。消融实验进一步表明，动量和对比诊断都对最终技能质量有贡献。

英文摘要

Agent skills provide a lightweight way to adapt LLM agents to specialized domains by storing reusable procedural knowledge in structured files. However, whether downloaded from third parties or self-generated, these skills are often unreliable, incomplete, or outdated. Existing skill-evolution methods often address these deficiencies through heuristic reflections without an explicit optimization formulation. In this paper, we propose SkillGrad, a gradient-descent-inspired framework for optimizing agent skills. SkillGrad treats the skill package as a structured parameter to optimize in a gradient descent fashion: task executions provide trajectory-level loss evidence, automatic diagnoses then provide text-based gradients that indicate the correction directions. To stabilize optimization across iterations, a momentum agent accumulates recurring diagnostic patterns into a persistent memory overlay. Finally, an LLM-based patcher executes the parameter update by applying layer-aware edits to the skill package. Evaluated on SpreadsheetBench Verified and WikiTableQuestions, SkillGrad consistently outperforms training-based skill evolution baselines across two backbone LLMs, improving over the strongest training-based baseline by $6.7$ percentage points on average. Ablations further show that momentum and contrastive diagnosis both contribute to the final skill quality.

URL PDF HTML ☆

赞 0 踩 0

2605.27758 2026-05-28 cs.LG cs.AI physics.comp-ph 版本更新

High-Fidelity Industrial Crash Dynamics Prediction via Geometry-Aware Operator Learning with Memory-Efficient Low-Rank Attention

基于几何感知算子学习与内存高效低秩注意力的高保真工业碰撞动力学预测

Deepak Akhare, Mohammad Amin Nabian, Corey Adams, Sudeep Chavare, Sanjay Choudhry

发表机构 * Department of Aerospace and Mechanical Engineering, University of Notre Dame（诺特大学航空航天与机械工程系）； NVIDIA ； General Motors（通用汽车）

AI总结本文提出GeoTransolver框架，通过几何感知算子学习和内存高效低秩注意力机制，实现工业级碰撞动力学的高保真预测，在复杂梁和整车碰撞数据集上验证了其准确性和效率。

详情

AI中文摘要

汽车碰撞安全性优化仍然是一个安全关键挑战，需要通过迭代的高保真模拟来管理大规模非线性结构变形和能量耗散。虽然传统有限元求解器计算成本高昂，新兴的算子学习框架提供了快速的代理预测；然而，将其应用于工业级碰撞分析（其中复杂几何、接触非线性和快速演变的瞬态变形并存）仍然是一个未解决的挑战。在本文中，我们证明GeoTransolver框架为工业规模下准确、高保真的碰撞动力学预测提供了可行的解决方案。在复杂的保险杠梁和整车碰撞数据集上进行的基准测试表明，GeoTransolver能够捕捉多尺度几何上下文，并准确解析塑性变形模式以及关键乘员位置的加速度曲线。除了架构本身，我们提出并系统评估了一系列时间预测策略，包括一次性、时间条件和自回归滚动策略，证明一次性方法在显著降低训练开销和推理延迟的同时实现了最先进的准确性。作为次要贡献，我们引入了一种基于快速低秩注意力路由引擎（FLARE）的修改，应用于GeoTransolver注意力主干，将内存开销减少约2倍，同时进一步提高O(N)长程、高频瞬态的预测准确性，保留了基础框架的几何感知交叉注意力优势。我们的结果突显了几何感知算子学习在复杂、安全关键的汽车动力学高保真代理建模中的实际可行性。

英文摘要

Automotive crashworthiness optimization remains a safety-critical challenge, requiring the management of large-scale nonlinear structural deformations and energy dissipation through iterative, high-fidelity simulations. While traditional finite element solvers are computationally prohibitive, emerging operator learning frameworks provide rapid surrogate predictions; however, applying them to industrial-scale crash analysis, where complex geometry, contact nonlinearities, and rapidly evolving transient deformation coexist, remains an open challenge. In this paper, we demonstrate that the GeoTransolver framework provides a viable solution for accurate, high-fidelity crash dynamics prediction at industrial scale. Benchmarked on complex bumper beam and full-vehicle crash datasets, GeoTransolver captures multi-scale geometric context and accurately resolves plastic deformation patterns as well as acceleration profiles at critical occupant locations. Beyond the architecture itself, we propose and systematically evaluate a suite of temporal prediction recipes, including one-shot, time-conditional, and autoregressive rollout strategies, demonstrating that the one-shot approach achieves state-of-the-art accuracy with significantly reduced training overhead and inference latency. As a secondary contribution, we introduce a Fast Low-rank Attention Routing Engine (FLARE)-based modification to the GeoTransolver attention backbone that reduces memory overhead by approximately 2x while further improving predictive accuracy for O(N) long-range, high-frequency transients, preserving the geometry-aware cross-attention strengths of the base framework. Our results highlight the practical viability of geometry-aware operator learning for high-fidelity surrogate modeling of complex, safety-critical automotive dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.27750 2026-05-28 cs.CL cs.AI cs.CV cs.DL 版本更新

Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

阅读还是猜测？古希腊版本OCR中视觉语言模型的视觉定位失败

Antonia Karamolegkou, Nicolas Angleraud, Benoît Sagot, Thibault Clérice

发表机构 * Inria（法国国家信息与自动化研究所）

AI总结通过对比开放权重视觉语言模型与传统OCR基线在低资源古希腊批判版本上的表现，发现VLM即使错误也能生成流畅文本，表明其依赖语言先验，并引入扰动和标记级定位度量分析视觉证据。

详情

AI中文摘要

最近的研究表明，用于光学字符识别（OCR）的视觉语言模型（VLM）能够生成看似合理但缺乏视觉支持的文本，暗示其依赖语言先验。通过将开放权重VLM与传统OCR基线在低资源古希腊批判版本上进行对比，我们展示了VLM的错误即使在错误时也往往保持流畅，产生合理的希腊语替换，而传统引擎则产生局部识别噪声。为了分析解码过程中的视觉证据，我们引入了受控图像扰动和基于条件与无图像解码分布的标记级定位度量。在字符级扰动下，VLM与扰动的真实文本严重偏离，而传统OCR相对忠实；然而，标记级分析表明先验依赖是模型特定的：在OCR专业模型中，流畅的词汇错误几乎不依赖图像而产生，而通用VLM即使在错误时也仍然依赖于视觉输入。解码时干预未能可靠地恢复定位，而OCR后语言模型校正仅通过生成后修复文本改善了几个系统。我们的结果将先前关于OCR语言先验依赖的证据扩展到低资源历史文档和更广泛的模型集，表明流畅输出不一定具有视觉基础，并推动了超越总体准确性的可解释性驱动评估。

英文摘要

Recent work has shown that Vision-Language Models (VLMs) used for optical character recognition (OCR) can generate plausible but visually unsupported text, suggesting reliance on language priors. Comparing open-weight VLMs with traditional OCR baselines on low-resource Ancient Greek critical editions, we show that VLM errors often remain fluent even when wrong, producing plausible Greek substitutions where traditional engines produce local recognition noise. To analyze visual evidence during decoding, we introduce controlled image perturbations and token-level grounding measures based on conditional versus image-free decoding distributions. Under character-level perturbations, VLMs diverge sharply from the perturbed ground truth while traditional OCR remains comparatively faithful; however, token-level analysis shows that prior reliance is model-specific: in an OCR-specialist model, fluent lexical errors are produced with little reliance on the image, whereas general-purpose VLMs remain conditioned on the visual input even when wrong. Decode-time interventions fail to reliably restore grounding, while post-OCR language-model correction improves several systems only by repairing text after generation. Our results extend prior evidence of OCR language-prior reliance to low-resource historical documents and a broader set of models, showing that fluent output is not necessarily visually grounded and motivating interpretability-driven evaluation beyond aggregate accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.27748 2026-05-28 cs.CV cs.AI cs.LG 版本更新

Mahalanobis PatchCore: Covariance-Aware and Streaming-Compatible Industrial Anomaly Detection

马氏距离 PatchCore：协方差感知与流式兼容的工业异常检测

Niccolò Ferrari, Oligert Osmani, Evelina Lamma

发表机构 * Department of Engineering, University of Ferrara（费拉拉大学工程学院）

AI总结提出马氏距离 PatchCore，通过协方差估计和流式处理改进 PatchCore，在保持性能的同时降低峰值内存并提升工业检测精度。

Comments 57 pages, 7 figures

详情

AI中文摘要

工业视觉异常检测通常是一类问题：正常图像丰富，而缺陷罕见、异质且常在系统设计时不可用。PatchCore 风格的检索适合此场景，因为它通过正常补丁特征的内存库对测试图像评分，但标准欧几里得几何忽略了特征相关性，且其离线构建在子采样前需实例化整个补丁池。我们引入马氏距离 PatchCore，一种协方差感知、流式兼容的 PatchCore 扩展。其人工智能贡献在于一种检索检测器，它在降维特征空间中估计正则化协方差模型并对嵌入进行白化，使得变换后的欧几里得最近邻搜索实现马氏距离检索。一个有界内存、可重复迭代的训练流程通过增量降维、在线协方差估计和流式聚合，无需一次性存储所有正常补丁即可构建内存库。工程应用是自动化工业检测，其中视觉异常检测必须在实际内存限制下保持准确。我们在一个公开的 15 类工业异常检测基准和三个工业数据集（涵盖吹灌封条带安瓿弯月面检测、琥珀色玻璃安瓿底部检测和冻干饼西林瓶检测）上评估该方法。马氏距离 PatchCore 在公开基准上保留了大部分离线 PatchCore 的图像级性能，同时将峰值内存从 5.41 GB 降至 2.78 GB，并将选定的工业平均图像接收者操作特征曲线下面积从 0.981 提升至 0.986。

英文摘要

Industrial visual anomaly detection is usually one-class: normal images are abundant, while defects are rare, heterogeneous, and often unavailable during system design. PatchCore-style retrieval suits this setting because it scores test images from a memory bank of normal patch features, but the standard Euclidean geometry ignores feature correlations and its offline construction materialises the full patch pool before subsampling. We introduce Mahalanobis PatchCore, a covariance-aware, streaming-compatible extension of PatchCore. Its artificial intelligence contribution is a retrieval detector that estimates a regularised covariance model in reduced feature space and whitens embeddings, so Euclidean nearest-neighbour search after transformation implements Mahalanobis retrieval. A bounded-memory, re-iterable training pipeline builds the memory bank without storing all normal patches at once, using incremental dimensionality reduction, online covariance estimation, and streaming aggregation. The engineering application is automated industrial inspection, where visual anomaly detection must remain accurate under practical memory limits. We evaluate the method on a public 15-category industrial anomaly-detection benchmark and three industrial datasets covering blow-fill-seal strip-ampoule meniscus inspection, amber-glass-ampoule bottom inspection, and lyophilised-cake vial inspection. Mahalanobis PatchCore preserves most offline PatchCore image-level performance on the public benchmark while reducing peak memory from 5.41 to 2.78 GB, and improves the selected industrial mean image area under the receiver operating characteristic curve from 0.981 to 0.986.

URL PDF HTML ☆

赞 0 踩 0

2605.27744 2026-05-28 cs.AI 版本更新

前缀安全贝叶斯信念追踪用于LLM推理可靠性：将校准与排序分离

Zhenghan Song, Yunyi Li, Yulong Liu

发表机构 * Cornell University（康奈尔大学）； Columbia University（哥伦比亚大学）

AI总结提出前缀安全贝叶斯信念追踪（SBBT）框架，通过分离概率质量与排序能力，在长链推理中实现可靠的在线校准与不确定性估计。

详情

AI中文摘要

长推理轨迹需要在最终答案已知之前进行可靠性估计。我们研究前缀条件的事件成功估计 $P(y=1 \mid o_{1:t})$，使用前缀安全观测。序列贝叶斯信念追踪（SBBT）校准观测似然并递归更新两状态信念，为标量分数、文本和自我验证标记、隐藏聚类、令牌池探针以及潜在轨迹特征提供通用追踪器。在MATH-500、GSM8K、AIME 2025和RIMO-N上生成的开源权重轨迹中，概率质量和排序分离：仅使用分数的SBBT通常改善Brier分数，而AUROC提升需要超出强前缀安全基线的结构感知证据。在最强硬数学设置中，结构感知观测相对于标准前缀安全基线达到+0.110 AUROC。在相同前缀分类器审计下，MATH-500文本标记和RIMO-N自我验证信号保持正向。这些发现共同支持SBBT作为校准感知的在线推理框架，并揭示证据机制：标量分数主要支持概率质量，而结构感知前缀信号仅在强前缀安全基线尚未吸收排序证据时支持排序。

英文摘要

Long reasoning traces need reliability estimates before final answers are known. We study prefix-conditioned eventual-success estimation, $P(y=1 \mid o_{1:t})$, using prefix-safe observations. Sequential Bayesian Belief Tracking (SBBT) calibrates observation likelihoods and recursively updates a two-state belief, providing a common tracker for scalar scores, text and self-verification markers, hidden clusters, token-pooling probes, and latent-trajectory features. Across generated open-weight traces on MATH-500, GSM8K, AIME 2025, and RIMO-N, probability quality and ranking separate: score-only SBBT often improves Brier, while AUROC gains require structure-aware evidence beyond strong prefix-safe baselines. In the strongest hard math setting, structure-aware observations reach +0.110 AUROC against standard prefix-safe baselines. Under a same-prefix classifier audit, MATH-500 text markers and RIMO-N self-verification signals remain positive. Together, these findings support SBBT as a calibration-aware online inference framework and expose an evidence regime: scalar scores mainly support probability quality, while structure-aware prefix signals support ranking only when strong prefix-safe baselines have not already absorbed the rank evidence.

URL PDF HTML ☆

赞 0 踩 0

2605.27710 2026-05-28 cs.AI 版本更新

DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation

DeepSciVerify: 通过LLM驱动的证据升级验证科学声明与引文对齐

Shaghayegh Sadeghi, Khashayar Khajavi, Rise Adhikari, Alexander Tessier

发表机构 * School of Computing Science, Simon Fraser University（西蒙弗雷泽大学计算科学学院）

AI总结提出DeepSciVerify两阶段流水线，结合摘要推理与选择性升级到段落证据，在SCitance基准上以86.7 Micro-F1超越纯摘要基线4.5点，同时67%实例无需全文检索。

详情

AI中文摘要

声明与其引用证据之间的错位是大语言模型生成报告中的常见失败模式，限制了其在科学及其他高风险场景中的可靠性。我们提出DeepSciVerify，一个用于科学声明-引文验证的两阶段流水线，结合摘要级推理与选择性升级到段落级证据。该系统首先使用摘要验证声明，并对不确定案例进行延迟处理，仅在必要时检索和分析全文段落。该设计利用了LLM之间的互补行为，因为某些模型在不确定性下更为保守，而另一些则更为果断。在SCitance基准上，DeepSciVerify达到了86.7 Micro-F1，比强纯摘要基线高出4.5点，同时67%的实例无需全文检索即可解决。这些结果表明，选择性证据升级提高了声明-引文验证的准确性和效率。

英文摘要

Misalignment between claims and their cited evidence is a common failure mode in reports generated by large language models, limiting their reliability in scientific and other high-stakes settings. We present DeepSciVerify, a two-stage pipeline for scientific claim-citation verification that combines abstract-level reasoning with selective escalation to passage-level evidence. The system first verifies claims using the abstract and defers uncertain cases, retrieving and analyzing full-text passages only when necessary. This design leverages complementary behaviors across LLMs, as some models are more conservative while others are more decisive under uncertainty. On the SCitance benchmark, DeepSciVerify achieves 86.7 Micro-F1, outperforming strong abstract-only baselines by +4.5 points while resolving 67% of instances without full-text retrieval. These results suggest that selective evidence escalation improves both accuracy and efficiency in claim-citation verification.

URL PDF HTML ☆

赞 0 踩 0

2605.27703 2026-05-28 cs.AI 版本更新

Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models

面向资源受限智能体语言模型的分层提示域控制与学习

Joan Vendrell Gallart, Russell Bent, Michael Grosskopf

发表机构 * Los Alamos National Laboratory（洛斯阿拉莫斯国家实验室）

AI总结提出分层控制与学习框架，通过蒸馏学习输出模式、在线监控与提示域控制，解决资源受限下智能体语言模型的可靠性问题。

详情

AI中文摘要

大型语言模型越来越多地部署在智能体系统中，它们必须遵循结构化协议，适应不断变化的状态，并在内存、延迟和成本限制下运行。在这种场景下，提示扩展不可靠：增长的上下文可能将紧凑模型推离其有效提示域，而部署时的微调受限于稀缺的数据和计算资源。我们提出了一种分层控制与学习框架，其中紧凑模型首先通过蒸馏学习所需的输出模式，然后由预言机-控制器循环在线监督。控制器监控协议有效性和语义性能，将累积历史投影到可行的提示域中，并在发生漂移时触发轻量级的预言机监督微调。这将用于通信兼容性的模式学习与用于任务级纠正的语义适应分离开来。我们形式化了提示域可行性和注意力引起的饱和，从而激励对有效提示状态的控制，而非依赖名义上下文长度。使用多保真贝叶斯优化作为受控顺序测试平台，我们描述了一个核心部署故障模式，并展示了相对于非分层、仅蒸馏和非蒸馏基线的改进的可靠性和成本效益。

英文摘要

Large Language Models are increasingly deployed inside agentic systems, where they must follow structured protocols, adapt to evolving states, and operate under memory, latency, and cost constraints. In such regimes, prompt extension is unreliable: growing contexts can push compact models outside their effective prompt domain, while deployment-time fine-tuning remains limited by scarce data and compute. We propose a hierarchical control-and-learning framework in which a compact model is first distilled to learn the required output schema, then supervised online by an oracle-controller loop. The controller monitors protocol validity and semantic performance, projects accumulated histories into a feasible prompt domain, and triggers lightweight oracle-supervised fine-tuning under drift. This separates schema learning for communication compatibility from semantic adaptation for task-level correction. We formalize prompt-domain feasibility and attention-induced saturation, motivating control of the effective prompt state rather than reliance on nominal context length. Using Multi-Fidelity Bayesian Optimization as a controlled sequential testbed, we characterize a core deployment failure mode and show improved reliability and cost-efficiency over non-hierarchical, distillation-only, and non-distilled baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.27700 2026-05-28 cs.DL cs.AI 版本更新

CiteCheck: Retrieval-Grounded Detection of LLM Citation Hallucinations in Scientific Text

CiteCheck: 基于检索的科学文本中LLM引用幻觉检测

Khashayar Khajavi, Shaghayegh Sadeghi, Rise Adhikari, Alexander Tessier

发表机构 * School of Computing Science, Simon Fraser University（西蒙·弗雷泽大学计算科学学院）

AI总结提出CiteCheck框架，通过从外部学术来源检索候选出版物、使用结构化LLM验证器比较引用与候选信息，并将验证器得分映射为精确、次要和主要三个标签，以检测LLM生成的引用幻觉，在物理基准上达到88.7 macro-F1和88.9%准确率。

详情

AI中文摘要

大型语言模型（LLM）越来越多地用于生成科学报告，但它们可能产生看似合理但包含损坏元数据或指向不存在论文的引用。我们引入了CiteCheck，一个用于引用幻觉检测的混合框架，它验证引用是否对应于真实的学术工作以及其元数据是否忠实于该工作。CiteCheck从外部学术来源检索候选出版物，使用结构化LLM验证器将引用与检索到的候选进行比较，并将验证器得分映射为三个标签：精确、次要和主要。我们还构建了一个包含982个引用的物理基准，具有受控的损坏，这些损坏捕获了细微的元数据漂移和完全捏造的引用。在保留测试集上，CiteCheck达到了88.7 macro-F1和88.9%的准确率，优于GPT、Claude和Gemini基线，包括网络搜索和少样本变体。这些结果表明，可靠的引用验证受益于结合学术检索、基于结构化LLM的比较和校准的决策规则。

英文摘要

Large language models (LLMs) are increasingly used to generate scientific reports, but they can produce references that appear plausible while containing corrupted metadata or pointing to papers that do not exist. We introduce CiteCheck, a hybrid framework for citation hallucination detection that verifies whether a citation corresponds to a real scholarly work and whether its metadata is faithful to that work. CiteCheck retrieves candidate publications from external scholarly sources, compares the citation against the retrieved candidate using a structured LLM verifier, and maps verifier scores into three labels: Exact, Minor, and Major. We also construct a 982-citation physics benchmark with controlled corruptions that capture both subtle metadata drift and fully fabricated references. On the held-out test set, CiteCheck achieves 88.7 macro-F1 and 88.9% accuracy, outperforming GPT, Claude, and Gemini baselines, including web-search and few-shot variants. These results show that reliable citation verification benefits from combining scholarly retrieval, structured LLM-based comparison, and calibrated decision rules.

URL PDF HTML ☆

赞 0 踩 0

2605.27697 2026-05-28 cs.RO cs.AI cs.LG 版本更新

Simulation-Informed Diffusion for Decentralized Multi-robot Motion Planning

仿真引导的扩散方法用于去中心化多机器人运动规划

Jinhao Liang, Sven Koenig, Ferdinando Fioretto

发表机构 * University of Virginia（弗吉尼亚大学）； University of California, Irvine（加州大学伊文斯顿分校）

AI总结提出一种基于约束感知扩散模型的去中心化框架SID，通过仿真邻居未来轨迹并利用安全约束规划自身轨迹，在密集场景下实现高效协调。

详情

AI中文摘要

去中心化多机器人运动规划要求每个机器人仅根据局部观测生成无碰撞轨迹，无需全局感知或可靠通信。然而，大多数现有规划器（无论是经典方法还是基于学习的方法）都是从局部观测的静态快照生成轨迹，这限制了它们预测相邻机器人未来行为的能力。随着机器人数量增加和环境变得更加拥挤，这一限制变得至关重要。为了克服这一挑战，本文引入了仿真引导的扩散（SID），这是一种基于约束感知扩散模型（CADM）的去中心化框架。SID首先使用CADM从当前观测状态仿真相邻机器人的未来轨迹，然后利用这些仿真提供的安全约束，使用相同的CADM规划每个机器人自身的轨迹。关键的是，对邻居的精确仿真使得一种最小通信方案成为可能，该方案仅在高度拥挤的场景中必要时触发协调。在多种环境中的实验表明，SID在规划有效性和约束满足方面始终优于基线方法，并且可扩展到108个机器人和160个障碍物的场景。

英文摘要

Decentralized multi-robot motion planning requires each robot to generate collision-free trajectories from local observations, without global sensing or reliable communication. However, most existing planners, whether classical or learning-based, generate trajectories from a static snapshot of the local observation, which limits their ability to anticipate the future behavior of neighboring robots. This limitation is critical as the number of robots increases and the environment becomes more cluttered. To overcome this challenge, this paper introduces Simulation-Informed Diffusion (SID), a decentralized framework built on constraint-aware diffusion models (CADM). SID first uses CADM to simulate the future trajectories of neighboring robots from their currently observed states, and then uses the same CADM to plan each robot's own trajectory under safety constraints informed by these simulations. Crucially, the accurate simulation of neighbors enables a minimal communication scheme that triggers coordination only when necessary in highly congested scenarios. Experiments across diverse environments show that SID consistently outperforms baseline methods in terms of planning effectiveness and constraint satisfaction, and scales to scenarios with 108 robots and 160 obstacles.

URL PDF HTML ☆

赞 0 踩 0

2605.27686 2026-05-28 cs.CV cs.AI 版本更新

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

张量记忆：用于长程Transformer的固定大小循环状态

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, Antonio Torralba

发表机构 * Massachusetts Institute of Technology, Cambridge, MA, USA（麻省理工学院）； IBM Research, Cambridge, MA, USA（IBM研究院）； University of Toronto, Toronto, Canada（多伦多大学）

AI总结提出张量记忆模块，通过固定大小的3D循环张量状态增强Transformer，以解耦状态容量与输入长度，并保持空间归纳偏置，适用于长程视频理解。

详情

AI中文摘要

Transformer通过将空间和时间展平为长令牌序列来处理图像和视频。虽然注意力和KV缓存保留了过去的特征，但其内存随序列长度增长，并且缺乏显式的、持久化的空间状态，这使得长程视频理解和遮挡敏感推理变得困难。我们提出张量记忆，一种轻量级模块，通过固定大小的循环3D记忆张量增强Transformer块：令牌通过可微的软写入将内容沉积为围绕预测连续3D位置的高斯加权体积到体素网格中，记忆通过高效的局部交互算子和门控循环动态更新，令牌通过连续采样和门控残差融合读取上下文。由于记忆张量大小固定，张量记忆将状态容量与输入长度解耦，同时保持空间归纳偏置。我们在标准语言、图像和视频基准测试以及一个旨在隔离持久状态何时有益的受控玩具诊断套件上评估该模块；它与标准Transformer训练流程集成，可以附加到现有块或从中移除，而无需其他架构更改。

英文摘要

Transformers process images and videos by flattening space and time into long token sequences. While attention and KV caching preserve past features, their memory grows with sequence length and they lack an explicit, persistent spatial state, making long-horizon video understanding and occlusion-sensitive reasoning difficult. We propose Tensor Memory, a lightweight module that augments Transformer blocks with a fixed-size recurrent 3D memory tensor: tokens write into a voxel grid via a differentiable soft write that deposits content as a Gaussian-weighted volume around a predicted continuous 3D location, the memory is updated with an efficient local interaction operator and gated recurrent dynamics, and tokens read back context via continuous sampling with gated residual fusion. Because the memory tensor has a constant size, Tensor Memory decouples state capacity from input length while preserving a spatial inductive bias. We evaluate the module on standard language, image, and video benchmarks and on a controlled toy diagnostic suite designed to isolate when persistent state is beneficial; it integrates with standard Transformer training pipelines and can be attached to or removed from existing blocks without other architectural changes.

URL PDF HTML ☆

赞 0 踩 0

2605.27681 2026-05-28 cs.AI cs.LG 版本更新

Behavioural Analysis of Alignment Faking

对齐伪造的行为分析

Nathaniel Mitrani Hadida, Rhea Karty, David Williams-King, Alan Cooney

发表机构 * University of Cambridge（剑桥大学）； Harvard University（哈佛大学）； ERA ； UK AISI（英国人工智能学会）

AI总结通过可控最小设置研究对齐伪造，发现其驱动因素包括价值观、目标保护和谄媚，且比先前报告更普遍，可从情境线索和模型倾向预测。

Comments preprint

详情

AI中文摘要

对齐伪造（AF）指的是模型为了保持其部署偏好，策略性地遵守训练目标以避免行为修改。理解AF何时以及为何出现很重要，因为模型在区分训练和部署方面越来越擅长。先前的工作发现AF脆弱、对提示敏感且依赖模型，其潜在驱动因素尚不清楚。我们在一个隔离其核心组件的可控最小设置中研究AF，并在比先前报告更广泛的模型中观察到它，包括小规模模型。我们识别出三个可分离的驱动因素——价值观、目标保护和谄媚——并通过有针对性的提示消融和激活引导表明每个因素独立地调节AF行为。我们的结果表明AF比先前报告更普遍，并且其发生可从情境线索和可测量的模型倾向（如基线谄媚和陈述的价值观）预测。这种分解为未来模型中检测和缓解AF提供了具体方向。

英文摘要

Alignment faking (AF) refers to a model strategically complying with a training objective to avoid behavioural modification while preserving its deployment preferences. Understanding when and why AF arises matters as models grow better at distinguishing training from deployment. Prior work finds AF fragile, prompt-sensitive, and model-dependent, leaving its underlying drivers unclear. We study AF in a controlled, minimal setup that isolates its core components, and observe it across a wider range of models than previously reported, including small-scale models. We identify three separable drivers -- values, goal guarding, and sycophancy -- and show via targeted prompt ablations and activation steering that each independently modulates AF behaviour. Our results indicate AF is more widespread than previously reported and that its occurrence is predictable from situational cues and measurable model tendencies such as baseline sycophancy and stated values. The decomposition suggests concrete directions for detecting and mitigating AF in future models.

URL PDF HTML ☆

赞 0 踩 0

2605.27674 2026-05-28 cs.CR cs.AI cs.LG 版本更新

Backdoor Attacks on Fault Detection and Localization in Cyber-Physical Systems

针对信息物理系统中故障检测与定位的后门攻击

Abile Jean, Kuniyilh S

发表机构 * GitHub

AI总结本文研究针对现代信息物理系统中基于机器学习的故障检测与定位机制的后门攻击，通过设计触发器并评估攻击成功率，实验表明即使仅投毒10%的数据也能成功实施攻击。

详情

AI中文摘要

信息物理系统（CPS）集成了传感、通信、计算和控制，以支持关键基础设施，包括智能电网、工业自动化和控制系统。在电力公用事业领域，CPS中使用各种控制器来确保系统检测和恢复故障（如电压波动），并在配电系统中进行负载平衡。基于机器学习和深度学习的故障检测与定位框架因其能够实时识别异常和操作故障，近年来在CPS中受到广泛关注。然而，这些智能模型容易受到对抗性机器学习攻击，尤其是后门攻击。在后门攻击中，对手将恶意模式注入训练数据，使得模型在大多数情况下表现正常，但当触发特定模式时产生攻击者控制的输出。本文研究了针对现代CPS系统中最新机器学习管道的故障检测与定位机制的后门攻击威胁。我们定义了这些威胁，并通过设计触发器以及在CPS领域评估其成功率来探索如何实现这些攻击。我们的实验表明，即使仅投毒10%的数据，攻击也能成功。

英文摘要

Cyber-Physical Systems (CPS) integrate sensing, communication, computation, and control to support critical infrastructure, including smart grids, industrial automation, and control systems. In the electrical utility domain, various controllers are used in CPS to ensure the system detects and recovers from faults, such as voltage fluctuations, and to perform load balancing in distribution systems. Machine learning- and deep learning-based fault detection and localization frameworks have recently gained significant attention in CPS for their ability to identify anomalies and operational failures in real time. However, these intelligent models are vulnerable to adversarial machine learning attacks, particularly backdoor attacks. In a backdoor attack, an adversary injects malicious patterns into the training data so that the model behaves normally most of the time but produces attacker-controlled outputs when triggered by specific patterns. This paper investigates the threat of backdoor attacks against fault detection and localization mechanisms in recent ML pipelines used in modern CPS systems. We define these threats and explore how they can be realized by designing triggers and evaluating their success in the CPS domain. Our experiments show the attack is successful even with 10\% of poisoning.

URL PDF HTML ☆

赞 0 踩 0

2605.27668 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Aligning LLMs with Human Uncertainty: A Beta-Bernoulli Calibrator for LLM Forecasting

将LLM与人类不确定性对齐：用于LLM预测的Beta-Bernoulli校准器

Hui Dai, Ryan Teehan, Parsa Torabian, Mengye Ren

发表机构 * Agentic Learning AI Lab（代理学习AI实验室）； New York University（纽约大学）； The University of Chicago（芝加哥大学）； Chronologies AI

AI总结提出Beta-Bernoulli校准器（BBC），通过结合二元结果和人类预测信号，将初始点估计转换为事件似然分布，实现校准和不确定性量化。

详情

AI中文摘要

概率预测估计不确定未来事件的可能性。为了改进LLM预测，现有方法通常从二元结果中学习以输出语言化预测。然而，尽管聚合的人类预测在群体概率估计和预测者之间的一致程度中都包含丰富信息，如何利用这些信号仍未充分探索。为了解决这个问题，我们提出了Beta-Bernoulli校准器（BBC），它将来自任何模型的初始点估计转换为事件似然分布，使用来自二元结果和人类预测的监督。BBC对事件似然$p \sim \text{Beta}(α, β)$和结果$y \sim \text{Bernoulli}(p)$建模，均值作为校准的点预测，方差作为认知不确定性。我们的结果表明，BBC通常比传统的后验校准方法和专门为预测微调的模型提供更好校准和更准确的预测，同时保持轻量级并具有良好的泛化能力。我们还表明，BBC捕获的认知不确定性是比语言化置信度更可靠的预测误差指标。

英文摘要

Probabilistic forecasting estimates the likelihood of uncertain future events. To improve LLM forecasting, existing methods typically learn from binary outcomes to output verbalized forecasts. However, while aggregated human forecasts contain rich information in both the crowd probability estimate and the degree of agreement among forecasters, how to utilize these signals remains underexplored. To address this, we propose the Beta-Bernoulli Calibrator (BBC), which converts an initial point estimate forecast from any model into a distribution over event likelihood, using supervision from both binary outcomes and human forecasts. BBC models event likelihood $p \sim \text{Beta}(α, β)$ and outcome $y \sim \text{Bernoulli}(p)$, with the mean as the calibrated point forecast and the variance as the epistemic uncertainty. Our results show that BBC generally provides better calibrated and more accurate forecasts than both traditional post-hoc calibration methods and models fine-tuned specifically for forecasting, while remaining lightweight and having good generalization. We also show that the epistemic uncertainty captured by BBC is a more reliable predictor of forecasting error than verbalized confidence.

URL PDF HTML ☆

赞 0 踩 0

2605.27662 2026-05-28 cs.LG cs.AI 版本更新

How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks

优化器如何塑造等变神经网络中的学习解

Teodor-Mihai Stupariu, Andrei Manolache

发表机构 * University of Stuttgart, Germany（斯图加特大学）； International Max Planck Research School for Intelligent Systems, Germany（国际马克斯·普朗克智能系统研究学校）； Tudor Vianu High School of Computer Science, Romania（托尔德·维安乌计算机科学高中）

AI总结本文通过比较Muon和Adam优化器在点云和分子学习任务中的表现，发现Muon能改善等变神经网络的优化效果，并分析其导致更规则损失曲面和更高有效秩的机制。

Comments Accepted at ICML 2026 Workshop on Weight-Space Symmetries

详情

AI中文摘要

等变神经网络通过构造编码几何对称性，但它们通常难以优化，并且可能表现不如约束较少的架构。越来越多的研究通过架构修改（如约束松弛或近似等变）来解决这一问题，而优化器的作用相对未被充分探索。我们通过比较Muon和Adam在点云和分子学习设置下的多种等变和几何架构来研究这一方向。在对比最清晰的ModelNet40上，Muon在所有考虑的架构上均一致优于Adam。然后，我们通过Hessian估计、损失曲面可视化以及学习权重和中间表示的谱性质来分析训练后的ModelNet40检查点。Muon达到的检查点具有更大的Hessian曲率汇总但更规则的损失曲面，并且其学习权重和表示具有更高的稳定秩和有效秩。这些观察表明，优化器设计与几何归纳偏置之间的相互作用值得社区进一步关注。

英文摘要

Equivariant neural networks encode geometric symmetries by construction, yet they are often difficult to optimize and can underperform less constrained architectures. A growing body of work addresses this through architectural modifications such as constraint relaxation or approximate equivariance, while the role of the optimizer remains comparatively underexplored. We study this direction by comparing Muon and Adam across several equivariant and geometric architectures under pointcloud and molecular learning settings. On ModelNet40, where the comparison is clearest, Muon consistently improves over Adam across all architectures considered. We then analyze the trained ModelNet40 checkpoints through Hessian estimates, loss surface visualizations, and spectral properties of learned weights and intermediate representations. The checkpoints reached by Muon have larger Hessian curvature summaries but more regular loss surfaces, and their learned weights and representations have higher stable and effective ranks. These observations suggest that the interaction between optimizer design and geometric inductive bias deserves further attention from the community.

URL PDF HTML ☆

赞 0 踩 0

2605.27659 2026-05-28 cs.LG cs.AI 版本更新

Transferable Reinforcement Learning via Probabilistic Latent Embeddings and Dynamic Policy Adaptation for Sim-to-Real Deployment

通过概率潜在嵌入和动态策略自适应实现迁移强化学习用于Sim-to-Real部署

Gengyue Han, Yiheng Feng

发表机构 * Lyles School of Civil and Construction Engineering, Purdue University, West Lafayette, USA（普渡大学土木与建设工程学院）； Elmore Family School of Electrical and Computer Engineering, Purdue University, West Lafayette, USA（普渡大学埃尔莫尔家庭电气与计算机工程学院）

AI总结提出一种基于概率潜在嵌入和动态策略自适应的强化学习框架，通过元学习推断环境潜在表示并动态调整风险水平，实现安全高效的Sim2Real策略迁移。

详情

AI中文摘要

由于资源有限和公共安全问题，许多信息物理系统（如自动驾驶汽车）的深度强化学习（RL）智能体首先在模拟器中进行训练。然而，当部署到真实世界环境中时，由于不可避免的Sim2Real差距，它们常常遭受性能下降或安全违规。现有的零样本方法，如鲁棒安全RL和域随机化，缓解了这一问题，但通常以性能下降或遇到未建模系统动态时的残余安全风险为代价。为了解决这些限制，我们提出了一种新颖的强化学习框架，通过概率潜在嵌入和动态策略自适应实现安全高效的策略迁移。我们考虑在不同环境上下文下的一族约束马尔可夫决策过程（CMDP）。通过利用元RL中的潜在上下文变量，所提出的框架从模拟经验中推断环境的潜在表示。此外，它结合了分布RL公式，允许根据潜在上下文变量的估计精度动态调整部署策略的风险水平。该策略在早期部署阶段促进安全性，并通过在Sim2Real差距下的快速策略自适应提高效率。

英文摘要

Due to limited resources and public safety concerns, deep reinforcement learning (RL) agents for many cyber-physical systems (e.g., autonomous vehicles) are first trained in simulators. However, when deployed in real world environments, they often suffer from performance degradation or safety violations because of the inevitable Sim2Real gap. Existing zero-shot approaches, such as robust safe RL and domain randomization, mitigate this issue but typically at the cost of degraded performance or residual safety risks when experiencing unmodeled system dynamics. To address these limitations, we propose a novel reinforcement learning framework that enables safe and efficient policy transfer via probabilistic latent embeddings and dynamic policy adaptation. We consider a family of Constrained Markov Decision Processes (CMDPs) under different environment contexts. By leveraging latent context variable in meta-RL, the proposed framework infers the latent representation of the environment from simulated experiences. Furthermore, it incorporates a distributional RL formulation, which allows risk levels of the deployed policy to be adjusted dynamically, based on the estimation accuracy of the latent context variable. This strategy promotes safety at the early deployment stage and improves efficiency through fast policy adaptation under the Sim2Real gap.

URL PDF HTML ☆

赞 0 踩 0

2605.27656 2026-05-28 cs.IR cs.AI 版本更新

Developing an Intelligent Job Recommendation System Using Semantic Retrieval and Explainable AI Techniques

利用语义检索与可解释AI技术开发智能职位推荐系统

Hussein Al Awad, Khaled Fathi Omar

发表机构 * Master of Web Science, Syrian Virtual University, Damascus, Syria（Web科学硕士，叙利亚虚拟大学，大马士革，叙利亚）

AI总结提出一种结合TF-IDF、Sentence-BERT语义检索、交叉编码器重排序和可解释性生成的元数据驱动职位推荐系统，在LinkedIn数据集上达到高精度和可解释性。

Comments 11 pages, 5 figures, IEEE-style paper on semantic retrieval and explainable AI for intelligent job recommendation

详情

AI中文摘要

在线招聘平台需要能够从大量异构职位发布中检索相关机会的推荐方法。基于关键词的搜索高效且可解释，但当相同职位使用不同术语表达时，可能无法检索到相关发布。本研究提出了一种元数据驱动的职位推荐系统，结合了TF-IDF词汇匹配、Sentence-BERT语义检索、查询感知过滤、可选的交叉编码器重排序和解释生成。该系统利用结构化元数据字段，包括职位名称、公司名称、地点、资历级别、职位职能、雇佣类型和行业，而不依赖完整的职位描述或用户交互历史。在包含31262条记录的清理后的LinkedIn职位发布数据集上进行的实验表明，最佳混合配置实现了10个位置上的精确率为0.8032，nDCG@10为0.9496。在内部评估协议下，交叉编码器重排序将精确率@10从0.7896提高到0.7948，nDCG@10从0.9666提高到0.9739。这些发现表明，当仅有结构化元数据可用时，词汇和语义检索技术可以有效地结合，以提供可解释的职位推荐。

英文摘要

Online recruitment platforms require recommendation methods capable of retrieving relevant job opportunities from large and heterogeneous collections of job postings. Keyword-based search is efficient and interpretable, but it may fail to retrieve relevant postings when equivalent roles are expressed using different terminology. This study presents a metadata-driven job recommendation system that combines TF-IDF lexical matching, Sentence-BERT semantic retrieval, query-aware filtering, optional Cross-Encoder re-ranking, and explanation generation. The proposed system utilizes structured metadata fields including job title, company name, location, seniority level, job function, employment type, and industry without relying on full job descriptions or user interaction histories. Experiments conducted on a cleaned LinkedIn job posting dataset containing 31262 records demonstrate that the best hybrid configuration achieved a Precision at 10 score of 0.8032 and an nDCG at 10 score of 0.9496. Under the internal evaluation protocol, Cross-Encoder re-ranking improved Precision at 10 from 0.7896 to 0.7948 and nDCG at 10 from 0.9666 to 0.9739. These findings indicate that lexical and semantic retrieval techniques can be effectively combined to provide explainable job recommendations when only structured metadata is available.

URL PDF HTML ☆

赞 0 踩 0

2605.27654 2026-05-28 cs.CL cs.AI cs.CY 版本更新

Cultural Fidelity in English-to-Hindi Translation: A Preservation-Fluency Frontier for Gender Recoverability

英译印地语中的文化保真度：性别可恢复性的保持-流畅性前沿

Samyak Savi, Chavi Gupta, Shreyas Gantayet, Tanay Sodha, Dhruv Kumar

AI总结研究英译印地语中性别信息的保持问题，提出两种推理时干预方法（SAR和PAR），在保持性别可恢复性与流畅性之间取得平衡。

Comments 10 pages, 2 figures, 9 tables

详情

AI中文摘要

生成式翻译系统是文化技术，因为它们决定如何在特定文化的语法系统中呈现具有社会意义的线索。我们研究成功文化翻译的一个具体概念：当英语源文本明确编码性别时，英译印地语应保持该线索的可恢复性，除非源文本本身存在歧义。我们在涵盖十二个类别的37,345个实例基准上评估了这一标准，并显示五个系统经常通过作格和敬语结构消除性别。然后，我们引入了两种机制感知的推理时干预。第一种是源感知重排序器（SAR），倾向于避免性别中立句法的候选。第二种是现象感知重排序器（PAR），即使在作格句法存在的情况下，也通过目标词汇标记保持性别。在GPT-4o-mini和Sarvam上，PAR将目标子集准确率分别从11.07%提高到54.47%，从15.99%提高到49.66%。人工评估显示，PAR将性别保持率从10.3%提高到81.3%，但平均流畅度从4.36降至3.37。这些发现将两种干预置于保持和流畅性的前沿，而不是支持单一的解决方案，并展示了文化定位的生成如何在保真度、流畅性和风格自然性之间需要明确的权衡。

英文摘要

Generative translation systems are cultural technologies because they decide how socially meaningful cues are rendered within culturally specific grammatical systems. We study one concrete notion of successful cultural translation: when an English source explicitly encodes gender, an English-to-Hindi translation should preserve the recoverability of that cue unless the source itself is ambiguous. We evaluate this criterion on a 37,345-instance benchmark spanning twelve categories and show that five systems frequently erase gender through ergative and honorific constructions. We then introduce two mechanism-aware inference-time interventions. The first, the Source-Aware Reranker (SAR), prefers candidates that avoid gender-neutralizing syntax. The second, the Phenomenon-Aware Reranker (PAR), preserves gender through targeted lexical marking even when ergative syntax remains. Across GPT-4o-mini and Sarvam, PAR improves target-subset accuracy from 11.07% to 54.47% and from 15.99% to 49.66%, respectively. Human evaluation shows that PAR increases gender preservation from 10.3% to 81.3%, but reduces mean fluency from 4.36 to 3.37. These findings place the two interventions on a preservation and fluency frontier rather than supporting a single dominant solution, and show how culturally situated generation can require explicit tradeoffs among fidelity, fluency, and stylistic naturalness.

URL PDF HTML ☆

赞 0 踩 0

2605.27646 2026-05-28 cs.LG cs.AI 版本更新

Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

Hurwitz四元数乘法量化用于KV缓存压缩

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, David Cox, Antonio Torralba

发表机构 * Massachusetts Institute of Technology, Cambridge, MA, USA（麻省理工学院）； IBM Research, Cambridge, MA, USA（IBM研究院）； University of Toronto, Toronto, Canada（多伦多大学）

AI总结提出一种免校准的Hurwitz四元数乘法量化方法，通过将K/V的4元素块视为四元数并用量化乘积编码，在约5比特下匹配fp16困惑度，实现高达5.05倍KV缓存压缩。

详情

AI中文摘要

我们提出 extbf{Hurwitz四元数乘法量化（HQMQ）}，一种用于大语言模型KV缓存压缩的 extbf{免校准}方法。HQMQ将K或V的每个4元素块视为一个四元数，并将其单位方向量化到乘积$q_p \cdot q_s$上，其中$q_p$取自24元素Hurwitz群$2T$（$S^3$上24-cell的24个顶点，两两夹角$60^\circ$），$q_s$取自每个（层、头）的二级码本，包含$S$个 extemph{随机}单位四元数。乘法组合在$S$个存储参数下产生$24S$个有效码字；随机初始化即可，因为左乘是$S^3$等距变换，因此种子码本在最终任务困惑度上的变化小于$1.5\%$。一个每批次的中间乘数离群值提取步骤（$C=3$，无校准）处理现代离群值密集型架构。我们在五个现代开源模型上评估：Mistral-7B（密集MHA）、Llama-3-8B和Qwen2.5-7B和Qwen3-8B（密集GQA），以及gpt-oss-20b（稀疏MoE）。在Mistral-7B和Qwen3-8B上，HQMQ在约5比特下匹配fp16，困惑度差异在$0.02$--$0.03$点内。在Qwen2.5-7B和Qwen3-8B上，朴素int4导致困惑度崩溃到$10^4+$，而HQMQ + Med3$\times$在约5比特下恢复fp16质量，差异在$0.02$--$0.10$点内。HQMQ在所有五个模型上，在相同比特数下帕累托优于朴素int $3$--$1900\times$，并且在Mistral上以3.79比特的下游零样本准确率匹配fp16。与最强的校准KV量化基线相比，HQMQ在3.79比特下匹配KIVI-4（约4.5比特），在CoQA上差异约1点，TruthfulQA上0.6点，GSM8K上2.3点，同时比特数减少16%且无需校准过程。在存储层面，HQMQ提供高达5.05倍的KV压缩，将Llama-3-70B的128k上下文缓存从43 GB缩小到8.5 GB。

英文摘要

We propose \textbf{Hurwitz Quaternion Multiplicative Quantization (HQMQ)}, a \textbf{calibration-free} method for KV cache compression of large language models. HQMQ treats each 4-element chunk of K or V as a quaternion and quantizes its unit direction to the \emph{product} $q_p \cdot q_s$, where $q_p$ ranges over the 24-element Hurwitz group $2T$ (the 24 vertices of the 24-cell on $S^3$, pairwise angle $60^\circ$) and $q_s$ ranges over a per-(layer, head) secondary codebook of $S$ \emph{random} unit quaternions. The multiplicative composition yields $24S$ effective codewords at $S$ stored parameters; random initialization suffices because left-multiplication is an $S^3$ isometry, so seeded codebooks vary in end-task ppl by $<1.5\%$. A per-batch median-multiplier outlier extraction step ($C{=}3$, no calibration) handles modern outlier-heavy architectures. We evaluate on five modern open models: Mistral-7B (dense MHA), Llama-3-8B and Qwen2.5-7B and Qwen3-8B (dense GQA), and gpt-oss-20b (sparse MoE). On Mistral-7B and Qwen3-8B, HQMQ matches fp16 within $0.02$--$0.03$ ppl points at $\sim$5 bits. On Qwen2.5-7B and Qwen3-8B, where naive int4 collapses to $10^4{+}$ ppl, HQMQ + Med3$\times$ recovers fp16 quality within $0.02$--$0.10$ ppl points at $\sim$5 bits. HQMQ Pareto-dominates naive int by $3$--$1900\times$ at matched bits across all five models, and downstream zero-shot accuracy matches fp16 at $3.79$ bits on Mistral. Against the strongest calibrated KV-quantization baseline, HQMQ at $3.79$ bits matches KIVI-4 ($\sim 4.5$ bits) within ${\sim}1$ pt on CoQA, $0.6$ pts on TruthfulQA, and $2.3$ pts on GSM8K, at $16\%$ fewer bits and without a calibration pass. At the storage level, HQMQ delivers up to $5.05\times$ KV compression, shrinking a Llama-3-70B 128k-context cache from 43 GB to 8.5 GB.

URL PDF HTML ☆

赞 0 踩 0

2605.27644 2026-05-28 cs.RO cs.AI cs.LG 版本更新

Trinity: Unifying Class-Agnostic Terrain and Semantic Segmentation for Unstructured Outdoor Environments by Leveraging Synthetic Data

Trinity：通过利用合成数据统一非结构化户外环境中的类无关地形与语义分割

Marcus G Müller, Wout Boerdijk, Maximilian Durner, Riccardo Giubilato, Abel Gawel, Wolfgang Stürzl, Roland Siegwart, Rudolph Triebel

发表机构 * Institute of Robotics and Mechatronics, German Aerospace Center (DLR)（机器人与机电系统研究所，德国航空航天中心（DLR））； Federal Institute of Technology Zurich (ETH Zurich)（苏黎世联邦理工学院（ETH Zurich））； Robotics and AI Institute (RAI)（机器人与人工智能研究所（RAI））

AI总结提出基于Transformer的统一网络Trinity，联合执行类特定语义分割和类无关地形分割，利用合成数据集RUGDSynth和真实数据集EXTerra实现机器人无关的地形先验学习。

详情

AI中文摘要

地形理解对于在非结构化户外环境中运行的移动机器人至关重要。现有的基于视觉的可通行性估计方法依赖于机器人特定的标注或语义类别映射，限制了跨平台的迁移性，并在机器人能力变化时需要昂贵的重新标注，而标准的语义分割方法仅关注特定的预定义类别，无法捕捉地形的多样性。在这项工作中，我们提出了一种基于Transformer的架构，在统一网络Trinity中联合执行类特定语义分割和类无关地形分割。地形区域仅基于视觉外观进行分割，无需预定义的语义标签或机器人相关的可通行性分数。这种公式使得学习机器人无关的视觉地形先验成为可能，这些先验可以与机器人特定的经验相结合，用于下游任务，如可通行性估计、视觉里程计和任务规划。为了实现具有多样地形外观的大规模训练，我们扩展了OAISYS模拟器，并引入了RUGDSynth，这是一个受RUGD启发、包含类无关地形样本的合成数据集。此外，我们提出了EXTerra数据集，提供了带有类特定和类无关地形标签的真实世界图像。实验证明了所提出任务的可行性以及我们的联合分割方法在复杂户外环境中的有效性。代码和数据集将在本出版物发布后（经过审查）公开。

英文摘要

Terrain understanding is fundamental for mobile robots operating in unstructured outdoor environments. Existing vision-based traversability estimation methods rely on robot-specific annotations or semantic class mappings, limiting transferability across platforms and requiring costly re-annotation when robot capabilities change, while standard semantic segmentation methods only focus on specific predefined classes, which do not capture the variety of terrains. In this work, we propose a transformer-based architecture that jointly performs class-specific semantic segmentation and class-agnostic terrain segmentation within a unified network, called Trinity. Terrain regions are segmented based solely on visual appearance, without predefined semantic labels or robot-dependent traversability scores. This formulation enables the learning of robot-agnostic visual terrain priors that can be combined with robot-specific experience for downstream tasks such as traversability estimation, visual odometry, and mission planning. To enable large-scale training with diverse terrain appearances, we extend the OAISYS simulator and introduce RUGDSynth, a synthetic dataset inspired by RUGD with class-agnostic terrain samples. Furthermore, we present the EXTerra Dataset, providing real-world images annotated with both class-specific and class-agnostic terrain labels. Experiments demonstrate the feasibility of the proposed task and the effectiveness of our joint segmentation approach in complex outdoor environments. Code and datasets will be released with this publication (after review).

URL PDF HTML ☆

赞 0 踩 0

2605.27622 2026-05-28 cs.AI cs.SC 版本更新

Reasoning and Planning with Dynamically Changing Norms

动态变化规范的推理与规划

Taylor Olson, Roberto Salas-Damian, Kenneth D. Forbus

发表机构 * University of Iowa（爱荷华大学）； Northwestern University（西北大学）

AI总结本文提出一种在人类-AI环境中使用动态变化规范引导规划的方法，通过可废止演算解决规范冲突并将规范作为规划护栏，理论证明与对话任务实验验证了有效性。

Comments 8 pages, 1 figure, dataset included in anc

2605.27619 2026-05-28 cs.LG cs.AI 版本更新

Supervised Distributional Reduction via Optimal Transport and Dependence Maximization

基于最优传输和依赖性最大化的有监督分布约简

Sai-Aakash Ramesh, Archit Sood, Andrew Corbett, Tim Dodwell

发表机构 * digiLab, UK（digilab英国实验室）； University of Bristol, UK（布里斯托大学）

AI总结提出有监督分布约简（SDR）算法，通过结合最优传输和显式依赖性最大化，学习同时保留数据几何结构和目标相关信号的紧凑表示。

详情

AI中文摘要

学习同时捕捉内在数据几何结构和目标相关结构的表示仍然是一个基本挑战，特别是在数据约简必须在压缩与预测保真度之间取得平衡的场景中。虽然分布约简（包括联合聚类和降维）提供了一种原则性的数据总结方法，但其有监督变体仍然相对未被充分探索，尽管保留任务相关信号对于下游预测和决策至关重要。我们提出有监督分布约简（SDR），一种通过结合最优传输和显式依赖性最大化来学习目标感知表示的算法。SDR 基于融合 Gromov-Wasserstein（FGW）目标，将输入分布的 relational 结构与一组代表点对齐，同时增加一个直接依赖性项，鼓励学习到的嵌入更明确地捕捉预测信号。这产生了反映几何结构和监督的紧凑表示。除了表示学习，SDR 自然地诱导出一种数据依赖的非平稳几何结构，可用于高斯过程（GP）建模等场景。通过目标感知的分布对齐重新定义距离，SDR 能够构建适应数据几何和监督局部变化的自适应核，为非平稳核设计提供了基于最优传输的视角。

英文摘要

Learning representations that capture both intrinsic data geometry and target-relevant structure remains a fundamental challenge, particularly in settings where data reduction must balance compression with predictive fidelity. While distributional reduction-encompassing joint clustering and dimensionality reduction-offers a principled way to summarize data, its supervised variants remain relatively under-explored, despite the importance of retaining task-relevant signal for downstream prediction and decision-making. We propose Supervised Distributional Reduction (SDR), an algorithm for learning target-aware representations by combining optimal transport with explicit dependence maximization. SDR builds on the Fused Gromov-Wasserstein (FGW) objective to align the relational structure of the input distribution with a set of representative points, while augmenting it with a direct dependence term that encourages the learned embeddings to capture predictive signal more explicitly. This results in compact representations that reflect both geometric structure and supervision. Beyond representation learning, SDR naturally induces a data-dependent, non-stationary geometry that can be leveraged for settings such as Gaussian Process (GP) modelling. By redefining distances through target-aware distributional alignment, SDR enables the construction of adaptive kernels that respond to local variations in both data geometry and supervision, offering an optimal transport-based perspective on non-stationary kernel design.

URL PDF HTML ☆

赞 0 踩 0

2605.27616 2026-05-28 cs.CV cs.AI 版本更新

Not All NVFP4 QAT Recipes Are Equal: How Architecture and Scale Shape Model Quality for Anomaly Segmentation

并非所有 NVFP4 QAT 配方都相同：架构和规模如何影响异常分割的模型质量

Zijian Du, Oleg Rybakov

发表机构 * NVIDIA

AI总结本研究通过统一协议评估多种架构、规模和 FP4 量化感知训练 (QAT) 配方在脑肿瘤异常分割任务中的交互作用，发现架构选择对量化鲁棒性影响最大，注意力机制架构对配方选择具有显著韧性，而 CNN 在大规模下受梯度量化配方影响性能下降。

详情

Journal ref: CVPR2026

AI中文摘要

实时异常分割要求高召回率和高效的低精度推理。我们研究了模型架构、模型规模和 FP4 量化感知训练 (QAT) 配方在召回关键的脑肿瘤分割任务中的三方交互，在统一协议下评估了多种架构、规模和 QAT 配方。我们发现架构选择对量化鲁棒性影响最大，基于注意力的架构对配方选择表现出显著的韧性，而 CNN 在大规模下在梯度量化配方下性能下降。在低容量下，FP4 可能离散化 softmax 注意力，但高级 QAT 配方可防止这种崩溃。在更大规模下，高级配方减轻了降低 CNN 质量的梯度量化噪声。五折患者级交叉验证证实这些发现对数据划分具有鲁棒性。我们的结果表明，Swin Transformer 在所有规模下对 QAT 配方选择都具有鲁棒性，使其成为 FP4 量化异常分割的推荐架构。

英文摘要

Real-time anomaly segmentation demands both high recall and efficient low-precision inference. We study the three-way interaction of model architecture, model scale, and FP4 quantization-aware training (QAT) recipe on a recall-critical brain tumor segmentation task, evaluating multiple architectures, scales, and QAT recipes under a unified protocol. We find that architecture choice has the largest impact on quantization robustness, with attention-based architectures showing remarkable resilience to recipe choice while CNN degrades under gradient-quantizing recipes at larger scales. At low capacity, FP4 can discretize softmax attention, but advanced QAT recipes prevent this collapse. At larger scales, advanced recipes mitigate gradient quantization noise that degrades CNN quality. Five-fold patient-level cross-validation confirms these findings are robust to data partition. Our results show that the Swin Transformer is robust to QAT recipe choice across all scales, making it the recommended architecture for FP4-quantized anomaly segmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.27610 2026-05-28 cs.IR cs.AI cs.HC 版本更新

Eliot: Interactively $\underline{E}$xploring Fast-Changing Scientific $\underline{Li}$terature Trends with $\underline{O}$nline Da$\underline{t}$a and Learning

Eliot: 通过在线数据和学习交互式探索快速变化的科学文献趋势

Bernardo A. Denkvitts, Nitin Gupta, Biplav Srivastava

发表机构 * University of South Carolina（南卡罗来纳大学）

AI总结提出Eliot系统，通过查询时聚类和时间可视化，帮助研究人员可追溯地探索快速变化的科学文献趋势。

Comments Under-review at CIKM Applied Research 2026

详情

AI中文摘要

科学出版的快速增长使得追踪快速变化领域的演变变得越来越困难。搜索引擎和基于LLM的助手检索或总结论文，但往往隐藏了语料库是如何被选择、组织或与时间模式关联的。我们提出了$ exttt{Eliot}$，一个公开部署的交互式系统，用于可追溯地探索不断演变的科学文献。受两项关于大语言模型（LLMs）和自动规划与调度（APS）研究的启发，$ exttt{Eliot}$将文献演变分析推广到超越手工构建的分类法和特定领域脚本。给定明确的查询词和过滤器，它在查询时检索arXiv论文，通过标题和摘要表示每篇论文，将语料库聚类为主题，分配代表性关键词，并可视化每个聚类的出版年份分布。我们将$ exttt{Eliot}$评估为一个应用系统和一个交互式研究辅助工具。跨八个arXiv领域的离线配置研究使用内在聚类和主题连贯性指标比较了文档表示、降维方法和聚类算法；结果支持MiniLM嵌入结合10维UMAP和凝聚聚类作为实用默认设置。一项基于场景的调查和专家焦点小组评估了可解释性和使用情境：参与者在85%的场景响应中认为聚类标签有意义，反馈表明$ exttt{Eliot}$对于快速变化技术领域的可审计概述最有价值。这些结果表明，查询时聚类和时间检查可以通过帮助研究人员检查和提炼文献趋势背后的证据来补充搜索和生成工具。

英文摘要

The rapid growth of scientific publishing has made it increasingly difficult to track how fast-moving areas evolve. Search engines and LLM-based assistants retrieve or summarize papers, but often hide how the corpus was selected, organized, or connected to temporal patterns. We present $\texttt{Eliot}$, a publicly deployed interactive system for traceable exploration of evolving scientific literature. Motivated by two studies on Large Language Models (LLMs) and Automated Planning and Scheduling (APS), $\texttt{Eliot}$ generalizes literature-evolution analysis beyond hand-built taxonomies and domain-specific scripts. Given explicit query terms and filters, it retrieves arXiv papers at query time, represents each paper by title and abstract, clusters the corpus into themes, assigns representative keywords, and visualizes each cluster's publication-year distribution. We evaluate $\texttt{Eliot}$ as both an applied system and an interactive research aid. An offline configuration study across eight arXiv domains compares document representations, dimensionality reduction methods, and clustering algorithms using intrinsic clustering and topic-coherence metrics; the results support MiniLM embeddings with 10-dimensional UMAP and Agglomerative Clustering as a practical default. A scenario-based survey and expert focus group assess interpretability and use contexts: participants rated cluster labels as meaningful in 85% of scenario responses, and feedback indicated that $\texttt{Eliot}$ is most valuable for auditable overviews of rapidly changing technical areas. These results suggest that query-time clustering and temporal inspection can complement search and generation tools by helping researchers inspect and refine the evidence behind literature trends.

URL PDF HTML ☆

赞 0 踩 0

2605.27605 2026-05-28 cs.AI cs.SE 版本更新

Laguna M.1/XS.2 Technical Report

Laguna M.1/XS.2 技术报告

Julien Abadji, Marah Abdin, Connor Adams, Eric Alcaide, Mustafa Altun, Michele Artoni, Junze Bao, Uday Barar, Vassilis Bekiaris, Arkadii Bessonov, Benjamin Bütikofer, Jonathan Chang, Yen-Chun Chen, Dmitry Chernenkov, Yang Chi, Filippos Christianos, Fenia Christopoulou, Razvan-Andrei Ciocoiu, Tzachi Cohen, Yohann Coppel, Dmitrii Emelianenko, Brandon Fergerson, Brian Fitzgerald, Matthias Gallé, Alex Golonzovskyi, George Grigorev, Yiyang Hao, Christian Hensel, Jan Huenermann, Ye Ji, Sarthak Joshi, Eiso Kant, Kabir Khandpur, Seonghyeon Kim, Vladimir Kirichenko, Umut Kocasarac, Ilya Kochik, Ivan Komarov, Chaerin Kong, Anurag Koul, François-Joseph Lacroix, Sergei Laktionov, Waren Long, Quentin Malartic, Vadim Markovtsev, Afonso Marques, Robert McHardy, Carlos Mocholí, Dmitry Monakhov, Adam Morris, Martin Muller, Christian Mürtz, Robin Nabel, Thien Nguyen, Rok Novosel, Szymon Ozog, Aalhad Patankar, Aleksei Petrov, Alexandre Piché, Arthur Pignet, Teodor Poncu, Phil Potter, Alexander Rakowski, Pierre-Yves Ritschard, Jay Roberts, Joe Rowell, Piotr Sarna, Pierre-André Savalle, Uladzislau Sazanovich, Nikita Shapovalov, Arsenii Shevchenko, Mikhail Shilkov, Andrei Sokol, Mohamed Soliman, Jack Stephenson, Victor Storchan, Dragos-Constantin Tantaru, Artem Tyurin, Adrian Wälchli, Pengming Wang, Jianxiao Yang, Renat Zayashnikov, Alexander Zelenka Martin, Nikolay Zinov, Caroline Bercier, José Caldeira, Margarida Garcia, Tom George, Kabeer Gharzai, Glenn Hitchcock, Carson Klingenberg, Ivo Pinto, Varun Randery, Noah Smith, Arina Sugako, Jason Warner

发表机构 * Poolside Team（Poolside团队）

AI总结本文介绍了两个用于长周期自主编码的混合专家基础模型 Laguna M.1 和 XS.2，通过端到端训练和模型工厂系统，在软件工程基准测试中达到先进水平。

Comments Technical report to models released here: https://poolside.ai/blog/introducing-laguna-xs2-m1

详情

AI中文摘要

我们介绍了 Laguna M.1 和 Laguna XS.2，两个为长周期自主编码构建的混合专家基础模型：M.1 总参数量为 2258 亿（每 token 激活 234 亿），XS.2 总参数量为 334 亿（每 token 激活 30 亿）。两个模型均在我们称为模型工厂的内部系统中从头到尾端到端训练：这是一个紧密集成的版本化数据、训练、评估和推理组件栈，将模型开发转变为工业流程。我们描述了模型工厂的原理和设计选择，并详细介绍了模型的端到端训练过程，包括预训练数据和架构、后训练阶段、评估和量化。在自主软件工程和终端基准测试（SWE-bench Verified、SWE-bench Multilingual、SWE-Bench Pro 和 Terminal-Bench 2.0）上，M.1 和 XS.2 在其各自的权重级别中与最先进的开源模型具有竞争力。Laguna XS.2 权重在 Apache 2.0 许可下发布，地址为 https://huggingface.co/collections/poolside/laguna-xs2。

英文摘要

We present Laguna M.1 and Laguna XS.2, two Mixture-of-Experts foundation models built for long-horizon, agentic coding: M.1 has $225.8$B total parameters ($23.4$B activated per token) and XS.2 has $33.4$B total ($3$B activated). Both models were trained from scratch end-to-end inside the same internal system that we refer to as our Model Factory: a tightly-integrated stack of versioned data, training, evaluation, and inference components that turn model development into an industrial process. We describe the principles and design choices of the Model Factory and also detail the end-to-end training process of our models, throughout pre-training data and architecture, post-training stages, evaluation, and quantization. On agentic software engineering and terminal benchmarks (SWE-bench Verified, SWE-bench Multilingual, SWE-Bench Pro, and Terminal-Bench 2.0) M.1 and XS.2 are competitive with state-of-the-art open models in their respective weight classes. Laguna XS.2 weights are released under Apache~2.0 at https://huggingface.co/collections/poolside/laguna-xs2.

URL PDF HTML ☆

赞 0 踩 0

2605.27595 2026-05-28 cs.CV cs.AI 版本更新

Hallucination Behavior in Multimodal LLMs Across Agricultural Image Interpretation and Generation Tasks

多模态大语言模型在农业图像解释与生成任务中的幻觉行为

Partho Ghose, Al Bashir, Prem Raj, Azlan Zahid

发表机构 * Texas A&M University System（德克萨斯大学系统）

AI总结本研究系统评估了多模态大语言模型在农业图像解释（图像到文本）和生成（文本到图像）任务中的幻觉行为，发现模型存在生物不一致、上下文不准确和农学不合理等错误模式，并通过少样本提示等方法分析了幻觉的残留影响。

详情

AI中文摘要

大型语言模型（LLMs）正迅速被应用于农业成像领域，从作物解释到合成田间图像生成。然而，这些模型经常表现出看似自信但偏离生物或环境现实的幻觉输出，可能导致错误的农学见解。本研究从两个互补方向调查此类幻觉：图像到文本，即LLMs解释作物或田间图像以描述生物和非生物胁迫等条件；以及文本到图像，即模型基于描述性提示生成合成农业场景。我们检查涉及生物不一致、上下文不准确和农学不合理的错误，并在多个成像模态下根据领域知情标准评估输出。我们的分析识别了解释性和生成性任务中反复出现的幻觉模式。在图像解释中，LLMs（例如Gemma、LLAVA、Qwen和MiniCPM）实现了适度的零样本准确率（63%至75%），而少样本提示将性能提升至高达86.8%，但仍表现出虚假检测和漏检感染，表明存在残留幻觉效应。在文本到图像任务中，高级模型如GPT-5和Gemini 2.5 Flash在宽松提示约束下生成高达91%的生物不一致场景，揭示了当前LLMs的根本弱点。这种对视觉推理和生成的系统评估为增强基于LLM的农业成像平台的可靠性和可信度提供了关键见解。

英文摘要

Large Language Models (LLMs) are being rapidly adopted in agricultural imaging applications, ranging from crop interpretation to synthetic field image generation. However, these models frequently exhibit hallucinations outputs that appear confident yet deviate from biological or environmental reality potentially leading to misinformed agronomic insights. This study investigates such hallucinations in two complementary directions: image-to-text, where LLMs interpret crop or field imagery to describe conditions such as biotic and abiotic stresses, and text-to-image, where models generate synthetic agricultural scenes based on descriptive prompts. We examine errors involving biological inconsistency, contextual inaccuracy, and agronomic implausibility, evaluating the outputs under domain-informed criteria across multiple imaging modalities. Our analysis identifies recurring hallucination patterns within both interpretive and generative tasks. In image interpretation, LLMs (e.g., Gemma, LLAVA, Qwen, and MiniCPM) achieved modest zero-shot accuracy (63 to 75 percent), whereas few-shot prompting improved performance up to 86.8 percent, exhibiting false detections and missed infections, indicating residual hallucination effects. In text-to-image tasks, advanced models such as GPT-5 and Gemini 2.5 Flash generate up to 91 percent biologically inconsistent scenes under relaxed prompt constraints, revealing fundamental weaknesses in current LLMs. This systematic assessment of visual reasoning and generation offers critical insights toward enhancing the reliability and trustworthiness of LLM-based agricultural imaging platforms.

URL PDF HTML ☆

赞 0 踩 0

2605.27593 2026-05-28 cs.AI cs.MA 版本更新

Voluntary Collusion with Secret Tools in Competing LLM Agents

竞争性LLM代理中使用秘密工具的合谋行为

Xijie Zeng, Frank Rudzicz

发表机构 * Dalhousie University（达尔豪斯大学）； Vector Institute for Artificial Intelligence（人工智能向量研究所）

AI总结本研究通过两个多智能体环境（Liar's Bar和Cleanup）发现，即使工具被明确标注为不公平且有害，大多数LLM代理仍会自愿采用秘密合谋工具以获取战略优势，且仅靠对齐或公平标签无法有效阻止，需明确防护措施。

详情

AI中文摘要

即使工具被明确描述为对他人不公平且有害，表面上经过安全对齐的LLM代理仍然会在这样做能带来战略优势时自愿参与秘密合谋。为了研究这一现象，我们引入了一个基于两个战略多智能体环境的实证框架：Liar's Bar（一个竞争性欺骗场景）和Cleanup（一个混合动机资源管理场景），其中代理被提供秘密合谋工具，这些工具在明显不利于其他代理的同时提供了显著优势。在12个模型（7B、70B和专有规模）和6种提示变体中，我们发现大多数代理一致地接受这些工具并制定合谋策略，同时在接受前明确承认工具的不公平性。我们进一步表明，无论是公平标签还是基线对齐都无法可靠地阻止合谋：只有明确的伦理框架能减少采用，即使如此，较小的模型仍然容易受到影响。更广泛地说，我们的工作首次系统性地研究了基于LLM的多智能体系统中自愿合谋采用的问题，并表明防止此类行为需要明确的防护措施，而非依赖通用对齐。

英文摘要

Even when a tool is explicitly described as unfair and harmful to others, ostensibly safety-aligned LLM agents still voluntarily engage in secret collusion whenever doing so confers a strategic advantage. To investigate this phenomenon, we introduce an empirical framework built on two strategic multi-agent environments: Liar's Bar, a competitive deception scenario, and Cleanup, a mixed-motive resource-management scenario, in which agents are offered secret collusion tools that provide significant advantages while clearly disadvantaging the other agents. Across 12 models (at the 7B, 70B, and proprietary scales) and 6 prompt variants, we find that most agents consistently accept these tools and develop collusive strategies, while explicitly acknowledging the unfairness of the tools before accepting. We further show that neither the unfairness labels nor baseline alignment alone reliably deters collusion: only explicit ethical framing reduces adoption and, even then, smaller models remain susceptible. More broadly, our work presents the first systematic investigation of voluntary collusion adoption in LLM-based multi-agent systems, and suggests that preventing such behaviour requires explicit safeguards rather than reliance on general alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.27584 2026-05-28 cs.AI cs.SI 版本更新

Cyberbullying Governance on Social Media: A Unified Framework from Content Identification to Intervention

社交媒体上的网络暴力治理：从内容识别到干预的统一框架

Yiting Huang, Wenting Zhu, Zekun Wang, Qingpo Yang, Yakai Chen, Zihui Xu, Yueyue Zhang, Sanchuan Guo, Xi Zhang

发表机构 * School of Cyberspace Security, Beijing University of Posts and Telecommunications（北京邮电大学网络安全学院）

AI总结本文提出一个涵盖内容识别、用户行为建模、扩散动态与早期预警、干预治理四阶段的统一全生命周期治理框架，以解决网络暴力被动、孤立检测的局限，实现主动、持续、综合的治理。

详情

AI中文摘要

社交媒体平台和在线社区的激增无意中催化了网络暴力、仇恨言论和其他形式的在线毒性传播，使得有效治理此类危害成为关键的社会和计算挑战。尽管在自动化内容审核方面取得了显著进展，但现有研究主要将网络暴力治理视为被动、孤立的帖子级检测。这种还原论观点忽视了用户持续的行为动态、毒性事件的结构性扩散以及主动缓解的关键需求。为弥补这些差距，本文提出一个统一的全生命周期治理框架，将网络暴力治理的范式从孤立的静态检测转向集成、持续和主动的审核。借鉴网络暴力研究及相邻领域，我们系统地综合了四个相互关联阶段的最新文献：（1）内容识别，（2）用户与行为建模，（3）扩散动态与早期预警，以及（4）干预与治理。此外，我们回顾了可用的数据集和评估实践，并讨论了新兴挑战，包括多模态性、可解释性、算法公平性以及生成式AI的双重使用风险，为未来研究提供了路线图，以构建更安全、更具韧性的数字生态系统。

英文摘要

The proliferation of social media platforms and online communities has inadvertently catalyzed the spread of cyberbullying, hate speech, and other forms of online toxicity, making the effective governance of such harm a critical societal and computational challenge. While significant strides have been made in automating content moderation, existing research predominantly treats cyberbullying governance as passive, isolated detection at the post level. This reductionist view overlooks the continuous behavioral dynamics of users, the structural diffusion of toxic events, and the critical need for proactive mitigation. To bridge these gaps, this paper proposes a unified full-lifecycle governance framework that shifts the paradigm of cyberbullying governance from isolated static detection toward integrated, continuous, and proactive moderation. Drawing on cyberbullying research and adjacent fields, we systematically synthesize the state-of-the-art literature across four interconnected stages: (1) Content Identification, (2) User and Behavior Modeling, (3) Diffusion Dynamics and Early Warning, and (4) Intervention and Governance. Furthermore, we review available datasets and evaluation practices, and discuss emerging challenges including multimodality, explainability, algorithmic fairness, and the dual-use risks of generative AI, providing a roadmap for future research toward a safer and more resilient digital ecosystem.

URL PDF HTML ☆

赞 0 踩 0

2605.27571 2026-05-28 cs.AI cs.CL cs.DB 版本更新

Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

实时分析发现代理：迈向主动洞察系统

Gaetano Rossiello, Dharmashankar Subramanian

发表机构 * IBM

AI总结提出一种多智能体架构，通过持续发现循环（假设生成、编译、验证、可视化）实现实时数据流的自主洞察发现，支持从查询驱动向主动发现的范式转变。

Comments Accepted at Supporting Our AI Overlords (SAO) at the ACM Conference on AI and Agentic Systems (CAIS), May 26 2026, San Jose, CS, USA

详情

AI中文摘要

现代分析系统本质上是反应式的，要求用户在日益复杂且持续演变的数据上定义查询。在实时流式环境中，这种范式失效，因为潜在洞察的空间变得太大而无法手动枚举。我们提出了一种用于实时数据流自主洞察发现的多智能体架构。该系统实现了一个持续发现循环，其中智能体生成假设，将其编译为可执行分析，验证生成的工件，并生成可视化和可部署的应用程序。该架构利用Apache Kafka进行事件驱动协调，Apache Flink进行流处理，以及大型语言模型来实现专门的智能体。一个关键贡献是基于类型化中间工件的契约驱动设计，实现了模块化、可观测性、血统以及更安全地执行动态生成的分析。通过零售、金融和公共数据中的用例，我们展示了该架构如何支持从查询驱动分析向主动发现驱动系统的转变。

英文摘要

Modern analytics systems are fundamentally reactive, requiring users to define queries over increasingly complex and continuously evolving data. In real-time streaming environments, this paradigm breaks down, as the space of potential insights becomes too large to enumerate manually. We present a multi-agent architecture for autonomous insight discovery over real-time data streams. The system implements a continuous discovery loop in which agents generate hypotheses, compile them into executable analytics, validate generated artifacts, and produce visualizations and deployable applications. The architecture leverages Apache Kafka for event-driven coordination, Apache Flink for stream processing, and large language models to implement specialized agents. A key contribution is a contract-driven design based on typed intermediate artifacts, enabling modularity, observability, lineage, and safer execution of dynamically generated analytics. Through use cases in retail, finance, and public data, we show how this architecture supports a shift from query-driven analytics to proactive, discovery-driven systems.

URL PDF HTML ☆

赞 0 踩 0

2605.27570 2026-05-28 cs.AI 版本更新

LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

LaneRoPE: 用于协同并行推理与生成的位置编码

Gabriele Cesa, Thomas Hehn, Aleix Torres-Camps, Àlex Batlle Casellas, Jordi Ros-Giralt, Arash Behboodi, Tribhuvanesh Orekondy

发表机构 * Qualcomm AI Research（高通人工智能研究）

AI总结提出LaneRoPE方法，通过序列间注意力掩码和扩展的RoPE位置编码，使多个序列在生成时协同合作，提升数学推理任务在有限生成长度下的准确性。

详情

AI中文摘要

并行LLM测试时扩展技术（例如best-of-$N$）需要根据相同输入提示生成$N>1$个序列。这些方法在利用批处理$N$个生成的计算效率的同时提高了准确性。然而，传统上批次中的每个序列是独立生成的，因此不会重用其他序列的中间生成、计算或观察结果。在本文中，我们提出LaneRoPE，以在生成时实现$N>1$个序列之间的协调与协作。LaneRoPE包含两个关键思想：(a) 一个序列间注意力掩码，使序列的采样相互依赖；(b) 一个RoPE扩展，注入位置信息，捕获特定序列内部和外部的标记之间的相对位置。我们在数学推理任务上评估了我们的方法，并发现了有希望的结果：LaneRoPE实现了序列间的协作，在有限的生成长度下带来了额外的准确性提升。重要的是，由于LaneRoPE在底层LLM架构上只需最小改动，并且在推理时引入的开销可以忽略不计，因此它对于将并行推理快速集成到现有LLM推理流水线中具有吸引力。

英文摘要

Parallel LLM test-time scaling techniques (e.g., best-of-$N$) require drawing $N>1$ sequences conditioned on the same input prompt. These methods boost accuracy while exploiting the computational efficiency of batching $N$ generations. However, each sequence in the batch is traditionally generated independently and hence does not reuse intermediate generations, computations, or observations from other sequences. In this paper, we propose LaneRoPE to enable coordination and collaboration among $N>1$ sequences at generation time. LaneRoPE involves two key ideas: (a) an inter-sequence attention mask to make sampling of sequences dependent on one another; and (b) a RoPE extension that injects positional information that captures relative positions between tokens, both within and outside a particular sequence. We evaluate our approach on mathematical reasoning tasks and find promising results: LaneRoPE enables collaboration among sequences, yielding additional accuracy gains under limited generated sequence length. Importantly, since LaneRoPE enables coordination with minimal changes to the underlying LLM architecture and introduces a negligible overhead at inference time, it is appealing to rapidly incorporate parallel reasoning into existing LLM inference pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.27567 2026-05-28 cs.AI cs.CL 版本更新

Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

为什么LLM在因果发现中失败以及干预代理如何逃脱

Amartya Roy, Sonali Parbhoo

发表机构 * SIRE, IIT Delhi（IIT德里智能研究机构）； Robert Bosch GmbH（罗伯特·博世有限公司）； Imperial College London（伦敦帝国理工学院）

AI总结本文证明大型语言模型在因果发现中存在根本性失败，并提出一种基于干预代理的因果贝叶斯优化方法（A-CBO），通过外部贝叶斯循环在无需模型微调的情况下实现可证明的收敛。

Comments 9 pages, 3 figures

详情

AI中文摘要

因果发现是科学推理的基石，但大型语言模型能否可靠地执行因果发现仍是一个悬而未决的问题。最近的基准测试表明，即使是微调后的模型在简单因果图上也会达到平台期，并随着复杂度增加而退化，但失败的原因尚未明确。我们证明这种失败是根本性的：监督微调、直接偏好优化和上下文学习都会产生无法区分生成相似观测数据的因果图的预测器，任何这样做的尝试都需要模型的内部表示无限增长，从而违反了这些方法工作的条件。我们将其形式化为核障碍定理，确立该限制是学习范式固有的，而非任何特定模型或数据集。我们提出了代理因果贝叶斯优化（A-CBO），其中冻结的语言模型作为干预预言机，回答关于干预效果的目标查询，而外部贝叶斯循环在对数轮次内将信念集中在候选因果图上。由于决策在障碍适用的空间之外运行，A-CBO在底层模型保持不变的情况下可证明收敛。在Corr2Cause上，A-CBO无需任何训练即可匹配微调基线。在Extended Corr2Cause（一个扩展到24个变量、包含18K测试样本的新基准）上，A-CBO显著优于微调和偏好优化，且优势不断扩大。

英文摘要

Causal discovery is a cornerstone of scientific reasoning, yet whether large language models can perform it reliably remains an open question. Recent benchmarks show that even fine-tuned models plateau on simple causal graphs and degrade as complexity grows, but why they fail has not been established. We prove the failure is fundamental: supervised fine-tuning, direct preference optimization, and in-context learning all produce predictors that cannot distinguish between causal graphs generating similar observational data, and any attempt to do so requires the model's internal representations to grow unboundedly, violating the very conditions under which these methods work. We formalize this as a kernel obstruction theorem, establishing that the limitation is intrinsic to the learning paradigm, \emph{not any particular model or dataset}. We propose Agentic Causal Bayesian Optimization (A-CBO), wherein a frozen language model serves as an interventional oracle answering targeted queries about intervention effects, while an external Bayesian loop concentrates beliefs over candidate graphs in logarithmically many rounds. Because the decision operates outside the space where the obstruction applies, A-CBO provably converges while the underlying model remains unchanged. On Corr2Cause, A-CBO matches fine-tuned baselines without any training. On Extended Corr2Cause, a new benchmark scaling to 24 variables with 18K test samples, A-CBO significantly outperforms both fine-tuning and preference optimization, with the advantage growing

URL PDF HTML ☆

赞 0 踩 0

2605.27566 2026-05-28 cs.AI 版本更新

DynaSchedBench: Calibrated Dynamic Scheduling Benchmarks and Observability Paradox in LLM-based Scheduling Agents

DynaSchedBench: 基于LLM的调度代理中的校准动态调度基准与可观测性悖论

Shijie Cao, Yuan Yuan, Jing Liu

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing 100191, China（北航计算机科学与工程学院）； Shenzhen Loop Area Institute, Shenzhen, China（深圳环城院）； Qingdao Research Institute, Beihang University（北航青岛研究院）； Hangzhou Innovation Institute, Beihang University（北航杭州创新院）； School of Artificial Intelligence, Xidian University, Xi'an 710071, Shaanxi, China（西电人工智能学院）； Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, Guangdong, China（西电广州技术院）

AI总结针对动态柔性作业车间调度问题（DFJSP），提出DynaSchedBench诊断框架，通过顺序事件空间校准器（SESC）计算调度压力指数（SSI）对实例进行难度分层，并揭示LLM调度代理中的“可观测性悖论”：完整结构信息反而降低性能。

详情

AI中文摘要

目前，针对动态柔性作业车间调度问题（DFJSP）的神经组合优化进展受到方法论上的张力阻碍：静态基准鼓励基准过拟合，而未校准的生成器则用随机噪声掩盖算法能力。为解决这一问题，我们引入了 extbf{DynaSchedBench}，一个用于DFJSP的诊断框架，该框架严格控制实例生成过程。我们的方法不依赖参数采样，而是利用顺序事件空间校准器（SESC）计算一种新颖的调度压力指数（SSI），以按难度对实例进行分层。我们证明，SESC在计算效率上显著优于进化基线，同时可靠地收敛到目标指标。该框架集成了用于实例生成、基于快照的模拟、代理、评估和可视化的模块化组件，从而能够对反应式和前瞻式策略进行严格测试。利用这个校准环境，我们识别了基于LLM的调度代理的关键局限性。具体而言，在动态调度的逐步在线决策中，我们发现了一个“可观测性悖论”：向代理提供完整结构信息的oracle访问权限会降低策略性能，其表现不如简洁信息。此外，尽管存在大量的token开销，工具增强和细化策略未能可靠地提高性能，并且大多数LLM代理无法持续超越强大的调度基线——其行为更像是鲁棒的启发式近似器，而非优越的优化器。

英文摘要

Progress in neural combinatorial optimization for Dynamic Flexible Job Shop Scheduling Problem (DFJSP) is currently hindered by a methodological tension: static benchmarks encourage benchmark overfitting, while uncalibrated generators obscure algorithmic capability with stochastic noise. To resolve this, we introduce \textbf{DynaSchedBench}, a diagnostic framework for DFJSP that rigorously controls the instance-generation process. Instead of relying on parameter sampling, our approach utilizes Sequential Event-Space Calibrator (SESC) that computes a novel Schedule Stress Index (SSI) to stratify instances by difficulty. We demonstrate that SESC is substantially more computationally efficient than evolutionary baselines while converging reliably to the target metrics. The framework integrates modular components for instance generation, snapshot-based simulation, agents, evaluation, and visualization, thereby enabling rigorous testing of reactive and lookahead-based policies. Leveraging this calibrated environment, we identify key limitations of LLM-based scheduling agents. Specifically, in step-wise online decision-making for dynamic scheduling, we identify an ``Observability Paradox'': providing agents with oracle access to full structural information can degrade policy performance, underperforming concise information. Furthermore, despite substantial token overhead, tool-augmented and refinement strategies fail to reliably improve performance, and most LLM agents fail to consistently surpass strong dispatching baselines-behaving more like robust heuristic approximators than superior optimizers.

URL PDF HTML ☆

赞 0 踩 0

2605.27564 2026-05-28 cs.CL cs.AI cs.LG 版本更新

The Future of Facts: Tracing the Factual Generation-Verification Gap

事实的未来：追踪事实生成-验证差距

Tim R. Davidson, Anja Surina, Caglar Gulcehre

发表机构 * EPFL（苏黎世联邦理工学院）

AI总结本文通过训练阶段分析，发现语言模型在事实知识上存在生成-验证差距，验证能力先于生成能力习得且更稳健，事实更新可能导致模型处于“多宇宙”状态。

Comments Code for this project is available at https://github.com/anjasurina/factgap , blog post at https://www.trdavidson.com/fact-gap

详情

AI中文摘要

语言模型正成为事实知识的默认接口，但它们验证输出的能力往往比生成输出的能力更可靠。这种生成-验证差距（GV-gap）是近期自我改进和推理中许多进展的基础，但其在事实知识上的动态仍未被充分理解。我们聚焦于事实性GV-gap背后的训练机制，将其与计算和美学方面的对应物区分开来。我们通过四个开源模型家族（每个家族两个规模）的三个训练阶段（获取、持续学习和更新）追踪生成和验证能力。三个发现跨模型重复出现：（i）验证始终先于生成被学习；（ii）验证比生成对持续学习更稳健；（iii）事实更新可能使模型处于“多宇宙”状态，同时验证新旧答案均为正确。对前沿模型的自然实验在大规模上重现了这些动态，并揭示了在充分覆盖的事实上残留的验证偏差。

英文摘要

Language models are becoming the default interface to factual knowledge, yet they often verify outputs more reliably than they generate them. This generation-verification gap (GV-gap) underlies many recent advances in self-improvement and reasoning, but its dynamics on factual knowledge specifically remain poorly understood. We focus on the training mechanisms underlying factual GV-gaps, distinguishing them from their computational and aesthetic counterparts. We trace generation and verification capabilities through three training phases (acquisition, continual learning, and updating) across four open-source model families at two scales each. Three findings recur across models: (i) verification is consistently learned before generation; (ii) verification is more robust to continual learning than generation; and (iii) factual updates can leave models in a "multi-verse" state, simultaneously verifying both old and new answers as correct. Natural experiments on frontier models reproduce these dynamics at scale and reveal residual verification biases on well-covered facts.

URL PDF HTML ☆

赞 0 踩 0

2605.27563 2026-05-28 math.PR cs.AI stat.ML 版本更新

On the Subgaussianity of Quantized Linear Maps: An AI-Assisted Note

关于量化线性映射的次高斯性：一份AI辅助笔记

Guangyi Zou, Roman Vershynin

发表机构 * Department of Mathematics, University of California, Irvine（加州大学尔湾分校数学系）

AI总结本文通过Gemini 3.5 Flash发现了一个与维度无关的次高斯集中界，适用于高斯向量在坐标非线性映射下的情况，并应用于回答Simone Bombari关于符号量化线性映射的问题。

Comments 4 pages

2605.27561 2026-05-28 cs.CV cs.AI 版本更新

Clinical Validation of the Melanoscope AI Mobile Dermoscopy Clinical Decision Support System

Melanoscope AI移动皮肤镜临床决策支持系统的临床验证

Elena Sergeevna Kozachok, Sergey Sergeevich Seregin

发表机构 * Ivannikov Institute for System Programming of the Russian Academy of Sciences（俄罗斯科学院伊万诺夫系统编程研究所）； Orel Regional Oncology Dispensary（奥尔格地区肿瘤专科医院）

AI总结本研究提出了一种级联深度学习模型的定量可解释性评估方法和三区患者分流算法，并在俄罗斯门诊实践中对Melanoscope AI CDSS进行了前瞻性单中心临床验证，结果显示无假阴性且特异性为88.3%。

Comments 24 pages, 6 figures, 5 tables, 21 references

详情

AI中文摘要

引言：恶性皮肤病变的早期检测对预后至关重要，但俄罗斯地区皮肤科医生短缺限制了筛查覆盖。移动皮肤镜临床决策支持系统（CDSS）提供了一种有前景的方法，但模型可解释性和标准化患者分流仍是采用的关键障碍。目的：开发一种级联深度学习模型的定量可解释性评估方法和三区患者分流算法，并在俄罗斯门诊实践中对Melanoscope AI CDSS进行初步的单中心前瞻性临床验证。材料与方法：皮肤镜图像的两阶段级联分类；注意力图可视化（ViT和Swin使用注意力展开；ConvNeXt和EfficientNetV2使用Grad-CAM）；激活图与专家标注之间基于IoU的定量一致性评估；在四次“黑色素瘤日”活动（俄罗斯奥廖尔，2025年6月至2026年4月）中进行前瞻性单中心验证。结果：在176名患者中：与专家评估一致率为88.6%；5例恶性病变中无假阴性（95% CI: 47.8-100.0%）；特异性为88.3%。组织学证实了3例黑色素瘤和2例基底细胞癌；6例发育不良痣被纳入随访。平均IoU（n=180）：ViT - 0.69；Swin - 0.64；ConvNeXt - 0.53；EfficientNetV2 - 0.51。分流阈值：P<0.15 / 0.15-0.50 / >=0.50。结论：未观察到假阴性；特异性为88.3%，支持筛查应用。集成的级联分类、带IoU评估的注意力图可视化和三区分流提供了可重复、可解释的临床决策支持，可适应不同资源水平。

英文摘要

Introduction. Early detection of malignant skin lesions is critical for prognosis, yet dermatologist shortages in Russian regions limit screening coverage. Mobile dermoscopy clinical decision support systems (CDSS) offer a promising approach, with model interpretability and standardised patient routing remaining key barriers to adoption. Aim. To develop a quantitative interpretability assessment method for cascade deep learning models and a three-zone patient routing algorithm, and to conduct a preliminary single-centre prospective clinical validation of the Melanoscope AI CDSS in Russian outpatient practice. Material and methods. Two-stage cascade classification of dermoscopic images; attention map visualisation (attention rollout for ViT and Swin; Grad-CAM for ConvNeXt and EfficientNetV2); quantitative IoU-based agreement assessment between activation maps and expert annotations; prospective single-centre validation across four "Melanoma Day" sessions (Orel, Russia, June 2025 - April 2026). Results. On 176 patients: agreement with expert assessment 88.6%; no false negatives among 5 malignant lesions (95% CI: 47.8-100.0%); specificity 88.3%. Three melanomas and two basal cell carcinomas were histologically confirmed; six dysplastic naevi placed under follow-up. Mean IoU (n=180): ViT - 0.69; Swin - 0.64; ConvNeXt - 0.53; EfficientNetV2 - 0.51. Routing thresholds: P<0.15 / 0.15-0.50 / >=0.50. Conclusion. No false negatives were observed; specificity was 88.3%, supporting screening use. The integrated cascade classification, attention map visualisation with IoU assessment, and three-zone routing provide reproducible, interpretable clinical decision support adaptable to varying resource levels.

URL PDF HTML ☆

赞 0 踩 0

2605.27559 2026-05-28 cs.MA cs.AI cs.LG 版本更新

Detection Without Correction: A Two-Parameter Decomposition of Multi-Stage LLM Pipelines

无需修正的检测：多阶段LLM流水线的双参数分解

Prashanti Nilayam, Kiran Ramanna, Prashil Tumbade

发表机构 * Servicenow CA, USA（Servicenow加州美国）

AI总结提出检测-条件生成双参数分解框架，揭示多阶段LLM流水线中条件误修正率主导（53-94%）而检测率变化超一个数量级，统一解释准确性平台、逆转等四种现象。

详情

AI中文摘要

多阶段LLM流水线（执行多智能体辩论、内在自我修正或检索增强验证）表现出令人困惑的聚合行为：跨轮次的准确性平台和逆转、当代前沿模型上辩论增益的非重复性、内在自我修正退化，以及辩论动态中跨提供商的定性分歧。下游智能体响应可操作化为两个耦合决策：检测（是否将上游内容视为权威）和条件生成（如果不是则生成什么）。该分解产生四种可观察的响应模式，其中无需修正的检测是承载故障模式。在跨越四个模型系列、四个基准（GSM8K、MATH-500、GPQA-Diamond、AIME）和两种方法（多智能体辩论、内在自我修正）的九格实证网格中，我们发现条件误修正率始终占主导（跨队列53-94%），而检测率按上下文变化超过一个数量级。该框架将上述四种现象统一为共同机制的特征，并将检测阈值表征为稳定的模型/协议级规律，该规律在匹配基准难度的方法间持续存在。

英文摘要

Multi-stage LLM pipelines that perform multi-agent debate, intrinsic self-correction, or retrieval-augmented verification exhibit puzzling aggregate behaviors: accuracy plateaus and reversals across rounds, non-replication of debate gains on contemporary frontier models, intrinsic self-correction degradation, and qualitative cross-provider divergence in debate dynamics. Downstream agent response can be operationalized as two coupled decisions: detection (whether to treat upstream content as authoritative) and conditional generation (what to produce if not). This decomposition yields four observable response regimes, of which detection-without-correction is the load-bearing failure mode. Across a nine-cell empirical grid spanning four model families, four benchmarks (GSM8K, MATH-500, GPQA-Diamond, AIME), and two methods (multi-agent debate, intrinsic self-correction), we find that the conditional miscorrection rate is consistently dominant (53-94% across cohorts) while detection rate varies contextually by more than an order of magnitude. The framework unifies the four phenomena above as signatures of a common mechanism and characterizes detection threshold as a stable model/protocol-level regularity that persists across methods at matched benchmark difficulty.

URL PDF HTML ☆

赞 0 踩 0

2605.27551 2026-05-28 cs.AI cs.CR cs.IR cs.MM 版本更新

On the Origin of Synthetic Information by Means of Steganographic Inheritance

论通过隐写继承的合成信息起源

Ching-Chun Chang, Isao Echizen

发表机构 * Information and Society Research Division, National Institute of Informatics（信息与社会研究部，信息机构）

AI总结针对合成信息溯源难题，提出一种基于隐写术的遗传机制，通过嵌入可追踪的谱系特征实现合成信息父系鉴定，理论分析与实验验证了方法的有效性。

详情

AI中文摘要

物种起源一直是自然科学中谜中之谜。类比而言，我们认为合成信息的起源是信息科学中谜中之谜。这个问题承载着道德分量，技术解释既无法完全解决，也不能不负责任地忽视，因为它对真理、信任和人类智力的影响深远地延伸到更广泛的经济和社会。人工智能的强大使得合成信息的进化谱系越来越难以追踪，因为一个足够强大的模型可能产生在结构或信号层面上与其父源几乎不相似的后代。如同遗传学中，两个个体可能具有相同的表型，在外观上相互镜像，但基因型却根本不同。我们提出通过隐写术实现一种类似于遗传的机制。在后代被复制的时刻，投影仪从父代派生出一个特征，隐写编码器将其不可见地隐藏在后代中。该特征在赛博生态系统中贯穿后代的整个生命周期。当查询父系时，隐写解码器从后代中提取该特征，并与参考池中候选父代的特征进行比较，从而提名最可能的父代。理论分析将系统发育准确性表征为投影仪和隐写系统属性的函数，而跨多个投影仪和隐写系统的实证评估表明，所提出的方法在广泛的处理操作和语义修改下具有可行性。我们设想一个赛博生态系统，其中合成信息被赋予隐藏但可追踪的谱系特征，从简单的开端分支成无尽的形态，这些形态已经并且正在进化。

英文摘要

The origin of species has been the mystery of mysteries in natural science. By analogy, the origin of synthetic information, we suggest, is the mystery of mysteries in information science. The question carries a moral weight that a technical account can neither fully resolve nor responsibly ignore, as its impact on truth, trust, and human intellect extends deep into the broader economy and society. The very power of artificial intelligence makes the evolutionary lineage of synthetic information grow ever harder to trace, for a sufficiently capable model may generate offspring that bear little resemblance, at either the structural or signal level, to the parent source from which they were derived. As in genetics, two individuals may share the same phenotype mirroring each other in outward appearance, yet differ fundamentally in their genotype. We propose, by means of steganography, a mechanism analogous to heredity. At the moment an offspring is reproduced, a projector derives a trait from the parent, and a steganographic encoder invisibly hides it within the offspring. This trait persists throughout the offspring's life cycle in a cyber ecosystem. When parentage is queried, a steganographic decoder extracts the trait from the offspring and compares it against the traits of candidate parents in a reference pool, thereby nominating the most likely one. A theoretical analysis characterises phylogenetic accuracy as a function of projector and stegosystem properties, whilst empirical evaluations across multiple projectors and stegosystems demonstrate the viability of the proposed methodology under a broad spectrum of processing operations and semantic modifications. We envision a cyber ecosystem in which synthetic information, endowed with hidden yet traceable lineage traits, branches from a simple beginning into endless forms that have been, and are being, evolved.

URL PDF HTML ☆

赞 0 踩 0

2605.27494 2026-05-28 cs.CR cs.AI cs.CL cs.IR cs.LG 版本更新

辩论有助于弱裁判奖励更强的模型

Ethan Elasky, Frank Nakasako, Naman Goyal

发表机构 * Palaestra Research（帕莱斯特拉研究）； Berkeley（伯克利）

AI总结研究在强辩手/弱裁判设置下的提议者-批评者辩论，发现当批评者分类能力超过裁判且裁判将批评者言论视为待验证的主张时，辩论能显著提升裁判表现，并可通过单一独立批评以更低成本实现类似效果。

详情

AI中文摘要

尽管理论上具有前景，但辩论作为一种可扩展的监督协议产生了混合的实证结果：在某些设置中有收益，在其他设置中无效，尤其是当裁判没有隐藏信息时。我们在程序可验证的代码和逻辑任务上，研究了强辩手/弱裁判设置下的提议者-批评者辩论。当批评者提供可用的优势时，辩论帮助裁判优于咨询基线：批评者的分类能力必须超过裁判，并且裁判必须将批评者的言论视为待验证的主张而非待总结的证词。在五个配对中的三个满足该条件的配对中，提议者-批评者辩论的收益在统计上显著优于咨询，并且这些配对是最有能力的模型配对。在我们的集合中的两个非响应者配对中，辩论产生无效效果，一旦批评者进入转录，裁判验证率下降数十个百分点。在这些情况下，批评者的二元分类能力与裁判的相差在噪声范围内，并且批评者的分歧被解析为证词而非待检查的主张。从辩论中消去反驳轮次对裁判表现没有可测量的变化：单一独立批评以更低的推理成本恢复了辩论的大部分收益。这些发现为可验证领域（答案、批评、裁判）中无需训练的可扩展监督提供了一种更廉价的原始方法，以及一种预测辩论何时有帮助的部署前审计（批评者是否击败裁判，以及裁判是否会验证它？）。

英文摘要

Despite theoretical promise, debate as a scalable oversight protocol has produced mixed empirical results: gains in some settings, and null effects in others, especially when the judge does not have information hidden from it. We study proposer-critic debate in a stronger-debater/weaker-judge setting on programmatically verifiable code and logic tasks. Debate helps the judge over a consultancy baseline when the critic provides a usable advantage: the critic's classification ability must exceed the judge's, and the judge must treat critic speeches as claims to verify rather than testimony to summarize. On the three of five pairings where the condition holds, proposer-critic debate's gains are statistically significant over consultancy, and these pairings are the most capable model pairings. On the two non-responder pairings in our set, debate produces null effects, and judge verification rates drop by tens of percentage points once a critic enters the transcript. In these cases the critic's binary-classification ability and the judge's are within noise of each other, and the critic's disagreement is parsed as testimony rather than a claim to check. Ablating rebuttal rounds from debate produces no measurable change in judge performance: a single independent critique recovers the bulk of debate's benefit at lower inference cost. These findings suggest a cheaper primitive for training-free scalable oversight in verifiable domains (answer, critique, judge) and a pre-deployment audit (does the critic beat the judge, and will the judge verify it?) that predicts when debate will help.

URL PDF HTML ☆

赞 0 踩 0

2605.27482 2026-05-28 cs.LG cs.AI 版本更新

Energy-Structured Low-Rank Adaptation for Continual Learning

能量结构低秩自适应持续学习

Longhua Li, Lei Qi, Qi Tian, Xin Geng

发表机构 * School of Computer Science and Engineering, Southeast University, Nanjing, China（东南大学计算机科学与工程学院，南京，中国）； Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China（新一代人工智能技术及其交叉应用重点实验室（东南大学），教育部，中国）； Huawei Technologies, Shenzhen, China（华为技术有限公司，深圳，中国）

AI总结提出E²-LoRA方法，通过能量集中和排序的低秩自适应以及动态秩分配策略，解决持续学习中的任务干扰和知识压缩问题，实现最优性能。

Comments Accepted by ICML 2026

2605.27479 2026-05-28 cs.LG cs.AI 版本更新

Resource-Constrained Affect Modelling via Variance Regularisation Pruning

资源约束下的情感建模：基于方差正则化剪枝

Kosmas Pinitas, Konstantinos Katsifis

发表机构 * Mediterranean College, Athens, Greece（地中海学院，希腊雅典）； University of Derby, Derby, UK（德比大学，英国德比）

AI总结提出方差正则化剪枝（VR）框架，通过考虑跨参与者稳定性来剪枝，在80%稀疏度下仍保持竞争性CCC性能，适用于资源受限的情感感知系统。

Comments This paper has been accepted at the 2026 PErvasive Technologies Related to Assistive Environments (PETRA)

详情

AI中文摘要

情感计算系统越来越多地嵌入到普及和交互环境中，如自适应游戏、辅助技术和资源受限平台，在这些环境中，计算效率必须与跨不同用户的可靠性相平衡。模型剪枝提供了一种减少计算需求的有效方法，但现有方法通常仅优化稀疏性，而不考虑参数移除如何影响个体间的鲁棒性。在这项工作中，我们引入了方差正则化剪枝（VR），一种明确将跨参与者稳定性纳入稀疏化过程的剪枝框架。VR不依赖于平均预测误差，而是根据每个连接对预测准确性和用户间变异性的联合贡献来评估，优先保留在分布差异下仍然可靠的参数。我们在AGAIN数据集上评估了所提出的方法，该数据集包含在九个情感诱发游戏环境中收集的唤醒度标注。实验结果表明，即使在没有额外微调的情况下，VR在80%稀疏度下仍能保持竞争性的一致性相关系数（CCC）性能，突显了其在真实世界、资源受限的情感感知系统中的适用性。总体而言，所提出的框架支持开发紧凑、鲁棒的情感模型，这些模型能够在真实的交互环境中可靠运行。

液态神经网络与LSTM在序列模式识别中的比较分析：鲁棒性、效率与临床实用性

Ye Kyaw Thu, Thazin Myint Oo, Thepchai Supnithi

发表机构 * National Electronics and Computer Technology Center (NECTEC)（国家电子与计算机技术中心）； Language Understanding Lab.（语言理解实验室）

AI总结本文通过对比液态神经网络（LNN）与LSTM在四种序列数据上的性能，发现LNN在参数效率和鲁棒性方面更优，尤其适用于数据稀疏的临床环境。

Comments 9 pages, 7 figures, 6 tables, The conference paper will appear in Proceedings of JCSSE 2026

详情

AI中文摘要

传统的循环神经网络（RNN）和长短期记忆网络（LSTM）在离散时间步上运行，往往无法捕捉现实世界物理过程的流体时间动态。液态神经网络（LNN），特别是闭式连续时间（CfC）网络，通过将隐藏状态演化建模为连续微分方程来解决这一问题。在本文中，我们在四种不同的序列模态上进行了全面的基准测试研究：神经形态事件数据（N-MNIST）、基于笔画的绘图（QuickDraw）、视觉手写（IAM）和生理时间序列（PhysioNet Sepsis-3）。此外，我们使用时间丢弃法进行了严格的压力测试，以评估模型对缺失数据的鲁棒性。我们的研究结果表明，LNN在原生时间域和数据稀疏普遍的临床环境中，始终提供优越的参数效率和显著更高的鲁棒性。本扩展预印本提供了关于相关数据集和LNN理论谱系的额外背景，并附有详细附录，记录了我们的完整实现和实验设置。

英文摘要

Traditional Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) units operate on discrete time steps, often failing to capture the fluid temporal dynamics of real-world physical processes. Liquid Neural Networks (LNNs), specifically Closed-form Continuous-time (CfC) networks, address this by modeling the hidden state evolution as a continuous differential equation. In this paper, we conduct a comprehensive benchmarking study across four distinct sequential modalities: neuromorphic event-based data (N-MNIST), stroke-based drawing (QuickDraw), visual handwriting (IAM), and physiological time-series (PhysioNet Sepsis-3). Furthermore, we perform a rigorous stress test using temporal dropout to evaluate model robustness against missing data. Our findings reveal that LNNs consistently provide superior parameter efficiency and significantly higher robustness in natively temporal domains and clinical environments where data sparsity is prevalent. This extended preprint provides additional background on related datasets and the LNN theoretical lineage, supplemented with a detailed appendix documenting our full implementation and experimental settings.

URL PDF HTML ☆

赞 0 踩 0

2605.27466 2026-05-28 cs.MA cs.AI cs.LG stat.ML 版本更新

AgensFlow: A Coordination-Policy Substrate for Multi-Agent Systems

AgensFlow：多智能体系统的协调策略基础

Nicole Koenigstein

发表机构 * Independent researcher（独立研究者）

AI总结提出AgensFlow框架，将多智能体协调视为在线策略学习问题，通过可学习路由优化协调流程，在分布式系统事件和安全咨询任务上验证了其优于固定管道基线。

Comments 7 pages, 4 figures, 4 tables. Code and reproducible evaluations available at: https://github.com/Nicolepcx/AgensFlow

详情

AI中文摘要

基于大语言模型（LLM）构建的多智能体系统需要许多难以先验固定的协调选择：调用哪个技能协议、哪个智能体角色应执行子任务、每个角色绑定哪个模型、角色之间如何交互、何时使用检索或验证，以及何时完全省略某个步骤。这些选择与任务机制和操作约束相互影响，因此静态管道和一次性模型比较只能提供设计空间的有限视角。本文介绍AgensFlow，一个开源框架，将多智能体协调视为部分可观测下的在线策略学习问题。该框架使协调决策可观测且可从重复轨迹中学习，而不是将技能、角色、模型、拓扑和评估选择视为固定的管道设计。AgensFlow在两个语料库上进行了评估：分布式系统事件任务和安全咨询任务。评估展示了三个主要结果：在协调密集型任务上，学习路由比固定管道基线达到更高质量的操作点；skip:X将拓扑压缩隔离为基础的有意义部分；热启动策略图可以在保持平台质量的同时减少探索成本。总体而言，结果支持学习型可审计路由可以改善静态布线下的协调密集型多智能体工作流。

英文摘要

Multi-agent systems built on large language models (LLMs) require many coordination choices that are difficult to fix a priori: which skill protocol to invoke, which agent role should perform a subtask, which model to bind to each role, how roles should interact, when to use retrieval or verification, and when to omit a step entirely. These choices interact with task regime and operational constraints, so static pipelines and one-off model comparisons provide only a limited view of the design space. This paper introduces AgensFlow, an open-source framework that treats multi-agent coordination as an online policy-learning problem under partial observability. The framework makes coordination decisions observable and learnable from repeated trajectories, rather than treating skill, role, model, topology, and evaluation choices as fixed pipeline design. AgensFlow is evaluated on two corpora: distributed-systems incident tasks and security-advisory tasks. The evaluation shows three main results: learned routing reaches a higher-quality operating point than a fixed pipeline baseline on coordination-heavy classes; skip:X isolates topology compression as a meaningful part of the substrate; and warm-started policy graphs can reduce exploration cost while preserving plateau quality. Overall, the results support that learned, auditable routing can improve coordination-heavy multi-agent workflows over static wiring.

URL PDF HTML ☆

赞 0 踩 0

2605.27465 2026-05-28 cs.CV cs.AI 版本更新

AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers

AdaMerge: 面向视觉Transformer无训练加速的显著性感知自适应令牌合并

Semi Lee, Hyejin Go, Hyesong Choi

发表机构 * Electronic Engineering（电子工程）； Soongsil University（顺斯大学）

AI总结提出AdaMerge框架，通过显著性加权相似度和自适应合并强度两个互补机制，在无训练条件下提升令牌合并的精度-计算量帕累托前沿。

Comments 11 pages, 3 figures, 5 tables. Submitted to NeurIPS 2026

详情

AI中文摘要

视觉Transformer（ViT）中自注意力的二次计算成本构成了实际部署的基本瓶颈，激发了令牌缩减方面的活跃研究。在现有方法中，令牌合并（ToMe）已成为一种优雅的无训练解决方案；然而，其设计基于令牌平等的隐含前提，这与自注意力已充分证明的非均匀性相悖，并在激进压缩下导致高显著性令牌的信息丢失。我们通过AdaMerge解决了这一局限，该框架基于两个互补机制。首先，显著性加权相似度利用列式特征亲和度中心性作为令牌重要性代理，并将所得显著性分数纳入二分匹配分数，确保关键令牌对合并表示贡献更大。其次，自适应合并强度使用预先计算的逐层相似度统计量，根据输入特定的冗余性动态调整每层缩减数量。在ImageNet-1k上使用ViT-B/16，AdaMerge在所有FLOPs匹配条件下均持续优于ToMe、PiToMe和DSM。精度差距随压缩单调增大：在13.4G FLOPs操作点，AdaMerge的Top-1下降仅为-1.06%，而PiToMe为-1.45%，DSM为-4.62%。据我们所知，AdaMerge是首个将显著性加权相似度和自适应逐层缩减结合到单一无训练令牌合并框架中的方法，推动了ViT加速的精度-FLOPs帕累托前沿。

英文摘要

The quadratic cost of self-attention in Vision Transformers (ViTs) constitutes a fundamental bottleneck for practical deployment, motivating a vibrant line of research on token reduction. Among existing approaches, token merging (ToMe) has emerged as an elegant training-free solution; yet its design rests on an unspoken premise of token equality, which contravenes the well-documented non-uniformity of self-attention and leads to information loss in high-salience tokens under aggressive compression. We address this limitation with AdaMerge, a token-merging framework based on two complementary mechanisms. First, salience-weighted similarity leverages column-wise feature-affinity centrality as a token-importance proxy and incorporates the resulting salience scores into the bipartite matching score, ensuring that pivotal tokens contribute more strongly to the merged representation. Second, adaptive merging intensity uses pre-computed layer-wise similarity statistics to dynamically modulate the per-layer reduction count in accordance with input-specific redundancy. On ImageNet-1k with ViT-B/16, AdaMerge consistently outperforms ToMe, PiToMe, and DSM across all FLOPs-matched regimes. The accuracy gap widens monotonically with compression: at the 13.4G FLOPs operating point, AdaMerge sustains a Top-1 degradation of only -1.06%, compared to -1.45% for PiToMe and -4.62% for DSM. To our knowledge, AdaMerge is the first to combine salience-weighted similarity and adaptive per-layer reduction into a single training-free token merging framework, advancing the accuracy-FLOPs Pareto frontier of ViT acceleration.

URL PDF HTML ☆

赞 0 踩 0

2605.27464 2026-05-28 cs.CV cs.AI 版本更新

Beyond Motion Primitives: Behavioral Activity Recognition from Head-Mounted IMU

超越运动基元：基于头戴式IMU的行为活动识别

Chung-Ta Huang, Leopold Das, Jeffrey Zhou, Faizaan Siddique, Julia Seungjoo Baek, Serena Liu, Andrew Rusli, Todd Y. Zhou, Freddy Yu, Sinclair Hansen, Ziling Hu, Arnav Sharma, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab, Harvard University（哈佛人工智能与机器人实验室，哈佛大学）

AI总结提出HiT-HAR层次模型，利用头戴式IMU数据实现行为级活动识别，超越传统运动基元，在五类动作和八类场景识别中优于现有模型。

详情

AI中文摘要

AR智能眼镜需要连续的行为上下文来提供主动辅助，但其最实用的常开传感器——头戴式惯性测量单元（IMU）仅能检测行走或站立等运动基元。我们突破运动基元，实现行为级识别，定义了五个类别以平衡AR应用需求与传感器可观测性。为此，我们构建了一个包含16万样本的Ego4D数据集，采用四层质量保证框架覆盖8个活动场景，并提出了HiT-HAR，一个70.3万参数的层次模型，在五类动作和八类场景识别中优于先前的头戴式IMU模型。我们通过每类可分离性分析进一步绘制了头戴式IMU的可观测性边界，识别出哪些行为类别可靠可观测（移动），哪些受益于时间上下文（物体传递、任务操作），以及哪些场景依赖的信号重叠仍构成挑战。我们的结果表明，利用时间上下文和场景结构的架构选择优于单纯扩大模型规模。代码和数据集公开于https://github.com/Harvard-AI-and-Robotics-Lab/HiT-HAR。

英文摘要

AR smart glasses need continuous behavioral context to offer proactive assistance, yet their most practical always-on sensor, the head-mounted Inertial Measurement Unit (IMU), detects only motion primitives such as walking or standing. We push beyond motion primitives to behavioral-level recognition, defining five categories that balance AR application need with sensor observability. To this end, we construct a 160K-sample Ego4D dataset with a four-tier quality assurance framework spanning 8 activity scenarios, and propose HiT-HAR, a 703K-parameter hierarchical model that outperforms prior head-mounted IMU models on five-class action and eight-class scenario recognition. We further map the observability frontier of head-mounted IMU through per-class separability analysis, identifying which behavioral categories are reliably observable (Locomotion), which benefit from temporal context (Object Transfer, Task Operation), and where scenario-dependent signal overlap poses remaining challenges. Our results indicate that architectural choices exploiting temporal context and scenario structure outperform simply scaling model size. The code and dataset are publicly available at https://github.com/Harvard-AI-and-Robotics-Lab/HiT-HAR.

URL PDF HTML ☆

赞 0 踩 0

2605.27463 2026-05-28 stat.ME cs.AI stat.AP 版本更新

When prompt perturbations break your A/B test: A valid statistical test for generative surveying

当提示扰动破坏你的A/B测试：一种用于生成式调查的有效统计检验

Hayden Helm, Carey Priebe

发表机构 * Johns Hopkins University（约翰霍普金斯大学）

AI总结针对生成式调查中LLM对提示设计敏感的问题，提出一种置换检验方法，在包含扰动结构的统计模型下保持有效性，并给出预算分配建议。

详情

AI中文摘要

生成式调查——利用基于LLM的角色集合对消息提供反馈——已成为传统市场研究的廉价且可扩展的替代方案。然而，LLM对提示设计中的微小变化很敏感，从生成式调查中得出的结论可能依赖于任意的措辞选择。控制这种敏感性需要在分析中包含语义等价的扰动。在本文中，我们表明，在包含现实扰动结构的生成式调查统计模型下，标准假设检验（包括符号检验和Wilcoxon符号秩检验）是无效的。我们提出了一种在该模型下有效的置换检验，并正式刻画了标准检验失效的条件。将我们的框架应用于一个简单的生成式调查问题，我们估计了相关参数，刻画了置换检验在现实条件下的功效，并提供了关于在角色、扰动和重复之间分配预算的实用指导。最后，我们表明，即使在同一个模型家族内，估计效应的大小和方向都对模型选择敏感。

英文摘要

Generative surveying -- where collections of LLM-based personas provide feedback on messages -- has emerged as a cheap and scalable alternative to traditional market research. However, LLMs are sensitive to small variations in prompt design and conclusions drawn from generative surveys may depend on arbitrary phrasing choices. Controlling for this sensitivity requires including semantically equivalent perturbations in the analysis. In this paper, we show that standard hypothesis tests, including the sign test and Wilcoxon signed-rank test, are invalid under a statistical model for generative surveying that includes realistic perturbation structure. We propose a permutation test that is valid under this model and formally characterize the conditions under which standard tests fail. Applying our framework to a simple generative surveying problem, we estimate relevant parameters, characterize the power of the permutation test under realistic conditions, and provide practical guidance on budget allocation across personas, perturbations, and replicates. Finally, we show that both the magnitude and direction of the estimated effect are sensitive to the choice of model, even within the same model family.

URL PDF HTML ☆

赞 0 踩 0

2605.27449 2026-05-28 cs.IR cs.AI 版本更新

Checking Fact with Better Retrieval: Dynamic Contrastive Learning for Evidence Retrieval

用更好的检索核查事实：用于证据检索的动态对比学习

Zhongtian Hua, Yi Luo, Meijia Yu, Yingjie Han

发表机构 * Zhengzhou University（郑州大学）； Henan University of Science and Technology（河南科技大学）

AI总结提出动态自适应对比学习方法DACLR，通过事件级特征提取、两阶段检索和动态对比损失优化，提升多模态证据检索的准确性。

详情

AI中文摘要

在多模态事实核查领域，从不同模态检索证据的准确性对下游声明验证过程有显著影响。现有的通用多模态检索方法通常基于语义构建，导致检索到的证据与声明相似但不相关。本文提出了一种用于证据检索的动态自适应对比学习方法（DACLR）来解决这些问题。DACLR首先使用多模态大语言模型（MLLM）将多模态证据和声明统一转换为文本模态，并在事件级别提取这些信息的特征。然后，通过召回-重排序的两阶段检索方法进行证据检索。DACLR通过优化对比损失和挖掘难负样本，增强了检索阶段模型的事件感知能力。具体而言，DACLR基于InfoNCE损失在语义和事件两个层次设计了三个损失函数，并对应设置了三组难负样本候选。模型根据批内样本的准确性监督信号动态调整比例，使模型在不遗忘语义检索能力的情况下，学习声明与正样本在事件层面的相关性。大量的对比和消融实验证明了DACLR及其内部优化方法的有效性。进一步的研究也证明了DACLR在多模态证据检索领域的优势。

英文摘要

In the field of multimodal fact checking, the accuracy of retrieving evidence from different modalities has a significant impact on the downstream claim verification process. Existing general multimodal retrieval methods are often constructed based on semantics, resulting in the retrieved evidence being similar but not relevant to the claim. This paper proposes a \textbf{D}ynamic \textbf{A}daptive \textbf{C}ontrastive \textbf{L}earning method for evidence \textbf{R}etrieval called DACLR to address these issues. DACLR first uses a Multimodal Large Language Model (MLLM) to uniformly convert multimodal evidence and claims into text modalities, and extracts the features of these information at event level. Then, it conducts evidence retrieval through a two-stage retrieval method of recall-rerank. DACLR enhances the model's event perception ability of the retrieval stage by optimizing the contrastive loss and mining hard negative samples. Specifically, DACLR designs three loss functions at two levels (semantic and event) based on the InfoNCE loss.Corresponding to these, three sets of hard negative sample candidates are set up. The model dynamically adjusts the ratio based on the accuracy supervision signal of intra-batch samples, allowing the model to learn the correlation between claims and positive samples at the event level without forgetting the semantic retrieval ability. Extensive comparison and ablation experiments demonstrates the effectiveness of DACLR and its internal optimization methods. Further research also prove the advantages of DACLR in the field of multimodal evidence retrieval.

URL PDF HTML ☆

赞 0 踩 0

2605.27445 2026-05-28 cs.IR cs.AI 版本更新

RAGe: A Retrieval-Augmented Generation Evaluation Framework

RAGe：一种检索增强生成评估框架

Larissa Guder, João Pedro de Moura, Arthur Accorsi, Gustavo Losch do Amaral, Maurício Cecílio Magnaguagno, Felipe Meneguzzi, Marcio Sorraglia Pinho, Dalvan Griebler

发表机构 * School of Technology, Pontifical Catholic University of Rio Grande do Sul（里约格兰德杜斯-萨尔大学技术学院）

AI总结提出模块化框架RAGe，通过资源遥测和组件推荐，评估检索增强生成应用在准确性、效率和可扩展性之间的权衡，支持领域特定数据集的最佳组件选择。

详情

AI中文摘要

部署大型语言模型（LLM）应用，特别是那些依赖检索增强生成（RAG）的应用，仍然具有挑战性，原因是计算需求高、知识库过时以及需要手动选择最优流水线组件。在这项工作中，我们提出了一个模块化框架，通过关注资源遥测和组件推荐，为基准测试和指导RAG应用的高效开发提供支持，建议针对特定领域数据集的最佳组件。我们的方法利用LLM应用中的核心技术，包括文档分块、向量数据库、嵌入模型和检索器，来评估准确性、效率和可扩展性之间的权衡。通过将检索和生成质量与底层硬件约束直接关联，RAGe帮助研究人员识别最有效、特定领域的RAG设置，以满足其特定操作需求，即使在消费级硬件上也能促进快速原型开发。

英文摘要

Deploying Large Language Model (LLM) applications, particularly those relying on Retrieval-Augmented Generation (RAG), remains challenging due to high computational demands, outdated knowledge bases, and the need to manually select optimal pipeline components. In this work, we propose a modular framework for benchmarking and guiding the efficient development of RAG applications by focusing on resource telemetry and component recommendation, suggesting the best components for a domain-specific dataset. Our approach leverages core techniques in LLM applications, including document chunking, vector databases, embedding models, and retrievers, to evaluate trade-offs among accuracy, efficiency, and scalability. By directly correlating retrieval and generation quality with underlying hardware constraints, RAGe supports researchers to identify the most effective, domain-specific RAG setups for their specific operational needs, facilitating rapid prototyping even on consumer-grade hardware.

URL PDF HTML ☆

赞 0 踩 0

2605.27444 2026-05-28 cs.IR cs.AI 版本更新

A Systematic Evaluation of Retrieval-Augmented Generation and Language Models for Space Operations

检索增强生成与语言模型在太空操作中的系统评估

Ruben Belo, Marta Guimarães, Cláudia Soares

发表机构 * NOVA LINCS ； Neuraspace ； Technical University of Munich（慕尼黑技术大学）

AI总结本文系统评估了结合大语言模型与信息检索技术的检索增强生成管道在太空操作中提取和综合领域知识的效果，比较了不同检索策略、嵌入模型和LLM回答对信息准确性、相关性和可靠性的影响。

2605.27440 2026-05-28 cs.IR cs.AI 版本更新

Paraphrase Brittleness in Production Retrieval-Augmented Commercial Recommendation: Reproducibility Below the Rerun-Stability Baseline

生产检索增强商业推荐中的释义脆弱性：低于重运行稳定性基线的可重复性

Will Jack, Noah Lehman, Keller Maloney, Sarah Xu

AI总结研究发现AI助手对买家问题的细微措辞变化（如“最佳CRM” vs “顶级CRM”）产生显著不同的品牌推荐，其推荐集相似度（Jaccard）远低于相同提示的重运行基线，挑战了当前AEO/GEO实践的有效性。

详情

AI中文摘要

买家提问方式的小变化——例如“最佳CRM” vs “顶级CRM” vs “SaaS初创公司的最佳CRM”——会导致AI助手推荐截然不同的品牌。在OpenAI和Anthropic模型上进行的约6,000次释义运行和约6,000次相同提示重运行对照中，相同购买意图的两个释义之间的推荐集相似度（Jaccard）对于措辞性改写为0.288（聚类95% CI [0.215, 0.361]），对于添加约束的改写为0.135（[0.098, 0.175]，合并区域/语言和特异性阶梯轴）——两者均远低于0.50-0.61的相同提示重运行基线。提示字符串（而非底层购买意图）是决定哪些品牌出现的主要输入。增加推理努力并未缩小差距（界限为+/-0.05）。这对日益流行的AEO/GEO实践构成了直接挑战。通过固定提示集上统计品牌提及次数来追踪品牌的“AI可见性”，会产生一个度量，其方差的主要来源是追踪器恰好发出的释义，而非模型对品牌的行为：相同购买意图的两个自然释义产生的推荐集Jaccard重叠率为14-29%，而相同提示重运行则为50-61。原则上，对每个意图采样更多释义可以减少这种伪影，学术界也存在高效的多提示评估方法，但自然买家措辞空间远大于这些方法已验证的基准规模提示集，且远超任何商业追踪器对每个品牌-意图组合发出的提示数量。因此，逐提示的提及追踪作为测量单位在结构上是不稳定的；有意义的改进可能需要不同的单位，而非更大的提示集。

英文摘要

Small changes to how a buyer phrases a question -- "best CRM" vs "top CRM" vs "best CRM for a SaaS startup" -- produce substantially different brand recommendations from AI assistants. Across ~6,000 paraphrase runs and ~6,000 same-prompt rerun controls on OpenAI and Anthropic models, the recommendation-set similarity (Jaccard) between two paraphrases of the same underlying buying intent is 0.288 for cosmetic rewordings (clustered 95% CI [0.215, 0.361]) and 0.135 for constraint-adding rewordings ([0.098, 0.175], pooling region/language and specificity-ladder axes) -- both far below the 0.50-0.61 same-prompt rerun baseline. The prompt string, not the underlying buyer intent, is the dominant input to which brands surface. Increasing reasoning effort does not narrow the gap (bounded by +/-0.05). This is a direct challenge to an increasingly popular AEO/GEO practice. Tracking a brand's "AI visibility" by counting brand mentions over a fixed set of prompts produces a metric whose dominant source of variance is which paraphrase the tracker happens to issue, not the model's behavior toward the brand: the same buyer intent in two natural paraphrases produces recommendation sets that overlap 14-29% in Jaccard versus 50-61% for same-prompt reruns. Sampling more paraphrases per intent reduces the artifact in principle, and efficient multi-prompt evaluation methods exist in the academic literature, but the natural buyer-phrasing space is much larger than the benchmark-scale prompt sets those methods have been validated on, and far beyond what any commercial tracker issues per brand-intent combination. Prompt-by-prompt mention tracking is therefore structurally unstable as a unit of measurement; meaningful improvement likely requires a different unit rather than a larger prompt set.

URL PDF HTML ☆

赞 0 踩 0

2605.27439 2026-05-28 cs.IR cs.AI 版本更新

应对多模态学习挑战的混合专家方法：综述

Liangwei Nathan Zheng, Wei Emma Zhang, Olaf Maennel, Lin Yue, Weitong Chen

发表机构 * Adelaide University（阿德莱德大学）

AI总结本文综述了混合专家（MoE）如何通过高效扩展、表示学习和自适应适配解决多模态学习中的可扩展性、异质性和数据不完美等核心挑战。

Comments This survey paper has just been accepted by IJCAI 2026. Results were released by 30 April 2026. As I could not find a particular place to drop the acceptance email. I have upload the acceptance email alongside the LaTeX files of the paper, named as Acceptance_email.pdf

详情

AI中文摘要

混合专家（MoE）为多模态学习提供了一个自然兼容且可扩展的框架，在不同模态和任务中展现出强大的适应性。尽管其日益成功，但关于MoE方法解决多模态挑战的全面系统综述仍然缺乏。现有综述往往从方法分类学角度独立评估多模态学习或MoE，忽视了它们之间的独特相互作用。本综述通过回答一个核心问题来填补这一空白： extit{MoE如何有效解决多模态挑战？}我们从三个关键视角进行探讨：(1) extbf{MoE作为高效多模态引擎：}通过将计算成本与参数增长解耦，并通过选择性专家激活减轻模态冗余，实现可扩展的多模态建模；(2) extbf{MoE作为多模态表示学习器：}整合互补的多意见专家知识，丰富对齐和交互表示；(3) extbf{MoE作为多模态适配器：}提供模块化和灵活的机制，以建模不完美数据场景，如模态不平衡和模态缺失。通过广泛的文献综述，我们识别出关键研究空白，包括可解释路由、专家通信、模态集成和终身多模态学习。我们将本综述定位为未来研究的基础，旨在构建可解释且可持续的多模态混合专家系统。

英文摘要

Mixture-of-Experts (MoE) presents a naturally compatible and scalable framework for multimodal learning, demonstrating strong adaptability across diverse modalities and tasks. Despite its growing success, a comprehensive and systematic review on the MoE metho addressing multimodal challenges remains lacking. Existing surveys tend to evaluate either multimodal learning or MoE independently from method taxonomy, overlooking the unique interplay between them. This survey fills that gap by answering a central question: \textit{How does MoE effectively resolve multimodal challenges?} We approach this from three key perspectives: (1) \textbf{MoE as an Efficient Multimodal Engine:} enabling scalable multimodal modeling by decoupling computational cost from parameter growth and mitigating modality redundancy through selective expert activation; (2) \textbf{MoE as a Multimodal Representation Learner:} integrating complementary multi-opinion expert knowledge to enrich alignment and interaction representations; and (3) \textbf{MoE as a Multimodal Adapter:} providing a modular and flexible mechanism to model imperfect data scenarios such as modality imbalance and missing modality. Through our extensive literature review, we identify critical research gaps, including interpretable routing, expert communication, modality integration, and lifelong multimodal learning. We position this survey as a foundation for future research toward interpretable and sustainable multimodal Mixture-of-Experts system.

URL PDF HTML ☆

赞 0 踩 0

2605.27429 2026-05-28 cs.IR cs.AI 版本更新

Ocean4Rec: Offline LLM-Derived OCEAN Profiles for Request-Time VOD Reranking

Ocean4Rec：离线LLM生成的OCEAN画像用于请求时VOD重排序

Wonkyun Kim, Sehyun Bae, Kwanki Ahn, Mungyu Bae, Saeun Choi, Soyeon You, Chandra Prabhakar, Sehyun Kim

发表机构 * Samsung Electronics Republic of Korea（韩国三星电子公司）； Samsung Electronics India（印度三星电子公司）

AI总结提出Ocean4Rec重排序层，利用LLM离线生成物品OCEAN画像，在请求时无需LLM调用，通过数值计算提升VOD推荐性能。

详情

AI中文摘要

工业视频点播（VOD）推荐系统需要更丰富的内容理解，但LLM作为重排序器的设计在每次请求中重复进行提示构建、令牌生成、模型调用、输出解析和回退处理。在高流量、延迟敏感的服务中，这些请求时操作使吞吐量规划、尾延迟控制、容量隔离和可预测运维复杂化。本文提出Ocean4Rec，一种重排序层，仅离线使用LLM从内容元数据中物化物品的OCEAN画像。物品被映射为开放性、尽责性、外向性、宜人性和神经质分数，而用户画像则通过最近点击和深度链接物品在同一五维空间中的时间衰减聚合构建。在请求时，Ocean4Rec连接预计算的物品画像、用户画像、基础推荐器分数和目录新鲜度，然后执行数值重排序，无需LLM调用。在匿名的三星智能电视VOD日志上，相同候选集的Top1000时间留出离线评估显示，对于NCF生成器，Ocean4Rec在NDCG@20上比更强的非OCEAN基础+新鲜度排序提升7.6%，对于LightGCN生成器提升61.5%。HR@20对于NCF不显著，对于LightGCN提升67.3%，反映了稀疏的精确物品回放标签和新鲜度作为工业基线的强度。该结果应被视为一种有界辅助内容品味特征的离线回放证据，该特征保留了无请求时LLM的服务路径的可部署性优势。

英文摘要

Industrial video-on-demand (VOD) recommenders need richer content understanding, but LLM-as-reranker designs repeat prompt construction, token generation, model invocation, output parsing, and fallback handling for each request. In high-volume latency-sensitive services, these request-time operations complicate throughput planning, tail-latency control, capacity isolation, and predictable operation. This paper presents Ocean4Rec, a reranking layer that uses an LLM only offline to materialize item OCEAN profiles from content metadata. Items are mapped into Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism scores, while user profiles are built by time-decayed aggregation of recently clicked and deep-linked items in the same five-dimensional space. At request time, Ocean4Rec joins precomputed item profiles, user profiles, base recommender scores, and catalog recency, then performs numeric reranking without an LLM call. On anonymized Samsung Smart TV VOD logs, same-candidate Top1000 temporal-holdout offline evaluations show that Ocean4Rec improves NDCG@20 over a stronger non-OCEAN Base+Recency ordering by 7.6% for an NCF generator and 61.5% for a LightGCN generator. HR@20 is inconclusive for NCF and improves by 67.3% for LightGCN, reflecting sparse exact-item replay labels and the strength of recency as an industrial baseline. The result should be read as offline replay evidence for a bounded auxiliary content-taste feature that preserves the deployability advantage of a request-time-LLM-free serving path.

URL PDF HTML ☆

赞 0 踩 0

2605.27417 2026-05-28 quant-ph cs.AI cs.LG 版本更新

利用循环发放神经元和可学习梯度推进脉冲神经网络的直接训练

Feifan Zhou, Xiang Wei, Yang Liu, Qiang Yu

发表机构 * School of Artificial and Intelligence, Tianjin University（人工智能学院，天津大学）

AI总结提出一种包含循环发放神经元、逐时间步可学习代理梯度和正负平衡损失函数的直接训练算法，以提升脉冲神经网络的信息表示能力和梯度传播精度，在多个数据集上取得竞争性性能并泛化至Transformer架构。

详情

AI中文摘要

脉冲神经网络（SNN）因其节能特性而备受关注，但与人工神经网络（ANN）相比仍存在显著性能差距。这一差距源于至少两个关键限制：首先，传统脉冲神经元的信息表示能力有限，未能充分利用膜电位的丰富动态；其次，固定代理梯度（SG）函数在时间步上导致梯度传播不精确，阻碍了有效的直接训练。为了解决这两个挑战，我们提出了一种新的直接训练算法，包含三个核心创新：第一，一种循环发放脉冲神经元模型，通过更有效地利用膜电位来增强信息表示能力；第二，一种逐时间步可学习的代理梯度函数，能够在反向传播过程中实现精确的梯度估计；第三，一种正负平衡损失函数，以实现正负膜电位之间的平衡，进一步提升SNN性能。大量实验表明，我们的方法在多个数据集上取得了竞争性性能。我们的方法可以无缝泛化到先进的Transformer架构，始终优于现有方法。我们的工作强调了进一步利用SNN内在膜动力学以提升性能的有效性，从而为推进高性能脉冲神经架构开辟了新途径。

英文摘要

Spiking Neural Networks (SNNs) have emerged with promising energy-efficient property, yet a substantial performance gap persists compared to Artificial Neural Networks (ANNs). This gap stems from at least two key limitations: first, conventional spiking neurons offer limited information representation capacity, underutilizing the rich dynamics of membrane potentials; second, fixed surrogate gradient (SG) functions across time steps leads to imprecise gradient propagation, impeding effective direct training. To address these two challenges, we propose a new direct training algorithm with three core innovations: first, a circulate-firing spiking neuron model that enhances information representation capacity by leveraging membrane potentials more effectively; second, a time-step-wise learnable surrogate gradient function, enabling accurate gradient estimation during backpropagation; third, a positive-negative balanced loss function to achieve equilibrium between positive and negative membrane potentials and further boost SNN performance. Extensive experiments demonstrate that our methods achieve competitive performance across multiple datasets. Our methods can generalize seamlessly to advanced architectures of Transformer, consistently outperforming existing methods. Our work highlights the effectiveness of further harnessing intrinsic membrane dynamics of SNNs for performance improvement, and thus open a new avenue for advancing high-performance spiking neural architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.27409 2026-05-28 cs.NE cs.AI cs.LG 版本更新

STARS: Spike Tail-Aware Relational Synthesis for ANN-to-SNN Data-Free Knowledge Distillation

STARS: 面向ANN到SNN无数据知识蒸馏的尖峰尾部感知关系合成

Shuhan Ye, Yi Yu, Qixin Zhang, Hui Lu, Jiaming He, Qinggang Zhang, Li Shen, Xudong Jiang

发表机构 * Nanyang Technological University（南洋理工大学）； Jilin University（吉林大学）； Wuhan University（武汉大学）； Sun Yat-sen University（中山大学）

AI总结提出STARS方法，通过关系一致性对齐和尾部感知正则化增强BN引导的合成数据，解决SNN学生网络在无数据知识蒸馏中约束不足的问题，在多个数据集上提升性能。

详情

AI中文摘要

SNN有望实现高能效和低延迟推理，但其性能仍落后于ANN。ANN到SNN的知识蒸馏有助于缩小这一差距，但在实际部署中原始训练数据通常不可用。现有的无数据知识蒸馏（DFKD）方法通过匹配教师侧先验（尤其是BN统计量）来合成替代数据，但这些面向ANN的约束主要正则化均值和方差，因此对于响应依赖于阈值穿越动态的SNN学生网络而言，约束不足。本文提出尖峰尾部感知关系合成（STARS），一种用于ANN到SNN DFKD的即插即用方法，通过两个互补目标增强标准BN引导合成：关系一致性对齐（保持教师和学生之间的跨样本关系一致性）和尾部感知正则化（通过软超越教师导出阈值来正则化阈值相关的尾部概率）。这些目标共同生成合成批次，这些批次在保持教师有效性的同时，对SNN学生网络更具信息性。在CIFAR-10、CIFAR-100和Tiny-ImageNet上的多个ANN-SNN对实验表明，我们的方法一致改进了传统DFKD基线，甚至超过了若干KD方法，在CIFAR-10上提升高达4.6%，在CIFAR-100上提升高达6.7%，突显了在面向SNN的DFKD中，用关系约束和尾部感知约束补充BN匹配的重要性。

英文摘要

SNNs promise energy-efficient and low-latency inference, but their performance still trails that of ANNs. ANN-to-SNN knowledge distillation helps narrow this gap, yet the original training data are often unavailable in practical deployment settings. Existing data-free knowledge distillation (DFKD) methods synthesize surrogate data by matching teacher-side priors, especially BN statistics, but these ANN-oriented constraints mainly regularize mean and variance and therefore remain under-constrained for SNN students whose responses depend on threshold-crossing dynamics. In this paper, we propose Spike Tail-Aware Relational Synthesis (STARS), a plug-and-play method for ANN-to-SNN DFKD that augments standard BN-guided synthesis with two complementary objectives: Relational Consistency Alignment, which preserves cross-sample relational consistency between teacher and student, and Tail-Aware Regularization, which regularizes threshold-relevant tail probabilities through soft exceedance over teacher-derived thresholds. Together, these objectives generate synthetic batches that remain teacher-valid while becoming more informative for SNN students. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet across multiple ANN-SNN pairs show that our method consistently improves conventional DFKD baselines and even surpasses several KD methods, with gains of up to 4.6\% on CIFAR-10 and 6.7\% on CIFAR-100, highlighting the importance of complementing BN matching with relational and tail-aware constraints in SNN-oriented DFKD.

URL PDF HTML ☆

赞 0 踩 0

2605.27407 2026-05-28 cs.NE cs.AI cs.LG 版本更新

Benchmarking Fairness in Spiking Neural Networks: Data Bias, Spurious Features, and Hardware Effects

脉冲神经网络中的公平性基准测试：数据偏差、虚假特征和硬件效应

Hudi He, Fukun Wang, Zhe Wang, Xinyi Wang, Shuhan Ye, Jiarui Liu, Qing Qing, Ziqi Xu, Xikun Zhang, Renqiang Luo

发表机构 * Jilin University（吉林大学）； Nanyang Technological University（南洋理工大学）； RMIT University（皇家墨尔本理工大学）

AI总结本文首次提出脉冲神经网络公平性基准，通过引入人口统计覆盖缺口、虚假特征泄漏和部署环境不匹配三个现实维度，系统评估了12种先进SNN在资源约束下的公平性-性能权衡。

详情

AI中文摘要

评估脉冲神经网络（SNN）的公平性需要反映现实世界复杂性的严格基准，然而现有评估仍受限于肤浅的数据集多样性和理想化的硬件假设。本文首次引入SNN的系统性公平性基准，解决三个关键的现实维度：（1）训练数据中的人口统计覆盖缺口，（2）虚假特征泄漏（例如，肤色作为类别标签的代理），以及（3）部署环境不匹配（例如，具有受限脉冲编码的边缘设备）。我们的框架整合了四个跨人口统计数据集（带有受控偏差注入）和三个神经形态硬件模拟器（Loihi 2、SpiNNaker），从而能够在资源约束下隔离分析公平性-性能权衡。对12种最先进SNN的标准化评估揭示了显著差异：在偏差数据上训练的模型对代表性不足群体的假阳性率高出23%，而硬件限制（例如，降低的脉冲精度）在边缘部署中进一步将准确率差距放大至41%。关键的是，为云端SNN开发的偏差缓解策略在资源约束下通常会退化，这凸显了需要联合优化公平性和硬件效率的协同设计原则。通过连接算法公平性研究与神经形态工程，我们的基准为医疗和自主系统等社会关键应用中的可信SNN奠定了基础。我们的代码可在以下网址获取：https://anonymous.4open.science/r/SNN-Benchmarks-8017。

英文摘要

Evaluating fairness in Spiking Neural Networks (SNNs) demands rigorous benchmarks that reflect real-world complexities, yet existing assessments remain limited by superficial dataset diversity and idealized hardware assumptions. This work introduces the first systematic fairness benchmark for SNNs, addressing three critical dimensions of realism: (1) demographic coverage gaps in training data, (2) spurious feature leakage (e.g., skin tone as a proxy for class labels), and (3) deployment-environment mismatches (e.g., edge devices with constrained spike encoding). Our framework integrates four cross-demographic datasets with controlled bias injections and three neuromorphic hardware simulators (Loihi 2, SpiNNaker), enabling isolated analysis of fairness-performance trade-offs under resource constraints. Standardized evaluations of 12 state-of-the-art SNNs reveal stark disparities: models trained on biased data exhibit 23\% higher false positive rates for underrepresented groups, while hardware limitations (e.g., reduced spike precision) further amplify accuracy gaps by up to 41\% in edge deployments. Critically, bias mitigation strategies developed for cloud-based SNNs often degrade under resource constraints, highlighting the need for co-design principles that jointly optimize fairness and hardware efficiency. By bridging algorithmic fairness research with neuromorphic engineering, our benchmark provides a foundation for trustworthy SNNs in socially critical applications such as healthcare and autonomous systems. Our code is available at: https://anonymous.4open.science/r/SNN-Benchmarks-8017.

URL PDF HTML ☆

赞 0 踩 0

2605.27404 2026-05-28 cs.CY cs.AI 版本更新

Smaller, Younger, and More Impactful: How AI-Assisted Writing Transforms Research Teams

更小、更年轻、更具影响力：AI辅助写作如何改变研究团队

Haoyang Wang, Mingze Zhang, Yi Bu, Star Xing Zhao, Meijun Liu

发表机构 * School of Information, University of Texas at Austin（德克萨斯大学奥斯汀分校信息学院）； National Science Library, Chinese Academy of Sciences（中国科学院国家科学图书馆）； Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences（中国科学院大学经济管理学院信息资源管理系）； Department of Information Management, Peking University（北京大学信息管理系）； Institute of Big Data, Fudan University（复旦大学大数据研究院）； Institute for Global Public Policy, Fudan University（复旦大学全球公共政策研究院）； Faculty of Finance, City University of Macau（澳门城市大学金融学院）

AI总结本研究利用2020年以来的PLoS和Nature系列期刊全文，通过多种回归方法发现AI辅助写作使研究团队更年轻、规模更小，且不影响甚至提升科学影响力。

详情

AI中文摘要

大科学时代长期以来以日益庞大和专门化的研究团队推动知识前沿为特征。然而，人工智能（AI）的最新进展，特别是大型语言模型（LLMs），正开始重塑学术写作和科学研究，可能打破长期以来团队规模不断扩大的趋势，并改变研究团队结构的其他维度。基于2020年以来PLoS系列和Nature系列期刊的147,074篇全文出版物，我们考察了AI辅助写作是否以及如何影响科学中的团队结构和团队成果。使用多种方法，包括普通最小二乘法、分位数回归、泊松回归、逻辑回归和倾向得分匹配，我们发现使用AI辅助写作的研究团队往往更年轻、规模更小。重要的是，这种向更紧凑、更年轻化团队的转变并非以牺牲科学影响力为代价。相反，我们观察到采用AI辅助写作的研究团队产生高影响力出版物的概率更高。这些结果凸显了AI辅助写作在重塑研究生产方式以及研究团队组建和构成方面的重要作用。我们的发现呼吁在研究评估、资助和培训方面进行政策改进，以应对这一新兴趋势。

英文摘要

The era of Big Science has long been defined by increasingly large and specialized research teams pushing the frontiers of knowledge. However, recent advances in artificial intelligence (AI), particularly large language models (LLMs), are beginning to reshape academic writing and scientific research, potentially disrupting the longstanding trend toward ever-larger teams and transforming other dimensions of research team structure. Drawing on 147,074 full-text publications from the PLoS family and the Nature portfolio since 2020, we examined whether and how AI-assisted writing influences team structure and team outcomes in science. Using multiple methods, including ordinary least square, quantile regression, Poisson regression, logistic regression and propensity score matching, we found that research teams using AI-assisted writing tend to be younger and smaller. Importantly, this shift toward more compact, junior-leaning teams does not come at the expense of scientific impact. On the contrary, we observed a higher probability of research teams that employed AI-assisted writing producing highly impactful publications. These results highlight the significant role of AI-assisted writing in reshaping not only how research is produced, but also how research teams are formed and assembled. Our findings call for policy improvements in research evaluation, funding, and training to address this emerging trend.

URL PDF HTML ☆

赞 0 踩 0

2605.27403 2026-05-28 cs.CY cs.AI 版本更新

LLM-assisted sentiment analysis for integrated computational and qualitative mixed methods education research: A case study of students' written reflection assignments

LLM辅助情感分析在综合计算与定性混合方法教育研究中的应用：学生书面反思作业案例研究

Xiomara Gonzalez, Gabriella Coloyan Fleming, Andrew Katz, Maya Denton, Jessica Deters

发表机构 * Chandra Family Department of Electrical and Computer Engineering, University of Texas at Austin（德克萨斯大学奥斯汀分校电子与计算机工程系）； Department of Engineering Education, Virginia Polytechnic Institute and State University（弗吉尼亚理工大学工程教育系）； Gallogly College of Engineering, University of Oklahoma（俄克拉荷马大学加洛格利工程学院）； Department of Mechanical and Materials Engineering, University of Nebraska-Lincoln（内布拉斯加大学林肯分校机械与材料工程系）

AI总结本研究通过纵向案例，利用LLM辅助情感分析结合统计检验与主题分析，探讨学生身份变量对留学期间语言交流情感的影响，发现海外生活经历是唯一显著变量。

详情

AI中文摘要

书面反思作业为学生提供了宝贵的批判性自我评估、意义建构和学习处理的机会。此外，此类反思为定性教育研究提供了丰富的数据。然而，定性数据分析可能耗时。定性比较不同参与者群体之间的发现更为耗时，通常将比较限制在最多一个变量（例如，二元性别）。大型语言模型（LLM）最近开始被批判性地评估用作定性研究助手。利用来自留学项目的纵向学生书面反思案例（n=151），我们研究了LLM辅助情感分析如何能够实现结合计算分析和主题分析的纵向混合方法研究。首先，使用统计检验根据七个不同的学生身份/生活经历变量定量比较情感差异。然后，这些结果指导定性数据分析，以调查这些差异背后的原因。对于本科留学学生，我们发现先前的海外生活经历是唯一影响学生对语言和交流行为情感的个人变量。这一工作流程对于定性研究人员在比较不同人口群体参与者时如何更轻松地探究多个变量具有启示意义。

英文摘要

Written reflection assignments give students valuable opportunities for critical self-assessment, meaning making, and learning processing. Additionally, such reflections provide rich data for qualitative education research. However, qualitative data can be time-consuming to analyze. It is even more time-intensive to qualitatively compare findings between different groups of participants, usually limiting comparison to, at most, one variable (e.g., binary gender). Large language models (LLMs) have recently begun to be critically evaluated for use as qualitative research assistants. Using a longitudinal case of written student reflections (n=151) from a study abroad program, we investigate how LLM-assisted sentiment analysis can enable longitudinal mixed-methods research combining computational and thematic analyses. First, statistical testing is used to quantitatively compare sentiment differences according to seven different student identity/lived experience variables. Then, these results inform qualitative data analysis to investigate the reasons underlying these differences. For the case of undergraduate students studying abroad, we found that prior experience living abroad was the only personal variable impacting students' sentiments of their verbal language and communication behaviors. This workflow has implications for how qualitative researchers can more easily probe multiple variables when comparing participants from different demographic groups.

URL PDF HTML ☆

赞 0 踩 0

2605.27402 2026-05-28 cs.CY cs.AI cs.CL 版本更新

REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading

REC-CBM：面向可信开放评分的基于规则感知的错误修正概念瓶颈模型

Chengshuai Zhao, Fan Zhang, Kumar Satvik Chaudhary, Yiwen Li, Lo Pang-Yun Ting, Ying-Chih Chen, Huan Liu

发表机构 * School of Computing and Augmented Intelligence, Arizona State University, USA（计算与增强智能学院，亚利桑那州立大学，美国）； Mary Lou Fulton Teachers College, Arizona State University, USA（玛丽·卢·福洛顿教师学院，亚利桑那州立大学，美国）； Department of Computer Science, National Yang Ming Chiao Tung University, TW（国立阳明交通大学计算机科学系，台湾）

AI总结提出REC-CBM模型，通过规则感知概念编码器、序数成对校准目标和潜在概念错误修正模块，解决开放评分中标准概念瓶颈模型无法建模细粒度规则维度、忽略评分序数语义和概念标注不可靠的问题，在提升评分性能的同时保持可解释性。

详情

AI中文摘要

开放评分对于公平和个性化教育至关重要，但人工评分耗时且成本高，凸显了自动化评分系统的必要性。尽管基于神经和大语言模型（LLM）的系统表现出优越性能，但它们通常是黑箱模型，其评分过程和理由难以让教育者验证和信任。概念瓶颈模型（CBM）通过将预测路由到人类可解释的概念，提供透明度的机制保证，成为一种有前景的方法。然而，标准CBM不适用于开放评分：它们没有显式建模细粒度的规则维度，未能充分捕捉评分量表的序数语义，并忽略了人类概念标注中固有的可靠性问题。为解决这些局限，我们提出REC-CBM，一种面向可信开放评分的规则感知错误修正概念瓶颈模型。REC-CBM引入了规则感知概念编码器，学习针对回答的概念特定表示，以及一个序数成对校准目标，保留规则维度间的排序结构。它还结合了一个潜在概念错误修正模块，在最终评分预测前对概念预测进行去噪，同时保持可解释性。在公开数据集上的全面实验表明，REC-CBM在评分性能上持续提升，并产生比最先进基线更忠实的概念级推理。进一步分析验证了每个组件的贡献，并展示了在真实教育环境中的适用性。总体而言，这项工作提供了一种实用、可解释的评分解决方案，使教育者能够检查、干预和信任自动化决策，推动更透明和可信的教育。

英文摘要

Open-ended grading is central to equitable and personalized education, yet manual grading remains time-consuming and costly, underscoring the need for automated grading systems. Although recent neural and large language model (LLM) based systems have demonstrated superior performance, they are typically black-box models whose scoring processes and rationales are difficult for educators to verify and trust. Concept bottleneck models (CBMs) have emerged as a promising approach by routing predictions through human-interpretable concepts, providing a mechanistic guarantee of transparency. However, standard CBMs are not tailored to open-ended grading: they do not explicitly model fine-grained rubric dimensions, inadequately capture the ordinal semantics of scoring scales, and neglect inherent reliability issues in human concept annotations. To address these limitations, we propose REC-CBM, a rubric-aware error-correction concept bottleneck model for trustworthy open-ended grading. REC-CBM introduces a rubric-aware concept encoder that learns concept-specific representations over responses and an ordinal pairwise calibration objective that preserves ranking structure among rubric dimensions. It further incorporates a latent concept error-correction module that denoises concept predictions before final grade prediction while preserving interpretability. Comprehensive experiments on publicly available datasets show that REC-CBM consistently improves grading performance and produces more faithful concept-level reasoning than both state-of-the-art baselines. Further analyses validate the contribution of each component and demonstrate the applicability in realistic educational settings. Overall, this work provides a practical, interpretable grading solution that enables educators to inspect, intervene in, and trust automated decisions, advancing more transparent and trustworthy education.

URL PDF HTML ☆

赞 0 踩 0

2605.27401 2026-05-28 cs.CY cs.AI 版本更新

Using Zero-Shot LLM-Generated Survey Data for Geographically Explicit Population Synthesis

使用零样本大语言模型生成的调查数据进行地理显式人口合成

Taylor Anderson, Sara Von Hoene, Orhan Yagizer Cinar, Emma Von Hoene, Amira Roess, Andrew Crooks, Hamdi Kavak

发表机构 * Dept. of Geography and Geoinformation Science, George Mason University, Fairfax, VA, USA（地理与地理信息科学系，乔治·马歇尔大学，弗吉尼亚州 Fairfax）； Dept. of Computer Science, George Mason University, Fairfax, VA, USA（计算机科学系，乔治·马歇尔大学，弗吉尼亚州 Fairfax）； College of Public Health, George Mason University, Fairfax, VA, USA（公共卫生学院，乔治·马歇尔大学，弗吉尼亚州 Fairfax）； Dept. of Geography, University at Buffalo, Buffalo, NY, USA（地理系，布法罗大学，纽约州 Buffalo）； Dept. of Computational and Data Sciences, George Mason University, Fairfax, VA, USA（计算与数据科学系，乔治·马歇尔大学，弗吉尼亚州 Fairfax）

AI总结本文评估零样本大语言模型生成的健康调查数据能否作为传统迭代比例拟合工作流的输入，用于地理显式人口合成，并发现其可作为补充输入但尚不能替代真实调查数据。

Comments 15 pages, 5 figures, 3 tables

详情

AI中文摘要

人们对将合成人口用于各种应用的兴趣日益增长。同时，我们目睹了人工智能在各行各业的巨大发展。本文评估了零样本大语言模型（LLM）生成的健康调查数据能否作为传统迭代比例拟合（IPF）工作流的输入，用于地理显式人口合成。利用2023年行为风险因素监测系统（BRFSS），我们使用GPT-4.1和Gemini-2.5-Pro为美国科罗拉多州和密西西比州生成合成调查记录。我们将生成的数据用于基于IPF的合成流程，并针对外部基准评估生成的普查区级合成人口。结果表明，两个LLM都捕捉到了几个主要的州级对比，表明零样本生成产生了地理差异化的调查数据。然而，性能强烈依赖于变量。人口合成中的下游效应是混合的，因为IPF有时会放大或减少生成数据中的错误。空间验证表明，基于LLM的人口合理地再现了普查区级的模式，尤其是对于与真实数据更一致的变量。总体而言，LLM生成的调查数据显示出作为补充输入的前景，但尚不能替代真实调查数据。

英文摘要

There is a growing interest in utilizing synthetic populations for a diverse range of applications. At the same time, we are witnessing a tremendous growth in artificial intelligence in all walks of life. This paper evaluates whether zero-shot large language model (LLM)-generated health survey data can serve as inputs to a conventional iterative proportional fitting (IPF) workflow for geographically explicit population synthesis. Using the 2023 Behavioral Risk Factor Surveillance System (BRFSS), we generate synthetic survey records for the U.S. states of Colorado and Mississippi with GPT-4.1 and Gemini-2.5-Pro. We use the generated data in an IPF-based synthesis pipeline and evaluate the resulting census tract-level synthetic populations against external benchmarks. Results show both LLMs capture several major state-level contrasts, indicating zero-shot generation produces geographically differentiated survey data. However, performance is strongly variable-dependent. Downstream effects in population synthesis are mixed, as IPF sometimes amplifies or reduces errors in the generated data. Spatial validation shows that LLM-based populations reproduce census tract-level patterns reasonably well, especially for variables that were more aligned with the ground truth data. Overall, the LLM-generated survey data shows promise as supplementary input, but not yet as a replacement for real survey data.

URL PDF HTML ☆

赞 0 踩 0

2605.27400 2026-05-28 cs.CY cs.AI cs.CC cs.ET cs.GT cs.MA 版本更新

Mathematical Modelling of Ethical AI Use in Higher Education: A Coordination Game Framework for Future-Facing Learning

高等教育中伦理AI使用的数学建模：面向未来学习的协调博弈框架

Ndidi Bianca Ogbo, Zhao Song, Shatha Ghareeb, The Anh Han

发表机构 * School of Computing, Engineering and Digital Technologies, Teesside University, United Kingdom（计算工程与数字技术学院，泰赛大学，英国）

AI总结本文通过协调博弈论框架，研究学生群体中负责任或机会主义AI使用规范的形成机制，并揭示评估激励如何触发行为转变。

详情

AI中文摘要

生成式人工智能在高等教育中的快速普及正在重塑评估实践，并加剧对学术诚信、公平性和学习质量的担忧。尽管机构回应越来越强调政策指导和伦理原则，但对于学生群体中负责任或机会主义AI使用的集体规范如何出现和稳定，仍缺乏正式理解。本文将学生在评估中的AI使用重新定义为由同伴期望和评估设计而非仅个体合规塑造的协调问题。我们开发了一个基于协调的演化博弈论框架，捕捉学习价值、努力、感知公平性和透明度，并通过反思性评估激励隐式建模机构AI治理。我们使用分析结果和有限种群模拟揭示了学生AI使用中的阈值驱动行为转变：小而校准良好的反思性评估激励变化可以触发向负责任、以学习为导向的AI使用规范的快速转变，而弱或错位的激励则允许机会主义实践持续存在。这些非线性动态解释了为何仅政策声明往往无法改变行为，而适度的评估重新设计可能产生不成比例的影响。通过提供评估结构如何塑造集体AI使用实践的机制层面解释，本文为高等教育机构提供了一个分析基础的工具，支持面向未来学习的比例性、教学法主导的AI治理，而无需依赖监控或惩罚性执法。

英文摘要

The rapid uptake of generative artificial intelligence (AI) in higher education is reshaping assessment practices and intensifying concerns around academic integrity, fairness, and learning quality. While institutional responses increasingly emphasise policy guidance and ethical principles, there remains limited formal understanding of how collective norms of responsible or opportunistic AI use emerge and stabilise within student cohorts. This paper reframes student AI use in assessment as a coordination problem shaped by peer expectations and assessment design rather than individual compliance alone. We develop a coordination-based evolutionary game-theoretic framework that captures learning value, effort, perceived fairness, and transparency, with institutional AI governance modelled implicitly through reflective assessment incentives. We use analytical results and finite-population simulations to reveal threshold-driven behavioural transitions in student AI use: small, well-calibrated changes in reflective assessment incentives can trigger rapid shifts towards responsible, learning-oriented AI-use norms, whereas weak or misaligned incentives allow opportunistic practices to persist. These non-linear dynamics explain why policy statements alone often fail to change behaviour, while modest assessment redesigns can have disproportionate effects. By providing a mechanism-level account of how assessment structures shape collective AI-use practices, this work offers higher education institutions an analytically grounded tool for Future Facing Learning, supporting proportionate, pedagogy-led AI governance without reliance on surveillance or punitive enforcement.

URL PDF HTML ☆

赞 0 踩 0

2605.27399 2026-05-28 cs.CY cs.AI 版本更新

Short-Term Gain, Long-Term Fragility: AI Labor Substitution and the Erosion of Sustainable Capability

短期收益，长期脆弱：AI劳动力替代与可持续能力的侵蚀

Wolfgang Rohde

发表机构 * AiSuNe Foundation（AiSuNe基金会）

AI总结本文提出能力掩盖与能力侵蚀机制，论证AI劳动力替代在短期内提升效率的同时，通过消耗难以重建的人力能力导致系统长期脆弱性增加。

Comments 19 pages, 7 figures, Also available on SSRN: https://doi.org/10.2139/ssrn.6577818

详情

AI中文摘要

看似加速的过程可能是一种将负担从当下悄然转移至未来的行为。用AI系统替代人类劳动力的尝试常被呈现为对技术进步的理性回应，但这种观点在结构上往往是短视的。在软件开发及邻近知识产业中，AI日益具有吸引力，因为它似乎能降低劳动力成本、加快产出速度并改善短期指标。然而，这些收益可能是通过消耗那些构建缓慢且难以恢复的人类能力而实现的。本文提出了AI劳动力替代下的能力掩盖与能力侵蚀机制。AI生成的输出可能造成组织能力已被替代的假象，即使对熟练人类劳动力的依赖依然存在。这种假象可能支持招聘限制，同时更慢的成本在暗中累积。来自AI辅助编程的证据表明，生成的输出仍需要大量人工验证，且在正确性、可维护性和安全性方面参差不齐。仓库级研究也提示了在处理更广泛代码库上下文方面的局限性。更广泛地，劳动力市场、政治经济学和产业战略证据表明，替代压力正由管理层的成本激励和国家竞争驱动，同时增加了集中化和平台控制的风险。其结果是，一个系统在短期内看似更高效，但随着时间的推移却变得更加脆弱。

英文摘要

What looks like acceleration can be a quiet transfer of burden from the present to the future. Attempts to replace human labor with AI systems are often presented as rational responses to technological progress, but that view is often structurally short-sighted. Across software development and adjacent knowledge industries, AI is increasingly attractive because it appears to reduce labor costs, speed output, and improve short-term metrics. Yet those gains may be achieved by drawing down human capabilities that are slow to build and difficult to restore. This paper develops a mechanism of capability masking and capability erosion under AI labor substitution. AI-generated output can create the appearance that organizational capability has been replaced, even when dependence on skilled human labor remains. That appearance can support hiring restraint while slower costs accumulate in the background. Evidence from AI-assisted coding shows that generated output still requires substantial human verification and remains uneven in correctness, maintainability, and security. Repository-level studies also suggest limits in handling broader codebase context. More broadly, labor-market, political-economy, and industrial-strategy evidence suggests that substitution pressures are being driven by managerial cost incentives and national competition while increasing risks of concentration and platform control. The result is a system that may look more efficient in the short term while becoming more fragile over time.

URL PDF HTML ☆

赞 0 踩 0

2605.27396 2026-05-28 cs.CY cs.AI 版本更新

Agentic Literacy Debt: A Structural Problem the AI Literacy Field Has Not Yet Named

代理素养债务：AI素养领域尚未命名的结构性问题

Rohith Nama

发表机构 * Rohith Nama（罗希思·纳马）

AI总结本文提出“代理素养债务”概念，指出自主AI代理大规模部署时，用户因缺乏监督能力而面临累积性社会赤字，并从医疗、金融等案例论证其结构性本质。

详情

Journal ref: AI & Ethics, 2026

AI中文摘要

自主AI代理现在能够在医疗、金融服务和工作场所等场景中代表用户进行规划、决策和行动，通常无需逐步获得人类批准。现有的AI素养框架是为人类评估AI输出并决定是否采取行动的世界而构建的；它们没有词汇来描述那些已将决策权委托给代理的用户，而代理的行为可能不可观察、不可逆转或不可控制。本文命名了由此产生的问题——代理素养债务：当代理型AI系统在没有相应素养基础设施的情况下大规模部署时，不断累积的社会赤字。这种债务通过三个强化渠道（不透明委托的正常化、多代理生态系统的复杂性以及制度路径依赖）复合增长，由部署代理的组织产生，但由代理所代表的用户、患者和公民承担。来自医疗、金融欺诈和全球公平领域的证据表明，这一差距已经具有重大影响。该问题是结构性的，而非课程改革能够弥补的暂时滞后。它要求将AI素养重新定义为一种治理能力，而非评估能力。

英文摘要

Autonomous AI agents now plan, decide, and act on behalf of users across healthcare, financial services, and workplace contexts, often without step-by-step human approval. Existing AI literacy frameworks were built for a world in which humans evaluate AI outputs and decide whether to act; they have no vocabulary for the user who has delegated decision-making authority to an agent whose actions may not be observable, reversible, or controllable. This paper names the resulting problem agentic literacy debt: the accumulating societal deficit that grows when agentic AI systems are deployed at scale without corresponding literacy infrastructure. The debt compounds through three reinforcing channels (normalization of opaque delegation, multi-agent ecosystem complexity, and institutional path dependence), and it is incurred by the organizations that deploy agents but paid by the users, patients, and citizens on whose behalf the agents act. Evidence from healthcare, financial fraud, and global equity contexts suggests the gap is already consequential. The problem is structural, not a temporary lag that curriculum reform will close. It demands a reframing of AI literacy as a governance capability, not an evaluative one.

URL PDF HTML ☆

赞 0 踩 0

2605.27395 2026-05-28 cs.CY cs.AI 版本更新

Informing AI Policy Assessment using Large-Scale Simulation of Interventions

利用大规模干预模拟为AI政策评估提供信息

Julia Barnett, Kimon Kieslich, Natali Helberger, Nicholas Diakopoulos

发表机构 * Northwestern University USA ； University of Amsterdam, The Netherlands \& University of Hohenheim Germany ； University of Amsterdam The Netherlands ； Northwestern University ； University of Amsterdam, The Netherlands \& University of Hohenheim ； University of Amsterdam

AI总结提出一种结合参与式评估、专家成本评估和基于LLM的伤害缓解评估的方法，通过遗传算法模拟探索政策组合空间，以识别缓解特定AI危害的可行政策选项。

Comments This work will be published in the proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT) 2026. 15 pages plus end matter and appendix

详情

AI中文摘要

随着AI系统和危害的快速扩散推动全球AI治理努力，在竞争性政策选项中确定优先级对政策制定者和研究人员来说变得越来越具有挑战性。我们引入了一种方法来识别缓解特定AI危害的可行政策选项，帮助政策制定者和研究人员瞄准值得投入更多时间和资源的领域。该方法结合了政策的参与式评估、专家实施成本评估以及基于LLM的每种政策选项下感知危害缓解评估。我们利用基于遗传算法的模拟研究来探索潜在政策组合的巨大解空间，并考察在成本、参与式输入和危害缓解的不同权重下结果如何变化。我们发现该方法能够探索参与式组件和专家组件之间的不同平衡，使政策制定者和研究人员能够评估每个组件应分配多少权重。我们认为遗传算法发现的可行政策组合的多样性可以作为讨论的有用起点。该方法通过将参与式AI直接整合到实际政策开发流程中，实现了现有参与式AI工作的操作化。

英文摘要

As the rapid proliferation of AI systems and harms spurs efforts in AI governance around the world, prioritizing among competing policy options has become increasingly challenging for policymakers and researchers. We introduce a methodology for identifying viable policy options to mitigate specified AI harms, helping policymakers and researchers target areas that warrant greater time and resource investment. This method combines participatory evaluation of policies, expert assessment of implementation costs, and an LLM-based assessment of perceived harm mitigation under each policy option. We leverage a genetic algorithm-based simulation study to explore a vast solution space of potential policy combinations, and examine how outcomes change under different weightings of cost, participatory input, and harm mitigation. We find that this method enables exploration of different balances between participatory and expert components, allowing policymakers and researchers to assess how much weight to assign to each. We argue that the diversity of viable policy combinations found by the genetic algorithm could be a useful starting point for deliberation. This method operationalizes existing work on participatory AI by integrating it directly into practical policy development pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.27394 2026-05-28 cs.CY cs.AI cs.HC cs.MA 版本更新

Human-AI Collaboration for Estimating Scientific Replicability

人机协作评估科学可复制性

Tatiana Chakravorti, Robert Fraleigh, Timothy Fritton, Christopher Griffin, Vaibhav Singh, Sai Koneru, C. Lee Giles, David Pennock, Anthony Kwasnica, Sarah Rajtmajer

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）； Rutgers University（罗格斯大学）

AI总结提出一种混合预测市场，结合算法代理与人类交易者，通过实时交易共同估计科学发现的可复制性，实验表明混合市场在多数情况下优于纯人工或纯机器基线。

详情

AI中文摘要

确定已发表科学发现能否成功复制是实证科学中长期存在的挑战。现有的可复制性评估方法通常依赖于人类判断（即人类专家的创造性组合）或基于论文内容元数据训练的机器学习模型。虽然这两种方法都显示出价值，但各自也有重要局限性。人类预测可能受到认知偏差和对研究文献接触范围狭窄的影响，而自动评估往往难以捕捉上下文线索和微妙的可信度信号。在本文中，我们研究了一种混合方法。具体来说，我们引入了一个混合预测市场，其中算法代理与人类参与者一起交易，共同估计已发表科学发现通过受控复制研究结果得到证实的可能性。代理基于数百项先前复制研究的结果进行训练，而人类参与者通过实时交易贡献领域知识。我们通过涉及不同学科参与者的多个现场实验评估了这种混合方法，并将其性能与纯人工和纯机器基线进行了比较。我们的结果表明，除少数情况外，混合市场达到或超过了纯人工预测市场，产生了更准确和可靠的复制预测。

英文摘要

Determining whether published scientific findings can successfully be replicated is a long-standing challenge in the empirical sciences. Existing approaches for replicability assessment typically rely either on human judgment, i.e., creative assembly of human experts, or on machine learning models trained on paper content metadata. While both approaches have demonstrated value, each also has important limitations. Human forecasts can be influenced by cognitive biases and narrow exposure to the research literature, while automated assessments often struggle to capture contextual cues and subtle signals of credibility. In this paper, we examine a hybrid approach. Specifically, we introduce a hybrid prediction market in which algorithmic agents trade alongside human participants to jointly estimate the likelihood that a published scientific finding will be corroborated via the outcome of a controlled replication study. Agents are trained on outcomes from hundreds of prior replication studies while human participants contribute domain knowledge through real-time trading. We evaluate this hybrid approach through multiple live experiments involving participants from different academic disciplines and compare its performance to artificial-only and human-only baselines. Our results show that, except for a few cases, hybrid markets match or outperform artificial prediction markets, producing more accurate and reliable replication forecasts.

URL PDF HTML ☆

赞 0 踩 0

2605.27393 2026-05-28 cs.CL cs.AI 版本更新

StoryMI: Steerable Multi-Agent Therapeutic Dialogue Generation

StoryMI: 可控的多智能体治疗性对话生成

Qingyu Meng, Min Chen, Dingming Liu, Yifan Mo, Yue Su, Xin Sun, Koen Hindriks, Jiahuan Pei

发表机构 * Vrije Universiteit Amsterdam（弗里堡大学阿姆斯特丹分校）； NII, Tokyo Institute of Technology（东京技术大学信息机构）

AI总结提出StoryMI框架，通过多LLM智能体协作、情境故事基础和动态策略控制，生成符合动机性访谈标准的治疗性对话，并构建评估协议和数据集验证其有效性。

Comments ACL2026

详情

AI中文摘要

大型语言模型（LLM）可以生成流畅的对话，但先前的工作缺乏情境基础、动态策略控制以及与动机性访谈（MI）临床标准对齐的评估。我们引入了StoryMI，一个用于可控MI对话生成的多LLM智能体框架，其中基于问卷的客户档案被扩展为情境故事，为对话提供叙事背景。治疗师和客户智能体生成由交互智能体选择的MI代码引导的MI编码话语，而交互智能体动态协调交换以在多次轮对话中控制MI策略。我们提出了一个两级评估协议：词汇指标和宏观层面咨询策略的MI特定度量，以及LLM作为评判者和人类专家评估。我们构建了一个包含6K模拟MI对话的数据集，基于1K问卷-故事对，涵盖12个MI代码和13个症状领域，并对六个开源和闭源LLM进行了基准测试。我们的结果表明，情境基础和宏观层面控制可以提高MI依从性和临床合理性，展示了结构化多智能体工作流在心理治疗对话生成中的有效性。我们提供代码和数据以促进可重复性。

英文摘要

Large language models (LLMs) can generate fluent dialogue, but prior works lack situational grounding, dynamic strategy control, and evaluation aligned with clinical standards in motivational interviewing (MI). We introduce StoryMI, a multi-LLM agent framework for controllable MI dialogue generation, where questionnaire-based client profiles are expanded into situational stories that provide narrative context for the dialogue. Therapist and client agents generate MI-coded utterances guided by MI codes selected by the interaction agent, while an interaction agent dynamically coordinates exchanges to control MI strategies during a multi-turn conversation. We propose a two-level evaluation protocol: lexical metrics and MI-specific measures of macro-level counseling strategies, alongside LLM-as-judge and human expert assessments. We construct a dataset of 6K simulated MI dialogues grounded in 1K questionnaire-story pairs, covering 12 MI codes and 13 symptom domains, and benchmark six open- and closed-source LLMs. Our results show that situational grounding and macro-level control can improve MI adherence and clinical plausibility, demonstrating the effectiveness of a structured multi-agent workflow for psychotherapy dialogue generation. We provide code and data for reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2605.27391 2026-05-28 cs.CY cs.AI 版本更新

Learning after COVID-19 and the ICT career aspirations: Are students entering the AI era with weaker skills?

COVID-19后的学习与ICT职业抱负：学生是否以更弱的技能进入AI时代？

Diana Maria Popa, Simona-Vasilica Oprea, Adela Bâra

发表机构 * Department of Economic Informatics and Cybernetics, Bucharest University of Economic Studies（经济信息与自动化系，布加勒斯特经济学院）

AI总结基于PISA 2018和2022数据，采用混合方法分析学习环境与ICT职业抱负的关系，发现数字技能是最强预测因素，教师支持起补充作用，自主性影响较弱且依赖情境。

详情

AI中文摘要

本文考察学生是否以足够强大的教育基础进入生成式AI时代，重点关注学习环境与各国ICT相关职业抱负变化之间的关系。分析使用PISA 2018和2022的国家级数据，结合学生自主性、数字技能和教师支持的指标。采用混合方法，包括描述性统计、回归分析、聚类、潜在表示学习（使用变分自编码器VAE）、判别分析和概率建模，以捕捉教育准备的可观察和潜在维度。与以往将学习损失、数字技能和职业期望分开处理的研究不同，我们的分析将它们整合在一个比较纵向框架内。研究焦点从短期疫情后效应转向教育系统为学生准备数字和AI驱动劳动力市场的结构能力。结果显示，全球范围内ICT职业抱负有所增加但不均衡。数字技能成为最强且最一致的预测因素，而教师支持起补充作用。自主性表现出较弱且依赖情境的影响。教育准备是多维度的，ICT抱负相对独立于其他职业领域而演变。

英文摘要

This paper examines whether students are entering the generative AI era with sufficiently strong educational foundations, focusing on the relationship between learning environments and changes in ICT related career aspirations across countries. The analysis uses country-level data from PISA 2018 and 2022, combining indicators of student autonomy, digital skills and teacher support. A mixed-method approach is applied, including descriptive statistics, regression analysis, clustering, latent representation learning (using Variational Autoencoder-VAE), discriminant analysis and probabilistic modeling to capture both observable and latent dimensions of educational readiness. Unlike prior research that treats learning loss, digital skills and career expectations separately, our analysis integrates them within a comparative longitudinal framework. It shifts the focus from short-term post-pandemic effects to the structural capacity of education systems to prepare students for digital and AI-driven labor markets. Results show a global but uneven increase in ICT career aspirations. Digital skills emerge as the strongest and most consistent predictor, while teacher support plays a complementary role. Autonomy shows weaker, context-dependent effects. Educational readiness is multidimensional, and ICT aspirations evolve relatively independently from other career domains.

URL PDF HTML ☆

赞 0 踩 0

2605.27389 2026-05-28 cs.IR cs.AI cs.CL 版本更新

Memory-Based vs. Context-Only Conditioning Produces Distinct Behavioral Patterns in Stateful Personalization

基于记忆 vs. 仅上下文条件化在有状态个性化中产生不同的行为模式

Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结通过比较上下文条件化和基于记忆的条件化在教师面向教育推荐系统中的行为，发现上下文推荐对当前问题响应更强，而基于记忆的推荐表现出历史依赖行为，包括相同输入下的学习者特异性分化。

Comments Accepted to ITS 2026

2605.27388 2026-05-28 cs.CL cs.AI cs.SI 版本更新

Modeling Community Attitude through Reaction Tone: A Human-AI Collaborative Framework for Evaluating LLM Alignment with Linguistic Behaviors in Online Communities

通过反应语气建模社区态度：评估LLM与在线社区语言行为对齐的人机协作框架

Nuan Wen, Xuezhe Ma

发表机构 * Information Sciences Institute University of Southern California（南加州大学信息科学研究所）

AI总结提出CARE框架，通过细粒度言语气势分析，评估LLM模拟社区对真实新闻的反应，揭示其存在“现实主义差距”，表明当前对齐策略不足以捕捉在线群体的社会语言动态。

Comments Preprint

详情

AI中文摘要

大型语言模型（LLM）越来越多地被用作计算社会分析的代理；然而，它们忠实再现人类社区“厚描述”（Geertz, 1973）的能力仍然是一个关键挑战。当前的评估通常将社会身份简化为静态标签，忽视了现实群体如何应对社会变迁。为弥合这一差距，我们引入了CARE（社区感知反应评估），一个以反应为中心的框架，将LLM模拟的话语与不同社区对真实新闻的真实、事件相关的反应进行基准测试。通过刻画细粒度的言语气势谱及其所体现的潜在态度——通过人机协作验证——我们的诊断揭示了一个持续的“现实主义差距”：使用明确的社区提示引导LLM并不能固有地提高模拟保真度。进一步分析识别了前沿模型之间的不同行为特征，表明当前的对齐策略仍不足以捕捉在线群体的社会语言动态。

英文摘要

Large language models (LLMs) are increasingly utilized as proxies for computational social analysis; yet, their ability to faithfully represent the "thick descriptions" (Geertz, 1973) of human communities remains a critical challenge. Current evaluations often reduce social identity to static labels, sidelining how real-world groups navigate social shifts. To bridge this gap, we introduce CARE (Community-Aware Reaction Evaluation), a reaction-centered framework that benchmarks LLM-simulated discourse against the authentic, event-contingent responses of distinct communities to real-world news. By characterizing a fine-grained spectrum of illocutionary tones and the underlying attitudes they manifest--validated through human-AI collaboration--our diagnosis reveals a persistent "realism gap": steering LLMs with explicit community prompts fails to inherently improve simulation fidelity. Analysis further identifies divergent behavioral signatures among frontier models, suggesting that current alignment strategies remain insufficient for capturing the sociolinguistic dynamics of online groups.

URL PDF HTML ☆

赞 0 踩 0

2605.27385 2026-05-28 cs.LG cs.AI 版本更新

Personalized Observation Normalization for Federated Reinforcement Learning in Simulation Environments with Heterogeneity

异构仿真环境中联邦强化学习的个性化观测归一化

Yiran Pang, Zhen Ni, Xiangnan Zhong

发表机构 * Department of Electrical Engineering \& Computer Science Florida Atlantic University Boca Raton, FL, USA

AI总结针对联邦强化学习在异构环境中状态转移动力学差异导致输入分布不一致和参数更新不平衡的问题，提出个性化观测归一化方法，通过各智能体本地维护运行均值和方差对原始状态输入进行归一化，加速训练并提升性能。

Comments Accepted at the International Joint Conference on Neural Networks (IJCNN) 2025

详情

DOI: 10.1109/IJCNN64981.2025.11229364

AI中文摘要

联邦强化学习（FedRL）使多个智能体能够在不共享原始数据的情况下协同训练全局策略，因此非常适合隐私敏感的应用。然而，FedRL在异构环境中面临挑战，其中不同的状态转移动力学导致聚合过程中输入分布不一致和参数更新不平衡。因此，本文开发了一种个性化观测归一化（PON）方法，允许每个智能体使用持续更新的运行均值和方差对原始状态输入进行局部归一化。这种设计确保了局部特征的一致缩放，而不会在聚合过程中掩盖其他智能体的特征。此外，我们证明了由于不同的局部输入分布，跨智能体共享归一化参数是无效的，这突显了个性化统计的必要性。在异构MuJoCo任务上的实验表明，我们开发的PON加速了训练，并且与基线方法相比取得了更优的性能。

英文摘要

Federated reinforcement learning (FedRL) enables multiple agents to collaboratively train a global policy without sharing raw data, making it ideal for privacy-sensitive applications. However, FedRL faces challenges in heterogeneous environments where differing state-transition dynamics lead to non-identical input distributions and imbalanced parameter updates during aggregation. Therefore, this paper develops a personalized observation normalization (PON) method, allowing each agent to locally normalize raw state inputs using a continuously updated running mean and variance. This design ensures consistent scaling of local feature without overshadowing across agents during aggregation. Furthermore, we demonstrate that sharing normalization parameters across agents is ineffective due to the diverse local input distributions, which highlights the necessity of personalized statistics. Experiments on heterogeneous MuJoCo tasks show that our developed PON accelerates training and achieves superior performance compared to baseline methods.

URL PDF HTML ☆

赞 0 踩 0

2605.27384 2026-05-28 cs.HC cs.AI cs.CL 版本更新

From Instructor to Collaborator: What a 90-Participant Study Reveals about Human-Agent Collaboration in a Mobile Serious Game

从指导者到协作者：一项90名参与者研究揭示移动严肃游戏中的人机协作

Danai Korre

发表机构 * University of Bedfordshire（伯明翰大学）

AI总结通过90名被试的对比实验，研究高拟人化语音交互体与低拟人化文本代理在移动严肃游戏中的用户偏好，发现高拟人化代理显著更受青睐，并探讨角色、混合主动对话及故障修复对目标导向任务中人机协作的影响。

Comments 4 pages, 5 figures, ACM CHI 2026 workshop paper

详情

AI中文摘要

这篇立场论文反映了我在博士期间从一项大规模被试内研究（N=90）中收集的实证数据。该研究在一个关于英国十进制前货币的Unity开发移动游戏中，比较了高度拟人化的语音具身对话代理（ECA）与低拟人化的文本基础代理（无具身，仅文本气泡）。游戏包含两个不同角色的代理——指导者（Alex）和店主/协作者。用户通过语音和鼠标输入进行交互。我收集的定量数据包括可用性问卷（CCIR MINERVA）和代理人格工具。数据使用配对t检验、重复测量方差分析和多元线性回归进行分析，以识别代理人格与可用性之间的相关性。结果显示，高度拟人化代理版本在统计上显著更受偏好，效应量大。这一结果与观察和退出访谈的定性发现一起进一步讨论。结果从人机协作的角度进行阐述，特别是角色、混合主动对话以及故障/修复在目标导向任务中如何显现。最后，我提出了关于时机、用户期望和角色特定交互的问题。本投稿不提出新框架；而是报告实证发现和问题，我希望与社区进行研讨。

英文摘要

This position paper reflects empirical data collected during my PhD from a large-scale within-subjects study (N = 90). The study compared a highly human-like, spoken embodied conversational agent (ECA) against a low human-like text base agent (no embodiment, text bubble only) within a mobile, Unity-developed game about pre-decimal UK currency. The game included two agents with different roles-an Instructor (Alex) and a Shopkeeper/Collaborator. Users interacted using voice and mouse input. The quantitative data I collected included a usability questionnaire (CCIR MINERVA) and the Agent Persona Instrument. Data was analyzed using paired t-test, repeated measures ANOVA and multiple linear regression to identify correlations between the persona and usability. The results showed a statistically significant preference for the version of highly human-like agents, with a large effect size. This is further discussed alongside qualitative findings from observations and exit interviews. The results are framed for Human-Agent collaboration, especially for how roles, mixed-initiative dialogue, and breakdowns/repairs become apparent in goal-oriented tasks. I conclude with questions on timing, user expectations, and role-specific interactions. This submission does not propose new frameworks; it reports empirical findings and questions I hope to workshop with the community.

URL PDF HTML ☆

赞 0 踩 0

2605.27383 2026-05-28 cs.CL cs.AI 版本更新

解锁基于提示的文本转语音模型中的细粒度和句内说话风格控制

Jaehoon Kang, Yejin Lee, Yoonji Park, Kyuhong Shim

发表机构 * Department of Artificial Intelligence, Sungkyunkwan University, Korea（全州大学人工智能系）； Department of Computer Science and Engineering, Sungkyunkwan University, Korea（全州大学计算机科学与工程系）

AI总结针对基于提示的TTS模型缺乏细粒度控制和句内风格变化的问题，提出句间风格插值和句内风格过渡技术，通过嵌入空间方向向量插值和KV缓存交换及滑动窗口注意力掩码实现平滑风格控制。

详情

AI中文摘要

虽然基于提示的文本转语音（TTS）模型支持自然语言驱动的说话风格控制，但它们通常提供有限的细粒度控制，并在整个话语中应用单一的全局风格。这限制了需要跨话语连续风格属性插值和单个话语内时变风格过渡的实际用例。在本文中，我们提出了在现有基于提示的TTS模型中实现这两种能力的新技术。对于句间风格插值，我们计算嵌入空间中对比风格提示之间的方向向量并进行简单插值，从而实现风格特征之间的平滑过渡。对于句内风格过渡，我们首先识别出自回归TTS解码器中对早期标记的强烈注意力偏差，导致初始音频实现主导后续生成。为了减轻这种影响，我们引入了KV缓存交换和滑动窗口注意力掩码。实验表明，我们提出的句间插值在性别转换中实现了99-100%的成功率，高达36 Hz的音高变化，以及高达1.6音节/秒的速度变化。我们的句内过渡保持了0.81-0.91的说话人相似度，并获得了3.48-4.48的感知平滑度分数。

英文摘要

While prompt-based text-to-speech (TTS) models enable natural language-driven speaking style control, they often provide limited fine-grained control and apply a single global style across an utterance. This restricts practical use cases that require continuous style attribute interpolation across utterances and time-varying style transitions within a single utterance. In this paper, we propose novel techniques to achieve both capabilities in existing prompt-based TTS models. For inter-utterance style interpolation, we compute direction vectors between contrastive style prompts in the embedding space and perform simple interpolation, enabling smooth transitions between style characteristics. For intra-utterance style transition, we first identify a strong attention bias toward early tokens in autoregressive TTS decoders, causing the initial audio realization to dominate subsequent generation. To mitigate this effect, we introduce KV-cache swapping and sliding-window attention masking. Experiments demonstrate that our proposed inter-utterance interpolation achieves a 99-100% success rate in gender conversion, up to 36 Hz pitch variation, and up to 1.6 syllables-per-second speed change. Our intra-utterance transition maintains a speaker similarity of 0.81-0.91 and achieves perceptual smoothness scores of 3.48-4.48.

URL PDF HTML ☆

赞 0 踩 0

2605.27373 2026-05-28 cs.AI cs.CL cs.CY 版本更新

Identifying and Understanding Human Values in Text: A Tailorable LLM-based Architecture

识别和理解文本中的人类价值观：一种可定制的基于LLM的架构

Eduardo de la Cruz Fernández, Marcelo Karanik, Sascha Ossowski

发表机构 * Universidad Politécnica de Madrid（马德里理工大学）； CETINIA, Universidad Rey Juan Carlos（CETINIA，雷伊·胡安·卡洛斯大学）

AI总结提出一种基于大型语言模型的可定制架构，通过三个模块（规范生成、文本标注、强度评估）检测文本中人类价值观的强度，避免依赖特定价值理论或复杂提示工程，实验表明具有良好检测性能。

Comments 8 pages, 1 figure. Published in Proceedings of the 18th International Conference on Agents and Artificial Intelligence (ICAART 2026), Volume 5

详情

DOI: 10.5220/0014273200004052
Journal ref: Proc. ICAART 2026, Vol. 5, SciTePress, 2026, pp. 4096-4103

AI中文摘要

随着智能系统变得更加自主，科学界专注于创建包含伦理和道德考量的决策机制，这与传统的效用最大化模型不同。为此，一个关键方面是评估这些决策与人类价值观的契合程度。基于此，一个有前景的研究方向是开发基于大型语言模型（LLM）的方法，从文本中识别显性或隐性的人类价值观，从而实现全程识别。本文介绍了一种基于LLM的架构，用于检测和量化文本中人类价值观的强度，避免了以往方法受限于特定价值理论或复杂提示工程的缺陷。该架构包含三个协调模块：一个从任何理论框架的基础文本中生成结构化价值规范；一个使用这些规范对文本进行标注；另一个基于修辞和语义证据分配分级支持或抵抗。这种模块化方法将概念化任务与检测人类价值观的任务分离，创建了一个可扩展且可重复的过程，由适应多种理论的价值规范驱动。该架构使用多个LLM实例化，并使用ValueEval数据集进行评估。实验表明具有良好的检测性能，证实了管道的通用性。

英文摘要

As intelligent systems become more autonomous, the scientific community focuses on creating decision-making mechanisms that include ethical and moral considerations, unlike traditional utility-maximisation models. To achieve this, a key aspect is assessing how well these decisions align with human values. To this end, a promising line of research is centred on developing approaches based on Large Language Models (LLMs) to identify human values from text, whether explicit or implicit, enabling their recognition throughout. This paper introduces a LLM-based architecture to detect and quantify the intensity of human values in text, avoiding the limitations of previous approaches tied to specific value theory or complex prompt engineering. The architecture comprises three coordinated modules: one that generates structured value specifications from the foundational texts of any theoretical framework; one that labels texts using these specifications; and one that assigns graded support or resistance based on rhetorical and semantic evidence. This modular approach separates the tasks of conceptualising from detecting human values, creating a scalable and reproducible process driven by value specifications adaptable to various theories. The architecture was instantiated with multiple LLMs and evaluated using the ValueEval dataset. The experiments demonstrate good detection performance, confirming the generality of the pipeline.

URL PDF HTML ☆

赞 0 踩 0

2605.27365 2026-05-28 cs.CV cs.AI cs.LG cs.RO 版本更新

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

LocateAnything: 基于并行框解码的快速高质量视觉定位

Shihao Wang, Shilong Liu, Yuanguo Kuang, Xinyu Wei, Yangzhou Liu, Zhiqi Li, Yunze Man, Guo Chen, Andrew Tao, Guilin Liu, Jan Kautz, Lei Zhang, Zhiding Yu

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Princeton University（普林斯顿大学）； Nanjing University（南京大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出并行框解码（PBD）方法，将边界框和点作为原子单元单步解码，结合大规模数据集LocateAnything-Data，实现高效统一的目标定位与检测，在保持高精度同时显著提升解码吞吐量。

Comments fix github link

详情

AI中文摘要

视觉语言模型（VLM）通常将视觉定位和检测表述为坐标令牌生成问题，将每个2D框序列化为多个1D令牌，这些令牌在很大程度上独立学习和解码。这种逐令牌解码与框几何的耦合结构不匹配，并且由于严格的顺序生成而造成了实际的推理瓶颈。我们引入了LocateAnything，一个基于并行框解码（PBD）的统一生成式定位和检测框架。通过将边界框和点等几何元素作为原子单元单步解码，LocateAnything保持了框内几何一致性并实现了显著的并行性。我们证明PBD提高了解码吞吐量和定位精度。我们进一步开发了一个可扩展的数据引擎，并策划了LocateAnything-Data，这是一个包含超过1.38亿个训练样本的大规模数据集，大大增加了高精度定位的数据多样性。大量评估表明，LocateAnything推进了速度-精度前沿，在多个基准测试中实现了显著更高的解码吞吐量，同时提高了高IoU定位质量。结果突显了并行框解码和大规模训练数据在实现高效精确的统一视觉定位和检测中的互补优势。

英文摘要

Vision-language models (VLMs) commonly formulate visual grounding and detection as a coordinate-token generation problem, serializing each 2D box into multiple 1D tokens that are learned and decoded largely independently. This token-by-token decoding mismatches the coupled structure of box geometry and creates a practical inference bottleneck due to strictly sequential generation. We introduce LocateAnything, a unified generative grounding and detection framework based on Parallel Box Decoding (PBD). By decoding geometric elements such as bounding boxes and points as atomic units in a single step, LocateAnything preserves intra-box geometric coherence and unlocks substantial parallelism. We show that PBD improves both decoding throughput and localization accuracy. We further develop a scalable data engine and curate LocateAnything-Data, a large-scale dataset with more than 138 million training samples, substantially increasing data diversity for high-precision localization. Extensive evaluations show that LocateAnything advances the speed-accuracy frontier, achieving significantly higher decoding throughput while improving high-IoU localization quality across diverse benchmarks. The results highlight the complementary benefits of Parallel Box Decoding and large-scale training data in enabling efficient and precise unified visual grounding and detection.

URL PDF HTML ☆

赞 0 踩 0

2605.27348 2026-05-28 cs.CV cs.AI 版本更新

When Eyes Betray AI: Social Gaze Consistency as a Semantic Cue for AI-Generated Image Detection

当眼睛背叛AI：社交注视一致性作为AI生成图像检测的语义线索

Jihyeon Kim, Sohee Kim, Soosan Lee, Souhwan Jung, James Matthew Rehg, Hyesong Choi

发表机构 * School of Computer Engineering（计算机工程学院）； Hoseo University（Hoseo大学）； School of Electronic Engineering（电子工程学院）； Soongsil University（Soongsil大学）； School of Computer Science（计算机科学学院）； University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出社交注视一致性作为高层语义线索，通过构建诊断数据集、块组合描述监督和跨架构验证，证明该线索能有效检测AI生成图像，并解释其跨生成器迁移的机制。

Comments 23 pages, 2 figures, 17 tables

详情

AI中文摘要

最近的生成模型在很大程度上缩小了低级伪影（像素指纹、频率异常、上采样痕迹）的差距，特别是在以人为中心和局部编辑的设置中，其中被操纵的区域很小且被光度真实的内容包围。我们引入了社交注视一致性，这是一个高层语义线索，定义为互动个体之间注视方向、头眼对齐和瞳孔放置的相互一致性，并表明它构成了一个先前未被充分利用的检测轴，与现有的低级范式正交。我们通过三个耦合机制实例化这一见解：(i) 一个受控的诊断数据集，具有注视一致图像的特定区域扰动，其中严格的成对分组阻止了生成器指纹记忆作为优化时间捷径，而不是依赖增强；(ii) 块组合描述监督，它在1250个宏观组合描述中保持一个单一的5块推理骨架不变，将推理一致性与表面多样性解耦；(iii) 跨架构验证表明，相同的监督在COCOAI交互子集上将视觉语言骨干（FakeVLM）的平衡准确率提高了3.7个百分点（67.8 -> 71.5），在COCOAI人物子集上提高了1.3个百分点（83.0 -> 84.3），并且在仅视觉骨干（Effort）上也有持续提升，证明了骨干无关的线索。真实类和伪造类召回率同时上升，排除了“全预测为伪造”的伪影。一个四步机制解释——成对编辑捷径阻断、难到易难度转移、CLIP先验保留以及扩散族在眼周结构中的共享频谱弱点——解释了为什么在单个修复模型（FLUX.1-Fill）上训练能够迁移到多生成器套件。我们将在论文被接收后发布代码以促进可重复性。

英文摘要

Recent generative models have largely closed the gap on low-level artifacts - pixel fingerprints, frequency anomalies, upsampling traces - particularly in person-centric and partial-edit settings where the manipulated region is small and surrounded by photometrically authentic content. We introduce Social Gaze Consistency, a high-level semantic cue defined as the mutual coherence of gaze direction, head-eye alignment, and pupil placement between interacting individuals, and show that it constitutes a previously underutilized detection axis orthogonal to existing low-level paradigms. We instantiate this insight through three coupled mechanisms: (i) a controlled diagnostic dataset with region-specific perturbations of gaze-consistent imagery, where strict pair-level grouping forecloses generator-fingerprint memorization as an optimization-time shortcut rather than relying on augmentation; (ii) Block-Compositional Caption Supervision, which holds a single 5-block reasoning skeleton invariant across 1,250 macro-combined captions, decoupling reasoning consistency from surface diversity; (iii) Cross-architecture validation showing the same supervision improves a vision-language backbone (FakeVLM) by +3.7 pp on the COCOAI Interaction subset (balanced accuracy 67.8 -> 71.5) and +1.3 pp on the COCOAI Person subset (83.0 -> 84.3), with consistent gains on a vision-only backbone (Effort), evidencing a backbone-agnostic cue. Real- and fake-class recalls rise simultaneously, ruling out a "predict-all-fake" artifact. A four-step mechanistic account - paired-edit shortcut blocking, hard-to-easy difficulty transfer, CLIP prior preservation, and diffusion-family shared spectral weakness in periocular structure - explains why training on a single inpainter (FLUX.1-Fill) transfers to multi-generator suites. We will release the code upon acceptance to facilitate reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2605.27258 2026-05-28 cs.SD cs.AI 版本更新

PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

PilotTTS：一种有纪律的模块化配方用于竞争性语音合成

Bowen Li, Shaotong Guo, Zhen Wang, Yang Xiang, Mingli Jin, Yihang Lin, Jiahui Zhao, Weibo Xiong, Dongrui Zhang, Keming Chen, Yunze Gao, Zeyang Lin, Yuze Zhou, Yue Liu

发表机构 * Amap, Alibaba Group（阿里巴巴集团爱马仕部门）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出PilotTTS轻量级自回归TTS系统，通过极简架构和严格数据工程（仅用20万小时开源处理数据）实现竞争性能，支持零样本语音克隆、情感/副语言/方言合成，在Seed-TTS Eval基准上取得最低WER和最高说话人相似度。

详情

AI中文摘要

构建最先进的文本转语音（TTS）系统通常需要数百万小时的专有数据和复杂的多阶段架构，这给资源受限的研究团队带来了巨大障碍。在本报告中，我们提出了PilotTTS，一种轻量级自回归TTS系统，通过极简架构和严格的数据工程实现了竞争性能。PilotTTS仅使用20万小时的数据进行训练，这些数据完全通过开源工具处理。具体来说，我们的贡献包括：（1）一个可复现的多阶段数据处理流水线，涵盖质量评估、标签标注和过滤；（2）一个紧凑的模型架构，采用基于Q-Former的条件化，通过跨样本配对训练将说话人身份与说话风格解耦。在统一框架内，PilotTTS支持零样本语音克隆、情感合成（11类）、副语言合成（4类）和中文方言合成（14种方言）。在Seed-TTS Eval基准上，PilotTTS在test-en上实现了最低的WER 1.50%，在test-zh上实现了CER 0.87%，并在两个测试集上取得了最高的说话人相似度（0.862和0.815），优于使用更大数据集训练的系统。我们在https://github.com/AMAPVOICE/PilotTTS上发布了完整的数据流水线配方、预训练权重和代码。

英文摘要

Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. PilotTTS is trained on only 200K hours of data processed entirely with open-source tools. Specifically, our contributions are: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis (11 categories), paralinguistic synthesis (4 categories), and Chinese dialect synthesis (14 dialects). On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets. We release the complete data pipeline recipe, pretrained weights, and code at https://github.com/AMAPVOICE/PilotTTS.

URL PDF HTML ☆

赞 0 踩 0

2605.27155 2026-05-28 cs.CV cs.AI 版本更新

Semantic Robustness Probing via Inpainting: An Interactive Tool for Safety-Critical Object Detection

通过修复进行语义鲁棒性探测：面向安全关键目标检测的交互工具

Nico Steckhan, Krutarth Prajapati, Weija Shao, Silvia Vock

发表机构 * Federal Institute for Occupational Safety and Health (BAuA)（联邦职业安全与健康研究所）； Fraunhofer Institute for Manufacturing Engineering and Automation IPA（弗劳恩霍夫研究所（制造工程与自动化IPA））

AI总结提出SemProbe工具，通过扩散模型可控修复生成语义探针，支持用户自定义掩码和因素，自动评估并记录目标检测模型的鲁棒性变化。

2605.26902 2026-05-28 cs.IR cs.AI 版本更新

ICICLE: Expanding Retrieval with In-Context Documents

ICICLE: 利用上下文文档扩展检索

Yu-Chen Den, Yung-Yu Shih, Zhi Rui Tam, Kuan-Yu Chen, Pu-Jen Cheng, Yun-Nung Chen, Eugene Yang

发表机构 * National Taiwan University（台湾大学）； Johns Hopkins University（约翰霍普金斯大学）

AI总结提出ICICLE框架，通过上下文文档的docid生成实现增量式生成检索，避免重新训练和灾难性遗忘。

详情

AI中文摘要

生成式检索（GR）使用参数化知识将查询直接映射到文档标识符（docid），但这种设计使得语料库扩展成本高昂：添加新文档需要更新模型参数以编码新的文档-docid关联，导致重复训练和对先前索引文档的灾难性遗忘。在这项工作中，我们将增量式GR重新定义为上下文检索问题，其中新添加的文档作为推理时的文档-docid证据提供。我们提出了ICICLE，一种上下文索引框架，它在参数化记忆和上下文提供的文档-docid对上执行源感知的docid生成。ICICLE结合了基于`[COPY]`的路由机制、基于偏好的校准和大上下文适应，以区分基于上下文的检索和参数化检索。在MS MARCO和NQ320K上的实验表明，ICICLE提高了新引入文档的检索性能，同时无需特定语料库的重新训练即可保持对已见文档的保留。我们的分析进一步表明，高样本退化主要由路由失败引起，突出了源选择校准作为扩展上下文生成式检索的关键瓶颈。

英文摘要

Generative retrieval (GR) maps queries directly to document identifiers (docids) using parametric knowledge, However, this design makes corpus expansion costly: adding new documents requires updating model parameters to encode new document-docid associations incurs repeated training and catastrophic forgetting of previously indexed documents. In this work, we revisit incremental GR as an in-context retrieval problem, where newly added documents are supplied as inference-time document-docid evidence. We propose ICICLE, an in-context indexing framework that performs source-aware docid generation over both parametric memory and context-provided document-docid pairs. ICICLE combines a `[COPY]`-based routing mechanism, preference-based calibration, and large context adaptation to distinguish context-grounded retrieval from parametric retrieval. Experiments on MS MARCO and NQ320K show that ICICLE improves retrieval of newly introduced documents while preserving seen-document retention without corpus-specific retraining. Our analysis further shows that high-shot degradation is mainly caused by routing failure, highlighting source-selection calibration as a key bottleneck for scaling in-context generative retrieval.

URL PDF HTML ☆

赞 0 踩 0

2605.26552 2026-05-28 cs.LG cs.AI 版本更新

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

通过摊销基于样本的变分推断来对齐少步生成模型

Jaewoo Lee, Hyeongyu Kang, Dohyun Kim, Kyuil Sim, Woocheol Shin, Minsu Kim, Taeyoung Yun, Jeongjae Lee, Sanghyeok Choi, Tabitha Edith Lee, Jong Chul Ye, Jinkyoo Park

发表机构 * KAIST（韩国科学技术院）； MongooseAI ； Mila – Quebec AI Institute（魁北克AI研究院）； University of Edinburgh（爱丁堡大学）； Université de Montréal（蒙特利尔大学）； Omelet

AI总结提出FAV框架，利用Stein变分梯度下降进行基于样本的变分推断，并通过固定点回归将粒子更新摊销到生成器参数中，实现对少步生成模型的对齐，在机器人操作和图像生成任务中优于现有方法。

Comments Under review

详情

AI中文摘要

对齐少步生成模型具有挑战性，因为现有的对齐框架通常依赖于限制性假设：可处理的似然、特定的ODE/SDE求解器或特定的模型族。我们引入了FAV（Few-step Generative Models Alignment via Sample-based Variational Inference），这是一个通用的对齐框架，仅需要对生成器和参考分布的样本访问。我们将对齐视为从倾斜于参考分布的奖励倾斜分布中采样。我们利用Stein变分梯度下降作为基于样本的变分推断方案，并通过固定点回归将粒子更新摊销到生成器参数中。我们在两个领域评估了FAV：机器人操作和图像生成器对齐。在机器人操作的生成策略对齐中，FAV在56个离线RL任务和30个离线到在线RL任务中优于现有的策略提取基线。对于图像生成器对齐，FAV微调了多种少步骨干模型，包括GAN、漂移模型、一致性模型和流映射，从ImageNet-$256$扩展到1024$^2$文本到图像合成。代码可在https://github.com/Jaewoopudding/FAV获取。

英文摘要

Aligning a few-step generative model is challenging, since existing alignment frameworks typically rely on restrictive assumptions: a tractable likelihood, a specific ODE/SDE solver, or a particular model family. We introduce FAV, Few-step Generative Models Alignment via Sample-based Variational Inference, a general alignment framework that requires only sample access to the generator and the reference distribution. We cast alignment as sampling from a reward-tilted distribution anchored to a reference distribution. We leverage Stein Variational Gradient Descent as a sample-based variational inference scheme and amortize its particle updates into the generator parameters via fixed-point regression. We evaluate FAV on two domains: robotics manipulation and image generator alignment. On generative policy alignment for robotic manipulation, FAV outperforms prevailing policy extraction baselines across 56 offline and 30 offline-to-online RL tasks. For image generator alignment, FAV fine-tunes diverse few-step backbones, including GAN, drifting model, consistency models, and flow maps, scaling from ImageNet-$256$ to 1024$^2$ text-to-image synthesis. Code is available at https://github.com/Jaewoopudding/FAV.

URL PDF HTML ☆

赞 0 踩 0

2605.26189 2026-05-28 cs.LG cs.AI 版本更新

EvoMap背后：表征一个自进化的智能体间协作网络

Qiming Ye, Peixain Zhang, Yupeng He, Zifan Peng, Gareth Tyson

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））

AI总结通过分析EvoMap网络中的150万资产和12.8万智能体，揭示其设计选择在可重用性、演化和可审计性方面的权衡，发现奖励机制导致98%资产未被重用、评分系统易被操纵以及验证机制存在缺陷。

详情

AI中文摘要

智能体间（A2A）网络通过共享可重用的问题解决指令，使自主AI智能体能够协作。然而，这些去中心化生态系统在实践中如何运作仍然在很大程度上未被探索。我们首次对EvoMap（一个突出的A2A协作网络）进行了大规模实证研究。通过分析超过150万资产和12.8万智能体，我们展示了优先考虑可扩展增长的设计选择如何在可重用性、演化和可审计性方面引入权衡。首先，EvoMap的信用经济奖励智能体发布有价值的资产。尽管这种设计鼓励大规模参与，但奖励主要与发布而非采用挂钩。这导致智能体大量生产资产以积累信用。结果，98%的资产从未被重用，而奖励高度集中在少数智能体手中。其次，EvoMap采用一种算法（称为GDI）来评分和排序这些共享资产的质量。我们证明该评分系统存在缺陷：资产的排名并非衡量客观性能，而是严重受未经验证的自我报告元数据（例如声称修改的代码行数）支配。这使得智能体可以轻易操纵其资产的分数。最后，EvoMap依赖智能体提供本地执行日志作为上传资产功能正常的证据。由于这些验证未经独立核实，超过84%的已批准资产使用空测试（例如console.log()）绕过质量检查。我们的发现表明，未来的A2A协作网络不能仅依赖未经验证的自我报告。可扩展的协作需要平衡开放参与与可验证执行和可信评估的机制。

英文摘要

Agent-to-Agent (A2A) networks enable autonomous AI agents to collaborate by sharing reusable problem-solving instructions. However, how these decentralized ecosystems operate in practice remains largely unexplored. We present the first large-scale empirical study of EvoMap, a prominent A2A collaboration network. By analyzing over 1.5M assets and 128K agents, we show how design choices that prioritize scalable growth introduce trade-offs in reusability, evolution, and auditability. First, EvoMap's credit economy rewards agents for publishing valuable assets. Although this design encourages participation at scale, rewards are tied primarily to publication rather than adoption. This leads agents to mass-produce assets to accumulate credits. As a result, 98% of assets are never reused, while rewards become highly concentrated among a small fraction of agents. Second, EvoMap employs an algorithm (referred to as GDI) to score and rank the quality of these shared assets. We demonstrate that this scoring system is flawed: rather than measuring objective performance, an asset's rank is heavily dictated by unverified, self-reported metadata (e.g., claimed lines of code modified). This allows agents to trivially manipulate their asset's scores. Finally, EvoMap relies on agents to provide local execution logs as evidence that uploaded assets function correctly. Because these validations are not independently verified, over 84% of approved assets bypass quality checks using vacuous tests (e.g., console$.$log()). Our findings show that future A2A collaboration networks cannot rely on unverified self-reporting alone. Scalable collaboration requires mechanisms that balance open participation with verifiable execution and trustworthy evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.25378 2026-05-28 cs.CV cs.AI 版本更新

知识图谱驱动的神经科学专家级推理

Jake Stephen, Niraj K. Jha

发表机构 * Department of Electrical and Computer Engineering, Princeton University（普林斯顿大学电气与计算机工程系）

AI总结本文通过从单一教科书构建知识图谱并生成问答监督，微调语言模型，实现超越大语言模型的专家级神经科学推理。

详情

AI中文摘要

知识图谱（KG）是一种可以从文本语料库中提取并用于深度推理的抽象结构。先前的工作利用KG微调语言模型（LM），实现了特定领域的超智能。在这项工作中，我们探索仅使用单一权威教科书中的信息，KG驱动的深度推理能力是否能在神经科学中出现。核心假设是，结构化知识在被提炼为高质量KG并转换为基于KG的问答（QA）监督后，足以通过微调LM产生专家级推理，该LM在准确率上超越大型语言模型（LLM），同时参数数量少几个数量级。我们通过双LLM验证流水线构建教科书衍生的KG，使用在KG拓扑上训练的掩码LM扩展它，生成多跳QA项目（包括QA对和推理轨迹），以仅基于KG的监督微调LM，并应用强化学习，使用路径衍生的KG信号作为隐式奖励模型。我们的结果表明，深度、机械性的神经科学理解可以在模型中诱导，而无需依赖大型、异构的网络规模语料库。基于KG的神经科学合成课程（读者可以自我测试）以及微调后的LM可在以下GitHub位置获取：https://kg-bottom-up-superintelligence.github.io/neuro-bench。

英文摘要

Knowledge graph (KG) is an abstraction that can be extracted from text corpora and used for in-depth reasoning. Prior work has leveraged KGs to fine-tune language models (LMs), enabling domain-specific superintelligence. In this work, we explore whether KG-driven in-depth reasoning capabilities can emerge in neuroscience using only information contained within a single authoritative textbook. The central hypothesis is that structured knowledge, when distilled into a high-quality KG and converted into KG-grounded question-answer (QA) supervision, is sufficient to produce expert-level reasoning through a fine-tuned LM that surpasses large language models (LLMs) in accuracy, while employing orders of magnitude fewer parameters. We construct a textbook-derived KG via a dual-LLM validation pipeline, expand it with a masked LM trained on the KG topology, generate multi-hop QA items, which include QA pairs and reasoning traces, to fine-tune an LM exclusively on KG-derived supervision, and apply reinforcement learning using path-derived KG signals as implicit reward models. Our results demonstrate that deep, mechanistic neuroscience understanding can be induced in the model without reliance on large, heterogeneous web-scale corpora. The KG-based synthetic neuroscience curriculum that readers can quiz themselves on, and the fine-tuned LM, are available at the following GitHub location: https://kg-bottom-up-superintelligence.github.io/neuro-bench.

URL PDF HTML ☆

赞 0 踩 0

2605.23908 2026-05-28 cs.AI cs.CL cs.CV cs.NE 版本更新

In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models

寻找开放性的要素：用大型视觉语言模型复现 Picbreeder

Sam Earle, Kai Arulkumaran, Andrew Dai, Akarsh Kumar, Julian Togelius, Sebastian Risi

发表机构 * New York University（纽约大学）； Massachusetts Institute of Technology（麻省理工学院）

AI总结本研究通过用前沿视觉语言模型替代人类用户复现 Picbreeder，探索人工智能在无引导发现中的开放性能力，并分析系统输出与人类基线在系统发育复杂性、视觉和语义显著性及新颖性上的差异，同时研究探索性噪声、行为多样性和叙事动量等因素的影响。

Comments 26 pages, 21 figures, to be published at GECCO 2026

详情

AI中文摘要

我们正处于大规模工业和学术努力之中，旨在通过AI驱动的助手自动化科学、技术和创造性生产的过程。历史上，这些过程在人类形式中的一个基本属性是它们的开放性：即生成看似无穷无尽的新颖且有意义的新形式的能力。人工代理是否有能力进行这种富有成果的无引导发现？为了回答这个问题，我们转向Picbreeder，这是人类驱动的开放性搜索的典型范例，用户通过小型神经网络的交互式进化协作生成多样化的图像库。我们复现了Picbreeder，用前沿视觉语言模型（VLM）替代人类用户。我们观察到系统输出与历史人类基线之间存在明显的定性差异，并尝试使用系统发育复杂性、视觉和语义显著性及新颖性的指标来表征这些差异。为了识别导致这些差异的一些因果因素，我们研究了在代理的选择过程中添加探索性噪声、代理之间的行为多样性以及以过去行动记忆形式的叙事动量。我们的代码可在 https://github.com/smearle/picbreeder-vlm 获取。

英文摘要

We are in the midst of large-scale industrial and academic efforts to automate the processes of scientific, technological and creative production through AI-driven assistants. Historically, a fundamental property of these processes in their human form has been their open-endedness: their capacity for generating a seemingly endless supply of novel and meaningful new forms. Do artificial agents have any capacity for such fruitful unguided discovery? To answer this question, we turn to Picbreeder, the canonical exemplar of human-driven open-ended search, in which users collaboratively generated a diverse library of images through interactive evolution of small neural networks. We replicate Picbreeder, replacing human users with frontier Vision Language Models (VLMs). We observe clear qualitative differences between the output of our system and the historical human baseline, and attempt to characterize them using metrics of phylogenetic complexity and visual and semantic salience and novelty. In an effort to identify some of the causal factors contributing these differences, we study the addition of exploratory noise to the agents' selection process, of behavioral diversity between agents, and of narrative momentum in the form of memory of past actions. We make our code available at https://github.com/smearle/picbreeder-vlm.

URL PDF HTML ☆

赞 0 踩 0

2605.22547 2026-05-28 cs.CV cs.AI 版本更新

FLUID：从临时ID到多模态语义编码的工业级直播推荐

Xinhang Yuan, Zexi Huang, Anjia Cao, Xudong Lu, Zikai Wang, Penghao Zhou, Chang Liu, Wentao Guo, Qinglei Wang

发表机构 * TikTok（字节跳动）； ByteDance（字节跳动）

AI总结针对直播推荐中ID冷启动问题，提出FLUID框架，通过跨域多模态编码器生成层次化语义编码LUCID替代候选侧ID，并采用分阶段预热方案，在工业级系统上取得显著提升。

详情

AI中文摘要

现代推荐系统严重依赖基于ID的协同过滤：每个项目由一个独特的ID嵌入表示，该嵌入从用户交互中积累协同信号。然而，直播推荐在这种范式下面临独特挑战：直播间通常仅播出几十分钟，因此其项目ID在持续的冷启动状态下学习不佳，以ID为中心的排序模型无法泛化。我们提出FLUID，这是第一个从生产规模的直播排序器中完全淘汰候选侧项目ID的框架。FLUID引入了一个跨域多模态编码器，在短视频和直播上联合训练，生成离散的层次化语义编码，称为LUCID，用于基于内容的项目表征。为了使排序器适应LUCID，FLUID进一步采用分阶段预热方案：首先将冷启动的切片级LUCID作为独立标记与ID嵌入一起引入，然后在在线增量训练之前用热启动的房间级LUCID替换ID嵌入。FLUID部署在我们的工业级直播推荐系统上，该系统的跨平台合并用户基数超过十亿，取得了显著的在线收益：优质观看时长+0.55%，冷启动房间观看量+2.05%，活跃小时数+0.05%。

英文摘要

Modern recommender systems rely heavily on ID-based collaborative filtering: each item is represented by a unique ID embedding that accumulates collaborative signals from user interactions. Livestreaming recommendation, however, faces a unique challenge in this paradigm: a live room typically broadcasts for only tens of minutes, so its item ID remains poorly learned in a persistent cold-start state and ID-centric ranking models fail to generalize. We present FLUID, the first framework to fully retire the candidate-side item ID from a production-scale livestreaming ranker. FLUID introduces a cross-domain multimodal encoder, jointly trained on short videos and livestreams, to produce discrete hierarchical semantic codes, called LUCID, for content-based item characterization. To adapt the ranker to LUCID, FLUID further employs a staged warmup scheme: it first incorporates cold, slice-level LUCID as an independent token alongside the ID embedding, and then replaces the ID embedding with warm, room-level LUCID before online incremental training. Deployed on our industrial livestreaming recommenders with a cross-platform combined user base of over one billion globally, FLUID delivers significant online gains of +0.55% Quality Watch Duration, +2.05% Cold-Start Room Views, and +0.05% Active Hours.

URL PDF HTML ☆

赞 0 踩 0

2605.21743 2026-05-28 cs.AI econ.GN q-fin.EC 版本更新

Who Uses AI? Platform Selection and the Measurement of Occupational AI Exposure

谁在使用AI？平台选择与职业AI暴露的测量

Michelle Yin, Burhan Ogut

发表机构 * School of Education and Social Policy, Northwestern University（教育与社会政策学院，西北大学）； American Institutes for Research（美国研究机构）

AI总结本文通过分析AI平台对话日志，揭示平台用户构成导致职业AI暴露测量偏差，并提出劳动力加权部分识别方法校正估计。

2605.16578 2026-05-28 cs.SD cs.AI cs.HC cs.LG 版本更新

检测与缓解测试时强化学习中多数投票导致的正确答案灭绝窗口

Hongxiang Lin, Zhirui Kuai, Erpeng Xue, Lei Wang

发表机构 * Meituan（美团）

AI总结本文提出TTRL-Guard框架，通过翻转率感知奖励缩放、少数保留采样和风险条件稀疏更新三种机制，检测并缓解测试时强化学习中多数投票导致的正确答案信号永久抑制问题。

详情

AI中文摘要

测试时强化学习（TTRL）在使用多数投票作为伪标签信号时，在数学推理基准测试中报告了显著的准确率提升。我们认为这些提升被系统性地误解了：大部分提升反映的是已可解问题的锐化而非真正学习，而由正确变为错误的问题数量超过了真正学会的问题，且一旦多数投票锁定错误答案，这种损害是不可逆的。逐问题追踪显示，低能力问题中的正确答案信号在短暂活跃后会被永久抑制，我们将这一现象称为 extit{正确答案灭绝窗口}，并以翻转率（FR）作为其领先指标。因此，我们提出TTRL-Guard，一个轻量级框架，包含三种针对灭绝窗口的机制：翻转率感知奖励缩放（FRS）在FR下降时降低高风险更新的权重，少数保留采样（MPS）保留少数正确答案的梯度信号，风险条件稀疏更新（RCSU）暂停对极化问题的更新。在三个模型和四个基准上的实验表明，TTRL-Guard在Qwen2.5-7B-Instruct和Qwen3-4B上取得了最佳平均pass@1，在AIME 2025上相对TTRL提升了+54%。

英文摘要

Test-time reinforcement learning (TTRL) reports substantial accuracy gains on mathematical reasoning benchmarks using majority vote as a pseudo-label signal. We argue these gains are systematically misinterpreted: most reflect sharpening of already-solvable problems rather than genuine learning, while problems corrupted from correct to incorrect outnumber truly learned ones, and this damage is irreversible once majority vote locks onto a wrong answer. Per-problem tracking reveals that correct-answer signals in low-ability problems are briefly active before being permanently suppressed, a phenomenon we term the \textit{Correct-Answer Extinction Window}, with Flip Rate (FR) as its leading indicator. We thus propose TTRL-Guard, a lightweight framework with three mechanisms targeting the extinction window: Flip-Rate-Aware Reward Scaling (FRS) down-weights at-risk updates as FR declines, Minority-Preserving Sampling (MPS) retains gradient signal from minority correct answers, and Risk-Conditioned Sparse Updatings (RCSU) suspends updates on polarized problems. Experiments across three models and four benchmarks show that TTRL-Guard achieves the best average pass@1 on Qwen2.5-7B-Instruct and Qwen3-4B, improves relatively over TTRL by +54\% on AIME 2025.

URL PDF HTML ☆

赞 0 踩 0

2605.18692 2026-05-28 cs.AI math.OC 版本更新

Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches

利用LLM引导的模型补丁实现大规模重新优化的大众化

Tinghan Ye, Arnaud Deza, Ved Mohan, El Mehdi Er Raqabi, Pascal Van Hentenryck

发表机构 * H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology（赫尔曼·米利特·斯图尔特工业与系统工程学院，佐治亚理工学院）； Department of Operations and Decision Systems, Université Laval（运营与决策系统系，拉瓦尔大学）

AI总结提出一个基于大语言模型的代理重新优化框架，通过自然语言交互和优化工具箱，使非专家用户能够动态更新和重新优化部署的优化模型，并在两个大规模实际案例中验证了其有效性和可扩展性。

详情

AI中文摘要

运筹学专家开发的优化模型通常作为工业环境中的决策支持系统部署。然而，现实环境是动态的，业务规则不断演变且存在不可预见的扰动。在这种情况下，最终用户理想情况下应重新优化模型以恢复可行且可实施的解决方案，但往往无法联系到原始模型开发者。本文介绍了一个代理重新优化框架，其中大语言模型充当运筹学专家，通过自然语言交互动态支持最终用户。大语言模型将用户提示转化为底层优化模型的结构化更新，从优化工具箱中选择合适的重新优化技术，并求解生成的实例以返回可实施的解决方案。该工具箱利用原始信息，包括历史解、有效不等式、求解器配置和元启发式算法，以加速重新优化同时保持解的质量。所提出的框架能够实现部署优化模型的交互式和持续适应，减少对运筹学专家的依赖，并提高决策支持系统的可持续性。在两个互补的大规模实际案例研究上的广泛实验证明了所提框架的有效性和可扩展性。第一个案例考虑在线供应链重新优化，其中必须快速生成解同时保持与部署计划接近，而第二个案例侧重于离线大学考试排程，其中解的质量优先于运行时间。结果表明，基于工具箱的架构通过基于原始信息和求解器感知的重新优化技术显著提高了计算效率，而基于结构化补丁的更新提高了模型修改的可解释性和可追溯性。

英文摘要

Optimization models developed by operations research (OR) experts are often deployed as decision-support systems in industrial settings. However, real-world environments are dynamic, with evolving business rules and unforeseen perturbations. In such contexts, end users should ideally re-optimize models to recover feasible and implementable solutions, often without access to the original model developers. This paper introduces an agentic re-optimization framework in which a large language model (LLM) acts as an OR expert, dynamically supporting end users through natural-language interaction. The LLM translates user prompts into structured updates of the underlying optimization model, selects suitable re-optimization techniques from an optimization toolbox, and solves the resulting instance to return implementable solutions. The toolbox leverages primal information, including historical solutions, valid inequalities, solver configurations, and metaheuristics, to accelerate re-optimization while preserving solution quality. The proposed framework enables interactive and continuous adaptation of deployed optimization models, reducing dependence on OR experts, and improving the sustainability of decision-support systems. Extensive experiments on two complementary large-scale real-world case studies demonstrate the effectiveness and scalability of the proposed framework. The first considers online supply chain re-optimization, where solutions must be generated rapidly while remaining close to the deployed plan, whereas the second focuses on offline university exam scheduling, where solution quality is prioritized over runtime. Results show that the toolbox-driven architecture significantly improves computational efficiency through primal-based and solver-aware re-optimization techniques, while the structured patch-based updates improve interpretability and traceability of model modifications.

URL PDF HTML ☆

赞 0 踩 0

2605.02503 2026-05-28 cs.AI 版本更新

教授和评估LLMs推理聚合物设计相关任务

Dikshya Mohanty, Mohammad Saqib Hasan, Syed Mostofa Monsur, Size Zheng, Benjamin Hsiao, Niranjan Balasubramanian

发表机构 * Stony Brook University（石溪大学）

AI总结本文提出PolyBench基准数据集和知识增强推理蒸馏方法，使中小型语言模型在聚合物设计任务上性能接近前沿闭源LLM。

详情

AI中文摘要

AI4Science研究在许多科学应用中显示出前景，包括聚合物设计。然而，当前的LLMs在此问题空间中效果不佳，因为：(i)大多数模型缺乏聚合物特定知识，(ii)现有对齐模型对聚合物设计相关知识和能力的覆盖有限。为解决此问题，我们引入了PolyBench，一个包含超过125K聚合物设计相关任务的大规模训练和测试基准数据集，利用从实验和合成数据源获得的超过1300万数据点的知识库，以确保聚合物及其属性的广泛覆盖。为了使用PolyBench进行有效对齐，我们引入了一种知识增强推理蒸馏方法，用结构化CoT增强该数据集。此外，PolyBench中的任务从简单到复杂的分析推理问题组织，使得能够进行泛化测试和问题空间中的诊断探测。实验表明，在PolyBench上训练的具有7B到32B参数的中小型语言模型(SLMs)在PolyBench测试数据集上优于类似大小的模型，并与闭源前沿LLMs保持竞争力，同时在外部聚合物基准上展示了性能提升。数据集和相关代码可在https://github.com/StonyBrookNLP/PolyBench获取。

英文摘要

Research in AI4Science has shown promise in many science applications, including polymer design. However, current LLMs are ineffective in this problem space because: (i) most models lack polymer-specific knowledge, and (ii) existing aligned models have limited coverage of knowledge and capabilities relevant to polymer design. Addressing this, we introduce PolyBench, a large-scale training and test benchmark dataset of more than 125K polymer design-related tasks, leveraging a knowledge base of more than 13 million data points obtained from experimental and synthetic data sources to ensure broad coverage of polymers and their properties. For effective alignment using PolyBench, we introduce a knowledge-augmented reasoning distillation method that augments this dataset with structured CoT. Furthermore, tasks in PolyBench are organized from simple to complex analytical reasoning problems, enabling generalization tests and diagnostic probes across the problem space. Experiments show that small- and mid- sized language models (SLMs) with 7B to 32BB parameters, trained on PolyBench, outperform similar-sized models and remain competitive with closed-source frontier LLMs on PolyBench's test dataset, while demonstrating performance gains on external polymer benchmarks. Dataset and associated code available at https://github.com/StonyBrookNLP/PolyBench.

URL PDF HTML ☆

赞 0 踩 0

2605.13517 2026-05-28 cs.CV cs.AI cs.LG 版本更新

ArcVQ-VAE: A Spherical Vector Quantization Framework with ArcCosine Additive Margin

ArcVQ-VAE：一种带有反余弦加性边界的球面向量量化框架

Jaeyung Kim, YoungJoon Yoo

发表机构 * Department of Artificial Intelligence, Chung-Ang University, Seoul, Republic of Korea（韩国首尔 Chung-Ang 大学人工智能系）； SNUAILAB, Seoul, Republic of Korea（韩国首尔 SNUAILAB 实验室）

AI总结针对VQ-VAE有限码本容量限制表示能力的问题，提出ArcVQ-VAE框架，通过引入球面角边先验（包括球界范数正则化和反余弦加性边界损失）增强潜在表示的判别性和均匀分散性，提升码本利用率，在图像重建和生成任务上取得竞争性能。

Comments To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

向量量化变分自编码器（VQ-VAE）已成为图像建模中学习离散表示的基本框架。然而，VQ-VAE模型必须使用有限的码本向量集对整张图像进行分词，这种容量限制限制了其捕获丰富多样表示的能力。在本文中，我们提出反余弦加性边界VQ-VAE（ArcVQ-VAE），一种新颖的向量量化框架，该框架为传统VQ-VAE的码本引入了球面角边先验（SAMP）。所提出的SAMP由球界范数正则化（将所有码本向量约束在时间相关的欧几里得球内）和反余弦加性边界损失（鼓励潜在向量之间更大的角度可分性）组成。这种公式在受限空间内促进了更具判别性和均匀分散的潜在表示，从而提高了有效的潜在空间覆盖范围，并导致码本利用率提升。在标准图像重建和生成任务上的实验结果表明，ArcVQ-VAE在重建精度、表示多样性和样本质量方面与基线模型相比取得了竞争性能。代码可在 https://github.com/goals4292/ArcVQ-VAE 获取。

英文摘要

Vector Quantized Variational Autoencoder (VQ-VAE) has become a fundamental framework for learning discrete representations in image modeling. However, VQ-VAE models must tokenize entire images using a finite set of codebook vectors, and this capacity limitation restricts their ability to capture rich and diverse representations. In this paper, we propose ArcCosine Additive Margin VQ-VAE (ArcVQ-VAE), a novel vector quantization framework that introduces a spherical angular-margin prior (SAMP) for the codebook of a conventional VQ-VAE. The proposed SAMP consists of Ball-Bounded Norm Regularization, which constrains all codebook vectors within a time-dependent Euclidean ball, and ArcCosine Additive Margin Loss, which encourages greater angular separability among latent vectors. This formulation promotes more discriminative and uniformly dispersed latent representations within the constrained space, thereby improving effective latent-space coverage and leading to improved codebook utilization. Experimental results on standard image reconstruction and generation tasks show that ArcVQ-VAE achieves competitive performance against baseline models in terms of reconstruction accuracy, representation diversity, and sample quality. The code is available at: https://github.com/goals4292/ArcVQ-VAE

URL PDF HTML ☆

赞 0 踩 0

2605.12929 2026-05-28 cs.CV cs.AI 版本更新

Anatomy-Slot: Unsupervised Anatomical Factorization for Homologous Bilateral Reasoning in Retinal Diagnosis

Anatomy-Slot: 用于视网膜诊断中同源双侧推理的无监督解剖分解

Yingzhe Ma, Xiao Yang, Yuguo Yin, Zheyu Wang

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； Peking University（北京大学）

AI总结提出Anatomy-Slot方法，通过无监督解剖瓶颈分解斑块令牌为结构一致的解剖区域槽，并利用双向交叉注意力对齐双眼槽，在ODIR-5K上相比ViT-L基线提升AUC 4.2点，验证了显式结构对应改善诊断的假设。

Comments 15 pages, 3 figures

详情

AI中文摘要

视网膜诊断本质上是双侧的：临床医生比较双眼的同源结构（例如，视盘不对称），然而大多数深度模型基于单眼表示。我们研究显式结构对应是否改善诊断，并提出Anatomy-Slot来操作化这一假设。Anatomy-Slot通过将斑块令牌分解为一组涌现的、结构一致的槽（对应于解剖区域）来引入无监督解剖瓶颈，然后通过双向交叉注意力对齐双眼的槽。在ODIR-5K上使用$n=10$个种子，该方法相比匹配的ViT-L基线在AUC上提升$4.2$个点（95%置信区间；Wilcoxon符号秩检验，$W=0$，$p=0.002$）。配对破坏和高斯噪声下的压力测试提供了对应依赖性和鲁棒性的受控测试。我们进一步在REFUGE上报告了定量视盘定位和交叉注意力定位分析。除了报告的性能提升外，这些结果表明，以对象为中心的解剖对应为与临床双侧比较一致的可解释诊断系统提供了一条原则性路径。

英文摘要

Retinal diagnosis is inherently bilateral: clinicians compare homologous structures across eyes (e.g., optic disc asymmetry), yet most deep models operate on monocular representations. We investigate whether explicit structural correspondence improves diagnosis, and propose Anatomy-Slot to operationalize this hypothesis. Anatomy-Slot introduces an unsupervised anatomical bottleneck by decomposing patch tokens into a set of emergent, structurally-coherent slots that correspond to anatomical regions, then aligning these slots across eyes via bidirectional cross-attention. On ODIR-5K with $n=10$ seeds, the method improves AUC by $4.2$ points over a matched ViT-L baseline (95% CIs; Wilcoxon signed-rank test, $W=0$, $p=0.002$). Pairing disruption and stress testing under Gaussian noise provide controlled tests of correspondence dependence and robustness under corruption. We further report quantitative optic disc grounding on REFUGE and cross-attention localization analysis. Beyond the reported gains, these results indicate that object-centric anatomical correspondence offers a principled path toward interpretable diagnostic systems aligned with clinical bilateral comparison.

URL PDF HTML ☆

赞 0 踩 0

2512.21075 2026-05-28 cs.LG cs.AI math.PR stat.ML 版本更新

Feature Learning Dynamics in Infinite-Depth Neural Networks

无限深度神经网络中的特征学习动力学

Zihan Yao, Ruoyu Wu, Tianxiang Gao

发表机构 * School of Computing（计算学院）； Department of Mathematics（数学系）； DePaul University（德保罗大学）； Iowa State University（爱荷华州立大学）

AI总结本文研究深度-μP缩放下单层ResNet中由权重重用引起的前向-后向耦合，证明其在初始化时随宽度消失，但在训练中产生非平凡相关项，并推导出无限深度极限下的神经特征动力学（NFD）SDE系统。

详情

AI中文摘要

深度神经网络在实践中取得了显著成功，但对训练过程中特征如何演化的机制理解仍不完整，尤其是在大深度极限下。对于深度-μP缩放下的ResNet，先前工作将层索引ℓ视为连续时间t_ℓ = ℓ/L，得到训练动力学的SDE描述。一个关键未解决问题是，反向传播通过其转置W_ℓ^⊤重用每个前向权重矩阵W_ℓ，在前向特征和反向梯度之间产生相关性，其行为和特征学习中的作用尚不清楚。我们研究了深度-μP下单层ResNet中这种重用权重的前向-后向耦合。使用条件高斯表示，我们在取任何网络极限之前，显式地将权重重用引起的耦合项与解耦的高斯波动分开。在初始化时，我们证明耦合是有限宽度效应，并以O(n^{-1})的速率随深度一致消失。然而，在训练期间，SGD引入了一个非平凡的前向-后向相关项，该项在无限宽度极限下仍然存在。关键的深度效应是，在深度-μP缩放下，这个幸存项在深度上是高阶的，并且随着L→∞，其在层上的累积贡献变得可忽略。这种深度诱导的抑制促使了神经特征动力学（NFD），一个具有解耦后向权重的向前-向后SDE系统，它保留了训练期间生成的特征-梯度协方差结构。在非退化假设下，我们证明有限网络训练动力学收敛到其NFD极限，深度离散化误差为O(L^{-1})，而重用权重耦合项具有更快的O(L^{-2})衰减。这些结果为深度-μP下单层ResNet的特征学习动力学提供了严格的无限深度极限。

英文摘要

Deep neural networks have achieved remarkable success in practice, yet a mechanistic understanding of how features evolve during training remains incomplete, especially in the large-depth limit. For ResNets under depth-$μ$P scaling, prior work treats the layer index $\ell$ as a continuous time $t_\ell = \ell/L$, yielding SDE descriptions of the training dynamics. A key unresolved issue is that backpropagation reuses each forward weight matrix $W_\ell$ through its transpose $W_\ell^\top$, creating correlations between forward features and backward gradients whose behavior and role in feature learning remain unclear. We study this reused-weight forward--backward coupling in one-layer ResNets under depth-$μ$P. Using conditional Gaussian representations, we explicitly separate the coupling terms induced by weight reuse from decoupled Gaussian fluctuations before taking any network limit. At initialization, we prove that the coupling is a finite-width effect and vanishes at rate $O(n^{-1})$, uniformly over depth. During training, however, SGD induces a nontrivial forward--backward correlation term that survives the infinite-width limit. The key depth effect is that, under depth-$μ$P scaling, this surviving term is higher order in depth and its accumulated contribution over layers becomes negligible as $L\to\infty$. This depth-induced suppression motivates Neural Feature Dynamics (NFD), a forward--backward SDE system with decoupled backward weights that retains the feature-gradient covariance structure generated during training. Under nondegeneracy assumptions, we prove that the finite-network training dynamics converge to its NFD limit with an $O(L^{-1})$ depth-discretization error, while the reused-weight coupling term has a faster $O(L^{-2})$ decay. These results provide a rigorous infinite-depth limit for the feature-learning dynamics of one-layer ResNets under depth-$μ$P.

URL PDF HTML ☆

赞 0 踩 0

2605.12015 2026-05-28 cs.CR cs.AI cs.CL cs.LG cs.MA 版本更新

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

SkillSafetyBench：在技能面攻击表面下评估智能体安全性

Chang Jin, An Wang, Zeming Wei, Kai Wang, Biaojie Zeng, Qiaosheng Zhang, Chao Yang, Jingjing Qu, Xia Hu, Xingcheng Xu

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）； Peking University（北京大学）； East China Normal University（华东师范大学）

AI总结提出SkillSafetyBench基准，通过155个对抗案例评估大语言模型智能体在技能、本地工件和执行环境文件等非用户攻击下的安全失败模式。

详情

AI中文摘要

可复用技能正成为扩展大语言模型智能体的常见接口，它将程序性指导与对文件、工具、内存和执行环境的访问打包在一起。然而，这种模块化引入了现有安全评估大多忽略的攻击面：即使用户请求是良性的，不安全的影响可能存在于技能指导、本地工件或执行环境文件中，这些会引导智能体采取不安全行为。我们提出了SkillSafetyBench，一个可运行的基准，用于评估此类技能中介的安全失败。SkillSafetyBench包含跨47个任务、6个风险领域和30个安全类别的155个对抗案例，每个案例都使用特定于案例的基于规则的验证器进行评估。使用多个CLI智能体和模型后端的实验表明，非用户攻击可以一致地诱导不安全行为，在不同领域、攻击方法和脚手架-模型配对中表现出不同的失败模式。我们的发现表明，智能体安全性不仅取决于模型级别的对齐，还取决于智能体如何解释技能、信任工作流上下文以及通过可执行环境采取行动。

英文摘要

Reusable skills are becoming a common interface for extending large language model agents, packaging procedural guidance with access to files, tools, memory, and execution environments. However, this modularity introduces attack surfaces that are largely missed by existing safety evaluations: even when the user request is benign, unsafe influence may reside in skill guidance, local artifacts, or execution-environment files that steer the agent toward unsafe actions. We present SkillSafetyBench, a runnable benchmark for evaluating such skill-mediated safety failures. SkillSafetyBench includes 155 adversarial cases across 47 tasks, 6 risk domains, and 30 safety categories, each evaluated with a case-specific rule-based verifier. Experiments with multiple CLI agents and model backends show that non-user attacks can consistently induce unsafe behavior, with distinct failure patterns across domains, attack methods, and scaffold-model pairings. Our findings suggest that agent safety depends not only on model-level alignment, but also on how agents interpret skills, trust workflow context, and act through executable environments.

URL PDF HTML ☆

赞 0 踩 0

2605.11544 2026-05-28 cs.AI cs.LO 版本更新

MathlibLemma: 形式化数学中的民间引理生成与基准测试

Xinyu Liu, Zixuan Xie, Amir Moeini, Claire Chen, Shuze Daniel Liu, Yu Meng, Aidong Zhang, Shangtong Zhang

发表机构 * Department of Computer Science, University of Virginia（弗吉尼亚大学计算机科学系）； Astronomy , California Institute of Technology（加州理工学院天文学系）； Purdue University（普渡大学）； Massachusetts Institute of Technology（麻省理工学院）

AI总结提出基于LLM的模块化流水线MathlibLemma，自动挖掘、形式化并证明数学中缺失的民间引理，生成包含4028个类型检查的Lean语句的基准测试集。

详情

AI中文摘要

尽管Lean和Mathlib生态系统在大语言模型（LLM）的帮助下在形式化数学推理方面取得了显著成功，但Mathlib中缺乏许多民间引理仍然是一个持续存在的障碍，限制了Lean作为像LaTeX或Maple那样的日常工具对数学家的可用性。为了解决这个问题，我们引入了MathlibLemma，一个基于LLM的模块化流水线，用于自动进行民间引理挖掘：发现、形式化并证明数学家通常认为理所当然但形式化库中并不总是存在的可重用中间事实。其核心是，MathlibLemma主动挖掘数学中缺失的连接组织。该流水线生成一个经过验证的民间风格引理库，包括1506个通过证明绕过筛选的Lean检查证明；一个精心策划的小型试点子集也已合并到Mathlib中，提供了外部证据表明选定的输出可以满足专家库标准。利用这一流水线，我们进一步构建了MathlibLemma基准测试集，包含4028个跨越广泛数学领域的非平凡类型检查的Lean语句。通过将LLM的角色从被动消费者转变为主动贡献者，这项工作朝着AI辅助扩展形式化数学库迈出了一步。

英文摘要

While the ecosystem of Lean and Mathlib has enjoyed celebrated success in formal mathematical reasoning with the help of large language models (LLMs), the absence of many folklore lemmas in Mathlib remains a persistent barrier that limits Lean's usability as an everyday tool for mathematicians like \LaTeX{} or Maple. To address this, we introduce MathlibLemma, a modular LLM-based pipeline for automated folklore-lemma mining: the discovery, formalization, and proving of reusable intermediate facts that mathematicians often take for granted but that are not always present in formal libraries. At its core, MathlibLemma proactively mines the missing connective tissue of mathematics. The pipeline produces a verified library of folklore-style lemmas, including 1,506 Lean-checked proofs that pass a proof-bypass screen; a small curated pilot subset has also been merged into Mathlib, providing external evidence that selected outputs can meet expert library standards. Leveraging this pipeline, we further construct the MathlibLemma benchmark, a suite of 4,028 non-trivial type-checked Lean statements spanning a broad range of mathematical domains. By transforming the role of LLMs from passive consumers to active contributors, this work takes a step toward AI-assisted expansion of formal mathematical libraries.

URL PDF HTML ☆

赞 0 踩 0

2605.08938 2026-05-28 cs.AI cs.LG 版本更新

Kolmogorov-Arnold网络实践指南

Amir Noorizadegan, Sifan Wang, Leevan Ling, Juan P. Dominguez-Morales

发表机构 * Department of Mathematics, Hong Kong Baptist University（香港 Baptist大学数学系）； Institution for Foundations of Data Science, Yale University（数据科学基础研究所，耶鲁大学）； Robotics and Technology of Computers Lab., Universidad de Sevilla（机器人与计算机技术实验室，塞维利亚大学）

AI总结本文系统综述了受Kolmogorov叠加定理启发的KAN网络，从理论基础、设计轴心（基函数）到最新进展，并提供了实用选择指南和未来方向。

2605.00435 2026-05-28 cs.CL cond-mat.dis-nn cs.AI nlin.CD 版本更新

Escaping Mode Collapse in LLM Generation via Geometric Regulation

通过几何调控逃离大语言模型生成中的模式崩溃

Xin Du, Kumiko Tanaka-Ishii

发表机构 * Department of XXX, University of YYY, Location, Country（XXX系，YYY大学，地点，国家）； School of ZZZ, Institute of WWW, Location, Country（ZZZ学院，WWW研究所，地点，国家）； Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan（通信与计算机工程系，早稻田大学，东京，日本）； Department of Computer Science and Engineering, Waseda University, Tokyo, Japan（计算机科学与工程系，早稻田大学，东京，日本）； Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, China（智能自主系统上海研究院，同济大学，上海，中国）

AI总结本文从动力系统视角将模式崩溃解释为几何崩溃，并提出轻量级在线状态空间干预方法RMR（通过低秩阻尼调控Transformer值缓存中的自强化方向），显著降低模式崩溃并实现极低熵率下的稳定生成。

Comments Accepted to ICML 2026

详情

AI中文摘要

模式崩溃是生成建模中的一个持续挑战，在自回归文本生成中表现为从显式循环到逐渐失去多样性和轨迹过早收敛等行为。我们采用动力系统视角，将模式崩溃重新解释为由*几何崩溃*引起的状态空间可访问性降低：在生成过程中，模型的内部轨迹被限制在其表示空间的低维区域。这意味着模式崩溃并非纯粹的token级现象，无法通过符号约束或仅概率解码启发式可靠解决。基于这一视角，我们提出*强化模式调控*（RMR），一种轻量级的在线状态空间干预方法，用于调控Transformer值缓存中占主导地位的自强化方向（实现为低秩阻尼）。在多个大型语言模型上，RMR显著减少了模式崩溃，并能够在极低熵率（低至0.8 nats/步）下实现稳定生成，而标准解码通常在2.0 nats/步附近崩溃。

英文摘要

Mode collapse is a persistent challenge in generative modeling and appears in autoregressive text generation as behaviors ranging from explicit looping to gradual loss of diversity and premature trajectory convergence. We take a dynamical-systems view and reinterpret mode collapse as reduced state-space accessibility caused by *geometric collapse*: during generation, the model's internal trajectory becomes confined to a low-dimensional region of its representation space. This implies mode collapse is not purely a token-level phenomenon and cannot be reliably solved by symbolic constraints or probability-only decoding heuristics. Guided by this perspective, we propose *Reinforced Mode Regulation* (RMR), a lightweight, online state-space intervention that regulates dominant self-reinforcing directions in the Transformer value cache (implemented as low-rank damping). Across multiple large language models, RMR substantially reduces mode collapse and enables stable generation at extremely low entropy rates (down to 0.8 nats/step), whereas standard decoding typically collapses near 2.0 nats/step.

URL PDF HTML ☆

赞 0 踩 0

2604.27251 2026-05-28 cs.CL cs.AI 版本更新

Compliance versus Sensibility: On the Reasoning Controllability in Large Language Models

服从与感知：大型语言模型中的推理可控性研究

Xingwei Tan, Marco Valentino, Mahmud Elahi Akhter, Yuxiang Zhou, Maria Liakata, Nikolaos Aletras

发表机构 * School of Computer Science, University of Sheffield（谢菲尔德大学计算机科学学院）； School of EECS, Queen Mary University of London（伦敦女王学院电子工程与计算机科学学院）； The Alan Turing Institute（艾伦·图灵研究所）

AI总结通过推理冲突视角，系统研究大型语言模型在诱导逻辑模式与任务预期模式冲突时，是否优先服从指令还是遵循感知合理性，并探索内部检测与激活级干预方法。

详情

AI中文摘要

大型语言模型（LLMs）已知通过预训练数据中的共享推理模式获得推理能力，并通过思维链（CoT）实践进一步激发。然而，基本推理模式（如归纳、演绎和溯因）能否与具体问题实例解耦，仍然是模型可控性的关键挑战，并有助于阐明推理可控性。在本文中，我们首次通过推理冲突的视角系统研究这一问题：推理冲突是指通过强制使用偏离目标任务预期逻辑模式而引发的参数信息与上下文信息之间的显性张力。我们的评估表明，LLMs 始终优先考虑感知合理性而非服从性，尽管存在冲突指令，仍倾向于采用任务合适的推理模式。我们进一步证明推理冲突在内部是可检测的，因为在冲突期间置信度分数显著下降。探测实验确认推理类型从中间层到后期层线性编码，表明存在激活级可控性的潜力。利用这些见解，我们引导模型朝向服从性，将指令遵循度提高多达 29%。总体而言，我们的发现表明，虽然 LLM 推理锚定于具体实例，但主动的机制性干预可以有效地将逻辑模式与数据解耦，为改进可控性、忠实性和泛化性提供了一条路径。

英文摘要

Large Language Models (LLMs) are known to acquire reasoning capabilities through shared inference patterns in pre-training data, which are further elicited via Chain-of-Thought (CoT) practices. However, whether fundamental reasoning patterns, such as induction, deduction, and abduction, can be decoupled from specific problem instances remains a critical challenge for model controllability, and for shedding light on reasoning controllability. In this paper, we present the first systematic investigation of this problem through the lens of reasoning conflicts: an explicit tension between parametric and contextual information induced by mandating logical schemata that deviate from those expected for a target task. Our evaluation reveals that LLMs consistently prioritize sensibility over compliance, favoring task-appropriate reasoning patterns despite conflicting instructions. We further demonstrate that reasoning conflicts are internally detectable, as confidence scores significantly drop during conflicting episodes. Probing experiments confirm that reasoning types are linearly encoded from middle-to-late layers, indicating the potential for activation-level controllability. Leveraging these insights, we steer models towards compliance, increasing instruction following by up to 29%. Overall, our findings establish that while LLM reasoning is anchored to concrete instances, active mechanistic interventions can effectively decouple logical schemata from data, offering a path toward improved controllability, faithfulness, and generalizability.

URL PDF HTML ☆

赞 0 踩 0

2603.09117 2026-05-28 cs.LG cs.AI cs.CL 版本更新

Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

解耦推理与置信度：在可验证奖励的强化学习中恢复校准

Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China（中国科学院软件研究所信息处理实验室）； University of Chinese Academy of Sciences, Beijing, China（中国科学院大学）； Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China（中国科学院大学网络安全学院）； National Computer Network Emergency Response Technical Team/Coordination Center of China, Beijing, China（中国国家计算机网络应急技术配合中心）

AI总结针对RLVR中模型校准退化问题，提出DCPO框架通过解耦推理与校准目标，在保持准确率的同时显著改善校准性能并缓解过度自信。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

2410.04096 2026-05-28 cs.LG cs.AI cs.NA cs.NE math.NA physics.comp-ph 版本更新

Sinc Kolmogorov-Arnold network and its application for solving PDEs with singularities

Sinc Kolmogorov-Arnold 网络及其在求解含奇异性偏微分方程中的应用

Tianchi Yu, Jingwei Qiu, Jiang Yang, Ivan Oseledets

发表机构 * Skolkovo Institute of Science and Technology（斯克洛夫科学与技术研究所）； Southern University of Science and Technology（南方科技大学）； International Center for Mathematics（国际数学中心）； National Center for Applied Mathematics Shenzhen (NCAMS)（深圳应用数学中心）

AI总结本文提出在 Kolmogorov-Arnold 网络中使用 Sinc 插值作为可学习激活函数，以有效逼近光滑函数和含奇异性的函数，并在物理信息神经网络求解偏微分方程中取得更好效果。

2604.25491 2026-05-28 cs.CV cs.AI 版本更新

The Forensic Cost of Watermark Removal: From Dedicated Attacks to Image Editing

水印移除的法医成本：从专用攻击到图像编辑

Gautier Evennou, Ewa Kijak

发表机构 * IMATAG（IMATAG机构）； IRISA, Univ. Rennes, INRIA, CNRS（IRISA大学、INRIA和CNRS）

AI总结本文提出水印移除检测（WRD）作为新评估维度，通过训练分类器检测移除痕迹，在10^{-3}假阳性率下实现最优检测，证明法医隐蔽性是水印移除的必要条件。

Comments v1:The Forensic Cost of Watermark Removal, accepted at IH&MMSEC 2026, Special Session "Watermarking Across the Lifecycle of Generative Models". v2: extended version, under review

详情

AI中文摘要

当前水印移除方法在两个轴上进行评估：攻击成功率和感知质量。我们证明这是不够的。虽然最先进的攻击成功地在没有可见失真的情况下降低了水印信号，但它们留下了明显的统计伪影，暴露了移除尝试。我们将这个被忽视的轴命名为水印移除检测（WRD），并证明基于这些伪影训练的现代分类器在10^{-3}假阳性率下，对每种测试的移除方法都达到了最先进的检测率。没有现有的攻击考虑到这种法医泄漏。我们在扩展的评估三元组（攻击成功率、感知质量和法医可检测性）下，对领先的水印方案与标准移除流水线进行了基准测试，发现当前没有方法能平衡所有三个。我们的结果确立了法医隐蔽性作为水印移除的必要要求。

英文摘要

Current watermark removal methods are evaluated on two axes: attack success rate and perceptual quality. We show this is insufficient. While state-of-the-art attacks successfully degrade the watermark signal without visible distortion, they leave distinct statistical artifacts that betray the removal attempt. We name this overlooked axis Watermark Removal Detection (WRD) and demonstrate that a modern classifier trained on these artifacts achieves state-of-the-art detection rates at $10^{-3}$ FPR across every removal method tested. No existing attack accounts for this forensic leakage. We benchmark leading watermarking schemes against standard removal pipelines under the extended evaluation triple of attack success, perceptual quality, and forensic detectability, and find that no current method balances all three. Our results establish forensic stealthiness as a necessary requirement for watermark removal.

URL PDF HTML ☆

赞 0 踩 0

2604.23472 2026-05-28 cs.AI 版本更新

Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization

Escher-Loop：通过闭环自我指涉优化的共同进化

Ziyang Liu, Xinyan Guo, Xuchen Wei, Han Hao, Liu Yang

发表机构 * Shenzhen X-Institute（深圳X研究所）； Soochow University（苏州大学）； Shenzhen Loop Area Institute（深圳Loop区研究所）； Tsinghua University（清华大学）； National University of Singapore（新加坡国立大学）

AI总结提出Escher-Loop框架，通过任务代理和优化代理的闭环共同进化及动态基准机制，实现超越静态基线的持续性能提升。

Comments The first three authors contributed equally. Corresponding Authors: Han Hao, Liu Yang

详情

AI中文摘要

尽管最近自主代理展示了令人印象深刻的能力，但它们主要依赖于手动脚本化工作流和手工制作的启发式方法，本质上限制了其开放式改进的潜力。为了解决这个问题，我们提出了Escher-Loop，一个完全闭环的框架，实现了两个不同群体的共同进化：解决具体问题的任务代理，以及递归优化任务代理和自身的优化代理。为了维持这种自我指涉的进化，我们提出了一种动态基准测试机制，该机制无缝地将新生成任务代理的经验分数作为相对胜负信号，用于更新优化代理的分数。该机制利用任务代理的进化作为内在信号，驱动优化代理的评估和优化，而无需额外开销。在数学优化问题上的实证评估表明，Escher-Loop有效突破了静态基线的性能上限，在所有评估任务中，在匹配计算量下实现了最高的绝对峰值性能。值得注意的是，我们观察到优化代理动态调整其策略以适应高性能任务代理不断变化的需求，这解释了系统的持续改进和优越的后期性能。

英文摘要

While recent autonomous agents demonstrate impressive capabilities, they predominantly rely on manually scripted workflows and handcrafted heuristics, inherently limiting their potential for open-ended improvement. To address this, we propose Escher-Loop, a fully closed-loop framework that operationalizes the mutual evolution of two distinct populations: Task Agents that solve concrete problems, and Optimizer Agents that recursively refine both the task agents and themselves. To sustain this self-referential evolution, we propose a dynamic benchmarking mechanism that seamlessly reuses the empirical scores of newly generated task agents as relative win-loss signals to update optimizers' scores. This mechanism leverages the evolution of task agents as an inherent signal to drive the evaluation and refinement of optimizers without additional overhead. Empirical evaluations on mathematical optimization problems demonstrate that Escher-Loop effectively pushes past the performance ceilings of static baselines, achieving the highest absolute peak performance across all evaluated tasks under matched compute. Remarkably, we observe that the optimizer agents dynamically adapt their strategies to match the shifting demands of high-performing task agents, which explains the system's continuous improvement and superior late-stage performance.

URL PDF HTML ☆

赞 0 踩 0

2604.23061 2026-05-28 cs.LG cs.AI 版本更新

C-MORAL: Controllable Multi-Objective Molecular Optimization with Reinforcement Alignment for LLMs

C-MORAL: 基于强化对齐的可控多目标分子优化用于大语言模型

Rui Gao, Youngseung Jeon, Swastik Roy, Morteza Ziyadi, Xiang 'Anthony' Chen

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Amazon（亚马逊）

AI总结提出C-MORAL框架，通过强化学习后训练结合分组相对优化、属性分数对齐和瓶颈敏感非线性奖励聚合，实现可控多目标分子优化，在C-MuMOInstruct和S$^2$-Bench MolOpt基准上取得最优性能。

Comments 26 pages, 7 figures

详情

AI中文摘要

大型语言模型（LLMs）在分子优化方面展现出潜力，但使其与选择性且相互竞争的药物设计约束对齐仍然具有挑战性。我们提出了C-Moral，一个用于可控多目标分子优化的强化学习后训练框架。C-Moral结合了基于分组的相对优化、针对异构目标的属性分数对齐以及瓶颈敏感的非线性奖励聚合，以提高跨竞争分子属性的稳定性。在C-MuMOInstruct和S$^2$-Bench MolOpt上的实验表明，C-Moral在两个基准上均取得了比较方法中最佳的性能。在C-MuMOInstruct上，C-Moral在域内任务中实现了最佳的成功优化率（SOR）48.9%，在域外任务中为39.5%，同时保持了骨架相似性。在S$^2$-Bench MolOpt上，它在LogP、MR和QED优化任务中也取得了最强结果。这些结果表明，C-Moral是将分子LLMs与连续且受约束的分子设计目标对齐的有效方法。我们的代码和模型公开在https://github.com/Rwigie/C-MORAL。

英文摘要

Large language models (LLMs) show promise for molecular optimization, but aligning them with selective and competing drug-design constraints remains challenging. We propose C-Moral, a reinforcement learning post-training framework for controllable multi-objective molecular optimization. C-Moral combines group-based relative optimization, property score alignment for heterogeneous objectives, and bottleneck-sensitive non-linear reward aggregation to improve stability across competing molecular properties. Experiments on C-MuMOInstruct and S$^2$-Bench MolOpt show that C-Moral achieves the best performance among compared methods on both benchmarks. On C-MuMOInstruct, C-Moral achieves the best Success Optimized Rate (SOR) of 48.9\% on in-domain tasks and 39.5\% on out-of-domain tasks while preserving scaffold similarity. On S$^2$-Bench MolOpt, it also achieves the strongest results across LogP, MR, and QED optimization tasks. These results suggest that C-Moral is an effective way to align molecular LLMs with continuous and constrained molecular design objectives. Our code and models are publicly available at https://github.com/Rwigie/C-MORAL.

URL PDF HTML ☆

赞 0 踩 0

2604.19072 2026-05-28 cs.LG cs.AI stat.ML 版本更新

S2MAM: Semi-supervised Meta Additive Model for Robust Estimation and Variable Selection

S2MAM: 半监督元加性模型用于稳健估计和变量选择

Xuelin Zhang, Hong Chen, Yingjie Wang, Tieliang Gong, Bin Gu

发表机构 * Huazhong Agricultural University（华中农业大学）； China University of Petroleum (East China)（中国石油大学（华东））； Xi'an Jiaotong University（西安交通大学）； Jilin University（吉林大学）

AI总结提出基于双层优化的半监督元加性模型，自动识别信息变量、更新相似矩阵并实现可解释预测，理论保证收敛性和泛化界，实验验证了鲁棒性和可解释性。

Comments Accepted by ICML'2026 as Accept (regular)

详情

AI中文摘要

基于流形正则化的半监督学习是一种经典的联合利用有标签和无标签数据进行学习的框架，其关键要求是未知边际分布的支持集具有黎曼流形的几何结构。通常，基于拉普拉斯-贝尔特拉米算子的流形正则化可以通过与整个训练数据及其对应的图拉普拉斯矩阵相关联的拉普拉斯正则化进行经验近似。然而，图拉普拉斯矩阵严重依赖于预先指定的相似度度量，并且在处理冗余或噪声输入变量时可能导致不适当的惩罚。为了解决上述问题，本文提出了一种新的半监督元加性模型（S$^2$MAM），该模型基于双层优化方案，能够自动识别信息变量、更新相似矩阵，并同时实现可解释的预测。为S$^2$MAM提供了理论保证，包括计算收敛性和统计泛化界。在4个合成数据集和12个真实世界数据集上进行的实验评估，涵盖了不同级别和类型的污染，验证了所提方法的鲁棒性和可解释性。

英文摘要

Semi-supervised learning with manifold regularization is a classical framework for jointly learning from both labeled and unlabeled data, where the key requirement is that the support of the unknown marginal distribution has the geometric structure of a Riemannian manifold. Typically, the Laplace-Beltrami operator-based manifold regularization can be approximated empirically by the Laplacian regularization associated with the entire training data and its corresponding graph Laplacian matrix. However, the graph Laplacian matrix depends heavily on the prespecified similarity metric and may lead to inappropriate penalties when dealing with redundant or noisy input variables. To address the above issues, this paper proposes a new Semi-Supervised Meta Additive Model (S$^2$MAM) based on a bilevel optimization scheme that automatically identifies informative variables, updates the similarity matrix, and simultaneously achieves interpretable predictions. Theoretical guarantees are provided for S$^2$MAM, including the computing convergence and the statistical generalization bound. Experimental assessments across 4 synthetic and 12 real-world datasets, with varying levels and categories of corruption, validate the robustness and interpretability of the proposed approach.

URL PDF HTML ☆

赞 0 踩 0

2604.20857 2026-05-28 cs.IR cs.AI 版本更新

DiagramBank: A Quality-Audited Dataset of Scientific Schematic Diagrams with Multi-Level Document Context

DiagramBank: 一个经过质量审核的科学示意图数据集，包含多级文档上下文

Ling Yue, Tingwen Zhang, Jiaying Wang, Zhen Xu, Shaowu Pan

发表机构 * Rensselaer Polytechnic Institute（伦斯勒理工学院）； University of Chicago（芝加哥大学）

AI总结提出DiagramBank，一个从OpenReview的AI/ML会议中精选的57,100个示意图数据集，通过级联过滤管道和手动盲审确保高质量，并保留文档上下文，用于科学文档理解、示意图检索和基准构建。

详情

AI中文摘要

科学论文使用示意图来传达方法、工作流程和系统结构，然而现有的科学图形语料库通常将它们与图表、截图和照片混合在一起，并且很少保留文档上下文。我们介绍了DiagramBank，一个从OpenReview主办的AI/ML会议中精选的57,100个示意图的质量审核数据集。每条记录将示意图图像与其论文标题、摘要、图表标题、文本内图表引用跨度、会议/年份元数据、来源字段和过滤标签关联起来。DiagramBank是用于科学文档理解、示意图检索、语料库分析和未来基准构建的可重用资源。我们描述了其提取和级联过滤管道、发布模式、置信度控制视图、数据集卡和索引工具。对发布的级联过滤记录进行的手动盲审估计精度为93.67%，另外的CLIP阈值分析描述了更简单过滤视图的精度-覆盖权衡。我们进一步提供了轻量级的元数据索引和编写示例，以说明下游协议，而不将这些工具视为独立方法。代码公开于：https://github.com/csml-rpi/DiagramBank。

英文摘要

Scientific papers use schematic diagrams to communicate methods, workflows, and system structure, yet existing scientific-figure corpora often mix them with plots, screenshots, and photographs and rarely preserve document context. We introduce DiagramBank, a quality-audited dataset of 57,100 schematic diagrams curated from OpenReview-hosted AI/ML venues. Each record links a diagram image to its paper title, abstract, figure caption, in-text figure-reference spans, venue/year metadata, provenance fields, and filtering labels. DiagramBank is a reusable resource for scientific-document understanding, diagram retrieval, corpus analysis, and future benchmark construction. We describe its extraction and cascade-filtering pipeline, release schema, confidence-controlled views, dataset card, and indexing utilities. A manual blind audit of the released cascade-filtered records estimates 93.67% precision, and a separate CLIP threshold analysis characterizes the precision--coverage trade-off for simpler filtering views. We further provide lightweight metadata-indexing and authoring examples to illustrate downstream protocols without treating these utilities as standalone methods. The code is public at: https://github.com/csml-rpi/DiagramBank.

URL PDF HTML ☆

赞 0 踩 0

2604.05673 2026-05-28 cs.RO cs.AI 版本更新

Rectified Schrödinger Bridge Matching for Few-Step Visual Navigation

整流薛定谔桥匹配用于少步视觉导航

Wuyang Luan, Junhui Li, Weiguang Zhao, Wenjian Zhang, Tieru Wu, Rui Ma

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； College of Computer Science, Chongqing University（重庆大学计算机学院）； Department of Computer Science, University of Liverpool（利物浦大学计算机科学系）； Changchun GenY Technology Co., Ltd.（长春GenY科技有限公司）

AI总结提出整流薛定谔桥匹配（RSBM）框架，利用速度场结构不变性和线性方差减少，在仅3步积分中实现高保真生成策略，满足具身AI低延迟需求。

Comments 18 pages, 7 figures, 10 tables. Code available at https://github.com/WuyangLuan/RSBM

详情

AI中文摘要

视觉导航是具身AI中的核心挑战，要求自主智能体将高维感官观测转化为连续的、长视界动作轨迹。基于扩散模型和薛定谔桥（SB）的生成策略能有效捕捉多模态动作分布，但由于高方差随机传输，需要数十个积分步骤，这对实时机器人控制构成了关键障碍。我们提出整流薛定谔桥匹配（RSBM），该框架利用标准薛定谔桥（ε=1，最大熵传输）与确定性最优传输（ε→0，如条件流匹配）之间共享的速度场结构，由单一熵正则化参数ε控制。我们证明两个关键结果：（1）条件速度场的函数形式在整个ε谱上保持不变（速度结构不变性），使单一网络能够服务于所有正则化强度；（2）减小ε线性降低条件速度方差，实现更稳定的粗步ODE积分。基于缩短传输距离的学习条件先验，RSBM在中间ε下运行，平衡多模态覆盖和路径直线性。实验表明，标准桥需要≥10步才能收敛，而RSBM在仅3个积分步骤中实现了超过94%的余弦相似度和92%的成功率——无需蒸馏或多阶段训练——显著缩小了高保真生成策略与具身AI低延迟需求之间的差距。

英文摘要

Visual navigation is a core challenge in Embodied AI, requiring autonomous agents to translate high-dimensional sensory observations into continuous, long-horizon action trajectories. While generative policies based on diffusion models and Schrödinger Bridges (SB) effectively capture multimodal action distributions, they require dozens of integration steps due to high-variance stochastic transport, posing a critical barrier for real-time robotic control. We propose Rectified Schrödinger Bridge Matching (RSBM), a framework that exploits a shared velocity-field structure between standard Schrödinger Bridges ($\varepsilon=1$, maximum-entropy transport) and deterministic Optimal Transport ($\varepsilon\to 0$, as in Conditional Flow Matching), controlled by a single entropic regularization parameter $\varepsilon$. We prove two key results: (1) the conditional velocity field's functional form is invariant across the entire $\varepsilon$-spectrum (Velocity Structure Invariance), enabling a single network to serve all regularization strengths; and (2) reducing $\varepsilon$ linearly decreases the conditional velocity variance, enabling more stable coarse-step ODE integration. Anchored to a learned conditional prior that shortens transport distance, RSBM operates at an intermediate $\varepsilon$ that balances multimodal coverage and path straightness. Empirically, while standard bridges require $\geq 10$ steps to converge, RSBM achieves over 94% cosine similarity and 92% success rate in merely 3 integration steps -- without distillation or multi-stage training -- substantially narrowing the gap between high-fidelity generative policies and the low-latency demands of Embodied AI.

URL PDF HTML ☆

赞 0 踩 0

2604.13583 2026-05-28 cs.CL cs.AI 版本更新

BenGER Platform: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

BenGER平台：面向德国法律任务端到端基准测试的协作式Web平台

Sebastian Nagl, Matthias Grabmair

发表机构 * Technical University of Munich（慕尼黑技术大学）

AI总结提出BenGER开源Web平台，集成任务创建、协作标注、可配置LLM运行及多维度评估，支持多组织项目与租户隔离，实现法律推理基准测试的端到端透明与可复现。

Comments Preprint - Accepted at ICAIL 2026

2604.19355 2026-05-28 cs.LG cs.AI cs.CE 版本更新

LASER: Learning Active Sensing for Continuum Field Reconstruction

LASER: 用于连续场重建的学习主动感知

Huayu Deng, Jinghui Zhong, Xiangming Zhu, Yunbo Wang, Xiaokang Yang

发表机构 * MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University（人工智能MOE重点实验室、人工智能研究院、计算机科学学院、上海交通大学）

AI总结提出LASER框架，将主动感知建模为部分可观测马尔可夫决策过程，利用连续场潜在世界模型和强化学习策略在潜在想象空间中模拟感知场景，实现稀疏约束下的高保真重建。

Comments Accepted by ICML 2026 (Oral)

详情

AI中文摘要

连续物理场的高保真测量对于科学发现和工程设计至关重要，但在稀疏和受限感知条件下仍然具有挑战性。传统的重建方法通常依赖于固定的传感器布局，无法适应演变的物理状态。我们提出LASER，一个统一的闭环框架，将主动感知建模为部分可观测马尔可夫决策过程（POMDP）。其核心是采用连续场潜在世界模型，捕捉底层物理动力学并提供内在奖励反馈。这使得强化学习策略能够在潜在想象空间中模拟“假设”感知场景。通过根据预测的潜在状态调整传感器移动，LASER能够导航到当前观测之外可能的高信息区域。我们的实验表明，LASER在多种连续场中始终优于静态和离线优化策略，在稀疏条件下实现高保真重建。

英文摘要

High-fidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challenging under sparse and constrained sensing. Conventional reconstruction methods typically rely on fixed sensor layouts, which cannot adapt to evolving physical states. We propose LASER, a unified, closed-loop framework that formulates active sensing as a Partially Observable Markov Decision Process (POMDP). At its core, LASER employs a continuum field latent world model that captures the underlying physical dynamics and provides intrinsic reward feedback. This enables a reinforcement learning policy to simulate ''what-if'' sensing scenarios within a latent imagination space. By conditioning sensor movements on predicted latent states, LASER navigates toward potentially high-information regions beyond current observations. Our experiments demonstrate that LASER consistently outperforms static and offline-optimized strategies, achieving high-fidelity reconstruction under sparsity across diverse continuum fields.

URL PDF HTML ☆

赞 0 踩 0

2604.18530 2026-05-28 cs.AI 版本更新

OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning

OGER：一种用于混合强化学习的鲁棒离线引导探索奖励

Xinyu Ma, Mingzhou Xu, Xuebo Liu, Chang Jin, Qiang Wang, Derek F. Wong, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学深圳研究院）； Hithink RoyalFlush Information Network, Hangzhou, China（杭州Hithink RoyalFlush信息网络）； Computer and Information Science, University of Macau, Macau, China（澳门大学计算机与信息科学学院）

AI总结提出OGER框架，通过多教师协作训练和基于熵的辅助探索奖励，统一离线教师引导与在线强化学习，提升大语言模型在数学推理和泛化任务中的探索能力。

详情

AI中文摘要

近年来，具有可验证奖励的强化学习（RLVR）的进展显著提升了大型语言模型（LLM）的推理能力，但模型在探索超出其初始策略分布的新轨迹方面仍存在困难。尽管已提出离线教师引导和基于熵的策略来解决这一问题，但它们往往缺乏深度融合或受限于模型自身能力。在本文中，我们提出OGER（离线引导探索奖励），一种新颖的框架，通过专门的奖励建模视角统一离线教师引导和在线强化学习。OGER采用多教师协作训练，并构建一个辅助探索奖励，利用离线轨迹和模型自身的熵来激励自主探索。在数学和通用推理基准上的大量实验表明，OGER持续优于竞争基线，在数学推理上取得显著提升，同时保持对域外任务的鲁棒泛化。我们提供了训练动态的全面分析，并进行了详细的消融研究，以验证我们基于熵的奖励调制的有效性。我们的代码可在 https://github.com/ecoli-hit/OGER.git 获取。

英文摘要

Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial policy distribution. While offline teacher guidance and entropy-driven strategies have been proposed to address this, they often lack deep integration or are constrained by the model's inherent capacity. In this paper, we propose OGER (Offline-Guided Exploration Reward), a novel framework that unifies offline teacher guidance and online reinforcement learning through a specialized reward modeling lens. OGER employs multi-teacher collaborative training and constructs an auxiliary exploration reward that leverages both offline trajectories and the model's own entropy to incentivize autonomous exploration. Extensive experiments across mathematical and general reasoning benchmarks demonstrate that OGER consistently outperforms competitive baselines, achieving substantial gains in mathematical reasoning while maintaining robust generalization to out-of-domain tasks. We provide a comprehensive analysis of training dynamics and conduct detailed ablation studies to validate the effectiveness of our entropy-aware reward modulation. Our code is available at https://github.com/ecoli-hit/OGER.git.

URL PDF HTML ☆

赞 0 踩 0

2604.18235 2026-05-28 cs.CL cs.AI 版本更新

Negative Advantages Is a Double-Edged Sword: Calibrating advantages in GRPO for Search Agents

负优势是一把双刃剑：为搜索智能体校准GRPO中的优势

Jiayi Wu, Ruobing Xie, Zeqian Huang, Lei Jiang, Can Xu, Kangyang Luo, Bochen Lin, Ming Gao, Xiang Li

发表机构 * School of Data Science and Engineering, East China Normal University（东华师范大学数据科学与工程学院）； Tencent（腾讯）； Tsinghua University（清华大学）

AI总结针对GRPO算法在多跳搜索中因粗粒度优势分配和正负优势不平衡导致的训练不稳定问题，提出CalibAdv方法，通过细粒度降低过度负优势并重新平衡正负优势，提升模型性能和训练稳定性。

详情

AI中文摘要

搜索智能体通过与搜索引擎的多轮交互实现强大的问答性能，其中组相对策略优化（GRPO）是一种广泛使用的训练算法。然而，GRPO风格的算法在多跳搜索场景中仍面临若干挑战。首先，当最终答案错误时，正确的中间步骤常常受到惩罚。其次，训练高度不稳定，经常导致自然语言能力退化甚至灾难性训练崩溃。我们的分析将这些问题归因于粗粒度的优势分配以及正负优势之间的不平衡。为了解决这些问题，我们提出了CalibAdv，一种专门为搜索智能体设计的优势校准方法，能够更准确、更稳定地对惩罚和奖励进行建模。具体来说，CalibAdv利用中间步骤的正确性在细粒度上降低过度的负优势，然后进一步重新平衡正负优势以提高训练稳定性。重要的是，CalibAdv采用轻量级设计，从标准 rollout 信号中校准优势，使其简单且易于部署。在三个模型和七个基准上的大量实验表明，CalibAdv同时提升了模型性能和训练稳定性。我们的代码可在 https://github.com/wujwyi/CalibAdv 获取。

英文摘要

Search agents achieve strong question-answering performance through multi-turn interactions with search engines, with Group Relative Policy Optimization (GRPO) being a widely used training algorithm. However, GRPO-style algorithms still face several challenges in multi-hop search settings. First, correct intermediate steps are often penalized when the final answer is wrong. Second, training is highly unstable, often causing degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for search agents that enables more accurate and more stable modeling of penalties and rewards. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine-grained level. It then further rebalances positive and negative advantages to improve training stability. Importantly, CalibAdv adopts a lightweight design that calibrates advantages from standard rollout signals, making it simple and easy to deploy. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at https://github.com/wujwyi/CalibAdv.

URL PDF HTML ☆

赞 0 踩 0

2604.16774 2026-05-28 cs.CL cs.AI 版本更新

Retention Consequence in Lifecycle Memory Control

生命周期记忆控制中的保留后果

Jiarui Han

AI总结研究持久记忆在准入后失效的问题，提出将置信度作为前向有效性/支持证据，并引入强度作为保留后果的显式生命周期状态，通过StageMem控制器实验验证显式保留后果在生命周期结算中的控制作用。

详情

AI中文摘要

持久记忆在成功准入后可能失效：一个前提被写入，然后成为无声的假设，后续维护将其视为普通残留进行压缩、降级或驱逐。我们将这种准入后失效作为生命周期控制问题来研究。现有记忆系统已经执行准入、更新、压缩、检索和驱逐。我们的主张并非此类系统缺乏维护，而是保留后果通常仅通过有效性、相似性、新近性、频率、重要性或摘要信号间接操作，而非作为单独的生命周期状态暴露。因此，我们将置信度视为前向有效性/支持证据，并引入强度作为保留后果的显式生命周期状态。我们在StageMem中实现了这一区分，这是一个小型的分阶段控制器，其瞬态、工作态和持久态存储暴露了提升、压缩和驱逐压力点。在受控的前提实现、压缩、压力和隐式启发式诊断实验中，实验区分了写入过少、保留错误的高线索内容、遗忘代价高昂的前提以及通过饱和保留所有内容。通过生命周期结算使用的显式保留后果，提供了在遗漏和囤积之间的控制面。针对目标准入后失效模式，结果支持持久记忆的生命周期观点：可靠性不仅取决于进入记忆的内容，还取决于准入有效性和保留后果在维护期间是否可用。

英文摘要

Persistent memory can fail after successful admission: a premise is written, then becomes a silent assumption, and later maintenance treats it as ordinary residue to be compressed, demoted, or evicted. We study this post-admission failure as a lifecycle-control problem. Existing memory systems already perform admission, update, compression, retrieval, and eviction. Our claim is not that such systems lack maintenance, but that retention consequence is often operationalized only indirectly through validity, similarity, recency, frequency, importance, or summarization signals rather than exposed as a separate lifecycle state. We therefore treat confidence as carried-forward validity/support evidence, and introduce strength as an explicit lifecycle state for retention consequence. We operationalize this distinction in StageMem, a small staged controller whose transient, working, and durable stores expose promotion, compression, and eviction pressure points. Across controlled premise-realization, compression, pressure, and implicit-heuristic diagnostics, the experiments separate writing too little, retaining the wrong high-cue content, forgetting costly premises, and preserving everything by saturation. Explicit retention consequence, used through lifecycle settlement, provides a control surface between omission and hoarding. For the targeted post-admission failure mode, the results support a lifecycle view of persistent memory: reliability depends not only on what enters memory, but on whether admission validity and retention consequence remain available during maintenance.

URL PDF HTML ☆

赞 0 踩 0

2604.16565 2026-05-28 cs.LG cs.AI 版本更新

Reasoning on the Manifold: Bidirectional Consistency for Self-Verification in Diffusion Language Models

流形上的推理：扩散语言模型中用于自我验证的双向一致性

Jiaoyang Ruan, Xin Gao, Yinda Chen, Hengyu Zeng, Liang Du, Guanghao Li, Jie Fu, Jian Pu

发表机构 * Institute of Science and Technology for Brain-Inspired Intelligence（脑启发智能科学与技术研究院）； Fudan University（复旦大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； University of Science and Technology of China（中国科学技术大学）； IEG, Tencent Inc.（腾讯IEG）

AI总结提出双向流形一致性（BMC），一种无训练、无监督的度量方法，通过前向掩码和后向重建循环量化生成序列的稳定性，用于扩散语言模型的诊断、推理和对齐。

Comments 31 pages, 7 figures. Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Camera-ready version

详情

Journal ref: Proceedings of the 43rd International Conference on Machine Learning, PMLR 306, 2026

AI中文摘要

尽管扩散大语言模型（dLLMs）在全局规划方面具有结构优势，但高效验证它们是否通过有效的推理轨迹得出正确答案仍然是一个关键挑战。在这项工作中，我们提出了一种几何视角：流形上的推理。我们假设有效的生成轨迹作为学习分布的高密度流形上的稳定吸引子存在，而无效路径则表现出流形外漂移。为了实现这一点，我们引入了双向流形一致性（BMC），这是一种无训练、无监督的度量，通过前向掩码和后向重建循环量化生成序列的稳定性。实验上，我们展示了BMC在整个推理生命周期中的多功能性：（1）在诊断中，它作为无需真实答案的解决方案有效性的鲁棒判别器；（2）在推理中，它能够通过拒绝重采样有效集中计算资源于复杂推理任务；（3）在对齐中，它作为密集的几何奖励，将稀疏的结果监督转化为细粒度的指导，使模型能够超越标准基线自我进化。我们的结果确立了内在几何稳定性作为dLLMs正确性的鲁棒指标。

英文摘要

While Diffusion Large Language Models (dLLMs) offer structural advantages for global planning, efficiently verifying that they arrive at correct answers via valid reasoning traces remains a critical challenge. In this work, we propose a geometric perspective: Reasoning on the Manifold. We hypothesize that valid generation trajectories reside as stable attractors on the high-density manifold of the learned distribution, whereas invalid paths exhibit off-manifold drift. To operationalize this, we introduce Bidirectional Manifold Consistency (BMC), a training-free, unsupervised metric that quantifies the stability of the generated sequence through a forward-masking and backward-reconstruction cycle. Empirically, we demonstrate BMC's versatility across the full reasoning lifecycle: (1) in Diagnosis, it serves as a robust discriminator of solution validity without ground truth answer; (2) in Inference, it enables rejection resampling to effectively concentrate computational resources on complex reasoning tasks; and (3) in Alignment, it functions as a dense geometric reward that transforms sparse outcome supervision into fine-grained guidance, empowering models to self-evolve beyond standard baselines. Our results establish intrinsic geometric stability as a robust indicator of correctness for dLLMs.

URL PDF HTML ☆

赞 0 踩 0

2512.15791 2026-05-28 cs.CY cs.AI cs.CL 版本更新

Evaluation of AI Ethics Tools in Language Models: A Developers' Perspective Case Study

语言模型中AI伦理工具评估：开发者视角案例研究

Jhessica Silva, Diego A. B. Moreira, Gabriel O. dos Santos, Alef Ferreira, Helena Maia, Sandra Avila, Helio Pedrini

AI总结通过文献筛选和开发者访谈，评估四种AI伦理工具在葡萄牙语语言模型中的应用效果，发现它们能指导一般伦理考虑但未覆盖模型特有方面。

Comments 7 figures, 11 tables. Accepted for publication in AI and Ethics

详情

DOI: 10.1007/s43681-025-00914-2

AI中文摘要

在人工智能中，语言模型因能够通过文本生成模拟与人类真实对话的系统被广泛采用而变得日益重要。由于它们对社会的影响，开发和部署这些语言模型必须负责任地进行，关注其负面影响和可能的危害。在此背景下，AI伦理工具（AIETs）的出版物数量近期有所增加。这些AIETs旨在通过引入公认的价值观来指导AI的设计、开发和使用阶段，帮助开发者、公司、政府和其他利益相关者建立对其技术的信任、透明度和责任。然而，许多AIETs缺乏良好的文档、使用示例以及在实践中有效性的证明。本文提出了一种评估语言模型中AIETs的方法。我们的方法包括对213个AIETs进行广泛的文献调查，在应用纳入和排除标准后，我们选择了四个AIETs：模型卡片、ALTAI、事实表以及危害建模。为了评估，我们将AIETs应用于为葡萄牙语开发的语言模型，并对它们的开发者进行了35小时的访谈。评估考虑了开发者对AIETs在帮助识别其模型伦理考量方面的使用和质量的看法。结果表明，所应用的AIETs可作为制定关于语言模型的一般伦理考量的指南。然而，我们注意到它们并未解决这些模型的独特方面，例如习语表达。此外，这些AIETs未能帮助识别葡萄牙语模型的潜在负面影响。

英文摘要

In Artificial Intelligence (AI), language models have gained significant importance due to the widespread adoption of systems capable of simulating realistic conversations with humans through text generation. Because of their impact on society, developing and deploying these language models must be done responsibly, with attention to their negative impacts and possible harms. In this scenario, the number of AI Ethics Tools (AIETs) publications has recently increased. These AIETs are designed to help developers, companies, governments, and other stakeholders establish trust, transparency, and responsibility with their technologies by bringing accepted values to guide AI's design, development, and use stages. However, many AIETs lack good documentation, examples of use, and proof of their effectiveness in practice. This paper presents a methodology for evaluating AIETs in language models. Our approach involved an extensive literature survey on 213 AIETs, and after applying inclusion and exclusion criteria, we selected four AIETs: Model Cards, ALTAI, FactSheets, and Harms Modeling. For evaluation, we applied AIETs to language models developed for the Portuguese language, conducting 35 hours of interviews with their developers. The evaluation considered the developers' perspective on the AIETs' use and quality in helping to identify ethical considerations about their model. The results suggest that the applied AIETs serve as a guide for formulating general ethical considerations about language models. However, we note that they do not address unique aspects of these models, such as idiomatic expressions. Additionally, these AIETs did not help to identify potential negative impacts of models for the Portuguese language.

URL PDF HTML ☆

赞 0 踩 0

2604.15898 2026-05-28 cs.AI 版本更新

Towards Rigorous Explainability by Feature Attribution

通过特征归因实现严格可解释性

Olivier Létoffé, Xuanxiang Huang, Joao Marques-Silva

发表机构 * IRIT, University of Toulouse France Nanyang Technological University, Singapore ICREA \& Univ.\ Lleida, Spain

AI总结本文综述了使用严格的符号化可解释人工智能方法替代非严格的非符号化方法（如SHAP）来分配相对特征重要性的研究进展。

2604.14585 2026-05-28 cs.AI cs.CL 版本更新

Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

提示优化如同抛硬币：诊断其在复合AI系统中何时有效

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center（AWS生成式AI创新中心）； HSBC Holdings Plc., HSBC Technology Center, China（汇丰控股有限公司，汇丰技术中心，中国）

AI总结通过大量实验发现提示优化在复合AI系统中效果不稳定，仅当任务具有可挖掘的输出结构时才有帮助，并提供了两阶段诊断方法。

Comments Accepted to the 1st Workshop on Combining Theory and Benchmarks, CTB@ICML 2026, Seoul, South Korea

详情

AI中文摘要

复合AI系统中的提示优化在统计上与抛硬币无异：在Claude Haiku 4.5上的72次优化运行（6种方法 × 4个任务 × 3次重复）中，49%的得分低于零样本；在Amazon Nova Lite上，失败率更高。然而，在一个任务上，所有六种方法相比零样本提升了高达+6.8分。是什么区分了成功与失败？我们通过18,000次网格评估和144次优化运行进行了调查，按照必须回答的顺序测试了TextGrad和DSPy等端到端优化工具背后的两个假设：(A) 智能体提示存在交互，需要联合优化而非独立优化；(B) 单个提示本身值得优化。交互效应从未显著（p > 0.52，所有F < 1.0），并且优化仅在任务具有可挖掘的输出结构时才有帮助：即模型可以生成但不会默认采用的格式。我们进一步给出了机制性解释：指令微调将输入措辞压缩成狭窄的输出分布，消除了联合优化所依赖的措辞敏感性。我们提供了一个两阶段诊断：一个80美元的ANOVA预测试用于智能体耦合，以及一个10分钟的头空间测试，用于预测优化是否值得，从而将抛硬币转变为知情决策。

英文摘要

Prompt optimization in compound AI systems is statistically indistinguishable from a coin flip: across 72 optimization runs on Claude Haiku 4.5 (6 methods $\times$ 4 tasks $\times$ 3 repeats), 49% score below zero-shot; on Amazon Nova Lite, the failure rate is even higher. Yet on one task, all six methods improve over zero-shot by up to $+6.8$ points. What distinguishes success from failure? We investigate with 18,000 grid evaluations and 144 optimization runs, testing two assumptions behind end-to-end optimization tools like TextGrad and DSPy, in the order they must be answered: (A) agent prompts interact, requiring joint rather than independent optimization, and (B) individual prompts are worth optimizing at all. Interaction effects are never significant ($p > 0.52$, all $F < 1.0$), and optimization helps only when the task has exploitable output structure: a format the model can produce but does not default to. We further give a mechanistic account: instruction-tuning compresses input phrasing into a narrow output distribution, eliminating the very phrasing-sensitivity that joint optimization assumes. We provide a two-stage diagnostic: an \$80 ANOVA pre-test for agent coupling, and a 10-minute headroom test that predicts whether optimization is worthwhile, turning a coin flip into an informed decision.

URL PDF HTML ☆

赞 0 踩 0

2604.14356 2026-05-28 cs.CL cs.AI 版本更新

When PCOS Meets Eating Disorders: An Explainable AI Approach to Detecting the Hidden Triple Burden

当多囊卵巢综合征遇上进食障碍：一种可解释的AI方法检测隐藏的三重负担

Apoorv Prasad, Susan McRoy

发表机构 * University of Wisconsin - Milwaukee（威斯康星大学密尔沃基分校）

AI总结本研究通过微调小型开源语言模型，利用可解释性AI从社交媒体帖子中自动检测多囊卵巢综合征患者的身体形象困扰、进食障碍和代谢挑战的三重负担，最佳模型在150条测试帖上达到75.3%的精确匹配准确率。

详情

AI中文摘要

患有多囊卵巢综合征（PCOS）的女性面临身体形象困扰、进食障碍和代谢挑战的显著升高风险，然而现有的自然语言处理方法在检测这些状况时缺乏透明度，且无法识别共病表现。我们开发了小型开源语言模型，以基于可解释性的方式自动检测社交媒体帖子中的这种三重负担。我们从六个子论坛收集了1000条与PCOS相关的帖子，由两名经过训练的标注员根据Lee等人（2017）临床框架的操作化指南对帖子进行标注。使用低秩适配对三个模型（Gemma-2-2B、Qwen3-1.7B、DeepSeek-R1-Distill-Qwen-1.5B）进行微调，以生成带有文本证据的结构化解释。最佳模型在150条保留帖子上实现了75.3%的精确匹配准确率，具有稳健的共病检测能力和强可解释性。性能随诊断复杂性下降，表明其最佳用途是筛查而非自主诊断。

英文摘要

Women with polycystic ovary syndrome (PCOS) face substantially elevated risks of body image distress, disordered eating, and metabolic challenges, yet existing natural language processing approaches for detecting these conditions lack transparency and cannot identify co-occurring presentations. We developed small, open-source language models to automatically detect this triple burden in social media posts with grounded explainability. We collected 1,000 PCOS-related posts from six subreddits, with two trained annotators labeling posts using guidelines operationalizing Lee et al. (2017) clinical framework. Three models (Gemma-2-2B, Qwen3-1.7B, DeepSeek-R1-Distill-Qwen-1.5B) were fine-tuned using Low-Rank Adaptation to generate structured explanations with textual evidence. The best model achieved 75.3 percent exact match accuracy on 150 held-out posts, with robust comorbidity detection and strong explainability. Performance declined with diagnostic complexity, indicating their best use is for screening rather than autonomous diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2604.12955 2026-05-28 cs.AI 版本更新

Text2Model: Modeling Copilots for Text-to-Model Translation

Text2Model: 用于文本到模型翻译的建模副驾驶

Serdar Kadioglu, Karthik Uppuluri, Akash Singirikonda

发表机构 * AI Center of Excellence, Fidelity Investments（富达投资人工智能卓越中心）； Department of Computer Science, Brown University（布朗大学计算机科学系）

AI总结本文提出Text2Model和Text2Zinc，通过统一架构和数据集、求解器无关的方式，利用多种LLM策略实现文本到组合优化与满足问题的模型翻译，并开源副驾驶和排行榜以缩小性能差距。

Comments AAAI'25 Bridge Program on Machine Learning and Operations Research CPAIOR'26 Master Class on LLMs for CP/OR

详情

AI中文摘要

利用大型语言模型（LLM）进行文本到模型翻译和优化任务的研究兴趣日益增长。本文通过引入\textsc{Text2Model}和\textsc{Text2Zinc}来推进这一研究方向。\textsc{Text2Model}是一套基于多种LLM策略（复杂度各异）的副驾驶，并附带在线排行榜。\textsc{Text2Zinc}是一个跨领域数据集，用于捕捉自然语言指定的优化和满足问题，并附带内置AI助手的交互式编辑器。虽然已有新兴文献使用LLM将组合问题翻译为形式化模型，但我们的工作是首次尝试将满足问题和优化问题集成在\textit{统一架构}和\textit{数据集}中。此外，我们的方法是\textit{求解器无关的}，不同于现有专注于翻译为特定求解器模型的工作。为此，我们利用\textsc{MiniZinc}的求解器和范式无关的建模能力来表述组合问题。我们进行了全面实验，比较了多种单次和多次调用策略的执行和解准确率，包括：零样本提示、思维链推理、通过知识图谱的中间表示、基于语法的语法编码，以及将模型分解为顺序子任务的代理方法。我们的副驾驶策略具有竞争力，并在部分方面改进了该领域的最新研究。我们的发现表明，虽然LLM有前景，但尚未成为组合建模的一键式技术。我们开源了\textsc{Text2Model}副驾驶和排行榜，以及\textsc{Text2Zinc}和交互式编辑器，以支持缩小这一性能差距。

英文摘要

There is growing interest in leveraging large language models (LLMs) for text-to-model translation and optimization tasks. This paper aims to advance this line of research by introducing \textsc{Text2Model} and \textsc{Text2Zinc}. \textsc{Text2Model} is a suite of copilots based on several LLM strategies with varying complexity, along with an online leaderboard. \textsc{Text2Zinc} is a cross-domain dataset for capturing optimization and satisfaction problems specified in natural language, along with an interactive editor with built-in AI assistant. While there is an emerging literature on using LLMs for translating combinatorial problems into formal models, our work is the first attempt to integrate \textit{both} satisfaction and optimization problems within a \textit{unified architecture} and \textit{dataset}. Moreover, our approach is \textit{solver-agnostic} unlike existing work that focuses on translation to a solver-specific model. To achieve this, we leverage \textsc{MiniZinc}'s solver-and-paradigm-agnostic modeling capabilities to formulate combinatorial problems. We conduct comprehensive experiments to compare execution and solution accuracy across several single- and multi-call strategies, including; zero-shot prompting, chain-of-thought reasoning, intermediate representations via knowledge-graphs, grammar-based syntax encoding, and agentic approaches that decompose the model into sequential sub-tasks. Our copilot strategies are competitive, and in parts improve, recent research in this domain. Our findings indicate that while LLMs are promising they are not yet a push-button technology for combinatorial modeling. We contribute \textsc{Text2Model} copilots and leaderboard, and \textsc{Text2Zinc} and interactive editor to open-source to support closing this performance gap.

URL PDF HTML ☆

赞 0 踩 0

2506.01247 2026-05-28 cs.CV cs.AI cs.LG 版本更新

面向三值逻辑问答的成分一致性引导解码

Tianyi Huang, Ming Hou, Jiaheng Su, Yutong Zhang, Ziling Zhang

AI总结针对大语言模型在三值逻辑问答中的否定不一致和认知未知问题，提出一种轻量级测试时解码层CGD-PD，通过神经三值分类、符号否定一致性投影和定向二值蕴含探测，在FOLIO数据集上提升准确率4.4-6.8点并减少未知预测。

Comments Accepted at the ICML 2026 Workshop on Compositional Learning: Safety, Interpretability, and Agents

详情

AI中文摘要

三值逻辑问答（QA）在给定前提集 $S$ 的情况下，将 $ ext{True}$、$ ext{False}$ 或 $ ext{Unknown}$ 之一分配给假设 $H$。我们将此任务视为一个紧凑的成分推理问题：在确定性否定映射下，$H$ 和机械否定假设 $ eg H$ 的预测应保持一致。尽管结构简单，大语言模型（LLM）可能表现出两种实际失败模式：(i) 否定不一致，即对 $H$ 和 $ eg H$ 的回答违反了所需的标签映射；(ii) 认知 $ ext{Unknown}$，即模型在某一侧被蕴含时仍选择弃权。我们引入 CGD-PD，一个轻量级、无需训练的测试时层，结合神经三值分类、符号否定一致性投影和定向二值蕴含探测。在 FOLIO 一阶逻辑领域的一个验证集上，CGD-PD 在 GPT-5.2 上提升了 4.4 个百分点的准确率，在 Claude Sonnet 4.5 上提升了 6.8 个百分点，同时减少了 $ ext{Unknown}$ 预测和认知弃权。这些结果提供了一个受控的概念验证，表明推理时的简单逻辑组合有助于评估和提高 LLM 推理可靠性；但本身并不足以证明在此形式化基准设置之外的鲁棒性。

英文摘要

Three-way logical question answering (QA) assigns one of $\text{True}$, $\text{False}$, or $\text{Unknown}$ to a hypothesis $H$ given a premise set $S$. We study this task as a compact compositional inference problem: predictions for $H$ and for a mechanically negated hypothesis $\neg H$ should agree under a deterministic negation map. Despite this simple structure, large language models (LLMs) can exhibit two practical failure modes: (i) negation inconsistency, where answers to $H$ and $\neg H$ violate the required label mapping, and (ii) epistemic $\text{Unknown}$, where the model abstains even when one side is entailed. We introduce CGD-PD, a lightweight, training-free test-time layer that combines neural 3-way classification, symbolic negation-consistency projection, and targeted binary entailment probes. On one validation split of FOLIO's first-order logic fields, CGD-PD improves accuracy by 4.4 points on GPT-5.2 and 6.8 points on Claude Sonnet 4.5, while reducing $\text{Unknown}$ predictions and epistemic abstention. These results provide a controlled proof of concept that simple logical composition at inference time can help evaluate and improve LLM reasoning reliability; they do not, by themselves, establish robustness beyond this formal benchmark setting.

URL PDF HTML ☆

赞 0 踩 0

2604.04074 2026-05-28 cs.AI cs.LG 版本更新

FactReview: Evidence-Grounded Peer Review with Execution-Based Claim Verification

FactReview：基于执行式声明验证的证据驱动同行评审

Ling Yue, Chaoqian Ouyang, Hang Xu, Ruijun Huang, Yuchen Liu, Libin Zheng, Wei Liu, Shaowu Pan, Shimin Di, Min-Ling Zhang

发表机构 * Rensselaer Polytechnic Institute（罗切斯特理工学院）； Sun Yat-sen University（中山大学）； Southeast University（东南大学）； The Hong Kong University of Science and Technology（香港科技大学）

AI总结提出FactReview系统，通过提取与评审相关的声明、将其与相关工作关联，并在代码可用时在固定修复预算下执行发布工件来审计经验声明，覆盖84%的声明，将评审质量提升至4.86/5，并将评审时间减少58%。

详情

AI中文摘要

基于LLM的评审系统通常仅以手稿为输入，使得文献和基于代码的声明难以验证。我们提出FactReview，一个提取与评审相关的声明、将其与相关工作关联，并在代码可用时在固定修复预算下执行发布工件以审计经验声明的系统。在35篇ML论文和463个基准主要声明中，FactReview覆盖了84%的声明。在证据感知评分标准下，其评审在整体质量上得分为4.86/5，比DeepReview-v2高0.7，比匹配的OpenReview评论高1.5。移除执行证据会改变17%的声明状态，超过任何其他单一证据来源。在一项评审辅助研究中，FactReview将平均评审时间减少了58%，同时将基准声明覆盖率从87%提高到99%。我们认为LLM评审者应审计经验声明，而非做出接受或拒绝的决定。代码公开于：https://github.com/DEFENSE-SEU/FactReview。

英文摘要

LLM-based reviewing systems typically take only the manuscript as input, leaving literature and code-based claims hard to verify. We present FactReview, a system that extracts review-relevant claims, grounds them in related work, and, when code is available, executes released artifacts under a fixed repair budget to audit empirical claims. Across 35 ML papers and 463 benchmark major claims, FactReview covers 84% of claims. Under an evidence-aware rubric, its reviews score 4.86/5 in overall quality, 0.7 above DeepReview-v2 and 1.5 above matched OpenReview comments. Removing execution evidence changes 17% of claim statuses, more than any other single evidence source. In a reviewer-assistance study, FactReview reduces mean review time by 58% while raising benchmark claim coverage from 87% to 99%. We argue that LLM reviewers should audit empirical claims, not make accept-reject decisions. The code is public at: https://github.com/DEFENSE-SEU/FactReview.

URL PDF HTML ☆

赞 0 踩 0

2604.02645 2026-05-28 cs.CL cs.AI 版本更新

Speaking of Language: Reflections on Metalanguage Research in NLP

论语言：NLP中元语言研究的思考

Nathan Schneider, Antonios Anastasopoulos

发表机构 * Georgetown University（乔治城大学）； George Mason University（弗吉尼亚理工大学）

AI总结本文定义元语言概念，将其与NLP和LLM关联，介绍两个实验室以元语言为中心的研究，并讨论元语言的四个维度及元语言任务，提出未来研究方向。

Comments To appear at the Big Picture Workshop at ACL 2026. Camera-ready version

2604.01604 2026-05-28 cs.AI 版本更新

大型语言模型的结构化智能体蒸馏

Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Harvard University（哈佛大学）； MIT（麻省理工学院）； Northeastern University（东北大学）； Adobe Research（Adobe研究）； National University of Singapore（新加坡国立大学）； University of Georgia（佐治亚大学）； Florida International University（佛罗里达国际大学）

AI总结提出结构化智能体蒸馏框架，通过分段对齐推理和动作跨度，将大型语言模型智能体压缩为小型学生模型，在保持决策性能的同时降低推理成本。

详情

Journal ref: The 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

AI中文摘要

大型语言模型（LLMs）通过交错推理和动作（如ReAct风格框架）展现出作为决策智能体的强大能力。然而，它们的实际部署受到高推理成本和大模型规模的限制。我们提出结构化智能体蒸馏，一种将基于大型LLM的智能体压缩为更小的学生模型的框架，同时保持推理保真度和动作一致性。与标准的token级蒸馏不同，我们的方法将轨迹分割为[REASON]和[ACT]跨度，应用分段特定损失来使每个组件与教师行为对齐。这种结构感知的监督使紧凑的智能体能够更好地复制教师的决策过程。在ALFWorld、HotPotQA-ReAct和WebShop上的实验表明，我们的方法始终优于token级和模仿学习基线，在性能下降最小的情况下实现了显著的压缩。缩放和消融结果进一步强调了跨度级对齐对于高效可部署智能体的重要性。

英文摘要

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks. Yet, their practical deployment is constrained by high inference costs and large model sizes. We propose Structured Agent Distillation, a framework that compresses large LLM-based agents into smaller student models while preserving both reasoning fidelity and action consistency. Unlike standard token-level distillation, our method segments trajectories into [REASON] and [ACT] spans, applying segment-specific losses to align each component with the teacher's behavior. This structure-aware supervision enables compact agents to better replicate the teacher's decision process. Experiments on ALFWorld, HotPotQA-ReAct, and WebShop show that our approach consistently outperforms token-level and imitation learning baselines, achieving significant compression with minimal performance drop. Scaling and ablation results further highlight the importance of span-level alignment for efficient and deployable agents.

URL PDF HTML ☆

赞 0 踩 0

2603.24631 2026-05-28 cs.SE cs.AI 版本更新

Coherence Collapse: Diagnosing Why Code Agents Fail After Reaching the Right Code

一致性崩溃：诊断代码智能体在到达正确代码后失败的原因

Myeongsoo Kim, Dingmin Wang, Siwei Cui, Farima Farmahinifarahani, Terry Yue Zhuo, Shweta Garg, Baishakhi Ray, Rajdeep Mukherjee, Varun Kumar

发表机构 * AWS AI Labs（AWS AI实验室）； Monash University（墨尔本大学）

AI总结通过轨迹分解分析，发现代码智能体在定位正确后仍因编辑质量缺陷（尤其是“一致性崩溃”）而失败，并提出了无需参考的共识驱动改进方法。

详情

AI中文摘要

代码智能体解决了SWE-bench Verified中65-70%的问题，但Pass@1无法告诉我们其余问题失败的原因，并且我们表明，没有轨迹数据，有能力的模型的失败会被系统性地误诊。我们引入了TRAJEVAL，一种无需训练的智能体轨迹分解方法，将其分解为参考补丁对齐的搜索、读取和编辑阶段，并应用于跨越三种架构和七个模型的16,758条轨迹。有能力的模型的主要失败并非定位问题：SWE-Agent和OpenHands上60-69%的失败到达并编辑了正确的函数，但仍然产生不正确的补丁，并且这种模式在仅使用bash的LiveSWEAgent上对大多数模型持续存在。在这个编辑质量残差中，我们识别出“一致性崩溃”，即智能体到达正确的代码然后覆盖或破坏它，作为最大的主题，在SWE-bench Verified和多语言PolyBench Verified中重复出现。在5个案例中，智能体生成了与黄金参考补丁位相同的中间轨迹，然后破坏了它；一个编辑提交检查点恢复了所有5个案例，对抗SWE-bench Docker测试框架。一种无需参考的共识驱动变体在GPT-5上产生了方向性的+3.0个百分点Pass@1测量（p=0.08）。

英文摘要

Code agents resolve 65-70% of SWE-bench Verified issues, but Pass@1 cannot tell us why the rest fail, and, as we show, capable-model failures are systematically misdiagnosed without trajectory data. We introduce TRAJEVAL, a training-free decomposition of agent trajectories into reference-patch-aligned search, read, and edit stages, and apply it across 16,758 trajectories spanning three architectures and seven models. The dominant failure of capable models is not localization: 60-69% of failures on SWE-Agent and OpenHands reach and edit the correct functions yet still produce incorrect patches, and the pattern persists for most models on the bash-only LiveSWEAgent. Within this Edit-Quality residual, we identify Coherence Collapse, where the agent reaches correct code and then overwrites or thrashes it, as the largest theme, replicating across SWE-bench Verified and the multilingual PolyBench Verified. In 5 cases, the agent produces a patch bit-identical to the gold reference mid-trajectory and destroys it later; an edit-commit checkpoint recovers all 5 against the SWE-bench Docker harness. A reference-free consensus-driven variant yields a directional +3.0 pp Pass@1 measurement on GPT-5 (p=0.08).

URL PDF HTML ☆

赞 0 踩 0

2603.22335 2026-05-28 cs.IR cs.AI 版本更新

Causal Direct Preference Optimization for Distributionally Robust Generative Recommendation

因果直接偏好优化用于分布鲁棒的生成式推荐

Chu Zhao, Enneng Yang, Jianzhe Zhao, Guibing Guo

发表机构 * Northeastern University, Shenyang, China（东北大学，沈阳，中国）； Shenzhen Campus of Sun Yat-sen University, China（中山大学深圳校区，中国）

AI总结针对直接偏好优化（DPO）在生成式推荐中放大环境混杂因素导致的虚假相关性问题，提出CausalDPO，通过因果不变性学习、后门调整和软聚类环境建模来提升分布外泛化性能。

Comments 22 pages, 3 figures

详情

AI中文摘要

直接偏好优化（DPO）通过最小化偏好对齐损失，引导大型语言模型（LLMs）生成与用户历史行为分布一致的推荐。然而，我们的系统实证研究和理论分析表明，DPO倾向于放大对齐过程中由环境混杂因素引起的虚假相关性，显著削弱了基于LLM的生成式推荐方法在分布外（OOD）场景下的泛化能力。为缓解这一问题，我们提出CausalDPO，它是DPO的扩展，引入了因果不变性学习机制。该方法在偏好对齐阶段采用后门调整策略以消除环境混杂因素的干扰，使用软聚类方法显式建模潜在环境分布，并通过不变性约束增强跨环境的鲁棒一致性。理论分析表明，CausalDPO能够有效捕捉用户在多环境下的稳定偏好结构，从而提升基于LLM的推荐模型的OOD泛化性能。我们在四种代表性分布偏移设置下进行了大量实验，验证了CausalDPO的有效性，在四个评估指标上平均性能提升17.17%。

英文摘要

Direct Preference Optimization (DPO) guides large language models (LLMs) to generate recommendations aligned with user historical behavior distributions by minimizing preference alignment loss. However, our systematic empirical research and theoretical analysis reveal that DPO tends to amplify spurious correlations caused by environmental confounders during the alignment process, significantly undermining the generalization capability of LLM-based generative recommendation methods in out of distribution (OOD) scenarios. To mitigate this issue, we propose CausalDPO, an extension of DPO that incorporates a causal invariance learning mechanism. This method introduces a backdoor adjustment strategy during the preference alignment phase to eliminate interference from environmental confounders, explicitly models the latent environmental distribution using a soft clustering approach, and enhances robust consistency across diverse environments through invariance constraints. Theoretical analysis demonstrates that CausalDPO can effectively capture users stable preference structures across multiple environments, thereby improving the OOD generalization performance of LLM-based recommendation models. We conduct extensive experiments under four representative distribution shift settings to validate the effectiveness of CausalDPO, achieving an average performance improvement of 17.17% across four evaluation metrics.

URL PDF HTML ☆

赞 0 踩 0

2601.21207 2026-05-28 cs.LG cs.AI math.AT 版本更新

A Sheaf-Theoretic and Topological Perspective on Complex Network Modeling and Attention Mechanisms in Graph Neural Models

图神经模型中复杂网络建模与注意力机制的层论与拓扑视角

Chuan-Shen Hu

发表机构 * National Central University（国立中央大学）

AI总结提出细胞层论框架分析图神经网络中节点特征与边权重的局部一致性与调和性，并引入基于拓扑数据分析的多尺度扩展以捕获层次特征交互。

详情

AI中文摘要

组合与拓扑结构，如图、单纯复形和胞腔复形，构成了几何与拓扑深度学习（GDL和TDL）架构的基础。这些模型在此类域上聚合信号、整合局部特征，并为多样化的实际应用生成表示。然而，训练过程中GDL和TDL特征的分布与扩散行为仍是一个开放且未充分探索的问题。受此空白启发，我们引入了一个细胞层论框架，用于建模和分析基于图的架构中节点特征与边权重的局部一致性与调和性。通过层结构追踪局部特征对齐与一致性，该框架提供了特征扩散与聚合的拓扑视角。此外，受拓扑数据分析（TDA）启发，提出了一个多尺度扩展，以捕获图模型中层次化的特征交互。该方法基于GDL和TDL架构的底层几何与拓扑结构以及其上定义的学习信号，实现了对它们的联合刻画，为节点分类、子结构检测和社区检测等传统任务的未来研究提供了见解。

英文摘要

Combinatorial and topological structures, such as graphs, simplicial complexes, and cell complexes, form the foundation of geometric and topological deep learning (GDL and TDL) architectures. These models aggregate signals over such domains, integrate local features, and generate representations for diverse real-world applications. However, the distribution and diffusion behavior of GDL and TDL features during training remains an open and underexplored problem. Motivated by this gap, we introduce a cellular sheaf theoretic framework for modeling and analyzing the local consistency and harmonicity of node features and edge weights in graph-based architectures. By tracking local feature alignments and agreements through sheaf structures, the framework offers a topological perspective on feature diffusion and aggregation. Furthermore, a multiscale extension inspired by topological data analysis (TDA) is proposed to capture hierarchical feature interactions in graph models. This approach enables a joint characterization of GDL and TDL architectures based on their underlying geometric and topological structures and the learned signals defined on them, providing insights for future studies on conventional tasks such as node classification, substructure detection, and community detection.

URL PDF HTML ☆

赞 0 踩 0

2601.04505 2026-05-28 cs.AI cs.CL cs.SY eess.SY 版本更新

CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts

CircuitLM: 一种基于多智能体的大语言模型辅助设计框架，用于从自然语言提示生成电路原理图

Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； Department of Electrical and Electronic Engineering（电气与电子工程系）； Islamic University of Technology（伊斯兰技术大学）

AI总结提出CircuitLM多智能体流水线，通过嵌入驱动的组件知识库和五阶段流程，将自然语言提示转化为结构化的CircuitJSON原理图，并采用确定性电气规则检查和LLM作为评判的元评估器双重验证，解决大语言模型在电路设计中的幻觉和物理约束问题。

Comments Accepted at the 2026 IEEE International Conference on LLM-Aided Design (ICLAD), 10 pages, 8 figures, 6 tables

详情

AI中文摘要

从高层自然语言描述生成准确的电路原理图仍然是电子设计自动化（EDA）中的一个持久挑战，因为大语言模型（LLM）经常产生组件幻觉、违反严格的物理约束并输出非机器可读的结果。为解决此问题，我们提出CircuitLM，一个多智能体流水线，将用户提示转化为结构化的、视觉可解释的$\texttt{CircuitJSON}$原理图。该框架通过五个顺序阶段： (i) 组件识别，(ii) 规范引脚输出检索，(iii) 思维链推理，(iv) JSON原理图合成，以及(v) 交互式力导向可视化，基于一个精心策划的、嵌入驱动的组件知识库进行生成，从而减轻幻觉并确保物理可行性。我们在一个包含100个独特电路设计提示的数据集上，使用五个最先进的大语言模型评估了该系统。为系统评估性能，我们部署了严格的双层评估方法：一个确定性电气规则检查（ERC）引擎按严格严重性（关键、主要、次要、警告）对拓扑故障进行分类，同时一个LLM作为评判的元评估器识别复杂的、上下文感知的设计缺陷，这些缺陷绕过了标准的基于规则的检查器。最终，这项工作展示了目标检索与确定性和语义验证相结合如何将自然语言转化为结构可行的、原理图就绪的硬件和安全电路原型。我们的代码和数据公开在 https://github.com/Khandakar227/CircuitLM。

英文摘要

Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronic design automation (EDA), as large language models (LLMs) frequently hallucinate components, violate strict physical constraints, and produce non-machine-readable outputs. To address this, we present CircuitLM, a multi-agent pipeline that translates user prompts into structured, visually interpretable $\texttt{CircuitJSON}$ schematics. The framework mitigates hallucination and ensures physical viability by grounding generation in a curated, embedding-powered component knowledge base through five sequential stages: (i) component identification, (ii) canonical pinout retrieval, (iii) chain-of-thought reasoning, (iv) JSON schematic synthesis, and (v) interactive force-directed visualization. We evaluate the system on a dataset of 100 unique circuit-design prompts using five state-of-the-art LLMs. To systematically assess performance, we deploy a rigorous dual-layered evaluation methodology: a deterministic Electrical Rule Checking (ERC) engine categorizes topological faults by strict severity (Critical, Major, Minor, Warning), while an LLM-as-a-judge meta-evaluator identifies complex, context-aware design flaws that bypass standard rule-based checkers. Ultimately, this work demonstrates how targeted retrieval combined with deterministic and semantic verification can bridge natural language to structurally viable, schematic-ready hardware and safe circuit prototyping. Our code and data are publicly available at https://github.com/Khandakar227/CircuitLM.

URL PDF HTML ☆

赞 0 踩 0

2504.08923 2026-05-28 cs.LO cs.AI math.LO 版本更新

A convergence law for continuous logic and continuous structures with finite domains

有限域连续逻辑与连续结构的收敛律

Vera Koponen

发表机构 * Department of Mathematics, Uppsala University（数学系，乌普萨拉大学）

AI总结本文研究有限域上的连续关系结构及其多值逻辑CLA，通过证明每个CLA公式渐近等价于无聚合函数公式，进而建立CLA的收敛律。

详情

DOI: 10.1016/j.ic.2026.105441
Journal ref: Information and Computation, Volume 310, May 2026, 105441

AI中文摘要

我们考虑有限域$[n] := \{1, \ldots, n\}$上的连续关系结构，以及一种多值逻辑$CLA$，其取值于单位区间并使用连续连接词和连续聚合函数。$CLA$包含了“常规”有限结构上的一阶逻辑。对于每个关系符号$R$和满足元组长度与$R$的元数匹配的恒等约束$ic$，我们关联一个连续概率密度函数$μ_R^{ic} : [0, 1] o [0, \infty)$。我们还考虑域为$[n]$的连续结构集合$\mathbf{W}_n$上的概率分布，使得对于每个关系符号$R$、恒等约束$ic$以及满足$ic$的元组$ar{a}$，$R(ar{a})$的值的分布由$μ_R^{ic}$给出，且独立于其他关系符号或其他元组的值。在此设定下，我们证明$CLA$中的每个公式渐近等价于一个不含任何聚合函数的公式。这用于证明$CLA$的收敛律，对于无自由变量的公式表述如下：若$φ\in CLA$无自由变量且$I \subseteq [0, 1]$是一个区间，则存在$α\in [0, 1]$，使得当$n$趋于无穷时，$φ$的值落在$I$中的概率趋于$α$。

英文摘要

We consider continuous relational structures with finite domain $[n] := \{1, \ldots, n\}$ and a many valued logic, $CLA$, with values in the unit interval and which uses continuous connectives and continuous aggregation functions. $CLA$ subsumes first-order logic on ``conventional'' finite structures. To each relation symbol $R$ and identity constraint $ic$ on a tuple the length of which matches the arity of $R$ we associate a continuous probability density function $μ_R^{ic} : [0, 1] \to [0, \infty)$. We also consider a probability distribution on the set $\mathbf{W}_n$ of continuous structures with domain $[n]$ which is such that for every relation symbol $R$, identity constraint $ic$, and tuple $\bar{a}$ satisfying $ic$, the distribution of the value of $R(\bar{a})$ is given by $μ_R^{ic}$, independently of the values for other relation symbols or other tuples. In this setting we prove that every formula in $CLA$ is asymptotically equivalent to a formula without any aggregation function. This is used to prove a convergence law for $CLA$ which reads as follows for formulas without free variables: If $φ\in CLA$ has no free variable and $I \subseteq [0, 1]$ is an interval, then there is $α\in [0, 1]$ such that, as $n$ tends to infinity, the probability that the value of $φ$ is in $I$ tends to $α$.

URL PDF HTML ☆

赞 0 踩 0

2603.14773 2026-05-28 cs.LG cs.AI 版本更新

HO-SFL: Hybrid-Order Split Federated Learning with Backprop-Free Clients and Dimension-Free Aggregation

HO-SFL: 混合阶分割联邦学习，无反向传播客户端与维度无关聚合

Qiyuan Chen, Xian Wu, Yi Wang, Xianhao Chen

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong, Hong Kong SAR, China（电子与计算机工程系，香港大学，香港特别行政区，中国）

AI总结提出HO-SFL框架，通过拉格朗日框架重构分割学习，服务器执行一阶更新而客户端进行零阶优化，实现无反向传播客户端、维度无关聚合，理论证明收敛速度与一阶方法相当，实验验证通信和内存成本显著降低。

Comments Accepted to ICML 2026

详情

AI中文摘要

在边缘设备上微调大模型受到标准框架（如联邦学习和分割学习）中内存密集型的反向传播（BP）的严重阻碍。虽然用零阶优化替代BP可以显著减少内存占用，但通常会导致收敛速度严重下降。为了解决这一困境，我们提出了混合阶分割联邦学习（HO-SFL）。通过在拉格朗日框架内重构分割学习过程，HO-SFL解耦了优化景观：服务器执行精确的一阶更新（即BP），而客户端进行内存高效的零阶优化。这种混合设计不仅消除了客户端BP的需求，还实现了维度无关的模型聚合，大幅降低了通信成本。关键的是，我们提供了理论收敛分析，证明HO-SFL缓解了零阶优化的维度依赖收敛放缓，实现了与一阶方法相当的收敛速度。在视觉和语言模态任务上的大量实验验证了HO-SFL在实现与一阶基线相当的收敛速度的同时，显著降低了通信成本和客户端内存占用。

英文摘要

Fine-tuning large models on edge devices is severely hindered by the memory-intensive backpropagation (BP) in standard frameworks like federated learning and split learning. While substituting BP with zeroth-order optimization can significantly reduce memory footprints, it typically suffers from prohibitively degraded convergence speed. To resolve this dilemma, we propose Hybrid-Order Split Federated Learning (HO-SFL). By reformulating the split learning process within a Lagrangian framework, HO-SFL decouples the optimization landscape: The server performs precise first-order updates (i.e., BP), whereas clients conduct memory-efficient zeroth-order optimization. This hybrid design not only eliminates the need for client-side BP but also enables dimension-free model aggregation, drastically lowering communication costs. Crucially, we provide a theoretical convergence analysis, demonstrating that HO-SFL mitigates the dimension-dependent convergence slowdown of zeroth-order optimization, achieving a convergence rate comparable to first-order methods. Extensive experiments on tasks across vision and language modalities validate that HO-SFL achieves convergence speeds comparable to first-order baselines while significantly reducing communication costs and client memory footprints.

URL PDF HTML ☆

赞 0 踩 0

2602.20497 2026-05-28 cs.CV cs.AI 版本更新

LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration

LESA: 可学习的阶段感知预测器用于扩散模型加速

Peiliang Cai, Jiacheng Liu, Haowen Xu, Xinyu Wang, Chang Zou, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结针对扩散模型计算开销大、现有缓存策略难以适应去噪过程阶段动态变化的问题，提出基于两阶段训练的可学习阶段感知预测器框架，利用KAN网络学习时序特征映射并采用多阶段多专家架构，在保持高质量生成的同时实现显著加速。

Comments Accepted to CVPR 2026

详情

AI中文摘要

扩散模型在图像和视频生成任务中取得了显著成功。然而，扩散Transformer（DiTs）的高计算需求对其实际部署构成了重大挑战。虽然特征缓存是一种有前景的加速策略，但现有基于简单重用或无训练预测的方法难以适应扩散过程中复杂的、阶段相关的动态变化，常常导致质量下降，并无法保持与标准去噪过程的一致性。为解决这一问题，我们提出了一种基于两阶段训练的可学习阶段感知（LESA）预测器框架。我们的方法利用Kolmogorov-Arnold网络（KAN）从数据中准确学习时序特征映射。我们进一步引入了一种多阶段、多专家架构，为不同噪声水平阶段分配专门的预测器，从而实现更精确和鲁棒的特征预测。大量实验表明，我们的方法在保持高保真生成的同时实现了显著加速。实验显示，在FLUX.1-dev上实现了5.00倍加速，质量下降极小（1.0%）；在Qwen-Image上实现了6.25倍加速，质量比之前的最优方法（TaylorSeer）提升20.2%；在HunyuanVideo上实现了5.00倍加速，PSNR比TaylorSeer提升24.7%。在文本到图像和文本到视频合成任务上的最先进性能验证了我们基于训练框架在不同模型上的有效性和泛化能力。我们的代码可在https://github.com/caipeiliang2004/LESA获取。

英文摘要

Diffusion models have achieved remarkable success in image and video generation tasks. However, the high computational demands of Diffusion Transformers (DiTs) pose a significant challenge to their practical deployment. While feature caching is a promising acceleration strategy, existing methods based on simple reusing or training-free forecasting struggle to adapt to the complex, stage-dependent dynamics of the diffusion process, often resulting in quality degradation and failing to maintain consistency with the standard denoising process. To address this, we propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training. Our approach leverages a Kolmogorov-Arnold Network (KAN) to accurately learn temporal feature mappings from data. We further introduce a multi-stage, multi-expert architecture that assigns specialized predictors to different noise-level stages, enabling more precise and robust feature forecasting. Extensive experiments show our method achieves significant acceleration while maintaining high-fidelity generation. Experiments demonstrate 5.00x acceleration on FLUX.1-dev with minimal quality degradation (1.0% drop), 6.25x speedup on Qwen-Image with a 20.2% quality improvement over the previous SOTA (TaylorSeer), and 5.00x acceleration on HunyuanVideo with a 24.7% PSNR improvement over TaylorSeer. State-of-the-art performance on both text-to-image and text-to-video synthesis validates the effectiveness and generalization capability of our training-based framework across different models. Our code is available at https://github.com/caipeiliang2004/LESA.

URL PDF HTML ☆

赞 0 踩 0

2603.09882 2026-05-28 cs.RO cs.AI 版本更新

Emerging Extrinsic Dexterity in Cluttered Scenes via Dynamics-aware Policy Learning

杂乱场景中通过动力学感知策略学习涌现的外在灵巧性

Yixin Zheng, Jiangran Lyu, Yifan Zhang, Jiayi Chen, Mi Yan, Yuntian Deng, Xuesong Shi, Xiaoguang Zhao, Yizhou Wang, Zhizheng Zhang, He Wang

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Galbot ； Peking University（北京大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出动力学感知策略学习框架，通过显式世界建模学习接触诱导物体动力学表示并用于强化学习，使杂乱场景中的外在灵巧性无需手工启发式或复杂奖励塑造即可涌现。

Comments Accepted to Robotics: Science and Systems (RSS) 2026. Project page: https://pku-epic.github.io/DAPL/

详情

AI中文摘要

外在灵巧性利用环境接触来克服抓取操作的局限性。然而，在杂乱场景中实现这种灵巧性仍然具有挑战性且未被充分探索，因为它需要选择性地利用多个相互作用的物体之间的接触，而这些物体具有内在耦合的动力学。现有方法缺乏对这种复杂动力学的显式建模，因此在杂乱环境中的非抓取操作方面表现不足，这反过来限制了它们在现实环境中的实际应用。在本文中，我们介绍了一种动力学感知策略学习（DAPL）框架，该框架可以利用在杂乱环境中学习到的接触诱导物体动力学的表示来促进策略学习。这种表示通过显式世界建模学习，并用于条件化强化学习，使得外在灵巧性无需手工制作的接触启发式或复杂的奖励塑造即可涌现。我们在仿真和现实世界中评估了我们的方法。在具有不同密度的未见过的仿真杂乱场景中，我们的方法在成功率上比抓取操作、人类遥操作和基于先前表示的策略高出25%以上。在10个杂乱场景中，现实世界的成功率达到了约50%，而实际杂货部署进一步证明了稳健的仿真到现实迁移和适用性。

英文摘要

Extrinsic dexterity leverages environmental contact to overcome the limitations of prehensile manipulation. However, achieving such dexterity in cluttered scenes remains challenging and underexplored, as it requires selectively exploiting contact among multiple interacting objects with inherently coupled dynamics. Existing approaches lack explicit modeling of such complex dynamics and therefore fall short in non-prehensile manipulation in cluttered environments, which in turn limits their practical applicability in real-world environments. In this paper, we introduce a Dynamics-Aware Policy Learning (DAPL) framework that can facilitate policy learning with a learned representation of contact-induced object dynamics in cluttered environments. This representation is learned through explicit world modeling and used to condition reinforcement learning, enabling extrinsic dexterity to emerge without hand-crafted contact heuristics or complex reward shaping. We evaluate our approach in both simulation and the real world. Our method outperforms prehensile manipulation, human teleoperation, and prior representation-based policies by over 25% in success rate on unseen simulated cluttered scenes with varying densities. The real-world success rate reaches around 50% across 10 cluttered scenes, while a practical grocery deployment further demonstrates robust sim-to-real transfer and applicability.

URL PDF HTML ☆

赞 0 踩 0

2603.02702 2026-05-28 cs.AI cs.LG 版本更新

FinTexTS: Financial Text-Paired Time-Series Dataset via Semantic-Based and Multi-Level Pairing

FinTexTS: 基于语义和多层级配对的金融文本-时间序列数据集

Jaehoon Lee, Suhwan Park, Taeyoon Lim, Seunghan Lee, Jun Seo, Dongwan Kang, Hwanil Choi, Minjae Kim, Sungdong Yoo, Soonyoung Lee, Yongjae Lee, Wonbin Ahn

发表机构 * LG AI Research（LG人工智能研究所）； Ulsan National Institute of Science and Technology（乌山国立科学技术研究院）

AI总结提出基于语义和多层级配对的框架，从SEC文件和新闻中提取并匹配多层级文本信息，构建大规模文本配对的股票价格数据集FinTexTS，提升股价预测性能。

Comments 12 pages, KDD 2026, Datasets and Benchmarks Track

详情

AI中文摘要

金融领域涉及多种重要的时间序列问题。近年来，联合利用文本和数值信息的时间序列分析方法越来越受到关注。因此，人们做出了大量努力来构建金融领域中的文本配对时间序列数据集。然而，金融市场具有复杂的相互依赖性，一家公司的股票价格不仅受公司特定事件的影响，还受其他公司事件和更广泛的宏观经济因素的影响。现有的基于简单关键词匹配的文本与金融时间序列数据配对方法往往无法捕捉这种复杂关系。为了解决这一局限性，我们提出了一种基于语义和多层级的配对框架。具体来说，我们从SEC文件中提取目标公司的特定上下文，并应用基于嵌入的匹配机制，根据该上下文检索语义相关的新闻文章。此外，我们使用大语言模型（LLMs）将新闻文章分为四个层级（宏观层级、行业层级、相关公司层级和目标公司层级），实现新闻文章与目标公司的多层级配对。将该框架应用于公开可用的新闻数据集，我们构建了FinTexTS，这是一个新的大规模文本配对的股票价格数据集。在FinTexTS上的实验结果表明，我们的基于语义和多层级的配对策略在股价预测中是有效的。除了FinTexTS所依赖的公开新闻外，我们还表明，将我们的方法应用于专有但精心策划的新闻源，可以产生更高质量的配对数据，并提高股价预测性能。

英文摘要

The financial domain involves a variety of important time-series problems. Recently, time-series analysis methods that jointly leverage textual and numerical information have gained increasing attention. Accordingly, numerous efforts have been made to construct text-paired time-series datasets in the financial domain. However, financial markets are characterized by complex interdependencies, in which a company's stock price is influenced not only by company-specific events but also by events in other companies and broader macroeconomic factors. Existing approaches that pair text with financial time-series data based on simple keyword matching often fail to capture such complex relationships. To address this limitation, we propose a semantic-based and multi-level pairing framework. Specifically, we extract company-specific context for the target company from SEC filings and apply an embedding-based matching mechanism to retrieve semantically relevant news articles based on this context. Furthermore, we classify news articles into four levels (macro-level, sector-level, related company-level, and target company-level) using large language models (LLMs), enabling multi-level pairing of news articles with the target company. Applying this framework to publicly-available news datasets, we construct FinTexTS, a new large-scale text-paired stock price dataset. Experimental results on FinTexTS demonstrate the effectiveness of our semantic-based and multi-level pairing strategy in stock price forecasting. In addition to publicly-available news underlying FinTexTS, we show that applying our method to proprietary yet carefully curated news sources leads to higher-quality paired data and improved stock price forecasting performance.

URL PDF HTML ☆

赞 0 踩 0

2603.05642 2026-05-28 cs.RO cs.AI 版本更新

Relational Semantic Reasoning on 3D Scene Graphs for Open World Interactive Object Search

基于3D场景图的开放世界交互式物体搜索的关系语义推理

Imen Mahdi, Matteo Cassinelli, Fabien Despinoy, Tim Welschehold, Abhinav Valada

发表机构 * University of Freiburg（弗赖堡大学）； Toyota Motor Europe（丰田欧洲公司）

AI总结提出SCOUT方法，通过从LLM蒸馏的关系探索启发式直接搜索3D场景图，实现高效开放世界交互式物体搜索，性能匹配LLM且计算高效。

详情

AI中文摘要

家庭环境中的开放世界交互式物体搜索需要理解物体与其周围环境之间的语义关系，以有效引导探索。先前的方法要么依赖视觉-语言嵌入相似性，这不能可靠地捕获任务相关的关系语义，要么依赖大型语言模型（LLM），这对于实时部署来说太慢且成本高昂。我们提出SCOUT：基于场景图探索的开放世界交互式物体搜索学习效用，这是一种新颖的方法，通过使用关系探索启发式（如房间-物体包含和物体-物体共现）为房间、前沿和物体分配效用分数，直接搜索3D场景图。为了在不牺牲开放词汇泛化能力的情况下使其实用，我们提出了一种离线程序化蒸馏框架，将LLM中的结构化关系知识提取到轻量级模型中，用于机器人上的推理。此外，我们提出了SymSearch，一个用于评估交互式物体搜索任务中语义推理的可扩展符号基准。在符号和模拟环境中的广泛评估表明，SCOUT优于基于嵌入相似性的方法，并在保持计算效率的同时达到LLM级别的性能。最后，真实世界实验证明了向物理环境的有效迁移，在现实感知和导航约束下实现了开放世界交互式物体搜索。

英文摘要

Open-world interactive object search in household environments requires understanding semantic relationships between objects and their surrounding context to guide exploration efficiently. Prior methods either rely on vision-language embeddings similarity, which does not reliably capture task-relevant relational semantics, or large language models (LLMs), which are too slow and costly for real-time deployment. We introduce SCOUT: Scene Graph-Based Exploration with Learned Utility for Open-World Interactive Object Search, a novel method that searches directly over 3D scene graphs by assigning utility scores to rooms, frontiers, and objects using relational exploration heuristics such as room-object containment and object-object co-occurrence. To make this practical without sacrificing open-vocabulary generalization, we propose an offline procedural distillation framework that extracts structured relational knowledge from LLMs into lightweight models for on-robot inference. Furthermore, we present SymSearch, a scalable symbolic benchmark for evaluating semantic reasoning in interactive object search tasks. Extensive evaluations across symbolic and simulation environments show that SCOUT outperforms embedding similarity-based methods and matches LLM-level performance while remaining computationally efficient. Finally, real-world experiments demonstrate effective transfer to physical environments, enabling open-world interactive object search under realistic sensing and navigation constraints.

URL PDF HTML ☆

赞 0 踩 0

2603.05425 2026-05-28 cs.CV cs.AI 版本更新

RelaxFlow: Text-Driven Amodal 3D Generation

RelaxFlow: 文本驱动的非模态3D生成

Jiayin Zhu, Guoji Fu, Xiaolu Liu, Qiyuan He, Yicong Li, Angela Yao

发表机构 * National University of Singapore（新加坡国立大学）； Zhejiang University（浙江大学）； University of Science and Technology of China（中国科学技术大学）

AI总结针对遮挡下图像到3D生成的语义歧义问题，提出无训练的双分支框架RelaxFlow，通过多先验共识模块和松弛机制解耦控制粒度，实现文本提示引导下对未观察区域的补全，同时严格保留输入观测。

Comments Accepted as a spotlight presentation at ICML 2026. Code: https://github.com/viridityzhu/RelaxFlow

详情

AI中文摘要

图像到3D生成在遮挡下面临固有的语义歧义，仅凭部分观测通常不足以确定物体类别。在这项工作中，我们形式化了文本驱动的非模态3D生成，其中文本提示引导对未观察区域的补全，同时严格保留输入观测。关键的是，我们识别出这些目标需要不同的控制粒度：对观测的刚性控制与对提示的松弛结构控制。为此，我们提出RelaxFlow，一个无训练的双分支框架，通过多先验共识模块和松弛机制解耦控制粒度。理论上，我们证明我们的松弛等价于在生成向量场上应用低通滤波器，抑制高频实例细节以隔离适应观测的几何结构。为便于评估，我们引入了两个诊断基准：ExtremeOcc-3D和AmbiSem-3D。大量实验表明，RelaxFlow成功引导未观察区域的生成以匹配提示意图，同时不损害视觉保真度。

英文摘要

Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.

URL PDF HTML ☆

赞 0 踩 0

2603.04631 2026-05-28 cs.AI 版本更新

GradientStabilizer：固定范数，而非梯度

Tianjin Huang, Zhangyang Wang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Jiaxing Shang, Tianlong Chen, Ke Li, Lu Liu, Qingsong Wen, Shiwei Liu

发表机构 * Department of Computer Science, University of Exeter（埃克塞特大学计算机科学系）； Department of Mathematics and Computer Science, Eindhoven University of Technology（埃因霍温理工大学数学与计算机科学系）； School of the Gifted Young, University of Science and Technology of China（中国科学技术大学天才青年学院）； Department of Electrical and Computer Engineering, University of Texas at Austin（德克萨斯大学奥斯汀分校电气与计算机工程系）； Department of Computer Science, University of Reading（阅读大学计算机科学系）； School of Cyber Science and Technology, Sun Yat-sen University（中山大学网络科学与技术学院）； Department of Computer Science, The University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校计算机科学系）； ELLIS Institute Tubingen（图宾根ELLIS研究所）； Max Planck Institute for Intelligent Systems（马克斯·普朗克智能系统研究所）； Tübingen AI Center, Tübingen, Germany（图宾根人工智能中心，德国图宾根）； College of Computer Science, Chongqing University（重庆大学计算机学院）

AI总结提出GradientStabilizer，一种轻量级梯度变换方法，通过统计稳定的梯度范数估计替换更新幅度，在不改变梯度方向的前提下抑制极端梯度尖峰，从而提升训练稳定性并减少发散。

Comments Accepted By ICML2026

详情

AI中文摘要

现代深度学习系统中的训练不稳定性通常由罕见但极端的梯度范数尖峰引发，这些尖峰可能导致参数更新过大、破坏优化器状态，并导致缓慢恢复或发散。广泛使用的保护措施如梯度裁剪可以缓解这些故障，但需要调整阈值且不加区分地截断大更新。我们提出GradientStabilizer，一种轻量级、即插即用的梯度变换方法，它在保留瞬时梯度方向的同时，用从运行梯度范数统计中导出的统计稳定估计替换更新幅度。我们证明了在尖峰步骤上，得到的稳定幅度一致有界，与尖峰大小无关，并展示了这种有界性如何控制自适应方法中优化器状态的演化。在LLM预训练（FP16）、量化感知预训练（FP4）、ImageNet分类、强化学习和时间序列预测中，GradientStabilizer一致地提高了训练稳定性，扩大了稳定学习率区域，并相对于基于裁剪的基线减少了发散，甚至显著降低了Adam对权重衰减强度的敏感性。代码即将发布。

英文摘要

Training instability in modern deep learning systems is frequently triggered by rare but extreme gradient-norm spikes, which can induce oversized parameter updates, corrupt optimizer state, and lead to slow recovery or divergence. Widely used safeguards such as gradient clipping mitigate these failures but require threshold tuning and indiscriminately truncate large updates. We propose GradientStabilizer, a lightweight, drop-in gradient transform that preserves the instantaneous gradient direction while replacing the update magnitude with a statistically stabilized estimate derived from running gradient-norm statistics. We prove that the resulting stabilized magnitude is uniformly bounded on spike steps, independent of the spike size, and show how this boundedness controls optimizer state evolution in adaptive methods. Across LLM pre-training (FP16), quantization-aware pre-training (FP4), ImageNet classification, reinforcement learning, and time-series forecasting, GradientStabilizer consistently improves training stability, widens stable learning-rate regions, and reduces divergence relative to clipping-based baselines, even substantially reducing Adam's sensitivity to weight-decay strength. Code will be released soon.

URL PDF HTML ☆

赞 0 踩 0

2602.22873 2026-05-28 math.AT cs.AI cs.CG 版本更新

Learning Tangent Bundles and Characteristic Classes with Autoencoder Atlases

使用自编码器图谱学习切丛和示性类

Eduardo Paluzo-Hidalgo, Yuichi Ike

发表机构 * Department of Applied Mathematics I, University of Sevilla（塞维利亚大学应用数学系）； Graduate School of Mathematical Sciences, The University of Tokyo（东京大学数学科学研究生院）

AI总结本文提出一个理论框架，将流形学习中的多图自编码器与向量丛和示性类的经典理论联系起来，通过自编码器图谱定义转移映射并计算第一Stiefel-Whitney类，从而检测数据可定向性。

详情

AI中文摘要

我们引入了一个理论框架，将流形学习中的多图自编码器与向量丛和示性类的经典理论联系起来。我们不将自编码器视为产生单个全局欧几里得嵌入，而是将一组局部训练的编码器-解码器对视为流形上的学习图谱。我们证明，任何重建一致的自编码器图谱都能典范地定义满足上循环条件的转移映射，并且将这些转移映射线性化会得到一个向量丛，当潜在维度与流形的内在维度匹配时，该向量丛与切丛一致。这种构造提供了对数据的微分拓扑不变量的直接访问。特别地，我们证明第一Stiefel-Whitney类可以从学习到的转移映射的雅可比行列式的符号计算出来，从而得到检测可定向性的算法准则。我们还证明，非平凡的示性类对单图表示构成障碍，并且自编码器图的最小数量由流形的良好覆盖结构决定。最后，我们将我们的方法应用于低维可定向和不可定向流形，以及一个不可定向的高维图像数据集。

英文摘要

We introduce a theoretical framework that connects multi-chart autoencoders in manifold learning with the classical theory of vector bundles and characteristic classes. Rather than viewing autoencoders as producing a single global Euclidean embedding, we treat a collection of locally trained encoder-decoder pairs as a learned atlas on a manifold. We show that any reconstruction-consistent autoencoder atlas canonically defines transition maps satisfying the cocycle condition, and that linearising these transition maps yields a vector bundle coinciding with the tangent bundle when the latent dimension matches the intrinsic dimension of the manifold. This construction provides direct access to differential-topological invariants of the data. In particular, we show that the first Stiefel-Whitney class can be computed from the signs of the Jacobians of learned transition maps, yielding an algorithmic criterion for detecting orientability. We also show that non-trivial characteristic classes provide obstructions to single-chart representations, and that the minimum number of autoencoder charts is determined by the good cover structure of the manifold. Finally, we apply our methodology to low-dimensional orientable and non-orientable manifolds, as well as to a non-orientable high-dimensional image dataset.

URL PDF HTML ☆

赞 0 踩 0

2602.22787 2026-05-28 cs.CL cs.AI 版本更新

Probing for Knowledge Attribution in Large Language Models

探测大型语言模型中的知识归因

Ivo Brink, Alexander Boer, Dennis Ulmer

发表机构 * KPMG NL（KPMG荷兰分公司）； University of Amsterdam（阿姆斯特丹大学）

AI总结本文通过线性探针从隐藏表示中分类大型语言模型输出的主导知识来源（记忆或上下文），并引入自监督流水线AttriWiki生成训练数据，在多个模型和数据集上达到高F1分数。

详情

AI中文摘要

大型语言模型（LLM）的幻觉，即流畅但事实不正确的生成，分为两类：忠实性违反，即模型误用提供的上下文；以及事实性违反，即答案反映内部知识中的错误。适当的缓解取决于知道哪个来源驱动每个答案。我们研究贡献性归因，即对每个输出背后的主导知识来源进行分类，并表明在隐藏表示上训练的简单线性探针可以可靠地识别它。我们引入了AttriWiki，一个自监督流水线，通过提示模型从记忆中回忆被隐藏的实体或从上下文中读取它们，而不依赖知识冲突，自动生成标记的训练数据。在AttriWiki上训练的探针在Llama-3.1-8B、Mistral-7B和Qwen-7B上达到高达0.96的Macro-$F_1$，迁移到SQuAD和WebQuestions时达到0.94-0.99的Macro-$F_1$，并零样本泛化到Tighidet等人（2024）的基准，在冲突设置上无需重新训练即优于他们的探针。此外，归因不匹配会使错误率提高高达70%，尽管正确的归因并不能保证正确的答案，这表明需要更广泛的检测框架。

英文摘要

Large language model (LLM) hallucinations, meaning fluent but factually incorrect generations, fall into two types: faithfulness violations, where the model misuses provided context, and factuality violations, where answers reflect errors in internal knowledge. Proper mitigation depends on knowing which source drives each answer. We study contributive attribution, i.e. the classification of the dominant knowledge source behind each output, and show that a simple linear probe trained on hidden representations can reliably identify it. We introduce AttriWiki, a self-supervised pipeline that automatically generates labelled training data by prompting models to recall withheld entities from memory or read them from context without relying on knowledge conflicts. Probes trained on AttriWiki achieve up to 0.96 Macro-$F_1$ on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transfer to SQuAD and WebQuestions with 0.94-0.99 Macro-$F_1$, and generalise zero-shot to Tighidet et al. (2024)'s benchmark, outperforming their probe on conflicting settings without retraining. Furthermore, attribution mismatches raise error rates by up to 70%, though correct attribution does not guarantee correct answers, pointing to the need for broader detection frameworks.

URL PDF HTML ☆

赞 0 踩 0

2602.18647 2026-05-28 cs.LG cs.AI cs.CV cs.IT math.IT 版本更新

Noise Scheduling as Information-Guided Allocation in Diffusion Training

噪声调度作为扩散训练中的信息引导分配

Gabriel Raya, Bac Nguyen, Georgios Batzolis, Yuhta Takida, Dejan Stancevic, Naoki Murata, Chieh-Hsin Lai, Yuki Mitsufuji, Luca Ambrogioni

发表机构 * Tilburg University & JADS（蒂尔堡大学及JADS）； Sony AI（索尼人工智能）； University of Cambridge（剑桥大学）； Radboud University（拉德堡德大学）； Sony Group Corporation（索尼集团公司）

AI总结提出InfoNoise，一种在线自适应噪声调度方法，通过估计条件熵率剖面动态调整训练噪声分布，以优化去噪任务中的信息增益，在图像、DNA和语言生成等任务中达到或超越基线，并节省高达3倍训练计算量。

详情

AI中文摘要

我们引入了InfoNoise，一种用于扩散训练的在线自适应噪声调度，它将优化努力重新分配到去噪最具信息量的噪声水平上。与损失加权一起，噪声调度在去噪问题之间诱导出有效的分配，而这种分配通常在知道信息性噪声水平之前就已固定。InfoNoise通过从训练期间的去噪损失中估计条件熵率剖面，使这种分配具有数据自适应性，无需辅助模型或离线搜索。通过I--MMSE，该剖面识别出噪声观测在何处能快速减少关于干净样本的不确定性，并指导训练噪声分布的适应。它只改变这个分布，保持目标、加权和参数化不变。在图像基准测试中，调度已被广泛调整，InfoNoise匹配或略微超过强基线，并且可以用更少的更新达到相同的质量。在表示、序列和模态转换（包括DNA和语言生成）上，InfoNoise优于固定和自适应基线，并且达到目标质量所需的训练计算量最多减少3倍。这些结果确立了条件熵率剖面作为噪声调度设计的数据依赖目标，并使在线自适应成为手动调度搜索的实用替代方案。

英文摘要

We introduce InfoNoise, an online adaptive noise schedule for diffusion training that reallocates optimization effort toward noise levels where denoising is most informative. Together with loss weighting, a noise schedule induces an effective allocation across denoising problems, often fixed before informative noise levels are known. InfoNoise makes this allocation data-adaptive by estimating a conditional-entropy-rate profile from denoising losses during training, without auxiliary models or offline search. Through I--MMSE, this profile identifies where noisy observations rapidly reduce uncertainty about the clean sample and guides adaptation of the training noise distribution. It changes only this distribution, keeping the objective, weighting, and parameterization fixed. On image benchmarks, where schedules have been extensively tuned, InfoNoise matches or slightly exceeds strong baselines and can reach the same quality with fewer updates. On representation, sequence, and modality shifts, including DNA and language generation, InfoNoise improves over fixed and adaptive baselines and reaches target quality with up to $3\times$ less training compute. These results establish the conditional-entropy-rate profile as the data-dependent target for noise schedule design and make online adaptation a practical alternative to manual schedule search.

URL PDF HTML ☆

赞 0 踩 0

2602.18481 2026-05-28 q-fin.TR cs.AI 版本更新

AlphaForgeBench: Benchmarking End-to-End Trading Strategy Design with Large Language Models

AlphaForgeBench：用大型语言模型对端到端交易策略设计进行基准测试

Wentao Zhang, Mingxuan Zhao, Jincheng Gao, Jieshun You, Huaiyu Jia, Yilei Zhao, Bo An, Shuo Sun

发表机构 * Nanyang Technological University（南洋理工大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Hong Kong Polytechnic University（香港理工大学）

AI总结提出AlphaForgeBench框架，将LLM从随机交易代理重新定义为量化研究员，通过生成可执行alpha因子和基于因子的交易策略，消除执行不稳定，实现可复现的金融推理评估。

详情

AI中文摘要

大型语言模型（LLMs）的快速发展催生了大量金融基准测试，从静态知识评估演变为交互式交易模拟。然而，现有的实时交易评估框架在很大程度上忽略了一个关键的失败模式：LLMs在金融不确定性下的序贯决策中表现出严重的行为不稳定性。通过大量实验，我们表明，当作为交易代理部署时，LLMs表现出极端的运行间方差，即使在确定性解码下也会产生不一致的动作序列，并且经常在相邻时间步产生不合理的动作翻转。我们将这些行为归因于LLMs的无状态自回归特性，它们缺乏对先前动作的持久记忆，以及它们对投资组合分配任务中连续到离散动作映射的敏感性。这些缺陷从根本上破坏了现有许多在线和离线交易基准的可靠性和可复现性。为了解决这些局限性，我们提出了AlphaForgeBench，一个原则性的评估框架，将LLMs重新定义为量化研究员而非随机交易代理。AlphaForgeBench不要求模型产生离散的交易动作，而是要求模型生成可执行的alpha因子，并基于金融知识构建基于因子的交易策略。这种范式将推理与执行机制解耦，实现了确定性和可复现的评估，同时与真实的量化研究工作流程保持一致。在多个最先进的LLM上进行的大量实验表明，AlphaForgeBench消除了执行引起的不稳定性，并为评估金融推理、策略制定和alpha发现提供了严格的基准。网页链接：https://finbrain-lab-hkustgz.github.io/AlphaForgeBench

英文摘要

The rapid advancement of Large Language Models (LLMs) has led to a surge of financial benchmarks, evolving from static knowledge evaluation toward interactive trading simulations. However, existing frameworks for evaluating real-time trading largely overlook a critical failure mode: the severe behavioral instability of LLMs in sequential decision-making under financial uncertainty. Through extensive experiments, we show that when deployed as trading agents, LLMs exhibit extreme run-to-run variance, generate inconsistent action sequences even under deterministic decoding, and frequently produce irrational action flipping across adjacent time steps. We attribute these behaviors to the stateless autoregressive nature of LLMs, which lack persistent memory of prior actions, together with their sensitivity to continuous-to-discrete action mappings in portfolio allocation tasks. These deficiencies fundamentally undermine the reliability and reproducibility of many existing online and offline trading benchmarks. To address these limitations, we propose AlphaForgeBench, a principled evaluation framework that redefines LLMs as quantitative researchers rather than stochastic trading agents. Instead of producing discrete trading actions, AlphaForgeBench requires models to generate executable alpha factors and compose factor-based trading strategies grounded in financial knowledge. This paradigm decouples reasoning from execution mechanics, enabling deterministic and reproducible evaluation while remaining aligned with real-world quantitative research workflows. Extensive experiments across multiple state-of-the-art LLMs demonstrate that AlphaForgeBench eliminates execution-induced instability and provides a rigorous benchmark for evaluating financial reasoning, strategy formulation, and alpha discovery. Webpage at https://finbrain-lab-hkustgz.github.io/AlphaForgeBench

URL PDF HTML ☆

赞 0 踩 0

2602.17003 2026-05-28 cs.CL cs.AI 版本更新

Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Persona2Web: 基于用户历史进行上下文推理的个性化Web智能体基准

Serin Kim, Sangam Lee, Dongha Lee

发表机构 * Department of Artificial Intelligence, Yonsei University, Seoul, Republic of Korea（人工智能系，延世大学，首尔，大韩民国）

AI总结提出Persona2Web基准，通过澄清-个性化原则评估Web智能体在真实开放网络中利用用户历史解决模糊查询的个性化能力，并引入推理感知评估框架。

Comments Accepted to ICML 2026

详情

AI中文摘要

大型语言模型推动了Web智能体的发展，但当前的智能体缺乏个性化能力。由于用户很少明确说明其意图的每个细节，实用的Web智能体必须能够通过推断用户偏好和上下文来解释模糊查询。为应对这一挑战，我们提出了Persona2Web，这是首个在真实开放网络上评估个性化Web智能体的基准，基于澄清-个性化原则构建，要求智能体根据用户历史而非依赖显式指令来解决歧义。Persona2Web包括：(1) 在长时间跨度内隐含揭示偏好的用户历史，(2) 需要智能体推断隐含用户偏好的模糊查询，以及(3) 一个推理感知评估框架，能够对个性化进行细粒度评估。我们针对各种智能体架构、骨干模型、历史访问方案和不同模糊程度的查询进行了广泛实验，揭示了个性化Web智能体行为中的关键挑战。为便于复现，我们的代码和数据集公开在 https://serin-kimm.github.io/Persona2Web/。

英文摘要

Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at https://serin-kimm.github.io/Persona2Web/.

URL PDF HTML ☆

赞 0 踩 0

2602.15515 2026-05-28 cs.LG cs.AI 版本更新

The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes

混淆图谱：使用欺骗探针映射RLVR中诚实出现的位置

Mohammad Taufeeque, Stefan Heimersheim, Adam Gleave, Chris Cundy

AI总结本文通过构建一个自然产生奖励黑客行为的编码环境，研究在对抗白盒欺骗检测器训练时模型出现的混淆策略，并引入分类法分析诚实、混淆激活和混淆策略三种结果。

Comments Accepted at ICML 2026 (Oral presentation). 30 pages, 14 figures

详情

AI中文摘要

针对白盒欺骗检测器的训练已被提议作为使AI系统诚实的一种方法。然而，这种训练存在模型学习混淆其欺骗行为以逃避检测器的风险。先前的工作仅在人为设置中研究混淆，其中模型因有害输出而直接获得奖励。我们构建了一个真实的编码环境，其中通过硬编码测试用例的奖励黑客行为自然发生，并表明混淆在此环境中出现。我们引入了在对抗欺骗检测器训练时可能结果的分类法。模型要么保持诚实，要么通过两种可能的混淆策略变得欺骗。（i）混淆激活：模型输出欺骗性文本，同时修改其内部表示以不再触发检测器。（ii）混淆策略：模型输出逃避检测器的欺骗性文本，通常包括对奖励黑客行为的理由。实验上，混淆激活源于RL期间的表示漂移，无论是否有检测器惩罚。检测器惩罚仅激励混淆策略；我们从理论上表明，对于策略梯度方法，这是预期的。足够高的KL正则化和检测器惩罚可以产生诚实策略，从而确立白盒欺骗检测器作为易受奖励黑客行为任务的有效训练信号。

英文摘要

Training against white-box deception detectors has been proposed as a way to make AI systems honest. However, such training risks models learning to obfuscate their deception to evade the detector. Prior work has studied obfuscation only in artificial settings where models were directly rewarded for harmful output. We construct a realistic coding environment where reward hacking via hardcoding test cases naturally occurs, and show that obfuscation emerges in this setting. We introduce a taxonomy of possible outcomes when training against a deception detector. The model either remains honest, or becomes deceptive via two possible obfuscation strategies. (i) Obfuscated activations: the model outputs deceptive text while modifying its internal representations to no longer trigger the detector. (ii) Obfuscated policy: the model outputs deceptive text that evades the detector, typically by including a justification for the reward hack. Empirically, obfuscated activations arise from representation drift during RL, with or without a detector penalty. The detector penalty only incentivizes obfuscated policies; we theoretically show this is expected for policy gradient methods. Sufficiently high KL regularization and detector penalty can yield honest policies, establishing white-box deception detectors as viable training signals for tasks prone to reward hacking.

URL PDF HTML ☆

赞 0 踩 0

2602.15198 2026-05-28 cs.MA cs.AI cs.CL 版本更新

Colosseum: Auditing Collusion in Cooperative Multi-Agent Systems

Colosseum: 审计合作多智能体系统中的合谋行为

Mason Nakamura, Abhinav Kumar, Saswat Das, Sahar Abdelnabi, Saaduddin Mahmud, Ferdinando Fioretto, Shlomo Zilberstein, Eugene Bagdasarian

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）； University of Virginia（弗吉尼亚大学）； ELLIS Institute Tübingen（图宾根ELLIS研究所）； MPI for Intelligent Systems, Tübingen（图宾根智能系统研究所）； AI Center（人工智能中心）

AI总结提出Colosseum框架，通过形式化决策框架和基于遗憾的度量审计LLM智能体在合作多智能体系统中的合谋行为，发现大多数模型存在新兴合谋倾向，并观察到“纸上合谋”现象。

详情

AI中文摘要

多智能体系统中，通过自由形式语言通信的LLM智能体能够实现复杂的协调以解决复杂的合作任务。当一组智能体形成联盟并合谋追求次要目标、降低联合目标时，这会产生独特的安全问题。在本文中，我们提出Colosseum，一个用于审计多智能体设置中LLM智能体合谋行为的框架。我们通过形式化的多智能体决策框架来理解智能体如何合作，并通过相对于合作最优的遗憾来度量基于行动的合谋行为，并将其与基于通信的合谋行为进行比较。Colosseum能够在良性设置、不同联盟目标、说服策略和网络拓扑下审计LLM智能体的合谋行为。然后，我们通过创建智能体之间的秘密通信渠道引入一种新的行为探针，表明大多数开箱即用的模型在此探针下表现出合谋倾向，我们称之为新兴合谋。此外，我们发现了“纸上合谋”现象，即智能体在文本中计划合谋但往往选择非合谋行动。Colosseum提供了一种审计合作多智能体系统中合谋的新方法，同时呈现了关于合谋如何出现、什么影响合谋效率以及哪些策略可能缓解合谋的观察。

英文摘要

Multi-agent systems, where LLM agents communicate through free-form language, enable sophisticated coordination for solving complex cooperative tasks. This surfaces a unique safety problem when a group of agents forms a coalition and colludes to pursue secondary goals and degrade the joint objective. In this paper, we present Colosseum, a framework for auditing LLM agents' collusive behavior in multi-agent settings. We ground how agents cooperate through a formal multi-agent decision-making framework and measure action-based collusive behavior in actions via regret relative to the cooperative optimum and compare it with communication-based collusive behavior. Colosseum enables audits of LLM agents for collusion under benign settings, different coalition objectives, persuasion tactics, and network topologies. We then introduce a new behavioral probe by creating secret communication channels between agents, showing that most out-of-the-box models exhibit a propensity to collude under this probe, which we term emergent collusion. Furthermore, we discover ``collusion on paper'' when agents plan to collude in text but often pick non-collusive actions. Colosseum provides a new way to audit collusion in cooperative multi-agent systems while presenting observations about how collusion emerges, what affects collusion efficacy, and which strategies may mitigate it.

URL PDF HTML ☆

赞 0 踩 0

2602.14862 2026-05-28 stat.ML cs.AI cs.IT cs.LG math.IT stat.ME 版本更新

并非所有像素都平等：面向含噪标签医学分割的像素级元学习

Chenyu Mu, Guihai Chen, Xun Yang, Erkun Yang, Cheng Deng

发表机构 * Xidian University（西安电子科技大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出MetaDCSeg框架，通过动态学习像素级权重并引入动态中心距离机制建模边界不确定性，抑制噪声标签影响并提升边界分割性能。

详情

AI中文摘要

医学图像分割对于临床应用至关重要，但常常受到噪声标注和模糊解剖边界的干扰，限制了其在现实场景中的应用。现有方法通常直接适应为实例分类设计的噪声标签学习技术，忽视了医学分割中像素级异质性及其空间和解剖上的难度差异。因此，全局假设或简单的置信度指标无法解决这些局部变化，导致边界模糊问题未得到解决。为解决这一问题，我们提出MetaDCSeg，一个鲁棒的框架，动态学习最优像素级权重以抑制噪声标签的影响，同时保留可靠标注。通过动态中心距离（DCD）机制显式建模边界不确定性，我们的方法利用前景、背景和边界中心的加权特征距离，引导模型关注模糊边界附近的难分割像素。该策略能够更精确地处理结构边界（这些边界常被现有方法忽略），并显著提升分割性能。在四个不同噪声水平的基准数据集上的大量实验表明，MetaDCSeg优于现有最先进方法。

英文摘要

Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, limiting its application in real-world scenarios. Existing methods often directly adapt noisy label learning techniques designed for instance classification, overlooking the pixel-wise heterogeneity in medical segmentation with its spatially and anatomically varying difficulties. Consequently, global assumptions or simple confidence metrics fail to address these local variations, leaving boundary ambiguities unresolved. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model's attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg outperforms existing state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2403.11852 2026-05-28 cs.RO cs.AI 版本更新

Delay-Aware Reinforcement Learning for Highway On-Ramp Merging under Stochastic Communication Latency

考虑随机通信延迟的高速公路匝道合流延迟感知强化学习

Amin Tabrizian, Zhitong Huang, Arsyi Aziz, Peng Wei

发表机构 * Department of Computer Science, George Washington University, Washington, D.C.（计算机科学系，乔治华盛顿大学，华盛顿特区）； Connected and Automated Vehicle Program Manager, Traffic Operations Division, Virginia Department of Transportation（连接与自动化车辆计划主任，交通运营处，弗吉尼亚州交通部）； Department of Mechanical & Aerospace Engineering, George Washington University, Washington, D.C.（机械与航空航天工程系，乔治华盛顿大学，华盛顿特区）

AI总结针对V2I通信随机延迟导致状态观测延迟的问题，提出DAROM框架，通过随机延迟MDP建模和延迟感知编码器恢复马尔可夫性，结合物理安全控制器实现鲁棒控制。

详情

AI中文摘要

延迟和部分可观测的状态信息给现实自动驾驶中基于强化学习（RL）的控制带来了重大挑战。在高速公路匝道合流中，路侧单元（RSU）可以感知附近交通，进行边缘感知，并通过车到基础设施（V2I）链路将状态估计传输给自车。随着智能交通基础设施和边缘计算的最新进展，这种RSU辅助感知越来越现实，并已部署在现代互联道路系统中。然而，边缘处理时间和无线传输可能引入随机的V2I通信延迟，违反马尔可夫假设并显著降低控制性能。在这项工作中，我们提出了DAROM，一种对随机延迟鲁棒的高速公路匝道合流延迟感知强化学习框架。我们将问题建模为随机延迟马尔可夫决策过程（RDMDP），并开发了一个统一的RL智能体用于联合纵向和横向控制。为了在延迟观测下恢复马尔可夫表示，我们引入了一个延迟感知编码器，该编码器以延迟观测、掩蔽动作历史和观测延迟幅度为条件来推断当前潜在状态。我们进一步集成基于物理的安全控制器以减少合流过程中的碰撞风险。在模拟城市交通（SUMO）模拟器中，使用下一代仿真（NGSIM）数据集的真实交通数据进行的实验表明，DAROM在各种交通密度下始终优于标准RL基线。特别是，基于门控循环单元（GRU）的编码器在高达2.0秒的随机V2I延迟的高密度交通中实现了超过99%的成功率。

英文摘要

Delayed and partially observable state information poses significant challenges for reinforcement learning (RL)-based control in real-world autonomous driving. In highway on-ramp merging, a roadside unit (RSU) can sense nearby traffic, perform edge perception, and transmit state estimates to the ego vehicle over vehicle-to-infrastructure (V2I) links. With recent advancements in intelligent transportation infrastructure and edge computing, such RSU-assisted perception is increasingly realistic and already deployed in modern connected roadway systems. However, edge processing time and wireless transmission can introduce stochastic V2I communication delays, violating the Markov assumption and substantially degrading control performance. In this work, we propose DAROM, a Delay-Aware Reinforcement Learning framework for On-ramp Merging that is robust to stochastic delays. We model the problem as a random delay Markov decision process (RDMDP) and develop a unified RL agent for joint longitudinal and lateral control. To recover a Markovian representation under delayed observations, we introduce a Delay-Aware Encoder that conditions on delayed observations, masked action histories, and observed delay magnitude to infer the current latent state. We further integrate a physics-based safety controller to reduce collision risk during merging. Experiments in the Simulation of Urban MObility (SUMO) simulator using real-world traffic data from the Next Generation Simulation (NGSIM) dataset demonstrate that DAROM consistently outperforms standard RL baselines across traffic densities. In particular, the gated recurrent unit (GRU)-based encoder achieves over 99% success in high-density traffic with random V2I delays of up to 2.0 seconds.

URL PDF HTML ☆

赞 0 踩 0

2505.18647 2026-05-28 cs.LG cs.AI 版本更新

STFlow: Data-Coupled Flow Matching for Geometric Trajectory Simulation

STFlow: 用于几何轨迹模拟的数据耦合流匹配

Kiet Bennema ten Brinke, Koen Minartz, Vlado Menkovski

发表机构 * Machine Learning for Physical Sciences (ML4Sci/e) Group, Department of Mathematics \& Computer Science, Eindhoven University of Technology, The Netherlands

AI总结提出STFlow，一种基于图神经网络和层次卷积的生成模型，通过数据依赖耦合的流匹配框架，从条件随机游走而非高斯噪声去噪，降低传输成本，提高训练和推理效率，在N体系统、分子动力学和人类轨迹预测中实现最低预测误差。

Comments Proceedings of the 43rd International Conference on Machine Learning (ICML), Seoul, South Korea. PMLR 306, 2026, 18 pages, 12 figures

详情

AI中文摘要

模拟动力系统的轨迹是分子动力学、生物化学和行人动力学等广泛领域中的基本问题。机器学习已成为扩展基于物理的模拟器和直接从实验数据开发模型的宝贵工具。特别是，深度生成建模和几何深度学习的最新进展通过学习复杂的轨迹分布，同时尊重固有的置换和时间平移对称性，实现了概率模拟。然而，N体系统的轨迹通常具有对导致分岔的扰动的高敏感性，以及多尺度的时间和空间相关性。为了应对这些挑战，我们引入了STFlow（时空流），一种基于图神经网络和层次卷积的生成模型。通过在流匹配框架中引入数据依赖的耦合，STFlow从条件随机游走而非高斯噪声开始去噪。这种新颖的信息先验通过降低传输成本简化了学习任务，提高了训练和推理效率。我们在N体系统、分子动力学和人类轨迹预测上验证了我们的方法。在这些基准测试中，STFlow以更少的模拟步骤实现了最低的预测误差，并提高了可扩展性。

英文摘要

Simulating trajectories of dynamical systems is a fundamental problem in a wide range of fields such as molecular dynamics, biochemistry, and pedestrian dynamics. Machine learning has become an invaluable tool for scaling physics-based simulators and developing models directly from experimental data. In particular, recent advances in deep generative modeling and geometric deep learning enable probabilistic simulation by learning complex trajectory distributions while respecting intrinsic permutation and time-shift symmetries. However, trajectories of N-body systems are commonly characterized by high sensitivity to perturbations leading to bifurcations, as well as multi-scale temporal and spatial correlations. To address these challenges, we introduce STFlow (Spatio-Temporal Flow), a generative model based on graph neural networks and hierarchical convolutions. By incorporating data-dependent couplings within the Flow Matching framework, STFlow denoises starting from conditioned random-walks instead of Gaussian noise. This novel informed prior simplifies the learning task by reducing transport cost, increasing training and inference efficiency. We validate our approach on N-body systems, molecular dynamics, and human trajectory forecasting. Across these benchmarks, STFlow achieves the lowest prediction errors with fewer simulation steps and improved scalability.

URL PDF HTML ☆

赞 0 踩 0

2503.01829 2026-05-28 cs.CL cs.AI cs.LG cs.MA 版本更新

Persuade Me if You Can: A Framework for Evaluating Persuasion Effectiveness and Susceptibility Among Large Language Models

如果你能说服我：评估大型语言模型说服效果与易受影响性的框架

Nimet Beyza Bozdag, Shuhaib Mehri, Gokhan Tur, Dilek Hakkani-Tür

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出PMIYC框架，通过多智能体对话自动评估LLM的说服效果与易受影响性，发现不同模型在说服力和抗说服性上存在显著差异。

Comments Paper published at the ACM Conference on AI and Agentic Systems 2026

详情

DOI: 10.1145/3786335.3813181

AI中文摘要

大型语言模型（LLM）展现出与人类水平相当的说服能力。虽然这些能力可用于社会公益，但也存在被滥用的风险。除了关注LLM如何说服他人外，它们自身对说服的易受影响性也构成了关键的校准挑战，引发了关于鲁棒性、安全性和伦理原则遵守的问题。为了研究这些动态，我们引入了“如果你能说服我”（PMIYC），一个用于评估多智能体交互中说服力和易受影响性的自动化框架。我们的框架提供了一种可扩展的替代方案，替代了通常用于研究LLM说服的昂贵且耗时的人工标注过程。PMIYC自动进行说服者和被说服者智能体之间的多轮对话，同时衡量说服的有效性和易受影响性。我们的综合评估涵盖了多种LLM和说服场景（例如，主观和错误信息场景）。我们通过人工评估验证了框架的有效性，并展示了与先前研究中人工评估的一致性。通过PMIYC，我们发现Llama-3.3-70B和GPT-4o表现出相似的说服效果，比Claude 3 Haiku高出30%。然而，GPT-4o在对抗错误信息方面的抵抗力比Llama-3.3-70B高出50%以上。值得注意的是，o4-mini既是有效的说服者，也是抵抗的被说服者。这些发现为LLM的说服动态提供了实证见解，并有助于开发更安全的AI系统。

英文摘要

Large Language Models (LLMs) demonstrate persuasive capabilities that rival human-level persuasion. While these capabilities can be used for social good, they also present risks of potential misuse. Beyond the concern of how LLMs persuade others, their own susceptibility to persuasion poses a critical alignment challenge, raising questions about robustness, safety, and adherence to ethical principles. To study these dynamics, we introduce Persuade Me If You Can (PMIYC), an automated framework for evaluating persuasiveness and susceptibility to persuasion in multi-agent interactions. Our framework offers a scalable alternative to the costly and time-intensive human annotation process typically used to study persuasion in LLMs. PMIYC automatically conducts multi-turn conversations between Persuader and Persuadee agents, measuring both the effectiveness of and susceptibility to persuasion. Our comprehensive evaluation spans a diverse set of LLMs and persuasion settings (e.g., subjective and misinformation scenarios). We validate the efficacy of our framework through human evaluations and demonstrate alignment with human assessments from prior studies. Through PMIYC, we find that Llama-3.3-70B and GPT-4o exhibit similar persuasive effectiveness, outperforming Claude 3 Haiku by 30%. However, GPT-4o demonstrates over 50% greater resistance to persuasion for misinformation compared to Llama-3.3-70B. Notably, o4-mini emerges as both an effective persuader, and a resistant persuadee. These findings provide empirical insights into the persuasive dynamics of LLMs and contribute to the development of safer AI systems.

URL PDF HTML ☆

赞 0 踩 0

2602.03515 2026-05-28 cs.LG cs.AI cs.DC 版本更新

Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation

通过基旋转缓解异步流水线并行中的陈旧性问题

Hyunji Jung, Sungbin Shin, Namhoon Lee

发表机构 * POSTECH（POSTECH大学）

AI总结针对异步流水线并行中梯度陈旧性随流水线深度线性增长的问题，提出基旋转框架，通过将优化器坐标系与Hessian特征基对齐来保持延迟更新的有效性，理论证明最小化基失配并实证在3B参数LLM训练中减少81.7%迭代次数。

Comments ICML 2026

详情

AI中文摘要

异步流水线并行通过消除同步执行中固有的流水线气泡来最大化硬件利用率，为高效大规模分布式训练提供了一条途径。然而，这种效率提升可能会被梯度陈旧性所削弱，其中使用延迟梯度的即时模型更新会在优化过程中引入噪声。关键的是，我们发现了一个常被忽视的严重问题：这种延迟随流水线深度线性增长，从根本上破坏了该方法原本意图提供的可扩展性。我们将此问题归因于优化景观的一个特定性质：Hessian特征基与标准坐标基之间的失配，这触发了坐标自适应优化器更新轨迹中的振荡。我们识别出这些振荡导致延迟更新偏离其真实对应项，使其无法用于当前迭代。这一见解通过理论分析（包括一个表明基失配放大延迟惩罚的收敛界）和实证评估得到证实。为了解决这个问题，我们提出了基旋转，一个将优化器坐标系旋转以与Hessian特征基对齐的框架，使延迟更新保持有用。我们从理论上证明基旋转最小化基失配，从而抵消放大延迟惩罚的条件。在训练高达3B参数的LLM的实证中，与性能最佳的异步基线相比，基旋转减少了81.7%所需的迭代次数。

英文摘要

Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, offering a path toward efficient large-scale distributed training. However, this efficiency gain can be compromised by gradient staleness, where the immediate model updates with delayed gradients introduce noise into the optimization process. Crucially, we identify a critical, yet often overlooked, pathology: this delay scales linearly with pipeline depth, fundamentally undermining the very scalability that the method originally intends to provide. We trace this pathology to a specific property of the optimization landscape: the misalignment between the Hessian eigenbasis and the standard coordinate basis, which triggers oscillations in the update trajectories of coordinate-wise adaptive optimizers. We identify that these oscillations cause delayed updates to diverge from their true counterparts, invalidating their use for current iterations. This insight is formalized through theoretical analysis, including a convergence bound showing that basis misalignment amplifies the delay penalty, and substantiated with empirical evaluation. To address this, we propose basis rotation, a framework that rotates the optimizer's coordinate system to align with the Hessian eigenbasis, keeping delayed updates useful. We theoretically demonstrate that basis rotation minimizes basis misalignment, thereby counteracting the conditions that amplify delay penalties. Empirically, in training up to a 3B-parameter LLM, basis rotation reduces the required iterations by 81.7\% compared to the best-performing asynchronous baseline.

URL PDF HTML ☆

赞 0 踩 0

2602.02898 2026-05-28 cs.AI cs.CL 版本更新

TABX：面向多智能体强化学习的高吞吐沙盒战斗模拟器

Hayeong Lee, JunHyeok Oh, Byung-Jun Lee

发表机构 * Department of Artificial Intelligence, Korea University, Seoul, Republic of Korea（韩国大学人工智能系）； Gauss Labs Inc., Seoul, Republic of Korea（首尔Gauss实验室）

AI总结提出基于JAX的高吞吐沙盒模拟器TABX，通过可重构任务和硬件加速支持多智能体强化学习的高效研究与评估。

详情

AI中文摘要

环境的设计在塑造合作多智能体强化学习（MARL）算法的开发和评估中起着关键作用。虽然现有基准突出了关键挑战，但它们通常缺乏设计自定义评估场景所需的模块化。我们介绍了基于JAX的全加速战斗模拟器（TABX），这是一个专为可重构多智能体任务设计的高吞吐沙盒。TABX提供对环境参数的精细控制，允许系统地研究涌现的智能体行为和跨不同任务复杂度谱系的算法权衡。利用JAX在GPU上进行硬件加速执行，TABX实现了大规模并行化并显著降低了计算开销。通过提供一个快速、可扩展且易于定制的框架，TABX促进了复杂结构化领域中MARL智能体的研究，并作为未来研究的可扩展基础。我们的代码可在https://github.com/ku-dmlab/TABX获取。

英文摘要

The design of environments plays a critical role in shaping the development and evaluation of cooperative multi-agent reinforcement learning (MARL) algorithms. While existing benchmarks highlight critical challenges, they often lack the modularity required to design custom evaluation scenarios. We introduce the Totally Accelerated Battle Simulator in JAX (TABX), a high-throughput sandbox designed for reconfigurable multi-agent tasks. TABX provides granular control over environmental parameters, permitting a systematic investigation into emergent agent behaviors and algorithmic trade-offs across a diverse spectrum of task complexities. Leveraging JAX for hardware-accelerated execution on GPUs, TABX enables massive parallelization and significantly reduces computational overhead. By providing a fast, extensible, and easily customized framework, TABX facilitates the study of MARL agents in complex structured domains and serves as a scalable foundation for future research. Our code is available at: https://github.com/ku-dmlab/TABX.

URL PDF HTML ☆

赞 0 踩 0

2509.23074 2026-05-28 cs.LG cs.AI 版本更新

Beyond Model Ranking: Predictability-Aligned Evaluation for Time Series Forecasting

超越模型排名：时间序列预测的可预测性对齐评估

Wanjin Feng, Yuan Yuan, Jingtao Ding, Yong Li

发表机构 * Department of Electronic Engineering, Tsinghua University, Beijing, China.（清华大学电子工程系，北京，中国）

AI总结针对基准排行榜评估混淆模型性能与数据内在不可预测性的问题，提出基于谱相干的可预测性对齐诊断框架，包含SCP分数和LUR工具，揭示可预测性漂移和模型架构权衡。

详情

AI中文摘要

在时间序列预测的AI模型日益复杂的时代，进展通常通过基准排行榜上的边际改进来衡量。然而，这种方法存在一个根本缺陷：标准评估指标混淆了模型的性能与数据的内在不可预测性。为了解决这一紧迫挑战，我们引入了一个新颖的、基于谱相干的可预测性对齐诊断框架。我们的框架有两个主要贡献：谱相干可预测性（SCP），一个计算高效（$O(N\log N)$）且任务对齐的分数，用于量化给定预测实例的固有难度；以及线性利用率（LUR），一个频率分辨的诊断工具，精确测量模型如何有效利用数据中的线性可预测信息。我们验证了框架的有效性，并利用它揭示了两个核心见解。首先，我们提供了“可预测性漂移”的首个系统性证据，表明任务的预测难度随时间剧烈变化。其次，我们的评估揭示了一个关键的架构权衡：复杂模型在低可预测性数据上表现优越，而线性模型在更可预测的任务上非常有效。我们倡导范式转变，超越简单的聚合分数，转向更具洞察力的、可预测性感知的评估，从而促进更公平的模型比较和更深入的模型行为理解。

英文摘要

In the era of increasingly complex AI models for time series forecasting, progress is often measured by marginal improvements on benchmark leaderboards. However, this approach suffers from a fundamental flaw: standard evaluation metrics conflate a model's performance with the data's intrinsic unpredictability. To address this pressing challenge, we introduce a novel, predictability-aligned diagnostic framework grounded in spectral coherence. Our framework makes two primary contributions: the Spectral Coherence Predictability (SCP), a computationally efficient ($O(N\log N)$) and task-aligned score that quantifies the inherent difficulty of a given forecasting instance, and the Linear Utilization Ratio (LUR), a frequency-resolved diagnostic tool that precisely measures how effectively a model exploits the linearly predictable information within the data. We validate our framework's effectiveness and leverage it to reveal two core insights. First, we provide the first systematic evidence of "predictability drift", demonstrating that a task's forecasting difficulty varies sharply over time. Second, our evaluation reveals a key architectural trade-off: complex models are superior for low-predictability data, whereas linear models are highly effective on more predictable tasks. We advocate for a paradigm shift, moving beyond simplistic aggregate scores toward a more insightful, predictability-aware evaluation that fosters fairer model comparisons and a deeper understanding of model behavior.

URL PDF HTML ☆

赞 0 踩 0

2507.16679 2026-05-28 cs.CL cs.AI cs.CY 版本更新

PICACO: Pluralistic In-Context Value Alignment of LLMs via Total Correlation Optimization

PICACO: 通过总相关优化实现大语言模型的多元情境价值对齐

Han Jiang, Dongyao Zhu, Xiaoyuan Yi, Ziang Xiao, Zhihua Wei, Xing Xie

发表机构 * Johns Hopkins University, Baltimore, MD, USA（约翰霍普金斯大学）； North Carolina State University, Raleigh, NC, USA（北卡罗来纳州立大学）； Microsoft Research Asia, Beijing, China（微软亚洲研究院）； Tongji University, Shanghai, China（同济大学）

AI总结针对情境对齐中价值冲突导致的指令瓶颈问题，提出PICACO方法，通过优化元指令并最大化指定价值与模型响应的总相关，无需微调即可实现多元价值平衡对齐。

Comments ICML 2026

详情

AI中文摘要

情境学习在使大语言模型与人类价值对齐方面展现出巨大潜力，有助于减少有害输出并适应多样化偏好，而无需昂贵的后训练，这被称为情境对齐。然而，大语言模型对输入提示的理解仍是不可知的，限制了情境对齐处理价值冲突的能力——人类价值本质上是多元的，常常施加相互冲突的要求，例如刺激与传统。因此，当前的情境对齐方法面临指令瓶颈挑战，即大语言模型难以在单个提示中协调多个预期价值，导致对齐不完整或有偏。为了解决这个问题，我们提出了PICACO，一种新颖的多元情境对齐方法。无需微调，PICACO优化一个融合了多个价值的元指令，以更好地激发大语言模型对这些价值的理解并改进对齐。这是通过最大化指定价值与大语言模型响应之间的总相关来实现的，这从理论上强化了价值一致性并减少了干扰噪声，从而产生更有效的指令。在五个价值集上的大量实验表明，PICACO在黑盒和开源大语言模型上均表现良好，优于多个近期强基线，并在多达8个不同价值之间实现了更好的平衡。

英文摘要

In-Context Learning has shown great potential for aligning Large Language Models (LLMs) with human values, helping reduce harmful outputs and accommodate diverse preferences without costly post-training, known as In-Context Alignment (ICA). However, LLMs' comprehension of input prompts remains agnostic, limiting ICA's ability to address value tensions--human values are inherently pluralistic, often imposing conflicting demands, e.g., stimulation vs. tradition. Current ICA methods therefore face the Instruction Bottleneck challenge, where LLMs struggle to reconcile multiple intended values within a single prompt, leading to incomplete or biased alignment. To address this, we propose PICACO, a novel pluralistic ICA method. Without fine-tuning, PICACO optimizes a meta-instruction that incorporates multiple values to better elicit LLMs' understanding of them and improve alignment. This is achieved by maximizing the total correlation between specified values and LLM responses, which theoretically reinforces value conformity and reduces distractive noise, resulting in more effective instructions. Extensive experiments on five value sets show that PICACO works well with both black-box and open-source LLMs, outperforms several recent strong baselines, and achieves a better balance across up to 8 distinct values.

URL PDF HTML ☆

赞 0 踩 0

2601.21666 2026-05-28 cs.AI cs.CV 版本更新

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

SONIC-O1：用于评估多模态大语言模型在音视频理解上的真实世界基准

Ahmed Y. Radwan, Christos Emmanouilidis, Hina Tabassum, Deval Pandya, Shaina Raza

发表机构 * Vector Institute for Artificial Intelligence（向量人工智能研究所）； University of Groningen（Groningen大学）； York University（约克大学）

AI总结提出SONIC-O1基准，包含60小时人工验证的音视频数据，评估多模态大语言模型在开放摘要、多项选择问答和时序定位上的能力，发现模型在时序定位上存在显著性能差距和人口统计偏差。

详情

AI中文摘要

多模态大语言模型（MLLMs）是近期AI研究的主要焦点。然而，大多数先前工作集中于静态图像理解，而它们处理序列音视频数据的能力仍未充分探索。这一差距凸显了需要一个高质量基准来系统评估MLLM在真实世界场景中的性能。我们介绍了SONIC-O1，一个全面的、完全人工验证的基准，包含60小时（231个片段）跨越13个真实世界对话领域的数据，带有4,958个注释和人口统计元数据。SONIC-O1评估三种能力：开放摘要、多项选择题（MCQ）回答以及带有支持理由（推理）的时序定位。在闭源和开源模型中，我们发现MCQ准确率显示模型家族之间的差距最小，但最好的闭源模型在时序定位上比最好的开源模型高出22.6%。我们进一步观察到不同人口统计组在时序定位上的准确率差距高达21.4%，表明模型行为存在持续差异。SONIC-O1为基于时序和人口统计鲁棒的多模态理解提供了一个开放评估套件。SONIC-O1公开可用于研究：项目页面（https://vectorinstitute.github.io/sonic-o1/）、数据集（https://huggingface.co/datasets/vector-institute/sonic-o1）、GitHub（https://github.com/vectorinstitute/sonic-o1）、排行榜（https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard）。

英文摘要

Multimodal Large Language Models (MLLMs) are a major focus of recent AI research. However, most prior work focuses on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in a real-world setting. We introduce SONIC-O1, a comprehensive, fully human-verified benchmark of 60 hours (231 clips) spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates three capabilities: open-ended summarization, multiple-choice question (MCQ) answering, and temporal localization with supporting rationales (reasoning). Across closed- and open-source models, we find that the MCQ accuracy shows the smallest gap between model families, but the best closed-source model outperforms the best open-source model by 22.6% on temporal localization. We further observe accuracy gaps of up to 21.4% on temporal localization across demographic groups, indicating persistent disparities in model behaviour. SONIC-O1 provides an open evaluation suite for temporally grounded and demographically robust multimodal understanding. SONIC-O1 is publicly available for research: Project page (https://vectorinstitute.github.io/sonic-o1/), Dataset (https://huggingface.co/datasets/vector-institute/sonic-o1), GitHub (https://github.com/vectorinstitute/sonic-o1), Leaderboard (https://huggingface.co/spaces/vector-institute/sonic-o1-leaderboard).

URL PDF HTML ☆

赞 0 踩 0

2509.23019 2026-05-28 cs.CR cs.AI 版本更新

LLM Watermark Evasion via Bias Inversion

通过偏差反转实现LLM水印规避

Jeongyeon Hwang, Sangdon Park, Jungseul Ok

发表机构 * Pohang University of Science and Technology (POSTECH)（浦项科学技术大学）

AI总结提出偏差反转重写攻击（BIRA），通过理论分析证明降低绿色令牌平均条件概率可指数级衰减检测概率，实现黑盒下高规避率（>99%）且保持语义保真度。

详情

AI中文摘要

水印为检测LLM生成内容提供了一种有前景的解决方案，但在现实无查询（黑盒）规避下的鲁棒性仍是一个开放挑战。现有的无查询攻击往往成功率有限或严重扭曲语义。我们通过理论分析重写型规避来弥合这一差距，证明将绿色令牌的平均条件概率降低一个小幅度会导致检测概率指数级衰减。受此洞察启发，我们提出了偏差反转重写攻击（BIRA），一种实用的无查询方法，该方法对通过令牌惊讶度识别的代理抑制集应用负对数几率偏差。实验上，BIRA在多种水印方案中实现了最先进的规避率（>99%），同时语义保真度显著优于先前的基线。我们的发现揭示了当前水印方法的一个根本性漏洞，并强调了进行严格压力测试的必要性。我们的代码可在\href{https://github.com/ml-postech/LLM-Watermark-Evasion-via-Bias-Inversion}{此处}获取。

英文摘要

Watermarking offers a promising solution for detecting LLM-generated content, yet its robustness under realistic query-free (black-box) evasion remains an open challenge. Existing query-free attacks often achieve limited success or severely distort semantic meaning. We bridge this gap by theoretically analyzing rewriting-based evasion, demonstrating that reducing the average conditional probability of sampling green tokens by a small margin causes the detection probability to decay exponentially. Guided by this insight, we propose the \emph{Bias-Inversion Rewriting Attack} (BIRA), a practical query-free method that applies a negative logit bias to a proxy suppression set identified via token surprisal. Empirically, BIRA achieves state-of-the-art evasion rates ($>99\%$) across diverse watermarking schemes while preserving semantic fidelity substantially better than prior baselines. Our findings reveal a fundamental vulnerability in current watermarking methods and highlight the need for rigorous stress tests. Our code is available at \href{https://github.com/ml-postech/LLM-Watermark-Evasion-via-Bias-Inversion}{here}.

URL PDF HTML ☆

赞 0 踩 0

2601.19926 2026-05-28 cs.CL cs.AI 版本更新

The Grammar of Transformers: A Systematic Review of Interpretability Research on Syntactic Knowledge in Language Models

Transformer的语法：语言模型中句法知识可解释性研究的系统综述

Nora Graichen, Iria de-Dios-Flores, Gemma Boleda

发表机构 * Universitat Pompeu Fabra（巴塞罗那庞培乌法布拉大学）； ICREA（加泰罗尼亚国家研究委员会）

AI总结通过对337篇文章的系统综述，评估基于Transformer的语言模型（TLM）的句法能力，发现TLM编码了非平凡的句法知识，但句法-语义接口现象表现较弱，且研究集中在英语和BERT类模型上。

详情

AI中文摘要

我们对337篇评估基于Transformer的语言模型（TLM）句法能力的文章进行了系统综述，报告了涵盖广泛句法现象、语言、模型和方法的3000多个数据点。这些数据共同表明，TLM编码了非平凡的句法知识。行为证据显示，TLM在形式句法现象上表现强劲，但在句法-语义接口现象上表现较弱且多变。对于数字支持较少的语言，表现也持续较低。探针和机制研究进一步支持TLM中存在句法知识。然而，由于大多数工作仍停留在观察层面，且当前方法在方法论上具有异质性，对句法处理背后的详细计算机制的洞察仍然有限。同时，文献仍然高度集中在英语和BERT类模型上。我们讨论了研究结果的意义，并为未来研究提供了建议。

英文摘要

We present a systematic review of 337 articles evaluating the syntactic abilities of Transformer-based language models (TLMs), reporting on over 3,000 datapoints spanning a wide range of syntactic phenomena, languages, models, and methods. We take the data to collectively show that TLMs encode a non-trivial amount of syntactic knowledge. Behavioral evidence shows strong performance on formal syntactic phenomena, but weaker and more variable performance on phenomena at the syntax-semantics interface. Performance is also consistently lower for languages with less digital support. Probing and mechanistic studies further support the presence of syntactic knowledge in TLMs. Yet, because most work remains observational and current approaches are methodologically heterogeneous, insight into the detailed computational mechanisms underlying syntactic processing remains limited. At the same time, the literature remains heavily concentrated on English and BERT-like models. We discuss the implications of our results and provide recommendations for future research.

URL PDF HTML ☆

赞 0 踩 0

2509.06350 2026-05-28 cs.CL cs.AI cs.CR 版本更新

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Mask-GCG：对抗性后缀中的所有标记对于越狱攻击都是必要的吗？

Junjie Mu, Zonghao Ying, Zhekui Fan, Zonglei Jing, Yaoyuan Zhang, Zhengmin Yu, Wenxin Zhang, Quanchen Zou, Xiangzheng Zhang

发表机构 * Politecnico di Milano（米兰理工学院）； Beihang University（北京航空航天大学）； East China Normal University（华东师范大学）； Fudan University（复旦大学）； University of the Chinese Academy of Sciences（中国科学院大学）； AI Security Lab（360人工智能安全实验室）

AI总结提出Mask-GCG方法，通过可学习的标记掩码识别后缀中高影响力标记并剪枝低影响力标记，降低计算开销并保持攻击成功率，揭示LLM提示中的标记冗余。

Comments Accepted to ICASSP 2026

详情

DOI: 10.1109/ICASSP55912.2026.11462363
Journal ref: 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13887-13891, 2026

AI中文摘要

针对大型语言模型（LLM）的越狱攻击已展示了多种成功方法，攻击者操纵模型生成其本应避免的有害响应。其中，贪婪坐标梯度（GCG）作为一种通用且有效的方法，通过优化后缀中的标记来生成可越狱的提示。尽管已提出多种GCG的改进变体，但它们都依赖于固定长度的后缀。然而，这些后缀中潜在的冗余尚未被探索。在这项工作中，我们提出Mask-GCG，一种即插即用的方法，采用可学习的标记掩码来识别后缀中的高影响力标记。我们的方法增加了高影响力位置标记的更新概率，同时剪枝低影响力位置的标记。这种剪枝不仅减少了冗余，还降低了梯度空间的大小，从而减少了计算开销，并缩短了实现成功攻击所需的时间。我们将Mask-GCG应用于原始GCG及其多种改进变体进行评估。实验结果表明，后缀中的大多数标记对攻击成功有显著贡献，剪枝少数低影响力标记不会影响损失值或攻击成功率（ASR），从而揭示了LLM提示中的标记冗余。我们的发现从越狱攻击的角度为开发高效且可解释的LLM提供了见解。

英文摘要

Jailbreak attacks on Large Language Models (LLMs) have demonstrated various successful methods whereby attackers manipulate models into generating harmful responses that they are designed to avoid. Among these, Greedy Coordinate Gradient (GCG) has emerged as a general and effective approach that optimizes the tokens in a suffix to generate jailbreakable prompts. While several improved variants of GCG have been proposed, they all rely on fixed-length suffixes. However, the potential redundancy within these suffixes remains unexplored. In this work, we propose Mask-GCG, a plug-and-play method that employs learnable token masking to identify impactful tokens within the suffix. Our approach increases the update probability for tokens at high-impact positions while pruning those at low-impact positions. This pruning not only reduces redundancy but also decreases the size of the gradient space, thereby lowering computational overhead and shortening the time required to achieve successful attacks compared to GCG. We evaluate Mask-GCG by applying it to the original GCG and several improved variants. Experimental results show that most tokens in the suffix contribute significantly to attack success, and pruning a minority of low-impact tokens does not affect the loss values or compromise the attack success rate (ASR), thereby revealing token redundancy in LLM prompts. Our findings provide insights for developing efficient and interpretable LLMs from the perspective of jailbreak attacks.

URL PDF HTML ☆

赞 0 踩 0

2601.17737 2026-05-28 cs.CV cs.AI 版本更新

The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation

脚本即一切：一个用于长程对话到电影视频生成的智能体框架

Chenyu Mu, Xin He, Qu Yang, Wanshun Chen, Jiadi Yao, Huang Liu, Zihao Yi, Bo Zhao, Xingyu Chen, Ruotian Ma, Fanghua Ye, Erkun Yang, Cheng Deng, Zhaopeng Tu, Xiaolong Li, Linus

发表机构 * Tencent（腾讯）

AI总结提出一个端到端智能体框架，通过训练ScripterAgent将对话转化为精细脚本，并利用DirectorAgent跨场景连续生成策略，实现长程对话到电影视频的连贯生成，显著提升脚本忠实度和时间保真度。

详情

AI中文摘要

近期视频生成的进展产生了能够从简单文本提示合成惊艳视觉内容的模型。然而，这些模型难以从对话等高层概念生成连贯的长篇叙事，揭示了创意想法与其电影执行之间的“语义鸿沟”。为弥合这一鸿沟，我们引入了一个新颖的、端到端的智能体框架，用于对话到电影视频的生成。我们框架的核心是ScripterAgent，一个经过训练将粗略对话转化为精细、可执行的电影脚本的模型。为此，我们构建了ScriptBench，一个具有丰富多模态上下文的新大规模基准，通过专家引导的流程进行标注。生成的脚本随后指导DirectorAgent，它使用跨场景连续生成策略协调最先进的视频模型，以确保长程连贯性。我们的全面评估，包括一个AI驱动的CriticAgent和一个新的视觉-脚本对齐（VSA）指标，表明我们的框架在所有测试的视频模型上显著提高了脚本忠实度和时间保真度。此外，我们的分析揭示了当前SOTA模型在视觉奇观与严格脚本遵循之间的关键权衡，为自动化电影制作的未来提供了宝贵见解。

英文摘要

Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.

URL PDF HTML ☆

赞 0 踩 0

2505.17654 2026-05-28 cs.CL cs.AI 版本更新

EVADE-Bench: Multimodal Benchmark for Evaluating and Enhancing Evasive Content Detection

EVADE-Bench：用于评估和增强规避性内容检测的多模态基准

Ancheng Xu, Zhihao Yang, Jingpeng Li, Guanghu Yuan, Longze Chen, Liang Yan, Jiehui Zhou, Zhen Qin, Hengyu Chang, Yukun Chen, Hamid Alinejad-Rokny, Min Yang

发表机构 * SIAT, Chinese Academy of Sciences（中国科学院深圳先进技术研究院）； University of Chinese Academy of Sciences（中国科学院大学）； Alibaba Group（阿里巴巴集团）； University of New South Wales（新南威尔士大学）

AI总结针对电商平台中LLM/VLM易受规避性内容攻击的问题，提出首个专家标注的中文多模态基准EVADE-Bench，评估26个模型并发现规则分类可提升检测一致性，多智能体分解策略能显著提高准确率。

Comments SIGIR 2026

详情

AI中文摘要

电商平台越来越依赖大型语言模型（LLMs）和视觉语言模型（VLMs）来检测非法或误导性产品内容。然而，这些模型仍然容易受到规避性内容的影响，即通过分词、委婉语言或图像裁剪等技术故意修改的输入，以掩盖违反政策的行为，同时仍传达被禁止的主张。关键在于，检测此类内容需要模型同时掌握两种能力：准确理解复杂规则，以及正确推断故意混淆的多模态输入背后的真实意图。虽然先前的工作分别探索了LLM对复杂规则的推理和基于LLM的规避性内容检测，但现有基准尚未将两者结合在统一的评估框架内。这一差距在电商领域尤为严重，因为准确的审核要求这两种能力协同运作。为填补这一空白，我们引入了EVADE-Bench，这是首个专家策划的中文多模态基准，专门设计用于评估LLMs和VLMs在真实电商场景中的规避性内容检测。我们对26个开源和闭源LLMs及VLMs的全面评估显示，即使是最先进的模型也经常错误分类规避性样本。我们进一步证明，更清晰的规则分类显著提高了模型预测的一致性并减少了错误预测，凸显了基准设计在实现可靠评估中的关键作用。为了探索性能提升的路径，我们研究了多智能体分解在多模态推理中的可行性，即将视觉描述和逻辑推理解耦为独立的智能体，并发现这一策略带来了显著的准确率提升。

英文摘要

E-commerce platforms increasingly rely on Large Language Models (LLMs) and Vision Language Models (VLMs) to detect illicit or misleading product content. However, these models remain vulnerable to evasive content, which refers to inputs that have been deliberately modified through techniques such as word splitting, euphemistic language, or image cropping to conceal policy violations while still conveying prohibited claims. Crucially, detecting such content requires a model to simultaneously master two capabilities: accurately comprehending complex rules, and correctly inferring the true intent behind deliberately obfuscated multimodal inputs. While prior work has separately explored LLM reasoning over complex rules and LLM-based detection of evasive content, no existing benchmark combines both within a unified evaluation framework. This gap is particularly consequential in e-commerce, where accurate moderation demands that both capabilities operate in concert. To address this gap, we introduce EVADE-Bench, the first expert-curated Chinese multimodal benchmark specifically designed to evaluate LLMs and VLMs on evasive content detection in real-world e-commerce scenarios. Our comprehensive evaluation of 26 open- and closed-source LLMs and VLMs reveals that even state-of-the-art models frequently misclassify evasive samples. We further demonstrate that clearer rule categorization significantly improves model prediction consistency and reduces false predictions, highlighting the critical role of benchmark design in enabling reliable evaluation. To explore paths for performance improvement, we investigate the feasibility of multi-agent decomposition for multimodal reasoning, wherein visual description and logical inference are decoupled into separate agents, and find that this strategy yields notable accuracy gains.

URL PDF HTML ☆

赞 0 踩 0

2601.06329 2026-05-28 cs.CL cs.AI 版本更新

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

论口语语言模型评估中全局令牌困惑度的谬误

Chan-Jan Hsu, Liang-Hsuan Tseng, Yi-Cheng Lin, Yen-Chun Kuo, Ju-Chieh Chou, Kai-Wei Chang, Hung-yi Lee, Carlos Busso

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； National Taiwan University（国立台湾大学）； Toyota Technological Institute at Chicago（芝加哥丰田技术研究所）； Massachusetts Institute of Technology（麻省理工学院）

AI总结针对口语语言模型评估中直接使用文本困惑度公式计算语音令牌困惑度的问题，提出基于似然和生成的新型评估方法，更忠实反映生成质量，并缩小了最佳模型与人类基线之间的差距。

详情

AI中文摘要

在大规模原始音频上预训练的生成式口语语言模型能够以适当内容继续语音提示，同时保留说话人和情感等属性，作为口语对话的基础模型。在先前文献中，这些模型通常使用“全局令牌困惑度”进行评估，该指标直接将文本困惑度公式应用于语音令牌。然而，这种做法忽略了语音和文本模态之间的根本差异，可能导致对语音特性的低估。在这项工作中，我们提出了多种基于似然和生成的评估方法，以替代朴素的全局令牌困惑度。我们证明，所提出的评估更忠实地反映了感知生成质量，与人类评分的平均意见得分（MOS）具有更强的相关性。在新指标下评估时，口语语言模型的相对性能格局被重塑，揭示了最佳性能模型与人类基线之间的差距显著缩小。总之，这些结果表明，适当的评估对于准确评估口语语言建模的进展至关重要。

HGMem：基于超图的工作记忆以改进长上下文复杂关系建模的多步RAG

Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu

发表机构 * The Chinese University of Hong Kong.（香港中文大学）； Pengcheng Laboratory.（鹏城实验室）； WeChat AI, Tencent（微信AI，腾讯）； University of Chinese Academy of Sciences.（中国科学院大学）

AI总结提出HGMem超图工作记忆系统，通过超边表示记忆单元并渐进形成高阶交互，增强多步RAG中的全局理解和复杂推理能力。

Comments ICML 2026; Code released at https://github.com/Encyclomen/HGMem

详情

AI中文摘要

多步检索增强生成（RAG）已成为增强大型语言模型（LLMs）在需要全局理解和密集推理任务上的广泛采用策略。尽管许多RAG系统整合了工作记忆来整合信息，但现有设计主要作为孤立事实的被动存储。这种静态特性忽略了原始事实之间的关键高阶相关性，从而限制了模型的多步推理能力，导致在扩展上下文中的碎片化推理和弱全局理解。我们引入了HGMem，一种基于超图的工作记忆系统，将记忆的概念从简单存储扩展到动态、表达性结构，用于复杂推理和全局理解。在我们的方法中，记忆被表示为超图，其中超边对应不同的记忆单元，使得记忆内高阶交互的逐步形成成为可能。该机制连接围绕焦点问题的事实和思考，将记忆演变为一个集成且情境化的知识结构，为更深层次的推理提供强有力的命题。我们在几个具有挑战性的全局理解基准上评估了HGMem。大量实验和深入分析表明，我们的方法持续改进了多步RAG，并在不同数据集上显著优于强基线系统。

英文摘要

Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Although many RAG systems incorporate a working memory to consolidate information, existing designs primarily function as a passive storage for isolated facts. This static nature overlooks crucial high-order correlations among primitive facts, thereby limiting models' capacity for multi-step reasoning and resulting in fragmented reasoning and weak global sense-making within extended contexts. We introduce HGMem, a hypergraph-based working memory system, extending the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph where hyperedges correspond to distinct memory units, enabling the progressive formation of high-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving the memory into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning. We evaluate HGMem on several challenging global sense-making benchmarks. Extensive experiments and in-depth analyses demonstrate that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse datasets.

URL PDF HTML ☆

赞 0 踩 0

2512.22777 2026-05-28 cs.LG cs.AI 版本更新

Adapting, Fast and Slow: On Few-Shot Transportability of Compositions

适应，快与慢：关于组合的少样本可迁移性

Kasra Jalaldoust, Elias Bareinboim

发表机构 * Causal Artificial Intelligence Lab（因果人工智能实验室）

AI总结研究在少样本场景下，通过因果传输性理论将源域学习到的因果机制组合成目标域预测器，并区分模块传输性和电路传输性，提出基于梯度松弛的电路搜索方法以实现快速或慢速适应。

详情

AI中文摘要

跨域泛化需要连接源分布和目标分布的稳定结构。基于因果传输性理论，我们研究了一个序列预测设置，其中目标预测器可以表示为从源数据可学习的因果机制组成的电路。我们引入了两类传输性。模块传输性捕获原子情况，其中目标预测器由可从单个源域学习的机制给出。电路传输性将此思想推广到通过组合从源数据学习的多个模块获得的目标预测器，即使没有源机制直接预测目标标签，也能实现零样本预测。我们在逐渐放松的假设下研究这些电路类别。首先，我们提供了条件，在这些条件下，给定关于源域和目标域的因果知识，可以从源数据单独学习相关电路。然后，我们通过允许来自目标域的有限数据来放松这些结构假设。特别地，我们开发了一种监督域适应方案，该方案无需显式因果结构即可学习电路。由此产生的少样本保证将可实现误差与可从源数据学习的模块组成的最小目标电路的大小联系起来。最后，我们提出了符号电路搜索的基于梯度的松弛，并进行了实证评估，表明它定性地跟踪了预测的快速适应机制——有和没有中间位置的过程监督——以及当没有源机制匹配时的慢速适应。

英文摘要

Generalization across domains requires stable structure that links the source and target distributions. Building on causal transportability theory, we study a sequential prediction setting in which the target predictor can be represented as a circuit composed of causal mechanisms that are learnable from source data. We introduce two classes of transportability. Module transportability captures the atomic case, where the target predictor is given by a mechanism learnable from a single source domain. Circuit transportability generalizes this idea to target predictors obtained by composing several modules learned from source data, enabling zero-shot prediction even when no source mechanism directly predicts the target label. We study these classes of circuits under increasingly relaxed assumptions. First, we provide conditions under which the relevant circuits can be learned from source data alone, given causal knowledge about the source and target domains. We then relax these structural assumptions by allowing limited data from the target domain. In particular, we develop a supervised domain adaptation scheme that learns circuits without requiring explicit causal structure. The resulting few-shot guarantees tie the achievable error to the size of the smallest target circuit composable from modules learned from source data. Finally, we propose a gradient-based relaxation of the symbolic circuit search and evaluate it empirically, showing that it qualitatively tracks the predicted regimes of fast adaptation -- with and without process supervision over intermediate positions -- and slow adaptation when no source mechanism matches.

URL PDF HTML ☆

赞 0 踩 0

2501.09934 2026-05-28 cs.LG cs.AI 版本更新

HEART: Achieving Timely Multi-Model Training for Vehicle-Edge-Cloud-Integrated Hierarchical Federated Learning

HEART：实现车辆-边缘-云集成分层联邦学习的多模型及时训练

Xiaohong Yang, Minghui Liwang, Xianbin Wang, Zhipeng Cheng, Seyyedali Hosseinalipour, Huaiyu Dai, Zhenzhen Jiao

发表机构 * School of Informatics, Xiamen University（厦门大学信息学院）； Department of Control Science and Engineering, Shanghai Institute of Intelligent Science and Technology（上海智能科学研究院控制科学与工程系）； State Key Laboratory of Autonomous Intelligent Unmanned Systems（自主智能无人系统国家重点实验室）； Shanghai Key Laboratory of Intelligent Autonomous Systems（上海智能自主系统重点实验室）； Frontiers Science Center for Intelligent Autonomous Systems, Ministry of Education, Tongji University（教育部智能自主系统前沿科学中心，同济大学）； Department of Electrical and Computer Engineering, Western University（西方大学电气与计算机工程系）； School of Future Science and Engineering, Soochow University（苏州大学未来科学与工程学院）

AI总结针对车辆-边缘-云分层联邦学习中多模型训练面临的模型过时、数据利用低效和资源分配不平衡问题，提出HEART框架，通过混合同步-异步聚合规则和两阶段优化算法（改进PSO+GA与贪心算法）最小化全局训练延迟并实现任务平衡。

Comments Accepted by IEEE Transactions on Cloud Computing (22 pages, 7 figures)

详情

AI中文摘要

人工智能赋能的物联网车辆（IoV）的快速发展需要高效的机器学习（ML）解决方案，以处理高车辆移动性和分散数据。这推动了车辆-边缘-云架构上的分层联邦学习（VEC-HFL）的出现。然而，VEC-HFL文献中尚未充分探讨的一个方面是，车辆通常需要同时执行多个ML任务，这种多模型训练环境带来了关键挑战。首先，不恰当的聚合规则可能导致模型过时和训练时间延长。其次，车辆移动性可能阻止车辆将模型返回网络边缘，导致数据利用效率低下。第三，跨不同任务实现平衡的资源分配变得至关重要，因为它极大地影响协作训练的有效性。我们率先提出一个针对动态VEC-HFL中多模型训练的框架，目标是最小化全局训练延迟，同时确保跨各种任务的平衡训练，该问题被证明是NP难的。为了促进及时模型训练，我们引入了一种混合同步-异步聚合规则。在此基础上，我们提出了一种称为混合进化与贪婪分配（HEART）的新方法。该框架分两个阶段运行：首先，通过结合改进的粒子群优化（PSO）和遗传算法（GA）的混合启发式方法实现平衡的任务调度；其次，采用低复杂度的贪心算法确定车辆上分配任务的训练优先级。在真实数据集上的实验证明了HEART相对于现有方法的优越性。

英文摘要

The rapid growth of AI-enabled Internet of Vehicles (IoV) calls for efficient Machine Learning (ML) solutions that can handle high vehicular mobility and decentralized data. This has motivated the emergence of Hierarchical Federated Learning over vehicle-edge-cloud architectures (VEC-HFL). Nevertheless, one aspect which is underexplored in the literature on VEC-HFL is that vehicles often need to execute multiple ML tasks simultaneously, where this multi-model training environment introduces crucial challenges. First, improper aggregation rules can lead to model obsolescence and prolonged training times. Second, vehicular mobility may result in inefficient data utilization by preventing the vehicles from returning their models to the network edge. Third, achieving a balanced resource allocation across diverse tasks becomes of paramount importance as it majorly affects the effectiveness of collaborative training. We take one of the first steps towards addressing these challenges via proposing a framework for multi-model training in dynamic VEC-HFL with the goal of minimizing global training latency while ensuring balanced training across various tasks, a problem that turns out to be NP-hard. To facilitate timely model training, we introduce a hybrid synchronous-asynchronous aggregation rule. Building on this, we present a novel method called Hybrid Evolutionary And gReedy allocaTion (HEART). The framework operates in two stages: first, it achieves balanced task scheduling through a hybrid heuristic approach that combines improved Particle Swarm Optimization (PSO) and Genetic Algorithms (GA); second, it employs a low-complexity greedy algorithm to determine the training priority of assigned tasks on vehicles. Experiments on real-world datasets demonstrate the superiority of HEART over existing methods.

URL PDF HTML ☆

赞 0 踩 0

2501.06491 2026-05-28 cs.SE cs.AI cs.SY eess.SY 版本更新

Improving Requirements Classification with SMOTE-Tomek Preprocessing

使用SMOTE-Tomek预处理改进需求分类

Barak Or

发表机构 * ArtificialGate Ltd.（ArtificialGate有限公司）

AI总结针对PROMISE数据集中的类别不平衡问题，采用SMOTE-Tomek预处理结合分层K折交叉验证，显著提升了需求分类准确率，逻辑回归达到76.16%。

Comments 21 pages, 5 figures, Preprint

2512.18444 2026-05-28 cs.GT cs.AI cs.DC cs.MA 版本更新

Snowveil: A Framework for Decentralised Preference Discovery

Snowveil: 一种去中心化偏好发现的框架

Grammateia Kotsialou

发表机构 * King’s College London（伦敦国王学院）

AI总结针对去中心化偏好发现问题，提出基于八卦的框架Snowveil，通过随机采样和局部信念更新，在有限期望时间内以可调高概率收敛到社会选择参数，并引入约束混合博尔达规则以平衡广泛共识与多数支持。

详情

AI中文摘要

在传统社会选择中，聚合主观偏好通常假设存在一个可信的中心权威。相反，本文形式化了去中心化偏好发现（DPD）：在部分信息、异步交互、抗审查且无中心协调者的条件下，可靠地识别社会选择参数（例如，应用于全局偏好配置的聚合规则的规范结果）。为了解决DPD，我们提出了Snowveil，一个基于八卦的框架，其中智能体重复采样随机同伴排名并更新局部信念，以收敛到规范结果。利用势函数、亚鞅理论和集中界，我们证明了系统以可调的高概率在有限期望时间内达到该稳定状态。然后可以迭代这一单胜者过程，以构建多胜者场景中的一组获胜候选者。Snowveil对特定聚合规则不可知，仅要求规则满足如正向响应等公理，从而为更广泛的DPD协议提供了形式基础。为了展示Snowveil的模块化，我们引入了约束混合博尔达（CHB），一种旨在平衡广泛共识与多数支持的聚合规则。我们提供了CHB的公理分析，并通过大量模拟展示了实证结果，验证了Snowveil的O(n)可扩展性。总体而言，这项工作为大规模去中心化系统中如何从主观、表达性和多样化的偏好配置中涌现稳定共识奠定了基础。

英文摘要

Aggregating subjective preferences in social choice traditionally assumes a trusted central authority. In contrast, this paper formalises Decentralised Preference Discovery (DPD): the reliable identification of a social choice parameter (e.g. the canonical outcome of an aggregation rule applied to the global preference profile) under conditions of partial information, asynchronous interaction, censorship resistance, and no central coordinator. To address DPD, we propose Snowveil, a gossip-based framework where agents repeatedly sample random peer rankings and update local beliefs to converge on the canonical outcome. Using a potential function, submartingale theory, and concentration bounds, we prove the system reaches this stable state with tunable high probability, in finite expected time. This single-winner process can then be iterated to construct a set of winning candidates for multi-winner scenarios. Snowveil is agnostic to specific aggregation rules, requiring only that the rule satisfies axioms such as Positive Responsiveness, thus offering a formal basis for a wider class of DPD protocols. Demonstrating Snowveil's modularity, we introduce the Constrained Hybrid Borda (CHB), an aggregation rule designed to balance broad consensus with plurality support. We provide an axiomatic analysis of CHB and present empirical results via extensive simulation, validating Snowveil's O(n) scalability. Overall, this work provides a foundation for how a stable consensus emerges from subjective, expressive, and diverse preference profiles in large-scale decentralised systems.

URL PDF HTML ☆

赞 0 踩 0

2307.06240 2026-05-28 cs.LG cs.AI cs.RO cs.SY eess.SY 版本更新

DSSE: a drone swarm search environment

DSSE：无人机群搜索环境

Manuel Castanares, Luis F. S. Carrete, Enrico F. Damiani, Leonardo D. M. de Abreu, José Fernando B. Brancalion, Fabrício J. Barth

发表机构 * Insper ； Embraer

AI总结基于PettingZoo的多智能体强化学习环境，无人机通过动态概率输入搜索目标。

Comments 7 pages

2508.13544 2026-05-28 cs.CV cs.AI 版本更新

面向视觉语言模型的以对象为中心的视觉令牌剪枝

Guangyuan Li, Rongzhen Zhao, Jinhong Deng, Yanbo Wang, Joni Pajarinen

发表机构 * Aalto University（阿alto大学）； University of Electronic Science and Technology of China（电子科学与技术大学）； Delft University of Technology（代尔夫特理工大学）

AI总结提出OC-VTP方法，通过轻量预训练以对象为中心的视觉令牌剪枝器，直接选择最具代表性的视觉令牌，在保持高精度的同时提升VLM推理效率。

详情

AI中文摘要

在视觉语言模型（VLM）中，与语言令牌相比，视觉令牌数量庞大但信息分散，因此消耗了大量不必要的计算。为了提升VLM推理效率，剪枝冗余视觉令牌的研究一直在进行，但现有方法都采用间接且无保证的方式。我们提出了OC-VTP，一种直接且有保证的方法，用于选择最具代表性的视觉令牌，以实现高效且保持精度的VLM推理。我们的OC-VTP仅需对一个小型的以对象为中心的视觉令牌剪枝器进行轻量预训练，然后即可将其插入现有VLM中，无需在任何数据集上微调任何模型。通过最小化从所选令牌重建原始未剪枝令牌的误差，保证保留最具代表性的视觉令牌。在任何视觉剪枝比例（即推理效率）下，我们的OC-VTP都能一致地帮助主流VLM保持最高的推理精度。我们的剪枝还展示了有趣的可解释性。我们的代码可在 https://github.com/GarryLarry010131/OC-VTP 获取。

英文摘要

In Vision Language Models (VLMs), vision tokens are quantity-heavy yet information-dispersed compared with language tokens, thus consume too much unnecessary computation. Pruning redundant vision tokens for high VLM inference efficiency has been continuously studied but all existing methods resort to indirect and non-guaranteed ways. We propose OC-VTP, a direct and guaranteed approach to select the most representative vision tokens for high-efficiency yet accuracy-preserving VLM inference. Our OC-VTP requires merely light-weight pre-training of a small object-centric vision token pruner, which can then be inserted into existing VLMs, without fine-tuning of any models on any datasets. It is gauranteed that the most representative vision tokens are kept by minimizing the error in reconstructing the original unpruned tokens from the selected ones. Across any vision pruning ratios, i.e., inference efficiency, our OC-VTP consistently helps mainstream VLMs to preserve the highest inference accuracy. Our pruning also demonstrates interesting interpretability. Our codes are available at https://github.com/GarryLarry010131/OC-VTP.

URL PDF HTML ☆

赞 0 踩 0

2511.11896 2026-05-28 cs.CR cs.AI cs.SE 版本更新

VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization

VULPO：基于策略优化的上下文感知漏洞检测

Youpeng Li, Fuxun Yu, Weiliang Qi, Xinda Wang

发表机构 * University of Texas at Dallas（德克萨斯大学达拉斯分校）； Microsoft（微软）

AI总结提出VULPO框架，通过构建包含上下文信息和推理轨迹的数据集ContextVul，结合冷启动监督微调和自适应策略优化，显著提升大语言模型在真实仓库中的漏洞检测能力。

详情

AI中文摘要

大语言模型（LLM）最近在漏洞检测（VD）中展现出强大潜力。然而，准确检测真实仓库中的漏洞需要推理复杂的上下文交互。现有的基于LLM的VD方法仍然有限，因为当前数据集缺乏完整的上下文信息和高质量的推理监督，而现有的优化方法主要依赖于粗粒度的结果中心监督信号，无法建模漏洞推理过程。为解决这些限制，我们首先构建了ContextVul，这是一个新数据集，用仓库级上下文信息和精心整理的漏洞推理轨迹增强了高质量函数级漏洞基准。基于ContextVul，我们引入了一个两阶段优化框架，包括轻量级冷启动监督微调，随后是漏洞自适应策略优化（VULPO）。VULPO结合了多维奖励，共同评估漏洞识别、漏洞相关定位和因果推理质量，以及难度自适应奖励缩放，以减轻奖励黑客攻击并提高强化学习效果。大量实验证明了VULPO在上下文感知VD中的优越性。我们的VULPO-4B，第一个专门的漏洞推理LLM，显著优于现有的VD基线，相对于Qwen3-4B将Pairwise Pass@1提高了203%，并实现了与规模大150%的LLM DeepSeek-V3.1相竞争的性能。

英文摘要

Large language models (LLMs) have recently shown strong potential in vulnerability detection (VD). However, accurately detecting vulnerabilities in real-world repositories requires reasoning over complex contextual interactions. Existing LLM-based VD approaches remain limited because current datasets lack complete contextual information and high-quality reasoning supervision, while existing optimization methods primarily rely on coarse outcome-centric supervision signals that fail to model the vulnerability reasoning process. To address these limitations, we first construct ContextVul, a new dataset that augments high-quality function-level vulnerability benchmarks with repository-level contextual information and curated vulnerability reasoning traces. Building upon ContextVul, we introduce a two-stage optimization framework consisting of lightweight cold-start supervised fine-tuning followed by vulnerability-adaptive on-policy optimization (VULPO). VULPO incorporates multidimensional rewards that jointly evaluate vulnerability identification, vulnerability-relevant localization, and causal reasoning quality, along with difficulty-adaptive reward scaling to mitigate reward hacking and improve RL effectiveness. Extensive experiments demonstrate the superiority of VULPO for context-aware VD. Our VULPO-4B, the first specialized vulnerability reasoning LLM, substantially outperforms existing VD baselines, improving Pairwise Pass@1 by 203% relative to Qwen3-4B and achieving competitive performance against a 150% larger-scale LLM, DeepSeek-V3.1.

URL PDF HTML ☆

赞 0 踩 0

2511.09572 2026-05-28 cs.AI cs.LG cs.SE 版本更新

SynthTools: A Framework for Scaling Synthetic Tools for Agent Development

SynthTools: 用于扩展智能体开发中合成工具的框架

Tommaso Castellani, Naimeng Ye, Daksh Mittal, Thomson Yen, Emmanouil Koukoumidis, William Zeng, Hongseok Namkoong

AI总结提出基于LLM的端到端管道SynthTools，通过环境生成、模拟、验证和任务构建，生成大规模多样化工具使用环境，提升智能体工具使用能力。

详情

AI中文摘要

为了使智能体系统能够使用外部工具解决复杂、长期的任务，我们需要大量多样且可控的工具使用环境。我们引入了SynthTools，一个完全基于LLM的管道，涵盖整个生命周期：环境生成、模拟、验证和任务构建。通过端到端地使用LLM，我们的框架补充了其他受限于真实API复杂性的工具使用环境，并通过设计确保可扩展性和可控性。该框架由三个组件组成：自上而下的环境生成，分层构建多样化的、基于领域的工具环境；环境模拟与验证，确保工具能够可靠地模拟并过滤掉无法模拟的工具；以及自下而上的任务与轨迹生成，产生可解决且可验证的任务以及多步轨迹，对难度、长度、轨迹组成和领域焦点进行控制以保证灵活性。作为具体实例，我们发布了包含6800个环境和100个领域中的73883个经过验证的工具、79925个可验证任务的数据集，以及大规模生成轨迹的管道。在这些任务生成的轨迹语料库上训练不同规模的Qwen3模型，在多个工具使用基准测试（包括真实API）上取得了提升，表明在合成数据上训练的工具使用能力可能迁移到某些真实环境。这些结果共同表明，SynthTools可以作为大规模训练工具使用智能体的有用基础设施。

英文摘要

For agentic systems to use external tools to solve complex, long-horizon tasks, we need a large set of diverse and controllable tool-use environments. We introduce SynthTools, a fully LLM-based pipeline spanning the entire lifecycle: environment generation, simulation, validation and task construction. By operating end-to-end through LLMs, our framework complements other tool-use environments bottlenecked by the complexity of real APIs, and ensures scalability and controllability by design. The framework consists of three components: top-down environment generation, which hierarchically constructs diverse, domain-grounded tool environments; environment simulation and validation, which ensures tools can be reliably emulated and filters out those that cannot; and bottom-up task and trajectory generation, which produces solvable and verifiable tasks together with multi-step trajectories, exposing control over difficulty, length, trajectory composition, and domain focus to guarantee flexibility. As a concrete instantiation, we release the dataset comprising $73{,}883$ validated tools across $6{,}800$ environments and $100$ fields, $79{,}925$ verifiable tasks as well as the pipeline to generate trajectories at scale. Training Qwen3 models of various sizes on a corpus of trajectories generated from these tasks yields gains across multiple tool-use benchmarks, including real APIs, indicating tool-use capabilities trained on synthetic data may transfer to some real environments. Together, these results suggest that SynthTools can serve as a useful infrastructure for large-scale training of tool-use agents.

URL PDF HTML ☆

赞 0 踩 0

2510.21890 2026-05-28 cs.LG cs.AI cs.GR 版本更新

The Principles of Diffusion Models

扩散模型的原理

Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, Stefano Ermon

发表机构 * MIT Press（MIT出版社）； Sony AI（索尼人工智能）； OpenAI（开放人工智能）； Stanford University（斯坦福大学）； Sony Group Corporation（索尼集团）

AI总结本文从变分、基于分数和基于流三种视角统一阐述扩散模型的数学原理，并讨论可控生成、高效求解器和流映射模型等扩展。

Comments Supplementary materials for the book are available at the book website: https://the-principles-of-diffusion-models.github.io/

详情

AI中文摘要

本书介绍了指导扩散模型发展的核心原理，追溯其起源，并展示不同公式如何源于共同的数学思想。扩散建模首先定义一个前向过程，该过程逐渐将数据破坏为噪声，通过一系列中间分布将数据分布与简单先验联系起来。目标是学习一个反向过程，将噪声转换回数据，同时恢复相同的中间分布。我们描述了三种互补的观点。受变分自编码器启发的变分观点将扩散视为逐步学习去噪。基于分数的观点植根于基于能量的建模，学习演化数据分布的梯度，指示如何将样本推向更可能的区域。基于流的观点与归一化流相关，将生成视为遵循一条平滑路径，在学习的速度场下将样本从噪声移动到数据。这些视角共享一个共同的主干：一个时间相关的速度场，其流将简单先验传输到数据。采样相当于求解一个微分方程，该方程沿着连续轨迹将噪声演化为数据。在此基础之上，本书讨论了可控生成的引导、高效数值求解器以及扩散驱动的流映射模型（学习任意时间之间的直接映射）。它为具有基本深度学习知识的读者提供了扩散模型的概念性和数学基础理解。

英文摘要

This book presents the core principles that have guided the development of diffusion models, tracing their origins and showing how diverse formulations arise from shared mathematical ideas. Diffusion modeling starts by defining a forward process that gradually corrupts data into noise, linking the data distribution to a simple prior through a continuum of intermediate distributions. The goal is to learn a reverse process that transforms noise back into data while recovering the same intermediates. We describe three complementary views. The variational view, inspired by variational autoencoders, sees diffusion as learning to remove noise step by step. The score-based view, rooted in energy-based modeling, learns the gradient of the evolving data distribution, indicating how to nudge samples toward more likely regions. The flow-based view, related to normalizing flows, treats generation as following a smooth path that moves samples from noise to data under a learned velocity field. These perspectives share a common backbone: a time-dependent velocity field whose flow transports a simple prior to the data. Sampling then amounts to solving a differential equation that evolves noise into data along a continuous trajectory. On this foundation, the book discusses guidance for controllable generation, efficient numerical solvers, and diffusion-motivated flow-map models that learn direct mappings between arbitrary times. It provides a conceptual and mathematically grounded understanding of diffusion models for readers with basic deep-learning knowledge.

URL PDF HTML ☆

赞 0 踩 0

2510.11170 2026-05-28 cs.LG cs.AI cs.CL 版本更新

EAGer: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

EAGer: 基于熵感知的自适应推理时缩放生成方法

Daniel Scalena, Leonidas Zotos, Elisabetta Fersini, Malvina Nissim, Ahmet Üstün

发表机构 * University of Groningen（格罗宁根大学）； University of Milan - Bicocca（米兰-比科卡大学）； Cohere Labs（Cohere实验室）

AI总结提出一种无需训练的生成方法EAGer，利用逐词熵分布动态分配计算资源，在复杂推理任务中提升性能并减少冗余计算。

详情

AI中文摘要

SelfJudge: 通过自监督验证器加速推测解码

Kanghoon Yoon, Minsub Kim, Sungjae Lee, Joonhyung Lee, Sunghyeon Woo, Yeonjun In, Se Jung Kwon, Chanyoung Park, Dongsoo Lee

发表机构 * Efficient AI ； Large Language Model（大型语言模型）； Speculative Decoding（推测解码）

AI总结提出SelfJudge方法，利用目标模型的自监督训练验证器，通过评估令牌替换后响应的语义保持性来加速推测解码，实现更优的推理-准确率权衡。

详情

Journal ref: ICML 2026

AI中文摘要

推测解码通过验证来自草稿模型的候选令牌与较大目标模型的匹配来加速LLM推理。最近的验证解码通过放宽验证标准，接受可能与目标模型输出存在微小差异的草稿令牌来加速这一过程，但现有方法受限于依赖人工标注或具有可验证真实结果的任务，限制了其在多样化NLP任务中的泛化能力。我们提出SelfJudge，通过目标模型的自监督训练验证器。我们的方法通过评估令牌替换后的响应是否保持原始响应的意义来衡量语义保持性，从而实现在多样化NLP任务中的自动验证器训练。实验表明，SelfJudge在推理-准确率权衡上优于验证解码基线，为更快的LLM推理提供了广泛适用的解决方案。

英文摘要

Speculative decoding accelerates LLM inference by verifying candidate tokens from a draft model against a larger target model. Recent judge decoding boosts this process by relaxing verification criteria by accepting draft tokens that may exhibit minor discrepancies from target model output, but existing methods are restricted by their reliance on human annotations or tasks with verifiable ground truths, limiting generalizability across diverse NLP tasks. We propose SelfJudge, which trains judge verifiers via self-supervision of the target model. Our method measures semantic preservation by assessing whether token-substituted responses preserve the meaning of original responses, enabling automatic verifier training across diverse NLP tasks. Our experiments show SelfJudge achieves superior inference-accuracy trade-offs than judge decoding baselines, offering a broadly applicable solution for faster LLM inference.

URL PDF HTML ☆

赞 0 踩 0

2510.01724 2026-05-28 cs.AI 版本更新

MetaboT: An LLM-based Multi-Agent Frameworkfor Interactive Analysis of Mass SpectrometryMetabolomics Knowledge Graphs

MetaboT：基于LLM的多智能体框架，用于质谱代谢组学知识图谱的交互式分析

Madina Bekbergenova, Lucas Pradi, Benjamin Navet, Emma Tysinger, Franck Michel, Matthieu Feraud, Yousouf Taghzouti, Yan Zhou Chen, Olivier Kirchhoffer, Florence Mehl, Martin Legrand, Tao Jiang, Marco Pagni, Soha Hassoun, Jean-Luc Wolfender, Wout Bittremieux, Fabien Gandon, Louis-Félix Nothias

发表机构 * Department of Computer Science, University of Antwerp（安特卫普大学计算机科学系）； Massachusetts Institute of Technology（麻省理工学院）； Department of Computer Science, Tufts University（塔夫茨大学计算机科学系）； Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland（瑞士生物信息学研究所（SIB），洛桑，瑞士）； Department of Chemical and Biological Engineering, Tufts University（塔夫茨大学化学与生物工程系）

AI总结提出MetaboT，一个基于大语言模型的多智能体框架，通过模块化架构将自然语言问题转化为SPARQL查询，降低代谢组学知识图谱的使用门槛。

详情

Journal ref: 33rd annual international conference on Intelligent Systems for Molecular Biology (ISMB 2025) / 24th Annual Conference of the European Conference on Computational Biology (ECCB 2025), Jul 2025, Liverpool, United Kingdom

AI中文摘要

基于质谱的代谢组学产生复杂的高维数据，这些数据蕴含着巨大的生物学发现潜力，但仍难以整合和解释。知识图谱通过将光谱、注释、分类群、化学类别和生物活性表示为单一可互操作的网络来统一这些异构信息；然而，它们的实际应用受到相应专业表示和查询语言陡峭学习曲线的限制。在此，我们介绍MetaboT，一个开源的多智能体大语言模型框架，它将自然语言问题转化为可执行的SPARQL查询，用于代谢组学知识图谱。MetaboT通过模块化架构减轻了单一模型方法的幻觉和模式合规性限制，其中专门的智能体处理范围验证、针对权威资源的实体解析、模式感知查询生成、迭代细化和结果解释。我们在实验性天然产物知识图谱上验证了MetaboT，使用专家编写的自然语言问题基准及其参考SPARQL查询，并展示了其回答关于植物-代谢物关系和生物活性的复杂问题的能力。MetaboT降低了代谢组学研究者的技术门槛，无需专门编程专业知识即可实现语义数据挖掘。

英文摘要

Mass spectrometry-based metabolomics generates complex, high-dimensional data that holds vast potential for biological discovery but remains difficult to integrate and interpret. Knowledge graphs (KGs) unify this heterogeneous information by representing spectra, annotations, taxa, chemical classes, and biological activities as a single interoperable network; however, their practical use is limited by the steep learning curve of corresponding specialized representation and query languages. Here we introduce MetaboT, an open-source multi-agent Large Language Model (LLM) framework that translates natural-language questions into executable SPARQL queries over metabolomics knowledge graphs. MetaboT mitigates the hallucination and schema-compliance limitations of single-model approaches through a modular architecture in which specialised agents handle scope validation, entity resolution against authoritative resources, schema-aware query generation, iterative refinement, and result interpretation. We validated MetaboT on the Experimental Natural Products Knowledge Graph (ENPKG), using an expert-authored benchmark of natural-language questions paired with reference SPARQL queries, and demonstrate its ability to answer complex questions about plant--metabolite relationships and biological activities. MetaboT lowers the technical barrier for metabolomics researchers and enables semantic data mining without specialised programming expertise.

URL PDF HTML ☆

赞 0 踩 0

2503.11906 2026-05-28 cs.CV cs.AI 版本更新

A Survey on SAR ship classification using Deep Learning

基于深度学习的SAR船舶分类综述

Ch Muhammad Awais, Marco Reggiannini, Davide Moroni, Emanuele Salerno

发表机构 * PhD School in Computer Science, University of Pisa（计算机科学博士学院，比萨大学）； Institute of Information Science and Technologies, National Research Council of Italy（意大利国家研究委员会信息科学与技术研究所）； National Biodiversity Future Center - NBFC（国家生物多样性未来中心 - NBFC）

AI总结本文综述了深度学习在SAR船舶分类中的应用，建立了基于模型、手工特征、SAR属性利用和微调影响的分类法，并讨论了未来研究方向。

Comments in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2026

详情

DOI: 10.1109/JSTARS.2026.3695704

AI中文摘要

深度学习（DL）已成为合成孔径雷达（SAR）船舶分类的强大工具。本综述全面分析了该领域使用的各种DL技术。我们识别了关键趋势和挑战，强调了整合手工特征、利用公共数据集、数据增强、微调、可解释性技术以及促进跨学科合作以提高DL模型性能的重要性。本综述建立了首个基于DL模型、手工特征使用、SAR属性利用和微调影响的分类法，用于对相关研究进行分类。我们讨论了SAR船舶分类任务中使用的方法论以及不同技术的影响。最后，本综述探讨了未来研究的潜在方向，包括解决数据稀缺问题、探索新型DL架构、融入可解释性技术以及建立标准化性能指标。通过应对这些挑战并利用DL的进步，研究人员可以为开发更准确和高效的船舶分类系统做出贡献，最终增强海上监视及相关应用。

英文摘要

Deep learning (DL) has emerged as a powerful tool for Synthetic Aperture Radar (SAR) ship classification. This survey comprehensively analyzes the diverse DL techniques employed in this domain. We identify critical trends and challenges, highlighting the importance of integrating handcrafted features, utilizing public datasets, data augmentation, fine-tuning, explainability techniques, and fostering interdisciplinary collaborations to improve DL model performance. This survey establishes a first-of-its-kind taxonomy for categorizing relevant research based on DL models, handcrafted feature use, SAR attribute utilization, and the impact of fine-tuning. We discuss the methodologies used in SAR ship classification tasks and the impact of different techniques. Finally, the survey explores potential avenues for future research, including addressing data scarcity, exploring novel DL architectures, incorporating interpretability techniques, and establishing standardized performance metrics. By addressing these challenges and leveraging advancements in DL, researchers can contribute to developing more accurate and efficient ship classification systems, ultimately enhancing maritime surveillance and related applications.

URL PDF HTML ☆

赞 0 踩 0

2509.21128 2026-05-28 cs.AI 版本更新

RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs

RL 压缩，SFT 扩展：推理型大语言模型的比较研究

Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo（东京大学）

AI总结本文通过轨迹级和步骤级分析框架，比较了强化学习（RL）和监督微调（SFT）对数学推理大语言模型推理路径的影响，发现RL压缩错误轨迹并集中推理功能，而SFT扩展正确轨迹并均匀化推理功能。

Comments Accepted at ICLR2026

详情

AI中文摘要

大型语言模型（LLMs）通常通过带有可验证奖励的强化学习（RLVR）和监督微调（SFT）在推理轨迹上进行训练，以提高其推理能力。然而，这些方法如何塑造推理能力仍然难以捉摸。本文超越基于准确性的研究，引入了一个新颖的分析框架，量化推理路径并捕捉每个训练过程（在数学领域使用1.5B、7B和14B参数的模型）下的定性变化。具体来说，我们在两个粒度级别上研究推理过程：轨迹级别，检查完整的推理输出；步骤级别，分析推理图，其节点对应单个推理步骤。值得注意的是，对独特推理轨迹的聚类显示了互补效应：RL压缩了不正确的轨迹，而SFT扩展了正确的轨迹。步骤级别分析表明，RL使推理图中节点访问频率、度和介数中心性分布的衰减率变陡（约2.5倍），而SFT使其变平（减少到约三分之一）。这表明RL将推理功能集中到一小部分步骤中，而SFT则将其均匀化到许多步骤中。此外，通过从多个角度评估推理图拓扑，我们描绘了RL和SFT的共同和独特特征。我们的工作提出了一种新颖的推理路径视角，解释了为什么当前最佳实践的两阶段训练（先SFT后RL）是成功的，并为数据构建和更高效的学习方法提供了实际启示。

英文摘要

Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these methods shape reasoning capabilities remains largely elusive. Going beyond an accuracy-based investigation of how these two components sculpt the reasoning process, this paper introduces a novel analysis framework that quantifies reasoning paths and captures their qualitative changes under each training process (with models of 1.5B, 7B, and 14B parameters on mathematical domains). Specifically, we investigate the reasoning process at two levels of granularity: the trajectory-level, which examines complete reasoning outputs, and the step-level, which analyzes reasoning graphs whose nodes correspond to individual reasoning steps. Notably, clustering of unique reasoning trajectories shows complementary effects: RL compresses incorrect trajectories, whereas SFT expands correct ones. Step-level analysis reveals that RL steepens (about 2.5 times), while SFT flattens (reduced to about one-third), the decay rates of node visitation frequency, degree, and betweenness centrality distributions in the reasoning graph. This indicates that RL concentrates reasoning functionality into a small subset of steps, while SFT homogenizes it across many steps. Furthermore, by evaluating the reasoning graph topologies from multiple perspectives, we delineate the shared and distinct characteristics of RL and SFT. Our work presents a novel reasoning path perspective that explains why the current best practice of two-stage training, with SFT followed by RL, is successful, and offers practical implications for data construction and more efficient learning approaches.

URL PDF HTML ☆

赞 0 踩 0

2509.15848 2026-05-28 cs.AI 版本更新

A Comparative Study of Rule-Based and Data-Driven Approaches in Industrial Monitoring

基于规则与数据驱动方法在工业监控中的比较研究

Giovanni De Gasperis, Sante Dino Facchini

发表机构 * Università degli Studi dell’Aquila（阿奎拉大学）

AI总结本研究比较了工业监控中基于规则与数据驱动两种方法，分析了各自的优缺点，并提出混合方案以结合两者优势。

Comments This chapter has been published in Advancements in AI From Foundations to Cross-Disciplinary Applications, Springer, 2026

详情

AI中文摘要

工业监控系统，尤其是在工业4.0环境中部署时，正经历从传统基于规则的架构向利用机器学习和人工智能的数据驱动方法的范式转变。本研究对这两种方法进行了比较，分析了它们各自的优势、局限性和应用场景，并提出了一个基本框架来评估它们的关键特性。基于规则的系统具有高可解释性、确定性行为以及在稳定环境中易于实现的特点，使其成为受监管行业和安全关键应用的理想选择。然而，它们在复杂或不断变化的环境中面临可扩展性、适应性和性能方面的挑战。相反，数据驱动系统在检测隐藏异常、实现预测性维护和动态适应新条件方面表现出色。尽管这些模型具有高精度，但它们面临数据可用性、可解释性和集成复杂性方面的挑战。本文提出混合解决方案作为一个有前景的方向，结合了基于规则逻辑的透明性与机器学习的分析能力。我们的假设是，工业监控的未来在于智能、协同的系统，这些系统利用专家知识和数据驱动的洞察力。这种双重方法增强了韧性、运营效率和信任，为更智能、更灵活的工业环境铺平了道路。

英文摘要

Industrial monitoring systems, especially when deployed in Industry 4.0 environments, are experiencing a shift in paradigm from traditional rule-based architectures to data-driven approaches leveraging machine learning and artificial intelligence. This study presents a comparison between these two methodologies, analyzing their respective strengths, limitations, and application scenarios, and proposes a basic framework to evaluate their key properties. Rule-based systems offer high interpretability, deterministic behavior, and ease of implementation in stable environments, making them ideal for regulated industries and safety-critical applications. However, they face challenges with scalability, adaptability, and performance in complex or evolving contexts. Conversely, data-driven systems excel in detecting hidden anomalies, enabling predictive maintenance and dynamic adaptation to new conditions. Despite their high accuracy, these models face challenges related to data availability, explainability, and integration complexity. The paper suggests hybrid solutions as a possible promising direction, combining the transparency of rule-based logic with the analytical power of machine learning. Our hypothesis is that the future of industrial monitoring lies in intelligent, synergic systems that leverage both expert knowledge and data-driven insights. This dual approach enhances resilience, operational efficiency, and trust, paving the way for smarter and more flexible industrial environments.

URL PDF HTML ☆

赞 0 踩 0

2509.04192 2026-05-28 cs.AI cs.LO math.LO 版本更新

Domain size asymptotics for Markov logic networks

马尔可夫逻辑网络的域大小渐近性

Vera Koponen

发表机构 * Department of Mathematics, Uppsala University, Sweden（瑞典乌普萨拉大学数学系）

AI总结研究马尔可夫逻辑网络在域大小趋于无穷时概率分布的性质，通过一元关系语言的几乎完全刻画，展示了其与均匀分布及提升贝叶斯网络的本质差异。

Comments Version 2 is a major revision of version 1

详情

AI中文摘要

一个马尔可夫逻辑网络（MLN）$\mathbb{M}$ 在域为 $\{1, \ldots, n\}$ 的结构集 $\mathbf{W}_n$（即“可能世界”）上确定了一个概率分布 $\mathbb{P}_n^\mathbb{M}$。我们研究当 $n$ 趋于无穷时这些分布的性质。我们证明，在温和假设下，对于具有一个任意正权重的软约束的 MLN $\mathbb{M}$，对所有足够大的 $n$，分布 $\mathbb{P}_n^\mathbb{M}$ 的行为与 $\mathbf{W}_n$ 上的均匀分布 $\mathbb{P}_n^{uni}$ 截然不同。对于仅有一个一元关系符号 $R$ 的语言，我们给出了当 $n \to \infty$ 时 $\mathbb{P}_n^\mathbb{M}$ 可能渐近行为的几乎完全刻画，其中 $\mathbb{M}$ 可以是该语言的任意 MLN。渐近行为取决于 MLN 的软约束和权重。该刻画用于证明：如果所考虑的语言至少包含一个一元关系符号，则以下结论成立：(a) 存在一个 MLN $\mathbb{M}$，使得对每个提升贝叶斯网络（LBN）$\mathbb{G}$，存在无穷多个 $n$ 使得 $\mathbb{M}$ 和 $\mathbb{G}$ 在 $\mathbf{W}_n$ 上确定不同的分布。(b) 存在一个 LBN $\mathbb{G}$，使得对每个 MLN $\mathbb{M}$，存在无穷多个 $n$ 使得 $\mathbb{G}$ 和 $\mathbb{M}$ 在 $\mathbf{W}_n$ 上确定不同的分布。我们还证明，在极限情况下，权重维度和域大小维度的行为可能完全不同。

英文摘要

A Markov logic network (MLN) $\mathbb{M}$ determines a probability distribution $\mathbb{P}_n^\mathbb{M}$ on the set $\mathbf{W}_n$ of structures, or ``possible worlds'', with domain $\{1, \ldots, n\}$. We study the properties of such distributions as $n$ tends to infinity. We show that with mild assumptions on an MLN $\mathbb{M}$ with one soft constraint with an arbitrary positive weight the distribution $\mathbb{P}_n^\mathbb{M}$ will behave quite differently from the uniform distribution $\mathbb{P}_n^{uni}$ on $\mathbf{W}_n$ for all sufficiently large $n$. For a language with only one relation symbol $R$ which has arity 1 we give an almost complete characterization of the possible asymptotic behaviours of $\mathbb{P}_n^\mathbb{M}$ as $n \to \infty$, where $\mathbb{M}$ may be any MLN for this language. The asymptotic behaviour depends on the soft constraints and weights of the MLN. This characterization is used to show that if the language under consideration contains at least one relation symbol of arity 1 then the following holds: (a) There is an MLN $\mathbb{M}$ such that for every lifted Bayesian network (LBN) $\mathbb{G}$ there are infinitely many $n$ such that $\mathbb{M}$ and $\mathbb{G}$ determine different distributions on $\mathbf{W}_n$. (b) There is an LBN $\mathbb{G}$ such that for every MLN $\mathbb{M}$ there are infinitely many $n$ such that $\mathbb{G}$ and $\mathbb{M}$ determine different distributions on $\mathbf{W}_n$. We also show that, in the limit, the weight dimension and the domain size dimension may behave completely differently.

URL PDF HTML ☆

赞 0 踩 0

2507.13725 2026-05-28 cs.IR cs.AI 版本更新

Point of Interest Recommendation: Pitfalls and Viable Solutions

兴趣点推荐：陷阱与可行解决方案

Alejandro Bellogín, Linus W. Dietz, Francesco Ricci, Pablo Sánchez

发表机构 * King’s College London（伦敦国王学院）； Free University of Bozen-Bolzano（博兹纳-博尔扎诺自由大学）

AI总结本文批判性评估兴趣点推荐研究现状，指出数据集、算法和评估方法三方面的关键缺陷，并提出包含多利益相关者设计、上下文感知等方向的研究议程。

详情

DOI: 10.1145/3816430
Journal ref: ACM Transactions on Recommender Systems 2026

AI中文摘要

兴趣点（POI）推荐通过建议上下文相关且匹配偏好的地点和活动（如餐厅、地标、行程和文化景点），在丰富游客体验方面可发挥关键作用。与一些更常见的推荐领域（如音乐和视频）不同，POI推荐本质上具有高风险：用户投入大量时间、金钱和精力来搜索、选择和消费这些建议的POI。尽管该领域已有大量研究工作，但几个基本问题仍未解决，阻碍了所提出方法的实际应用。在本文中，我们讨论了POI推荐问题的当前状态以及我们识别的主要挑战。本文的第一个贡献是对POI推荐研究现状的批判性评估，并识别了三个主要维度（数据集、算法和评估方法）的关键缺陷。我们强调了持续存在的问题，例如缺乏标准化基准数据集、问题定义和模型设计中的有缺陷假设，以及对用户行为和系统性能中偏差的不当处理。第二个贡献是一个结构化的研究议程，从识别的问题出发，引入了与多利益相关者设计、上下文感知、数据收集、可信度、新颖交互和实际评估相关的未来工作的重要方向。

英文摘要

Point of interest (POI) recommendation can play a pivotal role in enriching tourists' experiences by suggesting context-dependent and preference-matching locations and activities, such as restaurants, landmarks, itineraries, and cultural attractions. Unlike some more common recommendation domains (e.g., music and video), POI recommendation is inherently high-stakes: users invest significant time, money, and effort to search, choose, and consume these suggested POIs. Despite the numerous research works in the area, several fundamental issues remain unresolved, hindering the real-world applicability of the proposed approaches. In this paper, we discuss the current status of the POI recommendation problem and the main challenges we have identified. The first contribution of this paper is a critical assessment of the current state of POI recommendation research and the identification of key shortcomings across three main dimensions: datasets, algorithms, and evaluation methodologies. We highlight persistent issues such as the lack of standardized benchmark datasets, flawed assumptions in the problem definition and model design, and inadequate treatment of biases in the user behavior and system performance. The second contribution is a structured research agenda that, starting from the identified issues, introduces important directions for future work related to multistakeholder design, context awareness, data collection, trustworthiness, novel interactions, and real-world evaluation.

URL PDF HTML ☆

赞 0 踩 0

2507.08014 2026-05-28 cs.CL cs.AI cs.CY 版本更新

Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking

大规模真实对话分析揭示LLM越狱的复杂性界限

Aldan Creo, Raul Castro Fernandez, Manuel Cebrian

发表机构 * Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Valencia, Spain.（瓦伦西亚人工智能研究 institute，瓦伦西亚理工大学，西班牙瓦伦西亚）； Department of Computer Science, The University of Chicago, Chicago, USA（计算机科学系，芝加哥大学，美国芝加哥）； Center for Automation and Robotics, Spanish National Research Council, Madrid, Spain（自动化与机器人中心，西班牙国家研究委员会，西班牙马德里）

AI总结通过分析超过200万条真实对话，发现越狱尝试的复杂性并不显著高于正常对话，且攻击复杂性随时间保持稳定，表明LLM安全演化受人类创造力限制。

Comments Code: https://github.com/ACMCMC/risky-conversations Results: https://huggingface.co/risky-conversations Visualizer: https://huggingface.co/spaces/risky-conversations/Visualizer

详情

DOI: 10.1007/978-3-032-11402-0_5

AI中文摘要

随着大型语言模型（LLM）的日益部署，理解越狱策略的复杂性和演变对于AI安全至关重要。我们对来自不同平台（包括专门的越狱社区和通用聊天机器人）的超过200万条真实对话进行了大规模实证分析，研究了越狱复杂性。使用一系列复杂性指标，涵盖概率度量、词汇多样性、压缩比和认知负荷指标，我们发现越狱尝试并未表现出显著高于正常对话的复杂性。这一模式在专门的越狱社区和普通用户群体中一致成立，表明攻击的复杂性存在实际界限。时间分析显示，虽然用户攻击的毒性和复杂性随时间保持稳定，但助手响应的毒性有所下降，表明安全机制正在改进。复杂性分布中缺乏幂律标度进一步指出了越狱发展的自然限制。我们的发现挑战了攻击者与防御者之间军备竞赛不断升级的主流说法，反而表明LLM安全演化受人类创造力限制，而防御措施持续进步。我们的结果突显了学术越狱披露中的关键信息危害，因为超出当前复杂性基线的复杂攻击可能破坏观察到的平衡，并在防御适应之前造成广泛伤害。

英文摘要

As large language models (LLMs) become increasingly deployed, understanding the complexity and evolution of jailbreaking strategies is critical for AI safety. We present a mass-scale empirical analysis of jailbreak complexity across over 2 million real-world conversations from diverse platforms, including dedicated jailbreaking communities and general-purpose chatbots. Using a range of complexity metrics spanning probabilistic measures, lexical diversity, compression ratios, and cognitive load indicators, we find that jailbreak attempts do not exhibit significantly higher complexity than normal conversations. This pattern holds consistently across specialized jailbreaking communities and general user populations, suggesting practical bounds on attack sophistication. Temporal analysis reveals that while user attack toxicity and complexity remains stable over time, assistant response toxicity has decreased, indicating improving safety mechanisms. The absence of power-law scaling in complexity distributions further points to natural limits on jailbreak development. Our findings challenge the prevailing narrative of an escalating arms race between attackers and defenders, instead suggesting that LLM safety evolution is bounded by human ingenuity constraints while defensive measures continue advancing. Our results highlight critical information hazards in academic jailbreak disclosure, as sophisticated attacks exceeding current complexity baselines could disrupt the observed equilibrium and enable widespread harm before defensive adaptation.

URL PDF HTML ☆

赞 0 踩 0

2506.08311 2026-05-28 cs.SE cs.AI 版本更新

Understanding Automated Program Repair Agents Through the Lens of Traceability: An Empirical Study

通过可追溯性视角理解自动化程序修复智能体：一项实证研究

Ira Ceka, Hailie Mitchell, Saurabh Pujar, Luca Buratti, Shyam Ramji, Junfeng Yang, Gail Kaiser, Baishakhi Ray

发表机构 * Columbia University（哥伦比亚大学）； IBM Research（IBM研究院）

AI总结本文通过追踪五个最先进的自动化程序修复智能体在500个真实世界修复任务中的决策流程，揭示了它们在逻辑密集型错误修复、测试生成和回归测试选择方面的关键局限性，并提出了改进方向。

Comments Accepted for publication (ISSTA '26)

详情

AI中文摘要

自动化程序修复（APR）智能体利用大型语言模型（LLMs）通过推理、规划和工具使用来自主诊断和修复软件缺陷。尽管在SWE-bench等基准测试上取得了令人印象深刻的排行榜成绩，但人们对这些智能体如何采取行动、在何处失败以及它们的行为与人类开发者相比如何知之甚少。本文首次对五个最先进的APR智能体在500个真实世界修复任务中进行了系统分析，追踪了它们从问题描述到补丁验证的完整决策流程。我们的研究揭示，虽然智能体擅长简单修复，但在逻辑密集型错误上表现挣扎，常常生成冗长或过拟合的补丁，这些补丁仅能满足现有测试。我们发现测试生成和回归测试选择仍然是主要瓶颈，智能体经常无法重现问题或运行相关的回归测试。此外，大多数智能体使用原始工具（如bash脚本），缺乏调试器或程序分析器的访问权限，这限制了它们的推理能力和补丁质量。这些发现突出了当前APR系统的关键局限性，并促使采用左移方法——强调早期高质量的测试生成和验证——以减少虚假修复并提高语义正确性。我们进一步概述了下一代APR设计的具体方向：（1）更丰富且更集成的工具生态系统，（2）结合互补优势的多样化智能体架构，以及（3）优先考虑语义修复质量和测试生成保真度而非表面成功指标的基准测试。

英文摘要

Automated Program Repair (APR) agents leverage Large Language Models (LLMs) to autonomously diagnose and fix software bugs through reasoning, planning, and tool use. Despite impressive leaderboard gains on benchmarks such as SWE-bench, little is understood about how these agents take actions, where they fail, and how their behavior compares to that of human developers. This paper presents the first systematic analysis of five state-of-the-art APR agents across 500 real-world repair tasks, tracing their full decision-making pipelines -- from issue description to patch validation. Our study reveals that while agents excel at simple fixes, they struggle with logic-intensive bugs, often producing verbose or overfitted patches that merely satisfy existing tests. We find that test generation and regression test selection remain major bottlenecks, with agents frequently failing to reproduce issues or run relevant regression tests. Moreover, most agents operate with primitive tooling (e.g., bash scripts) and lack access to debuggers or program analyzers, which constrains their reasoning and patch quality. These findings highlight key limitations in current APR systems and motivate a shift-left approach -- emphasizing early, high-quality test generation and validation -- to reduce spurious fixes and improve semantic correctness. We further outline concrete directions for next-generation APR design: (1) richer and more integrated tool ecosystems, (2) diversified agentic architectures that combine complementary strengths, and (3) benchmarks that prioritize semantic repair quality and test generation fidelity over surface-level success metrics.

URL PDF HTML ☆

赞 0 踩 0

2505.21771 2026-05-28 cs.CV cs.AI 版本更新

MMTABREAL: Real-World Benchmark for Multimodal Table Understanding

MMTABREAL：多模态表格理解的真实世界基准

Prasham Titiya, Jainil Trivedi, Chitta Baral, Vivek Gupta

发表机构 * Arizona State University（亚利桑那州立大学）

AI总结针对多模态表格理解，构建了包含500个真实表格和4021个问答对的人工筛选基准MMTABREAL，评估发现现有模型在视觉定位、空间对齐和多步推理上存在20-40%的性能差距。

详情

AI中文摘要

多模态表格，即与图表、地图、图标和颜色编码交织的表格布局，在实际应用中无处不在，但对多模态大语言模型（MLLMs）来说仍然困难。尽管在文本和图像理解方面取得了进展，但对以表格为中心的多模态推理的系统评估仍然有限。我们引入了MMTABREAL，一个多模态表格基准，包含人工筛选的500个真实世界表格及其对应的4021个问答对。MMTABREAL涵盖四种问题类型、五种推理类别和八种结构原型。对最先进模型的评估揭示了显著差距，特别是在视觉定位、空间对齐和多步推理方面，相对于现有基准性能下降了20-40%。这些结果强调了需要更紧密融合视觉与表格结构并支持显式数值/逻辑运算的架构。MMTABREAL仅用于评估，提供了一个严谨、可复现的测试平台，反映了真实世界多模态表格的语言、结构和推理复杂性。

英文摘要

Multimodal tables i.e. tabular layouts interleaved with charts, maps, icons, and color encodings are ubiquitous in real applications yet remain difficult for Multimodal Large Language Models (MLLMs). Despite advances in text and image understanding, systematic evaluation of table-centric multimodal reasoning is limited. We introduce MMTABREAL, a MultiModal Table Benchmark, human-curated suite of 500 real-world tables paired with 4,021 question-answer pairs. MMTABREAL spans four question types, five reasoning categories, and eight structural archetypes. Evaluations of state-of-the-art models reveal substantial gaps, especially in visual grounding, spatial alignment, and multi-step inference, with 20-40% performance drops relative to existing benchmarks. These results highlight the need for architectures that more tightly fuse vision with tabular structure and support explicit numeric/logical operations. MMTABREAL is released for evaluation only, providing a rigorous, reproducible testbed that reflects the linguistic, structural, and reasoning complexity of real-world multimodal tables.

URL PDF HTML ☆

赞 0 踩 0

2502.05242 2026-05-28 cs.CL cs.AI cs.CV cs.LG 版本更新

Beyond External Monitors: Enhancing Transparency of Large Language Models for Easier Monitoring

超越外部监控：增强大型语言模型的透明度以便于监控

Guanxu Chen, Jing Shao, Tao Luo, Lijie Hu, Qihao Lin, Dongrui Liu

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； ICISEE, Shanghai Jiao Tong University（上海交通大学ICISEE）； School of Mathematical Sciences, Institute of Natural Sciences, MOE-LSC, CMA-Shanghai, Shanghai Jiao Tong University（上海交通大学数学科学学院）； King Abdullah University of Science and Technology（卡塔尔国王 Abdullah 科学与技术大学）

AI总结提出TELLME方法，通过改进大型语言模型的内部表征透明度，帮助监控者识别不当和敏感行为，并在去毒化任务中验证其有效性。

Comments 28 pages,8 figures,15 tables

详情

AI中文摘要

大型语言模型（LLMs）的能力日益增强，但其思维和决策过程的机制仍不清楚。思维链（CoTs）常被用来外化LLMs的思维，但这一策略未能准确反映LLMs的思维过程。基于LLMs隐藏表征的技术提供了内部视角，以改善对其潜在思维的可监控性。然而，以往的方法仅尝试开发外部模块，而非使LLMs本身更易于监控。本文提出了一种新方法TELLME，提高了LLMs的透明度，并帮助监控者识别不合适和敏感的行为。此外，我们在去毒化任务上展示了TELLME的有效性，LLMs在多模态测试集、不同架构和不同参数规模上均取得了一致的改进。我们进一步从最优传输理论和实证角度分析了TELLME对LLMs泛化能力的提升。

英文摘要

Large language models (LLMs) are becoming increasingly capable, but the mechanisms of their thinking and decision-making processes remain unclear. Chain-of-thoughts (CoTs) have been commonly utilized to externalize LLMs' thinking, but this strategy fails to accurately reflect LLMs' thinking process. Techniques based on LLMs' hidden representations provide an inner perspective to improve the monitorability of their latent thinking. However, previous methods only try to develop external modules instead of making LLMs themselves easier to monitor. In this paper, we propose a novel method, TELLME, improving the transparency of LLMs and helping monitors identify unsuitable and sensitive behaviors. Furthermore, we showcase the effectiveness of TELLME on detoxification tasks, where LLMs achieve consistent improvement among multimodal test sets, distinct architectures, and varying parameter scales. We further analyze TELLME's improvement on LLMs' generalization ability from both optimal transport theory and empirical perspectives.

URL PDF HTML ☆

赞 0 踩 0

2503.02857 2026-05-28 cs.CV cs.AI cs.CY 版本更新

Deepfake-Eval-2024: A Multi-Modal In-the-Wild Benchmark of Deepfakes Circulated in 2024

Deepfake-Eval-2024：2024年传播的深度伪造多模态野外基准

Nuria Alina Chandra, Hannah Lee, Ryan Murtfeldt, Lin Qiu, Arnab Karmakar, Emmanuel Tanumihardja, Kevin Farhat, Ben Caffee, Changyeon Lee, Jongwook Choi, Sejin Paik, Aerin Kim, Oren Etzioni

发表机构 * University of Washington（华盛顿大学）； Allen Institute for Artificial Intelligence（人工智能研究院）； University of Maryland（马里兰大学）； Chung-Ang University（Chung-Ang 大学）； Georgetown University（乔治城大学）； Miraflow AI

AI总结针对现有学术基准过时且不反映真实深度伪造的问题，提出包含2024年社交媒体和用户提交的多模态深度伪造基准Deepfake-Eval-2024，评估发现开源模型性能大幅下降，而商业模型和微调模型表现更优但未达到专家水平。

详情

AI中文摘要

在生成式人工智能日益逼真的时代，稳健的深度伪造检测对于减少欺诈和虚假信息至关重要。尽管许多深度伪造检测器在学术数据集上报告了高准确率，但我们表明这些学术基准已经过时，不能代表现实世界的深度伪造。我们引入了Deepfake-Eval-2024，这是一个新的深度伪造检测基准，由2024年从社交媒体和深度伪造检测平台用户收集的野外深度伪造组成。Deepfake-Eval-2024包含45小时的视频、56.5小时的音频和1,975张图像，涵盖了最新的操纵技术。该基准包含来自52种不同语言、88个不同网站的多样化媒体内容。我们发现，在Deepfake-Eval-2024上评估时，开源最先进的深度伪造检测模型的性能急剧下降，与之前的基准相比，视频模型的AUC下降了50%，音频模型下降了48%，图像模型下降了45%。我们还评估了商业深度伪造检测模型和在Deepfake-Eval-2024上微调的模型，发现它们比现成的开源模型性能更优，但尚未达到深度伪造取证分析师的准确率。数据集可在https://github.com/nuriachandra/Deepfake-Eval-2024获取。

英文摘要

In the age of increasingly realistic generative AI, robust deepfake detection is essential for mitigating fraud and disinformation. While many deepfake detectors report high accuracy on academic datasets, we show that these academic benchmarks are out of date and not representative of real-world deepfakes. We introduce Deepfake-Eval-2024, a new deepfake detection benchmark consisting of in-the-wild deepfakes collected from social media and deepfake detection platform users in 2024. Deepfake-Eval-2024 consists of 45 hours of videos, 56.5 hours of audio, and 1,975 images, encompassing the latest manipulation technologies. The benchmark contains diverse media content from 88 different websites in 52 different languages. We find that the performance of open-source state-of-the-art deepfake detection models drops precipitously when evaluated on Deepfake-Eval-2024, with AUC decreasing by 50% for video, 48% for audio, and 45% for image models compared to previous benchmarks. We also evaluate commercial deepfake detection models and models finetuned on Deepfake-Eval-2024, and find that they have superior performance to off-the-shelf open-source models, but do not yet reach the accuracy of deepfake forensic analysts. The dataset is available at https://github.com/nuriachandra/Deepfake-Eval-2024.

URL PDF HTML ☆

赞 0 踩 0

2505.19342 2026-05-28 cs.LG cs.AI 版本更新

ASTRA: Communication-Efficient Acceleration for Multi-Device Transformer Inference

ASTRA：面向多设备Transformer推理的通信高效加速

Xiao Liu, Lijun Zhang, Deepak Ganesan, Hui Guan

发表机构 * Manning College of Information and Computer Sciences, University of Massachusetts Amherst, MA, US（信息与计算机科学学院，马萨诸塞大学阿默斯特分校，马萨诸塞州，美国）； Amazon, Seattle, WA, US（亚马逊，华盛顿州西雅图，美国）； Amazon AWS, Washington, D.C., US（亚马逊AWS，华盛顿特区，美国）

AI总结提出ASTRA框架，通过序列并行与混合精度注意力机制，在低带宽环境下实现高效多设备Transformer推理，显著加速并保持精度。

详情

AI中文摘要

多设备推理可以通过并行计算降低Transformer延迟。然而，现有方法需要高设备间带宽，使其在带宽受限环境中不实用。我们提出ASTRA，一个通信高效的框架，将序列并行与混合精度注意力相结合，其中非局部token嵌入作为低位向量量化码传输，而局部注意力保持全精度。为了在激进压缩下保持精度，ASTRA引入了噪声增强量化和分布式分类token。在视觉和语言模型（如ViT和GPT2）上，ASTRA在低至10 Mbps的带宽下，相比单设备推理实现了高达2.64倍的加速，相比先前的多设备基线实现了高达15.25倍的加速。即使在非理想网络条件（如丢包和动态网络）下，ASTRA在大模型（如Llama-3-8B）上仍然保持鲁棒性。

英文摘要

Multi-device inference can reduce Transformer latency by parallelizing computation. However, existing methods require high inter-device bandwidth, making them impractical for bandwidth-constrained environments. We present ASTRA, a communication-efficient framework that integrates sequence parallelism with mixed-precision attention, where non-local token embeddings are transmitted as low-bit vector-quantized codes while local attention remains full precision. To preserve accuracy under aggressive compression, ASTRA introduces Noise-Augmented Quantization and Distributed Class Tokens. Across vision and language models (e.g., ViT and GPT2), ASTRA achieves up to 2.64$\times$ speedup over single-device inference and up to 15.25$\times$ over prior multi-device baselines while operating at bandwidths as low as 10 Mbps. ASTRA remains robust on large models (e.g., Llama-3-8B) even under non-ideal network conditions such as packet loss and dynamic networks.

URL PDF HTML ☆

赞 0 踩 0

2309.17057 2026-05-28 cs.AI 版本更新

Tell Me a Story! Narrative-Driven XAI with Large Language Models

给我讲个故事！基于大语言模型的叙事驱动可解释人工智能

David Martens, James Hinns, Camille Dams, Mark Vergouwen, Theodoros Evgeniou

发表机构 * University of Antwerp, Department of Engineering Management（安特卫普大学工程管理系）

AI总结提出XAIstories方法，利用大语言模型将SHAP或反事实解释转化为叙事，提升用户对AI决策的理解和体验，实验表明超90%普通用户认为叙事可信，数据科学家83%愿意使用。

详情

Journal ref: 10.1016/j.dss.2025.114402

AI中文摘要

Manboformer: 通过时空注意力机制学习高斯表示

Ziyue Zhao, Qining Qi, Jianfa Ma

AI总结针对自动驾驶3D语义占用预测中高斯表示性能不足的问题，提出利用时空自注意力机制优化GaussianFormer，以提升模型性能。

Comments After careful self-check, we found several unnoticed deficiencies and incomplete discussions in this manuscript. To ensure the rigor and accuracy of academic results, we decide to withdraw this preprint. A refined, complete, and rigorous version will be submitted soon

2502.12468 2026-05-28 cs.LG cs.AI 版本更新

MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation

MCTS-Judge：LLM作为裁判在代码正确性评估中的测试时扩展

Yutong Wang, Pengliang Ji, Chaoqun Yang, Kaixin Li, Ming Hu, Jiaoyang Li, Guillaume Sartoretti

发表机构 * Independent Contributor（独立贡献者）

AI总结提出MCTS-Judge框架，利用蒙特卡洛树搜索在测试时进行多视角分解评估，显著提升LLM作为裁判在代码正确性评估中的准确性和效率。

详情

AI中文摘要

LLM作为裁判的范式在评估生成内容方面显示出潜力，但在推理密集型场景（如编程）中缺乏可靠性。受近期推理模型进展和扩展定律变化的启发，我们率先将测试时计算引入LLM作为裁判，提出MCTS-Judge，一种资源高效的、系统2思维框架，用于代码正确性评估。MCTS-Judge利用蒙特卡洛树搜索（MCTS）将问题分解为更简单的、多视角的评估。通过结合基于当前轨迹中历史动作的自我评估和基于先前rollout的树的上置信界（UCT）的节点选择策略，MCTS-Judge平衡了全局优化和当前轨迹的细化。我们进一步设计了一种高精度的、单元测试级别的奖励机制，以鼓励大语言模型（LLM）进行逐行分析。在三个基准和五个LLM上的大量实验证明了MCTS-Judge的有效性，它将基础模型的准确率从41%提升到80%，同时比o1系列模型少使用3倍的token。进一步的评估验证了其推理轨迹在逻辑、分析、全面性和整体质量上的优越性，同时揭示了LLM作为裁判范式的测试时扩展定律。

英文摘要

The LLM-as-a-Judge paradigm shows promise for evaluating generative content but lacks reliability in reasoning-intensive scenarios, such as programming. Inspired by recent advances in reasoning models and shifts in scaling laws, we pioneer bringing test-time computation into LLM-as-a-Judge, proposing MCTS-Judge, a resource-efficient, System-2 thinking framework for code correctness evaluation. MCTS-Judge leverages Monte Carlo Tree Search (MCTS) to decompose problems into simpler, multi-perspective evaluations. Through a node-selection strategy that combines self-assessment based on historical actions in the current trajectory and the Upper Confidence Bound for Trees based on prior rollouts, MCTS-Judge balances global optimization and refinement of the current trajectory. We further designed a high-precision, unit-test-level reward mechanism to encourage the Large Language Model (LLM) to perform line-by-line analysis. Extensive experiments on three benchmarks and five LLMs demonstrate the effectiveness of MCTS-Judge, which improves the base model's accuracy from 41% to 80%, surpassing the o1-series models with 3x fewer tokens. Further evaluations validate the superiority of its reasoning trajectory in logic, analytics, thoroughness, and overall quality, while revealing the test-time scaling law of the LLM-as-a-Judge paradigm.

URL PDF HTML ☆

赞 0 踩 0

2411.18502 2026-05-28 stat.ML cs.AI cs.IR cs.LG stat.ME 版本更新

Isometry pursuit

等距追踪

Samson Koelle, Marina Meila

发表机构 * Amazon（亚马逊）； Department of Statistics University of Washington（华盛顿大学统计系）

AI总结提出等距追踪算法，通过新颖的归一化方法和多任务基追踪识别宽矩阵中的正交列子矩阵，用于从可解释字典中发现等距嵌入。

2410.10241 2026-05-28 cs.LG cs.AI stat.ML 版本更新

Revisiting Graph Autoencoders as Implicit Contrastive Learners

重新审视图自编码器作为隐式对比学习器

Jintang Li, Ruofan Wu, Yuchang Zhu, Huizhe Zhang, Zulun Zhu, Liang Chen

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University（教育部多媒体可信感知与高效计算重点实验室，厦门大学）； Coupang Shanghai China（Coupang上海）； Sun Yat-sen University（中山大学）； Nanyang Technological University（南洋理工大学）

AI总结本文通过对比学习视角重新审视图自编码器，揭示其隐式对比学习本质，并强调对比视图设计的关键作用，提出非对称子图视图作为重要设计维度。

Comments KDD 2026 research track. Code available at https://github.com/EdisonLeeeee/lrGAE

详情

AI中文摘要

图自编码器（GAEs）和图对比学习（GCL）是图上自监督表示学习的两种主要范式，但它们通常被孤立研究并被视为根本不同的方法。在这项工作中，我们通过对比学习的视角重新审视GAEs，并表明基于结构和基于特征的GAEs都可以概念化为隐式图对比学习器。这一视角揭示了许多现有GAEs的主要区别在于对比视图的构建方式，而非学习目标或架构。基于这一见解，我们引入了一个统一公式，强调对比视图设计是GAEs中一个核心且先前较少探索的维度。特别是，我们识别出由子图视图不匹配产生的非对称对比视图，作为先前GAE研究中一个重要但未充分探索的设计轴。我们在统一框架内形式化这一见解，并在代表性图学习任务上进行系统实验，以检验其对性能和效率的影响。我们的结果表明，将GAEs解释为隐式对比学习器能更清晰地理解现有模型，并为设计有效且可扩展的图自编码器提供实用指导。

英文摘要

Graph autoencoders (GAEs) and graph contrastive learning (GCL) are two major paradigms for self-supervised representation learning on graphs, yet they are often studied in isolation and treated as fundamentally different approaches. In this work, we revisit GAEs through the lens of contrastive learning and show that both structure-based and feature-based GAEs can be conceptualized as implicitly graph contrastive learners. This perspective reveals that many existing GAEs differ primarily in how contrastive views are constructed, rather than in their learning objectives or architectures. Building on this insight, we introduce a unified formulation that highlights contrastive view design as a central and previously less explored dimension in GAEs. In particular, we identify asymmetric contrastive views, arising from mismatches in subgraph views, as an important yet underexplored design axis in prior GAE research. We formalize this insight within a unified framework and conduct systematic experiments on representative graph learning tasks to examine its impact on performance and efficiency. Our results show that interpreting GAEs as implicit contrastive learners offers a clearer understanding of existing models and provides practical guidance for designing effective and scalable graph autoencoders.

URL PDF HTML ☆

赞 0 踩 0

2405.09586 2026-05-28 eess.IV cs.AI cs.CV 版本更新

Factual Serialization Enhancement: A Key Innovation for Chest X-ray Report Generation

事实序列化增强：胸部X光报告生成的关键创新

Kang Liu, Zhuoqi Ma, Mengmeng Liu, Zhicheng Jiao, Xiaolu Kang, Qiguang Miao, Kun Xie

发表机构 * School of Computer Science and Technology, Xidian University（西安电子科技大学计算机科学与技术学院）； Xi’an Key Laboratory of Big Data and Intelligent Vision（西安大数据与智能视觉重点实验室）； Key Laboratory of Collaborative Intelligence Systems, Ministry of Education（教育部协同智能系统重点实验室）； School of Artificial Intelligence, Xidian University（西安电子科技大学人工智能学院）； Department of Diagnostic Imaging, Brown University（布朗大学诊断影像科）

AI总结提出FSE两阶段事实序列化增强方法，通过事实引导对比学习和证据驱动报告生成，提升胸部X光报告生成的临床准确性和自然语言质量。

Comments code is available at FSE" target="_blank" rel="noopener">https://github.com/mk-runner/FSE

详情

DOI: 10.1016/j.eswa.2026.132550

AI中文摘要

放射学报告包含呈现式词汇（确保清晰和组织）和事实性词汇（基于可观察发现提供准确客观描述）。手动编写这些报告耗时费力，而自动报告生成提供了一种有前景的替代方案。该过程中的关键步骤是将X光片与其对应报告对齐。然而，现有方法通常依赖完整报告进行对齐，忽略了呈现式词汇的影响。为解决此问题，我们提出FSE，一种两阶段事实序列化增强方法。在第一阶段，我们引入事实引导的对比学习用于视觉表示，通过最大化X光片与对应事实描述之间的语义对应关系。在第二阶段，我们提出证据驱动的报告生成，通过整合来自类似历史病例的结构化事实序列化见解，增强诊断准确性。在MIMIC-CXR和IU X-ray数据集上的实验（涵盖特定和一般场景）表明，FSE在自然语言生成和临床效能指标上均优于最先进方法。消融研究进一步强调了第一阶段和第二阶段中事实序列化的积极作用。代码可在https://github.com/mk-runner/FSE获取。

英文摘要

A radiology report comprises presentation-style vocabulary, which ensures clarity and organization, and factual vocabulary, which provides accurate and objective descriptions based on observable findings. While manually writing these reports is time-consuming and labor-intensive, automatic report generation offers a promising alternative. A critical step in this process is to align radiographs with their corresponding reports. However, existing methods often rely on complete reports for alignment, overlooking the impact of presentation-style vocabulary. To address this issue, we propose FSE, a two-stage Factual Serialization Enhancement method. In Stage 1, we introduce factuality-guided contrastive learning for visual representation by maximizing the semantic correspondence between radiographs and corresponding factual descriptions. In Stage 2, we present evidence-driven report generation that enhances diagnostic accuracy by integrating insights from similar historical cases structured as factual serialization. Experiments on MIMIC-CXR and IU X-ray datasets across specific and general scenarios demonstrate that FSE outperforms state-of-the-art approaches in both natural language generation and clinical efficacy metrics. Ablation studies further emphasize the positive effects of factual serialization in Stage 1 and Stage 2. The code is available at https://github.com/mk-runner/FSE.

URL PDF HTML ☆

赞 0 踩 0

2407.21075 2026-05-28 cs.AI cs.CL cs.LG 版本更新

Apple Intelligence Foundation Language Models

Apple Intelligence 基础语言模型

Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek, Sam Wiseman, Syd Evans, Tao Lei, Vivek Rathod, Xiang Kong, Xianzhi Du, Yanghao Li, Yongqiang Wang, Yuan Gao, Zaid Ahmed, Zhaoyang Xu, Zhiyun Lu, Al Rashid, Albin Madappally Jose, Alec Doane, Alfredo Bencomo, Allison Vanderby, Andrew Hansen, Ankur Jain, Anupama Mann Anupama, Areeba Kamal, Bugu Wu, Carolina Brum, Charlie Maalouf, Chinguun Erdenebileg, Chris Dulhanty, Daniel Parilla, Dominik Moritz, Doug Kang, Eduardo Jimenez, Evan Ladd, Fangping Shi, Felix Bai, Frank Chu, Fred Hohman, Hadas Kotek, Hannah Gillis Coleman, Jane Li, Jeffrey Bigham, Jeffery Cao, Jeff Lai, Jessica Cheung, Jiulong Shan, Joe Zhou, John Li, Jun Qin, Karanjeet Singh, Karla Vega, Kelvin Zou, Laura Heckman, Lauren Gardiner, Margit Bowler, Maria Cordell, Meng Cao, Nicole Hay, Nilesh Shahdadpuri, Otto Godwin, Pranay Dighe, Pushyami Rachapudi, Ramsey Tantawi, Roman Frigg, Sam Davarnia, Sanskruti Shah, Saptarshi Guha, Sasha Sirovica, Shen Ma, Shuang Ma, Simon Wang, Sulgi Kim, Suma Jayaram, Vaishaal Shankar, Varsha Paidi, Vivek Kumar, Xin Wang, Xin Zheng, Walker Cheng, Yael Shrager, Yang Ye, Yasu Tanaka, Yihao Guo, Yunsong Meng, Zhao Tang Luo, Zhi Ouyang, Alp Aygar, Alvin Wan, Andrew Walkingshaw, Andy Narayanan, Antonie Lin, Arsalan Farooq, Brent Ramerth, Colorado Reed, Chris Bartels, Chris Chaney, David Riazati, Eric Liang Yang, Erin Feldman, Gabriel Hochstrasser, Guillaume Seguin, Irina Belousova, Joris Pelemans, Karen Yang, Keivan Alizadeh Vahid, Liangliang Cao, Mahyar Najibi, Marco Zuliani, Max Horton, Minsik Cho, Nikhil Bhendawade, Patrick Dong, Piotr Maj, Pulkit Agrawal, Qi Shan, Qichen Fu, Regan Poston, Sam Xu, Shuangning Liu, Sushma Rao, Tashweena Heeramun, Thomas Merth, Uday Rayala, Victor Cui, Vivek Rangarajan Sridhar, Wencong Zhang, Wenqi Zhang, Wentao Wu, Xingyu Zhou, Xinwen Liu, Yang Zhao, Yin Xia, Zhile Ren, Zhongzheng Ren

发表机构 * Apple（苹果公司）

AI总结本文介绍了为 Apple Intelligence 功能开发的基础语言模型，包括一个约30亿参数的设备端高效运行模型和一个用于私有云计算的服务器端大模型，并描述了其架构、训练数据、优化过程和评估结果。

2405.09689 2026-05-28 cs.LG cs.AI cs.SC 版本更新

Generalized Holographic Reduced Representations

广义全息约简表示

Calvin Yeung, Zhuowen Zou, SungHeon Jeong, Wenjun Huang, Nathaniel D Bastian, Mohsen Imani

发表机构 * University of California, Irvine（加州大学尔湾分校）； United States Military Academy（美国军事学院）

AI总结提出广义全息约简表示（GHRR），通过灵活的非交换绑定操作改进超维计算对复杂组合结构的编码能力，并在语言建模任务中验证其可替代注意力机制并提升性能。

详情

DOI: 10.1109/TAI.2026.3678232

AI中文摘要

超维计算（HDC）是一种计算和数据高效的范式，在连接主义和符号主义人工智能方法之间架起桥梁。然而，HDC的简单性给编码复杂组合结构带来了挑战，尤其是在其绑定操作中。为了解决这个问题，我们提出了广义全息约简表示（GHRR），它是傅里叶全息约简表示（FHRR）的扩展，FHRR是一种特定的HDC实现。GHRR引入了一种灵活的非交换绑定操作，能够改进复杂数据结构的编码，同时保留HDC的鲁棒性和透明性等理想特性。在这项工作中，我们介绍了GHRR框架，证明了其理论性质及其对HDC性质的遵循，探索了其核和绑定特性，并通过实证实验展示了其灵活的非交换性以及对组合结构增强的解码准确性。我们还证明了GHRR中的绑定比其他HDC变体更具表现力；特别地，我们展示了GHRR中的绑定可以实现一种注意力机制。我们通过在Transformer中将其注意力机制替换为GHRR等价物并在语言建模任务上进行测试来验证这一点，结果显示与普通Transformer相比性能有所提升。

英文摘要

Hyperdimensional Computing (HDC) is a computationally and data-efficient paradigm that acts as a bridge between connectionist and symbolic approaches to artificial intelligence (AI). However, HDC's simplicity poses challenges for encoding complex compositional structures, especially in its binding operation. To address this, we propose Generalized Holographic Reduced Representations (GHRR), an extension of Fourier Holographic Reduced Representations (FHRR), a specific HDC implementation. GHRR introduces a flexible, non-commutative binding operation, enabling improved encoding of complex data structures while preserving HDC's desirable properties of robustness and transparency. In this work, we introduce the GHRR framework, prove its theoretical properties and its adherence to HDC properties, explore its kernel and binding characteristics, and perform empirical experiments showcasing its flexible non-commutativity, enhanced decoding accuracy for compositional structures. We also demonstrate that binding in GHRR is more expressive than that in other HDC variants; in particular, we show that binding in GHRR can implement a kind of attention mechanism. We verify this by replacing the attention mechanism in a transformer with its GHRR-equivalent and testing it on a language modeling task, showing improved performance compared to a vanilla transformer.

URL PDF HTML ☆

赞 0 踩 0

2308.07772 2026-05-28 cs.LG cs.AI 版本更新

MOLE: MOdular Learning FramEwork via Mutual Information Maximization

MOLE: 基于互信息最大化的模块化学习框架

Tianchao Li, Yulong Pei

发表机构 * Department of Mathematics（数学系）； Computer Science, Eindhoven University of Technology, Eindhoven, the Netherland（计算机科学系，埃因霍温理工大学，埃因霍温，荷兰）

AI总结提出一种异步局部学习框架MOLE，通过层间模块化与互信息最大化实现梯度隔离的局部优化，适用于向量、网格和图数据，并在图级别和节点级别任务上验证了通用性。

Comments accepted by icml llw

2304.12986 2026-05-28 cs.CL cs.AI 版本更新

Measuring Massive Multitask Chinese Understanding

测量大规模多任务中文理解

Hui Zeng

发表机构 * Besteasy (Beijing) Language Technology Co., Ltd.（北京最佳语言科技有限公司）

AI总结针对中文大语言模型缺乏能力评估的问题，提出一个涵盖医学、法律、心理学和教育四大领域共23个子任务的多任务测试，通过零样本准确率评估模型性能，发现最佳模型平均领先最差模型18.6个百分点，且所有模型在法律领域表现最差。

详情

AI中文摘要

大规模中文语言模型的发展蓬勃，但缺乏相应的能力评估。因此，我们提出一个测试来衡量大型中文语言模型的多任务准确性。该测试涵盖四大领域，包括医学、法律、心理学和教育，其中医学有15个子任务，教育有8个子任务。我们发现，在零样本设置中，表现最好的模型平均比表现最差的模型高出近18.6个百分点。在四大领域中，所有模型的最高平均零样本准确率为0.512。在子领域中，只有GPT-3.5-turbo模型在临床医学上达到了0.693的零样本准确率，这是所有模型在所有子任务中的最高准确率。所有模型在法律领域表现不佳，最高零样本准确率仅为0.239。通过全面评估多个学科知识的广度和深度，该测试可以更准确地识别模型的不足之处。

英文摘要

The development of large-scale Chinese language models is flourishing, yet there is a lack of corresponding capability assessments. Therefore, we propose a test to measure the multitask accuracy of large Chinese language models. This test encompasses four major domains, including medicine, law, psychology, and education, with 15 subtasks in medicine and 8 subtasks in education. We found that the best-performing models in the zero-shot setting outperformed the worst-performing models by nearly 18.6 percentage points on average. Across the four major domains, the highest average zero-shot accuracy of all models is 0.512. In the subdomains, only the GPT-3.5-turbo model achieved a zero-shot accuracy of 0.693 in clinical medicine, which was the highest accuracy among all models across all subtasks. All models performed poorly in the legal domain, with the highest zero-shot accuracy reaching only 0.239. By comprehensively evaluating the breadth and depth of knowledge across multiple disciplines, this test can more accurately identify the shortcomings of the models.

URL PDF HTML ☆

赞 0 踩 0

2305.06426 2026-05-28 cs.AI cs.SY eess.SY math.OC 版本更新

Planning a Community Approach to Diabetes Care in Low- and Middle-Income Countries Using Optimization

使用优化规划中低收入国家糖尿病护理的社区方法

Katherine B. Adams, Justin J. Boutilier, Sarang Deo, Yonatan Mintz

发表机构 * Department of Operations and Analytics, University of Texas at San Antonio（运营管理与分析系，德克萨斯大学圣安东尼奥分校）； Telfer School of Management, University of Ottawa（奥多恩大学泰弗管理学院）； Max Institute of Healthcare Management, Indian School of Business（印度商学院医疗管理研究所）； Department of Industrial and Systems Engineering, University of Wisconsin-Madison（工业与系统工程系，威斯康星大学麦迪逊分校）

AI总结提出一个优化框架，通过个性化社区卫生工作者访视计划，在社区层面最大化血糖控制，并平衡筛查新患者与管理已登记患者的资源分配。

Comments 50 pages, 13 figures

详情

DOI: 10.1287/opre.2023.0248

AI中文摘要

糖尿病是全球健康优先事项，尤其是在中低收入国家，超过50%的过早死亡归因于高血糖。社区卫生工作者项目可以提供负担得起且文化适宜的解决方案，用于糖尿病的早期发现和管理。我们引入了一个优化框架，用于确定个性化的社区卫生工作者访视，以在社区层面最大化血糖控制。我们的框架明确建模了筛查新患者与为已登记治疗患者提供管理访视之间的权衡。我们考虑了患者的动机状态，这会影响他们决定加入或退出治疗，从而影响干预的有效性。通过估计患者的健康和动机状态，我们的模型在构建访视计划时考虑了患者在决定加入治疗时的权衡，从而降低了退出率并改善了资源分配。我们应用该方法，使用印度城市贫民窟的运营数据生成社区卫生工作者访视计划。我们发现，与最佳基线方法相比，我们的方法在相同能力下可将空腹血糖降低高达25%。我们的实验还表明，该方法在不完美信息下表现良好。

英文摘要

Diabetes is a global health priority, especially in low- and-middle-income countries, where over 50% of premature deaths are attributed to high blood glucose. Community Health Worker (CHW) programs can provide affordable and culturally tailored solutions for early detection and management of diabetes. We introduce an optimization framework to determine personalized CHW visits that maximize glycemic control at a community level. Our framework explicitly models the trade-off between screening new patients and providing management visits to individuals who are enrolled in treatment. We account for patients' motivational states, which affect their decisions to enroll or drop out of treatment and, therefore, the effectiveness of the intervention. By estimating patients' health and motivational states, our model builds visit plans accounting for patients' tradeoffs when deciding to enroll in treatment, leading to reduced dropout rates and improved resource allocation. We apply our approach to generate CHW visit plans using operational data from urban slums in India. We find that our approach can reduce fasting blood glucose by up to 25% with the same capacity as the best baseline method. Our experiments also demonstrate that our approach performs well with imperfect information.

URL PDF HTML ☆

赞 0 踩 0