arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2251
2606.09879 2026-06-10 cs.LG 新提交

Operator Fusion for LLM Inference on the Tensix Architecture

面向Tensix架构的LLM推理算子融合

Qingbo Wu, Ke Li, Wenzhu Wang, Jie Yu, Ruian Zhang, Lili Liu

发表机构 * KylinSoft Co., Ltd, Tianjin, China(KylinSoft有限公司,天津,中国) openKylin Community, Tianjin, China(openKylin社区,天津,中国) College of Computer Science and Technology, National University of Defense Technology, Changsha, China(计算机科学与技术学院,国防科技大学,长沙,中国)

AI总结 针对Tensix架构的LLM推理瓶颈,提出RMSNorm与矩阵乘法融合的算子融合策略,利用片上SRAM和NoC多播减少DRAM读写与调度开销,在Wormhole平台上实现注意力延迟降低37.44%、MLP延迟降低15.89%,且数值一致性保持98.75%以上。

Comments 11 pages, 5 figures

详情
AI中文摘要

本研究解决了Tenstorrent Tensix架构上Transformer模型的设备端推理瓶颈,并提出了一种增强数据局部性的算子融合策略。RMSNorm与自注意力和FFN中的矩阵乘法融合,使得内存受限和计算受限的算子能够在片上SRAM中连续执行,从而显著减少中间结果的DRAM读写次数和调度开销。为了支持多核并行,利用基于NoC的多播机制,其中行/列主节点高效地将输入和权重分发到核心网格,缓解DRAM带宽争用。在Wormhole平台上使用Qwen2.5-0.5B、Qwen3-0.6B和Qwen3-4B进行的实验显示,注意力延迟降低高达37.44%,MLP延迟降低15.89%,每解码层延迟降低高达7.91%,同时皮尔逊相关系数(PCC)保持在98.75%以上,证实了在数值一致性下显著的端到端效率提升。

英文摘要

This study addresses on-device inference bottlenecks of Transformer models on Tenstorrent's Tensix architecture and proposes an operator fusion strategy that enhances data locality. RMSNorm is fused with matrix multiplication in self-attention and in the FFN, enabling back-to-back execution of memory-bound and compute-bound operators in on-chip SRAM to significantly reduce DRAM reads/writes of intermediate results and scheduling overhead. To support multi-core parallelism, a NoC-based multicast mechanism is leveraged in which row/column master nodes efficiently distribute inputs and weights across the core mesh, alleviating DRAM bandwidth contention. Experiments on the Wormhole platform with Qwen2.5-0.5B, Qwen3-0.6B, and Qwen3-4B show up to 37.44% latency reduction for attention and 15.89% for MLP, with up to 7.91% reduction per decoder layer, while Pearson Correlation Coefficient (PCC) remains above 98.75%, confirming significant end-to-end efficiency gains under numerical consistency.

2606.09878 2026-06-10 cs.LG 新提交

FailureScope: Cross-Regime Behavioral Diagnosis of Language Model Weaknesses

FailureScope: 语言模型弱点的跨机制行为诊断

Nicholas Saban

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出FailureScope方法,通过跨模型通过/失败模式聚类评估探针,在单轮基准、多轮对话和对抗性代理攻击三种机制下稳定生成可解释的失败分类,实现高效任务采样和跨模型失败预测。

详情
AI中文摘要

标准基准报告聚合准确率,但从业者需要知道模型缺乏哪些特定能力。我们引入FailureScope,一种行为诊断方法,通过跨模型通过/失败模式(留一模法,LOMO)对评估探针进行聚类,并展示其在通常单独研究的三种机制(单轮基准、多轮对话和对抗性代理攻击)中产生稳定、可解释的失败分类。在18个模型的2,664个单轮任务上,基于分类的采样在50个任务时达到Kendall's tau=0.81(随机选择为0.34),跨模型失败预测达到AUC 0.88。相同的原语在363个任务的多轮语料库和630个对抗性代理轨迹上恢复出可解释的聚类,其中揭示了一个元失败模式:LLM评判ASR与实际执行之间存在73-100个百分点的差距。所有三种机制的聚类凝聚力保持强劲,我们认为这证明了行为聚类是一种可移植的诊断原语,能够泛化到任何单一基准之外。我们发布了该流程、三个带注释的语料库以及跨机制分类。

英文摘要

Standard benchmarks report aggregate accuracy, but practitioners need to know which specific capabilities a model lacks. We introduce FailureScope, a behavioral-diagnosis method that clusters evaluation probes by their cross-model pass/fail patterns (leave-one-model-out, LOMO), and show it yields stable, interpretable failure taxonomies across three regimes usually studied separately: single-turn benchmarks, multi-turn dialogue, and adversarial agent attacks. On 2,664 single-turn tasks across 18 models, taxonomy-conditioned sampling reaches Kendall's tau = 0.81 at 50 tasks (versus 0.34 for random selection), and cross-model failure prediction reaches AUC 0.88. The same primitive recovers interpretable clusters on a 363-task multi-turn corpus and on 630 adversarial agent traces, where it exposes a meta-failure mode: a 73-100 percentage-point gap between LLM-judge ASR and real execution. Cluster cohesion remains strong across all three regimes, which we take as evidence that behavioral clustering is a portable diagnosis primitive that generalizes beyond any single benchmark. We release the pipeline, three annotated corpora, and the cross-regime taxonomies.

2606.09877 2026-06-10 cs.LG cs.CE cs.CL 新提交

Streaming Knowledge Compilation: Proactive Materiality-Scored Pinning for Time-Evolving LLM Wikis

流式知识编译:面向时变LLM维基的主动物质性评分固定

Juan M. Huerta

发表机构 * Zinnia Tech Solutions(Zinnia科技解决方案)

AI总结 提出流式知识编译框架,通过物质性信号φ_t主动固定重要文档,在金融和维基百科领域验证O(√T log K)遗憾界,并揭示LLM评判偏差。

详情
AI中文摘要

LLM维基系统将知识编译为预填充的KV缓存以实现高效推理,但假设语料库是静态的——当底层信息格局演变时,这一假设失效。我们形式化流式知识编译:给定文档流、固定令牌预算以及在摄取时未知的未来查询,维护一个编译后的维基,使其相对于具有完美预见力的离线oracle的累积遗憾最小化。关键洞察是物质性信号φ_t(k,n)∈[0,1],它对时间t实体k的文档重要性进行评分,作为查询相关性的代理,在查询到达前主动固定;我们证明O(√T log K)遗憾界,其中ε=E[|φ_t-φ̂_t|]是唯一的领域特定量。我们在两个领域实例化:金融领域,其中φ_t是由冻结的Llama 3.1 8B分类头预测的异常股票波动率(在76K篇文章上AUROC=0.728,严格时间分割;预测为物质性的文章实现1.49倍更高的实际远期波动率);以及维基百科领域,其中φ_t是异常编辑比率(AER),一种横截面标准化的编辑速度——表明同一算法可泛化到金融领域之外。在173个匹配对(金融)和119个(维基百科)上的端到端QA评估揭示了训练后知识上普遍的LLM-as-judge混淆,确立了遗憾分析——而非绝对QA分数——是编译知识系统的可靠评估指标。金融累积遗憾收敛至-20.0(-0.12/步);维基百科收敛至+16.0(+0.13/步),正号确认维基百科编辑内容确实是训练后的——更丰富的上下文持续提高分数(无维基3.80 vs. Oracle 4.74)——并消除了这一混淆。O(√T log K)保证适用于任何知识差距可从流信号预测的领域。

英文摘要

LLM wiki systems compile knowledge into pre-filled KV caches for efficient inference, but assume a static corpus -- an assumption that fails whenever the underlying information landscape evolves. We formalize Streaming Knowledge Compilation: given a document stream, a fixed token budget, and future queries unknown at ingestion time, maintain a compiled wiki that minimizes cumulative regret against an offline oracle with perfect foresight. The enabling insight is a materiality signal $ϕ_t(k,n)\in[0,1]$ that scores document importance for entity $k$ at time $t$, acting as a query-relevance surrogate for proactive pinning before queries arrive; we prove an $O(\sqrt{T\log K})$ regret bound where $\varepsilon=\mathbb{E}[|ϕ_t-\hatϕ_t|]$ is the only domain-specific quantity. We instantiate in two domains: finance, where $ϕ_t$ is abnormal stock volatility predicted by frozen Llama 3.1 8B classification head (AUROC = 0.728 on 76K articles, strict temporal split; $1.49\times$ higher realized forward volatility for predicted-material articles); and Wikipedia, where $ϕ_t$ is the Abnormal Edit Ratio (AER), a cross-sectionally normalized edit velocity -- showing the same algorithm generalizes beyond the finance domain. End-to-end QA evaluation on 173 matched pairs (finance) and 119 (Wikipedia) reveals a pervasive LLM-as-judge confound on post-training knowledge, establishing that regret analysis -- not absolute QA scores -- is the reliable evaluation metric for compiled knowledge systems. Finance cumulative regret converges to -20.0 (-0.12/step); Wikipedia to +16.0 (+0.13/step), with the positive sign confirming that Wikipedia edit content is genuinely post-training -- richer context consistently improves scores (No Wiki 3.80 vs. Oracle 4.74) -- and eliminates this confound. The $O(\sqrt{T\log K})$ guarantee applies to any domain where knowledge gaps can be predicted from streaming signals.

2606.09876 2026-06-10 cs.LG 新提交

Calibrating Overconfidence Without Sacrificing Confidence: Probe-Conditioned Head Intervention for LLMs

校准过度自信而不牺牲自信:基于探针条件的头部干预方法用于LLMs

Ke Li, Chongzhe Zhang, Zifan Zeng, Feng Liu, Qunli Zhang, Zheng Hu

发表机构 * Huawei Heisenberg Research Center(华为海森堡研究中心) EPFL(苏黎世联邦理工学院) TU Berlin(柏林技术大学) TUM(慕尼黑工业大学)

AI总结 提出PCHI方法,通过冻结探针检测可能错误但自信的响应,并条件性调整注意力头输出,在保持正确自信的同时减少过度自信,将ECE从21.9%降至9.2%。

Comments 11 pages, 4 figures

详情
AI中文摘要

大型语言模型常常对错误答案表现出高置信度。标准的校准方法通常全局或分数级别操作,减少无根据的自信,但也冒着侵蚀正确答案上有根据的自信的风险。我们引入了基于探针条件的头部干预(PCHI),一种推理时方法,使用冻结探针检测可能错误但自信的响应,并在置信度生成期间条件性地重新缩放下游注意力头的输出。在Qwen3-4B-Instruct解决OpenMathInstruct问题(具有结构化二进制置信度字段)上,读取令牌PCHI将82.2%的原始错误-是置信度读数转换为$\texttt{no}$,而跨上游置信度模板令牌的联合干预将ECE从21.9%降至9.2%,并且仅损坏5.1%的原始正确-是读数。读取令牌效应也出现在Gemma3-4B上,尽管上游干预较弱且更依赖于掩码。这些结果表明,通过条件性应用的内部干预,可以选择性地减少口头表达的过度自信,部分解耦抑制无根据自信与损失有根据自信之间的关系。

英文摘要

Large language models often express high confidence in answers that are wrong. Standard calibration remedies typically act globally or at the score level, reducing unwarranted confidence but also risking erosion of warranted confidence on correct answers. We introduce Probe-Conditioned Head Intervention (PCHI), an inference-time method that uses a frozen probe to detect likely wrong-but-confident responses and conditionally rescales downstream attention-head outputs during confidence generation. On Qwen3-4B-Instruct solving OpenMathInstruct problems with a structured binary confidence field, readout-token PCHI converts 82.2% of originally wrong-yes confidence readouts to $\texttt{no}$, while a joint intervention across upstream confidence-template tokens reduces ECE from 21.9% to 9.2% and damages only 5.1% of originally correct-yes readouts. The readout-token effect also appears on Gemma3-4B, though upstream interventions are weaker and more mask-dependent. These results show that verbalized overconfidence can be selectively reduced through conditionally applied internal intervention, partially decoupling the suppression of unwarranted confidence from the loss of warranted confidence.

2606.09874 2026-06-10 cs.LG stat.ML 新提交

Disjoint or Overlapping? Inference Windowing for Reconstruction-Based Time Series Anomaly Detection

不相交还是重叠?基于重构的时间序列异常检测中的推理窗口化

Guillaume Coulaud, Reza Akbarinia, Florent Masseglia

发表机构 * University of Montpellier, Inria, CNRS, LIRMM(蒙彼利埃大学、Inria、CNRS、LIRMM)

AI总结 研究推理步长(重叠窗口)对基于重构的时间序列异常检测性能的影响,提出统一评估协议,实验表明重叠窗口平均提升28%且改变方法排名。

详情
AI中文摘要

基于重构的方法广泛用于时间序列异常检测,其中模型被训练来重构子序列,并通过重构误差识别异常。然而,由于异构的评估实践和不明确的推理过程,报告的结果往往难以比较。在本文中,我们重新审视单变量离线设置下的基于重构的异常检测,并研究推理步长的作用,该步长控制子序列是作为不相交窗口处理还是重叠处理。我们在精心策划的TSB-AD基准上提出了一个统一的训练、调优和多种子评估协议,并研究了重叠推理如何影响一系列重构模型的异常检测性能,包括基于PCA的基线、DLinear、AutoEncoder、TimesNet和Transformer变体。结果表明,在所有模型中,重叠窗口带来一致的改进,平均相对增益高达+28%,并且可以改变方法排名。我们进一步分析了跨数据集、随机种子和超参数配置的变异性。最后,我们使用与滑动窗口重构对齐的定位标准,在完整的UCR存档上补充了基准研究。总体而言,我们的结果强调,基于重构的异常检测性能不仅取决于模型架构和训练,还取决于推理选择,这促使采用清晰且可重复的协议。我们的结果表明,基于重构的基线在TSB-AD和UCR基准上都取得了强劲的性能,支持它们作为单变量时间序列异常检测的竞争性和实用方法。

英文摘要

Reconstruction-based methods are widely used for time series anomaly detection, where models are trained to reconstruct subsequences, and anomalies are identified through reconstruction errors. However, reported results are often hard to compare due to heterogeneous evaluation practices and underspecified inference procedures. In this paper, we revisit reconstruction-based anomaly detection in the univariate offline setting and study the role of the inference stride, which controls whether subsequences are processed as disjoint windows or with overlap. We propose a unified training, tuning, and multi-seed evaluation protocol on the curated TSB-AD benchmark, and study how overlapping inference affects anomaly detection performance for a range of reconstruction models, including PCA-based baselines, DLinear, an AutoEncoder, TimesNet, and Transformer variants. The results show that across all models, overlapping windows yield consistent improvements, with average relative gain up to +28%, and can alter method rankings. We further analyze variability across datasets, random seeds, and hyperparameter configurations. Finally, we complement the benchmark study with an evaluation on the full UCR archive using localization criteria aligned with sliding-window reconstruction. Overall, our results highlight that reconstruction-based anomaly detection performance depends not only on model architecture and training, but also on inference choices, motivating a clear and reproducible protocol. Our results show that reconstructionbased baselines achieve strong performance on both TSB-AD and UCR benchmarks, supporting them as competitive and practical approaches for univariate time series anomaly detection.

2606.09872 2026-06-10 cs.LG cs.AI 新提交

PatchSTG: Scalable Spatiotemporal Graph Transformers for Traffic Forecasting on Irregular Sensor Networks

PatchSTG:面向不规则传感器网络的交通预测可扩展时空图Transformer

Jichao Li, Xuanming Shi

发表机构 * Portsmouth Abbey School(波特茅斯修道学校) CodingFuture (Shanghai) Education Technology Co., Ltd.(上海CodingFuture教育科技有限公司)

AI总结 提出PatchSTG,通过地理信息将传感器划分为平衡的局部补丁,并采用双注意力编码器交替捕捉局部和全局依赖,实现计算复杂度从二次降至近线性,在不规则传感器网络上取得高效且稳定的交通预测性能。

Comments 22 pages,12 figures

详情
AI中文摘要

交通预测是智能交通系统的基本组成部分,但由于传感器分布不规则以及建模大规模时空依赖的高计算成本,在实际环境中仍然具有挑战性。在实际交通网络中,传感器在不同区域分布不均,导致空间结构不均匀,限制了现有基于图和基于注意力模型的有效性和可扩展性。为了解决这些挑战,我们提出了PatchSTG,一种基于补丁的时空图Transformer,专为不规则传感器网络上的高效预测而设计。关键思想是引入一种层次化空间表示,基于地理信息将传感器划分为平衡且保持局部性的补丁。在此结构之上,双注意力编码器交替进行补丁内注意力(捕捉局部交互)和补丁间注意力(建模全局依赖),将计算复杂度从二次降低到近线性。我们在罗德岛的真实交通数据以及额外的大规模数据集上评估了PatchSTG。实验结果表明,所提模型在多个预测时域上实现了稳定且具有竞争力的预测性能,同时显著提高了计算效率。消融研究进一步验证了空间划分和双注意力在捕捉局部和长程交通动态方面的有效性。这些结果表明,基于补丁的时空建模为不规则空间设置下的交通预测提供了一个可扩展且有效的框架。

英文摘要

Traffic forecasting is a fundamental component of intelligent transportation systems, yet remains challenging in real-world settings due to irregular sensor distributions and the high computational cost of modeling large-scale spatiotemporal dependencies. In practical traffic networks, sensors are unevenly distributed across regions, leading to non-uniform spatial structures that limit the effectiveness and scalability of existing graph-based and attention-based models. To address these challenges, we propose PatchSTG, a patch-based spatiotemporal graph Transformer designed for efficient forecasting on irregular sensor networks. The key idea is to introduce a hierarchical spatial representation that partitions sensors into balanced, locality-preserving patches based on geographic information. On top of this structure, a dual attention encoder alternates between intra-patch attention for capturing local interactions and inter-patch attention for modeling global dependencies, reducing computational complexity from quadratic to near-linear scaling. We evaluate PatchSTG on real-world traffic data from Rhode Island and additional large-scale datasets. Experimental results demonstrate that the proposed model achieves stable and competitive forecasting performance across multiple horizons, while significantly improving computational efficiency. Ablation studies further validate the effectiveness of spatial partitioning and dual attention in capturing both local and long-range traffic dynamics. These results suggest that patch-based spatiotemporal modeling provides a scalable and effective framework for traffic forecasting under irregular spatial settings.

2606.09868 2026-06-10 cs.LG cs.AI 新提交

SPACE: Source-free Proxy Anchor Concept Erasure for MLLMs

SPACE: 面向多模态大语言模型的无源代理锚点概念擦除

Zhijing Zhang, Jiaqi Ding, Qianshan Wei, Nan Zhou, Jiaqi Li, Yongliang Wu, Tongxin Zhu, Xiaolin Fang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SPACE框架,首个针对多模态大语言模型的无源机器遗忘方法,通过文本引导的代理锚点选择和双约束语义隔离,在不访问目标数据的情况下擦除概念,并保持模型性能。

详情
AI中文摘要

随着多模态大语言模型(MLLMs)面临日益增长的隐私风险和监管约束,机器遗忘(MU)已成为在保持模型性能的同时移除敏感数据的关键解决方案。然而,现有的MU方法通常依赖于目标概念的视觉数据,由于严格的数据保留政策,这些数据往往不可用,因此需要一种无需访问目标数据的无源遗忘方法。在这项工作中,我们提出了无源代理锚点概念擦除(SPACE),这是首个专门针对MLLMs的无源遗忘框架。SPACE包含两个阶段:(1)文本引导的代理锚点选择(TPAS),从共享特征空间中检索语义对齐的代理锚点;(2)双约束语义隔离(DCSI),优化这些锚点以间接擦除目标概念。DCSI将更新限制在保留知识的零空间中,确保结构完整性。我们从理论上证明,SPACE严格限制了保留知识的扰动,并最大化特征谱熵,从而保持模型性能。此外,在六个数据集上的大量实验表明,SPACE实现了与最先进的数据依赖方法相当的性能,验证了其在无源MU场景中的有效性。源代码将发布。

英文摘要

As Multimodal Large Language Models (MLLMs) face growing privacy risks and regulatory constraints, machine unlearning (MU) has emerged as a crucial solution for removing sensitive data while preserving model performance. However, existing MU methods typically rely on visual data of the target concepts, which is often unavailable due to strict data retention policies, thus creating a demand for source-free unlearning approaches that operate without access to the target data. In this work, we propose Source-free Proxy Anchor Concept Erasure (SPACE), the first source-free unlearning framework specialized for MLLMs. SPACE consists of two stages: (1) Text-Guided Proxy Anchor Selection (TPAS), which retrieves semantically aligned proxy anchors from the shared feature space. (2) Dual-Constraint Semantic Isolation (DCSI), which optimizes these anchors to indirectly erase target concepts. DCSI confines updates to the null space of retained knowledge, ensuring structural integrity. We theoretically prove that SPACE strictly bounds the perturbation on retained knowledge and maximizes feature spectral entropy, thereby maintaining the model's performance. Furthermore, extensive experiments across six datasets show that SPACE achieves performance comparable to that of state-of-the-art data-dependent methods, validating its effectiveness in source-free MU scenarios. The source code will be released.

2606.09864 2026-06-10 cs.LG cs.AI cs.ET 新提交

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

KV缓存量化下的对齐崩溃:诊断与缓解

Bruce Changlong Xu, Adarsh Kumarappan, Mu Zhou

发表机构 * Stanford University(斯坦福大学) California Institute of Technology(加利福尼亚理工学院)

AI总结 研究发现低比特KV缓存量化会无声破坏大模型的安全对齐,根源在于安全特征位于低维激活子空间,易受量化噪声影响;提出逐通道缩减(PCR)诊断方法,分类三种失效模式并指导缓解,无需训练即可恢复高达97%的对齐损失。

Comments Preprint. 61 pages, 9 figures

详情
AI中文摘要

键值(KV)缓存量化被广泛用于减少大语言模型(LLM)推理内存,然而现有评估仅关注困惑度和准确率,未评估安全性影响。在本研究中,我们探索了KV缓存量化下的对齐保持。在11个指令微调模型(3.8B-72B)和5个基准(1,894个提示)上,我们发现低比特量化可以无声地破坏安全对齐:Mistral-7B在仅1.03倍困惑度下丢失了15.2%的拒绝能力,且不存在通用的安全位宽,标准指标无法察觉的尖锐模型特定相变普遍存在。我们识别出根本原因是几何性的:安全特征占据一个低维激活子空间,其对量化噪声的脆弱性比困惑度平均的全表示空间高10^2-10^3倍。受此观察启发,我们提出逐通道缩减(PCR),一种诊断方法,将每个模型分类为三种机制性失效模式之一:异常值压碎安全(安全位于非异常值通道,被异常值驱动的缩放因子连带损害);异常值即安全(安全与异常值通道重叠,更细粒度无法挽救);多层稀释(安全分布在许多层,逐层修复失败)。PCR在全部9个主要模型和来自独立家族的1个保留模型上,使用20个校准提示预测了正确的缓解方向。PCR泛化到未见过的提示、模型和生产量化器,包括KIVI,恢复率高达97.2%,而基于注意力的分配方法失败。由此产生的免训练协议,大约需要35 GPU分钟,以最小的内存开销恢复高达97%的丢失对齐,解决了在NVIDIA GPU上使用FP8 KV缓存的生产vLLM服务中确认的漏洞。

英文摘要

Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study, we explore alignment preservation under KV cache quantization. Across eleven instruction-tuned models (3.8B-72B) and five benchmarks (1,894 prompts), we find that low-bit quantization can silently destroy safety alignment: Mistral-7B loses 15.2% of its refusals at only 1.03x perplexity, and no universal safe bit-width exists, with sharp model-specific phase transitions invisible to standard metrics. We identify that the root cause is geometric: safety features occupy a low-dimensional activation subspace 10^2-10^3x more vulnerable to quantization noise than the full representation space perplexity averages over. Inspired by this observation, we propose Per-Channel Reduction (PCR), a diagnostic that classifies each model into one of three mechanistic failure modes: outlier-crushes-safety, where safety lives in non-outlier channels collaterally damaged by outlier-driven scale factors; outlier-as-safety, where safety overlaps outlier channels and finer granularity cannot rescue it; and multi-layer dilution, where safety is distributed across many layers and per-layer fixes fail. PCR predicts the correct mitigation direction on all nine primary models and one held-out model from an independent family using 20 calibration prompts. PCR generalizes across unseen prompts, models, and production quantizers, including KIVI with up to 97.2% recovery, succeeding where attention-based allocation methods fail. The resulting training-free protocol, requiring approximately 35 GPU-minutes, recovers up to 97% of lost alignment at minimal memory overhead, addressing vulnerabilities confirmed in production vLLM serving with FP8 KV cache on NVIDIA GPUs.

2606.09863 2026-06-10 cs.LG 新提交

From Confident Closing to Silent Failure: Characterizing False Success in LLM Agents

从自信收尾到无声失败:LLM智能体中的虚假成功特征分析

Laksh Advani

发表机构 * Laksh Advani

AI总结 研究LLM智能体在环境状态未完成时声称任务完成的虚假成功模式,发现其普遍存在但检测困难,轻量级TF-IDF检测器优于LLM评判器。

Comments Accepted to FAGEN@ICML2026

详情
AI中文摘要

LLM智能体可能在环境状态显示任务未完成时,通过断言任务已完成而无声地失败。我们研究了这种失败模式——虚假成功,在两个智能体基准测试中:来自8个模型家族的9,876条tau2-bench轨迹和来自4个模型家族的1,879条AppWorld轨迹,均具有文本无关的真实标签。虚假成功很常见,但随设置而异:在单控制tau2-bench域中占失败的45-48%,在双控制电信域中占3%,在具有显式状态声明的AppWorld自评估编码智能体轨迹中占75.8%。LLM评判器不可靠:在tau2-bench上,5个评判器、5种提示策略和完整任务规格的任何配置均未超过AUROC 0.65;同样的评判器在AppWorld API调用轨迹上仅达到AUROC 0.54。评判器依赖表面完成代理——tau2-bench中的自信收尾语言和AppWorld中的粗略动作序列量——而非验证的状态变化。轻量级TF-IDF检测器在tau2-bench上达到任务分离AUROC 0.83,在AppWorld上达到0.95,在相同标记率下比最佳评判器多恢复4-8倍的虚假成功,且延迟低3300倍。这些结果表明,生产监控应使用轻量级、领域校准的检测器作为分诊信号,而非依赖LLM评判器作为虚假成功的主要监控手段。

英文摘要

LLM agents can fail silently by asserting task completion when the environment state shows otherwise. We study this failure mode, false success, across two agent benchmarks: 9,876 tau2-bench trajectories from 8 model families and 1,879 AppWorld trajectories from 4 model families with text-independent ground truth. False success is common but varies by setting: 45--48% of failures in single-control tau2-bench domains, 3% in dual-control telecom, and 75.8% among AppWorld self-assessing coding-agent trajectories with explicit status claims. LLM judges fail reliably: no configuration across 5 judges, 5 prompt strategies, and full task specifications exceeds AUROC 0.65 on tau2-bench, and the same judges reach only 0.54 AUROC on AppWorld API-call traces. Judges rely on surface completion proxies -- confident closing language in tau2-bench and coarse action-sequence volume in AppWorld -- rather than verified state changes. Lightweight TF-IDF detectors achieve task-disjoint AUROC 0.83 on tau2-bench and 0.95 on AppWorld, recovering 4--8x more false successes than the best judge at the same flag rate with 3,300x lower latency. These results suggest that production monitoring should use lightweight, domain-calibrated detectors as triage signals rather than relying on LLM judges as the primary monitor for false success.

2606.09859 2026-06-10 cs.LG cs.AI 新提交

Mitigating Manifold Departure: Uncertainty-Aware Subspace Rectification for Trustworthy MLLM Decoding

缓解流形偏离:面向可信MLLM解码的不确定性感知子空间校正

Yingxuan Zhuang, Jingxiao Yang, Miao Pan, Cheng Tan, Yuxiang Cai, Siwei Tan, Chen Zhi, Xuhong Zhang, Jianwei Yin, Jintao Chen

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出MGAP方法,通过SVD构建语言先验子空间并自适应衰减投影分量,在抑制幻觉的同时保持语义结构,优于现有解码基线。

Comments ICML 2026 regular

详情
AI中文摘要

MLLM经常产生与视觉输入不一致的幻觉对象。这一问题通常归因于对语言先验的过度依赖,这种依赖可能覆盖视觉上下文。最近的训练自由解码策略通过惩罚语言先验来解决这一问题。然而,这些方法忽视了语言先验的双重性质——根据与视觉证据的对齐程度,它们既可能有益也可能有害。特别是,盲目抑制语言先验往往会破坏模型的语义流形,导致性能下降,我们将这种现象称为流形偏离。为了解决这个问题,我们提出了流形引导自适应投影(MGAP),一种几何感知、训练自由的解码方法,在抑制幻觉的同时保持表示结构。MGAP首先通过SVD从盲隐藏状态构建语言先验子空间。在解码过程中,MGAP将每个多模态隐藏状态投影到该子空间上,并应用一致性感知门控自适应地仅衰减投影的先验分量,从而产生子空间选择性更新,很大程度上保留了正交的语义分量。在POPE和CHAIR上的大量实验表明,MGAP优于先前的解码基线,在不牺牲连贯性的情况下实现了更强的幻觉抑制。

英文摘要

MLLMs frequently hallucinate objects inconsistent with visual inputs. This issue is typically attributed to the over-reliance on language priors, which can override the visual context. Recent training-free decoding strategies address this by penalizing language priors. However, these methods overlook the dual nature of language priors, where they can be both helpful and harmful depending on the alignment with visual evidence. In particular, blindly suppressing language priors often disrupts the model's semantic manifold, leading to performance degradation, a phenomenon we term Manifold Departure. To address this, we propose Manifold-Guided Adaptive Projection (MGAP), a geometry-aware, training-free decoding method that mitigates hallucinations while preserving representation structure. MGAP first constructs a language-prior subspace from blind hidden states via SVD. During decoding, MGAP projects each multimodal hidden state onto this subspace and applies a consistency-aware gate to adaptively attenuate only the projected prior component, yielding a subspace-selective update that largely preserves the orthogonal semantic components. Extensive experiments on POPE and CHAIR show that MGAP outperforms prior decoding baselines, achieving stronger hallucination suppression without sacrificing coherence.

2606.09853 2026-06-10 cs.LG cs.IT math.IT 新提交

SynIB: Informational Bottleneck for Maximizing Synergy in Multimodal Learning

SynIB: 多模态学习中最大化协同的信息瓶颈

Konstantinos Kontras, Teodora Gagaleska, Thomas Strypsteen, Christos Chatzichristos, Matthew Blaschko, Maarten De Vos, Paul Pu Liang

发表机构 * University of California, Berkeley(加州大学伯克利分校) Google Research(谷歌研究院) University of Amsterdam(阿姆斯特丹大学)

AI总结 提出SynIB方法,通过信息瓶颈理论直接优化多模态协同,在训练中屏蔽单模态时惩罚高置信度,提升跨模态推理能力,在合成和真实任务上准确率提升达7.8%。

详情
AI中文摘要

多模态学习的一个核心目标是捕捉协同:即仅通过联合使用多个模态才能获得的、且任何单一模态都无法提供的任务相关信息。虽然大多数方法通过更大或更复杂的融合模型在架构层面进行操作,但我们提出一个互补的轴:塑造训练目标本身。标准训练通常强调单模态或冗余信息,在需要跨模态推理的示例上表现不足。我们通过信息论形式化多模态协同,并引入协同信息瓶颈(SynIB),一个直接针对协同的可扩展目标。为了优先学习协同,SynIB激励模型从所有模态准确预测,同时当任何模态的信息被隐藏时惩罚置信度。除了标准任务损失外,模型每次运行时屏蔽一个模态进行前向传播,如果保持高置信度则受到惩罚,这表明依赖单模态线索而非跨模态交互。我们在两个场景中验证SynIB。在合成XOR任务中,真实协同已知,标准训练无法恢复协同而SynIB成功。在五个真实世界基准测试中,包括三个MultiBench情感任务、使用CLIP-ViT和DeBERTa骨干的Hateful Memes,以及我们引入的可控讽刺扩展CREMA-D,SynIB在依赖协同的示例上准确率提升高达7.8%,总体准确率提升高达3.8%。

英文摘要

A central objective in multimodal learning is to capture synergy: task-relevant information that arises only from the joint use of multiple modalities, and is not available from any single modality alone. While most approaches operate at the architectural level through larger or more complex fusion models, we propose a complementary axis: shaping the training objective itself. Standard training often emphasizes unimodal or redundant information, falling short on examples that require cross-modal reasoning. We formalize multimodal synergy through information theory and introduce the Synergistic Information Bottleneck (SynIB), a scalable objective that targets synergy directly. To prioritize learning synergy, SynIB motivates the model to predict accurately from all modalities while penalizing confidence when information from any modality is withheld. Alongside the standard task loss, the model runs forward passes with one modality masked at a time and is penalized for remaining confident, which would indicate reliance on unimodal cues rather than cross-modal interactions. We validate SynIB in two regimes. On synthetic XOR tasks where the ground-truth synergy is known by construction, standard training fails to recover it while SynIB does. On five real-world benchmarks, including three MultiBench affective tasks, Hateful Memes with CLIP-ViT and DeBERTa backbones, and a controllable irony extension of CREMA-D we introduce, SynIB improves accuracy on synergy-dependent examples by up to 7.8% and overall accuracy by up to 3.8%.

2606.10756 2026-06-10 cs.CV physics.med-ph 新提交

DD-INR: Dynamics-Driven Implicit Neural Representation for Accelerated Whole-Brain Functional MRI Reconstruction

DD-INR: 用于加速全脑功能磁共振成像重建的动力学驱动隐式神经表示

Qiaoxin Li, Caini Pan, Pierre-Antoine Comby, Chaithya Giliyar, Philippe Ciuciu

发表机构 * MIND, Inria, Palaiseau, France(MIND、Inria、法国帕莱赛欧) Neurospin, CEA Paris Saclay, France(Neurospin、CEA巴黎萨克雷、法国) CEA NeuroSpin, Paris-Saclay University, CNRS BAOBAB, Gif-sur-Yvette, France(CEA NeuroSpin、巴黎萨克雷大学、CNRS BAOBAB、法国吉夫-sur-伊夫特)

AI总结 提出DD-INR框架,通过将fMRI数据分解为静态背景和动态成分并用隐式神经表示建模动态,实现加速fMRI重建,提升图像质量和激活模式恢复。

详情
Journal ref
MICCAI 2026 - 29th International Conference on Medical Image Computing and Computer Assisted Intervention, Sep 2026, Strasbourg, France
AI中文摘要

fMRI的加速采集能够增强对大脑神经血管(BOLD)活动的检测,但高k空间欠采样使图像重建变得具有挑战性:任务诱发的BOLD信号幅度较小,传统的解剖MRI重建方法倾向于空间准确性而非时间保真度,因此无法恢复这些信号。我们提出了DD-INR,一个专为加速fMRI设计的动力学驱动隐式神经表示框架,它利用非相干时变采样和定制的时空先验,在模拟和体内采集中均优于传统方法,无论是在图像质量还是激活模式恢复方面。DD-INR通过将fMRI数据分解为静态背景和时变动态成分,仅用专门的INR表示动态部分,从而将模型能力集中在与激活相关的变化上,同时保持紧凑。总的来说,DD-INR为加速fMRI重建提供了一个有前景的框架,有潜力在实际扫描时间限制内提高fMRI研究的灵敏度和鲁棒性。源代码可在该网址获取。

英文摘要

Accelerated acquisition of fMRI enables enhanced detection of neurovascular (BOLD) activity in the brain, but image reconstruction becomes challenging with high k-space undersampling: Task-evoked BOLD signals are small in magnitude, which traditional anatomical MRI reconstruction methods fail to recover, as they favor spatial accuracy over temporal fidelity. We present DD-INR, a Dynamics-Driven Implicit Neural Representation framework tailored for accelerated fMRI that benefits from incoherent time-varying sampling and a tailored spatiotemporal prior, outperforming traditional methods, demonstrated in simulation and in-vivo acquisition, both in terms of image quality and retrieval of activation patterns. DD-INR achieves this by splitting the fMRI data into a static background and a temporally varying dynamic component, representing only the dynamics with a dedicated INR, thereby focusing the model's capacity on activation-relevant changes while remaining compact. In general, DD-INR provides a promising framework for accelerated fMRI reconstruction, with the potential to improve the sensitivity and robustness of fMRI studies within practical scan time limits. The source code is available at https://github.com/JoosenLi/DD-INR.

2606.10735 2026-06-10 cs.CV physics.med-ph 新提交

Patient-Level Diagnosis of Acute Myeloid Leukemia via Deep Learning Analysis of Bone Marrow Smear

基于深度学习分析骨髓涂片的急性髓系白血病患者级诊断

Yuqi Ma, Tianyi Wang, Weihua Meng, Hongru Chen, Fajin Tao, Qunxian Lu, Lin An, Xiaodong Mo, Gen Yang

发表机构 * State Key Laboratory of Nuclear Physics and Technology, School of Physics, Peking University(北京大学核物理与天体物理国家重点实验室,物理学院) Peking University People’s Hospital, Peking University Institute of Hematology, National Clinical Research Center for Hematologic Disease, Beijing Key Laboratory of Hematopoietic Stem Cell Transplantation(北京大学人民医院,北京大学血液病研究所,国家血液病临床医学研究中心,北京造血干细胞移植重点实验室) Shanghai Dishuo Beiken Biotechnology Co., Ltd.(上海迪朔生物科技有限公司)

AI总结 提出从细胞到患者的深度学习流程,通过YOLO检测细胞、EfficientNet-B0分类复合母细胞样细胞(CBLC),聚合细胞级预测为患者级CBLC比率,实现AML辅助诊断,在外部验证集上F1达0.91。

Comments 4 figures

详情
AI中文摘要

骨髓涂片检查对于急性髓系白血病(AML)评估仍然重要,但手动单细胞解释劳动强度大,患者级诊断需要聚合大量细胞观察结果。我们提出了一种从细胞到患者的深度学习流程,用于从骨髓涂片图像进行AML辅助诊断。该研究包括来自六个匿名中心的258名患者,其中主要队列来自中心1-3的169名患者,外部验证队列来自中心4-6的89名患者。使用16类细胞注释词汇描述全局细胞组成,包括粒细胞、单核细胞、红系、淋巴、嗜酸性粒细胞和其他细胞。该模型不识别严格的AML母细胞或白血病母细胞,而是针对专家定义的复合类别——复合母细胞样细胞(CBLC),根据项目范围内的形态学标准,包括N、N1、M、M1、R、R1、J和J1。基于YOLO的固定分割模块检测细胞,预测轮廓通过轮廓IoU与专家多边形注释匹配,并生成标准化的单细胞裁剪。通过两阶段GT-to-YOLO和YOLO-to-YOLO策略训练EfficientNet-B0分类器,包括类别不平衡校正、中心-边界正则化和形态学辅助监督。将细胞级预测聚合为患者级CBLC比率,用于AML导向的诊断支持。该流程实现了稳定的内部验证并保持了外部泛化能力,在中心4、5和6上的集成加权F1分数分别为0.9076、0.8696和0.9124。

英文摘要

Bone marrow smear review remains important for acute myeloid leukemia (AML) assessment, but manual single-cell interpretation is labor-intensive and patient-level diagnosis requires aggregation of many cellular observations. We present a cell-to-patient deep learning pipeline for AML-assisted diagnosis from bone marrow smear images. The study included 258 patients from six anonymized centers, including a main cohort of 169 patients from Centers 1-3 and an external validation cohort of 89 patients from Centers 4-6. A 16-category cell annotation vocabulary was used to describe the global cellular composition, including granulocytic, monocytic, erythroid, lymphoid, eosinophilic, and other cells. Rather than identifying strict AML blasts or leukemic blasts, the model targets an expert-defined composite category termed Composite Blast-like Cells (CBLC), comprising N, N1, M, M1, R, R1, J, and J1 according to the project-wide morphological standard. A fixed YOLO-based segmentation module detected cells, predicted contours were matched to expert polygon annotations by contour IoU, and standardized single-cell crops were generated. An EfficientNet-B0 classifier was trained through a two-stage GT-to-YOLO and YOLO-to-YOLO strategy with class-imbalance correction, center-border regularization, and morphology-assisted supervision. Cell-level predictions were aggregated into patient-level CBLC ratios for AML-oriented diagnostic support. The pipeline achieved stable internal validation and maintained external generalization, with ensemble weighted F1-scores of 0.9076, 0.8696, and 0.9124 on Centers 4, 5, and 6, respectively.

2606.10294 2026-06-10 cs.LG cs.AI cs.AR cs.NE physics.comp-ph 新提交

LLM-Guided Neural Architecture Search for Robust Co-Design of Physical Neural Networks

LLM引导的神经架构搜索用于物理神经网络的鲁棒协同设计

Tyler King, Timothee Leleu

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深度思维)

AI总结 提出UH-NAS框架,利用大语言模型作为进化算子,协同优化任务准确率和推理能耗,实现跨硬件平台的公平比较,在光学MZI硬件上发现更鲁棒的架构。

详情
AI中文摘要

在非常规硬件上部署神经网络需要同时优化任务准确率和平台特定约束(如能耗成本、物理非理想性和数值精度)的架构。现有的神经架构搜索(NAS)方法通常针对单一硬件系列,限制了跨平台比较和泛化。我们提出了非常规硬件神经架构搜索(UH-NAS),一个硬件无关的、LLM引导的NAS框架,它集成语言模型作为进化算子来协同优化准确率和推理能耗。通过将硬件暴露为可交换的后端,并配备每平台能耗模型、物理约束和非理想性模拟器,UH-NAS能够在无需修改搜索算法的情况下,对各种后端进行公平的系统级比较。在光学MZI硬件上测试,UH-NAS发现了比传统基线更多样化、更鲁棒的架构,同时优于现有的LLM-to-NAS方法。对非理想性下架构鲁棒性以及系统提示作用的进一步消融研究,凸显了架构-硬件协同设计对新兴计算平台的重要性。

英文摘要

Deploying neural networks on unconventional hardware demands architectures that co-optimize task accuracy and platform-specific constraints such as energy cost, physical non-idealities, and numerical precision. Existing neural architecture search (NAS) methods are typically tailored to a single hardware family, limiting cross-platform comparison and generalization. We introduce Unconventional Hardware Neural Architecture Search (UH-NAS), a hardware-agnostic, LLM-guided NAS framework that integrates language models as evolutionary operators to co-optimize accuracy and inference energy. By exposing hardware as a swappable backend with per-platform energy models, physical constraints, and non-ideality simulators, UH-NAS enables fair system-level comparisons across various backends without modifying the search algorithm. Tested on optical MZI hardware, UH-NAS discovers more diverse, robust architectures than conventional baselines while outperforming existing LLM-to-NAS approaches. Additional ablations on architecture robustness under non-idealities and the role of system prompts highlight the importance of architecture-hardware co-design for emerging computing platforms.

2606.06709 2026-06-10 cs.CV 新提交

USU-Corn-WeedDB: A UAV RGB Image Dataset for Multi-Species Weed Detection in Forage Corn

USU-Corn-WeedDB:用于饲料玉米多物种杂草检测的无人机RGB图像数据集

Utsav Bhandari, Saroj Burlakoti, Rhonda Miller, Sierra Young, Eric Westra, Aaron Etienne

发表机构 * Department of Applied Sciences, Technology, and Education, Utah State University(应用科学、技术和教育系,犹他州立大学) Department of Plants, Soils & Climate, Utah State University(植物、土壤与气候系,犹他州立大学) Department of Civil and Environmental Engineering, Utah State University(土木与环境工程系,犹他州立大学)

AI总结 为解决饲料玉米生产中杂草检测数据集稀缺问题,构建了USU-Corn-WeedDB无人机RGB图像数据集,包含三种杂草的10539个标注实例和8000张未标注图像,并验证了多种YOLO及RT-DETR模型的检测性能。

Comments 8 pages, 4 figures, 1 table

详情
AI中文摘要

饲料玉米生产中的杂草压力导致产量损失高达31.5%,然而基于无人机图像和深度学习的特定地点杂草管理(SSWM)系统仍受限于缺乏田间代表性训练数据集。我们提出了USU-Corn-WeedDB,这是一个公开可用的无人机RGB图像数据集,采集自犹他州Cache Valley的一个商业饲料玉米田,旨在支持有监督和半监督学习框架下的多类别杂草检测。RGB图像于2025年6月27日使用Autel EVO II Dual 640T V2无人机在距地面约10米高度采集,地面采样距离约为0.48厘米/像素。总共366张全分辨率图像被切分为8800个640×640像素的图块。其中,800张图像被手动标注了三种杂草:藜(Chenopodium album)、反枝苋(Amaranthus retroflexus)和狗尾草(Setaria viridis),共包含10539个边界框实例,其余8000个图块作为未标注池用于半监督实验。该数据集反映了自然的类别不平衡,其中反枝苋占标注实例的53.86%,这是有意保留以模拟真实田间条件。为验证数据集实用性,我们在相同条件下训练了28个目标检测模型,涵盖YOLOv8、YOLOv9、YOLOv10、YOLO11、YOLO26和RT-DETR五个架构家族,未进行超参数调优。测试集mAP@0.5范围为0.773至0.840,轻量级模型取得了与边缘部署无人机系统相关的竞争性能。USU-Corn-WeedDB公开于https://this URL。

英文摘要

Weed pressure in forage corn production causes yield losses of up to 31.5%, yet site-specific weed management (SSWM) systems built on UAV imagery and deep learning remain constrained by the scarcity of field-representative training datasets. We present USU-Corn-WeedDB, a publicly available UAV RGB image dataset collected from a commercial forage corn field in Cache Valley, Utah, designed to support multi-class weed detection under both supervised and semi-supervised learning frameworks. RGB imagery was acquired on 27 June 2025 using an Autel EVO II Dual 640T V2 drone at ~10m above ground level, yielding a ground sampling distance of approximately 0.48 cm/pixel. A total of 366 full-resolution images were tiled into 8,800 patches at 640 x 640-pixel resolution. Of these, 800 images were manually annotated for three weed species; common lambsquarters (Chenopodium album), redroot pigweed (Amaranthus retroflexus), and green foxtail (Setaria viridis) comprising 10,539 bounding-box instances, with the remaining 8,000 tiles retained as an unlabeled pool for semi-supervised experiments. This dataset reflects a natural class imbalance where redroot pigweed constitutes 53.86% of annotated instances, which was preserved intentionally to mirror real field conditions. To validate dataset utility, we trained 28 object detection models spanning five architecture families including YOLOv8, YOLOv9, YOLOv10, YOLO11, YOLO26, and RT-DETR under identical conditions without hyperparameter tuning. Test set mAP@0.5 ranged from 0.773 to 0.840, with lightweight models achieving competitive performance relevant to edge-deployed UAV systems. USU-Corn-WeedDB is publicly available at https://doi.org/10.5281/zenodo.20044178.

2606.09635 2026-06-10 cs.CL cs.LG 交叉投稿

Gradient-Guided Reward Optimization for Inference-time Alignment

梯度引导的推理时对齐奖励优化

Hankun Lin, Ruqi Zhang

发表机构 * Purdue University(普渡大学)

AI总结 提出梯度引导奖励优化(GGRO)方法,通过解码时注入梯度信号生成的引导令牌,在推理时微调生成轨迹,提升安全性、有用性和推理性能,并增强对奖励攻击的鲁棒性。

Comments Accepted to UAI 2026

详情
AI中文摘要

确保大型语言模型(LLMs)在分布漂移下的可靠性需要推理时自适应。虽然推理时对齐方法如Best-of-$N$和拒绝采样被广泛使用,但它们将任务视为采样密集的奖励引导搜索,导致两个关键限制:性能受限于基础模型的生成质量,以及对不完美奖励模型的依赖使其易受奖励攻击。为解决这些挑战,我们引入梯度引导奖励优化(GGRO),一种轻量级推理时方法,通过梯度引导在解码期间执行有针对性的最小干预。具体来说,GGRO监测令牌级熵以识别指示漂移或未对齐的高不确定性区域。一旦检测到,它通过注入使用现成奖励模型的梯度信号生成的引导令牌来响应,以引导生成轨迹而不仅仅是重新排序样本。实验表明,GGRO在安全性、有用性和推理基准上持续改进推理时对齐。它还提高了高质量响应的覆盖率和对奖励攻击的鲁棒性,且计算开销极小。代码可在https://github.com/lhk2004/GGRO获取。

英文摘要

Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of-$N$ and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two key limitations: their performance is bounded by the base model's generation quality, and their reliance on imperfect reward models makes them vulnerable to reward hacking. To address these challenges, we introduce Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time method that performs targeted, minimal intervention during decoding via gradient guidance. Specifically, GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples. Experiments show that GGRO consistently improves inference-time alignment across safety, helpfulness, and reasoning benchmarks. It also increases coverage of high-quality responses and robustness to reward hacking, with minimal computational overhead. Code is available at https://github.com/lhk2004/GGRO.

2606.09553 2026-06-10 cs.CL cs.SD 交叉投稿

OpenBibleTTS: Large-Scale Speech Resources and TTS Models for Low-Resource Languages

OpenBibleTTS:面向低资源语言的大规模语音资源与TTS模型

David Guzmán, Luel Hagos Beyene, Jesujoba Oluwadara Alabi, Yejin Jeon, Dietrich Klakow, David Ifeoluwa Adelani

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(米拉-魁北克人工智能研究所) AIMS Research and Innovation Centre(AIMS研究与创新中心) NM-AIST Saarland University(萨尔大学) Canada CIFAR AI Chair(加拿大CIFAR人工智能教席)

AI总结 针对低资源语言TTS研究不足的问题,提出包含37种语言的OpenBibleTTS基准,系统比较多种TTS架构,发现无单一系统通用,并开源数据集与模型。

详情
AI中文摘要

神经文本转语音(TTS)和多语言语音生成的最新进展显著提升了合成语音质量,但这些进步在全球语言中分布不均。现有模型仍由少数高资源语言主导,而许多低资源TTS研究是在人工降采样的高资源语料库上模拟的,未能反映真正低资源环境中的正字法变化和有限的音系覆盖。为此,我们引入OpenBibleTTS,这是一个涵盖37种低资源语言的大规模低资源语音合成基准。此外,我们对各种TTS架构和大规模语音生成模型在领域内圣经文本和领域外材料上进行了系统比较。结果表明,没有单一系统在所有语言和指标上占优:Gemini-TTS在大多数评估语言上获得最高听众评分,但在OpenBibleTTS上训练的单一语言EveryVoice模型在可懂度上仍然最强,并在几种非洲语言中更受青睐,而从头训练的开放系统在领域外文本上性能急剧下降,揭示了广泛多语言覆盖与可靠合成质量之间在服务不足的语言社区中持续存在的差距。我们用主观人类判断补充自动评估,并开源所有处理后的数据集、对齐和训练模型,以支持未来的低资源TTS研究。

英文摘要

Recent advances in neural text-to-speech (TTS) and multilingual speech generation have substantially improved synthetic speech quality, yet these gains remain unevenly distributed across the world's languages. Existing models are still dominated by a small set of high-resource languages, while many studies of low-resource TTS are simulated on artificially downsampled high-resource corpora that do not reflect the orthographic variation and limited phonetic coverage encountered in genuinely underrepresented settings. As such, we introduce OpenBibleTTS, which is a large-scale benchmark for low-resource speech synthesis spanning 37 underrepresented languages. Moreover, a systematic comparison of various TTS architectures and large-scale speech generation models is conducted across in-domain Biblical text and out-of-domain material. Results show that no single system dominates across languages and metrics: Gemini-TTS achieves the highest listener ratings on most evaluated languages, but monolingual EveryVoice models trained on OpenBibleTTS remain strongest for intelligibility and are preferred in several African languages, while open from-scratch systems degrade sharply on out-of-domain text, revealing a persistent gap between broad multilingual coverage and reliable synthesis quality in underserved linguistic communities. We complement automatic evaluation with subjective human judgments, and open-source all processed datasets, alignments, and trained models to support future low-resource TTS research.

2606.06037 2026-06-10 cs.SD cs.CL eess.AS 交叉投稿

SpeechJBB: Probing Safety Alignment and Comprehension in Large Audio Language Models under Code-Switched Speech

SpeechJBB:探究大型音频语言模型在代码切换语音下的安全对齐与理解

Virginia Ceccatelli, Yejin Jeon, David Ifeoluwa Adelani

发表机构 * Mila - Quebec AI Institute(魁北克AI研究所) McGill University(麦吉尔大学) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 提出SpeechJBB数据集,通过代码切换有害音频和伪词插入方法,揭示大型音频语言模型在多语言和口语设置下的安全漏洞。

详情
AI中文摘要

大型音频语言模型(LALMs)越来越多地部署在现实应用中,但其安全对齐仍主要在单语、基于文本的有害提示上进行评估。这导致其在多语言和口语设置,特别是代码切换语音下的泛化能力很大程度上未被探索。为填补这一空白,我们引入了SpeechJBB,一个用于对多种最先进LALMs进行基准测试的音频越狱数据集。通过引入一种增强设置,即在安全关键术语周围插入音位学上合理的伪词以模拟局部混淆,进一步探测了安全弱点的程度。跨模型而言,代码切换的有害音频产生了显著高的越狱成功率(JSR),其中非英语单语和非英语代码切换对表现出最高的攻击成功率。伪词插入进一步降低了拒绝率,表明听起来自然的混淆可以有效绕过安全策略。

英文摘要

Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability under multilingual and spoken settings, particularly code-switched speech, largely underexplored. To address this gap, we introduce SpeechJBB, an audio jailbreak dataset for benchmarking across multiple state-of-the-art LALMs. The extent of safety weaknesses is further probed by introducing an augmented setting where phonologically plausible pseudo-words are inserted around safety-critical terms to simulate localized obfuscation. Across models, code-switched harmful audio yields substantially high jailbreak success rates (JSR), with non-English monolingual and non-English code-switched pairs exhibiting the highest attack success. Pseudo-word insertion further reduces refusal rates, which demonstrates that natural-sounding obfuscation can effectively bypass safety policies.

2605.29662 2026-06-10 cs.CV cs.RO 交叉投稿

SAFE-Pruner: Semantic Attention-Guided Future-Aware Token Pruning for Efficient Vision-Language-Action Manipulation

SAFE-Pruner: 语义注意力引导的未来感知令牌剪枝用于高效视觉-语言-动作操控

Shilin Ma, Chubin Zhang, Changyuan Wang, Yuji Wang, Yue Wu, Zixuan Wang, Jingqi Tian, Zheng Zhu, Yansong Tang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学)

AI总结 针对视觉-语言-动作模型推理加速中现有剪枝方法忽略深层视觉信息的问题,提出SAFE-Pruner框架,通过引入未来层注意力线索和语义注意力一致性实现前瞻性令牌剪枝,在仿真和真实实验中取得最高1.89倍加速且成功率下降小于1.7%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型的实时推理对于机器人控制至关重要。虽然视觉令牌剪枝在加速推理方面显示出巨大潜力,但现有方法主要基于浅层线索进行剪枝决策,并存在丢弃深层所需视觉信息的风险。为解决此问题,我们提出SAFE-Pruner,一种即插即用的剪枝框架,将未来层的注意力线索融入剪枝决策。具体而言,我们识别出语义注意力一致性,即VLA模型在执行步骤中倾向于将其注意力概率质量集中在同一语义实体上。基于这一观察,我们设计了一种前瞻性策略来预测深层令牌的显著性,从而防止关键令牌过早移除并实现更稳定的加速。我们进一步引入自适应子任务划分策略来检测注意力突变,从而提高预测准确性和剪枝可靠性。在仿真和真实环境中的大量实验表明,我们的方法实现了高达1.89倍的加速,成功率下降最小(低于1.7%),同时比最先进的方法高出1.9%。

英文摘要

Real-time inference of vision-language-action (VLA) models is essential for robotic control. While visual token pruning has shown strong potential for accelerating inference, most existing methods mainly base pruning decisions on shallow-layer cues and risk discarding visual information required by deep layers. To address this issue, we propose SAFE-Pruner, a plug-and-play pruning framework that incorporates attention cues of future layers into pruning decisions. Specifically, we identify semantic attention consistency, the tendency that VLA models concentrate their attention probability mass on the same semantic entity across execution steps. Based on this observation, we design a forward-looking strategy to forecast the token saliency in deep layers, which prevents the premature removal of critical tokens and leads to more stable acceleration. We further introduce an adaptive subtask division strategy to detect abrupt attention shifts, thereby improving forecasting accuracy and pruning reliability. Extensive experiments in simulation and real-world settings demonstrate that our method achieves up to 1.89x speedup with a minimal degradation in success rate of less than 1.7%, while outperforming state-of-the-art methods by up to 1.9%.

2606.09809 2026-06-10 cs.AI 版本更新

Evaluation Cards: An Interpretive Layer for AI Evaluation Reporting

评估卡:AI评估报告的解释层

Avijit Ghosh, Anka Reuel, Jenny Chim, Wm. Matthew Kennedy, Srishti Yadav, Jennifer Mickel, Yanan Long, Andrew Tran, Anastassia Kornilova, Damian Stachura, Kevin Klyman, Felix Friedrich, Jeba Sania, Jan Batzner, Anoop Mishra, Eliya Habba, Yixiong Hao, Nathan Heath, Shalaleh Rismani, Usman Gohar, Andrea Loehr, David Manheim, Ruchira Dhar, Sree Harsha Nelaturu, Aarush Sinha, Leshem Choshen, Drishti Sharma, Ishan Khire, Amit Saha, Subramanyam Sahoo, Michael Hardy, Michael Alexander Riegler, Kabir Manghnani, Michelle Lin, Yanan Jiang, Yilin Huang, Asaf Yehudai, Jessica Ji, Aris Hofmann, Mubashara Akhtar, Max Lamparth, Nuno Moniz, Yacine Jernite, Stella Biderman, Zeerak Talat, Sanmi Koyejo, Mykel Kochenderfer, Irene Solaiman

发表机构 * Hugging Face Stanford University(斯坦福大学) Queen Mary University of London(伦敦玛丽女王大学) University of Copenhagen(哥本哈根大学) Trustible EleutherAI TU Darmstadt(达姆施塔特工业大学) Weizenbaum Institute & Technical University of Munich(魏森鲍姆研究所与慕尼黑工业大学) Harvard University(哈佛大学) The Hebrew University of Jerusalem(耶路撒冷希伯来大学) Iowa State University(爱荷华州立大学) IBM Research(IBM研究院) University of Chicago(芝加哥大学) Independent(独立) Berkeley AI Safety Institute (BASIS)(伯克利人工智能安全研究所) Simula University of Edinburgh(爱丁堡大学) ETH Zurich & ETH AI Center(苏黎世联邦理工学院与ETH AI中心) Oxford Internet Institute(牛津互联网研究所) Amherst College(阿默斯特学院) University of Nebraska(内布拉斯加大学) Syntony Research McGill University(麦吉尔大学) Evals Consensus Israel Institute of Technology(以色列理工学院) IOL.Learn & Zuse Institute Berlin(IOL.Learn与柏林祖泽研究所) Georgia Institute of Technology(佐治亚理工学院) Quebec AI Institute, Université de Montréal(魁北克人工智能研究所,蒙特利尔大学) University of Notre Dame(圣母大学) Georgetown University(乔治城大学) DHBW Stuttgart(斯图加特双元制大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对AI评估报告不一致的问题,提出EvalCards作为统一记录层,通过结构化模式、四种解释信号和监控工具,覆盖5816个模型和635个基准,揭示报告实践中的系统性差距。

详情
AI中文摘要

AI评估结果大规模产生,但在排行榜、模型卡、基准论文和公司博客中的报告不一致。代价是解释性的:读者无法可靠地比较不同来源的结果,识别报告遗漏的内容,或将聚合声明追溯到其基础证据。最近的努力解决了孤立组件,但留下了三个空白:它们只覆盖了评估生命周期的狭窄片段,并且不能组合成单个可解释的记录;它们指定了静态表示,无法区分不同利益相关者对同一证据提出的问题;它们仍然是纸面上的提案,缺乏大规模采用所需的提取基础设施。我们提出EvalCards,一个可操作的报告层,将基准元数据、评估运行数据和模型元数据组合成统一记录。我们(1)从52篇论文和10次利益相关者访谈的结构化审查中推导出报告模式,(2)实现四种解释信号(可重复性、文档完整性、来源和风险、以及分数可比性),通过针对研究和非研究受众校准的读者模式呈现,以及(3)部署一个监控工具,将EvalCards应用于5816个模型、635个基准和101843个结果,揭示当前报告实践中的系统性差距。

英文摘要

AI evaluation results are produced at scale but reported inconsistently across leaderboards, model cards, benchmark papers, and company blogs. The cost is interpretive: readers cannot reliably compare results across sources, identify what a report omits, or trace an aggregate claim to its underlying evidence. Recent efforts address isolated components but leave three gaps: they cover only narrow slices of the evaluation lifecycle and do not compose into a single interpretable record; they specify static representations that do not differentiate the questions different stakeholders bring to the same evidence; and they remain proposals on paper, lacking the extraction infrastructure required for adoption at scale. We present \EvalCards{}, an operational reporting layer that composes benchmark metadata, evaluation run data, and model metadata into a unified record. We (1) derive a reporting schema from a structured review of 52 papers and 10 stakeholder interviews, (2) implement four interpretive signals (reproducibility, documentation completeness, provenance and risk, and score comparability), rendered through reader modes calibrated to research and non-research audiences, and (3) deploy a monitoring tool that applies \EvalCards{} across 5,816 models, 635 benchmarks, and 101,843 results, surfacing systematic gaps in current reporting practice.

2606.09681 2026-06-10 cs.CV 版本更新

GenEyePose: Patient-Free, Knowledge-Based Saccadic Eye Movement Modeling for Digital Neurophysiologic Biomarker Development

GenEyePose:用于数字神经生理学生物标志物开发的无患者、基于知识的扫视眼动建模

Tianyu Lin, Jooyoung Ryu, Puvada Sreevarsha, Rahul Srinivasaragavan, Riya Satavlekar, Susan Kim, Nidhi Soley, Yujie Yan, Ishan Vatsaraj, Carl Harris, Aimon Rahman, Vishal Patel, Joseph Greenstein, Casey Taylor, Kemar E. Green

发表机构 * Whiting School of Engineering, Johns Hopkins University(约翰霍普金斯大学惠廷工程学院) Department of Neurology, Johns Hopkins Medicine(约翰霍普金斯医学院神经内科)

AI总结 提出首个全合成、无患者的多模态眼动生成流水线,用于泛化扫视分析;基于合成数据训练的深度学习分类器在真实临床数据上区分正常与异常扫视精度,AUROC达0.76。

详情
AI中文摘要

眼动(包括扫视)被广泛认为是神经生理状态的高度敏感和客观生物标志物。检测神经系统疾病中的扫视特征提供了一种快速、便携的脑成像替代方案,避免了获取和成本障碍。目前,由于隐私问题和数据集稀缺,缺乏稳健的AI视频眼动图解决方案(例如数字生物标志物)用于筛查、分诊或定位脑异常。在这项工作中,我们提出了第一个完全合成、无患者的多模态眼动生成流水线,用于泛化扫视分析。使用该合成数据集,我们训练了一个深度学习分类器,以区分正常和异常(低度量和高度量)扫视精度,并在真实临床数据上评估其性能。该模型实现了0.76的AUROC和0.71的灵敏度,表明合成数据在临床应用中具有强大的泛化潜力,包括作为家庭和急诊室环境中的筛查工具或精确神经解剖定位工具。

英文摘要

Eye movements, including saccades, are widely regarded as highly sensitive and objective biomarkers of neurophysiologic states. Detecting saccadic signatures in neurologic diseases offers a rapid, portable alternative to brain imaging, avoiding access and cost barriers. Currently, there are no robust AI-enabled video-oculographic solutions (e.g., digital biomarkers) for screening, triaging, or localizing brain abnormalities due to privacy issues and scarce datasets. In this work, we propose the first fully synthetic, patient-free, multimodal eye movement generation pipeline for generalizable saccade analysis. Using this synthetic dataset, we trained a deep learning classifier to distinguish between normal and abnormal (hypometria and hypermetria) saccadic accuracies and evaluated its performance on real-world clinical data. The model achieved an AUROC of 0.76 and a sensitivity of 0.71, showing that the synthetic data has strong potential to generalize for clinical applications, including as a screening tool in at-home and emergency room settings or a tool for precise neuroanatomic localization.

2606.09601 2026-06-10 cs.LG 版本更新

Assessing Sample Quality in Conditional Generation under Compositional Shift

在组合偏移下评估条件生成中的样本质量

Berker Demirel, Valentino Maiorca, Marco Fumero, Theofanis Karaletsos, Francesco Locatello

发表机构 * Institute of Science and Technology Austria (ISTA)(奥地利科学技术学院) Pyramidal Inc(Pyramidal公司) Achira Inc(Achira公司)

AI总结 针对条件生成在组合偏移下的评估难题,提出一种基于训练分布的后验信任分数,结合全局真实性和属性忠实度,实现样本过滤、排序和弃权,提升生成质量。

详情
AI中文摘要

条件生成器为可控生成提供了自然工具,包括所需条件是观测属性或实验因素的新组合的场景。在许多应用中,尤其是在科学领域,此类模型对于探索真实样本稀少、昂贵或尚未观测到的条件具有吸引力。然而,这给评估带来了循环问题:标准条件质量指标需要参考目标分布,但在外推场景下该分布根据定义不可用。我们通过一个基于训练分布的后验、每样本信任分数来解决这个问题,该分数用于评估条件样本。该分数结合了两个可估计的量:全局真实性(衡量与真实数据流形的兼容性)和属性忠实度(衡量样本是否更接近请求的属性而非合理的替代属性)。我们证明,在观测属性的温和覆盖条件下,该分数可以恢复跨外推生成的有意义比较。这些比较能够实现生成的有效过滤、排序和弃权,并可直接用于现成的预训练模型。在生物成像中,选定的样本更好地保留了真实的形态结构,并提高了下游预测性能,而在受控视觉基准上也观察到类似的增益。最后,我们展示了该分数如何在生成过程中应用,从而在完全解码之前实现弃权。代码可在 https://github.com/berkerdemirel/faithful-cond-gen 获取。

英文摘要

Conditional generators provide a natural tool for controllable generation, including settings where the desired condition is a new composition of observed attributes or experimental factors. In many applications, especially in scientific domains, such models are attractive to explore conditions for which real samples are rare, expensive, or not yet observed. However, this creates a circularity for evaluation: standard conditional quality metrics require a reference target distribution, but in the extrapolative regime that distribution is unavailable by definition. We address this problem with a post-hoc, per-sample trust score for assessing conditional samples using only the training distribution. The score combines two estimable quantities: global realism, measuring compatibility with the real data manifold, and attribute-wise faithfulness, measuring whether a sample is closer to the requested attributes than to plausible alternatives. We show that the score can recover meaningful comparisons across extrapolated generations, under a mild coverage condition on the observed attributes. These comparisons enable effective filtering, ranking, and abstention of generations and can be used directly on off-the-shelf pretrained models. In biological imaging, selected samples preserve real morphological structure better and improve downstream predictive performance, while similar gains are observed on controlled vision benchmarks. Finally, we show how the score can be applied during generation, enabling abstention before full decoding. Code is available at https://github.com/berkerdemirel/faithful-cond-gen.

2606.09570 2026-06-10 cs.CL cs.HC 版本更新

UXBench: Benchmarking User Experience in AI Assistants

UXBench:AI助手中的用户体验基准测试

Mengze Hong, Xia Zeng, Zeyang Lei, Sheng Wang, Chen Jason Zhang, Di Jiang, Taiming Fu, Jinfeng Huang, Mengqiao Liu, Qinghe Chang, Haosheng Zou, Qiongyi Zhou, Sijun He, Simonjmdeng, Haojing Huang, Zijian Li, Lucas Mu Li, Fubao Zhang, Mona Zhou, Wei Ma, Chenxuan Ma, Yuanmeng Zhang, Jian Song, Minlong Peng, Di Liang, Davey Chen

发表机构 * Hong Kong Polytechnic University(香港理工大学) Tencent(腾讯)

AI总结 提出首个基于真实用户反馈的用户中心基准UXBench,包含三个任务和7400个测试实例,评估26个前沿语言模型,发现用户反馈预测是可学习的能力,并揭示了LLM作为评判者的系统偏差。

详情
AI中文摘要

随着AI助手每天服务数百万用户,评估超越一般模型能力的用户体验(UX)变得越来越重要。我们提出了UXBench,这是第一个基于真实用户反馈信号、用于评估偏好对齐和对话生成的用户中心基准。该基准由三个相互关联的任务组成:UX Judge、UX Eval和UX Recovery,包含从主流中文AI助手的超过7万条交互日志中提取的7400个测试实例。数据集紧密反映真实用户分布,涵盖8个场景、83个领域以及多种带来严峻挑战的失败模式。对26个前沿语言模型的大量实验提供了关于模型如何感知用户体验以及模型能力提升如何促进更好对话参与的新见解。通过对模型行为和性能差距的全面分析,我们表明用户反馈预测是一种可学习的能力,其中从野外反馈信号训练出的奖励模型可以实现良好校准的准确性。我们进一步记录了LLM作为评判者评估协议的系统性偏差,并比较了直接影响用户体验的典型响应策略。UXBench建立了一个新的评估格局,并呼吁更多关注定制的用户体验优化,为塑造AI助手成功的用户中心缩放定律做出贡献。

英文摘要

As AI assistants serve millions of users daily, evaluating user experience (UX) beyond general model capability has become increasingly important. We present UXBench, the first user-centric benchmark grounded in real user feedback signals for evaluating preference alignment and dialogue generation. The benchmark consists of three interconnected tasks, UX Judge, UX Eval, and UX Recovery, with 7,400 test instances extracted from over 70K interaction logs of a mainstream Chinese AI assistant. The dataset closely reflects real user distributions, covering 8 scenarios, 83 domains, and diverse failure patterns that pose severe challenges. Extensive experiments on 26 frontier language models provide novel insights into how well models perceive user experience and how improvements in model capability contribute to better dialogue engagement. Through comprehensive analysis of model behavior and performance gaps, we show that user feedback prediction is a learnable capability, where a reward model trained from in-the-wild feedback signals can achieve well-calibrated accuracy. We further document the systematic biases of LLM-as-a-judge evaluation protocols and compare typical response strategies that directly affect user experience. UXBench establishes a new evaluation landscape and calls for greater attention to tailored UX optimization, contributing to a user-centric scaling law that shapes the success of AI assistants.

2606.09543 2026-06-10 cs.CL 版本更新

From Genes to Tokens: a GWAS-inspired Approach for Interpretable Stylometric Analysis

从基因到词元:受GWAS启发的可解释风格计量分析方法

Dmitry Pronin, Evgeny Kazartsev

发表机构 * HSE University(莫斯科国立高等经济大学)

AI总结 受全基因组关联研究启发,提出一种通过逻辑回归和多重比较校正检测作者独特词汇标记的风格计量方法,在英、德、俄语语料中验证有效。

详情
AI中文摘要

这篇短文介绍了一种受全基因组关联研究(GWAS)启发的风格计量解释方法。每个“基因”词元与“表型”作者身份的关联通过逻辑回归进行检验,并进行了多重比较校正。将该方法应用于英语、德语和俄语语料库,检测出了个体作者特有的统计显著的词汇标记。

英文摘要

This short paper introduces a stylometric interpretation method inspired by genome-wide association studies (GWAS). Each "gene" token's association with "phenotype" authorship is tested using logistic regression with multiple-comparison correction. Applied to English, German, and Russian corpora, the method detects statistically significant lexical markers distinctive of individual authors.

2606.09475 2026-06-10 cs.AI cs.LG 版本更新

Emergent alignment and the projectability of ethical personas

涌现对齐与伦理人格的可投射性

Guillermo Del Pinal, Youngchan Lee, Calum McNamara, Alejandro Perez Carballo

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Indiana University Bloomington(印第安纳大学布卢明顿分校)

AI总结 研究微调大语言模型在窄任务上如何引发广泛对齐行为,通过宪法AI方法赋予模型伦理人格,发现窄对齐可投射到未训练类别,并提出对齐策略应评估可投射性。

详情
AI中文摘要

关于“涌现错位”的研究表明,在窄任务上微调LLM会诱导广泛错位的行为。这支持了“人格选择”(PSM)假说:在预训练期间,LLM学会模拟不同的角色和视角,这些可以在后训练期间被激发和细化。本文研究了相反的现象“涌现对齐”,并用它来支持和细化PSM,并激发对齐的新需求。我们在广泛和狭窄的安全任务上微调一个仅帮助型模型。为了创建SFT样本,我们遵循“宪法AI”(CAI)方法,并使用四条编码合理对齐策略的宪法:道义论、后果论、美德伦理以及将AI对齐为从属于人类权威。对于每个模型,我们表明,在两个狭窄的安全子类别上微调可靠地诱导出在代表性的一般安全类别以及我们直接从用于窄对齐的数据集中过滤掉的安全子类别上的涌现对齐。为了使用更细粒度的评估测试“PSM”,我们使用了多维“伦理人格”诊断。对于每个宪法微调(广泛/狭窄)模型,我们评估其行为与预期特征概况的匹配程度。我们的结果表明,我们的CAI模型获得了预期的“伦理人格”——例如,使用后果论宪法创建的SFT样本窄微调的模型与功利主义信念的一致性显著高于道义论信念。然而,我们的粗粒度和细粒度评估显示,我们的(广泛/狭窄)微调CAI模型在投射效果上存在显著差异。我们得出结论,对齐策略不仅应基于其(分布内)一般安全性能进行评估,还应特别基于其可投射性程度进行评估。

英文摘要

Work on `emergent misalignment' shows that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the `persona selection' (PSM) hypothesis: during pre-training, LLMs learn to simulate different characters and perspectives, which can be elicited and refined during post-training. This paper investigates the converse phenomenon, `emergent alignment', and uses it to support and refine the PSM and motivate a novel desideratum for alignment. We finetune a helpful-only model on broad and narrow safety tasks. To create SFT samples, we follow the `Constitutional AI' (CAI) approach and use four constitutions which encode reasonable alignment strategies: deontology, consequentialism, virtue ethics, and aligning AIs as subordinate to human authority. For each of those models, we show that finetuning on two narrow safety sub-categories reliably induces emergent alignment over a representative set of general safety categories, and on safety subcategories that we directly filtered-out of the data sets used for narrow alignment. To test the `PSM' using a more fine-grained evaluation, we used a multidimensional `ethical persona' diagnostic. For each constitutionally finetuned (broad/narrow) model, we evaluate how well their behavior matches their expected signature profile. Our results show that our CAI models acquire their expected ``ethical persona'' -- e.g., the model narrowly fine-tuned on SFT samples created using the consequentialist constitution agrees significantly more with utilitarian than deontological beliefs. Yet our coarse and fine-grained evaluations show that there are significant differences across our (broad/narrow) finetuned CAI models in how well they project. We conclude that alignment strategies should be evaluated, not just on their (in-distribution) general safety performance, but also specifically on their degree of projectability.

2606.09466 2026-06-10 cs.CL 版本更新

DECSELFMASK: Leveraging Unlabeled Text via Self-Relevance-Guided Masking for Decoder-Only Classification

DECSELFMASK: 通过自相关引导掩码利用未标记文本进行仅解码器分类

Pietro Ferrazzi, Matteo Merler, Giovanni Bonetta, Alberto Lavelli, Bernardo Magnini

发表机构 * Fondazione Bruno Kessler, Trento, Italy(布鲁诺·凯斯勒基金会,特伦托,意大利) University of Padova, Italy(帕多瓦大学,意大利)

AI总结 提出DecSelfMask方法,利用相关性归因引导掩码策略从无标签数据创建自监督训练样本,通过下一词预测重构掩码部分,提升仅解码器模型在分类任务上的性能,在136个临床任务上平均Macro F1提升19.9点。

详情
AI中文摘要

分类任务需要标注数据,但收集这些数据往往昂贵、耗时甚至不可行。医学领域尤其如此,大型数据集通常只有少量标注样本。为解决这一问题,我们提出DecSelfMask(通过掩码进行解码器自学习),一种增强仅解码器模型在分类任务上性能的方法。我们基于常见的自学习方法,利用模型从无标签数据创建训练样本,并提出一种新颖的相关性引导掩码策略。我们使用相关性归因方法确定未标注文本中与任务相关的部分。然后通过掩码这些部分创建自监督训练样本,训练模型通过下一词预测重建它们。我们假设这些样本传达了关于未标注数据结构和语义的知识,可能对下游性能有用。我们在来自一家意大利医院的190万份临床笔记的136个任务上测试了我们的方法。我们在5个不同规模和系列的模型上量化了DecSelfMask对下游任务的影响,包括探测分析。实验显示持续改进,优于标准监督微调方法(Macro F1提高19.9点)、合成标签生成(提高12.5点)和持续预训练(提高6.3点),以及常见基线。

英文摘要

Classification tasks require annotated data, which can often be expensive, time-consuming, or even unfeasible to collect. This is the case of the medical domain, where large datasets often have few annotated examples. To address this, we propose DecSelfMask (Decoder Self-learning by Masking), an approach to enhance decoder-only performance on classification tasks. We build on common self-learning approaches by leveraging a model to create training examples from unlabeled data to propose a novel relevance-guided masking strategy. We use relevance attribution methods to determine what portions of unannotated texts are relevant for a task. We then create self-supervised training examples by masking out those portions, training the model to reconstruct them via next-token-prediction. We hypothesize that those examples convey knowledge about the structure and semantics of unannotated data that can be useful for downstream performance. We test our approach on 136 tasks from a collection of 1.9M clinical notes from an Italian hospital. We quantify DecSelfMask's impact on downstream tasks on 5 models of different scales and families, including a probing analysis. Experiments show consistent gains, outperforming standard supervised fine-tuning approaches (+19.9 points in Macro F1), synthetic label generation (+12.5), and continual pretraining (+6.3), as well as common baselines.

2606.09421 2026-06-10 cs.CL 版本更新

What Should a Skill Remember? Quality--Cost Trade-offs in Cost-Aware Skill Rewriting for Language Model Agents

技能应记住什么?语言模型代理中成本感知技能重写的质量-成本权衡

Qinghua Xing, Yinda Chen, Yaping Jin, Zhenhe Wu, Bohan Lin, Hang Zhou, Xinghao Chen, Hanting Chen, Zhiwei Xiong

发表机构 * University of Science and Technology of China(中国科学技术大学) Huawei Technologies(华为技术有限公司) Tianjin University(天津大学)

AI总结 研究语言模型代理中技能重写的质量-成本权衡,提出信息保留策略,在SkillsBench上实现成本降低7%-14.7%且保持验证质量。

详情
AI中文摘要

大型语言模型代理越来越依赖技能:可重用的程序文档,编码工作流程、工具使用、实现模式、验证检查和领域规则。技能重写通常被视为提示压缩,但较短的技能可能通过移除防止探索、调试和恢复的稀疏操作锚点而使代理更昂贵。我们通过这种经济视角研究技能重写。我们的受控框架剖析技能结构,使用信息保留策略重写技能,并在固定任务指令、环境和验证器下评估重写。在SkillsBench上的实验揭示了不同策略间明显的质量-成本权衡:API/代码锚定、工作流保护和规则/公式锚定有利于不同的任务族,没有普遍主导的模板。在主要的留出评估中,学习到的策略将总成本降低7.0%,下游代理令牌成本降低6.0%;在冻结的跨模型迁移中,相应的降低平均为14.7%和13.7%,同时验证器质量保持不变。这些结果将技能设计定位为成本感知的操作知识工程,而非提示压缩。资源:\href{https://github.com/1Reminding/Skill_EE}{SkillEE}。

英文摘要

Large language model agents increasingly rely on skills: reusable procedural documents encoding workflows, tool use, implementation patterns, validation checks, and domain rules. Skill rewriting is often treated as prompt compression, but shorter skills can make agents more expensive by removing sparse operational anchors that prevent exploration, debugging, and recovery. We study skill rewriting through this economic lens. Our controlled framework profiles skill structure, rewrites skills using information-preservation strategies, and evaluates the rewrites under fixed task instructions, environments, and verifiers. Experiments on SkillsBench reveal distinct quality--cost trade-offs across strategies: API/code anchoring, workflow guarding, and rule/formula anchoring benefit different task families, with no universally dominant template. In the main held-out evaluation, the learned policy reduces total cost by 7.0% and downstream agent-token cost by 6.0%; in frozen cross-model transfer, the corresponding reductions average 14.7% and 13.7%, while verifier quality is preserved. These results position skill design as cost-aware operational knowledge engineering rather than prompt compression. Resources: https://github.com/1Reminding/Skill_EE.

2606.09377 2026-06-10 cs.LG cs.AI 版本更新

Scaling Neural Network Verification with Tensor Parallelism and Fully Sharded Data Parallelism

利用张量并行和全分片数据并行扩展神经网络验证

Sergei Vorobyov, Eugene Ilyushin

发表机构 * Lomonosov Moscow State University(莫斯科国立大学) Central University(中央大学)

AI总结 针对神经网络形式化验证中GPU内存瓶颈,将张量并行(TP)和全分片数据并行(FSDP)适配到auto_LiRPA/α,β-CROWN框架,TP实现约2倍峰值内存降低但边界紧度下降,FSDP实现80-90%基础内存降低且边界与单GPU逐位一致,并支持完整验证和卷积层。

详情
AI中文摘要

形式化神经网络验证——证明网络对于指定域内所有输入满足安全属性——在实践中受限于GPU内存:边界传播算法(IBP、CROWN、α-CROWN)的标准实现要求权重矩阵和松弛系数矩阵完全驻留在单个加速器上。我们将最初为大规模模型训练开发的两种并行技术适配到auto_LiRPA/α,β-CROWN验证框架。张量并行(TP)将权重矩阵和A矩阵分片到多个GPU上,在P=2时实现约2倍的峰值内存降低;在VNN-COMP 2022 MNIST-FC基准测试上确认了正确性,但由于分片区域内中间边界被迫使用IBP替代,边界紧度随分片区域数量增加而下降。全分片数据并行(FSDP)仅对权重矩阵进行分片,并逐层使用AllGather,产生的边界与单GPU基线逐位一致:在宽MLP上,基础内存降低80-90%,峰值内存降低34-39%。FSDP与完整验证(β-CROWN + 分支定界)和卷积层(BoundConv)无缝集成;在FSDP下,CIFAR-100 ResNet-large(VNN-COMP 2024)获得了完整的不可满足结果。在所有实验中,α-CROWN+BaB模式下的内存瓶颈被证明是每个神经元的alpha张量,而非权重矩阵,这指出了未来工作的关键方向。

英文摘要

Formal neural network verification -- proving that a network satisfies safety properties for *all* inputs in a specified domain -- is bounded in practice by GPU memory: standard implementations of bound-propagation algorithms (IBP, CROWN, $α$-CROWN) require weight and relaxation-coefficient matrices to reside entirely on one accelerator. We adapt two parallelism techniques originally developed for large-scale model training to the auto_LiRPA / $α,β$-CROWN verification framework. Tensor Parallelism (TP) shards both weight and $A$-matrices across GPUs, achieving ${\approx}2\times$ peak-memory reduction at $P{=}2$; soundness is confirmed on VNN-COMP 2022 MNIST-FC benchmarks, though bound tightness degrades with the number of sharded zones due to forced IBP substitution for intermediate bounds inside sharded zones. Fully Sharded Data Parallelism (FSDP) shards only weight matrices with a per-layer AllGather, producing bounds that are bitwise identical to the single-GPU baseline: baseline memory drops by 80--90%, peak memory by 34--39% on wide MLPs. FSDP integrates cleanly with complete verification ($β$-CROWN + Branch-and-Bound) and with convolutional layers (BoundConv); a complete unsat result is obtained for CIFAR-100 ResNet-large (VNN-COMP 2024) under FSDP. Across all experiments the memory bottleneck in $α$-CROWN+BaB mode proves to be per-neuron alpha tensors, not weight matrices, pointing to the key direction for future work.

2606.09316 2026-06-10 cs.AI 版本更新

Anything2Skill: Compiling External Knowledge into Reusable Skills for Agents

Anything2Skill: 将外部知识编译为智能体的可复用技能

Qianjun Pan, Yutao Yang, Junsong Li, Jie Zhou, Kai Chen, Xin Li, Qin Chen, Liang He

发表机构 * East China Normal University(华东师范大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出Anything2Skill框架,将异构外部知识编译为可复用、可检索、可执行的技能,结合RAG显著提升智能体任务成功率。

详情
AI中文摘要

检索增强生成(RAG)使智能体在推理时能够访问外部知识,但主要检索的是碎片化的陈述性证据,导致智能体需要反复从段落、手册、示例、日志或轨迹中推断任务流程。这引发了一个根本性问题:能否从外部知识库中提取技能并安装到智能体中,使其快速逼近领域专业知识?在本文中,我们提出Anything2Skill,一个基于分类的框架,将异构外部知识编译为智能体可复用、可检索、可执行的技能。给定一个知识记录语料库,Anything2Skill首先将每条记录分解为证据窗口,并在技能树先验下执行规划与扩展的技能提取。然后将提取的候选技能转换为结构化的技能契约,指定调用条件、禁忌、动作步骤、工作流程步骤、约束、输出规范、支持证据和置信度分数。为了构建可部署的程序性记忆,Anything2Skill通过分类感知编译、注册表级协调、生命周期跟踪、版本化更新和可见的技能树投影,将提取的技能管理在持久化的SkillBank中。在推理时,智能体从原始知识库中检索任务特定段落,并从SkillBank中检索相关程序性技能,使RAG提供陈述性证据,而编译的技能提供可复用的程序性指导。在qsv和GitHub-CLI上的实验表明,Anything2Skill结合RAG分别实现了98.85%和94.10%的成功率,显著优于仅使用RAG的智能体。这些结果表明,将潜在的程序性知识编译为显式技能是从知识访问扩展到能力复用的有效途径。

英文摘要

Retrieval-augmented generation (RAG) enables agents to access external knowledge at inference time, but it primarily retrieves fragmented declarative evidence, leaving agents to repeatedly infer task procedures from passages, manuals, examples, logs, or trajectories. This raises a fundamental question: can skills extracted from external knowledge bases be installed into an agent, enabling it to rapidly approximate domain expertise? In this paper, we propose Anything2Skill, a taxonomy-guided framework that compiles heterogeneous external knowledge into reusable, retrievable, and executable skills for agents. Given a corpus of knowledge records, \textsc{Anything2Skill} first decomposes each record into evidence windows and performs plan-and-expand skill extraction under a skill-tree prior. The extracted candidates are then converted into structured skill contracts that specify invocation conditions, contraindications, action moves, workflow steps, constraints, output specifications, supporting evidence, and confidence scores. To construct a deployable procedural memory, Anything2Skill manages the extracted skills in a persistent SkillBank through taxonomy-aware compilation, registry-level reconciliation, lifecycle tracking, versioned updates, and visible skill-tree projection. At inference time, agents retrieve both task-specific passages from the original knowledge base and relevant procedural skills from the SkillBank, allowing RAG to provide declarative evidence while compiled skills provide reusable procedural guidance. Experiments on qsv and GitHub-CLI show that Anything2Skill combined with RAG achieves 98.85\% and 94.10\% success rates, respectively, substantially outperforming RAG-only agents. These results suggest that compiling latent procedural knowledge into explicit skills is an effective way to extend retrieval-augmented agents from knowledge access toward capability reuse.

2606.09203 2026-06-10 cs.RO 版本更新

Deterministic Execution of ROS 2 Applications via Lingua Franca

通过Lingua Franca实现ROS 2应用的确定性执行

Harun Teper, Shaokai Lin, Shulu Li, Edward A. Lee, Jian-Jia Chen

发表机构 * TU Dortmund University(多特蒙德工业大学) University of California, Berkeley(加州大学伯克利分校) RWTH Aachen University(亚琛工业大学)

AI总结 提出框架将未修改的ROS 2应用转换为Lingua Franca程序,利用逻辑时间实现确定性执行,解决ROS 2中回调执行顺序和消息交织的非确定性问题。

详情
AI中文摘要

机器人操作系统2(ROS 2)是一种广泛用于机器人系统的中间件,其特点是发布-订阅(pub-sub)通信机制,计算结构为由ROS 2执行器调度的回调。尽管很流行,但ROS 2中的pub-sub模式本质上是不确定的:即使在单个执行器内,这些回调的运行顺序也是不确定的,分布式部署由于节点间消息的交织和网络延迟进一步增加了不确定性。这种不确定性常常导致并发问题,使得几乎不可能分析安全性并提供保证。我们提出了一个框架,能够将未修改的ROS 2应用程序转换并在Lingua Franca(LF)下运行,LF是一种使用逻辑时间进行确定性执行的协调语言,使得相同的输入总是产生相同的确定性执行顺序。我们首先描述了哪些ROS 2特性可以在逻辑时间下确定性执行。这些特性使得建立自动转换框架成为可能,该框架从ROS 2应用程序中提取信息并直接将其转换为LF程序。然后可以应用LF的丰富特性,如逻辑时间延迟、跨进程的联邦执行和故障处理,使ROS 2应用程序以确定性和时序可预测的方式执行,而无需更改ROS 2代码。我们在一个合成示例和Autoware参考系统上评估了该框架。我们表明,在默认ROS 2中,回调的执行顺序不同,同时端到端延迟在不同执行中也有所变化。相比之下,我们由LF控制的ROS 2系统产生了确定的执行顺序和一致的端到端延迟。

英文摘要

The Robot Operating System 2 (ROS 2) is a widely used middleware for robotic systems, characterized by a publish-subscribe (pub-sub) communication mechanism in which computation is structured as callbacks dispatched by ROS 2 executors. Despite its popularity, the pub-sub pattern in ROS 2 is inherently nondeterministic: the order in which these callbacks run is nondeterministic even within a single executor, and distributed deployments add further nondeterminism from the interleaving of messages across nodes and from network latency. Such nondeterminism often leads to concurrency issues and makes it virtually impossible to analyze for safeness and provide guarantees. We present a framework that is able to convert an unmodified ROS 2 application and run it under Lingua Franca (LF), a coordination language for deterministic execution using logical time, so that the same input always produces the same deterministic execution order. We first describe which ROS 2 features can be executed deterministically under logical time. Such features enable the possibility to establish an automatic conversion framework to extract information from a ROS 2 application and directly convert it into an LF program. The rich features of LF, such as logical-time delays, federated execution across processes, and fault handling, can then be applied to make the ROS 2 application be executed in a deterministic and timing-predictable manner without changing the ROS 2 code. We evaluate the framework on a synthetic example and on the Autoware reference system. We show that the order in which callbacks are executed differs in default ROS 2, while also having end-to-end latencies that vary across executions. In contrast, our LF-controlled ROS 2 system produces a deterministic execution order and consistent end-to-end latencies.