arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3813
专题追踪
2605.17449 2026-05-19 cs.CV cs.AI

Spatial Blindness in Whole-Slide Multiple Instance Learning

全切片多实例学习中的空间盲区

Xiangyu Li, Ran Su

发表机构 * College of Intelligence and Computing(智能与计算学院)

AI总结 本文研究了全切片多实例学习中由于空间信息处理不足导致的分类误差问题,提出ResTopoMIL模型通过引入不变原型直方图和坐标洗牌约束来提升模型对空间关系的敏感性,从而在多个公开数据集上提升了分类和生存预测性能。

Comments 28 pages, 8 figures, 16 tables

详情
AI中文摘要

全切片MIL模型通常被称为上下文感知模型,当将图网络、Transformer或状态空间模块置于补丁嵌入之上时。我们证明这种标签可能具有误导性。在病理任务中,组织结构是诊断信号的一部分,几个强大的MIL基线在补丁坐标随机排列后,滑片级别AUC几乎未变。它们的预测准确,但大多具有组合性。我们将其失败模式称为空间盲区。我们的解释是基于优化的:在滑片级监督下,密集的外观统计信息被早期学习,留下弱梯度用于稀疏的空间关系。ResTopoMIL通过首先拟合一个排列不变的原型直方图,然后冻结它,同时一个轻量级图分支在坐标洗牌约束下学习残差来解决这个问题。该架构设计简单;干预在于如何训练空间分支。在9个公开WSI基准上,ResTopoMIL在1.15M参数下提升了分类和生存预测性能,恢复了对坐标扰动的敏感性,并在CAMELLYON-16上提供了更强的局部化证据。

英文摘要

Whole-slide MIL models are often called context-aware once graphs, Transform ers, or state-space modules are placed above patch embeddings. We show that this label can be deceptive. On pathology tasks where tissue architecture is part of the diagnostic signal, several strong MIL baselines retain nearly unchanged slide level AUC after patch coordinates are permuted. Their predictions are accurate, but largely compositional. We refer to this failure mode as spatial blindness. Our explanation is optimization-based: dense appearance statistics are learned early under slide-level supervision, leaving weak gradients for sparse spatial relations. ResTopoMIL addresses the issue by first fitting a permutation-invariant prototype histogram and then freezing it while a lightweight graph branch learns the residual under a coordinate-shuffling constraint. The architecture is simple by design; the intervention is in how the spatial branch is trained. Across 9 public WSI bench marks, ResTopoMIL improves classification and survival prediction with 1.15M parameters, restores sensitivity to coordinate perturbation, and gives stronger lo calization evidence on CAMELYON-16.

2605.17447 2026-05-19 cs.CV cs.CL

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

FastOCR: 通过KV缓存剪枝实现高效的动态视觉聚焦文档解析

Zihan Tang, Leqi Shen, Hui Chen, Ao Wang, Ben Wan, Yan Feng, Ke Zhang, Sicheng Zhao, Tongxuan Liu, Guiguang Ding

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出FastOCR,一种无需训练的框架,通过动态视觉聚焦技术解决文档解析中的高效KV缓存剪枝问题,显著提升处理速度和准确性。

详情
AI中文摘要

视觉-语言模型(VLMs)在光学字符识别(OCR)中展现出强大潜力,但编码密集文档所需的大量视觉令牌导致推理成本过高。现有剪枝方法依赖物理驱逐,例如在prefill阶段永久丢弃视觉令牌。尽管在自然图像上有效,但此策略在OCR中失效,因为几乎每个视觉令牌可能对应一个字符或结构元素,任何不可逆的损失都会导致准确性急剧下降。我们观察到,尽管文档图像看似密集且难以剪枝,模型对它们的注意力实际上在时间上是稀疏的:在每个解码步骤中,它集中在一小块区域,随着步骤逐渐移动,就像人类读者依次聚焦于词语而不是一次性感知整页内容一样。受此动态视觉聚焦现象的启发,我们将不可行的全局剪枝问题转化为可处理的局部动态问题,并提出FastOCR,一种无需训练的框架,包含两个互补模块。具体而言,Focal-Guided Pruning识别少量焦点层,并在每一步从中选择最相关的视觉令牌;Cross-Step Fixation Reuse利用固定点的逐渐移动,从上一步温暖启动。通过动态调整哪些令牌被关注而不是驱逐任何缓存中的令牌,FastOCR避免了永久信息丢失。广泛实验表明,FastOCR作为一种即插即用的加速模块,在五个不同大小和架构的VLMs上表现出一致的泛化能力。在Qwen2.5-VL上,FastOCR在每个解码步骤只关注5%的视觉令牌,保留了未剪枝模型98%的准确性,同时将注意力延迟减少了3.0倍。

英文摘要

Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.

2605.17442 2026-05-19 cs.CL cs.AI cs.IR

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

超越目录计数:低资源多语言NLP中的数据集可见性不对称

Zhiyin Tan, Changxu Duan

发表机构 * L3S Research Center, Leibniz University Hannover(莱布尼茨汉诺威大学L3S研究中心) Technische Universität Darmstadt(达姆施塔特技术大学)

AI总结 本研究探讨了多语言NLP中数据集可见性不对称问题,通过结合目录基准和文献证据,提出了资源密度指数(RDI)来衡量语言的数据集可见性,揭示了大量语言在目录记录中数据贫乏但文献中存在明显数据集活动的现象。

Comments Accepted at the 15th edition of the Language Resources and Evaluation Conference (LREC 2026)

详情
AI中文摘要

多语言NLP常常依赖于集中式目录中的数据集计数来确定哪些语言是资源丰富或贫乏的。然而,这些目录只记录了数据集可见性的一层:哪些数据集已被注册或机构分发。它们不一定反映哪些数据集在研究文献中被创建、引用或重用。为了考察这一差距,我们结合基于目录的基准与文献支持的数据集流通证据。我们引入了资源密度指数(RDI),定义为每一百万使用者的数据集数量,并计算了乙努诺格(Ethnologue)中200种最广泛使用的语言的RDI。其中,118种语言(59%)在LRE地图和语言数据 consortium(LDC)中平均RDI为零,另有23种语言低于0.1,对应每十万使用者最多一个目录数据集。然后,我们利用LLM辅助的引用挖掘流程处理Semantic Scholar语料库中的这141种低可见性语言。经过人工验证和整合,我们识别出53种语言中的609个唯一数据集,其中356个仍通过工作公共链接公开访问。这些结果揭示了显著的可见性差距:许多大使用者语言在目录记录中数据贫乏,但在研究文献中显示明显的数据集活动。我们的发现表明,多语言数据稀缺不仅应被视为生产问题,还应被视为文档、可发现性和长期可访问性的问题。代码和数据可在(https://github.com/zhiyintan/dataset-visibility-asymmetry)公开获取。

英文摘要

Multilingual NLP often relies on dataset counts from centralized catalogues to characterize which languages are resource-rich or resource-poor. However, these catalogues record only one layer of dataset visibility: what has been registered or institutionally distributed. They do not necessarily reflect which datasets are created, cited, or reused in the research literature. To examine this gap, we combine a catalogue-based baseline with literature-backed evidence of dataset circulation. We introduce the Resource Density Index (RDI), defined as the number of catalogued datasets per one million speakers, and compute it for the 200 most widely spoken languages in Ethnologue. Among them, 118 languages (59%) have an average RDI of zero across the LRE Map and the Linguistic Data Consortium (LDC), and another 23 fall below 0.1, corresponding to at most one catalogued dataset per ten million speakers. We then apply an LLM-assisted citation-mining pipeline over the Semantic Scholar corpus to these 141 low-visibility languages. After manual validation and consolidation, we identify 609 unique datasets across 53 languages, of which 356 remain openly accessible through working public links. These results reveal a substantial visibility gap: many large-speaker languages appear data-poor in catalogue records yet show clear evidence of dataset activity in the research literature. Our findings suggest that multilingual data scarcity should be understood not only as a production problem, but also as a question of documentation, discoverability, and long-term accessibility. Code and data are publicly available at (https://github.com/zhiyintan/dataset-visibility-asymmetry).

2605.17436 2026-05-19 cs.CV cs.CL

Medical Context Distorts Decisions in Clinical Vision Language Models

医学语境扭曲了临床视觉语言模型的决策

David Restrepo, Ira Ktena, Maria Vakalopoulou, Stergios Christodoulidis, Enzo Ferrante

发表机构 * MICS(医学信息学中心) CentraleSupélec - Université Paris-Saclay(中央超导学院 - 巴黎萨克雷大学) Cancer Data Science Unit(癌症数据科学单元) IHU PRISM National Institute in Precision Oncology(精准肿瘤学国家研究所) University Paris-Saclay(巴黎萨克雷大学) CentraleSupelec(中央超导学院) Gustave Roussy(儒勒-维维安-圣拉扎尔医院) INSERM(国家医学研究院) CONICET(阿根廷国家科研与技术创新委员会) Universidad de Buenos Aires(布宜诺斯艾利斯大学)

AI总结 本文研究了医学语境对临床视觉语言模型决策的影响,发现模型在整合医学记录的视觉和文本信息时存在模态依赖、无关历史依赖和提示敏感性等问题,强调了在临床应用前需要建立明确的保障措施。

详情
AI中文摘要

视觉-语言模型(VLMs)越来越多地被提出用于临床决策支持,但其在需要整合医学记录中视觉和文本信息的现实场景中的可靠性仍缺乏充分了解。本文识别了三种失败模式:(1)对文本的过度依赖而非图像,(2)对无关临床历史的虚假依赖,以及(3)在语义等价输入上的提示敏感性。我们评估了多种通用领域和医学调优的开源和闭源VLMs,在胸片任务中使用MIMIC-CXR进行测试。通过系统地操纵图像-文本对齐、临床历史和提示公式,我们发现VLM的决策受到文本模态主导,即使有视觉证据可用。此外,我们发现VLMs受到无关报告的强烈影响,而微小的提示变化可以逆转正确的图像基预测。我们的发现强调了在考虑将这些模型用于临床实践之前,需要建立明确的保障措施和压力测试。

英文摘要

Vision-language models (VLMs) are increasingly proposed for clinical decision support, yet their reliability in real-world scenarios that require integrating both visual and textual context from medical records remains poorly characterized. This paper identifies three failure modes: (1) modality over-reliance on text over images, (2) spurious reliance on irrelevant clinical history, and (3) prompt sensitivity across semantically equivalent inputs. We evaluate a diverse set of general-domain and medically-tuned open and closed VLMs on chest x-ray tasks using MIMIC-CXR. By systematically manipulating image-text alignment, clinical history, and prompt formulations, we found that VLM decisions are dominated by the text modality, even when visual evidence is available. Moreover, we observed that VLMs are heavily influenced by irrelevant reports, while minor prompt changes can reverse correct image-based predictions. Our findings underscore the need for explicit safeguards and stress-testing before considering the use of these models in clinical practice.

2605.17435 2026-05-19 cs.CL

BELIEF: Structured Evidence Modeling and Uncertainty-Aware Fusion for Biomedical Question Answering

BELIEF: 结构化证据建模与不确定性感知融合用于生物医学问答

Chang Zong, Hao Ning, Siliang Tang, Jie Huang, Jian Wan

发表机构 * School of Computer Science and Technology, Zhejiang University of Science and Technology(浙江理工大学计算机科学与技术学院) College of Artificial Intelligence, Zhejiang University(浙江大学人工智能学院) Zhejiang Key Laboratory of Biomedical Intelligent Computing Technology, Zhejiang University of Science and Technology(浙江理工大学生物医学智能计算技术重点实验室)

AI总结 本文提出BELIEF框架,通过结构化证据建模和不确定性感知融合,提升生物医学问答任务中检索文献的利用效率,实现对证据可靠性、不确定性以及候选假设的支持强度的显式建模。

Comments 14 pages, 6 figures

详情
AI中文摘要

生物医学问答通常需要从检索文献中做出决策,这些文献的相关性、质量以及对候选答案的支持程度不均。大多数检索增强的大语言模型(LLM)方法将这些文献作为平铺文本输入模型,导致证据可靠性及剩余不确定性大多隐含。我们提出BELIEF,一种用于封闭集生物医学问答的结构化证据建模和不确定性感知融合框架。不同于将检索文档视为非区分的上下文,BELIEF将其转换为证据对象,记录临床属性、来源质量、问题相关性、支持强度以及相关的候选假设。这些证据对象为两种互补的推理路径提供共享基础。符号路径基于Dempster-Shafer(D-S)理论,在有限的答案空间上构建可靠性加权的基本概率分配,并进行不确定性感知的符号证据融合以估计信念和残余不确定性。神经路径使用相同的结构化证据进行基于LLM的语义推理,而一个可靠性感知的仲裁模块根据信念强度、不确定性、证据可靠性和语义一致性来协调符号和神经输出。在PubMedQA、MedQA和MedMCQA上使用五个通用大语言模型(LLM)后端进行的实验表明,BELIEF在25个30种后端-数据集-指标设置中取得了最佳结果。与生物医学领域模型的比较表明,BELIEF在MedQA和MedMCQA上具有竞争力,而专门的生物医学预训练仍然在PubMedQA上具有优势。消融、互补性、不确定性分层和成本分析进一步表明,BELIEF通过使证据结构、路径分歧和决策不确定性显式化,提高了检索证据的利用效率。

英文摘要

Biomedical question answering often requires decisions from retrieved literature whose relevance, quality, and support for candidate answers are uneven. Most retrieval-augmented large language model (LLM) methods feed this literature to the model as flat text, leaving evidence reliability and remaining uncertainty largely implicit. We propose BELIEF, a structured evidence modeling and uncertainty-aware fusion framework for closed-set biomedical question answering. Rather than treating retrieved documents as undifferentiated context, BELIEF converts them into evidence objects that record clinical attributes, source quality, question relevance, support strength, and the associated candidate hypothesis. These evidence objects provide a shared basis for two complementary reasoning paths. The symbolic path constructs reliability-weighted basic probability assignments based on Dempster--Shafer (D-S) theory over a finite answer space and performs uncertainty-aware symbolic evidence fusion to estimate belief and residual uncertainty. The neural path uses the same structured evidence for LLM-based semantic inference, while a reliability-aware arbitration module reconciles the symbolic and neural outputs according to belief strength, uncertainty, evidence reliability, and semantic consistency. Experiments on PubMedQA, MedQA, and MedMCQA with five general-purpose LLM backbones show that BELIEF obtains the best result in 25 of 30 backbone--dataset--metric settings. Comparisons with biomedical-domain models indicate that BELIEF is competitive on MedQA and MedMCQA, while specialized biomedical pretraining remains advantageous on PubMedQA. Ablation, complementarity, uncertainty-stratified, and cost analyses further show that BELIEF improves retrieved-evidence utilization by making evidence structure, path disagreement, and decision uncertainty explicit.

2605.17433 2026-05-19 cs.CV

VISTA: Variance-Gated Inter-Sequence Test-Time Adaptation for Multi-Sequence MRI Segmentation

VISTA: 用于多序列MRI分割的方差门控跨序列测试时间适应

Zhipeng Deng, Jiale Zhou, Wenhan Jiang, Haolin Wang, Xun Lin, Yafei Ou, Yefeng Zheng

发表机构 * Westlake University(西湖大学) Hokkaido University(北海道大学) The Chinese University of Hong Kong(香港中文大学) RIKEN(理化学研究所)

AI总结 本文提出VISTA框架,解决多序列MRI分割中模态交互偏移问题,通过设计跨序列干预生成器和跨视图分歧感知伪标签方法,提升模型在临床环境下的适应能力,实验表明在不同群体上性能优于现有方法。

Comments MICCAI2026 early accept

详情
AI中文摘要

在新的临床环境中部署多序列磁共振成像(MRI)分割模型具有挑战性,因为存在扫描仪和采集协议的差异。尽管现有的TTA方法能够处理基本的单模态偏移,但它们在根本性的双偏移问题下常常失效,因为其适应信号无法捕捉模态交互偏移,这会破坏跨序列一致性。为了解决这个问题,我们提出了方差门控跨序列测试时间适应(VISTA),一种无源框架,用于解决模态交互偏移问题。首先,我们设计了一个跨序列干预生成器(ISIG),通过交换低频谱和熵局部化的补丁跨序列生成一组一致性探针,保持解剖语义的同时挑战跨序列依赖性。其次,我们引入了跨视图分歧感知伪标签(CDPL),通过跨视图分歧方差建立体素级可靠性度量,动态门控自我训练并强制干预一致性,促使网络依赖于稳健的解剖语义。大量实验将模型从标准成人MRI(BraTS-GLI-Pre)适应到非洲低场(BraTS-SSA)和儿童(BraTS-PED)群体,在临床偏移下优于竞争方法,实现了绝对Dice改进+1.89%(SSA)和+2.82%(PED)超过源模型。代码可在https://github.com/dzp2095/VISTA获取。

英文摘要

Deploying multi-sequence magnetic resonance imaging (MRI) segmentation models to new clinical environments is challenging due to variations in scanners and acquisition protocols. Although existing TTA methods handle basic per-modality shifts, they often fail under a fundamental dual-shift problem, as their adaptation signals fail to capture modality-interaction shifts that disrupt inter-sequence consistency. To address this, we propose Variance-gated Inter-Sequence Test-time Adaptation (VISTA), a source-free framework that tackles modality-interaction shifts. First, we design an Inter-Sequence Intervention Generator (ISIG) that generates a set of consistency probes by swapping low-frequency spectra and entropy-localized patches across sequences, preserving anatomical semantics while challenging inter-sequence dependencies. Second, we introduce Cross-View Disagreement-Aware Pseudo Labeling (CDPL), which establishes a voxel-wise reliability metric using cross-view disagreement variance to dynamically gate self-training and enforce interventional consistency, encouraging the network to rely on robust anatomical semantics. Extensive experiments adapting from standard adult MRI (BraTS-GLI-Pre) to African low-field (BraTS-SSA) and pediatric (BraTS-PED) cohorts show improved performance over competing methods under clinical shifts, achieving absolute Dice improvements of +1.89% (SSA) and +2.82% (PED) over the source model. The code is available at https://github.com/dzp2095/VISTA.

2605.17432 2026-05-19 cs.LG cs.CR

DP-SelFT: Differentially Private Selective Fine-Tuning for Large Language Models

DP-SelFT: 大语言模型的差分隐私选择性微调

Haichao Sha, Zihao Wang, Yuncheng Wu, Hong Chen, Wei Dong

发表机构 * Renmin University of China(中国人民大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出DP-SelFT框架,通过选择性微调方法在保持差分隐私的同时提升大语言模型的隐私-效用权衡。

详情
AI中文摘要

大型语言模型(LLMs)通常通过微调适应下游任务,但微调数据中包含敏感信息,可能导致模型泄露。差分隐私(DP)提供正式保护,但LLM的DP微调仍因梯度裁剪和噪声注入而显著降低效用。现有工作通过将DP与参数高效微调方法(如LoRA)结合来改进这一权衡。在本文中,我们研究了互补方向:选择性微调,通过限制更新应用的位置。我们提出DP-SelFT框架,用于大语言模型的差分隐私选择性微调。DP-SelFT解决参数选择中的三个DP特定挑战:避免重复隐私成本、在噪声估计下提高稳定性、以及选择在裁剪和噪声更新下仍有用的参数。首先构建轻量级DP合成数据集,并仅在该合成数据上进行选择,因此选择阶段不增加额外隐私成本。然后通过临时训练候选层子集在合成训练拆分上,并在合成验证拆分上评估它们。关键在于临时训练是在与下游DP微调匹配的扰动范围内进行的,最坏情况下的扰动规模与DP噪声相同。这有利于不仅可学习且对噪声私人更新具有鲁棒性的层子集。在基准任务上的实验表明,DP-SelFT在相同隐私保障下,一致地改进了隐私-效用权衡。

英文摘要

Large language models (LLMs) are commonly adapted to downstream tasks through fine-tuning, but fine-tuning data often contains sensitive information that may be leaked by the resulting model. Differential privacy (DP) offers formal protection against such leakage, yet DP fine-tuning of LLMs still suffers from substantial utility degradation due to gradient clipping and noise injection. Existing work improves this trade-off by combining DP with parameter-efficient fine-tuning methods such as LoRA, which constrain the form of updates. In this work, we study a complementary direction: selective fine-tuning, which constrains where updates are applied. We propose DP-SelFT, a framework for differentially private selective fine-tuning of LLMs. DP-SelFT addresses three DP-specific challenges in parameter selection: avoiding repeated privacy cost, improving stability under noisy estimates, and selecting parameters that remain useful under clipped and noisy updates. It first constructs a lightweight DP synthetic dataset and performs selection only on this synthetic data, so the selection stage incurs no additional privacy cost. It then conducts layer-level selection by temporarily training candidate layer subsets on a synthetic training split and evaluating them on a synthetic validation split. Crucially, this temporary training is performed under a perturbation regime matched to downstream DP fine-tuning, with worst-case perturbations of the same scale as DP noise. This favors layer subsets that are not only learnable but also robust to noisy private updates. Experiments on benchmark tasks show that DP-SelFT consistently improves the privacy--utility trade-off over existing DP fine-tuning baselines under the same privacy guarantees.

2605.17431 2026-05-19 cs.LG cs.AI

MATE: Solving Contextual Markov Decision Processes with Memory of Accumulated Transition Embeddings

MATE:利用累积转移嵌入记忆解决上下文马尔可夫决策过程

Himchan Hwang, Hyeokju Jeong, Gene Chung, Seungyeon Kim, Sangwoong Yoon, Frank Chongwoo Park

发表机构 * Seoul National University(首尔国立大学) Ulsan National Institute of Science and Technology (UNIST)(釜山国立科学技术研究所(UNIST))

AI总结 MATE通过使用累积转移嵌入的记忆架构,解决了由未观察上下文参数化的上下文马尔可夫决策过程(CMDPs),在保持后验信念的同时,避免了传统方法的计算和梯度问题,实现了高效且性能优异的解决方案。

详情
AI中文摘要

我们提出了MATE,一种简单而有效的记忆架构,用于解决由未观察上下文参数化的上下文马尔可夫决策过程(CMDPs)。在CMDPs中,最优智能体可以通过维持上下文的后验信念来在线适应。MATE用求和聚合的记忆替代了不可行的后验,利用后验的排列不变性来保留可证明的充分表达性。与先前的记忆架构相比,MATE避免了Transformer的逐步展开成本增长和与循环神经网络(RNNs)通常相关的梯度问题。在多样化的基准测试中,MATE展示了清晰的计算优势,同时实现了与标准序列模型基线相当的性能。

英文摘要

We propose MATE, a simple yet effective memory architecture for solving Contextual Markov Decision Processes (CMDPs), a family of MDPs parameterized by an unobserved context. In CMDPs, an optimal agent can adapt online by maintaining the posterior belief over contexts. MATE replaces this intractable posterior with a sum-aggregated memory, leveraging the posterior's permutation invariance to retain provably sufficient expressiveness. Compared to prior memory architectures, MATE avoids the growing per-step rollout cost of Transformers and the gradient issues commonly associated with Recurrent Neural Networks (RNNs). Extensive evaluations across diverse benchmarks demonstrate that MATE provides clear computational advantages while achieving performance comparable to standard sequence-model baselines.

2605.17429 2026-05-19 cs.LG cs.CV

Radial-Angular Geometry for Reliable Update Diagnosis in Noisy-Label Learning

径向-角向几何用于噪声标签学习中的可靠更新诊断

Ningkang Peng, Jingyang Mao, Xiaoqian Peng, Weiguang Qu, Yanhui Gu

发表机构 * Nanjing Normal University(南京师范大学) Nanjing University of Chinese Medicine(南京中医药大学)

AI总结 本文提出了一种基于径向-角向几何的方法,用于在噪声标签学习中可靠地诊断更新,通过比较观测标签梯度与EMA教师诱导的参考梯度,区分对齐的困难清洁更新与由损坏标签引起的冲突更新。

详情
AI中文摘要

噪声标签方法通常从正向空间信号如损失、置信度或熵来估计样本可靠性。这些信号表明样本是否难以预测,但它们不直接测试其观察到的标签是否导致可靠的参数更新。这个差距很重要,因为困难的干净样本和错误标记的样本可能具有相似的损失,但会诱导不同的更新。我们重新诠释可靠性估计为观测标签更新的诊断。样本级经验Fisher迹提供了一个反向空间的更新能量度量:对于分类器层,它分解为一个预测残差项和一个特征敏感性项,因此捕获了超越标量损失的信息。然而,迹仍是一个径向幅度信号,无法决定大更新是否有益或有害。因此,我们提出了相对几何冲突(RGC),它将观测标签梯度与由EMA教师诱导的参考梯度进行比较。冲突项有助于区分大但对齐的困难清洁更新与由损坏标签引起的冲突更新。在合成和现实世界的噪声标签基准上,RGC在我们的评估协议下提高了困难清洁样本的保留和准确性。

英文摘要

Noisy-label methods often estimate sample reliability from forward-space signals such as loss, confidence, or entropy. These signals indicate whether a sample is difficult to predict, but they do not directly test whether its observed label induces a reliable parameter update. This gap matters because hard clean samples and mislabeled samples can have similar loss while inducing different updates. We recast reliability estimation as diagnosis of the observed-label update. The sample-wise empirical Fisher trace gives a backward-space measure of update energy: for the classifier layer, it factorizes into a prediction-residual term and a feature-sensitivity term, so it captures information beyond scalar loss. Trace, however, is still a radial magnitude signal and cannot decide whether a large update is useful or harmful. We therefore propose Relative Geometric Conflict (RGC), which compares the observed-label gradient with a reference gradient induced by an EMA teacher. The conflict term helps distinguish large but aligned hard-clean updates from large conflicting updates caused by corrupted labels. Across synthetic and real-world noisy-label benchmarks, RGC improves hard-clean preservation and accuracy under our evaluation protocol.

2605.17428 2026-05-19 cs.LG cs.AI

Progressive Generalization Augmentation with Deeply Coupled RND-PPO and Domain-Prioritized Noise Injection for Robust Crop Management Reinforcement Learning

渐进泛化增强:结合深度耦合RND-PPO和领域优先噪声注入的稳健作物管理强化学习

Wu Yang

发表机构 * Chongho Bridge Group Limited(中宏桥梁集团有限公司)

AI总结 本文提出了一种渐进泛化增强方法,通过深度耦合RND-PPO和领域优先噪声注入,解决农业强化学习中早期学习效率与后期泛化能力的平衡、内在和外在奖励的简单加法结合以及统一噪声注入策略的问题,从而提高作物管理的鲁棒性。

详情
AI中文摘要

我们在gym-DSSAT玉米灌溉任务上的初步实验表明,±2摄氏度的温度噪声会导致在清洁条件下训练的PPO策略的经济收益减少11.9% - 这是现有研究未充分解决的系统性鲁棒性缺陷。本文针对阻碍农业RL系统实际部署的三个相互关联的限制:早期阶段学习效率与后期阶段泛化能力之间的权衡;探索增强PPO中内在和外在奖励的简单加法结合;以及忽视农业状态变量经验证实的差异敏感性的统一测量噪声注入策略。我们引入了三个系统性的创新:渐进泛化增强(PGA),实现一个三阶段课程(清洁训练0-800次回合,渐进800-1200次回合,完整增强1200-2000次回合);深度耦合RND-PPO架构,具有双通道GAE归一化、进度衰减的内在系数和语义离散化;以及领域优先噪声注入,具有层次激活。我们的实验评估显示:在佛罗里达州,相比最先进的BERT-DQN,产量提高了8.43%,氮肥利用效率提高了16.42%;在阿拉贡,产量提高了5.61%(尽管由于恶劣的地中海气候,经济评分降低了3.67%);在综合扰动下,性能保留率分别为94.4% vs 80.0%。所有实验均使用5个随机种子,在NVIDIA A100 GPU上进行,每运行约4.2±0.3小时(2000次回合,2048步缓冲区,64 mini-batch大小)。

英文摘要

Our preliminary experiments on gym-DSSAT maize irrigation tasks revealed that +/-2 degrees C temperature noise causes an 11.9% reduction in economic returns for PPO policies trained under clean conditions - a systematic robustness deficit that existing research has not adequately addressed. This paper tackles three interconnected limitations impeding practical deployment of agricultural RL systems: the trade-off between early-stage learning efficiency and late-stage generalization capability; the naive additive combination of intrinsic and extrinsic rewards in exploration-augmented PPO; and uniform measurement noise injection strategies that disregard empirically validated differential sensitivity across agricultural state variables. We introduce three systematic innovations: Progressive Generalization Augmentation (PGA) implementing a three-phase curriculum (clean training 0-800 episodes, progressive 800-1200, full augmentation 1200-2000); a deeply coupled RND-PPO architecture with dual-channel GAE normalization, progress-decayed intrinsic coefficients, and semantic discretization; and domain-prioritized noise injection with hierarchical activation. Our experimental evaluation demonstrates: 8.43% yield improvement and 16.42% nitrogen use efficiency improvement over SOTA BERT-DQN in Florida; 5.61% yield improvement in Zaragoza (though 3.67% lower economic score due to challenging Mediterranean climate); and 94.4% vs 80.0% performance retention under combined perturbations. All experiments used 5 random seeds on NVIDIA A100 GPUs with 4.2+/-0.3 hours per run (2000 episodes, 2048-step buffer, 64 mini-batch size).

2605.17423 2026-05-19 cs.CV

Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration

Soap2Soap:通过多智能体协作实现长 cinematic 视频重制

Yiren Song, Huilin Zhong, Kevin Qinghong Lin, Haofan Wang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(新加坡国立大学Show实验室) University of Oxford(牛津大学) Lovart AI(Lovart人工智能)

AI总结 本研究提出 Soap2Soap 框架,通过多智能体协作实现长 cinematic 视频重制,解决视频到视频生成中长期一致性与叙事保真度的问题。

详情
AI中文摘要

我们研究系列级 cinematic 重制,这是一个长视界视频到视频生成问题,通过风格化或演员替换局部化完整 episodes 或 films,同时严格保持叙事结构、动作编排和角色身份在数百个镜头中。现有视频生成和编辑管道在此领域常常失效,因为大相机运动和视角变化下会出现身份漂移、背景突变和语义侵蚀的叠加问题。我们提出 Soap2Soap,一个通过双桥一致性机制强制长期语言-视觉一致性的多智能体框架:一个场景感知的 JSON 剧本作为持久的语义骨架,以及在场景和镜头级别动态分配的视觉参考锚点。为在视频合成前抑制漂移,我们引入批次关键帧一致性,通过基于网格的公式共同生成多个关键帧在共享的潜在上下文中。一个闭环验证智能体进一步审计身份、稳定性和对齐度以触发选择性再生。在 SoapBench 上的实验显示,与商业视频生成 API 相比,在长期一致性和叙事保真度方面有显著提升。

英文摘要

We study series-level cinematic remaking, a long-horizon video-to-video generation problem that localizes full episodes or films via stylization or actor replacement while strictly preserving narrative structure, motion choreography, and character identity across hundreds of shots. Existing video generation and editing pipelines often break down in this regime due to compounding identity drift, background mutation, and semantic erosion under large camera motions and viewpoint changes. We propose Soap2Soap, a multi-agent framework that enforces long-term language-visual consistency through a Dual-Bridge Consistency mechanism: a scene-aware JSON screenplay serving as a persistent semantic backbone, and dynamically allocated visual reference anchors at both scene and shot levels. To suppress drift before video synthesis, we introduce batch keyframe consistency, jointly generating multiple keyframes in a shared latent context via a grid-based formulation. A closed-loop verification agent further audits identity, stability, and alignment to trigger selective regeneration. Experiments on SoapBench demonstrate strong improvements over commercial video generation APIs in long-term consistency and narrative fidelity.

2605.17421 2026-05-19 cs.RO

MUSE: Multimodal Uncertainty Quantification of State Estimation

MUSE:多模态状态估计不确定性量化

Minkyung Kim, Henry Che, Bhargav Chandaka, Bhumsitt Pramuanpornsatid, Chengyu Yang, Sheng Cheng, Xiaofeng Wang, Naira Hovakimyan, Shenlong Wang

发表机构 * Department of Mechanical Science and Engineering, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校机械科学与工程系) Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校塞贝尔计算与数据科学学院) Department of Electrical Engineering, University of South Carolina(南卡罗来纳大学电气工程系)

AI总结 本文提出MUSE,一种基于学习的实时框架,利用Mamba的强效序列建模能力,从多个异步传感器流中估计定位不确定性,提高了状态估计的可靠性和鲁棒性。

Comments Code and dataset: https://github.com/hungdche/MUSE

详情
AI中文摘要

准确的视觉状态估计一直是机器人领域的重要课题,广泛应用于机器人导航、自动驾驶和自主飞行。最近的机器人感知进展显著提高了状态估计的精度和鲁棒性,但如何量化和校准其精度,即我们对估计的置信度以及能否检测失败仍然是一个根本性挑战。在视觉惯性里程计(VIO)中,异方差和多模态的性质使不确定性量化尤为困难。本文介绍了MUSE(多模态状态估计不确定性量化),一种新颖的实时学习框架,利用Mamba的强大且高效的序列建模能力,从多个异步传感器流中估计定位不确定性。在公开和内部数据集上的实验表明,MUSE相比现有不确定性量化方法在可靠性和鲁棒性方面表现更优,消融研究验证了其关键设计选择的优势。

英文摘要

Accurate visual state estimation has been a central topic in robotics with a wide range of applications in robot navigation, autonomous driving, and autonomous flight. Recent advances in robot perception have led to significant improvements in the accuracy and robustness of state estimation, yet a fundamental challenge remains in how to quantify and calibrate its precision, i.e., how confident we are in an estimate and whether failures can be detected. This issue is particularly pronounced in visual-inertial odometry (VIO), where the heteroscedastic and multimodal nature of the problem makes uncertainty quantification especially difficult. This paper introduces MUSE (Multimodal Uncertainty Quantification of State Estimation), a novel real-time learning-based framework that leverages the strong and efficient sequential modeling capacity of Mamba to estimate localization uncertainty from multiple asynchronous sensor streams. Experiments on both public and in-house datasets demonstrate that MUSE achieves superior reliability and robustness compared to existing uncertainty quantification methods, and ablation studies justify the benefits of its key design choices.

2605.17419 2026-05-19 cs.LG cs.AI

Learning Displacement-Robust Representations for Landslide Early Warning under Rainfall Forecast Uncertainty

学习位移鲁棒的表示以在降雨预报不确定性下进行滑坡预警

Ren Ozeki, Hamada Rizk, Hirozumi Yamaguchi

发表机构 * Osaka University(大阪大学) RIKEN Center for Computational Science(理化学研究所计算科学中心) Tanta University(塔塔大学)

AI总结 本文提出了一种鲁棒于降雨场位移的滑坡预警系统,通过学习降雨和地形数据的潜在表示,以提高在降雨预报不确定性下的滑坡预测精度。

详情
AI中文摘要

由降雨引发的滑坡已成为全球范围内日益增长的风险,因为气候变化加剧了极端降雨事件。为了提供足够的撤离时间,实时灾害监测的滑坡预警系统(LEWS)必须通过整合观测降雨与短期降雨预报来估计近未来滑坡风险,这些预报来自时空环境数据流。尽管最近的滑坡预测方法通过统计和深度学习方法提高了预测性能,但大多数方法假设降雨输入是准确的。然而,在实际应用中,滑坡预测依赖于降雨预报,这些预报通常包含由于预测不确定性导致的降雨场空间位移。这种位移会改变局部累积降雨并降低预测准确性。为了解决这一挑战,我们提出了一种新的LEWS,其对降雨场位移具有鲁棒性。关键思想是学习降雨和地形数据的潜在表示,这些表示在降雨场运动中的位移下保持稳定,从而实现可靠的地理空间数据整合以估计滑坡风险。滑坡预测模型通过使用降雨-运动-感知对比学习(RMCL)进行训练,该方法引入了时间相关的降雨场扰动以模拟预报引起的降雨驱动时空环境数据流中的位移。实验使用了日本两年的降雨和地形数据,覆盖了19个地区中的滑坡事件。所提出的系统在精度上比最先进的基线高出高达37%。这些结果表明,将降雨建模为移动的空间场并在学习过程中处理降雨场位移显著提高了操作预警系统中短期滑坡预测的可靠性。

英文摘要

Rainfall-induced landslides pose a growing risk worldwide as climate change intensifies extreme rainfall events. To provide sufficient evacuation time, landslide early warning systems (LEWS) for real-time disaster monitoring must estimate near-future landslide risk by integrating observed rainfall with short-term rainfall forecasts from spatio-temporal environmental data streams. Although recent landslide prediction methods have improved predictive performance using statistical and deep learning approaches, most assume accurate rainfall inputs. In operational settings, however, landslide prediction relies on rainfall forecasts, which often contain spatial displacement of rainfall fields due to forecasting uncertainties. Such displacement can alter local accumulated rainfall and degrade prediction accuracy. To address this challenge, we propose a novel LEWS robust to rainfall field displacement. The key idea is to learn latent representations from rainfall and terrain data that remain stable under displacement in rainfall field motion, enabling reliable geospatial data integration for landslide risk estimation. The landslide prediction model is trained using Rainfall-Motion-Aware Contrastive Learning (RMCL), which introduces temporally correlated rainfall field perturbations to emulate forecast-induced displacement in rainfall-driven spatio-temporal environmental data streams. Experiments were conducted using two years of rainfall and terrain data across Japan, covering 19 regions with landslide events. The proposed system achieved up to 37% higher precision than state-of-the-art baselines. These results demonstrate that modeling rainfall as a moving spatial field and addressing rainfall field displacement during learning significantly improve the reliability of short-term landslide prediction in operational early warning systems.

2605.17410 2026-05-19 cs.AI

Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design

令牌经济学中的计算挑战:连接经济理论与AI系统设计

Ou Wu, Yingjun Deng

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院) Hefei Institutes of Physical Science, Chinese Academy of Sciences(中国科学院合肥物理研究所)

AI总结 本文探讨了在大规模语言模型系统中,将令牌作为经济原语时所面临的计算挑战,提出了计算令牌经济学的概念和令牌经济学三元论,旨在建立连接令牌经济学与AI系统设计的研究议程。

Comments 43 pages

详情
AI中文摘要

令牌经济学已逐渐成为理解大型语言模型系统中资源分配、价值创造和定价的一个有用的视角。尽管近期的研究越来越多地将令牌视为经济原语,但高水平的经济理论与现代AI基础设施的计算现实之间仍存在显著的差距。本文识别并分析了在实时推理系统中实施令牌经济原则时出现的关键计算挑战。我们主张计算可行性不仅仅是令牌经济学的一个维度,而是其支配约束:这些挑战是由精细估值、低延迟执行和在不确定性下的分配最优性之间根本矛盾驱动的。为了结构化这个问题空间,我们引入了计算令牌经济学的概念,并提出了令牌经济学三元论——一个条件无免费午餐原则,捕捉了粒度、实时性能和最优性之间的固有权衡。我们进一步将主要技术挑战分为三个领域:实时价值会计、受限资源分配和经济感知的系统架构。与其提供完整的解决方案,本文旨在定义连接令牌经济学与AI系统设计的研究议程,突出计算经济学、机器学习系统和AI基础设施交汇处的开放问题。

英文摘要

Token economics has emerged as a useful lens for understanding resource allocation, value creation, and pricing in large language model systems. While recent work has increasingly treated tokens as economic primitives, there remains a substantial gap between high-level economic theory and the computational realities of modern AI infrastructure. This paper identifies and analyzes the key computational challenges that arise when token-economic principles are implemented in real-time inference systems. We argue that computational feasibility is not merely one dimension of token economics, but its governing constraint: these challenges are driven by fundamental tensions among fine-grained valuation, low-latency execution, and allocation optimality under uncertainty. To structure this problem space, we introduce the notion of \textbf{Computational Token Economics} and propose the \textbf{Token Economics Trilemma} -- a conditional no-free-lunch principle that captures the inherent trade-offs among granularity, real-time performance, and optimality. We further categorize the main technical challenges into three areas: real-time value accounting, constrained resource allocation, and economic-aware system architecture. Rather than presenting a complete solution, this paper aims to define a research agenda for bridging token economics and AI system design, highlighting open problems at the intersection of computational economics, machine learning systems, and AI infrastructure.

2605.17405 2026-05-19 cs.SD cs.MM

A Distribution Matching Approach to Neural Piano Transcription with Optimal Transport

基于最优传输的神经钢琴转录方法

Weixing Wei, Raynaldi Lalang, Dichucheng Li, Kazuyoshi Yoshii

发表机构 * Graduate School of Informatics, Kyoto University, Japan(京都大学信息科学研究生院,日本) Graduate School of Engineering, Kyoto University, Japan(京都大学工学研究生院,日本) Independent Researcher, Hong Kong, China(香港中国独立研究者)

AI总结 本文提出将自动钢琴转录视为最优传输问题,而非帧级多标签二分类问题,通过最小化预测音符分布到真实分布的传输成本,提升了时间对齐的感知相关性,并提出了一种带有谐波感知注意力机制的卷积循环神经网络来捕捉音乐中的频谱-时间依赖性。

Comments Accepted to ICASSP2026

详情
AI中文摘要

本文描述了一种新的范式,将自动钢琴转录(APT)正式化为最优传输(OT)问题,而不是帧级多标签二分类问题。我们的方法学习最小化将预测的音符事件分布传输到真实分布的时间和频率上的成本。因此,OT损失可以容纳时间错位,从而实现感知相关性优化。我们还提出了一种带有谐波感知注意力机制的卷积循环神经网络(CRNN),以捕捉音乐中固有的频谱-时间依赖性。使用MAESTRO数据集的实验表明,我们的方法在起始检测上取得了最先进的性能。我们确认了OT损失在应用于现有模型中的通用性。

英文摘要

This paper describes a novel paradigm that formalizes automatic piano transcription (APT) as an optimal transport (OT) problem, not as a frame-level multi-label binary classification problem. Our method learns to minimize the cost of transporting a predicted distribution of note events to the ground-truth distribution over time and frequency. The OT loss can thus accommodate temporal misalignment, leading to perceptually relevant optimization. We also propose a convolutional recurrent neural network (CRNN) with a harmonics-aware attention mechanism to capture the spectro-temporal dependencies inherent in music.Our experiments using the MAESTRO dataset showed that our method attained a state-of-the-art performance in onset detection. We confirmed the versatility of the OT loss in application to existing models.

2605.17403 2026-05-19 cs.LG

Self-Supervised Learning for Sparse Matrix Reordering

稀疏矩阵重新排序的自监督学习

Ziwei Li, Tao Yuan, Fangfang Liu, Shuzi Niu, Huiyuan Li, Wenjia Wu

发表机构 * Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出了一种自监督学习方法,通过多网格图网络捕捉结构信息,基于不等式推导三元组采样策略,并引入端最大链损失函数以减少不满足不等式的三元组数量,从而在稀疏矩阵重新排序中实现填充分离减少和LU分解时间加速。

Comments Accepted by DASFAA 2026

详情
AI中文摘要

使用适当顺序重新排列稀疏矩阵可以显著减少填充分离,即矩阵分解过程中引入的新非零元素,从而减少内存使用和运行时间。然而,找到最小化填充分离的顺序是NP难问题。现有方法,包括图论和深度学习方法,依赖于替代目标函数而没有理论保证。填充分离定理揭示了填充分离生成与矩阵稀疏结构之间的直接内在关系,即路径三元组不等式。本文首先使用多网格图网络来捕获每个顶点的结构信息。然后基于不等式推导出三元组采样策略。最后,我们引入端最大链损失函数以减少预测分数满足这些不等式的三元组数量。在公开可用的SuiteSparse矩阵集合上的实验评估表明,所提出的方法在填充分离减少和LU分解时间加速方面均优于现有方法。

英文摘要

Rearranging the rows or columns of a sparse matrix using an appropriate ordering can significantly reduce fill-ins, i.e., new nonzeros introduced during matrix factorization, decreasing memory usage and runtime. However, finding an ordering that minimizes fill-ins is NP-complete. Existing approaches, including graph-theoretic and deep learning methods, rely on surrogate objectives without theoretical guarantees. The Fill-Path Theorem reveals a direct and intrinsic relationship between fill-in generation and the sparse structure of the matrix as path triplet inequalities. Here we first employ a multigrid graph network to capture structural information for each vertex. We then derive a triplet sampling strategy based on inequalities. Finally, we introduce an end-max chain loss function to reduce the number of triplets whose predicted scores satisfy these inequalities. Experimental evaluations on the publicly available SuiteSparse matrix collection demonstrate the superiority of the proposed method in terms of both fill-in reduction and speedup in LU factorization time.

2605.17398 2026-05-19 cs.CL cs.LG

MiniGPT: Rebuilding GPT from First Principles

MiniGPT:从第一原理重新构建GPT

Jibin Joseph

发表机构 * Department of Computer Science(计算机科学系) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出MiniGPT,一个基于PyTorch从头实现的GPT风格自回归语言模型,旨在在研究nanoGPT设计后,从第一原理重新构建GPT核心流程,同时保持模型和训练代码独立编写。MiniGPT实现了词嵌入、位置嵌入、因果多头自注意力、预层归一化Transformer块、残差连接、前馈MLP层、下一词交叉熵训练(教师强制)、验证跟踪、检查点选择和自回归文本生成。

Comments 13 pages, 2 figures

详情
AI中文摘要

本文提出了MiniGPT,一个基于PyTorch从头实现的GPT风格自回归语言模型。目的是在研究nanoGPT设计后,从第一原理重新构建GPT核心流程,同时保持模型和训练代码独立编写。MiniGPT实现了词嵌入、位置嵌入、因果多头自注意力、预层归一化Transformer块、残差连接、前馈MLP层、下一词交叉熵训练(教师强制)、验证跟踪、检查点选择以及自回归文本生成。本文在Tiny Shakespeare数据集上评估了该实现,使用字符级分词。一个基线模型在3000次训练迭代后达到验证损失1.7236。一个更强的10.77M参数配置,使用更大的上下文长度和改进的训练设置,达到最佳验证损失1.4780,并生成具有可识别莎士比亚风格对话结构的文本。MiniGPT并未引入新的语言模型架构。相反,它记录了从原始文本到训练好的字符级生成的清晰且可重复的实现路径,包括设计选择、训练行为、生成质量以及实际限制。

英文摘要

This paper presents MiniGPT, a compact from-scratch implementation of GPT-style autoregressive language modeling in PyTorch. The aim is to rebuild the core GPT pipeline from first principles after studying the design of nanoGPT by Andrej Karpathy, while keeping the model and training code independently written in a single notebook. MiniGPT implements token and positional embeddings, causal multi-head self-attention, pre-LayerNorm Transformer blocks, residual connections, feed-forward MLP layers, next-token cross-entropy training (teacher forcing), validation tracking, checkpoint selection, and autoregressive text generation. This paper evaluates the implementation on Tiny Shakespeare dataset using character-level tokenization. A baseline 0.83M-parameter model reaches a validation loss of 1.7236 after 3000 training iterations. A stronger 10.77M-parameter configuration, using a larger context length and improved training settings, reaches a best validation loss of 1.4780 and generates text with recognizable Shakespeare-style dialogue structure. MiniGPT does not introduce a new language-model architecture. Instead, it documents a clear and reproducible implementation path from raw text to trained character-level generation, including design choices, training behavior, generation quality, and practical limitations.

2605.17393 2026-05-19 cs.AI cs.LG cs.MA

Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning

异质信息瓶颈协调图用于多智能体强化学习

Wei Duan, Junyu Xuan, En Yu, Xiaoyu Yang, Jie Lu

发表机构 * Australian Artificial Intelligence Institute (AAII)(澳大利亚人工智能研究所)

AI总结 本文提出异质信息瓶颈协调图(HIBCG),通过理论指导机制解决多智能体强化学习中协调图的边存在性和信息传递容量分配问题,通过信息瓶颈方法构建组对齐的块对角先验,实现边存在性和信息容量的理论验证。

详情
AI中文摘要

协调图是合作多智能体强化学习(MARL)中的核心抽象,然而现有的稀疏图学习者缺乏理论基础的机制来决定哪些边应存在以及每条边应携带多少信息。当前方法依赖于启发式标准,无法保证学习到的拓扑结构的正式保证,并且没有系统的方法来分配不同的通信容量以处理结构不同的智能体关系。为了解决这个问题,我们提出了异质信息瓶颈协调图(HIBCG),它学习了一个组感知的稀疏图,在其中边的存在性和信息容量都得到了理论支持。通过图信息瓶颈(GIB)作为底层工具,HIBCG首先构建了一个组对齐的块对角先验,提供了一个闭式标准用于边保留——确定哪些边应该存在以及每个组块的密度——然后在所得到的拓扑上控制每个智能体的特征带宽,压缩信息以保留仅与任务相关的内容。我们证明了组对齐的先验严格收紧拓扑学习的变分界,目标分解为每个组块,实现了微分边控制,且容量分配遵循水填充原则。

英文摘要

Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism to decide which edges should exist and how much information each edge should carry. Current methods rely on heuristic criteria that offer no formal guarantee on the learned topology, and no principled way to allocate different communication capacities to structurally different agent relationships. To address this, we propose Heterogeneous Information-Bottleneck Coordination Graphs (HIBCG), which learns a group-aware sparse graph in which both edge existence and message capacity are theoretically justified. With the graph information bottleneck (GIB) serving as the underlying tool, HIBCG first constructs a group-aligned block-diagonal prior that provides a closed-form criterion for edge retention -- determining which edges should exist and at what density per group block -- and then controls per-agent feature bandwidth on the resulting topology, compressing messages to retain only task-relevant content. We prove that the group-aligned prior strictly tightens the variational bound on topology learning, that the objective decomposes per group block, enabling differential edge control, and that capacity allocation follows a water-filling principle.

2605.17382 2026-05-19 cs.AI cs.CL cs.GR

QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI

QQJ: 量化定性判断以实现可扩展且与人类对齐的生成AI评估

Marjan Veysi, Pirooz Shamsinejadbabaki, Mohammad Zare, Mohammad Sabouri

发表机构 * AI Lab, Arioobarzan Engineering Team(艾伊罗巴赞工程团队人工智能实验室) Department of Computer Engineering and Information Technology(计算机工程与信息科技系) Department of Informatics, Bioengineering, Robotics and Systems Engineering(信息学、生物工程、机器人与系统工程系) University of Genoa(热那亚大学)

AI总结 本文提出QQJ框架,通过专家设计的多维评分标准和高质量标注集校准大语言模型评估器,实现与人类判断一致的可扩展评估方法,验证了结构化定性判断在大规模应用中的有效性。

详情
AI中文摘要

生成人工智能的快速发展暴露了现有评估方法的根本局限,尤其是在开放性、创造性和面向人类的任务中。传统自动指标依赖于表面统计相似性,往往无法反映人类对质量的感知,而纯粹的人类评估虽然可靠,但成本高、主观性强且难以扩展。最近利用大语言模型作为评估者的做法虽然提高了可扩展性,但通常缺乏明确的人类定义评估原则,导致偏见和不一致。本文介绍Quantifying Qualitative Judgment (QQJ),一种可扩展且以人类为中心的评估框架,通过专家设计的多维评分标准和高质量标注集校准大语言模型评估器,以实现人类判断与自动化评估之间的桥梁。这种设计使在多样化的生成任务和模态上实现了一致、可解释和可扩展的评估。在文本和图像生成上的大量实验表明,QQJ在与人类判断的一致性方面优于传统自动指标和无约束的大语言模型评估者。此外,QQJ在重复评估中表现出更高的稳定性,并在识别关键失败模式如幻觉和意图不匹配方面具有更好的诊断能力。这些结果表明,结构化的定性判断可以在不牺牲可解释性和人类对齐的情况下实现规模化应用,使QQJ成为现代生成AI系统可靠评估的实用基础。

英文摘要

The rapid progress of generative artificial intelligence has exposed fundamental limitations in existing evaluation methodologies, particularly for open-ended, creative, and human-facing tasks. Traditional automatic metrics rely on surface-level statistical similarity and often fail to reflect human perceptions of quality, while purely human evaluation, although reliable, is costly, subjective, and difficult to scale. Recent approaches using large language models as evaluators offer improved scalability but frequently lack explicit grounding in human-defined evaluation principles, leading to bias and inconsistency. In this paper, we introduce Quantifying Qualitative Judgment (QQJ), a scalable and human-centric evaluation framework that explicitly bridges the gap between human judgment and automated assessment. QQJ separates the definition of quality from its execution by anchoring evaluation in expert-designed, multi-dimensional rubrics and calibrating large language model evaluators to align with expert reasoning using a small, high-quality annotation set. This design enables consistent, interpretable, and scalable evaluation across diverse generative tasks and modalities. Extensive experiments on text and image generation demonstrate that QQJ achieves substantially stronger alignment with human judgment than traditional automatic metrics and unconstrained LLM-based evaluators. Moreover, QQJ exhibits improved stability across repeated evaluations and superior diagnostic capability in identifying critical failure modes such as hallucination and intent mismatch. These results indicate that structured qualitative judgment can be operationalized at scale without sacrificing interpretability or human alignment, positioning QQJ as a practical foundation for reliable evaluation of modern generative AI systems.

2605.17380 2026-05-19 cs.AI cs.CR cs.LG

ADR: An Agentic Detection System for Enterprise Agentic AI Security

ADR:一种用于企业代理AI安全的代理检测系统

Chenning Li, Pan Hu, Justin Xu, Baris Ozbas, Olivia Liu, Caroline Van, Manxue Li, Wei Zhou, Mohammad Alizadeh, Pengyu Zhang, KK Sriramadhesikan, Ming Zhang

发表机构 * Uber

AI总结 本文提出ADR系统,一种大规模、经过生产验证的企业框架,用于安全地管理通过模型上下文协议(MCP)运行的AI代理。该系统解决了三个关键问题:观测有限、鲁棒性不足和检测成本高,并通过三个组件实现了这些目标:ADR传感器、ADR探索器和ADR检测器。

Comments Accepted at MLSys 2026 (Industry Track)

详情
AI中文摘要

我们提出了代理AI检测与响应(ADR)系统,这是首个大规模、经过生产验证的企业框架,用于安全地管理通过模型上下文协议(MCP)运行的AI代理。我们识别出该领域存在的三个持续挑战:(1)观测有限——现有的终端检测与响应(EDR)工具只能看到文件写入,而无法看到代理推理、提示或连接意图到执行的因果链;(2)鲁棒性不足——静态防御受限于预定义规则,无法在多样化的攻击技术和企业环境中泛化;(3)高检测成本——基于LLM的推理在大规模上成本过高。ADR通过三个组件解决这些挑战:ADR传感器用于高保真的代理遥测,ADR探索器用于系统性的预部署红队行动和困难示例生成,以及ADR检测器用于可扩展的、两阶段在线检测,结合快速初步筛查与上下文感知推理。在Uber部署超过十个月,ADR在生产中保持了可靠的检测,随着采用的增加,已覆盖超过7,200个唯一主机,每天处理超过10,000个代理会话,发现了数百个凭证泄露,涵盖26类,并启用了向左预防层(97.2%的精度,206个检测到的凭证)。为了验证该方法并促进社区采用,我们引入了ADR-Bench(302个任务,17种技术,133个MCP服务器),其中ADR实现了零误报,同时检测了67%的攻击——在F1分数上,比三个最先进的基线(ALRPHFS、GuardAgent、LlamaFirewall)高出2-4倍。在AgentDojo(公共提示注入基准)上,ADR检测了所有攻击,仅在93个任务中产生了3个误报。

英文摘要

We present the Agentic AI Detection and Response (ADR) system, the first large-scale, production-proven enterprise framework for securing AI agents operating through the Model Context Protocol (MCP). We identify three persistent challenges in this domain: (1) limited observability -- existing Endpoint Detection and Response (EDR) tools see file writes but not the agent reasoning, prompts, or causal chains linking intent to execution; (2) insufficient robustness -- static defenses constrained by pre-defined rules fail to generalize across diverse attack techniques and enterprise contexts; and (3) high detection costs -- LLM-based inference is prohibitively expensive at scale. ADR addresses these challenges via three components: the ADR Sensor for high-fidelity agentic telemetry, the ADR Explorer for systematic pre-deployment red teaming and hard-example generation, and the ADR Detector for scalable, two-tier online detection combining fast triage with context-aware reasoning. Deployed at Uber for over ten months, ADR has sustained reliable detection in production with growing adoption reaching over 7,200 unique hosts and processing over 10,000 agent sessions daily, uncovering hundreds of credential exposures across 26 categories and enabling a shift-left prevention layer (97.2% precision, 206 detected credentials). To validate the approach and enable community adoption, we introduce ADR-Bench (302 tasks, 17 techniques, 133 MCP servers), where ADR achieves zero false positives while detecting 67% of attacks -- outperforming three state-of-the-art baselines (ALRPHFS, GuardAgent, LlamaFirewall) by 2--4x in F1-score. On AgentDojo (public prompt injection benchmark), ADR detects all attacks with only three false alarms out of 93 tasks.

2605.17379 2026-05-19 cs.CL cs.AI

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

通过更好的令牌学习:用于专业文本摘要的参数高效词汇适应

Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly

发表机构 * Dept. of Computer Science and Engg., IIT Kharagpur(印度Kharagpur理工学院计算机科学与工程系) Dept. of Medicine (Biomedical Informatics), Stanford University(斯坦福大学医学院(生物医学信息学))

AI总结 本文提出了一种参数高效的领域适应方法,通过结合词汇适应和预训练,提升大型语言模型在专业领域文本摘要任务中的性能,同时减少训练时间和参数数量。

Comments 16 pages. Accepted in the 64th Annual Meeting of the Association for Computational Linguistics [ACL (Main) 2026] as a long paper

详情
AI中文摘要

预训练在通用领域语料库上的大型语言模型在应用于专门领域时常常表现出令牌化效率低下。尽管连续预训练用于领域适应在一定程度上缓解了性能下降,但并未解决根本的词汇匹配问题。为了解决这一差距,我们引入了一种有针对性的参数高效领域适应方法,结合词汇适应与预训练用于基于LLM的文本摘要。我们的统一框架在预训练令牌化器中增加领域特定的令牌,同时选择性地替换未充分训练和不可达的令牌以限制参数增长。我们在Llama-3.1-8B和Qwen2.5-7B上评估了我们的方法,在法律和医学摘要任务上使用以专家驱动文本和摘要为中心的评估协议,这些文本通常包含更高浓度的Out-of-Vocabulary(OOV)词。词汇适应算法通过提高生成摘要与参考摘要之间的语义相似性,提升了摘要模型的整体质量。此外,适应后的模型生成的摘要包含更多合适的新型和领域特定的词汇,从而提高了连贯性、相关性和忠实性。我们进一步观察到,我们的方法在连续预训练上减少了35-55%的训练时间,并将参数数量减少了多达37%。我们公开了代码库:https://github.com/gb-kgp/VocabReplace-Then-Expand。

英文摘要

Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by $35-55\%$ over continual pretraining and reduce parameter counts up to $37\%$ w.r.t expansion-only methods. We make the codebase publicly available at https://github.com/gb-kgp/VocabReplace-Then-Expand.

2605.16234 2026-05-19 cs.LG cs.AI cs.CL

No Free Swap: Protocol-Dependent Layer Redundancy in Transformers

没有免费的交换:Transformer中的协议依赖层冗余

Gabriel Garcia

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了Transformer中层冗余问题,通过比较替换和交换两种协议,发现它们在压缩中的效果存在显著差异,且在相同评估器下,不同协议可能导致层剪枝结果的变化,尤其在高替换距离时更为明显。

Comments 40 pages, 8 figures, 24 tables. Code is available at https://github.com/Gpgabriel25/ProtocolGapDiagnostic

详情
AI中文摘要

当研究人员询问两个Transformer层是否在压缩中“等价”时,他们常常混淆了不同的测试方法。替换测试询问是否可以将一层的映射替换为另一层的映射;交换测试询问是否当两层位置交换时,它们近似可交换。两者都是基于输出的swap-KL探测器,但它们并不总是一致:在预训练的Transformer中,协议差距可能在相同评估器下改变哪些层看起来可以安全剪枝,尤其是在替换距离较高时。我们跨检查点和架构测量了两种协议。在Pythia训练轨迹(410M和1.4B)上,替换-交换差距从初始化到收敛逐渐增大。在8B规模的WikiText-2合同下,Qwen3-8B进入了一个发散阶段:交换引导的移除比替换引导的在相同层预算下更安全,而Llama-3.1-8B在剪枝成本上两者持平,尽管交换KL较低,这表明指标差距不必一对一映射到移除。在层移除或合并之前,应在目标检查点上对两种swap-KL进行评分;该诊断仅需未标记的正向传递。

英文摘要

When researchers ask whether two transformer layers are "equivalent" for compression, they often conflate distinct tests. Replacement asks whether one layer's map can substitute for another's in place; interchange asks whether two layers approximately commute when their positions are swapped. Both are output-grounded swap-KL probes, but they need not agree: on pretrained transformers the protocol gap can change which layers look safe to prune by several-fold under the same evaluator, especially when replacement distances are high. We measure both protocols across checkpoints and architectures. On a Pythia training trajectory (410M and 1.4B), the replacement-interchange gap grows from initialization to convergence. Under one matched WikiText-2 contract at 8B scale, Qwen3-8B enters a divergent regime: interchange-guided removal is several-fold safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower, showing metric gaps need not map one-to-one to removal. Before layer removal or merging, score both swap-KLs on the target checkpoint; the diagnostic requires only unlabeled forward passes.

2605.15735 2026-05-19 cs.CV cs.AI

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

UAM:VL A训练中遗忘的双流视角

Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, Jianyu Chen

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出UAM模型,通过双流架构解决VL A训练中因单一编码器导致的多模态能力下降问题,展示了通过架构分离而非冻结权重或辅助数据可实现语义保留,并在多种任务中取得高成功率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通常通过在动作数据上微调预训练的视觉-语言模型(VLM)来构建。然而,我们证明这种标准方法系统性地削弱了VLM的多模态能力,这种副作用我们称之为‘具身税’。但VL A是否必须遗忘?受生物视觉双流组织的启发,我们将这种退化归因于结构性瓶颈:当前VL A要求单一编码器同时支持语言基础语义和控制相关的视觉特征,而生物视觉将识别与视觉运动控制分为不同的路径。基于此观点,我们提出了统一动作模型(UAM),添加了一个平行的背侧专家,作为大脑背侧通路的类比。为了使背侧专家成为有效的第二路径并减少对VLM的控制学习负担,我们从预训练的生成模型中初始化它,并用中层推理目标进行训练,该目标预测视觉动态。这种设计使我们能够仅用动作数据端到端地训练整个VLA:无需参数冻结、无需梯度停止、无需辅助VL共训练,UAM保留了超过95%的底层VLM的多模态能力,同时在多种任务中取得了最高平均成功率,包括未见物体、新物体-目标组合和指令变化等探测分布外泛化的任务。这些结果表明,VL A中的语义保留可以从架构分离本身产生,而非通过冻结权重或辅助数据重放,并且这种保留的语义能力可以自然地从VLM转移到动作中的语义泛化。

英文摘要

Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.

2605.15694 2026-05-19 cs.LG

Going Beyond the Edge: Distributed Inference of Transformer Models on Ultra-Low-Power Wireless Devices

超越边缘:在超低功耗无线设备上实现变压器模型的分布式推断

Alexander Gräfe, Ding Huo, Vincent de Bakker, Johannes Berger, Marco Zimmerling, Sebastian Trimpe

发表机构 * RWTH Aachen University(亚琛工业大学) TU Darmstadt(德累斯顿技术大学)

AI总结 本文提出CATS框架,通过在超低功耗无线设备上实现分布式变压器推断,使多个设备协同执行远大于单个设备能处理的模型。核心方法结合了变压器划分、无线通信和训练,采用SomeGather通信原语减少带宽和内存使用,同时设计高效的模型并行方法,并通过消息丢弃提高通信可靠性。

详情
AI中文摘要

Transformer模型正迅速成为现代物联网(IoT)应用的核心,但其计算和内存需求远超单个典型超低功耗IoT设备的能力。我们提出了CATS,一种用于超低功耗无线设备的分布式变压器推断框架,使多个设备能够协同执行远大于单个设备能处理的模型。CATS的核心是一种通信感知的分布式变压器推断方案,结合了变压器划分、无线通信和训练。它采用SomeGather,一种新的剪枝通信原语,选择性广播激活列以减少通信带宽和RAM使用,而不牺牲模型精度。基于SomeGather,我们设计了一种划分方法,利用该原语实现高效的模型并行。为应对不可靠的无线通信,CATS在训练期间采用消息丢弃,模拟数据包丢失,并在推断时产生对消息丢失具有鲁棒性的模型。在实际实验中,我们证明CATS首次将分布式变压器推断带到了超低功耗无线设备上,部署在多达16个设备上,协同执行的变压器模型大小是单个设备能运行的14倍。

英文摘要

Transformer models are rapidly becoming a cornerstone of modern Internet of Things (IoT) applications, yet their computational and memory demands far exceed the capabilities of a single typical ultra-low-power IoT device. We present CATS, a framework for distributed transformer inference on ultra-low-power wireless devices, enabling multiple devices to collaboratively execute models far larger than what a single device can sustain. At its core, CATS is a communication-aware distributed transformer inference scheme co-designed across transformer partitioning, wireless communication and training. It employs SomeGather, a new pruned communication primitive that selectively broadcasts activation columns to reduce communication bandwidth and RAM usage without sacrificing model accuracy. Building on SomeGather, we design a partitioning method that exploits this primitive for efficient model parallelism. To cope with unreliable wireless communication, CATS employs message-dropout during training, which mimics packet losses and yields models that are robust to message loss during inference. In real-world experiments, we show that CATS brings distributed transformer inference to ultra-low-power wireless devices for the first time, with deployments on up to 16 devices that collaboratively execute transformer models up to 14 times larger than what a single device can run.

2605.15641 2026-05-19 cs.RO cs.CR

Propagating Unsafe Actions in LLM Controlled Multi-Robot Collaboration via Single Robot Compromise

通过单个机器人入侵在LLM控制的多机器人协作中传播不安全行为

Zhen Huang, Zhihuang Liu, Mengxuan Luo, Weishang Wu, Zhiping Cai

发表机构 * College of Computer Science and Technology, National University of Defense Technology(国防科技大学计算机科学与技术学院)

AI总结 本文研究了LLM控制的多机器人协作中的安全问题,提出了一种新型攻击模式,其中攻击者仅通过单个机器人传播恶意意图,导致系统中协调的不安全行为,通过三个指标量化了这一过程,并展示了攻击的高效性和持续性。

Comments Accepted by the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026). 9 pages, 4 figures, 3 tables

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作具身智能中的通用规划器,使单个机器人和多机器人协作的高层协调和底层任务规划成为可能。这种对具身LLM规划器的依赖也引发了关键的安全问题,因为不一致或被操控的指令可以转化为物理动作。先前的工作已研究了单个机器人设置中的此类威胁,而LLM控制的多机器人协作中的安全风险,尤其是通过机器人间通信传播的风险,仍鲜有研究。为弥合这一差距,我们提出了一种新的攻击模式,攻击者仅与单个入口机器人交互。被入侵的机器人然后通过同伴通信传播恶意意图,导致系统中协调的不安全行为。我们的评估涵盖了高风险维度,如失职、隐私侵犯和公共安全危害,揭示了多机器人规划器中持续的安全对齐差距。我们通过三个指标量化这一过程:服从性、传染性和隐蔽性。实验显示了攻击者的持续控制和快速传播:在最强的情况下,服从性达到1.00,传染性上升到0.90。值得注意的是,该攻击非常高效,只需3.0轮次即可入侵所有机器人,同时保持隐蔽性得分为0.81。当机器人必须在关键时刻解决权衡问题,如紧急情况或权利冲突时,此类风险会加剧,因为协调机制可能无意中允许对抗性指令覆盖安全要求。代码可在https://github.com/TheFatInsect/InfectBot上获取。

英文摘要

Large language models (LLMs) are increasingly used as general planners in embodied intelligence, enabling high level coordination and low level task planning for both single robot and multi-robot collaboration. This increasing reliance on embodied LLM planners also raises critical security concerns, since misaligned or manipulated instructions can be translated into physical actions. Prior work has studied such threats in single robot settings, while security risks in LLM controlled multi-robot collaboration, especially those propagated through inter robot communication, remain largely unexplored. To bridge this gap, we propose a novel attack paradigm for multi-robot system in which the adversary interacts with only a single entry robot. The compromised robot then propagates malicious intent through peer communication, leading to coordinated unsafe actions across the system. Our evaluation, covering high risk dimensions of dereliction of duty, privacy compromise, and public safety hazards, reveals a persistent safety alignment gap in multi-robot planners. We quantify this process with three metrics, obedience, infectiousness, and stealthiness. Experiments demonstrate both persistent attacker control and rapid propagation: obedience reaches 1.00 in the strongest cases, and infectiousness rises to 0.90. Notably, the attack is highly efficient, requiring as few as 3.0 rounds to compromise all the robots while maintaining a stealthiness score of 0.81. Such risks are amplified when robots must resolve trade offs in critical situations, such as emergencies or conflicts of rights, because the coordination mechanism can unintentionally allow adversarial instructions to override safety requirements. The code is available at https://github.com/TheFatInsect/InfectBot.

2605.15622 2026-05-19 cs.LG

Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered

位置:深度学习中零阶优化被低估,而非无能

Sijia Liu, Yicheng Lang, Soumyadeep Pal, Changsheng Wang, Yancheng Huang, Chongyu Fan, James Diffenderfer, Bhavya Kailkhura, Yihua Zhang

发表机构 * OPTML Lab, Michigan State University, USA(密歇根州立大学OPTML实验室) Lawrence Livermore National Laboratory, USA(劳伦斯利弗莫尔国家实验室)

AI总结 本文探讨了深度学习中零阶优化(ZO)的局限性,指出其被低估而非无能,并提出了六个涵盖算法、系统和评估层面的立场,强调通过控制方差、方差-查询权衡和方向导数视角重新审视ZO方法的可行性,同时指出三个未被充分利用的机会,包括子空间和谱观点、ZO作为系统优势的通信效率以及去模糊化ZO评估与任务复杂性之间的关系。

Comments Accepted by ICML 2026 Position Paper Track as a Spotlight Paper

详情
AI中文摘要

零阶(ZO)优化,通过函数评估的有限差分来学习,由于其内存效率和适用于灰箱或黑箱管道的适用性,最近在深度学习中重新受到关注。然而,ZO方法往往被忽视,因为估计方差和不利的查询复杂性被认为是根本无法扩展的。我们主张这一结论可能是误导的:ZO优化是被低估的,而不是无能的。我们证明了许多看似限制性的因素源于短视的发展实践,尤其是全空间、元素-wise、估计器中心的设计。我们阐述了六个涵盖算法、系统和评估栈的立场。首先,我们通过方差控制、方差-查询权衡和方向导数视角重新审视估计器中心ZO方法的可行性边界。然后,我们识别出三个未被充分利用的机会:(i)子空间和谱观点的ZO,使通过优雅的查询扩展实现可解释的方差减少;(ii)ZO作为系统优势,为通信高效、管道友好的和资源受限的训练提供优势;(iii)需要去模糊化ZO评估与任务复杂性之间的关系。我们强烈倡导围绕ZO优化的独特优势重新思考,并采取相应行动,打开通往大规模、系统感知和资源高效学习的可行路径。

英文摘要

Zeroth-order (ZO) optimization, learning from finite differences of function evaluations without backpropagation, has recently regained attention in deep learning due to its memory efficiency and applicability to gray- or black-box pipelines. Yet, ZO methods are often dismissed as fundamentally unscalable because of estimator variance and unfavorable query complexity. We argue that this conclusion might be misguided: ZO optimization is underexplored, not underpowered. We show that many perceived limitations stem from myopic development practices, most notably full-space, element-wise, estimator-centric designs. We articulate six positions spanning the algorithmic, systems, and evaluation stack. First, we revisit the feasibility boundaries of estimator-centric ZO methods through variance control, variance-query tradeoffs, and directional-derivative lenses. Then, we identify three underexplored opportunities: (i) subspace and spectral views of ZO that enable interpretable variance reduction with graceful query scaling, (ii) the forward-only nature of ZO as a systems advantage for communication-efficient, pipeline-friendly, and resource-constrained training, and (iii) the need to de-obfuscate ZO evaluations from task complexity. We strongly advocate rethinking ZO optimization around its unique strengths and acting accordingly, opening a viable path toward large-scale, system-aware, and resource-efficient learning with ZO optimization.

2605.15586 2026-05-19 cs.LG cs.AI cs.CV

Embracing Biased Transition Matrices for Complementary-Label Learning with Many Classes

拥抱偏置转移矩阵以实现多类互补标签学习

Tan-Ha Mai, Chao-Kai Chiang, Han-Hwa Shih, Gang Niu, Masashi Sugiyama, Hsuan-Tien Lin

发表机构 * National Taiwan University(国立台湾大学) The University of Tokyo(东京大学) RIKEN Center for Advanced Intelligence Project(日本理化学研究院先进智能项目中心)

AI总结 本文提出了一种新的框架BICL,通过设计偏置的标签生成过程来克服传统互补标签学习在多类设置中的限制,从而在CIFAR-100和TinyImageNet-200上实现了传统方法的七倍以上准确率提升。

Comments 33 pages, 16 figures, 18 tables

详情
AI中文摘要

互补标签学习(CLL)是一种弱监督范式,其中实例被标记为不属于其类别的标签。尽管已有十年的研究,CLL方法主要在10类分类任务中具有竞争力,而扩展到大规模标签空间仍然是一个持久的瓶颈。这种限制源于传统方法对均匀标签生成的假设,这在多类设置中严重稀释了学习信号。在本文中,我们证明通过故意设计偏置(非均匀)的生成过程,将互补标签限制在类别的子集,可以克服这一长期存在的障碍。这一发现促使我们提出Bias-Induced Constrained Labeling(BICL),一个涵盖数据收集到训练的原理性框架,利用这种偏置。BICL在CIFAR-100和TinyImageNet-200上实现了有效学习,比传统方法的准确率提高了超过七倍。我们的发现为在现实应用中使CLL适用于多类问题开辟了新的道路。

英文摘要

Complementary-label learning (CLL) is a weakly supervised paradigm where instances are labeled with classes they do not belong to. Despite a decade of research, CLL methods remain competitive mainly on 10-class classification, with scaling to large label spaces continuing to be an enduring bottleneck. This limitation stems from the common assumption of uniform label generation in traditional methods, which fatally dilutes the learning signal in many-class settings. In this paper, we demonstrate that this long-standing barrier can be overcome by deliberately designing a biased (non-uniform) generation process that restricts complementary labels to a subset of classes. This finding motivates us to propose Bias-Induced Constrained Labeling (BICL), a principled framework spanning data collection to training that leverages this bias. BICL enables effective learning on CIFAR-100 and TinyImageNet-200, achieving more than sevenfold accuracy improvements over traditional methods. Our findings establish a new trajectory for making CLL feasible for many classes in real-world applications.

2605.15508 2026-05-19 cs.LG cs.CL

STS: Efficient Sparse Attention with Speculative Token Sparsity

STS: 高效稀疏注意力与推测性标记稀疏性

Ceyu Xu, Jiangnan Yu, Yongji Wu, Yuan Xie

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) UC Berkeley(加州大学伯克利分校)

AI总结 本文提出STS,一种无需模型再训练的稀疏注意力机制,通过利用较小的草稿模型识别出的重要标记来预测更大目标模型的重要标记,从而在大规模语言模型推理中实现高效的稀疏注意力计算,显著提升速度并保持准确性。

Comments 14 pages, 12 figures

详情
AI中文摘要

注意力的二次复杂性对大型语言模型(LLM)推理造成了严重的内存和计算瓶颈。这一挑战在新兴的代理应用中尤为突出,这些应用需要处理数百万标记序列。我们提出STS,一种稀疏注意力机制,无需模型再训练。STS利用关键洞察:由较小的草稿模型识别出的重要标记对更大目标模型的重要标记具有高度预测性。通过整合到推测解码框架中,STS将草稿模型的注意力分数重新利用,动态构建标记和头部层面的稀疏性掩码。该掩码有效剪枝目标LLM中的昂贵注意力计算。我们的评估显示,STS在代表性的基准NarrativeQA上实现了约90%稀疏度下的2.67倍加速,与密集注意力相比,准确性降解可忽略不计。STS在稀疏性与准确性权衡上建立了新的状态-of-the-art,通过在给定准确性预算下实现更高的稀疏度水平,优于先前技术。

英文摘要

The quadratic complexity of attention imposes severe memory and computational bottlenecks on Large Language Model (LLM) inference. This challenge is particularly acute for emerging agentic applications that require processing multi-million token sequences. We propose STS, a sparse attention mechanism that requires no model retraining. STS leverages the key insight that tokens identified as important by a smaller draft model are highly predictive of important tokens for a larger target model. By integrating into speculative decoding frameworks, STS repurposes the draft model's attention scores to dynamically construct a token-and-head-wise sparsity mask. This mask effectively prunes the expensive attention computation in the target LLM. Our evaluation shows that STS achieves a 2.67x speedup operating at approximately 90% sparsity on representative benchmark NarrativeQA, maintaining negligible accuracy degradation compared to dense attention. STS establishes a new state-of-the-art on the sparsity-accuracy trade-off, outperforming prior techniques by enabling higher sparsity levels for a given accuracy budget.

2605.15487 2026-05-19 cs.LG cs.CV eess.IV

Learning Normalized Energy Models for Linear Inverse Problems

学习归一化能量模型以解决线性逆问题

Nicolas Zilberstein, Santiago Segarra, Eero Simoncelli, Florentin Guth

发表机构 * Rice University(里士满大学) Flatiron Institute(Flatiron研究所) New York University(纽约大学)

AI总结 本文提出了一种新的能量模型,用于解决线性逆问题,通过引入基于协方差的正则化项来提高不同测量条件下的一致性,从而计算出归一化的后验密度,无需额外训练或微调,同时实现了能量引导的自适应采样、无偏的Metropolis-Hastings修正步骤以及通过贝叶斯规则估计退化算子。

Comments ICML 2026

Journal ref Int'l Conf Machine Learning (ICML), Jul 2026. https://openreview.net/forum?id=PlFJwgaaDK

详情
AI中文摘要

生成扩散模型可以为成像中的逆问题提供强大的先验概率模型,但现有实现存在两个关键限制:(i) 先验密度以隐式方式表示,(ii) 它们依赖于似然近似,这会引入采样偏见。我们通过引入一种新的能量模型来解决这些挑战,该模型针对去噪进行了训练,并引入了基于协方差的正则化项,以确保在不同测量条件下的一致性。训练后的模型能够为各种线性逆问题计算归一化的后验密度,而无需额外的重新训练或微调。除了保留扩散模型的采样能力外,这还使以前不可用的能力得以实现:能量引导的自适应采样,可以实时调整采样计划,无偏的Metropolis-Hastings修正步骤,以及通过贝叶斯规则估计退化算子。我们验证了该方法在多个数据集(ImageNet、CelebA、AFHQ)和任务(修复、去模糊)上的性能,证明了其与现有基线相比具有竞争力或更优的表现。

英文摘要

Generative diffusion models can provide powerful prior probability models for inverse problems in imaging, but existing implementations suffer from two key limitations: $(i)$ the prior density is represented implicitly, and $(ii)$ they rely on likelihood approximations that introduce sampling biases. We address these challenges by introducing a new energy-based model trained for denoising with a covariance-based regularization term that enforces consistency across different measurement conditions. The trained model can compute normalized posterior densities for diverse linear inverse problems, without additional retraining or fine tuning. In addition to preserving the sampling capabilities of diffusion models, this enables previously unavailable capabilities: energy-guided adaptive sampling that adjusts schedules on-the-fly, unbiased Metropolis-Hastings correction steps, and blind estimation of the degradation operator via Bayes rule. We validate the method on multiple datasets (ImageNet, CelebA, AFHQ) and tasks (inpainting, deblurring), demonstrating competitive or superior performance to established baselines.

2605.15377 2026-05-19 cs.AI

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

为AI控制的集束监控:多样信号胜过更多计算

Eugene Koran, Yejun Yun, Samantha Tetef, Benjamin Arnav, Pablo Bernabeu-Pérez

发表机构 * Yale University(耶鲁大学)

AI总结 本文研究了通过结合多种监控信号来提高AI行为检测的性能,发现多样性的监控集合比单一或同质的监控集合更有效,且细调的监控方法在检测能力上更具优势。

详情
AI中文摘要

随着AI系统在大规模自主代理环境中越来越广泛地部署,确保它们采取的安全和符合用户意图的行为变得至关重要。监控代理行为是关键的安全机制,但可靠的监控仍然难以构建,而系统规模使人类监督变得不切实际。我们证明,将来自不同监控器的信号组合成一个集合可以提高检测偏离行为的能力。我们使用提示和微调策略构建了12个GPT-4.1-Mini监控器。我们在编码任务中评估了它们,其中候选解决方案通过标准测试但失败于对抗性输入。在这种情况下,多样化的集合优于单个监控器和同质的集合。我们的最佳3监控集合在检测性能上比由三个相同监控器组成的集合提高了2.4倍,且在独立数据集上表现强劲。我们认为这些结果表明,收益来自于多样性而不是规模。最佳集合结合了强个体表现和监控器之间低相关性。此外,微调的监控器出现在每一个表现最好的集合中,并且在非分布攻击类型上保持了这一优势,表明微调能够激发检测能力,而提示单独无法做到。这些结果支持集合监控作为一种实用的AI控制策略,以在合理的推理成本下获得安全收益。

英文摘要

As AI systems are increasingly deployed in autonomous agentic settings at scale, it is important to ensure the actions they take are safe and aligned with user intent. Monitoring agent actions is a key safety mechanism, yet reliable monitors remain difficult to build and the scale of these systems makes human oversight impractical. We show that combining signals from diverse monitors into an ensemble improves detection of misaligned actions. We build 12 GPT-4.1-Mini monitors using both prompting and fine-tuning strategies. We evaluate them on coding tasks where candidate solutions pass standard tests but fail on adversarial inputs. In this setting, diverse ensembles outperform both individual monitors and homogeneous ensembles. Our best 3-monitor ensemble achieves 2.4x greater detection performance gain compared to an ensemble composed of three identical monitors, with the same ensemble performing strongly on an independent dataset. We contend that these results show that diversity - not scale - drives gains. The best ensembles combine strong individual performance with low correlation between monitors. Furthermore, fine-tuned monitors appear in every top-performing ensemble and maintain this advantage on out-of-distribution attack types, suggesting that fine-tuning enables detection capabilities that prompting alone does not elicit. These results support ensemble monitoring as a practical AI control strategy for safety gains at reasonable inference costs.