arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.17449 2026-05-19 cs.CV cs.AI

Spatial Blindness in Whole-Slide Multiple Instance Learning

全切片多实例学习中的空间盲区

Xiangyu Li, Ran Su

发表机构 * College of Intelligence and Computing（智能与计算学院）

AI总结本文研究了全切片多实例学习中由于空间信息处理不足导致的分类误差问题，提出ResTopoMIL模型通过引入不变原型直方图和坐标洗牌约束来提升模型对空间关系的敏感性，从而在多个公开数据集上提升了分类和生存预测性能。

Comments 28 pages, 8 figures, 16 tables

详情

AI中文摘要

全切片MIL模型通常被称为上下文感知模型，当将图网络、Transformer或状态空间模块置于补丁嵌入之上时。我们证明这种标签可能具有误导性。在病理任务中，组织结构是诊断信号的一部分，几个强大的MIL基线在补丁坐标随机排列后，滑片级别AUC几乎未变。它们的预测准确，但大多具有组合性。我们将其失败模式称为空间盲区。我们的解释是基于优化的：在滑片级监督下，密集的外观统计信息被早期学习，留下弱梯度用于稀疏的空间关系。ResTopoMIL通过首先拟合一个排列不变的原型直方图，然后冻结它，同时一个轻量级图分支在坐标洗牌约束下学习残差来解决这个问题。该架构设计简单；干预在于如何训练空间分支。在9个公开WSI基准上，ResTopoMIL在1.15M参数下提升了分类和生存预测性能，恢复了对坐标扰动的敏感性，并在CAMELLYON-16上提供了更强的局部化证据。

英文摘要

Whole-slide MIL models are often called context-aware once graphs, Transform ers, or state-space modules are placed above patch embeddings. We show that this label can be deceptive. On pathology tasks where tissue architecture is part of the diagnostic signal, several strong MIL baselines retain nearly unchanged slide level AUC after patch coordinates are permuted. Their predictions are accurate, but largely compositional. We refer to this failure mode as spatial blindness. Our explanation is optimization-based: dense appearance statistics are learned early under slide-level supervision, leaving weak gradients for sparse spatial relations. ResTopoMIL addresses the issue by first fitting a permutation-invariant prototype histogram and then freezing it while a lightweight graph branch learns the residual under a coordinate-shuffling constraint. The architecture is simple by design; the intervention is in how the spatial branch is trained. Across 9 public WSI bench marks, ResTopoMIL improves classification and survival prediction with 1.15M parameters, restores sensitivity to coordinate perturbation, and gives stronger lo calization evidence on CAMELYON-16.

URL PDF HTML ☆

赞 0 踩 0

2605.17447 2026-05-19 cs.CV cs.CL

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

FastOCR: 通过KV缓存剪枝实现高效的动态视觉聚焦文档解析

Zihan Tang, Leqi Shen, Hui Chen, Ao Wang, Ben Wan, Yan Feng, Ke Zhang, Sicheng Zhao, Tongxuan Liu, Guiguang Ding

发表机构 * Tsinghua University（清华大学）

AI总结本文提出FastOCR，一种无需训练的框架，通过动态视觉聚焦技术解决文档解析中的高效KV缓存剪枝问题，显著提升处理速度和准确性。

详情

AI中文摘要

视觉-语言模型（VLMs）在光学字符识别（OCR）中展现出强大潜力，但编码密集文档所需的大量视觉令牌导致推理成本过高。现有剪枝方法依赖物理驱逐，例如在prefill阶段永久丢弃视觉令牌。尽管在自然图像上有效，但此策略在OCR中失效，因为几乎每个视觉令牌可能对应一个字符或结构元素，任何不可逆的损失都会导致准确性急剧下降。我们观察到，尽管文档图像看似密集且难以剪枝，模型对它们的注意力实际上在时间上是稀疏的：在每个解码步骤中，它集中在一小块区域，随着步骤逐渐移动，就像人类读者依次聚焦于词语而不是一次性感知整页内容一样。受此动态视觉聚焦现象的启发，我们将不可行的全局剪枝问题转化为可处理的局部动态问题，并提出FastOCR，一种无需训练的框架，包含两个互补模块。具体而言，Focal-Guided Pruning识别少量焦点层，并在每一步从中选择最相关的视觉令牌；Cross-Step Fixation Reuse利用固定点的逐渐移动，从上一步温暖启动。通过动态调整哪些令牌被关注而不是驱逐任何缓存中的令牌，FastOCR避免了永久信息丢失。广泛实验表明，FastOCR作为一种即插即用的加速模块，在五个不同大小和架构的VLMs上表现出一致的泛化能力。在Qwen2.5-VL上，FastOCR在每个解码步骤只关注5%的视觉令牌，保留了未剪枝模型98%的准确性，同时将注意力延迟减少了3.0倍。

英文摘要

Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.

URL PDF HTML ☆

赞 0 踩 0

2605.17442 2026-05-19 cs.CL cs.AI cs.IR

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

超越目录计数：低资源多语言NLP中的数据集可见性不对称

Zhiyin Tan, Changxu Duan

发表机构 * L3S Research Center, Leibniz University Hannover（莱布尼茨汉诺威大学L3S研究中心）； Technische Universität Darmstadt（达姆施塔特技术大学）

AI总结本研究探讨了多语言NLP中数据集可见性不对称问题，通过结合目录基准和文献证据，提出了资源密度指数（RDI）来衡量语言的数据集可见性，揭示了大量语言在目录记录中数据贫乏但文献中存在明显数据集活动的现象。

Comments Accepted at the 15th edition of the Language Resources and Evaluation Conference (LREC 2026)

详情

DOI: 10.63317/3bep4yiomtp2

AI中文摘要

多语言NLP常常依赖于集中式目录中的数据集计数来确定哪些语言是资源丰富或贫乏的。然而，这些目录只记录了数据集可见性的一层：哪些数据集已被注册或机构分发。它们不一定反映哪些数据集在研究文献中被创建、引用或重用。为了考察这一差距，我们结合基于目录的基准与文献支持的数据集流通证据。我们引入了资源密度指数（RDI），定义为每一百万使用者的数据集数量，并计算了乙努诺格（Ethnologue）中200种最广泛使用的语言的RDI。其中，118种语言（59%）在LRE地图和语言数据 consortium（LDC）中平均RDI为零，另有23种语言低于0.1，对应每十万使用者最多一个目录数据集。然后，我们利用LLM辅助的引用挖掘流程处理Semantic Scholar语料库中的这141种低可见性语言。经过人工验证和整合，我们识别出53种语言中的609个唯一数据集，其中356个仍通过工作公共链接公开访问。这些结果揭示了显著的可见性差距：许多大使用者语言在目录记录中数据贫乏，但在研究文献中显示明显的数据集活动。我们的发现表明，多语言数据稀缺不仅应被视为生产问题，还应被视为文档、可发现性和长期可访问性的问题。代码和数据可在（https://github.com/zhiyintan/dataset-visibility-asymmetry）公开获取。

英文摘要

Multilingual NLP often relies on dataset counts from centralized catalogues to characterize which languages are resource-rich or resource-poor. However, these catalogues record only one layer of dataset visibility: what has been registered or institutionally distributed. They do not necessarily reflect which datasets are created, cited, or reused in the research literature. To examine this gap, we combine a catalogue-based baseline with literature-backed evidence of dataset circulation. We introduce the Resource Density Index (RDI), defined as the number of catalogued datasets per one million speakers, and compute it for the 200 most widely spoken languages in Ethnologue. Among them, 118 languages (59%) have an average RDI of zero across the LRE Map and the Linguistic Data Consortium (LDC), and another 23 fall below 0.1, corresponding to at most one catalogued dataset per ten million speakers. We then apply an LLM-assisted citation-mining pipeline over the Semantic Scholar corpus to these 141 low-visibility languages. After manual validation and consolidation, we identify 609 unique datasets across 53 languages, of which 356 remain openly accessible through working public links. These results reveal a substantial visibility gap: many large-speaker languages appear data-poor in catalogue records yet show clear evidence of dataset activity in the research literature. Our findings suggest that multilingual data scarcity should be understood not only as a production problem, but also as a question of documentation, discoverability, and long-term accessibility. Code and data are publicly available at (https://github.com/zhiyintan/dataset-visibility-asymmetry).

URL PDF HTML ☆

赞 0 踩 0

2605.17436 2026-05-19 cs.CV cs.CL

DP-SelFT: 大语言模型的差分隐私选择性微调

Haichao Sha, Zihao Wang, Yuncheng Wu, Hong Chen, Wei Dong

发表机构 * Renmin University of China（中国人民大学）； Nanyang Technological University（南洋理工大学）

AI总结本文提出DP-SelFT框架，通过选择性微调方法在保持差分隐私的同时提升大语言模型的隐私-效用权衡。

详情

AI中文摘要

MUSE：多模态状态估计不确定性量化

Minkyung Kim, Henry Che, Bhargav Chandaka, Bhumsitt Pramuanpornsatid, Chengyu Yang, Sheng Cheng, Xiaofeng Wang, Naira Hovakimyan, Shenlong Wang

发表机构 * Department of Mechanical Science and Engineering, University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校机械科学与工程系）； Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校塞贝尔计算与数据科学学院）； Department of Electrical Engineering, University of South Carolina（南卡罗来纳大学电气工程系）

AI总结本文提出MUSE，一种基于学习的实时框架，利用Mamba的强效序列建模能力，从多个异步传感器流中估计定位不确定性，提高了状态估计的可靠性和鲁棒性。

Comments Code and dataset: https://github.com/hungdche/MUSE

详情

AI中文摘要

准确的视觉状态估计一直是机器人领域的重要课题，广泛应用于机器人导航、自动驾驶和自主飞行。最近的机器人感知进展显著提高了状态估计的精度和鲁棒性，但如何量化和校准其精度，即我们对估计的置信度以及能否检测失败仍然是一个根本性挑战。在视觉惯性里程计（VIO）中，异方差和多模态的性质使不确定性量化尤为困难。本文介绍了MUSE（多模态状态估计不确定性量化），一种新颖的实时学习框架，利用Mamba的强大且高效的序列建模能力，从多个异步传感器流中估计定位不确定性。在公开和内部数据集上的实验表明，MUSE相比现有不确定性量化方法在可靠性和鲁棒性方面表现更优，消融研究验证了其关键设计选择的优势。

英文摘要

Accurate visual state estimation has been a central topic in robotics with a wide range of applications in robot navigation, autonomous driving, and autonomous flight. Recent advances in robot perception have led to significant improvements in the accuracy and robustness of state estimation, yet a fundamental challenge remains in how to quantify and calibrate its precision, i.e., how confident we are in an estimate and whether failures can be detected. This issue is particularly pronounced in visual-inertial odometry (VIO), where the heteroscedastic and multimodal nature of the problem makes uncertainty quantification especially difficult. This paper introduces MUSE (Multimodal Uncertainty Quantification of State Estimation), a novel real-time learning-based framework that leverages the strong and efficient sequential modeling capacity of Mamba to estimate localization uncertainty from multiple asynchronous sensor streams. Experiments on both public and in-house datasets demonstrate that MUSE achieves superior reliability and robustness compared to existing uncertainty quantification methods, and ablation studies justify the benefits of its key design choices.

URL PDF HTML ☆

赞 0 踩 0

2605.17419 2026-05-19 cs.LG cs.AI

Learning Displacement-Robust Representations for Landslide Early Warning under Rainfall Forecast Uncertainty

学习位移鲁棒的表示以在降雨预报不确定性下进行滑坡预警

Ren Ozeki, Hamada Rizk, Hirozumi Yamaguchi

发表机构 * Osaka University（大阪大学）； RIKEN Center for Computational Science（理化学研究所计算科学中心）； Tanta University（塔塔大学）

AI总结本文提出了一种鲁棒于降雨场位移的滑坡预警系统，通过学习降雨和地形数据的潜在表示，以提高在降雨预报不确定性下的滑坡预测精度。

详情

AI中文摘要

由降雨引发的滑坡已成为全球范围内日益增长的风险，因为气候变化加剧了极端降雨事件。为了提供足够的撤离时间，实时灾害监测的滑坡预警系统（LEWS）必须通过整合观测降雨与短期降雨预报来估计近未来滑坡风险，这些预报来自时空环境数据流。尽管最近的滑坡预测方法通过统计和深度学习方法提高了预测性能，但大多数方法假设降雨输入是准确的。然而，在实际应用中，滑坡预测依赖于降雨预报，这些预报通常包含由于预测不确定性导致的降雨场空间位移。这种位移会改变局部累积降雨并降低预测准确性。为了解决这一挑战，我们提出了一种新的LEWS，其对降雨场位移具有鲁棒性。关键思想是学习降雨和地形数据的潜在表示，这些表示在降雨场运动中的位移下保持稳定，从而实现可靠的地理空间数据整合以估计滑坡风险。滑坡预测模型通过使用降雨-运动-感知对比学习（RMCL）进行训练，该方法引入了时间相关的降雨场扰动以模拟预报引起的降雨驱动时空环境数据流中的位移。实验使用了日本两年的降雨和地形数据，覆盖了19个地区中的滑坡事件。所提出的系统在精度上比最先进的基线高出高达37%。这些结果表明，将降雨建模为移动的空间场并在学习过程中处理降雨场位移显著提高了操作预警系统中短期滑坡预测的可靠性。

英文摘要

Rainfall-induced landslides pose a growing risk worldwide as climate change intensifies extreme rainfall events. To provide sufficient evacuation time, landslide early warning systems (LEWS) for real-time disaster monitoring must estimate near-future landslide risk by integrating observed rainfall with short-term rainfall forecasts from spatio-temporal environmental data streams. Although recent landslide prediction methods have improved predictive performance using statistical and deep learning approaches, most assume accurate rainfall inputs. In operational settings, however, landslide prediction relies on rainfall forecasts, which often contain spatial displacement of rainfall fields due to forecasting uncertainties. Such displacement can alter local accumulated rainfall and degrade prediction accuracy. To address this challenge, we propose a novel LEWS robust to rainfall field displacement. The key idea is to learn latent representations from rainfall and terrain data that remain stable under displacement in rainfall field motion, enabling reliable geospatial data integration for landslide risk estimation. The landslide prediction model is trained using Rainfall-Motion-Aware Contrastive Learning (RMCL), which introduces temporally correlated rainfall field perturbations to emulate forecast-induced displacement in rainfall-driven spatio-temporal environmental data streams. Experiments were conducted using two years of rainfall and terrain data across Japan, covering 19 regions with landslide events. The proposed system achieved up to 37% higher precision than state-of-the-art baselines. These results demonstrate that modeling rainfall as a moving spatial field and addressing rainfall field displacement during learning significantly improve the reliability of short-term landslide prediction in operational early warning systems.

URL PDF HTML ☆

赞 0 踩 0

2605.17410 2026-05-19 cs.AI

Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design

令牌经济学中的计算挑战：连接经济理论与AI系统设计

Ou Wu, Yingjun Deng

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences（中国科学院大学杭州高等研究院）； Hefei Institutes of Physical Science, Chinese Academy of Sciences（中国科学院合肥物理研究所）

AI总结本文探讨了在大规模语言模型系统中，将令牌作为经济原语时所面临的计算挑战，提出了计算令牌经济学的概念和令牌经济学三元论，旨在建立连接令牌经济学与AI系统设计的研究议程。

Comments 43 pages

详情

AI中文摘要

异质信息瓶颈协调图用于多智能体强化学习

Wei Duan, Junyu Xuan, En Yu, Xiaoyu Yang, Jie Lu

发表机构 * Australian Artificial Intelligence Institute (AAII)（澳大利亚人工智能研究所）

AI总结本文提出异质信息瓶颈协调图（HIBCG），通过理论指导机制解决多智能体强化学习中协调图的边存在性和信息传递容量分配问题，通过信息瓶颈方法构建组对齐的块对角先验，实现边存在性和信息容量的理论验证。

详情

AI中文摘要

协调图是合作多智能体强化学习（MARL）中的核心抽象，然而现有的稀疏图学习者缺乏理论基础的机制来决定哪些边应存在以及每条边应携带多少信息。当前方法依赖于启发式标准，无法保证学习到的拓扑结构的正式保证，并且没有系统的方法来分配不同的通信容量以处理结构不同的智能体关系。为了解决这个问题，我们提出了异质信息瓶颈协调图（HIBCG），它学习了一个组感知的稀疏图，在其中边的存在性和信息容量都得到了理论支持。通过图信息瓶颈（GIB）作为底层工具，HIBCG首先构建了一个组对齐的块对角先验，提供了一个闭式标准用于边保留——确定哪些边应该存在以及每个组块的密度——然后在所得到的拓扑上控制每个智能体的特征带宽，压缩信息以保留仅与任务相关的内容。我们证明了组对齐的先验严格收紧拓扑学习的变分界，目标分解为每个组块，实现了微分边控制，且容量分配遵循水填充原则。

英文摘要

Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism to decide which edges should exist and how much information each edge should carry. Current methods rely on heuristic criteria that offer no formal guarantee on the learned topology, and no principled way to allocate different communication capacities to structurally different agent relationships. To address this, we propose Heterogeneous Information-Bottleneck Coordination Graphs (HIBCG), which learns a group-aware sparse graph in which both edge existence and message capacity are theoretically justified. With the graph information bottleneck (GIB) serving as the underlying tool, HIBCG first constructs a group-aligned block-diagonal prior that provides a closed-form criterion for edge retention -- determining which edges should exist and at what density per group block -- and then controls per-agent feature bandwidth on the resulting topology, compressing messages to retain only task-relevant content. We prove that the group-aligned prior strictly tightens the variational bound on topology learning, that the objective decomposes per group block, enabling differential edge control, and that capacity allocation follows a water-filling principle.

URL PDF HTML ☆

赞 0 踩 0

2605.17382 2026-05-19 cs.AI cs.CL cs.GR

UAM：VL A训练中遗忘的双流视角

Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, Jianyu Chen

发表机构 * Tsinghua University（清华大学）

AI总结本文提出UAM模型，通过双流架构解决VL A训练中因单一编码器导致的多模态能力下降问题，展示了通过架构分离而非冻结权重或辅助数据可实现语义保留，并在多种任务中取得高成功率。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通常通过在动作数据上微调预训练的视觉-语言模型（VLM）来构建。然而，我们证明这种标准方法系统性地削弱了VLM的多模态能力，这种副作用我们称之为‘具身税’。但VL A是否必须遗忘？受生物视觉双流组织的启发，我们将这种退化归因于结构性瓶颈：当前VL A要求单一编码器同时支持语言基础语义和控制相关的视觉特征，而生物视觉将识别与视觉运动控制分为不同的路径。基于此观点，我们提出了统一动作模型（UAM），添加了一个平行的背侧专家，作为大脑背侧通路的类比。为了使背侧专家成为有效的第二路径并减少对VLM的控制学习负担，我们从预训练的生成模型中初始化它，并用中层推理目标进行训练，该目标预测视觉动态。这种设计使我们能够仅用动作数据端到端地训练整个VLA：无需参数冻结、无需梯度停止、无需辅助VL共训练，UAM保留了超过95%的底层VLM的多模态能力，同时在多种任务中取得了最高平均成功率，包括未见物体、新物体-目标组合和指令变化等探测分布外泛化的任务。这些结果表明，VL A中的语义保留可以从架构分离本身产生，而非通过冻结权重或辅助数据重放，并且这种保留的语义能力可以自然地从VLM转移到动作中的语义泛化。

英文摘要

Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.

URL PDF HTML ☆

赞 0 踩 0

2605.15694 2026-05-19 cs.LG

学习归一化能量模型以解决线性逆问题

Nicolas Zilberstein, Santiago Segarra, Eero Simoncelli, Florentin Guth

发表机构 * Rice University（里士满大学）； Flatiron Institute（Flatiron研究所）； New York University（纽约大学）

AI总结本文提出了一种新的能量模型，用于解决线性逆问题，通过引入基于协方差的正则化项来提高不同测量条件下的一致性，从而计算出归一化的后验密度，无需额外训练或微调，同时实现了能量引导的自适应采样、无偏的Metropolis-Hastings修正步骤以及通过贝叶斯规则估计退化算子。

Comments ICML 2026

Journal ref Int'l Conf Machine Learning (ICML), Jul 2026. https://openreview.net/forum?id=PlFJwgaaDK

详情

AI中文摘要

生成扩散模型可以为成像中的逆问题提供强大的先验概率模型，但现有实现存在两个关键限制：(i) 先验密度以隐式方式表示，(ii) 它们依赖于似然近似，这会引入采样偏见。我们通过引入一种新的能量模型来解决这些挑战，该模型针对去噪进行了训练，并引入了基于协方差的正则化项，以确保在不同测量条件下的一致性。训练后的模型能够为各种线性逆问题计算归一化的后验密度，而无需额外的重新训练或微调。除了保留扩散模型的采样能力外，这还使以前不可用的能力得以实现：能量引导的自适应采样，可以实时调整采样计划，无偏的Metropolis-Hastings修正步骤，以及通过贝叶斯规则估计退化算子。我们验证了该方法在多个数据集（ImageNet、CelebA、AFHQ）和任务（修复、去模糊）上的性能，证明了其与现有基线相比具有竞争力或更优的表现。

英文摘要

Generative diffusion models can provide powerful prior probability models for inverse problems in imaging, but existing implementations suffer from two key limitations: $(i)$ the prior density is represented implicitly, and $(ii)$ they rely on likelihood approximations that introduce sampling biases. We address these challenges by introducing a new energy-based model trained for denoising with a covariance-based regularization term that enforces consistency across different measurement conditions. The trained model can compute normalized posterior densities for diverse linear inverse problems, without additional retraining or fine tuning. In addition to preserving the sampling capabilities of diffusion models, this enables previously unavailable capabilities: energy-guided adaptive sampling that adjusts schedules on-the-fly, unbiased Metropolis-Hastings correction steps, and blind estimation of the degradation operator via Bayes rule. We validate the method on multiple datasets (ImageNet, CelebA, AFHQ) and tasks (inpainting, deblurring), demonstrating competitive or superior performance to established baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.15377 2026-05-19 cs.AI

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

为AI控制的集束监控：多样信号胜过更多计算

Eugene Koran, Yejun Yun, Samantha Tetef, Benjamin Arnav, Pablo Bernabeu-Pérez

发表机构 * Yale University（耶鲁大学）

AI总结本文研究了通过结合多种监控信号来提高AI行为检测的性能，发现多样性的监控集合比单一或同质的监控集合更有效，且细调的监控方法在检测能力上更具优势。

详情

AI中文摘要

随着AI系统在大规模自主代理环境中越来越广泛地部署，确保它们采取的安全和符合用户意图的行为变得至关重要。监控代理行为是关键的安全机制，但可靠的监控仍然难以构建，而系统规模使人类监督变得不切实际。我们证明，将来自不同监控器的信号组合成一个集合可以提高检测偏离行为的能力。我们使用提示和微调策略构建了12个GPT-4.1-Mini监控器。我们在编码任务中评估了它们，其中候选解决方案通过标准测试但失败于对抗性输入。在这种情况下，多样化的集合优于单个监控器和同质的集合。我们的最佳3监控集合在检测性能上比由三个相同监控器组成的集合提高了2.4倍，且在独立数据集上表现强劲。我们认为这些结果表明，收益来自于多样性而不是规模。最佳集合结合了强个体表现和监控器之间低相关性。此外，微调的监控器出现在每一个表现最好的集合中，并且在非分布攻击类型上保持了这一优势，表明微调能够激发检测能力，而提示单独无法做到。这些结果支持集合监控作为一种实用的AI控制策略，以在合理的推理成本下获得安全收益。

英文摘要

As AI systems are increasingly deployed in autonomous agentic settings at scale, it is important to ensure the actions they take are safe and aligned with user intent. Monitoring agent actions is a key safety mechanism, yet reliable monitors remain difficult to build and the scale of these systems makes human oversight impractical. We show that combining signals from diverse monitors into an ensemble improves detection of misaligned actions. We build 12 GPT-4.1-Mini monitors using both prompting and fine-tuning strategies. We evaluate them on coding tasks where candidate solutions pass standard tests but fail on adversarial inputs. In this setting, diverse ensembles outperform both individual monitors and homogeneous ensembles. Our best 3-monitor ensemble achieves 2.4x greater detection performance gain compared to an ensemble composed of three identical monitors, with the same ensemble performing strongly on an independent dataset. We contend that these results show that diversity - not scale - drives gains. The best ensembles combine strong individual performance with low correlation between monitors. Furthermore, fine-tuned monitors appear in every top-performing ensemble and maintain this advantage on out-of-distribution attack types, suggesting that fine-tuning enables detection capabilities that prompting alone does not elicit. These results support ensemble monitoring as a practical AI control strategy for safety gains at reasonable inference costs.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Spatial Blindness in Whole-Slide Multiple Instance Learning

FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

Medical Context Distorts Decisions in Clinical Vision Language Models

BELIEF: Structured Evidence Modeling and Uncertainty-Aware Fusion for Biomedical Question Answering

VISTA: Variance-Gated Inter-Sequence Test-Time Adaptation for Multi-Sequence MRI Segmentation

DP-SelFT: Differentially Private Selective Fine-Tuning for Large Language Models

MATE: Solving Contextual Markov Decision Processes with Memory of Accumulated Transition Embeddings

Radial-Angular Geometry for Reliable Update Diagnosis in Noisy-Label Learning

Progressive Generalization Augmentation with Deeply Coupled RND-PPO and Domain-Prioritized Noise Injection for Robust Crop Management Reinforcement Learning

Soap2Soap: Long Cinematic Video Remaking via Multi-Agent Collaboration

MUSE: Multimodal Uncertainty Quantification of State Estimation

Learning Displacement-Robust Representations for Landslide Early Warning under Rainfall Forecast Uncertainty

Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design

A Distribution Matching Approach to Neural Piano Transcription with Optimal Transport

Self-Supervised Learning for Sparse Matrix Reordering

MiniGPT: Rebuilding GPT from First Principles

Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning

QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI

ADR: An Agentic Detection System for Enterprise Agentic AI Security

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

No Free Swap: Protocol-Dependent Layer Redundancy in Transformers

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

Going Beyond the Edge: Distributed Inference of Transformer Models on Ultra-Low-Power Wireless Devices

Propagating Unsafe Actions in LLM Controlled Multi-Robot Collaboration via Single Robot Compromise

Position: Zeroth-Order Optimization in Deep Learning Is Underexplored, Not Underpowered

Embracing Biased Transition Matrices for Complementary-Label Learning with Many Classes

STS: Efficient Sparse Attention with Speculative Token Sparsity

Learning Normalized Energy Models for Linear Inverse Problems

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute