arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2510.19788 2026-05-11 cs.AI cs.LG

Benchmarking World-Model Learning with Environment-Level Queries

用环境级查询评估世界模型学习

Archana Warrier, Dat Nguyen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cambridge Yang, Joshua B. Tenenbaum, Sebastian Vollmer, Kevin Ellis, Zenna Tavares

发表机构 * Basis Research Institute DFKI GmbH Harvard University Universit\'e de Montr\'eal \& Mila - Quebec AI Institute University of Cambridge Massachusetts Institute of Technology Cornell University

AI总结 本文提出WorldTest协议,通过环境级查询评估智能体是否学习到支持多种环境属性的模型,通过AutumnBench验证人类在网格世界环境中的表现优于前沿模型。

Comments 34 pages, 10 figures

详情
AI中文摘要

世界模型是构建能灵活推理和规划的AI代理的核心。然而当前评估仅测试可从观察交互中测量的属性,如下一帧预测或任务回报,并不测试学习模型是否支持关于环境的多样化查询。本文提出WorldTest:一种评估智能体是否学习支持多种环境级查询的协议。这些查询的答案依赖于环境整体属性,而非仅观察轨迹。我们实例化WorldTest为AutumnBench,包含43个交互网格世界环境和129个任务,涵盖三种查询类型,用于评估人类和学习代理。实验显示人类显著优于前沿模型,差距归因于探索和信念更新的差异。AutumnBench提供评估网格世界环境世界模型学习的框架,WorldTest提供扩展到更复杂领域的模板。

英文摘要

World models are central to building AI agents capable of flexible reasoning and planning. Yet current evaluations (i) test only properties measurable from observed interactions, such as next-frame prediction or task return, and (ii) do not test whether a learned model supports diverse queries about the environment. In contrast, humans build $\textit{general-purpose}$ models that can answer many different questions about an environment$\unicode{x2014}$including questions that require understanding global structure and counterfactual consequences. We propose $\textit{WorldTest}$: a protocol for evaluating whether agents learn models that support multiple $\textit{environment-level queries}\unicode{x2014}$questions whose answers depend on properties of the full environment, not just observed trajectories. Individually, these queries can target properties (e.g., reachability or the effects of interventions) that no single rollout distribution determines. Collectively, they assess model generality across query types. We instantiate WorldTest as $\textit{AutumnBench}$, a benchmark of 43 interactive grid-world environments and 129 tasks across three query families for both humans and learning agents. Experiments with 517 human participants and five frontier models show that humans substantially outperform these models, a gap we attribute to differences in exploration and belief updating. AutumnBench provides a framework for evaluating world-model learning in grid-world environments with environment-level queries, and WorldTest provides a template for extending such evaluations to richer domains.

2510.08638 2026-05-11 cs.CV cs.AI

Into the Rabbit Hull: From Task-Relevant Concepts in DINO to Minkowski Geometry

进入兔皮:从DINO中的任务相关概念到闵可夫斯基几何

Thomas Fel, Binxu Wang, Michael A. Lepori, Matthew Kowal, Andrew Lee, Randall Balestriero, Sonia Joseph, Ekdeep S. Lubana, Talia Konkle, Demba Ba, Martin Wattenberg

发表机构 * Kempner Institute, Harvard University(哈佛大学凯姆纳研究所) Harvard University(哈佛大学) Dept. of Psychology, Harvard University(哈佛大学心理学系) Brown University(布朗大学) Google DeepMind(谷歌深Mind) Meta Goodfire

AI总结 本文通过分析DINOv2的任务相关概念,揭示了其在分类、分割和深度估计中的功能专业化,并提出闵可夫斯基表示假设以解释视觉Transformer的表示结构。

Comments Accepted at ICLR 2026

Journal ref ICLR 2024

详情
AI中文摘要

DINOv2通常用于识别对象、场景和动作,但其感知的本质仍不清楚。作为工作基线,我们采用线性表示假设(LRH)并利用SAEs进行操作化,产生一个32,000单元的词典,作为本研究的可解释性基础。第一部分分析了不同下游任务如何从学习的词典中招募概念,揭示了功能专业化:分类利用“别处”概念在目标对象外处处激活,实现学习的否定;分割依赖边界检测器形成连贯的子空间;深度估计利用三种不同的单目深度线索,符合视觉神经科学原理。随后分析了由SAE学习的概念的几何和统计特性。我们发现表示部分密集而非严格稀疏。词典朝着更高的一致性发展,远离最大正交理想(格拉斯曼框架)。在图像中,令牌占据低维、局部连接的集合,即使移除位置后仍持续存在。这些迹象表明表示的组织方式超越了线性稀疏性本身。综合这些观察,我们提出了一种改进的观点:令牌由凸混合的原型(例如,动物中的兔子,颜色中的棕色,纹理中的毛茸茸)组合而成。这种结构基于Gardenfors的概念空间和模型的机制,即多头注意力产生凸混合的总和,定义由原型界定的区域。我们提出了闵可夫斯基表示假设(MRH)并检查其经验签名和对解释视觉Transformer表示的含义。

英文摘要

DINOv2 is routinely deployed to recognize objects, scenes, and actions; yet the nature of what it perceives remains unknown. As a working baseline, we adopt the Linear Representation Hypothesis (LRH) and operationalize it using SAEs, producing a 32,000-unit dictionary that serves as the interpretability backbone of our study, which unfolds in three parts. In the first part, we analyze how different downstream tasks recruit concepts from our learned dictionary, revealing functional specialization: classification exploits "Elsewhere" concepts that fire everywhere except on target objects, implementing learned negations; segmentation relies on boundary detectors forming coherent subspaces; depth estimation draws on three distinct monocular depth cues matching visual neuroscience principles. Following these functional results, we analyze the geometry and statistics of the concepts learned by the SAE. We found that representations are partly dense rather than strictly sparse. The dictionary evolves toward greater coherence and departs from maximally orthogonal ideals (Grassmannian frames). Within an image, tokens occupy a low dimensional, locally connected set persisting after removing position. These signs suggest representations are organized beyond linear sparsity alone. Synthesizing these observations, we propose a refined view: tokens are formed by combining convex mixtures of archetypes (e.g., a rabbit among animals, brown among colors, fluffy among textures). This structure is grounded in Gardenfors' conceptual spaces and in the model's mechanism as multi-head attention produces sums of convex mixtures, defining regions bounded by archetypes. We introduce the Minkowski Representation Hypothesis (MRH) and examine its empirical signatures and implications for interpreting vision-transformer representations.

2510.07926 2026-05-11 cs.CL

Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation

全面性指标用于文本生成中事实回忆的自动评估

Adam Dejl, James Barry, Alessandra Pascale, Javier Carnerero Cano

发表机构 * Department of Computing, Imperial College London(伦敦帝国学院计算机系) IBM Research(IBM研究院)

AI总结 本文提出三种自动评估指标以评估大语言模型生成文本的全面性,发现端到端方法在效果上优于复杂指标,但牺牲了鲁棒性和可解释性。

Comments ACL 2026 Findings

详情
AI中文摘要

尽管大语言模型在多种任务中表现出色,但它们也常生成不完整或遗漏关键信息的输出。在敏感领域,这种遗漏可能造成的危害与事实错误相当,包括幻觉。本文研究了三种自动评估指标:基于自然语言推理的分解文本方法、基于问答的提取对比较方法,以及端到端的直接识别缺失内容方法。实验表明,简单的端到端指标在效果上优于更复杂的指标,但牺牲了鲁棒性、可解释性和结果细度。我们进一步评估了几个流行的开源大语言模型在回答基于多源信息的用户查询时的全面性。

英文摘要

Despite demonstrating remarkable performance across a wide range of tasks, large language models (LLMs) have also been found to frequently produce outputs that are incomplete or selectively omit key information. In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies, including hallucinations. In this study, we address the challenge of evaluating the comprehensiveness of LLM-generated texts, focusing on the detection of missing information or underrepresented viewpoints. We investigate three automated evaluation metrics: (1) an NLI-based method that decomposes texts into atomic statements and uses natural language inference (NLI) to identify missing facts, (2) a Q&A-based metric that extracts question-answer pairs and compares responses across sources, and (3) an end-to-end approach that directly identifies missing content using LLMs. Our experiments demonstrate the surprising effectiveness of the simple end-to-end metric compared to more complex metrics, though at the cost of reduced robustness, interpretability and result granularity. We further assess the comprehensiveness of responses from several popular open-weight LLMs when answering user queries based on multiple sources.

2510.04850 2026-05-11 cs.CL cs.AI

Detecting Distillation Data from Reasoning Models

从推理模型中检测蒸馏数据

Hengxiang Zhang, Hyeong Kyu Choi, Sharon Li, Hongxin Wei

发表机构 * Department of Statistics and Data Science, Southern University of Science and Technology(统计与数据科学系,南方科技大学) Department of Computer Sciences, University of Wisconsin–Madison(计算机科学系,威斯康星大学麦迪逊分校)

AI总结 本文提出基于输出token概率偏差的检测方法,用于识别推理模型蒸馏数据中的问题,通过量化生成token与高置信度参考概率的偏差,提升检测性能,实验表明在蒸馏数据集上AUC提升达31%。

详情
AI中文摘要

推理蒸馏已成为将大推理模型的能力转移到小语言模型的主流范式。然而,推理蒸馏存在数据污染风险:基准数据可能意外包含在蒸馏数据中,从而夸大模型性能指标。本文正式定义了蒸馏数据检测任务,确定给定问题是否包含在模型的蒸馏数据中。该任务的独特挑战在于蒸馏数据的不完全可用性。为此,我们提出Token Probability Deviation(TPD),一种利用模型生成的输出token概率模式而非输入token的概率检测方法。我们的方法受观察启发:已见的问题倾向于生成更多近确定性的token,而未见的问题则不然。因此,TPD分数被设计用来量化生成token与高置信度参考概率的token层面偏差。因此,已见的问题会产生显著较低的TPD分数,从而实现强大的检测性能。广泛的实验展示了我们方法的有效性,在蒸馏数据集上将检测AUC提高高达31%。

英文摘要

Reasoning distillation has emerged as a prevailing paradigm for transferring reasoning capabilities from large reasoning models to small language models. Yet, reasoning distillation risks data contamination: benchmark data may inadvertently be included in the distillation data, thereby inflating model performance metrics. In this work, we formally define the distillation data detection task, which determines whether a given question is included in the model's distillation data. The unique challenge of this task lies in the partial availability of distillation data. To address this, we propose Token Probability Deviation (TPD), a detection method that leverages the probability patterns of output tokens generated by the model instead of input tokens. Our method is motivated by the observation that seen questions tend to elicit more near-deterministic tokens generated by the models than unseen ones. Our TPD score is thus designed to quantify the token-level deviation of generated tokens from a high-confidence reference probability. Consequently, seen questions can yield substantially lower TPD scores than unseen ones, enabling strong detection performance. Extensive experiments demonstrate the effectiveness of our approach, improving detection AUC by up to 31% on distillation datasets.

2510.01569 2026-05-11 cs.AI cs.CL

InvThink: Premortem Reasoning for Safer Language Models

InvThink:用于更安全语言模型的预mortem推理

Yubin Kim, Taehan Kim, Eugene Park, Chunjong Park, Cynthia Breazeal, Daniel McDuff, Hae Won Park

发表机构 * MIT(麻省理工学院) Google Research(谷歌研究院) Google DeepMind(谷歌深Mind) Samsung Research(三星研究)

AI总结 InvThink通过预mortem推理框架提升语言模型安全性,通过枚举、分析和约束潜在故障来生成响应,相比现有方法在安全性和减少有害行为方面表现更优。

详情
AI中文摘要

我们提出了InvThink,一种训练和提示框架,要求模型在生成最终响应前枚举、分析并约束潜在故障。与仅优化安全最终响应的传统安全对齐方法不同,InvThink将生成过程分为三个步骤:(1) 枚举潜在危害,(2) 分析其后果,(3) 在显式缓解约束下生成响应。我们观察到三个发现:(i) 相比现有安全提示和对齐基线,InvThink在更大模型规模上表现出更高的安全性评分。(ii) InvThink减轻了安全税。使用InvThink训练的模型在标准基准上保持推理能力。(iii) 除了通用安全任务外,InvThink还减少了专业伦理领域(医学、金融、法律)和代理不一致场景中的有害行为,相较于零样本基线,有害性减少了高达32%,相较于SafetyPrompt减少了16%。我们通过监督微调和基于GRPO的强化学习扩展了InvThink,应用于三个LLM家族。

英文摘要

We present InvThink, a training and prompting framework that requires the model to enumerate, analyze, and constrain potential failures before generating its final response. Unlike existing safety alignment methods that optimize only for safe final responses, InvThink structures generation into three steps: (1) enumerate potential harms, (2) analyze their consequences, (3) generate the response under explicit mitigation constraints. We observe three findings: (i) InvThink shows higher safety scores at larger model sizes, compared to existing safety prompting and alignment baselines. (ii) InvThink mitigates the safety tax. Models trained with INVTHINK preserve their reasoning capability on standard benchmarks. (iii) beyond general safety tasks, InvThink also reduces harmful behavior in professional ethics domains (medicine, finance, law) and in agentic misalignment scenarios, achieving up to 32% reduction in harmfulness over zero-shot baselines and 16% over SafetyPrompt. We extend InvThink with supervised fine-tuning, and GRPO-based reinforcement learning across three LLM families.

2510.01290 2026-05-11 cs.LG

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

ThinKV: 基于思维适应的KV缓存压缩以提高推理模型效率

Akshat Ramachandran, Marina Neseem, Charbel Sakr, Rangharajan Venkatesan, Brucek Khailany, Tushar Krishna

发表机构 * Georgia Institute of Technology(佐治亚理工学院) NVIDIA Research(NVIDIA研究)

AI总结 本文提出ThinKV,通过混合量化-驱逐策略,根据思维重要性动态调整token精度,减少KV缓存占用,提升推理效率和吞吐量。

Comments ICLR 2026 (Oral)

详情
AI中文摘要

大推理模型的长输出上下文生成使得推理链(CoT)得以扩展,但同时也导致键值(KV)缓存迅速增长,快速超出GPU内存。为解决这一挑战,我们提出ThinKV,一种基于思维适应的KV缓存压缩框架。ThinKV基于观察到注意力稀疏性揭示了不同重要性的思维类型。它应用混合量化-驱逐策略,根据思维重要性分配token精度,并逐步驱逐不重要思维中的token。此外,为实现ThinKV,我们设计了一个内核,扩展PagedAttention以实现驱逐token内存槽的高效重用,消除压缩开销。在DeepSeek-R1-Distill、GPT-OSS和NVIDIA AceReason等模型上进行的广泛实验表明,ThinKV在KV缓存占用少于5%的情况下实现了接近无损的精度,同时在状态-of-the-art基线之上提升了高达5.8倍的推理吞吐量。

英文摘要

The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key-value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantization-eviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens' memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over state-of-the-art baselines.

2510.00436 2026-05-11 cs.AI cs.CL

Automated Evaluation can Distinguish the Good and Bad AI Responses to Patient Questions about Hospitalization

自动化评估可区分优质与劣质的AI对住院患者问题的回应

Sarvesh Soni, Dina Demner-Fushman

发表机构 * Division of Intramural Research(院内研究部) National Library of Medicine(国家医学图书馆) National Institutes of Health(国立卫生研究院)

AI总结 本文通过系统研究,验证了自动化评估在区分AI对患者住院问题回应质量上的有效性,提出了基于临床参考答案的评估方法,展示了自动化评估在提升AI系统比较和患者沟通中的潜力。

Comments Accepted for publication in npj Digital Medicine

详情
AI中文摘要

自动化回答患者提出的健康问题的方法正在兴起,但选择系统需要可靠的评估。目前的黄金标准是人工专家审查,但这种过程劳动密集且缓慢,限制了可扩展性。自动化指标具有潜力,但往往与人类判断不一致且依赖上下文。为了解决自动化评估AI对住院相关问题回应的可行性,我们进行了大规模系统研究。在100个患者案例中,我们收集了28个AI系统的回应(总计2800个),并沿三个维度评估:是否回答问题、是否适当使用临床笔记证据、是否使用一般医学知识。使用临床作者参考答案来锚定指标,自动化排名与人类评分高度一致。我们的发现表明,精心设计的自动化评估可以扩展AI系统的比较评估,并支持患者与医生的沟通。

英文摘要

Automated approaches to answer patient-posed health questions are rising, but selecting among systems requires reliable evaluation. The current gold standard for evaluating the free-text artificial intelligence (AI) responses--human expert review--is labor-intensive and slow, limiting scalability. Automated metrics are promising yet variably aligned with human judgments and often context-dependent. To address the feasibility of automating the evaluation of AI responses to hospitalization-related questions posed by patients, we conducted a large systematic study of evaluation approaches. Across 100 patient cases, we collected responses from 28 AI systems (2800 total) and assessed them along three dimensions: whether a system response (1) answers the question, (2) appropriately uses clinical note evidence, and (3) uses general medical knowledge. Using clinician-authored reference answers to anchor metrics, automated rankings closely matched human ratings. Our findings suggest that carefully designed automated evaluation can scale comparative assessment of AI systems and support patient-clinician communication.

2509.25584 2026-05-11 cs.AI cs.CL cs.CV cs.IT cs.LG math.IT

Skip-It? Theoretical Conditions for Layer Skipping in Vision-Language Models

跳过层?视觉-语言模型中层跳过的理论条件

Max Hartman, Vidhata Jayaraman, Moulik Choraria, Akhil Bhimaraju, Lav R. Varshney

发表机构 * Department of Electrical and Computer Engineering, The University of Illinois, Champaign, IL, United States(电气与计算机工程系,伊利诺伊大学香槟分校) Department of Mathematics, The University of Illinois, Champaign, IL, United States(数学系,伊利诺伊大学香槟分校) AI Innovation Institute, Stony Brook University, Stony Brook, NY, United States(人工智能创新研究所,石溪大学)

AI总结 本文提出统一框架,通过可验证的冗余概念,理论分析视觉-语言模型中层跳过的条件,验证早期和晚期视觉token的冗余性,并统一现代层跳技术的理论基础。

详情
AI中文摘要

视觉-语言模型在广泛任务中表现出色,但其大尺寸使推理成本高。近期研究表明多模态处理包含显著冗余,使跳过某些层损失最小。然而,现有剪枝技术仍依赖启发式或超参数扫描,而非系统性标准。本文提出统一框架,刻画剪枝增强效率而不牺牲性能的冗余条件。核心是可验证且可解释的冗余概念,无需下游任务性能作为指标。应用该框架,我们验证早期和晚期视觉token在模型中均冗余,并通过实际性能退化验证条件。此外,该框架为VLMs的冗余提供了理论理解,并统一了现代层跳技术的许多思想。

英文摘要

Vision-language models achieve incredible performance across a wide range of tasks, but their large size makes inference costly. Recent work has shown that multimodal processing contains significant redundancies, making it possible to skip certain layers with minimal performance loss. Yet current pruning techniques remain ad-hoc, relying on heuristics or hyperparameter sweeps rather than principled criteria for determining when layer skipping is beneficial. In this paper, we propose a unified framework that characterizes the redundancy conditions under which pruning can enhance efficiency without sacrificing performance. Central to our approach are experimentally verifiable and interpretable notions of redundancy that can be evaluated without requiring downstream task performance as a metric. Applying this framework, we corroborate prior findings that both early and late vision tokens are redundant across models, and we validate our conditions by showing they align with actual performance degradation. Beyond these empirical results, our framework provides a theoretically grounded understanding of redundancy in VLMs and unifies many of the ideas behind modern layer-skipping techniques.

2509.24789 2026-05-11 cs.LG stat.ML

Fidel-TS: A High-Fidelity Multimodal Benchmark for Time Series Forecasting

Fidel-TS:一种高保真多模态时间序列预测基准

Zhijian Xu, Wanxu Cai, Xilin Dai, Zhaorong Deng, Qiang Xu

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) The Chinese University of Hong Kong(香港中文大学) School of Software(软件学院) Tsinghua University(清华大学) ZJU-UIUC Institute(浙大-UIUC研究院) Zhejiang University(浙江大学) School of Data Science(数据科学学院) The Chinese University of Hong Kong, Shenzhen(深圳香港中文大学)

AI总结 本文提出Fidel-TS,一种高保真多模态时间序列预测基准,解决现有基准数据源不纯、泄露问题及结构不清晰的问题,通过实验揭示现有基准的局限性和模型评估中的潜在差异。

Comments new version

详情
AI中文摘要

时间序列预测模型的评估受到高质量基准缺乏的阻碍,导致进展评估被高估。现有数据集存在单模态设计中的小规模、低频、预训练数据污染,以及早期多模态设计中的时间性和描述性泄露问题。为解决此问题,我们正式化高保真基准的核心原则,聚焦数据源完整性、无泄露设计和结构清晰性。我们引入Fidel-TS,一种基于这些原则构建的新大规模基准。我们的实验揭示了现有基准的局限性以及模型评估中的潜在差异,为多种现有单模态和多模态预测模型及LLMs在各种评估任务中的提供了新的见解。

英文摘要

The evaluation of time series forecasting models is hindered by a lack of high-quality benchmarks, leading to overestimated assessments of progress. Existing datasets suffer from issues ranging from small-scale, low-frequency, pre-training data contamination in unimodal designs to the temporal and description leakage prevalent in early multimodal designs. To address this, we formalize the core principles of high-fidelity benchmarking, focusing on data sourcing integrity, leak-free design, and structural clarity. We introduce Fidel-TS, a new large-scale benchmark built from these principles. Our experiments reveal the limitations of prior benchmarks and the potential discrepancies in model evaluation, providing new insights into multiple existing unimodal and multimodal forecasting models and LLMs across various evaluation tasks.

2509.22813 2026-05-11 cs.CV

TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses

TRUST: 基于不确定性引导的SSM遍历的测试时细化

Sahar Dastani, Ali Bahri, Gustavo Adolfo Vargas Hakim, Moslem Yazdanpanah, Mehrdad Noori, David Osowiechi, Samuel Barbeau, Ismail Ben Ayed, Herve Lombaert, Christian Desrosiers

发表机构 * LIVIA, ILLS, ÉTS Montréal(蒙特利尔理工学院LIVIA、ILLs、ÉTS) Mila - Quebec AI Institute(魁北克人工智能研究所) Polytechnique Montreal(蒙特利尔理工学院)

AI总结 TRUST通过利用不同遍历排列生成输入图像的多种因果视角,提升SSM在分布偏移下的泛化能力,首次明确利用SSM独特架构特性进行适应。

详情
AI中文摘要

状态空间模型(SSMs)已作为视觉变换器(ViTs)的高效替代方案,其中VMamba脱颖而出,但其泛化性能在分布偏移下显著下降。为解决此限制,我们提出TRUST(基于不确定性引导的SSM遍历的测试时细化),一种新颖的测试时适应(TTA)方法,利用多样化的遍历排列生成输入图像的多个因果视角。模型预测用作伪标签以指导Mamba特定参数的更新,适应后的权重被平均以整合遍历扫描中学习的信息。TRUST是首个明确利用SSM独特架构特性进行适应的方法。在七个基准测试中,TRUST一致提升了鲁棒性,并优于现有TTA方法。

英文摘要

State Space Models (SSMs) have emerged as efficient alternatives to Vision Transformers (ViTs), with VMamba standing out as a pioneering architecture designed for vision tasks. However, their generalization performance degrades significantly under distribution shifts. To address this limitation, we propose TRUST (Test-Time Refinement using Uncertainty-Guided SSM Traverses), a novel test-time adaptation (TTA) method that leverages diverse traversal permutations to generate multiple causal perspectives of the input image. Model predictions serve as pseudo-labels to guide updates of the Mamba-specific parameters, and the adapted weights are averaged to integrate the learned information across traversal scans. Altogether, TRUST is the first approach that explicitly leverages the unique architectural properties of SSMs for adaptation. Experiments on seven benchmarks show that TRUST consistently improves robustness and outperforms existing TTA methods.

2509.21654 2026-05-11 cs.LG cs.AI cs.CC

Limitations on Accurate, Trusted, Human-level Reasoning

对准确、可信、人类水平推理的限制

Rina Panigrahy, Vatsal Sharan

发表机构 * Google Research(谷歌研究) University of Southern California(南加州大学)

AI总结 本文研究了准确、可信和人类水平推理在AI系统中的根本矛盾,证明了准确且可信的系统无法实现人类水平推理。

Comments 19 pages, 1 figure

详情
AI中文摘要

我们识别了人工智能系统在准确、可信和人类水平推理目标之间的根本不相容性。我们定义准确性为系统在有能力时不会做出虚假声明的属性,可信性假设系统是准确的。人类水平推理是指AI系统始终匹配或超过人类能力。我们的核心发现是,对于这些定义,准确且可信的AI系统不能是人类水平推理系统:对于这样的准确可信系统,存在一些任务实例,人类可以轻松解决但系统无法解决。我们的证明借鉴了哥德尔不完全性定理和图灵关于停机问题不可判定性的证明,并可视为对哥德尔和图灵结果的解释。关键在于信任概念的正式化,使系统本身的属性(准确)与其知识状态(可信)分离。

英文摘要

We identify a fundamental incompatibility between the goals of accuracy, trust, and human-level reasoning in artificial intelligence (AI) systems, for strict mathematical definitions of these notions. We define accuracy of a system as the property that it never makes any false claims when it has the ability to abstain from making a prediction on any input, and trust as the assumption that the system is accurate. We define human-level reasoning as the property of an AI system always matching or exceeding human capability. Our core finding is that -- for our formal definitions of these notions -- an accurate and trusted AI system cannot be a human-level reasoning system: for such an accurate, trusted system there are task instances which are easily and provably solvable by a human but not by the system. Our proofs draw parallels to Gödel's incompleteness theorems and Turing's proof of the undecidability of the halting problem, and can be regarded as interpretations of Gödel's and Turing's results. Key to our proof is the formalization of the notion of trust, which allows us to separate the intrinsic property of a system (being accurate) from its epistemic status (being trusted).

2509.21637 2026-05-11 cs.LG

BoHA: Blockwise Hadamard Product Adaptation for Parameter-Efficient Fine-Tuning

BoHA:基于块状哈达玛积的参数高效微调

Feng Yu, Jia Hu, Geyong Min

发表机构 * Department of Computer Science(计算机科学系)

AI总结 BoHA通过块状哈达玛积适配在参数受限下提升模型微调性能,保留初始任务表现,优于LoRA并在连续学习中表现优异。

详情
AI中文摘要

参数高效的微调(PEFT)在大型语言模型中通过训练小任务特定参数集而保持预训练模型冻结。低秩适配(LoRA)家族使这种权衡成为现实;然而,在相同参数预算下评估单任务准确性。在连续适配设置中,此类评估还应测量在后续微调后第一阶段任务的性能保留情况。为解决这一差距,我们引入BoHA,一种块状W₀耦合哈达玛积适配器,将空间支持作为显式设计轴。BoHA将冻结的权重W₀划分为b×b网格,并在每个块中学习独立的低秩哈达玛积因子,保留与无适配合并推理等效的总秩。在合成目标上,BoHA在每块秩rb=1时精确重建需要全局W₀耦合哈达玛参数化秩b²的更新。在Llama-3.2-1B/3B、Mistral-7B和Gemma-2-9B上,BoHA在所有匹配预算的单任务平均表现优于LoRA,并与最强的哈达玛基线竞争。在Llama-3.2-3B常识→算术连续学习诊断中,BoHA保留57.66%的第一阶段准确性,并在匹配的第二阶段可塑性下超过无W₀加法控制均值15.23%。这些结果表明,当保留连续适配下的性能是目标时,块状W₀耦合哈达玛适配是一种有竞争力的PEFT设计选择。

英文摘要

Parameter-efficient fine-tuning (PEFT) of large language models trains a small task-specific parameter set while keeping the pretrained model frozen. The dominant Low-Rank Adaptation (LoRA) family makes this trade-off practical; however, evaluations under the same parameter budget assess single-task accuracy. In sequential adaptation settings, such evaluations should also measure how well performance on the first-stage task is retained after subsequent fine-tuning. To address this gap, we introduce BoHA, a blockwise $W_0$-coupled Hadamard product adapter that treats spatial support as an explicit design axis. BoHA partitions the frozen weight $W_0$ into a $b{\times}b$ grid and learns an independent low-rank Hadamard product factor in each block, preserving a matched LoRA-equivalent total rank with adapter-free merged inference. On a synthetic target, BoHA at per-block rank $r_b{=}1$ exactly reconstructs an update that requires rank $b^2$ under the global $W_0$-coupled Hadamard parameterization. Across Llama-3.2-1B/3B, Mistral-7B, and Gemma-2-9B on commonsense and arithmetic reasoning tasks, BoHA outperforms LoRA across all matched-budget single-task averages and remains competitive with the strongest Hadamard baseline. On a Llama-3.2-3B commonsense $\to$ arithmetic continual-learning diagnostic, BoHA retains $57.66\%$ first-stage accuracy and exceeds the $W_0$-free additive-control mean by $15.23\%$ under matched second-stage plasticity. These results demonstrate that blockwise $W_0$-coupled Hadamard adaptation is a competitive PEFT design choice when retention under sequential adaptation is part of the objective.

2509.21172 2026-05-11 cs.LG econ.EM math.OC stat.ML

Inverse Reinforcement Learning with Just Classification and a Few Regressions

逆强化学习与仅分类及少量回归

Lars van der Laan, Nathan Kallus, Aurelien Bibaut

发表机构 * Department of Statistics, University of Washington(华盛顿大学统计系) Netflix(网飞) Cornell University(康奈尔大学)

AI总结 本文提出GenPQR方法,通过模块化分类和回归技术,实现逆强化学习中的奖励恢复,无需依赖锚点动作限制或专用神经网络架构,理论支持更广泛的状态wise线性归一化。

详情
AI中文摘要

逆强化学习(IRL)旨在从观察行为中推断奖励,但仅凭策略无法唯一确定奖励:许多奖励-价值对可以合理化相同动作。有意义的奖励恢复需要归一化,但现有归一化IRL方法常依赖锚点动作限制或专用神经网络架构。我们研究了最大熵或Gumbel冲击模型下的奖励恢复,假设在广泛的状态wise线性归一化类别下,锚点动作限制是特殊情况。这产生了通用的策略到Q到奖励(GenPQR)方法,一个模块化过程,估计行为策略,通过贝尔曼方程评估其软Q函数,并恢复归一化奖励。两个阶段均可使用现成的分类和回归方法实现。我们证明了在一般函数逼近下,模块化有限样本保证,具有分别的策略估计和Q估计误差。作为具体实例,我们研究了GenPQR与拟合Q评估,将IRL简化为策略估计后跟回归。实验表明,GenPQR在奖励恢复上与DeepPQR匹配或改进,同时更简单且模块化。与DeepPQR相比,我们的理论超越了锚点动作,容纳了大且连续的动作空间,明确了覆盖要求,并未绑定到特定神经网络架构或训练过程。

英文摘要

Inverse reinforcement learning (IRL) aims to infer rewards from observed behavior, but rewards are not identified from the policy alone: many reward--value pairs can rationalize the same actions. Meaningful reward recovery therefore requires a normalization, yet existing normalized IRL methods often rely on anchor-action restrictions or specialized neural architectures. We study reward recovery in the maximum-entropy, or Gumbel-shock, model under a broad class of statewise affine normalizations, with anchor-action constraints as a special case. This yields Generalized Policy-to-$Q$-to-Reward (GenPQR), a modular procedure that estimates the behavior policy, evaluates its soft $Q$-function through the Bellman equation, and recovers the normalized reward. Both stages can be implemented with off-the-shelf classification and regression methods. We prove modular finite-sample guarantees under general function approximation, with separate policy-estimation and $Q$-estimation errors. As a concrete instantiation, we study GenPQR with fitted $Q$-evaluation, reducing IRL to policy estimation followed by regression. Experiments show that GenPQR matches or improves reward recovery relative to DeepPQR while remaining simpler and more modular. Compared with DeepPQR, our theory goes beyond anchor actions, accommodates large and continuous action spaces, makes coverage requirements explicit, and is not tied to a specific neural-network architecture or training procedure.

2509.12047 2026-05-11 cs.CV cs.AI

A Computer Vision Pipeline for Individual-Level Behavior Analysis: Benchmarking on the Edinburgh Pig Dataset

用于个体层面行为分析的计算机视觉流水线:在爱丁堡猪数据集上的基准测试

Haiyu Yang, Enhong Liu, Jennifer Sun, Sumit Sharma, Meike van Leerdam, Sebastien Franceschini, Puchun Niu, Miel Hostens

发表机构 * Department of Animal Science, College of Agriculture and Life Sciences, Cornell University(动物科学系,农业与生命科学学院,康奈尔大学) University of Liège, Gembloux Agro-Bio Tech (ULìege-GxABT)(列日大学,吉尔伯特农业生物技术(ULiège-GxABT))

AI总结 本文提出了一种模块化流水线,利用先进的计算机视觉技术自动化群体饲养环境中的动物行为分析,通过爱丁堡猪数据集验证,实现了94.2%的总体准确率,优于现有方法。

Comments 9 figures

详情
AI中文摘要

动物行为分析在理解动物福利、健康状况和生产力方面起着关键作用。然而,传统手动观察方法耗时、主观且扩展性有限。我们提出了一种模块化流水线,利用最先进的计算机视觉技术自动化群体饲养环境中的动物行为分析。我们的方法结合了最先进的零样本目标检测、运动感知分割和跟踪以及使用视觉变换器进行的高级特征提取,以实现稳健的行为识别。该流水线解决了动物遮挡和群体饲养场景等挑战,在室内猪监控中得到演示。我们验证了系统在爱丁堡猪行为视频数据集上的多个行为任务。我们的时序模型实现了94.2%的总体准确率,比现有方法提高了21.2个百分点。流水线展示了强大的跟踪能力,具有93.3%的身份保持(IDF1)得分和89.3%的平均精度(AP)对于目标检测。模块化设计表明有潜力适应其他情境,但需要进一步验证其他物种。开源实现提供了一种可扩展的行为监控解决方案,通过自动化、客观和持续的分析,促进精确养猪和福利评估。

英文摘要

Animal behavior analysis plays a crucial role in understanding animal welfare, health status, and productivity in agricultural settings. However, traditional manual observation methods are time-consuming, subjective, and limited in scalability. We present a modular pipeline that leverages open-sourced state-of-the-art computer vision techniques to automate animal behavior analysis in a group housing environment. Our approach combines state-of-the-art models for zero-shot object detection, motion-aware segmentation and tracking, and advanced feature extraction using vision transformers for robust behavior recognition. The pipeline addresses challenges including animal occlusions and group housing scenarios, as demonstrated in indoor pig monitoring. We validated our system on the Edinburgh Pig Behavior Video Dataset for multiple behavioral tasks. Our temporal model achieved 94.2% overall accuracy, representing a 21.2 percentage point improvement over existing methods. The pipeline demonstrated robust tracking capabilities with a 93.3% identity preservation (IDF1) score and an 89.3% average precision (AP) for object detection. The modular design suggests potential for adaptation to other contexts, though further validation across species would be required. The open-source implementation provides a scalable solution for behavior monitoring, contributing to precision pig farming and welfare assessment through automated, objective, and continuous analysis.

2509.11777 2026-05-11 cs.CL cs.LG

User eXperience Perception Insights Dataset (UXPID): Synthetic User Feedback from Public Industrial Forums

用户体验感知洞察数据集(UXPID):来自公共工业论坛的合成用户反馈

Mikhail Kulyabin, Jan Joosten, Choro Ulan uulu, Nuno Miguel Martins Pacheco, Fabian Ries, Filippos Petridis, Jan Bosch, Helena Holmström Olsson

发表机构 * Siemens AG(西门子股份公司) Eindhoven University of Technology(埃因霍温理工大学) Chalmers University of Technology(楚克大学) Malmö University(马尔默大学)

AI总结 本文提出UXPID数据集,包含7130条合成匿名用户反馈,用于研究用户需求、用户体验分析及AI驱动反馈处理,尤其在隐私和许可限制下访问真实数据时具有价值。

详情
AI中文摘要

工业论坛中的客户反馈提供了丰富的现实产品体验洞察,但因内容无结构、领域特定以及高质量标注数据稀缺,系统分析仍具挑战性。本文提出了用户体验感知洞察数据集(UXPID),包含从公共工业自动化论坛中提取的7130条合成和匿名用户反馈分支。每个JSON记录包含多条评论,附带元数据,并由大型语言模型(LLM)标注,涵盖用户体验洞察、用户期望、严重性评分、情感和主题分类。UXPID旨在促进用户需求、用户体验分析及AI驱动反馈处理的研究,尤其在隐私和许可限制下访问真实数据时具有价值。它支持训练和评估基于转换器的模型,用于技术论坛中的问题检测、情感分析和需求提取任务,为推进自然语言处理方法在工业产品支持和软件工程领域的发展提供了有价值的资源。

英文摘要

Customer feedback in industrial forums offers rich but underexplored insights into real-world product experience. Yet systematic analysis remains challenging due to unstructured, domain-specific content and the scarcity of high-quality labeled datasets. This paper presents the User eXperience Perception Insights Dataset (UXPID), a collection of 7130 synthesized and anonymized user feedback branches extracted from a public industrial automation forum. Each JSON record contains multi-post comments enriched with metadata and annotated by a large language model (LLM) for UX insights, user expectations, severity ratings, sentiment, and topic classifications. UXPID is designed to facilitate research in user requirements, user experience (UX) analysis, and AI-driven feedback processing, particularly where privacy and licensing restrictions limit access to real-world data. It supports the training and evaluation of transformer-based models for tasks such as issue detection, sentiment analysis, and requirements extraction in technical forums, providing a valuable resource for advancing NLP methods within industrial product support and software engineering domains.

2509.08318 2026-05-11 cs.CV

CalexNet: Soft Cascade-Aligned Training and Calibration for Lightweight Early-Exit Branches

CalexNet: 轻量级早期退出分支的软级联对齐训练与校准

Yehudit Aperstein, Alexander Apartsin

发表机构 * Intelligent Systems, Afeka Academic College of Engineering(阿法卡学术工程学院智能系统) School of Computer Science, Faculty of Sciences, Holon Institute of Technology(霍洛恩理工学院计算机科学学院)

AI总结 CalexNet通过软级联对齐训练与校准方法,解决早期退出分支在训练与推理中的不匹配问题,提升精度与效率,适用于ResNet18和ResNet50在CIFAR-100和CINIC-10上的表现。

Comments 19 pages, 6 figures

详情
AI中文摘要

早期退出级联在冻结的卷积主干上实现了自适应推理,但存在三个训练-推理不匹配的来源:分支在推理时不会见到的样本上训练,其每类精度阈值在错误的分布上校准,以及标准交叉熵目标丢弃了主干的不确定性信号。CalexNet(级联对齐早期退出)通过仅训练配方的修改解决了这三个问题:分支在连续加权重要性采样下训练以匹配级联幸存者分布;每类精度阈值在验证集的实际级联幸存者子集上校准;并通过温度缩放KL目标训练分类头以对抗主干的完整softmax。结合增强的原型池化分支头,CalexNet在ResNet18和ResNet50主干上跨CIFAR-100(20超类粗粒度,更难的主设置)和CINIC-10(10类,更易的交叉验证对应)进行评估。在精度-FLOPs帕累托前沿上,CalexNet匹配或超过三个已发表的基线(PTEEnet、ZTW、BoostNet)和论文内的“无对齐、无KD”参考。最大的增益出现在实际相关的30-70% FLOPs减少范围内,并在n=3训练种子下保持稳定。CalexNet不需要推理时间的架构更改,是任何冻结主干早期退出级联的即插即用方案。

英文摘要

Early-exit cascades over a frozen convolutional backbone enable adaptive inference but suffer from three sources of train-inference mismatch: branches train on samples they will never see at inference, their per-class precision thresholds are calibrated on the wrong distribution, and the standard cross-entropy target on backbone argmax labels discards the backbone's uncertainty signal. We close all three gaps with CalexNet (Cascade-Aligned Early eXits), a training-recipe-only modification: branches train under continuously-weighted importance sampling that matches the cascade-survivor distribution; per-class precision thresholds are calibrated on the actual cascade-survivor subset of the validation set; and the classification head is trained against the backbone's full softmax via a temperature-scaled KL objective. Combined with an augmented prototype-pooling branch head, CalexNet is evaluated on ResNet18 and ResNet50 backbones across CIFAR-100 (20-superclass coarse, the harder primary setting) and CINIC-10 (10-class, the easier cross-validation counterpart). On the accuracy-FLOPs Pareto frontier, CalexNet matches or exceeds three published baselines (PTEEnet, ZTW, BoostNet) and a within-paper "no-alignment, no-KD" reference. The largest gains appear in the practically relevant 30-70% FLOPs-reduction regime and are stable across n=3 training seeds. CalexNet requires no inference-time architectural change and is a drop-in for any frozen-backbone early-exit cascade.

2509.05276 2026-05-11 cs.LG cs.AI cs.CL

SpikingBrain: Spiking Brain-inspired Large Models

SpikingBrain:基于大脑的高效大模型

Yuqi Pan, Yupeng Feng, Jinghao Zhuang, Siyu Ding, Han Xu, Zehao Liu, Bohan Sun, Yuhong Chou, Xuerui Qiu, Anlin Deng, Anjie Hu, Shurong Wang, Peng Zhou, Man Yao, Jibin Wu, Jian Yang, Guoliang Sun, Bo Xu, Guoqi Li

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beijing Key Laboratory of Brain-Inspired General Intelligence Large Model(北京脑启发通用智能大模型重点实验室) Key Laboratory of Brain Cognition and Brain-inspired Intelligence Technology(脑认知与脑启发智能技术重点实验室) Beijing Academy of Artificial Intelligence(北京人工智能研究院) The Hong Kong Polytechnic University(香港理工大学) Zhongguancun Academy(中关村学院) Beihang University(北航) Zhejiang University(浙江大学) LuxiTech MetaX Integrated Circuit Co., Ltd(MetaX集成电路有限公司)

AI总结 本文提出SpikingBrain,一种受大脑启发的大模型,通过高效架构和算法优化,在非NVIDIA平台实现大规模LLM训练与推理,提升长上下文处理效率。

详情
AI中文摘要

主流基于Transformer的大语言模型面临效率瓶颈:训练计算量随序列长度平方增长,推理内存线性增长,限制长上下文处理。为解决此问题,我们引入SpikingBrain,一种受大脑启发的模型,用于高效长上下文训练与推理。SpikingBrain利用MetaX GPU集群,聚焦三个方面:(1)模型架构:线性与混合线性注意力架构,结合自适应脉冲神经元;(2)算法优化:高效、基于转换的训练流水线和专用脉冲编码框架;(3)系统工程:定制化的训练框架、运算库和并行策略,针对MetaX硬件。通过这些技术,我们开发了两种模型:SpikingBrain-7B,一种线性LLM,和SpikingBrain-76B,一种混合线性MoE LLM。这些模型展示了在非NVIDIA平台开发大规模LLM的可行性,且在数百个MetaX GPU上训练稳定,模型FLOPs利用率处于预期水平。SpikingBrain在持续预训练仅使用约1500亿token的情况下,性能可与开源Transformer基线相比。我们的模型还显著提升了长上下文效率,并实现了(部分)常数内存和事件驱动的脉冲行为的推理。例如,SpikingBrain-7B在400万token序列上实现了超过100倍的Time to First Token提速。此外,所提出的脉冲方案实现了69.15%的稀疏性,使设备能够低功耗运行。总体而言,本文展示了受大脑启发机制在下一代高效可扩展大模型设计中的潜力。

英文摘要

Mainstream Transformer-based large language models face major efficiency bottlenecks: training computation scales quadratically with sequence length, and inference memory grows linearly, limiting long-context processing. Building large models on non-NVIDIA platforms also poses challenges for stable and efficient training. To address this, we introduce SpikingBrain, a family of brain-inspired models designed for efficient long-context training and inference. SpikingBrain leverages the MetaX GPU cluster and focuses on three aspects: (1) Model Architecture: linear and hybrid-linear attention architectures with adaptive spiking neurons; (2) Algorithmic Optimizations: an efficient, conversion-based training pipeline and a dedicated spike coding framework; (3) System Engineering: customized training frameworks, operator libraries, and parallelism strategies tailored to MetaX hardware. Using these techniques, we develop two models: SpikingBrain-7B, a linear LLM, and SpikingBrain-76B, a hybrid-linear MoE LLM. These models demonstrate the feasibility of large-scale LLM development on non-NVIDIA platforms, and training remains stable for weeks on hundreds of MetaX GPUs with Model FLOPs Utilization at expected levels. SpikingBrain achieves performance comparable to open-source Transformer baselines while using only about 150B tokens for continual pre-training. Our models also significantly improve long-context efficiency and deliver inference with (partially) constant memory and event-driven spiking behavior. For example, SpikingBrain-7B attains over 100x speedup in Time to First Token for 4M-token sequences. Furthermore, the proposed spiking scheme achieves 69.15 percent sparsity, enabling low-power operation. Overall, this work demonstrates the potential of brain-inspired mechanisms to drive the next generation of efficient and scalable large model design.

2509.03738 2026-05-11 cs.LG cs.AI eess.SP stat.ML

Mechanistic Interpretability with Sparse Autoencoder Neural Operators

基于稀疏自编码器的神经算子机制可解释性

Bahareh Tolooshams, Ailsa Shen, Anima Anandkumar

发表机构 * University of Alberta(阿尔伯塔大学) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔机器智能研究所(Amii)) California Institute of Technology (Caltech)(加州理工学院(Caltech))

AI总结 本文提出稀疏自编码器神经算子(SAE-NOs),通过函数空间而非欧几里得空间进行操作,利用联合稀疏性实现对概念的函数化表示,提升对输入域内概念表达的建模能力。

Comments Tolooshams and Shen has equal contribution. Preprint. Earlier version was presented as Oral and Extended Abstract at the Workshop on Unifying Representations in Neural Models (UniReps 2025) at NeurIPS

详情
AI中文摘要

我们引入稀疏自编码器神经算子(SAE-NOs),一种新的稀疏自编码器,其在函数空间而非固定维数的欧几里得表示中进行操作。我们正式提出了功能表示假设,即数据通过稀疏的结构化函数组合来解释。与标准SAEs通过标量激活表示概念不同,SAE-NOs将概念参数化为函数,使表示能够捕捉不仅概念的存在,还包括其在输入域中的表达方式。我们通过联合稀疏性实现这一点:概念稀疏性选择活跃的概念,而领域稀疏性控制其表达的位置。我们通过傅里叶神经算子(SAE-FNOs)实例化这一框架,将概念参数化为傅里叶域中的积分算子。这种函数和频谱参数化在数据具有多尺度空间结构或概念具有频率结构时特别有效。我们对视觉数据进行了SAE-FNO的表征分析,并证明其学习局部模式、更高效地使用概念,并在稀疏性水平上表现出稳定的概念特性。我们进一步表明,SAE-FNO能够适应域大小的变化,并在离散化之间泛化,以超过训练期间所见的分辨率运行,而标准SAEs在此时失效。我们还引入了提升到SAEs中,并理论和实验证明其作为预处理器加速优化。总体而言,我们的结果表明,从向量值到函数参数化,结合概念和领域稀疏性,将SAEs从表示概念存在扩展到建模结构化概念表达,突显了参数化的的重要性。

英文摘要

We introduce sparse autoencoder neural operators (SAE-NOs), a new class of sparse autoencoders that operate in function spaces rather than fixed-dimensional Euclidean representations. We formalize the functional representation hypothesis, where data are explained through sparse compositions of structured functions. Unlike standard SAEs that represent concepts with scalar activations, SAE-NOs parameterize concepts as functions, enabling representations that capture not only a concept's presence, but also how and where it is expressed across the input domain. We achieve this through joint sparsity: concept sparsity selects active concepts, while domain sparsity governs where they are expressed. We instantiate this framework using Fourier neural operators (SAE-FNOs), parameterizing concepts as integral operators in the Fourier domain. This functional and spectral parameterization is particularly advantageous when data exhibit spatial structure across scales or when concepts are frequency-structured. We characterize SAE-FNO on vision data and demonstrate that it learns localized patterns, uses concepts more efficiently, and exhibits stable concept characteristics across sparsity levels. We further show that SAE-FNO adapts to changes in domain size and generalizes across discretizations, operating at resolutions beyond those seen during training, where standard SAEs fail. We also introduce lifting into SAEs and show theoretically and empirically that it acts as a preconditioner that accelerates optimization. Overall, our results show that moving from vector-valued to functional parameterizations, with concept and domain sparsity, extends SAEs from representing concept presence to modeling structured concept expression, highlighting the importance of parameterization.

2508.21466 2026-05-11 cs.LG cs.IT math.IT

Normalized Maximum Likelihood Code-Length on Riemannian Data Spaces

在黎曼数据空间上的归一化最大似然代码长度

Kota Fukuzawa, Atsushi Suzuki, Kenji Yamanishi

发表机构 * The University of Tokyo(东京大学) NTT, Inc.(NTT公司) The University of Hong Kong(香港大学)

AI总结 本文提出在黎曼流形上适应的归一化最大似然代码长度(Rm-NML),克服了传统方法对坐标系的依赖,适用于具有层次结构的图数据,如双曲空间。

Comments 19 pages. This is a preprint of an article accepted for publication in the IEEE Transactions on Information Theory

详情
AI中文摘要

近年来,随着图数据的规模化扩展,对非欧几里得空间的黎曼流形数据的关注增加。特别是双曲空间的发展显著,具有层次结构的图数据表现出高表达能力。归一化最大似然(NML)用于最小化遗憾和模型选择。然而,现有的NML公式主要在欧几里得空间中开发,并且本质上依赖于坐标系的选择,使将其扩展到黎曼流形变得非平凡。在本研究中,我们定义了一种新的NML,称为黎曼流形NML(Rm-NML),以反映黎曼流形的几何结构。Rm-NML在坐标变换下不变,并且在欧几里得空间的自然参数化下与传统NML一致。我们扩展了现有NML的计算技术到黎曼流形的设置。此外,我们推导了一种简化在黎曼对称空间上计算Rm-NML的方法,这些空间涵盖了日益感兴趣的的数据空间,如双曲空间。为了说明我们提出的方法的实际应用,我们明确计算了双曲空间上正态分布的Rm-NML。

英文摘要

In recent years, with the large-scale expansion of graph data, there has been an increased focus on Riemannian manifold data spaces other than Euclidean space. In particular, the development of hyperbolic spaces has been remarkable, and they have high expressive power for graph data with hierarchical structures. Normalized Maximum Likelihood (NML) is employed in regret minimization and model selection. However, existing formulations of NML have been developed primarily in Euclidean spaces and are inherently dependent on the choice of coordinate systems, making it non-trivial to extend NML to Riemannian manifolds. In this study, we define a new NML that reflects the geometric structure of Riemannian manifolds, called the Riemannian manifold NML (Rm-NML). This Rm-NML is invariant under coordinate transformations and coincides with the conventional NML under the natural parameterization in Euclidean space. We extend existing computational techniques for NML to the setting of Riemannian manifolds. Furthermore, we derive a method to simplify the computation of Rm-NML on Riemannian symmetric spaces, which encompass data spaces of growing interest such as hyperbolic spaces. To illustrate the practical application of our proposed method, we explicitly computed the Rm-NML for normal distributions on hyperbolic spaces.

2508.20909 2026-05-11 cs.CV eess.IV

Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation

Dino U-Net:利用基础模型的高保真密集特征进行医学图像分割

Haoyue Li, Yifan Gao, Feng Yuan, Xiaosong Wang, Xin Gao

发表机构 * School of Biomedical Engineering (Suzhou), Division of Life Science and Medicine, University of Science and Technology of China, Hefei, China(生物医学工程学院(苏州),生命科学与医学系,中国科学技术大学,合肥,中国) Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences, Suzhou, China(苏州生物医学工程与技术研究所,中国科学院,苏州,中国) Shanghai Innovation Institute, Shanghai, China(上海创新研究院,上海,中国) Medical School of Tianjin University, Tianjin, China(天津大学医学院,天津,中国) Jinan Guoke Medical and Technology Development Co., Ltd., Pharmaceutical Valley New Drug Creation Platform, Jinan, China(济南国科医药科技发展有限公司,药谷新药创制平台,济南,中国)

AI总结 本文提出Dino U-Net,通过融合DINOv3模型的语义特征与低层空间细节,提升医学图像分割精度,实验表明其在多种影像模态中均优于现有方法。

Comments MICCAI 2026

详情
AI中文摘要

预训练于大规模自然图像数据集的基础模型为医学图像分割提供了强大范式。然而,有效迁移其学习表示以实现精确临床应用仍具挑战。本文提出Dino U-Net,一种新型编码器-解码器架构,旨在利用DINOv3视觉基础模型的高保真密集特征。我们的架构引入一个基于冻结DINOv3主干的编码器,采用专用适配器融合模型的丰富语义特征与低层空间细节。为在降维过程中保持表示质量,我们设计了新的保真度感知投影模块(FAPM),有效精炼并投影特征以供解码器使用。我们在七个多样化的公共医学图像分割数据集上进行了广泛实验。结果表明,Dino U-Net在各种成像模态中均取得最佳性能,优于先前方法。我们的框架证明具有高度可扩展性,随着主干模型大小增加至7亿参数变体,分割准确性持续提升。研究发现利用通用基础模型的优越密集预训练特征,为提升医学图像分割精度提供了一种高效且参数经济的方法。代码可在https://github.com/yifangao112/DinoUNet获取。

英文摘要

Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.

2508.16571 2026-05-11 cs.AI cs.IR cs.MA

LLM-Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence

基于大语言模型的竞品映射 agent 在药物资产尽调中的应用

Vlad Vinogradov, Alisa Vinogradova, Dmitrii Radkevich, Ilya Yasny, Dmitry Kobyzev, Ivan Izmailov, Katsiaryna Yanchanka, Roman Doronin, Andrey Doronichev

发表机构 * LanceBio Ventures AI Expert

AI总结 本文提出一种基于大语言模型的竞品发现 agent,通过结构化数据处理提升药物资产尽调效率,实现竞品检索与属性标准化,实验结果显示其召回率优于现有系统。

详情
AI中文摘要

本文描述并评估了用于代理AI系统中的竞品发现组件,用于快速进行药物资产尽调。一个竞品发现AI agent,给定一个适应症,检索该适应症的全部竞争性药物并提取这些药物的规范属性。竞品定义是投资者特定的,数据是付费的/授权的,分散在注册处,按适应症存在本体不匹配,药物名称别名多,多模态且迅速变化。尽管被认为是该问题的最佳工具,当前基于LLM的AI系统无法可靠地检索所有竞争性药物名称,且没有被接受的公开基准。为解决缺乏评估的问题,我们使用基于LLM的agent将五年多模态、非结构化尽调备忘录转换为结构化评估语料库,将适应症映射到竞争性药物并归一化属性。我们还引入了竞品验证的LLM-as-a-judge agent,用于过滤预测竞争者的假阳性以提高精度并抑制幻觉。在该基准上,我们的竞品发现agent实现了83%的召回率,优于OpenAI Deep Research(65%)和Perplexity Labs(60%)。该系统已在企业用户中部署;在与生物技术VC投资基金的案例研究中,分析师周转时间从2.5天降至约3小时(约20倍)。

英文摘要

In this paper, we describe and benchmark a competitor-discovery component used within an agentic AI system for fast drug asset due diligence. A competitor-discovery AI agent, given an indication, retrieves all drugs comprising the competitive landscape of that indication and extracts canonical attributes for these drugs. The competitor definition is investor-specific, and data is paywalled/licensed, fragmented across registries, ontology-mismatched by indication, alias-heavy for drug names, multimodal, and rapidly changing. Although considered the best tool for this problem, the current LLM-based AI systems aren't capable of reliably retrieving all competing drug names, and there is no accepted public benchmark for this task. To address the lack of evaluation, we use LLM-based agents to transform five years of multi-modal, unstructured diligence memos from a private biotech VC fund into a structured evaluation corpus mapping indications to competitor drugs with normalized attributes. We also introduce a competitor validating LLM-as-a-judge agent that filters out false positives from the list of predicted competitors to maximize precision and suppress hallucinations. On this benchmark, our competitor-discovery agent achieves 83% recall, exceeding OpenAI Deep Research (65%) and Perplexity Labs (60%). The system is deployed in production with enterprise users; in a case study with a biotech VC investment fund, analyst turnaround time dropped from 2.5 days to $\sim$3 hours ($\sim$20x) for the competitive analysis.

2507.21183 2026-05-11 cs.LG cs.AI cs.CL

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

MaPPO:结合先验知识的最大后验偏好优化

Guangchen Lan, Sipeng Zhang, Tianle Wang, Yuwei Zhang, Daoan Zhang, Xinpeng Wei, Xiaoman Pan, Hongming Zhang, Dong-Jun Han, Christopher G. Brinton

发表机构 * Purdue University(普渡大学) University of California, San Diego(加州大学圣地亚哥分校) Amazon(亚马逊) Meta, FAIR Yonsei University(延世大学)

AI总结 MaPPO通过整合先验奖励知识优化偏好学习,提升对齐性能,无需额外超参数,适用于离线和在线设置,支持DPO变体插件,实验显示在多个基准上效果显著。

详情
AI中文摘要

随着大语言模型(LLMs)时代的到来,偏好优化(PO)方法已成为对齐LLMs与人类偏好和提升性能的核心方法。我们提出了最大后验偏好优化(MaPPO),一种学习偏好时明确整合先验奖励知识的优化方法。基于直接偏好优化(DPO)及其变体将偏好学习视为最大似然估计(MLE)问题的范式,MaPPO将先验奖励估计整合到一个原则性的最大后验(MaP)目标中。这不仅扩展了DPO及其变体,还通过缓解响应的二元分类过度简化问题增强了对齐效果。此外,MaPPO不引入额外超参数,并支持离线和在线设置的偏好优化。此外,MaPPO可作为DPO变体的插件,包括广泛使用的SimPO、IPO和CPO,产生一致的改进。对不同模型大小和系列在三个标准基准(MT-Bench、AlpacaEval 2.0和Arena-Hard)上的广泛实验证明了在不牺牲计算效率的情况下对齐性能的一致改进。

英文摘要

As the era of large language models (LLMs) unfolds, Preference Optimization (PO) methods have become a central approach to aligning LLMs with human preferences and improving performance. We propose Maximum a Posteriori Preference Optimization (MaPPO), a methodology for learning from preferences that explicitly incorporates prior reward knowledge into the optimization objective. Building on the paradigm employed by Direct Preference Optimization (DPO) and its variants of treating preference learning as a Maximum Likelihood Estimation (MLE) problem, MaPPO integrates prior reward estimates into a principled Maximum a Posteriori (MaP) objective. This not only generalizes DPO and its variants, but also enhances alignment by mitigating the oversimplified binary classification of responses. Additionally, MaPPO introduces no additional hyperparameters, and supports preference optimization in both offline and online settings. In addition, MaPPO can be used as a plugin for DPO variants, including widely used SimPO, IPO and CPO, and produce consistent improvements. Extensive empirical evaluations of different model sizes and model series on three standard benchmarks (MT-Bench, AlpacaEval 2.0, and Arena-Hard) demonstrate consistent improvements in alignment performance without sacrificing computational efficiency.

2506.21582 2026-05-11 cs.CL cs.AI cs.HC

VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

VIDEE:基于智能代理的文本分析可视化、交互式分解、执行与评估系统

Sam Yu-Te Lee, Chenyang Ji, Shicheng Wen, Lifu Huang, Dongyu Liu, Kwan-Liu Ma

发表机构 * Department of Computer Science, University of California at Davis(加州大学戴维斯分校计算机科学系)

AI总结 VIDEE通过人机协作流程帮助初级分析师进行高级文本分析,包含分解、执行和评估三个阶段,结合人类反馈和LLM评估,验证了系统在非专家用户中的实用性。

详情
AI中文摘要

文本分析传统上需要自然语言处理(NLP)或文本分析的专业知识,这为初级分析师设定了障碍。近年来,大型语言模型(LLMs)的进展改变了NLP的格局,使更易于访问和自动化的文本分析成为可能(例如,主题检测、摘要、信息提取等)。我们介绍了VIDEE,一个支持初级数据分析师使用智能代理进行高级文本分析的系统。VIDEE实现了一个人-代理协作的工作流程,包含三个阶段:(1)分解,该阶段结合了人机协作的蒙特卡洛树搜索算法,以支持生成推理并利用人类反馈;(2)执行,该阶段生成可执行的文本分析流水线;(3)评估,该阶段整合基于LLM的评估和可视化,以支持用户验证执行结果。我们进行了两个定量实验来评估VIDEE的有效性,并分析常见的代理错误。一项涉及具有不同NLP和文本分析经验水平的参与者(从无到专家)的用户研究,证明了该系统的可用性,并揭示了不同的用户行为模式。研究结果确定了人-代理协作的设计启示,验证了VIDEE对非专家用户的实际效用,并为未来的智能文本分析系统改进提供了指导。

英文摘要

Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE's effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience -- from none to expert -- demonstrates the system's usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.

2506.14399 2026-05-11 cs.CV cs.AI

Factored Classifier-Free Guidance

因子分类器无关引导

Tian Xia, Fabio De Sousa Ribeiro, Rajat R Rasal, Avinash Kori, Raghav Mehta, Ben Glocker

发表机构 * Department of Computing, Imperial College London, UK(伦敦帝国理工学院计算系)

AI总结 本文提出因子分类器无关引导方法,通过因果图实现属性级控制,提升反事实推理的严谨性和可逆性。

Comments Accepted at ICML 2026

详情
AI中文摘要

本文提出因子分类器无关引导方法,通过因果图实现属性级控制,提升反事实推理的严谨性和可逆性。

英文摘要

Counterfactual generation aims to simulate realistic hypothetical outcomes under causal interventions. Diffusion models have emerged as a powerful tool for this task, combining DDIM inversion with conditional generation and classifier-free guidance (CFG). In this work, we identify a key limitation of CFG for counterfactual generation: it prescribes a global guidance scale for all attributes, leading to significant spurious changes in inferred counterfactuals. To mitigate this, we propose Factored Classifier-Free Guidance (FCFG), a flexible and model-agnostic guidance technique that enables attribute-wise control following a causal graph. FCFG complements recent advances in classifier-free guidance and can be seamlessly extended to advanced guidance schemes such as CFG++ and APG. Our experiments demonstrate that FCFG significantly improves the axiomatic soundness of inferred counterfactuals across both natural and medical image datasets, mitigating spurious amplification effects, and enhancing counterfactual reversibility.

2506.13351 2026-05-11 cs.CL cs.AI cs.LG

Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks

直接推理优化:基于推理反射的令牌级推理与评分门控用于不可验证任务

Yifei Xu, Tusher Chakraborty, Srinagesh Sharma, Leonardo Nunes, Swati Sharma, Kate Drakos Demopulos, Emre Kıcıman, Songwu Lu, Ranveer Chandra

发表机构 * Microsoft(微软公司) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出了一种约束强化学习框架,通过令牌级推理反射奖励和评分门控优化不可验证任务的推理质量与学习效率。

详情
AI中文摘要

大型语言模型在不可验证任务上的强化学习训练具有挑战性,即使有合理质量的参考答案。我们提出了一种约束强化学习框架,通过优化与推理质量对齐的令牌级密集推理反射奖励(R3),并在回放组级别施加评分门控作为可行性约束。R3衡量模型在其思维链(CoT)前缀下对参考答案的令牌级确定性,并选择性强调具有高跨回放方差的令牌,即推理反射令牌,这些令牌否则会被低方差令牌稀释。相同的方差信号还驱动一个过滤器,淘汰信号不足的查询以进行比较学习。评分门控通过将原则性任务标准作为最终答案的硬接受/拒绝检查来补充R3。实证结果显示,在涵盖科学写作、医学、法律合同和金融的四个数据集上,我们的框架优于强基线,实现了更快、更样本高效的训练,并遵守了可行性约束。

英文摘要

Reinforcement learning (RL) training of large language models (LLMs) on unverifiable tasks is challenging even when a reasonable-quality reference answer is available. We propose a constrained RL training framework that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level certainty of a reference answer under its chain-of-thought (CoT) prefix, and selectively emphasizes tokens with high cross-rollout variance, which we call reasoning-reflective tokens, that would otherwise be diluted by the bulk of low-variance tokens. The same variance signal also drives a filter that discards queries with insufficient signal for comparative learning. Rubric-gating complements R3 by operationalizing principled task criteria as hard accept/reject checks on final answers. Empirically, across four datasets spanning scientific writing, medicine, legal contracts, and finance, our framework outperforms strong baselines, achieves faster, more sample-efficient learning, and respects feasibility constraints.

2506.09816 2026-05-11 cs.LG

Identifiability Challenges in Sparse Linear Ordinary Differential Equations

稀疏线性常微分方程中的可识别性挑战

Cecilia Casolo, Sören Becker, Niki Kilbertus

发表机构 * Technical University of Munich(慕尼黑技术大学) Helmholtz Munich(海德堡-慕尼黑研究所) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 研究稀疏线性常微分方程的可识别性问题,发现其在实际稀疏度下存在正概率的不可识别性,并通过实验验证了这一理论结果对状态估计方法的影响。

Comments 9 pages, 4 figures

Journal ref The Fourteenth International Conference on Learning Representations, ICLR 2026

详情
AI中文摘要

动力系统建模是自然科学和生命科学中科学探索的核心支柱。越来越多的动力系统模型从数据中学习,使得可识别性成为关键概念。对于无法从数据中识别的系统,无法保证其在新条件和输入下的行为或可能的控制机制。已知在社区中,

英文摘要

Dynamical systems modeling is a core pillar of scientific inquiry across natural and life sciences. Increasingly, dynamical system models are learned from data, rendering identifiability a paramount concept. For systems that are not identifiable from data, no guarantees can be given about their behavior under new conditions and inputs, or about possible control mechanisms to steer the system. It is known in the community that "linear ordinary differential equations (ODE) are almost surely identifiable from a single trajectory." However, this only holds for dense matrices. The sparse regime remains underexplored, despite its practical relevance with sparsity arising naturally in many biological, social, and physical systems. In this work, we address this gap by characterizing the identifiability of sparse linear ODEs. Contrary to the dense case, we show that sparse systems are unidentifiable with a positive probability in practically relevant sparsity regimes and provide lower bounds for this probability. We further study empirically how this theoretical unidentifiability manifests in state-of-the-art methods to estimate linear ODEs from data. Our results corroborate that sparse systems are also practically unidentifiable. Theoretical limitations are not resolved through inductive biases or optimization dynamics. Our findings call for rethinking what can be expected from data-driven dynamical system modeling and allows for quantitative assessments of how much to trust a learned linear ODE.

2505.22976 2026-05-11 cs.CV cs.AI

LoopNav: Benchmarking Spatial Consistency in World Models

LoopNav:世界模型中空间一致性评估的基准测试

Kewei Lian, Shaofei Cai, Yitao Liang, Anji Liu

发表机构 * NUS(国立新加坡大学) Peking Univeristy(北京大学)

AI总结 LoopNav提出一个基于循环导航的基准测试,用于评估世界模型的空间一致性,通过Minecraft开放世界环境收集250小时视频数据,并引入场景图一致性评分以量化空间一致性。

Comments V3: SGCS

详情
AI中文摘要

在空间上一致地模拟世界是有效世界模型的关键要求。此类模型能够实现高质量的视觉生成,并确保世界模型在模拟和规划等下游任务中的可靠性。它必须不仅保留长周期的观测信息,还能够构建显式或隐式的内部空间表示。然而,现有数据集并未显式施加空间一致性约束,限制了系统评估该能力以及通过数据驱动方法学习的能力。此外,大多数现有基准主要强调视觉一致性或生成质量,忽略了长距离空间一致性的要求。为弥合这一差距,我们提出了LoopNav,一个基于循环导航的基准测试集,用于评估空间一致性。该数据集包含250小时(2000万帧)的循环导航视频和动作,收集自Minecraft开放世界环境中的不同地点。我们进一步引入了场景图一致性评分,以量化空间一致性,同时对像素级变化保持不变。数据集、基准和代码均已开源,以支持未来研究。

英文摘要

The ability to simulate the world in a spatially consistent manner is a crucial requirement for effective world models. Such a model enables high-quality visual generation, and also ensures the reliability of world models for downstream tasks such as simulation and planning. It must not only retain long-horizon observational information, but also enables the construction of explicit or implicit internal spatial representations. However, existing datasets do not explicitly enforce spatial consistency constraints, limiting both the ability to systematically evaluate this capability and to learn it through data-driven approaches. Furthermore, most existing benchmarks primarily emphasize visual coherence or generation quality, neglecting the requirement of long-range spatial consistency. To bridge this gap, we propose LoopNav, a dataset and corresponding benchmark centered on loop-based navigation for evaluating spatial consistency. The dataset comprises 250 hours (20 million frames) of loop-based navigation videos with actions, collected from diverse locations in the open-world environment of Minecraft. We further introduce a Scene Graph Consistency Score to quantify spatial consistency while remaining invariant to pixel-level variations. Dataset, benchmark, and code are open-sourced to support future research.

2505.22842 2026-05-11 cs.CL cs.LG

Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

贝叶斯注意机制:一种用于位置编码和上下文长度外推的概率框架

Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü

发表机构 * MALTA – Machine Learning Theory and Applications Lab, PUCRS – Porto Alegre, Brazil(MALTA机器学习理论与应用实验室,PUCRS-波士顿实验室,巴西) Kunumi Institute, Brazil(昆米研究所,巴西)

AI总结 本文提出贝叶斯注意机制,通过概率模型将位置编码作为先验,统一了现有方法并改进了长上下文泛化能力,在500倍训练上下文长度下实现更准确的信息检索。

Comments Accepted to ICLR 2026

详情
AI中文摘要

基于Transformer的语言模型依赖位置编码(PE)来处理token顺序并支持上下文长度外推。然而,现有PE方法缺乏理论清晰度且依赖有限评估指标来支持其外推主张。本文提出贝叶斯注意机制(BAM),一种将位置编码作为概率模型中先验的概率框架。BAM统一了现有方法(如NoPE和ALiBi)并提出了一种通用高斯位置先验,显著改进了长上下文泛化能力。实验证明,BAM在500倍训练上下文长度下实现准确的信息检索,优于现有最先进长上下文检索精度,同时保持相近的困惑度并引入最少的额外参数。

英文摘要

Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.

2505.14113 2026-05-11 cs.CV cs.LG

CONSIGN: Conformal Segmentation Informed by Spatial Groupings via Decomposition

CONSIGN:通过分解的空间分组指导的符合性分割

Bruno Viti, Elias Karabelas, Martin Holler

发表机构 * Department of Mathematics and Scientific Computing(数学与科学计算系) University of Graz(格拉茨大学) BioTechMed-Graz(BioTechMed-格拉茨) IDea_Lab

AI总结 CONSIGN通过分解的空间分组指导,改进图像分割的不确定性量化,采用符合性预测框架,结合空间相关性,提升分割性能和不确定性估计质量。

Comments Accepted as poster at ICLR 2026

详情
AI中文摘要

大多数基于机器学习的图像分割模型生成像素级置信度评分,代表模型对每个像素类标签的预测概率。虽然这些信息在高风险领域如医学影像中特别有价值,但这些评分本质上是启发式的,不构成严谨的定量不确定性估计。符合性预测(CP)提供了一个原则性的框架,将启发式置信度评分转化为统计上有效的不确定性估计。然而,直接将CP应用于图像分割会忽略像素之间的空间相关性,这是图像数据的基本特征。这可能导致过于保守且解释性较差的不确定性估计。为此,我们提出了CONSIGN(通过分解的空间分组指导的符合性分割),一种基于CP的方法,结合空间相关性以改进图像分割的不确定性量化。我们的方法生成具有用户指定的高概率误差保证的有意义预测集。它与任何预训练的分割模型兼容,该模型能够生成多个样本输出。我们评估了CONSIGN在三个医学影像数据集和两个COCO数据集子集上,使用三种不同的预训练分割模型,与两个CP基线进行比较。结果表明,考虑空间结构在多个指标上显著提高了性能,并增强了不确定性估计的质量。

英文摘要

Most machine learning-based image segmentation models produce pixel-wise confidence scores that represent the model's predicted probability for each class label at every pixel. While this information can be particularly valuable in high-stakes domains such as medical imaging, these scores are heuristic in nature and do not constitute rigorous quantitative uncertainty estimates. Conformal prediction (CP) provides a principled framework for transforming heuristic confidence scores into statistically valid uncertainty estimates. However, applying CP directly to image segmentation ignores the spatial correlations between pixels, a fundamental characteristic of image data. This can result in overly conservative and less interpretable uncertainty estimates. To address this, we propose CONSIGN (Conformal Segmentation Informed by Spatial Groupings via Decomposition), a CP-based method that incorporates spatial correlations to improve uncertainty quantification in image segmentation. Our method generates meaningful prediction sets that come with user-specified, high-probability error guarantees. It is compatible with any pre-trained segmentation model capable of generating multiple sample outputs. We evaluate CONSIGN against two CP baselines across three medical imaging datasets and two COCO dataset subsets, using three different pre-trained segmentation models. Results demonstrate that accounting for spatial structure significantly improves performance across multiple metrics and enhances the quality of uncertainty estimates.

2505.13289 2026-05-11 cs.LG cs.CV

RECON: Robust symmetry discovery via Explicit Canonical Orientation Normalization

RECON:通过显式规范定向归一化实现鲁棒对称发现

Alonso Urbano, David W. Romero, Max Zimmer, Sebastian Pokutta

发表机构 * Department for AI in Society, Science, and Technology, Zuse Institute Berlin (ZIB)(人工智能与社会、科学与技术部门,柏林Zuse研究所) Cartesia AI Institute of Mathematics, Technische Universität Berlin(数学研究所,柏林技术大学)

AI总结 RECON通过显式规范定向归一化纠正任意规范,实现实例特定姿态分布的无监督发现,检测异常姿态并提升模型性能。

Comments Accepted as a conference paper at ICLR 2026

详情
AI中文摘要

现实世界的数据往往表现出未知的实例特定对称性,这些对称性很少与事先固定的变换群G完全匹配。类-姿态分解旨在通过将输入分解为不变特征和相对训练依赖的任意规范表示的姿态g∈G来创建解耦的表示。我们引入RECON,一种类-姿态无关的规范定向归一化,通过简单的右平移纠正任意规范,从而获得自然的数据对齐规范。这使能够(i)无监督发现实例特定的姿态分布,(ii)检测异常分布的姿态,(iii)提供即插即用的测试时间规范归一化层。该层可以附加在任何预训练模型上,以注入群不变性,提升性能而不需重新训练。我们在图像和分子集合上进行了验证,展示了准确的对称发现,并在下游分类中匹配或优于其他规范方法。

英文摘要

Real world data often exhibits unknown, instance-specific symmetries that rarely exactly match a transformation group $G$ fixed a priori. Class-pose decompositions aim to create disentangled representations by factoring inputs into invariant features and a pose $g\in G$ defined relative to a training-dependent, arbitrary canonical representation. We introduce RECON, a class-pose agnostic canonical orientation normalization that corrects arbitrary canonicals via a simple right translation, yielding natural, data-aligned canonicalizations. This enables (i) unsupervised discovery of instance-specific pose distributions, (ii) detection of out-of-distribution poses and (iii) a plug-and-play test-time canonicalization layer. This layer can be attached on top of any pre-trained model to infuse group invariance, improving its performance without retraining. We validate on images and molecular ensembles, demonstrating accurate symmetry discovery, and matching or outperforming other canonicalizations in downstream classification.