arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2063
2509.15680 2026-06-11 cs.SD eess.AS 版本更新

SAM: A Mamba-2 State-Space Audio-Language Model

SAM: 一种基于 Mamba-2 状态空间的音频-语言模型

Taehan Lee, Jaehan Jung, Hyukjun Lee

发表机构 * Sogang University(ソンガン大学)

AI总结 提出 SAM,一种结合 Mamba-2 骨干网络的音频-语言模型,在 AudioSet 和 AudioCaps 上以更少参数达到或超越 7B 变压器模型性能,并系统分析了 SSM 与音频编码器输出的交互机制。

Comments 6 pages, Accepted to Interspeech 2026

详情
AI中文摘要

我们提出了 SAM,一种状态空间音频-语言模型,它将音频编码器与 Mamba-2 骨干网络集成。SAM-2.7B 在 AudioSet 上达到 21.1 mAP,在 AudioCaps 上达到 17.6 SPICE,以更少的参数匹配或超越更大的 7B 变压器模型。我们进一步首次提供了系统性的、表示级别的分析,研究 SSM 如何与音频编码器输出交互:(1) 联合音频编码器微调是必要的,这由准确率提升以及在不同 SSM 大小下观察到的 token 表示秩和相似性的适应所支持;(2) 尽管线性缩放,SSM 从紧凑、信息丰富的音频 token 表示中获益更多,而非过长的 token 序列;(3) 融入指令跟随监督显著提升了推理能力,将 MMAU-Sound 准确率从 22.8 提升至 56.8。通过全面的实验和分析,我们为 SSM 作为音频-语言模型的强大、可扩展骨干网络建立了实用的设计原则。

英文摘要

We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.

2603.03855 2026-06-11 cs.SD 版本更新

A Sensitivity Analysis of Multi-Event Audio Grounding in Audio LLMs

音频大语言模型中多事件音频定位的敏感性分析

Taehan Lee, Jaehan Jung, Hyukjun Lee

发表机构 * Sogang University(ソンガン大学)

AI总结 通过大规模评估,发现音频大语言模型在复杂声学场景中事件数量增加会导致真阳性率下降和假阳性率上升,提示词则引入权衡,模型对多事件音频更不确定。

Comments 6 pages, Accepted to Interspeech 2026

详情
AI中文摘要

音频大语言模型在理解音频样本方面表现出强大能力,但其在复杂声学场景中的可靠性仍待探索。不同于以往局限于小规模或查询构建控制不足的工作,我们提出了一种大规模评估,研究随着听觉场景复杂度增加时的事件定位和误报情况。使用71K个AudioCapsV2片段,我们提取标准化的(源,属性)事件,并构建两种查询类型:用于真实检测的存在事件查询和用于探测幻觉的缺失事件查询,在音频对齐的文本嵌入空间中采用相似性过滤的负采样。我们评估了四种最先进的音频大语言模型,每个模型使用12种提示变体,处理超过50万个是/否查询。在所有模型中,事件数量增加一致地降低了真阳性率并提高了假阳性率,而提示则在两者之间引入了强烈的权衡。我们的置信度分析表明,模型在多事件音频上变得更加不确定,揭示了改进空间。

英文摘要

Audio LLMs have shown a strong ability to understand audio samples, yet their reliability in complex acoustic scenes remains under-explored. Unlike prior work limited to small scale or less controlled query construction, we present a large-scale evaluation of event grounding and false alarms as auditory scene complexity increases. Using 71K AudioCapsV2 clips, we extract normalized (source, attribute) events and build two query types: present-event queries for ground-truth detection and absent-event queries to probe hallucinations, using similarity-filtered negative sampling in an audio-aligned text embedding space. We evaluate four SOTA Audio LLMs with 12 prompt variants over 500K yes/no queries per model. Across models, increasing event count consistently lowers true-positive rate and raises false-positive rate, while prompts induce a strong trade-off between the two. Our confidence analysis shows that models become more uncertain on multi-event audio, revealing room for improvement.

2505.10018 2026-06-11 cs.RO 版本更新

LEMON-Mapping: Loop-Enhanced Large-Scale Multi-Session Point Cloud Merging and Optimization for Globally Consistent Mapping

LEMON-Mapping: 面向全局一致建图的环路增强大规模多会话点云融合与优化

Lijie Wang, Xiaoyi Zhong, Ziyi Xu, Kaixin Chai, Anke Zhao, Tianyu Zhao, Changjian Jiang, Qianhao Wang, Xieyuanli Chen, Fei Gao

发表机构 * Institute of CyberSystems and Control, Zhejiang University(浙江大学控制系统研究所) The Huzhou Institute, Zhejiang University(浙江大学湖州研究院) The State Key Laboratory of Industrial Control Technology, College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院国家工业控制技术重点实验室) The College of Intelligence Science and Technology, National University of Defense Technology(国防科技大学智能科学与技术学院)

AI总结 提出LEMON-Mapping框架,通过鲁棒环路处理、空间光束法平差和基于PGO的优化,解决多机器人建图中重叠区域发散和模糊问题,实现高精度全局一致点云融合。

详情
AI中文摘要

多机器人协作在现代机器人领域日益重要且面临重大挑战,尤其是在构建全局一致、精确的地图方面。传统的多机器人位姿图优化(PGO)方法确保基本的全局一致性,但忽略了地图的几何结构,仅将闭环作为位姿节点之间的约束,导致重叠区域出现发散和模糊。为解决此问题,我们提出LEMON-Mapping,一种用于大规模多会话点云融合与优化的环路增强框架。我们重新审视环路在多机器人建图中的作用,并引入三项关键创新。首先,我们开发了一种鲁棒的环路处理机制来剔除异常值,以及一种环路召回策略来恢复被错误移除但有效的环路。其次,我们引入了针对多机器人地图的空间光束法平差,以减少发散并消除重叠中的模糊。第三,我们设计了一种基于PGO的方法,利用精化的光束法平差约束将局部精度传播到整个地图。我们在多个公开数据集和一个自采集数据集上验证了LEMON-Mapping。实验结果表明,与传统融合方法相比,我们的框架具有更优越的建图精度和全局一致性。可扩展性实验也证明了其处理涉及大量机器人场景的强大能力。

英文摘要

Multi-robot collaboration is becoming increasingly critical and presents significant challenges in modern robotics, especially for building a globally consistent, accurate map. Traditional multi-robot pose graph optimization (PGO) methods ensure basic global consistency but ignore the geometric structure of the map, and only use loop closures as constraints between pose nodes, leading to divergence and blurring in overlapping regions. To address this issue, we propose LEMON-Mapping, a loop-enhanced framework for large-scale, multi-session point cloud fusion and optimization. We re-examine the role of loops for multi-robot mapping and introduce three key innovations. First, we develop a robust loop processing mechanism that rejects outliers and a loop recall strategy to recover mistakenly removed but valid loops. Second, we introduce spatial bundle adjustment for multi-robot maps, reducing divergence and eliminating blurring in overlaps. Third, we design a PGO-based approach that leverages refined bundle adjustment constraints to propagate local accuracy to the entire map. We validate LEMON-Mapping on several public datasets and a self-collected dataset. The experimental results show superior mapping accuracy and global consistency of our framework compared to traditional merging methods. Scalability experiments also demonstrate its strong capability to handle scenarios involving numerous robots.

2602.23545 2026-06-11 cs.AI 版本更新

Planning under Distribution Shifts with Causal POMDPs

基于因果POMDP的分布偏移下规划

Matteo Ceriscioli, Karthika Mohan

发表机构 * School of Electrical Engineering and Computer Science (EECS)(电气工程与计算机科学学院)

AI总结 提出因果POMDP框架,通过干预表示环境变化,在部分可观测下维持PWLC性质,实现分布偏移下的规划与更新。

Comments To appear at the 36th International Conference on Automated Planning and Scheduling (ICAPS-26)

详情
AI中文摘要

在现实世界中,规划常常受到分布偏移的挑战。因此,在一组条件下获得的环境模型在状态分布或环境动态变化时可能不再有效,进而导致先前学习的策略失败。在这项工作中,我们提出了一个使用因果知识构建的部分可观测马尔可夫决策过程(POMDP)的理论框架,用于在部分可观测性下进行规划。通过将环境中的变化表示为对该因果POMDP的干预,该框架能够评估假设变化下的计划,并主动识别环境中哪些组件已被改变。我们展示了如何维护和更新关于潜在状态和底层领域的信念,并证明了在该增强信念空间中值函数保持分段线性凸(PWLC)。在分布偏移下保持PWLC的优势在于,通过基于$\alpha$-向量的POMDP方法保持规划的可处理性。

英文摘要

In the real world, planning is often challenged by distribution shifts. As such, a model of the environment obtained under one set of conditions may no longer remain valid as the distribution of states or the environment dynamics change, which in turn causes previously learned strategies to fail. In this work, we propose a theoretical framework for planning under partial observability using Partially Observable Markov Decision Processes (POMDPs) formulated using causal knowledge. By representing shifts in the environment as interventions on this causal POMDP, the framework enables evaluating plans under hypothesized changes and actively identifying which components of the environment have been altered. We show how to maintain and update a belief over both the latent state and the underlying domain, and we prove that the value function remains piecewise linear and convex (PWLC) in this augmented belief space. Preservation of PWLC under distribution shifts has the advantage of maintaining the tractability of planning via $α$-vector-based POMDP methods.

2602.22962 2026-06-11 cs.LG 版本更新

Scaling Laws of Global Weather Models

全球天气模型的缩放定律

Yuejiang Yu, Langwen Huang, Alexandru Calotoiu, Torsten Hoefler

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文分析数据驱动天气模型中模型大小、数据集大小和计算预算与验证损失之间的缩放定律,发现Aurora数据缩放最强,GraphCast参数效率高但硬件利用率低,计算最优分析表明增加训练数据比增大模型更有效,且模型形状上宽度优于深度。

Comments Accepted at ICML 2026. 21 pages, 7 figures

详情
AI中文摘要

数据驱动模型正在彻底改变天气预报。为了优化训练效率和模型性能,本文分析了该领域内的经验缩放定律。我们研究了模型性能(验证损失)与三个关键因素:模型大小($N$)、数据集大小($D$)和计算预算($C$)之间的关系。在一系列模型中,我们发现Aurora表现出最强的数据缩放行为:将训练数据集增加10倍可使验证损失降低多达3.2倍。GraphCast展示了最高的参数效率,但硬件利用率有限。我们的计算最优分析表明,在固定计算预算下,将资源分配给更多的总训练数据比增加模型大小能带来更大的性能提升。此外,我们分析了模型形状,并发现了与语言模型中观察到的根本不同的缩放行为:天气预报模型始终倾向于增加宽度而非深度。这些发现表明,未来的天气模型应优先考虑更宽的架构和更大的有效训练数据集,以最大化预测性能。

英文摘要

Data-driven models are revolutionizing weather forecasting. To optimize training efficiency and model performance, this paper analyzes empirical scaling laws within this domain. We investigate the relationship between model performance (validation loss) and three key factors: model size ($N$), dataset size ($D$), and compute budget ($C$). Across a range of models, we find that Aurora exhibits the strongest data-scaling behavior: increasing the training dataset by 10x reduces validation loss by up to 3.2x. GraphCast demonstrates the highest parameter efficiency, yet suffers from limited hardware utilization. Our compute-optimal analysis indicates that, under fixed compute budgets, allocating resources to more total training data yields greater performance gains than increasing model size. Furthermore, we analyze model shape and uncover scaling behaviors that differ fundamentally from those observed in language models: weather forecasting models consistently favor increased width over depth. These findings suggest that future weather models should prioritize wider architectures and larger effective training datasets to maximize predictive performance.

2601.11670 2026-06-11 cs.LG cs.AI 版本更新

CoVar: Confidence-Variance-Guided Pseudo-Label Selection for Semi-Supervised Learning

CoVar: 置信度-方差引导的半监督学习伪标签选择

Jinshi Liu, Lei He, Pan Liu

发表机构 * College of Artificial Intelligence, Shenzhen University(深圳大学人工智能学院) School of Information and Electrical Engineering, Hunan University of Science and Technology(湖南科技大学信息与电气工程学院) Information Hub, Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)信息中心)

AI总结 提出CoVar框架,通过联合建模最大置信度和残差类方差来评估伪标签可靠性,利用SVD谱松弛分离可靠与不可靠预测,无需手动阈值,在分割和分类任务上取得提升。

详情
AI中文摘要

半监督学习中的伪标签选择通常由最大置信度阈值驱动,然而在模型过度自信和类别不平衡下,仅靠置信度可能不可靠。我们提出CoVar,一个置信度-方差框架,通过联合建模最大置信度(MC)和残差类方差(RCV)来评估伪标签可靠性。从熵最小化出发,我们推导出二阶交叉熵近似,表明当MC高且RCV低时,低损失伪标签更受青睐,并带有置信度依赖的惩罚项,该惩罚项对接近确定的预测更强。基于此准则,CoVar将预测嵌入二维置信度-方差空间,并使用基于SVD的谱松弛来分离可靠和不可靠的预测,无需手动调整置信度阈值。然后,聚类加权高斯函数将此分离转换为每个样本的训练权重。所得权重可在训练期间集成到现有的半监督分割和分类流程中,且不引入推理开销。在PASCAL VOC 2012、Cityscapes、CIFAR-10、CIFAR-100、SVHN和STL-10上的实验表明,在匹配骨干网络下,VOC和Cityscapes上取得明显提升,并在标准分类基准上达到竞争性或更低的错误率。这些结果表明,残差类离散度为鲁棒伪标签选择提供了置信度之外的补充信号。

英文摘要

Pseudo-label selection in semi-supervised learning is commonly driven by maximum-confidence thresholds, yet confidence alone can be unreliable under model overconfidence and class imbalance. We propose CoVar, a confidence--variance framework that assesses pseudo-label reliability by jointly modeling Maximum Confidence (MC) and Residual-Class Variance (RCV). Starting from entropy minimization, we derive a second-order cross-entropy approximation showing that low-loss pseudo-labels are favored when MC is high and RCV is low, with a confidence-dependent penalty that becomes stronger for near-certain predictions. Based on this criterion, CoVar embeds predictions into a two-dimensional confidence--variance space and uses SVD-based spectral relaxation to separate reliable and unreliable predictions without hand-tuned confidence thresholds. Cluster-wise Gaussian weighting then converts this separation into per-sample training weights. The resulting weights can be integrated into existing semi-supervised segmentation and classification pipelines during training and introduce no inference-time overhead. Experiments on PASCAL VOC 2012, Cityscapes, CIFAR-10, CIFAR-100, SVHN, and STL-10 show clear gains on VOC and Cityscapes under matched backbones, as well as competitive or improved error rates on standard classification benchmarks. These results indicate that residual-class dispersion provides a useful signal complementary to confidence for robust pseudo-label selection.

2602.22638 2026-06-11 cs.AI 版本更新

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

MobilityBench:用于评估真实世界移动场景中路径规划智能体的基准

Zhiheng Song, Jingshuai Zhang, Chuan Qin, Chao Wang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu, Hengshu Zhu

发表机构 * Computer Network Information Center, Chinese Academy of Sciences(中国科学院计算机网络信息中心) AMAP, Alibaba Group(阿里集团AMAP) Alibaba Group(阿里集团)

AI总结 提出MobilityBench基准,通过确定性API重放沙箱和多维评估协议,系统评估基于LLM的路径规划智能体,发现现有模型在偏好约束路径规划上表现不佳。

详情
AI中文摘要

由大型语言模型(LLM)驱动的路径规划智能体已成为一种有前景的范式,通过自然语言交互和工具介导的决策支持日常人类移动。然而,在真实世界移动场景中的系统评估受到多样化路由需求、非确定性地图服务和有限可重复性的阻碍。在本研究中,我们引入了MobilityBench,一个用于评估基于LLM的路径规划智能体在真实世界移动场景中的可扩展基准。MobilityBench基于从高德地图收集的大规模匿名真实用户查询构建,覆盖全球多个城市的广泛路径规划意图。为了实现可重复的端到端评估,我们设计了一个确定性API重放沙箱,消除了实时服务带来的环境变化。我们进一步提出了一个以结果有效性为中心的多维评估协议,辅以对指令理解、规划、工具使用和效率的评估。使用MobilityBench,我们在多种真实世界移动场景中评估了多个基于LLM的路径规划智能体,并对其行为和性能进行了深入分析。我们的发现表明,当前模型在基本信息检索和路径规划任务上表现良好,但在偏好约束路径规划上困难重重,突显了在个性化移动应用中仍有显著改进空间。我们在此https URL公开发布基准数据、评估工具包和文档。

英文摘要

Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench.

2602.20958 2026-06-11 cs.RO cs.AI 版本更新

EKF-Based Depth Camera and Deep Learning Fusion for UAV-Person Distance Estimation and Following in SAR Operations

基于EKF的深度相机与深度学习融合用于搜救任务中无人机-人员距离估计与跟随

Luka Šiktar, Branimir Ćaran, Bojan Šekoranja, Marko Švaco

发表机构 * University of Rijeka(里雅斯特大学)

AI总结 提出融合深度相机测量和单目相机人体距离估计的EKF方法,利用YOLO-pose实现实时融合,提高无人机跟随中距离估计的精度和鲁棒性,在三个测试场景中平均误差降低15.3%。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

基于视觉的无人机框架通过检测和识别特定个体,然后跟踪并跟随它们,同时保持安全距离,来辅助人类搜索任务。无人机跟随的一个关键安全要求是在现实条件下准确估计相机与目标物体之间的距离,这通过融合多种图像模态来实现。作为使用深度学习进行自动人员检测和面部识别系统的一部分,本文提出了融合深度相机测量和单目相机到人体距离估计的方法,以实现鲁棒的跟踪和跟随。使用YOLO-pose实现了基于深度学习的深度相机数据滤波和从单目相机估计相机到人体距离,从而利用扩展卡尔曼滤波算法实现深度信息的实时融合。所提出的子系统设计用于无人机,估计和测量深度相机与人体关键点之间的距离,以保持无人机与人类目标之间的安全距离。我们的系统提供了准确的距离估计,并已通过运动捕捉地面真值数据进行了验证。该系统已在室内实时测试,在三个测试场景中,距离估计的平均误差、均方根误差和标准差降低了高达15.3%。基于测试结果,基于EKF融合的方法通过减少深度相机最佳工作范围之外的误差,增加了深度检测范围。它还在具有挑战性的条件下(如反射和能见度差)显示出改进的鲁棒性和精度,使其适用于搜救任务。

英文摘要

Vision-based Unmanned Aerial Vehicles (UAVs) frameworks aid human search tasks by detecting and recognizing specific individuals, then tracking and following them while maintaining a safe distance. A key safety requirement for UAV following is the accurate estimation of the distance between camera and target object under real-world conditions, achieved by fusing multiple image modalities. As part of the system for automatic people detection and face recognition using deep learning, in this paper we present the fusion of depth camera measurements and monocular camera-to-body distance estimation for robust tracking and following. Deep learning based filtering of depth camera data and estimation of camera-to-body distance from a monocular camera are achieved with YOLO-pose, enabling real-time fusion of depth information using the Extended Kalman Filter (EKF) algorithm. The proposed subsystem, designed for use in drones, estimates and measures the distance between the depth camera and the human body keypoints, to maintain the safe distance between the drone and the human target. Our system provides an accurate estimated distance, which has been validated against motion capture ground truth data. The system has been tested in real time indoors, where it reduces the average errors, RMSE and standard deviations of distance estimation up to 15,3% in three tested scenarios. Based on the test results, the EKF fusion-based approach increases the depth detection range by reducing the errors outside the optimal depth camera working range. It also shows improved robustness and precision in challenging conditions, such as reflections and poor visibility, making it suitable for SAR.

2602.19502 2026-06-11 cs.AI cs.LG 版本更新

Human-Guided Agentic AI for Multimodal Clinical Prediction: Lessons from the AgentDS Healthcare Benchmark

人类引导的智能体AI用于多模态临床预测:来自AgentDS医疗基准的教训

Lalitha Pranathi Pulavarthy, Raajitha Muthyala, Aravind V Kuruvikkattil, Zhenan Yin, Rashmita Kudamala, Saptarshi Purkayastha

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) Stanford University(斯坦福大学)

AI总结 通过人类引导智能体AI在多模态临床预测任务中取得领先性能,提炼出领域知识引导特征工程、任务特定多模态融合和临床动机模型集成三大通用经验。

Comments Presented at the Data Challenge track at the 14th IEEE International Conference on Healthcare Informatics (ICHI) 2026 on June 3, 2026

详情
AI中文摘要

智能体AI系统越来越能够自主执行数据科学工作流程,但临床预测任务需要纯自动化方法难以提供的领域专业知识。我们研究了人类引导智能体AI如何改进多模态临床预测,展示了我们在所有三个AgentDS医疗基准挑战中的方法:30天再入院预测(Macro-F1 = 0.8986)、急诊科费用预测(MAE = $465.13)和出院准备评估(Macro-F1 = 0.7939)。在这些任务中,人类分析师在关键决策点指导智能体工作流程:来自临床笔记、扫描PDF账单收据和时间序列生命体征的多模态特征工程;任务适当的模型选择;以及临床信息验证策略。我们的方法在医疗领域总体排名第5,在出院准备任务中获得第3名。消融研究表明,人类引导决策在自动化基线之上累积增益达到+0.065 F1,其中多模态特征提取贡献了最大的单一改进(+0.041 F1)。我们提炼出三个可推广的经验:(1)每个流水线阶段的领域信息特征工程产生累积增益,优于广泛的自动搜索;(2)多模态数据集成需要任务特定的人类判断,没有单一提取策略能泛化到临床文本、PDF和时间序列;(3)具有临床动机模型配置的刻意集成多样性优于随机超参数搜索。这些发现为在需要可解释性、可重复性和临床有效性的医疗环境中部署智能体AI的团队提供了实用指导。

英文摘要

Agentic AI systems are increasingly capable of autonomous data science workflows, yet clinical prediction tasks demand domain expertise that purely automated approaches struggle to provide. We investigate how human guidance of agentic AI can improve multimodal clinical prediction, presenting our approach to all three AgentDS Healthcare benchmark challenges: 30-day hospital readmission prediction (Macro-F1 = 0.8986), emergency department cost forecasting (MAE = $465.13), and discharge readiness assessment (Macro-F1 = 0.7939). Across these tasks, human analysts directed the agentic workflow at key decision points, multimodal feature engineering from clinical notes, scanned PDF billing receipts, and time-series vital signs; task-appropriate model selection; and clinically informed validation strategies. Our approach ranked 5th overall in the healthcare domain, with a 3rd-place finish on the discharge readiness task. Ablation studies reveal that human-guided decisions compounded to a cumulative gain of +0.065 F1 over automated baselines, with multimodal feature extraction contributing the largest single improvement (+0.041 F1). We distill three generalizable lessons: (1) domain-informed feature engineering at each pipeline stage yields compounding gains that outperform extensive automated search; (2) multimodal data integration requires task-specific human judgment that no single extraction strategy generalizes across clinical text, PDFs, and time-series; and (3) deliberate ensemble diversity with clinically motivated model configurations outperforms random hyperparameter search. These findings offer practical guidance for teams deploying agentic AI in healthcare settings where interpretability, reproducibility, and clinical validity are essential.

2602.18291 2026-06-11 cs.AI 版本更新

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

扩散以协调:高效在线多智能体扩散策略

Zhuoran Li, Hai Zhong, Xun Wang, Qingxin Xia, Lihua Zhang, Longbo Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出首个在线离线策略多智能体强化学习框架OMAD,利用扩散策略和松弛策略目标最大化缩放联合熵,实现高效探索与协调,在MPE和MAMuJoCo上样本效率提升2.5至5倍。

详情
AI中文摘要

在线多智能体强化学习(MARL)是实现高效智能体协调的重要框架。关键在于增强策略表达能力以实现更优性能。基于扩散的生成模型在图像生成和离线设置中展现出卓越的表达能力和多模态表示,因此非常适合满足这一需求。然而,它们在在线MARL中的潜力尚未被充分探索。主要障碍是扩散模型的难以处理的似然性阻碍了基于熵的探索和协调。为应对这一挑战,我们首次提出使用扩散策略的在线离线策略MARL框架(OMAD)来协调协调。我们的关键创新是采用松弛策略目标,最大化缩放联合熵,从而在无需可处理似然的情况下促进有效探索。此外,在集中训练与分散执行(CTDE)范式中,我们使用联合分布价值函数来优化分散扩散策略。它利用可处理的熵增强目标来指导扩散策略的同时更新,从而确保稳定协调。在MPE和MAMuJoCo上的广泛评估表明,我们的方法在10个不同任务上达到了新的最先进水平,样本效率显著提升了2.5至5倍。

英文摘要

Online Multi-Agent Reinforcement Learning (MARL) is a prominent framework for efficient agent coordination. Crucially, enhancing policy expressiveness is pivotal for achieving superior performance. Diffusion-based generative models are well-positioned to meet this demand, having demonstrated remarkable expressiveness and multimodal representation in image generation and offline settings. Yet, their potential in online MARL remains largely under-explored. A major obstacle is that the intractable likelihoods of diffusion models impede entropy-based exploration and coordination. To tackle this challenge, we propose among the first \underline{O}nline off-policy \underline{MA}RL framework using \underline{D}iffusion policies (\textbf{OMAD}) to orchestrate coordination. Our key innovation is a relaxed policy objective that maximizes scaled joint entropy, facilitating effective exploration without relying on tractable likelihood. Complementing this, within the centralized training with decentralized execution (CTDE) paradigm, we employ a joint distributional value function to optimize decentralized diffusion policies. It leverages tractable entropy-augmented targets to guide the simultaneous updates of diffusion policies, thereby ensuring stable coordination. Extensive evaluations on MPE and MAMuJoCo establish our method as the new state-of-the-art across $10$ diverse tasks, demonstrating a remarkable $2.5\times$ to $5\times$ improvement in sample efficiency.

2602.05746 2026-06-11 cs.LG cs.AI 版本更新

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

学习注入:通过强化学习实现自动化提示注入

Xin Chen, Jie Zhang, Florian Tramèr

发表机构 * ETH Zürich(苏黎世联邦理工学院)

AI总结 提出AutoInject,一种基于强化学习的黑盒框架,自动学习对抗性后缀进行提示注入,在AgentDojo上优于模板攻击和多种自适应攻击,并突破专门防御模型。

详情
AI中文摘要

提示注入是LLM代理中的一个关键漏洞,然而最强的方法仍然依赖于人类红队和手工制作的提示。适应自动化越狱优化器并不能缩小这一差距:越狱使模型趋向于通用顺从,而提示注入需要发出具有正确参数的特定工具调用。成功信号是二元的,随机采样的后缀几乎从不触发它,因此标准优化器没有梯度可循。我们提出了AutoInject,一个黑盒强化学习(RL)框架,学习用于提示注入的对抗性后缀。一个学习的基于比较的奖励对每个候选后缀与迄今为止看到的最佳后缀进行评分,将二元信号转化为适合RL优化的密集奖励。该框架支持在线基于查询的攻击和离线训练的可迁移后缀(部署时无需实用访问),并在任务完成反馈可用时纳入实用目标。在AgentDojo上,AutoInject在生产模型中优于模板攻击、GCG、TAP和自适应攻击,在McNemar检验下具有统计显著性(p<0.05)。AutoInject学习的后缀还打破了Meta-SecAlign-70B,这是一个专门针对提示注入进行微调的模型,而模板攻击完全失败。这些结果为提示注入建立了自动化基线,并揭示了基于偏好的防御与基于自适应优化的攻击者之间的差距。

英文摘要

Prompt injection is a critical vulnerability in LLM agents, yet the strongest methods still rely on human red-teamers and hand-crafted prompts. Adapting automated jailbreak optimizers does not close this gap: jailbreaks shape models toward generic compliance, while prompt injection requires emitting specific tool calls with correct parameters. The success signal is binary, and randomly sampled suffixes almost never trigger it, so standard optimizers have no gradient to follow. We present AutoInject, a black-box reinforcement learning (RL) framework that learns adversarial suffixes for prompt injection. A learned comparison-based reward scores each candidate against the best suffix seen so far, turning the binary signal into a dense reward suitable for RL optimization. The framework supports both online query-based attacks and offline-trained transferable suffixes that need no utility access at deployment, and incorporates a utility objective when task-completion feedback is available. On AgentDojo, AutoInject outperforms template attacks, GCG, TAP, and adaptive attack across production models, with statistically significant improvements under McNemar's test with p<0.05. Suffixes learned by AutoInject also break Meta-SecAlign-70B, a model fine-tuned specifically to resist prompt injection, where template attacks fail outright. The results establish an automated baseline for prompt injection and expose a gap between preference-based defenses and adaptive optimization-based attackers.

2502.14894 2026-06-11 cs.CV cs.AI cs.CY cs.LG 版本更新

FOCUS on Contamination: Hydrology-Informed Noise-Aware Learning for Geospatial PFAS Mapping

聚焦污染:基于水文信息与噪声感知的地理空间PFAS测绘学习

Jowaria Khan, Alexa Friedman, Sydney Evans, Rachel Klein, Runzi Wang, Katherine E. Manz, Kaley Beins, David Q. Andrews, Elizabeth Bondi-Kelly

发表机构 * University of Michigan(密歇根大学) Environmental Working Group(环保工作组) University of California, Davis(加州大学戴维斯分校)

AI总结 提出FOCUS框架,结合稀疏PFAS观测与水文连通性等环境先验,通过噪声感知损失实现鲁棒训练,在PFAS污染测绘中优于传统方法。

Comments Best Paper Award at ICLR 2026 Machine Learning for Remote Sensing Workshop

详情
AI中文摘要

全氟和多氟烷基物质(PFAS)是持久性环境污染物,对公共健康有显著影响,但由于现场采样的高成本和后勤挑战,大规模监测仍然严重受限。样本的缺乏导致难以用物理模型模拟其扩散,并且对PFAS在地表水中传输的科学理解有限。然而,描述土地覆盖、水文和工业活动的丰富地理空间和卫星衍生数据广泛可用。我们提出了FOCUS,一个用于PFAS污染测绘的地理空间深度学习框架,该框架将稀疏的PFAS观测与大规模环境背景(包括来自水文连通性、土地覆盖、污染源邻近性和采样距离的先验)相结合。这些先验被整合到一个原则性的、噪声感知的损失函数中,从而在稀疏标签下产生稳健的训练目标。通过广泛的消融实验、鲁棒性分析和实际验证,FOCUS始终优于包括稀疏分割、克里金法和污染物传输模拟在内的基线方法,同时在大区域上保持了空间一致性和可扩展性。我们的结果展示了AI如何通过提供筛查级风险图来支持环境科学,这些风险图可优先安排后续采样,并在缺乏完整物理模型的情况下帮助将潜在污染源与地表水污染模式联系起来。

英文摘要

Per- and polyfluoroalkyl substances (PFAS) are persistent environmental contaminants with significant public health impacts, yet large-scale monitoring remains severely limited due to the high cost and logistical challenges of field sampling. The lack of samples leads to difficulty simulating their spread with physical models and limited scientific understanding of PFAS transport in surface waters. Yet, rich geospatial and satellite-derived data describing land cover, hydrology, and industrial activity are widely available. We introduce FOCUS, a geospatial deep learning framework for PFAS contamination mapping that integrates sparse PFAS observations with large-scale environmental context, including priors derived from hydrological connectivity, land cover, source proximity, and sampling distance. These priors are integrated into a principled, noise-aware loss, yielding a robust training objective under sparse labels. Across extensive ablations, robustness analyses, and real-world validation, FOCUS consistently outperforms baselines including sparse segmentation, Kriging, and pollutant transport simulations, while preserving spatial coherence and scalability over large regions. Our results demonstrate how AI can support environmental science by providing screening-level risk maps that prioritize follow-up sampling and help connect potential sources to surface-water contamination patterns in the absence of complete physical models.

2602.14913 2026-06-11 cs.LG eess.IV 版本更新

Coverage Guarantees for Pseudo-Calibrated Conformal Prediction under Distribution Shift

分布漂移下伪校准保形预测的覆盖保证

Farbod Siahkali, Ashwin Verma, Vijay Gupta

发表机构 * Elmore Family School of Electrical and Computer Engineering, Purdue University(艾洛姆家族电气与计算机工程学院,普渡大学)

AI总结 针对分布漂移下保形预测覆盖失效问题,利用伪校准和领域自适应工具,推导目标覆盖下界,并提出通过松弛参数膨胀保形阈值的方法及源调优伪校准算法,实验证明其能缓解覆盖退化。

Comments Under review. 6 pages, 2 figures, 1 table

详情
AI中文摘要

保形预测(CP)在可交换性假设下提供无分布边际覆盖保证,但当数据分布发生漂移时,这些保证可能失效。我们分析了在有限标签条件协变量漂移模型下,使用伪校准作为应对这种性能损失的工具。利用领域自适应的工具,我们根据分类器的源域损失和漂移的Wasserstein度量推导出目标覆盖的下界。利用这一结果,我们提供了一种设计伪校准集的方法,该方法通过松弛参数膨胀保形阈值,使目标覆盖保持在规定水平以上。最后,我们提出了一种源调优伪校准算法,该算法根据分类器的不确定性在硬伪标签和随机化标签之间进行插值。数值实验表明,我们的界限定性地跟踪了伪校准行为,并且源调优方案在分布漂移下缓解了覆盖退化,同时保持了非平凡的预测集大小。

英文摘要

Conformal prediction (CP) offers distribution-free marginal coverage guarantees under an exchangeability assumption, but these guarantees can fail if the data distribution shifts. We analyze the use of pseudo-calibration as a tool to counter this performance loss under a bounded label-conditional covariate shift model. Using tools from domain adaptation, we derive a lower bound on target coverage in terms of the source-domain loss of the classifier and a Wasserstein measure of the shift. Using this result, we provide a method to design pseudo-calibrated sets that inflate the conformal threshold by a slack parameter to keep target coverage above a prescribed level. Finally, we propose a source-tuned pseudo-calibration algorithm that interpolates between hard pseudo-labels and randomized labels as a function of classifier uncertainty. Numerical experiments show that our bounds qualitatively track pseudo-calibration behavior and that the source-tuned scheme mitigates coverage degradation under distribution shift while maintaining nontrivial prediction set sizes.

2507.11688 2026-06-11 cs.LG 版本更新

Composing Linear Layers from Irreducibles

从不可约元组合线性层

Travis Pence, Daisuke Yamada, Vikas Singh

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出用Clifford代数将线性层分解为双向量(几何基元)的组合,仅需O(log^2 d)参数,在LLM注意力投影中匹配强基线性能。

Comments 35 Pages, 11 Tables, 6 Figures, Appearing in NeurIPS 2025

详情
Journal ref
Advances in Neural Information Processing Systems 38 (2025)
AI中文摘要

当代大型模型常表现出暗示存在低级基元的行为,这些基元组合成功能更丰富的模块,但这些基本构建块仍未被很好理解。我们通过询问:能否从最小几何基元集合中识别/合成线性变换?来研究线性层中的这种组合结构。利用Clifford代数,我们证明线性层可以表示为双向量(编码有向平面的几何对象)的组合,并引入一种可微算法将其分解为转子乘积。这种构造仅需O(log^2 d)个参数,而稠密矩阵需要O(d^2)。应用于LLM注意力层中的键、查询和值投影,我们的基于转子的层匹配了块Hadamard和低秩近似等强基线的性能。我们的发现为这些几何基元如何在深度模型中组合成更高层次功能提供了代数视角。

英文摘要

Contemporary large models often exhibit behaviors suggesting the presence of low-level primitives that compose into modules with richer functionality, but these fundamental building blocks remain poorly understood. We investigate this compositional structure in linear layers by asking: can we identify/synthesize linear transformations from a minimal set of geometric primitives? Using Clifford algebra, we show that linear layers can be expressed as compositions of bivectors -- geometric objects encoding oriented planes -- and introduce a differentiable algorithm that decomposes them into products of rotors. This construction uses only O(log^2 d) parameters, versus O(d^2) required by dense matrices. Applied to the key, query, and value projections in LLM attention layers, our rotor-based layers match the performance of strong baselines such as block-Hadamard and low-rank approximations. Our findings provide an algebraic perspective on how these geometric primitives can compose into higher-level functions within deep models.

2602.11995 2026-06-11 cs.LG 版本更新

Momentum LMS Theory beyond Stationarity: Stability, Tracking, and Regret

超越平稳性的动量LMS理论:稳定性、跟踪与遗憾

Yifei Jin, Xin Zheng, Lei Guo

发表机构 * School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学先进交叉学科学院) State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学科学国家重点实验室) School of Mathematical Sciences, University of Chinese Academy of Sciences(中国科学院大学数学科学学院)

AI总结 本文研究动量最小均方算法在非平稳时变线性系统中的跟踪性能与遗憾界,通过分析二阶时变随机向量差分方程,证明其快速适应和鲁棒跟踪能力。

Comments 9 pages, 3 figures

详情
AI中文摘要

在大规模数据处理场景中,数据通常以序列流的形式到达,这些序列由具有漂移分布和时变系统参数的复杂系统生成。这种非平稳性挑战了理论分析,因为它违反了i.i.d.(独立同分布)样本的经典假设,需要能够实时更新而无需昂贵重新训练的算法。一种有效的方法应在单次处理每个样本的同时,保持计算和内存复杂度与数据流长度无关。受这些挑战的启发,本文研究了动量最小均方(MLMS)算法作为自适应识别工具,利用其计算简单和在线处理能力。理论上,我们在各种实际条件下推导了MLMS在时变随机线性系统中的跟踪性能和遗憾界。与经典LMS不同,其稳定性可由一阶随机向量差分方程表征,而MLMS由于动量引入额外的动态状态,导致二阶时变随机向量差分方程,其稳定性分析依赖于更复杂的随机矩阵乘积,这构成了一个极具挑战性的问题。在合成和真实数据流上的实验表明,MLMS实现了快速适应和鲁棒跟踪,与我们的理论结果一致,尤其是在非平稳环境中,突显了其在现代流式和在线学习应用中的潜力。

英文摘要

In large-scale data processing scenarios, data often arrive in sequential streams generated by complex systems that exhibit drifting distributions and time-varying system parameters. This nonstationarity challenges theoretical analysis, as it violates classical assumptions of i.i.d. (independent and identically distributed) samples, necessitating algorithms capable of real-time updates without expensive retraining. An effective approach should process each sample in a single pass, while maintaining computational and memory complexities independent of the data stream length. Motivated by these challenges, this paper investigates the Momentum Least Mean Squares (MLMS) algorithm as an adaptive identification tool, leveraging its computational simplicity and online processing capabilities. Theoretically, we derive tracking performance and regret bounds for the MLMS in time-varying stochastic linear systems under various practical conditions. Unlike classical LMS, whose stability can be characterized by first-order random vector difference equations, MLMS introduces an additional dynamical state due to momentum, leading to second-order time-varying random vector difference equations whose stability analysis hinges on more complicated products of random matrices, which poses a substantially challenging problem to resolve. Experiments on synthetic and real-world data streams demonstrate that MLMS achieves rapid adaptation and robust tracking, in agreement with our theoretical results especially in nonstationary settings, highlighting its promise for modern streaming and online learning applications.

2602.11801 2026-06-11 cs.LG 版本更新

SpaTeoGL: Spatiotemporal Graph Learning for Interpretable Seizure Onset Zone Analysis from Intracranial EEG

SpaTeoGL: 用于颅内脑电图可解释癫痫发作起始区分析的时空图学习

Elham Rostami, Aref Einizade, Taous-Meriem Laleg-Kirati

发表机构 * Inria Saclay(Inria萨克莱实验室) Palaiseau, France(法国帕莱伊索)

AI总结 提出SpaTeoGL框架,通过联合学习窗口级空间图和时间图,在平滑图信号处理框架下交替求解,实现癫痫发作起始区的可解释定位,在多中心iEEG数据集上优于基线方法。

Comments 5 pages, 4 figures

详情
AI中文摘要

从颅内脑电图(iEEG)中准确定位癫痫发作起始区(SOZ)对癫痫手术至关重要,但受复杂时空发作动态的挑战。我们提出SpaTeoGL,一种用于可解释癫痫网络分析的时空图学习框架。SpaTeoGL联合学习捕捉iEEG电极间相互作用的窗口级空间图,以及基于空间结构相似性连接时间窗口的时间图。该方法在平滑图信号处理框架内制定,并通过具有收敛保证的交替块坐标下降算法求解。在具有成功手术结果的多中心iEEG数据集上的实验表明,SpaTeoGL与基于水平可见图与逻辑回归的基线方法相比具有竞争力,同时改善了非SOZ识别,并为癫痫发作起始和传播动态提供了可解释的见解。

英文摘要

Accurate localization of the seizure onset zone (SOZ) from intracranial EEG (iEEG) is essential for epilepsy surgery but is challenged by complex spatiotemporal seizure dynamics. We propose SpaTeoGL, a spatiotemporal graph learning framework for interpretable seizure network analysis. SpaTeoGL jointly learns window-level spatial graphs capturing interactions among iEEG electrodes and a temporal graph linking time windows based on similarity of their spatial structure. The method is formulated within a smooth graph signal processing framework and solved via an alternating block coordinate descent algorithm with convergence guarantees. Experiments on a multicenter iEEG dataset with successful surgical outcomes show that SpaTeoGL is competitive with a baseline based on horizontal visibility graphs and logistic regression, while improving non-SOZ identification and providing interpretable insights into seizure onset and propagation dynamics.

2602.10908 2026-06-11 cs.CL cs.LG stat.ML 版本更新

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

SoftMatcha 2:一种用于万亿级语料库的快速软模式匹配器

Masataka Yoneda, Yusuke Matsushita, Go Kamoda, Kohei Suenaga, Takuya Akiba, Masaki Waga, Sho Yokoi

发表机构 * The University of Tokyo(东京大学) Kyoto University(京都大学) National Institute of Informatics(信息处理研究所) The Graduate University for Advanced Studies (SOKENDAI)(先进科学研究生院) National Institute for Japanese Language(日本语言学研究所) Tohoku University(东北大学)

AI总结 提出SoftMatcha 2,一种基于后缀数组和词向量的超快速软搜索算法,通过动态语料感知剪枝和磁盘感知设计,在万亿级语料上实现0.3秒内支持替换、插入和删除的语义变体搜索,并发现基准污染。

Comments Accepted at ICML2026. Project Page & Web Interface: https://softmatcha.github.io/v2/, Source Code: https://github.com/softmatcha/softmatcha2

详情
AI中文摘要

我们提出SoftMatcha 2,一种超快速且灵活的搜索算法,能够在0.3秒内搜索万亿规模的自然语言语料库,同时允许以替换、插入和删除形式进行的语义变体。我们的方法采用基于后缀数组的字符串匹配,该数组随语料库规模扩展良好,并将单词表示为向量,这支撑了其语义灵活性。为了缓解查询语义放松导致的组合爆炸,我们的方法建立在两个关键算法思想上:动态语料感知剪枝和由磁盘感知设计实现的快速精确查找。我们从理论上分析了所提出方法的效率,表明它可以缓解搜索空间的指数增长。在FineWeb-Edu(Lozhkov等人,2024)(1.4T tokens)上的实验表明,与现有方法infini-gram(Liu等人,2024)、infini-gram mini(Xu等人,2025)和SoftMatcha(Deguchi等人,2025)相比,它实现了显著更低的搜索延迟。作为实际应用,我们的方法发现了现有方法遗漏的训练语料库中的基准污染,并且也有利于信息检索和释义检测。我们还提供了一个在线演示,支持七种语言的语料库快速软搜索。

英文摘要

We present SoftMatcha 2, an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while allowing semantic variations in the form of substitution, insertion, and deletion. Our approach employs string matching based on suffix arrays that scales well with corpus size, and represents words as vectors, which underpin its semantic flexibility. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: dynamic corpus-aware pruning and fast exact lookup enabled by a disk-aware design. We theoretically analyze the efficiency of the proposed method, indicating that it can mitigate exponential growth in the search space. Empirically, on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), it attains substantially lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, our method uncovers benchmark contamination in training corpora that existing approaches miss, and it also benefits information retrieval and paraphrase detection. We also provide an online demo of fast, soft search across corpora in seven languages.

2602.10743 2026-06-11 cs.LG 版本更新

Kalman Linear Attention: Parallel Bayesian Filtering For Efficient Language Modelling and State Tracking

Kalman线性注意力:用于高效语言建模和状态跟踪的并行贝叶斯滤波

Vaisakh Shaj, Cameron Barker, Aidan Scannell, Andras Szecsenyi, Elliot J. Crowley, Amos Storkey

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出Kalman线性注意力层,将序列混合重写为信息形式的精确贝叶斯滤波,实现时间并行推理,在相同计算成本下比GLA更具表达力,并在状态跟踪任务中超越线性SSM和注意力。

Comments Accepted at ICML 2026. An earlier version of this work was presented at the 1st Workshop on Epistemic Intelligence in Machine Learning (EIML) at EurIPS 2025

详情
AI中文摘要

状态空间语言模型如Mamba和门控线性注意力(GLA)提供了线性复杂度、可并行的Transformer替代方案,但其线性状态更新限制了表达力和鲁棒的状态跟踪。我们从概率角度弥合这一差距,将序列混合视为精确贝叶斯滤波,以卡尔曼滤波为核心原语。经典卡尔曼滤波提供有原则的状态和不确定性估计,但被认为是固有顺序的;我们展示了将其重参数化为信息形式后,更新变为关联扫描——因此每个token的循环更新是非线性的(莫比乌斯/精度递归),但保持时间并行。由此产生的Kalman线性注意力(KLA)层是一个即插即用的序列混合器,执行时间并行概率推理,携带显式的信念状态不确定性,并且在相同计算成本下比GLA风格的线性更新具有严格更强的表达力。这种表达力直接转化为更强的状态跟踪:KLA解决了线性SSM和注意力无法解决的排列组合($A_5$)任务,同时保持扫描并行。作为即插即用原语,它在合成token操作和零样本常识基准测试中匹配或改进了现代SSM和GLA,并且是首批在十亿token规模下训练的堆叠贝叶斯滤波原语之一。

英文摘要

State-space language models such as Mamba and gated linear attention (GLA) offer linear-complexity, parallelisable alternatives to transformers, but their linear state updates limit expressivity and robust state tracking. We close this gap from a probabilistic angle, casting sequence mixing as exact Bayesian filtering with the Kalman filter as the core primitive. Classical Kalman filters give principled state and uncertainty estimates but are viewed as inherently sequential; we show that reparameterising them in information form turns their updates into an associative scan - so the per-token recurrent update is non-linear (a Möbius/precision recursion) yet remains temporally parallel. The resulting Kalman Linear Attention (KLA) layer is a drop-in sequence mixer that performs time-parallel probabilistic inference, carries an explicit belief-state uncertainty, and is strictly more expressive than GLA-style linear updates at the same computational cost. This expressivity translates directly into stronger state tracking: KLA solves permutation-composition ($A_5$) tasks that linear SSMs and attention cannot, while staying scan-parallel. As a drop-in primitive it also matches or improves on modern SSMs and GLAs across synthetic token-manipulation and zero-shot commonsense benchmarks, and is among the first stacked Bayesian-filtering primitives trained at the billion-token scale.

2602.10392 2026-06-11 cs.LG 版本更新

Tensor Methods: A Unified and Interpretable Approach for Material Design

张量方法:一种统一且可解释的材料设计方法

Shaan Pakala, Aldair E. Gongora, Brian Giera, Evangelos E. Papalexakis

发表机构 * University of California, Riverside(加州大学河滨分校) Dept. of Computer Science & Engineering(计算机科学与工程系) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室) Materials Engineering Division(材料工程 division) Data Science Institute(数据科学研究所)

AI总结 提出使用张量补全方法作为材料设计的统一框架,兼具可解释性和预测性能,在非均匀采样下优于传统机器学习,最高提升5%的R²并减半分布外误差。

Comments Accepted to ACM SIGKDD 2026 AI for Sciences track

详情
AI中文摘要

在设计新材料时,通常需要根据所需性能定制材料设计。随着设计参数数量的增长,搜索空间呈指数级增长,使得所有材料组合的实际合成和评估几乎不可能。即使使用有限元分析等传统计算方法,搜索设计空间也变得过于计算密集。近期方法使用机器学习(ML)代理模型来更高效地确定最优材料设计;不幸的是,这些方法通常(i)难以解释,且(ii)当训练数据来自设计空间的非均匀采样时性能不佳。我们建议使用张量补全方法作为可解释性和预测的统一方法。我们观察到经典张量方法在预测上能够与传统ML竞争,并且额外具有可解释的张量因子(作为预测的副产品完全免费获得)。在我们的实验中,我们能够通过张量因子重新发现物理现象,表明我们的预测与问题的真实底层物理一致。这也意味着,鉴于我们能够重新发现现有模式,实验人员可以利用这些张量因子识别潜在的新模式。我们还研究了当遇到来自设计空间非均匀采样的训练数据时,两种代理模型的效果。我们观察到更专门的张量方法在这些非均匀采样场景下能够提供更好的泛化能力。我们发现最佳的泛化来自一个张量模型,它在总体$R^2$上比基线ML方法提升高达5%,并在某些分布外区域将误差减半。

英文摘要

When designing new materials, it is often necessary to tailor the material design to have some desired properties. As the set of design parameters grow, the search space grows exponentially, making the actual synthesis and evaluation of all material combinations virtually impossible. Even using traditional computational methods such as Finite Element Analysis becomes too computationally heavy to search the design space. Recent methods use machine learning (ML) surrogate models to more efficiently determine optimal material designs; unfortunately, these methods often (i) are notoriously difficult to interpret and (ii) under perform when the training data comes from a non-uniform sampling of the design space. We suggest the use of tensor completion methods as an all-in-one approach for interpretability and predictions. We observe classical tensor methods are able to compete with traditional ML in predictions, with the added benefit of their interpretable tensor factors (which are given completely for free, as a result of the prediction). In our experiments, we are able to rediscover physical phenomena via the tensor factors, indicating that our predictions are aligned with the true underlying physics of the problem. This also means these tensor factors could be used by experimentalists to identify potentially novel patterns, given we are able to rediscover existing ones. We also study the effects of both types of surrogate models when we encounter training data from a non-uniform sampling of the design space. We observe more specialized tensor methods that can give better generalization in these non-uniforms sampling scenarios. We find the best generalization comes from a tensor model, which is able to improve upon the baseline ML methods by up to 5% on aggregate $R^2$, and halve the error in some out of distribution regions.

2602.09591 2026-06-11 cs.CL cs.AI cs.LG 版本更新

On the Optimal Reasoning Length for RL-Trained Language Models

关于RL训练的语言模型的最优推理长度

Daisuke Nohara, Taishi Nakamura, Rio Yokota

发表机构 * University of Tokyo(东京大学)

AI总结 研究强化学习训练的语言模型中推理长度与准确率的非单调关系,发现存在最优中间长度,并通过模式准确率分析揭示其成因。

Comments 18 pages, 12 figures

详情
AI中文摘要

强化学习显著提高了大型语言模型的推理能力,但也倾向于延长思维链输出并增加计算成本。尽管已经提出了长度控制方法,但它们所引发的长度-准确率关系仍不清楚。我们在受控设置下,在多个基础模型上使用几种长度控制方法训练策略,发现在数学推理和代码生成中,准确率随输出长度呈非单调变化,在中间值达到峰值。然而,即使在样本准确率趋于平稳或下降的情况下,模式准确率仍随长度持续提高,这表明非单调的长度-准确率关系是由围绕越来越正确的中心的分散性驱动的。

英文摘要

Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain-of-thought outputs and increase computational cost. Although length-control methods have been proposed, the length-accuracy relationship they induce remains unclear. We train policies with several length-control methods on multiple base models in a controlled setup and find that, across both mathematical reasoning and code generation, accuracy is non-monotonic in output length, peaking at an intermediate value. Mode accuracy, however, continues to improve with length even in settings where sample accuracy plateaus or declines, indicating that the non-monotonic length-accuracy relationship is driven by dispersion around an increasingly correct center.

2602.09533 2026-06-11 cs.AI 版本更新

Autoregressive Direct Preference Optimization

自回归直接偏好优化

Masanari Oi, Mahiro Ukai, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue

发表机构 * University of Tokyo(东京大学)

AI总结 提出自回归直接偏好优化(ADPO),在应用Bradley-Terry模型前显式引入自回归假设,通过将DPO目标中的求和操作移至log-sigmoid函数外部,实现更优的偏好对齐,并首次区分token长度μ和反馈长度μ'两种度量。

Comments ICML 2026

详情
AI中文摘要

直接偏好优化(DPO)已成为将大型语言模型(LLMs)与人类偏好对齐的一种有前景的方法。然而,对响应级Bradley-Terry(BT)模型的广泛依赖可能限制了其全部潜力,因为参考模型和可学习模型仅在推导目标函数后才被假定为自回归。受此限制的启发,我们重新审视DPO的理论基础,并提出一种新的公式,在应用BT模型之前显式引入自回归假设。通过重新表述和扩展DPO,我们推导出一种新的变体,称为自回归DPO(ADPO),它将自回归建模显式整合到偏好优化框架中。在不违反理论基础的情况下,推导出的损失采用了一种优雅的形式:它将DPO目标中的求和操作移至log-sigmoid函数外部。此外,通过对ADPO的理论分析,我们表明在设计基于DPO的算法时需要考虑两种长度度量:token长度μ和反馈长度μ'。据我们所知,我们是第一个明确区分这两种度量并分析它们对LLMs中偏好优化影响的工作。

英文摘要

Direct preference optimization (DPO) has emerged as a promising approach for aligning large language models (LLMs) with human preferences. However, the widespread reliance on the response-level Bradley-Terry (BT) model may limit its full potential, as the reference and learnable models are assumed to be autoregressive only after deriving the objective function. Motivated by this limitation, we revisit the theoretical foundations of DPO and propose a novel formulation that explicitly introduces the autoregressive assumption prior to applying the BT model. By reformulating and extending DPO, we derive a novel variant, termed Autoregressive DPO (ADPO), that explicitly integrates autoregressive modeling into the preference optimization framework. Without violating the theoretical foundations, the derived loss takes an elegant form: it shifts the summation operation in the DPO objective outside the log-sigmoid function. Furthermore, through theoretical analysis of ADPO, we show that there exist two length measures to be considered when designing DPO-based algorithms: the token length $μ$ and the feedback length $μ'$. To the best of our knowledge, we are the first to explicitly distinguish these two measures and analyze their implications for preference optimization in LLMs.

2602.08735 2026-06-11 cs.CV 版本更新

From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models

从对应到动作:多模态大语言模型中类人多图像空间推理

Masanari Oi, Koki Maeda, Ryuto Koike, Daisuke Oba, Nakamasa Inoue, Naoaki Okazaki

发表机构 * University of Tokyo(东京大学)

AI总结 提出HATCH框架,通过补丁级空间对齐和动作-答案推理两个目标,提升多模态大模型在多图像空间推理中的性能,在三个基准上超越同规模基线。

Comments ICML 2026

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在单图像空间推理方面取得了实质性进展,但多图像空间推理(需要整合来自多个视角的信息)仍然具有挑战性。认知研究表明,人类通过两种机制处理此类任务:跨视图对应(识别不同视图中对应于相同物理位置的区域)和逐步视角变换(顺序组合相对视角变化)。然而,现有研究仅部分且通常隐式地整合这些机制,没有对两者进行显式监督。我们提出了用于跨视图对应和视角变化的类人感知训练(HATCH),这是一个具有两个互补目标的训练框架:(1)补丁级空间对齐,鼓励补丁表示在空间对应区域跨视图对齐;(2)动作-答案推理,要求模型在预测最终答案之前生成显式的视角转换动作。在三个基准上的实验表明,HATCH以明显优势持续优于同规模基线,并与更大的模型相比取得了有竞争力的结果,同时保持了单图像推理能力。

英文摘要

While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms: cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, and stepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning, which requires the model to generate explicit viewpoint transition actions before predicting the final answer. Experiments on three benchmarks demonstrate that HATCH consistently outperforms baselines of comparable size by a clear margin and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.

2602.08986 2026-06-11 cs.LG cs.AI 版本更新

Improving Detection of Rare Nodes in Hierarchical Multi-Label Learning

改进分层多标签学习中稀有节点的检测

Isaac Xu, Martin Gillis, Ayushi Sharma, Benjamin Misiuk, Craig J. Brown, Thomas Trappenberg

发表机构 * Faculty of Computer Science(计算机科学学院) Dalhousie University(达尔豪斯大学) Department of Geography(地理系) Memorial University of Newfoundland(纽芬兰纪念大学) Department of Oceanography(海洋学系)

AI总结 针对分层多标签分类中稀有节点检测困难的问题,提出结合节点不平衡加权和焦点加权的损失函数,利用集成不确定性量化,在基准数据集上将召回率提升至五倍,并显著提高F1分数。

Comments Accepted for publication in Transactions on Machine Learning Research (TMLR), 2026

详情
AI中文摘要

在分层多标签分类中,一个持续的挑战是使模型预测能够达到层次结构的更深层次,以实现更详细或更细粒度的分类。这一困难部分源于某些类别(或层次节点)的自然稀有性,以及确保子节点几乎总是比其父节点频率更低的分层约束。为了解决这个问题,我们为神经网络提出了一种加权损失目标,该目标结合了节点不平衡加权和焦点加权组件,后者利用了集成不确定性的现代量化。通过强调稀有节点而非稀有观测(数据点),并在训练过程中关注每个模型输出分布中的不确定节点,我们观察到在基准数据集上召回率提高了高达五倍,并且$F_{1}$分数有统计显著的提升。我们还展示了我们的方法有助于卷积网络处理具有挑战性的任务,例如在编码器次优或数据有限的情况下。

英文摘要

In hierarchical multi-label classification, a persistent challenge is enabling model predictions to reach deeper levels of the hierarchy for more detailed or fine-grained classifications. This difficulty partly arises from the natural rarity of certain classes (or hierarchical nodes) and the hierarchical constraint that ensures child nodes are almost always less frequent than their parents. To address this, we propose a weighted loss objective for neural networks that combines node-wise imbalance weighting with focal weighting components, the latter leveraging modern quantification of ensemble uncertainties. By emphasizing rare nodes rather than rare observations (data points), and focusing on uncertain nodes for each model output distribution during training, we observe improvements in recall by up to a factor of five on benchmark datasets, along with statistically significant gains in $F_{1}$ score. We also show our approach aids convolutional networks on challenging tasks, as in situations with suboptimal encoders or limited data.

2509.23248 2026-06-11 cs.AI cs.NI 版本更新

Resource-Aware LLM Reasoning for Mobile Edge General Intelligence

面向移动边缘通用智能的资源感知LLM推理

Mingyi Luo, Ruichen Zhang, Xiangwang Hou, Jun Du, Chunxiao Jiang, Yong Ren, Shiwen Mao

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen(清华大学深圳国际研究生院,清华大学,深圳) College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学 computing 和数据科学学院,新加坡) Department of Electronic Engineering, Tsinghua University, Beijing(清华大学电子工程系,北京) State Key Laboratory of Space Network and Communications, Tsinghua University, Beijing(空间网络与通信国家重点实验室,清华大学,北京) Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing(北京信息科学与技术国家研究中心,清华大学,北京) Department of Electrical and Computer Engineering, Auburn University, Auburn, USA(阿伯丁大学电气与计算机工程系,阿伯丁,美国)

AI总结 提出联合优化框架,通过自适应CoT提示和分布式MoE架构协同优化推理深度、专家激活和传输功率,在资源受限的移动边缘环境中实现LLM高效推理,推理质量与资源效率平衡,额外推理时间小于1秒时准确率和延迟满足率均达90%。

详情
AI中文摘要

大型语言模型(LLM)的快速发展催生了具有强大推理和自主决策能力的智能体人工智能(AI)。与边缘计算的集成推动了移动边缘通用智能(MEGI)的发展,将实时、隐私保护的推理带到网络边缘。然而,在MEGI环境中部署基于LLM的智能体AI推理面临重大挑战,原因是推理的高计算需求与边缘设备的有限资源。为应对这些挑战,我们提出了一种在MEGI中高效部署LLM推理的联合优化框架。首先,我们系统回顾增强方法,识别适合边缘适配的机制。随后,我们提出一个分布式框架,通过自适应思维链(CoT)提示协同推理增强,并通过分布式专家混合(MoE)架构实现可扩展部署。该方法的一个重要创新是将推理深度建模为动态网络资源变量,并与专家激活和传输功率联合优化。该机制使系统能够根据任务需求和设备能力动态调节专家网络和推理复杂度。在移动边缘环境中的实验评估表明,所提框架有效平衡了推理质量和资源效率。结果显示,在额外推理时间小于1秒的情况下,准确率和延迟满足率均可达到90%,验证了在资源受限的MEGI系统中部署复杂LLM推理的实际可行性。

英文摘要

The rapid advancement of large language models (LLMs) has enabled an emergence of agentic artificial intelligence (AI) with powerful reasoning and autonomous decision-making capabilities. This integration with edge computing has led to the development of Mobile Edge General Intelligence (MEGI), which brings real-time, privacy-preserving reasoning to the network edge. However, deploying LLM-based agentic AI reasoning in MEGI environments poses significant challenges due to the high computational demands of reasoning and the limited resources of edge devices. To address these challenges, we propose a joint optimization framework for efficient LLM reasoning deployment in MEGI. First, we systematically review enhancement methods to identify mechanisms suitable for edge adaptation. Subsequently, we present a distributed framework that synergizes reasoning enhancement via adaptive CoT prompting with scalable deployment through a distributed MoE architecture. An important innovation of this approach involves modeling reasoning depth as a dynamic network resource variable, which is optimized jointly with expert activation and transmission power. This mechanism allows the system to dynamically regulate expert networks and reasoning complexity according to task requirements and device capabilities. Experimental evaluations in mobile edge environments demonstrate that the proposed framework effectively balances reasoning quality and resource efficiency. The results show that with less than one second of additional inference time, both accuracy and latency satisfaction rate can reach 90\%, validating the practical viability of deploying sophisticated LLM reasoning in resource-constrained MEGI systems.

2601.00181 2026-06-11 cs.CL cs.AI 版本更新

Causal Emotion Recognition in Conversation: Context Saturation and Discourse-Marker Evidence

对话中的因果情绪识别:上下文饱和与话语标记证据

Cheonkam Jeong, Adeline Nyamathi

发表机构 * University of California, Irvine(加州大学尔湾分校)

AI总结 通过系统消融实验发现对话上下文对情绪识别性能起主导作用但快速饱和,并揭示悲伤情绪与左边缘话语标记使用减少及更高上下文依赖性的关联。

详情
AI中文摘要

我们解决了对话情绪识别中两个长期存在的空白:哪些建模选择实质性地影响性能,以及识别结果如何与可解释的话语层面模式相关联。我们通过在IEMOCAP上进行系统研究并在MELD上进行跨数据集验证来研究这两个问题。对于识别,我们使用10个随机种子进行受控消融实验,并进行多重比较校正的配对显著性检验,得到三个发现。首先,对话上下文是主导因素,但性能快速饱和:大约90%的性能提升来自最近的前10-30轮对话,具体取决于标签集。其次,层级句子表示仅在仅话语设置中帮助最大,并在MELD上显示出明显优势,但一旦轮次级别的上下文可用,其益处消失,表明对话历史吸收了大量话语内部结构。第三,整合外部情感词典不会改善结果,这与预训练编码器已经捕获ERC所需的大部分情感信号一致。在严格因果设置下,我们的简单模型实现了强性能(4-way 82.69%;6-way加权F1 67.07%),表明无需未来轮次即可达到竞争性准确率。对于语言分析,我们检查了5,286个话语标记出现,发现情绪与标记位置之间存在可靠关联(p <.0001)。悲伤话语的左边缘标记使用率(21.9%)低于其他情绪(28-32%),这与左边缘标记与主动话语管理相关的观点一致。这与我们的识别结果一致,其中悲伤从对话上下文中获益最多(+22个百分点),表明悲伤可能比具有更强局部语用线索的情绪更依赖于上下文。

英文摘要

We address two persistent gaps in Emotion Recognition in Conversation: which modeling choices materially affect performance, and how recognition findings connect to interpretable discourse-level patterns. We study both through a systematic investigation on IEMOCAP with cross-dataset validation on MELD. For recognition, we run controlled ablations with 10 random seeds and paired significance tests with multiple-comparisons correction, yielding three findings. First, conversational context is the dominant factor, but performance saturates quickly: roughly 90% of the gain is captured within the most recent 10-30 preceding turns, depending on the label set. Second, hierarchical sentence representations help most in utterance-only settings and show a clear advantage on MELD, but their benefit disappears once turn-level context is available, suggesting that conversational history subsumes much of the intra-utterance structure. Third, integrating an external affective lexicon does not improve results, consistent with pretrained encoders already capturing most of the affective signal needed for ERC. Under a strictly causal setting, our simple models achieve strong performance (82.69% 4-way; 67.07% 6-way weighted F1), showing that competitive accuracy is achievable without future turns. For linguistic analysis, we examine 5,286 discourse-marker occurrences and find a reliable association between emotion and marker position (p < .0001). Sad utterances show reduced left-periphery marker usage (21.9%) relative to other emotions (28-32%), consistent with accounts linking left-periphery markers to active discourse management. This aligns with our recognition results, where Sad benefits most from conversational context (+22 percentage points), suggesting sadness may be more context-dependent than emotions with stronger local pragmatic cues.

2602.06868 2026-06-11 cs.RO 版本更新

Consensus-based optimization (CBO): Towards Global Optimality in Robotics

基于共识的优化(CBO):迈向机器人学的全局最优性

Xudong Sun, Armand Jordana, Massimo Fornasier, Jalal Etesami, Majid Khadiv

发表机构 * Munich Center for Machine Learning (MCML), Munich, Germany(慕尼黑机器学习中心(MCML),德国慕尼黑)

AI总结 提出将共识优化(CBO)引入机器人学,在温和假设下保证收敛到全局最优,并在三个挑战性轨迹优化场景中优于现有方法。

详情
AI中文摘要

零阶优化最近在机器人系统的最优轨迹和策略设计中受到显著关注。然而,大多数现有方法(如MPPI、CEM和CMA-ES)本质上是局部的,因为它们依赖于梯度估计。在本文中,我们将基于共识的优化(CBO)引入机器人学,该方法在温和假设下保证收敛到全局最优。我们提供了理论分析和说明性示例,以直观理解CBO与现有方法之间的根本差异。为了展示CBO在机器人问题上的可扩展性,我们考虑了三个具有挑战性的轨迹优化场景:(1)一个简单系统的长时域问题,(2)一个高度欠驱动系统的动态平衡问题,以及(3)一个仅具有终端成本的高维问题。我们的结果表明,在所有三个具有挑战性的设置中,CBO相对于现有方法能够实现更低的成本。这为研究机器人学中的全局轨迹优化开辟了一个新框架。

英文摘要

Zero-order optimization has recently received significant attention for designing optimal trajectories and policies for robotic systems. However, most existing methods (e.g., MPPI, CEM, and CMA-ES) are local in nature, as they rely on gradient estimation. In this paper, we introduce consensus-based optimization (CBO) to robotics, which is guaranteed to converge to a global optimum under mild assumptions. We provide theoretical analysis and illustrative examples that give intuition into the fundamental differences between CBO and existing methods. To demonstrate the scalability of CBO for robotics problems, we consider three challenging trajectory optimization scenarios: (1) a long-horizon problem for a simple system, (2) a dynamic balance problem for a highly underactuated system, and (3) a high-dimensional problem with only a terminal cost. Our results show that CBO is able to achieve lower costs with respect to existing methods on all three challenging settings. This opens a new framework to study global trajectory optimization in robotics.

2602.03282 2026-06-11 cs.CV cs.AI 版本更新

Global Geometry Is Not Enough for Vision Representations

全局几何不足以用于视觉表示

Jiwan Chung, Seon Joo Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过实验发现全局嵌入几何与组合绑定能力几乎无关,而输入-输出雅可比矩阵衡量的功能敏感性可靠地追踪该能力,并分析指出这是由于现有损失函数显式约束嵌入几何但未约束局部输入-输出映射所致。

详情
AI中文摘要

表示学习中的一个常见假设是,全局分布良好的嵌入支持鲁棒且可泛化的表示。这一关注点塑造了训练目标和评估协议,隐含地将全局几何视为表示能力的代理。虽然全局几何有效地编码了哪些元素存在,但它通常对元素如何组合不敏感。我们通过测试几何度量预测跨多种视觉编码器的组合绑定的能力来研究这一局限性。我们发现,基于标准几何的统计量与组合绑定几乎无相关性。相比之下,由输入-输出雅可比矩阵衡量的功能敏感性可靠地追踪这一能力。我们进一步提供了分析性解释,表明这种差异源于目标设计,因为现有损失显式约束嵌入几何,但未约束局部输入-输出映射。这些结果表明,全局嵌入几何仅捕捉了表示能力的部分视图,并将功能敏感性确立为建模复合结构的关键补充轴。

英文摘要

A common assumption in representation learning is that globally well-distributed embeddings support robust and generalizable representations. This focus has shaped both training objectives and evaluation protocols, implicitly treating global geometry as a proxy for representational competence. While global geometry effectively encodes which elements are present, it is often insensitive to how they are composed. We investigate this limitation by testing the ability of geometric metrics to predict compositional binding across a diverse suite of vision encoders. We find that standard geometry-based statistics exhibit near-zero correlation with compositional binding. In contrast, functional sensitivity, as measured by the input--output Jacobian, reliably tracks this capability. We further provide an analytic account showing that this disparity arises from objective design, as existing losses explicitly constrain embedding geometry but leave the local input--output mapping unconstrained. These results suggest that global embedding geometry captures only a partial view of representational competence and establish functional sensitivity as a critical complementary axis for modeling composite structure.

2602.02726 2026-06-11 cs.LG cs.CL 版本更新

Vector Quantized Latent Concepts: A Scalable Alternative to Clustering-Based Concept Discovery

向量量化潜在概念:聚类式概念发现的可扩展替代方案

Xuemin Yu, Ankur Garg, Samira Ebrahimi Kahou, Hassan Sajjad

发表机构 * Dalhousie University, Canada(加拿大达尔豪斯大学) University of Calgary, Canada(加拿大卡尔加里大学)

AI总结 提出VQLC框架,通过向量量化学习离散潜在概念,在保持可解释性的同时,实现与K-Means相当的计算效率,并优于层次聚类在大规模数据上的扩展性。

详情
AI中文摘要

大型语言模型(LLMs)在其隐藏状态中编码了丰富的语义信息,但理解这些内部表示捕获了哪些信息仍然困难。从隐藏状态中提取的潜在概念为解释LLMs提供了有希望的方向,但现有的基于聚类的方法面临权衡:层次聚类产生连贯的概念,但由于其二次内存成本而仅限于小数据集,而K-Means高效扩展但可能产生语义连贯性较差的概念。我们提出向量量化潜在概念(VQLC),一种离散概念学习框架,在冻结的隐藏状态上学习潜在概念的码本。在12个数据集-模型设置中,VQLC在计算成本上接近K-Means,扩展性优于层次聚类,并在忠实度上保持竞争力,在仅解码器模型上增益最明显。基于LLMs的评估、定性分析和稀疏自编码器(SAE)比较表明,学习到的概念是可解释且任务相关的。

英文摘要

Large language models (LLMs) encode rich semantic information in their hidden states, yet it remains difficult to understand what information these internal representations capture. Latent concepts extracted from hidden states offer a promising direction for interpreting LLMs, but existing clustering-based methods face a trade-off: hierarchical clustering produces coherent concepts but is limited to small datasets due to its quadratic memory cost, while K-Means scales efficiently but may yield less semantically coherent concepts. We propose Vector Quantized Latent Concept (VQLC), a discrete concept learning framework that learns a codebook of latent concepts on frozen hidden states. Across 12 dataset-model settings, VQLC stays close to K-Means in computational cost, scales better than hierarchical clustering, and remains competitive in faithfulness, with the clearest gains on decoder-only models. LLMs-based evaluation, qualitative analysis, and a Sparse Autoencoder (SAE) comparison demonstrate that the learned concepts are interpretable and task-relevant.

2512.16415 2026-06-11 cs.CV 版本更新

CountZES: Counting via Zero-Shot Exemplar Selection

CountZES: 通过零样本示例选择进行计数

Muhammad Ibraheem Siddiqui, Muhammad Haris Khan

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence(莫莫德·本·扎耶德人工智能大学)

AI总结 针对零样本计数中示例质量差导致计数不准的问题,提出CountZES方法,通过检测锚定、密度引导和特征共识三阶段协同选择多样化示例,提升计数准确性。

详情
AI中文摘要

在零样本(ZS)设置下,复杂场景中的目标计数尤其具有挑战性,其中仅使用类别名称对未见类别的实例进行计数。现有的ZS计数方法通常依赖现成的开放词汇检测器(OVD)从文本推断示例,但在密集场景中,这些方法会受到语义噪声、外观变异和多实例提议的影响。或者,采用随机图像块采样,但无法准确描绘目标实例。由于计数对示例质量敏感,此类选择策略通常产生代表性差的示例,导致计数估计不准确。为解决这些问题,我们提出CountZES,一种通过零样本示例选择进行目标计数的纯推理方法。CountZES通过三个协同阶段发现多样化的示例:检测锚定示例(DAE)、密度引导示例(DGE)和特征共识示例(FCE)。DAE细化OVD检测以分离出精确的单实例示例。DGE引入密度驱动的自监督范式,识别统计一致且语义紧凑的示例,而FCE通过特征空间聚类增强视觉一致性。这些阶段共同产生互补的示例集,平衡了文本基础、计数一致性和特征代表性。在多个数据集上的实验表明,CountZES在零样本计数方法中表现出优越性能,同时有效跨领域泛化。

英文摘要

Object counting in complex scenes is particularly challenging in the zero-shot (ZS) setting, where instances of unseen categories are counted using only a class name. Existing ZS counting methods that infer exemplars from text often rely on off-the-shelf open-vocabulary detectors (OVDs), which in dense scenes suffer from semantic noise, appearance variability, and multi-instance proposals. Alternatively, random image-patch sampling is employed, which fails to accurately delineate object instances. Since counting is sensitive to exemplar quality, such selection strategies often yield poorly representative exemplars, leading to inaccurate count estimation. To address these issues, we propose CountZES, an inference-only approach for object counting via ZS exemplar selection. CountZES discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines OVD detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across domains.

2602.02465 2026-06-11 cs.AI cs.CV cs.LG 版本更新

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

MentisOculi: 揭示心智图像推理的局限性

Jana Zeller, Thaddäus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克信息研究所)

AI总结 提出MentisOculi基准,通过多步推理问题测试前沿模型利用视觉表示辅助推理的能力,发现视觉策略普遍无法提升性能,且统一多模态模型存在生成错误累积和无法利用真实可视化的问题。

Comments 9 pages, 8 figures, Accepted at ICML 2026

详情
AI中文摘要

前沿模型正从仅摄入视觉信息的多模态大语言模型(MLLMs)过渡到能够原生交错生成的统一多模态模型(UMMs)。这一转变激发了将中间可视化作为推理辅助的兴趣,类似于人类的心智图像。这一想法的核心是能够以目标导向的方式形成、维护和操作视觉表示。为了评估和探究这一能力,我们开发了MentisOculi,这是一个程序化的、分层的多步推理问题套件,适用于视觉解决方案,旨在挑战前沿模型。评估从潜在令牌到显式生成图像的视觉策略,我们发现它们通常无法提升性能。对UMMs的分析特别揭示了一个关键限制:虽然它们拥有解决任务的文本推理能力,并且有时能生成正确的视觉内容,但它们遭受复合生成错误,并且无法利用甚至真实的可视化。我们的发现表明,尽管视觉思维具有内在吸引力,但尚未有益于模型推理。MentisOculi为分析和弥合不同模型家族之间的这一差距建立了必要的基础。

英文摘要

Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.