arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2056
专题追踪
2503.07482 2026-06-01 cs.LG cs.AI

How does Bayesian Sampling help Membership Inference Attacks?

贝叶斯采样如何帮助成员推断攻击?

Zhenlong Liu, Wenyu Jiang, Feng Zhou, Hongxin Wei

发表机构 * Department of Statistics and Data Science, Southern University of Science and Technology(统计与数据科学系,南方科技大学) Shanghai Innovation Institute(上海创新研究院) School of Computer Science, Nanjing University(南京大学计算机科学系) Center for Applied Statistics and School of Statistics, Renmin University of China(应用统计中心和统计学系,中国人民大学)

AI总结 提出贝叶斯成员推断攻击(BMIA),通过拉普拉斯近似对单个参考模型进行贝叶斯采样以估计条件分数分布,理论证明降低模型内方差从而提升攻击性能,并在多模态数据集上实现最先进的效果与效率。

Comments Accepted to ICML 2026

详情
AI中文摘要

成员推断攻击(MIAs)旨在估计特定数据点是否用于给定模型的训练。现有的最先进攻击通常依赖于训练多个参考模型来近似单个数据点的条件分数分布,这导致显著的计算开销并限制了其实际适用性。在这项工作中,我们提出了一种新颖的方法——贝叶斯成员推断攻击(BMIA),通过贝叶斯采样执行条件攻击。具体来说,我们对单个参考模型应用拉普拉斯近似以获得模型参数的后验分布,从而能够直接估计条件分数分布。理论上,我们证明了贝叶斯采样降低了模型内方差,从而提高了攻击能力。这一见解自然地激发了多参考变体,当有额外的参考模型可用时,该变体进一步提升了性能。在图像、文本和表格数据集上的大量实验表明,我们的方法在有效性和效率方面均达到了最先进的性能。

英文摘要

Membership Inference Attacks (MIAs) aim to estimate whether a specific data point was used in the training of a given model. Existing state-of-the-art attacks typically rely on training multiple reference models to approximate the conditional score distribution for individual data points, which leads to significant computational overhead and limits their practical applicability. In this work, we propose a novel approach -- Bayesian Membership Inference Attack (BMIA), which performs conditional attack through Bayesian sampling. Specifically, we apply Laplace approximation to a single reference model to obtain a posterior over model parameters, enabling direct estimation of the conditional score distribution. Theoretically, we demonstrate that Bayesian sampling reduces intra-model variance, thereby improving attack power. This insight naturally motivates the multi-reference variant that further enhances performance when additional reference models are available. Extensive experiments across image, text, and tabular datasets indicate that our method achieves state-of-the-art performance in both effectiveness and efficiency.

2605.25193 2026-06-01 cs.CV

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

SpongeBob:同步感知的和谐视听生成式编辑

Sen Liang, Cong Wang, Fengbin Guan, Zhentao Yu, Yiting Lu, Yuanzhi Wang, Yuan Zhou, Xin Li, Zhibo Chen

发表机构 * University of Science and Technology of China(科学技术大学) Tencent Hunyuan(腾讯文生)

AI总结 提出首个端到端视听联合编辑框架SpongeBob,通过双向跨模态交互的同步感知机制和上下文感知模块,解决视频编辑中的音画不同步和语义冲突问题。

详情
AI中文摘要

物理世界中的视觉和声学事件本质上是耦合的,然而现有的视频编辑方法通常采用解耦的流水线,缺乏双向模态交互。这导致两个关键限制:(i) 视听不同步和(ii) 生成的音频与保留内容之间的上下文冲突。为了解决这些问题,我们提出了SpongeBob,这是第一个具有双向跨模态交互的端到端视听联合编辑框架。对于同步,同步感知机制通过双向注意力、时间对齐和空间约束将视觉编辑与声音事件对齐。对于上下文一致性,上下文感知模块利用声学和视觉上下文注意力来防止语义冲突。此外,我们引入了同步保持训练和指导(SPTG),以在不降低质量的情况下增强对齐。由于配对数据的稀缺,我们构建了一个可扩展的数据流水线和一个大规模的主题级数据集。我们还提出了SpongeBob-Bench用于系统评估。实验表明,SpongeBob显著优于现有基线,将Sync-C提高了30%,Ctx-F1提高了12.5%。我们的项目页面位于:https://hy-spongebob.github.io/。

英文摘要

Visual and acoustic events in the physical world are inherently coupled, yet existing video editing methods typically adopt decoupled pipelines, lacking bidirectional modality interaction. This results in two key limitations: (i) audio-visual desynchronization and (ii) contextual conflicts between generated audio and preserved content. To address these, we propose SpongeBob, the first end-to-end audio-visual joint editing framework featuring bidirectional cross-modal interaction. For synchronization, a Sync-Aware Mechanism aligns visual edits with sound events via bidirectional attention, temporal alignment, and spatial constraints. For contextual consistency, a Context-Aware Module leverages acoustic and visual context attention to prevent semantic clashes. Additionally, we introduce Sync-Preserving Training and Guidance (SPTG) to enhance alignment without degrading quality. Due to the scarcity of paired data, we construct a scalable data pipeline and a large-scale subject-level dataset. We also propose SpongeBob-Bench for systematic evaluation. Experiments show SpongeBob significantly outperforms existing baselines, improving Sync-C by 30% and Ctx-F1 by 12.5%. Our project page is available at: https://hy-spongebob.github.io/.

2603.24254 2026-06-01 cs.LG cs.AI

Beyond Static Uncertainty: Modeling Temporal Uncertainty Dynamics for Probabilistic Time Series Forecasting

超越静态不确定性:为概率时间序列建模时间不确定性动态

Yijun Wang, Qiyuan Zhuang, Larysa Marchanka, Xiu-Shen Wei

发表机构 * Department of Computer Science, Southeast University(东南大学计算机科学系) Francisk Skorina Gomel State University(弗拉基米尔·斯科里纳戈梅尔州立大学)

AI总结 提出VolDy-VAE模型,通过循环尺度路径捕捉波动率动态,实现时间一致的概率预测,提升准确性和不确定性校准。

详情
AI中文摘要

现实世界的时间序列表现出时间结构化的不确定性:波动率在动荡时期聚集,在稳定时期消散,并在结构断裂处突然变化。然而,许多概率预测方法将预测不确定性估计为独立的逐点量,忽略了波动率机制的演变和持续性。我们将这一缺失维度形式化为时间不确定性动态,并在波动率动态变分自编码器(VolDy-VAE)中实例化它,这是一个具有位置-尺度解码器的非自回归生成预测器。VolDy-VAE结合了用于均值预测的位置路径和用于传递和演化波动率隐藏状态的循环尺度路径,该状态从回溯窗口转移到预测范围,从而实现时间一致的预测方差。这种设计产生了一种自适应衰减机制:高方差观测值对位置估计的影响较小,而其不确定性通过明确的尺度预测得以保留。我们进一步提供了一个简化的机制转换分析,表明当方差已知或一致估计时,波动率感知目标简化为逆方差加权,而基于MSE的估计量保持无偏但统计效率较低。在九个基准上的实验表明,VolDy-VAE在保持低推理延迟的同时,提高了预测准确性和不确定性校准,优于竞争的概率和点预测基线;插件研究进一步表明,VolDy原理可以有益于GAN、Koopman VAE和Transformer骨干网络。源代码公开于https://github.com/wangyijunlyy/VolDy-VAE。

英文摘要

Real-world time series exhibit temporally structured uncertainty: volatility clusters in turbulent regimes, dissipates in stable periods, and shifts abruptly around structural breaks. Yet many probabilistic forecasting methods estimate predictive uncertainty as an independent per-step quantity, leaving the evolution and persistence of volatility regimes under-modeled. We formalize this missing dimension as temporal uncertainty dynamics and instantiate it in the Volatility Dynamics Variational Autoencoder (VolDy-VAE), a non-autoregressive generative forecaster with a location-scale decoder. VolDy-VAE combines a location path for mean prediction with a recurrent scale path that transfers and evolves a volatility hidden state from the look-back window to the forecasting horizon, enabling temporally coherent predictive variances. This design yields an adaptive attenuation mechanism: high-variance observations receive lower influence on the location estimate while their uncertainty is preserved through explicit scale predictions. We further provide a simplified regime-switching analysis showing that, when variances are known or consistently estimated, the volatility-aware objective reduces to inverse-variance weighting, whereas MSE-based estimators remain unbiased but statistically inefficient. Experiments on nine benchmarks show that VolDy-VAE improves forecasting accuracy and uncertainty calibration over competitive probabilistic and point-forecasting baselines while maintaining low inference latency; plug-in studies further indicate that the VolDy principle can benefit GAN, Koopman VAE, and Transformer backbones. The source code is publicly available at https://github.com/wangyijunlyy/VolDy-VAE.

2605.23937 2026-06-01 cs.AI cs.LG cs.LO math.OC

BoxLitE: A Faithful Knowledge Base Embedding Based on Convex Optimization

BoxLitE:基于凸优化的忠实知识库嵌入

Bruno F. Lourenço, Hesham Morgan, Ana Ozaki, Aleksandar Pavlović, Emanuel Sallinger

发表机构 * The Institute of Statistical Mathematics, Japan(日本统计数学研究所) TU Wien, Austria(奥地利技术大学维也纳分校) University of Oslo, Norway(挪威奥斯陆大学) University of Applied Sciences Campus Vienna, Austria(奥地利应用科学大学维也纳校区)

AI总结 提出BoxLitE模型,通过凸优化实现DL-Lite$^{\mathcal{H}}$知识库的忠实嵌入,确保可满足知识库存在弱忠实模型。

Comments 28 pages. Full version of paper accepted to KR 2026 (23nd International Conference on Principles of Knowledge Representation and Reasoning). Track: KR meets Machine Learning and Explanation. Added a figure and some minor changes

详情
AI中文摘要

知识库(KB)嵌入旨在结合经典知识图谱嵌入在事实(ABox)中泛化信息的能力与本体语言(TBox)表示的概念知识。多位作者最近探索了将概念映射到向量空间中凸区域的思想。这对于表示TBox中通常存在的层次结构很有用,因为更一般的概念可以映射到更大的区域,包含与更具体概念相关的区域。然而,在实际学习任务中,凸性的能力很少被利用。在这里,我们引入了BoxLitE,一个针对DL-Lite$^{\mathcal{H}}$的KB嵌入模型,允许凸优化。我们证明,对于任何可满足的DL-Lite$^{\mathcal{H}}$ KB,存在一个BoxLitE嵌入,它是一个弱忠实模型。作为概念验证,我们展示了如何将KB嵌入任务表述为凸优化问题,以及如何获得具有这种理想忠实性属性的嵌入。

英文摘要

Knowledge base (KB) embeddings aim at combining the capability of classical knowledge graph embeddings to generalize the information present in facts, the ABox, with conceptual knowledge represented in an ontology language, the TBox. Several authors have recently explored the idea of mapping concepts to convex regions in a vector space. This is useful to represent hierarchies, typically present in TBoxes, since more general concepts can be mapped to larger regions, containing those regions associated with more specific concepts. However, the power of convexity is rarely leveraged during the actual learning tasks. Here, we introduce BoxLitE, a KB embedding model for DL-Lite$^{\mathcal{H}}$ that allows for convex optimization. We show that for any satisfiable DL-Lite$^{\mathcal{H}}$ KB, there is a BoxLitE embedding that is a weakly faithful model. As a proof of concept, we show how to formulate the KB embedding task as a convex optimization problem and how to obtain embeddings with such desirable faithfulness properties.

2605.22478 2026-06-01 cs.CV

DeliCIR: Deliberative Test-Time Evolutionary Hierarchical Multi-Agents for Composed Image Retrieval

DeliCIR: 用于组合图像检索的深思型测试时进化分层多智能体

Xingtian Pei, Yukun Song, Changwei Wang, Shunpeng Chen, Rongtao Xu, Shengpeng Xu, Shibiao Xu

发表机构 * Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences)(计算机网络与信息安全部重点实验室,教育部,山东计算机科学中心(济南国家超算中心),青岛科技大学(山东科学院)) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) School of Artificial Intelligence, Mohamed bin Zayed University(Mohamed bin Zayed大学人工智能学院)

AI总结 提出一种分层感知-深思框架PDF,通过分层多智能体架构、意图路由管理器、决策管理器及锦标赛式测试时缩放策略,实现经验自进化与测试时缩放定律在组合图像检索中的首次应用,在三个基准数据集上达到最优性能。

Comments 10 pages, 5 figures,4 tables

详情
AI中文摘要

组合图像检索(CIR)要求同时保留参考图像的视觉连续性并忠实执行修改文本中指定的语义变量,这构成了该任务的核心挑战。现有方法常常在单一空间中遭受感知近视,或由于底层检索器的感知上限而在迭代协作中陷入逻辑漂移。为解决这一问题,我们提出了一种一站式分层感知-深思框架(PDF),据我们所知,这是首次将经验自进化和测试时缩放定律(TTS)引入CIR。依托分层多智能体架构,PDF首先利用意图路由管理器根据修改意图动态调度多视角工作器感知信号,构建高召回候选池。随后,决策管理器结合无需训练的推理策略蒸馏机制与锦标赛式TTS(T-TTS)策略,实现自进化的细粒度推理,得出最终检索结果。实验结果表明,PDF在三个基准数据集CIRR、CIRCO和FashionIQ上均达到了最优性能。本研究表明,经验驱动的自进化和TTS是实现零样本细粒度多媒体检索的一条极具前景且可扩展的路径。代码将在论文被接收后公开。

英文摘要

Composed Image Retrieval (CIR) requires both preserving the visual continuity of the reference image and faithfully executing the semantic variables specified in the modification text, which constitute the core challenge of the task. Existing methods often suffer from Perception Myopia in a single space, or fall into Logic Drift in iterative collaboration due to the perception ceiling of the underlying retriever. To address this issue, we propose a one-stop hierarchical Perception-to-Deliberation Framework (PDF), which, to the best of our knowledge, is the first to introduce experience self-evolution and Test-Time Scaling Laws (TTS) into CIR. Relying on a hierarchical multi-agent architecture, PDF first utilizes an Intent Routing Manager to dynamically dispatch multi-view Worker perception signals based on modification intents to construct a high-recall candidate pool. Subsequently, the Decision Manager combines a Training-free Reasoning Policy Distillation mechanism with a Tournament-style TTS (T-TTS) strategy to achieve self-evolving fine-grained reasoning, yielding the final retrieval results. Experimental results demonstrate that PDF achieves SOTA performance on three benchmark datasets: CIRR, CIRCO, and FashionIQ. This study indicates that experience-driven self-evolution and TTS represent a highly promising and scalable path for achieving zero-shot fine-grained multimedia retrieval. The code will be made publicly available upon acceptance.

2605.15530 2026-06-01 cs.LG

Rethinking Neural Network Learning Rates: A Stackelberg Perspective

重新思考神经网络学习率:从Stackelberg视角

Sihan Zeng, Sujay Bhatt, Sumitra Ganesh

发表机构 * JPMorgan AI Research, United States(摩根大通人工智能研究实验室,美国)

AI总结 本文从Stackelberg优化角度研究非均匀学习率,证明对网络主体层用小学习率、最后一层用大学习率可解释为两时间尺度交替梯度下降,并建立有限时间收敛保证,揭示其通过改善优化结构和局部曲率加速训练。

详情
AI中文摘要

神经网络通常在所有层使用单一学习率进行训练。虽然最近的实验证据表明,为各层分配特定学习率可以加速训练,但对于非均匀学习率有益的条件和机制,目前仍缺乏原则性的理解。在这项工作中,我们从Stackelberg优化的角度研究非均匀学习率。具体来说,我们证明,对网络主体层使用较小的学习率、对最后一层使用较大的学习率来训练神经网络,可以解释为对原始目标的Stackelberg重构应用两时间尺度交替梯度下降算法。我们在适应约束集和非光滑激活函数的广泛条件下,建立了该算法的有限时间收敛保证。除了收敛性,我们识别了非均匀学习率优于均匀学习率的两种机制:(i)我们表明,某些问题实例会诱导出比原始目标具有更强优化结构的Stackelberg目标,从而更快收敛到全局最优解;(ii)我们的数值分析揭示,Stackelberg目标可以表现出明显更尖锐的局部曲率,尤其是在训练早期,这导致更信息丰富的梯度和学习加速。在监督学习和强化学习中的实验支持了我们的发现。

英文摘要

Neural networks are typically trained with a single learning rate across all layers. While recent empirical evidence suggests that assigning layer-specific learning rates can accelerate training, a principled understanding of the conditions and mechanisms under which non-uniform learning rates are beneficial remains limited. In this work, we investigate non-uniform learning rates through the lens of Stackelberg optimization. Specifically, we demonstrate that training neural networks with a smaller learning rate for the body layers and a larger learning rate for the final layer can be interpreted as a two-time-scale alternating gradient descent algorithm applied to a Stackelberg reformulation of the original objective. We establish finite-time convergence guarantees for the algorithm under broad conditions that accommodate constraint sets and non-smooth activation functions. Beyond convergence, we identify two mechanisms by which non-uniform learning rates can outperform uniform learning rates: (i) we show that certain problem instances induce a Stackelberg objective with stronger optimization structure than the original objective, yielding faster convergence to globally optimal solutions, (ii) our numerical analysis reveals that the Stackelberg objective can exhibit substantially sharper local curvature, especially in early training, which leads to more informative gradients and learning acceleration. Experiments in supervised learning and reinforcement learning support our findings.

2605.23278 2026-06-01 cs.CL stat.ML

When Is Next-Token Prediction Useful? Marginalization, Ergodicity, Mixture Identifiability, Local Sufficiency, RAG, Tools, and Programming

下一个词预测何时有用?边缘化、遍历性、混合可识别性、局部充分性、RAG、工具与编程

Francesco Corielli

发表机构 * GitHub

AI总结 本文通过区分完整条件语言过程、边缘文本过程和模型诱导分布,论证了下一个词预测的有效性依赖于强假设(平稳性、代表性、遍历性)以及观察前缀对潜在上下文的充分性,并解释了RAG和工具使用作为条件充分性机制的作用。

详情
AI中文摘要

在观察序列上训练的语言模型通常被描述为学习给定前一个词的下一个词的条件分布。这种描述仅在一定条件下成立。在真实词轨迹上训练的模型并未观察到完整的条件法则;它接收的是采样后的延续。此外,真实语言生成不仅受前文影响,还受非文本环境的影响:事实、事件、意图、目标、信念、社会背景和任务特定约束。本文区分了三个常被混淆的对象:以潜在环境为条件的完整条件语言过程、通过积分掉这些环境得到的边缘纯文本过程,以及从有限观察语料库中学习到的模型诱导分布。 本文认为,将模型训练解释为估计边缘纯文本法则需要强假设:平稳性、代表性和遍历性,这些假设在统计估计中是标准的,但在应用于异质语言语料库时存在问题。即使这些假设成立,边缘纯文本法则也仅当观察前缀是延续相关潜在环境的近似充分统计量时才有用。从信息论角度看,有用性要求下一个词与被省略环境之间的条件互信息(给定观察文本)很小。 然后,本文将这一论证扩展到异质训练语料库。 最后,本文将检索增强生成(RAG)和工具使用解释为条件充分性装置。

英文摘要

Language models trained on observed sequences are often described as learning the conditional distribution of the next token given previous tokens. This description is only conditionally correct. A model trained on realized token trajectories does not observe full conditional laws; it receives sampled continuations. Moreover, real language generation is conditioned not only on previous words but also on non-textual circumstances: facts, events, intentions, goals, beliefs, social context, and task-specific constraints. This paper distinguishes three objects that are often conflated: the full conditional language process conditioned on latent circumstances, the marginal text-only process obtained by integrating those circumstances out, and the model-induced distribution learned from finite observed corpora. The paper argues that interpreting model training as estimating the marginal text-only law requires strong assumptions of stationarity, representativeness, and ergodicity, assumptions that are standard in statistical estimation but problematic when applied to heterogeneous language corpora. Even if these assumptions hold, the marginal text-only law is useful only when the observed prefix is an approximately sufficient statistic for the latent circumstances relevant to continuation. In information-theoretic terms, usefulness requires that the residual conditional mutual information between the next token and the omitted circumstances, given the observed text, be small. The paper then extends this argument to heterogeneous training corpora. Finally, the paper interprets Retrieval Augmented Generation (RAG) and tool use as conditional sufficiency devices.

2605.22967 2026-06-01 cs.LG

Learned Relay Representations for Forward-Thinking Discrete Diffusion Models

学习的中继表示用于前向思考的离散扩散模型

Benjamin Rozonoyer, Jacopo Minniti, Dhruvesh Patel, Neil Band, Avishek Joey Bose, Tim G. J. Rudner, Andrew McCallum

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) University of Toronto(多伦多大学) Stanford University(斯坦福大学) Imperial College London(伦敦帝国学院) Mila Vijil

AI总结 提出Learned Relay Representations (Relay)方法,通过可微通道传递潜在信息,使掩码扩散模型在去噪步骤间前向思考,减少推理延迟并提升性能。

Comments 16 pages, 3 figures. Equal contribution: Benjamin Rozonoyer, Jacopo Minniti, and Dhruvesh Patel. Code: https://github.com/jacopo-minniti/relay

详情
AI中文摘要

当掩码扩散模型(MDMs)通过迭代细化生成序列时,掩码位置上的丰富内部计算被丢弃,迫使每个后续细化步骤重新计算存储为模型表示的有价值内部信息。为了避免去噪轮次之间的硬重置,我们提出了学习的中继表示(Relay),一种允许MDMs在去噪时进行前向思考的方法,通过显式学习如何传播潜在信息以利于未来的去噪步骤。Relay引入了一个可微的逐token通道,在前向传递之间传递信息,并通过时间截断反向传播(BPTT)进行训练。我们展示了该框架可以扩展到最先进的扩散语言模型(DLMs),并且与块扩散和KV缓存等技术无缝兼容。我们首先在具有挑战性的基于数独的规划任务上对Relay的设计选择进行了彻底验证。然后,我们将Relay扩展到最先进的DLM Fast-dLLM v2,在编码任务上优于标准的监督微调,同时将推理延迟降低高达32%。我们的实证结果表明,最先进的DLM可以被显式训练以在解码步骤间前向中继潜在信息,从而推进性能-延迟帕累托前沿。我们提供了所有实验的代码。

英文摘要

When Masked Diffusion Models (MDMs) generate sequences through iterative refinement, the rich internal computation over masked positions is discarded, forcing every subsequent refinement step to recompute the valuable internal information stored as model representations. To avoid a hard reset between denoising rounds, we propose Learned Relay Representations (Relay), a method that allows MDMs to be forward-thinking when denoising by explicitly learning how to propagate latent information for the benefit of future denoising steps. Relay introduces a differentiable per-token channel that passes information between forward passes and is trained via truncated backpropagation through time (BPTT). We show that this framework can be scaled to state-of-the-art Diffusion Language Models (DLMs), and is seamlessly compatible with techniques like block diffusion and KV caching. We first provide a thorough justification of the design choices in Relay on a challenging Sudoku-based planning task. We then scale Relay to Fast-dLLM v2, a state-of-the-art DLM, outperforming standard supervised finetuning on coding tasks while reducing inference latency by up to 32%. Our empirical results demonstrate that state-of-the-art DLMs can be explicitly trained to relay latent information forward across decoding steps, advancing the performance-latency Pareto frontier. We provide code for all our experiments.

2605.22639 2026-06-01 cs.RO

Symmetries Here and There, Combined Everywhere: Cross-space Symmetry Compositions in Robotics

此处与彼处的对称性,无处不在的组合:机器人学中的跨空间对称性组合

Loizos Hadjiloizou, Rodrigo Pérez-Dattari, Noémie Jaquier

发表机构 * Department of Robotics, Perception and Learning, KTH Royal Institute of Technology(机器人、感知与学习系,皇家理工学院)

AI总结 提出跨空间对称性组合框架,通过前向运动学的微分几何结构实现配置空间与任务空间对称性的联合等变,并在双机械臂实验中验证了多对称性联合利用能提升泛化能力。

Comments 8 pages, 8 figures, 1 table

详情
AI中文摘要

机器人由于其机械结构和任务属性展现出丰富的对称性。尽管许多机器人问题同时表现出多种对称性,现有方法通常孤立地处理它们,未能利用其组合潜力。本文介绍了跨空间对称性组合,一个学习在配置空间和任务空间中对多种对称性联合等变的机器人策略的框架。利用前向运动学映射的微分几何结构,我们将对称性从配置空间下降到任务空间,并从任务空间提升到配置空间,使得它们能够在统一的表示空间内组合。我们在双机械臂的仿真和真实世界实验中验证了该框架,证明联合利用多种对称性能够改善泛化能力。

英文摘要

Robots exhibit a rich variety of symmetries arising from their mechanical structure and the properties of their tasks. Although many robotics problems exhibit several symmetries simultaneously, existing approaches typically treat them in isolation, failing to exploit their combined potential. This paper introduces cross-space symmetry compositions, a framework for learning robot policies that are jointly equivariant to multiple symmetries across configuration and task spaces. Leveraging the differential-geometric structure of the forward kinematics map, we both descend symmetries from configuration to task space and lift symmetries from task to configuration space, enabling their composition within a unified representation space. We validate our framework on simulated and real-world experiments on a dual-arm robot, demonstrating that jointly leveraging multiple symmetries yields improved generalization.

2605.20992 2026-06-01 cs.CV

CHOIR: Contact-aware 4D Hand-Object Interaction Reconstruction

CHOIR: 接触感知的4D手物交互重建

Hao Xu, Yilin Liu, Yinqiao Wang, Chi-Wing Fu, Niloy J. Mitra

发表机构 * The Chinese University of Hong Kong(香港中文大学) University College London(伦敦大学学院) University College London, Adobe Research(伦敦大学学院,Adobe研究)

AI总结 提出CHOIR框架,利用接触作为显式耦合信号,从单目视频中重建手物交互的4D序列,包括手部运动、物体形状与6D姿态以及接触信息,显著提升了物体重建、物理合理性和时间一致性。

详情
AI中文摘要

我们探究是否可以将日常开放世界单目视频转化为可复用的4D交互基元:包括关节手部运动、随时间变化的物体形状与6D姿态,以及接触的时空信息。这种能力将支持真实交互的可扩展挖掘,并在重建之外,支持场景感知的合成与规划。然而,从具有挑战性的单目视频中重建手物交互(HOI)仍然困难:现有方法通常假设已知物体或精心设计的场景,且单独估计的手和物体在杂乱、遮挡和未见物体几何下容易错位。针对这一场景,我们提出CHOIR,一种面向单目相机的接触感知HOI重建框架,利用接触作为手和物体之间的显式耦合信号。CHOIR首先从开放世界视觉先验中初始化一个粗糙的、接触无关的4D HOI序列。然后引入一个生成式HOI空间修正模块,预测射线深度修正并纠正手物相对位置,随后在修正后的几何上推导出初始的逐帧接触对应关系。最后,采用带有动态更新接触约束的接触感知联合优化,强制执行几何、时间和接触一致性。在受控和具有挑战性的视频上的实验表明,CHOIR在物体重建、物理合理性和时间一致性上优于现有最先进方法。

英文摘要

We ask whether everyday open-world monocular videos can be turned into reusable 4D interaction primitives: articulated hand motion, object shape with 6D pose over time, and the when/where of contact. Such a capability would enable scalable mining of real interactions and, beyond reconstruction, support scene-aware synthesis and planning. However, reconstructing hand-object interaction (HOI) from challenging monocular videos remains difficult: methods often assume known objects or curated scenes, and separately estimated hands and objects easily become misaligned under clutter, occlusion, and unseen object geometries. Targeting this setting, we present CHOIR, a Contact-aware HOI Reconstruction framework for a monocular camera, using contact as an explicit coupling signal between hands and objects. CHOIR first initializes a coarse, contact-agnostic 4D HOI sequence from open-world visual priors. It then introduces a generative HOI spatial rectification module to predict ray-depth corrections and rectify hand-object relative placement, then derive initial per-frame contact correspondences on the rectified geometry. Last, a contact-aware joint optimization with dynamically updated contact constraints enforces geometric, temporal, and contact consistency. Experiments on controlled and challenging videos show that CHOIR improves object reconstruction, physical plausibility, and temporal consistency over state-of-the-art methods.

2605.20036 2026-06-01 cs.LG

D$^3$-Subsidy: Online and Sequential Driver Subsidy Decision-Making for Large-Scale Ride-Hailing Market

D$^3$-Subsidy:大规模网约车市场的在线和顺序司机补贴决策

Taijie Chen, Rui Su, Siyuan Feng, Laoming Zhang, Hongyang Zhang, Haijiao Wang, Zhaofeng Ma, Jintao Ke, Li Ma

发表机构 * University of Hong Kong(香港大学) Harbin Institute of Technology(哈尔滨工业大学) Hong Kong Polytechnic University(香港理工大学)

AI总结 针对网约车市场动态环境,提出基于扩散的分层框架D$^3$-Subsidy,通过前缀条件扩散模型和拉格朗日对偶映射实现城市级补贴控制,在满足补贴率上限和低延迟约束下提升订单量和GMV。

Comments 14 pages, 14 figures

详情
AI中文摘要

滴滴出行等网约车平台运行在高度动态的环境中,平衡司机供给和乘客需求至关重要。尽管司机端补贴是调整这些力量并改善关键KPI(如完成订单数(\texttt{Rides})和总交易额(\texttt{GMV}))的主要杠杆,但在生产中优化它们需要同时满足三个约束:(i)对随机冲击的响应性,(ii)严格的补贴率上限,以及(iii)城市规模的低延迟执行。这些要求排除了昂贵的逐订单优化,需要一种前瞻性的、约束感知的城市级控制器用于在线顺序决策。为了满足这些要求,我们引入了D$^3$-Subsidy(动态司机端基于扩散的补贴),一种基于扩散的分层框架,用于可部署的全城补贴控制。为了弥合训练-推理差距,D$^3$-Subsidy采用前缀条件扩散模型,从不可变的历史观测中采样可能的未来轨迹,确保训练协议与在线部署的固定历史性质一致。这些生成的计划随后由上下文条件逆模块解码为低维城市级控制信号。对于可扩展的执行,我们通过拉格朗日对偶导出的映射弥合了城市级规划和细粒度调度之间的差距,该映射将补贴率上限直接嵌入到订单-司机激励中,无需迭代优化。此外,采用参数高效微调的多城市预训练策略能够实现跨异构城市的鲁棒迁移。广泛的离线评估表明,D$^3$-Subsidy在提高\texttt{Rides}和\texttt{GMV}的同时增强了上限合规性,而真实世界的A/B测试证实了显著提升,同时将预算相关的违规指标保持在运营阈值内。

英文摘要

Ride-hailing platforms like DiDi Chuxing operate in highly dynamic environments where balancing driver supply and passenger demand is critical. Although driver-side subsidies serve as a primary lever to align these forces and improve key KPIs like completed rides (\texttt{Rides}) and gross merchandise value (\texttt{GMV}), optimizing them in production requires simultaneously meeting three constraints: (i) responsiveness to stochastic shocks, (ii) strict subsidy-rate caps, and (iii) low-latency execution at city scale. These requirements rule out expensive per-order optimization, calling for a forward-looking, constraint-aware city-level controller for online sequential decision making. To meet these requirements, we introduce D$^3$-Subsidy (Dynamic Driver-side Diffusion-based Subsidy), a hierarchical diffusion-based framework for deployable city-wide subsidy control. To bridge the train-inference gap, D$^3$-Subsidy employs a prefix-conditioned diffusion model that samples plausible future trajectories from immutable historical observations, ensuring the training protocol aligns with the fixed-history nature of online deployment. These generated plans are then decoded by a context-conditioned inverse module into low-dimensional city-level control signals. For scalable execution, we bridge the gap between city-level planning and fine-grained dispatch via a Lagrangian-dual-derived mapping, which embeds subsidy-rate caps directly into order-driver incentives without iterative optimization. Additionally, a multi-city pretraining strategy with parameter-efficient fine-tuning enables robust transfer across heterogeneous cities. Extensive offline evaluations demonstrate that D$^3$-Subsidy improves \texttt{Rides} and \texttt{GMV} while enhancing cap compliance, and a real-world A/B test confirms significant uplift while keeping budget-related violation metrics within operational thresholds.

2506.21035 2026-06-01 cs.LG

Little by Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts

循序渐进:通过增量混合秩-1联想记忆专家实现持续学习

Haodong Lu, Chongyang Zhao, Minhui Xue, Lina Yao, Kristen Moore, Dong Gong

发表机构 * University of New South Wales(新南威尔士大学)

AI总结 针对持续学习中专家粒度粗糙导致的冗余、干扰和遗忘问题,提出MoRAM方法,将秩-1适配器作为细粒度专家和联想记忆单元,通过自激活机制实现增量扩展,显著提升塑性-稳定性权衡和泛化能力。

Comments Accepted at ICML2026. Project page: https://artificer-ai-lab.github.io/MoRAM/

详情
AI中文摘要

持续学习(CL)与大型预训练模型旨在增量获取知识而不发生灾难性遗忘。现有的基于LoRA的混合专家(MoE)方法通过添加孤立的新专家并冻结旧专家来扩展容量,但仍存在冗余、干扰、路由模糊以及由此导致的遗忘问题。我们研究了源于粗粒度专家粒度的问题。粗粒度专家(例如高秩LoRA)编码低专一性信息,导致专家重复/干扰以及随着专家积累而路由退化/混乱。在这项工作中,我们提出了MoRAM(混合秩-1联想记忆)。基于权重矩阵作为线性联想记忆的观点,MoRAM将CL实现为可重用原子秩-1专家作为记忆的增量扩展。每个秩-1适配器充当细粒度MoE专家或联想记忆单元。通过将秩-1专家视为键值记忆对,我们消除了显式的MoE-LoRA路由器,采用自激活机制,其中每个记忆原子通过其内在键评估其相关性。因此,推理过程成为对增量累积的学习快照记忆的内容可寻址检索和回忆。在CLIP和LLM上的大量实验表明,MoRAM显著优于最先进的方法,实现了更好的塑性-稳定性权衡、更强的泛化能力和更少的遗忘。项目页面:https://artificer-ai-lab.github.io/MoRAM/。

英文摘要

Continual learning (CL) with large pre-trained models aims to incrementally acquire knowledge without catastrophic forgetting. Existing LoRA-based Mixture-of-Experts (MoE) methods expand capacity by adding isolated new experts while freezing old ones, but still suffer from redundancy, interference, routing ambiguity, and consequent forgetting. We investigate the issues stemming from coarse-grained expert granularity. Coarse-grained experts (e.g., high-rank LoRA) encode low-specialty information, leading to expert duplication/interference and routing degradation/confusion as experts accumulate. In this work, we propose MoRAM (Mixture of Rank-1 Associative Memory). Grounded in the view that weight matrices act as linear associative memories, MoRAM achieves CL as incremental expansion of reusable atomic rank-1 experts as memory. Each rank-1 adapter acts as a fine-grained MoE expert or an associative memory unit. By viewing rank-1 experts as key-value memory pairs, we eliminate explicit MoE-LoRA routers with self-activation, where each memory atom evaluates its relevance via its intrinsic key. The inference process thus becomes a content-addressable retrieval and recall over the incrementally accumulated memory of learning snapshots. Extensive experiments on CLIP and LLMs show that MoRAM significantly outperforms state-of-the-art methods, achieving a better plasticity-stability trade-off, stronger generalization, and reduced forgetting. Project Page: https://artificer-ai-lab.github.io/MoRAM/.

2605.21470 2026-06-01 cs.LG cs.AI

Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling

面向延迟优化的Web Agent规划与调度的Agent即时编译

Caleb Winston, Ron Yifeng Wang, Azalia Mirhoseini, Christos Kozyrakis

发表机构 * Stanford University(斯坦福大学)

AI总结 提出Agent即时编译系统,通过JIT-Planner生成代码计划、JIT-Scheduler探索并行化策略及不变式工具协议,显著降低延迟并提高准确性。

Comments Accepted at ICML 2026

详情
AI中文摘要

计算机使用Agent通过生成对浏览器中点击、输入、滚动等工具的调用序列,自动化自然语言指定的任务,例如“从Taco Bell订购最便宜的商品”。当前实现遵循顺序的获取截图-执行循环,每次迭代需要一次LLM调用,导致高延迟和因工具使用错误而频繁出错。我们提出了Agent即时编译系统,该系统将任务描述直接编译为可执行代码,其中可能包含LLM调用、工具调用和并行化。我们的方法包括三个组件:(1)JIT-Planner,生成多个代码计划,根据工具规范验证每个计划,并选择最小成本候选;(2)JIT-Scheduler,通过从学习到的延迟分布进行蒙特卡洛成本估计,探索并行化策略;(3)不变式强制工具协议,指定前置条件和后置条件要求,以减少工具使用错误率。在五个应用中,JIT-Planner相比Browser-Use实现了10.4倍的加速和28%的更高准确率,而JIT-Scheduler相比OpenAI CUA实现了2.4倍的加速和9%的更高准确率。

英文摘要

Computer-use agents (CUAs) automate tasks specified with natural language such as "order the cheapest item from Taco Bell" by generating sequences of calls to tools such as click, type, and scroll on a browser. Current implementations follow a sequential fetch-screenshot-execute loop where each iteration requires an LLM call, resulting in high latency and frequent errors from incorrect tool use. We present agent just-in-time (JIT) compilation, a system that compiles task descriptions directly into executable code that may include LLM calls, tool calls, and parallelization. Our approach comprises three components: (1) JIT-Planner, which generates multiple code plans, validates each against tool specifications, and selects the minimum-cost candidate; (2) JIT-Scheduler, which explores parallelization strategies via Monte Carlo cost estimation from learned latency distributions; and (3) an invariant-enforcing tool protocol specifying precondition and postcondition requirements to reduce the rate of incorrect tool use. Across five applications, JIT-Planner achieves $10.4\times$ speedup and 28$\%$ higher accuracy over Browser-Use, while JIT-Scheduler achieves $2.4\times$ speedup and 9\% higher accuracy over OpenAI CUA.

2605.21108 2026-06-01 cs.LG cs.AI

Efficient Learning of Deep State Space Models via Importance Smoothing

通过重要性平滑高效学习深度状态空间模型

John-Joseph Brady, Nikolas Nusken, Yunpeng Li

发表机构 * Centre for Oral, Clinical and Translational Sciences, King's College London, London, United Kingdom(口腔、临床与转化科学中心,伦敦国王学院,伦敦,英国) Department of Mathematics, King's College London, London, United Kingdom(数学系,伦敦国王学院,伦敦,英国)

AI总结 提出并行变分蒙特卡洛(PVMC)方法,结合变分推断和序贯蒙特卡洛,实现深度状态空间模型在判别与生成任务上的高效训练,速度提升10倍。

Comments Accepted to the proceedings of ICML 2026

详情
AI中文摘要

潜在状态空间系统在统计建模中无处不在,当通过噪声观测时间序列时自然出现。然而,大规模训练深度状态空间模型(DSSM)仍然困难。训练DSSM出现了两种截然不同的策略。第一种是自编码DSSM,通过优化变分下界来训练生成模型。第二种是通过经典序贯蒙特卡洛(SMC)算法的输出进行反向传播。这些方法可以训练DSSM用于判别和生成任务,但其固有的顺序前向传递在现代硬件上扩展性差。我们提出了并行变分蒙特卡洛(PVMC),一种新的训练方法,它桥接了这些范式,并稳健地训练DSSM用于判别和生成任务。在一组基准实验中,PVMC达到或超过了最先进的性能,同时训练速度比最快的竞争SMC方法快10倍。

英文摘要

Latent state space systems are ubiquitous in statistical modelling, arising naturally when time series are observed through noisy measurements. However, training deep state space models (DSSMs) at scale remains difficult. Two largely distinct strategies have emerged for training DSSMs. The first, auto-encoding DSSMs, trains generative models by optimising a variational lower bound. The second backpropagates through the outputs of classical sequential Monte Carlo (SMC) algorithms. Such approaches can train DSSMs for both discriminative and generative tasks, but their inherently sequential forward passes scale poorly on modern hardware. We propose \emph{parallel variational Monte Carlo} (PVMC), a new training method that bridges these paradigms and robustly trains DSSMs for both discriminative and generative tasks. Across a set of benchmark experiments, PVMC matches or exceeds state-of-the-art performance while training $10\times$ faster than the fastest competing SMC-based approach.

2605.21007 2026-06-01 cs.CV cs.RO

LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation

LiteViLNet: 轻量级视觉-激光雷达融合网络用于高效道路分割

Daojie Peng, Bingtao Wang, Fulong Ma, Liang Zhang, Jun Ma

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Shandong University(山东大学)

AI总结 提出轻量级多模态网络LiteViLNet,通过双流编码器、深度可分离卷积和多尺度特征融合模块,在KITTI数据集上以14.04M参数达到96.36% MaxF,实现精度与效率的平衡。

详情
AI中文摘要

道路分割是自动驾驶和智能机器人系统的基本感知任务,需要高精度和实时推理,特别是在资源受限的边缘设备上部署时。现有的多模态道路分割方法通常依赖重型基于Transformer的编码器以达到最先进的性能,但其巨大的计算成本阻碍了在嵌入式平台上的实时部署。为解决这一困境,我们提出了LiteViLNet,一种轻量级多模态网络,融合RGB纹理信息和LiDAR几何信息用于高效道路分割。具体来说,我们设计了双流轻量级编码器和深度可分离卷积,以最小的参数从两种模态中提取层次特征。我们进一步提出了多尺度特征融合模块(MSFM)以促进不同层次的跨模态交互,以及一个大核桥模块以线性复杂度捕获长距离依赖。在KITTI道路数据集和实际应用上的大量实验表明,LiteViLNet在准确性和效率之间取得了有希望的平衡。值得注意的是,仅用14.04M参数,我们的模型达到了96.36%的MaxF分数,在所有基于CNN的方法中排名最佳,并与更大的基于Transformer的模型相当,在RTX 4060 Ti上模型推理速度为163.79 FPS(在Jetson Orin NX上为22.18 FPS)。它在推理速度上优于许多重型方法,同时保持高度竞争的准确性,充分验证了LiteViLNet在自动驾驶和智能机器人中实时嵌入式部署的潜力。

英文摘要

Road segmentation is a fundamental perception task for autonomous driving and intelligent robotic systems, requiring both high accuracy and real-time inference, especially for deployment on resource-constrained edge devices. Existing multi-modal road segmentation methods often rely on heavy transformer-based encoders to achieve state-of-the-art performance, but their enormous computational cost prohibits real-time deployment on embedded platforms. To address this dilemma, we propose LiteViLNet, a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for efficient road segmentation. Specifically, we design a dual-stream lightweight encoder and depth-wise separable convolutions to extract hierarchical features from both modalities with minimal parameters. We further propose a Multi-Scale Feature Fusion Module (MSFM) to facilitate cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. Extensive experiments on the KITTI Road dataset and real-world applications demonstrate that LiteViLNet achieves a promising balance between accuracy and efficiency. Notably, with only 14.04M parameters, our model attains a 96.36% MaxF score, ranking the best among all CNN-based methods and being comparable to larger transformer-based models, and runs at 163.79 FPS in model-only inference on RTX 4060 Ti (22.18 FPS on Jetson Orin NX). It outperforms numerous heavy-weight methods in inference speed while maintaining highly competitive accuracy, fully validating the potential of LiteViLNet for real-time embedded deployment in autonomous driving and intelligent robotics.

2605.20873 2026-06-01 cs.AI cs.LG

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

PlanningBench: 生成可扩展且可验证的规划数据以评估和训练大型语言模型

Ziliang Zhao, Zenan Xu, Shuting Wang, Hongjin Qian, Yan Lei, Minda Hu, Zhao Wang, Shihan Dou, Zhicheng Dou, Pluto Zhou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院 Gallagher 学校) LLM Department, Hunyuan Team, Tencent(腾讯 Hunyuan 团队 LLM 部门) Beijing Academy of Artificial Intelligence(北京人工智能研究院) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出PlanningBench框架,通过约束驱动合成管道生成可扩展、多样化且可验证的规划数据,用于评估和训练LLMs,并验证其在提升规划能力上的有效性。

详情
AI中文摘要

规划是大型语言模型(LLMs)的一项基本能力,因为这类复杂任务要求模型将目标、约束、资源和长期后果协调成可执行且可验证的解决方案。然而,现有的规划基准通常将规划数据视为固定的实例集合,而非可控的生成目标。这限制了场景覆盖范围,将难度与表面代理而非结构来源挂钩,并且对可扩展生成、自动验证或面向规划的训练支持有限。我们引入PlanningBench,一个用于生成可扩展、多样化且可验证的规划数据的框架,既可用于评估也可用于训练。PlanningBench从真实规划场景出发,将实际工作流程抽象为包含30多种任务类型、子任务、约束族和难度因素的结构化分类体系。在该分类体系的指导下,一个约束驱动的合成管道实例化自包含的规划问题,具备自适应难度控制、质量过滤和实例级验证检查表。这将规划数据构建从固定基准收集转变为可控生成,同时保留现实任务基础。我们使用PlanningBench评估开源和闭源前沿LLMs,发现当前模型在耦合约束下仍难以生成完整解决方案。除评估外,在已验证的PlanningBench数据上进行强化学习可提升在未见规划基准和更广泛的指令遵循任务上的性能。进一步分析表明,确定性或明确指定的最优解提供了更清晰的奖励信号和更稳定的训练动态。总体而言,PlanningBench为诊断和提高LLMs中可泛化的规划能力提供了可控的规划数据来源。

英文摘要

Planning is a fundamental capability for large language models (LLMs) because such complex tasks require models to coordinate goals, constraints, resources, and long-term consequences into executable and verifiable solutions. Existing planning benchmarks, however, usually treat planning data as fixed collections of instances rather than controllable generation targets. This limits scenario coverage, ties difficulty to surface-level proxies rather than structural sources, and offers limited support for scalable generation, automatic verification, or planning-oriented training. We introduce PlanningBench, a framework for generating scalable, diverse, and verifiable planning data for both evaluation and training. PlanningBench starts from real planning scenarios and abstracts practical workflows into a structured taxonomy of more than 30 task types, subtasks, constraint families, and difficulty factors. Guided by this taxonomy, a constraint-driven synthesis pipeline instantiates self-contained planning problems with adaptive difficulty control, quality filtering, and instance-level verification checklists. This shifts planning data construction from fixed benchmark collection to controllable generation while preserving realistic task grounding. We use PlanningBench to evaluate open-source and closed-source frontier LLMs, and find that current models still struggle to produce complete solutions under coupled constraints. Beyond evaluation, reinforcement learning on verified PlanningBench data improves performance on unseen planning benchmarks and broader instruction-following tasks. Further analysis suggests that determinate or well-specified optimal solutions provide clearer reward signals and more stable training dynamics. Overall, PlanningBench provides a controllable source of planning data for diagnosing and improving generalizable planning abilities in LLMs.

2601.22538 2026-06-01 cs.LG stat.AP

Learning-to-Defer in Non-Stationary Time Series via Switching State-Space Models

通过切换状态空间模型在非平稳时间序列中的学习-延迟决策

Yannis Montreuil, Letian Yu, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi

发表机构 * School of Computing(计算机科学学院) National University of Singapore(新加坡国立大学) Institute for Infocomm Research(信息通信研究所) ISAE-SUPAERO ONERA A*STAR, Singapore(新加坡A*STAR)

AI总结 提出L2D-SLDS框架,利用因子化切换线性高斯状态空间模型处理非平稳流式数据,通过共享因子持续更新未查询专家的信念,并设计学习感知查询分数平衡即时成本与信息增益,实现在线学习-延迟决策。

详情
AI中文摘要

学习-延迟决策(L2D)将每个决策路由到系统自身的预测器或外部专家。流式时间序列设置打破了离线L2D的假设:数据是非平稳的,专家可用性随时间变化,内部预测器在线训练。我们提出L2D-SLDS,一种基于因子化切换线性高斯状态空间模型的一阶段在线L2D框架,该模型覆盖所有潜在残差:一个离散状态、一个共享全局因子以及每个专家的特异状态。始终观测的内部残差通过共享因子持续更新关于每个未查询专家的信念,而学习感知查询分数平衡即时成本与潜在状态信息增益以及一步学习者的改进。我们证明了一个针对时变学习-延迟比较器的oracle不等式,将遗憾分解为查询奖励预算、SLDS预测成本误差项$\mathcal{E}_{\mathrm{SLDS}}$以及内部学习者的区间动态遗憾。在合成数据、墨尔本、耶拿和24专家德里基准测试上,L2D-SLDS与上下文和非平稳老虎机基线相比具有竞争力或更优,同时在真实数据轮次中延迟比例低于$2\%$。

英文摘要

Learning-to-defer (L2D) routes each decision to a system's own predictor or to an external expert. Streaming time-series settings break the offline-L2D assumptions: the data are non-stationary, expert availability shifts over time, and the internal predictor is trained online. We propose L2D-SLDS, a one-stage online L2D framework based on a factorized switching linear-Gaussian state-space model over all potential residuals: a discrete regime, a shared global factor, and per-expert idiosyncratic states. The always-observed internal residual continuously updates beliefs about every unqueried expert through the shared factor, and a learner-aware query score balances immediate cost against latent-state information gain and one-step learner improvement. We prove an oracle inequality against a time-varying learn-and-defer comparator, decomposing regret into a query-bonus budget, an SLDS predictive-cost-error term~$\mathcal{E}_{\mathrm{SLDS}}$, and the internal learner's interval dynamic regret. On synthetic, Melbourne, Jena, and 24-expert Delhi benchmarks, L2D-SLDS is competitive with or improves on contextual- and non-stationary-bandit baselines while deferring on ${<}2\%$ of real-data rounds.

2509.10308 2026-06-01 cs.LG

GraphCSVAE: Graph Categorical Structured Variational Autoencoder for Spatiotemporal Auditing of Physical Vulnerability Towards Sustainable Post-Disaster Risk Reduction

GraphCSVAE: 面向可持续灾后风险降低的物理脆弱性时空审计的图类别结构化变分自编码器

Joshua Dimasaka, Christian Geiß, Robert Muir-Wood, Emily So

发表机构 * University of Cambridge(剑桥大学) Cambridge University Centre for Risk in the Built Environment(剑桥大学建筑环境风险中心) Earth Observation Center(地球观测中心) Institute of Geography(地理研究所)

AI总结 提出GraphCSVAE框架,通过整合深度学习、图表示和类别概率推断,利用时间序列卫星数据和专家先验,对物理脆弱性进行建模,并在两个灾后地区验证其时空审计能力。

Comments Accepted for publication in Progress in Disaster Science (on May 20, 2026) and at the 8th International Disaster and Risk Conference, IDRC 2025 | Keywords: weakly supervised, graph, categorical, vulnerability, remote sensing, spatiotemporal | The data and code are respectively available at https://doi.org/10.5281/zenodo.16656471 and https://github.com/riskaudit/GraphCSVAE

详情
AI中文摘要

在灾害发生后,全球许多机构在监测灾害风险变化方面面临挑战,限制了评估联合国仙台减少灾害风险框架(2015-2030)进展的能力。尽管众多研究通过地球观测和数据驱动方法显著推进了灾害暴露和危险性的大规模建模,但在风险方程中另一个同等重要但具有挑战性的要素——物理脆弱性的建模方面进展仍然有限。为弥补这一空白,我们引入了图类别结构化变分自编码器(GraphCSVAE),这是一个概率数据驱动框架,通过整合深度学习、图表示和类别概率推断,利用时间序列卫星数据集和专家先验来建模物理脆弱性。我们引入了一个弱监督的一阶转移矩阵,以捕捉两个受灾害影响且社会经济弱势地区脆弱性时空分布的变化:孟加拉国受气旋影响的Khurushkul社区和塞拉利昂受泥石流影响的弗里敦市。在两个案例研究中,该框架构建了2016-2023年的大规模图表示,并由于缺乏时间地面真值标签,使用Aitchison距离评估后验成分分布与专家先验的差异。该工作揭示了灾后物理脆弱性的区域动态,为局部时空审计和可持续的灾后风险降低策略提供了宝贵见解。

英文摘要

In the aftermath of disasters, many institutions worldwide face challenges in monitoring changes in disaster risk, limiting assessment of progress towards the UN Sendai Framework for Disaster Risk Reduction 2015-2030. While numerous efforts have substantially advanced the large-scale modeling of hazard and exposure through Earth observation and data-driven methods, progress remains limited in modeling another equally important yet challenging element of the risk equation: physical vulnerability. To address this gap, we introduce Graph Categorical Structured Variational Autoencoder (GraphCSVAE), a probabilistic data-driven framework for modeling physical vulnerability by integrating deep learning, graph representation, and categorical probabilistic inference, using time-series satellite-derived datasets and expert priors. We introduce a weakly supervised first-order transition matrix to capture changes in the spatiotemporal distribution of vulnerability across two disaster-affected and socioeconomically disadvantaged regions: the cyclone-impacted Khurushkul community in Bangladesh and the mudslide-affected city of Freetown in Sierra Leone. Across both case studies, the framework constructs large-scale graph representations spanning 2016-2023 and evaluates posterior compositional distributions against expert priors using Aitchison distance due to the lack of temporal groundtruth labels. The work reveals post-disaster regional dynamics in physical vulnerability, offering valuable insights into localized spatiotemporal auditing and sustainable strategies for post-disaster risk reduction.

2605.19806 2026-06-01 cs.CL cs.AI

Chunking German Legal Code

德国法律文本的分块处理

Max Prior, Natalia Milanova, Andreas Schultz

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 研究针对德国成文法,以德国民法典为基准语料库,比较多种分块策略在检索增强生成中的性能,发现基于法律固有结构(如章节、小节)的分块方法在召回率和计算效率上优于语义增强方法。

Comments Accepted at the Eigth Workshop on Automated Semantic Analysis of Information in Legal Texts co-located with the 21th International Conference on Artificial Intelligence and Law (ICAIL 2026)

详情
AI中文摘要

本文研究了针对德国成文法的检索增强生成的分块策略,以德国民法典作为结构化基准语料库。我们实现并比较了一系列分割方法,包括结构单元(章节、小节、句子、命题)、固定大小窗口、上下文分块、语义聚类、Lumber风格分块以及基于RAPTOR的层次检索。所有方法都在一个具有章节级黄金标签的法律问答数据集上进行评估,测量召回率、查询延迟、索引构建时间和存储需求。结果表明,与固有法律结构对齐的分块策略——特别是基于章节和小节的检索——实现了最高的召回率,而覆盖这种结构的更复杂方法表现更差。与上下文分块、RAPTOR和Lumber等LLM密集型技术相比,这些更简单的方法还提供了有利的计算效率。研究结果突出了语义丰富性与操作成本之间的关键权衡,并证明保留领域特定结构对于有效的法律信息检索至关重要。

英文摘要

This paper investigates chunking strategies for retrieval-augmented generation on German statutory law, using the German Civil Code as a structured benchmark corpus. We implement and compare a range of segmentation approaches, including structural units (sections, subsections, sentences, propositions), fixed-size windows, contextual chunking, semantic clustering, Lumber-style chunking, and RAPTOR-based hierarchical retrieval. All methods are evaluated on a legal question-answering dataset with section-level gold labels, measuring recall, query latency, index build time, and storage requirements. Results show that chunking strategies aligned with the inherent legal structure - particularly section and subsection - based retrieval-achieve the highest recall, while more complex approaches that override this structure perform worse. These simpler methods also offer favorable computational efficiency compared to LLM-intensive techniques such as contextual chunking, RAPTOR, and Lumber. The findings highlight a key trade-off between semantic enrichment and operational cost, and demonstrate that preserving domain-specific structure is critical for effective legal information retrieval.

2605.19145 2026-06-01 cs.LG

PMF-CL: Pareto-Minimal-Forgetting Continual Learner for Conflicting Tasks

PMF-CL: 面向冲突任务的帕累托最小遗忘持续学习器

Srijith Nair, Atilla Eryilmaz, Jia Liu

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 提出基于多任务学习视角的帕累托最优框架,通过寻找帕累托最优解实现冲突任务下最小化遗忘的持续学习,并推导出适用于线性回归、基函数回归及具有二次上界损失函数的帕累托最小遗忘算法。

Comments 25 pages, 4 figures, 4 algorithms

详情
AI中文摘要

文献中提出了许多持续学习算法来解决机器学习模型中的灾难性遗忘问题(即学习新任务导致先前学习任务性能下降)。尽管所有持续学习方法都使用某种形式的记忆来保留过去任务的信息,但对需要存储哪些信息以最小化灾难性遗忘的基本理解仍然难以捉摸。最近,人们认识到,在存在所有任务共同全局最小化器的强假设下,灾难性遗忘可以完全避免。然而,在实践中,任务很少具有共同的全局最小化器,一定程度的遗忘是不可避免的。本文提出了一个基于多任务学习视角的、原则性且系统化的冲突任务持续学习基础框架。该方法基于寻找帕累托最优解,即根据定义,在帕累托意义上最小化遗忘先前任务的解。我们推导了线性回归和基函数回归的帕累托最小遗忘持续学习算法,以及具有二次上界的一般损失函数(例如逻辑回归)。对于二次问题,PMF-CL使用内存高效的迭代更新,对于具有$d$个参数的模型,静态内存占用为$\mathcal{O}(d^2)$。

英文摘要

In the literature, many continual learning (CL) algorithms have been proposed to address the issue of catastrophic forgetting in ML models (i.e., learning new tasks leads to the loss of performance on previously learned tasks). Although all CL approaches use some form of memory to retain information about past tasks, a grounded understanding of what information needs to be stored to minimize catastrophic forgetting remains elusive. Recently, it has been recognized that under the strong assumption of the existence of a common global minimizer over all tasks, catastrophic forgetting can be completely avoided. However, in practice, tasks rarely have a common global minimizer, and a certain amount of forgetting is inevitable. In this paper, we propose a foundational framework for principled and systematic CL of conflicting tasks using a multi-task learning (MTL) perspective. The approach is based on finding Pareto-optimal solutions, i.e., the solutions which, by definition, minimally forget the previous tasks in the Pareto sense. We derive Pareto-minimal-forgetting CL algorithms for linear and basis-function regression, and general loss functions which have a quadratic upper bound, e.g., logistic regression. For quadratic problems, PMF-CL uses memory-efficient iterative updates with a static memory footage of $\mathcal{O}(d^2)$ for models with $d$ parameters.

2605.18807 2026-06-01 cs.LG cs.AI

Block-Based Double Decoders

基于块的双解码器

Asher Labovich, Benjamin Bradley, Vanessa Alexander, Chaitanya Harsha

发表机构 * Brown University(布朗大学)

AI总结 提出基于块的双解码器架构,利用双重因果块注意力掩码实现全损失监督和静态序列打包,结合解码器训练效率与编码器-解码器推理效率,在缩放定律实验中优于编码器-解码器并接近解码器模型,推理时KV缓存和每token计算减少至少2/3。

Comments 8 pages main, 13 pages total

详情
AI中文摘要

编码器-解码器模型在推理时间上比仅解码器模型节省大量成本,但其预训练目标存在稀疏监督和动态序列长度的问题,使其难以大规模实践。我们提出了基于块的双解码器,一种新颖的Transformer架构,利用双重因果块注意力掩码进行全损失监督和静态序列打包,结合了解码器训练效率与编码器-解码器推理效率。在缩放定律实验中,基于块的双解码器显著优于编码器-解码器,并在各规模上紧密跟踪仅解码器模型。在推理时,它们在不牺牲预填充缓存或仅解码器模型可用的其他现有推理优化的情况下,将KV缓存内存和每token计算减少至少2/3。

英文摘要

Encoder-decoder models offer substantial inference-time savings over decoder-only models, but their pretraining objectives suffer from sparse supervision and dynamic sequence lengths, keeping them out of practice at scale. We propose block-based double decoders, a novel transformer architecture that utilizes doubly-causal block-based attention masks to train with full loss supervision and static sequence packing, combining decoder-only training efficiency with encoder-decoder inference efficiency. In scaling law experiments, block-based double decoders strongly outperform encoder-decoders and closely track decoder-only models across scales. At inference time, they cut KV-cache memory and per-token compute by at least 2/3 without sacrificing prefill caching or other existing inference optimizations available to decoder-only models.

2605.18803 2026-06-01 cs.LG cs.AI

PROWL: Prioritized Regret-Driven Optimization for World Model Learning

PROWL: 基于优先遗憾驱动的世界模型学习优化

Ahmet H. Güzel, Jenny Seidenschwarz, Benjamin Graham, Jonathan Sadeghi, Jeffrey Hawke, Ilija Bogunovic

发表机构 * University College London AI Centre(伦敦大学学院人工智能中心) Odyssey University of Basel(巴塞尔大学)

AI总结 提出一种KL约束的对抗课程,通过训练策略暴露扩散世界模型的高误差轨迹并持续微调,结合优先对抗轨迹缓冲区,解决被动数据中罕见关键转换的鲁棒性问题。

详情
AI中文摘要

现代动作条件视频世界模型在短期视觉真实性上表现强劲,但在罕见且对交互关键的转换上仍不可靠,而这些转换主导了下游规划和策略性能。由于被动演示数据系统性地对这些高影响区域采样不足,提高鲁棒性需要主动引发模型失败,而非依赖其自然发生。我们引入了一种KL约束的对抗课程,其中训练一个策略来暴露基于扩散的世界模型的高误差轨迹,同时保持接近行为分布。世界模型在这些对抗性发现的轨迹上持续微调,形成一个对抗训练循环,将罕见失败转化为稳定的、接近分布的训练信号,而不会漂移到分布外利用。为了在模型改进时持续对未解决的弱点施加压力,我们提出了一种优先对抗轨迹(PAT)缓冲区,该缓冲区根据预测误差、动作保真度和学习进度对轨迹重新排序,将训练集中在未解决的失败模式上,而不是重复访问已解决的案例。我们在MineRL框架中实现了我们的方法,并在保留的分布外轨迹上进行了评估;PROWL提高了相对于仅在被动数据上训练的模型的鲁棒性,揭示了在弱行为约束下的奖励黑客行为,并证明了有效的对抗世界模型训练关键取决于平衡探索性失败发现与显式行为正则化。我们的结果表明,可扩展的世界模型不仅受益于更大的数据集,还受益于选择性生成信息丰富的训练数据。

英文摘要

Modern action-conditioned video world models achieve strong short-horizon visual realism, yet remain unreliable on rare, interaction-critical transitions that dominate downstream planning and policy performance. Because passive demonstration data systematically under-samples these high-impact regimes, improving robustness requires actively eliciting model failures rather than relying on their natural occurrence. We introduce a KL-constrained adversarial curriculum in which a policy is trained to expose high-error trajectories of a diffusion-based world model while remaining close to the behavior distribution. The world model is continuously fine-tuned on these adversarially discovered trajectories, yielding an adversarial training loop that converts rare failures into a stable, near-distribution training signal without drifting into out-of-distribution exploitation. To maintain pressure on unresolved weaknesses as the model improves, we propose a Prioritized Adversarial Trajectory (PAT) buffer that re-ranks trajectories based on prediction error, action fidelity, and learning progress, focusing training on unresolved failure modes rather than repeatedly revisiting solved cases. We implement our approach in the MineRL framework and evaluate it on held-out out-of-distribution trajectories; PROWL improves robustness over models trained on passive data alone, reveals reward-hacking behaviors under weak behavioral constraints, and demonstrates that effective adversarial world-model training critically depends on balancing exploratory failure discovery with explicit behavioral regularization. Our results suggest that scalable world models benefit not only from larger datasets, but also from selectively generating informative training data.

2605.18606 2026-06-01 cs.LG

Physics-Aligned Canonical Equivariant Fourier Neural Operator under Symmetry-Induced Shifts

对称性诱导位移下的物理对齐规范等变傅里叶神经算子

Jiaxiao Xu, Changhong Mou, Yeyu Zhang, Fengxiang He

发表机构 * Shanghai University of Finance and Economics(上海财经大学) Utah State University(犹他州立大学) University of Edinburgh(爱丁堡大学)

AI总结 提出PACE-FNO,通过李代数坐标估计将输入场对齐到参考帧,再应用标准FNO并恢复目标帧,利用周期性演化方程的连续对称性分离坐标对齐与物理演化,在多种PDE上实现OOD相对误差降低高达12倍。

Comments 36 pages, 14 figures, 10 tables

详情
AI中文摘要

神经算子近似PDE解映射,但未必尊重控制方程的对称性。在分布外(OOD)场景中,标准神经算子通常需要在单个映射中学习坐标对齐和物理演化,这可能会损害泛化能力。我们利用周期性域上演化方程的已知连续对称性来分离这两个角色。我们提出了物理对齐规范等变傅里叶神经算子(PACE-FNO),它通过李代数坐标估计器估计输入帧,将场映射到参考帧,应用标准傅里叶神经算子(FNO),并将预测恢复到目标帧。我们使用有界对称扰动联合训练对齐和算子预测,并在推理时通过可选的低维精化步骤更新估计帧。等变性通过输入和输出变换强制执行,而FNO架构保持不变。在周期性域上的1-D和2-D Burgers、浅水方程和Navier-Stokes方程中,PACE-FNO在分布内(ID)精度上与标准神经算子相当,并在平移和伽利略位移下将分布外(OOD)相对误差比带对称增强的FNO(FNO+Aug)降低多达12倍,在耦合旋转-平移位移下增益较小。消融实验表明,对齐输入和恢复输出帧贡献了大部分OOD增益;推理时精化提供了较小的修正。

英文摘要

Neural operators approximate PDE solution maps, but they need not respect the symmetries of the governing equation. In out-of-distribution (OOD) regimes, a standard neural operator must often learn coordinate alignment and physical evolution within a single map, which can hurt generalization. We use known continuous symmetries of evolution equations on periodic domains to separate these two roles. We propose the Physics-Aligned Canonical Equivariant Fourier Neural Operator (PACE-FNO), which estimates the input frame with a Lie-algebra coordinate estimator, maps the field to a reference frame, applies a standard Fourier Neural Operator (FNO), and restores the prediction to the target frame. We train alignment and operator prediction jointly using bounded symmetry perturbations, with an optional low-dimensional refinement step that updates the estimated frame at inference. Equivariance is enforced by the input and output transformations, while the FNO architecture remains unchanged. Across 1-D and 2-D Burgers, shallow-water, and Navier-Stokes equations on periodic domains, PACE-FNO matches the in-distribution (ID) accuracy of standard neural operators and reduces out-of-distribution (OOD) relative error by up to 12x over FNO with symmetry augmentation (FNO+Aug) under translations and Galilean shifts, with smaller gains for coupled rotation-translation shifts. Ablations show that aligning the input and restoring the output frame account for most OOD gains; inference-time refinement provides a smaller correction.

2605.18364 2026-06-01 cs.LG math.OC

Proximal basin hopping: global optimization with guarantees

近端盆地跳跃:有保证的全局优化

Guillaume Lauga, Cesare Molinari, Samuel Vaiter

发表机构 * LJAD MALGA Université Côte d’Azur(法国尼斯大学) Università di Genova(热那亚大学) CNRS(国家科学研究中心)

AI总结 提出近端盆地跳跃(PBH)理论框架,结合近端优化与局部最小化,构建算法以高概率收敛到全局最小值,在合成硬函数和深度学习标度律拟合等实际问题中表现优于有理论保证的已知算法,且维度越高性能差距越大。

详情
AI中文摘要

全局优化是一个具有挑战性的问题,大量算法展示了经验上的成功,但缺乏理论支持。在这项工作中,我们提出了一个名为近端盆地跳跃(PBH)的新理论框架,精心设计以结合近端优化和局部最小化。我们利用它构建了一个实用算法,在使用有限样本时以高概率收敛到全局最小值。近端盆地跳跃在标准合成硬函数和实际问题(如拟合深度学习标度律)上优于具有理论保证的已知算法。此外,维度越高,性能差距越大。

英文摘要

Global optimization is a challenging problem, with plenty of algorithms displaying empirical success, but scarce theoretical backing. In this work, we propose a new theoretical framework called Proximal Basin Hopping (PBH), carefully tailored to combine proximal optimization and local minimization. We use it to construct a practical algorithm that converges to the global minimizer with high probability, when using a finite amount of samples. Proximal Basin Hopping outperforms well known algorithms with theoretical backing on standard synthetic hard functions, and real problems such as fitting scaling laws for deep learning. Furthermore, the higher the dimension, the better the performance gap.

2605.18024 2026-06-01 cs.LG cs.AI cs.MA

Interaction-Breaking Adversarial Learning Framework for Robust Multi-Agent Reinforcement Learning

交互破坏对抗学习框架用于鲁棒多智能体强化学习

Sunwoo Lee, Mingu Kang, Yonghyeon Jo, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence, UNIST, Ulsan, South Korea(人工智能研究生院,UNIST,乌山,韩国)

AI总结 提出交互破坏对抗学习框架,从信息论角度构建攻击破坏智能体间交互,并训练智能体在干扰下可靠执行,提升鲁棒性。

Comments 9 pages for main, 33 pages for total, Accepted to ICML 2026

详情
AI中文摘要

合作是多智能体强化学习(MARL)的核心,然而当外部扰动破坏智能体间的交互时,学到的协调可能变得脆弱。先前的鲁棒MARL方法主要考虑面向价值的攻击,在交互结构本身被破坏时存在鲁棒性缺口。在本文中,我们提出一个交互破坏对抗学习(IBAL)框架,该框架从信息论角度构建攻击,通过扰动智能体的观测和动作来阻碍协调,并训练智能体在此类干扰下可靠执行。实验上,我们的方法在多种攻击设置下比现有鲁棒MARL基线具有更好的鲁棒性,甚至在智能体缺失场景下也表现出更强的性能。我们的代码可在 https://sunwoolee0504.github.io/IBAL 获取。

英文摘要

Cooperation is central to multi-agent reinforcement learning (MARL), yet learned coordination can be fragile when external perturbations disrupt inter-agent interactions. Prior robust MARL methods have primarily considered value-oriented attacks, leaving a gap in robustness when interaction structures themselves are corrupted. In this paper, we propose an interaction-breaking adversarial learning (IBAL) framework that takes an information-theoretic view to construct attacks that impede coordination by perturbing agents' observations and actions, and trains agents to perform reliably under such disruptions. Empirically, our approach improves robustness over existing robust MARL baselines across diverse attack settings and yields stronger performance even under agent-missing scenarios. Our code is available at https://sunwoolee0504.github.io/IBAL.

2605.18023 2026-06-01 cs.CV

DSAA: Dual-Stage Attribute Activation for Fine-grained Open Vocabulary Detection

DSAA: 面向细粒度开放词汇检测的双阶段属性激活

Donghong Jiang, Endian Lin, Hanqing Liu, Mingjie Liu, Luoping Cui, Zhao Yang, Chuang Zhu

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Beijing E-Hualu Information Technology Co., Ltd.(北京亿华鲁信息技术有限公司) State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China(通用人工智能国家重点实验室,BIGAI,北京,中国)

AI总结 提出DSAA框架,通过文本嵌入阶段的属性前缀适配器和BERT编码阶段的键/值调制器增强属性语义,并引入属性感知对比损失,提升细粒度开放词汇检测性能。

详情
AI中文摘要

开放词汇目标检测(OVD)模型打破了封闭集检测的限制,能够通过自然语言提示识别未见类别。然而,在涉及颜色、材质和纹理等属性的细粒度检测任务中,它们表现出明显的局限性。我们将OVD模型中的这一性能瓶颈归因于一个核心问题:当类别信号占主导时,OVD模型在推理过程中倾向于边缘化属性信息,导致属性与目标对象之间的错误绑定。为了解决这个问题,我们提出了双阶段属性激活(DSAA)框架,通过在两个关键阶段增强属性语义来提升细粒度检测能力。在文本嵌入阶段,我们采用属性前缀适配器(APA)模块生成属性前缀,注入显式的属性先验。为了进一步放大这些属性的影响,我们的键/值(K/V)调制器模块在BERT编码阶段进行干预,选择性地增强对应属性令牌的键和值向量。此外,我们引入了属性感知对比损失,以在训练过程中提高具有不同属性的同类别实例之间的区分度。在FG-OVD基准上的实验结果表明,我们的方法在各种主流开放词汇模型中均有效。

英文摘要

Open-Vocabulary Object Detection (OVD) models break the limitations of closed-set detection, enabling the identification of unseen categories through natural language prompts. However, they exhibit notable limitations in fine-grained detection tasks involving attributes like color, material, and texture. We attribute this performance bottleneck in OVD models to a core issue: when category signals dominate, OVD models tend to marginalize attribute information during inference. This leads to incorrect binding between attributes and target objects. To address this, we propose the Dual-Stage Attribute Activation (DSAA) framework, which enhances fine-grained detection capabilities by strengthening attribute semantics at two critical stages. In the text embedding stage, we employ Attribute Prefix Adapter (APA) module to generate attribute prefixes that inject explicit attribute priors. To further amplify the influence of these attributes, our Key/Value (K/V) Modulator module then intervenes during the BERT encoding phase, selectively enhancing the Key and Value vectors of the corresponding attribute tokens. In addition, we introduce an attribute-aware contrastive loss to improve discrimination among same-category instances with different attributes during training. Experimental results on the FG-OVD benchmark demonstrate the effectiveness of our method across various mainstream open-vocabulary models.

2605.17524 2026-06-01 cs.LG cs.DB

Covariance Structure and Coordinate Heterogeneity Govern Binary Quantization of Contrastive Embeddings

协方差结构与坐标异质性支配对比嵌入的二值量化

Wenxuan Xiao

发表机构 * Changsha University(长沙大学)

AI总结 通过分析InfoNCE训练表示的协方差结构,揭示了协方差矩阵的非对角项和坐标异质性如何分别影响二值量化的排序保真度和设计选择,并推导出缩放律以指导系统设计。

Comments 21 pages, 1 figure, 19 tables (6 in main text, 13 in appendix)

详情
AI中文摘要

二值量化(BQ)将高维嵌入压缩为每个坐标一或两个比特,从而实现极速的最近邻搜索。然而,一个显著的谜题仍然存在:BQ在对比嵌入上取得了有竞争力的召回率,但在其他嵌入上却失败——并且两个领先系统采用了截然相反的策略(随机旋转与保留坐标轴),而没有共同的理论解释何时适用哪种策略。我们通过将最近建立的InfoNCE训练表示的Gaussian结构与BQ质量的统计框架联系起来,解决了这个谜题。我们的分析揭示了协方差矩阵的两个不同作用。首先,完整的协方差结构——而不仅仅是其对角线——决定了排序保真度的绝对水平,其中非对角相关性贡献了30-50%的信号。其次,坐标异质性(每个坐标方差的非均匀性)支配着关键设计选择:每个额外比特贡献多少,以及随机旋转是有益还是有害。我们推导了Gaussian模型下排序保真度的近似表达式,表明幅度比特携带与异质性成比例的信息,并表明随机旋转恰好破坏了某个范式所利用的信号,同时创造了另一个范式所需的各向同性。一个现象学缩放律预测了跨模型和维度的保真度。在涵盖9个嵌入家族的18个数据集上的实验支持了主要预测,并据我们所知,为二值量化系统提供了第一个有原则的设计指南。

英文摘要

Binary quantization (BQ) compresses high-dimensional embeddings into one or two bits per coordinate, enabling nearest neighbor search at extreme speed. Yet a striking puzzle persists: BQ achieves competitive recall on contrastive embeddings but fails on others -- and two leading systems adopt diametrically opposite strategies (random rotation vs. preserving coordinate axes) without a common theory explaining when each is appropriate. We address this puzzle by connecting the Gaussian structure recently established for InfoNCE-trained representations to a statistical framework for BQ quality. Our analysis reveals two distinct roles of the covariance matrix. First, the full covariance structure -- not merely its diagonal -- determines the absolute level of ranking fidelity, with off-diagonal correlations contributing 30--50% of the signal. Second, coordinate heterogeneity (the non-uniformity of per-coordinate variances) governs key design choices: how much each additional bit contributes, and whether random rotation helps or hurts. We derive approximate expressions for ranking fidelity under a Gaussian model, show that the magnitude bit carries information proportional to heterogeneity, and show that random rotation destroys precisely the signal that one paradigm exploits while creating the isotropy that the other requires. A phenomenological scaling law predicts fidelity across models and dimensions. Experiments on 18 datasets spanning 9 embedding families support the main predictions and provide, to our knowledge, the first principled design guide for binary quantization systems.

2605.17373 2026-06-01 cs.LG cs.AI

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

FML-bench:从搜索动力学视角对AI研究代理策略的受控研究

Qiran Zou, Hou Hei Lam, Wenhao Zhao, Tingting Chen, Yiming Tang, Samson Yu, Yingtao Zhu, Srinivas Anumasa, Zufeng Zhang, Tianyi Zhang, Chang Liu, Zhengyao Jiang, Anirudh Goyal, Dianbo Liu

发表机构 * National University of Singapore(国立新加坡大学) Tsinghua University(清华大学) University of Minnesota(明尼苏达大学) Weco Meta

AI总结 本文提出FML-Bench基准,通过分离策略与基础设施并定义过程级指标,评估六种代理策略,发现贪婪爬山法接近最优树搜索,且自适应策略基于搜索密度切换可超越其他代理。

Comments Our benchmark is available at: https://github.com/qrzou/FML-bench

详情
AI中文摘要

AI研究代理通过自动化假设生成、实验和实证改进来加速机器学习研究。现有代理策略从贪婪爬山法到树搜索和进化优化不等,但哪些策略选择驱动性能仍不清楚。回答这个问题需要一个基准,该基准将代理策略(例如搜索拓扑)与执行基础设施(例如代码编辑器)分离,以便性能差异归因于策略而非基础设施,并提供最终分数之外的过程级指标来分析探索行为。现有基准支持有限。我们提出FML-Bench,一个涵盖10个领域18个基础ML研究任务的基准,将代理策略与执行基础设施分离,并定义了12个过程级行为指标。评估六个代表性代理,我们发现:(1) 策略复杂性本身并不能保证强性能:一个简单的贪婪爬山者几乎与最佳性能的树搜索代理相匹配,两者均远高于其余代理;(2) 我们的分析表明,这种模式与改进机会结构相关:当机会密集时,贪婪搜索往往更有效,而当机会稀疏时,树搜索和进化策略往往更有效;基于这一见解构建的自适应代理在检测到改进停滞时切换到更广泛的探索,并优于其他六个代理,初步支持了这一观察;(3) 过程级分析表明,早期收敛和方向聚焦的探索与最终性能显著相关,而解决方案多样性和计算成本则不然。我们的基准可在 https://github.com/qrzou/FML-bench 获取。

英文摘要

AI research agents accelerate ML research by automating hypothesis generation, experimentation, and empirical refinement. Existing agent strategies range from greedy hill-climbing to tree search and evolutionary optimization, yet which strategy choices drive performance remains unclear. Answering this question requires a benchmark that separates agent strategy (e.g., search topology) from execution infrastructure (e.g., code editor), so that performance differences are attributable to strategy rather than infrastructure, and that provides process-level metrics beyond final scores to analyze exploration behaviors. Existing benchmarks offer limited support. We propose FML-Bench, a benchmark of 18 fundamental ML research tasks across 10 domains that separates agent strategy from execution infrastructure and defines 12 process-level behavioral metrics. Evaluating six representative agents, we find that: (1) strategy complexity alone does not guarantee strong performance: a simple greedy hill-climber nearly matches the best-performing tree-search agent, both well above the remaining agents; (2) our analysis suggests this pattern relates to improvement opportunity structure: greedy search tends to be more effective when opportunities are dense, while tree-search and evolutionary strategies tend to be more effective when opportunities are sparse; an adaptive agent built on this insight switches to broader exploration upon detecting improvement stagnation and outperforms the other six agents, lending initial support to this observation; and (3) process-level analysis reveals that early convergence and directionally focused exploration are significantly associated with final performance, while solution diversity and compute cost are not. Our benchmark is available at: https://github.com/qrzou/FML-bench.

2605.17101 2026-06-01 cs.CL cs.AI

SEMA-RAG: A Self-Evolving Multi-Agent Retrieval-Augmented Generation Framework for Medical Reasoning

SEMA-RAG: 面向医学推理的自演化多智能体检索增强生成框架

Yongfeng Huang, Ruiying Chen, James Cheng

发表机构 * CSE, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) Wuhan University of Technology(武汉理工大学)

AI总结 针对医学问答中单轮静态检索与临床推理多阶段过程不匹配的问题,提出SEMA-RAG框架,通过任务解耦和动态多轮探索,由三个专业智能体分别负责临床解释、自演化检索和证据裁决,在多个基准上平均提升准确率6.46个百分点。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

检索增强生成(RAG)被广泛用于缓解医学问答中的幻觉和知识过时等风险,但其主要采用单轮静态检索范式,与临床推理的多阶段过程不匹配。这种压缩的工作流导致两个结构性缺陷:问题到查询的转换通常缺乏临床基础的语义解释,且检索缺乏迭代充分性反馈,难以形成可靠的证据链。我们认为这两个问题源于更深层的原因:将解释、探索和裁决等异构任务过载到单一推理链上。解决方案是通过任务解耦和动态多轮探索来重构工作流。为此,我们提出SEMA-RAG,一种用于医学问答的自演化多智能体RAG框架,将这些角色分配给三个专业智能体:解释智能体负责临床模式解释,探索智能体负责充分性驱动的自演化检索,裁决智能体负责证据裁决和答案选择。在五个基准和五个LLM骨干网络上,SEMA-RAG平均比最强基线提高6.46个准确率点(按骨干网络测量)。

英文摘要

Retrieval-Augmented Generation (RAG) is widely employed to mitigate risks such as hallucinations and knowledge obsolescence in medical question answering, yet its predominantly single-round, static retrieval paradigm misaligns with the multi-stage process of clinical reasoning. This compressed workflow induces two structural deficiencies: question-to-query translation often lacks clinically grounded semantic interpretation, and retrieval lacks iterative sufficiency feedback, making it difficult to form reliable evidence chains. We argue that both issues stem from a deeper cause: overloading a single reasoning chain with heterogeneous tasks of interpretation, exploration, and adjudication. The remedy is to reconstruct the workflow via task decoupling and dynamic multi-round exploration. To this end, we propose SEMA-RAG, a Self-Evolving Multi-Agent RAG framework for medical question answering, which assigns these roles to three specialist agents: the Interpreter Agent for clinical schema interpretation, the Explorer Agent for sufficiency-driven self-evolving retrieval, and the Arbiter Agent for evidence adjudication and answer selection. Across five benchmarks and five LLM backbones, SEMA-RAG improves the strongest baseline by +6.46 accuracy points on average, measured per backbone.

2605.16215 2026-06-01 cs.AI cs.CL

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

完全开放的Meditron:临床大语言模型的可审计流水线

Xavier Theimer-Lienhard, Mushtaha El-Amin, Fay Elhassan, Sahaj Vaidya, Victor Cartier-Negadi, David Sasu, Lars Klein, Mary-Anne Hartley

发表机构 * EPFL(苏黎世联邦理工学院)

AI总结 提出首个完全开放的临床大语言模型构建流水线Fully Open Meditron,通过可审计的数据集、可复现的训练框架和对齐评估协议,在不牺牲可审计性和可复现性的前提下实现了领域最新性能。

Comments Preprint. 31 pages, 10 figures. Code, models, and data: https://github.com/EPFLiGHT/FullyOpenMeditron

详情
AI中文摘要

临床决策支持系统(CDSS)需要可审查、可审计的流水线,以实现严格、可复现的验证。然而,当前基于LLM的CDSS仍然大多不透明。大多数“开放”模型仅开放权重,发布参数的同时隐瞒了决定模型行为的数据来源、整理程序和生成流水线。完全开放(FO)模型暴露完整的训练堆栈,目前在医学领域尚不存在。我们引入了Fully Open Meditron,这是首个用于构建LLM-CDSS的完全开放流水线,包含临床医生审计的训练语料库、可复现的数据构建和训练框架,以及使用对齐的评估协议。该语料库将八个公共医学QA数据集统一为标准化对话格式,并通过三个经临床医生审查的合成扩展扩展了覆盖范围:考试式QA、源自46,469个临床实践指南的指南基础QA以及临床小插曲。该流水线强制执行系统级去污染、教师生成的金标签重采样以及由四位医生小组进行的端到端验证。我们使用LLM-as-a-judge协议对专家撰写的临床小插曲进行评估,并针对204名人类评分者进行校准。我们将该配方应用于五个FO基础模型(Apertus-70B/8B-Instruct、OLMo-2-32B-SFT、EuroLLM-22B/9B-Instruct)。所有MeditronFO变体均优于其基础模型。Apertus-70B-MeditronFO在综合医学基准上比其基础模型提高了+6.6个百分点(从47.2%到53.8%),建立了新的FO SoTA。Gemma-3-27B-MeditronFO在58.6%的LLM-as-a-judge比较中优于MedGemma,并在HealthBench上表现更优(58% vs 55.9%)。这些结果表明,完全开放的流水线可以在不牺牲可审计性或可复现性的情况下实现最先进的领域特定性能。

英文摘要

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.