arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 8081
2510.01698 2026-06-03 cs.IR cs.MM cs.SD eess.AS

TalkPlay-Tools: Conversational Music Recommendation with LLM Tool Calling

TalkPlay-Tools: 基于大语言模型工具调用的对话式音乐推荐

Seungheon Doh, Keunwoo Choi, Juhan Nam

发表机构 * KAIST(韩国科学技术院) talkpl.ai

AI总结 提出一种基于LLM工具调用的统一检索-重排序流水线,通过布尔过滤、稀疏检索、稠密检索和生成式检索的组合,实现端到端的对话式音乐推荐。

Comments Accepted for publication at The Workshop on AI for Music, Neural Information Processing Systems (NeurIPS-AI4Music)

详情
AI中文摘要

尽管大型语言模型(LLM)的最新进展已成功实现了具有自然语言交互的生成式推荐系统,但其推荐行为受限,导致系统中其他更简单但关键组件(如元数据或属性过滤)未被充分利用。我们提出了一种基于LLM的音乐推荐系统,通过工具调用作为统一的检索-重排序流水线。该系统将LLM定位为端到端推荐系统,解释用户意图、规划工具调用并编排专门组件:布尔过滤(SQL)、稀疏检索(BM25)、稠密检索(嵌入相似度)和生成式检索(语义ID)。通过工具规划,系统预测要使用的工具类型、执行顺序以及查找匹配用户偏好的音乐所需的参数,支持多种模态,同时无缝集成多个数据库过滤方法。我们证明,这种统一的工具调用框架通过基于用户查询选择性地采用适当的检索方法,在多种推荐场景中实现了有竞争力的性能,为对话式音乐推荐系统设想了新的范式。

英文摘要

While the recent developments in large language models (LLMs) have successfully enabled generative recommenders with natural language interactions, their recommendation behavior is limited, leaving other simpler yet crucial components such as metadata or attribute filtering underutilized in the system. We propose an LLM-based music recommendation system with tool calling to serve as a unified retrieval-reranking pipeline. Our system positions an LLM as an end-to-end recommendation system that interprets user intent, plans tool invocations, and orchestrates specialized components: boolean filters (SQL), sparse retrieval (BM25), dense retrieval (embedding similarity), and generative retrieval (semantic IDs). Through tool planning, the system predicts which types of tools to use, their execution order, and the arguments needed to find music matching user preferences, supporting diverse modalities while seamlessly integrating multiple database filtering methods. We demonstrate that this unified tool-calling framework achieves competitive performance across diverse recommendation scenarios by selectively employing appropriate retrieval methods based on user queries, envisioning a new paradigm for conversational music recommendation systems.

2502.09755 2026-06-03 cs.CR cs.LG

Jailbreak Attack Initializations as Extractors of Compliance Directions

越狱攻击初始化作为合规方向的提取器

Amit Levi, Rom Himelstein, Yaniv Nemcovsky, Avi Mendelson, Chaim Baskin

发表机构 * Department of Computer Science, Technion - Israel Institute of Technology(技术学院计算机科学系) Department of Data and Decision Science, Technion - Israel Institute of Technology(技术学院数据与决策科学系) School of Electrical and Computer Engineering Engineering, Ben-Gurion University of the Negev(内盖夫本· Gurion大学电气与计算机工程学院)

AI总结 本文发现基于梯度的越狱攻击初始化会收敛到抑制拒绝的单一合规方向,并据此提出CRI框架,通过沿合规方向投影未见提示来提高攻击成功率并降低计算开销。

Comments Accepted to Findings of the Association for Computational Linguistics 2025 (EMNLP 2025)

详情
AI中文摘要

安全对齐的LLM对提示的响应要么是合规要么是拒绝,每种响应对应模型激活空间中的不同方向。最近的研究表明,通过从其他提示进行自我迁移来初始化攻击可以显著提升其性能。然而,这些初始化的潜在机制仍不清楚,并且攻击使用任意或手动选择的初始化。本文表明,每个基于梯度的越狱攻击及其后续初始化逐渐收敛到一个抑制拒绝的单一合规方向,从而能够实现从拒绝到合规的高效转换。基于这一见解,我们提出了CRI,一个旨在将未见提示进一步投影到合规方向的初始化框架。我们在多种攻击、模型和数据集上展示了我们的方法,实现了更高的攻击成功率(ASR)并降低了计算开销,突显了安全对齐LLM的脆弱性。参考实现可在以下网址获取:https://amit1221levi.github.io/CRI-Jailbreak-Init-LLMs-evaluation

英文摘要

Safety-aligned LLMs respond to prompts with either compliance or refusal, each corresponding to distinct directions in the model's activation space. Recent works show that initializing attacks via self-transfer from other prompts significantly enhances their performance. However, the underlying mechanisms of these initializations remain unclear, and attacks utilize arbitrary or hand-picked initializations. This work presents that each gradient-based jailbreak attack and subsequent initialization gradually converge to a single compliance direction that suppresses refusal, thereby enabling an efficient transition from refusal to compliance. Based on this insight, we propose CRI, an initialization framework that aims to project unseen prompts further along compliance directions. We demonstrate our approach on multiple attacks, models, and datasets, achieving an increased attack success rate (ASR) and reduced computational overhead, highlighting the fragility of safety-aligned LLMs. A reference implementation is available at: https://amit1221levi.github.io/CRI-Jailbreak-Init-LLMs-evaluation.

2510.01377 2026-06-03 math.OC cs.AI cs.LG cs.MA cs.SY eess.SY

DeMuon: A Decentralized Muon for Matrix Optimization over Graphs

DeMuon:一种用于图上矩阵优化的去中心化Muon方法

Chuan He, Shuyi Ren, Jingwei Mao, Erik G. Larsson

发表机构 * Department of Mathematics, Linköping University(利乌普堡大学数学系) Department of Electrical Engineering, Linköping University(利乌普堡大学电气工程系) Department of Computer and Information Science, Linköping University(利乌普堡大学计算机与信息科学系)

AI总结 提出DeMuon方法,通过牛顿-舒尔茨迭代实现矩阵正交化,并利用梯度跟踪处理局部函数异质性,在重尾噪声下达到与集中式算法匹配的复杂度,首次将Muon扩展到去中心化图优化并具有可证明的复杂度保证。

Comments Add an accelerated variant of the proposed method. New proofs of proposed methods

详情
AI中文摘要

本文提出DeMuon,一种在给定通信拓扑上进行去中心化矩阵优化的方法。DeMuon通过牛顿-舒尔茨迭代(继承自其集中式前身Muon)实现矩阵正交化,并采用梯度跟踪来减轻局部函数之间的异质性。在重尾噪声条件和额外的温和假设下,我们建立了DeMuon达到近似随机驻点的迭代复杂度。该复杂度结果在目标容差依赖方面与已知的最佳集中式算法复杂度界相匹配。据我们所知,DeMuon是首个将Muon直接扩展到图上去中心化优化并具有可证明复杂度保证的方法。我们在不同连通程度的图上进行了去中心化Transformer预训练的初步数值实验。数值结果表明,在不同网络拓扑下,DeMuon相比其他流行的去中心化算法具有明显的改进优势。

英文摘要

In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates matrix orthogonalization via Newton-Schulz iterations-a technique inherited from its centralized predecessor, Muon-and employs gradient tracking to mitigate heterogeneity among local functions. Under heavy-tailed noise conditions and additional mild assumptions, we establish the iteration complexity of DeMuon for reaching an approximate stochastic stationary point. This complexity result matches the best-known complexity bounds of centralized algorithms in terms of dependence on the target tolerance. To the best of our knowledge, DeMuon is the first direct extension of Muon to decentralized optimization over graphs with provable complexity guarantees. We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity. Our numerical results demonstrate a clear margin of improvement of DeMuon over other popular decentralized algorithms across different network topologies.

2509.08726 2026-06-03 math.OC cs.LG

Decentralized Stochastic Nonconvex Optimization under the $(L_0,L_1)$-Smoothness

$(L_0,L_1)$-光滑条件下的去中心化随机非凸优化

Luo Luo, Xue Cui, Tingkai Jia, Cheng Chen

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院) East China Normal University(华东师范大学)

AI总结 针对满足$(L_0,L_1)$-光滑条件的非凸函数,提出去中心化归一化随机梯度下降算法,实现每个局部智能体达到ε-稳定点,并给出样本复杂度和通信复杂度的上界。

详情
AI中文摘要

本文关注去中心化随机优化问题 $f(\mathbf{x})=\frac{1}{m}\sum_{i=1}^m f_i(\mathbf{x})$,其中网络由 $n$ 个智能体连接,每个局部函数形如 $f_i(\mathbf{x}) = {\mathbb E}\left[F(\mathbf{x};{\boldsymbol ξ}_i)\right]$,满足 $(L_0,L_1)$-光滑条件但可能非凸,且每个随机变量 ${\boldsymbol ξ}_i$ 服从分布 ${\mathcal D}_i$。我们提出一种新算法——去中心化归一化随机梯度下降(DNSGD),该算法可使每个局部智能体达到 $\varepsilon$-稳定点。我们提出了一个基于梯度范数与一致性误差乘积的李雅普诺夫函数的新框架,用于分析 $(L_0,L_1)$-光滑设置下的去中心化一阶方法。我们证明,所提算法在每个智能体上的样本复杂度上界为 ${\mathcal O}(m^{-1}(L_fσ^2Δ_fε^{-4} + σ^2ε^{-2} + L_f^{-2}L_1^3σ^2Δ_fε^{-1} + L_f^{-2}L_1^2σ^2))$,通信复杂度上界为 $\tilde{\mathcal O}((L_fε^{-2} + L_1ε^{-1})γ^{-1/2}Δ_f)$,其中 $L_f=L_0 +L_1ζ$,$σ^2$ 是随机梯度的方差,$Δ_f$ 是初始最优函数值差距,$γ$ 是网络的谱间隙,$ζ$ 是梯度异质性程度。在 $L_1=0$ 的特殊情况下,上述结果(几乎)匹配标准光滑条件下去中心化随机非凸优化的下界。我们还进行了数值实验,以展示我们方法的实证优越性。

英文摘要

This paper focuses on the decentralized stochastic optimization problem $f(\mathbf{x})=\frac{1}{m}\sum_{i=1}^m f_i(\mathbf{x})$ over a connected network of $n$ agents, where each local function has the form of $f_i(\mathbf{x}) = {\mathbb E}\left[F(\mathbf{x};{\boldsymbol ξ}_i)\right]$ which satisfies the $(L_0,L_1)$-smooth condition but possibly nonconvex and each random variable ${\boldsymbol ξ}_i$ follows distribution ${\mathcal D}_i$. We propose a novel algorithm called decentralized normalized stochastic gradient descent (DNSGD), which can achieve an $ε$-stationary point at each local agent. We present a new framework for analyzing decentralized first-order methods in the $(L_0,L_1)$-smooth setting, based on the Lyapunov function related to the product of the gradient norm and the consensus error. We show that the proposed algorithm attains the upper bounds on the sample complexity of ${\mathcal O}(m^{-1}(L_fσ^2Δ_fε^{-4} + σ^2ε^{-2} + L_f^{-2}L_1^3σ^2Δ_fε^{-1} + L_f^{-2}L_1^2σ^2))$ per agent and the communication complexity of $\tilde{\mathcal O}((L_fε^{-2} + L_1ε^{-1})γ^{-1/2}Δ_f)$, where $L_f=L_0 +L_1ζ$, $σ^2$ is the variance of the stochastic gradient, $Δ_f$ is the initial optimal function value gap, $γ$ is the spectral gap of the network, and $ζ$ is the degree of the gradient dissimilarity. In the special case of $L_1=0$, the above results (nearly) match the lower bounds of decentralized stochastic nonconvex optimization under the standard smoothness. We also conduct numerical experiments to show the empirical superiority of our method.

2509.08707 2026-06-03 q-bio.BM cs.LG

Tokenizing Loops of Antibodies

抗体环的标记化

Ada Fang, Robert G. Alberstein, Simon Kelow, Frédéric A. Dreyer

发表机构 * Harvard University(哈佛大学) Prescient Design, Genentech(Prescient Design,基因泰克)

AI总结 提出Igloo多模态抗体环标记器,通过对比学习编码主链二面角和序列,高效检索相似环结构,提升H3环识别性能5.9%,并集成到蛋白质语言模型中改善抗体设计。

Comments 21 pages, 7 figures, 10 tables, code available at https://github.com/prescient-design/igloo

详情
AI中文摘要

抗体的互补决定区是环状结构,对其与抗原的相互作用至关重要,并且对新型生物制品的设计具有高度重要性。自20世纪80年代以来,将CDR结构的多样性分类为规范簇使得能够识别抗体的关键结构基序。然而,现有方法的覆盖范围有限,并且不能轻易地整合到蛋白质基础模型中。在这里,我们介绍了免疫球蛋白环标记器Igloo,这是一种多模态抗体环标记器,用于编码主链二面角和序列。Igloo使用对比学习目标进行训练,以在潜在空间中将具有相似主链二面角的环映射得更近。Igloo可以高效地从结构抗体数据库中检索最接近的匹配环结构,在识别相似H3环方面比现有方法高出5.9%。Igloo为所有环分配标记,解决了规范簇覆盖范围有限的问题,同时保留了恢复规范环构象的能力。为了展示Igloo标记的多功能性,我们展示了它们可以通过IglooLM和IglooALM整合到蛋白质语言模型中。在预测重链变体的结合亲和力方面,IglooLM在10个抗体-抗原靶点中的8个上优于基础蛋白质语言模型。此外,它与现有的最先进的基于序列和多模态蛋白质语言模型相当,与参数多7倍的模型表现相当。IglooALM采样的抗体环在序列上多样化,在结构上比最先进的抗体逆折叠模型更一致。Igloo展示了引入多模态标记用于抗体环在编码抗体环的多样化景观、改进蛋白质基础模型以及抗体CDR设计方面的优势。

英文摘要

The complementarity-determining regions of antibodies are loop structures that are key to their interactions with antigens, and of high importance to the design of novel biologics. Since the 1980s, categorizing the diversity of CDR structures into canonical clusters has enabled the identification of key structural motifs of antibodies. However, existing approaches have limited coverage and cannot be readily incorporated into protein foundation models. Here we introduce ImmunoGlobulin LOOp Tokenizer, Igloo, a multimodal antibody loop tokenizer that encodes backbone dihedral angles and sequence. Igloo is trained using a contrastive learning objective to map loops with similar backbone dihedral angles closer together in latent space. Igloo can efficiently retrieve the closest matching loop structures from a structural antibody database, outperforming existing methods on identifying similar H3 loops by 5.9\%. Igloo assigns tokens to all loops, addressing the limited coverage issue of canonical clusters, while retaining the ability to recover canonical loop conformations. To demonstrate the versatility of Igloo tokens, we show that they can be incorporated into protein language models with IglooLM and IglooALM. On predicting binding affinity of heavy chain variants, IglooLM outperforms the base protein language model on 8 out of 10 antibody-antigen targets. Additionally, it is on par with existing state-of-the-art sequence-based and multimodal protein language models, performing comparably to models with $7\times$ more parameters. IglooALM samples antibody loops which are diverse in sequence and more consistent in structure than state-of-the-art antibody inverse folding models. Igloo demonstrates the benefit of introducing multimodal tokens for antibody loops for encoding the diverse landscape of antibody loops, improving protein foundation models, and for antibody CDR design.

2506.01969 2026-06-03 cs.DC cs.AI cs.LG

FlashMLA-ETAP: Efficient Transpose Attention Pipeline for Accelerating MLA Inference on NVIDIA H20 GPUs

FlashMLA-ETAP:用于加速NVIDIA H20 GPU上MLA推理的高效转置注意力流水线

Pengcuo Dege, Qiuming Luo, Rui Mao, Chang Kong

发表机构 * Tencent(腾讯) College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院) College of Artificial Intelligence, Shenzhen Polytechnic University(深圳职业技术学院人工智能学院)

AI总结 针对单多GPU服务器部署DeepSeek-R1 671B模型时多头潜在注意力(MLA)推理效率低的问题,提出FlashMLA-ETAP框架,通过高效转置注意力流水线(ETAP)重配置注意力计算,在NVIDIA H20 GPU上实现2.78倍加速,并保持数值稳定性。

Comments Accepted by ICONIP2025

详情
AI中文摘要

多头潜在注意力(MLA)的高效推理面临在单台多GPU服务器上部署DeepSeek-R1 671B模型的挑战。本文介绍FlashMLA-ETAP,一种新颖的框架,用于增强NVIDIA H20 GPU上单实例部署场景的MLA推理。我们提出了高效转置注意力流水线(ETAP),通过转置重新配置注意力计算,使KV上下文长度与WGMMA操作中的\(M\)维度对齐,显著减少冗余计算。FlashMLA-ETAP在64K序列长度(批大小16)下比FlashMLA加速2.78倍,比FlashAttention-3和FlashInfer分别提升5.24倍和4.94倍,同时保持数值稳定性,均方根误差(RMSE)比FlashAttention-3低15.2倍(\(1.25 imes 10^{-5}\))。此外,ETAP的设计能够无缝集成到FlashAttention-3和FlashInfer等框架中,并有详细的理论分析支持。我们的工作解决了资源受限推理中的一个关键空白,为中端GPU提供了可扩展的解决方案,并为硬件感知优化的更广泛采用铺平了道路。代码可在https://github.com/pengcuo/FlashMLA-ETAP获取。

英文摘要

Efficient inference of Multi-Head Latent Attention (MLA) is challenged by deploying the DeepSeek-R1 671B model on a single Multi-GPU server. This paper introduces FlashMLA-ETAP, a novel framework that enhances MLA inference for the single-instance deployment scenario on NVIDIA H20 GPUs. We propose the Efficient Transpose Attention Pipeline (ETAP), which reconfigures attention computation through transposition to align the KV context length with the \(M\)-dimension in WGMMA operations, significantly reducing redundant computations. FlashMLA-ETAP achieves a 2.78x speedup over FlashMLA at 64K sequence length (batch size 16), with 5.24x and 4.94x improvements over FlashAttention-3 and FlashInfer, respectively, while maintaining numerical stability with a 15.2x lower RMSE (\(1.25 \times 10^{-5}\)) than FlashAttention-3. Furthermore, ETAP's design enables seamless integration into frameworks like FlashAttention-3 and FlashInfer, supported by a detailed theoretical analysis. Our work addresses a critical gap in resource-constrained inference, offering a scalable solution for mid-tier GPUs and paving the way for broader adoption in hardware-aware optimization. Code is available at https://github.com/pengcuo/FlashMLA-ETAP.

2506.01075 2026-06-03 cs.DS cs.IT cs.LG math.IT

Learning DNF through Generalized Fourier Representations

通过广义傅里叶表示学习DNF

Mohsen Heidari, Roni Khardon

发表机构 * Department of Computer Sciences, Indiana University, Bloomington, IN, USA(印第安纳大学计算机科学系,印第安纳州布卢明顿,IN,USA)

AI总结 针对非乘积分布下DNF学习难题,引入基于贝叶斯网络的广义傅里叶表示,证明合取式的L1谱范数有界性,实现DNF和决策树的可学习性。

Comments 60 pages

详情
AI中文摘要

布尔傅里叶表示在学习理论中被广泛使用,特别是在均匀分布和乘积分布下学习析取范式(DNF)。将这些结果扩展到非乘积分布一直是一个长期未解决的开放问题。我们通过引入一种广义傅里叶表示来应对这一挑战,该表示能够在广泛的一类非乘积分布下进行学习。我们的方法将任意分布$D$表示为贝叶斯网络(BN),并推导出相应的傅里叶展开。我们证明了使用成员查询来识别重系数的标准基于傅里叶的学习技术可以通过少量修改适应于这种广义表示。我们证明了对于差分有界树BN,合取式的$L_1$谱范数在这种展开下保持有界,显著推广了均匀分布的已知结果;匹配的下界证明了这些约束的必要性。利用这些结果,我们建立了DNF的可学习性以及决策树在此类分布下的不可知学习性。最后,我们提出了一种学习差分有界树BN分布的算法,将我们的结果扩展到分布未知的场景。

英文摘要

The Boolean Fourier representation has been widely used in learning theory, particularly for learning Disjunctive Normal Form (DNF) under uniform and product distributions. Extending these results to non-product distributions has remained a longstanding open problem. We address this challenge by introducing a generalized Fourier representation that enables learning under a broad class of non-product distributions. Our approach represents any distribution $D$ as a Bayesian network (BN) and derives a corresponding Fourier expansion. We show that standard Fourier-based learning techniques using membership queries to identify heavy coefficients can be adapted to this generalized representation with minor modifications. We prove that the $L_1$ spectral norm of conjunctions remains bounded under this expansion for difference-bounded tree BNs, significantly generalizing the known result for uniform distributions; matching lower bounds demonstrate the necessity of these constraints. Using these results, we establish the learnability of DNF and the agnostic learnability of decision trees under such distributions. Finally, we present an algorithm for learning difference-bounded tree BN distributions, extending our results to settings where the distribution is unknown.

2502.13713 2026-06-03 cs.IR cs.SD eess.AS

TALKPLAY: Multimodal Music Recommendation with Large Language Models

TALKPLAY: 基于大语言模型的多模态音乐推荐

Seungheon Doh, Keunwoo Choi, Juhan Nam

发表机构 * KAIST(韩国科学技术院) talkpl.ai

AI总结 提出TALKPLAY系统,通过将推荐转化为token生成问题,利用大语言模型处理多模态音乐特征,实现端到端对话式推荐,显著优于单模态方法。

详情
AI中文摘要

我们提出TALKPLAY,一种新颖的多模态音乐推荐系统,它将推荐重新表述为使用大语言模型(LLM)的token生成问题。通过利用LLM的指令遵循和自然语言生成能力,我们的系统能够从多样化的用户查询中有效推荐音乐,同时生成上下文相关的响应。虽然预训练的LLM主要设计用于文本模态,但TALKPLAY通过两个关键创新扩展了其范围:一个多模态音乐分词器,用于编码音频特征、歌词、元数据、语义标签和播放列表共现信号;以及一个词汇扩展机制,能够统一处理和生成语言和音乐相关的token。通过将推荐系统直接集成到LLM架构中,TALKPLAY通过以下方式改造传统系统:(1)将先前的两阶段对话推荐系统(推荐引擎和对话管理器)统一为连贯的端到端系统,(2)有效利用长对话上下文进行推荐,同时在扩展的多轮交互中保持强劲性能,以及(3)生成自然语言响应以实现无缝的用户交互。我们的定性和定量评估表明,TALKPLAY在推荐性能和对话自然度方面显著优于仅基于文本或收听历史的单模态方法。

英文摘要

We present TALKPLAY, a novel multimodal music recommendation system that reformulates recommendation as a token generation problem using large language models (LLMs). By leveraging the instruction-following and natural language generation capabilities of LLMs, our system effectively recommends music from diverse user queries while generating contextually relevant responses. While pretrained LLMs are primarily designed for text modality, TALKPLAY extends their scope through two key innovations: a multimodal music tokenizer that encodes audio features, lyrics, metadata, semantic tags, and playlist co-occurrence signals; and a vocabulary expansion mechanism that enables unified processing and generation of both linguistic and music-relevant tokens. By integrating the recommendation system directly into the LLM architecture, TALKPLAY transforms conventional systems by: (1) unifying previous two-stage conversational recommendation systems (recommendation engines and dialogue managers) into a cohesive end-to-end system, (2) effectively utilizing long conversational context for recommendation while maintaining strong performance in extended multi-turn interactions, and (3) generating natural language responses for seamless user interaction. Our qualitative and quantitative evaluation demonstrates that TALKPLAY significantly outperforms unimodal approaches based solely on text or listening history in both recommendation performance and conversational naturalness.

2505.07068 2026-06-03 stat.ML cs.LG math.DS

A Sparse Bayesian Learning Algorithm for Estimation of Interaction Kernels in Motsch-Tadmor Model

Motsch-Tadmor模型中交互核估计的稀疏贝叶斯学习算法

Jinchao Feng, Sui Tang

发表机构 * Department of Mathematics, Great Bay University(广东大湾大学数学系) Department of Mathematics, University of California, Santa Barbara(加州大学圣芭芭拉分校数学系)

AI总结 针对Motsch-Tadmor模型中非对称交互核的估计问题,提出一种基于变分框架和稀疏贝叶斯学习的算法,实现核函数的鲁棒识别与不确定性量化。

Comments 23 pages

详情
AI中文摘要

本文基于观测轨迹数据,研究Motsch-Tadmor模型中非对称交互核的数据驱动辨识。所考虑的模型由一类半线性演化方程控制,其中交互核定义了一个归一化的、依赖于状态的拉普拉斯算子,该算子支配集体动力学。为了解决由此产生的非线性逆问题,我们提出一个变分框架,利用控制方程的隐式形式重新表述核辨识问题,将其简化为子空间辨识问题。我们建立了一个可辨识性结果,刻画了交互核在尺度意义下可唯一恢复的条件。为了鲁棒地求解逆问题,我们开发了一种稀疏贝叶斯学习算法,该算法引入信息先验进行正则化,量化不确定性,并实现原则性的模型选择。在代表性交互粒子系统上的大量数值实验表明,所提出的框架在不同噪声水平和数据范围内具有准确性、鲁棒性和可解释性。

英文摘要

In this paper, we investigate the data-driven identification of asymmetric interaction kernels in the Motsch-Tadmor model based on observed trajectory data. The model under consideration is governed by a class of semilinear evolution equations, where the interaction kernel defines a normalized, state-dependent Laplacian operator that governs collective dynamics. To address the resulting nonlinear inverse problem, we propose a variational framework that reformulates kernel identification using the implicit form of the governing equations, reducing it to a subspace identification problem. We establish an identifiability result that characterizes conditions under which the interaction kernel can be uniquely recovered up to scale. To solve the inverse problem robustly, we develop a sparse Bayesian learning algorithm that incorporates informative priors for regularization, quantifies uncertainty, and enables principled model selection. Extensive numerical experiments on representative interacting particle systems demonstrate the accuracy, robustness, and interpretability of the proposed framework across a range of noise levels and data regimes.

2502.03139 2026-06-03 astro-ph.CO astro-ph.IM cs.LG

Fast Sampling of Cosmological Initial Conditions with Gaussian Neural Posterior Estimation

基于高斯神经后验估计的宇宙学初始条件快速采样

Oleg Savchenko, Guillermo Franco Abellán, Florian List, Noemi Anau Montel, Christoph Weniger

发表机构 * GRAPPA Institute, Institute for Theoretical Physics Amsterdam, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The Netherlands(GRAPPA研究所、阿姆斯特丹理论物理研究所、阿姆斯特丹大学、科学公园904号、1098 XH阿姆斯特丹、荷兰) Department of Astrophysics, University of Vienna, Türkenschanzstraße 17, 1180 Vienna, Austria(天体物理学系、维也纳大学、土耳其沙恩茨街17号、1180维也纳、奥地利)

AI总结 提出一种基于模拟推理的方法,通过高斯后验建模和神经网络估计,实现从晚期观测数据快速重建宇宙初始密度场,比现有方法快数个数量级。

Comments 9 + 2 pages, 7 figures, 1 table. Comments welcome!

Journal ref Mon Not R Astron Soc (2026)

详情
AI中文摘要

了解宇宙大尺度结构在宇宙时间中形成的原初物质密度场对宇宙学至关重要。然而,从晚期观测重建这些宇宙学初始条件是一项著名的困难任务,需要先进的宇宙学模拟器和复杂的统计方法来探索数百万维的参数空间。我们展示了如何利用基于模拟的推理(SBI)来解决这个问题,并以模拟高效的方式使用通用的不可微模拟器获得数据约束的原初暗物质密度场实现。我们的方法适用于完整的高分辨率暗物质$N$体模拟,并基于将约束初始条件的后验分布建模为傅里叶空间中对角协方差矩阵的高斯分布。因此,我们可以在单个GPU上几秒内生成数千个后验样本,比现有方法快数个数量级,为宇宙学场的顺序SBI铺平了道路。此外,我们对协方差与波数的依赖关系进行了解析拟合,有效地将任何初始条件的点估计器转化为快速采样器。我们通过汇总统计将获得的样本与真实值进行比较,并执行贝叶斯一致性检验,验证了样本的有效性。

英文摘要

Knowledge of the primordial matter density field from which the large-scale structure of the Universe emerged over cosmic time is of fundamental importance for cosmology. However, reconstructing these cosmological initial conditions from late-time observations is a notoriously difficult task, which requires advanced cosmological simulators and sophisticated statistical methods to explore a multi-million-dimensional parameter space. We show how simulation-based inference (SBI) can be used to tackle this problem and to obtain data-constrained realisations of the primordial dark matter density field in a simulation-efficient way with general non-differentiable simulators. Our method is applicable to full high-resolution dark matter $N$-body simulations and is based on modelling the posterior distribution of the constrained initial conditions to be Gaussian with a diagonal covariance matrix in Fourier space. As a result, we can generate thousands of posterior samples within seconds on a single GPU, orders of magnitude faster than existing methods, paving the way for sequential SBI for cosmological fields. Furthermore, we perform an analytical fit of the estimated dependence of the covariance on the wavenumber, effectively transforming any point-estimator of initial conditions into a fast sampler. We test the validity of our obtained samples by comparing them to the true values with summary statistics and performing a Bayesian consistency test.

2501.02173 2026-06-03 cs.IR cs.LG

The Efficiency vs. Accuracy Trade-off: Optimizing RAG-Enhanced LLM Recommender Systems Using Multi-Head Early Exit

效率与准确性的权衡:使用多头早期退出优化RAG增强的LLM推荐系统

Huixue Zhou, Hengrui Gu, Xi Liu, Kaixiong Zhou, Mingfu Liang, Yongkang Xiao, Srinivas Govindan, Piyush Chawla, Jiyan Yang, Xiangfei Meng, Huayu Li, Buyun Zhang, Liang Luo, Wen-Yen Chen, Yiping Han, Bo Long, Rui Zhang, Tianlong Chen

发表机构 * Meta Platforms(Meta平台) University of Minnesota(明尼苏达大学) NCSU(北卡罗来纳州立大学) UNC at Chapel Hill(Chapel Hill分校,北卡罗来纳大学)

AI总结 提出结合检索增强生成(RAG)与多头早期退出架构的优化框架,通过图卷积网络(GCN)高效检索和动态推理终止,在降低计算时间的同时保持或提升点击率(CTR)预测准确性。

详情
AI中文摘要

在推荐系统中部署大型语言模型(LLM)以预测点击率(CTR)需要在计算效率和预测准确性之间取得微妙的平衡。本文提出一个优化框架,结合检索增强生成(RAG)与创新的多头早期退出架构,同时增强这两个方面。通过集成图卷积网络(GCN)作为高效检索机制,我们能够显著减少数据检索时间,同时保持高模型性能。采用的早期退出策略允许动态终止模型推理,利用跨多个头的实时预测置信度评估。这不仅加快了LLM的响应速度,还维持或提高了其准确性,使其非常适合实时应用场景。我们的实验表明,该架构有效减少了计算时间,而不牺牲可靠推荐交付所需的准确性,为商业系统中高效、实时的LLM部署建立了新标准。

英文摘要

The deployment of Large Language Models (LLMs) in recommender systems for predicting Click-Through Rates (CTR) necessitates a delicate balance between computational efficiency and predictive accuracy. This paper presents an optimization framework that combines Retrieval-Augmented Generation (RAG) with an innovative multi-head early exit architecture to concurrently enhance both aspects. By integrating Graph Convolutional Networks (GCNs) as efficient retrieval mechanisms, we are able to significantly reduce data retrieval times while maintaining high model performance. The early exit strategy employed allows for dynamic termination of model inference, utilizing real-time predictive confidence assessments across multiple heads. This not only quickens the responsiveness of LLMs but also upholds or improves their accuracy, making it ideal for real-time application scenarios. Our experiments demonstrate how this architecture effectively decreases computation time without sacrificing the accuracy needed for reliable recommendation delivery, establishing a new standard for efficient, real-time LLM deployment in commercial systems.

2412.17484 2026-06-03 cs.DC cs.AI

Power- and Fragmentation-aware Online Scheduling for GPU Datacenters

面向GPU数据中心的功耗与碎片感知在线调度

Francesco Lettich, Emanuele Carlini, Franco Maria Nardini, Raffaele Perego, Salvatore Trani

发表机构 * Istituto di Scienza e Tecnologie dell’Informazione "Alessandro Faedo", Consiglio Nazionale delle Ricerche(阿莱索·法多信息科学与技术研究所,意大利国家研究委员会)

AI总结 针对GPU数据中心在线调度问题,提出PWR调度策略,结合碎片梯度下降(FGD)方法,在降低功耗和最小化GPU碎片之间取得平衡。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

人工智能和大语言模型的兴起推动了数据中心中GPU在复杂训练和推理任务中的使用增加,影响了大规模计算基础设施的运营成本、能源需求和环境足迹。本文解决了GPU数据中心中的在线调度问题,即在不知道任务未来到达时间的情况下进行调度。我们关注两个目标:最小化GPU碎片和降低功耗。当数据中心接近满容量时,部分GPU分配会阻碍剩余资源的有效利用,从而产生GPU碎片。最近的调度策略FGD(碎片梯度下降)利用碎片度量来解决这个问题。由于GPU的功耗需求巨大,降低功耗也至关重要。为此,我们提出了PWR,一种新颖的调度策略,通过选择功耗高效的GPU和CPU组合来最小化功耗。这涉及到一个简化的功耗测量模型,该模型集成到Kubernetes评分插件中。通过在模拟集群中的广泛实验评估,我们展示了PWR与FGD结合时,如何在降低功耗和最小化GPU碎片之间实现平衡的权衡。

英文摘要

The rise of Artificial Intelligence and Large Language Models is driving increased GPU usage in data centers for complex training and inference tasks, impacting operational costs, energy demands, and the environmental footprint of large-scale computing infrastructures. This work addresses the online scheduling problem in GPU datacenters, which involves scheduling tasks without knowledge of their future arrivals. We focus on two objectives: minimizing GPU fragmentation and reducing power consumption. GPU fragmentation occurs when partial GPU allocations hinder the efficient use of remaining resources, especially as the datacenter nears full capacity. A recent scheduling policy, Fragmentation Gradient Descent (FGD), leverages a fragmentation metric to address this issue. Reducing power consumption is also crucial due to the significant power demands of GPUs. To this end, we propose PWR, a novel scheduling policy to minimize power usage by selecting power-efficient GPU and CPU combinations. This involves a simplified model for measuring power consumption integrated into a Kubernetes score plugin. Through an extensive experimental evaluation in a simulated cluster, we show how PWR, when combined with FGD, achieves a balanced trade-off between reducing power consumption and minimizing GPU fragmentation.

2406.10407 2026-06-03 math.OC cs.LG cs.NA math.NA

Suboptimality bounds for trace-bounded SDPs enable a faster and scalable low-rank SDP solver SDPLR+

迹有界半定规划的最优性界实现更快且可扩展的低秩SDP求解器SDPLR+

Yufan Huang, David F. Gleich

发表机构 * Purdue University(普渡大学)

AI总结 本文利用迹有界半定规划的最优性界改进Burer-Monteiro的低秩SDP求解器SDPLR,提出SDPLR+,通过动态调整秩并跟踪原始不可行性和最优性,实现更快的求解和更好的可扩展性。

Comments 31 pages, 12 figures

详情
AI中文摘要

半定规划(SDP)及其求解器是机器学习和数据科学中许多应用的有力工具。设计可扩展的SDP求解器具有挑战性,因为标准情况下正半定决策变量是一个$n \times n$的稠密矩阵,尽管输入通常是$n \times n$的稀疏矩阵。然而,如Barvinok和Pataki所示,解可能不需要满秩矩阵。二十年前,Burer和Monteiro开发了SDP求解器\texttt{SDPLR},它在低秩分解上而不是完整矩阵上进行优化。这大大降低了存储成本,并且对许多问题效果良好。原始求解器\texttt{SDPLR}仅跟踪解的原始不可行性,阻止了在中等精度下的提前终止。我们利用迹有界SDP问题的最优性界,使我们能够更好地跟踪进展并执行提前终止。然后我们开发了\texttt{SDPLR+},它以极低秩分解开始优化,并基于原始不可行性和最优性动态更新秩。这进一步加速了计算并节省了存储。在Max Cut、Minimum Bisection、Cut Norm和Lovász Theta问题上与许多近期的内存高效可扩展SDP求解器的数值比较展示了\texttt{SDPLR+}在决策变量达到百万乘百万规模问题上的可扩展性。它通常是达到中等精度$10^{-2}$的最快求解器。在$\mu$-电导、矩阵补全和$k$-均值聚类上的进一步实验显示了\texttt{SDPLR+}在更广泛数据科学应用中的潜力。

英文摘要

Semidefinite programs (SDPs) and their solvers are powerful tools with many applications in machine learning and data science. Designing scalable SDP solvers is challenging because by standard the positive semidefinite decision variable is an $n \times n$ dense matrix, even though the input is often an $n \times n$ sparse matrix. However, the solution may not require a full-rank matrix, as shown by Barvinok and Pataki. Two decades ago, Burer and Monteiro developed an SDP solver \texttt{SDPLR} that optimizes over a low-rank factorization instead of the full matrix. This greatly decreases the storage cost and works well for many problems. The original solver \texttt{SDPLR} tracks only the primal infeasibility of the solution, preventing early termination at moderate accuracy. We use a suboptimality bound for trace-bounded SDP problems that enables us to track the progress better and perform early termination. We then develop \texttt{SDPLR+}, which starts the optimization with an extremely low-rank factorization and dynamically updates the rank based on the primal infeasibility and suboptimality. This further speeds up the computation and saves storage. Numerical comparisons on Max Cut, Minimum Bisection, Cut Norm, and Lovász Theta problems with many recent memory-efficient scalable SDP solvers demonstrate the scalability of \texttt{SDPLR+} up to problems with million-by-million decision variables. It is often the fastest solver to a moderate accuracy of $10^{-2}$. Further experiments on $μ$-conductance, matrix completion, and $k$-means clustering show the potential of \texttt{SDPLR+} on a broader range of data science applications.

2606.03982 2026-06-03 cs.CL

Language Models Compare Quantities Using Number-specific and Unit-specific Heuristics

语言模型使用数字特定和单位特定启发式比较数量

Mutsumi Sasaki, Go kamoda, Ryosuke Takahashi, Kosuke Sato, Kentaro Inui, Keisuke Sakaguchi, Benjamin Heinzerling

AI总结 本研究通过控制实验发现,语言模型在比较带单位的数量时,并非进行精确的尺度转换,而是依赖数字差异和单位尺度差异的启发式策略,导致在比较边界附近系统性错误。

详情
AI中文摘要

带有测量单位的数量,例如110 cm和1.2 m,要求语言模型(LMs)将数字与符号单位尺度相结合。在这里,我们研究LMs如何在跨越多个单位系统的受控设置中比较此类数量。我们发现,在比较边界附近,准确性会下降,其中值的微小变化决定了正确答案。由此产生的错误是系统性的:线性代理模型从数字差异和单位尺度差异线索中预测LM偏好,并且对这些变量对齐的子空间进行因果干预会改变模型的输出。结果表明,LMs通过一系列关于数字和单位的启发式策略来比较数量,而不是先将两个表达式转换为精确的共享尺度表示。

英文摘要

Quantities with measurement units, such as 110 cm and 1.2 m, require language models (LMs) to combine a numeral with a symbolic unit scale. Here, we study how LMs compare such quantities in controlled settings spanning several unit systems. We find that accuracy degrades near the comparison boundary, where small changes in value determine the correct answer. The resulting errors are systematic: linear surrogate models predict LM preferences from numerical-difference and unit-scale-difference cues, and causal interventions on subspaces aligned with these variables shift model's output. The results suggest that LMs compare quantities through a bag of heuristics over numerals and units, rather than first converting both expressions to an exact shared-scale representation.

2606.03976 2026-06-03 cs.CV cs.AI cs.LG q-bio.NC

Formalizing the Binding Problem

形式化绑定问题

Lianghuan Huang, Yihao Li, Saeed Salehi, Yingshan Chang, Ansh Soni, Konrad P. Kording

AI总结 本文用信息论方法形式化绑定问题,提出一种探测方法测量模型表示中的绑定信息,并在视觉Transformer上实验,证明绑定是强视觉识别和推理的关键要素。

Comments Accepted to ICML 2026

详情
AI中文摘要

世界表征,可以说,包含关于特征的信息(例如,某物是蓝色的,某物是圆形的),但也包含关于哪些特征属于同一对象的信息(例如,圆形是蓝色的),我们称之为绑定信息。任何具有理解包含多个对象场景能力的系统都必须解决绑定问题:它需要知道哪些特征属于一起。然而,尽管有研究表明视觉Transformer(ViT)知道哪些补丁属于一起,但目前尚不清楚当前的深度学习模型是否学会展示绑定信息,即针对特征的信息。我们可能认为绑定信息并不多,毕竟将特征错误归因于错误对象是基于ViT架构的常见失败,尤其是在对象共享特征的场景中。本文用信息论方法形式化绑定问题,并引入一种探测方法来测量模型表示中的绑定信息。我们在ViT上进行实验,测量来自架构不同组件(如图像摘要标记[CLS]或空间标记)的绑定信息。我们使用具有不同绑定挑战的数据集,例如特征共享、遮挡和自然特征,同时比较多个预训练ViT的性能。总体而言,我们的研究证明了绑定是强视觉识别和推理的关键要素。

英文摘要

Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn to exhibit binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViT-based architectures, especially in scenes with objects sharing features. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments on ViTs, measuring binding from different components of the architecture, such as the image summary token [CLS] or the spatial tokens. We use datasets with different binding challenges, such as feature sharing, occlusion, and natural features, while comparing the performance of several pre-trained ViTs. Overall, our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.

2606.03965 2026-06-03 cs.CL cs.AI

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Agentic Chain-of-Thought Steering:实现高效且可控的LLM推理

Yu Xia, Zhouhang Xie, Xin Xu, Byungkyu Kang, Prarit Lamba, Xiang Gao, Julian McAuley

AI总结 提出Agentic Chain-of-Thought Steering (ACTS)方法,通过强化学习训练控制器智能体在推理过程中自适应地选择推理策略和引导短语,实现预算感知的策略控制,从而在保持推理质量的同时显著节省token,并支持准确率-效率的可控权衡。

详情
AI中文摘要

大型语言模型通过扩展的思维链推理提高了最终答案的准确性,但通常token使用效率低下且缺乏推理时的控制。现有的高效推理方法通过缩短、提前停止或压缩轨迹来控制思考长度,但隐式地决定了模型的思考方式。在本文中,我们提出了Agentic Chain-of-Thought Steering (ACTS),它将推理引导形式化为一个马尔可夫决策过程,其中控制器智能体在推理过程中自适应地引导冻结的推理器。在每一步,控制器观察推理轨迹和剩余思考预算,然后发出一个包含推理策略和引导短语的引导动作,以启动推理器的下一步。这使得在保持推理器生成连续性的同时,能够进行预算感知的策略控制以实现高效推理。我们从构建的合成引导轨迹中初始化控制器智能体,并进行多预算增强,然后通过带有预算条件奖励塑造的强化学习进一步优化。跨多个基准的实验表明,ACTS在显著节省token的同时达到了与全思考相当的性能,并在不同的推理器和任务上实现了可控的准确率-效率权衡。代码可在该https URL获取。

英文摘要

Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.

2606.03883 2026-06-03 cs.AI cs.LG

Reasoning Structure of Large Language Models

大型语言模型的推理结构

Frédéric Berdoz, Luca A. Lanzendörfer, Fabian Farestam, Roger Wattenhofer

AI总结 针对大型推理模型评估中隐藏不同推理结构的问题,提出基于逻辑谜题的基准测试和将非结构化轨迹转化为可验证推理图的方法,并定义推理效率指标,以量化分析推理拓扑结构。

Comments Accepted at ICML 2026 and presented at the ICLR 2026 workshop on LLM reasoning

详情
AI中文摘要

大型推理模型(LRMs)通常使用最终答案准确率或token数量等指标进行评估。然而,这些指标上的相同分数可能隐藏着根本不同的推理结构。为了解决这一局限性,我们引入了一个可扩展的逻辑谜题LRM基准测试,以及一个将非结构化轨迹转化为包含声明和依赖关系的可验证推理图的流程。这将推理转化为一个结构化的、可测量的对象,其拓扑结构可以定量分析。在此基础上,我们定义了一个推理效率指标,用于量化模型逻辑流的集中程度。我们对开源推理模型的分析表明,结构度量能够区分token数量和准确率所混淆的行为,为诊断失败模式和比较推理如何随谜题难度扩展提供了实用工具。

英文摘要

Large reasoning models (LRMs) are often evaluated using metrics such as final-answer accuracy or token count. However, identical scores on these metrics can hide fundamentally different reasoning structures. To address this limitation, we introduce a scalable LRM benchmark of logic puzzles and a pipeline that converts unstructured traces into verifiable reasoning graphs of claims and dependencies. This turns reasoning into a structured, measurable object whose topology can be quantitatively analyzed. Building on this, we define a reasoning efficiency metric that quantifies how concentrated the model's logical flow is. Our analysis on open-source reasoning models shows that structural measurements separate behaviors that token count and accuracy conflate, providing a practical tool for diagnosing failure modes and comparing how reasoning scales with puzzle difficulty.

2606.03804 2026-06-03 cs.LG

Easy-to-Use Shielding for Reinforcement Learning

易于使用的强化学习屏蔽技术

Stefan Pranger, Bettina Könighofer

AI总结 提出tempestpy库,将形式化屏蔽合成集成到Gymnasium API中,降低强化学习安全探索的门槛,并扩展了随机多人博弈的屏蔽算法。

详情
AI中文摘要

安全探索是强化学习中的一个关键挑战,旨在防止智能体在探索环境时做出有害决策。屏蔽是一种利用环境模型形式的领域知识来决定动作安全性的技术。尽管已经成熟,但由于缺乏将形式化屏蔽合成与标准强化学习框架连接起来的可访问端到端基础设施,屏蔽在强化学习中的应用有限。应用屏蔽通常需要形式化方法的专业知识和大量的工程工作,使其脱离典型的强化学习工作流程。我们通过将屏蔽合成工具Tempest扩展为安全强化学习的实用后端来解决这一问题。我们的核心贡献是tempestpy,一个Python库,它将基于Tempest的屏蔽合成直接集成到Gymnasium API中,使得屏蔽可以在现有的强化学习管道中合成和部署。这降低了屏蔽的入门门槛,将形式化安全探索方法转化为强化学习实践者可用的组件。我们还扩展了Tempest的算法支持,以计算随机多人博弈的可靠屏蔽,保留了形式化安全保证。我们端到端地展示了最终的工作流程,并在多个环境中评估了有屏蔽和无屏蔽的强化学习。为了便于建模,我们为MiniGrid提供了符号模型,并引入了MiniGridSafe,这是一个游乐场环境集合,旨在使屏蔽易于访问且实验透明。MiniGridSafe通过具有概率转换和额外智能体的安全导向场景扩展了MiniGrid,使得在简单直观的设置中研究具有挑战性的安全方面成为可能。

英文摘要

Safe exploration is a key challenge in Reinforcement Learning (RL) that aims to prevent agents from making harmful decisions while exploring their environment. Safe exploration is a key challenge in Reinforcement Learning (RL) that aims to prevent agents from making harmful decisions while exploring their environment. Shielding is one such technique that assumes domain knowledge in the form of an environment model to decide upon action safety. Although well-established, shielding has seen limited adoption in RL due to the lack of accessible end-to-end infrastructure connecting formal shield synthesis with standard RL frameworks. Applying shielding typically requires expertise in formal methods and substantial engineering effort, keeping it outside the typical RL workflow. We address this by extending our shield synthesis tool Tempest into a practical backend for safe RL. Our core contribution is tempestpy, a Python library that integrates Tempest-based shield synthesis directly into the Gymnasium API, allowing shields to be synthesized and deployed within existing RL pipelines. This lowers the barrier to entry for shielding and turns formal safe-exploration methods into a usable component for RL practitioners. We also extend Tempest's algorithmic support to compute sound shields for stochastic multiplayer games, preserving formal safety guarantees. We demonstrate the resulting workflow end to end and evaluate shielded and unshielded RL across multiple environments. To facilitate modeling, we provide symbolic models for MiniGrid and introduce MiniGridSafe, a collection of playground environments designed to make shielding easily accessible and experimentally transparent. MiniGridSafe extends MiniGrid with safety-oriented scenarios featuring probabilistic transitions and additional agents, enabling the study of challenging safety aspects in a simple and intuitive setting.

2606.03798 2026-06-03 cs.RO

Optimal Design and Analytical Modeling of a Soft Fin-Ray Effect Gripper Finger Using the Finite Rigid Elements Method

基于有限刚性单元法的软体鳍射线效应夹爪手指的优化设计与解析建模

Sara Adeli, Hassan Sayyaadi

AI总结 提出采用有限刚性单元法(FREM)对软体鳍射线效应(FRE)夹爪手指进行建模与优化,实现精准力控,以轻柔抓取易损农产品。

详情
AI中文摘要

受鳍射线启发的软体夹爪为轻柔处理易损、不规则物体(尤其在农业中)提供了有前景的解决方案。本研究旨在设计、制造和建模一种鳍射线效应(FRE)软体夹爪手指,以实现未来应用中的精确力控制。该设计旨在轻柔抓取需要适应性和精确力施加的易损农产品,如番茄。为解决软体机器人固有的挑战,包括非线性行为、无限自由度和可变材料属性,采用有限刚性单元法(FREM)进行建模。该方法在保持解析精度的同时,为后续阶段力控制器的开发提供了可靠基础。使用ANSYS创建了详细的有限元模型(FEM),并通过仿真和实验测试验证了解析结果。基于四个关键标准优化了夹爪手指:尖端位移、总变形、应力分布和接触力。最优手指配置包括长度30毫米、肋间距10毫米、七根肋条角度-15度、肋条厚度1毫米。使用FREM的理论建模预测手指变形误差为3%,而ANSYS数值模型误差为2%。

英文摘要

Fin Ray-inspired soft grippers offer a promising solution for gently handling delicate, irregular objects, especially in agriculture. The objective of this research is to design, fabricate, and model a Fin Ray Effect (FRE) soft gripper finger to enable precise force control in future applications. This design aims to gently grasp delicate agricultural products, such as tomatoes, that require both adaptability and accurate force application. To address the inherent challenges of soft robotics, including nonlinear behavior, infinite degrees of freedom, and variable material properties, the Finite Rigid Elements Method (FREM) was employed for modeling. This method preserves analytical accuracy while providing a reliable foundation for the development of a force controller in later stages. A detailed Finite Element Model (FEM) was created using ANSYS, and the analytical results were validated through simulation and experimental testing. The gripper's fingers were optimized based on four key criteria: tip displacement, total deflection, stress distribution, and contact force. The optimal finger configuration includes a length of 30 mm, rib spacing of 10 mm, seven ribs angled at -15 deg, and a rib thickness of 1 mm. Theoretical modeling using the FREM predicted finger deformation with a 3% error, while the ANSYS numerical model achieved 2% error.

2606.03777 2026-06-03 cs.AI cs.CR q-fin.RM

From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework

从控制边界到保险索赔:通过CER框架重构AI中介损失

Alex Leung, Rex Zhang, Kentaroh Toyoda, SiewMei Loh

AI总结 本文提出CER框架(控制边界、证据重构、保险响应),用于诊断和重构由生成式或代理式AI系统导致的损失,以支持保险索赔。

详情
AI中文摘要

通过受保组织的生成式或代理式AI系统产生的AI损失需要状态重构,而不仅仅是事件重构,因为相关状态会随着系统推理、检索、调用工具和行动而改变。相关的问题不仅是发生了什么损失,还包括系统被允许做什么、实际做了什么,以及重构的损失能否支持保险索赔。本文处理受保人的AI系统处于因果链中的损失,包括外部触发的故障,如提示注入、检索增强生成(RAG)投毒、恶意工具输出、凭证滥用和数据投毒。具体而言,本文介绍了CER,一种用于AI残余风险转移的用例级诊断。C(控制边界)询问系统是否具有可执行的操作范围。E(证据重构)询问是否可以从保留的工件中重构系统状态和因果链。R(保险响应)询问重构的损失是否被保险:保险覆盖是否在市场上可用并为受保人投保,以及支持保险索赔所需的证据。本文做出三项贡献:定义了AI特定的重构问题,通过CER操作化该问题,并指定了AI重构的索赔级证据。公开示例包括报道的PocketOS和Replit代理数据库删除事件,以及作为已裁决的输出/依赖案例的Moffatt诉加拿大航空案。关键词:AI系统;CER框架;残余风险转移;代理式AI;生成式AI;AI保险;证据重构。

英文摘要

AI losses that arise through an insured organization's generative or agentic AI system require state reconstruction, not merely event reconstruction, because the relevant state changes as the system reasons, retrieves, calls tools, and acts. The relevant question is not only what loss occurred, but what the system was allowed to do, what it actually did, and whether that reconstructed loss can support insurance claim recovery. This paper addresses losses in which the insured's AI system is in the causal chain, including externally triggered failures such as prompt injection, retrieval-augmented generation (RAG) poisoning, malicious tool output, credential misuse, and data poisoning. Specifically, this paper introduces CER, a use-case-level diagnostic for AI residual risk transfer. C (control boundary) asks whether the system had an enforceable operating envelope. E (evidence reconstruction) asks whether the system state and causal chain can be reconstructed from retained artifacts. R (insurance response) asks whether the reconstructed loss is insured: whether insurance coverage is available in the market and placed for the insured, together with the proof needed to support insurance claim recovery. The paper makes three contributions: it defines the AI-specific reconstruction problem, operationalizes that problem through CER, and specifies claim-grade evidence for AI reconstruction. Public examples include the reported PocketOS and Replit agentic database-deletion incidents and Moffatt v. Air Canada as an adjudicated output/reliance case. Keywords: AI systems; CER framework; residual risk transfer; agentic AI; generative AI; AI insurance; evidence reconstruction.

2606.03719 2026-06-03 cs.AI

Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs

通过推导图揭示Do-演算推理的结构

Clément Yvernes, Emilie Devijver, Marianne Clausel, Eric Gaussier

AI总结 本文引入推导图来表示Do-演算规则的应用与组合,刻画了在Do-演算下等价的观测与干预概率的完整空间,并展示了通过最多四次规则应用即可实现等价变换,进而利用等价因果查询产生更有效的估计量。

Comments Accepted at ICML 2026

详情
AI中文摘要

Do-演算定义了干预查询的一般推理系统,允许通过连续应用其规则来转换因果量。这个过程产生了丰富的等价干预表达式空间,但组合和排序这些规则仍然具有挑战性。在这项工作中,我们引入了推导图,它表示Do-演算规则如何应用和组合,并刻画了在Do-演算下等价的观测和干预概率的完整空间。这些图的结构产生了一个简单的过程,最多使用四次Do-演算规则的应用。最后,我们展示了如何将识别算法应用于等价的因果查询,为相同的因果量产生多个有效的估计量,最终得到更有效的估计量。

英文摘要

The do-calculus defines a general system of inference for interventional queries, allowing causal quantities to be transformed through successive applications of its rules. This process induces a rich space of equivalent interventional expressions, but combining and ordering these rules remains challenging. In this work, we introduce derivation graphs, which represent how do-calculus rules are applied and combined, and characterize the full space of observational and interventional probabilities which are equivalent under the do-calculus. The structure of these graphs yields a simple procedure that uses at most four applications of do-calculus rules. Finally, we show how applying identification algorithms to equivalent causal queries produces multiple valid estimands for the same causal quantity, eventually yielding more efficient estimators.

2606.03686 2026-06-03 cs.AI

The DeepSpeak-Agentic Dataset

DeepSpeak-Agentic 数据集

Sarah Barrington, Maty Bohacek, Hany Farid

AI总结 本文提出了一个包含37小时人机半结构化对话视频的数据集DeepSpeak-Agentic,用于评估AI代理的自动取证识别、研究人机交互特性,并作为大型语言模型和AI生成语音/面部技术的基准。

详情
AI中文摘要

我们提出了DeepSpeak-Agentic,一个包含超过37小时半结构化对话视频的数据集,对话发生在人类与具身AI代理之间。我们利用该数据集评估AI代理的自动取证识别(音频、视频或文本),研究人机交互的本质,并为驱动具身AI代理的大型语言模型和AI生成语音及面部技术的未来进展提供基准。我们还贡献了一个可扩展的数据采集系统,该系统创建代理,自动将其与人类众包工作者配对,记录指定场景下的视听对话,并在混合流中识别和分离人类与代理。

英文摘要

We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluate the automatic forensic identification (audio, video, or text) of AI agents, study the nature of human-agent interactions, and provide a benchmark for future advances in the large-language models and AI-generated voices and faces that power embodied AI agents. We also contribute a scalable data-capture system that creates agents, automatically pairs them with human crowd workers, records audiovisual conversations across specified scenarios, and identifies and separates the human and agent in the combined stream.

2606.03682 2026-06-03 cs.RO

GN0: Toward a Unified Paradigm for Generation, Evaluation, and Policy Learning in Visual-Language Navigation

GN0:迈向视觉语言导航中生成、评估与策略学习的统一范式

Xinhai Li, Xiaotao Zhang, Yuehao Huang, Jiankun Dong, Tianhang Wang, Sunyao Zhou, Yunzi Wu, Chengnuo Sun, Yunfei Ge, Qizhen Weng, Chi Zhang, Chenjia Bai, Xuelong Li

AI总结 提出GN0统一框架,通过自动生成大规模导航数据集GN-Matrix、基于3DGS的高保真仿真平台和BEV基准GN-Bench,结合RL驱动的导航基础模型BAE,在VLN任务上超越现有方法。

详情
AI中文摘要

具身导航将智能体与物理世界连接起来,是通用机器人智能的基础。导航数据的有限可用性和质量限制了视觉语言导航(VLN)系统的泛化和长时程能力。为解决这一问题,我们整理了多样化的3D场景,并开发了大规模导航数据的自动化流水线,生成了GN-Matrix数据集。基于3D高斯泼溅(3DGS)引擎,我们引入了一个支持交互式漫游和碰撞感知导航的高保真仿真平台。我们进一步提出了GN-Bench,这是首个基于BEV的基准测试,包含用于人机交互评估的动态3DGS化身。为了利用仿真器,我们开发了一个RL驱动的导航基础模型——Break and Establish(BAE)。在监督学习之后,DAgger将模型暴露于滚动生成的状态,打破了狭窄的专家中心分布,并实现了下游RL探索。这一统一的VLN范式整合了基于地图和无地图的任务,包括指令跟随、人类跟随和目标导航。GN-BAE将高保真3DGS渲染的鸟瞰图表示形式化为紧凑记忆,解锁了VLM中的潜在空间推理。在GN-Bench和VLN-CE上的广泛评估表明,GN0优于最先进的VLN方法。总体而言,GN-Matrix提供了一个涵盖数据、仿真和学习的统一框架,推动了研究和工业应用中的具身导航。

英文摘要

Embodied navigation connects intelligent agents with the physical world and is fundamental for general robotic intelligence. Limited availability and quality of navigation data have constrained Vision-and-Language Navigation (VLN) systems' generalization and long-horizon capabilities. To address this, we curate diverse 3D scenes and develop an automated pipeline for large-scale navigation data, resulting in the GN-Matrix dataset. Building on a 3D Gaussian Splatting (3DGS) engine, we introduce a high-fidelity simulation platform supporting interactive roaming and collision-aware navigation. We further propose GN-Bench, the first BEV-based benchmark incorporating dynamic 3DGS avatars for human-robot interaction evaluation. To leverage the simulator, we develop an RL-driven navigation foundation model, Break and Establish (BAE). After supervised learning, DAgger exposes the model to rollout-induced states, breaking narrow expert-centric distributions and enabling downstream RL exploration. This unified VLN paradigm integrates map-based and map-free tasks, including instruction following, human following, and goal navigation. GN-BAE formalizes high-fidelity 3DGS-rendered Bird's Eye View representations as compact memory, unlocking latent spatial reasoning in VLMs. Extensive evaluations on GN-Bench and VLN-CE show that GN0 outperforms state-of-the-art VLN methods. Overall, GN-Matrix offers a unified framework spanning data, simulation, and learning, advancing embodied navigation in research and industrial applications.

2606.03678 2026-06-03 cs.AI

EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents

EvoDrive: 通过自我改进的LLM智能体实现安全关键自动驾驶的帕累托进化

Tong Nie, Yuewen Mei, Yihong Tang, Junlin He, Jie Deng, Jian Sun, Wei Ma

AI总结 提出EvoDrive,首个基于LLM的自动化智能体进化框架,通过模拟器接地演员-评论家架构和帕累托存档,在安全关键场景生成中实现对抗性与真实性的多目标优化。

详情
AI中文摘要

生成安全关键场景对于验证和改进自动驾驶系统至关重要,但它本质上需要在最大化对抗性以暴露故障的同时保持真实性。现有方法通常通过手工设计的启发式方法来管理这种权衡,将生成限制在已知的先验知识中,忽视了未充分探索的模式。虽然最近开放式的智能体进化可以突破这一限制,但不受约束的通用智能体缺乏严格的模拟器接地,往往将多目标张力退化为单标量最大化。本文提出了EvoDrive,第一个基于LLM的自动化智能体进化框架,用于多目标场景生成。EvoDrive采用模拟器接地的演员-评论家架构,其中记忆驱动的演员迭代地提出对生成器的改进,评论家过滤掉不可信的候选者,而自我进化的世界评估器将有前途的候选者路由以优化模拟预算。EvoDrive进一步维护一个评估候选者的帕累托存档,以保留多样化的攻击-真实性权衡,并通过模拟反馈指导未来的进化。在MetaDrive和CARLA上的基准测试结果表明,EvoDrive不仅显著扩展了各种生成器的帕累托前沿,而且为策略训练生成了有价值的场景。

英文摘要

Generating safety-critical scenarios is essential for validating and improving autonomous driving systems, yet it inherently requires maximizing adversariality to expose failures while preserving realism. Existing methods usually manage this trade-off with handcrafted heuristics, confining generation to known priors and overlooking underexplored patterns. While recent open-ended agentic evolution can push this limit, unconstrained general agents lack strict simulator grounding and tend to collapse the multi-objective tension into single-scalar maximization. Here we present EvoDrive, the first automated, LLM-based agentic evolution framework for multi-objective scenario generation. EvoDrive employs a simulator-grounded actor-critic architecture where a memory-driven actor iteratively proposes improvements to the generators and critics filter out implausible candidates, and a self-evolving world evaluator routes promising proposals to optimize simulation budgets. EvoDrive further maintains a Pareto archive of evaluated candidates to preserve diverse attack-realism trade-offs and guide future evolution via simulation feedback. Benchmark results on MetaDrive and CARLA show that EvoDrive not only significantly expands the Pareto frontier across various generators, but also produces valuable scenarios for policy training.

2606.03566 2026-06-03 cs.CV cs.AI

Efficient Transformer-Based Localized Patch Sampling for Choroid Plexus Segmentation in Multiple Sclerosis

基于高效Transformer的局部块采样用于多发性硬化脉络丛分割

Po-Jui Lu, Alessandro Cagol, Mario Ocampo-Pineda, Federico Spagnolo, Marina Mastantuono, Andreea-Alexandra Aldea, Jannis Müller, Özgür Yaldizli, Matthias Weigel, Lester Melie-Garcia, Roberta Magliozzi, Maria Pia Sormani, Ludwig Kappos, Jens Kuhle, Cristina Granziera

AI总结 提出一种基于SwinUNETR和局部块采样的方法,实现多发性硬化侧脑室脉络丛的自动分割,在降低99%计算量的同时取得优于现有模型的Dice系数。

详情
AI中文摘要

背景:侧脑室脉络丛(LVCP)正逐渐被认为是与多发性硬化(MS)身体残疾和神经炎症相关的关键影像生物标志物。然而,LVCP的手动分割非常繁琐,限制了其在广泛临床试验和纵向评估中的应用。本研究旨在开发一种基于SwinUNETR的流程,利用靶向的脑室内和脑室周围小块采样,从独立和多模态MRI输入中自动分割MS中的LVCP。方法:我们回顾性评估了来自两个独立MS主导队列的三组数据的3T MRI扫描(数据集1:n=177;数据集2:n=177;扩展测试集:n=388)。我们的方法采用在32x32x32体素块上训练的SwinUNETR架构,并与3D UXNET模型进行基准比较。主要评估指标是Dice相似系数(DSC),辅以计算需求(GFLOPs)和95百分位豪斯多夫距离(HD95)。结果:在扩展测试集上,SwinUNETR模型在结合MPRAGE和FLAIR时获得了平均DSC为0.868(95% CI: 0.863-0.872),显著优于UXNET(DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001)。当仅限于独立FLAIR输入时,基于Transformer的方法保持了0.863的高DSC,而UXNET的空间定位显著恶化(HD95: 1.86 vs. 3.00 mm)。重要的是,所提出的框架将计算负载降低了99%(91.8 vs. 22,080 GFLOPs)。通过将局部块采样与SwinUNETR架构相结合,该方法为LVCP分割提供了一种准确、稳健且统计上优于当前领先模型的替代方案。其巨大的计算成本降低使其非常适合在临床和研究环境中广泛实施。

英文摘要

Background: The lateral ventricle choroid plexus (LVCP) is gaining recognition as a key imaging biomarker for multiple sclerosis (MS) related to physical disability and neuroinflammation. Yet, manual segmentation of the LVCP is highly tedious, restricting its use in broad clinical trials and longitudinal assessments. This research aims to develop a SwinUNETR-driven pipeline that leverages targeted intra- and peri-ventricular small patch sampling to automatically segment the LVCP in MS from both standalone and multi-modal MRI inputs. Methods: We retrospectively assessed 3T MRI scans across three sets of data stemming from two separate MS-dominant cohorts (Dataset 1: n=177; Dataset 2: n=177; expanded test set: n=388). Our method employed a SwinUNETR architecture trained on 32x32x32 voxel patches, benchmarking it against the 3D UXNET model. The primary metric for evaluation was the Dice Similarity Coefficient (DSC), supplemented by computational demand (GFLOPs) and the 95th percentile Hausdorff Distance (HD95). Results: On the extended test set, the SwinUNETR model secured a mean DSC of 0.868 (95% CI: 0.863-0.872) with MPRAGE and FLAIR combined, showing a statistically significant gain over UXNET (DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001). When restricted to standalone FLAIR inputs, the transformer-based approach sustained a high DSC of 0.863, while the spatial localization of UXNET worsened considerably (HD95: 1.86 vs. 3.00 mm). Importantly, the proposed framework lowered computational load by 99% (91.8 vs. 22,080 GFLOPs). By integrating localized patch sampling with a SwinUNETR architecture, this methodology offers an accurate, robust, and statistically superior alternative to current leading models for LVCP segmentation. Its vast reduction in computational cost makes it ideal for widespread implementation in clinical and research environments.

2606.03540 2026-06-03 cs.CV

Attend to Anything: Foundation Model for Unified Human Attention Modeling

关注一切:统一人类注意力建模的基础模型

Wenzhuo Zhao, Ronghao Xian, Keren Fu, Qijun Zhao

AI总结 提出 Attend to Anything Model (AAM),一种多模态基础模型,通过层次化语言提示和双曲空间嵌入统一图像、视频和视听任务中的注意力建模,并在16个基准上平均提升6%,视频推理加速约4倍。

Comments Accepted to ICML 2026

详情
AI中文摘要

现有人类注意力(显著性)建模方法在模态、场景和任务公式上高度碎片化。因此,即使模型容量和数据规模增加,当前模型仍主要依赖于场景且针对特定任务,无法在实际应用中泛化。为解决这些根本限制,我们提出了关注一切模型(AAM),一种多模态基础模型,统一了各种图像、视频和视听任务及场景中的注意力建模。AAM将注意力重新表述为一种认知蕴含关系,按通用到特定的层次组织,通过双曲空间中的层次嵌入语言提示实现。此外,为统一静态图像和动态视频注意力,我们采用流体动力学视角,将视频帧注意力建模为由Fokker-Planck方程控制的扩散时间演化。在16个基准上的大量实验表明,AAM在各种场景下平均比最先进方法高出6%,同时视频推理速度提升约4倍。总体而言,这些结果表明AAM为未来注意力和显著性相关任务的研究提供了原则性基础。数据集和代码将在此https URL提供。

英文摘要

Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model that unifies attention modeling across various image, video, and audio-visual tasks and scenes. AAM reformulates attention as a cognitive entailment relationship organized in a general-to-specific hierarchy, implemented through language prompts with hierarchical embeddings in hyperbolic space. Furthermore, to unify static image and dynamic video attention, we adopt a fluid-dynamics perspective, formulating video-frame attention as a diffusive temporal evolution governed by the Fokker--Planck equation. Extensive experiments on 16 benchmarks demonstrate that AAM consistently outperforms state-of-the-art methods by an average of 6\% across various scenarios, while achieving approximately a 4$\times$ speedup in video inference. Overall, these results demonstrate that AAM provides a principled foundation for future research on attention and saliency-related tasks. The dataset and code will be available at https://github.com/wz-zhao/Attend-to-Anything.

2606.03509 2026-06-03 cs.CV

EvoMemNav: Efficient Self-Evolving Fine-Grained Memory for Zero-Shot Embodied Navigation

EvoMemNav: 用于零样本具身导航的高效自进化细粒度记忆

Zuhao Ge, Xiaosong Jia, Chao Wu, Yuchen Zhou, Zuxuan Wu, Yu-Gang Jiang

AI总结 提出EvoMemNav框架,通过构建视觉-语义记忆图并采用预算驱动的粗到细策略,结合反射驱动写回机制,实现零样本具身导航中高效、自进化的细粒度记忆,提升多实例区分和停止验证性能。

Comments Preprint

详情
AI中文摘要

构建记忆对于零样本具身导航中的长时程规划至关重要。以检测器为中心的场景图通常将观测压缩为稀疏节点,丢弃细粒度视觉证据并积累噪声,而基于3D重建的方法计算成本高昂。我们提出EvoMemNav,一种用于零样本具身导航的高效、自进化、细粒度记忆框架。EvoMemNav构建视觉-语义记忆图(VSMGraph),将原始视图作为一等记忆,并通过轻量级语义线索和拓扑关系将其组织成房间-视图-对象层次结构,保留用于消歧和停止验证的细粒度细节。为了扩展到不断增长的记忆,我们引入预算驱动的粗到细策略:粗阶段将搜索空间压缩到有希望的区域,细阶段仅调用VLM进行目标验证和决策。除了静态记忆,EvoMemNav在每个子任务后执行反射驱动的写回,更新附加到图上的先验知识,编码累积的环境知识以优化未来决策而无需重新训练。在GOAT-Bench和HM3D上,针对物体、文本描述和图像目标模态的实验显示,SR/SPL持续提升,具有更好的多实例区分能力、更少的过早停止和更强的零样本泛化能力。

英文摘要

Building memory is essential for long-horizon planning in zero-shot embodied navigation. Detector-centric scene graphs often compress observations into sparse nodes, discarding fine-grained visual evidence and accumulating noise, while 3D reconstruction-based methods remain computationally prohibitive. We present EvoMemNav, an efficient, self-evolving, fine-grained memory framework for zero-shot embodied navigation. EvoMemNav constructs a Visual-Semantic Memory Graph (VSMGraph) that keeps raw views as first-class memory and organizes them with lightweight semantic cues and topological relations into a room-view-object hierarchy, preserving fine-grained details for disambiguation and Stop verification. To scale to growing memory, we introduce a budgeted coarse-to-fine policy: a coarse stage compresses the search space into promising regions, and a fine stage invokes a VLM only for targeted verification and decision. Beyond static memories, EvoMemNav performs reflection-driven write-back after each subtask, updating graph-attached priors that encode accumulated environmental knowledge to refine future decisions without retraining. Experiments on GOAT-Bench and HM3D across object, text-description, and image-goal modalities show consistent gains in SR/SPL, with better multi-instance disambiguation, fewer premature stops, and stronger zero-shot generalization.

2606.03503 2026-06-03 cs.AI

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

ThoughtFold: 通过内省偏好学习折叠推理链

Ziyan Liu, Xueda Shen, Yuzhe Gu, Songyang Gao, Kuikun Liu, Guangran Cheng, Chengqi Lyu, Dahua Lin, Wenwei Zhang, Kai Chen

AI总结 提出ThoughtFold框架,通过细粒度偏好学习惩罚冗余探索并鼓励直接连接关键推理段,将推理链折叠为更简洁路径,在保持精度的同时大幅降低token使用量。

详情
AI中文摘要

大型推理模型(LRMs)由于在思维链(CoTs)上使用可验证奖励的强化学习(RLVR)取得了显著进展。然而,由于长CoT自然包含试错,且主流RLVR方法选择结果正确的CoT轨迹进行记忆,长CoT中的冗余探索不可避免地得到强化,导致LRMs的过度思考问题。先前解决此问题的尝试主要给较短轨迹更多优势,但其学习信号仍基于结果,无法减少长CoT中冗余探索的记忆。因此,我们提出ThoughtFold,一个利用细粒度偏好学习来缓解冗余探索以实现高效推理的框架。ThoughtFold采用内省策略识别每个正确轨迹中的冗余,从而产生一系列候选子轨迹。利用这一谱系,我们引入一个掩码偏好优化目标,明确惩罚冗余探索并鼓励模型直接桥接关键推理段,有效地将其推理链折叠为更简洁的路径。大量实验表明,ThoughtFold显著提高了效率。它将DeepSeek-R1-Distill-Qwen-7B的token使用量减少约56%,同时保持最先进的准确性。

英文摘要

Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.

2606.03471 2026-06-03 cs.AI cs.MA q-bio.NC

A formal definition and meta-model for a machine theory of mind

机器心智理论的正式定义与元模型

Fabio Cuzzolin

AI总结 本文基于认知心理学、神经科学和人工智能证据,首次提出机器心智理论的严格形式化定义,并构建整体元模型,以审视现有研究并推动未来突破。

Comments 48 pages, 2 figures

详情
AI中文摘要

本文首次提出了机器心智理论概念的严格形式化定义,该定义基于认知心理学、神经科学和人工智能证据支持的原则,并以此作为视角审视该领域的最新进展和当前努力,推动进一步研究以“破解”该问题的潜在议程。本文还提出了一个通用的整体机器心智理论元模型,并考察了在经验基准测试此类模型方面的最新进展。

英文摘要

This paper proposes, for the first time, a rigorous formal definition of the concept of Machine Theory of Mind, based on principles supported by evidence from cognitive psychology, neuroscience and artificial intelligence, and uses the above as a lens to examine state-of-the-art and current efforts in the field, driving a potential agenda for further research there able to "crack" the problem. It also advances a general holistic meta-model for Machine Theory of Mind, and examines the state of the art when it comes to empirically benchmarking such models.

2606.03460 2026-06-03 cs.CV

From 3D Perception to Safety Reasoning: A Graph-Based Framework for Real-Time Underground Mine Monitoring

从3D感知到安全推理:基于图的实时地下矿井监控框架

Pasindu Ranasinghe, Simit Raval, Dibyayan Patra, Bikram Banerjee, Ismet Canbulat

AI总结 提出一个结合3D语义感知、不确定性异常检测、规则检查、设备端LLM推理和GraphRAG记忆分析的连续监控框架,通过场景图和时序图实现结构化安全推理,在115个危险场景中达到93%的覆盖率和92.7%的感知精度。

详情
AI中文摘要

地下煤矿开采要求人员和重型设备在共享、受限且照明不良的空间中作业,其中设备接近违规、结构不稳定和遮挡盲区等危险难以预测。传统监控系统(包括固定摄像头和基于规则的接近警报)可以检测预定义事件,但缺乏识别复杂或演变危险所需的3D场景理解和上下文记忆。本文提出一个连续监控框架,将彩色3D点云转换为结构化和可追溯的安全推理输出。该框架结合了3D语义感知、基于不确定性的异常检测、基于规则的危险检查、设备端LLM推理和基于GraphRAG的记忆分析,以识别即时危险并解释长期安全模式。场景图和时序图作为显式知识结构,连接推理阶段的感知输出。为克服标记地下数据的稀缺性,结合真实巷道扫描、受控物体放置和高保真长壁模拟生成多样化的危险场景,同时自监督预训练从有限标注中改进分割。感知模型在30 FPS下达到92.7%的准确率,内存使用低。在115个危险场景中,基于规则的检查覆盖率为57%,结合上下文LLM推理提高到76%,使用基于历史记录的记忆推理达到93%。定性结果表明,不确定性衍生的异常信号支持对超出预定义类别的分布外危险进行解释。总体而言,基于图的知识表示结合3D感知和分层安全推理,为地下矿井监控中的智能决策支持提供了实用基础。

英文摘要

Underground coal mining requires personnel and heavy equipment to operate within shared, confined, and poorly illuminated spaces where hazards such as equipment proximity violations, structural instabilities, and occluded blind spots are difficult to anticipate. Conventional monitoring systems, including fixed cameras and rule-based proximity alerts, can detect predefined events but lack the 3D scene understanding and contextual memory needed to identify complex or evolving hazards. This paper presents a continuous monitoring framework that converts colourised 3D point clouds into structured and traceable safety reasoning outputs. The framework combines 3D semantic perception, uncertainty-based anomaly detection, rule-based hazard checks, on-device LLM reasoning, and GraphRAG -based memory analysis to identify immediate hazards and interpret longer-term safety patterns. Scene and temporal graphs serve as the explicit knowledge structure, linking perception outputs across reasoning stages. To overcome the scarcity of labeled underground data, real roadway scans, controlled object placement, and high-fidelity longwall simulation were combined to generate diverse hazard scenarios, while self-supervised pretraining improved segmentation from limited annotations. The perception model achieved 92.7% accuracy at 30 FPS with low memory usage. Across 115 hazard scenarios, rule-based checks achieved 57% coverage, increasing to 76% with contextual LLM reasoning and 93% with memory-based reasoning using historical records. Qualitative results show uncertainty-derived anomaly signals support the interpretation of out-of-distribution hazards beyond predefined classes. Overall, graph-based knowledge representation combined with 3D perception and layered safety reasoning provides a practical foundation for intelligent decision support in underground mine monitoring.