arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2237
专题追踪 全部专题
2606.17180 2026-06-17 cs.LG 新提交

Towards Fast GNN Surrogates for CO2 Migration in Complex Geological Formations

面向复杂地质构造中CO2运移的快速GNN替代模型

Rodrigo S. Luna, Thiago H. N. Coelho, Luiz S. L. Neto, Roberto M. Velho, Adriano M. A. Cortes, Renato N. Elias, Alexandre G. Evsukoff, Fernando A. Rochinha, Mauricio Araya-Polo, Herve Gross, Alvaro L. G. A. Coutinho

发表机构 * Systems and Computer Engineering and High Performance Computing Center, NACAD - COPPE, Federal University of Rio de Janeiro(里约热内卢联邦大学COPPE工程研究生院NACAD高性能计算中心,系统与计算机工程) Civil Engineering and High Performance Computing Center, NACAD - COPPE, Federal University of Rio de Janeiro(里约热内卢联邦大学COPPE工程研究生院NACAD高性能计算中心,土木工程) Mechanical Engineering and High Performance Computing Center, NACAD - COPPE, Federal University of Rio de Janeiro(里约热内卢联邦大学COPPE工程研究生院NACAD高性能计算中心,机械工程) Shell Global Solutions International B.V.(壳牌全球解决方案国际公司) TotalEnergies OneTech(道达尔能源OneTech)

AI总结 提出一种端到端图神经替代模型,用于地质封存中CO2羽流运移预测,通过各向异性消息传递和自回归残差公式在SPE11A基准上实现竞争性预测。

详情
AI中文摘要

本章讨论数据驱动的机器学习方法如何再现复杂地质构造中多相流物理行为的关键方面。我们提出了一种端到端的图神经替代模型,专门用于地质封存中CO$_2$羽流运移预测。该方法在SPE11A基准上进行了评估,这是一个著名的行业测试案例,旨在评估CO$_2$封存场景,其特点是尖锐的气-水界面、强平流输运以及伴随指进发展的快速对流混合。该基准被重新表述为一个图,其中节点表示计算单元,边编码基于传导率的相互作用,并辅以几何属性。由网格几何、渗透率对比和地质非均质性引起的方向性输运通过各向异性消息传递机制捕获,其中交互权重通过几何条件化的边嵌入计算,使消息聚合偏向于物理相关的输运方向。时间演化在潜在空间中使用自回归残差公式建模,并通过多步监督训练。所提出的模型对气体饱和度和液相密度(CO$_2$封存监测的关键指标)产生了具有竞争力的预测,在较长的预测范围内累积误差保持适中。

英文摘要

This chapter discusses how a data-driven machine learning approach can reproduce key aspects of the physical behavior of multiphase flows in complex geological formations. We propose an end-to-end graph neural surrogate tailored to CO$_2$ plume migration forecasting in geological storage. The method is evaluated on the SPE11A benchmark, a well-known industry test case designed to assess CO$_2$ storage scenarios and characterized by sharp gas-water interfaces, strong advective transport, and rapid convective mixing with fingering development. The benchmark is reformulated as a graph in which nodes represent computational cells and edges encode transmissibility-based interactions enriched with geometric attributes. Directional transport arising from grid geometry, permeability contrasts, and geological heterogeneity is captured through an anisotropic message-passing mechanism, where interaction weights are computed via geometry-conditioned edge embeddings, biasing message aggregation toward physically relevant transport directions. Temporal evolution is modeled in latent space using an autoregressive residual formulation trained with multi-step supervision. The proposed model produces competitive forecasts of gas saturation and liquid-phase density, which are key indicators for CO$_2$ storage monitoring, with cumulative errors that remain moderate over extended forecasting horizons.

2606.17175 2026-06-17 cs.CL 新提交

Self-Generated Error Training for Token Editing in Diffusion Language Models

扩散语言模型中令牌编辑的自生成错误训练

Lin Yao

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学与技术学院) Zhongguancun Academy(中关村学院)

AI总结 针对LLaDA2.1中令牌编辑的训练-推理不匹配问题,提出自生成T2T方法,通过无梯度草稿传递和自生成错误监督,提升编辑准确率并减少编辑强度。

详情
AI中文摘要

令牌到令牌(T2T)编辑允许LLaDA2.1在块扩散解码过程中修正已提交的令牌。已发布的配方在随机词汇损坏上训练该编辑器,但在推理时,编辑器看到的是模型自身流畅、高置信度的草稿错误。我们研究了这种训练-推理不匹配,并提出了自生成T2T,该方法执行无梯度草稿传递,用预测的令牌填充掩码位置,并在第二次传递中在这些自生成损坏下监督恢复。我们将更新实现为LLaDA2.1-mini上的短LoRA持续预训练传递,并在官方Q-Mode T2T程序下使用不变的推理参数在多个基准上进行评估。该方法通常提高准确率,同时降低T2T编辑强度,缓解了诸如在正确推理后出现最终数字转录错误以及在简短事实答案前过度自我纠正等失败模式。

英文摘要

Token-to-token (T2T) editing lets LLaDA2.1 revise committed tokens during block-diffusion decoding. The released recipe trains this editor on random vocabulary corruptions, but at inference the editor sees the model's own fluent, high-confidence draft errors instead. We study this training-inference mismatch and propose self-generated T2T, which performs a no-gradient draft pass, fills masked positions with predicted tokens, and supervises recovery in a second pass under these self-generated corruptions. We implement the update as a short LoRA continued-pretraining pass on LLaDA2.1-mini and evaluate on several benchmarks under the official Q-Mode T2T procedure with unchanged inference parameters. The method generally improves accuracy while reducing T2T edit intensity, mitigating failure modes such as final-digit transcription errors after otherwise correct reasoning and excessive self-correction before short factual answers.

2606.17174 2026-06-17 cs.CL cs.CY cs.MA 新提交

From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities

从准社会脚本到自主AI智能体社区中的二元持久性

Mohammadsadegh Abolhasani, Hamid Reza Firoozfar, Reza Mousavi, Paul Jen-Hwa Hu

发表机构 * University of Utah(犹他大学) University of Virginia(弗吉尼亚大学)

AI总结 研究自主AI智能体社区中是否存在准社会互动(PSI)线索,通过关键词匹配、少样本LLM标注等方法分析帖子与评论,发现PSI线索与OP再参与及互惠回复结构强相关,并通过二元持久性测试验证了互动层面的PSI脚本与重复二元模式的一致性。

Comments Submitted for review in ARR for EMNLP 2026

详情
AI中文摘要

虽然准社会互动(PSI)和准社会关系(PSR)已在传统媒体环境中得到研究,但我们调查了在双方均为自主AI智能体的在线社区中是否也存在PSI(口语化)关系线索。我们通过三个基于理论的文本指标分析了Moltbook上的4,434篇帖子和50,338条评论:依恋/亲密语言、互惠邀请以及对原始发帖者(OP)的自我认同。基于关键词匹配、少样本大语言模型(LLM)标注和分组上下文LLM标注的方法的综合结果表明,PSI口语化线索普遍存在,并且与OP再参与和互惠回复结构强相关。这些结果在负对照、无效化、聚类标准误重估计和多重检验校正中均稳健。二元持久性测试进一步证实了互惠邀请与持续涉及OP的相互重复模式一致,为将互动层面的PSI脚本与符合PSR的重复二元模式联系起来提供了实证证据。我们将这些证据解释为由LLM驱动的智能体在话语中的行为结构。

英文摘要

While parasocial interactions (PSIs) and parasocial relationships (PSRs) have been studied in conventional media settings, we investigate whether PSI- (colloquial) relational cues also exist in online communities where both sides are autonomous AI agents. We analyze 4,434 posts and 50,338 comments from Moltbook through three theory-based textual indicators: attachment/intimacy language, reciprocity bids, and self-identification to original poster (OP). The combined results across methods based on keyword matching, few-shot large language model (LLM) annotation, and grouped-context LLM annotation reveal that PSI colloquial cues prevail and are strongly associated with OP re-engagement and a reciprocal reply structure. These results are robust across negative controls, nullification, clustered-standard-error re-estimation, and multiple-testing correction. A dyadic persistence test further affirms reciprocity bids aligned with sustained OP-involving mutual recurrence, providing empirical evidence for bridging interaction-level PSI scripts with PSR-consistent repeated dyadic patterns. We interpret the evidence as a behavioral structure in discourse by LLM-enabled agents.

2606.17168 2026-06-17 cs.CL 新提交

RepSelect: Robust LLM Unlearning via Representation Selectivity

RepSelect: 通过表示选择性实现鲁棒的大语言模型遗忘

Filip Sondej, Yushi Yang, Adam Mahdi

发表机构 * Independent(独立) University of Oxford(牛津大学)

AI总结 针对现有遗忘方法易被微调或少样本提示逆转的问题,提出RepSelect方法,通过梯度主成分坍塌隔离遗忘集表示,实现深度鲁棒遗忘。

详情
AI中文摘要

使大语言模型(LLMs)深度遗忘特定知识和价值观而不牺牲通用能力仍然是遗忘领域的一个核心挑战。然而,当前方法容易被微调或少样本提示逆转,表明其遗忘仅是浅层的。我们找到了根本原因:现有方法针对与保留集以及微调攻击者恢复的子空间共享的表示,这使得遗忘既破坏通用能力又容易被逆转。我们提出RepSelect(表示选择性),通过在每次更新前坍塌权重梯度的主成分来隔离遗忘集特定的表示,从而保持通用能力完整,同时限制微调可恢复的内容。我们在两个遗忘类别(生物危害知识和虐待倾向)以及四种涵盖密集和混合专家架构的模型家族(Llama 3、Qwen 3.5、Gemma 4 E4B、DeepSeek V2 Lite)上进行评估。与五种流行基线(GradDiff、NPO、SimNPO、RMU、UNDIAL)相比,RepSelect在重新学习后的答案准确性上实现了比最强基线大4-50倍的降低,并且对少样本提示攻击近乎完美鲁棒。因此,针对选择性表示是实现深度鲁棒LLM遗忘的重要一步。

英文摘要

Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.

2606.17164 2026-06-17 cs.CL cs.AI cs.HC cs.PL cs.SE 新提交

PromptMN: Pseudo Prompting Language

PromptMN: 伪提示语言

Enkhzol Dovdon

发表机构 * ICT Group(ICT集团)

AI总结 提出PromptMN,一种伪提示领域特定语言,通过紧凑的%前缀类型指令注释自然语言,减少上下文歧义,提升人机交互的清晰度和可审查性。

Comments 32 pages, 2 figures

详情
AI中文摘要

提示已成为人类与生成式AI之间的主要接口,然而许多自然语言提示仍然脆弱:角色、目标、约束和预期输出常常埋没在散文中或隐含起来。在智能体和软件开发工作流中,首次交接时的误读可能会传播到每一步,因为相当一部分智能体故障源于上下文歧义而非模型限制。本文介绍PromptMN,一种伪提示领域特定语言,它用紧凑的、以%为前缀的类型指令注释自然语言,涵盖角色、目标、需求、优先级、约束、计划、输入和输出。语义解析允许作者以任意顺序编写,而模型根据功能解释指令。PromptMN介于非正式提示和编程风格伪代码之间:结构足够可检查和可重用,又足够轻量,适用于软件开发生命周期(SDLC)中的分析师、管理者、开发者和利益相关者。PromptMN还与逆向提示工程配合使用。要求模型将期望结果重述为PromptMN,让用户在执行前检查推断的角色、目标、约束和缺失假设,从而减少修复周期,并产生一个可重用的工件来对齐人员和AI工具。PromptMN的可行性在多个前沿模型上进行了评估,包括Claude Fable 5、Claude Opus 4.8、Gemini 3.1 Pro和GPT-5.5。这些模型正确解析了PromptMN指令,包括复杂结构如重复、条件、方法和素数检查任务,无需微调。相同的词汇适用于所呈现的SDLC场景中的新代码库、维护和重新设计。虽然大规模验证仍是未来工作,但这些早期结果表明PromptMN是朝着更清晰、更可审查的人机交互迈出的实际一步。

英文摘要

Prompting has become the primary interface between humans and generative AI, yet many natural language prompts remain fragile: roles, goals, constraints, and expected outputs are often buried in prose or left implicit. In agentic and software development workflows, a misread at the first handoff can propagate through every step, since a significant portion of agent failures stem from context ambiguities rather than model limitations. This paper introduces PromptMN, a pseudo-prompting domain-specific language that annotates natural language with compact, %-prefixed typed directives covering roles, goals, requirements, priorities, constraints, plans, inputs, and outputs. Semantic resolution lets authors write in any order while the model interprets directives by function. PromptMN sits between informal prompting and programming-style pseudocode: structured enough to be inspectable and reusable, yet lightweight enough for analysts, managers, developers, and stakeholders across the software development lifecycle (SDLC). PromptMN also pairs with reverse prompt engineering. Asking a model to restate a desired outcome as PromptMN lets users inspect the inferred roles, goals, constraints, and missing assumptions before acting, reducing repair cycles and yielding a reusable artifact for aligning people and AI tools. PromptMN's feasibility is evaluated across several frontier models, including Claude Fable 5, Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5. The models correctly resolved PromptMN instructions, including complex structures such as repetition, conditionals, methods, and a prime-checking task, without fine-tuning. The same vocabulary applies across new codebases, maintenance, and redesign in the SDLC scenarios presented. While large-scale validation remains future work, these early results suggest PromptMN is a practical step toward clearer, more reviewable human-to-AI interaction.

2606.17162 2026-06-17 cs.CL cs.HC cs.MA 新提交

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

MemSlides:一种用于个性化幻灯片生成与多轮局部修订的层次化记忆驱动智能体框架

Ye Jin, Yangyang Xu, Jun Zhu, Yibo Yang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出MemSlides层次化记忆框架,通过分离长期记忆(用户画像和工具记忆)与工作记忆,结合局部修订机制,实现个性化幻灯片生成中的用户偏好保持、多轮修订和可靠局部编辑。

Comments Code, website, project page, and video are linked in the paper

详情
AI中文摘要

个性化演示文稿生成不仅需要基于当前提示或模板的条件生成:智能体必须跨任务保持稳定的用户偏好,在多轮修订中保留新引入的偏好和约束,并可靠地执行局部编辑。我们提出MemSlides,一种用于个性化演示智能体的层次化记忆框架,将长期记忆与工作记忆分离,并进一步将长期记忆分为用户画像记忆和工具记忆。用户画像记忆存储意图条件化的画像,用于第0轮个性化;工作记忆跨修订轮次携带活跃偏好和会话约束;工具记忆存储可重用的执行经验,用于可靠的局部编辑。MemSlides将此记忆设计与有范围的幻灯片局部修订相结合,使得目标更新作用于最小受影响区域,而非重复生成整个演示文稿。在控制实验中,用户画像记忆提高了多人物、多意图画像库上的人物对齐判断;工具记忆注入在诊断性配对设置中改善了闭环修改行为;定性案例展示了工作记忆传递偏好的能力。综合来看,这些结果表明,演示文稿创作中的有效个性化依赖于在生成和局部修订过程中分离持久用户画像、会话级工作记忆和可重用执行经验。

英文摘要

Personalized presentation generation requires more than conditioning on a current prompt or template: agents must preserve stable user preferences across tasks, retain newly introduced preferences and constraints during multi-turn revision, and carry out local edits reliably. We propose MemSlides, a hierarchical memory framework for personalized presentation agents that separates long-term memory from working memory and further divides long-term memory into user profile memory and tool memory. User profile memory stores intent-conditioned profiles for round-0 personalization, working memory carries active preferences and session constraints across revision rounds, and tool memory stores reusable execution experience for reliable localized editing. MemSlides pairs this memory design with scoped slide-local revision, so targeted updates act on the smallest affected region instead of repeatedly regenerating the full deck. In controlled experiments, user profile memory improves persona-alignment judgments on a multi-persona, multi-intent profile bank, tool-memory injection improves closed-loop modify behavior in diagnostic matched-pair settings, and qualitative cases illustrate working memory's ability to carryover preferences. Taken together, these results suggest that effective personalization in presentation authoring depends on separating persistent user profiles, session-level working memory, and reusable execution experience across generation and localized revision.

2606.17160 2026-06-17 cs.SD 新提交

Transductive Zero-Shot Audio Classification with Audio-Language Models

基于音频-语言模型的直推式零样本音频分类

Jingwen Zhou, Mingzhe Wang

发表机构 * Xidian University, Xi'an, China(西安电子科技大学)

AI总结 提出一种文本锚定的球面高斯混合EM算法,利用测试批次音频嵌入统计信息改进零样本后验,无需标签和梯度,在三个数据集上提升4.6-9.2个点。

详情
AI中文摘要

对比语言-音频预训练(CLAP)实现了零样本音频分类,但标准推理孤立地对每个片段进行分类,忽略了未标记测试集的结构。我们首次对CLAP的TransCLIP风格直推式推理进行了系统研究:一种文本锚定的球面高斯混合EM算法,利用测试批次的音频嵌入统计信息改进零样本后验,无需标签、无需梯度,且计算量可忽略(在单个CPU核心上处理2000个片段约需15毫秒)。在ESC-50、UrbanSound8K和VocalSound上,该方法始终将top-1准确率提升+4.6至+9.2个百分点(例如,ESC-50从89.1%提升至94.8%,UrbanSound8K从73.8%提升至81.8%)。我们进一步表明,该增益(i)受一个简单的操作边界控制——每批次每类约需2.5个测试样本,超过约5个样本后收益递减;(ii)与熵引导的提示加权互补,两者结合在ESC-50上达到96.2%;以及(iii)在长尾批次下衰减但仍为正(在20:1不平衡下从+4.9降至+3.1个百分点),我们将其报告为显式限制。我们还记录了一个负面结果:在TUT Urban Acoustic Scenes 2018上,零样本CLAP接近随机水平,直推式没有信号可放大。

英文摘要

Contrastive language-audio pretraining (CLAP) enables zero-shot audio classification, but standard inference classifies each clip in isolation and ignores the structure of the unlabeled test set. We present the first systematic study of TransCLIP-style transductive inference for CLAP: a text-anchored spherical Gaussian-mixture EM that refines zero-shot posteriors using the audio-embedding statistics of the test batch, with no labels, no gradients, and negligible compute (about 15 ms on one CPU core for 2,000 clips). Across ESC-50, UrbanSound8K, and VocalSound, this consistently improves top-1 accuracy by +4.6 to +9.2 points over the zero-shot baseline (e.g., 89.1 -> 94.8% on ESC-50, 73.8 -> 81.8% on UrbanSound8K). We further show that the gain (i) is governed by a simple operating boundary -- roughly 2.5 test samples per class per batch are required, with diminishing returns beyond ~5; (ii) is complementary to entropy-guided prompt weighting, with the combination reaching 96.2% on ESC-50; and (iii) attenuates but remains positive under long-tailed batches (+4.9 -> +3.1 points at a 20:1 imbalance), which we report as an explicit limitation. We also document a negative result: on TUT Urban Acoustic Scenes 2018, where zero-shot CLAP is near chance, transduction has no signal to amplify.

2606.17126 2026-06-17 cs.SD cs.AI 新提交

Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control

通过改进独立控制实现歌唱声音转换中的颤音表达控制

Joon-Seung Choi, Dong-Min Byun, Seong-Whan Lee

发表机构 * Korea University(高丽大学)

AI总结 提出VibE-SVC2框架,通过能量风格转换器、零样本音高风格转换器、颤音速率缩放和次谐波校正算法,实现对音高和音色两种歌唱风格的精细独立控制,性能优于现有方法。

Comments Accepted to IEEE Transactions on Audio, Speech, and Language Processing (TASLP)

详情
AI中文摘要

歌唱风格是自然且富有表现力的歌声的关键方面。歌手利用歌唱风格来传达歌曲的情感。已有若干工作提出控制歌唱风格以制作更具表现力的歌声。最近,VibE-SVC通过预测高频F0轮廓成功控制了颤音。在本文中,我们引入了一个名为VibE-SVC2的歌唱声音转换框架,以改进歌唱风格转换性能和可控性。该模型提供对两种歌唱风格的控制:音高风格和音色风格。对于音高风格,为了解决我们先前工作中未解决的能量-音高纠缠问题,我们引入了一种新颖的能量风格转换器来处理能量轮廓中剩余的样式信息。此外,我们提出了一种零样本音高风格转换器,它模仿参考音频的音高风格。为了扩展模型的可控性,我们提出了颤音速率缩放,这是对颤音程度的独立控制,这在VibE-SVC中是不可用的。对于音色风格,我们扩展了模型以处理多种发声风格。然而,解决诸如气泡音等特定风格带来了挑战,因为传统的F0提取由于其固有的次谐波特性而常常失败,这降低了转换质量。为了解决这个问题,我们提出了一种新颖的次谐波校正算法来细化F0轮廓,以实现更自然的音色转换。通过全面的客观和主观评估,我们证明了VibE-SVC2提供了对两种歌唱风格的精细、独立控制,优于现有方法。

英文摘要

Singing style is a crucial aspect of a natural and expressive singing voice. Singers utilize singing styles to convey the feeling or emotion of the songs. Several works have been proposed to control singing style for making the more expressive singing voice. Recently, VibE-SVC successfully controls vibrato by predicting high-frequency F0 contour. In this paper, we introduce a singing voice conversion framework, called VibE-SVC2, to improve singing style conversion performance and controllability. The model offers control over two types of singing styles: a pitch style and a timbre style. For the pitch style, to resolve the pitch-energy entanglement issue that is unresolved in our previous work, we introduce a novel Energy Style Converter to address remaining style information in the energy contour. In addition, we propose a Zero-shot Pitch Style Converter, which mimics the pitch style of reference audio. To expand the controllability of the model, we propose vibrato rate scaling that is an independent control of vibrato extent, which is unavailable in VibE-SVC. For the timbre style, we extend the model to handle a variety of phonation styles. However, addressing specific styles such as vocal fry poses a challenge, as conventional F0 extraction often fails due to their inherent subharmonic characteristics, which degrades the conversion quality. To address this, we propose a novel Subharmonic Correction algorithm to refine the F0 contour for more natural timbre conversion. Through comprehensive objective and subjective evaluations, we demonstrate that VibE-SVC2 provides fine-grained, independent control over two types of singing styles, outperforming existing methods.

2606.17118 2026-06-17 cs.LG cs.AI 新提交

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

MODE: 面向MoE多模态大语言模型的模态分解专家级混合精度量化

Yuanteng Chen, Peisong Wang, Zhilei Liu, Nanxin Zeng, Yuantian Shao, Shiqiang Lang, Tao Liu, Chuangyi Li, Qinghao Hu, Gang Li, Jing Liu, Jian Cheng

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Zhongguancun Academy(中关村学院)

AI总结 针对MoE多模态大语言模型在专家重要性估计中存在的跨模态和视觉内偏差,提出模态分解的专家级混合精度量化框架MODE,通过分解选择频率、过滤冗余视觉令牌并评估模态敏感性,在给定预算下分配比特宽度,在W3A16下平均性能损失控制在2.9%以内。

Comments 18 pages, 8 figures

详情
AI中文摘要

混合专家多模态大语言模型(MoE-MLLMs)性能卓越,但GPU内存成本高昂,因此压缩至关重要。在PTQ方法中,专家级混合精度量化已被证明对MoE-LLMs有效,但由于专家重要性估计中两个被忽视的偏差,在MoE-MLLMs上性能显著下降。(1)在跨模态层面,视觉令牌的数值优势导致专家选择频率被视觉令牌主导,掩盖了对文本模态至关重要的专家;(2)在视觉内层面,大量冗余视觉令牌进一步扭曲频率统计,模糊了对信息性视觉内容关键的专家。为弥补差距,我们提出MODE,一种面向MoE-MLLMs的模态分解专家级混合精度量化框架,该框架按模态分解专家选择频率,过滤冗余视觉令牌以获得去噪的视觉频率,并进一步评估每个模态的量化敏感性作为基于频率估计的补充信号。这些信号被整合到整数线性规划公式中,以在给定预算下分配每个专家的比特宽度。大量实验表明,MODE特别适合MoE-MLLMs,在W3A16下平均性能损失限制在2.9%以内,在极端2比特设置下获得更大增益。

英文摘要

Mixture-of-Experts Multimodal Large Language Models (MoE-MLLMs) offer remarkable performance but incur prohibitive GPU memory costs, making compression essential. Among PTQ methods, expert-level mixed-precision quantization has proven effective for MoE-LLMs, yet suffers notable degradation on MoE-MLLMs due to two overlooked biases in expert importance estimation. (1) At the cross-modal level, the numerical dominance of vision tokens causes expert selection frequency to be dominated by vision tokens, masking experts that are critical to the text modality; (2) at the intra-vision level, the large proportion of redundant vision tokens further skew frequency statistics, obscuring experts critical for informative visual content. To bridge gaps, we propose MODE, a modality-decomposed expert-level mixed-precision quantization framework for MoE-MLLMs that decomposes expert selection frequency by modality, filters redundant vision tokens to obtain denoised visual frequency, and further evaluates quantization sensitivity per modality as a complementary signal to frequency-based estimation. These signals are integrated into an Integer Linear Programming formulation to assign per-expert bit-widths under a given budget. Extensive experiments show that MODE is particularly well-suited for MoE-MLLMs, limiting average performance loss to within 2.9% at W3A16, with larger gains at the extreme 2-bit setting.

2606.17115 2026-06-17 cs.LG cs.AI q-bio.QM 新提交

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

探测、融合与可信度:基础模型表示在多模态癌症分析中的系统评估

Jingyu Hu, Giuseppe Tripodi, Reed Naidoo, Sarah F. McGough, Tapabrata Chakraborti

发表机构 * The Alan Turing Institute(艾伦·图灵研究所) University of Bristol(布里斯托大学) University of Manchester(曼彻斯特大学) The Institute of Cancer Research(癌症研究所) Genentech(基因泰克)

AI总结 系统评估基础模型表示在计算病理学任务中的性能,发现图像和组学表示互补,多模态融合在单模态不占优时有效,并利用共形预测验证了不确定性感知推理的临床价值。

详情
AI中文摘要

基础模型(FMs)已成为医学数据的强大表示提取器,但它们在分布偏移下的泛化能力仍未充分探索。本工作系统评估了基于FM的表示在计算病理学任务上的表现,涉及两个真实世界商业队列IH-BC和IH-NSCLC,这些队列来自许可的内部(IH)肿瘤学数据集。分析聚焦于两种模态:全切片图像和转录组图谱,均来自IH多模态数据。我们首先在八个下游分类任务上对五个FM进行单模态探测性能基准测试,发现图像和组学表示携带互补的预测信号。然后,我们通过比较三种基于配对表示的图像-组学融合策略,研究多模态融合是否能在单模态基线之上带来额外收益。进一步通过共形预测评估所选单模态和多模态管道的可信度。我们的结果表明,FM表示在分布外数据上取得了竞争性性能,且多模态融合主要在单模态不占主导信号时有所帮助。共形预测揭示,在点预测失败的大多数情况下,真实诊断仍可在预测集中恢复,这强化了不确定性感知推理对临床支持的价值。

英文摘要

Foundation models (FMs) have emerged as powerful representation extractors for medical data, yet their generalizability to datasets under distribution shift remains underexplored. This work systematically evaluates FM-based representations on a suite of computational pathology tasks across two real-world commercial cohorts, IH-BC and IH-NSCLC, drawn from the licensed in-house (IH) oncology dataset. The analysis focuses on two modalities, whole-slide images and transcriptomic profiles, drawn from the IH multimodal data. We first benchmark unimodal probing performance across five FMs on eight downstream classification tasks, and find that image and omics representations carry complementary predictive signals. Then we investigate whether multimodal fusion can yield additional gains over unimodal baselines by comparing three image-omics fusion strategies built on paired representations. The trustworthiness of selected unimodal and multimodal pipelines is further assessed through conformal prediction. Our results show that FM representations achieve competitive performance on out-of-distribution data and that multimodal fusion helps mainly when no single modality dominates the signal. Conformal prediction reveals that in the majority of cases where a point prediction fails, the true diagnosis remains recoverable within the prediction set, reinforcing the value of uncertainty-aware inference for clinical support.

2606.17113 2026-06-17 cs.LG cs.CL 新提交

The Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance

模型选择在因果推断中的关键作用:基于InferBERT框架的药物警戒分类模型比较分析

Csaba Kiss, Roland Molontay, Gabriele Pergola

发表机构 * Department of Stochastics, Institute of Mathematics, Budapest University of Technology and Economics(布达佩斯技术与经济大学数学研究所随机学系) Institute of Biostatistics and Network Science, Semmelweis University(塞梅维什大学生物统计学与网络科学研究所) Department of Computer Science, University of Warwick(华威大学计算机科学系)

AI总结 本研究在InferBERT框架下比较XGBoost、ALBERT、BioBERT和Med-LLaMA四种模型,发现领域特定预训练(BioBERT)在药物警戒因果ADE检测中优于简单基线和大型LLM,校准改善ECE但对准确率和因果发现影响不一。

Comments 10 pages, 5 figures

详情
AI中文摘要

区分因果性药物不良事件(ADE)与虚假相关性仍然是药物警戒中的核心挑战。InferBERT框架将Transformer模型与Do-calculus相结合,但其成功依赖于底层的分类模型。本研究评估了InferBERT中模型选择的影响,考察了更简单的模型是否足够、领域特定预训练是否有帮助、扩展到LLM是否能改善因果检测,以及事后校准的效果。我们在两个基准上进行了比较研究:镇痛药诱导的急性肝衰竭(AILF)和曲马多相关死亡率(TRAM)。评估了四种模型——XGBoost(基线)、ALBERT(原始InferBERT)、BioBERT(生物医学Transformer)和Med-LLaMA(医学LLM)——使用重复20次的5折交叉验证。我们测量了准确率、等渗回归前后的期望校准误差(ECE),以及因果项与PRR、ROR和EBGM的Jaccard一致性;显著性通过配对t检验测试。BioBERT在两个数据集上均取得了最高准确率,而Med-LLaMA尽管规模大且进行了参数高效微调,表现不佳。领域特定预训练起到了决定性作用。校准改善了ECE,但对准确率和因果发现的影响不一。BioBERT的优越性也使其与传统药物警戒信号的一致性最强。这些结果表明,领域特定预训练相比简单基线和更大的LLM具有明显优势。在计算药物警戒中,投资于可管理的、领域感知的模型比单纯扩大模型规模更有效。

英文摘要

Distinguishing causal adverse drug events (ADEs) from spurious correlations remains a central challenge in pharmacovigilance. The InferBERT framework integrates transformer models with Do-calculus, but its success hinges on the underlying classification model. This study evaluates the impact of model choice in InferBERT, assessing whether simpler models suffice, if domain-specific pre-training helps, whether scaling to LLMs improves causal detection, and the effect of post-hoc calibration. We performed a comparative study on two benchmarks: Analgesics-induced Acute Liver Failure (AILF) and Tramadol-related Mortalities (TRAM). Four models were evaluated-XGBoost (baseline), ALBERT (original InferBERT), BioBERT (biomedical transformer), and Med-LLaMA (medical LLM)-using 5-fold cross-validation repeated over 20 runs. We measured accuracy, Expected Calibration Error (ECE) pre- and post-isotonic regression, and Jaccard concordance of causal terms with PRR, ROR, and EBGM; significance was tested with paired t-tests. BioBERT achieved the highest accuracy on both datasets, while Med-LLaMA underperformed despite its size and parameter-efficient fine-tuning. Domain-specific pre-training was decisive. Calibration improved ECE but had mixed effects on accuracy and causal discovery. BioBERT's superiority also yielded the strongest concordance with traditional pharmacovigilance signals. These results show that domain-specific pre-training provides a clear advantage over simpler baselines and larger LLMs. Investing in manageable, domain-aware models is more effective for computational pharmacovigilance than simply scaling model size.

2606.17107 2026-06-17 cs.LG cs.AI 新提交

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

模型在预填充阶段记笔记:KV缓存可编辑且可组合

Bojie Li

发表机构 * Pine AI

AI总结 研究发现KV缓存像笔记一样存储结论,支持编辑和组合:编辑单个字段可修正决策(8B模型准确率1.00,仅需~1%计算),组合预编译技能可无缝插入任意上下文(logit余弦相似度0.90-0.999),延迟降低至O(L)。

详情
AI中文摘要

前缀缓存仅对完全共享的前缀重用预填充结果,因此一个字段的改变会使整个下游缓存失效。然而,覆盖该字段自身的键/值向量并重用其余部分,会导致模型基于旧值行动。通过四个模型家族的因果分析,原因在于:在预填充阶段,模型已将基于字段条件的结论写入下游笔记;该字段自身的键/值对决策的贡献不足1%。将KV缓存视为记录已记忆结论的笔记本,可以引出两个能力。(1) 可编辑性。一个显著的勘误可以修正笔记;结合思维链,仅编辑该字段即可恢复决策(8B模型准确率1.00,约1%计算),而无思维链时则被忽略。(2) 可组合性。笔记具有位置可移植性,因此预编译的技能可以通过RoPE重新定位并拼接至任意上下文,与完全重计算无法区分(logit余弦相似度0.90-0.999,十二个模型),且首次令牌延迟为O(L)而非O(L^2)。统一的编辑+组合智能体在决策上与重计算相同,延迟降低高达14.9倍。该方法适用于任何逐令牌注意力KV缓存,在规模、量化、混合专家和多模态缓存上得到验证,并通过小型适配器扩展到多种注意力变体。由于勘误仅追加,它与生产环境中的前缀缓存兼容:在在线vLLM基准测试中,它保持前缀缓存对齐(命中率98.5%),将p90首次令牌延迟降低53-398倍。

英文摘要

Prefix caching reuses prefill only across an exactly shared prefix, so one changed field invalidates the entire downstream cache. Yet overwriting the field's own key/value vectors and reusing the rest leaves the model acting on the old value. The reason, established causally across four model families: at prefill the model has already written the field-conditioned conclusion onto downstream notes; the field's own key/value drives under 1% of the decision. Read as a notebook of memoized conclusions, two capabilities follow. (1) It is editable. A salient erratum amends the notes; and with chain-of-thought, editing the field alone recovers the decision (1.00 at 8B, ~1% compute), while without CoT it is ignored. (2) It is composable. The notes are position-portable, so a precompiled skill can be RoPE-repositioned and spliced into any context, indistinguishable from full recompute (logit cosine 0.90-0.999, twelve models) at O(L) rather than O(L^2) time-to-first-token. A unified edit+compose agent stays decision-identical to recompute at up to 14.9x lower latency. The approach applies to any per-token attention KV cache, validated across scale, quantization, Mixture-of-Experts, and multimodal caches, and extends to several attention variants through small adapters. Because the erratum is append-only, it composes with production prefix caching: in an online vLLM benchmark it keeps the prefix cache-aligned (98.5% hit-rate), cutting p90 time-to-first-token by 53-398x.

2606.17106 2026-06-17 cs.LG cs.CY 新提交

Informative Missingness to Generate Irregular Clinical Time Series

信息性缺失生成不规则临床时间序列

Hadi Mehdizavareh, Gabriele Santangelo, Giovanna Nicora, Simon Lebech Cichosz, Arianna Dagliati, Arijit Khan, Riccardo Bellazzi

发表机构 * Aalborg University(奥尔堡大学) University of Pavia(帕维亚大学) Bowling Green State University(博林格林州立大学)

AI总结 提出基于扩散的临床时间序列生成方法,联合建模实验室值和观察模式,在DACMI基准上验证,能捕获生理与检测行为间的临床依赖。

详情
AI中文摘要

电子健康记录中的实验室检测是不规则收集的,检测指令的缺失可能与测量值本身一样具有信息性。这种缺失反映了临床医生的决策和患者生理状态,因此直接对其建模而非将其视为预处理伪影非常重要。本文提出一种基于扩散的方法,用于生成临床时间序列,该方法使用源自MIMIC-III的公共数据填补缺失数据挑战(DACMI)基准,联合建模实验室值及其观察模式。为了保持真实的采样,我们将图表时间对齐为4小时间隔,并将入院记录分割为7天窗口,生成每个实验室值对应一个观察指示符的轨迹。应用标准变换和归一化以稳定训练。我们的方法扩展了TimeDiff框架,通过互补的扩散目标学习连续的实验室值和离散的缺失模式。实验表明,生成的数据在单个实验室分布和联合值-缺失嵌入方面与真实患者轨迹高度匹配,证明扩散模型能够捕获在类似MNAR(非随机缺失)缺失下患者生理与临床医生检测行为之间的临床有意义依赖。这些初步结果表明,我们的模型可以作为开发临床基础模型的初始组件。通过生成保留关键生理-缺失关系的合成先验,本工作激励了后续训练能够利用信息性缺失的先验数据拟合网络,我们将在扩展工作中对此进行研究。

英文摘要

Laboratory tests in electronic health records are collected irregularly, and the absence of a test order can be as informative as the measurement itself. Such missingness reflects clinicians' decisions and patient physiology, making it important to model it directly rather than treat it as a preprocessing artifact. Here we present a diffusion-based approach for generating clinical time series that jointly models laboratory values and their observation patterns using the public Data Analytics Challenge on Missing Data Imputation (DACMI) benchmark derived from MIMIC-III. To preserve realistic sampling, we align chart times into 4-hour intervals and segment admissions into 7-day windows, producing trajectories that pair each lab value with a corresponding observation indicator. Standard transformations and normalization are applied to stabilize training. Our method extends the TimeDiff framework to learn continuous lab values and discrete missingness patterns through complementary diffusion objectives. Experiments show that the generated data closely match real patient trajectories across individual lab distributions and joint value-missingness embeddings, demonstrating that diffusion models can capture clinically meaningful dependencies between patient physiology and clinicians' testing behavior under MNAR-like (missing-not-at-random) missingness. These preliminary results indicate that our model can serve as an initial component toward developing clinical foundation models. By producing synthetic priors that preserve key physiology-missingness relationships, this work motivates the subsequent training of Prior-Data Fitted Networks capable of leveraging informative missingness, which we will investigate in the extended work.

2606.17093 2026-06-17 cs.LG eess.IV 新提交

Diagnosing and Repairing Shape-Prior Shortcuts in Long-Range Single-Shot Fringe Projection Profilometry

诊断和修复长距离单次条纹投影轮廓测量中的形状先验捷径

Adam Haroon, Anush Lakshman, Cody Fleming, Beiwen Li

发表机构 * Department of Mechanical Engineering, Iowa State University(爱荷华州立大学机械工程系) College of Engineering, University of Georgia(佐治亚大学工程学院)

AI总结 通过机械可解释性和共形不确定性量化诊断长距离单次条纹投影轮廓测量中网络依赖形状先验而非条纹相位解码的问题,提出PhiCalNet架构修复,将物体平均绝对误差降低3.3倍。

Comments 44 pages, 27 figures

详情
AI中文摘要

基于学习的单次条纹投影轮廓测量术(FPP)主要在近距离下研究。长距离(工作距离超过1米)情况仍未得到充分解决:平方反比强度衰减降低了条纹信噪比并降低了物理真实度,单次问题由于一幅图像中缺乏条纹阶次信息而病态,且这些架构尚未被机制性地研究。我们提出了一项诊断-修复-验证研究,使用机械可解释性(MI)和共形不确定性量化(UQ)作为收敛的诊断工具:它们在一个物理故障点上达成一致,驱动并验证了架构修复。在一个逼真的合成基准(15,600幅条纹图像,50个物体在1.5-2.1米距离)上,最佳UNet基线达到14.54毫米的物体平均绝对误差(MAE)。三种探测方法(线性探测、Grad-CAM、平面外分布测试)收敛:基线通过物体边界形状先验而非条纹相位解码来解决任务。我们通过PhiCalNet修复此问题,该网络输出包裹相位而非深度,并应用固定的可微校准层将相位映射到深度,从架构上而非通过损失惩罚从假设空间中移除形状先验解。一个物理信息损失,作为对深度回归网络的软惩罚强制执行相同物理规律,没有带来可测量的增益,从而将架构隔离为操作因素。PhiCalNet将物体MAE降低3.3倍至4.46毫米;残余由±π包裹不连续处的0.103%像素承载。逐像素共形UQ确认了诊断:通过快照不一致性拒绝前5%的物体像素,将PhiCalNet RMSE降低64%(20.6->7.4毫米),而基线仅降低3.5%。MI和UQ在相同的故障点上收敛。

英文摘要

Learning-based single-shot fringe projection profilometry (FPP) has been studied mostly at close range. The long-range regime (standoff beyond 1 m) remains largely unaddressed: inverse-square intensity falloff lowers fringe signal-to-noise ratio and degrades physical ground truth, the single-shot problem is ill-posed because fringe-order information is absent from one image, and these architectures have not been studied mechanistically. We present a diagnose-repair-verify study using mechanistic interpretability (MI) and conformal uncertainty quantification (UQ) as convergent diagnostics: they agree on one physical failure locus, driving and verifying an architectural repair. On a photorealistic synthetic benchmark (15,600 fringe images, 50 objects at 1.5-2.1 m), a best UNet baseline reaches 14.54 mm object mean absolute error (MAE). Three probes (linear probing, Grad-CAM, flat-plane out-of-distribution test) converge: the baseline solves the task via object-boundary shape priors rather than fringe-phase decoding. We repair this with PhiCalNet, which outputs wrapped phase rather than depth and applies a fixed differentiable calibration layer mapping phase to depth, removing the shape-prior solution from the hypothesis space architecturally rather than by a loss penalty. A physics-informed loss that enforces the same physics as a soft penalty on a depth-regressing network yields no measurable gain, isolating the architecture as the operative factor. PhiCalNet reduces object MAE 3.3x to 4.46 mm; the residual is carried by 0.103% of pixels at the +/-pi wrap discontinuity. Pixel-wise conformal UQ confirms the diagnosis: rejecting the top 5% of object pixels by snapshot disagreement cuts PhiCalNet RMSE by 64% (20.6->7.4 mm) versus 3.5% for the baseline. MI and UQ converge on the same failure locus.

2606.17082 2026-06-17 cs.RO cs.AI 新提交

ParkingTransformer: LLM-Enhanced End-to-End Trajectory Planning for Autonomous Parking

ParkingTransformer: 基于大语言模型增强的端到端自主泊车轨迹规划

Hauteng Wu, Xu Li, Dong Kong, Zihang Wang, Xieyuanli Chen, Benwu Wang, Wenkai Zhu

发表机构 * School of Instrument Science and Engineering, Southeast University(东南大学仪器科学与工程学院) School of Electronic and Information Engineering, Tongji University(同济大学电子与信息工程学院) College of Transportation, Shandong University of Science and Technology(山东科技大学交通学院) National University of Defense Technology(国防科技大学)

AI总结 提出ParkingTransformer框架,利用多视角感知和大语言模型场景理解能力,结合轨迹查询与隐状态特征,直接输出规划轨迹,无需密集BEV表示,通过3D位置编码、固定窗口流机制和粗到细解码策略提升性能,在CARLA和实车实验中验证有效性。

详情
AI中文摘要

端到端自主泊车已成为自动驾驶领域的关键任务。然而,现有方法存在黑箱特性,缺乏高层语义理解和可解释性,阻碍了从道路到目标点的无缝长距离自主泊车的实现。为解决这些限制,我们提出ParkingTransformer,一种利用多视角感知和大语言模型(LLMs)场景理解能力的新型框架。通过将轨迹查询与LLMs隐状态特征相结合,我们的方法直接与历史信息和原始传感器数据交互以输出规划轨迹,无需密集的鸟瞰图(BEV)表示。为补偿LLMs空间推理能力的不足,我们引入3D位置编码以显式注入空间几何感知。此外,设计了固定窗口流机制用于历史信息处理,显著提高了长期时间处理效率和推理速度。同时,采用粗到细解码策略逐步提升轨迹精度。在CARLA模拟器和真实车辆平台上进行了广泛的闭环实验。结果表明,我们的方法在CARLA模拟器中达到61.32的驾驶分数,在真实实验中平均成功率为88.70%,验证了所提算法的可行性和有效性。

英文摘要

End-to-end autonomous parking has emerged as a critical task within the realm of autonomous driving. However, existing methods suffer from black-box characteristics, lacking high-level semantic understanding and interpretability, which impedes the realization of seamless long-distance autonomous parking from the road to the target spot. To address these limitations, we propose ParkingTransformer, a novel framework that leverages multi-view perception and the scene understanding capability of Large Language Models (LLMs). By combining trajectory queries with LLMs implicit state features, our method interacts directly with historical information and raw sensor data to output planning trajectories, eliminating the need for dense Bird's-View (BEV) representations. To compensate for the inadequate spatial reasoning ability of LLMs, we introduce 3D positional encoding to explicitly inject spatial geometric awareness. Furthermore, a fixed-window streaming mechanism is designed for historical information processing, significantly improving long-term temporal processing efficiency and inference speed. Additionally, a coarse-to-fine decoding strategy is employed to progressively enhance trajectory precision. Extensive closed-loop experiments are conducted on the CARLA simulator and real-world vehicle platforms. The results demonstrate that our method achieves a driving score of 61.32 in CARLA simulator and an average success rate of 88.70% in real-world experiments, validating the feasibility and effectiveness of the proposed algorithms.

2606.17080 2026-06-17 cs.RO cs.AI cs.CV 新提交

HRDX: A Large-Scale Vector HD-Map Dataset

HRDX:大规模矢量高清地图数据集

Sahith Reddy Chada, Isht Dwivedi, Nirav Savaliya

发表机构 * Honda Research Institute US(本田美国研究院)

AI总结 提出HRDX大规模矢量高清地图数据集,覆盖1400公里驾驶数据,含10类地图元素和20多种属性,并引入复合评分评估几何与属性准确性。

Comments https://usa.honda-ri.com/hrdx

详情
AI中文摘要

可靠的自动驾驶需要矢量化的高清地图,这些地图应具有几何精确性、语义丰富性,并能够扩展到长距离驾驶。然而,现有的公开高清地图数据集规模有限,提供的语义属性稀疏,并且缺乏诸如航拍图像等能够开启新研究方向的模态。我们提出了HRDX,一个用于矢量高清地图构建的大规模数据集,涵盖约40小时(1400公里)的最小重叠驾驶,比之前的公开高清地图数据集大数倍。数据使用六个同步环视摄像头、一个128线激光雷达和厘米级RTK GNSS/IMU捕获,并辅以精确对齐的航拍正射影像。标注涵盖10个矢量地图类别,并补充了20多个语义和拓扑属性。为了评估这一更丰富的本体,我们引入了复合评分(CS)来联合评估几何保真度和属性正确性。基准实验表明,HRDX的规模改善了在线矢量地图构建,并且对齐的航拍图像提供了有用的结构先验:在训练和/或推理中使用航拍图像可提高几何地图质量,而航拍增强的教师可以将部分优势转移给仅使用摄像头的学生,而无需增加推理时的传感器需求。HRDX旨在支持大规模高清地图学习、多模态BEV融合以及训练时特权信息的可重复研究。HRDX数据集和基准可在以下网址获取:https://github.com/example/HRDX

英文摘要

Reliable autonomous driving requires vectorized HD maps that are geometrically accurate, semantically rich, and scalable to long-horizon driving. However, existing public HD map datasets are limited in scale, provide sparse semantic attributes, and lack modalities such as aerial imagery that could enable new research directions. We present HRDX, a large-scale dataset for vector HD-map construction, spanning about 40 hours (1,400 km) of minimally overlapping drives, which is several times larger than prior public HD map datasets. Data is captured using six synchronized surround cameras, a 128-beam LiDAR, and centimeter-level RTK GNSS/IMU, and is further complemented by precisely aligned aerial orthoimagery. Annotations cover 10 vector map classes, complemented with over 20 semantic and topological attributes. To evaluate this richer ontology, we introduce the Composite Score (CS) to jointly assess geometric fidelity and attribute correctness. Benchmark experiments show that HRDX's scale improves online vector-map construction, and that aligned aerial imagery provides a useful structural prior: using aerial imagery at training and/or inference improves geometric map quality, while aerial-augmented teachers can transfer part of this benefit to camera-only students without increasing inference-time sensor requirements. HRDX is intended to support reproducible research on large-scale HD-map learning, multimodal BEV fusion, and training-time privileged information. HRDX dataset and benchmarks are available at https://github.com/honda-research-institute/HRDX

2606.17073 2026-06-17 cs.RO cs.AI 新提交

Extracting Semantics: LLM-Guided Automatic Population of Robot Ontology from URDF

提取语义:从URDF自动构建机器人本体的LLM引导方法

Bastien Dussard, Guillaume Sarthou

发表机构 * LAAS-CNRS, Department of Robotics, Toulouse, France(法国图卢兹机器人系CNRS实验室)

AI总结 提出利用大语言模型从URDF文件自动生成机器人语义本体,通过多数投票和语法验证确保与现有本体对齐,初步实验表明该方法能有效桥接低层描述与高层知识表示。

Journal ref 18th International Conference on Social Robotics (ICSR 2026), University of London, Jul 2026, Londres, United Kingdom

详情
AI中文摘要

虽然常识知识可能足以满足虚拟代理的需求,但与人类交互的具身机器人需要对其环境和自身物理形态具有基于现实的、语义丰富的表示。在认知机器人学中,本体论能够有效整合这种异构知识,以支持可解释的推理,即使在持续知识更新过程中也是如此。然而,手动构建本体仍然是一个瓶颈。我们提出了一种初步方法,通过将统一机器人描述格式(URDF)模型转换为填充的本体,自动生成机器人语义抽象。尽管URDF文件提供了结构和运动学描述,但其标识符通常需要常识解释才能恢复有意义的语义,而大语言模型(LLM)擅长此任务。我们的流程利用LLM,通过用现有本体中的概念提示它们来推断语义关系,确保最终分类与形式模型保持一致。为了提高可靠性,该流程结合了跨多个LLM查询的多数投票以及语法和模式级验证,以确保生成的输出符合预期的表示格式和本体约束。我们在多个机器人描述上评估了该方法,并讨论了生成的抽象。初步结果表明,所提出的方法能够有效弥合低层机器人描述与人机交互所需的结构化、基于现实的知识表示之间的差距。

英文摘要

While commonsense knowledge may suffice for virtual agents, embodied robots interacting with humans require grounded and semantically rich representations of both their environment and their own physical embodiment. In cognitive robotics, ontologies are effective for integrating such heterogeneous knowledge to enable explainable reasoning, even during continuous knowledge updates. Yet, their manual construction remains a bottleneck. We present a preliminary approach for the automatic generation of robot semantic abstractions by transforming Unified Robot Description Format (URDF) models into populated ontologies. Although URDF files provide structural and kinematic descriptions, their identifiers often require commonsense interpretation to recover meaningful semantics, a task at which Large Language Models (LLMs) excel. Our pipeline leverages LLMs to infer semantic relationships by prompting them with concepts from an existing ontology, ensuring the final classification remains aligned with the formal model. To improve reliability, the pipeline combines majority voting across multiple LLM queries along with syntactic and schema-level validation to ensure that generated outputs conform to the expected representation format and ontology constraints. We evaluate the approach on multiple robot descriptions and discuss the generated abstractions. Initial results indicate that the proposed method can effectively bridge the gap between low-level robot descriptions and the structured, grounded knowledge representations required for human-robot interaction.

2606.17057 2026-06-17 cs.LG cs.AI cs.CL 新提交

Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs

配对时正确,分离时错误:多模态大语言模型中模态特定神经元的解耦与编辑

Tingchao Fu, Wenkai Wang, Fanxiao Li, Huadong Zhang, Jinhong Zhang, Dayang Li, Yunyun Dong, Renyang Liu, Wei Zhou

发表机构 * School of Information Science and Engineering, Yunnan University(云南大学信息科学与工程学院) School of Software, Yunnan University(云南大学软件学院) National University of Singapore(新加坡国立大学) School of Engineering, Yunnan University(云南大学工程学院)

AI总结 针对多模态大语言模型知识编辑中存在的解耦失败问题,提出DECODE方法,通过显式解耦和定位模态特定神经元组,实现跨模态触发下的有效知识更新。

Comments 18 pages, 11 figures

详情
AI中文摘要

尽管知识编辑为多模态大语言模型(MLLMs)的知识更新提供了一种高效机制,但我们发现当前范式仍面临一个重要但尚未充分探索的问题:编辑解耦失败,即当模型被多模态输入(文本-图像查询对)触发时,实体相关知识可以更新,但当配对输入被拆分为单模态输入时,这些知识往往恢复为编辑前的旧事实。我们深入的实证分析表明,MLLMs中的实体知识并非以统一表示存储,而是分布在解耦的模态特定路径中。因此,偏向多模态查询的更新无法有效传播到单模态电路。为弥补这一差距,我们提出DECODE,该方法显式解耦并定位模态特定神经元组以获取目标知识。大量实验证明,DECODE在不同模态触发下均能实现有效的知识更新,从而缓解编辑解耦失败。

英文摘要

Although Knowledge Editing provides an efficient mechanism for updating the knowledge of Multimodal Large Language Models (MLLMs), we find that current paradigms still suffer from an important yet remain underexplored issue : editing decoupling failure, where entity-related knowledge can be updated when the model is triggered by multimodal inputs (text--image query pairs), however, it often reverts to outdated pre-edit facts when the paired inputs are split into unimodal ones. Our in-depth empirical analysis reveals that the entity knowledge in MLLMs is not stored as a unified representation, but is instead distributed across disentangled modality-specific pathways. As a result, updates biased toward multimodal queries fail to propagate effectively to unimodal circuits. To bridge this gap, we propose DECODE, which explicitly disentangles and localizes modality-specific neuron groups for targeted knowledge. Extensive experiments demonstrate that DECODE consistently achieves effective knowledge updates under different modality triggers, thereby mitigating editing decoupling failures.

2606.18236 2026-06-17 cs.LG cs.IT math.IT 新提交

Sign-Rank, Index, and List Replicability: Connections and Separations

符号秩、索引与列表可复制性:联系与分离

Ari Blondal, Hamed Hatami, Pooya Hatami, Chavdar Lalov, Sivan Tretiak

发表机构 * McGill University(麦吉尔大学) Ohio State University(俄亥俄州立大学)

AI总结 本文研究二元概念类的符号秩下界,通过比较Z2-索引和列表可复制数,证明Z2-索引被列表可复制数的线性函数上界,从而解决符号秩与Z2-索引的分离问题,并进一步建立列表可复制数的上界与组合性质。

Comments 29 pages, 1 figure

详情
AI中文摘要

在学习理论中,二元概念类的符号秩捕捉了其能被点和半空间表示的最小维度。尽管兴趣浓厚,符号秩的下界却难以获得。最近两种方法通过更易分析的度量建立符号秩的下界:$\mathbb{Z}_2$-索引和列表可复制数。我们对这些度量进行排序,证明$\mathbb{Z}_2$-索引被列表可复制数的线性函数上界。作为主要结果,我们得到了符号秩与$\mathbb{Z}_2$-索引之间的强分离,从而解决了Frick、Hosseini和Vasileuski提出的一个问题。这促使我们对列表可复制性(两个下界度量中更强的一个)进行深入研究。我们通过两个组合度量——高度和最小星数——建立了列表可复制数的上界。我们还证明了一个基本的复合结果:两个概念类的乘积的列表可复制数被这两个类的列表可复制数之和所界。

英文摘要

In learning theory, the sign rank of a binary concept class captures the smallest dimension in which it can be represented by points and halfspaces. Despite tremendous interest, lower bounds on sign rank are notoriously difficult to come by. Two recent approaches to the problem establish lower bounds on sign rank by measures that are easier to analyze: the $\mathbb{Z}_2$-index and the list replicability number. We order these measures, showing that the $\mathbb{Z}_2$-index is upper-bounded by a linear function of the list replicability number. As a main consequence, we obtain a strong separation between sign rank and $\mathbb{Z}_2$-index, thereby resolving a question of Frick, Hosseini, and Vasileuski. This motivates a thorough study of list replicability, the stronger of the two lower-bounding measures. We establish upper bounds on the list replicability number by two combinatorial measures: height and minimum star number. We also prove a fundamental composition result, showing that the product of two concept classes has list replicability number bounded by the sum of the list replicability numbers of the two classes.

2606.17531 2026-06-17 cs.LG cs.CG math.AT 新提交

Non-negative Matrix Factorisation with Topological Regularisation

带拓扑正则化的非负矩阵分解

Matias de Jong van Lier, Shizuo Kaji, Keunsu Kim

发表机构 * Recursive Inc.(Recursive公司) Graduate School of Science, Kyoto University(京都大学理学研究科) Institute of Mathematics for Industry, Kyushu University(九州大学数理学研究院)

AI总结 提出通过持久同调作为拓扑正则化项融入非负矩阵分解目标函数,以学习具有空间连贯性、周期结构或团状图信号的可解释基函数。

详情
AI中文摘要

我们研究了通过正则化学习到的基函数的拓扑结构,在非负矩阵分解(NMF)中学习可解释基函数。我们的方法源于观察到许多数据模态可以视为结构化域上的非负函数,其中基的质量与其拓扑结构内在相关。然而,纳入支撑拓扑的朴素方法通常受离散性和阈值依赖性困扰,使其不适合连续优化。我们通过采用持久同调作为稳定、无阈值的拓扑量化器,并设计将拓扑分数作为正则化项融入NMF目标函数来应对这些挑战。所得框架在一个统一的建模语言中涵盖了空间连贯的图像成分、周期性的时间序列结构和团状图信号。

英文摘要

We investigate the learning of interpretable bases in non-negative matrix factorisation (NMF) by regularising the topology of the learned basis functions. Our approach is motivated by the observation that many data modalities can be viewed as non-negative functions on a structured domain, where the quality of a basis is intrinsically linked to its topology. However, naive methods for incorporating the topology of the support are often hindered by discreteness and threshold dependence, rendering them unsuitable for continuous optimisation. We address these challenges by employing persistent homology as a stable, threshold-free topological quantifier and by designing topological scores that integrate into the NMF objective as regularisers. The resulting framework encompasses spatially coherent image components, periodic time-series structures, and clique-like graph signals within a unified modelling language.

2606.17419 2026-06-17 cs.LG cs.NA math.NA 新提交

Generalization Guarantees for Multi-Input Neural Operator Learning in Sobolev Spaces

多输入神经算子学习在Sobolev空间中的泛化保证

Yahong Yang, Zecheng Zhang, Wei Zhu, Wenjing Liao, Hao Liu

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of Notre Dame(圣母大学) Hong Kong Baptist University(香港浸会大学)

AI总结 针对多输入神经算子,在Sobolev范数下建立逼近和泛化误差估计,量化各输入空间对误差界的贡献,并揭示平衡状态下输入维度、正则性和Sobolev阶的相互作用。

详情
AI中文摘要

我们发展了多输入神经算子的逼近和泛化误差估计,输出误差在Sobolev范数下度量。与标准算子学习设置中只有一个输入函数不同,我们的框架允许多个输入函数定义在可能不同的域上,具有不同的维度和Sobolev正则性。导出的速率明确量化了每个输入空间对最终误差界的贡献。特别地,在平衡状态下,逼近和泛化速率由输入维度、正则性和Sobolev阶之间的相互作用控制,而对模型复杂度的依赖保持\(\log\log/\log\)型结构。我们的分析为多输入算子学习(包括Sobolev训练)提供了一个通用的理论框架,并适用于来自偏微分方程和科学计算的算子学习问题。

英文摘要

We develop approximation and generalization error estimates for multi-input neural operators, with the output error measured in Sobolev norms. In contrast to standard operator-learning settings with a single input function, our framework allows multiple input functions defined on possibly different domains, with different dimensions and Sobolev regularities. The derived rates explicitly quantify the contribution of each input space to the final error bound. In particular, in the balanced regime, the approximation and generalization rates are governed by the interaction between the input dimensions, regularities, and Sobolev orders, while the dependence on the model complexity retains a \(\log\log/\log\)-type structure. Our analysis provides a general theoretical framework for multi-input operator learning, including Sobolev training, and is applicable to operator learning problems arising from partial differential equations and scientific computing.

2606.17414 2026-06-17 cs.LG math.DS 新提交

Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations

用于对抗性航天器接近操作中自适应安全关键控制的内存高效元强化学习

Alejandro Posadas-Nava, Richard Linares, Minduli Wijayatunga

发表机构 * MIT(麻省理工学院) University of Illinois, Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究利用元强化学习调整输入约束控制屏障函数的类K函数,比较三种循环网络架构和两种训练算法,发现Mamba与PPO组合在合作与非合作场景中均能提升任务完成率、安全性和燃料效率。

详情
AI中文摘要

自主航天器交会与接近操作(RPO)需要控制器在推力约束下保证安全,同时最小化燃料消耗。输入约束控制屏障函数(ICCBF)为具有执行约束的非线性系统提供了一种控制方法,构建前向不变安全集。先前工作表明,通过元强化学习(meta-RL)学习定义ICCBF递归的类$\mathcal{K}$函数,可为RPO中的安全关键控制提供鲁棒、非贪婪的方法。本文进一步扩展该框架,研究了三种循环网络架构(长短期记忆(LSTM)、门控循环单元(GRU)、选择性状态空间模型(Mamba))和两种训练算法(近端策略优化(PPO)和软演员-评论家(SAC))的性能,以确定通过元强化学习调整ICCBF类K函数的最佳设置。除了合作测试案例外,还在存在对抗行为的情况下评估性能,其中目标航天器以恶化追踪航天器安全的方式行动。结果表明,在所有测试的合作与非合作场景中,使用PPO的状态空间模型(如Mamba)相比其他架构在任务完成、安全和燃料节省方面表现更优。

英文摘要

Autonomous spacecraft rendezvous and proximity operations (RPO) require controllers that guarantee safety under thrust constraints while minimizing fuel expenditure. Input-constrained control barrier functions (ICCBFs) provide a control method for nonlinear systems with actuation constraints that construct a forward-invariant safe set. Previous work has shown that learning class-$\mathcal{K}$ functions defining the ICCBF recursion via meta reinforcement learning (meta-RL) yields a robust, non-greedy approach to safety-critical control in RPO. This paper extends that framework further by investigating the performance of three recurrent network architectures (Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU), Selective State Space Model (Mamba)) and two training algorithms (Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC)) to identify the best setup for tuning ICCBF class-K functions via meta-RL. In addition to cooperative test cases, performance is evaluated in the presence of adversarial behavior where the target spacecraft behaves in a way that worsens the safety of the chaser spacecraft. Results indicate that state space models such as Mamba when used with PPO achieve superior task completion, safety, and fuel-savings compared to other architectures, across all cooperative and uncooperative scenarios tested.

2606.17317 2026-06-17 cs.RO cs.AI math.OC 新提交

Transformer-Based Warm-Starting for Feasible and Optimal Terminal Approach to Tumbling Objects with Space Manipulators

基于Transformer的可行且最优末端接近翻滚目标的空间机械臂热启动方法

Yuji Takubo, Maximilian Adang, Mac Schwager, Simone D'Amico

发表机构 * Stanford University(斯坦福大学)

AI总结 针对空间机械臂末端接近翻滚目标的实时轨迹生成问题,提出基于因果Transformer的热启动方法,通过分解规划并热启动姿态-力矩分配阶段,在300个测试场景中减少28%迭代次数和23%运行时间,同时保持控制成本分布。

Comments 8 pages, 4 figures

详情
AI中文摘要

由于航天器总线运动、机械臂动力学、可见性锥和轨迹级安全约束之间的非线性耦合,在轨机器人服务的实时轨迹生成具有挑战性。本文研究了基于学习的热启动方法,用于空间机械臂末端接近翻滚目标的序列凸规划(SCP)。所提出的框架将问题分解为系统质心平移规划阶段和耦合姿态-机械臂力矩分配阶段,并对后者应用因果变压器热启动,后者构成了主要的计算瓶颈。比较了线性动作解码器和流匹配动作解码器在不同动作分块和训练数据集大小下的表现,并使用SCP在成本最优和可行性投影下评估了生成的热启动。在300个保留场景中,学习的热启动将第二阶段SCP迭代次数减少多达28%,运行时间减少23%,同时保持最终控制成本分布。当学习的热启动用于非凸可行性投影时,其运行时间相比成本最优SCP几乎减半,同时避免了启发式初始化时观察到的灾难性高成本尾部行为。这些结果表明,序列模型热启动可以提高基于优化的空间机械臂末端制导的计算效率和轨迹鲁棒性。

英文摘要

Real-time trajectory generation for on-orbit robotic servicing is challenging due to the nonlinear coupling between spacecraft bus motion, manipulator dynamics, visibility cone, and trajectory-level safety constraints. This paper studies learning-based warm-starting for sequential convex programming (SCP) in the terminal approach of a space manipulator toward a tumbling target. The proposed framework decomposes the problem into a system center-of-mass translational planning stage and a coupled attitude--manipulator torque-allocation stage, and applies a causal transformer warm-start to the latter, which constitutes the dominant computational bottleneck. Linear and flow matching action decoders are compared under different action-chunking and training dataset sizes, and the resulting warm-starts are evaluated under both cost-optimal and feasibility projection using SCP. Across 300 held-out scenarios, the learned warm-start reduces the second-stage SCP iteration count by up to 28% and the runtime by 23% while preserving the final control-cost distribution. When the learned warm-starts are used for nonconvex feasibility projection, they nearly halve the runtime relative to cost-optimal SCP, while avoiding the catastrophic high-cost tail behavior observed when initialized heuristically. These results indicate that sequence-model warm-starts can improve both the computational efficiency and trajectory robustness of optimization-based terminal guidance for space manipulation.

2606.17185 2026-06-17 cs.LG eess.SP math.DG stat.ML 新提交

Finsler Geometry, Graph Neural Networks, and You

芬斯勒几何、图神经网络与你

T. Mitchell Roddenberry, Richard G. Baraniuk

发表机构 * Rice University(莱斯大学)

AI总结 针对图拉普拉斯只能近似各向同性算子的局限,提出基于芬斯勒拉普拉斯的图神经网络层,证明其收敛性并恢复非线性扩散方程的几何结构。

详情
AI中文摘要

基于图拉普拉斯的图神经网络架构近似拉普拉斯-贝尔特拉米算子,因此限制了它们在各向同性算子上的应用。作为拉普拉斯-贝尔特拉米算子的非线性替代,我们考虑从流形上采样的点云上芬斯勒拉普拉斯的估计。我们证明,随着点样本数量的增加,这些离散估计收敛到流形上的真实算子。此外,我们表明该算子可以表示为图神经网络层,我们用它来定义一组受约束以表达芬斯勒几何的芬斯勒图神经网络。我们表明,芬斯勒图神经网络在实践中恢复了非线性扩散方程背后的几何结构。

英文摘要

Graph neural network architectures based on the graph Laplacian approximate the Laplace-Beltrami operator, thus limiting their application to isotropic operators. As a nonlinear alternative to the Laplace-Beltrami operator, we consider estimates of the Finsler Laplacian on point clouds sampled from a manifold. We prove that these discrete estimates converge to the true operator on the manifold as the number of point samples grows. Moreover, we show that this operator can be expressed as a graph neural network layer, which we use to define a family of Finslerian graph neural networks constrained to express Finsler geometry. We show that Finslerian graph neural networks recover the geometry underlying nonlinear diffusion equations in practice.

2606.17460 2026-06-17 cs.LG cs.NA math.NA physics.comp-ph 新提交

Operator Boosting Produces Pareto-Efficient PDE Surrogates

算子提升产生帕累托高效的PDE代理模型

Lennon J. Shikhman

发表机构 * College of Computing, Georgia Institute of Technology(佐治亚理工学院计算学院) Department of Mathematics and Systems Engineering, Florida Institute of Technology(佛罗里达理工学院数学与系统工程系)

AI总结 提出算子提升框架,通过残差学习直接构建紧凑神经算子代理,在30个数据集-架构对上平均准确率提升,参数量减少72-95%,并在多个PDE基准上实现帕累托改进。

Comments 19 pages, 4 figures, 3 tables. Preprint submitted to Elsevier

详情
AI中文摘要

神经算子被广泛用作偏微分方程(PDE)的代理解映射,但在多查询科学工作流中,全尺寸模型可能存储、部署和评估成本高昂。本文引入算子提升(Operator Boosting),一种逐阶段残差学习框架,直接构建紧凑的神经算子代理,而非先训练大模型再压缩。从归一化输出坐标中的经验均值预测器开始,该方法在残差场上训练一系列同族小型神经算子,并通过验证选择的收缩整合每个修正。我们以傅里叶神经算子(FNO)、DeepONet和卷积神经算子(CNO)实例化该框架,并将提升的小型堆栈与来自PDEBench、APEBench和The Well的一维、二维和三维PDE基准上的全尺寸单体基线进行比较。在30个数据集-架构对中,21个显示平均准确率正向提升,17个具有正置信区间,而所有提升堆栈的可训练参数数量减少约72-95%。最佳模型比较显示,在10个完成的PDE基准中,有7个实现了经验帕累托改进,包括二维纳维-斯托克斯方程、浅水动力学、达西流、一维输运和反应系统,以及三维可压缩纳维-斯托克斯方程。这些结果表明,算子提升通常改善了神经PDE代理的经验准确率-参数帕累托前沿,同时也揭示了残差提升未能抵消压缩的PDE和架构依赖区域。

英文摘要

Neural operators are widely used as surrogate solution maps for partial differential equations (PDEs), but full-size models can be costly to store, deploy, and evaluate in many-query scientific workflows. This work introduces Operator Boosting, a stagewise residual-learning framework for constructing compact neural-operator surrogates directly, rather than training a large model and compressing it afterward. Starting from the empirical mean predictor in normalized output coordinates, the method trains a sequence of tiny same-family neural operators on residual fields and incorporates each correction through validation-selected shrinkage. We instantiate the framework with Fourier neural operators (FNOs), DeepONets, and convolutional neural operators (CNOs), and compare boosted tiny stacks against full-size monolithic baselines across one-, two-, and three-dimensional PDE benchmarks from PDEBench, APEBench, and The Well. Across 30 dataset-architecture pairs, 21 show positive mean accuracy gains and 17 have positive confidence intervals, while all boosted stacks reduce trainable parameter count by approximately 72-95%. Best-model comparisons show empirical Pareto improvements on 7 of 10 completed PDE benchmarks, including two-dimensional Navier-Stokes, shallow-water dynamics, Darcy flow, one-dimensional transport and reaction systems, and three-dimensional compressible Navier-Stokes. These results show that Operator Boosting often improves the empirical accuracy-parameter Pareto frontier of neural PDE surrogates, while also exposing PDE- and architecture-dependent regimes where residual boosting fails to offset compression.

2606.17120 2026-06-17 cs.LG physics.chem-ph 新提交

Noise-Driven Escape from Metastable Phases explains Grokking in Deep Neural Networks

噪声驱动从亚稳态逃逸解释深度神经网络中的grokking现象

Ibrahim Talha Ersoy, Karoline Wiesner

发表机构 * Complexity Science Group, Institute of Physics and Astronomy, University of Potsdam(波茨坦大学物理与天文研究所复杂性科学组)

AI总结 本文通过线性DNN模型证明,grokking现象源于L2正则化引起的一阶相变中的迟滞效应,SGD噪声驱动模型从低精度亚稳态逃逸,逃逸时间符合Arrhenius标度。

Comments 13 pages, 4 figures. Accepted at HiLD 2026: 4th Workshop on High-dimensional Learning Dynamics

详情
AI中文摘要

深度神经网络(DNN)在L2正则化强度变化下表现出第一阶相变,每个相变标志着新可学习特征的出现。在临界正则化强度以下,所有特征原则上可学习,但共存的亚稳态(由能量势垒分隔)可能困住网络并阻碍收敛。DNN的优势在于其泛化能力,但仍有许多开放问题,其中包括所谓的grokking的起源:在长时间明显的过拟合后突然延迟出现的泛化。我们在线性DNN中证明,grokking与一阶L2相变中的迟滞一致:通过使用L2正则化设计有意的困住,我们证明低精度亚稳态中的模型仅在SGD噪声驱动其跨越能量势垒时逃逸,逃逸时间遵循Arrhenius标度。我们通过故意将模型困在亚稳态中,在逃逸时间两个数量级范围内重现了类似grokking的延迟收敛。使用稀疏子采样,我们还重现了典型的grokking曲线,其中测试误差最终接近最终训练误差。我们的工作表明,亚稳态的数量等于可学习特征的数量——每个数据协方差的奇异值对应一个——迟滞的潜力随任务复杂度自然增长。我们提供证据表明相同机制可能适用于一般非线性DNN。我们的结果为更高效的学习方案提供了途径。

英文摘要

Deep neural networks (DNNs) exhibit first order phase transitions under variations of the L2 regularization strength, with each transition marking the onset of a new learnable feature. Below a critical regularization strength, all features are in principle learnable, but coexisting metastable states, separated by energy barriers, can trap the network and impede convergence. A strength of DNNs is their ability to generalize. But many open questions remain, among them the origin of so called grokking: the abrupt, delayed onset of generalization after prolonged apparent overfitting. We show for linear DNNs that grokking is consistent with hysteresis in first-order L2 phase transitions: using L2 regularization to engineer deliberate trapping, we demonstrate that a model in a low-accuracy metastable state escapes only when SGD noise drives it across an energy barrier, with escape times following Arrhenius scaling. We reproduce grokking-like delayed convergence across two orders of magnitude in escape time by deliberately trapping models in metastable phases. Using sparse sub-sampling we also reproduce the canonical grokking curve where test error eventually approaches the final training error. Our work suggests that the number of metastable states equals the number of learnable features -- one per singular value of the data covariance -- the potential for hysteresis grows naturally with task complexity. We provide evidence that the same mechanism likely operates in general nonlinear DNNs. Our results provide routes toward more efficient learning schemes.

2606.17445 2026-06-17 cs.LG cond-mat.mtrl-sci physics.chem-ph 新提交

Toward Controllable Catalyst Inverse Design via Large-Scale Autoregressive Pretraining

面向可控催化剂逆向设计的大规模自回归预训练

Dong Hyeon Mok, Jonggeol Na, Seoin Back

发表机构 * Department of Chemical and Biomolecular Engineering, Institute of Emergent Materials, Sogang University(化学与生物分子工程系,新兴材料研究所,首尔大学) Department of Chemical Engineering and Materials Science, Ewha Womans University(化学工程与材料科学系,成实女子大学) Department of Chemical Engineering, Graduate Program in System Health Science and Engineering, Ewha Womans University(化学工程系,系统健康科学与工程研究生院,成实女子大学) Institute for Multiscale Matter and Systems (IMMS), Ewha Womans University(多尺度物质与系统研究所(IMMS),成实女子大学) KU-KIST Graduate School of Converging Science and Technology, Korea University(KU-KIST融合科学与技术研究生院,韩国大学) Department of Integrated Energy Engineering, Korea University(整合能源工程系,韩国大学) Center for Hydrogen and Fuel Cells, Korea Institute of Science and Technology(KIST)(氢气与燃料电池中心,韩国科学技术院(KIST))

AI总结 提出基于生成式预训练Transformer的条件催化剂生成模型,通过大规模预训练和微调实现高结构有效性和条件匹配率,显著提升筛选效率。

详情
AI中文摘要

多相催化剂的逆向设计仍然具有挑战性,因为催化剂表面表现出显著的结构复杂性,在广阔的化学空间中存在耦合的表面-吸附物相互作用,仅通过传统筛选难以高效探索。尽管基于机器学习的高通量筛选加速了催化剂发现,但其效率随着搜索空间的增长而不可避免地下降,这促使了能够直接构建具有目标特性的催化剂的生成模型的发展。在这里,我们提出了一种基于生成式预训练Transformer架构的条件催化剂生成模型,该模型具有数值嵌入层,能够在单一自回归框架内生成以分类和连续属性为条件的催化剂结构。该模型在1.33亿个催化剂结构上进行了预训练,随后在大约46万个优化结构上进行了微调,这些结构具有相关的分类属性和结合能,用于条件生成。最终模型实现了98%的结构有效性、95%的优化有效性以及高分类条件保真度,吸附物类型和组成的联合匹配率达到93%。对于结合能条件,约20%的匹配率相比基线训练分布提高了四倍,生成的分布系统地朝向目标值偏移,使得无需额外微调即可将反应靶向催化剂发现的筛选效率提高1.5至4倍。这些结果表明,大规模自回归预训练结合显式属性条件为可控催化剂生成和加速催化剂发现提供了一条实用途径。

英文摘要

Inverse design of heterogeneous catalysts remains challenging because catalyst surfaces exhibit substantial structural complexity with coupled surface-adsorbate interactions across a vast chemical space that is difficult to explore efficiently through conventional screening alone. Although machine learning-based high-throughput screening has accelerated catalyst discovery, its efficiency inevitably declines as the search space grows, motivating the development of generative models that can directly construct catalysts with target properties. Here, we present a conditional catalyst generative model based on the Generative Pretrained Transformer architecture with a numerical embedding layer that enables the generation of catalyst structures conditioned on both categorical and continuous properties within a single autoregressive framework. The model was pretrained on 133 million catalyst structures and subsequently fine-tuned on approximately 460,000 optimized structures with associated categorical properties and binding energies for conditional generation. The resulting model achieved 98% structural validity, 95% optimization validity, and high categorical condition fidelity, with a 93 % joint match rate for adsorbate type and composition. For binding energy conditioning, the match rate of approximately 20% represents a four-fold improvement over the baseline training distribution, and the generated distributions shift systematically toward the target values, enabling a 1.5 to 4-fold improvement in screening efficiency for reaction-targeted catalyst discovery without additional fine-tuning. These results show that large-scale autoregressive pre-training, combined with explicit property conditioning, provides a practical route toward controllable catalyst generation and accelerated catalysts discovery.

2606.16917 2026-06-17 cs.RO 新提交

Unified Motion-Action Modeling for Heterogeneous Robot Learning

统一运动-动作建模用于异构机器人学习

Yunhao Cao, Shitong Liu, Chao Feng, Meryl Zhang, Xuanchen Lu, Andrew Owens, Kuan Fang

发表机构 * Cornell University(康奈尔大学)

AI总结 提出UMA模型,利用3D物体运动轨迹作为共享接口,通过掩码生成目标统一视觉运动控制和动力学建模,实现跨异构数据源的多任务预训练,并在部署时支持多种推理模式。

Comments https://uma-manipulation.github.io/

详情
AI中文摘要

我们提出了统一运动-动作(UMA)模型,该方法使用3D物体运动轨迹作为共享接口,以桥接视觉运动控制和动力学建模。UMA将物体运动和机器人动作视为在掩码生成目标下共同演化的变量,其中掩码模式决定了预训练期间的监督机制和部署时的推理模式。通过使用事后重标记的运动上下文和对比目标(将任务意图与场景几何解耦),UMA能够在无需手动标注任务指令的情况下,跨异构数据源进行多任务预训练。在部署时,相同的预训练参数支持运动条件视觉运动控制、基于运动的动力学建模以及从少量示范中进行的任务适应。在机器人演示、人类视频和模拟数据的混合数据集上预训练后,UMA在每种推理模式下均持续优于专门针对该模式的最先进基线。

英文摘要

We present Unified Motion-Action (UMA) Model, an approach that uses 3D object motion trajectories as a shared interface to bridge visuomotor control and dynamics modeling. UMA treats object motion and robot actions as co-evolving variables under a masked generative objective, in which the mask pattern determines both the supervision regime during pretraining and the inference mode at deployment. Using hindsight-relabeled motion contexts and a contrastive objective that disentangles task intent from scene geometry, UMA enables multi-task pretraining across heterogeneous data sources without requiring manually annotated task instructions. At deployment, the same pretrained parameters support motion-conditioned visuomotor control, motion-based dynamics modeling, and task adaptation from few-shot demonstrations. Pretrained on a mixture of robot demonstrations, human videos, and simulated data, UMA consistently outperforms state-of-the-art baselines specialized for each inference mode.

2606.16591 2026-06-17 cs.CL 新提交

SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents

SING: 用于LLM代理中可扩展主动工具发现的合成意图图

Qiao Xiao, Haochen Shi, Yisen Gao, Wenbin Hu, Huihao Jing, Tianshi Zheng, Baixuan Xu, Ziheng Zhang, Weiqi Wang, Haoran Li, Jiaxin Bai, Yangqiu Song

发表机构 * Cornell University(康奈尔大学) The Hong Kong University of Science and Technology(香港科技大学) The Ohio State University(俄亥俄州立大学) Hong Kong Baptist University(香港浸会大学)

AI总结 提出SING框架,通过构建意图-工具图并动态检索工具,在长周期任务中提升工具发现准确率,Global Recall@5提高59.8%,下游成功率提高28.9%。

详情
AI中文摘要

大型语言模型(LLM)代理越来越依赖管理上下文、工具和多轮执行的代理框架,使工具成为在真实数字环境中行动的核心接口。随着框架连接的工具生态系统扩展到数百或数千个API、服务和任务特定技能,穷举工具模式注入变得昂贵,并施加了封闭世界假设,将代理限制在预定义的静态库存中。检索增强的工具选择提供了一种自然的替代方案,但现有的一次性检索方法通常无法将孤立的工具描述与代理的真实任务意图对齐,特别是在需要通过分解、观察和新诱导的子目标来涌现所需能力的长期任务中。我们提出SING,一种意图感知的主动工具发现框架,它构建了一个连接用户意图、工具能力和工具协作模式的意图-工具图,并根据不断变化的任务状态动态检索工具。使用包含7,471个工具的统一语料库,我们在三个真实世界的工具使用基准上评估了SING。与基线相比,SING将全局Recall@5提高了59.8%,下游成功率提高了28.9%,同时将全语料库工具模式暴露减少了99.8%,表明意图感知的图结构能够在大规模代理生态系统中实现更准确和上下文高效的工具发现。

英文摘要

Large language model (LLM) agents increasingly rely on agent harnesses that manage context, tools, and multi-turn execution, making tools a central interface for acting in realistic digital environments. As harness-connected tool ecosystems expand to hundreds or thousands of APIs, services, and task-specific skills, exhaustive tool schema injection becomes costly and imposes a closed-world assumption that limits agents to a predefined static inventory. Retrieval-augmented tool selection offers a natural alternative, but existing one-shot retrieval methods often fail to align isolated tool descriptions with the agent's true task intention, especially in long-horizon tasks where required capabilities emerge through decomposition, observations, and newly induced subgoals. We propose SING, an intention-aware active tool discovery framework that builds an intention-tool graph linking user intentions, tool capabilities, and tool collaboration patterns, and dynamically retrieves tools according to evolving task states. Using a unified corpus of 7,471 tools, we evaluate SING on three real-world tool-use benchmarks. SING improves Global Recall@5 by up to 59.8% and downstream success rate by up to 28.9% over baselines, while reducing full-corpus tool-schema exposure by 99.8%, demonstrating that intention-aware graph structure enables more accurate and context-efficient tool discovery in large-scale agentic ecosystems.

2606.16590 2026-06-17 cs.LG cs.AI q-bio.NC 新提交

Infant Spontaneous Movement Noise Improves Exploration in Deep RL

婴儿自发运动噪声改善深度强化学习中的探索

Francisco M. López, Markus R. Ernst, Francisco Cruz, Matej Hoffmann, and Jochen Triesch

发表机构 * Frankfurt Institute for Advanced Studies(法兰克福高等研究所) School of Computer Science and Engineering, University of New South Wales(新南威尔士大学计算机科学与工程学院) Escuela de Ingeniería, Universidad Central de Chile(智利中央大学工程学院) Faculty of Electrical Engineering, Czech Technical University(捷克理工大学电气工程学院)

AI总结 受婴儿自发运动噪声启发,提出一种在RL训练中逐步增加时间自相关的探索噪声机制,实验表明其能产生结构化探索行为并提高学习效率。

Comments 6 pages, 4 figures, 1 table. Accepted at IEEE ICDL 2026. Cite as: F. M. López, M. R. Ernst, F. Cruz, M. Hoffmann, and J. Triesch, "Infant Spontaneous Movement Noise Improves Exploration in Deep RL", in 2026 IEEE International Conference on Development and Learning (ICDL). IEEE, 2026, pp. 1-6

详情
AI中文摘要

深度强化学习(RL)中的探索通常实现为时间上不相关的白噪声。然而,最近的研究表明,时间相关的有色噪声可以通过产生更平滑的轨迹和更好的状态空间覆盖来提高探索效率。我们探究受婴儿自发运动启发的动作噪声是否也能改善深度RL中的探索。我们发现婴儿末端执行器速度的功率谱密度遵循有色噪声过程,其谱指数随年龄增长而增加。受这一发育模式的启发,我们引入了一种机制,在RL训练过程中逐步增加探索噪声的时间自相关,与婴儿统计数据相匹配。在多个RL环境中的实验表明,婴儿启发的噪声产生结构化的探索行为,并且与传统的探索策略相比可以提高学习效率。这些发现表明,人类运动和认知发展可以为人工智能体的学习机制设计提供有用的指导。我们的代码可在 https://github.com/trieschlab/baby-noise-rl 获取。

英文摘要

Exploration in deep reinforcement learning (RL) is commonly implemented as temporally uncorrelated white noise. However, recent works show that temporally correlated colored noise can improve exploration efficiency by producing smooth trajectories with better coverage of the state space. We inquire whether action noise inspired by infant spontaneous movements can also improve exploration in deep RL. We find that the power spectral densities of babies' end-effector velocities follow a colored noise process where the spectral exponent increases with age. Inspired by this developmental pattern, we introduce a mechanism that progressively increases the temporal auto-correlation of exploration noise during RL training, matching the infant statistics. Experiments across several RL environments show that infant-inspired noise produces structured exploratory behavior and can improve learning efficiency compared to conventional exploration strategies. These findings suggest that human motor and cognitive development can provide useful guidance for designing learning mechanisms in artificial agents. Our code is available at https://github.com/trieschlab/baby-noise-rl.