arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2606.17082 2026-06-17 cs.RO cs.AI 新提交

ParkingTransformer: LLM-Enhanced End-to-End Trajectory Planning for Autonomous Parking

ParkingTransformer: 基于大语言模型增强的端到端自主泊车轨迹规划

Hauteng Wu, Xu Li, Dong Kong, Zihang Wang, Xieyuanli Chen, Benwu Wang, Wenkai Zhu

发表机构 * School of Instrument Science and Engineering, Southeast University(东南大学仪器科学与工程学院) School of Electronic and Information Engineering, Tongji University(同济大学电子与信息工程学院) College of Transportation, Shandong University of Science and Technology(山东科技大学交通学院) National University of Defense Technology(国防科技大学)

AI总结 提出ParkingTransformer框架,利用多视角感知和大语言模型场景理解能力,结合轨迹查询与隐状态特征,直接输出规划轨迹,无需密集BEV表示,通过3D位置编码、固定窗口流机制和粗到细解码策略提升性能,在CARLA和实车实验中验证有效性。

详情
AI中文摘要

端到端自主泊车已成为自动驾驶领域的关键任务。然而,现有方法存在黑箱特性,缺乏高层语义理解和可解释性,阻碍了从道路到目标点的无缝长距离自主泊车的实现。为解决这些限制,我们提出ParkingTransformer,一种利用多视角感知和大语言模型(LLMs)场景理解能力的新型框架。通过将轨迹查询与LLMs隐状态特征相结合,我们的方法直接与历史信息和原始传感器数据交互以输出规划轨迹,无需密集的鸟瞰图(BEV)表示。为补偿LLMs空间推理能力的不足,我们引入3D位置编码以显式注入空间几何感知。此外,设计了固定窗口流机制用于历史信息处理,显著提高了长期时间处理效率和推理速度。同时,采用粗到细解码策略逐步提升轨迹精度。在CARLA模拟器和真实车辆平台上进行了广泛的闭环实验。结果表明,我们的方法在CARLA模拟器中达到61.32的驾驶分数,在真实实验中平均成功率为88.70%,验证了所提算法的可行性和有效性。

英文摘要

End-to-end autonomous parking has emerged as a critical task within the realm of autonomous driving. However, existing methods suffer from black-box characteristics, lacking high-level semantic understanding and interpretability, which impedes the realization of seamless long-distance autonomous parking from the road to the target spot. To address these limitations, we propose ParkingTransformer, a novel framework that leverages multi-view perception and the scene understanding capability of Large Language Models (LLMs). By combining trajectory queries with LLMs implicit state features, our method interacts directly with historical information and raw sensor data to output planning trajectories, eliminating the need for dense Bird's-View (BEV) representations. To compensate for the inadequate spatial reasoning ability of LLMs, we introduce 3D positional encoding to explicitly inject spatial geometric awareness. Furthermore, a fixed-window streaming mechanism is designed for historical information processing, significantly improving long-term temporal processing efficiency and inference speed. Additionally, a coarse-to-fine decoding strategy is employed to progressively enhance trajectory precision. Extensive closed-loop experiments are conducted on the CARLA simulator and real-world vehicle platforms. The results demonstrate that our method achieves a driving score of 61.32 in CARLA simulator and an average success rate of 88.70% in real-world experiments, validating the feasibility and effectiveness of the proposed algorithms.

2606.17080 2026-06-17 cs.RO cs.AI cs.CV 新提交

HRDX: A Large-Scale Vector HD-Map Dataset

HRDX:大规模矢量高清地图数据集

Sahith Reddy Chada, Isht Dwivedi, Nirav Savaliya

发表机构 * Honda Research Institute US(本田美国研究院)

AI总结 提出HRDX大规模矢量高清地图数据集,覆盖1400公里驾驶数据,含10类地图元素和20多种属性,并引入复合评分评估几何与属性准确性。

Comments https://usa.honda-ri.com/hrdx

详情
AI中文摘要

可靠的自动驾驶需要矢量化的高清地图,这些地图应具有几何精确性、语义丰富性,并能够扩展到长距离驾驶。然而,现有的公开高清地图数据集规模有限,提供的语义属性稀疏,并且缺乏诸如航拍图像等能够开启新研究方向的模态。我们提出了HRDX,一个用于矢量高清地图构建的大规模数据集,涵盖约40小时(1400公里)的最小重叠驾驶,比之前的公开高清地图数据集大数倍。数据使用六个同步环视摄像头、一个128线激光雷达和厘米级RTK GNSS/IMU捕获,并辅以精确对齐的航拍正射影像。标注涵盖10个矢量地图类别,并补充了20多个语义和拓扑属性。为了评估这一更丰富的本体,我们引入了复合评分(CS)来联合评估几何保真度和属性正确性。基准实验表明,HRDX的规模改善了在线矢量地图构建,并且对齐的航拍图像提供了有用的结构先验:在训练和/或推理中使用航拍图像可提高几何地图质量,而航拍增强的教师可以将部分优势转移给仅使用摄像头的学生,而无需增加推理时的传感器需求。HRDX旨在支持大规模高清地图学习、多模态BEV融合以及训练时特权信息的可重复研究。HRDX数据集和基准可在以下网址获取:https://github.com/example/HRDX

英文摘要

Reliable autonomous driving requires vectorized HD maps that are geometrically accurate, semantically rich, and scalable to long-horizon driving. However, existing public HD map datasets are limited in scale, provide sparse semantic attributes, and lack modalities such as aerial imagery that could enable new research directions. We present HRDX, a large-scale dataset for vector HD-map construction, spanning about 40 hours (1,400 km) of minimally overlapping drives, which is several times larger than prior public HD map datasets. Data is captured using six synchronized surround cameras, a 128-beam LiDAR, and centimeter-level RTK GNSS/IMU, and is further complemented by precisely aligned aerial orthoimagery. Annotations cover 10 vector map classes, complemented with over 20 semantic and topological attributes. To evaluate this richer ontology, we introduce the Composite Score (CS) to jointly assess geometric fidelity and attribute correctness. Benchmark experiments show that HRDX's scale improves online vector-map construction, and that aligned aerial imagery provides a useful structural prior: using aerial imagery at training and/or inference improves geometric map quality, while aerial-augmented teachers can transfer part of this benefit to camera-only students without increasing inference-time sensor requirements. HRDX is intended to support reproducible research on large-scale HD-map learning, multimodal BEV fusion, and training-time privileged information. HRDX dataset and benchmarks are available at https://github.com/honda-research-institute/HRDX

2606.17073 2026-06-17 cs.RO cs.AI 新提交

Extracting Semantics: LLM-Guided Automatic Population of Robot Ontology from URDF

提取语义:从URDF自动构建机器人本体的LLM引导方法

Bastien Dussard, Guillaume Sarthou

发表机构 * LAAS-CNRS, Department of Robotics, Toulouse, France(法国图卢兹机器人系CNRS实验室)

AI总结 提出利用大语言模型从URDF文件自动生成机器人语义本体,通过多数投票和语法验证确保与现有本体对齐,初步实验表明该方法能有效桥接低层描述与高层知识表示。

详情
Journal ref
18th International Conference on Social Robotics (ICSR 2026), University of London, Jul 2026, Londres, United Kingdom
AI中文摘要

虽然常识知识可能足以满足虚拟代理的需求,但与人类交互的具身机器人需要对其环境和自身物理形态具有基于现实的、语义丰富的表示。在认知机器人学中,本体论能够有效整合这种异构知识,以支持可解释的推理,即使在持续知识更新过程中也是如此。然而,手动构建本体仍然是一个瓶颈。我们提出了一种初步方法,通过将统一机器人描述格式(URDF)模型转换为填充的本体,自动生成机器人语义抽象。尽管URDF文件提供了结构和运动学描述,但其标识符通常需要常识解释才能恢复有意义的语义,而大语言模型(LLM)擅长此任务。我们的流程利用LLM,通过用现有本体中的概念提示它们来推断语义关系,确保最终分类与形式模型保持一致。为了提高可靠性,该流程结合了跨多个LLM查询的多数投票以及语法和模式级验证,以确保生成的输出符合预期的表示格式和本体约束。我们在多个机器人描述上评估了该方法,并讨论了生成的抽象。初步结果表明,所提出的方法能够有效弥合低层机器人描述与人机交互所需的结构化、基于现实的知识表示之间的差距。

英文摘要

While commonsense knowledge may suffice for virtual agents, embodied robots interacting with humans require grounded and semantically rich representations of both their environment and their own physical embodiment. In cognitive robotics, ontologies are effective for integrating such heterogeneous knowledge to enable explainable reasoning, even during continuous knowledge updates. Yet, their manual construction remains a bottleneck. We present a preliminary approach for the automatic generation of robot semantic abstractions by transforming Unified Robot Description Format (URDF) models into populated ontologies. Although URDF files provide structural and kinematic descriptions, their identifiers often require commonsense interpretation to recover meaningful semantics, a task at which Large Language Models (LLMs) excel. Our pipeline leverages LLMs to infer semantic relationships by prompting them with concepts from an existing ontology, ensuring the final classification remains aligned with the formal model. To improve reliability, the pipeline combines majority voting across multiple LLM queries along with syntactic and schema-level validation to ensure that generated outputs conform to the expected representation format and ontology constraints. We evaluate the approach on multiple robot descriptions and discuss the generated abstractions. Initial results indicate that the proposed method can effectively bridge the gap between low-level robot descriptions and the structured, grounded knowledge representations required for human-robot interaction.

2606.17057 2026-06-17 cs.LG cs.AI cs.CL 新提交

Correct When Paired, Wrong When Split: Decoupling and Editing Modality-Specific Neurons in MLLMs

配对时正确,分离时错误:多模态大语言模型中模态特定神经元的解耦与编辑

Tingchao Fu, Wenkai Wang, Fanxiao Li, Huadong Zhang, Jinhong Zhang, Dayang Li, Yunyun Dong, Renyang Liu, Wei Zhou

发表机构 * School of Information Science and Engineering, Yunnan University(云南大学信息科学与工程学院) School of Software, Yunnan University(云南大学软件学院) National University of Singapore(新加坡国立大学) School of Engineering, Yunnan University(云南大学工程学院)

AI总结 针对多模态大语言模型知识编辑中存在的解耦失败问题,提出DECODE方法,通过显式解耦和定位模态特定神经元组,实现跨模态触发下的有效知识更新。

Comments 18 pages, 11 figures

详情
AI中文摘要

尽管知识编辑为多模态大语言模型(MLLMs)的知识更新提供了一种高效机制,但我们发现当前范式仍面临一个重要但尚未充分探索的问题:编辑解耦失败,即当模型被多模态输入(文本-图像查询对)触发时,实体相关知识可以更新,但当配对输入被拆分为单模态输入时,这些知识往往恢复为编辑前的旧事实。我们深入的实证分析表明,MLLMs中的实体知识并非以统一表示存储,而是分布在解耦的模态特定路径中。因此,偏向多模态查询的更新无法有效传播到单模态电路。为弥补这一差距,我们提出DECODE,该方法显式解耦并定位模态特定神经元组以获取目标知识。大量实验证明,DECODE在不同模态触发下均能实现有效的知识更新,从而缓解编辑解耦失败。

英文摘要

Although Knowledge Editing provides an efficient mechanism for updating the knowledge of Multimodal Large Language Models (MLLMs), we find that current paradigms still suffer from an important yet remain underexplored issue : editing decoupling failure, where entity-related knowledge can be updated when the model is triggered by multimodal inputs (text--image query pairs), however, it often reverts to outdated pre-edit facts when the paired inputs are split into unimodal ones. Our in-depth empirical analysis reveals that the entity knowledge in MLLMs is not stored as a unified representation, but is instead distributed across disentangled modality-specific pathways. As a result, updates biased toward multimodal queries fail to propagate effectively to unimodal circuits. To bridge this gap, we propose DECODE, which explicitly disentangles and localizes modality-specific neuron groups for targeted knowledge. Extensive experiments demonstrate that DECODE consistently achieves effective knowledge updates under different modality triggers, thereby mitigating editing decoupling failures.

2606.18236 2026-06-17 cs.LG cs.IT math.IT 新提交

Sign-Rank, Index, and List Replicability: Connections and Separations

符号秩、索引与列表可复制性:联系与分离

Ari Blondal, Hamed Hatami, Pooya Hatami, Chavdar Lalov, Sivan Tretiak

发表机构 * McGill University(麦吉尔大学) Ohio State University(俄亥俄州立大学)

AI总结 本文研究二元概念类的符号秩下界,通过比较Z2-索引和列表可复制数,证明Z2-索引被列表可复制数的线性函数上界,从而解决符号秩与Z2-索引的分离问题,并进一步建立列表可复制数的上界与组合性质。

Comments 29 pages, 1 figure

详情
AI中文摘要

在学习理论中,二元概念类的符号秩捕捉了其能被点和半空间表示的最小维度。尽管兴趣浓厚,符号秩的下界却难以获得。最近两种方法通过更易分析的度量建立符号秩的下界:$\mathbb{Z}_2$-索引和列表可复制数。我们对这些度量进行排序,证明$\mathbb{Z}_2$-索引被列表可复制数的线性函数上界。作为主要结果,我们得到了符号秩与$\mathbb{Z}_2$-索引之间的强分离,从而解决了Frick、Hosseini和Vasileuski提出的一个问题。这促使我们对列表可复制性(两个下界度量中更强的一个)进行深入研究。我们通过两个组合度量——高度和最小星数——建立了列表可复制数的上界。我们还证明了一个基本的复合结果:两个概念类的乘积的列表可复制数被这两个类的列表可复制数之和所界。

英文摘要

In learning theory, the sign rank of a binary concept class captures the smallest dimension in which it can be represented by points and halfspaces. Despite tremendous interest, lower bounds on sign rank are notoriously difficult to come by. Two recent approaches to the problem establish lower bounds on sign rank by measures that are easier to analyze: the $\mathbb{Z}_2$-index and the list replicability number. We order these measures, showing that the $\mathbb{Z}_2$-index is upper-bounded by a linear function of the list replicability number. As a main consequence, we obtain a strong separation between sign rank and $\mathbb{Z}_2$-index, thereby resolving a question of Frick, Hosseini, and Vasileuski. This motivates a thorough study of list replicability, the stronger of the two lower-bounding measures. We establish upper bounds on the list replicability number by two combinatorial measures: height and minimum star number. We also prove a fundamental composition result, showing that the product of two concept classes has list replicability number bounded by the sum of the list replicability numbers of the two classes.

2606.17531 2026-06-17 cs.LG cs.CG math.AT 新提交

Non-negative Matrix Factorisation with Topological Regularisation

带拓扑正则化的非负矩阵分解

Matias de Jong van Lier, Shizuo Kaji, Keunsu Kim

发表机构 * Recursive Inc.(Recursive公司) Graduate School of Science, Kyoto University(京都大学理学研究科) Institute of Mathematics for Industry, Kyushu University(九州大学数理学研究院)

AI总结 提出通过持久同调作为拓扑正则化项融入非负矩阵分解目标函数,以学习具有空间连贯性、周期结构或团状图信号的可解释基函数。

详情
AI中文摘要

我们研究了通过正则化学习到的基函数的拓扑结构,在非负矩阵分解(NMF)中学习可解释基函数。我们的方法源于观察到许多数据模态可以视为结构化域上的非负函数,其中基的质量与其拓扑结构内在相关。然而,纳入支撑拓扑的朴素方法通常受离散性和阈值依赖性困扰,使其不适合连续优化。我们通过采用持久同调作为稳定、无阈值的拓扑量化器,并设计将拓扑分数作为正则化项融入NMF目标函数来应对这些挑战。所得框架在一个统一的建模语言中涵盖了空间连贯的图像成分、周期性的时间序列结构和团状图信号。

英文摘要

We investigate the learning of interpretable bases in non-negative matrix factorisation (NMF) by regularising the topology of the learned basis functions. Our approach is motivated by the observation that many data modalities can be viewed as non-negative functions on a structured domain, where the quality of a basis is intrinsically linked to its topology. However, naive methods for incorporating the topology of the support are often hindered by discreteness and threshold dependence, rendering them unsuitable for continuous optimisation. We address these challenges by employing persistent homology as a stable, threshold-free topological quantifier and by designing topological scores that integrate into the NMF objective as regularisers. The resulting framework encompasses spatially coherent image components, periodic time-series structures, and clique-like graph signals within a unified modelling language.

2606.17419 2026-06-17 cs.LG cs.NA math.NA 新提交

Generalization Guarantees for Multi-Input Neural Operator Learning in Sobolev Spaces

多输入神经算子学习在Sobolev空间中的泛化保证

Yahong Yang, Zecheng Zhang, Wei Zhu, Wenjing Liao, Hao Liu

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of Notre Dame(圣母大学) Hong Kong Baptist University(香港浸会大学)

AI总结 针对多输入神经算子,在Sobolev范数下建立逼近和泛化误差估计,量化各输入空间对误差界的贡献,并揭示平衡状态下输入维度、正则性和Sobolev阶的相互作用。

详情
AI中文摘要

我们发展了多输入神经算子的逼近和泛化误差估计,输出误差在Sobolev范数下度量。与标准算子学习设置中只有一个输入函数不同,我们的框架允许多个输入函数定义在可能不同的域上,具有不同的维度和Sobolev正则性。导出的速率明确量化了每个输入空间对最终误差界的贡献。特别地,在平衡状态下,逼近和泛化速率由输入维度、正则性和Sobolev阶之间的相互作用控制,而对模型复杂度的依赖保持\(\log\log/\log\)型结构。我们的分析为多输入算子学习(包括Sobolev训练)提供了一个通用的理论框架,并适用于来自偏微分方程和科学计算的算子学习问题。

英文摘要

We develop approximation and generalization error estimates for multi-input neural operators, with the output error measured in Sobolev norms. In contrast to standard operator-learning settings with a single input function, our framework allows multiple input functions defined on possibly different domains, with different dimensions and Sobolev regularities. The derived rates explicitly quantify the contribution of each input space to the final error bound. In particular, in the balanced regime, the approximation and generalization rates are governed by the interaction between the input dimensions, regularities, and Sobolev orders, while the dependence on the model complexity retains a \(\log\log/\log\)-type structure. Our analysis provides a general theoretical framework for multi-input operator learning, including Sobolev training, and is applicable to operator learning problems arising from partial differential equations and scientific computing.

2606.17414 2026-06-17 cs.LG math.DS 新提交

Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations

用于对抗性航天器接近操作中自适应安全关键控制的内存高效元强化学习

Alejandro Posadas-Nava, Richard Linares, Minduli Wijayatunga

发表机构 * MIT(麻省理工学院) University of Illinois, Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究利用元强化学习调整输入约束控制屏障函数的类K函数,比较三种循环网络架构和两种训练算法,发现Mamba与PPO组合在合作与非合作场景中均能提升任务完成率、安全性和燃料效率。

详情
AI中文摘要

自主航天器交会与接近操作(RPO)需要控制器在推力约束下保证安全,同时最小化燃料消耗。输入约束控制屏障函数(ICCBF)为具有执行约束的非线性系统提供了一种控制方法,构建前向不变安全集。先前工作表明,通过元强化学习(meta-RL)学习定义ICCBF递归的类$\mathcal{K}$函数,可为RPO中的安全关键控制提供鲁棒、非贪婪的方法。本文进一步扩展该框架,研究了三种循环网络架构(长短期记忆(LSTM)、门控循环单元(GRU)、选择性状态空间模型(Mamba))和两种训练算法(近端策略优化(PPO)和软演员-评论家(SAC))的性能,以确定通过元强化学习调整ICCBF类K函数的最佳设置。除了合作测试案例外,还在存在对抗行为的情况下评估性能,其中目标航天器以恶化追踪航天器安全的方式行动。结果表明,在所有测试的合作与非合作场景中,使用PPO的状态空间模型(如Mamba)相比其他架构在任务完成、安全和燃料节省方面表现更优。

英文摘要

Autonomous spacecraft rendezvous and proximity operations (RPO) require controllers that guarantee safety under thrust constraints while minimizing fuel expenditure. Input-constrained control barrier functions (ICCBFs) provide a control method for nonlinear systems with actuation constraints that construct a forward-invariant safe set. Previous work has shown that learning class-$\mathcal{K}$ functions defining the ICCBF recursion via meta reinforcement learning (meta-RL) yields a robust, non-greedy approach to safety-critical control in RPO. This paper extends that framework further by investigating the performance of three recurrent network architectures (Long Short Term Memory (LSTM), Gated Recurrent Unit (GRU), Selective State Space Model (Mamba)) and two training algorithms (Proximal Policy Optimization (PPO) and Soft Actor Critic (SAC)) to identify the best setup for tuning ICCBF class-K functions via meta-RL. In addition to cooperative test cases, performance is evaluated in the presence of adversarial behavior where the target spacecraft behaves in a way that worsens the safety of the chaser spacecraft. Results indicate that state space models such as Mamba when used with PPO achieve superior task completion, safety, and fuel-savings compared to other architectures, across all cooperative and uncooperative scenarios tested.

2606.17317 2026-06-17 cs.RO cs.AI math.OC 新提交

Transformer-Based Warm-Starting for Feasible and Optimal Terminal Approach to Tumbling Objects with Space Manipulators

基于Transformer的可行且最优末端接近翻滚目标的空间机械臂热启动方法

Yuji Takubo, Maximilian Adang, Mac Schwager, Simone D'Amico

发表机构 * Stanford University(斯坦福大学)

AI总结 针对空间机械臂末端接近翻滚目标的实时轨迹生成问题,提出基于因果Transformer的热启动方法,通过分解规划并热启动姿态-力矩分配阶段,在300个测试场景中减少28%迭代次数和23%运行时间,同时保持控制成本分布。

Comments 8 pages, 4 figures

详情
AI中文摘要

由于航天器总线运动、机械臂动力学、可见性锥和轨迹级安全约束之间的非线性耦合,在轨机器人服务的实时轨迹生成具有挑战性。本文研究了基于学习的热启动方法,用于空间机械臂末端接近翻滚目标的序列凸规划(SCP)。所提出的框架将问题分解为系统质心平移规划阶段和耦合姿态-机械臂力矩分配阶段,并对后者应用因果变压器热启动,后者构成了主要的计算瓶颈。比较了线性动作解码器和流匹配动作解码器在不同动作分块和训练数据集大小下的表现,并使用SCP在成本最优和可行性投影下评估了生成的热启动。在300个保留场景中,学习的热启动将第二阶段SCP迭代次数减少多达28%,运行时间减少23%,同时保持最终控制成本分布。当学习的热启动用于非凸可行性投影时,其运行时间相比成本最优SCP几乎减半,同时避免了启发式初始化时观察到的灾难性高成本尾部行为。这些结果表明,序列模型热启动可以提高基于优化的空间机械臂末端制导的计算效率和轨迹鲁棒性。

英文摘要

Real-time trajectory generation for on-orbit robotic servicing is challenging due to the nonlinear coupling between spacecraft bus motion, manipulator dynamics, visibility cone, and trajectory-level safety constraints. This paper studies learning-based warm-starting for sequential convex programming (SCP) in the terminal approach of a space manipulator toward a tumbling target. The proposed framework decomposes the problem into a system center-of-mass translational planning stage and a coupled attitude--manipulator torque-allocation stage, and applies a causal transformer warm-start to the latter, which constitutes the dominant computational bottleneck. Linear and flow matching action decoders are compared under different action-chunking and training dataset sizes, and the resulting warm-starts are evaluated under both cost-optimal and feasibility projection using SCP. Across 300 held-out scenarios, the learned warm-start reduces the second-stage SCP iteration count by up to 28% and the runtime by 23% while preserving the final control-cost distribution. When the learned warm-starts are used for nonconvex feasibility projection, they nearly halve the runtime relative to cost-optimal SCP, while avoiding the catastrophic high-cost tail behavior observed when initialized heuristically. These results indicate that sequence-model warm-starts can improve both the computational efficiency and trajectory robustness of optimization-based terminal guidance for space manipulation.

2606.17185 2026-06-17 cs.LG eess.SP math.DG stat.ML 新提交

Finsler Geometry, Graph Neural Networks, and You

芬斯勒几何、图神经网络与你

T. Mitchell Roddenberry, Richard G. Baraniuk

发表机构 * Rice University(莱斯大学)

AI总结 针对图拉普拉斯只能近似各向同性算子的局限,提出基于芬斯勒拉普拉斯的图神经网络层,证明其收敛性并恢复非线性扩散方程的几何结构。

详情
AI中文摘要

基于图拉普拉斯的图神经网络架构近似拉普拉斯-贝尔特拉米算子,因此限制了它们在各向同性算子上的应用。作为拉普拉斯-贝尔特拉米算子的非线性替代,我们考虑从流形上采样的点云上芬斯勒拉普拉斯的估计。我们证明,随着点样本数量的增加,这些离散估计收敛到流形上的真实算子。此外,我们表明该算子可以表示为图神经网络层,我们用它来定义一组受约束以表达芬斯勒几何的芬斯勒图神经网络。我们表明,芬斯勒图神经网络在实践中恢复了非线性扩散方程背后的几何结构。

英文摘要

Graph neural network architectures based on the graph Laplacian approximate the Laplace-Beltrami operator, thus limiting their application to isotropic operators. As a nonlinear alternative to the Laplace-Beltrami operator, we consider estimates of the Finsler Laplacian on point clouds sampled from a manifold. We prove that these discrete estimates converge to the true operator on the manifold as the number of point samples grows. Moreover, we show that this operator can be expressed as a graph neural network layer, which we use to define a family of Finslerian graph neural networks constrained to express Finsler geometry. We show that Finslerian graph neural networks recover the geometry underlying nonlinear diffusion equations in practice.

2606.17460 2026-06-17 cs.LG cs.NA math.NA physics.comp-ph 新提交

Operator Boosting Produces Pareto-Efficient PDE Surrogates

算子提升产生帕累托高效的PDE代理模型

Lennon J. Shikhman

发表机构 * College of Computing, Georgia Institute of Technology(佐治亚理工学院计算学院) Department of Mathematics and Systems Engineering, Florida Institute of Technology(佛罗里达理工学院数学与系统工程系)

AI总结 提出算子提升框架,通过残差学习直接构建紧凑神经算子代理,在30个数据集-架构对上平均准确率提升,参数量减少72-95%,并在多个PDE基准上实现帕累托改进。

Comments 19 pages, 4 figures, 3 tables. Preprint submitted to Elsevier

详情
AI中文摘要

神经算子被广泛用作偏微分方程(PDE)的代理解映射,但在多查询科学工作流中,全尺寸模型可能存储、部署和评估成本高昂。本文引入算子提升(Operator Boosting),一种逐阶段残差学习框架,直接构建紧凑的神经算子代理,而非先训练大模型再压缩。从归一化输出坐标中的经验均值预测器开始,该方法在残差场上训练一系列同族小型神经算子,并通过验证选择的收缩整合每个修正。我们以傅里叶神经算子(FNO)、DeepONet和卷积神经算子(CNO)实例化该框架,并将提升的小型堆栈与来自PDEBench、APEBench和The Well的一维、二维和三维PDE基准上的全尺寸单体基线进行比较。在30个数据集-架构对中,21个显示平均准确率正向提升,17个具有正置信区间,而所有提升堆栈的可训练参数数量减少约72-95%。最佳模型比较显示,在10个完成的PDE基准中,有7个实现了经验帕累托改进,包括二维纳维-斯托克斯方程、浅水动力学、达西流、一维输运和反应系统,以及三维可压缩纳维-斯托克斯方程。这些结果表明,算子提升通常改善了神经PDE代理的经验准确率-参数帕累托前沿,同时也揭示了残差提升未能抵消压缩的PDE和架构依赖区域。

英文摘要

Neural operators are widely used as surrogate solution maps for partial differential equations (PDEs), but full-size models can be costly to store, deploy, and evaluate in many-query scientific workflows. This work introduces Operator Boosting, a stagewise residual-learning framework for constructing compact neural-operator surrogates directly, rather than training a large model and compressing it afterward. Starting from the empirical mean predictor in normalized output coordinates, the method trains a sequence of tiny same-family neural operators on residual fields and incorporates each correction through validation-selected shrinkage. We instantiate the framework with Fourier neural operators (FNOs), DeepONets, and convolutional neural operators (CNOs), and compare boosted tiny stacks against full-size monolithic baselines across one-, two-, and three-dimensional PDE benchmarks from PDEBench, APEBench, and The Well. Across 30 dataset-architecture pairs, 21 show positive mean accuracy gains and 17 have positive confidence intervals, while all boosted stacks reduce trainable parameter count by approximately 72-95%. Best-model comparisons show empirical Pareto improvements on 7 of 10 completed PDE benchmarks, including two-dimensional Navier-Stokes, shallow-water dynamics, Darcy flow, one-dimensional transport and reaction systems, and three-dimensional compressible Navier-Stokes. These results show that Operator Boosting often improves the empirical accuracy-parameter Pareto frontier of neural PDE surrogates, while also exposing PDE- and architecture-dependent regimes where residual boosting fails to offset compression.

2606.17120 2026-06-17 cs.LG physics.chem-ph 新提交

Noise-Driven Escape from Metastable Phases explains Grokking in Deep Neural Networks

噪声驱动从亚稳态逃逸解释深度神经网络中的grokking现象

Ibrahim Talha Ersoy, Karoline Wiesner

发表机构 * Complexity Science Group, Institute of Physics and Astronomy, University of Potsdam(波茨坦大学物理与天文研究所复杂性科学组)

AI总结 本文通过线性DNN模型证明,grokking现象源于L2正则化引起的一阶相变中的迟滞效应,SGD噪声驱动模型从低精度亚稳态逃逸,逃逸时间符合Arrhenius标度。

Comments 13 pages, 4 figures. Accepted at HiLD 2026: 4th Workshop on High-dimensional Learning Dynamics

详情
AI中文摘要

深度神经网络(DNN)在L2正则化强度变化下表现出第一阶相变,每个相变标志着新可学习特征的出现。在临界正则化强度以下,所有特征原则上可学习,但共存的亚稳态(由能量势垒分隔)可能困住网络并阻碍收敛。DNN的优势在于其泛化能力,但仍有许多开放问题,其中包括所谓的grokking的起源:在长时间明显的过拟合后突然延迟出现的泛化。我们在线性DNN中证明,grokking与一阶L2相变中的迟滞一致:通过使用L2正则化设计有意的困住,我们证明低精度亚稳态中的模型仅在SGD噪声驱动其跨越能量势垒时逃逸,逃逸时间遵循Arrhenius标度。我们通过故意将模型困在亚稳态中,在逃逸时间两个数量级范围内重现了类似grokking的延迟收敛。使用稀疏子采样,我们还重现了典型的grokking曲线,其中测试误差最终接近最终训练误差。我们的工作表明,亚稳态的数量等于可学习特征的数量——每个数据协方差的奇异值对应一个——迟滞的潜力随任务复杂度自然增长。我们提供证据表明相同机制可能适用于一般非线性DNN。我们的结果为更高效的学习方案提供了途径。

英文摘要

Deep neural networks (DNNs) exhibit first order phase transitions under variations of the L2 regularization strength, with each transition marking the onset of a new learnable feature. Below a critical regularization strength, all features are in principle learnable, but coexisting metastable states, separated by energy barriers, can trap the network and impede convergence. A strength of DNNs is their ability to generalize. But many open questions remain, among them the origin of so called grokking: the abrupt, delayed onset of generalization after prolonged apparent overfitting. We show for linear DNNs that grokking is consistent with hysteresis in first-order L2 phase transitions: using L2 regularization to engineer deliberate trapping, we demonstrate that a model in a low-accuracy metastable state escapes only when SGD noise drives it across an energy barrier, with escape times following Arrhenius scaling. We reproduce grokking-like delayed convergence across two orders of magnitude in escape time by deliberately trapping models in metastable phases. Using sparse sub-sampling we also reproduce the canonical grokking curve where test error eventually approaches the final training error. Our work suggests that the number of metastable states equals the number of learnable features -- one per singular value of the data covariance -- the potential for hysteresis grows naturally with task complexity. We provide evidence that the same mechanism likely operates in general nonlinear DNNs. Our results provide routes toward more efficient learning schemes.

2606.17445 2026-06-17 cs.LG cond-mat.mtrl-sci physics.chem-ph 新提交

Toward Controllable Catalyst Inverse Design via Large-Scale Autoregressive Pretraining

面向可控催化剂逆向设计的大规模自回归预训练

Dong Hyeon Mok, Jonggeol Na, Seoin Back

发表机构 * Department of Chemical and Biomolecular Engineering, Institute of Emergent Materials, Sogang University(化学与生物分子工程系,新兴材料研究所,首尔大学) Department of Chemical Engineering and Materials Science, Ewha Womans University(化学工程与材料科学系,成实女子大学) Department of Chemical Engineering, Graduate Program in System Health Science and Engineering, Ewha Womans University(化学工程系,系统健康科学与工程研究生院,成实女子大学) Institute for Multiscale Matter and Systems (IMMS), Ewha Womans University(多尺度物质与系统研究所(IMMS),成实女子大学) KU-KIST Graduate School of Converging Science and Technology, Korea University(KU-KIST融合科学与技术研究生院,韩国大学) Department of Integrated Energy Engineering, Korea University(整合能源工程系,韩国大学) Center for Hydrogen and Fuel Cells, Korea Institute of Science and Technology(KIST)(氢气与燃料电池中心,韩国科学技术院(KIST))

AI总结 提出基于生成式预训练Transformer的条件催化剂生成模型,通过大规模预训练和微调实现高结构有效性和条件匹配率,显著提升筛选效率。

详情
AI中文摘要

多相催化剂的逆向设计仍然具有挑战性,因为催化剂表面表现出显著的结构复杂性,在广阔的化学空间中存在耦合的表面-吸附物相互作用,仅通过传统筛选难以高效探索。尽管基于机器学习的高通量筛选加速了催化剂发现,但其效率随着搜索空间的增长而不可避免地下降,这促使了能够直接构建具有目标特性的催化剂的生成模型的发展。在这里,我们提出了一种基于生成式预训练Transformer架构的条件催化剂生成模型,该模型具有数值嵌入层,能够在单一自回归框架内生成以分类和连续属性为条件的催化剂结构。该模型在1.33亿个催化剂结构上进行了预训练,随后在大约46万个优化结构上进行了微调,这些结构具有相关的分类属性和结合能,用于条件生成。最终模型实现了98%的结构有效性、95%的优化有效性以及高分类条件保真度,吸附物类型和组成的联合匹配率达到93%。对于结合能条件,约20%的匹配率相比基线训练分布提高了四倍,生成的分布系统地朝向目标值偏移,使得无需额外微调即可将反应靶向催化剂发现的筛选效率提高1.5至4倍。这些结果表明,大规模自回归预训练结合显式属性条件为可控催化剂生成和加速催化剂发现提供了一条实用途径。

英文摘要

Inverse design of heterogeneous catalysts remains challenging because catalyst surfaces exhibit substantial structural complexity with coupled surface-adsorbate interactions across a vast chemical space that is difficult to explore efficiently through conventional screening alone. Although machine learning-based high-throughput screening has accelerated catalyst discovery, its efficiency inevitably declines as the search space grows, motivating the development of generative models that can directly construct catalysts with target properties. Here, we present a conditional catalyst generative model based on the Generative Pretrained Transformer architecture with a numerical embedding layer that enables the generation of catalyst structures conditioned on both categorical and continuous properties within a single autoregressive framework. The model was pretrained on 133 million catalyst structures and subsequently fine-tuned on approximately 460,000 optimized structures with associated categorical properties and binding energies for conditional generation. The resulting model achieved 98% structural validity, 95% optimization validity, and high categorical condition fidelity, with a 93 % joint match rate for adsorbate type and composition. For binding energy conditioning, the match rate of approximately 20% represents a four-fold improvement over the baseline training distribution, and the generated distributions shift systematically toward the target values, enabling a 1.5 to 4-fold improvement in screening efficiency for reaction-targeted catalyst discovery without additional fine-tuning. These results show that large-scale autoregressive pre-training, combined with explicit property conditioning, provides a practical route toward controllable catalyst generation and accelerated catalysts discovery.

2606.17041 2026-06-17 cs.CL cs.IR 新提交

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

对Nature Portfolio元分析文章进行LLM代理基准测试

Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai

发表机构 * Tsinghua University(清华大学)

AI总结 提出MetaSyn数据集,包含442篇专家策划的元分析,用于评估LLM代理在检索-筛选-综合全流程中的表现,发现当前系统在筛选阶段存在严重瓶颈。

Comments 13 pages, 7 figures, preprint for arXiv, dataset and code available at https://github.com/BFTree/MetaSyn

详情
AI中文摘要

元分析是一种要求高的证据综合形式,结合了文献检索、PI/ECO指导的研究选择和统计聚合。其结构化、可验证的工作流程使其成为评估系统科学推理的理想基础,然而现有基准缺乏完整的检索-筛选-综合流程的真相。我们引入了MetaSyn,一个包含来自Nature Portfolio期刊的442篇专家策划的元分析的数据集。每个条目将研究问题与PI/ECO标准、包含140k篇PubMed文章的检索语料库、经过验证的阳性研究、主题相似但不符合PI/ECO的硬负样本以及完整的搜索策略和日期范围配对。对十二种流水线配置(九种RAG变体和一种协议驱动的代理)进行基准测试揭示了关键的筛选瓶颈:尽管在K=200时检索上限达到90.9%的召回率,但没有任何系统能恢复超过52.7%的真相包含文献。当前的LLM无法可靠地将合格研究与主题相关性相当的PI/ECO不合格干扰项区分开来。阶段归因指标捕捉了系统成功和失败的地方;单一的端到端分数则不能。

英文摘要

Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.

2606.17030 2026-06-17 cs.CV 新提交

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Qwen-RobotWorld技术报告:通过语言条件视频生成统一具身世界模型

Jie Zhang, Xiaoyue Chen, Anzhe Chen, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Yan Shu, Yanran Zhang, Ye Wang, Yi Wang, Yilei Chen, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zixing Lei, Zhixuan Liang, Zihao Liu, Zikai Zhou, Chenxu Lv, Xiong-Hui Chen, Chenfei Wu

发表机构 * Qwen Team(Qwen团队)

AI总结 提出Qwen-RobotWorld,一种以自然语言为统一动作接口的语言条件视频世界模型,通过双流MMDiT、大规模具身世界知识语料和渐进式课程训练,在机器人操作、自动驾驶等任务中实现物理一致的未来视觉轨迹预测,在多个基准上取得最优结果。

详情
AI中文摘要

我们介绍Qwen-RobotWorld,一种用于具身智能的语言条件视频世界模型。以自然语言作为统一动作接口,它从当前观测预测物理上合理的未来视觉轨迹,涵盖机器人操作、自动驾驶、室内导航和人到机器人迁移。这种统一公式提供了三个有前景的应用方向:用于策略训练增强的合成数据生成、用于策略评估的可扩展虚拟环境,以及用于下游机器人控制的语言引导规划信号。这是通过三部分设计实现的:a) 双流MMDiT与MLLM动作编码,其中60层双流扩散变压器通过逐层联合注意力将冻结的Qwen2.5-VL语义与视频VAE潜变量耦合;b) 具身世界知识(EWK),一个860万视频-文本语料库(2亿+帧),包含20+种具身形态和500+动作类别的动作-语言映射;c) 通用+专家渐进式课程,一种两阶段训练策略,首先学习通用视觉先验,然后在共享语言接口下注入具身专门化。广泛的结果显示出强竞争力:在EWMBench和DreamGen Bench上总体排名第一,在WorldModelBench和PBench上优于所有开源模型。在RoboTwin-IF基准上的额外零样本分析进一步支持了鲁棒泛化和多视图一致性。

英文摘要

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

2606.16917 2026-06-17 cs.RO 新提交

Unified Motion-Action Modeling for Heterogeneous Robot Learning

统一运动-动作建模用于异构机器人学习

Yunhao Cao, Shitong Liu, Chao Feng, Meryl Zhang, Xuanchen Lu, Andrew Owens, Kuan Fang

发表机构 * Cornell University(康奈尔大学)

AI总结 提出UMA模型,利用3D物体运动轨迹作为共享接口,通过掩码生成目标统一视觉运动控制和动力学建模,实现跨异构数据源的多任务预训练,并在部署时支持多种推理模式。

Comments https://uma-manipulation.github.io/

详情
AI中文摘要

我们提出了统一运动-动作(UMA)模型,该方法使用3D物体运动轨迹作为共享接口,以桥接视觉运动控制和动力学建模。UMA将物体运动和机器人动作视为在掩码生成目标下共同演化的变量,其中掩码模式决定了预训练期间的监督机制和部署时的推理模式。通过使用事后重标记的运动上下文和对比目标(将任务意图与场景几何解耦),UMA能够在无需手动标注任务指令的情况下,跨异构数据源进行多任务预训练。在部署时,相同的预训练参数支持运动条件视觉运动控制、基于运动的动力学建模以及从少量示范中进行的任务适应。在机器人演示、人类视频和模拟数据的混合数据集上预训练后,UMA在每种推理模式下均持续优于专门针对该模式的最先进基线。

英文摘要

We present Unified Motion-Action (UMA) Model, an approach that uses 3D object motion trajectories as a shared interface to bridge visuomotor control and dynamics modeling. UMA treats object motion and robot actions as co-evolving variables under a masked generative objective, in which the mask pattern determines both the supervision regime during pretraining and the inference mode at deployment. Using hindsight-relabeled motion contexts and a contrastive objective that disentangles task intent from scene geometry, UMA enables multi-task pretraining across heterogeneous data sources without requiring manually annotated task instructions. At deployment, the same pretrained parameters support motion-conditioned visuomotor control, motion-based dynamics modeling, and task adaptation from few-shot demonstrations. Pretrained on a mixture of robot demonstrations, human videos, and simulated data, UMA consistently outperforms state-of-the-art baselines specialized for each inference mode.

2606.16591 2026-06-17 cs.CL 新提交

SING: Synthetic Intention Graph for Scalable Active Tool Discovery in LLM Agents

SING: 用于LLM代理中可扩展主动工具发现的合成意图图

Qiao Xiao, Haochen Shi, Yisen Gao, Wenbin Hu, Huihao Jing, Tianshi Zheng, Baixuan Xu, Ziheng Zhang, Weiqi Wang, Haoran Li, Jiaxin Bai, Yangqiu Song

发表机构 * Cornell University(康奈尔大学) The Hong Kong University of Science and Technology(香港科技大学) The Ohio State University(俄亥俄州立大学) Hong Kong Baptist University(香港浸会大学)

AI总结 提出SING框架,通过构建意图-工具图并动态检索工具,在长周期任务中提升工具发现准确率,Global Recall@5提高59.8%,下游成功率提高28.9%。

详情
AI中文摘要

大型语言模型(LLM)代理越来越依赖管理上下文、工具和多轮执行的代理框架,使工具成为在真实数字环境中行动的核心接口。随着框架连接的工具生态系统扩展到数百或数千个API、服务和任务特定技能,穷举工具模式注入变得昂贵,并施加了封闭世界假设,将代理限制在预定义的静态库存中。检索增强的工具选择提供了一种自然的替代方案,但现有的一次性检索方法通常无法将孤立的工具描述与代理的真实任务意图对齐,特别是在需要通过分解、观察和新诱导的子目标来涌现所需能力的长期任务中。我们提出SING,一种意图感知的主动工具发现框架,它构建了一个连接用户意图、工具能力和工具协作模式的意图-工具图,并根据不断变化的任务状态动态检索工具。使用包含7,471个工具的统一语料库,我们在三个真实世界的工具使用基准上评估了SING。与基线相比,SING将全局Recall@5提高了59.8%,下游成功率提高了28.9%,同时将全语料库工具模式暴露减少了99.8%,表明意图感知的图结构能够在大规模代理生态系统中实现更准确和上下文高效的工具发现。

英文摘要

Large language model (LLM) agents increasingly rely on agent harnesses that manage context, tools, and multi-turn execution, making tools a central interface for acting in realistic digital environments. As harness-connected tool ecosystems expand to hundreds or thousands of APIs, services, and task-specific skills, exhaustive tool schema injection becomes costly and imposes a closed-world assumption that limits agents to a predefined static inventory. Retrieval-augmented tool selection offers a natural alternative, but existing one-shot retrieval methods often fail to align isolated tool descriptions with the agent's true task intention, especially in long-horizon tasks where required capabilities emerge through decomposition, observations, and newly induced subgoals. We propose SING, an intention-aware active tool discovery framework that builds an intention-tool graph linking user intentions, tool capabilities, and tool collaboration patterns, and dynamically retrieves tools according to evolving task states. Using a unified corpus of 7,471 tools, we evaluate SING on three real-world tool-use benchmarks. SING improves Global Recall@5 by up to 59.8% and downstream success rate by up to 28.9% over baselines, while reducing full-corpus tool-schema exposure by 99.8%, demonstrating that intention-aware graph structure enables more accurate and context-efficient tool discovery in large-scale agentic ecosystems.

2606.16590 2026-06-17 cs.LG cs.AI q-bio.NC 新提交

Infant Spontaneous Movement Noise Improves Exploration in Deep RL

婴儿自发运动噪声改善深度强化学习中的探索

Francisco M. López, Markus R. Ernst, Francisco Cruz, Matej Hoffmann, and Jochen Triesch

发表机构 * Frankfurt Institute for Advanced Studies(法兰克福高等研究所) School of Computer Science and Engineering, University of New South Wales(新南威尔士大学计算机科学与工程学院) Escuela de Ingeniería, Universidad Central de Chile(智利中央大学工程学院) Faculty of Electrical Engineering, Czech Technical University(捷克理工大学电气工程学院)

AI总结 受婴儿自发运动噪声启发,提出一种在RL训练中逐步增加时间自相关的探索噪声机制,实验表明其能产生结构化探索行为并提高学习效率。

Comments 6 pages, 4 figures, 1 table. Accepted at IEEE ICDL 2026. Cite as: F. M. López, M. R. Ernst, F. Cruz, M. Hoffmann, and J. Triesch, "Infant Spontaneous Movement Noise Improves Exploration in Deep RL", in 2026 IEEE International Conference on Development and Learning (ICDL). IEEE, 2026, pp. 1-6

详情
AI中文摘要

深度强化学习(RL)中的探索通常实现为时间上不相关的白噪声。然而,最近的研究表明,时间相关的有色噪声可以通过产生更平滑的轨迹和更好的状态空间覆盖来提高探索效率。我们探究受婴儿自发运动启发的动作噪声是否也能改善深度RL中的探索。我们发现婴儿末端执行器速度的功率谱密度遵循有色噪声过程,其谱指数随年龄增长而增加。受这一发育模式的启发,我们引入了一种机制,在RL训练过程中逐步增加探索噪声的时间自相关,与婴儿统计数据相匹配。在多个RL环境中的实验表明,婴儿启发的噪声产生结构化的探索行为,并且与传统的探索策略相比可以提高学习效率。这些发现表明,人类运动和认知发展可以为人工智能体的学习机制设计提供有用的指导。我们的代码可在 https://github.com/trieschlab/baby-noise-rl 获取。

英文摘要

Exploration in deep reinforcement learning (RL) is commonly implemented as temporally uncorrelated white noise. However, recent works show that temporally correlated colored noise can improve exploration efficiency by producing smooth trajectories with better coverage of the state space. We inquire whether action noise inspired by infant spontaneous movements can also improve exploration in deep RL. We find that the power spectral densities of babies' end-effector velocities follow a colored noise process where the spectral exponent increases with age. Inspired by this developmental pattern, we introduce a mechanism that progressively increases the temporal auto-correlation of exploration noise during RL training, matching the infant statistics. Experiments across several RL environments show that infant-inspired noise produces structured exploratory behavior and can improve learning efficiency compared to conventional exploration strategies. These findings suggest that human motor and cognitive development can provide useful guidance for designing learning mechanisms in artificial agents. Our code is available at https://github.com/trieschlab/baby-noise-rl.

2606.16533 2026-06-17 cs.AI cs.CV 新提交

Kairos: A Native World Model Stack for Physical AI

Kairos: 面向物理AI的原生世界模型栈

Kairos Team, Fei Wang, Shan You, Qiming Zhang, Tao Huang, Zuoyi Fu, Zhisheng Zheng, Yunlong Xi, Feng Lv, Xiaoming Wu, Zeyu Liu, Cong Wan, Pu Li, Ruiqing Yang, Xiaoou Li, Wei Wang, Kangkang Zhu, Yuwei Zhang, Shi Fu, Zheng Zhang, Xiaoning Wu, Xuzeng Fan, Dacheng Tao, Xiaogang Wang

发表机构 * Kairos Team(Kairos团队)

AI总结 提出Kairos原生世界模型栈,通过跨具身数据课程、混合线性时间注意力架构和部署感知系统协同设计,实现世界知识获取、长时程状态保持与高效执行,在具身世界模型等基准上达到顶级性能。

详情
AI中文摘要

世界模型正从被动视觉生成器转变为物理AI的基础性、可操作基础设施:它们必须从异构经验中原生获取世界知识,在长时间跨度内维持持久状态,并在实际部署约束下高效执行。我们引入Kairos,一个围绕这些需求设计的原生世界模型栈。(1) Kairos通过开创由跨具身数据课程指导的原生预训练范式来学习世界,该课程将开放世界视频、人类行为数据和机器人交互组织成渐进式发展路径。(2) Kairos通过配备混合线性时间注意力的原生统一架构来维持世界,该架构中滑动窗口注意力捕捉局部动态,扩张滑动窗口捕捉中程依赖,门控线性注意力维持持久全局记忆。我们建立了形式化理论界限,证明这种时间分解严格限制了误差累积,从数学上保证了跨扩展时间范围的状态传播。(3) Kairos通过整合部署感知系统协同设计来运行世界,支持在服务器和消费级硬件上为真实世界的观察-行动-反馈循环生成低延迟展开。在具身世界模型、长时程和动作策略基准上的实验表明,Kairos在实现顶级性能的同时提供了强大的效率-能力权衡。这些结果共同将Kairos定位为未来自进化物理智能的凝聚性操作基础。

英文摘要

World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.

2606.16449 2026-06-17 cs.CV 新提交

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

PermaVid: 通过解耦上下文记忆实现编辑下的一致视频生成

Shuai Yang, Bingjie Gao, Ziwei Liu, Jiaqi Wang, Dahua Lin, Tong Wu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Stanford University(斯坦福大学) S-Lab, Nanyang Technological University(南洋理工大学S-Lab) The Chinese University of Hong Kong(香港中文大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出PermaVid框架,利用解耦为语义外观和几何结构的上下文记忆,结合编辑感知更新策略,实现编辑操作后视频的长期一致生成。

Comments Project page: https://ys-imtech.github.io/projects/PermaVid/

详情
AI中文摘要

在编辑操作下的一致视频生成需要持久性:当编辑修改场景外观或布局时,后续生成应在时间和视角上保持连贯。然而,现有的记忆设计在修改后难以维持长期一致性,因为存储的上下文可能变得过时或无效。为了解决这个问题,我们提出了PermaVid,一种新颖的框架,基于多模态上下文记忆,将空间上下文解耦为语义外观和几何结构,并采用编辑感知的记忆更新和检索策略,使记忆演化与后续观察保持一致。具体来说,我们开发了两个互补的记忆库:一个RGB上下文记忆,捕获外观感知的观察同时隐式编码几何;一个深度上下文记忆,保留与语义解耦的纯几何结构。基于此设计,我们引入了一个记忆引导的视频生成模型,在从混合模态记忆上下文中提取的参考条件下执行多模态特征融合。实验表明,我们的方法在编辑后保持了强大的长期语义和结构一致性,显著优于现有方法。

英文摘要

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

2606.16379 2026-06-17 cs.LG stat.ML 新提交

Scalable and Interpretable Representation Alignment with Ordinal Similarity

可扩展且可解释的序数相似性表示对齐

Diogo Soares, Pankhil Gawade, Andrea Dittadi, Ewa Szczurek

发表机构 * University of Maryland(马里兰大学) Google Research(谷歌研究院)

AI总结 针对现有表示相似性度量缺乏可解释性、对异常值敏感且计算复杂的问题,提出基于序数相似性的三元组和四元组相似性指数,实现可解释、鲁棒且高效的对齐度量。

详情
AI中文摘要

评估表示相似性是表示学习的基础。然而,现有度量存在显著局限性:由于基线漂移而缺乏可解释性,对异常值缺乏鲁棒性,并且对于大型数据集计算上难以处理,迫使依赖启发式近似。为了解决这些问题,我们开发了一个序数相似性框架,通过三元组相似性指数(TSI)和四元组相似性指数(QSI)实例化,通过量化序数关系的一致性来衡量对齐。我们从理论上证明,这种公式本质上是可解释的、对异常值鲁棒的,并且计算高效。最后,我们建立了TSI与通过互近邻度量的局部邻域对齐之间的形式等价性。实验上,我们验证了这些性质,并表明序数相似性提供了一种可扩展的对齐度量方法,使从业者能够更好地理解和设计表示。

英文摘要

Evaluating representation similarity is fundamental to representation learning. However, existing metrics suffer from significant limitations: they lack interpretability due to shifting baselines, lack robustness to outliers, and are computationally intractable for large datasets, forcing reliance on heuristic approximations. To address this, we develop an ordinal-similarity framework, instantiated by the Triplet (TSI) and Quadruplet (QSI) Similarity Indices, which measure alignment by quantifying the consistency of ordinal relationships. We theoretically demonstrate this formulation is inherently interpretable, robust to outliers, and computationally efficient. Finally, we establish a formal equivalence between TSI and local neighborhood alignment, measured by Mutual Nearest Neighbors. Empirically, we validate these properties and show that ordinal similarity offers a scalable approach to measuring alignment, enabling practitioners to better understand and design representations.

2606.16337 2026-06-17 cs.AI cs.HC cs.LG 新提交

Medical Heuristic Learning: An LLM-Driven Framework for Interpretable and Auditable Clinical Decision Rules

医学启发式学习:一个用于可解释和可审计临床决策规则的LLM驱动框架

Wei Xu, Ke Yang, Gang Luo, Keli Zheng, Lingyan Hu, Jing Wang, Kefeng Li

发表机构 * Centre for Artificial Intelligence Driven Drug Discovery, Macao Polytechnic University(人工智能驱动药物发现中心,澳门理工学院) Key Laboratory of Short-Range Radio Equipment Testing and Evaluation, Ministry of Industry and Information Technology Terahertz Science Application Center (TSAC), Beijing Institute of Technology(工业和信息化部短距离无线电设备测试与评估重点实验室,太赫兹科学应用中心(TSAC),北京理工大学) Department of Critical Care Medicine, Yantai Yuhuangding Hospital, Qingdao University(重症医学科,烟台友谊医院,青岛大学) Faculty of Education, The University of Hong Kong(教育学院,香港大学) College of Information Engineering, Dalian University(信息工程学院,大连大学)

AI总结 提出医学启发式学习(MHL),利用LLM驱动的工作流优化确定性可执行决策系统,生成可解释、可审计的Python决策规则,在医学数据集上达到与最先进方法相当的性能,并支持小样本和高度不平衡场景。

详情
AI中文摘要

临床表格数据的预测建模是临床决策支持的核心,因此不仅需要强大的预测性能,还需要透明的决策逻辑。尽管深度学习和基于树的集成方法可以实现高精度,但其黑箱性质仍然是临床部署的主要障碍。这一挑战因医疗数据的常见特征而进一步加剧,包括有限的样本量、严重的类别不平衡以及因诊断标准和临床文档变化引起的特征演化。为了解决这些问题,我们提出了医学启发式学习(MHL),这是临床表格预测中超越梯度学习范式的一个实例。MHL不依赖神经网络权重更新,而是使用大型语言模型(LLM)驱动的工作流,整合统计探测、医学知识探测、规则合成和代码级迭代优化,以优化一个确定性的可执行决策系统。最终模型不是以不透明的参数表示,而是作为版本化的纯Python决策规则,这些规则明确可解释、完全可审计且具有临床基础。MHL还支持持续学习,从先前验证的规则开始,并在数据漂移或特征演化下使用更新的特征信息迭代修订规则。在医学数据集上的全面实验表明,MHL在保持与小样本和高度不平衡设置下强健行为的同时,实现了与最先进方法相当的性能。结果进一步表明,这种显式规则更新机制有助于缓解特征演化下的灾难性遗忘。总体而言,这些发现表明,非基于梯度的启发式系统为高风险临床决策支持提供了一种透明且可适应的替代方案。

英文摘要

Predictive modeling for clinical tabular data is central to clinical decision support and therefore requires not only strong predictive performance but also transparent decision logic. Although deep learning and tree-based ensemble methods can achieve high accuracy, their black-box nature remains a major obstacle to clinical deployment. This challenge is further compounded by common characteristics of medical data, including limited sample sizes, severe class imbalance, and feature evolution arising from changes in diagnostic criteria and clinical documentation. To address these issues, we propose Medical Heuristic Learning (MHL), an instantiation of the learning-beyond-gradients paradigm for clinical tabular prediction. Instead of relying on neural network weight updates, MHL uses a large language model (LLM)-driven workflow that integrates statistical probes, medical knowledge probes, rule synthesis, and code-level iterative refinement to optimize a deterministic and executable decision system. The resulting model is expressed not as opaque parameters, but as versioned pure-Python decision rules that are explicitly interpretable, fully auditable, and clinically grounded. MHL also supports continual learning by starting from previously validated rules and iteratively revising them using updated feature information under data drift or feature evolution. Comprehensive experiments on medical datasets show that MHL achieves performance comparable to state-of-the-art methods while maintaining strong behavior in small-sample and highly imbalanced settings. The results further indicate that this explicit rule update mechanism can help alleviate catastrophic forgetting under feature evolution. Overall, these findings suggest that non-gradient-based heuristic systems offer a transparent and adaptable alternative for high-stakes clinical decision support.

2606.16203 2026-06-17 cs.CV 新提交

DynFS-MoE: Dynamic Functional-Structural Mixture-of-Experts for Post-Traumatic Epilepsy Diagnosis

DynFS-MoE: 用于创伤后癫痫诊断的动态功能-结构混合专家模型

Jun-En Ding, Spencer Chen, Henry Noren, Daniel Valdivia, Christine Yohn, Suhina Patel, Taylor Zink, Hai Sun, Feng Liu

发表机构 * Department of Systems Engineering, Stevens Institute of Technology(史蒂文斯理工学院系统工程系) Department of Neurosurgery, Robert Wood Johnson Medical School, Rutgers University(罗格斯大学罗伯特·伍德·约翰逊医学院神经外科)

AI总结 提出动态多模态混合专家框架,通过时间感知功能-结构编码和类别条件专家路由,融合功能与结构MRI,在三个二分类任务中优于静态融合基线,并揭示有意义的ROI交互。

详情
AI中文摘要

创伤后癫痫(PTE)是创伤性脑损伤(TBI)的严重并发症,但由于其在大脑中诱导的复杂结构和功能改变,早期识别仍然具有挑战性。为了解决这个问题,我们提出了一个动态多模态混合专家(MoE)框架,通过时间感知功能-结构编码和类别条件专家路由,整合功能性和结构性MRI。在该框架内,模态特定和跨模态专家学习互补表示,而模态-类别MoE(MCoE)模块根据每个分类目标动态分配专家权重。跨三个二分类任务的实验结果表明,该框架始终优于静态融合基线,高可解释性分析进一步揭示了有意义的感兴趣区域(ROI)交互。这种动态多模态专家框架有效捕获了类别依赖的脑交互模式,并为PTE诊断和风险分层提供了一种可解释的方法。

英文摘要

Post-traumatic epilepsy (PTE) is a severe complication of traumatic brain injury (TBI), yet early identification remains challenging due to the complex structural and functional alterations it induces in the brain. To address this, we propose a dynamic multimodal Mixture-of-Experts (MoE) framework that integrates functional and structural MRI through time-aware functional-structural encoding and class-conditioned expert routing. Within this framework, modality-specific and cross-modal experts learn complementary representations, while a Modality-Class MoE (MCoE) module dynamically dispatches expert weights according to each classification objective. Experimental results across three binary classification tasks demonstrate that the framework consistently outperforms static fusion baselines, and high-interpretability analyses further reveal meaningful region-of-interest (ROI) interactions. This dynamic multimodal expert framework effectively captures class-dependent brain interaction patterns and provides an interpretable approach for PTE diagnosis and risk stratification.

2606.16070 2026-06-17 cs.AI 新提交

Mind-Studio: Executable World Models with Lookahead Evaluation for Partially Observable Games

Mind-Studio: 针对部分可观测游戏的可执行世界模型与前向评估

Yifei Dong, Mingen Zheng, Linquan Wu, Jeff Z. Pan, Jiaxin Bai

发表机构 * Hong Kong University of Science and Technology(香港科技大学) City University of Hong Kong(香港城市大学) University of Edinburgh(爱丁堡大学) Hong Kong Baptist University(香港浸会大学)

AI总结 提出Mind-Studio框架,利用大语言模型从轨迹合成可执行的pygame风格世界模型,通过K步前向保真度协议评估,在Montezuma's Revenge等游戏中显著提升预测准确性和子目标验证。

Comments 12 pages, 2 figures

详情
AI中文摘要

世界模型合成旨在将交互经验转化为环境动态的内部模型。现有的符号方法通常拟合观测到的转移或局部规则的混合,但它们不会产生一个可以独立于真实环境运行的完整可执行程序。我们提出了Mind-Studio,一个利用大语言模型从状态-动作-下一状态轨迹合成可执行的pygame风格世界模型的框架。Mind-Studio将熵选择轨迹与一个轻量级游戏技能文件相结合,该文件包含从截图中提取的对象、动作和静态场景信息。我们使用K步前向保真度协议评估合成质量,该协议将生成的世界模型 rollout 与来自相同状态的Real-ALE rollout进行比较。在Montezuma's Revenge上,Mind-Studio将选定动作的下一状态预测从PoE-World的0.3%提高到48.7%,同时验证了8个子目标中的5个;在Alien、Assault和Skiing上,它实现了比先前学习的前向源更强的分支级保真度。

英文摘要

World-model synthesis aims to turn interaction experience into an internal model of environment dynamics. Existing symbolic approaches often fit observed transitions or mixtures of local rules, but they do not produce a complete executable program that can run independently of the real environment. We present Mind-Studio, a framework that synthesizes executable pygame-style world models from state-action-next-state trajectories using large language models. Mind-Studio combines entropy-selected traces with a lightweight game skill file containing object, action, and static scene information extracted from screenshots. We evaluate synthesis quality with a K-step lookahead fidelity protocol that compares generated world-model rollouts against Real-ALE rollouts from the same state. On Montezuma's Revenge, Mind-Studio improves chosen-action next-state prediction from 0.3% for PoE-World to 48.7% while verifying 5 of 8 subgoals; across Alien, Assault, and Skiing, it achieves stronger branch-level fidelity than prior learned lookahead sources.

2606.16009 2026-06-17 cs.CL cs.HC 新提交

Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design

弥合可用性差距:口译研究对机器口译设计的启示

Claudio Fantinuoli

发表机构 * University of Mainz(美因茨大学)

AI总结 本文定义机器口译为语音翻译的子领域,指出其存在“准确性幻觉”,并借鉴口译研究提出未来设计的三个优先方向:能动性、共同基础与体验,以弥合可用性差距。

详情
AI中文摘要

机器口译(MI)作为语音翻译的实时分支,在标准基准测试中取得了显著进展,一些系统在文本保真度上接近人类水平。然而,用户体验仍远不如口译员中介的交流,揭示了所谓的“准确性幻觉”:系统在纸面上表现准确,但在实践中无法支持流畅、目标导向的互动。本文将MI定义为语音翻译的一个独特子领域,具有自身特点,并需要基于交际有效性而非孤立保真度指标的评估方法。借鉴口译研究的见解,我们识别了当前系统忽视的专业口译实践的关键维度,并将其整合为未来MI的三个相互依赖的设计优先方向:能动性(上下文敏感的主动性和修复)、共同基础(多模态和话语级情境意识)以及体验(通过真实互动进行自适应改进)。这些优先方向共同为弥合可用性差距、实现能够实时维持真实多语言交流的系统指明了道路。

英文摘要

Machine interpreting (MI), the live, real-time application of speech translation, has achieved remarkable progress on standard benchmarks, with some systems approaching human parity on textual fidelity. Yet the user experience remains far inferior to interpreter-mediated communication, revealing what we term the accuracy illusion: systems that appear accurate on paper but fail in practice to support smooth, goal-oriented interaction. This paper defines MI as a distinct subfield of speech translation, with its own characteristics and the need for evaluation methods grounded in communicative effectiveness rather than isolated fidelity metrics. Drawing on insights from interpreting studies, we identify critical dimensions of professional interpreting practice that are overlooked by current systems, and consolidate them into three interdependent design priorities for future MI: agency (context-sensitive initiative and repair), grounding (multimodal and discourse-level situational awareness), and experience (adaptive improvement through real interaction). Together, these priorities chart a path toward closing the usability gap and enabling systems that can sustain authentic multilingual communication in real time.

2606.15937 2026-06-17 cs.CV 新提交

GOOSE-M2F: Adapting Mask2Former for High-Fidelity, Long-Tailed Fine-Grained Semantic Segmentation in Unstructured Outdoor Terrain

GOOSE-M2F:适配Mask2Former用于非结构化户外地形的高保真、长尾细粒度语义分割

Jyothiraditya Lingam, Nikhileswara Rao Sulake, Sai Manikanta Eswar Machara

发表机构 * Rajiv Gandhi University of Knowledge Technologies, Nuzvid, India(拉吉夫·甘地知识技术大学,努兹维德,印度)

AI总结 针对非结构化户外地形长尾细粒度语义分割挑战,提出GOOSE-M2F,通过200个对象查询、特征精炼模块和辅助监督头,结合多阶段训练策略,在GOOSE基准上达到70.08%复合mIoU。

Comments This solution has got 3rd position at GOOSE 2D Fine-Grained Semantic Segmentation (FGSS) Challenge at ICRA~2026

详情
AI中文摘要

我们提出GOOSE-M2F,这是Mask2Former针对GOOSE 2D细粒度语义分割(FGSS)挑战(ICRA 2026)的任务特定适配。GOOSE基准涵盖非结构化户外地形中的64个细粒度类别,具有严重的长尾分布,其中稀有类别每张图像占据少于50个像素。我们扩展了Swin-Large Mask2Former基线,并贡献了三个针对性改进:(1)200个对象查询以消除表示饱和;(2)结合ASPP-lite和CBAM双注意力的特征精炼模块(FRM);(3)为稀有类别提供直接逐像素梯度的辅助监督头。多阶段训练策略结合了分布平衡损失、稀有类别复制粘贴增强、动态IoU感知重加权和EMA。在推理时,采用密集滑动窗口引擎,结合2D高斯核融合和4尺度TTA,提升了+10.57%。GOOSE-M2F达到70.08%官方复合mIoU(细粒度63.55%,粗粒度76.61%),在GOOSE 2D FGSS排行榜上位列第三。代码和训练好的模型已公开:\href{https://github.com/Aditya-Lingam-9000/GOOSE-M2F}{Github GOOSE-M2F代码} 和 \href{https://huggingface.co/XYZ9843/GOOSE-M2F}{Hugging Face GOOSE-M2F}。

英文摘要

We present GOOSE-M2F, a task-specific adaptation of Mask2Former for the GOOSE 2D Fine-Grained Semantic Segmentation (FGSS) Challenge at ICRA 2026. The GOOSE benchmark spans 64 fine-grained classes across unstructured outdoor terrain with a severely long-tailed distribution, where rare classes occupy fewer than 50 pixels per image. We extend the Swin-Large Mask2Former baseline with three targeted contributions: (1) 200 object queries to eliminate representational saturation; (2) a Feature Refinement Module (FRM) combining ASPP-lite and CBAM dual-attention; and (3) an Auxiliary Supervision Head that delivers direct per-pixel gradients for rare classes. A multi-stage training strategy pairs Distribution-Balanced loss, Rare-Class Copy-Paste augmentation, dynamic IoU-aware re-weighting, and EMA. At inference, a dense sliding-window engine with 2D Gaussian kernel blending and 4-scale TTA adds +10.57%. GOOSE-M2F achieves 70.08% Official Composite mIoU (63.55% fine, 76.61% coarse), placing 3rd on the GOOSE 2D FGSS leaderboard. Code and trained models are publicly available at GitHub: https://github.com/Aditya-Lingam-9000/GOOSE-M2F and Hugging Face: https://huggingface.co/XYZ9843/GOOSE-M2F.

2606.15932 2026-06-17 cs.CL 新提交

Beyond NL2Code: A Structured Survey of Multimodal Code Intelligence

超越NL2Code:多模态代码智能的结构化综述

Xuanle Zhao, Qiushi Sun, Jingyu Xiao, Xuexin Liu, Haoyue Yang, Qiaosheng Chen, Xianzhen Luo, Jing Huang, Yufeng Zhong, Lei Chen, Shuai Fu, Zhenlin Wei, Jinhe Bi, Lei Jiang, Haibo Qiu, Siqi Yang, Peng Shi, Jian Hu, Zhixiong Zeng

发表机构 * Meituan(美团) The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Nanjing University(南京大学) Harbin Institute of Technology(哈尔滨工业大学) Australian Institute for Machine Learning, Adelaide University(阿德莱德大学澳大利亚机器学习研究所) Ludwig Maximilian University of Munich(慕尼黑大学) University of Science and Technology of China(中国科学技术大学) Queen Mary University of London(伦敦玛丽女王大学)

AI总结 本文系统综述多模态代码智能,将任务按代码角色分类,覆盖GUI、科学可视化、结构化图形及前沿任务,并提出四个基于验证的未来方向。

Comments Work completed in January 2026. Updating now

详情
AI中文摘要

虽然LLMs已经显著推进了文本到代码的合成,但许多实际编程任务通过视觉工件(如截图、图表、文档、矢量图、视频和交互状态)来指定意图。这些任务要求模型将视觉感知连接到可执行程序,因为正确性不仅取决于语法,还取决于布局、几何、数据语义、可编辑性、交互行为以及执行后适用的领域特定约束。本综述考察多模态代码智能,涵盖在视觉输入和输出下生成、编辑、优化、执行或推理代码的系统。我们首先根据代码在每个任务中扮演的角色来定义该领域,将代码区分为渲染工件、可编辑符号结构、科学表示、中间推理轨迹或可执行策略/工具接口。然后,我们将基准和方法组织成四个领域:图形用户界面、科学可视化、结构化图形以及前沿任务和框架。这种分类法将成熟的工件生成问题与新兴的智能体和统一设置联系起来,并使我们能够比较不同任务如何处理正确性证据。展望未来,我们认为未来研究可能受益于四个以验证为中心的方向:多信号验证可以结合互补的正确性证据,多状态验证可以测试跨执行轨迹的行为,跨任务迁移测试可以探究可重用的视觉代码技能,以及可验证的智能体轨迹可以揭示智能体行为是否基于视觉证据。这些方向共同可能将多模态代码生成从单输出模仿转向基于证据的可执行系统。

英文摘要

While Large Language Models (LLMs) have substantially advanced text-to-code synthesis, many real programming tasks specify intent through visual artifacts such as screenshots, charts, vector drawings, videos, and interactive states. These tasks require models to connect visual perception to executable programs, because correctness depends not only on syntax but also on layout, data semantics, interaction behavior, and domain-specific constraints that apply after execution. This survey examines Multimodal Code Intelligence, covering systems that generate, edit, refine, or reason with code under visually grounded inputs and outputs. We first formulate the field by the role that code plays in each task, distinguishing code as a rendered artifact, an editable symbolic structure, a scientific representation, an intermediate reasoning trace, or an executable policy or tool interface. We then organize benchmarks and methods into four domains: Graphical User Interface, Scientific Visualization, Structured Graphics, and Frontier Tasks and Frameworks. This taxonomy connects mature artifact-generation problems to emerging agentic and unified settings and allows us to compare how different tasks treat evidence of correctness. Looking ahead, we argue that future research may benefit from four verification-centered directions. Multi-signal validation can combine complementary evidence of correctness, multi-state verification can test behavior across execution trajectories, cross-task transfer testing can probe reusable visual-code skills, and verifiable agent traces can reveal whether agent actions are grounded in visual evidence. Together, these directions may move this field from single-output imitation toward evidence-grounded executable systems. An ongoing project and resources are available on \href{https://github.com/xjywhu/Awesome-Multimodal-LLM-for-Code}{GitHub}.

2606.15903 2026-06-17 cs.CL cs.AI 新提交

Control-Plane Placement Shapes Forgetting: An Architectural Study of Agent Memory Across Thirteen System Configurations

控制平面放置塑造遗忘:跨十三种系统配置的智能体记忆架构研究

Dongxu Yang

发表机构 * DeepLethe

AI总结 研究LLM在智能体记忆管道中的位置(控制平面 vs 召回平面)对遗忘失败模式的影响,通过13种配置在385例对抗测试集上的实验,揭示了三种放置机制的互补覆盖范围,并提出了ForgetEval评估套件。

Comments 25 pages including appendices. Code, benchmark, and adapters released under MIT at https://github.com/deeplethe/lethe

详情
AI中文摘要

LLM在智能体记忆管道中的位置——位于检索存储事实(广泛基准测试)的召回平面和通过替换、释放、清除来改变事实(基本未经测试)的控制平面之间——决定了系统能够恢复哪些遗忘失败模式。通过在385例对抗测试集上比较十三种系统配置,我们观察到三种具有部分互补覆盖范围的放置机制:确定性原语足以处理词汇/时间类别,但无法处理规范化(标识符混淆上5%,跨语言上0%);写入时LLM可以恢复规范化(100%),但无法处理意图感知删除(前缀冲突和复合事实为0%);变异时钩子可以恢复意图感知删除(78-85%),并同时提升几乎所有类别的性能(整体91.7-93.2%,每385例运行成本0.17美元,每例变异延迟2.3秒,而确定性方法为64-191毫秒,召回路径不变)。我们通过ForgetEval揭示了这种权衡,ForgetEval包含1000例模板化套件和385例对抗层(132例手工制作+253例LLM生成并经预言机验证),通过确定性子串匹配评分,并配有一个六方法适配器协议,采用诚实的N/A评分,允许异构记忆存储以130行代码接入。该协议通过10名标注者的IAA(Fleiss' kappa = 0.958)和77例外部作者子集(四位盲贡献者)得到验证,该子集复现了规范化不对称性并放大了联合放置的提升(+27.8个百分点)。生产环境中的失败主要是遗忘失败而非召回失败,但现有基准仅衡量召回。ForgetEval和所有适配器均以MIT许可发布。

英文摘要

Where an LLM sits in an agent memory pipeline -- between the recall plane that retrieves stored facts (extensively benchmarked) and the control plane that mutates them via supersede, release, purge (largely untested) -- shapes which forgetting failure modes the system recovers. Comparing thirteen system configurations on a 385-case adversarial surface, we observe three placement regimes with partly complementary coverage: deterministic primitives suffice for lexical/temporal categories but fail canonicalization (5% on identifier-obfuscation, 0% on cross-lingual); inscribe-time LLM recovers canonicalization (100%) but cannot help intent-aware deletion (0% on prefix-collision and compound-fact); a mutation-time hook recovers intent-aware deletion (78-85%) and brightens nearly all categories simultaneously (91.7-93.2% overall, $0.17 per 385-case run, 2.3s/case mutation latency vs. 64-191ms/case deterministic, recall path unchanged). We expose the trade-off via ForgetEval, a 1000-case templated suite plus a 385-case adversarial layer (132 hand-crafted + 253 LLM-drafted oracle-validated) scored by deterministic substring match, paired with a six-method Adapter Protocol with honest N/A scoring that lets heterogeneous memory stores enter in 130 lines. Admission is corroborated by 10-annotator IAA (Fleiss' kappa = 0.958) and a 77-case external-authored subset (four blind contributors) that replicates the canonicalization asymmetry and amplifies the joint-placement lift (+27.8 pt). Production failures are predominantly forgetting failures rather than recall failures, yet existing benchmarks measure only recall. ForgetEval and all adapters are released under MIT.

2606.15883 2026-06-17 cs.CL cs.AI 新提交

Koshur Diacritizer: A Byte-Level Sequence-to-Sequence Model for Kashmiri Diacritic Restoration

Koshur Diacritizer:用于克什米尔语变音符号恢复的字节级序列到序列模型

Haq Nawaz Malik, Nahfid Nissar, Faizan Iqbal

发表机构 * arXiv

AI总结 针对克什米尔语数字文本中变音符号缺失导致的歧义问题,提出基于ByT5-small的字节级序列到序列模型Koshur Diacritizer,结合脚本感知归一化、对齐验证和骨架保留推理,在测试集上实现DERm 0.2012和WER 0.2159,专家评估准确率77.5%。

详情
AI中文摘要

克什米尔语是一种使用改良的波斯-阿拉伯字母书写的印度-雅利安语言,在数字文本中经常省略变音符号,造成歧义并挑战下游NLP应用。我们提出了Koshur Diacritizer,一个基于ByT5-small的字节级序列到序列模型,用于恢复克什米尔语文本中的变音符号。为支持此任务,我们发布了一个公开可用的数据集,包含23.7k对齐的未变音/变音克什米尔语句对。所提出的框架结合了脚本感知归一化、对齐验证和骨架保留推理,以确保在保持原始基本字母序列的同时进行可靠的恢复。在保留测试集上的实验结果显示,DERm为0.2012,WER为0.2159。此外,由克什米尔语母语语言学专家评估的平均准确率为77.5%。数据集、模型和源代码已公开发布,为克什米尔语变音符号恢复和未来的低资源语言研究提供了可复现的基线。

英文摘要

Kashmiri, an Indo-Aryan language written in a modified Perso-Arabic script, frequently omits diacritic marks in digital text, creating ambiguity and challenging downstream NLP applications. We present Koshur Diacritizer, a ByT5-small byte-level sequence-to-sequence model for restoring diacritics in Kashmiri text. To support this task, we release a publicly available dataset of 23.7k aligned undiacritized diacritized Kashmiri sentence pairs. The proposed framework combines script-aware normalization, alignment validation, and skeleton-preserving inference to ensure reliable restoration while maintaining the original base-letter sequence. Experimental results on a held-out test set achieve a DERm of 0.2012 and a WER of 0.2159. Additionally, evaluation by a native Kashmiri linguistic expert yields a mean accuracy of 77.5%. The dataset, model, and source code are publicly released to provide a reproducible baseline for Kashmiri diacritic restoration and future low-resource language research.

2606.15735 2026-06-17 cs.CL cs.AI 新提交

EHRNote-ChatQA: A Benchmark for Evidence-Grounded Multi-Turn Clinical Question Answering over Longitudinal Discharge Summaries

EHRNote-ChatQA:一个面向纵向出院总结的基于证据的多轮临床问答基准

Jiyoun Kim, Muhan Yeo, Eunhye Jang, Jeewon Yang, Hangyul Yoon, Su Ji Lee, Hee Jo Han, Hee-Jae Jung, Doyun Kwon, Jun young Lee, Jaehun Lee, Jung-Oh Lee, Sunjun Kweon, Jong Hak Moon, Daseul Kim, Minjae Cho, Edward Choi

发表机构 * KAIST(韩国科学技术院) Seoul National University(首尔大学) Seoul National University Bundang Hospital(首尔大学盆唐医院) SAIHST, Sungkyunkwan University(成均馆大学) Yonsei University College of Medicine(延世大学医学院) Gangnam Severance Hospital(江南塞弗伦斯医院) Severance Hospital(塞弗伦斯医院) Seoul Medical Center(首尔医疗中心) Seoul National University Hospital(首尔大学医院) National Cancer Center(国立癌症中心) Icahn School of Medicine at Mount Sinai(西奈山伊坎医学院) Samsung Medical Center(三星医疗中心)

AI总结 提出EHRNote-ChatQA基准,基于MIMIC-IV出院总结构建,包含967个多轮样本和16072个专家验证的QA对,评估LLM在证据支持下的多轮临床问答能力,发现模型在证据定位和多轮错误累积方面存在挑战。

详情
AI中文摘要

出院总结是关键的临床文档,包含患者整个住院期间的背景信息,医疗专家在患者再入院、持续护理和诊断决策中会常规审阅这些文档。在审阅时,医疗专家通常必须迭代地综合多个总结中的信息,同时验证支持每个答案的证据。尽管大型语言模型(LLM)在临床问答中的应用日益增多,但现有基准未能充分反映这一场景:它们通常评估考试式的医学知识,或侧重于单轮问答且证据定位评估有限。我们引入了EHRNote-ChatQA,这是首个针对患者多个出院总结的基于证据的多轮临床问答基准。该基准基于去标识化的MIMIC-IV出院总结构建,包含967个患者级多轮样本,涵盖1到5份笔记,以及16072个经医学专家验证的QA对(8036个内容问题,每个配对有一个证据定位问题),覆盖八个临床类别。基准通过专家指导的流程构建,结合出院总结结构化模式、专家策划的多轮QA模板和基于LLM的生成,随后由11位医学专家对每个QA样本进行审查和修订。对22个开源和闭源LLM的基准测试揭示了若干挑战,包括LLM在证据定位方面比内容回答更困难、多轮错误随轮次累积,以及单轮临床QA性能无法可靠迁移到该场景。这些发现确立了EHRNote-ChatQA作为评估临床QA系统的严格且实用的基准。该数据集将通过PhysioNet凭证访问公开发布。

英文摘要

Discharge summaries are crucial clinical documents containing the context of a patient's overall hospital stay, and are routinely reviewed by medical experts for patient readmission, ongoing care, and diagnostic decision-making. When reviewing them, medical experts often must iteratively synthesize information across multiple summaries while verifying the evidence supporting each answer. Although large language models (LLMs) are increasingly explored for clinical question answering, existing benchmarks do not sufficiently reflect this setting: they often evaluate exam-style medical knowledge or focus on single-turn question answering with limited evidence-grounding evaluation. We introduce EHRNote-ChatQA, the first benchmark for evidence-grounded multi-turn clinical question answering over patients' multiple discharge summaries. Built from de-identified MIMIC-IV discharge summaries, EHRNote-ChatQA contains 967 patient-level multi-turn samples spanning one to five notes and 16,072 medical-expert-verified QA pairs (8,036 content questions, each paired with an evidence-grounding question) across eight clinical categories. The benchmark is constructed through an expert-informed pipeline combining discharge-summary structuring schema, expert-curated multi-turn QA templates, and LLM-based generation, followed by review and revision of every single QA sample by 11 medical experts. Benchmarking 22 open- and closed-source LLMs reveals several challenges, including that LLMs struggle more with evidence grounding than content answering, multi-turn errors compound across turns, and single-turn clinical QA performance does not reliably transfer to this setting. These findings establish EHRNote-ChatQA as a rigorous and practical benchmark for evaluating clinical QA systems. The dataset will be made publicly available through PhysioNet credentialed access.