arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1938
专题追踪
2605.20936 2026-05-21 cs.LG cs.AI cs.CL

DASH: Fast Differentiable Architecture Search for Hybrid Attention in Minutes on a Single GPU

DASH:在单个GPU上几分钟内完成的快速可微架构搜索用于混合注意力

Weizhe Chen, Miao Zhang, Junpeng Jiang, Yaping Li, Weili Guan, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 本研究提出DASH,一种快速可微架构搜索框架,用于混合注意力架构设计,通过将离散的层间注意力操作放置转化为连续的架构logits,准备可重用的教师对齐线性候选,并在模型和操作权重冻结的情况下进行架构仅搜索,显著提高了搜索效率。DASH在Qwen2.5-3B-Instruct上优于现有的所有选择器风格的混合注意力设计基线,展示了直接可微搜索可以发现更强的混合架构。此外,DASH在RULER性能上优于已发布的Jet-Nemotron模型,同时在重叠的短上下文和通用基准上保持竞争力。值得注意的是,每个DASH搜索运行仅使用12.3M tokens,并在单个RTX Pro 6000 GPU上仅需约20分钟,对应Jet-Nemotron报告的PostNAS搜索tokens的0.006%。这些结果表明,通过分钟级的可微搜索可以获得高质量的混合注意力架构,为混合架构设计提供了有前景的方向。

Comments 19 pages, 7 figures

详情
AI中文摘要

混合注意力架构正变得越来越重要,用于在保持模型质量的同时提高LLM推理效率,使混合架构设计成为核心问题。现有的设计通常依赖于手动经验规则或基于代理的选择器信号来分配层间操作符。最近的NAS风格系统,如Jet-Nemotron,展示了自动混合架构搜索的潜力。然而,Jet-Nemotron的PostNAS搜索阶段单独使用200B tokens,使得此类搜索流程难以作为混合架构设计的常规方法。我们引入DASH,一种用于混合注意力架构设计的快速可微搜索框架,它将离散的层间注意力操作放置放松为连续的架构logits,准备可重用的教师对齐线性候选,并在模型和操作权重冻结的情况下进行架构仅搜索,以显著提高搜索效率。在Qwen2.5-3B-Instruct上,DASH一致优于现有的所有选择器风格的混合注意力设计基线,表明直接可微搜索可以发现更强的混合架构。此外,DASH在RULER性能上优于已发布的Jet-Nemotron模型,同时在重叠的短上下文和通用基准上保持竞争力。值得注意的是,每个DASH搜索运行仅使用12.3M tokens,并在单个RTX Pro 6000 GPU上仅需约20分钟,对应Jet-Nemotron报告的PostNAS搜索tokens的0.006%。这些结果表明,通过分钟级的可微搜索可以获得高质量的混合注意力架构,为混合架构设计提供了有前景的方向。

英文摘要

Hybrid attention architectures are becoming an increasingly important paradigm for improving LLM inference efficiency while preserving model quality, making hybrid architecture design a central problem. Existing designs often rely on manual empirical rules or proxy-based selector signals for layer-wise operator allocation. Recent NAS-style systems such as Jet-Nemotron demonstrate the promise of automated hybrid architecture search. However, Jet-Nemotron's PostNAS search stages alone use 200B tokens, making such search pipelines difficult to use as routine methods for hybrid architecture design. We introduce DASH, a fast differentiable search framework for hybrid attention architecture design, which relaxes discrete layer-wise attention operator placement into continuous architecture logits, prepares reusable teacher-aligned linear candidates, and performs architecture-only search with model and operator weights frozen to significantly enhance search efficiency. On Qwen2.5-3B-Instruct, DASH consistently outperforms a comprehensive suite of existing selector-style hybrid attention design baselines, showing that direct differentiable search can discover stronger hybrid architectures. Moreover, DASH achieves stronger RULER performance than released Jet-Nemotron models while remaining competitive on overlapping short-context and general benchmarks. Notably, each DASH search run uses only 12.3M tokens and takes about 20 minutes on a single RTX Pro 6000 GPU, corresponding to merely 0.006% of the PostNAS search tokens reported by Jet-Nemotron. These results suggest that high-quality hybrid attention architectures can be obtained through minutes-level differentiable search, providing a promising direction for hybrid architecture design.

2605.20932 2026-05-21 cs.RO

WiXus: A Wheeled-Legged Robot with Wire-Driven Environmental Utilizing to Integrate Mobility and Manipulation

WiXus: 一种配备线驱动环境利用的轮腿机器人,用于整合移动与操作

Shintaro Inoue, Kento Kawaharazuka, Temma Suzuki, Sota Yuzaki, Kei Okada

发表机构 * Department of Mechano-Informatics, Graduate School of Information Science and Technology, The University of Tokyo(机械信息学系,信息科学和技术研究生院,东京大学)

AI总结 本文提出了一种新型轮腿机器人WiXus,通过利用外部环境的线驱动机制,使机器人能够实现平面移动和三维移动,并将腿部重新用于物体操作和工具使用。

Comments Accepted at ICRA2026, website - https://shin0805.github.io/wixus/, YouTube - https://youtu.be/32qhUslR0gM

详情
AI中文摘要

轮腿机器人通过协调轮驱动和腿驱动实现高移动性,但通常仅作为专为移动设计的平台。因此,它们无法将腿部用于其他任务,如物体操作或工具利用。本文提出了一种方法,通过外部身体支持释放腿部的移动角色,以挖掘腿部的任务执行潜力。为此,我们提出并开发了一种新的机器人WiXus,该机器人融合了轮腿机制和利用外部环境的线驱动机制。开发的WiXus不仅能够通过轮腿驱动实现平面移动,还能通过协调线驱动和轮腿驱动实现如攀爬等三维移动。此外,通过使用线驱动驱动悬吊身体,WiXus成功将腿部重新用作手臂,执行物体操作(例如救援狗(填充玩具))和工具使用(例如用剪枝器采摘苹果(模拟))。本研究证明了利用线驱动驱动环境的方法是一种新的设计原则,扩展了轮腿机器人的操作领域。

英文摘要

Wheeled-legged robots, which have wheels at their feet and achieve high mobility by coordinating wheel drive and leg drive, have been developed. These robots have been developed purely as platforms specialized for locomotion. Therefore, they do not have a means to repurpose their legs for roles other than locomotion, such as object manipulation or tool utilization. In this paper, we address the problem of how to draw out the potential task-execution capability of the legs by freeing them from the roles of locomotion through external body support. To this end, we propose and develop a new robot, WiXus, which fuses a wheeled-legged mechanism with a wire-driven mechanism that utilizes the external environment. The developed WiXus demonstrates not only planar locomotion with wheeled-legged drive, but also three-dimensional mobility such as cliff climbing by coordinating wire-driven and wheeled-legged actuation. Furthermore, by suspending the body with wire-driven actuation, WiXus successfully repurpose its legs as arms to perform object manipulation, (e.g., rescuing a dog (stuffed animal)), and tool utilization (e.g., harvesting an apple (mockup) with loppers). This study demonstrates that the approach of utilizing the environment with wire-driven actuation is a new design principle that extends the operational domain of wheeled-legged robots.

2605.20929 2026-05-21 cs.RO

STEAM: A Training-Free Congestion-Aware Enhancement Framework for Decentralized Multi-Agent Path Finding

STEAM: 一种无需训练的拥堵感知增强框架用于去中心化多智能体路径寻找

Mingyang Feng, Mengnuo Zhang, Shaoyuan Li, Xiang Yin

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(自动化与智能感知学院,上海交通大学)

AI总结 本文提出STEAM框架,一种无需训练的去中心化多智能体路径寻找(MAPF)学习方法,在离散环境中通过注入轻量级拥堵感知指导来提升性能,通过空间避让、时间修正和密度修正等方法提高成功率和效率。

详情
AI中文摘要

我们提出STEAM(空间、时间和涌现拥堵意识用于MAPF),一种无需训练的测试时间增强框架,用于学习的去中心化多智能体路径寻找(MAPF)在离散环境中。给定一个预训练的去中心化策略,STEAM不需要重新训练、架构修改或用集中规划器替代。相反,它将轻量级拥堵感知指导注入到原始策略执行中。STEAM首先通过当前的成本到目标地图诱导的最短路径来识别潜在的未来拥堵热点。通过更新agent特定的成本到目标信息来缓解空间上可避免的拥堵,而通过时间logit修正来处理空间上不可避免的瓶颈。此外,通过基于邻近智能体修正后的成本到目标地图的密度感知logit修正来减少涌现的局部拥堵。在代表性学习的去中心化MAPF算法上的大量实验表明,STEAM一致地提高了成功率、完成时间和解决方案成本,成功率提升高达60%,且仅带来轻微的计算开销。实现可在https://anonymous.4open.science/r/STEAM-MAPF-7A62获取。

英文摘要

We propose STEAM (Spatial, Temporal, and Emergent congestion Awareness for MAPF), a training-free test-time enhancement framework for learning-based decentralized Multi-Agent Path Finding (MAPF) in discrete environments. Given a pretrained decentralized policy, STEAM requires no retraining, architectural modification, or replacement by a centralized planner. Instead, it injects lightweight congestion-aware guidance into the original policy execution. STEAM first rolls out the shortest paths induced by the current cost-to-go maps to identify potential future congestion hotspots. Spatially avoidable congestion is mitigated by updating agent-specific cost-to-go information, while spatially unavoidable bottlenecks are handled through temporal logit correction. In addition, emergent local congestion is reduced by a density-aware logit correction based on neighboring agents' corrected cost-to-go maps. Extensive experiments on representative learning-based decentralized MAPF algorithms show that STEAM consistently improves success rate, makespan, and solution cost, with success-rate gains of up to 60% and only minor computational overhead. The implementation is available at https://anonymous.4open.science/r/STEAM-MAPF-7A62.

2605.20924 2026-05-21 cs.CL cs.AI

Strategy-Induct: Task-Level Strategy Induction for Instruction Generation

Strategy-Induct: 任务级策略诱导用于指令生成

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University, Taiwan(国立台湾大学计算机科学与资讯工程学系) Institute of Information Science, Academia Sinica, Taiwan(台湾“中央研究院”资讯科学研究所) AI Research Center (AINTU), National Taiwan University, Taiwan(国立台湾大学人工智能研究中心)

AI总结 该研究提出了一种任务级策略诱导方法Strategy-Induct,通过仅使用少量示例问题生成任务指令,无需依赖标注答案,从而在指令生成任务中取得优于现有方法的性能。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

设计有效的任务级提示对于提高大型语言模型(LLMs)的性能至关重要。尽管先前的指令诱导工作表明LLMs可以通过有限的例子推断出更好的指令,但现有方法通常依赖于输入-输出对,而获取标注答案可能困难或成本高昂。为了解决这一限制,我们提出了Strategy-Induct框架,该框架仅从少量示例问题中推导出任务级指令,而无需标注答案。我们的方法首先提示模型为每个问题生成显式的推理策略,形成(策略,问题)对。这些对随后用于诱导一个任务指令,以引导推理。在多个任务和模型规模上的实验表明,Strategy-Induct在仅问题设置中优于最先进的方法。此外,我们观察到在任务指令生成和推理中联合使用LLMs和大型推理模型可能会进一步提高性能。

英文摘要

Designing effective task-level prompts is crucial for improving the performance of Large Language Models (LLMs). While prior work on instruction induction demonstrates that LLMs can infer better instructions with limited examples, existing approaches often rely on input-output pairs, where obtaining labeled answers can be difficult or costly. To address this limitation, we propose Strategy-Induct, a framework that derives task-level instructions solely from a small set of example questions without requiring labeled answers. Our approach first prompts the model to generate explicit reasoning strategies for each question, forming (strategy, question) pairs. These pairs are then used to induce a task instruction that guides reasoning. Experiments across multiple tasks and model scales demonstrate that Strategy-Induct outperforms state-of-the-art methods in question-only settings. Furthermore, we observe that jointly utilizing LLMs and Large Reasoning Models across task instruction generation and inference may lead to further performance improvements.

2605.20922 2026-05-21 cs.LG cs.AI cs.CV

Winfree Oscillatory Neural Network

Winfree振荡神经网络

Jiawen Dai, Yue Song

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Qi Zhi Institute(上海启智研究院) College of AI, Tsinghua University(清华大学人工智能学院)

AI总结 本文提出了一种基于广义Winfree动力学的振荡神经网络WONN,通过结构化的振荡交互在流形$(S^1)^d$上进化表示,结合基于相位的归纳偏置与灵活的层次交互机制,实现了在图像识别和复杂推理任务上的竞争力和参数效率。

Comments Project page: https://jiawen-dai.github.io/WONN_Project_Page/

详情
AI中文摘要

振荡和同步被认为是表示和计算中的基本要素。然而,现有的基于同步动力学的机器学习方法大多局限于特定领域,如物体发现,缺乏在标准视觉基准或逻辑推理任务中的扩展性证据。我们提出Winfree振荡神经网络(WONN),一种基于广义Winfree动力学的动态神经架构。WONN通过结构化的振荡交互在流形$(S^1)^d$上进化表示,结合基于相位的归纳偏置与灵活且层次化的交互机制,这些机制可以是固定的三角函数映射或可学习的神经网络。我们在图像识别和复杂推理任务上评估了WONN,包括CIFAR、ImageNet、Maze-hard和Sudoku。在这些领域中,WONN实现了具有竞争力或优越性能的成果,并且具有强参数效率。特别是,WONN是目前已知第一个能够与ImageNet-1K竞争的基于同步的振荡架构。此外,在Maze-hard上,WONN仅使用前状态-of-the-art模型1%的参数就达到了80.1%的准确率。这些结果表明,结构化的振荡动力学为传统神经架构提供了一种可扩展且参数高效的替代方案。

英文摘要

Oscillations and synchronization are widely believed to play a fundamental role in representation and computation. However, existing machine learning approaches based on synchronization dynamics have largely been confined to specialized settings such as object discovery, with limited evidence of scalability to standard vision benchmarks or logic reasoning tasks. We propose the Winfree Oscillatory Neural Network (WONN), a dynamical neural architecture based on generalized Winfree dynamics. WONN evolves representations on the torus $(S^1)^d$ through structured oscillatory interactions, combining phase-based inductive biases with flexible and hierarchical interaction mechanisms instantiated as either fixed trigonometric mappings or learnable neural networks. We evaluate WONN on image recognition and complex reasoning tasks, including CIFAR, ImageNet, Maze-hard, and Sudoku. Across these domains, WONN achieves competitive or superior performance with strong parameter efficiency. In particular, WONN is, to our knowledge, the first synchronization-based oscillatory architecture to scale competitively to ImageNet-1K. Furthermore, on Maze-hard, WONN achieves 80.1% accuracy using only 1% of the parameters of prior state-of-the-art models. These results suggest that structured oscillatory dynamics provide a scalable and parameter-efficient alternative to conventional neural architectures.

2605.20920 2026-05-21 cs.CL cs.SD

Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

通过发声学音素识别评估语音发声合成

Vinicius Ribeiro, Yves Laprie

发表机构 * Université de Lorraine, CNRS, Inria, LORIA(洛林大学、国家科学研究中心、法国国家信息与自动化研究所、LORIA)

AI总结 本文通过发声学音素识别作为代理来评估语音发声合成的质量,提出利用发声学特征进行音素识别以更准确捕捉发音细节,从而改进生成模型的评估方法。

Comments Accepted for publication at the European Signal Processing Conference (EUSIPCO), 2026

详情
AI中文摘要

最近机器学习的进步和发声学数据集的可用性使得声带合成可以基于语音序列进行条件化,这是发声学语音合成的主要任务。然而,质量评估需要更好的定义。通常,对生成模型进行排名具有挑战性,因为这涉及主观性。然而,发声学合成还具有额外的困难,即需要对声带解剖学和声学有专业知识。为了解决这个问题,本文提出通过音素识别来评估语音发声合成。我们的假设是使用发声学特征进行音素识别能更好地捕捉发音细节,如正确的发音位置,这传统度量(如点距度量)无法做到。我们训练了一个神经网络,使用来自单说话人RT-MRI数据集提取的声学和发声学特征。然后,我们比较了在不同合成发声学特征下测试模型的识别性能。我们的结果表明,我们的发声学特征集在语音发声合成中具有丰富的语音信息,并有助于探索额外的维度。

英文摘要

Recent advances in machine learning and the availability of articulatory datasets allow vocal tract synthesis to be conditioned on phonetic sequences, a primary task of articulatory speech synthesis. However, quality assessment needs a better definition. Generally, ranking generative models is tricky due to subjectivity. However, articulatory synthesis has the additional difficulty of requiring specialized knowledge in vocal tract anatomy and acoustics. To address this problem, this paper proposes to evaluate speech articulation synthesis using phoneme recognition as a proxy. Our hypothesis is that phoneme recognition using articulatory features better captures nuances in phoneme production, such as correct places of articulation, which traditional metrics (e.g., point-wise distance metrics) do not. We train a neural network with acoustic and articulatory features extracted from a single-speaker RT-MRI dataset. Then, we compare the recognition performance when testing the model with different synthetic articulatory features. Our results show that our articulatory feature set is phonetically rich and helps exploring additional dimensions on speech articulation synthesis.

2605.20917 2026-05-21 cs.RO

SubTGraph: Large-Scale Subterranean Environment Synthesis with Controllable Topological Variability for Robotic Autonomy Validation

SubTGraph: 大规模地下环境合成与可控拓扑变化用于机器人自主性验证

F. Labra Caso, A. Saradagi, S. Fredriksson, S. Nordström, A. Koval, G. Nikolakopoulos

发表机构 * Robotics & AI Luleå University of Technology(机器人与人工智能卢勒奥技术大学)

AI总结 本文提出SubTGraph框架,用于快速合成具有高变异性的多层级地下环境,通过用户指定的拓扑、维度、纹理等参数生成不同类型的地下环境,用于验证机器人自主栈各层的严格验证。

Comments 16 pages, 18 figures

详情
AI中文摘要

地下(SubT)环境已成为自主机器人技术的前沿领域,推动采矿自动化和行星探索(如火星熔岩管)。由于实际SubT环境的访问具有挑战性,因此在现实模拟环境中严格测试自主性堆栈至关重要。本文填补了已知的空白,即由于缺乏大规模基于模拟的基准评估基础设施,导致SubT研究论文通常只能在少数环境中展示验证结果。本文提出了SubTGraph,一种新的框架,用于快速合成具有高变异性的多层级SubT环境,结合用户指定的拓扑、维度、纹理等参数,生成如运营矿山、自然洞穴和熔岩管等不同环境。SubTGraph通过用户指定的结构约束构建成本矩阵,指导经典Dijkstra算法,利用DARPA World Generator的拓扑瓷砖生成SubT世界。通过三个机器人案例研究验证了SubTGraph在验证机器人自主栈不同层次的严格性方面的有效性。结构语义分割与拓扑地面真相进行验证,多智能体路径规划广泛测试以识别算法行为中的模式和趋势,LIO SLAM在具有挑战性的地下部分进行压力测试以识别失败案例。SubTGraph世界创建代码库已开源(https://github.com/LTU-RAI/SubTGraph.git),并附带包含150个高度变异的地下世界的数据库。

英文摘要

Subterranean (SubT) environments have been a frontier for autonomous robotics, driven by the push for automation of mining operations and the interest in planetary exploration (Martian Lava Tubes). Due to the challenges involved in accessing real SubT environments, rigorous hardening of autonomy stacks in realistic simulation environments is critical. This article fills a well-known gap, which relates to the unavailability of a large-scale simulation-based benchmarking infrastructure for rigorous statistical evaluation of robotic autonomy, due to which it is common for SubT research articles to present validation results in a few environments at best. This article presents SubTGraph, a novel framework for rapid synthesis of multi-level SubT environments with high variability, incorporating user specifications related to topology, dimensionality, textures, etc., to generate distinct environments such as operational mines, natural caves and lava tubes. SubTGraph builds a cost matrix from user-specified structural constraints to guide the classical Dijkstra algorithm to procedurally generate SubT worlds utilizing topometric tiles from the DARPA World Generator. Three robotics case-studies are investigated to demonstrate the utility of SubTGraph for rigorous validation of different layers in the robotic autonomy stack. Structural semantic segmentation is validated against topometric ground truths, multi-agent path planning is widely tested for identification of patterns and trends in the algorithm behavior and LIO SLAM is stress-tested in challenging subterranean sections to identify failure cases. The SubTGraph world creation codebase is open-sourced (https://github.com/LTU-RAI/SubTGraph.git) along with a database consisting of 150 highly variable underground worlds.

2605.20916 2026-05-21 cs.CL

Task-Routed Mixture-of-Experts with Cognitive Appraisal for Implicit Sentiment Analysis

具有认知评估的任务路由混合专家模型用于隐含情感分析

Yaping Chai, Haoran Xie, Joe S. Qin

发表机构 * Division of Artificial Intelligence, Lingnan University, Hong Kong(人工智能学院,岭南大学,香港)

AI总结 本文提出了一种基于认知评估理论的多任务学习框架,通过隐含情感检测和认知推理生成两个辅助任务,提升隐含情感分析的性能,同时采用任务级混合专家模型减少任务干扰,实验表明该方法在隐含情感子集上优于现有方法。

Comments 8 pages, 4 figures, and 3 tables

详情
AI中文摘要

隐含情感分析具有挑战性,因为对某个方面的态度通常是通过事件推断而非显式意见词表达。现有模型通常仅学习最终极性标签,这限制了从上下文推断情感的能力。受认知评估理论启发,我们提出了一种评估意识的多任务学习(MTL)框架,用于隐含情感分析,该框架通过两个互补的辅助任务:隐含情感检测和认知推理生成,提供极性预测。然而,训练多个具有不同目标的任务并共享单一骨干结构会限制灵活性并导致任务干扰。为减少这些相关但不同的目标之间的干扰,我们采用任务级混合专家模型,其中所有任务共享一组专家,任务身份控制这些专家的稀疏组合。我们的方法基于编码器-解码器架构,并用这些稀疏混合替换部分编码器和解码器块。我们使用任务条件路由器为每个任务选择稀疏专家混合,并使用任务分离的路由目标鼓励不同任务学习不同的专家选择模式。实验结果表明,我们的模型在隐含情感子集上优于最近提出的方法,具有显著优势。我们的代码可在 https://github.com/yaping166/TRMoE-ISA 上获得。

英文摘要

Implicit sentiment analysis is challenging because sentiment toward an aspect is often inferred from events rather than expressed through explicit opinion words. Existing models typically learn from the final polarity label, which provides limited guidance for reasoning about sentiment from the context. Motivated by cognitive appraisal theory, we propose an appraisal-aware multi-task learning (MTL) framework for implicit sentiment analysis that provides polarity prediction with two complementary auxiliary tasks: implicit sentiment detection and cognitive rationale generation. However, training several objectives with different targets and sharing a single backbone across tasks in MTL limits flexibility and can lead to task interference. To reduce interference among these related but distinct objectives, we adopt task-level mixture-of-experts models in which all tasks share a common set of experts, and task identity controls the sparse combination of these experts. Our method builds on an encoder-decoder architecture and replaces a subset of encoder and decoder blocks with these sparse mixtures. We use a task-conditioned router to select sparse expert mixtures for each task, and a task-separated routing objective to encourage different tasks to learn distinct expert-selection patterns. Experimental results show that our model outperforms recently proposed approaches, with strong gains on the implicit sentiment subset. Our code is available at https://github.com/yaping166/TRMoE-ISA.

2605.20915 2026-05-21 cs.CL cs.AI cs.LG

Calibration vs Decision Making: Revisiting the Reliability Paradox in Unlearned Language Models

校准与决策制定:重新审视未学习语言模型中的可靠性悖论

Divyaksh Shukla, Ashutosh Modi

发表机构 * Indian Institute of Technology Kanpur (IIT Kanpur)(印度理工学院坎浦尔学院(IIT坎浦尔))

AI总结 本文研究了生成语言模型中校准与决策可靠性之间的差距,通过TOFU基准测试中的多项选择问答评估协议,发现经过微调的模型在校准误差较低,而未学习后的模型在校准误差仍低,但依赖于相关性特征的决策规则增加,扩展了可靠性悖论到机器未学习领域。

Comments Accepted at SRW, ACL 2026; 17 pages (9 + 2 + 6)

详情
AI中文摘要

机器未学习旨在从模型中移除特定训练数据的影响,同时保持对剩余数据的可靠行为,使可靠的预测和不确定性估计成为评估的关键。校准常被用作语言模型可靠性代理,但低校准误差并不一定意味着可靠的决策规则,因为模型可能依赖于虚假相关性而保持良好校准。我们通过TOFU基准测试中的多项选择问答评估协议,研究了生成语言模型中的这一差距,利用校准指标(ECE、MCE、Brier)测量概率可靠性,并通过基于属性的快捷方式检测(使用积分梯度和局部互信息)评估决策规则可靠性。我们发现,微调模型的校准误差(ECE ~ 0.04)低于预训练模型(ECE > 0.5),而未学习后的模型在校准误差相似,尽管在遗忘分割上的准确性降低,属性分析显示对基于相关性的标记依赖增加。这些结果表明,良好的校准可以与未学习后的基于快捷方式的决策规则共存,将可靠性悖论扩展到了机器未学习领域。

英文摘要

Machine unlearning aims to remove the influence of specific training data from a model while preserving reliable behavior on the remaining data, making reliable prediction and uncertainty estimation essential for evaluation. Calibration is commonly used as a proxy for reliability in language models, but low calibration error does not necessarily imply reliable decision rules, as models may rely on spurious correlations while remaining well calibrated. We investigate this gap in generative language models using the multiple-choice question-answering evaluation protocol on the TOFU benchmark, measuring probabilistic reliability with calibration metrics (ECE, MCE, Brier) and decision-rule reliability via attribution-based shortcut detection with Integrated Gradients and Local Mutual Information. We find that fine-tuned models achieve low calibration error (ECE ~ 0.04) compared to pretrained models (ECE > 0.5), and models after unlearning retain similarly low calibration despite reduced accuracy on the forget split, while attribution analysis shows increased reliance on correlation-based tokens. These results demonstrate that good calibration can coexist with shortcut-based decision rules after unlearning, extending the reliability paradox to the machine unlearning setting.

2605.20912 2026-05-21 cs.CL

Enhancing Scientific Discourse: Machine Translation for the Scientific Domain

增强科学论述:面向科学领域的机器翻译

Dimitris Roussis, Sokratis Sofianopoulos, Stelios Piperidis

发表机构 * Institute for Speech and Language Processing(语音与语言处理研究所) Athena RC(雅典研究中心)

AI总结 本文针对科学领域中由于专业术语和复杂句式带来的翻译挑战,构建了多语种平行和单语语料库,并通过微调通用神经机器翻译系统评估语料库质量。

详情
AI中文摘要

随着科研文献数量的增加,跨语言交流的需求日益迫切。机器翻译(MT)为获取国际出版物提供了有前景的解决方案。然而,科学领域因其专业术语和复杂句式而具有独特挑战。本文提出了一套面向科学领域的平行和单语语料库,目标语言对为西班牙-英语、法语-英语和葡萄牙-英语。对于每种语言对,我们创建了一个大规模的通用科学语料库以及四个聚焦于癌症研究、能源研究、神经科学和交通运输研究的较小语料库。为了评估这些语料库的质量,我们利用它们对通用神经机器翻译(NMT)系统进行微调。我们详细介绍了语料库的创建过程、所采用的微调策略,并最后给出了评估结果。

英文摘要

The increasing volume of scientific research necessitates effective communication across language barriers. Machine translation (MT) offers a promising solution for accessing international publications. However, the scientific domain presents unique challenges due to its specialized vocabulary and complex sentence structures. In this paper, we present the development of a collection of parallel and monolingual corpora for the scientific domain. The corpora target the language pairs Spanish-English, French-English, and Portuguese-English. For each language pair, we create a large general scientific corpus as well as four smaller corpora focused on the domains of: Cancer Research, Energy Research, Neuroscience, and Transportation research. To evaluate the quality of these corpora, we utilize them for fine-tuning general-purpose neural machine translation (NMT) systems. We provide details regarding the corpus creation process, the fine-tuning strategies employed, and we conclude with the evaluation results.

2605.20911 2026-05-21 cs.AI cs.LG

For How Long Should We Be Punching? Learning Action Duration in Fighting Games

我们应该持续打击多久?在格斗游戏中学习动作持续时间

Hoang Hai Nguyen, Kurt Driessens, Dennis J. N. J. Soemers

发表机构 * Department of Advanced Computing Sciences, Maastricht University(马斯特里赫特大学高级计算科学系)

AI总结 本文研究了在格斗游戏中如何通过学习动作持续时间来提高强化学习代理的决策能力,探讨了动态调整反应时间的方法及其对性能和行为模式的影响。

Comments Accepted at Computers and Games 2026

详情
AI中文摘要

像《街头霸王II》这样的格斗游戏对强化学习(RL)代理提出了独特的挑战,因为它们具有快速且实时的性质。在大多数RL框架中,代理被硬编码为在固定间隔内做出决策,通常每帧或每N帧。虽然这种设计确保了及时的响应,但限制了代理调整反应时间的能力。每帧行动提供帧完美反应,这与人类玩家相比不现实,而更长的固定间隔会降低计算成本但会阻碍响应速度。我们考虑了一种替代的决策框架,其中代理不仅学习采取什么动作,还学习执行该动作有多久。通过同时预测动作和持续时间,代理可以动态调整其对游戏不同情况的响应能力。我们使用开源的FightLadder环境,通过训练代理对抗内置的脚本机器人,系统地测试不同的帧跳配置,以分析其对性能、响应性和学习行为的影响。实验表明,学习的时间可以与精心选择的固定帧跳性能相匹配,并鼓励可重复的动作模式,但本身并不能保证鲁棒性。在大多数情况下,我们发现代理在一致的高帧跳值(即低响应速度)下表现最佳。这种策略使学习利用性策略变得更容易,其中相同的动作被反复执行,而脚本机器人似乎容易受到这种策略的影响。

英文摘要

Fighting games such as Street Fighter II present unique challenges to reinforcement learning (RL) agents due to their fast-paced, real-time nature. In most RL frameworks, agents are hard-coded to make decisions at a fixed interval, typically every frame or every N frames. Although this design ensures timely responses, it restricts the agent's ability to adjust its reaction timing. Acting every frame grants frame-perfect reflexes, which are unrealistic compared to human players, whereas longer fixed intervals reduce computational cost but hinder responsiveness. We consider an alternative decision-making framework in which the agent learns not only what action to take but also for how long to execute it. By jointly predicting both action and duration, the agent can dynamically adapt its responsiveness to different situations in the game. We implement this method using the open-source FightLadder environment with agents trained against scripted built-in bots, systematically testing different frame skip configurations to analyze their influence on performance, responsiveness, and learned behavior. Experiments show that learned timing can match the performance of well-chosen fixed frame skips and encourages repeatable action patterns, but does not ensure robustness on its own. In most cases, we see agents performing best with consistently high frame skip values (i.e., low responsiveness). This strategy makes it easier to learn exploitative strategies where the same action is repeated over and over, which the scripted bots appear to be susceptible to.

2605.20910 2026-05-21 cs.CV

FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching

FlowLong: 通过流形约束的 Tweedie 匹配实现推理时的长视频生成

Jangho Park, Geon Yeong Park, Gihyun Kwon, Jong Chul Ye

发表机构 * KAIST(韩国科学技术院) Amazon(亚马逊)

AI总结 本文提出了一种新的推理时长视频生成方法,通过流形约束的Tweedie匹配在重叠滑动窗口中生成长视频,同时保持时间和空间一致性,并且无需额外训练。

Comments Project Page: https://flowlong-video.github.io/

详情
AI中文摘要

扩展视频扩散模型的生成时间范围仍然是一个长期且重要的挑战。现有的无训练方法分为两类:双向模型的扩展,这些模型紧密耦合到特定架构,且在长范围内质量下降;以及自回归模型,这些模型由于暴露偏差积累漂移误差,倾向于生成重复的运动模式。为了解决这些问题,我们提出了一种新颖但简单的推理时长视频生成方法,该方法对架构不敏感且不需要额外训练。我们的方法通过重叠滑动窗口生成长视频,其中相邻窗口预测的干净样本通过Tweedie匹配融合,以强制重叠区域的流形约束和时间一致性。随后,随机早期阶段采样通过在高噪声阶段每次Tweedie匹配校正后注入新鲜噪声,同步每个窗口的轨迹,然后过渡到确定性ODE采样以保持细粒度的视觉保真度。应用于各种视频生成模型,我们的方法生成的视频长度是原窗口长度的数倍,同时在时间和视觉质量上优于无训练和自回归基线,并且进一步扩展到音频视频联合生成和文本到3DGS,无需微调。

英文摘要

Extending the generation horizon of video diffusion models to long sequences remains a long-standing and important challenge. Existing training-free approaches fall into two categories: extensions of bidirectional models, which are tightly coupled to specific architectures and suffer from quality degradation over long horizons, and autoregressive models, which accumulate drift errors due to exposure bias and tend to produce repetitive motion patterns. To address these issues, we propose a novel but simple inference-time approach for long video generation that is architecture-agnostic and requires no additional training. Our method generates long videos via overlapping sliding windows, where predicted clean samples from adjacent windows are blended via \emph{Tweedie matching} to enforce both \textbf{manifold constraint and temporal consistency} across overlap regions. \emph{Stochastic early-phase sampling} then synchronizes per-window trajectories by injecting fresh noise after each Tweedie matching correction in the high-noise phase, before transitioning to deterministic ODE sampling to preserve fine-grained visual fidelity. Applied to various video generation models, our method generates videos several times longer than the native window length while outperforming both training-free and autoregressive baselines in temporal consistency and visual quality, and further extends to audio-video joint generation and text-to-3DGS without any fine-tuning.

2605.20908 2026-05-21 cs.CV

SynCB: A Synergy Concept-Based Model with Dynamic Routing Between Concepts and Complementary Neural Branches

SynCB:一种基于协同概念的模型,具有概念与互补神经分支之间的动态路由

Tores Julie, Sun Rémy, Sassatelli Lucile, Ancarani Elisa, Wu Hui-Yin, Precioso Frédéric

发表机构 * CNRS(法国国家科学研究中心) Inria(法国国家信息与自动化技术研究院) I3S(信息科学与系统研究所)

AI总结 本研究提出了一种协同概念模型SynCB,通过动态路由模块在概念分支和互补神经分支之间进行选择,以提高任务准确性和对人工干预的响应性。

详情
AI中文摘要

基于概念(CB)的模型提供了可解释性和支持测试时的人工干预,而标准神经网络(NN)提供了强大的任务性能但透明性较低。先前的工作探索了将概念和其他表示结合的混合公式以提高准确性,但通常以牺牲人工干预为代价。我们引入了协同概念模型(SynCB)框架,该框架结合了CB分支和互补神经分支,并且有一个可训练的路由模块,可以动态选择每个输入使用的分支。与以往模型不同,SynCB保持两个分支独立,并通过路由模块协调它们。此外,两个分支都是联合学习的,允许互补神经分支和CB分支通过它们的共同骨干进行信息共享。为了提高对干预的响应性,我们进一步引入了测试时的干预策略和相应的损失。在五个数据集和CB基准上,SynCB始终在任务准确性和对人工干预的响应性上取得更高的成绩,比全神经基线高3.9个百分点,比最强竞争对手的干预性能高6.43个百分点。

英文摘要

Concept-based (CB) models provide interpretability and support test-time human intervention, while standard neural networks (NN) offer strong task performance but little transparency. Prior work has explored hybrid formulations that integrate concepts and additional representations to improve accuracy, often at the cost of human interventions. We introduce the \emph{Synergy Concept-Based Model (SynCB)} framework, that combines a CB branch with a complementary neural branch, and a trainable routing module that dynamically selects which branch to use for each input. Unlike prior models, which fuse residual and concept-based predictions, SynCB keeps the two branches distinct and coordinates them through the routing module. Moreover, both branches are learned jointly, allowing information sharing between the complementary neural branch and CB branches through their common backbone. To improve responsiveness to interventions, we further introduce a test-time intervention policy and a corresponding loss. Across five datasets and CB benchmarks, SynCB consistently achieves higher task accuracy while remaining more responsive to human interventions, surpassing the full neural baseline by up to 3.9 percentage points and exceeding the strongest competitor in intervention performance by up to 6.43 percentage points.

2605.20904 2026-05-21 cs.CV

JFAA: Technical Report for the EPIC-KITCHENS-100 Action Anticipation Challenge at EgoVis 2026

JFAA:EgoVis 2026 EPIC-KITCHENS-100 动作预见挑战的技术报告

Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Dongmei Jiang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室) Shandong Jianzhu University(山东建筑大学)

AI总结 本文提出JFAA,一种基于JEPA的未来动作预见方法,用于EPIC-KITCHENS-100动作预见任务。通过冻结编码器和预测器提取观察上下文特征和近未来潜在标记,再训练轻量级注意力探针以预测动词、名词和动作日志。通过构建字段感知的集成模型提高鲁棒性,实验结果表明JFAA在EgoVis 2026 EPIC-KITCHENS-100动作预见挑战中取得第一名。

Comments The champion solution for the EPIC-KITCHENS-100 Action Anticipation Challenge at the CVPR EgoVis Workshop 2026

详情
AI中文摘要

我们提出JFAA,一种基于JEPA的未来动作预见方法,用于EPIC-KITCHENS-100(EK-100)动作预见任务。受V-JEPA 2.1的表示学习和未来预测能力的启发,JFAA使用冻结的编码器和预测器来提取观察上下文特征和近未来潜在标记。然后训练一个轻量级的注意力探针,使用单独的任务查询来预测动词、名词和动作的日志。为了提高鲁棒性,我们进一步构建了一个字段感知的集成模型,使每个输出字段都能受益于其最可靠的候选者。在官方挑战服务器上的实验结果表明,JFAA在EgoVis 2026 EK-100动作预见挑战中取得第一名。我们的代码将在https://github.com/CorrineQiu/JFAA上发布。

英文摘要

We propose JFAA, a JEPA-based Future Action Anticipation method for the EPIC-KITCHENS-100 (EK-100) Action Anticipation task. Inspired by the representation learning and future prediction ability of V-JEPA 2.1, JFAA uses a frozen encoder and predictor to extract observed context features and near-future latent tokens. A lightweight attentive probe is then trained to predict verb, noun, and action logits with separate task queries. To improve robustness, we further build a field-aware ensemble over selected epoch-level predictions, allowing each output field to benefit from its most reliable candidates. Experimental results on the official challenge server show that JFAA achieves first place in the EgoVis 2026 EK-100 Action Anticipation Challenge. Our code will be released at https://github.com/CorrineQiu/JFAA.

2605.20901 2026-05-21 cs.CV cs.AI

VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026

VISTA:EgoVis 2026 ego4D 短期物体交互预测挑战的技术报告

Qiaohui Chu, Haoyu Zhang, Yisen Feng, Meng Liu, Weili Guan, Dongmei Jiang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室) Shandong Jianzhu University(山东建筑大学)

AI总结 本文提出VISTA,一种用于EgoVis 2026 ego4D短期物体交互预测挑战的V-JEPA集成静态快速时序预测器。该方法结合了以物体为中心的空间检测与短视时间上下文,通过特征调制和ROI级上下文融合,将时间表示注入检测路径,以提高预测的鲁棒性。

Comments The champion solution for the Ego4D Short-Term Object Interaction Anticipation Challenge at the CVPR EgoVis Workshop 2026

详情
AI中文摘要

我们提出VISTA,一种用于EgoVis 2026 ego4D短期物体交互预测(STA)挑战的V-JEPA集成静态快速时序预测器。给定一个眼动视频时间戳,任务要求预测下一步的人-物体交互,包括未来活跃物体的边界框、名词类别、动词类别、接触时间以及置信度分数。VISTA采用StillFast风格的设计,结合以物体为中心的空间检测与短视时间上下文。具体来说,一个在COCO上预训练的Faster R-CNN ResNet-50 FPN检测器从最后一个观察到的高分辨率帧中生成物体建议,而冻结的V-JEPA 2.1时间分支从观察到的视频中提取片段级眼动上下文。时间表示通过特征调制和ROI级上下文融合注入检测路径。融合的建议特征随后传递给多头STA预测器进行框细化、名词分类、动词分类、接触时间回归和交互置信度估计。为了最终提交,我们进一步融合互补预测以提高鲁棒性。在官方挑战服务器上的实验结果表明,VISTA在EgoVis 2026 ego4D STA挑战中获得第一名。我们的代码将在https://github.com/CorrineQiu/VISTA上发布。

英文摘要

We propose VISTA, a V-JEPA Integrated StillFast Temporal Anticipator for the Ego4D Short-Term Object Interaction Anticipation (STA) Challenge at EgoVis 2026. Given an egocentric video timestamp, the task requires anticipating the next human-object interaction, including the future active object's bounding box, noun category, verb category, time-to-contact, and confidence score. VISTA follows a StillFast-style design that combines object-centric spatial detection with short-horizon temporal context. Specifically, a COCO-pretrained Faster R-CNN ResNet-50 FPN detector generates object proposals from the last observed high-resolution frame, while a frozen V-JEPA 2.1 temporal branch extracts clip-level egocentric context from the observed video. The temporal representation is injected into the detection pathway through feature modulation and ROI-level context fusion. The fused proposal features are then passed to multi-head STA predictors for box refinement, noun classification, verb classification, time-to-contact regression, and interaction confidence estimation. For the final submission, we further ensemble complementary predictions to improve robustness. Experimental results on the official challenge server show that VISTA achieves first place in the EgoVis 2026 Ego4D STA Challenge. Our code will be released at https://github.com/CorrineQiu/VISTA.

2605.20894 2026-05-21 cs.RO

Mobile UMI: Cross-View Diffusion Policy with Decoupled Kinematics for Mobile Manipulation

Mobile UMI: 用于移动操作的跨视角扩散策略与解耦动力学

Haoran Huang, Haonan Dong, Huixu Dong

发表机构 * Zhejiang University(浙江大学)

AI总结 本文提出了一种无需硬件的演示框架Mobile UMI,通过三个组件解决移动模仿学习中的两个瓶颈问题:运动污染的动作标签和推理导致的执行延迟。核心方法是通过双摄像头捕捉全局和局部上下文,结合空间锚点统一视觉-惯性框架,并利用异步递推地执行器进行在线状态匹配,从而实现解耦的动力学和基座轨迹。

详情
AI中文摘要

在便携式演示接口上进行移动模仿学习面临两个耦合的瓶颈:由运动污染导致的动作标签和由于连续移动基座引起的推理诱导的执行延迟。最近的腕部安装接口降低了桌面数据收集的成本,但单个腕部视角无法捕捉基座导航所需的全局上下文。添加身体安装的摄像头会将人类行走与手部运动纠缠在一起。同时,生成策略引入了数百毫秒的推理延迟,在此期间,基座会经过预测的路径点,迫使在动作拼接处进行回退修正。本文提出了Mobile UMI,一种无需硬件的演示框架,通过三个组件解决这两个缺口。首先,双摄像头捕获系统记录以胸部为中心的全局上下文和以腕部为中心的局部交互,无需任何机器人存在。其次,基于ChArUco的一次性空间锚点统一了胸部和手部的视觉-惯性框架;手部姿态随后相对于胸部重新表达,以提取解耦的SE(3)操作和SE(2)基座轨迹。第三,异步递推地执行器执行在线状态匹配:每个生成的动作块都与当前物理姿态对齐,使过期的路径点在执行前被丢弃。整个系统在四个长周期家庭任务上进行了评估,在100次试验中平均成功率为83.8%。受控比较ACT和Diffusion Policy显示,仅胸部相对标签就缩小了大部分差距;在线状态匹配缩小了剩余差距。这些结果表明,在测试条件下,移动模仿学习中显式动力学分解与状态级延迟对齐相结合,提供了一种有效的解决方案,而无需对底层策略类别进行架构更改。

英文摘要

Mobile imitation learning on portable demonstration interfaces faces two coupled bottlenecks: locomotion-contaminated action labels and inference-induced execution latency on a continuously moving base. Recent wrist-mounted interfaces lower the cost of tabletop data collection, yet a single wrist view does not capture the global context required for base navigation. Adding a body-mounted camera entangles human walking with hand motion. Meanwhile, generative policies introduce hundreds of milliseconds of inference latency, during which the base advances past predicted waypoints, forcing backward corrections at action splices. This paper presents Mobile UMI, a hardware-free demonstration framework that addresses both gaps through three components. First, a dual-camera capture system records chest-centric global context and wrist-centric local interaction without any robot present. Second, a one-shot ChArUco-based spatial anchor unifies the chest and hand visual-inertial frames; the hand pose is then re-expressed relative to the chest to extract decoupled SE(3) manipulation and SE(2) base trajectories. Third, an asynchronous receding-horizon executor performs online state matching: each generated action chunk is realigned with the current physical pose so that expired waypoints are discarded before execution. The full system is evaluated on four long-horizon household tasks, achieving an average success rate of 83.8% over 100 trials per task. Controlled comparisons against ACT and Diffusion Policy show that the chest-relative label alone closes much of the gap; online state matching closes the remainder. These results indicate that, for mobile imitation learning under the tested conditions, explicit kinematic factorization combined with state-level latency alignment provides an effective solution without requiring architectural changes to the underlying policy class.

2605.20892 2026-05-21 cs.CV

FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition

FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition

Enhui Yu, Junhui Li, Ruitong Lu, Jialu Li, Youshan Zhang

发表机构 * University of Science and Technology Liaoning(辽宁科技大学) Chuzhou University(楚州大学) Yeshiva University(犹他大学)

AI总结 本文提出FruitEnsemble框架,通过多阶段动态推理解决细粒度水果分类中的泛化限制问题,利用MLLM进行专家仲裁以提升分类准确率,最终达到70.49%的分类精度。

Comments 10 pages,6 figures,submitted to CVPR 2026

Journal ref Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2026

详情
AI中文摘要

细粒度水果分类是农业计算机视觉中的关键但具有挑战性的任务,主要受高质量数据集匮乏和类别间高视觉相似性阻碍。为解决这些问题,我们首先构建了一个包含306个水果类别、116,233个样本的综合数据集。此外,我们提出FruitEnsemble,一种实用的两阶段动态推理框架,旨在克服静态单模型架构的泛化限制。第一阶段,FruitEnsemble利用验证校准的异构骨干网络加权集成生成稳健的Top-3候选池。为处理困难样本,我们引入专家仲裁机制:当集成置信度低于0.6时,触发多模态大语言模型(MLLM)进行严格视觉验证,通过整合外部植物学描述使用链式推理(CoT)进行验证。此外,我们优化了训练流程,采用硬样本感知的联合损失。大量实验表明,FruitEnsemble实现了70.49%的分类准确率,并优于现有最先进模型。我们的框架为现实世界的农业视觉分拣和质量检测任务提供了高效、部署导向的解决方案。

英文摘要

Fine-grained fruit classification is a critical yet challenging task in agricultural computer vision, primarily hindered by a severe shortage of high-quality datasets and the high visual similarity between classes. To address these challenges, we first constructed a comprehensive dataset comprising 306 fruit categories with 116,233 samples. Moreover, we propose FruitEnsemble, a practical two-stage dynamic inference framework designed to overcome the generalization limitations of static single-model architectures. In the first stage, FruitEnsemble employs a validation-calibrated weighted ensemble of heterogeneous backbones to generate a robust Top-3 candidate pool. To tackle difficult samples, we introduce an expert arbitration mechanism: when ensemble confidence falls below 0.6, a multimodal large language model (MLLM) is triggered to perform rigorous visual verification by integrating external botanical descriptions using Chain-of-Thought (CoT) reasoning. Furthermore, we optimized the training pipeline with a hard sample-aware joint loss. Extensive experiments demonstrate that FruitEnsemble achieves a classification accuracy of 70.49\% and outperforms existing state-of-the-art models. Our framework provides an efficient, deployment-oriented solution for real-world agricultural visual sorting and quality inspection tasks.

2605.20891 2026-05-21 cs.CV

HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction

HDMoE:一种用于多模态癌症生存预测的分层解耦-融合专家混合框架

Huayi Wang, Haochao Ying, Yuyang Xu, Qiyao Zheng, jun wang, Cheng Zhang, Ying Sun, Jian Wu

发表机构 * Zhejiang University(浙江大学) Xinjiang University(新疆大学) Hangzhou City University(杭州市大学) Sun Yat-sen University Cancer Center(中山大学肿瘤中心)

AI总结 本文提出HDMoE框架,通过分层解耦-融合专家混合方法,有效整合多模态医学数据以提高癌症生存预测的准确性,解决了传统方法中特征解耦和融合效果不佳的问题。

Comments 12 pages, HDMoE has been accepted by KDD 2026 AI for Sciences Track

详情
AI中文摘要

多模态生存预测是一项关键但具有挑战性的任务,要求整合多模态医学数据(例如全切片图像(WSIs)和基因组谱)以实现准确的预后建模。鉴于模态间的固有异质性,特征解耦-融合范式已成为主导方法。然而,这些方法存在以下不足:(1)在解耦前未能减少模态特征的冗余信息,这会负面影响特征解耦和融合效果;(2)缺乏对特征细粒度关系建模的能力,无法捕捉模态内和模态间特征的局部信息交互。为了解决这些问题,我们提出了一种具有两个层次MoE和随机特征重排(RFR)模块的HDMoE框架。在第一层MoE中,使用共享专家和路由专家去除冗余信息并提取每个模态的细粒度特定特征,而第二层MoE促进细粒度的跨模态特征解耦。此外,我们设计了两个RFR模块,分别跟随每个层次的MoE,以精细融合模态内和模态间特征,有助于模型捕捉更多模态间的细粒度关系。在我们的私有肝癌(LC)和三个TCGA公开数据集上的广泛实验结果证实了我们所提出方法的有效性。代码可在https://github.com/ZJUMAI/HDMoE上获得。

英文摘要

Multimodal survival prediction, a crucial yet challenging task, demands the integration of multimodal medical data (\eg Whole Slide Images (WSIs) and Genomic Profiles) to achieve accurate prognostic modeling. Given the inherent heterogeneity across modalities, the feature decoupling-fusion paradigm has emerged as a dominant approach. However, these methods have the following shortcomings: (1) fail to reduce the redundant information of modality features before decoupling, which negatively affects the feature decoupling and fusion effect;(2) lack the ability to model the fine-grained relationships of the features and capture the local information interactions between intra- and inter-modality features. To address these issues, we propose a \underline{H}ierarchical \underline{D}ecoupling-Fusion \underline{M}ixture-\underline{o}f-\underline{E}xperts (HDMoE) framework with two levels of MoE and \underline{R}andom \underline{F}eature \underline{R}eorganization (RFR) modules.In the first-level MoE, shared experts and routed experts are employed to remove redundant information and extract fine-grained specific features within each modality, while the second-level MoE facilitates fine-grained inter-modality feature decoupling. Besides, we design two RFR modules following each level of MoE to finely fuse intra- and inter-modality features, which can help the model capture more fine-grained relationships between modalities. Extensive experimental results on our private Liver Cancer (LC) and three TCGA public datasets confirm the effectiveness of our proposed method. Codes are available at https://github.com/ZJUMAI/HDMoE.

2605.20889 2026-05-21 cs.CV

Map-Mono-Ego: Map-Grounded Global Human Pose Estimation from Monocular Egocentric Video

Map-Mono-Ego: 从单目第一视角视频实现基于地图的全局人体姿态估计

Hiroyuki Deguchi, Ryosuke Hori, Kotaro Amaya, Tsubasa Maruyama, Mitsunori Tada, Hideo Saito

发表机构 * Keio University(庆应大学) National Institute of Advanced Industrial Science and Technology(国家先进工业科学与技术研究院)

AI总结 本文提出Map-MonoEgo框架,通过利用预扫描的3D点云实现从单目摄像头获得的全局一致的人体姿态估计,并引入AIST-Living数据集,证明该方法在无需专用硬件的情况下能有效提升日常监控任务的实用性。

Comments Accepted at ICIP 2026, Project page: https://deguchihiroyuki.github.io/Map-Mono-Ego-Project/

详情
AI中文摘要

单目第一视角人体姿态估计对于无处不在的活动监控至关重要。然而,理解用户在环境中的绝对位置仍是一个挑战。现有方法主要关注初始位置的相对运动,而不考虑佩戴者在环境中的绝对位置。此外,单目视觉固有的尺度模糊性导致严重的位移漂移,限制了长期跟踪,而无法使用专用多传感器硬件。为了解决这一问题,我们提出了MapMonoEgo,一种新颖的框架,仅通过单目摄像头即可实现全局一致的人体姿态估计,利用预扫描的3D点云。我们还引入了AIST-Living数据集,该数据集将第一视角视频与扫描环境中的真实运动相结合。实验表明,我们的方法显著优于现有最先进基线,证明其在无需专用硬件的情况下对实际监控任务的实用性。

英文摘要

Monocular egocentric human pose estimation is essential for ubiquitous activity monitoring. However, understanding the user's absolute location within the environment remains a challenge. Existing methods primarily focus on relative motion from an initial position, and tend not to account for the wearer's absolute location within an environment. Furthermore, inherent scale ambiguity in monocular vision leads to severe translational drift, limiting long-term tracking without specialized multi-sensor hardware. To address this, we propose MapMonoEgo, a novel framework achieving globally consistent human pose estimation solely from a monocular camera by leveraging a pre-scanned 3D point cloud. We also introduce AIST-Living dataset, a new dataset pairing egocentric video with ground-truth motion in a scanned environment. Experiments demonstrate that our approach significantly outperforms the state-of-the-art baseline, proving its utility for practical monitoring tasks without specialized hardware.

2605.20885 2026-05-21 cs.LG q-bio.QM

Training distribution determines the ceiling of drug-blind cancer sensitivity prediction

训练分布决定了药物盲癌敏感性预测的上限

Taekyung Heo

发表机构 * Taekyung Heo

AI总结 本文研究了药物盲癌敏感性预测中训练分布对预测性能的影响,发现传统指标存在偏差,通过机制分层训练和响应匹配策略恢复了预测增益。

详情
AI中文摘要

精准肿瘤学需要预测特定肿瘤从其分子特征出发哪种药物能抑制它,但尽管药物表示越来越复杂,药物盲敏感性预测却停滞不前。本文表明这种停滞反映的是度量偏差而非表示瓶颈。标准基准全球皮尔逊相关系数受药物间效力差异主导,一个简单的药物均值预测器即可捕捉。每种药物皮尔逊相关系数揭示了在四个独立数据集中,没有药物编码能超过仅基于细胞特征的预测。受控实验将作用机制身份作为药物特征或训练分布约束,确定了原因。将作用机制作为特征提供微小收益,而将其作为训练分布分层则显著提高针对靶向激酶抑制剂的每种药物相关系数,因为全癌症联合训练抑制了通路特异性敏感信号。机制分层训练和试点观察的响应匹配提供了两种可部署策略,共同恢复了药物盲敏感性预测中的主要预测增益来源。

英文摘要

Precision oncology requires predicting which drugs will suppress a specific tumor from its molecular profile, but drug-blind sensitivity prediction has plateaued despite increasingly complex drug representations. Here we show that this stagnation reflects a metric artifact rather than a representational bottleneck. The standard benchmark, global Pearson r, is dominated by between-drug potency differences that a trivial drug-mean predictor captures without any cell-specific learning. Per-drug Pearson r, which isolates within-drug cell ranking, reveals that no drug encoding improves over cell-only features across four independent datasets. A controlled experiment channeling mechanism-of-action identity as either a drug feature or a training-distribution constraint identifies the cause. Supplying MoA as a feature yields negligible benefit, whereas using it to stratify training raises per-drug r substantially for targeted kinase inhibitors, because pan-cancer co-training suppresses pathway-specific sensitivity signals. Mechanism-stratified training and response matching from pilot observations provide two deployable strategies that together recover the principal sources of predictive gain in drug-blind sensitivity prediction.

2605.20883 2026-05-21 cs.LG

Learning fMRI activations dictionaries across individual geometries via optimal transport

通过最优传输学习跨个体几何的fMRI激活字典

Sonia Mazelet, Rémi Flamary, Bertrand Thirion

发表机构 * CMAP, Ecole Polytechnique Palaiseau, France(CMAP,巴黎政治学院帕莱索校区,法国) Mind, Inria-Saclay Palaiseau, France(Mind,法国国家信息与自动化研究所萨克雷帕莱索分所,法国)

AI总结 本文提出了一种基于最优传输的fMRI字典学习方法,通过Fused Gromov-Wasserstein距离处理个体脑几何差异,利用amortized优化减少计算成本,并学习依赖FGW参数平衡特征对齐与结构一致性的字典原子。

详情
AI中文摘要

字典学习是一种创建可解释表示的强大工具。当应用于功能性磁共振成像(fMRI)数据时,所得到的脑活动模式可用于各种下游任务,如脑状态分类或群体水平分析。然而,一个主要挑战是不同个体之间的脑几何差异。通常通过将每个个体的脑几何投影到一个通用模板上来解决,这会移除个体特定的信息。在本工作中,我们提出了一种新的fMRI数据字典学习方法,该方法明确考虑了这种差异。我们使用基于最优传输的融合Gromov-Wasserstein(FGW)距离来比较具有不同几何和特征的图。为了解决计算多个FGW距离对于大图(如来自fMRI数据的图)带来的挑战,我们依赖于amortized优化来学习一个神经网络,该网络可以预测最优传输计划的近似值,从而显著降低计算成本。此外,我们学习了依赖FGW权衡参数的字典原子,该参数控制特征对齐和结构一致性之间的平衡。在HCP数据集上的数值实验表明,所提出的方法能够捕捉数据中的不同几何差异水平,并提供保留关键信息的表示。

英文摘要

Dictionary learning is a powerful tool for creating interpretable representations. When applied to functional magnetic resonance imaging (fMRI) data, the resulting patterns of brain activity can be used for various downstream tasks, such as brain state classification or population-level analysis. However, a major challenge is the variability in brain geometry across individuals. This is usually addressed by projecting each individual brain geometry onto a common template, which removes subject-specific information. In this work, we introduce a novel approach to dictionary learning on fMRI data that explicitly accounts for this variability. We use the optimal transport-based Fused Gromov-Wasserstein (FGW) distance to compare graphs with different geometries and features. To address the challenge of computing multiple FGW distances for large graphs such as those arising from fMRI data, we rely on amortized optimization to learn a neural network that predicts an approximation of the optimal transport plans, which substantially reduces the computational cost. Additionally, we learn dictionary atoms that depend on the FGW trade-off parameter, which controls the balance between feature alignment and structural consistency. Numerical experiments on the HCP dataset demonstrate that the proposed approach captures different levels of geometric variability in the data and provides representations that preserve essential information.

2605.20879 2026-05-21 cs.LG

NeighborDiv: Training-free Zero-shot Generalist Graph Anomaly Detection via Neighbor Diversity

NeighborDiv: 一种基于邻居多样性、无需训练的跨域通用图异常检测方法

Kaifeng Wei, Teng Liu, Liang Dong, Xiubo Liang, Yuke Li

发表机构 * Netease Yidun AI Lab(网易易盾AI实验室) School of Software Technology(软件学院) Zhejiang University(浙江大学)

AI总结 本文提出NeighborDiv,一种无需训练的通用图异常检测方法,通过邻居多样性原理来检测异常,克服了传统方法在训练复杂度、数据依赖性和跨域泛化稳定性方面的不足,实验表明其在多个评估框架下均取得最佳性能。

详情
AI中文摘要

图异常检测(GAD)正逐渐转向通用图异常检测(GGAD)以实现跨域的'一揽子'检测,但现有GGAD方法主要依赖邻居一致性原则,陷入'节点到邻居一致性范式'的异常量化中。这些方法存在训练流程复杂、依赖大量训练数据、计算成本高以及跨域泛化不稳定等问题。为了解决这些限制,我们提出了NeighborDiv,一种基于邻居多样性的无需训练的通用图异常检测框架。偏离主流的'节点到邻居一致性范式',我们转向'邻居到邻居多样性范式',发现节点邻居集合的内部结构分散性是一种强大且独立的异常信号。我们通过邻居间特征相似性的方差来量化邻居多样性,捕捉节点如何组织其局部图环境,并独立于传统节点到邻居一致性框架。在两个标准的GGAD评估范式下进行的大量实验表明,NeighborDiv在单域独立训练(SDIT)下平均AUC提升了10.25%,平均AP提升了17.78%;在统一多域训练(UMDT)下,AUC和AP分别提升了6.89%和9.58%。值得注意的是,NeighborDiv在所有数据集上均无性能波动,消除了训练集依赖性,建立了一个轻量且高度实用的GGAD框架。

英文摘要

Graph Anomaly Detection (GAD) is increasingly shifting to Generalist GAD (GGAD) for cross-domain "one-for-all" detection, but existing GGAD methods predominantly rely on the neighbor consistency principle, falling into the \textbf{Node-to-Neighbor Consistency Paradigm} for anomaly quantification. These methods suffer from complex training pipelines, heavy training data dependency, high computational costs, and unstable cross-domain generalization. To address these limitations, we propose NeighborDiv, a training-free generalist graph anomaly detection framework based on neighbor diversity. Departing from the dominant Node-to-Neighbor Consistency Paradigm, we shift the focus to the \textbf{Neighbor-to-Neighbor Diversity Paradigm}, and uncover that the internal structural dispersion of a node's neighbor set is a powerful, independently discriminative anomaly signal. We quantify neighbor diversity via the variance of inter-neighbor feature similarities, which captures how a node organizes its local graph environment, and operates independently of conventional node-to-neighbor consistency frameworks. Extensive experiments under two standard GGAD evaluation paradigms show NeighborDiv achieves state-of-the-art performance, with relative gains of 10.25% in average AUC and 17.78% in average AP over the second-best baseline under Single-Domain Independent Training (SDIT), and 6.89%/9.58% in AUC/AP under Unified Multi-Domain Training (UMDT), respectively. Notably, NeighborDiv yields zero performance volatility across all datasets, eliminating training-set dependency and establishing a lightweight and highly practical GGAD framework.

2605.20878 2026-05-21 cs.LG

CIG: Exploration via Conditional Information Gain

CIG: 通过条件信息增益进行探索

Tim Joseph, Marcus Fechner, Philipp Stegmaier, Karam Daaboul, J. Marius Zöllner

发表机构 * FZI Karlsruhe(弗赖堡研究所卡尔斯鲁厄分所) KIT Karlsruhe(卡尔斯鲁厄理工学院)

AI总结 该研究提出了一种条件信息增益(CIG)奖励机制,用于强化学习中的探索问题,通过可追溯的log-determinant目标和Ensemble Disagreement核来生成因果每步奖励,从而在高维状态空间中实现有效的探索。

Comments 28 pages, 10 figures, 3 tables

详情
AI中文摘要

在强化学习中,内在奖励用于探索时会根据不同的上下文进行条件化:终身奖励对每个转移进行累积经验评分,但忽略轨迹内的冗余;事件奖励惩罚轨迹内的重复,但丢弃长期进步。混合方法通过启发式权重结合两种信号,或需要高斯过程动态模型,无法扩展到低维状态空间。轨迹级信息增益可以分解为每步项,这些项同时条件于回放缓冲区和轨迹前缀,但在深度模型中仍然不可行。我们推导出条件信息增益(CIG)奖励作为可追溯的替代方案:一个基于集合分歧核的log-determinant目标,其Cholesky因子分解产生因果每步奖励,保留两个条件集并在高维状态空间中扩展。我们在基于模型的设置中实例化CIG,其中轨迹较短且轨迹内的修正仍大部分未探索。在十二个任务上,包括离散(MiniGrid)和连续控制(OGBench),在干净和随机干扰设置中,CIG在性能上优于或匹配先前的探索方法,同时对随机干扰具有鲁棒性。

英文摘要

Intrinsic rewards for exploration in reinforcement learning condition on different contexts: lifelong rewards score each transition against accumulated experience but ignore within-rollout redundancy; episodic rewards penalize intra-trajectory repetition but discard lifetime progress. Hybrid methods combine both signals through heuristic weights or require Gaussian-process dynamics that do not scale beyond low-dimensional state spaces. Trajectory-level information gain decomposes into per-step terms that condition on the replay buffer and rollout prefix simultaneously, but remains intractable for deep models. We derive the Conditional Information Gain (CIG) reward as a tractable surrogate: a log-determinant objective over an ensemble disagreement kernel whose Cholesky factorization yields causal per-step rewards that retain both conditioning sets while scaling to high-dimensional state spaces. We instantiate CIG in a model-based setting, where rollouts are short and within-rollout corrections remain largely unexplored. Across twelve tasks spanning discrete (MiniGrid) and continuous control (OGBench), in both clean and stochastic-distractor settings, CIG outperforms or matches prior exploration methods while remaining robust to stochastic distractors.

2605.20876 2026-05-21 cs.CL cs.AI

Terminal-World: Scaling Terminal-Agent Environments via Agent Skills

Terminal-World: 通过智能体技能扩展终端智能体环境

Zihao Cheng, Hongru Wang, Zeming Liu, Xinyi Wang, Xiangrong Zhu, Yuhang Guo, Wei Lin, Jeff Z. Pan, Yunhong Wang

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing, China(北京航空航天大学计算机科学与工程学院) Independent Researcher(独立研究者) Beijing Institute of Technology(北京理工大学) University of Edinburgh(爱丁堡大学)

AI总结 本文提出Terminal-World,一种自动化流程,利用智能体技能作为核心合成原语,共同编码任务目标、执行时机和方法,从而生成任务指令、环境和教师轨迹。通过构建5,723个训练环境,训练出Terminal-World-8B/14B/32B模型,在六个基准测试中均优于终端智能体基线,其中Terminal-World-32B在Terminal-Bench 2.0上以仅1.2%的训练数据超越Nemotron-Terminal-32B。

Comments Work in Progress

详情
AI中文摘要

Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

英文摘要

Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, environments misaligned with task semantics, and inefficient trajectories from unguided exploration. To address these limitations, we introduce Terminal-World, a fully automated pipeline that uses agent skills as the central synthesis primitive, which jointly encode what to accomplish, when to apply (preconditions and environment state), and how to execute, enabling task instructions, environments, and teacher trajectories to be co-derived. To further broaden the synthesis space, Terminal-World composes skills into skill teams and skill graphs for multi-role and cross-domain task synthesis. Using this pipeline, we construct 5,723 training environments and train Terminal-World-8B/14B/32B, evaluated across 6 benchmarks where the Terminal-World series consistently outperforms terminal-agent baselines. Notably, using the same teacher model and only 1.2% of the training data, Terminal-World-32B surpasses Nemotron-Terminal-32B on Terminal-Bench 2.0 by +4.5 Pass@1 (31.5) and achieves 43.8 Pass@3.

2605.20874 2026-05-21 cs.AI cs.SE

Governance by Construction for Generalist Agents

为通用智能体构建的治理机制

Segev Shlomov, Iftach Shoham, Alon Oved, Ido Levy, Sami Marreed, Harold Ship, Offer Akrabi, Sergey Zeltyn, Avi Yaeli, Nir Mashkif

发表机构 * IBM

AI总结 本文提出了一种模块化的政策-as-code层,用于在不微调模型的情况下,通过与通用大语言模型智能体结合,实现可预测、可审计且符合合规要求的行为,在复合工作流中无需为每个领域重新构建智能体。

详情
AI中文摘要

企业智能体日益被期望在多个工具和界面中自主运行,但生产部署需要通过构建来实施治理。系统必须指定哪些操作被允许、何时需要人类监督以及哪些信息可以暴露,而无需为每个领域重新构建智能体。本演示展示了CUGA的策略系统,这是一种模块化的策略-as-code层,能够与通用大语言模型智能体结合,以在复合工作流中实现可预测、可审计且符合合规要求的行为。我们提出了一种运行时治理架构,在执行的每一个关键阶段都强制执行策略干预。而不是被动地限制行为,策略在五个结构性检查点拦截智能体:规划上游(意图守卫)、在系统提示内引导推理(手册)、在工具调用边界处强制正确使用(工具指南)、在推理循环外作为人类在环的闸门用于高风险操作(工具批准)、以及在输出阶段过滤和结构化最终响应(输出格式器)。这些阶段将治理连续嵌入智能体的执行流程中,而不是将其视为事后考虑。通过一个医疗场景和多层次的执行干预,演示展示了动态手册注入用于结构化工具序列执行,意图守卫阻止恶意或意外有害请求,以及人类在环的工具批准检查点用于可能破坏性操作。该成果展示了类型化的治理原语如何加快、安全地部署企业智能体系统,同时提高政策遵守和执行一致性。

英文摘要

Enterprise agents are increasingly expected to operate autonomously across tools and interfaces, yet production deployments require governance by construction. Systems must specify which actions are allowed, when human oversight is required, and what information may be exposed, without rebuilding the agent for each domain. This demo presents CUGA's policy system, a modular policy-as-code layer that composes with a generalist LLM agent to deliver predictable, auditable, and compliance-aware behavior in compound workflows without model fine-tuning. We present a runtime governance architecture that enforces policy interventions at every critical stage of execution. Rather than passively constraining behavior, policies intercept the agent at five structural checkpoints: upstream of planning (Intent Guard), within the system prompt to steer reasoning (Playbook), at the tool-call boundary to enforce proper usage (Tool Guide), outside the reasoning loop as a Human-in-the-Loop gate for high-risk actions (Tool Approvals), and at the output stage to filter and structure the final response (Output Formatter). Together, these stages embed governance continuously across the agent's execution pipeline rather than treating it as an afterthought. Using a healthcare scenario and a multi-layered enforcement intervention, the demo shows dynamic playbook injection for structured tool-sequence enforcement, intent guards that block malicious or accidental harmful requests, and human-in-the-loop tool approval checkpoints for potentially destructive actions. The artifact illustrates how typed governance primitives enable faster, safer deployment of enterprise agentic systems while improving policy adherence and execution consistency.

2605.20872 2026-05-21 cs.LG cs.AI cs.GR

CAdam: Context-Adaptive Moment Estimation for 3D Gaussian Densification in Generative Distillation

CAdam: 3D高斯密度细化中的上下文自适应矩估计

SeungJeh Chung, Geonho Park, Misong Kim, HyeongYeop Kang

发表机构 * IIIXR Lab, Kyung Hee University(庆尚大学IIIXR实验室) IIIXR Lab, Korea University(韩国大学IIIXR实验室)

AI总结 本文提出CAdam方法,通过将密度细化问题转化为统计信号验证问题,解决生成式蒸馏中密度估计的瓶颈,从而在保持视觉质量的同时显著减少高斯点数量。

Comments Accepted to SIGGRAPH 2026 Conference Papers. 12 pages, 8 figures

详情
AI中文摘要

Adaptive densification是3D高斯点划法(3DGS)的核心引擎。然而,当将其应用于基于优化的生成式蒸馏范式时,这种重建原生机制暴露了根本性限制,导致效率低下且充满冗余的表示。我们诊断这种失败为密度困境,源于生成指导的随机性:标准的幅度基积累无差别地聚合瞬态噪声与几何信号,难以在过密度和欠拟合之间取得平衡。为了解决这一问题,我们引入了上下文自适应矩估计(CAdam),一种新的框架,将密度细化重新解释为统计上站得住的信号验证问题。CAdam利用梯度的一阶矩来利用干涉原理,其中随机波动通过破坏性干涉抵消,而一致的几何漂移通过建设性干涉累积,从而有效分离底层信号与生成噪声底座。这进一步通过基于分位数的上下文意识和内在信号噪声比(SNR)门控机制增强,确保在优化阶段之间具有鲁棒的适应性,并使密度细化能够软终止。在多样化的目标(SDS,ISM,VFDS)和强大的生成3DGS后端上进行了广泛的实验,结果表明CAdam相比标准密度细化将高斯点数减少85%-97%,同时保持整体可比的视觉质量。这些结果突显了信号感知密度控制作为改进优化生成式蒸馏内存效率的实用方法。

英文摘要

Adaptive densification is the engine of 3D Gaussian Splatting (3DGS). However, when transposed to the optimization-based Generative Distillation paradigm, this reconstruction-native mechanism reveals fundamental limitations, resulting in inefficient representations cluttered with redundant primitives. We diagnose this failure as a Densification Dilemma stemming from the stochastic nature of generative guidance: the standard magnitude-based accumulation indiscriminately aggregates transient noise alongside geometric signals, making it difficult to strike a balance between over-densification and under-fitting. To resolve this, we introduce Context-Adaptive Moment Estimation (CAdam), a novel framework that reinterprets densification as a statistically grounded signal verification problem. CAdam leverages the first moment of gradients to exploit the interference principle, where stochastic fluctuations cancel out via destructive interference while consistent geometric drifts accumulate via constructive interference, effectively disentangling the underlying signal from the generative noise floor. This is further augmented by a quantile-based context awareness and an intrinsic Signal-to-Noise Ratio (SNR) gating mechanism, which ensure robust adaptation across optimization stages and enable the soft termination of densification. Extensive experiments across diverse objectives (SDS, ISM, VFDS) and strong generative 3DGS backbones show that CAdam reduces Gaussian count by 85%-97% relative to standard densification while preserving overall comparable perceptual quality. These results highlight signal-aware density control as a practical way to improve memory efficiency in optimization-based generative distillation.

2605.20868 2026-05-21 cs.LG cs.AI cs.SY eess.SY

Runtime-Certified Bounded-Error Quantized Attention

具有运行时认证的误差受限量化注意

Dean Calver

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种分层的KV缓存架构,通过在GPU内存中存储INT8键和INT4值,同时在系统RAM中保留FP16原始数据,实现了运行时认证的注意机制,通过误差分解得到每头每步的误差界,以驱动自适应精度选择和多阶段回退流程,确保在需要时能恢复到精确的密集注意输出。

Comments 32 pages, 1 figure

详情
AI中文摘要

KV缓存量化减少了长上下文LLM推理的内存成本,但引入了通常仅通过经验验证的近似误差。现有系统依赖于平均情况下的鲁棒性,没有机制在运行时检测或恢复失败。本文提出了一种分层的KV缓存架构,使注意机制具有运行时认证:INT8键和INT4值存储在GPU内存中,而FP16原始数据保留在系统RAM中以实现确定性回退。一个两术语误差分解提供了每头每步的误差界(i)键量化导致的注意分布扭曲和(ii)值重建误差。这些界在线计算并用于驱动自适应精度选择和多阶段回退阶梯,确保在需要时能恢复到精确的密集注意输出。在PG-19、NIAH和RULER基准上,对LLaMA~3.1-8B(上下文长度达128K)的测试中,系统在语言建模和检索任务中与密集FP16 KV质量在噪声范围内匹配,同时恢复了在朴素INT8/INT4基线中观察到的灾难性故障。短上下文的值敏感任务暴露了压缩与保真度之间的可控权衡,可通过更紧的值容忍度或FP16值回退消除。认证是局部的(每头、每步),不保证端到端模型的正确性,但确保每个注意计算要么相对于FP16参考是受控的,要么通过回退精确恢复。这将KV缓存量化重新定义为运行时验证的计算,而不是固定近似。目标不是原始的速度提升,而是使在严格质量约束下安全部署的激进KV压缩成为可能。

英文摘要

KV cache quantization reduces the memory cost of long-context LLM inference, but introduces approximation error that is typically validated only empirically. Existing systems rely on average-case robustness, with no mechanism to detect or recover from failures at runtime. We present a tiered KV cache architecture that enables runtime-certified attention: INT8 keys and INT4 values are stored in GPU memory, while FP16 originals are retained in system RAM for deterministic fallback. A two-term error decomposition yields per-head, per-step bounds on (i) attention distribution distortion from key quantization and (ii) value reconstruction error. These bounds are computed online and used to drive adaptive precision selection and a multi-stage fallback ladder, which guarantees recovery to the exact dense attention output when required. Across PG-19, NIAH, and RULER benchmarks on LLaMA~3.1-8B with contexts up to 128K, the system matches dense FP16 KV quality within noise for language modelling and retrieval tasks, while recovering catastrophic failures observed in naive INT8/INT4 baselines. Value-sensitive tasks at short context expose a controlled trade-off between compression and fidelity, which can be eliminated via tighter value tolerances or FP16-value fallback. The certification is local (per-head, per-step) and does not guarantee end-to-end model correctness, but ensures that each attention computation is either bounded relative to an FP16 reference or exactly recovered via fallback. This reframes KV cache quantization as a runtime-verified computation rather than a fixed approximation. The goal is not raw speedups, but enabling safe deployment of aggressive KV compression under strict quality constraints.

2605.20866 2026-05-21 cs.LG cs.DC math.OC stat.ML

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging

LOSCAR-SGD:局部SGD与通信-计算重叠及延迟校正的稀疏模型平均

Yassine Maziane, Ammar Mahran, Artavazd Maranjyan, Peter Richtárik

发表机构 * KAUST(卡塔尔科技大学)

AI总结 本文研究了在异构计算环境下结合通信压缩、局部训练和通信-计算重叠的局部SGD方法,提出LOSCAR-SGD通过仅通信稀疏模型坐标并持续优化来提高分布式学习效率,首次给出了这种组合方法的理论保证。

详情
AI中文摘要

在分布式学习中,通信是主要的瓶颈,尤其是在大规模设置和联邦学习环境中链接缓慢时。减少此成本的三种标准方法是通信压缩、局部训练和通信-计算重叠。结合这些成分的方法在实践中被发现对大规模训练有效,但很少有理论支持同时结合这三种方法的方法。我们研究了一个异构计算环境,其中不同的工作者可能进行不同数量的局部步骤,并提出LOSCAR-SGD,一种局部SGD方法,仅通信模型坐标的稀疏子集,并在通信飞行期间继续优化。关键成分是延迟校正的合并规则,该规则在不丢弃重叠阶段所做进展的情况下整合延迟同步信息。我们为光滑非凸目标函数提供了收敛保证,并展示了稀疏性、重叠和工作者异质性如何影响收敛速度。据我们所知,这是首次针对这种成分组合的理论。实验进一步表明,通信-计算重叠减少了训练时间,并且延迟校正的合并优于朴素覆盖。

英文摘要

Communication is a major bottleneck in distributed learning, especially in large-scale settings and in federated learning environments with slow links. Three standard ways to reduce this cost are communication compression, local training, and communication-computation overlap. Methods that combine these ingredients are used in practice and have been found to be effective for large-scale training, but there is little theory for methods that combine all three. We study a heterogeneous-compute setting in which different workers may take different numbers of local steps, and we propose LOSCAR-SGD, a Local SGD method that communicates only a sparse subset of model coordinates and continues optimizing while communication is in flight. A key ingredient is a delay-corrected merge rule that incorporates delayed synchronized information without discarding the progress made during the overlap phase. We give convergence guarantees for smooth non-convex objectives and show how sparsity, overlap, and worker heterogeneity affect the rate. To the best of our knowledge, this is the first theory for this combination of ingredients. Experiments further show that communication-computation overlap reduces training time and that the delay-corrected merge outperforms naive overwriting.

2605.20865 2026-05-21 cs.LG cs.AI

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

多步似然比校正用于可验证奖励的强化学习

Deokgyu Yoon, Hyungkyu Kang, Joongkyu Lee, Byeongchan Kim, Gyungin Shin, Sungrae Park, Min-hwan Oh

发表机构 * Seoul National University(首尔国立大学) Upstage

AI总结 本文提出了一种多步前向轨迹政策优化(NFPO)算法,通过引入N步前向轨迹来改进PPO的近似目标,从而在可验证奖励的强化学习中实现更精确的策略改进。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)在提升大语言模型的推理能力方面起着关键作用。然而,广泛使用的PPO替代目标本质上是局部的,因为它们依赖于精确策略梯度目标的局部近似。虽然这种近似通过减少重要性采样引起的方差来提高稳定性,但它也引入了结构偏差到替代目标中,必须通过信任区域机制进行控制。在本文中,我们引入了N步前向轨迹,通过累积下一个N-1个token的似然比来增强PPO替代目标。基于这一想法,我们提出了N步前向轨迹策略优化(NFPO),一种将N步前向轨迹整合到掩码策略梯度框架中的实用RLVR算法。NFPO提供了一个连续的桥梁,将PPO替代目标与精确策略梯度目标联系起来,提供了一种控制偏差-方差权衡的原理机制。我们的理论分析表明,通过适当选择N,所提出的目标比标准PPO替代目标提供了更紧的策略改进界。在全面推理基准测试中,实验表明NFPO一致地提高了性能,支持了我们的理论发现。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) plays a pivotal role in improving the reasoning ability of large language models. However, widely used PPO surrogate objectives are fundamentally local, as they rely on a local approximation of the exact policy gradient objective. While this approximation improves stability by reducing the variance induced by importance sampling, it also introduces structural bias into the surrogate objective, which must be controlled through trust region mechanisms. In this work, we introduce the $N$-step forward trace, which augments the PPO surrogate objective using the cumulative likelihood ratio of the next $N-1$ tokens. Building on this idea, we propose $N$-Step Forward-Trace Policy Optimization (NFPO), a practical RLVR algorithm that integrates the $N$-step forward trace into the masked policy gradient framework. NFPO provides a continuous bridge between the PPO surrogate objective and the exact policy gradient objective, offering a principled mechanism for controlling the bias-variance trade-off. Our theoretical analysis shows that, with an appropriate choice of $N$, the proposed objective yields a tighter policy-improvement bound than the standard PPO surrogate. Experiments on comprehensive reasoning benchmarks demonstrate that NFPO consistently improves performance, supporting our theoretical findings.

2605.20856 2026-05-21 cs.RO cs.AI cs.LG

DISC: Decoupling Instruction from State-Conditioned Control via Policy Generation

DISC: 通过策略生成解耦指令与状态条件控制

Hanxiang Ren, Pei Zhou, Xunzhe Zhou, Yanchao Yang

发表机构 * Zhejiang University(浙江大学) The University of Hong Kong(香港大学) TranscEngram

AI总结 DISC通过策略生成解耦指令与状态条件控制,解决了任务状态耦合导致的观察泄漏问题,并在多个基准测试中表现出色,证明了语言生成的策略参数驱动行为。

详情
AI中文摘要

语言条件的操控策略通常通过共享网络参数处理指令和观察。这种任务-状态耦合提供了观察泄漏的路径——网络学习了场景到动作的捷径,完全绕过了语言接地。DISC通过结构上消除这一失败。而不是将通用策略条件在语言上,DISC使用超网络从指令本身生成整个任务特定的视觉-运动策略参数集。生成的策略从不直接访问语言;因此,其任务意识必须来自语言。 Consequently,观察泄漏没有路径出现。另一方面,生成一致的高维策略权重本身是一个具有挑战性的问题。我们通过两阶段超网络解决它,其细化阶段将基于梯度优化的结构作为前馈归纳偏差嵌入,产生全局一致的参数,而无需实际梯度计算。在标准数据预算上完全从头训练,DISC在LIBERO-90和Meta-World上优于所有耦合基线,在复杂、长周期任务中优势扩大,并在不使用外部预训练数据的情况下超越了大规模预训练的π₀。在一个现实基准中,所有任务共享相同的视觉上下文,DISC显著优于耦合替代方案,直接证实了语言生成的策略参数,而非视觉捷径,驱动行为。超网络进一步学习了一个语义结构化的参数流形,能够从最少的演示中实现少样本适应,并在改写指令中实现稳健的泛化。我们的代码可在:https://github.com/ReNginx/DISC获取。

英文摘要

Language-conditioned manipulation policies typically process instructions and observations through shared network parameters. This task-state entanglement provides a pathway for observation leakage -- networks learn scene-to-action shortcuts that bypass language grounding entirely. DISC eliminates this failure structurally. Rather than conditioning a universal policy on language, DISC uses a hypernetwork to generate the entire parameter set of a task-specific visuomotor policy from the instruction alone. The generated policy never directly accesses language; therefore, its task-awareness must come from the language. Consequently, observation leakage has no pathway to emerge. On the other hand, generating coherent high-dimensional policy weights is itself a challenging problem. We address it with a two-stage hypernetwork whose refinement stage embeds the structure of gradient-based optimization as a feed-forward inductive bias, producing globally consistent parameters without actual gradient computation. Trained entirely from scratch on standard data budgets, DISC outperforms all entangled baselines on LIBERO-90 and Meta-World, with advantages that widen on complex, long-horizon tasks -- and surpasses the large-scale pretrained $π_0$ despite using no external pretraining data. On a real-world benchmark where all tasks share identical visual context, DISC substantially outperforms entangled alternatives, directly confirming that language-generated policy parameters, not visual shortcuts, drive behavior. The hypernetwork further learns a semantically structured parameter manifold that enables few-shot adaptation from minimal demonstrations and robust generalization across paraphrased instructions. Our code is available at: {https://github.com/ReNginx/DISC}.