arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2602.12005 2026-06-01 cs.CL

LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

LaCy: 小型语言模型能学且应学的不仅仅是损失问题

Szilvia Ujváry, Louis Béthune, Pierre Ablin, João Monteiro, Marco Cuturi, Michael Kirchhof

发表机构 * Apple（苹果公司）； University of Cambridge（剑桥大学）

AI总结研究在预训练中，小型语言模型（SLM）应学习哪些token以及应通过<CALL>委托哪些token，提出结合损失和事实性信号的LaCy方法，提升SLM在级联生成中的事实准确性。

Comments 40 pages, 26 figures, 10 tables, preprint. v3-v4: new results for RAG, ablations and additional analysis

详情

AI中文摘要

语言模型不断增长以将更多世界知识压缩到其参数中，但可预训练到其中的知识受参数规模上限约束。尤其是小型语言模型（SLM）容量有限，导致事实性错误生成。通常通过让SLM访问外部源（如查询更大模型、文档或数据库）来缓解此问题。在此设置下，我们研究基本问题：预训练期间SLM可以且应该学习哪些token，以及哪些应通过<CALL> token委托。我们发现这不仅仅是损失问题：尽管损失可预测预测token是否与真实值不匹配，但不足以识别哪些预测实际上会导致事实性或语义无效的延续。一些高损失token对应预训练文档中可接受的替代延续，因此不应触发<CALL>。这表明可学习性不能仅从损失表征，而需要关于token在句子中角色的额外领域特定信号。在类似维基百科的领域中，我们展示用spaCy解析器的轻量级语法信息增强损失信号可显著改善委托决策。基于此洞察，我们提出LaCy，一种新颖的预训练方法，结合损失与事实性信号以决定SLM应学习哪些token。实验表明，LaCy模型成功学习预测哪些token以及何时请求帮助。这在与更大模型级联生成时获得更高FactScore，且优于Rho或LLM-judge训练的SLM，同时更简单更廉价。

英文摘要

Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{<CALL>} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, it is insufficient for identifying which predictions would actually lead to factual or semantically invalid continuations. Some high-loss tokens correspond to \emph{acceptable} alternative continuations of a pretraining document and therefore should not trigger a \texttt{<CALL>}. This suggests that learnability cannot be characterized from loss alone, but requires additional domain-specific signals about the role of a token in the sentence. In Wikipedia-like domains, we show that augmenting the loss signal with lightweight grammatical information from a spaCy parser substantially improves delegation decisions. Based on this insight, we propose LaCy, a novel pretraining method that combines loss with factuality signals to decide which tokens an SLM should learn. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and when to call for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.

URL PDF HTML ☆

赞 0 踩 0

2602.00747 2026-06-01 cs.CL cs.AI

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

将搜索与训练解耦：通过模型合并实现大规模语言模型预训练的数据混合缩放

Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, Shaosheng Cao

发表机构 * NLP Team, Xiaohongshu Inc., Shanghai, China（小红书自然语言处理团队，小红书公司，上海，中国）； Tsinghua University, Beijing, China（清华大学，北京，中国）； The University of Tokyo, Tokyo, Japan（东京大学，东京，日本）

AI总结提出DeMix框架，通过模型合并预测最优数据配比，在降低搜索成本的同时提升基准性能。

Comments 18 pages, 5 figures, accepted at ICML 2026

详情

AI中文摘要

确定有效的数据混合是大语言模型（LLM）预训练的关键因素，模型必须在通用能力与数学、代码等困难任务的专业性之间取得平衡。然而，识别最优混合仍然是一个开放挑战，现有方法要么依赖不可靠的小规模代理实验，要么需要代价高昂的大规模探索。为此，我们提出“将搜索与训练解耦混合”（DeMix），一种利用模型合并预测最优数据配比的新框架。DeMix不是为每个采样的混合训练代理模型，而是按规模在候选数据集上训练组件模型，并通过加权模型合并推导数据混合代理。这种范式将搜索与训练成本解耦，使得无需额外训练负担即可评估无限采样的混合，从而通过更多搜索试验促进更好的混合发现。大量实验表明，DeMix打破了充分性、准确性和效率之间的权衡，以更低的搜索成本获得更高基准性能的最优混合。此外，我们发布了DeMix语料库，一个包含高质量预训练数据和已验证混合的综合22T令牌数据集，以促进开放研究。我们的代码和DeMix语料库可在https://github.com/Lucius-lsr/DeMix获取。

英文摘要

Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at https://github.com/Lucius-lsr/DeMix.

URL PDF HTML ☆

赞 0 踩 0

2605.15706 2026-06-01 cs.LG

Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models

可微分的混合智能体激励大型语言模型的群体智能

Xingjian Wu, Junkai Lu, Siyu Yan, Xiangfei Qiu, Jilin Hu, Chenjuan Guo, Bin Yang

发表机构 * East China Normal University（华东师范大学）

AI总结提出可微分的混合智能体（DMoA）框架，通过可微分的上下文感知路由机制动态激活智能体，实现推理过程中的弹性协作，并在9个基准上取得最优性能。

详情

AI中文摘要

大型语言模型（LLMs）的最新进展推动了用于复杂推理任务的多智能体系统（MAS）的发展。然而，现有的MAS通常依赖于预定义或预编译的通信拓扑，这限制了它们对动态任务需求的灵活性和适应性。在这项工作中，我们提出了可微分的混合智能体（DMoA），一个自我进化的多智能体框架，能够在推理过程中实现弹性且自适应的智能体协作。不同于静态构建工作流，DMoA在每个推理步骤动态路由和激活智能体，使系统能够隐式模拟多样化的通信拓扑并适应不断变化的需求。为了实现这一点，我们设计了一个可微分的、上下文感知的路由机制，利用循环结构融入历史和上下文信息，以逐步方式产生稀疏的智能体激活。此外，我们引入预测熵作为自监督信号来优化路由过程，实现了无需外部标注的高效测试时自适应。在9个基准上的广泛实验表明，DMoA在实现最先进性能的同时，展现出强大的效率、鲁棒性和集成能力。

英文摘要

Recent advances in Large Language Models (LLMs) have catalyzed the development of multi-agent systems (MAS) for complex reasoning tasks. However, existing MAS typically rely on pre-defined or pre-compiled communication topologies, which limits their flexibility and adaptability to dynamic task requirements. In this work, we propose Differentiable Mixture-of-Agents (DMoA), a self-evolving multi-agent framework that enables elastic and adaptive agent collaboration during inference. Instead of statically constructing workflows, DMoA dynamically routes and activates agents at each reasoning step, allowing the system to implicitly simulate diverse communication topologies and adapt to evolving demands. To achieve this, we design a differentiable, context-aware routing mechanism that leverages recurrent structures to incorporate historical and contextual information, producing sparse agent activations in a step-wise manner. Furthermore, we introduce predictive entropy as self-supervised signals to optimize the routing process, enabling efficient test-time adaptation without external annotations. Extensive experiments across 9 benchmarks demonstrate that DMoA achieves state-of-the-art performance while exhibiting strong efficiency, robustness, and ensembling capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.15470 2026-06-01 cs.LG physics.ao-ph

Njord: A Probabilistic Graph Neural Network for Ensemble Ocean Forecasting

Njord: 一种用于集合海洋预报的概率图神经网络

Daniel Holmberg, Joel Oskarsson, Erik Wikingsson, Fredrik Lindsten, Teemu Roos

发表机构 * University of Helsinki（赫尔辛基大学）； ETH AI Center（苏黎世联邦理工学院人工智能中心）； Linköping University（利德诺大学）

AI总结提出结合深度潜变量框架和图神经网络的概率模型Njord，在全球和区域海洋实现单次前向传播采样集合预报，并引入K-means聚类网格适应不规则海面几何，相比确定性基线在观测评估中取得更低误差。

Comments Preprint

详情

AI中文摘要

海洋动力学本质上是混沌的，但现有的机器学习海洋模型仅产生确定性预报。我们介绍了Njord，一种用于海洋预报的概率数据驱动模型，适用于全球和区域领域。Njord结合了深度潜变量框架与图神经网络架构，使得每次预报步骤可以在单次前向传播中采样。我们在全球0.25°分辨率和波罗的海区域2 km分辨率上应用Njord。为了扩展到这些大型海洋网格，我们引入了K-means聚类网格，以适应不规则的海面几何。实验表明，与确定性机器学习基线相比，Njord在两个领域均表现出强劲性能，同时通过采样的集合预报提供不确定性估计。在全球OceanBench基准上，Njord在针对真实观测评估时，在上层海洋变量上平均实现了最低误差，其中海表温度预测改进最大。

英文摘要

Ocean dynamics are inherently chaotic, yet existing machine learning ocean models produce only deterministic forecasts. We introduce Njord, a probabilistic data-driven model for ocean forecasting, applicable to both global and regional domains. Njord combines a deep latent variable framework with a graph neural network architecture, enabling sampling each forecast step in a single forward pass. We apply Njord globally at 0.25° resolution and regionally to the Baltic Sea at 2 km resolution. To scale to these large ocean grids we introduce K-means cluster meshes that adapt to irregular sea surface geometry. Experiments demonstrate strong performance on both domains compared to deterministic machine learning baselines, while also providing uncertainty estimates from the sampled ensemble forecasts. On the global OceanBench benchmark, Njord achieves the lowest errors on average across upper-ocean variables when evaluated against real-world observations, with the largest improvements in surface temperature prediction.

URL PDF HTML ☆

赞 0 踩 0

2604.15215 2026-06-01 cs.RO

A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

用于机器人上下文模仿学习的层次化时空动作分词器

Fawad Javed Fateh, Ali Shah Ali, Murad Popattia, Usman Nizamani, Andrey Konin, M. Zeeshan Zia, Quoc-Huy Tran

发表机构 * Retrocausal, Inc.（Retrocausal公司）

AI总结提出一种层次化时空动作分词器HiST-AT，通过两级向量量化实现动作的层次化聚类，并同时利用空间和时间信息进行重建，在多个模拟和真实机器人操作基准上达到最先进性能。

详情

AI中文摘要

我们提出了一种新颖的层次化时空动作分词器，用于上下文模仿学习。我们首先提出一种层次化方法，包括两个连续级别的向量量化。具体来说，低级别将输入动作分配到细粒度子簇，而高级别进一步将细粒度子簇映射到簇。我们的层次化方法优于非层次化方法，同时主要通过重建输入动作来利用空间信息。此外，我们通过利用空间和时间线索扩展了我们的方法，形成了层次化时空动作分词器，即HiST-AT。具体来说，我们的层次化时空方法进行多级聚类，同时重建输入动作及其相关时间戳。最后，在多个模拟和真实机器人操作基准上的广泛评估表明，我们的方法在上下文模仿学习中建立了新的最先进性能。

英文摘要

We present a novel hierarchical spatiotemporal action tokenizer for in-context imitation learning. We first propose a hierarchical approach, which consists of two successive levels of vector quantization. In particular, the lower level assigns input actions to fine-grained subclusters, while the higher level further maps fine-grained subclusters to clusters. Our hierarchical approach outperforms the non-hierarchical counterpart, while mainly exploiting spatial information by reconstructing input actions. Furthermore, we extend our approach by utilizing both spatial and temporal cues, forming a hierarchical spatiotemporal action tokenizer, namely HiST-AT. Specifically, our hierarchical spatiotemporal approach conducts multi-level clustering, while simultaneously recovering input actions and their associated timestamps. Finally, extensive evaluations on multiple simulation and real robotic manipulation benchmarks show that our approach establishes a new state-of-the-art performance in in-context imitation learning.

URL PDF HTML ☆

赞 0 踩 0

2410.06074 2026-06-01 cs.LG cs.NA math.NA

Scalable Mechanistic Neural Networks for Differential Equations and Machine Learning

可扩展的机械神经网络用于微分方程和机器学习

Jiale Chen, Dingling Yao, Adeel Pervez, Dan Alistarh, Francesco Locatello

发表机构 * Institute of Science and Technology Austria (ISTA)（奥地利科学技术研究所）

AI总结提出可扩展机械神经网络（S-MNN），通过线性化序列长度的计算和空间复杂度，实现高效建模长期动力学，保持精度和可解释性。

Comments Published as a conference paper at the Thirteenth International Conference on Learning Representations (ICLR 2025): https://openreview.net/forum?id=Oazgf8A24z

Journal ref International Conference on Learning Representations, 2025, pp. 10018-10039

2601.15197 2026-06-01 cs.AI cs.CL cs.CV cs.RO

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

LangForce: 通过潜在动作查询对视觉语言动作模型进行贝叶斯分解

Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Beijing Zhongguancun Academy（北京中关村学院）； Zhongguancun Institute of Artificial Intelligence（中关村人工智能研究院）； Harbin Institute of Technology（哈尔滨工业大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Zhengzhou University（郑州大学）； Beihang University（北航）； East China Normal University（东华大学）； DeepCybot Co., Ltd.（DeepCybot有限公司）

AI总结针对VLA模型在训练中因数据偏差导致语言信息被忽略的问题，提出LangForce框架，通过贝叶斯分解和潜在动作查询构建双分支架构，最大化动作与指令的点互信息，无需新数据即可显著提升泛化能力。

Comments ICML 2026

详情

AI中文摘要

视觉-语言-动作（VLA）模型在机器人操作中显示出潜力，但往往难以泛化到新指令或复杂的多任务场景。我们识别出当前训练范式中的一个关键病理：目标驱动的数据收集造成了数据集偏差。在此类数据集中，仅凭视觉观察就能高度预测语言指令，导致指令与动作之间的条件互信息消失，我们将此现象称为信息崩溃。因此，模型退化为忽略语言约束的纯视觉策略，并在分布外（OOD）设置中失败。为解决此问题，我们提出LangForce，一种通过贝叶斯分解强制执行指令跟随的新框架。通过引入可学习的潜在动作查询，我们构建了一个双分支架构，用于估计纯视觉先验 $p(a \mid v)$ 和语言条件后验 $π(a \mid v, \ell)$。然后我们优化策略以最大化动作与指令之间的条件点互信息（PMI）。该目标有效惩罚了视觉捷径，并奖励明确解释语言命令的动作。无需新数据，LangForce显著提升了泛化能力。在SimplerEnv和RoboCasa上的大量实验证明了显著改进，包括在具有挑战性的OOD SimplerEnv基准上提升11.3%，验证了我们的方法在动作中稳健地锚定语言的能力。

英文摘要

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, LangForce significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

URL PDF HTML ☆

赞 0 踩 0

2511.16084 2026-06-01 cs.CV cs.AI

SpectralTrain: A Universal Framework for Hyperspectral Image Classification

SpectralTrain：一种通用的高光谱图像分类框架

Meihua Zhou, Liping Yu, Xinyu Tong, Wai Kin Fung, Ruiguo Hu, Jiarui Zhao, Nan Wan

发表机构 * School of Medical Information, Wannan Medical University（皖南医学院信息学院）； University of Chinese Academy of Sciences（中国科学院大学）； The Chinese University of Hong Kong（香港中文大学）； Northeastern University（东北大学）

AI总结提出SpectralTrain通用训练框架，通过课程学习与基于PCA的光谱下采样提升高光谱图像分类效率，在多个数据集上实现2-7倍训练加速且精度损失小。

详情

AI中文摘要

高光谱图像（HSI）分类通常涉及大规模数据和计算密集的训练，这限制了深度学习模型在实际遥感任务中的部署。本研究引入SpectralTrain，一个通用的、与架构无关的训练框架，通过将课程学习（CL）与基于主成分分析（PCA）的光谱下采样相结合，提高学习效率。通过逐步引入光谱复杂性同时保留关键信息，SpectralTrain能够在显著降低计算成本的情况下高效学习光谱-空间模式。该框架独立于特定架构、优化器或损失函数，并与经典和最先进（SOTA）模型兼容。在三个基准数据集——Indian Pines、Salinas-A和新引入的CloudPatch-7上的大量实验表明，该框架在空间尺度、光谱特性和应用领域上具有很强的泛化能力。结果显示，训练时间一致减少2-7倍，精度变化取决于骨干网络。在云分类上的应用进一步揭示了其在气候相关遥感中的潜力，强调训练策略优化作为HSI模型中架构设计的有效补充。代码可在https://github.com/mh-zhou/SpectralTrain获取。

英文摘要

Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral -- spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets -- Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 -- demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at https://github.com/mh-zhou/SpectralTrain.

URL PDF HTML ☆

赞 0 踩 0

2605.11946 2026-06-01 cs.AI

Counterfactual Trace Auditing of LLM Agent Skills

LLM Agent技能的反事实痕迹审计

Xiaolin Zhou, Jinbo Liu, Li Li, Ryan A. Rossi, Xiyang Hu

发表机构 * Arizona State University（亚利桑那州立大学）； University of Southern California（南加州大学）； Adobe Research（Adobe研究）

AI总结提出反事实痕迹审计（CTA）框架，通过配对有无技能的Agent轨迹并生成结构化技能影响模式（SIP）注释，揭示技能对行为的重塑效应，弥补仅通过通过率评估的不足。

Comments Code and data are available at https://github.com/WillChow66/CTA.git

详情

AI中文摘要

大型语言模型Agent越来越多地配备Agent技能。当前对技能的评估方法仍然有限。大多数已部署的基准测试仅报告技能附加前后的通过率，将技能视为对Agent行为的黑盒更改。我们引入了反事实痕迹审计（CTA），这是一个衡量技能如何改变Agent行为的框架。CTA将每个带技能的Agent轨迹与同一任务上不带技能的对应轨迹配对，将两条轨迹分割成目标导向的阶段，对齐这些阶段，并输出结构化的技能影响模式（SIP）注释。这些注释描述了技能的行为效果，而不仅仅是任务结果。我们在SWE-Skills-Bench上使用Claude对49个软件工程任务实例化了CTA。由此产生的审计揭示了一个明显的评估差距。通过率平均仅变化+0.3个百分点，表明总体效果很小。然而，CTA在相同的配对轨迹中识别出522个SIP实例，表明即使在通过率几乎不变的情况下，技能也显著重塑了Agent行为。审计还分离了通过率无法检测到的几种反复出现的效果，包括字面模板复制、偏离任务的人工制品创建、过度规划和任务恢复。出现了三个发现。首先，高基线任务包含了大多数观察到的技能效果，尽管它们的通过率已经饱和，因此无法反映这些效果。其次，基线性能适中的任务显示出最大的可恢复增益，但通常以显著更高的令牌成本为代价。第三，主导的SIP类型可以通过基线桶识别：表面锚定在最高任务中最常见，边缘案例提示在中档和最低任务中最常见。这些规律将非正式的故障模式观察转化为可重复的行为测量。

英文摘要

Large Language Model agents are increasingly augmented with agent skills. Current evaluation methods for skills remain limited. Most deployed benchmarks report only pass rate before and after a skill is attached, treating the skill as a black box change to agent behavior. We introduce Counterfactual Trace Auditing (CTA), a framework for measuring how a skill changes agent behavior. CTA pairs each with skill agent trace with a without skill counterpart on the same task, segments both traces into goal directed phases, aligns the phases, and emits structured Skill Influence Pattern (SIP) annotations. These annotations describe the behavioral effect of a skill rather than only its task outcome. We instantiate CTA on SWE-Skills-Bench with Claude across 49 software engineering tasks. The resulting audit reveals a clear evaluation gap. Pass rate changes by only +0.3 percentage points on average, suggesting little aggregate effect. Yet CTA identifies 522 SIP instances across the same paired traces, showing that the skills substantially reshape agent behavior even when pass rate is nearly unchanged. The audit also separates several recurring effects that pass rate cannot detect, including literal template copying, off task artifact creation, excess planning, and task recovery. Three findings emerge. First, high baseline tasks contain most of the observed skill effects, although their pass rate is already saturated and therefore cannot reflect those effects. Second, tasks with moderate baseline performance show the most recoverable gain, but often at substantially higher token cost. Third, the dominant SIP type can be identified by baseline bucket: surface anchoring is most common on ceiling tasks and edge-case prompting is most common on mid-range and floor tasks. These regularities turn informal failure mode observations into reproducible behavioral measurements.

URL PDF HTML ☆

赞 0 踩 0

2605.11367 2026-06-01 cs.CV

3D-Belief: Embodied Belief Inference via Generative 3D World Modeling

3D-Belief：通过生成式3D世界建模实现具身信念推断

Yifan Yin, Zehao Wen, Suyu Ye, Jieneng Chen, Zehan Zheng, Nanru Dai, Haojun Shi, Aydan Huang, Zheyuan Zhang, Alan Yuille, Jianwen Xie, Ayush Tewari, Tianmin Shu

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； Lambda ； University of Cambridge（剑桥大学）

AI总结提出3D-Belief，一种生成式3D世界模型，通过在线更新显式3D信念，使具身智能体能够在部分可观测环境中想象场景补全并推理，在2D/3D想象质量和下游物体导航任务上优于现有方法。

详情

AI中文摘要

近期视觉生成模型的进展凸显了学习生成式世界模型的前景。然而，现有大多数方法将世界建模视为新视角合成或未来帧预测，强调视觉真实感，而非部分可观测环境下具身智能体所需的结构化不确定性。在这项工作中，我们提出了一种不同的视角：世界建模作为3D空间中的具身信念推断。从这个角度看，世界模型不应仅仅渲染可能看到的景象，而应在获取新观测时维护并更新智能体关于未观测3D世界的信念。我们识别了此类模型的几个关键能力，包括空间一致的场景记忆、多假设信念采样、顺序信念更新以及基于语义的未观测区域预测。我们将这些思想实例化为3D-Belief，一种生成式3D世界模型，它从部分观测中推断出显式、可操作的3D信念，并随时间在线更新。与先前的视觉预测模型不同，3D-Belief直接在3D中表示不确定性，使具身智能体能够想象合理的场景补全并在部分可观测环境中进行推理。我们在场景记忆和未观测场景想象的2D视觉质量、使用我们提出的3D-CORE基准的物体和场景级3D想象，以及模拟和真实世界中的挑战性物体导航任务上评估了3D-Belief。实验表明，与最先进方法相比，3D-Belief提高了2D和3D想象质量以及下游具身任务性能。

英文摘要

Recent advances in visual generative models have highlighted the promise of learning generative world models. However, most existing approaches frame world modeling as novel-view synthesis or future-frame prediction, emphasizing visual realism rather than the structured uncertainty required by embodied agents acting under partial observability. In this work, we propose a different perspective: world modeling as embodied belief inference in 3D space. From this view, a world model should not merely render what may be seen, but maintain and update an agent's belief about the unobserved 3D world as new observations are acquired. We identify several key capabilities for such models, including spatially consistent scene memory, multi-hypothesis belief sampling, sequential belief updating, and semantically informed prediction of unseen regions. We instantiate these ideas in 3D-Belief, a generative 3D world model that infers explicit, actionable 3D beliefs from partial observations and updates them online over time. Unlike prior visual prediction models, 3D-Belief represents uncertainty directly in 3D, enabling embodied agents to imagine plausible scene completions and reason over partially observed environments. We evaluate 3D-Belief on 2D visual quality for scene memory and unobserved-scene imagination, object- and scene-level 3D imagination using our proposed 3D-CORE benchmark, and challenging object navigation tasks in both simulation and the real world. Experiments show that 3D-Belief improves 2D and 3D imagination quality and downstream embodied task performance compared to state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.11134 2026-06-01 cs.LG cs.AI

Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

偏好优化中的虚假相关学习：机制、后果及通过平局训练的缓解方法

Christian Moya, Alex Semendinger, Guang Lin, Elliott Thornley

发表机构 * Department of Mathematics, Purdue University, West Lafayette IN, USA（普渡大学数学系）； School of Mechanical Engineering, Purdue University, West Lafayette IN, USA（普渡大学机械工程学院）； Massachusetts Institute of Technology, Cambridge MA, USA（麻省理工学院）

AI总结本文通过统一理论分析揭示了偏好优化（如DPO）中虚假相关学习的机制（均值虚假偏差和因果-虚假相关泄漏），证明其导致分布偏移下的不可逆脆弱性，并提出平局训练数据增强策略以选择性减少虚假学习。

Comments Proceedings of the 43rd International Conference on Machine Learning, 2026, Seoul, South Korea

Journal ref Proceedings of the 43rd International Conference on Machine Learning, 2026, Seoul, South Korea

详情

AI中文摘要

偏好学习方法（如直接偏好优化DPO）已知会诱导对虚假相关的依赖，导致当前语言模型中的谄媚和长度偏差，并可能在未来系统中造成严重的目标泛化错误。在这项工作中，我们对此现象进行了统一的理论分析，描述了虚假学习的机制、其在部署中的后果以及一种可证明的缓解策略。聚焦于对数线性策略，我们展示了标准偏好学习目标通过两个渠道在总体水平上诱导对虚假特征的依赖：均值虚假偏差和因果-虚假相关泄漏。然后我们表明这种依赖造成了分布偏移的不可逆脆弱性：来自相同训练分布的更多数据无法减少模型对虚假特征的依赖。为了解决这个问题，我们提出了平局训练，一种使用平局（等效用偏好对）的数据增强策略，以引入数据驱动的正则化。我们证明了该方法选择性地减少虚假学习而不降低因果学习。最后，我们在对数线性模型上验证了我们的理论，并提供了实证证据，表明虚假学习机制和平局训练的益处均适用于神经网络和大语言模型。

英文摘要

Preference learning methods like Direct Preference Optimization (DPO) are known to induce reliance on spurious correlations, leading to sycophancy and length bias in today's language models and potentially severe goal misgeneralization in future systems. In this work, we provide a unified theoretical analysis of this phenomenon, characterizing the mechanisms of spurious learning, its consequences on deployment, and a provable mitigation strategy. Focusing on log-linear policies, we show that standard preference-learning objectives induce reliance on spurious features at the population level through two channels: mean spurious bias and causal-spurious correlation leakage. We then show that this reliance creates an irreducible vulnerability to distribution shift: more data from the same training distribution fails to reduce the model's dependence on spurious features. To address this, we propose tie training, a data augmentation strategy using ties (equal-utility preference pairs) to introduce data-driven regularization. We demonstrate that this approach selectively reduces spurious learning without degrading causal learning. Finally, we validate our theory on log-linear models and provide empirical evidence that both the spurious learning mechanisms and the benefits of tie training persist for neural networks and large language models.

URL PDF HTML ☆

赞 0 踩 0

2605.06280 2026-06-01 cs.CV

Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency

欧拉运动引导：基于双向几何一致性的鲁棒图像动画

Thong Nguyen, Khoi M. Le, Cong-Duy Nguyen, Luu Anh Tuan, See-Kiong Ng, Chunyan Miao

发表机构 * National University of Singapore（新加坡国立大学）； Centre for AI Research, VinUniversity（Vin大学人工智能研究中心）； Nanyang Technological University（南洋理工大学）

AI总结提出使用相邻帧欧拉运动场引导生成，并通过双向几何一致性机制解决遮挡问题，实现加速训练、保持时间连贯性和减少动态伪影。

Comments Work in progress. Code is available at https://github.com/nguyentthong/eulerian_motion_guidance

详情

AI中文摘要

近期图像动画的进展利用扩散模型为静态图像注入活力。然而，现有的可控框架通常依赖于拉格朗日运动引导，其中光流是相对于初始帧估计的。本文通过更局部的监督设计重新审视相同的光流基元：我们使用相邻帧欧拉运动场来引导生成，其中运动信号始终描述一个短时间跳跃。这种转变使得并行训练成为可能，并在整个生成过程中提供有界误差监督。为了减轻相邻帧生成中常见的漂移伪影，我们引入了一种双向几何一致性机制，该机制计算前向-后向循环检查以数学识别并掩蔽遮挡区域，防止模型学习错误的扭曲目标。大量实验表明，与基于参考的基线相比，我们的方法加速了训练，保持了时间连贯性，并减少了动态伪影。

英文摘要

Recent advancements in image animation have utilized diffusion models to breathe life into static images. However, existing controllable frameworks typically rely on Lagrangian motion guidance, where optical flow is estimated relative to the initial frame. This paper revisits the same optical-flow primitive through a more local supervision design: we use adjacent-frame Eulerian motion fields to guide generation, where the motion signal always describes a short temporal hop. This shift enables parallelized training and provides bounded-error supervision throughout the generation process. To mitigate the drift artifacts common in adjacent frame generation, we introduce a Bidirectional Geometric Consistency mechanism, which computes a forward-backward cycle check to mathematically identify and mask occluded regions, preventing the model from learning incorrect warping objectives. Extensive experiments demonstrate that our approach accelerates training, preserves temporal coherence, and reduces dynamic artifacts compared to reference-based baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.01581 2026-06-01 cs.RO

Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

Hyper-DP3: 面向视觉运动控制的3D扩散策略的频率感知规模调整

Jinhao Zhang, Zhexuan Zhou, Huizhe Li, Yichen Lai, Wenlong Xia, Haoming Song, Youmin Gong, Jie Mei

发表机构 * Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Shanghai Jiao Tong University（上海交通大学）

AI总结针对机器人操作中扩散策略的高计算成本问题，从频域角度分析动作轨迹的平滑性，提出轻量级3D扩散策略Hyper-DP3，使用扩散混合器解码器和两步DDIM推理，以极低参数和延迟实现最先进性能。

详情

AI中文摘要

基于扩散的视觉运动策略在机器人操作中表现良好，但当前方法仍继承了图像生成风格的解码器和多步采样。我们从频域角度重新审视这一设计。机器人动作轨迹高度平滑，大部分能量集中在少数低频离散余弦变换模式上。在此结构下，我们证明最优去噪器的误差受低频子空间维度和残余高频能量限制，意味着去噪误差在很少的反向步骤后即饱和。这也表明动作去噪需要比图像生成简单得多的去噪模型。受此启发，我们提出Hyper-DP3（HDP3），一种口袋大小的3D扩散策略，具有轻量级扩散混合器解码器，支持两步DDIM推理。我们的合成实验验证了理论，并支持两步去噪的充分性。此外，在RoboTwin2.0、Adroit、MetaWorld和真实世界任务中，HDP3以不到先前基于3D扩散策略1%的参数和显著更低的推理延迟实现了最先进的性能。

英文摘要

Diffusion-based visuomotor policies perform well in robotic manipulation, yet current methods still inherit image-generation-style decoders and multi-step sampling. We revisit this design from a frequency-domain perspective. Robot action trajectories are highly smooth, with most energy concentrated in a few low-frequency discrete cosine transform modes. Under this structure, we show that the error of the optimal denoiser is bounded by the low-frequency subspace dimension and residual high-frequency energy, implying that denoising error saturates after very few reverse steps. This also suggests that action denoising requires a much simpler denoising model than image generation. Motivated by this insight, we propose Hyper-DP3 (HDP3), a pocket-scale 3D diffusion policy with a lightweight Diffusion Mixer decoder that supports two-step DDIM inference. Our synthetic experiments validate the theory and support the sufficiency of two-step denoising. Futhermore, across RoboTwin2.0, Adroit, MetaWorld, and real-world tasks, HDP3 achieves state-of-the-art performance with fewer than 1% of the parameters of prior 3D diffusion-based policies and substantially lower inference latency.

URL PDF HTML ☆

赞 0 踩 0

2604.12579 2026-06-01 cs.LG

EEG-Based Multimodal Learning via Hyperbolic Mixture-of-Curvature Experts

基于EEG的多模态学习：曲率混合专家双曲空间方法

Runhe Zhou, Shanglin Li, Guanxiang Huang, Xinliang Zhou, Qibin Zhao, Motoaki Kawanabe, Yi Ding, Cuntai Guan

发表机构 * Nanyang Technological University, Singapore（新加坡南洋理工大学）； BIFOLD, Berlin Institute for the Foundations of Learning（柏林学习与数据基础研究所）； University of Cambridge, Cambridge, UK（剑桥大学）

AI总结提出EEG-MoCE框架，通过可学习曲率的双曲空间为每个模态分配专家，并采用曲率感知融合策略，实现层次结构建模，在情绪识别、睡眠分期和认知评估任务上达到最优性能。

Comments Accepted at the Forty-third International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

基于脑电图（EEG）的多模态学习将脑信号与互补模态相结合，以改善精神状态评估，具有巨大的临床潜力。这种范式的有效性在很大程度上取决于异构模态上的表示学习。对于基于EEG的范式，一种有前景的方法是利用其层次结构，因为最近的研究表明，EEG和相关模态（例如面部表情）都表现出反映复杂认知过程的层次结构。然而，欧几里得嵌入由于其平坦的几何结构难以表示这些层次结构，而双曲空间由于其指数增长特性，天然适合表示层次结构。在这项工作中，我们提出了EEG-MoCE，一种新颖的基于双曲曲率混合专家框架，专为多模态神经技术设计。EEG-MoCE将每个模态分配给一个具有可学习曲率的双曲空间中的专家，从而能够自适应地建模其内在几何结构。然后，一种曲率感知融合策略动态加权专家，强调具有更丰富层次信息的模态。在基准数据集上的大量实验表明，EEG-MoCE在情绪识别、睡眠分期和认知评估等任务上达到了最先进的性能。代码可在https://github.com/zhourunhe/EEG-MoCE获取。

英文摘要

Electroencephalography (EEG)-based multimodal learning integrates brain signals with complementary modalities to improve mental state assessment, providing great clinical potential. The effectiveness of such paradigms largely depends on the representation learning on heterogeneous modalities. For EEG-based paradigms, one promising approach is to leverage their hierarchical structures, as recent studies have shown that both EEG and associated modalities (e.g., facial expressions) exhibit hierarchical structures reflecting complex cognitive processes. However, Euclidean embeddings struggle to represent these hierarchical structures due to their flat geometry, while hyperbolic spaces, with their exponential growth property, are naturally suited for them. In this work, we propose EEG-MoCE, a novel hyperbolic mixture-of-curvature experts framework designed for multimodal neurotechnology. EEG-MoCE assigns each modality to an expert in a learnable-curvature hyperbolic space, enabling adaptive modeling of its intrinsic geometry. A curvature-aware fusion strategy then dynamically weights experts, emphasizing modalities with richer hierarchical information. Extensive experiments on benchmark datasets demonstrate that EEG-MoCE achieves state-of-the-art performance, including emotion recognition, sleep staging, and cognitive assessment. Code is available at https://github.com/zhourunhe/EEG-MoCE.

URL PDF HTML ☆

赞 0 踩 0

2602.16165 2026-06-01 cs.LG cs.AI

HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

HiPER: 具有显式信用分配的分层强化学习用于大型语言模型智能体

Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, Mingyi Hong

发表机构 * University of Minnesota ； Northwestern University ； Amazon AGI ； Texas A\&M University ； Cisco Research

AI总结针对稀疏奖励长程任务中LLM智能体信用分配困难的问题，提出HiPER分层规划-执行框架，通过分层优势估计（HAE）在规划和执行层面显式分配信用，在ALFWorld和WebShop上达到97.4%和83.3%的成功率。

Comments ICML 2026

详情

AI中文摘要

将LLM训练为用于多轮决策的交互式智能体仍然具有挑战性，特别是在具有稀疏和延迟奖励的长程任务中，智能体必须在获得有意义的反馈之前执行一系列扩展的动作。大多数现有的强化学习方法将LLM智能体建模为在单一时间尺度上运行的扁平策略，每轮选择一个动作。在稀疏奖励设置中，这种扁平策略必须跨整个轨迹传播信用，而没有显式的时间抽象，这常常导致不稳定的优化和低效的信用分配。我们提出HiPER，一种新颖的分层规划-执行强化学习框架，明确地将高层规划与低层执行分开。HiPER将策略分解为一个提出子目标的高层规划器和一个在多个动作步骤中执行这些子目标的低层执行器。为了将优化与此结构对齐，我们引入了一种称为分层优势估计（HAE）的关键技术，该技术在规划和执行层面仔细分配信用。通过聚合每个子目标执行过程中的回报并协调两个层面的更新，HAE提供了无偏的梯度估计器，并且与扁平广义优势估计相比，可证明地减少了方差。实验上，HiPER在具有挑战性的交互式基准测试中达到了最先进的性能，在ALFWorld上达到97.4%的成功率，在WebShop上达到83.3%的成功率（使用Qwen2.5-7B-Instruct，分别比先前最佳方法高出6.6%和8.3%），在需要多个依赖子任务的长程任务上尤其取得了巨大收益。这些结果突显了显式层次分解对于多轮LLM智能体的可扩展RL训练的重要性。

英文摘要

Training LLMs as interactive agents for multi-turn decision-making remains challenging, particularly in long-horizon tasks with sparse and delayed rewards, where agents must execute extended sequences of actions before receiving meaningful feedback. Most existing reinforcement learning (RL) approaches model LLM agents as flat policies operating at a single time scale, selecting one action at each turn. In sparse-reward settings, such flat policies must propagate credit across the entire trajectory without explicit temporal abstraction, which often leads to unstable optimization and inefficient credit assignment. We propose HiPER, a novel Hierarchical Plan-Execute RL framework that explicitly separates high-level planning from low-level execution. HiPER factorizes the policy into a high-level planner that proposes subgoals and a low-level executor that carries them out over multiple action steps. To align optimization with this structure, we introduce a key technique called hierarchical advantage estimation (HAE), which carefully assigns credit at both the planning and execution levels. By aggregating returns over the execution of each subgoal and coordinating updates across the two levels, HAE provides an unbiased gradient estimator and provably reduces variance compared to flat generalized advantage estimation. Empirically, HiPER achieves state-of-the-art performance on challenging interactive benchmarks, reaching 97.4\% success on ALFWorld and 83.3\% on WebShop with Qwen2.5-7B-Instruct (+6.6\% and +8.3\% over the best prior method), with especially large gains on long-horizon tasks requiring multiple dependent subtasks. These results highlight the importance of explicit hierarchical decomposition for scalable RL training of multi-turn LLM agents.

URL PDF HTML ☆

赞 0 踩 0

2605.08145 2026-06-01 cs.CV cs.AI cs.LG

Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

自描述多模态交互调优：放大可利用冗余以实现鲁棒的视觉语言模型

Yuriel Ryan, Hei Man Ip, Adriel Kuek, Paul Pu Liang, Roy Ka-Wei Lee

发表机构 * Singapore University of Technology and Design（新加坡科技设计大学）； DSO National Laboratories（国防部国家实验室）； Massachusetts Institute of Technology（麻省理工学院）

AI总结针对视觉语言模型中的幻觉和鲁棒性问题，提出自描述多模态交互调优方法，通过放大模态间冗余信息来补偿受损模态，并设计多模态交互门机制将独特交互转化为冗余交互，实验表明该方法可减少38.3%的视觉诱导错误并提升16.8%的一致性。

Comments Accepted to ICML 2026. Code: https://github.com/yurielryan/Multimodal-Interaction-Tuning

详情

AI中文摘要

当前的视觉语言模型在面对模糊或受损模态时存在幻觉和鲁棒性问题。我们假设这些问题可以通过利用模态间的共享信息来补偿受损模态得到解决。为此，我们分析了多模态交互——模态提供的冗余（共享）、独特（排他）和协同（涌现）任务相关信息——以确定它们对模型可靠性的影响。具体来说，放大冗余交互将增加这种可利用的共享信息以解决这些问题；然而，现代指令数据集通常消除冗余以优先考虑视觉定位。我们通过一个自描述工作流弥合这一差距，该工作流包含一个 extsc{多模态交互门}：一种将独特交互转化为冗余交互的机制。我们的发现表明，增加冗余可以减少38.3%的视觉诱导错误，并提高16.8%的一致性。

英文摘要

Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities to compensate for the impaired one. To this end, we analyze multimodal interactions -- redundant (shared), unique (exclusive), and synergistic (emergent) task-relevant information provided by the modalities -- to determine their impacts on model reliability. Specifically, amplifying redundant interactions would increase this exploitable shared information to resolve these issues; yet, modern instruction datasets often eliminate redundancies to prioritize visual grounding. We bridge this gap through a self-captioning workflow featuring a \textsc{Multimodal Interaction Gate}: a mechanism to convert unique interactions into redundant interactions. Our findings suggest that increasing redundancy can reduce visual induced errors by 38.3\% and improve consistency by 16.8\%.

URL PDF HTML ☆

赞 0 踩 0

2605.06831 2026-06-01 cs.LG cs.AI

Why DDIM Hallucinates More Than DDPM: A Theoretical Analysis of Reverse Dynamics

为什么DDIM比DDPM更容易产生幻觉：反向动力学的理论分析

Muhammad H. Ashiq, Samanyu Arora, Abhinav N. Harish, Ishaan Kharbanda, Hung Yun Tseng, Grigorios G. Chrysos

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结通过理论分析高斯混合目标下的反向ODE（DDIM）和SDE（DDPM），证明在临界时间τ后DDIM会卡在两个最近模式之间的线段上，而DDPM的随机性帮助其脱离该区域从而避免幻觉。

Comments Accepted in ICML

2605.06137 2026-06-01 cs.CV cs.AI cs.LG

Autoregressive Visual Generation Needs a Prologue

自回归视觉生成需要一个序幕

Bowen Zheng, Weijian Luo, Guang Yang, Colin Zhang, Tianyang Hu

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； hi-Lab, Xiaohongshu Inc（小红书实验室）

AI总结提出Prologue方法，通过生成前置的序幕令牌来弥合自回归图像生成中的重建-生成差距，在不影响重建质量的前提下显著提升生成性能。

Comments Code: https://github.com/Zyriix/prologue Demo: https://huggingface.co/spaces/Zyriix/prologue-demo

详情

AI中文摘要

在这项工作中，我们提出了Prologue，一种弥合自回归（AR）图像生成中重建-生成差距的方法。Prologue不修改视觉令牌以同时满足重建和生成，而是生成一小部分序幕令牌，并将其前置到视觉令牌序列之前。这些序幕令牌仅使用AR交叉熵（CE）损失进行训练，而视觉令牌则专用于重建。这种解耦设计使我们能够通过AR模型的真实分布优化生成，而不影响重建质量，我们进一步从ELBO角度形式化了这一点。在ImageNet 256x256上，Prologue-Base在没有无分类器引导的情况下将gFID从21.01降至10.75，同时几乎保持重建不变；Prologue-Large使用标准AR模型，无需辅助语义监督，达到了具有竞争力的rFID 0.99和gFID 1.46。有趣的是，仅由AR梯度驱动，序幕令牌展现出涌现的语义结构：对16个序幕令牌进行线性探测达到35.88%的Top-1准确率，远高于标准分词器前16个令牌的23.71%；使用固定序幕令牌进行重采样保留了相似的高层语义布局。我们的结果暗示了一个新方向：通过引入单独学习的生成表示，同时保持原始表示不变，可以提升生成质量。

英文摘要

In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model's true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic supervision. Interestingly, driven only by AR gradients, prologue tokens exhibit emergent semantic structure: linear probing on 16 prologue tokens reaches 35.88% Top-1, far above the 23.71% of the first 16 tokens from a standard tokenizer; resampling with fixed prologue tokens preserves a similar high-level semantic layout. Our results suggest a new direction: generation quality can be improved by introducing a separate learned generative representation while leaving the original representation intact.

URL PDF HTML ☆

赞 0 踩 0

2605.05520 2026-06-01 cs.LG stat.AP stat.ML

Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors

使用商业微波链路和扩散模型先验的贝叶斯雨场重建

Badr Moufad, Albina Ilina, Hai Victor Habi, Salem Lahlou, Yazid Janati, Hagit Messer, Eric Moulines

发表机构 * School of Electrical and Computer Engineering, Tel Aviv University, Tel Aviv, Israel（电气与计算机工程学院，特拉维夫大学，特拉维夫，以色列）

AI总结提出将雨场重建视为贝叶斯逆问题，利用扩散模型作为高保真空间先验，通过无需训练的后验采样方法（如即插即用、序贯蒙特卡洛和副本交换）实现优于传统方法的性能。

Comments Added link to source code

Journal ref ICML 2026

详情

AI中文摘要

商业微波链路（CML）为降雨感知提供了密集的空间覆盖，但其产生的路径积分测量使得精确的地面重建具有挑战性。现有方法通常将CML简化为点传感器，并忽略降雨与信号衰减之间的线积分关系，导致在非均匀降水条件下性能下降。在这项工作中，我们将雨场重建视为一个贝叶斯逆问题，使用扩散模型（DM）作为高保真空间先验。我们表明，与删失高斯过程相比，扩散模型能更好地保留关键降雨统计量。将降雨估计视为具有DM先验的贝叶斯逆问题，使得可以使用广泛的无需训练的后验采样方法，包括即插即用、序贯蒙特卡洛和副本交换方法。在合成和真实世界数据集上的实验表明，与基于CML的现有重建基线相比，该方法具有一致的改进。

英文摘要

Commercial Microwave Links (CMLs) offer dense spatial coverage for rainfall sensing but produce path-integrated measurements that make accurate ground-level reconstruction challenging. Existing methods typically oversimplify CMLs as point sensors and neglect line integration relating rainfall to signal attenuation, resulting in degraded performance under heterogeneous precipitation. In this work, we view rain field reconstruction as a Bayesian inverse problem with Diffusion Models (DMs) as high-fidelity spatial priors. We show that diffusion models better preserve key rainfall statistics compared to censored Gaussian processes. Framing rainfall estimation as a Bayesian inverse problem with a DM prior enables training-free posterior sampling using a broad family of methods, including Plug-and-Play, Sequential Monte Carlo, and Replica Exchange methods. Experiments on synthetic and real-world datasets demonstrate consistent improvements over established CML-based reconstruction baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.01134 2026-06-01 cs.AI

To Use AI as Dice of Possibilities with Timing Computation

将AI用作带有时序计算的可能性骰子

Jia Li, Vipin Kumar, Rui Zhang

发表机构 * Department of Surgery, University of Minnesota（明尼苏达大学外科系）； Department of Computer Science & Engineering, University of Minnesota（明尼苏达大学计算机科学与工程系）

AI总结本文提出基于动词的范式，定义时序计算和可能性，使AI能作为实现思维语法的工具，并在乳腺癌患者数据上自动发现临床轨迹和反事实时序推断。

2510.03096 2026-06-01 cs.LG

Adaptive Node Feature Selection For Graph Neural Networks

图神经网络的自适应节点特征选择

Madeline Navarro, Ali Azizpour, Santiago Segarra

发表机构 * Department of Electrical and Computer Engineering, Rice University, Houston, TX, USA（电气与计算机工程系，理海大学，休斯顿，德克萨斯州，美国）

AI总结提出一种自适应节点特征选择方法，通过置换特征值后验证性能的变化来识别和移除无关特征，适用于任意数据、模型和任务。

详情

AI中文摘要

我们为图神经网络（GNN）提出了一种自适应节点特征选择方法，能够在训练过程中识别并移除不必要的特征。衡量特征对模型输出的贡献能力对于解释决策和通过消除无帮助变量来降低维度至关重要。然而，图结构数据引入了复杂的依赖关系，可能不适合经典的特征重要性度量。受此启发，我们提出了一种数据、模型和任务无关的方法，该方法基于置换特征值后验证性能的变化，在训练过程中确定相关特征。我们从理论上通过刻画节点数据与图结构之间的关系如何影响GNN性能来论证我们的方法。实验表明：（i）我们的高度通用方法可与利用先验假设的定制特征选择方法相媲美；（ii）在GNN完全训练之前，我们就能返回有意义的特征重要性分数；（iii）我们的分数明显提取了与各种图学习设置中特征重要性相关的属性。

英文摘要

We propose an adaptive node feature selection approach for graph neural networks (GNNs) that identifies and removes unnecessary features during training. The ability to measure how features contribute to model output is key for interpreting decisions and reducing dimensionality by eliminating unhelpful variables. However, graph-structured data introduces complex dependencies that may be unsuited to classical feature importance metrics. Inspired by this, we present a data-, model-, and task-agnostic method that determines relevant features during training based on changes in validation performance upon permuting feature values. We theoretically motivate our approach by characterizing how the relationships between node data and graph structure influences GNN performance. Empirically, we show that (i) our highly general approach rivals the performance of tailored feature selection approaches that exploit prior assumptions; (ii) we return meaningful feature importance scores well before the GNN is fully trained; and (iii) our scores demonstrably extract relevant properties that inform feature importance for various graph learning settings.

URL PDF HTML ☆

赞 0 踩 0

2605.00265 2026-06-01 cs.LG

Polaris: Coupled Orbital Polar Embeddings for Hierarchical Concept Learning

Polaris: 用于层次概念学习的耦合轨道极坐标嵌入

Sahil Mishra, Srinitish Srinivasan, Sourish Dasgupta, Tanmoy Chakraborty

发表机构 * Indian Institute of Technology Delhi, New Delhi, India（印度理工学院德里分校，新德里，印度）； Indian Institute of Technology Delhi, Abu Dhabi, UAE（印度理工学院德里分校，阿布扎比，阿联酋）； KDM Lab, Dhirubhai Ambani University Gandhinagar, Gujarat, India（KDM实验室，迪鲁布希阿姆巴尼大学冈丁加尔，古吉拉特邦，印度）

AI总结提出Polaris极坐标超球面嵌入框架，通过角度和半径分离语义与层次，结合局部约束、全局正则化和不确定性感知非对称目标，在多种层次结构扩展任务中显著提升检索性能。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

现实世界的知识通常组织为层次结构，如产品分类法、医学本体和标签树，但由于非对称结构和噪声语义，学习层次表示具有挑战性。我们引入了Polaris，一个极坐标超球面嵌入框架，它使用角度几何和半径将语义性与层次分离，使得在不干扰的情况下学习意义和结构。为了将潜在表示映射到球面上，我们将其投影到北极的切空间，应用指数映射，并使用球面线性层学习单位范数表示。Polaris结合了鲁棒的局部约束、防止几何坍缩的全局正则化以及鼓励方向包含的不确定性感知非对称目标。在推理时，Polaris使用结构引导检索在最终排序前高效缩小候选父节点范围。我们在分类法扩展的不同设置上评估Polaris——涵盖树、多父DAG和多模态层次结构，在top-K检索中一致提升高达约19个点，在14个强基线上平均排名降低高达约60%。

英文摘要

Real-world knowledge is often organized as hierarchies such as product taxonomies, medical ontologies, and label trees, yet learning hierarchical representations is challenging due to asymmetric structure and noisy semantics. We introduce Polaris, a polar hyperspherical embedding framework that separates semanticity from hierarchy using angular geometry and radius, enabling the learning of meaning and structure without interference. To map latent representation onto the sphere, we project it to the tangent space at the north pole, apply the exponential map, and learn unit-norm representations using spherical linear layers. Polaris then combines robust local constraints, global regularization that prevents geometric collapse, and uncertainty-aware asymmetric objectives that encourage directional containment. At inference time, Polaris uses structure-guided retrieval to efficiently narrow down candidate parents before final ranking. We evaluate Polaris on different settings of taxonomy expansion - spanning trees, multi-parent DAGs, and multimodal hierarchies, showing consistent improvements of up to ~19 points in top-K retrieval and up to ~60% reduction in mean rank over fourteen strong baselines.

URL PDF HTML ☆

赞 0 踩 0

2604.26262 2026-06-01 cs.CV

Semantic Foam: Unifying Spatial and Semantic Scene Decomposition

Semantic Foam：统一空间与语义场景分解

Amr Sharafeldin, Shrisudhan Govindarajan, Thomas Walker, Aryan Mikaeili, Daniel Rebain, Kwang Moo Yi, Andrea Tagliasacchi

发表机构 * Simon Fraser University（西蒙弗雷泽大学）； University of Toronto（多伦多大学）； Wayve Technologies（Wayve技术公司）； University of British Columbia（不列颠哥伦比亚大学）； University of Edinburgh（爱丁堡大学）

AI总结提出Semantic Foam，通过扩展Radiant Foam表示，结合Voronoi网格的空间分解和显式语义特征场，实现高质量、一致性的语义分割。

Comments 15 pages, 10 figures, Accepted to CVPR 2026 (Highlight) , Project page: http://semanticfoam.github.io/

详情

AI中文摘要

现代场景重建方法，如3D高斯泼溅，能够以实时速度实现照片级真实感的新视角合成，但它们在交互式图形应用中的采用受到限制。一个主要瓶颈是与传统人工创作的3D资产相比，与这些表示进行交互的难度。尽管先前的研究尝试对这些模型施加语义分解，但在分割质量和一致性方面仍然存在重大挑战。为了解决这个问题，我们引入了Semantic Foam，将最近提出的Radiant Foam表示扩展到语义分解任务。我们的方法将Radiant Foam的Voronoi网格的自然空间体积分解与在单元级别参数化的显式语义特征场相结合。这种显式结构能够直接进行空间正则化，从而防止由遮挡或跨视图不一致监督引起的伪影——这是其他基于点的表示的常见问题。实验结果表明，与Gaussian Grouping和SAGA等最先进方法相比，我们的方法在对象级分割性能上达到或超越了它们。

英文摘要

Modern scene reconstruction methods, such as 3D Gaussian Splatting, deliver photo-realistic novel view synthesis at real-time speeds, yet their adoption in interactive graphics applications has been limited. A major bottleneck is the difficulty of interacting with these representations compared to traditional, human-authored 3D assets. While previous research has attempted to impose semantic decomposition on these models, significant challenges remain regarding segmentation quality and consistency. To address this, we introduce Semantic Foam, extending the recently proposed Radiant Foam representations to semantic decomposition tasks. Our approach integrates the natural spatial volumetric decomposition of Radiant Foam's Voronoi mesh with an explicit semantic feature field parameterized at the cell level. This explicit structure enables direct spatial regularization, which prevents artifacts caused by occlusion or inconsistent supervision across views - common pitfalls for other point-based representations. Experimental results show that our method achieves comparable or superior object-level segmentation performance compared to state-of-the-art methods like Gaussian Grouping and SAGA.

URL PDF HTML ☆

赞 0 踩 0

2602.03216 2026-06-01 cs.CL cs.LG

Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

Token Sparse Attention: 交错令牌选择的高效长上下文推理

Dongwon Jo, Beomseok Kang, Jiwon Song, Jae-Joon Kim

发表机构 * Department of Electrical and Computer Engineering, Seoul National University, Seoul, South Korea（电气电子工程系，首尔国立大学，首尔，韩国）

AI总结提出Token Sparse Attention，一种轻量级动态令牌级稀疏化机制，通过交错选择令牌并在注意力前后压缩/解压缩，实现高效长上下文推理，在128K上下文中获得高达3.23倍加速且精度损失小于1%。

Comments ICML 2026

详情

AI中文摘要

注意力的二次复杂度仍然是大语言模型长上下文推理的核心瓶颈。先前的加速方法要么使用结构化模式稀疏化注意力图，要么在特定层永久驱逐令牌，这可能会保留不相关的令牌或依赖不可逆的早期决策，尽管令牌重要性具有层/头动态性。在本文中，我们提出Token Sparse Attention，一种轻量级动态令牌级稀疏化机制，在注意力期间将每个头的$Q$、$K$、$V$压缩到减少的令牌集，然后将输出解压缩回原始序列，使得令牌信息可以在后续层中重新考虑。此外，Token Sparse Attention在令牌选择和稀疏注意力的交叉点上暴露了一个新的设计点。我们的方法完全兼容密集注意力实现，包括Flash Attention，并且可以无缝地与现有的稀疏注意力内核组合。实验结果表明，Token Sparse Attention持续改善精度-延迟权衡，在128K上下文中实现高达3.23倍的注意力加速，且精度下降小于1%。这些结果表明，动态和交错的令牌级稀疏化是可扩展长上下文推理的一种互补且有效的策略。

英文摘要

The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain irrelevant tokens or rely on irreversible early decisions despite the layer-/head-wise dynamics of token importance. In this paper, we propose Token Sparse Attention, a lightweight and dynamic token-level sparsification mechanism that compresses per-head $Q$, $K$, $V$ to a reduced token set during attention and then decompresses the output back to the original sequence, enabling token information to be reconsidered in subsequent layers. Furthermore, Token Sparse Attention exposes a new design point at the intersection of token selection and sparse attention. Our approach is fully compatible with dense attention implementations, including Flash Attention, and can be seamlessly composed with existing sparse attention kernels. Experimental results show that Token Sparse Attention consistently improves accuracy-latency trade-off, achieving up to $\times$3.23 attention speedup at 128K context with less than 1% accuracy degradation. These results demonstrate that dynamic and interleaved token-level sparsification is a complementary and effective strategy for scalable long-context inference.

URL PDF HTML ☆

赞 0 踩 0

2508.21762 2026-06-01 cs.CL cs.AI

Reasoning-Intensive Regression

推理密集型回归

Diane Tchuindjo, Omar Khattab

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结针对推理密集型回归任务，提出MENTAT方法，结合批量反思提示优化与神经集成学习，在基准测试中相比基线提升高达65%。

详情

AI中文摘要

AI研究人员和从业者越来越多地将大型语言模型（LLMs）应用于我们称之为推理密集型回归（RiR）的任务，即从文本中推断细微的数值分数。与情感分析或相似性分析等标准语言回归任务不同，RiR通常出现在临时应用中，例如基于评分标准的评分、复杂环境中的密集奖励建模或特定领域的检索，这些任务需要对上下文进行更深入的分析，而可用的任务特定训练数据和计算资源有限。我们将四个实际问题作为RiR任务，建立初始基准，并用于测试我们的假设：即冻结的LLMs和通过梯度下降微调Transformer编码器在RiR中通常都会遇到困难。然后，我们提出MENTAT，一种简单轻量的方法，结合批量反思提示优化与神经集成学习。MENTAT在两个基线上实现了高达65%的提升，尽管未来仍有很大的改进空间。

英文摘要

AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e., deducing subtle numerical scores from text. Unlike standard language regression tasks such as sentiment or similarity analysis, RiR often appears instead in ad-hoc applications such as rubric-based scoring, modeling dense rewards in complex environments, or domain-specific retrieval, where much deeper analysis of context is required while only limited task-specific training data and computation are available. We cast four realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and fine-tuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances.

URL PDF HTML ☆

赞 0 踩 0

2604.28020 2026-06-01 cs.LG

Cost-Aware Learning

成本感知学习

Clara Mohri, Amir Globerson, Haim Kaplan, Tomer Koren, Yishay Mansour

发表机构 * Kempner Institute（凯姆纳研究所）； Harvard University（哈佛大学）； Google Research（谷歌研究）； Tel Aviv University（特拉维夫大学）

AI总结针对有限和优化中不同组件采样成本不同的问题，提出基于梯度范数和成本的采样分布算法Cost-Aware SGD，并应用于语言模型强化学习，显著降低策略优化中的token使用量。

2604.27994 2026-06-01 cs.RO

Dreaming Across Towns: Semantic Rollout and Town-Adversarial Regularization for Zero-Shot Held-Out-Town Fixed-Route Driving in CARLA

跨城镇驾驶：面向CARLA零样本未见城镇固定路线驾驶的语义展开与城镇对抗正则化

Feeza Khan Khanzada, Jaerock Kwon

发表机构 * Department of Electrical and Computer Engineering, University of Michigan–Dearborn（密歇根大学迪尔伯恩分校电子与计算机工程系）

AI总结提出一种结合未来语义预测与城镇对抗正则化的训练方法，在仅使用Town05和Town06训练的情况下，提升CARLA驾驶代理在未见城镇Town03和Town04上的零样本迁移性能。

详情

AI中文摘要

在一个模拟城镇中训练的驾驶代理往往在新城镇中表现不佳，因为道路形状、交叉口和车道布局可能不同。本文研究如何在CARLA驾驶模拟器中改进这种迁移，而不向代理提供来自测试城镇的任何训练数据。代理仅在Town05和Town06中训练，然后直接在Town03和Town04中评估。为了聚焦于道路布局差异，所有实验使用相同的天气和交通设置。我们提出一种训练方法，鼓励代理学习跨城镇有用的特征，而不是与单个训练城镇绑定的特征。在训练过程中，代理被要求预测未来相机视图的高层视觉含义，并且被阻止依赖那些揭示数据来自哪个源城镇的线索。这些额外的学习信号仅在训练期间使用；在测试时，驾驶策略使用与基线代理相同的观测和控制接口。在与匹配的DreamerV3风格世界模型驾驶代理的受控比较中，所提出的方法在未见城镇上取得了最高的平均成功率：在Town03上为36.6%，95%置信区间[30.5, 42.7]；在Town04上为85.6%，95%置信区间[84.0, 87.2]（基于五个训练种子计算）。针对最强基线的种子配对测试显示，在两个未见城镇上成功率差异均为正。额外实验表明，单独预测未来视觉含义或单独去除城镇特定线索不足以匹配组合方法。这些结果表明，将未来场景理解与减少对源城镇特定特征的依赖相结合，可以改善该CARLA设置下的跨城镇驾驶性能。

英文摘要

Driving agents trained in one simulated town often perform poorly in a new town because the road shapes, intersections, and lane layouts can be different. This paper studies how to improve this kind of transfer in the CARLA driving simulator without giving the agent any training data from the test towns. The agent is trained only in Town05 and Town06, then evaluated directly in Town03 and Town04. To focus on road-layout differences, all experiments use the same weather and traffic settings. We propose a training method that encourages the agent to learn features that are useful across towns rather than features tied to one training town. During training, the agent is asked to predict the high-level visual meaning of future camera views and is also discouraged from relying on cues that reveal which source town the data came from. These extra learning signals are used only during training; at test time, the driving policy uses the same observation and control interface as the baseline agent. In controlled comparisons with matched DreamerV3-style world-model driving agents, the proposed method achieves the highest mean held-out success: 36.6\% on Town03 with a 95\% confidence interval of [30.5, 42.7] and 85.6\% on Town04 with a 95\% confidence interval of [84.0, 87.2], computed across five training seeds. Seed-paired tests against the strongest primary baselines show positive success-rate differences in both held-out towns. Additional experiments show that predicting future visual meaning alone or removing town-specific cues alone is not enough to match the combined method. These results suggest that combining future-scene understanding with reduced reliance on source-town-specific features can improve cross-town driving performance in this CARLA setting.

URL PDF HTML ☆

赞 0 踩 0

2604.27617 2026-06-01 cs.CV cs.AI

Robust Lightweight Crack Classification for Real-Time UAV Bridge Inspection

用于实时无人机桥梁检测的鲁棒轻量级裂缝分类

Wei Li, Haisheng Li, Weijie Li, Jiandong Wang, Kaichen Ma, Luming Yang

发表机构 * Bay Area Super Bridge Maintenance Technology Center, Guangdong Provincial Highway Construction Co., Ltd., Guangdong, China（湾区超级桥梁维护技术中心、广东省高速公路建设有限公司、广东，中国）； Guangdong AIHISUN Technology Co., Ltd., Guangdong, China（广东AIHISUN技术有限公司、广东，中国）

AI总结提出一个由轻量级骨干网络、CBAM注意力模块、基于场景先验的定向鲁棒增强策略和Focal Loss组成的统一轻量级CNN框架，在SDNET2018数据集上以11.21M参数和1.82G FLOPs实现825 FPS推理速度，F1分数提升2.51%，召回率提升3.95%。

详情

AI中文摘要

随着无人机在桥梁结构健康监测中的广泛应用，基于深度学习的自动裂缝检测已成为主要研究热点。然而，实际无人机检测仍面临四个关键挑战：弱裂缝特征、退化成像条件、严重类别不平衡以及实际无人机检测工作流程中有限的计算资源。为了解决这些问题，本文提出了一个统一的轻量级卷积神经网络框架，由四个协同组件组成：轻量级骨干网络、用于通道和空间增强的卷积块注意力模块（CBAM）、基于检测场景先验的定向鲁棒增强策略，以及用于类别不平衡下难样本学习的Focal Loss。在SDNET2018桥面数据集上的实验表明，所提方法仅以11.21M参数和1.82G FLOPs实现了825 FPS的推理速度。与基线模型相比，完整框架的F1分数提高了2.51%，召回率提高了3.95%。此外，Grad-CAM可视化表明，引入的注意力模块将模型关注点从分散区域转移到沿裂缝轨迹的精确跟踪。总体而言，本研究在准确性、速度和鲁棒性之间取得了强平衡，为无人机桥梁检测中地面站辅助的实时部署提供了实用解决方案。源代码可在 https://github.com/skylynf/AttXNet 获取。

英文摘要

With the widespread application of Unmanned Aerial Vehicles (UAVs) in bridge structural health monitoring, deep learning-based automatic crack detection has become a major research focus. However, practical UAV inspections still face four key challenges: weak crack features, degraded imaging conditions, severe class imbalance, and limited computational resources for practical UAV inspection workflows. To address these issues, this paper proposes a unified lightweight convolutional neural network framework composed of four synergistic components: a lightweight backbone network, a Convolutional Block Attention Module (CBAM) for channel and spatial enhancement, a directed robust augmentation strategy based on inspection-scene priors, and Focal Loss for hard-sample learning under class imbalance. Experiments on the SDNET2018 bridge deck dataset show that the proposed method achieves an inference speed of 825 FPS with only 11.21M parameters and 1.82G FLOPs. Compared with the baseline model, the complete framework improves the F1-score by 2.51% and recall by 3.95%. In addition, Grad-CAM visualizations indicate that the introduced attention module shifts the model's focus from scattered regions to precise tracking along crack trajectories. Overall, this study achieves a strong balance among accuracy, speed, and robustness, providing a practical solution for ground-station assisted real-time deployment in UAV bridge inspections. The source code is available at: https://github.com/skylynf/AttXNet .

URL PDF HTML ☆

赞 0 踩 0

2604.21928 2026-06-01 cs.CL

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

使用生成式大语言模型评估自动语音识别

Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil, Sergio Burdisso, Petr Motlicek, Shiran Liu, Mickael Rouvier, Jane Wottawa, Richard Dufour

发表机构 * Idiap Research Institute（Idiap研究 institute）； EPFL（瑞士联邦理工学院）； Brno University of Technology（布拉格技术大学）； Avignon University（阿维尼翁大学）； Le Mans University（勒曼大学）； Nantes University（南特大学）

AI总结本文提出利用生成式大语言模型通过假设选择、语义距离计算和错误分类三种方法评估ASR，在HATS数据集上达到92-94%的人类一致性，优于WER和语义指标。

2604.20395 2026-06-01 cs.CV cs.RO

SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

SpaCeFormer: 快速无提议开放词汇3D实例分割

Chris Choy, Junha Lee, Chunghyun Park, Minsu Cho, Jan Kautz

发表机构 * NVIDIA

AI总结提出SpaCeFormer，一种基于空间曲线变换的无提议方法，在0.12-0.30秒内完成场景分割，比多阶段2D+3D流水线快2-3个数量级，并构建了最大开放词汇3D实例分割数据集SpaCeFormer-3M，在ScanNet200上零样本mAP达11.1，提升2.8倍。

Comments Project page: https://nvlabs.github.io/SpaCeFormer/

详情

AI中文摘要

开放词汇3D实例分割是机器人和AR/VR的核心能力，但先前方法存在瓶颈：多阶段2D+3D流水线聚合基础模型输出需数百秒每场景，而伪标签端到端方法依赖碎片化掩码和外部区域提议。我们提出SpaCeFormer，一种无提议的空间曲线变换器，在标准基准上每场景运行0.12-0.30秒，比多阶段2D+3D流水线快2-3个数量级。我们将其与SpaCeFormer-3M配对，这是最大的开放词汇3D实例分割数据集（通过多视图掩码聚类和多视图VLM标注构建，包含来自7.4K场景的604K实例的3.0M多视图一致描述）；其掩码召回率比先前单视图流水线高21倍（IoU>0.5时54.3% vs 2.5%）。SpaCeFormer结合空间窗口注意力与Morton曲线序列化以获得空间连贯特征，并使用RoPE增强解码器直接从学习到的查询预测实例掩码，无需外部提议。在ScanNet200上，我们实现11.1零样本mAP，比先前最佳无提议方法提升2.8倍；在ScanNet++和Replica上，我们达到22.9和24.1 mAP，超越包括使用多视图2D输入在内的所有先前方法。

英文摘要

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs in 0.12--0.30 seconds per scene across standard benchmarks, 2--3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21$\times$ higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU$>$0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8$\times$ improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

Differentiable Mixture-of-Agents Incentivizes Swarm Intelligence of Large Language Models

Njord: A Probabilistic Graph Neural Network for Ensemble Ocean Forecasting

A Hierarchical Spatiotemporal Action Tokenizer for In-Context Imitation Learning in Robotics

Scalable Mechanistic Neural Networks for Differential Equations and Machine Learning

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

SpectralTrain: A Universal Framework for Hyperspectral Image Classification

Counterfactual Trace Auditing of LLM Agent Skills

3D-Belief: Embodied Belief Inference via Generative 3D World Modeling

Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency

Hyper-DP3: Frequency-Aware Right-Sizing of 3D Diffusion Policies for Visuomotor Control

EEG-Based Multimodal Learning via Hyperbolic Mixture-of-Curvature Experts

HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

Why DDIM Hallucinates More Than DDPM: A Theoretical Analysis of Reverse Dynamics

Autoregressive Visual Generation Needs a Prologue

Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors

To Use AI as Dice of Possibilities with Timing Computation

Adaptive Node Feature Selection For Graph Neural Networks

Polaris: Coupled Orbital Polar Embeddings for Hierarchical Concept Learning

Semantic Foam: Unifying Spatial and Semantic Scene Decomposition

Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

Reasoning-Intensive Regression

Cost-Aware Learning

Dreaming Across Towns: Semantic Rollout and Town-Adversarial Regularization for Zero-Shot Held-Out-Town Fixed-Route Driving in CARLA

Robust Lightweight Crack Classification for Real-Time UAV Bridge Inspection

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation