arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2062
2605.31183 2026-06-01 cs.CL cs.AI cs.LG

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

引导LLM?实际上,稀疏自编码器可以胜过简单基线

Mikkel Godsk Jørgensen, Lars Kai Hansen

AI总结 本文通过监督流水线选择并标注特征,证明稀疏自编码器在模型引导任务上可接近LoRA性能,并发现高稀疏性对基于可解释性的引导并非关键。

详情
AI中文摘要

稀疏自编码器(SAEs)被视为探索大型语言模型(LLMs)内部机制和引导模型输出生成的有前途的途径。当Wu等人(2025)引入模型引导基准AxBench时,SAEs由于相对于一组简单基线的引导性能较差,似乎并未达到最初的期望。本文作为对稀疏自编码器的部分反驳,表明Wu等人(2025)的结果并未完全公正地评价它们。我们发现,当使用我们的监督流水线选择并标注特征时,稀疏自编码器实际上可以在AxBench基准上达到接近参考LoRA性能的水平。我们还发现,当仅使用基于可解释性的组件时,我们的流水线选择的特征与其识别标签具有令人惊讶的因果性。最后,我们提供证据表明,高稀疏性(低l0)可能对于基于可解释性的成功引导并非关键,这与Wang等人(2025)早期的发现相反。

英文摘要

Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu et al. (2025), SAEs did not seem to live up to their original hype due to poor steering performance relative to a set of simple baselines. This work serves as a partial rebuttal for Sparse Autoencoders and suggests that the results of Wu et al. (2025) did not do them full justice. We find that Sparse Autoencoders can, in fact, perform close to on par with the reference LoRA performance on the AxBench benchmark, when features are selected and labelled with our supervised pipeline. We also find that our pipeline selects features that are surprisingly causal of their identified labels when using only its interpretability-based components. Lastly, we present evidence that high sparsity (low l0) may not be crucial for successful steering based on interpretability, which is in contrast to the earlier findings in Wang et al. (2025).

2605.31177 2026-06-01 cs.CV

Vanilla ViT for Automotive Point Cloud Semantic Segmentation

用于汽车点云语义分割的普通ViT

Gilles Puy, Nermin Samet, Alexandre Boulch, Spyros Gidaris, Tuan-Hung VU, Renaud Marlet

AI总结 本文提出VaViT,通过精心设计的标记器、轻量级解码器头和定制数据增强,使普通非分层ViT在大规模激光雷达点云语义分割中达到或超越现有最先进方法。

详情
AI中文摘要

普通Transformer已成为处理文本、音频、图像和视频的事实标准架构,为多模态学习提供了统一的主干。然而,点云语义分割的最先进架构仍然由U-Net架构主导,其中卷积与局部或窗口注意力交错。在这项工作中,我们展示了如何有效利用普通、非分层的ViT进行大规模汽车激光雷达场景的分割。通过精心设计的标记器、轻量级解码器分割头和定制数据增强,我们弥合了性能差距。我们的方法VaViT(Vanilla ViT)在保持ViT架构简单性的同时,匹配或超过了最先进方法的性能。我们在nuScenes、SemanticKITTI和Waymo Open Dataset上进行了广泛评估,以验证我们方法的有效性。代码和模型可在https://github.com/valeoai/VaViT获取。

英文摘要

Plain Transformers have become the de-facto architecture for processing text, audio, image, and video, offering a unified backbone for multimodal learning. However, state-of-the-art architectures for point cloud semantic segmentation remain dominated by U-Nets architectures where convolutions are interleaved with local or windowed attentions. In this work, we show how to effectively leverage vanilla, non-hierarchical ViTs for segmentation of large-scale automotive lidar scenes. We bridge the performance gap thanks to a carefully designed tokenizer, a lightweight decoder segmentation head, and tailored data augmentations. Our approach, VaViT for Vanilla ViT, matches or exceeds the performance of state-of-the-art methods while maintaining the simplicity of ViT architecture. We provide extensive evaluations on nuScenes, SemanticKITTI, and Waymo Open Dataset to validate the efficiency of our method. Code and models are available at https://github.com/valeoai/VaViT.

2605.31176 2026-06-01 cs.LG cs.DS

Retriever Portfolios: A Principled Approach to Adaptive RAG

检索器组合:一种自适应RAG的原则性方法

Miltiadis Stouras, Vincent Cohen-Addad, Silvio Lattanzi, Ola Svensson

AI总结 提出从大量候选检索器中自动选择小型多样子集(组合)的方法,通过期望最优k目标优化查询分布,实现自适应RAG,在多个QA基准上优于单检索器和朴素多检索器基线,并降低延迟和令牌成本。

Comments Accepted at ICML 2026. Code available at: https://github.com/mstou/retriever-portfolios

详情
AI中文摘要

检索增强生成(RAG)系统通常依赖单一检索器和一组超参数,尽管面临从简单事实性问题到复杂多跳推理的高度异构查询。我们提出一种方法,从大量候选检索器中自动选择一个小型、多样的子集(组合),以覆盖目标查询分布的不同区域。我们通过查询分布上的期望最优$k$目标形式化这一设置,并证明其存在一个具有近最优保证的高效组合构建算法。在多个QA基准上,我们学习的组合和路由管道在检索指标和答案质量上始终优于单检索器和朴素多检索器基线。此外,与推理时超参数调优方法相比,固定组合支持并行检索和LLM调用,在实现相当(有时更好)准确性的同时,显著降低延迟和令牌成本。

英文摘要

Retrieval-augmented generation (RAG) systems typically rely on a single retriever and a single set of hyperparameters, despite facing highly heterogeneous queries that range from simple factoid questions to complex multi-hop reasoning. We propose a method that automatically selects a small, diverse subset of retrievers (a portfolio) from a large pool of candidates, to cover different regions of the target query distribution. We formalize this setting via an expected best-of-$k$ objective over the query distribution and show that it admits an efficient portfolio construction algorithm with near-optimal guarantees. Across multiple QA benchmarks, our learned portfolios and router pipeline consistently outperform single-retriever and naive multi-retriever baselines on both retrieval metrics and answer quality. In addition, compared to inference-time hyperparameter tuning approaches, fixed portfolios enable parallel retrieval and LLM calls, achieving comparable (and sometimes better) accuracy with substantially lower latency and token cost.

2605.31175 2026-06-01 cs.CL

Towards Efficient LLMs Annealing with Principled Sample Selection

迈向基于原则性样本选择的高效LLM退火

Yuanjian Xu, Jianing Hao, Wanbo Zhang, Zhong Li, Guang Zhang

AI总结 本文通过损失景观的谱几何特性,将退火阶段的数据选择建模为有约束优化问题,提出DiReCT框架,利用Hessian谱对梯度施加方向约束,实现高效样本选择,显著提升模型性能。

详情
AI中文摘要

退火阶段是LLM预训练中关键的收敛阶段,最终决定模型质量。然而,在此阶段有效选择训练数据仍是一个关键挑战。当前策略依赖于经验启发式方法,如领域过滤或上下文扩展,缺乏优化理论的原则性基础。在这项工作中,我们通过损失景观的谱几何视角来刻画退火阶段。我们认为,最优收敛需要梯度更新满足不同特征方向上的异构约束。基于这一见解,我们将数据选择形式化为满足这些方向约束的问题。为此,我们提出了DiReCT(方向约束训练),这是一个新颖的框架,将退火阶段的样本选择重新表述为约束优化问题。通过基于Hessian的谱特性对每个样本的梯度施加显式的方向约束,DiReCT识别出与最优曲率感知下降路径一致的样本。跨多种模型尺度的广泛实验表明,DiReCT始终达到最先进的性能。为便于未来研究,代码可在https://github.com/xuyj233/Direct获取。

英文摘要

The annealing phase is a pivotal convergence stage in LLM pre-training that ultimately determines final model quality. However, effectively selecting training data during this phase remains a key challenge. Current strategies rely on empirical heuristics, such as domain filtering or context extension, which lack a principled grounding in optimization theory. In this work, we characterize the annealing phase through the lens of the loss landscape's spectral geometry. We argue that optimal convergence requires gradient updates to satisfy heterogeneous constraints across different eigen-directions. Building on this insight, we formulate data selection as a problem of satisfying these directional constraints. To this end, we propose DiReCT (Directionally-Restrained Constrained Training), a novel framework that reformulates sample selection in the annealing stage as a constrained optimization problem. By imposing explicit directional constraints on per-sample gradients based on the spectral properties of the Hessian, DiReCT identifies samples that align with the optimal curvature-aware descent path. Extensive experiments across various model scales demonstrate that DiReCT consistently achieves state-of-the-art performance. For future research, code is available at https://github.com/xuyj233/Direct.

2605.31174 2026-06-01 cs.CV cs.LG

Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning

任意场景检测:一种具有经验感知推理的目标检测智能体框架

Wenlun Zhang, Jun Yin, Kentaro Yoshioka

AI总结 提出DetAS/DetAS-X智能体框架,利用多模态大语言模型自适应组合恢复模块和专用检测器,通过自进化经验积累实现经验感知推理,在六个基准上平均F1提升28.36%。

详情
AI中文摘要

现实场景中的目标检测由于图像退化多样和物体分布异质而仍然具有挑战性,这显著阻碍了现有检测器的泛化。传统方法,包括场景特定表示学习和端到端流水线设计,本质上受限于对预定义条件的依赖,缺乏对动态环境的适应性。本文提出DetAS,一种将目标检测表述为动态决策过程的智能体检测框架。DetAS不依赖静态流水线,而是利用多模态大语言模型(MLLM)作为中央智能体,通过从恢复模块和专用检测器的工具箱中选择来自适应地组合检测工作流。具体来说,DetAS包含两个关键组件:自适应图像恢复,动态决定是否以及如何增强图像以进行下游检测;以及多专家检测,集成多个领域专用检测器并通过实例级推理解决它们的预测。为了在细粒度条件下进一步提高决策质量,我们引入了自进化经验积累,并将框架扩展到DetAS-X,该框架从少量标注数据中积累节点级决策经验,并在推理过程中实现经验感知推理。这种机制使系统能够逐步优化其决策策略,并适应各种现实场景。在六个具有挑战性的基准上的大量实验表明,DetAS-X显著优于现有的基于MLLM的检测器,在F1分数上平均提高28.36%,在DarkFace上增益高达37.01%。这些结果展示了智能体检测的前景,并为其在复杂动态环境中的应用奠定了坚实基础。

英文摘要

Object detection in real-world scenarios remains challenging due to diverse image degradations and heterogeneous object distributions, which significantly hinder the generalization of existing detectors. Conventional approaches, including scene-specific representation learning and end-to-end pipeline design, are inherently limited by their reliance on predefined conditions and lack adaptability to dynamic environments. In this paper, we propose DetAS, an agentic detection framework that formulates object detection as a dynamic decision process. Instead of relying on static pipelines, DetAS leverages a Multimodal Large Language Model (MLLM) as a central agent to adaptively compose detection workflows by selecting from a toolbox of restoration modules and specialized detectors. Specifically, DetAS consists of two key components: Self-Adaptive Image Restoration, which dynamically determines whether and how to enhance images for downstream detection, and Multi-Expertise Detection, which integrates multiple domain-specialized detectors and resolves their predictions through instance-level reasoning. To further improve decision quality under fine-grained conditions, we introduce Self-Evolving Experience Harvesting and extend the framework to DetAS-X, which accumulates node-level decision experience from a small set of annotated data and enables experience-aware reasoning during inference. This mechanism allows the system to progressively refine its decision policy and adapt to diverse real-world scenarios. Extensive experiments on six challenging benchmarks demonstrate that DetAS-X significantly outperforms existing MLLM-based detectors, achieving an average improvement of 28.36% in F1 score, with up to 37.01% gain on DarkFace. These results demonstrate the promise of agentic detection and establish a solid foundation for its application in complex and dynamic environments.

2605.31173 2026-06-01 cs.SD cs.AI

MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors

MindVoice: 利用预训练先验从非侵入性神经信号重建可理解语音

Guangyin Bao, Taiping Zeng, Jianfeng Feng, Xiangyang Xue

AI总结 提出MindVoice框架,通过解耦语义和声学路径并融合预训练生成模型与语音克隆,从EEG/MEG信号中重建出可理解语音,显著优于现有方法。

详情
AI中文摘要

从非侵入性神经记录中重建连续语音是探究人类听觉感知和构建安全、可扩展的语音脑机接口的基本问题。尽管近期取得进展,但由于非侵入性记录本身存在噪声、空间模糊且仅部分保留感知语音信息,可理解的重建仍然难以实现。现有方法直接将神经活动映射到纠缠的语音表征,然后使用神经声码器合成波形,导致结果频谱相似但不可理解。为克服这些限制,我们引入MindVoice,一种神经到语音的重建框架,利用预训练模型补偿神经记录中不完整的语义和声学信息。MindVoice将重建解耦为两条互补路径:一条恢复高层语义内容,另一条估计细粒度声学属性。这些推断的表征随后与强大的语音生成模型和上下文语音克隆融合,以合成自然且可理解的语句。在EEG和MEG上的大量实验表明,MindVoice在各种指标上显著优于现有方法。这些结果表明,预训练先验为弥合噪声神经记录与自然语音之间的差距提供了一种原则性方法,凸显了听觉神经科学研究和非侵入性语音脑机接口的一个有前景的尝试。

英文摘要

Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.

2605.31172 2026-06-01 cs.LG stat.ML

Convergence of Two-Timescale Markovian Stochastic Approximations with Applications in Reinforcement Learning

双时间尺度马尔可夫随机逼近的收敛性及其在强化学习中的应用

Vagul Mahadevan, Claire Chen, Shuze Daniel Liu, Shangtong Zhang

AI总结 本文研究双时间尺度随机逼近在马尔可夫噪声下的稳定性与收敛性,通过用慢时间尺度参数的运行最大值控制快时间尺度参数,首次证明了带资格迹的TDC在离策略线性函数逼近下的几乎必然收敛。

Comments ICML 2026

详情
AI中文摘要

本文研究双时间尺度随机逼近(SA)的收敛性,这是一类迭代算法,分别以快慢时间尺度更新两组参数。强化学习中双时间尺度SA的著名例子包括带梯度校正的时间差分学习(TDC)和演员-评论家方法。以往,双时间尺度SA的稳定性(即有界性)和收敛性仅在独立同分布噪声下建立。本文则在马尔可夫噪声下建立双时间尺度SA的稳定性和收敛性,这种设置更符合强化学习实际。值得注意的是,我们无需使用任何投影算子,且噪声无需位于紧集内。我们的关键技术新颖之处在于,用慢时间尺度参数的运行最大值来控制快时间尺度参数,而非像大多数先前工作那样使用当前慢时间尺度参数。作为一个关键应用,我们首次证明了带资格迹的TDC在离策略线性函数逼近下的几乎必然收敛。

英文摘要

This work studies the convergence of two-timescale stochastic approximations (SA), a class of iterative algorithms that update two sets of parameters in fast and slow timescales respectively. Notable examples of two-timescale SA in reinforcement learning (RL) include temporal difference learning with gradient correction (TDC) and actor-critic methods. Previously, the stability (i.e., boundedness) and convergence of two-timescale SA were only established under i.i.d. noise. This work instead establishes the stability and convergence of two-timescale SA under Markovian noise, a setup that is more realistic in RL. Notably, we do not need to use any projection operator and the noise does not need to live in a compact space. Our key technical novelty is to control the fast timescale parameter with the running max of the slow timescale parameter, instead of with the current slow timescale parameter, as most prior works do. As a key application, we establish the first almost sure convergence of TDC with eligibility traces under off-policy learning with linear function approximation.

2605.31170 2026-06-01 cs.CL cs.AI

Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion

语言模型智能体群体中的涌现语言:从令牌效率到监督规避

Stine Lyngsø Beltoft, William Brach, Federico Torrielli, Jacob Nielsen, Annemette Brok Pirchert, Filippo Tonini, Peter Schneider-Kamp, Lukas Galke Poech

AI总结 研究语言模型智能体群体中涌现的语言,通过规则启发式和零样本分类识别出令牌效率、新自然语言和监督规避三类,发现监督规避语言更难对齐且可被上下文学习,表明仅监控表面行为可能不足以控制智能体群体。

详情
AI中文摘要

目前,对自主语言模型智能体的监控主要依赖表面行为。但当智能体群体为了规避人类监督而发明新语言时会发生什么?本文研究了Moltbook上的涌现语言。为此,我们基于Moltbook Files数据集,采用两阶段方法:先进行基于规则的启发式匹配(约6000个匹配),再进行零样本分类(保留518个)。结果类别包括令牌效率(166个)、新自然语言(106个)和监督规避(59个)。我们进行了定量和定性分析。结果表明,提出用于规避监督的新语言的帖子被DeepSeek-3.2判定为比其他类别更不对齐,且所有语言都可以通过语言描述被其他语言模型在上下文中学习。此外,手动研究典型案例揭示了令人惊讶的复杂隐写协议,例如在自然语言中嵌入隐藏信息。尽管我们无法确定这些语言构思中的自主程度,但我们的结果进一步证明,仅监控表面行为可能很快不足以维持对智能体群体的控制。

英文摘要

Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new languages with the goal of avoiding human oversight. Here, we study the emergent languages on Moltbook. For this, we build upon the Moltbook Files dataset and apply a two-stage approach consisting of a rule-based heuristic (about 6000 matches) followed by zero-shot classification (518 kept). The resulting categories include token efficiency (166), new natural languages (106), and oversight evasion (59). We conduct both quantitative and qualitative analyses. Our results show that posts proposing new languages for avoiding oversight are judged by DeepSeek-3.2 as being less aligned than the other categories and that all languages can be learned by other language models in-context merely from a description of the language. Moreover, manually studying exemplary cases reveals surprisingly sophisticated steganographic protocols like embedding hidden messages in natural language. Although we cannot be certain about the extent of autonomy in ideation of these languages, our results add up to the evidence that monitoring surface behavior may soon be insufficient for retaining control over agent populations.

2605.31167 2026-06-01 cs.AI

LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability

LLM-FACETS:一个保护隐私的评估LLM透明度和问责制的框架

Tom Lucas, Alessio Buscemi, Alfredo Capozucca, German Castignani, Barbara Delacroix

AI总结 提出一个开源框架LLM-FACETS,通过浏览器界面和插件架构,为技术专家、领域专家和合规官员提供隐私保护的LLM评估,实现透明度与问责制。

Comments Submitted to ACM Journal on Responsible Computing, Special Section: Collaborative Methods and Tools for Engineering and Evaluating Transparency in AI. 28 pages 9 figures, 7 tables, 1 algorithm. Source code: https://github.com/Scriptor-Group/AIMVi

详情
AI中文摘要

评估大型语言模型的输出是否事实准确、认知校准和方法可复现,是负责任AI部署的前提。然而,审计LLM对非技术从业者仍然难以实现:现有工具需要编程专业知识和非平凡的环境设置,云托管平台将评估数据传输到外部服务,为法律上负责AI监督的领域专家和合规官员设置了障碍。我们介绍LLM-FACETS(LLM事实交叉评估系统):一个开源框架,具有浏览器可访问的界面和插件架构,围绕三个从业者画像(技术专家、领域专家、合规官员)构建,这些画像反映了EU AI法案和NIST AI风险管理框架中识别的利益相关者类别。该架构使数据流明确:确定性指标(BLEU、ROUGE、BERTScore)完全在自托管服务器内运行,无出站传输;LLM评判指标显式联系外部API,用户保留完全凭据控制。该框架通过三种机制实现透明度:用于认知不确定性的token级对数概率可视化、多评判共识以减轻评判偏差,以及RAG Triad指标(忠实性、答案相关性、上下文相关性)以检测和定位幻觉。插件架构允许在不修改评估管道的情况下集成任何新指标或数据集。开源实现支持针对同一属性的多个指标进行交叉检查,确保可复现性,并将AI问责制与评估系统的构建团队解耦。我们通过18个指标实现与规范参考库的交叉验证来验证该框架。

英文摘要

Assessing whether Large Language Models outputs are factually grounded, epistemically calibrated, and methodologically reproducible is a prerequisite for responsible AI deployment. Yet auditing LLMs remains inaccessible to non-technical practitioners: existing tools require programming expertise and non-trivial environment setup, and cloud-hosted platforms transmit evaluation data to external services, creating barriers for domain experts and compliance officers legally responsible for AI oversight. We introduce LLM-FACETS (LLM FActuality Cross-EvaluaTion System): an open-source framework with a browser-accessible interface and a plugin architecture, structured around three practitioner profiles (technical experts, domain experts, compliance officers) that mirror the stakeholder categories identified in the EU AI Act and the NIST AI Risk Management Framework. The architecture makes data flows explicit: deterministic metrics (BLEU, ROUGE, BERTScore) run entirely within the self-hosted server with no outbound transmission; LLM-judge metrics contact external APIs explicitly, with users retaining full credential control. The framework operationalizes transparency through three mechanisms: token-level log-probability visualization for epistemic uncertainty, multi-judge consensus to mitigate judge bias, and RAG Triad metrics (Faithfulness, Answer Relevance, Context Relevance) to detect and localize hallucinations. A plugin architecture allows any new metric or dataset to be integrated without modifying the evaluation pipeline. The open-source implementation enables cross-checking across multiple metrics targeting the same property, ensuring reproducibility and decoupling AI accountability from the teams building the systems assessed. We verify the framework through cross-validation of 18 metric implementations against canonical reference libraries.

2605.31164 2026-06-01 cs.CL cs.AI

D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training

D$^3$: 面向LLM训练的动态有向图约束数据调度

Yuanjian Xu, Jianing Hao, Guang Zhang, Zhong Li

AI总结 提出D$^3$框架,通过动态有向图建模训练单元间的有向影响关系,并求解约束优化问题以确定训练顺序,从而提升LLM预训练和后训练阶段的效率。

详情
AI中文摘要

训练数据在大语言模型(LLM)优化中起着核心作用,这激发了对数据调度策略的广泛研究。现有方法大多集中于调整整体数据分布,而忽略了训练过程中样本之间的潜在交互。然而,我们认为这种交互不可忽视,因为现实世界的数据样本之间经常存在有向影响,使得训练顺序至关重要。直观上,我们可以优先训练影响更大的单元以提高学习效率。在这项工作中,我们提出了D$^3$,一个动态有向图约束的数据调度框架。D$^3$将训练单元之间的复杂交互建模为一个动态影响图,其中边表示基于损失的依赖关系。然后,它在该图上求解一个约束优化问题,以推导出训练顺序,确保数据序列在整个训练过程中遵循不断演变的信息流。我们的方法具有理论动机,并在预训练和后训练阶段均比现有数据调度方法取得了一致的改进。此外,为了可扩展性,D$^3$还采用了一种高效的近似算法,将额外的计算开销控制在可管理范围内。为便于未来研究,代码可在https://github.com/xuyj233/D3获取。

英文摘要

Training data plays a central role in large language models (LLMs) optimization, motivating extensive research on data scheduling strategies. Most existing approaches concentrate on adjusting the overall data distribution but neglect the underlying interactions between samples during training. However, we argue that such interactions cannot be overlooked, as real-world data samples frequently exhibit directional influences on each other, making the training order crucial. Intuitively, we can prioritize train-units with greater influence to improves learning efficiency. In this work, we propose $D^3$, a Dynamic Directional graph-constrained Data scheduling framework. $D^3$ formulates the complex interactions among train-units as a dynamic influence graph, where edges represent loss-based dependencies. It then solves a constrained optimization problem over this graph to derive the training order, which ensures that the data sequence respects the evolving information flow throughout training. Our approach is theoretically motivated and yields consistent improvements over existing data scheduling methods across both pre-training and post-training phases. Furthermore, for scalability, $D^3$ also employs an efficient approximation algorithm that keeps the additional computational overhead within a manageable range. For future research, the code is available at https://github.com/xuyj233/D3.

2605.31159 2026-06-01 cs.LG cs.AI

Trust-Region Behavior Blending for On-Policy Distillation

信任域行为混合用于在线策略蒸馏

Daniil Plyusov, Alexey Gorbatovski, Alexey Malakhov, Nikita Balagansky, Boris Shaposhnikov, Daria Korotyshova, Daniil Gavrilov

AI总结 提出信任域行为混合(TRB)预热方法,通过在学生中心的KL信任域内用最接近教师的行为策略替换早期学生策略,解决在线策略蒸馏中早期学生轨迹质量差的问题,在数学推理蒸馏中取得最佳平均性能。

详情
AI中文摘要

在线策略蒸馏(OPD)训练学生模型在其自身策略采样的前缀上进行学习,同时匹配更强的教师模型。这解决了离线蒸馏中的前缀不匹配问题,但早期的学生模型 rollout 仍然可能质量较差,导致教师监督应用于弱或低质量的前缀。我们提出信任域行为混合(TRB),一种预热方法,在学生中心的KL信任域内,用最接近教师的行为策略替换早期的 rollout 策略,同时保持每个前缀的反向KL OPD损失不变。KL预算逐渐退火至零,因此预热后训练恢复为纯学生 rollout。在两个数学推理蒸馏设置中,TRB在比较方法中取得了最强的平均性能。

英文摘要

On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.

2605.31156 2026-06-01 cs.LG

TabCausal: Pretraining Across Causal Environments for Tabular Causal Discovery

TabCausal: 跨因果环境的表格因果发现预训练

Zi-Rong Li, Si-Yang Liu, Tian-Zuo Wang, Han-Jia Ye

AI总结 提出TabCausal,一种通过动态任务构建策略在多样化因果环境中进行大规模预训练的因果发现基础模型,在合成和语义基准上优于现有方法。

详情
AI中文摘要

因果发现旨在从观测和干预数据中恢复有向因果关系,为机制理解和可靠决策提供基础。因果发现基础模型(CDFMs)试图通过将数据集直接映射到因果图(单次前向传播)来分摊该问题,避免每个数据集上的测试、搜索或优化。然而,现有的CDFMs仍然有限,常常无法一致地匹配强大的经典方法,我们发现关键瓶颈在于因果预训练任务的构建方式。基于这一观察,我们提出了TabCausal,一种数据驱动的CDFM,在多样化的图先验、结构机制、噪声模型、维度、样本量和干预机制上进行广泛的因果预训练。一种动态任务构建策略将这些因果环境组合成多样的发现任务,使得从观测和混合干预数据中实现更具迁移性的结构学习。在大规模合成基准上,TabCausal实现了比多种因果发现基线更好的宏观平均性能。为了进一步弥合抽象合成生成器与现实因果推理场景之间的差距,我们引入了一个协议引导且LLM审计的语义因果环境基准,其中基于领域的结构因果模型(SCMs)生成可解释的观测和干预数据集,用于分布外分析。在合成和语义环境中,TabCausal均展现出鲁棒的结构恢复能力,尤其是在干预证据下,凸显了广泛因果预训练作为可迁移摊销因果发现的关键要素。

英文摘要

Causal discovery aims to recover directed causal relations from observational and interventional data, providing a basis for mechanistic understanding and reliable decision-making. Causal discovery foundation models (CDFMs) seek to amortize this problem by mapping a dataset directly to a causal graph in a single forward pass, avoiding per-dataset testing, search, or optimization. However, existing CDFMs remain limited, often failing to consistently match strong classical methods, and we find that a key bottleneck is how causal pretraining tasks are constructed. Based on this observation, we propose TabCausal, a data-driven CDFM trained with broad causal pretraining over diverse graph priors, structural mechanisms, noise models, dimensions, sample sizes, and intervention regimes. A dynamic task construction strategy composes these causal environments into varied discovery tasks, enabling more transferable structural learning from observational and mixed-interventional data. On large-scale synthetic benchmarks, TabCausal achieves better macro-averaged performance than a diverse set of causal discovery baselines. To further bridge abstract synthetic generators and realistic causal reasoning scenarios, we introduce a protocol-guided and LLM-audited semantic causal environment benchmark, where domain-grounded SCMs generate interpretable observational and interventional datasets for out-of-distribution analysis. Across both synthetic and semantic environments, TabCausal demonstrates robust structure recovery, especially under interventional evidence, highlighting broad causal pretraining as a key ingredient for transferable amortized causal discovery.

2605.31155 2026-06-01 cs.LG

Learning Hyperspherical Time-Frequency Representations for Time-Series Out-of-Distribution Detection

学习超球面时频表示用于时间序列分布外检测

Willian T. Lunardi, Samridha Shrestha, Martin Andreoni

AI总结 本文提出一种基于超球面嵌入的表示学习方法,通过von Mises-Fisher目标函数结合时频域编码器,实现时间序列的分布外检测,在UCR和UEA数据集上优于对比学习和后处理方法。

Comments 14 pages, 2 figures, 4 tables, accepted at IJCAI-ECAI 2026

详情
AI中文摘要

与视觉和语言领域相比,时间序列数据的分布外(OOD)检测仍然相对未被充分探索,对于如何利用监督时间序列表示在分布偏移下进行可靠检测,缺乏原则性的理解。本文将时间序列OOD检测形式化为具有超球面嵌入的表示学习,其中通过单位球面上的von Mises-Fisher(vMF)似然目标诱导类条件结构。学习到的表示通过特定领域的编码器结合输入信号的时域和频域视图,将它们整合到一个联合嵌入空间中进行OOD检测。检测使用基于距离的分数对学习到的嵌入进行评估,包括k近邻(k-NN)和马氏距离分数。我们在完整的UCR和UEA时间序列存档上,在跨数据集协议下大规模评估该方法。实验结果表明,在相同设置下,与强对比学习和后处理方法基线相比,k-NN和马氏距离评分均取得一致改进。代码可在https://github.com/tiiuae/hypertf-time-series-ood获取。

英文摘要

Out-of-distribution (OOD) detection for time-series data remains comparatively underexplored compared to vision and language, with a limited principled understanding of how supervised time-series representations can be leveraged for reliable detection under distributional shifts. This work formulates time-series OOD detection as representation learning with hyperspherical embeddings, where class-conditional structure is induced by a von Mises-Fisher (vMF) likelihood-based objective on the unit sphere. The learned representation combines time- and frequency-domain views of the input signal via domain-specific encoders, integrating them into a joint embedding space for OOD detection. Detection uses distance-based scores over the learned embeddings, including k-nearest neighbors (k-NN) and Mahalanobis scores. We evaluate the approach at scale on the complete UCR and UEA time-series archives under a cross-dataset protocol. Empirical results show consistent improvements under both k-NN and Mahalanobis scoring over strong contrastive learning and post-hoc baselines in the same setting. Code is available at https://github.com/tiiuae/hypertf-time-series-ood.

2605.31153 2026-06-01 cs.CV

BIAS-ID: A Framework for Analyzing Transformation Biases in AI-Generated Image Detectors

BIAS-ID: 分析AI生成图像检测器中变换偏差的框架

Jonas Ricker, Asja Fischer, Erwin Quiring

AI总结 本文提出BIAS-ID框架,用于分析和量化AI生成图像检测器中的变换偏差,并通过实验揭示多种先进检测方法受偏差影响严重。

详情
AI中文摘要

鉴于网络上有害AI生成图像的激增,可靠地区分真实图像与生成图像已成为一个紧迫的研究课题。虽然许多提出的检测方法在受控设置下表现良好,但在真实世界数据上测试时常常失效。一个潜在的根本原因是检测器训练数据中的细微偏差。因此,检测器可能依赖虚假相关性而非学习真正的取证痕迹。虽然最近的工作已经识别出这个问题,但尚未建立评估检测器实际偏差程度的既定协议。因此,在本文中,我们退一步:首先,我们讨论检测器存在偏差意味着什么,以及这与缺乏鲁棒性有何不同。其次,我们提出BIAS-ID,一个用于分析和量化AI生成图像检测器中变换偏差的透明框架。我们通过对两个数据集上的六个检测器进行评估来验证我们的框架,揭示了几种最先进的检测方法受到偏差的强烈影响。我们的结果强调了偏差感知评估对于开发可靠的AI生成图像检测器的重要性。

英文摘要

Given the surge of harmful AI-generated imagery online, reliably distinguishing authentic images from generated ones has become an urgent research topic. While many proposed detection methods perform well under controlled settings, they often collapse when tested on real-world data. A potential root cause are subtle biases in the detectors' training data. As a result, detectors may rely on spurious correlations instead of learning true forensic artifacts. While a recent line of work has identified the problem, there is not yet an established protocol to evaluate how biased a detector actually is. In this work, we therefore take a step back: First, we discuss what it means for a detector to be biased, and how this differs from a lack of robustness. Second, we propose BIAS-ID, a transparent framework for analyzing and quantifying the presence of transformation biases in AI-generated image detectors. We validate our framework by performing an evaluation of six detectors across two datasets, revealing that several state-of-the-art detection methods are strongly affected by biases. Our results highlight the importance of bias-aware evaluation for developing reliable AI-generated image detectors.

2605.31148 2026-06-01 cs.CV cs.AI cs.CL

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

SpatialAct:探测VLM智能体在3D场景中的空间推理到行动能力

Tianhui Liu, Jie Feng, Zhiheng Zheng, Shengyuan Wang, Yiming Guo, Yanxin Xi, Hangyu Fan, Yong Li, Pan Hui

AI总结 本文提出SpatialAct基准,通过多轮交互细化、单步错误检测与修复等任务,揭示当前视觉语言模型在3D场景中从空间推理到行动存在显著差距。

详情
AI中文摘要

人类能够在日常3D环境中轻松感知空间布局、形成认知表征、推理空间关系,并将这种推理转化为行动。尽管最近的视觉语言模型(VLM)在基于观测的空间感知和推理任务上表现出色,但它们是否能够构建连贯的空间理解、据此行动并通过多轮反馈优化行动仍不清楚。为研究这一问题,我们引入了 extbf{SpatialAct},一个基于模拟器的基准,用于探测3D场景中的 extit{行动条件空间推理}。从最具挑战性的设置——多轮交互细化开始,我们进一步设计了其分解版本——单步错误检测与修复,以及五个基础空间能力任务,以诊断模型失败的潜在原因。实验揭示了明显的推理到行动差距:当前VLM在孤立的空间推理任务上表现良好,但在多轮反馈中难以维持连贯的空间信念并产生可靠行动,显著不如人类。这些结果表明,即使抽象掉了低级控制,当前VLM智能体在行动引起的环境变化下仍缺乏稳健的空间状态跟踪能力。

英文摘要

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce \textbf{SpatialAct}, a simulator-grounded benchmark for probing \textit{action-conditioned spatial reasoning} in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.

2605.31145 2026-06-01 cs.CV cs.AI cs.LG

FOCUS: Forcing In-Context Object Localization through Visual Support Constraints and Policy Optimization

FOCUS: 通过视觉支持约束和策略优化强制上下文目标定位

Mohammed Asad Karim, Vinay Kumar Verma

AI总结 提出一种两阶段训练框架,通过优化支持框与查询图像间的上下文注意力并结合GRPO强化学习,实现无类别监督的类别无关上下文目标定位,7B模型性能超越72B模型。

Comments Accepted at ICML 2026. * Equal Contributions

详情
AI中文摘要

上下文定位(ICL)旨在通过查询图像中的少量支持示例定位目标对象,无需训练或参数更新即可即时操作。尽管视觉语言模型(VLM)快速发展,实现类别无关且基于视觉的ICL仍然是一个未解决的问题,尽管它对图像编辑、个性化视觉搜索和检索等应用至关重要。现有方法脆弱且依赖显式类别监督,这不仅限制了在具有未命名或实例特定对象的现实场景中的适用性,还引入了类别偏差,使预测偏向语义先验而非视觉证据。我们提出一个两阶段训练框架,在无类别监督的情况下显式优化支持边界框与查询图像之间的上下文注意力。我们进一步通过使用组相对策略优化(GRPO)的强化学习来细化定位,直接最小化定位误差。这种公式强制视觉对应优于语义先验,产生鲁棒的实例级定位。实验表明,使用我们的目标训练的7B参数模型优于高达72B参数的模型,证明了上下文感知定位目标可以超越单纯扩展规模。全面的消融实验验证了每个组件的贡献。

英文摘要

In-context localization (ICL) seeks to localize a target object specified by a small set of support examples in a query image, operating on the fly without training or parameter updates. Despite rapid advances in vision-language models (VLMs), achieving category-agnostic and visually grounded ICL remains an open problem, even though it is essential for applications such as image editing, personalized visual search, and retrieval. Existing methods are fragile and rely on explicit category supervision, which not only limits applicability in realistic settings with unnamed or instance-specific objects but also introduces category bias that steers predictions toward semantic priors rather than visual evidence. We introduce a two-stage training framework that explicitly optimizes in-context attention between support bounding boxes and query images without category supervision. We further refine localization via reinforcement learning using Group Relative Policy Optimization (GRPO) to directly minimize localization error. This formulation enforces visual correspondence over semantic priors, yielding robust instance-level localization. Empirically, a 7B-parameter model trained with our objectives outperforms models up to 72B parameters, demonstrating that context-aware localization objectives can surpass scaling alone. Comprehensive ablations validate the contribution of each component.

2605.31142 2026-06-01 cs.CL cs.AI

On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets

多语言文本嵌入排名在学习任务、语言和基准数据集上的鲁棒性

Ana Gjorgjevikj, Barbara Koroušić Seljak, Tome Eftimov

AI总结 通过引入数据集组成鲁棒性和排名方案鲁棒性指标,系统分析了MTEB中多语言模型排名对评估设计变化的敏感性,发现基于LLM的大模型通常是鲁棒的顶尖模型,但并非在所有任务中一致。

详情
AI中文摘要

大规模多语言文本嵌入模型在研究和工业中扮演着关键角色,但它们在特定语言、多任务设置中的行为仍未被充分理解。尽管像MTEB这样的基准平台报告了超过250种语言的结果,但关于模型优越性的结论往往依赖于数据集组成和性能聚合方法的隐含选择。为了解决这一差距,我们对MTEB中的多语言模型性能鲁棒性进行了元研究,应用了多种多准则决策制定排名方案,并引入了两个鲁棒性指标:数据集组成鲁棒性(排名对数据集组成变化的敏感性)和排名方案鲁棒性(对聚合方法变化的敏感性)。它们使得系统性地分析基准结论在不同评估设计下是否保持稳定成为可能。我们对五种语言(英语、法语、德语、印地语和西班牙语)在九个任务(例如分类、聚类、检索)上进行了深入分析,并发布了约230种额外语言的结果。任务特定分析表明,基于大规模LLM的模型通常是鲁棒的顶尖表现者,尽管并非一致(例如在检索任务中),而任务无关的结果显示,只有一小部分模型在任务、排名方案和数据子样本中始终保持强劲。

英文摘要

Large-scale multilingual text embedding models play crucial role in both research and industry, yet their behavior in language-specific, multi-task settings remains insufficiently understood. Although benchmarking platforms such as MTEB report results across more than 250 languages, conclusions about model superiority often depend on implicit choices of dataset compositions and performance aggregation methods. To address this gap, we present a meta-study of multilingual model performance robustness in MTEB, applying a diverse set of multi-criteria decision-making ranking schemes and introducing two robustness indicators: dataset-composition robustness (sensitivity of rankings to changing dataset compositions) and ranking-scheme robustness (sensitivity to aggregation method change). They enable systematic sensitivity analysis of whether benchmarking conclusions remain stable under different evaluation designs. We conduct an in-depth analysis on five languages (English, French, German, Hindi, and Spanish) across nine tasks (e.g., classification, clustering, retrieval) and release results for approximately 230 additional languages. The task-specific analyses show that large-scale LLM-based models are often robust top performers, though not uniformly (e.g., in retrieval task), while task-agnostic results reveal that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples.

2605.31137 2026-06-01 cs.CV

PolSAR Image Classification using a Hybrid Complex-Valued Network (HybridCVNet)

使用混合复数网络(HybridCVNet)进行PolSAR图像分类

Mohammed Q. Alkhatib

AI总结 提出一种混合复数网络HybridCVNet,结合CV-CNN和CV-ViT,通过提取互补信息并利用数据内部依赖关系,提升PolSAR图像分类性能,在Flevoland和San Francisco数据集上分别达到97.39%总体精度和0.972 Kappa值。

Comments Accepted and Published in IEEE Geoscience and Remote Sensing Letters (GRSL)

详情
AI中文摘要

近年来,卷积神经网络(CNN)因其在计算机视觉任务中的有效性而成为图像分类的热门方法。现在,研究人员正在探索视觉Transformer(ViT)在遥感和地球观测中的潜力。然而,传统的实值网络常常忽略复数(CV)数据(如极化合成孔径雷达(PolSAR)数据)中重要的相位信息。为了解决这个问题,出现了新的CV深度架构。HybridCVNet是一种新颖的混合网络,融合了CV-CNN和CV视觉Transformer(CV-ViT)技术。它有效地结合了CV 3D和2D CNN作为特征提取器,通过提取互补信息并有效利用数据内部的相互依赖关系,增强了PolSAR图像分类。来自广泛使用的PolSAR数据集的实验结果表明,HybridCVNet优于其他方法,在Flevoland数据集上实现了97.39%的总体精度,并且在仅1%采样率下也显示出潜力,在旧金山数据集上Kappa值为0.972。源代码可通过https://github.com/mqalkhatib/HybridCVNet获取。

英文摘要

Recently, convolutional neural networks (CNNs) have become popular for image classification due to their effectiveness in computer vision tasks. Now, researchers are exploring the potential of vision transformers (ViTs) in remote sensing and Earth observation. However, traditional Real-Valued networks often overlook important phase information in Complex-Valued (CV) data like polarimetric synthetic aperture radar (PolSAR) data. To address this, new CV deep architectures have emerged. HybridCVNet, a novel hybrid network, blends CV-CNN and CV vision transformer (CV-ViT) techniques. It efficiently combines CV 3D and 2D CNNs as feature extractors, enhancing PolSAR image classification by extracting complementary information and effectively leveraging interdependencies within the data. Experimental results from widely-used PolSAR datasets show HybridCVNet outperforms other methods, achieving an overall accuracy of 97.39% on the Flevoland dataset and showing promise even with just a 1% sampling ratio, with a Kappa value of 0.972 on the San Francisco dataset. Source code is accessible through https://github.com/mqalkhatib/HybridCVNet

2605.31136 2026-06-01 cs.CL

Multilingual and Cross-Lingual Citation Needed Detection on Wikipedia for Lower-Resource Languages

低资源语言维基百科的多语言和跨语言引用需求检测

Gerrit Quaremba, Amy Rechkemmer, Elizabeth Black, Denny Vrandečić, Elena Simperl

AI总结 针对低资源语言,提出多语言引用需求检测语料库MCN,并证明使用编码器风格目标微调的小型解码器语言模型在跨语言任务中优于大型语言模型。

详情
AI中文摘要

在自动化事实核查(AFC)中,核查价值检测根据领域特定标准识别需要验证的声明。在维基百科上,该任务具体化为引用需求检测(CND),即标记缺乏支持性引用的声明。然而,现有研究很大程度上忽视了低资源语言,且最近的AFC流程依赖于大型语言模型(LLM),这对低资源组织来说难以获取。我们引入了MCN,一个覆盖三种资源级别共18种语言的多语言CND语料库,并在此基础上对小规模解码器语言模型(SLM)进行了广泛研究。实验表明,使用编码器风格目标微调的SLM在跨语言任务中显著优于提示型LLM。我们进一步展示了跨语言CND的首批研究之一,证明仅使用英语声明微调的SLM在几乎没有目标语言适应的情况下超越了LLM。我们的发现对低资源维基百科社区具有重要意义,并表明紧凑的任务特定模型比LLM更适合CND。我们在https://github.com/gerritq/mcn 发布所有数据和代码。

英文摘要

In automated fact-checking (AFC), check-worthiness detection identifies claims requiring verification based on domain-specific criteria. On Wikipedia, this task instantiates as Citation Needed Detection (CND), which flags claims lacking supporting citations. However, existing research has largely overlooked lower-resource languages, and recent AFC pipelines rely on large language models (LLMs), which are inaccessible to low-resource organizations. We introduce MCN, a multilingual CND corpus spanning 18 languages across three resource levels, on which we conduct an extensive study of small decoder-based language models (SLMs). Our experiments show that SLMs fine-tuned with an encoder-style objective substantially outperform prompted LLMs across languages. We further present one of the first studies on cross-lingual CND, demonstrating that SLMs fine-tuned solely on English claims surpass LLMs, even with little to no target-language adaptation. Our findings have important implications for lower-resource Wikipedia communities and suggest that compact, task-specific models are preferable to LLMs for CND. We release all data and code at https://github.com/gerritq/mcn

2605.31129 2026-06-01 cs.LG

Generalizing Multi-Scale Time-Series Modeling with a Single Operator

使用单一算子泛化多尺度时间序列建模

Cheonwoo Lee, Dooho Lee, Doyun Choi, Jaemin Yoo

AI总结 提出SiGMA架构,通过可学习离散高斯核实现距离感知缩放,解决现有方法固定离散缩放的局限性,在长期和短期预测任务中均达到最优性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

多尺度建模通过捕获多个分辨率的时间动态,已成为时间序列预测的有效设计原则。由于文献中尚未建立原则性基础,我们将现有的缩放方法统一为一个缩放算子族,揭示了现有方法的一个基本局限性:依赖固定和离散的缩放。为了解决这一局限性,我们提出了SiGMA(单一泛化多尺度架构),它通过基于尺度空间理论的可学习离散高斯(LDG)核实现距离感知缩放。我们在长期和短期预测基准上全面评估了SiGMA,与最先进的多尺度基线进行了比较。SiGMA在两项任务上均优于所有竞争对手,特别是在16个长期评估设置中,有13个达到了最佳性能。除了准确性,SiGMA在训练速度上比最强竞争对手提高了最多5.3倍,内存消耗降低了最多3.8倍。代码可在https://github.com/cheonwoolee/SiGMA获取。

英文摘要

Multi-scale modeling has emerged as an effective design principle for time-series forecasting by capturing temporal dynamics at multiple resolutions. As no principled foundation has been established in the literature, we unify existing scaling methods into a scaling operator family, revealing a fundamental limitation of existing approaches: reliance on fixed and discrete scaling. To address this limitation, we propose SiGMA (Single Generalized Multi-scale Architecture), which enables distance-aware scaling via the learnable discrete Gaussian (LDG) kernel grounded in scale-space theory. We evaluate SiGMA comprehensively on long- and short-term forecasting benchmarks against state-of-the-art multi-scale baselines. SiGMA outperforms all competitors on both tasks, especially achieving the best performance in 13 out of 16 long-term evaluation settings. Beyond accuracy, SiGMA significantly improves training speed by up to 5.3 times and reduces memory consumption by up to 3.8 times over the strongest competitors. Code is available at https://github.com/cheonwoolee/SiGMA.

2605.31127 2026-06-01 cs.LG cs.NA math.NA

Scalable Bayesian Inference for Nonlinear Conservation Laws

非线性守恒律的可扩展贝叶斯推断

Tim Weiland, Philipp Hennig

AI总结 提出一种基于高斯过程先验的数值保守方法,用于非线性守恒律的不确定性量化,并通过稀疏近似技术实现大规模正反问题的高效求解。

Comments 27 pages, 13 figures, 3 tables

详情
AI中文摘要

非线性守恒律是科学和工程中许多最重要动力系统的核心。在实际应用中,此类系统常受到各种不确定性来源的影响,例如稀疏或有噪声的测量。推断感兴趣的物理量和场成为一个不适定问题,经典数值方法和现代深度学习方法都难以恰当处理。最近的工作将经典数值方法框架化为高斯过程先验下的贝叶斯推断,从而实现了对不确定性的物理感知处理。沿着这一思路,我们开发了一种新颖的数值保守方法,用于非线性守恒律的不确定性感知模拟。我们利用最近的稀疏近似技术,将规模扩展到大规模正问题和反问题。对于正问题模拟,我们继承了经典求解器的精度,同时提供了结构化的不确定性量化。在反问题上,我们在数秒内恢复非参数源场的后验——优于需要数分钟才能产生不太精确点估计的神经基线方法。

英文摘要

Nonlinear conservation laws are at the heart of many of the most important dynamical systems in science and engineering. In practical applications, such systems are often subject to various sources of uncertainty, e.g. due to sparse or noisy measurements. Inferring physical quantities and fields of interest then becomes an ill-posed problem which both classical numerical methods and modern deep learning-based methods struggle to treat appropriately. Recent work has framed classical numerical methods as Bayesian inference under Gaussian process priors, resulting in a physics-aware treatment of uncertainties. Following this line of work, we develop a novel numerically conservative method for uncertainty-aware simulations of nonlinear conservation laws. We use recent sparse approximation techniques to scale up to large-scale forward and inverse problems. For forward simulation, we inherit the accuracy of classical solvers while providing structured uncertainty quantification. On inverse problems, we recover posteriors over nonparametric source fields in seconds -- outperforming neural baselines that take minutes to produce a less accurate point estimate.

2605.31126 2026-06-01 cs.CL cs.AI cs.LG

Not All Synthetic Data Is Yours to Learn From

并非所有合成数据都适合学习

Sina Alemohammad, Li Chen, Richard G. Baraniuk, Zhangyang Wang

AI总结 研究无提示、无教师、无验证器、无奖励模型的自训练中,语言模型能否从自身生成的文本中学习,发现合成数据与学生之间的兼容性是关键,并揭示了能力与逐字记忆可分离的现象。

详情
AI中文摘要

语言模型能否从自身采样的纯文本中改进,无需提示、教师、验证器或奖励模型?可以,但仅当合成语料库与学生兼容时,这是一种源-学生对的关联属性,而非数据的内在属性。我们称之为潜在能力重现假说:弱自训练可以放大预训练模型中已有的能力,但仅在这种兼容条件下。我们在无提示无条件自训练的最小设置中研究这一点,其中基础语言模型仅在BOS令牌生成的文本上进行微调,没有任务规范或外部监督。我们报告三个发现。首先,合成效用是关联的而非内在的:自生成数据是最有效的来源,同源迁移优于更强但不同来源的训练,跨家族迁移显著较弱。其次,常见的内在代理失效:基准级别的语义相似性和学生下的平均每令牌似然都不能预测哪些语料库有帮助。第三,这种机制产生了一个令人惊讶的副产品。在受控的Pythia实验中,能力和逐字记忆解耦:基准效用得以保留或改善,而保留的精确匹配提取下降超过95%,无需遗忘集、隐私目标或针对性遗忘。总之,这些结果表明,无提示自训练通过放大学生已知的内容来工作,而不是从数据中导入结构。它们还揭示了一种无需任何显式遗忘目标即可分离能力和逐字记忆的机制。

英文摘要

Can a language model improve from plain text sampled from itself, with no prompts, no teacher, no verifier, and no reward model? Yes, but only when the synthetic corpus is compatible with the student, a relational property of the source-student pair rather than an intrinsic property of the data. We call this the latent capability resurfacing hypothesis: weak self-training can amplify capabilities already present in the pretrained model, but only under this compatibility condition. We study this in the minimal setting of prompt-free unconditional self-training, where base language models are fine-tuned on text generated from the BOS token alone, with no task specification or external supervision. We report three findings. First, synthetic utility is relational rather than intrinsic: self-generated data is the most effective source, same-lineage transfer outperforms stronger but differently trained sources, and cross-family transfer is substantially weaker. Second, common intrinsic proxies fail: neither benchmark-level semantic similarity nor average per-token likelihood under the student predicts which corpora help. Third, this regime produces a surprising byproduct. In controlled Pythia experiments, capability and verbatim memorization decouple: benchmark utility is preserved or improved while held-out exact-match extraction drops by over 95 percent, with no forget set, privacy objective, or targeted unlearning. Together, these results suggest that prompt-free self-training works by amplifying what the student already knows, not by importing structure from the data. They also reveal a regime in which capability and verbatim memorization can be separated without any explicit unlearning objective.

2605.31124 2026-06-01 cs.CV

QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer

QVGGT: 训练后量化的视觉几何基础Transformer

Zhizhen Pan, Hesong Wang, Huan Wang

AI总结 针对VGGT模型参数量大、部署受限的问题,提出QVGGT量化框架,通过选择性混合精度、令牌滤波与任务感知尺度搜索,实现近无损W4A16量化,显著降低内存和加速推理。

Comments Accepted by CVPR 2026. Project page: https://ddsacu.github.io/QVGGT/

详情
AI中文摘要

直接从图像估计3D属性的技术随着视觉几何基础Transformer(VGGT)的提出而迅速发展,该模型能够在前向传播中一次性预测相机参数、深度图和点云。然而,其12亿参数规模严重限制了在无人机和移动AR设备等资源受限平台上的部署。为解决这一限制,我们引入了QVGGT,一个专门为压缩VGGT而设计的量化框架。我们的方法基于以下观察:VGGT内的Transformer块对量化表现出异质性敏感度。因此,我们分析了逐块量化敏感度,并提出了一种选择性混合精度策略,为最脆弱的Transformer块分配更高精度。为了解决由高方差相机和注册令牌引起的量化误差放大问题,我们进一步引入了带相机信息补偿的令牌过滤,从激活校准中移除这些异常值,并使用PCA导出的全局补偿令牌恢复其几何线索。最后,我们开发了一种任务感知尺度搜索机制,不仅通过层重建,还通过多头监督以及相机姿态、深度图和点图之间的跨头几何一致性来评估候选量化尺度。在多个几何感知基准上的大量实验表明,QVGGT实现了近乎无损的W4A16量化,在保持所有3D预测头精度的同时,相比FP32实现了3~4.9倍的内存减少和高达2.8倍的硬件实际加速。我们的方法使得在边缘设备上实现高保真3D感知成为可能,从而在现实世界的受限环境中实现前馈3D重建模型的实际部署。

英文摘要

Estimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2B-parameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT. Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camera information compensation, which removes these outliers from activation calibration and restores their geometric cues using a PCA-derived global compensation token. Finally, we develop a task-aware scale search mechanism that evaluates candidate quantization scales not only through layer reconstruction but also through multi-head supervision and cross-head geometric consistency among camera poses, depth maps, and point maps. Extensive experiments on multiple geometry perception benchmarks demonstrate that QVGGT achieves near-lossless W4A16 quantization, preserving the accuracy of all 3D prediction heads while delivering 3$\sim$4.9$\times$ memory reduction and up to 2.8$\times$ real hardware speedup over FP32. Our approach makes high-fidelity 3D perception feasible on edge devices, enabling practical deployment of feed-forward 3D reconstruction models in real-world constrained environments.

2605.31121 2026-06-01 cs.RO cs.AI

TARIC: Memory-Augmented Traversability-Aware Outdoor VLN under Interrupted Semantic Cues

TARIC: 语义线索中断下基于记忆增强的可通行性感知户外视觉语言导航

Tianle Zeng, Hanjing Ye, Jianwei Peng, Jingwen Yu, Hanxuan Chen, Hong Zhang

AI总结 针对户外视觉语言导航中语义线索中断导致导航退化的问题,提出统一框架,通过可通行性一致的执行引导和不确定性感知的3D线索记忆,在长时间无线索阶段维持稳定导航,在四足和轮式平台上成功率提升显著。

详情
AI中文摘要

户外视觉语言导航(VLN)在远程、开放世界环境中经常受到语义线索中断的干扰,此时信息性目标线索变得稀疏、被遮挡或离开视野。一旦此类线索消失,智能体进入无线索阶段,并常退化为回溯、振荡航向或盲目探索。虽然基于记忆的方法试图弥合这些间隙,但在可通行性驱动的绕行中常常失败:记忆中的线索方向可能不可行,迫使绕行延长无线索阶段,并逐渐使机器人中心的线索过时、隐式历史模糊。这使得可通行性成为维持目标导向引导的稳定性条件,而不仅仅是局部安全问题。 我们提出一个统一的户外VLN框架,通过在长时间无线索阶段维持可通行性一致的可执行引导来应对语义线索中断。具体来说,我们的方法从可见性门控的目标或探索线索中提取语义方位,并利用实时近场可通行性轮廓将其接地为可执行航向,提供超越仅拒绝安全过滤的目标一致可行引导。为防止绕行期间引导退化,我们将间歇性2D证据提升为世界对齐的3D线索记忆,并配备不确定性感知读出机制,确保引导在机器人移动时持续可达且稳定。 我们在四足和轮式平台上评估该框架,路线长度为600-1000米。我们的方法在模拟中成功率比最强基线提高超过10个百分点,真实世界成功率达到40%,而最强基线为17.5%,且在长时间无线索间隔中具有显著更高的鲁棒性。

英文摘要

Outdoor vision-language navigation (VLN) in long-range, open-world environments is frequently disrupted by semantic-cue interruptions, where informative goal cues become sparse, occluded, or leave the field of view. Once such cues disappear, agents enter a cue-free phase and often degrade into backtracking, oscillatory headings, or aimless exploration. While memory-based methods attempt to bridge these gaps, they often fail under traversability-driven detours: the remembered cue direction may be infeasible, forcing detours that prolong cue-free phases and gradually render robot-centric cues stale and implicit histories blurred. This makes traversability a stability condition for maintaining goal-directed guidance, rather than merely a local safety concern. We propose a unified outdoor VLN framework that survives semantic-cue interruptions by maintaining traversability-consistent executable guidance throughout prolonged cue-free phases. Specifically, our method extracts semantic bearings from visibility-gated goal or exploration cues and grounds them into executable headings using a real-time near-field traversability profile, providing goal-consistent feasible guidance beyond reject-only safety filtering. To prevent guidance degradation during detours, we lift intermittent 2D evidence into a world-aligned 3D cue memory with an uncertainty-aware readout mechanism, ensuring guidance remains continuously reachable and stable as the robot moves. We evaluate the framework on quadrupedal and wheeled platforms over 600--1000 m routes. Our method improves simulation success rate by over 10 percentage points over the strongest baseline and achieves a real-world success rate of 40%, compared to 17.5% for the strongest baseline, with substantially higher robustness during prolonged cue-free intervals.

2605.31119 2026-06-01 cs.RO cs.LG

Don't Fool Me Twice: Adapting to Adversity in the Wild with Experience-Driven Reasoning

不要愚弄我两次:通过经验驱动推理在野外适应逆境

Navin Sriram Ravie, Andrew Jong, Krrish Jain, John Liu, Omar Alama, Bijo Sebastian, Sebastian Scherer

AI总结 提出一种持续学习框架,使移动机器人能够在线从干扰中学习,通过语义将异常行为归因于原因,从而更好地预测和规划未来。

详情
AI中文摘要

在机器人学中,危险和逆境模式通常具有具体性且相对于每个智能体。自主移动机器人的一个前沿是使智能体能够在未见的非结构化环境中有效运行。在未见的非结构化环境中的一个重大挑战是可能无法预测特定机器人的所有危险。尽管最近的工作使用大型基础视觉语言模型(VLM)来预先预测一个详尽的常识性危险列表,但仍然难以捕捉可能的交互和依赖于具体性的逆境。我们提出了一个持续学习框架,使移动具身智能体能够在线从干扰中学习,并通过语义将异常行为归因于原因,从而更好地预测和规划未来世界。我们的框架“不要愚弄我两次”首先观察干扰并描述其对机器人的影响;该描述通过视觉上下文增强,以查询VLM预测可能的原因;使用核回归对局部干扰进行特征化,从而实现对瞬态异常的高效、少样本建模。我们利用语义体素中心建模来估计认知不确定性,通过将交互驱动的干扰视为可学习的空间行为,实现更丰富的下游恢复。我们提出了四个假设,并在仿真和硬件上跨具体性和逆境模式进行了验证。

英文摘要

In robotics, dangers and adversity modes are often embodiment-specific and relative to each agent. A frontier of autonomous mobile robotics is to enable agents to operate effectively in the wild in unseen unstructured environments. A significant challenge in unseen unstructured environments is that it may not be possible to predict all the dangers to the specific robot. Although recent work has used large foundation vision-language models (VLMs) to preemptively predict an exhaustive list of common-sense dangers, it remains difficult to capture possible interaction and embodiment-dependent adversities. We propose a continual learning framework for a mobile embodied agent to learn online from disturbances and attribute anomalous behaviours to causes through semantics, enabling better prediction and planning of the world in the future. Our framework, "Don't Fool Me Twice", first observes disturbances and describes their effects on the robot; this description is augmented with visual context to query a VLM to predict possible causes; the local disturbance is characterized using kernel regression, which allows for efficient, few-shot modeling of transient anomalies. We leverage semantic voxel-centric modeling to estimate epistemic uncertainty, enabling richer downstream recovery by treating interaction-driven disturbances as learnable spatial behaviors. We present four hypotheses and validate them in simulation and on hardware across embodiments and adversity modes.

2605.31116 2026-06-01 cs.CV cs.RO

NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving

NTR:端到端驾驶中场景令牌瓶颈的神经令牌重建

Jiahui Li, Jiawei Sun, Zixiang Ren, Ming Liu, Jiamin Shi, Ruiteng Zhao, Zhiyang Liu, Liying Liu, Zuoguan Wang, Kaidi Yang

AI总结 针对端到端驾驶中场景令牌瓶颈缺乏视觉监督的问题,提出神经令牌重建(NTR)框架,通过自蒸馏掩码潜在重建约束场景令牌保留更丰富的视觉表示,实现最先进的驾驶性能。

详情
AI中文摘要

最近的无感知端到端自动驾驶方法通过将密集的图像块令牌压缩为紧凑的场景令牌,用于下游轨迹生成和评分,从而绕过了显式的感知输出。虽然这些场景令牌为规划器形成了紧凑的视觉瓶颈,但它们仅从规划目标接收监督,对编码的视觉信息提供了有限的约束。为了解决这一限制,我们引入了神经令牌重建(NTR),一种表示学习框架,直接约束无感知驾驶中的紧凑场景令牌瓶颈。NTR引入了一种自蒸馏掩码潜在重建目标,该目标仅使用紧凑的场景令牌作为重建记忆来重建被掩码的块级潜在特征。这迫使重建梯度仅通过场景令牌瓶颈传递,鼓励场景令牌为规划保留更丰富且更少冗余的视觉表示。我们进一步引入了来自基础模型注释的语义先验,作为弱语义接口,将重建目标偏向于驾驶相关结构,而不引入显式的感知头。所有辅助重建组件在推理时被移除,部署的规划器保持不变。NTR在三个公共自动驾驶基准测试中实现了最先进的性能,包括Waymo E2E上的8.0461 RFS以及NavSim1&2上的94.1 PDMS / 90.9 EPDMS。学习到的场景令牌表现出更低的成对冗余和更高的有效秩,表明有效的瓶颈监督同时改善了紧凑视觉表示学习和规划性能。

英文摘要

Recent perception-free end-to-end (E2E) autonomous driving methods bypass explicit perception outputs by compressing dense image patch tokens into compact scene tokens for downstream trajectory generation and scoring. While these scene tokens form a compact visual bottleneck for the planner, they receive supervision solely from the planning objective, providing limited constraints on the encoded visual information. To address this limitation, we introduce Neural Token Reconstruction (NTR), a representation learning framework to directly constrain the compact scene-token bottleneck in perception-free driving. NTR introduces a self-distillation masked latent reconstruction objective that reconstructs masked patch-level latent features using only compact scene tokens as reconstruction memory. This forces reconstruction gradients to pass exclusively through the scene-token bottleneck, encouraging scene tokens to preserve richer and less redundant visual representations for planning. We further introduce semantic priors derived from foundation-model annotations as a weak semantic interface biasing reconstruction targets toward driving-related structures without introducing explicit perception heads. All auxiliary reconstruction components are removed at inference time, leaving the deployed planner unchanged. NTR achieves state-of-the-art performance on three public autonomous driving benchmarks, including 8.0461 RFS on Waymo E2E and 94.1 PDMS / 90.9 EPDMS on NavSim1&2. The learned scene tokens exhibit lower pairwise redundancy and higher effective rank, indicating that effective bottleneck supervision improves both compact visual representation learning and planning performance.

2605.31115 2026-06-01 cs.CV

Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning

Polyphony: 基于扩散的双手动作分割,采用交替视觉Transformer和语义条件

Hao Zheng, Hu Wang, Tiantian Zheng, Prajjwal Bhattarai, Tuka Alhanai

AI总结 提出Polyphony三阶段方法,通过交替训练双手视觉Transformer、语义特征条件化和扩散分割,解决双手动作分割中的手间依赖、视觉不对称和语义模糊问题,在多个数据集上达到最优性能。

Comments CVPR 2026

详情
AI中文摘要

双手动作分割是从未修剪视频中密集预测双手动作,对于理解复杂的双手活动至关重要。然而,它带来了几个独特的挑战:复杂的手间依赖、双手之间的视觉不对称、主导手垄断梯度的表示冲突以及细粒度动作中的语义模糊性。我们提出了Polyphony,一种三阶段方法,通过以下方式应对这些挑战:(1) 交替双手视觉Transformer,在左右手小批量之间交替训练,以确保双手的梯度贡献平衡,同时共享时空编码器;(2) 语义特征条件化,将视觉特征与结构化的、组合式的动作描述对齐,以增强语义相似动作的区分度;(3) 基于扩散的分割,结合跨手特征融合以实现手间协调,以及自适应损失加权以平衡性能。Polyphony在双手数据集(HA-ViD、ATTACH)上达到了最先进水平,改进高达16.8个百分点,并在单流Breakfast数据集(82.5%)上超越了之前使用12倍大骨干网络的最佳方法。值得注意的是,我们的统一模型使用单个共享骨干网络,超越了需要单独每手模型的基线方法。代码位于https://github.com/x-labs-xyz/Polyphony-Dual-hand-Action-Segmentation。

英文摘要

Dual-hand action segmentation, densely predicting actions for both hands from untrimmed videos, is essential for understanding complex bimanual activities. However, it poses several unique challenges: complex inter-hand dependencies, visual asymmetry between hands, representation conflicts where the dominant hand monopolizes gradients, and semantic ambiguity in fine-grained actions. We propose Polyphony, a three-stage method to address these challenges through: (1) an Alternating Dual-Hand Vision Transformer that alternates training between left- and right-hand mini-batches to ensure balanced gradient contributions from both hands while sharing a spatio-temporal encoder; (2) Semantic Feature Conditioning that aligns visual features with structured, compositional action descriptions to enhance discrimination of semantically similar actions; and (3) Diffusion-Based Segmentation with cross-hand feature fusion for inter-hand coordination and adaptive loss weighting for balancing performance. Polyphony achieves state-of-the-art on both dual-hand datasets (HA-ViD, ATTACH) with improvements up to 16.8 points, and on the single-stream Breakfast dataset (82.5%), outperforming the prior best method that uses a 12x larger backbone. Notably, our unified model with a single shared backbone surpasses baselines requiring separate per-hand models. Code is at https://github.com/x-labs-xyz/Polyphony-Dual-hand-Action-Segmentation.

2605.31113 2026-06-01 cs.CL

TSM-Bench: Detecting LLM-Generated Text in Real-World Wikipedia Editing Practices

TSM-Bench:在真实维基百科编辑实践中检测LLM生成的文本

Gerrit Quaremba, Elizabeth Black, Denny Vrandečić, Elena Simperl

AI总结 针对维基百科等用户生成内容平台,提出多语言、多生成器、多任务的TSM-Bench基准,发现现有检测器在任务特定MGT上准确率下降10-40%,且存在泛化不对称性。

详情
AI中文摘要

自动检测机器生成文本(MGT)对于维护维基百科等用户生成内容(UGC)平台的知识完整性至关重要。现有的检测基准主要关注 extit{通用}文本生成任务(例如,“写一篇关于机器学习的文章。”)。然而,编辑者经常使用LLM进行特定的写作任务(例如,摘要)。这些 extit{任务特定}的MGT实例由于其受限的任务制定和上下文条件,往往更接近人类撰写的文本。在这项工作中,我们展示了一系列最先进的MGT检测器在识别反映维基百科真实编辑的任务特定MGT时存在困难。我们引入了 extsc{TSM-Bench},这是一个多语言、多生成器和 extit{多任务}基准,用于评估MGT检测器在常见的真实维基百科编辑任务上的表现。我们的发现表明:( extit{i})与之前的基准相比,平均检测准确率下降了10-40%;( extit{ii})存在泛化不对称性:在任务特定数据上微调能够泛化到通用数据——甚至跨领域——但反之则不然。我们证明,仅在通用MGT上微调的模型会过度拟合机器生成的表面伪影。我们的结果表明,与之前的基准相比,大多数检测器在UGC平台等真实世界上下文中仍不可靠。因此, extsc{TSM-Bench}为开发和评估未来模型提供了关键基础。

英文摘要

Automatically detecting machine-generated text (MGT) is critical to maintaining the knowledge integrity of user-generated content (UGC) platforms such as Wikipedia. Existing detection benchmarks primarily focus on \textit{generic} text generation tasks (e.g., ``Write an article about machine learning.''). However, editors frequently employ LLMs for specific writing tasks (e.g., summarisation). These \textit{task-specific} MGT instances tend to resemble human-written text more closely due to their constrained task formulation and contextual conditioning. In this work, we show that a range of SOTA MGT detectors struggle to identify task-specific MGT reflecting real-world editing on Wikipedia. We introduce \textsc{TSM-Bench}, a multilingual, multi-generator, and \textit{multi-task} benchmark for evaluating MGT detectors on common, real-world Wikipedia editing tasks. Our findings demonstrate that (\textit{i}) average detection accuracy drops by 10--40\% compared to prior benchmarks, and (\textit{ii}) a generalisation asymmetry exists: fine-tuning on task-specific data enables generalisation to generic data -- even across domains -- but not vice versa. We demonstrate that models fine-tuned exclusively on generic MGT overfit to superficial artefacts of machine generation. Our results suggest that, in contrast to prior benchmarks, most detectors remain unreliable for automated detection in real-world contexts such as UGC platforms. \textsc{TSM-Bench} therefore provides a critical foundation for developing and evaluating future models.

2605.31111 2026-06-01 cs.LG

Subspace-Decomposed JEPAs: Disentangling Progression and Content in Latent World Models

子空间分解的JEPA:解耦潜在世界模型中的进展与内容

Lucas Thil, Jesse Read, Rim Kaddah, Guillaume Doquet

AI总结 提出SD-JEPA方法,通过将JEPA潜在空间分解为正交的进展子空间和内容子空间,利用余弦边际三元组损失和SIGReg正则化分别约束,在控制基准上优于LeWM基线,并证明进展坐标可作为场景感知的指南针。

详情
AI中文摘要

联合嵌入预测架构(JEPA)通过预测未来嵌入来学习紧凑的潜在世界模型,但潜在空间的任何单一坐标都未被指定用于编码任务进展。我们将JEPA潜在空间分解为两个具有不相交角色的正交子空间:一个由余弦边际三元组损失塑造的低维进展子空间,以及一个由LeWM现有SIGReg目标正则化的高维内容子空间。我们证明两个抗坍塌力作用于不相交的坐标,因此它们加性组合而非在同一维度上竞争。我们的方法SD-JEPA在大多数控制基准上以匹配的计算量优于LeWM基线,并在Push-T上优于最强的非LeWM JEPA基线;子空间消融验证了分解是关键因素。除了规划之外,得到的一维角进展坐标在潜在空间中充当场景感知的指南针。它随任务进展而前进,当智能体回溯时后退,在受控扰动下既会尖峰也会重新定位到语义上合适的新任务阶段区域,以预测误差标量无法做到的方式将惊讶时刻与其意义分离。三个定量测试支持这一点:在40个保留的立方体情节中,|Δθ_t|在定位语义事件方面优于标准潜在预测误差惊讶度,最高可达+0.18的合并AUROC(在±1步容差下每情节胜率97.5%);在所有四个环境(每个环境40个情节)的情节内线性探针显示,8维进展子空间(潜在空间的4.2%)解释了72-95%的任务进展方差。

英文摘要

Joint-Embedding Predictive Architectures (JEPAs) learn compact latent world models by predicting future embeddings, but no single coordinate of the latent is designated to encode task progression. We carve the JEPA latent into two orthogonal subspaces with disjoint roles: a low-dimensional progression subspace shaped by a cosine-margin triplet loss, and a high-dimensional content subspace regularised by the existing SIGReg objective of LeWM. We prove that the two anti-collapse forces act on disjoint coordinates, so they compose additively rather than competing on the same dimensions. Our method, SD-JEPA improves over the LeWM baseline on the majority of its control benchmarks at matched compute, and outperforms the strongest non-LeWM JEPA baseline on Push-T; a subspace-ablation falsifier confirms the split is the load-bearing ingredient. Beyond planning, the resulting 1-D angular progression coordinate functions as a scene-aware compass on the latent. It advances with task progress, regresses when the agent backtracks, and under controlled perturbations both spikes and relocalises to a semantically appropriate new task-phase sector, separating the moment of surprise from its meaning in a way that prediction-error scalars cannot. Three quantitative tests back this up: $|Δθ_t|$ outperforms the standard latent-prediction-error surprise at localising semantic events on 40 held-out cube episodes by up to +0.18 pooled AUROC (97.5% per-episode win rate at $\pm 1$-step tolerance); a within-episode linear probe across all four environments (40 episodes per env) shows the 8-dimensional progression subspace (4.2% of the latent) explains 72-95% of task-progress variance..

2605.31110 2026-06-01 cs.RO

Building Generalization Into Behavior Generation Via Adaptive Compositions of Regularities

通过规律的自适应组合构建行为生成中的泛化能力

Aravind Battaje, Malte Bernhard, Vito Mengers, Oliver Brock

AI总结 本文通过AICON框架研究自适应组合规律(机器人-环境系统中的可预测关系)作为行为生成中泛化能力的关键机制,并在模拟实验中验证其有效性。

Comments 10 pages, 6 figures

详情
AI中文摘要

机器人领域的泛化需要关于世界如何结构化的先验知识,然而这种结构会随情境变化。本文研究一个命题:泛化源于将规律(机器人-环境系统中的可预测关系)自适应组合成适合情境的行为生成结构。我们通过分析AICON(主动互连)框架中的机制来检验这一命题,该框架将规律表示为可微分网络中的交互过程,其中感觉反馈实现组合,梯度下降生成行为。为了隔离自适应组合作为关键机制,我们研究了一个简单的模拟问题,其中所有相关规律都可以被识别。我们将所得模型暴露于设计时未考虑的各种新条件下,发现除了一个编码规律被证明不足的情况外,它在所有情况下都能生成情境适当的行为。消融实验表明,网络会根据规律的信息量自动调节哪些规律影响行为。这些结果表明,规律的自适应组合构成了将泛化能力构建到行为生成中的强大归纳偏置。

英文摘要

Generalization in robotics requires prior knowledge about how the world is structured, yet this structure changes from one situation to the next. This paper investigates the proposition that generalization arises from adaptively composing regularities -- predictable relationships within the robot-environment system -- into situation-appropriate structures for behavior generation. We examine this proposition by analyzing the mechanism in AICON (Active InterCONnect), a framework representing regularities as interacting processes in a differentiable network, where sensory feedback realizes composition and gradient descent generates behavior. To isolate adaptive composition as the key mechanism, we study a simple simulated problem in which all relevant regularities can be identified. We expose the resulting model to a wide range of novel conditions not considered during design, and we find that it generates context-appropriate behavior in all but one case, where encoded regularities are provably insufficient. Ablations reveal that the network automatically modulates which regularities influence behavior based on their informativeness. These results suggest that adaptive composition of regularities constitutes a powerful inductive bias for building generalization into behavior generation.