arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪 全部专题
2605.01288 2026-05-11 cs.LG cond-mat.dis-nn stat.ML

A Theory of Saddle Escape in Deep Nonlinear Networks

Divit Rawal, Michael R. DeWeese

AI总结 本文研究了深度非线性网络在小初始化条件下训练过程中出现的长时间平坦期及突变特征获取现象。通过推导适用于任意平滑激活函数和可微损失函数的矩阵Frobenius范数不平衡恒等式,作者将激活函数分为四类通用类别,并在对称子流形上将矩阵演化简化为标量ODE,得出了临界深度逃逸时间与瓶颈层数相关的解析公式。理论结果与数值模拟高度一致,揭示了深度网络训练动态中瓶颈结构对逃逸时间的关键影响。

详情
英文摘要

In deep networks with small initialization, training exhibits long plateaus separated by sharp feature-acquisition transitions. Whereas shallow nonlinear networks and deep linear networks are well studied, extending these analyses to deep nonlinear networks remains challenging. We derive an exact identity for the imbalance of Frobenius norms of layer weight matrices that holds for any smooth activation and any differentiable loss and use this to classify activation functions into four universality classes. On the permutation-symmetric submanifold, the identity combines with an approximate balance law to reduce the full matrix flow to a scalar ODE, giving a critical-depth escape time law $τ_\star = Θ(\varepsilon^{-(r-2)})$ governed by the number $r$ of layers at the bottleneck scale rather than the total depth $L$. We find that this same $r-2$ exponent is recovered under He-normal initialization with $r$ bottleneck layers rescaled by $\varepsilon$, where the symmetry manifold is preserved by the flow but not attracting. We find close agreement between our theory and numerical simulations.

2605.00814 2026-05-11 cs.CV cs.AI

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Zefeng He, Muxin Fu, Daizong Liu, Wei-Long Zheng, Yu Cheng

AI总结 尽管自回归的大型视觉-语言模型(LVLMs)在多模态任务中表现出色,但在生成过程中会出现“视觉信号稀释”现象,导致视觉注意力随着生成长度增加而衰减。为解决这一问题,本文提出了一种轻量可学习模块——持久视觉记忆(PVM),通过并行于前馈网络(FFN)的分支,建立一种与距离无关的视觉信息检索路径,从而增强模型对视觉信息的持续感知能力。实验表明,PVM在参数开销极小的情况下显著提升了模型性能,尤其在需要长期视觉感知的复杂推理任务中表现突出。

详情
英文摘要

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to strengthen sustained, on-demand access to visual evidence. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for enhanced visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM shows improved robustness in longer generations and accelerates internal prediction convergence.

2605.00380 2026-05-11 cs.LG cs.CL

ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning

Zihan Lin, Xiaohan Wang, Jie Cao, Jiajun Chai, Li Wang, Xiaodong Lu, Wei Lin, Ran He, Guojun Yin

AI总结 该论文提出了一种名为ResRL的新方法,旨在提升大语言模型的推理能力,同时保持生成多样性。ResRL通过引入负样本投影残差强化学习,将正负样本之间的语义分布解耦,并利用低秩正空间投影和梯度调制策略,在增强推理性能的同时避免多样性下降。实验表明,ResRL在多个基准任务中优于现有方法,尤其在数学推理任务上取得了显著提升。

Comments Accepted to ICML 2026. Preprint version. https://github.com/1229095296/ResRL.git

详情
英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) enhances reasoning of Large Language Models (LLMs) but usually exhibits limited generation diversity due to the over-incentivization of positive rewards. Although methods like Negative Sample Reinforcement (NSR) mitigate this issue by upweighting penalty from negative samples, they may suppress the semantic distributions shared between positive and negative responses. To boost reasoning ability without losing diversity, this paper proposes negative sample projection Residual Reinforcement Learning (ResRL) that decouples similar semantic distributions among positive and negative responses. We theoretically link Lazy Likelihood Displacement (LLD) to negative-positive head-gradient interference and derive a single-forward proxy that upper-bounds representation alignment to guide conservative advantage reweighting. ResRL then projects negative-token hidden representations onto an SVD-based low-rank positive subspace and uses projection residuals to modulate negative gradients, improving reasoning while preserving diversity and outperforming strong baselines on average across twelve benchmarks spanning Mathematics, Code, Agent Tasks, and Function Calling. Notably, ResRL surpasses NSR on mathematical reasoning by 9.4\% in Avg@16 and 7.0\% in Pass@128. Code is available at https://github.com/1229095296/ResRL.git.

2604.26509 2026-05-11 cs.RO cs.CV

3D Generation for Embodied AI and Robotic Simulation: A Survey

Tianwei Ye, Yifan Mao, Minwen Liao, Jian Liu, Chunchao Guo, Dazhao Du, Quanxin Shou, Fangqi Zhu, Song Guo

AI总结 本文综述了用于具身人工智能和机器人仿真中的3D生成技术,重点探讨了其在生成可交互对象、构建任务导向仿真环境以及促进仿真到现实迁移中的三大作用。研究指出,当前领域正从追求视觉真实转向注重交互能力,并指出了物理注释不足、几何质量与物理合理性不匹配等主要瓶颈问题。该综述为推动3D生成成为具身智能可靠基础提供了系统性分析与未来方向。

Comments 27 pages, 11 figures, 8 tables

详情
英文摘要

Embodied AI and robotic systems increasingly depend on scalable, diverse, and physically grounded 3D content for simulation-based training and real-world deployment. While 3D generative modeling has advanced rapidly, embodied applications impose requirements far beyond visual realism: generated objects must carry kinematic structure and material properties, scenes must support interaction and task execution, and the resulting content must bridge the gap between simulation and reality. This survey reviews 3D generation for embodied AI and organizes the literature around three roles that 3D generation plays in embodied systems. In Data Generator, 3D generation produces simulation-ready objects and assets, including articulated, physically grounded, and deformable content for downstream interaction; in Simulation Environments, it constructs interactive and task-oriented worlds, spanning structure-aware, controllable, and agentic scene generation; and in Sim2Real Bridge, it supports digital twin reconstruction, data augmentation, and synthetic demonstrations for downstream robot learning and real-world transfer. We also show that the field is shifting from visual realism toward interaction readiness, and we identify the main bottlenecks, including limited physical annotations, the gap between geometric quality and physical validity, fragmented evaluation, and the persistent sim-to-real divide, that must be addressed for 3D generation to become a dependable foundation for embodied intelligence. Our project page is at https://3dgen4robot.github.io.

2604.24013 2026-05-11 cs.LG cs.AI cs.CV cs.DC

CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training

Rezaul Karim, Austin Wen, Wang Zongzuo, Weiwei Zhang, Yang Liu, Walid Ahmed

AI总结 随着大语言模型规模的快速增长,分布式训练中的通信开销成为影响计算效率的主要瓶颈。本文提出了一种名为CommFuse的新方法,通过通信分解与融合技术,有效消除现有重叠策略中的尾部延迟问题。该方法将传统的集体通信操作替换为细粒度的点对点通信,并优化计算调度,从而在数据并行和张量并行场景下显著降低通信开销,提升模型训练的吞吐量和计算利用率。

Comments Slightly modified the title, and corresponding minor wording change in the content

详情
英文摘要

The rapid growth in the size of large language models has necessitated the partitioning of computational workloads across accelerators such as GPUs, TPUs, and NPUs. However, these parallelization strategies incur substantial data communication overhead significantly hindering computational efficiency. While communication-computation overlap presents a promising direction, existing data slicing based solutions suffer from tail latency. To overcome this limitation, this research introduces a novel communication-computation overlap technique to eliminate this tail latency in state of the art overlap methods for distributed LLM training. The aim of this technique is to effectively mitigate communication bottleneck of tensor parallelism and data parallelism for distributed training and inference. In particular, we propose a novel method termed CommFuse that replaces conventional collective operations of reduce-scatter and all-gather with decomposed peer-to-peer (P2P) communication and schedules partitioned computations to enable fine-grained overlap. Our method provides an exact algorithm for reducing communication overhead that eliminates tail latency. Moreover, it presents a versatile solution compatible with data-parallel training and various tensor-level parallelism strategies, including TPSP and UP. Experimental evaluations demonstrate that our technique consistently achieves lower latency, superior Model FLOPS Utilization (MFU), and high throughput.

2604.20403 2026-05-11 cs.LG

Robustness of Spatio-temporal Graph Neural Networks for Fault Location in Partially Observable Distribution Grids

Burak Karabulut, Carlo Manna, Chris Develder

AI总结 本文研究了在部分可观测的配电网络中,时空图神经网络(STGNN)用于故障定位的鲁棒性问题。作者提出了一种基于测量节点构建图结构的新方法,并引入了基于GraphSAGE和改进的GATv2的STGNN模型,实验表明该方法在性能和训练效率上均优于传统RNN模型。研究还发现,仅使用测量节点构建的图结构能够显著提升模型效率和稳定性,为部分可观测配电网络的故障定位提供了更实用和鲁棒的解决方案。

详情
英文摘要

Fault location in distribution grids is critical for reliability and minimizing outage durations. Yet, it remains challenging due to partial observability, given sparse measurement infrastructure. Recent works show promising results by combining Recurrent Neural Networks (RNNs) and Graph Neural Networks (GNNs) for spatio-temporal learning. Still, many modern GNN architectures remain untested for this grid application, while existing GNN solutions have not explored GNN topology definitions beyond simply adopting the full grid topology to construct the GNN graph. We address these gaps by (i) systematically comparing a newly proposed graph-forming strategy (measured-only) to the traditional full-topology approach, and (ii) introducing STGNN (Spatio-temporal GNN) models based on GraphSAGE and an improved Graph Attention (GATv2), for distribution grid fault location; (iii) benchmarking them against state-of-the-art STGNN and RNN baselines on the IEEE 123-bus feeder. In our experiments, all evaluated STGNN variants achieve high performance and consistently outperform a pure RNN baseline, with improvements up to 11 percentage points F1. Among STGNN models, the newly explored RGATv2 and RGSAGE achieve only marginally higher F1 scores. Still, STGNNs demonstrate superior stability, with tight confidence intervals (within +/- 1.4%) compared to the RNN baseline (up to +/- 7.5%) across different experiment runs. Finally, our proposed reduced GNN topology (measured-only) shows clear benefits in both (i) model training time (6-fold reduction) and (ii) model performance (up to 11 points F1). This suggests that measured-only graphs offer a more practical, efficient, and robust framework for partially observable distribution grids.

2604.19697 2026-05-11 cs.CV

Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

Jing Jin, Hao Liu, Yan Bai, Yihang Lou, Zhenke Wang, Tianrun Yuan, Juntong Chen, Yongkang Zhu, Fanhu Zeng, Xuanyu Zhu, Tao Feng, Yige Xu

AI总结 该研究针对多模态大语言模型在STEM领域中的推理能力评估问题,提出了一个名为StepSTEM的细粒度基准测试,涵盖数学、物理、化学等283道研究生级别题目,强调跨模态推理过程的评估。该基准通过严格构建文本与视觉输入的互补性,并引入基于动态规划的步骤级评估框架,全面衡量模型的推理链表现。实验表明,当前主流模型仍主要依赖文本推理,跨模态能力仍有较大提升空间,StepSTEM为细粒度多模态推理研究提供了重要参考。

详情
英文摘要

Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.

2604.15719 2026-05-11 cs.AI

Harnessing Pre-Resolution Signals for Future Prediction Agents

Chuyang Wei, Maohang Gao, Zhixin Han, Kefei Chen, Yu Zhuang, Haoxiang Guan, Yanzhi Zhang, Yilin Cheng, Xiren Zhou, Huanhuan Chen, Jian Li, Jiyan He, Yu Shi, Yitong Duan, Shuxin Zheng

AI总结 本文研究了在结果尚未确定的情况下进行未来预测的问题,核心挑战在于监督信号仅在事后提供,难以指导预测过程中的关键判断。作者提出利用多次预测过程中产生的“预解决信号”来改进预测代理的判断能力,并设计了名为Milkyway的预测系统,通过持续更新的外部状态存储可复用的指导信息,从而在多次预测中不断优化预测结果。实验表明,该方法在多个基准测试中表现优异,其优势主要来源于预解决信号驱动的系统演化。

Comments Work in progress

详情
英文摘要

Many high-stakes decisions depend on forecasts made before outcomes are known. In this future prediction setting, the central challenge is that public evidence evolves over time, while the main supervision signal arrives only after resolution: the realized outcome mainly assesses final correctness, offering only coarse guidance on what to track, what to verify, and which judgments to leave uncertain along the way. Our key observation is that revisiting the same unresolved question over time creates informative temporal contrasts across evolving evidence and repeated forecasts, exposing what earlier attempts missed before resolution and yielding a diagnostic signal we call the pre-resolution signal. We instantiate this idea in Milkyway, a future prediction agent with a persistent future prediction harness, an editable external state that stores reusable procedural guidance across revisits to the same unresolved question. As the same unresolved question is revisited, Milkyway extracts pre-resolution signals from evolving evidence and repeated forecasts, uses them to update the harness, and improves later forecasts on that question before resolution. After resolution, the realized outcome serves as a post-resolution check of provisional updates. On the FutureX and FutureWorld benchmarks, Milkyway achieves strong performance against competitive baselines, and a mechanism study suggests that the gains stem from harness evolution driven by pre-resolution signals rather than repeated prediction alone.

2604.06333 2026-05-11 cs.LG cs.CV

Drifting Fields are not Conservative

Leonard T. Franz, Sebastian Hoffmann, Tim Weiland, Bernhard Schölkopf, Georg Martius

AI总结 本文研究了漂移场(drift field)在生成模型中的性质,指出漂移场通常不是保守场,因此不能表示为任何标量势函数的梯度。作者发现非保守性的根源在于位置依赖的归一化操作,而高斯核是唯一的径向例外。为此,他们引入了尖锐核(sharp kernel)和对应的归一化漂移场,使其对于一般的径向核都成为保守场,从而可以使用梯度下降直接优化标量势函数,提升了模型的理论基础和生成性能。

详情
英文摘要

Drifting models have recently gained attention for generating high-quality samples in a single forward pass. During training, they learn a push-forward map by following a vector-valued field, the drift field. We ask whether this procedure is equivalent to optimizing a scalar loss and find that, in general, it is not: drift fields are not conservative and cannot be written as the gradient of any scalar potential. We identify the position-dependent normalization as the source of non-conservatism, with the Gaussian kernel as the unique radial exception. Guided by this, we introduce the sharp kernel $k^\#$ and a sharp-normalized drift field that is conservative for general radial kernels. The resulting vector field is the gradient of a scalar potential that can be optimized directly using stochastic gradient descent. Moreover, the field has the form of a score difference of kernel density estimates, and gives exact equilibrium identifiability. Thus, sharp normalization closes the gap to related literature, such as Wasserstein gradient-flows and denoising score matching, also for non-Gaussian kernels. Empirically, sharp normalization preserves the performance of the original drifting objective, suggesting that the non-conservative flexibility is not required for high-quality generation.

2604.05777 2026-05-11 cs.AI

Emergent social transmission of model-based representations without inference

Silja Keßler, Miriam Bautista-Salinero, Claudio Tennie, Charley M. Wu

AI总结 本文探讨了人们如何在有限认知能力下,通过他人获取丰富且灵活的环境知识。研究通过强化学习模拟表明,无需推断他人心理状态,仅通过观察行为并利用简单社会线索,即可间接传递高层表征。研究发现,基于模型的学习者在社会暴露下能更快学习并形成更接近专家的表征,揭示了文化传递可能源于非心智化的过程。

Comments Code available at https://github.com/skessler01/social-transmission-rl.git

详情
英文摘要

How do people acquire rich, flexible knowledge about their environment from others despite limited cognitive capacity? Humans are often thought to rely on computationally costly mentalizing, such as inferring others' beliefs. In contrast, cultural evolution emphasizes that behavioral transmission can be supported by simple social cues. Using reinforcement learning simulations, we show how minimal social learning can indirectly transmit higher-level representations. We simulate a naïve agent searching for rewards in a reconfigurable environment, learning either alone or by observing an expert - crucially, without inferring mental states. Instead, the learner heuristically selects actions or boosts value representations based on observed actions. Our results demonstrate that these cues bias the learner's experience, causing its representation to converge toward the expert's. Model-based learners benefit most from social exposure, showing faster learning and more expert-like representations. These findings show how cultural transmission can arise from simple, non-mentalizing processes exploiting asocial learning mechanisms.

2604.03147 2026-05-11 cs.CL cs.AI cs.CY

Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

Lihao Sun, Lewen Yan, Xiaoya Lu, Andrew Lee, Jie Zhang, Jing Shao

AI总结 本研究揭示了大语言模型中情感向量在二维“效价-唤醒”(VA)子空间中呈现出环形几何结构,并通过主成分分解和岭回归方法,恢复出与情感控制向量相关的VA轴。研究发现,沿这些轴进行情感引导可实现对生成文本情感属性的单调控制,并能同时双向调控下游行为(如拒绝和奉承)。实验在多个主流模型中复现,表明该方法具有普适性,且提出词汇中介机制解释其有效性。

详情
英文摘要

We show that emotion vectors in LLMs are organized by a two-dimensional valence-arousal (VA) subspace exhibiting circular geometry. Through principal component decomposition and ridge regression, we recover meaningful VA axes underlying emotion steering vectors whose projections correlate with human affect ratings across 44,728 words. Steering along these axes produces monotonic control over the affective properties of generated text, and further affords bidirectional control over multiple downstream behaviors (refusal and sycophancy) from a single subspace. These effects replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B. We propose lexical mediation to explain why these effects and prior emotionally framed controls work: refusal and compliance tokens occupy distinct VA regions, and VA steering directly modulates their emission probabilities.

2603.23198 2026-05-11 cs.LG cs.CL

Sparser, Faster, Lighter Transformer Language Models

Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami, Llion Jones

AI总结 本文研究如何通过引入非结构化稀疏性来降低大型语言模型(LLM)的计算成本,重点优化前馈层的参数和计算效率。作者提出了一种新的稀疏打包格式和配套的CUDA内核,以适配现代GPU的优化执行流程,从而在推理和训练过程中实现高效的稀疏计算。实验表明,使用简单的L1正则化可以实现超过99%的稀疏度,且对模型性能影响极小,同时显著提升了模型的吞吐量、能效和内存使用效率。

Comments Code and checkpoints available at: https://github.com/SakanaAI/sparser-faster-llms

详情
英文摘要

Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. We will release all code and kernels under an open-source license to promote adoption and accelerate research toward establishing sparsity as a practical axis for improving the efficiency and scalability of modern foundation models.

2603.15001 2026-05-11 cs.LG cs.AI

How Log-Barrier Helps Exploration in Policy Optimization

Leonardo Cesani, Matteo Papini, Marcello Restelli

AI总结 本文研究了策略优化中探索机制的问题,指出现有的随机梯度老虎机(SGB)算法在收敛性保证上依赖于不现实的假设,因此提出通过引入对数障碍(log-barrier)正则化来增强策略的探索能力。该方法在保持样本复杂度的同时,能够在更一般的情况下保证收敛,并揭示了对数障碍与自然策略梯度之间的几何联系。实验验证了理论分析的有效性。

详情
英文摘要

Recently, it has been shown that the Stochastic Gradient Bandit (SGB) algorithm converges to a globally optimal policy with a constant learning rate. However, these guarantees rely on unrealistic assumptions about the learning process, namely that the probability of the optimal action is always bounded away from zero. We attribute this to the lack of an explicit exploration mechanism in SGB. To address these limitations, we propose to regularize the SGB objective with a log-barrier on the parametric policy, structurally enforcing a minimal amount of exploration. We prove that Log-Barrier Stochastic Gradient Bandit (LB-SGB) matches the sample complexity of SGB, but also converges (at a slower rate) without any assumptions on the learning process. We also show a connection between the log-barrier regularization and Natural Policy Gradient, as both exploit the geometry of the policy space by controlling the Fisher information. We validate our theoretical findings through numerical simulations, showing the benefits of the log-barrier regularization.

2603.09742 2026-05-11 cs.LG math.DS stat.ML

Upper Generalization Bounds for Neural Oscillators

Zifeng Huang, Konstantin M. Zuev, Yong Xia, Michael Beer

AI总结 本文研究了源自二阶常微分方程的神经振荡器在学习复杂非线性结构系统动态映射时的泛化能力。通过Rademacher复杂度框架,推导了其在连续时间函数空间之间逼近因果和一致连续算子,以及逼近一致渐近增量稳定二阶动力系统的上界泛化界,并将其扩展到目标算子与神经振荡器输出之间的平方Wasserstein-1距离。理论分析表明,估计误差随神经网络规模和时间长度多项式增长,避免了参数复杂度的灾难,并指出通过损失函数正则化约束MLP的Lipschitz常数可提升泛化性能。数值实验验证了理论预测的误差幂律关系,并证实了在有限训练数据下约束MLP矩阵和向量范数的有效性。

Comments This manuscript contains 33 pages with 6 figures

详情
英文摘要

Neural oscillators that originate from second-order ordinary differential equations (ODEs) have shown competitive performance in learning mappings between dynamic loads and responses of complex nonlinear structural systems. Despite this empirical success, theoretically quantifying the generalization capacities of their neural network architectures remains undeveloped. In this study, the neural oscillator consisting of a second-order ODE followed by a multilayer perceptron (MLP) is considered. Its upper probably approximately correct (PAC) generalization bound for approximating causal and uniformly continuous operators between continuous temporal function spaces and that for approximating the uniformly asymptotically incrementally stable second-order dynamical systems are derived by leveraging the Rademacher complexity framework. These bounds are further extended to the squared Wasserstein-1 distances between the probability measures of quantities of interest calculated from target causal operators and the corresponding learned neural oscillators. The theoretical results show that the estimation errors grow polynomially with respect to both MLP sizes and the time length, thereby avoiding the curse of parametric complexity. Furthermore, the derived error bounds demonstrate that constraining the Lipschitz constants of the MLPs via loss function regularization can improve the generalization ability of the neural oscillator. Numerical studies considering a Bouc-Wen nonlinear system under stochastic seismic excitation validates the theoretically predicted power laws of the estimation errors with respect to the sample size and time length, and confirms the effectiveness of constraining MLPs' matrix and vector norms in enhancing the performance of the neural oscillator under limited training data.

2603.09652 2026-05-11 cs.AI

MiniAppBench: Evaluating the Shift from Text to Interactive HTML Responses in LLM-Powered Assistants

Zuhao Zhang, Chengyue Yu, Yuante Li, Chenyi Zhuang, Linjian Mo, Shuai Li

AI总结 随着大型语言模型在代码生成方面的发展,人机交互正从静态文本响应转向动态的、基于HTML的交互式应用,即MiniApps。为评估模型在生成此类应用方面的能力,本文提出了MiniAppBench,这是首个全面评估原理驱动型交互应用生成的基准测试,包含来自真实应用场景的500个任务。同时,文章还引入了MiniAppEval评估框架,通过浏览器自动化进行类人探索测试,从意图、静态和动态三个维度系统评估应用质量,为未来研究提供了可靠的标准。

详情
英文摘要

With the rapid advancement of Large Language Models (LLMs) in code generation, human-AI interaction is evolving from static text responses to dynamic, interactive HTML-based applications, which we term MiniApps. These applications require models to not only render visual interfaces but also construct customized interaction logic that adheres to real-world principles. However, existing benchmarks primarily focus on algorithmic correctness or static layout reconstruction, failing to capture the capabilities required for this new paradigm. To address this gap, we introduce MiniAppBench, the first comprehensive benchmark designed to evaluate principle-driven, interactive application generation. Sourced from a real-world application with 10M+ generations, MiniAppBench distills 500 tasks across six domains (e.g., Games, Science, and Tools). Furthermore, to tackle the challenge of evaluating open-ended interactions where no single ground truth exists, we propose MiniAppEval, an agentic evaluation framework. Leveraging browser automation, it performs human-like exploratory testing to systematically assess applications across three dimensions: Intention, Static, and Dynamic. Our experiments reveal that current LLMs still face significant challenges in generating high-quality MiniApps, while MiniAppEval demonstrates high alignment with human judgment, establishing a reliable standard for future research. Our homepage is available in miniappbench.github.io.

2603.06859 2026-05-11 cs.LG cs.AI

Exact Is Easier: Credit Assignment for Cooperative LLM Agents

Yanjun Chen, Yirong Sun, Hanlin Wang, Jinghan Wang, Xinming Zhang, Xiaoyu Shen, Wenjie Li, Wei Zhang

AI总结 本文研究了如何准确评估合作大型语言模型(LLM)系统中各智能体的贡献问题。不同于传统多智能体强化学习依赖近似方法,作者指出在合作LLM系统中,由于交互历史是可观测文本的确定性函数,因此可以精确还原每个决策点的状态,从而实现无偏的因果贡献度量。基于此,提出了一种名为C3的方法,通过固定完整历史、冻结行为策略并采样替代动作,计算出精确的每步优势值,实验表明该方法在多个基准上优于现有方法,并且还提出了首个与方法无关的多智能体LLM信用分配审计工具。

详情
英文摘要

Removing an agent from a cooperative team to measure its contribution seems natural, yet in multi-agent LLM systems this evaluation distorts the result it claims to measure. This failure is not isolated: learned critics, trajectory-level baselines, and agent-removal counterfactuals all inherit from standard multi-agent reinforcement learning a premise that exact counterfactual evaluation requires privileged environment access, and therefore approximate. In cooperative LLM systems, this premise is false. Interaction histories are deterministic functions of observable text with no hidden state, so any decision point can be restored exactly, making direct causal measurement possible without parametric approximation. C3 exploits this property by fixing the complete history at each decision point, sampling alternative actions under a frozen behavior policy, and computing unbiased per-decision advantages through a parameter-free leave-one-out baseline. Across six benchmarks spanning math reasoning and code generation, two model families, and two multi-agent topologies, C3 consistently outperforms all baselines; a controlled decomposition confirms gains originate from credit quality, not architecture, while checkpoint restoration reduces training token consumption. The exact solution proves simpler, cheaper, and more effective than all approximate alternatives. The same structural property that enables exact credit also enables exact verification: three independently computable diagnostics, credit fidelity, within-group variance, and inter-agent influence, constitute the first method-agnostic auditing tool for multi-agent LLM credit assignment. Our code is available at https://github.com/EIT-EAST-Lab/C3

2603.06811 2026-05-11 cs.AI

Making AI Evaluation Deployment Relevant Through Context Specification

Matthew Holmes, Thiago Lacerda, Reva Schwartz

AI总结 本文探讨了如何通过上下文规范(context specification)提升AI评估在实际部署中的相关性。研究指出,当前AI评估方法往往忽视了影响部署效果的实际操作环境,导致组织难以判断AI工具能否带来持久价值。为此,作者提出通过明确界定评估场景中的关键属性、行为和结果,将模糊的利益相关者观点转化为可观察和衡量的构建,从而为AI系统的部署评估提供清晰的指导框架。

Comments 8 pages; 2 figures

详情
英文摘要

With many organizations struggling to gain value from AI deployments, pressure to evaluate AI in an informed manner has intensified. Status quo AI evaluation approaches often mask the operational realities that ultimately determine deployment success, making it difficult for organizational decision makers to know whether and how AI tools will deliver durable value. We introduce and describe context specification as a process to support and inform this decision making process. Context specification turns diffuse stakeholder perspectives about what matters in a given setting into clear, named constructs: explicit definitions of the properties, behaviors, and outcomes that evaluations aim to capture, so they can be observed and measured in context. The process serves as a foundational roadmap for evaluating what AI systems are likely to do in the deployment contexts that organizations actually manage.

2603.05539 2026-05-11 cs.LG cs.AI cs.IR cs.MM

VDCook:DIY video data cook your MLLMs

Chengwei Wu

AI总结 本文提出 VDCook,一种可自我演进的视频数据操作系统,旨在为研究人员和垂直领域团队提供灵活的视频数据构建平台。用户可通过自然语言查询和参数调整发起数据请求,系统自动优化查询并并行运行视频检索与可控合成模块,最终生成带有完整来源信息和元数据的数据包。VDCook 支持基于 MCP 协议的自动数据摄入机制,使数据集能够持续更新和扩展,同时提供多维元数据标注,为后续数据处理和索引奠定基础,显著降低了构建专业视频训练数据集的门槛。

详情
英文摘要

We introduce VDCook: a self-evolving video data operating system, a configurable video data construction platform for researchers and vertical domain teams. Users initiate data requests via natural language queries and adjustable parameters (scale, retrieval-synthesis ratio, quality threshold). The system automatically performs query optimization, concurrently running real video retrieval and controlled synthesis modules. It ultimately generates in-domain data packages with complete provenance and metadata, along with reproducible Notebooks. Unlike traditional static, one-time-built datasets, VDCook enables continuous updates and domain expansion through its automated data ingestion mechanism based on MCP (Model Context Protocol)\cite{mcp2024anthropic}, transforming datasets into dynamically evolving open ecosystems. The system also provides multi-dimensional metadata annotation (scene segmentation, motion scoring, OCR ratio, automatic captioning, etc.), laying the foundation for flexible subsequent data `cooking' and indexing\cite{vlogger}. This platform aims to significantly lower the barrier to constructing specialized video training datasets through infrastructure-level solutions, while supporting community contributions and a governance-enabled data expansion paradigm. \textbf{Project demo:} https://screenapp.io/app/v/WP0SvffgsH

2603.00223 2026-05-11 cs.CV quant-ph

Pretty Good Measurement for Radiomics: A Quantum-Inspired Multi-Class Classifier for Lung Cancer Subtyping and Prostate Cancer Risk Stratification

Giuseppe Sergioli, Carlo Cuccu, Giovanni Pasini, Alessandro Stefano, Giorgio Russo, Andrés Camilo Granda Arango, Roberto Giuntini

AI总结 本文提出了一种基于量子启发的多分类方法——Pretty Good Measurement(PGM),用于解决医学影像中的肺癌亚型分类和前列腺癌风险分层问题。该方法将每个类别映射为一个编码的混合量子态,并通过单个正交测量(POVM)进行分类,实现了真正的多类分类策略,无需降维为二分类或一对一比较。实验表明,该方法在多个医学影像分析任务中表现优异,尤其在肺癌的二分类和三分类任务中优于传统方法,且在前列腺癌风险分层中也展现出良好的临床相关性。

Comments 22 pages, 9 figures, 12 table, in preparation for journal submission

详情
英文摘要

We investigate a quantum-inspired approach to supervised multi-class classification based on the Pretty Good Measurement (PGM), viewed as an operator-valued decision rule derived from quantum state discrimination. The method associates each class with an encoded mixed state and performs classification through a single POVM construction, thus providing a genuinely multi-class strategy without reduction to pairwise or one-vs-rest schemes. In this perspective, classification is reformulated as the discrimination of a finite ensemble of class-dependent density operators, with performance governed by the geometry induced by the encoding map and by the overlap structure among classes. To assess the practical scope of this framework, we apply the PGM-based classifier to two biomedical radiomics case studies: histopathological subtyping of non-small-cell lung carcinoma (NSCLC) and prostate cancer (PCa) risk stratification. The evaluation is conducted under protocols aligned with previously reported radiomics studies, enabling direct comparison with established classical baselines. The results show that the PGM-based classifier is consistently competitive and, in several settings, improves upon standard methods. In particular, the method performs especially well in the NSCLC binary and three-class tasks, while remaining competitive in the four-class case, where increased class overlap yields a more demanding discrimination geometry. In the PCa study, the PGM classifier remains close to the strongest ensemble baseline and exhibits clinically relevant sensitivity--specificity trade-offs across feature-selection scenarios.

2603.00041 2026-05-11 cs.LG cs.AI econ.EM stat.ME

Econometric vs. Causal Structure-Learning for Time-Series Policy Decisions: Evidence from the UK COVID-19 Policies

Bruno Petrungaro, Anthony C. Constantinou

AI总结 本文研究了在时间序列政策决策中,计量经济学方法与因果结构学习方法在因果关系发现上的表现差异,以英国新冠疫情政策为案例进行实证分析。研究对比了四种计量经济学方法与十一种因果机器学习算法在图结构、模型维度和因果效应恢复能力方面的表现,发现计量经济学方法在时间结构上提供了明确的规则,而因果机器学习方法则能探索更广泛的图结构空间,从而发现更多可识别的因果关系。研究为因果机器学习从计量经济学中借鉴经验提供了实证依据,并提供了将计量经济学结果转换为贝叶斯网络工具的代码支持。

详情
英文摘要

Causal machine learning (ML) recovers graphical structures that inform us about potential cause-and-effect relationships. Most progress has focused on cross-sectional data with no explicit time order, whereas recovering causal structures from time series data remains the subject of ongoing research in causal ML. In addition to traditional causal ML, this study assesses econometric methods that some argue can recover causal structures from time series data. The use of these methods can be explained by the significant attention the field of econometrics has given to causality, and specifically to time series, over the years. This presents the possibility of comparing the causal discovery performance between econometric and traditional causal ML algorithms. We seek to understand if there are lessons to be incorporated into causal ML from econometrics, and provide code to translate the results of these econometric methods to the most widely used Bayesian Network R library, bnlearn. We investigate the benefits and challenges that these algorithms present in supporting policy decision-making, using the real-world case of COVID-19 in the UK as an example. Four econometric methods are evaluated in terms of graphical structure, model dimensionality, and their ability to recover causal effects, and these results are compared with those of eleven causal ML algorithms. Amongst our main results, we see that econometric methods provide clear rules for temporal structures, whereas causal-ML algorithms offer broader discovery by exploring a larger space of graph structures that tends to lead to denser graphs that capture more identifiable causal relationships.

2602.16360 2026-05-11 cs.RO

Docking and Persistent Operations for a Resident Underwater Vehicle

Leonard Günzel, Gabrielė Kasparavičiūtė, Ambjørn Grimsrud Waldum, Bjørn-Magnus Moslått, Abubakar Aliyu Badawi, Celil Yılmaz, Md Shamin Yeasher Yousha, Robert Staven, Martin Ludvigsen

AI总结 本文研究了如何实现水下驻留机器人在深海环境下的持续自主运行,以克服传统水下监测方法在成本和效率上的限制。作者提出了一种结合对接站和小型遥控水下机器人(ROV)的驻留系统,在90米深度环境下实现了自主导航、视觉定位对接和局部检测任务。该系统展示了高自主对接成功率和快速任务执行能力,验证了声学与视觉导航融合在实际水下环境中的可行性,为低成本、可扩展的水下监测提供了新思路。

详情
英文摘要

Our understanding of the oceans remains limited by sparse and infrequent observations, primarily because current methods are constrained by the high cost and logistical effort of underwater monitoring, relying either on sporadic surveys across broad areas or on long-term measurements at fixed locations. To overcome these limitations, monitoring systems must enable persistent and autonomous operations without the need for continuous surface support. Despite recent advances, resident underwater vehicles remain uncommon due to persistent challenges in autonomy, robotic resilience, and mechanical robustness, particularly under long-term deployment in harsh and remote environments. This work addresses these problems by presenting the development, deployment, and operation of a resident infrastructure using a docking station with a mini-class Remotely Operated Vehicle (ROV) at 90 m depth. The ROV is equipped with enhanced onboard processing and perception, allowing it to autonomously navigate using USBL signals, dock via ArUco marker-based visual localisation fused through an Extended Kalman Filter, and carry out local inspection routines. The system demonstrated a 90 % autonomous docking success rate and completed full inspection missions within four minutes, validating the integration of acoustic and visual navigation in real-world conditions. These results show that reliable, untethered operations at depth are feasible, highlighting the potential of resident ROV systems for scalable, cost-effective underwater monitoring.

2602.14868 2026-05-11 cs.LG cs.AI

Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning

Ilia Mahrooghi, Aryo Lotfi, Emmanuel Abbe

AI总结 该研究针对强化学习中因稀疏奖励导致的样本效率低下的问题,提出了一种名为Goldilocks的新型数据采样策略。该方法通过教师模型预测学生模型在不同问题上的难度,选择适中的问题(既不太简单也不太困难),从而更高效地训练模型的推理能力。实验表明,该方法在相同计算预算下显著提升了模型在数学推理任务中的表现。

Comments 28 pages, 13 figures

详情
英文摘要

Reinforcement learning has emerged as a powerful paradigm for unlocking reasoning capabilities in language models. However, relying on sparse rewards makes this process highly sample-inefficient, as models must navigate vast search spaces with minimal feedback. While classic curriculum learning aims to mitigate this by ordering data based on complexity, prior works have primarily targeted small datasets and do not directly transfer to the large-scale settings typical of modern LM training. Furthermore, the right ordering for a specific model is often unclear. To address this, we propose Goldilocks, a novel teacher-driven data sampling strategy that aims to predict each question's difficulty for the student model. The teacher model selects questions of appropriate difficulty for the student model, i.e., questions that are neither too easy nor too hard (Goldilocks principle), while training the student with GRPO. By leveraging the student's performance on seen samples, the teacher continuously adapts to the student's evolving abilities. On the OpenMathReasoning dataset, Goldilocks data sampling improves the performance of models trained with standard GRPO under the same compute budget.

2602.13298 2026-05-11 cs.CV cs.AI

The Effective Depth Paradox: Evaluating the Relationship between Architectural Topology and Trainability in Deep CNNs

Manfred M. Fischer, Joshua Pitts

AI总结 本文通过对比VGG、ResNet和GoogLeNet等卷积神经网络架构,研究了CNN拓扑结构与其图像识别性能之间的关系。研究引入了名义深度和有效深度的概念,揭示了网络结构中身份捷径和分支模块对优化稳定性的影响。结果表明,有效深度比名义深度更能准确反映网络的可训练性和扩展潜力,指出网络拓扑结构而非单纯的层数是影响深度学习模型梯度健康的关键因素。

详情
英文摘要

This paper investigates the relationship between convolutional neural network (CNN) topology and image recognition performance through a comparative study of the VGG, ResNet, and GoogLeNet architectural families. Utilizing a unified experimental framework, the study isolates the impact of depth from confounding implementation variables. A formal distinction is introduced between nominal depth ($D_{\mathrm{nom}}$), representing the physical layer count, and effective depth ($D_{\mathrm{eff}}$), an operational metric quantifying the expected number of sequential transformations. Empirical results demonstrate that architectures utilizing identity shortcuts or branching modules maintain optimization stability by decoupling $D_{\mathrm{eff}}$ from $D_{\mathrm{nom}}$. These findings suggest that effective depth serves as a superior framework for predicting scaling potential and practical trainability, ultimately indicating that architectural topology - rather than sheer layer volume - is the primary determinant of gradient health in deep learning models.

2602.11758 2026-05-11 cs.RO

HAIC: Humanoid Agile Object Interaction Control via Dynamics-Aware World Model

Dongting Li, Xingyu Chen, Qianyang Wu, Bo Chen, Sikai Wu, Hanyu Wu, Guoyao Zhang, Liang Li, Mingliang Zhou, Diyun Xiang, Jianzhu Ma, Qiang Zhang, Renjing Xu

AI总结 本文提出HAIC,一种用于人形机器人敏捷物体交互的控制框架,解决了与非完整约束和独立动力学物体交互时的控制难题。HAIC通过仅依靠本体感觉历史预测物体的高阶状态(如速度、加速度),并结合静态几何先验生成动态占用地图,从而在无外部状态估计的情况下实现鲁棒交互。实验表明,HAIC在多种敏捷任务和多物体长期任务中表现出色,展示了其对惯性扰动的主动补偿能力和环境适应性。

Comments RSS 2026. Webpage: https://haic-humanoid.github.io/

详情
英文摘要

Humanoid robots show promise for complex whole-body tasks in unstructured environments. Although Human-Object Interaction (HOI) has advanced, most methods focus on fully actuated objects rigidly coupled to the robot, ignoring underactuated objects with independent dynamics and non-holonomic constraints. These introduce control challenges from coupling forces and occlusions. We present HAIC, a unified framework for robust interaction across diverse object dynamics without external state estimation. Our key contribution is a dynamics predictor that estimates high-order object states (velocity, acceleration) solely from proprioceptive history. These predictions are projected onto static geometric priors to form a spatially grounded dynamic occupancy map, enabling the policy to infer collision boundaries and contact affordances in blind spots. We use asymmetric fine-tuning, where a world model continuously adapts to the student policy's exploration, ensuring robust state estimation under distribution shifts. Experiments on a humanoid robot show HAIC achieves high success rates in agile tasks (skateboarding, cart pushing/pulling under various loads) by proactively compensating for inertial perturbations, and also masters multi-object long-horizon tasks like carrying a box across varied terrain by predicting the dynamics of multiple objects.

2602.10693 2026-05-11 cs.LG cs.AI

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Guobin Shen, Chenxiao Zhao, Xiang Cheng, Lei Huang, Xing Yu

AI总结 在大型语言模型的强化学习训练中,由于异步训练和训练与推理引擎不匹配,导致策略更新需要依赖离线策略。传统的重要度采样方法虽无偏,但方差大,且在自回归生成中问题更严重。本文提出了一种基于变分序列级软策略优化的方法VESPO,通过直接对序列级重要性权重进行处理,有效降低方差并提供明确的方差上界,实验表明该方法在数学推理和代码生成任务中能稳定训练并优于现有方法。

详情
英文摘要

Off-policy updates are inevitable in reinforcement learning (RL) for large language models (LLMs) due to rollout staleness from asynchronous training and mismatches between training and inference engines. Naive importance sampling gives an unbiased correction but suffers from high variance, which is amplified by unbounded ratios and autoregressive generation. Prior remedies either rely on scenario-specific engineering, or trade bias for variance via token-level clipping or sequence-level normalization, yet these approaches remain largely heuristic. We propose Variational sEquence-level Soft Policy Optimization (VESPO). By explicitly incorporating variance reduction into a variational formulation, we derive a principled closed-form reshaping kernel that operates directly on sequence-level importance weights, avoids token-level approximation and length normalization, and admits an explicit variance bound for the deployed kernel. Experiments on math reasoning and code generation show that VESPO maintains stable training under severe off-policy conditions (staleness up to 64x) and delivers consistent gains across both dense and Mixture-of-Experts (MoE) models, outperforming recent reshaping baselines under matched setup. Code is available at https://github.com/FloyedShen/VESPO.

2602.07425 2026-05-11 cs.LG cs.CL math.OC

Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, Lijun Zhang

AI总结 本文研究了在重尾噪声环境下符号梯度优化算法(如Lion和Muon)的优越性问题,提出了一个新的重尾噪声条件,更准确地描述了大语言模型训练中的梯度特性。理论分析表明,符号梯度方法在该噪声模型下具有与现有最佳结果相当或更优的收敛速度,并首次对Muon等算法在矩阵优化中的表现进行了严格分析。实验验证了理论结论,说明符号优化器在处理重尾噪声时具有显著优势。

Comments Code is available at https://github.com/Dingzhen230/Heavy-tailed-Noise-in-LLMs

详情
英文摘要

While adaptive gradient methods are the workhorse of modern machine learning, sign-based optimization algorithms such as Lion and Muon have recently demonstrated superior empirical performance over AdamW in training large language models (LLM). However, a theoretical understanding of why sign-based updates outperform variance-adapted methods remains elusive. In this paper, we aim to bridge the gap between theory and practice through the lens of heavy-tailed gradient noise, a phenomenon frequently observed in language modeling tasks. Theoretically, we introduce a novel generalized heavy-tailed noise condition that captures the behavior of LLMs more accurately than standard finite variance assumptions. Under this noise model, we establish sharp convergence rates of SignSGD and Lion for generalized smooth function classes, matching or surpassing previous best-known bounds. Furthermore, we extend our analysis to Muon and Muonlight, providing what is, to our knowledge, the first rigorous analysis of matrix optimization under heavy-tailed stochasticity. These results offer a strong theoretical justification for the empirical superiority of sign-based optimizers, showcasing that they are naturally suited to handle the noisy gradients associated with heavy tails. Empirically, LLM pretraining experiments validate our theoretical insights and confirm that our proposed noise models are well-aligned with practice.

2602.04939 2026-05-11 cs.CV

SynthForensics: Benchmarking and Evaluating People-Centric Synthetic Video Deepfakes

Roberto Leotta, Salvatore Alfio Sambataro, Claudio Vittorio Ragaglia, Mirko Casu, Yuri Petralia, Francesco Guarnera, Luca Guarnera, Sebastiano Battiato

AI总结 本文提出SynthForensics,一个以人物为中心的合成视频深度伪造基准数据集,包含来自8个文本到视频和7个图像到视频生成器的20,445个视频,并与真实视频进行配对验证。该数据集在四个压缩版本中提供完整元数据,实验表明现有检测方法在该数据集上的性能显著下降,突显了当前评估体系的不足。研究还揭示了合成视频与传统伪造视频在特征上的差异,为未来检测方法的改进提供了重要参考。

详情
英文摘要

Modern T2V/I2V generators synthesize people increasingly hard to distinguish from authentic footage, while current evaluation suites lag: legacy benchmarks target manipulation-based forgeries, and recent synthetic-video benchmarks prioritize scale over realistic human depiction. We introduce SynthForensics, a people-centric benchmark of $20{,}445$ videos from 8 T2V and 7 I2V open-source generators, paired-source from FF++/DFD reals, two-stage human-validated, in four compression versions with full metadata. In our paired-comparison human study, raters prefer SynthForensics in $71$--$77\%$ of head-to-head comparisons against each of nine existing synthetic-video benchmarks, while facial-quality metrics fall within the FF++/DFD baseline range. Across 15 detectors and three protocols, face-based methods drop $13$--$55$ AUC points (mean $27$) from FF++ to SynthForensics and a further $23$ under aggressive compression; fine-tuning closes the gap at a backward cost on legacy benchmarks; training from scratch shows synthetic and manipulation features largely disjoint for most detectors. We release dataset, pipeline, and code.

2602.03490 2026-05-11 cs.LG q-bio.NC

Path Integration and Object-Location Binding Emerge in an Action-Conditioned Predictive Sequence Network

Linda Ariel Ventura, Victoria Bosch, Tim C Kietzmann, Sushrut Thorat

AI总结 该研究探讨了如何通过行动条件下的预测序列网络实现路径整合和物体-位置绑定。研究中使用了一个递归神经网络,在连续的二维场景中依次采样标记,并通过预测下一个标记来学习环境模型。实验表明,网络能够逐步提升预测准确性,并在解码分析中展现出路径整合和动态绑定能力,揭示了结构化表征如何通过灵活绑定支持预测,为认知科学中的序列世界建模提供了机制性解释。

Comments 8 pages, 4 figures; accepted at CogSci 2026

详情
英文摘要

Adaptive cognition requires structured internal models of objects and their relations. Predictive neural networks are often proposed to learn such world models, but how these are instantiated and how they support prediction remain unclear. We investigate this in a minimal in-silico setting. A recurrent neural network samples tokens sequentially from 2D continuous token scenes and is trained to predict the upcoming token from the current input and a saccade-like displacement. On novel scenes, prediction accuracy improves across the sequence, indicating in-context learning. Decoding analyses reveal path integration and dynamic binding of token identity to position. Interventional analyses show that new bindings can be learned late in sequence and that out-of-distribution bindings can be learned as well. Together, these findings show how structured representations relying on flexible binding emerge to support prediction, offering a mechanistic account of sequential world modeling relevant to cognitive science.

2602.03473 2026-05-11 cs.LG cs.CV

Scaling Continual Learning to 300+ Tasks with Bi-Level Routing Mixture-of-Experts

Meng Lou, Yunxiang Fu, Yizhou Yu

AI总结 本文提出了一种名为CaRE的可扩展持续学习框架,旨在解决在数百个任务序列上同时保持模型稳定性和可塑性的挑战。其核心方法是引入双级路由混合专家(BR-MoE)机制,通过动态激活任务相关的路由和专家模块,增强模型对判别性和综合性特征的提取能力。此外,研究还构建了一个包含上千任务的挑战性数据集OmniBenchmark-1K,并在多种任务设置下验证了CaRE的优越性能,尤其在超长任务序列上表现突出,是目前首个支持300多个非重叠任务的持续学习模型。

Comments Accepted by ICML 2026

详情
英文摘要

Continual learning, especially class-incremental learning (CIL), on the basis of a pre-trained model (PTM) has garnered substantial research interest in recent years. However, how to effectively learn both discriminative and comprehensive feature representations while maintaining stability and plasticity over very long task sequences remains an open problem. We propose CaRE, a scalable {C}ontinual Le{a}rner with efficient Bi-Level {R}outing Mixture-of-{E}xperts (BR-MoE). The core idea of BR-MoE is a bi-level routing mechanism: a router selection stage that dynamically activates relevant task-specific routers, followed by an expert routing phase that dynamically activates and aggregates experts, aiming to inject discriminative and comprehensive representations into every intermediate network layer. On the other hand, we introduce a challenging dataset, OmniBenchmark-1K, for CIL performance evaluation on very long task sequences with hundreds of tasks. Extensive experiments show that CaRE demonstrates leading performance across a variety of datasets and task settings, including commonly used CIL datasets with classical CIL settings (e.g., 5-20 tasks). To the best of our knowledge, CaRE is the first continual learner that scales to very long task sequences (ranging from 100 to over 300 non-overlapping tasks), while outperforming all baselines by a large margin on such task sequences. We hope that this work will inspire further research into continual learning over extremely long task sequences. Code and dataset are publicly released at https://github.com/LMMMEng/CaRE.

2602.02832 2026-05-11 cs.LG physics.flu-dyn

Koopman Autoencoders with Continuous-Time Latent Dynamics for Fluid Dynamics Forecasting

Rares Grozavescu, Pengyu Zhang, Etienne Meunier, Mark Girolami

AI总结 本文提出了一种基于连续时间动力学的Koopman自编码器,用于流体动力学的长期预测,其核心在于通过连续时间演化方程 $dz/dt = \mathbf{K}_{\mathrm{cont}} z$ 实现闭式推理,从而摆脱固定时间步长的限制,并提升计算效率。面对高维混沌系统中潜在状态不稳定的挑战,作者引入了包括滚动训练、前后一致性、潜在正则化和物理条件化的LoRA等结构约束,有效提升了长期预测的稳定性。实验表明,该方法在复杂流体基准测试中优于现有扩散模型和算子学习方法,并实现了110倍的推理加速。

详情
英文摘要

Forecasting physical systems over long horizons from irregularly sampled observations demands models that are stable, computationally efficient, and free of fixed-timestep assumptions. We address this with a continuous-time Koopman autoencoder whose latent dynamics obey $dz/dt = \mathbf{K}_{\mathrm{cont}} z$, yielding closed-form inference via $z(τ) = \exp(\mathbf{K}_{\mathrm{cont}} τ) z(0)$ at any horizon $τ$ in a single step. This decouples forecast cost from forecast length at inference time and supports data assimilation as gradient-based optimization with cost independent of the assimilation window. However, scaling continuous-time Koopman dynamics to high-dimensional chaotic systems causes severe latent instability, including spectral collapse and trajectory divergence over long horizons. In contrast, discrete Koopman methods train an operator $\mathbf{A}$ such that $z_{t+Δt} = \mathbf{A} z_t$; recovering the continuous generator could be theoretically done through matrix logarithm but requires conditions not guaranteed by training, and approximation errors grow with the $Δt$ imposed by the training data. These methods also require fixed, regular timesteps. We identify an empirically effective set of structural constraints -- rollout training, forward-backward consistency, latent regularization, and physics-conditioned LoRA -- sufficient for stable long-horizon latent dynamics. On challenging fluid benchmarks, our method outperforms strong diffusion and operator-learning baselines on long-horizon forecasting while achieving a 110$\times$ inference speedup.