URL PDF HTML ☆

赞 0 踩 0

2605.20678 2026-05-21 cs.LG cs.AI

Dynamic TMoE: A Drift-Aware Dynamic Mixture of Experts Framework for Non-Stationary Time Series Forecasting

动态TMoE：一种针对非平稳时间序列预测的漂移感知动态专家混合框架

Jiawen Zhu, Shuhan Liu, Di Weng, Yingcai Wu

发表机构 * School of Software Technology, Zhejiang University, Ningbo, China ； State Key Lab of CAD\&CG, Zhejiang University, Hangzhou, China

AI总结本文提出Dynamic TMoE框架，通过动态构建异构专家和剪枝冗余专家来优化容量，并利用时间记忆路由器确保稳定且上下文感知的专家选择，从而在非平稳时间序列预测中实现更优性能。

Comments 27 pages, 7 figures. Accepted to ICML 2026

详情

AI中文摘要

非平稳时间序列预测面临由演变分布偏移带来的挑战，静态模型难以捕捉这些变化。虽然混合专家（MoE）架构提供了解耦复杂漂移模式的有前景范式，但现有方法受限于固定专家池和无记忆路由，阻碍了其适应突发制度转变的能力。为此，我们提出Dynamic TMoE框架，将架构进化与时间连续性统一在学习阶段。通过最大均值偏差（MMD）检测分布偏移，动态实例化异构专家并剪枝冗余专家以优化容量。此外，时间记忆路由器利用循环状态和异常库确保稳定、上下文感知的专家选择，无需测试时更新。在九个基准测试中的实验表明，该方法实现了最先进的性能，将MSE减少10.4%，MAE减少7.8%。代码可在https://github.com/andone-07/Dynamic-TMoE获取。

英文摘要

Non-stationary time series forecasting is challenged by evolving distribution shifts that static models struggle to capture. While Mixture-of-Experts (MoE) architectures offer a promising paradigm for decoupling complex drift patterns, existing approaches are limited by fixed expert pools and memoryless routing, hampering their ability to adapt to abrupt regime shifts. To address this, we propose Dynamic TMoE, a framework that unifies architectural evolution with temporal continuity during learning phase. By detecting distribution shifts via Maximum Mean Discrepancy (MMD), we dynamically instantiate heterogeneous experts and prune redundant ones to optimize capacity. Additionally, a temporal memory router leverages recurrent states and an anomaly repository to ensure stable, context-aware expert selection without requiring test-time updates. Experiments on nine benchmarks demonstrate state-of-the-art performance, reducing MSE by 10.4% and MAE by 7.8%. Code is available at https://github.com/andone-07/Dynamic-TMoE.

URL PDF HTML ☆

赞 0 踩 0

2605.20676 2026-05-21 cs.CV

RoPeSLR: 3D RoPE驱动的稀疏低秩注意力用于高效的扩散变换器

Yuxi Liu, Zekun Zhang, Yixiang Cai, Renjia Deng, Yutong He, Kun Yuan

发表机构 * Peking University（北京大学）； University of Electronic Science and Technology of China（电子科技大学）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结本研究提出RoPeSLR，一种基于3D RoPE的稀疏低秩注意力框架，旨在解决扩散变换器中长序列生成的高复杂度问题，通过结合高频率语义尖峰集和极低秩背景连续体，实现子二次稀疏性和子线性秩增长，从而在超长视频推理中表现出色。

详情

AI中文摘要

扩散变换器（DiTs）已革新了高保真视频生成，但其$\mathcal{O}(L^2)$的注意力复杂度对长序列合成构成了重大瓶颈。尽管近期的稀疏线性注意力混合体旨在缓解这一问题，但其在极端稀疏性下性能严重下降，这是因为“RoPE困境”：标准线性注意力无法保持3D旋转位置嵌入（RoPE）的正交相对位置结构，从而消除了关键的距离意识。为了解决这个问题，我们提出了RoPeSLR，一种3D RoPE引导的稀疏低秩注意力框架。我们建立，根据经验证实的假设，DiT注意力流形可以解耦为一个高频率语义尖峰集（受限于$\mathcal{O}(L^{3/2})$稀疏性）和一个极低秩（$\mathcal{O}(d_h \log L)$）背景连续体。受这一结构先验的指导，RoPeSLR摒弃标准线性注意力，采用具有可学习3D绝对位置嵌入（PE）注入的头级低秩参数化，无缝合成长距离相对距离衰减。通过保证子二次稀疏性和子线性秩增长，RoPeSLR特别适合扩展到超长视频推理。广泛的评估验证了这种可扩展优势：在90%稀疏性下，RoPeSLR在Wan2.1-1.3B上实现高达10倍的FLOPs减少，并在HunyuanVideo-13B的超长100K+ token序列上提供2.26倍的端到端推理加速，同时保持接近无损的生成保真度（平均VBench退化低于1.3%）

英文摘要

Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, yet their $\mathcal{O}(L^2)$ attention complexity poses a formidable bottleneck for long-sequence synthesis. While recent sparse-linear attention hybrids aim to mitigate this, their performance severely degrades at extreme sparsity due to the "RoPE Dilemma": standard linear attention fails to preserve the orthogonal relative-position structure of 3D Rotary Position Embeddings (RoPE), neutralizing vital distance awareness. To address this, we propose \textbf{RoPeSLR}, a 3D RoPE-guided Sparse-LowRank attention framework. We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by $\mathcal{O}(L^{3/2})$ sparsity) and an extreme low-rank ($\mathcal{O}(d_h \log L)$) background continuum. Guided by this structural prior, RoPeSLR eschews standard linear attention for a head-wise low-rank parameterization equipped with a learnable 3D Absolute Positional Embedding (PE) injection, seamlessly synthesizing long-range relative distance decay. By guaranteeing sub-quadratic sparsity and sub-linear rank growth, RoPeSLR is exceptionally suited for scaling to ultra-long video inference. Extensive evaluations validate this scalable superiority: at 90\% sparsity, RoPeSLR achieves up to $10\times$ fewer FLOPs on Wan2.1-1.3B and delivers a $2.26\times$ end-to-end inference speedup on the ultra-long 100K+ token sequences of HunyuanVideo-13B, all while maintaining near-lossless generation fidelity (less than 1.3\% average VBench degradation).

URL PDF HTML ☆

赞 0 踩 0

2605.20651 2026-05-21 cs.CV

Gaze into the Details: Locality-Sensitive Enhancement for OCTA Retinal Vessel Segmentation

凝视细节：用于OCTA视网膜血管分割的局部敏感增强

Tuopusen Huang, Ding Ma, Xiangqian Wu

发表机构 * Faculty of Computing（计算学院）

AI总结本文提出LSENet，通过引入三个创新模块解决OCTA血管分割中局部对比度低导致的断续和细节丢失问题，实验表明其在多个公开数据集上达到最佳性能且参数更少。

详情

AI中文摘要

现有的OCTA血管分割深度学习框架大多基于U-Net架构，但大多数方法仅关注整体表示，难以处理OCTA特有的低局部对比度问题，导致血管断续和细节丢失。为此，我们提出LSENet，基于U-Net架构引入三个核心创新模块：为解决血管断续问题，引入补丁信息增强模块（PIE），用补丁级注意力替代标准跳接连接；为缓解细节丢失问题，提出多尺度特征融合模块（MFF），通过从原始输入和前一层提取可解释特征，为PIE模块提供丰富多尺度信息；最后设计连接性细化解码器（CRD），通过最终卷积层的大核减少碎片化。在三个公开数据集（OCTA-500、ROSE-1和ROSSA）上的实验表明，所提LSENet在性能上达到最佳，且参数更少。

英文摘要

Existing deep learning frameworks for Optical Coherence Tomography Angiography (OCTA) vessel segmentation are largely derived from the U-Net architecture, which serves as the foundation for most current designs. However, most of these methods focus only on holistic representation, struggling to address the problem of low local contrast unique to OCTA, which leads to vessel discontinuities and loss of detail. To address these problems, we propose LSENet, which builds upon the U-Net architecture by introducing three core innovative modules: To address vessel discontinuities, we introduce the Patch Information Enhance module (PIE), which replaces standard skip connections to execute patch-wise attention. To mitigate detail loss, the Multiscale Feature Fusion module (MFF) is proposed to feed the PIE module rich, multi-scale information by extracting visually interpretable features from both the original input and preceding layers. Finally, the Connectivity Refinement Decoder (CRD) is designed to refine features from all levels and utilize a large kernel in the final convolutional layer to reduce fragmentation. Experiments on three public datasets (OCTA-500, ROSE-1, and ROSSA) demonstrate that our proposed LSENet achieves state-of-the-art performance while requiring fewer parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.20648 2026-05-21 cs.RO cs.AI

Jointly Learning Predicates and Actions Enables Zero-Shot Skill Composition

联合学习谓词和动作使零样本技能组合成为可能

Benedict Quartey, Sebastian Castro, Eric Rosen, Wil Thomason, George Konidaris, Stefanie Tellex

发表机构 * Brown University（布朗大学）； Robotics & AI Institute（机器人与人工智能研究所）

AI总结本文提出了一种联合学习谓词和动作的技能方法，通过闭合回路的视觉-运动策略，使机器人能够在不重新训练的情况下实现零样本技能组合。

详情

AI中文摘要

学习示范（LfD）使机器人能够从专家示例中学习复杂行为，但现有方法往往无法在不重新训练的情况下泛化到新组合的已知技能。现代生成性策略仅建模动作轨迹分布，因此无法推断出所需的符号结果。我们提出技能应联合建模动作轨迹和它们诱导的符号结果。为解决这一差距，我们引入了谓词动作技能（PACTS），一种闭合回路的视觉-运动策略，将技能建模为动作和谓词信念轨迹的联合生成过程，在单一模型中产生连贯的动作-结果滚动。联合生成动作和谓词使PACTS能够学习改进动作生成和谓词分类的内部表示。此外，我们通过利用PACTS的在线谓词预测作为符号接口来序列化和监控执行，展示了学习技能的零样本组合。项目网站：https://planpacts.github.io/

英文摘要

Learning from Demonstration (LfD) enables robots to learn complex behaviors from expert examples, yet existing approaches often fail to generalize to new compositions of known skills without retraining. Modern generative policies model distributions over action trajectories alone, thus are unable to reason about the symbolic outcomes required for robust composition. We propose that skills should jointly model action trajectories and the symbolic outcomes they induce. To address this gap, we introduce Predicate Action Skills (PACTS), a class of closed-loop visuomotor policies that model skills as a joint generative process over action and predicate belief trajectories, producing coherent action-outcome rollouts within a single model. Jointly generating actions and predicates enables PACTS to learn internal representations that improve both action generation and predicate classification. Furthermore, we demonstrate zero-shot composition of learned skills via planning by leveraging online predicate predictions from PACTS as a symbolic interface for sequencing and monitoring execution. Project website: https://planpacts.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.20645 2026-05-21 cs.CV

Seeing Through Fog: Towards Fog-Invariant Action Recognition

穿透雾气：迈向雾不变的动作识别

Enqi Liu, Liyuan Pan, Zhi Gao, Lingzhi Li, Qing Li

发表机构 * Beijing Institute of Technology, Beijing, China（北京理工大学，北京，中国）； Beijing Institute for General Artificial Intelligence, Beijing, China（北京通用人工智能研究院，北京，中国）； Yangtze Delta Region Academy of Beijing Institute of Technology, Jiaxing, China（北京理工大学扬子江地区研究院，嘉兴，中国）

AI总结本文提出FogAct基准数据集和FogNet模型，旨在解决雾天环境下动作识别中的挑战，通过改进的两流CLIP模型提取雾不变的语义信息，提升在雾天条件下的动作识别性能。

详情

AI中文摘要

雾天条件在现实应用中很常见；然而，现有动作识别方法通常假设有利的天气和高质量的视频输入。在雾天，不可预测的可见性降级和对比度降低会阻碍语义线索的提取，给当前的动作识别方法带来重大挑战。在本文中，我们通过采用两种策略来缓解雾天条件下动作识别的问题。首先，我们提出了FogAct，这是第一个雾状动作识别基准数据集，由使用立体相机系统拍摄的配对干净和雾天视频组成。该数据集涵盖10个场景和55个动作类别，包含近10000个视频片段。其次，我们提出了FogNet，一种两流CLIP模型，该模型发现隐藏在降质视频背后的雾不变的语义信息。FogNet通过清洁视频的指导学习雾视频的稳健表示，有效捕捉清洁和雾天视频之间的共享结构和运动线索。在FogAct和三个其他流行数据集上的广泛实验表明，我们的方法在与最先进（SOTA）方法相比时具有竞争性性能。我们的FogAct和FogNet可在我们的项目页面上找到。

英文摘要

Foggy conditions are commonly encountered in real-world applications; however, existing action recognition approaches typically assume favorable weather and high-quality video inputs. On foggy days, unpredictable visibility degradation and reduced contrast obstruct the extraction of semantic cues, posing significant challenges for current action recognition methods. In this paper, we mitigate the issues faced in action recognition under foggy conditions by employing two strategies. First, we present FogAct, the first benchmark dataset for foggy action recognition, consisting of paired clean and foggy videos captured with a stereo camera system. The dataset spans 10 scenes and 55 action categories, comprising nearly 10,000 video clips. Second, we propose FogNet, a two-stream CLIP model that discovers fog-invariant semantic information hidden behind the degraded videos. FogNet learns robust representations of foggy videos with guidance from clean videos, effectively capturing shared structural and motion cues between clean and foggy videos. Extensive experiments on FogAct and three other popular datasets demonstrate that our method achieves competitive performance compared with state-of-the-art (SOTA) approaches. Our FogAct and FogNet are given in our project page.

URL PDF HTML ☆

赞 0 踩 0

2605.20644 2026-05-21 cs.LG cs.AI cs.RO

基于检索的长上下文翻译用于文化图像描述：佛罗里达大学Gators参加2026年美洲自然语言处理共享任务的提交

Aashish Dhawan, Christopher Driggers-Ellis, Dzmitry Kasinets, Daisy Zhe Wang, Christan Grant

发表机构 * University of Florida（佛罗里达大学）

AI总结本文提出了一种基于检索的长上下文翻译方法，用于文化图像描述，通过两阶段流程生成西班牙语中间描述，再利用检索增强的多示例提示生成目标语言描述，显著提升了Bribri、Guaraní和Orizaba Nahuatl语言的描述生成性能，并在共享任务中获得冠军。

2605.20624 2026-05-21 cs.CV cs.AI cs.LG

Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

用自回归扩散模型加速视频逆问题求解器

Taesung Kwon, Jonghyun Park, Hyungjin Chung, Jong Chul Ye

发表机构 * KAIST（韩国科学技术院）； EverEx

AI总结本文提出自回归视频逆问题求解器（AVIS），通过自回归扩散模型实现流式视频恢复，显著降低初始延迟并提高吞吐量，同时保持高质量的恢复效果，并进一步提出加速变体AVIS Flash，实现更高的吞吐量和更优的效率-性能权衡，为实时部署铺平道路。

Comments Project page is available here: https://avis-project.github.io/

详情

AI中文摘要

扩散模型为零样本视频逆问题提供了强大的先验知识，但其实时部署受到两个效率问题的阻碍：由整体视频恢复引起的高初始延迟，以及由于在像素空间中多次VAE传递以强制测量一致性导致的低吞吐量。为克服这些限制，我们提出了自回归视频逆问题求解器（AVIS）。AVIS框架利用自回归视频扩散模型以流式方式恢复视频，自然地消除了延迟瓶颈。具体而言，AVIS通过测量一致性的估计初始化反向扩散，减少了所需的采样步骤。与领先的非自回归求解器相比，AVIS将初始延迟从114秒减少到4秒，并将吞吐量从0.71提高到1.18 FPS，同时实现更优的恢复质量。我们进一步引入了一个高度加速的变体，称为AVIS Flash，该变体仅在第一个片段上强制测量一致性。AVIS Flash在单个RTX 4090 GPU上将吞吐量提高到5.91 FPS，同时保持竞争性的性能，并实现有利的效率-性能权衡，为实时部署铺平道路。

英文摘要

Diffusion models provide powerful priors for zero-shot video inverse problems, but their real-time deployment is hindered by two inefficiencies: high initial latency caused by holistic video restoration, and low throughput resulting from multiple VAE passes to enforce measurement consistency in pixel space. To overcome these limitations, we propose Autoregressive Video Inverse problem Solver (AVIS). The AVIS framework leverages autoregressive video diffusion models to restore videos in a streaming manner, naturally eliminating latency bottlenecks. Specifically, AVIS initializes reverse diffusion with a measurement-consistent estimate, reducing the required sampling steps. Compared to leading non-autoregressive solvers, AVIS drastically reduces initial latency from 114s to 4s and increases throughput from 0.71 to 1.18 FPS while achieving superior restoration quality. We further introduce a highly accelerated variant, dubbed AVIS Flash, that enforces measurement consistency solely on the first chunk. AVIS Flash substantially boosts throughput to 5.91 FPS on a single RTX 4090 GPU while maintaining competitive performance and achieving a favorable efficiency-performance trade-off, paving the way toward real-time deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.20620 2026-05-21 cs.LG cs.DB cs.GT

Dynamic Shapley Computation

动态Shapley值计算

Xuan Yang, Hsi-Wen Chen, Ming-Syan Chen, Jian Pei

发表机构 * Duke University（杜克大学）； National Taiwan University（国立台湾大学）

AI总结本文提出D-Shap框架，通过将Shapley值表示为玩家-任务矩阵，解决动态环境下训练数据贡献评估的高效更新问题，利用任务和联盟的局部性特性实现快速更新和自评估。

详情

AI中文摘要

基于数据的Shapley估值提供了一种量化训练数据贡献的原则性方法，但其高计算成本使其在动态设置中难以应用，其中任务和训练玩家不断变化。现有方法将Shapley计算视为一次性过程，将贡献汇总为聚合分数，阻止了重用并要求在任何变化时重新计算。我们引入了一种新的视角，将Shapley值表示为玩家-任务矩阵，并将动态估值建模为结构化矩阵维护问题。我们利用每个任务依赖于少量训练玩家的事实以及相似任务产生相似估值，导致效用局部性和联盟局部性。基于这些见解，我们提出了D-Shap，一种动态估值框架，通过仅修改矩阵的小部分实现高效更新：新任务估值通过结构感知插值推断，而由新玩家引起的更新被限制在受影响的局部矩阵块中。为消除对预指定评估任务的需求，我们引入了自估值，通过可扩展的子集重用和覆盖感知的锚点选择，直接从训练数据构建初始矩阵。在多样模型上的实验表明，D-Shap在毫秒级内完成任务更新，并将玩家更新成本降低至全重新计算的三量级，同时实现与全重新计算相当的估值质量。

英文摘要

Shapley-based data valuation provides a principled way to quantify the contribution of training data, but its high computational cost makes it impractical in dynamic settings where tasks and training players evolve. Existing methods treat Shapley computation as a one-shot process and collapse contributions into aggregated scores, preventing reuse and requiring recomputation under any change. We introduce a new perspective that represents Shapley values as a player-by-task matrix and formulates dynamic valuation as a structured matrix maintenance problem. We exploit the fact that each task depends on a small subset of training players and that similar tasks yield similar valuations, leading to utility locality and coalition locality. Based on these insights, we propose D-Shap, a dynamic valuation framework that enables efficient updates by modifying only a small portion of the matrix: new task valuations are inferred via structure-aware interpolation, while updates induced by new players are confined to affected local matrix blocks. To eliminate the need for pre-specified evaluation tasks, we introduce self-valuation, which constructs the initial matrix directly from training data, supported by scalable subset reuse and coverage-aware anchor selection. Experiments across diverse models show that D-Shap performs task updates in milliseconds and reduces the cost of player updates by up to three orders of magnitude, while achieving valuation quality competitive with full recomputation.

URL PDF HTML ☆

赞 0 踩 0

2605.20619 2026-05-21 cs.LG math.OC stat.ML

HRM-Text: 超越规模的高效预训练

Guan Wang, Changling Liu, Chenyu Wang, Cai Zhou, Yuhao Sun, Yifei Wu, Shuai Zhen, Luca Scimeca, Yasin Abbasi Yadkori

发表机构 * Sapient Intelligence ； MIT（麻省理工学院）

AI总结本文提出HRM-Text模型，通过引入分层递归模型和新的训练方法，在减少计算资源消耗的同时实现了与大规模模型相当的性能，展示了高效预训练的可能性。

详情

AI中文摘要

当前大型语言模型的预训练范式依赖于巨大的计算资源和互联网级原始文本，这在基础研究中形成了显著的障碍。相比之下，生物系统通过多时间尺度处理实现高样本效率的学习，例如前额叶环路的功能组织。受此启发，我们引入了HRM-Text，它用分层递归模型（HRM）取代标准Transformer，将计算分解为慢速演变的战略层和快速演变的执行层。为了稳定这种深度递归进行语言建模，我们引入了MagicNorm和深度信用分配的预热。此外，我们不再使用标准的原始文本预训练，而是仅在指令-响应对上进行训练，使用任务完成目标和PrefixLM遮蔽。作为高效预训练的实证存在证明，一个仅用400亿个唯一词和1,500美元预算从头训练的10亿参数HRM-Text模型在MMLU上达到60.7%，在ARC-C上达到81.9%，在DROP上达到82.2%，在GSM8K上达到84.5%，在MATH上达到56.2%。尽管使用了比标准基线少100-900倍的训练词和96-432倍的估计计算，HRM-Text的性能与2-7B参数的开源模型相媲美。这些结果表明，协同设计架构和目标可以大幅降低计算到性能的比率，使从头开始的预训练对更广泛的研究社区具有可及性。

英文摘要

The current pretraining paradigm for large language models relies on massive compute and internet-scale raw text, creating a significant barrier to foundational research. In contrast, biological systems demonstrate highly sample-efficient learning through multi-timescale processing, such as the functional organization of the frontoparietal loop. Taking this as inspiration, we introduce HRM-Text, which replaces standard Transformers with a Hierarchical Recurrent Model (HRM) that decouples computation into slow-evolving strategic and fast-evolving execution layers. To stabilize this deep recurrence for language modeling, we introduce MagicNorm and warmup deep credit assignment. Furthermore, instead of standard raw-text pretraining, we train exclusively on instruction-response pairs using a task-completion objective and PrefixLM masking. Serving as an empirical existence proof of efficient pretraining, a 1B-parameter HRM-Text model trained from scratch on only 40 billion unique tokens and $1,500 budget achieves 60.7% on MMLU, 81.9% on ARC-C, 82.2% on DROP, 84.5% on GSM8K, and 56.2% on MATH. Despite utilizing roughly 100-900x fewer training tokens and 96-432x less estimated compute than standard baselines, HRM-Text performs competitively with 2-7B parameter open models. These results demonstrate that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making pretraining from scratch accessible to the broader research community.

URL PDF HTML ☆

赞 0 踩 0

2605.20610 2026-05-21 cs.CV cs.AI

Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts

超越路由：表征专家调节与表示在视觉混合专家中的刻画

Gene Tangtartharakul, Katherine R. Storrs

发表机构 * School of Psychology University of Auckland（心理学系奥克兰大学）

AI总结本文研究了视觉混合专家模型中专家调节与表示的特性，通过对比学习训练稀疏门控卷积MoE模型，并利用视觉神经科学工具分析专家的专业化，发现动植物区分主导专家划分，并揭示了专家在更广泛的连续视觉和语义维度上的调节。

Comments 21 Pages, 6 Main Figures, 1 Table

详情

AI中文摘要

混合专家（MoE）模型通常通过分析哪些类别被路由到哪些专家来解释。然而，仅靠路由并不能揭示每个专家实际编码的内容。我们训练了稀疏门控卷积MoE模型，并在自然图像上使用对比目标进行训练，利用视觉神经科学工具来表征专家的专业化。从门控级别扩展到专家级别分析，我们测量了每个专家的类别分离度，并利用最吸引人的输入来分析每个专家的调节。从类别级别扩展到特征级别解释，我们通过从人类行为判断数据集（THINGS）中衍生出的语义维度来解释调节。最后，我们使用调节和表征相似性分析来评估在独立初始化下专家分配的稳定性。我们发现，动植物区分主导专家划分，从门控到专家读取都明显，并在独立训练模型中保持稳定。尽管路由统计数据表明相对稀疏的、类别的偏好，但专家分析揭示了更广泛的对连续视觉和语义维度的调节，超出了类别边界。尽管特征调节不同，专家之间表现出相似的类别分离度，这表明超越类别级别分析的解释优势。这些结果表明，视觉MoE中的专家专业化远超类别路由，并通过探测细粒度专家级别调节和表征结构来更好地理解。

英文摘要

Mixture-of-Experts (MoE) models are often interpreted by analysing which categories are routed to which experts. However, routing alone does not reveal what each expert actually encodes. We train sparsely-gated convolutional MoE models with a contrastive objective on natural images and characterise expert specialisation using tools from visual neuroscience. Extending from gating-level to expert-level analyses, we measure per-expert category separability, and per-expert tuning using the most exciting inputs. Extending from category-level to feature-level explanations, we interpret tuning via semantic dimensions derived from a dataset of human behavioural judgements (THINGS). Finally, we use tuning and representational similarity analysis to assess the stability of expertise-allocation across independent initialisations. We find that an animate-inanimate distinction dominates expert partitioning, apparent from gating through to expert readout, and is stable across independently trained models. Although routing statistics suggest relatively sparse, categorical preferences, expert analyses reveal broader tuning to continuous visual and semantic dimensions that extend beyond category boundaries. Experts exhibit similar category-separability to one another, despite distinct feature tuning, demonstrating the explanatory benefits of moving beyond category-level analyses. Together, these results show that expert specialisation in vision MoEs extends well beyond category routing and is better understood by probing fine-grained expert-level tuning and representational structure.

URL PDF HTML ☆

赞 0 踩 0

2605.20609 2026-05-21 cs.LG

Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

基于潜在类比的组合转导用于离线目标条件强化学习

Junseok Kim, Dohyeong Kim, Mineui Hong, Songhwai Oh

发表机构 * Department of Electrical and Computer Engineering and ASRI, Seoul National University（电气与计算机工程系和首尔国立大学ASRI）； Independent researcher（独立研究者）； Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）

AI总结本文提出了一种基于潜在类比的组合转导方法，用于解决离线目标条件强化学习中面对新情境时的目标泛化问题，通过引入新的类比表示方法，提升了在不同情境下的目标达到能力。

Comments ICML 2026

详情

AI中文摘要

组合泛化对于在新颖的上下文变化中达到未见过的目标在离线目标条件强化学习（GCRL）中至关重要，其中必须从有限的数据中学习一个通用的目标达到智能体。大多数先前的方法通过在时间连续的片段上进行轨迹缝合来实现这一点，这限制了在不同上下文中组合行为的能力。为了克服这一限制，我们正式将类比转导定义为通过组合任务内固有的类比与给定的上下文来合成新的计划，并提出了一个针对此目的的新型类比表示。基于我们的理论，这种类比表示捕捉了在最优任务执行下发生变化的内容，对上下文变化保持不变，并且足以实现最优的目标达到。我们进一步认为，对未见过的类比-上下文对的泛化是类比转导中的实际障碍，并引入了一种新的离线GCRL方法，使类比转导能够超越已见过的对到未见的组合。我们通过在OGBench操纵环境中实验证明了我们方法的有效性，显著优于不进行类比转导的先前方法。项目页面：https://rllab-snu.github.io/projects/CTA/

英文摘要

Compositional generalization is essential for reaching unseen goals under novel contextual variations in offline goal-conditioned reinforcement learning (GCRL), where a generalist goal-reaching agent must be learned from limited data. Most prior approaches pursue this via trajectory stitching over temporally contiguous segments, which limits composing behaviors across varying contexts. To overcome this limitation, we formalize analogy transduction as synthesizing new plans by composing task-endogenous analogies with given contexts and propose a novel analogy representation tailored for it. Grounded in our theory, this analogy representation captures what changes under optimal task execution, remains invariant to contextual variations, and is sufficient for optimal goal reaching. We further contend that generalization to unseen analogy-context pairs is a practical obstacle in analogy transduction, and introduce a new approach for offline GCRL that enables analogy transduction beyond seen pairs to unseen combinations. We empirically demonstrate the effectiveness of our approach on OGBench manipulation environments, substantially outperforming prior methods that do not perform analogy transduction. Project page: https://rllab-snu.github.io/projects/CTA/

URL PDF HTML ☆

赞 0 踩 0

2605.20608 2026-05-21 cs.AI cs.NI

From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)

从自动化到自主：分层代理原生网络架构（HANA）

Binghan Wu, Shoufeng Wang, Yunxin Liu, Ya-Qin Zhang, Joseph Sifakis, Ye Ouyang

发表机构 * AsiaInfo Technologies Limited（亚洲信息科技有限公司）； Institute for AI Industry Research (AIR)（人工智能产业研究院）； Tsinghua University（清华大学）； Verimag

AI总结本文提出了一种分层多代理参考架构，旨在实现Level 4/5自主网络，通过引入代理自意识，统一战略规划与操作韧性，验证了其在5G核心环境中的有效性。

Comments This manuscript has been accepted by IEEE Networking Letters

Journal ref B. Wu, S. Wang, Y. Liu, Y. -Q. Zhang, J. Sifakis and Y. Ouyang, "From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)," in IEEE Networking Letters, 2026

详情

DOI: 10.1109/LNET.2026.3693226

AI中文摘要

实现Level 4/5自主网络（AN）需要从静态自动化转向代理原生智能。当前的操作依赖于刚性的脚本，缺乏处理非正常条件的认知能力。为此，本文提出了一种分层多代理参考架构，该架构包含一个双驱动协调器，协调专门的执行代理，并通过共享的公共内存实现统一的领域知识。关键创新是将代理自意识整合进来，使系统能够协调 deliberative战略治理与 reflexive 故障恢复。我们将在5G核心环境中实例化并验证该架构。案例研究表明，该系统在拥堵条件下仍能维持关键吞吐量，并将平均修复时间（MTTR）减少了86%，证实了其在统一战略规划与操作韧性方面的有效性。

英文摘要

Realizing Level 4/5 Autonomous Networks (AN) demands a shift from static automation to agent-native intelligence. Current operations, reliant on rigid scripts, lack the cognitive agency to handle off-nominal conditions. To address this, this letter proposes a hierarchical multi-agent reference architecture enabling high-level autonomy. The framework features a Dual-Driven Orchestrator that coordinates specialized Executive Agents, supported by a shared Public Memory for unified domain knowledge. A key innovation is the integration of agent self-awareness, which empowers the system to harmonize deliberative strategic governance with reflexive fault recovery. We instantiate and validate this architecture within a 5G Core environment. Case studies demonstrate that the system sustains critical throughput under congestion and reduces Mean Time to Repair (MTTR) by 86%, confirming its efficacy in unifying strategic planning with operational resilience.

URL PDF HTML ☆

赞 0 踩 0

2605.20607 2026-05-21 cs.LG cs.CV cs.RO

Mechanistic Interpretability for Learning Assurance of a Vision-Based Landing System

基于视觉着陆系统的学习保证机制解释

Romeo Valentin, Olivia Beyer Bruvik, Marc R. Schlichting, Mykel J. Kochenderfer

发表机构 * Stanford Intelligent Systems Laboratory, Stanford University, Stanford, CA, USA（斯坦福智能系统实验室，斯坦福大学，斯坦福，CA，美国）

AI总结本文提出了一种基于视觉着陆系统的学习保证机制，通过分离内容与风格来构建可解释的模型，从而提供可靠的证据支持，同时引入了新的运行时保证方法来监控模型的情境表示。

Comments 10 pages, 4 figures

详情

AI中文摘要

EASA的学习保证指导要求数据驱动的航空系统构建并监控自身的情境表示，但对神经网络而言，提供此类证据的技术手段仍是一个开放问题。我们针对基于视觉的飞机着陆系统填补了这一空白：我们提出，一个可保证的模型至少必须展示其情境表示中能够分离内容与风格。展示模型的预测主要依赖于内容表示组件，从而得到一个具体的保证路径。为了在具体模型上展示这个保证路径，我们训练了一个用于跑道关键点回归的视觉Transformer模型，在LARDv2数据集上进行训练。该模型作为我们保证演示的主体，产生每块嵌入，我们通过K-SVD稀疏字典学习将其分解为可解释的原子。定性可视化确认了内容原子跟踪任务相关的跑道结构，风格原子跟踪领域特定的外观，且回归头几乎将所有线性权重放在内容原子上。我们进一步基于内容/风格分离并定义了模型外范围（OOMS）检测，一种新的运行时保证方法，直接监控模型的情境表示。OOMS监控与操作设计领域和输出空间的分布外监控互补，并满足最近EASA指导的明确要求。通过在测试时间和运行时直接分析模型的情境表示，本工作提供了EASA学习保证指导所要求的第一个具体的表示层面证据，并指出了机制解释作为未来航空安全案例的实用构建块。

英文摘要

EASA's learning-assurance guidance requires data-driven aviation systems to build and monitor their own situation representation, yet for neural networks the technical means to provide such evidence remain an open problem. We address this gap for a vision-based aircraft landing system: we propose that a minimally assurable model must at least be shown to separate content from style in its own situation representation. Showing that the model's predictions then rely largely on the contentful representation components leads to a concrete assurance path. To demonstrate this assurance path on a concrete model we train a vision transformer model for runway keypoint regression on the LARDv2 dataset. The model, which acts as the subject for our assurance demonstration, produces per-patch embeddings that we decompose into interpretable atoms via K-SVD sparse dictionary learning. A qualitative visualization confirms that contentful atoms track task-relevant runway structure and stylistic atoms track domain-specific appearance, and the regression head is shown to place almost all of its linear weight on contentful atoms. We further build on the content/style separation and define out-of-model-scope (OOMS) detection, a novel runtime assurance approach directly monitoring the model's situation representation. OOMS monitoring is complementary to operational design domain and output-space out-of-distribution monitoring and addresses concrete requirements of the recent EASA guidance. By directly analyzing a model's situation representation both at test time and runtime, this work delivers the first concrete piece of the representation-level evidence that EASA learning-assurance guidance demands, and points to mechanistic interpretability as a practical building block of future aviation safety cases.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

DIVE: Embedding Compression via Self-Limiting Gradient Updates

Beyond Semantic Similarity: A Two-Phase Non-Parametric Retrieval Workflow for Corporate Credit Underwriting

IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools

DarkShake-DVS: Event-based Human Action Recognition under Low-light andShaking Camera Conditions

Dynamic TMoE: A Drift-Aware Dynamic Mixture of Experts Framework for Non-Stationary Time Series Forecasting

VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence

Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach

GSA-YOLO: A High-Efficiency Framework via Structured Sparsity and Adaptive Knowledge Distillation for Real-Time X-ray Security Inspection

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

LER-YOLO: Reliability-Aware Expert Routing for Misaligned RGB-Infrared UAV Detection

A Semantic and Occlusion-Aware GM-PHD Filter

RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers

Gaze into the Details: Locality-Sensitive Enhancement for OCTA Retinal Vessel Segmentation

Jointly Learning Predicates and Actions Enables Zero-Shot Skill Composition

Seeing Through Fog: Towards Fog-Invariant Action Recognition

Design for Manufacturing: A Manufacturability Knowledge-Integrated Reinforcement Learning Framework for Free-Form Pipe Routing in Aeroengines

AVSD: Adaptive-View Self-Distillation by Balancing Consensus and Teacher-Specific Privileged Signals

Same Target, Different Basins: Hard vs. Soft Labels for Annotator Distributions

Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

Accelerating Video Inverse Problem Solvers with Autoregressive Diffusion Models

Dynamic Shapley Computation

SURF: Steering the Scalarization Weight to Uniformly Traverse the Pareto Front

COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space

Auto-Dreamer: Learning Offline Memory Consolidation for Language Agents

HRM-Text: Efficient Pretraining Beyond Scaling

Beyond Routing: Characterising Expert Tuning and Representation in Vision Mixture-of-Experts

Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)

Mechanistic Interpretability for Learning Assurance of a Vision-Based Landing System