arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2605.02881 2026-05-11 cs.RO

MolmoAct2: Action Reasoning Models for Real-world Deployment

MolmoAct2:面向现实部署的动作推理模型

Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei-Chuan Tsai, Shirui Chen, Yi Ru Wang, Shanli Xing, Jaemin Cho, Jae Sung Park, Ainaz Eftekhar, Peter Sushko, Karen Farley, Angad Wadhwa, Cole Harrison, Winson Han, Ying-Chun Lee, Eli VanderBilt, Rose Hendrix, Suveen Ellawela, Lucas Ngoo, Joyce Chai, Zhongzheng Ren, Ali Farhadi, Dieter Fox, Ranjay Krishna

发表机构 * Allen Institute for AI(艾伦人工智能研究所) University of Washington(华盛顿大学) National University of Singapore(新加坡国立大学) University of Pennsylvania(宾夕法尼亚大学) Johns Hopkins University(约翰霍普金斯大学) Amazon(亚马逊公司) University of Michigan(密歇根大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文提出MolmoAct2,一个完全开放的动作推理模型,通过改进架构和引入新数据集,在五个方面提升性能,展示了在多个基准测试中优于现有模型的成果。

Comments 31 pages, project page: https://allenai.org/blog/molmoact2

详情
AI中文摘要

Vision-Language-Action (VLA)模型旨在为机器人提供单一通用控制器,但当前系统在现实部署中表现不足。Frontier模型封闭,开放权重替代方案依赖昂贵硬件,推理增强策略因接地而产生高延迟,微调成功率仍低于可靠使用的阈值。我们提出了MolmoAct2,一个为实际部署设计的完全开放动作推理模型,在五个方面改进其前代。我们引入了MolmoER,一个专门用于空间和具身推理的VLM基础模型,训练于330万个样本数据集。我们发布了三个新数据集,涵盖低至中等成本平台,包括MolmoAct2-BimanualYAM,720小时的遥控双臂轨迹,以及经过质量筛选的Franka (DROID)和SO100/101子集。我们提供了OpenFAST,一个开放权重、开放数据的动作分词器,训练于数百万条轨迹。我们重新设计了架构,通过每层KV缓存条件将流匹配连续动作专家连接到离散令牌VLM。最后,我们提出了MolmoThink,一种自适应深度推理变体,仅对场景中变化的区域重新预测深度令牌,保留几何基础,仅消耗先前延迟的分数。在迄今为止最广泛的开放VLA实证研究中,MolmoAct2在7个模拟和现实基准测试中优于强基线模型,包括Pi-05,而MolmoER在13个具身推理基准测试中超过GPT-5和Gemini Robotics ER-1.5。我们发布了模型权重、训练代码和完整训练数据。项目页面:https://allenai.org/blog/molmoact2

英文摘要

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models are closed, open-weight alternatives are tied to expensive hardware, reasoning-augmented policies pay prohibitive latency for their grounding, and fine-tuned success rates remain below the threshold for dependable use. We present MolmoAct2, a fully open action reasoning model built for practical deployment, advancing its predecessor along five axes. We introduce MolmoER, a VLM backbone specialized for spatial and embodied reasoning, trained on a 3.3M-sample corpus with a specialize-then-rehearse recipe. We release three new datasets spanning low-to-medium cost platforms, including MolmoAct2-BimanualYAM, 720 hours of teleoperated bimanual trajectories that constitute the largest open bimanual dataset to date, together with quality-filtered Franka (DROID) and SO100/101 subsets. We provide OpenFAST, an open-weight, open-data action tokenizer trained on millions of trajectories across five embodiments. We redesign the architecture to graft a flow-matching continuous-action expert onto a discrete-token VLM via per-layer KV-cache conditioning. Finally, we propose MolmoThink, an adaptive-depth reasoning variant that re-predicts depth tokens only for scene regions that change between timesteps, retaining geometric grounding at a fraction of prior latency. In the most extensive empirical study of any open VLA to date, spanning 7 simulation and real-world benchmarks, MolmoAct2 outperforms strong baselines including Pi-05, while MolmoER surpasses GPT-5 and Gemini Robotics ER-1.5 across 13 embodied-reasoning benchmarks. We release model weights, training code, and complete training data. Project page: https://allenai.org/blog/molmoact2

2605.02206 2026-05-11 cs.CV cs.LG

Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score

多模态机器去学习中的度量不可靠性:系统分析与原则统一分数

Abdullah Ahmad Khan, Hamid Laga, Ferdous Sohel

发表机构 * Murdoch University(墨尔本大学)

AI总结 本文系统分析了多模态去学习中度量可靠性问题,提出统一质量评分(UQS)以解决不同度量标准间的冲突,通过实验验证了UQS在稳定排序方面的有效性。

Comments 9 Pages , 6 figures, Neurips 2026

详情
AI中文摘要

在视觉-语言模型(VLMs)中,机器去学习对于遵守通用数据保护条例(GDPR)是必要的,但当前的评估实践却存在不一致。我们首次系统研究了多模态去学习中的度量可靠性。五个标准度量(Forget Accuracy, Retain Accuracy, Membership Inference Attack, Activation Distance, JS divergence)在三个VQA基准(MLLMU-Bench, UnLOK-VQA, MMUBench)上产生冲突的方法排名。对36个去学习的LLaVA-1.5-7B模型进行Kendall tau分析,揭示出两个对立的聚类,{FA, RA, MIA}和{AD, JS},tau_FA_AD = -0.26,在BLIP-2 OPT-2.7B上复现。在多模态VQA中的一致性较低(平均tau = 0.086)比在单模态分类中(平均tau = 0.158;差异 = 0.072),表明双图像-文本路径放大了不一致性。我们引入了统一质量评分(UQS),一个综合度量,其权重来源于每个度量与Oracle距离d(M_hat, M_star)的Spearman相关性,其中M_star是仅在保留集上重新训练的Oracle模型。RA显示最强的可靠性(rho = 0.484, p = 0.003),而FA是负相关(rho = -0.418, p = 0.011)。UQS在100次随机权重扰动下产生稳定的排名(tau = 0.647 ± 0.262)。我们发布了基准、36个检查点和一个交互式排行榜。代码和预计算结果可在https://github.com/neurips26/UnifiedUnl获取。

英文摘要

Machine unlearning in Vision-Language Models (VLMs) is required for compliance with the General Data Protection Regulation (GDPR), yet current evaluation practices are inconsistent. We present the first systematic study of metric reliability in multimodal unlearning. Five standard metrics, Forget Accuracy (FA), Retain Accuracy (RA), Membership Inference Attack (MIA), Activation Distance (AD), and JS divergence (JS), yield conflicting method rankings across three VQA benchmarks (MLLMU-Bench, UnLOK-VQA, MMUBench). Kendall tau analysis over 36 unlearned LLaVA-1.5-7B models reveals two opposing clusters, {FA, RA, MIA} and {AD, JS}, with tau_FA_AD = -0.26, reproduced on BLIP-2 OPT-2.7B. Agreement is lower in multimodal VQA (average tau = 0.086) than in unimodal classification (average tau = 0.158; difference = 0.072), indicating that dual image-and-text pathways amplify inconsistency. We introduce the Unified Quality Score (UQS), a composite metric with weights derived from each metric's Spearman correlation with the oracle distance d(M_hat, M_star), where M_star is the oracle model retrained only on the retain set. RA shows the strongest reliability (rho = 0.484, p = 0.003), while FA is negatively correlated (rho = -0.418, p = 0.011). UQS yields stable rankings under 100 random weight perturbations (tau = 0.647 +- 0.262). We release the benchmark, 36 checkpoints, and an interactive leaderboard. Code and pre-computed results are available at https://github.com/neurips26/UnifiedUnl.

2605.02201 2026-05-11 cs.CV

Super-Resolution of Airborne Laser Scanning Point Clouds for Forest Inventory

航空激光扫描点云的超分辨率处理用于森林调查

Jinyuan Shao, Sangyoong Park, Chunxi Zhao, Ayman Habib, Songlin Fei

发表机构 * Department of Forestry and Natural Resources, Purdue University(林业与自然资源系,普渡大学) Lyles School of Civil and Construction Engineering, Purdue University(莱尔斯土木与建设工程学院,普渡大学)

AI总结 本文提出3D Forest Super Resolution模型,通过提升点云密度和减少噪声,提高森林调查精度,实验表明其在树干定位和直径估计上表现优异。

详情
AI中文摘要

航空激光扫描(ALS)能够收集大范围的点云数据,用于大规模森林调查。然而,ALS点云稀疏且噪声大,导致个体树级调查不准确,如树干定位和树冠大小估计。为解决此问题,本文提出深度学习模型3D Forest Super Resolution(3DFSR),用于同时提升点云密度和减少噪声。3DFSR是一种基于体素的CNN,采用U-Net架构。该模型在美国内陆森林和德国北部森林的ALS点云上进行评估。实验结果表明,3DFSR生成的点云比现有最先进的点云超分辨率算法更精细,达到0.249米的Chamfer距离和2.711米的Hausdorff距离。此外,为验证3DFSR点云在森林调查中的有效性,我们对原始ALS点云和3DFSR增强点云进行了树干检测、直径测量和树干重建。发现针对TLS/MLS点云开发的树干检测和重建算法可直接应用于3DFSR点云,直径可通过圆拟合方法得出。树干检测的F1分数从原始ALS点云的0.71提升到3DFSR点云的0.97;直径估计从使用林分方程的13.45厘米RMSE提升到使用圆拟合的6.43厘米;与MLS点云重建的树干相比,3DFSR点云重建的树干具有0.170米的Chamfer距离和0.377米的Hausdorff距离,以及0.95的体积估计R2值。最后,发现所提出的3DFSR适用于处理点密度从10到1700点/平方米的数据;它也可泛化到不同LiDAR平台的数据,无需迁移学习。

英文摘要

Airborne Laser Scanning (ALS) can collect point clouds across large areas, enabling large-scale forest inventory. However, ALS point clouds are sparse and noisy, resulting in inaccurate individual-tree-level forest inventory, such as stem localization and tree size estimation. To overcome this problem, we propose a deep learning model, 3D Forest Super Resolution (3DFSR), to simultaneously improve point density and reduce noise for ALS forest point cloud. 3DFSR is a voxel-based CNN with a U-Net architecture. The proposed 3DFSR is evaluated on ALS point clouds collected in both temperate forests in the U.S. and boreal forests in Germany. Experimental results demonstrate that 3DFSR can generate finer point clouds of tree structure than other state-of-the-art point cloud super-resolution algorithms, achieving 0.249 m Chamfer Distance and 2.711 m Hausdorff Distance. Furthermore, to verify the effectiveness of 3DFSR point clouds in forest inventory, we conduct stem detection, DBH measurements, and stem reconstruction on both original ALS point clouds and 3DFSR enhanced point clouds. We find that stem detection and reconstruction algorithms developed for TLS/MLS point clouds can directly work on our 3DFSR point clouds, and DBH can be derived with circle-fitting method. F1 score of stem detection is improved from 0.71 on original ALS point clouds to 0.97 on 3DFSR point clouds; DBH estimation improves from 13.45 cm RMSE using allometric equations to 6.43 cm using circle fitting; comparing to stems reconstruction from MLS point clouds, stem reconstructed from 3DFSR point clouds has 0.170 m of Chamfer Distance and 0.377 m of Hausdorff Distance, and 0.95 R2 volume estimation. Finally, we find that the proposed 3DFSR is applicable to process point densities from 10 to 1700 points/m2; it also can be generalized across data collected from different LiDAR platforms without transfer learning.

2605.01999 2026-05-11 cs.AI

TumorXAI: Self-Supervised Deep Learning Framework for Explainable Brain MRI Tumor Classification

TumorXAI: 基于自监督深度学习的可解释性脑部MRI肿瘤分类框架

Abrar Hossain Zahin, Amit Kumar Saha, Tanvir Mridha, Saifur Rahman, Jannatul Ferdous Prome, Raima Husna, Israt Jahan, Ahmed Wasif Reza

发表机构 * Department of Computer Science and Engineering, East West University(东-西大学计算机科学与工程系)

AI总结 本文提出TumorXAI框架,利用自监督学习对脑部MRI肿瘤进行多类分类,通过预处理、微调和线性评估等步骤,展示自监督模型在有限标注数据下优于监督方法的性能,并结合Grad-CAM等技术提升可解释性。

Comments 16 pages, 9 figures, 6 Tables

详情
AI中文摘要

利用磁共振成像(MRI)对脑肿瘤进行分类对于早期诊断和治疗至关重要;然而,肿瘤异质性和标注数据匮乏限制了监督深度学习方法的应用。本文使用自监督学习(SSL)研究多类脑肿瘤分类。采用ResNet-50主干网络,在包含4,448例MRI和17种不同肿瘤类型的公开数据集上评估了四个SSL框架,包括SimCLR、BYOL、DINO和Moco v3。在该数据集上,SimCLR实现了99.64%的准确率、精确率、召回率和F1分数。工作流程包括预处理、微调、线性评估和SSL预训练,使用数据增强。结果表明,当标签有限时,SSL预训练模型在F1分数、召回率、准确率和精确率方面优于监督基线。此外,通过提供模型决策的视觉洞察,可解释性AI技术(Grad-CAM、Grad-CAM++、EigenCAM)增强了可解释性。这些结果展示了SSL在从未标注医疗数据中诊断脑肿瘤方面的可扩展性和可靠性。

英文摘要

Classifying brain tumors using magnetic resonance imaging (MRI) is crucial for early diagnosis and treatment; however, tumor heterogeneity and a dearth of annotated datasets restrict the use of supervised deep learning approaches. In this work, we use self-supervised learning (SSL) to study multi-class brain tumor classification. Using a ResNet-50 backbone, we evaluate four SSL frameworks including SimCLR, BYOL, DINO, and Moco v3 on a publicly available dataset of 4,448 MRIs with 17 distinct tumor types. On the dataset, SimCLR achieved 99.64% accuracy, 99.64% precision, 99.64% recall, and 99.64% F1-score. The workflow includes preprocessing, fine-tuning, linear evaluation, and SSL pretraining with data augmentations. Results show that, when labels are limited, SSL-pretrained models outperform supervised baselines in terms of F1-score, recall, accuracy, and precision. Additionally, by providing visual insights into model decisions, Explainable AI techniques (Grad-CAM, Grad-CAM++, EigenCAM) enhance interpretability. These results demonstrate SSL's scalability and dependability in diagnosing brain tumors from unlabeled medical data.

2605.01862 2026-05-11 cs.LG

QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL

QHyer: 用于离线目标条件强化学习的Q-条件混合注意力-门控Transformer

Xing Lei, Jincheng Wang, Xuetao Zhang, Donglin Wang

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室) Institute of Artificial Intelligence(人工智能研究院) Robotics, Xi'an Jiaotong University(机器人技术,西安交通大学) Computer Science, University College London(计算机科学,伦敦大学学院) School of Engineering, Westlake University(工程学院,西湖大学)

AI总结 本文提出QHyer,通过引入状态条件化的Q估计器和混合注意力-门控架构,解决离线目标条件强化学习中长期依赖建模和稀疏奖励下的行为拼接问题,验证了其在非马尔可夫和马尔可夫数据集上的优越性能。

Comments ICML 2026

详情
AI中文摘要

离线目标条件强化学习(GCRL)从静态数据集中学习目标到达策略,但现实数据集常为部分可观测且历史依赖,呈现马尔可夫与非马尔可夫混合特性,违反标准强化学习假设。历史感知序列模型如决策变压器(DT)适合长期依赖建模,但纯注意力在处理局部马尔可夫结构和长程上下文时效率低下且脆弱。尽管最近的混合架构(如LSDT)引入局部提取器改进局部依赖建模,但固定窗口提取无法适应不同时间异质设置中变化的依赖长度,常截断长程上下文而非自适应压缩内容。此外,序列离线GCRL面临关键瓶颈:在稀疏奖励下,返回到目标(RTG)在子轨迹间变得非判别性,提供很少的指导信号以拼接不同演示中的目标到达行为。为解决这些问题,我们提出QHyer,用流参数化的状态条件化目标到达Q估计器替代RTG以支持跨演示拼接,并引入门控混合注意力-门控架构,实现内容自适应的历史压缩同时保留局部动态。广泛实验表明,QHyer在非马尔可夫和马尔可夫数据集上均取得最先进性能,验证了其在多样化场景中的有效性。

英文摘要

Offline goal-conditioned RL (GCRL) learns goal-reaching policies from static datasets, but real-world datasets are often partially observable and history-dependent, exhibiting a mix of Markovian and non-Markovian that violate standard RL assumptions. History-aware sequence models such as Decision Transformer (DT) are a natural fit for long-term dependency modeling, yet pure attention is inefficient and brittle when handling local Markovian structure and long-range context simultaneously. Although recent hybrid architectures (e.g., LSDT) introduce local extractors to improve local dependencies modeling, the fixed-window extraction cannot adapt its effective memory to varying dependency lengths in temporally heterogeneous settings, often truncating long-range context rather than compressing its content adaptively. Moreover, sequential offline GCRL faces a key bottleneck: under sparse rewards, return-to-go (RTG) becomes non-discriminative across sub-trajectories, providing little guidance signal for stitching goal-reaching behaviors from diverse demonstrations. To address these, we propose \textbf{QHyer}, which replaces RTG with a flow-parameterized, state-conditioned goal-reaching Q-estimator to support stitching across demonstrations, and introduces a gated Hybrid Attention-Mamba backbone that performs content-adaptive history compression while preserving local dynamics. Extensive experiments demonstrate that \textbf{QHyer} achieves state-of-the-art performance on both non-Markovian and Markovian datasets, validating its effectiveness for diverse scenarios.

2605.01717 2026-05-11 cs.CL cs.AI

TCDA: Thread-Constrained Discourse-Aware Modeling for Conversational Sentiment Quadruple Analysis

TCDA:基于对话的线程约束模型用于对话情感四元分析

Xinran Li, Xinze Che, Yifan Lyu, Zhiqi Huang, Xiujuan Xu

发表机构 * School of Software, Dalian University of Technology(大连理工大学软件学院) School of Data and Computer Science, Sun Yat-sen University(中山大学数据与计算机科学学院)

AI总结 本文提出TCDA框架,结合线程约束有向无环图和话语感知旋转位置嵌入,解决对话情感四元分析中复杂关系捕捉问题,实验表明其性能优于现有方法。

Comments Accepted to IJCAI 2026 (Main Track)

详情
AI中文摘要

对话方面基于情感四元分析(DiaASQ)需要捕捉多轮对话中的复杂互相关系。现有方法通常使用简单的图卷积网络(GCN),引入结构噪声且无法考虑对话的时序顺序,或使用标准RoPE,隐式捕捉平序列中的相对距离但无法清晰区分token级语法顺序与话语级进展,可能遭受距离稀释问题。为解决这些问题,我们提出一个新的框架,结合线程约束有向无环图(TC-DAG)和话语感知旋转位置嵌入(D-RoPE)。具体而言,TC-DAG基于线程约束过滤跨线噪声,通过根锚定保持全局连接,并纳入对话的时序顺序。D-RoPE通过双流投影和多尺度频率信号对齐多层语义,使用树状距离捕捉线程依赖,并通过纳入话语级进展缓解token级距离稀释问题。在两个基准数据集上的实验结果表明,我们的框架实现了最先进的性能。

英文摘要

Conversational Aspect-based Sentiment Quadruple Analysis (DiaASQ) needs to capture the complex interrelationships in multiple rounds of dialogues. Existing methods usually employ simple Graph Convolutional Networks (GCN), which introduce structural noise and fail to consider the temporal sequence of the dialogues, or use standard RoPE, which implicitly captures relative distances in a flat sequence but cannot clearly separate the token-level syntactic order from the utterance-level progression, and may suffer from the Distance Dilution problem. To address these issues, we propose a new framework that combines Thread-Constrained Directed Acyclic Graph (TC-DAG) and Discourse-Aware Rotary Position Embedding (D-RoPE). Specifically, TC-DAG filters out cross-thread noise based on thread constraints, maintains global connectivity through root anchoring, and incorporates the temporal sequence of the dialogues. D-RoPE aligns multi-layer semantics using dual-stream projection and multi-scale frequency signals, captures thread dependencies using tree-like distances, and alleviates the token-level Distance Dilution problem by incorporating utterance-level progressions. Experimental results on two benchmark datasets demonstrate that our framework achieves state-of-the-art performance.

2605.01459 2026-05-11 cs.CV cs.AI

SRGAN-CKAN: Expressive Super-Resolution with Nonlinear Functional Operators under Minimal Resources

SRGAN-CKAN:在最小资源下利用非线性函数算子实现表达性超分辨率

Roberto Isai Navaro-Aviña, Eduardo Said Merin-Martinez, Andres Mendez-Vazquez, Eduardo Rodriguez-Tello

发表机构 * Cinvestav, Unidad Guadalajara(Cinvestav,瓜达拉哈拉分校) Cinvestav, Unidad Tamaulipas(Cinvestav,塔毛利帕斯分校)

AI总结 本文提出SRGAN-CKAN框架,通过整合卷积Kolmogorov-Arnold网络,利用非线性分块变换提升局部算子的表达能力,在最小资源下实现高质量超分辨率。

详情
AI中文摘要

单图像超分辨率(SISR)旨在从低分辨率观测中重建高分辨率图像,这是一个本质上不稳定的逆问题,其中高频细节在大倍率放大时严重退化。近期进展主要由基于变压器的架构和扩散模型推动,这些方法在全局上下文建模和感知质量方面有所提升,但以增加计算复杂性为代价。相比之下,本工作专注于在最小资源下增强局部算子的表达性。我们提出SRGAN-CKAN,一种混合超分辨率框架,将卷积Kolmogorov-Arnold网络(CKAN)整合到对抗学习设置中,将卷积重新表述为非线性分块变换。所提出算子用样条基函数表示替代线性局部映射,允许在最小硬件资源下对复杂局部结构和高频纹理进行表达性建模。实验结果表明,所提出的方法在保持重建保真度的同时提升了感知质量,实现了在失真度和感知度指标之间的良好平衡。这些结果是在受限制的计算设置下获得的,突显了所提出方法的效率。总体而言,本文通过改进局部变换的表示能力,引入了现有方法的互补方向,提供了一种高效且可扩展的替代方案,以替代全局密集架构。

英文摘要

Single-Image Super-Resolution (SISR) aims to reconstruct a High-Resolution (HR) image from a Low-Resolution (LR) observation, a fundamentally ill-posed problem where high-frequency details are severely degraded at large upscaling factors. Recent advances have been driven by transformer-based architectures and diffusion models improve global context modeling and perceptual quality at the cost of increased computational complexity. In contrast, this work focuses on enhancing the expressivity of local operators under minimal resources. We propose SRGAN--CKAN, a hybrid super-resolution framework that integrates Convolutional Kolmogorov--Arnold Networks (CKAN) into an adversarial learning setting reformulating convolution as a nonlinear patch-based transformation. The proposed operator replaces linear local mappings with spline-based functional representations, allowing expressive modeling of complex local structures and high-frequency textures using minimal hardware resources. Experimental results demonstrate that the proposed approach improves perceptual quality while preserving reconstruction fidelity, achieving a favorable balance between distortion-based and perceptual metrics. These results are obtained under constrained computational settings, highlighting the efficiency of the proposed formulation. Overall, this work introduces a complementary direction to existing approaches by improving the representational power of local transformations, providing an efficient and scalable alternative to globally intensive architectures.

2605.01240 2026-05-11 cs.LG cs.AI

Rhamba: Region-Aware Hybrid Attention-Mamba Framework for Self-Supervised Learning in Resting-State fMRI

Rhamba:基于区域意识的混合注意力-马尔可夫框架用于静息态fMRI的自监督学习

Ruthwik Reddy Doodipala, Pankaj Pandey, Pratheek Eranki, Carolina Torres-Rojas, Manob Jyoti Saikia, Ranganatha Sitaram

发表机构 * St. Jude Children’s Research Hospital(圣犹大儿童研究医院) The University of Memphis(孟菲斯大学)

AI总结 Rhamba结合解剖引导的掩码与混合注意力-马尔可夫架构,通过ABIDE数据集预训练,利用不同掩码策略提升静息态fMRI分析性能,最终在COBRE和ADHD-200数据集上实现优于现有方法的性能。

详情
AI中文摘要

自监督预训练在大规模神经影像中具有潜力,但区域意识掩码和混合序列建模的影响仍待探索。本文引入Rhamba,一种结合解剖引导掩码与混合注意力-马尔可夫架构的区域意识预训练框架,用于静息态功能性磁共振成像(fMRI)分析。模型在ABIDE数据集上预训练,使用区域对齐的补丁嵌入和三种掩码策略(Any、Majority、Pure)以增加空间特异性。我们评估了四种架构变体:仅Mamba模型、交替架构(交替Mamba和Attention块)、以及两种混合编码器-解码器配置(Attention-Mamba (AM) 和 Mamba-Attention (MA))。预训练模型在COBRE和ADHD-200数据集上进行微调,用于 schizophrenia 和 attention-deficit/hyperactivity disorder 的分类任务。我们采用集成梯度法,一种可解释AI方法,以识别贡献于模型预测的脑区。掩码策略强烈影响重建行为,重建损失遵循一致的顺序(Any > Majority > Pure)。然而,这种趋势并未直接转化为下游性能,其中差异较小且依赖于数据集。混合架构中的MA配置在两个数据集上实现了最高的平均AUROC,且Rhamba在比较评估中优于现有方法。区域分析显示,峰值性能取决于掩码策略和架构的相互作用,而非单一主导配置。总体而言,Rhamba提供了一个灵活的框架,用于在大规模fMRI表示学习中平衡可解释性、可扩展性和性能。

英文摘要

Self-supervised pretraining is promising for large-scale neuroimaging, yet the impact of region-aware masking and hybrid sequence modeling remains underexplored. In this work, we introduce Rhamba, a region-aware pretraining framework that integrates anatomically guided masking with hybrid Attention-Mamba architectures for resting state functional magnetic resonance imaging (fMRI) analysis. Models were pretrained on the ABIDE dataset using region-aligned patch embeddings and three masking strategies (Any, Majority, and Pure) with increasing spatial specificity. We evaluated four architectural variants: a Mamba only model, an Alternate architecture with interleaved Mamba and Attention blocks, and two hybrid encoder-decoder configurations (Attention-Mamba (AM) and Mamba-Attention (MA)). The pretrained models were fine-tuned on downstream classification tasks using the COBRE and ADHD-200 datasets for schizophrenia and attention-deficit/hyperactivity disorder discrimination. We employed Integrated Gradients, an explainable AI method, to identify the brain regions contributing to model predictions. Masking strategy strongly influenced reconstruction behavior, with reconstruction loss following a consistent ordering (Any > Majority > Pure). However, this trend did not directly translate into downstream performance, where differences were modest and dataset-dependent. The hybrid architecture with the MA configuration achieved the highest average AUROC across both datasets, and Rhamba outperformed state-of-the-art methods in comparative evaluation. Region-wise analysis showed that peak performance depends on the interaction between masking strategy and architecture rather than a single dominant configuration. Overall, Rhamba offers a flexible framework for balancing interpretability, scalability, and performance in large-scale fMRI representation learning.

2605.01006 2026-05-11 cs.CL cs.CY

Can AI Debias the News? LLM Interventions Improve Cross-Partisan Receptivity but LLMs Overestimate Their Own Effectiveness

AI能否去偏新闻?LLM干预提升了跨党派接受度但LLM高估了自身效果

Faisal Feroz, Jonas R. Kunst

发表机构 * Department of Experimental Psychology, University of Oxford(心理学实验系,牛津大学) Uehiro Oxford Institute, University of Oxford(牛津大学欧胡罗研究所) Psychology Programme, School of Social Sciences, Nanyang Technological University(南洋理工大学社会科学学院心理学项目) Department of Communication and Culture, BI Norwegian Business School(BI挪威商学院传播与文化系)

AI总结 研究探讨LLM对新闻去偏的影响,发现实质性重构干预提升保守派对左派新闻的信任度,但LLM高估自身效果,需人类监督。

详情
AI中文摘要

党派新闻媒体削弱跨党派信任,但大语言模型(LLM)提供了一种大规模去偏的潜在手段。在两项预注册实验中,我们测试了LLM生成的去偏对自由派新闻标题是否能改善保守派读者的信任相关判断。研究1发现,微妙的词汇去偏(用更中性的同义词替换情绪词)对任何结果都没有影响。研究2发现,更实质性的重构干预显著增加了保守派对自由派新闻标题的可信度、完整性和参与意愿,未在自由派样本中产生反效果。在研究1中,干预在LLM模拟的硅参与者中产生稳健效果,而对人类读者无影响。在研究2中,干预在硅参与者中的效果方向与人类反应一致,但某些结果的幅度更大。调节分析显示,模型隐含的关于谁响应去偏的理论与实际预测人类反应的心理特征存在差异。这些发现表明,当针对意识形态框架而非表面语言时,基于LLM的去偏可以提升跨党派接受度,但当前模型缺乏定量准确性和定性心理忠实度,无法在没有人类监督的情况下评估自身干预。

英文摘要

Partisan news media erode cross-partisan trust, but large language models (LLMs) offer a potential means of debiasing such content at scale. Across two pre-registered experiments, we tested whether LLM-generated debiasing of liberal news headlines could improve conservative readers' trust-relevant judgments. Study 1 found that subtle lexical debiasing (replacing emotive words with more moderate synonyms) had no effect on any outcome. Study 2 found that a more substantive reframing intervention significantly increased conservatives' perceived trustworthiness, completeness, and willingness to engage with liberal news headlines, without producing a backfire effect among a sample of liberals. In Study 1, the intervention produced robust effects among LLM-simulated silicon participants, whereas it had no impact on human readers. In Study 2, the intervention's effects among silicon participants aligned directionally with human responses but were significantly larger in magnitude for some outcomes. Moderation analyses revealed that the model's implicit theory of who responds to debiasing diverged from the psychological profile that actually predicted human responsiveness. These findings demonstrate that LLM-based debiasing can improve cross-partisan receptivity when targeting ideological framing rather than surface-level language, but that current models lack both the quantitative accuracy and qualitative psychological fidelity to evaluate their own interventions without human oversight.

2605.00834 2026-05-11 cs.LG cs.CC cs.IT math.IT

Polynomial-Time Optimal Group Selection via the Double-Commutator Eigenvalue Problem

通过双交换量本征值问题实现多项式时间最优群选择

Mitchell A. Thornton

发表机构 * Richardson, TX 75080 USA(美国德克萨斯州里奇蒙德市75080)

AI总结 本文提出通过双交换量本征值问题解决群选择问题,该方法在多项式时间内找到最佳群结构,结合群论、矩阵分析和统计估计,提供闭式解和可验证性。

Comments v2: 2 theorems, 4 open problems, §X.A correction added; 1 reference added

详情
AI中文摘要

代数多样性框架将时间平均扩展到单个观测上的代数群作用,用于二阶统计估计。该框架的核心问题在于群选择:给定一个M维观测,其协方差结构未知,需找到一个有限群,其谱分解最符合协方差。直接枚举对称群S_M的所有子群需要指数时间。本文证明该组合问题可转化为由协方差矩阵双交换量导出的广义本征值问题,得到多项式时间算法,复杂度为O(d²M² + d³),其中d是生成基的维度。双交换量矩阵的最小特征向量可直接构造最优群生成器,无需迭代优化。该转换是精确的:双交换量最小本征值为零当且仅当最优生成器位于基的张成空间中,其大小提供可验证的最优性间隙。此问题不在标准计算复杂性目录中,代表了群论、矩阵分析和统计估计的新类。本文建立了与独立成分分析(JADE)、结构矩阵近似问题和同时矩阵对角化的关系,并展示双交换量公式是唯一同时多项式时间、闭式解和可验证的方法。还扩展了框架以非阿贝尔对称恢复,通过序列GEVP与消去,并添加两个可识别定理,描述交换量-晶格歧义和Aut(R)是否恢复生成子群或仅超群的二元性。

英文摘要

The algebraic diversity framework generalizes temporal averaging over multiple observations to algebraic group action on a single observation for second-order statistical estimation. The central open problem in this framework is $\textit{group selection}$: given an $M$-dimensional observation with unknown covariance structure, find the finite group whose spectral decomposition best matches the covariance. Naive enumeration of all subgroups of the symmetric group $S_M$ requires exponential time in $M$. We prove that this combinatorial problem reduces to a generalized eigenvalue problem derived from the double commutator of the covariance matrix, yielding a polynomial-time algorithm with complexity $O(d^2M^2 + d^3)$, where $d$ is the dimension of a generator basis. The minimum eigenvector of the double-commutator matrix directly constructs the optimal group generator in closed form, with no iterative optimization. The reduction is exact: the double-commutator minimum eigenvalue is zero if and only if the optimal generator lies in the span of the basis, and its magnitude provides a certifiable optimality gap when it does not. This problem does not appear in the standard catalogs of computational complexity (Garey and Johnson, 1979) and represents a new class linking group theory, matrix analysis, and statistical estimation. We establish connections to independent component analysis (JADE), structured matrix nearness problems, and simultaneous matrix diagonalization, and we show that the double-commutator formulation is the unique approach that is simultaneously polynomial-time, closed-form, and certifiable. We extend the framework to non-Abelian symmetry recovery via a Sequential GEVP with deflation, and add two identifiability theorems characterizing the commutant-lattice ambiguity and the dichotomy on whether $\mathrm{Aut}(\mathbf{R})$ recovers a generative subgroup or only a supergroup.

2605.00663 2026-05-11 cs.RO cs.CV

Affordance Agent Harness: Verification-Gated Skill Orchestration

可及性代理框架:验证门控技能协调

Haojian Huang, Jiahao Shi, Yinchuan Li, Yingcong Chen

发表机构 * HKUST(GZ)(香港理工大学(广州)) Knowin AI Harbin Engineering University(哈尔滨工程大学)

AI总结 本文提出一种闭环运行时系统,统一异构技能并利用证据存储和成本控制,通过路由器自适应选择技能,利用验证器通过自一致性、跨尺度稳定性等验证证据充分性,从而提升可及性定位的准确性和效率。

Comments 43 pages, 22 figures, 8 tables. Ongoing work

详情
AI中文摘要

可及性定位要求在开放世界场景中识别代理应交互的位置和方式,其中可操作区域往往很小、被遮挡、反射性且视觉模糊。最近的系统结合了多种技能(例如检测、分割、交互-想象),但大多数使用固定的管道进行协调,这些管道与实例难度不匹配,提供有限的中间错误恢复能力,并无法重用重复对象的经验。这些失败暴露了一个系统问题:测试时的定位必须获取正确的证据,决定该证据是否可靠以做出承诺,并在有限的推理成本下完成,而无需访问标签。我们提出了可及性代理框架,一种闭环运行时系统,统一异构技能与证据存储和成本控制,检索事件记忆以提供重复类别的先验信息,并使用路由器自适应选择和参数化技能。一个特定于可及性的验证器则通过自一致性、跨尺度稳定性和证据充分性来门控承诺,触发目标重试,然后在最终判断中融合积累的证据和轨迹以生成预测。在多个可及性基准和难度控制子集上的实验表明,该方法在准确性和成本的帕累托前沿方面优于固定管道基线,提高了定位质量,同时减少了平均技能调用次数和延迟。项目页面:https://tenplusgood.github.io/a-harness-page/.

英文摘要

Affordance grounding requires identifying where and how an agent should interact in open-world scenes, where actionable regions are often small, occluded, reflective, and visually ambiguous. Recent systems therefore combine multiple skills (e.g., detection, segmentation, interaction-imagination), yet most orchestrate them with fixed pipelines that are poorly matched to per-instance difficulty, offer limited targeted recovery from intermediate errors, and fail to reuse experience from recurring objects. These failures expose a systems problem: test-time grounding must acquire the right evidence, decide whether that evidence is reliable enough to commit, and do so under bounded inference cost without access to labels. We propose Affordance Agent Harness, a closed-loop runtime that unifies heterogeneous skills with an evidence store and cost control, retrieves episodic memories to provide priors for recurring categories, and employs a Router to adaptively select and parameterize skills. An affordance-specific Verifier then gates commitments using self-consistency, cross-scale stability, and evidence sufficiency, triggering targeted retries before a final judge fuses accumulated evidence and trajectories into the prediction. Experiments on multiple affordance benchmarks and difficulty-controlled subsets show a stronger accuracy-cost Pareto frontier than fixed-pipeline baselines, improving grounding quality while reducing average skill calls and latency. Project page: https://tenplusgood.github.io/a-harness-page/.

2605.00425 2026-05-11 cs.AI

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

AEM:多轮代理强化学习中的自适应熵调节

Haotian Zhao, Songlin Zhou, Yuxin Zhang, Stephen S. -T. Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, Daxiang Dong, Jianmin Wu

发表机构 * Baidu(百度) Tsinghua University(清华大学) Fudan University(复旦大学)

AI总结 AEM通过自适应调节熵动态,改进多轮代理强化学习中的探索与利用平衡,无需监督即可提升性能。

Comments 30 pages

详情
AI中文摘要

强化学习(RL)显著提升了大语言模型(LLM)代理与环境交互和解决多轮任务的能力。然而,有效的代理RL仍具挑战:稀疏仅结果奖励对长交互轨迹中各步骤的信用分配指导有限。现有方法常引入密集中间监督,如过程奖励模型或辅助自监督信号,增加监督和调优复杂度并可能限制跨任务和领域的泛化。我们提出AEM,一种无监督的信用分配方法,通过自适应调节RL训练中的熵动态以改进探索-利用权衡。由于在代理RL中环境通常受完整响应影响,而非单个token,我们的分析将熵动态从token层面提升到响应层面,使不确定性估计与LLM代理的有效动作粒度对齐,并减少对token级采样噪声的敏感性。我们进一步表明,熵漂移在自然梯度更新下由采样响应优势与其相对惊奇度的相互作用所主导。受此结果启发,AEM推导出一个实用的响应级不确定性代理,并利用其重新缩放优势,利用正负样本间不断变化的平衡自然地从探索过渡到利用。在ALFWorld、WebShop和SWE-bench-Verified上的广泛实验表明,AEM在1.5B至32B规模的模型上一致改进了强大的RL基线,包括整合进最先进的软件工程RL训练框架时的+1.4%提升。

英文摘要

Reinforcement learning (RL) has substantially improved the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. However, effective agentic RL remains challenging: sparse outcome-only rewards provide limited guidance for assigning credit to individual steps within long interaction trajectories. Existing approaches often introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, which increases supervision and tuning complexity and may limit generalization across tasks and domains. We present AEM, a supervision-free credit assignment method that adaptively modulates entropy dynamics during RL training to improve the exploration-exploitation trade-off. Since in agentic RL the environment is typically affected by a complete response, rather than an individual token, our analysis lifts entropy dynamics from the token level to the response level, aligning uncertainty estimation with the effective action granularity of LLM agents and reducing sensitivity to token-level sampling noise. We further show that entropy drift under natural-gradient updates is governed by the interaction between the sampled-response advantage and its relative surprisal. Motivated by this result, AEM derives a practical response-level uncertainty proxy and uses it to rescale advantages, leveraging the evolving balance between positive and negative samples to naturally transition from exploration to exploitation. Extensive experiments on ALFWorld, WebShop, and SWE-bench-Verified with models ranging from 1.5B to 32B demonstrate that AEM consistently improves strong RL baselines, including a +1.4\% gain when integrated into a state-of-the-art software-engineering RL training framework.

2604.26157 2026-05-11 cs.CL cs.AI

Structural Generalization on SLOG without Hand-Written Rules

无需手写规则的SLOG结构泛化

Zichao Wei

发表机构 * Saarland University(萨尔兰大学)

AI总结 本文提出无需手写规则的结构泛化方法,基于神经细胞自动机,在SLOG基准上实现67.3%的准确率,优于AM-Parser,揭示结构泛化与CCG方向类型的关系。

详情
AI中文摘要

语义解析中的结构泛化要求系统将学习的组合规则应用于新结构组合。现有方法或依赖手写代数规则(AM-Parser),或无法结构泛化(Transformer模型)。本文提出一种无需手写组合规则的方法,基于具有离散瓶颈的神经细胞自动机(NCA),所有组合规则均通过局部迭代从数据中学习。在SLOG基准上,系统在10个种子下的总体准确率为67.3±0.2%,在17个结构泛化类别中有11个达到100%类型精确匹配,包括三个AM-Parser得分为0-74%的类别。分析表明,所有5,539个失败实例均归因于两种机制:新颖的wh-提取上下文与减少的动词类型组合,以及出现在动词主语侧的修饰语。当按CCG结构特征分解结果时,每个子模式要么在所有实例上成功,要么在所有实例上失败。中间分数(如41.4%)是结构不同的CCG模式的混合,而非部分泛化。这些结果表明,CCG方向类型比SLOG的现象级别类别更能表征结构泛化,成功/失败边界由训练数据中方向操作的覆盖范围决定。

英文摘要

Structural generalization in semantic parsing requires systems to apply learned compositional rules to novel structural combinations. Existing approaches either rely on hand-written algebraic rules (AM-Parser) or fail to generalize structurally (Transformer-based models). We present an alternative requiring no hand-written compositional rules, based on a neural cellular automaton (NCA) with a discrete bottleneck: all compositional rules are learned from data through local iteration. On the SLOG benchmark, the system achieves an overall accuracy of $67.3 \pm 0.2\%$ across 10 seeds (AM-Parser: $70.8 \pm 4.3\%$), with 11 of 17 structural generalization categories at $100\%$ type-exact match, including three where AM-Parser scores $0$--$74\%$. Analysis reveals that all 5,539 failure instances reduce to exactly two mechanisms: novel combinations of wh-extraction context with reduced verb types, and modifiers appearing on the subject side of verbs. When we decompose results by CCG structural features, each sub-pattern either succeeds on all instances or fails on all. Intermediate scores (e.g., $41.4\%$) are mixtures of structurally distinct CCG patterns, not partial generalization. These results suggest that CCG directed types provide higher resolution than SLOG's phenomenon-level categories for characterizing structural generalization, and that the success/failure boundary is determined by the coverage of directed operations in the training data.

2604.25809 2026-05-11 cs.CV

Instruction-Evidence Contrastive Dual-Stream Decoding for Grounded Vision-Language Reasoning

基于指令和证据的对比双流解码用于 grounded 视觉语言推理

Yashwant Pravinrao Bangde, Debaditya Roy

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Indian Institute of Technology Kharagpur(印度理工学院Kharagpur分校)

AI总结 本文提出 IECD$^2$ 方法,通过双流解码平衡语言信息和视觉真实性,提升视觉语言推理任务的准确性和减少幻觉。

详情
AI中文摘要

视觉语言模型(VLMs)在遵循指令和开放性视觉语言推理任务中表现出色,但生成的流畅输出往往缺乏对视觉证据的充分支持。先前研究表明,指令提示会加剧这一问题,尤其是在视觉信号不确定或模糊时。为解决此挑战,我们提出了一种解码框架,通过在生成过程中平衡语言信息和视觉真实性。我们的方法,即指令-证据对比双流解码(IECD$^2$),在每个解码步骤中维护两个并行的token概率分布:一个由指令驱动的流,促进表达性和信息性响应;另一个由证据驱动的流,强制在图像中严格接地。这两个流通过对称KL基于对比门进行自适应融合,抑制受语言先验偏好但缺乏视觉证据支持的token,同时在两者分布一致时保留它们。我们在多个数据集上评估了IECD$^2$,涵盖多种生成视觉语言推理任务,如captioning和视觉问答,包括POPE、MME、VQAv2、AMBER和MSCOCO。与最先进的解码方法相比,IECD$^2$在任务准确性和推理性能上表现出一致的改进,并显著减少了幻觉。

英文摘要

Vision-Language Models (VLMs) exhibit strong performance in instruction following and open-ended vision-language reasoning, yet they frequently generate fluent outputs that are weakly grounded in visual evidence. Prior works have shown that instruction prompting further worsens this issue by amplifying language priors, especially when the visual signal is uncertain or ambiguous. To address this challenge, we propose a decoding framework that explicitly balances linguistic informativeness and visual faithfulness during generation. Our method, Instruction-Evidence Contrastive Dual-Stream Decoding (IECD$^2$), maintains two parallel probability distribution of tokens at each decoding step: an instruction-driven stream that promotes expressive and informative responses, and an evidence-driven stream that enforces strict grounding in the image. These two streams are adaptively fused using a symmetric KL-based contrastive gate, which suppresses tokens favored by language priors but unsupported by visual evidence, while preserving them when both distributions agree. We evaluate IECD$^2$ on multiple datasets spanning various generative vision-language reasoning tasks such as captioning and visual question answering on multiple datasets such as, POPE, MME, VQAv2, AMBER, and MSCOCO. IECD$^2$ demonstrates consistent improvements in task accuracy and reasoning performance with substantial reduction in hallucination compared to state-of-the-art decoding approaches.

2604.25380 2026-05-11 cs.CV

Benchmarking and Improving GUI Agents in High-Dynamic Environments

在高动态环境中评估和改进GUI代理

Enqi Liu, Liyuan Pan, Zhi Gao, Yan Yang, Chenrui Shi, Yang Liu, Jingrong Wu, Qing Li

发表机构 * Beijing Institute of Technology(北京理工大学) Beijing Institute for General Artificial Intelligence(北京通用人工智能研究院) Yangtze Delta Region Academy of Beijing Institute of Technology(北京理工大学长江三角洲地区研究院) Australian National University(澳大利亚国立大学)

AI总结 本文提出DynamicGUIBench和DynamicUI,通过动态界面处理方法提升高动态GUI环境下的代理性能,同时保持其他基准测试的竞争力。

详情
AI中文摘要

最近在图形用户界面(GUI)代理方面的进展主要集中在监督微调(SFT)和强化学习(RL)等训练范式上。然而,高动态GUI环境的挑战仍被广泛忽视。现有代理通常在每次操作后依赖单一截图进行决策,导致部分可观测(甚至不可观测)的马尔可夫决策过程,其中关键的GUI状态,包括对操作至关重要的信息,往往被不充分捕捉。为了系统探索这一挑战,我们引入DynamicGUIBench,一个涵盖十个应用和多样化交互场景的综合性在线GUI基准,这些场景以操作之间重要的界面变化为特征。此外,我们提出了DynamicUI,一种专为动态界面设计的代理,其输入是交互过程的屏幕录制视频,并由三个组件组成:动态感知器、细化策略和反思。具体而言,动态感知器对GUI视频的帧进行聚类,为质心生成描述,然后迭代选择最信息丰富的帧作为显著的动态上下文。考虑到所选帧与代理文本上下文之间可能存在不一致性和噪声,细化策略采用基于动作的过滤来细化思维,以缓解思维-动作不一致性和冗余。基于细化的代理轨迹,反思模块为后续动作提供有效的准确指导。在DynamicGUIBench上的实验表明,DynamicUI在动态GUI环境中显著提升了性能,同时在其他公开基准测试中保持了竞争性的性能。

英文摘要

Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments remains largely underexplored. Existing agents typically rely on a single screenshot after each action for decision-making, leading to a partially observable (or even unobservable) Markov decision process, where the key GUI state including important information for actions is often inadequately captured. To systematically explore this challenge, we introduce DynamicGUIBench, a comprehensive online GUI benchmark spanning ten applications and diverse interaction scenarios characterized by important interface changes between actions. Furthermore, we present DynamicUI, an agent designed for dynamic interfaces, which takes screen-recording videos of the interaction process as input and consists of three components: a dynamic perceiver, a refinement strategy, and a reflection. Specifically, the dynamic perceiver clusters frames of the GUI video, generates captions for the centroids, and iteratively selects the most informative frames as the salient dynamic context. Considering that there may be inconsistencies and noise between the selected frames and the textual context of the agent, the refinement strategy employs an action-conditioned filtering to refine thoughts to mitigate thought-action inconsistency and redundancy. Based on the refined agent trajectories, the reflection module provides effective and accurate guidance for further actions. Experiments on DynamicGUIBench demonstrate that DynamicUI significantly improves the performance in dynamic GUI environments, while maintaining competitive performance on other public benchmarks.

2604.25150 2026-05-11 cs.LG cs.AI

The Role of Symmetry in Optimizing Overparameterized Networks

对参数过多网络优化中对称性的作用

Kusha Sareen, Mohammad Pedramfar, Sékou-Oumar Kaba, Mehran Shakerinava, Siamak Ravanbakhsh

发表机构 * McGill University(麦吉尔大学) Mila - Quebec Artificial Intelligence Institute(魁北克人工智能研究所)

AI总结 研究通过分析神经网络中的权重空间对称性,揭示参数过多如何通过两种方式促进优化:对角预条件和增加全局极小值的概率质量。

详情
AI中文摘要

参数过多是深度学习成功的关键,但其如何促进优化的机制尚不明确。我们分析了神经网络中的权重空间对称性,并证明参数过多引入了额外的对称性,以两种不同的方式促进优化。首先,我们证明这些对称性在Hessian上起到对角预条件的作用,使每个功能相同解等价类中存在更好的条件极小值。其次,我们显示参数过多增加了初始化附近全局极小值的概率质量,使这些有利解更容易达到。这些结果提供了损失景观几何与简单偏置之间潜在的联系。实证上,我们观察到更宽的网络具有更低的上特征值、更小的条件数和更快的收敛速度,与我们的分析一致。我们的分析为理解参数过多和宽度增长作为损失景观几何变换提供了一个统一的框架。

英文摘要

Overparameterization is central to the success of deep learning, yet the mechanisms by which it improves optimization remain incompletely understood. We analyze weight-space symmetries in neural networks and show that overparameterization introduces additional symmetries that benefit optimization in two distinct ways. First, we prove that these symmetries act as a form of diagonal preconditioning on the Hessian, enabling the existence of better-conditioned minima within each equivalence class of functionally identical solutions. Second, we show that overparameterization increases the probability mass of global minima near typical initializations, making these favourable solutions more reachable. These results offer a potential link between loss landscape geometry and simplicity bias. Empirically, we observe wider networks have lower top eigenvalues, smaller condition numbers and faster convergence, matching our analysis. Our analysis provides a unified framework for understanding overparameterization and width growth as a geometric transformation of the loss landscape.

2604.24661 2026-05-11 cs.RO

Agent-Centric Observation Adaptation for Robust Visual Control under Dynamic Perturbations

以代理为中心的观测适应用于动态扰动下的鲁棒视觉控制

Zhengru Fang, Yu Guo, Fei Liu, Yuang Zhang, Yihang Tao, Senkang Hu, Wenbo Ding, Yuguang Fang

发表机构 * City University of Hong Kong(香港城市大学) Tsinghua University(清华大学)

AI总结 本文提出ACO-MoE方法,通过结合路由的修复专家和前景掩码分支,解决动态扰动下的视觉控制问题,提升下游任务性能。

Comments Source code is available at https://github.com/fangzr/aco-moe-code

详情
AI中文摘要

现实中的视觉系统面临时间变化的扰动,包括天气、传感器噪声、压缩伪影和背景干扰。现有图像修复方法通常针对固定损坏类型设计,并优化像素级保真度,留下两个问题:修复在非平稳损坏切换下的表现如何,以及像素级保真度是否保留下游模型所需的信息。为此,我们引入了视觉退化控制套件(VDCS),在渲染场景中注入马尔可夫切换物理退化。我们进一步识别了基于重建的表示的基本失败模式:忠实重建受损观测迫使潜在状态编码特定于损坏的干扰信息,从而污染下游模型。从信息瓶颈的角度来看,将表示锚定在干净的前景上消除了这种污染。受此分析启发,我们提出了一种冻结、即插即用的观测适配器,即混合专家(ACO-MoE),结合路由的修复专家银行和前景掩码分支。ACO-MoE完全在合成渲染数据上预训练,使用自动生成功能退化对和模拟推导的前景掩码,无需手动注释。在推理时,它仅需损坏的RGB作为输入,无需损坏标签、干净参考帧或前景掩码。在VDCS、DMC-GB和RoboSuite上,ACO-MoE在模型无关和模型基于的回溯中一致提高下游控制性能,恢复95.3%的干净输入性能,在挑战性的马尔可夫切换退化下。它还泛化到未见的视觉退化,这些退化在适配器预训练中被排除。

英文摘要

Real-world visual systems face time-varying perturbations, including weather, sensor noise, compression artifacts, and background distractions. Existing image restoration methods are typically designed for fixed corruption types and optimized for pixel-level fidelity, leaving open two questions: how restoration behaves under non-stationary corruption switching, and whether pixel-level fidelity preserves the task-relevant information needed by downstream models. To study this setting, we introduce the Visual Degraded Control Suite (VDCS), a benchmark that injects Markov-switching physical degradations into rendered scenes. We further identify a fundamental failure mode of reconstruction-based representations: faithfully reconstructing corrupted observations forces the latent state to encode corruption-specific nuisance information, thereby contaminating downstream models. From an information-bottleneck perspective, anchoring the representation to the clean foreground eliminates this contamination. Motivated by this analysis, we propose \emph{Agent-Centric Observations with Mixture-of-Experts} (ACO-MoE), a frozen, plug-and-play observation adapter that combines a routed bank of restoration experts with a foreground-mask branch. ACO-MoE is pretrained entirely offline on synthetic rendered data with automatically generated degradation pairs and simulation-derived foreground masks, requiring no manual annotation. At inference time, it takes only corrupted RGB as input without corruption labels, clean reference frames, or foreground masks. Across VDCS, DMC-GB, and RoboSuite, ACO-MoE consistently improves downstream control with both model-free and model-based backbones, recovering 95.3\% of clean-input performance under challenging Markov-switching corruptions. It also generalizes zero-shot to unseen visual perturbations excluded from adapter pretraining.

2604.24372 2026-05-11 cs.CL cs.AI cs.NE

SeaEvo: Advancing Algorithm Discovery with Strategy Space Evolution

SeaEvo:通过策略空间进化推进算法发现

Sichun Luo, Yi Huang, Haochen Luo, Fengyuan Liu, Guanzhi Deng, Lei Li, Qinghua Yao, Zefa Hu, Junlan Feng, Qi Liu

发表机构 * The University of Hong Kong(香港大学) JIUTIAN Research, China Mobile(钧天研究院,中国移动) City University of Hong Kong(城市大学)

AI总结 SeaEvo通过策略空间进化层,将语言级策略推理转化为群体级进化状态,提升LLM引导的程序搜索效果,实现算法发现、系统优化和智能体设计任务的性能提升。

详情
AI中文摘要

大型语言模型(LLM)引导的进化搜索日益用于自动化算法发现,但大多数现有方法主要通过可执行程序和标量适应度跟踪搜索进度。即使使用自然语言推理通过启发式描述或反思,通常仍停留在短暂的突变上下文或无结构的记忆中,而不是以组织化的方式作为战略方向的持久群体状态。因此,进化搜索难以区分相同想法的不同语法实现、保存低适应度但战略上有前途的方向,或检测整个策略家族已饱和。我们引入\model,一个模块化的策略空间层,将语言级战略推理转化为LLM引导程序搜索中的第一类群体级进化状态。\model用显式的自然语言策略表示每个候选程序,按策略语义聚类存档,检索行为互补的灵感,并定期导航策略景观以避免饱和方向。不修改底层进化算法,\model在大多数情况下提升了现有进化骨架在算法发现、系统优化和智能体设计任务中的表现。在四个系统基准测试中,\model实现了20.6%的平均相对提升,其中在Prism上的最佳单次运行得分高3倍。这些结果表明,持久的策略表示为提高LLM引导进化搜索的有效性和成本效率提供了实用机制,指向了其搜索能力受益于算法策略结构化积累和重用的复合人工智能系统。

英文摘要

Large Language Model (LLM)-guided evolutionary search is increasingly used for automated algorithm discovery, yet most current methods track search progress primarily through executable programs and scalar fitness. Even when natural-language reasoning is used through heuristic descriptions or reflection, it typically remains transient mutation context or unstructured memory, rather than organized as persistent population-level state over strategic directions. As a result, evolutionary search can struggle to distinguish syntactically different implementations of the same idea, preserve lower-fitness but strategically promising directions, or detect when an entire family of strategies has saturated. We introduce \model, a modular strategy-space layer that turns language-level strategic reasoning into first-class population-level evolutionary state in LLM-driven program search. \model represents each candidate program with an explicit natural-language strategy, clusters the archive by strategy semantics, retrieves behaviorally complementary inspirations, and periodically navigates the strategy landscape to avoid saturated directions. Without modifying the underlying evolutionary algorithms, \model improves existing evolutionary backbones across algorithm discovery, systems optimization, and agent-scaffold design tasks in most settings. Across four systems benchmarks, \model achieves a 20.6% average relative improvement, with the best single run on Prism scoring 3$\times$ higher. These results suggest that persistent strategy representations provide a practical mechanism for improving the effectiveness and cost-efficiency of LLM-guided evolutionary search, pointing toward compound AI systems whose search capabilities benefit from the structured accumulation and reuse of algorithmic strategies.

2604.24136 2026-05-11 cs.CV eess.IV

Bridging Restoration and Generation Manifolds in One-Step Diffusion for Real-World Super-Resolution

连接恢复与生成流形的单步扩散在现实世界超分辨率中的应用

Shyang-En Weng, Yi-Cheng Liao, Yu-Syuan Xu, Wei-Chen Chiu, Ching-Chun Huang

发表机构 * National Yang Ming Chiao Tung University, Hsinchu, Taiwan MediaTek Inc., Taiwan

AI总结 本文提出IDaS-SR框架,通过Manifold Inversion Noise Estimator和CHARIOT机制,解决单步扩散中恢复与生成流形的匹配问题,提升现实世界超分辨率的性能。

详情
AI中文摘要

预训练扩散模型已革新了现实世界图像超分辨率(Real-ISR),但因迭代采样导致计算瓶颈。最近的单步蒸馏加速了推理,但面临感知-失真权衡的严峻挑战,由于固有的时间步初始化、分布轨迹不匹配和脆弱的随机调制。为此,我们提出了Adaptive Inversion和Degradation-aware Sampling for Real-ISR(IDaS-SR),一个连接确定性恢复和随机生成流形的单步框架。其核心是Manifold Inversion Noise Estimator(MINE),通过预测严重性感知的时间步和倒置噪声,精确地将低质量潜在映射到扩散轨迹上。此外,为缓解脆弱的随机调制,我们提出了CHARIOT,一种连续生成引导机制。通过重新调度轨迹和插值噪声,它使显式导航感知-失真边界而不影响结构先验。大量实验表明,IDaS-SR优于现有最先进方法,能够无缝从严格的结构恢复器过渡到复杂的纹理 hallucinator,在单步推理中实现。

英文摘要

Pretrained diffusion models have revolutionized real-world image super-resolution (Real-ISR) but suffer from computational bottlenecks due to iterative sampling. Recent single-step distillation accelerates inference but faces a stark perception-distortion trade-off due to rigid timestep initialization, distributional trajectory mismatches, and fragile stochastic modulation. To address this, we present Adaptive Inversion and Degradation-aware Sampling for Real-ISR (IDaS-SR), a one-step framework bridging the deterministic restoration and stochastic generation manifolds. At its core, the Manifold Inversion Noise Estimator (MINE) resolves these initialization and trajectory mismatches by predicting a severity-aware timestep and inversion noise, precisely anchoring low-quality latents onto the diffusion trajectory. Furthermore, to mitigate fragile stochastic modulation, we propose CHARIOT, a continuous generative steering mechanism. By rescheduling trajectories and interpolating noise, it enables explicit navigation of the perception-distortion boundary without compromising structural priors. Extensive experiments demonstrate that IDaS-SR outperforms state-of-the-art methods, seamlessly transitioning from a rigorous structural restorer to a sophisticated texture hallucinator in a single inference step.

2604.23947 2026-05-11 cs.AI

GamED.AI: A Hierarchical Multi-Agent Framework for Automated Educational Game Generation

GamED.AI:一种用于自动教育游戏生成的分层多智能体框架

Shiven Agarwal, Yash Shah, Ashish Raj Shekhar, Priyanuj Bordoloi, Vivek Gupta

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 本文提出GamEDAI框架,通过分层多智能体方法将教师提供的问题转化为可玩的教育游戏,验证通过形式化机制合同。系统在200个问题上实现90%验证通过率和98.3%的模式合规性。

详情
AI中文摘要

我们介绍了GamEDAI,一种分层多智能体框架,该框架将教师提供的问题转换为完全可玩的、具有教育依据的教育游戏,并通过形式化机制合同进行验证。该框架基于基于阶段的LangGraph子图、确定性的质量门和结构化的Pydantic模式,支持两种模板家族,涵盖15种交互机制,包括空间推理、过程执行和更高阶的布卢姆分类法目标。在200个问题上评估,系统实现了90%的验证通过率、98.3%的模式合规性以及73%的令牌减少(从约73,500个令牌/游戏到约19,900个令牌/游戏),每游戏成本为0.46美元。在该模型配置下,这些结果表明,阶段限定的架构结构比仅靠提示策略更强烈地相关。我们的演示使与会者能够在60秒内从自然语言生成布卢姆对齐的游戏,检查每个流水线阶段的质量门输出,并浏览包含所有15种机制类型的50个游戏的精选图书馆。

英文摘要

We introduce GamEDAI, a hierarchical multi-agent framework that transforms instructor-provided questions into fully playable, pedagogically grounded educational games validated through formal mechanic contracts. Built on phase-based LangGraph sub-graphs, deterministic Quality Gates, and structured Pydantic schemas, GamEDAI supports two template families encompassing 15 interaction mechanics across spatial reasoning, procedural execution, and higher-order Bloom's Taxonomy objectives. Evaluated on 200 questions spanning five subject domains, the system achieves a 90% validation pass rate, 98.3% schema compliance, and 73% token reduction over ReAct agents (${\sim}$73,500 $\rightarrow$ ${\sim}$19,900 tokens/game) at $0.46 per game. Within this model configuration, these results suggest that phase-bounded architectural structure correlates more strongly with alignment quality than prompting strategy alone. Our demonstration lets attendees generate Bloom's-aligned games from natural language in under 60 seconds, inspect Quality Gate outputs at each pipeline phase, and browse a curated library of 50 games spanning all 15 mechanic types.

2604.23478 2026-05-11 cs.CL

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems

JudgeSense: 一个用于评估LLM作为裁判系统提示敏感性的基准

Rohith Reddy Bellibatlu, Edward Raff, Wenbin Zhang

发表机构 * Florida International University(佛罗里达国际大学) CrowdStrike University of Maryland Baltimore County(马里兰大学巴尔的摩县分校)

AI总结 本文提出JudgeSense基准,用于评估LLM作为裁判系统在提示重述下的稳定性,分析不同任务和架构的决策稳定性,并发现一致性不依赖模型规模。

Comments 20 pages, 2 figures, 1 table. Code: https://github.com/rohithreddybc/judgeSense. Dataset (JudgeSense Benchmark): https://huggingface.co/datasets/Rohithreddybc/judgesense-benchmark

详情
AI中文摘要

大型语言模型被广泛用作自动化评估裁判,但其在语义等价提示重述下的稳定性仍缺乏系统研究。本文通过多个评估任务和裁判架构进行系统实证研究,发布JudgeSense基准,包含人工验证的提示-重述对,涵盖事实性、连贯性、相关性和偏好,基于现有NLP基准并附有详细决策日志。该基准可测量等价提示下的裁判稳定性,帮助研究者评估稳定性是否与模型规模或指令微调相关,并识别最敏感于提示语言的任务。评估发现,连贯性是区分裁判行为的主要任务,事实性判断在标准条件下表现出高稳定性。成对评估任务始终表现出位置偏置。关键发现是模型规模不是一致性的可靠代理;有趣的是,分析中发现最大和最新模型并不最一致。

英文摘要

Large language models are widely adopted as automated evaluation judges, yet the stability of their verdicts under semantically equivalent prompt rephrasings remains largely unexamined. We conduct a systematic empirical study of prompt-induced decision instability across multiple evaluation tasks and judge architectures. To facilitate this analysis, we release JudgeSense, a benchmark comprising hand-validated prompt-paraphrase pairs spanning factuality, coherence, relevance, and preference, drawn from established NLP benchmarks and accompanied by comprehensive decision logs. The benchmark enables the measurement of judge stability across equivalent prompts, allowing researchers to assess whether stability correlates with model scale or instruction-tuning, and to identify which tasks are most sensitive to prompt wording. Our evaluation reveals that coherence remains the primary task for distinguishing judge behavior, while factuality judgments demonstrate high stability under standard conditions. Pairwise evaluation tasks consistently exhibit position bias. Crucially, we find that model scale is not a reliable proxy for consistency; notably, as an interesting result in our analysis, the largest and newest models are not the most consistent.

2604.21657 2026-05-11 cs.LG

Transferable SCF-Acceleration through Solver-Aligned Initialization Learning

通过求解器对齐的初始化学习实现可迁移的SCF加速

Eike S. Eberhard, Viktor Kotsev, Timm Güthle, Stephan Günnemann

发表机构 * Technical University of Munich(慕尼黑技术大学) Munich Data Science Institute(慕尼黑数据科学研究所) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 本文提出SAIL方法,通过端到端区分SCF求解器,解决初始化预测中的监督问题,提升大分子SCF计算效率,实现对大型药物分子的加速。

详情
AI中文摘要

Kohn-Sham密度泛函理论(KS-DFT)计算的成本随求解器迭代次数增加,这取决于初始猜测的质量。利用分子几何预测初始猜测的机器学习方法可降低此成本,但矩阵预测模型在扩展到更大分子时失效,反而减缓收敛速度[Liu et al., 2025]。我们证明此失败是监督问题而非扩展问题:训练于基态目标的模型在分布外表现良好,但产生的初始猜测却减慢收敛。通过求解器对齐的初始化学习(SAIL)解决了哈密顿量和密度矩阵模型的问题,通过端到端区分自洽场(SCF)求解器。我们引入有效的相对迭代次数(ERIC),一种修正常用RIC的指标,以考虑隐藏的Fock构建开销。在QM40数据集上,包含比训练分布大4倍的分子,SAIL在PBE、SCAN和B3LYP中分别减少ERIC 37%、33%和28%,在B3LYP上超过现有最佳表现两倍以上。在QMugs分子上,分子大小是训练集的10倍,SAIL在混合理论水平实现1.35倍的墙时间加速,将ML SCF加速扩展到大型药物分子。

英文摘要

The cost of Kohn-Sham density functional theory (KS-DFT) calculations scales with the number of solver iterations, which depends on the quality of the initial guess. Machine learning methods that predict initial guesses from molecular geometry can reduce this cost, but matrix-prediction models fail when extrapolating to larger molecules, degrading rather than accelerating convergence [Liu et al., 2025]. We show that this failure is a supervision problem, not an extrapolation problem: models trained on ground-state targets fit those targets well out of distribution, yet produce initial guesses that slow convergence. Solver-Aligned Initialization Learning (SAIL) resolves this for both Hamiltonian and density matrix models by differentiating through the self-consistent field (SCF) solver end-to-end. We introduce the Effective Relative Iteration Count (ERIC), a correction to the commonly used RIC that accounts for hidden Fock-build overhead. On QM40, which contains molecules up to 4$\times$ larger than the training distribution, SAIL reduces ERIC by 37\% (PBE), 33\% (SCAN), and 28\% (B3LYP), more than doubling the previous state-of-the-art reduction on B3LYP. On QMugs molecules 10$\times$ larger than the training set, SAIL delivers a 1.35$\times$ wall-time speedup at the hybrid level of theory, extending ML SCF acceleration to large drug-like molecules.

2604.18905 2026-05-11 cs.RO

Task-Adaptive Admittance Control for Human-Quadrotor Cooperative Load Transportation with Dynamic Cable-Length Regulation

任务自适应阻抗控制用于人机四旋翼协作负载运输与动态缆长调节

Shuai Li, Ton T. H. Duong, Damiano Zanotto

发表机构 * Dept. of Mechanical Engineering, Stevens Institute of Technology(机械工程系,史蒂文斯理工学院)

AI总结 本文提出一种任务自适应阻抗控制器,用于安全高效的四旋翼协作负载运输,通过主动控制的绞盘实现动态缆长调节,提升系统响应性和运动平滑度。

Comments Preprint of accepted manuscript to be published in IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

人类与机器人协作在许多机器人应用中至关重要,尤其是在需要物理人机交互(pHRI)的应用中。先前的pHRI研究主要集中在机械臂上,采用阻抗或顺应控制以维持操作安全性。相反,人类-四旋翼协作负载运输(CLT)研究仍处于初级阶段。本文介绍了一种新型阻抗控制器,用于安全有效的CLT,该控制器采用配备主动控制绞盘的四旋翼。所提出的方法考虑了系统的耦合动力学,使四旋翼及其缆线能够动态适应CLT任务中的接触力,从而提高响应性。我们实验验证了控制器在整个CLT过程中的任务自适应能力,包括就地装载/卸载和负载运输任务。为此,我们比较了系统性能与传统方法,在低刚度和高刚度条件下使用可变和固定缆长。结果表明,所提出的方法在系统响应性和运动平滑度方面优于传统方法,从而提升了CLT能力。

英文摘要

The collaboration between humans and robots is critical in many robotic applications, especially in those requiring physical human-robot interaction (pHRI). Previous research in pHRI has largely focused on robotic manipulators, employing impedance or admittance control to maintain operational safety. Conversely, research in human-quadrotor cooperative load transportation (CLT) is still in its infancy. This letter introduces a novel admittance controller designed for safe and effective human-quadrotor CLT using a quadrotor equipped with an actively-controlled winch. The proposed method accounts for the system's coupled dynamics, allowing the quadrotor and its cable to dynamically adapt to contact forces during CLT tasks, thereby enhancing responsiveness. We experimentally validated the task-adaptive capability of the controller across the entire CLT process, including in-place loading/unloading and load transporting tasks. To this end, we compared the system performances against a conventional approach, using both variable and fixed cable lengths under low- and high-stiffness conditions. Results demonstrate that the proposed method outperforms the conventional approach in terms of system responsiveness and motion smoothness, leading to improved CLT capabilities.

2604.16889 2026-05-11 cs.CL

Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

剪枝、解释、评估:一种跨层译码器原生框架,用于通过特征归因实现高效的电路发现

Qinhao Chen, Linyang He, Nima Mesgarani

发表机构 * Columbia University(哥伦比亚大学)

AI总结 本文提出PIE框架,通过先剪枝后解释的方法,结合特征归因修补和协同意识重排序,提升电路发现效率和解释质量。

详情
AI中文摘要

现有的特征解释流程通常在均匀采样的单元或完整特征集上操作,导致与目标行为无关的单元成本高昂。为解决此问题,我们引入了首个CLT原生端到端剪枝框架PIE,开创了先剪枝后解释的范式。PIE连接剪枝、自动解释和解释评估,建立了一个全面的基准环境,系统地测量剪枝下的行为保真度和下游可解释性。在该框架中,我们采用强相关性基线并提出特征归因修补(FAP),一种基于修补的归因方法,通过聚合梯度加权写入贡献来评分CLT特征。此外,我们引入FAP-Synergy,一种系统性的协同意识重排序过程。我们通过KL散度行为保留和FADE类度量在IOI和Doc-String数据集上评估剪枝,并在K∈{50, 100, 200, 400, 800}的预算约束下进行严格基准测试。我们的严格基准测试揭示了不同的操作区域:虽然基础FAP和适应基线在放松预算下表现稳健,但FAP-Synergy在高约束、严格预算区域表现优异。关键的是,我们展示了实际的“有效预算”优势:在IOI任务中,对于Llama-3.2-1B和Gemma-2-2B,FAP-Synergy在K=50时的功能保真度与基线电路在K=75时相当。由于下游评估成本与每个特征成线性关系,协同有效地使流程获得25个“免费”特征,实现K=75保真度的同时减少解释成本33%。

英文摘要

Existing feature-interpretation pipelines typically operate on uniformly sampled units or exhaustive feature sets, incurring massive costs on units irrelevant to target behaviors. To address this, we introduce the first CLT-native end-to-end pruning framework, PIE, which pioneers the paradigm of pruning first and interpreting later. PIE connects Pruning, automatic Interpretation, and interpretation Evaluation, establishing a comprehensive benchmarking environment to systematically measure behavioral fidelity and downstream interpretability under pruning. Within this framework, we adapt strong relevance baselines and propose Feature Attribution Patching (FAP), a patch-grounded attribution method that scores CLT features by aggregating gradient-weighted write contributions. Furthermore, we introduce FAP-Synergy, a systematic synergy-aware reranking procedure. We evaluate pruning using KL-divergence behavior retention and assess interpretation quality with FADE-style metrics across IOI and Doc-String datasets. Across budget constraints of K in {50, 100, 200, 400, 800}, our rigorous benchmarking reveals distinct operational regimes: while base FAP and adapted baselines perform robustly at relaxed budgets, FAP-Synergy excels in highly constrained, strict-budget regimes. Crucially, we demonstrate a practical "Effective Budget" advantage: on the IOI task for both Llama-3.2-1B and Gemma-2-2B, FAP-Synergy at K=50 functionally matches the behavioral fidelity of baseline circuits at K=75. Because downstream evaluation costs scale linearly per feature, Synergy effectively grants the pipeline 25 "free" features, achieving K=75 fidelity while reducing interpretation costs by 33%.

2604.16579 2026-05-11 cs.LG cs.AI

EviDep: Trustworthy Multimodal Depression Estimation via Disentangled Evidential Learning

EviDep:通过解耦证据学习实现可信的多模态抑郁估计

Fangyuan Liu, Sirui Zhao, Zeyu Zhang, Jinyang Huang, Feng-Qi Cui, Bin Luo, Meng Li, Tong Xu, Enhong Chen

发表机构 * School of Computer Science and Technology, University of Science and Technology of China(中国科学技术大学计算机科学与技术学院) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院) Department of Psychiatry, The First Affiliated Hospital of University of Science and Technology of China, Division of Life Sciences and Medicine, University of Science and Technology of China(中国科学技术大学附属第一医院精神科,生命科学与医学学院)

AI总结 本文提出EviDep框架,通过解耦证据学习量化抑郁严重程度及不确定性,提升多模态抑郁估计的可信度和不确定性校准能力。

详情
AI中文摘要

在无约束环境中自动进行多模态抑郁估计面临自然噪声和复杂行为变异的挑战。现有确定性方法无法量化预测不确定性,导致决策风险。本文提出EviDep框架,通过正常-逆伽玛分布联合量化抑郁严重程度及aleatoric和epistemic不确定性。引入两个机制:频率感知特征提取模块利用小波基混合专家动态解耦稳定宏观情绪基线与瞬时微观行为爆发,过滤无关任务伪影;解耦证据学习策略在净化表示中强制显式去相关,通过分离跨模态共识与模态特异性行为细节,防止信息重复计数。在AVEC 2013、AVEC 2014、DAIC-WOZ和E-DAIC数据集上的实验表明,EviDep在预测准确性和不确定性校准方面达到最先进的水平,提供一种可信的风险意识决策支持工具。

英文摘要

Automated multimodal depression estimation in unconstrained environments is inherently challenged by naturalistic noise and complex behavioral variability. Prevailing deterministic methods, however, produce uncalibrated point estimates without quantifying predictive uncertainty, exposing decision-making to the risk of overconfident, untrustworthy estimates. To establish a reliable and trustworthy estimation paradigm, we propose EviDep, an evidential learning framework that jointly quantifies depression severity alongside aleatoric and epistemic uncertainties via a Normal-Inverse-Gamma distribution. To ensure the integrity of the extracted behavioral evidence and prevent artificial confidence inflation during multimodal fusion, EviDep introduces two tailored mechanisms. First, addressing the temporal-frequency heterogeneity of behavioral cues, a Frequency-aware Feature Extraction module leverages a wavelet-based Mixture-of-Experts to dynamically decouple stable macro-level affective baselines from transient micro-level behavioral bursts, effectively filtering out task-irrelevant artifacts. Second, a Disentangled Evidential Learning strategy enforces explicit decorrelation of features in these purified representations. By separating the cross-modal shared consensus from modality-specific behavioral nuances before Bayesian fusion, this rigorous disentanglement strictly prevents the model from double-counting overlapping information. Extensive experiments on the AVEC 2013, AVEC 2014, DAIC-WOZ, and E-DAIC datasets confirm that EviDep achieves state-of-the-art predictive accuracy and superior uncertainty calibration, thereby delivering a trustworthy, risk-aware decision-support tool for depression estimation.

2604.15694 2026-05-11 cs.LG math.PR

Neural Continuous-Time Markov Chain: Discrete Diffusion via Decoupled Jump Timing and Direction

神经连续时间马尔可夫链:通过解耦的跳跃时间和方向实现离散扩散

Jingyuan Li, Xiaoyi Jiang, Fukang Wen, Wei Liu, Renqian Luo, Yi Zhu, Zuoqiang Shi, Pipi Hu

发表机构 * Tsinghua University(清华大学) Beijing Institute of Mathematical Sciences and Applications(北京数学科学研究院) Wuhan University(武汉大学) MathonAI

AI总结 本文提出神经CTMC模型,通过解耦跳跃时间和方向参数化,提升离散扩散生成性能,实验表明其在TinyStories和OpenWebText上优于现有方法。

详情
AI中文摘要

基于连续时间马尔可夫链(CTMC)的离散扩散模型在语言和离散数据生成中表现出色,但现有方法通常通过代理如Concrete Scores(SEDD)或Clean-data预测(MDLM, GIDD)单一起参数化反向速率矩阵,而非对齐CTMC分解为跳跃时间和方向。我们提出Neural CTMC,通过两个专用网络头分别参数化反向过程的退出率(何时跳跃)和跳跃分布(何处跳跃),利用CTMC的动力学的泊松结构。我们展示证据下界(ELBO)减少为真实与学习反向过程之间的路径空间KL散度,分解为时间的泊松KL和方向的分类KL,并具有可计算、梯度等价且一致的损失。实验表明,通过Gemma2-9B评分,纯均匀Neural CTMC在TinyStories上达到16.36生成困惑度(vs. GIDD 37.60和MDLM 42.66)。在OpenWebText上,它在相同训练token预算下,经过16-128次采样步骤时,取得最佳困惑度(例如,在128步时:Neural CTMC 183.6 vs. MDLM 210.5和GIDD 249.8)。为促进可重复性,我们发布了预训练权重至https://huggingface.co/Jiangxy1117/Neural-CTMC。

英文摘要

Discrete diffusion models based on continuous-time Markov chains (CTMCs) have shown strong performance on language and discrete data generation, yet existing approaches typically parameterize the reverse rate matrix monolithically -- through proxies such as concrete scores (SEDD) or clean-data predictions (MDLM, GIDD) -- rather than aligning the parameterization with the intrinsic CTMC decomposition into jump timing and jump direction. We propose \textbf{Neural CTMC}, which exploits the underlying Poisson structure of CTMC dynamics by separately parameterizing the reverse process through an \emph{exit rate} (when to jump) and a \emph{jump distribution} (where to jump) via two dedicated network heads. We show that the evidence lower bound (ELBO) reduces to a path-space KL divergence between the true and learned reverse processes that factorizes into a Poisson KL for timing and a categorical KL for direction, and admits a tractable, gradient-equivalent and consistent loss. Experimentally, scored by Gemma2-9B, our pure-uniform Neural CTMC achieves $16.36$ generative perplexity on TinyStories (vs.\ GIDD $37.60$ and MDLM $42.66$). On OpenWebText, it attains the best perplexity at the same training-token budget across 16--128 sampling steps among the methods we compare (e.g., at 128 steps: Neural CTMC $183.6$ vs.\ MDLM $210.5$ and GIDD $249.8$). To facilitate reproducibility, we release our pretrained weights at https://huggingface.co/Jiangxy1117/Neural-CTMC.

2604.14786 2026-05-11 cs.AI

CogEvolution: A Human-like Generative Educational Agent to Simulate Student's Cognitive Evolution

CogEvolution: 一种模拟学生认知演化的类人生成教育代理

Wei Zhang, Yihang Cheng, Zhirong Ye, Kezhen Huang

发表机构 * Faculty of Artificial Intelligence in Education, Central China Normal University, Wuhan, Hubei, China(教育人工智能学院,中国中央师范大学,武汉,湖北,中国) National Engineering Research Center for Educational Big Data, Central China Normal University, Wuhan, Hubei, China(教育大数据国家工程研究中心,中国中央师范大学,武汉,湖北,中国) Central China Normal University Wollongong Joint Institute, Central China Normal University, Wuhan, Hubei, China(中国中央师范大学沃林顿联合学院,中国中央师范大学,武汉,湖北,中国)

AI总结 本文提出CogEvolution,通过构建认知深度感知器、基于IRT的记忆检索方法和基于进化算法的动态认知更新机制,模拟学生认知演化过程,提升教育代理的可解释性。

Comments none

详情
AI中文摘要

生成代理因其对人类行为的精确建模和模拟能力,已成为人工智能教育(AIEd)领域研究复杂学习者认知过程的关键工具。然而,现有教育代理主要依赖静态人设模拟学生学习行为,忽视了深度认知能力在学习结果中的决定性作用,且难以刻画知识内化、迁移和认知状态转换的动态流动性。为克服这一瓶颈,本文提出一种能模拟学生认知演化的类人教育代理:CogEvolution。首先,基于认知心理学的交互、建构、主动、被动(ICAP)分类法构建认知深度感知器,实现学习者认知参与的精确量化。随后,提出基于项目反应理论(IRT)的记忆检索方法,模拟新旧知识的连接与同化。最后,设计基于进化算法的动态认知更新机制,模拟学生学习行为与认知演化过程的实时整合。全面评估表明,CogEvolution不仅在行为忠实度和学习曲线拟合上显著优于基线模型,还独特地再现了符合教育心理学预期的合理且稳健的认知演化路径,为构建高可解释性教育代理提供了新范式。

英文摘要

Generative Agents, owing to their precise modeling and simulation capabilities of human behavior, have become a pivotal tool in the field of Artificial Intelligence in Education (AIEd) for uncovering complex cognitive processes of learners. However, existing educational agents predominantly rely on static personas to simulate student learning behaviors, neglecting the decisive role of deep cognitive capabilities in learning outcomes during practice interactions. Furthermore, they struggle to characterize the dynamic fluidity of knowledge internalization, transfer, and cognitive state transitions. To overcome this bottleneck, this paper proposes a human-like educational agent capable of simulating student cognitive evolution: CogEvolution. Specifically, we first construct a cognitive depth perceptron based on the Interactive, Constructive, Active, Passive (ICAP) taxonomy from cognitive psychology, achieving precise quantification of learner cognitive engagement. Subsequently, we propose a memory retrieval method based on Item Response Theory (IRT) to simulate the connection and assimilation of new and prior knowledge. Finally, we design a dynamic cognitive update mechanism based on evolutionary algorithms to simulate the real-time integration of student learning behaviors and cognitive evolution processes. Comprehensive evaluations demonstrate that CogEvolution not only significantly outperforms baseline models in behavioral fidelity and learning curve fitting but also uniquely reproduces plausible and robust cognitive evolutionary paths consistent with educational psychology expectations, providing a novel paradigm for constructing highly interpretable educational agents.

2604.13010 2026-05-11 cs.LG cs.AI

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Lightning OPD: 为大推理模型高效实现离线在线蒸馏

Yecheng Wu, Song Han, Hai Cai

发表机构 * NVIDIA

AI总结 本文提出Lightning OPD,通过强制教师一致性消除对实时教师服务器的需求,实现高效离线在线蒸馏,实验表明其在数学推理和代码生成任务中性能与传统OPD相当,训练效率提升4倍。

详情
AI中文摘要

在线蒸馏(OPD)是大语言模型的一种有效后训练范式,但需要整个训练过程中持续运行教师服务器,导致显著的基础设施开销。我们探讨是否可以通过预计算教师日志概率来实现离线OPD,并在SFT滚动中重用这些概率。我们发现,简单地这样做无法可靠地匹配标准OPD,根源在于一个之前被忽视的条件,即教师一致性,要求监督微调和OPD使用相同的教师。违反这一条件会引入梯度偏置,降低离线和在线OPD的性能。基于这一见解,我们提出了Lightning OPD,一种离线在线蒸馏框架,强制教师一致性并完全消除对实时教师服务器的需求。我们证明,在教师一致性下,Lightning OPD与标准OPD共享相同的最优解,具有有界梯度差异和隐含的正则化效果,有助于防止策略漂移。在数学推理和代码生成实验中,Lightning OPD在性能上与标准OPD相当,同时训练效率提高了4倍。从SFT初始化的Qwen3-8B-Base模型开始,Lightning OPD在30个GPU小时内达到AIME 2024的69.9%。Lightning OPD进一步扩展到MoE架构,训练Qwen3-30B-A3B在单个8xH100节点上达到AIME 2024的71.0%,显著降低了学术界在LLM后训练研究中的门槛。我们的代码在https://github.com/jet-ai-projects/Lightning-OPD上发布。

英文摘要

On-policy distillation (OPD) is an effective post-training paradigm for large language models but requires a live teacher server throughout training, resulting in substantial infrastructure overhead. We investigate whether OPD can be performed offline by precomputing teacher log-probabilities once over SFT rollouts and reusing them during training. We find that naively doing so fails to reliably match standard OPD, and trace the root cause to a previously overlooked condition we term teacher consistency, requiring that the same teacher be used for both supervised fine-tuning and OPD. Violating this condition introduces a gradient bias that degrades performance for both offline and online OPD. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency and eliminates the need for a live teacher server entirely. We prove that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Experiments on math reasoning and code generation show that Lightning OPD achieves comparable performance to standard OPD while delivering 4.0x higher training efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours. Lightning OPD further scales to MoE architectures, training Qwen3-30B-A3B to 71.0% on AIME 2024 on a single 8xH100 node, substantially lowering the barrier for academic research on LLM post-training. Our code is released at https://github.com/jet-ai-projects/Lightning-OPD.

2604.11995 2026-05-11 cs.LG

Loss-Driven Bayesian Active Learning

基于损失的贝叶斯主动学习

Zhuoyue Huang, Freddie Bickford Smith, Tom Rainforth

发表机构 * University of Oxford(牛津大学)

AI总结 本文提出一种基于损失的贝叶斯主动学习方法,通过将损失函数转化为优化目标,提升下游任务的预测性能。实验表明该方法在回归和分类任务中有效降低了测试损失。

详情
AI中文摘要

主动学习的核心目标是获取最大化下游预测性能的数据,但现有方法在定制化数据获取方面灵活性有限。本文提出一种严谨的基于损失的贝叶斯主动学习方法,允许直接针对给定决策问题的损失进行数据采集。特别地,我们展示了如何通过任何损失函数推导出独特的最优数据采集目标。关键在于,任何以加权Bregman散度形式的损失允许解析计算其对应目标的核心组成部分,使该方法在实践中可行。在回归和分类实验中,使用不同损失函数时,我们的方法相对于现有技术减少了测试损失。

英文摘要

The central goal of active learning is to gather data that maximises downstream predictive performance, but popular approaches have limited flexibility in customising this data acquisition to different downstream problems and losses. We propose a rigorous loss-driven approach to Bayesian active learning that allows data acquisition to directly target the loss associated with a given decision problem. In particular, we show how any loss can be used to derive a unique objective for optimal data acquisition. Critically, we then show that any loss taking the form of a weighted Bregman divergence permits analytic computation of a central component of its corresponding objective, making the approach applicable in practice. In regression and classification experiments with a range of different losses, we find our approach reduces test losses relative to existing techniques.

2604.11962 2026-05-11 cs.LG

The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

线性质心假说:由局部专家学习的特征作为方向

Thomas Walker, Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk

发表机构 * Rice University(里士大学) Google Research(谷歌研究) Brown University(布朗大学)

AI总结 本文提出线性质心假说,将深度网络的质心空间中的线性方向作为特征,通过替换中间激活为质心,提升可解释性工具的效果,并统一了字典、探针、电路和显著性图等方法。

Comments 23 pages, 17 figures

详情
AI中文摘要

线性表示假说(LRH)将训练好的深度网络(DN)的特征视为激活空间中的线性方向,即中间层的输出空间。本文引入线性质心假说(LCH),将特征识别为DN的质心空间中的线性方向——其中任何向量表示一个局部仿射专家的质心或总结,该专家精确或近似地描述了DN的输入-输出映射。我们证明,用质心替代中间激活可以作为标准可解释性工具的替代方案。实验证明,这种改变在DINO ViTs上产生更稀疏、更下游有用的特征字典,在受控任务中抑制了伪方向,恢复了GPT2-Large中的可解释电路,并产生了忠实的基于梯度的显著性图。LCH将字典、探针、电路和显著性图统一为一个基于网络输入-输出映射的几何对象,使可解释性从事后解释变为构造性的机制。代码见https://github.com/ThomasWalker1/LinearCentroidsHypothesis。

英文摘要

The Linear Representation Hypothesis (LRH) identifies features of a trained deep network (DN) as linear directions in the activation spaces, i.e., output spaces of intermediate layers. This characterization decouples the input-output maps learned by a DN from the organization of feature directions in its activation spaces. We introduce the Linear Centroids Hypothesis (LCH), which instead identifies features with linear directions among a DN's centroid spaces -- where any vector denotes a centroid or summary of a local affine expert characterizing the learned input-output maps of the DN exactly (e.g., for piecewise-affine DNs) or approximately (e.g., for smooth DNs like transformers). We show that replacing intermediate activations with centroids yields a functional drop-in alternative for standard interpretability tools. Empirically, this change yields sparser, more downstream-useful feature dictionaries on DINO ViTs, suppresses spurious directions on a controlled task, recovers interpretable circuits in GPT2-Large, and produces faithful gradient-based saliency maps. LCH unifies dictionaries, probing, circuits, and saliency maps into a single geometric object grounded in the network's input-output map -- making interpretability mechanistic by construction rather than post hoc. Code to study the LCH https://github.com/ThomasWalker1/LinearCentroidsHypothesis .