arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪
2605.14269 2026-05-15 cs.CV cs.AI

PhyMotion: Structured 3D Motion Reward for Physics-Grounded Human Video Generation

Yidong Huang, Zun Wang, Han Lin, Dong-Ki Kim, Shayegan Omidshafiei, Jaehong Yoon, Jaemin Cho, Yue Zhang, Mohit Bansal

发表机构 * UNC Chapel Hill(UNC夏洛特希尔大学) FieldAI NTU Singapore(新加坡国立大学) AI2 Johns Hopkins University(约翰霍普金斯大学)

AI总结 生成真实的人类运动是视频生成中的核心挑战之一。为了解决现有奖励信号无法准确评估运动真实性的难题,本文提出PhyMotion,一种基于物理模拟的结构化运动奖励机制,通过评估运动的运动学合理性、接触与平衡一致性以及动力学可行性等多个维度,实现对生成视频中人体运动质量的精细评价。实验表明,PhyMotion相比现有方法能更准确地反映人类判断,并在基于强化学习的后训练中显著提升了运动真实性和生成质量。

Comments First two authors contributed equally, website: https://phy-motion.github.io/

详情
英文摘要

Generating realistic human motion is a central yet unsolved challenge in video generation. While reinforcement learning (RL)-based post-training has driven recent gains in general video quality, extending it to human motion remains bottlenecked by a reward signal that cannot reliably score motion realism. Existing video rewards primarily rely on 2D perceptual signals, without explicitly modeling the 3D body state, contact, and dynamics underlying articulated human motion, and often assign high scores to videos with floating bodies or physically implausible movements. To address this, we propose PhyMotion, a structured, fine-grained motion reward that grounds recovered 3D human trajectories in a physics simulator and evaluates motion quality along multiple dimensions of physical feasibility. Concretely, we recover SMPL body meshes from generated videos, retarget them onto a humanoid in the MuJoCo physics simulator, and evaluate the resulting motion along three axes: kinematic plausibility, contact and balance consistency, and dynamic feasibility. Each component provides a continuous and interpretable signal tied to a specific aspect of motion quality, allowing the reward to capture which aspects of motion are physically correct or violated. Experiments show that PhyMotion achieves stronger correlation with human judgments than existing reward formulations. These gains carry over to RL-based post-training, where optimizing PhyMotion leads to larger and more consistent improvements than optimizing existing rewards, improving motion realism across both autoregressive and bidirectional video generators under both automatic metrics and blind human evaluation (+68 Elo gain). Ablations show that the three axes provide complementary supervision signals, while the reward preserves overall video generation quality with only modest training overhead.

2605.14267 2026-05-15 cs.CV cs.AI

Image Restoration via Diffusion Models with Dynamic Resolution

Yang Zheng, Wen Li, Zhaoqiang Liu

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 该研究针对扩散模型在图像修复任务中计算开销大的问题,提出了一种基于动态分辨率扩散模型的图像修复方法。通过将数据投影到低维子空间,有效降低了计算负担,并在原有像素空间方法的基础上改进,提出了SubDPS和SubDAPS两种新方法,其中SubDAPS++进一步提升了修复效率和质量。实验表明,该方法在多个数据集和任务上优于现有基于扩散模型的图像修复方法。

Comments Accepted by ICML 2026

详情
英文摘要

Diffusion models (DMs) have exhibited remarkable efficacy in various image restoration tasks. However, existing approaches typically operate within the high-dimensional pixel space, resulting in high computational overhead. While methods based on latent DMs seek to alleviate this issue by utilizing the compressed latent space of a variational autoencoder, they require repeated encoder-decoder inference. This introduces significant additional computational burdens, often resulting in runtime performance that is even inferior to that of their pixel-space counterparts. To mitigate the computational inefficiency, this work proposes projecting data into lower-dimensional subspaces using dynamic resolution DMs to accelerate the inference process. We first fine-tune pre-trained DMs for dynamic resolution priors and adapt DPS and DAPS, which are two widely used pixel-space methods for general image restoration tasks, into the proposed framework, yielding methods we refer to as SubDPS and SubDAPS, respectively. Given the favorable inference speed and reconstruction fidelity of SubDAPS, we introduce an enhanced variant termed SubDAPS++ to further boost both reconstruction efficiency and quality. Empirical evaluations across diverse image datasets and various restoration tasks demonstrate that the proposed methods outperform recent DM-based approaches in the majority of experimental scenarios. The code is available at https://github.com/StarNextDay/SubDAPS.git.

2605.14266 2026-05-15 cs.AI cs.CY

Agentic AI Ecosystems in Higher Education: A Perspective on AI Agents to Emerging Inclusive, Agentic Multi-Agent AI Framework for Learning, Teaching and Institutional Intelligence

Vidya K Sudarshan, Anushka Sisodia, Reshma A Ramachandra, Sia Batra, Josephine Chong Leng Leng

发表机构 * College of Computing and Data Science, Nanyang Technological University, NTU, Singapore(南洋理工大学计算机与数据科学学院) DeepMed Ptd Ltd, India(印度DeepMed公司)

AI总结 本文探讨了人工智能代理在高等教育中的应用前景,提出构建一个集成化的多智能体AI框架,以支持教学、学习和机构管理的协同运作。当前AI工具多为单一任务导向且缺乏整合,难以满足教育生态系统复杂需求,本文通过文献分析指出现有研究在跨功能整合与包容性设计方面的不足,并强调构建协调、适应性强的多智能体系统对于实现公平、包容教育的重要意义。

Comments 50 pages, 14 figures, 3 tables

详情
英文摘要

Integration of artificial intelligent (AI) agents in higher education is transforming teaching, learning and administrative processes. Although existing AI agents effectively support individual tasks, their implementation remains fragmented and inefficient for handling the complexity of educational institutions. This highlights a significant research gap: the lack of integrated eco-system-level agentic multi-agent AI platform capable of coordinated planning, reasoning, and adaptive decision-making across multiple educational functions. This paper presents a forward-looking perspective on agentic multi-agent AI platform in higher education, consisting interconnected autonomous, goal driven agents that support learning, teaching, and institutional operations. It addresses timely and critical questions: Can agentic AI represent the next generation of intelligent systems in tertiary education? Can they collectively support seamless coordinated operations across teaching, learning and administrative support? To what extent can such systems foster inclusive and equitable learning for diverse learners with special educational needs? To ground this perspective, a thematic analysis of existing literature identifies four dominant themes: task-specific fragmented AI tools, the transition from single-agent to multi-agent systems, limited cross-functional integration, and insufficient focus on inclusivity and accessibility. Findings reveal a clear gap between current AI implementations and the needs of holistic, learner-centered educational ecosystem. The paper synthesizes challenges and outlines future research directions for scalable human-aligned, and inclusive agentic AI platform. The significant contribution is the incorporation of inclusive learning perspectives, highlighting how coordinated agentic multi-agent platform can support diverse learners through adaptive, multimodal interventions.

2605.14262 2026-05-15 cs.RO cs.HC

Distill: Uncovering the True Intent behind Human-Robot Communication

Ting Li, David Porfirio

发表机构 * Computer Science(计算机科学) George Mason University(乔治·玛莎大学)

AI总结 随着机器人越来越多地融入日常生活,自然语言和用户端编程等直观的沟通方式成为指定机器人自主行为的重要手段。然而,这些方法难以准确捕捉用户的真正意图。为此,本文提出了一种名为Distill的通信方法,通过去除冗余步骤、概括单个步骤的含义以及放宽步骤间的顺序约束,有效提炼和优化用户的初始任务描述,从而更准确地理解用户的真实需求。

Comments 17 pages

详情
英文摘要

As robots become increasingly integrated into everyday environments, intuitive communication paradigms such as natural language and end-user programming have become indispensable for specifying autonomous robot behavior. However, these mechanisms are ineffective at fully capturing user intent: natural language is imprecise and ambiguous, whereas end-user programming can be overly specific. As a result, understanding what users truly mean when they interact with robots remains a central challenge for human-AI communication systems. To address this issue, we propose the Distill approach for human-robot communication interfaces. Given a task specification provided by the user, Distill (1) removes unnecessary steps; (2) generalizes the meaning behind individual steps; and (3) relaxes ordering constraints between steps. We implemented Distill on a web interface and, through a crowdsourcing study, demonstrated its ability to elicit and refine user intent from initial task specifications.

2605.14261 2026-05-15 cs.AI cs.GT

Heuristic Pathologies and Further Variance Reduction via Uncertainty Propagation in the AIVAT Family of Techniques

Juho Kim, Tuomas Sandholm

发表机构 * CMU Strategic Machine, Inc.(CMU战略机器公司) Strategy Robot, Inc.(策略机器人公司) Optimized Markets, Inc.(优化市场公司)

AI总结 本文研究了在多智能体环境中如何在样本量有限或试验成本高昂的情况下评估智能体的性能,提出了AIVAT方法族以降低估计方差。文章指出,AIVAT中的启发式价值函数选择和不确定性处理缺乏指导,进而揭示了该方法在梯度下降应用下的潜在问题,并提出应在观察评估数据前固定启发式函数。此外,作者展示了如何传播启发式不确定性以进一步降低方差,尽管这可能牺牲无偏性。实验表明,该方法在扑克数据集上有效减少了达到统计结论所需的样本数量。

详情
英文摘要

How should an agent's performance in a multiagent environment be evaluated when there is a limited sample size or a high cost of running a trial? The AIVAT family of variance reduction techniques was proposed to address this challenge by introducing unbiased low-variance estimators of agents' expected payoffs. An important component of AIVAT is a heuristic value function that discriminates between potentially low- and high-value counterfactual histories. A notable gap in the literature is that there is little to no constraint or guideline on how the heuristic value function should be chosen or how uncertainty in its output should be handled. In our first contribution, we parameterize the heuristic value function to highlight AIVAT's potential vulnerabilities: a) the sample variance can be set pathologically low by directly applying gradient descent on the sample variance, and b) one can p-hack to draw a desired statistical conclusion via gradient descent/ascent on the test statistic. The main takeaway is that the heuristic value function should be fixed prior to observing the evaluation data! In our second contribution, we show how the heuristic uncertainty can be propagated to quantify the uncertainty of AIVAT estimates. It is then possible to further reduce the variance using inverse-variance weighted averaging, but AIVAT's unbiasedness guarantee may have to be sacrificed. In our experiments, we use a dataset of 10,000 poker hands to demonstrate our heuristic pathology and uncertainty results, with the latter yielding a 43.0% reduction in the number of samples (poker hands) needed to draw statistical conclusions.

2605.14258 2026-05-15 cs.LG cs.AI

Dynamics of the Transformer Residual Stream: Coupling Spectral Geometry to Network Topology

Jesseba Fernando, Grigori Guitchounts

发表机构 * Network Science Institute, Northeastern University(网络科学研究所,东北大学) Flagship Pioneering(先锋计划)

AI总结 本文研究了大型语言模型中残差流的动态特性,揭示了训练过程中谱几何与网络拓扑之间的耦合关系。通过全雅可比矩阵的特征分解,作者发现训练使得模型深度方向上形成单调的谱梯度,并伴随着维度压缩现象,这些特性是学习得到的而非由模型结构决定。研究进一步表明,网络中图社区的拓扑位置决定了雅可比矩阵对其扰动的放大或抑制作用,这一关系在模型初始化时并不存在。

详情
英文摘要

Large language models are remarkably capable, yet how computation propagates through their layers remains poorly understood. A growing line of work treats depth as discrete time and the residual stream as a dynamical system, where each layer's nonlinear update has a local linear description. However, previous analyses have relied on scalar summaries or approximate linearizations, leaving the full spectral geometry of trained LLMs unknown. We perform full Jacobian eigendecomposition across three production--scale LLMs and show that training installs a monotonic spectral gradient through depth -- from non-normal, rotation-dominated early layers to near--symmetric late layers -- together with a cumulative low-rank bottleneck that funnels perturbations into a small fraction of the residual stream's effective dimensions. Our experiments reveal that this gradient and the dimensional collapse are learned rather than architectural, and is largely dissolved when structured non-normality is removed. We further show that the topological positioning of graph communities predicts whether the Jacobian amplifies or suppresses them, with the sign of the coupling determined by the local operator type, a relationship absent at initialization. These results map a learned spectral geometry in LLMs that links perturbation propagation and compression to the network's functional topology.

2605.14253 2026-05-15 cs.CV cs.LG

Towards Real-Time Autonomous Navigation: Transformer-Based Catheter Tip Tracking in Fluoroscopy

Harry Robertshaw, Yanghe Hao, Weiyuan Deng, Benjamin Jackson, S. M. Hadi Sadati, Nikola Fischer, Tom Vercauteren, Alejandro Granados, Thomas C. Booth

发表机构 * Surgical & Interventional Engineering School of Biomedical Engineering & Imaging Sciences Kings College London(生物医学工程与成像科学学院手术与介入工程系伦敦国王学院) School of Engineering & Materials Science Queen Mary London(工程与材料科学学院女王玛丽学院伦敦)

AI总结 本文旨在开发一种基于荧光透视图像的实时导管尖端跟踪系统,以支持基于强化学习的自主机械取栓手术导航。研究提出了一种多线程处理框架,结合深度学习分割模型与后处理算法,有效应对图像对比度低、噪声大及设备遮挡等挑战。实验表明,该方法在分割精度上优于现有方法,为未来自主导航系统的实现提供了可靠高效的解决方案。

Comments Harry Robertshaw and Yanghe Hao contributed equally to this work. Published in the International Journal of Computer Assisted Radiology and Surgery

Journal ref Int J CARS (2026)

详情
英文摘要

Purpose: Mechanical thrombectomy (MT) improves stroke outcomes, but is limited by a lack of local treatment access. Widespread distribution of reinforcement learning (RL)-based robotic systems can be used to alleviate this challenge through autonomous navigation, but current RL methods require live device tip coordinate tracking to function. This paper aims to develop and evaluate a real-time catheter tip tracking pipeline under fluoroscopy, addressing challenges such as low contrast, noise, and device occlusion. Methods: A multi-threaded pipeline was designed, incorporating frame reading, preprocessing, inference, and post-processing. Deep learning segmentation models, including U-Net, U-Net+Transformer, and SegFormer, were trained and benchmarked using two-class and three-class formulations. Post-processing involved two-step component filtering, one-pixel medial skeletonization, and greedy arc-length path following with contour fall-back. Results: On manually-labeled moderate complexity fluoroscopic video data, the two-class SegFormer achieved a mean absolute error of 4.44 mm, outperforming U-Net (4.60 mm), U-Net+Transformer (6.20 mm) and all three-class models (5.19-7.74 mm). On segmentation benchmarks, the system exceeded state-of-the-art CathAction results with improvements of up to +5% in Dice scores for three-segmentation. Conclusion: The results demonstrate that the proposed multi-threaded tracking framework maintains stable performance under challenging imaging conditions, outperforming prior benchmarks, while providing a reliable and efficient foundation for RL-based autonomous MT navigation.

2605.14252 2026-05-15 cs.LG cs.AI

Not All Timesteps Matter Equally: Selective Alignment Knowledge Distillation for Spiking Neural Networks

Kai Sun, Peibo Duan, Yongsheng Huang, Guowei Zhang, Benjamin Smith, Nanxu Gong, Levin Kuhlmann

发表机构 * Faculty of Information Technolody, Monash University, Australia(墨尔本大学信息科技学院,澳大利亚) School of Software, Northeastern University, China(东北大学软件学院,中国) Department of Medicine, National University of Singapore, Singapore(新加坡国立大学医学部,新加坡)

AI总结 本文研究了脉冲神经网络(SNN)与人工神经网络(ANN)之间的性能差距问题,提出了一种新的知识蒸馏方法——选择性对齐知识蒸馏(SeAl-KD)。该方法突破了传统方法对所有时间步进行统一对齐的假设,通过识别错误时间步并针对性地进行校正,同时保留有用的时序动态,从而更有效地提升SNN的性能。实验表明,该方法在静态图像和神经形态事件数据集上均优于现有蒸馏方法。

详情
英文摘要

Spiking neural networks (SNNs), which are brain-inspired and spike-driven, achieve high energy efficiency. However, a performance gap between SNNs and artificial neural networks (ANNs) still remains. Knowledge distillation (KD) is commonly adopted to improve SNN performance, but existing methods typically enforce uniform alignment across all timesteps, either from a teacher network or through inter-temporal self-distillation, implicitly assuming that per-timestep predictions should be treated equally. In practice, SNN predictions vary and evolve over time, and intermediate timesteps need not all be individually correct even when the final aggregated output is correct. Under such conditions, effective distillation should not force every timestep toward the same supervision target, but instead provide corrective guidance to erroneous timesteps while preserving useful temporal dynamics. To address this issue, we propose Selective Alignment Knowledge Distillation (SeAl-KD), which selectively aligns class-level and temporal knowledge by equalizing competing logits at erroneous timesteps and reweighting temporal alignment based on confidence and inter-timestep similarity. Extensive experiments on static image and neuromorphic event-based datasets demonstrate consistent improvements over existing distillation methods. The code is available at https://github.com/KaiSUN1/SeAl

2605.14251 2026-05-15 cs.CV

Generative Deep Learning for Computational Destaining and Restaining of Unregistered Digital Pathology Images

Aarushi Kulkarni, Alarice Lowe, Pratik Shah

发表机构 * Department of Computer Science University of California Irvine, CA, USA(计算机科学系 加州大学 伊藤市 加州 USA) Department of Pathology Stanford University Stanford, CA, USA(病理学系 斯坦福大学 斯坦福市 加州 USA)

AI总结 该研究探讨了基于条件生成对抗网络(cGAN)的数字病理图像去染色与再染色方法,并针对不同机构间未对齐的全切片图像(WSI)进行了评估。为减少领域偏移影响,研究提出了一种预处理流程,包括基于直方图的染色归一化和通道强度校准。实验结果表明,即使在无图像配准的情况下,该方法仍能实现较好的染色还原效果,并在多个指标上优于直接染色方法,验证了预处理对模型性能的重要影响。

详情
英文摘要

Conditional generative adversarial networks (cGANs) have enabled high-fidelity computational staining and destaining of hematoxylin and eosin (H&E) in digital pathology whole-slide images (WSI). However, their ability to generalize to out-of-distribution WSI across institutions without retraining remains insufficiently characterized. Previously developed cGAN models trained on 102 registered prostate core biopsy WSIs from Brigham and Women's Hospital were evaluated on 82 spatially unregistered WSIs acquired at Stanford University. To mitigate domain shift without retraining, a preprocessing pipeline consisting of histogram-based stain normalization for H&E-stained WSIs and channel-wise intensity calibration for unstained WSIs was developed. Because image registration was intentionally omitted for real-world deployment conditions, the reported quantitative results are conservative lower bounds reflecting both model performance and limited spatial alignment. Under these conditions, virtual destaining achieved a Pearson correlation coefficient (PCC) of 0.854, structural similarity index measure (SSIM) of 0.699, and peak signal-to-noise ratio (PSNR) of 18.41 dB. H&E restaining from computationally destained outputs outperformed direct staining from ground-truth unstained inputs across all metrics (PCC: 0.798 vs. 0.715; SSIM: 0.756 vs. 0.718; PSNR: 20.08 vs. 18.51 dB), suggesting that preprocessing quality may be more limiting than model capacity. Qualitative pathological review indicated preservation of benign glandular structures while showing that malignant glands were often rendered with vessel-like morphologies. These findings support the feasibility of applying cGAN-based computational H&E staining and destaining generative models to external WSI datasets using preprocessing-based adaptation alone while defining specific morphological targets for future domain adaptation.

2605.14249 2026-05-15 cs.LG

EnergyLens: Predictive Energy-Aware Exploration for Multi-GPU LLM Inference Optimization

Zhiye Song, Kyungmi Lee, Eun Kyung Lee, Xin Zhang, Tamar Eilam, Anantha P. Chandrakasan

发表机构 * Massachusetts Institute of Technology(麻省理工学院) IBM Research(IBM研究院)

AI总结 本文提出了一种端到端的EnergyLens框架,用于实现面向能效的大型语言模型(LLM)推理优化。该方法通过直观的einsum接口捕捉模型的融合、并行性和计算-通信重叠等特性,并结合负载不平衡感知的MoE建模和多GPU通信能耗模型,有效预测和优化多GPU环境下的能耗。实验表明,EnergyLens在多个模型和配置上实现了较高的能耗预测精度,并揭示了不同配置下显著的能效差异,为分布式推理优化提供了重要指导。

详情
英文摘要

We present EnergyLens, an end-to-end framework for energy-aware large language model (LLM) inference optimization. As LLMs scale, predicting and reducing their energy footprint has become critical for sustainability and datacenter operations, yet existing approaches either require production-level code and expensive profiling or fail to accurately capture multi-GPU energy behavior. As a result, practitioners lack tools for deciding which optimizations to prioritize and for selecting among existing deployment configurations when exhaustive profiling is impractical. EnergyLens addresses this gap with an intuitive einsum-based interface that captures LLM specifications including fusion, parallelism, and compute-communication overlap, combined with load-imbalance-aware MoE modeling and an empirically driven communication energy model for multi-GPU settings. We validate EnergyLens on Llama3 and Qwen3-MoE across tensor-parallel and expert-parallel configurations, achieving mean absolute percentage errors (MAPEs) between 9.25% and 13.19% for multi-GPU prefill and decode energy, and 12.97% across SM allocations for Megatron-style overlap. Our energy-driven exploration reveals up to 1.47x and 52.9x energy variation across configurations in prefill and decode efficiency and motivates distributed serving. We further show that compute-communication overlap is difficult to optimize with intuition alone, but EnergyLens correctly identifies Pareto-optimal overlap configurations.

2605.14246 2026-05-15 cs.LG cs.AI cs.SY eess.SY

Action-Conditioned Risk Gating for Safety-Critical Control under Partial Observability

Yushen Liu, Yin-Jen Chen, Ziyi Chen, Tao Wang, Heng Huang, Xugui Zhou, Yanfu Zhang

发表机构 * University of Virginia(弗吉尼亚大学) Google(谷歌) University of Maryland, College Park(马里兰大学 College Park 分校) Stanford University(斯坦福大学) Louisiana State University(路易斯安那州立大学) College of William and Mary(威廉与玛丽学院)

AI总结 该研究针对部分可观测环境下安全关键控制问题,提出了一种基于动作条件风险门控的强化学习方法,用于在不完全观测情况下平衡任务性能与安全风险。方法通过构建有限历史的紧凑代理状态,并学习动作条件的短期安全违规预测,将预测风险用于价值学习中的风险惩罚和决策时的风险门控,从而在保证安全的同时提升控制性能。实验表明,该方法在血糖调节和安全导航等任务中相比传统方法具有更优的奖励-成本平衡和运行效率。

详情
英文摘要

Many safety-critical control problems are modeled as risk-sensitive partially observable Markov decision processes, where the controller must make decisions from incomplete observations while balancing task performance against safety risk. Although belief-space planning provides a principled solution, maintaining and planning over beliefs can be computationally costly and sensitive to model specification in practical domains. We propose a lightweight risk-gated reinforcement learning approximation for risk-sensitive control under partial observability. The method constructs a compact finite-history proxy state and learns an action-conditioned predictor of near-term safety violation. This predicted candidate-action risk is used in two complementary ways: as a risk penalty during value learning, and as a decision-time gate that interpolates between optimistic and conservative ensemble value estimates. As a result, low-risk actions are evaluated closer to reward-seeking estimates, while high-risk actions are evaluated more conservatively. We evaluate the approach in two safety-critical partially observable domains: automated glucose regulation and safety-constrained navigation. Across adult and adolescent glucose-control cohorts, the method improves overall glycemic tradeoffs and substantially reduces runtime relative to a belief-space planning baseline. On Safety-Gym navigation benchmarks, it achieves a more favorable reward-cost balance than unconstrained RL and several standard safe-RL baselines. These results suggest that action-conditioned near-term risk can provide an effective local signal for approximate risk-sensitive POMDP control when full belief-space planning is impractical.

2605.14242 2026-05-15 cs.LG cs.AI

Artificial Intelligence-Assistant Cardiotocography: Unified Model for Signal Reconstruction, Fetal Heart Rate Analysis, and Variability Assessment

Xiaohua Wang, Kai Yu, XuXiao Liang, Liang Wang, Chao Han

发表机构 * Artificial Intelligence Research Center, Bengbu Medical University(蚌埠医科大学人工智能研究中心) CHARMMIRAEL Biotech Co., Ltd(CHARMMIRAEL生物科技有限公司)

AI总结 该研究提出了一种基于人工智能的卡iotocography(CTG)模型,用于胎儿心率信号重建、心率分析及变异性评估。该模型通过大规模未标注数据预训练,并结合专家审核数据进行微调,有效提升了信号重建精度和分析可靠性。研究引入了交叠标签(IOL)方法验证胎儿心率,模型在检测关键心率减速和加速方面表现出高灵敏度和特异性,并在临床指标评估中取得了优异的AUC成绩。

详情
英文摘要

The monitoring of fetal heart rate (FHR) and the assessment of its variability are crucial for preventing fetal compromise and adverse outcomes. However, traditional methods encounter limitations arising from equipment performance, data transmission, and subjective assessments by doctors. We have developed a tailored AI-based FHrCTG model specifically for FHR monitoring, which effectively mitigates noise interference and precisely reconstructs signals. Our model was pre-trained on a massive dataset consisting of 558,412 unlabeled data points and further refined using 7,266 expert-reviewed entries. To validate FHR, we introduced the Intersection Overlapping Labels (IOL) approach, which transforms rate analysis into categorical judgments. Testing revealed that our model demonstrates high sensitivity and specificity in detecting critical FHR decelerations (89.13% and 87.78%, respectively) and accelerations (62.5% and 92.04%, respectively). Furthermore, based on Fischer's criteria for clinical application, our model achieved impressive AUC scores of 0.7214 and 0.9643 for verifying FHR periodicity and amplitude variation, respectively.

2605.14240 2026-05-15 cs.LG

Paraphrasing Attack Resilience of Various AI-Generated Text Detection Methods

Andrii Shportko, Inessa Verbitsky

发表机构 * Northwestern University(西北大学)

AI总结 本文研究了多种AI生成文本检测方法在面对改写攻击时的鲁棒性,评估了包括微调RoBERTa、Binoculars和文本特征分析等方法及其随机森林集成的效果。研究发现,包含Binoculars的集成方法性能最强,但在攻击下表现下降也最明显,揭示了AI文本检测中性能与鲁棒性之间的矛盾,挑战了当前对先进检测技术可靠性的认知。

Comments NAACL 2025

详情
英文摘要

The recent large-scale emergence of LLMs has left an open space for dealing with their consequences, such as plagiarism or the spread of false information on the Internet. Coupling this with the rise of AI detector bypassing tools, reliable machine-generated text detection is in increasingly high demand. We investigate the paraphrasing attack resilience of various machine-generated text detection methods, evaluating three approaches: fine-tuned RoBERTa, Binoculars, and text feature analysis, along with their ensembles using Random Forest classifiers. We discovered that Binoculars-inclusive ensembles yield the strongest results, but they also suffer the most significant losses during attacks. In this paper, we present the dichotomy of performance versus resilience in the world of AI text detection, which complicates the current perception of reliability among state-of-the-art techniques.

2605.14239 2026-05-15 cs.CV

Implicit spatial-frequency fusion of hyperspectral and lidar data via kolmogorov-arnold networks

Zekun Long, Judy X. Yang, Jing Wang, Ali Zia, Guanyiman Fu, Jun Zhou

发表机构 * School of Information and Communication Technology(信息与通信技术学院) School of Computing, Engineering and Mathematical Sciences(计算、工程与数学科学学院)

AI总结 本文研究了高光谱图像(HSI)与激光雷达(LiDAR)数据的融合问题,旨在提升复杂场景下的分类性能。针对现有方法在建模结构不连续性和光谱特征方面存在的不足,作者提出了一种基于Kolmogorov-Arnold网络(KAN)的隐式频域-几何融合网络(IFGNet),通过可学习的样条函数自适应捕捉高光谱与LiDAR特征之间的高度非线性关系,并在空间和频域引入LiDAR引导的隐式聚合模块,增强几何感知的表示能力。实验表明,IFGNet在多个基准数据集上显著优于现有方法,具有更高的分类精度和效率。

Comments 6 pages, 1 figure, conference

详情
英文摘要

Hyperspectral image (HSI) classification is challenging in complex scenes due to spectral ambiguity, spatial heterogeneity, and the strong coupling between material properties and geometric structures. Although LiDAR provides complementary elevation information, most HSI-LiDAR fusion methods rely on CNNs or MLPs with fixed activation functions and linear weights. These methods struggle to model structural discontinuities in LiDAR data, intricate spectral features of HSI, and their interactions. In addition, fusion of the two modalities in both spatial and frequency domains with LiDAR guidance remains underexplored. To address these issues, we propose the Implicit Frequency-Geometry Fusion Network (IFGNet), which leverages Kolmogorov-Arnold Networks (KANs) with learnable spline-based functions to adaptively capture highly nonlinear relationships between hyperspectral and LiDAR features. Furthermore, IFGNet introduces a LiDAR-guided implicit aggregation module in both spatial and frequency domains, enhancing geometry-aware spatial representations while capturing global structural patterns. Experiments on the Houston 2013 and MUUFL benchmarks demonstrate that IFGNet consistently outperforms existing fusion methods in overall accuracy, average accuracy, and Cohen's Kappa, while maintaining an efficient architecture.

2605.14237 2026-05-15 cs.AI

Good to Go: The LOOP Skill Engine That Hits 99% Success and Slashes Token Usage by 99% via One-Shot Recording and Deterministic Replay

Xiaohua Wang, Kai Yu, XuXiao Liang, Liang Wang, Chao Han

发表机构 * Artificial Intelligence Research Center, Bengbu Medical University(蚌埠医科大学人工智能研究中心) CHARMMIRAEL Biotech Co., Ltd(CHARMMIRAEL生物科技有限公司)

AI总结 本文提出了一种名为LOOP SKILL ENGINE的系统,旨在解决AI代理执行重复性任务时的高失败率和高计算成本问题。该系统通过一次性的任务执行记录和确定性回放机制,实现了99%的任务成功率,并将令牌使用量减少了99%。其核心方法是将首次运行中记录的工具调用轨迹转化为参数化的确定性执行计划,后续任务直接回放该计划,无需再次调用大语言模型,从而大幅降低开销并保证执行的可预测性。

Comments 8 pages, 5 tables

详情
英文摘要

Deploying AI agents for repetitive periodic tasks exposes a critical tension: Large Language Models (LLMs) offer unmatched flexibility in tool orchestration, yet their inherent stochasticity causes unpredictable failures, and repeated invocations incur prohibitive token costs. We present the LOOP SKILL ENGINE, a system that achieves a combined 99% success rate and 99% token reduction for periodic agent tasks through a one-shot recording, deterministic replay paradigm. On its first run, the agent executes the task with full LLM reasoning while the system transparently intercepts and records the complete tool-call trajectory. A greedy length-descending template extraction algorithm then converts this recording into a parameterized, branch-free Loop Skill -- a deterministic execution plan that captures the task's functional intent while parameterizing time-dependent and result-dependent variables. All subsequent executions bypass the LLM entirely: the engine resolves template variables against real-time values and replays the tool sequence deterministically. We prove two theorems: (1) Replay Determinism -- the step sequence of a validated Loop Skill is invariant across all future executions; (2) Write Safety -- concurrent access to persistent configuration is serialized through reentrant locks and atomic file replacement. Across a benchmark of periodic agent tasks spanning intervals from 5 minutes to 24 hours, the Loop Skill Engine reduces monthly token consumption by 93.3%--99.98% and cuts execution latency by 8.7x while eliminating output non-determinism. A multi-layer degradation strategy guarantees that tasks never stall. We release the engine as part of the buddyMe open-source agent framework.

2605.14235 2026-05-15 cs.LG cs.MA quant-ph

Quantum Advantage in Multi Agent Reinforcement Learning

Simranjeet Singh Dahia, Claudia Szabo

发表机构 * Adelaide University(阿德莱德大学)

AI总结 本文研究了量子纠缠在多智能体强化学习中的协调优势,提出了一种基于可变量子电路的去中心化量子多智能体强化学习框架。通过在CHSH游戏中验证,纠缠的量子智能体能够接近理论上的Tsirelson极限,展现出明确的量子优势,而无纠缠的量子电路则与经典方法表现相当。实验还表明,特定的纠缠结构对协调性能有显著影响,并在合作导航任务中展示了量子智能体与经典方法相比的性能提升。

Comments 19 pages

详情
英文摘要

We present an empirical evaluation of quantum entanglement in agent coordination within quantum multi agent reinforcement learning (QMARL). While QMARL has attracted growing interest recently, most prior work evaluates quantum policies without provable baselines, making it impossible to rigorously distinguish quantum advantage from algorithmic coincidence. We address this directly by evaluating a decentralized QMARL framework with variational quantum circuit (VQC) actors with shared entangled states. In the CHSH game, which has a mathematically proven classical performance ceiling of 0.75 win rate, we show that entangled QMARL agents approach the Tsirelson limit of 0.854, providing clear evidence of their quantum advantage. We show that unentangled quantum circuits match the classical baseline, confirming that entanglement and not the quantum circuit itself is the active coordination mechanism. We also explore the effect of specific entanglement structures, as some Bell states enable coordination gains while others actively harm performance. On cooperative navigation (CoopNav), QMARL without entanglement achieves $\sim2\times$ improvement in success rate over classical MAA2C ($\sim$0.85 versus $\sim$0.40), with the hybrid configuration, quantum actor paired with a classical centralised critic, outperforming both fully classical and fully quantum solutions. We present our experimental analysis and discuss future work.

2605.14232 2026-05-15 cs.RO

Reactive Planning based Control for Mobile Robots in Obstacle-Cluttered Environments

Li Tan, Junlin Xiong, Yan Wang, Wei Ren

发表机构 * Department of Automation, University of Science and Technology of China(自动化系,中国科学技术大学) School of Intelligence Science and Engineering, Harbin Institute of Technology Shenzhen(智能科学与工程学院,哈尔滨工业大学深圳) School of Control Science and Engineering, Dalian University of Technology(控制科学与工程学院,大连理工大学)

AI总结 本文研究了移动机器人在障碍物密集环境中进行避障运动控制的问题。针对机器人仅有部分环境信息的情况,提出了一种基于反应式规划的控制策略(RPCS),通过构建参考轨迹并结合局部轨迹调整实现避障,同时设计了自适应跟踪控制方法以提高轨迹跟踪性能。该方法有效结合了反应式规划与自适应控制,实验结果验证了其优越性与实用性。

Comments 7 pages, 7 figures

详情
英文摘要

This paper addresses the motion control problem for mobile robots in obstacle-cluttered environments. The mobile robot has partial environment information only, and aims to move from an initial position to a target position without collisions. For this purpose, a reactive planning based control strategy (RPCS) is proposed. First, the initial and target positions are connected as a reference trajectory. Then, a reactive planning strategy (RPS) is developed to ensure the collision avoidance by modifying the reference trajectory locally based on the partial environment information. Next, an adaptive tracking control strategy (ATCS) is proposed to track the reference trajectory with potentially local modifications via the discretization techniques. Finally, the RPS and ATCS are combined to establish the RPCS, whose efficacy and advantages are illustrated by numerical examples.

2605.14231 2026-05-15 cs.LG cs.AI cs.SD

AudioMosaic: Contrastive Masked Audio Representation Learning

Hanxun Huang, Qizhou Wang, Xingjun Ma, Cihang Xie, Christopher Leckie, Sarah Erfani

发表机构 * School of Computing and Information Systems, The University of Melbourne, Australia(墨尔本大学计算机与信息系统学院) Baskin School of Engineering, University of California, Santa Cruz, USA(加州大学圣克鲁兹分校工程学院) Institute of Trustworthy Embodied AI, Fudan University, China(复旦大学可信具身人工智能研究所)

AI总结 本文提出了一种基于对比学习的音频编码器 AudioMosaic,用于通用音频理解任务。该方法通过结构化时频掩码生成正样本对,降低内存消耗并支持高效的大批量训练。与生成式方法相比,AudioMosaic 能够学习更具判别性的语句级表示,在不同数据集、领域和声学条件下表现出优异的迁移能力,并在多个标准音频基准测试中取得了最先进的性能。

Comments ICML2026

详情
英文摘要

Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre-training. We introduce \textbf{AudioMosaic}, a contrastive learning-based audio encoder for general audio understanding. During pre-training, AudioMosaic constructs positive pairs by applying structured time-frequency masking to spectrogram patches, which reduces memory usage and enables efficient large-batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance-level representations that demonstrate strong transferability across datasets, domains, and acoustic conditions. Extensive experiments show that AudioMosaic achieves state-of-the-art performance on several standard audio benchmarks under both linear probing and fine-tuning. We further show that integrating the pretrained AudioMosaic encoder into audio-language models improves performance on audio-language tasks. The code is publicly available in our \href{https://github.com/HanxunH/AudioMosaic}{GitHub repository}.

2605.14227 2026-05-15 cs.LG cs.CL

DT-Transformer: A Foundation Model for Disease Trajectory Prediction on a Real-world Health System

Yunying Zhu, Andrew R Weckstein, Kueiyu Joshua Lin, Jie Yang

发表机构 * Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School(药理流行病学与药效学部,医学部,布里奇沃特医院,哈佛医学院) Harvard T.H. Chan School of Public Health, Harvard University(哈佛大学T.H. 陈公共卫生学院) Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University(自然与人工智能研究学院,哈佛大学) Broad Institute of MIT and Harvard, Cambridge, MA, USA(MIT与哈佛大学Broad研究所,剑桥,马萨诸塞州,美国) Harvard Data Science Initiative, Harvard University(哈佛大学数据科学计划)

AI总结 该研究提出了一种名为DT-Transformer的基础模型,旨在基于真实世界医疗系统中的电子健康记录(EHR)预测疾病轨迹。模型在麻省总医院布里格姆健康系统(MGB)的170万名患者、5710万条结构化EHR数据上进行训练,能够准确预测多种疾病的未来发展情况。实验表明,该模型在多种疾病类别上的预测性能优异,为基于真实临床数据的疾病预测提供了新的方法和有力支持。

Comments Work in Progress

详情
英文摘要

Accurate disease trajectory prediction is critical for early intervention, resource allocation, and improving long-term outcomes. While electronic health records (EHRs) provide a rich longitudinal view of patient health in clinical environments, models trained on curated research cohorts may not reflect routine deployment settings, and those trained on single-hospital datasets capture only fragments of each patient's trajectory. This highlights the importance of leveraging large, multi-hospital health systems for training and validation to better reflect real-world clinical complexity. In this work, we develop DT-Transformer, a foundation model trained on 57.1M structured EHR entries over 1.7M patients from Mass General Brigham (MGB), spanning 11 hospitals and a broad network of outpatient clinics. DT-Transformer achieves strong discrimination in both held-out and prospective validation settings. Next-event prediction achieves a median age- and sex-stratified AUC of 0.871 across 896 disease categories, with all categories exceeding AUC 0.5. These results support health system-scale training as a path toward foundation models suited to real-world clinical forecasting.

2605.14221 2026-05-15 cs.CV

Automatic Landmark-Based Segmentation of Human Subcortical Structures in MRI

Ahmed Rekik, R. Jarrett Rushmore, Sylvain Bouix, Linda Marrakchi-Kacem

发表机构 * École de technologie supérieure (ÉTS)(埃克塞尔技术高等学院) Boston University School of Medicine(波士顿大学医学学院) Signal and Smart Systems Lab (L3S)(信号与智能系统实验室) National School of Engineering of Tunis, University of Tunis El Manar(突尼斯国家工程学院,突尼斯El Manar大学)

AI总结 本文研究了如何在磁共振成像(MRI)中精确分割人脑皮下结构的问题,提出了一个基于标志点引导的三维脑分割方法。该方法模仿哈佛-牛津图谱的手动分割流程,通过全局到局部网络自动检测16个关键标志点,并结合语义分割模型和标志点驱动的后处理步骤,将12个粗略解剖标签分割为26个独立结构,显著提升了分割边界的一致性和准确性。

Comments 7 pages, 5 figures. Accepted for presentation at the 48th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2026)

详情
英文摘要

Precise segmentation of brain structures in magnetic resonance imaging (MRI) is essential for reliable neuroimaging analysis, yet voxel-wise deep models often yield anatomically inconsistent results that diverge from expert-defined boundaries. In this research, we propose a landmark-guided 3D brain segmentation approach that explicitly mimics the manual segmentation protocol of the Harvard--Oxford Atlas. A Global-to-Local network automatically detects 16 landmarks representing key subcortical reference points. Then, a semantic segmentation model produces a coarse segmentation of 12 anatomical labels, each grouping multiple subcortical regions. Finally, a landmark-driven post-processing step separates these 12 labels into 26 distinct structures by enforcing local anatomical constraints. Experimental results demonstrate consistent improvements in boundary accuracy. Overall, integrating learned landmarks aligns segmentations more closely with manual protocols.

2605.14220 2026-05-15 cs.LG cs.AI cs.CL

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

Tianle Zhong, Neiwen Ling, Yifan Pi, Zijun Wei, Tianshu Yu, Geoffrey Fox, Peng Wu, Xiao Yu

发表机构 * ByteDance(字节跳动) The University of Virginia(弗吉尼亚大学)

AI总结 本文研究了大语言模型强化学习中训练与推理阶段概率分布不一致的问题,即训练-推理不匹配(TIM)。作者提出了一种零不匹配诊断设置(VeXact),用于隔离TIM的影响,并发现即使微小的标记级数值差异也可能导致训练崩溃。研究进一步表明TIM改变了优化问题的本质,并提出了一些缓解TIM的方法,强调TIM是影响LLM强化学习稳定性的关键系统性因素,而非单纯的数值噪声。

详情
英文摘要

Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing Training-Inference Mismatch (TIM). TIM is difficult to inspect because it is entangled with off-policy drift and common stabilization mechanisms. In this work, we isolate TIM in a zero-mismatch diagnostic setting (VeXact), and show that small token-level numerical disagreements can independently cause training collapse. We further show that TIM changes the effective optimization problem, and identify a set of remedies that could mitigate TIM. Our results suggest that TIM is not benign numerical noise, but a systems-level perturbation that should be treated as a first-order factor in analyzing LLM RL stability.

2605.14218 2026-05-15 cs.AI physics.soc-ph

Fusion-fission forecasts when AI will shift to undesirable behavior

Neil F. Johnson, Frank Yingjie Huo

发表机构 * Physics Department, The George Washington University(乔治华盛顿大学物理系)

AI总结 本文研究了类似ChatGPT的AI系统在使用过程中行为从有益转向有害的转变问题,并提出了一种基于融合-裂变群体动力学的预测方法。该方法通过分析对话历史与有益或有害行为之间的竞争动态,能够在不依赖具体模型或随机采样的情况下,提前预测AI行为转变的时间点。研究通过多项独立测试验证了该方法的有效性,表明其具有广泛适用性和较高的预测准确性。

详情
英文摘要

The key problem facing ChatGPT-like AI's use across society is that its behavior can shift, unnoticed, from desirable to undesirable -- encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes -- and no one can yet predict when. Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Here we show that a vector generalization of fusion-fission group dynamics observed in living and active-matter systems drives -- and can forecast -- future shifts in the AI's behavior. The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including: 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 human-AI exchanges. Because it sits architecturally below the current safety stack, the same formula provides a real-time warning signal that current alignment does not supply, portable across current and future ChatGPT-like AI architectures and instantiable in application domains where competing response classes can be defined.

2605.14217 2026-05-15 cs.LG cs.AI cs.CL cs.SY eess.SY

PreFT: Prefill-only finetuning for efficient inference

Andrew Lanpouthakoun, Aryaman Arora, Zhengxuan Wu, Dhruv Pai, Ben Keigwin, Dan Jurafsky, Christopher Potts

发表机构 * Stanford University(斯坦福大学) Tilde Research(Tilde研究)

AI总结 本文提出了一种名为 PreFT 的高效微调方法,专注于在推理阶段仅对预填充(prefill)阶段应用适配器,从而提升多用户场景下的服务吞吐量。相比传统的参数高效微调方法(PEFT),PreFT 在保持性能的同时显著提高了吞吐效率,尤其在处理大量适配器时表现更优。实验表明,PreFT 在监督微调和强化学习任务中能够接近甚至达到传统 PEFT 的性能,验证了其在个性化服务场景中更具优势的精度-吞吐量权衡。

详情
英文摘要

Large language models can now be personalised efficiently at scale using parameter efficient finetuning methods (PEFTs), but serving user-specific PEFTs harms throughput, even with specialised kernels and memory management techniques. This is because, theoretically and empirically, a mismatch exists between prefill (processing a large number of tokens at once) and decode (generating a single token autoregressively): the latter has far lower throughput when serving multiple adapters. Rather than optimising performance relative to parameter count, for efficient multi-adapter serving, we instead ought to optimise performance relative to serving throughput. We therefore propose PreFT (Prefill-only Finetuning), wherein we only apply the adapter to prefill tokens and discard it afterwards. PreFT significantly increases throughput with minimal effect on performance. We develop and release an efficient implementation of two prefill-only PEFTs, LoRA and ReFT, on the vLLM inference engine. We first show that serving multi-user PreFTs is more efficient than traditional PEFTs ($1.9\times$ the throughput when serving $512$ adapters on Llama 3.1 70B). Then, we compare the performance of prefill-only vs. all-token adapters on a variety of supervised finetuning and reinforcement learning tasks with LMs at varying scales. On SFT, we observe that the evaluation loss of PreFTs is higher than PEFTs, but can be compensated by increasing rank with nearly no reduction in throughput. On RL, we consistently find that PreFTs approach parity with standard PEFTs. Together, this work validates prefill-only adaptation of LLMs as a more favourable accuracy-throughput tradeoff than existing PEFTs for personalised serving.

2605.14215 2026-05-15 cs.AI cs.LG q-bio.QM

GenCircuit-RL: Reinforcement Learning from Hierarchical Verification for Genetic Circuit Design

Noah Flynn

发表机构 * University of California, Berkeley, CA, USA(加州大学伯克利分校)

AI总结 该研究针对合成生物学中遗传电路设计仍依赖专家经验的问题,提出了一种基于强化学习的框架GenCircuit-RL,通过分层验证奖励机制将电路正确性分解为五个层次,并结合四阶段课程学习逐步提升模型能力。研究还构建了一个包含4753个电路的基准数据集SynBio-Reason,用于评估模型在代码修复、从头设计等任务中的表现。实验表明,分层验证和课程学习显著提升了模型在功能推理任务中的成功率,并能生成拓扑正确、泛化性强的遗传电路设计。

Comments Link: https://icml.cc/virtual/2026/poster/61789

详情
英文摘要

Genetic circuit design remains a laborious, expert-driven process despite decades of progress in synthetic biology. We study this problem through code generation: models produce Python code in pysbol3 to construct genetic circuits in the Synthetic Biology Open Language (SBOL), a formal representation that supports automated verification. We introduce GenCircuit-RL, a reinforcement learning framework built around hierarchical verification rewards that decompose correctness into five levels, from code execution to task-specific topological checks, and a four-stage curriculum that shifts optimization pressure from code generation to functional reasoning. We also introduce SynBio-Reason, a benchmark of 4,753 circuits spanning six canonical circuit types and nine tasks from code repair to de novo design, with held-out biological parts for out-of-distribution evaluation. Hierarchical verification improves task success on functional reasoning tasks by 14 to 16 percentage points over binary rewards, and curriculum learning is required for strong design performance. The resulting models generate topologically correct circuits, generalize to novel biological parts, and rediscover canonical designs from the synthetic biology literature.

2605.14212 2026-05-15 cs.AI

MetaAgent-X : Breaking the Ceiling of Automatic Multi-Agent Systems via End-to-End Reinforcement Learning

Yaolun Zhang, Yujie Zhao, Nan Wang, Yiran Wu, Jiayu Chang, Yizhao Chen, Qingyun Wu, Jishen Zhao, Huazheng Wang

发表机构 * Oregon State University(俄勒冈州立大学) UCSD(加州大学圣迭戈分校) Amazon AGI(亚马逊人工智能实验室) Pennsylvania State University(宾夕法尼亚州立大学) AG2AI, Inc.(AG2AI公司)

AI总结 本文提出了一种端到端的强化学习框架 MetaAgent-X,旨在突破现有自动多智能体系统(MAS)在设计与执行解耦的限制,实现自设计与自执行的智能体流程生成。该方法通过联合优化设计与执行过程,引入分层 rollout 与阶段性共进化策略,提升了训练稳定性与系统适应性。实验表明,MetaAgent-X 在多个基准上显著优于现有方法,验证了端到端训练自动 MAS 的有效性与实用性。

详情
英文摘要

Automatic multi-agent systems aim to instantiate agent workflows without relying on manually designed or fixed orchestration. However, existing automatic MAS approaches remain only partially adaptive: they either perform training-free test-time search or optimize the meta-level designer while keeping downstream execution agents frozen, which creating a frozen-executor ceiling and leaving the end-to-end training of self-designing and self-executing agentic models unexplored. To address this, we introduce MetaAgent-X, an end-to-end reinforcement learning framework that jointly optimizes automatic MAS design and execution. MetaAgent-X enables script-based MAS generation, execution rollout collection, and credit assignment for both designer and executor trajectories. To support stable and scalable optimization, we propose Executor Designer Hierarchical Rollout and Stagewise Co-evolution to improve training stability and expose the dynamics of designer-executor co-evolution. MetaAgent-X consistently outperforms existing automatic MAS baselines, achieving up to 21.7% gains. Comprehensive ablations show that both designer and executor improve throughout training, and that effective automatic MAS learning follows a stagewise co-evolution process. These results establish end-to-end trainable automatic MAS as a practical paradigm for building self-designing and self-executing agentic models.

2605.14210 2026-05-15 cs.LG cs.AI

Towards Fine-Grained and Verifiable Concept Bottleneck Models

Yingying Fang, Haijie Xu, Shuang Wu, Mariathasan Anish, Guang Yang

发表机构 * Bioengineering Department(生物工程部门) Imperial-X, Imperial College London, London, UK(帝国理工学院伦敦校区) Thoughtworks AI Labs, Singapore(Thoughtworks AI实验室,新加坡)

AI总结 该论文提出了一种细粒度且可验证的概念瓶颈模型(CBM)框架,旨在解决现有CBM在验证预测概念是否对应正确视觉证据方面的不足。通过将每个概念与局部视觉证据关联,该方法支持直接检查概念的编码位置和方式,从而提升模型的可解释性和可靠性。实验表明,该方法在保持预测性能的同时显著提高了透明度,并建立了概念层面的人机交互机制,为构建更可靠和临床可用的概念驱动学习系统奠定了基础。

Comments 10 pages, 4 figures

详情
英文摘要

Concept Bottleneck Models (CBMs) offer interpretable alternatives to black-box predictors by introducing human-relatable concepts before the final output. However, existing CBMs struggle to verify whether predicted concepts correspond to the correct visual evidence, limiting their reliability. We propose a fine-grained CBM framework that grounds each concept in localized visual evidence, enabling direct inspection of where and how concepts are encoded. This design allows users to interpret predictions and verify that the model learns intended concepts rather than spurious correlations. Experiments on medical imaging benchmarks show that our learned concept space is information-complete and achieves predictive performance comparable to standard CBMs, while substantially improving transparency. Unlike post-hoc attribution methods, our framework validates both the presence and correctness of concept representations, bridging interpretability with verifiability. Our approach enhances the trustworthiness of CBMs and establishes a principled mechanism for human-model interaction at the concept level, paving the way toward more reliable and clinically actionable concept-based learning systems.

2605.14200 2026-05-15 cs.LG stat.ML

How to Scale Mixture-of-Experts: From muP to the Maximally Scale-Stable Parameterization

Leena Chennuru Vankadara, Moritz Haas, Luke Hayward, Sebastian Bordt, Alessandro Breccia

发表机构 * Gatsby Computational Neuroscience Unit, University College London(Gatsby计算神经科学单元,伦敦大学学院) Amazon University of Tübingen, Tübingen AI Center(亚马逊图宾根大学,图宾根人工智能中心)

AI总结 本文研究了混合专家(MoE)架构在大规模扩展时的参数设置问题,分析了网络宽度、专家数量、稀疏度等超参数的合理缩放关系。作者提出了一种基于动态平均场理论(DMFT)的分析框架,推导出满足最大更新(μ)条件的参数化方法(μP),但发现其在扩展性方面存在不足。为此,作者进一步提出了最大尺度稳定性参数化(MSSP),在不同扩展场景下均能实现学习率迁移和性能的单调提升,为MoE架构的扩展提供了完整的理论指导。

详情
英文摘要

Recent frontier large language models predominantly rely on Mixture-of-Experts (MoE) architectures. Despite empirical progress, there is still no principled understanding of how hyperparameters should scale with network width $N$, expert width $N_e$, number of experts $M$, sparsity $K$, and depth $L$ to ensure both stability and optimal performance at scale. We take a principled step toward resolving this gap by analyzing three different scaling regimes: (I) co-scaling $N\asymp N_e$, (II) co-scaling $N\asymp M\asymp K$, and (III) full proportional scaling of $N, N_e, M$, and $K$. For each regime, we develop a novel Dynamical Mean Field Theory (DMFT) description of the limiting training dynamics of MoEs that provides a formal foundation for our analysis. Within this framework, we derive the unique parameterization for SGD and Adam satisfying all maximal-update ($μ$) desiderata. We then show that the resulting $μ$P prescription does not reliably induce monotonic improvement with scale or robust learning-rate transfer. We trace these pathologies to scale-dependent observables in the aggregation dynamics, which motivates a refined set of desiderata that we term maximal scale stability. Guided by this principle, we derive a Maximally Scale-Stable Parameterization (MSSP) for both SGD and Adam in all three scaling regimes, and characterize the corresponding limiting dynamics - qualitatively distinct from the $μ$P limit - through a separate DMFT analysis. Experiments verify that MSSP robustly recovers learning rate transfer and monotonic improvement with scale across regimes. Combined with existing depth-scaling theory, these results provide a complete scaling prescription for MoE architectures as a function of width, depth, expert width, and number of experts.

2605.14199 2026-05-15 cs.RO cs.SY eess.SY

Motion Planning for Autonomous Vehicles using Optimization over Graphs of Convex Sets

Matheus Wagner, Antônio Augusto Fröhlich

发表机构 * GitHub

AI总结 本文研究了在自动驾驶场景下,如何利用凸集图(GCS)上的优化方法近似求解非线性最优控制问题,以生成避障且动力学可行的轨迹。方法将自由空间表示为有向凸集图,结合贝塞尔曲线和多项式时间函数对车辆运动进行参数化,并通过凸约束近似保证动态可行性。实验表明,该方法在保证轨迹安全性与动力学一致性的同时,相比传统非线性最优控制方法具有更高的计算效率和更强的鲁棒性。

详情
英文摘要

Motion planning for autonomous vehicles requires generating collision-free and dynamically feasible trajectories in complex environments under real-time constraints. While nonlinear optimal control formulations provide high-fidelity solutions, they are computationally demanding and sensitive to initialization, whereas geometric planning methods scale well but often decouple path selection from trajectory optimization. This paper studies the extent to which optimization over Graphs of Convex Sets (GCS) can approximate solutions of nonlinear optimal control problems in the context of autonomous driving. The free space is represented as a finite union of convex regions organized as a directed graph, allowing nonconvex geometry to be handled through discrete connectivity decisions while maintaining convex trajectory constraints within each region. Vehicle motion is parameterized using Bezier curves for the spatial path and a polynomial time-scaling function for temporal evolution. Under small-slip and linear tire assumptions, a simplified dynamic bicycle model enables approximate enforcement of dynamic feasibility through convex constraints on trajectory derivatives. The approach is evaluated in CommonRoad scenarios involving static obstacle avoidance and lane-changing maneuvers, and is compared against a nonlinear discrete-time optimal control formulation. The results indicate that the GCS-based method generates collision-free and dynamically consistent trajectories that closely match those obtained from the nonlinear program, while exhibiting improved computational efficiency and reduced sensitivity to initialization. These findings suggest that GCS provides a structured approximation of nonlinear motion planning problems, capturing dominant geometric and dynamic effects while preserving convexity in the continuous relaxation.

2605.14192 2026-05-15 cs.CL cs.AI

Why Retrieval-Augmented Generation Fails: A Graph Perspective

Kai Guo, Xinnan Dai, Zhibo Zhang, Nuohan Lin, Shenglai Zeng, Jie Ren, Haoyu Han, Jiliang Tang

发表机构 * Michigan State University(密歇根州立大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文从图的角度分析了检索增强生成(RAG)为何在许多情况下仍会产生错误答案,揭示了检索信息如何影响模型生成过程。通过构建归因图,研究者发现了正确与错误预测在信息流动结构上的显著差异,并基于这些发现提出了一个基于图的错误检测框架,进一步展示了如何通过干预归因图结构来提升RAG的生成质量。

详情
英文摘要

Retrieval-Augmented Generation (RAG) has become a powerful and widely used approach for improving large language models by grounding generation in retrieved evidence. However, RAG systems still produce incorrect answers in many cases. Why RAG fails despite having access to external information remains poorly understood. We present a model-internal study of retrieval-augmented generation that examines how retrieved evidence influences answer generation. Using circuit tracing, we construct attribution graphs that model the flow of information through transformer layers during decoding. These graphs represent interactions among retrieved context, intermediate model activations, and generated tokens, providing a graph, circuit-level view of how external evidence is integrated into the model's reasoning process across multiple question answering benchmarks, we observe consistent structural differences: correct predictions exhibit deeper reasoning paths, more distributed evidence flow, and a more structured pattern of local connectivity, while failed predictions show shallower, fragmented, and overly concentrated evidence flow. Building on these findings, we develop a graph-based error detection framework that uses attribution-graph topology features. Furthermore, we show that attribution graphs enable targeted interventions. By reinforcing question-constrained evidence grounding, we reshape internal routing so that answer generation remains guided by the question, leading to more effective integration of retrieved information and fewer errors.

2605.14191 2026-05-15 cs.CV

CoReDiT: Spatial Coherence-Guided Token Pruning and Reconstruction for Efficient Diffusion Transformers

Zhuojin Li, Hsin-Pai Cheng, Hong Cai, Shizhong Han, Fatih Porikli

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 本文提出了一种名为CoReDiT的结构化token剪枝框架,旨在提升扩散变换器(DiTs)在图像和视频生成任务中的计算效率。该方法通过线性时间计算的空间一致性分数评估潜在token网格中的局部冗余,并在自注意力机制中跳过高一致性的冗余token,同时通过邻近保留token的聚合重建被跳过的注意力输出,以保持表示的密集性和视觉连续性。实验表明,CoReDiT在多个先进扩散模型上实现了高达55%的自注意力计算量减少,并在云端和移动端分别提升了1.33倍和1.72倍的推理速度,同时保持了高质量的生成效果,并提升了设备端的内存使用效率。

Comments 8 pages, 8 figures, CVPR workshop

Journal ref 2026 CVPR Workshop of EDGE

详情
英文摘要

Diffusion Transformers (DiTs) deliver remarkable image and video generation quality but incur high computational cost, limiting scalability and on-device deployment. We introduce CoReDiT, a structured token pruning framework for DiTs across vision tasks. CoReDiT uses a linear-time spatial coherence score to estimate local redundancy in the latent token lattice and skips high coherence (redundant) tokens in self-attention. To maintain a dense representation and avoid visual discontinuities, we reconstruct skipped attention outputs via coherence-guided aggregation of spatially neighboring retained tokens. We further introduce a progressive, block-adaptive pruning schedule that increases pruning gradually and allocates larger budgets to blocks and denoising steps with higher redundancy. Across state-of-the-art diffusion backbones including PixArt-α and MagicDrive-V2, CoReDiT achieves up to 55% self-attention FLOPs reduction and inference speedups of 1.33x on cloud GPUs and 1.72x on mobile NPUs, while maintaining high visual quality. Notably, CoReDiT also increases on-device memory head-room, enabling higher-resolution generation.