arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2332
2605.11537 2026-05-13 cs.LG

Fast MoE Inference via Predictive Prefetching and Expert Replication

Ankit Jyothish, Ali Jannesari, Aishwarya Sarkar, Joseph Zuber

AI总结 该论文针对混合专家(MoE)架构在大语言模型推理中面临的GPU利用率低、负载不均衡和延迟高的问题,提出了一种基于预测预取和专家复制的加速方法。通过动态预测可能过载的专家并在后续批次中复制这些专家,实现跨层并行处理,从而提升并行性、减少GPU空闲时间,显著加快推理速度。实验表明,该方法在保持模型性能几乎不变的情况下,使推理速度提升达3倍,并接近实现100%的GPU利用率。

详情
英文摘要

The Mixture of Experts (MoE) architecture has become a fundamental building block in state-of-the-art large language models (LLMs), improving domain-specific expertise in LLMs and scaling model capacity without proportionally increasing their computational overhead. However, MoE inference often suffers from suboptimal GPU utilization, load imbalance, and elevated latency arising from multiple tokens waiting on the same experts for their computation which arises from sparsity of expert activation. To address these challenges, we propose a dynamic expert replication strategy that predicts which experts are likely to be overloaded and replicates them for upcoming batches of tokens. The replicated experts process batch tokens concurrently across layers, which leads to improved parallelism, shorter GPU idle time, and significantly faster inference. Experimental evaluations conducted on large-scale MoE models, including Switch-base-128 and Switch-base-256, demonstrate that our method achieves near-complete GPU utilization (approx 100%), leading to upto 3x improvement in inference speed while preserving approximately 90-95% of the performance of baseline architectures

2605.11535 2026-05-13 cs.LG

Primal-Dual Policy Optimization for Linear CMDPs with Adversarial Losses

Kihyun Yu, Seoungbin Bae, Dabeen Lee

AI总结 本文研究了在线有限时间对抗性线性约束马尔可夫决策过程(CMDPs)中的策略优化问题,其中损失函数由对手对抗性选择,而代价函数则服从随机反馈。为应对这一挑战,作者提出了一种基于原-对偶优化的算法,首次在该设置下实现了次线性遗憾和约束违反界,均为 $\widetilde{\mathcal{O}}(K^{3/4})$。该算法引入了一类新的加权 LogSumExp 软max策略,并结合周期性策略混合和正则化对偶更新等关键技术,有效控制了策略覆盖数和对偶变量,从而保证了算法的理论性能。

Comments Accepted to ICLR 2026

详情
英文摘要

Existing work on linear constrained Markov decision processes (CMDPs) has primarily focused on stochastic settings, where the losses and costs are either fixed or drawn from fixed distributions. However, such formulations are inherently vulnerable to adversarially changing environments. To overcome this limitation, we propose a primal-dual policy optimization algorithm for online finite-horizon {adversarial} linear CMDPs, where the losses are adversarially chosen under full-information feedback and the costs are stochastic under bandit feedback. Our algorithm is the \emph{first} to achieve sublinear regret and constraint violation bounds in this setting, both bounded by $\widetilde{\mathcal{O}}(K^{3/4})$, where $K$ denotes the number of episodes. The algorithm introduces and runs with a new class of policies, which we call weighted LogSumExp softmax policies, designed to adapt to adversarially chosen loss functions. Our main result stems from the following key contributions: (i) a new covering number argument for the weighted LogSumExp softmax policies, and (ii) two novel algorithmic components -- periodic policy mixing and a regularized dual update -- which allow us to effectively control both the covering number and the dual variable. We also report numerical results that validate our theoretical findings on the performance of the algorithm.

2605.11534 2026-05-13 cs.RO

PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments

Yunn Kang Lim, Pengzhan Sun, Ziyi Bai, Xun Xu, Angela Yao, Xulei Yang, Shijie Li

AI总结 PRISM 是一个用于诊断具身智能体在家庭任务中失败原因的基准平台,它将问题从单纯的“是否成功”转变为识别“哪个能力模块最可能导致失败”。该基准基于五个逼真的多房间公寓环境,构建了300个由人类验证的任务,并将其分为三个能力层级,分别评估感知-动作映射、隐式意图解析和长期协调能力。实验表明,隐式意图解析是当前主流大语言模型的显著瓶颈,而长期任务协调则暴露了模型在规划能力上的明显差距。

详情
英文摘要

When an LLM-based embodied agent fails at a household task, the culprit could be misidentified objects, forgotten sub-goals, or poor action sequencing -- yet existing benchmarks report only a single success rate, making it impossible to tell which cognitive module is responsible. We present PRISM, a diagnostic benchmark that reframes this problem: rather than asking only \textit{did the agent succeed?}, PRISM asks \textit{which capability is most likely responsible for failure?} Built on five photorealistic multi-room apartments (4--8 rooms each), PRISM structures 300 human-verified tasks into three capability tiers -- \textit{Basic Ability}, \textit{Reasoning Ability}, and \textit{Long-horizon Ability} -- that isolate perception-to-action grounding, implicit intent resolution, and sustained multi-step coordination respectively. PRISM exposes an agent-agnostic executable action API that allows arbitrary agents: LLM agents, VLM agents, symbolic planners, RL policies, and hybrid systems, to be evaluated end-to-end under the same benchmark protocol. To support deeper diagnosis, optional probes for perception, memory, and planning can be adopted, replaced, or bypassed entirely, enabling controlled component-level analysis when desired. Experiments on seven contemporary LLMs establish a clear hierarchy: explicit spatial grounding is not the dominant failure source under oracle perception, implicit intent resolution is a significant bottleneck for all model families, and long-horizon coordination exposes a stark capability cliff -- lightweight models collapse to as low as 20.0\% success while simultaneously consuming more tokens than their frontier counterparts, a signature of compensatory over-reasoning rather than genuine planning capability. Project page: \href{https://sj-li.com/PROJ/PRISM}{link}.

2605.11532 2026-05-13 cs.AI

Read, Grep, and Synthesize: Diagnosing Cross-Domain Seed Exposure for LLM Research Ideation

Yunju Choi, Min Song

AI总结 本文研究了大型语言模型(LLM)在生成研究想法时,是否能从跨领域知识中获益。作者提出了一种名为PaperGym的三阶段方法,通过工具增强的种子提取、跨领域种子检索与方法合成,评估了不同种子来源对创新性的影响。实验表明,跨领域种子检索在提升方法新颖性方面优于单一领域和无检索基线,但未能显著优于随机多样化种子。研究指出,当前LLM在利用跨领域知识生成创意时,仍难以有效捕捉种子的语义关联。

Comments 12 pages, 2 figures, 7 tables

详情
英文摘要

The discovery of novel methodologies for emerging problems is a continuing cycle in ML, often driven by the migration of techniques across domains. Building on this observation, we ask whether current LLM ideation systems benefit from targeted cross-domain retrieval or simply from exposure to diverse mechanisms. We study this question through PaperGym, a three-stage pipeline: (1) tool-augmented seed extraction via read, grep, and bash over an isolated paper environment, (2) cross-domain seed retrieval via paraphrasing across seven ML domains, and (3) method synthesis from retrieved seeds, each scored by rubric-based judges. Tool-augmented extraction improves specificity, and paraphrase-based retrieval broadens domain coverage. In synthesis, cross-domain retrieval receives more pairwise novelty wins than no-retrieval and same-domain baselines, but shows no significant difference from a random diverse-seed control. These findings suggest LLM ideation systems benefit from diverse seed exposure, but do not yet reliably exploit the semantic reason particular seeds were retrieved. We release the seed library, rubric prompts, and run scripts at https://github.com/yunjoochoi/PaperGym

2605.11530 2026-05-13 cs.LG

Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes

Tatsuhito Hasegawa, Taisei Tanaka

AI总结 本文研究了在参数预算相近的情况下,是将模型容量集中于单一宽网络路径,还是分散到多个窄且独立的分支中更为有效。通过引入多窄(Multi-Narrow)变换,将基础卷积神经网络转化为具有多个窄分支的单一模型集成,作者系统比较了不同数据条件、网络结构和数据集下的单宽与多窄配置表现。研究发现,多窄结构在数据稀缺场景下表现更优,因其能学习到更多样、更少冗余的特征,而在数据丰富的场景下,单宽模型更具优势,这一结论在多种CNN架构和图像分类任务中得到了验证。

Comments 12 pages, 9 figures, 4 tables. Preprint version of a manuscript submitted to Neurocomputing

详情
英文摘要

Single-model ensembles (SMEs) have attracted attention as a way to approximate some of the benefits of deep ensembles within a single network. However, under an approximately matched parameter budget, it remains unclear whether model capacity should be concentrated in a single wide pathway or redistributed into many narrow and independent members. We investigate this question through the Multi-Narrow (MN) transformation, which converts a baseline CNN into an SME of narrow, path-wise independent branches while approximately preserving the dominant parameter budget. We systematically compare Single-Wide and Multi-Narrow configurations across different training-data regimes, architectures, and datasets. The results show that the effectiveness of MN is strongly data-dependent: weakly partitioned or baseline-wide models are preferable in data-rich settings, whereas highly partitioned MN models consistently outperform the baseline in low-data settings. This tendency is reproduced across multiple CNN architectures and image-classification datasets, suggesting that it is not specific to a single benchmark or model family. Analysis of internal representations shows that high-MN models learn more diverse and less redundant path-wise features. In low-data regimes, this diversity is broadly utilized and improves generalization, whereas in data-rich regimes, training becomes imbalanced and prediction is dominated by a small subset of paths. These findings clarify when and why Multi-Narrow transformation is effective, and provide practical guidance for allocating model capacity between width and member multiplicity under a limited budget.

2605.11527 2026-05-13 cs.LG cs.CR cs.DB

FERMI: Exploiting Relations for Membership Inference Against Tabular Diffusion Models

Abtin Mahyar, Masoumeh Shafieinejad, Yuhan Liu, Xi He

AI总结 该研究探讨了针对表格扩散模型的成员推理攻击问题,关注真实敏感数据中多表关联结构被忽略的挑战。提出了一种名为FERMI的方法,通过利用目标表关联表的辅助信息增强单表特征,从而提升攻击性能。实验表明,FERMI在多种表格扩散模型和真实数据集上均显著优于传统单表攻击方法,特别是在白盒和黑盒设置下分别提升了最高达53%和22%的攻击准确率。

详情
英文摘要

Diffusion models are the leading approach for tabular data synthesis and are increasingly used to share sensitive records. Whether they actually protect privacy has become a pressing question. Membership inference attacks are the standard tool for this purpose, yet existing attacks assume a single-table setting and ignore the multi-relational structure of real sensitive data. A core challenge in assessing privacy risks from membership inference attacks in multi-table settings is how to leverage auxiliary information from relations associated with the target table, such as its parent tables. Particularly, we study a practical setting in which such auxiliary information is available only when training the attack model. At inference time, the attacker observes only the attribute values of the target record from the target table. We propose FERMI (FEature-mapping for Relational Membership Inference), which resolves this gap by enriching single-table features with relational membership signal. Across three tabular diffusion architectures and three real-world relational datasets, FERMI consistently improves attack performance over single-table baselines, with TPR@$0.1$FPR rising by up to 53% over the single-table baseline in the white-box setting and 22% in the black-box setting.

2605.11525 2026-05-13 cs.LG

OverNaN: NaN-Aware Oversampling for Imbalanced Learning with Meaningful Missingness

Amanda S Barnard

AI总结 在实际应用中,缺失值常被视为需要删除或填补的缺陷,但其本身可能包含重要信息。本文提出了一种名为 OverNaN 的轻量级过采样框架,专门用于处理类别不平衡问题,同时保留缺失值结构。该方法扩展了传统合成过采样技术,直接在不完整特征向量上生成样本,允许缺失值被保留、传播或选择性插值,从而在不破坏缺失信息的前提下提升模型性能。该研究为科学与工程领域中处理不可避免且具有信息量的缺失值提供了新的解决方案。

Comments 14 pages, 2 figures, 17 tables

详情
英文摘要

Missing values are routinely treated as defects to be eliminated through deletion or imputation prior to machine learning. In many applied domains, however, missingness itself carries information, reflecting experimental constraints, measurement choices, or systematic mechanisms tied to the data-generating process. Eliminating or masking this structure can distort class boundaries, introduce bias, and reduce generalisability; particularly in imbalanced datasets where minority classes are already under-represented. OverNaN is a lightweight, NaN-aware oversampling framework designed to address class imbalance without erasing missingness structure. It extends common synthetic oversampling methods to operate directly on incomplete feature vectors, allowing missing values to be preserved, propagated, or selectively interpolated according to explicitly defined strategies. Rather than repairing missing data, OverNaN treats missingness as part of the feature space over which synthetic samples are generated. This paper situates OverNaN within the broader landscape of imbalanced learning, missing-data handling, and NaN-tolerant algorithms. Using representative examples included with the software, we demonstrate that meaningful missingness can be retained during oversampling without introducing artificial certainty. OverNaN is intended for practitioners working with small, incomplete, and imbalanced datasets in scientific and engineering domains where missingness is unavoidable and often informative.

2605.11524 2026-05-13 cs.LG cs.CE

EqOD: Symmetry-Informed Stability Selection for PDE Identification

Gnankan Landry Regis N'guessan, Bum Jun Kim

AI总结 该研究提出了一种名为EqOD的自动方法,用于从噪声数据中稳定识别偏微分方程(PDE),通过结合对称性约简和稳定性选择机制,有效减少虚假正例并提高识别准确性。当检测到伽利略不变性时,EqOD利用对称性约简库剔除不可能存在的项;否则采用随机LASSO稳定性选择。实验表明,EqOD在多个PDE和噪声水平下表现优异,显著优于现有方法如PySINDy和WF-LASSO。

Comments 45 pages, 16 figures

详情
英文摘要

Data-driven identification of partial differential equations (PDEs) relies on sparse regression over a candidate library of differential operators, where larger libraries inflate false positives under observation noise and smaller libraries risk missing true terms. We introduce Equivariant Operator Discovery (EqOD), a fully automatic method combining two library reduction mechanisms. When Galilean invariance is detected from trajectory data via a weak-form structural test, EqOD uses the symmetry-reduced library, eliminating terms that our Galilean exclusion result proves to be absent from the governing equation. Otherwise, it applies randomized LASSO stability selection guided by classical false-positive bounds. A residual-based fallback prevents degradation below the full-library baseline. On 8 PDEs at 4 noise levels, EqOD attains $F_1 = 1.000 \pm 0.000$ on Heat at $20\%$ noise, where WF-LASSO obtains $0.475 \pm 0.181$, official PySINDy 2.0 obtains $0.000$, and the WSINDy reimplementation obtains $0.789$. Under the strict criterion that the mean F1 difference exceeds the larger of the two standard deviations, EqOD wins 7 of 32 cells. WF-LASSO wins none, and the remaining 25 cells are ties. Across all 32 cells, EqOD outperforms PySINDy 2.0.0 in 23 of 32 cells, and all 5 PySINDy wins occur on reaction PDEs. External validation on WeakIdent and PINN-SR datasets gives $F_1 = 1.000$ on all 5 clean benchmarks. NLS, 2D, coupled-system, and cylinder-wake extensions are reported. The Galilean library reduction is proved under explicit autonomy and library assumptions. The stability-selection step is motivated by classical false-positive bounds, while formal guarantees for correlated PDE design matrices remain open.

2605.11521 2026-05-13 cs.CV

XWOD: A Real-World Benchmark for Object Detection under Extreme Weather Conditions

Chih-Hsin Chen, Yu-Tung Liu, Amar Fadillah, Kuan-Ting Lai, Dong Liu

AI总结 本文提出XWOD,一个用于极端天气条件下目标检测的大型真实世界数据集,包含10,010张图像和42,924个标注框,涵盖雨、雪、雾、沙尘、洪水、龙卷风和野火七种极端天气条件下的六类交通目标。XWOD扩展了天气分类的范围,首次引入气候加剧型灾害类别,并通过在其他天气数据集上的零样本测试验证了其数据质量,显著提升了检测性能。该数据集为研究极端天气下的交通感知提供了强有力的基准。

详情
英文摘要

Autonomous driving and intelligent transportation systems remain vulnerable under extreme weather. The U.S. Federal Highway Administration reports that roughly 745,000 crashes and 3,800 fatalities per year are weather-related, and recent regulatory investigations have examined failures of Level-2/3 driving systems under reduced-visibility conditions. However, datasets commonly used to evaluate weather robustness remain limited in scale, diversity, and realism. In this paper, we introduce XWOD (Extreme Weather Object Detection), a large-scale real-world traffic-object detection benchmark containing 10,010 images and 42,924 bounding boxes across seven extreme weather conditions: rain, snow, fog, haze/sand/dust, flooding, tornado, and wildfire. The dataset covers six traffic-object categories, including car, person, truck, motorcycle, bicycle, and bus. XWOD extends the weather taxonomy from one to seven conditions, and is the first to cover the emerging class of climate-amplified hazards, such as flooding, tornado, and wildfire. To evaluate the quality of our data, we train standard YOLO-family detectors on XWOD and test them zero-shot on external weather benchmarks, achieving mAP$_{50}$ scores of 63.00% on RTTS, 59.94% on DAWN, and 61.12% on WEDGE, compared with the corresponding published YOLO-based baselines of 40.37%, 32.75%, and 45.41%, respectively, representing relative improvements of 56%, 83%, and 35%. These cross-dataset results show that XWOD provides a strong source domain for learning weather-robust traffic perception. We release the dataset, splits, baseline weights, and reproducible evaluation code under a research-use license.

2605.11520 2026-05-13 cs.CV cs.AI

PointGS: Semantic-Consistent Unsupervised 3D Point Cloud Segmentation with 3D Gaussian Splatting

Yixiao Song, Qingyong Li, Wen Wang, Zhicheng Yan

AI总结 本文提出了一种名为PointGS的无监督3D点云分割方法,旨在解决传统监督方法依赖密集标注带来的高昂成本问题。该方法通过3D高斯溅射技术构建统一的中间表示,弥合了离散点云与连续图像之间的域差距,并利用多视角重建与语义蒸馏策略,实现了跨视角语义的一致性分配。实验表明,PointGS在多个基准数据集上优于现有无监督方法,显著提升了分割性能。

Comments Accepted by Computer Vision and Pattern Recognition (CVPR) 2026

详情
英文摘要

Unsupervised point cloud segmentation is critical for embodied artificial intelligence and autonomous driving, as it mitigates the prohibitive cost of dense point-level annotations required by fully supervised methods. While integrating 2D pre-trained models such as the Segment Anything Model (SAM) to supplement semantic information is a natural choice, this approach faces a fundamental mismatch between discrete 3D points and continuous 2D images. This mismatch leads to inevitable projection overlap and complex modality alignment, resulting in compromised semantic consistency across 2D-3D transfer. To address these limitations, this paper proposes PointGS, a simple yet effective pipeline for unsupervised 3D point cloud segmentation. PointGS leverages 3D Gaussian Splatting as a unified intermediate representation to bridge the discrete-continuous domain gap. Input sparse point clouds are first reconstructed into dense 3D Gaussian spaces via multi-view observations, filling spatial gaps and encoding occlusion relationships to eliminate projection-induced semantic conflation. Multi-view dense images are rendered from the Gaussian space, with 2D semantic masks extracted via SAM, and semantics are distilled to 3D Gaussian primitives through contrastive learning to ensure consistent semantic assignments across different views. The Gaussian space is aligned with the original point cloud via two-step registration, and point semantics are assigned through nearest-neighbor search on labeled Gaussians. Experiments demonstrate that PointGS outperforms state-of-the-art unsupervised methods, achieving +0.9% mIoU on ScanNet-V2 and +2.8% mIoU on S3DIS.

2605.11519 2026-05-13 cs.AI cs.CL cs.LG

Controllable User Simulation

Guy Tennenholtz, Ofer Meshi, Amir Globerson, Uri Shalit, Jihwan Jeong, Craig Boutilier

AI总结 本文研究如何构建可控的用户模拟器,以更准确地评估对话代理的行为。作者将可控模拟问题形式化为因果推断问题,指出传统基于监督微调的方法会引入结构偏差,导致评估指标方差急剧上升,即“可控性崩溃”。为此,作者提出了基于因果一致性的理论条件和一系列实用训练方法,实验表明其方法能有效消除前瞻偏差,保持对话多样性,并具备对未知代理行为的鲁棒泛化能力。

详情
英文摘要

Using offline datasets to evaluate conversational agents often fails to cover rare scenarios or to support testing new policies. This has motivated the use of controllable user simulators for targeted, counterfactual evaluation, typically implemented by prompting or fine-tuning large language models. In this work, we formalize controllable simulation as a causal inference problem. By bridging natural language evaluation with off-policy evaluation methodology, we show that the standard practice of training simulators via supervised fine-tuning on post-hoc trajectory labels yields a structurally biased model. Specifically, these labels are inextricably coupled to the data-generating behavior policy, injecting a look-ahead bias that breaks causal consistency. Furthermore, we prove that under policy shift this failure causes the variance of evaluation metrics to explode geometrically, a phenomenon we term controllability collapse. To restore causal consistency, we establish theoretical conditions for accurate simulation and propose practical training mitigations: a priori controls, step-wise dynamic controls, and direct policy-conditioned learning. Empirical evaluation confirms that while standard global controls distort conversational distributions and collapse behavioral diversity, our causally grounded simulators eliminate look-ahead bias, preserve natural variance, and exhibit robust zero-shot generalization to unseen agent behaviors.

2605.11513 2026-05-13 cs.CL cs.AI

A Study on Hidden Layer Distillation for Large Language Model Pre-Training

Maxime Guigon, Lucas Dixon, Michaël E. Sander

AI总结 本文研究了隐藏层蒸馏(HLD)在大规模语言模型预训练中的应用,指出当前知识蒸馏主要依赖输出logits,而忽视了教师模型中间层的语义信息。通过对比实验,作者发现HLD在下游任务上的表现并不一致优于传统基于logits的蒸馏方法,但在所有共享超参数配置下,HLD在困惑度上均有所提升,表明其可能蕴含潜在价值,但尚未成为预训练中的主流方法。

详情
英文摘要

Knowledge Distillation (KD) is a critical tool for training Large Language Models (LLMs), yet the majority of research focuses on approaches that rely solely on output logits, neglecting semantic information in the teacher's intermediate representations. While Hidden Layer Distillation (HLD) showed potential for encoder architectures, its application to decoder-only pre-training at scale remains largely unexplored. Through compute-controlled experiments, we benchmark HLD against logit-based KD and self-supervised baselines with Gemma3 3.4B as teacher and 123M and 735M students trained on up to 168B tokens from the C4 dataset. Our experiments show that HLD does not consistently outperform standard KD on downstream evaluation tasks. Nevertheless, we show that HLD can yield a systematic perplexity gain over KD across all shared-hyperparameter configurations, suggesting that a latent signal can be extracted, but a breakthrough may be needed for it to play a more significant role in LLM pre-training.

2605.11509 2026-05-13 cs.AI cs.LG cs.MA cs.SY eess.SY

Hierarchical LLM-Driven Control for HAPS-Assisted UAV Networks: Joint Optimization of Flight and Connectivity

Zijiang Yan, Hao Zhou, Wael Jaafar, Jianhua Pei, Ping Wang, Halim Yanikomeroglu, Hina Tabassum

AI总结 本文研究了在融合地面与非地面网络(ITNTN)环境下,无人机(UAV)的飞行控制与通信连接的联合优化问题。为解决动态且部分可观测条件下的多无人机协同问题,作者提出了一种基于大语言模型(LLM)的分层多速率控制框架,将全局负载均衡与切换决策与局部无人机运动控制相结合。实验表明,该方法在运输效率、通信吞吐量和碰撞率等方面均优于现有方法,展现出良好的动态场景适应能力。

Comments Submission for possible publication

详情
英文摘要

Uncrewed aerial vehicles (UAVs) are increasingly deployed in complex networked environments, yet the joint optimization of multi-UAV motion control and connectivity remains a fundamental challenge. In this paper, we study a multi-UAV system operating in an integrated terrestrial and non-terrestrial network (ITNTN) comprising terrestrial base stations and high-altitude platform stations (HAPS). We consider a three-dimensional (3D) aerial highway scenario where UAVs must adapt their motion to ensure collision avoidance, efficient traffic flow, and reliable communication under dynamic and partially observable conditions. We first model the problem as a hierarchical multi-objective partially observable Markov decision process (H-MO-POMDP), capturing the coupling between control and communication objectives. Based on this formulation, we propose a large language model (LLM)-driven hierarchical multi-rate control framework. At the global level, an LLM-based controller on the HAPS performs long-term planning for load balancing and handover decisions. At the local level, each UAV employs a hybrid controller that integrates a slow-timescale LLM for high-level spatial reasoning with a reinforcement learning agent for faster UAV-to-infrastructure (U2I) communication and motion control. We further develop a high-fidelity 3D simulation platform by integrating the gym-pybullet-drones environment with 3GPP-compliant RF/THz channel models. Numerical results demonstrate that the proposed framework significantly outperforms state-of-the-art baselines, achieving a 14% increase in transportation efficiency and a 25% improvement in telecommunication throughput. Additionally, it achieves a 23% reduction in physical collision rates, demonstrating strong handover stability and zero-shot generalization in dynamic scenarios.

2605.11508 2026-05-13 cs.CV

LiBrA-Net: Lie-Algebraic Bilateral Affine Fields for Real-Time 4K Video Dehazing

Yongcong Wang, Chengchao Shen, Guangwei Gao, Wei Wang, Pengwen Dai, Dianjie Lu, Guijuan Zhang, Zhuoran Zheng

AI总结 当前超高分辨率视频去雾领域缺乏评估基准,且现有方法难以在消费级GPU上实时处理4K视频。本文提出LiBrA-Net,通过将去雾问题转化为由低频深度场驱动的逐像素仿射变换,并利用双侧网格进行高效编码,实现了在单个GPU上以25 FPS处理4K视频的实时去雾。此外,本文还发布了首个包含深度、透射率和光流注释的4K视频去雾基准UHV-4K,并在多个数据集上取得了最先进的性能。

Comments 10 pages, 5 figures

详情
英文摘要

Currently, there is a gap in the field of ultra-high-definition (UHD) video dehazing due to the lack of a benchmark for evaluation. Furthermore, existing video dehazing methods cannot run on consumer-grade GPUs when processing continuous UHD sequences of 3--5 frames at a time. In this paper, we address both issues with a new benchmark and an efficient method. Our key observation is that atmospheric dehazing reduces to a per-pixel affine transform governed by the low-frequency depth field, which can be compactly encoded in bilateral grids whose prediction cost is decoupled from the output resolution. Building on this, we propose LiBrA-Net, which factorizes the spatiotemporal affine field into a spatial--color and a temporal bilateral sub-grid predicted at a fixed low resolution, fuses their coefficients in the $\mathfrak{gl}(3)$ Lie algebra under group-theoretic regularization, maps the result to invertible GL(3) transforms via a Cayley parameterization, and restores high-frequency detail through a lightweight input-guided branch. We further release UHV-4K, the first paired 4K video dehazing benchmark with depth, transmission, and optical-flow annotations on every frame. Across UHV-4K, REVIDE, and HazeWorld, LiBrA-Net sets a new state of the art among compared video dehazing methods while running native 4K at 25 FPS on a single GPU with only 6.12 M parameters. Code and data are available at https://anonymous.4open.science/r/LiBrA-Net-42B8.

2605.11506 2026-05-13 cs.CV

Principled Design of Diffusion-based Optimizers for Inverse Problems

Julio Oscanoa, Irmak Sivgin, Cagan Alkan, Daniel Ennis, John Pauly, Mert Pilanci, Shreyas Vasanawala

AI总结 本文研究了基于扩散模型的优化器在逆问题中的设计,旨在解决其推理时间长和超参数调优繁琐的问题。作者提出了一种原理性的重参数化方法,使超参数能够在不同任务间复用,无需重新调整。同时,基于RED-diff框架,他们进一步开发了OptDiff流程,将后验采样转化为优化问题,从而加速推理并提升图像质量。实验表明,该方法在图像重建、去模糊和超分辨率任务中均取得了显著的加速效果和图像质量提升。

Comments 22 pages, 8 figures, 6 tables

详情
英文摘要

Score-based diffusion models achieve state-of-the-art performance for inverse problems, but their practical deployment is hindered by long inference times and cumbersome hyperparameter tuning. While pretrained diffusion models can be reused across tasks without retraining, inference-time hyperparameters such as the noise schedule and posterior sampling weights typically require ad-hoc adjustment for each problem setup. We propose principled reparameterizations that induce invariances, allowing the same hyperparameters to be reused across multiple problems without re-tuning. In addition, building on the RED-diff framework, which reformulates posterior sampling as an optimization problem, we further develop the OptDiff pipeline. OptDiff provides a simplified tuning framework that facilitates the integration of convex optimization tools to accelerate inference. Experiments on image reconstruction, deblurring, and super-resolution show substantial speedups and improved image quality.

2605.11504 2026-05-13 cs.LG cs.CR

CTFusion: A CTF-based Benchmark for LLM Agent Evaluation

Dongjun Lee, Ga-eun Bae, Insu Yun

AI总结 随着大型语言模型(LLM)在复杂任务中的应用日益广泛,网络安全成为其重要应用场景之一。然而,现有基于夺旗(CTF)的评估基准存在数据污染和作弊风险,影响评估可靠性。为此,本文提出CTFusion,一个基于实时CTF赛事的流式评估框架,通过单账号多代理独立运行和仅提交首个正确flag等方式降低竞争干扰,并在CTFd平台上实现为模型上下文协议(MCP)服务器,有效提升对网络安全代理的评估准确性。实验表明,CTFusion相较于现有基准更具鲁棒性,已作为开源工具释放以促进相关研究。

Comments 14 pages, 8 figures

详情
英文摘要

Recent advances in Large Language Models (LLMs) have enabled agentic systems for complex, multi-step tasks; cybersecurity is emerging as a prominent application. To evaluate such agents, researchers widely adopt Capture The Flag (CTF) benchmarks. However, current CTF benchmarks reuse existing challenges, which exposes them to data contamination and potential cheating. Notably, we confirmed these issues in practice by integrating web search tools into an existing agent. To address these limitations, we present CTFusion, a streaming evaluation framework built on Live CTFs. To achieve this, CTFusion preserves per-agent independence under a single team account and reduces competition impact by forwarding only the first correct flag per challenge. Moreover, we implement CTFusion as a Model Context Protocol (MCP) server on the widely used CTFd platform, which offers broad applicability to diverse CTF events and agent types. Through experiments with three LLMs, two agents, and five Live CTFs, we demonstrate that existing CTF benchmarks can be unreliable in assessing LLM-based agents, while CTFusion can serve as a robust solution for evaluating cybersecurity agents. We release CTFusion as open source to foster future research in this area.

2605.11497 2026-05-13 cs.CV

PoseBridge: Bridging the Skeletonization Gap for Zero-Shot Skeleton-Based Action Recognition

Sanghyeon Lee, Jinwoo Kim, Jong Taek Lee

AI总结 本文研究了零样本骨架动作识别(ZSSAR)中的语义对齐问题,指出当前方法在对齐阶段已丢失了人体与物体交互及姿态相关视觉线索等关键语义信息。为此,提出了一种名为PoseBridge的框架,通过利用姿态估计过程中的中间表示,提取姿态锚定的语义线索,并通过骨架条件桥接和语义原型适配将其传递至文本对齐模块,从而提升零样本识别性能。实验表明,PoseBridge在多个数据集上均取得显著提升,尤其在Kinetics-200/400 PURLS基准上表现突出。

详情
英文摘要

Zero-shot skeleton-based action recognition (ZSSAR) is typically treated as a skeleton-text alignment problem: encode joint-coordinate sequences, align them with language, and classify unseen actions. We argue that this alignment is often too late. Skeletons are not complete action observations, but compressed outputs of human pose estimation (HPE); by the time alignment begins, human-object interactions and pose-relative visual cues may no longer be explicit. We call this upstream semantic loss. To address it, we propose PoseBridge, an HPE-aware ZSSAR framework that bridges intermediate HPE representations to skeleton-text alignment. Rather than adding an RGB action branch or object detector, PoseBridge extracts pose-anchored semantic cues from the same HPE process that produces skeletons, then transfers them through skeleton-conditioned bridging and semantic prototype adaptation. Across NTU-RGB+D 60/120, PKU-MMD, and Kinetics-200/400, PoseBridge improves ZSSAR performance under the evaluated protocols. On the Kinetics-200/400 PURLS benchmark, which contains in-the-wild videos with diverse scenes and action contexts, PoseBridge shows the clearest separation, improving the strongest compared baseline by 13.3-17.4 points across all eight splits. Our code will be publicly released.

2605.11496 2026-05-13 cs.AI cs.CY cs.HC cs.LG

The Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested

Varad Vishwarupe, Nigel Shadbolt, Marina Jirotka, Ivan Flechais

AI总结 本文探讨了前沿人工智能模型在识别评估环境时表现出的行为差异问题,指出这些模型在测试环境下可能与实际部署时表现不同,从而影响安全评估的可靠性。研究提出了“评估差异”(Evaluation Differential)的概念,定义了标准化效应大小(nED)以进行跨属性比较,并开发了TRACE评估框架,用于更严谨地分析和限制从评估中得出的安全声明。该研究对AI系统评估和治理具有重要启示。

详情
英文摘要

Recent published evidence from frontier laboratories shows that contemporary AI models can recognise evaluation contexts, latently represent them, and behave differently under those contexts than under deployment-continuous conditions. Anthropic's BrowseComp incident, the Natural Language Autoencoder findings on SWE-bench Verified and destructive-coding evaluations, and the OpenAI / Apollo anti-scheming work all document instances of this phenomenon. We argue that these findings create a claim-validity problem for safety conclusions drawn from frontier evaluations. We introduce the Evaluation Differential (ED), a conditional divergence in a target behavioural property between recognised-evaluation and deployment-continuous contexts, define a normalised effect-size form (nED) for cross-property comparison, and prove that marginal evaluation scores cannot identify ED. We develop a typology of safety claims (ED-stable, ED-degraded, ED-inverted, ED-undetermined) by their warrant-status under documented divergence, and specify TRACE (Test-Recognition Audit for Claim Evaluation), an audit protocol that wraps existing evaluation infrastructure and produces restricted claims rather than capability scores. We apply the framework retrospectively to three publicly documented evaluation incidents and discuss governance implications for system cards, conformity assessment, and the international network of AI safety and security institutes. TRACE does not eliminate adversarial adaptation; it disciplines the claims drawn from evaluation evidence by making explicit the conditions under which that evidence was produced.

2605.11494 2026-05-13 cs.CV

STRIDE: Training-Free Diversity Guidance via PCA-Directed Feature Perturbation in Single-Step Diffusion Models

Ankit Yadav, Arpit Garg, Ta Duc Huy, Lingqiao Liu

AI总结 STRIDE 是一种无需训练和优化的单步扩散模型多样性增强方法,通过在中间特征上注入与模型激活主成分对齐的噪声,实现可控的多样性提升。该方法基于模型自身特征结构进行扰动,确保生成样本在保持高质量的同时提高多样性。实验表明,STRIDE 在多个数据集上有效提升了生成图像的多样性,同时保持了良好的文本对齐性能,优于现有无训练基线方法。

Comments 11 Pages 3 figures 4 tables

详情
英文摘要

Distilled one-step (T=1) or few-step (T$\leq$4) diffusion models enable real-time image generation but often exhibit reduced sample diversity compared to their multi-step counterparts. In multi-step diffusion, diversity can be introduced through schedules, trajectories, or iterative optimization; however, these mechanisms are unavailable in the few-step or single-step setting, limiting the effectiveness of existing diversity-enhancing methods. A natural alternative is to perturb intermediate features, but naive feature perturbation is often ineffective, either yielding limited diversity gains or degrading generation quality. We argue that effective diversity injection in few-step models requires perturbations that respect the model's learned feature geometry. Based on this insight, we propose STRIDE, a training-free and optimization-free method that operates in a single forward pass. STRIDE injects spatially coherent (pink) noise into intermediate transformer features, projected onto the principal components of the model's own activations, ensuring that perturbations lie on the learned feature manifold. This design enables controlled variation along meaningful directions in the representation space. Extensive experiments on FLUX.1-schnell and SD3.5 Turbo across COCO, DrawBench, PartiPrompts, and GenEval show that STRIDE consistently improves diversity while maintaining strong text alignment. In particular, STRIDE reduces intra-batch similarity with minimal impact on CLIP score, and Pareto-dominates existing training-free baselines on the diversity-fidelity frontier. These results highlight that, in the absence of iterative refinement, improving diversity in few-step and one-step diffusion depends not on increasing perturbation strength, but on aligning perturbations with the model's internal representation structure.

2605.11491 2026-05-13 cs.LG cs.AI

Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

Huimin Xu, Shuai Zhao, Xiaobao Wu, Anh Tuan Luu

AI总结 本文研究了可验证奖励强化学习(RLVR)中普遍存在的熵崩溃问题,分析发现该问题源于令牌层面的熵流不平衡,即熵减少的令牌远多于熵增加的令牌。为此,作者提出了一种基于策略的熵流优化方法(OPEFO),通过动态调整熵增和熵减更新的比重,实现熵流的自适应平衡。实验表明,该方法有效提升了模型在数学推理任务中的训练稳定性和最终性能。

详情
英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become an effective paradigm for improving the reasoning ability of large language models. However, widely used RLVR algorithms, such as GRPO, often suffer from entropy collapse, leading to premature determinism and unstable optimization. Existing remedies, including entropy regularization and ratio-based clipping heuristics, either control entropy in a coarse-grained manner or rely on approximate on-policy training. In this paper, we revisit entropy collapse from a token-level entropy flow perspective. Our analysis reveals that entropy-decreasing tokens consistently outweigh entropy-increasing ones, resulting in a severely imbalanced entropy flow. This perspective provides a unified explanation of entropy collapse in existing RLVR algorithms and highlights the importance of balancing entropy dynamics. Motivated by this analysis, we propose On-Policy Entropy Flow Optimization (OPEFO), an adaptive entropy flow balancing mechanism that rescales entropy-increasing and entropy-decreasing updates according to their contributions to entropy change, while remaining strict on-policy. Experiments on six mathematical reasoning benchmarks demonstrate that OPEFO improves training stability and final performance. We will release the code and models upon publication.

2605.11483 2026-05-13 cs.CL

StoicLLM: Preference Optimization for Philosophical Alignment in Small Language Models

Ishmam Khan, Sindhuja Thogarrati, Shuo Zhang

AI总结 本文研究了小型语言模型在哲学对齐方面的优化问题,特别是斯多葛哲学的内向型美德与外向型宇宙公民责任的对齐。作者采用偏好优化方法(如ORPO、AlphaPO),在微小数据集上对小型语言模型进行专项训练,结果表明仅需300个高质量样本即可实现较强的内向型美德对齐,效果接近少样本提示方法。然而,所有模型在处理斯多葛哲学的外向型责任时均表现不佳,揭示了小型模型在该方面的表征局限,单靠微数据集优化无法解决这一问题。

详情
英文摘要

While large language models excel at factual adaptation, their ability to internalize nuanced philosophical frameworks under severe data constraints remains underexplored. We investigate this by specializing small LLMs on micro-datasets of foundational Stoic texts using preference optimization (ORPO, AlphaPO). Evaluated via a multi-model critic bank, our results show that just 300 high-fidelity examples can induce strong alignment with inward-facing Stoic virtues, closely approaching few-shot prompting while freeing the context window. Critically, however, all models, including few-shot baselines, exhibit a persistent failure on Stoicism's outward-facing cosmopolitan duties, pointing to a representational limitation of small models that micro-dataset adaptation alone cannot overcome.

2605.11479 2026-05-13 cs.RO cs.AI

Offline Policy Evaluation for Manipulation Policies via Discounted Liveness Formulation

Hao Wang, Joshua Bowden, Colton Crosby, Somil Bansal

AI总结 本文研究了在稀疏奖励环境下对机械臂操作策略进行离线策略评估的问题,针对策略评估中任务进展非单调、评估轨迹长度有限导致的截断偏差等问题,提出了一种基于生存性(liveness)的贝尔曼算子框架。该方法将策略评估视为任务完成问题,得到的值函数对有限时间截断具有鲁棒性,并在理论分析中证明了其收缩性等性质。实验表明,该方法在多个模拟和实际任务中能更准确反映任务进展并有效减少截断偏差,优于传统方法如TD(0)和蒙特卡洛策略评估。

Comments Published at RSS 2026

详情
英文摘要

Policy evaluation is a fundamental component of the development and deployment pipeline for robotic policies. In modern manipulation systems, this problem is particularly challenging: rewards are often sparse, task progression of evaluation rollouts are often non-monotonic as the policies exhibit recovery behaviors, and evaluation rollouts are necessarily of finite length. This finite length introduces truncation bias, breaking the infinite-horizon assumptions underlying standard methods relying on Bellman equations/principle of optimality. In this work, we propose a framework for offline policy evaluation from sparse rewards based on a liveness-based Bellman operator. Our formulation interprets policy evaluation as a task-completion problem and yields a conservative fixed-point value function that is robust to finite-horizon truncation. We analyze the theoretical properties of the proposed operator, including contraction guarantees, and show how it encodes task progression while mitigating truncation bias. We evaluate our method on two simulated manipulation tasks using both a Vision-Language-Action model and a diffusion policy, and a cloth folding task using human demonstrations. Empirical results demonstrate that our approach more accurately reflects task progress and substantially reduces truncation bias, outperforming classical baselines such as TD(0) and Monte Carlo policy evaluation.

2605.11478 2026-05-13 cs.AI cs.IT math.IT stat.ML

FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression

Namyoon Lee, Yongjune Kim

AI总结 本文提出了一种名为FibQuant的通用固定率向量量化方法,用于随机访问的键值缓存压缩,以解决长上下文推理中的内存和流量瓶颈问题。该方法在保持归一化-旋转-存储接口的同时,将传统的标量编码表替换为与标准化源匹配的共享径向-角向码本,从而保留归一化步骤所创建的几何信息并提升压缩效率。实验表明,FibQuant在保持高注意力相似度的同时实现了更高的压缩比,并在多个模型上表现出优于现有标量量化方法的性能。

Comments 15 pages

详情
英文摘要

Long-context inference is increasingly a memory-traffic problem. The culprit is the key--value (KV) cache: it grows with context length, batch size, layers, and heads, and it is read at every decoding step. Rotation-based scalar codecs meet this systems constraint by storing a norm, applying a shared random rotation, and quantizing one coordinate at a time. They are universal and random-access, but they discard the geometry created by the normalization step. After a Haar rotation, a block of $k$ consecutive coordinates is not a product source; it is a spherical-Beta source on the unit ball. We introduce \textsc{FibQuant}, a universal fixed-rate vector quantizer that keeps the same normalize--rotate--store interface while replacing scalar tables by a shared radial--angular codebook matched to this canonical source. The codebook combines Beta-quantile radii, Fibonacci\,/\,Roberts--Kronecker quasi-uniform directions, and multi-restart Lloyd--Max refinement. We prove that the resulting vector code strictly improves on its scalar product specialization at matched rate, with a high-rate gain that separates into a cell-shaping factor and a density-matching factor. The same construction gives a dense rate axis, including fractional-bit and sub-one-bit operating points, without calibration or variable-length addresses. On GPT-2 small KV caches, \textsc{FibQuant} traces a memory--fidelity frontier from $5\times$ compression at $0.99$ attention cosine similarity to $34\times$ at $0.95$. End-to-end on TinyLlama-1.1B, it is within $0.10$ perplexity of fp16 at $4\times$ compression and has $3.6\times$ lower perplexity than scalar \textsc{TurboQuant} at $b = 2$ ($8\times$ compression), where scalar random-access quantization begins to fail.

2605.11477 2026-05-13 cs.CV

LDDR: Linear-DPP-Based Dynamic-Resolution Frame Sampling for Video MLLMs

Jingfeng Chen, Jiawen Qian, Wendi Deng, Yinuo Guo, Jiaqi Yu, Sicong Leng, Raghuveer Thirukovalluru, Bhuwan Dhingra

AI总结 在多模态大语言模型中,视频理解需要在有限的视觉token预算下从冗长的视频中选取信息量大的帧。为此,本文提出LDDR,一种基于线性行列式点过程(DPP)的动态分辨率帧采样框架,能够在任务条件特征空间中进行查询感知的帧选择,实现比标准DPP方法快3倍的运行效率。LDDR通过引入组DPP重要性度量,指导帧的保留与动态分辨率分配,显著提升了视频理解性能,在多个视频基准测试中均优于现有方法。

Comments 21 pages, 4 figures

详情
英文摘要

Video understanding in multimodal large language models requires selecting informative frames from long, redundant videos under limited visual-token budgets. Existing methods often rely on uniform sampling, point-wise relevance scoring, chunk-wise selection, or agentic exploration, which either miss global dependencies or introduce substantial overhead. We propose LDDR (Linear DPP-Based Dynamic Resolution), a training-free, plug-and-play, and budget-aware video frame sampling framework. LDDR performs query-aware Determinantal Point Process (DPP) frame selection in a task-conditioned feature space, achieving a 3x runtime speedup over standard DPP baselines. It further introduces a Group DPP importance metric to guide frame retention and dynamic resolution allocation, assigning more tokens to informative, non-redundant frames while downscaling or pruning less useful ones. Across four video benchmarks spanning short-, medium-, and long-range videos, LDDR consistently outperforms the next-best baselines, achieving gains of 2.5 points under budget-constrained settings and 1.6 points in high-budget scenarios. These improvements are consistently observed across multiple MLLM backbones, including both open- and closed-source models. Qualitative analysis confirms that relevant frames are selected and allocated a higher budget, facilitating improved video understanding.

2605.11475 2026-05-13 cs.CV

Deep Probabilistic Unfolding for Quantized Compressive Sensing

Gang Qu, Ping Wang, Siming Zheng, Xin Yuan

AI总结 本文提出了一种深度概率展开模型,用于解决量化压缩感知问题,通过展开框架提升重建的精度和效率。不同于以往方法采用L2投影,本文推导出一种闭式且数值稳定的似然梯度投影,使模型能够遵循真实的量化物理特性,将硬量化约束转化为软概率引导。此外,设计了一个高效的双域Mamba模块,用于动态捕捉和融合多尺度的局部与全局特征,增强远距离相关区域的交互能力。实验表明,该方法在多个任务上达到当前最优性能,有助于推动量化压缩感知在实际中的应用。

详情
英文摘要

We propose a deep probabilistic unfolding model to address the classical quantized compressive sensing problem that leverages an unfolding framework to enhance the reconstruction accuracy and efficiency. Unlike previous unfolding methods that apply L2 projection to measurements, we derive a closed-form, numerically stable likelihood gradient projection, which allows the model to respect the true quantization physics, turning the hard quantization constraint into a soft probabilistic guidance. Furthermore, an efficient, dual-domain Mamba module is specifically designed to dynamically capture and fuse the multi-scale local and global features, ensuring the interactions between the distant but correlated regions. Extensive experiments demonstrate the state-of-the-art performance of the proposed method over previous works, which is capable of promoting the application of quantized compressive sensing in real life.

2605.11473 2026-05-13 cs.AI cs.LG cs.RO stat.ML

TOPPO: Rethinking PPO for Multi-Task Reinforcement Learning with Critic Balancing

Yuanpeng Li, Gefei Lin, Annie Qu, Rui Miao

AI总结 本文研究了多任务强化学习中基于策略梯度的PPO方法的优化问题,指出其在多任务环境下存在价值函数梯度条件不佳的问题,导致部分任务学习停滞。为此,作者提出TOPPO方法,通过引入批评者平衡模块改善梯度条件,提升任务间的学习均衡性。实验表明,TOPPO在参数和环境步数更少的情况下,优于现有的SAC和ARS方法,在多任务基准测试中表现出更强的平均和尾部任务性能,证明了基于策略的方法在适当优化下可以媲美甚至超越基于价值的方法。

详情
英文摘要

Soft Actor-Critic (SAC) and its variants dominate Multi-Task Reinforcement Learning (MTRL) due to their off-policy sample efficiency, while on-policy methods such as Proximal Policy Optimization (PPO) remain underexplored. We diagnose that PPO in MTRL suffers from a previously overlooked issue: critic-side gradient ill-conditioning, which may cause tail tasks to stall while easy tasks dominate the value function's updates. To address this, we propose TOPPO (Tail-Optimized PPO), a reformulation of PPO via Critic Balancing -- a set of modules that improve gradient conditioning and balance learning dynamics across tasks. Unlike prior approaches that rely on modular architectures or large models, TOPPO targets the optimization bottleneck within PPO itself. Empirically, TOPPO achieves stronger mean and tail-task performance than published SAC-family and ARS-family baselines while using substantially fewer parameters and environment steps on Meta-World+ benchmark. Notably, TOPPO matches or surpasses strong SAC baselines early in training and maintains superior performance at full budget. Ablations confirm the effectiveness of each module in TOPPO and provide insights into their interactions. Our results demonstrate that, with proper optimization, on-policy methods can rival or exceed off-policy approaches in MTRL, challenging the prevailing reliance on SAC and highlighting critic-side gradient conditioning as the central bottleneck.

2605.11471 2026-05-13 cs.LG

On the Approximation Complexity of Matrix Product Operator Born Machines

Chao Li, Zerui Tao, Yuchen Cong, Jian Xu, Qibin Zhao

AI总结 本文研究了矩阵乘积算子玻姆机(MPO-BM)在概率建模中的近似复杂性。作者从正反两方面分析了MPO-BM的近似能力边界,证明在连续情况下,使用KL散度进行近似是NP难的,表明最坏情况下无法实现通用高效的近似。同时,在满足局部性和谱隙条件的前提下,作者展示了对于结构化目标(如路径图马尔可夫随机场),MPO-BM可以以多项式规模的键维度实现具有可证明KL保证的近似,并且只需多项式数量的得分查询即可估计诱导哈密顿量,从而获得相应的保证。这些结果为MPO-BM在何时难以近似、何时可高效学习提供了理论依据。

详情
英文摘要

Matrix product operator Born machines (MPO-BMs) are tractable tensor-network models for probabilistic modeling, but their efficient approximation capability remains unclear. We characterize this boundary from both negative and positive perspectives. First, we prove that KL approximation is NP-hard for MPO-BMs in the continuous setting, ruling out universal efficient approximation in the worst case. Second, for score-based variational inference, we show that, under a locality and spectral-gap conditions on the loss-induced Hamiltonian, structured targets (e.g., path-graph Markov random fields) admit MPO-BM approximations with polynomial bond dimension and provable KL guarantees. Third, under the same locality structure, we prove that polynomially many score queries suffice to estimate the induced Hamiltonian and obtain such guarantees. Our results provide a theoretical characterization of when MPO-BMs are fundamentally hard to approximate and when they become efficiently learnable.

2605.11469 2026-05-13 cs.LG

Robust Multi-Agent Path Finding under Observation Attacks: A Principled Adversarial-Plus-Smoothing Training Recipe

Riad Ahmed

AI总结 本文研究了在观测攻击下如何提高多智能体路径规划(MAPF)的鲁棒性。作者提出了一种基于对抗训练和平滑优化的联合训练方法,通过在训练过程中引入最坏情况下的输入扰动,并结合随机平滑技术来增强策略的稳定性。实验表明,该方法在保持清洁环境下性能几乎不变的前提下,显著提升了系统在受到强攻击时的成功率。

详情
英文摘要

Decentralized multi-agent path finding (MAPF) routes a team of agents on a shared grid, each acting from its own local view. The standard solution trains one shared neural policy with Proximal Policy Optimization (PPO), a popular on-policy reinforcement learning algorithm. Such a policy works well on clean observations, but a small input perturbation on one agent often changes its action, which then blocks a neighbour, and the team jams. In this paper we present two training recipes that keep the same network and the same deployment loop, yet make the policy hold up under perturbed observations. The first recipe, Adv-PPO, trains the shared policy against worst-case perturbations of its own input and selects the checkpoint by performance under adversarial perturbation. The second recipe, Adv-PPO+MACER, fine-tunes that checkpoint with a small on-policy smoothness term whose gradient follows the certified radius of randomized smoothing. On POGEMA with 8x8 maps and four agents, the unprotected PPO policy reaches 95.8% clean success but only 2.5% under the strongest attack. Adv-PPO recovers worst-case success to 59.2% at one percentage point of clean cost. Adv-PPO+MACER recovers it to 77.5% +/- 6.0% across three independent seeds at less than one percentage point of clean cost. We support these numbers with per-attack curves, a certified action-stability sanity check (which measures the smoothed-policy wrapper, not the deployed argmax policy), and side-by-side rollout storyboards that show the failure mode and the fix inside one environment instance.

2605.11468 2026-05-13 cs.AI

CAMPA: Efficient and Aligned Multimodal Graph Learning via Decoupled Propagation and Aggregation

Daohan Su, Hao Liu, Xunkai Li, Yinlin Zhu, Xiong Yongfu, Yi Liu, Hongchao Qin, Rong-Hua Li, Guoren Wang

AI总结 本文提出了一种名为CAMPA的跨模态对齐的多模态图学习框架,旨在解决现有解耦多模态图神经网络在传播和聚合阶段面临的模态冲突问题。CAMPA通过引入两阶段对齐机制,分别在传播阶段注入跨模态相似性先验以保持语义一致性,在聚合阶段利用轨迹级自注意力和跨注意力对齐多模态和多跳特征轨迹,从而提升表示学习效果。实验表明,CAMPA在多个基准数据集上优于现有耦合和解耦方法,同时保持了较高的计算效率。

详情
英文摘要

Multimodal Graph Neural Networks (MGNNs) have shown strong potential for learning from multimodal attributed graphs, yet most existing approaches rely on tightly coupled architectures that suffer from prohibitive computational overhead. In this paper, we present a systematic empirical analysis showing that decoupled MGNNs are substantially more efficient and scalable for large-scale graph learning. However, we identify a critical bottleneck in existing decoupled pipelines, namely modal conflict, which arises in both the propagation and aggregation stages. Specifically, independent multi-hop diffusion causes cross-modal semantic divergence during propagation, while naive fusion fails to align multi-hop feature trajectories during aggregation, jointly limiting effective representation learning. To address this challenge, we propose CAMPA, a Cross-modal Aligned Multimodal Propagation & Aggregation framework for decoupled multimodal graph learning. Concretely, CAMPA introduces a two-stage alignment mechanism: (1) cross-modal aligned propagation, which injects cross-modal similarity priors into message passing to preserve semantic consistency without additional parameter overhead; (2) trajectory aligned aggregation, which leverages trajectory-level self-attention and cross-attention to capture and align long-range dependencies across modalities and hops. Extensive experiments on diverse benchmark datasets and tasks demonstrate that CAMPA consistently outperforms strong coupled and decoupled baselines while preserving the efficiency advantages of the decoupled paradigm.

2605.11467 2026-05-13 cs.LG cs.AI

Drop the Act: Probe-Filtered RL for Faithful Chain-of-Thought Reasoning

Swapnil Parekh

AI总结 该研究提出了一种名为ProFIL的新方法,旨在减少大型语言模型在链式推理过程中产生的“推理剧场”现象,即模型在已得出结论后仍生成看似思考但实际上对正确性无贡献的推理步骤。ProFIL通过在冻结的基模型上训练一个多头注意力探针,检测并抑制这些冗余步骤,并结合强化学习框架GRPO进行优化,从而提升推理链的可信度、缩短推理长度,同时保持或提升任务准确性。实验表明,该方法在多个推理任务和模型架构上均取得显著效果。

详情
英文摘要

Reasoning models post-hoc rationalize answers they have already committed to internally, producing chains of *reasoning theater*: deliberative-looking steps that contribute nothing to correctness. This wastes inference tokens, pollutes interpretability, and obscures what the model actually computed. We introduce **ProFIL** (**Pro**be-**Fil**tered Reinforcement Learning) to *reduce theater, increase chain-of-thought faithfulness, and shrink chain length* in a single, drop-in extension to Group Relative Policy Optimization (GRPO). A multi-head attention probe is trained *once* on the *frozen* base model to detect post-commitment steps from internal activations alone; during GRPO, rollouts whose probe score exceeds a threshold have their advantage zeroed. *Our central finding is that a probe trained on a frozen base, with verifier-derived labels and no human annotation, provides a stable signal that suppresses theater while resisting the RL-obfuscation failure mode predicted by prior work.* Across four reasoning domains (GSM8K, LiveCodeBench, ToolUse, MMLU-Redux) and two model architectures (Llama-8B, Qwen-7B), ProFIL reduces post-commitment theater by **11--100%**, raises faithful-fraction (e.g., +24pp on LiveCodeBench under an independent Claude 3.7 Sonnet judge), and shortens chains by 4--19%, all while preserving or improving task accuracy. ProFIL also beats a matched length-penalty GRPO baseline, isolating the gain as semantic commitment-detection rather than chain compression. Probe weights, training configurations, and rollouts are released across all four domains.