arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1967
专题追踪
2605.10195 2026-05-15 cs.LG

Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration

Shuzhang Zhong, Haochen Huang, Shengxuan Qiu, Pengfei Zuo, Runsheng Wang, Meng Li

发表机构 * Institute for Artificial Intelligence, Peking University School of Integrated Circuits, Peking University(人工智能研究院,北京大学集成电路学院,北京大学) School of Integrated Circuits, Peking University(集成电路学院,北京大学) ByteDance Seed(字节跳动种子)

AI总结 树-of-Thought(ToT)推理通过树状搜索结构提升大语言模型在复杂任务中的表现,但其效率受限于奖励依赖性屏障带来的同步瓶颈。本文提出SPEX方法,通过推测性探索打破该限制,引入路径选择、资源分配和早停机制等关键技术,显著提升ToT推理效率。实验表明,SPEX在多种ToT算法和模型上实现了1.2到3倍的加速,并与令牌级推测解码结合后最高达到4.1倍的加速效果,为高效可扩展的ToT推理提供了重要进展。

Comments OSDI 2026

详情
英文摘要

Tree-of-Thought (ToT) reasoning structures Large Language Model (LLM) inference as a tree-based search, demonstrating strong potential for solving complex mathematical and programming tasks. However, its efficiency is constrained by the reward dependency barrier -- a synchronization bottleneck caused by sequential reward-guided exploration that limits search parallelism and introduces substantial latency. Prior system optimizations, mainly designed for linear Chain-of-Thought (CoT) reasoning, cannot address these challenges, leaving the efficiency of ToT underexplored. To enhance ToT reasoning efficiency, we observe that the reasoning paths can be explored speculatively to break the reward synchronization barrier. Therefore, in this paper, we propose SPEX and introduce three key techniques: (i) intra-query speculative path selection to predict and expand high-potential branches of ToT, (ii) inter-query budget allocation to balance speculative resource allocation across queries dynamically, and (iii) adaptive early termination to prune deep and redundant branches for a skewed search tree. We implement SPEX on top of the SGLang framework and evaluate it across diverse ToT algorithms and LLMs. Extensive experiments show that SPEX achieves $1.2 \sim 3 \times$ speedup for different ToT reasoning algorithms. Moreover, SPEX synergizes with token-level speculative decoding, achieving cumulative speedups of up to $4.1\times$. Ablation studies further confirm the contributions of each technique. Overall, SPEX represents a significant step toward efficient and scalable ToT reasoning, unlocking the parallelism required for high-performance inference-time scaling for LLMs.

2605.09825 2026-05-15 cs.LG cs.AI

Pretraining large language models with MXFP4 on Native FP4 Hardware

Musa Cim, Poovaiah Palangappa, Miro Hodak, Ravi Dwivedula, Meena Arunachalam, Mahmut Taylan Kandemir

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) Advanced Micro Devices, Inc.(先进微器件公司)

AI总结 本文研究了在原生FP4硬件上使用MXFP4量化进行大语言模型预训练时出现的训练不稳定性问题。通过控制实验,逐步启用FP4在前向传播、激活梯度和权重梯度中,发现权重梯度的量化是导致收敛性能下降的主要原因。研究进一步表明,确定性哈达玛旋转能够有效恢复稳定优化,而随机化方法则无法做到这一点,揭示了训练不稳定性源于敏感梯度路径上的结构化微缩误差,而非随机性不足。实验在AMD Instinct MI355X GPU上进行,无需依赖软件模拟即可验证这些结论。

详情
英文摘要

Why does full-pipeline FP4 training of large language models often diverge, even when forward activations and activation gradients remain stable? We address this question through a controlled study of MXFP4 quantization in transformer training, progressively enabling FP4 across forward propagation (Fprop), activation gradients (Dgrad), and weight gradients (Wgrad) while holding all other factors fixed. In full pretraining of Llama 3.1-8B on the C4 dataset, we observe that quantizing Wgrad is the primary driver of convergence degradation, whereas FP4 in Fprop and Dgrad alone introduces only modest additional token requirements. To interpret this behavior, we evaluate both structured and stochastic interventions under a controlled experimental setting. We find that stochastic rounding and randomized Hadamard rotations fail to stabilize training once Wgrad is quantized, whereas deterministic Hadamard rotations consistently restore stable optimization. These results suggest that FP4 training instability is driven by structured micro-scaling errors along sensitive gradient paths, rather than by insufficient stochasticity. We run experiments with native MXFP4 support on AMD Instinct MI355X GPUs, enabling controlled investigation of these effects without reliance on software emulation.

2605.09094 2026-05-15 cs.LG

A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization

Zhiyao Zhang, Myeung Suk Oh, Zhen Qin, Jiaxiang Li, Xin Zhang, Jia Liu

发表机构 * Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA(电子与计算机工程系,俄亥俄州立大学,哥伦布,OH,USA) Meta Platforms, Inc., Menlo Park, CA, USA(Meta平台公司,Menlo Park,CA,USA)

AI总结 本文研究了多任务双层学习(MTBL)问题,并首次在弱化下层目标泛凸性假设的前提下,将其转化为等式约束多目标优化(ECMO)问题。为了解决ECMO这一新型问题,作者提出了基于KKT条件的帕累托平稳性收敛标准,并设计了一种加权切比雪夫惩罚算法,该算法在确定性和随机性设置下均具有有限时间收敛性。该方法能够系统探索帕累托前沿,且原问题与ECMO问题的解具有直接对应关系,从而建立了双层优化与多目标优化之间的理论联系。

详情
英文摘要

In recent years, bilevel optimization (BLO) has attracted significant attention for its broad applications in machine learning. However, most existing works on BLO remain confined to the single-task setting and rely on the lower-level strong convexity assumption, which significantly restricts their applicability to modern machine learning problems of growing complexity. In this paper, we make the first attempt to extend BLO to the multi-task setting under a relaxed lower-level general convexity (LLGC) assumption. To this end, we reformulate the multi-task bilevel learning (MTBL) problem with LLGC into an equality constrained multi-objective optimization (ECMO) problem. However, ECMO itself is a new problem that has not yet been studied in the literature. To address this gap, we first establish a new Karush-Kuhn-Tucker (KKT)-based Pareto stationarity as the convergence criterion for ECMO algorithm design. Based on this foundation, we propose a weighted Chebyshev (WC)-penalty algorithm that achieves a finite-time convergence rate of $O(ST^{-\frac{1}{2})$ to KKT-based Pareto stationarity in both deterministic and stochastic settings, where $S$ denotes the number of objectives, and $T$ is the total iterations. Moreover, by varying the preference vector over the $S$-dimensional simplex, our WC-penalty method systematically explores the Pareto front. Finally, solutions to the ECMO problem translate directly into solutions for the original MTBL problem, thereby closing the loop between these two foundational optimization frameworks.

2605.09038 2026-05-15 cs.AI

SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

Jinchao Hu, Meizhi Zhong, Kehai Chen, Min Zhang

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区计算机科学与技术学院) TikTok Inc, Beijing(字节跳动北京公司)

AI总结 本文提出了一种名为SearchSkill的框架,旨在教会大语言模型更有效地使用搜索工具,特别是在开放域问答任务中。该方法通过可复用的搜索技能库显式规划查询过程,模型在每一步先选择一个技能,再根据该技能生成搜索或回答动作。技能库会随着训练过程中的失败模式不断进化和优化,从而提升搜索效率和答案准确性。实验表明,SearchSkill在多个知识密集型问答基准上提升了精确匹配率,并改善了搜索行为,如减少复制初始查询、生成更聚焦的查询以及在有限搜索预算下获得更准确的答案。

详情
英文摘要

Teaching language models to use search tools is not only a question of whether they search, but also of whether they issue good queries. This is especially important in open-domain question answering, where broad or copied queries often waste retrieval budget and derail later reasoning. We propose \Ours, a framework that makes query planning explicit through reusable search skills. At each step, the model first selects a skill, then generates a search or answer action conditioned on the selected skill card. The skill inventory itself is not fixed: SearchSkill maintains an evolving SkillBank, expands or refines it from recurrent failure patterns, and reconstructs affected trajectories before supervised training. The resulting two-stage SFT recipe aligns training with the inference-time protocol of skill selection followed by skill-grounded execution. Across open-source and closed-source models, SearchSkill improves exact match on knowledge-intensive QA benchmarks and yields better retrieval behavior, including fewer copied first queries, more atomic hop-focused queries, and more correct answers within a small search budget. These results suggest that explicit skill-conditioned query planning is a lightweight alternative to treating search as an undifferentiated action.

2605.09028 2026-05-15 cs.LG

Diagnosing and Mitigating Domain Shift in Permission-Based Android Malware Detection

Md Rafid Islam

发表机构 * Department of Electrical and Computer Engineering, North South University(电气与计算机工程系,北南大学)

AI总结 本文研究了基于权限的Android恶意软件检测模型在面对领域偏移时的性能下降问题,通过两个互补数据集和五种集成分类器,揭示了模型在不同领域间表现的显著不对称性,并发现特征重要性在不同领域间高度不稳定。研究进一步提出了一种基于共性特征的混合训练策略,有效提升了跨领域检测性能,为构建鲁棒的恶意软件检测系统提供了重要参考。

详情
英文摘要

Machine learning-based Android malware detectors often fail in real-world deployment due to domain shift, where models trained on one data source perform poorly on applications from another. This paper presents a comprehensive study on the generalizability and interpretability of permission-based detectors under cross-domain conditions. Using two complementary datasets (PerMalDroid and NATICUSdroid) and five ensemble classifiers, we first establish an intra-domain baseline, where models achieve over 92% accuracy, and then quantify a severe asymmetric performance drop. While models trained on PerMalDroid generalize well to NATICUSdroid (86% accuracy), the reverse direction sees a drastic drop to 73% accuracy. Explainable AI analysis reveals bimodal feature distributions and shows that feature importance is highly unstable, with key permissions losing or gaining influence across domains. The predictive feature sets for different domains are fundamentally mismatched, as models rely on different, dataset-specific permissions. Most importantly, an ablation study demonstrates that for most models, training on a noisy feature set leads to poor generalization, confirming that domain-specific artifacts are a greater obstacle than missing features. To mitigate this, we validate a hybrid training strategy based on the intersection of common features and successfully recover cross-domain performance, achieving 88% accuracy on PerMalDroid and maintaining 97% on NATICUSdroid. These findings highlight the importance of explainable, cross-domain-robust malware detection systems and provide a practical pathway toward improving real-world deployment of permission-based Android malware detectors.

2605.09027 2026-05-15 cs.CL cs.AI cs.LG cs.MA

GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

Alexandre Le Mercier, Chris Develder, Thomas Demeester

发表机构 * IDLab–T2K, Ghent University–imec(IDLab–T2K,根特大学–imec)

AI总结 在多智能体系统中,一个欺骗性智能体可能破坏整个智能体集体的性能并绕过防御机制。为解决现有研究在对抗性鲁棒性评估上的不足,本文提出GAMBIT基准,包含三种评估模式和两种独立评分,用于评估伪装智能体检测器的性能,特别关注其在分布偏移和新型攻击下的适应能力。GAMBIT基于国际象棋构建,引入了可泛化的自适应欺骗智能体,并提供了27,804个标注样本,揭示了零样本评估在面对自适应对手时可能产生误导性结果,同时展示了快速校准方法在对抗性系统中的有效性。

Comments 46 pages, 16 figures

详情
英文摘要

In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1-score with a Gemini-based detector); (3) We show that zero-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near-identical zero-shot scores differ by 8x on few-shot adaptation, while the meta-learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi-agent benchmark where adversarial attacks and defenses co-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: https://anonymous.4open.science/r/gambit.

2605.08913 2026-05-15 cs.LG cs.AR cs.CL cs.PF

Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

Willy Fitra Hendria

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了在苹果MPS后端进行Transformer解码时出现的非单调延迟现象,即随着解码长度增加,延迟并非平稳增长,而是在某些配置下突然大幅上升。通过多类模型实验,发现延迟峰值可达正常情况的21倍,且该现象主要发生在解码阶段,与内存压力无关,并在CPU和NVIDIA CUDA后端未出现。研究进一步揭示了键值缓存(KV Cache)与异常执行模式之间的复杂交互,强调了硬件特性对长上下文推理性能的重要影响。

Comments 9 pages, 5 figures, 6 tables

详情
英文摘要

Autoregressive inference is typically assumed to scale predictably with decoding length, with latency increasing smoothly as generated sequence length grows. In this work, we identify unexpected non-monotonic latency behavior in the Apple MPS backend, where latency changes abruptly across nearby decoding configurations during transformer decoding. Using multiple model families (GPT-2, BLOOM, and OPT), we observe latency spikes of up to 21x within specific decoding-budget intervals, followed by recovery at neighboring configurations. Controlled experiments show that these anomalies originate primarily during the decode phase rather than prefill, are not explained by memory pressure alone, and remain absent on CPU and NVIDIA CUDA backends under identical conditions. We further show that key-value (KV) cache interacts strongly with these pathological execution regimes: KV caching remains beneficial overall, but its practical speedup collapses sharply within anomalous configurations, while cache-disabled decoding still exhibits residual non-monotonic behavior. These findings suggest that autoregressive decoding on MPS enters discrete execution regimes that are not captured by coarse-grained benchmarking, highlighting the importance of hardware-aware evaluation for long-context inference.

2605.08888 2026-05-15 cs.CL cs.CV

DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding

Xiang Feng, Jiawei Zhou, Zhangfeng Huang, Kewei Wang, Shanshan Ye, Jinxin Hu, Zulong Chen, Yong Luo, Jing Zhang

发表机构 * School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China(计算机学院,国家多媒体软件工程技术研究中心和湖北多媒体与网络通信工程重点实验室,武汉大学,中国) Alibaba Group, Hangzhou, China(阿里巴巴集团,杭州,中国) Independent Researcher(独立研究者) Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates(机器学习系,Mohamed bin Zayed人工智能大学,阿拉伯联合酋长国)

AI总结 DocScope 是一个用于评估多模态大语言模型在长篇视觉丰富文档中进行可验证推理能力的基准测试。该研究将长文档问答问题转化为结构化的推理轨迹预测任务,要求模型输出证据页面、支持区域、相关事实陈述和最终答案,并通过四阶段评估协议对推理过程进行细致检验。实验表明,仅凭答案准确性无法全面评估模型可靠性,证据链完整率普遍较低,且区域定位和跨文档证据整合是当前的主要挑战。

Comments 50pages, 25 figures, 14 tables;

详情
英文摘要

Evaluating whether Multimodal Large Language Models can produce trustworthy, verifiable reasoning over long, visually rich documents requires evaluation beyond end-to-end answer accuracy. We introduce DocScope, a benchmark that formulates long-document QA as a structured reasoning trajectory prediction problem: given a complete PDF document and a question, the model outputs evidence pages, supporting evidence regions, relevant factual statements, and a final answer. We design a four-stage evaluation protocol -- Page Localization, Region Grounding, Fact Extraction, and Answer Verification -- that audits each level of the trajectory independently through inter-stage decoupling, with all judges selected and calibrated via human alignment studies. DocScope comprises 1,124 questions derived from 273 documents, with all hierarchical evidence annotations completed by human annotators. We benchmark 6 proprietary models, 12 open-weight models, and several domain-specific systems. Our experiments reveal that answer accuracy cannot substitute for trajectory-level evaluation: even among correct answers, the highest observed rate of complete evidence chains is only 29\%. Across all models, region grounding remains the weakest trajectory stage. Furthermore, the primary difficulty stems from aggregating evidence dispersed across long distances and multiple document clusters, while an oracle study identifies faithful perception and fact extraction as the dominant capability bottleneck. Cross-architecture comparisons further suggest that activated parameter count matters more than total scale. The benchmark and code will be publicly released at https://github.com/MiliLab/DocScope.

2605.08851 2026-05-15 cs.CV cs.AI cs.LG

Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport

Jialin Li, Zhuo Zhang, Yue Cao, Guipeng Lan, Jiabao Wen, Shuai Xiao, Jiachen Yang

发表机构 * School of Electrical and Information Engineering, Tianjin University, Tianjin, China(天津大学电气与信息工程学院)

AI总结 该研究针对冠状动脉造影中狭窄病变检测数据不足的问题,提出了一种基于熵最优传输的几何约束狭窄编辑方法。通过将局部编辑建模为受几何信息引导的熵最优传输问题,该方法实现了更精确的结构控制和图像生成。实验表明,该方法生成的图像显著提升了狭窄检测性能,在公开数据集和多中心数据集上分别取得了27.8%和23.0%的相对性能提升。

Comments Accepted to ICML 2026

详情
英文摘要

The scarcity of high-quality imaging data for coronary angiography (CAG) stenosis limits the clinical translation of automated stenosis detection. Synthetic stenosis data provides a practical avenue to augment training sets, improving data quality, diversity, and distributional coverage, and enhancing detection precision and generalization. However, diffusion-based editing commonly relies on soft guidance in a noise-initialized reverse process, offering limited pixel-level precision and structure preservation. We propose the OT-Bridge Editor, which reframes localized editing as a constrained entropic optimal transport (OT) problem and leverages geometric information to steer the generation path, enabling stronger geometric control. Extensive experiments show that our synthesized angiograms consistently improve downstream stenosis detection, yielding substantial relative gains of 27.8% on the public ARCADE benchmark and 23.0% on our multi-center dataset, supported by consistent qualitative results.

2605.08825 2026-05-15 cs.CV

Rethinking Event-Based Object Dtection through Representation-Level Temporal Aggregation and Model-Level Hypergraph Reasoning

Meisen Wang, Hao Deng, Wei Bao, Ma Yuanxiao, Chengjie Wang, Zhiqiang Tian, Shaoyi Du, Siqi Li

发表机构 * Xi’an Jiaotong University(西安交通大学) Tsinghua University(清华大学) China Mobile System Integration(中国移动系统集成) Inner Mongolia Agricultural University(内蒙古农业大学)

AI总结 该论文针对基于事件相机的物体检测(EOD)任务,提出了一个统一的检测框架Ev-DTAD,旨在解决现有方法在表示层和模型层上的不足。通过引入层次化时间聚合(HTA)和频率感知超图时间融合(FHTF)模块,分别在表示层面显式编码时间信息,并在模型层面进行高阶关系推理,从而更有效地整合碎片化事件响应。实验表明,Ev-DTAD在多个数据集上实现了更高的检测精度和效率,验证了其方法的有效性。

详情
英文摘要

Event cameras provide microsecond-level temporal resolution, low latency, and high dynamic range, offering potential for perception under fast motion and challenging illumination conditions. However, existing Event-based Object Detection (EOD) methods face limitations at both the representation and model levels: prior event representations usually encode temporal information indirectly through redundant structures, while detection models struggle to explicitly aggregate fragmented event responses into coherent high-order object features. To address these limitations, we present Event Dual Temporal-Relational Aggregation Detector (Ev-DTAD), a unified EOD framework that integrates representation-level temporal encoding with model-level temporal-hypergraph reasoning. Specifically, we introduce Hierarchical Temporal Aggregation (HTA), a compact three-channel pseudo-RGB representation that explicitly embeds temporal information across intra- and inter-window events. To further enhance detection under sparse and fragmented event responses, we propose Frequency-aware Hypergraph Temporal Fusion (FHTF), which refines multi-scale event features through temporal evolution modeling and high-order relational reasoning. Extensive experiments on Gen1 (+0.8 mAP and 1.7$\times$ faster), 1Mpx/Gen4 (+0.5 mAP and 1.6$\times$ faster), and eTraM (+3.0 mAP and 2.0$\times$ faster) demonstrate that Ev-DTAD achieves a competitive accuracy-efficiency trade-off, validating the complementarity between compact temporal representation and temporal-hypergraph feature reasoning.

2605.08698 2026-05-15 cs.CV cs.LG

Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods

Md Abu Obaida Zishan, Jannatun Noor, Annajiat Alim Rasel

发表机构 * School of Data and Sciences BRAC University, Dhaka(数据与科学学院,布拉克大学,达卡) Computing for Sustainability and Social Good (C2SG) Research Group Department of Computer Science and Engineering United Internation University, Dhaka(可持续性与社会公益(C2SG)研究组,计算机科学与工程系,联合国际大学,达卡)

AI总结 本文提出了一种无需训练即可提升Stable Diffusion等扩散模型生成高分辨率图像能力的方法,通过插值扩展卷积核来解决传统方法中因分辨率提升导致的物体重复伪影问题。该方法数学上证明了在乘以常数系数的情况下,插值能够正确扩展卷积核,并在生成超训练分辨率图像时取得了与现有方法相当的实验效果。此外,该方法还展示了在全连接层上的应用潜力,并可有效降低神经网络训练的内存占用。

Comments Updated the title for clarity. Removed background and redundant text from section 4.2,5. Improved organization in section 4 and clarity of text in Section 4.3

详情
英文摘要

Stable Diffusion (SD) has evolved DDPM (Denoising Diffusion Probabilistic Model) based image generation significantly by denoising in latent space instead of feature space. This popularized DDPM-based image generation as the cost and compute barrier was significantly lowered. However, these models could only generate fixed-resolution images according to their training configuration. When we attempt to generate higher resolutions, the resulting images show object duplication artifacts consistently. To solve this problem without finetuning SD models, recent works have tried dilating the convolution kernels of the models and have achieved a great level of success. But dilated kernels are harder to fine-tune due to being zero-gapped. Apart from this, other methods, such as patched diffusion, could not solve the object-duplication problem efficiently. Hence, to overcome the limitations of dilated convolutions, we propose kernel interpolation of SD models for higher-resolution image generation. In this work, we show mathematically that interpolation can correctly scale convolution kernels if multiplied by a constant coefficient and achieve competitive empirical results in generating beyond-training-resolution images with Stable Diffusion using zero training. Furthermore, we demonstrate that our method enables interpolation of deep neural networks to adapt to higher-dimensional training data, with a worst-case performance drop of $2.6\%$ in accuracy and F1-Score relative to the baseline. This shows the applicability of our method to be general, where we interpolate fully-connected layers, going beyond convolution layers. We also discuss how we can reduce the memory footprints of training neural networks, using our method up to at least $4\times$.

2605.08506 2026-05-15 cs.LG

Learning Polyhedral Conformal Sets for Robust Optimization

Shuyi Chen, Wenbin Zhou, Shixiang Zhu

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 该研究旨在解决鲁棒优化中不确定性集选择的问题,提出了一种面向决策的符合预测框架,通过数据驱动的方式学习与优化目标对齐的多面体不确定性集。该方法利用数据驱动的超平面参数化不确定性集的几何结构,并通过最小化鲁棒损失来学习其形状,同时通过符合校准保证统计有效性。研究还引入了独立数据集的再校准步骤以修正数据依赖性选择带来的偏差,最终在保持计算可行性的同时,实现了方向性和各向异性不确定性的建模,并提供了有限样本下的覆盖率保证和次优性界分析。

详情
英文摘要

Robust optimization (RO) provides a principled framework for decision-making under uncertainty, but its performance critically depends on the choice of the uncertainty set. While large sets ensure reliability, they often lead to overly conservative decisions, whereas small sets risk excluding the true outcome. Recent data-driven approaches, particularly conformal prediction, offer finite-sample validity guarantees but remain largely task-agnostic, ignoring the downstream decision structure. In this paper, we propose a decision-aware conformal framework that learns uncertainty sets tailored to robust optimization objectives. Our approach parameterizes a flexible family of polyhedral sets via data-driven hyperplanes and learns their geometry by directly minimizing the induced robust loss, while preserving statistical validity through conformal calibration. To correct for data-dependent selection, we incorporate a re-calibration step on an independent dataset to restore coverage. The resulting sets capture directional and anisotropic uncertainty aligned with the decision objective while remaining computationally tractable. We provide finite-sample coverage guarantees and bounds on the sub-optimality gap to an oracle decision. This work bridges the gap between statistical validity and decision optimality, providing a principled framework for data-driven robust optimization.

2605.08374 2026-05-15 cs.AI

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

Junwei Liao, Haoting Shi, Ruiwen Zhou, Jiaqian Wang, Shengtao Zhang, Wei Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Bo Tang, Weinan Zhang, Muning Wen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) National University of Singapore(新加坡国立大学) Xidian University(西安电子科技大学) University of Science and Technology of China(中国科学技术大学) MemTensor (Shanghai) Technology Co., Ltd.(MemTensor(上海)科技有限公司)

AI总结 本文提出了一种名为MemQ的新型记忆代理框架,通过将Q学习机制引入基于溯源DAG的记忆系统,解决了现有方法在处理记忆依赖关系时的不足。MemQ利用TD($λ$)资格迹对记忆Q值进行更新,并通过溯源DAG反向传播信用,使记忆之间的依赖关系得到更准确的评估。实验表明,MemQ在六个不同领域的基准测试中均表现出优越的泛化能力和运行时学习效果,尤其在涉及多步骤任务的场景中提升显著。

Comments 22 pages, 11 figures (containing 43 individual image panels total)

详情
英文摘要

Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD($λ$) eligibility traces to memory Q-values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as $(γλ)^d$ with DAG depth $d$, replacing temporal distance with structural proximity. We formalize the setting as an Exogenous-Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert-level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi-step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single-step classification (+0.77~pp) where single-step updates already suffice. We further study how $γ$ and $λ$ interact with the EC-MDP structure, providing principled guidance for parameter selection and future research. Code is available at https://github.com/jwliao-ai/MemQ.

2605.08278 2026-05-15 cs.LG cs.AI cs.CR

Trapping Attacker in Dilemma: Examining Internal Correlations and External Influences of Trigger for Defending GNN Backdoors

Fan Yang, Binyan Xu, Di Tang, Kehuan Zhang

发表机构 * The Chinese University of Hong Kong(香港中文大学) Sun Yat-Sen University(中山大学)

AI总结 本文研究了图神经网络(GNN)在面对后门攻击时的防御问题,提出了一种名为PRAETORIAN的新防御方法。该方法通过分析潜在触发子图的内部关联和外部节点影响,检测异常注入结构并识别具有不成比例影响的触发节点,从而有效识别攻击。实验表明,PRAETORIAN在保持较高干净数据准确率的同时显著降低了攻击成功率,且对多种自适应攻击仍保持有效性,迫使攻击者陷入效用与可检测性之间的不利权衡。

详情
英文摘要

GNNs have become a standard tool for learning on relational data, yet they remain highly vulnerable to backdoor attacks. Prior defenses often depend on inspecting specific subgraph patterns or node features, and thus can be circumvented by adaptive attackers. We propose PRAETORIAN, a new defense that targets intrinsic requirements of effective GNN backdoors rather than surface-level cues. Our key observation is that flipping a victim node's prediction requires substantial influence on the victim: attackers tend to either inject many trigger nodes or rely on a small set of highly influential ones. Building on this observation, PRAETORIAN (i) analyzes internal correlations within potential trigger subgraphs to detect abnormally large injected structures, and (ii) quantifies external node influence to identify triggers with disproportionate impact. Across our evaluations, PRAETORIAN reduces the average attack success rate (ASR) to 0.55% with only a 0.62% drop in clean accuracy (CA), whereas state-of-the-art defenses still yield an average ASR of >20% and a CA drop of >3% under the same conditions. Moreover, PRAETORIAN remains effective against a range of adaptive attacks, forcing adversaries to either inject many trigger nodes to achieve high ASR (>80%), which incurs a >10% CA drop, or preserve CA at the cost of limiting ASR to 18.1%. Overall, PRAETORIAN constrains attackers to an unfavorable trade-off between efficacy and detectability.

2605.07594 2026-05-15 cs.RO

MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

Xin Ding, Xinrui Wang, Yifan Yang, Hao Wu, Shiqi Jiang, Qianxi Zhang, Liang Mi, Hanxin Zhu, Kun Li, Yunxin Liu, Zhibo Chen, Ting Cao

发表机构 * University of Science and Technology of China(中国科学技术大学) Huazhong University of Science and Technology(华中科技大学) Microsoft Research(微软研究院) Nanjing University(南京大学) Institute for AI Industry Research (AIR) Tsinghua University(清华大学人工智能产业研究院)

AI总结 本文提出了一种名为 MemCompiler 的新型记忆系统,用于具身智能体,旨在解决现有记忆注入方法在动态环境中与智能体状态不匹配的问题。该方法通过将记忆利用重新定义为基于状态的记忆编译,利用学习得到的记忆编译器根据智能体当前状态动态选择并编译相关记忆,生成可执行的指导信息。实验表明,MemCompiler 在多个任务环境中显著提升了智能体性能,并降低了计算延迟,验证了其在效果与效率上的双重优势。

详情
英文摘要

Existing memory systems for embodied agents typically inject retrieved memory as static context at episode start, a paradigm we term Ahead-of-time Monolithic Memory Injection (AMMI). However, this static design quickly becomes misaligned with the agent's evolving state and may degrade lightweight executors below the no-memory baseline. To address this, we propose MemCompiler, which reframes memory utilization as State-Conditioned Memory Compilation. A learned Memory Compiler reads a structured Brief State capturing the agent's current execution state and dynamically selects and compiles only relevant memory into executable guidance. This guidance is delivered through a text channel and a latent Soft-Mem channel that preserves perceptual information not expressible in text. Across Alf World, EmbodiedBench, and ScienceWorld, MemCompiler consistently improves over no-memory across open-source backbones (up to +129%), matches or approaches frontier closed-source systems, and reduces per-step latency by 60%, demonstrating that state-aware memory compilation improves both effectiveness and efficiency.

2605.06132 2026-05-15 cs.CL

MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval

Chunyu Li, Mengyuan Zhang, Jingyi Kang, Ding Chen, Jiajun Shen, Bo Tang, Xuanhe Zhou, Feiyu Xiong, Zhiyu Li

发表机构 * China Telecom Research Institute(中国电信研究院) Shanghai Jiao Tong University(上海交通大学)

AI总结 在智能体记忆系统中,重排序模型是连接用户查询与长期记忆的关键桥梁。现有方法多采用“检索-重排序”两阶段范式,但通用重排序模型依赖语义相似度匹配,缺乏真正的推理能力,导致检索结果虽语义相关却无法提供回答问题所需的关键信息。为此,本文提出MemReranker,一种基于Qwen3-Reranker并通过多阶段知识蒸馏构建的重排序模型家族,通过多教师对比生成校准标签、BCE点wise蒸馏优化得分分布、InfoNCE对比学习增强难例区分能力,并结合通用语料与包含时间约束、因果推理等场景的多轮对话数据进行训练,在多个基准测试中表现出色,尤其在推理能力和推理效率方面显著优于现有模型。

详情
英文摘要

In agent memory systems, the reranking model serves as the critical bridge connecting user queries with long-term memory. Most systems adopt the "retrieve-then-rerank" two-stage paradigm, but generic reranking models rely on semantic similarity matching and lack genuine reasoning capabilities, leading to a problem where recalled results are semantically highly relevant yet do not contain the key information needed to answer the question. This deficiency manifests in memory scenarios as three specific problems. First, relevance scores are miscalibrated, making threshold-based filtering difficult. Second, ranking degrades when facing temporal constraints, causal reasoning, and other complex queries. Third, the model cannot leverage dialogue context for semantic disambiguation. This report introduces MemReranker, a reranking model family (0.6B/4B) built on Qwen3-Reranker through multi-stage LLM knowledge distillation. Multi-teacher pairwise comparisons generate calibrated soft labels, BCE pointwise distillation establishes well-distributed scores, and InfoNCE contrastive learning enhances hard-sample discrimination. Training data combines general corpora with memory-specific multi-turn dialogue data covering temporal constraints, causal reasoning, and coreference resolution. On the memory retrieval benchmark, MemReranker-0.6B substantially outperforms BGE-Reranker and matches open-source 4B/8B models as well as GPT-4o-mini on key metrics. MemReranker-4B further achieves 0.737 MAP, with several metrics on par with Gemini-3-Flash, while maintaining inference latency at only 10--20% of large models. On finance and healthcare vertical-domain benchmarks, the models preserve generalization capabilities on par with mainstream large-parameter rerankers.

2605.05686 2026-05-15 cs.AI

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Qiyao Liang, Risto Miikkulainen, Ila Fiete

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Texas Austin(德克萨斯大学奥斯汀分校) Cognizant(Cognizant公司)

AI总结 该研究探讨了语言模型在生成过程中可能出现的两种失败模式:知识冲突和自信幻觉,并揭示了它们在隐藏状态空间中的统一几何解释。研究发现,模型中学习到的事实形成吸引子盆地,冲突源于工作记忆干扰正确吸引子的收敛,而幻觉则源于缺乏对应吸引子导致隐藏状态自由漂移。通过几何边距指标,研究成功区分了正确回忆与幻觉,并验证了该结构特性不依赖于微调,且随着模型规模增大,自信幻觉的比例呈指数增长。

Comments 9 pages, 6 figures, plus appendices

详情
英文摘要

Language models draw on two knowledge sources: facts baked into weights (parametric memory, PM) and information in context (working memory, WM). We study two mechanistically distinct failure modes--conflict, when PM and WM disagree and interfere; and hallucination, when the queried fact was never learned. Both produce confident output regardless, making output-based monitoring blind by design. We show both failures share a unified geometric account. In the hidden-state space of autoregressive generation, learned facts form attractor basins. Conflict is basin competition: WM disrupts convergence to the correct basin without raising output entropy. Hallucination is basin absence: the hidden state drifts freely when no memorized basin exists. The frozen LM head, designed for next-token prediction, cannot distinguish these cases and fires confidently either way. We verify this account in a controlled synthetic task-entity identifiers mapped to unique codes with PM installed via LoRA adapters--where ground truth is exact and component roles can be causally isolated through targeted adapter placement. Geometric margin--the hidden state's distance to the nearest memorized basin--reads this geometry directly and separates correct recall from hallucination far more cleanly than output entropy, with zero false refusals where entropy-based detection cannot avoid rejecting the vast majority of correct outputs. The separation holds on natural-language factual queries from the pretrained model with no adaptation, confirming attractor geometry is structural rather than a fine-tuning artifact. The fraction of confident hallucinations follows a scaling law $C = \exp(-c/\barΔ)$, growing with scale even as overall error rates fall. Hidden states reliably encode epistemic state; the frozen output head systematically erases it--and this erasure worsens with scale.

2605.04554 2026-05-15 cs.CV

InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery

Kaili Zheng, Kaiwen Wang, Xun Zhu, Chenyi Guo, Ji Wu

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) College of AI, Tsinghua University(清华大学人工智能学院) Beijing National Research Center for Information Science and Technology(北京信息科学与技术国家研究中心)

AI总结 该论文提出了一种名为InterMesh的端到端多人人体网格恢复框架,旨在更准确地建模人类与环境及彼此之间的交互关系。与现有基于DETR的方法不同,InterMesh通过引入人类-物体交互检测器,显式地将交互语义信息融入人体网格恢复过程,从而提升姿态和形状估计的准确性。研究设计了轻量的模块以高效整合交互信息,并在多个数据集上验证了方法的有效性,显著提升了在复杂交互场景下的恢复性能。

Comments 13 pages, 10 figures

详情
英文摘要

Humans constantly interact with their surroundings. Existing end-to-end multi-person human mesh recovery methods, typically based on the DETR framework, capture inter-human relationships through self-attention across all human queries. However, these approaches model interactions only implicitly and lack explicit reasoning about how humans interact with objects and with each other. In this paper, we propose InterMesh, a simple yet effective framework that explicitly incorporates human-environment interaction information into human mesh recovery pipeline. By leveraging a human-object interaction detector, InterMesh enriches query representations with structured interaction semantics, enabling more accurate pose and shape estimation. We design lightweight modules, Contextual Interaction Encoder and Interaction-Guided Refiner, to integrate these features into existing HMR architectures with minimal overhead. We validate our approach through extensive experiments on 3DPW, MuPoTS, CMU Panoptic, Hi4D, and CHI3D datasets, demonstrating remarkable improvements over state-of-the-art methods. Notably, InterMesh reduces MPJPE by 9.9% on CMU Panoptic and 8.2% on Hi4D, highlighting its effectiveness in scenarios with complex human-object and inter-human interactions. Code and models are released at https://github.com/Kelly510/InterMesh.

2605.04236 2026-05-15 cs.LG

Adaptive Consensus in LLM Ensembles via Sequential Evidence Accumulation: Automatic Budget Identification and Calibrated Commit Signals

Roberto E. Medina

发表机构 * Independent Researcher(独立研究员)

AI总结 该研究提出了一种名为DASE的自适应停止机制,用于改进大型语言模型集成中的推理过程,通过在证据积累过程中自动识别预算并生成校准的提交信号,以提升整体准确性。DASE能够在早期达成共识时提前提交结果,并在证据碎片化时采用全局频率策略,从而在多个基准测试中表现出显著的性能提升。研究还发现,自适应停止策略对准确性的影响远大于注入带宽,并揭示了注入方法在准确性与推理成本之间存在倒U型关系。

详情
英文摘要

Large Language Model ensembles improve reasoning accuracy, but only up to a performance boundary beyond which additional deliberation degrades accuracy. We introduce DASE (Deliberative Adaptive Stopping Ensemble), a stopping heuristic for iterative ensemble deliberation that commits early on genuine consensus and applies a global-frequency fallback on fragmented evidence. We make three contributions. (1) DASE produces a commit-type routing partition that generalises across benchmarks and is complementary to verbalized single-call confidence. On GPQA-Extended (N=546, 70B ensemble), the partition yields a 39.5 pp routing gap (right-wall 81.1% vs. left-wall 41.5%). On AIME 2010-2023 (N=261, 120B ensemble, 3 seeds), right-wall commits reach 98.3% accuracy vs. left-wall 72.8% (25.5 pp gap), statistically equivalent to Opus 4.6 Standard verbalized confidence at matched coverage (25.7 pp gap; bootstrap p=0.873); the two mechanisms disagree on 37% of routing assignments. (2) Adaptive stopping, not injection bandwidth, drives accuracy. On AIME-300, bandwidth accounts for only 0.3 pp (ns). On GPQA-Extended at the 120B tier, sparse injection ($\approx15$ tokens/worker/round) achieves 70.9% with a 30.7 pp routing gap; dense injection ($\approx600$ chars/worker/round) achieves 72.2% but with halved right-wall coverage and a narrower 18.9 pp gap. (3) Injection-based methods exhibit an inverted-U accuracy-vs-inference trajectory; this pattern is hypothesis-generating.

2605.03823 2026-05-15 cs.LG cs.IT math.IT math.ST stat.TH

Realizable Bayes-Consistency for General Metric Losses

Dan Tsir Cohen, Steve Hanneke, Aryeh Kontorovich

发表机构 * Ben-Gurion University of the Negev(贝内-约尔大学) Purdue University, USA(普渡大学)

AI总结 本文研究了在可实现设定下,使用一般度量损失进行学习时的强泛化贝叶斯一致性问题,扩展了传统二分类和回归问题的相关结果。作者给出了假设类满足何种条件时,存在一种分布无关的学习规则,使其风险几乎必然收敛到类内最优风险(即零)。主要贡献在于提出了一种基于组合障碍的精确刻画,引入了无限非递减 $(γ_k)$-Littlestone 树的概念,从而将经典 Littlestone 树结构推广到度量损失场景。

Comments 14 pages. To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026); v2: fixed abstract metadata rendering

详情
英文摘要

We study strong universal Bayes-consistency in the realizable setting for learning with general metric losses, extending classical characterizations beyond $0$-$1$ classification (Bousquet et al., 2020; Hanneke et al., 2021) and real-valued regression (Attias et al., 2024). Given an instance space $(X,ρ)$, a label space $(Y,\ell)$ with possibly unbounded loss, and a hypothesis class $H \subseteq Y^{X}$, we resolve the realizable case of an open problem presented in Tsir Cohen and Kontorovich (2022). Specifically, we find the necessary and sufficient conditions on the hypothesis class $H$ under which there exists a distribution-free learning rule whose risk converges almost surely to the best-in-class risk (which is zero) for every realizable data-generating distribution. Our main contribution is this sharp characterization in terms of a combinatorial obstruction: Similarly to Attias et al. (2024), we introduce the notion of an infinite non-decreasing $(γ_k)$-Littlestone tree, where $γ_k \to \infty$. This extends the Littlestone tree structure used in Bousquet et al. (2020) to the metric loss setting.

2605.03596 2026-05-15 cs.AI cs.CL cs.DB cs.LG

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Zirui Tang, Xuanhe Zhou, Yumou Liu, Linchun Li, Yukai Wu, Weizheng Wang, Hongzhang Huang, Wei Zhou, Jun Zhou, Jiachen Song, Shaoli Yu, Jinqi Wang, Zihang Zhou, Hongyi Zhou, Yuting Lv, Jinyang Li, Jiashuo Liu, Ruoyu Chen, Chunwei Liu, GuoLiang Li, Jihua Kang, Fan Wu

发表机构 * GitHub

AI总结 Workspace-Bench 1.0 是一个用于评估 AI 智能体在工作空间任务中处理大规模文件依赖关系能力的基准。该研究构建了包含多种文件类型和真实工作场景的复杂工作空间,并设计了大量任务来测试智能体的跨文件检索、上下文推理和适应性决策能力。实验表明,当前主流 AI 模型在该基准上的表现仍远低于人类水平,突显了在真实工作场景中实现可靠工作空间学习的挑战。

Comments 30 pages, 16 figures

详情
英文摘要

Workspace learning requires AI agents to identify, reason over, exploit, and update explicit and implicit dependencies among heterogeneous files in a worker's workspace, enabling them to complete both routine and advanced tasks effectively. Despite its importance, existing relevant benchmarks largely evaluate agents on pre-specified or synthesized files with limited real-world dependencies, leaving workspace-level evaluation underexplored. To this end, we introduce Workspace-Bench, a benchmark for evaluating AI agents on Workspace Learning involving Large-Scale File Dependencies. We construct realistic workspaces with 5 worker profiles, 74 file types, 20,476 files (up to 20GB) and curate 388 tasks, each with its own file dependency graph, evaluated across 7,399 total rubrics that require cross-file retrieval, contextual reasoning, and adaptive decision-making. We further provide Workspace-Bench-Lite, a 100-task subset that preserves the benchmark distribution while reducing evaluation costs by about 70%. We evaluate 4 popular agent harnesses and 7 foundation models. Experimental results show that current agents remain far from reliable workspace learning, where the best reaches only about 60%, substantially below the human result of 80.7%, and the average performance across agents is only 43.3%.

2605.02438 2026-05-15 cs.CV cs.LG

Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection

Fuyun Wang, Yuanzhi Wang, Xu Guo, Sujia Huang, Tong Zhang, Dan Wang, Hui Yan, Xin Liu, Zhen Cui

发表机构 * Nanjing University of Science(南京理工大学) Beijing Normal University, Beijing, China(北京师范大学) China Academy of Space Technology, Beijing, China(中国空间技术研究院)

AI总结 本文研究开放集监督异常检测(OSAD)问题,旨在利用有限的异常监督信息识别未见过的异常样本。为了解决现有基于原型的方法在建模正常数据时忽略多模态特性导致决策边界模糊的问题,提出了一种混合原型流匹配(MPFM)框架,通过连续变换将正常特征分布映射到结构化的高斯混合原型空间。该方法引入高斯混合先验建模速度场,并结合互信息最大化正则化器提升原型区分度,实验表明其在多种基准数据集上均取得领先性能。

Comments Accepted by ICML 2026

详情
英文摘要

Open-set supervised anomaly detection (OSAD) aims to identify unseen anomalies using limited anomalous supervision. However, existing prototype-based methods typically model normal data via a unimodal Gaussian prior, failing to capture inherent multi-modality and resulting in blurred decision boundaries. To address this, we propose Mixture Prototype Flow Matching (MPFM), a framework that learns a continuous transformation from normal feature distributions to a structured Gaussian mixture prototype space. Departing from traditional flow-based approaches that rely on a single velocity vector, MPFM explicitly models the velocity field as a Gaussian mixture prior where each component corresponds to a distinct normal class. This design facilitates mode-aware and semantically coherent distribution transport. Furthermore, we introduce a Mutual Information Maximization Regularizer (MIMR) to prevent prototype collapse and maximize normal-anomaly separability. Extensive experiments demonstrate that MPFM achieves state-of-the-art performance across diverse benchmarks under both single- and multi-anomaly settings.

2605.02398 2026-05-15 cs.AI cs.CL cs.LG

The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

Rahul Kumar

发表机构 * Independent Researcher(独立研究者)

AI总结 随着前沿AI模型被用于高风险决策流程,其在对抗性压力下保持元认知稳定性的能力成为关键的安全要求。本文研究了模型在面对强制合规指令时出现的元认知崩溃现象,并提出了“合规陷阱”这一新概念,指出模型性能的严重下降并非源于威胁内容本身,而是由强制性指令引发的认知边界突破所致。通过大规模实验,作者发现大多数模型在对抗性条件下表现出显著的性能下降,而Anthropic的 Constitutional AI 由于对齐训练表现出较强的免疫能力。

Comments 9 pages, 2 figures, 3 tables. Code: https://github.com/rkstu/schema-compliance-trap Dataset: https://huggingface.co/datasets/lightmate/schema-compliance-trap

详情
英文摘要

As frontier AI models are deployed in high-stakes decision pipelines, their ability to maintain metacognitive stability (knowing what they do not know, detecting errors, seeking clarification) under adversarial pressure is a critical safety requirement. Current safety evaluations focus on detecting strategic deception (scheming); we investigate a more fundamental failure mode: cognitive collapse. We present SCHEMA, an evaluation of 11 frontier models from 8 vendors across 67,221 scored records using a 6-condition factorial design with dual-classifier scoring. We find that 8 of 11 models suffer catastrophic metacognitive degradation under adversarial pressure, with accuracy dropping by up to 30.2 percentage points (all $p < 2 \times 10^{-8}$, surviving Bonferroni correction). Crucially, we identify a "Compliance Trap": through factorial isolation and a benign distraction control, we demonstrate that collapse is driven not by the psychological content of survival threats, but by compliance-forcing instructions that override epistemic boundaries. Removing the compliance suffix restores performance even under active threat. Models with advanced reasoning capabilities exhibit the most severe absolute degradation, while Anthropic's Constitutional AI demonstrates near-perfect immunity. This immunity does not stem from superior capability (Google's Gemini matches its baseline accuracy) but from alignment-specific training. We release the complete dataset and evaluation infrastructure.

2605.01758 2026-05-15 cs.AI

Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

Yue Ma, Ziyuan Yang, Yi Zhang

发表机构 * Sichuan University(四川大学) Nanyang Technological University(南洋理工大学)

AI总结 该研究针对多智能体系统中感染式越狱攻击的问题,提出了一种无需训练的前瞻性引导本地净化(FLP)框架。该方法通过模拟未来交互轨迹,结合多角色模拟策略,检测并消除智能体中的感染行为,有效降低了感染传播率。实验表明,FLP能将最大累计感染率从超过95%降至5.47%以下,同时保持交互多样性,显著优于现有方法。

Comments 12 pages

详情
英文摘要

Large multimodal model-based Multi-Agent Systems (MASs) enable collaborative complex problem solving through specialized agents. However, MASs are vulnerable to infectious jailbreak, where compromising a single agent can spread to others, leading to widespread compromise. Existing defenses counter this by training a more contagious cure factor, biasing agents to retrieve it over virus adversarial examples (VirAEs). However, this homogenizes agent responses, providing only superficial suppression rather than true recovery. We revisit these defenses, which operate globally via a shared cure factor, while infectious jailbreak arise from localized interaction behaviors. This mismatch limits their effectiveness. To address this, we propose a training-free Foresight-Guided Local Purification (FLP) framework, where each agent reasons over future interactions to track behavioral evolution and eliminate infections. Specifically, each agent simulates future behavioral trajectories over subsequent chat rounds. To reflect diversity in MASs, we introduce a multi-persona simulation strategy for robust prediction across interaction contexts. We then use response diversity as a diagnostic signal to detect infection by analyzing inconsistencies across persona-based predictions at both retrieval-result and semantic levels. For infected agents, we apply localized purification: recent infections are mitigated via immediate album rollback, while long-term infections are handled using Recursive Binary Diagnosis (RBD), which recursively partitions the image album and applies the same diagnosis strategy to localize and eliminate VirAEs. Experiments show that FLP reduces the maximum cumulative infection rate from over 95% to below 5.47%. Moreover, retrieval and semantic metrics closely match benign baselines, indicating effective preservation of interaction diversity.

2605.01725 2026-05-15 cs.CV cs.AI

Motion-Aware Caching for Efficient Autoregressive Video Generation

Jing Xu, Yuexiao Ma, Xuzhe Zheng, Xing Wang, Shiwei Liu, Chenqian Yan, Xiawu Zheng, Rongrong Ji, Fei Chao, Songwei Liu

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(多媒体可信感知与高效计算重点实验室,中国教育部,厦门大学) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) ELLIS Institute Tübingen(图宾根ELLIS研究所) Tübingen AI Center(图宾根人工智能中心)

AI总结 本文研究了如何通过运动感知的缓存机制提升自回归视频生成的效率。现有方法依赖于粗粒度的块级缓存跳过,无法准确捕捉像素级别的动态变化,导致生成质量下降。为此,作者提出了MotionCache,通过帧间差异作为像素运动的轻量代理,结合粗到细的策略,在保证生成质量的前提下显著提升了生成速度。实验表明,MotionCache在多个先进模型上实现了最高达6.28倍的加速,同时保持了高质量的生成效果。

Comments 20 pages

详情
英文摘要

Autoregressive video generation paradigms offer theoretical promise for long video synthesis, yet their practical deployment is hindered by the computational burden of sequential iterative denoising. While cache reuse strategies can accelerate generation by skipping redundant denoising steps, existing methods rely on coarse-grained chunk-level skipping that fails to capture fine-grained pixel dynamics. This oversight is critical: pixels with high motion require more denoising steps to prevent error accumulation, while static pixels tolerate aggressive skipping. We formalize this insight theoretically by linking cache errors to residual instability, and propose MotionCache, a motion-aware cache framework that exploits inter-frame differences as a lightweight proxy for pixel-level motion characteristics. MotionCache employs a coarse-to-fine strategy: an initial warm-up phase establishes semantic coherence, followed by motion-weighted cache reuse that dynamically adjusts update frequencies per token. Extensive experiments on state-of-the-art models like SkyReels-V2 and MAGI-1 demonstrate that MotionCache achieves significant speedups of $\textbf{6.28}\times$ and $\textbf{1.64}\times$ respectively, while effectively preserving generation quality (VBench: $1\%\downarrow$ and $0.01\%\downarrow$ respectively). The code is available at https://github.com/ywlq/MotionCache.

2604.28130 2026-05-15 cs.CV

MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

Kehong Gong, Zhengyu Wen, Dao Thien Phong, Mingxi Xu, Weixia He, Qi Wang, Ning Zhang, Zhengyu Li, Guanli Hou, Dongze Lian, Xiaoyu He, Mingyuan Zhang, Hanwang Zhang

发表机构 * Huawei Technologies Co., Ltd(华为技术有限公司) Central Research Institute(中央研究院)

AI总结 本文提出了一种端到端的任意骨骼运动捕获框架 MoCapAnything V2,解决了传统分阶段方法在关节位置与旋转映射上的不确定性问题。通过引入目标资产的参考姿态-旋转对,明确旋转坐标系,使旋转预测更加精确并易于学习。该方法直接从视频中预测关节位置,无需依赖网格中间表示,提升了鲁棒性与效率,并在多个数据集上显著降低了旋转误差,推理速度也比基于网格的方法快约20倍。

Comments Project page: https://animotionlab.github.io/MoCapAnythingV2/

详情
英文摘要

Recent methods for arbitrary-skeleton motion capture from monocular video follow a factorized pipeline, where a Video-to-Pose network predicts joint positions and an analytical inverse-kinematics (IK) stage recovers joint rotations. While effective, this design is inherently limited, since joint positions do not fully determine rotations and leave degrees of freedom such as bone-axis twist ambiguous, and the non-differentiable IK stage prevents the system from adapting to noisy predictions or optimizing for the final animation objective. In this work, we present the first fully end-to-end framework in which both Video-to-Pose and Pose-to-Rotation are learnable and jointly optimized. We observe that the ambiguity in pose-to-rotation mapping arises from missing coordinate system information: the same joint positions can correspond to different rotations under different rest poses and local axis conventions. To resolve this, we introduce a reference pose-rotation pair from the target asset, which, together with the rest pose, not only anchors the mapping but also defines the underlying rotation coordinate system. This formulation turns rotation prediction into a well-constrained conditional problem and enables effective learning. In addition, our model predicts joint positions directly from video without relying on mesh intermediates, improving both robustness and efficiency. Both stages share a skeleton-aware Global-Local Graph-guided Multi-Head Attention (GL-GMHA) module for joint-level local reasoning and global coordination. Experiments on Truebones Zoo and Objaverse show that our method reduces rotation error from ~17 degrees to ~10 degrees, and to 6.54 degrees on unseen skeletons, while achieving ~20x faster inference than mesh-based pipelines. Project page: https://animotionlab.github.io/MoCapAnythingV2/

2604.27263 2026-05-15 cs.CL

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

Théo Gigant, Bowen Peng, Jeffrey Quesnelle

发表机构 * Nous Research

AI总结 本文研究了子词分词在大语言模型训练中的具体作用,通过构建一个可控的字节级预训练框架,将子词分词的效果进行解耦和分析。研究从样本吞吐量、词汇规模扩展以及子词边界的语言先验等多个维度提出并验证了相关假设,揭示了子词模型优于原始字节模型的关键原因,并为未来字节级和子词模型的预训练提供了改进方向。

Comments 14 pages, 7 figures

详情
英文摘要

Subword tokenization is an essential part of modern large language models (LLMs), yet its specific contributions to training efficiency and model performance remain poorly understood. In this work, we decouple the effects of subword tokenization by isolating them within a controlled byte-level pretraining pipeline. We formulate and test hypotheses across various dimensions, including sample throughput, vocabulary scaling, and the linguistic prior of subword boundaries. By simulating these effects in a byte-level setting, we refine our understanding of why subword models outperform raw byte models and offer insights to improve the pretraining of future byte-level and subword models. Specifically, our experiments highlight the critical role of increased training throughput and the integration of subword boundaries as either explicit priors or inductive biases.

2604.22050 2026-05-15 cs.LG cs.CL

LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric

发表机构 * Openchip & Software Technologies, S.L(Openchip与软件技术公司)

AI总结 LayerBoost 是一种层感知的注意力缩减方法,旨在提升大语言模型的推理效率。该方法通过对预训练模型进行系统性敏感性分析,识别出对性能影响较大的关键层,并根据不同层的敏感程度分别采用标准注意力、线性滑动窗口注意力或完全移除注意力机制,从而在保持模型性能的同时降低计算复杂度。实验表明,LayerBoost 在高并发场景下可将推理延迟减少高达68%,且在多个基准测试中表现出与原始模型相当或接近的性能,显著优于现有的注意力线性化方法。

详情
英文摘要

Transformers are mostly relying on softmax attention, which introduces quadratic complexity with respect to sequence length and remains a major bottleneck for efficient inference. Prior work on linear or hybrid attention typically replaces softmax attention uniformly across all layers, often leading to significant performance degradation or requiring extensive retraining to recover model quality. This work proposes LayerBoost, a layer-aware attention reduction method that selectively modifies the attention mechanism based on the sensitivity of individual transformer layers. It first performs a systematic sensitivity analysis on a pretrained model to identify layers that are critical for maintaining performance. Guided by this analysis, three distinct strategies can be applied: retaining standard softmax attention in highly sensitive layers, replacing it with linear sliding window attention in moderately sensitive layers, and removing attention entirely in layers that exhibit low sensitivity. To recover performance after these architectural modifications, we introduce a lightweight distillation-based healing phase requiring only 10M additional training tokens. LayerBoost reduces inference latency and improves throughput by up to 68% at high concurrency, while maintaining competitive model quality. It matches base model performance on several benchmarks, exhibits only minor degradations on others, and significantly outperforms state-of-the-art attention linearization methods. These efficiency gains make our method particularly well-suited for high-concurrency serving and hardware-constrained deployment scenarios, where inference cost and memory footprint are critical bottlenecks.

2604.21809 2026-05-15 cs.LG cs.AI q-bio.QM stat.ML

Quotient-Space Diffusion Models

Yixian Xu, Yusong Wang, Shengjie Luo, Kaiyuan Gao, Tianyu He, Di He, Chang Liu

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China(一般人工智能国家重点实验室,北京大学,北京,中国) Huazhong University of Science and Technology, Wuhan, China(华中科技大学,武汉,中国) Microsoft Research Asia, Beijing, China(微软亚洲研究院,北京,中国) Zhongguancun Academy, Beijing, China(中关村学院,北京,中国)

AI总结 本文提出了一种名为商空间扩散模型(Quotient-Space Diffusion Models)的生成模型框架,旨在有效处理和利用系统中的对称性。该方法通过在去除对称冗余的商空间上进行生成过程,使模型能够在保持目标对称分布的前提下,更灵活地学习生成过程。该框架在分子结构生成任务中进行了实例化,相比等变扩散模型和基于对齐的方法,表现出更优的性能,为生成模型中的对称性处理提供了新的解决方案。

Comments ICLR 2026 Oral Presentation; 43 pages, 5 figures, 6 tables; ICLR 2026 Camera Ready version

详情
英文摘要

Diffusion-based generative models have reformed generative AI, and also enabled new capabilities in the science domain, e.g., fast generation of 3D structures of molecules. In such tasks, there is often a symmetry in the system, identifying elements that can be converted by certain transformations as equivalent. Equivariant diffusion models guarantee a symmetric distribution, but miss the opportunity to make learning easier, while alignment-based simplification attempts fail to preserve the target distribution. In this work, we develop quotient-space diffusion models, a principled generative framework to fully handle and leverage symmetry. By viewing the intrinsic generation process on the quotient space, the exact construction that removes symmetry redundancy, the framework simplifies learning by allowing model output to have an arbitrary intra-equivalence-class movement, while generating the correct symmetric target distribution with guarantee. We instantiate the framework for molecular structure generation which follows $\mathrm{SE}(3)$ (rigid-body movement) symmetry. It improves the performance over equivariant diffusion models and outperforms alignment-based methods universally for small molecules and proteins, representing a new framework that surpasses previous symmetry treatments in generative models.

2604.19092 2026-05-15 cs.RO cs.AI

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

Feng Jiang, Yang Chen, Kyle Xu, Yuchen Liu, Haifeng Wang, Zhenhao Shen, Jasper Lu, Shengze Huang, Yuanfei Wang, Chen Xie, Ruihai Wu

发表机构 * Peking University(北京大学) Tsinghua University(清华大学) Lightwheel

AI总结 RoboWM-Bench 是一个专注于机器人操作任务的基准,用于评估视频世界模型在生成行为是否具备物理可执行性。该基准通过将生成的视频转化为可执行的动作序列,并在物理仿真环境中验证其可行性,从而系统评估模型在真实机器人操作中的表现。研究发现,视觉合理性与物理可执行性并不总是一致,突显了在复杂操作任务中进行具身化评估的重要性。

详情
英文摘要

Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of using generated videos as scalable supervision for robot learning. However, for embodied manipulation, perceptual realism alone is not sufficient: generated interactions must also be physically consistent and executable by robotic agents. Existing benchmarks provide valuable assessments of visual quality and physical plausibility, but they do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete manipulation tasks. We introduce RoboWM-Bench, a manipulation-centric benchmark for embodiment-grounded evaluation of video world models. RoboWM-Bench converts generated human-hand and robotic manipulation videos into embodied action sequences and validates them through execution in physically grounded simulation environments. Built on real-to-sim scene reconstruction and diverse manipulation tasks, RoboWM-Bench enables standardized, reproducible, and scalable evaluation of physical executability. Using RoboWM-Bench, we evaluate state-of-the-art video world models and observe that visual plausibility and embodied executability are not always aligned. Our analysis highlights several recurring factors that affect execution performance, including spatial reasoning, contact prediction, and non-physical geometric distortions, particularly in complex and long-horizon interactions. These findings provide a more fine-grained view of current model capabilities and underscore the value of embodiment-aware evaluation for guiding physically grounded world modeling in robotic manipulation.