arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

2605.10195 2026-05-15 cs.LG

Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration

Shuzhang Zhong, Haochen Huang, Shengxuan Qiu, Pengfei Zuo, Runsheng Wang, Meng Li

发表机构 * Institute for Artificial Intelligence, Peking University School of Integrated Circuits, Peking University（人工智能研究院，北京大学集成电路学院，北京大学）； School of Integrated Circuits, Peking University（集成电路学院，北京大学）； ByteDance Seed（字节跳动种子）

AI总结树-of-Thought（ToT）推理通过树状搜索结构提升大语言模型在复杂任务中的表现，但其效率受限于奖励依赖性屏障带来的同步瓶颈。本文提出SPEX方法，通过推测性探索打破该限制，引入路径选择、资源分配和早停机制等关键技术，显著提升ToT推理效率。实验表明，SPEX在多种ToT算法和模型上实现了1.2到3倍的加速，并与令牌级推测解码结合后最高达到4.1倍的加速效果，为高效可扩展的ToT推理提供了重要进展。

Comments OSDI 2026

2605.09825 2026-05-15 cs.LG cs.AI

Pretraining large language models with MXFP4 on Native FP4 Hardware

Musa Cim, Poovaiah Palangappa, Miro Hodak, Ravi Dwivedula, Meena Arunachalam, Mahmut Taylan Kandemir

发表机构 * The Pennsylvania State University（宾夕法尼亚州立大学）； Advanced Micro Devices, Inc.（先进微器件公司）

AI总结本文研究了在原生FP4硬件上使用MXFP4量化进行大语言模型预训练时出现的训练不稳定性问题。通过控制实验，逐步启用FP4在前向传播、激活梯度和权重梯度中，发现权重梯度的量化是导致收敛性能下降的主要原因。研究进一步表明，确定性哈达玛旋转能够有效恢复稳定优化，而随机化方法则无法做到这一点，揭示了训练不稳定性源于敏感梯度路径上的结构化微缩误差，而非随机性不足。实验在AMD Instinct MI355X GPU上进行，无需依赖软件模拟即可验证这些结论。

2605.09094 2026-05-15 cs.LG

A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization

Zhiyao Zhang, Myeung Suk Oh, Zhen Qin, Jiaxiang Li, Xin Zhang, Jia Liu

发表机构 * Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA（电子与计算机工程系，俄亥俄州立大学，哥伦布，OH，USA）； Meta Platforms, Inc., Menlo Park, CA, USA（Meta平台公司，Menlo Park，CA，USA）

AI总结本文研究了多任务双层学习（MTBL）问题，并首次在弱化下层目标泛凸性假设的前提下，将其转化为等式约束多目标优化（ECMO）问题。为了解决ECMO这一新型问题，作者提出了基于KKT条件的帕累托平稳性收敛标准，并设计了一种加权切比雪夫惩罚算法，该算法在确定性和随机性设置下均具有有限时间收敛性。该方法能够系统探索帕累托前沿，且原问题与ECMO问题的解具有直接对应关系，从而建立了双层优化与多目标优化之间的理论联系。

2605.09038 2026-05-15 cs.AI

SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

Jinchao Hu, Meizhi Zhong, Kehai Chen, Min Zhang

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳校区计算机科学与技术学院）； TikTok Inc, Beijing（字节跳动北京公司）

AI总结本文提出了一种名为SearchSkill的框架，旨在教会大语言模型更有效地使用搜索工具，特别是在开放域问答任务中。该方法通过可复用的搜索技能库显式规划查询过程，模型在每一步先选择一个技能，再根据该技能生成搜索或回答动作。技能库会随着训练过程中的失败模式不断进化和优化，从而提升搜索效率和答案准确性。实验表明，SearchSkill在多个知识密集型问答基准上提升了精确匹配率，并改善了搜索行为，如减少复制初始查询、生成更聚焦的查询以及在有限搜索预算下获得更准确的答案。

2605.09028 2026-05-15 cs.LG

Diagnosing and Mitigating Domain Shift in Permission-Based Android Malware Detection

Md Rafid Islam

发表机构 * Department of Electrical and Computer Engineering, North South University（电气与计算机工程系，北南大学）

AI总结本文研究了基于权限的Android恶意软件检测模型在面对领域偏移时的性能下降问题，通过两个互补数据集和五种集成分类器，揭示了模型在不同领域间表现的显著不对称性，并发现特征重要性在不同领域间高度不稳定。研究进一步提出了一种基于共性特征的混合训练策略，有效提升了跨领域检测性能，为构建鲁棒的恶意软件检测系统提供了重要参考。

2605.09027 2026-05-15 cs.CL cs.AI cs.LG cs.MA

GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

Alexandre Le Mercier, Chris Develder, Thomas Demeester

发表机构 * IDLab–T2K, Ghent University–imec（IDLab–T2K，根特大学–imec）

AI总结在多智能体系统中，一个欺骗性智能体可能破坏整个智能体集体的性能并绕过防御机制。为解决现有研究在对抗性鲁棒性评估上的不足，本文提出GAMBIT基准，包含三种评估模式和两种独立评分，用于评估伪装智能体检测器的性能，特别关注其在分布偏移和新型攻击下的适应能力。GAMBIT基于国际象棋构建，引入了可泛化的自适应欺骗智能体，并提供了27,804个标注样本，揭示了零样本评估在面对自适应对手时可能产生误导性结果，同时展示了快速校准方法在对抗性系统中的有效性。

Comments 46 pages, 16 figures

详情

英文摘要

In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1-score with a Gemini-based detector); (3) We show that zero-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near-identical zero-shot scores differ by 8x on few-shot adaptation, while the meta-learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi-agent benchmark where adversarial attacks and defenses co-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: https://anonymous.4open.science/r/gambit.

URL PDF HTML ☆

赞 0 踩 0

2605.08913 2026-05-15 cs.LG cs.AR cs.CL cs.PF

Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

Willy Fitra Hendria

发表机构 * Independent Researcher（独立研究者）

AI总结本文研究了在苹果MPS后端进行Transformer解码时出现的非单调延迟现象，即随着解码长度增加，延迟并非平稳增长，而是在某些配置下突然大幅上升。通过多类模型实验，发现延迟峰值可达正常情况的21倍，且该现象主要发生在解码阶段，与内存压力无关，并在CPU和NVIDIA CUDA后端未出现。研究进一步揭示了键值缓存（KV Cache）与异常执行模式之间的复杂交互，强调了硬件特性对长上下文推理性能的重要影响。

Comments 9 pages, 5 figures, 6 tables

2605.08888 2026-05-15 cs.CL cs.CV

DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding

Xiang Feng, Jiawei Zhou, Zhangfeng Huang, Kewei Wang, Shanshan Ye, Jinxin Hu, Zulong Chen, Yong Luo, Jing Zhang

发表机构 * School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University, China（计算机学院，国家多媒体软件工程技术研究中心和湖北多媒体与网络通信工程重点实验室，武汉大学，中国）； Alibaba Group, Hangzhou, China（阿里巴巴集团，杭州，中国）； Independent Researcher（独立研究者）； Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence, United Arab Emirates（机器学习系，Mohamed bin Zayed人工智能大学，阿拉伯联合酋长国）

AI总结 DocScope 是一个用于评估多模态大语言模型在长篇视觉丰富文档中进行可验证推理能力的基准测试。该研究将长文档问答问题转化为结构化的推理轨迹预测任务，要求模型输出证据页面、支持区域、相关事实陈述和最终答案，并通过四阶段评估协议对推理过程进行细致检验。实验表明，仅凭答案准确性无法全面评估模型可靠性，证据链完整率普遍较低，且区域定位和跨文档证据整合是当前的主要挑战。

Comments 50pages, 25 figures, 14 tables;

2605.08851 2026-05-15 cs.CV cs.AI cs.LG

Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport

Jialin Li, Zhuo Zhang, Yue Cao, Guipeng Lan, Jiabao Wen, Shuai Xiao, Jiachen Yang

发表机构 * School of Electrical and Information Engineering, Tianjin University, Tianjin, China（天津大学电气与信息工程学院）

AI总结该研究针对冠状动脉造影中狭窄病变检测数据不足的问题，提出了一种基于熵最优传输的几何约束狭窄编辑方法。通过将局部编辑建模为受几何信息引导的熵最优传输问题，该方法实现了更精确的结构控制和图像生成。实验表明，该方法生成的图像显著提升了狭窄检测性能，在公开数据集和多中心数据集上分别取得了27.8%和23.0%的相对性能提升。

Comments Accepted to ICML 2026

2605.08825 2026-05-15 cs.CV

Rethinking Event-Based Object Dtection through Representation-Level Temporal Aggregation and Model-Level Hypergraph Reasoning

Meisen Wang, Hao Deng, Wei Bao, Ma Yuanxiao, Chengjie Wang, Zhiqiang Tian, Shaoyi Du, Siqi Li

发表机构 * Xi’an Jiaotong University（西安交通大学）； Tsinghua University（清华大学）； China Mobile System Integration（中国移动系统集成）； Inner Mongolia Agricultural University（内蒙古农业大学）

AI总结该论文针对基于事件相机的物体检测（EOD）任务，提出了一个统一的检测框架Ev-DTAD，旨在解决现有方法在表示层和模型层上的不足。通过引入层次化时间聚合（HTA）和频率感知超图时间融合（FHTF）模块，分别在表示层面显式编码时间信息，并在模型层面进行高阶关系推理，从而更有效地整合碎片化事件响应。实验表明，Ev-DTAD在多个数据集上实现了更高的检测精度和效率，验证了其方法的有效性。

2605.08698 2026-05-15 cs.CV cs.LG

Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods

Md Abu Obaida Zishan, Jannatun Noor, Annajiat Alim Rasel

发表机构 * School of Data and Sciences BRAC University, Dhaka（数据与科学学院，布拉克大学，达卡）； Computing for Sustainability and Social Good (C2SG) Research Group Department of Computer Science and Engineering United Internation University, Dhaka（可持续性与社会公益（C2SG）研究组，计算机科学与工程系，联合国际大学，达卡）

AI总结本文提出了一种无需训练即可提升Stable Diffusion等扩散模型生成高分辨率图像能力的方法，通过插值扩展卷积核来解决传统方法中因分辨率提升导致的物体重复伪影问题。该方法数学上证明了在乘以常数系数的情况下，插值能够正确扩展卷积核，并在生成超训练分辨率图像时取得了与现有方法相当的实验效果。此外，该方法还展示了在全连接层上的应用潜力，并可有效降低神经网络训练的内存占用。

Comments Updated the title for clarity. Removed background and redundant text from section 4.2,5. Improved organization in section 4 and clarity of text in Section 4.3

2605.08506 2026-05-15 cs.LG

Learning Polyhedral Conformal Sets for Robust Optimization

Shuyi Chen, Wenbin Zhou, Shixiang Zhu

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结该研究旨在解决鲁棒优化中不确定性集选择的问题，提出了一种面向决策的符合预测框架，通过数据驱动的方式学习与优化目标对齐的多面体不确定性集。该方法利用数据驱动的超平面参数化不确定性集的几何结构，并通过最小化鲁棒损失来学习其形状，同时通过符合校准保证统计有效性。研究还引入了独立数据集的再校准步骤以修正数据依赖性选择带来的偏差，最终在保持计算可行性的同时，实现了方向性和各向异性不确定性的建模，并提供了有限样本下的覆盖率保证和次优性界分析。

2605.08374 2026-05-15 cs.AI

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

Junwei Liao, Haoting Shi, Ruiwen Zhou, Jiaqian Wang, Shengtao Zhang, Wei Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Bo Tang, Weinan Zhang, Muning Wen

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Innovation Institute（上海创新研究院）； National University of Singapore（新加坡国立大学）； Xidian University（西安电子科技大学）； University of Science and Technology of China（中国科学技术大学）； MemTensor (Shanghai) Technology Co., Ltd.（MemTensor（上海）科技有限公司）

AI总结本文提出了一种名为MemQ的新型记忆代理框架，通过将Q学习机制引入基于溯源DAG的记忆系统，解决了现有方法在处理记忆依赖关系时的不足。MemQ利用TD($λ$)资格迹对记忆Q值进行更新，并通过溯源DAG反向传播信用，使记忆之间的依赖关系得到更准确的评估。实验表明，MemQ在六个不同领域的基准测试中均表现出优越的泛化能力和运行时学习效果，尤其在涉及多步骤任务的场景中提升显著。

Comments 22 pages, 11 figures (containing 43 individual image panels total)

2605.08278 2026-05-15 cs.LG cs.AI cs.CR

Trapping Attacker in Dilemma: Examining Internal Correlations and External Influences of Trigger for Defending GNN Backdoors

Fan Yang, Binyan Xu, Di Tang, Kehuan Zhang

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Sun Yat-Sen University（中山大学）

AI总结本文研究了图神经网络（GNN）在面对后门攻击时的防御问题，提出了一种名为PRAETORIAN的新防御方法。该方法通过分析潜在触发子图的内部关联和外部节点影响，检测异常注入结构并识别具有不成比例影响的触发节点，从而有效识别攻击。实验表明，PRAETORIAN在保持较高干净数据准确率的同时显著降低了攻击成功率，且对多种自适应攻击仍保持有效性，迫使攻击者陷入效用与可检测性之间的不利权衡。

2605.07594 2026-05-15 cs.RO

MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

Xin Ding, Xinrui Wang, Yifan Yang, Hao Wu, Shiqi Jiang, Qianxi Zhang, Liang Mi, Hanxin Zhu, Kun Li, Yunxin Liu, Zhibo Chen, Ting Cao

发表机构 * University of Science and Technology of China（中国科学技术大学）； Huazhong University of Science and Technology（华中科技大学）； Microsoft Research（微软研究院）； Nanjing University（南京大学）； Institute for AI Industry Research (AIR) Tsinghua University（清华大学人工智能产业研究院）

AI总结本文提出了一种名为 MemCompiler 的新型记忆系统，用于具身智能体，旨在解决现有记忆注入方法在动态环境中与智能体状态不匹配的问题。该方法通过将记忆利用重新定义为基于状态的记忆编译，利用学习得到的记忆编译器根据智能体当前状态动态选择并编译相关记忆，生成可执行的指导信息。实验表明，MemCompiler 在多个任务环境中显著提升了智能体性能，并降低了计算延迟，验证了其在效果与效率上的双重优势。

2605.06132 2026-05-15 cs.CL

MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval

Chunyu Li, Mengyuan Zhang, Jingyi Kang, Ding Chen, Jiajun Shen, Bo Tang, Xuanhe Zhou, Feiyu Xiong, Zhiyu Li

发表机构 * China Telecom Research Institute（中国电信研究院）； Shanghai Jiao Tong University（上海交通大学）

AI总结在智能体记忆系统中，重排序模型是连接用户查询与长期记忆的关键桥梁。现有方法多采用“检索-重排序”两阶段范式，但通用重排序模型依赖语义相似度匹配，缺乏真正的推理能力，导致检索结果虽语义相关却无法提供回答问题所需的关键信息。为此，本文提出MemReranker，一种基于Qwen3-Reranker并通过多阶段知识蒸馏构建的重排序模型家族，通过多教师对比生成校准标签、BCE点wise蒸馏优化得分分布、InfoNCE对比学习增强难例区分能力，并结合通用语料与包含时间约束、因果推理等场景的多轮对话数据进行训练，在多个基准测试中表现出色，尤其在推理能力和推理效率方面显著优于现有模型。

2605.05686 2026-05-15 cs.AI

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Qiyao Liang, Risto Miikkulainen, Ila Fiete

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； University of Texas Austin（德克萨斯大学奥斯汀分校）； Cognizant（Cognizant公司）

AI总结该研究探讨了语言模型在生成过程中可能出现的两种失败模式：知识冲突和自信幻觉，并揭示了它们在隐藏状态空间中的统一几何解释。研究发现，模型中学习到的事实形成吸引子盆地，冲突源于工作记忆干扰正确吸引子的收敛，而幻觉则源于缺乏对应吸引子导致隐藏状态自由漂移。通过几何边距指标，研究成功区分了正确回忆与幻觉，并验证了该结构特性不依赖于微调，且随着模型规模增大，自信幻觉的比例呈指数增长。

Comments 9 pages, 6 figures, plus appendices

详情

英文摘要

Language models draw on two knowledge sources: facts baked into weights (parametric memory, PM) and information in context (working memory, WM). We study two mechanistically distinct failure modes--conflict, when PM and WM disagree and interfere; and hallucination, when the queried fact was never learned. Both produce confident output regardless, making output-based monitoring blind by design. We show both failures share a unified geometric account. In the hidden-state space of autoregressive generation, learned facts form attractor basins. Conflict is basin competition: WM disrupts convergence to the correct basin without raising output entropy. Hallucination is basin absence: the hidden state drifts freely when no memorized basin exists. The frozen LM head, designed for next-token prediction, cannot distinguish these cases and fires confidently either way. We verify this account in a controlled synthetic task-entity identifiers mapped to unique codes with PM installed via LoRA adapters--where ground truth is exact and component roles can be causally isolated through targeted adapter placement. Geometric margin--the hidden state's distance to the nearest memorized basin--reads this geometry directly and separates correct recall from hallucination far more cleanly than output entropy, with zero false refusals where entropy-based detection cannot avoid rejecting the vast majority of correct outputs. The separation holds on natural-language factual queries from the pretrained model with no adaptation, confirming attractor geometry is structural rather than a fine-tuning artifact. The fraction of confident hallucinations follows a scaling law $C = \exp(-c/\barΔ)$, growing with scale even as overall error rates fall. Hidden states reliably encode epistemic state; the frozen output head systematically erases it--and this erasure worsens with scale.

URL PDF HTML ☆

赞 0 踩 0

2605.04554 2026-05-15 cs.CV

InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery

Kaili Zheng, Kaiwen Wang, Xun Zhu, Chenyi Guo, Ji Wu

发表机构 * Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； College of AI, Tsinghua University（清华大学人工智能学院）； Beijing National Research Center for Information Science and Technology（北京信息科学与技术国家研究中心）

AI总结该论文提出了一种名为InterMesh的端到端多人人体网格恢复框架，旨在更准确地建模人类与环境及彼此之间的交互关系。与现有基于DETR的方法不同，InterMesh通过引入人类-物体交互检测器，显式地将交互语义信息融入人体网格恢复过程，从而提升姿态和形状估计的准确性。研究设计了轻量的模块以高效整合交互信息，并在多个数据集上验证了方法的有效性，显著提升了在复杂交互场景下的恢复性能。

Comments 13 pages, 10 figures

2605.04236 2026-05-15 cs.LG

Adaptive Consensus in LLM Ensembles via Sequential Evidence Accumulation: Automatic Budget Identification and Calibrated Commit Signals

Roberto E. Medina

发表机构 * Independent Researcher（独立研究员）

AI总结该研究提出了一种名为DASE的自适应停止机制，用于改进大型语言模型集成中的推理过程，通过在证据积累过程中自动识别预算并生成校准的提交信号，以提升整体准确性。DASE能够在早期达成共识时提前提交结果，并在证据碎片化时采用全局频率策略，从而在多个基准测试中表现出显著的性能提升。研究还发现，自适应停止策略对准确性的影响远大于注入带宽，并揭示了注入方法在准确性与推理成本之间存在倒U型关系。

2605.03823 2026-05-15 cs.LG cs.IT math.IT math.ST stat.TH

Realizable Bayes-Consistency for General Metric Losses

Dan Tsir Cohen, Steve Hanneke, Aryeh Kontorovich

发表机构 * Ben-Gurion University of the Negev（贝内-约尔大学）； Purdue University, USA（普渡大学）

AI总结本文研究了在可实现设定下，使用一般度量损失进行学习时的强泛化贝叶斯一致性问题，扩展了传统二分类和回归问题的相关结果。作者给出了假设类满足何种条件时，存在一种分布无关的学习规则，使其风险几乎必然收敛到类内最优风险（即零）。主要贡献在于提出了一种基于组合障碍的精确刻画，引入了无限非递减 $(γ_k)$-Littlestone 树的概念，从而将经典 Littlestone 树结构推广到度量损失场景。

Comments 14 pages. To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026); v2: fixed abstract metadata rendering

2605.03596 2026-05-15 cs.AI cs.CL cs.DB cs.LG

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Zirui Tang, Xuanhe Zhou, Yumou Liu, Linchun Li, Yukai Wu, Weizheng Wang, Hongzhang Huang, Wei Zhou, Jun Zhou, Jiachen Song, Shaoli Yu, Jinqi Wang, Zihang Zhou, Hongyi Zhou, Yuting Lv, Jinyang Li, Jiashuo Liu, Ruoyu Chen, Chunwei Liu, GuoLiang Li, Jihua Kang, Fan Wu

发表机构 * GitHub

AI总结 Workspace-Bench 1.0 是一个用于评估 AI 智能体在工作空间任务中处理大规模文件依赖关系能力的基准。该研究构建了包含多种文件类型和真实工作场景的复杂工作空间，并设计了大量任务来测试智能体的跨文件检索、上下文推理和适应性决策能力。实验表明，当前主流 AI 模型在该基准上的表现仍远低于人类水平，突显了在真实工作场景中实现可靠工作空间学习的挑战。

Comments 30 pages, 16 figures

2605.02438 2026-05-15 cs.CV cs.LG

Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection

Fuyun Wang, Yuanzhi Wang, Xu Guo, Sujia Huang, Tong Zhang, Dan Wang, Hui Yan, Xin Liu, Zhen Cui

发表机构 * Nanjing University of Science（南京理工大学）； Beijing Normal University, Beijing, China（北京师范大学）； China Academy of Space Technology, Beijing, China（中国空间技术研究院）

AI总结本文研究开放集监督异常检测（OSAD）问题，旨在利用有限的异常监督信息识别未见过的异常样本。为了解决现有基于原型的方法在建模正常数据时忽略多模态特性导致决策边界模糊的问题，提出了一种混合原型流匹配（MPFM）框架，通过连续变换将正常特征分布映射到结构化的高斯混合原型空间。该方法引入高斯混合先验建模速度场，并结合互信息最大化正则化器提升原型区分度，实验表明其在多种基准数据集上均取得领先性能。

Comments Accepted by ICML 2026

2605.02398 2026-05-15 cs.AI cs.CL cs.LG

The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

Rahul Kumar

发表机构 * Independent Researcher（独立研究者）

AI总结随着前沿AI模型被用于高风险决策流程，其在对抗性压力下保持元认知稳定性的能力成为关键的安全要求。本文研究了模型在面对强制合规指令时出现的元认知崩溃现象，并提出了“合规陷阱”这一新概念，指出模型性能的严重下降并非源于威胁内容本身，而是由强制性指令引发的认知边界突破所致。通过大规模实验，作者发现大多数模型在对抗性条件下表现出显著的性能下降，而Anthropic的 Constitutional AI 由于对齐训练表现出较强的免疫能力。

Comments 9 pages, 2 figures, 3 tables. Code: https://github.com/rkstu/schema-compliance-trap Dataset: https://huggingface.co/datasets/lightmate/schema-compliance-trap

2605.01758 2026-05-15 cs.AI

Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

Yue Ma, Ziyuan Yang, Yi Zhang

发表机构 * Sichuan University（四川大学）； Nanyang Technological University（南洋理工大学）

AI总结该研究针对多智能体系统中感染式越狱攻击的问题，提出了一种无需训练的前瞻性引导本地净化（FLP）框架。该方法通过模拟未来交互轨迹，结合多角色模拟策略，检测并消除智能体中的感染行为，有效降低了感染传播率。实验表明，FLP能将最大累计感染率从超过95%降至5.47%以下，同时保持交互多样性，显著优于现有方法。

Comments 12 pages

详情

英文摘要

Large multimodal model-based Multi-Agent Systems (MASs) enable collaborative complex problem solving through specialized agents. However, MASs are vulnerable to infectious jailbreak, where compromising a single agent can spread to others, leading to widespread compromise. Existing defenses counter this by training a more contagious cure factor, biasing agents to retrieve it over virus adversarial examples (VirAEs). However, this homogenizes agent responses, providing only superficial suppression rather than true recovery. We revisit these defenses, which operate globally via a shared cure factor, while infectious jailbreak arise from localized interaction behaviors. This mismatch limits their effectiveness. To address this, we propose a training-free Foresight-Guided Local Purification (FLP) framework, where each agent reasons over future interactions to track behavioral evolution and eliminate infections. Specifically, each agent simulates future behavioral trajectories over subsequent chat rounds. To reflect diversity in MASs, we introduce a multi-persona simulation strategy for robust prediction across interaction contexts. We then use response diversity as a diagnostic signal to detect infection by analyzing inconsistencies across persona-based predictions at both retrieval-result and semantic levels. For infected agents, we apply localized purification: recent infections are mitigated via immediate album rollback, while long-term infections are handled using Recursive Binary Diagnosis (RBD), which recursively partitions the image album and applies the same diagnosis strategy to localize and eliminate VirAEs. Experiments show that FLP reduces the maximum cumulative infection rate from over 95% to below 5.47%. Moreover, retrieval and semantic metrics closely match benign baselines, indicating effective preservation of interaction diversity.

URL PDF HTML ☆

赞 0 踩 0

2605.01725 2026-05-15 cs.CV cs.AI

Motion-Aware Caching for Efficient Autoregressive Video Generation

Jing Xu, Yuexiao Ma, Xuzhe Zheng, Xing Wang, Shiwei Liu, Chenqian Yan, Xiawu Zheng, Rongrong Ji, Fei Chao, Songwei Liu

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University（多媒体可信感知与高效计算重点实验室，中国教育部，厦门大学）； Max Planck Institute for Intelligent Systems（智能系统马克斯·普朗克研究所）； ELLIS Institute Tübingen（图宾根ELLIS研究所）； Tübingen AI Center（图宾根人工智能中心）

AI总结本文研究了如何通过运动感知的缓存机制提升自回归视频生成的效率。现有方法依赖于粗粒度的块级缓存跳过，无法准确捕捉像素级别的动态变化，导致生成质量下降。为此，作者提出了MotionCache，通过帧间差异作为像素运动的轻量代理，结合粗到细的策略，在保证生成质量的前提下显著提升了生成速度。实验表明，MotionCache在多个先进模型上实现了最高达6.28倍的加速，同时保持了高质量的生成效果。

Comments 20 pages

2604.28130 2026-05-15 cs.CV

MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

Kehong Gong, Zhengyu Wen, Dao Thien Phong, Mingxi Xu, Weixia He, Qi Wang, Ning Zhang, Zhengyu Li, Guanli Hou, Dongze Lian, Xiaoyu He, Mingyuan Zhang, Hanwang Zhang

发表机构 * Huawei Technologies Co., Ltd（华为技术有限公司）； Central Research Institute（中央研究院）

AI总结本文提出了一种端到端的任意骨骼运动捕获框架 MoCapAnything V2，解决了传统分阶段方法在关节位置与旋转映射上的不确定性问题。通过引入目标资产的参考姿态-旋转对，明确旋转坐标系，使旋转预测更加精确并易于学习。该方法直接从视频中预测关节位置，无需依赖网格中间表示，提升了鲁棒性与效率，并在多个数据集上显著降低了旋转误差，推理速度也比基于网格的方法快约20倍。

Comments Project page: https://animotionlab.github.io/MoCapAnythingV2/

2604.27263 2026-05-15 cs.CL

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

Théo Gigant, Bowen Peng, Jeffrey Quesnelle

发表机构 * Nous Research

AI总结本文研究了子词分词在大语言模型训练中的具体作用，通过构建一个可控的字节级预训练框架，将子词分词的效果进行解耦和分析。研究从样本吞吐量、词汇规模扩展以及子词边界的语言先验等多个维度提出并验证了相关假设，揭示了子词模型优于原始字节模型的关键原因，并为未来字节级和子词模型的预训练提供了改进方向。

Comments 14 pages, 7 figures

2604.22050 2026-05-15 cs.LG cs.CL

LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

Mohamed Ali Souibgui, Jan Fostier, Rodrigo Abadía-Heredia, Bohdan Denysenko, Christian Marschke, Igor Peric

发表机构 * Openchip & Software Technologies, S.L（Openchip与软件技术公司）

AI总结 LayerBoost 是一种层感知的注意力缩减方法，旨在提升大语言模型的推理效率。该方法通过对预训练模型进行系统性敏感性分析，识别出对性能影响较大的关键层，并根据不同层的敏感程度分别采用标准注意力、线性滑动窗口注意力或完全移除注意力机制，从而在保持模型性能的同时降低计算复杂度。实验表明，LayerBoost 在高并发场景下可将推理延迟减少高达68%，且在多个基准测试中表现出与原始模型相当或接近的性能，显著优于现有的注意力线性化方法。

2604.21809 2026-05-15 cs.LG cs.AI q-bio.QM stat.ML

Quotient-Space Diffusion Models

Yixian Xu, Yusong Wang, Shengjie Luo, Kaiyuan Gao, Tianyu He, Di He, Chang Liu

发表机构 * State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China（一般人工智能国家重点实验室，北京大学，北京，中国）； Huazhong University of Science and Technology, Wuhan, China（华中科技大学，武汉，中国）； Microsoft Research Asia, Beijing, China（微软亚洲研究院，北京，中国）； Zhongguancun Academy, Beijing, China（中关村学院，北京，中国）

AI总结本文提出了一种名为商空间扩散模型（Quotient-Space Diffusion Models）的生成模型框架，旨在有效处理和利用系统中的对称性。该方法通过在去除对称冗余的商空间上进行生成过程，使模型能够在保持目标对称分布的前提下，更灵活地学习生成过程。该框架在分子结构生成任务中进行了实例化，相比等变扩散模型和基于对齐的方法，表现出更优的性能，为生成模型中的对称性处理提供了新的解决方案。

Comments ICLR 2026 Oral Presentation; 43 pages, 5 figures, 6 tables; ICLR 2026 Camera Ready version

2604.19092 2026-05-15 cs.RO cs.AI

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

Feng Jiang, Yang Chen, Kyle Xu, Yuchen Liu, Haifeng Wang, Zhenhao Shen, Jasper Lu, Shengze Huang, Yuanfei Wang, Chen Xie, Ruihai Wu

发表机构 * Peking University（北京大学）； Tsinghua University（清华大学）； Lightwheel

AI总结 RoboWM-Bench 是一个专注于机器人操作任务的基准，用于评估视频世界模型在生成行为是否具备物理可执行性。该基准通过将生成的视频转化为可执行的动作序列，并在物理仿真环境中验证其可行性，从而系统评估模型在真实机器人操作中的表现。研究发现，视觉合理性与物理可执行性并不总是一致，突显了在复杂操作任务中进行具身化评估的重要性。

AI 大模型

视觉与机器人

科学与医疗

Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration

Pretraining large language models with MXFP4 on Native FP4 Hardware

A Tale of Two Problems: Multi-Task Bilevel Learning Meets Equality Constrained Multi-Objective Optimization

SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks

Diagnosing and Mitigating Domain Shift in Permission-Based Android Malware Detection

GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes

DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding

Geometrically Constrained Stenosis Editing in Coronary Angiography via Entropic Optimal Transport

Rethinking Event-Based Object Dtection through Representation-Level Temporal Aggregation and Model-Level Hypergraph Reasoning

Supersampling Stable Diffusion and Beyond: A Seamless, Training-Free Approach for Scaling Neural Networks Using Common Interpolation Methods

Learning Polyhedral Conformal Sets for Robust Optimization

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

Trapping Attacker in Dilemma: Examining Internal Correlations and External Influences of Trigger for Defending GNN Backdoors

MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents

MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval

Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

InterMesh: Explicit Interaction-Aware End-to-End Multi-Person Human Mesh Recovery

Adaptive Consensus in LLM Ensembles via Sequential Evidence Accumulation: Automatic Budget Identification and Calibrated Commit Signals

Realizable Bayes-Consistency for General Metric Losses

Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

Mixture Prototype Flow Matching for Open-Set Supervised Anomaly Detection

The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

Motion-Aware Caching for Efficient Autoregressive Video Generation

MoCapAnything V2: End-to-End Motion Capture for Arbitrary Skeletons

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

LayerBoost: Layer-Aware Attention Reduction for Efficient LLMs

Quotient-Space Diffusion Models

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation