arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1670
专题追踪
2605.15647 2026-05-18 cs.LG cs.NE

Perforated Neural Networks for Keyword Spotting

孔洞神经网络用于关键词检测

Vishy Gopal, Aris Ilias Goutis, Ralph Crewe, Erin Yanacek, Rorry Brenner

发表机构 * Purdue University(普渡大学) Renesas Electronics(瑞萨电子) Perforated AI

AI总结 本文提出在Edge Impulse平台使用孔洞反向传播进行关键词检测,通过在标准卷积神经网络中添加人工树突节点,证明树突模型在参数数量和准确性方面均优于传统架构,实现了模型质量和部署效率的双重提升。

Comments 9 pages, 1 figure, 800-trial hyperparameter sweep; Best Model award, Edge Impulse 2025 Hackathon

详情
AI中文摘要

边缘机器学习面临着云规模模型部署中未遇到的独特约束:严格的内存预算、有限的计算能力和不可妥协的准确率阈值必须同时满足。现有的压缩和优化技术可以将一种资源换取另一种,但很少同时提高准确性和模型大小。本文提出了在Edge Impulse平台上的关键词检测应用,该实验在2025年12月的Edge Impulse黑客松上获得了最佳模型奖。通过在Edge Impulse关键词检测教程流水线上训练的标准卷积神经网络中添加人工树突节点,我们证明了树突模型在800次超参数试验中每个参数数量层级和每个准确性阈值测试中均优于传统架构。最佳的树突模型仅使用1,500个参数就达到了93.3%的测试准确率,而基准模型需要约4,000个参数才能达到92.1%的准确率。这些结果表明,孔洞反向传播是边缘AI工程师工具包中的强大补充,同时提升了模型质量和部署效率。

英文摘要

Edge machine learning presents a unique set of constraints not encountered in cloud-scale model deployment: strict memory budgets, limited compute, and non-negotiable accuracy thresholds must all be satisfied simultaneously. Existing compression and optimization techniques can trade one resource for another, but rarely improve both accuracy and model size at the same time. This paper presents the application of Perforated Backpropagation to keyword spotting on the Edge Impulse platform, an experiment that won the Best Model award at the Edge Impulse 2025 Hackathon in December 2025. By adding artificial Dendrite Nodes to a standard convolutional neural network trained on the Edge Impulse keyword spotting tutorial pipeline, we demonstrate that dendritic models outperform traditional architectures at every level of parameter count and at every accuracy threshold tested across 800 hyperparameter trials. The best dendritic model achieved a test accuracy of 0.933 with only 1,500 parameters, versus the baseline accuracy of 0.921 requiring approximately 4,000 parameters. These results suggest that Perforated Backpropagation is a powerful addition to the edge AI engineer's toolkit, offering simultaneous gains in both model quality and deployment efficiency.

2605.15640 2026-05-18 cs.CV

Learning Disentangled Representations for Generalized Multi-view Clustering

学习解耦表示以实现通用多视图聚类

Xin Zou, Ruimeng Liu, Chang Tang, Zhenglai Li, Xinwang Liu, Kunlun He, Wanqing Li

发表机构 * AI Thrust, The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)人工智能方向) School of Computer Science and Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件工程学院) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) School of Computer, National University of Defense Technology(国防科技大学计算机学院) Medical Big Data Research Center, Medical Engineering Laboratory of Chinese PLA General Hospital(中国人民解放军总医院医学大数据研究中心,医学工程实验室) School of Computing and Information Technology, University of Wollongong(沃林根大学计算与信息学院)

AI总结 本文提出GMAE框架,通过解耦表示学习保留多视图互补性,提升聚类效果。实验表明其在完整和不完整多视图聚类任务中均优于现有方法。

Comments accepted by IEEE TPAMI 2026 (IEEE Transactions on Pattern Analysis and Machine Intelligence)

详情
AI中文摘要

多视图聚类(MVC)因其能利用互补信息而受到关注。然而,现有深度MVC方法在跨视图融合时常面临视图分布纠缠问题,影响共享潜在空间质量。为此,本文提出通用多视图自编码器(GMAE),通过解耦表示学习保留跨视图互补性。具体而言,GMAE采用双路径自编码器将源特征解耦为视图特定和视图共同嵌入,促进更清晰的聚类结构发现。进一步构建跨视图对抗判别器,引导视图特定编码器捕捉更判别性特征。通过策略性调节互信息,GMAE有效对齐分布并防止表示崩溃,确保生成稳健且非平凡的嵌入。在13个基准数据集上的全面实验表明,GMAE在完整和不完整MVC任务中均优于现有方法。代码实现见:https://github.com/obananas/GMAE。

英文摘要

Multi-View Clustering (MVC) has gained significant attention for its ability to leverage complementary information across diverse views. However, existing deep MVC methods often struggle with view-distribution entanglement during cross-view fusion, which hampers the quality of the shared latent space and leads to suboptimal Figures. To address this issue, we propose the Generalized Multi-view Auto-Encoder (GMAE), a framework designed to preserve cross-view complementarity through disentangled representation learning. Specifically, GMAE employs dual-path autoencoders to decouple source features into view-specific and view-common embeddings, facilitating the discovery of clearer clustering structures. We further construct cross-view adversarial discriminators to guide view-specific encoders in capturing more discriminative features. By strategically modulating mutual information, GMAE effectively aligns distributions and prevents representation collapse, ensuring the generation of robust, non-trivial embeddings. Comprehensive experiments on 13 benchmark datasets demonstrate that GMAE consistently outperforms state-of-the-art methods in both complete and incomplete MVC tasks. Our code implementation is available at the repository: https://github.com/obananas/GMAE.

2605.15635 2026-05-18 cs.CL

Evaluating Chinese Ambiguity Understanding in Large Language Models

评估大型语言模型中的中文歧义理解

Junwen Mo, Yuanzhi Lu, Yifang Xue, Ke Xu, Hideki Nakayama

发表机构 * Graduate School of Information Science and Technology, The University of Tokyo(东京大学信息科学与技术研究生院) School of Software Engineering, South China University of Technology(华南理工大学软件学院)

AI总结 本文设计了首个基于潜在歧义理论的中文歧义数据集CHA-Gen,评估了LLM在歧义检测中的表现,揭示了模型在歧义识别中的常见失败模式及语义不确定性量化结果。

详情
AI中文摘要

语言歧义对大型语言模型(LLM)的鲁棒性至关重要,但现有研究多聚焦于英语,对中文关注有限。现有中文歧义数据集(如CHAmbi)存在可扩展性差的问题。基于潜在歧义(PA)理论,我们设计了一个半自动化流程构建CHA-Gen,这是首个PA理论指导的中文歧义数据集,包含18种潜在歧义结构的5,712个句子(2,414个歧义句,3,298个非歧义句)。通过直接查询和机器翻译评估LLM(如Gemma 3、Qwen 2.5/3系列),发现LLM在歧义检测上存在困难(通过CoT提示有所改善)。对Qwen3-32B的CoT推理过程分析揭示了三种常见失败模式:歧义盲区、误归因和过早解决。使用语义熵度量对不确定性进行量化,显示歧义句子具有更高的不确定性。此外,指令微调会导致过度自信,而基础模型更能捕捉语义多样性。我们进一步发现模型倾向于主导解释。本文提供了一种可扩展的中文歧义语料库方法,并为LLM的歧义处理提供了见解,为增强LLM中的中文歧义研究奠定了基础。

英文摘要

Linguistic ambiguity is critical to the robustness of Large Language Models (LLMs), yet existing research focuses mostly on English, with limited attention devoted to Chinese. Existing Chinese ambiguity datasets (e.g., CHAmbi) suffer from poor scalability. Guided by Potential Ambiguity (PA) Theory, we design a semi-automatic pipeline to construct CHA-Gen. It is the first PA Theory-grounded Chinese ambiguity dataset, which comprises 5,712 sentences (2,414 ambiguous, 3,298 unambiguous) across 18 potential ambiguous structures. Evaluating LLMs (e.g. Gemma 3, Qwen 2.5/3 series) via direct querying and machine translation, we find that LLMs struggle with ambiguity detection (improved by CoT prompting). Analysis of Qwen3-32B's CoT rationales reveals three common failure modes: ambiguity blindness, misattribution, and premature resolution. Uncertainty quantification with semantic entropy metric shows higher uncertainty for ambiguous sentences. Moreover, instruction tuning induces overconfidence, whereas Base models better capture semantic diversity. We further observe that models exhibit a bias toward dominant interpretations. Our work provides a scalable approach for Chinese ambiguity corpus and insights into LLMs' ambiguity handling, laying a foundation for enhancing Chinese ambiguity research in LLMs.

2605.15626 2026-05-18 cs.LG

IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression

IO-SVD:输入-输出白化SVD用于自适应秩LLM压缩

Ali Abbasi, Chayne Thrash, Haoran Qin, Hamed Pirsiavash, Soheil Kolouri

发表机构 * Vanderbilt University(范德比大学) University of California, Davis(加州大学戴维斯分校)

AI总结 IO-SVD通过构建KL感知的双侧白化空间,结合高效异质秩分配策略,实现LLM压缩时的性能与效率平衡,实验表明其在压缩过程中性能损失小且推理速度提升显著。

详情
AI中文摘要

大型语言模型在语言和推理任务中表现出色,但其存储和计算成本仍是资源受限和延迟敏感环境下的主要障碍。基于SVD的后训练压缩提供了一种硬件无关的方法,通过低秩分解减少模型大小并提高推理效率。然而,现有方法往往依赖于仅输入的白化空间、同质秩分配或损失无关的分配启发式方法,限制了在剧烈压缩下保持模型质量的能力。我们提出输入-输出白化SVD(IO-SVD),一种后训练压缩方法,通过构建KL感知的双侧白化空间来处理模型权重。利用KL损失在顶部K个token概率上的二次展开,IO-SVD构建了一个输出侧度量,捕捉预测敏感性,同时输入白化捕捉激活统计。我们进一步引入了高效的异质秩分配策略,通过第一阶校准损失估计评分白化奇异成分,并在全局预算下修剪最不敏感的成分。受先前工作结合SVD截断与量化的工作启发,我们通过损失感知的重映射改进了SVD-量化压缩,该方法根据量化后预计的损失变化选择低秩因子行进行8位量化。在多样化的LLM和VLM家族上的广泛实验以及推理时分析表明,IO-SVD在压缩LLM时具有最小的性能损失,同时提供实用的推理加速。代码可在https://github.com/mint-vu/IO-SVD.git获得。

英文摘要

Large language models deliver strong performance across language and reasoning tasks, but their storage and compute costs remain major barriers to deployment in resource-constrained and latency-sensitive settings. SVD-based post-training compression offers a hardware-agnostic way to reduce model size and improve inference efficiency through low-rank factorization. However, existing methods often rely on input-only whitening spaces, homogeneous rank allocation, or loss-agnostic allocation heuristics, limiting their ability to preserve model quality under aggressive compression. We propose Input-Output Whitened SVD (IO-SVD), a post-training compression method that forms a KL-aware double-sided whitening space for model weights. Using a second-order expansion of the KL loss over the top-K token probabilities, IO-SVD constructs an output-side metric that captures predictive sensitivity, while input whitening captures activation statistics. We further introduce an efficient heterogeneous rank-allocation strategy that scores whitened singular components using first-order calibration loss estimates and prunes the least sensitive components under a global budget. Inspired by prior work that combines SVD truncation with quantization, we improve hybrid SVD-quantization compression through loss-aware remapping, which selects low-rank factor rows for 8-bit quantization based on the predicted loss change incurred by quantizing them. Extensive experiments across diverse LLM and VLM families, and inference-time analysis shows that IO-SVD compresses LLMs with minimal performance degradation while delivering practical inference speedups. Code is available at https://github.com/mint-vu/IO-SVD.git

2605.15625 2026-05-18 cs.AI cond-mat.soft

ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing

ColPackAgent:基于代理技能的硬粒子蒙特卡罗工作流程用于胶体堆积

Lijie Ding, Changwoo Do

发表机构 * Neutron Scattering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA(奥克勒德国家实验室中子散射部)

AI总结 ColPackAgent通过MCP工具服务器和代理技能实现胶体堆积模拟的自主工作流程,展示了如何利用LLM代理执行模拟任务并评估不同模型的性能。

详情
AI中文摘要

我们介绍了ColPackAgent,一种代理框架,通过模型上下文协议(MCP)工具服务器和代理技能自主运行胶体堆积的蒙特卡罗模拟,无论是作为独立代理还是现有代理系统的一部分。通过利用MCP服务器和代理技能,ColPackAgent执行胶体堆积模拟的结构化工作流程,这些流程对于研究相变、自组装和材料设计至关重要。在没有专用模拟工具和工作流程指令的情况下,通用大型语言模型(LLM)代理倾向于描述此类工作流程而不是可靠地执行。MCP服务器暴露了一个定制构建的colpack Python包,该包封装了HOOMD-blue硬粒子蒙特卡罗。技能编码了一个四阶段的工作流程合同。ColPackAgent可以与人类反馈互动执行工作流程,从端到端提示自主执行,或作为提供的程序文件的autoresearch。我们通过不同模式展示了系统,包括立方体粒子的3D模拟、二元系统中的盘和胶囊的2D模拟,以及使用autoresearch的2D硬盘冻结转变。我们还比较了不同LLM在该工作流程上的模型性能,使用17个阶段特定的提示。此基准测试提供了对不同模型在设置、规划和分析工作流程中可靠性的阶段级检查。这些结果表明,将领域Python包与MCP工具和便携式代理技能结合,为将模拟工具包转化为代理辅助研究工作流程提供了可行的途径。

英文摘要

We introduce ColPackAgent, an agent framework that autonomously runs Monte Carlo simulations of colloidal packing through a Model Context Protocol (MCP) tool server and an agent skill, whether as a standalone agent or inside an existing agent system. By harnessing the MCP server and agent skill, ColPackAgent executes a structured workflow for colloidal packing simulations, which are central to studies of phase behavior, self-assembly, and materials design. Without dedicated simulation tools and workflow instructions, general-purpose Large Language Model (LLM) agents tend to describe such workflows rather than execute them reliably. The MCP server exposes a custom-built colpack Python package that wraps HOOMD-blue hard-particle Monte Carlo, and the skill encodes a four-stage workflow contract. ColPackAgent can carry out the workflow interactively with human feedback, autonomously from an end-to-end prompt, or as autoresearch following a provided program file. We demonstrate the system in different modes with several colloidal packing simulation examples such as cube particles in 3D, a binary system of disks and capsules in 2D, and the 2D hard-disk freezing transition using autoresearch. We also compare model performance on this workflow across a panel of LLMs with 17 stage-specific prompts. This benchmark provides a stage-level check of how reliably different models follow the setup, planning, and analysis workflow. Together, these results show that pairing a domain Python package with MCP tools and a portable agent skill provides a practical route for turning a simulation toolkit into an agent-assisted research workflow.

2605.15621 2026-05-18 cs.CV

LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

LRCP: 低秩压缩性引导的视觉标记修剪用于高效的LVLMs

Hongyu Lu, Feng Zhang, Wenwei Jin, Huanling Hu, Tianjun Shi, Shikai Jiang, Yao Hu, Jiawei Li

发表机构 * Xiaohongshu(小红书) Harbin Institute of Technology(哈尔滨工业大学) Fudan University(复旦大学)

AI总结 本文提出LRCP,通过低秩压缩性引导视觉标记修剪,有效减少视觉语言模型的推理成本,实现94.7%的图像理解性能保留和88.9%的标记减少。

Comments The paper includes 11 figures, multiple tables, comprehensive experimental results on 11 image understanding benchmarks and 3 video benchmarks, with extensive ablation studies and qualitative visualizations

详情
AI中文摘要

大型视觉-语言模型(LVLMs)在多模态理解方面表现出色,但其推理成本随着视觉标记数量的增加而迅速增长,尤其在高分辨率图像和长视频中更为明显。现有基于注意力的方法通过注意力分数估计标记重要性,可能引入位置偏差;而基于表示的方法则通过特征关系或重建误差减少视觉冗余,忽略了视觉标记集的整体结构。本文从低秩压缩性的角度重新审视视觉标记压缩。在多个模型和数据集中,我们发现视觉标记表示表现出显著的低秩结构,存在一个主导子空间,即使随机移除大量标记后仍保持稳定。受此发现启发,我们提出LRCP,一种无需训练的压缩框架,首先通过PCA估计视觉标记的主导低秩子空间,然后通过投影残差对每个标记进行评分,保留那些难以由低秩背景解释的标记。大量实验表明,LRCP在保持94.7%的原始图像理解性能的同时实现88.9%的标记减少,并在保持97.8%的平均视频理解准确性的同时实现87.5%的标记减少。

英文摘要

Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods estimate token importance from attention scores, which may introduce positional bias, while representation-based methods reduce visual redundancy based on feature relations or reconstruction errors, overlooking the global structure of the visual token set. In this paper, we revisit visual token compression from the perspective of low-rank compressibility. Across models and datasets, we observe that visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, we propose LRCP, a training-free compression framework that first estimates the dominant low-rank subspace of visual tokens via PCA, and then scores each token by its projection residual onto this subspace, retaining tokens that are poorly explained by the low-rank background. Extensive experiments show that LRCP achieves superior results, preserving 94.7% of the original image-understanding performance with an 88.9% token reduction and 97.8% of the average video-understanding accuracy with an 87.5% token reduction.

2605.15619 2026-05-18 cs.RO

Wind-Aware Optimal Trajectory Planning for Efficient Gliding of Fixed-Wing Aerial Systems

考虑风的高效滑翔轨迹规划

Luca Morando, Nishanth Bobbili, Giuseppe Loianno

发表机构 * New York University(纽约大学) University of California Berkeley(加州大学伯克利分校)

AI总结 本文提出非线性多目标轨迹规划器,通过伯恩斯坦多项式生成三次连续轨迹,结合风速估算优化滑翔性能,实验证明在风扰和障碍物情况下具有稳定性和可靠性。

Comments Accepted for publication at IEEE International Conference on Robotics and Automation (ICRA 2026) held in Vienna

Journal ref IEEE International Conference on Robotics and Automation (ICRA 2026) held in Vienna

详情
AI中文摘要

滑翔为小型固定翼无人机提供了更长的续航和静音操作,但需要精确的能量管理,特别是在风扰和障碍物约束下。传统总能量控制系统通常需要精细调参和trim条件知识。本文将调控移至规划层面,提出非线性多成本轨迹规划器,基于伯恩斯坦多项式生成三次连续轨迹,通过微分平坦性映射为控制指令,并在线重新规划以匹配实验得出的下沉极曲线。集成模拟净to variometer估计空气运动,约束滑翔至能量平衡状态。通过Dubins路径基的航点初始化轨迹计算巡航段,连接连续滑翔轨迹,实现结合动力和非动力飞行的混合任务。该方法在CFD仿真和真实世界实验中验证,显示在风切变和障碍物存在下,滑翔率、空速和滑翔比的稳定性。

英文摘要

Gliding offers small fixed-wing UAVs extended endurance and silent operation but requires accurate energy management, especially under wind disturbances and obstacle constraints. Traditional Total Energy Control Systems based controllers regulate the trade between potential and kinetic energy reactively, often requiring fine-tuning and trim-conditions knowledge. In this work, we shift the regulation to the planning level and present a nonlinear, multi-cost trajectory planner for small UAV gliders. The method generates $\mathcal{C}^3$ continuous trajectories based on Bernstein polynomials, mapped into control commands through differential flatness, and re-planned online to match experimentally derived sink polar curves. A simulated netto variometer is integrated into the optimization to estimate air mass motion, constraining the glide to energy-balanced states. Consecutive gliding trajectories are linked by cruising segments computed through trajectories initialized on Dubins path-based waypoints, enabling hybrid missions that combine powered and unpowered flight. The approach is validated in CFD simulations and real-world experiments with a fixed-wing platform, showing reliable stabilization of sink rate, airspeed, and glide ratio under wind gusts and in presence of obstacles.

2605.15618 2026-05-18 cs.CV cs.AI

Latent Video Prediction Learns Better World Models

潜在视频预测学习更好的世界模型

Ali J Alrasheed, Aryan Yazdan Parast, Basim Azam, James Bailey, Naveed Akhtar

发表机构 * The University of Melbourne(墨尔本大学) Monash University(莫纳什大学)

AI总结 本文系统研究了潜在预测模型在世界模型中的鲁棒性,发现其在特征可区分性、抗污损性、细粒度辨别、遮挡鲁棒性和时间方向敏感性等方面表现优异,优于其他视频基础模型。

详情
AI中文摘要

本文系统研究了潜在预测模型在世界模型中的鲁棒性,发现其在特征可区分性、抗污损性、细粒度辨别、遮挡鲁棒性和时间方向敏感性等方面表现优异,优于其他视频基础模型。

英文摘要

Self-supervised video models are increasingly framed as world models, yet their evaluation remains largely confined to a single top-1 accuracy score on clean benchmarks. This leaves a major gap in comprehending their potential as world models. We present the first systematic study addressing this gap, analyzing four matched-capacity frontier video foundation models, V-JEPA 2.1, V-JEPA 2, VideoPrism, and VideoMAEv2, across five robustness axes relevant to their deployment as video world models: feature discriminability, corruption robustness, fine-grained discrimination, occlusion robustness, and sensitivity to temporal direction. Our evaluations establish that across all five axes, latent-prediction models form a distinct and consistent profile. They degrade more gracefully under pixel corruption, preserve usable class structure rather than mere geometric stability under occlusion, capture fine-grained physical contact cues without reconstructing pixels, and uniquely encode the arrow of time. These advantages can even survive task adaptation: a frozen V-JEPA 2 backbone with a lightweight attentive probe outperforms a fully fine-tuned VideoMAE and a supervised TimeSformer on corruption and occlusion robustness. Our extensive results offer concrete new evidence in favor of latent prediction for robust world modeling.

2605.15615 2026-05-18 cs.CV cs.LG

Neutral-Reference Prompting for Vision-Language Models

视觉-语言模型的中性参考提示

Senmao Tian, Xiang Wei, Shunli Zhang

发表机构 * Beijing Jiaotong University(北京交通大学)

AI总结 本文提出NeRP策略,通过中性提示和参考图像提升模型对未知类别的判别能力,同时保持对已知类别的准确性。

Comments Accepted at ICML 2026

详情
AI中文摘要

视觉-语言模型(VLMs)的有效迁移学习常面临基类-新类权衡(BNT)问题:提升对未见过类别的识别性能往往会降低对已知类别的准确性。现有工作通常简单归因于过拟合已知类别。我们观察到一种有趣现象:VLMs在某些下游数据上表现出不对称混淆,即类别A的样本系统性被误判为类别B,而反向混淆(B到A)很少发生。对于已知类别,这种偏差可通过交叉熵损失调整来缓解,但对未知类别,这种预训练诱导的偏差仍存在并损害泛化能力。受此启发,我们提出NeRP,一种即插即用的提示修正策略,无需修改模型参数即可提升对未知类别的判别能力。NeRP利用中性文本提示和参考图像,测量类别层面的先验偏好,结合样本似然获得模型的代理分数。如果对于给定样本,先验强烈支持当前预测,而观察到的证据明显不足,则在容易混淆的类别对之间执行局部翻转,从而纠正先验主导的误判。在多个backbone和15个少样本及跨领域基准上的广泛实验表明,NeRP显著提高了对未知类别的准确性,同时保持已知类别的预测性能。

英文摘要

Efficient transfer learning of vision-language models (VLMs) commonly suffers from a Base-New Trade-off (BNT): improving performance on unseen (new) classes often degrades accuracy on known (base) classes. Addressing how to boost recognition of unseen classes without sacrificing known-class performance remains a central challenge. Existing work often simplistically attributes the BNT to overfitting on known classes. We observe an interesting phenomenon: VLMs frequently exhibit asymmetric confusion on certain downstream data, i.e., samples of class A are systematically mispredicted as class B, while the reverse confusion (B to A) rarely occurs. For known classes, this kind of bias can be mitigated by tuning using a cross-entropy loss, but for unseen classes, such pretraining-induced bias persists and harms generalization. Motivated by this, we propose NeRP, a plug-and-play prompting correction strategy that improves discrimination on unseen classes without modifying model parameters. NeRP leverages neutral text prompts and reference images to measure class-wise prior preferences along the pre-trained inter-class geometry, and combines them with the sample likelihood to obtain the model's surrogate score. If, for a given sample, the prior strongly favors the current prediction while the observed evidence is clearly insufficient, we perform a local flip between easily confusable class pairs, thereby correcting prior-dominated mispredictions. Extensive experiments across multiple backbones and 15 few-shot and cross-domain benchmarks show that NeRP substantially improves accuracy on unseen classes while preserving known-class prediction performance.

2605.15613 2026-05-18 cs.CL

Toward LLMs Beyond English-Centric Development

迈向超越英语中心化发展的语言模型

Sho Takase, Ukyo Honda

发表机构 * CyberAgent

AI总结 研究发现语言模型对英语存在显著偏见,持续预训练并非优于从头训练的低成本方案,未来需加强多语言投入。

详情
AI中文摘要

通过分析由开放权重大语言模型(LLMs)生成的序列,我们证明LLMs对英语存在显著偏见。尽管持续预训练常用于适应目标语言,但我们显示其在提升目标语言文化理解方面并不比从头训练更具成本优势。这些发现表明,未来语言模型发展可能需要更多针对特定语言的投资,而非主要依赖英语中心化的资源扩展。

英文摘要

Through an analysis of sequences generated by open-weight large language models (LLMs), we demonstrate that LLMs are heavily biased toward English. While continual pre-training is commonly used to adapt LLMs to a target language, we show that it does not offer a cost advantage over training from scratch, even for improving cultural understanding in the target language. These findings suggest that dedicated per-language investment may become increasingly important for future LLM development, rather than relying primarily on the expansion of English-centric resources.

2605.15611 2026-05-18 cs.AI

TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices

TopoEvo: 一种面向拓扑的自演化多智能体框架用于微服务中的根本原因分析

Junle Wang, Xingchuang Liao, Wenjun Wu

发表机构 * School of Artificial Intelligence, Beihang University Beijing, China(人工智能学院,北京航空航天大学,北京,中国)

AI总结 针对微服务中观测数据异质性、故障传播和拓扑漂移问题,TopoEvo通过多模态对齐、拓扑约束推理和自演化机制,提升根本原因分析的鲁棒性与准确性。

Comments 12 pages

详情
AI中文摘要

微服务中的根本原因分析(RCA)面临噪声异质多模态观测数据、级联故障传播放大下游症状以及由自动扩展和滚动更新引起的非平稳拓扑漂移等挑战。最近基于LLM的RCA智能体虽能生成工具导向的解释,但往往缺乏拓扑意识,导致症状放大偏误。本文提出TopoEvo,一种面向拓扑的自演化多智能体框架,结合图表示学习与结构化拓扑约束推理。TopoEvo首先引入度量正交多模态对齐(MOMA),将度量嵌入分解为互补子空间,并通过对比对齐日志和追踪以减少模态冗余和稀疏性,从而获得稳定的节点表示。随后应用向量量化(VQ)将拓扑增强的状态离散化为可审计的症状令牌,利用症状词典实现可靠检索和令牌级证据支撑。在这些离散拓扑提示之上,TopoEvo执行多智能体假设-证据-测试(HET)工作流,明确验证传播一致的解释并区分起因异常与放大下游症状。最后,自演化机制刷新分层事件记忆,并通过高置信度伪标签进行保守测试时适应,以维持在漂移下的鲁棒性。

英文摘要

Root cause analysis (RCA) in microservices is challenging due to (i) noisy and heterogeneous multimodal observability (metrics, logs, traces), (ii) cascading failure propagation that amplifies downstream symptoms, and (iii) non-stationary topology drift induced by autoscaling and rolling updates. Recent LLM-based RCA agents can generate tool-grounded explanations, yet they often remain topology-agnostic and suffer from \emph{symptom-amplification bias}, misattributing the root cause to salient downstream victims. We propose \textbf{TopoEvo}, a topology-aware self-evolving multi-agent framework that couples graph representation learning with structured, topology-constrained reasoning. TopoEvo first introduces \emph{Metric-orthogonal Multimodal Alignment} (MOMA), which decomposes metric embeddings into complementary subspaces and contrastively aligns logs and traces to reduce modality redundancy and sparsity, yielding stable node representations for graph encoding. It then applies \emph{Vector Quantization} (VQ) to discretize topology-enhanced states into auditable \emph{symptom tokens} with a symptom lexicon, enabling reliable retrieval and token-level evidence grounding. On top of these discrete topology cues, TopoEvo performs a multi-agent \emph{Hypothesis--Evidence--Test} (HET) workflow to explicitly verify propagation-consistent explanations and separate initiating anomalies from amplified downstream symptoms. Finally, a \emph{Self-Evolving Mechanism} refreshes hierarchical incident memory and performs conservative test-time adaptation with high-confidence pseudo-labels to maintain robustness under drift.

2605.15609 2026-05-18 cs.CL

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

PSD: 推动扩散大语言模型的帕累托前沿:通过并行推测解码

Shengyin Sun, Yiming Li, Renxi Liu, Xinqi Li, Hui-Ling Zhen, Weizhe Lin, Chen Chen, Xianzhi Yu, Mingxuan Yuan, Chen Ma

发表机构 * Huawei Technologies(华为技术)

AI总结 本文提出PSD框架,通过并行推测解码提升推理效率与生成质量,在推理效率和生成质量之间取得良好平衡,达到每前向传递5.5倍的token处理速度。

Comments 16 pages

详情
AI中文摘要

扩散大语言模型(dLLMs)通过迭代去噪掩码标记序列生成文本。尽管dLLMs可以在每个步骤内并行预测所有掩码位置,但大量的去噪迭代仍使推理成本高昂。此成本可通过每步解掩多个标记进行空间优化,或通过将多个去噪步骤合并为一次验证调用进行时间优化。我们提出并行推测解码(PSD),一种无需训练的框架,同时提升推理效率和生成质量。利用单次前向传递的置信度分数,PSD通过可配置的自适应解掩策略选择解掩位置,并构建多深度的推测草案而无需额外模型调用。最终的批量验证步骤应用分层接受机制,保留与更新预测一致的最深草案。在三个dLLMs上进行的实验表明,PSD在推理效率和生成质量之间取得了良好的权衡,达到每前向传递5.5倍的token处理速度,其准确性与贪婪解码相当。

英文摘要

Diffusion large language models (dLLMs) generate text by iteratively denoising masked token sequences. Although dLLMs can predict all masked positions in parallel within each step, the large number of denoising iterations still makes inference expensive. This cost can be reduced spatially by unmasking multiple tokens per step, or temporally by collapsing multiple denoising steps into one verification call. We propose Parallel Speculative Decoding (PSD), a training-free framework that jointly improves inference along both axes. Using the confidence scores from a single forward pass, PSD selects positions to unmask via a configurable, adaptive unmasking policy and constructs multi-depth speculative drafts without extra model calls. A final batched verification pass then applies hierarchical acceptance, keeping the deepest draft that remains consistent with the updated predictions. Experiments on three dLLMs across reasoning and code generation tasks show that PSD achieves favorable trade-offs between inference efficiency and generation quality, reaching up to $5.5\times$ tokens per forward pass with accuracy comparable to greedy decoding.

2605.15608 2026-05-18 cs.LG cs.SY eess.SY

Transformer-like Inference from Optimal Control

基于最优控制的变换器式推理

Aditya Kudre, Heng-Sheng Chang, Prashant G. Mehta

发表机构 * Coordinated Science Laboratory(协调科学实验室) Electrical and Computer Engineering University of Illinois Urbana-Champaign(电气与计算机工程大学伊利诺伊大学厄巴纳-香槟分校) Mechanical Science and Engineering University of Illinois Urbana-Champaign(机械科学与工程大学伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文从最优控制理论出发,推导出解决预测问题的推理架构,揭示了变换器层操作的起源,并通过非线性离散过程模型和线性高斯模型进行实验验证。

Comments Preprint

详情
AI中文摘要

本文从最优控制理论出发,推导出解决预测问题的推理架构,揭示了变换器层操作的起源,并通过非线性离散过程模型和线性高斯模型进行实验验证。

英文摘要

Decoder-only transformers compute the conditional probability of the next token from a sequence of past observations. This paper derives, from first principles, inference architectures that solve the same prediction problem - and in doing so, recovers transformer-like layer operations as a consequence of optimal control theory. The framework is developed for two model classes: a nonlinear model of discrete-valued processes, directly motivated by the transformer, and a linear Gaussian model as a tractable baseline. For both model classes, the prediction objective is reformulated as an optimal control problem whose solution yields an explicit inference algorithm, the dual filter, with a layer structure that mirrors the layer structure of a decoder-only transformer. Numerical experiments provide a comparison of the optimal control to attention weights from a trained transformer. These experiments reveal that when the embedding dimension is insufficient, the transformer implicitly exploits non-Markovian structure.

2605.15607 2026-05-18 cs.CL cs.LG

Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language

无语义的语法:教大语言模型在未见过的语言中编程

Vinayshekhar Bannihatti Kumar, Disha Makhija, Manoj Ghuhan Arivazhagan, Rashmi Gangadharaiah

发表机构 * AWS AI Labs(AWS人工智能实验室)

AI总结 研究探讨大语言模型在未见过的语言中生成代码的能力,发现微调仅能教授语法而无法转移语义能力,揭示了推理与语言实现之间的鸿沟。

详情
AI中文摘要

大型语言模型(LLMs)在代码生成基准测试中表现出高通过率,但它们能否将这种能力转移到训练时未见过的语言仍不清楚。我们介绍了PyLang,一种最小的命令式语言,未出现在所有预训练语料库中,并评估了前沿模型在352个问题上的零样本和微调Qwen3(4B、8B、32B)的表现。我们发现微调快速教授了语法,但无法转移语义能力:Python在所有配置中比PyLang高出高达19%,且没有干预(多任务学习、偏好微调、代码填充或潜在空间目标)无法缩小差距。一个LLM法官发现,前沿模型有80%的时间选择与Python相同的算法,但无法将其翻译成有效的PyLang实现。CKA分析确认,微调模型在不同语言中收敛到几乎相同的内部表示(CKA > 0.97),但在输出阶段却不同。我们称这种现象为实现忠实度鸿沟:模型具有语言无关的算法理解,但无法用不熟悉的语言表达它。我们的发现强调了需要训练方法将推理与语言特定的实现解耦。

英文摘要

Large language models (LLMs) achieve high pass rates on code generation benchmarks, yet whether they can transfer this ability to languages absent from pretraining remains poorly understood. We introduce PyLang, a minimal imperative language absent from all pretraining corpora, and evaluate frontier models zero-shot and fine-tuned Qwen3 (4B, 8B, 32B) on 352 problems. We find that fine-tuning quickly teaches syntax but fails to transfer semantic competence: Python outperforms PyLang by up to 19% across all configurations, and no intervention (multi-task learning, preference tuning, code infilling, or latent-space objectives) closes the gap. An LLM judge reveals that frontier models select an identical algorithm to Python 80% of the time, yet cannot translate it into a working PyLang implementation., and CKA analysis confirms that fine-tuned models converge to nearly identical internal representations across languages (CKA > 0.97) while diverging at the output stage. We term this the implementation fidelity gap: models possess language-agnostic algorithmic understanding but cannot express it in an unfamiliar language. Our findings highlight the need for training methods that decouple reasoning from language-specific realization.

2605.15604 2026-05-18 cs.LG cs.CL

VSPO: Vector-Steered Policy Optimization for Behavioral Control

VSPO:用于行为控制的向量引导策略优化

Xuechen Zhang, Zijian Huang, Kai Yang, Weijia Zhang, Jiasi Chen, Samet Oymak

发表机构 * University of Michigan(密歇根大学)

AI总结 VSPO通过引入与目标行为关联的引导向量,控制生成轨迹的行为强度,解决多目标优化中的稀疏奖励问题,提升策略优化效率。

详情
AI中文摘要

现代语言模型往往需要在优化主要准确性目标的同时,兼顾次要行为偏好,如 verbosity、agreeableness 或响应中技术专家水平。在实践中,基础模型可能很少或完全不表现出期望的行为。因此,赋予模型目标行为会形成稀疏行为奖励瓶颈。为解决此类多目标问题,我们引入了向量引导策略优化(VSPO),它利用与目标行为相关的引导向量来控制生成轨迹的行为强度。VSPO是通过修改GRPO以采样具有不同引导强度的轨迹获得的。此过程可以解释为一种在线策略潜在自我蒸馏过程,其中模型内部化其引导向量。通过调整引导强度,VSPO上采样稀有行为并丰富轨迹多样性,缓解稀疏奖励问题并可证明加速策略优化。通过全面的理论和实验,我们证明了VSPO相较于 vanilla reward shaping 和其他替代方法具有更优的性质。具体而言,在bandit抽象下,当引导引起的分布足够与目标行为对齐时,VSPO可证明在迭代复杂度上优于reward-shaped GRPO。我们评估了VSPO在多个推理基准上,包括MATH和MMLU-Pro,针对四个目标行为:解释能力、自信表达、对误导上下文的鲁棒性以及响应 verbosity。我们的结果表明,VSPO在保持或提高任务准确性的同时,一致提升了对目标行为的控制。

英文摘要

Modern language models often need to optimize a primary accuracy objective while also accommodating secondary behavioral preferences, such as verbosity, agreeableness, or the level of technical expertise in its response. In practice, a base model may exhibit a desired behavior very rarely or not at all. Thus, endowing the model with a target behavior creates a sparse behavioral reward bottleneck. To address such multi-objective problems, we introduce Vector-Steered Policy Optimization (VSPO) which employs a steering vector associated with the target behavior to control the behavior intensity of the generated rollouts. VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Through comprehensive theory and experiments, we establish that VSPO has favorable properties compared to vanilla reward shaping and other alternative approaches. Specifically, under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior. We evaluate VSPO across multiple reasoning benchmarks, including MATH and MMLU-Pro, for four target behaviors: explanation expertise, confidence expression, robustness to misleading context, and response verbosity. Our results show that VSPO consistently improves the control along target behavior while maintaining or improving task accuracy compared with reward shaping, teacher-trace distillation, and guidance-based baselines.

2605.15603 2026-05-18 cs.LG cs.AI

Offline Reinforcement Learning with Universal Horizon Models

离线强化学习中的通用时间 horizon 模型

Hojun Chung, Junseo Lee, Songhwai Oh

发表机构 * Interdisciplinary Program in Artificial Intelligence and ASRI, Seoul National University(人工智能交叉学科项目及首尔国立大学ASRI) Department of Electrical and Computer Engineering, Seoul National University(电气与计算机工程系,首尔国立大学)

AI总结 本文提出通用时间 horizon 模型,通过灵活预测任意时间 horizon 的未来状态,改进了传统几何时间 horizon 模型在远期状态建模上的不足,并在100个OGBench任务中验证了其有效性。

Comments ICML 2026

详情
AI中文摘要

基于模型的强化学习(RL)通过在想象的 on-policy 轨迹上进行价值学习,为离线 RL 提供了有吸引力的方法。然而,由于重复的模型推断导致自我生成状态中的累积误差,这一方法常常面临挑战。尽管几何时间 horizon 模型(GHM)通过直接预测折扣无限时间 horizon 的未来来缓解这一问题,但在准确建模远期状态方面仍存在挑战。为此,我们引入了通用时间 horizon 模型(UHM),这是 GHM 的推广,能够直接在任意时间 horizon 下预测未来状态。利用这种灵活性,我们提出了一种可扩展的价值学习方法,该方法采用winsorized 时间 horizon 分布来稳定训练,通过限制过大的时间 horizon 来实现。在100个具有挑战性的OGBench任务上的实验结果表明,所提出的方法在高度次优数据集和需要长时间 horizon 推理的任务上优于竞争性基线。项目页面:https://rllab-snu.github.io/projects/UHM/

英文摘要

Model-based reinforcement learning (RL) offers a compelling approach to offline RL by enabling value learning on imagined on-policy trajectories. However, it often suffers from compounding errors due to repeated model inference on self-generated states. While geometric horizon models (GHM) alleviate this issue through direct prediction over a discounted infinite-horizon future, they remain challenged in accurately modeling distant future states. To this end, we introduce universal horizon models (UHM), a generalization of GHM that directly predicts future states under arbitrary horizons. Leveraging this flexibility, we propose a scalable value learning method that employs a winsorized horizon distribution to stabilize training by capping excessively large horizons. Experimental results on 100 challenging OGBench tasks demonstrate that the proposed method outperforms competitive baselines, particularly on tasks with highly suboptimal datasets and those requiring long-horizon reasoning. Project page: https://rllab-snu.github.io/projects/UHM/

2605.15597 2026-05-18 cs.CV cs.GR cs.LG cs.RO

CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage

CM-EVS:稀疏全景RGB-D-姿态数据用于完整场景覆盖

Jiale Liu, Jungang Li, Jieming Yu, Xinglin Yu, Zihao Dongfang, Zongjian Ding, Kaifeng Ding, Yi Yang, Lidong Chen, Yang Zou, Shunwen Bai, Jiahuan Zhang, Haoran Huang, Shan Huang, Yudong Gao, Mingjun Cheng

发表机构 * Zhejiang University(浙江大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong University of Science and Technology(香港科学与技术大学) Vorynel(沃伊内尔) Xinjiang University(新疆大学) Wuhan Polytechnic University(武汉职业技术学院) Tianjin University(天津大学)

AI总结 本文提出CM-EVS,通过COVER算法生成稀疏全景RGB-D-姿态数据,实现低冗余且可追溯的完整场景覆盖,提升3D学习的几何一致性。

Comments 35 pages including appendix. Code and dataset: https://github.com/Strange-animalss/CM-EVS

详情
AI中文摘要

现代3D视觉学习依赖于从度量3D资产中采样的观测,但现有扫描、网格、点云、模拟和重建并未直接提供稀疏、可比且几何一致的全景训练接口。密集轨迹会重复附近视角,源特定渲染策略导致注释异质性,稀疏启发式可能遗漏重要区域或引入深度不一致观测。本文研究如何将3D资产转换为稀疏全景RGB-D-姿态数据,以保持完整的场景覆盖,同时具有低冗余和可追溯的来源。我们提出COVER(以覆盖为导向的视角筛选与ERP范围-深度变形),一种无需训练的ERP视角筛选器,将选定视角观测的几何投影到候选ERP探针,评分增量覆盖,并惩罚深度冲突。在有限的代理误差下,其贪心覆盖代理保持标准覆盖式近似行为,误差项内。使用COVER,我们构建了CM-EVS(覆盖-curated度量ERP视角集),一个包含36,373个curated ERP帧的全景RGB-D-姿态数据集,来自1,275个室内场景,涵盖Blender室内、HM3D和ScanNet++,并补充了从TartanGround和OB3D重新编码的户外全景。每个帧提供完整的球形RGB、度量范围深度、校准姿态;COVER生成的室内帧包括每一步的来源日志。每个室内场景平均仅25帧,覆盖所有13种统一房间类型,同时保持紧凑的场景级覆盖。实验表明,COVER改进了覆盖冲突的权衡,使CM-EVS成为稀疏、紧凑且可追溯的RGB-D-姿态资源,用于几何一致的全景3D学习。

英文摘要

Modern 3D visual learning relies on observations sampled from metric 3D assets, yet existing scans, meshes, point clouds, simulations, and reconstructions do not directly provide a sparse, comparable, and geometry-consistent panoramic training interface. Dense trajectories duplicate nearby views, source-specific rendering policies yield heterogeneous annotations, and sparse heuristics may miss important regions or introduce depth-inconsistent observations. We study how to convert 3D assets into sparse panoramic RGB-D-pose data that preserves complete scene coverage with low redundancy and auditable provenance. We propose COVER (Coverage-Oriented Viewpoint curation with ERP Range-depth warping), a training-free ERP viewpoint curator that projects geometry observed from selected views into candidate ERP probes, scores incremental coverage, and penalizes depth conflicts. Under bounded proxy error, its greedy coverage proxy preserves the standard coverage-style approximation behavior up to an additive error term. Using COVER, we build CM-EVS (Coverage-curated Metric ERP View Set), a panoramic RGB-D-pose dataset with 36,373 curated ERP frames from 1,275 indoor scenes across Blender indoor, HM3D, and ScanNet++, complemented by outdoor panoramas from TartanGround and OB3D re-encoded into the same schema. Each frame provides full-sphere RGB, metric range depth, calibrated pose; COVER-produced indoor frames include per-step provenance logs. With a median of only 25 frames per indoor scene, CM-EVS covers all 13 unified room types while maintaining compact scene-level coverage. Experiments show that COVER improves the coverage-conflict trade-off, making CM-EVS a sparse, compact, and auditable RGB-D-pose resource for geometry-consistent panoramic 3D learning.

2605.15592 2026-05-18 cs.CV

Efficient Image Synthesis with Sphere Latent Encoder

高效图像合成与球形潜在编码器

Tung Do, Thuan Hoang Nguyen, Hao Li

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence(穆罕默德·本·拉希德人工智能大学)

AI总结 本文提出分离的固定预训练图像编码器和球形潜在去噪模型,提高效率并独立优化重建与生成。在多个数据集上,方法在生成质量和推理速度上优于Sphere Encoder。

Comments Technical report

详情
AI中文摘要

少数步骤图像生成已取得快速进展,一致性及meanflow-based方法显著减少了采样步骤的数量。尽管其推理成本低,但这些方法常面临训练不稳定和可扩展性有限的问题。Sphere Encoder是一种近期的替代方案,仅需少数步骤即可生成高质量图像;然而,其在推理过程中需要在像素空间和潜在空间之间反复转换,同时在单一架构内联合优化重建与生成。这种设计导致计算效率低下,并在重建与生成之间产生目标冲突。为解决这些限制,我们将框架分离为一个固定的预训练图像编码器和一个单独的潜在去噪模型,后者完全在球形潜在空间中训练。我们的方法在训练和推理过程中消除了反复的像素空间操作,提高了效率,并允许重建与生成各自专业化。在Animal-Faces、Oxford-Flowers和ImageNet-1K数据集上,我们的方法在生成质量和推理速度上显著优于Sphere Encoder,同时在强少数步骤和多步骤基线中也取得了具有竞争力的结果。

英文摘要

Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.

2605.15589 2026-05-18 cs.CL

MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

MHGraphBench: 用于评估大语言模型中心理健康知识的图知识基准

Weixin Liu, Congning Ni, Shelagh A. Mulvaney, Susannah L. Rose, Murat Kantarcioglu, Bradley A. Malin, Zhijun Yin

发表机构 * Vanderbilt University(范德比大学) Vanderbilt University Medical Center(范德比大学医学院) Virginia Tech(弗吉尼亚理工学院)

AI总结 本文提出MHGraphBench基准,评估大语言模型在心理健康实体识别、关系判断及双跳推理能力,发现模型在实体类型识别和小关系类型判断上表现优异,但在关系预测和双跳推理上仍有不足,且输出格式可靠性对性能有显著影响。

Comments Accepted to GEM 2026, ACL 2026 Workshop; 9 pages main text plus references and appendices

详情
AI中文摘要

大型语言模型(LLMs)在心理健康领域应用日益广泛,但其对相关生物医学知识的捕捉能力和临床相关结构判断的可靠性仍不明确。本文提出一个基于知识图谱(KG)的基准,用于评估LLMs在心理健康实体识别、关系判断和双跳推理能力。该基准源自PrimeKG,包含九个任务家族,具有KG支持答案和受控负样本选项。在15个封闭源和开源LLM上的实验揭示了持续的识别-判断差距:领先模型在实体类型识别和小关系类型子集上接近天花板表现,但仍在关系预测和双跳推理上挣扎。此外,短KG衍生片段对某些模型有益,但对其他模型则会降低性能。此外,输出格式可靠性在受限的多选设置下对测量性能有显著影响,突显了响应有效性在基准评估中的关键作用。MHGraphBench因此应被解释为在受控多选界面下评估与PrimeKG精心编纂的心理健康切片的一致性,而不是直接评估现实世界临床安全性的评估。

英文摘要

Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.

2605.15585 2026-05-18 cs.AI cs.CV

See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation

在编码前看到:学习视觉先验以生成空间感知的教育动画

Yuejia Li, Ke He, Junheng Li, Shutong Chen, Jingkang Xia, Zhiyue Su, Junchi Zhang, Mang Ye

发表机构 * Wuhan University(武汉大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出OmniManim框架,通过视觉规划和反馈机制提升教育动画生成质量,改进渲染效果和教学效果。

Comments 21 pages, 4 figures

详情
AI中文摘要

大型语言模型可以为教育动画生成可执行代码,但生成的渲染结果常出现元素重叠、对齐错误和动画连续性断裂等问题。这些缺陷无法仅从代码中可靠检测,需在执行后才能显现。本文将该问题形式化为渲染反馈感知的约束代码生成:给定自然语言规范,模型必须生成可执行代码,其渲染输出需满足可在渲染后评估的结构化质量标准。为解决此问题,我们引入OmniManim框架,围绕共享场景状态、显式视觉规划、结构化后渲染诊断和局部修复构建。其中,Vision Agent是任务特定的视觉规划模块:它通过粗到细的边界框去噪预测稀疏关键帧布局,并优化插值感知的目标以减少下游动画插值引起的中间帧失败。我们进一步构建了ManimLayout-1K和EduRequire-500两个数据集,并提供可复现的评估协议,涵盖可执行性、教学质量、视觉质量和效率。在EduRequire-500上,OmniManim在单模型基线和现有多智能体框架上均提升了测量渲染质量。系统性消融研究进一步验证,显式视觉规划,特别是其粗略空间先验、边界框细化和插值感知优化是这些提升的关键。

英文摘要

Large language models can generate executable code for educational animations, but the resulting renders often exhibit visual defects, including element overlap, misalignment, and broken animation continuity. These defects cannot be reliably detected from the code alone and become apparent only after execution. We formalize this problem as render-feedback-aware constrained code generation: given a natural language specification, the model must generate executable code whose rendered output satisfies structured quality criteria that can be evaluated only after rendering. To address this problem, we introduce OmniManim, a render-feedback-aware educational animation generation framework built around a shared scene state, explicit visual planning, structured post-render diagnostics, and localized repair. Within OmniManim, the Vision Agent is a task-specific visual planning module: it predicts sparse keyframe layouts with coarse-to-fine bounding-box denoising and optimizes an interpolation-aware objective to reduce intermediate-frame failures induced by downstream animation interpolation. We further construct two datasets, ManimLayout-1K and EduRequire-500, and provide a reproducible evaluation protocol covering executability, instructional quality, visual quality, and efficiency. On EduRequire-500, OmniManim improves measured render quality over both single-model baselines and existing multi-agent frameworks. Systematic ablation studies further verify that explicit visual planning, especially its coarse spatial prior, bounding-box refinement, and interpolation-aware optimization, is central to these gains.

2605.15584 2026-05-18 cs.CV

AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models

AGC:面向视觉-语言模型对抗鲁棒性的自适应测地修正

Zhiwei Li, Jiacheng Xue, Weining Wang, Ajian Liu, Xingyu Gao, Zhenan Sun, Qi Li

发表机构 * NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences(自动化研究所国家工程研究中心与人工智能院,中国科学院) School of Computer Science and Engineering, Central South University(中南大学计算机科学与工程学院) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出AGC,一种无需训练的防御机制,通过自适应步长修正输入特征,提升视觉-语言模型的对抗鲁棒性,实测在八个细粒度数据集上提升44.4%的鲁棒准确率,同时降低10倍推理延迟。

详情
AI中文摘要

像CLIP这样的视觉-语言模型已展示了显著的零样本迁移能力。然而,其对不可察觉对抗扰动的易受攻击性仍是一个关键安全问题。虽然测试时间防御为部署模型提供了务实的解决方案,但现有方法通常在推理过程中依赖梯度优化,导致显著的计算开销。在本文中,我们重新审视了数据增强在CLIP鲁棒性中的作用,并观察到增强并非等效有效:特定增强提供稳定的几何线索,与正确类语义在超球面特征空间中对齐。基于此,我们提出自适应测地修正(AGC),一种无需训练的防御机制,无需参数更新。AGC将可靠的增强识别为几何锚点,并通过自适应步长将输入特征朝向锚点修正。AGC在八个细粒度数据集和三个CLIP后端上实现了优越性能,比最先进的基线提高了44.4%的平均鲁棒准确率,同时交付了10倍的推理延迟减少。我们的发现揭示了CLIP特征的基本几何属性,提供了一种高效且有效的多模态鲁棒部署范式。

英文摘要

Vision-language models like CLIP have demonstrated remarkable zero-shot transfer capabilities. However, their susceptibility to imperceptible adversarial perturbations remains a critical security concern. While test-time defenses offer a pragmatic solution for deployed models, existing approaches typically rely on gradient-based optimization during inference, incurring significant computational overhead. In this paper, we revisit the role of data augmentation in CLIP robustness and observe that augmentations are not equally effective: specific augmentations consistently provide robust geometric cues that align with correct class semantics in the hyperspherical feature space. Based on this, we propose Adaptive Geodesic Correction (AGC), a training-free defense mechanism that requires no parameter updates. AGC identifies a reliable augmentation as a geometric anchor and corrects the input feature towards it, utilizing an adaptive step size to balance robustness against clean accuracy preservation. AGC achieves superior performance across eight fine-grained datasets and three CLIP backbones, improving average robust accuracy by 44.4\% over state-of-the-art baseline while delivering a 10$\times$ reduction in inference latency. Our findings reveal a fundamental geometric property of CLIP features, offering a highly efficient and effective paradigm for robust multimodal deployment.

2605.15583 2026-05-18 cs.CV

Unsupervised 3D Human Pose Estimation via Conditional Multi-view Ancestral Sampling

通过条件多视图祖先采样进行无监督3D人体姿态估计

Ryohei Goto, Takuya Fujihashi, Shunsuke Saruwatari, Fumio Okura

发表机构 * The University of Osaka(大阪大学)

AI总结 本文提出一种无需3D监督的单视角3D人体姿态估计方法,利用预训练的2D运动扩散模型的2D扩散先验,通过条件多视图祖先采样优化3D姿态,使其多视图投影符合2D MDM噪声空间的流形,同时匹配给定的2D姿态和人体解剖约束。

Comments International Conference on Automatic Face and Gesture Recognition (FG 2026), Oral

详情
AI中文摘要

我们提出了一种从单视角估计3D人体姿态的方法,无需3D监督。该方法的关键在于利用在大规模2D人体姿态数据集上预训练的运动扩散模型(MDMs)的2D扩散先验。具体来说,我们将扩散模型的多视图祖先采样扩展到人体姿态的2D-3D提升任务。为此,我们提出了一种条件多视图祖先采样(cMAS),以优化3D姿态,使其多视图投影遵循2D MDM噪声空间中的流形,同时将3D姿态条件化以匹配给定的2D姿态和人体解剖约束。在Yoga数据集上的实验表明,我们的方法在跨域性能上优于最先进的监督和无监督3D姿态估计方法,包括在3D监督不可用的极端人体姿态情况下。代码可在:https://github.com/asaa0001/c-MAS获取。

英文摘要

We propose a method of estimating a 3D human pose from a single view without 3D supervision. The key to our method is to leverage the 2D diffusion priors of motion diffusion models (MDMs) pre-trained on large 2D human pose datasets. Specifically, we extend multi-view ancestral sampling of diffusion models to the task of 2D-3D lifting of human pose. To this end, we newly propose a conditional multi-view ancestral sampling (cMAS) that optimizes the 3D pose such that its multi-view projections follow the manifold in 2D MDM noise space, while conditioning the 3D pose to match the given 2D poses and anatomical constraints of humans. Experiments on the Yoga dataset demonstrate that our method achieves better cross-domain performance compared to state-of-the-art supervised and unsupervised 3D pose estimation methods, including extreme human poses where 3D supervision is unavailable. Code is available at: https://github.com/asaa0001/c-MAS.

2605.15582 2026-05-18 cs.CV

LDGuid: A Framework for Robust Change Detection via Latent Difference Guidance

LDGuid: 一种通过潜在差异引导实现鲁棒变化检测的框架

Jiaxuan Zhao, Ali Bereyhi

发表机构 * University of Toronto(多伦多大学)

AI总结 本文提出LDGuid框架,通过学习并注入语义差异提升变化检测性能,实验显示其在多个数据集上显著提升分割效果,尤其在受光谱噪声影响的挑战性场景中表现突出。

Comments Accepted to IGARSS 2026. Code is available at: https://github.com/zjxyoyo/LDGuid

详情
AI中文摘要

现代深度学习模型在变化检测(CD)中常难以显式表示任务相关的语义差异。本文提出Latent Difference Guidance(LDGuid)框架,通过对抗自编码实现差异嵌入(DE)模块。DE模块通过信息瓶颈方法预训练,限制其仅学习前后事件样本间的任务相关差异。学习到的潜在差异随后作为CD模型的显式引导信号。通过将LDGuid整合到U-Net、BIT和AERNet基线模型中,并在LEVIR-CD、WHU-CD、SVCD和CaBuAr数据集上评估,实验结果表明LDGuid在所有基准上均提升了分割性能,特别是在受光谱噪声影响的挑战性场景中表现显著。结果进一步突显了LDGuid在整合领域知识(如任务特定的光谱指数)方面的能力。我们的发现表明,语义差异学习可以显著增强遥感中变化检测的鲁棒性。

英文摘要

Modern deep learning models for change detection (CD) often struggle to explicitly represent task-relevant semantic differences. This paper proposes the Latent Difference Guidance (LDGuid) framework that explicitly learns and injects semantic differences into CD models. LDGuid deploys adversarial autoencoding to implement a difference embedding (DE) module. The DE module is pretrained via the information bottleneck method, restricting it to learn only task-relevant differences between pre- and post-event samples. The learned latent difference is then used as an explicit guidance signal in the CD model. We validate LDGuid by integrating it into U-Net, BIT, and AERNet baselines for CD and evaluating it on LEVIR-CD, WHU-CD, SVCD, and CaBuAr datasets. Experimental results show that LDGuid enhances segmentation performance across all benchmarks, with particularly remarkable gains in challenging settings affected by spectral noise. The results further highlight the ability of LDGuid in incorporating domain knowledge, such as task-specific spectral indices. Our findings suggest that semantic difference learning can drastically enhance the robustness of CD in remote sensing.

2605.15581 2026-05-18 cs.AI

STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

STAR: 一种针对微服务中RCA代理的阶段属性分诊与修复框架

Junle Wang, Xingchuang Liao, Wenjun Wu

发表机构 * School of Artificial Intelligence, Beihang University Beijing, China(人工智能学院,北京航空航天大学,北京,中国)

AI总结 本文提出STAR框架,通过将RCA流程分解为四个阶段,提升微服务中RCA代理的可靠性与自修复能力。

Comments 11 pages

详情
AI中文摘要

基于大语言模型的根因分析(RCA)代理近年来在微服务AIOps中崭露头角,但其可靠性仍脆弱:早期证据收集、假设构建或因果分析中的错误会通过推理轨迹传播,最终破坏最终诊断。本文提出STAR,一种针对RCA代理的阶段属性分诊与修复框架,将RCA工作流程分解为四个结构化阶段:证据包(EP)、假设集(HS)、分析结构(AS)和决策报告(DR),并将代理故障视为可定位的阶段性推理错误,而非整体端到端错误。基于LangGraph,STAR执行阶段审计,实施预算感知的快速/慢速路由,通过反事实候选评估进行决断阶段定位,并进行阶段特定的修补与重放修复。

英文摘要

LLM-based root cause analysis (RCA) agents have recently emerged as a promising paradigm for incident diagnosis in microservice AIOps. However, their reliability remains fragile: an error in early evidence collection, hypothesis formulation, or causal analysis can propagate through the reasoning trace and eventually corrupt the final diagnosis. In this paper, we present \textbf{STAR}, a \emph{Stage-attributed Triage and Repair} framework for repairing erroneous RCA traces. STAR explicitly decomposes an RCA workflow into four structured stages, namely \emph{Evidence Package} (EP), \emph{Hypothesis Set} (HS), \emph{Analysis Structure} (AS), and \emph{Decision Report} (DR), and treats agent failure as a stage-localizable reasoning bug rather than a monolithic end-to-end error. Built on top of LangGraph, STAR performs stage-wise auditing, budget-aware \emph{Fast/Slow Routing}, \emph{decisive stage localization via counterfactual candidate evaluation}, and stage-specific patch-and-replay repair. We evaluate STAR on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models. Experimental results show that STAR consistently improves both root cause localization and fault type classification over strong baselines. Moreover, STAR identifies the decisive faulty stage with high accuracy, repairs most initially incorrect traces within one or two replay rounds, and benefits substantially from both Fast/Slow Routing and counterfactual stage evaluation. These results suggest that explicitly modeling \emph{where} an RCA agent fails is an effective path toward reliable, debuggable, and self-repairing agentic RCA systems.

2605.15575 2026-05-18 cs.LG cs.DB

Gaussian Relational Graph Transformer

高斯关系图变换器

Zezhong Ding, Jin Li, Xugang Wang, Xike Xie

发表机构 * School of Artificial Intelligence and Data Science, University of Science and Technology of China(中国科学技术大学人工智能与数据科学学院) School of Biomedical Engineering, USTC(中国科学技术大学生物医学工程学院) Data Darkness Lab, Suzhou Institute for Advanced Research, USTC(中国科学技术大学苏州市先进研究机构数据暗室) Chinese Academy of Sciences(中国科学院)

AI总结 本文提出GelGT,通过结构-语义协作采样和高斯图注意力机制,解决关系图模型中长距离依赖和多信息联合建模问题,实验显示在多个真实数据集上达到最先进的预测性能。

详情
AI中文摘要

关系图学习模型将关系数据库视为图,并在多种关系预测任务中表现出色。然而,现有方法由于信息衰减在消息传递机制中难以捕捉长距离依赖,而近期的关系图变换器在联合建模结构、语义和时间信息方面仍有限。本文提出GelGT,一种高斯关系图变换器,明确解决这些挑战。GelGT引入结构-语义协作采样策略以保持结构连接并过滤无关语义信息,并结合带有可学习高斯偏置的高斯图注意力机制,在采样的子图上动态编码时间依赖性。在各种真实世界数据集上的广泛实验表明,GelGT在下游任务性能上达到最先进水平,预测性能提升高达13.8%。

英文摘要

Relational graph learning models relational databases as graphs and has demonstrated superior performance on a wide range of relational predictive tasks. However, existing methods struggle to capture long-range dependencies due to information decay in their message-passing mechanisms, and recent relational graph transformers remain limited in jointly modeling structural, semantic, and temporal information. In this paper, we propose GelGT, a Gaussian relational graph transformer that explicitly addresses these challenges. GelGT introduces a structure-semantic collaborative sampling strategy to preserve structural connectivity while filtering irrelevant semantic information, and incorporates a Gaussian graph attention mechanism with a learnable Gaussian bias on the sampled subgraphs to dynamically encode temporal dependencies. Extensive experiments on various real-world datasets demonstrate that GelGT achieves state-of-the-art downstream task performance, with up to a 13.8% improvement in predictive performance.

2605.15574 2026-05-18 cs.CV

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

MI-CXR:多区间胸部X光片纵向推理基准

Sunghwan Steve Cho, Yunseok Han, Jaeyoung Do

发表机构 * AIDAS Laboratory(AIDAS实验室) Seoul National University(首尔国立大学)

AI总结 MI-CXR基准旨在评估多visit胸部X光片的纵向推理能力,通过五选一问题和三个互补任务家族,揭示现有视觉语言模型在时间维度上的局限性。

Comments 33 pages

详情
AI中文摘要

纵向胸部X光片解读需在多个患者访问中推理疾病演变,但现有医疗VQA基准多关注单张图像或短时间图像对。我们引入MI-CXR,一个用于标准化评估多访问胸部X光片序列多区间纵向推理的基准,无需自由形式报告生成或额外临床上下文。MI-CXR包含五个访问患者时间线的五选一问题,并实例化三个互补任务家族:时间事件定位、区间级变化推理和全局轨迹总结,评估基于临床的视觉推理。评估14种最先进的视觉语言模型(VLMs)显示整体表现较低,平均准确率为29.3%,仅略高于随机猜测。通过阶段式诊断探测,发现模型常产生局部合理的区间描述,但未能强制时间约束或将证据组合成全局一致的决策。这些发现揭示了当前VLMs的关键限制,并确立MI-CXR作为纵向医疗推理的原理性基准。该基准可在https://github.com/AIDASLab/MI-CXR获取。

英文摘要

Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce MI-CXR, a benchmark for standardized evaluation of Multi-Interval longitudinal reasoning over multi-visit CXR sequences, without requiring free-form report generation or additional clinical context. MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization, which assess clinically grounded visual reasoning over time. Evaluating 14 state-of-the-art vision-language models (VLMs) shows low overall performance, with an average accuracy of 29.3%, only modestly above random guessing. Using stage-wise diagnostic probing, we find that models often produce locally plausible interval descriptions but fail to enforce temporal constraints or compose evidence into globally consistent decisions over the full timeline. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning. The benchmark is available at https://github.com/AIDASLab/MI-CXR

2605.15573 2026-05-18 cs.CL cs.LG cs.MA

Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems

响应条件化的并行到顺序 orchestration 用于多智能体系统

Nurbek Tastan, Alex Iacob, Lorenzo Sani, Meghdad Kurmanji, Nicholas D. Lane, Samuel Horvath, Karthik Nandakumar

发表机构 * MBZUAI(马克斯·普朗克智能系统研究所) University of Cambridge(剑桥大学) Flower Labs(Flower实验室) Michigan State University(密歇根州立大学)

AI总结 本文提出Nexa框架,通过响应条件化的策略结合并行与顺序执行,减少通信和延迟同时提高最终响应准确性,展示了其通用性。

详情
AI中文摘要

多智能体系统可通过多个大语言模型智能体之间的协作解决复杂任务。现有协作框架通常采用并行或顺序模式。在并行模式中,智能体独立响应查询后进行响应聚合。相反,顺序系统允许智能体通过有向拓扑进行通信并逐步细化。然而,这两种模式都无法在最小化通信和延迟的同时最大化最终响应的准确性。本文引入了一种名为Nexa的混合范式,即可训练的响应条件化策略,以弥合两种模式之间的差距。Nexa首先进行并行执行阶段,将结果嵌入共享语义空间,然后预测稀疏有向无环通信图。如果图为空,则系统保持纯粹并行;如果非空,则进行一次顺序信息传播。该策略是轻量级的transformer模型,方法避免了外部LLM判断者或奖励模型以及手工设计的测试时间拓扑搜索。我们正式化了这种混合执行问题,证明所生成的图是无环的,并且该框架严格包含纯并行执行,且提出基于策略梯度优化的训练程序。结果表明,Nexa在一种设置下学习的响应条件化策略可以在智能体数量、任务或底层智能体变化时重用,从而强调所学通信策略的通用性。

英文摘要

Multi-agent systems can solve complex tasks through collaboration between multiple Large Language Model agents. Existing collaboration frameworks typically operate in either a parallel or a sequential mode. In the parallel mode, agents respond independently to queries followed by aggregation of responses. In contrast, sequential systems allow agents to communicate via a directed topology and refine one another step by step. However, both modes are inadequate for achieving the desired objectives of minimizing communication and latency while simultaneously maximizing the accuracy of the final response. In this work, we introduce a hybrid paradigm called Nexa, a trainable response-conditioned policy that bridges the gap between the two modes. Nexa begins with a parallel execution stage, embeds the resulting responses into a shared semantic space, and then predicts a sparse directed acyclic communication graph. If the graph is empty, the system remains purely parallel; if it is non-empty, the system performs one sequential message propagation. The policy is a lightweight transformer model, and the method avoids the need for external LLM judges or reward models, as well as hand-crafted test-time topology search. We formalize this hybrid execution problem, show that the resulting graph is acyclic by construction, and that the framework strictly subsumes pure parallel execution, and present a training procedure based on policy-gradient optimization. Results demonstrate that the response-conditioned policy learned by Nexa under one setting can be reused when the number of agents, the task, or the underlying agent changes, thus emphasizing the generalizability of the learned communication policy.

2605.15567 2026-05-18 cs.AI

Position: Artificial Intelligence Needs Meta Intelligence -- the Case for Metacognitive AI

位置:人工智能需要元智能——元认知AI的案例

Sergei Chuprov, Richard D. Lange, Leon Reznik, Paulo Shakarian, Raman Zatsarenko, Dmitrii Korobeinikov

发表机构 * University of Texas Rio Grande Valley, Edinburg, TX, USA(德克萨斯大学里奥格兰德谷分校) Rochester Institute of Technology, Rochester, NY, USA(罗切斯特理工学院) Syracuse University, Syracuse, NY, USA(锡拉库萨大学)

AI总结 本文主张将元认知作为设计更准确、安全和高效AI的通用原则,通过联邦学习案例展示元认知提升学习效率和安全性的方法,提出新的软件框架用于实现元认知AI。

Comments This is a preliminary version accepted for presentation and publication at the 43rd International Conference on Machine Learning (ICML26). The modified final version will be available in the conference proceedings

详情
AI中文摘要

本文主张将元认知作为设计更准确、安全和高效AI的通用原则。元认知解决方案涉及系统监控自身状态并根据每个问题实例的难度或错误成本合理分配资源。受资源理性AI和心理学、认知科学中已记录的元认知策略的启发,我们识别了将这些策略嵌入AI设计中的具体挑战,并突出了开放的理论和实现问题。我们通过联邦学习(FL)案例研究展示这些原则,并展示如何通过新开发的软件框架将这些原则转化为实践,使社区能够设计、部署和实验元认知增强的AI应用。

英文摘要

This position paper argues for metacognition as a general design principle for creating more accurate, secure, and efficient AI. The metacognitive solution involves systems monitoring their own states and judiciously allocating resources depending on each problem instance's difficulty or cost of mistakes. Drawing inspiration both from past work on resource-rational AI and from well-documented metacognitive strategies in psychology and cognitive science, we identify specific challenges in embedding these strategies into AI design and highlight open theoretical and implementation problems. We showcase these principles through a tangible example of improved learning efficiency, effectiveness, and security in a Federated Learning (FL) case study. We show how these principles can be translated into practice with a novel software framework developed specifically to allow the community to design, deploy, and experiment with metacognition-enabled AI applications.

2605.15565 2026-05-18 cs.LG cs.AI

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

AstraFlow:面向代理大语言模型的数据流强化学习

Haizhong Zheng, Yizhuo Di, Jiahui Wang, Shuowei Jin, Xueshen Liu, Yongji Wu, Z. Morley Mao, Ion Stoica, Jiawei Zhao, Beidi Chen

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Michigan(密歇根大学) UC Berkeley(加州大学伯克利分校) Meta

AI总结 AstraFlow通过数据流导向的强化学习系统,实现复杂多策略协作训练和高效利用异构计算资源,提升代理LLM的推理与工具使用能力。

详情
AI中文摘要

强化学习(RL)日益被用于提升大语言模型的推理、编码和工具使用能力,但代理RL仍面临高昂成本。为扩展RL到代理LLM,需支持复杂工作负载,包括多策略协作训练,同时高效利用弹性、异构和跨区域计算资源。现有LLM RL系统支持部分能力,但每次新扩展通常需专门系统工程。此问题源于训练器导向的控制架构和RL系统组件缺乏原理性抽象。为此,我们提出AstraFlow,一种数据流导向的RL系统,取代传统训练器导向控制,采用原理性组件抽象。在AstraFlow中,rollout服务、数据流管理和训练被解耦为自主组件,使系统能原生支持复杂多策略代理RL工作负载并高效利用多样化计算资源。我们评估了AstraFlow在数学、代码、搜索和AgentBench工作负载上的表现,显示同一系统支持多策略训练、弹性扩展、异构跨区域执行和可组合的数据算法,无需系统级代码更改。在多策略协作训练中,AstraFlow的准确度与现有RL系统相当或更优,同时训练时间加速2.7倍。

英文摘要

Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.

2605.15564 2026-05-18 cs.LG cs.CE eess.IV

CrystalBoltz: End-to-End Protein Structure Determination via Experiment-Guided Diffusion for X-Ray Crystallography

CrystalBoltz:通过实验引导扩散实现端到端蛋白质结构确定用于X射线晶体学

Minseo Kim, Huanghao Mai, Jay Shenoy, Alec Follmer, Gordon Wetzstein, Frederic Poitevin

发表机构 * Stanford University(斯坦福大学) SLAC National Accelerator Laboratory(SLAC国家加速器实验室) UC Davis(加州大学戴维斯分校)

AI总结 CrystalBoltz通过实验引导扩散模型实现端到端蛋白质结构确定,利用贝叶斯推断优化原子结构,降低坐标RMSD和R因子,提升X射线晶体学结构确定效率。

Comments Project page: https://soniaminseokim.github.io/crystalboltz-website/

详情
AI中文摘要

基于公共蛋白质结构数据库训练的生成模型,大部分由X射线晶体学确定,现在为结构预测提供了强大先验。然而,它们无法直接条件于新晶体学实验的测量,限制了X射线结构确定的应用。在晶体学中,测量的结构因子振幅本身不能确定电子密度图或原子结构,因为相关的相位未被观测且必须推断。因此,结构确定仍然是一个逆问题,候选模型必须在结构上合理且与测量的衍射数据一致,通常需要大量人工专家手动优化。新兴方法旨在更直接地将实验信息纳入预测和优化流程。我们提出了CrystalBoltz,一种生成框架,将晶体学优化视为原子结构上的贝叶斯推断,并直接在结构因子振幅上操作。CrystalBoltz从无指导生成(基于预训练的蛋白质结构先验)转向实验引导的后验采样,随后进行原子坐标和B因子优化。在多个蛋白质晶体学数据集上,CrystalBoltz在坐标RMSD和R因子方面优于现有最强基线,同时将运行时间减少了33倍。

英文摘要

Generative models trained on public databases of protein structures, most of which have been determined by X-ray crystallography, now provide powerful priors for structure prediction. However, they are not readily conditioned on the measurements from a new crystallographic experiment, limiting their use for X-ray structure determination. In crystallography, the measured structure-factor amplitudes do not by themselves determine an electron density map or atomic structure because the associated phases are unobserved and must be inferred. Structure determination therefore remains an inverse problem in which candidate models must be both structurally plausible and consistent with measured diffraction data, often requiring substantial manual refinement by human experts. Emerging methods aim to incorporate experimental information more directly into predictive and refinement workflows. We present CrystalBoltz, a generative framework that casts crystallographic refinement as Bayesian inference over atomic structures and operates directly on structure-factor amplitudes. CrystalBoltz moves from unguided generation with a pre-trained prior over protein structures to experiment-guided posterior sampling, followed by atomic coordinate and B-factor refinement. Across multiple protein crystallography datasets, CrystalBoltz attains lower coordinate RMSD and lower R-factors than the strongest baselines considered, while reducing runtime by a factor of 33 relative to existing experimentally guided refinement.