arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

检索范围排序方式

检索时间范围

重置

HOT 人工智能、机器人等 9

cs.AI 人工智能 cs.CV 计算机视觉 cs.CL 自然语言处理 cs.RO 机器人 cs.LG 机器学习 cs.SD 声音 cs.ET 新兴技术 eess.AS 音频语音 eess.IV 图像视频

CS 计算机 41

cs 计算机 cs.AI 人工智能 cs.AR 硬件架构 cs.CC 计算复杂性 cs.CE 计算工程 cs.CG 计算几何 cs.CL 自然语言处理 cs.CR 密码安全 cs.CV 计算机视觉 cs.CY 计算机与社会 cs.DB 数据库 cs.DC 分布式计算 cs.DL 数字图书馆 cs.DM 离散数学 cs.DS 数据结构 cs.ET 新兴技术 cs.FL 形式语言 cs.GL 综述文献 cs.GR 图形学 cs.GT 博弈论 cs.HC 人机交互 cs.IR 信息检索 cs.IT 信息论 cs.LG 机器学习 cs.LO 计算机逻辑 cs.MA 多智能体 cs.MM 多媒体 cs.MS 数学软件 cs.NA 数值分析 cs.NE 神经进化 cs.NI 网络架构 cs.OH 其他计算机 cs.OS 操作系统 cs.PF 性能 cs.PL 编程语言 cs.RO 机器人 cs.SC 符号计算 cs.SD 声音 cs.SE 软件工程 cs.SI 社会信息网络 cs.SY 系统控制

ECON 经济学 4

econ 经济学 econ.EM 计量经济 econ.GN 一般经济 econ.TH 理论经济

EESS 电气与系统 5

eess 电气与系统 eess.AS 音频语音 eess.IV 图像视频 eess.SP 信号处理 eess.SY 系统控制

MATH 数学 33

math 数学 math.AC 交换代数 math.AG 代数几何 math.AP 偏微分方程 math.AT 代数拓扑 math.CA 经典分析 math.CO 组合数学 math.CT 范畴论 math.CV 复变函数 math.DG 微分几何 math.DS 动力系统 math.FA 泛函分析 math.GM 一般数学 math.GN 一般拓扑 math.GR 群论 math.GT 几何拓扑 math.HO 历史综述 math.IT 信息论 math.KT K理论 math.LO 逻辑 math.MG 度量几何 math.MP 数学物理 math.NA 数值分析 math.NT 数论 math.OA 算子代数 math.OC 优化控制 math.PR 概率 math.QA 量子代数 math.RA 环与代数 math.RT 表示论 math.SG 辛几何 math.SP 谱理论 math.ST 统计理论

PHYSICS 物理 55

astro-ph 天体物理 astro-ph.CO 宇宙学 astro-ph.EP 地球行星 astro-ph.GA 星系物理 astro-ph.HE 高能天体 astro-ph.IM 天文仪器 astro-ph.SR 太阳恒星 cond-mat 凝聚态 cond-mat.dis-nn 无序神经 cond-mat.mes-hall 介观纳米 cond-mat.mtrl-sci 材料科学 cond-mat.other 其他凝聚态 cond-mat.quant-gas 量子气体 cond-mat.soft 软凝聚态 cond-mat.stat-mech 统计力学 cond-mat.str-el 强关联电子 cond-mat.supr-con 超导 gr-qc 广义相对论 hep-ex 高能实验 hep-lat 格点高能 hep-ph 高能唯象 hep-th 高能理论 math-ph 数学物理 nlin 非线性科学 nlin.AO 自适应系统 nlin.CD 混沌动力学 nlin.CG 胞自动机 nlin.PS 斑图孤子 nlin.SI 可积系统 nucl-ex 核物理实验 nucl-th 核物理理论 physics 物理 physics.acc-ph 加速器物理 physics.ao-ph 大气海洋 physics.app-ph 应用物理 physics.atm-clus 原子分子团簇 physics.atom-ph 原子物理 physics.bio-ph 生物物理 physics.chem-ph 化学物理 physics.class-ph 经典物理 physics.comp-ph 计算物理 physics.data-an 数据分析 physics.ed-ph 物理教育 physics.flu-dyn 流体动力学 physics.gen-ph 普通物理 physics.geo-ph 地球物理 physics.hist-ph 物理史哲 physics.ins-det 仪器探测 physics.med-ph 医学物理 physics.optics 光学 physics.plasm-ph 等离子体 physics.pop-ph 科普物理 physics.soc-ph 物理与社会 physics.space-ph 空间物理 quant-ph 量子物理

Q-BIO 定量生物 11

q-bio 定量生物 q-bio.BM 生物分子 q-bio.CB 细胞行为 q-bio.GN 基因组学 q-bio.MN 分子网络 q-bio.NC 神经认知 q-bio.OT 其他定量生物 q-bio.PE 种群进化 q-bio.QM 定量方法 q-bio.SC 亚细胞过程 q-bio.TO 组织器官

Q-FIN 定量金融 10

q-fin 定量金融 q-fin.CP 计算金融 q-fin.EC 经济学 q-fin.GN 一般金融 q-fin.MF 数学金融 q-fin.PM 投资组合 q-fin.PR 证券定价 q-fin.RM 风险管理 q-fin.ST 统计金融 q-fin.TR 交易微观结构

STAT 统计 7

stat 统计 stat.AP 统计应用 stat.CO 统计计算 stat.ME 统计方法 stat.ML 机器学习 stat.OT 其他统计 stat.TH 统计理论

2605.15647 2026-05-18 cs.LG cs.NE

Perforated Neural Networks for Keyword Spotting

孔洞神经网络用于关键词检测

Vishy Gopal, Aris Ilias Goutis, Ralph Crewe, Erin Yanacek, Rorry Brenner

发表机构 * Purdue University（普渡大学）； Renesas Electronics（瑞萨电子）； Perforated AI

AI总结本文提出在Edge Impulse平台使用孔洞反向传播进行关键词检测，通过在标准卷积神经网络中添加人工树突节点，证明树突模型在参数数量和准确性方面均优于传统架构，实现了模型质量和部署效率的双重提升。

Comments 9 pages, 1 figure, 800-trial hyperparameter sweep; Best Model award, Edge Impulse 2025 Hackathon

详情

AI中文摘要

边缘机器学习面临着云规模模型部署中未遇到的独特约束：严格的内存预算、有限的计算能力和不可妥协的准确率阈值必须同时满足。现有的压缩和优化技术可以将一种资源换取另一种，但很少同时提高准确性和模型大小。本文提出了在Edge Impulse平台上的关键词检测应用，该实验在2025年12月的Edge Impulse黑客松上获得了最佳模型奖。通过在Edge Impulse关键词检测教程流水线上训练的标准卷积神经网络中添加人工树突节点，我们证明了树突模型在800次超参数试验中每个参数数量层级和每个准确性阈值测试中均优于传统架构。最佳的树突模型仅使用1,500个参数就达到了93.3%的测试准确率，而基准模型需要约4,000个参数才能达到92.1%的准确率。这些结果表明，孔洞反向传播是边缘AI工程师工具包中的强大补充，同时提升了模型质量和部署效率。

英文摘要

Edge machine learning presents a unique set of constraints not encountered in cloud-scale model deployment: strict memory budgets, limited compute, and non-negotiable accuracy thresholds must all be satisfied simultaneously. Existing compression and optimization techniques can trade one resource for another, but rarely improve both accuracy and model size at the same time. This paper presents the application of Perforated Backpropagation to keyword spotting on the Edge Impulse platform, an experiment that won the Best Model award at the Edge Impulse 2025 Hackathon in December 2025. By adding artificial Dendrite Nodes to a standard convolutional neural network trained on the Edge Impulse keyword spotting tutorial pipeline, we demonstrate that dendritic models outperform traditional architectures at every level of parameter count and at every accuracy threshold tested across 800 hyperparameter trials. The best dendritic model achieved a test accuracy of 0.933 with only 1,500 parameters, versus the baseline accuracy of 0.921 requiring approximately 4,000 parameters. These results suggest that Perforated Backpropagation is a powerful addition to the edge AI engineer's toolkit, offering simultaneous gains in both model quality and deployment efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.15640 2026-05-18 cs.CV

Learning Disentangled Representations for Generalized Multi-view Clustering

学习解耦表示以实现通用多视图聚类

Xin Zou, Ruimeng Liu, Chang Tang, Zhenglai Li, Xinwang Liu, Kunlun He, Wanqing Li

发表机构 * AI Thrust, The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州）人工智能方向）； School of Computer Science and Technology, Huazhong University of Science and Technology（华中科技大学计算机科学与技术学院）； School of Software Engineering, Huazhong University of Science and Technology（华中科技大学软件工程学院）； Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（中国科学院深圳先进技术研究院）； School of Computer, National University of Defense Technology（国防科技大学计算机学院）； Medical Big Data Research Center, Medical Engineering Laboratory of Chinese PLA General Hospital（中国人民解放军总医院医学大数据研究中心，医学工程实验室）； School of Computing and Information Technology, University of Wollongong（沃林根大学计算与信息学院）

AI总结本文提出GMAE框架，通过解耦表示学习保留多视图互补性，提升聚类效果。实验表明其在完整和不完整多视图聚类任务中均优于现有方法。

Comments accepted by IEEE TPAMI 2026 (IEEE Transactions on Pattern Analysis and Machine Intelligence)

详情

DOI: 10.1109/TPAMI.2026.3687339

AI中文摘要

多视图聚类（MVC）因其能利用互补信息而受到关注。然而，现有深度MVC方法在跨视图融合时常面临视图分布纠缠问题，影响共享潜在空间质量。为此，本文提出通用多视图自编码器（GMAE），通过解耦表示学习保留跨视图互补性。具体而言，GMAE采用双路径自编码器将源特征解耦为视图特定和视图共同嵌入，促进更清晰的聚类结构发现。进一步构建跨视图对抗判别器，引导视图特定编码器捕捉更判别性特征。通过策略性调节互信息，GMAE有效对齐分布并防止表示崩溃，确保生成稳健且非平凡的嵌入。在13个基准数据集上的全面实验表明，GMAE在完整和不完整MVC任务中均优于现有方法。代码实现见：https://github.com/obananas/GMAE。

英文摘要

Multi-View Clustering (MVC) has gained significant attention for its ability to leverage complementary information across diverse views. However, existing deep MVC methods often struggle with view-distribution entanglement during cross-view fusion, which hampers the quality of the shared latent space and leads to suboptimal Figures. To address this issue, we propose the Generalized Multi-view Auto-Encoder (GMAE), a framework designed to preserve cross-view complementarity through disentangled representation learning. Specifically, GMAE employs dual-path autoencoders to decouple source features into view-specific and view-common embeddings, facilitating the discovery of clearer clustering structures. We further construct cross-view adversarial discriminators to guide view-specific encoders in capturing more discriminative features. By strategically modulating mutual information, GMAE effectively aligns distributions and prevents representation collapse, ensuring the generation of robust, non-trivial embeddings. Comprehensive experiments on 13 benchmark datasets demonstrate that GMAE consistently outperforms state-of-the-art methods in both complete and incomplete MVC tasks. Our code implementation is available at the repository: https://github.com/obananas/GMAE.

URL PDF HTML ☆

赞 0 踩 0

2605.15635 2026-05-18 cs.CL

Evaluating Chinese Ambiguity Understanding in Large Language Models

评估大型语言模型中的中文歧义理解

Junwen Mo, Yuanzhi Lu, Yifang Xue, Ke Xu, Hideki Nakayama

发表机构 * Graduate School of Information Science and Technology, The University of Tokyo（东京大学信息科学与技术研究生院）； School of Software Engineering, South China University of Technology（华南理工大学软件学院）

AI总结本文设计了首个基于潜在歧义理论的中文歧义数据集CHA-Gen，评估了LLM在歧义检测中的表现，揭示了模型在歧义识别中的常见失败模式及语义不确定性量化结果。

详情

AI中文摘要

语言歧义对大型语言模型（LLM）的鲁棒性至关重要，但现有研究多聚焦于英语，对中文关注有限。现有中文歧义数据集（如CHAmbi）存在可扩展性差的问题。基于潜在歧义（PA）理论，我们设计了一个半自动化流程构建CHA-Gen，这是首个PA理论指导的中文歧义数据集，包含18种潜在歧义结构的5,712个句子（2,414个歧义句，3,298个非歧义句）。通过直接查询和机器翻译评估LLM（如Gemma 3、Qwen 2.5/3系列），发现LLM在歧义检测上存在困难（通过CoT提示有所改善）。对Qwen3-32B的CoT推理过程分析揭示了三种常见失败模式：歧义盲区、误归因和过早解决。使用语义熵度量对不确定性进行量化，显示歧义句子具有更高的不确定性。此外，指令微调会导致过度自信，而基础模型更能捕捉语义多样性。我们进一步发现模型倾向于主导解释。本文提供了一种可扩展的中文歧义语料库方法，并为LLM的歧义处理提供了见解，为增强LLM中的中文歧义研究奠定了基础。

英文摘要

Linguistic ambiguity is critical to the robustness of Large Language Models (LLMs), yet existing research focuses mostly on English, with limited attention devoted to Chinese. Existing Chinese ambiguity datasets (e.g., CHAmbi) suffer from poor scalability. Guided by Potential Ambiguity (PA) Theory, we design a semi-automatic pipeline to construct CHA-Gen. It is the first PA Theory-grounded Chinese ambiguity dataset, which comprises 5,712 sentences (2,414 ambiguous, 3,298 unambiguous) across 18 potential ambiguous structures. Evaluating LLMs (e.g. Gemma 3, Qwen 2.5/3 series) via direct querying and machine translation, we find that LLMs struggle with ambiguity detection (improved by CoT prompting). Analysis of Qwen3-32B's CoT rationales reveals three common failure modes: ambiguity blindness, misattribution, and premature resolution. Uncertainty quantification with semantic entropy metric shows higher uncertainty for ambiguous sentences. Moreover, instruction tuning induces overconfidence, whereas Base models better capture semantic diversity. We further observe that models exhibit a bias toward dominant interpretations. Our work provides a scalable approach for Chinese ambiguity corpus and insights into LLMs' ambiguity handling, laying a foundation for enhancing Chinese ambiguity research in LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.15626 2026-05-18 cs.LG

IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression

IO-SVD：输入-输出白化SVD用于自适应秩LLM压缩

Ali Abbasi, Chayne Thrash, Haoran Qin, Hamed Pirsiavash, Soheil Kolouri

发表机构 * Vanderbilt University（范德比大学）； University of California, Davis（加州大学戴维斯分校）

AI总结 IO-SVD通过构建KL感知的双侧白化空间，结合高效异质秩分配策略，实现LLM压缩时的性能与效率平衡，实验表明其在压缩过程中性能损失小且推理速度提升显著。

详情

AI中文摘要

大型语言模型在语言和推理任务中表现出色，但其存储和计算成本仍是资源受限和延迟敏感环境下的主要障碍。基于SVD的后训练压缩提供了一种硬件无关的方法，通过低秩分解减少模型大小并提高推理效率。然而，现有方法往往依赖于仅输入的白化空间、同质秩分配或损失无关的分配启发式方法，限制了在剧烈压缩下保持模型质量的能力。我们提出输入-输出白化SVD（IO-SVD），一种后训练压缩方法，通过构建KL感知的双侧白化空间来处理模型权重。利用KL损失在顶部K个token概率上的二次展开，IO-SVD构建了一个输出侧度量，捕捉预测敏感性，同时输入白化捕捉激活统计。我们进一步引入了高效的异质秩分配策略，通过第一阶校准损失估计评分白化奇异成分，并在全局预算下修剪最不敏感的成分。受先前工作结合SVD截断与量化的工作启发，我们通过损失感知的重映射改进了SVD-量化压缩，该方法根据量化后预计的损失变化选择低秩因子行进行8位量化。在多样化的LLM和VLM家族上的广泛实验以及推理时分析表明，IO-SVD在压缩LLM时具有最小的性能损失，同时提供实用的推理加速。代码可在https://github.com/mint-vu/IO-SVD.git获得。

英文摘要

Large language models deliver strong performance across language and reasoning tasks, but their storage and compute costs remain major barriers to deployment in resource-constrained and latency-sensitive settings. SVD-based post-training compression offers a hardware-agnostic way to reduce model size and improve inference efficiency through low-rank factorization. However, existing methods often rely on input-only whitening spaces, homogeneous rank allocation, or loss-agnostic allocation heuristics, limiting their ability to preserve model quality under aggressive compression. We propose Input-Output Whitened SVD (IO-SVD), a post-training compression method that forms a KL-aware double-sided whitening space for model weights. Using a second-order expansion of the KL loss over the top-K token probabilities, IO-SVD constructs an output-side metric that captures predictive sensitivity, while input whitening captures activation statistics. We further introduce an efficient heterogeneous rank-allocation strategy that scores whitened singular components using first-order calibration loss estimates and prunes the least sensitive components under a global budget. Inspired by prior work that combines SVD truncation with quantization, we improve hybrid SVD-quantization compression through loss-aware remapping, which selects low-rank factor rows for 8-bit quantization based on the predicted loss change incurred by quantizing them. Extensive experiments across diverse LLM and VLM families, and inference-time analysis shows that IO-SVD compresses LLMs with minimal performance degradation while delivering practical inference speedups. Code is available at https://github.com/mint-vu/IO-SVD.git

URL PDF HTML ☆

赞 0 踩 0

2605.15625 2026-05-18 cs.AI cond-mat.soft

ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing

ColPackAgent：基于代理技能的硬粒子蒙特卡罗工作流程用于胶体堆积

Lijie Ding, Changwoo Do

发表机构 * Neutron Scattering Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA（奥克勒德国家实验室中子散射部）

AI总结 ColPackAgent通过MCP工具服务器和代理技能实现胶体堆积模拟的自主工作流程，展示了如何利用LLM代理执行模拟任务并评估不同模型的性能。

详情

AI中文摘要

我们介绍了ColPackAgent，一种代理框架，通过模型上下文协议（MCP）工具服务器和代理技能自主运行胶体堆积的蒙特卡罗模拟，无论是作为独立代理还是现有代理系统的一部分。通过利用MCP服务器和代理技能，ColPackAgent执行胶体堆积模拟的结构化工作流程，这些流程对于研究相变、自组装和材料设计至关重要。在没有专用模拟工具和工作流程指令的情况下，通用大型语言模型（LLM）代理倾向于描述此类工作流程而不是可靠地执行。MCP服务器暴露了一个定制构建的colpack Python包，该包封装了HOOMD-blue硬粒子蒙特卡罗。技能编码了一个四阶段的工作流程合同。ColPackAgent可以与人类反馈互动执行工作流程，从端到端提示自主执行，或作为提供的程序文件的autoresearch。我们通过不同模式展示了系统，包括立方体粒子的3D模拟、二元系统中的盘和胶囊的2D模拟，以及使用autoresearch的2D硬盘冻结转变。我们还比较了不同LLM在该工作流程上的模型性能，使用17个阶段特定的提示。此基准测试提供了对不同模型在设置、规划和分析工作流程中可靠性的阶段级检查。这些结果表明，将领域Python包与MCP工具和便携式代理技能结合，为将模拟工具包转化为代理辅助研究工作流程提供了可行的途径。

英文摘要

We introduce ColPackAgent, an agent framework that autonomously runs Monte Carlo simulations of colloidal packing through a Model Context Protocol (MCP) tool server and an agent skill, whether as a standalone agent or inside an existing agent system. By harnessing the MCP server and agent skill, ColPackAgent executes a structured workflow for colloidal packing simulations, which are central to studies of phase behavior, self-assembly, and materials design. Without dedicated simulation tools and workflow instructions, general-purpose Large Language Model (LLM) agents tend to describe such workflows rather than execute them reliably. The MCP server exposes a custom-built colpack Python package that wraps HOOMD-blue hard-particle Monte Carlo, and the skill encodes a four-stage workflow contract. ColPackAgent can carry out the workflow interactively with human feedback, autonomously from an end-to-end prompt, or as autoresearch following a provided program file. We demonstrate the system in different modes with several colloidal packing simulation examples such as cube particles in 3D, a binary system of disks and capsules in 2D, and the 2D hard-disk freezing transition using autoresearch. We also compare model performance on this workflow across a panel of LLMs with 17 stage-specific prompts. This benchmark provides a stage-level check of how reliably different models follow the setup, planning, and analysis workflow. Together, these results show that pairing a domain Python package with MCP tools and a portable agent skill provides a practical route for turning a simulation toolkit into an agent-assisted research workflow.

URL PDF HTML ☆

赞 0 踩 0

2605.15621 2026-05-18 cs.CV

LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

LRCP: 低秩压缩性引导的视觉标记修剪用于高效的LVLMs

Hongyu Lu, Feng Zhang, Wenwei Jin, Huanling Hu, Tianjun Shi, Shikai Jiang, Yao Hu, Jiawei Li

发表机构 * Xiaohongshu（小红书）； Harbin Institute of Technology（哈尔滨工业大学）； Fudan University（复旦大学）

AI总结本文提出LRCP，通过低秩压缩性引导视觉标记修剪，有效减少视觉语言模型的推理成本，实现94.7%的图像理解性能保留和88.9%的标记减少。

Comments The paper includes 11 figures, multiple tables, comprehensive experimental results on 11 image understanding benchmarks and 3 video benchmarks, with extensive ablation studies and qualitative visualizations

详情

AI中文摘要

大型视觉-语言模型（LVLMs）在多模态理解方面表现出色，但其推理成本随着视觉标记数量的增加而迅速增长，尤其在高分辨率图像和长视频中更为明显。现有基于注意力的方法通过注意力分数估计标记重要性，可能引入位置偏差；而基于表示的方法则通过特征关系或重建误差减少视觉冗余，忽略了视觉标记集的整体结构。本文从低秩压缩性的角度重新审视视觉标记压缩。在多个模型和数据集中，我们发现视觉标记表示表现出显著的低秩结构，存在一个主导子空间，即使随机移除大量标记后仍保持稳定。受此发现启发，我们提出LRCP，一种无需训练的压缩框架，首先通过PCA估计视觉标记的主导低秩子空间，然后通过投影残差对每个标记进行评分，保留那些难以由低秩背景解释的标记。大量实验表明，LRCP在保持94.7%的原始图像理解性能的同时实现88.9%的标记减少，并在保持97.8%的平均视频理解准确性的同时实现87.5%的标记减少。

英文摘要

Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods estimate token importance from attention scores, which may introduce positional bias, while representation-based methods reduce visual redundancy based on feature relations or reconstruction errors, overlooking the global structure of the visual token set. In this paper, we revisit visual token compression from the perspective of low-rank compressibility. Across models and datasets, we observe that visual token representations exhibit a pronounced low-rank structure, with a dominant subspace that remains stable even after a large fraction of tokens is randomly removed. Motivated by this finding, we propose LRCP, a training-free compression framework that first estimates the dominant low-rank subspace of visual tokens via PCA, and then scores each token by its projection residual onto this subspace, retaining tokens that are poorly explained by the low-rank background. Extensive experiments show that LRCP achieves superior results, preserving 94.7% of the original image-understanding performance with an 88.9% token reduction and 97.8% of the average video-understanding accuracy with an 87.5% token reduction.

URL PDF HTML ☆

赞 0 踩 0

2605.15619 2026-05-18 cs.RO

Wind-Aware Optimal Trajectory Planning for Efficient Gliding of Fixed-Wing Aerial Systems

考虑风的高效滑翔轨迹规划

Luca Morando, Nishanth Bobbili, Giuseppe Loianno

发表机构 * New York University（纽约大学）； University of California Berkeley（加州大学伯克利分校）

AI总结本文提出非线性多目标轨迹规划器，通过伯恩斯坦多项式生成三次连续轨迹，结合风速估算优化滑翔性能，实验证明在风扰和障碍物情况下具有稳定性和可靠性。

Comments Accepted for publication at IEEE International Conference on Robotics and Automation (ICRA 2026) held in Vienna

Journal ref IEEE International Conference on Robotics and Automation (ICRA 2026) held in Vienna

详情

AI中文摘要

滑翔为小型固定翼无人机提供了更长的续航和静音操作，但需要精确的能量管理，特别是在风扰和障碍物约束下。传统总能量控制系统通常需要精细调参和trim条件知识。本文将调控移至规划层面，提出非线性多成本轨迹规划器，基于伯恩斯坦多项式生成三次连续轨迹，通过微分平坦性映射为控制指令，并在线重新规划以匹配实验得出的下沉极曲线。集成模拟净to variometer估计空气运动，约束滑翔至能量平衡状态。通过Dubins路径基的航点初始化轨迹计算巡航段，连接连续滑翔轨迹，实现结合动力和非动力飞行的混合任务。该方法在CFD仿真和真实世界实验中验证，显示在风切变和障碍物存在下，滑翔率、空速和滑翔比的稳定性。

英文摘要

Gliding offers small fixed-wing UAVs extended endurance and silent operation but requires accurate energy management, especially under wind disturbances and obstacle constraints. Traditional Total Energy Control Systems based controllers regulate the trade between potential and kinetic energy reactively, often requiring fine-tuning and trim-conditions knowledge. In this work, we shift the regulation to the planning level and present a nonlinear, multi-cost trajectory planner for small UAV gliders. The method generates $\mathcal{C}^3$ continuous trajectories based on Bernstein polynomials, mapped into control commands through differential flatness, and re-planned online to match experimentally derived sink polar curves. A simulated netto variometer is integrated into the optimization to estimate air mass motion, constraining the glide to energy-balanced states. Consecutive gliding trajectories are linked by cruising segments computed through trajectories initialized on Dubins path-based waypoints, enabling hybrid missions that combine powered and unpowered flight. The approach is validated in CFD simulations and real-world experiments with a fixed-wing platform, showing reliable stabilization of sink rate, airspeed, and glide ratio under wind gusts and in presence of obstacles.

URL PDF HTML ☆

赞 0 踩 0

2605.15618 2026-05-18 cs.CV cs.AI

Latent Video Prediction Learns Better World Models

潜在视频预测学习更好的世界模型

Ali J Alrasheed, Aryan Yazdan Parast, Basim Azam, James Bailey, Naveed Akhtar

发表机构 * The University of Melbourne（墨尔本大学）； Monash University（莫纳什大学）

AI总结本文系统研究了潜在预测模型在世界模型中的鲁棒性，发现其在特征可区分性、抗污损性、细粒度辨别、遮挡鲁棒性和时间方向敏感性等方面表现优异，优于其他视频基础模型。

2605.15615 2026-05-18 cs.CV cs.LG

Neutral-Reference Prompting for Vision-Language Models

视觉-语言模型的中性参考提示

Senmao Tian, Xiang Wei, Shunli Zhang

发表机构 * Beijing Jiaotong University（北京交通大学）

AI总结本文提出NeRP策略，通过中性提示和参考图像提升模型对未知类别的判别能力，同时保持对已知类别的准确性。

Comments Accepted at ICML 2026

详情

AI中文摘要

视觉-语言模型（VLMs）的有效迁移学习常面临基类-新类权衡（BNT）问题：提升对未见过类别的识别性能往往会降低对已知类别的准确性。现有工作通常简单归因于过拟合已知类别。我们观察到一种有趣现象：VLMs在某些下游数据上表现出不对称混淆，即类别A的样本系统性被误判为类别B，而反向混淆（B到A）很少发生。对于已知类别，这种偏差可通过交叉熵损失调整来缓解，但对未知类别，这种预训练诱导的偏差仍存在并损害泛化能力。受此启发，我们提出NeRP，一种即插即用的提示修正策略，无需修改模型参数即可提升对未知类别的判别能力。NeRP利用中性文本提示和参考图像，测量类别层面的先验偏好，结合样本似然获得模型的代理分数。如果对于给定样本，先验强烈支持当前预测，而观察到的证据明显不足，则在容易混淆的类别对之间执行局部翻转，从而纠正先验主导的误判。在多个backbone和15个少样本及跨领域基准上的广泛实验表明，NeRP显著提高了对未知类别的准确性，同时保持已知类别的预测性能。

英文摘要

Efficient transfer learning of vision-language models (VLMs) commonly suffers from a Base-New Trade-off (BNT): improving performance on unseen (new) classes often degrades accuracy on known (base) classes. Addressing how to boost recognition of unseen classes without sacrificing known-class performance remains a central challenge. Existing work often simplistically attributes the BNT to overfitting on known classes. We observe an interesting phenomenon: VLMs frequently exhibit asymmetric confusion on certain downstream data, i.e., samples of class A are systematically mispredicted as class B, while the reverse confusion (B to A) rarely occurs. For known classes, this kind of bias can be mitigated by tuning using a cross-entropy loss, but for unseen classes, such pretraining-induced bias persists and harms generalization. Motivated by this, we propose NeRP, a plug-and-play prompting correction strategy that improves discrimination on unseen classes without modifying model parameters. NeRP leverages neutral text prompts and reference images to measure class-wise prior preferences along the pre-trained inter-class geometry, and combines them with the sample likelihood to obtain the model's surrogate score. If, for a given sample, the prior strongly favors the current prediction while the observed evidence is clearly insufficient, we perform a local flip between easily confusable class pairs, thereby correcting prior-dominated mispredictions. Extensive experiments across multiple backbones and 15 few-shot and cross-domain benchmarks show that NeRP substantially improves accuracy on unseen classes while preserving known-class prediction performance.

URL PDF HTML ☆

赞 0 踩 0

2605.15613 2026-05-18 cs.CL

Toward LLMs Beyond English-Centric Development

迈向超越英语中心化发展的语言模型

Sho Takase, Ukyo Honda

发表机构 * CyberAgent

AI总结研究发现语言模型对英语存在显著偏见，持续预训练并非优于从头训练的低成本方案，未来需加强多语言投入。

2605.15611 2026-05-18 cs.AI

TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices

TopoEvo: 一种面向拓扑的自演化多智能体框架用于微服务中的根本原因分析

Junle Wang, Xingchuang Liao, Wenjun Wu

发表机构 * School of Artificial Intelligence, Beihang University Beijing, China（人工智能学院，北京航空航天大学，北京，中国）

AI总结针对微服务中观测数据异质性、故障传播和拓扑漂移问题，TopoEvo通过多模态对齐、拓扑约束推理和自演化机制，提升根本原因分析的鲁棒性与准确性。

Comments 12 pages

详情

AI中文摘要

微服务中的根本原因分析（RCA）面临噪声异质多模态观测数据、级联故障传播放大下游症状以及由自动扩展和滚动更新引起的非平稳拓扑漂移等挑战。最近基于LLM的RCA智能体虽能生成工具导向的解释，但往往缺乏拓扑意识，导致症状放大偏误。本文提出TopoEvo，一种面向拓扑的自演化多智能体框架，结合图表示学习与结构化拓扑约束推理。TopoEvo首先引入度量正交多模态对齐（MOMA），将度量嵌入分解为互补子空间，并通过对比对齐日志和追踪以减少模态冗余和稀疏性，从而获得稳定的节点表示。随后应用向量量化（VQ）将拓扑增强的状态离散化为可审计的症状令牌，利用症状词典实现可靠检索和令牌级证据支撑。在这些离散拓扑提示之上，TopoEvo执行多智能体假设-证据-测试（HET）工作流，明确验证传播一致的解释并区分起因异常与放大下游症状。最后，自演化机制刷新分层事件记忆，并通过高置信度伪标签进行保守测试时适应，以维持在漂移下的鲁棒性。

英文摘要

Root cause analysis (RCA) in microservices is challenging due to (i) noisy and heterogeneous multimodal observability (metrics, logs, traces), (ii) cascading failure propagation that amplifies downstream symptoms, and (iii) non-stationary topology drift induced by autoscaling and rolling updates. Recent LLM-based RCA agents can generate tool-grounded explanations, yet they often remain topology-agnostic and suffer from \emph{symptom-amplification bias}, misattributing the root cause to salient downstream victims. We propose \textbf{TopoEvo}, a topology-aware self-evolving multi-agent framework that couples graph representation learning with structured, topology-constrained reasoning. TopoEvo first introduces \emph{Metric-orthogonal Multimodal Alignment} (MOMA), which decomposes metric embeddings into complementary subspaces and contrastively aligns logs and traces to reduce modality redundancy and sparsity, yielding stable node representations for graph encoding. It then applies \emph{Vector Quantization} (VQ) to discretize topology-enhanced states into auditable \emph{symptom tokens} with a symptom lexicon, enabling reliable retrieval and token-level evidence grounding. On top of these discrete topology cues, TopoEvo performs a multi-agent \emph{Hypothesis--Evidence--Test} (HET) workflow to explicitly verify propagation-consistent explanations and separate initiating anomalies from amplified downstream symptoms. Finally, a \emph{Self-Evolving Mechanism} refreshes hierarchical incident memory and performs conservative test-time adaptation with high-confidence pseudo-labels to maintain robustness under drift.

URL PDF HTML ☆

赞 0 踩 0

2605.15609 2026-05-18 cs.CL

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

PSD: 推动扩散大语言模型的帕累托前沿：通过并行推测解码

Shengyin Sun, Yiming Li, Renxi Liu, Xinqi Li, Hui-Ling Zhen, Weizhe Lin, Chen Chen, Xianzhi Yu, Mingxuan Yuan, Chen Ma

发表机构 * Huawei Technologies（华为技术）

AI总结本文提出PSD框架，通过并行推测解码提升推理效率与生成质量，在推理效率和生成质量之间取得良好平衡，达到每前向传递5.5倍的token处理速度。

Comments 16 pages

详情

AI中文摘要

扩散大语言模型（dLLMs）通过迭代去噪掩码标记序列生成文本。尽管dLLMs可以在每个步骤内并行预测所有掩码位置，但大量的去噪迭代仍使推理成本高昂。此成本可通过每步解掩多个标记进行空间优化，或通过将多个去噪步骤合并为一次验证调用进行时间优化。我们提出并行推测解码（PSD），一种无需训练的框架，同时提升推理效率和生成质量。利用单次前向传递的置信度分数，PSD通过可配置的自适应解掩策略选择解掩位置，并构建多深度的推测草案而无需额外模型调用。最终的批量验证步骤应用分层接受机制，保留与更新预测一致的最深草案。在三个dLLMs上进行的实验表明，PSD在推理效率和生成质量之间取得了良好的权衡，达到每前向传递5.5倍的token处理速度，其准确性与贪婪解码相当。

英文摘要

Diffusion large language models (dLLMs) generate text by iteratively denoising masked token sequences. Although dLLMs can predict all masked positions in parallel within each step, the large number of denoising iterations still makes inference expensive. This cost can be reduced spatially by unmasking multiple tokens per step, or temporally by collapsing multiple denoising steps into one verification call. We propose Parallel Speculative Decoding (PSD), a training-free framework that jointly improves inference along both axes. Using the confidence scores from a single forward pass, PSD selects positions to unmask via a configurable, adaptive unmasking policy and constructs multi-depth speculative drafts without extra model calls. A final batched verification pass then applies hierarchical acceptance, keeping the deepest draft that remains consistent with the updated predictions. Experiments on three dLLMs across reasoning and code generation tasks show that PSD achieves favorable trade-offs between inference efficiency and generation quality, reaching up to $5.5\times$ tokens per forward pass with accuracy comparable to greedy decoding.

URL PDF HTML ☆

赞 0 踩 0

2605.15608 2026-05-18 cs.LG cs.SY eess.SY

Transformer-like Inference from Optimal Control

基于最优控制的变换器式推理

Aditya Kudre, Heng-Sheng Chang, Prashant G. Mehta

发表机构 * Coordinated Science Laboratory（协调科学实验室）； Electrical and Computer Engineering University of Illinois Urbana-Champaign（电气与计算机工程大学伊利诺伊大学厄巴纳-香槟分校）； Mechanical Science and Engineering University of Illinois Urbana-Champaign（机械科学与工程大学伊利诺伊大学厄巴纳-香槟分校）

AI总结本文从最优控制理论出发，推导出解决预测问题的推理架构，揭示了变换器层操作的起源，并通过非线性离散过程模型和线性高斯模型进行实验验证。

Comments Preprint

2605.15607 2026-05-18 cs.CL cs.LG

Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language

无语义的语法：教大语言模型在未见过的语言中编程

Vinayshekhar Bannihatti Kumar, Disha Makhija, Manoj Ghuhan Arivazhagan, Rashmi Gangadharaiah

发表机构 * AWS AI Labs（AWS人工智能实验室）

AI总结研究探讨大语言模型在未见过的语言中生成代码的能力，发现微调仅能教授语法而无法转移语义能力，揭示了推理与语言实现之间的鸿沟。

详情

AI中文摘要

大型语言模型（LLMs）在代码生成基准测试中表现出高通过率，但它们能否将这种能力转移到训练时未见过的语言仍不清楚。我们介绍了PyLang，一种最小的命令式语言，未出现在所有预训练语料库中，并评估了前沿模型在352个问题上的零样本和微调Qwen3（4B、8B、32B）的表现。我们发现微调快速教授了语法，但无法转移语义能力：Python在所有配置中比PyLang高出高达19%，且没有干预（多任务学习、偏好微调、代码填充或潜在空间目标）无法缩小差距。一个LLM法官发现，前沿模型有80%的时间选择与Python相同的算法，但无法将其翻译成有效的PyLang实现。CKA分析确认，微调模型在不同语言中收敛到几乎相同的内部表示（CKA > 0.97），但在输出阶段却不同。我们称这种现象为实现忠实度鸿沟：模型具有语言无关的算法理解，但无法用不熟悉的语言表达它。我们的发现强调了需要训练方法将推理与语言特定的实现解耦。

英文摘要

Large language models (LLMs) achieve high pass rates on code generation benchmarks, yet whether they can transfer this ability to languages absent from pretraining remains poorly understood. We introduce PyLang, a minimal imperative language absent from all pretraining corpora, and evaluate frontier models zero-shot and fine-tuned Qwen3 (4B, 8B, 32B) on 352 problems. We find that fine-tuning quickly teaches syntax but fails to transfer semantic competence: Python outperforms PyLang by up to 19% across all configurations, and no intervention (multi-task learning, preference tuning, code infilling, or latent-space objectives) closes the gap. An LLM judge reveals that frontier models select an identical algorithm to Python 80% of the time, yet cannot translate it into a working PyLang implementation., and CKA analysis confirms that fine-tuned models converge to nearly identical internal representations across languages (CKA > 0.97) while diverging at the output stage. We term this the implementation fidelity gap: models possess language-agnostic algorithmic understanding but cannot express it in an unfamiliar language. Our findings highlight the need for training methods that decouple reasoning from language-specific realization.

URL PDF HTML ☆

赞 0 踩 0

2605.15604 2026-05-18 cs.LG cs.CL

VSPO: Vector-Steered Policy Optimization for Behavioral Control

VSPO：用于行为控制的向量引导策略优化

Xuechen Zhang, Zijian Huang, Kai Yang, Weijia Zhang, Jiasi Chen, Samet Oymak

发表机构 * University of Michigan（密歇根大学）

AI总结 VSPO通过引入与目标行为关联的引导向量，控制生成轨迹的行为强度，解决多目标优化中的稀疏奖励问题，提升策略优化效率。

详情

AI中文摘要

现代语言模型往往需要在优化主要准确性目标的同时，兼顾次要行为偏好，如 verbosity、agreeableness 或响应中技术专家水平。在实践中，基础模型可能很少或完全不表现出期望的行为。因此，赋予模型目标行为会形成稀疏行为奖励瓶颈。为解决此类多目标问题，我们引入了向量引导策略优化（VSPO），它利用与目标行为相关的引导向量来控制生成轨迹的行为强度。VSPO是通过修改GRPO以采样具有不同引导强度的轨迹获得的。此过程可以解释为一种在线策略潜在自我蒸馏过程，其中模型内部化其引导向量。通过调整引导强度，VSPO上采样稀有行为并丰富轨迹多样性，缓解稀疏奖励问题并可证明加速策略优化。通过全面的理论和实验，我们证明了VSPO相较于 vanilla reward shaping 和其他替代方法具有更优的性质。具体而言，在bandit抽象下，当引导引起的分布足够与目标行为对齐时，VSPO可证明在迭代复杂度上优于reward-shaped GRPO。我们评估了VSPO在多个推理基准上，包括MATH和MMLU-Pro，针对四个目标行为：解释能力、自信表达、对误导上下文的鲁棒性以及响应 verbosity。我们的结果表明，VSPO在保持或提高任务准确性的同时，一致提升了对目标行为的控制。

英文摘要

Modern language models often need to optimize a primary accuracy objective while also accommodating secondary behavioral preferences, such as verbosity, agreeableness, or the level of technical expertise in its response. In practice, a base model may exhibit a desired behavior very rarely or not at all. Thus, endowing the model with a target behavior creates a sparse behavioral reward bottleneck. To address such multi-objective problems, we introduce Vector-Steered Policy Optimization (VSPO) which employs a steering vector associated with the target behavior to control the behavior intensity of the generated rollouts. VSPO is obtained by modifying GRPO to sample rollouts with varying steering intensities. This process can be interpreted as an on-policy latent self-distillation procedure where the model internalizes its steering vector. By varying steering intensities, VSPO upsamples rare behaviors and enriches rollout diversity, which alleviates the sparse reward issue and provably accelerates the policy optimization. Through comprehensive theory and experiments, we establish that VSPO has favorable properties compared to vanilla reward shaping and other alternative approaches. Specifically, under a bandit abstraction, VSPO provably achieves better iteration complexity than reward-shaped GRPO when the steering-induced distributions are sufficiently aligned with the target behavior. We evaluate VSPO across multiple reasoning benchmarks, including MATH and MMLU-Pro, for four target behaviors: explanation expertise, confidence expression, robustness to misleading context, and response verbosity. Our results show that VSPO consistently improves the control along target behavior while maintaining or improving task accuracy compared with reward shaping, teacher-trace distillation, and guidance-based baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.15603 2026-05-18 cs.LG cs.AI

Offline Reinforcement Learning with Universal Horizon Models

离线强化学习中的通用时间 horizon 模型

Hojun Chung, Junseo Lee, Songhwai Oh

发表机构 * Interdisciplinary Program in Artificial Intelligence and ASRI, Seoul National University（人工智能交叉学科项目及首尔国立大学ASRI）； Department of Electrical and Computer Engineering, Seoul National University（电气与计算机工程系，首尔国立大学）

AI总结本文提出通用时间 horizon 模型，通过灵活预测任意时间 horizon 的未来状态，改进了传统几何时间 horizon 模型在远期状态建模上的不足，并在100个OGBench任务中验证了其有效性。

Comments ICML 2026

详情

AI中文摘要

基于模型的强化学习（RL）通过在想象的 on-policy 轨迹上进行价值学习，为离线 RL 提供了有吸引力的方法。然而，由于重复的模型推断导致自我生成状态中的累积误差，这一方法常常面临挑战。尽管几何时间 horizon 模型（GHM）通过直接预测折扣无限时间 horizon 的未来来缓解这一问题，但在准确建模远期状态方面仍存在挑战。为此，我们引入了通用时间 horizon 模型（UHM），这是 GHM 的推广，能够直接在任意时间 horizon 下预测未来状态。利用这种灵活性，我们提出了一种可扩展的价值学习方法，该方法采用winsorized 时间 horizon 分布来稳定训练，通过限制过大的时间 horizon 来实现。在100个具有挑战性的OGBench任务上的实验结果表明，所提出的方法在高度次优数据集和需要长时间 horizon 推理的任务上优于竞争性基线。项目页面：https://rllab-snu.github.io/projects/UHM/

英文摘要

Model-based reinforcement learning (RL) offers a compelling approach to offline RL by enabling value learning on imagined on-policy trajectories. However, it often suffers from compounding errors due to repeated model inference on self-generated states. While geometric horizon models (GHM) alleviate this issue through direct prediction over a discounted infinite-horizon future, they remain challenged in accurately modeling distant future states. To this end, we introduce universal horizon models (UHM), a generalization of GHM that directly predicts future states under arbitrary horizons. Leveraging this flexibility, we propose a scalable value learning method that employs a winsorized horizon distribution to stabilize training by capping excessively large horizons. Experimental results on 100 challenging OGBench tasks demonstrate that the proposed method outperforms competitive baselines, particularly on tasks with highly suboptimal datasets and those requiring long-horizon reasoning. Project page: https://rllab-snu.github.io/projects/UHM/

URL PDF HTML ☆

赞 0 踩 0

2605.15597 2026-05-18 cs.CV cs.GR cs.LG cs.RO

CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage

CM-EVS：稀疏全景RGB-D-姿态数据用于完整场景覆盖

Jiale Liu, Jungang Li, Jieming Yu, Xinglin Yu, Zihao Dongfang, Zongjian Ding, Kaifeng Ding, Yi Yang, Lidong Chen, Yang Zou, Shunwen Bai, Jiahuan Zhang, Haoran Huang, Shan Huang, Yudong Gao, Mingjun Cheng

发表机构 * Zhejiang University（浙江大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； The Hong Kong University of Science and Technology（香港科学与技术大学）； Vorynel（沃伊内尔）； Xinjiang University（新疆大学）； Wuhan Polytechnic University（武汉职业技术学院）； Tianjin University（天津大学）

AI总结本文提出CM-EVS，通过COVER算法生成稀疏全景RGB-D-姿态数据，实现低冗余且可追溯的完整场景覆盖，提升3D学习的几何一致性。

Comments 35 pages including appendix. Code and dataset: https://github.com/Strange-animalss/CM-EVS

详情

AI中文摘要

现代3D视觉学习依赖于从度量3D资产中采样的观测，但现有扫描、网格、点云、模拟和重建并未直接提供稀疏、可比且几何一致的全景训练接口。密集轨迹会重复附近视角，源特定渲染策略导致注释异质性，稀疏启发式可能遗漏重要区域或引入深度不一致观测。本文研究如何将3D资产转换为稀疏全景RGB-D-姿态数据，以保持完整的场景覆盖，同时具有低冗余和可追溯的来源。我们提出COVER（以覆盖为导向的视角筛选与ERP范围-深度变形），一种无需训练的ERP视角筛选器，将选定视角观测的几何投影到候选ERP探针，评分增量覆盖，并惩罚深度冲突。在有限的代理误差下，其贪心覆盖代理保持标准覆盖式近似行为，误差项内。使用COVER，我们构建了CM-EVS（覆盖-curated度量ERP视角集），一个包含36,373个curated ERP帧的全景RGB-D-姿态数据集，来自1,275个室内场景，涵盖Blender室内、HM3D和ScanNet++，并补充了从TartanGround和OB3D重新编码的户外全景。每个帧提供完整的球形RGB、度量范围深度、校准姿态；COVER生成的室内帧包括每一步的来源日志。每个室内场景平均仅25帧，覆盖所有13种统一房间类型，同时保持紧凑的场景级覆盖。实验表明，COVER改进了覆盖冲突的权衡，使CM-EVS成为稀疏、紧凑且可追溯的RGB-D-姿态资源，用于几何一致的全景3D学习。

英文摘要

Modern 3D visual learning relies on observations sampled from metric 3D assets, yet existing scans, meshes, point clouds, simulations, and reconstructions do not directly provide a sparse, comparable, and geometry-consistent panoramic training interface. Dense trajectories duplicate nearby views, source-specific rendering policies yield heterogeneous annotations, and sparse heuristics may miss important regions or introduce depth-inconsistent observations. We study how to convert 3D assets into sparse panoramic RGB-D-pose data that preserves complete scene coverage with low redundancy and auditable provenance. We propose COVER (Coverage-Oriented Viewpoint curation with ERP Range-depth warping), a training-free ERP viewpoint curator that projects geometry observed from selected views into candidate ERP probes, scores incremental coverage, and penalizes depth conflicts. Under bounded proxy error, its greedy coverage proxy preserves the standard coverage-style approximation behavior up to an additive error term. Using COVER, we build CM-EVS (Coverage-curated Metric ERP View Set), a panoramic RGB-D-pose dataset with 36,373 curated ERP frames from 1,275 indoor scenes across Blender indoor, HM3D, and ScanNet++, complemented by outdoor panoramas from TartanGround and OB3D re-encoded into the same schema. Each frame provides full-sphere RGB, metric range depth, calibrated pose; COVER-produced indoor frames include per-step provenance logs. With a median of only 25 frames per indoor scene, CM-EVS covers all 13 unified room types while maintaining compact scene-level coverage. Experiments show that COVER improves the coverage-conflict trade-off, making CM-EVS a sparse, compact, and auditable RGB-D-pose resource for geometry-consistent panoramic 3D learning.

URL PDF HTML ☆

赞 0 踩 0

2605.15592 2026-05-18 cs.CV

Efficient Image Synthesis with Sphere Latent Encoder

高效图像合成与球形潜在编码器

Tung Do, Thuan Hoang Nguyen, Hao Li

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence（穆罕默德·本·拉希德人工智能大学）

AI总结本文提出分离的固定预训练图像编码器和球形潜在去噪模型，提高效率并独立优化重建与生成。在多个数据集上，方法在生成质量和推理速度上优于Sphere Encoder。

Comments Technical report

详情

AI中文摘要

少数步骤图像生成已取得快速进展，一致性及meanflow-based方法显著减少了采样步骤的数量。尽管其推理成本低，但这些方法常面临训练不稳定和可扩展性有限的问题。Sphere Encoder是一种近期的替代方案，仅需少数步骤即可生成高质量图像；然而，其在推理过程中需要在像素空间和潜在空间之间反复转换，同时在单一架构内联合优化重建与生成。这种设计导致计算效率低下，并在重建与生成之间产生目标冲突。为解决这些限制，我们将框架分离为一个固定的预训练图像编码器和一个单独的潜在去噪模型，后者完全在球形潜在空间中训练。我们的方法在训练和推理过程中消除了反复的像素空间操作，提高了效率，并允许重建与生成各自专业化。在Animal-Faces、Oxford-Flowers和ImageNet-1K数据集上，我们的方法在生成质量和推理速度上显著优于Sphere Encoder，同时在强少数步骤和多步骤基线中也取得了具有竞争力的结果。

英文摘要

Few-step image generation has seen rapid progress, with consistency and meanflow-based methods significantly reducing the number of sampling steps. Despite their low inference cost, these approaches often suffer from training instability and limited scalability. Sphere Encoder is a recent alternative that produces high-quality images in only a few steps; however, it requires repeated transitions between the pixel space and latent space during inference while jointly optimizing reconstruction and generation within a single architecture. This design leads to computational inefficiency and objective conflict between reconstruction and generation. To address these limitations, we decouple the framework into a fixed pretrained image encoder and a separate latent denoising model trained entirely in a spherical latent space. Our approach eliminates repeated pixel-space operations during training and inference, improving efficiency and allowing reconstruction and generation to specialize independently. On Animal-Faces, Oxford-Flowers and ImageNet-1K datasets, our method significantly outperforms Sphere Encoder in both generation quality and inference speed, while achieving competitive results against strong few-step and multi-step baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.15589 2026-05-18 cs.CL

MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

MHGraphBench: 用于评估大语言模型中心理健康知识的图知识基准

Weixin Liu, Congning Ni, Shelagh A. Mulvaney, Susannah L. Rose, Murat Kantarcioglu, Bradley A. Malin, Zhijun Yin

发表机构 * Vanderbilt University（范德比大学）； Vanderbilt University Medical Center（范德比大学医学院）； Virginia Tech（弗吉尼亚理工学院）

AI总结本文提出MHGraphBench基准，评估大语言模型在心理健康实体识别、关系判断及双跳推理能力，发现模型在实体类型识别和小关系类型判断上表现优异，但在关系预测和双跳推理上仍有不足，且输出格式可靠性对性能有显著影响。

Comments Accepted to GEM 2026, ACL 2026 Workshop; 9 pages main text plus references and appendices

详情

AI中文摘要

大型语言模型（LLMs）在心理健康领域应用日益广泛，但其对相关生物医学知识的捕捉能力和临床相关结构判断的可靠性仍不明确。本文提出一个基于知识图谱（KG）的基准，用于评估LLMs在心理健康实体识别、关系判断和双跳推理能力。该基准源自PrimeKG，包含九个任务家族，具有KG支持答案和受控负样本选项。在15个封闭源和开源LLM上的实验揭示了持续的识别-判断差距：领先模型在实体类型识别和小关系类型子集上接近天花板表现，但仍在关系预测和双跳推理上挣扎。此外，短KG衍生片段对某些模型有益，但对其他模型则会降低性能。此外，输出格式可靠性在受限的多选设置下对测量性能有显著影响，突显了响应有效性在基准评估中的关键作用。MHGraphBench因此应被解释为在受控多选界面下评估与PrimeKG精心编纂的心理健康切片的一致性，而不是直接评估现实世界临床安全性的评估。

英文摘要

Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.

URL PDF HTML ☆

赞 0 踩 0

2605.15585 2026-05-18 cs.AI cs.CV

See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation

在编码前看到：学习视觉先验以生成空间感知的教育动画

Yuejia Li, Ke He, Junheng Li, Shutong Chen, Jingkang Xia, Zhiyue Su, Junchi Zhang, Mang Ye

发表机构 * Wuhan University（武汉大学）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结本文提出OmniManim框架，通过视觉规划和反馈机制提升教育动画生成质量，改进渲染效果和教学效果。

Comments 21 pages, 4 figures

详情

AI中文摘要

大型语言模型可以为教育动画生成可执行代码，但生成的渲染结果常出现元素重叠、对齐错误和动画连续性断裂等问题。这些缺陷无法仅从代码中可靠检测，需在执行后才能显现。本文将该问题形式化为渲染反馈感知的约束代码生成：给定自然语言规范，模型必须生成可执行代码，其渲染输出需满足可在渲染后评估的结构化质量标准。为解决此问题，我们引入OmniManim框架，围绕共享场景状态、显式视觉规划、结构化后渲染诊断和局部修复构建。其中，Vision Agent是任务特定的视觉规划模块：它通过粗到细的边界框去噪预测稀疏关键帧布局，并优化插值感知的目标以减少下游动画插值引起的中间帧失败。我们进一步构建了ManimLayout-1K和EduRequire-500两个数据集，并提供可复现的评估协议，涵盖可执行性、教学质量、视觉质量和效率。在EduRequire-500上，OmniManim在单模型基线和现有多智能体框架上均提升了测量渲染质量。系统性消融研究进一步验证，显式视觉规划，特别是其粗略空间先验、边界框细化和插值感知优化是这些提升的关键。

英文摘要

Large language models can generate executable code for educational animations, but the resulting renders often exhibit visual defects, including element overlap, misalignment, and broken animation continuity. These defects cannot be reliably detected from the code alone and become apparent only after execution. We formalize this problem as render-feedback-aware constrained code generation: given a natural language specification, the model must generate executable code whose rendered output satisfies structured quality criteria that can be evaluated only after rendering. To address this problem, we introduce OmniManim, a render-feedback-aware educational animation generation framework built around a shared scene state, explicit visual planning, structured post-render diagnostics, and localized repair. Within OmniManim, the Vision Agent is a task-specific visual planning module: it predicts sparse keyframe layouts with coarse-to-fine bounding-box denoising and optimizes an interpolation-aware objective to reduce intermediate-frame failures induced by downstream animation interpolation. We further construct two datasets, ManimLayout-1K and EduRequire-500, and provide a reproducible evaluation protocol covering executability, instructional quality, visual quality, and efficiency. On EduRequire-500, OmniManim improves measured render quality over both single-model baselines and existing multi-agent frameworks. Systematic ablation studies further verify that explicit visual planning, especially its coarse spatial prior, bounding-box refinement, and interpolation-aware optimization, is central to these gains.

URL PDF HTML ☆

赞 0 踩 0

2605.15584 2026-05-18 cs.CV

AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models

AGC：面向视觉-语言模型对抗鲁棒性的自适应测地修正

Zhiwei Li, Jiacheng Xue, Weining Wang, Ajian Liu, Xingyu Gao, Zhenan Sun, Qi Li

发表机构 * NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences（自动化研究所国家工程研究中心与人工智能院，中国科学院）； School of Computer Science and Engineering, Central South University（中南大学计算机科学与工程学院）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结本文提出AGC，一种无需训练的防御机制，通过自适应步长修正输入特征，提升视觉-语言模型的对抗鲁棒性，实测在八个细粒度数据集上提升44.4%的鲁棒准确率，同时降低10倍推理延迟。

详情

AI中文摘要

像CLIP这样的视觉-语言模型已展示了显著的零样本迁移能力。然而，其对不可察觉对抗扰动的易受攻击性仍是一个关键安全问题。虽然测试时间防御为部署模型提供了务实的解决方案，但现有方法通常在推理过程中依赖梯度优化，导致显著的计算开销。在本文中，我们重新审视了数据增强在CLIP鲁棒性中的作用，并观察到增强并非等效有效：特定增强提供稳定的几何线索，与正确类语义在超球面特征空间中对齐。基于此，我们提出自适应测地修正（AGC），一种无需训练的防御机制，无需参数更新。AGC将可靠的增强识别为几何锚点，并通过自适应步长将输入特征朝向锚点修正。AGC在八个细粒度数据集和三个CLIP后端上实现了优越性能，比最先进的基线提高了44.4%的平均鲁棒准确率，同时交付了10倍的推理延迟减少。我们的发现揭示了CLIP特征的基本几何属性，提供了一种高效且有效的多模态鲁棒部署范式。

英文摘要

Vision-language models like CLIP have demonstrated remarkable zero-shot transfer capabilities. However, their susceptibility to imperceptible adversarial perturbations remains a critical security concern. While test-time defenses offer a pragmatic solution for deployed models, existing approaches typically rely on gradient-based optimization during inference, incurring significant computational overhead. In this paper, we revisit the role of data augmentation in CLIP robustness and observe that augmentations are not equally effective: specific augmentations consistently provide robust geometric cues that align with correct class semantics in the hyperspherical feature space. Based on this, we propose Adaptive Geodesic Correction (AGC), a training-free defense mechanism that requires no parameter updates. AGC identifies a reliable augmentation as a geometric anchor and corrects the input feature towards it, utilizing an adaptive step size to balance robustness against clean accuracy preservation. AGC achieves superior performance across eight fine-grained datasets and three CLIP backbones, improving average robust accuracy by 44.4\% over state-of-the-art baseline while delivering a 10$\times$ reduction in inference latency. Our findings reveal a fundamental geometric property of CLIP features, offering a highly efficient and effective paradigm for robust multimodal deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.15583 2026-05-18 cs.CV

Unsupervised 3D Human Pose Estimation via Conditional Multi-view Ancestral Sampling

通过条件多视图祖先采样进行无监督3D人体姿态估计

Ryohei Goto, Takuya Fujihashi, Shunsuke Saruwatari, Fumio Okura

发表机构 * The University of Osaka（大阪大学）

AI总结本文提出一种无需3D监督的单视角3D人体姿态估计方法，利用预训练的2D运动扩散模型的2D扩散先验，通过条件多视图祖先采样优化3D姿态，使其多视图投影符合2D MDM噪声空间的流形，同时匹配给定的2D姿态和人体解剖约束。

Comments International Conference on Automatic Face and Gesture Recognition (FG 2026), Oral

2605.15582 2026-05-18 cs.CV

LDGuid: A Framework for Robust Change Detection via Latent Difference Guidance

LDGuid: 一种通过潜在差异引导实现鲁棒变化检测的框架

Jiaxuan Zhao, Ali Bereyhi

发表机构 * University of Toronto（多伦多大学）

AI总结本文提出LDGuid框架，通过学习并注入语义差异提升变化检测性能，实验显示其在多个数据集上显著提升分割效果，尤其在受光谱噪声影响的挑战性场景中表现突出。

Comments Accepted to IGARSS 2026. Code is available at: https://github.com/zjxyoyo/LDGuid

详情

AI中文摘要

现代深度学习模型在变化检测（CD）中常难以显式表示任务相关的语义差异。本文提出Latent Difference Guidance（LDGuid）框架，通过对抗自编码实现差异嵌入（DE）模块。DE模块通过信息瓶颈方法预训练，限制其仅学习前后事件样本间的任务相关差异。学习到的潜在差异随后作为CD模型的显式引导信号。通过将LDGuid整合到U-Net、BIT和AERNet基线模型中，并在LEVIR-CD、WHU-CD、SVCD和CaBuAr数据集上评估，实验结果表明LDGuid在所有基准上均提升了分割性能，特别是在受光谱噪声影响的挑战性场景中表现显著。结果进一步突显了LDGuid在整合领域知识（如任务特定的光谱指数）方面的能力。我们的发现表明，语义差异学习可以显著增强遥感中变化检测的鲁棒性。

英文摘要

Modern deep learning models for change detection (CD) often struggle to explicitly represent task-relevant semantic differences. This paper proposes the Latent Difference Guidance (LDGuid) framework that explicitly learns and injects semantic differences into CD models. LDGuid deploys adversarial autoencoding to implement a difference embedding (DE) module. The DE module is pretrained via the information bottleneck method, restricting it to learn only task-relevant differences between pre- and post-event samples. The learned latent difference is then used as an explicit guidance signal in the CD model. We validate LDGuid by integrating it into U-Net, BIT, and AERNet baselines for CD and evaluating it on LEVIR-CD, WHU-CD, SVCD, and CaBuAr datasets. Experimental results show that LDGuid enhances segmentation performance across all benchmarks, with particularly remarkable gains in challenging settings affected by spectral noise. The results further highlight the ability of LDGuid in incorporating domain knowledge, such as task-specific spectral indices. Our findings suggest that semantic difference learning can drastically enhance the robustness of CD in remote sensing.

URL PDF HTML ☆

赞 0 踩 0

2605.15581 2026-05-18 cs.AI

STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

STAR: 一种针对微服务中RCA代理的阶段属性分诊与修复框架

Junle Wang, Xingchuang Liao, Wenjun Wu

发表机构 * School of Artificial Intelligence, Beihang University Beijing, China（人工智能学院，北京航空航天大学，北京，中国）

AI总结本文提出STAR框架，通过将RCA流程分解为四个阶段，提升微服务中RCA代理的可靠性与自修复能力。

Comments 11 pages

详情

AI中文摘要

基于大语言模型的根因分析（RCA）代理近年来在微服务AIOps中崭露头角，但其可靠性仍脆弱：早期证据收集、假设构建或因果分析中的错误会通过推理轨迹传播，最终破坏最终诊断。本文提出STAR，一种针对RCA代理的阶段属性分诊与修复框架，将RCA工作流程分解为四个结构化阶段：证据包（EP）、假设集（HS）、分析结构（AS）和决策报告（DR），并将代理故障视为可定位的阶段性推理错误，而非整体端到端错误。基于LangGraph，STAR执行阶段审计，实施预算感知的快速/慢速路由，通过反事实候选评估进行决断阶段定位，并进行阶段特定的修补与重放修复。

英文摘要

LLM-based root cause analysis (RCA) agents have recently emerged as a promising paradigm for incident diagnosis in microservice AIOps. However, their reliability remains fragile: an error in early evidence collection, hypothesis formulation, or causal analysis can propagate through the reasoning trace and eventually corrupt the final diagnosis. In this paper, we present \textbf{STAR}, a \emph{Stage-attributed Triage and Repair} framework for repairing erroneous RCA traces. STAR explicitly decomposes an RCA workflow into four structured stages, namely \emph{Evidence Package} (EP), \emph{Hypothesis Set} (HS), \emph{Analysis Structure} (AS), and \emph{Decision Report} (DR), and treats agent failure as a stage-localizable reasoning bug rather than a monolithic end-to-end error. Built on top of LangGraph, STAR performs stage-wise auditing, budget-aware \emph{Fast/Slow Routing}, \emph{decisive stage localization via counterfactual candidate evaluation}, and stage-specific patch-and-replay repair. We evaluate STAR on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models. Experimental results show that STAR consistently improves both root cause localization and fault type classification over strong baselines. Moreover, STAR identifies the decisive faulty stage with high accuracy, repairs most initially incorrect traces within one or two replay rounds, and benefits substantially from both Fast/Slow Routing and counterfactual stage evaluation. These results suggest that explicitly modeling \emph{where} an RCA agent fails is an effective path toward reliable, debuggable, and self-repairing agentic RCA systems.

URL PDF HTML ☆

赞 0 踩 0

2605.15575 2026-05-18 cs.LG cs.DB

Gaussian Relational Graph Transformer

高斯关系图变换器

Zezhong Ding, Jin Li, Xugang Wang, Xike Xie

发表机构 * School of Artificial Intelligence and Data Science, University of Science and Technology of China（中国科学技术大学人工智能与数据科学学院）； School of Biomedical Engineering, USTC（中国科学技术大学生物医学工程学院）； Data Darkness Lab, Suzhou Institute for Advanced Research, USTC（中国科学技术大学苏州市先进研究机构数据暗室）； Chinese Academy of Sciences（中国科学院）

AI总结本文提出GelGT，通过结构-语义协作采样和高斯图注意力机制，解决关系图模型中长距离依赖和多信息联合建模问题，实验显示在多个真实数据集上达到最先进的预测性能。

2605.15574 2026-05-18 cs.CV

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

MI-CXR：多区间胸部X光片纵向推理基准

Sunghwan Steve Cho, Yunseok Han, Jaeyoung Do

发表机构 * AIDAS Laboratory（AIDAS实验室）； Seoul National University（首尔国立大学）

AI总结 MI-CXR基准旨在评估多visit胸部X光片的纵向推理能力，通过五选一问题和三个互补任务家族，揭示现有视觉语言模型在时间维度上的局限性。

Comments 33 pages

详情

AI中文摘要

纵向胸部X光片解读需在多个患者访问中推理疾病演变，但现有医疗VQA基准多关注单张图像或短时间图像对。我们引入MI-CXR，一个用于标准化评估多访问胸部X光片序列多区间纵向推理的基准，无需自由形式报告生成或额外临床上下文。MI-CXR包含五个访问患者时间线的五选一问题，并实例化三个互补任务家族：时间事件定位、区间级变化推理和全局轨迹总结，评估基于临床的视觉推理。评估14种最先进的视觉语言模型（VLMs）显示整体表现较低，平均准确率为29.3%，仅略高于随机猜测。通过阶段式诊断探测，发现模型常产生局部合理的区间描述，但未能强制时间约束或将证据组合成全局一致的决策。这些发现揭示了当前VLMs的关键限制，并确立MI-CXR作为纵向医疗推理的原理性基准。该基准可在https://github.com/AIDASLab/MI-CXR获取。

英文摘要

Longitudinal chest X-ray (CXR) interpretation requires reasoning over disease evolution across multiple patient visits, yet most existing medical VQA benchmarks focus on single images or short-horizon image pairs. We introduce MI-CXR, a benchmark for standardized evaluation of Multi-Interval longitudinal reasoning over multi-visit CXR sequences, without requiring free-form report generation or additional clinical context. MI-CXR comprises five-way multiple-choice questions over five-visit patient timelines and instantiates three complementary task families: Temporal Event Localization, Interval-wise Change Reasoning, and Global Trajectory Summarization, which assess clinically grounded visual reasoning over time. Evaluating 14 state-of-the-art vision-language models (VLMs) shows low overall performance, with an average accuracy of 29.3%, only modestly above random guessing. Using stage-wise diagnostic probing, we find that models often produce locally plausible interval descriptions but fail to enforce temporal constraints or compose evidence into globally consistent decisions over the full timeline. These findings reveal key limitations of current VLMs and establish MI-CXR as a principled benchmark for longitudinal medical reasoning. The benchmark is available at https://github.com/AIDASLab/MI-CXR

URL PDF HTML ☆

赞 0 踩 0

2605.15573 2026-05-18 cs.CL cs.LG cs.MA

Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems

响应条件化的并行到顺序 orchestration 用于多智能体系统

Nurbek Tastan, Alex Iacob, Lorenzo Sani, Meghdad Kurmanji, Nicholas D. Lane, Samuel Horvath, Karthik Nandakumar

发表机构 * MBZUAI（马克斯·普朗克智能系统研究所）； University of Cambridge（剑桥大学）； Flower Labs（Flower实验室）； Michigan State University（密歇根州立大学）

AI总结本文提出Nexa框架，通过响应条件化的策略结合并行与顺序执行，减少通信和延迟同时提高最终响应准确性，展示了其通用性。

详情

AI中文摘要

多智能体系统可通过多个大语言模型智能体之间的协作解决复杂任务。现有协作框架通常采用并行或顺序模式。在并行模式中，智能体独立响应查询后进行响应聚合。相反，顺序系统允许智能体通过有向拓扑进行通信并逐步细化。然而，这两种模式都无法在最小化通信和延迟的同时最大化最终响应的准确性。本文引入了一种名为Nexa的混合范式，即可训练的响应条件化策略，以弥合两种模式之间的差距。Nexa首先进行并行执行阶段，将结果嵌入共享语义空间，然后预测稀疏有向无环通信图。如果图为空，则系统保持纯粹并行；如果非空，则进行一次顺序信息传播。该策略是轻量级的transformer模型，方法避免了外部LLM判断者或奖励模型以及手工设计的测试时间拓扑搜索。我们正式化了这种混合执行问题，证明所生成的图是无环的，并且该框架严格包含纯并行执行，且提出基于策略梯度优化的训练程序。结果表明，Nexa在一种设置下学习的响应条件化策略可以在智能体数量、任务或底层智能体变化时重用，从而强调所学通信策略的通用性。

英文摘要

Multi-agent systems can solve complex tasks through collaboration between multiple Large Language Model agents. Existing collaboration frameworks typically operate in either a parallel or a sequential mode. In the parallel mode, agents respond independently to queries followed by aggregation of responses. In contrast, sequential systems allow agents to communicate via a directed topology and refine one another step by step. However, both modes are inadequate for achieving the desired objectives of minimizing communication and latency while simultaneously maximizing the accuracy of the final response. In this work, we introduce a hybrid paradigm called Nexa, a trainable response-conditioned policy that bridges the gap between the two modes. Nexa begins with a parallel execution stage, embeds the resulting responses into a shared semantic space, and then predicts a sparse directed acyclic communication graph. If the graph is empty, the system remains purely parallel; if it is non-empty, the system performs one sequential message propagation. The policy is a lightweight transformer model, and the method avoids the need for external LLM judges or reward models, as well as hand-crafted test-time topology search. We formalize this hybrid execution problem, show that the resulting graph is acyclic by construction, and that the framework strictly subsumes pure parallel execution, and present a training procedure based on policy-gradient optimization. Results demonstrate that the response-conditioned policy learned by Nexa under one setting can be reused when the number of agents, the task, or the underlying agent changes, thus emphasizing the generalizability of the learned communication policy.

URL PDF HTML ☆

赞 0 踩 0

2605.15567 2026-05-18 cs.AI

Position: Artificial Intelligence Needs Meta Intelligence -- the Case for Metacognitive AI

位置：人工智能需要元智能——元认知AI的案例

Sergei Chuprov, Richard D. Lange, Leon Reznik, Paulo Shakarian, Raman Zatsarenko, Dmitrii Korobeinikov

发表机构 * University of Texas Rio Grande Valley, Edinburg, TX, USA（德克萨斯大学里奥格兰德谷分校）； Rochester Institute of Technology, Rochester, NY, USA（罗切斯特理工学院）； Syracuse University, Syracuse, NY, USA（锡拉库萨大学）

AI总结本文主张将元认知作为设计更准确、安全和高效AI的通用原则，通过联邦学习案例展示元认知提升学习效率和安全性的方法，提出新的软件框架用于实现元认知AI。

Comments This is a preliminary version accepted for presentation and publication at the 43rd International Conference on Machine Learning (ICML26). The modified final version will be available in the conference proceedings

2605.15565 2026-05-18 cs.LG cs.AI

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

AstraFlow：面向代理大语言模型的数据流强化学习

Haizhong Zheng, Yizhuo Di, Jiahui Wang, Shuowei Jin, Xueshen Liu, Yongji Wu, Z. Morley Mao, Ion Stoica, Jiawei Zhao, Beidi Chen

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of Michigan（密歇根大学）； UC Berkeley（加州大学伯克利分校）； Meta

AI总结 AstraFlow通过数据流导向的强化学习系统，实现复杂多策略协作训练和高效利用异构计算资源，提升代理LLM的推理与工具使用能力。

详情

AI中文摘要

强化学习（RL）日益被用于提升大语言模型的推理、编码和工具使用能力，但代理RL仍面临高昂成本。为扩展RL到代理LLM，需支持复杂工作负载，包括多策略协作训练，同时高效利用弹性、异构和跨区域计算资源。现有LLM RL系统支持部分能力，但每次新扩展通常需专门系统工程。此问题源于训练器导向的控制架构和RL系统组件缺乏原理性抽象。为此，我们提出AstraFlow，一种数据流导向的RL系统，取代传统训练器导向控制，采用原理性组件抽象。在AstraFlow中，rollout服务、数据流管理和训练被解耦为自主组件，使系统能原生支持复杂多策略代理RL工作负载并高效利用多样化计算资源。我们评估了AstraFlow在数学、代码、搜索和AgentBench工作负载上的表现，显示同一系统支持多策略训练、弹性扩展、异构跨区域执行和可组合的数据算法，无需系统级代码更改。在多策略协作训练中，AstraFlow的准确度与现有RL系统相当或更优，同时训练时间加速2.7倍。

英文摘要

Reinforcement learning (RL) is increasingly used to improve the reasoning, coding, and tool-use capabilities of large language models, but agentic RL remains prohibitively expensive. Scaling RL to agentic LLMs requires supporting complex workloads, including multi-policy collaborative training, while efficiently using elastic, heterogeneous, and cross-region compute resources. Existing LLM RL systems support some of these capabilities, but each new extension often requires dedicated system engineering. This burden arises from trainer-centered control architectures and the lack of principled abstractions for RL system components. To address these limitations, we propose AstraFlow, a dataflow-oriented RL system that replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, enabling the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources. We evaluate AstraFlow across math, code, search, and AgentBench workloads, showing that the same system supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without system-level code changes. In multi-policy collaborative training, AstraFlow achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7x.

URL PDF HTML ☆

赞 0 踩 0

2605.15564 2026-05-18 cs.LG cs.CE eess.IV

CrystalBoltz: End-to-End Protein Structure Determination via Experiment-Guided Diffusion for X-Ray Crystallography

CrystalBoltz：通过实验引导扩散实现端到端蛋白质结构确定用于X射线晶体学

Minseo Kim, Huanghao Mai, Jay Shenoy, Alec Follmer, Gordon Wetzstein, Frederic Poitevin

发表机构 * Stanford University（斯坦福大学）； SLAC National Accelerator Laboratory（SLAC国家加速器实验室）； UC Davis（加州大学戴维斯分校）

AI总结 CrystalBoltz通过实验引导扩散模型实现端到端蛋白质结构确定，利用贝叶斯推断优化原子结构，降低坐标RMSD和R因子，提升X射线晶体学结构确定效率。

Comments Project page: https://soniaminseokim.github.io/crystalboltz-website/

详情

AI中文摘要

基于公共蛋白质结构数据库训练的生成模型，大部分由X射线晶体学确定，现在为结构预测提供了强大先验。然而，它们无法直接条件于新晶体学实验的测量，限制了X射线结构确定的应用。在晶体学中，测量的结构因子振幅本身不能确定电子密度图或原子结构，因为相关的相位未被观测且必须推断。因此，结构确定仍然是一个逆问题，候选模型必须在结构上合理且与测量的衍射数据一致，通常需要大量人工专家手动优化。新兴方法旨在更直接地将实验信息纳入预测和优化流程。我们提出了CrystalBoltz，一种生成框架，将晶体学优化视为原子结构上的贝叶斯推断，并直接在结构因子振幅上操作。CrystalBoltz从无指导生成（基于预训练的蛋白质结构先验）转向实验引导的后验采样，随后进行原子坐标和B因子优化。在多个蛋白质晶体学数据集上，CrystalBoltz在坐标RMSD和R因子方面优于现有最强基线，同时将运行时间减少了33倍。

英文摘要

Generative models trained on public databases of protein structures, most of which have been determined by X-ray crystallography, now provide powerful priors for structure prediction. However, they are not readily conditioned on the measurements from a new crystallographic experiment, limiting their use for X-ray structure determination. In crystallography, the measured structure-factor amplitudes do not by themselves determine an electron density map or atomic structure because the associated phases are unobserved and must be inferred. Structure determination therefore remains an inverse problem in which candidate models must be both structurally plausible and consistent with measured diffraction data, often requiring substantial manual refinement by human experts. Emerging methods aim to incorporate experimental information more directly into predictive and refinement workflows. We present CrystalBoltz, a generative framework that casts crystallographic refinement as Bayesian inference over atomic structures and operates directly on structure-factor amplitudes. CrystalBoltz moves from unguided generation with a pre-trained prior over protein structures to experiment-guided posterior sampling, followed by atomic coordinate and B-factor refinement. Across multiple protein crystallography datasets, CrystalBoltz attains lower coordinate RMSD and lower R-factors than the strongest baselines considered, while reducing runtime by a factor of 33 relative to existing experimentally guided refinement.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Perforated Neural Networks for Keyword Spotting

Learning Disentangled Representations for Generalized Multi-view Clustering

Evaluating Chinese Ambiguity Understanding in Large Language Models

IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression

ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing

LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

Wind-Aware Optimal Trajectory Planning for Efficient Gliding of Fixed-Wing Aerial Systems

Latent Video Prediction Learns Better World Models

Neutral-Reference Prompting for Vision-Language Models

Toward LLMs Beyond English-Centric Development

TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

Transformer-like Inference from Optimal Control

Syntax Without Semantics: Teaching Large Language Models to Code in an Unseen Language

VSPO: Vector-Steered Policy Optimization for Behavioral Control

Offline Reinforcement Learning with Universal Horizon Models

CM-EVS: Sparse Panoramic RGB-D-Pose Data for Complete Scene Coverage

Efficient Image Synthesis with Sphere Latent Encoder

MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models

See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation

AGC: Adaptive Geodesic Correction for Adversarial Robustness on Vision-Language Models

Unsupervised 3D Human Pose Estimation via Conditional Multi-view Ancestral Sampling

LDGuid: A Framework for Robust Change Detection via Latent Difference Guidance

STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

Gaussian Relational Graph Transformer

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

Response-Conditioned Parallel-to-Sequential Orchestration for Multi-Agent Systems

Position: Artificial Intelligence Needs Meta Intelligence -- the Case for Metacognitive AI

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

CrystalBoltz: End-to-End Protein Structure Determination via Experiment-Guided Diffusion for X-Ray Crystallography