arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3813
专题追踪
2605.18104 2026-05-19 cs.AI cs.CR

Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

多模态大语言模型中的安全几何坍缩与自适应漂移修正

Jiahe Guo, Xiangran Guo, Jiaxuan Chen, Weixiang Zhao, Yanyan Zhao, Yutai Hou, Qianchao Wang, Dandan Tu, Bing Qin

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 本文研究了多模态大语言模型在跨模态安全转移中的不足,提出安全几何坍缩现象,并通过自适应漂移修正方法提升模型安全性。

详情
AI中文摘要

多模态大语言模型(MLLMs)常常无法将文本模态中学习到的安全能力转移到语义等价的非文本输入中,揭示出一个持续存在的多模态安全缺口。我们从表示几何视角出发,通过分析文本对齐的拒绝方向和模态诱导的漂移方向来研究这一缺口。我们展示了多模态输入压缩了沿拒绝方向的可用分离度,使其不再可靠用于识别和拒绝有害输入。我们将这种失败模式称为安全几何坍缩。我们通过条件拒绝分离度量化这一现象,并显示更强的模态诱导漂移与更弱的拒绝分离度和更高的攻击成功率一致。随后,我们通过固定强度激活干预验证了模态诱导漂移的因果作用:抵消估计的漂移可以恢复拒绝分离度并提高多模态安全性。在漂移修正后,我们进一步观察到自修正现象,其中模型在前向动态中恢复了识别和拒绝有害多模态输入的能力。这种效果也提供了模型对每个输入感知有害性的内部信号。受此信号启发,我们提出了ReGap,一种无需训练的推理时方法,通过自修正自适应修正模态漂移。在多个多模态安全基准和实用性基准上的实验展示了ReGap的有效性,显著提高了MLLMs的安全性,而不会损害通用能力。我们的发现强调了表示层面的模态对齐作为实时安全改进和构建更安全、更可靠MLLMs的关键方向。

英文摘要

Multimodal large language models (MLLMs) often fail to transfer safety capabilities learned in the text modality to semantically equivalent non-text inputs, revealing a persistent multimodal safety gap. We study this gap from a representation-geometric perspective by analyzing a text-aligned refusal direction and a modality-induced drift direction. We show that multimodal inputs compress the usable separation along the refusal direction, making it no longer reliable for identifying and refusing harmful inputs. We refer to this failure mode as Safety Geometry Collapse. We quantify it through conditional refusal separability and show that stronger modality-induced drift is consistently associated with weaker refusal separability and higher attack success rates. We then validate the causal role of modality-induced drift through a fixed-strength activation intervention: counteracting the estimated drift restores refusal separability and improves multimodal safety. After drift correction, we further observe self-rectification, where the model recovers its ability to recognize and refuse harmful multimodal inputs during forward dynamics. This effect also provides an internal signal of the model's perceived harmfulness of each input. Motivated by this signal, we propose ReGap, a training-free inference-time method that adaptively corrects modality drift using self-rectification. Experiments across multiple multimodal safety benchmarks and utility benchmarks demonstrate the effectiveness of ReGap, which significantly improves the safety of MLLMs without compromising general capabilities. Our findings highlight representation-level modality alignment as a crucial direction for real-time safety improvement and for building safer, more reliable MLLMs.

2605.18101 2026-05-19 cs.CV cs.AI

SENSE: Satellite-based ENergy Synthesis for Sustainable Environment

SENSE: 基于卫星的能源合成以实现可持续环境

Kailai Sun, Mingyi He, Heye Huang, Can Rong, Alok Prakash, Baoshen Guo, Shenhao Wang, Jinhua Zhao

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Florida(佛罗里达大学)

AI总结 本文提出SENSE,一种统一的生成性城市建筑能耗框架,通过结合生成扩散模型和大规模视觉模型知识,生成高分辨率的城市卫星图像和对齐的高质量建筑能耗和高度地图,以提高城市可持续发展预测性能。

Comments Accpted by KDD 2026 (Oral)

详情
AI中文摘要

城市建筑能耗建模在实现联合国可持续发展目标7和11中起着关键作用。尽管基于卫星图像和深度学习的研究已取得显著进展,但仍存在许多挑战:大多数现有研究本质上是预测性的,无法反映城市规划的生成性;虽然生成式AI和扩散模型在卫星图像中实现了指数级增长,但缺乏城市功能生成(例如能耗层);第三,高质量高分辨率建筑能耗数据与卫星图像的对齐数据有限且稀缺。本文提出SENSE(基于卫星的能源合成以实现可持续环境),一种统一的生成性城市建筑能耗(UBEM)框架,联合合成逼真的城市卫星图像和对齐的高质量建筑能耗和高度地图。通过在道路网络和城市密度指标上进行条件控制,SENSE基于可控扩散模型,利用大规模视觉模型学习到的知识,生成城市建筑能耗和高度信息(注释)在潜在空间中。在四个城市(纽约市、波士顿、里昂、釜山)上的实验表明,SENSE实现了高视觉保真度和强物理一致性,满足ASHRAE标准度量。实验表明,SENSE可以使用少于20%的标注能耗数据生成足够的注释合成数据,将下游预测性能提升10% IoU。与最先进的城市能耗预测方法相比,SENSE显著降低了预测误差(预测误差减少了3%-11% NMBE和1%-9% CVRMSE)。本研究为城市科学、能源科学和建筑科学提供了能耗效率的城市规划和物理生成解决方案。数据集和代码:https://huggingface.co/datasets/skl24/MUSE和https://github.com/kailaisun/GenAI4Urban-Energy/.

英文摘要

Urban Building Energy Modeling plays a critical role in achieving the United Nations' Sustainable Development Goals 7 and 11. Although existing studies based on satellite imagery and deep learning have achieved remarkable progress, many challenges exist: most existing studies are inherently predictive, failing to reflect the generative nature of urban planning; although generative AI and diffusion models have seen explosive growth in satellite imagery, they lack the urban functional generation (e.g., energy layer); third, aligned high-quality high-resolution building energy data with satellite imagery is limited and scarce. Here we propose SENSE (Satellite-based ENergy Synthesis for Sustainable Environment), a unified generative UBEM framework that jointly synthesizes realistic urban satellite imagery and aligned high-quality building energy consumption and height maps. By conditioning on road networks and urban density metrics, SENSE, based on a controllable diffusion model, leverages the knowledge learned by large vision models to generate urban building energy consumption and height information (annotations) in the latent space. Experiments across four cities (New York City, Boston, Lyon, Busan) demonstrate that SENSE achieves high visual fidelity and strong physical consistency, satisfying the ASHRAE standard metric. Experiments demonstrate that SENSE can generate enough annotated synthetic data using less than 20% labeled energy data, boosting downstream prediction performance by 10% IoU. Compared to SOTA urban energy prediction methods, SENSE significantly reduced prediction error (reduced 3%-11% NMBE and 1%-9% CVRMSE). This study offers an energy-efficiency urban planning and physical generation solution for urban science, energy science and building science. The dataset and code: https://huggingface.co/datasets/skl24/MUSE and https://github.com/kailaisun/GenAI4Urban-Energy/.

2605.18094 2026-05-19 cs.AI

Learning to Solve Compositional Geometry Routing Problems

学习解决组合几何路由问题

Mingfeng Fan, Jianan Zhou, Jiaqi Cheng, Yifeng Zhang, Jie Zhang, Guillaume Adrien Sartoretti

发表机构 * National University of Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学) Central South University(中南大学)

AI总结 本文研究了组合几何路由问题(CGRP),这是一种涵盖点、线、面及任意混合任务几何的统一超类,为现实中的路由场景提供广泛抽象。为解决非点任务带来的不对称性和复杂性,作者提出DiCon框架,通过对比学习和差异注意力机制提升表示学习和决策能力。

Comments 27 pages, 10 figures

详情
AI中文摘要

我们研究了组合几何路由问题(CGRP),这是一种涵盖点-only、line-only、area-only及任意混合任务几何的统一超类,为现实中的路由场景提供广泛抽象。除了标准的点基路由外,CGRP中的非点任务可以本质上是不对称的,紧密耦合的旅行路线与内在路径密切相关,并扩展了大量可行但通常无关的行动空间,从而对表示学习和决策提出了重大挑战。为解决这些挑战,我们提出DiCon,一种带有对比学习的差分注意力辅助求解器,作为即插即用的框架,从两个互补的角度解决该问题。首先,我们引入差分注意力机制,主动抑制概率质量在不具竞争力的候选动作上的分布。其次,我们设计了双层对比学习目标,以促进稳健的全局实例表示并正则化几何感知的任务表示。广泛的实验表明,DiCon在不同组成CGRP实例上实现了强大的性能、广泛的通用性和优越的泛化能力。

英文摘要

We study the Compositional Geometry Routing Problem (CGRP), a unified superclass of traditional routing problems that covers point-only, line-only, area-only, and arbitrary hybrid task geometries, providing a broad abstraction for real-world routing scenarios. Beyond standard point-based routing, CGRP with non-point tasks can be inherently asymmetric, tightly coupled travel routes with the intrinsic path, and enlarges the action space with numerous feasible yet often irrelevant options, thereby posing significant challenges for both representation learning and decision-making. To address these challenges, we propose DiCon, a differential attention-assisted solver with contrastive learning, as a plug-and-play framework that tackles the problem from two complementary angles. First, we introduce a differential attention mechanism that actively suppresses the probability mass on less competitive candidate actions. Second, we design a double-level contrastive learning objective to promote robust global instance representations and regularize geometry-aware task representations. Extensive experiments demonstrate that DiCon achieves strong performance, broad versatility, and superior generalization across diverse CGRP instances with different compositions.

2605.18083 2026-05-19 cs.CL

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$Δ$ Integration into Upcycled MoE

多语言大语言模型的高效路径:通过后训练PARAM$Δ$整合到再利用MoE进行语言扩展

Hao Zhou, Tianhao Li, Zhijun Wang, Shuaijie She, Linjuan Wu, Hao-Ran Wei, Baosong Yang, Jiajun Chen, Shujian Huang

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Tongyi Lab, Alibaba Group(阿里集团通义实验室) Zhejiang University(浙江大学)

AI总结 本文提出了一种高效的方法,通过将密集模型转换为MoE架构,并将不同语言分配给不同专家,从而在不进行复杂对齐阶段的情况下提升多语言大语言模型的性能,同时保留原始能力。

详情
AI中文摘要

将大型语言模型(LLMs)扩展到新语言是一个成本高昂的过程,需要大量的持续预训练(CPT)和数据密集型对齐。尽管最近的数据免费融合技术试图通过将多语言CPT增强模型与其指令版本融合来绕过对齐,但它们受到关键权衡的限制:缓解参数冲突以保持原始能力不可避免地会稀释新语言的学习,反之亦然。为了解决这一矛盾,我们引入了\method,将密集模型重新利用为专家混合(MoE)架构,将不同专家分配给不同语言。然后通过将MoE扩展的参数delta($Δ_{ ext{post}}$)嫁接回CPT增强的基模型来转移对齐能力,从而绕过复杂的对齐阶段。实验表明,\method在具有相似FLOPs或参数数量的基线方法上表现出色;它在扩展语言上提高了性能,同时有效保留了原始能力。我们进一步证明,我们的方法在不同模型和后训练delta上具有高度适用性。

英文摘要

Expanding Large Language Models~(LLMs) to new languages is a costly endeavor, demanding extensive Continued Pre-Training~(CPT) and data-intensive alignment. While recent data-free merging techniques attempt to bypass alignment by fusing a multilingual CPT-enhanced model with its instruct counterpart, they are plagued by a critical trade-off: mitigating parameter conflicts to preserve original abilities inevitably dilutes new language acquisition, and vice-versa. To resolve this conflict, we introduce \method, which upcycles a dense model into a Mixture-of-Experts~(MoE) architecture, allocating different experts to different languages. Alignment ability is then transferred by grafting a MoE-expanded parameter delta~($Δ_{\text{post}}$) to the CPT-enhanced base model, bypassing the complex alignment phase. Experiments demonstrate \method's superiority even against baselines with similar FLOPs or number of parameters; it improves performance on expanded languages while effectively preserving original capabilities. We further show our approach is highly applicable across different models and Post-training deltas.

2605.18082 2026-05-19 cs.LG

pyforce-1.0.0: Python Framework for data-driven model Order Reduction of multi-physiCs problEms

pyforce-1.0.0: 用于多物理问题数据驱动模型降阶的Python框架

Stefano Riva, Yantao Luo, Carolina Introini, Antonio Cammi

发表机构 * Department of Energy, Nuclear Engineering Division, Politecnico di Milano(能源学院,核工程系,米兰理工学院) Department of Mechanical and Nuclear Engineering and Emirates Nuclear Technology Center, Khalifa University(机械与核工程学院和阿联酋核技术中心,卡比大学)

AI总结 本文提出pyforce-1.0.0框架,采用数据驱动降阶建模技术用于多物理问题,主要应用于核工程领域,改进了传感器位置优化和实测数据整合,提升了物理系统认知。

Comments Github Repo: https://github.com/ERMETE-Lab/ROSE-pyforce

详情
AI中文摘要

pyforce是一个实现数据驱动降阶建模技术的Python包,主要用于多物理问题的应用,主要集中在核工程领域。该包是ROSE(用于多物理问题的数据驱动降阶建模)的一部分:数学算法旨在减少多物理模型的复杂性(用于核反应堆应用),寻找最优传感器位置,并整合真实测量以提高对物理系统的认识。与之前的基于dolfinx包的原始实现(v0.6.0)相比,pyforce 1.0.0完全重写,使用pyvista作为网格导入、积分计算和结果可视化后端;此外,函数存储为numpy数组,提高了包的易用性。这一选择允许pyforce与任何能够导出VTK格式结果的软件求解器一起使用。

英文摘要

pyforce is a Python package implementing Data-Driven Reduced Order Modelling techniques for applications to multi-physics problems, mainly set in the Nuclear Engineering world. The package is part of the ROSE (Reduced Order modelling with data-driven techniques for multi-phySics problEms): mathematical algorithms aimed at reducing the complexity of multi-physics models (for nuclear reactors applications), at searching for optimal sensor positions and at integrating real measures to improve the knowledge on the physical systems. With respect to the previous original implementation based on dolfinx package (v0.6.0), version 1.0.0 of pyforce has been completely re-written using pyvista as backend for mesh importing, computing integrals, and visualisation of results; in addition, functions are stored as numpy arrays, improving the ease of use of the package. This choice allows to use pyforce with any software solver able to export results in VTK format.

2605.18079 2026-05-19 cs.LG cs.CC cs.CL

The Expressive Power of Low Precision Softmax Transformers with (Summarized) Chain-of-Thought

低精度softmax变换器的表达能力(摘要)链式思维

Moritz Brösamle, Stephan Eckstein

发表机构 * Department of Mathematics, University of Tübingen, Germany(图宾根大学数学系)

AI总结 本文研究了低精度softmax变换器在链式思维中的表达能力,通过构造三元激活和分离注意力分数的硬max变换器来模拟图灵机,从而将构造转换为等效的softmax变换器,并分析了最近提出的总结链式思维范式在模拟图灵机时的效率。

Comments Accepted to ICML 2026

详情
AI中文摘要

现有的变换器表达性结果通常依赖于hardmax注意力、高精度和其它架构修改,这些修改将它们与实际使用的模型脱节。我们通过分析具有softmax注意力和激活值及注意力权重四舍五入的标准变换器解码器,同时允许深度和宽度以对数方式增长于上下文长度,来弥合这一差距。作为中间步骤,我们构造了具有三元激活和良好分离注意力分数的硬max变换器,利用链式思维(CoT)模拟图灵机。这使我们能够将构造转换为等效的softmax变换器,而无需先前方法所需的不现实的参数规模或激活精度。使用相同的技术,我们分析了最近提出的总结Co T范式,并展示其在模拟图灵机时更加高效,模型大小以空间界而非时间界缩放。我们通过在数独推理任务上验证我们的结果,并发现其比先前的高精度结果更符合可学习性。我们的代码可在https://github.com/moritzbroe/transformer-expressivity上获得。

英文摘要

Existing expressivity results for transformers typically rely on hardmax attention, high precision, and other architectural modifications that disconnect them from the models used in practice. We bridge this gap by analyzing standard transformer decoders with softmax attention and rounding of activations and attention weights, while allowing depth and width to grow logarithmically with the context length. As an intermediate step, we construct hardmax transformers with ternary activations and well-separated attention scores that simulate Turing machines using Chain-of-Thought (CoT). This lets us convert the constructions to equivalent softmax transformers without the unrealistic parameter magnitudes or activation precision that prior approaches would require. Using the same technique, we analyze a recently proposed summarized CoT paradigm and show that it simulates Turing machines more efficiently, with model size scaling logarithmically in a space bound rather than a time bound. We empirically test predictions made by our results on a Sudoku reasoning task and find better alignment with learnability than for prior high-precision results. Our code is available at https://github.com/moritzbroe/transformer-expressivity.

2605.18078 2026-05-19 cs.LG

Equilibrium Selection in Multi-Agent Policy Gradients via Opponent-Aware Basin Entry

通过对手感知盆地入口进行多智能体策略梯度的均衡选择

Yevhen Shcherbinin, Arina Redina, Maxim Kalpin, Vlad Kochetov

发表机构 * Bloomsbury Technology(布洛姆斯伯里技术) London School of Economics and Political Science(伦敦政治经济学院) University of Bristol(布里斯托大学) Johannes Kepler University Linz(林茨约翰尼斯·开普勒大学) Odesa Polytechnic National University(敖德萨国立技术大学)

AI总结 本文研究了多智能体策略梯度方法在局部收敛到稳定纳什均衡时的均衡选择问题,提出通过对手感知的盆地入口概率机制来提升目标均衡集的进入概率,并通过实验验证了该机制在合作盆地中的有效性。

详情
AI中文摘要

多智能体策略梯度方法已被证明能够局部收敛到稳定的纳什均衡。然而,局部收敛并不决定最终达到哪一个均衡。本文通过相对于由外部标准(如收益支配)选择的目标均衡集的盆地入口概率来研究这一问题。对于有限展开的元Meta-MAPG,我们证明更新可以分解为普通的策略梯度加上自身学习和同伴学习的修正,其中包含受控的采样噪声和有限展开偏差。我们识别出同伴学习修正作为主要的均衡选择机制:在局部对齐条件下,进入目标稳定纳什集的认证吸引区域的概率相对于普通的策略梯度会增加。由于持续的修正可能会改变原始游戏的零更新点,进入盆地后对修正进行退火可以恢复普通的策略梯度动态,并继承局部稳定的纳什收敛保证。在 stag hunt、迭代囚徒困境和初步的神经策略协调环境中的实验支持了这一盆地入口观点,显示在同伴意识更新下合作盆地的进入概率增加。

英文摘要

Multi-agent policy-gradient methods have been shown to converge locally near stable Nash equilibria. Local convergence, however, does not determine which equilibrium is reached. We study this question through basin-entry probability with respect to a target set of equilibria selected by an external criterion, such as payoff dominance. For finite-unroll Meta-MAPG, we show that the update decomposes into ordinary policy gradient plus own-learning and peer-learning corrections, with controlled sampling noise and finite-unroll bias. We identify the peer-learning correction as the main equilibrium-selection mechanism: under a local alignment condition, the probability of entering the certified attraction region of the target stable-Nash set increases, relative to ordinary policy gradient. Because persistent correction may shift zero-update points of the original game, annealing the correction after entering the basin recovers ordinary policy-gradient dynamics and inherits local stable-Nash convergence guarantees. Experiments in Stag Hunt, iterated Prisoner's Dilemma, and preliminary neural-policy coordination environments support this basin-entry view, showing increased entry into cooperative basins under peer-aware updates.

2605.18074 2026-05-19 cs.RO

4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving

4DLidarOpen: 一个用于运动感知自动驾驶的开放4D FMCW激光雷达数据集

Kane Qian, Xin Zhao, Yining Shi, Rujun Yan, Zhengqing Pan, Kaojin Zhu, Mengmeng Yang, Kai Sun, Diange Yang, Kun Jiang

发表机构 * Tsinghua University(清华大学) Hesai Technology Co., Ltd.(海思科技有限公司)

AI总结 本文提出4DLidarOpen数据集,用于自动驾驶,该数据集基于4D频率调制连续波(FMCW)激光雷达传感,包含点径向速度测量、多种激光雷达、环绕摄像头和6自由度车辆姿态数据,通过混合标注策略实现大规模训练和人工精修,用于3D目标检测、鸟瞰图分割和流预测及运动预测基准测试。

Comments 15pages, 9 figures

详情
AI中文摘要

我们提出了4DLidarOpen,一个大规模的开放多模态自动驾驶数据集,核心是基于4D频率调制连续波(FMCW)激光雷达传感。与传统飞行时间激光雷达数据集主要提供几何测量不同,4DLidarOpen包含来自前方4D FMCW激光雷达的点径向速度测量,以及多种类型的激光雷达,包括旋转、固态和盲 spot变种,环绕视图摄像头,以及6-DOF ego-vehicle姿态。该数据集在北京复杂城市环境中采集,涵盖了密集行人交互、拥堵交通、高速驾驶和无保护变道。4DLidarOpen提供同步多传感器数据和具有持久跟踪ID的3D边界框标注,跨五个物体类别。采用混合标注策略,其中大规模自动标注数据支持可扩展训练,而人类专家对人工标注的训练和验证集进行精修。基于此数据集,我们建立了3D目标检测、鸟瞰图(BEV)分割和流预测以及运动预测的基准测试。大量实验表明,直接来自4D FMCW激光雷达的速度测量为动态场景理解提供了互补的运动线索。与仅几何感知相比,速度感知表示提高了运动相关感知和下游预测和规划,特别是在涉及易受伤害道路使用者和快速移动物体的场景中。这些结果表明,4D FMCW激光雷达是运动感知自动驾驶的有前途的感知模式。数据集和评估工具包已公开发布,以支持4D场景理解、多激光雷达融合和速度感知感知和规划的研究。

英文摘要

We present 4DLidarOpen, a large-scale open multi-modal dataset for autonomous driving, centered on 4D frequency-modulated continuous-wave (FMCW) Lidar sensing. Unlike conventional time-of-flight Lidar datasets that mainly provide geometric measurements, 4DLidarOpen includes point-wise radial velocity measurements from a forward-facing 4D FMCW Lidar, together with multiple Lidars of different types, including rotating, solid-state, and blind-spot variants, surround-view cameras, and 6-DOF ego-vehicle poses. The dataset was collected in complex urban environments in Beijing and covers dense pedestrian interactions, congested traffic, high-speed driving, and unprotected maneuvers. 4DLidarOpen provides synchronized multi-sensor data and 3D bounding-box annotations with persistent track IDs across five object categories. A hybrid annotation strategy is adopted, where large-scale auto-labeled data support scalable training and human experts refine annotations for the human-annotated training and validation sets. Based on this dataset, we establish benchmarks for 3D object detection, birds-eye view (BEV) segmentation and flow prediction, and motion forecasting with planning. Extensive experiments show that direct velocity measurements from 4D FMCW Lidar provide complementary motion cues for dynamic-scene understanding. Compared with geometric-only sensing, the velocity-aware representation improves motion-related perception and downstream forecasting and planning, especially in scenarios involving vulnerable road users and fast-moving objects. These results indicate that 4D FMCW Lidar is a promising sensing modality for motion-aware autonomous driving. The dataset and evaluation toolkit are publicly released to support research on 4D scene understanding, multi-Lidar fusion, and velocity-aware perception and planning.

2605.18072 2026-05-19 cs.SD

MusicDET: Zero-Shot AI-Generated Music Detection

MusicDET: 零样本AI生成音乐检测

Chaolei Han, Hongsong Wang, Jie Gui

发表机构 * School of Cyber Science and Engineering, Southeast University, Nanjing 210096, China(东南大学计算机科学与工程学院,南京210096,中国) School of Computer Science and Engineering, Southeast University, Nanjing 210096, China(东南大学计算机科学与工程学院,南京210096,中国) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China(新一代人工智能技术及其交叉应用关键实验室(东南大学),教育部,中国) Purple Mountain Laboratories, Nanjing 210000, China(紫金山实验室,南京210000,中国) Engineering Research Center of Blockchain Application, Supervision And Management (Southeast University), Ministry of Education, China(区块链应用、监督与管理工程研究中心(东南大学),教育部,中国)

AI总结 本文提出MusicDET框架,通过频率引导的归一化流模型在无生成样本情况下实现零样本AI生成音乐检测,有效识别非分布音乐信号。

Comments Accepted by ICML 2026

详情
AI中文摘要

检测AI生成的音乐对于保持艺术真实性并防止生成音乐技术的滥用至关重要。然而,现有判别检测器通常依赖生成样本进行训练,当面对未知生成器产生的音乐时,性能会严重下降,限制了其实际应用。为了解决这个问题,我们提出了一个零样本设置用于AI生成音乐检测,其中检测器仅在真实音乐上训练而没有访问任何生成样本。在此设置下,我们提出了MusicDET,一种基于频率引导归一化流的生成无关检测框架,该框架通过概率模型真实音乐特征的分布。通过评估输入样本在学习的真实音乐分布下的似然性,MusicDET能够有效检测非分布音乐信号。在FakeMusicCaps和SONICS数据集上的实验表明,MusicDET在识别之前未见过的模型生成的音乐方面显著优于传统判别检测器。

英文摘要

Detecting AI-generated music is crucial for preserving artistic authenticity and preventing the misuse of generative music technologies. However, existing discriminative detectors typically rely on generated samples during training and often suffer from severe performance degradation when confronted with music produced by unseen generators, which limits their real-world applicability. To address this issue, we formulate a zero-shot setting for AI-generated music detection, where the detector is trained exclusively on real music without access to any generated samples. Under this setting, we propose MusicDET, a generator-agnostic detection framework based on frequency-guided normalizing flows that probabilistically models the distribution of real music features. By evaluating the likelihood of an input sample under the learned real-music distribution, MusicDET enables effective detection of out-of-distribution music signals. Experiments on the FakeMusicCaps and SONICS datasets show that MusicDET consistently outperforms conventional discriminative detectors, particularly when detecting music generated by previously unseen models.

2605.18071 2026-05-19 cs.CL

KVDrive: A Holistic Multi-Tier KV Cache Management System for Long-Context LLM Inference

KVDrive: 一个面向长上下文LLM推理的多层级KV缓存管理系统

Jian Lin, Jiazhi Mi, Zicong Hong, Haodong Wang, Qianli Liu, Haodyue Zhang, Peng Li, Song Guo

发表机构 * Hong Kong University of Science and Technology China(香港科技大学中国) Xi’an Jiaotong University China(西安交通大学中国)

AI总结 本文提出KVDrive,一个面向长上下文LLM推理的多层级KV缓存管理系统,通过联合缓存放置、流水线调度和跨层级协调,实现了高吞吐量的推理,在有限的GPU预算下保持高精度。

详情
AI中文摘要

支持长上下文LLM存在挑战,因为键值(KV)缓存的大量内存需求。现有的卸载系统将完整的缓存存储在主机内存中,并在解码过程中选择性地获取关键条目,但这种策略很快达到极限:无法进一步稀释而不影响准确性。因此,当上下文长度和批处理大小增加时,KV传输的体积急剧上升,成为解码延迟的主要来源。我们提出了KVDrive,一个横跨GPU内存、主机DRAM和SSD的多层级KV缓存管理系统。与之前通过算法改进追求更高稀疏度的工作不同,KVDrive从系统角度出发,联合缓存放置、流水线调度和跨层级协调,以在有限的GPU预算下维持高吞吐量的推理。KVDrive实现了三个基本能力:它根据注意力行为调整缓存管理以最大化重用并最小化冗余数据移动;它重构解码流水线以重叠I/O和CPU/GPU计算瓶颈阶段,消除异构资源中的停滞;并且它协调内存层级之间的数据移动,解锁远超GPU和DRAM限制的可扩展长上下文推理。我们已经实现了一个完整的KVDrive原型,并在长上下文基准测试中评估了流行LLM。该系统在保持准确性的同时,相比最先进的工作实现了高达1.74倍的吞吐量提升。

英文摘要

Supporting long-context LLMs is challenging due to the substantial memory demands of the key-value (KV) cache. Existing offloading systems store the full cache in host memory and selectively fetch critical entries during decoding, but this strategy quickly hits a ceiling: sparsity cannot be pushed further without degrading accuracy. As a result, when context length and batch size grow, the volume of KV transfers rises sharply and becomes the dominant source of decoding latency. We present KVDrive, a holistic multi-tier KV cache management system spanning GPU memory, host DRAM, and SSD. Unlike prior work that pursues greater sparsity through algorithmic refinements, KVDrive tackles the problem from a systems perspective - jointly orchestrating cache placement, pipeline scheduling, and cross-tier coordination to sustain high-throughput inference under tight GPU budgets. KVDrive advances three fundamental capabilities: it adapts cache management to attention behavior to maximize reuse and minimize redundant data movement; it restructures the decoding pipeline to overlap I/O- and CPU/GPU compute-bound stages, eliminating stalls across heterogeneous resources; and it harmonizes data movement across memory tiers to unlock scalable long-context inference far beyond GPU and DRAM limits. We have implemented a fully functional prototype of KVDrive and evaluated it on long-context benchmarks with popular LLMs. The system achieves up to 1.74x higher throughput compared to state-of-the-art works while preserving accuracy.

2605.18068 2026-05-19 cs.LG cs.AI

Improving Spatio-Temporal Residual Error Propagation by Mitigating Over-Squashing

通过缓解过压缩来改进时空残差误差传播

Seyed Mohamad Moghadas, Esther Rodrigo Bonet, Bruno Cornelis, Adrian Munteanu

发表机构 * ETRO Department, Vrije Universiteit Brussel(瓦隆联合大学布鲁塞尔分校ETRO系) imec

AI总结 本文提出Teger模块,通过空间曲率感知的图重排机制改进误差相关的自回归预测,提升时空预测的连续排名概率得分。

详情
AI中文摘要

残差误差传播仍然是递归模型中的基本问题,其中小的预测不准确会随时间累积并降低长周期性能。准确建模此类残差的相关结构对于概率多变量时间序列预测中的可靠不确定性量化至关重要。尽管最近的时间序列深度模型能够高效参数化时间变化的同期相关性,但它们通常假设误差的时序独立性,并忽略了观测网络中的空间相关性。在本文中,我们引入Teger,一个结构化的不确定性模块,克服了误差相关自回归预测中的空间和时间限制。Teger提出了一种空间曲率感知的图重排机制,明确加强了由离散Forman曲率识别出的信息瓶颈边。该组件被集成到低秩加对角协方差头中,通过Woodbury恒等式保持可推断性。Teger是backbone无关的,仅需任何自回归编码器产生的潜在状态。我们提供了Teger的理论证据,并在四个现实世界的时空数据集上实验评估了它在LSTM、Transformer和xLSTM backbone上的表现,显示了连续排名概率得分的一致改进。我们进一步提供了将曲率感知重排与(i)过压缩缓解、(ii)改进的谱连接性、(iii)减少有效电阻以及(iv)改进的协方差校准界联系起来的正式理论分析。

英文摘要

Residual error propagation remains a fundamental problem in recurrent models, where small prediction inaccuracies compound over time and degrade long-horizon performance. Accurately modeling the correlation structure of such residuals is critical for reliable uncertainty quantification in probabilistic multivariate timeseries forecasting. While recent time-series deep models efficiently parametrize time-varying contemporaneous correlations, they often assume temporal independence of errors and neglect spatial correlation across the observed network. In this paper, we introduce Teger, a structured uncertainty module that overcomes the spa- tial and temporal limitations of error-correlated autoregressive forecasting. Teger proposes a spatial curvature-aware graph rewiring mechanism explicitly strengthening information-bottleneck edges identified by discrete Forman curvature. The component is integrated into a low-rank-plus-diagonal covariance head, preserving tractable inference via the Woodbury identity. Teger is backbone-agnostic, requiring only the latent state produced by any autoregressive encoder. We provide theoretical evidence of Teger, and experimentally evaluate it on LSTM, Transformer, and xLSTM backbones across four real-world spatio-temporal datasets, showing consistent improvement in Continuous Ranked Probability Score (CRPS). We further provide a formal theoretical analysis connecting curvature-aware rewiring to (i) oversquashing alleviation, (ii) improved spectral connectivity, (iii) reduced effective resistance, and (iv) improved covariance calibration bounds

2605.18067 2026-05-19 cs.CL

PPAI: Enabling Personalized LLM Agent Interoperability for Collaborative Edge Intelligence

PPAI: 促进个性化大语言模型代理在协作边缘智能中的互操作性

Zile Wang, Qianli Liu, Kaibin Guo, Haodong Wang, Jian Lin, Zicong Hong, Song Guo

发表机构 * Hong Kong RGC General Research Fund(香港研究资助局一般研究基金) Research Impact Fund(研究影响基金) Collaborative Research Fund(协作研究基金) NSFC/RGC Collaborative Research Scheme(国家自然科学基金/香港研究资助局协作研究计划) Areas of Excellence Scheme(卓越领域计划) InnoHK (HKGAI)(创新科技署(HKGAI))

AI总结 本文提出PPAI系统,通过代理专长实现用户间协作,解决动态代理池和负载平衡问题,提升任务准确性并降低延迟。

详情
AI中文摘要

在边缘设备上部署大型语言模型(LLM)可为各种用户提供个性化LLM代理。随着多样化个性化代理的可用性增加,同伴对同伴(P2P)协作提供了独特机会,其中每个用户可以将超出本地代理专长的任务委托给更适合特定查询的远程代理。本文介绍了PPAI,首个个性化LLM代理互操作性系统,使用户基于代理专长进行协作。然而,代理池的不断变化和其可互换性带来了新的挑战,即在存在 churn 的P2P网络中匹配查询到代理并平衡负载,与现有P2P系统相比更具挑战性。因此,我们提出了一种基于原型的可扩展查询-代理对评分机制,以在具有 churn 的P2P网络中识别适合的代理。此外,我们提出一个多代理互操作性贝叶斯博弈,以在远程代理负载变化过快无法观察时平衡本地需求和全局效率。最后,我们实现了一个PPAI原型,并证明它显著扩展了可执行的任务范围,同时保持负载平衡。平均而言,它在多个任务上实现了高达7.96%的准确性提升,同时相比基线减少了16.34%的延迟。

英文摘要

Deploying large language model (LLM) on edge device enables personalized LLM agents for various users. The growing availability of diverse personalized agents presents a unique opportunity for peer-to-peer (P2P) collaboration, wherein each user can delegate tasks beyond the local agent's expertise to remote agents more suited for the specific query. This paper introduces PPAI, the first personalized LLM agent interoperability system, which enables users to collaborate with each other based on agent specialization. However, the ever-changing pool of agents and their interchangeable capacity introduce new challenges when it comes to matching queries to agents and balancing loads, compared with existing P2P systems. Therefore, we propose a scalable query-agent pair scoring mechanism based on prototypes to identify suitable agents within a P2P network with churn. Moreover, we propose a multi-agent interoperability Bayesian game to balance local demand and global efficiency, when changes in remote agent load occur too quickly to be observed. Finally, we implement a prototype of PPAI and demonstrate that it substantially broadens the range of tasks that could be carried out while maintaining load balance. On average, it achieves an accuracy improvement of up to 7.96% across multiple tasks, while reducing latency by 16.34% compared to the baseline.

2605.18063 2026-05-19 cs.CV cs.LG

The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting

MixCount数据集:弥合开放词汇物体计数的数据缺口

Corentin Dumery, Niki Amini-Naieni, Shervin Naini, Pascal Fua

发表机构 * EPFL(苏黎世联邦理工学院) University of Oxford(牛津大学) Northwestern University(西北大学)

AI总结 本文提出MixCount数据集,通过自动生成管道解决开放词汇物体计数中混合物体场景下的数据不足问题,展示了在真实世界基准上的显著提升。

Comments Co-first authors. Dataset and project page https://corentindumery.github.io/projects/mixcount.html

详情
AI中文摘要

物体计数是一个基础的视觉任务,已有超过十年的专门研究,但最先进的模型在混合物体设置中仍系统性地失败,这在工业检测和产品分拣等现实应用中占主导地位。我们证明,这一差距主要是由现有训练和评估数据的限制造成的:真实的计数数据集标注成本过高且存在标签噪声,而现有的合成替代方案缺乏多样性和现实感。我们通过MixCount数据集和基准来解决这一问题,该数据集旨在针对当前计数模型的失败模式。为了克服构建和标注此类数据的高成本,我们开发了一种自动生成管道,能够大规模合成图像、细粒度文本描述和像素级计数注释,消除了此前数据集中的标注模糊性。在MixCount上评估最先进的计数模型会暴露混合物体设置下的严重退化。更重要的是,将这些模型在我们的合成数据上训练,在真实世界基准上取得了显著提升,将FSC-147的MAE降低了20.14%,在PairTally上降低了18.3%。这些结果确立了MixCount作为细粒度计数的基准和训练数据集,并证明了我们的管道能够产生实际上无限的标注数据,从而解决了计数模型中长期存在的瓶颈问题。

英文摘要

Object counting is a foundational vision task with over a decade of dedicated research, yet state-of-the-art models still fail systematically in the mixed-object setting that dominates real-world applications such as industrial inspection and product sorting. We show that this gap is strongly driven by limitations in existing training and evaluation data: real counting datasets are prohibitively expensive to annotate and suffer from labeling noise, while existing synthetic alternatives lack diversity and realism. We address this with MixCount, a dataset and benchmark for mixed-object counting designed to target the failure modes of current counting models. To overcome the high cost of constructing and labeling such data, we develop an automatic generation pipeline that synthesizes images, fine-grained textual descriptions, and pixel-perfect counting annotations at scale, eliminating the labeling ambiguity that plagues prior datasets. Evaluating state-of-the-art counting models on MixCount exposes severe degradation in the mixed-object setting. More importantly, training these models on our synthesized data yields substantial gains on real-world benchmarks, reducing MAE by 20.14% on FSC-147 and by 18.3% on PairTally. These results establish MixCount as both a benchmark and a training dataset for fine-grained counting, and demonstrate that our pipeline, which produces effectively unlimited labeled data, helps address a long-standing bottleneck in counting models.

2605.18060 2026-05-19 cs.CV

Embedded ConvNet Ensembles: A Lightweight Approach to Recognize Arabic Handwritten Characters

嵌入式卷积网络集合:一种轻量级的阿拉伯手写字符识别方法

Mohsine El Khayati, Rachid Elouahbi, Abdelillah Semma

发表机构 * Systems theory and informatics laboratory(系统理论与信息系统实验室) Moulay Ismail University of Meknes(穆拉伊姆·艾斯米尔大学梅克内斯分校) Laboratory of Computer Science and Applications(计算机科学与应用实验室) Computer Science Dept.(计算机科学系)

AI总结 本文提出了一种轻量级嵌入式卷积网络与集成学习相结合的方法,用于实现阿拉伯手写字符识别,通过实验验证了轻量模型在准确率上的优势以及集成学习对性能的提升。

Comments Accepted in the IEEE 15th Image, Video, and Multidimensional Signal Processing Workshop 2026

详情
AI中文摘要

阿拉伯手写字符识别(AHCR)近年来通过深度卷积神经网络(ConvNets)取得了显著进展。然而,文献中的许多模型深度且在参数和FLOPs上计算成本高,限制了其在资源受限设备上的部署,而这些设备日益普遍。本研究通过提出轻量级嵌入式ConvNet模型和集成学习技术来填补这一空白。进行了广泛的实验以确定AHCR的最佳实践,考虑了训练超参数、学习策略、模型选择和集成方法。结果表明,嵌入模型可以实现与或甚至超过更重架构的准确率。集成学习在只有适度计算开销的情况下进一步提升性能,特别是在具有挑战性的训练场景中。在集成策略中,软投票产生了最佳的整体结果。

英文摘要

Arabic Handwritten Character Recognition (AHCR) has recently advanced significantly with deep Convolutional Neural Networks (ConvNets). However, many models in the literature are deep and computationally expensive in terms of parameters and FLOPs, limiting their deployment on resource-constrained devices, which are increasingly common. This study addresses this gap by proposing a combination of lightweight embedded ConvNet models and ensemble learning techniques. Extensive experiments were conducted to identify best practices in AHCR, considering training hyperparameters, learning strategies, model choices, and ensemble methods. Results show that embedded models can achieve accuracy comparable to, or even surpassing, heavier architectures. Ensemble learning further enhances performance with only modest computational overhead, particularly under challenging training scenarios. Among the ensembling strategies, soft voting yielded the best overall results.

2605.18059 2026-05-19 cs.RO

Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations

Bench2Drive-Robust: 在部署扰动下闭环自动驾驶的基准测试

Zhiyuan Zhang, Zhenghao Jin, Yanlun Peng, Xianda Guo, Haoran Liu, Shaofeng Zhang, Xingjun Ma, Zuxuan Wu, Junchi Yan, Xiaosong Jia, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI (TEAI)(可信具身人工智能研究院) Great Wall Motor(长城汽车) Sch. of Computer Science & Sch. of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学计算机学院及人工智能学院) School of Computer Science, Wuhan University(武汉大学计算机学院) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出Bench2Drive-Robust,首个针对闭环端到端自动驾驶在现实部署扰动下的设备中心鲁棒性基准测试,评估了三种主要来源的部署相关扰动对自动驾驶系统的影响,揭示了传统图像级腐蚀评估未能完全捕捉的鲁棒性挑战。

详情
AI中文摘要

鲁棒性是部署自动驾驶系统到现实世界中的关键要求。现有的自动驾驶鲁棒性基准测试在研究图像级腐蚀(如恶劣天气或摄像头退化)对感知模块和开环规划输出的影响方面取得了重要进展。然而,部署还可能涉及系统级缺陷,如推理延迟和自我状态估计误差,这些在闭环端到端自动驾驶评估中仍较少研究。这些缺陷可以通过反馈回路积累并导致控制不稳定。在本文中,我们提出了Bench2Drive-Robust,据我们所知,这是首个针对闭环端到端自动驾驶在现实部署扰动下的设备中心鲁棒性基准测试。我们系统地评估了三种主要来源的部署导向扰动:摄像头流故障(帧丢失、部分观察)、自我状态估计误差(GPS噪声,以及速度或里程误差)和计算导致的控制延迟(模型推理延迟)。我们评估了代表性端到端驾驶方法,并分析它们在不同扰动严重程度下的鲁棒性。我们的结果表明,这些部署相关扰动可以显著降低闭环驾驶性能,揭示了传统图像级腐蚀评估未能完全捕捉的鲁棒性挑战。通过建立闭环评估协议并展示这些部署导向扰动的实质性影响,Bench2Drive-Robust定义了端到端自动驾驶的实用鲁棒性问题,并鼓励进一步研究面向部署的鲁棒驾驶系统。

英文摘要

Robustness is a critical requirement for deploying autonomous driving systems in the real world. Existing robustness benchmarks for autonomous driving have made important progress in studying the effects of image-level corruptions, such as adverse weather or camera degradation, on perception modules and open-loop planning outputs. However, deployment can also involve system-level imperfections, such as inference latency and ego-state estimation errors, which remain less studied in closed-loop E2E-AD evaluation. These imperfections can accumulate through the feedback loop and destabilize control. In this work, we present Bench2Drive-Robust, to our knowledge the first device-centric robustness benchmark for closed-loop end-to-end autonomous driving under realistic deployment perturbations. We systematically evaluate deployment-oriented perturbations arising from three major sources: camera-stream failures (frame drop, partial observation), ego-state estimation errors (GPS noise, and speed or odometry errors), and compute-induced control delay (model inference delay). We evaluate representative end-to-end driving methods and analyze their robustness under different perturbation severities. Our results show that these deployment-related perturbations can substantially degrade closed-loop driving performance, revealing robustness challenges that are not fully captured by conventional image-level corruption evaluations. By establishing a closed-loop evaluation protocol and demonstrating the substantial impact of these deployment-oriented perturbations, Bench2Drive-Robust defines practical robustness problems for end-to-end autonomous driving and encourages further research on deployment-aware robust driving systems.

2605.18058 2026-05-19 cs.CV

Threats to Arabic Handwriting Recognition: Investigating Black-Box Adversarial Attacks on embedded ConvNet models

阿拉伯手写识别的威胁:调查嵌入式卷积网络模型上的黑盒对抗攻击

Mohsine EL Khayati, Abdelillah Semma, Abdelaziz Courr, Rachid Elouahbi

发表机构 * Systems theory and informatics laboratory(系统理论与信息学实验室) Moulay Ismail University of Meknes(穆莱·艾息姆大学) Department of Computer Science(计算机科学系) EST of Sidi Bennour(西迪·本努尔工程与技术学院) Chouaib Doukkali University(侯赛因·杜克利大学) Faculty of Education Sciences(教育科学学院) University Mohammed V(穆莱·维大学) Laboratory of Computer Science and Applications(计算机科学与应用实验室)

AI总结 本研究探讨了阿拉伯手写识别系统对黑盒对抗攻击的脆弱性,通过实验揭示了高精度模型在面对对抗攻击时的易受攻击性,强调了加强模型安全性和可靠性的必要性。

Comments Accepted in the IEEE 15th Image, Video, and Multidimensional Signal Processing Workshop 2026

详情
AI中文摘要

阿拉伯手写识别(AHR)通过深度学习模型取得了显著进展。AHR研究主要关注性能,而安全性却很少受到重视。本研究通过展示高性能模型对对抗黑盒攻击的易受攻击性,提供了一条新的研究方向。研究聚焦于黑盒攻击,反映了现实场景中攻击者对模型架构没有先验知识的情况。在两个包含阿拉伯手写字符的基准AHR数据集上进行了大量实验。结果表明攻击的有效性,其中Pixle攻击在大多数模型上达到了99-100%的攻击成功率。其他较为温和的攻击在大多数实验中达到了50-96%的成功率。尽管攻击成功率较高,但攻击保持了字符的结构完整性,使其在人眼几乎不可察觉。研究结果表明,所研究的模型对对抗操纵具有更高的易受性。这突显了加强这些模型安全性和可靠性以确保其在AHR实际应用中的必要性。

英文摘要

Arabic handwriting recognition (AHR) has made significant progress with deep learning models. AHR research has largely focused on performance, with security receiving little attention. This study provides what appears to be a new line of inquiry by demonstrating the vulnerability of high-performing models to adversarial black-box attacks. The focus on black-box attacks reflects real-world scenarios where the attacker has no prior knowledge of the model architecture. Extensive experiments were conducted on two benchmark AHR datasets containing Arabic handwritten Characters. Results demonstrated the effectiveness of the attacks, with the Pixle attack achieving an attack success rate of 99-100\% on most models. Other, less aggressive attacks achieved success rates of 50-96\% across most experiments. Despite the higher attack success rate, the attacks maintain the structural integrity of the characters, rendering them almost imperceptible to the human eye. The findings indicate the higher vulnerability of the studied models to adversarial manipulation. This underscores the need to strengthen efforts to secure these models and ensure their reliability in AHR real-world applications.

2605.18055 2026-05-19 cs.LG cs.AI

FLAG: Foundation model representation with Latent diffusion Alignment via Graph for spatial gene expression prediction

FLAG: 通过图结构的潜在扩散对齐实现基础模型表示以空间基因表达预测

Qi Si, Penglei Wang, Yushuai Wu, Yifeng Jiao, Xuyang Liu, Xin Guo, Yuan Qi, Yuan Cheng

发表机构 * Shanghai Academy of Artificial Intelligence for Science, Shanghai, China.(上海人工智能科学研究院) School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China.(上海交通大学生物医学工程学院) Incubation Institute, Fudan University, Shanghai, China.(复旦大学孵化院)

AI总结 本文提出FLAG框架,通过图结构的潜在扩散对齐方法,解决空间基因表达预测中的基因协调和空间分布关系问题,并引入基因维度诅咒的概念,通过空间图编码器和基因基础模型对齐来提升模型的结构一致性与基因间保真度。

Comments 9 pages for main text, 3 pages for references, 19 pages for appendix. accepted by ICML 2026

详情
AI中文摘要

从常规的H&E染色预测空间基因表达能够实现大规模分子谱分析,但当前模型将此任务视为孤立的点wise任务,从而忽略了诸如基因协调和空间分布等关键生物结构。为保持这些关系,我们引入FLAG,一种基于扩散的框架,将此任务重新定义为结构分布建模。同时,我们识别出关键的基因维度诅咒,即联合建模基因表达及其空间相互作用在高维空间中失效,而FLAG通过整合空间图编码器以实现拓扑一致性,并利用基因基础模型(GFM)对齐以在生成过程中保持基因-基因的保真度。为严格评估模型性能,我们提出了一组新的结构评估度量标准,包括基因结构相关性(GSC)和空间结构相关性(SSC)。我们的实验表明,FLAG在传统准确性(PCC/MSE)方面具有高度竞争力,同时在捕捉基因-基因和基因-空间关系时实现了显著增强的结构保真度。代码可在https://github.com/darkflash03/FLAG上获取。

英文摘要

Predicting spatial gene expression from routine H\&E enables large-scale molecular profiling, yet current models treat this as isolated pointwise tasks, thereby overlooking essential biological structures like gene coordination and spatial distribution. To preserve these relationships, we introduce \textbf{FLAG}, a diffusion-based framework that redefines this task as structured distribution modeling. At the same time, we identify the critical \textbf{Gene Dimension Curse}, where joint modeling gene expression and their spatial interactions fail in high-dimensional spaces, and FLAG solves this challenge by integrating a spatial graph encoder for topological consistency and utilizing Gene Foundation Model (GFM) alignment for gene-gene fidelity in the generation process. To rigorously assess model performance, we propose a set of novel structural evaluation metrics, including Gene Structural Correlation (\textbf{GSC}) and Spatial Structural Correlation (\textbf{SSC}). Our experiments demonstrate that FLAG is highly competitive in traditional accuracy (PCC/MSE) while achieving significantly enhanced structural fidelity in capturing both gene-gene and gene-spatial relationships. The code is available at https://github.com/darkflash03/FLAG.

2605.18053 2026-05-19 cs.LG cs.CL cs.CR cs.PF

Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction

保护几乎就是一切:结构保护在全局受限的KV淘汰中占据主导地位

Gabriel Garcia

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了在共享全局受限解码时间Harness下的KV缓存淘汰问题,发现结构保护在保持质量方面起关键作用,通过保留边界缓存恢复了大部分参考天花板质量,并展示了保护机制在不同模型上的有效性。

Comments 38 pages, 6 figures, 25 tables (includes one longtable). Code and figure regeneration scripts: https://github.com/gpgabriel25/KVCacheBoundaryProtection

详情
AI中文摘要

我们研究了在共享全局受限解码时间Harness下的KV缓存淘汰问题。七种策略(LRU、H2O、SnapKV、StreamingLLM、Ada-KV、QUEST、Random)共享一个提示边界漏洞:在没有结构保护的情况下,它们在六个纯Transformer模型上几乎降级到质量为零(F1≤0.064)。保留每个边界10%的缓存可在七个LongBench模型上恢复69-90%的C=2048参考天花板质量(13%保留率);一个十模型面板覆盖68-98%。一个注意力质量试点(Qwen2.5-3B,N=30)表明原因:位置0的sink占据约75%的前缀质量,而其他边界token接近约0.41倍的均匀期望,因此注意力评分器保留sink但仍会丢弃结构关键token。有保护的情况下,简化评分隔离变体在K=32时与LRU TOST等价(Δ=0.02);在K=8时,注意力策略对彼此收敛但仍比LRU高0.011-0.021 F1(在C=256和C=512时)。忠实的Ada-KV/QUEST在Mistral-7B和Phi-3.5上添加约0.03-0.04 F1超过简化变体。在Qwen3-4B上的NIAH-32K领域转移试点(解码vs.预填,C∈{512,2048})显示保护提升几乎相同(比率0.99-1.00)。在64K时,保护有所帮助但恢复有限;忠实的每头评分在Gemma-3-4B上仅在模型已支持强64K检索而不淘汰时才能在6.3%保留率下达到全缓存天花板。总体而言:保护占主导;一旦边界被保护,评分差异变得次要;每头分配进一步带来小幅提升。

英文摘要

We study KV cache eviction under a shared globally capped decode-time harness. Seven policies (LRU, H2O, SnapKV, StreamingLLM, Ada-KV, QUEST, Random) share a prompt-boundary vulnerability: without structural protection, they collapse to near-zero quality on six pure-transformer models (F1$\leq$0.064). Reserving 10\% of cache at each boundary recovers 69--90\% of the $C{=}2{,}048$ reference-ceiling quality on seven LongBench models at $C{=}256$ (13\% retention); a ten-model panel spans 68--98\%. An attention-mass pilot (Qwen2.5-3B, $N{=}30$) suggests why: the position-0 sink holds ${\sim}75\%$ of prefix mass, while other boundary tokens sit near ${\sim}0.41{\times}$ uniform expectation, so attention scorers retain the sink but still drop structurally critical tokens. With protection, simplified score-isolation variants are TOST-equivalent to LRU at $K{=}32$ ($Δ{=}0.02$); at $K{=}8$, attention policies pairwise converge yet beat LRU by 0.011--0.021 F1 across $C{=}256$ and $C{=}512$. Faithful Ada-KV/QUEST add ${\sim}0.03$--$0.04$ F1 on Mistral-7B and Phi-3.5 beyond simplified variants. A NIAH-32K regime-transfer pilot on Qwen3-4B (decode vs.\ prefill, $C{\in}\{512,2048\}$) shows near-identical protection lifts (ratio 0.99--1.00). At 64K, protection helps but recovery is modest; faithful per-head scoring matches full-cache ceiling on Gemma-3-4B at 6.3\% retention only when the model already supports strong 64K retrieval without eviction. Overall: protection dominates; scoring differences are secondary once boundaries are guarded; per-head allocation gives a further modest gain.

2605.18052 2026-05-19 cs.CV

Efficient 3D Content Reconstruction and Generation

高效3D内容重建与生成

Jiahao Li

发表机构 * TOYOTA TECHNOLOGICAL INSTITUTE AT CHICAGO(丰田技术研究所芝加哥分校)

AI总结 本文提出了一种高效的3D内容生成和重建方法,通过结合多视图扩散和稀疏视图3D重建,实现了高质量的3D资产生成,并开发了FastMap算法以提高3D重建的速度和精度。

详情
AI中文摘要

自动3D内容创建旨在用能够从文本或图像直接合成或恢复3D资产的系统取代劳动密集型的建模和扫描流程。其应用范围涵盖视频游戏、虚拟现实、机器人技术和模拟,使资产原型设计、多样化的交互世界生成和高效的3D数据收集成为可能。当前解决方案主要遵循两种互补的范式:(i)文本或图像到3D生成,学习3D几何和外观的先验知识,以从自然语言或单视图图像创建新资产;(ii)3D重建,从RGB图像估计相机姿态和几何结构。本论文在两个方向上都取得了进展。在生成方面,我介绍了Instant3D,它结合了多视图扩散和前馈稀疏视图3D重建,可在5-20秒内生成高质量的资产。在重建方面,我开发了FastMap,一种结构从运动流水线,通过使用一阶优化与广泛融合的GPU内核,实现了比现有最先进方法快10倍的速度提升,同时保持了可比的姿态精度和下游新视图合成质量。

英文摘要

Automatic 3D content creation seeks to replace labor-intensive modeling and scanning pipelines with systems that can synthesize or recover 3D assets directly from text or images. Its applications span video games, virtual reality, robotics, and simulation, enabling rapid asset prototyping, diverse interactive world generation, and efficient 3D data collection for training foundation models. Contemporary solutions largely follow two complementary paradigms: (i) text- or image-to-3D generation, which learns priors over 3D geometry and appearance to create novel assets from natural language or a single view image; and (ii) 3D reconstruction, which estimates camera poses and geometry from RGB images. This thesis advances both directions. On the generation side, I introduce Instant3D, which combines multi-view diffusion with feed-forward sparse-view 3D reconstruction to produce high-quality assets in 5-20 seconds. On the reconstruction side, I develop FastMap, a structure-from-motion pipeline that achieves up to 10x speedup over prior state-of-the-art by using first-order optimization with fused GPU kernels extensively, while maintaining comparable pose accuracy and downstream novel view synthesis quality.

2605.18048 2026-05-19 cs.AI

DocOS: Towards Proactive Document-Guided Actions in GUI Agents

DocOS: 向 GUI 代理中的主动文档引导行动迈进

Jingjing Liu, Ziye Huang, Zihao Cheng, Zeming Liu, Jiahong Wu, Yuhang Guo, Kehai Chen, Yunhong Wang, Haifeng Wang

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing, China(北航计算机科学与工程学院) School of Computer Science, Peking University, Beijing, China(北京大学计算机科学学院) School of Computer Science and Technology, Beijing Institute of Technology, Beijing(北京理工大学计算机科学与技术学院) School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)计算机科学与技术学院) Baidu Inc., Beijing, China(百度公司)

AI总结 本文提出 DocOS 基准,通过引导文档解决长尾任务,解决 GUI 代理在动态开放网络环境中处理长尾任务的能力限制,核心方法是主动文档引导行动,主要贡献是设计了一个评估文档引导问题解决能力的基准。

详情
AI中文摘要

尽管图形用户界面(GUI)代理在自动化设备交互中表现出色,但它们主要依赖于预训练或指令微调的静态参数知识。这种依赖从根本上限制了它们处理需要显式过程知识的长尾任务的能力,通常迫使代理采用低效且易碎的试错探索。为缓解这一限制,我们引入了面向 GUI 代理的主动文档引导行动,这是一种新的范式,通过使代理能够自主搜索相关文档来解决长尾任务,从而模仿人类问题解决方式。为了评估代理在此范式中的能力,我们提出了 DocOS,一个基准,用于评估在完全交互环境中文档引导的问题解决能力。DocOS 要求代理自主导航网络浏览器,定位相关在线文档,理解操作步骤,并将这些步骤准确地转化为可执行的 GUI 操作。广泛的实验表明,进展受到双重瓶颈的限制:代理在主动搜索中难以可靠地定位相关信息,并且频繁失败将检索到的指令准确地转化为精确的操作,这表明文档引导交互是使 GUI 代理在动态环境中自我演化的关键路径。

英文摘要

While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pre-training or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce \textbf{Proactive Document-Guided Action} for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents' capability in this paradigm, we propose \textbf{DocOS}, a benchmark designed to assess document-guided problem solving in fully interactive environments. DocOS requires agents to autonomously navigate a web browser, locate relevant online documentation, comprehend procedural instructions, and faithfully ground them into executable GUI actions. Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions, pointing toward document-guided interaction as a crucial pathway for enabling self-evolving GUI agents in dynamic environments.

2605.18045 2026-05-19 cs.RO cs.AI

Confidence-Gated Robot Autonomy: When Does Uncertainty Actually Help?

置信度门控机器人自主性:不确定性何时真的有帮助?

Johannes A. Gaus, Jhon P. F. Charaja, Daniel Haeufle

发表机构 * Hertie Institute for Clinical Brain Research & Center for Integrative Neuroscience, University of Tübingen(赫尔特研究所临床脑研究与整合神经科学中心,图宾根大学)

AI总结 本文研究了不确定性在机器人自主性决策中的作用,发现当基础模型具备一定能力时,简单的不确定性代理足以实现选择性门控,但无法用于语义新颖性检测。

Comments ICRA 2026 workshop paper

详情
AI中文摘要

机器人系统常常使用预测不确定性来决定是否自主行动还是退回到备用策略。在阈值门控自主性中,不确定性主要通过其对可能错误的排序能力起作用。标准指标如预期校准误差和AUROC并不能直接测试不确定性是否改变行动/退避决策。因此,我们通过斯皮尔曼等级相关性、配对bootstrap等价检验和行动/退避一致率来评估不确定性。在三个时间活动识别基准上,我们发现存在一个数据集依赖的胜任区域,在此之下不确定性只能提供弱且不稳定的错误排序。在此之上,softmax启发式方法、MC Dropout和集成模型产生相似的门控行为,而阈值选择对执行结果影响更大。一个多种子具身模拟显示,一旦实现自主性,碰撞率和成本也呈现出相同模式。在时间协变量转移下,排序质量保持稳定,但细粒度语义OOD检测仍接近随机。这些结果表明,一旦基础模型具备一定能力,简单的不确定性代理足以实现选择性门控,但无法用于语义新颖性检测。

英文摘要

Robotic systems often use predictive uncertainty to decide whether to act autonomously or defer to a fallback policy. In threshold-gated autonomy, uncertainty matters mainly through its ability to rank likely errors. Standard metrics such as expected calibration error and AUROC do not directly test whether uncertainty changes act/defer decisions. We therefore evaluate uncertainty using Spearman rank correlation, paired bootstrap equivalence testing, and act/defer agreement. Across three temporal activity-recognition benchmarks, we find a dataset-dependent competence regime below which uncertainty provides a weak and unstable error ranking. Above this regime, softmax heuristics, MC Dropout, and ensembles produce similar gating behavior, while threshold choice has a much larger effect on execution outcomes. A multi-seed embodied simulation shows the same pattern for collision rate and cost once realized autonomy is matched. Under temporal covariate shift, ranking quality remains stable, but fine grained semantic OOD detection remains near chance. These results suggest that simple uncertainty proxies can suffice for selective gating once the base model is competent, but not for semantic novelty detection.

2605.18041 2026-05-19 cs.CV

OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models

OmniSelect: 动态模态感知的令牌压缩用于高效多模态大语言模型

Morunliu Yang, Ruotao Xu, Le Li, Yue Wang, Jianxin Zhang, Juntao Li, Yihang Lou, Siwei Feng, Peifeng Li

发表机构 * Soochow University(苏州大学) Peking University(北京大学)

AI总结 本文提出OmniSelect,一种无需训练的模态自适应令牌剪枝框架,通过动态选择压缩策略来提高多模态大语言模型的效率,通过轻量级AudioCLIP模型估计跨模态相关性,并根据相关性得分在不同时间组中进行细粒度令牌剪枝,从而在不增加训练成本的情况下实现高效的多模态令牌压缩。

详情
AI中文摘要

Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose $ extbf{OmniSelect}$, a 免训练, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.

英文摘要

Omnimodal large language models (OmniLLMs) have recently gained increasing attention for unified audio-video understanding. However, processing long multimodal token sequences introduces substantial computational overhead, making efficient token compression crucial. Existing methods typically rely on fixed, modality-specific guidance, which fails to account for the varying importance of modalities across different queries. To address this limitation, we propose $\textbf{OmniSelect}$, a training-free, modality-adaptive token pruning framework that dynamically selects appropriate compression strategies for multimodal inputs. Specifically, we leverage a lightweight AudioCLIP model to estimate cross-modal relevance and categorize each input into three pruning regimes: Audio-Centric, Video-Centric, and Uniform pruning. Based on these relevance scores, OmniSelect further performs fine-grained token pruning within each temporal group, adaptively allocating pruning ratios to preserve informative tokens across modalities. By explicitly modeling modality preference and enabling dynamic strategy selection, OmniSelect effectively avoids the pitfalls of one-size-fits-all compression. Extensive experiments demonstrate that our method achieves efficient multimodal token reduction while maintaining strong performance, without requiring any additional training.

2605.18039 2026-05-19 cs.CV

SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals

SGSoft: 通过模板引导的软信号学习融合语义-几何特征以实现3D形状对应

Soyeon Yoon, Chang Wook Seo, Hyunjung Shim

发表机构 * KAIST AI(韩国科学技术院人工智能研究所) Anigma Technologies(Anigma科技公司)

AI总结 本文提出SGSoft方法,通过模板引导的软信号学习融合语义-几何特征,实现3D形状对应,解决了结构变化、非等距变形和拓扑不一致的挑战,实现了最先进的跨类别泛化和最佳精度-效率权衡。

详情
AI中文摘要

学习变形3D形状之间的密集对应关系仍是一个长期挑战,由于结构变化、非等距变形和不一致拓扑。现有方法通常在通用性、几何保真度和效率之间进行权衡。我们通过提出SGSoft,一个统一的内在流程,解决这个问题:(i) 在标准模板上构建测地线对应场;(ii) 学习由预训练语义先验引导的多模态密集描述符,利用该测地线对应场监督;(iii) 通过描述符空间的最近邻搜索在单次前向传递中检索密集对应关系。这种公式在大姿态变化、结构差异和重新网格化下实现了稳定且拓扑不变的监督。SGSoft在跨类别泛化方面达到最先进的水平,同时在先前方法中提供了最佳的精度-效率权衡。它还实现了近实时推断,无需预对齐、成对优化或后处理。学习的描述符可以有效地转移到下游任务,如语义分割和变形转移,建立了一种可扩展且可部署的密集3D对应范式。

英文摘要

Learning dense correspondences across deformable 3D shapes remains a long-standing challenge due to structural variability, non-isometric deformation, and inconsistent topology. Existing methods typically trade off generalization, geometric fidelity, and efficiency. We address this by proposing SGSoft, a unified intrinsic pipeline that (i) constructs a geodesic correspondence field on a canonical template, (ii) learns multimodal dense descriptors guided by pretrained semantic priors with this geodesic correspondence field supervision, (iii) retrieves dense correspondences in a single feed-forward pass via nearest-neighbor search in descriptor space. This formulation enables stable and topology-invariant supervision under large pose variation, structural differences, and remeshing. SGSoft achieves state-of-the-art inter-category generalization while offering the best accuracy-efficiency trade-off among prior methods. It also achieves near real-time inference without pre-alignment, pairwise optimization, or post-refinement. Learned descriptors can be transferred effectively to downstream tasks such as semantic segmentation and deformation transfer, establishing a scalable and deployment-ready paradigm for dense 3D correspondence.

2605.18038 2026-05-19 cs.CV

Patch Ensembles for Robust Salmon Re-Identification with Weak Trajectory Labels

基于补丁的鲁棒性鲑鱼重识别方法:使用弱轨迹标签

Espen Uri Høgstedt, Christian Schellewald, Annette Stahl, Rudolf Mester

发表机构 * Department of Computer Science, Norwegian University of Science and Technology(挪威科技大学计算机科学系) SINTEF Ocean(SINTEF海洋研究中心) Department of Engineering Cybernetics, Norwegian University of Science and Technology(挪威科技大学工程 cybernetics 系)

AI总结 本文提出了一种基于补丁的重识别框架,通过融合补丁级预测来决定鲑鱼身份,利用侧线预测提取纹理锚定的补丁和补丁切片,通过多摄像头实验设置构建跨摄像头测试集,实验证明该方法在同轨迹验证和跨摄像头测试中均优于全图像基线,展示了更好的泛化能力和鲁棒性。

Comments Accepted to the 2026 IEEE International Conference on Image Processing (ICIP)

详情
AI中文摘要

在商业网箱中,鲑鱼重识别具有挑战性,因为种群数量大,这要求严格准确性并使大规模标记数据获取不可行。轨迹ID可以作为代理标签,但会引入轨迹ID偏差。为了解决这些挑战,我们提出了一种基于补丁的重识别框架,将补丁级预测融合到鲑鱼身份决策中。一个关键组件是预测鲑鱼的侧线,从而提取纹理锚定的补丁和补丁切片。为了实现真实的评估,我们引入了一个实验设置,使用多个相距6米的摄像头,允许同一鱼在不同轨迹中被记录。这使得通过手动匹配确认构建跨摄像头测试集成为可能。我们的集成方法在同轨迹验证中(0.932到0.965 mAP)和跨摄像头测试中(0.609到0.860 mAP)均优于全图像基线。跨摄像头设置的显著改进证明了改进的通用性和鲁棒性。代码和数据:https://github.com/espenbh/salmon-reid-patch-ensemble。

英文摘要

Salmon re-identification in commercial net-pens is challenging due to large populations, which impose strict accuracy requirements and make large-scale labeled data acquisition infeasible. Trajectory IDs can be used as proxy labels, but this introduces trajectory-ID bias. To address these challenges, we propose a patch-based re-identification framework that fuses patch-level predictions into a salmon identity decision. A key component is the prediction of the salmon's lateral line, enabling extraction of texture-anchored patches and patch slices. To enable realistic evaluation, we introduce an experimental setup using multiple cameras placed 6 m apart, allowing the same fish to be recorded in different trajectories. This enables the construction of a cross-camera test set through manual match confirmation. Our ensemble approach outperforms the full-image baseline in same-trajectory validation (0.932 to 0.965 mAP) and cross-camera testing (0.609 to 0.860 mAP). The substantial improvements in the cross-camera setting demonstrate improved generalizability and robustness. Code and data: https://github.com/espenbh/salmon-reid-patch-ensemble.

2605.18035 2026-05-19 cs.AI cs.LG

New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions

零阶硬阈值化中方差减少的新见解:缓解梯度误差和扩张性矛盾

Xinzhe Yuan, William de Vazelhes, Bin Gu, Huan Xiong

发表机构 * IASM, Harbin Institute of Technology(哈尔滨工业大学人工智能研究所,哈尔滨工业大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) School of Artificial Intelligence, Jilin University(吉林大学人工智能学院)

AI总结 本文提出了一种通用的方差减少零阶硬阈值化算法,通过考虑方差的作用,缓解零阶梯度与硬阈值操作之间的冲突,从而消除对随机方向数量的限制,提高收敛速度和应用范围。

Comments Published as a conference paper at ICLR 2024. 9 pages main paper, 24 pages appendix, 11 figures, 7 tables. Correspondence to Bin Gu and Huan Xiong

Journal ref International Conference on Learning Representations (ICLR), 2024

详情
AI中文摘要

硬阈值化是机器学习中用于解决ℓ0约束优化问题的重要算法类型。然而,在某些情况下,目标函数的真实梯度可能难以获取,通常可以通过零阶(ZO)方法进行近似。到目前为止,SZOHT算法是唯一能够处理ℓ0稀疏性约束的ZO梯度算法。不幸的是,由于零阶梯度的偏差与硬阈值操作的扩张性之间存在固有的矛盾,SZOHT在ZO梯度的随机方向数量上存在明显的限制。本文通过考虑方差的作用,提供了一种新的方差减少见解:缓解零阶梯度与硬阈值操作之间的独特矛盾。在此视角下,我们提出了一种通用的方差减少零阶硬阈值化算法以及在标准假设下的通用收敛性分析。理论结果表明,新算法消除了对随机方向数量的限制,相较于SZOHT,具有改进的收敛速度和更广泛的应用范围。最后,我们通过岭回归问题以及黑盒对抗攻击问题展示了本方法的实用性。

英文摘要

Hard-thresholding is an important type of algorithm in machine learning that is used to solve $\ell_0$ constrained optimization problems. However, the true gradient of the objective function can be difficult to access in certain scenarios, which normally can be approximated by zeroth-order (ZO) methods. The SZOHT algorithm is the only algorithm tackling $\ell_0$ sparsity constraints with ZO gradients so far. Unfortunately, SZOHT has a notable limitation on the number of random directions % in ZO gradients due to the inherent conflict between the deviation of ZO gradients and the expansivity of the hard-thresholding operator. This paper approaches this problem by considering the role of variance and provides a new insight into variance reduction: mitigating the unique conflicts between ZO gradients and hard-thresholding. Under this perspective, we propose a generalized variance reduced ZO hard-thresholding algorithm as well as the generalized convergence analysis under standard assumptions. The theoretical results demonstrate the new algorithm eliminates the restrictions on the number of random directions, leading to improved convergence rates and broader applicability compared with SZOHT. Finally, we illustrate the utility of our method on a ridge regression problem as well as black-box adversarial attacks.

2605.18032 2026-05-19 cs.CL cs.AI cs.HC cs.SE

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

PROTEA:多智能体大语言模型工作流的离线评估与迭代优化

Kazuki Kawamura, Satoshi Waki, Kei Tateno

发表机构 * Sony Group Corporation(索尼集团公司)

AI总结 本文提出PROTEA,一种用于多智能体大语言模型工作流的离线评估和迭代优化接口,通过配置评分标准和可视化工作流图中的节点状态,帮助开发者定位瓶颈并改进工作流性能。

Comments 9 pages, 3 figures, 1 table. To appear in Proceedings of ACL 2026 System Demonstrations

详情
AI中文摘要

多智能体大语言模型工作流——由多个角色特定的LLM调用组成——通常优于单提示基线,但调试和优化仍然困难。失败可能源于中间输出的细微错误,这些错误会传播到下游节点,要求开发者检查长轨迹并推断应修改哪个代理。我们提出了PROTEA,一个统一的接口,用于离线、测试驱动的多智能体工作流改进。PROTEA执行工作流,用可配置的评分标准评分中间节点输出,并在工作流图上叠加每个节点的状态和理由,以定位可能的瓶颈。为了支持复杂系统,其中最终答案参考是主要监督,PROTEA执行反向节点评估:它从最终答案参考和图上下文生成候选节点级期望,然后将它们与观察到的节点输出进行比较。对于选定的节点,PROTEA以可编辑的前后比较形式呈现目标提示修订,然后自动重新运行并重新评估工作流,以显示输出变化和评分轨迹。在两个生产相关的工作流中,PROTEA将文档检查准确性从64.3%提高到83.9%,推荐Hit@5从0.30提高到0.38。在与六名经验丰富的LLM开发者进行的形成研究中,参与者重视图层面的定位、节点级别的理由以及可编辑的前后提示修订。

英文摘要

Multi-agent LLM workflows -- systems composed of multiple role-specific LLM calls -- often outperform single-prompt baselines, but they remain difficult to debug and refine. Failures can originate from subtle errors in intermediate outputs that propagate to downstream nodes, requiring developers to inspect long traces and infer which agent to modify. We present PROTEA, a unified interface for offline, test-driven improvement of multi-agent workflows. PROTEA executes a workflow, scores intermediate node outputs with configurable rubrics, and overlays per-node states and rationales on the workflow graph to localize likely bottlenecks. To support complex systems where final-answer references are the primary supervision, PROTEA performs backward node evaluation: it generates candidate node-level expectations from final-answer references and graph context, then compares them with observed node outputs. For selected nodes, PROTEA presents targeted prompt revisions as editable before/after comparisons, then automatically reruns and re-evaluates the workflow to show output changes and score trajectories within the same interface. In two production-adjacent workflows, PROTEA improved document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38. In a formative study with six experienced LLM developers, participants valued graph-level localization, per-node rationales, and editable before/after prompt revisions.

2605.18029 2026-05-19 cs.CV

What Matters for Grocery Product Retrieval with Open Source Vision Language Models

在开源视觉语言模型中,什么因素影响杂货产品检索

Emmanuel G. Maminta, Rowel O. Atienza

发表机构 * AI Graduate Program, University of the Philippines, Diliman, Quezon City(菲律宾大学达林学院人工智能研究生项目) EEEI, University of the Philippines, Diliman, Quezon City(菲律宾大学达林学院电子工程系)

AI总结 本文研究了开源视觉语言模型在杂货产品检索任务中的表现,发现数据质量比规模更重要,高效模型可以胜出,并且存在召回率差距的问题。

Comments Accepted in the 28th International Conference on Pattern Recognition (ICPR 2026)

详情
AI中文摘要

多模态产品检索(MPR)是无结账零售和自动化库存系统的基础,但需要细粒度SKU区分,而标准视觉语言基准无法捕捉这一点。我们首次系统地在GroceryVision挑战赛的MPR任务上评估了190个开源VLMs,隔离了预训练数据、架构和输入分辨率。我们的分析得出三个可操作的发现。(1)数据质量优于规模。从原始网络爬取切换到过滤数据集可获得高达16.6%的准确率提升,超过翻倍模型参数的收益。(2)高效模型可以获胜。MobileCLIP-B(150M参数)优于在噪声数据上训练的351M模型。我们引入了效率度量标准“语义功率密度”(ϕ),该指标惩罚低于阈值的准确性。(3)存在召回率差距。最先进模型在Recall@5上达到94.5%,但在Recall@1上下降17.5%,表明对比嵌入式在分类上有效,但无法对视觉相似的SKU进行排序。代码和评估脚本可在https://github.com/upeee/openmpr获取。

英文摘要

Multimodal product retrieval (MPR) underpins checkout-free retail and automated inventory systems, yet it demands fine-grained SKU discrimination that standard vision-language benchmarks fail to capture. We present the first systematic zero-shot evaluation of 190 open-source VLMs on the MPR task of the GroceryVision Challenge, isolating pre-training data, architecture, and input resolution. Our analysis yields three actionable findings. \textbf{(1) Data quality trumps scale.} Switching from raw web-scrapes to filtered datasets delivers up to 16.6\% accuracy gains, exceeding the benefit of doubling model parameters. \textbf{(2) Efficient models can win.} MobileCLIP-B (150M parameters) outperforms 351M counterparts trained on noisy data. We introduce \textit{semantic power density} ($ϕ$), an efficiency metric that penalizes sub-threshold accuracy. \textbf{(3) A precision gap persists.} State-of-the-art models achieve 94.5\% Recall@5 but suffer a 17.5\% drop at Recall@1, revealing that contrastive embeddings cluster categories effectively but fail to rank visually similar SKUs. Code and evaluation scripts are available at \url{https://github.com/upeee/openmpr}.

2605.18028 2026-05-19 cs.LG cs.AI

FedSDR: Federated Self-Distillation with Rectification

FedSDR: 带校正的联邦自我蒸馏

Ziheng Ren, Zhanming Shen, Hao Wang, Ning Liu, You Song

发表机构 * Beijing University of Aeronautics(北京航空航天大学) Zhejiang University(浙江大学) Shandong University(山东大学) Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 本文提出FedSDR,一种改进的联邦自我蒸馏方法,通过引入双重流机制来解决联邦学习中数据分布不匹配和幻觉问题,提升模型的准确性和一致性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大规模语言模型的联邦微调面临严重的统计异质性。然而,现有模型级防御方法往往忽视了根本原因:内在的数据分布不匹配。在本文中,我们首先建立了联邦自我蒸馏(FedSD)作为基本且有力的策略。通过将客户端表示投影到一个平滑的

英文摘要

Federated fine-tuning of Large Language Models faces severe statistical heterogeneity. However, existing model-level defenses often overlook the root cause: intrinsic data distribution mismatches. In this work, we first establish Federated Self-Distillation (FedSD) as a fundamental and potent strategy. By projecting client representations into a smoothed ``model-understanding space,'' FedSD alone serves as a universal booster, demonstrating superior performance over conventional algorithms. Despite its success, we identify a subtle trade-off termed the Rewrite Paradox -- unconstrained self-distillation can inadvertently increase hallucinations and redundancy. To refine this paradigm, we further propose FedSDR (Federated Self-Distillation with Rectification), the ultimate reinforced framework. It augments FedSD with a dual-stream mechanism: a local LoRA-S (Smoothing) branch to implicitly absorb heterogeneity via distilled data, and a parallel global LoRA-R (Rectification) branch anchored to raw data to enforce factual correctness. By selectively aggregating only LoRA-R, FedSDR yields a globally aligned and faithful model. Extensive experiments verify its superior performance.

2605.18026 2026-05-19 cs.RO

Scenario Generation in Roundabouts with Adjustable Interaction Intensity

在可调节交互强度的环形交叉口中的场景生成

Li Li, Till Temmen, Tobias Brinkmann, Björn Krautwig, Markus Eisenbarth, Jakob Andert

发表机构 * Chair of Mechatronics in Mobile Propulsion(移动 propulsion 机械系统教授团)

AI总结 本文提出了一种具有可调节交互强度的环形交叉口场景生成器,通过解耦几何路线和时间进度轮廓,并利用预训练的自编码器映射到潜在代码,再通过Wasserstein生成对抗网络生成场景,从而提高时间-潜在空间的保真度和交互响应的合理性,增强了安全测试的可控性和可扩展性。

详情
AI中文摘要

环形交叉口以其频繁的合并和让行交互而著称,仍然是智能驾驶功能开发和测试中的安全关键案例。然而,从自然数据中提取足够的临界场景是低效的。大多数现有场景生成方法对交互强度和临界性控制有限,使得系统化安全测试和详细分析困难。本文提出了一种交互感知的环形交叉口场景生成器,具有连续可调的交互强度。首先,几何路线和时间进度轮廓被解耦并映射到潜在代码,使用预训练的自编码器。然后,通过Wasserstein生成对抗网络(WGAN)进行条件潜在生成,以生成场景。让行被建模为一种可控的定时干预,通过紧凑的让行代码在接近入口段进行,其中交互强度通过将代码与因子λ缩放来调节。结果表明,与基线模型相比,提高了时间-潜在空间的保真度和合理的交互响应。在临界性校准的缩放下,增加λ扩大了安全边际,提供了一种可扩展和受控的测试机制。

英文摘要

Roundabouts, characterized by frequent merging and yielding interactions, remain a safety-critical corner case for the development and testing of intelligent driving functions. However, extracting sufficient near-critical scenarios from naturalistic data is inefficient. Most existing scenario generation methods provide limited controllability over interaction intensity and criticality, making systematic safety testing and detailed analysis difficult. This paper presents an interaction-aware roundabout scenario generator with continuously adjustable interaction intensity. Geometric routes and temporal progress profiles are first decoupled and mapped to latent codes using pretrained autoencoders. Conditional latent generation is then performed with Wasserstein Generative Adversarial Networks (WGAN) to generate scenarios. Yielding is modeled as a controllable timing intervention via a compact yield code during the approach-to-entry segment, where interaction intensity is modulated by scaling the code with a factor $λ$. Results demonstrate enhanced timing-latent fidelity and plausible interaction responses compared to a baseline model. Under criticality-calibrated scaling, increasing $λ$ expands the safety margin, providing a scalable and controlled testing mechanism.

2605.18025 2026-05-19 cs.AI

TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?

TeleCom-Bench: 大型语言模型在工业电信应用中还有多远?

Jieting Xiao, Yun Lin, Huizhen Qiu, Rui Ma, Chen Zhong, Dongyang Xu, Xiao Long, Chaoyu Zhang, Qiaobo Hao, Ding Zou, Zhiguo Yang, Yanqin Gao, Fang Tan

发表机构 * ZTE Corporation(中兴通讯)

AI总结 本文提出TeleCom-Bench,一个包含12个评估集和22678个精选样本的全面基准,旨在评估大型语言模型在电信领域的综合能力,揭示其在工业流程中的执行能力缺口。

Comments Accepted by KDD 2026

详情
AI中文摘要

尽管大型语言模型(LLM)在各种垂直场景中实现了显著整合,但其在电信领域的部署仍处于探索阶段,由于缺乏标准化的评估框架。当前的电信基准主要关注静态基础知识和孤立的原子技能,忽略了设备特定的文档和端到端的工业工作流,这些对于实际生产系统至关重要。为此,我们提出了TeleCom-Bench,一个包含12个评估集和22,678个精选样本的全面基准,评估LLM在协同层次上的能力:(1)多维知识理解,整合电信基础、3GPP协议、5G网络架构和专有产品知识,通过知识图谱驱动的合成整合有线、核心和无线网络的知识;(2)端到端知识应用,正式化六个核心任务在真实网络代理工作流中的真实轨迹,包括意图识别、实体提取、事件验证、工具调用、根本原因分析和解决方案生成,涵盖网络优化和故障维护场景。对八种最先进的LLM的评估揭示了一个普遍的执行墙:虽然模型在意图识别和实体提取等语言接口任务中达到90%的准确率,但在解决方案生成等过程执行任务中的性能降至约30%。这种能力差距表明,当前LLM在诊断方面表现良好,但在现场工程师方面却失败。TeleCom-Bench提供标准化的诊断,精确指出这一缺陷,为特定领域的对齐提供可操作的指导,以实现生产就绪的电信代理。数据集和评估代码已发布在https://github.com/ZTE-AICloud/TeleCom-Bench。

英文摘要

While Large Language Models have achieved remarkable integration in various vertical scenarios, their deployment in the telecommunications domain remains exploratory due to the lack of a standardized evaluation framework. Current telecom benchmarks primarily focus on static, foundational knowledge and isolated atomic skills, neglecting the equipment-specific documentation and end-to-end industrial workflows essential for real-world production systems. To bridge this gap, we present TeleCom-Bench, a comprehensive benchmark comprising 12 evaluation sets with 22,678 curated samples, which evaluates LLMs across a synergistic hierarchy: (1) Multi-dimensional Knowledge Comprehension, which integrates telecommunication fundamentals, 3GPP protocols, and 5G network architecture with proprietary product knowledge across wired, core, and wireless networks via knowledge graph-driven synthesis; and (2)End-to-End Knowledge Application, which formalizes six core tasks on authentic trajectories from live network agent workflows, including intent recognition, entity extraction, event verification, tool invocation, root cause analysis, and solution generation-across network optimization and fault maintenance scenarios. Evaluations of eight state-of-the-art LLMs reveal a universal Execution Wall: while models achieve 90% accuracy in linguistic interface tasks such as intent recognition and entity extraction, performance collapses to approximately 30% in procedural execution tasks like solution generation. This capability gap demonstrates that current LLMs function competently as diagnosticians but fail as field engineers. TeleCom-Bench provides standardized diagnostics to precisely pinpoint this deficit, offering actionable guidance for domain-specific alignment toward production-ready telecom agents. The dataset and evaluation code have been released at https://github.com/ZTE-AICloud/TeleCom-Bench.