arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 21516
2606.04767 2026-06-04 cs.LG cs.CV

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

通过Fisher信息度量模型鲁棒性:谱界、理论保证与实用算法

Chong Zhang, Xiang Li, Jia Wang, Qiufeng Wang, Xiaobo Jin

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于Fisher信息矩阵谱范数的攻击无关鲁棒性度量,理论推导常见架构的闭式谱界,并开发高效估计算法,实验验证其与对抗脆弱性的强相关性。

Comments 35 pages, 1 figure

详情
AI中文摘要

深度神经网络的鲁棒性对于安全关键部署至关重要,但现有评估方法通常依赖于攻击且缺乏可解释性。我们提出了一种基于Fisher信息矩阵(FIM)谱范数的原则性、攻击无关的鲁棒性度量,该度量量化了模型输出分布对输入扰动的worst-case敏感性。理论上,我们证明了FIM等于输入Jacobian的方差,并推导了常见架构(包括VGG、ResNet、DenseNet和Transformer)的闭式谱界,提供了首个理论鲁棒性排名。为了实现可扩展的评估,我们开发了高效算法,包括幂迭代和基于Hutchinson的估计,支持白盒和黑盒设置。在多个数据集(包括CIFAR、ImageNet和医学图像)和多种架构上的大量实验表明,我们的度量与对抗脆弱性之间存在强相关性。我们的框架作为一种可解释的诊断工具,补充了基于攻击的评估,提供了对架构敏感性的洞察,并指导更鲁棒模型的设计。代码可在https://github.com/franz-chang/SRP/获取。

英文摘要

The robustness of deep neural networks is crucial for safety-critical deployments, yet existing evaluation methods are often attack-dependent and lack interpretability. We propose a principled, attack-agnostic robustness metric based on the spectral norm of the Fisher Information Matrix (FIM), which quantifies the worst-case sensitivity of the model's output distribution to input perturbations. Theoretically, we establish that the FIM equals the variance of the input Jacobian and derive closed-form spectral bounds for common architectures, including VGG, ResNet, DenseNet, and Transformer, providing the first theoretical robustness ranking. To enable scalable evaluation, we develop efficient algorithms, including power iteration and Hutchinson-based estimation, that support both white-box and black-box settings. Extensive experiments across multiple datasets, including CIFAR, ImageNet, and medical images, and across multiple architectures show a strong correlation between our metric and adversarial vulnerability. Our framework serves as an interpretable diagnostic tool that complements attack-based evaluations, offering insights into architectural sensitivity and guiding the design of more robust models. Code is available at: https://github.com/franz-chang/SRP/.

2606.04764 2026-06-04 cs.CV

Do Foundation Models See Biology? Evaluating Attention Coherence with Spatial Transcriptomics in Glioblastoma

基础模型是否理解生物学?利用空间转录组学评估胶质母细胞瘤中的注意力一致性

Dilakshan Srikanthan, Amoon Jamzad, Paul Wilson, Nooshin Maghsoodi, Robert Policelli, Gabor Fichtinger, John F. Rudan, Parvin Mousavi

发表机构 * Translational Medicine, School of Medicine, Queen’s University, Kingston, ON, Canada(转化医学、医学院、皇后大学、金斯顿,ON,加拿大) School of Computing, Queen’s University, Kingston, ON, Canada(计算学院、皇后大学、金斯顿,ON,加拿大) Department of Surgery, Queen’s University, Kingston, ON, Canada(外科部门、皇后大学、金斯顿,ON,加拿大)

AI总结 提出基于空间转录组学的框架,客观评估病理基础模型注意力图与生物学的一致性,发现注意力捕捉多基因转录程序而非单个分子事件。

详情
AI中文摘要

病理基础模型的注意力图是否捕捉真实的生物学仍未知,但这一问题对临床信任和监管批准至关重要。我们提出一个基于空间转录组学的框架,用于无假设的注意力正交评估,并将其应用于五个病理基础模型(CONCH v1.5、UNI v2、Virchow2、GigaPath、H-Optimus-1)和一个ResNet50基线。使用基于注意力的多实例学习,我们训练单任务和多任务模型预测胶质母细胞瘤中的五种分子改变(CPTAC队列),在独立TCGA队列上验证,并使用来自18个样本的共配准Visium空间转录组数据评估注意力图与87个转录特征之间的生物学一致性。内部结果显示,没有单一编码器在所有任务中占优,外部验证则颠倒了内部性能排名。注意力图显示从通路(Cohen's d=0.329)到单个基因(d=0.055)的五倍富集梯度,表明注意力捕捉的是涌现的多基因转录程序而非单个分子事件。空间平滑的注意力图并不意味生物学一致性,不同编码器关注不同的生物学区室。我们的框架提供了对基础模型从组织病理学中学到内容的客观定量评估,推动该领域超越定性显著性图审查。

英文摘要

Whether attention maps from pathology foundation models capture genuine biology remains unknown, yet this question is critical for clinical trust and regulatory approval. We propose a spatial transcriptomics-based framework for orthogonal, hypothesis-free evaluation of attention and apply it to five pathology foundation models (CONCH v1.5, UNI v2, Virchow2, GigaPath, H-Optimus-1) and a ResNet50 baseline. Using attention-based multiple instance learning, we train single-task and multi-task models to predict five molecular alterations in glioblastoma on the CPTAC cohort, validate on an independent TCGA cohort, and evaluate biological coherence of attention maps against 87 transcriptional signatures using co-registered Visium spatial transcriptomics data from 18 samples. Internally, no single encoder dominates across all tasks, and external validation inverts internal performance rankings. Attention maps show a five-fold enrichment gradient from pathways (Cohen's d=0.329) to individual genes (d=0.055), indicating that attention captures emergent multi-gene transcriptional programs rather than individual molecular events. Spatially smooth attention maps do not imply biological coherence, and different encoders attend to distinct biological compartments. Our framework provides objective, quantitative assessment of what foundation models learn from histopathology, moving the field beyond qualitative saliency map review.

2606.04754 2026-06-04 cs.LG

Beyond Structural Symmetries: Linear Mode Connectivity via Neuron Identifiability

超越结构对称性:通过神经元可辨识性实现线性模式连通性

Vincent Bürgin, Daniel Herbst, Ya-Wei Eileen Lin, Stefanie Jegelka

发表机构 * DeepMind, London, UK(伦敦英国深Mind公司) University of Cambridge(剑桥大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过提出有效函数类理论框架并形式化神经元可辨识性,揭示了神经网络中即使结构不对称也存在大量近似等价解,并展示了神经元可辨识性如何无需先验对齐即可实现表示合并及线性低损失路径。

Comments Accepted at ICML 2026

详情
AI中文摘要

深度学习中的许多显著现象,如线性模式连通性和训练动力学的结构化行为,都与参数对称性密切相关:即保持实现函数不变的变换。尽管参数对称性日益受到关注,但参数、数据和表示之间的确切相互作用仍未得到充分探索。为了研究这一点,我们开发了一个有效函数类的理论框架,即神经元在其输入支持上可以实现的函数集以及实现它们的范数代价。然后,我们通过跨独立训练运行的神经元可辨识性来形式化有效对称性破缺。我们的分析表明,即使在结构不对称的模型中,神经网络也可以容纳大量近似等价的解族。我们进一步证明,神经元可辨识性使得无需先验对齐即可进行表示合并,并刻画了这种合并何时允许线性低损失路径。这些发现强调了有效函数类在影响损失景观中的作用。

英文摘要

Many striking phenomena in deep learning, such as linear mode connectivity and the structured behavior of training dynamics, are closely tied to parameter symmetries: transformations that leave the realized function unchanged. Despite growing attention to parameter symmetries, the exact interplay between parameters, data, and representations remains underexplored. To investigate this, we develop a theoretical framework of effective function classes, i.e., the set of functions a neuron can realize on its input support, and the norm cost of realizing them. We then formalize effective symmetry breaking via neuron identifiability across independent training runs. Our analysis shows that neural networks can admit large families of approximately equivalent solutions even in structurally asymmetric models. We further show that neuron identifiability enables representation merging without prior alignment, and characterize when such merging admits a linear low-loss path. These findings highlight the role of effective function classes in affecting the loss landscape.

2606.04751 2026-06-04 cs.AI

FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games

FALSIFYBENCH: 通过规则发现游戏评估大语言模型中的归纳推理

Leonardo Bertolazzi, Katya Tentori, Raffaella Bernardi

发表机构 * University of Trento(特伦托大学) Free University of Bozen-Bolzano(博泽-博尔扎诺自由大学)

AI总结 提出FALSIFYBENCH框架,基于Wason 2-4-6任务评估LLM在假设生成、证据收集和信念修正方面的归纳推理能力,发现推理模型优于指令微调模型,且主动寻求证伪的负测试策略是成功的关键。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为科学任务中的自主智能体。然而,这些系统能否有效参与与科学发现相关的归纳推理形式仍是一个开放问题。在这项工作中,我们引入了FALSIFYBENCH,一个受经典Wason 2-4-6任务启发的假设驱动推理评估框架,其中智能体必须通过迭代提出示例并接收反馈来发现隐藏的语义属性。该任务捕捉了科学推理的关键要素:假设生成、证据收集以及根据确认和证伪证据进行信念修正。我们对跨模型家族和规模的12个LLM的评估表明,推理模型通常比指令微调模型更强的科学推理者,尽管没有模型接近最优性能。成功的主要驱动因素是负测试的能力:主动寻求证伪其假设的模型始终优于主要寻求确认的模型。此外,先前工作中被忽略的细粒度回合级分析揭示,失败与模型在假设空间中导航的可识别模式相关。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents in scientific tasks. Yet whether these systems can effectively engage in forms of inductive reasoning relevant to scientific discovery remains an open question. In this work, we introduce FALSIFYBENCH, an evaluation framework for hypothesis-driven reasoning inspired by the classic Wason 2-4-6 task, in which agents must discover hidden semantic properties by iteratively proposing examples and receiving feedback. This task captures key elements of scientific reasoning: hypothesis generation, evidence gathering, and belief revision in response to both confirming and disconfirming evidence. Our evaluation of 12 LLMs across model families and scales shows that reasoning models are generally stronger scientific reasoners than instruction-tuned models, although no model comes close to optimal performance. The primary driver of success is the capacity for negative testing: models that actively seek to falsify their hypotheses consistently outperform those that primarily seek confirmation. Moreover, a fine-grained turn-level analysis, neglected in previous work, reveals that failure is tied to identifiable patterns in how models navigate the hypothesis space.

2606.04750 2026-06-04 cs.AI cs.CY cs.LG

Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment

Fog of Love: 基于亲和力强化学习在游戏环境中塑造道德智能体行为

Ajay Vishwanath, Christian Omlin

发表机构 * University of Agder(阿格德大学)

AI总结 本文提出基于亲和力的强化学习方法,通过策略正则化在多智能体角色扮演游戏Fog of Love中同时实现竞争与合作目标,并提升智能体行为的可解释性。

详情
AI中文摘要

在人工智能中注入道德行为越来越受到关注。其中一种提出的技术是基于亲和力的强化学习,它通过对目标函数进行策略正则化来激励道德行为,而不完全依赖于奖励函数设计。迄今为止,该技术已在状态和动作空间最小的网格世界和玩具问题环境中证明有效。为了将这项研究扩展到更复杂的环境,我们引入了一个基于角色扮演棋盘游戏Fog of Love的双人多智能体环境。在该环境中,两个智能体竞争以实现各自的道德目标,同时合作以维持他们的关系。鉴于多智能体性质,这是一个复杂问题,其中多智能体深度确定性策略梯度智能体既不能成功竞争也不能成功合作。我们提供的证据表明,局部亲和力增强了智能体在实现竞争和合作目标方面的性能,从而在两个领域都获得了更高的总体得分。这不仅产生了道德选择,还阐明了智能体的目的论,并使其行为达到人类水平的可解释性。

英文摘要

Instilling virtuous behavior in artificial intelligence has seen increasing interest. One of the techniques proposed is known as affinity-based reinforcement learning, which uses policy regularization on the objective function to incentivize virtuous actions without being fully dependent on the reward function design. Thus far, this technique has been demonstrated to be effective in grid worlds and toy-problem environments with minimal state and action spaces. To expand this research to more sophisticated environments, we introduce a two-player multi-agent environment based on the role-playing board game known as Fog of Love. In this environment, two agents compete to fulfill their individual virtues, while also cooperating to satisfy their relationship. Given the multi-agent nature, this is a complex problem where multi-agent deep deterministic policy gradient agents neither compete nor cooperate successfully. We present evidence that localized affinities enhance agent performance in achieving both competitive and cooperative objectives, resulting from superior overall scores in both domains. This not only results in virtuous choices but also clarifies an agent's teleology and makes its behavior human-level interpretable.

2606.04749 2026-06-04 cs.RO cs.LG

COP-Q: Safety-First Reinforcement Learning for Robot Control via Cholesky-Ordered Projection

COP-Q:基于Cholesky有序投影的安全优先强化学习机器人控制

Guopeng Li, Moritz A. Zanger, Matthijs T. J. Spaan, Julian F. P. Kooij

发表机构 * Department of Cognitive Robotics, Delft University of Technology(代尔夫特理工大学认知机器人系) Department of Intelligent Systems, Delft University of Technology(代尔夫特理工大学智能系统系) School of Transportation, Southeast University(东南大学交通学院)

AI总结 提出COP-Q方法,通过Cholesky分解编码目标优先级并利用联合Q值空间的广义置信界,在安全优先的离线策略强化学习中平衡安全与奖励目标,减少过度保守性,提升样本效率。

Comments 7 pages, 6 figures, 2 tables

详情
AI中文摘要

安全机器人控制需要在满足安全约束的同时最大化回报。在离线策略安全强化学习中,奖励和安全Q值通常由独立的评论家集成学习,每个目标的不确定性独立处理。这种按目标处理的方式忽略了目标间的相关性,可能导致过于保守的价值估计,从而降低样本效率。为解决此问题,我们提出Cholesky有序投影Q学习(COP-Q),一种安全优先的方法,将目标间协方差纳入向量值Q值估计中。COP-Q在联合Q值空间中构建广义置信界,并使用Cholesky分解以顺序形式编码目标优先级。这在对安全目标保持保守性的同时,自适应地减少对奖励目标的过度保守性。得到的估计同时用于时序差分目标计算和演员优化。COP-Q引入最小的计算开销,并且与大多数现有深度Q学习框架兼容。在Brax中的机器人运动和安全健身房中的安全导航实验(涵盖硬安全和软安全设置)表明,与代表性基线相比,COP-Q实现了强大的安全性能以及有竞争力或更高的样本效率。

英文摘要

Safe robot control requires maximizing return while satisfying safety constraints. In off-policy safe reinforcement learning, reward and safety Q-values are commonly learned by separate critic ensembles, with uncertainty handled independently for each objective. This objective-wise treatment neglects inter-objective correlation and can lead to overly conservative value estimates, thereby reducing sample efficiency. To address this issue, we propose Cholesky-Ordered Projection Q-learning (COP-Q), a safety-first method that incorporates inter-objective covariance into vector-valued Q-value estimation. COP-Q constructs a generalized confidence bound in the joint Q-value space and uses Cholesky factorization to encode objective priority in a sequential form. This preserves conservatism on safety while adaptively reducing excessive conservatism on the reward objective. The resulting estimate is used in both temporal-difference target computation and actor optimization. COP-Q incurs minimal computational overhead and is readily compatible with most existing deep Q-learning frameworks. Experiments on robot locomotion in Brax and safe navigation in Safety-Gymnasium, covering both hard- and soft-safety settings, demonstrate that COP-Q achieves strong safety performance together with competitive or improved sample efficiency relative to representative baselines.

2606.04743 2026-06-04 cs.CL cs.AI cs.LG

TIDE: Proactive Multi-Problem Discovery via Template-Guided Iteration

TIDE:通过模板引导迭代的主动多问题发现

Soyeong Jeong, Jinheon Baek, Minki Kang, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) DeepAuto.ai

AI总结 提出TIDE框架,通过模板引导的迭代机制主动发现用户上下文中隐藏的多个问题,并给出具体行动方案,在个人工作区和软件仓库两个场景中显著提升任务覆盖率和问题识别与解决能力。

详情
AI中文摘要

智能体被广泛部署为文档、工具和代码的助手。然而,它们通常仅对明确的用户请求做出响应,这些请求只反映了用户已注意到的问题,而许多其他重要问题共存于更广泛的用户上下文中,隐藏于显而易见之处,且其总数事先未知。我们将此定义为从上下文中发现多个隐藏问题的任务,其中应揭示共存的问题,基于支持性证据,并配以具体行动。为此,我们引入了TIDE,一个模板引导的迭代框架,包含两种互补机制。具体而言,基于单次预测倾向于关注最显著案例并产生泛化结论的观察,我们提出迭代发现:每轮生成一小批候选,同时基于已发现结果进行条件化,从而后续轮次扩展覆盖范围;以及思维模板:从先前解决的案例中提炼的可重用模式,指定应关注哪些上下文信号以及如何连接它们,将每个预测锚定于可识别的问题类别。我们在两个现实场景(个人工作区和软件仓库)中,使用四种模型骨干验证了TIDE,在任务覆盖率、识别和解决方面显著优于单次和并行多智能体基线。

英文摘要

Agents are widely deployed as assistants over documents, tools, and code. However, they typically act only on explicit user requests, which surface only the problems the user has noticed, while many other important problems coexist, hidden in plain sight, within the broader user context, with their total number unknown in advance. We frame this as the task of discovering multiple hidden problems from context, in which coexisting problems should be uncovered, grounded in supporting evidence, and paired with concrete actions. To this end, we introduce TIDE, a template-guided iterative framework with two complementary mechanisms. Specifically, motivated by the observation that single-pass prediction anchors on the most salient cases and yields generic claims, we propose iterative discovery, which surfaces a small batch of candidates per round while conditioning on what has already been found, so subsequent rounds extend coverage; and thought templates, reusable schemas distilled from previously solved cases that specify what contextual signals to attend to and how to connect them, anchoring each prediction in a recognizable problem class. We validate TIDE on two realistic settings, personal workspaces and software repositories, across four model backbones, showing substantial gains over single-shot and parallel multi-agent baselines on task coverage, identification, and resolution.

2606.04737 2026-06-04 cs.CV

Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment

基于物理信息的视频生成:通过混合专家潜在对齐

Cong Wang, Hanxin Zhu, Jiayi Luo, Yonglin Tian, Xiaoqian Cheng, Peiyan Tu, Xin Jin, Long Chen, Zhibo Chen

发表机构 * CASIA(中国科学院自动化研究所) UCAS(中国科学技术大学) ZGCA(浙江大学) USTC(中国科学技术大学) BUAA(北京航空航天大学) ZJU(浙江大学) EIT(欧洲工业技术学院)

AI总结 提出PILA框架,通过混合专家潜在对齐将物理结构化潜在引导注入预训练视频模型的冻结流匹配动力学,以提升生成视频的物理合理性。

详情
AI中文摘要

大规模视频生成模型在语义一致性和视觉质量方面取得了显著进展,生成的视频越来越连贯且视觉上令人信服。然而,由像素级拟合引发的动态过程自然无法适应支配真实世界运动和交互的规律性,导致在物理合理性方面持续存在不足。为解决这一局限,我们提出了PILA(物理信息潜在对齐),一个将物理结构化的潜在引导注入预训练视频模型冻结流匹配动力学的框架。具体而言,PILA首先采用锚定场估计,将冻结生成器的潜在变量映射到一个由场代理槽组织的可操作物理属性库中,利用可观测运动作为运动学锚点来构建较难直接观测的代理。为处理真实世界动态的异质性,PILA采用基于物理类别的混合专家设计。标签先验掩码专家路由选择特定类别的算子专家,其精炼结果通过从物理关系中抽象出的操作残差进行正则化。最后,精炼后的代理被融合回物理属性库,并解码为流匹配向量场的修正,从而在保持预训练骨干网络视觉先验的同时注入物理感知引导。通过在Wan 2.1-1.3B上进行分阶段适配器训练,并将学到的适配器直接迁移到Wan 2.2-14B,PILA在VBench-2.0、VideoPhy-2和PhyGenBench上,在视觉质量和基准测量的物理合理性方面均达到了最先进的结果。

英文摘要

Large-scale video generation models have made remarkable progress in semantic consistency and visual quality, producing videos that are increasingly coherent and visually convincing. Nevertheless, the dynamics induced by pixel-level fitting do not naturally accommodate the regularities that govern real-world motion and interaction, resulting in persistent shortcomings in physical plausibility. To address this limitation, we propose \textbf{PILA} (Physics-Informed Latent Alignment), a framework that injects physics-structured latent guidance into the frozen flow-matching dynamics of pretrained video models. Specifically, PILA first employs anchored field estimation to map frozen-generator latents into an operational physical attribute bank organized by field-proxy slots, using observable motion as a kinematic anchor for constructing less directly observed proxies. To handle the heterogeneity of real-world dynamics, PILA adopts a mixture-of-experts design over physical categories. Label-prior masked expert routing selects category-specific operator experts, whose refinements are regularized by operational residuals abstracted from physical relations. Finally, the refined proxies are fused into the physical attribute bank and decoded into a correction to the flow-matching vector field, injecting physics-aware guidance while preserving the visual prior of the pretrained backbone. With staged adapter training on Wan 2.1-1.3B and direct transfer of the learned adapter to Wan 2.2-14B, PILA achieves state-of-the-art results on VBench-2.0, VideoPhy-2, and PhyGenBench in both visual quality and benchmark-measured physical plausibility.

2606.04736 2026-06-04 cs.LG cs.AI

Curvature-aware dynamic precision approach for physics-informed neural networks

面向物理信息神经网络的曲率感知动态精度方法

Yingjie Shao, Ioannis N. Athanasiadis, George van Voorn, Taniya Kapoor

发表机构 * Mathematical & Statistical Methods Group (Biometris), Wageningen University & Research(数学与统计方法组(Biometris),瓦赫宁根大学与研究中心) Artificial Intelligence Group, Wageningen University & Research(人工智能组,瓦赫宁根大学与研究中心)

AI总结 提出一种曲率感知精度控制器,利用L-BFGS优化器中的曲率信息动态调整数值精度,在保持预测精度的同时降低双精度训练的计算成本。

详情
AI中文摘要

物理信息神经网络(PINNs)通过将物理定律直接嵌入神经网络训练,已成为模拟偏微分方程(PDEs)的有前景框架。然而,近期研究表明PINN优化对数值精度敏感。现有实现通常使用单精度(FP32),计算效率高但易出现失败模式,或双精度(FP64),鲁棒但成本高昂。这造成了计算效率与数值精度之间的权衡。为降低双精度训练的计算成本同时保持预测精度,我们提出一种曲率感知精度控制器,在训练过程中自适应调整数值精度,而非将其视为固定的实现选择。该方法重用来自有限内存BFGS(L-BFGS)优化器的曲率信息来构建精度控制器,在低精度足够时保留FP32,并在训练动态表明数值敏感或精度受限停滞时提升至FP64计算。我们在四个典型PINN失败模式基准和一个辐照度驱动的常微分方程示例上评估了所提方法。我们还测试了不同神经网络架构下的方法。该方法在所有基准方程上一致匹配甚至略微超过全FP64解的精度,同时相对于全双精度训练减少了训练时间。所得结果表明,PINN优化中的精度敏感性具有相位依赖性,仅在数值关键阶段选择性应用更高精度可以在不牺牲预测精度的前提下降低计算成本。

英文摘要

Physics-informed neural networks (PINNs) have become a promising framework for simulating partial differential equations (PDEs) by embedding physical laws directly into neural network training. However, recent studies show that PINN optimisation is sensitive to numerical precision. Existing implementations commonly use either single precision (FP32), which is computationally efficient but prone to failure modes, or double precision (FP64), which is robust but substantially expensive. This creates a trade-off between computational efficiency and numerical accuracy. To reduce the computational cost of double-precision training while retaining prediction accuracy, we propose a curvature-aware precision controller that adapts numerical precision during training rather than treating it as a fixed implementation choice. The proposed method reuses curvature information derived from the limited-memory BFGS (L-BFGS) optimiser to construct a precision controller, retaining FP32 when lower precision is sufficient and promoting computation to FP64 when the training dynamics indicate numerical sensitivity or precision-limited stagnation. We evaluate the proposed approach on four canonical PINN failure-mode benchmarks and an irradiance-driven ordinary differential equation example. We further test the proposed approach across different neural network architectures. The method consistently matches or even slightly exceeds full FP64 solution accuracy while reducing training time relative to full double-precision training on all benchmark equations. The obtained results indicate that precision sensitivity in PINN optimisation is phase-dependent, and that selectively applying higher precision only during numerically critical stages can lower computational cost without sacrificing predictive accuracy.

2606.04735 2026-06-04 cs.LG cs.AI

Trace-Mediated Peak Bias: Bridging Temporal Credit Assignment and Cognitive Heuristics in Deep Reinforcement Learning

迹介导的峰值偏差:深度强化学习中时间信用分配与认知启发式的桥梁

Viktor Veselý, Aleksandar Todorov, Erwan Escudie, Matthia Sabatelli

发表机构 * Department of AI, University of Groningen(格罗宁根大学人工智能系)

AI总结 本文发现深度强化学习中的迹介导峰值偏差(TMPB),揭示了其作为峰值-末端规则的机制基础,并证明自适应优化器通过二阶矩归一化可缓解该偏差。

详情
AI中文摘要

时间信用分配是生物和人工智能的核心问题,但其与非线性函数逼近的相互作用尚不清楚。我们在深度强化学习中识别出一种系统性失效模式,称为迹介导峰值偏差(TMPB)。在中间资格迹深度下,智能体非理性地偏好具有高幅度奖励“峰值”的轨迹,而非具有更高累积回报的替代轨迹。这为峰值-末端规则提供了一种机制解释:一种人类记忆偏差,其中经验由其最强烈的时刻而非整合效用判断。我们证明,TMPB的出现是因为迹将远时时间差分误差放大为“梯度冲击”,而固定步长的随机梯度下降无法将其归一化,导致全局高估。相反,自适应优化器通过二阶矩归一化缓解了这种病理现象。我们的结果表明,类人的显著性扭曲可能自然产生于分布式系统中信用分配的数学约束,而自适应优化是理性价值估计的理论必要条件。

英文摘要

Temporal credit assignment is central to both biological and artificial intelligence, yet its interaction with non-linear function approximation is poorly understood. We identify a systematic failure mode in deep reinforcement learning (RL) termed Trace-Mediated Peak Bias (TMPB). At intermediate eligibility trace depths, agents irrationally prefer trajectories with high-magnitude reward ``peaks'' over alternatives with higher cumulative returns. This provides a mechanistic account of the Peak-End Rule: a human memory bias where experiences are judged by their most intense moments rather than integrated utility. We show that TMPB emerges because traces amplify distal Temporal Difference errors into ``gradient shocks'' that fixed-step-size Stochastic Gradient Descent cannot normalize, leading to global overestimation. Conversely, adaptive optimizers mitigate this pathology via second-moment normalization. Our results suggest that human-like saliency distortions may emerge naturally from the mathematical constraints of credit assignment in distributed systems, and that adaptive optimization is a theoretical necessity for rational value estimation.

2606.04733 2026-06-04 cs.LG cs.NI

Contrastive Learning and Correlation Clustering for Sequences of Network Telescope Data

对比学习与相关聚类在网络望远镜数据序列中的应用

Jannik Presberger, Alexander Männel, Maynard Koch, Thomas C. Schmidt, Matthias Wählisch, Bjoern Andres

发表机构 * TU Dresden(德累斯顿技术大学) HAW Hamburg(汉堡应用技术大学) Center for Scalable Data Analytics and AI Dresden/Leipzig(德累斯顿/莱比锡可扩展数据与人工智能研究中心)

AI总结 本文提出一种无需预训练和标注的对比学习变压器模型,用于估计网络流记录序列间的语义关系,并通过相关聚类实现扫描器行为的无监督分组。

Comments Code: https://github.com/JannikPresberger/Contrastive_Learning_and_Correlation_Clustering_for_Sequences_of_Network_Telescope_Data

详情
AI中文摘要

理解互联网扫描器的活动具有挑战性;通常需要识别源之间的关系,而这一任务的语义标注非常稀缺。本文研究是否可以通过对比学习,无需预训练和标注,来估计网络流记录序列之间具有语义意义的成对关系。为此,我们提出一个变压器模型,嵌入经过最小预处理的网络流记录序列,并使用对比学习进行训练。利用该模型获得的相似度,我们定义了一个相关聚类问题并局部求解。实验表明:来自同一源的序列之间的学习相似度平均高于来自不同源的序列,并且这一特性可推广到未见过的源和未见过的序列。此外,相关聚类产生的聚类结果与扫描器标签一致。算法和重现实验的完整源代码已公开。

英文摘要

Understanding activities of Internet scanners is challenging; it often requires identifying relationships between sources, a task for which semantic annotations are scarce. This work investigates whether semantically meaningful pairwise relationships between sequences of network flow records can be estimated by contrastive learning, without pretraining and without annotations. To this end, we propose a transformer model that embeds minimally preprocessed sequences of network flow records and train it using contrastive learning. With the similarities obtained from this model, we state a correlation clustering problem and solve it locally. Experimentally, we show: Learned similarities are higher on average for sequences originating from the same source than for sequences originating from different sources, and this property generalizes to unseen sequences of unseen sources. Moreover, correlation clustering yields clusters consistent with scanner labels. The complete source code of the algorithms and for reproducing the experiments is publicly available.

2606.04730 2026-06-04 cs.CL eess.AS

Multilingual Long-Form Speech Instruction Following: KIT's Submission to IWSLT 2026

多语言长篇语音指令跟随:KIT 在 IWSLT 2026 的提交

Enes Yavuz Ugan, Maike Züfle, Yuka Ko, Supriti Sinhamahapatra, Fabian Retkowski, Seymanur Akti, Jan Niehues, Alexander Waibel

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出一种通用数据增强流水线,通过片段拼接、LLM标签生成和跨语言翻译将短语音语料转换为长语音训练数据,结合似然与最小贝叶斯风险解码解决长语音语义任务退化问题。

Comments 9 pages main paper, IWSLT 2026 Instruction Following track

详情
AI中文摘要

随着大语言模型的出现,单任务和基于标记的多任务模型已演变为基于指令的系统,该系统从自然语言提示中隐式推断任务和目标语言。这一趋势反映在IWSLT的指令跟随赛道中,该赛道今年引入了包括未知惊喜任务在内的新任务,对已知任务的过拟合构成了真正的挑战。我们展示了KIT在无约束设置下对长指令和短指令跟随赛道的提交。我们的方法结合了一个通用数据增强流水线,通过片段拼接、基于LLM的标签生成和跨语言翻译将短语音语料转换为长语音训练数据,在六个任务和四种语言上产生了超过100万个实例。我们进一步表明,基于似然的重新排序虽然对ASR非常有效,但会系统地降低语义任务,通过选择从分段音频处理而非整体长语音推理中生成的候选者,这一失败模式通过将似然与最小贝叶斯风险解码相结合得以解决。

英文摘要

With the advent of Large Language Models, single-task and token-based multi-task models have evolved into instruction-based systems that infer task and target language implicitly from natural language prompts. This trend is reflected in IWSLT's Instruction Following Track, which this year introduced new tasks including an unknown surprise task, posing a genuine challenge against overfitting to known tasks. We present KIT's submission to the Long and Short Instruction Following tracks in the unconstrained setting. Our approach combines a general data augmentation pipeline that converts short-form corpora into long-form training data through segment concatenation, LLM-based label generation, and cross-lingual translation, yielding over 1M instances across six tasks and four languages. We further show that likelihood-based re-ranking, while highly effective for ASR, systematically degrades semantic tasks by spuriously selecting candidates generated from segmented audio processing rather than holistic long-form inference, a failure mode resolved by combining likelihood with Minimum Bayes Risk decoding.

2606.04722 2026-06-04 cs.CV

StrokeTimer: Robust Representation Learning for Ischemic Stroke Onset-Time Estimation from Non-contrast CT

StrokeTimer: 基于非增强CT的缺血性卒中发病时间估计的鲁棒表示学习

Weiru Wang, Susanne G. H. Olthuis, Elizaveta Lavrova, Robert J. van Oostenbrugge, Charles B. L. M. Majoie, Wim H. van Zwam, Ruisheng Su

发表机构 * Department of Biomedical Engineering, Eindhoven University of Technology(埃因霍温理工大学生物医学工程系) Graduate School of Life Sciences, Utrecht University(乌得勒支大学生命科学研究生院) Department of Neurology, Maastricht University Medical Centre+(马斯特里赫特大学医学中心神经科) Precision Medicine Department, GROW Research Institute for Oncology and Reproduction, Maastricht University(马斯特里赫特大学精准医学部,GROW肿瘤与生殖医学研究所) Department of Radiology and Nuclear Medicine, Amsterdam University Medical Centre(阿姆斯特丹大学医学中心放射学与核医学系) Department of Radiology and Nuclear Medicine, Maastricht University Medical Centre+(马斯特里赫特大学医学中心放射学与核医学系)

AI总结 提出StrokeTimer框架,通过自监督解耦学习和能量引导对比学习,从非增强CT中估计缺血性卒中发病时间,在大型多中心数据集上实现宏AUC 0.69和宏F1 0.57,较基线提升近50%。

Comments Early accepted at MICCAI 2026

详情
AI中文摘要

缺血性卒中是一种主要的全球性疾病。治疗决策高度时间敏感,因为再灌注治疗的资格取决于卒中发病与干预之间的时间间隔。然而,在临床实践中,真实的发病时间往往不确定,因此需要基于影像的组织年龄评估作为替代标志物。常规非增强CT(NCCT)上的早期缺血性改变通常很细微,而真实世界的临床数据集表现出显著的发病时间类别不平衡和中心-扫描仪相关的异质性。在这项工作中,我们提出了StrokeTimer,一个用于急性缺血性卒中发病时间估计的全自动框架。StrokeTimer整合了自监督解耦学习和能量引导对比学习,以捕捉细微的缺血模式,同时解决采集变异下的长尾数据分布。发病时间被分为三个临床相关窗口:<4.5小时、4.5-6小时和>6小时。在两个国家队列(MR CLEAN Registry和MR CLEAN LATE)的大型多中心NCCT数据集上的实验结果表明,StrokeTimer实现了宏AUC 0.69和宏F1分数0.57,比最强基线提高了近50%(p < 0.005)。在这个现实且具有挑战性的设置中,代表性基线方法表现出接近随机的宏性能。模型解释进一步突出了与已建立的放射学生物标志物一致的细微灰白质模糊和低密度区域。这些发现证明了StrokeTimer在支持急性缺血性卒中治疗决策方面的潜力。代码可在https://github.com/BrainVas/StrokeTimer获取。

英文摘要

Ischemic stroke is a major global disease. Treatment decisions are highly time-sensitive, as eligibility for reperfusion therapies relies on the interval between stroke onset and intervention. However, the true onset time is often uncertain in clinical practice, necessitating imaging-based assessment of tissue age as a surrogate marker. Early ischemic changes on routinely acquired non-contrast CT (NCCT) are often subtle, and real-world clinical datasets exhibit pronounced onset-time class imbalance and center-scanner-related heterogeneity. In this work, we propose StrokeTimer, a fully automated framework for onset-time estimation in acute ischemic stroke. StrokeTimer integrates self-supervised disentanglement learning with energy-guided contrastive learning to capture subtle ischemic patterns while addressing long-tailed data distributions under acquisition variability. Onset time is categorized into three clinically relevant windows: <4.5 h, 4.5-6 h, and >6 h. Experimental results on a large multi-center NCCT dataset from two national cohorts, MR CLEAN Registry and MR CLEAN LATE, show that StrokeTimer achieves a macro AUC of 0.69 and a macro F1-score of 0.57, improving the strongest baseline by nearly 50% (p < 0.005). In this realistic, challenging setting, representative baseline approaches exhibit near-chance macro performance. Model explanations further highlight subtle gray-white matter blurring and hypodense regions consistent with established radiological biomarkers. These findings demonstrate the potential of StrokeTimer to support treatment decision-making in acute ischemic stroke. Code is available at https://github.com/BrainVas/StrokeTimer.

2606.04719 2026-06-04 cs.CL

Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM

基于查询的跨模态投影器增强Mamba多模态大语言模型

SooHwan Eom, Jay Shim, Gwanhyeong Koo, Haebin Na, Mark A. Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo

发表机构 * Korea Advanced Institute of Science and Technology / Korea, Republic of(韩国科学技术院) University of Illinois in Urbana-Champaign / United States of America(伊利诺伊大学厄巴纳-香槟分校) Korea University / Korea, Republic of(韩国大学)

AI总结 提出基于查询的跨模态投影器,通过交叉注意力压缩视觉令牌,消除手动设计2D扫描顺序的需求,提升Mamba多模态LLM的性能和吞吐量。

Comments Accepted to EMNLP 2024 Findings

详情
AI中文摘要

Transformer的复杂度随输入长度呈二次增长,给大语言模型(LLM)带来了不可持续的计算负担。相比之下,选择性扫描结构化状态空间模型(即Mamba)有效解决了这一计算挑战。本文探索了一种基于查询的跨模态投影器,通过交叉注意力机制根据输入压缩视觉令牌,从而增强Mamba在视觉-语言建模中的效率。这种创新的投影器还消除了将原始图像特征转换为Mamba LLM输入序列时手动设计2D扫描顺序的需求。在各种视觉-语言理解基准上的实验结果表明,所提出的跨模态投影器增强了基于Mamba的多模态LLM,提升了性能和吞吐量。

英文摘要

The Transformer's quadratic complexity with input length imposes an unsustainable computational load on large language models (LLMs). In contrast, the Selective Scan Structured State-Space Model, or Mamba, addresses this computational challenge effectively. This paper explores a query-based cross-modal projector designed to bolster Mamba's efficiency for vision-language modeling by compressing visual tokens based on input through the cross-attention mechanism. This innovative projector also removes the need for manually designing the 2D scan order of original image features when converting them into an input sequence for Mamba LLM. Experimental results across various vision-language understanding benchmarks show that the proposed cross-modal projector enhances Mamba-based multimodal LLMs, boosting both performance and throughput.

2606.04710 2026-06-04 cs.CV

Data Efficient Complex Feature Fusion Network For Hyperspectral Image Classification

数据高效复杂特征融合网络用于高光谱图像分类

Maitreya Shelare, Atharva Satam, Poonam Sonar, Sneha Burnase

发表机构 * Department of Electronics and Telecommunication, Rajiv Gandhi Institute of Technology, University of Mumbai(电子与电信系,拉吉夫甘地技术学院,孟买大学)

AI总结 提出一种数据高效的注意力双支路复杂特征融合网络(DE-CFFN),通过因子分析降维和3D卷积层滤波器数量减半来减少模型复杂度,同时保持与CFFN相当的分类性能。

Comments 10 pages, 3 figures

详情
Journal ref
In Proceedings of International Conference on Wireless Communication (ICWiCOM 2025), Lecture Notes in Electrical Engineering, vol. 1499, Springer, 2025
AI中文摘要

本工作提出了一种数据高效的基于注意力的双支路复杂特征融合网络(CFFN)变体,用于高光谱图像分类。所提出的模型称为DE-CFFN,保留了原始的双流结构:实值神经网络(RVNN)处理标准高光谱图像块,而复值神经网络(CVNN)处理其傅里叶变换后的对应物。本工作的主要贡献在于特征提取过程和架构增强。使用因子分析进行降维,相比主成分分析提供了更好的潜在特征表示。此外,RVNN和CVNN流均通过将3D卷积层中的滤波器数量逐次减半来减少复杂度。两个分支的输出被拼接并通过一个挤压激励(SE)块以增强联合特征表示。在Pavia University和Salinas数据集上的评估表明,DE-CFFN实现了与CFFN相当的分类性能,同时显著减小了模型大小、内存消耗和推理延迟,使其适用于实时高光谱成像应用。

英文摘要

This work presents a data-efficient variant of the Attention-Based Dual-Branch Complex Feature Fusion Network (CFFN) for hyperspectral image classification. The proposed model, termed DE-CFFN, retains the original two-stream structure: the Real-Valued Neural Network (RVNN) processes standard hyperspectral patches, while the Complex-Valued Neural Network (CVNN) handles their Fourier-transformed counterparts. The main contribution of this work lies in the feature extraction process and architectural enhancement. Factor Analysis is used for dimensionality reduction, offering improved latent feature representation over Principal Component Analysis. Additionally, both the RVNN and CVNN streams are structurally modified by successively halving the number of filters in the 3D convolutional layers to reduce complexity. The outputs of both branches are concatenated and passed through a Squeeze and Excitation (SE) block to enhance joint feature representation. Evaluated on the Pavia University and Salinas datasets, DE-CFFN achieves classification performance comparable to CFFN, while significantly reducing model size, memory consumption, and inference latency, making it suitable for real-time hyperspectral imaging applications.

2606.04706 2026-06-04 cs.CV

ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection

ReConFuse: 重建误差引导的语义融合用于AI生成视频检测

Xiaojing Chen, Xinyu Lu, Changtao Miao, Yunfeng Diao

发表机构 * Anhui University(安徽大学) Ant Group(蚂蚁集团) Hefei University of Technology(合肥工业大学)

AI总结 提出ReConFuse框架,利用预训练WF-VAE的重建误差作为鉴别线索,结合多帧语义特征和Mamba时序建模,实现AI生成视频的鲁棒检测。

详情
AI中文摘要

AI生成的视频变得越来越逼真,引发了关于错误信息、内容真实性和媒体信任的严重担忧。因此,可靠的AI生成视频检测对于多媒体取证至关重要,但由于需要捕捉空间伪影、时间动态并泛化到不断演变的生成模型,这仍然具有挑战性。在本文中,我们探索重建误差作为AI生成视频检测的判别性取证线索。通过使用预训练的WF-VAE重建输入视频,我们观察到真实视频和生成视频表现出可区分的逐帧重建误差模式,表明重建误差可以揭示它们的分布差异。然而,将基于重建的图像检测扩展到视频并非易事,因为视频重建误差在帧间具有时间组织性,并且需要语义上下文才能有效解释。为了应对这些挑战,我们提出了ReConFuse,一个用于视频级AI生成视频检测的重建引导语义融合框架。ReConFuse从WF-VAE重建的视频中提取重建误差线索,将其与多帧语义特征对齐,并使用基于Mamba的模块对时间演化进行建模以进行视频级分类。在多个生成器和评估设置上的实验证明了ReConFuse的有效性和强大的泛化能力。

英文摘要

AI-generated videos are becoming increasingly realistic, raising serious concerns about misinformation, content authenticity, and media trust. Reliable AI-generated video detection is therefore essential for multimedia forensics, yet remains challenging due to the need to capture spatial artifacts, temporal dynamics, and generalize to evolving generative models. In this paper, we explore reconstruction error as a discriminative forensic cue for AI-generated video detection. By reconstructing input videos with a pretrained WF-VAE, we observe that real and generated videos exhibit distinguishable frame-wise reconstruction error patterns, suggesting that reconstruction errors can reveal their distributional discrepancies. However, extending reconstruction-based image detection to videos is non-trivial, since video reconstruction errors are temporally organized across frames and require semantic context for effective interpretation. To address these challenges, we propose ReConFuse, a reconstruction-guided semantic fusion framework for video-level AI-generated video detection. ReConFuse extracts reconstruction error cues from WF-VAE reconstructed videos, aligns them with multi-frame semantic features, and uses a Mamba-based module to model temporal evolution for video-level classification. Experiments across multiple generators and evaluation settings demonstrate the effectiveness and strong generalization ability of ReConFuse.

2606.04705 2026-06-04 cs.CV cs.AI

Enhancing MedSAM with a Lightweight Box Predictor for Medical Image Segmentation

通过轻量级框预测器增强 MedSAM 用于医学图像分割

Amirhossein Movahedisefat, Amirreza Fateh, Mohammad Reza Mohammadi

发表机构 * School of Computer Engineering, Iran University of Science and Technology (IUST)(伊朗科学技术大学计算机工程学院)

AI总结 提出一种集成轻量级框预测器的 MedSAM 增强框架,通过单次点击估计边界框以提升点提示的空间引导能力,在仅增加 1.6M 参数下显著提高多模态医学图像分割的准确性和鲁棒性。

详情
AI中文摘要

医学图像中的语义分割是一项关键但具有挑战性的任务,原因是数据稀缺和跨模态的高变异性。虽然像 Segment Anything Model (SAM) 这样的基础模型显示出潜力,但它们在没有特定适应的情况下往往难以处理医学图像。此外,点提示尽管是最自然的用户交互形式,但为可靠分割提供的空间上下文不足,特别是当目标结构不规则或对比度差时。在本文中,我们提出了一种增强的分割框架,将轻量级框预测器模块集成到 MedSAM 架构中。框预测器通过使用局部图像嵌入特征从单次用户点击估计近似边界框,提供空间引导以减少点提示的模糊性,同时仅引入 1.6M 额外参数和可忽略的推理开销。我们引入了一个两阶段训练流程,其中框预测器在集成到 MedSAM 之前独立训练。为了验证我们方法的泛化能力,我们在四个不同的数据集(FLARE22、BRISC、BUSI、LungSegDB)上进行了广泛评估,这些数据集涵盖不同的成像模态,包括 CT、MRI 和超声。我们的方法在不同解剖结构和成像领域中提高了分割准确性和鲁棒性,在 BUSI、FLARE22、BRISC 和 LungSegDB 上分别达到了 0.89、0.93、0.88 和 0.98 的 Dice 分数。代码可在 https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor 获取。

英文摘要

Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point prompts, despite being the most natural form of user interaction, provide insufficient spatial context for reliable segmentation, particularly when target structures are irregular or poorly contrasted. In this paper, we propose an enhanced segmentation framework that integrates a lightweight Box Predictor module into the MedSAM architecture. The Box Predictor estimates an approximate bounding box from a single user click using localized image embedding features, providing spatial guidance that reduces the ambiguity of point prompts, while introducing only 1.6M additional parameters and negligible inference overhead. We introduce a two-stage training pipeline where the Box Predictor is trained independently before being integrated into MedSAM. To validate the generalization capability of our method, we conduct extensive evaluations on four diverse datasets (FLARE22, BRISC, BUSI, LungSegDB) spanning distinct imaging modalities, including CT, MRI, and Ultrasound. Our method improves segmentation accuracy and robustness across varied anatomical structures and imaging domains, achieving Dice scores of 0.89 (BUSI), 0.93 (FLARE22), 0.88 (BRISC), and 0.98 (LungSegDB). Code is available at https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor

2606.04703 2026-06-04 cs.CL cs.LG

Rethinking Continual Experience Internalization for Self-Evolving LLM Agents

重新思考持续经验内化以实现自我进化的大语言模型智能体

Jingwen Chen, Wenkai Yang, Shengda Fan, Wenbo Nie, Chenxing Sun, Shaodong Zheng, Yangen Hu, Lu Pan, Ke Zeng, Yankai Lin

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院 Gallagher 学院) School of Software, Beihang University(北航软件学院) Meituan(美团)

AI总结 本文通过经验粒度、注入模式和内化机制三个维度,提出一种稳定可持续的经验内化方法,解决多轮经验学习中的能力崩溃问题。

Comments 10 pages, 8 figures

详情
AI中文摘要

经验内化将过去交互中的上下文经验转化为可重用的参数化能力,为大型语言模型(LLM)的持续学习提供了一条有前景的路径。虽然先前的工作主要关注单次迭代迁移,但我们发现在多轮经验学习下,现有方法遭受的是渐进的能力崩溃而非复合改进。我们通过经验内化的三个关键维度系统地考察了这种失败:(1)经验粒度:我们发现原则级经验比实例级经验更持久,因为它有效地从轨迹特定细节中抽象出可迁移的策略。(2)经验注入模式:我们的分析表明,逐步注入通过将经验与中间决策状态对齐,显著优于全局注入,这一特性对于长程工具使用至关重要。(3)内化机制:我们证明,在高质量教师轨迹上的离策略上下文蒸馏提供了比在策略上下文蒸馏更稳定的训练信号,后者固有地受限于对学生诱导的缺陷状态的局部修正。这些见解共同产生了一个简单而稳健的配方,用于稳定和可持续的经验内化,为工程化自我进化和持续学习的LLM提供了具体指导。

英文摘要

Experience internalization converts contextual experience from past interactions into reusable parametric capability, offering a promising path toward continual learning in large language models (LLMs). While prior work has predominantly focused on single-iteration transfer, we discover that under multi-iteration experience learning, existing methods suffer from a progressive capability collapse rather than compounding improvement. We systematically examine this failure through three vital dimensions of experience internalization: (1) Experience Granularity: We find that principle-level experience is more durable than instance-level experience, as it effectively abstracts transferable strategies away from trajectory-specific details. (2) Experience Injection Pattern: Our analysis reveals that step-wise injection significantly outperforms global injection by aligning experience with intermediate decision states, a property that is critical for long-horizon tool use. (3) Internalization Regime: We demonstrate that off-policy context-distillation on high-quality teacher trajectories provides a substantially more stable training signal than on-policy context-distillation, which is inherently limited by local corrections on student-induced flawed states. Together, these insights yield a simple yet robust recipe for stable and sustainable experience internalization, providing concrete guidance for engineering self-evolving and continually learning LLMs.

2606.04701 2026-06-04 cs.CV cs.CL

Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms

在短视频平台上对原生动态屏幕GUI代理的基准测试

Jiashu Yao, Heyan Huang, Daiqing Wu, Wangke Chen, Huaxi Ai, Haoyu Wen, Zeming Liu, Yuhang Guo

发表机构 * Beijing Institute of Technology(北京理工大学) Tsinghua University(清华大学) Beihang University(北航)

AI总结 针对短视频平台等动态屏幕环境,提出LivingScreen基准测试,通过三级任务套件和联合评估准确性与信息效率的指标,发现现有GUI代理存在观察过度或不足的问题。

Comments preprint

详情
AI中文摘要

当前的GUI代理假设屏幕是静态的,即两次动作之间世界是冻结的。然而,诸如短视频应用之类的真实界面违反了这一假设,因为其内容持续播放,一个称职的用户必须决定观看什么以及观看多长时间。我们将此任务形式化为原生动态屏幕GUI代理,并引入LivingScreen——首个在短视频平台上实例化该任务的基准测试,它包含一个基于浏览器的忠实环境、三级任务套件以及联合评估准确性和信息效率的指标。评估广泛的前沿模型后,我们发现没有一个模型能达到人类的成本-准确率性能,并且它们的主要失败模式是过度观察和观察不足,这表明观察控制是未来GUI代理缺失的能力轴。所有数据和代码将在https://github.com/BITHLP/LivingScreen上提供。

英文摘要

GUI agents today assume a static screen, where the world is frozen between two actions. However, real interfaces such as short-video applications violate this assumption, as their content keeps playing, and a competent user must decide what to watch and for how long. We formalize this task as Living-Screen-Native GUI agents and introduce LivingScreen, the first benchmark instantiating it on short-video platforms, with a faithful browser-based environment, a three-tier task suite, and metrics that jointly score accuracy and information efficiency. Evaluating extensive frontier models, we find that none reaches the human cost-accuracy performance, and that their dominant failure mode is over- and under-observation, pointing to observation control as a missing capability axis for future GUI agents. All data and code will be available at https://github.com/BITHLP/LivingScreen.

2606.04700 2026-06-04 cs.CV

A New Angle on Bones: Robust Pose Estimation in X-Ray and Ultrasound

骨骼的新视角:X射线和超声中的鲁棒姿态估计

Ron Keuth, Christoph Großbröhmer, Franziska Halm, Miriam Johann, Anne-Nele Schröder, Ludger Tüshaus, Mattias P. Heinrich, Lasse Hansen

发表机构 * Medical Informatics, University of Lübeck(吕贝克大学医学信息学系) Institut of Radiology and Nuclear Medicine, University Hospital Schleswig-Holstein(石勒苏益格-荷尔斯泰因大学医院放射学与核医学研究所) Paediatric Surgery, University Hospital Schleswig-Holstein(石勒苏益格-荷尔斯泰因大学医院小儿外科) EchoScout GmbH

AI总结 提出基于学习的关键点候选和鲁棒线模型(RANSAC、霍夫变换)的自动骨骼姿态估计方法,在儿科骨折和髋关节发育不良评估中达到临床可接受的误差并优于地标方法。

Comments Code and annotations for fracture angle assessment in radiographs: https://github.com/multimodallearning/RobustBonePoseEstimation

详情
AI中文摘要

测量骨骼结构之间的角度是医学图像分析中的常规任务,为诊断和治疗规划提供关键的定量参数。自动化方法可以减少时间和成本,同时提高可重复性。在这项工作中,我们通过基于学习的关键点候选提议,随后使用线模型提取轴参数,来解决自动骨骼姿态估计问题。由于传统线模型如最小二乘法对异常值敏感,我们结合了假阳性减少策略和鲁棒拟合技术,如RANSAC和霍夫变换,以提高鲁棒性。我们在三个临床相关的儿科角度估计任务上评估了我们的方法:X射线和超声中的骨折碎片评估,以及使用Graf方法的超声中髋关节发育不良评估。我们的方法分别实现了$4.1^\circ$、$5.4^\circ$和$5.51^\circ$的平均误差,不仅保持在预期的临床观察者变异范围内,而且显著优于基于地标的方法。我们的代码和用于X射线骨折角度评估的注释已在GitHub上公开。

英文摘要

Measuring the angle between bone structures is a routine task in medical image analysis and provides a key quantitative parameter for diagnosis and treatment planning. Automated methods can reduce time and cost while improving reproducibility. In this work, we address automatic bone pose estimation using a learning-based point candidate proposal followed by a line model to extract axis parameters. Since conventional line models such as least squares are sensitive to outliers, we incorporate false-positive reduction strategies and robust fitting techniques, such as RANSAC and Hough transforms, to improve robustness. We evaluate our method on three clinically relevant paediatric angle estimation tasks: fracture fragment assessment in radiographs and ultrasound and developmental dysplasia of the hip evaluation in ultrasound using the Graf method. Our approach achieves mean errors of $4.1^\circ$, $5.4^\circ$, and $5.51^\circ$, respectively, not only remaining within the expected clinical observer variability, but also significantly outperforming landmark-based methods. Our code and annotations for fracture angle assessment in radiographs are publicly available on GitHub.

2606.04699 2026-06-04 cs.LG cs.AI cs.CV

Graph-Guided Universum Learning in Generalized Eigenvalue Proximal SVMs for Alzheimer's Disease Classification

基于图引导的广义特征值近端支持向量机中的Universum学习用于阿尔茨海默病分类

Yogesh Kumar, Vrushank Ahire, Mudasir Ganaie

发表机构 * Dept. of Computer Science and Engineering, IIT Ropar, Punjab 140001, India(计算机科学与工程系,IIT罗帕尔,旁遮普140001,印度)

AI总结 针对阿尔茨海默病分类,提出两种图引导的Universum学习模型UG-GEPSVM和IUG-GEPSVM,利用轻度认知障碍样本构建图拉普拉斯正则化,替代传统独立惩罚项,在ADNI MRI数据集上取得更优性能。

详情
AI中文摘要

早期准确检测阿尔茨海默病(AD)对于及时干预和疾病管理至关重要。广义特征值近端支持向量机(GEPSVM)及其基于Universum的变体在AD分类中显示出有希望的结果。然而,现有方法将Universum样本视为独立点,未考虑它们之间的几何关系。本文提出了两种图引导的Universum学习模型,即UG-GEPSVM和IUG-GEPSVM,用于使用结构MRI数据进行AD与认知正常(CN)分类。在所提出的框架中,轻度认知障碍(MCI)受试者被用作Universum数据,以提供AD和CN类别之间的中间信息。使用高斯相似性、最小生成树连通性和多跳传播在Universum样本上构建图。从该图中导出拉普拉斯矩阵,捕获MCI样本的几何结构。这种基于拉普拉斯的正则化被纳入学习过程,以替代传统的独立Universum惩罚项。UG-GEPSVM将此正则化集成到广义特征值公式中,而IUG-GEPSVM使用标准特征值公式扩展了数值稳定的改进GEPSVM框架。在ADNI MRI数据集变体上使用ICA和PCA特征在五个不同噪声水平下的实验表明,两种提出的模型始终优于现有的GEPSVM和基于Universum的方法。UG-GEPSVM实现了88.07%的最高平均AUC,并在增加的噪声水平下保持稳定的性能。统计检验进一步证实了观察到的改进的显著性。

英文摘要

Early and accurate detection of Alzheimer's disease (AD) is important for timely intervention and disease management. Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM) and its Universum-based variants have shown promising results for AD classification. However, existing methods treat Universum samples as independent points and do not consider the geometric relationships among them. This paper proposes two graph-guided Universum learning models, namely UG-GEPSVM and IUG-GEPSVM, for AD versus cognitively normal (CN) classification using structural MRI data. In the proposed framework, mild cognitive impairment (MCI) subjects are used as Universum data to provide intermediate information between AD and CN classes. A graph is constructed over the Universum samples using Gaussian similarity, Minimum Spanning Tree connectivity, and multi-hop propagation. From this graph, a Laplacian matrix is derived that captures the geometric structure of the MCI samples. This Laplacian-based regularization is incorporated into the learning process in place of the conventional independent Universum penalty term. UG-GEPSVM integrates this regularization into the generalized eigenvalue formulation, while IUG-GEPSVM extends the numerically stable improved GEPSVM framework using a standard eigenvalue formulation. Experiments on ADNI MRI dataset variants using ICA- and PCA-based features at five different noise levels show that both proposed models consistently outperform existing GEPSVM and Universum-based methods. UG-GEPSVM achieves the highest average AUC of 88.07% and maintains stable performance under increasing noise levels. Statistical tests further confirm the significance of the observed improvements.

2606.04695 2026-06-04 cs.LG

Cone-Compatible Monge Geometry for High-Dimensional Ordered Optimal Transport

锥相容的Monge几何用于高维有序最优输运

Lei Luo, Hongliang Zhang, Jian Yang

发表机构 * PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology(PCA实验室、教育部高维信息智能感知与系统重点实验室、计算机科学与工程学院、南京理工大学)

AI总结 本文提出锥相容的Monge几何,通过闭凸锥诱导的偏序与输运成本兼容的条件,为高维有序数据提供闭式最优耦合。

Comments 13 pages, 2 figures, including appendices

详情
AI中文摘要

高维最优输运很少具有闭式解。一维情况是例外,因为实数线的顺序与凸输运成本兼容,使得单调重排最优。本文研究在更高维中如何从偏序恢复类似的Monge结构。我们引入锥相容的Monge几何:一个闭凸锥(K)诱导序(x\preceq_K y)当(y-x\in K),并且如果有序对满足Monge交换不等式,则与成本兼容。对于平方马氏距离成本(c_M(x,y)=(x-y)^\top M(x-y)),我们证明了一个尖锐的刻画:兼容性恰好当(K)在(M)-内积下是锐角锥,即对所有(u,v\in K)有(u^\top Mv\ge0),等价于(K\subseteq K_M^*)。在此条件下,支撑在锥链上的测度允许分位数型的闭式最优耦合,在原始地面成本下(而非投影或度量替换后)得到精确输运。我们将由此产生的锥链Wasserstein度量(定义在规范有序的链分布上)与扩展的有向锥输运成本(定义在一般测度上)区分开来,并发展了可行性、对偶性、稳定性、逼近、高斯恢复、统计和计算方面的结果。该理论与切片和树Wasserstein距离互补:它不是通用的快速替代,而是为有序高维数据提供可解释、方向有效、原始空间单调输运的一种方法。

英文摘要

High-dimensional optimal transport is seldom available in closed form. The one-dimensional case is exceptional because the order of the real line is compatible with convex transport costs, making monotone rearrangement optimal. This paper studies when an analogous Monge structure can be recovered in higher dimensions from a partial order. We introduce a cone-compatible Monge geometry: a closed convex cone (K) induces the order (x\preceq_K y) whenever (y-x\in K), and is compatible with a cost if ordered pairs satisfy a Monge exchange inequality. For squared Mahalanobis costs (c_M(x,y)=(x-y)^\top M(x-y)), we prove a sharp characterization: compatibility holds exactly when (K) is acute under the (M)-inner product, namely (u^\top Mv\ge0) for all (u,v\in K), equivalently (K\subseteq K_M^*). Under this condition, measures supported on cone chains admit a quantile-type closed-form optimal coupling, yielding exact transport under the original ground cost rather than after projection or metric replacement. We distinguish the resulting cone-chain Wasserstein metric on canonically ordered chain distributions from an extended directed cone transport cost on general measures, and develop feasibility, duality, stability, approximation, Gaussian recovery, statistical, and computational results. The theory is complementary to sliced and tree Wasserstein distances: it is not a universal fast surrogate, but a way to obtain interpretable, direction-valid, original-space monotone transport for ordered high-dimensional data.

2606.04691 2026-06-04 cs.CL

SMADE-IE: Sparse Multi-Agent Framework with Evidence-Driven Debate for Zero-Shot Information Extraction

SMADE-IE: 基于证据驱动辩论的稀疏多智能体框架用于零样本信息抽取

Kenfeng Huang, Yi Cai, Xin Wu, Zikun Deng, Li Yuan

发表机构 * School of Software Engineering, South China University of Technology(华南理工大学软件学院)

AI总结 提出SMADE-IE稀疏多智能体框架,通过自适应模式选择器和证据驱动辩论机制,在零样本信息抽取中减少冗余交互并提升性能。

Comments 21 pages, 9 figures

详情
AI中文摘要

基于大型语言模型的零样本信息抽取因其无需任务特定训练即可适应新模式和领域的灵活性而受到越来越多的关注。现有方法主要依赖于整体提示、逐类型提示或多智能体辩论。然而,整体提示常常遭受边界和类型错误,而逐类型提示和多智能体辩论引入了跨类型冲突、冗余智能体交互和大量令牌开销。为了解决这些挑战,我们提出了SMADE-IE,一种用于零样本信息抽取的稀疏且证据驱动的多智能体框架。SMADE-IE首先采用自适应模式选择器将输入动态路由到轻量级全局抽取模式或类型中心抽取模式,减少不必要的类型选择和推理噪声。对于冲突预测,我们进一步引入了证据驱动辩论机制,将论证结构化为图尔敏式组件,并通过外部证据评分和贝叶斯更新进行置信度聚合。在NER、RE和JERE任务的9个基准数据集上的实验结果表明,SMADE-IE在持续优于现有零样本信息抽取基线的同时,通过稀疏智能体选择和早期停止辩论提高了令牌效率。

英文摘要

Zero-shot information extraction (IE) with large language models (LLMs) has attracted increasing attention due to its flexibility in adapting to new schemas and domains without task-specific training. Existing approaches mainly rely on monolithic prompting, each-type prompting, or multi-agent debate. However, monolithic prompting often suffers from boundary and type errors, while each-type prompting and multi-agent debate introduce cross-type conflicts, redundant agent interactions, and substantial token overhead. To address these challenges, we propose SMADE-IE, a sparse and evidence-driven multi-agent framework for zero-shot IE. SMADE-IE first employs an Adaptive Mode Selector to dynamically route inputs into either a lightweight Global Extraction Mode or a Type-Centric Extraction Mode, reducing unnecessary type selection and reasoning noise. For conflicting predictions, we further introduce an Evidence-Driven Debate mechanism that structures arguments into Toulmin-style components and performs confidence aggregation through external evidence scoring and Bayesian updates. Experimental results on 9 benchmark datasets across NER, RE, and JERE tasks show that SMADE-IE consistently outperforms existing zero-shot IE baselines while also improving token efficiency through sparse agent selection and early-stopping debate.

2606.04688 2026-06-04 cs.CV

MeshWeaver: Sparse-Voxel-Guided Surface Weaving for Autoregressive Mesh Generation

MeshWeaver: 稀疏体素引导的表面编织用于自回归网格生成

Jiale Xu, Wang Zhao, Ying Shan

发表机构 * ARC Lab, Tencent PCG(腾讯PCG实验室)

AI总结 提出MeshWeaver框架,通过多级稀疏体素编码器注入几何上下文,以自回归方式直接预测顶点实现表面编织,在压缩比、高多边形网格生成和几何保真度上达到最优。

Comments CVPR 2026

详情
AI中文摘要

自回归网格生成通过将网格标记化为序列并以语言建模方式训练模型而受到关注。然而,现有方法存在两个基本限制:(i) 标记化效率低,导致长标记序列并阻碍扩展到高多边形网格;(ii) 缺乏几何感知引导,因为生成仅基于全局形状嵌入而非局部表面线索。我们提出MeshWeaver,一个自回归框架,将网格生成视为表面编织过程,直接预测下一个顶点而非独立坐标。其核心是多级稀疏体素编码器,通过三种互补方式将几何上下文注入生成过程:提供体素特征作为顶点表示,通过交叉注意力引导标记预测,以及作为结构支架约束生成围绕输入表面。我们的层次化设计使得在单次解码步骤中实现从粗到细的顶点预测,同时紧密耦合生成模型与3D几何。大量实验表明,MeshWeaver实现了18%的最先进压缩比,能够生成多达16K面的网格,并且在几何保真度上显著优于先前方法。

英文摘要

Autoregressive mesh generation has gained attention by tokenizing meshes into sequences and training models in a language-modeling fashion. However, existing approaches suffer from two fundamental limitations: (i) low tokenization efficiency, which yields long token sequences and prevents scaling to high-poly meshes, and (ii) absence of geometry-aware guidance, as generation is conditioned only on global shape embeddings rather than local surface cues. We introduce MeshWeaver, an autoregressive framework that treats mesh generation as a surface weaving process by directly predicting the next vertex instead of independent coordinates. At its core is a multi-level sparse-voxel encoder that injects geometric context into the generative process in three complementary ways: providing voxel features as vertex representations, guiding token prediction via cross-attention to voxel features, and serving as a structural scaffold that constrains generation around the input surface. Our hierarchical design enables coarse-to-fine vertex prediction in a single decoding step, while tightly coupling the generative model with 3D geometry. Extensive experiments demonstrate that MeshWeaver achieves a state-of-the-art compression ratio of 18%, can generate meshes with up to 16K faces, and significantly improves geometric fidelity over prior approaches.

2606.04684 2026-06-04 cs.CV cs.AI

Real-Time Automatic License Plate Recognition Using YOLOv8, SORT Tracking, and Temporal Data Interpolation

基于YOLOv8、SORT跟踪与时间数据插值的实时自动车牌识别

Mirza Muhammad Mobeen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一个五阶段端到端算法流程,结合YOLOv8目标检测、SORT多目标跟踪和时间数据插值,解决动态交通监控中因光照变化、遮挡等导致的识别率低和跟踪路径断裂问题。

Comments 7 Pages, For Accessing code:https://github.com/ mobeen-pmo/Automatic-License-Plate-Recognition

详情
AI中文摘要

视频处理的实时困难严重限制了自动车牌识别(ALPR)在动态交通监控环境中的应用。对非受控变量(如光照剧烈变化、摄像机扫描角度、车辆高速行驶和物理遮挡)的高保真识别是一个问题,常导致跟踪路径断裂和光学字符识别(OCR)率低下。为缓解这些弱点,本研究提出一个五阶段端到端算法流程,涵盖基于深度学习的目标检测、运动学多目标跟踪和几何时间数据插值之间的平滑过渡。所提出的架构利用强大的YOLOv8 nano模型在第一阶段定位车辆,然后使用简单在线实时跟踪(SORT)算法建立帧间时空联系。另一种更具体的YOLOv8目标检测器检测车牌区域,将切片数组传递给EasyOCR链,并受位置语法验证约束。更重要的是,启动离线时间边界框插值机制以重新连接断裂的路径。

英文摘要

The real-time hardships of video processing seriously limit the usage of Automatic License Plate Recognition (ALPR) with application in dynamic traffic monitoring settings. High-fidelity recognition of unconstrained variables, e.g. drastic variations in illumination, acute camera scans, high vehicle speeds, and harsh physical concealment, is a problem that often leads to disjointed tracking paths and poor Optical Character Recognition (OCR) rates. In order to mitigate these weaknesses, the study proposes a 5 stage, end-to-end algorithmic pipeline, encompassing a smooth transition between deep learning based object detection, multi-object tracking which is kinematic in nature, and geometry temporal data interpolation. The suggested architecture takes advantage of a very powerful YOLOv8 nano model to localize the vehicle at the first stage and then Simple Online and Realtime Tracking (SORT) algorithm is used to build spatial-temporal links between frames. Another, more specific typology of YOLOv8 object detectors the license plate area, channeling the sliced array to an EasyOCR chain under the limitations of positional syntax verification. More importantly, an offline interpolation mechanism of temporal bounding box is initiated to recast fragmented paths.

2606.04665 2026-06-04 cs.LG

Towards Accurate Model Selection in Deep Unsupervised Domain Adaptation

面向深度无监督域适应的精确模型选择

Kaichao You, Ximei Wang, Mingsheng Long, Michael I. Jordan

发表机构 * University of California, Berkeley(加州大学伯克利分校) UC Berkeley(加州大学伯克利分校)

AI总结 针对深度无监督域适应中缺乏准确模型选择方法的问题,提出Deep Embedded Validation (DEV)方法,通过嵌入适应特征表示到验证过程中,获得目标风险的无偏估计,并利用控制变量技术降低方差,理论和实验证明了其有效性。

Comments upload to arxiv for record

详情
AI中文摘要

深度无监督域适应(Deep UDA)方法成功利用源域中丰富的标记数据来提升相关但未标记的目标域上的性能。然而,由于缺乏准确且标准化的模型选择方法,Deep UDA中的算法比较变得繁琐,这阻碍了该领域的进一步进展。现有的Deep UDA模型选择方法要么高度有偏、受限、不稳定,甚至存在争议(需要标记的目标数据)。为此,我们提出了 extit{Deep Embedded Validation}( extbf{DEV}),它将适应后的特征表示嵌入到验证过程中,以获得目标风险的无偏估计,且方差有界。通过控制变量技术进一步降低了方差。该方法的有效性在理论和实验上都得到了验证。

英文摘要

Deep unsupervised domain adaptation (Deep UDA) methods successfully leverage rich labeled data in a source domain to boost the performance on related but unlabeled data in a target domain. However, algorithm comparison is cumbersome in Deep UDA due to the absence of accurate and standardized model selection method, posing an obstacle to further advances in the field. Existing model selection methods for Deep UDA are either highly biased, restricted, unstable, or even controversial (requiring labeled target data). To this end, we propose \textit{Deep Embedded Validation} (\textbf{DEV}), which embeds adapted feature representation into the validation procedure to obtain unbiased estimation of the target risk with bounded variance. The variance is further reduced by the technique of control variate. The efficacy of the method has been justified both theoretically and empirically.

2606.04662 2026-06-04 cs.LG cs.AI

Why Muon Outperforms Adam: A Curvature Perspective

为什么 Muon 优于 Adam:曲率视角

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Dirk Bergemann, Zhuoran Yang

发表机构 * National University of Singapore(新加坡国立大学) Yale University(耶鲁大学) University of Minnesota(明尼苏达大学)

AI总结 从曲率视角出发,通过泰勒展开和曲率分解,发现 Muon 因更低的归一化方向锐度(NDS)而比 Adam 实现更大的一步损失下降,数据不平衡和层内曲率是其主要优势来源。

详情
AI中文摘要

Muon 在大语言模型训练中相比 Adam 将训练效率提升约两倍,但这一优势的局部几何来源尚不清楚。我们的工作首次从曲率视角尝试揭开 Muon 优于 Adam 的原因。首先,我们对训练损失曲面应用二阶泰勒近似,表明在匹配验证损失下,Muon 比 Adam 实现更大的一步损失下降。两种优化器的一阶增益相当,但 Muon 始终承受更小的二阶曲率惩罚。其次,我们将该曲率惩罚分解为更新范数的平方和归一化方向锐度(NDS)。我们发现 Muon 和 Adam 的更新范数相当,因此 Muon 更小的曲率惩罚源于更低的 NDS,而非更新尺度。第三,我们研究训练数据和模型结构如何塑造 Muon 的 NDS 优势。使用具有受控不平衡的 Zipf-概率上下文无关文法(PCFG)数据,我们表明数据不平衡放大了 Muon 相对于 Adam 的 NDS 优势。进一步的层内/跨层分解表明,在训练的中后期,Muon 更低的 NDS 主要由更小的层内曲率维持。除了经验证据,我们还分析了具有异质曲率和梯度对齐于高曲率模式的风格化二次问题。我们证明 Muon 通过平衡曲率组间的更新能量,实现了比 GD 更低的平均 NDS;当曲率异质性足够强时,在相同步数后这也产生更低的局部二次损失。

英文摘要

Muon improves training efficiency over Adam in large language-model training by about two times, but the local geometric source of this advantage remains unclear. Our work takes a first step toward demystifying Muon's superiority over Adam from a curvature perspective. First, we apply a second-order Taylor approximation to the training landscape and show that Muon achieves a larger one-step loss decrease than Adam at matched validation loss. The two optimizers have comparable first-order gains, but Muon consistently incurs a smaller second-order curvature penalty. Second, we decompose this curvature penalty into the squared update norm and Normalized Directional Sharpness (NDS). We find that Muon and Adam have comparable update norms, so Muon's smaller curvature penalty is driven by lower NDS, not update scale. Third, we study how training data and model structure shape Muon's NDS advantage. Using Zipf-Probabilistic Context-Free Grammar (PCFG) data with controlled imbalance, we show that data imbalance amplifies Muon's NDS advantage over Adam. A within-/cross-layer decomposition further shows that, in the middle and late stages of training, Muon's lower NDS is mainly sustained by smaller within-layer curvature. Beyond empirical evidence, we analyze stylized quadratic problems with heterogeneous curvature and gradient alignment toward high-curvature modes. We prove that Muon attains a smaller average NDS than GD by balancing update energy across curvature groups; when curvature heterogeneity is sufficiently strong, this also yields lower local quadratic loss after the same number of steps.

2606.04661 2026-06-04 cs.CL cs.LG

CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts

CRAFT: 成本感知的提示精炼与前沿感知的调优

Shanu Kumar, Shubhanshu Khandelwal, Akhila Yesantarao Venkata, Parag Agrawal, Yova Kementchedjhieva, Manish Gupta

发表机构 * MBZUAI Microsoft(微软)

AI总结 提出CRAFT方法,通过帕累托前沿优化提示的准确性和成本,避免标量化崩溃,在多个基准上实现更广泛的准确-成本权衡。

详情
AI中文摘要

为准确性调优的提示通常变长,每次模型调用都会增加推理成本。最佳的准确-成本权衡取决于任务和预算,因此提示优化是在准确性和提示令牌成本的帕累托前沿上的搜索,而不是针对单个提示。通常的捷径是将目标折叠成加权和,在搜索前固定权衡权重,通常只能恢复前沿的狭窄区域,我们称之为标量化崩溃。我们提出了CRAFT(成本感知的精炼和前沿感知的调优),一种帕累托前沿提示优化器,将目标LLM验证调用视为稀缺资源,并将其分配给乐观候选前沿附近的候选。每轮,互补的面向准确性和面向成本的生成器提出编辑,帕累托差距获取花费每轮的验证预算,NSGA-II保留保持分布广泛的种群。在六个分类和推理基准上,CRAFT保留的前沿同时达到高准确性和低成本区域,而仅准确性、仅成本和加权和基线各自集中在更窄的区域。准确-成本权衡成为搜索后的选择,而不是搜索前的权重。

英文摘要

Prompts tuned for accuracy often grow long, raising inference cost on every model call. The best accuracy-cost trade-off depends on the task and the budget, so prompt optimization is a search over the Pareto front of accuracy and prompt-token cost rather than for one prompt. The usual shortcut, collapsing the objectives into a weighted sum, fixes the trade-off weight before search and often recovers only a narrow region of the front, a failure we call scalarization collapse. We present CRAFT (Cost-aware Refinement And Front-aware Tuning), a Pareto-front prompt optimizer that treats target-LLM validation calls as the scarce resource and allocates them to candidates near the optimistic candidate front. Each round, complementary accuracy-oriented and cost-oriented generators propose edits, Pareto-gap acquisition spends the per-round validation budget, and NSGA-II retention keeps a spread-out population. Across six classification and reasoning benchmarks, CRAFT's retained fronts reach both high-accuracy and low-cost regions, while accuracy-only, cost-only, and weighted-sum baselines each concentrate in narrower regions. The accuracy-cost trade-off becomes a post-search choice, not a pre-search weight.

2606.04660 2026-06-04 cs.CL

LifeSide: Benchmarking Agents as Lifelong Digital Companions

LifeSide: 将智能体作为终身数字伴侣的基准测试

Yuqian Wu, Zhijie Deng, Wei Chen, Junwei Li, Yutian Jiang, Junle Chen, Zhengjun Huang, Qingxiang Liu, Jing Tang, Jiaheng Wei, Yuxuan Liang

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州)) Hong Kong University of Science and Technology(香港理工大学) Tencent(腾讯)

AI总结 针对现有评估无法捕捉终身数字伴侣所需的多会话记忆、用户理解和隐私适应能力的问题,提出LifeSide基准,通过多智能体模拟构建记忆-情感-环境循环,评估模型在记忆追踪、用户理解、隐私控制和情感陪伴方面的表现,发现即使当前记忆基准饱和的模型也无法在长期内维持准确的用户理解和真正的陪伴。

Comments 28 pages, 23 figures, 7 tables

详情
AI中文摘要

终身数字伴侣必须整合跨会话线索,持续更新对用户的理解,并适应不断变化的隐私边界。现有评估未能捕捉到这一点,而是孤立地测试记忆回忆和短期共情。为了弥补这一差距,我们引入了\benchmark,一个以多会话 extit{记忆-情感-环境}循环为中心的基准。通过将用户建模为具有分层档案和事件轨迹的持久世界,\benchmark使用多智能体模拟将环境动态投射到对话中,保留了潜在思想与可观察表达之间的关键差距。在记忆追踪、用户理解、隐私控制和情感陪伴方面评估了2,000个角色和111K个任务,我们的实验结果揭示了一个严峻的现实:即使是在当前记忆基准上饱和的模型,也无法在长期内维持准确的用户理解和真正的陪伴。

英文摘要

Lifelong digital companions must integrate cross-session cues, continually update their understanding of users, and adapt to shifting privacy boundaries. Existing evaluations fail to capture this, testing memory recall and short-term empathy in isolation. To bridge this gap, we introduce \benchmark, a benchmark centered on multi-session \textit{Memory-Emotion-Environment} loops. By modeling users as persistent worlds with layered profiles and event trajectories, \benchmark uses multi-agent simulation to project environmental dynamics into dialogue, preserving the critical gap between latent thoughts and observable expressions. Evaluating 2,000 personas and 111K tasks across memory tracking, user understanding, privacy control, and emotional companionship, our experiment results reveal a stark reality: even models that saturate current memory benchmarks fail to sustain accurate user understanding and true companionship over long horizons.

2606.04656 2026-06-04 cs.CV cs.AI

Instance-Level Post Hoc Uncertainty Quantification in Object Detection

目标检测中的实例级事后不确定性量化

Chongzhe Zhang, Zifan Zeng, Qunli Zhang, Feng Liu, Zheng Hu

发表机构 * Tsinghua University(清华大学)

AI总结 提出蒙特卡洛广义线性模型(MC-GLM),用于目标检测中实例级、近似事后不确定性量化,无需重新训练,在nuScenes数据集上验证了有效性。

Comments 7 pages, 2 figures

详情
AI中文摘要

目标检测是自动驾驶的安全关键组成部分。为了安全保证,量化边界框预测中的不确定性至关重要。无需重新训练的事后不确定性量化符合实际部署需求;因此,我们采用拉普拉斯近似。由于需要实例级不确定性,需要多次反向传播的线性化推理方法时间效率不高,而基于采样的方法并非完全事后。我们提出了蒙特卡洛广义线性模型(MC-GLM),它提供实例级且近似事后不确定性量化。蒙特卡洛步骤中所需的样本数量是恒定的,与输出实例数量无关,因此可以并行化。在nuScenes数据集上使用CenterPoint检测器的实验验证了我们方法的有效性,所得不确定性表现出良好质量。

英文摘要

Object detection is a safety-critical component of autonomous driving. It is essential to quantify the uncertainty in bounding-box predictions for safety assurance. Post hoc uncertainty quantification without retraining aligns with real-world deployment requirements; therefore, we employ the Laplace approximation. Because instance-level uncertainty is needed, linearized inference methods that require multiple backpropagations are not time-efficient, and sampling-based methods are not fully post hoc. We propose Monte-Carlo generalized linearized model (MC-GLM), which provides instance-level and approximately post hoc uncertainty quantification. The number of samples required in the Monte Carlo step is constant and independent of the number of output instances, so it can be parallelized. Experiments on the nuScenes dataset with the CenterPoint detector validate the effectiveness of our method, and the resulting uncertainties exhibit good quality.