arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.17827 2026-05-19 cs.LG cs.AI

Content-Style Identification via Differential Independence

通过微分独立性进行内容-风格识别

Subash Timilsina, Hoang-Son Nguyen, Sagar Shrestha, Xiao Fu

发表机构 * School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, USA（电气工程与计算机科学学院，俄勒冈州立大学，科瓦利斯，俄勒冈，美国）

AI总结本文提出了一种新的结构条件，即内容-风格微分独立性（CSDI），用于在内容和风格可能依赖的情况下实现生成分析中的可识别性，通过在雅可比子空间上施加块状正交约束，并设计了基于数值雅可比近似的随机正则化器以支持高维生成模型。

Comments 24 pages, 15 figures, ICML 2026

详情

AI中文摘要

生成分析经常将多领域观察建模为领域不变内容变量和领域特定风格变量的非线性混合。从不成对的领域中识别这两种因素可以实现域迁移和反事实数据生成等任务。先前的工作在内容和风格之间（块状）统计独立性或通过非线性混合函数的稀疏雅可比假设下建立了可识别性，但这些条件在实践中可能过于严格。在本文中，我们引入了内容-风格微分独立性（CSDI），一种替代的结构条件，要求内容和风格的微小变化在数据流形上诱导正交方向，从而在内容和风格依赖且雅可比密集时也能实现可识别性。我们通过在内容和风格相关的雅可比子空间上施加块状正交约束来操作化这一条件。为了支持高维生成模型，我们设计了一个基于数值雅可比近似的随机正则化器，从而在如高分辨率图像生成等设置中实现可扩展训练。在多个数据集上的实验验证了可识别性分析，并展示了反事实生成和域迁移的实用优势。

英文摘要

Generative analysis often models multi-domain observations as nonlinear mixtures of domain-invariant content variables and domain-specific style variables. Identifying both factors from unpaired domains enables tasks such as domain transfer and counterfactual data generation. Prior work establishes identifiability under (block-wise) statistical independence between content and style, or via sparse Jacobian assumptions on the nonlinear mixing function, but such conditions can be restrictive in practice. In this work, we introduce content-style differential independence (CSDI), an alternative structural condition requiring that infinitesimal variations in content and style induce orthogonal directions on the data manifold, thereby enabling identifiability even when content and style are dependent and the Jacobian is dense. We operationalize this condition through a blockwise orthogonality constraint on the Jacobian subspaces associated with content and style. To support high-dimensional generative models, we design a stochastic regularizer based on numerical Jacobian approximation, enabling scalable training in settings such as high-resolution image generation. Experiments across multiple datasets corroborate the identifiability analysis and demonstrate practical benefits on counterfactual generation and domain translation.

URL PDF HTML ☆

赞 0 踩 0

2605.17826 2026-05-19 cs.CV cs.AI

CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models

CounterCount: 一种用于视觉语言模型计数偏差诊断的框架

Reem Alzahrani, Hassan Alshanqiti, Bushra Bin Hemid, Zaid Alyafeai, Abdelrahman Eldesokey, Bernard Ghanem

发表机构 * KAUST（卡尔斯鲁德大学）； University of Edinburgh（爱丁堡大学）； King Abdullah University of Science and Technology（国王阿卜杜勒-阿齐兹大学）

AI总结本文提出CounterCount框架，通过对比事实性与反事实性图像来诊断视觉语言模型在计数任务中的偏差问题，揭示模型对物体级先验知识的依赖，并提出统一的注意力调节策略提升反事实计数准确性。

详情

AI中文摘要

视觉语言模型（VLMs）在多模态推理方面表现出色，但尚不清楚其答案是基于视觉证据还是由学习的语言和世界先验知识驱动。计数提供了一个精确的测试环境：当视觉证据与常识物体知识冲突时，模型必须依赖图像而非典型计数。我们引入CounterCount，一种用于VLMs的反事实计数诊断框架，包含配对的事实性和反事实性图像、编辑过的计数相关属性、验证答案和局部化证据注释。评估最近的VLMs，我们发现其在事实性图像上表现强劲，但在反事实属性变化下持续退化，表明即使存在矛盾的视觉证据，模型仍依赖物体级先验知识。利用局部化注释，我们发现这些失败不仅由于缺失或模糊的视觉证据，而是由于模型对计数相关视觉token的注意力权重不足。我们引入一种统一的推理时间注意力调节策略，重新加权所选的视觉token，使多个VLMs的反事实计数准确率提高高达8%。总体而言，CounterCount揭示了先验驱动的计数失败，并为设计未来的VLMs提供了诊断见解。

英文摘要

Vision-Language Models (VLMs) excel at multimodal reasoning, yet it remains unclear whether their answers are grounded in visual evidence or driven by learned language and world priors. Counting provides a precise testbed: when visual evidence conflicts with canonical object knowledge, a model must rely on the image rather than a prototypical count. We introduce CounterCount, a diagnostic framework for counterfactual counting in VLMs, consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. Evaluating recent VLMs, we find strong performance on factual images but consistent degradation under counterfactual attribute changes, indicating reliance on object-level priors even when contradictory visual evidence is present. Using localized annotations, we show that these failures are not solely due to missing or ambiguous visual evidence, but to models underweighting attention to count-relevant visual tokens. We introduce a unified inference-time attention modulation strategy that reweights selected visual tokens, improving counterfactual counting accuracy by up to 8% across multiple VLMs. Overall, CounterCount exposes prior-driven counting failures and provides diagnostic insights for designing future VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.17823 2026-05-19 cs.CV cs.AI

Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

为什么我们看那里：一种最大化场景理解的视网膜视觉语言模型表现出的人类样注视模式

Shravan Murlidaran, Ziqi Wen, Sana Shehabi, Miguel P. Eckstein

发表机构 * Psychological & Brain Sciences, University of California, Santa Barbara（加州大学圣芭芭拉分校心理学与脑科学系）； Electrical and Computer Engineering, University of California, Santa Barbara（加州大学圣芭芭拉分校电气与计算机工程系）； Computer Science, University of California, Santa Barbara（加州大学圣芭芭拉分校计算机科学系）

AI总结研究探讨了人类自由观看时注视模式的形成机制，发现最大化场景理解的视网膜视觉语言模型能够产生类似人类的注视模式，表明这种模式可能是优化场景理解的副产品。

2605.17822 2026-05-19 cs.CV

Unleashing the Representational Power of Fourier Shapes for Attacking Infrared Object Detection

释放傅里叶形状的表示能力以攻击红外目标检测

Yixing Yong, Jian Wang, Ming Lei, Lijun He, Fan Li

发表机构 * School of Information and Communications Engineering, Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an, China（信息与通信工程学院，电子与信息工程学院，西安交通大学，西安，中国）； School of Physics, Xi'an Jiaotong University, Xi'an, China（物理学院，西安交通大学，西安，中国）

AI总结本文提出了一种基于傅里叶形状的红外目标检测攻击方法，通过引入可学习的傅里叶形状，克服了传统形状方法在表示能力和优化能力之间的根本权衡问题，实现了高效的梯度优化生成具有欺骗性的形状，使人类目标逃避检测。

详情

AI中文摘要

红外目标检测在自动驾驶和监控中至关重要，但仍然容易受到物理对抗攻击的威胁。与RGB域不同，攻击必须操控热信号，使得热阻材料的几何形状成为主要的对抗信息载体。当前基于形状的方法在表示能力和优化能力之间存在根本性的权衡，限制了攻击效果。在本文中，我们通过将可学习的傅里叶形状引入红外域，克服了这一困境。我们利用端到端可微框架，将一组紧凑的傅里叶系数，定义形状边界，通过 winding number theorem 解析地映射到像素空间的掩码。这使得能够通过梯度优化高效生成具有欺骗性的形状，使人类目标逃避检测。广泛的数字和物理实验提供了全面的评估，并验证了我们的优越性能。我们得到的物理贴片实现了惊人的鲁棒性，成功逃避了不同距离、角度、姿态和个体的检测器，且在距离大于25米（置信度=0.5）时攻击成功率超过88%。代码可在 https://github.com/Yongyx99/Fourier-shape-attack 上获得。

英文摘要

Infrared object detection is crucial for perception in autonomous driving and surveillance but remains vulnerable to physical adversarial attacks. Unlike in the RGB domain, where attacks rely on color texture, infrared attacks must manipulate thermal signatures, making the geometry shape of heat-blocking materials the primary adversarial information carrier. Current shape-based methods suffer from a fundamental trade-off between representational capability and optimization power, limiting their attack effectiveness.In this work, we overcome this dilemma by introducing learnable Fourier shapes to the infrared domain. We utilize an end-to-end differentiable framework where a compact set of Fourier coefficients, defining the shape boundary, is analytically mapped to a pixel-space mask via the winding number theorem. This enables efficient gradient-based optimization to generate potent shapes that cause human targets to evade detection. Extensive digital and physical experiments provide a comprehensive evaluation and validate our superior performance. Our resulting physical patch achieves striking robustness, successfully evading detectors across diverse distances, angles, poses, and individuals, and achieves over 88% attack success rate at distances greater than 25m (conf.=0.5). Code is available at https://github.com/Yongyx99/Fourier-shape-attack.

URL PDF HTML ☆

赞 0 踩 0

2605.17818 2026-05-19 cs.CV

Evidence-Guided Unknown Rejection for High-Confidence Near-Known Unknowns

基于证据的未知拒绝用于高置信度近似未知物

Xi Chen, Yingjun Xiao, Gang Fang

发表机构 * Xi Chen 1（陈曦 1）； Yingjun Xiao 2（肖英俊 2）； Gang Fang 3（方刚 3）

AI总结本文提出EGUR-A方法，通过改变决策方式从判断样本得分是否足够高到判断预测已知类别是否有足够证据接受样本，从而减少高置信度的误判接受。

Comments 8 pages, 2 figures,8 tables

详情

AI中文摘要

开放集识别系统面临一个被忽视的失败模式：高置信度的近似未知物，这些样本位于已知标签集之外，但足够接近已知类别，使得闭合集分类器以高置信度接受它们。我们证明这种失败在标量阈值方法中普遍存在，包括最近的后处理检测器，并且更强的编码器可能放大而非消除风险。我们提出EGUR-A，将决策从『这个样本的得分是否足够高？』转变为『这个预测的已知类别是否有足够的证据来接受这个样本？』EGUR-A结合类别条件的局部接受证据与全局残差证据，并从已知样本统计中选择其相对权重，而无需未知验证数据。在CUB、FGVC-Aircraft和ImageNet-hard上，EGUR-A显著减少了在匹配已知拒绝操作点处的高置信度误判接受。结果不是更强的阈值，而是不同的问题：已知类别是否有权接受样本。

英文摘要

Open-set recognition systems face a neglected failure mode: high-confidence near-known unknowns, which lie outside the known label set but are close enough to known classes that a closed-set classifier accepts them with high confidence. We show that this failure is widespread across scalar-threshold methods, including recent post-hoc detectors, and that stronger encoders can amplify rather than remove the risk. We propose EGUR-A, which changes the decision from ``is this sample's score high enough?'' to ``does this predicted known class have sufficient evidence to accept this sample?'' EGUR-A combines class-conditional local acceptance evidence with global residual evidence, and selects their relative weight from known-sample statistics without unknown validation data. Across CUB, FGVC-Aircraft, and ImageNet-hard, EGUR-A substantially reduces high-confidence false known acceptance at matched known-rejection operating points. The result is not a stronger threshold; it is a different question: whether a known class is entitled to accept a sample.

URL PDF HTML ☆

赞 0 踩 0

2605.17815 2026-05-19 cs.RO cs.AI

Virtues of Ordered Chaos: Planning with Topple Actions in Tabletop Stack Rearrangement

秩序之中的混沌：在桌面堆叠重构中使用Topple动作的规划

Hao Lu, Rahul Shome

发表机构 * School of Computing at the Australian National University（澳大利亚国立大学计算学院）

AI总结本文研究了桌面环境中堆叠重构任务，通过引入更丰富的非抓取聚合动作（特别是从堆叠中倒落物体到桌面的Topple动作）来增强任务规划领域。核心方法是提出一种新的Topple聚合工具，将候选任务计划计算转化为 Pebble Motion 问题变体，从而在IsaacSim物理模拟中验证了其效果，展示了在执行速度上的显著优势。

Comments 8 pages, 7 figures

详情

AI中文摘要

高效的物体操作策略对自动化应用有重大影响。本文研究了桌面环境中的堆叠重构任务，重点是通过引入更丰富的非抓取聚合动作（特别是从堆叠中倒落物体到桌面的Topple动作）来增强任务规划领域。Topple可以压缩长序列的中间搬运动作。计算的计划需要根据问题在其中交错执行抓取和放置动作与Topple动作。为了生成任务计划并建模一个抽象来计算包含抓取和Topple动作的解决方案，引入了一种新的Topple聚合工具。使用这种有向图抽象，候选任务计划计算成为Pebble Motion问题的变种，将物体视为石子。然后在基于IsaacSim的物理模拟中报告了基准测试。结果突显了仅使用抓取和放置动作相比，在执行速度上的明显优势。尽管本文主要研究Topple动作，但证明了类似的抽象可以建模其他感兴趣的聚合动作，如Scoop。本文的工作为丰富物体交互的操纵应用提供了初步但有力的证据，表明抽象在其中的潜在好处。

英文摘要

Efficient object manipulation strategies have significant impact in automation applications. In this work, the stack rearrangement in tabletop settings is studied, with a focus on augmenting the task planning domain with richer nonprehensile aggregating actions, in particular the toppling of objects from a stack to the table. Toppling can compress long sequences of intermediate relocations. Computed plans need to interleave pick-and-place actions with topple throughout its plan based on the problem. In order to generate the task plan and model an abstraction to compute solutions that include both pick-and-place and topple actions, a novel aggregating gadget for topple is introduced. Using this directed graphical abstraction, candidate task plan computation becomes a variant of the pebble motion problem, treating objects as pebbles. Benchmarks are then reported in a IsaacSim-based physics simulation. Results highlight clear benefits of achieving faster execution than solely using pick-and-place actions. Though this work primarily investigates the topple action, we demonstrate that similar abstractions can model other aggregating actions of interest, like scoop. The current work provides a preliminary, strong indication of the promising benefits of abstractions for rich object interactions in manipulation applications.

URL PDF HTML ☆

赞 0 踩 0

2605.17812 2026-05-19 cs.AI

Going Headless? On the Boundaries of Vertical AI Firms

going headless？关于垂直AI企业的边界

Muhammad Zia Hydari, Farooq Muzaffar

发表机构 * University of Pittsburgh（匹兹堡大学）

AI总结本文探讨了垂直AI企业在会计、法律、医疗、采购等领域中，将工作流、领域逻辑和责任整合到单一应用中的传统模式，以及通用AI代理如何解构这种模式，促使企业采取"going headless"策略。文章指出，这种策略对某些企业有益，对另一些企业则可能造成破坏，并提出了基于任务-责任制度的三类分类体系及规则债务的概念。

详情

AI中文摘要

垂直AI企业在会计、法律、医疗、采购等领域历史上将工作流、领域逻辑和责任整合到单一应用中。通用AI代理现在正在解构这种整合，促使创始人和投资者倡导"going headless"：将工作流和界面交给代理，并将领域专业知识作为可调用的服务暴露出来。本文认为，对于某些企业来说，going headless是正确的，而对于另一些企业则可能是破坏性的，后者往往通过看似界面决策的架构选择无意中放弃了其价值捕获。这是一个边界问题，答案取决于区分接口边界（通常可以移动）和责任边界（通常不能移动）。基于科斯的企业理论、埃森曼、帕克和范阿尔斯特恩的平台包容框架，以及蒂茨对互补资产和可获取性的分析，本文表明，通过开放协议运营的协调者即使在技术互操作性提高的情况下仍能获得包容权力，并且持久的价值捕获集中在专业签发、受监管的工作流、证据轨迹和受信任的记录系统中。本文提出了一种三类分类体系（组件、集成软件平台、双轨），该分类不是基于行业而是基于任务-责任制度，并正式化了规则债务的概念：当业务规则和专业标准从受控系统迁移到提示和代理指令时，客户组织将承担未来治理、维护和责任负担。随后有四项原则：按责任而非界面分解，翻转边缘同时保留核心，将规则债务作为集成平台防止的客户成本，避免单一协调者依赖。

英文摘要

Vertical AI firms in accounting, law, healthcare, procurement, and similar domains historically bundled workflow, domain logic, and accountability into a single application. General-purpose AI agents are now unbundling that package, prompting founders and investors to advocate "going headless": cede the workflow and interface to agents and expose domain expertise as callable services. This article argues that going headless is correct for some firms and destructive for others, and that the latter often cede their value capture inadvertently through architectural choices that look like interface decisions. This is a boundary question, and the answer turns on distinguishing the interface boundary, which can often move, from the accountability boundary, which often must not. Drawing on Coase's theory of the firm, Eisenmann, Parker, and Van Alstyne's platform envelopment framework, and Teece's analysis of complementary assets and appropriability, the article shows that orchestrators operating through open protocols acquire envelopment power even as technical interoperability improves, and that durable value capture concentrates in cospecialized accountability assets: professional signoff, regulated workflows, evidence trails, and trusted systems of record. The article proposes a three-position taxonomy (component, integrated software platform, dual-track) determined not by sector but by task-accountability regime, and formalizes the construct of rule debt: the future governance, maintenance, and accountability burden that accrues to customer organizations when business rules and professional standards migrate from governed systems into prompts and agent instructions. Four principles follow: decompose by accountability not interface, invert the edges while retaining the core, position rule debt as the customer cost the integrated platform prevents, and avoid single-orchestrator dependence.

URL PDF HTML ☆

赞 0 踩 0

2605.17811 2026-05-19 cs.LG cs.AI math.OC

One Model, Two Roles: Emergent Specialization in a Shared Recurrent Transformer

一个模型，两种角色：共享递归变压器中的涌现专业化

Jucheng Shen, Barbara Su, Anastasios Kyrillidis

发表机构 * Rice University（里士大学）

AI总结该研究探讨了共享权重的递归变压器是否能在未被分割成独立模块的情况下发展出不同的内部角色，通过不对称输入递归（AIR）架构发现，模型内部状态分化出不同的功能角色，并展示了这种分化与模型状态动态的关系。

Comments 21 pages, 13 figures, 8 tables

详情

AI中文摘要

可以一个共享权重的递归变压器在未被分割成独立模块的情况下发展出不同的内部角色吗？我们研究了不对称输入递归（AIR），这是一种最小的两状态推理架构，在其中相同的Transformer模型被重复用于更新（根据文献，L和H），唯一的更新规则差异是编码输入在L更新中被注入但在H更新中不被注入。在Sudoku-Extreme和Maze中，解码的rollouts揭示出一致的分裂：$\zH$表现得像一个完全承诺的提案状态，而$\zL$保留局部不确定性和移动的中间结构。冻结实验显示，这种分裂实际上与模型的状态动态有关：在Sudoku中，冻结$\zH$会减少$\zL$的内容变化，而冻结$\zL$会增加$\zH$的内容变化；而在Maze中，冻结任一状态会增加另一个状态的内容变化。消融实验显示，为了诱导专业化，共享模型需要能够区分两种更新类型，要么通过输入注入的不对称性，要么通过一个单独的层级标记。机理上，注意力分析显示在Sudoku和Maze中，L更新始终比H更新更局部。这些结果表明，在两状态递归设置中，清晰的状态身份信号可以诱导共享参数递归变压器内部稳定的、相关的功能角色。代码可在https://github.com/juchengshen/air获得。

英文摘要

Can a shared-weight recurrent Transformer develop distinct internal roles without being partitioned into separate modules? We study this in Asymmetric Input Recurrence (AIR), a minimal two-state reasoning architecture in which the same Transformer model is reused for both updates (per literature, L and H) and the only built-in difference in the update rule is that the encoded input is injected during L-updates but not H-updates. Across Sudoku-Extreme and Maze, decoded rollouts reveal a consistent split: $\zH$ behaves like a fully committed proposal state, whereas $\zL$ retains local uncertainty and shifting intermediate structure. Freeze experiments show that this split is, in practice, related to the model's state dynamics: in Sudoku, freezing $\zH$ reduces $\zL$'s content changes whereas freezing $\zL$ increases $\zH$'s, while in Maze, freezing either state increases content changes in the other state. Ablations show that to induce specialization, the shared model needs to be able to tell the two update types apart, either from input injection asymmetry or from a separate level token. Mechanistically, attention analysis shows that L-updates are consistently more local than H-updates in both Sudoku and Maze. Together, these results show that, in a two-state recurrent setting, a clear state-identity signal can induce stable, related functional roles inside a shared-parameter recurrent Transformer. Code is available at \href{https://github.com/juchengshen/air}{\textcolor{blue}{https://github.com/juchengshen/air}}.

URL PDF HTML ☆

赞 0 踩 0

2605.17808 2026-05-19 cs.LG stat.ML

A Unified Framework for Data-Free One-Step Sampling via Wasserstein Gradient Flows

通过Wasserstein梯度流构建数据免费一步采样的统一框架

Chenguang Wang, Tianshu Yu

发表机构 * School of Data Science（数据科学学院）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结本文提出了一种基于Wasserstein梯度流的数据免费一步采样的统一理论框架，展示了f-分歧度目标下诱导速度场的通用形式，并通过软欠覆盖功能理论推导了分歧度选择与质量运输几何之间的压缩-弹性恒等式，进一步扩展到Log-Variance分歧度，并通过KDE实现和归一化流路线实现了一步推断。

详情

AI中文摘要

我们开发了一种基于Wasserstein梯度流的数据免费一步采样的统一理论框架。对于广泛的标准f-分歧度目标，我们证明诱导速度场具有通用形式V(x)=w(r(x))β(x)，其中β(x)=∇log(p(x)/q(x))在不同目标中共享，而w仅由分歧度的选择决定。这种分解表明标准f-分歧度漂移共享相同的渐近目标分布p，并主要区别于如何在欠覆盖区域重新分配瞬时修复努力。为了正式化这种区别，我们推导了软欠覆盖功能的一步区域响应理论，并获得了一个将分歧度选择与质量运输进入欠覆盖区域的几何联系的压缩-弹性恒等式。我们进一步将该框架扩展到Log-Variance (LV)分歧度，分析参考分布如何改变最终的漂移结构，并提出一个实用的LV启发式替代方案用于数据免费训练。基于此理论，我们通过KDE实现该框架，并描述了互补的归一化流路线，从而在训练后实现一步推断。在多模态高斯混合基准测试中的实验结果与理论预测一致，并在这些目标上展示了有效的一步采样。

英文摘要

We develop a unified theoretical framework for data-free one-step sampling from unnormalized target distributions based on Wasserstein gradient flows. For a broad class of standard f-divergence objectives, we show that the induced velocity field admits the universal form $\mathbf{V}(x)=w(r(x))\,β(x)$, where $β(x)=\nabla \log (p(x)/q(x))$ is shared across objectives and $w$ is determined solely by the choice of divergence. This decomposition shows that standard f-divergence drifts share the same asymptotic target distribution $p$ and differ primarily in how they redistribute transient repair effort across under-covered regions. To formalize this distinction, we derive a one-step regional-response theory for a soft under-coverage functional and obtain a compression--elasticity identity that links divergence choice to the geometry of mass transport into under-covered regions. We further extend the framework beyond the f-divergence family to the Log-Variance (LV) divergence, analyze how the reference distribution alters the resulting drift structure, and motivate a practical LV-inspired surrogate for data-free training. Based on this theory, we instantiate the framework with a KDE-based implementation and describe a complementary normalizing-flow route, enabling one-step inference after training. Experiments on multimodal Gaussian-mixture benchmarks are consistent with the theoretical predictions and demonstrate effective one-step sampling on these targets.

URL PDF HTML ☆

赞 0 踩 0

2605.17807 2026-05-19 cs.CV cs.AI

Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation

课程组策略优化：适应性采样以释放文本到图像生成的潜力

Baoteng Li, Xianghao Zang, Xinran Wang, Xiangyu Na, Zhixiang He, Hao Sun, Chi Zhang, Zhongjiang He, Tianwei Cao, Kongming Liang, Zhanyu Ma

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； Institute of Artificial Intelligence (TeleAI), China Telecom（中国电信人工智能研究院）； Beijing Key Laboratory of Multimodal Data Intelligent Perception and Governance（北京多模态数据智能感知与治理重点实验室）

AI总结本文提出了一种适应性课程训练框架CGPO，通过动态调整采样策略来提高文本到图像生成的训练效率，同时解决多类别数据集中的数据不平衡问题。

详情

AI中文摘要

文本到图像（T2I）生成在近年来取得了显著进展。同时，基于组相对策略优化（GRPO）的强化学习方法引起了广泛关注，并已成功应用于T2I任务。然而，训练过程中常用的均匀采样策略往往忽略了样本难度与模型当前学习能力之间的匹配，导致训练效率低下。我们主张，提高训练效率需要持续优先选择与模型 evolving 能力匹配且仍能主动学习的提示。为此，我们提出了课程组策略优化（CGPO），一种适应性课程训练框架。在训练过程中，每个提示生成一组由奖励模型评分的图像。我们使用组奖励的方差作为在线代理来衡量提示的一致性。较高的方差表明模型部分捕捉了提示要求，但尚未达到稳定的掌握。此类提示更可能提供有用的训练信号，因此相应增加其采样概率。此外，为了解决多类别数据集中的数据不平衡问题，我们设计了一种基于比例公平优化的类别校准方法，以平衡各类别之间的训练难度。在GenEval、T2I-CompBench++和DPG Bench上的实验表明，我们的框架有效提高了生成性能。

英文摘要

Text-to-Image (T2I) generation has achieved remarkable progress in recent years. Meanwhile, reinforcement learning methods, particularly those based on Group Relative Policy Optimization (GRPO), have attracted widespread attention and been successfully applied to T2I tasks. However, the uniform sampling strategy commonly used during training often ignores the match between sample difficulty and the model's current learning capability, leading to low training efficiency. We argue that improving training efficiency requires continuously prioritizing prompts that match the model's evolving capability and remain actively learnable. To this end, we propose Curriculum Group Policy Optimization (CGPO), an adaptive curriculum training framework. During training, each prompt produces a group of images scored by a reward model. We use the variance of group rewards as an online proxy for prompt inconsistency. A higher variance suggests that the model has partially captured the prompt requirements but has not yet achieved stable mastery. Such prompts are more likely to provide useful learning signals, so we increase their sampling probabilities accordingly. Additionally, to address data imbalance in multi-category datasets, we design a category calibration method based on proportional fairness optimization, which balances training difficulty across categories. Experiments on GenEval, T2I-CompBench++, and DPG Bench demonstrate that our framework effectively improves generation performance.

URL PDF HTML ☆

赞 0 踩 0

2605.17806 2026-05-19 cs.LG

AMO: Adaptive Muon Orthogonalization

AMO：自适应缪子正交化

Xinlin Zhuang, Panyi Ouyang, Yichen Li, Jiangming Shi, Yizhang Chen, Shuman Liu, Ying Qian, Weiyang Liu, Haibo Zhang, Imran Razzak

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Shopee ； MBZUAI ； East China Normal University（华东师范大学）； Huazhong University of Science and Technology（华中科技大学）； Xiamen University（厦门大学）

AI总结本文研究了缪子优化中正交化过程的异质性，提出自适应缪子正交化方法，通过测量权重几何特性动态分配NS预算，提升预训练性能。

Comments preprint, under-review

详情

AI中文摘要

缪子最近作为一种替代AdamW的预训练优化器出现，其核心操作是通过牛顿-施鲁茨（NS）迭代实现正交化。现有缪子变体对所有参数矩阵应用统一的NS调度，忽略了正交化难度的差异及其对性能的影响。通过系统性的实证研究，我们发现这种每矩阵异质性普遍存在，主要由矩阵几何决定，其在不同操作类型、训练阶段和网络深度下动态变化。因此，统一的NS调度可能导致模型中正交化质量不均。受此启发，我们提出自适应缪子正交化（AMO），一种观察后承诺的方法，通过早期测量操作类型权重几何特性，并利用这些信号为剩余训练分配NS预算。AMO在标准、延长和连续预训练中均优于统一调度的缪子，其在Llama3.1-1.4B上平均下游性能提升+0.76，在Qwen3-1.7B上提升+0.51。

英文摘要

Muon has recently emerged as a competitive alternative to AdamW for large-scale pre-training, with orthogonalization via Newton-Schulz (NS) iterations as its core operation. Existing Muon variants apply a uniform NS schedule to all parameter matrices, overlooking possible differences in orthogonalization difficulty and its impact on performance. Through a systematic empirical study, we show that this per-matrix heterogeneity is pervasive and largely determined by matrix geometry, which evolves dynamically across operator types, training stages, and network depths. As a result, uniform NS schedules can lead to uneven orthogonalization quality across the model. Motivated by these findings, we propose Adaptive Muon Orthogonalization (AMO), an observe-then-commit method that measures weight geometry by operator type early in training and then uses these signals to allocate the NS budget for the remainder of training. AMO delivers consistent improvements over uniform-schedule Muon across standard, prolonged, and continual pre-training, surpassing the strongest baseline by +0.76 on Llama3.1-1.4B and +0.51 on Qwen3-1.7B in average downstream performance of 12 evaluation tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.17800 2026-05-19 cs.RO cs.AI

Optimal Knock-Pick Planning for Tightly Packed Tabletop Blocks With Parallel Grippers

紧密排列桌面积木的最优敲击抓取规划

Hao Lu, Rahul Shome

发表机构 * School of Computing（计算学院）； Australian National University（澳大利亚国立大学）

AI总结研究在平行夹具无法在物体周围获得足够空隙时，如何通过引入方向性敲击原语来优化敲击抓取策略，以减少动作数量。

Comments Accepted by WAFR 2026, 18 pages, 6 figures

详情

AI中文摘要

在平行夹具无法在物体周围获得足够空隙时，重新排列紧密堆积的桌面物体具有挑战性。本文研究了在实际应用中，均匀大小的积木放置在平面桌面网格位置时的问题特性。由于纯粹的抓取移除可能不可行，因此引入了方向性敲击原语，并将该问题的最优敲击抓取变体进行了建模。本文提出了一系列抽象，其中通过覆盖最小约束装置来识别必要的敲击。利用图抽象上的最大权重完美匹配，可以高效地在多项式时间内计算最优计划，以最小化动作数量。在合成环境以及IsaacSim中报告了随着网格大小增加的实验结果。理论观察为构建高效操作策略提供了有前途的基石，这些策略可以交错抓取和非抓取动作。

英文摘要

Rearranging densely packed tabletop objects is challenging when parallel-gripper picks are infeasible without sufficient clearance around an object. This work studies the problem characteristics for practically motivated settings with uniformly sized blocks placed at planar tabletop grid locations. Since purely prehensile removal can become infeasible, a directional knock primitive is therefore introduced and the optimal knock-pick variant of the problem is formulated. The work proposes a series of abstractions wherein minimal constraining gadgets are covered to identify the necessary knocks. Utilizing a maximum-weight perfect matching on a graphical abstraction yields efficient polynomial-time computation of the optimal plan that minimizes the number of actions. Experiments are reported for increasing grid sizes in synthetic settings as well as in IsaacSim. The theoretical observations provide a promising stepping stone towards rigorously building efficient manipulation strategies that interleave prehensile and non-prehensile actions.

URL PDF HTML ☆

赞 0 踩 0

2605.17799 2026-05-19 cs.CV cs.LG

Is Complex Training Necessary for Long-Tailed OOD Detection? A Re-think from Feature Geometry

长尾分布外检测是否需要复杂的训练？从特征几何角度的重新思考

Ningkang Peng, Xuanming Chen, Yanhui Gu

发表机构 * Nanjing Normal University（南京师范大学）

AI总结本文重新审视长尾分布外检测问题，提出通过特征几何方法简化检测过程，改进Mahalanobis距离计算，提升检测性能。

详情

AI中文摘要

长尾分布外检测通常通过专门的训练方法解决，包括引入分布外数据、回避头、对比目标、能量损失或梯度冲突控制。我们表明这些训练机制可能掩盖了一个更简单的问题：冻结的长尾表示可能已经包含有用的分布外证据，但原始Mahalanobis距离受到频率耦合特征半径和不充分支持的尾部协方差的影响。我们提出了超球面池化Mahalanobis（HPM）方法，一种后处理检测器，将特征归一化到单位球面，并用池化、岭正则化的度量替换类特定协方差，同时保持类均值作为语义锚点。在CIFAR-LT实验和ImageNet-100-LT近分布外边界分析中，HPM提高了原始Mahalanobis评分；对于先验校准经验风险最小化（PC-ERM），在CIFAR-10-LT上将AUROC从46.49提升到85.67，在CIFAR-100-LT上从50.40提升到78.35。这个简单的PC-ERM+HPM流程在CIFAR-100-LT上实现了最佳对数效率分数（LES；3.08），在显著降低训练时间成本的情况下，保留了约95%的最佳CIFAR-100-LT AUROC观测值。这些结果表明，在长尾分布外检测中应分别评估表示质量、检测器几何和训练复杂性。

英文摘要

Long-tailed out-of-distribution (LT-OOD) detection is often addressed with specialized training, including auxiliary out-of-distribution (OOD) data, abstention heads, contrastive objectives, energy losses, or gradient-conflict control. We show that these training mechanisms can obscure a simpler issue: frozen long-tailed representations may already contain useful OOD evidence, but raw Mahalanobis distance is distorted by frequency-coupled feature radius and poorly supported tail covariance. We propose Hyperspherical Pooled Mahalanobis (HPM), a post-hoc detector that normalizes features onto the unit sphere and replaces class-specific covariance with a pooled, ridge-regularized metric while keeping class means as semantic anchors. In CIFAR-LT experiments and an ImageNet-100-LT near-OOD boundary analysis, HPM improves raw Mahalanobis scoring; for Prior-Calibrated ERM (PC-ERM), it raises AUROC from 46.49 to 85.67 on CIFAR-10-LT and from 50.40 to 78.35 on CIFAR-100-LT. This simple PC-ERM+HPM pipeline also achieves the best Log Efficiency Score (LES; 3.08) on CIFAR-100-LT, retaining roughly 95% of the best CIFAR-100-LT AUROC observed among the compared post-hoc scores at substantially lower training-time cost. These results argue for evaluating representation quality, detector geometry, and training complexity as separate factors in LT-OOD detection.

URL PDF HTML ☆

赞 0 踩 0

2605.17795 2026-05-19 cs.LG cs.CV

When Accuracy Is Not Enough: Uncertainty Collapse between Noisy Label Learning and Out-of-Distribution Detection

当准确性不够时：噪声标签学习与分布外检测之间的不确定性崩溃

Ningkang Peng, Jingyang Mao, Runhan Zhou, Peirong Ma, Yanhui Gu

发表机构 * Nanjing Normal University（南京师范大学）

AI总结本文研究了噪声标签学习与分布外检测之间的不确定性崩溃问题，提出了一种通用的ACC-OOD基准，揭示了高准确率并不保证分布外可靠性，提出虚拟边距正则化方法来缓解这一问题。

详情

AI中文摘要

噪声标签学习（LNL）通常通过封闭集分类准确率进行评估，但部署时往往需要分类器能够拒绝分布外（OOD）输入。我们提出了一种学习者无关的ACC-OOD基准，冻结LNL检查点，并在合成和真实噪声标签上评估它们，使用标准化的近/远OOD路由和事后评分。该基准揭示了一种反复出现的失败模式：高封闭集准确率不保证OOD可靠性，因为低置信度、被错误分类的分布内样本可能在噪声训练下与OOD输入占据的得分和特征区域重叠。我们称之为这种病理现象不确定性崩溃。这种结构重叠可能导致高准确率的LNL方法在标准OOD评分下失去ID错误/OOD界面的分离性。作为干预措施，我们研究了虚拟边距正则化（VMR），一种轻量级的修复探针，主要通过PSSCL展示，通过在可信ID批次上合成边界虚拟异常值并扩大能量边距。VMR在不替换主机目标或牺牲封闭集准确率的情况下，部分减少了由崩溃引起的远OOD失败。这些结果支持LNL基准，同时报告封闭集泛化、开放世界可靠性以及结构重叠诊断。

英文摘要

Learning with noisy labels (LNL) is typically benchmarked by closed-set classification accuracy, yet deployment often requires classifiers to reject out-of-distribution (OOD) inputs. We present a learner-agnostic ACC-OOD benchmark that freezes LNL checkpoints and evaluates them with standardized near-/far-OOD routing and post-hoc scores across synthetic and real label noise. The benchmark reveals a recurring failure mode: high closed-set accuracy does not ensure OOD reliability, because low-confidence, misclassified in-distribution samples can overlap the score and feature regions occupied by OOD inputs under noisy training. We term this pathology uncertainty collapse. This structural overlap can make high-accuracy LNL methods lose separability at the ID-error/OOD interface under standard OOD scores. As an intervention, we study Virtual Margin Regularization (VMR), a lightweight repair probe demonstrated mainly with PSSCL that synthesizes boundary virtual outliers on trusted ID batches and widens the energy margin. VMR partially reduces the collapse-induced far-OOD failure without replacing the host objective or sacrificing closed-set accuracy in the tested settings. These results support LNL benchmarks that co-report closed-set generalization, open-world reliability, and structural overlap diagnostics.

URL PDF HTML ☆

赞 0 踩 0

2605.17792 2026-05-19 cs.LG physics.geo-ph

HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

HydroAgent: 通过模拟器引导的强化学习缩小前沿大语言模型与人类专家在水文模型校准之间的差距

Zhi Li, Songkun Yan, Jie Cao, Mofan Zhang, Anjiang Wei, Jinwoong Yoo, Yang Hong

发表机构 * Civil, Environmental, and Architectural Engineering, University of Colorado Boulder（科罗拉多大学波尔德分校土木、环境与建筑工程系）； Civil Engineering and Environmental Sciences, University of Oklahoma（俄克拉荷马大学土木工程与环境科学系）； Department of Computer Science, University of Oklahoma（俄克拉荷马大学计算机科学系）； Civil and Environmental Engineering, Stanford University（斯坦福大学土木与环境工程系）； Department of Computer Science, Stanford University（斯坦福大学计算机科学系）； NASA Goddard Space Flight Center（美国国家航空航天局戈达德空间飞行中心）

AI总结本文研究如何利用前沿大语言模型（LLM）代理替代人类水文模型师进行水文模型校准，提出HydroAgent方法，通过模拟器引导的强化学习（RLSF）进行微调，以提高模型在不同流域中的适应性和准确性。

详情

AI中文摘要

校准分布式水文模型是操作水资源管理中的关键瓶颈——径流预测、水库调度、干旱监测、基础设施设计和洪水预测都依赖于此。每个流域都需要专家将水文图谱特征转化为高维参数向量的调整，而这种工作流程无法在不同流域之间转移。我们问：前沿大语言模型（LLM）代理能否替代人类水文模型师？如果不能，需要什么条件？我们对九个前沿LLM代理——Claude Opus 4.6/4.7、Sonnet 4.6、GPT-5/5.4/5.4-pro和Gemini 2.5-pro/3.1-pro/3-flash——在由美国国家气象局用于暴雨预报的运营CREST分布式水文模型上进行基准测试。最佳的二十轮次Nash-Sutcliffe效率（NSE）在四个保留的水文站上跨越329-40,792平方公里的范围从-0.16（GPT-5.4）到0.75（Sonnet 4.6）；上限在所有三个供应商和能力层级中都保持一致，最强的模型集中在0.65-0.75范围内，除了Opus-4.7在其中一个水文站外，没有其他模型达到人类专家的参考水平。我们认为这个差距不是参数数量的问题，而是领域基础的问题。然后我们提出了HYDROAGENT，通过监督微调2,576条专家校准轨迹和使用NSE作为可验证奖励的组相对策略优化，对开放权重的Qwen3-4B进行微调——模拟器反馈的强化学习（RLSF）。对于地球系统科学，一个经过领域微调的策略，通过模拟器在环的强化学习，比扩展通用前沿模型更计算高效且物理上更忠实，而地球数据的多模态丰富性——遥感、现场时间序列和预报员叙述——使领域代理成为物理科学中人工智能发展的杠杆方向。

英文摘要

Calibrating distributed hydrologic models is a critical bottleneck across operational water resources management - streamflow prediction, reservoir operation, drought monitoring, infrastructure design, and flood forecasting all depend on it. Each basin demands an expert to translate hydrograph signatures into adjustments of a high-dimensional parameter vector, and the resulting workflow does not transfer between watersheds. We ask: can frontier large language model (LLM) agents replace the human hydrologic modeler, and if not, what would it take? We benchmark nine frontier LLM agents - Claude Opus 4.6/4.7, Sonnet 4.6, GPT-5/5.4/5.4-pro, and Gemini 2.5-pro/3.1-pro/3-flash - on the operational CREST distributed hydrologic model used by the U.S. National Weather Service for flash-flood forecasting. Best-of-twenty-rounds Nash-Sutcliffe Efficiency (NSE) across four held-out gauges spanning 329-40,792 km2 ranges from -0.16 (GPT-5.4) to 0.75 (Sonnet 4.6); the ceiling reproduces across all three vendors and capability tiers, with the strongest models concentrating in the 0.65-0.75 band, and no model reaches the human-expert reference except Opus-4.7 on one gauge. We argue this gap is not a parameter-count problem but a domain-grounding problem. We then propose HYDROAGENT, fine-tuning open-weight Qwen3-4B with supervised fine-tuning on 2,576 expert calibration trajectories and Group-Relative Policy Optimization using NSE as a verifiable reward from online CREST simulations - reinforcement learning with simulation feedback (RLSF). For Earth system science, a small domain-tuned policy with simulator-in-the-loop RL is a more compute-efficient and physically faithful path than scaling generic frontier models, and the multi-modal richness of Earth data - remote sensing, in-situ time series, and forecaster narrative - makes domain agents a leveraged direction for AI in physical science.

URL PDF HTML ☆

赞 0 踩 0

2605.17790 2026-05-19 cs.AI

STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

STRIDE：一种用于可靠自动方程发现的自反思代理框架

Jiarui Su, Songjun Tu, Bei Sun, Xiaojun Liang

发表机构 * Central South University（中南大学）； Pengcheng Laboratory（鹏城实验室）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结本文提出STRIDE框架，通过协调数据感知生成、混合拟合评估、批评-执行器修复和多样性保持语义记忆，提升自动方程发现的可靠性，实验表明其在多个LLM基础上提升了准确性、OOD鲁棒性和结构恢复能力。

Comments 23 pages, 15 figures

详情

AI中文摘要

基于LLM的方程发现为从数据中恢复符号定律提供了有前途的途径，但许多系统仍依赖于以生成为中心的循环，提出候选者、拟合参数、评分结果并重用选定的例子。此类循环在不可靠的拟合下可能误判有用的骨架，丢弃需要修复的近正确方程，并积累冗余记忆提供有限的指导。我们提出了STRIDE，一种自反思代理框架，通过协调数据感知生成、混合拟合评估、批评-执行器修复和多样性保持语义记忆来提高可靠性。通过将拟合分数和候选行为转化为共享反馈，STRIDE使方程能够在闭环发现过程中被提出、评估、细化和重用。在具有代表性的符号回归基准和LSR-Synth套件上的实验表明，STRIDE在多个LLM基础上提高了准确性、OOD鲁棒性和结构恢复能力，消融分析和分析确认了其核心组件的贡献。

英文摘要

LLM-based equation discovery offers a promising route to recovering symbolic laws from data, but many systems still rely on generation-centered loops that propose candidates, fit parameters, score results, and reuse selected examples. Such loops can misjudge useful skeletons under unreliable fitting, discard near-correct equations that require repair, and accumulate redundant memories that provide limited guidance. We propose STRIDE, a self-reflective agent framework that improves reliability by coordinating data-aware generation, mixed-fitting evaluation, critic--executor repair, and diversity-preserving semantic memory. By turning fitted scores and candidate behavior into shared feedback, STRIDE enables equations to be proposed, assessed, refined, and reused within a closed-loop discovery process. Experiments on representative symbolic-regression benchmarks and LSR-Synth suites show that STRIDE improves accuracy, OOD robustness, and structural recovery across multiple LLM backbones, with ablations and analyses confirming the contribution of its core components.

URL PDF HTML ☆

赞 0 踩 0

2605.17789 2026-05-19 cs.CL cs.AI

SocialMemBench: Are AI Memory Systems Ready for Social Group Settings?

SocialMemBench: AI记忆系统是否准备好应对社交群体环境？

Olukunle Owolabi

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出SocialMemBench，一个针对多党社交群体的AI记忆系统评估基准，通过人类验证的合成社交网络，测试记忆系统在处理共享历史、群体规范和成员退出等复杂社交场景中的能力。

详情

AI中文摘要

为单用户对话设计的AI记忆系统在应用于多党社交群体环境时会表现出典型故障。这一差距对当今构建的社会助手尤为重要：嵌入聊天平台的群体作用代理，以及需要全面用户模型的主动个人助理代理。现有记忆基准评估的是二元或职场对话；没有针对多党社交群体，其中记忆必须将事实锚定在共享历史而非职业角色，区分群体规范与个体例外，并在成员退出后正确归因。我们引入SocialMemBench，一个涵盖五个典型（亲密朋友、家庭、娱乐、兴趣社区、熟人网络）和三个群体规模层级（4-30成员）的人类验证合成社交群体网络的基准，包含430个角色和7,355次对话轮次，产生1,031个问题-答案对，覆盖九个问题类别。每个类别隔离一种架构能力，五个失败模式（单流融合、时间状态覆盖、大规模实体合并、缺失跨角色知识、规范-个体融合）是可测试的假设；我们的两项研究探针Subject-Mem和SMG提供了证据，其余三个仍待解决。在所有43个网络中，评估的四个开源记忆框架（Mem0、LangMem、Graphiti、Cognee）在问题加权范围内聚集在0.12-0.18，95%置信区间重叠，远低于未压缩检索参考0.345和匹配回答者完整上下文参考0.369（GPT-4o-mini）。当前的记忆系统显示出可测量的差距。

英文摘要

Memory systems for AI assistants were built for single-user dialogue and fail characteristically when applied to multi-party social group settings. This gap matters for the social assistants being built today: group-acting agents embedded in chat platforms, and proactive personal-assistant agents whose holistic model of a user must include their social context. Existing memory benchmarks evaluate dyadic or workplace dialogue; none targets multi-party social groups, where memory must anchor facts in shared history rather than professional roles, separate group norms from individual exceptions, and correctly attribute even after member departure. We introduce SocialMemBench, a benchmark of human-verified synthetic social group networks across five archetypes (close friends, family, recreational, interest community, acquaintance network) and three group-size tiers (4-30 members), with 430 personas and 7,355 conversation turns, yielding 1,031 QA pairs across nine question categories. Each category isolates an architectural capability, and the five failure modes (single-stream conflation, temporal-state overwrite, entity merging at scale, missing cross-persona knowledge, norm-individual conflation) are testable hypotheses; our two research probes Subject-Mem and SMG provide evidence on two, three remain open. A full-context Gemini 2.5 Flash reference reaches only 0.721 against a blind-critic reasoning-model mean of 0.98 on small networks, indicating the benchmark is genuinely difficult even with complete access to the conversation. Across all 43 networks, the four open-source memory frameworks evaluated (Mem0, LangMem, Graphiti, Cognee) cluster in the 0.12-0.18 question-weighted range with overlapping 95% CIs, well below an uncompressed retrieval reference of 0.345 and a matched-answerer full-context reference of 0.369 (GPT-4o-mini). Current memory systems show a measurable gap.

URL PDF HTML ☆

赞 0 踩 0

2605.17787 2026-05-19 cs.LG

Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

重新审视LLM预训练中Adam与SGD的差距：大有效学习率的作用

Athanasios Glentis, Dawei Li, Chung-Yiu Yau, Mingyi Hong

发表机构 * University of Minnesota（明尼苏达大学）

AI总结本文通过实证和理论分析，发现SGD在LLM预训练中表现较差的原因在于其无法维持与Adam相媲美的有效学习率，而大有效学习率需求源于小梯度范数和大权重-梯度比，且在大批次大小下更加明显。通过简单剪枝机制，SGD在大学习率下能恢复大部分Adam性能，实验显示验证损失差距从超过50%降至约3.5%。

详情

AI中文摘要

人们普遍认为随机梯度下降（SGD）在预训练大型语言模型（LLMs）时比自适应优化器如Adam表现更差。然而，这一差距的根源仍不清楚。本文认为，SGD无法维持与Adam相比更大的有效学习率是导致差异的主要原因。通过分析LLM预训练动态，我们发现训练过程中梯度范数较小且权重-梯度比较大，这一现象在预训练中常见的大批次大小下更加显著，需要较大的有效学习率。然而，我们发现输出层梯度幅度在不同token类别间差异显著，且训练过程中经常出现大梯度尖峰。这些因素严重限制了SGD的可接受学习率。基于这一理解，我们展示出简单的剪枝机制能够稳定SGD在大学习率下的表现，使其恢复大部分Adam的性能。在大规模实验中，使用1B参数的LLaMA模型和1M token批次大小预训练时，大学习率SGD与Adam的验证损失差距从超过50%降至仅约3.5%。

英文摘要

It is widely believed that stochastic gradient descent (SGD) performs significantly worse than adaptive optimizers such as Adam in pre-training Large Language Models (LLMs). Yet the underlying reason for this gap remains unclear. In this work, we attribute a large part of the discrepancy to SGD's inability to sustain learning rates comparable to Adam's much larger effective learning rates. Through empirical and theoretical analysis of LLM pre-training dynamics, we identify that training is characterized by small gradient norms and large weight-to-gradient ratios, an effect that becomes more pronounced with larger batch sizes typical in pre-training, necessitating such large effective learning rates. However, we find that output-layer gradient magnitudes become highly uneven across token classes, and that large gradient spikes frequently occur during training. Together, these effects severely restrict the admissible learning rate of SGD. Guided by this understanding, we show that simple clipping mechanisms that stabilize SGD at large learning rates enable it to recover most of Adam's performance. In our large-scale experiments, the validation loss gap between large-learning-rate SGD and Adam shrinks from more than 50% to only about 3.5% when pre-training a 1B-parameter LLaMA model with a 1M-token batch size.

URL PDF HTML ☆

赞 0 踩 0

2605.17780 2026-05-19 cs.CV

Network Knowledge Prior Guided Learning for Data-Efficient Surface Defect Detection

基于网络知识先验的高效数据表面缺陷检测学习

Hang-Cheng Dong, Guodong Liu, Dong Ye, Bingguo Liu

发表机构 * School of Instrumentation Science and Engineering, Harbin Institute of Technology, Harbin, China（哈尔滨工业大学仪器科学与工程学院）； Harbin Institute of Technology Suzhou Research Institute, Suzhou, China（哈尔滨工业大学苏州研究院）

AI总结本文提出了一种基于网络知识先验的知识引导损失函数，通过在训练过程中整合模型可解释性，提升数据高效表面缺陷检测的性能和可信赖度。

2605.17777 2026-05-19 cs.CV

Efficient Sparse-to-Dense Visual Localization via Compact Gaussian Scene Representation and Accelerated Dense Pose Estimation

通过紧凑的高斯场景表示和加速的密集姿态估计实现高效的稀疏到密集视觉定位

Zizhuo Li, Songchu Deng, Linfeng Tang, Jiayi Ma

发表机构 * Electronic Information School, Wuhan University（武汉大学电子信息学院）； School of Robotics, Wuhan University（武汉大学机器人学院）； Electronic Information School and the School of Robotics, Wuhan University（武汉大学电子信息学院和机器人学院）

AI总结本文提出了一种高效的视觉定位方法LiteLoc，通过去除冗余的色彩字段和优化密集姿态估计，显著提升了内存和计算效率，同时保持了定位性能。

Comments IEEE/CAA JAS 2026

详情

AI中文摘要

本文提出LiteLoc，一种基于3D高斯点云（3DGS）的新型高效局部化器。先前最先进的稀疏到密集局部化器STDLoc在定位能力上表现出色，但存在严重的存储冗余和计算延迟问题。通过重新审视其设计决策，我们推导出两个简单但高效的改进方法，使LiteLoc在内存和计算效率上大幅提升，同时更易于训练。关键发现是，继承自Feature 3DGS的色彩场对定位功能上是无用的，但其重建高频光度细节需要大量的高斯基元，导致紧密耦合的色彩-特征表示，产生显著的内存开销和次优的特征场优化。为此，我们提出了一种无色彩解耦的特征场，通过保留仅任务必要的特征属性，构建紧凑的高斯场景表示，从而消除约94%的冗余存储，而不会损失与定位相关的信息。我们进一步发现，主要的计算瓶颈在于密集的视角-n-点（PnP）求解器，其中大多数匹配贡献饱和的几何约束，精度提升有限。因此，我们提出了一种压缩策略，将密集匹配压缩到5%的代表性匹配子集，从而在鲁棒估计中实现了近19倍的速度提升，同时性能下降 negligible。大量实验表明，LiteLoc在多个场景中超越了STDLoc，具有显著的效率优势，为对延迟敏感的视觉定位打开了新的前景。

英文摘要

This letter presents LiteLoc, a novel and efficient localizer built on 3D Gaussian Splatting (3DGS). The previous state-of-the-art (SoTA) sparse-to-dense localizer, STDLoc, has shown remarkable localization capability but suffers from severe storage redundancy and computational latency. By revisiting its design decisions, we derive two simple yet highly effective improvements that cumulatively make LiteLoc much more efficient in both memory and computation, while also being easier to train. One key observation is that the color field, inherited directly from Feature 3DGS, is functionally useless for localization. Yet, its reconstruction of high-frequency photometric details necessitates excessive Gaussian primitives, resulting in a tightly coupled color-feature representation with significant memory overhead and sub-optimal feature field optimization. To resolve this, we propose a color-free decoupled feature field that constructs a compact Gaussian scene representation by retaining only task-essential feature attributes, thereby eliminating approximately 94% of redundant storage with no loss of localization-relevant information. We further find that the primary computational bottleneck lies in the dense Perspective-n-Point (PnP) solver, where most matches contribute saturated geometric constraints with diminishing accuracy gains. Accordingly, we propose a condensing strategy that distills dense matches into a subset of 5% representative matches, enabling a nearly 19-fold speedup in robust estimation with negligible performance drop. Extensive experiments show that LiteLoc surpasses STDLoc in multiple scenes with considerable efficiency benefits, opening up exciting prospects for latency-sensitive visual localization.

URL PDF HTML ☆

赞 0 踩 0

2605.17775 2026-05-19 cs.CL cs.AI

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

在百万笔记规模上系统评估LLM重新表述的合成临床笔记质量

Jinghui Liu, Sarvesh Soni, Anthony Nguyen

发表机构 * Australian e-Health Research Centre, CSIRO, Australia（澳大利亚电子健康研究中心，CSIRO，澳大利亚）； National Library of Medicine, National Institutes of Health, USA（国家医学图书馆，国立卫生研究院，美国）

AI总结本研究系统评估了LLM生成的合成临床笔记的质量，包括内在、外在和事实性评估，发现尽管在粗粒度任务中保留了核心临床信息和预测效用，但在细粒度任务如ICD编码中丢失了细节，通过分块重述可以缓解这一问题，但会降低事实准确性。研究还发现合成错误主要源于临床情境的误解、时间混淆、测量误差和虚构声明，同时展示了这些合成笔记可以有效增强罕见ICD代码的特定任务训练。

详情

AI中文摘要

大型语言模型（LLMs）可以为各种应用生成或合成临床文本，从改善临床文档到增强临床文本分析。然而，评估通常集中在狭窄方面——例如相似性或效用比较——尽管这些方面是互补的，最好并行看待。在本研究中，我们旨在系统评估LLM生成的临床文本，包括在百万笔记规模上从MIMIC数据库重新表述的合成临床笔记的内在、外在和事实性评估。我们的分析显示，尽管存在显著的语言变化，合成笔记仍保留了核心临床信息和粗粒度任务的预测效用，但在像ICD编码这样的细粒度任务中会丢失细节。我们展示，通过分块重述而不是整体重述笔记可以显著缓解这种细节丢失，但会以减少事实准确性为代价。通过事实核查和错误分析，我们进一步发现合成错误主要由临床情境的误解、时间混淆、测量误差和虚构声明引起。最后，我们展示了这些合成笔记——尽管具有任务无关性——可以有效增强罕见ICD代码的特定任务训练。

英文摘要

Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect -- such as similarity or utility comparisons -- even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes -- despite their task-agnostic nature -- can effectively augment task-specific training for rare ICD codes.

URL PDF HTML ☆

赞 0 踩 0

2605.17772 2026-05-19 cs.CV

Towards Universal Physical Adversarial Attacks via a Joint Multi-Objective and Multi-Model Optimization Framework

通过联合多目标和多模型优化框架实现通用物理对抗攻击

Ziyang Liu, Hongyuan Wang, Zijian Wang, Yinxi Lu, Yunzhao Zang, Zhiqiang Yan, Qianhao Ning

发表机构 * Research Center for Space Optical Engineering, Harbin Institute of Technology（哈尔滨工业大学空间光学工程研究中心）； Zhengzhou Research Institute, Harbin Institute of Technology（郑州研究院，哈尔滨工业大学）

AI总结本文提出了一种联合多目标和多模型优化框架（JMOF），通过定量相似性分析选择最优的替代模型集合，以解决物理对抗攻击中单个替代模型过拟合和优化目标的问题，同时通过双层机制平衡攻击效率与深度泛化，并通过正交梯度对齐策略解决跨模型梯度冲突，从而提升攻击效果和跨任务泛化能力。

Comments Under review

详情

AI中文摘要

物理对抗攻击通常会过度拟合单一替代模型和优化目标。虽然集成攻击可以缓解这一问题，但现有方法在受限的物理纹理空间中面临严重的梯度冲突，显著降低了跨模型可转移性。为弥合这一差距，本文提出了一种联合多目标和多模型优化框架（JMOF），该框架利用定量相似性分析来选择最优的替代模型集合。在JMOF中，双层机制共同抑制预测输出并平化中间特征分布，平衡攻击效率与深度泛化。此外，正交梯度对齐（OGA）策略解决跨模型梯度冲突，将相互排斥的梯度转化为协同优化方向。广泛的模拟和现实世界实验表明，JMOF在对抗多种黑盒检测器方面优于最先进的基线方法。关键的是，JMOF表现出显著的跨视觉任务泛化能力，能够生成同时欺骗目标检测、语义分割或单目深度估计模型的攻击。这项研究推进了物理对抗攻击的泛化极限，为评估现实部署中视觉AI的脆弱性提供了稳健的框架。

英文摘要

Physical adversarial attacks often overfit single surrogate models and optimization objectives. While ensemble attacks can mitigate this, existing methods struggle with severe gradient conflicts within restricted physical texture spaces, significantly degrading cross-model transferability. To bridge this gap, this paper proposes a Joint Multi-Objective and Multi-Model Optimization Framework (JMOF) that leverages quantitative similarity analysis to select the optimal surrogate model ensemble. Within JMOF, a dual-level mechanism jointly suppresses prediction outputs and flattens intermediate feature distributions, balancing attack efficiency with deep generalization. Additionally, an Orthogonal Gradient Alignment (OGA) strategy resolves cross-model gradient conflicts, transforming mutually repulsive gradients into synergistic optimization directions. Extensive simulated and real-world experiments demonstrate that JMOF outperforms state-of-the-art baselines against diverse black-box detectors. Crucially, JMOF exhibits substantial cross-vision-task generalization, generating attacks capable of simultaneously deceiving object detection and semantic segmentation or monocular depth estimation models. This research advances the generalization limits of physical adversarial attacks, providing a robust framework for evaluating visual AI vulnerabilities in real-world deployments.

URL PDF HTML ☆

赞 0 踩 0

2605.17766 2026-05-19 cs.CV

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

LatentUMM: 双重潜在对齐用于统一多模态模型

Yinyi Luo, Wenwen Wang, Hayes Bai, Marios Savvides, Jindong Wang

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； William & Mary（威廉与玛丽学院）

AI总结本文提出LatentUMM，通过构建增强的共享潜在空间，显式对齐映射到和从潜在空间的转换，提高跨模态一致性。实验表明，该方法在多种架构上一致提升了多模态一致性。

详情

AI中文摘要

统一多模态模型（UMMs）通过学习共享的潜在空间，在理解和生成方面取得优异表现，但往往在这些能力之间存在功能不一致。我们发现，这一问题并非源于共享表示的不足，而是源于映射到和从潜在空间的转换之间缺乏显式对齐。因此，生成和重新编码可能遵循不一致的轨迹，在模态转换时导致语义漂移。在本文中，我们提出了LatentUMM，一个构建增强共享潜在空间的框架，以显式对齐这些转换并提高跨模态一致性。LatentUMM包含两个阶段。第一阶段，双潜在对齐在模态和容量层面强制一致性：跨模态对齐使用更强的嵌入模型来施加结构化的跨模态语义，而双容量对齐在生成和重新编码下强制双向一致性。第二阶段，潜在动态稳定化通过随机潜在滚动和偏好优化提高鲁棒性，倾向于保留语义一致性的轨迹。实验表明，LatentUMM在多种架构上一致提高了多模态一致性。代码可在：https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM。

英文摘要

Unified multimodal models (UMMs) achieve strong performance in both understanding and generation by learning a shared latent space, yet they often exhibit functional inconsistency between these two capabilities. We observe that this issue does not stem from a lack of shared representations, but from the absence of explicit alignment between the transformations that map into and out of the latent space. As a result, generation and re-encoding can follow inconsistent trajectories, leading to semantic drift under modality transitions. In this work, we propose LatentUMM, a framework that constructs an enhanced shared latent space to explicitly align these transformations and improve cross-modal consistency. LatentUMM consists of two stages. First, dual latent alignment enforces consistency at both the modality and capacity levels: cross-modal alignment uses a stronger embedding model to impose structured cross-modal semantics, while dual capacity alignment enforces bidirectional consistency under generation and re-encoding. Second, latent dynamics stabilization improves robustness via stochastic latent rollouts and preference optimization, favoring trajectories that better preserve semantic consistency. Experiments show that LatentUMM consistently improves multimodal consistency across diverse architectures. Code is available at: https://github.com/AIFrontierLab/TorchUMM/tree/main/src/umm/post_training/LatentUMM.

URL PDF HTML ☆

赞 0 踩 0

2605.17765 2026-05-19 cs.LG

AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models

AURORA：用于医疗基础模型中几何表示学习的上下文正交化

Yuanyun Zhang, Shi Li

发表机构 * University of the Chinese Academy of Sciences（中国科学院大学）； Columbia University（哥伦比亚大学）

AI总结本文提出AURORA框架，通过上下文潜在几何进行正交化，以解决医疗基础模型中潜在表示的语义模糊和上下文变化不稳定性问题，提升了模型在不同机构分布变化下的鲁棒性和预测性能。

详情

AI中文摘要

近年来，医疗基础模型通过大规模自监督学习实现了强大的预测性能，但其潜在表示经常将生理严重程度、干预强度、观察结构和机构工作流程整合到共享嵌入方向中。尽管在下游预测中有效，这些表示在上下文变化下仍然语义模糊且不稳定。我们引入AURORA，即通过正交化关系对齐的适应性不确定性感知表示，这是一种基于上下文潜在几何的医疗表示学习新框架。与优化单一统一嵌入流形不同，AURORA将表示分解为对应于不同上下文因素的正交语义子空间，并在每个子空间内学习关系一致性目标。这诱导出既语义解耦又几何可解释的潜在空间。在多个临床预测和检索任务中，AURORA在重建、对比和自蒸馏基线方面表现一致优于，同时显著提高了上下文解耦、邻域纯度和机构分布变化下的鲁棒性。我们的结果表明，潜在几何本身是医疗基础模型设计的重要轴线，且根据上下文语义显式结构化表示空间为传统预测压缩目标提供了补充方向。

英文摘要

Recent healthcare foundation models have achieved strong predictive performance through large scale self supervised learning, yet their latent representations frequently entangle physiologic severity, intervention intensity, observational structure, and institutional workflow into shared embedding directions. While effective for downstream prediction, such representations remain semantically opaque and unstable under contextual shift. We introduce AURORA, Adaptive Uncertainty aware Representations through Orthogonalized Relational Alignment, a new framework for healthcare representation learning based on contextual latent geometry. Rather than optimizing a single unified embedding manifold, AURORA decomposes representations into orthogonal semantic subspaces corresponding to distinct contextual factors and learns relational consistency objectives within each subspace. This induces latent spaces that are both semantically disentangled and geometrically interpretable. Across multiple clinical prediction and retrieval tasks, AURORA consistently outperforms reconstruction, contrastive, and self distillation baselines while substantially improving contextual disentanglement, neighborhood purity, and robustness under institutional distribution shift. Our results suggest that latent geometry itself constitutes an important axis of healthcare foundation model design and that explicitly structuring representation space according to contextual semantics provides a complementary direction beyond conventional predictive compression objectives.

URL PDF HTML ☆

赞 0 踩 0

2605.17762 2026-05-19 cs.AI

Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search

表面形式神经稀疏检索：面向工业音乐搜索的鲁棒模糊匹配

Paul Greyson, Zhichao Geng, Wei Zhang, Yang Yang

发表机构 * Amazon（亚马逊）

AI总结本文提出了一种鲁棒的神经稀疏检索系统，通过改进的稀疏检索架构和领域特定的子词分词策略，提升了工业音乐搜索中对拼写错误、转置和发音变异的鲁棒性，实现了更高的召回率和更低的延迟。

Comments accepted at SIGIR 2026 industry track

详情

DOI: 10.1145/3805712.3808414

AI中文摘要

在亚马逊音乐的规模下进行音乐搜索面临独特挑战：查询经常由于拼写错误、转置和发音变异而偏离索引元数据，但检索系统必须在毫秒级延迟约束下运行。我们的现有学习到检索系统，即高置信度索引（HCI），从客户行为中学习查询-实体关联，依赖于持续的『探索』来选择候选。传统的n-gram匹配能够实现这种探索，但存在语义鲁棒性差和噪声高，限制了系统从长尾查询中学习的能力。在本工作中，我们提出了一种鲁棒的神经稀疏检索系统，旨在最大化探索效率。我们将最先进的『推理自由』稀疏检索架构适应到音乐领域，并结合一种有效的领域特定的细粒度子词分词策略。我们的方法利用短长度的token约束（最大3个字符）来强制学习表面形式的鲁棒性而非词法记忆。通过在离线索引阶段预计算神经嵌入和术语扩展，使在线处理减少到最小的tokenization和IDF加权，从而实现查询编码的几乎零延迟开销。在600万文档生产语料库上的评估显示，召回率@10达到91.4%（相比传统的三元组为57.7%），在可比的吞吐量下。对HCI反馈循环的模拟显示了探索效率的提高，稳定召回率比生产三元组高0.8%。消融研究表明，我们的稀疏训练方法驱动了性能提升，而领域特定的预训练提供了比大规模通用预训练更具成本效益的替代方案。

英文摘要

Music search at the scale of Amazon Music presents a unique challenge: queries frequently deviate from indexed metadata due to misspellings, transpositions, and phonetic variations, yet the retrieval system must operate under strict millisecond-level latency constraints. Our existing learning-to-retrieve system, the High Confidence Index (HCI), learns query-entity associations from customer behavior, relying on continual ``exploration'' to choose candidates. Traditional n-gram matching enables this exploration but suffers from poor semantic robustness and high noise, limiting the system's ability to learn from long-tail queries. In this work, we present a \textbf{robust neural sparse retrieval system} designed to maximize exploration efficiency. We adapt a state-of-the-art \textbf{inference-free} sparse retrieval architecture to the music domain, combining it with an effective \textbf{domain-specific granular subword tokenization strategy}. Our approach utilizes short-length token constraints (max 3 chars) to enforce the learning of surface-form robustness over lexical memorization. By pre-computing the neural embeddings and term expansions during the offline indexing phase, online processing is reduced to minimal tokenization and IDF weighting, achieving effectively zero latency overhead for query encoding. Evaluations on a 6M-document production corpus show an aggregate \textbf{91.4\%} recall@10 (vs. \textbf{57.7\%} for trigrams) at comparable throughput. Simulation of the HCI feedback loop demonstrates improved exploration efficiency, with \textbf{+0.8\%} higher stabilized recall than production trigrams. Ablation studies indicate that our sparse training methodology drives the performance gains, while domain-specific pretraining provides a cost-effective alternative to large-scale general-purpose pretraining.

URL PDF HTML ☆

赞 0 踩 0

2605.17758 2026-05-19 cs.LG

Memisis: Orchestrating and Evaluating Synthetic Data for Tabular Health Datasets

Memisis：协调和评估表格健康数据的合成数据

Nitish Nagesh, Mahdi Bagheri, Arshia Harish Puthran, Pengbao Zhou, Muhjaazee Love, Aadi Sharma, Ian Harris, Amir M. Rahmani

发表机构 * University of California Irvine（加州大学尔湾分校）

AI总结本文提出Memisis工具，通过结合现有合成数据工具、大语言模型和先进评估指标，协调和评估合成数据，以提高下游预测任务和临床决策的质量。

详情

AI中文摘要

合成数据在医疗领域被广泛用于创建与原始数据相似但不涉及隐私问题的数据集。在隐私、效用和公平性方面生成和评估合成数据对于促进高质量数据的可用性以支持下游预测任务和临床决策至关重要。我们提出了Memisis，一个工具，通过利用现有的合成数据工具、大语言模型的威力以及最先进的评估指标来协调和评估合成数据。我们的工具创建了一个统一的工作流用于数据生成、验证和评估。用户可以控制训练大小、训练周期以及合成行的数量。而不是通过调整合成数据的参数，交互式代理允许用户指定其合成数据生成目标，工具将通过利用现有工具并执行必要的评估来协调工作流。在演示中，我们使用了一个开源的 schizophrenia 数据集，其中包含与种族和性别相关的受保护属性，三种不同的合成器和一个本地语言模型来协调工作流。我们观察到 CTGAN、TVAE 和 GaussianCopula 在公平性和效用指标上表现相当。工作流允许用户在数据生成和评估过程中拥有灵活性和控制。

英文摘要

Synthetic data is widely used in healthcare to create datasets that are similar to original data but without the privacy concerns. Generating and evaluating synthetic data across privacy, utility and fairness is crucial for facilitating high quality data availability for downstream prediction tasks and clinical decision making. We present Memisis, a tool that orchestrates and evaluates synthetic data by leveraging existing synthetic data tools, the power of large language models and state-of-the-art evaluation metrics. Our tool creates a unified workflow for data generation, validation and evaluation. Users have control over the training size, training epochs and the number of synthetic rows to sample. Instead of knobs to tune synthetic data, the interactive agent allows users to specify their synthetic data generation goals and the tool will orchestrate the workflow by leveraging existing tools while performing the requisite evaluation. For the demo, we use an open source schizophrenia dataset with protected attributes related to race and gender, three different synthesizers and a local language model to orchestrate the workflow. We observe that CTGAN, TVAE and GaussianCopula have comparable performance across fairness and utility metrics. The workflow allows users flexibility and control over the data generation and evaluation process.

URL PDF HTML ☆

赞 0 踩 0

2605.17757 2026-05-19 cs.LG cs.AI cs.DC cs.PF

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

OSCAR: 2位KV缓存量化中的离线频谱协方差感知旋转

Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu

发表机构 * Together AI ； University of Sydney（悉尼大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出OSCAR方法，通过离线估计注意力感知的协方差结构，实现2位KV缓存量化的高效和准确，同时开发了可部署的系统，提升了LLM服务框架的性能和效率。

Comments 35 pages, 10 figures

详情

AI中文摘要

INT2 KV-cache量化对于长上下文LLM服务具有吸引力，但实现准确性和可部署性仍然具有挑战。简单的旋转如Hadamard变换可以减少异常值，但仍然在INT2层面失效，因为它们与下游注意力不对齐。我们提出了OSCAR，一种超低比特KV缓存量化方法，通过离线估计注意力感知的协方差结构，并利用这些结构推导出固定旋转和截断阈值用于量化。这样，KV量化就与注意力实际消耗的协方差结构对齐。更重要的是，我们不仅提供了理论依据，还开发了一个完全可部署的OSCAR系统，包含一个定制的INT2注意力内核，该内核与分页KV缓存服务和融合内核流水线保持兼容，从而无缝集成到现代LLM服务框架中，如SGLang和vLLM。我们评估了我们的方法在最近的推理模型上，使用最多32k token的推理轨迹进行跨5个任务的测试。在Qwen3-4B-Thinking-2507和Qwen3-8B上，OSCAR将BF16精度差距分别减少到3.78和1.42个点，而朴素旋转INT2几乎归零。我们进一步将OSCAR扩展到Qwen3-32B和GLM-4.7（358B参数），其中它仍然与BF16保持有效相当。在长上下文-RULER-NIAH（最多128K）上，OSCAR在Qwen3模型上保持稳健，而朴素旋转INT2崩溃。从系统层面来看，OSCAR将KV缓存内存减少约8倍，在相同内存预算下，大批次大小下吞吐量提高最多7倍，并且由于内存带宽开销减少，单批次解码速度比BF16快最多3倍。

英文摘要

INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.17755 2026-05-19 cs.CL cs.AI

Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes

弥合版本差距：多版本训练提升ICD代码预测，尤其是罕见代码

Jinghui Liu, Anthony Nguyen

发表机构 * Australian e-Health Research Centre, CSIRO（澳大利亚电子健康研究中心，CSIRO）

AI总结本文研究了通过结合不同ICD版本的数据训练版本无关模型的有效性，以解决ICD代码预测中的长尾问题和罕见代码性能瓶颈，实验表明多版本训练在提升罕见代码的微F1指标和频繁代码的宏指标方面均取得显著效果。

详情

AI中文摘要

临床编码将临床文档映射到标准化的医疗代码，这是一个关键但耗时的行政任务，可以通过自动化来改进。当前ICD编码模型通常针对特定版本的代码进行优化。然而，实际上ICD系统持续演进，不同版本在不同时期和地区被采用。此外，ICD编码面临长尾问题，罕见代码性能可能成为开发可实施模型的瓶颈。我们探讨了通过结合不同ICD版本的数据训练版本无关模型的可行性，这可能有助于解决这些挑战。我们将在修改后的标签注意力模型中加入ICD-9数据进行ICD-10预测训练，并发现尽管存在版本不匹配，加入ICD-9数据使18K个罕见ICD代码的微F1指标相比仅使用ICD-10训练提高了27%。在8K个频繁ICD-10代码上，多版本训练也显著提升了宏指标，并且模型参数更少。

英文摘要

Clinical coding maps clinical documentation to standardized medical codes, an essential yet time-consuming administrative task that could benefit from automation. Current models on ICD coding are typically optimized for codes from a specific ICD version. However, in reality, ICD systems evolve continuously, and different versions are adopted across time periods and regions. Moreover, ICD coding suffers from the long-tail problem, and rare code performance can be a bottleneck for developing implementable models. We examine whether it is viable to train version-independent models by combining data annotated in different ICD versions, which may help address these challenges. We add ICD-9 data to the training of a modified label-wise attention model for ICD-10 prediction, and find that despite the version mismatch, adding ICD-9 yields a 27% increase in micro F1 for 18K rare ICD codes compared to training on ICD-10 alone. On 8K frequent ICD-10 codes, the multi-version training also substantially improves macro metrics, with far fewer model parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.17749 2026-05-19 cs.LG stat.ML

Testable and Actionable Calibration for Full Swap Regret

可检验且可操作的全面交换懊悔校准

Konstantina Bairaktari, Lunjia Hu, Huy L. Nguyen, Jonathan Ullman

发表机构 * Department of Computer Science, Aarhus University（阿arhus大学计算机科学系）； Khoury College of Computer Sciences, Northeastern University（东北大学计算机科学学院）； Northeastern University（东北大学）

AI总结本文提出了一种新的校准度量标准SCDL，该度量标准在不削弱任何要求的前提下，既可操作又可检验，同时具备连续性和一致性等理想特性，并通过实验验证了其在实际中的优越性能。

详情

AI中文摘要

人工智能生成的预测越来越多地影响关键任务中的决策制定，因此必须具有可信度。校准是衡量可信度的一种广泛使用的度量标准，要求预测与真实频率匹配，并可以像真实概率一样对待某一结果。然而，定义校准是微妙的，设计良好的校准误差度量标准一直是最近研究的活跃主题。第一个目标是找到可操作的校准度量标准，即能够向决策者说明当预测被视为真实概率时的效用损失，这被称为交换懊悔。第二个目标是找到可检验的校准度量标准，即校准误差可以从少量预测和结果中测量出来。尽管这些是基本要求，但目前没有现有的校准度量标准能够完全满足这两个属性，所有现有的度量标准都通过限制交换懊悔的弱化观念来放松可操作性，或通过具有次优估计误差来放松可检验性。我们介绍了一种新的校准度量标准，称为软分箱校准决策损失（SCDL），我们证明其在不削弱任何要求的前提下是完全可操作的，并且可检验性具有几乎最优的误差率。此外，SCDL还满足其他理想属性，如连续性和一致性。我们还提供了一组实验，证明了SCDL与其他度量标准的理论优势在实践中导致更好的性能。

英文摘要

AI generated predictions increasingly inform decision making in critical tasks, and therefore must be trustworthy. One widely used measure of trustworthiness is calibration, which requires that the predictions match the true frequencies and can be treated like real probabilities of a given outcome. However, defining calibration is subtle, and designing good measures of calibration error has been an active topic of recent research. The first goal is to find calibration measures that are actionable, meaning they can inform decision makers about their utility loss when predictions are treated as true probabilities, which is known as swap regret. The second goal is to find calibration measures that are testable, meaning that calibration error can be measured from a small sample of predictions and outcomes. Although these are very basic requirements, there is no existing calibration measure that fully satisfies both properties, and all existing measures relax actionability by bounding a weaker notion of swap regret, or relax testability by having suboptimal estimation error. We introduce a new calibration measure, Soft-Binned Calibration Decision Loss (SCDL), which we prove is fully actionable without weakening either requirement, and testable with nearly optimal error rate. In addition, SCDL satisfies other desired properties such as continuity and consistency. We also provide a set of experiments confirming that the theoretical advantages of SCDL compared to other measures lead to better performance in practice.

URL PDF HTML ☆

赞 0 踩 0

2605.17748 2026-05-19 cs.CV

Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction

通过全局-局部自适应交互释放视觉Transformer在图像质量评估中的潜力

Yu Li, Puchao Zhou, Yachun Mi, Yanfeng Wu, Xiaoming Wang, Shaohui Liu

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Meituan（美团）

AI总结本文提出了一种全局-局部自适应交互框架，通过双流特征提取机制和交互式全局-局部融合，提升图像质量评估的预测精度和鲁棒性，同时减少可训练参数数量。

Journal ref Proceedings of the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. [10567]-[10571], 2026

详情

AI中文摘要

在盲图像质量评估（BIQA）领域，准确预测自然环境中真实失真图像的感知质量仍然极具挑战性，因为存在多样的复杂失真。尽管现有方法已取得显著准确性，但其可扩展性常受限于主观注释的高成本和可用数据集的有限规模。近年来，大规模预训练视觉模型的进步引入了强大的语义和表征能力，但其在IQA任务中的应用受到显著的计算需求和次优微调效率的阻碍。为克服这些限制，我们引入了全局-局部交互适配器（GLIA），一种新的框架，通过双流特征提取机制与交互式全局-局部融合有效利用预训练的视觉Transformer。通过同时保留全局语义信息和细粒度局部细节，我们的方法在显著减少可训练参数的同时，实现了优越的预测精度和鲁棒性。在多个基准上的广泛实验验证了我们方法的有效性和优越性。

英文摘要

In the field of Blind Image Quality Assessment (BIQA), accurately predicting the perceptual quality of authentically distorted images remains highly challenging due to the diverse and complex distortions present in natural environments. Although existing methods have achieved notable accuracy, their scalability is often constrained by the high cost of subjective annotation and the limited size of available datasets. Recent advances in large-scale pre-trained vision models have introduced powerful semantic and representational capabilities, yet their application to IQA tasks is hindered by substantial computational demands and suboptimal fine-tuning efficiency. To overcome these limitations, we introduce the Global-Local Interaction Adapter (GLIA), a novel framework that effectively harnesses pre-trained Vision Transformers through a dual-stream feature extraction mechanism coupled with interactive global-local fusion. By jointly retaining global semantic information and fine-grained local details, our approach delivers superior prediction accuracy and robustness while requiring significantly fewer trainable parameters. Extensive experiments on multiple benchmarks validate the effectiveness and superiority of our approach.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Content-Style Identification via Differential Independence

CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models

Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

Unleashing the Representational Power of Fourier Shapes for Attacking Infrared Object Detection

Evidence-Guided Unknown Rejection for High-Confidence Near-Known Unknowns

Virtues of Ordered Chaos: Planning with Topple Actions in Tabletop Stack Rearrangement

Going Headless? On the Boundaries of Vertical AI Firms

One Model, Two Roles: Emergent Specialization in a Shared Recurrent Transformer

A Unified Framework for Data-Free One-Step Sampling via Wasserstein Gradient Flows

Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation

AMO: Adaptive Muon Orthogonalization

Optimal Knock-Pick Planning for Tightly Packed Tabletop Blocks With Parallel Grippers

Is Complex Training Necessary for Long-Tailed OOD Detection? A Re-think from Feature Geometry

When Accuracy Is Not Enough: Uncertainty Collapse between Noisy Label Learning and Out-of-Distribution Detection

HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

SocialMemBench: Are AI Memory Systems Ready for Social Group Settings?

Revisiting the Adam-SGD Gap in LLM Pre-Training: The Role of Large Effective Learning Rates

Network Knowledge Prior Guided Learning for Data-Efficient Surface Defect Detection

Efficient Sparse-to-Dense Visual Localization via Compact Gaussian Scene Representation and Accelerated Dense Pose Estimation

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

Towards Universal Physical Adversarial Attacks via a Joint Multi-Objective and Multi-Model Optimization Framework

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

AURORA: Contextual Orthogonalization for Geometric Representation Learning in Healthcare Foundation Models

Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search

Memisis: Orchestrating and Evaluating Synthetic Data for Tabular Health Datasets

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes

Testable and Actionable Calibration for Full Swap Regret

Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction