arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.01955 2026-06-02 cs.RO cs.CV

WALL-WM: Carving World Action Modeling at the Event Joints

WALL-WM：在事件关节处雕刻世界动作建模

Shalfun Li, Victor Yao, Charles Yang, Truth Qu, Regis Cheng, Ryan Yu, Howard Lu, Newton Von, Vincent Chen, Yohann Tang, Maeve Zhang, Ellie Ma, Gody Li, Sage Yang, Lorien Shu, J. W. Gao, Ethan Chen, Colin Ye, Yu Sun, Elise Mon, PS Zhang, Neo Li, Lily Li, James Wang, Ping Yang, Chris Pan, Lucy Liang, Hang Su, Roy Gan, Hao Wang, Qian Wang

发表机构 * X Square Robot Team（X Square机器人团队）

AI总结提出WALL-WM世界动作模型，通过事件级视觉-语言-动作预训练解决固定长度动作块与语言、视觉、动作之间的粒度不匹配问题，实现跨语言、场景和任务的泛化，在大规模真实世界评估中达到最先进性能。

详情

AI中文摘要

WALL-WM是一种世界动作模型，它将视频-动作学习从以块为中心的优化转变为以事件为基础的视觉-语言-动作预训练，使用语义连贯的动作事件作为学习的基本单元。现有的WAM通常从多模态或视频基础模型初始化，然后直接基于当前观测和指令优化固定长度的动作块。尽管方便，但这种以块为中心的公式造成了基本的粒度不匹配。语言描述语义目标和事件，视觉通过连续场景动态演变，动作在控制级时间尺度上运行；将三者强制纳入相同的固定长度预测窗口，使得VLA训练变成短视的相关性拟合。WALL-WM通过围绕语义事件组织监督和数据来解决这种不匹配。具体来说，它将基于事件的VLA预训练与由事件级标题和聚类平衡采样构建的数据生态系统配对，从而实现对多样化行为、场景和任务结构的可扩展学习。从相同的事件预训练骨干出发，WALL-WM支持两种互补的推理模式。事件模式消耗下一事件描述并实现可变长度的执行块，而统一模式使用带有阶梯式解码的VLM来调节传统的固定长度块推理，同时保留梯度连续的VLA路径。结合基于Muon优化器的大规模预训练基础设施，WALL-WM为通用WAM提供了实用的规模化方案。实验表明，WALL-WM在语言、场景和任务上广泛泛化，在大规模真实世界泛化评估中达到了最先进的性能。

英文摘要

WALL-WM is a World Action Model that shifts video-action learning from chunk-centric optimization to event-grounded Vision-Language-Action pretraining, using semantically coherent action events as the atomic unit of learning. Existing WAMs commonly initialize from multimodal or video foundation models and then optimize fixed-length action chunks conditioned directly on the current observation and instruction. Although convenient, this chunk-centric formulation creates a fundamental granularity mismatch. Language describes semantic goals and events, vision evolves through continuous scene dynamics, and actions operate at control-level timescales; forcing all three into the same fixed-length prediction window turns VLA training into short-horizon correlation fitting. WALL-WM addresses this mismatch by organizing both supervision and data around semantic events. Specifically, it pairs event-grounded VLA pretraining with a data ecosystem built from event-level captions and cluster-balanced sampling, enabling scalable learning over diverse behaviors, scenes, and task structures. From the same event-pretrained backbone, WALL-WM supports two complementary inference modes. The event mode consumes next-event descriptions and enables variable-length execution chunks, while the unified mode uses a VLM with Staircase Decoding to condition conventional fixed-length chunk inference while preserving a gradient-continuous VLA path. Together with Muon-optimizer-based large-scale pretraining infrastructure, WALL-WM provides a practical scale-up recipe for general-purpose WAMs. Experiments show that WALL-WM generalizes broadly across language, scenes, and tasks, achieving state-of-the-art performance in large-scale real-world generalization evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.01954 2026-06-02 cs.LG stat.ML

Flow-Transformed Implicit Processes for Function-Space Variational Inference

流变换隐式过程用于函数空间变分推断

Luis A. Ortega, Andrés R. Masegosa, Thomas D. Nielsen

发表机构 * Aalborg University（奥尔堡大学）

AI总结提出流变换隐式过程（FTIP），通过归一化流增强组合权重的变分分布，从而在函数空间中捕获非对称、重尾和多模态后验结构，并使用黑盒α目标进行优化。

Comments 24 pages, 4 figures, 10 tables. Pre-print submitted for revision

详情

AI中文摘要

隐式过程先验通过灵活的生成机制定义函数上的分布，使其对贝叶斯函数空间建模具有吸引力。然而，使用此类先验进行后验推断具有挑战性，因为其诱导的函数空间分布通常没有闭式解。一种实用策略是使用有限个采样函数的集合来近似先验，然后将后验函数表示为这些样本的学习组合。现有方法通常对组合权重施加高斯变分分布。虽然易于处理，但这种选择限制了可表示的后验不确定性形状，特别是当真实后验是非对称、重尾或多模态时。我们提出流变换隐式过程（FTIP），一种变分推断方法，使这种有限维函数空间近似更具表达力。FTIP不使用高斯分布，而是使用归一化流来定义更丰富的变分分布，从而在保持可处理优化的同时诱导灵活的后验函数分布。我们使用黑盒α目标训练模型，从而能够比较质量覆盖和模式寻找的变分行为。实验表明，FTIP捕获了函数空间中的非对称和多模态后验结构，而高斯系数近似往往会平滑或崩溃这些结构。

英文摘要

Implicit-process priors define distributions over functions through flexible generative mechanisms, making them attractive for Bayesian function-space modelling. However, performing posterior inference with such priors is challenging because their induced function-space distributions are typically not available in closed form. One practical strategy is to approximate the prior using a finite collection of sampled functions, and then represent posterior functions as learned combinations of these samples. Existing approaches commonly place a Gaussian variational distribution over the combination weights. While tractable, this choice limits the shapes of posterior uncertainty that can be represented, especially when the true posterior is asymmetric, heavy-tailed, or multimodal. We propose Flow-Transformed Implicit Processes (FTIP), a variational inference method that makes this finite-dimensional function-space approximation more expressive. Instead of using a Gaussian distribution over the combination weights, FTIP uses a normalizing flow to define a richer variational distribution. This induces a flexible posterior distribution over functions while preserving tractable optimization. We train the model using a Black-Box α objective, allowing us to compare mass-covering and mode-seeking variational behaviour. Experiments show that FTIP captures asymmetric and multimodal posterior structure in function space that Gaussian coefficient approximations tend to smooth or collapse.

URL PDF HTML ☆

赞 0 踩 0

2606.01952 2026-06-02 cs.LG

Randomized Least Squares Value Iteration itself is Joint Differentially Private

随机最小二乘值迭代本身是联合差分隐私的

Haiyang Lu, Pratik Gajane, Shaojie Bai, Mohammad Sadegh Talebi

发表机构 * Laboratoire d’Informatique Fondamentale d’Orléans (LIFO), Université d’Orléans（奥尔良基础信息学实验室（LIFO），奥尔良大学）； College of Control Science and Engineering, Zhejiang University（浙江大学控制科学与工程学院）； Department of Computer Science, University of Copenhagen（哥本哈根大学计算机科学系）

AI总结研究随机探索算法RLSVI在表格MDP中的隐私保护，证明其内在噪声同时提供联合差分隐私保证。

Comments 12 pages, 0 figures

详情

AI中文摘要

随着强化学习越来越多地应用于医疗和推荐系统等敏感领域，隐私保护技术对于保护用户的敏感信息变得至关重要。我们研究在情节设置下的隐私保护强化学习，重点关注基于随机探索的算法，如随机最小二乘值迭代（RLSVI）。总体目标是研究随机探索如何与隐私机制所需的注入噪声相互作用。在这项工作中，我们展示了一种新的隐私分析，该分析描述了RLSVI中为探索设置的噪声如何同时提供隐私保护。具体来说，我们证明RLSVI在表格MDP中是$(\varepsilon(δ),δ)$-联合差分隐私的，其中$\varepsilon(δ) = rac{2AK}{H^2\log(2HSA)} + 2\sqrt{ rac{2AK\log(1/δ)}{H^2\log(2HSA)}}$，$S$和$A$分别是状态和动作的数量，$H$是情节的长度，$K$是情节的数量。

英文摘要

As reinforcement learning (RL) increasingly applies to sensitive domains, such as health care and recommendation systems, privacy-preserving techniques have become essential to protect users' sensitive information. We investigate privacy-preserving RL under an episodic setting, focusing on algorithms based on randomized exploration, such as Randomized Least Squares Value Iteration (RLSVI). The overall goal is to study how randomized exploration interacts with the injected noise required by privacy mechanisms. In this work, we show a new privacy analysis that characterizes how the noise in RLSVI set for exploration simultaneously provides privacy protection. Specifically, we prove that RLSVI is $(\varepsilon(δ),δ)$-joint differentially private in tabular MDP as is with $\varepsilon(δ) = \frac{2AK}{H^2\log(2HSA)} + 2\sqrt{\frac{2AK\log(1/δ)}{H^2\log(2HSA)}}$, where $S$ and $A$ are the number of states and actions respectively, $H$ is the length of an episode and $K$ is the number of episodes.

URL PDF HTML ☆

赞 0 踩 0

2606.01951 2026-06-02 cs.RO

Co-training with Ego-centric Video and Demonstration for Robot Navigation Task

基于自我中心视频与示范的机器人导航任务协同训练

Shoya Kuno, Yumo Ouchi, Kanata Suzuki

发表机构 * Department of Informatics, Graduate School of Informatics, Kyoto University（信息学系，京都大学研究生院）； Spatial Robotics Research Center, Fujitsu Limited（空间机器人研究中心，富士通有限公司）

AI总结提出将自我中心行走视频转化为移动机器人模仿学习数据集的框架，通过联合训练VLA模型提升语言理解和动作生成能力。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在多种机器人任务中展现出潜力，但其性能严重依赖于大规模高质量训练数据，而在真实机器人上收集这些数据成本高昂且耗时。虽然先前的工作已经探索了利用自我中心人类视频来增强操作数据集，但由于运动过程中的视角变化，将此类方法应用于移动机器人导航仍然具有挑战性。在本文中，我们提出了一个框架，将自我中心行走视频转化为移动机器人模仿学习的数据集。该方法从人类视频中估计相机运动，并将其转换为与地面移动机器人兼容的动作表示。通过联合训练基于人类数据和机器人收集数据的VLA模型，该模型在语言理解和鲁棒动作生成方面比单独使用任一数据源训练取得了更好的性能。在水果搜索导航任务上的实验表明，人类自我中心视频为移动机器人学习提供了有效且可扩展的数据源。

英文摘要

Vision-language-action (VLA) models are promising for diverse robotic tasks, but their performance heavily depends on large-scale high-quality training data, whose collection on real robots is costly and time-consuming. While prior work has explored augmenting manipulation datasets with egocentric human videos, applying such approaches to mobile robot navigation remains challenging due to viewpoint changes during locomotion. In this paper, we propose a framework that converts egocentric walking videos into datasets for mobile robot imitation learning. The proposed method estimates camera motion from human videos and transforms it into action representations compatible with ground mobile robots. By jointly training a VLA model on human-derived and robot-collected datasets, the model achieves improved language understanding and more robust action generation than training with either data source alone. Experiments on a fruit-search navigation task demonstrate that human egocentric videos provide an effective and scalable data source for mobile robot learning.

URL PDF HTML ☆

赞 0 踩 0

2606.01950 2026-06-02 cs.RO cs.CV cs.LG

Learning Action-Conditional and Object-Centric Gaussian Splatting World Models for Rigid Objects

面向刚性物体的学习动作条件与对象中心高斯溅射世界模型

Jens U. Kreber, Lukas Mack, Joerg Stueckler

发表机构 * Intelligent Perception in Technical Systems Group（技术系统智能感知组）

AI总结提出MRO-GWM模型，通过对象中心高斯表示和时空变换器架构，学习刚性物体在3D中的动作条件动力学，支持多物体场景和部分观测下的未来运动预测。

详情

AI中文摘要

世界模型使智能体能够预测其动作对环境的影响。在本文中，我们提出了多刚性物体高斯世界模型（MRO-GWM），一种学习刚性物体在3D中动作条件动力学的新模型。通过用对象中心高斯表示场景，我们可以表示任意物体形状和多物体场景。我们开发了一种新颖的时空变换器架构，该架构根据物体高斯的历史和未来动作预测未来的刚体运动。物体通过其在规范坐标系中的高斯表示，从而可以将物体运动描述为刚体变换。我们的模型在多视角重建上进行训练，这要求模型处理因遮挡导致的物体部分观测。我们分析了该方法在由典型家庭物体组成的合成数据集上的预测性能，这些数据集包含多物体动力学和机器人末端执行器的交互。我们还在模拟中评估了模型在非抓取操作中的模型预测控制性能。

英文摘要

World models enable intelligent agents to predict the consequences of their actions on the environment. In this paper, we propose Multi Rigid Object Gaussian World Model (MRO-GWM), a novel model that learns action-conditional dynamics of rigid objects in 3D. By representing the scene by object-centric Gaussians, we can represent arbitrary object shapes and multi-object scenes. We develop a novel spatio-temporal transformer architecture that predicts future rigid body motion from a history of object Gaussians and future actions. Objects are represented by their Gaussians in a canonical frame, which allows for describing object motion as rigid body transformation. Our model is trained on reconstructions from multiple viewpoints, which requires the model to handle partial observations of objects due to occlusions. We analyze prediction performance of our approach on synthetic datasets composed of typical household objects with multi-object dynamics and interactions by a robot end effector. We also evaluate our model in model-predictive control for non-prehensile manipulation in simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.01947 2026-06-02 cs.CV cs.AI

Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks

大型预训练模型在实例分割任务中的参数高效微调

Nermeen Abou Baker, David Rohrschneider, Uwe Handmann

发表机构 * University of Freiburg（弗赖堡大学）

AI总结本研究针对实例分割任务，探索了适配器和低秩适应（LoRA）两种参数高效微调方法，在仅微调约1-6%参数的情况下取得竞争性能，并发现每个Transformer块使用2-3个适配器可达到性能与效率的最佳平衡。

Comments Published by the Machine Learning and Knowledge Extraction Journal

详情

DOI: 10.3390/make6040133
Journal ref: Abou Baker N, Rohrschneider D, Handmann U. Parameter-Efficient Fine-Tuning of Large Pretrained Models for Instance Segmentation Tasks. Machine Learning and Knowledge Extraction. 2024; 6(4):2783-2807

AI中文摘要

近年来，随着大型预训练模型的兴起，人工智能的研究和应用发生了转变，这些模型在众多任务中取得了最先进的结果。然而，参数的大量增加引入了对参数高效训练策略的需求。尽管取得了显著进展，但针对基于Transformer的模型在实例分割任务中的参数高效微调（PEFT）方法的研究仍然有限。为填补这一空白，本研究调查了PEFT方法的有效性，特别是适配器和低秩适应（LoRA），并将其应用于两个模型和四个基准数据集。通过集成顺序排列的适配器模块并将LoRA应用于可变形注意力（本文首次探索），在仅微调约1-6%模型参数的情况下取得了竞争性能，相比传统微调所需的40-55%有显著改进。关键发现表明，每个Transformer块使用2-3个适配器可实现性能与效率的最佳平衡。此外，LoRA在应用于可变形注意力时表现出强大的参数效率，并在某些情况下超越了适配器配置。这些结果表明，PEFT技术的影响因数据集复杂性和模型架构而异，强调了上下文特定调优的重要性。总体而言，这项工作展示了PEFT在实例分割任务中实现可扩展、可定制且计算高效的迁移学习的潜力。

英文摘要

Research and applications in artificial intelligence have recently shifted with the rise of large pretrained models, which deliver state-of-the-art results across numerous tasks. However, the substantial increase in parameters introduces a need for parameter-efficient training strategies. Despite significant advancements, limited research has explored parameter-efficient fine-tuning (PEFT) methods in the context of transformer-based models for instance segmentation. Addressing this gap, this study investigates the effectiveness of PEFT methods, specifically adapters and Low-Rank Adaptation (LoRA), applied to two models across four benchmark datasets. Integrating sequentially arranged adapter modules and applying LoRA to deformable attention--explored here for the first time--achieves competitive performance while fine-tuning only about 1-6% of model parameters, a marked improvement over the 40-55% required in traditional fine-tuning. Key findings indicate that using 2-3 adapters per transformer block offers an optimal balance of performance and efficiency. Furthermore, LoRA, exhibits strong parameter efficiency when applied to deformable attention, and in certain cases surpasses adapter configurations. These results show that the impact of PEFT techniques varies based on dataset complexity and model architecture, underscoring the importance of context-specific tuning. Overall, this work demonstrates the potential of PEFT to enable scalable, customizable, and computationally efficient transfer learning for instance segmentation tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.01945 2026-06-02 cs.CV

Beyond Low-Rank: Low-Rank Sparse Prompting via Spiking Neural Network and Prompt Factorization

超越低秩：通过脉冲神经网络和提示分解实现低秩稀疏提示

Yumiao Zhao, Bo Jiang, Beibei Wang, Xixi Wan, Xiao Wang, Jin Tang

发表机构 * Information Materials and Intelligent Sensing Laboratory of Anhui Province（安徽省信息材料与智能感知实验室）； Anhui Provincial Key Laboratory of Multimodal Cognitive Computation（安徽省多模态认知计算重点实验室）； School of Computer Science and Technology, Anhui University（安徽大学计算机科学与技术学院）

AI总结提出LoRSP框架，利用脉冲神经元的稀疏发放机制和低秩分解，生成实例特定的稀疏视觉提示，实现高效且鲁棒的视觉提示学习。

详情

AI中文摘要

视觉提示（VP）已成为一种高效范式，通过在输入层引入可学习提示来适应大规模预训练视觉模型到下游任务。然而，现有的VP方法通常采用密集的像素级提示，往往存在冗余扰动、泛化能力有限和能效低的问题。为克服这些限制，我们提出将脑启发脉冲学习融入视觉提示学习任务。我们知道，脉冲神经元可以通过将输入数据转换为离散脉冲序列并返回稀疏输出来进行低成本信息处理。受此启发，我们提出低秩视觉脉冲提示（LoRSP），一种新颖框架，通过脉冲神经元学习机制自然地学习动态低秩稀疏视觉提示。LoRSP的核心思想是利用脉冲神经元的脑启发稀疏发放机制为每个实例生成像素级稀疏提示。具体而言，我们首先通过低秩分解构建一系列提示因子以捕获不同的提示子空间。然后将这些提示因子输入SNN架构，执行整合-发放过程以发射脉冲。因此，我们的LoRSP在保持低秩约束的同时生成稀疏视觉提示。这种设计实现了实例特定的选择性提示，从而在多样化的下游任务中实现更紧凑和鲁棒的适应。在五个异构视觉骨干网络和多个基准上的大量实验表明，与现有VP方法相比，LoRSP在需要更少可调参数的情况下实现了具有竞争力的性能。

英文摘要

Visual Prompting (VP) has emerged as an efficient paradigm for adapting large-scale pre-trained vision models to downstream tasks by incorporating learnable prompts at the input level. However, existing VP methods typically employ dense pixel-level prompts, which often suffer from redundant perturbations, limited generalization and energy inefficiency. To overcome these limitations, we propose to integrate brain-inspired spiking learning into visual prompt learning tasks. As we know that spiking neuron can perform inexpensive information processing by transmitting the input data into discrete spike trains and return sparse outputs. Inspired by this, we propose \textbf{Lo}w-\textbf{R}ank visual \textbf{S}pike \textbf{P}rompting (LoRSP), a novel framework that learns dynamic low-rank sparse visual prompts naturally via a Spiking neuron learning mechanism. The core idea of LoRSP is to exploit the brain-inspired sparse firing mechanism of spiking neurons to generate pixel-level sparse prompt for each instance. To be specific, we first construct a series of prompt factors via low-rank factorization to capture distinct prompt subspaces. These prompt factors are then fed into an SNN architecture, which performs the integrate-and-fire process to emit spikes. As a result, our LoRSP generates a \emph{sparse} visual prompt while maintaining the low-rank constraint. This design enables instance-specific selective prompting, leading to more compact and robust adaptation across diverse downstream tasks. Extensive experiments on five heterogeneous vision backbones and multiple benchmarks demonstrate that LoRSP achieves competitive performance while requiring fewer tunable parameters compared to existing VP methods.

URL PDF HTML ☆

赞 0 踩 0

2606.01940 2026-06-02 cs.CV

SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation

SCAPO: 从单次3D观测中自监督学习类别级关节物体姿态估计

Can Zhang, Gim Hee Lee

发表机构 * Department of Computer Science, National University of Singapore（新加坡国立大学计算机科学系）

AI总结提出SCAPO框架，通过自监督方式从单张RGB-D图像中估计关节物体的规范几何、刚性部件分割和关节参数，无需真实标签或类别特定模型。

详情

AI中文摘要

现有的从单次3D观测中估计类别级物体关节的方法通常依赖密集监督、多帧输入或CAD模板，并且仍然难以从关节中解耦几何或恢复显式关节参数。我们提出SCAPO，一个自监督框架，从单张RGB-D观测中估计规范几何、刚性部件分割以及关节枢轴、轴和关节状态，无需真实标签或类别特定模型。我们的SCAPO首先使用SE(3)-等变向量神经元自编码器来分解全局姿态并将不同实例对齐到共享规范空间。在此对齐形状上，设计了一个关节感知的混合蒙皮模块来建模部件运动。我们通过观测形状和规范形状之间的循环重建以及可学习规范模板的跨空间对齐来学习这种表示，该模板将共享类别几何与实例特定残差形状解耦。在合成和真实关节物体数据集上的实验表明，我们的SCAPO恢复了一致的部件结构和准确的关节参数，并优于所有自监督基线。

英文摘要

Existing methods for category-level object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO, a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonical template that decouples shared category geometry from instance-specific residual shape. Experiments on synthetic and real articulated-object datasets show that our SCAPO recovers consistent part structure and accurate articulation parameters and outperforms all self-supervised baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01939 2026-06-02 cs.CV

SAVMap: Structure-Aided Visual Mapping of Large-Scale 2.5D Manhattan Wireframes from Panoramic Video

SAVMap: 基于结构辅助的全景视频大规模2.5D曼哈顿线框视觉映射

Howard Huang, Bharath Surianarayanan, Keifer Lee, Chenyu Wang, Chen Feng

发表机构 * Nokia Bell Labs（诺基亚贝尔实验室）； NYU（纽约大学）

AI总结提出SAVMap方法，利用全景视频和语义分割网络，结合曼哈顿网格几何约束，从仓库场景生成语义线框地图，实现高精度大规模3D重建。

Comments IEEE ICRA 2026

详情

AI中文摘要

工业环境的精确3D表示能够支持机器人定位和数字孪生生成等任务。我们提出SAVMap，一种仅使用全景视频相机作为传感器输入，生成仓库货架和灯光结构语义线框地图的方法。从沿仓库通道拍摄的全景视频中提取一系列带有货架和天花板视角的校正图像。通过语义分割网络前端，从每张图像中提取一组稀疏的语义结构特征点（例如货架结构的角点、灯光的中心），并在序列中跟踪这些点。通过考虑点之间的真实世界几何关系（如曼哈顿网格），一种受约束的运动恢复结构算法生成构成线框地图的3D点。我们在一个拥有46排货架的仓库中展示了我们方法的可扩展性和准确性，每排货架的面尺寸为55米×7米。从一小时的视频内容中，我们为超过5000个货架元素创建了线框地图，与真实值相比，总体平均绝对误差为4.8厘米。

英文摘要

Precise 3D representations of industrial environments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a semantic wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the warehouse aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55\,m by 7\,m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8\,cm with respect to ground-truth.

URL PDF HTML ☆

赞 0 踩 0

2606.01936 2026-06-02 cs.CL

What to Format and How: A Benchmark and Workflow Approach for Document Formatting

格式化什么以及如何格式化：文档格式化的基准与工作流方法

Shihao Rao, Liang Li, Jiapeng Liu, Tong Lin, Bing Li, Xiyan Gao, Peng Fu, Jing Huang, Can Ma

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（信息工程研究所，中国科学院）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）

AI总结针对内容感知的文档格式化任务，提出基准DocFormBench和工作流方法DocFormFlow，通过解耦目标定位与修改执行，在提升准确率的同时降低token消耗。

详情

AI中文摘要

大型语言模型（LLM）的最新进展为自动化文档格式化开辟了新的可能性。然而，现实中的格式化通常需要根据文档内容识别目标。这种内容感知的设置仍然具有挑战性且未被充分探索，主要是由于缺乏专门的评估数据集。为了在现实的内容感知场景中实现评估，我们引入了DocFormBench，这是一个将文本到格式评估扩展到多样化格式化需求的基准，同时提供了准确性和效率的指标。为了减少现有方法在格式化过程中的冗余文档读取，我们提出了DocFormFlow，一种工作流格式化方法，将目标定位与修改执行解耦为“格式化什么”和“如何格式化”。在多个LLM和多模态模型上的大量实验表明，与代表性基线相比，DocFormFlow在减少token消耗的同时持续提高了格式化准确性。进一步的分析表明，精确的目标定位是影响格式化性能的主要因素。我们希望DocFormBench和DocFormFlow能够促进未来朝着更智能、更可靠的文档格式化的研究。

英文摘要

Recent advances in large language models (LLMs) have opened up new possibilities for automated document formatting. However, real-world formatting often requires identifying targets based on document content. This content-aware setting remains challenging and underexplored, primarily due to the lack of dedicated evaluation datasets.To enable evaluation in realistic content-aware scenarios, we introduce DocFormBench, a benchmark that extends Text-to-Format evaluation to diverse formatting requirements, along with metrics for both accuracy and efficiency.To mitigate redundant document reading in existing methods during formatting, we propose DocFormFlow, a workflow formatting method that decouples target localization from modification execution into what to format and how. Extensive experiments across multiple LLMs and multimodal models show that DocFormFlow consistently improves formatting accuracy while reducing token consumption compared to representative baselines. Further analysis reveals that precise target localization is the primary factor influencing formatting performance. We hope DocFormBench and DocFormFlow will facilitate future research toward more intelligent and reliable document formatting.

URL PDF HTML ☆

赞 0 踩 0

2606.01934 2026-06-02 cs.LG cs.CL

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

HMPO: 用于思维链压缩的混合中位数长度策略优化

Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin, Xiaoyang Qu, Ze Wang, Shuling Yang, Ziyu Peng, Kaike Zhang, Pan Zhou, Kun Zhan

发表机构 * Li Auto Inc.（Li Auto公司）

AI总结提出HMPO，一种单阶段强化学习框架，通过自适应中位数预算、余弦衰减令牌奖励和乘法奖励公式，在数学数据上训练后实现19%-46%的令牌压缩且精度损失极小，并泛化至多种任务。

详情

AI中文摘要

大型语言模型通过扩展的思维链推理取得了显著性能，但这一冗长过程带来了大量推理开销。现有的思维链压缩方法面临不灵活的手动长度预算、计算昂贵的多阶段训练流程以及仅适用于小模型的脆弱可扩展性。我们提出HMPO（混合中位数长度策略优化），一种经济高效的单阶段强化学习框架。HMPO通过三个协同组件高效压缩思维链：基于成功轨迹的自适应中位数预算以消除手动调整、用于平滑长度惩罚的余弦衰减令牌奖励，以及通过严格优先考虑答案正确性来大幅减轻琐碎奖励破解的乘法奖励公式。仅在数学数据上训练，HMPO无缝泛化到数学、代码、科学和指令遵循任务。在从9B到122B参数、涵盖密集和混合专家架构的大规模实验中，HMPO实现了19%-46%的令牌压缩，精度下降可忽略，同时与现有的多阶段基线相比大幅降低了训练成本。

英文摘要

Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.01933 2026-06-02 cs.CV

3rd Place at CVPR 2026 CASTLE Challenge: Agentic Multi-View Long-Context Video Understanding via Hierarchical Knowledge Graph Retrieval

CVPR 2026 CASTLE挑战赛第三名：基于层次化知识图谱检索的智能多视角长视频理解

Raghad Albusayes, Munirah Alyahya

发表机构 * TAHAKOM（塔哈科姆）

AI总结提出一种免训练的智能框架，通过视频知识图谱和层次化检索索引，解决大规模多视角视频中的复杂时空推理问题，在CASTLE挑战赛中获得第三名。

详情

AI中文摘要

本文介绍了我们在CVPR 2026 EgoVis研讨会举办的CASTLE 2026挑战赛中的获胜方法，我们的团队在全球获得了第三名。该挑战要求参与者在海量多模态视频流中回答高度复杂的视觉、时空和语言问题，包括视觉计数、动作定位、多视角跟踪和说话者时间推理。底层数据集包含由15个自我和外部摄像头源捕获的超过600小时的同步视频。为了应对这种极端规模和长上下文的需求，我们引入了一种无需训练的智能框架，专门针对长视频理解进行了优化。我们的框架引入了两个核心架构组件：i) 视频知识图谱，映射静态和动态实体、它们的时间关系以及交叉事件，以实现多跳关系推理；ii) 自适应智能工作流，通过层次化检索和索引解决复杂查询。实验结果表明，我们的框架在长上下文多视角流上实现了高零样本推理精度。我们的代码将在https://github.com/RaghadKhaled/CASTLE-Challenge-Framework发布。

英文摘要

This paper presents our winning methodology for the CASTLE 2026 Challenge at the CVPR 2026 EgoVis Workshop, where our team secured third place globally. The challenge tasks participants with answering highly complex visual, spatiotemporal, and verbal questions, including visual counting, action localization, multi-view tracking and speaker temporal reasoning, within massive, multimodal video streams. The underlying dataset consists of over 600 hours synchronized footage captured by 15 ego and exo camera sources. To tackle the extreme scale and long-context demands of this environment, we introduce a training-free agentic framework optimized for long-form video understanding. Our framework introduces two core architectural components: i) a Video Knowledge Graph that maps static and dynamic entities, their temporal relationships, and intersecting events to enable multi-hop relational reasoning, and ii) an adaptive agentic workflow that resolves complex queries through a hierarchical retrieval and indexing. Empirical results demonstrate that our framework achieves high zero-shot reasoning accuracy on long-context multi-view streams. Our code will be released at https://github.com/RaghadKhaled/CASTLE-Challenge-Framework.

URL PDF HTML ☆

赞 0 踩 0

2606.01926 2026-06-02 cs.CL

Mitigating Bias in Locally Constrained Decoding via Tractable Proposals

通过可处理提议缓解局部约束解码中的偏差

Meihua Dang, Linxin Song, Honghua Zhang, Jieyu Zhao, Guy Van den Broeck, Stefano Ermon

发表机构 * Stanford University（斯坦福大学）； University of California, Berkeley（加州大学伯克利分校）； Massachusetts Institute of Technology（麻省理工学院）

AI总结针对局部约束解码中因短视掩码导致的采样偏差，提出基于张量化有限自动机的全局约束解码提议和概率全局约束解码提议，结合序贯蒙特卡洛方法实现无偏采样，在函数调用、关键词生成和SQL生成任务中显著减少所需粒子数并加速收敛。

Comments 13 pages, 5 figures

详情

AI中文摘要

大型语言模型的生成结果往往不符合期望的约束，如JSON模式。现有的局部约束解码（LCD）方法通过短视地掩蔽下一个词元来强制约束，导致采样偏差和性能下降。最近的工作使用序贯蒙特卡洛（SMC）方法来缓解此类偏差，但设计有效的提议分布或势函数仍然是一个关键挑战。在这项工作中，我们提出了一种通用方法来构建从 $p_{\mathrm{lm}}( \cdot \mid \mathrm{constraint})$ 进行SMC采样的提议和势函数。首先，我们证明了以有限自动机形式指定的约束可以张量化以在GPU上高效执行，我们利用这一点构建了全局约束解码（GCD）提议。此外，利用张量化有限自动机与隐马尔可夫模型共享相同电路结构的事实，我们通过电路乘法得到概率全局约束解码（P-GCD）提议，该提议编码了目标分布的逻辑和概率信息。我们在函数调用、基于关键词的生成和SQL生成任务上评估了(P-)GCD。实验表明，在相同的SMC采样设置下，与LCD提议相比，(P-)GCD以显著更少的粒子更快地收敛到目标分布。

英文摘要

Generations from large language models often fail to conform to desired constraints such as JSON schema. Existing locally constrained decoding (LCD) approaches enforce constraints by myopically masking out next tokens, resulting in biased sampling and degradation in performance. Recent work uses sequential Monte Carlo (SMC) methods to mitigate such biases, but designing effective proposal distributions or potential functions remains a key challenge. In this work, we propose a generic approach to construct proposals and potentials for SMC sampling from $p_{\mathrm{lm}}( \cdot \mid \mathrm{constraint})$. First, we show that constraints specified as finite automata can be tensorized for efficient execution on GPUs, which we use to construct globally constrained decoding (GCD) proposals. In addition, leveraging the fact that tensorized finite automata share the same circuit structure as hidden Markov models, we circuit-multiply them to obtain the probabilistic GCD (P-GCD) proposals encoding both logical and probabilistic information about the target distributions. We evaluate (P-)GCD on the tasks of function calling, keyword-based generation, and SQL generation. Experiments show that under the same SMC sampling setup, compared to LCD proposals, (P-)GCD converges faster to the target distribution with significantly fewer particles.

URL PDF HTML ☆

赞 0 踩 0

2606.01923 2026-06-02 cs.CL cs.LG

Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time

共振上下文锚定：推理时解耦注意力路由与信号增益

Mingkuan Zhao, Yide Gao, Wentao Hu, Suquan Chen, Tianchen Huang, Zhenhua An, Zetao Chang, Xiayu Sun, Yuheng Min

发表机构 * Xi’an Jiaotong University（西安交通大学）； University of Science and Technology of China（中国科学技术大学）； Tongji University（同济大学）； Tsinghua University（清华大学）

AI总结提出共振上下文锚定（RCA）方法，通过解耦自注意力中的路由逻辑与信息幅度，在推理时动态增强上下文令牌的信号，有效抑制大语言模型的参数化幻觉，提升事实一致性。

详情

AI中文摘要

大型语言模型（LLM）在面对与内部参数记忆冲突的输入证据时，经常表现出“上下文忽视”，导致持续的事实幻觉。现有的缓解策略主要依赖于抑制特定神经元激活或使用计算昂贵的对比解码机制，这往往会导致困惑度增加或推理延迟显著升高。为了解决这些局限性，我们从残差流信号动力学的角度提出了一种轻量级的推理时干预方法——共振上下文锚定（RCA）。RCA旨在解决外部证据在深层网络传播过程中的信号衰减问题。其核心机制是在自注意力模块中正交解耦路由逻辑和信息幅度。通过利用原始的softmax前注意力分数作为语义对齐的即时度量，我们通过非线性整流构建动态增益场，选择性地放大上下文令牌对应的值向量的范数，而不改变注意力概率分布。该机制有效提升了残差流混合中输入证据的信噪比（SNR），从而在推理时稳健地将生成轨迹锚定到真实上下文。在Llama-3模型系列上的大量实验表明，RCA在多个事实一致性和强知识冲突任务中显著提高了上下文忠实度，有效抑制了参数化幻觉。此外，结果证实，作为一个无需训练且计算量可忽略的即插即用模块，RCA在保持模型通用语言理解能力的同时，在忠实度和流畅性上实现了帕累托改进。

英文摘要

Large Language Models (LLMs) frequently exhibit "contextual disregard" when faced with input evidence that conflicts with their internal parametric memory, leading to persistent factual hallucinations. Existing mitigation strategies primarily rely on suppressing specific neuron activations or employing computationally expensive contrastive decoding mechanisms, which often result in increased perplexity or significantly elevated inference latency. To address these limitations, we propose Resonant Context Anchoring (RCA), a lightweight inference-time intervention method grounded in the perspective of residual stream signal dynamics. RCA aims to resolve the signal attenuation of external evidence during its propagation through deep networks. The core mechanism involves the orthogonal decoupling of routing logic and information magnitude within the self-attention module. By utilizing raw pre-softmax attention scores as an instantaneous metric of semantic alignment, we construct a dynamic gain field via non-linear rectification to selectively amplify the norms of value vectors corresponding to context tokens, without altering the attention probability distribution. This mechanism effectively elevates the signal-to-noise ratio (SNR) of input evidence within the residual stream mixture, thereby robustly anchoring the generation trajectory to the truthful context during inference. Extensive experiments on the Llama-3 model series demonstrate that RCA significantly improves contextual faithfulness across multiple factual consistency and strong knowledge-conflict tasks, effectively suppressing parametric hallucinations. Furthermore, results confirm that as a training-free and computationally negligible plug-and-play module, RCA achieves a Pareto improvement in faithfulness and fluency while maintaining the model's general language understanding capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.01914 2026-06-02 cs.CL cs.CV

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

多模态大语言模型空间推理中空间词汇偏差的机制诊断

Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng, Wang Yang, Sudong Cai, Shuyuan Zheng, Akiko Aizawa, Sadao Kurohashi

发表机构 * Kyoto University（京都大学）； NII LLMC（日本国立信息与通信技术研究所语言模型中心）； RIKEN AIP（日本理化学研究所先进理工研究所）； Case Western Reserve University（凯斯西储大学）； The Hong Kong Polytechnic University（香港理工大学）； The University of Osaka（大阪大学）； University of Tokyo（东京大学）

AI总结本文发现多模态大语言模型存在空间词汇偏差，即添加空间关系词会吸引模型选择该选项，并通过机制可解释性工具揭示偏差主要源于语言侧而非视觉侧，最后提出轻量级LLM-only DPO更新可有效缓解偏差。

详情

AI中文摘要

多模态大语言模型（MLLMs）在空间多项选择题上仍不可靠，其失败常归因于视觉信息关注不足。本文识别了一种互补的失败模式——空间词汇偏差：向答案选项添加空间关系词会吸引模型决策，使新添加的选项更可能被选中。使用九个开放权重的MLLMs，我们证明该现象广泛存在。特别地，模型能正确回答二元空间问题，但一旦向答案集添加第三个空间选项，模型便持续选择错误的第三选项。我们将这种二元稳定但三元脆弱的案例隔离为诊断示例，并利用机制可解释性工具，揭示失败的主要原因来自语言侧而非视觉侧：视觉注意力分析和残差流探针表明，在这些失败中，正确的空间关系在内部仍然可用，而不相关选项控制、激活修补和稀疏组件干预将偏差追溯到特定的LLM侧通道和神经元。基于此发现，我们证明在微小的单对象对合成数据上进行轻量级仅LLM的DPO更新可缓解偏差，在合成数据上将四路鲁棒准确率提升高达100个百分点，在更广泛的评估数据集WhatsUp、SpatialMQA-Direct和VSR上分别提升68.0、32.6和20.1个百分点。

英文摘要

Multimodal large language models (MLLMs) remain unreliable on spatial multiple-choice questions, and their failures are often attributed to poorly attended visual information. In this work, we identify a complementary failure mode, spatial lexical bias: adding a spatial relation word to the answer options can attract the model's decision and make the newly added option likely to be selected. Using nine open-weight MLLMs, we show that this phenomenon is widely observed. In particular, models can answer a binary spatial question correctly, yet consistently select an incorrect third spatial option once it is added to the answer set. We isolate such binary-stable but ternary-fragile cases as diagnostic examples and leverage mechanistic interpretability tools, revealing that a substantial part of the failure instead originates on the language side rather than the visual side: visual attention analyses and residual-stream probes show the correct spatial relation remains internally available on these failures, while irrelevant-option controls, activation patching, and sparse component interventions trace the bias to specific LLM-side channels and neurons. Based on this finding, we show that a lightweight LLM-only DPO update on tiny single-object-pair synthetic data mitigates the bias, lifting four-way robust accuracy by up to 100 points on synthetic data, and by 68.0, 32.6, and 20.1 points on broader evaluation datasets WhatsUp, SpatialMQA-Direct, and VSR.

URL PDF HTML ☆

赞 0 踩 0

2606.01912 2026-06-02 cs.AI

SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

SMH-Bench：用于智能家居中环境基础推理与行动的LLM代理基准测试

Kuan Li, Shuo Zhang, Huacan Wang, Fangzhou Yu, Zecheng Sheng, Yi Gu, Weipeng Ming, Lei Xue, Chen Liu, Sen Hu, Ronghao Chen, Siyue Lin, Yuqing Hou, Xiaofeng Mou, Yi Xu

发表机构 * Midea Group（美的集团）； Beijing University of Posts and Telecommunications（北京邮电大学）； Donghua University（东华大学）； The University of Sydney（悉尼大学）； Peking University（北京大学）

AI总结提出SMH-Bench基准，基于可执行模拟器HomeEnv，通过1100个任务评估LLM在智能家居中的推理与行动能力，发现前沿模型在自动化调度、模糊处理和个性化推理方面存在不足。

详情

AI中文摘要

智能家居正朝着复杂的、依赖于状态的生活环境发展，需要大型语言模型（LLM）对用户意图、偏好和多设备交互进行推理。然而，现有的智能家居基准通常侧重于静态的指令到API映射或有限的模拟，未能评估LLM是否能够在现实家庭场景中可靠地进行推理、交互和行动。为了解决这些局限性，我们引入了SMH-Bench，这是一个用于评估智能家居环境中LLM的全面基准。基于可执行且可验证的智能家居模拟器HomeEnv，SMH-Bench包含1100个高质量任务，涵盖7个类别和22个细粒度子类别。它进一步将任务分层为简单、中等和复杂家庭，范围从小型公寓到拥有135个设备的密集多房间环境。实验表明，尽管前沿LLM在显式控制和查询任务上表现强劲，但在自动化任务调度、模糊处理和个性化推理方面仍存在显著弱点，尤其是在家庭复杂性增加时。我们希望SMH-Bench能够促进更可靠、上下文感知且实际可部署的智能家居代理的发展。

英文摘要

Smart homes are evolving toward complex state-dependent living environments, requiring Large Language Models (LLMs) to reason over user intent, preferences, and multi-device interactions. However, existing smart-home benchmarks often focus on static instruction-to-API mapping or limited simulations, failing to evaluate whether LLMs can reason, interact, and act reliably in realistic household scenarios. To address these limitations, we introduce SMH-Bench, a comprehensive benchmark for evaluating LLMs in smart-home environments. Built upon HomeEnv, an executable and verifiable smart-home simulator, SMH-Bench contains 1,100 high-quality tasks spanning 7 categories and 22 fine-grained subcategories. It further stratifies tasks across simple, medium and complex homes, ranging from small apartments to dense multi-room environments with 135 devices. Experiments show that although frontier LLMs achieve strong performance on explicit control and query tasks, they still exhibit significant weaknesses in automation task scheduling, ambiguity handling and personalized reasoning, especially as home complexity increases. We hope SMH-Bench will facilitate the development of more reliable, context-aware, and practically deployable smart-home agents.

URL PDF HTML ☆

赞 0 踩 0

2606.01911 2026-06-02 cs.CV

Residual Decoder Adapter: ID-Preserving Tokenizer Adaption for Autoregressive Text Rendering

残差解码器适配器：用于自回归文本渲染的身份保持分词器适配

Dongxing Mao, Jinpeng Wang, Jiahao Tang, Kevin Qinghong Lin, Linjie Li, Zhengyuan Yang, Lijuan Wang, Min Li, Jingru Tan

发表机构 * Central South University（中南大学）； University of Oxford（牛津大学）； Microsoft Research（微软研究院）

AI总结提出残差解码器适配器（RDA），通过引入配对码本和平行分支学习像素空间残差，在不重新训练分词器和自回归模型的情况下显著提升文本渲染性能。

Comments CVPR 2026 poster

详情

AI中文摘要

视觉自回归（AR）模型通过预测由视觉分词器解码的离散标记来生成图像。尽管展示了强大的整体图像生成能力，但在文本渲染方面仍表现不佳，出现模糊笔画和破坏字母形状。在这项工作中，我们将这一限制追溯到视觉分词器，它难以重建细粒度细节。改进分词器直接但昂贵，因为它需要重新训练分词器和AR模型。我们能否在不重新训练现有分词器和AR模型的情况下提高AR模型的文本渲染性能？为实现这一目标，我们提出了残差解码器适配器（RDA），它在不改变标记空间的情况下事后升级现有分词器。具体来说，它通过引入两个新颖组件来细化视觉分词器的解码器输出：（i）一个与原始标记分布共享的配对码本；（ii）一个并行分支，用于学习像素空间中重建图像与真实图像之间的微小差异（残差）。这种残差设计使我们能够非侵入性地增强分词器，同时保持与先前AR模型的兼容性。RDA大幅提升了文本渲染性能。例如，在具有竞争力的TextAtlas基准测试上，我们使微调后的Janus-Pro OCR准确率从24.52%提高到58.26%（TextVisionBlend），从12.75%提高到36.81%（StyledTextSynth）。代码可在https://github.com/CSU-JPG/RDA获取。

英文摘要

Visual Autoregressive (AR) models generate images by predicting discrete tokens that are decoded by a visual tokenizer. Despite demonstrating strong overall image generation ability, they still underperform on text rendering with blur strokes and disrupt letter shapes. In this work, we trace this limitation to the visual tokenizer, which struggles to reconstruct fine-grained detail. Improving the tokenizer is straightforward but expensive, as it necessitates retraining both the tokenizer and the AR model. Can we improve text rendering performance of AR models without retraining the existing tokenizer and AR model? To achieve this, we propose the Residual Decoder Adapter(RDA) that upgrades an existing tokenizer post-hoc without changing its token space. Specifically, it refines the decoder output of the visual tokenizer by introducing two novel components: (i) a paired codebook that shares the token distribution with the original one; (ii) a parallel branch to learn the tiny differences (residual) between the reconstructed image and the ground-truth images in the pixel space. This residual design allows us to enhance the tokenizer non-invasively while preserving compatibility with prior AR models. RDA substantially improves text rendering significantly by a large margin. For instance, we boost finetuned Janus-Pro OCR accuracy rises from 24.52% to 58.26% (TextVisionBlend), from 12.75% to 36.81% (StyledTextSynth) on competitive TextAtlas benchmark. The code is available at https://github.com/CSU-JPG/RDA

URL PDF HTML ☆

赞 0 踩 0

2606.01908 2026-06-02 cs.LG cs.CV

Private and Stable Test-Time Adaptation with Differential Privacy

具有差分隐私的私有且稳定的测试时自适应

Zefeng Li, Qiaoyue Tang, Mathias Lecuyer, Evan Shelhamer

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出将多种测试时自适应方法转化为差分隐私形式，通过逐样本梯度裁剪和高斯噪声保护测试数据隐私，在ImageNet-C上实现隐私与精度的平衡，并发现裁剪机制能提升连续自适应的准确性和稳定性。

Comments ICML 2026

详情

AI中文摘要

测试时自适应（TTA）可以通过在推理过程中更新模型来减少在新数据上的误差。然而，这些更新引发了关于测试数据隐私的问题，因为模型参数现在依赖于所有过去的输入。为了控制这种隐私风险，我们将多种流行的TTA方法（Tent、EATA、SAR、DeYO和COME）转化为差分隐私（DP）形式，对所有更新应用逐样本梯度裁剪和高斯噪声。在ImageNet-C上，我们的DP-TTA方法在精度损失较小的情况下提供了足够的隐私，并且在低隐私机制下，DP的裁剪机制甚至可以改善连续设置中自适应的准确性和稳定性。这些对隐私和精度的改进仅带来适度的计算开销。这些关于私有TTA的初步结果提高了对该问题的认识，为开发更私密的测试时更新提供了信息，并确定了逐样本裁剪作为提高自适应准确性和稳定性的有效技术。

英文摘要

Test-time adaptation (TTA) can reduce error on new and different data by updating the model on these inputs during inference. However, these updates raise the issue of privacy w.r.t. the testing data, because the model parameters now depend on all past inputs. To control this privacy risk, we cast multiple popular TTA methods (Tent, EATA, SAR, DeYO, and COME) into differential privacy (DP) forms that apply per-sample gradient clipping and Gaussian noise for all updates. On ImageNet-C, our DP-TTA methods provide adequate privacy at small cost to accuracy, and in the low-privacy regime the clipping mechanism of DP can even improve the accuracy and stability of adaptation in the continual setting. These improvements to privacy and accuracy come at only modest computational overhead. These first results on private TTA raise awareness of the issue, inform the development of more private test-time updates, and identify per-sample clipping as an effective technique for improving the accuracy and stability of adaptation.

URL PDF HTML ☆

赞 0 踩 0

2606.01906 2026-06-02 cs.AI

Bayesian Spectral Emotion Transition Discovery from Multi-Annotator Disagreement

贝叶斯谱情感转移发现：来自多标注者分歧

Keito Inoshita, Takato Ueno

发表机构 * Keio University（庆应大学）； National Institute of Advanced Industrial Science and Technology（国家工业科学与技术研究院）

AI总结提出贝叶斯谱情感转移发现（BSETD）两阶段框架，从多标注者软标签中挖掘情感转移结构，并通过谱分解分离惯性与传染成分，在EmotionLines数据集上验证了与心理学理论的一致性。

详情

AI中文摘要

情感通过对话的动态过程演变，理解其转移结构对于从心理健康筛查到对话系统等应用至关重要。然而，现有研究通常通过多数投票将多评分者判断压缩为单个硬标签，丢弃了理解轮次间转移所需的不确定性信号。本文提出贝叶斯谱情感转移发现（BSETD），一个从多评分者软标签中发现情感转移结构的两阶段框架。第一阶段，通过软标签的外积构建层次狄利克雷-多项后验，为K×K转移矩阵的每个单元配备可信区间和Benjamini-Hochberg（BH）错误发现率（FDR）控制的显著性。第二阶段，对称图拉普拉斯矩阵经谱分解，分离出低频（惯性）和高频（传染）成分。在EmotionLines上，BSETD同时恢复了两个不同情感空间的标志：Plutchik相邻的转移——厌恶到愤怒（log2提升+0.94）和愤怒到厌恶（+0.86）被过度表示，而Russell效价反转的转移——快乐到愤怒（-0.90）和愤怒到快乐（-0.89）被欠表示。五源跨语料验证得到英语内成对皮尔逊相关0.91-0.98，与中文M3ED对比0.79-0.85，以及同一话语集上人类硬标签与LLM虚拟软标签之间0.979的相关性，表明保留标注者不确定性的流程将情感动态的计算研究与既有的心理学理论联系起来。

英文摘要

Emotions evolve through the dynamics of conversation, and understanding their transition structure is foundational to applications ranging from mental-health screening to dialogue systems. However, existing studies typically compress multi-rater judgments into a single hard label by majority voting, discarding the uncertainty signal needed to understand turn-to-turn transitions. In this article, we propose Bayesian Spectral Emotion Transition Discovery (BSETD), a two-stage framework that discovers emotion-transition structure from multi-rater soft labels. In the first stage, a hierarchical Dirichlet-Multinomial posterior is constructed through the outer product of soft labels, equipping each cell of the K x K transition matrix with a credible interval and Benjamini-Hochberg (BH) false discovery rate (FDR)-controlled significance. In the second stage, the symmetrized graph Laplacian is spectrally decomposed to separate a low-frequency (inertia) component from a high-frequency (contagion) component. On EmotionLines, BSETD simultaneously recovers the signatures of two distinct affective spaces: the Plutchik-adjacent transitions disgust to anger (log2 lift +0.94) and anger to disgust (+0.86) are over-represented, while the Russell-valence-reversed transitions joy to anger (-0.90) and anger to joy (-0.89) are under-represented. A five-source cross-corpus validation yields pairwise Pearson correlations in 0.91-0.98 within English, 0.79-0.85 against Chinese M3ED, and 0.979 between the human hard labels and the LLM virtual soft labels on the same utterance set, demonstrating that a pipeline preserving annotator uncertainty bridges the computational study of emotion dynamics with established psychological theory.

URL PDF HTML ☆

赞 0 踩 0

2606.01901 2026-06-02 cs.CV cs.AI cs.CL

The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue

图像重建游戏：通过迭代多模态对话建立共同基础

Sherzod Hakimov, Mattia D'Agostini, Ivan Samodelkin, David Schlangen

发表机构 * Computational Linguistics, Department of Linguistics University of Potsdam（波恩大学语言学系计算语言学部）； German Research Center for Artificial Intelligence (DFKI), Berlin（德国人工智能研究中心（DFKI）柏林）

AI总结提出图像重建游戏基准，通过多轮迭代中视觉语言模型向图像生成器发出纠正指令，使累积的共同基础直接可视化为重建图像，发现描述器是重建质量的主导因素，而生成器决定迭代改进的效果。

详情

AI中文摘要

我们引入了图像重建游戏，这是一个全自动基准测试，其中视觉语言模型在多轮迭代中向图像生成器发出纠正指令，使得累积的共同基础直接可视化为渲染图像。通过对七个图像类别中的两个描述器模型与两个生成器模型进行交叉基准测试，我们发现描述器是重建质量的主导因素，而生成器决定迭代改进是否有益。数学和几何图像构成了最大的挑战。描述器的令牌预算强烈影响收敛性：较短的预算产生更稀疏的初始渲染，有更多可见改进的空间，而较长的预算提高了绝对质量，但留下的修复空间较少。更强的描述器使用更丰富的纠正词汇，涵盖空间、数值和结构类别，而较弱的描述器则集中于表面属性，并且往往在几轮后停止。人工验证表明，最佳自动评判器与人类偏好之间仅达到轻微到中等的一致性，并且自动评分需要人工重新校准才能可靠使用。

英文摘要

We introduce the Image Reconstruction Game, a fully automated benchmark in which a vision-language model issues corrective instructions to an image generator across multiple turns, making accumulated common ground directly observable as a rendered image. Benchmarking two Describer models crossed with two Generator models across seven image categories, we find that the describer is the dominant factor in reconstruction quality, while the generator determines whether iterative refinement helps or hurts. Mathematical and geometric images pose the greatest challenge. The describer's token budget strongly affects convergence: shorter budgets yield sparser first renderings with more room for visible improvement, while longer budgets raise absolute quality but leave less to fix. Stronger describers use a richer correction vocabulary spanning spatial, numeric, and structural categories, while weaker describers concentrate on surface properties and tend to stop after a few turns. Human validation shows that the best automated judge reaches only slight-to-fair agreement with human preferences, and automated scores require human recalibration to be used reliably.

URL PDF HTML ☆

赞 0 踩 0

2606.01896 2026-06-02 cs.CV cs.AI

Train, Test, Re-evaluate: Schedule-Sensitive Evaluation of Generative Data for Hand Detection

训练、测试、重新评估：用于手部检测的生成数据的调度敏感评估

Atmika Bhardwaj, Silvia Vock, Nico Steckhan

发表机构 * Federal Institute for Occupational Safety and Health（联邦职业安全与卫生研究所）

AI总结本研究通过多阶段训练调度实验，评估生成性图像修补数据对安全关键场景下手部检测性能的影响，发现适当的训练流程能显著提升真实部署效果。

Comments 16 pages, 4 figures

详情

AI中文摘要

生成（或合成）图像数据越来越多地被用于增强或替代真实训练数据集，当目标图像稀缺、昂贵或存在偏差时。在手部检测中，特别是在职业安全设置中，公共数据集大多包含裸手。这低估了手套、纹身、珠宝和其他个人防护装备引入的手部外观变化，造成了安全关键应用在部署时遇到的分布偏移。我们测试生成性修补，即仅编辑真实照片的手部区域以引入配饰，是否能缩小这种偏移差距。在一个由真实图像及其合成对应物组成的配对数据集上，我们在六种训练和调度方案（实验A-F，每种三个随机种子）下训练YOLOv8n手部检测器，在真实测试集和仅真实手套测试子集上评估每个检测器，报告两个重叠阈值（mAP@0.5和mAP@0.5:0.95）下的平均精度（mAP），并进行配对统计检验。一个两阶段实验：在真实+合成数据上训练，然后在较低学习率下仅用真实数据微调得到的权重，与标准真实测试集上的仅真实基线模型相比，提高了mAP@0.5，并改善了真实手套的分布外差距。另一个三阶段实验最好地保持了框的紧密度，达到了研究中任何其他实验的最高mAP@0.5:0.95。合成数据对安全关键手部检测的效用由训练过程决定，简单的多阶段实验从修补的配饰数据中提取了实质性的真实部署收益。

英文摘要

Generated (or synthetic) image data is increasingly used to augment or replace real training datasets when target imagery is scarce, expensive, or biased. For hand detection, particularly in occupational safety settings, public datasets mostly contain bare hands. This under-represents the variation in hand appearance introduced by gloves, tattoos, jewelry, and other personal protective equipment, creating a distribution shift that safety-critical applications encounter at deployment. We test whether generative inpainting, editing only the hand region of a real photograph to introduce accessories, can close this shift gap. On a paired dataset of real images and their synthetic counterparts, we train YOLOv8n hand detectors under six training-and-scheduling regimes (Experiments A-F, three random seeds each), evaluate every detector on a real test set and on a real-gloves-only test split, and report the mean average precision (mAP) at two overlap thresholds (mAP@0.5 and mAP@0.5:0.95) along with paired statistical tests. A two-stage experiment: train on real U synthetic data, then fine-tune the resulting weights on real-only at a lower learning rate, increases mAP@0.5 compared to the real-only baseline model on the standard real test set, and improves the real-gloves out-of-distribution gap. Another three-stage experiment preserves box-tightness best, reaching the highest mAP@0.5:0.95 of any other experiment in the study. The synthetic-data utility for safety-critical hand detection is determined by the training procedure, and simple multi-stage experiments extract substantial real-deployment benefit from inpainted accessory data.

URL PDF HTML ☆

赞 0 踩 0

2606.01895 2026-06-02 cs.CV cs.AI

Collaborative Space Object Detection with Multi-Satellite Viewpoints in LEO Constellations

LEO星座中基于多卫星视角的协作空间目标检测

Xingyu Qu, Wenxuan Zhang, Peng Hu

发表机构 * Government of Canada（加拿大政府）； Natural Sciences and Engineering Research Council of Canada（加拿大自然科学和工程研究理事会）

AI总结针对LEO星座中空间目标检测的挑战，提出基于深度学习框架的多视角观测融合方法，使用YOLO检测器处理多视角数据，实验表明多视角融合显著提升检测精度。

详情

AI中文摘要

随着低地球轨道（LEO）星座中卫星数量的增加，近地空间环境日益拥挤，使得空间目标检测（SOD）成为空间安全和可持续性面临的紧迫挑战。为了降低碰撞风险并确保空间操作的连续性，SOD系统必须在严格的星载约束下提供快速准确的检测。在本文中，我们研究了深度学习（DL）框架内多视角观测融合的潜力，以增强SOD性能。我们设计了一个实用的多视角流水线和几种输入表示，用于将多视角数据输入基于YOLO的检测器。我们的实验表明，在大多数情况下使用多视角输入是可行的，并且通常能在mAP50和mAP50-95上产生更好的结果。例如，在模型YOLOv9-m中，单视角与三视角融合RGB设置相比，mAP50从0.638增加到0.732，而mAP50-95从0.227提高到0.276。与单视角设置相比，最佳的三视角灰度配置将mAP50提高了36.3%，mAP50-95提高了46.5%。这些发现确立了多视角融合作为SOD的一种可行且有效的策略，对LEO星座部署中的空间态势感知具有广泛意义。

英文摘要

With the growing number of satellites in low Earth orbit (LEO) constellations, the near-Earth space environment has become increasingly congested, making space object detection (SOD) a pressing challenge for space safety and sustainability. To mitigate collision risks and ensure the continuity of space operations, SOD systems must deliver fast and accurate detection under stringent onboard constraints. In this paper, we investigate the potential of multi-viewpoint observation fusion within a deep learning (DL) framework to enhance SOD performance. We design a practical multi-view pipeline and several input representations for feeding multi-view data into YOLO-based detectors. Our experiments show that using multi-view inputs is feasible in most cases and typically produces better results for mAP50 and mAP50-95. For example, in model YOLOv9-m, single-view compared to a three-view fused RGB setting, mAP50 increases from 0.638 to 0.732, while mAP50-95 improves from 0.227 to 0.276. Compared with the single-view setting, the best three-view grayscale configuration improves mAP50 by 36.3% and mAP50-95 by 46.5%. These findings establish multi-view fusion as a viable and effective strategy for SOD, with broad implications for space situational awareness in LEO constellation deployments.

URL PDF HTML ☆

赞 0 踩 0

2606.01894 2026-06-02 cs.AI

Physically-Constrained Mamba-SDE for Remaining Useful Life Prediction under Irregular Observations

物理约束的Mamba-SDE用于不规则观测下的剩余使用寿命预测

Deyu Zhuang, Peiliang Gong, Yang Shao, Liyuan Shu, Qi Zhu, Xiaoli Li, Daoqiang Zhang

发表机构 * Nanjing University of Aeronautics and Astronautics（南京航空航天大学）； Nanyang Technological University（南洋理工大学）； Singapore University of Technology and Design（新加坡科技设计大学）

AI总结提出PC-MambaSDE框架，通过掩码感知连续Mamba编码器和物理引导的潜在SDE，解决不规则观测下剩余使用寿命预测的物理不可行性问题。

详情

AI中文摘要

准确的剩余使用寿命预测对于工业预测性维护至关重要。然而，由于传感器观测的不规则性，表现为异步采样、突发缺失和时间抖动，实际部署具有挑战性。更糟糕的是，纯数据驱动模型常常生成物理上不合理的退化轨迹，违反损伤累积的不可逆性。为了解决这个问题，我们提出了PC-MambaSDE，一个统一的连续时间框架，用于在不规则观测下进行鲁棒的RUL预测。具体来说，我们设计了一个掩码感知连续Mamba编码器，显式利用观测掩码提取富含上下文的控制信号。此外，我们引入了一个带有参数化修正混合漂移的物理引导潜在SDE，叠加全局物理偏差以强制单调退化，即使在严重观测间隙下也是如此。另外，我们通过终端退化惩罚将RUL预测公式化为边界值问题，该惩罚解耦健康指标维度并应用惩罚损失引导轨迹向故障状态演化。理论上，我们通过Girsanov定理证明了我们的变分目标在数学上等价于最小化KL散度，并通过Lyapunov分析保证了学习动力学的全局渐近稳定性。为了进行严格评估，我们开发了一个混合不规则性生成方案，模拟真实的工业缺陷。在公开基准上的大量实验表明，PC-MambaSDE显著优于最先进的方法，特别是在极端观测稀缺情况下，验证了将物理先验嵌入连续时间潜在动力学的有效性。

英文摘要

Accurate Remaining Useful Life prediction is critical for industrial predictive maintenance. However, real-world deployment is challenging due to the irregular nature of sensor observations, characterized by asynchronous sampling, burst missingness, and temporal jitter. Compounding this issue, purely data-driven models often generate physically implausible degradation trajectories that violate the irreversible nature of damage accumulation. To address this, we propose PC-MambaSDE, a unified continuous-time framework for robust RUL prediction under irregular observations. Specifically, we design a Mask-Aware Continuous Mamba Encoder that explicitly leverages observation masks to extract context-rich control signals. Furthermore, we introduce a Physics-Guided Latent SDE with parametrically rectified hybrid drift, superimposing a global physical bias to enforce monotonic degradation even amid severe observation gaps. Additionally, we formulate RUL prediction as a boundary value problem via a Terminal Degradation Penalty, which decouples a Health Index dimension and applies a penalty loss to guide trajectories toward the failure state. Theoretically, we prove that our variational objective is mathematically equivalent to minimizing the KL divergence via Girsanov's theorem, and we guarantee the global asymptotic stability of the learned dynamics through Lyapunov analysis. To enable rigorous evaluation, we develop a Hybrid Irregularity Generation Scheme that simulates realistic industrial imperfections. Extensive experiments on public benchmarks demonstrate that PC-MambaSDE significantly outperforms state-of-the-art methods, particularly under extreme observation scarcity, validating the efficacy of embedding physical priors into continuous-time latent dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.01886 2026-06-02 cs.AI cs.CE

Absorbing Complexity: An Interaction-Native Knowledge Harness for Financial LLM Agents

吸收复杂性：面向金融LLM代理的交互原生知识驾驭系统

Ailiya Borjigin, Igor Stadnyk, Ben Bilski, Maksym Chikita, Dmytro Kyrylenko, Sofiia Pidturkina, Julia Stadnyk

发表机构 * True Trading ； Inc4.net

AI总结提出交互原生知识驾驭（InKH）架构，通过被动知识注入、时序图记忆和过期失效机制，将复杂性吸收到系统中，在金融LLM代理任务中显著降低延迟、令牌成本和过时知识使用，同时提升任务质量和可追溯性。

Comments 17 pages, 3 figures

详情

AI中文摘要

金融AI代理常常因一个简单原因而失败：它们让用户承担复杂性。用户必须反复陈述目标、风险偏好、投资组合背景、过往判断以及不断变化的市场假设，而代理则回答、检索、行动并遗忘。在金融领域，这不仅仅是方便与否的问题。在市场分析、跟单交易审查和交易准备等任务中，被遗忘的背景和过时的记忆可能导致延迟、重复错误、弱可审计性以及不安全的决策。我们提出了交互原生知识驾驭（InKH），一种面向金融LLM代理的架构，将复杂性吸收到系统中。InKH将用户、市场、投资组合和工具事件转换为结构化的操作知识。它使用被动知识注入在主模型步骤之前组装一个有界的工作上下文缓冲区，使用时序图记忆进行低延迟检索，使用维基审计界面实现人类可读的治理，以及具有成熟度、衰减和写入时失效的背景提取。我们在一个可重复的受控合成基准上评估了InKH，该基准包含24个随机种子、4轮、每轮80个片段和6个基线，产生了46,080个基线条件评估。InKH在900毫秒延迟下实现了0.815的平均任务质量。与代理驱动的维基漫步记忆相比，它将延迟降低了82.95%，令牌成本降低了82.29%，过时知识使用降低了96.58%，同时质量提高了0.108，可追溯性提高了0.461。与没有失效机制的时序图系统相比，它在相当的服务成本下将质量提高了0.050，并将过时记忆使用降低了96.58%。结果支持了金融AI的设计论点：当复杂性被系统吸收而不是转移给用户时，采用就会发生。该基准验证了架构层面的行为，而非实时交易性能。

英文摘要

Financial AI agents often fail for a simple reason: they make users carry the complexity. A user must repeatedly restate goals, risk preferences, portfolio context, past judgments, and shifting market assumptions, while the agent answers, retrieves, acts, and forgets. In finance, this is not just inconvenient. In tasks such as market analysis, copy-trading review, and trade preparation, forgotten context and stale memory can create latency, repeated errors, weak auditability, and unsafe decisions. We propose the interaction-native knowledge harness (InKH), an architecture for financial LLM agents that absorbs complexity into the system. InKH converts user, market, portfolio, and tool events into structured operational knowledge. It uses passive knowledge injection to assemble a bounded working context buffer before the main model step, temporal graph memory for low-latency retrieval, a wiki audit surface for human-readable governance, and background extraction with maturity, decay, and write-time invalidation. We evaluate InKH on a reproducible controlled synthetic benchmark with 24 random seeds, 4 rounds, 80 episodes per round, and 6 baselines, producing 46,080 baseline-conditioned evaluations. InKH achieves mean task quality of 0.815 at 900 ms latency. Compared with agent-driven wiki-walk memory, it reduces latency by 82.95 percent, token cost by 82.29 percent, and stale-knowledge usage by 96.58 percent, while improving quality by 0.108 and traceability by 0.461. Compared with a temporal-graph system without invalidation, it improves quality by 0.050 and reduces stale-memory usage by 96.58 percent with comparable serving cost. The results support a design thesis for financial AI: adoption happens when complexity is absorbed by the system rather than transferred to the user. The benchmark validates architecture-level behavior, not live trading performance.

URL PDF HTML ☆

赞 0 踩 0

2606.01883 2026-06-02 cs.LG cs.CV

Beyond the Simplex: Balanced Prototype Geometry for Scorer-Agnostic Open-Set Recognition

超越单纯形：用于评分器无关的开放集识别的平衡原型几何

Mayank Sharma, Rohit Kumar Mourya

发表机构 * Indian Institute of Technology Jodhpur（印度理工学院乔浦尔）

AI总结本文提出平衡等范数原型几何理论，统一分析不同嵌入维度下的开放集识别，证明评分器性能依赖于评分规则而非单纯形结构。

Comments 20 pages, 2 figures, 6 tables

详情

AI中文摘要

开放集识别（OSR）要求分类器拒绝来自未见类别的输入，这在医学成像等安全关键场景中至关重要。基于单纯形的方法将类原型固定在正则单纯形的顶点，然后通过距离比分数进行拒绝，这些方法在经验上表现良好但缺乏理论依据，且现有分析仅适用于嵌入维度d至少为C-1的情况，这是正则单纯形存在的条件。我们给出了在任意嵌入维度（包括d < C-1）下单纯形比OSR的理论解释。我们的分析集中于平衡等范数编码：具有等长和零和的原型配置，存在于所有d >= 2的情况，并包含正则单纯形作为特例。对于这些编码，我们证明辅助平方比分数的子水平集是欧几里得球的精确并集，进而包围了操作分数的接受区域；并且我们证明了一个尖锐的二分法：当且仅当d >= C-1时，原型达到等距对称性，行为类似于正则单纯形，低于该阈值时，由显式缺陷参数控制退化程度。我们进一步证明，在自然各向同性假设下，错误接受率随d指数衰减，并且操作分数是全局Lipschitz的，具有紧致接受区域。在实验上，我们将平衡原型几何作为分析工具和表示学习先验进行研究，而非作为独立的先进检测器。在CIFAR和MedMNIST开放集划分上，几何结构提供了有用的结构，但OSR性能仍然强烈依赖于评分规则：原始比率分数通常不如基于最近邻和logit的替代方案。

英文摘要

Open-set recognition (OSR) requires a classifier to reject inputs from unseen classes which is essential in safety-critical settings such as medical imaging. Simplex based methods, which fix class prototypes at the vertices of a regular simplex and then reject via a distance-ratio score, perform well empirically but lack theoretical justification, and existing analysis applies only when the embedding dimension d is at least C-1, which is the regime in which a regular simplex exists. We give a theoretical account of simplex-ratio OSR that holds in every embedding dimension, including d < C-1. Our analysis centers on balanced equal-norm codes: prototype configurations with equal lengths and zero sum, which exist for all d >= 2 and include the regular simplex as a special case. For these codes we show that an auxiliary squared ratio score has sublevel sets that are exact unions of Euclidean balls, which in turn bracket the acceptance region of the operational score; and we prove a sharp dichotomy: the prototypes attain one-distance symmetry, behaving like a regular simplex, if and only if d >= C-1, with controlled degradation governed by an explicit defect parameter below that threshold. We further show the false-acceptance rate decays exponentially in d under natural isotropy assumptions, and that the operational score is globally Lipschitz with compact acceptance regions. Empirically, we study balanced prototype geometry as both an analytic tool and a representation-learning prior, rather than as a stand-alone state-of-the-art detector. Across CIFAR and MedMNIST open-set splits, the geometry provides useful structure, but OSR performance remains strongly dependent on the scoring rule: raw ratio scores typically underperform nearest-neighbor and logit-based alternatives.

URL PDF HTML ☆

赞 0 踩 0

2606.01879 2026-06-02 cs.CL

CultureForest: Understanding and Evaluating Cultural Norm Grounded Reasoning in LLMs

CultureForest：理解与评估大语言模型中的文化规范推理

Yangfan Ye, Xiaocheng Feng, Jialong Tang, Xiayu Cao, Zihan Zhang, Xiachong Feng, Baosong Yang, Bing Qin

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； The University of Hong Kong（香港大学）； Harvard University（哈佛大学）

AI总结为弥补现有研究仅将文化智能视为知识获取问题而忽视实际场景应用的不足，提出CultureForest基准，通过基于原子规范的推理任务评估模型，发现顶级模型在开放式生成中性能大幅下降，并揭示推理能力瓶颈。

详情

AI中文摘要

现有研究大多将大语言模型中的文化智能简化为知识层面的问题，忽视了模型能否在现实场景中有效利用其获取的知识。为弥补这一差距，我们引入了CultureForest，一个用于 extit{文化规范推理}的基准。每个问题都基于一组原子规范，从而支持可验证和可归因的评估。CultureForest包含来自8个领域和53个国家/地区的5,378个示例，并支持从多项选择到开放式生成的渐进式评估。大量实验表明，即使是顶级模型在开放式设置中性能也大幅下降，并伴随显著的跨区域差异。通过针对性分析，我们发现了几个一致的模式：（1）测试时推理带来的收益有限，且可能加剧不公平；（2）模型表现出高度共享的区域偏好结构；（3）模型响应明显保守，尤其在更严格的文化约束下；（4）通过分离文化知识获取与文化推理，我们发现虽然LLMs拥有丰富的文化知识，但其性能进一步受限于知识的有效利用。这些发现表明，有必要从以知识为中心的评估转向衡量基于知识的推理。

英文摘要

Existing research largely reduces cultural intelligence in LLMs to a knowledge-level problem, overlooking whether models can effectively utilize their acquired knowledge in realistic scenarios. To bridge this gap, we introduce CultureForest, a benchmark for \textit{Cultural Norm Grounded Reasoning}. Each question is grounded in a small set of atomic norms, enabling verifiable and attributable evaluation. CultureForest comprises 5,378 examples across 8 domains and 53 countries/regions, and supports a progressive evaluation from multiple-choice to open-ended generation. Extensive experiments reveal that even top-tier models degrade substantially in open-ended settings, accompanied by pronounced cross-region disparities. Through targeted analysis, we uncover several consistent patterns: (1) test-time reasoning yields limited gains and may exacerbate inequity; (2) models exhibit highly shared regional preference structures; (3) model responses are markedly conservative, especially under stricter cultural constraints; and (4) by disentangling cultural knowledge acquisition from cultural reasoning, we show that while LLMs possess substantial cultural knowledge, their performance is further bottlenecked by its effective use. These findings point to a necessary shift from knowledge-centric evaluation toward measuring knowledge-grounded reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.01873 2026-06-02 cs.LG

G2LoRA: Gradient Orthogonal Low-Rank Adaptation Framework for Graph Continual Learning on Text-Attributed Graphs

G2LoRA: 面向文本属性图的梯度正交低秩自适应框架用于图持续学习

Yuhan Wang, Yibo Ding, Yutong Ye, Mufan Zhao, Wenbo Zhang, Ruijie Wang, Jianxin Li

发表机构 * School of Computer Science and Engineering, Beihang University（北航计算机科学与工程学院）； Department of Statistics, Columbia University（哥伦比亚大学统计系）； College of Computer Science, Beijing University of Technology（北京理工大学计算机学院）

AI总结针对LLM-as-Aligner模型在文本属性图持续学习中的灾难性遗忘问题，提出G2LoRA框架，通过统一图-文本对齐目标、类别感知梯度投影和梯度幅度调制，实现任务间正向迁移并缓解模态漂移。

Comments Accepted by KDD 2026

详情

AI中文摘要

LLM-as-Aligner已成为文本属性图（TAGs）的一种流行预训练范式，通过CLIP风格的对比学习将图和文本模态对齐到共享嵌入空间。虽然在单个下游任务上有效，但我们观察到当此类模型在流式任务上顺序微调时会出现严重的灾难性遗忘。尽管参数高效微调在一定程度上缓解了遗忘，但仍不足以解决任务干扰和无效知识迁移。在这项工作中，我们研究了TAGs上LLM-as-Aligner模型的图持续学习，目标是减轻干扰同时促进任务间的正向迁移。该设置引入了两个基本挑战：（1）异构下游任务导致优化目标变化，阻碍统一微调；（2）图和文本编码器对自适应表现出不同的敏感性，不协调的更新容易导致错位。为应对这些挑战，我们提出了G2LoRA，一个面向TAGs的持续学习框架。G2LoRA将节点级、链接级和图级任务统一到单一的图-文本对齐目标下，并在领域/类别/任务增量模式下实现一致的优化。为减少任务干扰同时鼓励正向迁移，G2LoRA在结构化子空间中执行类别感知梯度投影，解决冲突更新并实现条件性反向迁移以平衡前向和后向知识流。为进一步防止跨模态漂移，G2LoRA引入梯度幅度调制来协调图和文本编码器之间的更新速率。在基准数据集上的大量实验表明，G2LoRA在不同骨干架构上始终优于强基线，实现了卓越的持续性能和可迁移性。

英文摘要

LLM-as-Aligner has emerged as a prevalent pre-training paradigm for Text-Attributed Graphs(TAGS), aligning graph and text modalities into a shared embedding space via CLIP-style contrastive learning. While effective on individual downstream tasks, we observe severe catastrophic forgetting when such models are sequentially fine-tuned on streaming tasks. Although parameter-efficient fine-tuning alleviates forgetting to some extent, it remains insufficient to resolve task interference and ineffective knowledge transfer. In this work, we study graph continual learning for LLM-as-Aligner models on TAGs, with the goal of mitigating interference while promoting positive transfer across tasks. This setting introduces two fundamental challenges: (1) heterogeneous downstream tasks induce shifting optimization objectives, hindering unified fine-tuning; and (2) graph and text encoders exhibit different sensitivities to adaptation, making uncoordinated updates prone to misalignment. To address these challenges, we propose G2LoRA, a continual learning framework for TAGs. G2LoRA unifies node-, link-, and graph-level tasks under a single graph--text alignment objective, and enables consistent optimization across domain/class/task incremental modes. To reduce task interference while encouraging positive transfer, G2LoRA performs category-aware gradient projection in structured subspaces, resolving conflicting updates and enabling conditional backward transfer to balance forward and backward knowledge flow. To further prevent cross-modal drift, G2LoRA introduces gradient magnitude modulation to coordinate update rates between graph and text encoders. Extensive experiments on benchmark datasets demonstrate that G2LoRA consistently outperforms strong baselines across different backbone architectures, achieving superior continual performance and transferability.

URL PDF HTML ☆

赞 0 踩 0

2606.01868 2026-06-02 cs.LG

Task-Induced Representational Invariances Depend on Learning Objective in Deep RL

任务诱导的表征不变性依赖于深度强化学习中的学习目标

Manu Srinath Halvagal, Sebastian Lee, SueYeon Chung

发表机构 * Department of Physics, Harvard University（哈佛大学物理系）； Kempner Institute, Harvard University（哈佛大学凯普纳研究所）； Center for Computational Neuroscience, Flatiron Institute（Flatiron研究所计算神经科学中心）

AI总结本文通过MDP约简理论分析深度强化学习中的表征，发现基于价值的方法（DQN）学习对MDP同态对称性不变的表征，而基于策略梯度的方法（PPO）学习对动作对称性不变的表征，这些差异影响迁移学习并在LLM中呈现提示依赖性。

详情

AI中文摘要

强化学习（RL）长期以来在神经科学中被用作目标导向动物行为的模型。现代深度RL在许多领域取得了显著成功，进一步强化了这一联系。学习高维状态空间的抽象表征能力是这一成功的基础。然而，对这些学习表征的理论理解仍然有限，阻碍了模型与动物学习之间的直接比较。我们通过MDP约简理论的视角分析深度RL表征来弥补这一差距。在导航任务中研究经典RL算法时，我们发现即使性能相当，基于价值的方法（DQN）学习对MDP同态对称性不变的表征，而基于策略梯度的方法（PPO）学习对动作对称性不变的表征。这些差异在不同领域中一致出现，对迁移学习有下游影响，并以提示依赖的方式出现在LLM中。我们的发现提供了一种比较不同RL算法学习表征的原则性方法，具有实际意义，并可能为大脑中的神经编码提供见解。

英文摘要

Reinforcement Learning (RL) has long served as a model for goal-directed animal behavior in neuroscience. Modern deep RL has shown remarkable success across many domains, further strengthening this connection. The ability to learn abstract representations of high-dimensional state spaces underlies much of this success. However, theoretical understanding of these learned representations remains limited, hindering direct comparisons between models and animal learning. We address this gap by analyzing deep RL representations through the lens of MDP reduction theory. Investigating canonical RL algorithms in a navigation task, we find that even when performance is comparable, the value-based method (DQN) learns representations that are invariant to MDP homomorphism symmetries, while the policy-gradient method (PPO) learns representations invariant to action symmetries. These differences emerge consistently across domains, have downstream consequences for transfer learning, and appear in LLMs in a prompt-dependent manner. Our findings provide a principled approach to comparing learned representations across RL algorithms, with demonstrated practical implications and possible insights for neural coding in the brain.

URL PDF HTML ☆

赞 0 踩 0

2606.01865 2026-06-02 cs.RO

Set-Supervised Diffusion Policy: Learning Action-Chunking Diffusion through Corrections

集合监督扩散策略：通过修正学习动作分块扩散

Zhaoting Li, Gang Chen, Javier Alonso-Mora, Cosimo Della Santina, Jens Kober

发表机构 * University of California, Berkeley（加州大学伯克利分校）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出集合监督扩散策略（SDP），利用人类修正中的对比动作分块数据，通过构建期望动作分块集合来训练扩散策略，有效缓解分布偏移并提升鲁棒性。

详情

AI中文摘要

扩散策略最近已成为机器人操作的一个强大框架。然而，与其他行为克隆方法一样，它仍然容易受到分布偏移的影响，通常需要人在回路中进行干预以纠正部署过程中的失败。这些交互自然提供了成对监督，形式为机器人的不期望动作和人类教师的纠正动作。然而，现有的数据聚合流程和标准行为克隆损失在很大程度上忽略了来自不期望动作的负面信号，导致对教师动作的过拟合以及对昂贵专家数据的日益依赖。为了解决这一限制，我们提出了集合监督扩散策略（SDP），这是一种新颖的学习框架，利用对比动作分块数据从人类修正中训练扩散策略。从配对的正负动作分块中，SDP构建了一组期望的动作分块，并设计了一个训练流程，鼓励扩散策略与该集合对齐。通过在多个机器人操作任务上的大量实验，我们证明了SDP持续提高了策略性能，在对噪声数据的鲁棒性方面尤其显著。此外，SDP生成了高质量的聚合数据集，使得从人在回路修正中进行更高效、更可靠的策略学习成为可能。我们的代码可在 https://set-supervised-diffusion-policy.github.io/ 获取。

英文摘要

Diffusion policies have recently emerged as a powerful framework for robotic manipulation. However, like other behavior cloning methods, they remain vulnerable to distributional shift, often requiring human-in-the-loop interventions to correct failures during deployment. These interactions naturally provide paired supervision in the form of the robot's undesired actions and the human teacher's corrective actions. Yet existing data aggregation pipelines and standard behavior cloning losses largely ignore this negative signal from undesired actions, leading to overfitting to teacher's actions and an increasing reliance on costly expert data. To address this limitation, we propose Set-Supervised Diffusion Policy (SDP), a novel learning framework that utilizes contrastive action-chunk data to train diffusion policies from human corrections. From paired positive and negative action-chunks, SDP constructs a set of desired action-chunks and designs a training pipeline that encourages the diffusion policy to align with the set. Through extensive experiments across multiple robotic manipulation tasks, we demonstrate that SDP consistently improves policy performance, with particularly strong gains in robustness to noisy data. Moreover, SDP induces high-quality aggregated datasets, enabling more efficient and reliable policy learning from human-in-the-loop corrections. Our code is available at https://set-supervised-diffusion-policy.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.01863 2026-06-02 cs.LG math-ph math.MP

Continual Learning as a Multiphase Moving-Boundary Problem

持续学习作为多相移动边界问题

Snigdha Chandan Khilar

发表机构 * Independent Researcher（独立研究者）

AI总结受熔化物理学启发，提出Stefan-CL方法，将知识巩固视为固相、未用容量视为液相，通过控制潜热调节边界移动，在几乎零遗忘下实现持续学习，无需存储原始数据。