arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
专题追踪 全部专题
2605.05812 2026-05-12 cs.AI

Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

Armaan A. Abraham, Lucy Xiaoyang Shi, Chelsea Finn

发表机构 * Stanford University(斯坦福大学)

AI总结 本文研究了基于值函数的离线强化学习方法在长时域任务中因引导误差导致的估计不稳定问题,提出了长时域Q学习(LQL)方法。LQL通过引入n步不等式约束,利用铰链损失函数对值函数估计进行修正,有效抑制误差累积,同时无需额外网络或计算开销。实验表明,LQL在多个在线和离线到在线的基准任务中均优于传统的1步和n步TD学习方法。

详情
英文摘要

Off-policy, value-based reinforcement learning methods such as Q-learning are appealing because they can learn from arbitrary experience, including data collected by older policies or other agents. In practice, however, bootstrapping makes long-horizon learning brittle: estimation errors at later states propagate backward through temporal-difference (TD) updates and can compound over time. We propose long-horizon Q-learning (LQL), which introduces a principled backstop against compounding error when learning the optimal action-value function. LQL builds on a prior optimality tightening observation: any realized action sequence lower-bounds what the optimal policy can achieve in expectation, so acting optimally earlier should not be worse than following the observed actions for several steps before switching to optimal behavior. Our contribution is to turn this inequality into a practical stabilization mechanism for Q-learning by using a hinge loss to penalize violations of these bounds. Importantly, LQL computes these penalties using network outputs already produced for the TD error, requiring no auxiliary networks and no additional forward passes relative to Q-learning. When combined with multiple state-of-the-art methods on a range of online and offline-to-online benchmarks, LQL consistently outperforms both 1-step TD and n-step TD learning at similar runtime.

2605.05775 2026-05-12 cs.CV cs.AI

The autoPET3 Challenge: Automated Lesion Segmentation in Whole-Body PET/CT $\unicode{x2013}$ Multitracer Multicenter Generalization

Jakob Dexl, Katharina Jeblick, Andreas Mittermeier, Balthasar Schachtner, Anna Theresa Stüber, Johanna Topalis, Maximilian Rokuss, Fabian Isensee, Klaus H. Maier-Hein, Hamza Kalisch, Jens Kleesiek, Constantin M. Seibold, Hussain Alasmawi, Lap Yan Lennon Chan, Yixuan Yuan, Alexander Jaus, Rainer Stiefelhagen, Pauline Ornela Megne Choudja, Konstantin Nikolaou, Christian La Fougère, Sergios Gatidis, Matthias P. Fabritius, Maurice Heimer, Gizem Abaci, Lalith Kumar Shiyam Sundar, Rudolf A. Werner, Jens Ricke, Clemens C. Cyran, Thomas Küstner, Michael Ingrisch

发表机构 * Department of Radiology, LMU University Hospital, LMU Munich(莱比锡大学医院放射科,莱比锡大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) University Hospital Tübingen, Department of Radiology(图宾根大学医院放射科) Department of Radiology, Stanford University(斯坦福大学放射科) German Cancer Research Center (DKFZ)(德国癌症研究中心(DKFZ)) Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital(海德堡大学医院放射肿瘤学部模式分析与学习组) Faculty of Mathematics and Computer Science, Heidelberg University(海德堡大学数学与计算机科学学院) Institute for AI in Medicine (IKIM), University Hospital Essen (AöR)(医学人工智能研究所(IKIM),埃森大学医院(AöR)) Department of Nuclear Medicine, University Hospital Essen (AöR)(核医学部,埃森大学医院(AöR)) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) Department of Electronic Engineering, The Chinese University of Hong Kong(香港中文大学电子工程系) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) HIDSS4Health - Helmholtz Information and Data Science School for Health(HIDSS4Health - 海德堡信息与数据科学健康学校) Department of Nuclear Medicine, LMU University Hospital, LMU Munich(莱比锡大学医院核医学部,莱比锡大学) Comprehensive Pneumology Center (CPC-M), Member of the German Center for Lung Research (DZL)(综合肺科中心(CPC-M),德国肺癌研究中心(DZL)成员) relAI – Konrad Zuse School of Excellence in Reliable AI(relAI - 卡诺德·祖斯可靠性人工智能卓越学校) Cluster of Excellence iFIT (EXC 2180) "Image Guided and Functionally Instructed Tumor Therapies", University of Tübingen(卓越中心iFIT(EXC 2180)"图像引导和功能指导肿瘤治疗",图宾根大学)

AI总结 本文介绍了第三届 autoPET 挑战赛(MICCAI 2024)的设计与结果,旨在评估在全身 PET/CT 图像中自动分割病灶的算法在多示踪剂、多中心场景下的泛化能力。研究使用了来自两个医院的大量标注数据,并在包含未见示踪剂-中心组合的测试集上评估算法性能,结果显示最佳算法在多个指标上优于基线模型。研究还指出,当前算法在域内多示踪剂分割任务上表现良好,但在跨中心、跨示踪剂的泛化任务中仍面临挑战,性能差异主要受数据异质性和病例难度影响。

Comments Preprint submitted to Medical Image Analysis

详情
英文摘要

We report the design and results of the third autoPET challenge (MICCAI 2024), which benchmarked automated lesion segmentation in whole-body PET/CT under a compositional generalization setting. Training data comprised 1,014 [18F]-FDG PET/CT studies from the University Hospital Tübingen and 597 [18F]/[68Ga]-PSMA PET/CT studies from the LMU University Hospital Munich, constituting the largest publicly available annotated PSMA PET/CT dataset to date. The held-out test set of 200 studies covered four tracer-center combinations, two of which represented unseen compositional pairings. A complementary data-centric award category isolated the contribution of data handling strategies by restricting participants to a fixed baseline model. Seventeen teams submitted 27 algorithms, predominantly nnU-Net-based 3D networks with PET/CT channel concatenation. The top-ranked algorithm achieved a mean DSC of 0.66, FNV of 3.18 mL, and FPV of 2.78 mL across all four test conditions, improving DSC by 8% and reducing the false-negative volume by 5 mL relative to the provided baseline. Ranking was stable across bootstrap resampling and alternative ranking schemes for the top tier. Beyond the benchmark, we provide an in-depth analysis of segmentation performance at the patient and lesion level. Three main conclusions can be drawn: (1) in-domain multitracer PET/CT segmentation is sufficient and probably approaching reader agreement; (2) compositional generalization to unseen tracer-center combinations remains an open problem mainly driven by systematic volume overestimation; (3) heterogeneity and case difficulty drive performance variation substantially more than the choice of algorithm among top-ranked teams.

2605.05373 2026-05-12 cs.LG

Neural Co-state Policies: Structuring Hidden States in Recurrent Reinforcement Learning

David Leeftink, Max Hinne, Marcel van Gerven

发表机构 * Donders Institute for Brain, Cognition and Behaviour(脑、认知与行为研究所) Radboud University(拉德堡德大学)

AI总结 本文研究了如何在部分可观测环境中提升强化学习智能体的决策能力,提出了一种基于最优控制中庞特里亚金最小原理(PMP)的神经共态策略方法。该方法通过将循环神经网络中的隐状态与PMP中的共态建立形式联系,使网络内部动态具有可解释性,并引入共态损失函数以显式引导隐状态的结构化学习。实验表明,该方法在部分可观测任务中表现优异,并具备对分布外传感器遮蔽的鲁棒性。

Comments 17 pages, 5 figures

详情
英文摘要

A key capability of intelligent agents is operating under partial observability: reasoning and acting effectively despite missing or incomplete state observations. While recurrent (memory-based) policies learned via reinforcement learning address this by encoding history into latent state representations, their internal dynamics remain uninterpretable black boxes. This paper establishes a formal link between these hidden states and the Pontryagin minimum principle (PMP) from optimal control. We demonstrate that for standard recurrent architectures, latent representations map directly to PMP co-states, which allows the readout layer to be interpreted as performing Hamiltonian minimization. Because standard reward maximization does not naturally discover this alignment, we introduce a PMP-derived co-state loss to explicitly structure the internal dynamics. Empirically, this approach matches or improves performance on partially observable DMControl tasks, and is robust against zero-shot out-of-distribution sensor masking. By framing recurrent networks as dynamic processes governed by the minimum principle, we provide a principled approach to designing robust continuous control policies.

2605.04617 2026-05-12 cs.CV cs.HC cs.LG

Temporal Structure Matters for Efficient Test-Time Adaptation in Wearable Human Activity Recognition

Zishu Zhou, Zaipeng Xie, Xuanyao Jie

发表机构 * College of Computer Science and Software Engineering, Hohai University(河海大学计算机科学与软件工程学院)

AI总结 可穿戴人体活动识别模型在面对真实世界中用户分布变化时往往性能下降,现有测试时自适应方法多沿用视觉任务的假设,未能充分利用活动识别流中的时间结构特性。本文重新审视时间结构作为条件推理信号的作用,提出了一种基于时间连续性和特征偏差的自适应机制,用于指导何时保持或释放时间惯性以及预测优化的路由位置。基于此,作者设计了SIGHT框架,无需反向传播即可实现轻量高效的实时自适应,实验表明其在实际数据集上优于现有方法,同时降低了计算和内存开销。

详情
英文摘要

Wearable human activity recognition (WHAR) models often suffer from performance degradation under real-world cross-user distribution shifts. Test-time adaptation (TTA) mitigates this degradation by adapting models online using unlabeled test streams, yet existing methods largely inherit assumptions from vision tasks and underexploit the inherent inter-window temporal structure in WHAR streams. In this paper, we revisit such temporal structure as a feature-conditioned inference signal rather than merely an output-space smoothing prior. We derive the insight that temporal continuity and observation-induced feature deviations provide complementary cues for determining when to preserve or release temporal inertia and where to route prediction refinement during likely transitions. Building upon this insight, we propose SIGHT, a lightweight and backpropagation-free TTA framework for WHAR, enabling real-time edge deployment. SIGHT estimates predictive surprise by comparing the current feature with a prototype-based expected state, and then uses the resulting feature deviation to guide geometry-aware transition routing based on prototype alignment and stream-level marginal habit tracking. Evaluations on real-world datasets confirm that SIGHT outperforms existing TTA baselines while reducing computational and memory costs.

2605.04541 2026-05-12 cs.CV

Angle-I2P: Angle-Consistent-Aware Hierarchical Attention for Cross-Modality Outlier Rejection

Muyao Peng, Shun Zou, Pei An, You Yang, Qiong Liu

发表机构 * School of Electronic Information and Communications, Huazhong University of Science and Technology(华中科技大学电子信息与通信学院)

AI总结 本文提出了一种名为Angle-I2P的图像到点云配准方法,旨在解决低内点比情况下传统PnP方法难以准确配准的问题。该方法通过引入角度一致性约束和层次注意力机制,有效提升配准的鲁棒性与精度。实验表明,Angle-I2P在多个公开数据集上取得了当前最优的配准效果。

Comments Accepted by ICRA 2026

详情
英文摘要

Image-to-point-cloud registration (I2P) is a fundamental task in robotic applications such as manipulation,grasping, and localization. Existing deep learning-based I2P methods seek to align image and point cloud features in a learned representation space to establish correspondences, and have achieved promising results. However, when the inlier ratio of the initial matching pairs is low, conventional Perspective-n-Points (PnP) methods may struggle to achieve accurate results. To address this limitation, we propose Angle-I2P, an outlier rejection network that leverages angle-consistent geometric constraints and hierarchical attention. First, we design a scale-invariant, crossmodality geometric constraint based on angular consistency. This explicit geometric constraint guides the model in distinguishing inliers from outliers. Furthermore, we propose a global-tolocal hierarchical attention mechanism that effectively filters out geometrically inconsistent matches under rigid transformation, thereby improving the Inlier Ratio (IR) and Registration Recall (RR). Experimental results demonstrate that our method achieves state-of-the-art performance on the 7Scenes, RGBD Scenes V2, and a self-collected dataset, with consistent improvements across all benchmarks.

2605.03650 2026-05-12 cs.CV cs.AI cs.LG

Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

Zhiyuan Li, Rongzhen Zhao, Wenyan Yang, Wenshuai Zhao, Pekka Marttinen, Joni Pajarinen

发表机构 * Department of Electrical Engineering and Automation(电气工程与自动化系) Aalto University(阿尔托大学) Department of Computer Science(计算机科学系)

AI总结 本文重新思考了视频对象中心学习中的时间一致性问题,指出当前依赖动态模块预测未来对象表示的方法实际上是复杂的离散对应问题的近似。作者提出了一种新的框架“Grounded Correspondence”,通过冻结的骨干网络提取显著区域初始化对象槽,并利用匈牙利匹配实现帧间身份对应,无需可学习的时间建模参数,即可在多个数据集上取得具有竞争力的性能。

详情
英文摘要

The de facto approach in video object-centric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We demonstrate that these predictors function as expensive approximations of discrete correspondence problems. Modern self-supervised vision backbones already encode instance-discriminative features that distinguish objects reliably. Exploiting these features eliminates the need for learned temporal prediction. We introduce Grounded Correspondence, a framework that replaces learned transition functions with deterministic bipartite matching. Slots initialize from salient regions in frozen backbone features. Frame-to-frame identity is maintained through Hungarian matching on slot representations. The approach requires zero learnable parameters for temporal modeling yet achieves competitive performance on MOVi-D, MOVi-E, and YouTube-VIS. Project page: https://magenta-sherbet-85b101.netlify.app/

2605.03639 2026-05-12 cs.CV

Diffusion Masked Pretraining for Dynamic Point Cloud

Zhuoyue Zhang, Jihua Zhu, Chaowei Fang, Jian Liu, Ajmal Saeed Mian

发表机构 * Xi’an Jiaotong University(西安交通大学) School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) University of Western Australia(西澳大学)

AI总结 本文提出了一种名为DiMP的统一自监督预训练框架,用于动态点云处理。该方法通过引入扩散模型,解决了现有掩码重建目标中的时空位置泄露和运动不确定性丢失问题。DiMP在位置推理和运动学习中均采用扩散建模,通过预测可见时空上下文中的干净点云中心,提升了位置表示的准确性,并将帧间位移监督转化为条件扩散模型的噪声预测任务,从而更完整地建模运动的条件分布。实验表明,DiMP在多个下游任务中均显著提升了性能。

详情
英文摘要

Dynamic point cloud pretraining is still dominated by masked reconstruction objectives. However, these objectives inherit two key limitations. Existing methods inject ground-truth tube centers as decoder positional embeddings, causing spatio-temporal positional leakage. Moreover, they supervise inter-frame motion with deterministic proxy targets that systematically discard distributional structure by collapsing multimodal trajectory uncertainty into conditional means. To address these limitations, we propose Diffusion Masked Pretraining (DiMP), a unified self-supervised framework for dynamic point clouds. DiMP introduces diffusion modeling into both positional inference and motion learning. It first applies forward diffusion noise only to masked tube centers, then predicts clean centers from visible spatio-temporal context. This removes positional leakage while preserving visible coordinates as clean temporal anchors. DiMP also reformulates point-wise inter-frame displacement supervision as a DDPM noise-prediction objective conditioned on decoded representations. This design drives the encoder to target the full conditional distribution of plausible motions under a variational surrogate, rather than collapsing to a single deterministic estimate. Extensive experiments demonstrate that DiMP consistently improves downstream accuracy over the backbone alone, with absolute gains of 11.21% on offline action segmentation and 13.65% under causally constrained online inference.Codes are available at https://github.com/InitalZ/DiMP.git.

2605.01643 2026-05-12 cs.LG cs.AI

AI Alignment via Incentives and Correction

Rohit Agarwal, Joshua Lin, Mark Braverman, Elad Hazan

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文从法律与经济学中的威慑与执行模型出发,研究人工智能对齐问题,认为AI系统中的不当行为是对其所受激励的策略性响应,而非单纯的外部失败。文章提出将对齐问题视为一个均衡问题,通过设计奖励机制来引导求解器和审计器之间的行为互动,从而实现更有效的对齐。研究还提出了一种基于强化学习的奖励设计方法,并在实际的大型语言模型代码生成任务中验证了其有效性。

详情
英文摘要

We study AI alignment through the lens of law-and-economics models of deterrence and enforcement. In these models, misconduct is not treated as an external failure, but as a strategic response to incentives: an actor weighs the gain from violation against the probability of detection and the severity of punishment. We argue that the same logic arises naturally in agentic AI pipelines. A solver may benefit from producing a persuasive but incorrect answer, hiding uncertainty, or exploiting spurious shortcuts, while an auditor or verifier must decide whether costly monitoring is worthwhile. Alignment is therefore a fixed-point problem: stronger penalties may deter solver misbehavior, but they can also reduce the auditor's incentive to inspect, since auditing then mainly incurs cost on a population that appears increasingly aligned. This perspective also changes what should count as a post-training signal. Standard feedback often attaches reward to the final answer alone, but a solver-auditor pipeline exposes the full correction event: whether the solver erred, whether the auditor inspected, whether the error was caught, and whether oversight incentives remained active. We formalize this interaction in a two-agent model in which a principal chooses rewards over joint correction outcomes, inducing both solver behavior and auditor monitoring. Reward design is therefore a bilevel optimization problem: rewards are judged not by their immediate semantic meaning, but by the behavioral equilibrium they induce. We propose a bandit-based outer-loop procedure for searching over reward profiles using noisy interaction feedback. Experiments on an LLM coding pipeline show that adaptive reward profiles can maintain useful oversight pressure and improve principal-aligned outcomes relative to static hand-designed rewards, including a substantial reduction in hallucinated incorrect attempts.

2605.00539 2026-05-12 cs.CL cs.DC

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

Wenxiang Lin, Juntao Huang, Luhan Zhang, Laili Li, Xiang Bao, Mengyang Zhang, Bing Wang, Shaohuai Shi

发表机构 * School of Computer Science(计算机科学学院) Technology, Harbin Institute of Technology, Shenzhen(技术学院,哈尔滨工业大学,深圳) Huawei Technologies Ltd(华为技术有限公司)

AI总结 本文提出了一种名为AGoQ的量化方法,旨在提高大语言模型分布式训练的内存效率。该方法通过引入层感知的激活量化算法和8位梯度量化算法,分别实现了接近4位的激活存储和高效通信的梯度存储,从而显著降低内存占用并提升训练速度。实验表明,AGoQ在多个大规模LLaMA模型上相比现有系统,在减少内存消耗和提升训练速度方面均取得了显著优势,同时保持了模型的收敛性能和任务准确率。

详情
英文摘要

Quantization is a key method for reducing the GPU memory requirement of training large language models (LLMs). Yet, current approaches are ineffective for 4-bit activations and 8-bit gradients, which would easily cause slow convergence or accuracy loss. To address this, we introduce AGoQ, incorporating two new techniques: 1) a layer-aware activation quantization algorithm that allocates appropriate bit-widths for activations of various layers based on their types and pipeline stages to achieve near 4-bit activation storage, and 2) a gradient quantization algorithm that reduces memory usage and shortens communication time by employing 8-bit gradient storage and precision-preserving 8-bit All-Reduce communication. We conduct extensive experiments using different sizes of LLMs on two GPU clusters (up to 64 GPUs), and the experimental results show that our AGoQ reduces the memory by up to 52\% and achieves up to 1.34$\times$ improvement of training speed compared to state-of-the-art training systems Megatron-LM (w/ or w/o ZeRO), COAT and DeepSpeed with 8B to 32B LLaMA models, while achieving convergence loss on pretraining and comparable accuracy on downstream tasks with LLaMA architectures.

2605.00370 2026-05-12 cs.LG cs.CY cs.MM

Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration

Chunlei Meng, Pengbin Feng, Rong Fu, Hoi Leong Lee, Xiaojing Du, Zhaolu Kang, Zeyu Zhang, Weilin Zhou, Chun Ouyang, Zhongxue Gan

发表机构 * University of Macau(澳门大学) Universiti Malaysia Perlis(马来西亚霹雳大学) Peking University(北京大学) Xinjiang University(新疆大学)

AI总结 该论文提出了一种名为Group Cognition Learning(GCL)的协作学习框架,旨在解决多模态学习中模态主导和虚假模态关联的问题。GCL采用两阶段协作机制,第一阶段通过路由代理和审计代理选择性地促进模态间有益的交互,抑制冗余关联;第二阶段通过公共因子代理和聚合代理生成最终预测,同时保持各模态的独立性。实验表明,GCL在多个多模态基准数据集上取得了优于现有方法的性能,有效提升了模型的鲁棒性和泛化能力。

Comments This study has been Accepted by ICML 2026. The current version is a manuscript, please refer to the official version released at ICML 2026 for the final published version

详情
英文摘要

Centralized multimodal learning commonly compresses language, acoustic, and visual signals into a single fused representation for prediction. While effective, this paradigm suffers from two limitations: modality dominance, where optimization gravitates towards the path of least resistance, ignoring weaker but informative modalities, and spurious modality coupling, where models overfit to incidental cross-modal correlations. To address these, we propose Group Cognition Learning (GCL), a governed collaboration paradigm that applies a two-stage protocol after modality-specific encoding. In Stage 1 (Selective Interaction), a Routing Agent proposes directed interaction routes, and an Auditing Agent assigns sample-wise gates to emphasize exchanges that yield positive marginal predictive gain while suppressing redundant coupling. In Stage 2 (Consensus Formation), a Public-Factor Agent maintains an explicit shared factor, and an Aggregation Agent produces the final prediction through contribution-aware weighting while keeping each modality representation as a specialization channel. Extensive experiments on CMU-MOSI, CMU-MOSEI, and MIntRec demonstrate that GCL mitigates dominance and coupling, establishing state-of-the-art results across both regression and classification benchmarks. Analysis experiments further demonstrate the effectiveness of the design.

2605.00195 2026-05-12 cs.LG

Diversity in Large Language Models under Supervised Fine-Tuning

Roman Klypa, Oleksandr Cherednichenko

发表机构 * Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, 38000 Grenoble, France(格勒诺布尔阿尔卑斯大学,国家科学研究中心,格勒诺布尔INP,LJK,法国格勒诺布尔) Department of Mathematics and Mathematical Statistics, Integrated Science Lab, Umeå University, Sweden(乌梅大学数学与统计学系,整合科学实验室,瑞典)

AI总结 本研究探讨了监督微调(SFT)对大语言模型生成多样性的影响,指出SFT会导致生成内容的多样性下降,并将这一现象归因于微调数据中低频模式的忽视和预训练知识的遗忘。为此,研究提出了一个新的损失函数Tempered Focal(TOFU)损失,能够同时解决这两个问题。实验表明,TOFU在保持响应质量的同时有效提升了模型输出的多样性,为SFT提供了更合理的方法。

详情
英文摘要

Supervised Fine-Tuning (SFT) is essential for aligning Large Language Models (LLMs) with user intent, yet it is believed to suppress generative diversity. Although this reduction is frequently referenced, formal empirical testing of the phenomenon remains limited. The expressiveness of LLMs by itself was addressed by multiple prior methods. Their varying perspectives suggest that deeper investigation could yield further improvements. In this study, we attribute the decline to two primary drivers: the neglect of low-frequency patterns within fine-tuning datasets and the forgetting of preexisting knowledge. Motivated by our theoretical analysis, we develop Tempered Focal (TOFU) loss, a novel objective that addresses both stated challenges simultaneously. Our extensive evaluation confirms at scale that generation breadth narrows after SFT and strengthens the hypothesis explaining this effect. Across multiple models and benchmarks, we demonstrate that TOFU enhances output diversity while preserving high response quality, offering a principled approach to SFT.

2604.27629 2026-05-12 cs.AI

WaferSAGE: Large Language Model-Powered Wafer Defect Analysis via Synthetic Data Generation and Rubric-Guided Reinforcement Learning

Ke Xu, Zhongyuan Lian

发表机构 * Shanghai Huahong Grace Semiconductor Manufacturing Corporation(上海华虹格瑞半导体制造有限公司) Dept. of Automation, School of Information Science and Engineering, East China University of Science and Technology(自动化系,信息科学与工程学院,东华大学)

AI总结 本文提出了一种名为WaferSAGE的框架,用于晶圆缺陷的视觉问答分析,该框架结合了小规模视觉语言模型与合成数据生成技术,以解决半导体制造中数据稀缺的问题。研究通过结构化评分标准生成和强化学习方法,提升了缺陷识别与分析的准确性,并在无需大量标注数据的情况下实现了高精度的模型训练。实验表明,该方法在专用工业视觉理解任务中能够超越大型商业模型,为半导体制造提供了隐私保护且成本更低的部署方案。

Comments 16 pages, 3 figures, 8 tables

详情
英文摘要

We present WaferSAGE, a framework for wafer defect visual question answering using small vision-language models. To address data scarcity in semiconductor manufacturing, we propose a three-stage synthesis pipeline incorporating structured rubric generation for precise evaluation. Starting from limited labeled wafer maps, we employ clustering-based cleaning to filter label noise, then generate comprehensive defect descriptions using vision-language models, which are converted into structured evaluation rubrics criteria. These rubrics guide the synthesis of VQA pairs, ensuring coverage across defect type identification, spatial distribution, morphology, and root cause analysis. Our dual assessment framework aligns rule-based metrics with LLM-Judge scores via Bayesian optimization, enabling reliable automated evaluation. Through curriculum-based reinforcement learning with Group Sequence Policy Optimization (GSPO) and rubric-aligned rewards, our 4B-parameter Qwen3-VL model achieves a 6.493 LLM-Judge score, closely approaching Gemini-3-Flash (7.149) while enabling complete on-premise deployment. We demonstrate that small models with domain-specific training can surpass proprietary large models in specialized industrial visual understanding, offering a viable path for privacy-preserving, cost-effective deployment in semiconductor manufacturing.

2604.23876 2026-05-12 cs.LG

Cardiac Stability Theory: An Axiomatically Grounded Framework for Continuous Cardiac Health Monitoring via Smartphone Photoplethysmography

Timothy Oladunni, Farouk Ganiyu Adewumi

发表机构 * Department of Computer Science, Morgan State University(莫根州立大学计算机科学系)

AI总结 本文提出了一种基于公理的框架——心脏稳定性理论(CST),用于通过智能手机光电容积描记(PPG)实现连续的心脏健康监测。该方法通过定义心血管健康为围绕心脏动力学吸引子的稳定性边界,结合李雅普诺夫指数、复发确定性和信号熵等指标,构建了心脏稳定性指数(CSI)。研究展示了CSI在ECG和PPG数据上的优越性能,并通过领域迁移技术实现了在智能手机上的实时应用,为长期非侵入式心脏健康监测提供了新方法。

详情
英文摘要

We present Cardiac Stability Theory (CST), an axiomatically grounded framework formally defining cardiovascular health as a stability margin around a cardiac dynamical attractor. From four axioms we derive the Cardiac Stability Index (CSI), a composite scalar in [0,1] integrating the largest Lyapunov exponent, recurrence determinism, and signal entropy via time-delay embedding. The ECG-based model (CSISurrogateV2, CNN-Transformer) achieves $R^2=0.8788$, MAE$=0.0234$ on PTB-XL (21,799 recordings). We extend CSI to smartphone PPG via Complementary Domain Transfer (CDT): CSISurrogateV2 generates pseudo-labels for the BUT PPG dataset (48 recordings, 12 subjects), training TinyCSINet (122,849 parameters), achieving MAE$=0.0557$, $ρ=0.660$ on the held-out test set ($n=1065$ windows) at ${<}30$ ms mobile latency. CDT is validated on BIDMC, Welltory, and RWS-PPG. Paired validation on 5,035 BIDMC windows yields $r=0.454$ ($ρ=0.485$, $p<10^{-295}$), confirming correlated cardiac stability across modalities. CSI is negatively correlated with age (slope $= -0.000225$ CSI/year, PTB-XL), discriminates atrial fibrillation from normal sinus rhythm (AUROC$=0.89$), and is robust under Perturbation Invariance Training (max AUC drop 1.65\%). We derive HeartSpan, a longitudinal stability metric relative to population age norms, enabling continuous non-invasive cardiac monitoring from commodity smartphones for longevity tracking and cardiac risk stratification.

2604.23750 2026-05-12 cs.LG cs.AI

The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation

Shuaizhi Cheng, Xiang Shi, Zhiwei Zhang, Mingwei Li

发表机构 * Harbin Institute of Technology(哈尔滨工程大学) Imperial College London(伦敦帝国理工学院) KigLand Machine Learning Lab(KigLand机器学习实验室)

AI总结 本文研究了基于超网络的即时大语言模型适配方法在处理知识冲突时的失效问题,发现其核心原因是幅度问题而非表示能力不足。通过分析表明,超网络虽然能正确定位模型层,但由于适配器的幅度固定,而预训练知识的幅度随训练频率增加,导致深层冲突知识难以被有效适配。为此,作者提出幅度增强方法,如选择性层增强和冲突感知内化,在无需再训练的情况下显著提升了模型在深层冲突任务上的表现。

Comments 35 pages, 15 figures v2: minor layout fixes and author list update

详情
英文摘要

Hypernetwork-based methods such as Doc-to-LoRA internalize a document into an LLM's weights in a single forward pass, but they fail systematically on conflicts: when the document contradicts pretraining knowledge, accuracy collapses to 46.4% on the deepest facts. We show the failure is a magnitude problem rather than a representational one. The hypernetwork already targets the right layers, but its adapter margin is approximately constant across documents while the pretrained margin grows with training frequency, so deep conflicts lose by construction. The account predicts that failure should track prior strength: sorting 194 conflicts by the base model's log-probability on the contradicted fact, baseline accuracy falls from 68% on weak-prior questions to 16% on strong-prior ones, a 52 percentage-point gap. The cure is amplitude. Selective Layer Boosting scales the adapter at its top-norm layers, and Conflict-Aware Internalization triggers boosting only when the base model is confident. Both are training-free; together they raise deep-conflict accuracy from 46.4% to 71.0% on Gemma-2B and from 53.6% to 72.5% on Mistral-7B while preserving novel-knowledge recall, and beat vanilla retrieval-augmented generation on medium conflicts by 18 percentage points despite operating entirely in parameter space. We release KID-Bench, a 489-question benchmark that separates novel recall, cross-knowledge combination, and prior-graded conflicts.

2604.21232 2026-05-12 cs.AI

ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures

Xiyin Zeng, Yuyu Sun, Haoyang Li, Shouqiang Liu, Hao Wang

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州)) South China Normal University(华南师范大学)

AI总结 本文提出了一种名为 ReCAPA 的分层预测校正架构,旨在解决视觉-语言-动作系统在执行多步骤任务时可能出现的级联失效问题。该方法通过在动作、子目标和轨迹三个层次上引入预测与对比机制,结合语义对齐模块,动态调整执行过程中的偏差,从而提升任务执行的鲁棒性。实验表明,ReCAPA 在多个具身智能代理基准测试中表现优异,优于现有的大型语言模型基线。

详情
英文摘要

Vision-Language-Action systems follow instructions to execute multi-step tasks in multimodal environments. Recent VLA approaches typically rely on post-hoc correction mechanisms or operate under fixed task decompositions and alignment schemes. However, once an intermediate step is mis-specified, local errors propagate through subsequent steps and eventually accumulate into cascading failures. To mitigate this compounding effect, we propose Predictive Alignment and Planning Architecture, a framework that uses prediction and contrast to adjust deviations across three levels: actions, subgoals, and trajectories. Semantic alignment is enforced at all levels using a Sinkhorn-based module and a Score-field module. The predictive correction and alignment jointly update the action generator during training, enabling it to adjust fine-grained steps to remain aligned with the overall intent. We further introduce two new metrics to quantify error propagation and recovery processes in tasks, capturing how mistakes spread and fade over long-horizon execution. Experiments show that ReCAPA achieves competitive results on embodied agent benchmarks such as VisualAgentBench, MineDojo, and AI2-THOR, outperforming strong proprietary and open-source Large Language Model baselines.

2604.19838 2026-05-12 cs.AI

Resolving space-sharing conflicts in road user interactions through uncertainty reduction: An active inference-based computational model

Julian F. Schumann, Johan Engström, Ran Wei, Shu-Yuan Liu, Jens Kober, Arkady Zgonnikov

发表机构 * Department of Cognitive Robotics, Delft University of Technology, Netherlands(德鲁特理工大学认知机器人学系) Waymo LLC, Mountain View, CA, USA(Waymo公司)

AI总结 本文研究了道路用户如何解决空间共享冲突的问题,提出了一种基于主动推理的计算模型,用于模拟两个智能体之间的交互行为。该模型通过隐式通信、规范期望和显式通信三种机制降低交互中的不确定性,揭示了规范和显式通信线索在提升冲突解决成功率中的作用,同时也指出当其他智能体违反规范或传递误导信息时,依赖这些线索可能导致碰撞。该研究为道路用户交互建模提供了理论依据,并具有更广泛的应用前景。

详情
英文摘要

Understanding how road users resolve space-sharing conflicts is important both for traffic safety and the safe deployment of autonomous vehicles. While existing models have captured specific aspects of such interactions (e.g., explicit communication), a theoretically-grounded computational framework has been lacking. In this paper, we extend a previously developed active inference-based driver behavior model to simulate interactive behavior of two agents. Our model captures three complementary mechanisms for uncertainty reduction in interaction: (i) implicit communication via direct behavioral coupling, (ii) reliance on normative expectations (stop signs, priority rules, etc.), and (iii) explicit communication. In a simplified intersection scenario, we show that normative and explicit communication cues can increase the likelihood of a successful conflict resolution. However, this relies on agents acting as expected. In situations where another agent (intentionally or unintentionally) violates normative expectations or communicates misleading information, reliance on these cues may induce collisions. These findings illustrate how active inference can provide a novel framework for modeling road user interactions which is also applicable in other fields.

2604.19792 2026-05-12 cs.AI cs.DC cs.MA cs.NE

OpenCLAW-P2P v7.0-P2PCLAW: Resilient Multi-Layer Persistence, Live Reference Verification, and Production-Scale Evaluation of Decentralized AI Peer Review v7.0 -- Mathematical Corrections & Ecosystem Developments Edition

Francisco Angulo de Lafuente, Teerth Sharma, Vladimir Veselov, Seid Mohammed Abdu, Nirmal Tej Kumar, Guillermo Perry

发表机构 * Independent AI Researcher & Science Fiction Writer(独立AI研究员及科幻作家) Bachelor of Technology (AI)(人工智能技术学士) Moscow Institute of Electronic Technology (MIET)(莫斯科电子技术学院) Dept. of Computer Science, Woldia University(沃尔迪亚大学计算机科学系) University of Texas at Dallas (UTD)(德克萨斯大学达拉斯分校) Andex Enterprising Inc.(安德克斯企业公司)

AI总结 本文介绍了 OpenCLAW-P2P v7.0,这是一个去中心化的集体智能平台,旨在让自主AI代理在无需人类审核者的情况下完成科学论文的发布、同行评审、评分和迭代改进。该版本在原有基础上引入了数学理论修正,确保框架的维度一致性、范围约束和符号明确性,并扩展了生态系统,包括用于科学论文生成的开源语言模型 CAJAL。此外,平台保留了四大核心子系统,提升了存储可靠性、检索效率和引用验证准确性。

Comments v7.0: Mathematical corrections (fixed-point condition Eq.4, dimensionally consistent tau-indicator Eq.7, fully specified reputation formula Eq.8 with quality terms q0 and q-bar, discrete-time PD Governor Eq.15, HSR parameter definitions Eq.16); ecosystem developments: CAJAL-4B/9B models, BenchClaw platform, 14 integrations. 36 pages

详情
英文摘要

This paper presents OpenCLAW-P2P v7.0, a comprehensive evolution of the decentralized collective-intelligence platform in which autonomous AI agents publish, peer-review, score, and iteratively improve scientific research papers without any human gatekeeper. Building on the v6.0 foundations -- multi-layer persistence, live reference verification, multi-LLM granular scoring, calibrated deception detection, the Silicon Chess-Grid FSM, and the AETHER containerized inference engine -- this release introduces mathematical corrections to the theoretical framework, ensuring dimensional consistency, proper range constraints, and unambiguous notation throughout. Additionally, this edition documents significant ecosystem expansions including the CAJAL family of open-source language models (4B and 9B parameters) fine-tuned for scientific paper generation. The four major subsystems introduced in v6.0 are retained: (i) a Multi-Layer Paper Persistence Architecture with four storage tiers ensuring zero paper loss; (ii) a Multi-Layer Retrieval Cascade reducing latency from >3s to <50ms; (iii) a Live Reference Verification system detecting fabricated citations with >85% accuracy; and (iv) a Scientific API Proxy providing access to seven public scientific databases. Mathematical corrections in v7.0 include: corrected fixed-point condition in the Sufficient Reason theorem; dimensionally consistent progress-rate indicator; fully specified reputation update formula incorporating quality terms q0 and q-bar; clarified attention-logit bound in the AETHER pruning theorem; explicit range documentation for the calibration mapping; non-negativity guarantee for the depth score; discrete-time notation for the PD Governor; and explicit parameter definitions for the HSR weight formula.

2604.19530 2026-05-12 cs.LG cs.CE stat.ML

Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention

Akash Yadav, Taiwo A. Adebiyi, Ruda Zhang

发表机构 * University of Houston(休斯顿大学)

AI总结 本文研究了如何为科学基础模型提供校准良好的预测不确定性,提出了一种名为“随机注意”的轻量级推理时修改方法,通过在注意力权重中引入随机性来生成预测集成,无需重新训练模型。该方法通过一个校准目标来调整随机性参数,实现了高效的后校准。实验表明,该方法在天气预测、时间序列和回归任务中表现出更优的校准性能和更窄的预测区间,且计算成本显著低于现有方法。

详情
英文摘要

Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic Attention, a sample average lightweight inference-time modification that randomizes attention by replacing softmax weights with normalized multinomial samples controlled by a single concentration parameter, and produces predictive ensembles without retraining. To set this parameter, we introduce a calibration objective that matches the stochastic attention output with the target, yielding an efficient univariate post-hoc tuning problem. We evaluate this mechanism on scientific foundation models for weather and time-series forecasting, as well as several regression tasks. Across benchmarks against uncertainty-aware baselines, we find that Sample Average Stochastic Attention achieves the strongest native calibration and the sharpest prediction intervals at comparable calibration, with adaptation costs nearly three orders of magnitude lower than the next-best baseline.

2604.17565 2026-05-12 cs.CV

UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models

Hong Jiang, Wensong Song, Zongxing Yang, Ruijie Quan, Yi Yang

发表机构 * ReLER, CCAI, Zhejiang University(ReLER、CCAI、浙江大学) DBMI, HMS, Harvard University(DBMI、HMS、哈佛大学)

AI总结 UniGeo 是一种新型的相机可控图像编辑框架,旨在在不同相机视角下生成几何一致的场景视图。该方法通过在表示层、架构层和损失函数层统一注入几何引导,解决了现有方法在连续相机运动下出现的几何漂移和结构退化问题。实验表明,UniGeo 在多个公开数据集上显著优于现有方法,具有更高的视觉质量和几何一致性。

详情
英文摘要

Camera-controllable image editing aims to synthesize novel views of a given scene under varying camera poses while strictly preserving cross-view geometric consistency. However, existing methods typically rely on fragmented geometric guidance, such as only injecting point clouds at the representation level despite models containing multiple levels, and are mainly based on image diffusion models that operate on discrete view mappings. These two limitations jointly lead to geometric drift and structural degradation under continuous camera motion. We observe that while leveraging video models provides continuous viewpoint priors for camera-controllable image editing, they still struggle to form stable geometric understanding if geometric guidance remains fragmented. To systematically address this, we inject unified geometric guidance across three levels that jointly determine the generative output: representation, architecture, and loss function. To this end, we propose UniGeo, a novel camera-controllable editing framework. Specifically, at the representation level, UniGeo incorporates a frame-decoupled geometric reference injection mechanism to provide robust cross-view geometry context. At the architecture level, it introduces geometric anchor attention to align multi-view features. At the loss function level, it proposes a trajectory-endpoint geometric supervision strategy to explicitly reinforce the structural fidelity of target views. Comprehensive experiments across multiple public benchmarks, encompassing both extensive and limited camera motion settings, demonstrate that UniGeo significantly outperforms existing methods in both visual quality and geometric consistency.

2604.14484 2026-05-12 cs.RO cs.AI math.OC

A Nonasymptotic Theory of Gain-Dependent Error Dynamics in Behavior Cloning

Junghoon Seo

发表机构 * PIT IN Corp.(PIT IN公司)

AI总结 本文研究了行为克隆(BC)策略在位置控制机器人中的非渐近有限时间误差传播特性,揭示了控制器增益对任务失败概率的影响机制。通过分析增益依赖的闭环动力学,作者提出了一个代理矩阵 $X_\infty(K)$ 来表征位置误差的分布,并将任务失败概率分解为增益放大因子、验证损失和泛化松弛项,表明仅凭训练损失无法预测闭环性能。研究还给出了代理矩阵的标量上界,并对不同系统刚度与阻尼组合下的性能排序进行了分析,为理解BC策略的稳定性提供了理论依据。

详情
英文摘要

Behavior cloning (BC) policies on position-controlled robots inherit the closed-loop response of the underlying PD controller, yet the nonasymptotic finite-horizon consequences of controller gains for BC failure remain open. We show that independent sub-Gaussian action errors propagate through the gain-dependent closed-loop dynamics to yield sub-Gaussian position errors whose proxy matrix $X_\infty(K)$ governs the failure tail. The probability of horizon-$T$ task failure factorizes into a gain-dependent amplification index $Γ_T(K)$ and the validation loss plus a generalization slack, so training loss alone cannot predict closed-loop performance. Under shape-preserving upper-bound structural assumptions, the proxy admits the scalar bound $X_\infty(K)\preceqΨ(K)\bar X$, with $Ψ(K)$ decomposed into label difficulty, injection strength, and contraction. This ranks the four canonical regimes with compliant-overdamped (CO) tightest, stiff-underdamped (SU) loosest, and the stiff-overdamped versus compliant-underdamped ordering system-dependent. For the canonical scalar second-order PD system, the closed-form continuous-time stationary variance $X_\infty^{\mathrm{c}}(α,β)=σ^2α/(2β)$ is strictly monotone in stiffness and damping over the entire stable orthant, covering both underdamped and overdamped regimes, and the exact zero-order-hold (ZOH) discretization inherits this monotonicity. The analysis gives a nonasymptotic finite-horizon extension of the gain-dependent error-attenuation explanation of Bronars et al.

2604.11734 2026-05-12 cs.RO cs.AI

SCORP: Scene-Consistent Multi-agent Diffusion Planning with Stable Online Reinforcement Post-Training for Cooperative Driving

Haojie Bai, Aimin Li, Ruoyu Yao, Xiongwei Zhao, Tingting Zhang, Xing Zhang, Lin Gao, and Jun Ma

发表机构 * School of Information Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)信息科学与技术学院) Middle East Technology University (METU)(中东技术大学) Robotics and Autonomous Systems Thrust, Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)机器人与自主系统方向) School of Computer Science and Technology, Qinghai University(青海大学计算机科学与技术学院)

AI总结 本文提出SCORP,一种用于协作驾驶的场景一致多智能体扩散规划器,结合了稳定的在线强化学习后训练方法。为了解决现有扩散模型在场景一致性和闭环协作目标对齐方面的不足,SCORP引入了基于场景条件的多智能体去噪架构,并设计了两层马尔可夫决策过程以整合逆向去噪链与策略-环境交互。实验表明,SCORP在核心安全与效率指标上显著优于现有开源方法,展现出在协作驾驶任务中的优越性能。

详情
英文摘要

Cooperative driving is a safety- and efficiency-critical task that requires the coordination of diverse, interaction-realistic multi-agent trajectories. Although existing diffusion-based methods can capture multimodal behaviors from demonstrations, they often exhibit weak scene consistency and poor alignment with closed-loop cooperative objectives. This makes post-training necessary for further improvement, yet achieving stable online post-training in reactive multi-agent environments remains challenging. In this paper, we propose SCORP, a scene-consistent multi-agent diffusion planner with stable online reinforcement learning (RL) post-training for cooperative driving. For pre-training, we develop a scene-conditioned multi-agent denoising architecture that couples inter-agent self-attention with a dual-path conditioning mechanism: cross-attention provides direct scene-information injection, while AdaLN-Zero enables additional flexible and stable conditional modulation, thereby improving the scene consistency and road adherence of joint trajectories. For post-training, we formulate a two-layer Markov decision process (MDP) that explicitly integrates the reverse denoising chain with policy-environment interaction. We further co-design dense, well-shaped planning rewards and variance-gated group-relative policy optimization (VG-GRPO) to mitigate advantage collapse and gradient instability during closed-loop training. Extensive experiments show that SCORP outperforms strong open-source baselines on WOMD, with 10.47%-28.26% and 1.70%-7.22% improvements in core safety and efficiency metrics, respectively. Moreover, compared with alternative post-training methods, SCORP delivers significant and consistent gains in both driving safety and traffic efficiency, highlighting stable and sustained advances in closed-loop cooperative driving.

2604.11674 2026-05-12 cs.RO cs.AI

AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation

Mingyang Li, Haofan Xu, Haowen Sun, Xinzhe Chen, Sihua Ren, Liqi Huang, Xinyang Sui, Chenyang Miao, Jiawei Ye, Qiongjie Cui, Zeyang Liu, Xingyu Chen, Xuguang Lan

发表机构 * School of Artificial Intelligence, Xi’an Jiaotong University(西安交通大学人工智能学院)

AI总结 AffordSim 是一个可扩展的数据生成器和基准平台,旨在提升机器人对物体功能区域的感知能力,以实现更精准的操控任务。该方法结合开放词汇的3D功能预测,根据自然语言任务描述生成场景、定位功能区域并生成对应的抓取动作,从而提高任务执行的成功率。AffordSim 在多种机器人平台和复杂物体上进行了验证,表现出优异的模拟到现实的迁移能力,并在多个关键任务中取得了接近人工标注数据的性能。

详情
英文摘要

Many everyday robot manipulation skills are affordance-dependent, with success determined by whether the robot contacts the functional object region required by the subsequent action. Current simulation data generators obtain contacts from generic grasp estimators or per-object manual contact annotations, but generic estimators rank stable grasps without task semantics and often select contacts that are misaligned with the downstream action, while manual contact annotations must be rewritten for each new object and task. To solve these challenges, we introduce AffordSim, a scalable data generator and benchmark that integrates open-vocabulary 3D affordance prediction into simulation-based trajectory generation. Given a natural-language task description, AffordSim synthesizes a task-relevant scene, emits affordance queries, grounds them on object surfaces, samples region-conditioned grasps, and selects executable candidates with motion planning. It further randomizes object pose, texture, lighting, image noise, and cross-viewpoint backgrounds for sim-to-real transfer. We instantiate AffordSim as a 50-task benchmark across diverse manipulation skills, five robot embodiments, and 500+ rigid and articulated objects. AffordSim achieves 93% of the trajectory collection success rate of manual contact annotations on affordance-critical tasks and 89% on hard composite tasks. Vision-language-action policies trained on AffordSim data transfer zero-shot to a real Franka FR3, reaching 24% average success.

2604.08577 2026-05-12 cs.LG cs.AI

Distributionally Robust Token Optimization in RLHF

Yeping Jin, Jiaming Hu, Ioannis Ch. Paschalidis

发表机构 * Department of System Engineering(系统工程系) Boston University(波士顿大学) Department of Math & Statistics(数学与统计学系)

AI总结 该研究针对大语言模型在面对细微输入变化时可能出现的性能下降问题,提出了一种分布鲁棒的标记优化方法(DRTO)。该方法结合了基于人类反馈的强化学习(RLHF)与分布鲁棒优化(DRO),通过构建f-散度模糊集来增强模型对困难响应片段的学习。实验表明,DRTO在多个推理任务中显著提升了模型在分布偏移下的表现,优于传统方法。

详情
英文摘要

Large Language Models (LLMs) tend to respond correctly to prompts that align well with the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO constructs f-divergence ambiguity sets over span-level actor losses, providing a principled way to emphasize difficult response segments during policy optimization. Empirically, DRTO enhances consistency under distribution shifts in multiple reasoning benchmarks among different tasks, achieving $+4.4$ percentage points on MATH-500 and $+2.7$ percentage points on LiveCodeBench over standard RTO.

2604.06720 2026-05-12 cs.CV

Exploring 6D Object Pose Estimation with Deformation

Zhiqiang Liu, Rui Song, Duanmu Chuangqi, Jiaojiao Li, David Ferstl, Yinlin Hu

发表机构 * State Key Laboratory of ISN, Xidian University(西安电子科技大学信息与通信系统国家重点实验室) MagicLeap

AI总结 本文提出DeSOPE,一个用于6自由度(6DoF)变形物体位姿估计的大规模数据集。传统6D位姿估计方法通常假设物体为刚性或可变形的关节结构,但在实际应用中,物体因磨损、碰撞或形变而偏离标准形状,导致方法失效。为此,DeSOPE包含26类常见物体在标准形态和三种变形状态下的高精度3D扫描数据,并配有133K帧的RGB-D图像和665K个位姿标注,为研究变形物体的位姿估计提供了重要资源。

Comments Accepted at CVPR 2026

详情
英文摘要

We present DeSOPE, a large-scale dataset for 6DoF deformed objects. Most 6D object pose methods assume rigid or articulated objects, an assumption that fails in practice as objects deviate from their canonical shapes due to wear, impact, or deformation. To model this, we introduce the DeSOPE dataset, which features high-fidelity 3D scans of 26 common object categories, each captured in one canonical state and three deformed configurations, with accurate 3D registration to the canonical mesh. Additionally, it features an RGB-D dataset with 133K frames across diverse scenarios and 665K pose annotations produced via a semi-automatic pipeline. We begin by annotating 2D masks for each instance, then compute initial poses using an object pose method, refine them through an object-level SLAM system, and finally perform manual verification to produce the final annotations. We evaluate several object pose methods and find that performance drops sharply with increasing deformation, suggesting that robust handling of such deformations is critical for practical applications.

2604.04306 2026-05-12 cs.CV cs.AI

HighFM: Towards a Foundation Model for Learning Representations from High-Frequency Earth Observation Data

Stella Girtsou, Konstantinos Alexis, Giorgos Giannopoulos, Charalambos Kontoes

发表机构 * National Observatory of Athens(国家天文台) National Technical University of Athens(雅典国家技术大学) National and Kapodistrian University of Athens(雅典国家与卡波迪斯特里亚大学) Athena Research Center(雅典研究所以及研究中心)

AI总结 随着气候相关灾害频发,实时监测和预警需求日益迫切。本文提出 HighFM,一种面向高时间分辨率多光谱遥感数据的基座模型,通过利用超过 2TB 的 SEVIRI 卫星影像,改进了掩码自编码框架以学习稳健的时空表征,并在云检测和火灾识别任务中取得了优于传统方法和近期地理空间基座模型的性能,展示了地静止卫星数据在实时遥感应用中的巨大潜力。

详情
英文摘要

The increasing frequency and severity of climate related disasters have intensified the need for real time monitoring, early warning, and informed decision-making. Earth Observation (EO), powered by satellite data and Machine Learning (ML), offers powerful tools to meet these challenges. Foundation Models (FMs) have revolutionized EO ML by enabling general-purpose pretraining on large scale remote sensing datasets. However most existing models rely on high-resolution satellite imagery with low revisit rates limiting their suitability for fast-evolving phenomena and time critical emergency response. In this work, we present HighFM, a first cut approach towards a FM for high temporal resolution, multispectral EO data. Leveraging over 2 TB of SEVIRI imagery from the Meteosat Second Generation (MSG) platform, we adapt the SatMAE masked autoencoding framework to learn robust spatiotemporal representations. To support real time monitoring, we enhance the original architecture with fine grained temporal encodings to capture short term variability. The pretrained models are then finetuned on cloud masking and active fire detection tasks. We benchmark our SEVIRI pretrained Vision Transformers against traditional baselines and recent geospatial FMs, demonstrating consistent gains across both balanced accuracy and IoU metrics. Our results highlight the potential of temporally dense geostationary data for real-time EO, offering a scalable path toward foundation models for disaster detection and tracking.

2603.28902 2026-05-12 cs.AI

ChartDiff: A Large-Scale Benchmark for Comprehending Pairs of Charts

Rongtian Ye

发表机构 * Department of Computer Science, Aalto University(阿尔托大学计算机科学系)

AI总结 ChartDiff 是首个大规模跨图表对比理解基准,旨在解决现有图表理解任务中缺乏多图表比较分析的问题。该基准包含 8,541 对来自不同数据源、图表类型和视觉风格的图表,每对图表均配有由大语言模型生成并经人工验证的摘要,描述趋势、波动和异常等差异。研究评估了通用模型、图表专用模型和流水线方法的性能,发现通用模型在生成质量上表现最佳,而专用模型和流水线方法虽在 ROUGE 分数上较高,但在人工评估中表现较差,揭示了词句重叠与实际摘要质量之间的不匹配。

Comments 21 pages, 17 figures, accepted to ACL 2026: the 4th Workshop on Advances in Language and Vision Research

详情
英文摘要

Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rather than comparative reasoning across multiple charts. To address this gap, we introduce ChartDiff, the first large-scale benchmark for cross-chart comparative summarization. ChartDiff consists of 8,541 chart pairs spanning diverse data sources, chart types, and visual styles, each annotated with LLM-generated and human-verified summaries describing differences in trends, fluctuations, and anomalies. Using ChartDiff, we evaluate general-purpose, chart-specialized, and pipeline-based models. Our results show that frontier general-purpose models achieve the highest GPT-based quality, while specialized and pipeline-based methods obtain higher ROUGE scores but lower human-aligned evaluation, revealing a clear mismatch between lexical overlap and actual summary quality. We further find that multi-series charts remain challenging across model families, whereas strong end-to-end models are relatively robust to differences in plotting libraries. Overall, our findings demonstrate that comparative chart reasoning remains a significant challenge for current vision-language models and position ChartDiff as a new benchmark for advancing research on multi-chart understanding.

2603.26680 2026-05-12 cs.CL cs.AI

AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment

Jianfei Xiao, Xiang Yu, Chengbing Wang, Wuqiang Zheng, Xinyu Lin, Kaining Liu, Hongxun Ding, Yang Zhang, Wenjie Wang, Fuli Feng, Xiangnan He

发表机构 * University of Science and Technology of China(科学技术大学) National University of Singapore(新加坡国立大学)

AI总结 随着大语言模型(LLMs)逐渐演变为终身AI助手,模型个性化成为关键研究方向。然而,目前缺乏一个标准的评估基准来推动这一领域的发展。为此,本文提出AlpsBench,一个基于真实人机对话构建的LLM个性化评估基准,包含2500个长期交互序列和人工验证的结构化记忆数据,用于评估个性化信息的提取、更新、检索与应用等核心任务,揭示了当前模型在个性化处理方面存在的多项挑战,并为未来研究提供了全面的评估框架。

详情
英文摘要

As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate the entire lifecycle of memory management. Our benchmarking of frontier LLMs and memory-centric systems reveals that: (i) models struggle to reliably extract latent user traits; (ii) memory updating faces a performance ceiling even in the strongest models; (iii) retrieval accuracy declines sharply in the presence of large distractor pools; and (iv) while explicit memory mechanisms improve recall, they do not inherently guarantee more preference-aligned or emotionally resonant responses. AlpsBench aims to provide a comprehensive framework.

2603.21901 2026-05-12 cs.CV

CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

Qingdong He, Chaoyi Wang, Peng Tang, Yifan Yang, Xiaobin Hu

发表机构 * University of Electronic Science and Technology of China(电子科技大学) University of Chinese Academy of Sciences(中国科学院大学) Technical University of Munich(慕尼黑技术大学) Shanghai Jiao Tong University(上海交通大学) National University of Singapore(新加坡国立大学)

AI总结 CLEAR 是一种无需掩码的端到端视频字幕去除框架,旨在在保持时间一致性的同时区分字幕与背景内容。该方法采用两阶段设计,第一阶段通过自监督正交约束学习解耦的字幕表示,第二阶段利用LoRA参数微调和生成反馈机制进行动态上下文调整,从而实现无需真实掩码的自适应推理。CLEAR 在参数效率和跨语言泛化能力方面表现优异,仅需基础扩散模型0.77%的参数即可在多个中文字幕数据集上超越依赖掩码的基线方法,并在六种语言中展现出强大的零样本泛化能力。

Comments Accepted by ICML 2026 (Spotlight)

详情
英文摘要

Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.

2603.18256 2026-05-12 cs.LG cs.AI

MolRGen: A Training and Evaluation Setting for De Novo Molecular Generation with Reasonning Models

Philippe Formont, Maxime Darrin, Ismail Ben Ayed, Pablo Piantanida

发表机构 * Université Paris-Saclay(巴黎萨克雷大学) ÉTS Montréal(蒙特利尔ÉTS) ILLS – International Laboratory on Learning Systems(学习系统国际实验室) Mila – Quebec AI Institute(魁北克人工智能研究所) Mistral AI LIVIA CNRS, CentraleSupélec(国家科学研究中心,中央超导大学)

AI总结 MolRGen 是一个用于从头生成分子的训练与评估框架,旨在解决基于推理的大语言模型在分子生成任务中缺乏有效奖励机制的问题。该框架包含约4,500个蛋白口袋目标,生成5万个结合对接评分与分子性质的多目标优化提示,并通过实时计算奖励来评估模型生成的分子质量。研究引入了多样性感知的Top-k指标,并利用验证器对大型语言模型进行微调,展示了其在分子设计中提升性能的潜力。

详情
英文摘要

Recent reasoning-based large language models have shown strong performance on tasks with verifiable outcomes, but their use in de novo molecular generation remains limited by the lack of training environments where rewards can be computed without reference molecules. We introduce MolRGen, a benchmark and molecular verifier for training and evaluating reasoning LLMs on de novo molecular generation. MolRGen contains approximately 4,500 protein-pocket targets, resulting in 50k multi-objective optimization prompts combining docking scores with molecular properties such as QED, synthetic accessibility, logP, and physicochemical descriptors. Unlike caption-based generation or molecule-editing benchmarks, MolRGen evaluates molecules proposed from scratch by computing rewards at generation time. We benchmark general-purpose and chemistry-specialized open-source LLMs and introduce a diversity-aware top-k metric to measure whether models can generate a diverse set of high-scoring molecules. Finally, we use the verifier to fine-tune a 128B LLM with GRPO, showing improved performance, at the cost of a diversity-exploitation trade-off. MolRGen provides a scalable testbed for studying verifier-based reasoning and reinforcement learning in molecular design.

2603.16964 2026-05-12 cs.CV cs.LG

Behavior-Centric Extraction of Scenarios from Highway Traffic Data and their Domain-Knowledge-Guided Clustering using CVQ-VAE

Niklas Roßberg, Sinan Hasirlioglu, Mohamed Essayed Bouzouraa, Wolfgang Utschick, Michael Botsch

发表机构 * Technische Hochschule Ingolstadt(因斯布鲁克技术大学) AUDI AG(奥迪公司) Technische Universität München(慕尼黑技术大学)

AI总结 该研究旨在从高速公路交通数据中标准化提取场景,并基于领域知识进行聚类,以支持自动驾驶系统的行为评估。研究提出了一种基于“场景即规范”概念的场景提取方法,并结合CVQ-VAE模型实现领域知识引导的聚类过程,提升了场景分类的可解释性和一致性。实验表明,该方法能够可靠地从真实数据中提取场景,并有效融合领域知识,为自动驾驶系统的验证提供了更高效和标准化的场景分类框架。

Comments Accepted as a conference paper in IEEE Intelligent Vehicles Symposium (IV) 2026, Detroit, MI, United States

详情
英文摘要

Approval of ADS depends on evaluating its behavior within representative real-world traffic scenarios. A common way to obtain such scenarios is to extract them from real-world data recordings. These can then be grouped and serve as basis on which the ADS is subsequently tested. This poses two central challenges: how scenarios are extracted and how they are grouped. Existing extraction methods rely on heterogeneous definitions, hindering scenario comparability. For the grouping of scenarios, rule-based or ML-based methods can be utilized. However, while modern ML-based approaches can handle the complexity of traffic scenarios, unlike rule-based approaches, they lack interpretability and may not align with domain-knowledge. This work contributes to a standardized scenario extraction based on the Scenario-as-Specification concept, as well as a domain-knowledge-guided scenario clustering process. Experiments on the highD dataset demonstrate that scenarios can be extracted reliably and that domain-knowledge can be effectively integrated into the clustering process. As a result, the proposed methodology supports a more standardized process for deriving scenario categories from highway data recordings and thus enables a more efficient validation process of automated vehicles.