arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2069
2605.31259 2026-06-01 cs.LG

Lightweight CNN-Based Anomaly Detection for High Voltage Converter Modulators in the Spallation Neutron Source

基于轻量级CNN的散裂中子源高压转换器调制器异常检测

Alberto D. Cencillo, Leonardo Concepción, Julián Luengo, Isaac Triguero

AI总结 针对高压转换器调制器多通道信号异常检测,通过改变时间滤波与跨通道混合的顺序并引入自适应通道重加权,在公开数据集上达到AUC-PR 0.816和AUC-ROC 0.934,超越现有方法。

Comments 21 pages, 8 figures

详情
AI中文摘要

高功率脉冲转换器的非计划停机是大型加速器设施停机的主要原因。在散裂中子源(SNS)中,高压转换器调制器(HVCM)始终是丢失束流时间的第二大贡献者。每个HVCM脉冲通过跨电流、电压和磁通量的传感器通道记录,这些通道的相互交互编码了系统的运行状态。故障前兆在这些通道中并非均匀显现:根据故障类型,它们可能改变单个信号的时间结构,改变通道间的统计依赖性,或两者兼有。现有的深度学习方法通常使用标准卷积流水线处理多通道信号,该流水线从第一层开始就纠缠时间和跨通道操作,使得模型没有明确的机制来表示通道独立性或结构化的通道间交互。我们假设架构归纳偏差,特别是时间滤波和跨通道混合的顺序,在这类数据的检测性能中起着核心作用。为了验证这一点,我们改变了这两个操作的顺序,并检查每个脉冲的自适应通道重加权是否进一步提高灵敏度。在涵盖所有四个SNS子系统(RFQ、DTL、CCL、SCL)的公开HVCM数据集上评估,我们最好的变体实现了池化AUC-PR为0.816和AUC-ROC为0.934,在大多数子系统和六个故障家族中的五个上优于现有技术。消融实验识别出三个主导输入通道,并将每个故障家族的性能与前兆表现为单个通道的幅度偏移还是需要联合通道表示才能显现的更细微模式联系起来。

英文摘要

Unscheduled trips of high-power pulsed converters are a leading source of downtime at large accelerator facilities. At the Spallation Neutron Source (SNS), the High Voltage Converter Modulators (HVCMs) are consistently the second-largest contributor to lost beam time. Each HVCM pulse is recorded across sensor channels spanning currents, voltages, and magnetic fluxes, whose mutual interactions encode the operating state of the system. Fault precursors do not manifest uniformly across these channels: depending on fault type, they may alter the temporal structure of individual signals, change the statistical dependencies among channels, or both. Existing deep-learning approaches typically process multi-channel signals with standard convolutional pipelines that entangle temporal and cross-channel operations from the first layer, giving the model no explicit mechanism to represent channel independence or structured inter-channel interaction. We hypothesise that architectural inductive bias, specifically the ordering of temporal filtering and cross-channel mixing, plays a central role in detection performance on this class of data. To test this, we vary the order in which these two operations are applied, and examine whether per-pulse adaptive channel reweighting further improves sensitivity. Evaluated on the public HVCM dataset across all four SNS subsystems (RFQ, DTL, CCL, SCL), our best variant achieves a pooled AUC-PR of 0.816 and AUC-ROC of 0.934, outperforming the state of the art on most subsystems and five of the six fault families. Ablations identify three dominant input channels and link per-fault-family performance to whether precursors manifest as amplitude shifts in individual channels or as subtler patterns requiring joint channel representations to surface.

2605.31257 2026-06-01 cs.LG stat.ML

Fraud Type Decomposition and the Observation-Mechanism Taxonomy:Class-Specific Detection Limits in Payment Networks

欺诈类型分解与观测机制分类:支付网络中的类别特定检测极限

Gaurav Dhama

AI总结 本文通过引入观测机制分类将欺诈分为五类,证明按类别分别估计欺诈率并聚合优于整体估计,并推导了每类检测的理论约束。

Comments 59 pages

详情
AI中文摘要

支付网络中的欺诈检测依赖于通过异质且不完美的观测过程生成的标签,但现有方法将欺诈视为同质二元变量。我们证明这一假设在结构上不正确,并导致可证明的低效。我们引入一个观测机制分类,将欺诈分为五类,每类由不同的审查和标记流程定义。我们证明按类别分别估计欺诈率并聚合严格优于整体估计,效率差距由异质观测率导致的Jensen惩罚刻画。对于每类,我们推导了检测的绑定理论约束,包括内生标签腐败、结构不可观测性和特征非信息性。这些结果确立了欺诈检测本质上是一组不同的估计问题,每个问题由其自身的观测结构和检测极限支配。

英文摘要

Fraud detection in payment networks relies on labels generated through heterogeneous and imperfect observation processes, yet existing approaches treat fraud as a homogeneous binary variable. We show that this assumption is structurally incorrect and leads to provable inefficiency. We introduce an observation-mechanism taxonomy that partitions fraud into five classes, each defined by a distinct censorship and labeling pipeline. We prove that estimating fraud rates separately by class and aggregating strictly dominates pooled estimation, with the efficiency gap characterized as a Jensen penalty arising from heterogeneous observation rates. For each class, we derive the binding theoretical constraint on detection, including endogenous label corruption, structural non-observability, and feature non-informativeness. These results establish that fraud detection is fundamentally a collection of distinct estimation problems, each governed by its own observation structure and detection limit.

2605.31256 2026-06-01 cs.RO

Before Parc Fermé: RL-Time Pruning for Efficient Embodied LLMs in Autonomous Driving

在封闭停车场之前:面向自动驾驶高效具身大语言模型的强化学习时间剪枝

Luca Benfenati, Ali Azimi, Matteo Risso, Fabio Carapellese, Daniele Jahier Pagliari, Alessio Burrello

AI总结 提出一种在强化学习过程中进行剪枝的策略BPF,通过任务特定监督和闭环反馈压缩具身大语言模型控制器,在自动驾驶控制管道中实现了更好的性能-内存-吞吐量权衡。

详情
AI中文摘要

具身大语言模型越来越多地被用作机器人控制管道中的推理模块,以改善人机交互,但其内存和生成延迟使得实时部署变得困难。剪枝可以降低这些成本,但对于经历多个预训练和后训练阶段的控制器,关键问题不仅在于剪枝多少,还在于何时进行剪枝。在这项工作中,我们提出了Before Parc Fermé(BPF),一种在强化学习期间执行的剪枝策略,它在具身大语言模型控制器仍在针对闭环行为进行优化时对其进行压缩。这使得剪枝决策能够考虑塑造最终控制器的任务特定监督和闭环反馈。我们提出了两种变体:BPF-RL,它在强化学习期间通过按预定义训练间隔移除部分模型来执行迭代剪枝;以及BPF-SFT/RL,它首先在SFT期间移除部分模型结构,然后在强化学习期间使用与BPF-RL相同的迭代策略进一步压缩,直到达到目标剪枝比率。我们在基于LLM的自动驾驶控制管道RobotxR1上,使用已建立的LLM剪枝框架(LLM-Pruner)评估BPF,并将其与训练后剪枝、带有强化学习恢复的训练后剪枝、SFT阶段剪枝以及来自同一系列的小型密集模型进行比较。我们的结果表明,在所考虑的剪枝策略中,BPF提供了最佳的任务性能与内存和吞吐量之间的权衡。在压缩较大的RobotxR1模型时,BPF-SFT/RL实现了比直接选择同一系列中较小密集模型更好的尺寸-端到端性能权衡,以每损失一个百分点的控制适应性所移除的参数数量衡量,提升幅度为1.69倍。在目标机器人平台上搭载的Jetson AGX Orin上,紧凑模型将解码吞吐量提高了高达27%。

英文摘要

Embodied Large Language Models (LLMs) are increasingly used as reasoning modules in robotic control pipelines to improve human-robot interaction, but their memory and generation latency make real-time deployment difficult. Pruning can reduce these costs, but for controllers that undergo multiple pre- and post-training phases, the crucial question is not only how much to prune, but when pruning should occur. In this work, we propose Before Parc Fermé (BPF), a pruning strategy performed during RL that compresses embodied LLM controllers while they are still being optimized for closed-loop behavior. This allows pruning decisions to account for the task-specific supervision and closed-loop feedback that shape the final controller. We propose two variants: BPF-RL, which performs iterative pruning during RL by removing part of the model at predefined training intervals, and BPF-SFT/RL, which first prunes part of the model structure during SFT and then further compresses it during RL using the same iterative strategy as BPF-RL until the target pruning ratio is reached. We evaluate BPF on RobotxR1, an LLM-based autonomous-driving control pipeline, using an established LLM pruning framework (LLM-Pruner), and compare it against post-training pruning, post-training pruning with RL recovery, SFT-stage pruning, and smaller dense models from the same family. Our results show that BPF provides the best task-performance vs. memory and throughput trade-off among the considered pruning strategies. When compressing the larger RobotxR1 models, BPF-SFT/RL achieves a $1.69\times$ better size-end-to-end performance trade-off than directly selecting a smaller dense model from the same family, measured as removed parameters per lost percentage point of control adaptability. On the Jetson AGX Orin mounted on the target robotic platform, the compact models improve decode throughput by up to $27\%$.

2605.31254 2026-06-01 cs.AI

Formalizing and falsifying causal pathways of rare events

罕见事件因果路径的形式化与证伪

Anahita Haghighat, Dominik Janzing

AI总结 本文在结构方程模型中罕见事件根因分析的形式化基础上,提出因果路径的形式定义并讨论其可检验含义,引入罕见事件因果路径的抽象以桥接简单因果解释与详细因果建模。

Comments accepted for ICML 2026

详情
AI中文摘要

基于最近在结构方程模型中对罕见事件(“异常值”)根因分析的形式化,我们提出了因果路径的形式定义并讨论了其可检验含义。我们识别了这些含义仅依赖于由罕见事件路径定义的因果抽象而非底层系统完整因果图的条件。据此,我们引入了一种因果结构到罕见事件路径的抽象,该抽象桥接了简单的口头因果解释与详细的因果建模。

英文摘要

Building on recent formalizations of root cause analysis for rare events (``outliers'') in structural equation models, we propose a formal definition of a causal pathway and discuss its testable implications. We identify conditions under which these implications depend only on a causal abstraction defined by the pathway of rare events, rather than on the full causal graph of the underlying system. Accordingly, we introduce an abstraction of causal structure to pathways of rare events that bridges simple verbal causal explanations and detailed causal modeling.

2605.31251 2026-06-01 cs.CV cs.AI

ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

ERGeoBench:多模态大语言模型中具身推理与地理定位的综合基准

Kaiwen Xue, Tao Wei, Guoxin Zhang, Zhonghong Ou, Kaoyan Lu, Yu Feng, Yifan Zhu, Haoran Luo

AI总结 提出ERGeoBench基准,通过单视图、全景视图和具身视图三种渐进设置评估多模态大语言模型在视觉驱动的具身地理定位中的能力,发现当前模型在高层次地理语义推理上表现良好,但在细粒度感知、度量定位和视图间空间一致性上仍有不足。

详情
AI中文摘要

多模态大语言模型(MLLMs)作为具身代理展现出强大潜力,然而由于缺乏细粒度评估,具身地理定位仍未被充分探索。我们引入ERGeoBench,一个用于视觉驱动的具身地理定位的诊断基准。ERGeoBench在三种渐进设置下评估模型——单视图、全景视图和具身视图——其中代理可以通过偏航、俯仰和缩放的顺序变化主动获取观察。该基准包含2,207个全球分布的街景全景图,并衡量四种互补能力:基础感知、空间意识、常识推理和地理定位推理。对领先的专有和开源MLLMs的评估表明,当前模型能够推断高层次的地理语义,但在细粒度感知操作、度量定位和跨视图空间一致性方面仍然困难。我们进一步观察到,地理定位与其他能力维度强相关,表明准确定位依赖于集成的感知、空间推理和常识推理,而非孤立的视觉识别。总体而言,ERGeoBench为诊断和推进类人具身地理定位提供了一个统一框架。项目页面:https://kaixuewen.github.io/ERGeoBench/

英文摘要

Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through sequential changes in yaw, pitch, and zoom. The benchmark contains 2,207 globally distributed street-view panoramas and measures four complementary capabilities: foundational perception, spatial awareness, common sense reasoning, and geo-localization reasoning. Evaluations of leading proprietary and open-source MLLMs show that current models can infer high-level geographic semantics, but still struggle with fine-grained perceptual operations, metric localization, and spatial consistency across views. We further observe that geo-localization is strongly correlated with the other capability dimensions, suggesting that accurate localization depends on integrated perception, spatial reasoning, and commonsense inference rather than isolated visual recognition. Overall, ERGeoBench provides a unified framework for diagnosing and advancing human-like embodied geo-localization. Project Page: https://kaixuewen.github.io/ERGeoBench/

2605.31249 2026-06-01 cs.LG cs.AI

Learning Cardiac Latent Representations in Vectorcardiogram Space

在向量心电图空间中学习心脏潜在表示

Bosong Huang, Panzhen Zhao, Zengxiang Li, Patricia Lee, Wei Jin, Alan Wee-Chung Liew, Ming Jin, Shirui Pan

AI总结 针对标准十二导联心电图表示学习中的冗余和过拟合问题,提出基于Frank向量心电图模型的LVCG框架,在物理潜在空间中学习视图不变的心脏电活动表示,提升鲁棒性和泛化能力。

详情
AI中文摘要

心电图(ECG)是心脏评估的基石,学习信息丰富的ECG表示对于从疾病诊断到临床报告生成等任务至关重要。然而,现有方法几乎完全在可观测的ECG信号空间中操作。实际上,标准十二导联ECG代表了同一心脏电活动在不同空间方向上的多个投影。因此,在ECG空间中进行表示学习不可避免地引入了大量冗余,可能导致虚假相关性和过拟合风险增加。为了解决这个问题,受Frank向量心电图(VCG)模型启发,我们提出直接在VCG空间中学习心脏电活动的统一潜在表示。我们引入了LVCG,这是第一个设计用于在此物理基础潜在空间中运行的通用自监督表示学习框架。通过学习视图不变的潜在VCG表示而非导联特定伪影,LVCG最小化了冗余并提高了泛化能力。LVCG在各项任务中普遍优于ECG空间基线,展现出增强的鲁棒性和泛化能力,尤其在领域偏移设置中。

英文摘要

Electrocardiography (ECG) is a cornerstone of cardiac assessment, making the learning of informative ECG representations fundamental to tasks ranging from disease diagnosis to clinical report generation. However, existing methods operate almost exclusively in the observable ECG signal space. In practice, the standard twelve-lead ECG represents multiple projections of the same underlying cardiac electrical activity from different spatial orientations. Therefore, representation learning in the ECG space inevitably introduces substantial redundancy, which may lead to spurious correlations and increased risk of overfitting. To address this and motivated by the Frank vectorcardiogram (VCG) model, we propose learning a unified latent representation of cardiac electrical activity directly in the VCG space. We introduce LVCG, the first general self-supervised representation learning framework designed to operate in this physically grounded latent space. By learning view-invariant latent VCG representations rather than lead-specific artifacts, VCG minimizes redundancy and improves generalization. LVCG generally outperforms ECG-space baselines across tasks, demonstrating enhanced robustness and generalization, especially in domain shift settings.

2605.31245 2026-06-01 cs.LG

Toward Identifiable Sparse Autoencoders

走向可识别的稀疏自编码器

Walter Nelson, Theofanis Karaletsos, Francesco Locatello

AI总结 针对稀疏自编码器训练不稳定的问题,通过理论分析模型属性并改进架构与训练流程,提出iSAE变体,实现更低重构误差与更高稳定性。

Comments International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

最近,稀疏自编码器(SAE)已成为解释和交互实际神经网络表示的有吸引力的工具。虽然常见的经验共识如此,但我们也在理论上表明SAE高度不稳定:不同的训练运行可能产生不同的概念字典和稀疏编码。我们刻画了阻碍实际SAE稳定性的模型属性,并通过架构和训练过程的最小改动解决每个问题。这些改动共同产生了两个版本的 extbf{可识别}SAE(iSAE),这是标准TopK SAE的变体,具有更低的重构误差和更高的稳定性。我们通过将SAE与传统字典学习方法联系起来,从理论上解释了这一改进,并表明实践中学习的字典满足近似受限等距条件,从而使这些模型中的相应稀疏编码接近可识别。

英文摘要

Recently, sparse autoencoders (SAEs) have emerged as an attractive tool for interpreting and interacting with representations in practical neural networks. While it is common empirical folklore, we also show theoretically that SAEs are highly unstable: different training runs are likely to produce different concept dictionaries and sparse codes. We characterize the model properties that hinder the stability of real-world SAEs, and address each of these problems through minimal changes to the architecture and training procedure. Together, these changes yield two versions of an \textbf{i}dentifiable SAE (iSAE), a variant of the standard TopK SAE with lower reconstruction error and improved stability. We explain this improvement theoretically by connecting SAEs with traditional dictionary learning approaches, and show that the dictionaries learned in practice satisfy an approximate restricted isometry condition, rendering the corresponding sparse codes in those models near-identifiable.

2605.31244 2026-06-01 cs.LG physics.comp-ph

Spectral Reach: Understanding Neural Scaling as Progress into the Spectral Tail

谱范围:理解神经缩放作为进入谱尾的进展

Konstantin Nikolaou, Jonas Scheunemann, Sven Krippendorf, Samuel Tovey, Christian Holm

AI总结 本文提出“谱位置”度量,通过经验神经正切核的特征值分析神经缩放定律,发现大模型通过“谱范围”进入更深的谱尾从而降低损失,并指出特征学习是关键机制。

详情
AI中文摘要

神经缩放定律描述了模型大小、数据集大小、计算量与性能之间可预测的幂律关系。尽管这些定律指导了现代基础模型的发展,但其背后的机制仍知之甚少,部分原因是缺乏可扩展的分析工具。为弥补这一差距,我们引入了“谱位置”:一种可扩展的度量,用于衡量经验神经正切核(eNTK)的哪些特征值当前驱动损失降低。将该度量应用于缩放实验,我们发现谱位置在整个训练过程中下降:学习从主导特征模式转移到谱尾。大模型比小模型更深入地进入谱尾,揭示了一种我们称之为“谱范围”的与大小相关的能力。这解释了为什么大模型能达到更低的损失:它们能持续学习小模型无法访问的弱谱信号。我们进一步确定特征学习是谱范围的关键促成因素。它自适应地放大梯度幅度,使学习在冻结表示停滞的地方持续进展。这指向了通过架构和优化器设计的具体干预措施。

英文摘要

Neural scaling laws describe predictable power-law relationships between model size, dataset size, compute, and performance. While these laws guide the development of modern foundation models, the mechanisms underpinning them remain poorly understood, in part due to the absence of scalable analysis tools. To close this gap, we introduce "spectral position": a scalable measure of which eigenvalues of the empirical neural tangent kernel (eNTK) currently drive loss reduction. Applying this measure to scaling experiments, we find that spectral position decreases throughout training: learning shifts from dominant eigenmodes into the spectral tail. Larger models reach further into the tail than smaller models, revealing a size-dependent capacity we call "spectral reach". This suggests why larger models achieve lower losses: they sustain learning on weak spectral signals inaccessible to smaller models. We further identify feature learning as a key enabler of spectral reach. It adaptively amplifies gradient magnitudes as learning advances, sustaining progress where frozen representations stall. This points to concrete interventions through architecture and optimizer design.

2605.31241 2026-06-01 cs.LG

Bifurcated Remaining Useful Life Prediction: A Hybrid Approach for Realistic Uncertainty Characterization

分支剩余使用寿命预测:一种用于现实不确定性表征的混合方法

Xabier Belaunzaran, Antonio Nappa, Arkaitz Artetxe, Basilio Sierra

AI总结 提出一种混合预测框架,通过将涡扇发动机寿命分为健康与退化阶段,结合LSTM自编码器、条件威布尔生存分析和概率神经网络,实现不确定性感知的剩余使用寿命预测。

Comments Submitted to 9th European Conference of the Prognostics and Health Management Society 2026

详情
AI中文摘要

本研究提出了一种新颖的混合预测框架,用于使用NASA C-MAPSS数据集对涡扇发动机进行不确定性感知的剩余使用寿命(RUL)估计。该框架采用状态感知策略,将发动机的运行寿命分为“健康”和“退化”两个阶段。一个基于LSTM的自编码器,仅在标称数据(RUL > 150个循环)上训练,通过监测重构误差作为鲁棒的状态分类器。对于健康阶段,使用条件威布尔生存分析进行平均剩余寿命估计。对于退化阶段,使用带有蒙特卡洛丢弃法的概率神经网络捕获偶然和认知不确定性。不使用严格的二元标签,而是使用校准的sigmoid函数将自编码器的输出转换为连续状态概率,动态加权最终集成预测。该框架的主要优势在于生成物理一致的不确定性带,在寿命末期提供高置信度预测,同时准确反映早期运行的内在方差,为风险知情维护提供鲁棒工具。

英文摘要

This study presents a novel hybrid prognostic framework for uncertainty-aware Remaining Useful Life (RUL) estimation in turbofan engines using the NASA C-MAPSS dataset. The framework employs a state-aware strategy that bifurcates the engines operational lifespan into "healthy" and "degraded" regimes. An LSTM-based autoencoder, trained strictly on nominal data (RUL > 150 cycles), monitors reconstruction error to act as a robust state classifier. For the healthy regime, a Conditional Weibull Survival Analysis is used for Mean Residual Life estimation. For the degraded regime, a Probabilistic Neural Network with Monte Carlo Dropout captures both aleatoric and epistemic uncertainties. Rather than using rigid binary labels, a calibrated sigmoid function converts the autoencoders output into continuous state probabilities, dynamically weighting the final ensemble prediction. The primary strength of this framework is its generation of physically consistent uncertainty bands, yielding high-confidence predictions near end-of-life while accurately reflecting the inherent variance of early operation, providing a robust tool for risk-informed maintenance.

2605.31238 2026-06-01 cs.CL cs.LG

Scaling Multi-Hop Training Data via Graph-Constrained Path Selection

通过图约束路径选择扩展多跳训练数据

Pengyu Chen, Yonggang Zhang, Mingming Chen, Jun Song, Wei Xue, Yike Guo

AI总结 针对专业文档的组成推理,提出基于图约束路径选择的方法,通过解耦路径发现与语言化,利用图约束过滤无效路径,显著扩展可用语料并提升模型性能。

Comments 21 pages, 5 figures

详情
AI中文摘要

赋予大型语言模型对专业文档的组成推理能力需要大规模的多跳训练数据,而此类数据除了基于结构化来源的精心策划基准外很少存在。为了直接从纯文本、无标注文本中构建此类数据,现有方法要求单个教师模型联合发现文档中的证据路径并将其表述为问答对。然而,当文档围绕重复模板和密集交叉引用子句(这是大多数真实世界专业语料库的特征)构建时,这些方法会严重退化。在这项工作中,我们将这两个操作解耦:推理路径在上下文关键词质心的图上离线枚举,教师模型仅用于将预验证的路径语言化。该图强制执行五个几何可接受性约束,我们提供了Gram矩阵论证,表明仅局部相似性边界允许端点漂移高达约91°,并且需要上相似性边界才能退出由模板文本形成的密集嵌入团。一项匹配规模的消融实验揭示了其机制:在相等的训练规模下,约束链和无约束链产生无法区分的下游性能,而全规模下的增益来自可用语料库的4.4倍扩展,而非更高的每条链质量——这重新定义了图约束在此设置中的作用:提高教师可合成性而非改进链内容。在从CUAD法律合同语料库构建的80K示例上微调Qwen3-32B,将闭卷Token F1从21.66%提高到38.58%。我们已在https://github.com/hkgai-official/GCSCS发布代码。

英文摘要

Endowing large language models with compositional reasoning over specialized documents requires multi-hop training data at scale, where such data rarely exists outside of curated benchmarks built on structured sources. To construct it directly from plain, unannotated text, existing methods ask a single teacher model to jointly discover an evidence path through a document and verbalize it as a question-answer pair. However, these methods degrade sharply when documents are structured around repetitive templates and densely cross-referencing clauses, conditions that characterize most real-world specialized corpora. In this work, we decouple the two operations: reasoning paths are enumerated offline over a graph of contextual keyword centroids, and the teacher is invoked only to verbalize pre-validated paths. The graph enforces five geometric admissibility constraints, for which we provide Gram-matrix arguments establishing that local similarity bounds alone admit endpoint drift up to ${\sim}91^{\circ}$, and that an upper similarity bound is necessary to exit dense embedding cliques formed by boilerplate text. A matched-size ablation isolates the mechanism: at equal training scale, constrained and unconstrained chains yield indistinguishable downstream performance, and the gain at full scale comes from a 4.4$\times$ expansion of the usable corpus rather than from higher per-chain quality -- reframing the role of graph constraints, in this setting, as raising teacher synthesizability rather than improving chain content. Fine-tuning Qwen3-32B on 80K examples constructed from the CUAD legal contract corpus improves closed-book Token F1 from 21.66% to 38.58%. We have released our codes at https://github.com/hkgai-official/GCSCS.

2605.31234 2026-06-01 cs.RO

HARP-VLA: Human-Robot Aligned Representation Learning for Vision-Language-Action Model

HARP-VLA:面向视觉-语言-动作模型的人机对齐表示学习

Xiang Zhu, Puzhen Yuan, Yichen Liu, Jianyu Chen

AI总结 提出HARP框架,通过有限配对人机演示和未配对视频,学习对齐的人机视觉与潜在动作表示,提升VLA模型预训练效果,在CALVIN和真实世界任务中取得性能提升。

详情
AI中文摘要

从大规模人类视频中学习可泛化的视觉-语言-动作(VLA)模型具有前景但也充满挑战,原因在于视觉观察和可执行动作方面存在跨实体差异。虽然潜在动作模型通过学习动作抽象减少了动作执行差距,但它们仍然依赖视觉特征。因此,未对齐的人机视觉表示可能导致策略输入不一致,并引发领域相关的潜在动作,阻碍人类视频的有效协同训练。为解决这一问题,我们提出HARP,一种人机对齐的表示学习框架,用于从人类视频中进行更有效的VLA预训练。具体而言,HARP使用有限的配对人机演示作为跨实体桥梁,并利用大量未配对的人机视频作为可扩展的动态监督数据源。它训练一个机器人适应的视觉编码器和一个潜在动作模型,采用以操作为中心的辅助线索和源相对对判别对齐损失,将机器人表示向人类语义对齐,同时保留对级判别性。学习到的对齐视觉编码器和潜在动作模型为VLA式策略学习提供了统一的视觉和动作表示,其中人类和机器人视频提供视觉-语言到潜在动作的监督,轻量级机器人动作头将潜在动作转化为可执行命令。在特征可视化、仿真和真实世界操作上的实验表明,人机对齐和下游策略性能得到提升,在CALVIN ABC→D上达到4.481的平均长度,真实世界成功率比最强基线提升7.1%。

英文摘要

Learning generalizable vision-language-action (VLA) models from large-scale human videos is promising but challenging due to cross-embodiment discrepancies in both visual observations and executable actions. While latent action models reduce the action execution gap by learning action abstractions, they still rely on visual features. Thus, misaligned human and robot visual representations can lead to inconsistencies in policy inputs and induce domain-dependent latent actions, hindering effective co-training with human videos. To address this, we propose HARP, a human-robot aligned representation learning framework for more effective VLA pretraining from human videos. Specifically, HARP uses limited paired human-robot demonstrations as cross-embodiment bridges and abundant unpaired human and robot videos as a scalable dynamics supervision data source. It trains a robot-adapted visual encoder and a latent action model with manipulation-centric auxiliary cues and a source-relative pair-discriminative alignment loss, which adapts robot representations toward human semantics while preserving pair-level discrimination. The learned aligned vision encoder and latent action model provide a unified vision and action representation for VLA-style policy learning, where human and robot videos provide vision-language-to-latent-action supervision and a lightweight robot action head grounds latent actions into executable commands. Experiments on feature visualization, simulation, and realworld manipulation show improved human-robot alignment and downstream policy performance, achieving 4.481 average length on CALVIN ABC$\rightarrow$D and a 7.1\% realworld success rate gain over the strongest baseline.

2605.31229 2026-06-01 cs.CV cs.AI

Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval

超越分类:面向持续多模态检索的动态适配器路由

Alicja Dobrzeniecka, Filip Szatkowski, Sebastian Cygert, Szymon Lukasik, Bartlomiej Twardowski

AI总结 针对持续多模态检索(CMR)任务,提出基于原型路由和模型合并的动态适配器路由(DAR)方法,在跨域评估中取得优于现有基线的性能。

详情
AI中文摘要

虽然检索是视觉-语言模型的核心功能,但持续更新这些模型用于检索任务仍未被充分探索。现有工作通常通过类增量学习(CIL)的视角处理持续检索,在可能无法完全捕捉检索特定动态的设置中评估标准CIL方法和面向检索的适应方法。为了解决这一问题,我们引入了一个新的、原则性的持续多模态检索(CMR)评估框架,涵盖多样化的视觉领域,并在此设置中系统评估常见方法。我们的实证分析表明,标准CIL方法在我们更具挑战性的场景中未能产生有意义的增益。因此,我们提出了动态适配器路由(DAR),一种基于通过原型路由选择适配器并通过模型合并组合的新方法。DAR在先前基线上取得了优越性能,并在分布外评估中展现出强大的泛化能力。我们的结果凸显了CMR的独特挑战,并鼓励在该方向进行进一步研究。

英文摘要

While retrieval is a core function of vision-language models, continually updating these models for retrieval tasks remains critically underexplored. Existing work often approaches continual retrieval through the lens of class-incremental learning (CIL), evaluating both standard CIL methods and retrieval-oriented adaptations in settings that may not fully capture the retrieval-specific dynamics. To address this, we introduce a new, principled evaluation framework for continual multimodal retrieval (CMR) spanning diverse visual domains, and systematically evaluate common approaches within this setting. Our empirical analysis shows that standard CIL methods fail to yield meaningful gains in our more challenging scenario. Therefore, we propose Dynamic Adapter Routing (DAR), a novel approach based on adapters selected through prototype-based routing and combined via model merging.DAR achieves superior performance over the previous baselines and demonstrates strong generalization under out-of-distribution evaluation. Our results highlights the unique challenges of CMR and encourages further research in this direction.

2605.31228 2026-06-01 cs.LG cs.AI

EchoRL: Reinforcement Learning via Rollout Echoing

EchoRL:通过回滚回响进行强化学习

Jinhe Bi, Aniri, Minglai Yang, Xingcheng Zhou, Wenke Huang, Sikuan Yan, Yujun Wang, Zixuan Cao, Michael Färber, Xun Xiao, Volker Tresp, Yunpu Ma

AI总结 针对RLVR训练中优势退化问题,提出EchoRL模块,通过从成功回滚中提取EchoClip作为辅助监督信号,持续提升训练性能。

Comments ICML 2026

详情
AI中文摘要

基于可验证奖励的强化学习是增强大语言模型推理能力的有效后训练方法。然而,随着训练进行,学习信号可能崩溃,导致训练收益变得微弱且无效。具体而言,越来越多的提示回滚出现优势退化:所有自生成回滚均显示验证成功,使得其奖励的标准差为零;相应地,每个回滚的优势也退化为零。由于这些回滚的优势为零,用于模型优化的策略梯度最终消失,限制了训练性能。我们认为,其中一些回滚仍然包含有价值的学习信号,但不幸被现有RLVR方法忽略。本文受外部专家模型生成的金色轨迹的熵模式分析启发,提出EchoRL以更好地利用优势退化的回滚来进一步提升训练性能。EchoRL是一个轻量级模块,首先根据逐步熵值从验证成功的回滚中识别出EchoClip,然后将该片段作为辅助监督信号反馈到RL目标中。在10个基准、5个LLM骨干网络和4种流行RLVR后训练方法上的大量实验表明,EchoRL能够以最小开销持续改进RLVR后训练。

英文摘要

Reinforcement Learning with Verifiable Rewards is an effective route for post-training to strengthen the reasoning capability of large language models. However, as training proceeds, the learning signal can collapse thus makes the training gain become marginal and ineffective. Specifically, a growing fraction of prompts' rollouts become advantage-degenerated: all the self-generated rollouts show verified-success, making the standard deviation over their rewards be zero; accordingly each rollout's advantage becomes degenerated (zero) as well. Given such rollouts' advantages, the policy-gradient for model optimization eventually vanishes, capping the training performance. We argue that some of these rollouts still contain valuable learning signals but unfortunately omitted with the existing RLVR methods. In this paper, inspired through analyzing the entropy pattern behind golden trajectories produced by external expert models, we propose EchoRL for better exploiting the advantage-degenerated rollouts to further improve the training performance. EchoRL is a lightweight module that first identifies an EchoClip from verified-success rollouts based on their step-level entropy values, and then feeds this clip back as an auxiliary supervision signal in the RL objective. Extensive experiments across 10 benchmarks, 5 LLM backbones, and 4 popular RLVR post-training methods demonstrate that EchoRL consistently improves RLVR post-training with minimal overhead.

2605.31227 2026-06-01 cs.CV

HiERO-StepG @ Ego4D Step Grounding Challenge: hierarchical activity understanding enables zero-shot step grounding

HiERO-StepG @ Ego4D Step Grounding Challenge: 层次化活动理解实现零样本步骤定位

Andrea Zenotto, Simone Alberto Peirone, Francesca Pistilli, Giuseppe Averta

AI总结 提出HiERO-StepG方法,利用弱监督层次化表示学习和聚类,无需任务特定微调即可实现零样本步骤定位,在Ego4D挑战中达到56.27% R@1 (IoU=0.3)。

Comments Technical report for the Ego4D Goal Step - Step Grounding challenge at CVPR 2026, derived from arXiv:2505.12911

详情
AI中文摘要

程序性活动遵循明确的结构:无论是考虑烹饪食谱还是机械师修理汽车,这些活动自然分解为步骤和子步骤的层次结构。传统的步骤定位方法需要大量标注且扩展性差。相反,我们认为这种层次结构可以通过共同发生的动作和活动的重复模式,从非策划的人类活动视频中自然涌现。我们的方法基于HiERO,一种弱监督表示学习方法,它利用细粒度的动作级叙述,将功能相关的动作在特征空间中映射得接近。在这个特征空间中,程序步骤可以通过简单的聚类检测到,无需额外的任务特定微调。对于Ego4D步骤定位挑战,我们通过确保步骤分配的细粒度和粗粒度一致性、强制定位步骤的严格时间单调性以及后处理检测步骤以减少噪声预测的影响来增强这种方法。我们将这种方法称为HiERO-StepG,在提交时,它在全局排行榜上以完全零样本且不需要程序特定注释的情况下,在R@1 (IoU = 0.3)指标上达到56.27%,排名第二。项目页面:https://github.com/andreazenotto/HiERO-StepG。

英文摘要

Procedural activities follow well-defined structures: whether we consider a cooking recipe or a mechanic repairing a car, these activities naturally decompose in a hierarchy of steps and sub-steps. Traditional approaches for step grounding require extensive annotations and scale poorly. Instead, we argue that such hierarchical structure can emerge naturally from uncurated videos of human activities through recurring patterns of co-occurring actions and activities. Our approach builds on HiERO, a weakly-supervised representation learning approach that maps close in the feature space actions that are functionally related to each other, leveraging only fine-grained action-level narrations. In this feature space, procedure steps can be detected by a simple clustering, with no additional task-specific fine-tuning. For the Ego4D Step Grounding challenge, we augment this approach by ensuring fine and coarse level agreement in step assignments, enforcing strict temporal monotonicity of the grounded steps and post-processing the detected steps to reduce the impact of noisy predictions. We call this approach HiERO-StepG and it achieves 56.27 % on the R@1 (IoU = 0.3) metric on the global leaderboard at submission time, ranking second while being completely zero-shot and not requiring procedure-specific annotations. Project page: https://github.com/andreazenotto/HiERO-StepG.

2605.31226 2026-06-01 cs.LG cs.AI

What changes after deployment? A survey on On-device Learning in TinyML

部署后发生了什么变化?TinyML中设备端学习综述

Massimo Pavan, Luca Pezzarossa, Fabrizio Pittorino, Manuel Roveri, Xenofon Fafoutis

AI总结 本文针对微控制器级设备上的机器学习模型,系统综述了约70篇设备端学习(ODL)工作,基于分布变化类型分析其对应用、硬件和解决方案的影响,并指出方法论基准与现实部署之间的差距。

详情
AI中文摘要

微控制器级设备上的机器学习模型(TinyML)面临一个根本性挑战:部署后的分布变化会破坏静态模型。设备端学习(ODL)通过直接在设备上运行学习过程来解决这一问题。现有文献尚未描述分布变化如何发生,以及不同类型的变化需要不同的解决方案。本文基于分布变化类型这一原则,综述了约70篇ODL工作。调查分析了不同类型的分布变化如何影响可寻址的设备端应用、所使用的硬件以及解决方案的结构。还指出了方法论基准与现实部署场景之间持续存在的差距。

英文摘要

Machine learning models on microcontroller-class devices (TinyML) face a fundamental challenge: post-deployment distribution change undermines static models. On-device learning (ODL) addresses this by running the learning process directly on the device. The existing literature has not characterized how distribution change occurs or how different change types require different solutions. Approximately 70 ODL works are surveyed under one principle: the distribution change regime. The survey analyzes how different types of distribution change influence the applications addressable on-device, the hardware employed, and the structure of the solutions. A persistent gap between methodological benchmarks and real-world deployment scenarios is also identified.

2605.31222 2026-06-01 cs.LG

Multivariate Distributional Reinforcement Learning Using Sliced Divergences

使用切片散度的多变量分布强化学习

Baptiste Debes, Tinne Tuytelaars

AI总结 提出SDRL方法,通过投影将一维散度扩展到多变量回报分布,并证明在标量折扣和一般矩阵折扣下的贝尔曼收缩性,支持多种散度并适用于标准单样本贝尔曼更新。

详情
AI中文摘要

分布强化学习(DRL)建模完整的回报分布而非期望,但将其扩展到多变量设置仍然具有挑战性。许多常见度量不能自然地推广到一维以上,或者失去计算可行性,并且多变量情况引入了额外的困难,例如一般矩阵折扣,对此没有可用的收缩结果。我们引入了切片分布强化学习(SDRL),它通过投影将可处理的一维散度提升到多变量回报分布。我们证明了在共享标量折扣下均匀切片的贝尔曼收缩,并引入了一种在一般密集折扣矩阵下具有收缩性的最大切片变体。SDRL支持广泛的基散度;我们分析了Wasserstein、Cramér和最大均值差异(MMD),并表征了哪些SDRL变体适用于分布强化学习中使用的标准单样本贝尔曼更新。我们在一个玩具链问题、一个基于网格世界的图像环境以及一组Atari游戏上评估了SDRL。

英文摘要

Distributional reinforcement learning (DRL) models the full return distribution rather than expectations, but extending it to multivariate settings remains challenging. Many common metrics do not naturally generalize beyond one dimension or lose computational tractability, and the multivariate case introduces additional difficulties such as general matrix discounting, for which no contraction results are available. We introduce Sliced Distributional Reinforcement Learning (SDRL), which lifts tractable one-dimensional divergences to multivariate return distributions via projections. We prove Bellman contraction for uniform slicing under shared scalar discounting, and introduce a maximum-slicing variant with contraction under general dense discount matrices. SDRL supports a broad class of base divergences; we analyze Wasserstein, Cramér, and Maximum Mean Discrepancy (MMD), and characterize which SDRL variants suit the standard single-sample Bellman update used in distributional RL. We evaluate SDRL on a toy chain problem and a gridworld image-based environment as well as a subset of Atari games.

2605.31220 2026-06-01 cs.CL cs.AI cs.LG

Shared Doubt: Zero-shot Cross-Lingual Confidence Estimation for Language Models

共享疑虑:语言模型的零样本跨语言置信度估计

Athina Kyriakou, Dennis Ulmer, Ivan Titov

AI总结 研究多语言大语言模型是否编码共享的、可跨语言迁移的置信度特征,通过轻量级线性探针从中间表示直接预测答案正确性,实现零样本跨语言泛化,并发现置信度特征集中在中间层。

详情
AI中文摘要

置信度估计(CE),即量化模型预测的可靠性,在大语言模型(LLM)背景下引起了极大兴趣。然而,大多数研究集中在英语上,忽视了LLM使用的多语言现实,而许多CE方法会退化或需要跨语言重新训练。为了解决这一差距,我们研究了多语言LLM是否编码共享的、可跨语言迁移的置信度特征。我们使用一个轻量级线性探针,直接从中间表示预测答案正确性。经过单语言训练后,该探针在零样本情况下泛化到未见过的、类型多样的语言,无需目标语言监督。学习到的层权重和多次消融实验表明,置信度特征集中在各语言的中间层,表明存在共享的置信度子空间。虽然零样本跨语言性能取决于与源语言的相似性,但该探针无需任何重新训练即可提供强基线,并且与其他流行的置信度估计方法相比具有优势。

英文摘要

Confidence estimation (CE), i.e. quantifying the reliability of a model's prediction, has attracted great interest in the context of large language models (LLMs). However, most studies focus on English, ignoring the multilingual reality of LLM usage, while many CE methods degrade or require retraining across languages. To address this gap, we investigate whether multilingual LLMs encode shared, language-transferable confidence features. We use a lightweight linear probe that predicts answer correctness directly from intermediate representations. Trained monolingually, the probe generalizes zero-shot to unseen, typologically diverse languages without target-language supervision. Learned layer weights and multiple ablations reveal that confidence features concentrate in middle layers across languages, suggesting a shared confidence subspace. While zero-shot cross-lingual performance depends on similarity to the source language, the probe provides a strong baseline without any retraining and compares favorably to other popular confidence estimation methods.

2605.31217 2026-06-01 cs.CV

TALON: Token-Aligned Lightweight Adapters for 6-DoF Spacecraft Pose Estimation

TALON: 用于六自由度航天器姿态估计的令牌对齐轻量适配器

Abid Ali, Arunkumar Rathinam, Djamila Aouada

AI总结 提出TALON方法,通过在冻结的ViT注意力层前注入时空3D适配器并结合令牌对齐损失,实现轻量级六自由度航天器姿态估计,在SPADES和SwissCube数据集上显著降低姿态误差。

Comments 13 pages paper with 3 figures in total

详情
AI中文摘要

单目六自由度航天器姿态估计方法主要处理单帧图像,忽略了航天器机动过程中获取的图像序列中的时间信息。少数时间方法需要完全骨干微调或辅助光流网络,分别存在灾难性遗忘或增加计算成本的风险。我们提出TALON(轨道导航的令牌对齐轻量适配器):在冻结的ViT视觉变换器的自注意力层之前注入时空3D适配器,结合补丁-令牌对齐损失,通过原型条件KL散度目标将适配特征几何地锚定到关键点结构。注意力前放置允许冻结注意力对时间增强的令牌进行推理,每个块使用单个适配器即可获得比注意力后替代方案更强的性能。对齐损失塑造中间表示,使得每个关键点在令牌场中引发空间精确的激活,而该框架向冻结骨干添加的参数少于5%。在SPADES数据集上,TALON将姿态误差比先前最先进方法降低50%;在SwissCube数据集上,其在ADD-0.1d准确率上超越先前最佳方法21.8%。在SPARK真实数据上的从仿真到真实的零样本跨域评估将姿态误差降低4.7倍,消融实验表征了适配器深度在域内和跨域设置中的作用。

英文摘要

Monocular 6-DoF spacecraft pose estimation methods predominantly process individual frames, discarding the temporal information present in an image sequence acquired during spacecraft manoeuvres. Few temporal approaches require full backbone fine-tuning or auxiliary optical flow networks, risking catastrophic forgetting or increasing computational cost, respectively. We propose TALON (Token-Aligned Lightweight adapters for Orbital Navigation): spatiotemporal 3D adapters injected before the self-attention layers of a frozen ViT vision transformer, combined with a patch-token alignment loss that geometrically grounds the adapted features to keypoint structure through a prototype-conditioned KL-divergence objective. Pre-attention placement allows the frozen attention to reason over temporally enriched tokens, achieving stronger performance with a single adapter per block than post-attention alternatives. The alignment loss shapes the intermediate representations so that each keypoint induces a spatially precise activation in the token field, while the framework adds less than 5% parameters to the frozen backbone. On SPADES dataset, TALON reduces the pose error by 50% over the prior state-of-the-art, and on SwissCube dataset it surpasses the prior best by 21.8% in ADD-0.1d accuracy. Zero-shot cross-domain evaluation from sim-to-real on SPARK real data reduces pose error by 4.7x, and ablations characterise the role of adapter depth across in-domain and cross-domain settings.

2605.31215 2026-06-01 cs.LG cs.CV

Fixed-Point Masked Generative Modeling

不动点掩码生成建模

Andrea Miele, Yiming Qin, Alba Carballo-Castro, Justin Deschenaux, Pascal Frossard

AI总结 提出不动点掩码生成模型(FP-MGM),通过共享注意力层的不动点求解器实现自适应深度,并引入跨步一致性损失和三态重用(3SR)策略,在降低参数和训练成本的同时提升低预算掩码生成质量。

详情
AI中文摘要

掩码生成模型(MGM)支持并行解码并在多种模态上取得强性能,但每一步都需要全序列双向变换器,导致训练成本高且在低采样预算下质量下降。现有工作通过更好的采样器或更便宜的固定深度去噪器提升效率,但仍为每个精炼步骤分配固定量的去噪器计算。我们提出不动点掩码生成模型(FP-MGM),用共享注意力层上的不动点求解器替换部分去噪器,实现自适应深度且参数更少。为使其更有效地用于掩码生成,我们首先引入跨步一致性损失,对齐相邻去噪步骤的隐藏表示;其次,三态重用(3SR)通过分别处理未改变、仍掩码和新揭示的令牌,利用先前解热启动求解器。这些组件共同定义了我们的不动点掩码生成的完整训练到推理框架CoFRe。我们还表明,预训练的MGM可以通过短微调转换为FP-MGM,避免完全重新训练。跨模态,CoFRe改善了质量与成本的权衡。在OpenWebText上,与MDLM相比,CoFRe参数减少38.8%,训练时间减少11.5%,VRAM减少16.9%,同时在96个变换器块前向传播的预算下,生成困惑度从830.8提升到101.8。在ImageNette上,CoFRe训练时间减少48.6%,VRAM减少50.7%,并在所有测试的样本预算下改善FID。总体而言,CoFRe为更便宜的训练和更强的低预算掩码生成提供了一个实用框架。

英文摘要

Masked Generative Models (MGMs) enable parallel decoding and achieve strong performance across modalities, but require full-sequence bidirectional transformers at every step, making training costly and degrading quality under low sampling budgets. Existing work improves efficiency via better samplers or cheaper fixed-depth denoisers, but they still allocate a fixed amount of denoiser computation to each refinement step. We introduce Fixed-Point Masked Generative Models (FP-MGMs), which replace part of the denoiser with a fixed-point solver over shared attention layers to enable adaptive depth with fewer parameters. To make it more effective for masked generation, we first introduce a cross-step consistency loss, which aligns hidden representations at neighboring denoising steps and, second, three-state reuse (3SR) which warm-starts the solver using the previous solution by treating differently unchanged, still-masked, and newly revealed tokens respectively. Together, these components define our complete training-to-inference framework for fixed-point masked generation, \emph{CoFRe}. We also show that pre-trained MGMs can be converted into FP-MGMs with short fine-tuning, avoiding full retraining. Across modalities, CoFRe improves the quality and cost trade-off. On OpenWebText, CoFRe reduces parameters by 38.8\%, training time by 11.5\%, and VRAM by 16.9\%, while improving generative perplexity from 830.8 to 101.8 at a budget of $96$ transformer-block forward passes, compared to MDLM. In ImageNette, CoFRe reduces training time by 48.6\% and VRAM by 50.7\%, while improving FID in all sample budgets tested. Overall, CoFRe offers a practical framework for cheaper training and stronger low-budget masked generation.

2605.31212 2026-06-01 cs.CV cs.AI cs.CL

Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education

基准测试与增强文本到图像模型以生成早期算术教育中的视觉表示

Junling Wang, Boqi Chen, Heejin Do, Mubashara Akhtar, April Yi Wang, Mrinmaya Sachan

AI总结 针对早期算术教育中的方程到视觉生成任务,构建了E2V-Bench基准并评估了现有T2I模型,发现其在计数和关系结构上存在严重错误,进而探索了基准引导的增强策略。

详情
AI中文摘要

AI系统越来越多地用于支持教育内容创作,但尚不清楚它们能否生成忠实代表其旨在教授的教学概念的输出。因此,我们引入了方程到视觉生成任务,与传统的图像生成不同,该任务要求从算术方程中生成具有教学意义的视觉内容,同时精确保留其数值和关系结构。根据对教师的访谈和教育材料的分析,我们构建了E2V-Bench基准,涵盖四种基于教学法的视觉类型,以及用于评估视觉正确性的自动指标。我们的评估显示,最近的文本到图像(T2I)模型在此任务上频繁失败,错误主要表现为对象计数不正确和关系结构破坏。在此基础上,我们探索了基准引导的增强策略。这些策略改进了代表性模型,但剩余的差距要求未来的T2I模型具备更强的数值和关系基础。

英文摘要

AI systems are increasingly used to support educational content creation, yet it remains unclear whether they can generate outputs that faithfully represent the pedagogical concepts they are intended to teach. Thus, we introduce equation-to-visual generation, a task that, in contrast to conventional image generation, requires producing pedagogically meaningful visuals from arithmetic equations while precisely preserving their numerical and relational structure. Informed by interviews with teachers and an analysis of educational materials, we construct E2V-Bench, a benchmark spanning four pedagogically grounded visual types, along with automatic metrics for evaluating visual correctness. Our evaluation reveals that recent text-to-image (T2I) models frequently fail on this task, with errors dominated by incorrect object counts and broken relational structure. Building on this, we explore benchmark-guided enhancement strategies. These strategies improve representative models, while the remaining gap calls for stronger numerical and relational grounding in future T2I models.

2605.31210 2026-06-01 cs.RO cs.AI

Simulation of collision avoidance behavior in crowd movement by data-driven approach

基于数据驱动方法的群体运动碰撞规避行为模拟

Xuanwen Liang, Eric Wai Ming Lee

AI总结 针对数据驱动人群模拟中碰撞率高的问题,提出一种结合碰撞惩罚的生成对抗网络(CPGAN),通过侧向加速度碰撞损失函数和Voronoi特征提取方法,有效降低双向流中的对向碰撞率。

详情
AI中文摘要

人群运动模拟对于行人安全管理和设施布局优化至关重要。数据驱动模型提高了欧几里得度量下的轨迹预测精度,但存在碰撞率过高的问题,尤其是在双向和多向流中。本文建立了一种新颖的数据驱动人群模拟模型,将行人碰撞机制纳入损失函数以减少碰撞。提出了基于侧向加速度的碰撞损失函数和基于Voronoi的运动特征提取方法。该模型基于生成对抗网络(GAN)架构,称为CPGAN(碰撞惩罚GAN)。我们在涉及频繁碰撞规避行为的双向流场景中评估了CPGAN。结果表明,所提出的基于侧向加速度的碰撞损失显著降低了相反方向行人的碰撞率,达到与受控实验相当的水平。CPGAN有效模拟了双向流,再现了通道形成和N-t曲线。研究成果可为将行人动力学机制融入数据驱动人群模拟的损失函数提供启发。

英文摘要

Crowd movement simulation is essential for pedestrian safety management and facility layout optimization. Data-driven models enhance trajectory prediction accuracy under Euclidean metrics, yet they suffer from excessively high collision rates, especially in bidirectional and multidirectional flows. In this paper, we establish a novel data-driven crowd simulation model that incorporates the pedestrian collision mechanism into the loss function to reduce collisions. A new lateral-acceleration-based collision loss function and a Voronoi-based motion feature extraction approach are proposed. The model is based on a Generative Adversarial Network (GAN) architecture and is termed CPGAN (Collision-Penalized GAN). We evaluate CPGAN in bidirectional flow scenarios, which involve frequent collision avoidance behaviors. Results show that the proposed lateral-acceleration-based collision loss significantly reduces opposite-direction pedestrian collision rates to levels comparable with controlled experiments. CPGAN effectively simulates bidirectional flow, reproducing lane formation and N-t curves. The research outcomes can provide inspiration for integrating pedestrian dynamics mechanisms into loss functions in data-driven crowd simulation.

2605.31204 2026-06-01 cs.CV

Probabilistic Precipitation Nowcasting with Rectified Flow Transformers

基于整流流变压器的概率降水临近预报

Johannes Schusterbauer, Jannik Wiese, Nick Stracke, Timy Phan, Björn Ommer

AI总结 提出FREUD模型,通过帧级编码器和统一解码器结合整流流变压器,在保持不确定性的同时实现高效时空压缩,在SEVIR基准上达到降水临近预报最优性能。

Comments CVPR 2026, Project Page: https://compvis.github.io/weather-rf/

详情
AI中文摘要

准确的天气预报在各个领域都至关重要,在极端天气条件下更是安全关键。与基于模拟的预报相比,数据驱动方法显示出更高的效率,能够实现短期、高分辨率的临近预报。特别是,扩散模型因其强大的概率基础在天气临近预报中被证明有效。然而,现有方法依赖于确定性压缩来降低高维天气数据的复杂性,限制了它们在解码过程中捕捉不确定性的能力。在这项工作中,我们引入了$ extbf{FREUD}$,一个基于整流流变压器的$ extbf{Fr}$ame-wise $ extbf{E}$ncoder和$ extbf{U}$nited $ extbf{D}$ecoder模型,用于高效压缩时空天气数据。帧级编码支持连续预报更新,而统一视频解码器确保时间一致性。我们保留不确定性的第一阶段允许通过集成捕捉偶然不确定性,这对于解码变异性高的极端天气事件特别有利。我们在SEVIR基准上使用紧凑的潜在空间整流流变压器实现了降水临近预报的最新性能,并通过模型和测试时缩放进一步展示了性能提升。代码见:https://github.com/CompVis/weather-rf

英文摘要

Accurate weather forecasts are essential across various domains and are safety-critical in extreme weather conditions. Compared to simulation-based forecasting, data-driven approaches show greater efficiency, enabling short-term, high-resolution nowcasting. In particular, diffusion models proved effective in weather nowcasting due to their strong probabilistic foundation. However, existing methods rely on deterministic compression to reduce the complexity of high-dimensional weather data, limiting their ability to capture uncertainty in the decoding process. In this work, we introduce $\textbf{FREUD}$, a $\textbf{Fr}$ame-wise $\textbf{E}$ncoder and $\textbf{U}$nited $\textbf{D}$ecoder model based on rectified flow transformers for efficient compression of spatio-temporal weather data. Frame-wise encoding enables continuous forecast updates, while the unified video decoder ensures temporal consistency. Our uncertainty-preserving first stage allows us to capture aleatoric uncertainty via ensembling, which is particularly beneficial for extreme weather events with high decoding variability. We achieve state-of-the-art performance in precipitation nowcasting with a compact latent-space rectified flow transformer on the SEVIR benchmark and show further performance gains by model and test-time scaling. Code available here: https://github.com/CompVis/weather-rf

2605.31201 2026-06-01 cs.CL

Learning Whom to Trust: Market-Feedback Adaptive Retrieval for Frozen LLMs in Event-Driven Financial RAG

学习信任谁:事件驱动金融RAG中冻结大语言模型的市场反馈自适应检索

Zijie Zhao, Roy E. Welsch

AI总结 针对事件驱动金融RAG,提出通过外部贝叶斯源记忆更新检索层,利用市场反馈自适应选择证据源,在冻结LLM情况下提升预测和投资组合表现。

详情
AI中文摘要

金融检索增强生成(RAG)系统通常按文本相关性对证据排序,但在金融市场中,有用的证据来源取决于事件类型、预测期限和市场背景。我们将新闻触发的事件影响预测作为一个时间点金融RAG问题进行研究。对于每个公司-新闻锚点,系统检索相关的金融新闻和SEC文件段落,附加决策前市场背景卡片,并预测多期限残差收益信号。我们的方法保持大语言模型(LLM)阅读器冻结,通过外部贝叶斯源记忆(根据已成熟的残差收益反馈更新)自适应检索层。在源自FinRL-DeepSeek/FNSPID任务的固定89只纳斯达克股票池上,使用原始FNSPID新闻和时间点EDGAR文件段落,与无记忆的冻结阅读器相比,带源记忆的冻结阅读器将留出宏F1从0.438提升至0.471,下游投资组合夏普比率从0.52提升至0.84。有监督的LoRA阅读器对静态RAG有适度改进,但未超过冻结源记忆阅读器。这些结果表明,对于金融RAG,学习从何处检索与学习如何阅读同等重要,提供了一种简单、模块化的市场反馈适应途径。

英文摘要

Financial retrieval-augmented generation (RAG) systems typically rank evidence by textual relevance, but in financial markets the useful evidence source depends on event type, forecast horizon, and market context. We study news-triggered event-impact prediction as a point-in-time financial RAG problem. For each company-news anchor, the system retrieves related financial news and SEC filing passages, appends a pre-decision market-context card, and predicts multi-horizon residual-return signals. Our method keeps the large language model (LLM) reader frozen and adapts the retrieval layer through an external Bayesian source memory updated from matured residual-return feedback. On a fixed 89-stock Nasdaq-oriented universe derived from the FinRL-DeepSeek/FNSPID task, using original FNSPID news and point-in-time EDGAR filing passages, Frozen Reader with Source Memory improves held-out macro-F1 from 0.438 to 0.471 and downstream portfolio Sharpe from 0.52 to 0.84 relative to Frozen Reader with No Memory. A supervised LoRA reader improves static RAG modestly, but does not improve over the frozen source-memory reader. These results suggest that, for financial RAG, learning where to retrieve from can be as important as learning how to read, offering a simple, modular route to market-feedback adaptation.

2605.31196 2026-06-01 cs.CV cs.AI cs.CL cs.RO

Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

探索视觉-语言模型中的碰撞接地以实现安全的人机协作

Jun Wang, Xiaohao Xu, Xiaonan Huang

AI总结 针对安全人机协作,提出碰撞接地概念及物理基准TouchSafeBench,评估视觉-语言模型在分类当前安全状态和预警即将碰撞任务中的表现,发现现有模型不可靠,视觉流畅性不等于物理责任性。

Comments 31 pages, 9 figures

详情
AI中文摘要

安全的人机协作需要的不仅仅是视觉描述:监控器必须确定机器人身体是否安全分离、已经与场景或人发生碰撞,或即将碰撞。我们将这种能力称为碰撞接地:将视觉观察与机器人身体几何、相机视角、场景布局、人体接近度和时间运动相结合,以推断当前和即将发生的接触。我们引入了TouchSafeBench,一个基于物理的基准,用于评估视觉-语言模型(VLM)中的碰撞接地能力。TouchSafeBench基于Habitat 3.0构建,包含2,940个模拟室内共现场景,涵盖社交导航和社交重排,具有同步的多视角RGB-D观测、自上而下的轨迹地图、校准的相机元数据和模拟器导出的接触标签。我们研究了两个面向部署的任务:分类当前安全状态和在接触前预警即将发生的碰撞。在三个前沿或面向机器人的VLM和九种视觉表示中,当前模型远未达到可靠:最佳平均Macro-F1仍低于50%,显式深度不会自动转化为机器人身体碰撞证据,且机器人与场景的接触始终比人与人的接触风险更难。TouchSafeBench揭示了具身VLM的一个核心限制:视觉流畅性并不意味着物理责任性。可靠的机器人安全监控器需要能够显式绑定视角、机器人形态、度量几何和未来碰撞的表示。我们将在论文被接收后发布该基准。

英文摘要

Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We introduce TouchSafeBench, a physics-grounded benchmark for evaluating collision grounding in vision-language models (VLMs). Built in Habitat~3.0, TouchSafeBench contains 2,940 simulated indoor co-presence episodes across social navigation and social rearrangement, with synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels. We study two deployment-facing tasks: classifying the current safety state and warning about imminent collision before contact. Across three frontier or robotics-oriented VLMs and nine visual representations, current models remain far from reliable: the best average Macro-F1 stays below 50\%, explicit depth is not automatically transformed into robot-body collision evidence, and robot--scene contact is consistently harder than human-contact risk. TouchSafeBench reveals a central limitation of embodied VLMs: visual fluency does not imply physical accountability. Reliable robot safety monitors will need representations that explicitly bind viewpoint, robot morphology, metric geometry, and future collision. We will release the benchmark upon acceptance.

2605.31193 2026-06-01 cs.LG

Geometry-based Schrödinger Bridges for Trustworthy Multimodal Fusion

基于几何的薛定谔桥用于可信多模态融合

Jiayu Xiong, Jing Wang, Qi Zhang, Wanlong Wang, Jun Xue

AI总结 提出基于几何的多模态融合方法GMF,利用扩散薛定谔桥的初始速度平方作为独立于预测的可靠性信号,以提升对低质量数据的鲁棒性。

Comments ICML 2026 accepted paper

详情
AI中文摘要

现实世界的多模态系统必须对低质量数据具有鲁棒性,例如传感器噪声、不完整的多模态数据和冲突输入。然而,现有的可信融合方法依赖模型自身的预测置信度来判断数据质量,这造成了循环依赖:当模型自信但错误时,这些方法无法检测到错误。为了打破这一循环,我们提出了基于几何的多模态融合(GMF)。我们不依赖预测,而是通过测量输入在潜在空间中所需的传输校正量来评估可靠性。我们实现了带有整流流的扩散薛定谔桥传输,其中初始速度的平方提供了一个高效的学习校正分数。有效数据具有低的平方速度幅度,而噪声、不完整数据或冲突数据需要更强的传输校正。这种基于几何的可靠性信号充当独立判断,即使在分类器被欺骗时也能有效标记不可靠输入。大量实验表明,与基于置信度的基线相比,GMF显著提高了对严重传感器噪声和语义冲突的鲁棒性。

英文摘要

Real-world multimodal systems must be robust against low-quality data, such as sensor noise, incomplete multimodal data and conflicting inputs. However, existing trustworthy fusion methods rely on the model's own prediction confidence to judge data quality. This creates a circular dependency: when a model is confident but wrong, these methods fail to detect the error. To break this loop, we propose Geometry-based Multimodal Fusion (GMF). Instead of relying on predictions, we evaluate reliability by measuring how much transport correction the input needs in latent space. We implement Diffusion Schrödinger Bridge transport with Rectified Flow, where the squared initial velocity gives an efficient learned correction score. Valid data has low squared velocity magnitude, while noisy, incomplete data or conflicting data requires stronger transport correction. This geometry-based reliability signal acts as an independent judge, effectively flagging unreliable inputs even when the classifier is fooled. Extensive experiments demonstrate that GMF significantly improves robustness against severe sensor noise and semantic conflicts compared to confidence-based baselines.

2605.31192 2026-06-01 cs.CV

The Regularizing Power of Language-Training Deepfake Detectors

语言训练深度伪造检测器的正则化能力

Benedikt Hopf, Zongwei Wu, Radu Timofte

AI总结 提出利用多模态大语言模型的双编码器架构和两阶段训练,通过语言正则化缓解过拟合,提升深度伪造检测的泛化性和可解释性。

详情
AI中文摘要

最近,得益于多模态大语言模型的出现,深度伪造检测器不仅追求泛化性,还追求可解释性。我们提出这两个挑战可以有效地联合解决,因为可描述的伪影通常泛化性更好,从而开辟了使用语言作为正则化机制的可能性。由于深度伪造检测通常过拟合于低层次的领域特定伪影,我们的直觉是,经过语言预训练的LLM会更偏好于可更好描述的高层次伪影。这样,我们可以在可能的情况下使用高层次特征,同时训练模型在必要时使用低层次特征。我们利用双编码器架构,将冻结的专家检测器与LoRA调优的MLLM编码器配对,并采用两阶段训练课程:首先,二元对齐阶段表明,MLLM的内在能力可以有效地组合特征,以减轻对数据集特定伪影的过拟合。为了进一步增强泛化性并实现可解释性,我们采用强化学习阶段,鼓励模型在分类前生成描述性推理,仅使用二元标签。通过奖励这种“先解释后分类”的行为,我们明确激励模型优先考虑高层次、鲁棒的特征。关键在于,这一过程既产生了可解释的描述,又进一步提升了跨数据集性能,即使在推理时省略推理链也是如此。在基准数据集上的大量实验验证了我们的方法,以较大优势超越了最先进的方法。

英文摘要

Recently, thanks to the advent of Multimodal-LLMs, deepfake detectors are striving not only to be generalizable but also interpretable. We propose that these two challenges can effectively be tackled jointly, since describable artifacts typically generalize better, opening the possibility to use language as a regularization mechanism. Since deepfake detection generally suffers from overfitting to low-level domain-specific artifacts, our intuition is that an LLM that has been pretrained on language would prefer high-level artifacts that can be described better. This way, we can use high-level features where possible, while training the model to use low-level features where necessary. We utilize a dual-encoder architecture, pairing a frozen specialist detector with a LoRA-tuned MLLM encoder, and a two-stage training curriculum: first, a binary alignment phase demonstrates that the intrinsic capability of MLLMs can effectively combine features to mitigate overfitting to dataset-specific artifacts. To further bolster generalization and achieve interpretability, we employ a reinforcement learning stage that encourages the model to generate descriptive reasoning before classifying, using only binary labels. By rewarding this "explain-then-classify" behavior, we explicitly incentivize the model to prioritize high-level, robust features. Crucially, this process yields both interpretable descriptions and a further boost in cross-dataset performance, even when reasoning chains are omitted at inference. Extensive experiments on benchmark datasets validate our approach, outperforming state-of-the-art methods by a large margin.

2605.31191 2026-06-01 cs.LG cs.CV

Student Capacity Moderates Knowledge Distillation Effectiveness: A Systematic Study Across ResNet Teacher-Student Pairs on CIFAR-10

学生容量调节知识蒸馏有效性:基于CIFAR-10上ResNet教师-学生对的系统研究

Umut Onur Yasar

AI总结 通过ResNet教师-学生对在CIFAR-10上的图像分类实验,系统研究学生容量如何调节知识蒸馏(KD)的有效性,发现学生容量是蒸馏增益的关键调节因素,并指出实现正确性和输入分辨率感知架构的重要性。

Comments 9 pages, 2 figures, 5 tables. Code available at https://github.com/umutonuryasar/kd-capacity-gap

详情
AI中文摘要

我们研究了教师-学生容量关系如何调节基于ResNet的CIFAR-10图像分类中知识蒸馏(KD)的有效性。在三个教师-学生对(R50->R18、R34->R18和R50->R34)中,我们在受控、可重复的条件下(3个种子,全程报告均值±标准差)比较了Logit-KD和Feature-KD。我们报告三个主要发现。首先,学生容量是蒸馏增益的关键调节因素:即使教师-学生准确率差距相当,R34学生从KD中获得的收益也远大于R18学生,R50->R34 Feature-KD的最大增益为+0.30个百分点,而R34->R18 Feature-KD为+0.18个百分点,R34->R18 Logit-KD为+0.00个百分点。其次,实现的正确性对Feature-KD至关重要:一个排除了投影层的梯度裁剪错误抑制了Feature-KD的性能,并产生了与Logit-KD的误导性比较。修正后,Feature-KD在三个对中的两个上匹配或优于Logit-KD,在R50->R34上达到95.55%,基线为95.25%。第三,输入分辨率感知架构是有效蒸馏的先决条件:将ResNet主干修正为32x32输入使教师准确率提高超过5个百分点——比任何KD增益高出一个数量级。所有代码和结果可在github.com/umutonuryasar/kd-capacity-gap获取。

英文摘要

We investigate how teacher-student capacity relationships modulate knowledge distillation (KD) effectiveness in ResNet-based image classification on CIFAR-10. Across three teacher-student pairs -- R50->R18, R34->R18, and R50->R34 -- we compare Logit-KD and Feature-KD under controlled, reproducible conditions (3 seeds, mean+/-std reported throughout). We report three main findings. First, student capacity is a key moderating factor in distillation gain: R34 students benefit substantially more from KD than R18 students even when teacher-student accuracy gaps are comparable, with the strongest gain of +0.30pp observed for R50->R34 Feature-KD versus +0.18pp for R34->R18 Feature-KD and +0.00pp for R34->R18 Logit-KD. Second, implementation correctness critically affects Feature-KD: a gradient clipping bug that excluded projection layers suppressed Feature-KD performance and produced misleading comparisons with Logit-KD. After correction, Feature-KD matches or outperforms Logit-KD in two of three pairs, reaching 95.55% on R50->R34 against a baseline of 95.25%. Third, input-resolution-aware architecture is a prerequisite for effective distillation: correcting the ResNet stem for 32x32 inputs raises teacher accuracy by over 5pp -- an order of magnitude larger than any KD gain. All code and results are available at github.com/umutonuryasar/kd-capacity-gap.

2605.31189 2026-06-01 cs.LG

FlagGAM: Rule-Based Generalized Additive Modeling for Explainable Tabular Prediction

FlagGAM:基于规则的可解释表格预测广义加性模型

Zijie Zhao, Roy E. Welsch

AI总结 提出FlagGAM框架,通过规则定义的基函数分离特征级规则构建与预测,在保持可解释性的同时提升对不完美输入的鲁棒性。

详情
AI中文摘要

在高风险领域的表格预测中,需要准确、透明且对不完美输入鲁棒的模型。我们提出FlagGAM,一个规则定义的基函数框架,将特征级规则构建与预测分离。Flag核心模块将数值和分类变量转换为稀疏、可读的单变量基函数,包括阈值标志、类别级标志、尾部偏差基和分类阶跃函数;默认的加性头部随后将这些基函数组合为受限的GAM风格预测器。FlagGAM不是将触发的规则简化为紧凑的计数摘要,而是保留稀疏的规则基矩阵,支持混合类型分类和回归、特征特定权重以及可选的灵活预测头部。在表格基准测试中,默认FlagGAM在透明加性模式下接近EBM,在混合类型回归上显著优于岭回归,并在缺失和噪声扰动下显示出比常见基线更小的AUROC下降。灵活头部进一步提高了准确性,接近强树基线,但需要注意,所得模型应解释为规则基表示后接非线性预测器,而非完全加性GAM。总体而言,FlagGAM为需要竞争性准确性、可传达规则和对不完美输入鲁棒性的表格设置提供了实用的中间地带。

英文摘要

Tabular prediction in high-stakes domains requires models that are accurate, transparent, and robust to imperfect inputs. We propose FlagGAM, a rule-defined basis framework that separates feature-level rule construction from prediction. A Flag Core Module converts numerical and categorical variables into sparse, human-readable univariate bases, including threshold flags, category-level flags, tail-deviation bases, and categorical step functions; a default additive head then combines these bases as a restricted GAM-style predictor. Rather than reducing triggered rules to compact count summaries, FlagGAM retains a sparse rule-basis matrix that supports mixed-type classification and regression, feature-specific weighting, and optional flexible prediction heads. Across tabular benchmarks, default FlagGAM remains close to EBM in transparent additive mode, improves substantially over ridge regression on mixed-type regression, and shows smaller AUROC degradation than common baselines under missing and noisy perturbations. Flexible heads further improve accuracy and approach strong tree-based baselines, with the caveat that the resulting model should be interpreted as a rule-basis representation followed by a nonlinear predictor rather than as a fully additive GAM. Overall, FlagGAM provides a practical middle ground for tabular settings that require competitive accuracy, communicable rules, and robustness to imperfect inputs.

2605.31187 2026-06-01 cs.CV cs.LG

From Local Geometry to Global Pseudo Labeling for Robust Positive Unlabeled Learning under Covariate Shift

从局部几何到全局伪标注:协变量偏移下鲁棒的正无标记学习

Firas Gabetni, Alexandre Rocchi Henry, Nacim Belkhir, Ziyi Liu, Gianni Franchi

AI总结 提出SPUNA框架,利用局部流形结构逐步发现偏移数据,在协变量偏移下实现正无标记学习,性能达到全监督方法水平。

详情
AI中文摘要

检测协变量偏移对于构建可靠的视觉系统至关重要。虽然大多数先前工作专注于提高对偏移的鲁棒性,但显式检测协变量偏移仍未被充分探索。现有方法通常依赖于全监督训练,需要来自原始分布和偏移分布的有标签样本,这往往不切实际。在本文中,我们表明协变量偏移检测可以通过使用正无标记(PU)学习的弱监督有效解决。然而,在协变量偏移下,分布内数据和偏移数据显著重叠,使得经典PU方法不稳定且对噪声敏感。为克服这一挑战,我们引入了谱PU邻域标注(SPUNA),这是一种几何感知框架,通过利用视觉特征的局部流形结构逐步发现偏移数据。大量实验表明,SPUNA在PU设置中实现了最先进的性能,并且显著匹配了全监督方法的性能。此外,我们的方法在不同类型的偏移之间鲁棒地迁移,展示了强大的泛化能力。

英文摘要

Detecting covariate shift is critical for building reliable vision systems. While most prior work focuses on improving robustness to shift, explicitly detecting covariate shift remains underexplored. Existing approaches typically rely on fully supervised training, requiring labeled examples from both original and shifted distributions, which is often impractical. In this paper, we show that covariate shift detection can be effectively addressed with weaker supervision using Positive Unlabeled (PU) learning. However, under covariate shift, in distribution and shifted data overlap significantly, making classical PU methods unstable and sensitive to noise. To overcome this challenge, we introduce Spectral PU Neighborhood Annotation (SPUNA), a geometry aware framework that progressively discovers shifted data by leveraging the local manifold structure of visual features. Extensive experiments show that SPUNA achieves state of the art performance in PU settings and remarkably matches the performances of fully supervised methods. Moreover, our approach transfers robustly across different types of shifts, demonstrating strong generalization capabilities.

2605.31186 2026-06-01 cs.LG

How well does Classification Accuracy capture Concept Drift Detection Quality? An overview of Concept Drift Detection evaluation

分类精度在多大程度上捕捉概念漂移检测质量?概念漂移检测评估综述

Joanna Komorniczak

AI总结 本文综述了概念漂移检测质量度量与分类性能之间的关系,通过七种合成数据流工具研究八种漂移检测质量度量,旨在确定最具信息量的度量集。

详情
AI中文摘要

数据流是当今最常分析的数据结构之一,概念漂移对处理系统构成了重大挑战。尽管提出了许多解决方案来应对概念漂移导致的精度下降,但科学界尚未建立统一的概念漂移检测评估框架。现有研究通常依赖分类质量度量,但这些度量可能受多种因素影响,无法可靠反映漂移检测质量。本文深入概述了合成非平稳数据流中漂移检测质量度量与分类性能之间的关系。研究通过七种合成数据流生成工具,考察了八种漂移检测质量度量与分类器性能的关系,并额外考虑了漂移动态因素。研究旨在识别最具信息量的漂移检测质量度量集,并提供对方法评估的深入理解。

英文摘要

Data streams are nowadays among the most frequently analyzed data structures, with the concept drift posing a major challenge encountered by processing systems. Despite the proposition of numerous solutions to counteract the accuracy degeneration due to concept drift, the scientific community has not yet established a unified framework for evaluating the concept drift detection task. Existing research often relies on classification quality metrics, but these can be affected by multiple factors and may not reliably reflect drift detection quality. In this work, we present an in-depth overview of the relationship between metrics for quantifying drift detection quality and classification performance in synthetic nonstationary data streams. The proposed research studies eight drift detection quality metrics in relation to the classifier's performance across seven synthetic data stream generation tools, additionally considering drift dynamics as a factor. The studies aim to identify the most informative set of drift detection quality metrics and provide a deep understanding of the method's evaluation.