arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2605.26192 2026-05-27 cs.LG cs.AI q-bio.BM

Co-folding model guided by structural proteomics

结构蛋白质组学引导的共折叠模型

Alon Shtrikman, Nitzan Simchi, Michal Ran Shchory, Sagie Brodsky, Eran Seger, Kirill Pevzner

发表机构 * Protai Bio(Protai生物)

AI总结 提出AIMS-Fold框架,通过整合XL-MS和HDX-MS实验数据与扩散模型,在推理时引导蛋白质复合物构象生成,提升诱导接近靶标的预测准确性。

详情
AI中文摘要

蛋白质结构生成模型擅长从序列预测单个蛋白质的静态结构,但通常无法捕捉蛋白质复合物的正确构象状态,这对蛋白质设计和诱导接近模式(如抗体和PROTACs)至关重要。虽然交联质谱(XL-MS)和氢氘交换质谱(HDX-MS)等结构蛋白质组学技术提供了有价值的空间和动态信息,但将这些稀疏、异质的测量整合到这些模型中仍然是一个开放的挑战。在这里,我们通过将结构蛋白质组学数据与预训练扩散模型学到的丰富生物物理先验相结合来弥合这一差距。我们引入了AIMS-Fold,一个推理时引导扩散框架,它使用源自XL-MS空间约束和HDX-MS溶剂可及性轮廓的可微物理势能主动引导生成采样轨迹。我们证明这些结构方法各自提高了预测准确性,并且它们的整合产生了协同改进。关键的是,通过利用这些实验约束,AIMS-Fold在具有挑战性的诱导接近靶标上比纯计算、无引导的最先进模型(如Boltz-2)实现了更高的准确性。这确立了我们的框架作为诱导接近药物基于结构的药物设计的强大整合计算方法。评估代码将在发表后公开。

英文摘要

Protein structure generative models excel at predicting single protein static structures from sequence, but routinely fail to capture the correct conformational state of protein complexes, critical for protein design and induced proximity modalities such as antibodies and PROTACs. While structural proteomics techniques like Cross-Linking Mass Spectrometry (XL-MS) and Hydrogen-Deuterium Exchange (HDX-MS) offer valuable spatial and dynamic insights, integrating these sparse, heterogeneous measurements into these models remains an open challenge. Here, we bridge this gap by combining structural proteomics data with the rich biophysical priors learned by pretrained diffusion models. We introduce AIMS-Fold, an inference-time guided-diffusion framework that actively steers the generative sampling trajectory using differentiable physical potentials derived from XL-MS spatial restraints and HDX-MS solvent accessibility profiles. We demonstrate that these structural methods individually enhance predictive accuracy, and their integration yields synergistic improvement. Crucially, by leveraging these experimental restraints, AIMS-Fold achieves higher accuracy on challenging induced proximity targets than purely computational, unguided state-of-the-art models like Boltz-2. This establishes our framework as a powerful, integrative computational approach for the structure based drug design of induced proximity drugs. Evaluation code will be made publicly available upon publication.

2605.26191 2026-05-27 cs.LG cs.AI

Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series

从流式时间序列建模时滞系统的动态混合

Ren Fujiwara, Yasuko Matsubara, Yasushi Sakurai

发表机构 * SANKEN, The University of Osaka, Japan(SANKEN大学大阪大学日本)

AI总结 提出在线框架DelayMix,将流式时间序列视为时滞系统的动态混合,通过固定长度表示总结过去状态,利用马尔可夫参数张量捕捉动态和延迟,实现快速适应环境变化并降低内存使用。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

本研究解决了具有清晰输入输出关系的时间序列数据流中的自适应建模问题。该问题具有挑战性,因为环境因素或输入延迟变化导致的快速系统变化(状态转移)会降低模型性能,并且在使用多个小模型处理每种时间序列模式时,需要在准确性、鲁棒性和内存使用之间进行权衡。为了解决这些问题,本文提出了一种在线框架/方法,将流式时间序列视为时滞系统的动态混合。该框架通过使用固定长度表示来总结过去的状态,该表示同时捕捉系统动态和输入输出延迟,从而保持模型跟踪的鲁棒性并减少内存使用。具体来说,该方法利用系统的马尔可夫参数序列构建一个摘要系统张量,同时捕捉动态行为和延迟特征。如有必要,张量分解算法从张量中提取相关的过去模型,并帮助选择最适合当前状态的系统。该方法能够快速适应环境变化,并且计算效率高。在真实数据集上的测试表明,DelayMix始终优于其他方法,实现了卓越的预测准确性和更快的延迟适应,特别是对于高度非平稳的数据。

英文摘要

This research addresses the problem of adaptive modeling in time-series data streams with clear input-output relationships. This problem is challenging because rapid system changes (regime shifts) caused by environmental factors or input delay changes degrade model performance, and the trade-off among accuracy, robustness, and memory usage arises when using multiple small models for each time-series pattern. To address these issues, this paper presents an online framework/method that treats streaming time series as dynamic mixtures of time-delay systems. This framework maintains robustness of model tracking and reduces memory usage by summarizing past regimes using a fixed-length representation that captures both the system dynamics and input-output delays. Concretely, this approach constructs a summary system tensor using the system's Markov parameter series, capturing both dynamic behavior and delay characteristics. If necessary, a tensor decomposition algorithm extracts relevant past models from the tensor and helps select the system that best fits the current regime. This method enables rapid adaptation to environmental changes and is computationally efficient. Tests on real datasets show that DelayMix consistently outperforms other methods, achieving superior forecast accuracy and faster adaptation to delays, especially for highly non-stationary data.

2605.26190 2026-05-27 cs.LG cs.AI eess.SP

HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals

HRVConformer:基于心率信号的新生儿缺氧缺血性脑病分类

Shuwen Yu, William P Marnane, Geraldine B. Boylan, Gordon Lightbody

发表机构 * University College Cork(大学学院科克) INFANT Research Centre(婴儿研究中心) Department of Electrical & Electronic Engineering(电气与电子工程系) School of Engineering and Architecture(工程与建筑学院) Pediatrics and Child Health(儿科学与儿童健康)

AI总结 提出HRVConformer,一种混合卷积-Transformer深度学习架构,直接从原始心率信号端到端分类新生儿缺氧缺血性脑病,在测试集上达到83.23% AUC和74.56%准确率,优于Transformer、ResNet50等基线。

Comments Paper submitted to Journal of Engineering Applications of Artifical Intelligence

详情
AI中文摘要

本文提出了HRVConformer,一种新颖的深度学习架构,用于使用瞬时心率(HR)信号对缺氧缺血性脑病(HIE)进行分类。与依赖手工特征的常规方法不同,HRVConformer以端到端方式直接处理原始HR信号,通过混合卷积-Transformer框架捕获局部和长距离依赖关系。通过集成用于局部特征提取的卷积层和用于全局上下文建模的基于Transformer的注意力机制,该架构有效增强了信号表示和分类性能。该模型使用监督学习在包含1,573个一小时时段的大型HR数据集上训练,其中包括259个专家标注的一小时时段和大量弱标注数据。一个314小时的验证集提供了稳健的性能估计,而一个独立的215小时专家标注数据集被保留用于最终测试。使用改进的Pan-Tompkins算法从心电图(ECG)记录中提取HR信号,该算法显著提高了信号质量和数据可用性。实验结果表明,HRVConformer在测试集上实现了83.23%的AUC和74.56%的准确率。这些结果超越了Transformer、ResNet50和全卷积网络基线,突显了集成卷积和Transformer组件用于基于HR的HIE分类的优势。所提出的方法为使用HR信号实现更准确和自动化的HIE评估提供了有希望的一步。代码可在https://github.com/syu-kylin/HRVConformer获取。

英文摘要

This paper presents the HRVConformer, a novel deep learning architecture for the classification of hypoxic-ischemic encephalopathy (HIE) using the instantaneous heart rate (HR) signal. Unlike conventional approaches that rely on handcrafted features, HRVConformer directly processes raw HR signals in an end-to-end manner, capturing both local and long-range dependencies through a hybrid Convolution-Transformer framework. By integrating convolutional layers for local feature extraction and Transformer-based attention mechanisms for global context modelling, the architecture effectively enhances signal representation and classification performance. The model was trained using supervised learning on a large HR dataset consisting of 1,573 one-hour epochs, including 259 one-hour expert-annotated epochs and a substantial set of weakly labelled data. A 314-hour validation set provided a robust performance estimation, while an independent 215-hour dataset with expert annotations was reserved for final testing. HR signals were extracted from electrocardiogram (ECG) recordings using an improved Pan-Tompkins algorithm, which significantly enhanced both signal quality and data availability. Experimental results demonstrate that the HRVConformer achieves an AUC of 83.23\% and accuracy of 74.56\% on the test set. These results surpass the performance of the Transformer, ResNet50 and fully convolutional networks baselines, highlighting the advantages of integrating convolutional and Transformer-based components for HR-based HIE classification. The proposed method provides a promising step toward a more accurate and automated assessment of HIE using HR signals. The code is available at: https://github.com/syu-kylin/HRVConformer.

2605.26184 2026-05-27 cs.LG cs.AI

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

GAC: 面向混合SFT-RL后训练的噪声感知自适应混合

Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Maritime University(上海海洋大学)

AI总结 提出噪声感知控制器GAC,通过在线估计梯度方差和两个训练信号之间的不一致性,自适应调整混合权重,以改进混合后训练性能。

Comments 15 pages, 3 figures, 22 tables

详情
AI中文摘要

混合后训练通常结合监督微调和强化学习,但固定的混合调度无法适应两种信号相对噪声随时间变化的情况。我们提出GAC,一种噪声感知控制器,通过在线估计梯度方差和两个训练信号之间的不一致性,推导出自适应混合权重。该方法在重用现有训练张量的同时,增加了平滑、先验指导和有界更新。在数学、代码、科学和逻辑基准上的实验表明,与强固定和基于规则的基线相比,GAC持续改进混合后训练,在更大模型规模下获得更大收益,且训练开销小于1%。

英文摘要

Hybrid post-training usually combines supervised fine-tuning and reinforcement learning, but fixed mixing schedules cannot adapt when the relative noise of the two signals changes over time. We propose GAC, a noise-aware controller that derives an adaptive mixing weight from online estimates of gradient variance and disagreement between the two training signals. The method adds smoothing, prior guidance, and bounded updates while reusing existing training tensors. Experiments on math, code, science, and logic benchmarks show that GAC consistently improves hybrid post-training over strong fixed and rule-based baselines, with larger gains at larger model scales and less than 1% training overhead.

2605.26182 2026-05-27 cs.AI cs.GR

BrickAnything: Geometry-Conditioned Buildable Brick Generation with Structure-Aware Tokenization

BrickAnything: 基于几何条件的可构建砖块生成与结构感知标记化

Zhengyang Ni, Feng Yan, Yu Guo, Fei Wang

发表机构 * Xi’an Jiaotong University(西安交通大学) State Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室) Institute of Artificial Intelligence and Robotics(人工智能与机器人研究院)

AI总结 提出BrickAnything,一个基于几何条件的自回归框架,通过结构感知树标记化生成满足装配约束和结构稳定性的砖块结构。

详情
AI中文摘要

从3D形状生成物理可构建的砖块结构不仅需要几何重建,输出还必须满足离散零件约束和结构稳定性。现有的砖块生成方法要么依赖启发式优化,当目标3D形状在预定义约束下无法实现可行结构时可能失败;要么生成砖块序列而不显式建模底层3D几何和装配关系。在这项工作中,我们提出了BrickAnything,一个基于几何条件的自回归框架,用于从多样的3D表示生成可构建的砖块结构。BrickAnything使用点云作为统一的几何接口,并预测在装配约束下重建目标形状的砖块序列。为了建模砖块之间的结构依赖性,我们引入了结构感知树标记化,通过局部附着关系表示砖块结构。这种公式使序列生成更符合物理构建过程,并减少无效中间状态。我们进一步引入了基于偏好的对齐后训练、有效性约束解码和自适应回滚,以改善可构建性目标,如稳定性和几何保真度。大量实验表明,BrickAnything生成几何忠实且物理可实现的砖块结构,并且与传统的排序策略相比,所提出的标记化有效减少了回滚和重新生成。

英文摘要

Generating physically buildable brick structures from 3D shapes requires more than geometric reconstruction: the output must also satisfy discrete part constraints and structural stability. Existing brick generation methods either rely on heuristic optimization, which can break down when the target 3D shape does not admit a feasible structure under predefined constraints, or generate brick sequences without explicitly modeling the underlying 3D geometry and assembly relations. In this work, we present BrickAnything, a geometry-conditioned autoregressive framework for generating buildable brick structures from diverse 3D representations. BrickAnything uses point clouds as a unified geometric interface and predicts brick sequences that reconstruct the target shape under assembly constraints. To model structural dependencies among bricks, we introduce a structure-aware tree tokenization, which represents brick structures through local attachment relations. This formulation makes sequence generation more consistent with the physical construction process, and reduces invalid intermediate states. We further introduce preference-based alignment post-training, validity-constrained decoding and adaptive rollback to improve buildability objectives such as stability and geometric fidelity. Extensive experiments demonstrate that BrickAnything produces geometrically faithful and physically realizable brick structures, and that the proposed tokenization effectively reduces rollback and regeneration compared with conventional ordering strategies.

2605.26176 2026-05-27 cs.SD cs.AI

PitchBench: Measuring Pitch Hearing in Audio-Language Models

PitchBench: 测量音频-语言模型中的音高听觉能力

Milan Liessens Dujardin, Song-Ze Yu, Craver Corbyn Thomas-Smith, David M. Chan, Karina Nguyen

发表机构 * University of California, Berkeley(加州大学伯克利分校) Thoughtful Lab

AI总结 提出PitchBench评估套件,通过28个实验系统测量音频-语言模型在绝对和相对音高感知上的表现,发现当前模型在不同声源、音长和格式下音高感知不可靠。

Comments Preprint

详情
AI中文摘要

音频-语言模型(ALMs)越来越多地用于需要理解音乐的实际应用,从音乐辅导和转录到字幕、推荐系统和音乐制作。更广泛地说,它们正在成为多模态AI系统的重要组成部分,这些系统必须从感官输入而非仅文本进行推理。这使得可靠的音乐感知成为关键前提:如果模型无法准确听到声音的结构,就不能信任它来推理、教学、转录或对现实世界中的音频采取行动。然而,现有的基准测试很少评估这种感知背后最基本的音乐能力之一:音高听觉。当前的评估往往通过更高层次的任务间接探测音高听觉,且通常采用多项选择格式,这留下了ALMs在不同乐器、声学条件和响应格式下识别细粒度音高的可靠性问题。我们引入了PitchBench,一个系统测量ALMs音高听觉的评估套件。PitchBench包含28个实验,涵盖序列和和弦中的绝对和相对音高感知,同时变化响度、音符时长、声源、时间拉伸、背景噪声和其他声学条件。任务范围从识别孤立音高到在四声部音乐织体中跟踪旋律线。评估前沿ALMs,我们发现音高听觉仍然非常不可靠:模型在不同设置下表现持续不佳,准确率随声源、音符时长和记谱格式急剧变化。当前的ALMs尚未具备稳定的音高感知,即使对于受控的合成和乐器刺激也是如此。除了基准测试,我们还发布了PitchBench作为Python包,包含评估数据和数据生成工具,以支持未来关于音高感知音频-语言建模的工作。

英文摘要

Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation systems, and music production. More broadly, they are becoming an important component of multimodal AI systems that must reason from sensory input rather than text alone. This makes reliable musical perception a critical prerequisite: if a model cannot accurately hear the structure of sound, it cannot be trusted to reason about, teach, transcribe, or act on audio in the real world. Yet existing benchmarks rarely assess one of the most fundamental musical abilities underlying such perception: pitch hearing. Current evaluations tend to probe pitch hearing only indirectly, through higher-level tasks and often in multiple-choice formats, leaving open how reliably ALMs identify fine-grained pitch across instruments, acoustic conditions, and response formats. We introduce PitchBench, an evaluation suite that systematically measures pitch hearing in ALMs. PitchBench comprises 28 experiments spanning absolute and relative pitch perception within sequences and chords, while varying loudness, note duration, sound source, time stretching, background noise, and other acoustic conditions. Tasks range from identifying individual pitches in isolation to tracking a melodic line within a four-part musical texture. Evaluating frontier ALMs, we find that pitch hearing remains highly unreliable: models perform consistently poorly across settings, with accuracy varying sharply by sound source, note duration, and notation format. Current ALMs do not yet possess stable pitch perception, even for controlled synthetic and instrumental stimuli. Alongside the benchmark, we release PitchBench as a Python package containing the evaluation data and data generation tools to support future work on pitch-aware audio-language modeling.

2605.26175 2026-05-27 cs.LG cs.AI

InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

InfoQuant:为低比特LLM量化塑造激活分布

Ke Li, Dong An, Xiaoling Zang, Can Ye, Liang Xie, Qibo Qiu, Chen Shen, Xiaofei He, Wenxiao Wang

发表机构 * School of Software Technology, Zhejiang University(浙江大学软件学院) Ant Group(蚂蚁集团) College of Computer Science and Technology, Zhejiang University of Technology(浙江工业大学计算机科学与技术学院) China Mobile (Zhejiang) Research & Innovation Institute(中国移动(浙江)研究院) Alibaba Cloud Computing(阿里云计算) State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD&CG国家重点实验室)

AI总结 针对低比特激活量化中分布与量化器不匹配的问题,提出基于信息论的分析和无需训练的峰值抑制正交变换(PSOT)方法,显著提升量化精度。

详情
AI中文摘要

低比特激活量化仍然是高效大语言模型(LLM)部署的主要瓶颈。难点不仅在于激活值包含异常值,还在于其分布通常与低比特均匀量化器不匹配。现有的训练后量化(PTQ)方法抑制峰值、平衡通道或最小化重建误差,但很少明确说明什么样的激活分布实际上易于离散化。因此,激活值可能在数值上更平滑,但仍会产生较大的量化误差,因为量化范围仍然很宽,或者大多数值坍缩到均值附近的几个水平。我们将激活变换重新表述为面向量化器的分布设计,并从信息论角度分析量化误差。我们的分析表明,有利于量化的激活值应同时具有较小的数值范围和在该范围内的足够分散性。在此分析指导下,我们提出InfoQuant,一种无需训练的方法,采用峰值抑制正交变换(PSOT)将激活值塑造成更有利于量化的分布。我们进一步引入自适应异常值标记选择,以提高PSOT在优化过程中的鲁棒性。在多个LLM家族中,InfoQuant始终优于先前的PTQ和端到端训练基线。在W4A4KV4下,它平均保留了97%的浮点精度,并将LLaMA-2 13B的性能差距较先前最先进方法缩小了42%。代码可在[https://github.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant)获取。

英文摘要

Low-bit activation quantization remains a major bottleneck in efficient large language model (LLM) deployment. The difficulty is not only that activations contain outliers, but that their distributions are often poorly matched to a low-bit uniform quantizer. Existing post-training quantization (PTQ) methods suppress peaks, balance channels, or minimize reconstruction error, yet they rarely specify what activation distribution is actually easy to discretize. As a result, activations may appear numerically smoother while still incurring large quantization error because the quantization range remains wide or most values collapse into a few levels near the mean. We recast activation transformation as quantizer-facing distribution design and analyze quantization error from an information-theoretic perspective. Our analysis shows that quantization-friendly activations should jointly have a smaller numerical range and sufficient dispersion within that range. Guided by this analysis, we propose InfoQuant, a train-free method that employs Peak Suppression Orthogonal Transformation (PSOT) to shape activations into more quantization-friendly distributions. We further introduce adaptive outlier-token selection to improve the robustness of PSOT during optimization. Across multiple LLM families, InfoQuant consistently outperforms prior PTQ and end-to-end training baselines. Under W4A4KV4, it preserves 97% of floating-point accuracy on average and reduces the LLaMA-2 13B performance gap by 42% over the previous state of the art. Code is available at [https://github.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant)

2605.26172 2026-05-27 cs.LG

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

ARBITER:测试时采样中的推理轨迹盆地与多数投票失败

Meng Cai, Lars Kulik, Farhana Choudhury

发表机构 * School of Computing and Information Systems(计算与信息系统学院) University of Melbourne(墨尔本大学)

AI总结 本文发现语言模型测试时采样的推理轨迹会聚集成少数“推理盆地”,导致多数投票选择最稳定而非最准确的盆地,并提出ARBITER方法通过保守加性证据修正共识,从样本池中恢复部分正确性。

Comments Preprint. 34 pages, 2 figures

详情
AI中文摘要

当语言模型使用测试时采样时,它们会生成多个推理轨迹并通过多数投票选择答案。我们证明这些轨迹并非独立:对于给定问题,它们会聚集成少数几个簇,即推理盆地,每个盆地由归一化的最终答案和达到该答案的解决方案定义。因此,多数投票选择的是最稳定的盆地而非最准确的盆地,这导致错误多数失败,即正确答案存在但被否决。我们提出ARBITER,一种模型无关的方法,仅使用基础模型自身的采样输出、隐藏状态和派生证据来建模盆地之间的交互。大多数直接纠正策略失败;ARBITER则在共识之上使用保守的加性证据。在其最简单的无参数形式中,ARBITER-Δ将同模型证据添加到多数先验中,而ARBITER-Enc则通过来自完整解决方案的隐藏状态的有界残差信号增强这一过程。在GSM8K上使用Qwen3-4B,K=24个样本的共识达到约94%中段,而同池top-2 oracle达到约96%中段。ARBITER在不使用外部信息的情况下恢复了这些案例的一个子集。在三个模型系列和三个数学基准上,它带来了一致的提升,且没有净负例;例如,在Llama-3.1-8B MMLU-HS-Math上,它将准确率从约78%中段提高到约82%中段,恢复了约22%的可用oracle余量,表明该余量可以从样本池本身部分恢复。

英文摘要

When language models use test-time sampling, they generate multiple reasoning trajectories and select an answer by majority vote. We show that these trajectories are not independent: for a given question, they concentrate into a small number of clusters, or reasoning basins, each defined by a normalized final answer and the solutions that reach it. A majority vote therefore selects the most stable basin rather than the most accurate one, which creates wrong-majority failures where the correct answer is present but outvoted. We introduce ARBITER, a model-agnostic approach that models interactions between basins using only the base model's own sampled outputs, hidden states, and derived evidence. Most direct correction strategies fail; ARBITER instead uses conservative additive evidence on top of consensus. In its simplest parameter-free form, ARBITER-Δ adds same-model evidence to the majority prior, while ARBITER-Enc augments this with bounded residual signals from hidden states over complete solutions. On GSM8K with Qwen3-4B, consensus over K=24 samples achieves around the mid-94% range, while a same-pool top-2 oracle reaches around the mid-96% range. ARBITER recovers a subset of these cases using zero external information. Across three model families and three math benchmarks, it yields consistent gains with no net-negative cases; for example, on Llama-3.1-8B MMLU-HS-Math, it improves accuracy from the mid-78% range to the mid-82% range, recovering about 22% of the available oracle headroom, indicating that this headroom can be partially recovered from the sample pool itself.

2605.26167 2026-05-27 cs.LG cs.AI math.DS math.RA

Planning Neural Dynamics with Lie Group Embedding through Supervised Projective Manifold Learning

通过监督投影流形学习进行李群嵌入的神经动力学规划

Tianwei Wang, Bryan Chen, Qian Zuo, Qiyue Xia, Xin Li, Wei Pang

发表机构 * School of Informatics(信息学院) School of Mathematics(数学学院) University of Edinburgh(爱丁堡大学) School of Computer Science(计算机科学学院) School of MACS(MACS学院) Beijing Institute of Technology(北京理工大学) Heriot-Watt University(赫瑞-瓦特大学)

AI总结 提出李群嵌入动力神经网络(LieEDNN),通过梯度下降和流形上的度量投影实现可学习且稳定的动力学,解决李群与神经网络加法不兼容及非线性表示空间中的演化问题,并在SE(3)伸缩机械臂上验证。

Comments Preprint. Under review

详情
AI中文摘要

我们提出了李群嵌入动力神经网络(LieEDNN)以及基于梯度下降和光滑流形上度量投影的相应学习算法,其中我们将李群视为流形几何连续对称性的内在表示。因此,我们在底层流形上实现了可学习且稳定的动力学,适用于一般李群,并且能够利用李群(如SO(3)和SE(3))强大的表示能力来解决机器人、图形和控制等领域的实际工程问题。两个核心挑战是:(i)一般李群与加法运算不兼容,而加法是神经网络交互所必需的。(ii)动力学在特殊代数的非线性表示空间中演化,而非正常的欧几里得空间,这违反了常见神经常微分方程的范式。为了解决这两个挑战,我们首先引入李代数上的伴随李群作用,它诱导出一个线性映射并转移到权重矩阵的分块结构,使得加法可以在李代数上作为向量空间进行运算。然后我们将李代数和伴随作用参数化为线性变换,从而使架构与神经网络感知器对齐。明确地说,这种嵌入表现为权重上的分块流形约束,我们开发了学习算法,以确保时间神经网络动力学的平衡态具有稳定性保证。我们在特定李群SE(3)上进行了实验,应用场景为伸缩机械臂。

英文摘要

We propose Lie group embedded dynamical neural networks (LieEDNN) and the corresponding learning algorithms based on gradient descent and metric projection on smooth manifold, where we treat Lie group as an intrinsic representation for continuous symmetry of manifold geometry. Thereby we achieve learnable and stable dynamics on the underlying manifold for general Lie group, and we are able to utilize the powerful representation capability of Lie group such as SO(3) and SE(3) to solve real world engineering problems in areas such as robotics, graphics, and control. Two core challenges are: (i) General Lie groups are incompatible with addition arithmetic, which is necessary for neural network interactions. (ii) The dynamics evolve in the nonlinear representation space of special algebra rather than the normal Euclidean space, which violates the paradigm of common neural ODEs. To address these two challenges, we firstly introduce adjoint Lie group action on the Lie algebra, which induces a linear mapping and transfer to the block-wise structure of weight matrices, such that addition could operate on the Lie algebra as a vector space. Then we parameterize the Lie algebra and the adjoint action as linear transformation so that the architecture is aligned with neural network perceptrons. Explicitly, this embedding appears as block-wise manifold constraints on weights, and we develop algorithms to learn the equilibrium with stability guarantees of the temporal neural network dynamics. Experiments are implemented on a specific Lie group SE(3), with the application scenario of telescopic manipulators.

2605.26162 2026-05-27 cs.LG cs.AI

On the Push-Based Asynchronous Federated Learning: A Bias-Correction Aggregation Approach

基于推送的异步联邦学习:一种偏差校正聚合方法

Jiahui Bai, Hai Dong, A. K. Qin

发表机构 * School of Computer Technologies, RMIT University(RMIT大学计算机技术学院) School of Science, Computing and Engineering Technologies, Swinburne University of Technology(斯威丁大学科学与工程技术学院)

AI总结 提出PushCen-ADFL框架,通过中心表示空间中的平均保持推-求和混合与轻量级中心正则化,解决异步去中心化联邦学习中的通信开销、聚合偏差和模型漂移问题。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026). This is the extended version with full appendix

详情
AI中文摘要

异步去中心化联邦学习(ADFL)消除了中央协调和全局同步,使其在大规模和异构系统中具有吸引力。然而,频繁的点对点通信、有向拓扑上的异步更新以及非独立同分布数据共同导致了过高的通信开销、有偏聚合和严重的模型漂移。我们提出了PushCen-ADFL,一种通信高效的ADFL框架,能够在非对称通信和延迟客户端参与下实现稳定训练。PushCen-ADFL在共享中心表示空间中耦合了通信、聚合和局部稳定化,形成了压缩与优化之间的闭环。客户端交换中心形式的消息,应用平均保持的推-求和混合来校正聚合偏差,并使用锚定在同一中心空间的轻量级中心正则化来减轻异构性和陈旧性下的漂移。一个有界、发送者去重的缓冲区进一步提高了在异步到达不规则情况下的鲁棒性。在视觉数据集上的实验表明,PushCen-ADFL在数据异构性下将准确率提高了最多6%,同时将每次推送的通信成本降低了80%以上,实现了良好的准确率-通信权衡。

英文摘要

Asynchronous decentralized federated learning (ADFL) eliminates central coordination and global synchronization, making it attractive for large-scale and heterogeneous systems. However, frequent peer-to-peer communication, asynchronous updates on directed topologies, and non-IID data jointly lead to excessive communication overhead, biased aggregation and severe model drift. We propose PushCen-ADFL, a communication-efficient ADFL framework that enables stable training under asymmetric communication and delayed client participation. PushCen-ADFL couples communication, aggregation, and local stabilization in a shared centroid representation space, forming a closed loop between compression and optimization. Clients exchange centroid-form messages, apply average-preserving push-sum mixing to correct aggregation bias, and use a lightweight centroid regularization anchored in the same centroid space to mitigate drift under heterogeneity and staleness. A bounded, sender-deduplicated buffer further improves robustness under irregular asynchronous arrivals. Experiments on vision datasets demonstrate that PushCen-ADFL improves accuracy under data heterogeneity by up to 6\% while reducing per-push communication cost by more than 80\%, achieving a favorable accuracy-communication trade-off.

2605.26161 2026-05-27 cs.LG cs.AI

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

TSFMAudit: 时间序列基础模型中的数据污染审计

Hongkai Li, Shifeng Xie, Lefei Shen, Zhuo Li, Mouxiang Chen, Xiaobin Zhang, Han Fu, Jianling Sun, Xiaoxue Ren, Chenghao Liu

发表机构 * Zhejiang University(浙江大学) Télécom Paris(巴黎高等电信学院) State Street Technology (Zhejiang) Ltd.(State Street Technology(浙江)有限公司) Datadog

AI总结 针对时间序列基础模型(TSFMs)预训练数据污染问题,提出基于探针适应动力学的审计方法TSFMAudit,通过检测微调后损失下降更快且骨干网络移动更小的异常现象来识别污染数据集。

Comments 22 pages, 7 figures, 9 tables

详情
AI中文摘要

时间序列基础模型(TSFMs)越来越多地在大型语料库上进行预训练,这引发了评估数据集可能在预训练期间被暴露从而导致过于乐观的性能估计的担忧。在时间序列中审计此类污染具有挑战性,因为信号是连续且异质的,并且通常缺乏语料库文档。据我们所知,这是第一个研究TSFMs预训练污染审计的工作。我们形式化了TSFMs的预训练污染审计问题,并提出了TSFMAudit,一种基于探针适应动力学的方法。我们的关键直觉是,污染表现为异常高效的适应:在微调探针后,受污染的数据集往往表现出更快的损失减少和更小的骨干网络移动。我们在6个TSFMs和187个数据集上评估了TSFMAudit,使用文档化的训练来源证据作为监督,并与从LLM文献中改编的10个竞争基线进行了比较。

英文摘要

Time series foundation models (TSFMs) are increasingly pretrained on large corpora, raising concerns that evaluation datasets may have been exposed during pretraining and thus yield overly optimistic performance estimates. Auditing such contamination is challenging in time series because signals are continuous and heterogeneous, and often lack corpus documentation. To the best of our knowledge, this is the first work to study pretraining contamination auditing for TSFMs. We formalize the problem of pretraining contamination auditing for TSFMs and propose TSFMAudit, a method based on probe adaptation dynamics. Our key intuition is that contamination manifests as unusually efficient adaptation: after a fine tuning probe, contaminated datasets tend to exhibit faster loss reduction with smaller backbone movement. We evaluate TSFMAudit on 6 TSFMs and 187 datasets using documented training source evidence as supervision, and compare against 10 competitive baselines adapted from the LLM literature.

2605.26155 2026-05-27 cs.RO cs.AI cs.LG

When Does Adaptive Guidance Help? Belief-Aware Privileged Distillation for Autonomous Driving Under Partial Observability

自适应引导何时有帮助?部分可观测条件下自动驾驶的信念感知特权蒸馏

Mehmet Haklidir

发表机构 * TUBITAK BILGEM Artificial Intelligence Institute(土耳其TUBITAK BILGEM人工智能研究所)

AI总结 本文提出信念感知GSAC(BA-GSAC),通过集成分歧动态调节蒸馏系数,系统研究自适应引导在部分可观测自动驾驶中的有效性,发现严重遮挡下系数过早崩溃,并揭示可观测性盲区问题。

Comments 9 pages, 3 figures, 7 tables. Accepted at CVPR 2026 Workshop on Autonomous Driving (WAD)

详情
AI中文摘要

引导软演员-评论家(GSAC)将来自特权全状态教师的知识蒸馏给部分可观测的学生,用于自动驾驶,但使用固定的蒸馏系数λ,而不考虑智能体的不确定性。我们提出信念感知GSAC(BA-GSAC),通过集成分歧调节λ,并将其作为系统实证研究的测试平台,探究:自适应引导何时真正有帮助?在Highway-Env上评估五种策略(固定λ∈{0.01, 0.1}、自适应、线性衰减和普通SAC)在三个POMDP难度级别下,我们发现初步的单种子运行表明在轻度和中度部分可观测性下有收益,但在严重遮挡下(所有方法使用3个种子评估),自适应系数在大约3K步内坍缩到λ_min。我们将其归因于可观测性盲区现象:由于集成预测部分观测,即使在严重遮挡下也能达到低分歧,建模了可见部分但无法检测缺失部分。我们诊断了根本原因并提出了架构修复(使用引导演员的特权访问在完整状态预测上训练集成);虽然此处未验证,但我们表明即使存在当前限制,预热阶段也提供了可测量的稳定性(CV=13.3% vs. 常数λ=0.01的29.8%)。实际上,简单的确定性线性衰减计划在所有指标上实现了最佳的严重POMDP性能(均值116.5,CV=8.9%),表明稳定性收益来自调度效应而非集成。这些发现为设计不确定性感知的师生框架提供了实用指导,并强调了集成预测目标是一个重要的设计选择。

英文摘要

Guided Soft Actor-Critic (GSAC) distills knowledge from a privileged full-state teacher to a partial-observation student for autonomous driving, but uses a fixed distillation coefficient lambda regardless of the agent's uncertainty. We present Belief-Aware GSAC (BA-GSAC), which modulates lambda via ensemble disagreement, and use it as a testbed for a systematic empirical study asking: when does adaptive guidance actually help? Evaluating five strategies (fixed lambda in {0.01, 0.1}, adaptive, linear decay, and vanilla SAC) across three POMDP difficulty levels on Highway-Env, we find that preliminary single-seed runs suggest benefits under mild and moderate partial observability, but under severe occlusion (evaluated with 3 seeds for all methods) the adaptive coefficient collapses to lambda_min within about 3K steps. We trace this to an observability blindness phenomenon: because the ensemble predicts partial observations, it achieves low disagreement even under heavy occlusion, modeling what is visible but unable to detect what is missing. We diagnose the root cause and propose an architectural fix (training the ensemble on full-state predictions using the guiding actor's privileged access); while not validated here, we show that even with current limitations, the warmup phase provides measurable stabilization (CV=13.3% vs. 29.8% for constant lambda=0.01). In fact, a simple deterministic linear decay schedule achieves the best severe-POMDP performance across all metrics (mean 116.5, CV=8.9%), suggesting that the scheduling effect, not the ensemble, drives the stability benefit. These findings provide practical guidance for designing uncertainty-aware teacher-student frameworks and highlight ensemble prediction targets as an important design choice.

2605.26136 2026-05-27 cs.SD cs.AI

Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception

侵蚀对真实语音的信任:人类音频深度伪造感知的大规模研究

Nicolas M. Müller, Wei Herng Choong

发表机构 * Fraunhofer AISEC(弗劳恩霍夫人工智能安全研究中心)

AI总结 通过大规模听辨实验(1768名参与者,35532次判断),发现音频深度伪造导致人类对真实语音的信任下降(准确率从72.7%降至64.1%),而非检测伪造能力下降。

详情
AI中文摘要

音频深度伪造近期发展迅速,但其对人类信任真实语音的影响尚未被研究。我们进行了迄今为止最大规模的音频深度伪造感知听辨研究,收集了来自1768名参与者对138个文本转语音和语音转换系统的35532次判断。我们的核心发现是怀疑偏移:与2021年的基线相比,人类对伪造样本的准确率几乎没有变化(72.9%降至71.2%),但对真实样本的准确率从72.7%降至64.1%。参与者并非更难以检测合成伪影,而是越来越不信任真实的语音。由商业和自回归语言模型系统生成的样本最难检测(61.3-65.9%),而传统seq2seq和流匹配模型生成的样本仍然较易识别(75.4-76.8%)。作为参考的机器学习检测器在所有条件下保持超过94.5%的准确率。我们的结果表明,现代深度伪造的主要威胁可能不仅仅是欺骗,而是对真实语音信任的侵蚀。

英文摘要

Audio deepfakes have improved rapidly recently, yet their effect on human trust in real speech remains unstudied. We present the largest listening study on audio deepfake perception to date, collecting 35,532 judgments from 1,768 participants across 138 text-to-speech and voice conversion systems. Our central finding is a skepticism shift: compared to a 2021 baseline, human accuracy on fake samples barely changed (72.9% to 71.2%), but accuracy on real samples dropped from 72.7% to 64.1%. Participants are not worse at detecting synthesis artifacts; rather, they increasingly distrust authentic speech. Samples generated by commercial and autoregressive language model systems proved hardest to detect (61.3 - 65.9%), while those from traditional seq2seq and flow-matching models remain easier to spot (75.4 - 76.8%). An ML detector that served as a reference point maintained over 94.5% accuracy across all conditions. Our results suggest that the primary threat posed by modern deepfakes may not be mere deception, but the erosion of trust in genuine audio.

2605.26135 2026-05-27 cs.LG

SilIF: Silhouette-Augmented Isolation Forest for Unsupervised Transaction Fraud Detection

SilIF:基于轮廓增强的隔离森林用于无监督交易欺诈检测

Venkatakrishnan Gopalakrishnan

发表机构 * Independent Researcher(独立研究员)

AI总结 提出SilIF方法,通过添加基于轮廓得分的层次增强隔离森林,在IEEE-CIS欺诈检测基准上平均AUC-PR提升0.0080,并在五个种子中均优于原始隔离森林。

Comments 5 pages, 1 figure, 5 tables. Code: https://github.com/venkat15vk/silif-anomaly-detection

详情
AI中文摘要

无监督异常检测广泛应用于标签稀缺的交易欺诈检测中。隔离森林(IF)因其可扩展性和易于部署而成为最流行的经典方法之一。我们提出了SilIF,一种隔离森林的增强方法,它在森林树诱导的表示空间中添加了一个基于轮廓得分的计算层。对于每个点,我们提取每棵树路径长度的向量,将这些“指纹”聚类成结构组,并计算轮廓得分,衡量该点与其分配组的匹配程度相对于最近替代组。轮廓信号通过单个超参数alpha与基础IF得分结合。在IEEE-CIS欺诈检测基准(约59万笔交易,3.5%欺诈)上,alpha=1.0的SilIF在五个种子上平均AUC-PR比普通隔离森林提高0.0080,且SilIF在所有五个种子上获胜(配对t检验p=0.046)。我们还在合成信用卡数据集(Sparkov)上报告了结果,其中轮廓增强并未优于普通IF,并描述了区分两种结果的条件。本文提出了SilIF作为隔离森林的一种可调、易于部署的增强方法,并诚实报告了其何时有效何时无效。代码见https://github.com/venkat15vk/silif-anomaly-detection。

英文摘要

Unsupervised anomaly detection is widely used in transaction fraud detection where labels are scarce. Isolation Forest (IF) is among the most popular classical methods due to its scalability and ease of deployment. We propose SilIF, an augmentation of Isolation Forest that adds a silhouette-based scoring layer computed in a representation space induced by the trees of the forest. For each point, we extract a vector of per-tree path lengths, cluster these "fingerprints" into structural groups, and compute a silhouette score that measures how well the point fits its assigned group versus the nearest alternative. The silhouette signal is combined with the base IF score via a single hyperparameter alpha. On the IEEE-CIS Fraud Detection benchmark (~590K transactions, 3.5% fraud), SilIF with alpha=1.0 improves over plain Isolation Forest by +0.0080 AUC-PR on average across five seeds, with SilIF winning on all five seeds (paired t-test p=0.046). We also report results on a synthetic credit-card dataset (Sparkov) where the silhouette augmentation does not improve over plain IF, and we characterize the conditions that distinguish the two outcomes. The paper presents SilIF as a tunable, easy-to-deploy enhancement to Isolation Forest with honest reporting of when it helps and when it does not. Code at https://github.com/venkat15vk/silif-anomaly-detection.

2605.26133 2026-05-27 cs.CL cs.AI cs.LG

Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

大型语言模型中的预训练数据暴露:成员推断、数据污染及安全影响综述

Ziyi Tong, Feifei Sun, Le Minh Nguyen

发表机构 * Japan Advanced Institute of Science and Technology(日本先进科学研究院)

AI总结 本文首次统一综述了大型语言模型中的预训练数据暴露问题,涵盖成员推断和数据污染,形式化定义了暴露级别,回顾了攻击与防御方法,并总结了实证发现及未来研究方向。

Comments accepted by NLDB 2025

详情
AI中文摘要

大型语言模型(LLMs)已成为NLP中的主导范式,推动了研究和工业的发展。随着模型规模和预训练数据的增长,由于训练数据集的规模和不可见性,对预训练数据暴露(PDE)的担忧也在增加。PDE指的是确定特定数据是否出现在LLM的预训练语料库中。它对于确保评估完整性和保护隐私至关重要,涉及两个关键领域:数据污染和成员推断。尽管概念上相关,但这些领域通常被孤立研究。本文首次在PDE框架下对两者进行了统一综述。我们形式化了跨暴露级别的PDE,回顾了攻击和防御方法,综合了实证发现,并强调了开放的挑战和未来的研究方向。

英文摘要

Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training datasets. PDE refers to determining whether specific data appeared in an LLM's pretraining corpus. It is critical for ensuring evaluation integrity and protecting privacy, intersecting two key areas: data contamination and membership inference. Though conceptually related, these areas have often been studied in isolation. This paper offers the first unified survey of both under the PDE framework. We formalize PDE across exposure levels, review attack and defense methods, synthesize empirical findings, and highlight open challenges and future research directions.

2605.26132 2026-05-27 cs.CL cs.LG

Self-Verified Distillation: Your Language Model Is Secretly Its Own Synthetic Data Pipeline

自验证蒸馏:你的语言模型秘密地就是它自己的合成数据管道

Tony Lee, Percy Liang

发表机构 * Stanford University(斯坦福大学)

AI总结 提出自验证蒸馏算法,让大语言模型仅用无标注种子问题,通过自生成、自验证和自训练提升推理能力,在数学、科学和编程任务上取得显著提升。

详情
AI中文摘要

经过后训练的大语言模型能否仅使用无标注提示,在没有外部教师或工具反馈的情况下进一步提升自己?我们在三个推理领域(数学、科学和编程)中研究这一设置,仅从没有真实解的无标注种子问题开始。我们提出自验证蒸馏,一种简单的后训练精炼算法,其中模型生成这些种子问题的候选解,使用基于提示的自验证进行过滤,并在由此产生的自策展数据集上进行训练。受UQ基准使用多个验证器筛选困难未解问题候选答案的启发,我们将这种基于验证的过滤思想应用于自训练:模型通过三级级联的循环一致性、事实性和正确性检查来过滤自己生成的解,仅当解通过所有阶段且获得一致判断时才被接受。我们发现,在训练数据构建过程中采样更多候选生成并使用更大的验证预算,可以产生更高质量的自策展数据,进而得到更好的推理模型。然后,我们使用自验证蒸馏训练多个规模的Qwen3模型,并在所有三个领域获得收益。对于Qwen3-4B,我们的方法在数学(AIME26和HMMT)上将聚合保留pass@1提升了+16.7个百分点,在科学(GPQA Diamond和HLE)上提升了+11.1个百分点,在编程(LCBv5和LCBv6)上提升了+8.3个百分点,这些收益也扩展到0.6B和8B模型。与我们的仅测试时基线(UQ-TTC)相比,后者通过在推理时花费额外计算来提升性能,自验证蒸馏在大多数设置下实现了更好的性能,同时仅在测试时进行一次推理调用。

英文摘要

Can post-trained large language models (LLMs) further improve themselves using only unlabeled prompts, without external teachers or feedback from tools? We study this setting starting only from unlabeled seed questions with no ground-truth solutions, across three reasoning domains: math, science, and coding. We propose Self-Verified Distillation, a simple post-training refinement algorithm in which the model generates candidate solutions to these seed questions, filters them using prompt-based self-verification, and trains on the resulting self-curated dataset. Inspired by the UQ benchmark's use of multiple validators to screen candidate answers to hard unsolved questions, we adapt this validation-based filtering idea to self-training: the model filters its own generated solutions through a three-stage cascade of cycle-consistency, factuality, and correctness checks, accepting a solution only if it passes all stages with unanimous judge votes. We find that sampling more candidate generations and using a larger verification budget during training data construction produces higher-quality self-curated data and, in turn, better reasoning models. We then train Qwen3 models at multiple scales with Self-Verified Distillation and obtain gains across all three domains. For Qwen3-4B, our method improves aggregate held-out pass@1 by +16.7 points in math (AIME26 and HMMT), +11.1 points in science (GPQA Diamond and HLE), and +8.3 points in coding (LCBv5 and LCBv6), with gains also extending to 0.6B and 8B models. Compared to our test-time-only baseline (UQ-TTC), which improves performance by spending extra compute at inference time, Self-Verified Distillation achieves better performance in most settings while requiring only a single inference call at test time.

2605.26130 2026-05-27 cs.LG physics.ao-ph

AirCast-SR: A Foundation Model for Kilometer-Scale Atmospheric Super-Resolution via Latent Consistency Diffusion

AirCast-SR: 基于潜在一致性扩散的千米级大气超分辨率基础模型

Somnath Luitel, Manmeet Singh, Joshua Durkee, Abdullah Al Fahad, Naveen Sudharsan, Prabhjot Singh, Cenlin He, Harsh Kamath, Zong-Liang Yang, Krishnagopal Halder, Sandeep Juneja, Parthasarathi Mukhopadhyay, Saptarishi Dhanuka, Amit Kumar Srivastava

发表机构 * Department of Earth, Environmental, and Atmospheric Sciences, Western Kentucky University, Bowling Green, KY, USA(地球、环境与大气科学系,西部肯塔基大学) NASA Goddard Space Flight Center, Greenbelt, MD, USA(NASA戈达德太空飞行中心) The University of Texas at Austin, Austin, TX, USA(德克萨斯大学奥斯汀分校) NSF National Center for Atmospheric Research, Boulder, CO, USA(国家大气科学研究中心) Leibniz Centre for Agricultural Landscape Research (ZALF), Berlin, Germany(莱比锡农业景观研究中心(ZALF)) Ashoka University, Sonipat, India(阿什oka大学)

AI总结 提出AirCast-SR基础模型,利用潜在一致性扩散框架将全球AI天气预报从0.25度降尺度至1公里分辨率,实现零偏差和跨区域零样本迁移。

Comments Somnath Luitel and Manmeet Singh are equal-contribution co-first authors, with Manmeet Singh (manmeet.singh@wku.edu) as corresponding author

详情
AI中文摘要

千米尺度的业务天气预报对于传统数值天气预报(NWP)模型而言仍然计算成本过高,限制了需要精细时空细节的能源、农业和灾害管理等应用对预报的获取。本文介绍AirCast-SR,一种用于大气超分辨率的基础模型,将全球AI天气预报从0.25度(约28公里)降尺度至1公里水平分辨率,时间分辨率为每小时,同时生成八个耦合地表变量的67小时预报。EarthMind-SR采用三维U-Net,在潜在一致性模型(LCM)扩散框架内进行条件化,使用基于图块(patch)的样本在美国本土(CONUS)上训练,以GraphCast预报为输入,NOAA的校准记录分析(AORC)为目标。该模型在所有变量和预报时效上实现接近零偏差,其径向功率谱密度分析表明,在10公里至100公里波长范围内,精细大气结构得以保留,而较粗模型在此范围内会损失谱功率。我们通过涵盖冬季、夏季和春季的三个CONUS案例研究验证了EarthMind-SR,并利用独立地面站观测数据,在无需任何重新训练或微调的情况下,展示了在印度和德国上的零样本全球迁移能力。作为一个开放权重的基础模型,EarthMind-SR为千米级AI天气预报建立了新范式,并为区域微调、蒸馏以及气候服务和灾害预报中的下游应用提供了平台。

英文摘要

Operational weather prediction at kilometer scales remains computationally prohibitive for traditional numerical weather prediction (NWP) models, limiting forecast access for applications in energy, agriculture, and disaster management that require fine-grained spatiotemporal detail. Here we introduce AirCast-SR, a foundation model for atmospheric super-resolution that downscales global AI weather forecasts from 0.25 degree (~28 km) to 1 km horizontal resolution at hourly temporal resolution, producing 67-hour forecasts of eight coupled surface variables simultaneously. EarthMind-SR employs a three-dimensional U-Net conditioned within a Latent Consistency Model (LCM) diffusion framework, trained on patch-based samples over the contiguous United States (CONUS) using GraphCast forecasts as input and NOAA's Analysis of Record for Calibration (AORC) as the target. The model achieves near-zero bias across all variables and lead times, and its radial power spectral density analysis demonstrates preservation of fine-scale atmospheric structure at wavelengths of 10 km to 100 km where coarser models lose spectral power. We validate EarthMind-SR across three CONUS case studies spanning winter, summer, and spring seasons, and demonstrate zero-shot global transferability over India and Germany using independent surface station observations without any retraining or fine-tuning. As an open-weights foundation model, EarthMind-SR establishes a new paradigm for kilometer-scale AI weather prediction and provides a platform for regional fine-tuning, distillation, and downstream applications in climate services and hazard forecasting.

2605.26128 2026-05-27 cs.LG cs.SE

The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models

约束税:小语言模型结构化输出中正确性与准确性的权衡度量

Jaideep Ray

发表机构 * ACM(美国计算机协会)

AI总结 本文提出“约束税”测量协议,通过实验证明硬输出约束会显著降低小语言模型的答案准确性和可执行准确性,并建议生产系统应分别报告模式有效性、答案准确性、可执行准确性和错误有效模式率。

详情
AI中文摘要

生产级LLM系统越来越需要机器可读的输出:JSON对象、类型化轨迹、正则表达式约束字段和工具调用模式。本文针对设备端和低成本小语言模型(SLM)部署,其中低于3B参数的模型因隐私、延迟和通用硬件而具有吸引力,但在解决任务时满足模式的能力有限。通常的工程假设是硬输出约束能提高可靠性而不改变底层答案。我们证明这一假设对小模型不安全。我们引入\emph{约束税},一种测量协议,用于在固定模型、固定任务分布和固定问题实例下,隔离由结构化输出约束引起的答案和可执行准确性损失。在Qwen2.5-0.5B、Qwen2.5-1.5B和SmolLM2-1.7B的15,000次通用GPU生成中,硬答案模式解码将模式有效性从61.5%提高到100.0%,但将答案准确性从19.7%降低到11.0%,并将错误有效模式输出从49.5%增加到88.9%。最强的工业类比是确定性日历工具调用任务:Qwen2.5-1.5B在仅提示JSON下达到91.5%的可执行准确性,但在相同硬工具调用模式下仅为48.0%,而两种模式都是100.0%模式有效。错误是语义性的,而非结构性的。我们还表明,3B边界仍然支付直接模式税,并且延迟包装支持一种建设性设计模式:自由推理,延迟约束。实际结论是直接的:生产系统应分别报告模式有效性、答案准确性、可执行准确性和错误有效模式率。

英文摘要

Production LLM systems increasingly require machine-readable outputs: JSON objects, typed traces, regex-constrained fields, and tool-call schemas. This paper targets on-device and low-cost small language model (SLM) deployments, where sub-3B models are attractive for privacy, latency, and commodity hardware but have limited capacity to satisfy schemas while solving tasks. The usual engineering assumption is that hard output constraints improve reliability without changing the underlying answer. We show that this assumption is unsafe for small models. We introduce \emph{constraint tax}, a measurement protocol for isolating the answer and executable-accuracy loss caused by structured-output constraints at fixed model, fixed task distribution, and fixed problem instances. Across 15,000 commodity-GPU generations with Qwen2.5-0.5B, Qwen2.5-1.5B, and SmolLM2-1.7B, hard answer-only schema decoding raises schema validity from 61.5\% to 100.0\%, but lowers answer accuracy from 19.7\% to 11.0\% and increases wrong-valid-schema outputs from 49.5\% to 88.9\%. The strongest industry analogue is a deterministic calendar tool-call task: Qwen2.5-1.5B achieves 91.5\% executable accuracy with prompt-only JSON but only 48.0\% under the same hard tool-call schema, while both modes are 100.0\% schema-valid. The error is semantic, not structural. We also show that the 3B boundary still pays a direct-schema tax and that delayed packaging supports a constructive design pattern: reason free, constrain late. The practical conclusion is direct: production systems should report schema validity, answer accuracy, executable accuracy, and wrong-valid-schema rate separately.

2605.26103 2026-05-27 cs.CV

Global Structure-from-Motion Meets Feedforward Reconstruction

全局运动恢复结构与前馈重建的结合

Linfei Pan, Johannes Schönberger, Marc Pollefeys

发表机构 * ETH Zurich(苏黎世联邦理工学院) Meta Reality Labs(Meta现实实验室) Microsoft(微软公司)

AI总结 提出一种结合经典SfM和前馈重建优势的新流水线,在多种场景下实现最先进的重建结果。

Comments CVPR 2026, Highlight

详情
AI中文摘要

运动恢复结构——从一组图像同时估计相机姿态和3D场景结构的过程——仍然是计算机视觉中的一个核心挑战,许多开放问题尚待解决。前馈3D重建的最新进展在克服经典SfM方法的持续失败案例方面取得了显著进步,特别是在低纹理、有限重叠和对称性等场景中。然而,尽管前馈方法在这些挑战性条件下表现出色,但它们在可扩展性、准确性或鲁棒性方面常常面临限制,并且在标准重建设置中通常不如经典方法。在这项工作中,我们系统地分析了这些限制,并通过结合经典和前馈方法的各自优势,提出了一种新的运动恢复结构流水线。在多个数据集上的广泛实验显示了我们的方法的优势,在广泛场景中实现了最先进的结果。我们将我们的系统作为开源实现分享在https://github.com/colmap/gluemap。

英文摘要

Structure-from-Motion -- the process of simultaneously estimating camera poses and 3D scene structure from a collection of images -- remains a central challenge in computer vision, with many open problems yet to be solved. Recent advances in feedforward 3D reconstruction have made significant strides in overcoming persistent failure cases of classical SfM methods, particularly in scenarios characterized by low texture, limited overlap, and symmetries. However, while feedforward approaches excel in these challenging conditions, they often face limitations regarding scalability, accuracy, or robustness, and typically fall short of classical methods in standard reconstruction settings. In this work, we systematically analyze these limitations and propose a new Structure-from-Motion pipeline by combining the respective strengths of classical and feedforward methods. Extensive experiments across multiple datasets show the benefits of our approach, achieving state-of-the-art results across a wide range of scenarios. We share our system as an open-source implementation at https://github.com/colmap/gluemap.

2605.26079 2026-05-27 cs.CL

Automated Benchmark Auditing for AI Agents and Large Language Models

AI智能体与大语言模型的自动化基准审计

Junlin Wang, Federico Bianchi, Shang Zhu, Fan Nie, Yongchan Kwon, Bhuwan Dhingra, James Zou

发表机构 * Duke University(杜克大学) Together AI Stanford University(斯坦福大学)

AI总结 提出自动化基准审计框架ABA,系统审计基准任务中的隐藏环境依赖、规范缺失和评分逻辑问题,在168个基准中发现25.7%的任务存在关键问题,过滤后模型排名变化且性能提升约10%。

详情
AI中文摘要

现代AI基准的复杂性超出了传统验证方法的能力。由领域专家编写的任务通常包含隐含假设、不完整的环境规范和脆弱的评估逻辑,人工标注无法可靠地捕捉这些问题。我们引入了自动化基准审计(ABA),一个系统审计单个基准任务的智能体框架,揭示隐藏的环境依赖、规范缺失和有限的评分逻辑等问题。我们在前沿LLM基准和之前的NeurIPS出版物上运行ABA,共涵盖九个领域的168个基准。在这些语料中,ABA识别出关键问题,包括模糊的任务设计、执行环境冲突和错误的地面真值,在超过25.7%的评估任务中。这些自动化审计的精确性通过专家评审和独立第三方报告(如上游PR)得到验证。关键的是,我们证明这些有问题的任务严重扭曲了对智能体和LLM的能力评估:过滤掉这些有问题的任务会改变模型排名,并在SWE-bench Verified和Terminal-Bench 2上分别将平均性能提高9.9%和9.6%。我们发布智能体工具和所有任务注释,以支持前沿基准的未来发展。

英文摘要

Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumptions, incomplete environment specifications, and brittle evaluation logic that human annotation cannot reliably catch. We introduce Auto Benchmark Audit (ABA), an agentic framework that systematically audits individual benchmark tasks, uncovering issues such as hidden environment dependencies, specification gaps, and limited grading logic. We run ABA on a collection of frontier LLM benchmarks and previous NeurIPS publications, totaling 168 benchmarks across nine domains. Across this corpus, ABA identifies critical issues including ambiguous task design, execution environment conflicts, and incorrect ground truths in over 25.7% of the evaluated tasks. The precision of these automated audits is validated by expert review and independent third-party reports such as upstream PRs. Crucially, we demonstrate that these problematic tasks severely distorts capability assessments for agents and LLMs: filtering out these tasks with issues shifts model rankings and increases average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6%, respectively. We release the agentic tool and all task annotations to support the future development of frontier benchmarks.

2605.26047 2026-05-27 cs.AI

Retrying vs Resampling in AI Control

AI控制中的重试与重采样

James Lucassen, Adam Kaufman

发表机构 * Redwood Research

AI总结 研究AI编码框架中重试与重采样的安全性,发现重试会泄露信息导致攻击更隐蔽,而重采样通过最大怀疑分数审计可提升安全性。

详情
AI中文摘要

像Claude Code和Codex这样的AI编码脚手架使用重试:阻止标记为有风险的动作并继续轨迹。我们从AI控制的角度研究重试,该角度将模型视为潜在对抗性的。我们发现,虽然重试降低了诚实的怀疑分数,但不可信模型可以利用监控器理由构建更隐蔽的攻击,抵消了安全增益。我们还研究了重采样:从相同上下文中抽取多个样本,这不会泄露可利用的信息。我们解开了先前关于重采样的工作所捆绑的设计选择。在BashArena中,使用Claude Opus 4.6作为不可信模型,MiMo-V2-Flash作为可信监控器,每步抽取五个样本并在最大怀疑分数上审计,在0.3%的审计预算下,安全性从61%提高到71%,且不影响有用性。仅对第一次抽取中看起来可疑的步骤进行选择性重采样,恢复了6.2个百分点的增益,同时仅多抽取了10%的样本。我们在此设置中的两个发现与先前关于重采样的工作相矛盾。第一个是,基于重采样怀疑分数最大值进行审计优于使用最小值,这与Ctrl-Z的发现相反。第二个是,执行最不可疑的样本(这是早期“推迟到重采样”协议的核心机制)在我们的设置中仅带来很小的实证安全增益(+3.9个百分点,置信区间包含零)。

英文摘要

AI coding scaffolds like Claude Code and Codex use retrying: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study resampling: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).

2605.25971 2026-05-27 cs.CL cs.IR cs.MA

Anticipate and Learn: Unleashing Idle-Time Compute in Proactive Agents

预见与学习:在主动智能体中释放空闲时间计算

Haoyi Hu, Qirong Lyu, Xianghan Kong, Weiwen Liu, Jianghao Lin, Zixuan Guo, Yan Xu, Yasheng Wang, Weinan Zhang, Yong Yu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Tencent(腾讯)

AI总结 提出ProAct主动智能体架构,利用空闲时间计算预测并满足用户未来需求,通过ProActEval基准测试验证其在任务加速、减少用户努力和降低幻觉率方面的显著优势。

Comments 26 pages, 4 figures; code available at https://github.com/AgentACE-AI/ProAct

详情
AI中文摘要

虽然AI智能体在推理和工具使用方面展现出显著能力,但它们本质上仍然是被动的:仅在用户明确提示后才计算响应。这种范式忽略了一个关键机会:交互之间的空闲时间很大程度上被浪费,使得智能体无法为未来的用户需求做准备。为弥补这一差距,我们引入了ProAct,一种主动智能体架构,利用空闲时间计算来预测并满足可能即将出现的用户需求。通过分析不断演变的对话历史以及持久记忆,ProAct预测即将到来的需求并迭代获取信息,使智能体能够在用户发起查询之前解决知识差距并准备证据。为严格评估主动能力,我们还引入了ProActEval,一个包含40个领域200个场景的综合基准,具有可预测的需求链和多样化的用户认知特征。实验结果表明,与被动基线相比具有显著优势。ProAct通过减少14.8%的必要交互轮次加速任务完成,减少11.7%的用户努力,并在ProActEval上将幻觉率降低28.1%。此外,MemBench评估证实ProAct达到了最先进的反思准确性,突显其持续且稳健的性能。

英文摘要

While AI agents demonstrate remarkable capabilities in reasoning and tool use, they remain fundamentally reactive: they compute responses only after explicit user prompts. This paradigm ignores a critical opportunity: the idle time between interactions is largely wasted, leaving agents unable to prepare for future user needs. To bridge this gap, we introduce ProAct, a proactive agent architecture that leverages idle-time compute to anticipate and fulfill likely upcoming user needs. By analyzing evolving dialogue history together with persistent memory, ProAct predicts upcoming needs and iteratively acquires information, allowing the agent to resolve knowledge gaps and prepare evidence before the user initiates a query. To rigorously evaluate proactive capabilities, we also introduce ProActEval, a comprehensive benchmark comprising 200 scenarios across 40 domains, featuring predictable need chains and diverse user cognitive profiles. Empirical results demonstrate significant advantages over reactive baselines. ProAct accelerates task completion by reducing required turns by 14.8%, decreases user effort by 11.7%, and cuts hallucination rates by 28.1% on ProActEval. Furthermore, MemBench evaluations confirm that ProAct achieves state-of-the-art reflective accuracy, underscoring its sustained and robust performance.

2605.25861 2026-05-27 cs.CV cs.AI

MuNet: A Mutualistic Network for Joint 3D Human Mesh Recovery and 3D Clothed Human Reconstruction from Single Images

MuNet: 一种用于从单张图像联合进行3D人体网格恢复和3D穿衣人体重建的互惠网络

Yunqi Gao, Leyuan Liu, Yuhan Li, Changxin Gao, Jingying Chen

发表机构 * National Engineering Research Center for E-Learning(教育信息化国家级工程研究中心) National Engineering Research Center of Educational Big Data(教育大数据国家级工程研究中心) School of Electronic Information and Communications(电子信息与通讯学院) School of Artificial Intelligence and Automation(人工智能与自动化学院)

AI总结 提出MuNet,一种互惠网络,通过统一表示和互惠机制联合优化3D人体网格恢复与穿衣人体重建,在六个基准数据集上达到最先进性能。

详情
AI中文摘要

3D人体网格恢复和3D穿衣人体重建本质相关,但长期以来被孤立研究,忽视了联合优化的潜在收益。为克服这一局限,我们提出在一个统一框架中处理这两个任务,从而有效利用它们的相互依赖关系。基于这一思想,我们提出MuNet,一种用于从单张图像联合进行3D人体网格恢复和3D穿衣人体重建的互惠网络。首先,我们采用2-流形图作为所有3D模型的统一表示,从而在3D人体网格恢复和穿衣人体重建之间实现一致建模。其次,我们设计了一个端到端的图卷积网络,逐步将初始图变形为3D人体网格,并将其细化成详细的3D穿衣人体模型。第三,我们引入一种互惠机制,允许两个任务在训练期间进行相互交互,其中3D人体网格恢复为3D穿衣人体重建提供指导,而重建反馈则细化3D人体网格恢复。我们在六个基准数据集上广泛评估了MuNet,包括Human3.6M、3DPW、MPI-INF-3DHP、THuman2.0、CAPE和RenderPeople。实验结果表明,MuNet在所有数据集上的两个任务均达到了最先进的性能。MuNet的代码已在https://github.com/starVisionTeam/MuNet上发布,供研究使用。

英文摘要

3D human mesh recovery and 3D clothed human reconstruction are inherently related, yet they have long been studied in isolation, thereby overlooking the potential gains of joint optimization. To overcome this limitation, we propose to address these two tasks within a unified framework, which allows their mutual dependencies to be effectively exploited. Building on this idea, we propose MuNet, a mutualistic network for joint 3D human mesh recovery and 3D clothed human reconstruction from single images. First, we adopt 2-manifold graphs as a unified representation for all 3D models, enabling consistent modeling across 3D human mesh recovery and clothed human reconstruction. Second, we design an end-to-end graph convolutional network that progressively deforms an initial graph into a 3D human mesh and refines it into a detailed 3D clothed human model. Third, we introduce a mutualistic mechanism that allows reciprocal interaction between the two tasks {during training}, where 3D human mesh recovery provides guidance for 3D clothed human reconstruction, and reconstruction feedback refines the 3D human mesh recovery. We extensively evaluate MuNet on six benchmark datasets for 3D human mesh recovery and 3D clothed human reconstruction, including Human3.6M, 3DPW, MPI-INF-3DHP, THuman2.0, CAPE, and RenderPeople. Experimental results demonstrate that MuNet achieves state-of-the-art performance on both tasks across all datasets. The code of MuNet is released for research purposes at https://github.com/starVisionTeam/MuNet.

2605.25758 2026-05-27 cs.CL

StreamProfileBench: A Benchmark for Fine-Grained User Profile Inference in Real-World Streaming Scenarios

StreamProfileBench:真实流式场景中细粒度用户画像推断的基准

Sizhe Wang, Feiyu Duan, Juelin Wang, Liwen Zhang, Zhongyu Wei

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院) Shanghai Innovation Institute(上海创新研究院) Shanghai University of Finance and Economics(上海财经大学) MOE laboratory for National Development and Intelligent Governance, Fudan University(复旦大学国家发展与智能治理实验室) Research Institute of Intelligent Complex Systems, Fudan University(复旦大学智能复杂系统研究院)

AI总结 提出StreamProfileBench基准,通过持续状态维护任务和免标注评估框架,研究大语言模型在流式用户画像更新中的保守偏差问题。

详情
AI中文摘要

大型语言模型(LLMs)重塑了用户画像,但当前的评估主要关注静态数据快照。这种范式忽视了个性化系统的现实,其中用户生成内容(UGC)持续到达,细粒度画像快速演变。为弥补这一差距,我们引入了StreamProfileBench,一个用于细粒度流式用户画像的大规模基准。我们将流式用户画像形式化为一个持续状态维护任务,并整理了一个高度真实的数据集,包含来自五个不同平台的7000多名真实用户的超过12万条UGC帖子。通过利用用户兴趣的时间相关性,我们进一步提出了一种新颖的、无需标注的评估框架。在14个领先的LLM上的大量实验表明,持续画像更新仍然是一个开放的挑战。模型表现出系统性的保守偏差,过度保留过去的兴趣而未能识别兴趣衰减。消融实验进一步验证了流式范式的实用性和必要性。

英文摘要

Large Language Models (LLMs) have reshaped user profiling, yet current evaluations mainly focus on static data snapshots. This paradigm overlooks the reality of personalized systems, where User-Generated Content (UGC) arrives continuously and fine-grained profile evolve rapidly. To bridge this gap, we introduce StreamProfileBench, a large-scale benchmark for fine-grained streaming user profiling. We formalize streaming user profiling as a continuous state maintenance task and curate a highly authentic dataset comprising over 120,000 UGC posts from 7,000+ real users across five diverse platforms. By leveraging the temporal correlation of user interests, we further propose a novel, annotation-free evaluation framework. Extensive experiments across 14 leading LLMs reveal that continuous profile updating remains an open challenge. Models exhibit a systemic conservative bias, over-retaining past interests while failing to recognize interest decay. Ablation experiments further validate the practical utility and necessity of the streaming paradigm.

2605.25629 2026-05-27 cs.CL cs.LG

When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift

当分布内增益失效:评估偏好转移下的弱到强奖励模型

Khoi Le, Tri Cao, Phong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu, Miao Chunyan, See-Kiong Ng, Thong Nguyen

发表机构 * National University of Singapore(国立新加坡大学) VinUniversity(文大学) Nanyang Technological University(南洋理工大学)

AI总结 研究弱到强偏好学习在零样本分布转移下的表现,发现弱监督微调会导致强模型偏向源域特征,提出表示锚定正则化方法以改善跨分布迁移。

Comments Code: https://anonymous.4open.science/r/w2s_reward_ood-682F

详情
AI中文摘要

弱到强(W2S)泛化是一种有前景的可扩展监督框架,然而现有评估通常在同分布训练-测试条件下进行。因此,我们研究零样本分布转移下的W2S偏好学习,发现基于弱偏好标签训练的强学生模型在分布内表现成功,但无法跨偏好数据集迁移。我们提供了证据表明存在一种表示失败模式:弱监督微调可能将强模型拉向源域特征,而不是保持广泛可迁移的偏好表示。为了缓解这一问题,我们提出表示锚定(Anchor),一种简单而有效的正则化方法,在微调过程中约束强模型预训练表示空间的过度漂移,同时允许任务相关的适应。在多个偏好领域、数据集和模型家族中,Anchor一致地改进了分布外迁移,同时保持了具有竞争力的分布内性能。综合来看,我们的评估协议、迁移感知指标和方法揭示了当前W2S奖励建模中隐藏的脆弱性,并为实现更稳健的偏好迁移提供了实用路径。

英文摘要

Weak-to-strong (W2S) generalization is a promising framework for scalable oversight, yet existing evaluations often test students under matched train-test distributions. Therefore, we study W2S preference learning under zero-shot distribution shift and find that strong students trained on weak preference labels can appear successful in-distribution while failing to transfer across preference datasets. We provide evidence for a representational failure mode in which weak-supervised fine-tuning can pull the strong model toward source-domain features instead of maintaining broadly transferable preference representations. To mitigate this, we propose Representation Anchoring (Anchor), a simple yet effective regularizer that constrains excessive drift from the pretrained strong model's representation space during fine-tuning, while still allowing task-relevant adaptation. Across preference domains, datasets, and model families, Anchor consistently improves out-of-distribution transfer while maintaining competitive in-distribution performance. Together, our evaluation protocol, transfer-aware metrics, and method expose hidden brittleness in current W2S reward modeling and provide a practical path toward more robust preference transfer.

2605.25570 2026-05-27 cs.CV

From Contrast to Consistency: Rethinking Event-based Continuous-Time Optical Flow Estimation

从对比到一致性:重新思考基于事件的连续时光流估计

Rui Hu, Song Wu, Wen Yang, Jinjian Wu

发表机构 * Xidian University(西安电子科技大学)

AI总结 提出一种基于时空结构一致性(STSC)的混合监督框架,结合双向互补多尺度架构和课程引导混合训练策略,在连续时间和标准光流估计中达到最先进性能。

Comments Accepted by CVPR 2026

详情
AI中文摘要

估计连续光流是动态视觉感知中一个基础但具有挑战性的问题。基于事件的相机具有微秒级延迟和高动态范围,能够异步捕捉亮度变化,为以精细时间精度建模运动提供了独特机会。然而,时间密集的真实标注的稀缺性限制了监督学习的有效性,而专注于锐化扭曲事件图像(IWE)的对比度最大化(CM)框架往往忽略时间连续性和结构一致性,导致复杂运动下的轨迹扭曲。为了克服这些挑战,我们提出了一种基于时空结构一致性(STSC)原则的混合监督框架,用于连续时光流估计。该范式共同强化局部结构稳定性和轨迹连续性,确保跨时间的物理一致运动。为了进一步增强表示和鲁棒性,我们设计了一种双向互补的多尺度架构,并采用课程引导的混合训练策略,实现了从监督点约束到自监督流形正则化的平滑过渡。在多个基准上的综合实验表明,我们的方法在连续时间和标准光流估计中均达到了最先进的性能,证明了所提出学习范式的有效性。

英文摘要

Estimating continuous optical flow is a fundamental yet challenging problem in dynamic visual perception. Event-based cameras, with microsecond latency and high dynamic range, capture brightness changes asynchronously, offering a unique opportunity to model motion with fine temporal precision. However, the scarcity of temporally dense ground-truth annotations limits the effectiveness of supervised learning, while contrast maximization (CM) frameworks, focused on sharpening the Image of Warped Events (IWE), often neglect temporal continuity and structural coherence, leading to distorted trajectories under complex motion. To overcome these challenges, we propose a hybrid-supervised framework for continuous-time optical flow estimation, grounded in the principle of Spatio-temporal Structural Consistency (STSC). This paradigm jointly enforces local structural stability and trajectory continuity, ensuring physically coherent motion across time. To further enhance representation and robustness, we design a bidirectionally complementary multi-scale architecture and employ a curriculum-guided hybrid training strategy, enabling a smooth transition from supervised point constraints to self-supervised manifold regularization. Comprehensive experiments across multiple benchmarks show that our method achieves state-of-the-art performance in both continuous-time and standard optical flow estimation, demonstrating the effectiveness of the proposed learning paradigm.

2605.25569 2026-05-27 cs.CV

ControlLight: Towards Controllable, Consistent, and Generalizable Low-Light Enhancement

ControlLight: 迈向可控、一致且泛化的低光照增强

Yufeng Yang, Jianzhuang Liu, Jisheng Chu, Yuqi Peng, Xianfang Zeng, Jiancheng Huang, Shifeng Chen

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) Zhejiang University(浙江大学)

AI总结 提出ControlLight框架,通过构建连续光照强度监督的大规模数据集和引入错位感知加权流匹配损失,实现了低光照增强的可控性、一致性和泛化性。

Comments 18 pages, 12 figures

详情
AI中文摘要

现有的基于深度学习的低光照增强方法通常在有限的数据集上训练,且具有单一的增强目标,这限制了它们在现实应用中的泛化能力和可控性。为了克服这些限制,我们提出了ControlLight,一个可控、一致且泛化的低光照增强框架。我们首先构建了一个带有连续光照强度监督的真实世界退化图像的大规模数据集。为了进一步确保在不同控制强度下输出的一致性,我们引入了一种错位感知加权流匹配损失,该损失在连续增强强度下保持图像结构。ControlLight允许用户通过灵活控制强度来编辑真实世界的退化低光照图像,以获得满意的增强结果,同时保持视觉一致性和真实性。大量实验表明,ControlLight在现有低光照增强方法中达到了最先进的性能,同时展现出强大的连续可控性和对真实世界场景的泛化能力。

英文摘要

Existing deep learning-based low-light enhancement methods are typically trained on limited datasets with single enhancement targets, which restricts their generalization ability and controllability in real-world applications. To overcome these limitations, we propose ControlLight, a controllable, consistent, and generalizable framework for low-light enhancement. We first construct a large-scale dataset of real-world degraded images with continuous illumination-strength supervision. To further ensure consistent outputs under different control strengths, we introduce a misalignment-aware weighted flow matching loss that preserves image structure across continuous enhancement strengths. ControlLight allows users to edit real-world degraded low-light images toward satisfactory enhancement results by flexibly controlling the strength while preserving visual consistency and realism. Extensive experiments show that ControlLight achieves state-of-the-art performance against existing low-light enhancement approaches while demonstrating strong continuous controllability and generalization to real-world scenarios.

2605.25538 2026-05-27 cs.CV cs.DB

Tetris: Tile-level Sampling for Efficient and High-Fidelity Video Object Tracking

Tetris: 用于高效高保真视频目标跟踪的瓦片级采样

Chanwut Kittivorawong, Alena Chao, Charlie Si, Alvin Cheung

发表机构 * U. of California, Berkeley(加州大学伯克利分校)

AI总结 提出Tetris系统,通过将视频分解为基于瓦片的骨牌数据模型,实现细粒度时空剪枝,在保持跟踪精度损失不超过5%的条件下,将检测器调用次数减少多达68.8倍。

详情
AI中文摘要

轨迹物化将原始视频转换为可重用的目标轨迹,下游查询可以直接使用而无需重新运行跟踪,但高效且高保真地提取这些轨迹仍然成本高昂。先前的系统通过时间帧采样来降低成本,但这会抹去细粒度跟踪所需的帧间运动。然而,在静态视频中,每帧的大部分区域不包含感兴趣的目标,剩余区域也能容忍不同的采样率。我们提出Tetris,一个轨迹提取系统,它将视频分解为基于瓦片的骨牌数据模型,实现细粒度时空剪枝,以最小的保真度损失减少检测器调用。Tetris在用户提供的检测器上游运行三个算子:一个分类器识别相关瓦片并将它们分组为骨牌;一个整数线性规划(ILP)在用户指定的精度约束下剪枝冗余骨牌;一个打包器将幸存者组装成画布,以最小化检测器调用。在7个静态视频数据集上,Tetris的跟踪精度损失保持在5%以内,而先前的系统在7个数据集中的3个上超过了这个界限。在这个5%的界限下,Tetris的吞吐量比先前系统高17.4倍,比参考流水线高68.8倍。项目页面位于https://tetris-db.github.io。

英文摘要

Track materialization converts raw video into reusable object tracks that downstream queries can run against without rerunning tracking, but extracting those tracks efficiently and with high fidelity remains expensive. Prior systems reduce cost through temporal frame sampling, erasing the inter-frame motion that fine-grained tracking requires. In stationary video, however, large portions of each frame contain no objects of interest, and the remaining regions tolerate different sampling rates. We present Tetris, a track-extraction system that decomposes videos into a tile-based polyomino data model, enabling fine-grained spatiotemporal pruning that reduces detector calls with minimal fidelity loss. Tetris runs three operators upstream of the user-provided detector: a classifier identifies relevant tiles and groups them into polyominoes, an integer linear program (ILP) prunes redundant polyominoes under a user-specified accuracy constraint, and a packer assembles the survivors into canvases that minimize detector calls. Across 7 stationary-video datasets, Tetris stays within a 5% tracking accuracy loss of a full-frame, every-frame reference pipeline, whereas prior systems exceed this bound on 3 of the 7 datasets. At this 5% bound, Tetris achieves up to 17.4x higher throughput than prior systems and up to 68.8x higher than the reference pipeline. The project page is at https://tetris-db.github.io .

2605.25353 2026-05-27 cs.LG cs.CV physics.comp-ph

PDEInvBench: A Comprehensive Dataset and Design Space Exploration of Neural Networks for PDE Inverse Problems

PDEInvBench:面向PDE逆问题的神经网络综合数据集与设计空间探索

Divyam Goel, Nithin Chalapathi, Sanjeev Raja, Aditi S. Krishnapriyan

发表机构 * Department of Computer Science, UC Berkeley(计算机科学系,加州大学伯克利分校) UC Berkeley(加州大学伯克利分校) Departments of Computer Science and Chemical Engineering UC Berkeley(计算机科学与化学工程系,加州大学伯克利分校;劳伦斯伯克利国家实验室) LBNL

AI总结 提出PDEInvBench基准数据集,通过数值模拟涵盖多种PDE,并沿优化、表示和缩放三个维度系统探索神经网络设计空间,发现两阶段训练、PDE导数输入和初始条件多样性等实用见解。

Comments 37 total pages, 13 main pages, 20 figures, 8 tables. Published in Transactions on Machine Learning Research (TMLR), 2026

Journal ref Transactions on Machine Learning Research, 2026

详情
AI中文摘要

偏微分方程(PDE)中的逆问题涉及从观测到的时空解场估计系统的物理参数。神经网络因其对函数到函数空间变换的建模能力,非常适合PDE参数估计。虽然现有的机器学习方法基准主要关注正问题,但尚无针对PDE逆问题(即从解场映射到潜在物理参数)的类似综合研究和基准数据集。我们通过引入PDEInvBench填补了这一空白,这是一个全面的基准数据集,包含时间依赖和时间独立PDE的数值模拟,覆盖广泛的物理行为和参数。我们的数据集包括评估划分,用于评估在分布内和多种分布外设置下的性能。利用我们的基准数据集,我们沿三个关键维度全面探索了神经网络在PDE逆问题中的设计空间:(1)优化过程,分析监督、自监督和测试时训练目标对性能的作用;(2)问题表示,研究具有不同归纳偏好的架构选择和各种条件策略的价值;(3)缩放,针对模型和数据大小进行。我们的实验揭示了几个实用见解:1)神经网络在两步训练过程中表现最佳:先用PDE参数进行初始监督,然后使用PDE残差进行测试时微调;2)将PDE导数作为输入特征始终能提高精度;3)增加训练数据中初始条件的多样性比扩大PDE参数范围带来更大的性能提升。我们公开了数据集和代码库。

英文摘要

Inverse problems in partial differential equations (PDEs) involve estimating the physical parameters of a system from observed spatiotemporal solution fields. Neural networks are well-suited for PDE parameter estimation due to their capability to model function-to-function space transformations. While existing benchmarks of machine learning methods for PDEs primarily focus on the forward problem, there are no similar comprehensive studies and benchmark datasets on PDE inverse problems, i.e., mapping solution fields to underlying physical parameters. We fill this gap by introducing PDEInvBench, a comprehensive benchmark dataset consisting of numerical simulations for both time-dependent and time-independent PDEs across a wide range of physical behaviors and parameters. Our dataset includes evaluation splits that assess performance in both in-distribution and various out-of-distribution settings. Using our benchmark dataset, we comprehensively explore the design space of neural networks for PDE inverse problems along three key dimensions: (1) optimization procedures, analyzing the role of supervised, self-supervised, and test-time training objectives on performance, (2) problem representations, where we study the value of architectural choices with different inductive biases and various conditioning strategies, and (3) scaling, which we perform with respect to both model and data size. Our experiments reveal several practical insights: 1) neural networks perform best with a two-stage training procedure: initial supervision with PDE parameters followed by test-time fine-tuning using the PDE residual, 2) incorporating PDE derivatives as input features consistently improves accuracy, and 3) increasing the diversity of initial conditions in the training data yields greater performance gains than expanding the range of PDE parameters. We make our dataset and codebase publicly available.

2605.25029 2026-05-27 cs.RO

ParkingWorld: End-to-End Autonomous Parking Reinforcement Learning from Corrective Experience in 3DGS Simulation

ParkingWorld: 基于3DGS仿真中纠正性经验的端到端自主泊车强化学习

Zhengcheng Yu, Changze Li, Haoran Liu, Tong Qin

发表机构 * Tsinghua University(清华大学) Shanghai Jiaotong University(上海交通大学)

AI总结 提出一种基于纠正性经验的样本高效强化学习框架(CIL-SERL),在逼真的3D高斯溅射(3DGS)仿真器中训练端到端自主泊车策略,通过多级回放缓冲区机制提高成功率、效率和安全性。

Comments 9 pages(including 1 page of Appendix), 6 figures. Will be submitted to RA-L 2026

详情
AI中文摘要

自主泊车需要在狭窄、杂乱且高度受限的环境中进行精确的低速操控,车辆必须避开静态障碍物和复杂的几何边界。与模仿学习不同(模仿学习通常需要大量高质量专家演示才能收敛到稳定策略,且泛化到未见场景的能力有限),传统强化学习方法面临训练开销过大、探索效率低下,甚至在具有挑战性的场景中无法学习可行泊车策略等持续挑战。为解决这些问题,本文提出了一种基于纠正性循环的样本高效强化学习(CIL-SERL)框架,用于端到端自主泊车,该框架完全在逼真的3D高斯溅射(3DGS)泊车模拟器中训练,能够对真实场景进行高保真数字重建。受学习实践中纠错笔记本的启发,我们设计了一种新颖的多级回放缓冲区机制。这些缓冲区将标准RL轨迹、人工纠正干预、失败探索轨迹和基于回滚的纠正段分层组织并存储在不同但相互连接的内存区域中,从而在训练过程中促进结构化采样和有针对性的学习。所提出的框架在3DGS仿真环境和真实车辆平台上进行了系统评估。大量实验结果表明,我们的方法在多种场景下显著提高了泊车成功率、运行效率和安全性,验证了所提出的基于CIL-SERL的端到端自主泊车解决方案的有效性和实际适用性。

英文摘要

Autonomous parking demands precise low-speed maneuvering within narrow, cluttered, and highly constrained environments, where vehicles must navigate tight spaces while avoiding static obstacles and complex geometric boundaries. Unlike imitation learning, which typically requires massive volumes of high-quality expert demonstrations to converge to a stable policy and often suffers from limited generalization to unseen scenarios, traditional reinforcement learning (RL) methods face persistent challenges including excessive training overhead, inefficient exploration, and even failure to learn viable parking strategies in challenging settings. To address these limitations, this paper presents a correction-in-the-loop sample-efficient reinforcement learning (CIL-SERL) framework for end-to-end autonomous parking, which is entirely trained in a photorealistic 3D Gaussian Splatting (3DGS) parking simulator that enables high-fidelity digital reconstruction of real-world scenes. Inspired by error-correction notebooks used in learning practice, we design a novel multi-level replay buffer mechanism. These buffers hierarchically organize and store standard RL rollouts, human corrective interventions, failed exploration trajectories, and rollback-based correction segments in separate yet interconnected memory regions, facilitating structured sampling and targeted learning during training. The proposed framework is systematically evaluated in both the 3DGS simulation environment and a physical vehicle platform. Extensive experimental results demonstrate that our method achieves substantial improvements in parking success rate, operational efficiency, and safety performance across diverse scenarios, validating the effectiveness and practical applicability of the proposed CIL-SERL-based end-to-end autonomous parking solution.