URL PDF HTML ☆

赞 0 踩 0

2605.31244 2026-06-01 cs.LG physics.comp-ph

HARP-VLA：面向视觉-语言-动作模型的人机对齐表示学习

Xiang Zhu, Puzhen Yuan, Yichen Liu, Jianyu Chen

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University, China（清华大学交叉信息研究院）； Shanghai Qi Zhi Institute, China（上海启智研究院）

AI总结提出HARP框架，通过有限配对人机演示和未配对视频，学习对齐的人机视觉与潜在动作表示，提升VLA模型预训练效果，在CALVIN和真实世界任务中取得性能提升。

详情

AI中文摘要

从大规模人类视频中学习可泛化的视觉-语言-动作（VLA）模型具有前景但也充满挑战，原因在于视觉观察和可执行动作方面存在跨实体差异。虽然潜在动作模型通过学习动作抽象减少了动作执行差距，但它们仍然依赖视觉特征。因此，未对齐的人机视觉表示可能导致策略输入不一致，并引发领域相关的潜在动作，阻碍人类视频的有效协同训练。为解决这一问题，我们提出HARP，一种人机对齐的表示学习框架，用于从人类视频中进行更有效的VLA预训练。具体而言，HARP使用有限的配对人机演示作为跨实体桥梁，并利用大量未配对的人机视频作为可扩展的动态监督数据源。它训练一个机器人适应的视觉编码器和一个潜在动作模型，采用以操作为中心的辅助线索和源相对对判别对齐损失，将机器人表示向人类语义对齐，同时保留对级判别性。学习到的对齐视觉编码器和潜在动作模型为VLA式策略学习提供了统一的视觉和动作表示，其中人类和机器人视频提供视觉-语言到潜在动作的监督，轻量级机器人动作头将潜在动作转化为可执行命令。在特征可视化、仿真和真实世界操作上的实验表明，人机对齐和下游策略性能得到提升，在CALVIN ABC→D上达到4.481的平均长度，真实世界成功率比最强基线提升7.1%。

英文摘要

Learning generalizable vision-language-action (VLA) models from large-scale human videos is promising but challenging due to cross-embodiment discrepancies in both visual observations and executable actions. While latent action models reduce the action execution gap by learning action abstractions, they still rely on visual features. Thus, misaligned human and robot visual representations can lead to inconsistencies in policy inputs and induce domain-dependent latent actions, hindering effective co-training with human videos. To address this, we propose HARP, a human-robot aligned representation learning framework for more effective VLA pretraining from human videos. Specifically, HARP uses limited paired human-robot demonstrations as cross-embodiment bridges and abundant unpaired human and robot videos as a scalable dynamics supervision data source. It trains a robot-adapted visual encoder and a latent action model with manipulation-centric auxiliary cues and a source-relative pair-discriminative alignment loss, which adapts robot representations toward human semantics while preserving pair-level discrimination. The learned aligned vision encoder and latent action model provide a unified vision and action representation for VLA-style policy learning, where human and robot videos provide vision-language-to-latent-action supervision and a lightweight robot action head grounds latent actions into executable commands. Experiments on feature visualization, simulation, and realworld manipulation show improved human-robot alignment and downstream policy performance, achieving 4.481 average length on CALVIN ABC$\rightarrow$D and a 7.1\% realworld success rate gain over the strongest baseline.

URL PDF HTML ☆

赞 0 踩 0

2605.31229 2026-06-01 cs.CV cs.AI

基于整流流变压器的概率降水临近预报

Johannes Schusterbauer, Jannik Wiese, Nick Stracke, Timy Phan, Björn Ommer

发表机构 * CompVis ； LMU Munich（慕尼黑大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）

AI总结提出FREUD模型，通过帧级编码器和统一解码器结合整流流变压器，在保持不确定性的同时实现高效时空压缩，在SEVIR基准上达到降水临近预报最优性能。

Comments CVPR 2026, Project Page: https://compvis.github.io/weather-rf/

详情

AI中文摘要

准确的天气预报在各个领域都至关重要，在极端天气条件下更是安全关键。与基于模拟的预报相比，数据驱动方法显示出更高的效率，能够实现短期、高分辨率的临近预报。特别是，扩散模型因其强大的概率基础在天气临近预报中被证明有效。然而，现有方法依赖于确定性压缩来降低高维天气数据的复杂性，限制了它们在解码过程中捕捉不确定性的能力。在这项工作中，我们引入了$ extbf{FREUD}$，一个基于整流流变压器的$ extbf{Fr}$ame-wise $ extbf{E}$ncoder和$ extbf{U}$nited $ extbf{D}$ecoder模型，用于高效压缩时空天气数据。帧级编码支持连续预报更新，而统一视频解码器确保时间一致性。我们保留不确定性的第一阶段允许通过集成捕捉偶然不确定性，这对于解码变异性高的极端天气事件特别有利。我们在SEVIR基准上使用紧凑的潜在空间整流流变压器实现了降水临近预报的最新性能，并通过模型和测试时缩放进一步展示了性能提升。代码见：https://github.com/CompVis/weather-rf

英文摘要

Accurate weather forecasts are essential across various domains and are safety-critical in extreme weather conditions. Compared to simulation-based forecasting, data-driven approaches show greater efficiency, enabling short-term, high-resolution nowcasting. In particular, diffusion models proved effective in weather nowcasting due to their strong probabilistic foundation. However, existing methods rely on deterministic compression to reduce the complexity of high-dimensional weather data, limiting their ability to capture uncertainty in the decoding process. In this work, we introduce $\textbf{FREUD}$, a $\textbf{Fr}$ame-wise $\textbf{E}$ncoder and $\textbf{U}$nited $\textbf{D}$ecoder model based on rectified flow transformers for efficient compression of spatio-temporal weather data. Frame-wise encoding enables continuous forecast updates, while the unified video decoder ensures temporal consistency. Our uncertainty-preserving first stage allows us to capture aleatoric uncertainty via ensembling, which is particularly beneficial for extreme weather events with high decoding variability. We achieve state-of-the-art performance in precipitation nowcasting with a compact latent-space rectified flow transformer on the SEVIR benchmark and show further performance gains by model and test-time scaling. Code available here: https://github.com/CompVis/weather-rf

URL PDF HTML ☆

赞 0 踩 0

2605.31201 2026-06-01 cs.CL

语言训练深度伪造检测器的正则化能力

Benedikt Hopf, Zongwei Wu, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS, University of Würzburg（计算机视觉实验室，CAIDAS，乌尔姆大学）

AI总结提出利用多模态大语言模型的双编码器架构和两阶段训练，通过语言正则化缓解过拟合，提升深度伪造检测的泛化性和可解释性。

详情

AI中文摘要

最近，得益于多模态大语言模型的出现，深度伪造检测器不仅追求泛化性，还追求可解释性。我们提出这两个挑战可以有效地联合解决，因为可描述的伪影通常泛化性更好，从而开辟了使用语言作为正则化机制的可能性。由于深度伪造检测通常过拟合于低层次的领域特定伪影，我们的直觉是，经过语言预训练的LLM会更偏好于可更好描述的高层次伪影。这样，我们可以在可能的情况下使用高层次特征，同时训练模型在必要时使用低层次特征。我们利用双编码器架构，将冻结的专家检测器与LoRA调优的MLLM编码器配对，并采用两阶段训练课程：首先，二元对齐阶段表明，MLLM的内在能力可以有效地组合特征，以减轻对数据集特定伪影的过拟合。为了进一步增强泛化性并实现可解释性，我们采用强化学习阶段，鼓励模型在分类前生成描述性推理，仅使用二元标签。通过奖励这种“先解释后分类”的行为，我们明确激励模型优先考虑高层次、鲁棒的特征。关键在于，这一过程既产生了可解释的描述，又进一步提升了跨数据集性能，即使在推理时省略推理链也是如此。在基准数据集上的大量实验验证了我们的方法，以较大优势超越了最先进的方法。

英文摘要

Recently, thanks to the advent of Multimodal-LLMs, deepfake detectors are striving not only to be generalizable but also interpretable. We propose that these two challenges can effectively be tackled jointly, since describable artifacts typically generalize better, opening the possibility to use language as a regularization mechanism. Since deepfake detection generally suffers from overfitting to low-level domain-specific artifacts, our intuition is that an LLM that has been pretrained on language would prefer high-level artifacts that can be described better. This way, we can use high-level features where possible, while training the model to use low-level features where necessary. We utilize a dual-encoder architecture, pairing a frozen specialist detector with a LoRA-tuned MLLM encoder, and a two-stage training curriculum: first, a binary alignment phase demonstrates that the intrinsic capability of MLLMs can effectively combine features to mitigate overfitting to dataset-specific artifacts. To further bolster generalization and achieve interpretability, we employ a reinforcement learning stage that encourages the model to generate descriptive reasoning before classifying, using only binary labels. By rewarding this "explain-then-classify" behavior, we explicitly incentivize the model to prioritize high-level, robust features. Crucially, this process yields both interpretable descriptions and a further boost in cross-dataset performance, even when reasoning chains are omitted at inference. Extensive experiments on benchmark datasets validate our approach, outperforming state-of-the-art methods by a large margin.

URL PDF HTML ☆

赞 0 踩 0

2605.31191 2026-06-01 cs.LG cs.CV

Student Capacity Moderates Knowledge Distillation Effectiveness: A Systematic Study Across ResNet Teacher-Student Pairs on CIFAR-10

学生容量调节知识蒸馏有效性：基于CIFAR-10上ResNet教师-学生对的系统研究

Umut Onur Yasar

发表机构 * GitHub

AI总结通过ResNet教师-学生对在CIFAR-10上的图像分类实验，系统研究学生容量如何调节知识蒸馏（KD）的有效性，发现学生容量是蒸馏增益的关键调节因素，并指出实现正确性和输入分辨率感知架构的重要性。

Comments 9 pages, 2 figures, 5 tables. Code available at https://github.com/umutonuryasar/kd-capacity-gap

详情

AI中文摘要

我们研究了教师-学生容量关系如何调节基于ResNet的CIFAR-10图像分类中知识蒸馏（KD）的有效性。在三个教师-学生对（R50->R18、R34->R18和R50->R34）中，我们在受控、可重复的条件下（3个种子，全程报告均值±标准差）比较了Logit-KD和Feature-KD。我们报告三个主要发现。首先，学生容量是蒸馏增益的关键调节因素：即使教师-学生准确率差距相当，R34学生从KD中获得的收益也远大于R18学生，R50->R34 Feature-KD的最大增益为+0.30个百分点，而R34->R18 Feature-KD为+0.18个百分点，R34->R18 Logit-KD为+0.00个百分点。其次，实现的正确性对Feature-KD至关重要：一个排除了投影层的梯度裁剪错误抑制了Feature-KD的性能，并产生了与Logit-KD的误导性比较。修正后，Feature-KD在三个对中的两个上匹配或优于Logit-KD，在R50->R34上达到95.55%，基线为95.25%。第三，输入分辨率感知架构是有效蒸馏的先决条件：将ResNet主干修正为32x32输入使教师准确率提高超过5个百分点——比任何KD增益高出一个数量级。所有代码和结果可在github.com/umutonuryasar/kd-capacity-gap获取。

英文摘要

We investigate how teacher-student capacity relationships modulate knowledge distillation (KD) effectiveness in ResNet-based image classification on CIFAR-10. Across three teacher-student pairs -- R50->R18, R34->R18, and R50->R34 -- we compare Logit-KD and Feature-KD under controlled, reproducible conditions (3 seeds, mean+/-std reported throughout). We report three main findings. First, student capacity is a key moderating factor in distillation gain: R34 students benefit substantially more from KD than R18 students even when teacher-student accuracy gaps are comparable, with the strongest gain of +0.30pp observed for R50->R34 Feature-KD versus +0.18pp for R34->R18 Feature-KD and +0.00pp for R34->R18 Logit-KD. Second, implementation correctness critically affects Feature-KD: a gradient clipping bug that excluded projection layers suppressed Feature-KD performance and produced misleading comparisons with Logit-KD. After correction, Feature-KD matches or outperforms Logit-KD in two of three pairs, reaching 95.55% on R50->R34 against a baseline of 95.25%. Third, input-resolution-aware architecture is a prerequisite for effective distillation: correcting the ResNet stem for 32x32 inputs raises teacher accuracy by over 5pp -- an order of magnitude larger than any KD gain. All code and results are available at github.com/umutonuryasar/kd-capacity-gap.

URL PDF HTML ☆

赞 0 踩 0

2605.31189 2026-06-01 cs.LG

FlagGAM: Rule-Based Generalized Additive Modeling for Explainable Tabular Prediction

FlagGAM：基于规则的可解释表格预测广义加性模型

Zijie Zhao, Roy E. Welsch

发表机构 * EECS Department, Massachusetts Institute of Technology, Cambridge, MA, USA（麻省理工学院电子工程与计算机科学系）； Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA, USA（麻省理工学院斯隆管理学院）

AI总结提出FlagGAM框架，通过规则定义的基函数分离特征级规则构建与预测，在保持可解释性的同时提升对不完美输入的鲁棒性。

详情

AI中文摘要

在高风险领域的表格预测中，需要准确、透明且对不完美输入鲁棒的模型。我们提出FlagGAM，一个规则定义的基函数框架，将特征级规则构建与预测分离。Flag核心模块将数值和分类变量转换为稀疏、可读的单变量基函数，包括阈值标志、类别级标志、尾部偏差基和分类阶跃函数；默认的加性头部随后将这些基函数组合为受限的GAM风格预测器。FlagGAM不是将触发的规则简化为紧凑的计数摘要，而是保留稀疏的规则基矩阵，支持混合类型分类和回归、特征特定权重以及可选的灵活预测头部。在表格基准测试中，默认FlagGAM在透明加性模式下接近EBM，在混合类型回归上显著优于岭回归，并在缺失和噪声扰动下显示出比常见基线更小的AUROC下降。灵活头部进一步提高了准确性，接近强树基线，但需要注意，所得模型应解释为规则基表示后接非线性预测器，而非完全加性GAM。总体而言，FlagGAM为需要竞争性准确性、可传达规则和对不完美输入鲁棒性的表格设置提供了实用的中间地带。

英文摘要

Tabular prediction in high-stakes domains requires models that are accurate, transparent, and robust to imperfect inputs. We propose FlagGAM, a rule-defined basis framework that separates feature-level rule construction from prediction. A Flag Core Module converts numerical and categorical variables into sparse, human-readable univariate bases, including threshold flags, category-level flags, tail-deviation bases, and categorical step functions; a default additive head then combines these bases as a restricted GAM-style predictor. Rather than reducing triggered rules to compact count summaries, FlagGAM retains a sparse rule-basis matrix that supports mixed-type classification and regression, feature-specific weighting, and optional flexible prediction heads. Across tabular benchmarks, default FlagGAM remains close to EBM in transparent additive mode, improves substantially over ridge regression on mixed-type regression, and shows smaller AUROC degradation than common baselines under missing and noisy perturbations. Flexible heads further improve accuracy and approach strong tree-based baselines, with the caveat that the resulting model should be interpreted as a rule-basis representation followed by a nonlinear predictor rather than as a fully additive GAM. Overall, FlagGAM provides a practical middle ground for tabular settings that require competitive accuracy, communicable rules, and robustness to imperfect inputs.

URL PDF HTML ☆

赞 0 踩 0

2605.31187 2026-06-01 cs.CV cs.LG

From Local Geometry to Global Pseudo Labeling for Robust Positive Unlabeled Learning under Covariate Shift

从局部几何到全局伪标注：协变量偏移下鲁棒的正无标记学习

Firas Gabetni, Alexandre Rocchi Henry, Nacim Belkhir, Ziyi Liu, Gianni Franchi

发表机构 * U2IS, ENSTA（U2IS，ENSTA）； Institut Polytechnique de Paris（巴黎政治学院）； AMIAD, Pôle Recherche, Palaiseau（AMIAD，研究学院，帕莱索）

AI总结提出SPUNA框架，利用局部流形结构逐步发现偏移数据，在协变量偏移下实现正无标记学习，性能达到全监督方法水平。

详情

AI中文摘要

检测协变量偏移对于构建可靠的视觉系统至关重要。虽然大多数先前工作专注于提高对偏移的鲁棒性，但显式检测协变量偏移仍未被充分探索。现有方法通常依赖于全监督训练，需要来自原始分布和偏移分布的有标签样本，这往往不切实际。在本文中，我们表明协变量偏移检测可以通过使用正无标记（PU）学习的弱监督有效解决。然而，在协变量偏移下，分布内数据和偏移数据显著重叠，使得经典PU方法不稳定且对噪声敏感。为克服这一挑战，我们引入了谱PU邻域标注（SPUNA），这是一种几何感知框架，通过利用视觉特征的局部流形结构逐步发现偏移数据。大量实验表明，SPUNA在PU设置中实现了最先进的性能，并且显著匹配了全监督方法的性能。此外，我们的方法在不同类型的偏移之间鲁棒地迁移，展示了强大的泛化能力。

英文摘要

Detecting covariate shift is critical for building reliable vision systems. While most prior work focuses on improving robustness to shift, explicitly detecting covariate shift remains underexplored. Existing approaches typically rely on fully supervised training, requiring labeled examples from both original and shifted distributions, which is often impractical. In this paper, we show that covariate shift detection can be effectively addressed with weaker supervision using Positive Unlabeled (PU) learning. However, under covariate shift, in distribution and shifted data overlap significantly, making classical PU methods unstable and sensitive to noise. To overcome this challenge, we introduce Spectral PU Neighborhood Annotation (SPUNA), a geometry aware framework that progressively discovers shifted data by leveraging the local manifold structure of visual features. Extensive experiments show that SPUNA achieves state of the art performance in PU settings and remarkably matches the performances of fully supervised methods. Moreover, our approach transfers robustly across different types of shifts, demonstrating strong generalization capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.31186 2026-06-01 cs.LG

How well does Classification Accuracy capture Concept Drift Detection Quality? An overview of Concept Drift Detection evaluation

分类精度在多大程度上捕捉概念漂移检测质量？概念漂移检测评估综述

Joanna Komorniczak

发表机构 * Department of Systems and Computer Networks（系统与计算机网络系）

AI总结本文综述了概念漂移检测质量度量与分类性能之间的关系，通过七种合成数据流工具研究八种漂移检测质量度量，旨在确定最具信息量的度量集。

详情

AI中文摘要

数据流是当今最常分析的数据结构之一，概念漂移对处理系统构成了重大挑战。尽管提出了许多解决方案来应对概念漂移导致的精度下降，但科学界尚未建立统一的概念漂移检测评估框架。现有研究通常依赖分类质量度量，但这些度量可能受多种因素影响，无法可靠反映漂移检测质量。本文深入概述了合成非平稳数据流中漂移检测质量度量与分类性能之间的关系。研究通过七种合成数据流生成工具，考察了八种漂移检测质量度量与分类器性能的关系，并额外考虑了漂移动态因素。研究旨在识别最具信息量的漂移检测质量度量集，并提供对方法评估的深入理解。

英文摘要

Data streams are nowadays among the most frequently analyzed data structures, with the concept drift posing a major challenge encountered by processing systems. Despite the proposition of numerous solutions to counteract the accuracy degeneration due to concept drift, the scientific community has not yet established a unified framework for evaluating the concept drift detection task. Existing research often relies on classification quality metrics, but these can be affected by multiple factors and may not reliably reflect drift detection quality. In this work, we present an in-depth overview of the relationship between metrics for quantifying drift detection quality and classification performance in synthetic nonstationary data streams. The proposed research studies eight drift detection quality metrics in relation to the classifier's performance across seven synthetic data stream generation tools, additionally considering drift dynamics as a factor. The studies aim to identify the most informative set of drift detection quality metrics and provide a deep understanding of the method's evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.31183 2026-06-01 cs.CL cs.AI cs.LG

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

引导LLM？实际上，稀疏自编码器可以胜过简单基线

Mikkel Godsk Jørgensen, Lars Kai Hansen

发表机构 * DTU Compute（丹麦技术大学计算学院）

AI总结本文通过监督流水线选择并标注特征，证明稀疏自编码器在模型引导任务上可接近LoRA性能，并发现高稀疏性对基于可解释性的引导并非关键。

详情

AI中文摘要

稀疏自编码器（SAEs）被视为探索大型语言模型（LLMs）内部机制和引导模型输出生成的有前途的途径。当Wu等人（2025）引入模型引导基准AxBench时，SAEs由于相对于一组简单基线的引导性能较差，似乎并未达到最初的期望。本文作为对稀疏自编码器的部分反驳，表明Wu等人（2025）的结果并未完全公正地评价它们。我们发现，当使用我们的监督流水线选择并标注特征时，稀疏自编码器实际上可以在AxBench基准上达到接近参考LoRA性能的水平。我们还发现，当仅使用基于可解释性的组件时，我们的流水线选择的特征与其识别标签具有令人惊讶的因果性。最后，我们提供证据表明，高稀疏性（低l0）可能对于基于可解释性的成功引导并非关键，这与Wang等人（2025）早期的发现相反。

英文摘要

Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu et al. (2025), SAEs did not seem to live up to their original hype due to poor steering performance relative to a set of simple baselines. This work serves as a partial rebuttal for Sparse Autoencoders and suggests that the results of Wu et al. (2025) did not do them full justice. We find that Sparse Autoencoders can, in fact, perform close to on par with the reference LoRA performance on the AxBench benchmark, when features are selected and labelled with our supervised pipeline. We also find that our pipeline selects features that are surprisingly causal of their identified labels when using only its interpretability-based components. Lastly, we present evidence that high sparsity (low l0) may not be crucial for successful steering based on interpretability, which is in contrast to the earlier findings in Wang et al. (2025).

URL PDF HTML ☆

赞 0 踩 0

2605.31177 2026-06-01 cs.CV

Vanilla ViT for Automotive Point Cloud Semantic Segmentation

用于汽车点云语义分割的普通ViT

Gilles Puy, Nermin Samet, Alexandre Boulch, Spyros Gidaris, Tuan-Hung VU, Renaud Marlet

发表机构 * LIGM, CNRS, Univ Gustave Eiffel, ENPC, IP Paris, France（LIGM、CNRS、Université Gustave Eiffel、ENPC、IP Paris、法国）

AI总结本文提出VaViT，通过精心设计的标记器、轻量级解码器头和定制数据增强，使普通非分层ViT在大规模激光雷达点云语义分割中达到或超越现有最先进方法。

详情

AI中文摘要

普通Transformer已成为处理文本、音频、图像和视频的事实标准架构，为多模态学习提供了统一的主干。然而，点云语义分割的最先进架构仍然由U-Net架构主导，其中卷积与局部或窗口注意力交错。在这项工作中，我们展示了如何有效利用普通、非分层的ViT进行大规模汽车激光雷达场景的分割。通过精心设计的标记器、轻量级解码器分割头和定制数据增强，我们弥合了性能差距。我们的方法VaViT（Vanilla ViT）在保持ViT架构简单性的同时，匹配或超过了最先进方法的性能。我们在nuScenes、SemanticKITTI和Waymo Open Dataset上进行了广泛评估，以验证我们方法的有效性。代码和模型可在https://github.com/valeoai/VaViT获取。

英文摘要

Plain Transformers have become the de-facto architecture for processing text, audio, image, and video, offering a unified backbone for multimodal learning. However, state-of-the-art architectures for point cloud semantic segmentation remain dominated by U-Nets architectures where convolutions are interleaved with local or windowed attentions. In this work, we show how to effectively leverage vanilla, non-hierarchical ViTs for segmentation of large-scale automotive lidar scenes. We bridge the performance gap thanks to a carefully designed tokenizer, a lightweight decoder segmentation head, and tailored data augmentations. Our approach, VaViT for Vanilla ViT, matches or exceeds the performance of state-of-the-art methods while maintaining the simplicity of ViT architecture. We provide extensive evaluations on nuScenes, SemanticKITTI, and Waymo Open Dataset to validate the efficiency of our method. Code and models are available at https://github.com/valeoai/VaViT.

URL PDF HTML ☆

赞 0 踩 0

2605.31176 2026-06-01 cs.LG cs.DS

Retriever Portfolios: A Principled Approach to Adaptive RAG

检索器组合：一种自适应RAG的原则性方法

Miltiadis Stouras, Vincent Cohen-Addad, Silvio Lattanzi, Ola Svensson

发表机构 * EPFL（瑞士联邦理工学院）； Google Research（谷歌研究院）

AI总结提出从大量候选检索器中自动选择小型多样子集（组合）的方法，通过期望最优k目标优化查询分布，实现自适应RAG，在多个QA基准上优于单检索器和朴素多检索器基线，并降低延迟和令牌成本。

Comments Accepted at ICML 2026. Code available at: https://github.com/mstou/retriever-portfolios

详情

AI中文摘要

检索增强生成（RAG）系统通常依赖单一检索器和一组超参数，尽管面临从简单事实性问题到复杂多跳推理的高度异构查询。我们提出一种方法，从大量候选检索器中自动选择一个小型、多样的子集（组合），以覆盖目标查询分布的不同区域。我们通过查询分布上的期望最优$k$目标形式化这一设置，并证明其存在一个具有近最优保证的高效组合构建算法。在多个QA基准上，我们学习的组合和路由管道在检索指标和答案质量上始终优于单检索器和朴素多检索器基线。此外，与推理时超参数调优方法相比，固定组合支持并行检索和LLM调用，在实现相当（有时更好）准确性的同时，显著降低延迟和令牌成本。

英文摘要

Retrieval-augmented generation (RAG) systems typically rely on a single retriever and a single set of hyperparameters, despite facing highly heterogeneous queries that range from simple factoid questions to complex multi-hop reasoning. We propose a method that automatically selects a small, diverse subset of retrievers (a portfolio) from a large pool of candidates, to cover different regions of the target query distribution. We formalize this setting via an expected best-of-$k$ objective over the query distribution and show that it admits an efficient portfolio construction algorithm with near-optimal guarantees. Across multiple QA benchmarks, our learned portfolios and router pipeline consistently outperform single-retriever and naive multi-retriever baselines on both retrieval metrics and answer quality. In addition, compared to inference-time hyperparameter tuning approaches, fixed portfolios enable parallel retrieval and LLM calls, achieving comparable (and sometimes better) accuracy with substantially lower latency and token cost.

URL PDF HTML ☆

赞 0 踩 0

2605.31175 2026-06-01 cs.CL

Towards Efficient LLMs Annealing with Principled Sample Selection

迈向基于原则性样本选择的高效LLM退火

Yuanjian Xu, Jianing Hao, Wanbo Zhang, Zhong Li, Guang Zhang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Fudan University（复旦大学）； Microsoft Research Asia（微软亚洲研究院）

AI总结本文通过损失景观的谱几何特性，将退火阶段的数据选择建模为有约束优化问题，提出DiReCT框架，利用Hessian谱对梯度施加方向约束，实现高效样本选择，显著提升模型性能。

详情

AI中文摘要

退火阶段是LLM预训练中关键的收敛阶段，最终决定模型质量。然而，在此阶段有效选择训练数据仍是一个关键挑战。当前策略依赖于经验启发式方法，如领域过滤或上下文扩展，缺乏优化理论的原则性基础。在这项工作中，我们通过损失景观的谱几何视角来刻画退火阶段。我们认为，最优收敛需要梯度更新满足不同特征方向上的异构约束。基于这一见解，我们将数据选择形式化为满足这些方向约束的问题。为此，我们提出了DiReCT（方向约束训练），这是一个新颖的框架，将退火阶段的样本选择重新表述为约束优化问题。通过基于Hessian的谱特性对每个样本的梯度施加显式的方向约束，DiReCT识别出与最优曲率感知下降路径一致的样本。跨多种模型尺度的广泛实验表明，DiReCT始终达到最先进的性能。为便于未来研究，代码可在https://github.com/xuyj233/Direct获取。

英文摘要

The annealing phase is a pivotal convergence stage in LLM pre-training that ultimately determines final model quality. However, effectively selecting training data during this phase remains a key challenge. Current strategies rely on empirical heuristics, such as domain filtering or context extension, which lack a principled grounding in optimization theory. In this work, we characterize the annealing phase through the lens of the loss landscape's spectral geometry. We argue that optimal convergence requires gradient updates to satisfy heterogeneous constraints across different eigen-directions. Building on this insight, we formulate data selection as a problem of satisfying these directional constraints. To this end, we propose DiReCT (Directionally-Restrained Constrained Training), a novel framework that reformulates sample selection in the annealing stage as a constrained optimization problem. By imposing explicit directional constraints on per-sample gradients based on the spectral properties of the Hessian, DiReCT identifies samples that align with the optimal curvature-aware descent path. Extensive experiments across various model scales demonstrate that DiReCT consistently achieves state-of-the-art performance. For future research, code is available at https://github.com/xuyj233/Direct.

URL PDF HTML ☆

赞 0 踩 0

2605.31174 2026-06-01 cs.CV cs.LG

Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning

任意场景检测：一种具有经验感知推理的目标检测智能体框架

Wenlun Zhang, Jun Yin, Kentaro Yoshioka

发表机构 * Keio University（Keio大学）； Tsinghua University（清华大学）

AI总结提出DetAS/DetAS-X智能体框架，利用多模态大语言模型自适应组合恢复模块和专用检测器，通过自进化经验积累实现经验感知推理，在六个基准上平均F1提升28.36%。

详情

AI中文摘要

现实场景中的目标检测由于图像退化多样和物体分布异质而仍然具有挑战性，这显著阻碍了现有检测器的泛化。传统方法，包括场景特定表示学习和端到端流水线设计，本质上受限于对预定义条件的依赖，缺乏对动态环境的适应性。本文提出DetAS，一种将目标检测表述为动态决策过程的智能体检测框架。DetAS不依赖静态流水线，而是利用多模态大语言模型（MLLM）作为中央智能体，通过从恢复模块和专用检测器的工具箱中选择来自适应地组合检测工作流。具体来说，DetAS包含两个关键组件：自适应图像恢复，动态决定是否以及如何增强图像以进行下游检测；以及多专家检测，集成多个领域专用检测器并通过实例级推理解决它们的预测。为了在细粒度条件下进一步提高决策质量，我们引入了自进化经验积累，并将框架扩展到DetAS-X，该框架从少量标注数据中积累节点级决策经验，并在推理过程中实现经验感知推理。这种机制使系统能够逐步优化其决策策略，并适应各种现实场景。在六个具有挑战性的基准上的大量实验表明，DetAS-X显著优于现有的基于MLLM的检测器，在F1分数上平均提高28.36%，在DarkFace上增益高达37.01%。这些结果展示了智能体检测的前景，并为其在复杂动态环境中的应用奠定了坚实基础。

英文摘要

Object detection in real-world scenarios remains challenging due to diverse image degradations and heterogeneous object distributions, which significantly hinder the generalization of existing detectors. Conventional approaches, including scene-specific representation learning and end-to-end pipeline design, are inherently limited by their reliance on predefined conditions and lack adaptability to dynamic environments. In this paper, we propose DetAS, an agentic detection framework that formulates object detection as a dynamic decision process. Instead of relying on static pipelines, DetAS leverages a Multimodal Large Language Model (MLLM) as a central agent to adaptively compose detection workflows by selecting from a toolbox of restoration modules and specialized detectors. Specifically, DetAS consists of two key components: Self-Adaptive Image Restoration, which dynamically determines whether and how to enhance images for downstream detection, and Multi-Expertise Detection, which integrates multiple domain-specialized detectors and resolves their predictions through instance-level reasoning. To further improve decision quality under fine-grained conditions, we introduce Self-Evolving Experience Harvesting and extend the framework to DetAS-X, which accumulates node-level decision experience from a small set of annotated data and enables experience-aware reasoning during inference. This mechanism allows the system to progressively refine its decision policy and adapt to diverse real-world scenarios. Extensive experiments on six challenging benchmarks demonstrate that DetAS-X significantly outperforms existing MLLM-based detectors, achieving an average improvement of 28.36% in F1 score, with up to 37.01% gain on DarkFace. These results demonstrate the promise of agentic detection and establish a solid foundation for its application in complex and dynamic environments.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

Learning Cardiac Latent Representations in Vectorcardiogram Space

Toward Identifiable Sparse Autoencoders

Spectral Reach: Understanding Neural Scaling as Progress into the Spectral Tail

Bifurcated Remaining Useful Life Prediction: A Hybrid Approach for Realistic Uncertainty Characterization

Scaling Multi-Hop Training Data via Graph-Constrained Path Selection

HARP-VLA: Human-Robot Aligned Representation Learning for Vision-Language-Action Model

Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval

EchoRL: Reinforcement Learning via Rollout Echoing

HiERO-StepG @ Ego4D Step Grounding Challenge: hierarchical activity understanding enables zero-shot step grounding

What changes after deployment? A survey on On-device Learning in TinyML

Multivariate Distributional Reinforcement Learning Using Sliced Divergences

Shared Doubt: Zero-shot Cross-Lingual Confidence Estimation for Language Models

Fixed-Point Masked Generative Modeling

Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education

Simulation of collision avoidance behavior in crowd movement by data-driven approach

Probabilistic Precipitation Nowcasting with Rectified Flow Transformers

Learning Whom to Trust: Market-Feedback Adaptive Retrieval for Frozen LLMs in Event-Driven Financial RAG

Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

Geometry-based Schrödinger Bridges for Trustworthy Multimodal Fusion

The Regularizing Power of Language-Training Deepfake Detectors

Student Capacity Moderates Knowledge Distillation Effectiveness: A Systematic Study Across ResNet Teacher-Student Pairs on CIFAR-10

FlagGAM: Rule-Based Generalized Additive Modeling for Explainable Tabular Prediction

From Local Geometry to Global Pseudo Labeling for Robust Positive Unlabeled Learning under Covariate Shift

How well does Classification Accuracy capture Concept Drift Detection Quality? An overview of Concept Drift Detection evaluation

Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines

Vanilla ViT for Automotive Point Cloud Semantic Segmentation

Retriever Portfolios: A Principled Approach to Adaptive RAG

Towards Efficient LLMs Annealing with Principled Sample Selection

Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning