arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2237
热门方向导航
2606.17667 2026-06-17 cs.LG cs.AI 新提交

Handling Feature Heterogeneity with Learnable Graph Patches

处理特征异质性:可学习图块方法

Yifei Sun, Yang Yang, Xiao Feng, Zijun Wang, Haoyang Zhong, Chunping Wang, Lei Chen

发表机构 * Zhejiang University(浙江大学) Huazhong University of Science and Technology(华中科技大学) Finvolution Group(信也科技集团)

AI总结 提出可学习图块概念,将图分解为语义单元,通过补丁编码器和聚合器实现跨域图数据的可迁移预训练,提升下游任务性能。

Comments Accepted at KDD 2025

详情
AI中文摘要

近年来,基础模型和图预训练技术的快速发展激发了构建通用预训练图模型或图基础模型(GFM)的兴趣。然而,一个重大挑战是现有模型无法处理无文本信息的图数据中的特征异质性,这阻碍了图模型在不同数据集间的可迁移性。为弥补这一差距,我们提出了可学习图块的概念,将其视为任何图数据的最小语义单元。我们通过展开节点特征并分别构建相应的图块结构,将图分解为可学习图块。然后,我们设计了一个框架,从跨域图数据中挖掘可迁移信息。具体来说,在提取图块后,我们提出一个补丁编码器从每个单元中提取知识,以及一个补丁聚合器学习如何将单元组合成整体。由于其领域无关的特性,该模型可应用于不同领域的下游数据。此外,我们分析了我们的方法与现有图模型之间的联系,以及其生成的节点嵌入的可迁移性。实验表明,我们的方法不仅实现了使用多域图进行预训练的能力,而且在各种下游数据集和任务上表现出增强的性能。此外,我们观察到随着预训练数据量的增加,下游性能持续提升。

英文摘要

In recent years, the rapid development of foundation models and graph pre-training technologies has spurred increasing interest in constructing a universal pre-trained graph model or Graph Foundation Model (GFM). However, a significant challenge is that existing models are unable to address feature heterogeneity in graph data without textual information, which hinders the transferability of graph models across different datasets. To bridge this gap, we propose the concept of learnable graph patches, which we regard as the smallest semantic units of any graph data. We decompose the graph into learnable graph patches by unfolding the node features and constructing corresponding patch structures separately. We then design a framework that mines transferable information from graph data across domains. Specifically, after extracting graph patches, we propose a patch encoder to extract knowledge from each unit and a patch aggregator to learn how the units are combined into a whole. Due to its domain-agnostic nature, the model can be applied to downstream data across different domains. Furthermore, we analyze the connection between our method and existing graph models, as well as the transferability of the node embeddings it generates. Empirically, our method not only achieves the capability to use multi-domain graphs for pre-training, but also shows enhanced performance across various downstream datasets and tasks. Moreover, we observe consistent improvement in downstream performance as the volume of pre-training data increases.

2606.17660 2026-06-17 cs.LG cs.AI 新提交

TuneAhead: Predicting Fine-tuning Performance Before Full Training Begins

TuneAhead: 在完整训练开始前预测微调性能

Yuxiang Luo, Haonan Long, Chen Wang, Qiqi Duan, Xiaotian Lin, Yanwei Xu, Yuyu Luo, Weikai Yang, Nan Tang

发表机构 * The Hong Kong University of Science(香港科学与技术大学) Huawei Technologies Ltd.(华为技术有限公司)

AI总结 提出TuneAHEAD框架,通过元特征向量和SHAP归因,在微调前预测性能,在Qwen2.5-7B-Instruct上RMSE为1.47个百分点,95.1%预测误差在±3%内。

Comments 9 pages, 6 figures, accepted as ICML 2026 poster:https://icml.cc/virtual/2026/poster/64847

详情
AI中文摘要

微调大型语言模型(LLM)计算密集且容易出错:模型性能对数据质量和超参数选择敏感,简单运行甚至可能降低模型性能。这引出一个实际问题:在投入完整训练之前,能否预测微调性能?我们提出TUNEAHEAD,一个用于微调性能预判的轻量级框架。TUNEAHEAD将每个候选运行编码为一个元特征向量,该向量结合了静态数据集描述符和来自短标准化探测的动态探测特征。一个预测器将这些特征映射到性能估计,而基于SHAP的归因提供可解释的诊断,揭示哪些特定特征驱动预测。在Qwen2.5-7B-Instruct上的1300多次微调运行中,TUNEAHEAD始终优于强基线,如Early-Stop Extrapolation和ProxyLM。在370次运行的保留测试集上,TUNEAHEAD实现了1.47个百分点的RMSE,并将95.1%的预测置于真实分数的±3个百分点内。这些准确的连续预测支持实用的通过/不通过筛选策略,可以在保留最有希望运行的同时减少不必要的完整微调。

英文摘要

Fine-tuning large language models (LLMs) is compute-intensive and error-prone: model performance depends sensitively on data quality and hyperparameter choices, and naïve runs can even degrade model performance. This raises a practical question:can we predict fine-tuning performance before committing to a full training run? We present TUNEAHEAD, a lightweight framework for pre-hoc prediction of fine-tuning performance. TUNEAHEAD encodes each candidate run as a meta-feature vector that combines static dataset descriptors with dynamic probe features from a short standardized probe. A predictor maps these features to performance estimates, while SHAP-based attributions provide interpretable diagnostics that reveal which specific features drive the prediction. Across 1,300+ fine-tuning runs on Qwen2.5-7B-Instruct, TUNEAHEAD consistently outperforms strong baselines such as Early-Stop Extrapolation and ProxyLM. On a held-out test set of 370 runs, TUNEAHEAD achieves an RMSE of 1.47 percentage points and places 95.1% of predictions within +3/-3 percentage points of the true score. These accurate continuous predictions support practical go/no-go screening policies that can reduce unnecessary full fine-tuning while retaining most promising runs.

2606.17659 2026-06-17 cs.LG 新提交

Physics-Constrained Neural Networks for Improved Short-Term Weather Forecasting: A Case Study over the South Pacific

物理约束神经网络改进短期天气预报:南太平洋案例研究

Egor Bugaev, Fedor Buzaev, Dmitry Efremenko, Denis Derkach, Fedor Ratnikov

发表机构 * Faculty of Computer Science, Higher School of Economics(高等经济学院计算机科学系)

AI总结 提出三种改进物理约束神经网络(PCNN)的方法,包括升级数值求解器、统一自回归混合块和集成两种神经骨干,在WeatherBench南太平洋子集上相比纯神经网络模型在1-12小时预报中均方根误差降低8-22%,同时保持物理一致性。

Comments Presented at ICLR 2026 Workshop AI and PDE

详情
AI中文摘要

本研究介绍了对物理约束神经网络(PCNN)的改进,提高了混合短期天气预报模型的准确性和稳定性。基于WeatherGFT架构,提出了三项创新。首先,升级的数值求解器结合了五阶加权本质无振荡格式(WENO-5)、beta平面近似和亚网格尺度粘度,允许积分时间步长增加四倍至1200秒,同时将日均方误差降低高达26%。其次,一个统一的回归混合块取代了原来的24个专门模块链,消除了对特定预报时间的过拟合。第三,物理核心与两个最先进的神经骨干集成,产生了PI-PredFormer和PI-IAM4VP。在2000年至2004年的WeatherBench南太平洋子集上的评估表明,这些混合模型在1-12小时预报时间内的均方根误差比纯神经模型降低了8-22%,同时更好地保持了物理一致性。这些结果表明,混合组件的逐步改进为实现更准确和高效的短期天气预报提供了一条实用途径。

英文摘要

This study introduces enhancements to physics-constrained neural networks (PCNNs) that improve the accuracy and stability of hybrid short-term weather forecasting models. Building on the WeatherGFT architecture, three innovations are proposed. First, an upgraded numerical solver, combining a fifth-order weighted essentially non-oscillatory scheme (WENO-5), a beta-plane approximation, and subgrid-scale viscosity, permits a fourfold increase in the integration time step to 1200 s while reducing the daily mean squared error by up to 26%. Second, a unified autoregressive hybrid block replaces the original chain of 24 specialised modules, eliminating overfitting to specific lead times. Third, the physical core is integrated with two state-of-the-art neural backbones, resulting in PI-PredFormer and PI-IAM4VP. Evaluation on the WeatherBench South Pacific subset from 2000 to 2004 shows that these hybrids reduce root mean squared error at 1-12 h lead times by 8-22% compared to purely neural counterparts, while better preserving physical consistency. These results demonstrate that incremental refinement of hybrid components offers a practical route toward more accurate and efficient short-range weather forecasting.

2606.17657 2026-06-17 cs.AI 新提交

Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games

使用认知模型改进语言模型对人类说服博弈的模拟

Zirui Cheng, Zeyu Shen, Thomas L. Griffiths, Peter Henderson

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出方程到行为提示和强化学习方法,使语言模型匹配认知模型(如贝叶斯更新、动机推理),在说服博弈中提升模拟人类决策多样性的能力。

详情
AI中文摘要

人们在战略互动中做出不同的决策。有些人像贝叶斯一样更新信念;其他人则表现出动机推理等偏见。尽管大型语言模型的创建者使用模拟人类进行安全评估和训练,但他们往往未能涵盖人类行为的这种广度。我们认为认知科学和经济学提供了一种方便的工具来做到这一点,利用人类决策的数学模型。我们提出了一种称为方程到行为提示的方法,用于引导大型语言模型匹配认知模型,并在基于法律决策的说服博弈中评估这种方法。我们发现大型模型可以通过提示近似基于方程的规范——贝叶斯更新、仿射扭曲、动机更新和Grether的$\alpha$-$\beta$模型,但小型模型无法做到。然而,使用强化学习训练小型模型以遵循数学规则,即方程到行为强化学习,在分布外参数化中将信念误差降低了26.5%。我们表明这些模拟可以帮助创建多样化的训练环境;训练小型模型考虑不同类型的决策者,与仅贝叶斯训练相比,平均信念变化提高了2.5%–12%,即使在说服GPT-5-mini时也是如此。我们的工作可以改进在日益逼真的环境中用于训练和评估的人类模拟,并且还可以促进对人类决策更复杂数学模型的新研究。

英文摘要

People make decisions differently in strategic interactions. Some update beliefs like a Bayesian; others exhibit biases like motivated reasoning. Although creators of large language models use simulated humans for safety evaluations and training, they often fail to cover this breadth of human behavior. We argue that cognitive science and economics provide a convenient tool for doing so, making use of mathematical models of human decision-making. We propose an approach that we call Equation-to-Behavior Prompting for guiding large language models to match cognitive models, and evaluate this approach on persuasion games based on legal decision-making. We find that large models can approximate equation-based specifications -- Bayesian updating, affine distortion, motivated updating, and Grether's $α$-$β$ model -- using prompting, but small models fail to do so. However, training small models with reinforcement learning to adhere to mathematical rules, Equation-to-Behavior RL, reduces belief error by 26.5% in out-of-distribution parameterizations. We show that these simulations can help create diverse training environments; training small models to consider different kinds of decision-makers improves average belief change by 2.5%--12% over Bayesian-only training, even when persuading GPT-5-mini. Our work could improve human simulations for training and evaluation in increasingly realistic settings, and could also enable novel research into more complicated mathematical models of human decision-making.

2606.17650 2026-06-17 cs.CV cs.CL 新提交

MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block

MambaCount: 基于空间稀疏状态空间对偶块的高效文本引导开放词汇目标计数

Hao-Yuan Ma, Li Zhang, Minjie Qiang, Jie Gao

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 提出MambaCount框架,通过空间稀疏状态空间对偶块解决Mamba在非因果视觉任务中的双向依赖限制和空间token高熵问题,实现线性复杂度的开放词汇目标计数,在FSC-147上取得12.23的测试MAE。

详情
AI中文摘要

文本引导的开放词汇目标计数(TOOC)旨在估计由文本提示描述的目标数量,在具有大规模变化的密集场景中尤其具有挑战性。现有的TOOC方法主要依赖Transformer,其相对于图像分辨率的二次复杂度限制了可扩展性。Mamba因其线性复杂度提供了一种有前景的替代方案。然而,先前基于Mamba的方法存在两个主要限制。一方面,Mamba固有的因果公式限制了非因果视觉任务所需的双向空间依赖建模。另一方面,现有的基于Mamba的视觉模型往往忽略了空间token响应中无约束的高熵,这可能削弱局部细节和高频线索。为了解决这些限制,我们提出了MambaCount,一种基于空间稀疏状态空间对偶(S^4D)块的高效框架。具体来说,我们分析并重构了Mamba中隐藏状态的衰减动态,以缓解因果建模引入的依赖约束。此外,我们引入了空间token选择(STS)子块,以减少Mamba中空间token响应的无约束高熵。另外,我们设计了多粒度原型(MGP),以在不同语义级别识别类似目标的区域,改善跨模态对齐和可解释性。在FSC-147上的大量实验表明,MambaCount在无需二次查询的方法中达到了最先进的性能,测试MAE为12.23,同时保持了线性复杂度。

英文摘要

Text-guided Open-vocabulary Object Counting (TOOC) aims to estimate the number of objects described by text prompts, which is particularly challenging in dense scenes with large scale variations. Existing TOOC approaches predominantly rely on Transformers, whose quadratic complexity with respect to image resolution limits their scalability. Mamba offers a promising alternative due to its linear complexity. However, previous Mamba-based methods have two main limitations. On the one hand, the inherent causal formulation of Mamba constrains the bidirectional spatial dependency modeling required by non-causal vision tasks. On the other hand, existing Mamba-based vision models often overlook the unconstrained high entropy in the spatial token responses, which can weaken local details and high-frequency cues. To address these limitations, we propose MambaCount, an efficient framework built on the Spatial Sparse State Space Duality (S^4D) block. Specifically, we analyze and reconstruct the decay dynamics of hidden states in Mamba to alleviate the dependency constraints introduced by causal modeling. Moreover, we introduce a Spatial Token Selection (STS) sub-block to reduce the unconstrained high entropy in spatial token responses within Mamba. In addition, we design Multi-Granularity Prototypes (MGP) to identify object-like regions at different semantic levels, improving cross-modal alignment and interpretability. Extensive experiments on FSC-147 demonstrate that MambaCount achieves state-of-the-art performance among methods without secondary querying, obtaining a test MAE of 12.23, while retaining linear complexity.

2606.17649 2026-06-17 cs.LG cs.AI 新提交

A Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction

预微调预测的风险分解框架

Yuxiang Luo, Chen Wang, Nan Tang

发表机构 * The Hong Kong University of Science(香港科技大学)

AI总结 提出风险分解框架,将预微调性能预测风险分解为内在极限与可降优化方差,证明优化方差衰减率存在下界,并导出预算最优探测原则及可预测性相图。

Comments 9 pages, 4 figures, accepted as ICML 2026 Poster:https://icml.cc/virtual/2026/poster/66570

详情
AI中文摘要

微调大型语言模型的高昂成本构成了显著的经济障碍;预微调性能预测提供了一个关键解决方案,以大幅降低这一费用。然而,预微调性能预测的理论极限尚未被探索。我们将其形式化为信息约束下的随机估计问题,将预测风险分解为两个组成部分:内在极限(静态数据-模型兼容性)和可降优化方差。我们证明优化方差在其衰减率上存在一个必要下界,这意味着无论使用何种预测器,不确定性消散的速度都受到基本约束。基于这些动态特性,我们推导出预算最优探测原则,并引入一个可预测性相图,将任务组织成三个不同的区域:静态充分、动态临界和噪声主导。在合成和真实世界基准上的大量实验验证了这些理论区域,并展示了我们探测策略的效率。

英文摘要

The high cost of fine-tuning LLMs poses a significant economic barrier; pre-hoc performance prediction offers a critical solution to substantially reduce this expense. However, the theoretical limits of pre-hoc performance prediction remain unexplored. We formulate it as a stochastic estimation problem under information constraints, decomposing prediction risk into two components: an intrinsic limit (static data-model compatibility) and a reducible optimization variance. We prove that optimization variance admits a necessary lower bound on its decay rate, implying fundamental constraints on how quickly uncertainty dissipates, regardless of the predictor used. Based on these dynamics, we derive a budget-optimal probing principle and introduce a predictability phase diagram that organizes tasks into three distinct regimes: Static-Sufficient, Dynamic-Critical, and Noise-Dominant. Extensive experiments on synthetic and real-world benchmarks validate these theoretical regimes and demonstrate the efficiency of our probing strategy.

2606.17648 2026-06-17 cs.AI 新提交

From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs

从酝酿到解析:追踪LLM中代码推理的内部生命周期

Siyue Chen, Yifu Guo, Yuquan Lu, Zishan Xu, Jiaye Lin, Jianbo Lin, Siyu Zhang, Cheng Yang, Junxin Li, Yujia Li, Yu Huo, Ruixuan Wang

发表机构 * South China University of Technology(华南理工大学) Sun Yat-sen University(中山大学) Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) Nanjing University(南京大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Hangzhou Dianzi University(杭州电子科技大学) Guangzhou College of Technology and Business(广州工商学院)

AI总结 提出双重诊断框架(逐层线性探针与上下文剥离解码),揭示LLM在代码推理中先酝酿答案后进入四种解析结果(已解析、过度处理、错误解析、未解析)的内部生命周期,发现酝酿支架稳定而解析成功随能力变化。

详情
AI中文摘要

标准准确率指标无法解释为什么LLM能处理变量追踪但在语义等价的循环上失败。我们研究了代码推理的内部生命周期,其中模型首先酝酿答案,使其在变得可自解码之前的许多层就线性可恢复,然后分化为四种解析结果之一:已解析、过度处理、错误解析或未解析。理解这一生命周期很重要,因为相似的任务准确率可能掩盖表面评估无法检测的根本不同的失败模式。我们引入了一个双重诊断框架,将逐层线性探针与上下文剥离解码(CSD)配对,并将其应用于跨越Qwen、Llama和DeepSeek架构的16个模型的六个代码推理任务族。所有四种结果在每个任务族中都占有显著比例:总体已解析仅为41.5%,多个任务低于30%。对结构、深度和算子的受控扫描揭示了特定任务的失败瓶颈:函数调用已解析率随着调用深度从一层增加到三层而从61.1%骤降至2.5%。跨架构和规模,酝酿支架保持稳定,所有16个模型的归一化酝酿持续时间为24-42%,而解析成功随能力变化。这表明该支架是测试的解码器-only Transformer家族中稳定的经验规律,而解析成功与能力、规模和训练共变。代码:此 https URL

英文摘要

Standard accuracy metrics cannot explain why LLMs handle variable tracking but fail on semantically equivalent loops. We study an internal lifecycle of code reasoning in which models first brew the answer, making it linearly recoverable many layers before it becomes self-decodable, and then diverge into one of four resolution outcomes: Resolved, Overprocessed, Misresolved, or Unresolved. Understanding this lifecycle matters because similar task accuracies can mask fundamentally different failure modes that surface-level evaluation cannot detect. We introduce a dual diagnostic framework pairing layer-wise linear probing with Context-Stripped Decoding (CSD) and apply it to six code-reasoning task families across 16 models spanning Qwen, Llama, and DeepSeek architectures. All four outcomes carry substantial mass in every task family: overall Resolved is only 41.5%, with multiple tasks below 30%. Controlled sweeps over structure, depth, and operators expose task-specific failure bottlenecks: Function Call Resolved plunges from 61.1% to 2.5% as call depth increases from one to three. Across architectures and scales, the brewing scaffold remains stable, with normalized brewing duration 24-42% across all 16 models, while resolution success varies with capability. This indicates that the scaffold is a stable empirical regularity across the tested decoder-only Transformer families, whereas resolution success covaries with capability, scale, and training. Code: https://github.com/euyis1019/llm-brewing

2606.17645 2026-06-17 cs.AI cs.CL cs.LG 新提交

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

超越领域:通过可迁移交互模式重用网络技能

Shiqi He, Yue Cui, Feijie Wu, Xinyu Ma, Jiaheng Lu, Yaliang Li, Bolin Ding, Mosharaf Chowdhury

发表机构 * University of Michigan(密歇根大学) Alibaba Group(阿里巴巴集团) Purdue University(普渡大学) McMaster University(麦克马斯特大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出SkillMigrator代理,通过学习可迁移交互模式(TIP)匹配布局结构而非元素引用,实现跨站点技能重用,在WebArena和Mind2Web上成功轨迹的LLM动作数减少8-10%。

详情
AI中文摘要

大型语言模型(LLM)网络代理通常被部署为工具调用者:每轮,模型读取新的页面观察并发出一个结构化工具动作。当每个动作都是低级原语时,视野迅速增长,面向策略的LLM完成次数也随之增加,在Mind2Web和WebArena等基准测试中主导了延迟和成本。因此,最近的系统将重复的交互片段包装为网络技能:从成功轨迹或诱导程序中构建的可调用工具,这样一次调用可以替代多个原语。然而,先前的技能库仍然主要通过指令相似性或粗略的站点元数据触发,这导致在未见站点上技能重用率低,并留下了许多潜在的步骤和令牌减少空间。我们提出了SkillMigrator,一个学习可重用网络技能并通过匹配布局结构而非特定元素引用来跨站点迁移它们的代理。每个诱导技能被存储为可迁移交互模式(TIP):技能与诱导时快照的结构草图配对。在测试时,SkillMigrator通过布局相似性检索TIP,并将其引用锚定到实时页面。其余堆栈是标准的:具有稳定引用的可访问性快照观察,以及基于原语加技能调用的固定工具调用。与最先进的方法相比,SkillMigrator在匹配成功率的情况下,将WebArena和Mind2Web上成功轨迹的平均LLM动作数减少了8-10%。

英文摘要

Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow quickly and so do policy-facing LLM completions, dominating latency and cost on benchmarks such as Mind2Web and WebArena. Recent systems therefore wrap repeated interaction fragments as web skills: callable tools built from successful trajectories or induced programs, so one call can replace several primitives. However, prior skill libraries are still triggered mainly by instruction similarity or coarse site metadata, which yields low skill reuse on held-out sites and leaves much of the potential step and token reduction on the table. We present SkillMigrator, an agent that learns reusable web skills and transfers them across sites by matching layout structure rather than specific element references. Each induced skill is stored as a transferable interaction pattern (TIP): the skill paired with a structural sketch of the snapshot at induction time. At test time, SkillMigrator retrieves TIPs by layout similarity and grounds their references on the live page. The rest of the stack is standard: accessibility-snapshot observations with stable references, and fixed tool calling over primitives plus skill invocations. Compared with the state-of-the-art approaches, SkillMigrator reduces the average LLM-action count on successful trajectories by 8-10% across both WebArena and Mind2Web at matched success rate.

2606.17644 2026-06-17 cs.CV cs.AI 新提交

Bounding Box Label Propagation for Re-Annotation of Document Layout Analysis Datasets

边界框标签传播用于文档布局分析数据集的重新标注

Nick Jochum, Tobias Alt-Veit, Christian Schön, Alexander Lück, René Schuster, Didier Stricker

发表机构 * Insiders Technologies GmbH(Insiders Technologies 有限公司) DFKI – German Research Center for Artificial Intelligence(德国人工智能研究中心) RPTU – University Kaiserslautern-Landau(凯泽斯劳滕-兰道大学)

AI总结 提出BBLP伪标签框架,通过对象编码器融合视觉、文本和位置嵌入,利用标签传播实现仅用10%标注数据达到全监督性能的81.6%。

Comments 17 pages, 3 figures, to appear in proceedings of ICDAR 2026, Vienna, Austria

详情
AI中文摘要

实际文档处理场景中的数据集通常随时间增长,其类别标注不断细化,这导致大量耗时且昂贵的重新标注工作。一个有前景的解决方案是仅手动重新标注一小部分可用文档,并应用半监督学习技术利用有标签和无标签数据。尽管针对分类问题已有多种方法,但对于目标检测实例的重新分类(例如文档布局分析)尚无适配方法。为此,我们提出了边界框标签传播(BBLP),一种用于目标检测的伪标签框架。对象编码器整合来自目标检测样本的视觉、文本和位置嵌入,生成联合嵌入,可用于部分标注数据集上的标签传播,即插即用。评估结果表明,所提方法能产生高质量的边界框类别标注。在D4LA布局分析数据集中,仅使用10%标注数据,其mAP达到54.0%,相当于全监督性能的81.6%。我们的工作展示了标签传播在目标检测中的潜力,并为减少实际文档处理应用中的手动标注工作量奠定了基础。

英文摘要

Datasets in practical document processing scenarios typically grow over time, and their class annotations undergo continuous refinement. This creates significant re-annotation efforts, which are time-consuming and costly. A promising remedy is to re-annotate only a small subset of available documents manually and apply semi-supervised learning techniques that leverage both labelled and unlabelled data. Although there are numerous approaches to tackle this problem for classification, there exists no adaptation for the problem of re-classifying object detection instances, e.g. for document layout analysis. To this end, we propose Bounding Box Label Propagation (BBLP), a pseudo-labelling framework for object detection. An object encoder integrates visual, textual, and positional embeddings from object detection samples to come up with a joint embedding that can be used for Label Propagation on partially annotated datasets in a plug-and-play fashion. Evaluation results indicate that the proposed approach produces high-quality class annotations of bounding boxes. In the D4LA layout analysis dataset, it achieves a mAP of 54.0%, corresponding to 81.6% of fully supervised performance, while using only 10% labelled data. Our work demonstrates the potential of Label Propagation for object detection and lays the groundwork for reducing manual annotation efforts in real-world document processing applications.

2606.17642 2026-06-17 cs.AI 新提交

FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness

FinAcumen: 通过自演化经验记忆实现的金融多模态推理

Pianran Guo, Pengcheng Zhou, Yucheng Jian, Shuhua Chen

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Queen Mary University of London(伦敦玛丽女王大学)

AI总结 提出FinAcumen框架,通过选择性经验记忆机制增强工具增强型多模态推理,在四个金融基准上持续提升冻结的8B视觉语言模型性能。

详情
AI中文摘要

金融多模态推理要求智能体协调跨异构证据源的数值计算、检索、视觉解释和时间定位。现有的工具增强型智能体提高了执行保真度,但在跨回合中仍然大多无状态,反复发现推理策略和失败模式。在高风险金融环境中,这导致不可靠的工具路由、噪声检索和易产生幻觉的推理。我们提出FinAcumen,一个以选择性经验记忆为中心的金融推理智能体框架,用于工具增强的多模态推理。FinAcumen从先前的轨迹中积累基于金融的推理经验,将成功策略和失败衍生的警示规则提炼到持久记忆库中。在推理过程中,只有当语义相关性超过校准阈值时,检索到的经验才会调节推理,而通过回退机制明确抑制不相关的记忆。一个确定性的金融工具环境进一步将数值计算、检索、视觉解码和答案生成置于基础。在四个金融多模态推理基准上,FinAcumen持续改进冻结的8B视觉语言模型,优于金融专用模型,并接近领先的通用专有模型。进一步分析表明,选择性经验激活在检索不确定性下提高了推理可靠性。我们的代码匿名发布于https://this https URL。

英文摘要

Financial multimodal reasoning requires agents to coordinate numerical computation, retrieval, visual interpretation, and temporal grounding across heterogeneous evidence sources. Existing tool-augmented agents improve execution fidelity, yet remain largely stateless across episodes, repeatedly rediscovering reasoning strategies and failure patterns. In high-stakes financial settings, this leads to unreliable tool routing, noisy retrieval, and hallucination-prone reasoning. We present FinAcumen, a financial reasoning agent framework centered on selective experience memory for tool-augmented multimodal reasoning. FinAcumen accumulates financially grounded reasoning experience from prior trajectories, distilling successful strategies and failure-derived cautionary rules into a persistent memory bank. During inference, retrieved experiences condition reasoning only when semantic relevance exceeds a calibrated threshold, while irrelevant memory is explicitly suppressed through a fallback mechanism. A deterministic financial tool environment further grounds numerical computation, retrieval, visual decoding, and answer verification.Across four financial multimodal reasoning benchmarks, FinAcumen consistently improves a frozen 8B vision-language model over finance-specialized models and approaches leading proprietary general-purpose models. Further analysis shows that selective experience activation improves reasoning reliability under retrieval uncertainty. Our code is anonymously available at https://anonymous.4open.science/r/FinAcumen

2606.17637 2026-06-17 cs.AI 新提交

Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification

Brick-DICL:用于自动化Brick模式分类的动态上下文学习

Yiyue Qian, Shinan Zhang, Huan Song, Negin Sokhandan, Hannah Marlowe, Diego Socolinsky

发表机构 * Amazon AWS Generative AI Innovation Center(亚马逊AWS生成式AI创新中心)

AI总结 提出Brick-DICL两阶段动态上下文学习框架,通过元数据检索和类别检索增强大语言模型领域知识,结合多模型过滤机制,实现楼宇管理系统点位的自动化Brick分类,显著提升准确率并减少人工验证。

详情
AI中文摘要

楼宇管理系统(BMS)对于优化现代建筑的能效和运营性能至关重要。然而,不同制造商的BMS点缺乏标准化,给集成和数据利用带来了重大障碍。尽管Brick模式为楼宇系统提供了标准化的本体,但将BMS点映射到合适的Brick类面临三个关键挑战:(i)Brick类数量庞大(最新版本有936个),(ii)大语言模型(LLM)的领域知识有限,(iii)验证需要大量人工。为解决这些挑战,我们提出了Brick-DICL,一种用于自动化Brick模式分类的两阶段动态上下文学习框架。Brick-DICL包含两个主要组件:metadata-RAG,检索相关示例以增强LLM的领域知识;以及class-RAG,缩小潜在Brick类范围以应对大的分类空间。此外,我们实现了一种多LLM过滤机制,比较多个模型的预测,标记低置信度分类以供人工审查。结果:(i)通用性:Brick-DICL适用于任何楼宇管理系统,无论制造商或元数据格式如何;(ii)新颖且强大:作为首个用于Brick模式分类的动态上下文学习方法,Brick-DICL在建筑数据集上取得了显著的分类准确率提升,优于现有方法;(iii)高效:我们的多LLM过滤策略减少了人工验证工作,实现了快速数字化建筑接入。大量实验证明了Brick-DICL在不同建筑数据集上的有效性,加速了向标准化、可互操作的楼宇管理系统的进程。

英文摘要

Building Management Systems (BMS) are essential for optimizing energy efficiency and operational performance in modern buildings. However, the lack of standardization across BMS points from different manufacturers creates significant barriers to integration and data utilization. While the Brick schema offers a standardized ontology for building systems, mapping BMS points to appropriate Brick classes presents three critical challenges: (i) the extensive number of Brick classes (936 in the latest version), (ii) limited domain-specific knowledge in large language models (LLMs), and (iii) substantial manual effort required for verification. To address these challenges, we propose Brick-DICL, a two-stage dynamic in-context learning framework for automated Brick schema classification. Brick-DICL consists of two primary components: metadata-RAG, which retrieves relevant examples to enhance LLMs' domain knowledge, and class-RAG, which narrows down potential Brick classes to address the large classification space. Additionally, we implement a multi-LLM filtering mechanism that compares predictions across multiple models, flagging low-confidence classifications for human review. As a result: (i) General: Brick-DICL is applicable to any building management system regardless of manufacturer or metadata format; (ii) Novel and Powerful: as the first dynamic in-context learning approach for Brick schema classification, Brick-DICL achieves significant classification accuracy improvements on building datasets, outperforming existing methods; (iii) Efficient: our multi-LLM filtering strategy reduces manual verification effort, enabling rapid digital building onboarding. Extensive experiments demonstrate Brick-DICL's effectiveness across diverse building datasets, accelerating the path toward standardized, interoperable building management systems.

2606.17634 2026-06-17 cs.CL 新提交

Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs

用于可靠LLM评估的提示扰动方法:基于比较图

Dong Huang, Jianbo Sun, Pengkun Yang

发表机构 * Department of Statistics and Data Science, Tsinghua University(清华大学统计与数据科学系)

AI总结 针对LLM成对评估中的传递性矛盾问题,提出提示扰动框架,通过生成扰动变体并过滤结构不一致的比较模式,提高排名一致性。

Comments 42 pages, 8 figures

详情
AI中文摘要

评估大型语言模型(LLM)对于理解其能力、比较竞争系统以及支持在实践中部署可靠模型至关重要。对于开放任务,成对评估已成为一种流行范式,其中比较同一提示的两个响应,并将产生的判断聚合为整体排名。该范式的核心挑战是传递性:诱导的比较结果可能无法支持任何连贯的全局排名。例如,可能会观察到循环偏好,如$A \succ B \succ C \succ A$,或涉及平局的不一致性,如$A \equiv B\equiv C\neq A$。此类矛盾使得最终排行榜不稳定且难以解释。在本文中,我们提出了一种提示扰动框架,用于提高成对LLM评估的一致性。我们的方法生成每个提示的扰动变体,利用生成的比较图识别并过滤掉结构不一致的比较模式,然后将标准排名方法应用于过滤后的比较。该框架的一个关键特征是,在排名聚合之前,将图级结构一致性显式纳入评估流程。这提供了一种简单且原则性的方法来减少循环不一致性并提高LLM排名的可靠性。

英文摘要

Evaluating large language models (LLMs) is important for understanding their capabilities, comparing competing systems, and supporting the deployment of reliable models in practice. For open-ended tasks, pairwise evaluation has become a popular paradigm, in which two responses to the same prompt are compared and the resulting judgments are aggregated into an overall ranking. A central challenge of this paradigm is intransitivity: the induced comparison outcomes may fail to support any coherent global ranking. For example, one may observe cyclic preferences such as $A \succ B \succ C \succ A$, or inconsistencies involving ties such as $A \equiv B\equiv C\neq A$. Such contradictions make the resulting leaderboard unstable and challenging to interpret. In this paper, we propose a prompt perturbation framework for improving the consistency of pairwise LLM evaluation. Our approach generates perturbed variants of each prompt, uses the resulting comparison graphs to identify and filter out structurally inconsistent comparison patterns, and then applies standard ranking methods to the filtered comparisons. A key feature of the proposed framework is that graph-level structural consistency is incorporated explicitly into the evaluation pipeline before ranking aggregation. This provides a simple and principled way to reduce cyclic inconsistencies and improve the reliability of LLM rankings.

2606.17630 2026-06-17 cs.RO 新提交

FLAP: FOV-Constrained Active Perception Planning for Prior-Map-Free 3D Navigation

FLAP: 面向无先验地图3D导航的视场约束主动感知规划

Mengke Zhang, Sitong Li, Tiancheng Lai, Ruitian Pang, Mingxuan Zhang, Qingcheng Chen, Fei Gao, Chao Xu, Yanjun Cao

发表机构 * The State Key Laboratory of Industrial Control Technology, College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院工业控制技术国家重点实验室) Huzhou Institute, Zhejiang University(浙江大学湖州研究院) Huzhou Key Laboratory of Autonomous System(湖州市自动驾驶系统重点实验室) Shanghai Institute of Special Equipment Inspection and Technical Research Co., Ltd(上海市特种设备监督检验技术研究院有限公司)

AI总结 提出一种将主动感知直接融入轨迹优化的规划框架,通过传感器坐标系下的视场几何约束和速度触发机制,在保证安全的同时提升效率,并支持任意3D机动。

Comments 18 pages, 19 figures

详情
AI中文摘要

在未知、杂乱的三维环境中进行安全高效的轨迹规划是无人机在现实应用中部署的关键瓶颈。机载传感器有限的视场和感知范围进一步加剧了这一挑战。许多现有方法要么对未探索空间做出简单假设,要么依赖保守启发式(如速度限制或固定感知模式),降低了效率且在不同传感器类型间泛化能力差。本文提出一种新颖的规划框架,将主动感知直接融入轨迹优化,从而在保持效率的同时提高安全性。感知约束源自无人机的动力学模型,并在传感器坐标系中公式化,从而能够精确处理视场几何。速度触发的激活机制使规划器能够平衡感知和运动效率。我们引入带有参数化起始时间优化的主动感知子轨迹段,减轻了因障碍物检测延迟带来的碰撞风险。我们的公式化方法能够在任意三维机动中实现主动感知,超越了主要针对水平运动的现有方法。所有约束和惩罚项均融入可微优化问题,因此规划器仅需一个简单的前端全局路径作为引导,而非计算昂贵的感知感知路径生成器。大量仿真和真实世界实验证明了该方法在不同传感器配置的多样未知环境中的鲁棒性能。

英文摘要

Safe and efficient trajectory planning in unknown, cluttered 3D environments constitutes a critical bottleneck for deploying Unmanned Aerial Vehicles (UAVs) in real-world applications. This challenge is further exacerbated by the limited field-of-view (FOV) and sensing range of onboard sensors. Many existing methods either make simplistic assumptions about unexplored space or rely on conservative heuristics such as speed limits or fixed perception patterns, reducing efficiency and generalizing poorly across different sensor types. In this work, we propose a novel planning framework that directly integrates active perception into trajectory optimization, thereby improving safety while preserving efficiency. The perception constraints are derived from the UAV's dynamic model and formulated in the sensor coordinate frame, which enables precise handling of FOV geometry. The velocity-triggered activation mechanism enables the planner to balance perception and motion efficiency. We introduce an active perception sub-trajectory segment with parametric start-time optimization, mitigating collision risks from late obstacle detection. Our formulation enables active perception during arbitrary 3D maneuvers, extending beyond prior methods designed mainly for horizontal motion. All constraints and penalties are incorporated into a differentiable optimization problem, so the planner requires only a simple front-end global path for guidance, rather than a computationally expensive perception-aware path generator. Extensive simulations and real-world experiments demonstrate robust performance across diverse unknown environments with varying sensor configurations.

2606.17628 2026-06-17 cs.CL 新提交

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

OPD-Evolver:通过在线策略蒸馏培养整体智能体进化器

Guibin Zhang, Xun Xu, Yanwei Yue, Zikun Su, Wangchunshu Zhou, Xiaobin Hu, Shuicheng Yan

发表机构 * LV-NUS Lab(林文实验室) FDU(福建大学) PKU(北京大学) Bytedance Inc.(字节跳动公司)

AI总结 提出OPD-Evolver框架,通过快慢双循环和在线策略自蒸馏,使智能体学会选择、使用、编写和维护经验,在多个基准上超越现有方法,并让小模型挑战大模型。

详情
AI中文摘要

记忆已成为自我进化智能体的标准基础,但保留经验并不等同于学习如何通过经验进化。现有的记忆智能体可以存储轨迹、检索反思或积累技能,但往往缺乏选择有用经验、据此行动、编写可重用知识以及维护不断增长的存储库的整体能力。我们引入OPD-Evolver,一种快慢双协同进化框架,通过在线策略自蒸馏培养这样的智能体进化器。在快循环中,OPD-Evolver与四级记忆层次结构交互,以读取、使用、编写和维护经验,实现快速测试时进化。在慢循环中,结果校准的记忆归因和特权事后视角将这些四种能力蒸馏到可部署的策略中。在多领域基准测试中,OPD-Evolver超越了诸如ReasoningBank等记忆系统高达11.5%,以及诸如Skill0等基于训练的方法约5.8%。进一步分析表明,OPD-Evolver内化了高价值经验和记忆管理,使OPD-Evolver-9B能够挑战诸如Qwen3.5-397B-A17B和Step-3.5-Flash等大型对手,指向超越记忆增强智能体的真正合格的智能体进化器。

英文摘要

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.

2606.17627 2026-06-17 cs.CV cs.AI 新提交

Divide, Deliberate, Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition

分、议、决:一种用于细粒度自我中心动作识别的多智能体框架

Alessandro Sottovia, Alessandro Torcinovich, Oswald Lanz

发表机构 * Faculty of Engineering, Free University of Bozen-Bolzano(博尔扎诺自由大学工程学院)

AI总结 提出一种零样本多智能体框架,通过视频分割、异构VLM专家协商和Borda计数聚合,提升细粒度自我中心动作识别性能。

详情
AI中文摘要

在自我中心视频中进行细粒度动作识别对视觉语言模型(VLM)具有挑战性:动作通常仅在小视觉线索上有所不同,而单个模型往往偏向于这些线索的一个子集。我们提出了“分、议、决”(Divide, Deliberate, Decide),一个完全本地化的零样本多智能体框架,其中(i)一个VLM编排器将视频分块,并为每个片段提出一个top-k候选标签列表,(ii)一个由来自不同开放模型系列的异构VLM专家组成的集成体进行结构化协商,包括一轮同行咨询问题,以及(iii)使用Borda计数聚合智能体排名,并且编排器根据专家的证据重新排名自己的预测。整个流程在本地运行,无需微调。实验表明,我们的方法在零样本动作识别性能上比基线有积极改进,突出了异构协商步骤的影响,表明增益来自去相关的模型先验而非额外的计算。

英文摘要

Fine-grained action recognition in egocentric video is challenging for Vision-Language Models (VLMs): actions often differ only in small visual cues, and a single model tends to be biased toward a subset of these cues. We propose Divide, Deliberate, Decide, a fully-local, zero-shot multi-agent framework in which (i) a VLM orchestrator chunks the video and proposes a top-k candidate label list per segment, (ii) an ensemble of heterogeneous VLM specialists, drawn from different open model families, engages in a structured deliberation that includes a peer-consultation round of questions, and (iii) agent rankings are aggregated with a Borda count and the orchestrator re-ranks its own prediction in light of the specialists' evidence. The entire pipeline runs locally with no fine-tuning. Experiments show that our method positively improves zero-shot action recognition performance over the baseline, highlighting the influence of a heterogeneous deliberation step, showing that the gain stems from decorrelated model priors rather than from additional compute.

2606.17619 2026-06-17 cs.CV 新提交

RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation

RAVA: 检索增强的视角对齐用于主题驱动图像生成

Qiwei Yan, Zhiqiang Yuan, Chongyang Li, Jiapei Zhang, Ying Deng, Jinchao Zhang, Jie Zhou

发表机构 * WeChat AI, Tencent Inc.(腾讯微信人工智能实验室)

AI总结 提出RAVA框架,通过检索增强提供几何证据,解决跨主体视角对齐中的视角漂移和结构不匹配问题,在保持身份的同时实现可靠视角控制。

详情
AI中文摘要

参考驱动图像生成在身份保持方面取得了快速进展,但跨不同主体的可靠视角控制仍然难以理解。难点不仅在于生成目标主体的新图像:模型必须推断一个主体的隐含视角,并仅使用图像级证据将其转移到另一个主体,无需相机姿态、深度或基于射线的条件。在这种设置下,现有基于多个图像参考的生成器通常依赖虚假的语义相关性,导致视角漂移、部分级结构不匹配以及缺失或不支持的目标特定内容。我们将这一挑战形式化为跨主体视角对齐,并提出RAVA,一个检索增强框架,在生成前提供显式几何证据。RAVA首先学习一个跨实例视角嵌入,检索与锚点视角对齐的目标主体图像,然后应用基于LogDet的子集选择策略,保留一个既视角一致又结构互补的紧凑参考集。最后,选定的参考被微调的多参考图像生成器使用。实验表明,通用语义嵌入在此任务上几乎是随机的,而所提出的检索器显著提高了视角检索质量。在跨主体生成上,RAVA在相同生成骨干下始终优于零样本基线和更强的检索替代方案。这些结果表明,跨主体视角对齐受益于检索增强的几何基础,而非仅依赖端到端生成。

英文摘要

Reference-driven image generation has made rapid progress on identity preservation, but reliable viewpoint control across different subjects remains poorly understood. The difficulty is not merely generating a new image of the target subject: the model must infer the implicit viewpoint of one subject and transfer it to another subject using only image-level evidence, without camera poses, depth, or ray-based conditions. In this setting, existing generators conditioned on multiple image references often rely on spurious semantic correlations, which lead to viewpoint drift, part-level structural mismatches, and missing or unsupported target-specific content. We formulate this challenge as cross-subject viewpoint alignment and propose RAVA, a retrieval-augmented framework that supplies explicit geometric evidence before generation. RAVA first learns a cross-instance viewpoint embedding that retrieves target-subject images aligned with the anchor viewpoint, then applies a LogDet-based subset selection strategy to retain a compact reference set that is both view-consistent and structurally complementary. The selected references are finally consumed by a fine-tuned multi-reference image generator. Experiments show that generic semantic embeddings are nearly random for this task, while the proposed retriever substantially improves viewpoint retrieval quality. On cross-subject generation, RAVA consistently outperforms zero-shot baselines and stronger retrieval alternatives under the same generation backbone. These results indicate that cross-subject viewpoint alignment benefits from retrieval-augmented geometric grounding rather than relying on end-to-end generation alone.

2606.17615 2026-06-17 cs.CV cs.AI 新提交

SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation

SkillMoV: 基于原型条件门控的视图混合路由用于统一多视角熟练度估计

Edoardo Bianchi, Antonio Liotta

发表机构 * Free University of Bozen-Bolzano(博尔扎诺自由大学)

AI总结 提出SkillMoV框架,通过混合视图投影器(MoVP)实现多场景多视角视频的熟练度估计,在EgoExo4D数据集上达到50.17%准确率,超越现有方法。

详情
AI中文摘要

从视频中估计人类熟练度是自动化技能评估的关键挑战,应用于体育教练、音乐教学、手术培训和工作场所学习。现有方法通常专注于单一场景或依赖共享的多视角聚合,限制了其适应异构摄像机视角和活动领域的能力。我们提出SkillMoV,一个统一的、参数高效的框架,用于从同步多视角视频中进行多场景熟练度估计。其核心是混合视图投影器(MoVP),将混合专家范式适应于摄像机特定的视角特征。MoVP由四个阶段组成:(i) 一个具有12个专家MLP的混合视图软路由器,无需摄像机身份监督即可学习视角相关的专家偏好;(ii) 跨视角注意力以对齐同步摄像机;(iii) 可学习的原型锚定,以类级参考向量条件化表示;(iv) 一个原型条件门控投影,生成最终技能嵌入。我们在EgoExo4D上评估SkillMoV,涵盖六个技能领域和三种单独训练的视角配置:Ego、Exos和Ego+Exos。SkillMoV在Exos设置中达到50.17%的总体准确率,单个模型在所有场景上联合训练,超过比较方法中报告的最强Exos结果3.57个百分点。在Ego+Exos中,SkillMoV接近该设置的最佳报告结果(47.63%对48.20%)。在选定的Exos配置上的消融实验验证了每个组件:MoV路由比注意力聚合提高+6.61个百分点,跨视角注意力+4.92个百分点,原型锚定+4.07个百分点,随机视角丢弃+3.90个百分点。通过LoRA适配,SkillMoV仅训练其参数的23.32%,并且相对于仅LoRA基线增加了有限的测量开销。

英文摘要

Estimating human proficiency from video is a key challenge for automated skill assessment, with applications in sports coaching, music pedagogy, surgical training, and workplace learning. Existing approaches often focus on individual scenarios or rely on shared multi-view aggregation, limiting their ability to adapt to heterogeneous camera viewpoints and activity domains. We introduce SkillMoV, a unified, parameter-efficient framework for multi-scenario proficiency estimation from synchronized multi-view video. At its core, SkillMoV introduces a Mixture-of-View Projector (MoVP), which adapts the mixture-of-experts paradigm to camera-specific view features. MoVP is composed of four stages: (i) a Mixture-of-View soft router with twelve expert MLPs that learns view-dependent expert preferences without camera-identity supervision; (ii) cross-view attention to align synchronized cameras; (iii) learnable prototype anchoring to condition the representation on class-level reference vectors; and (iv) a prototype-conditioned gated projection that produces the final skill embedding. We evaluate SkillMoV on EgoExo4D across six skill domains and three separately trained view configurations: Ego, Exos, and Ego+Exos. SkillMoV reaches 50.17% overall accuracy in the Exos setting with a single model trained jointly across all scenarios, surpassing the strongest reported Exos result among the compared methods by 3.57 percentage points. In Ego+Exos, SkillMoV remains close to the best reported result in that setting (47.63% versus 48.20%). Ablations on the selected Exos configuration validate each component: MoV routing contributes +6.61 pp over attentive aggregation, cross-view attention +4.92 pp, prototype anchoring +4.07 pp, and stochastic view dropout +3.90 pp. Through LoRA adaptation, SkillMoV trains only 23.32% of its parameters and adds limited measured overhead relative to a LoRA-only baseline.

2606.17609 2026-06-17 cs.CL 新提交

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

基准测试幻觉:剪枝后的大语言模型能通过多项选择但无法回答问题

Rui Wen, Lu Sun, Jiayang Liu, Zesheng Xu, Tianshuo Cong, Zheng Li

发表机构 * Institute of Science Tokyo(东京科学大学) Tohoku University(东北大学) Nanyang Technological University(南洋理工大学) KTH Royal Institute of Technology(瑞典皇家理工学院) Shandong University(山东大学)

AI总结 研究发现,高稀疏度剪枝后的大语言模型在多项选择评估中表现良好,但在开放生成中无法正确回答相同问题,揭示了基准测试的盲点。

详情
AI中文摘要

压缩大型语言模型可以减少内存使用和推理成本,但也可能导致标准基准测试未能捕捉到的失败。一个剪枝后的模型可能在多项选择评估中仍然表现良好,但在开放生成中却无法回答相同的问题。我们探究剪枝改变了什么:是擦除了正确答案,还是使答案更难作为最高输出产生?我们通过多语言问答来研究这个问题,追踪剪枝前后相同的问题。我们发现了一种基准测试幻觉。在高稀疏度剪枝(尤其是Wanda)下,模型在贪婪开放生成中经常失败,而在多项选择评分下仍然能选择正确答案。在这些仅识别错误中,答案通常并未消失,而是被降级:它经常通过束搜索、采样或一个上下文示例重新出现。总体而言,多项选择基准测试可能夸大压缩LLM的可用性,造成评估盲点。压缩模型应该测试它们能产生什么,而不仅仅是能识别什么。

英文摘要

Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the same question in open generation. We ask what pruning changes: does it erase the correct answer, or does it make the answer harder to produce as the top output? We study this question with multilingual question answering, tracking the same questions before and after pruning. We find a benchmark illusion. Under high-sparsity pruning, especially Wanda, models often fail in greedy open generation while still selecting the correct answer under multiple-choice scoring. In these recognition-only errors, the answer is usually not gone, but demoted: it often reappears with beam search, sampling, or one in-context example. Overall, multiple-choice benchmarks can overstate the usability of compressed LLMs, creating an evaluation blind spot. Compressed models should be tested on what they can produce, not only on what they can recognize.

2606.17606 2026-06-17 cs.CV 新提交

Flux-Guard: Facial Identity Protection using diffusion models

Flux-Guard:使用扩散模型的面部身份保护

Jie Wang, Tao Wang, Ru Zhang, Jianyi Liu

发表机构 * School of Cyberspace Security, Beijing University of Posts and Telecommunications(北京邮电大学网络空间安全学院) Nanjing University of Aeronautics and Astronautics(南京航空航天大学)

AI总结 提出Flux-Guard框架,通过流轨迹控制和潜在空间对抗优化,在统一生成过程中实现面部编辑与隐私保护,有效提升对跨域人脸识别模型的攻击成功率。

详情
AI中文摘要

人脸识别系统的广泛部署使得社交媒体和公共平台上共享的个人图像面临身份关联和隐私风险。现有的对抗性隐私保护方法可以降低未经授权的人脸识别性能,但与生成式面部编辑不兼容。人工智能驱动的面部编辑工具越来越受欢迎,这显著增加了用户对个性化肖像生成和社交分享的需求。然而,当前的编辑方法通常保留身份特征,使得编辑后的图像仍然容易被恶意人脸识别系统追踪。因此,本文提出了Flux-Guard,一种基于对抗攻击的隐私保护面部编辑框架,它在统一的生成过程中集成了面部编辑和隐私保护。具体地,我们设计了一种流轨迹控制方法,将语义操作与生成过程对齐,并引入了潜在空间对抗优化,采用自适应感知损失驱动的加权策略,动态调整对抗强度以在保持视觉质量的同时最大化攻击效果。大量实验表明,Flux-Guard支持面部编辑,同时在CelebA-HQ和LADN数据集上显著提高了对跨域人脸识别模型的攻击成功率。此外,对商业API的评估结果证实了其在现实世界应用中的有效性。代码发布在https://this URL。

英文摘要

The widespread deployment of face recognition (FR) systems exposes personal images shared on social media and public platforms to identity linkage and privacy risks. Existing adversarial privacy protection methods can degrade unauthorized FR performance but are not compatible with generative face editing. Artificial intelligence-driven face editing tools are gaining popularity, which has significantly increased user demand for personalized portrait generation and social sharing. However, current editing methods often preserve identity features, making the edited images still susceptible to tracking by malicious FR systems. Thus, this paper proposes Flux-Guard, a privacy-preserving face editing framework based on adversarial attacks, which integrates face editing and privacy protection within a unified generative process. Specifically, we design a flow trajectory control method to align semantic manipulations with the generative process and introduce latent-space adversarial optimization with an adaptive perceptual-loss-driven weighting strategy, dynamically adjusting adversarial strength to maximize attack effectiveness while preserving visual quality. Extensive experiments demonstrate that Flux-Guard supports face editing while significantly improving attack success rates against cross-domain face recognition models on the CelebA-HQ and LADN datasets. Furthermore, evaluation results for commercial APIs have confirmed its effectiveness in real-world applications. The code is released at https://github.com/JLMWang/Flux-Guard.

2606.17601 2026-06-17 cs.CV 新提交

Test-Time Training for Robust Text-Guided Open-Vocabulary Object Counting

测试时训练用于鲁棒文本引导的开放词汇目标计数

Hao-Yuan Ma, Yuda Zou, Li Zhang, Yongchao Xu

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院) Wuhan University(武汉大学)

AI总结 提出Dual-TTT框架,通过测试时训练轻量去噪模块,提升文本引导开放词汇目标计数在恶劣条件下的鲁棒性,无需修改原模型架构。

详情
AI中文摘要

文本引导的开放词汇目标计数(TOOC)能够对文本提示指定的任意目标类别进行计数,相比传统的封闭集计数提供了更大的灵活性。然而,现有的TOOC方法主要在理想图像上开发和评估,而现实场景常遭受雨、雾、黑暗和传感器噪声等不利条件,这些条件严重降低视觉质量并损害视觉-语言对齐。为弥补这一差距,我们引入了Robust-TOOC,这是首个在多种损坏条件下评估TOOC的基准,涵盖六种代表性退化类型:雨、雾、黑暗、高斯噪声、椒盐噪声和混合损坏。为提高鲁棒性同时保留原始计数架构,我们提出了Dual-TTT,一种用于TOOC的双架构测试时训练框架。具体来说,在测试时训练期间,Dual-TTT仅更新文本引导轻量去噪模块(TL-Denoiser),同时冻结原始计数网络。受扩散模型启发,TL-Denoiser被优化以从退化条件下的图像表示中去除与损坏相关的噪声。由于仅在测试时训练TL-Denoiser,Dual-TTT无需标注,并且可以无缝集成到现有TOOC模型中而无需修改其原始架构。在多个近期TOOC基线上的大量实验证明了我们方法的有效性。

英文摘要

Text-guided Open-vocabulary Object Counting (TOOC) enables counting arbitrary object categories specified by text prompts, offering substantially greater flexibility than conventional closed-set counting. However, existing TOOC methods are developed and evaluated primarily on ideal images, while real-world scenes often suffer from adverse conditions such as rain, fog, darkness, and sensor noise, which severely degrade visual quality and impair vision-language alignment. To bridge this gap, we introduce Robust-TOOC, the first benchmark for evaluating TOOC under diverse corruption conditions, which covers six representative degradation types: rain, fog, darkness, Gaussian noise, salt-and-pepper noise, and mixed corruption. To improve robustness while preserving the original counting architecture, we propose Dual-TTT, a dual-architecture test-time training framework for TOOC. Specifically, during test-time training, Dual-TTT updates only the Text-guided Lightweight Denoising module (TL-Denoiser), while keeping the original counting network frozen. Inspired by diffusion models, the TL-Denoiser is optimized to remove corruption-aware noise from image representations under degraded conditions. Since only the TL-Denoiser is trained at test time, Dual-TTT is annotation-free and can be seamlessly integrated into existing TOOC models without modifying their original architecture. Extensive experiments on multiple recent TOOC baselines demonstrate the effectiveness of our method.

2606.17598 2026-06-17 cs.RO cs.CV 新提交

MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation

MuseVLA: 一种用于机器人操作的自适应多模态感知视觉-语言-动作模型

Xingyuming Liu, Ruichun Ma, Heyu Guo, Qixiu Li, Qingwen Yang, Lin Luo, Shiqi Jiang, Chenren Xu, Jiaolong Yang, Baining Guo

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) Microsoft Research Asia(微软亚洲研究院) Princeton University(普林斯顿大学) Tsinghua University(清华大学)

AI总结 提出MuseVLA模型,通过将传感器作为按需工具集成,实现自适应多模态感知;设计传感器图像统一表示,并引入数据合成流水线,在灵巧手操作任务中平均成功率80.6%,显著优于RGB-only和多模态基线。

详情
AI中文摘要

人类自然地利用多种感知模态与物理世界交互,而大多数用于机器人的视觉-语言-动作(VLA)模型仅依赖RGB观测。这限制了它们感知难以或无法从RGB相机推断的物理属性(如温度、声音或雷达响应)的能力。我们提出MuseVLA,一种自适应多模态感知VLA模型,将新型传感器作为按需工具集成到机器人操作中。给定任务指令和视觉上下文,MuseVLA首先生成一个传感器令牌和目标描述,选择要调用的感知模态和关注对象,类似于带参数的工具调用。然后,它将选定的传感器测量值转换为接地传感器图像,这是一种统一的中间表示,编码异构读数以进行多模态融合和动作生成。这种设计将传感器特定处理与VLA主干解耦,实现了多种模态的高效集成。为了减少对昂贵的多传感器机器人数据集的需求,我们进一步引入了一种数据合成流水线,用接地传感器图像增强现有的RGB视频数据集,从而实现对未见过的传感器引导任务的泛化。我们在真实机器人上评估了MuseVLA,涉及需要多模态感知输入的挑战性灵巧手操作任务,包括温度引导的拾取与放置、音频驱动的物体搜索和雷达辅助的隐藏物体检索。MuseVLA平均成功率达到80.6%,显著优于仅RGB和多模态VLA基线,并在未见任务上表现出强大的零样本能力。

英文摘要

Humans naturally leverage diverse sensing modalities to interact with the physical world, while most Vision-Language-Action (VLA) models for robotics rely solely on RGB observations. This limits their ability to perceive physical properties that are difficult or impossible to infer from RGB cameras, such as temperature, sound, or radar response. We present MuseVLA, an adaptive multimodal sensing VLA model that integrates novel sensors as on-demand tools for robotic manipulation. Given a task instruction and visual context, MuseVLA first generates a sensor token and target description that select the sensing modality to invoke and what to attend to, analogous to a tool call with arguments. It then converts the selected sensor measurement into a grounded sensor image, a unified intermediate representation that encodes heterogeneous readings for multimodal fusion and action generation. This design decouples sensor-specific processing from the VLA backbone, enabling efficient integration of diverse modalities. To reduce the need for expensive multisensory robot datasets, we further introduce a data synthesis pipeline that augments existing RGB video datasets with grounded sensor images, enabling generalization to unseen sensor-guided tasks. We evaluate MuseVLA on a real-world robot across challenging dexterous hand manipulation tasks that require multimodal sensing inputs, including temperature-guided pick-and-place, audio-driven object search, and radar-assisted hidden object retrieval. MuseVLA achieves 80.6% success rate on average, outperforming RGB-only and multisensory VLA baselines significantly, and exhibits strong zero-shot capabilities on unseen tasks.

2606.17591 2026-06-17 cs.AI 新提交

Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning

闭环反馈:从经验提取到洞察治理在言语强化学习中的应用

Yanwei Cui, Xing Zhang, Yulong Zhang, Li Shao, Xiaofeng Shi, Guanghui Wang, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS生成式人工智能创新中心) Amazon Web Services (AWS)(亚马逊网络服务(AWS)) BingX Group Limited(BingX集团有限公司)

AI总结 针对非平稳环境中LLM智能体的保留-遗忘困境,提出三层架构(规则、证据、技能)通过反馈驱动的策展循环实现洞察治理,在金融预测中验证了该方法能显著提升准确率和风险调整收益。

Comments Accepted to the ICML 2026 RLxF: Reinforcement Learning from World Feedback Workshop, RLxF@ICML 2026, Seoul, South Korea

详情
AI中文摘要

无训练言语强化学习使LLM智能体能够从世界反馈中学习——客观信号如动态任务结果、市场回报或需求预测——通过从经验中提取言语规则并将其注入上下文,无需参数变化即可更新智能体行为。然而,在非平稳环境中,这些智能体面临保留-遗忘困境:保留过时的洞察会导致负迁移,而丢弃它们则会在条件重现时造成灾难性遗忘。我们识别出应对这一困境的四个要求——结果驱动评估、持久结构化证据、非单调知识生命周期和组合治理——并表明现有方法在经验提取上投入过多,而在洞察治理上投入不足。我们提出一个三层架构——规则、证据和技能——通过反馈驱动的策展循环连接,弥补治理差距。规则从世界结果中捕获提炼的经验;证据日志跟踪每条规则在多个回合中的可靠性;技能管理应用哪些规则、如何解决冲突以及何时弃权。以金融预测作为案例研究,其中世界反馈自然丰富、嘈杂且非平稳,我们表明相同的积累经验要么使性能低于零样本基线,要么显著提高准确率和风险调整收益,取决于是否存在策展循环。

英文摘要

Training-free verbal reinforcement learning enables LLM agents to learn from world feedback -- objective signals such as dynamic task outcomes, market returns, or demand forecasts -- by extracting verbal rules from experience and injecting them as context, updating the agent's behavior without parameter changes. However, in non-stationary environments these agents face a retention-forgetting dilemma: retaining stale insights causes negative transfer, while discarding them causes catastrophic forgetting when conditions recur. We identify four requirements for navigating this dilemma -- outcome-driven evaluation, persistent structured evidence, non-monotonic knowledge lifecycle, and compositional governance -- and show that existing methods invest heavily in experience extraction while underinvesting in insight governance. We propose a three-layer architecture -- rules, evidence, and skills -- connected by a feedback-driven curation loop that closes the governance gap. Rules capture distilled experience from world outcomes; evidence logs track each rule's reliability across episodes; skills govern which rules to apply, how to resolve conflicts, and when to abstain. On financial forecasting as a case study, where world feedback is naturally abundant, noisy, and non-stationary, we show that the same accumulated experience either degrades performance below the zero-shot baseline or dramatically improves accuracy and risk-adjusted returns, depending on whether the curation loop is present.

2606.17590 2026-06-17 cs.CV 新提交

TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization

TivTok:广播时间不变令牌以实现可扩展视频分词

Weiliang Chen, Yuanhui Huang, Xuebo Wang, Yueqi Duan

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Department of Automation, Tsinghua University(清华大学自动化系) Kuaishou Technology(快手科技)

AI总结 提出TivTok,一种可重用感知的视频分词器,通过时间不变(TIV)和时间变化(TV)令牌分解视频,实现高效压缩和长视频建模,在标准基准上rFVD达12.65,压缩效率提升2.91倍。

详情
AI中文摘要

视频分词是可扩展视频生成的基础,因为令牌数量直接决定计算成本和可建模视频长度。现有分词器主要通过将视频压缩为更少令牌来提高可扩展性,但它们通常跨帧和块重复表示持久内容,如静态背景和一致物体外观。本文提出\textbf{TivTok}(\textit{时间不变分词器}),一种可重用感知的视频分词器,使持久信息随时间可重用。TivTok用时间不变(TIV)令牌(编码跨帧共享信息)和时间变化(TV)令牌(编码帧特定残差)表示一个片段。为获得这种分解,我们引入范围诱导分解(SIF),为两个令牌组分配不同的注意力范围:TIV令牌关注整个片段,而每个TV令牌仅访问其对应帧及TIV令牌。在解码器中,不变广播(IB)跨帧和块重用相同的TIV令牌,用于并行重建和长视频分词。实验表明,TivTok在标准$16{\times}256{\times}256$基准上达到12.65的rFVD,与评估基线相比,128帧视频的压缩效率提升2.91倍,同时仅使用下采样分词器所需令牌的1.1%。

英文摘要

Video tokenization is fundamental to scalable video generation, as the number of tokens directly determines the computational cost and the length of videos that can be modeled. Existing tokenizers mainly improve scalability by compressing videos into fewer tokens, but they often continue to represent persistent content, such as static backgrounds and consistent object appearances, repeatedly across frames and chunks. In this paper, we propose \textbf{TivTok} (\textit{Time-Invariant Tokenizer}), a reuse-aware video tokenizer that makes persistent information reusable across time. TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. To obtain this factorization, we introduce Scope-Induced Factorization (SIF), which assigns different attention scopes to the two token groups: TIV tokens attend to the full clip, whereas each TV token only accesses its corresponding frame together with the TIV tokens. In the decoder, Invariant Broadcasting (IB) reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization. Experiments show that TivTok achieves an rFVD of 12.65 on the standard $16{\times}256{\times}256$ benchmark and improves compression efficiency by 2.91$\times$ for 128-frame videos compared with the evaluated baselines, while using only 1.1\% of the tokens required by downsample-based tokenizers in our evaluation.

2606.17584 2026-06-17 cs.CV cs.LG 新提交

Root-Selecting Fixed-Point Inversion for Rectified Flows via Trajectory Straightness

基于轨迹直线度的整流流根选择不动点反演

Semin Kim, Jihwan Yoon, Seunghoon Hong

发表机构 * KAIST(韩国科学技术院)

AI总结 提出SelFix方法,通过选择使逆轨迹更直的不动点解,在整流流中实现精确反演,提升图像重建和编辑质量。

详情
AI中文摘要

找到生成给定数据样本的初始噪声(称为反演)是下游应用(如无训练图像编辑)的关键组成部分。现有的不动点反演方法通过将每个反演步骤表述为不动点问题来提高反演精度,但它们缺乏一个原则性的机制来选择实践中可能出现的多个不动点解。我们观察到不同的选择会引发不同的反演轨迹,导致重建和编辑质量的显著变化。对于整流流,我们进一步发现这种变化与轨迹直线度密切相关,这促使我们将直线度作为原则性的选择标准。我们提出SelFix,一种不动点反演方法,它选择诱导更直逆轨迹的不动点解,同时在标准局部假设下保持收敛到精确的反演根。在FLUX.1-dev和PIE-Bench上的实验表明,SelFix改进了不动点反演,实现了比先前反演基线更强的真实图像重建和更好的源保持提示编辑。代码可在该https URL获取。

英文摘要

Finding the initial noise that generates a given data sample, known as inversion, is a key component for downstream applications such as training-free image editing. Existing fixed-point inversion methods improve inversion accuracy by formulating each inversion step as a fixed-point problem, but they lack a principled mechanism for selecting among multiple fixed-point solutions that can arise in practice. We observe that different selections induce different inversion trajectories, leading to substantial variation in reconstruction and editing quality. For rectified flows, we further find that this variation is closely associated with trajectory straightness, motivating straightness as a principled selection criterion. We propose SelFix, a fixed-point inversion method that selects fixed-point solutions inducing straighter inverse trajectories while retaining convergence to an exact inverse root under standard local assumptions. Experiments on FLUX.1-dev and PIE-Bench show that SelFix improves fixed-point inversion, achieving stronger real-image reconstruction and better source-preserving prompt-based editing than prior inversion baselines. The code is available at https://github.com/seminkim/selfix.

2606.17577 2026-06-17 cs.AI 新提交

Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow

基于基础模型编排工作流的代理辅助行人保护设计

Osamu Ito, Akihiko Katagiri, Yoshikazu Nakagawa, Shin Saeki, Jun Shiraishi, Masato Sasaki

发表机构 * Honda Motor Co., Ltd.(本田汽车有限公司)

AI总结 提出首个基础模型编排的碰撞安全设计工作流,集成代理模型、多目标进化搜索、几何生成器和自然语言接口,将行人保护评估时间从数小时降至秒级。

Journal ref ICLR 2026 Workshop The 2nd Workshop on Foundation Models for Science

详情
AI中文摘要

AI驱动的工程工作流在碰撞安全设计中面临特殊挑战:与空气动力学不同,碰撞事件涉及高度非线性的接触动力学、材料非线性和离散状态转换,难以用数据驱动的代理模型捕捉。据我们所知,我们首次提出了一个基于基础模型编排的碰撞安全设计工作流,实现了代理辅助的行人保护探索,将评估时间从每次CAE模拟数小时缩短至数秒。该工作流集成四个组件:(1) 基于CAE碰撞模拟训练的代理模型,用于从设计参数预测行人腿部伤害指标,平均$R^2=0.87$,并提供无分布假设的共形预测区间;(2) 多目标进化搜索(NSGA-II),在用户指定约束下发现多样化的可行参数集;(3) 基于形变的几何生成器,将参数映射为保持拓扑的3D形状;(4) 自然语言接口,其中LLM编排工作流,视觉-语言模型支持生成设计的语义比较。在一个汽车前保险杠案例研究中,该工作流通过单次探索产生35个不同的安全合规替代方案,而传统CAE迭代需要数周。这些结果表明,基础模型可以作为ML代理和基于物理的模拟之间的集成层,帮助将AI能力引入安全关键的工程领域。

英文摘要

AI-driven engineering workflows face particular challenges in crash safety design: unlike aerodynamics, crash events involve highly nonlinear contact dynamics, material nonlinearity, and discrete state transitions that are difficult to capture with data-driven surrogate models. To the best of our knowledge, we present the first foundation model--orchestrated workflow for crash safety design that enables surrogate-assisted exploration for pedestrian protection, reducing evaluation time from hours per CAE simulation to seconds. The workflow integrates four components: (1) a surrogate trained on CAE crash simulations to predict pedestrian leg injury metrics from design parameters, achieving an average $R^2=0.87$ and providing distribution-free conformal prediction intervals; (2) multiobjective evolutionary search (NSGA-II) to discover diverse feasible parameter sets under user-specified constraints; (3) a morphing-based geometry generator that maps parameters to topology-preserving 3D shapes; and (4) a natural-language interface in which an LLM orchestrates the workflow and a vision--language model supports semantic comparison of generated designs. In an automotive front-bumper case study, the workflow produces 35 distinct safety-compliant alternatives from a single exploration, a process that would require weeks with conventional CAE iteration. These results suggest that foundation models can serve as integration layers between ML surrogates and physics-based simulation, helping bring AI capabilities to safety-critical engineering domains.

2606.17574 2026-06-17 cs.AI 新提交

DeepInsight: A Unified Evaluation Infrastructure Across the Physical AI Stack

DeepInsight:跨物理AI栈的统一评估基础设施

Siyi Li, Chunyu Sun, Jiahao Zhang, Yuchen Kang, Wuliang Wang, Yu Qiu, Rui Jiang, Haitao Cui, Jie Chen

发表机构 * Xiaopeng(小鹏汽车)

AI总结 提出DeepInsight,一个在单一运行时上支持物理AI栈全谱系评估的基础设施,通过三个抽象(任务、资源、结果)保持异构性,实现跨层回归诊断。

详情
AI中文摘要

评估物理AI栈涉及的操作符跨越三个数量级以上——从单个基础模型解码步骤到全身控制的数千个物理滴答——在模态、奖励语义和资源概况上正交变化。现有框架无法覆盖这一范围,因此当前栈的评估是通过拼接独立的测试工具完成的,这些工具既不共享运行时也不共享评分,保留了每个片段的局部有效性,但失去了诊断跨层回归所需的共享身份。我们提出DeepInsight,一个在单一运行时上服务于这一完整谱系的评估基础设施。它不将各体制同质化,而是通过三个狭窄的抽象——任务、资源和结果——保持其异构性,每个抽象都由每个子系统共享的一个不变量实现:一个情节驱动器、一个由每个昂贵后端(LLM推理和沙盒运行时)实现的资源句柄协议,以及一个写入每个事件的跟踪身份方案。在具身人形机器人栈的所有三层上部署后,这一组不变量主要通过配置即可引入新的基准测试。在成熟的对等编排器存在的地方——在基础模型端——它在其自身分布内复现已发布的参考值和对等框架读数,在单个节点上更快地运行相同的套件,并跨节点近线性扩展。其独特的回报在于诊断:由于每一层都写入一个共享的跟踪,从一个层开始并在另一个层显现的回归在该跟踪上仍然可定位——这是任何片段测试工具联合体无法复现的跨层收益。

英文摘要

Evaluating a Physical AI stack spans operators that differ by more than three orders of magnitude -- from a single foundation-model decoding step to thousands of physics ticks of whole-body control -- varying orthogonally in modality, reward semantics, and resource profile. No existing framework spans this range, so the stack is evaluated today by stitching together separate harnesses that share neither runtime nor scoring, preserving each segment's local validity but losing the shared identity needed to diagnose cross-layer regressions. We present DeepInsight, an evaluation infrastructure that serves this full spectrum on a single runtime. Rather than homogenize the regimes, it preserves their heterogeneity behind three narrow abstractions -- task, resource, and result -- each realized as one invariant shared by every subsystem: one episode driver, one resource-handle protocol implemented by every expensive backend (LLM inference and sandboxed runtimes alike), and one trace identity scheme under which every event is written. Deployed in production across all three layers of an embodied humanoid stack, this single set of invariants onboards new benchmarks largely by configuration. Where mature peer orchestrators exist -- at the foundation-model end -- it reproduces published references and peer-framework readings within their own spread, runs the same suites faster on a single node, and scales near-linearly across nodes. Its distinctive return is diagnostic: because every layer writes into one shared trace, a regression that begins in one layer and surfaces in another stays localizable on that trace -- a cross-layer payoff no federation of per-segment harnesses can reproduce.

2606.17567 2026-06-17 cs.LG 新提交

Reducing Learner Redundancy in Boosting via Residual Orthogonalization

通过残差正交化减少Boosting中的学习器冗余

Ye Su, Jipeng Guo, Yong Liu, Xin Xu, Gangchun Zhang, Jinxin Chen, Di Wu, Longlong Zhao

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(中国科学院深圳先进技术研究院) College of Information Science and Technology, Beijing University of Chemical Technology(北京化工大学信息科学与技术学院) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) School of Computer Science, Central China Normal University(华中师范大学计算机学院) the School of Computing, Engineering and Mathematical Sciences, La Trobe University(拉筹伯大学计算、工程与数学科学学院)

AI总结 针对Boosting中残差拟合导致的学习器冗余问题,提出SCBoost框架,通过谱残差投影和协方差正则加权两种机制减少冗余,理论证明其几何性质,实验表明在精度和F1分数上表现优异。

详情
AI中文摘要

虽然顺序残差拟合是标准Boosting框架的基础,但它通过反复处理相关的误差成分,内在导致了学习器冗余。为了解决这一瓶颈,我们提出从残差拟合转向\textit{残差正交化},并引入SCBoost。我们的框架通过两种互补机制处理冗余:谱残差投影(SRP)和协方差正则加权(CRW)。在训练过程中,SRP将每个残差目标投影到历史预测子空间的正交补上,迫使后续学习器仅捕获新的经验创新。在聚合过程中,CRW在验证集上优化集成权重,并加入显式的协方差惩罚以减轻剩余相关性。理论上,我们提供了有限样本的几何刻画,证明SRP产生精确的加性残差能量分解。此外,在各向同性噪声假设下,我们严格建立了该投影改善有效信噪比的条件。在十个基准数据集上的大量实验表明,SCBoost在开箱即用的情况下表现出色,特别是在准确率和F1分数上。这项工作通过几何视角重新诠释了Boosting,表明显式的冗余控制是迈向更高效集成架构的一个有原则且必要的步骤。

英文摘要

While sequential residual fitting is the bedrock of standard boosting frameworks, it inherently breeds learner redundancy by repeatedly revisiting correlated error components. To address this bottleneck, we propose a shift from residual fitting to \textit{residual orthogonalization} and introduce SCBoost. Our framework tackles redundancy through two complementary mechanisms: Spectral Residual Projection (SRP) and Covariance-Regularized Weighting (CRW). During training, SRP projects each residual target onto the orthogonal complement of the historical prediction subspace, forcing successive learners to capture only novel empirical innovations. During aggregation, CRW optimizes ensemble weights on a validation set with an explicit covariance penalty to mitigate remaining correlations. Theoretically, we provide a finite-sample geometric characterization proving that SRP yields an exact additive residual-energy decomposition. Furthermore, under an isotropic-noise assumption, we rigorously establish the conditions under which this projection improves the effective Signal-to-Noise Ratio. Extensive experiments across ten benchmark datasets demonstrate that SCBoost delivers strong out-of-the-box performance, particularly in accuracy and F1 score. This work reinterprets boosting through a geometric lens, suggesting that explicit redundancy control is a principled and necessary step toward more efficient ensemble architectures.

2606.17564 2026-06-17 cs.CV cs.AI 新提交

Geometric Consistency Protocol for Foundation Model Features in Multi-View Satellite Imagery

多视图卫星图像中基础模型特征的几何一致性协议

Qiyan Luo, Jie Yang, Yingdong Pi, Lekang Wen, Mi Wang

发表机构 * Hubei Province Key Research and Development Program(湖北省重点研发计划) LIESMARS Special Research Funding(测绘遥感信息工程国家重点实验室专项研究基金) National Science Fund for Distinguished Young Scholars(国家杰出青年科学基金)

AI总结 针对卫星多视图重建中传统2D全局匹配的误导性,提出基于有理函数模型(RFM)的几何忠实评估协议,通过RPC投影3D一致性度量和几何约束密集匹配代理,揭示语义一致性与几何定位的解耦,并证明在RPC一致评估下2D骨干网络仍具竞争力。

Comments The manuscript is accepted as Oral Presentation in IEEE International Geoscience and Remote Sensing Symposium(IGARSS 2026)

详情
AI中文摘要

标准化的评估协议对于遥感领域的稳健基准测试至关重要,特别是当基础特征越来越多地跨不同传感器和复杂成像几何进行迁移时。在卫星多视图重建中,依赖无约束2D全局匹配的传统评估常常具有误导性。有理函数模型(RFM)及其有理多项式系数(RPC)决定了弯曲的、高度依赖的极线几何,这使得平坦的2D搜索空间在物理上不一致。我们提出了一种针对RPC框架的几何忠实且可复现的协议。我们的方法将RPC投影的3D一致性度量与几何约束的密集匹配代理相结合,专门评估在物理上合理的搜索流形下相似性响应是否保持局部化和唯一性。我们联合报告策略的一个关键发现是语义一致性与几何定位的解耦:在投影3D点处的高跨视图相似性并不能保证实际推理中的可靠匹配性。我们的基准测试表明,将几何约束纳入问题定义对于卫星图像是基础性的。此外,我们展示了最先进的2D骨干网络在经受这种RPC一致评估时,仍然与专门的3D感知模型保持显著竞争力。

英文摘要

Standardized evaluation protocols are indispensable for robust benchmarking in remote sensing, particularly as foundation features are increasingly transferred across diverse sensors and complex imaging geometries. In satellite multi-view reconstruction, conventional evaluations relying on unconstrained 2D global matching are often misleading. The Rational Function Model (RFM) and its Rational Polynomial Coefficients (RPC) dictate a curved, height-dependent epipolar geometry that render flat 2D search spaces physically inconsistent. We propose a geometry-faithful and reproducible protocol tailored for the RPC framework. Our approach integrates an RPC-projected 3D consistency metric with a geometry-constrained dense matching proxy, specifically evaluating whether similarity responses remain localized and unique under physically plausible search manifolds. A pivotal finding of our joint reporting strategy is the decoupling of semantic agreement and geometric localization: high cross-view similarity at a projected 3D point does not guarantee reliable matchability in practical inference. Our benchmark demonstrates that incorporating geometric constraints is fundamental to the problem definition in satellite imagery. Furthermore, we show that state-of-the-art 2D backbones remain remarkably competitive against specialized 3D-aware models when subjected to this RPC-consistent evaluation.

2606.17561 2026-06-17 cs.CV 新提交

RT-Counter: Real-Time Text-Guided Open-Vocabulary Object Counting

RT-Counter:实时文本引导的开放词汇目标计数

Hao-Yuan Ma, Li Zhang, Zhiwei Zhu, Jie Gao

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 提出实时文本引导开放词汇计数框架RT-Counter,通过视觉原型文本化模块和编织Transformer层,在保持高精度的同时实现实时推理,在FSC147上MAE为13.30,速度达112.48 FPS。

详情
AI中文摘要

文本引导的开放词汇目标计数(TOOC)旨在对自然语言描述指定的类别中的对象进行计数。尽管视觉-语言预训练模型已成功应用于TOOC任务,但在计数场景中仍面临细粒度空间理解和实时推理需求的挑战。为解决这些限制,本文提出一种实时TOOC框架,称为实时计数器(RT-Counter),它不仅实现了良好的计数精度,而且具有高计算效率。RT-Counter设计了一种新颖的视觉原型文本化(VPT)模块,该模块可以将学习到的视觉特征投影到文本特征空间,然后生成包含视觉原型难以捕获的抽象信息和文本难以描述的详细原型信息的特征,增强了对象级视觉-语言模型的计数能力。此外,RT-Counter集成了我们的编织Transformer(Weaformer)层,以极低的计算成本保持高描述能力。Weaformer层采用了一种新颖的混合注意力机制,可以高效地编织局部和全局视觉特征。在三个公共数据集上的大量实验表明,RT-Counter成功打破了TOOC中精度与速度的权衡。在FSC147上实现具有竞争力的MAE 13.30的同时,RT-Counter以112.48 FPS运行,比现有TOOC领先方法快7.4倍,参数效率高4倍以上。我们的工作旨在平衡TOOC中的高精度和实时性能。代码可在以下网址获取:this https URL。

英文摘要

Text-guided open-vocabulary object counting (TOOC) aims to count objects belonging to the categories specified by natural language descriptions. Although vision-language pre-trained models have been successful applied to TOOC tasks, they still struggle with fine-grained spatial understanding and real-time inference requirements in counting scenarios. To address these limitations, this paper proposes a real-time TOOC framework, called the Real-Time Counter (RT-Counter), that achieves not only good counting accuracy but also high computational efficiency. RT-Counter designs a novel Visual Prototype Textualization (VPT) module that can project learned visual features into a text feature space and then generate features containing the abstract information that is hard to capture with visual prototypes and the detailed prototype information that is difficult to describe in text, enhancing the object-level visual-language model's counting capabilities. Additionally, RT-Counter incorporates our Weaving Transformer (Weaformer) layers, maintaining high descriptive power at a fraction of the computational cost. The Weaformer layer adopts a novel hybrid attention mechanism that can efficiently weave together local and global visual features. Extensive experiments on three public datasets show that RT-Counter successfully breaks the accuracy-speed trade-off in TOOC. While achieving a competitive MAE of 13.30 on FSC147, RT-Counter operates at 112.48 FPS, making it 7.4x faster and over 4$\times$ more parameter-efficient than the existing leading methods in TOOC. Our work aims at balancing high accuracy and real-time performance in TOOC. Code is available at: https://github.com/Jason-Mar1/RT-Counter.

2606.17557 2026-06-17 cs.CV 新提交

Universal Image Restoration via Internalized Chain-of-Thought Reasoning

通过内化思维链推理的通用图像恢复

Yu Guo, Zhengru Fang, Shengfeng He, Senkang Hu, Yihang Tao, Phone Lin, Yuguang Fang

发表机构 * Hong Kong JC Lab of Smart City and Department of Computer Science, City University of Hong Kong(香港城市大学智慧城市香港联合实验室及计算机科学系) School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算与信息系统学院) Computer Science and Information Engineering, National Taiwan University(国立台湾大学计算机科学与信息工程学系)

AI总结 提出CoTIR框架,将思维链推理内化到单个预训练编辑模型中,通过可微拉格朗日优化实现混合退化下的通用图像恢复,在5.2M样本基准上优于现有方法。

详情
AI中文摘要

图像恢复旨在从退化输入中恢复高质量图像,但在复杂混合退化下高度病态。虽然统一的全能模型很常见,但其性能随退化复杂性增加而下降。近期工作采用思维链推理,通过专用模块进行多轮恢复。然而,这种方法面临两个关键限制:(i) 多步处理导致计算成本增加,(ii) 逐步推理过程中退化间交互建模薄弱。我们提出CoTIR,一种将思维链推理内化到单个模型中的通用图像恢复框架。具体而言,我们将图像恢复视为图像编辑的一个专门子任务,这意味着大规模预训练编辑模型提供了更有利的优化起点。在此基础上,我们对模型进行恢复微调,并通过受拉格朗日优化启发的可微公式将结构化思维链式推理编码到学习目标中,从而实现无需链接专用恢复器的整体恢复。为促进训练和评估,我们进一步提出CoTIR-Bench,一个包含520万样本及思维链式推理轨迹的大规模基准。在CoTIR-Bench和广泛真实复合退化场景上的大量实验表明,CoTIR在感知质量和保真度上均优于全能模型和多轮恢复方法。源代码见https://this https URL。

英文摘要

Image restoration seeks to recover high-quality images from degraded inputs but becomes highly ill-posed under complex, mixed degradations. While unified all-in-one models are common, their performance declines as degradation complexity increases. Recent works adopt Chain-of-Thought (CoT) reasoning for multi-round restoration using specialized modules. However, this approach faces two key limitations: (i) increased computational cost due to multi-step processing, and (ii) weak modeling of interactions between degradations during stepwise inference. We introduce CoTIR, a universal image restoration framework that internalizes CoT reasoning within a single model. Concretely, we view image restoration as a specialized subtask of image editing, which implies that a large-scale pre-trained editing model provides a more favorable optimization starting point. Building on this, we fine-tune the model for restoration and further encode structured CoT-style reasoning into the learning objective via a differentiable formulation inspired by Lagrangian optimization, enabling holistic restoration without chaining specialized restorers. To facilitate training and evaluation, we further present CoTIR-Bench, a large-scale benchmark comprising 5.2 million samples with CoT-style reasoning traces. Extensive experiments on CoTIR-Bench and broad real composite degradation scenes show that CoTIR achieves stronger perceptual quality and more competitive fidelity than both all-in-one models and multi-round restoration methods. The source code is available at https://github.com/gy65896/CoTIR.