arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2115
2606.17696 2026-06-17 cs.AI cs.GR 新提交

FllumaOne: A Code-Native Multimodal CAD Dataset with Executable Programs and Kernel-Validated Feature Histories

FllumaOne:一个代码原生多模态CAD数据集,包含可执行程序与内核验证的特征历史

Jizong Zhan

发表机构 * Qt/C++ OpenCASCADE-based CAD system(基于Qt/C++ OpenCASCADE的CAD系统)

AI总结 提出FllumaOne数据集,通过可执行Python程序生成CAD模型,对齐程序、特征树、几何等模态,支持可编辑逆向工程等任务。

Comments 24 pages, 4 figures

详情
AI中文摘要

参数化计算机辅助设计记录最终几何形状以及决定零件如何编辑的有序构建历史。因此,可编辑CAD研究的数据集应同时暴露建模操作、参数和特征依赖关系以及验证后的几何形状。我们介绍FllumaOne,一个代码原生多模态CAD数据集,其模型由基于Qt/C++和OpenCASCADE的CAD系统Flluma中的可执行Python程序生成。每个样本将其程序与结构化特征树、面向训练的中间表示、STEP几何、表面点云、自然语言描述、元数据和八个规范可见边渲染对齐。主要发布版本FllumaOne-100K包含100,000个接受样本,涵盖四个模板级复杂度范围。程序仅在通过内核几何、实体有效性和导出检查后执行并保留;发布报告还记录了模态完整性和分割级重复测试。在80,000个样本上训练的Qwen2.5-Coder-1.5B LoRA基线在保留的10,000样本测试集上实现了99.98%的Python语法有效性、99.97%的Flluma构建成功率和99.14%的STEP导出有效性。对于转换为表面点云的9,909个预测,平均归一化倒角距离为0.002124。该数据集支持条件化CAD重建、可执行程序合成、特征树预测、B-Rep分析、检索、设计完成和可编辑逆向工程。

英文摘要

Parametric computer-aided design records both final geometry and the ordered construction history that determines how a part can be edited. Datasets for editable CAD research should therefore expose modeling operations, parameters, and feature dependencies together with validated geometry. We introduce FllumaOne, a code-native multimodal CAD dataset whose models are generated by executable Python programs in Flluma, a Qt/C++ OpenCASCADE-based CAD system. Each sample aligns its program with a structured feature tree, a training-oriented intermediate representation, STEP geometry, a surface point cloud, natural-language descriptions, metadata, and eight canonical visible-edge renderings. The primary release, FllumaOne-100K, contains 100,000 accepted samples across four template-level complexity regimes. Programs are executed and retained only after kernel geometry, solid validity, and export checks; release reports also record modality completeness and split-level duplicate tests. A Qwen2.5-Coder-1.5B LoRA baseline trained on 80,000 samples achieves 99.98% Python syntax validity, 99.97% Flluma build success, and 99.14% STEP-export validity on the held-out 10,000-sample test split. For the 9,909 predictions converted to surface point clouds, the mean normalized Chamfer Distance is 0.002124. The dataset supports conditioned CAD reconstruction, executable program synthesis, feature-tree prediction, B-Rep analysis, retrieval, design completion, and editable reverse engineering.

2606.17688 2026-06-17 cs.CL 新提交

LLMs Infer Cultural Context but Fail to Apply It When Responding

LLMs 能推断文化背景但在回应时未能应用

Yisong Miao, Jian Zhu, Vered Shwartz

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) National University of Singapore(新加坡国立大学)

AI总结 研究大型语言模型在生成文化适应性回应时,能否根据用户文化背景使用本地计量单位。提出CAPRI数据集,实验发现模型能推断文化背景但常未能应用,除非明确提示。

Comments 9 pages, 7 figures, 2 tables (24 pages, 12 figures, 8 tables including references and appendices)

详情
AI中文摘要

近期研究表明,大型语言模型过度代表主导文化,尤其是西方文化,而边缘化其他文化。我们通过评估模型基于用户感知文化背景使用本地计量单位的能力,研究这是否影响模型生成文化适应性回应的能力。我们引入了文化与语用回应推理(CAPRI)数据集,包含不同文化线索水平的对话。与最先进LLMs的实验表明,模型能够推断文化背景并回忆相关惯例,但通常未能利用这些信息来调整其回答以适应相关文化惯例,除非被明确提示按顺序执行任务。我们进一步评估了对时间和数量表达解释的适应性,这是两个受文化影响的主观语言基础维度。我们发现,随着文化线索的积累,模型越来越多地调整其回答,但其先验并非文化中立,有时与模型的原产国一致。总体而言,CAPRI为未来旨在缩小文化知识与文化适应性语言生成之间差距的研究提供了资源。

英文摘要

Recent work has shown that LLMs overrepresent dominant cultures, particularly Western ones, while marginalizing others. We investigate whether this affects models' ability to generate culturally adapted responses by evaluating their use of local measurement units based on the user's perceived cultural background. We introduce Cultural and Pragmatic Response Inference (CAPRI), a dataset of conversations with varying levels of cultural cues. Experiments with state-of-the-art LLMs show that models can infer cultural background and recall relevant conventions, but often fail to utilize the information to adapt their answers to the relevant cultural conventions, unless explicitly prompted to perform the tasks sequentially. We further evaluate adaptation to the interpretation of time and quantity expressions, two subjective language grounding dimensions that are affected by culture. We find that models increasingly adapt their answers as cultural cues accumulate, but their priors are not culture-neutral, sometimes aligning with the model's country of origin. Overall, CAPRI provides a resource for future research aimed at narrowing the gap between cultural knowledge and culturally adaptive language generation.

2606.17687 2026-06-17 cs.CL cs.AI 新提交

SuCo: Sufficiency-guided Continuous Adaptive Reasoning

SuCo: 充分性引导的连续自适应推理

Jiahao Wang, Bingyu Liang, Chenhao Hu, Longhui Zhang, Xuebo Liu, Min zhang, Jing Li, Xuelong Li

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 针对大型推理模型生成过长思维链导致计算浪费的问题,提出最小充分CoT概念,并构建两阶段训练框架SuCo,通过自适应充分性阈值和强化学习优化推理长度,在数学、代码和科学基准上同时提升准确率和效率。

Comments Accepted to ICML 2026. 18 pages

详情
AI中文摘要

尽管在复杂任务上表现卓越,大型推理模型(LRMs)常常生成过长的思维链(CoT),即使对于简单查询也会增加计算成本。现有缓解此低效问题的工作通常依赖于离散推理模式或固定预算层级,缺乏推理何时充分的准则。本文引入最小充分CoT(MSC),定义为CoT轨迹中足以产生正确答案的最短前缀。实验表明,MSC不仅减少推理令牌,还能在不同难度级别上提高准确率。基于MSC,我们提出充分性引导的连续自适应推理(SuCo),一个用于连续谱上自主推理控制的两阶段训练框架。在第一阶段,MSC对齐微调(MFT)使用问题自适应充分性阈值构建MSC数据,该阈值自然随问题难度缩放,然后微调模型以内化简洁而充分的推理模式。在第二阶段,充分性感知策略优化(SAPO)通过带有动态复杂度跟踪和充分性感知奖励的强化学习进一步优化模型,该奖励惩罚过度思考和思考不足。在数学、代码和科学基准上的大量实验表明,SuCo在准确率和推理效率上均实现持续改进。

英文摘要

Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tiers, lacking a principled criterion of when reasoning is sufficient. In this work, we introduce Minimal Sufficient CoT (MSC), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer. We empirically show that MSC not only reduces reasoning tokens, but also improves accuracy across difficulty levels. Building on MSC, we propose Sufficiency-guided Continuous Adaptive Reasoning (SuCo), a two-stage training framework for autonomous reasoning control along a continuous spectrum. In stage 1, MSC-Aligned Fine-Tuning (MFT) constructs MSC data using problem-adaptive sufficiency thresholds that naturally scale with question difficulty, then fine-tunes the model to internalize concise yet sufficient reasoning patterns. In stage 2, Sufficiency-Aware Policy Optimization (SAPO) further optimizes the model through reinforcement learning with dynamic complexity tracking and sufficiency-aware rewards that penalize both over- and under-thinking. Extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency.

2606.17683 2026-06-17 cs.CL cs.PL 新提交

Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code Translation

弥合基于LLM的代码翻译中的功能正确性与运行时效率差距

Longhui Zhang, Jiahao Wang, Chenhao Hu, Bingyu Liang, Jing Li, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳))

AI总结 提出SwiftTrans框架,通过多视角探索和差异感知选择两阶段,利用并行上下文学习和差异比较,同时提升LLM代码翻译的正确性和运行时效率。

Comments Accepted to ICML 2026

详情
AI中文摘要

尽管大型语言模型(LLM)显著提高了自动代码翻译系统的功能正确性,但翻译后程序的运行时效率却相对较少受到关注。随着摩尔定律的失效,运行时效率与功能正确性一样,对程序质量变得越来越重要。我们的初步研究表明,LLM翻译的程序通常比人类编写的程序运行得更慢,而且这个问题无法仅通过提示工程来解决。因此,我们的工作提出了SwiftTrans,一个包含两个关键阶段的代码翻译框架:(1)多视角探索,其中MpTranslator利用并行上下文学习(ICL)生成多样化的翻译候选;(2)差异感知选择,其中DiffSelector通过显式比较翻译之间的差异来识别最优候选。我们进一步为MpTranslator引入层次化指导,为DiffSelector引入序数指导,使LLM能够更好地适应这两个核心组件。为了支持对翻译程序运行时效率的评估,我们扩展了现有基准CodeNet和F2SBench,并引入了一个新基准SwiftBench。在所有三个基准上的实验结果表明,SwiftTrans在正确性和运行时效率方面都取得了一致的改进。

英文摘要

While large language models (LLMs) have greatly advanced the functional correctness of automated code translation systems, the runtime efficiency of translated programs has received comparatively little attention. With the waning of Moore's law, runtime efficiency has become increasingly important for program quality, alongside functional correctness. Our preliminary study reveals that LLM-translated programs often run slower than human-written ones, and this issue cannot be remedied through prompt engineering alone. Therefore, our work proposes SwiftTrans, a code translation framework comprising two key stages: (1) Multi-Perspective Exploration, where MpTranslator leverages parallel in-context learning (ICL) to generate diverse translation candidates; and (2) Difference-Aware Selection, where DiffSelector identifies the optimal candidate by explicitly comparing differences between translations. We further introduce Hierarchical Guidance for MpTranslator and Ordinal Guidance for DiffSelector, enabling LLMs to better adapt to these two core components. To support the evaluation of runtime efficiency in translated programs, we extend existing benchmarks, CodeNet and F2SBench, and introduce a new benchmark, SwiftBench. Experimental results across all three benchmarks show that SwiftTrans achieves consistent improvements in both correctness and runtime efficiency.

2606.17682 2026-06-17 cs.CL 新提交

From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning

从受训者到训练者:用于多智能体推理的LLM设计的强化学习训练环境

Chao Chen, Chengzu Li, Zhiwei Li, Yinhong Liu, Zhijiang Guo

发表机构 * LARK, HKUST (GZ)(香港科技大学(广州)LARK实验室) University of Cambridge(剑桥大学) HKUST(香港科技大学)

AI总结 提出LLM-as-Environment-Engineer框架,让策略模型自动分析失败轨迹并修改训练环境配置,在MAPF-FrozenLake测试平台上用Qwen3-4B实现最优性能。

详情
AI中文摘要

用于大语言模型(LLM)训练的强化学习流程通常依赖于阶段之间手动重新设计的环境,要求从业者启发式地推断哪种配置最能改进当前策略。为了自动化这一过程,我们提出了LLM-as-Environment-Engineer框架,其中当前策略模型分析失败轨迹及上下文信息,并提出对下一阶段训练环境配置的修改。我们还引入了MAPF-FrozenLake,一个可控的测试平台,其生成器暴露多维环境配置,适合研究和基准测试环境重新设计。在该测试平台上,我们将环境工程师的条件建立在策略行为、失败案例和环境统计的结构化摘要上,从而生成下一训练阶段的配置。以Qwen3-4B为骨干,我们的框架在基准测试中取得了最强的综合性能,优于更大的专有LLM(如GPT、Gemini)和固定环境训练基线。我们进一步分析了哪种形式的上下文最有效,发现成功的环境更新依赖于失败证据并保留已生效的配置。有趣的是,当前的RL检查点比原始基础模型更适合作为环境工程师,这表明策略学习提高了模型诊断其剩余弱点的能力。

英文摘要

Reinforcement learning pipelines for Large Language Model (LLM) training often rely on manually redesigned environments between stages, requiring practitioners to heuristically infer which configuration will best improve the current policy. To automate this process, we propose the LLM-as-Environment-Engineer framework in which the current policy model analyzes failure trajectories together with contextual information and proposes modifications to the next-stage training environment configuration. We also introduce MAPF-FrozenLake, a controllable testbed whose generator exposes multi-dimensional environment configurations, making it suitable for studying and benchmarking environment redesign. On this testbed, we condition the environment engineer on structured summaries of policy behavior, failure cases, and environment statistics, from which it produces the configuration for the next training stage. With Qwen3-4B as the backbone, our framework achieves the strongest aggregate performance on our benchmarks, outperforming larger proprietary LLMs (e.g., GPT, Gemini) and fixed-environment training baselines. We further analyze which forms of context are most effective, finding that successful environment updates rely on failure evidence and preserve configurations that already work. Interestingly, the current RL checkpoint serves as a better environment engineer than the original base model, suggesting that policy learning improves the model's ability to diagnose its remaining weaknesses.

2606.17680 2026-06-17 cs.LG cs.CL 新提交

EnvRL: Learn from Environment Dynamics in Agentic Reinforcement Learning

EnvRL: 在智能体强化学习中从环境动力学中学习

Zhitong Wang, Songze Li, Hao Peng, Shuzheng Si, Yi Wang, Maosong Sun, Juanzi Li

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出EnvRL框架,通过状态预测和逆动力学两个辅助目标,将环境动力学学习融入智能体强化学习,在长周期任务中显著提升成功率。

详情
AI中文摘要

强化学习已成为训练大型语言模型作为智能体的强大范式。然而,针对长周期智能体任务的常规强化学习方法往往难以处理稀疏的结果奖励。直观上,这忽略了展开交互轨迹中包含的丰富环境动力学信息。我们认为交互体验本身固有地充当隐式监督信号,揭示了环境的潜在转换机制,并使智能体能够构建更准确的环境内部模型。因此,在这项工作中,我们研究了如何利用这一额外信号来改进策略学习。具体来说,我们提出了EnvRL,一个通过两个辅助目标(状态预测和逆动力学)将环境动力学学习融入智能体强化学习的框架。通过与主要强化学习目标联合优化,我们鼓励智能体从其自身的交互体验中内化环境动力学。在两个长周期智能体基准上的大量实验表明,EnvRL在成功率上比仅使用强化学习的基线有显著提升,例如,当使用GRPO训练时,在ALFWorld上将Qwen-2.5-1.5B-Instruct从72.8%提升到77.4%,在WebShop上从56.8%提升到67.0%。

英文摘要

Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contained in rollout interaction trajectories. We argue that the interaction experience inherently serves as an implicit supervision signal, reveals the underlying transition mechanisms of the environment, and enables the agent to construct a more accurate internal model of the environment.. Therefore, in this work, we investigate how to leverage this additional signal to improve policy learning. Specifically, we propose EnvRL, a framework that incorporates environment dynamics learning into agentic RL via two auxiliary objectives: state prediction and inverse dynamics. By jointly optimizing with the primary RL objective, we encourage the agent to internalize environment dynamics from its own interaction experience. Extensive experiments on two long-horizon agentic benchmarks demonstrate that EnvRL achieves significant improvements on success-rates over RL-only baselines, e.g., when trained with GRPO, lifting Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld, and from 56.8% to 67.0% on WebShop.

2606.17678 2026-06-17 cs.CV cs.AI 新提交

See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL

先看后答:基于充分性驱动的强化学习实现视觉证据预对齐

Yilian Liu, Sicong Leng, Guoshun Nan, Junyi Zhu, Jiayu Huang, Minghao Sun, Xuancheng Zhu, Yisong Chen, Zexian Wei, Xiaofeng Tao

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Nanyang Technological University(南洋理工大学) China Telecom(中国电信)

AI总结 提出视觉证据预对齐(VEPA)方法,在预训练与后训练之间引入充分性驱动的GRPO优化,以增强多模态大模型对细粒度视觉证据的利用,显著提升视觉密集型任务性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)将强大的文本推理与视觉输入相结合,但其响应可能与底层图像不一致,表明在推理过程中未能有效利用视觉证据。当前的训练范式依赖于大规模基于标题的预训练进行通用对齐,随后通过监督微调和强化学习实现指令遵循和复杂推理。然而,这种预训练仅提供较弱的视觉基础:简短、粗略的标题使模型偏向显著物体,而忽略了细粒度的视觉证据。本文引入视觉证据预对齐(VEPA),作为预训练与后训练之间的中间阶段,探索一种新颖的充分性驱动目标,结合组相对策略优化(GRPO)来优化基于问题的视觉证据描述。在多种基准上的大量实验表明,我们的VEPA在视觉密集型评估上持续提升性能,并补充了标准的监督后训练。进一步分析表明,这种提升源于增强的、可迁移的视觉基础,而非额外的任务特定训练。

英文摘要

Multimodal large language models (MLLMs) integrate strong text reasoning with visual inputs, yet their responses can be inconsistent with the underlying images, indicating ineffective utilization of visual evidence during inference. The prevailing training paradigm relies on large-scale caption-based pretraining for general alignment, followed by supervised fine-tuning and reinforcement learning to enable instruction following and complex reasoning. However, such pretraining provides only weak visual grounding: short, coarse captions bias models toward salient objects while neglecting fine-grained visual evidence. In this paper, we introduce Visual Evidence Pre-Alignment (VEPA), an intermediate stage between pretraining and post-training that explores a novel sufficiency-driven objective with Group Relative Policy Optimization (GRPO) to optimize question-conditioned visual evidence descriptions. Extensive experiments across diverse benchmarks show that our VEPA consistently enhances performance on visually demanding evaluations and complements standard supervised post-training. Further analyses show that the income stems from strengthened, transferable visual grounding, rather than from additional task-specific training.

2606.17675 2026-06-17 cs.CV 新提交

Do We Really Need Diffusion? A Fast U-Net for Paired Medical Image Translation

我们真的需要扩散吗?用于配对医学图像翻译的快速U-Net

Alicia Pirwass, Birte Glimm, Michael Munz, Hans-Joachim Wilke

发表机构 * Institute of Artificial Intelligence, Ulm University(乌尔姆大学人工智能研究所) Institute of Orthopaedic Research and Biomechanics, Centre for Trauma Research, University Hospital Ulm(乌尔姆大学医院创伤研究中心骨科研究与生物力学研究所) AI for Sensor Data Analytics Research Group, Ulm University of Applied Sciences(乌尔姆应用科学大学传感器数据分析人工智能研究组)

AI总结 本文比较轻量级4级U-Net与去噪扩散概率模型(DDPM)在从T2加权MRI估计脂肪分数任务上的性能,发现U-Net在精度和速度上均优于DDPM。

详情
AI中文摘要

磁共振成像-信号脂肪分数(MRI-SFF)量化组织脂肪,是代谢和肌肉骨骼疾病的既定生物标志物。然而,采集需要专门的MRI序列,这些序列并非常规可用。我们研究是否可以通过图像到图像翻译(I2I)从广泛可用的T2加权(T2w)MRI估计SFF。我们进一步使用来自德国国家队列(NAKO)的230048对2D图像(183517训练,23621验证,22910测试)数据集,将轻量级4级U-Net与最先进的去噪扩散概率模型(DDPM)进行比较。两种模型均明显优于恒等基线(Pearson相关系数r=0.769,平均绝对误差MAE=0.070±0.054),证实模型学习了非平凡的跨模态映射。有趣的是,轻量级U-Net在相关性(r=0.975 vs. 0.962)和误差(MAE=0.014±0.015 vs. 0.019±0.019)方面均优于DDPM,同时推理时间减少了208倍(每张图像25.2 ms vs. 5 227.2 ms,使用50步去噪扩散隐式模型(DDIM))。在显著降低计算成本的同时实现强大的临床性能,使得实时临床使用成为可能。

英文摘要

Magnetic resonance imaging-signal fat fraction (MRI-SFF) quantifies tissue fat and serves as an established biomarker for metabolic and musculoskeletal disorders. The acquisition requires, however, specialized MRI sequences, which are not available routinely. We investigate whether SFF can be estimated from widely available T2-weighted (T2w) MRI via image-to-image translation (I2I). We further compare a lightweight 4-level U-Net to a state-of-the-art Denoising Diffusion Probabilistic Model (DDPM) using a dataset of 230 048 paired 2D images (183 517 train, 23 621 val, 22 910 test) from the German National Cohort (NAKO). Both models clearly outperform the identity baseline (Pearson correlation r = 0.769, mean absolute error MAE = 0.070 +/- 0.054), which confirms that the models learn a non-trivial cross-modal mapping. Interestingly, the lightweight U-Net outperforms the DDPM in both correlation (r = 0.975 vs. 0.962) and error (MAE = 0.014 +/- 0.015 vs. 0.019 +/- 0.019), while reducing inference time by a factor of 208 (25.2 ms vs. 5 227.2 ms per image using 50 Denoising Diffusion Implicit Model (DDIM) steps). The strong clinical performance at substantially reduced computational cost enables real-time clinical use.

2606.17669 2026-06-17 cs.SD 新提交

DeSRPA: Decoupled Speech Role-Playing Agent via Inference-Time Intervention

DeSRPA: 通过推理时干预的解耦语音角色扮演智能体

Wenqiu Tang, Zhen Wan, Takahiro Komamizu, Ichiro Ide

发表机构 * Nagoya University(名古屋大学) National Institute of Informatics(国立情报学研究所)

AI总结 提出DeSRPA框架,通过推理时干预冻结骨干模型,利用双层控制向量机制解耦认知推理与副语言表达,在语音角色扮演中实现个性与情感一致性,超越端到端微调方法。

Comments Accepted to INTERSPEECH 2026

详情
AI中文摘要

虽然大型语言模型(LLMs)已经革新了基于文本的角色扮演,但创建沉浸式语音角色扮演智能体(SRPAs)需要在认知推理和副语言细微差别之间建立无缝桥梁。当前的SRPAs主要依赖于端到端(E2E)微调。然而,这种范式由于依赖角色特定数据而难以泛化到未见过的角色,同时施加了“模态对齐税”,降低了LLM固有的推理能力。我们提出DeSRPA,一种通过在冻结骨干模型上进行推理时干预来实现角色扮演的智能体框架。DeSRPA采用双层控制向量机制,即内部认知引导和外部表达渲染,以同步“思维”和“声音”。在SpeechRole和OmniCharacter基准上的实验表明,DeSRPA在个性和情感一致性上显著优于E2E基线。它实现了高语音自然度,缩小了与GPT-4o Audio等专有模型的差距,同时保持了一种可扩展且无需训练的范式。

英文摘要

While Large Language Models (LLMs) have revolutionized text-based role-playing, creating immersive Speech Role-Playing Agents (SRPAs) requires a seamless bridge between cognitive reasoning and paralinguistic nuances. Current SRPAs primarily rely on end-to-end (E2E) fine-tuning. However, this paradigm suffers from poor generalization to unseen characters due to its reliance on role-specific data, while imposing a "modality alignment tax" that degrades intrinsic LLM reasoning capabilities. We propose DeSRPA, an agentic framework for character role play via inference-time intervention on frozen backbones. DeSRPA employs a dual-level control vector mechanism, Internal Cognitive Steering and External Expressive Rendering, to synchronize "mind" and "voice". Experiments on SpeechRole and OmniCharacter benchmarks demonstrate that DeSRPA significantly outperforms E2E baselines in personality and emotional consistency. It achieves high speech naturalness, narrowing the gap with proprietary models like GPT-4o Audio, while remaining a scalable and training-free paradigm.

2606.17668 2026-06-17 cs.LG cs.AI q-bio.QM 新提交

ASTEROID: A Spatiotemporal Information Transformer for Forecasting Multi-Step Time Series of Molecular Dynamics

ASTEROID: 用于分子动力学多步时间序列预测的时空信息变换器

Kexin Wu, Luonan Chen, Renxiao Wang

发表机构 * Department of Medicinal Chemistry, School of Pharmaceutical Sciences, Fudan University(药学院药物化学系,复旦大学) School of Mathematical Sciences and School of AI, Shanghai Jiao Tong University(数学科学学院和人工智能学院,上海交通大学)

AI总结 提出ASTEROID框架,通过将分子动力学轨迹重构为高维时空序列并集成时空信息变换方程到Transformer中,实现多步原子坐标的直接预测,在多个量子力学分子数据集上显著提升预测精度并降低计算成本。

Comments 32 pages,10 figures

详情
AI中文摘要

分子动力学(MD)模拟计算需求高,尤其对于需要长期分析的大规模系统。准确预测MD模拟结果不仅是一个有吸引力的科学挑战,而且具有重要的实用价值。在这项工作中,我们开发了一个数据驱动框架,称为ASTEROID(用于推断动力学的先进时空变换器),可以直接预测多步原子坐标,避免传统的迭代积分。为此,我们的ASTEROID将MD轨迹重构为高维时空序列,并将时空信息(STI)变换方程集成到Transformer架构中。ASTEROID的核心创新在于其建模多尺度时空依赖性的能力。具体来说,对于空间依赖性,局部-全局自注意力机制捕获短程和长程相互作用。对于时间依赖性,编码器-解码器结构将全局上下文与自回归预测相结合。ASTEROID在几个量子力学衍生的分子数据集上进行了评估。我们的结果表明,ASTEROID不仅在各种基准测试中实现了比现有方法更高的多步预测精度,而且显著降低了传统MD模拟的计算成本。此外,该模型支持在扩展时间尺度上的迭代多步预测。这项工作为加速MD模拟建立了一个稳健且可推广的数据驱动范式。

英文摘要

Molecular dynamics (MD) simulation is computationally demanding, particularly for large-scale systems requiring long-term analysis. Accurate forecast of the outcomes of a MD simulation is not only an attractive scientific challenge but also has substantial practical value. In this work, we developed a data-driven framework, termed ASTEROID (Advanced Spatiotemporal TransformER fOr Inferring Dynamics), that can directly predict multi-step atomic coordinates, avoiding conventional iterative integration. For this purpose, our ASTEROID reformulates MD trajectories as high-dimensional spatiotemporal sequences and integrates the Spatiotemporal Information (STI) Transformation equation into a Transformer architecture. The core innovation of ASTEROID lies in its ability to model multiscale spatiotemporal dependencies. In particular, for spatial dependencies, a local-global self-attention mechanism captures both short- and long-range interactions. For temporal dependencies, an encoder-decoder structure integrates global context with autoregressive forecasting. ASTEROID was evaluated on several quantum-mechanics derived molecular datasets. Our results indicate that ASTEROID achieved not only a higher level of accuracy in multi-step prediction than existing methods on various benchmarks, but also significantly reduced computational cost of conventional MD simulation. Moreover, the model supports iterative multi-step forecasting over an extended time scale. This work establishes a robust and generalizable data-driven paradigm for accelerating MD simulations.

2606.17667 2026-06-17 cs.LG cs.AI 新提交

Handling Feature Heterogeneity with Learnable Graph Patches

处理特征异质性:可学习图块方法

Yifei Sun, Yang Yang, Xiao Feng, Zijun Wang, Haoyang Zhong, Chunping Wang, Lei Chen

发表机构 * Zhejiang University(浙江大学) Huazhong University of Science and Technology(华中科技大学) Finvolution Group(信也科技集团)

AI总结 提出可学习图块概念,将图分解为语义单元,通过补丁编码器和聚合器实现跨域图数据的可迁移预训练,提升下游任务性能。

Comments Accepted at KDD 2025

详情
AI中文摘要

近年来,基础模型和图预训练技术的快速发展激发了构建通用预训练图模型或图基础模型(GFM)的兴趣。然而,一个重大挑战是现有模型无法处理无文本信息的图数据中的特征异质性,这阻碍了图模型在不同数据集间的可迁移性。为弥补这一差距,我们提出了可学习图块的概念,将其视为任何图数据的最小语义单元。我们通过展开节点特征并分别构建相应的图块结构,将图分解为可学习图块。然后,我们设计了一个框架,从跨域图数据中挖掘可迁移信息。具体来说,在提取图块后,我们提出一个补丁编码器从每个单元中提取知识,以及一个补丁聚合器学习如何将单元组合成整体。由于其领域无关的特性,该模型可应用于不同领域的下游数据。此外,我们分析了我们的方法与现有图模型之间的联系,以及其生成的节点嵌入的可迁移性。实验表明,我们的方法不仅实现了使用多域图进行预训练的能力,而且在各种下游数据集和任务上表现出增强的性能。此外,我们观察到随着预训练数据量的增加,下游性能持续提升。

英文摘要

In recent years, the rapid development of foundation models and graph pre-training technologies has spurred increasing interest in constructing a universal pre-trained graph model or Graph Foundation Model (GFM). However, a significant challenge is that existing models are unable to address feature heterogeneity in graph data without textual information, which hinders the transferability of graph models across different datasets. To bridge this gap, we propose the concept of learnable graph patches, which we regard as the smallest semantic units of any graph data. We decompose the graph into learnable graph patches by unfolding the node features and constructing corresponding patch structures separately. We then design a framework that mines transferable information from graph data across domains. Specifically, after extracting graph patches, we propose a patch encoder to extract knowledge from each unit and a patch aggregator to learn how the units are combined into a whole. Due to its domain-agnostic nature, the model can be applied to downstream data across different domains. Furthermore, we analyze the connection between our method and existing graph models, as well as the transferability of the node embeddings it generates. Empirically, our method not only achieves the capability to use multi-domain graphs for pre-training, but also shows enhanced performance across various downstream datasets and tasks. Moreover, we observe consistent improvement in downstream performance as the volume of pre-training data increases.

2606.17660 2026-06-17 cs.LG cs.AI 新提交

TuneAhead: Predicting Fine-tuning Performance Before Full Training Begins

TuneAhead: 在完整训练开始前预测微调性能

Yuxiang Luo, Haonan Long, Chen Wang, Qiqi Duan, Xiaotian Lin, Yanwei Xu, Yuyu Luo, Weikai Yang, Nan Tang

发表机构 * The Hong Kong University of Science(香港科学与技术大学) Huawei Technologies Ltd.(华为技术有限公司)

AI总结 提出TuneAHEAD框架,通过元特征向量和SHAP归因,在微调前预测性能,在Qwen2.5-7B-Instruct上RMSE为1.47个百分点,95.1%预测误差在±3%内。

Comments 9 pages, 6 figures, accepted as ICML 2026 poster:https://icml.cc/virtual/2026/poster/64847

详情
AI中文摘要

微调大型语言模型(LLM)计算密集且容易出错:模型性能对数据质量和超参数选择敏感,简单运行甚至可能降低模型性能。这引出一个实际问题:在投入完整训练之前,能否预测微调性能?我们提出TUNEAHEAD,一个用于微调性能预判的轻量级框架。TUNEAHEAD将每个候选运行编码为一个元特征向量,该向量结合了静态数据集描述符和来自短标准化探测的动态探测特征。一个预测器将这些特征映射到性能估计,而基于SHAP的归因提供可解释的诊断,揭示哪些特定特征驱动预测。在Qwen2.5-7B-Instruct上的1300多次微调运行中,TUNEAHEAD始终优于强基线,如Early-Stop Extrapolation和ProxyLM。在370次运行的保留测试集上,TUNEAHEAD实现了1.47个百分点的RMSE,并将95.1%的预测置于真实分数的±3个百分点内。这些准确的连续预测支持实用的通过/不通过筛选策略,可以在保留最有希望运行的同时减少不必要的完整微调。

英文摘要

Fine-tuning large language models (LLMs) is compute-intensive and error-prone: model performance depends sensitively on data quality and hyperparameter choices, and naïve runs can even degrade model performance. This raises a practical question:can we predict fine-tuning performance before committing to a full training run? We present TUNEAHEAD, a lightweight framework for pre-hoc prediction of fine-tuning performance. TUNEAHEAD encodes each candidate run as a meta-feature vector that combines static dataset descriptors with dynamic probe features from a short standardized probe. A predictor maps these features to performance estimates, while SHAP-based attributions provide interpretable diagnostics that reveal which specific features drive the prediction. Across 1,300+ fine-tuning runs on Qwen2.5-7B-Instruct, TUNEAHEAD consistently outperforms strong baselines such as Early-Stop Extrapolation and ProxyLM. On a held-out test set of 370 runs, TUNEAHEAD achieves an RMSE of 1.47 percentage points and places 95.1% of predictions within +3/-3 percentage points of the true score. These accurate continuous predictions support practical go/no-go screening policies that can reduce unnecessary full fine-tuning while retaining most promising runs.

2606.17659 2026-06-17 cs.LG 新提交

Physics-Constrained Neural Networks for Improved Short-Term Weather Forecasting: A Case Study over the South Pacific

物理约束神经网络改进短期天气预报:南太平洋案例研究

Egor Bugaev, Fedor Buzaev, Dmitry Efremenko, Denis Derkach, Fedor Ratnikov

发表机构 * Faculty of Computer Science, Higher School of Economics(高等经济学院计算机科学系)

AI总结 提出三种改进物理约束神经网络(PCNN)的方法,包括升级数值求解器、统一自回归混合块和集成两种神经骨干,在WeatherBench南太平洋子集上相比纯神经网络模型在1-12小时预报中均方根误差降低8-22%,同时保持物理一致性。

Comments Presented at ICLR 2026 Workshop AI and PDE

详情
AI中文摘要

本研究介绍了对物理约束神经网络(PCNN)的改进,提高了混合短期天气预报模型的准确性和稳定性。基于WeatherGFT架构,提出了三项创新。首先,升级的数值求解器结合了五阶加权本质无振荡格式(WENO-5)、beta平面近似和亚网格尺度粘度,允许积分时间步长增加四倍至1200秒,同时将日均方误差降低高达26%。其次,一个统一的回归混合块取代了原来的24个专门模块链,消除了对特定预报时间的过拟合。第三,物理核心与两个最先进的神经骨干集成,产生了PI-PredFormer和PI-IAM4VP。在2000年至2004年的WeatherBench南太平洋子集上的评估表明,这些混合模型在1-12小时预报时间内的均方根误差比纯神经模型降低了8-22%,同时更好地保持了物理一致性。这些结果表明,混合组件的逐步改进为实现更准确和高效的短期天气预报提供了一条实用途径。

英文摘要

This study introduces enhancements to physics-constrained neural networks (PCNNs) that improve the accuracy and stability of hybrid short-term weather forecasting models. Building on the WeatherGFT architecture, three innovations are proposed. First, an upgraded numerical solver, combining a fifth-order weighted essentially non-oscillatory scheme (WENO-5), a beta-plane approximation, and subgrid-scale viscosity, permits a fourfold increase in the integration time step to 1200 s while reducing the daily mean squared error by up to 26%. Second, a unified autoregressive hybrid block replaces the original chain of 24 specialised modules, eliminating overfitting to specific lead times. Third, the physical core is integrated with two state-of-the-art neural backbones, resulting in PI-PredFormer and PI-IAM4VP. Evaluation on the WeatherBench South Pacific subset from 2000 to 2004 shows that these hybrids reduce root mean squared error at 1-12 h lead times by 8-22% compared to purely neural counterparts, while better preserving physical consistency. These results demonstrate that incremental refinement of hybrid components offers a practical route toward more accurate and efficient short-range weather forecasting.

2606.17657 2026-06-17 cs.AI 新提交

Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games

使用认知模型改进语言模型对人类说服博弈的模拟

Zirui Cheng, Zeyu Shen, Thomas L. Griffiths, Peter Henderson

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出方程到行为提示和强化学习方法,使语言模型匹配认知模型(如贝叶斯更新、动机推理),在说服博弈中提升模拟人类决策多样性的能力。

详情
AI中文摘要

人们在战略互动中做出不同的决策。有些人像贝叶斯一样更新信念;其他人则表现出动机推理等偏见。尽管大型语言模型的创建者使用模拟人类进行安全评估和训练,但他们往往未能涵盖人类行为的这种广度。我们认为认知科学和经济学提供了一种方便的工具来做到这一点,利用人类决策的数学模型。我们提出了一种称为方程到行为提示的方法,用于引导大型语言模型匹配认知模型,并在基于法律决策的说服博弈中评估这种方法。我们发现大型模型可以通过提示近似基于方程的规范——贝叶斯更新、仿射扭曲、动机更新和Grether的$\alpha$-$\beta$模型,但小型模型无法做到。然而,使用强化学习训练小型模型以遵循数学规则,即方程到行为强化学习,在分布外参数化中将信念误差降低了26.5%。我们表明这些模拟可以帮助创建多样化的训练环境;训练小型模型考虑不同类型的决策者,与仅贝叶斯训练相比,平均信念变化提高了2.5%–12%,即使在说服GPT-5-mini时也是如此。我们的工作可以改进在日益逼真的环境中用于训练和评估的人类模拟,并且还可以促进对人类决策更复杂数学模型的新研究。

英文摘要

People make decisions differently in strategic interactions. Some update beliefs like a Bayesian; others exhibit biases like motivated reasoning. Although creators of large language models use simulated humans for safety evaluations and training, they often fail to cover this breadth of human behavior. We argue that cognitive science and economics provide a convenient tool for doing so, making use of mathematical models of human decision-making. We propose an approach that we call Equation-to-Behavior Prompting for guiding large language models to match cognitive models, and evaluate this approach on persuasion games based on legal decision-making. We find that large models can approximate equation-based specifications -- Bayesian updating, affine distortion, motivated updating, and Grether's $α$-$β$ model -- using prompting, but small models fail to do so. However, training small models with reinforcement learning to adhere to mathematical rules, Equation-to-Behavior RL, reduces belief error by 26.5% in out-of-distribution parameterizations. We show that these simulations can help create diverse training environments; training small models to consider different kinds of decision-makers improves average belief change by 2.5%--12% over Bayesian-only training, even when persuading GPT-5-mini. Our work could improve human simulations for training and evaluation in increasingly realistic settings, and could also enable novel research into more complicated mathematical models of human decision-making.

2606.17650 2026-06-17 cs.CV cs.CL 新提交

MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block

MambaCount: 基于空间稀疏状态空间对偶块的高效文本引导开放词汇目标计数

Hao-Yuan Ma, Li Zhang, Minjie Qiang, Jie Gao

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 提出MambaCount框架,通过空间稀疏状态空间对偶块解决Mamba在非因果视觉任务中的双向依赖限制和空间token高熵问题,实现线性复杂度的开放词汇目标计数,在FSC-147上取得12.23的测试MAE。

详情
AI中文摘要

文本引导的开放词汇目标计数(TOOC)旨在估计由文本提示描述的目标数量,在具有大规模变化的密集场景中尤其具有挑战性。现有的TOOC方法主要依赖Transformer,其相对于图像分辨率的二次复杂度限制了可扩展性。Mamba因其线性复杂度提供了一种有前景的替代方案。然而,先前基于Mamba的方法存在两个主要限制。一方面,Mamba固有的因果公式限制了非因果视觉任务所需的双向空间依赖建模。另一方面,现有的基于Mamba的视觉模型往往忽略了空间token响应中无约束的高熵,这可能削弱局部细节和高频线索。为了解决这些限制,我们提出了MambaCount,一种基于空间稀疏状态空间对偶(S^4D)块的高效框架。具体来说,我们分析并重构了Mamba中隐藏状态的衰减动态,以缓解因果建模引入的依赖约束。此外,我们引入了空间token选择(STS)子块,以减少Mamba中空间token响应的无约束高熵。另外,我们设计了多粒度原型(MGP),以在不同语义级别识别类似目标的区域,改善跨模态对齐和可解释性。在FSC-147上的大量实验表明,MambaCount在无需二次查询的方法中达到了最先进的性能,测试MAE为12.23,同时保持了线性复杂度。

英文摘要

Text-guided Open-vocabulary Object Counting (TOOC) aims to estimate the number of objects described by text prompts, which is particularly challenging in dense scenes with large scale variations. Existing TOOC approaches predominantly rely on Transformers, whose quadratic complexity with respect to image resolution limits their scalability. Mamba offers a promising alternative due to its linear complexity. However, previous Mamba-based methods have two main limitations. On the one hand, the inherent causal formulation of Mamba constrains the bidirectional spatial dependency modeling required by non-causal vision tasks. On the other hand, existing Mamba-based vision models often overlook the unconstrained high entropy in the spatial token responses, which can weaken local details and high-frequency cues. To address these limitations, we propose MambaCount, an efficient framework built on the Spatial Sparse State Space Duality (S^4D) block. Specifically, we analyze and reconstruct the decay dynamics of hidden states in Mamba to alleviate the dependency constraints introduced by causal modeling. Moreover, we introduce a Spatial Token Selection (STS) sub-block to reduce the unconstrained high entropy in spatial token responses within Mamba. In addition, we design Multi-Granularity Prototypes (MGP) to identify object-like regions at different semantic levels, improving cross-modal alignment and interpretability. Extensive experiments on FSC-147 demonstrate that MambaCount achieves state-of-the-art performance among methods without secondary querying, obtaining a test MAE of 12.23, while retaining linear complexity.

2606.17649 2026-06-17 cs.LG cs.AI 新提交

A Risk Decomposition Framework for Pre-Hoc Fine-Tuning Prediction

预微调预测的风险分解框架

Yuxiang Luo, Chen Wang, Nan Tang

发表机构 * The Hong Kong University of Science(香港科技大学)

AI总结 提出风险分解框架,将预微调性能预测风险分解为内在极限与可降优化方差,证明优化方差衰减率存在下界,并导出预算最优探测原则及可预测性相图。

Comments 9 pages, 4 figures, accepted as ICML 2026 Poster:https://icml.cc/virtual/2026/poster/66570

详情
AI中文摘要

微调大型语言模型的高昂成本构成了显著的经济障碍;预微调性能预测提供了一个关键解决方案,以大幅降低这一费用。然而,预微调性能预测的理论极限尚未被探索。我们将其形式化为信息约束下的随机估计问题,将预测风险分解为两个组成部分:内在极限(静态数据-模型兼容性)和可降优化方差。我们证明优化方差在其衰减率上存在一个必要下界,这意味着无论使用何种预测器,不确定性消散的速度都受到基本约束。基于这些动态特性,我们推导出预算最优探测原则,并引入一个可预测性相图,将任务组织成三个不同的区域:静态充分、动态临界和噪声主导。在合成和真实世界基准上的大量实验验证了这些理论区域,并展示了我们探测策略的效率。

英文摘要

The high cost of fine-tuning LLMs poses a significant economic barrier; pre-hoc performance prediction offers a critical solution to substantially reduce this expense. However, the theoretical limits of pre-hoc performance prediction remain unexplored. We formulate it as a stochastic estimation problem under information constraints, decomposing prediction risk into two components: an intrinsic limit (static data-model compatibility) and a reducible optimization variance. We prove that optimization variance admits a necessary lower bound on its decay rate, implying fundamental constraints on how quickly uncertainty dissipates, regardless of the predictor used. Based on these dynamics, we derive a budget-optimal probing principle and introduce a predictability phase diagram that organizes tasks into three distinct regimes: Static-Sufficient, Dynamic-Critical, and Noise-Dominant. Extensive experiments on synthetic and real-world benchmarks validate these theoretical regimes and demonstrate the efficiency of our probing strategy.

2606.17648 2026-06-17 cs.AI 新提交

From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs

从酝酿到解析:追踪LLM中代码推理的内部生命周期

Siyue Chen, Yifu Guo, Yuquan Lu, Zishan Xu, Jiaye Lin, Jianbo Lin, Siyu Zhang, Cheng Yang, Junxin Li, Yujia Li, Yu Huo, Ruixuan Wang

发表机构 * South China University of Technology(华南理工大学) Sun Yat-sen University(中山大学) Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) Nanjing University(南京大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Hangzhou Dianzi University(杭州电子科技大学) Guangzhou College of Technology and Business(广州工商学院)

AI总结 提出双重诊断框架(逐层线性探针与上下文剥离解码),揭示LLM在代码推理中先酝酿答案后进入四种解析结果(已解析、过度处理、错误解析、未解析)的内部生命周期,发现酝酿支架稳定而解析成功随能力变化。

详情
AI中文摘要

标准准确率指标无法解释为什么LLM能处理变量追踪但在语义等价的循环上失败。我们研究了代码推理的内部生命周期,其中模型首先酝酿答案,使其在变得可自解码之前的许多层就线性可恢复,然后分化为四种解析结果之一:已解析、过度处理、错误解析或未解析。理解这一生命周期很重要,因为相似的任务准确率可能掩盖表面评估无法检测的根本不同的失败模式。我们引入了一个双重诊断框架,将逐层线性探针与上下文剥离解码(CSD)配对,并将其应用于跨越Qwen、Llama和DeepSeek架构的16个模型的六个代码推理任务族。所有四种结果在每个任务族中都占有显著比例:总体已解析仅为41.5%,多个任务低于30%。对结构、深度和算子的受控扫描揭示了特定任务的失败瓶颈:函数调用已解析率随着调用深度从一层增加到三层而从61.1%骤降至2.5%。跨架构和规模,酝酿支架保持稳定,所有16个模型的归一化酝酿持续时间为24-42%,而解析成功随能力变化。这表明该支架是测试的解码器-only Transformer家族中稳定的经验规律,而解析成功与能力、规模和训练共变。代码:此 https URL

英文摘要

Standard accuracy metrics cannot explain why LLMs handle variable tracking but fail on semantically equivalent loops. We study an internal lifecycle of code reasoning in which models first brew the answer, making it linearly recoverable many layers before it becomes self-decodable, and then diverge into one of four resolution outcomes: Resolved, Overprocessed, Misresolved, or Unresolved. Understanding this lifecycle matters because similar task accuracies can mask fundamentally different failure modes that surface-level evaluation cannot detect. We introduce a dual diagnostic framework pairing layer-wise linear probing with Context-Stripped Decoding (CSD) and apply it to six code-reasoning task families across 16 models spanning Qwen, Llama, and DeepSeek architectures. All four outcomes carry substantial mass in every task family: overall Resolved is only 41.5%, with multiple tasks below 30%. Controlled sweeps over structure, depth, and operators expose task-specific failure bottlenecks: Function Call Resolved plunges from 61.1% to 2.5% as call depth increases from one to three. Across architectures and scales, the brewing scaffold remains stable, with normalized brewing duration 24-42% across all 16 models, while resolution success varies with capability. This indicates that the scaffold is a stable empirical regularity across the tested decoder-only Transformer families, whereas resolution success covaries with capability, scale, and training. Code: https://github.com/euyis1019/llm-brewing

2606.17645 2026-06-17 cs.AI cs.CL cs.LG 新提交

Beyond Domains: Reusing Web Skills via Transferable Interaction Patterns

超越领域:通过可迁移交互模式重用网络技能

Shiqi He, Yue Cui, Feijie Wu, Xinyu Ma, Jiaheng Lu, Yaliang Li, Bolin Ding, Mosharaf Chowdhury

发表机构 * University of Michigan(密歇根大学) Alibaba Group(阿里巴巴集团) Purdue University(普渡大学) McMaster University(麦克马斯特大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出SkillMigrator代理,通过学习可迁移交互模式(TIP)匹配布局结构而非元素引用,实现跨站点技能重用,在WebArena和Mind2Web上成功轨迹的LLM动作数减少8-10%。

详情
AI中文摘要

大型语言模型(LLM)网络代理通常被部署为工具调用者:每轮,模型读取新的页面观察并发出一个结构化工具动作。当每个动作都是低级原语时,视野迅速增长,面向策略的LLM完成次数也随之增加,在Mind2Web和WebArena等基准测试中主导了延迟和成本。因此,最近的系统将重复的交互片段包装为网络技能:从成功轨迹或诱导程序中构建的可调用工具,这样一次调用可以替代多个原语。然而,先前的技能库仍然主要通过指令相似性或粗略的站点元数据触发,这导致在未见站点上技能重用率低,并留下了许多潜在的步骤和令牌减少空间。我们提出了SkillMigrator,一个学习可重用网络技能并通过匹配布局结构而非特定元素引用来跨站点迁移它们的代理。每个诱导技能被存储为可迁移交互模式(TIP):技能与诱导时快照的结构草图配对。在测试时,SkillMigrator通过布局相似性检索TIP,并将其引用锚定到实时页面。其余堆栈是标准的:具有稳定引用的可访问性快照观察,以及基于原语加技能调用的固定工具调用。与最先进的方法相比,SkillMigrator在匹配成功率的情况下,将WebArena和Mind2Web上成功轨迹的平均LLM动作数减少了8-10%。

英文摘要

Large language model (LLM) web agents are usually deployed as tool callers: each turn, the model reads a fresh page observation and emits one structured tool action. When every action is a low-level primitive, horizons grow quickly and so do policy-facing LLM completions, dominating latency and cost on benchmarks such as Mind2Web and WebArena. Recent systems therefore wrap repeated interaction fragments as web skills: callable tools built from successful trajectories or induced programs, so one call can replace several primitives. However, prior skill libraries are still triggered mainly by instruction similarity or coarse site metadata, which yields low skill reuse on held-out sites and leaves much of the potential step and token reduction on the table. We present SkillMigrator, an agent that learns reusable web skills and transfers them across sites by matching layout structure rather than specific element references. Each induced skill is stored as a transferable interaction pattern (TIP): the skill paired with a structural sketch of the snapshot at induction time. At test time, SkillMigrator retrieves TIPs by layout similarity and grounds their references on the live page. The rest of the stack is standard: accessibility-snapshot observations with stable references, and fixed tool calling over primitives plus skill invocations. Compared with the state-of-the-art approaches, SkillMigrator reduces the average LLM-action count on successful trajectories by 8-10% across both WebArena and Mind2Web at matched success rate.

2606.17644 2026-06-17 cs.CV cs.AI 新提交

Bounding Box Label Propagation for Re-Annotation of Document Layout Analysis Datasets

边界框标签传播用于文档布局分析数据集的重新标注

Nick Jochum, Tobias Alt-Veit, Christian Schön, Alexander Lück, René Schuster, Didier Stricker

发表机构 * Insiders Technologies GmbH(Insiders Technologies 有限公司) DFKI – German Research Center for Artificial Intelligence(德国人工智能研究中心) RPTU – University Kaiserslautern-Landau(凯泽斯劳滕-兰道大学)

AI总结 提出BBLP伪标签框架,通过对象编码器融合视觉、文本和位置嵌入,利用标签传播实现仅用10%标注数据达到全监督性能的81.6%。

Comments 17 pages, 3 figures, to appear in proceedings of ICDAR 2026, Vienna, Austria

详情
AI中文摘要

实际文档处理场景中的数据集通常随时间增长,其类别标注不断细化,这导致大量耗时且昂贵的重新标注工作。一个有前景的解决方案是仅手动重新标注一小部分可用文档,并应用半监督学习技术利用有标签和无标签数据。尽管针对分类问题已有多种方法,但对于目标检测实例的重新分类(例如文档布局分析)尚无适配方法。为此,我们提出了边界框标签传播(BBLP),一种用于目标检测的伪标签框架。对象编码器整合来自目标检测样本的视觉、文本和位置嵌入,生成联合嵌入,可用于部分标注数据集上的标签传播,即插即用。评估结果表明,所提方法能产生高质量的边界框类别标注。在D4LA布局分析数据集中,仅使用10%标注数据,其mAP达到54.0%,相当于全监督性能的81.6%。我们的工作展示了标签传播在目标检测中的潜力,并为减少实际文档处理应用中的手动标注工作量奠定了基础。

英文摘要

Datasets in practical document processing scenarios typically grow over time, and their class annotations undergo continuous refinement. This creates significant re-annotation efforts, which are time-consuming and costly. A promising remedy is to re-annotate only a small subset of available documents manually and apply semi-supervised learning techniques that leverage both labelled and unlabelled data. Although there are numerous approaches to tackle this problem for classification, there exists no adaptation for the problem of re-classifying object detection instances, e.g. for document layout analysis. To this end, we propose Bounding Box Label Propagation (BBLP), a pseudo-labelling framework for object detection. An object encoder integrates visual, textual, and positional embeddings from object detection samples to come up with a joint embedding that can be used for Label Propagation on partially annotated datasets in a plug-and-play fashion. Evaluation results indicate that the proposed approach produces high-quality class annotations of bounding boxes. In the D4LA layout analysis dataset, it achieves a mAP of 54.0%, corresponding to 81.6% of fully supervised performance, while using only 10% labelled data. Our work demonstrates the potential of Label Propagation for object detection and lays the groundwork for reducing manual annotation efforts in real-world document processing applications.

2606.17642 2026-06-17 cs.AI 新提交

FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness

FinAcumen: 通过自演化经验记忆实现的金融多模态推理

Pianran Guo, Pengcheng Zhou, Yucheng Jian, Shuhua Chen

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Queen Mary University of London(伦敦玛丽女王大学)

AI总结 提出FinAcumen框架,通过选择性经验记忆机制增强工具增强型多模态推理,在四个金融基准上持续提升冻结的8B视觉语言模型性能。

详情
AI中文摘要

金融多模态推理要求智能体协调跨异构证据源的数值计算、检索、视觉解释和时间定位。现有的工具增强型智能体提高了执行保真度,但在跨回合中仍然大多无状态,反复发现推理策略和失败模式。在高风险金融环境中,这导致不可靠的工具路由、噪声检索和易产生幻觉的推理。我们提出FinAcumen,一个以选择性经验记忆为中心的金融推理智能体框架,用于工具增强的多模态推理。FinAcumen从先前的轨迹中积累基于金融的推理经验,将成功策略和失败衍生的警示规则提炼到持久记忆库中。在推理过程中,只有当语义相关性超过校准阈值时,检索到的经验才会调节推理,而通过回退机制明确抑制不相关的记忆。一个确定性的金融工具环境进一步将数值计算、检索、视觉解码和答案生成置于基础。在四个金融多模态推理基准上,FinAcumen持续改进冻结的8B视觉语言模型,优于金融专用模型,并接近领先的通用专有模型。进一步分析表明,选择性经验激活在检索不确定性下提高了推理可靠性。我们的代码匿名发布于https://this https URL。

英文摘要

Financial multimodal reasoning requires agents to coordinate numerical computation, retrieval, visual interpretation, and temporal grounding across heterogeneous evidence sources. Existing tool-augmented agents improve execution fidelity, yet remain largely stateless across episodes, repeatedly rediscovering reasoning strategies and failure patterns. In high-stakes financial settings, this leads to unreliable tool routing, noisy retrieval, and hallucination-prone reasoning. We present FinAcumen, a financial reasoning agent framework centered on selective experience memory for tool-augmented multimodal reasoning. FinAcumen accumulates financially grounded reasoning experience from prior trajectories, distilling successful strategies and failure-derived cautionary rules into a persistent memory bank. During inference, retrieved experiences condition reasoning only when semantic relevance exceeds a calibrated threshold, while irrelevant memory is explicitly suppressed through a fallback mechanism. A deterministic financial tool environment further grounds numerical computation, retrieval, visual decoding, and answer verification.Across four financial multimodal reasoning benchmarks, FinAcumen consistently improves a frozen 8B vision-language model over finance-specialized models and approaches leading proprietary general-purpose models. Further analysis shows that selective experience activation improves reasoning reliability under retrieval uncertainty. Our code is anonymously available at https://anonymous.4open.science/r/FinAcumen

2606.17639 2026-06-17 cs.RO cs.CV 新提交

ERQA-Plus: A Diagnostic Benchmark for Reasoning in Embodied AI

ERQA-Plus:具身AI推理的诊断基准

Hong Yang, Basura Fernando

发表机构 * Centre for Frontier AI Research, Agency for Science, Technology and Research(新加坡科技研究局前沿人工智能研究中心) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 提出ERQA-Plus基准,包含1766个基于机器人中心图像的问答实例,覆盖感知、动作、社交、导航和常识推理,用于诊断具身AI的推理能力。

Comments under review at NeurIPS

详情
AI中文摘要

通用具身智能体需要的不仅仅是物体识别:它们必须从情境视觉观察中推理空间关系、动作、程序、人类意图、环境约束和常识后果。然而,现有的视觉和具身问答基准通常对测试的推理依赖关系控制有限,使得难以将基于具身的推理与基于捷径的视觉或语言模式匹配区分开来。我们提出了ERQA-Plus,一个用于具身AI推理的诊断基准。ERQA-Plus包含1766个问答实例,这些实例基于711张以机器人为中心的图像,并根据一个结构化的分类法组织,涵盖感知、动作中心、社交交互、导航环境和上下文常识推理。该数据集使用多阶段生成和验证流程构建,结合了分类法引导的问题生成、自动质量判断、迭代修订和人工评估,以改进视觉基础、答案有效性和推理质量。我们对代表性的通用视觉语言模型和具身模型进行了基准测试,包括LLaVA-NeXT-8B、Prismatic-7B、MiniCPM-V-4.5-8B、Qwen3-VL、RoboRefer-8B和RoboBrain2.5-8B。尽管最强的模型Qwen3-VL-32B达到了83.4%的整体准确率和61.4的SBERT分数,但类别级别的结果揭示了空间推理、程序推理、事件预测和意图推理方面的持续弱点。因此,ERQA-Plus提供了一个细粒度的评估框架,不仅衡量具身智能体是否回答正确,还衡量它们能够可靠地执行哪些形式的具身推理。数据集可在https://this https URL获取,项目页面在https://this https URL。

英文摘要

Generalist embodied agents require more than object recognition: they must reason about spatial relations, actions, procedures, human intentions, environmental constraints, and commonsense consequences from situated visual observations. Yet existing visual and embodied question answering benchmarks often provide limited control over the reasoning dependencies being tested, making it difficult to distinguish grounded embodied reasoning from shortcut-driven visual or linguistic pattern matching. We present ERQA-Plus, a diagnostic benchmark for reasoning in embodied AI. ERQA-Plus contains 1,766 question-answer instances grounded in 711 robot-centric images and organized according to a structured taxonomy spanning perceptual, action-centric, social-interaction, navigation-environmental, and contextual commonsense reasoning. The dataset is constructed using a multi-stage generation and validation pipeline that combines taxonomy-guided question generation, automatic quality judging, iterative revision, and human assessment to improve visual grounding, answer validity, and reasoning quality. We benchmark representative general-purpose vision-language models and embodied models, including LLaVA-NeXT-8B, Prismatic-7B, MiniCPM-V-4.5-8B, Qwen3-VL, RoboRefer-8B, and RoboBrain2.5-8B. Although the strongest model, Qwen3-VL-32B, achieves 83.4% overall accuracy and 61.4 SBERT score, category-level results reveal persistent weaknesses in spatial reasoning, procedural reasoning, event prediction, and intention inference. ERQA-Plus therefore provides a fine-grained evaluation framework for measuring not only whether embodied agents answer correctly, but also which forms of embodied reasoning they can and cannot perform reliably. The dataset is available https://huggingface.co/datasets/huggingdas/erqa-plus and the project page at https://github.com/LUNAProject22/erqa-plus.

2606.17637 2026-06-17 cs.AI 新提交

Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification

Brick-DICL:用于自动化Brick模式分类的动态上下文学习

Yiyue Qian, Shinan Zhang, Huan Song, Negin Sokhandan, Hannah Marlowe, Diego Socolinsky

发表机构 * Amazon AWS Generative AI Innovation Center(亚马逊AWS生成式AI创新中心)

AI总结 提出Brick-DICL两阶段动态上下文学习框架,通过元数据检索和类别检索增强大语言模型领域知识,结合多模型过滤机制,实现楼宇管理系统点位的自动化Brick分类,显著提升准确率并减少人工验证。

详情
AI中文摘要

楼宇管理系统(BMS)对于优化现代建筑的能效和运营性能至关重要。然而,不同制造商的BMS点缺乏标准化,给集成和数据利用带来了重大障碍。尽管Brick模式为楼宇系统提供了标准化的本体,但将BMS点映射到合适的Brick类面临三个关键挑战:(i)Brick类数量庞大(最新版本有936个),(ii)大语言模型(LLM)的领域知识有限,(iii)验证需要大量人工。为解决这些挑战,我们提出了Brick-DICL,一种用于自动化Brick模式分类的两阶段动态上下文学习框架。Brick-DICL包含两个主要组件:metadata-RAG,检索相关示例以增强LLM的领域知识;以及class-RAG,缩小潜在Brick类范围以应对大的分类空间。此外,我们实现了一种多LLM过滤机制,比较多个模型的预测,标记低置信度分类以供人工审查。结果:(i)通用性:Brick-DICL适用于任何楼宇管理系统,无论制造商或元数据格式如何;(ii)新颖且强大:作为首个用于Brick模式分类的动态上下文学习方法,Brick-DICL在建筑数据集上取得了显著的分类准确率提升,优于现有方法;(iii)高效:我们的多LLM过滤策略减少了人工验证工作,实现了快速数字化建筑接入。大量实验证明了Brick-DICL在不同建筑数据集上的有效性,加速了向标准化、可互操作的楼宇管理系统的进程。

英文摘要

Building Management Systems (BMS) are essential for optimizing energy efficiency and operational performance in modern buildings. However, the lack of standardization across BMS points from different manufacturers creates significant barriers to integration and data utilization. While the Brick schema offers a standardized ontology for building systems, mapping BMS points to appropriate Brick classes presents three critical challenges: (i) the extensive number of Brick classes (936 in the latest version), (ii) limited domain-specific knowledge in large language models (LLMs), and (iii) substantial manual effort required for verification. To address these challenges, we propose Brick-DICL, a two-stage dynamic in-context learning framework for automated Brick schema classification. Brick-DICL consists of two primary components: metadata-RAG, which retrieves relevant examples to enhance LLMs' domain knowledge, and class-RAG, which narrows down potential Brick classes to address the large classification space. Additionally, we implement a multi-LLM filtering mechanism that compares predictions across multiple models, flagging low-confidence classifications for human review. As a result: (i) General: Brick-DICL is applicable to any building management system regardless of manufacturer or metadata format; (ii) Novel and Powerful: as the first dynamic in-context learning approach for Brick schema classification, Brick-DICL achieves significant classification accuracy improvements on building datasets, outperforming existing methods; (iii) Efficient: our multi-LLM filtering strategy reduces manual verification effort, enabling rapid digital building onboarding. Extensive experiments demonstrate Brick-DICL's effectiveness across diverse building datasets, accelerating the path toward standardized, interoperable building management systems.

2606.17634 2026-06-17 cs.CL 新提交

Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs

用于可靠LLM评估的提示扰动方法:基于比较图

Dong Huang, Jianbo Sun, Pengkun Yang

发表机构 * Department of Statistics and Data Science, Tsinghua University(清华大学统计与数据科学系)

AI总结 针对LLM成对评估中的传递性矛盾问题,提出提示扰动框架,通过生成扰动变体并过滤结构不一致的比较模式,提高排名一致性。

Comments 42 pages, 8 figures

详情
AI中文摘要

评估大型语言模型(LLM)对于理解其能力、比较竞争系统以及支持在实践中部署可靠模型至关重要。对于开放任务,成对评估已成为一种流行范式,其中比较同一提示的两个响应,并将产生的判断聚合为整体排名。该范式的核心挑战是传递性:诱导的比较结果可能无法支持任何连贯的全局排名。例如,可能会观察到循环偏好,如$A \succ B \succ C \succ A$,或涉及平局的不一致性,如$A \equiv B\equiv C\neq A$。此类矛盾使得最终排行榜不稳定且难以解释。在本文中,我们提出了一种提示扰动框架,用于提高成对LLM评估的一致性。我们的方法生成每个提示的扰动变体,利用生成的比较图识别并过滤掉结构不一致的比较模式,然后将标准排名方法应用于过滤后的比较。该框架的一个关键特征是,在排名聚合之前,将图级结构一致性显式纳入评估流程。这提供了一种简单且原则性的方法来减少循环不一致性并提高LLM排名的可靠性。

英文摘要

Evaluating large language models (LLMs) is important for understanding their capabilities, comparing competing systems, and supporting the deployment of reliable models in practice. For open-ended tasks, pairwise evaluation has become a popular paradigm, in which two responses to the same prompt are compared and the resulting judgments are aggregated into an overall ranking. A central challenge of this paradigm is intransitivity: the induced comparison outcomes may fail to support any coherent global ranking. For example, one may observe cyclic preferences such as $A \succ B \succ C \succ A$, or inconsistencies involving ties such as $A \equiv B\equiv C\neq A$. Such contradictions make the resulting leaderboard unstable and challenging to interpret. In this paper, we propose a prompt perturbation framework for improving the consistency of pairwise LLM evaluation. Our approach generates perturbed variants of each prompt, uses the resulting comparison graphs to identify and filter out structurally inconsistent comparison patterns, and then applies standard ranking methods to the filtered comparisons. A key feature of the proposed framework is that graph-level structural consistency is incorporated explicitly into the evaluation pipeline before ranking aggregation. This provides a simple and principled way to reduce cyclic inconsistencies and improve the reliability of LLM rankings.

2606.17630 2026-06-17 cs.RO 新提交

FLAP: FOV-Constrained Active Perception Planning for Prior-Map-Free 3D Navigation

FLAP: 面向无先验地图3D导航的视场约束主动感知规划

Mengke Zhang, Sitong Li, Tiancheng Lai, Ruitian Pang, Mingxuan Zhang, Qingcheng Chen, Fei Gao, Chao Xu, Yanjun Cao

发表机构 * The State Key Laboratory of Industrial Control Technology, College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院工业控制技术国家重点实验室) Huzhou Institute, Zhejiang University(浙江大学湖州研究院) Huzhou Key Laboratory of Autonomous System(湖州市自动驾驶系统重点实验室) Shanghai Institute of Special Equipment Inspection and Technical Research Co., Ltd(上海市特种设备监督检验技术研究院有限公司)

AI总结 提出一种将主动感知直接融入轨迹优化的规划框架,通过传感器坐标系下的视场几何约束和速度触发机制,在保证安全的同时提升效率,并支持任意3D机动。

Comments 18 pages, 19 figures

详情
AI中文摘要

在未知、杂乱的三维环境中进行安全高效的轨迹规划是无人机在现实应用中部署的关键瓶颈。机载传感器有限的视场和感知范围进一步加剧了这一挑战。许多现有方法要么对未探索空间做出简单假设,要么依赖保守启发式(如速度限制或固定感知模式),降低了效率且在不同传感器类型间泛化能力差。本文提出一种新颖的规划框架,将主动感知直接融入轨迹优化,从而在保持效率的同时提高安全性。感知约束源自无人机的动力学模型,并在传感器坐标系中公式化,从而能够精确处理视场几何。速度触发的激活机制使规划器能够平衡感知和运动效率。我们引入带有参数化起始时间优化的主动感知子轨迹段,减轻了因障碍物检测延迟带来的碰撞风险。我们的公式化方法能够在任意三维机动中实现主动感知,超越了主要针对水平运动的现有方法。所有约束和惩罚项均融入可微优化问题,因此规划器仅需一个简单的前端全局路径作为引导,而非计算昂贵的感知感知路径生成器。大量仿真和真实世界实验证明了该方法在不同传感器配置的多样未知环境中的鲁棒性能。

英文摘要

Safe and efficient trajectory planning in unknown, cluttered 3D environments constitutes a critical bottleneck for deploying Unmanned Aerial Vehicles (UAVs) in real-world applications. This challenge is further exacerbated by the limited field-of-view (FOV) and sensing range of onboard sensors. Many existing methods either make simplistic assumptions about unexplored space or rely on conservative heuristics such as speed limits or fixed perception patterns, reducing efficiency and generalizing poorly across different sensor types. In this work, we propose a novel planning framework that directly integrates active perception into trajectory optimization, thereby improving safety while preserving efficiency. The perception constraints are derived from the UAV's dynamic model and formulated in the sensor coordinate frame, which enables precise handling of FOV geometry. The velocity-triggered activation mechanism enables the planner to balance perception and motion efficiency. We introduce an active perception sub-trajectory segment with parametric start-time optimization, mitigating collision risks from late obstacle detection. Our formulation enables active perception during arbitrary 3D maneuvers, extending beyond prior methods designed mainly for horizontal motion. All constraints and penalties are incorporated into a differentiable optimization problem, so the planner requires only a simple front-end global path for guidance, rather than a computationally expensive perception-aware path generator. Extensive simulations and real-world experiments demonstrate robust performance across diverse unknown environments with varying sensor configurations.

2606.17628 2026-06-17 cs.CL 新提交

OPD-Evolver: Cultivating Holistic Agent Evolver via On-Policy Distillation

OPD-Evolver:通过在线策略蒸馏培养整体智能体进化器

Guibin Zhang, Xun Xu, Yanwei Yue, Zikun Su, Wangchunshu Zhou, Xiaobin Hu, Shuicheng Yan

发表机构 * LV-NUS Lab(林文实验室) FDU(福建大学) PKU(北京大学) Bytedance Inc.(字节跳动公司)

AI总结 提出OPD-Evolver框架,通过快慢双循环和在线策略自蒸馏,使智能体学会选择、使用、编写和维护经验,在多个基准上超越现有方法,并让小模型挑战大模型。

详情
AI中文摘要

记忆已成为自我进化智能体的标准基础,但保留经验并不等同于学习如何通过经验进化。现有的记忆智能体可以存储轨迹、检索反思或积累技能,但往往缺乏选择有用经验、据此行动、编写可重用知识以及维护不断增长的存储库的整体能力。我们引入OPD-Evolver,一种快慢双协同进化框架,通过在线策略自蒸馏培养这样的智能体进化器。在快循环中,OPD-Evolver与四级记忆层次结构交互,以读取、使用、编写和维护经验,实现快速测试时进化。在慢循环中,结果校准的记忆归因和特权事后视角将这些四种能力蒸馏到可部署的策略中。在多领域基准测试中,OPD-Evolver超越了诸如ReasoningBank等记忆系统高达11.5%,以及诸如Skill0等基于训练的方法约5.8%。进一步分析表明,OPD-Evolver内化了高价值经验和记忆管理,使OPD-Evolver-9B能够挑战诸如Qwen3.5-397B-A17B和Step-3.5-Flash等大型对手,指向超越记忆增强智能体的真正合格的智能体进化器。

英文摘要

Memory has become a standard substrate for self-evolving agents, yet retaining experience is not the same as learning how to evolve through it. Existing memory agents can store trajectories, retrieve reflections, or accumulate skills, but often lack the holistic competence to select useful experience, act on it, write reusable knowledge, and maintain a growing repository. We introduce OPD-Evolver, a slow-fast co-evolution framework that cultivates such an agent evolver through on-policy self-distillation. In the fast loop, OPD-Evolver interacts with a four-level memory hierarchy to read, use, write, and maintain experience for rapid test-time evolution. In the slow loop, outcome-calibrated memory attribution and privileged hindsight distill these four abilities into the deployable policy. Across multi-domain benchmarks, OPD-Evolver surpasses memory systems such as ReasoningBank by up to 11.5%, and training-based methods such as Skill0 by ~5.8%. Further analysis shows that OPD-Evolver internalizes high-value experience and memory management, enabling OPD-Evolver-9B to challenge giant counterparts such as Qwen3.5-397B-A17B and Step-3.5-Flash, pointing beyond memory-augmented agents toward genuinely qualified agent evolvers.

2606.17627 2026-06-17 cs.CV cs.AI 新提交

Divide, Deliberate, Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition

分、议、决:一种用于细粒度自我中心动作识别的多智能体框架

Alessandro Sottovia, Alessandro Torcinovich, Oswald Lanz

发表机构 * Faculty of Engineering, Free University of Bozen-Bolzano(博尔扎诺自由大学工程学院)

AI总结 提出一种零样本多智能体框架,通过视频分割、异构VLM专家协商和Borda计数聚合,提升细粒度自我中心动作识别性能。

详情
AI中文摘要

在自我中心视频中进行细粒度动作识别对视觉语言模型(VLM)具有挑战性:动作通常仅在小视觉线索上有所不同,而单个模型往往偏向于这些线索的一个子集。我们提出了“分、议、决”(Divide, Deliberate, Decide),一个完全本地化的零样本多智能体框架,其中(i)一个VLM编排器将视频分块,并为每个片段提出一个top-k候选标签列表,(ii)一个由来自不同开放模型系列的异构VLM专家组成的集成体进行结构化协商,包括一轮同行咨询问题,以及(iii)使用Borda计数聚合智能体排名,并且编排器根据专家的证据重新排名自己的预测。整个流程在本地运行,无需微调。实验表明,我们的方法在零样本动作识别性能上比基线有积极改进,突出了异构协商步骤的影响,表明增益来自去相关的模型先验而非额外的计算。

英文摘要

Fine-grained action recognition in egocentric video is challenging for Vision-Language Models (VLMs): actions often differ only in small visual cues, and a single model tends to be biased toward a subset of these cues. We propose Divide, Deliberate, Decide, a fully-local, zero-shot multi-agent framework in which (i) a VLM orchestrator chunks the video and proposes a top-k candidate label list per segment, (ii) an ensemble of heterogeneous VLM specialists, drawn from different open model families, engages in a structured deliberation that includes a peer-consultation round of questions, and (iii) agent rankings are aggregated with a Borda count and the orchestrator re-ranks its own prediction in light of the specialists' evidence. The entire pipeline runs locally with no fine-tuning. Experiments show that our method positively improves zero-shot action recognition performance over the baseline, highlighting the influence of a heterogeneous deliberation step, showing that the gain stems from decorrelated model priors rather than from additional compute.

2606.17619 2026-06-17 cs.CV 新提交

RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation

RAVA: 检索增强的视角对齐用于主题驱动图像生成

Qiwei Yan, Zhiqiang Yuan, Chongyang Li, Jiapei Zhang, Ying Deng, Jinchao Zhang, Jie Zhou

发表机构 * WeChat AI, Tencent Inc.(腾讯微信人工智能实验室)

AI总结 提出RAVA框架,通过检索增强提供几何证据,解决跨主体视角对齐中的视角漂移和结构不匹配问题,在保持身份的同时实现可靠视角控制。

详情
AI中文摘要

参考驱动图像生成在身份保持方面取得了快速进展,但跨不同主体的可靠视角控制仍然难以理解。难点不仅在于生成目标主体的新图像:模型必须推断一个主体的隐含视角,并仅使用图像级证据将其转移到另一个主体,无需相机姿态、深度或基于射线的条件。在这种设置下,现有基于多个图像参考的生成器通常依赖虚假的语义相关性,导致视角漂移、部分级结构不匹配以及缺失或不支持的目标特定内容。我们将这一挑战形式化为跨主体视角对齐,并提出RAVA,一个检索增强框架,在生成前提供显式几何证据。RAVA首先学习一个跨实例视角嵌入,检索与锚点视角对齐的目标主体图像,然后应用基于LogDet的子集选择策略,保留一个既视角一致又结构互补的紧凑参考集。最后,选定的参考被微调的多参考图像生成器使用。实验表明,通用语义嵌入在此任务上几乎是随机的,而所提出的检索器显著提高了视角检索质量。在跨主体生成上,RAVA在相同生成骨干下始终优于零样本基线和更强的检索替代方案。这些结果表明,跨主体视角对齐受益于检索增强的几何基础,而非仅依赖端到端生成。

英文摘要

Reference-driven image generation has made rapid progress on identity preservation, but reliable viewpoint control across different subjects remains poorly understood. The difficulty is not merely generating a new image of the target subject: the model must infer the implicit viewpoint of one subject and transfer it to another subject using only image-level evidence, without camera poses, depth, or ray-based conditions. In this setting, existing generators conditioned on multiple image references often rely on spurious semantic correlations, which lead to viewpoint drift, part-level structural mismatches, and missing or unsupported target-specific content. We formulate this challenge as cross-subject viewpoint alignment and propose RAVA, a retrieval-augmented framework that supplies explicit geometric evidence before generation. RAVA first learns a cross-instance viewpoint embedding that retrieves target-subject images aligned with the anchor viewpoint, then applies a LogDet-based subset selection strategy to retain a compact reference set that is both view-consistent and structurally complementary. The selected references are finally consumed by a fine-tuned multi-reference image generator. Experiments show that generic semantic embeddings are nearly random for this task, while the proposed retriever substantially improves viewpoint retrieval quality. On cross-subject generation, RAVA consistently outperforms zero-shot baselines and stronger retrieval alternatives under the same generation backbone. These results indicate that cross-subject viewpoint alignment benefits from retrieval-augmented geometric grounding rather than relying on end-to-end generation alone.

2606.17615 2026-06-17 cs.CV cs.AI 新提交

SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation

SkillMoV: 基于原型条件门控的视图混合路由用于统一多视角熟练度估计

Edoardo Bianchi, Antonio Liotta

发表机构 * Free University of Bozen-Bolzano(博尔扎诺自由大学)

AI总结 提出SkillMoV框架,通过混合视图投影器(MoVP)实现多场景多视角视频的熟练度估计,在EgoExo4D数据集上达到50.17%准确率,超越现有方法。

详情
AI中文摘要

从视频中估计人类熟练度是自动化技能评估的关键挑战,应用于体育教练、音乐教学、手术培训和工作场所学习。现有方法通常专注于单一场景或依赖共享的多视角聚合,限制了其适应异构摄像机视角和活动领域的能力。我们提出SkillMoV,一个统一的、参数高效的框架,用于从同步多视角视频中进行多场景熟练度估计。其核心是混合视图投影器(MoVP),将混合专家范式适应于摄像机特定的视角特征。MoVP由四个阶段组成:(i) 一个具有12个专家MLP的混合视图软路由器,无需摄像机身份监督即可学习视角相关的专家偏好;(ii) 跨视角注意力以对齐同步摄像机;(iii) 可学习的原型锚定,以类级参考向量条件化表示;(iv) 一个原型条件门控投影,生成最终技能嵌入。我们在EgoExo4D上评估SkillMoV,涵盖六个技能领域和三种单独训练的视角配置:Ego、Exos和Ego+Exos。SkillMoV在Exos设置中达到50.17%的总体准确率,单个模型在所有场景上联合训练,超过比较方法中报告的最强Exos结果3.57个百分点。在Ego+Exos中,SkillMoV接近该设置的最佳报告结果(47.63%对48.20%)。在选定的Exos配置上的消融实验验证了每个组件:MoV路由比注意力聚合提高+6.61个百分点,跨视角注意力+4.92个百分点,原型锚定+4.07个百分点,随机视角丢弃+3.90个百分点。通过LoRA适配,SkillMoV仅训练其参数的23.32%,并且相对于仅LoRA基线增加了有限的测量开销。

英文摘要

Estimating human proficiency from video is a key challenge for automated skill assessment, with applications in sports coaching, music pedagogy, surgical training, and workplace learning. Existing approaches often focus on individual scenarios or rely on shared multi-view aggregation, limiting their ability to adapt to heterogeneous camera viewpoints and activity domains. We introduce SkillMoV, a unified, parameter-efficient framework for multi-scenario proficiency estimation from synchronized multi-view video. At its core, SkillMoV introduces a Mixture-of-View Projector (MoVP), which adapts the mixture-of-experts paradigm to camera-specific view features. MoVP is composed of four stages: (i) a Mixture-of-View soft router with twelve expert MLPs that learns view-dependent expert preferences without camera-identity supervision; (ii) cross-view attention to align synchronized cameras; (iii) learnable prototype anchoring to condition the representation on class-level reference vectors; and (iv) a prototype-conditioned gated projection that produces the final skill embedding. We evaluate SkillMoV on EgoExo4D across six skill domains and three separately trained view configurations: Ego, Exos, and Ego+Exos. SkillMoV reaches 50.17% overall accuracy in the Exos setting with a single model trained jointly across all scenarios, surpassing the strongest reported Exos result among the compared methods by 3.57 percentage points. In Ego+Exos, SkillMoV remains close to the best reported result in that setting (47.63% versus 48.20%). Ablations on the selected Exos configuration validate each component: MoV routing contributes +6.61 pp over attentive aggregation, cross-view attention +4.92 pp, prototype anchoring +4.07 pp, and stochastic view dropout +3.90 pp. Through LoRA adaptation, SkillMoV trains only 23.32% of its parameters and adds limited measured overhead relative to a LoRA-only baseline.

2606.17609 2026-06-17 cs.CL 新提交

The Benchmark Illusion: Pruned LLMs Can Pass Multiple Choice but Fail to Answer

基准测试幻觉:剪枝后的大语言模型能通过多项选择但无法回答问题

Rui Wen, Lu Sun, Jiayang Liu, Zesheng Xu, Tianshuo Cong, Zheng Li

发表机构 * Institute of Science Tokyo(东京科学大学) Tohoku University(东北大学) Nanyang Technological University(南洋理工大学) KTH Royal Institute of Technology(瑞典皇家理工学院) Shandong University(山东大学)

AI总结 研究发现,高稀疏度剪枝后的大语言模型在多项选择评估中表现良好,但在开放生成中无法正确回答相同问题,揭示了基准测试的盲点。

详情
AI中文摘要

压缩大型语言模型可以减少内存使用和推理成本,但也可能导致标准基准测试未能捕捉到的失败。一个剪枝后的模型可能在多项选择评估中仍然表现良好,但在开放生成中却无法回答相同的问题。我们探究剪枝改变了什么:是擦除了正确答案,还是使答案更难作为最高输出产生?我们通过多语言问答来研究这个问题,追踪剪枝前后相同的问题。我们发现了一种基准测试幻觉。在高稀疏度剪枝(尤其是Wanda)下,模型在贪婪开放生成中经常失败,而在多项选择评分下仍然能选择正确答案。在这些仅识别错误中,答案通常并未消失,而是被降级:它经常通过束搜索、采样或一个上下文示例重新出现。总体而言,多项选择基准测试可能夸大压缩LLM的可用性,造成评估盲点。压缩模型应该测试它们能产生什么,而不仅仅是能识别什么。

英文摘要

Compressing large language models reduces memory use and inference cost, but it can also create failures that standard benchmarks miss. A pruned model may still perform well on multiple-choice evaluations, yet fail to answer the same question in open generation. We ask what pruning changes: does it erase the correct answer, or does it make the answer harder to produce as the top output? We study this question with multilingual question answering, tracking the same questions before and after pruning. We find a benchmark illusion. Under high-sparsity pruning, especially Wanda, models often fail in greedy open generation while still selecting the correct answer under multiple-choice scoring. In these recognition-only errors, the answer is usually not gone, but demoted: it often reappears with beam search, sampling, or one in-context example. Overall, multiple-choice benchmarks can overstate the usability of compressed LLMs, creating an evaluation blind spot. Compressed models should be tested on what they can produce, not only on what they can recognize.

2606.17606 2026-06-17 cs.CV 新提交

Flux-Guard: Facial Identity Protection using diffusion models

Flux-Guard:使用扩散模型的面部身份保护

Jie Wang, Tao Wang, Ru Zhang, Jianyi Liu

发表机构 * School of Cyberspace Security, Beijing University of Posts and Telecommunications(北京邮电大学网络空间安全学院) Nanjing University of Aeronautics and Astronautics(南京航空航天大学)

AI总结 提出Flux-Guard框架,通过流轨迹控制和潜在空间对抗优化,在统一生成过程中实现面部编辑与隐私保护,有效提升对跨域人脸识别模型的攻击成功率。

详情
AI中文摘要

人脸识别系统的广泛部署使得社交媒体和公共平台上共享的个人图像面临身份关联和隐私风险。现有的对抗性隐私保护方法可以降低未经授权的人脸识别性能,但与生成式面部编辑不兼容。人工智能驱动的面部编辑工具越来越受欢迎,这显著增加了用户对个性化肖像生成和社交分享的需求。然而,当前的编辑方法通常保留身份特征,使得编辑后的图像仍然容易被恶意人脸识别系统追踪。因此,本文提出了Flux-Guard,一种基于对抗攻击的隐私保护面部编辑框架,它在统一的生成过程中集成了面部编辑和隐私保护。具体地,我们设计了一种流轨迹控制方法,将语义操作与生成过程对齐,并引入了潜在空间对抗优化,采用自适应感知损失驱动的加权策略,动态调整对抗强度以在保持视觉质量的同时最大化攻击效果。大量实验表明,Flux-Guard支持面部编辑,同时在CelebA-HQ和LADN数据集上显著提高了对跨域人脸识别模型的攻击成功率。此外,对商业API的评估结果证实了其在现实世界应用中的有效性。代码发布在https://this URL。

英文摘要

The widespread deployment of face recognition (FR) systems exposes personal images shared on social media and public platforms to identity linkage and privacy risks. Existing adversarial privacy protection methods can degrade unauthorized FR performance but are not compatible with generative face editing. Artificial intelligence-driven face editing tools are gaining popularity, which has significantly increased user demand for personalized portrait generation and social sharing. However, current editing methods often preserve identity features, making the edited images still susceptible to tracking by malicious FR systems. Thus, this paper proposes Flux-Guard, a privacy-preserving face editing framework based on adversarial attacks, which integrates face editing and privacy protection within a unified generative process. Specifically, we design a flow trajectory control method to align semantic manipulations with the generative process and introduce latent-space adversarial optimization with an adaptive perceptual-loss-driven weighting strategy, dynamically adjusting adversarial strength to maximize attack effectiveness while preserving visual quality. Extensive experiments demonstrate that Flux-Guard supports face editing while significantly improving attack success rates against cross-domain face recognition models on the CelebA-HQ and LADN datasets. Furthermore, evaluation results for commercial APIs have confirmed its effectiveness in real-world applications. The code is released at https://github.com/JLMWang/Flux-Guard.