arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2367
专题追踪
2605.29447 2026-05-29 cs.CV cs.CL

Recovering Policy-Induced Errors: Benchmarking and Trajectory Synthesis for Robust GUI Agents

恢复策略诱导错误:鲁棒GUI智能体的基准测试与轨迹合成

Tianpeng Bu, Xin Liu, Qihua Chen, Hao Jiang, Shurui Li, Hongtao Duan, Lu Jiang, Lulu Hu, Bin Yang, Minying Zhang

发表机构 * alibaba(阿里巴巴)

AI总结 提出GUI-RobustEval基准和鲁棒驱动轨迹合成框架RoTS,通过树状管道主动发现错误模式并合成恢复步骤,训练模型在GUI任务上取得最先进性能。

Comments ICML 2026 Spotlight. 36 pages, 19 figures, includes appendix

详情
AI中文摘要

尽管GUI智能体发展迅速,但它们通常缺乏从自身错误中恢复的鲁棒性,阻碍了实际部署。为了在评估和数据层面弥补这一差距,我们引入了GUI-RobustEval并提出了鲁棒驱动轨迹合成。GUI-RobustEval包含1,216个可执行测试用例,系统性地衡量在广泛且真实的错误模式下的错误恢复能力。在数据层面,RoTS是一个可扩展的合成框架,通过树状管道主动发现多样化的错误模式并合成相应的恢复步骤,创建了80万高质量数据。我们的两个模型RoTS-7B和RoTS-32B,在数据集上微调后,在GUI-RobustEval和传统GUI基准测试上均表现出显著提升。值得注意的是,RoTS-32B在OSWorld上达到了最先进性能,成功率为47.4%,All-Pass@4得分为33.8%,表明改进的长时域错误恢复能力有助于鲁棒性和整体性能。我们的代码可在https://github.com/AlibabaResearch/RoTS获取。

英文摘要

While GUI agents have advanced rapidly, they often lack the robustness to recover from their own errors, hindering real-world deployment. To bridge this gap at both the evaluation and data levels, we introduce GUI-RobustEval and propose Robustness-driven Trajectory Synthesis. GUI-RobustEval contains $1,216$ executable test cases that systematically measure error recovery capabilities across a broad and realistic spectrum of error modes. At the data level, RoTS is a scalable synthesis framework that creates $800k$ high-quality data via a tree-based pipeline that proactively discovers diverse error modes and synthesizes corresponding recovery steps. Our two models, RoTS-7B and RoTS-32B, fine-tuned on our dataset, both demonstrate significant gains on GUI-RobustEval and traditional GUI benchmarks. Notably, RoTS-32B achieves state-of-the-art performance on OSWorld, with a $47.4\%$ success rate and a $33.8\%$ All-Pass@4 score, suggesting that improved long-horizon error recovery ability contributes to both robustness and overall performance. Our code is available at https://github.com/AlibabaResearch/RoTS.

2605.29446 2026-05-29 cs.AI

CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials

CrystalXRD-Bench:面向多种晶体材料的XRD峰索引的视觉-语言模型基准测试

Chengliang Xu, Xiaogang Li, Peiyao Xiao, Beng Wang, Hu Wei, Bing Zhao

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出CrystalXRD-Bench基准,通过250个样本评估视觉-语言模型从粉末XRD图谱中识别米勒指数(HKL)的能力,发现最佳模型Jaccard得分仅0.5888,任务远未解决。

Comments 18 pages, 10 figures

详情
AI中文摘要

从粉末XRD图谱中识别米勒指数需要现有多模态基准未测试的能力:模型必须从渲染的科学曲线中读取窄峰位置,然后将该观察与多步晶体学推理联系起来。我们引入CrystalXRD-Bench,一个基于10个公共晶体学数据库构建的250样本基准,用于单一任务:恢复对XRD图谱中最高强度峰有贡献的完整HKL集合。每个样本将渲染的XRD图像与源CIF文本和化学式配对,因此视觉提取错误和推理错误可以并排检查。我们评估了七个视觉-语言模型。最佳Jaccard得分为0.5888(GPT-5.4),精确匹配率为37.6%,但七个模型中有六个仍低于Jaccard 0.50;该任务远未解决。错误模式系统性地变化:双峰情况尤其脆弱,注重召回率的模型通过过度预测HKL来增加覆盖率,而访问CIF文本并不能缩小晶体学计算方面的差距。除了模型排名外,该基准还确定了当前VLM在定量科学图形上失败的条件。所有数据和评估代码将公开提供。

英文摘要

Miller-index identification from powder XRD patterns requires capabilities untested by existing multimodal benchmarks: the model must read a narrow peak location from a rendered scientific curve and then connect that observation to multi-step crystallographic reasoning. We introduce CrystalXRD-Bench, a 250-sample benchmark built from 10 public crystallographic databases for a single task: recover the full set of HKLs contributing to the highest-intensity peak in an XRD pattern. Each sample pairs the rendered XRD image with the source CIF text and chemical formula, so visual extraction errors and reasoning errors can be examined side by side. We evaluate seven vision-language models. The best Jaccard score is 0.5888 (GPT-5.4) with an exact-match rate of 37.6%, yet six of seven models remain below Jaccard 0.50; the task is far from solved. Error patterns vary systematically: double-peak cases are especially brittle, recall-heavy models gain coverage by over-predicting HKLs, and access to CIF text does not close the gap in crystallographic calculation. Alongside model rankings, the benchmark identifies the conditions under which current VLMs fail on quantitative scientific figures. All data and evaluation code will be publicly available.

2605.29440 2026-05-29 cs.CL cs.AI cs.IR

SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents

SkillBrew: LLM智能体技能库的多目标策展

Wentao Hu, Zhendong Chu, Yiming Zhang, Junda Wu, Ming Jin, Xiangyu Zhao, Yilei Shao, Yanfeng Wang, Qingsong Wen

发表机构 * City University of Hong Kong(香港城市大学) Squirrel Ai Learning University of Science and Technology of China(中国科学技术大学) University of California, San Diego(加州大学圣地亚哥分校) Griffith University(格里菲斯大学) East China Normal University(华东师范大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出SkillBrew框架,将技能库策展建模为带效用约束的帕累托优化问题,通过双层提议-验证循环实现技能库的精简与多样性。

Comments 16 pages. Preprint. Under review

详情
AI中文摘要

检索增强的LLM智能体越来越依赖于精心策划的技能库:指导复杂任务决策的可重用文本原则集合。现有方法通常以仅追加的方式扩展这些库,不断添加新技能而不移除冗余、过时或有害的技能,导致存储库效率低下且策展不良。在本文中,我们将技能库策展形式化为一个受约束的多目标问题:一个理想的库必须对智能体有用、内容多样,并且对查询分布有良好的覆盖。为此,我们引入了SkillBrew,一个多目标策展框架,将技能库策展形式化为在效用约束下的帕累托感知优化,并通过双层提议-验证循环求解。我们在两个公共基准上评估了我们的方法。我们的发现表明,将技能库视为原则性策展的对象,而不是不断增长的仅追加日志,是构建自我改进的LLM智能体的重要一步。

英文摘要

Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision making on complex tasks. Existing approaches typically expand these banks in an append-only fashion, continuously adding new skills without removing redundant, outdated, or harmful ones, resulting in inefficient and poorly curated repositories. In this paper, we formulate the skill bank curation as a constrained multi-objective problem: a desirable bank must be useful for the agent, diverse in its content, and provide good coverage of the query distribution. To this end, we introduce SkillBrew, a multi-objective curation framework that formalizes skill bank curation as Pareto-aware optimization under a utility constraint, and solves it via a bi-level propose-then-verify loop. We evaluate our approach on two public benchmarks. Our findings suggest that treating skill banks as objects of principled curation, rather than ever-growing append-only logs, is an important step toward building self-improving LLM agents.

2605.29438 2026-05-29 cs.RO

ElegantVLA: Learning When to Think for Efficient Vision-Language-Action Models

ElegantVLA:学习何时思考以实现高效的视觉-语言-动作模型

Ye Li, Huanan Liu, Kangye Ji, Yuan Meng, Jiajun Fan, Yuansong Wang, Shiyu Qin, Chenglei Wu, Shu-Tao Xia, Zhi Wang

发表机构 * Tsinghua University(清华大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出ElegantVLA,一种即插即用的相位自适应推理框架,通过动态计算调度在视觉编码器、大语言模型和动作头之间分配计算资源,实现VLA模型加速,在GR00T和CogACT上分别获得最高2.55倍和3.77倍加速。

详情
AI中文摘要

视觉-语言-动作(VLA)模型是通用机器人控制的一种强大范式。然而,其高计算成本和有限的控制频率阻碍了实时机器人操作,尤其是在每个控制步骤都运行大型视觉-语言骨干网络和迭代动作头时。现有的VLA加速方法通常优化单个组件或依赖固定的加速规则,对不同控制步骤采用大致固定的计算量,忽略了序列化具身控制的非均匀推理需求。受人类运动控制的启发,其中认知和反馈资源集中在目标敏感阶段,我们认为VLA模型应该学习何时投入完整计算以及何时重用先前的计算。我们提出ElegantVLA,一种即插即用的相位自适应推理框架,通过模型内动态计算调度加速VLA模型。ElegantVLA引入一个轻量级调度器,观察时间表示相似性、机器人运动线索和任务进度,联合分配视觉编码器、大语言模型和动作头的计算。对于感知-语言推理,调度器根据视觉-语言表示稳定性选择五级视觉-大语言模型计算模式,从完全重计算到多步时间重用。对于动作生成,它选择三级去噪模式,在稳定运动期间重用中间去噪状态,同时在目标敏感阶段保留完整细化。通过协调这些决策,ElegantVLA为具有显式动作生成模块的现代VLA流水线提供了一个通用加速框架,无需修改或重新训练基础模型。在GR00T和CogACT上的实验分别实现了最高2.55倍和3.77倍的加速,在六个真实世界的GR00T任务中,ElegantVLA将计算量减少了2.18倍,同时将控制频率从13.8 Hz提高到26.3 Hz。

英文摘要

Vision-Language-Action (VLA) models are a powerful paradigm for generalist robotic control. However, their high computational cost and limited control frequency hinder real-time robotic manipulation, especially when large vision-language backbones and iterative action heads run at every control step. Existing VLA acceleration methods often optimize individual components or rely on fixed acceleration rules, treating different control steps with largely fixed computation and overlooking the non-uniform reasoning demands of sequential embodied control. Inspired by human motor control, where cognitive and feedback resources concentrate on goal-sensitive stages, we argue that VLA models should learn when to invest full computation and when to reuse prior computation. We propose ElegantVLA, a plug-in phase-adaptive inference framework that accelerates VLA models through intra-model dynamic compute scheduling. ElegantVLA introduces a lightweight scheduler that observes temporal representation similarity, robot-motion cues, and episode progress to jointly allocate computation across the vision encoder, LLM, and action head. For perception-language reasoning, the scheduler selects a five-level Vision-LLM compute mode, from full recomputation to multi-step temporal reuse, based on visual-language representation stability. For action generation, it selects a three-level denoising mode, reusing intermediate denoising states during stable motion while preserving full refinement for goal-sensitive stages. By coordinating these decisions, ElegantVLA offers a general acceleration framework for modern VLA pipelines with explicit action-generation modules, without modifying or retraining the base model. Experiments on GR00T and CogACT achieve up to 2.55x and 3.77x speedup, and on six real-world GR00T tasks ElegantVLA cuts computation by 2.18x while raising control frequency from 13.8 Hz to 26.3 Hz.

2605.29430 2026-05-29 cs.AI cs.CL

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

迈向具有智能体纠正和语义评估的类人交互式语音识别

Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen

发表机构 * College of Artificial Intelligence, Xi’an Jiaotong University(西安交通大学人工智能学院) X-LANCE Lab, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University(上海交通大学电子信息与电气工程学院X-LANCE实验室) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Fudan University(复旦大学) Tongyi Fun Team, Alibaba Group(阿里云通义团队)

AI总结 提出Agentic ASR闭环框架,通过多轮交互和语义纠正减少语义错误,并引入句子级语义错误率(S^2ER)作为评估指标。

详情
AI中文摘要

自动语音识别(ASR)是人机交互的核心组成部分,也是基于LLM的助手和智能体日益重要的前端。然而,当前大多数ASR系统仍遵循单遍范式,这与人类通信方式不一致——在人类通信中,误解通过迭代澄清和修正来解决。这种不匹配使得一旦发生意义关键的错误,很难纠正。同时,词错误率(WER)或字符错误率(CER)等词级指标无法充分反映此类问题。为解决这些局限,我们将交互式ASR形式化为多轮修正任务,并提出Agentic ASR,一种结合单遍ASR前端与语义纠正、意图路由和基于推理编辑的闭环框架。我们进一步引入句子级语义错误率(S^2ER),一种基于LLM的语义评估指标,以及交互式仿真系统,用于可扩展和可复现的基准测试。在多语言、命名实体密集和代码切换基准上的实验表明,迭代交互持续减少语义错误,在S^2ER上的提升远大于传统词级指标。人机对齐和消融研究进一步验证了语义判断器的可靠性和所提框架的鲁棒性。代码见:https://interactiveasr.github.io/,在线演示见:https://i-asr.sjtuxlance.com/

英文摘要

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

2605.29429 2026-05-29 cs.CV

One Click per Cell Type Suffices: Training-free Group Interaction for Cell Instance Segmentation

每细胞类型一次点击足矣:无需训练的组交互用于细胞实例分割

Sanghyun Jo, Seo Jin Lee, Seohyung Hong, Yoorim Gang, Hyeongsub Kim, Hyungseok Seo, Kyungsu Kim

发表机构 * OGQ, Korea(韩国OGQ) Seoul National University, Korea(韩国首尔国立大学) LG CNS, Korea(韩国LG CNS)

AI总结 提出组提示范式,通过每细胞类型一次点击即可分割所有该类型实例,基于SAM冻结编码器的特征聚类性质,设计无需训练的Chain-of-Prompts框架递归扩展点击,在多个基准上保持高性能。

Comments Accepted to MICCAI 2026 (Early Accept)

详情
AI中文摘要

在特定细胞数据集上训练的细胞实例分割模型在分布外的细胞类型上性能严重下降,而交互式基础模型通过每个实例提示克服了这一点,但对于包含数百到数千个密集实例的组织病理学图像,其成本过高。我们引入了组提示,这是一种新范式,将交互式分割从每个实例 $O(N)$ 转变为每个类型 $O(T)$,其中每细胞类型一次点击即可分割该类型的所有实例。我们的关键观察是,Segment Anything Model (SAM) 的冻结图像编码器在给出任何提示之前,已经在其特征空间中对相同类型的细胞进行了聚类。利用这一特性,我们提出了Chain-of-Prompts (CoP),这是一个无需训练的框架,通过以下方式递归扩展单个用户点击:(1) 通过非参数门控多尺度编码器特征识别可靠的相同类型位置,以及 (2) 选择空间上最远的可靠点作为下一个提示以最大化覆盖范围。在三个细胞类型标注的基准上,每类型一次点击的CoP保留了超过90%的每个实例性能,并且无需任何额外训练就超越了全监督方法。在四个形态均匀的基准上,一次点击保留了超过99%。项目页面:https://shjo-april.github.io/Chain-of-Prompts/

英文摘要

Cell instance segmentation models trained on cell-specific datasets suffer severe performance drops on out-of-distribution cell types, while interactive foundation models overcome this through per-instance prompting at a cost that is prohibitively expensive for histopathology images containing hundreds to thousands of densely packed instances. We introduce Group Prompting, a new paradigm that shifts interactive segmentation from per-instance $O(N)$ to per-type $O(T)$, where a single click per cell type suffices to segment all instances of that type. Our key observation is that the frozen image encoder of the Segment Anything Model (SAM) already clusters same-type cells in its feature space before any prompt is given. Exploiting this property, we propose Chain-of-Prompts (CoP), a training-free framework that recursively expands a single user click by (1) identifying reliable same-type locations through non-parametric gating of multi-scale encoder features, and (2) selecting the most spatially distant reliable point as the next prompt to maximize coverage. On three cell-type-annotated benchmarks, CoP with one click per type retains over 90% of per-instance performance and surpasses fully-supervised methods without any additional training. On four morphologically homogeneous benchmarks, a single click retains over 99%. Project Page: https://shjo-april.github.io/Chain-of-Prompts/

2605.29427 2026-05-29 cs.CL

FinGuard: Detecting Financial Regulatory Non-Compliance in LLM Interactions

FinGuard:检测LLM交互中的金融监管违规

Huaixia Dou, Jie Zhu, Minghao Wu, Shuo Jiang, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang

发表机构 * Qwen DianJin Team, Alibaba Cloud Computing(阿里云计算Qwen金融团队) Tongyi Lab, Alibaba Group(阿里集团通义实验室) School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 针对金融领域LLM交互中的监管违规检测问题,提出基于监管文档的自动化管道,构建首个金融合规检测基准FinGuard-Bench,并训练FinGuard模型,在基准上显著优于现有方法。

详情
AI中文摘要

随着大型语言模型(LLM)在金融服务中的部署日益增多,一次不合规的交互就可能使机构面临监管处罚并直接损害消费者利益。现有的防护模型围绕通用危害分类构建,忽略了基于特定金融法规的违规行为。我们通过一个直接操作监管文档的监管驱动管道来弥补这一空白,该管道归纳出金融合规风险分类,并在没有任何预定义违规类别的情况下合成基于监管的训练数据。将该管道应用于中国金融法规,我们发布了 extbf{FinGuard-Bench},据我们所知,这是首个金融监管合规检测基准,在查询和回复层面均带有专家标注的标签。我们进一步训练了 extbf{FinGuard},这是一个基于Qwen3-8B构建的金融合规检测模型,通过监督微调和自我对弈强化学习在基于监管的数据上进行训练。在FinGuard-Bench上,FinGuard显著优于所有基线,包括专用防护模型和更大的通用LLM,如Qwen3.5-397B-A17B和GPT-5.1。此外,FinGuard还保留了通用安全能力,并能仅使用政策文档适应未见过的机构特定政策。我们将在GitHub上公开发布本工作中使用的代码、提示和资源。

英文摘要

As large language models (LLMs) are increasingly deployed in financial services, a single non-compliant interaction can expose institutions to regulatory penalties and direct consumer harm. Existing guard models are built around general harm taxonomies and overlook violations grounded in specific financial regulations. We address this gap with a regulation-driven pipeline that operates directly on regulatory documents, inducing a financial compliance risk taxonomy and synthesizing grounded training data without any predefined violation categories. Instantiating the pipeline on Chinese financial regulations, we release \textbf{FinGuard-Bench}, to our knowledge the first benchmark for financial regulatory compliance detection, with expert-annotated labels at both the query and response levels. We further train \textbf{FinGuard}, a financial compliance detection model built on Qwen3-8B and trained on the regulation-grounded data via supervised fine-tuning and self-play reinforcement learning. On FinGuard-Bench, FinGuard substantially outperforms all baselines, including dedicated guard models and much larger general-purpose LLMs such as Qwen3.5-397B-A17B and GPT-5.1. Furthermore, FinGuard also preserves general safety capabilities and adapts to unseen institution-specific policies using policy documents alone. We will publicly release the code, prompts, and resources used in this work on GitHub.

2605.29425 2026-05-29 cs.AI

ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control

ReasonLight: 一种多模态基础模型增强的强化学习框架用于零样本交通信号控制

Aoyu Pang, Maonan Wang, Yuejiao Xie, Chung Shue Chen, Zhiwei Yang, Man-On Pun

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)科学与工程学院) Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Hong Kong(香港中文大学机械与自动化工程系) Shanghai AI Laboratory, Shanghai, China(上海人工智能实验室) Nokia Bell Labs, Paris-Saclay, France(法国巴黎萨克雷诺基贝尔实验室)

AI总结 提出ReasonLight框架,通过多模态基础模型增强强化学习,利用路侧传感器和摄像头数据实现零样本适应罕见交通事件,显著降低紧急车辆等待时间。

详情
AI中文摘要

强化学习在交通信号控制中展现出潜力,但其对预定义状态的依赖限制了其对训练数据中未出现的可观测开放世界事件的响应能力。物联网赋能的路口通过路侧传感器和摄像头提供异构观测,为提升强化学习对此类事件的适应性创造了机会。为此,我们提出ReasonLight,一种多模态基础模型增强的强化学习框架,用于零样本交通信号控制。ReasonLight整合三类信息:结构化交通测量、多视角摄像头观测以及预训练强化学习控制器生成的候选相位决策。给定强化学习提议的相位,ReasonLight从多视角图像中提取视觉语义,并将其与紧凑的传感器导出的场景描述对齐。这种对齐使得语义引导的细化模块能够根据交通规则和事件语义保留或调整提议的动作。为确保操作可靠性,细化后的动作受可用相位集合约束。任何无效决策被拒绝,系统回退至原始强化学习动作。我们在强化学习训练期间未见的两类罕见事件上评估ReasonLight:紧急车辆优先和临时交通管制。实验结果表明,ReasonLight无需重新训练即可实现零样本适应。与仅使用强化学习的主干相比,它将紧急车辆等待时间最多降低88.7%,同时保持相当的常规交通性能。

英文摘要

Reinforcement learning (RL) has shown promise in traffic signal control (TSC). However, its reliance on predefined states limits responsiveness to observable open-world events that are absent from training data. IoT-enabled intersections provide heterogeneous observations from roadside sensors and cameras, creating opportunities to improve RL adaptability to such events. To this end, we propose ReasonLight, a multimodal foundation model-enhanced RL framework for zero-shot TSC. ReasonLight integrates three sources of information: structured traffic measurements, multi-view camera observations, and candidate phase decisions from a pre-trained RL controller. Given an RL-proposed phase, ReasonLight extracts visual semantics from multi-view images and aligns them with compact sensor-derived scene descriptions. This alignment enables a semantic-guided refinement module to either preserve or adjust the proposed action according to traffic rules and event semantics. To ensure operational reliability, refined actions are constrained by the set of available phases. Any invalid decision is rejected, and the system falls back to the original RL action. We evaluate ReasonLight on two types of rare events not seen during RL training: emergency vehicle priority and temporary traffic regulation. Experimental results show that ReasonLight achieves zero-shot adaptation without retraining. It reduces emergency vehicle waiting time by up to 88.7% compared with the RL-only backbone while preserving comparable routine traffic performance.

2605.29421 2026-05-29 cs.CL

Learning Design Skills as Memory Policies for Agentic Photonic Inverse Design

将设计技能学习为记忆策略用于智能光子逆向设计

Shengchao Chen, Ting Shu, Sufen Ren

发表机构 * AAII, University of Technology Sydney(AAII,悉尼技术大学) School of Artificial Intelligence, Shenzhen University(人工智能学院,深圳大学) School of Information and Communication Engineering, Hainan University(信息与通信工程学院,海南大学)

AI总结 提出SkillPCF闭环智能体框架,通过物理引导的记忆技能库、强化学习技能选择和模拟器接地技能演化,解决光子晶体光纤逆向设计中的知识积累问题,在真实数据集上实现更优的设计质量与效率权衡。

Comments AI4Physics@ICML 2026

详情
AI中文摘要

光子晶体光纤(PCF)逆向设计仍然具有挑战性,因为候选几何形状必须在昂贵的电磁模拟下满足耦合的光学目标。现有流程改进了代理预测或一次性参数推荐,但未能在迭代试验中积累可重用的设计知识。我们将PCF逆向设计表述为记忆策略学习问题,并提出SkillPCF,一个闭环智能体框架,结合了物理引导的记忆技能库、强化学习的技能选择和模拟器接地的技能演化。我们进一步构建了一个真实世界数据集,包含479个专家交互轨迹(2507个跨度)和553个记忆依赖的评估查询,涵盖色散工程、损耗优化和多目标设计。在多个LLM骨干和经典基线上的实验表明,SkillPCF在实际模拟预算下实现了更强的设计质量和效率权衡,证明了我们提出的记忆技能学习范式在物理感知的PCF逆向设计中的有效性。

英文摘要

Photonic crystal fiber (PCF) inverse design remains challenging because candidate geometries must satisfy coupled optical targets under expensive electromagnetic simulation. Existing pipelines improve surrogate prediction or one-shot parameter recommendation, but they do not accumulate reusable design knowledge across iterative trials. We formulate PCF inverse design as a memory-policy learning problem and propose SkillPCF, a closed-loop agent framework that combines a physics-guided memory skill bank, reinforcement-learned skill selection, and simulator-grounded skill evolution. We further construct a real-world dataset with 479 expert interaction traces (2,507 spans) and 553 memory-dependent evaluation queries covering dispersion engineering, loss optimization, and multi-objective design. Experiments across multiple LLM backbones and classical baselines show that SkillPCF achieves stronger design-quality and efficiency trade-offs under practical simulation budgets, demonstrating the effectiveness of our proposed memory-skill learning paradigm for physics-aware PCF inverse design.

2605.29420 2026-05-29 cs.AI cs.LG

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

角色提示何时真正有效?LLM中专家角色注入的检索与度量分析

Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu, Xinjie He, Zhiyuan Lin, Qiyang Xie

发表机构 * Independent Researchers(独立研究者)

AI总结 通过对比四种提示条件在1140个开放式问题上的表现,发现角色提示系统性地增加专家深度但降低清晰度,其效果高度依赖于问题类型和领域,且混合检索优于纯嵌入检索。

Comments 6 pages, 2 figures. Submitted for peer review

详情
AI中文摘要

角色提示被广泛用于引导大型语言模型,但其实际价值仍不明确。先前的工作通常使用聚合分数评估角色提示,难以确定专家角色提示是否一致地提高响应质量,或者是否沿着不同的质量维度改变响应。我们通过对比四种提示条件在涵盖38个专家角色和六个领域的1140个开放式问题上的表现来研究这个问题:无角色提示、通用领域专家提示、基于嵌入的角色检索,以及结合嵌入搜索和基于LLM的角色选择的混合检索方法。聚合结果显示各条件之间总体差异很小。然而,度量级分析揭示了一个聚合平均值掩盖的一致权衡:角色提示系统性地增加了专家深度,同时降低了清晰度。这些效果高度有条件而非普遍。角色提示在咨询类问题以及医学和心理学等领域表现最佳,在这些领域中,结构化的专家框架和风险沟通具有内在价值。相比之下,基线提示在金融、法律、科学和技术领域的概念性和解释性问题中表现更好,在这些领域中,简洁的平实语言解释更为重要。我们进一步表明,混合检索显著优于纯嵌入角色选择,尽管更好的角色检索并不能消除更广泛的专家深度与清晰度之间的权衡。总体而言,我们的发现表明,角色提示主要重塑响应特征而非广泛提升能力,并且多度量评估对于理解其效果是必要的。

英文摘要

Persona prompting is widely used to steer large language models, yet its practical value remains unclear. Prior work often evaluates persona prompting using aggregate scores, making it difficult to determine whether expert-role prompting consistently improves response quality or instead changes responses along different quality dimensions. We study this question through a controlled comparison of four prompting conditions across 1,140 open-ended questions spanning 38 expert roles and six domains: no role prompt, a generic domain-expert prompt, embedding-based role retrieval, and a hybrid retrieval method combining embedding search with LLM-based role selection. Aggregate results show only small overall differences between conditions. However, metric-level analysis reveals a consistent tradeoff that aggregate averages obscure: role prompting systematically increases expertise depth while reducing clarity. These effects are highly conditional rather than universal. Role prompting performs best on advisory questions and in domains such as medicine and psychology, where structured expert framing and risk communication are intrinsically valuable. In contrast, baseline prompting performs better on conceptual and explanatory questions in finance, legal, science, and technology domains, where concise plain-language explanation is more important. We further show that hybrid retrieval significantly improves over embedding-only role selection, although better role retrieval does not eliminate the broader expertise-depth versus clarity tradeoff. Overall, our findings suggest that persona prompting primarily reshapes response characteristics rather than broadly improving capability, and that multi-metric evaluation is necessary for understanding its effects.

2605.29416 2026-05-29 cs.RO cs.CV

3DVLA: Enhancing Vision-Language-Action Models via 3D Spatial and Instance Understanding

3DVLA:通过3D空间和实例理解增强视觉-语言-动作模型

Zhongyu Xia, Yousen Tang, Bingqing Wei, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王轩计算机技术研究所)

AI总结 提出3DVLA框架,通过多视角一致性3D特征编码、实例估计模块和掩码自监督3D编码,解决VLA模型缺乏3D场景理解的问题,在LIBERO-Plus和RoboTwin 2.0上显著提升操作性能。

详情
AI中文摘要

视觉-语言-动作模型在机器人操作中取得了显著进展,但存在一个关键限制:缺乏3D场景理解。这一缺陷表现为三个相互交织的挑战:在不强制执行多视角一致性的情况下弱提取3D空间位置、不足的3D实例理解以及遮挡下的脆弱推理。尽管存在成熟的3D感知方法,但由于架构不兼容以及对昂贵实例级标注的严重依赖,它们难以直接集成到VLA流程中。为解决上述挑战,我们提出3DVLA,一个即插即用框架,将稳健的3D推理注入预训练的VLA,无需额外人工标注或丢弃VLM先验。具体来说,3DVLA通过以下方式应对三个挑战:(1)在所有模态上具有显式多视角一致性约束的普遍3D特征编码和空间条件几何聚合方法,(2)具有高级实例令牌的实例估计模块以实现3D实例感知,以及(3)保留预测器用于视觉令牌完成的掩码自监督3D编码分支以处理遮挡。我们将3DVLA与多个VLA基线集成,并在LIBERO-Plus和RoboTwin 2.0上进行评估。结果显示操作性能持续且显著提升,验证了我们方法的有效性和即插即用兼容性。

英文摘要

Vision-Language-Action models have achieved remarkable progress in robotic manipulation, yet they suffer from a critical limitation: a lack of 3D scene understanding. This deficiency manifests as three intertwined challenges: weak extraction of 3D spatial positions without enforcing multi-view consistency, inadequate 3D instance understanding, and fragile reasoning under occlusion. Although mature 3D perception methods exist, their direct integration into VLA pipelines is hindered by architectural incompatibility and by heavy reliance on costly instance-level annotations. To address the above challenges, we propose 3DVLA, a plug-and-play framework that injects robust 3D reasoning into pretrained VLAs without requiring extra manual labels or discarding VLM priors. Specifically, 3DVLA tackles the three challenges through: (1) pervasive 3D feature encoding with explicit multi-view consistency constraints across all modalities and a Spatially-Conditioned Geometry Aggregation method, (2) an instance estimation module with high-level instance tokens for 3D instance awareness, and (3) a masked self-supervised 3D encoding branch that retains its predictor for visual token completion to handle occlusions. We integrate 3DVLA with multiple VLA baselines and evaluate on LIBERO-Plus and RoboTwin 2.0. Results show consistent and significant gains in manipulation performance, validating both the effectiveness and plug-and-play compatibility of our approach.

2605.29414 2026-05-29 cs.CL cs.AI

Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning

超越双语迁移:指令微调中的多语言代码切换

Shunta Asano, Jeonghun Baek, Toshihiko Yamasaki

发表机构 * The University of Tokyo(东京大学)

AI总结 本研究通过跨四种语言的句子级多语言代码切换指令微调,验证了多语言代码切换能有效提升大语言模型的多语言理解性能,超越了传统双语迁移设置。

详情
AI中文摘要

近期研究表明,代码切换数据(CSD)——即在同一上下文中混合多种语言——可以改善大语言模型(LLMs)的跨语言迁移和多语言对齐。然而,现有研究主要关注英语与目标语言之间的双语迁移,涉及三种或更多语言的多语言设置在很大程度上尚未被探索。在本工作中,我们研究了跨四种语言(英语、日语、韩语和中文)的多语言代码切换指令微调。我们在Belebele上评估多语言理解能力。我们的实验表明,简单的句子级多语言CSD持续提高了所有四种语言的平均多语言性能,表明多语言代码切换在双语迁移设置之外也能有效。

英文摘要

Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-lingual transfer and multilingual alignment in large language models (LLMs). However, existing studies primarily focus on bilingual transfer between English and a target language, leaving multilingual settings involving three or more languages largely unexplored. In this work, we investigate multilingual code-switching instruction tuning across four languages: English, Japanese, Korean, and Chinese. We evaluate multilingual understanding on Belebele. Our experiments show that simple sentence-level multilingual CSD consistently improves average multilingual performance across all four languages, indicating that multilingual code-switching can be effective beyond bilingual transfer settings.

2605.29411 2026-05-29 cs.LG cs.AI stat.ME stat.ML

The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

马尔可夫边界在表格预测中的好、坏与丑

Shu Wan, Abhinav Gorantla, Huan Liu, K. Selçuk Candan

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 研究马尔可夫边界在表格预测中的实际效用,发现理论上最优的边界在实践中有条件地提升预测性能,但因果发现方法难以实现其潜力。

Comments 11 pages, 9 figures, 2 tables. Preprint

详情
AI中文摘要

在标准图形假设下,目标变量的马尔可夫边界是使所有其他特征冗余的最小特征集。一旦观察到边界,目标变量与表格的其余部分条件独立。这对于表格预测来说是一个诱人的对象,因为它恰好指出了模型所需的列。然而,现代回归器仍然在完整特征集上训练。我们询问马尔可夫边界是否在SCM3K(一个包含3450个任务的合成SCM基准,特征数量从40到1000,涵盖六个SCM家族)上对预测真正有用,并使用六个回归器进行评估。答案比理论所暗示的要微妙得多。将回归器限制在oracle边界上通常会显著改善预测,并且随着特征空间变得更大更稀疏,改善程度增加。但是,通过因果发现恢复边界并在恢复的掩码上训练的自然流程并不奏效。现有的估计器在达到边界最有帮助的区域之前就耗尽了计算预算,即使它们运行,也很少能击败完整特征集。我们将此归因于三个原因。发现优化的是结构恢复而非预测。假阴性和假阳性具有高度不对称的预测成本。精确边界只是众多击败所有特征的特征集之一。然后,我们阐述了这些事实对于预测对齐的特征选择以及学习使用因果结构的表格模型的意义。

英文摘要

Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other feature redundant. Once the boundary is observed, the target is conditionally independent of the rest of the table. This is a tempting object for tabular prediction, since it names exactly the columns a model should need. Yet modern regressors are still trained on the full feature set. We ask whether the Markov boundary is genuinely useful for prediction on SCM3K, a 3,450-task synthetic SCM benchmark with feature counts from 40 to 1000 and six SCM families, evaluated with six regressors. The answer is more nuanced than the theory suggests. Restricting a regressor to the oracle boundary often improves prediction substantially, and the improvement grows as the feature space becomes larger and sparser. But the natural pipeline of recovering the boundary with causal discovery and training on the recovered mask does not deliver. Existing estimators exhaust the compute budget before reaching the regime where the boundary helps most, and even where they run they rarely beat the full feature set. We trace this to three causes. Discovery optimizes structural recovery rather than prediction. False negatives and false positives carry sharply asymmetric predictive cost. The exact boundary is only one of many feature sets that beat all features. We then develop what these facts imply for prediction-aligned feature selection and for tabular models that learn to use causal structure.

2605.29410 2026-05-29 cs.RO

A Progress-Aware Leader-Follower Midair Docking System for Dual-Drone Aerial Manipulation

面向双无人机空中操控的进度感知领航-跟随空中对接系统

Yifan Cai, Jan Ming Kevin Tan, Xiangqi Li, Chenzhe Jin, Narsimlu Kemsaram, Valerio Modugno

发表机构 * Department of Computer Science, University College London(计算机科学系,伦敦大学学院)

AI总结 提出一种进度感知的领航-跟随双四旋翼空中对接平台,通过被动磁锁紧模块和阶段管理器实现可靠对接,并基于定量指标进行仿真与实验评估。

Comments This paper has been accepted for publication in the Proceedings of the 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026), August 17-21, 2026, Shenyang, China

详情
AI中文摘要

小型无人机之间的可靠空中对接对于模块化空中合作与操控至关重要,但需要在严格的推力和载荷约束下实现精确的相对位姿控制和可重复的平台操作。我们提出了一种双无人机对接平台,其中两架四旋翼以领航-跟随编队运行,并使用带有被动磁锁紧的轻量级模块化框架进行对接。一个进度感知的任务监督器管理阶段转换:接近、对准、捕获和稳定。该平台集成了完整的硬件-软件栈(带有Crazyflie/PX4接口的ROS 2)和同步日志记录,用于基准评估。我们在仿真和实际实验中,使用编队误差、基线及偏航一致性、对接成功率、对接时间和失败模式统计等定量指标对平台进行评估。该平台能够对对接监督和同步策略进行基于统计的比较,并为模块化空中合作和可重复的空中操控提供了实用的测试平台。

英文摘要

Reliable midair docking between small unmanned aerial vehicles (UAVs) is essential for modular aerial cooperation and manipulation, but it requires precise relative-pose control and repeatable platform under tight thrust and payload constraints. We present a dual-drone docking platform where two quadrotors operate in a leader-follower formation and dock using a lightweight modular frame with passive magnetic latching. A progress-aware mission supervisor manages phase transitions: approach, alignment, capture, and settle. This platform integrates a complete hardware-software stack (ROS 2 with Crazyflie/PX4 interfaces) and synchronized logging for benchmark evaluation. We evaluate the platform in simulation and real-world experiments using quantitative metrics such as formation error, baseline and yaw consistency, docking success rate, time-to-dock, and failure-mode statistics. The platform enables statistically grounded comparison of docking supervision and synchronization strategies and provides a practical testbed for modular aerial cooperation and repeatable midair aerial manipulation.

2605.29407 2026-05-29 cs.RO

Phase-Conditioned Imitation Learning with Autonomous Failure Recovery for Robust Deformable Object Manipulation

相位条件化模仿学习与自主故障恢复用于鲁棒可变形物体操作

Dayuan Chen, Kai Tang, Yukuan Zhang, Kazuhiro Kosuge, Yasuhisa Hirata

发表机构 * Department of Robotics, Tohoku University(东大理学院机器人系) JC STEM Lab of Robotics for Soft Materials, the Department of Electrical and Computer Engineering, Faculty of Engineering, The University of Hong Kong(香港大学工程学院电气与计算机工程系软材料机器人实验室)

AI总结 提出一种相位条件化、力感知的闭环分层框架,通过FiLM调节的ACT编码器和多模态相位预测器实现自主故障恢复,显著提升可变形物体操作的成功率。

Comments Accepted to IEEE/ASME Transactions on Mechatronics

详情
AI中文摘要

本文提出了一种相位条件化、力感知的框架,用于鲁棒的可变形物体操作。标准的模仿学习策略(如使用Transformer的动作分块,ACT)在推理时依赖马尔可夫假设,当视觉上相似的观测需要矛盾的动作时会导致状态混淆,并阻止从执行故障中自主恢复。我们通过一个闭环分层架构解决了这一问题。一个FiLM条件化的ACT编码器根据当前任务相位调节特征提取,使得单一统一策略能够产生相位特定的行为,同时跨相位共享动作动态。一个融合视觉、力和位姿反馈的多模态相位预测器实时估计相位,检测仅靠视觉无法发现的接触故障,并自主触发恢复轨迹。该系统由一个用于柔顺执行的混合阻抗控制器和一个用于力感知数据收集的触觉遥操作接口完成。消融研究表明,基于FiLM的调制显著优于无条件化和令牌级条件化的基线,t-SNE分析证实FiLM诱导了良好分离的、相位特定的特征表示。在双臂挂上和脱下T恤的任务中验证,闭环系统通过自主错误恢复将挂上成功率从56%提高到87%。代码和视频:https://leledeyuan00.github.io/phaser/

英文摘要

This paper presents a phase-conditioned, force-aware framework for robust deformable object manipulation. Standard imitation learning policies such as Action Chunking with Transformers (ACT) rely on a Markovian assumption at inference, causing state aliasing when visually similar observations require contradictory actions and preventing autonomous recovery from execution failures. We address this with a closed-loop hierarchical architecture. A FiLM-conditioned ACT encoder modulates feature extraction based on the current task phase, enabling a single unified policy to produce phase-specific behaviors while sharing action dynamics across phases. A multi-modal phase predictor fusing visual, force, and pose feedback estimates the phase in real time, detecting contact failures that are invisible to vision alone and autonomously triggering recovery trajectories. The system is completed by a hybrid impedance controller for compliant execution and a haptic teleoperation interface for force-aware data collection. Ablation studies show that FiLM-based modulation significantly outperforms both unconditioned and token-level conditioned baselines, and t-SNE analysis confirms that FiLM induces well-separated, phase-specific feature representations. Validated on hanging and removing a T-shirt with dual arms, the closed-loop system improves the hanging success rate from 56\% to 87\% through autonomous error recovery. Code and videos: https://leledeyuan00.github.io/phaser/

2605.29405 2026-05-29 cs.LG

Information-Directed Offline-to-Online Reinforcement Learning

信息导向的离线到在线强化学习

Keru Chen

发表机构 * School of Electrical, Computer and Energy Engineering, Arizona State University(电气、计算机与能源工程学院,亚利桑那州立大学)

AI总结 本文提出信息导向采样(IDS)方法,通过条件互信息量化离线数据后的残余不确定性,在离线到在线强化学习中平衡即时遗憾与信息增益,并证明其贝叶斯遗憾界及在偏置残余不确定性场景下的优势。

详情
AI中文摘要

基于离线数据集的决策通常从固定离线数据中预热策略或评分模型,然后通过有限的在线交互进行优化。离线数据减少了不确定性,但并未消除探索需求;它改变了仍需探索的内容。我们通过学习目标 $χ$ 与在线轨迹在给定离线数据集条件下的条件互信息 $I(χ;τ_{1:T}\\mid\\mathcal{D}_N)$ 来形式化这种残余不确定性。这一观点自然地引出了信息导向采样(IDS),一个由参数 $η\\\ge 0$ 参数化的家族,通过权衡即时遗憾与信息增益来选择动作。我们通过比率证书证明了 IDS 的通用离线到在线贝叶斯遗憾界:任何由参考汤普森采样策略在同一随机策略类上满足的信息比率界都会被 IDS 继承。在已知动力学的贝叶斯线性奖励模型中,条件互信息具有对数行列式形式,且普通 IDS($η=0$)满足 $\\widetilde O\\\!\\\left(Hd\\\min\\\left\\\{\\\sqrt T,\\\,T\\\sqrt{C^\\\dagger_{β,\\\mathrm{IDS}_0}(N,T)/N}\\right\\\}\\right)$,其中覆盖系数与普通 IDS 自身诱导的访问分布相关。我们还识别出一个预热阶段,其中存在一个主导但信息丰富的探测动作,普通 IDS 会选择该探测动作而汤普森采样从不选择,从而产生常数因子的贝叶斯遗憾分离。受控的赌博机实验和 D4RL 离线到在线强化学习实验验证了这一机制:当离线数据信息丰富但留下偏置或低概率的残余不确定性,且目标在线动作可以解决这些不确定性时,IDS 最为有益,这种情形在离线强化学习、离线黑箱优化和贝叶斯优化中普遍存在。

英文摘要

Decision-making from offline datasets typically warm-starts a policy or score model from fixed offline data and then refines it with limited online interaction. Offline data reduces uncertainty, but it does not remove the need for exploration; it changes what remains to be explored. We formalise this residual uncertainty by the conditional mutual information $I(χ;τ_{1:T}\mid\mathcal{D}_N)$ between a learning target $χ$ and the online trajectories after conditioning on the offline dataset. This view leads naturally to information-directed sampling (IDS), a family parameterised by $η\ge 0$ that selects actions by trading off instantaneous regret against information gain. We prove a generic offline-to-online Bayesian regret bound for IDS through a ratio certificate: any information-ratio bound satisfied by a reference Thompson-sampling policy over the same randomised policy class is inherited by IDS. In a known-dynamics Bayesian linear-reward model, the conditional mutual information has a log-determinant form, and vanilla IDS ($η=0$) satisfies $\widetilde O\!\left(Hd\min\left\{\sqrt T,\,T\sqrt{C^\dagger_{β,\mathrm{IDS}_0}(N,T)/N}\right\}\right),$ where the coverage coefficient is tied to the visitation distribution induced by vanilla IDS itself. We also identify a warm-start regime with a dominated but informative probe in which vanilla IDS selects the probe while Thompson sampling never does, giving a constant-factor Bayesian regret separation. Controlled bandit experiments and D4RL offline-to-online RL experiments validate this mechanism: IDS is most beneficial when offline data is informative but leaves biased or low-probability residual uncertainty that targeted online actions can resolve, a regime shared by offline RL, offline black-box optimization, and Bayesian optimization.

2605.29402 2026-05-29 cs.CV cs.AI

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

面向高效长视频推理的语义与视觉证据:HD-EPIC VQA挑战赛的解决方案

Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv, Hui Li

发表机构 * Lenovo, China(联想(中国))

AI总结 提出一种统一框架,通过解耦长视频推理为语义证据(粗到细提取全局过程结构)和视觉证据(基于目标的细粒度定位),并采用查询条件证据检索与整合,在HD-EPIC VQA挑战赛中取得竞争性能。

详情
AI中文摘要

理解长格式自我中心视频对于多模态大语言模型(MLLMs)仍然具有挑战性,原因在于有限的上下文长度和对细粒度视觉细节的定位不足。最近提出的HD-EPIC基准突出了这些局限性:即使是强大的长上下文模型,在多样化的视频问答任务中也表现较低。在本文中,我们提出了一个统一框架,将长视频推理解耦为两种互补的证据形式:语义证据和视觉证据。语义证据通过粗到细的提取流程捕获全局过程结构,而基于目标的视觉证据通过边界框和视觉嵌入保留细粒度的定位。在推理过程中,我们将推理形式化为查询条件的证据检索和整合过程,动态地从两个来源选择相关信息。我们的方法在HD-EPIC-VQA挑战赛的多个任务类别中取得了竞争性能。更广泛地说,我们的结果表明,显式地结构化、检索和整合语义与视觉证据对于使用MLLMs进行有效的长视频理解至关重要。

英文摘要

Understanding long-form egocentric videos remains challenging for multimodal large language models (MLLMs) due to limited context length and insufficient grounding of fine-grained visual details. The recently proposed HD-EPIC benchmark highlights these limitations: even strong long-context models achieve relatively low performance across diverse video question answering tasks. In this paper, we propose a unified framework that decouples long-video reasoning into two complementary forms of evidence: semantic evidence and visual evidence. Semantic evidence captures global procedural structure through a coarse-to-fine extraction pipeline, while object-centric visual evidence preserves fine-grained grounding through bounding boxes and visual embeddings. During inference, we formulate reasoning as a query-conditioned evidence retrieval and integration process, dynamically selecting relevant information from both sources. Our approach achieves competitive performance in the HD-EPIC-VQA Challenge across multiple task categories. More broadly, our results demonstrate that explicitly structuring, retrieving, and integrating semantic and visual evidence is critical for effective long-video understanding with MLLMs.

2605.29401 2026-05-29 cs.LG

Rethinking Post-Training Recipes for Multimodal Time-Series Forecasting

重新思考多模态时间序列预测的后训练方法

Haoxin Liu, Yichen Zhou, Rajat Sen, B. Aditya Prakash, Abhimanyu Das

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Google Research(谷歌研究)

AI总结 提出PostTime后训练方法,结合监督微调和基于可验证奖励的强化学习,利用大语言模型根据多模态上下文修正数值时间序列基础模型的预测,显著提升多模态时间序列预测性能。

详情
AI中文摘要

时间序列基础模型(TSFMs)在使用数值数据进行零样本单模态预测方面表现出色,但与LLMs不同,它们无法处理通常影响现实世界轨迹的多模态、非数值上下文。在这项工作中,我们弥合了这一差距,并主张一种多模态时间序列预测方法,该方法对LLMs进行后训练,使其作为上下文引导的修正器,作用于强大的数值TSFM先验。我们引入了PostTime,一种结合监督微调(SFT)和基于可验证奖励的强化学习(RLVR)的后训练方案,以及一种生成预测修正的自动推理轨迹的方法。PostTime教会LLM生成上下文条件的预测干预——基于多模态上下文决定修正、保留或忽略TSFM先验。我们在TimesX多模态预测基准上,使用Gemma-3-4B LLM和TimesFM-2.5 TSFM评估了该方法,结果表明它显著优于单独的TSFM、仅LLM的基线以及现有的多模态预测方法。

英文摘要

Time-Series Foundation Models (TSFMs) excel at zero-shot unimodal forecasting using numerical data, but unlike LLMs they cannot consume multimodal, non-numerical context that often shape real-world trajectories. In this work, we bridge this gap and argue for a multimodal time-series forecasting approach that post-trains LLMs to act as context-guided revisors over strong numerical TSFM priors. We introduce PostTime, a post-training recipe combining Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), along with a methodology to generate automated reasoning traces for forecast revisions. PostTime teaches an LLM to generate context-conditioned forecast interventions -- decisions to revise, preserve, or ignore the TSFM prior based on the multimodal context. We evaluate this approach on the TimesX multimodal forecasting benchmark using a Gemma-3-4B LLM and TimesFM-2.5 TSFM, and show that it significantly outperforms standalone TSFMs, LLM-only baselines, and existing multimodal forecasting approaches.

2605.29400 2026-05-29 cs.AI cs.CL cs.HC

Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark

面向屏幕条件动作预测的架构敏感监督微调:PiSAR基准

Rahul Bissa, Abhishek Vyas, Yash Jain

发表机构 * AprioriLabs(Apriori实验室)

AI总结 通过PiSAR基准评估监督微调模型与前沿零样本模型在屏幕锚定行为预测上的性能,发现微调Qwen3-VL-8B-Instruct显著优于前沿基线,而Gemma-4-26B-A4B-IT微调效果不佳,揭示模型与微调方法不匹配问题。

Comments 14 pages, 7 figures, 2 tables. PiSAR corpus and fine-tuned weights are proprietary to AprioriLabs; methodology and recipe released

详情
AI中文摘要

我们在PiSAR(Persona, intent, Screen, Action, Rationale)的一个661行保留子集上,对三个监督微调模型与前沿零样本基线进行了基准测试。PiSAR是一个包含12,929个元组的屏幕锚定行为理由语料库,从公开的应用商店评论、Pew美国趋势面板人口统计数据以及OPeRA购物者轨迹中整理得到。每个模型,无论是前沿模型还是微调模型,都在相同的661行子集上使用相同的评分流程进行评估。有两个发现。第一,前沿零样本基线(Claude Opus 4.7和GPT-5.5)分别达到sem_sim 0.459和0.482;而微调的Qwen3-VL-8B-Instruct达到0.783,并且在79%的行上sem_sim >= 0.7,而两个前沿基线仅为1-2%,在同一测试集上绝对差距为0.30。第二,相同的训练数据和配方在Gemma-4-26B-A4B-IT上仅得0.441,与前沿零样本基线处于同一水平,而非微调的Qwen。我们将其解读为配方与模型不匹配:经过推理调优的高参数模型抵抗位移,可能需要更多数据或更强的微调方法。

英文摘要

We benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent, Screen, Action, Rationale), a 12,929-tuple corpus of screen-anchored behavioural rationales curated from public app-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces. Every model, frontier or fine-tuned, is evaluated on the same 661-row slice with the same scoring pipeline. Two findings. First, frontier zero-shot baselines (Claude Opus 4.7 and GPT-5.5) reach sem_sim 0.459 and 0.482 respectively; a fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 and clears sem_sim >= 0.7 on 79% of rows, against 1-2% for either frontier baseline, a gap of 0.30 absolute on the same test set. Second, the same training data and recipe on Gemma-4-26B-A4B-IT scores only 0.441, in the same band as the frontier zero-shot baselines rather than the fine-tuned Qwen. We read this as a recipe-vs-model mismatch: the reasoning-tuned high-parameter model resists displacement and would likely need either more data or a stronger fine-tuning method.

2605.29398 2026-05-29 cs.LG cs.AI

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

GDSD:强化学习作为扩散语言模型的引导去噪器自蒸馏

Xiaohang Tang, Keyue Jiang, Che Liu, Qifang Zhao, Xiaoxiao Xu, Sangwoong Yoon, Ilija Bogunovic

发表机构 * UCL Dept. of Statistical Science(伦敦大学学院统计科学系) UCL Centre for AI(伦敦大学学院人工智能中心) Alibaba Group(阿里巴巴集团) Dept. of EEE(电子工程系) Imperial College London(伦敦帝国理工学院) UNIST(全南大学) University of Basel(巴塞尔大学)

AI总结 提出引导去噪器自蒸馏(GDSD)方法,通过从逆KL正则化强化学习的闭式最优解中导出的优势引导自教师直接蒸馏扩散语言模型的去噪器,避免了ELBO似然代理带来的训练-推理不匹配偏差,在规划、数学和代码基准上显著优于现有方法。

Comments Preprint

详情
AI中文摘要

强化学习(RL)可用于改进扩散大语言模型(dLLMs)的策略(去噪器),但受到策略似然难以处理的阻碍。一类主流且高效的方法将标准RL中的似然替换为其证据下界(ELBO),该下界从随机掩码序列中估计。尽管与预训练高度一致,但这些方法通过使用ELBO作为似然代理引入了训练-推理不匹配(TIM)偏差,可能降低性能。在这项工作中,我们提出了引导去噪器自蒸馏(GDSD),直接从优势引导的自教师中蒸馏dLLMs的去噪器,该自教师源自逆KL正则化RL的闭式最优解。GDSD通过无归一化目标将dLLM的去噪器logits与教师匹配,将RL简化为无似然自蒸馏,从而绕过了TIM偏差。最近的基于ELBO的方法表现为应用不同蒸馏散度的实例,但存在GDSD避免的可诊断病态。在LLaDA-8B和Dream-7B的规划、数学和代码基准上,GDSD以更稳定的训练奖励动态持续优于先前最先进的基于ELBO的方法,测试准确率提升高达+19.6%。这些结果表明,直接的去噪器自蒸馏,无需依赖ELBO似然代理,可以为dLLMs提供更稳定有效的RL过程。代码可在https://github.com/GaryBall/GDSD获取。

英文摘要

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to $+19.6\%$. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.

2605.29397 2026-05-29 cs.CL

Revisiting Observation Reduction for Web Agents: Comprehensive Evaluation with a Lightweight Framework

重新审视Web智能体的观察缩减:基于轻量级框架的综合评估

Masafumi Enomoto, Ryoma Obara, Haochen Zhang, Masafumi Oyamada

发表机构 * NEC Corporation(日本电报电话公司)

AI总结 针对LLM Web智能体中HTML观察过长的问题,提出基于最小失败集(MFS)的轻量级评估框架,通过覆盖率代理指标大幅加速评估,并优化剪枝程序实现2.2-3.1倍延迟降低同时保持84-89%成功率。

Comments 22 pages, 8 figures, 4 tables

详情
AI中文摘要

基于LLM的Web智能体中的HTML观察非常长,尽管已经提出了许多缩减方法,但仍不清楚哪些方法能在保持性能的同时降低整体智能体延迟。主要障碍是端到端评估的高成本:在我们的实验中,在WorkArena L1的33个任务上评估32种配置下的11种方法需要232.4累计小时。为解决此问题,我们提出了一个基于最小失败集(MFS)的轻量级评估框架,MFS是导致任务失败的最小HTML元素集合。我们将覆盖率定义为缩减方法完全保留MFS的实例比例,作为无需网络访问或LLM推理的代理指标。我们验证了覆盖率与端到端成功率强相关,在两个基准测试上累计评估时间加速超过100倍。利用该框架,我们发现提取式HTML缩减方法需要高计算成本或领域特定优化才能在保持性能的同时降低智能体延迟。在此基础上,我们在MFS训练数据上优化了一个剪枝程序,在WorkArena L1上实现了每步延迟2.2倍加速,同时保留了84%的原始成功率,在WebLinx上实现了3.1倍加速,保留了89%。

英文摘要

HTML observations in LLM-based web agents are extremely long, and while many reduction methods have been proposed, it remains unclear which methods reduce overall agent latency while maintaining performance. The main obstacle is the high cost of end-to-end evaluation: in our experiments, evaluating 11 methods across 32 configurations on 33 tasks of WorkArena L1 required 232.4 cumulative hours. To address this, we propose a lightweight evaluation framework based on the Minimal Failure Set (MFS), the minimal set of HTML elements whose removal causes task failure. We define coverage as the fraction of instances in which a reduction method fully retains the MFS, which serves as a proxy metric that requires neither web access nor LLM inference. We validate that coverage strongly correlates with end-to-end success rate, with over 100$\times$ speedup in cumulative evaluation time on both benchmarks. Using this framework, we find that extractive HTML reduction methods require either high computation cost or domain-specific optimization to reduce agent latency while maintaining performance. Building on this, we optimize a pruning program on MFS training data, achieving 2.2$\times$ faster per-step latency on WorkArena L1 while retaining 84\% of the original success rate, and 3.1$\times$ faster on WebLinx while retaining 89\%.

2605.29396 2026-05-29 cs.AI

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

对齐但脆弱:通过零阶优化增强LLM安全鲁棒性

Zhihao Liu, Yifan Wu, Jian Lou, Di Wang, Yuxi Zhou, Yuke Hu

发表机构 * The State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security(杭州高科技园区(滨江)区块链与数据安全研究院) Sun Yat-sen University(中山大学) KAUST(卡塔尔大学)

AI总结 针对大语言模型安全对齐后易受轻量级后处理(如参数噪声、激活噪声或量化)影响的问题,提出基于零阶优化的混合框架,通过先标准一阶安全对齐再零阶精炼提升鲁棒性,并利用扰动评估估计层鲁棒性敏感性以高效聚焦关键层更新。

详情
AI中文摘要

大语言模型的安全对齐旨在减少有害或不安全行为,同时保持通用效用。然而,最近的研究发现对齐效果可能是脆弱的:轻量级的对齐后操作,如参数噪声、激活噪声或量化,很容易削弱预期的安全行为。先前提高鲁棒性的努力主要集中在数据整理、修改对齐目标和识别安全关键参数上,而优化器本身的作用在很大程度上未被探索。在本文中,我们首次从基础优化器的角度研究安全对齐的鲁棒性。这种以优化器为中心的视角自然地指向零阶优化,它通过评估扰动下的安全对齐来提供面向鲁棒性的信号。基于这一见解,我们提出了一个混合框架,首先执行标准的一阶安全对齐,然后应用零阶精炼来提高鲁棒性。从理论和实证上,我们表明仅需少量零阶精炼步骤即可增强鲁棒性,同时保持安全对齐。我们进一步通过利用其固有的基于扰动的评估来估计逐层鲁棒性敏感性,从而提高零阶精炼的效率,使精炼过程能够以适度的训练开销将更新集中在鲁棒性关键层上。

英文摘要

Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe behavior while preserving general utility. However, recent findings reveal that alignment effects can be fragile: lightweight post-alignment manipulations, such as parameter noise, activation noise, or quantization, can easily weaken the intended safety behavior. Prior efforts to improve robustness have primarily focused on data curation, modified alignment objectives, and safety-critical parameter identification, leaving the role of the optimizer itself largely unexplored. In this paper, we are the first to study the robustness of safety alignment from the perspective of the base optimizer. This optimizer-centric view naturally points to zeroth-order optimization, which provides a robustness-oriented signal by evaluating safety alignment under perturbations. Based on this insight, we propose a hybrid framework that first performs standard first-order safety alignment and then applies zeroth-order refinement to improve robustness. Both theoretically and empirically, we show that only a few zeroth-order refinement steps can enhance robustness while preserving safety alignment. We further improve the efficiency of zeroth-order refinement by exploiting its inherent perturbation-based evaluations to estimate layer-wise robustness sensitivity, enabling the refinement process to concentrate updates on robustness-critical layers with modest training overhead.

2605.29394 2026-05-29 cs.AI

EvoMD-LLM: Learning the Language of Species Evolution in Reactive Molecular Dynamics

EvoMD-LLM:学习反应分子动力学中物种进化的语言

Zhichen Tang, Zhengzheng Dang, Yulin Chen, Jixin Wu, Haiwen Li, Yanming Wang

发表机构 * Global College, Shanghai Jiao Tong University(上海交通大学全球学院) Global Institute of Future Technology, Shanghai Jiao Tong University(上海交通大学未来技术全球研究院)

AI总结 提出EvoMD-LLM框架,将反应分子动力学轨迹离散化为符号时间序列,通过时间脚手架机制使自回归大语言模型学习物种组成演化,在多项时间预测任务上优于基线模型,并能生成可解释性预测。

Comments 17 pages, ACL Findings

详情
AI中文摘要

虽然大型语言模型(LLM)在静态科学推理方面表现出色,但它们在建模动态物理过程的时间结构方面存在困难。我们提出了EvoMD-LLM(进化分子动力学大型语言模型),这是一个将物种级分子动力学重新表述为符号时间语言建模问题的框架。反应分子动力学轨迹被离散化为分子事件序列,其中每个标记代表一个化学物种及其持续时间,通过高效微调使标准自回归LLM能够学习随时间的组成演化。EvoMD-LLM的一个关键组成部分是时间脚手架,它将事件持续时间视为显式语言标记,并作为结构化归纳偏置,与传统的序列建模方法相比,显著减少了无效或幻觉的分子输出。我们在多个时间预测任务上评估了EvoMD-LLM,达到了高达66.14%的准确率,并始终优于序列神经网络和基于语言的基线。除了定量改进,我们定性地观察到,该模型能够通过结合相关化学知识为其预测生成解释,尽管它没有经过配对轨迹-解释数据的显式监督。这些结果表明,符号时间语言建模为将LLM应用于动态物理模拟提供了有效框架。

英文摘要

While large language models (LLMs) excel at static scientific reasoning, they struggle to model the temporal structure of dynamic physical processes. We present EvoMD-LLM (Evolutionary Molecular Dynamics Large Language Model), a framework that reformulates species-level molecular dynamics as a symbolic temporal language modeling problem. Reactive MD trajectories are discretized into sequences of molecular events, where each token represents a chemical species augmented with its persistence duration, enabling standard autoregressive LLMs to learn compositional evolution over time through efficient fine-tuning. A key component of EvoMD-LLM is temporal scaffolding, which treats event duration as an explicit linguistic token and serves as a structured inductive bias, significantly reducing invalid or hallucinated molecular outputs compared to conventional sequence modeling approaches. We evaluate EvoMD-LLM on multiple temporal prediction tasks, achieving up to 66.14% accuracy and consistently outperforming sequential neural networks and language-based baselines. Beyond quantitative improvements, we qualitatively observe that the model is capable of generating interpretations for its own predictions by incorporating relevant chemical knowledge, even though it was not explicitly supervised with paired trajectory-explanation data. These results demonstrate that symbolic temporal language modeling provides an effective framework for grounding LLMs in dynamic physical simulations.

2605.29390 2026-05-29 cs.CV

Orthogonal Negative Guidance in Attention Feature Space for Text-to-Image Generation

注意力特征空间中的正交负引导用于文本到图像生成

Jungmin Ko, Jungwon Park, Jimyeong Kim, Changin Choi, Wonseok Lee, Wonjong Rhee

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University(人工智能交叉学科项目,首尔国立大学) Research Institute for Convergence Science, Seoul National University(融合科学研究所,首尔国立大学) Artificial Intelligence Institute, Seoul National University(人工智能研究所,首尔国立大学) Department of Intelligence and Information, Seoul National University(智能与信息系,首尔国立大学) Daegu Gyeongbuk Institute of Science and Technology(大邱庆北科学技术院) Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd(三星先进技术研究所,三星电子公司)

AI总结 提出一种基于注意力特征空间的正交负引导方法,通过正交化负提示注意力特征与正提示特征并仅减去正交分量,在无需训练的情况下有效抑制不需要的概念,同时保持图像质量和提示对齐。

Comments Preprint

详情
AI中文摘要

文本到图像(T2I)模型生成高质量图像的能力日益增强。然而,强制显式地避免指定对象或属性仍然是一个根本性的难题。现有方法,包括提示否定、事后编辑和负引导,对于显式概念抑制仍显不足,常常无法移除目标概念或降低整体图像质量。为此,我们提出了注意力特征空间中的正交负引导方法,这是一种无需训练的方法,在基于MM-DiT的T2I变换器的注意力输出空间中操作。我们的方法将负提示注意力特征相对于正提示特征进行正交化,并仅减去正交分量,从而在保留期望语义的同时抑制不需要的概念。在FLUX-dev和FLUX-schnell上的实验表明,我们的方法在概念抑制、提示对齐和图像质量之间取得了有利的权衡。在人工评估中,我们的方法比第二好的基线高出18.78%。我们进一步展示了该方法支持多概念抑制和可调概念抑制。

英文摘要

Text-to-image (T2I) models have become increasingly capable of generating high-quality images. Yet, enforcing the explicit absence of a specified object or attribute remains a fundamentally challenging problem. Existing approaches, including prompt negation, post-hoc editing, and negative guidance, remain insufficient for explicit concept suppression, often failing to remove the target concept or degrading overall image quality. To this end, we propose Orthogonal Negative Guidance in attention feature space, a training-free method that operates in the attention output space of MM-DiT-based T2I transformers. Our method orthogonalizes negative-prompt attention features with respect to positive-prompt features and subtracts only the orthogonal component, suppressing unwanted concepts while preserving desired semantics. Experiments on FLUX-dev and FLUX-schnell show that our method achieves favorable trade-offs between concept suppression, prompt alignment, and image quality. In human evaluation, our method outperforms the second-best baseline by 18.78%. We further show that our method supports multi-concept suppression and adjustable concept suppression.

2605.29387 2026-05-29 cs.LG cs.AI stat.ML

On the Optimizer Dependence of Neural Scaling Laws

神经缩放定律的优化器依赖性

Vansh Ramani, Shourya Vir Jain

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Delhi(计算机科学与工程系,印度理工学院德里)

AI总结 通过随机特征回归实验,发现优化器类型系统性地影响神经缩放定律中的缩放指数α,预条件优化器产生更陡峭的缩放,并提供了光谱诊断预测高级优化器的收益。

详情
AI中文摘要

神经缩放定律 $L(N) \propto N^{-α}$ 中的缩放指数 $α$ 通常被视为由架构和数据确定的固定常数。我们提出证据表明 $α$ 系统性地依赖于优化器。在受控的随机特征回归实验——神经缩放的理论框架——中,我们测量了五种优化器变体和六种光谱条件下的 $α$。预条件优化器一致地产生更陡峭的缩放(更大的 $α$),且 $α$ 的偏移在大部分测试光谱范围内增加,在 $s = 1.5$ 附近达到峰值,并在 $s = 2.0$ 时保持较大。在 $s \approx 1.0$(自然语言的特征)时,完全自然梯度达到 $α\approx 0.31$,而梯度下降为 $α\approx 0.12$——拟合指数大 $2.6$ 倍,在随机特征模型中,该差异随模型规模加倍而累积。这种指数偏移是否以及如何迁移到大规模 LLM 训练中——近期证据表明优势可能随规模减弱——仍是一个重要的开放问题。我们的结果表明,缩放定律预测应考虑优化器选择,并且我们提供了一个光谱诊断来预测高级优化器何时会带来收益。

英文摘要

The scaling exponent $α$ in neural scaling laws $L(N) \propto N^{-α}$ is commonly treated as a fixed constant set by architecture and data. We present evidence that $α$ depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure $α$ across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger $α$), with the $α$-shift increasing across most of the tested spectral range, peaking near $s = 1.5$, and remaining large at $s = 2.0$. At $s \approx 1.0$ (characteristic of natural language), the full natural gradient achieves $α\approx 0.31$ versus $α\approx 0.12$ for gradient descent -- a $2.6\times$ larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.

2605.29380 2026-05-29 cs.LG cs.AI cs.CV

TRACER: Persistent Regularization for Robust Multimodal Finetuning

TRACER: 用于鲁棒多模态微调的持久正则化

Hesam Asadollahzadeh, Feng Liu, Christopher Leckie, Sarah M. Erfani

发表机构 * School of Computing and Information Systems (CIS), Faculty of Engineering and IT (FEIT), University of Melbourne, Australia(墨尔本大学计算机科学与信息系统学院(CIS)、工程与信息技术学院(FEIT))

AI总结 提出TRACER方法,通过加权移动平均教师实现持久正则化,解决多模态对比微调中的灾难性遗忘和EMA坍缩问题,提升分布外鲁棒性。

Comments ICML 2026

详情
AI中文摘要

微调预训练多模态模型的主流策略通常会降低分布外(OOD)鲁棒性,这种现象被称为灾难性遗忘。在本文中,我们为多模态对比微调开发了一个理论框架,为每种策略提供了闭式解和几何分解。该框架表明,自蒸馏在保留预训练模型知识方面比其他正则化方法更有效。我们的分析揭示了一个被广泛忽视的局限性:在鲁棒微调中广泛使用的标准指数移动平均(EMA)教师存在坍缩问题。为了解决这个问题,我们证明加权移动平均(WMA)教师在有限时间范围内保持持久的正则化力,并在任务子空间中实现无偏收敛,同时保留正交知识。这些见解促使了**TRACER**(**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization)的提出,它将对比学习与WMA引导的多视角蒸馏相结合。在CLIP微调上的大量实验表明,在三种骨干架构上,OOD准确率和校准性能持续提升,全面的消融实验证实TRACER既有理论依据,又对超参数选择具有鲁棒性。代码可在[https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER)获取。

英文摘要

Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed-form solutions and a geometric decomposition for each strategy. This framework shows that self-distillation is more effective than other regularization approaches to retain the knowledge of the pretrained model. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from collapse. To solve this, we prove that a Weighted Moving Average (WMA) teacher maintains a persistent regularizing force over finite horizons and yields bias-free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate **TRACER** (**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures, and comprehensive ablations confirm that TRACER is both principled and robust to hyperparameter choices. Code is available at [https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER).

2605.29379 2026-05-29 cs.CL cs.LG

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

BrahmicTokenizer-131K:一种可替代o200k_base的印度文字兼容分词器

Rohan Shravan

发表机构 * The School of AI(人工智能学院)

AI总结 提出BrahmicTokenizer-131K,一种131072词汇量的字节级BPE分词器,通过两阶段改造在保持非印度文字性能的同时,显著提升印度文字的压缩效率。

Comments 24 pages, 15 tables, 3 code listings. Tokenizer artifact, verification scripts, and reproduction code at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K and https://github.com/theschoolofai/BrahmicTokenizer-131K

详情
AI中文摘要

我们提出了BrahmicTokenizer-131K,一种131,072词汇量的字节级BPE分词器,它在131K词汇量类别中弥合了印度文字(Brahmic)的压缩差距,同时保留了OpenAI的o200k_base在英语、欧盟语言和代码方面的压缩性能。我们通过两阶段改造构建了它:(1)脚本剪枝裁剪,通过移除九个不相关书写系统将200,019个令牌减少到131,072个;(2)外科手术式改造,通过线性规划分配在九个印度文字Unicode块中填充2,372个语料库中缺失的词汇槽位。预分词器、解码器和继承的合并规则与o200k_base保持不变,使得BrahmicTokenizer-131K在分词器接口上成为即插即用的替代品。 在2700万份公开印度语预训练文本(28.4亿词,46.21 GB)上,BrahmicTokenizer-131K在相同词汇预算下产生的令牌比Mistral-Nemo Tekken / Sarvam-m少26.7%,每种语言的节省幅度从15.79%(泰米尔语)到76.79%(奥里亚语,压缩比4.31倍)。奥里亚语的优势在机制上可解释为Tekken/Sarvam-m包含零个奥里亚语块令牌;我们的改造添加了725个。在非印度语内容上,BrahmicTokenizer-131K与o200k_base的英语词汇生育率相当(1.235 vs 1.232令牌/词),并在HumanEval、MBPP和GSM8K上比Tekken/Sarvam-m好4.0-14.2%。在我们的14个分词器基准测试中,它是唯一一个在131K预算下同时在印度文字、英语、欧盟语言、代码和数学上具有竞争力的分词器。其他词汇类别的专用分词器(Sarvam-30B、Sarvam-1、MUTANT-Indic)以牺牲非印度语性能为代价实现了更好的印度语压缩:Sarvam-1的英语词汇生育率比我们差15.9%,其代码/数学压缩比我们差26-33%。我们在Apache 2.0许可下发布该工件,地址为https://huggingface.co/theschoolofai/BrahmicTokenizer-131K。

英文摘要

We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. We construct it through a two-stage retrofit: (1) a script-prune crop that reduces 200,019 tokens to 131,072 by removing nine out-of-scope writing systems, and (2) a surgical retrofit of 2,372 corpus-dead vocabulary slots determined by linear-programming allocation across nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules are unchanged from o200k_base, making BrahmicTokenizer-131K a drop-in replacement at the tokenizer interface. On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB), BrahmicTokenizer-131K produces 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same vocabulary budget, with per-language savings of 15.79% (Tamil) to 76.79% (Odia, a 4.31x compression ratio). The Odia advantage is mechanistically explained by Tekken/Sarvam-m containing zero Oriya-block tokens; our surgery added 725. On non-Indic content, BrahmicTokenizer-131K matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam-m by 4.0-14.2% on HumanEval, MBPP, and GSM8K. Across our 14-tokenizer benchmark, it is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K budget. Specialist tokenizers at other vocab classes (Sarvam-30B, Sarvam-1, MUTANT-Indic) achieve better Indic compression at the cost of non-Indic performance: Sarvam-1's English fertility is 15.9% worse and its code/math compression 26-33% worse than ours. We release the artifact under Apache 2.0 at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K.

2605.29378 2026-05-29 cs.RO

Decentralized LLM-Driven Coordination of Acoustic Robots for Contactless Object Manipulation

去中心化LLM驱动的声学机器人协调用于非接触式物体操控

Yingying Wang, Narsimlu Kemsaram, Sriram Subramanian

发表机构 * Department of Computer Science, University College London(计算机科学系,伦敦大学学院) Department of Artificial Intelligence, University of Malaya(人工智能系,马来大学)

AI总结 提出一种去中心化框架,利用Whisper语音识别和LLM语义解析将自然语言指令转换为多机器人任务计划,实现声学机器人的非接触式物体操控,实验验证了顺序、并行和同步协作任务的有效性。

Comments This paper has been accepted for publication in the Proceedings of the 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026), August 17-21, 2026, Shenyang, China

详情
AI中文摘要

自然语言接口可以简化与多机器人系统的交互,特别是当非专业用户需要发出高级命令时。使用超声相控阵的声学操控也实现了非接触式物体处理,适用于医疗保健、实验室自动化和精密运输等应用。然而,将大型语言模型(LLM)与分布式声学移动机器人相结合仍未被充分探索。本文提出了一种去中心化框架,用于自然语言驱动的声学机器人协调,实现非接触式物体操控。该系统使用基于Whisper的语音识别、基于LLM的语义解析、结构化JSON任务表示和分布式调度,将口语指令转换为可执行的多机器人任务计划。JSON模式编码了机器人分配、时间依赖、空间约束以及顺序、并行和同步执行的同步要求。该系统在两个基于TurtleBot3的声学机器人上实现,每个机器人配备一个超声相控阵用于非接触式物体运输。实验在三种场景下进行:顺序执行、并行多机器人运输和同步协作操控。系统在顺序任务中实现了96%的任务成功率,并行执行为86%,同步协作运输为70%。这些结果表明,自然语言命令可以转化为分布式机器人动作以实现非接触式操控,突显了LLM驱动的自动化在分布式机器人系统中用于人机交互的潜力。

英文摘要

Natural language interfaces can simplify interaction with multi-robot systems, especially when non-expert users need to issue high-level commands. Acoustic manipulation using ultrasonic phased arrays also enables contactless object handling for applications such as healthcare, laboratory automation, and precision transport. However, combining large language models (LLMs) with distributed acoustic mobile robots remains underexplored. This paper presents a decentralized framework for natural language-driven coordination of acoustic robots for contactless object manipulation. The system converts spoken instructions into executable multi-robot task plans using Whisper-based speech recognition, LLM-based semantic parsing, structured JSON task representation, and distributed scheduling. The JSON schema encodes robot assignments, temporal dependencies, spatial constraints, and synchronization requirements for sequential, parallel, and synchronized execution. The system is implemented on two TurtleBot3-based acoustic robots, each equipped with an ultrasonic phased array for contactless object transport. Experiments were conducted in three scenarios: sequential execution, parallel multi-robot transport, and synchronized cooperative manipulation. The system achieved task success rates of 96 percent for sequential tasks, 86 percent for parallel execution, and 70 percent for synchronized collaborative transport. These results show that natural language commands can be transformed into distributed robot actions for contactless manipulation, highlighting the potential of LLM-driven automation for human-robot interaction in distributed robotic systems.

2605.29368 2026-05-29 cs.CL cs.AI

SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow

SURGENT: 一种跨围手术期工作流程的手术多智能体辅助系统

Dongsheng Shi, Yue Li, Xin Yi, Yongyi Cui, Huawei Feng, Linlin Wang

发表机构 * East China Normal University(华东师范大学) City University of Hong Kong(香港城市大学)

AI总结 提出SURGENT手术多智能体辅助系统,结合思维树规划器、多科室协作智能体和检索增强推理,通过新型记忆设计管理长期患者病史和短期工作摘要,在五项围手术期任务中优于基线LLM和现有医疗多智能体框架。

Comments preprint

详情
AI中文摘要

现代外科护理的复杂性需要智能系统能够综合大量患者记录,支持协作决策,并在整个围手术期工作流程中提供透明、可审计的推理。尽管基于网络的大型语言模型(LLM)具有先进的推理能力,但由于输入长度限制、不完整的记忆管理和有限的可追溯性等关键限制,它们不适合外科应用。为了解决这个问题,我们提出了SURGENT,一种手术多智能体辅助系统,它结合了思维树规划器、多科室协作智能体以及基于临床指南和生物医学文献的检索增强推理。SURGENT具有一种新颖的记忆设计,可以管理长期患者病史和短期工作摘要,从而实现更完整、情境化和一致的推理。在五项关键围手术期任务(病例分析、手术计划模拟、安全监测、并发症风险评估和康复指导)上的实验评估表明,SURGENT优于基线LLM和现有的医疗多智能体框架,生成的推荐与患者病史更加一致。消融研究进一步突出了DeepSeek作为本地可部署骨干模型的优势,使其能够在无需依赖集中服务的情况下实现隐私保护部署。这些结果使SURGENT成为迈向智能、公平和安全的外科辅助系统的实用且可信的进步。

英文摘要

The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collaborative decision-making, and provide transparent, auditable reasoning across the entire perioperative workflow. Although web-based Large Language Models (LLMs) possess advanced reasoning capabilities, they are ill-equipped for surgical applications due to critical limitations: input length constraints, incomplete memory management, and limited traceability. To address this issue, we present SURGENT, a surgical multi-agent assistance system that combines a Tree-of-Thought planner, multi-department collaboration agents, and retrieval-augmented reasoning with clinical guidelines and biomedical literature. SURGENT features a novel memory design that manages both long-term patient histories and short-term working summaries, enabling more complete, contextualized, and consistent reasoning. Experimental evaluations across five key perioperative tasks - case analysis, surgical plan simulation, safety monitoring, complication risk assessment, and rehabilitation guidance - show that SURGENT outperforms baseline LLMs and existing medical multi-agent frameworks, yielding recommendations more closely aligned with patient histories. Ablation studies further highlight the advantage of DeepSeek as a locally deployable backbone model, enabling privacy-preserving deployment without reliance on centralized services. These results position SURGENT as a practical and trustworthy advancement toward intelligent, equitable, and secure surgical assistance systems.

2605.29367 2026-05-29 cs.CL cs.CY cs.SI

Attention Asymmetry in AI Layoff Discourse on X: A Computational Analysis of Capital vs Labour Amplification

X平台上AI裁员话语中的注意力不对称性:资本与劳动放大的计算分析

Joy Bose

发表机构 * Independent Researcher(独立研究员)

AI总结 通过收集X平台推文,使用账户级收集方法发现资本话语的放大效应是劳动话语的3.12倍,经粉丝数标准化后仍存在2.69倍的不对称性,并引入放大比和放大归一化指数作为平台话语不平等的度量指标。

Comments 18 pages, 3 figures, 9 tables

详情
AI中文摘要

当工人因AI驱动的重组而失业时,X(前Twitter)上同时发生两种截然不同的对话。科技高管和AI研究人员谈论生产力、转型和机遇。被解雇的工人和劳工批评者谈论失业、不确定性和恐惧。本文提出一个简单问题:哪种对话获得更多传播?我们报告了三项研究,使用两种收集方法和来自20个知名公共账户的763条推文。研究1使用基于关键词的收集(n=392),发现语料库之间无显著差异(p=0.891),表明关键词搜索对此任务噪声过大。研究2使用基于账户的收集(n=96),发现资本话语的平均放大优势是劳动话语的3.12倍(p=0.000003,Cohen's d=0.555)。研究3结合两种方法(n=763),确认了平均放大比4.18倍和中位数放大比10.77倍的结果(p<0.000001)。关键的是,在按粉丝数标准化后,不对称性仍然存在,为2.69倍(p=0.000009,Cohen's d=0.491),表明该效应并非仅仅是资本账户拥有更大受众的结果。该发现在所有测试的放大度量权重下均稳健。我们引入放大比和放大归一化指数作为衡量平台级话语不平等的简单指标。在Reddit上的跨平台复制(n=647条帖子)未复制该发现,表明不对称性可能特定于X基于账户的放大架构。我们讨论了跨平台话语分析的方法论意义。

英文摘要

When workers lose jobs to AI-driven restructuring, two very different conversations happen on X (formerly Twitter) at the same time. Tech executives and AI researchers talk about productivity, transformation, and opportunity. Laid-off workers and labour critics talk about job loss, uncertainty, and fear. This paper asks a simple question: which conversation gets more reach? We report three studies using two collection methods and 763 tweets from 20 named public accounts. Study 1 used keyword-based collection (n=392) and found no significant difference between corpora (p=0.891), revealing that keyword search is too noisy for this task. Study 2 used account-based collection (n=96) and found a 3.12x mean amplification advantage for capital discourse over labour discourse (p=0.000003, Cohen's d=0.555). Study 3 combined both methods (n=763) and confirmed the finding at 4.18x mean and 10.77x median amplification ratio (p<0.000001). Critically, after normalising for follower count, the asymmetry persists at 2.69x (p=0.000009, Cohen's d=0.491), demonstrating that the effect is not simply a consequence of capital accounts having larger audiences. The finding is robust across all tested amplification metric weightings. We introduce the Amplification Ratio and Amplification Normalisation Index as simple metrics for measuring platform-level discourse inequality. A cross-platform replication on Reddit (n=647 posts) did not replicate the finding, suggesting the asymmetry may be specific to X's account-based amplification architecture. We discuss the methodological implications for cross-platform discourse analysis.