arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3844
专题追踪
2606.15647 2026-06-16 cs.AI cs.CV cs.RO 新提交

Towards Next-Generation Healthcare: A Survey of Medical Embodied AI for Perception, Decision-Making, and Action

迈向下一代医疗:医疗具身AI在感知、决策与行动中的综述

Cheng Zhang, Qing Cai, Xingzheng Wu, Xun Yang, Xiaojun Chang, Bingkun Bao, Liqiang Nie, Xinwang Liu, Yi Yang

发表机构 * School of Information Science and Engineering, Ocean University of China(中国海洋大学信息科学与工程学院) Innovation School of Artificial Intelligence, Hefei University of Technology(合肥工业大学人工智能创新学院) School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学技术学院) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机与信息工程学院) School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院) College of Computer Science and Technology, National University of Defense Technology(国防科技大学计算机科学与技术学院) ReLER Laboratory, CCAI, Zhejiang University(浙江大学计算机辅助设计与图形学国家重点实验室)

AI总结 本文系统综述医疗具身AI的核心组件,强调感知、决策与行动的协调集成,并分析临床实践中的挑战与未来方向。

Comments 19 pages, 9 figures

详情
AI中文摘要

基础模型在提升医疗效率方面表现出色,广泛应用于各类医疗场景。然而,它们在感知、理解和与物理世界交互方面的能力有限,严重制约了其在真实临床工作流中的有效性,而临床工作流中安全关键的决策和物理执行紧密耦合。近年来,具身人工智能(AI)作为一种有前景的物理交互范式出现,使智能体能够在复杂医疗环境中操作。随着该领域研究的迅速扩展,理解智能体如何在临床环境中作为集成的端到端系统运行变得日益关键。然而,现有关于医疗具身AI的综述大多强调单个方面或功能组件,缺乏统一的系统级组织。为支持和巩固最新进展,我们系统调查了医疗具身AI的核心组件,特别关注感知、决策与行动的协调集成。我们进一步回顾了代表性医疗应用和相关数据集,并分析了真实临床实践中遇到的主要挑战。最后,我们讨论了这一快速发展领域未来研究的关键方向。相关项目见 https://github.com/VMVLab/Medical_Embodied_AI_Paper_List。

英文摘要

Foundation models have demonstrated impressive performance in enhancing healthcare efficiency across a wide range of medical applications. Nevertheless, their limited ability to perceive, understand, and interact with the physical world significantly constrains their effectiveness in real-world clinical workflows, where safety-critical decision-making and physical execution are tightly coupled. Recently, embodied artificial intelligence (AI) has emerged as a promising physical-interactive paradigm for intelligent healthcare, enabling agents to operate in complex medical environments. As research in this area rapidly expands, understanding how intelligent agents function as integrated, end-to-end systems in clinical environments becomes increasingly critical. However, existing surveys on medical embodied AI largely emphasize individual aspects or functional components, lacking a unified system-level organization of the field. To support and consolidate recent advances, we systematically survey the core components of medical embodied AI, with a particular emphasis on the coordinated integration of perception, decision-making, and action. We further review representative medical applications and relevant datasets, and we analyze the major challenges encountered in real-world clinical practice. Finally, we discuss key directions for future research in this rapidly evolving field. The associated project can be found at https://github.com/VMVLab/Medical_Embodied_AI_Paper_List.

2606.15646 2026-06-16 cs.AI 新提交

NeuroSymbolic AI for Legal AI-TRISM: Trustworthy, Reliable, Interpretable, Safe Models

面向法律AI-TRISM的神经符号AI:可信、可靠、可解释、安全模型

Deepa Tilwani, Yash Saxena, Ankur Padia, Srinivasan Parthasarathy, Manas Gaur

发表机构 * Department of Computer Science, AI Institute, University of South Carolina(南卡罗来纳大学计算机科学系,人工智能研究所) Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校计算机科学与电气工程系) Department of Computer Science and Engineering, The Ohio State University(俄亥俄州立大学计算机科学与工程系)

AI总结 针对法律领域LLM缺乏可解释推理和易产生幻觉的问题,提出TRISM框架,融合神经符号AI与LLM,通过结构化法律知识集成和RAG验证机制提升模型可信度。

详情
AI中文摘要

大型语言模型(LLM)已经改变了自然语言处理,但其缺乏可解释推理且容易产生幻觉,给法律应用带来了重大挑战。尽管LLM在法律文本分析和生成方面显示出潜力,但它们在准确的引文归属和先例验证方面存在困难。例如,在法律语境中,一个错误的先例可能危及整个案件。当前提高法律领域LLM可靠性的方法存在两个关键限制:训练或微调期间结构化法律知识集成不足,以及对生成的法律内容缺乏验证机制。为应对这些挑战,我们提出了TRISM(可信、可靠、可解释、安全模型)框架,该框架将神经符号AI原理与LLM相结合,以利用神经学习能力和对结构化法律知识的符号推理。TRISM方法解决了上述限制,同时保持了可解释的决策路径。我们的框架形式化了从法律文本文档中提取符号知识的过程,并将检索增强生成(RAG)作为核心组件,用于将LLM输出锚定在经过验证的法律来源上。在这篇立场论文中,我们做出以下贡献:(1)分析了AI在法律中的局限性;(2)引入了RASOR RAG,通过生成可形式化为符号表示的显式可解释理由,为神经符号RAG奠定基础;(3)提出了一种形式化的方法,用于创建支持LLM中可解释推理和输出验证的符号法律知识库;(4)提出了TRISM框架,用于将符号法律知识与LLM集成。

英文摘要

Large Language Models (LLMs) have transformed natural language processing, but their lack of interpretable reasoning and tendency to hallucinate pose significant challenges for legal applications. While LLMs show promise for legal text analysis and generation, they struggle with accurate citation attribution and precedent verification. For example, in legal contexts, a single incorrect precedent can jeopardize a case. Current approaches to improve LLM reliability in legal domains suffer from two key limitations: inadequate integration of structured legal knowledge during training or fine-tuning, and insufficient verification mechanisms for generated legal content. To address these challenges, we propose the TRISM (Trustworthy, Reliable, Interpretable, Safe Models) framework, which integrates NeuroSymbolic AI principles with LLMs to leverage both neural learning capabilities and symbolic reasoning over structured legal knowledge. The TRISM approach addresses the above limitations while maintaining interpretable decision pathways. Our framework formalizes the extraction of symbolic knowledge from legal textual documents and incorporates Retrieval-Augmented Generation (RAG) as a core component for grounding LLM outputs in verified legal sources. In this position paper, we make the following contributions: (1) An analysis of the limitations of AI in law; (2) Introduce RASOR RAG which creates foundations for neurosymbolic RAG by generating explicit interpretable rationales that could be formalized into symbolic representations; (3) A formalized methodology for creating symbolic legal knowledge bases that support both interpretable reasoning and output verification in LLMs; and (4) The TRISM framework for integrating symbolic legal knowledge with LLMs.

2606.15645 2026-06-16 cs.RO 新提交

TO-SoFiT: Topology Optimization of Hydraulic Soft Fish Tail Design for programmable undulating locomotion

TO-SoFiT: 用于可编程波动运动的液压软鱼尾拓扑优化设计

A Padmaprabhan, Amal Shaji, Prabhat Kumar

发表机构 * Indian Institute of Technology Hyderabad(印度理工学院海得拉巴分校)

AI总结 提出一种拓扑优化方法自动设计液压软鱼尾,平衡变形效率、流固耦合、可制造性和刚度,实现可调波动幅度和多轴弯曲,优于传统矩形尾鳍。

Comments Accepted for publication at the Advances in Robotics (AIR), 2025, IIT Jodhpur

详情
AI中文摘要

软体机器人利用柔性材料通过受控弹性变形产生运动,使其非常适合水下探测和仿生海洋系统等精细任务。尽管液压/气动驱动对此类系统仍然至关重要,但缺乏系统化的设计框架阻碍了能够实现复杂三维运动(如鱼类游泳)的机器人开发。本文引入一种拓扑优化方法来自动设计液压软鱼尾,明确处理流体驱动与结构变形之间的设计依赖耦合。我们使用基于达西定律的模型,并增加排水项来模拟空间变化的液压压力载荷,通过有限元分析将其转化为一致的节点力。采用的鲁棒多准则优化公式平衡了变形效率、流固耦合、几何可制造性和所需刚度,以优化用于三维游泳运动学的仿生软鱼尾。优化后的尾鳍拓扑被集成到气动网络驱动器中,并在各种液压载荷下进行计算验证,实现了可调波动幅度和用于深度调节的多轴弯曲。优化的二维尾鳍优于其矩形对应物。通过级联优化的尾鳍段,我们展示了在不同液压载荷下软体机器鱼尾的可编程游泳模式。这项工作推进了液压驱动器和软结构的系统化协同设计,为在受限水生环境中自动化设计具有优化设计和脊椎动物般灵活性的水下机器人提供了途径。我们的实现和模拟公开于 https://github.com/PrabhatIn/TO-SoFiT。

英文摘要

Soft robots leverage compliant materials to generate motion through controlled elastic deformation, making them ideal for delicate tasks such as underwater exploration and biomimetic marine systems. Although hydraulic/pneumatic actuation remains pivotal for such systems, the lack of systematic design frameworks has hindered the development of robots capable of complex 3D motion, such as fish-like swimming. This work introduces a topology optimization method to automate the design of a hydraulic soft fish tail, explicitly addressing the design-dependent coupling between fluidic actuation and structural deformation. We use a Darcy law-based model augmented with a drainage term to simulate spatially varying hydraulic pressure loads, translating these into consistent nodal forces via finite element analysis. The employed robust multi-criteria optimization formulation balances deformation efficiency, fluid-structure interaction, geometric manufacturability, and required stiffness for optimizing a bioinspired soft fish tail for 3D swimming kinematics. The optimized tail topology is incorporated into a pneumatic network actuator and computationally validated under various hydraulic loads, achieving tunable undulatory amplitudes and multiaxis bending for depth adjustment. The optimized 2D tail outperforms its rectangular counterpart. By cascading optimized tail segments, we demonstrate programmable swimming patterns in soft robotic fish tails at different hydraulic loads. This work advances the systematic codesign of hydraulic actuators and soft structures, offering a pathway to automate underwater robots with optimized design and vertebrate-like agility in confined aquatic environments. Our implementations and simulations are publicly available at 'https://github.com/PrabhatIn/TO-SoFiT'.

2606.15643 2026-06-16 cs.CL 新提交

Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

扩展项目反应理论以实现高效且有意义的多语言评估

Gili Lior, Tzviel Frostig, Gabriel Stanovsky, Matan Eyal

发表机构 * Google Research(谷歌研究) The Hebrew University of Jerusalem(特拉维夫大学) PhaseV Trials

AI总结 提出Multilingual-IRT框架,通过引入每语言难度偏差、分离内容与语言效应的区分度及每语言能力残差,解决多语言基准测试中的线性扩展、翻译错误和文化特定知识混淆问题,在MMLU-Pro-X上实现更优的预测和错误检测。

详情
AI中文摘要

多语言基准测试对于评估跨语言的大语言模型(LLMs)至关重要,但它们存在三个问题:详尽评估随语言数量线性增长,自动翻译引入的错误在大规模下容易被忽略,以及某些项目混淆了通用知识和文化特定知识。我们通过一个统一的统计框架Multilingual-IRT来解决这三个问题,该框架扩展了项目反应理论,引入了每语言难度偏差、分离内容与语言效应的区分度以及每语言能力残差。在MMLU-Pro-X的29种语言上对25个LLM拟合Multilingual-IRT,我们表明其拟合参数支持三种实际应用:预测未观察到的(项目、LLM、语言)实例,其二元交叉熵比最强的基于准确率的基线低11-16%;发现分布在所有28种非英语语言中的候选翻译错误,而基于准确率的基线将检测集中在少数语言上;以及恢复基于准确率的基线遗漏的文化特定项目。

英文摘要

Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation introduces errors that are easily missed at scale, and some items conflate general and culture-specific knowledge. We address all three with a unified statistical framework, Multilingual-IRT, which extends Item Response Theory with per-language difficulty deviations, split discriminability separating content from language effects, and per-language ability residuals. Fitting Multilingual-IRT on 25 LLMs across 29 languages of MMLU-Pro-X, we show that its fitted parameters support three practical applications: predicting unobserved (item, LLM, language) instances with 11-16% lower binary cross-entropy than the strongest accuracy-based baseline, surfacing candidate translation errors distributed across all 28 non-English languages, whereas accuracy-based baselines concentrate detections in a few languages, and recovering culture-specific items that accuracy-based baselines miss.

2606.15642 2026-06-16 cs.LG cs.AI 新提交

CIWI-CKT: Chaos-Informed Wave Interference Feature Fusion and Cross-City Knowledge Transfer for Traffic Flow Forecasting

CIWI-CKT:混沌信息波干涉特征融合与跨城市知识迁移用于交通流预测

Abdul Joseph Fofanah, Lian Wen, David Chen, Shaoyang Zhang

发表机构 * Griffith University(格里菲斯大学) School of Information and Communication Technology, Griffith University(格里菲斯大学信息与通信技术学院) School of Information Engineering, Chang’an University(长安大学信息工程学院)

AI总结 针对跨城市数据稀缺场景,提出CIWI-CKT框架,融合混沌信息波生成、元干涉处理和混沌感知元学习,显著提升预测精度并降低数据需求。

详情
AI中文摘要

在跨城市、数据稀缺的场景下,准确预测交通流仍然具有挑战性,因为有限的历史数据阻碍了模型的泛化能力。交通动态的混沌性质、复杂的时空依赖关系以及异质的城市网络使得跨城市的小样本学习变得复杂。现有的深度学习方法要么将交通视为完全确定性的,要么缺乏对跨体制交通动态至关重要的波状干涉模式进行建模的机制。为了解决这些局限性,本文提出了CIWI-CKT,一种新颖的混沌信息波干涉特征融合框架,结合跨城市知识迁移。我们的框架引入了三个核心创新:混沌信息波生成,提取可测量的混沌不变量并将交通建模为自适应波分量;元干涉处理,捕获支持域和查询域之间的波相互作用,同时生成可预测性分数用于置信度估计;以及混沌感知元学习,在保留混沌特性的同时实现高效的跨城市知识迁移。我们建立了理论保证,包括混沌到波的稳定性、波诱导的降维以及元学习泛化界限。在四个真实世界交通数据集上的大量实验表明,CIWI-CKT显著优于最先进的时空图学习、迁移学习、基于提示和小样本方法,在提高预测精度的同时大幅减少了所需的训练数据。

英文摘要

Accurate traffic flow prediction remains challenging in cross-city, data-scarce scenarios where limited historical data hinders model generalisation. The chaotic nature of traffic dynamics, complex spatio-temporal dependencies, and heterogeneous urban networks complicate few-shot learning across cities. Existing deep learning approaches either treat traffic as purely deterministic or lack mechanisms to model wave-like interference patterns essential for cross-regime traffic dynamics. To address these limitations, this paper proposes CIWI-CKT, a novel Chaos-Informed Wave Interference Feature Fusion framework with Cross-City Knowledge Transfer. Our framework introduces three core innovations: chaos-informed wave generation that extracts measurable chaos invariants and models traffic as adaptive wave components; meta-interference processing that captures wave interactions between support and query regimes while producing a predictability score for confidence estimation; and chaos-aware meta-learning that enables efficient cross-city knowledge transfer while preserving chaotic characteristics. We establish theoretical guarantees including chaos-to-wave stability, wave-induced dimension reduction, and meta-learning generalisation bounds. Extensive experiments on four real-world traffic datasets demonstrate that CIWI-CKT significantly outperforms state-of-the-art spatio-temporal graph learning, transfer learning, prompt-based, and few-shot methods, improving prediction accuracy while substantially reducing required training data.

2606.15640 2026-06-16 cs.LG 新提交

Multi-Agent Framework for Audit Risk Assessment with Explicit Uncertainty and Evidence Conflict Modeling

具有显式不确定性和证据冲突建模的审计风险评估多智能体框架

Yuhan Wang, Manqing Wang, Yixuan Lu, Zhaoyue Peng, Shengda Lin

发表机构 * Columbia University(哥伦比亚大学) Trine University(特林大学) University of Sofia(索菲亚大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Westcliff University(韦斯特克莱夫大学)

AI总结 提出UMAR框架,通过三个专业智能体独立评估风险并校准不确定性,利用Dempster-Shafer理论融合分数并测量冲突,在SEC 10-K数据集上优于基线模型,提供可解释的风险信号。

详情
AI中文摘要

审计风险评估日益受益于结合异质证据源,但现有方法通常产生点预测,而不量化不同证据流的一致程度。我们提出UMAR(不确定性感知多智能体风险评估),一个采用三个专业智能体的框架:MD&A文本智能体、财务比率智能体和CAM智能体,每个智能体产生具有校准不确定性估计的独立风险评分。基于Dempster-Shafer证据理论的不确定性聚合器融合这些分数,同时显式测量智能体间冲突。我们在来自SEC 10-K文件(2019-2023)的3200个公司年观测值的美国数据集上评估UMAR,以财务重述为目标标签。实验结果表明,UMAR的AUROC为0.782,PR-AUC为0.341,优于逻辑回归、XGBoost、FinBERT以及单智能体和双智能体LLM基线。UMAR在所有方法中达到最低的期望校准误差(ECE = 0.052),并识别出与实际重述风险相关的证据冲突模式,为审计师提供潜在可操作且可解释的风险信号。

英文摘要

Audit risk assessment increasingly benefits from combining heterogeneous evidence sources, yet existing approaches typically produce point predictions without quantifying how well different evidence streams agree. We propose UMAR (Uncertainty-Aware Multi-Agent Risk Assessment), a framework that employs three specialized agents: an MD&A Text Agent, a Financial Ratio Agent, and a CAM Agent, each producing independent risk scores with calibrated uncertainty estimates. An Uncertainty Aggregator based on Dempster-Shafer evidence theory fuses these scores while explicitly measuring inter-agent conflict. We evaluate UMAR on a U.S. dataset of 3,200 firm-year observations from SEC 10-K filings (2019-2023), with financial restatement as the target label. Experimental results show that UMAR achieves an AUROC of 0.782 and a PR-AUC of 0.341, outperforming logistic regression, XGBoost, FinBERT, and single-agent and dual-agent LLM baselines. UMAR attains the lowest expected calibration error (ECE = 0.052) among all methods and identifies evidence-conflict patterns that correlate with actual restatement risk, offering auditors potentially actionable and interpretable risk signals.

2606.15637 2026-06-16 cs.LG 新提交

HAPI-EP: Towards Hybrid, Adaptive, and Predictive Digital Twins of Cardiac Electrophysiology

HAPI-EP:迈向混合、自适应和预测性的心脏电生理数字孪生

Sumeet Vadhavkar, Xiajun Jiang, Yubo Ye, Maryam Toloubidokhti, Linwei Wang

发表机构 * Rochester Institute of Technology(罗切斯特理工学院)

AI总结 提出HAPI框架,通过物理集成灰盒模型、元学习快速自适应和条件生成模型,构建可识别、强预测性的心脏电生理数字孪生。

详情
AI中文摘要

患者特异性心脏的数字孪生(DT)在个性化医疗中具有巨大潜力。然而,其快速动态适应个体实时数据以及适应后的预测能力仍是核心挑战。我们从两个组成部分审视这一挑战:DT公式化中,机械模型和数据驱动模型展现出竞争性的优点和局限性;DT优化策略主要由重建目标驱动,导致模型不可识别。我们通过HAPI——一个用于构建混合、自适应和预测性DT的AI框架——解决这两个瓶颈,该框架包含三个关键使能器。首先,HAPI构建了一个物理集成的灰盒模型,其中可解释的机械骨干网络由神经组件增强,以建模其与观测数据的残差。其次,HAPI不试图在静态混合模型中预编码所有可能的变异,而是通过前馈元学习器实现混合模型对少样本实时数据的快速即时自适应,这些元学习器通过预测目标训练实现机械和神经参数的摊销推理。最后,我们证明这种自适应性对应于构建一个条件生成模型(即混合DT),赋予其理论可识别性,从而在预测场景中表现出色。我们在心脏电生理学中展示了HAPI的概念验证,使用具有机械反应动力学和神经图扩散的混合单域模型。通过合成和真实数据研究,我们表明HAPI的机械-神经混合和预测自适应对于获得具有强预测和分布外能力的可识别DT至关重要。

英文摘要

A digital twin (DT) of a patient-specific heart offers significant potential in personalized medicine. However, its rapid and dynamic adaptation to an individual's live data and its predictive capability after adaptation remains central challenges. We examine this challenge from its two building blocks: DT formulation where mechanistic and data-driven models show competing merits and limitations, and DT optimization strategies that are largely driven by a reconstruction objective leading to un-identifiable models. We address both bottlenecks via HAPI -- an AI framework for building hybrid, adaptive, and predictive DTs with three key enablers. First, HAPI constructs a physics-integrated gray-box model in which an interpretable mechanistic backbone is augmented by a neural component that models its residual to the observed data. Second, rather than attempting to pre-encode all possible variations in a static hybrid model, HAPI enables rapid on-the-fly adaptation of the hybrid model to few-shot live data, achieved by feedforward meta-learners realizing amortized inference of both mechanistic and neural parameters of the hybrid model trained with predictive objectives. Finally, we show that this adaptivity corresponds to the construction of a conditional generative model (i.e., the hybrid DT) that endows it with theoretical identifiability and thus strong performance in predictive scenarios. We demonstrate the proof-of-concept of HAPI in cardiac electrophysiology using a hybrid monodomain model with mechanistic reaction kinetics and neural graph diffusion. Across synthetic and real-data studies, we show that HAPI's mechanistic-neural hybridization and predictive adaptation are critical for obtaining identifiable DTs with strong predictive and out-of-distribution capabilities.

2606.15631 2026-06-16 cs.RO cs.AI 新提交

Retrieve, Don't Retrain: Extending Vision Language Action Models to New Tasks at Test Time

检索,不重新训练:在测试时将视觉语言动作模型扩展到新任务

Jeongeun Park, Juhan Park, Taekyung Kim, Sungjoon Choi, Dongyoon Han, Sangdoo Yun

发表机构 * NAVER AI Lab(NAVER AI实验室) Korea University(高丽大学)

AI总结 提出检索增强策略,通过一次训练冻结模型,部署时仅添加检索数据即可适应新任务,无需逐任务微调,在跨本体泛化中优于基线。

Comments https://recap-robot.github.io/

详情
AI中文摘要

将视觉-语言-动作(VLA)策略扩展到新任务通常需要特定任务的遥操作演示和逐任务微调,这使得适应在数据收集和计算方面成本高昂。在本文中,我们表明这种目标侧逐任务适应成本可以被检索所取代。我们的检索增强策略在目标本体(查询)和更廉价的本体(池,例如人手视频)的配对演示上训练一次,然后冻结。新任务在部署时通过将池侧演示附加到检索池来添加。冻结策略在每个控制步骤中根据检索到的轨迹进行条件化,因此新任务通过索引数据而非更新参数来吸收。微调仅在面对新的、未见过的本体时需要,而不是每个新任务。我们表明,检索改进了超越特定骨干网络的策略,包括标准VLA策略,但其效果在基于视频生成的世界动作模型(WAM)Cosmos Policy中尤为显著。在这种设置中,检索提供了粗略的任务进展,而WAM的未来图像目标提供了额外的视觉一致性信号,增强了检索条件化的动作。在PushT上,我们研究了检索如何为跨本体泛化到未见目标角度提供可重用的高级运动先验,而在RoboTwin 2.0上,我们的方法在未见任务上优于跨本体基线,并且我们还在真实机器人上演示了该方法。

英文摘要

Extending a vision-language-action (VLA) policy to a new task typically requires task-specific teleoperated demonstrations and per-task fine-tuning, making adaptation costly in both data collection and compute. In this paper, we show that this target-side per-task adaptation cost can be replaced by retrieval. Our retrieval-augmented policy is trained once on paired demonstrations from the target embodiment (query) and a cheaper embodiment (pool, e.g., human-hand video), then frozen. New tasks are added at deployment by appending pool-side demonstrations to a retrieval pool. The frozen policy conditions on retrieved trajectories at every control step, so new tasks are absorbed by indexing data rather than updating parameters. Fine-tuning is needed only to take on a new, unseen embodiment, not for each new task. We show that retrieval improves policies beyond a specific backbone, including standard VLA policies, but its effect is especially pronounced in Cosmos Policy, a video-generation-based world-action model (WAM). In this setting, retrieval supplies coarse task progression, while the WAM's future-image objective provides an additional visual consistency signal that strengthens the retrieval-conditioned actions. On PushT, we study how retrieval provides a reusable high-level motion prior for cross-embodiment generalization to unseen goal angles, while on RoboTwin 2.0 our method outperforms cross-embodiment baselines on unseen tasks, and we additionally demonstrate the method on a real robot.

2606.15629 2026-06-16 cs.CV 新提交

XPASS-Vis: A Dataset for Cross-Domain Personalized Image Aesthetic Assessment

XPASS-Vis: 跨领域个性化图像美学评估数据集

Takato Hayashi, Hiroaki Takahara, Candy Olivia Mawalim, Hiromi Narimatsu, Akisato Kimura, Shiro Kumano, Shogo Okada

发表机构 * Japan Advanced Institute of Science and Technology(日本先端科学技术大学) Communication Science Laboratories, NTT, Inc.(日本电信电话株式会社通信科学实验室)

AI总结 提出首个跨领域个性化图像美学评估数据集XPASS-Vis,涵盖艺术、时尚、风景三个领域,通过129名标注者评估6526个刺激,建立跨领域个性化美学偏好迁移的基准模型,发现无监督域适应方法可恢复约60%的监督上限性能。

详情
AI中文摘要

个性化图像美学评估(PIAA)旨在个体层面上对艺术品和照片的美学判断的主观性进行建模。已知美学偏好既高度个性化又在视觉领域间部分一致。然而,现有的PIAA数据集和方法大多局限于单一领域,或每个领域内每位标注者的样本太少,无法实现跨领域个性化。因此,个性化美学偏好的跨领域泛化在很大程度上仍未得到探索。为了解决这一空白,我们引入了XPASS-Vis,这是第一个专门为跨领域PIAA设计的数据集。XPASS-Vis包含来自三个视觉领域(艺术、时尚、风景)的6,526个刺激,由129名标注者评分,产生87,836次用户-刺激交互,每次交互都标注了总体美学得分和九项美学情感评分。值得注意的是,每位标注者在每个领域评分的刺激超过200个,提供了足够的领域内覆盖以支持领域内和跨领域的个性化。此外,我们在无监督域适应(UDA)下建立了跨领域PIAA的基线模型,其中在标记源领域上训练的模型被迁移到未标记的目标领域。对代表性UDA方法的系统评估表明,在完全无监督的设置下,性能最佳的方法恢复了约60%(Spearman's ρ = .28)的监督上限。这提供了令人鼓舞的证据,表明个性化美学偏好在一定程度上可以在视觉领域间迁移。同时,仍然存在显著差距,凸显了需要针对PIAA的适应策略。XPASS-Vis及附带的基线为跨领域PIAA的未来研究奠定了基础。所有数据集和代码将在论文被接收后公开。

英文摘要

Personalized image aesthetic assessment (PIAA) seeks to model, at the individual level, the subjective nature of aesthetic judgments toward artworks and photographs. Aesthetic preference is known to be both deeply personal and partially consistent across visual domains. Yet existing PIAA datasets and methods are largely confined to a single domain, or provide too few samples per annotator within each domain to enable personalization across domains. Consequently, the cross-domain generalization of personalized aesthetic preferences remains largely unexplored. To address this gap, we introduce XPASS-Vis, the first dataset explicitly designed for cross-domain PIAA. XPASS-Vis comprises 6,526 stimuli from three visual domains -- art, fashion, and landscape -- rated by 129 annotators, yielding 87,836 user-stimulus interactions, each annotated with an overall aesthetic score and nine aesthetic-emotion ratings. Notably, each annotator rated more than 200 stimuli per domain, providing sufficient per-domain coverage to support personalization both within and across domains. Moreover, we establish baseline models for cross-domain PIAA under unsupervised domain adaptation (UDA), where a model trained on a labeled source domain is transferred to an unlabeled target domain. A systematic evaluation of representative UDA approaches shows that the best-performing method recovers approximately 60\% (Spearman's $ρ$ = .28) of the supervised upper bound under a fully unsupervised setting. This provides encouraging evidence that personalized aesthetic preferences are, to a meaningful extent, transferable across visual domains. At the same time, a substantial gap remains, highlighting the need for PIAA-specific adaptation strategies. XPASS-Vis and the accompanying baselines provide a foundation for future research on cross-domain PIAA. All datasets and code will be made publicly available upon acceptance.

2606.15625 2026-06-16 cs.LG cs.NI 新提交

Conflict-Aware Federated Fine-Tuning of Large Language Models with Mixture-of-Experts

基于混合专家的大语言模型冲突感知联邦微调

Yijun Lu, Zihan Fang, Pengpeng Qiao, Zheng Lin, Jing Yang, Yuxin Zhang, Por Lip Yee, Zhe Chen, Jun Luo

发表机构 * Nanyang Technological University(南洋理工大学) University of Malaya(马来亚大学)

AI总结 针对联邦学习中混合专家模型因数据异质性导致的专家优化冲突问题,提出FC-MoE框架,通过重要性加权、梯度共识投影和局部知识保留机制,实现稳定优化并提升非独立同分布环境下的模型性能。

Comments 6 pages, 4 figures

详情
AI中文摘要

大语言模型(LLMs)的持续扩展带来了高昂的计算成本,使得混合专家(MoE)通过稀疏激活成为一种可扩展的高效微调替代方案。虽然联邦学习(FL)作为隐私保护的协作优化范式出现,但在数据异质性下将MoE集成到FL中可能触发冲突的专家优化。客户端特定的数据分布迫使相同索引的专家在不一致甚至冲突的特征-标签相关性下进行优化。这种不匹配在聚合过程中引起破坏性干扰,从而破坏优化轨迹并降低模型性能。为解决此问题,我们提出FC-MoE,一种用于MoE微调的联邦冲突感知框架。它采用重要性感知加权方案来优先考虑可靠的局部更新,并利用梯度共识投影来抑制冲突更新,确保稳定的全局优化路径。此外,局部知识保留机制通过重新锚定领域特定残差进一步保留专门的客户端专业知识。大量实验表明,FC-MoE在非独立同分布联邦环境中加速收敛并增强全局和局部模型性能。

英文摘要

The continuous scaling of large language models (LLMs) incurs prohibitive computational costs, making Mixture-of-Experts (MoE) a scalable alternative for efficient fine-tuning via sparse activation. While federated learning (FL) emerges as the paradigm for privacy-preserving collaborative optimization, integrating MoE into FL under data heterogeneity may trigger conflicting expert optimizations. Client-specific data distributions force same-indexed experts to optimize under inconsistent or even conflicting feature-label correlations. This mismatch induces destructive interference during aggregation, thus destabilizing the optimization trajectory and degrading model performance. To address this issue, we propose FC-MoE, a federated conflict-aware framework for MoE fine-tuning. It employs an importance aware weighting scheme to prioritize reliable local updates and utilizes gradient consensus projection to suppress conflicting updates, ensuring a stable global optimization path. Moreover, a local knowledge retention mechanism further preserves specialized client expertise by re-anchoring domain-specific residuals. Extensive experiments demonstrate that FC-MoE accelerates convergence and enhances both global and local model performance in non-IID federated environments.

2606.15623 2026-06-16 cs.LG cs.AI 新提交

Surprise-Guided MergeSort: Budget-Efficient Human-in-the-Loop Ranking via Adaptive Comparison Scheduling

惊喜引导的归并排序:通过自适应比较调度实现预算高效的人机协同排名

Yujin Park, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University(汉阳大学) Hankuk University of Foreign Studies(韩国外国语大学)

AI总结 提出惊喜引导的归并排序(SGS)框架,利用视觉语言模型(VLM)作为问题优先级排序器,通过自适应预算分配将高模糊度比较路由给人类,在六个基准上以相同预算实现Kendall's τ×100提升6-12点。

Comments 16 pages

详情
AI中文摘要

成对比较是主观排名任务的金标准;然而,穷举标注需要大量人工比较($O(n^2)$)。虽然基于排序的方法已将此负担减少到$O(n\log n)$,但每次比较仍需昂贵的人工判断。为了进一步提高标注效率,我们提出利用视觉语言模型(VLM)不是作为标注替代,而是作为\emph{问题优先级排序器},以识别哪些比较真正需要人工判断。所提出的\textbf{惊喜引导的归并排序(SGS)}框架通过三个集成组件实现这一点:(1)自底向上的归并排序调度器,结构化比较并利用传递性;(2)复合惊喜评分器——结合位置偏差消除的VLM置信度、Elo差距和投票熵——量化比较模糊性;(3)自适应预算分配器,将高惊喜对路由给人类,同时通过传递性推理自动化低惊喜对。在六个不同基准上进行了验证,涵盖文本相似度(STS-B、BIOSSES、SICKR-STS)和图像质量评估(KonIQ-10k、TID2013、LIVE Challenge)。SGS有效地识别并跳过了每次会话多达535个非信息性比较。因此,在相同总预算下,它相对于Active Elo实现了Kendall's $τ{\times}100$提升+6到+12。这些结果表明,将VLM引导的惊喜度量与算法排序相结合,在不同领域提供了普遍一致的准确性-效率权衡。

英文摘要

Pairwise comparison is the gold standard for subjective ranking tasks; however, exhaustive annotation requires a massive number of human comparisons ($O(n^2)$). While sorting-based methods have reduced this burden to $O(n\log n)$, they still require expensive human judgment for every single comparison. To further improve annotation efficiency, we propose leveraging a Vision-Language Model (VLM) not as an annotator replacement, but as a \emph{question prioritizer} to identify which comparisons genuinely require human judgment. The proposed \textbf{Surprise-Guided MergeSort (SGS)} framework achieves this through three integrated components: (1) a bottom-up MergeSort scheduler that structures comparisons and exploits transitivity, (2) a composite Surprise Scorer -- combining position-bias-cancelled VLM confidence, Elo gap, and vote entropy -- to quantify comparison ambiguity, and (3) an adaptive budget allocator that routes high-surprise pairs to humans while automating low-surprise pairs via transitivity inference. Validation was conducted on six diverse benchmarks spanning text similarity (STS-B, BIOSSES, SICKR-STS) and image quality assessment (KonIQ-10k, TID2013, LIVE Challenge). SGS effectively identified and skipped up to 535 non-informative comparisons per session. Consequently, it achieved Kendall's $τ{\times}100$ improvements of $+6$ to $+12$ over Active Elo under the same total budget. These results demonstrate that combining VLM-guided surprise metrics with algorithmic sorting provides a generally consistent accuracy-efficiency trade-off across diverse domains.

2606.15621 2026-06-16 cs.LG cs.CL 新提交

Re-feeding Is Not Replaying: Measuring Replay Noise in Counterfactual Token-Credit Estimation

重新喂食并非重放:在反事实令牌信用估计中测量重放噪声

Nils Matteson

发表机构 * Northeastern University(东北大学)

AI总结 通过三遍实验设计,测量了在反事实令牌信用估计中重新喂食前缀导致的噪声,发现其改变信用估计的比率高于副本噪声基底,建议恢复解码器状态或使用批不变内核。

Comments 10 pages, 3 figures. Code, per-pivot data, logs, and registration: https://github.com/thaw-ai/thaw (benchmarks/, paper/refeed-drift/)

详情
AI中文摘要

逐令牌反事实信用估计询问语言模型生成结果中哪个令牌导致最终答案正确或错误:在某个枢轴处截断转录,替换一个替代令牌,重放后续内容,并比较结果。已发表的方法将转录前缀作为新提示重新喂食,假设这能重现模型在生成过程中经过的状态。我们在一个标准推理引擎上测量了这一假设的代价,采用三遍设计:从验证的解码时KV状态恢复的继续生成,一个完全相同的第二遍精确传递(副本噪声基底),以及一个重新喂食传递。在六种配置和三个模型(包括一个GRPO训练的检查点)中,在低边际决策令牌处,重新喂食改变信用估计的比率比副本基底高14-28个百分点(在治疗无关条件下为7-21个百分点;问题聚类t=2.9-6.4)。大多数变化是量化估计器的零边界交叉而非极性反转,且扰动均值为零,因此平均量基本安全;但选择并非如此:通过阈值化$|\hat{A}_t|$在重新喂食下选择的临界令牌集与精确恢复选择的Jaccard重叠为0.34-0.90,而副本上限为0.63-0.96。一个因果确认闭环:在vLLM的批不变内核下,所有三遍在每一个测量通道上完全相同,分歧率均为零。副本传递本身在9-23%的合格估计上存在分歧:决策令牌处的单样本信用测量在任何重放下都不可靠。设置事先固定;第二遍活动中的精确传递缓存命中被仪器化(100%命中率,3434个枢轴);总计算成本低于10美元。我们建议反事实信用研究恢复解码器状态或使用批不变内核,并报告副本基底。

英文摘要

Per-token counterfactual credit estimation asks which token in a language-model rollout caused the final answer to be right or wrong: cut the transcript at a pivot, substitute an alternative token, replay continuations, and compare outcomes. Published methods re-feed the transcript prefix as a fresh prompt, assuming this reproduces the state the model passed through during generation. We measure what that assumption costs on a stock inference engine, with a three-pass design: continuations resumed from the verified decode-time KV state, an identical second exact pass (a replica noise floor), and a re-feed pass. Across six configurations and three models (including a GRPO-trained checkpoint), at low-margin decision tokens, re-feeding changes the credit estimate at rates 14-28 percentage points above the replica floor (7-21pp under a treatment-independent conditioning; problem-clustered t = 2.9-6.4). Most changes are zero-boundary crossings of the quantized estimator rather than polarity reversals, and the perturbation is consistent with mean-zero, so averaged quantities are largely safe; but selection is not: a critical-token set chosen by thresholding $|\hat{A}_t|$ under re-feed overlaps the exact-resume selection at Jaccard 0.34-0.90, versus a 0.63-0.96 replica ceiling. A causal confirmation closes the loop: under vLLM's batch-invariant kernels all three passes are identical on every measured channel, with both disagreement rates exactly zero. Replica passes themselves disagree on 9-23% of eligible estimates: single-sample credit measurements at decision tokens are unreliable under any replay. Settings were fixed in advance; exact-pass cache hits in the second campaign are instrumented (100% hit rate, 3,434 pivots); total compute was under 10 USD. We recommend that counterfactual credit studies resume decoder state or use batch-invariant kernels, and report a replica floor.

2606.15615 2026-06-16 cs.LG cs.CV 新提交

MoECa: Aligning Feature Reuse with Expert Decomposition in Diffusion Transformers

MoECa: 在扩散变换器中对齐特征复用与专家分解

Maoliang Li, Haojing Chen, Jiayu Chen, Zihao Zheng, Xinhao Sun, Hailong Zou, Xiang Chen

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) School of Software Engineering, University of Electronic Science and Technology of China(电子科技大学软件工程学院)

AI总结 针对DiT-MoE中跨时间步的冗余计算,提出基于专家分支级别的细粒度缓存框架MoECa,实现分支级特征复用,并引入专家感知自适应控制和同步缓存更新,在多个模型上取得高达2.83倍加速且质量损失极小。

Comments under review

详情
AI中文摘要

基于混合专家模型的扩散变换器(DiT-MoE)通过稀疏激活提升了模型容量,但扩散推理仍然受限于跨时间步的冗余计算。现有的缓存方法主要在token级别操作,这在DiT-MoE中变得次优,因为每个token更新内部被分解为多个路由专家分支。我们的分析表明,DiT-MoE中的跨时间步冗余在专家分支级别比在整个token级别更易于表征。基于这一观察,我们提出MoECa,一种细粒度的缓存框架,跨时间步执行分支级特征复用。MoECa进一步引入了专家感知的自适应控制和MoE与注意力路径之间的同步缓存更新,以维持稳定的中间状态。在多个DiT-MoE模型上的实验表明,MoECa在速度-质量权衡上始终优于先前的缓存方法,实现了高达2.83倍的推理加速且质量退化极小。

英文摘要

Diffusion Transformers with Mixture-of-Experts (DiT-MoE) improve model capacity under sparse activation, but diffusion inference is still bottlenecked by redundant computation across timesteps. Existing caching methods mainly operate at the token level, which becomes suboptimal in DiT-MoE because each token update is internally decomposed into multiple routed expert branches. Our analysis shows that cross-timestep redundancy in DiT-MoE is better characterized at the expert-branch level than at the whole-token level. Based on this observation, we propose MoECa, a fine-grained caching framework that performs branch-level feature reuse across timesteps. MoECa further introduces expert-aware adaptive control and synchronized cache updates across MoE and attention paths to maintain stable intermediate states. Experiments on multiple DiT-MoE models show that MoECa consistently achieves a better speed-quality trade-off than prior caching methods, with up to 2.83$\times$ inference speedup and minimal quality degradation.

2606.15611 2026-06-16 cs.CV cs.AI 新提交

Mutual Distillation of Dual-Foundation Models for Semi-Supervised PET/CT Segmentation

双基础模型的相互蒸馏用于半监督PET/CT分割

Fuyou Mao, Beining Wu, Yanfeng Jiang, Bohan Xu, Lixin Lin, Naye Ji, Hao Zhang, Yan Tang

发表机构 * Central South University(中南大学) Hangzhou Dianzi University(杭州电子科技大学) Communication University of Zhejiang(浙江传媒学院) Northeastern University(东北大学)

AI总结 提出MuDuo框架,利用SAM-Med3D和SegAnyPET分别从CT和PET中蒸馏知识到轻量学生网络,实现半监督器官分割,仅用5个标注样本在AutoPET数据集上达到最优性能。

Comments MICCAI 2026

详情
AI中文摘要

PET/CT的器官分割对于肿瘤学中的定量分析和放疗计划至关重要。为了降低PET/CT分割的高标注成本,半监督学习(SSL)为使用有限标注数据开发深度模型提供了一种实用且有效的解决方案。视觉基础模型的最新发展展示了显著的适应性和更高的效率。在这项工作中,我们提出了一个相互蒸馏框架,该框架无缝地利用了结构性和功能性基础模型,这些模型作为模态特定的通才,从结构性CT和代谢性PET成像中蒸馏知识。通过弥合学生模型的任务特定精度与通才基础模型的分割先验之间的差距,我们提出了MuDuo,一个相互蒸馏框架,协同利用SAM-Med3D用于CT和SegAnyPET用于PET,将它们的知识蒸馏到一个轻量级学生网络中。我们的方法消除了手动提示的需要,同时最大化未标注数据在自动分割中的效用,在AutoPET数据集上仅使用5个标注案例就达到了最先进的性能。我们的源代码可在https://github.com/Wu-beining/MuDuo获取。

英文摘要

Organ segmentation from PET/CT is critical for quantitative analysis and radiotherapy planning in oncology. To ease the high annotation cost of PET/CT segmentation, semi-supervised learning (SSL) provides a practical and effective solution for developing deep models with limited labeled data. Recent developments in visual foundation models have demonstrated remarkable adaptability with improved efficiency. In this work, we propose a mutual distillation framework that seamlessly exploits both structural and functional foundation models, which act as modality-specific generalists for distilling knowledge from structural CT and metabolic PET imaging. By bridging the gap between the task-specific precision of student models and the segmentation priors of generalist foundation models, we propose \textbf{MuDuo}, a mutual distillation framework that synergistically leverages SAM-Med3D for CT and SegAnyPET for PET to distill their knowledge into a lightweight student network. Our approach eliminates the need for manual prompts while maximizing the utility of unlabeled data for automatic segmentation, achieving state-of-the-art performance on the AutoPET dataset with only 5 labeled cases. Our source code is available at https://github.com/Wu-beining/MuDuo.

2606.15610 2026-06-16 cs.CL astro-ph.IM cs.AI cs.LG 新提交

LLM Judges Have Dark Current: A Psychometric Datasheet for LLM-as-a-Judge Evaluation

LLM 裁判具有暗电流:LLM 作为裁判评估的心理测量数据表

Hiroyasu Usami, Keisuke Hara, Ayato Tsuboi, Naohiko Matsuda

发表机构 * Chubu University(中部大学) Mitsubishi Heavy Industries, Ltd., Research & Innovation Center(三菱重工业株式会社研究创新中心)

AI总结 提出裁判数据表协议,通过真空输入、表面变异、位置偏好等指标测量 LLM 裁判的暗电流和偏差,揭示其测量特性。

Comments 22 pages, 4 figures

详情
AI中文摘要

LLM 作为裁判的系统现在常规用于开放式模型评估,其中人类偏好标注成本高、速度慢且难以复现。然而,这些裁判通常被报告为标量准确率、胜率或一致性指标。我们认为,裁判应被报告为测量仪器。我们引入了一个裁判数据表协议,该协议测量在真实真空输入下的暗电流、对相同质量表面变化的稳定交叉敏感性、位置虚假偏好、在受控质量阶梯上的目标敏感性,以及由平局指令引发的标准或操作点。方向-稳定性分解揭示,明显的 Delta0 偏好可能是稳定的表面响应或伪装的位置偏差。在一个三裁判开放权重案例研究中,Llama-3.1-8B 显示出高暗电流和呈现冲突的 Delta0 行为,Qwen2.5-14B 是真空清洁且对目标敏感,但混合了稳定和位置过度判别,而 Qwen2.5-32B 是真空清洁,具有低稳定交叉敏感性和低位置虚假偏好。严格的平局标准消除了 Qwen32B 的 Delta0 虚假偏好,但将边缘 Delta1 目标信号吸收为平局,同时保留了 Delta5 敏感性。结果表明,提示移动的是标准,而不是分辨率。我们并不声称激发这项工作的下游机制假设已得到确认;贡献是在做出下游声明之前测量测量仪器的计量协议。

英文摘要

LLM-as-a-judge systems are now routinely used for open-ended model evaluation, where human preference annotation is costly, slow, and difficult to reproduce. Yet these judges are often reported as scalar accuracy, win-rate, or agreement devices. We argue that a judge should instead be reported as a measurement instrument. We introduce a Judge Datasheet protocol that measures dark current under true-vacuum inputs, stable cross-sensitivity to same-quality surface variation, positional false preference, target sensitivity on a controlled quality ladder, and the criterion or operating point induced by tie instructions. The direction-stability decomposition reveals that apparent Delta0 preference can be stable surface response or disguised position bias. In a three-judge open-weight case study, Llama-3.1-8B shows high dark current and presentation-conflicted Delta0 behavior, Qwen2.5-14B is vacuum-clean and target-sensitive but mixes stable and positional over-discrimination, and Qwen2.5-32B is vacuum-clean with low stable cross-sensitivity and low positional false preference. A strict tie criterion eliminates Qwen32B Delta0 false preference but absorbs marginal Delta1 target signals into ties while preserving Delta5 sensitivity. The results show that prompting moves the criterion, not the resolution. We do not claim that the downstream mechanism hypothesis that motivated this work is confirmed; the contribution is a metrological protocol for measuring the measuring device before downstream claims are made.

2606.15608 2026-06-16 cs.CV 新提交

On the Adversarial Robustness of Multimodal LLM Judges

多模态大语言模型评判器的对抗鲁棒性

Zihan Wang, Guansong Pang, Zelin Liu, Wenjun Miao, Jin Zheng, Xiao Bai

发表机构 * School of Computer Science and Engineering, Beihang University(北京航空航天大学计算机科学与工程学院) State Key Laboratory of Virtual Reality Technology and System, Beihang University(北京航空航天大学虚拟现实技术与系统国家重点实验室) State Key Laboratory of Software Development Environment, Jiangxi Research Institute, Beihang University(北京航空航天大学江西研究院软件开发环境国家重点实验室) School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算机与信息系统学院)

AI总结 提出RobustMLLMJudge框架评估多模态大语言模型作为评判器时的对抗鲁棒性,并设计MGSIA攻击方法,通过语义诱导和高分流形对齐生成可迁移的分数膨胀扰动,揭示其脆弱性。

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地被用作自动评判器,例如用于图像质量和安全评估。然而,它们的对抗鲁棒性在很大程度上尚未被探索,威胁到自动评判的公平性和可靠性。为弥补这一差距,我们引入了RobustMLLMJudge,这是第一个用于评估通用MLLM在充当评判器时对抗鲁棒性的通用框架。它涵盖了针对质量与安全评估场景中主流评判方法的各种攻击。利用RobustMLLMJudge,我们发现:i) 不同的MLLM评判器极易受到分数膨胀的对抗攻击;ii) 尽管这些攻击方法有效,但由于MLLM评判器评估协议中的独特约束,它们面临关键挑战。我们进一步提出了MGSIA,即流形引导语义诱导攻击,这是一种绕过这些约束的新方法,能够对MLLM评判器实施更有效且可迁移的攻击。MGSIA的核心思想是将肯定性语义诱导与高分流形对齐相结合:它最大化评判器对二元语义查询产生肯定性响应(例如“是”)的概率,同时将对抗性表示正则化到从代理协议估计的高分中心附近。这些目标共同产生可迁移的分数膨胀扰动。大量实验证明了MGSIA在不同评估场景下欺骗先进MLLM评判器的优越性和泛化能力,凸显了对鲁棒MLLM评判器的需求。代码和数据将在https://github.com/mala-lab/RobustMLLMJudge提供。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly used as automated judges, e.g., for image quality and safety assessment. However, their adversarial robustness remains largely unexplored, threatening the fairness and reliability of automated judging. To bridge this gap, we introduce RobustMLLMJudge, the first general framework for evaluating the adversarial robustness of general-purpose MLLMs when functioning as judges. It covers diverse attacks against popular judge approaches across quality and safety evaluation scenarios. Using RobustMLLMJudge, we reveal that i) different MLLM judges are highly vulnerable to score-inflating adversarial attacks; and ii) although effective, these attack methods face a critical challenge due to unique constraints in the evaluation protocols of MLLM judges. We further propose MGSIA, namely Manifold-Guided Semantic Induction Attack, a novel method that bypasses these constraints to enable more effective and transferable attacks on MLLM judges. The core idea of MGSIA is to combine affirmative semantic induction with high-score manifold alignment: it maximizes the probability that judges yield affirmative responses (e.g., "Yes") to binary semantic queries, while regularizing adversarial representations toward high-score centers estimated from proxy protocols. Together, these objectives yield transferable score-inflating perturbations. Extensive experiments demonstrate the superiority and generalizability of MGSIA in deceiving advanced MLLM judges under different evaluation scenarios, highlighting the need for robust MLLM judges. Code and data will be made available at https://github.com/mala-lab/RobustMLLMJudge.

2606.15598 2026-06-16 cs.AI 新提交

Integrating Reasoning and Generalization in Text-to-SQL via Self-Enhanced Fine-Tuning

通过自增强微调在Text-to-SQL中整合推理与泛化

Feng Lyu, Jinfeng Cen, Sijing Duan, Hao Wu, Shucheng Li, Weixu Zhang, Haolun Wu

发表机构 * Central South University(中南大学) Tsinghua University(清华大学) Nanjing University(南京大学) McGill University(麦吉尔大学)

AI总结 提出CoTE-SQL方法,通过自增强推理轨迹、结构化思维链提示和错误感知修正,在开源LLM上实现Bird和Spider基准的最优性能。

Comments 14 pages, 13 figures, 7 tables

详情
AI中文摘要

Text-to-SQL旨在将自然语言问题转换为可执行的结构化数据库SQL查询,使非专业用户能够直观地访问数据。尽管大型语言模型(LLM)的最新进展在该任务中显示出潜力,但现有的基于LLM的方法往往难以在强大的推理能力和稳健的泛化之间取得平衡。为了解决这些局限性,我们提出了CoTE-SQL,通过三个关键创新来增强基于LLM的Text-to-SQL生成:(i)从LLM中提取的自增强推理轨迹,无需人工标注;(ii)具有模块化分解和示例检索的结构化思维链(CoT)提示;(iii)基于SQL执行反馈的错误感知修正。在Spider和Bird基准上的大量实验表明,CoTE-SQL在基于开源LLM的方法中取得了新的最先进性能,在Bird上(53.39% EX / 59.02 VES)和Spider上(79.60% EX / 77.19 VES)均表现强劲,尤其是在复杂查询上取得了显著提升。结果突出了在基于LLM的Text-to-SQL设计中结合自增强、结构化推理和执行时反馈的有效性。

英文摘要

Text-to-SQL aims to translate natural language questions into executable SQL queries over structured databases, enabling non-expert users to access data intuitively. While recent advances in large language models (LLMs) have shown promise in this task, existing LLM-based approaches often struggle to strike a balance between strong reasoning capabilities and robust generalization. To address these limitations, we propose CoTE-SQL to enhance the LLM-based text-to-SQL generation with three key innovations: (i) self-enhanced reasoning traces distilled from LLMs without human annotation, (ii) structured chain-of-thought (CoT) prompting with modular decomposition and examples retrieval, and (iii) error-aware revision based on SQL execution feedback. Extensive experiments on the Spider and Bird benchmarks demonstrate that CoTE-SQL achieves new state-of-the-art performance among methods built on open-source LLMs with comparable model sizes on Bird (53.39% EX / 59.02 VES) and strong results on Spider (79.60% EX / 77.19 VES), with especially significant gains on complex queries. Results highlight the effectiveness of combining self-enhancement, structured reasoning, and execution-time feedback within an LLM-based framework for text-to-SQL design.

2606.15597 2026-06-16 cs.CV 新提交

Fusion-E2Pulse: A Multimodal Event-RGB Fusion Network for Non-contact Pulse Wave Reconstruction

Fusion-E2Pulse:一种用于非接触式脉搏波重建的多模态事件-RGB融合网络

Qian Feng, Hao Guo, Yan Niu, Zhenhuan Xu, Yidi Li

发表机构 * College of Computer Science and Technology(计算机科学与技术学院) Taiyuan University of Technology(太原科技大学)

AI总结 提出Fusion-E2Pulse多模态融合网络,利用RGB信号结构先验抑制运动伪影,结合事件流高灵敏度恢复细粒度形态细节,在脉搏波重建中实现噪声抑制与形态保真度的最佳平衡。

Comments Accepted by MICCAI 2026. The final version will appear in the official MICCAI proceedings published by Springer

详情
AI中文摘要

非接触式脉搏波重建依赖于波形形态的精确恢复,包括重搏切迹。传统的基于RGB的方法从录制的面部视频中提取生理信号,但受限于标准相机的积分成像机制,曝光过程会产生平滑效应,削弱微弱的血管搏动细节。相反,神经形态事件相机虽然对强度波动具有极高的灵敏度,但本质上容易受到微小运动引起的噪声和伪影的影响。为了利用基于帧的积分和基于事件的差分感知之间的协同作用,我们提出了一种名为Fusion-E2Pulse的新型多模态网络。该框架利用滤波后的RGB信号作为结构先验来抑制运动伪影,同时利用事件流的高灵敏度恢复细粒度的形态细节。实验结果表明,Fusion-E2Pulse达到了最先进的性能,有效平衡了噪声抑制和形态保真度,心率估计的平均绝对误差为0.78 bpm,波形相关性为0.89,收缩期相位持续时间误差为16.74 ms,验证了其在重建细粒度病理特征方面的有效性。

英文摘要

Non-contact pulse wave reconstruction hinges on the precise recovery of waveform morphology, including the dicrotic notch. Conventional Red-Green-Blue (RGB)-based methods, which extract physiological signals from recorded facial videos, are constrained by the integral imaging mechanism of standard cameras, where the exposure process induces a smoothing effect that attenuates subtle vascular pulsation details. Conversely, neuromorphic event cameras, while offering exceptional sensitivity to intensity fluctuations, are inherently susceptible to noise and artifacts induced by minor motion. To exploit the synergy between frame-based integration and event-based differential sensing, we propose a novel multimodal network named Fusion-E2Pulse. This framework utilizes filtered RGB signals as structural priors to suppress motion artifacts, while leveraging the high-sensitivity of event streams to recover fine-grained morphological details. Experimental results demonstrate that Fusion-E2Pulse achieves state-of-the-art performance, effectively balancing noise suppression and morphological fidelity, achieving a mean absolute error of 0.78 bpm for heart rate estimation, a waveform correlation of 0.89, and a systolic phase duration error of 16.74 ms, validating its efficacy in reconstructing fine-grained pathological features.

2606.15594 2026-06-16 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 新提交

Pixels to Proofs: Probabilistically-Safe Latent World Model Control via Parallel Conformal Robust MPC

从像素到证明:通过并行保形鲁棒MPC实现概率安全的潜在世界模型控制

Devesh Nath, Anutam Srinivasan, Haoran Yin, Ruitong Jiang, Jeffrey Fang, Glen Chou

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出SLS^2框架,结合保形预测与鲁棒模型预测控制,在学习的潜在世界模型中实现基于视觉的安全运动规划,提升目标到达性能与安全性。

详情
AI中文摘要

我们提出了SLS^2,一个使用鲁棒模型预测控制(MPC)在学习的潜在世界模型中进行安全反馈运动规划的框架。我们的方法训练了一个动作条件的联合嵌入世界模型,具有紧凑的马尔可夫潜在状态,通过学习的潜在动力学实现高效的基于梯度的轨迹优化。为了在潜在预测不完美的情况下确保真实系统的安全性,我们采用保形预测来通知GPU加速的系统级综合(SLS)鲁棒MPC方案,以获得校准的潜在误差界限和鲁棒的潜在空间约束集。我们还学习并保形化了一个潜在约束检查器,使SLS规划器能够在闭环执行期间施加概率安全约束。我们在基于视觉的控制任务上评估了我们的方法,与潜在世界模型和安全规划基线相比,它提高了目标到达性能和安全性。

英文摘要

We present SLS^2, a framework for safe feedback motion planning from pixels using robust model predictive control (MPC) in learned latent world models. Our approach trains an action-conditioned joint-embedding world model with compact Markovian latent states, enabling efficient gradient-based trajectory optimization through learned latent dynamics. To enforce safety for the true system despite imperfect latent predictions, we inform a GPU-accelerated system level synthesis (SLS) robust MPC scheme with conformal prediction to obtain calibrated latent error bounds and robust latent-space constraint sets. We further learn and conformalize a latent constraint checker, allowing the SLS planner to impose probabilistic safety constraints during closed-loop execution. We evaluate our method on vision-based control tasks, where it improves both goal-reaching performance and safety over latent world-model and safe-planning baselines.

2606.15592 2026-06-16 cs.CV 新提交

DenseControl: Instance-Level Controllable Synthesis of Dense Crowd Image

DenseControl: 密集人群图像的实例级可控合成

Juncheng Wang, Lei Shang, Wang Lu, Baigui Sun, Shujun Wang

发表机构 * the Hong Kong Polytechnic University(香港理工大学) Tongyi lab, Alibaba Group(阿里巴巴集团通义实验室) Tsinghua University(清华大学)

AI总结 提出DenseControl管道,通过隔离对象嵌入图和隐式尺度嵌入策略,实现密集人群图像中实例位置、大小、背景、风格和属性的精确控制,在合成质量和下游应用中达到最优。

Comments Accepted to IEEE TMM

详情
AI中文摘要

在本文中,我们介绍了DenseControl,一种用于生成密集人群图像的新型管道。具体来说,DenseControl精心定位和缩放每个生成的实例,以精确对齐预定义的坐标和尺度。在此基础上,我们进一步允许控制背景、风格和实例属性。DenseControl的动机源于对合成人群图像中两个主要挑战的观察:控制信号嵌入和在传递实例尺度指导时保持拓扑完整性。为了解决这些问题,我们首先引入了隔离对象嵌入(IOE)图,这是一种新颖的表示,有助于空间位置控制,同时减轻模型学习投影的困难。其次,我们提出了一种隐式尺度嵌入(ISE)策略,该策略与IOE图无缝集成,以编码精确的尺度信息。为了进一步增强ISE与IOE图结合的效果,我们引入了一种位置快捷机制,增强交叉注意力以缓解投影挑战。我们通过两个角度评估DenseControl:合成质量和在潜在应用中的适用性。不同控制条件下的实验表明,DenseControl在密集人群图像合成中达到了最先进的结果。此外,我们展示了在数据稀缺下增强人群分析、迁移学习和天气泛化场景中的应用,以突出DenseControl的实际效用。代码库将发布。

英文摘要

In this paper, we introduce DenseControl, a novel pipeline for generating dense crowd images. Specifically, DenseControl meticulously positions and sizes each generated instance to align precisely with the predefined coordinates and scales. Based on this, we further allow for control over the background, style, and attributes of instances. The motivation behind DenseControl stems from the observation of two main challenges in synthesizing crowd images: controlling signal embedding and maintaining topological integrity when imparting instance scale guidance. To address these, we first introduce the Isolated Object Embedding (IOE) map, a novel representation that facilitates spatial location control while mitigating the difficulties associated with learning projections for model. Secondly, we propose an Implicit Scale Embedding (ISE) strategy that seamlessly integrates with the IOE map to encode precise scale information. To further enhance the efficacy of combining ISE with the IOE map, we incorporate a Position Shortcut mechanism that enhances cross-attention to alleviate projection challenges. We evaluate DenseControl through two lenses: synthesis quality and applicability in latent applications. Experiments across different control conditions demonstrate DenseControl achieves state-of-the-art results in dense crowd image synthesis. Furthermore, we showcase applications in augmenting crowd analysis under data scarcity, transfer learning, and weather generalization scenes, to highlight the practical utility of DenseControl. The codebase will be released.

2606.15591 2026-06-16 cs.AI cs.CL cs.MA 新提交

Agentic Retrieval and Reinforcement Learned Equation Chains: A Controlled Generation Framework for Complex and Novel Physics Word Problems

智能检索与强化学习方程链:面向复杂新颖物理文字题的可控生成框架

Tirthankar Mittra

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出ARVRE两阶段框架,通过离线时序差分学习构建有效物理方程链,结合智能检索增强生成控制问题结构与难度,再由大语言模型生成自然语言问题,实现复杂、新颖且可解的物理文字题生成。

详情
AI中文摘要

生成高质量、新颖、复杂且可解的物理文字题(PWPs)在教育内容生成中仍是一个具有挑战性且未被充分探索的问题。现有方法多改编自数学文字题(MWP)生成,常产生模糊、不可解或结构简单且语言多样性有限的问题。我们提出ARVRE(智能检索值强化方程链),一个用于生成多样且数学有效的PWPs的两阶段框架。在第一阶段,使用一种离线时序差分学习形式构建有效的物理方程链,同时一个智能检索增强生成(RAG)框架动态选择主题特定的概念和词汇。这种设计能够显式控制问题结构和难度。在第二阶段,大语言模型(LLM)将方程链和检索到的概念转换为自然语言的物理问题。通过将生成过程基于有效方程链,我们的方法在保持数学正确性的同时,促进了语言多样性和上下文丰富性。人工和自动评估表明,ARVRE生成的PWPs比现有方法更复杂、新颖且可解。这些结果凸显了结合强化学习、检索和LLM用于可靠生成教育物理内容的潜力。

英文摘要

Generating high-quality Physics Word Problems (PWPs) that are novel, complex, and solvable remains a challenging and underexplored problem in educational content generation. Existing approaches, many adapted from Math Word Problem (MWP) generation, often produce ambiguous, unsolvable, or structurally simple questions with limited linguistic diversity. We introduce ARVRE (Agentic Retrieval Value Reinforced Equation-chain), a two-stage framework for generating diverse and mathematically valid PWPs. In the first stage, a form of offline temporal-difference learning is used to construct valid chains of physics equations, while an agentic retrieval-augmented generation (RAG) framework dynamically selects topic-specific concepts and vocabulary. This design enables explicit control over problem structure and difficulty. In the second stage, a Large Language Model (LLM) converts the equation chain and retrieved concepts into a natural-language physics question. By grounding generation in valid equation chains, our method preserves mathematical correctness while promoting linguistic diversity and contextual richness. Human and automated evaluations demonstrate that ARVRE generates PWPs that are more complex, novel, and solvable than those produced by existing approaches. These results highlight the potential of combining reinforcement learning, retrieval, and LLMs for reliable generation of educational physics content.

2606.15590 2026-06-16 cs.CV 新提交

Unlocking Diffusion Hierarchies: Adaptive Timestep Selection for Zero-Shot Segmentation

解锁扩散层次:自适应时间步选择用于零样本分割

Ramin Nakhli, Mahesh Ramachandran, Luca Ballan

发表机构 * Google(谷歌)

AI总结 提出自适应时间步选择机制,利用扩散模型去噪过程中的层次语义进展,结合上下文相似度图融合高分辨率注意力与U-Net特征,实现零样本分割性能提升。

详情
AI中文摘要

零样本分割最近通过利用大规模文本到图像扩散模型(如Stable Diffusion)中的丰富视觉先验取得了显著改进。然而,当前的基于扩散的方法常常面临空间分辨率和上下文信息之间的权衡,以及依赖单一静态时间步进行特征提取的限制。为了克服这些挑战,我们的工作引入了两项关键进展。首先,我们的上下文相似度图将高分辨率注意力图与丰富的U-Net编码器特征融合,提供了细粒度且鲁棒的逐像素表示。其次,我们识别出不同扩散模型的去噪过程中存在一种涌现的层次语义进展:表示从早期时间步的部分级抽象过渡到后期阶段的物体级抽象。利用这一洞察,我们引入了一种机制来自适应地为每个像素选择最优时间步。大量实验表明,我们的方法持续优于现有的零样本分割基线,验证了将上下文特征与动态层次时间步选择相结合的有效性。

英文摘要

Zero-shot segmentation has recently shown notable improvement by leveraging the rich visual priors in large-scale text-to-image diffusion models, such as Stable Diffusion. However, current diffusion-based methods often face limitations due to the trade-off between spatial resolution and contextual information, as well as their reliance on a single static timestep for feature extraction. To overcome these challenges, our work introduces two key advancements. First, our Contextual Similarity Maps fuse high-resolution attention maps with rich U-Net encoder features, providing both fine-grained and robust per-pixel representations. Second, we identify an emergent hierarchical semantic progression within the denoising process of various diffusion models: representations transition from part-level abstractions at earlier timesteps to object-level abstractions at later stages. Leveraging this insight, we introduce a mechanism to adaptively select the optimal timestep for each pixel. Extensive experiments demonstrate that our method consistently outperforms existing zero-shot segmentation baselines, validating the efficacy of combining contextual features with dynamic, hierarchical timestep selection.

2606.15589 2026-06-16 cs.LG cs.AI 新提交

Is Code Better Than Language for Algorithmic Reasoning

算法推理中代码是否优于语言

Terry Tong, Yu Feng, Surbhi Goel, Dan Roth

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 通过分离中间表示与执行机制,在40个任务上比较代码执行与自然语言推理,发现代码执行优势源于外部执行而非表示变化。

Comments ICML 2026

详情
AI中文摘要

对于工具增强的语言模型,比较自然语言推理与代码执行管道是困难的,因为比较同时改变了中间表示和执行机制。我们通过一个中间干预来分离这些因素:模型将其推理表达为可执行代码,语言模型在上下文中模拟该代码以产生答案。在40个任务的可验证算法基准上,确定性代码执行比自然语言推理高出+31.6个百分点。我们观察到中间干预与自然语言推理没有显著差异(+0.15个百分点)。这些结果表明,在我们评估的设置中,仅改变中间表示并不能解释工具使用的优势,为性能提升需要可靠的外部执行提供了证据。我们用一个简单的统计决策理论模型形式化了这一直觉,该模型刻画了在我们的解耦轨迹生成/执行机制中,执行何时主导端到端风险。我们通过一个重建干预验证了我们的理论,该干预利用代理语言模型从代码表示中推断自然语言推理轨迹,恢复了与原始自然语言推理管道相当的性能。所有实验见https://github.com/TerryTong-Git/ToolProj。

英文摘要

For tool-augmented language models, comparing natural-language reasoning with code-execution pipelines is difficult because the comparison changes both the intermediate representation and the execution mechanism. We separate these factors with an intermediate intervention: the model expresses its reasoning as executable code, and the language model simulates that code in context to produce an answer. On a 40-task verifiable algorithmic benchmark, deterministic code execution outperforms natural-language reasoning by +31.6pp. We observe that the intermediate intervention is not meaningfully different from natural-language reasoning (+0.15pp). These results suggest that, in our evaluated setting, changing the intermediate representation alone does not explain the tool-use advantage, providing evidence for the performance gains requiring reliable external execution. We formalize this intuition with a simple statistical decision-theoretic model that characterizes when execution dominates end-to-end risk in our disentangled trace-generation/execution regime. We validate our theory using a reconstruction intervention that leverages a proxy language model to infer natural-language reasoning traces from code representations, recovering performance comparable to the original natural-language reasoning pipeline. All experiments are at https://github.com/TerryTong-Git/ToolProj.

2606.15587 2026-06-16 cs.RO 新提交

Perfect Demo Makes Poor Teacher: Learning Robust Alignment from Critical Motion Segments

完美演示造就差劲教师:从关键运动片段学习鲁棒对齐

Mingyu Liu, Zeju Li, Jiuhe Shu, Hanqing Wang, Yuhao Chao, Hao Chen, Chunhua Shen

发表机构 * Zhejiang University(浙江大学) Shanghai Innovation Institute(上海创新研究院) Hong Kong University of Science and Technology (GZ)(香港科技大学(广州)) Nanjing University(南京大学)

AI总结 针对精细操作中流畅演示因压缩关键对齐动作导致策略学习不足的问题,提出数据级重采样和表示级STAIR特征,利用稠密运动感知监督提升策略鲁棒性。

详情
AI中文摘要

专家演示被广泛认为是机器人模仿学习的黄金标准。然而,对于插入、堆叠和对齐等精细操作,我们发现了一个反直觉的失败模式:流畅的演示可能是差劲的教师。熟练的遥操作员将对齐和恢复的决定性时刻压缩到一个短暂的时间窗口内,导致策略被冗余的自由空间运动淹没,并在精度决定成功的关键区域缺乏监督。我们在两个层面解决这一瓶颈。在数据层面,靠近对齐时减速和对关键片段重采样都有帮助,但收益主要来自拓宽策略必须学习的恢复状态的覆盖范围,而非重新加权已有的帧。然而,这种数据层面的修复并未触及策略的逐帧视角:单个图像仍然直接映射到动作,控制修正的局部运动仍然隐式。因此,我们转向表示层面,引入STAIR(时空特征作为机器人学习接口),这是一种紧凑的动态特征,连接视觉-语言模型和动作专家,将每个轨迹中已记录的短视运动蒸馏为稠密的、运动感知的监督。仅使用流畅数据训练,STAIR恢复了大部分精心演示带来的增益(总体从50.0%提升至62.2%,接近精心演示的64.4%)。这些结果呼吁对机器人数据采取更具教学性的视角,优化机器的可学习性而非仅考虑人类效率。

英文摘要

Expert demonstrations are widely assumed to be the gold standard for robot imitation learning. Yet for fine-grained manipulation such as insertion, stacking, and alignment, we uncover a counterintuitive failure mode: fluent demonstrations can be poor teachers. A skilled teleoperator compresses the decisive moments of alignment and recovery into a brief temporal window, leaving the policy flooded with redundant free-space motion and starved of supervision exactly where precision determines success. We address this bottleneck at two levels. At the data level, slowing down near alignment and resampling critical segments both help, yet the gain comes mainly from broadening the coverage of recovery states the policy must learn, not from reweighting frames it already has. Such data-side fixes, however, leave the policy's per-frame view untouched: a single image still maps directly to an action, and the local motion that governs correction stays implicit. We therefore turn to the representation level and introduce STAIR (\textbf{S}patio-\textbf{T}emporal feature \textbf{A}s an \textbf{I}nterface for \textbf{R}obot learning), a compact dynamic feature that bridges the vision-language model and the action expert, distilling the short-horizon motion already recorded in each trajectory into dense, motion-aware supervision. Trained on fluent data alone, STAIR recovers most of the deliberate-demonstration gain ($50.0$ to $62.2\%$ overall, approaching the $64.4\%$ of deliberate demonstrations). These results call for a more pedagogical view of robot data, optimized for machine learnability rather than human efficiency alone.

2606.15579 2026-06-16 cs.AI cs.LG cs.MA cs.SE 新提交

Your Agent Has a Genome: Sequence-Level Behavioral Analysis and Runtime Governance of LLM-Powered Autonomous Agents

你的智能体有基因组:基于序列的LLM驱动自主智能体行为分析与运行时治理

Sidi Deng

发表机构 * Independent Researcher(独立研究员)

AI总结 提出XEPV序列编码框架,将LLM智能体行为建模为基因组序列,通过n-gram挖掘发现P-X-P高风险模式,设计Governor三层干预系统,使成功率提升6.2%并减少44% token消耗。

Comments 16 pages, 15 figures, 12 tables

详情
AI中文摘要

我们提出基础序列分析框架,该框架将LLM驱动的自主智能体的运行时行为编码为使用四个字母的字母表的紧凑符号序列:X(探索)、E(执行)、P(规划)和V(验证)。借鉴基因组序列分析的类比,我们对从生产ReAct智能体系统收集的347条真实世界执行轨迹(跨越8天)应用n-gram模式挖掘、马尔可夫转移矩阵和点二列相关分析。我们的分析揭示:(1) 三元组P-X-P是唯一统计显著的高风险模式,使成功率降低10.4%;(2) P比率是成功的最强负预测因子(r=-0.256, p<0.0001);(3) E→V转移概率仅为2.1%,表明存在系统性验证缺陷。基于这些发现,我们设计了Governor,一个三层运行时干预系统,包括规则引擎、统计累加器和基于卡方的阈值自适应器。在自然的部署前后评估中(N=101 vs. N=246),Governor使任务成功率绝对提升6.2%,同时平均token消耗减少44%。为验证跨系统通用性,我们将XEPV编码应用于SWE-bench上2000条公开SWE-agent轨迹,确认探索螺旋和E→V验证缺陷在独立系统中复现。我们概述了六个研究方向,包括基础序列语言模型、跨智能体行为指纹识别和奖励塑造,并发布开源工具包以促进可重复性。

英文摘要

We propose Base Sequence Analysis, a framework that encodes the runtime behavior of LLM-powered autonomous agents into compact symbolic sequences using a four-letter alphabet: X (Explore), E (Execute), P (Plan), and V (Verify). Drawing an analogy to genomic sequence analysis, we apply n-gram pattern mining, Markov transition matrices, and point-biserial correlation to 347 real-world execution traces collected from a production ReAct agent system over 8 days. Our analysis reveals that (1) the trigram P-X-P is the only statistically significant high-risk pattern, lowering success rate by 10.4%; (2) P-ratio is the strongest negative predictor of success (r=-0.256, p<0.0001); and (3) the E->V transition probability is only 2.1%, indicating a systemic verification deficit. Based on these findings, we design Governor, a three-layer runtime intervention system comprising a rule engine, a statistical accumulator, and a chi-square-based threshold adaptor. In a natural before/after deployment evaluation (N=101 vs. N=246), Governor achieves a +6.2% absolute increase in task success rate while simultaneously reducing average token consumption by 44%. To validate cross-system generality, we apply the XEPV encoding to 2,000 public SWE-agent trajectories on SWE-bench, confirming that exploration spirals and the E->V verification deficit replicate in an independent system. We outline six research directions including base sequence language models, cross-agent behavioral fingerprinting, and reward shaping, and release an open-source toolkit for reproducibility.

2606.15577 2026-06-16 cs.AI 新提交

Large Language Models as Optimizers: A Survey of Direct vs. Tool-Augmented Approaches and Their Performance Frontiers

大型语言模型作为优化器:直接方法与工具增强方法的调查及其性能前沿

Roko Peran, Luka Hobor, Mihael Kovac, Mario Brcic

发表机构 * University of Zagreb, Faculty of Electrical Engineering and Computing(萨格勒布大学电气工程与计算学院)

AI总结 综述LLM作为优化器的三种范式(直接优化、工具增强优化、工具创造优化),分析性能前沿与推理差距,探讨直接优化的未来潜力与工具增强优化的可审计性之间的权衡。

Comments 6 pages, 1 figure, 2 tables, accepted at 49th ICT and Electronics Convention, MIPRO - https://mipro.hr; Paper ID: #23463

详情
AI中文摘要

大型语言模型(LLM)越来越多地参与复杂的数学优化,即使触发它们的实用用户并未意识到这一点。毕竟,许多现实世界问题归结为寻找更好或最佳解决方案。LLM作为优化器的领域有三种范式:直接优化、工具增强优化和工具创造优化。直接优化使用迭代提示和启发式生成来探索解空间。工具增强优化将自然语言问题转化为形式化规范并编排外部求解器。工具创造优化更进一步,利用LLM发现可重用的算法或启发式方法,这些方法可以以零边际LLM成本部署。我们基于文献中的基准描述了当前的性能前沿。我们识别了当前架构中的关键推理差距,并论证了直接优化的未来潜力与工具增强优化的可审计性之间的权衡。即使是未来更强大的模型,也可能选择工具制造以提高重复性问题族的操作效率。

英文摘要

Large Language Models (LLMs) are increasingly involved in complex mathematical optimization, even if the pragmatic user who triggers them is unaware of it. After all, many real-world problems reduce to the search for better or the best solutions. The field of LLM-as-optimizer has three paradigms: direct optimization, tool-augmented optimization, and tool-creating optimization. Direct optimization uses iterative prompting and heuristic generation to navigate solution spaces. Tool-augmented optimization translates natural language problems into formal specifications and orchestrates external solvers. Tool-creating optimization goes further, using LLMs to discover reusable algorithms or heuristics that can be deployed at zero marginal LLM cost. We describe current performance frontiers based on the benchmarks from the literature. We identify the critical reasoning gap in current architectures and argue for trade-offs between the future potential of direct optimization and the auditability of tool-augmented optimization. Even future, more powerful models might opt for tool-making to improve operational efficiency for repetitive families of problems.

2606.15576 2026-06-16 cs.LG cs.AI 新提交

Localizing Credit at the Divergence: Path-Conditioned Self-Distillation for LLM Reasoning

在分歧处定位信用:路径条件自蒸馏用于LLM推理

Yu Li, Shu Hong, Tian Lan

发表机构 * Department of Electrical and Computer Engineering, George Washington University(乔治华盛顿大学电气与计算机工程系)

AI总结 提出Hindsight Self-Distillation (HSD)方法,通过将教师模型条件于当前训练组中的成功同伴轨迹,在失败与成功轨迹的分歧处提供密集信用信号,提升LLM在数学和代码推理任务上的性能。

详情
AI中文摘要

基于可验证奖励的强化学习为每次 rollout 分配一个标量,在长推理轨迹中留下了 token 级信用分配不明确的问题。同策略自蒸馏通过让同一模型作为教师,并条件于特权信息,产生密集的逐 token 信号来解决这一问题。但常见的真实答案选择仅是一个终点线索:在简短答案任务中,教师在需要路径级指导的中间位置保持沉默。我们提出后见自蒸馏(HSD),它将教师条件于从当前训练组中抽取的一个成功同伴 rollout。这样的同伴是从成功条件策略中精确采样的样本,无需额外的采样 rollout。通过提供完整的成功延续而不仅仅是最终答案,产生的信用信号集中在失败 rollout 与成功同伴之间的分歧位置。在 Qwen3-8B 和 Qwen3-32B 的数学和代码基准测试中,HSD 相比 GRPO 变体和同策略蒸馏基线获得了最佳结果,在 AIME 等简短答案任务上提升最大。

英文摘要

Reinforcement learning from verifiable rewards assigns a single scalar to each rollout, leaving token-level credit assignment underspecified in long reasoning traces. On-policy self-distillation addresses this by letting the same model act as a teacher conditioned on privileged information, producing a dense per-token signal. But the common choice of a ground-truth answer is only an endpoint cue: on terse-answer tasks, the teacher falls silent at the intermediate positions where path-level guidance matters most. We propose Hindsight Self-Distillation (HSD), which conditions the teacher on a successful peer rollout drawn from the current training group. Such a peer is an exact sample from the success-conditioned policy, requiring no additional sampled rollouts. By providing a full successful continuation rather than only the final answer, the resulting credit signal concentrates at the divergence position between a failed rollout and a successful peer. Across Qwen3-8B and Qwen3-32B on math and code benchmarks, HSD obtains the best result against GRPO variants and on-policy distillation baselines, with the largest gains on terse-answer tasks such as AIME.

2606.15574 2026-06-16 cs.CV 新提交

Toward the Whole Picture: Accumulative Fingerprint Mapping and Reconstruction for Small-Area Mobile Sensors

迈向全貌:小面积移动传感器的累积指纹映射与重建

Xiongjun Guan, Jianjiang Feng, Jie Zhou

发表机构 * Tsinghua University(清华大学)

AI总结 针对小面积移动指纹传感中采集与识别不匹配的问题,提出累积映射与重建框架,将局部观测序列转化为统一指纹状态,实现单次匹配,提升效率与鲁棒性。

详情
AI中文摘要

移动设备上的小面积指纹传感在采集与识别之间造成了根本性的不匹配:每次触摸仅捕获一个微小且姿态变化的局部补丁,而可靠的生物特征匹配最终需要一个稳定且足够完整的指纹表示。现有流程主要通过将重复触摸视为独立的局部模板来应对这种不匹配,这导致重复注册、重复匹配,且无法保证足够的全局覆盖。在本文中,我们提出了一种不同的公式,即针对小面积移动传感的\emph{累积指纹映射与重建}。该视角并非分别匹配每个局部补丁,而是将一系列局部观测转换为一个统一的指纹状态,该状态随着新触摸的到来而逐步细化,并可在整合后仅匹配一次。作为一个具体基线,我们提出了一种经典流程,执行补丁级结构特征提取、特征级配准与融合、指纹图构建以及基于相位的脊线重建。更重要的是,我们将此基线定位在一个更广泛的移动指纹框架内,该框架集成了结构化令牌学习、两阶段姿态推理和基于扩散的生成式重建。这一观点将移动指纹识别从多次捕获多次匹配处理重新构建为累积地图构建、状态细化和一次性匹配,为小面积移动平台提供了一条通向高效、姿态鲁棒且易于部署的生物特征识别的原则性路径。基线实现已在 https://github.com/XiongjunGuan/FpReconstruction 公开发布。

英文摘要

Small-area fingerprint sensing on mobile devices creates a fundamental mismatch between acquisition and recognition: each touch captures only a tiny, pose-varying local patch, while reliable biometric matching ultimately requires a stable and sufficiently complete fingerprint representation. Existing pipelines largely cope with this mismatch by treating repeated touches as independent partial templates, which leads to repeated registration, repeated matching, and no guarantee of adequate global coverage. In this paper, we advocate a different formulation, namely \emph{accumulative fingerprint mapping and reconstruction} for small-area mobile sensing. Rather than matching every partial patch separately, the proposed perspective converts a sequence of local observations into a unified fingerprint state that is progressively refined as new touches arrive and can be matched only once after consolidation. As a concrete baseline, we present a classical pipeline that performs patch-wise structural feature extraction, feature-level registration and fusion, fingerprint map construction, and phase-based ridge reconstruction. More importantly, we position this baseline within a broader mobile fingerprint framework that integrates structured token learning, two-stage pose reasoning, and diffusion-based generative reconstruction. This viewpoint reframes mobile fingerprint recognition from multi-capture multi-match processing to accumulative map building, state refinement, and one-shot matching, offering a principled route toward efficient, pose-robust, and deployment-friendly biometrics for small-area mobile platforms. The baseline implementation has been publicly released at https://github.com/XiongjunGuan/FpReconstruction.

2606.15570 2026-06-16 cs.CV 新提交

An Extensive Benchmark for Single-round and Multi-round Instruction-based Image Editing

单轮与多轮指令式图像编辑的广泛基准

Yiwei Ma, Ke Ye, Weihuang Lin, Jiayi Ji, Xiaoshuai Sun, Tat-Seng Chua, Rongrong Ji

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(厦门大学多媒体可信感知与高效计算教育部重点实验室) National University of Singapore(新加坡国立大学)

AI总结 提出I2EBench2.0基准,通过16个单轮和7个多轮维度评估指令式图像编辑模型,结合用户研究确保与人类判断一致,并基于八种模型分析提供研究指导。

Comments Accepted by International Journal of Computer Vision (IJCV), 2026

详情
AI中文摘要

近年来,基于指令的图像编辑(IIE)领域取得了显著进展,该领域专注于使用模型自动修改输入图像。然而,由于指令的复杂性和编辑的多样性,评估这些编辑模型的有效性是一项重大挑战。为解决此问题,该领域的一项紧迫任务是开发一个稳健的评估框架,能够精确衡量编辑结果的质量,并提供有价值的基准以指导未来改进。为应对这一挑战,我们提出了一个名为I2EBench2.0的综合评估基准,专为IIE模型的单轮和多轮评估而设计。I2EBench2.0具有四个关键特性:1)跨单轮和多轮评估:I2EBench2.0同时评估单轮和多轮基于指令的编辑,评估编辑的精确性和一致性。2)广泛的评估标准:I2EBench2.0涵盖广泛的标准,评估每个IIE模型的高层和低层方面。具体而言,它包含16个单轮评估维度和7个多轮评估维度。3)与人类判断对齐:为确保我们的基准与人类评估一致,我们对每个标准进行了全面的用户研究。4)研究驱动的见解:通过分析当前IIE模型在所有16个单轮和7个多轮维度上的优缺点,我们提供了旨在指导该领域未来研究的关键见解。我们使用I2EBench2.0测试了八个最近开发的IIE模型,并通过细致的比较和分析得出了学术见解。相关代码、数据集以及所有IIE模型生成的图像可在GitHub上获取:https://github.com/cocoshe/I2EBench。

英文摘要

In recent years, there have been notable advancements in the area of instruction-based image editing (IIE), which focuses on the automatic alteration of input images using a model. Nevertheless, assessing the effectiveness of these editing models poses a considerable challenge due to the intricate nature of instructions and the wide variety of edits. To tackle this problem, one urgent task in this domain is the development of a robust evaluation framework that can precisely gauge the quality of editing outcomes and offer valuable benchmarks to guide future improvements. To address this challenge, we present a comprehensive evaluation benchmark named I2EBench2.0, designed for single-round and multi-round assessment of IIE models. I2EBench2.0 has four key features: 1) Evaluation Across Single and Multi-rounds: I2EBench2.0 simultaneously evaluates both single-round and multi-round instruction-based edits, assessing the precision and consistency of the edits. 2) Extensive Evaluation Criteria: I2EBench2.0 encompasses a broad range of criteria, evaluating both high-level and low-level aspects of each IIE model. Specifically, it incorporates 16 dimensions for single-round evaluations and 7 for multi-round evaluations. 3) Alignment with Human Judgment: To ensure our benchmark aligns with human evaluation, we conducted a comprehensive user study for each criterion. 4) Research-driven Insights: By analyzing the strengths and weaknesses of current IIE models across all 16 single-round and 7 multi-round dimensions, we provide critical insights aimed at directing future research in this area. We tested eight recently developed IIE models using I2EBench2.0 and derived academic insights through meticulous comparison and analysis. The related code, dataset, and images generated by all IIE models are available on GitHub: https://github.com/cocoshe/I2EBench.

2606.15569 2026-06-16 cs.LG math.ST stat.ML stat.TH 新提交

A Decision-Theoretic View of Test-Time Training: When, How Far, and Which Directions to Adapt

测试时训练的决策论视角:何时、多远以及哪些方向进行自适应

Tomoya Wakayama

发表机构 * N/A

AI总结 通过决策论将测试时训练视为核机制下的隐式贝叶斯推断,揭示了更新步长和子空间选择对性能的影响,并提出了自适应策略、PAC-Bayes保证和最优子空间选择规则。

详情
AI中文摘要

测试时训练(TTT)通过参数更新使预训练模型适应每个提示,提高了在预训练到测试分布偏移下的准确性。然而,其性能常常受到不稳定性和对超参数(如更新步长和子空间)敏感性的影响。我们通过决策论的视角解释这一行为,将TTT视为核机制下的隐式贝叶斯推断。在高斯过程基准下,我们表明当更新与提示的信噪比谱匹配并与查询相关的特征方向对齐时,TTT能降低预测误差。这一视角支撑了以下结果:(1)我们展示了固定更新步长和子空间在分布偏移下失败的情况,从而激励自适应策略;(2)我们证明通过提示证据选择更新步长具有对抗过拟合的PAC-Bayes保证;(3)我们在线性-高斯校正模型下刻画了贝叶斯最优更新子空间,从而为选择Transformer块和头提供了评分规则。我们的理论有助于解释TTT的经验不稳定性,为何时、多远以及哪些方向进行自适应提供了原则性指导。

英文摘要

Test-time training (TTT) adapts a pretrained model to each prompt via parameter updates, improving accuracy under pretraining-to-test distribution shifts. Yet, its performance often suffers from instability and sensitivity to hyperparameters such as update steps and subspace. We explain this behavior through a decision-theoretic lens, treating TTT as implicit Bayesian inference in the kernel regime. Under a Gaussian process benchmark, we show that TTT reduces prediction error when updates are spectrally matched to the prompt's signal-to-noise ratio and aligned with query-relevant eigen-directions. This perspective underpins the following results: (1) we show when fixed update steps and subspaces fail under distribution shifts, motivating adaptive strategies; (2) we prove that selecting update steps via prompt evidence admits a PAC-Bayes guarantee against overfitting; and (3) we characterize the Bayes-optimal update subspace under a linear-Gaussian correction model, yielding a scoring rule for selecting Transformer blocks and heads. Our theory helps explain the empirical instability of TTT, taking a step toward principled guidance for when, how far, and which directions to adapt.