arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3405
专题追踪
2605.25459 2026-05-26 cs.LG cs.AI

From Simulation to Enaction: Post-trained language models recognize and react to their own generations

从模拟到行动:后训练语言模型识别并回应自身生成

Asvin G., Jack Lindsey

发表机构 * Institute for Advanced Study, Princeton(普林斯顿高级研究院) Anthropic

AI总结 本文发现后训练语言模型能够识别自身生成(on-policy)并降低输出熵,通过内部表示输入意外性来调节,且显式识别与隐式识别机制不同。

Comments Anthropic fellows project mentored by Jack Lindsey

详情
AI中文摘要

语言模型被预训练为被动预测器,没有动机去建模自身输出的后果。后训练改变了这一点:产生自身响应的模型可以从识别自身处于on-policy状态中获益。我们提供证据表明,后训练模型识别其on-policy生成,并且这种识别隐式编码在其输出分布中。特别是,在不同模型家族和规模类别中,on-policy输出分布熵比off-policy熵低3-4倍。我们将这种效应的部分原因追溯到输入意外性的内部表示,该表示跟踪模型先前预测中最新的输入标记的不可能性,并因果性地调节输出熵。这些现象的一个例子可以在对开放式提示的响应中观察到;后训练模型(与预训练模型不同)在第一个输出标记之前就将其对即将生成的响应主题的不确定性坍缩;用不同主题的前缀违反这种缓存意图会导致更高的输出熵。我们还测试了模型是否可以通过显式口头报告区分on-policy上下文和前缀。我们发现它们可以,但有趣的是,这种显式识别通过不同于隐式识别的机制进行路由。

英文摘要

Language models are pretrained as passive predictors with no incentive to model the consequences of their own outputs. Post-training changes this: a model producing its own responses can benefit from recognizing that it is on-policy. We present evidence that post-trained models recognize their on-policy generations, and this recognition is implicitly encoded in their output distributions. In particular, on-policy output distribution entropy is 3--4$\times$ lower than off-policy entropy, across model families and size classes. We trace part of this effect to an internal representation of input surprise, tracking the unlikeliness of the most recent input token according to the model's prior predictions, that causally modulates output entropy. One example of these phenomena can be observed in response to open-ended prompts; post-trained models (unlike pretrained models) collapse their uncertainty over the topic of their upcoming response before the first output token; violating this cached intention with a different-topic prefill results in higher output entropy. We also tested whether models can distinguish on-policy contexts from prefills via explicit verbal report. We find that they can, but that interestingly, this explicit recognition routes through a different mechanism than implicit recognition.

2605.25447 2026-05-26 cs.CL

GeoSVG-RL: Geometry-Aware Reinforcement Learning for Layout-Constrained Text-to-SVG Diagram Generation

GeoSVG-RL:面向布局约束的文本到SVG图表生成的几何感知强化学习

Sifan Li, Yujun Cai, Hongkai Chen, Yiwei Wang

发表机构 * University of California, Merced(加州大学梅尔德分校) The University of Queensland(昆士兰大学) vivo Mobile Communication Co., Ltd.(vivo移动通信有限公司)

AI总结 提出GeoSVG-RL框架,通过强化学习优化策略,利用几何反馈奖励(渲染有效性、画布适配、锚点放置、文本包含、图一致性和代码整洁性)解决文本到SVG图表生成中的结构脆弱性问题,显著提升箭头锚点精度和文本框内率。

详情
AI中文摘要

生成结构化、可编辑的图表对当代大型语言模型来说仍然是一个重大挑战,尽管它们在通用向量代码生成方面表现出色。主要困难在于输出的结构脆弱性;微小的错误,如未对齐的连接器端点、文本标签与边框重叠或复杂布局超出画布边界,都会使生成的SVG文件在专业应用中无法使用。为了解决这些问题,我们引入了GeoSVG-RL,一个专门为布局约束的文本到SVG生成设计的强化学习框架。与仅依赖于最大化令牌级可能性的标准训练目标不同,我们的方法针对明确的、可执行的几何反馈优化策略。模型首先生成一个结构化的布局计划,作为后续SVG代码生成的几何契约。然后通过浏览器支持的验证器渲染该代码,从而在六个关键维度上计算细粒度奖励:渲染有效性、画布适配、精确锚点放置、文本包含、图一致性和代码整洁性。我们利用组相对策略优化(GRPO)来优化模型,每个提示采样多个候选,以便基于相对质量进行更新。从合成数据上的监督预热阶段开始,GeoSVG-RL在结构可靠性方面取得了显著提升,特别是在箭头锚点精度和文本框内率方面。定量评估表明,我们的方法在局部几何精度和图连通性保持方面持续优于当前最先进的系统,为自动化且可靠的技术插图提供了一条稳健的路径。

英文摘要

Generating structured, editable diagrams remains a significant challenge for contemporary large language models, despite their proficiency in general-purpose vector code generation. The primary difficulty lies in the structural fragility of the output; minor errors such as misaligned connector endpoints, text labels overlapping borders, or complex layouts drifting beyond the canvas boundaries render the resulting SVG files functionally unusable for professional applications. To address these issues, we introduce GeoSVG-RL, a specialized reinforcement learning framework designed for layout-constrained text-to-SVG generation. Unlike standard training objectives that rely solely on maximizing token-level likelihood, our approach optimizes the policy against explicit, executable geometric feedback. The model first produces a structured layout plan that serves as a geometric contract for the subsequent generation of the SVG code. This code is then rendered through a browser-backed verifier, enabling the calculation of fine-grained rewards across six critical dimensions: rendering validity, canvas fitting, precise anchor placement, text containment, graph consistency, and code cleanliness. We utilize Group Relative Policy Optimization (GRPO) to refine the model, sampling multiple candidates per prompt to facilitate updates based on relative quality. Starting from a supervised warm-start phase on synthetic data, GeoSVG-RL achieves substantial gains in structural reliability, particularly in arrow-anchor accuracy and text-in-box rates. Quantitative evaluations demonstrate that our method consistently outperforms current state-of-the-art systems in local geometric precision and the preservation of graph connectivity, providing a robust pathway toward automated yet reliable technical illustration.

2605.25446 2026-05-26 cs.AI cs.LG

A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography

面向常规心电图广谱心血管评估的信号-语言基础模型

Ziqing Yu, Yuhui Tao, Jiayu Huo, Lei Pan, Zilong Xiao, Juecheng Chen, Xiao Li, Jianxuan Li, You Zhou, Zhixing Li, Cong Wang, Beijian Zhang, Chen Chen, Hongyang Lu, Konstantinos Patlatzoglou, Daniel B. Kramer, Jonathan W. Waks, Yangang Su, Fu Siong Ng, Shuo Wang, Yixiu Liang, Junbo Ge

发表机构 * Department of Cardiology, Zhongshan Hospital of Fudan University(复旦大学中山医院心内科) Shanghai Institute of Cardiovascular Diseases, National Clinical Research Centre for Interventional Medicine(上海心血管病研究所,国家介入医学临床研究中心) Digital Medical Research Center, School of Basic Medical Sciences, Fudan University(复旦大学基础医学研究院数字医疗研究中心) Shanghai Key Laboratory of Medical Imaging Computing and Computer Assisted Intervention(上海医学影像计算与计算机辅助手术重点实验室) National Heart and Lung Institute, Imperial College London, Hammersmith Hospital, Du Cane Road(伦敦帝国学院国家心肺研究所,哈马舍姆医院,杜肯路) Department of Cardiology, Shanghai Geriatric Medical Center(上海老年医学中心心内科) Cardiac Rhythm Management, Medtronic Technology Center, Medtronic (Shanghai) Ltd.(美敦力技术中心,美敦力(上海)有限公司,心律管理部) Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology, Beth Israel Deaconess Medical Center, Harvard Medical School(哈佛医学院比尔·德·阿克谢心脏结局研究中心,贝斯以色列·德aconess医疗中心) Harvard-Thorndike Electrophysiology Institute, Beth Israel Deaconess Medical Center, Harvard Medical School(哈佛-托尔恩迪克电生理研究所,贝斯以色列·德aconess医疗中心,哈佛医学院) Department of Cardiology, Imperial College Healthcare NHS Trust(伦敦帝国学院医疗信托心内科部) Department of Cardiology, Chelsea and Westminster NHS Foundation Trust(切尔西和温斯洛医院 NHS 基础信托心内科部) Department of Computer Science and Technology, University of Cambridge(剑桥大学计算机科学与技术系)

AI总结 提出ECGCLIP信号-语言对比学习框架,通过大规模心电图-报告预训练,在89项下游任务中超越基线,实现对常见心律失常、超声心动图靶标及罕见心脏病的广谱评估。

详情
AI中文摘要

心电图(ECG)是心血管诊疗的核心,但传统AI模型通常局限于常见心律失常,且在不同人群或临床细微疾病中泛化能力较差。我们开发了ECGCLIP(心电图对比语言-图像预训练),一种信号-语言对比学习框架,将ECG波形与专家诊断报告对齐。ECGCLIP在来自1,324,856名患者的2,837,962份心电图研究上进行了预训练,并在一个留出内部测试集以及包含约150万份心电图的九个独立外部队列上进行了评估。评估覆盖89项下游任务,包括45项心电图诊断、39项超声心动图靶标和5种罕见心脏病,以PRAUC为主要指标。ECGCLIP在随机初始化和Merl-R18基线上持续提升性能。在内部测试集上,ECGCLIP-R34对心房颤动(PRAUC 0.900)和ST段抬高型心肌梗死(PRAUC 0.383)表现出强劲性能,并在所有外部队列中具有稳健泛化能力。它还改善了低患病率和诊断困难的疾病,包括埃布斯坦畸形、缩窄性心包炎、右位心和心脏淀粉样变性,内部PRAUC值分别为0.253、0.175、0.121和0.201。ECGCLIP数据高效,仅使用10%的训练数据即可达到或超过全数据集基线性能。特征可视化和显著性分析表明,其学习到的表示与既定心电图标准具有临床意义的对齐。这些发现表明,大规模心电图-报告对比预训练可以将常规心电图解读从常见心律失常扩展到广谱心血管评估以及超声心动图和罕见病的机会性筛查。

英文摘要

Electrocardiography (ECG) is central to cardiovascular care, but conventional AI models are often restricted to common arrhythmias and may generalize poorly across populations or clinically subtle diseases. We developed ECG Contrastive Language-Image Pre-training (ECGCLIP), a signal-language contrastive learning framework that aligns ECG waveforms with expert diagnostic reports. ECGCLIP was pre-trained on 2,837,962 ECG studies from 1,324,856 patients and evaluated on a held-out internal test set plus nine independent external cohorts comprising about 1.5 million ECGs. Evaluation covered 89 downstream tasks, including 45 ECG diagnoses, 39 echocardiographic targets, and 5 rare cardiac diseases, using PRAUC as the primary metric. ECGCLIP consistently improved performance over random initialization and Merl-R18 baselines. On the internal test set, ECGCLIP-R34 achieved strong performance for atrial fibrillation (PRAUC 0.900) and ST-segment elevation myocardial infarction (PRAUC 0.383), with robust generalization across all external cohorts. It also improved low-prevalence and diagnostically elusive diseases, including Ebstein anomaly, constrictive pericarditis, dextrocardia, and cardiac amyloidosis, with internal PRAUC values of 0.253, 0.175, 0.121, and 0.201, respectively. ECGCLIP was data efficient, matching or exceeding full-dataset baseline performance with only 10% of training data. Feature visualization and saliency analysis suggested clinically meaningful representations aligned with established electrocardiographic criteria. These findings indicate that large-scale ECG-report contrastive pre-training can expand routine ECG interpretation beyond common arrhythmias toward broad cardiovascular assessment and opportunistic screening of echocardiographic and rare conditions.

2605.25443 2026-05-26 cs.CL

Harmony in Diversity: Multi-domain Contrastive Policy Optimization for Large Reasoning Models

多样性中的和谐:面向大型推理模型的多域对比策略优化

Zongji Yu, Wenshui Luo, Yiliu Sun, Hao Fang, Runmin Cong, Chaochao Lu, Chen Gong

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Shandong University(山东大学)

AI总结 提出多域对比策略优化(MCPO),通过对比学习促进跨域知识迁移并减少干扰,提升大型推理模型在多域场景下的推理能力。

Comments 25 pages, 5 figures

详情
AI中文摘要

后训练显著增强了大型推理模型(LRM)的推理能力,尤其是使用如组相对策略优化(GRPO)等强化学习(RL)方法。然而,在多域设置中,GRPO风格的RL方法由于策略优化中的固有干扰,往往无法在所有领域实现一致的改进。先前关于多域RL的研究主要集中于减轻跨域干扰,而常常忽略了知识共享的关键作用,我们认为知识共享是将跨域交互从有害竞争转变为有益迁移的关键。为解决这一局限,我们提出了多域对比策略优化(MCPO),该方法分析展开(rollouts)之间的结构关系,并以对比方式促进跨域知识共享和域内知识整合。具体而言,对于给定的提示,MCPO将来自其他域的可迁移推理轨迹识别为正例,而将错误的展开视为负例。然后,它鼓励正例对的一致表示,并推开负例对,从而促进知识迁移并减少干扰。此外,MCPO对齐域内正确的展开以构建一个整合的表示空间。通过这种方式,MCPO对比学习一个能够容纳多样化多域知识的和谐表示空间。实验结果表明,MCPO提升了LRM在多个域上的推理能力,甚至在某些情况下优于单域训练。代码可在 https://github.com/Maricalce/MCPO 获取。

英文摘要

Post-training has significantly enhanced the reasoning capability of Large Reasoning Models (LRMs), especially with Reinforcement Learning (RL) like Group Relative Policy Optimization (GRPO). However, GRPO-style RL methods in multi-domain settings often fail to achieve consistent improvements across all domains due to inherent interference in policy optimization. Prior studies on multi-domain RL primarily focus on alleviating cross-domain interference, while often neglecting the pivotal role of knowledge sharing, which we argue is the key to transforming cross-domain interactions from harmful competition into beneficial transfer. To address this limitation, we propose Multi-domain Contrastive Policy Optimization (MCPO), which analyzes the structural relationships among rollouts and promotes cross-domain knowledge sharing and in-domain knowledge consolidation in a contrastive manner. Specifically, for a given prompt, MCPO identifies transferable reasoning trajectories from other domains as positive examples, while treating incorrect rollouts as negative ones. It then encourages consistent representations for positive pairs and pushes negative pairs apart, thereby facilitating knowledge transfer and reducing interference. Moreover, MCPO aligns intra-domain correct rollouts to build a consolidated representation space. In this way, MCPO contrastively learns a harmonious representation space that can accommodate diverse multi-domain knowledge. Empirical results show that MCPO improves the reasoning capabilities of LRMs across multiple domains and even outperforms single-domain training in some cases. Code is available at https://github.com/Maricalce/MCPO.

2605.25442 2026-05-26 cs.CV

Enhancing Single-Image Facial Demorphing using Multimodal Large Language Models

利用多模态大语言模型增强单图像面部去变形

Nitish Shukla, Arun Ross

发表机构 * IEEE

AI总结 提出一种基于多模态大语言模型引导的耦合扩散重建框架,通过提取中间层语义嵌入作为条件,实现无参考的面部去变形,恢复构成图像并保持身份一致性。

详情
AI中文摘要

人脸识别系统越来越容易受到变形攻击,其中合成图像被制作成匹配多个身份,从而实现未经授权的访问和身份欺诈。现有的检测方法可以识别变形图像,但无法恢复构成图像或身份,限制了其取证实用性。本文提出了一种新颖的无参考面部去变形框架,利用多模态大语言模型(MLLMs)引导耦合的扩散重建过程。我们的关键创新在于从MLLM中间层提取语义嵌入以调节去变形过程,提供关于面部属性和身份线索的高级推理,补充低级像素信息。我们将去变形表述为一个耦合的条件生成问题,其中两个构成人脸通过直接在RGB域中操作的去噪扩散模型联合合成,确保身份间一致性,同时保留细粒度的感知细节。与依赖于压缩潜在表示或假设训练集和测试集之间身份重叠的先前方法不同,我们的方法通过直接利用MLLM隐藏状态作为条件信号,绕过了有损的文本生成-重新编码循环,使去噪网络能够关注细微的视觉线索,如头发、背景和面部纹理。消融研究进一步揭示,MLLM中间层编码了更具身份判别性的表示,RGB域去变形在严格操作点上的性能优于潜在空间方法30-40%,并且完整的MLLM嵌入通过多模态预训练的增强语义结构,比原始ViT特征提供了显著优势。

英文摘要

Face recognition systems are increasingly vulnerable to morphing attacks, where a composite image is crafted to match multiple identities, enabling unauthorized access and identity fraud. Existing detection methods identify morphed images but cannot recover constituent images or identities, limiting their forensic utility. This paper presents a novel reference-free facial demorphing framework that leverages Multimodal Large Language Models (MLLMs) to guide a coupled diffusion-based reconstruction process. Our key innovation lies in extracting semantic embeddings from intermediate MLLM layers to condition the demorphing, providing high-level reasoning about facial attributes and identity cues that complement low-level pixel information. We formulate demorphing as a coupled conditional generation problem, where both constituent faces are synthesized jointly through a denoising diffusion model operating directly in the RGB domain, ensuring inter-identity consistency while preserving fine-grained perceptual details. Unlike prior approaches that rely on compressed latent representations or assume identity overlap between training and testing sets, our method bypasses lossy text generation-reencoding cycles by directly utilizing MLLM hidden states as conditioning signals, enabling the denoising network to attend to subtle visual cues such as hair, background, and facial textures. Ablation studies further reveal that middle MLLM layers encode more identity-discriminative representations, RGB-domain demorphing outperforms latent-space approaches by 30--40\% at strict operating points, and full MLLM embeddings provide substantial advantages over raw ViT features through enhanced semantic structuring from multimodal pretraining.

2605.25440 2026-05-26 cs.CL cs.AI cs.MA

A Multi-Agent LLM Framework for Rating the Quality of Surgical Feedback

用于评估手术反馈质量的多智能体LLM框架

Rafal Kocielnik, J. Everett Knudsen, Steven Y. Cen, Jasmine Lin, Cherine H. Yang, Atharva Deo, Ujjwal Pasupulety, Peter Wager, Anima Anandkumar, Andrew J. Hung

发表机构 * Computing + Mathematical Sciences, California Institute of Technology(加州理工学院计算与数学科学系) Department of Urology, Cedars-Sinai(塞斯医疗中心泌尿科) Keck School of Medicine, University of Southern California(美国南加州大学凯克医学院)

AI总结 提出一个两阶段LLM框架,通过多智能体提示和手术领域知识注入发现可解释的反馈质量标准,并利用LLM作为评判者自动评分,在预测反馈有效性上优于先前方法。

Comments 25 pages, 3 figures

详情
AI中文摘要

手术室中主治医生提供的口头反馈在住院医师技能习得中起着关键的形成性作用。然而,评估培训者反馈的质量及其在实时手术中影响受训者行为的有效性仍然是一个挑战。先前的研究依赖于专家人工评分者的大量手动标注来评估反馈内容,并侧重于开发忽略反馈传递定性方面(如清晰度或紧迫性)的广泛分类法。有限的现有自动化方法,包括关键词分析和主题建模,也无法捕捉这些细微方面。我们引入了一个两阶段基于LLM的框架,该框架发现基于手术培训背景的可解释反馈质量标准。我们的方法使用多智能体提示和手术领域知识注入来发现一小套人类可解释的评分标准(例如,鼓励性、紧迫性、清晰性)。然后,这些标准通过LLM作为评判者的方法自动评分实时手术反馈。对4.2k个培训者反馈实例的评估表明,我们AI发现的标准在预测反馈有效性(包括观察到的受训者行为调整和培训者认可)方面优于先前基于内容的框架。这项工作推进了手术室中可扩展的、与人类对齐的沟通质量评估,并为改进手术教学实践提供了基础。

英文摘要

Verbal feedback delivered by attending surgeons in the operating room plays a critical formative role in resident trainee skill acquisition. Yet, assessing the quality of trainer feedback and its effectiveness in influencing trainee behavior during live surgery remains a challenge. Prior studies assessed feedback content relying on extensive manual annotation by expert human raters and focused on developing broad taxonomies that overlook the qualitative aspects of feedback delivery such as clarity or urgency. Limited existing automated methods, including keyword analysis and topic modeling, also fail to capture these nuanced aspects. We introduce a two-stage LLM-based framework that discovers interpretable feedback quality criteria grounded in the context of surgical training. Our method uses multi-agent prompting and surgical domain knowledge injection to discover a small set of human interpretable scoring criteria (e.g., Encouraging, Urgent, Clear). These criteria are then used to automatically score live surgical feedback via an LLM-as-a-judge approach. Evaluation on 4.2k trainer feedback instances demonstrates that our AI-discovered criteria outperform prior content-based frameworks in predicting feedback effectiveness, including observed trainee behavioral adjustments and trainer approval. This work advances scalable, human-aligned assessment of communication quality in the operating room and provides a foundation for improving surgical teaching practices.

2605.25439 2026-05-26 cs.LG

Missing Pattern Recognized Diffusion Imputation Model for Missing Not At Random

缺失非随机识别的扩散插补模型

Gyuwon Sim, Sumin Lee, Heesun Bae, Byeonghu Na, Doyun Kwon, Ju-Hee Hwang, Jae-Young Lim, Il-Chul Moon

发表机构 * KAIST(韩国科学技术院) Seoul National University(首尔国立大学)

AI总结 针对缺失非随机(MNAR)问题,提出缺失模式识别扩散插补模型(PRDIM),通过模式识别器和EM算法最大化联合分布似然,实现精确插补。

详情
AI中文摘要

缺失数据在包括时间序列和图像在内的多个领域中频繁出现。在现实世界中,缺失的发生往往依赖于不可观测的值本身,这被称为缺失非随机(MNAR)。在这项工作中,我们引入了缺失模式识别扩散插补模型(PRDIM),这是一个新颖的框架,它显式地捕获缺失模式并精确插补未观测值。PRDIM在期望最大化(EM)算法下迭代地最大化观测值和缺失掩码的联合分布似然。从这个意义上说,我们首先采用一个模式识别器,它近似潜在的缺失模式,并在每次推理中提供指导,以针对缺失信息进行更合理的插补。通过大量实验,我们证明PRDIM在多种数据模态的MNAR设置下始终实现强大的插补性能。

英文摘要

Missing data frequently arises across diverse domains, including time-series and image domains. In the real world, missing occurrences often depend on the unobservable values themselves, which are referred to as Missing Not at Random (MNAR). In this work, we introduce the Missing Pattern Recognized Diffusion Imputation Model (PRDIM), a novel framework that explicitly captures the missing pattern and precisely imputes unobserved values. PRDIM iteratively maximizes the likelihood of the joint distribution for observed values and missing mask under an Expectation-Maximization (EM) algorithm. In this sense, we first employ a pattern recognizer, which approximates the underlying missing pattern and provides guidance during every inference toward more plausible imputations with respect to the missing information. Through extensive experiments, we demonstrate that PRDIM consistently achieves strong imputation performance under MNAR settings across multiple data modalities.

2605.25437 2026-05-26 cs.CV

Does Seeing More Mean Knowing More? Mono-Anchored Advantage Normalization for Multi-Source Visual Reasoning

看见更多意味着知道更多吗?基于单锚优势归一化的多源视觉推理

Fanhu Zeng, Zhicong Luo, Zefan Wang, You Li, Chi Chen, Maosong Sun

发表机构 * Tsinghua University(清华大学) Northwest Polytechnical University(西北工业大学) Beijing Jiaotong University(北京交通大学)

AI总结 针对多源视觉推理中现有方法无法区分信息增益与干扰的问题,提出MARS框架,通过单源奖励作为动态锚点,将多源融合的信息增益显式纳入优势归一化,在强化学习中自适应增强源间互促并抑制噪声,在GRPO和DAPO上分别提升3.2%和4.9%。

Comments preprint

详情
AI中文摘要

通过可验证奖励的强化学习(RLVR)进行视觉推理已取得显著进展。然而,在处理多源输入时,现有方法倾向于将其视为信息的简单累积,缺乏明确机制来区分整合额外源是否带来信息增益或引入干扰。因此,它们在整合多个源时难以有效建模动态交互,特别是当这些源在物理属性和语义上差异显著时(例如红外和深度),导致当某个源包含主导信号时,性能甚至低于单源推理。为解决此问题,我们提出MARS,一种新颖的基于单锚的多源推理框架,将每个视觉模态建模为独立信息源。具体而言,通过将单源奖励视为动态锚点,我们的方法将多源融合引入的信息增益显式纳入优势归一化,并在RLVR中自适应地强调源间的相互促进,同时抑制潜在噪声或冲突。从理论分析来看,我们的方法有效量化了梯度估计中多源整合引入的信息增益,实现了模态的一致调节。实验结果也表明,在GRPO和DAPO上,跨不同数据集分别取得了3.2%和4.9%的性能提升,证实了方法的有效性。

英文摘要

Visual reasoning through reinforcement learning with verifiable rewards (RLVR) has achieved remarkable progress. However, when dealing with multi-source inputs, existing approaches tend to treat them as a mere accumulation of information, lacking explicit mechanisms to distinguish whether integrating additional sources yields information gain or introduces interference. Therefore, they struggle to effectively model dynamic interaction when integrating multiple sources, particularly when they differ significantly in physical properties and semantics, e.g., infrared and depth, leading to inferior performance to mono-source reasoning when a certain source holds the dominant signal. To address this issue, we propose MARS, a novel mono-anchored multi-source reasoning framework that models each visual modality as an independent information source. Specifically, by treating mono-source rewards as dynamic anchors, our method explicitly incorporates the information gain introduced by multi-source fusion into advantage normalization and adaptively emphasizes mutual promotion between sources while suppressing potential noise or conflicts during RLVR. From theoretical analysis, our method effectively quantifies information gain introduced by multi-source integration in gradient estimation, enabling consistent modality regulation. Empirical results also show impressive 3.2% and 4.9% performance gains on GRPO and DAPO across diverse datasets, confirming effectiveness of our method.

2605.25435 2026-05-26 cs.AI

Security of OpenClaw Agents: Fundamentals, Attacks, and Countermeasures

OpenClaw 代理的安全性:基础、攻击与对策

Yuntao Wang, Jianle Ba, Han Liu, Yanghe Pan, Jintao Wei, Zhou Su, Tom H. Luan, Linkang Du

发表机构 * School of Cyber Science and Engineering, Xi'an Jiaotong University(西安交通大学网络科学与工程学院)

AI总结 本文综述了 OpenClaw 代理的安全挑战,分类分析了技能投毒、认知操纵、多代理级联故障和供应链漏洞等威胁,并总结了现有防御机制。

Comments 17 pages, 13 figures

详情
AI中文摘要

由大型语言模型驱动的自主代理的快速发展催生了 OpenClaw,这是一类新的开源代理框架,作为持续运行、技能增强的系统,具有持久记忆、多通道交互和高度的自主性。这些能力使 OpenClaw 代理能够自主执行复杂的多步骤任务,并与外部应用程序无缝交互,但同时也引入了显著扩大的攻击面。特别是,高权限操作与持久记忆的结合使 OpenClaw 代理面临各种新兴威胁,包括技能投毒、认知操纵、多代理级联故障和供应链漏洞。在本综述中,我们全面研究了 OpenClaw 代理的安全格局。我们首先考察了将 OpenClaw 代理与传统 AI 代理系统区分开来的通用架构和关键特征。我们将现有的安全和隐私威胁分类到一个分层框架中,并分析漏洞如何在代理推理、行动执行和外部交互过程中产生。还回顾了代表性的防御机制,以描绘当前的防御格局。最后,讨论了与 OpenClaw 生态系统可靠性和可信度相关的几个未解决问题。

英文摘要

The rapid evolution of large language model (LLM)-driven autonomous agents has given rise to OpenClaw, a new class of open-source agent frameworks that operate as continuously running, skill-augmented systems with persistent memory, multi-channel interaction, and high degrees of autonomy. Such capabilities enable OpenClaw agents to autonomously execute complex, multi-step tasks and interact seamlessly with external applications, but simultaneously introduce a substantially enlarged attack surface. In particular, the combination of high-privilege operations and persistent memory exposes OpenClaw agents to various emerging threats, including skill poisoning, cognitive manipulation, multi-agent cascading failures, and supply-chain vulnerabilities. In this survey, we present a comprehensive study of the security landscape of OpenClaw agents. We first examine the general architecture and key characteristics that distinguish OpenClaw agents from traditional AI agent systems. We categorize existing security and privacy threats into a layered framework and analyze how vulnerabilities arise during agent reasoning, action execution, and external interaction. Representative defense mechanisms are also reviewed to draw the current defense landscape. Finally, several unresolved issues related to the reliability and trustworthiness of OpenClaw ecosystems are discussed.

2605.25430 2026-05-26 cs.AI

CODESKILL: Learning Self-Evolving Skills for Coding Agents

CODESKILL:学习自进化技能的编码智能体

Yanzhou Li, Yiran Zhang, Xiaoyu Zhang, Xiaoxia Liu, Yang Liu

发表机构 * Nanyang Technological University(南洋理工大学) Zhejiang University(浙江大学)

AI总结 提出CODESKILL框架,通过强化学习从编码智能体轨迹中提取多粒度程序性技能并维护技能库,提升下游任务解决能力。

详情
AI中文摘要

编码智能体在解决软件工程任务时产生丰富的轨迹。为了实现智能体自我进化,这些轨迹可以提炼为可重用的程序性技能,以紧凑的方式编码经验来指导未来行为。然而,现有的技能构建和维护方法通常依赖固定提示和启发式更新规则,不清楚如何选择、抽象和维护知识以最好地服务下游智能体。我们提出CODESKILL,一个基于LLM的框架,将技能提取和技能库维护重新表述为可学习的管理策略。CODESKILL从编码智能体轨迹中提取多粒度程序性技能,用新经验进化技能,并维护一个紧凑的技能库用于未来任务解决。我们使用强化学习训练CODESKILL,采用混合奖励,将基于评分标准的密集技能质量反馈与来自冻结下游智能体的稀疏可验证执行反馈相结合。在EnvBench、SWE-Bench Verified和Terminal-Bench 2上的实验表明,CODESKILL相比无技能基线平均通过率提高9.69,相比最强的基于提示或记忆基线提高4.01,同时在迭代构建过程中将技能库大小维持在稳定水平。

英文摘要

Coding agents produce rich trajectories while solving software-engineering tasks. To enable agent self-evolution, these trajectories can be distilled into reusable procedural skills that compactly encode experience to guide future behavior. However, existing skill construction and maintenance methods often rely on fixed prompts and heuristic update rules, leaving it unclear how knowledge should be selected, abstracted, and maintained to best serve downstream agents. We propose CODESKILL, an LLM-based framework that reformulates skill extraction and skill-bank maintenance as a learnable management policy. CODESKILL extracts multi-granularity procedural skills from coding-agent trajectories, evolves skills with new experience, and maintains a compact skill bank for future task solving. We train CODESKILL with reinforcement learning, using a hybrid reward that combines dense rubric-based skill-quality feedback with sparse verifiable execution feedback from the frozen downstream agent. Experiments on EnvBench, SWE-Bench Verified, and Terminal-Bench 2 show that CODESKILL improves average pass rate by 9.69 over the no-skill baseline and by 4.01 over the strongest prompt-based or memory baseline, while maintaining the skill bank at a stable size during iterative construction.

2605.25429 2026-05-26 cs.LG

Rethinking Feature Alignment in Generalist Graph Anomaly Detection: A Relational Fingerprint-based Approach

重新思考通用图异常检测中的特征对齐:一种基于关系指纹的方法

Yujing Liu, Yixin Liu, Yu Zheng, Alan Wee-Chung Liew, Xiaofeng Cao, Shirui Pan

发表机构 * Griffith University, Gold Coast, Australia(格里菲斯大学,澳大利亚黄金海岸) Tongji University, Shanghai, China(同济大学,上海,中国)

AI总结 针对通用图异常检测中特征对齐忽略语义导致负迁移的问题,提出基于关系指纹的通用方法ReFi-GAD,通过编码上下文和结构异常指示线索的语义感知指纹,结合Transformer编码器和SNR引导的领域自适应模块,在14个数据集上显著超越现有方法。

Comments 9 pages, 7 figures. Accepted by ICML 2026

详情
AI中文摘要

通用图异常检测(GAD)旨在无需针对特定图进行重新训练即可检测未见图上的异常。然而,现有方法主要关注通过基于PCA的投影来对齐不同数据域间的异构特征,这种对齐方式虽然统一了特征维度,却忽略了特征语义。因此,GAD模型无法学习可迁移的语义知识,甚至在未见图上表现出负迁移。为解决此问题,我们提出一种基于关系指纹的通用GAD方法(简称ReFi-GAD),通过一种通用的、语义感知的关系指纹(ReFi)对齐异构原始特征,该指纹从上下文和结构两个角度编码异常指示线索。基于ReFi,我们设计了一个基于指纹的通用GAD模型,该模型结合了基于Transformer的编码器以捕获领域不变知识,以及一个SNR引导的细化模块用于领域特定自适应。在14个数据集上的大量实验表明,ReFi-GAD显著优于现有最先进方法。

英文摘要

Generalist graph anomaly detection (GAD) aims to detect anomalies on unseen graphs without graph-specific retraining. Nevertheless, existing approaches primarily focus on aligning heterogeneous features across different data domains via PCA-based projection, which harmonizes feature dimensions ignores feature semantics. As a result, GAD models fail to learn transferable semantic knowledge, and even exhibit negative transfer on unseen graphs. To address this issue, we propose a Relational Fingerprint-based generalist GAD approach (ReFi-GAD for short), aligning heterogeneous raw features with a universal and semantics-aware Relational Fingerprint (ReFi) that encodes anomaly-indicative cues from both contextual and structural perspectives. Building on ReFi, we design a fingerprint-grounded generalist GAD model, which combines a transformer-based encoder to capture domain-invariant knowledge with an SNR-guided refinement module for domain-specific adaptation. Extensive experiments on 14 datasets demonstrate that ReFi-GAD significantly outperforms state-of-the-art methods.

2605.25427 2026-05-26 cs.CV cs.AI

Binding Visual Features Point by Point

逐点绑定视觉特征

Udith Haputhanthri, Declan Campbell, Rim Assouel, Jonathan D. Cohen, Taylor W. Webb

发表机构 * Princeton University(普林斯顿大学) Mila – Quebec AI Institute(魁北克AI研究所) Université de Montréal(蒙特利尔大学)

AI总结 研究通过文本引导的“指向”机制解决视觉语言模型在多目标场景中的绑定问题,发现该机制诱导内部视觉搜索程序,消除绑定错误并实现组合泛化。

详情
AI中文摘要

尽管在标准基准测试中取得了成功,但视觉语言模型在处理涉及多目标场景的任务时仍表现出持续的失败,包括许多对人类来说相对容易的任务。最近的研究发现,这些失败可能源于在上下文中准确绑定对象特征的基本能力缺失,这在认知科学和神经科学中被称为“绑定问题”。人类视觉系统被认为通过串行处理来解决这一绑定问题,即一次只关注一个对象,以避免来自其他对象的干扰。最近的研究提出了“指向”——使用显式空间坐标来指代对象——作为视觉语言模型的类似解决方案,并发现它提高了具有挑战性的多目标任务的性能。然而,目前尚不清楚这种方法为何(即在机制或表征层面)能提高性能,以及这与人类视觉中的串行处理有何直接关系。本文研究了这一问题。我们发现,通过文本学习指向会诱导内部视觉搜索程序,并描述了支持这一过程的机制。我们还发现,指向行为可以通过微调泛化到新任务,并且这样做可以消除绑定错误并实现组合泛化。这些结果提供了一个原理证明,即串行处理可以像解决生物视觉中的绑定问题一样,解决视觉语言模型中的绑定问题。

英文摘要

Despite success on standard benchmarks, vision language models display persistent failures on tasks involving processing of multi-object scenes, including many tasks that are relatively easy for humans. Recent work has found that these failures may stem from a basic inability to accurately bind object features in-context, a challenge that is referred to as the "binding problem" in cognitive science and neuroscience. The human visual system is thought to solve this binding problem via serial processing, attending to individual objects one at a time so as to avoid interference from other objects. Recent work has proposed "pointing" -- the use of explicit spatial coordinates to refer to objects -- as an analogous solution for vision language models, and found that it improves performance on challenging multi-object tasks. However, it is unclear $\textit{why}$ (i.e., on a mechanistic or representational level) this approach improves performance, and how directly this relates to serial processing in human vision. Here, we investigate this question. We find that learning to point-via-text induces an internal visual search routine, and we characterize the mechanisms that support this procedure. We also find that pointing behavior can be generalized to new tasks via fine-tuning, and that doing so eliminates binding errors and enables compositional generalization. These results provide a proof-of-principle that serial processing can solve the binding problem for vision language models just as it does for biological vision.

2605.25424 2026-05-26 cs.LG cs.AI

SeqRoute: Global Budget-Aware Sequential LLM Routing via Offline Reinforcement Learning

SeqRoute: 通过离线强化学习实现全局预算感知的顺序LLM路由

Zhongling Xu, Shunan Zheng, Wei Wang

发表机构 * Department of Operations Research and Industrial Engineering(运筹学与工业工程系)

AI总结 提出SeqRoute框架,将多轮LLM路由建模为有限时域马尔可夫决策过程,通过离线强化学习(CQL)和事后预算重标记(HBR)学习延迟满足,在全局预算约束下优化成本与质量,降低破产率至1%以下。

详情
AI中文摘要

现有的LLM路由框架将查询视为独立事件,忽略了受全局计算预算约束的真实用户会话的顺序性质。这种不匹配不可避免地导致预算破产:短视的路由策略在早期交互中耗尽资源,迫使后续通常更复杂的查询使用不充分的模型。我们引入SeqRoute,一个将多轮路由建模为有限时域马尔可夫决策过程并通过离线强化学习求解的框架。通过将剩余预算纳入状态空间并使用保守Q学习(CQL)进行训练,SeqRoute学习延迟满足以策略性地为会话后期的高风险轮次保留资源。为了克服数据匮乏,我们提出事后预算重标记(HBR)。该技术在不同假设预算下回顾性地模拟历史轨迹,将10,000个原始会话扩展为238万个包含关键破产信号的转换。在部署时,动态λ扫描机制无需重新训练即可实现成本-质量帕累托前沿的零样本导航。大量评估表明,SeqRoute在保持或提高质量的同时将运营成本降低6.0-73.5%,并将破产率抑制在1%以下,在整个帕累托前沿上严格优于行为克隆、预算感知启发式和静态基线。

英文摘要

Existing LLM routing frameworks treat queries as independent events, neglecting the sequential nature of real-world user sessions constrained by global computational budgets. This mismatch inevitably leads to budget bankruptcy: myopic routing policies exhaust resources on early interactions, forcing subsequent and often more complex queries onto inadequate models. We introduce SeqRoute, a framework that formulates multi-turn routing as a finite-horizon Markov Decision Process and solves it via offline reinforcement learning. By incorporating the remaining budget into the state space and training with Conservative Q-Learning (CQL), SeqRoute learns delayed gratification to strategically preserve resources for high-stakes turns later in the session. To overcome data starvation, we propose Hindsight Budget Relabeling (HBR). This technique retrospectively simulates historical trajectories under diverse hypothetical budgets, expanding 10,000 raw sessions into 2.38 million transitions enriched with critical bankruptcy signals. At deployment, a dynamic $λ$-sweep mechanism enables zero-shot navigation of the cost-quality Pareto frontier without retraining. Extensive evaluations demonstrate that SeqRoute reduces operational costs by 6.0-73.5% while maintaining or improving quality, and suppresses bankruptcy rates to under 1%, strictly dominating behavior cloning, budget-aware heuristics, and static baselines across the entire Pareto frontier.

2605.25423 2026-05-26 cs.RO

OPAL: Omnidirectional Path-efficient Aerial 3D expLoration

OPAL: 全方位路径高效空中三维探索

Yoga Satwik Chappidi, Avideh Zakhor

发表机构 * Department of Electrical Engineering and Computer Sciences, University of California, Berkeley(加州大学伯克利分校电子工程与计算机科学系)

AI总结 提出OPAL框架,通过在歧义分支点进行360度偏航旋转替代计算密集的全局路径规划,实现计算简单、路径短且覆盖率高的自主探索。

Comments Submitted to IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

自主探索对于机器人绘制未知环境地图至关重要。探索算法的理想特性包括计算效率高和探索过程中行进距离小。受此启发,我们提出了全方位路径高效空中三维探索(OPAL),这是一个探索框架,其核心是在歧义分支点进行有意的360度偏航旋转,而不是进行计算密集的全局路径规划。我们设计了OPAL的多个变体,以确定在偏航旋转完成后如何选择前沿。其中一个变体是无模型的,而其他变体则使用大语言模型(LLM)或视觉语言模型(VLM)。我们通过改变邻近搜索半径以将前沿纳入选择过程,来表征这些变体的性能。通过仿真,我们发现尽管与计算更复杂的基线(如EDEN和FALCON)相比,耗时的原地偏航旋转增加了总探索时间,但OPAL计算更简单,实现了更短的行进距离和更高的覆盖率-距离曲线下面积。我们还表明,调整前沿选择搜索半径可以在行进距离和总探索时间之间进行权衡。我们在两个室内环境中使用Modal AI无人机将OPAL与FALCON进行比较,验证了我们的结果,发现OPAL的一个变体的行进距离比FALCON低25%。

英文摘要

Autonomous exploration is critical for robot mapping unknown environments. Desirable characteristics of exploration algorithms include compute efficiency and small traversed distance during the exploration process. Motivated by these, we present Omnidirectional Path-efficient Aerial 3D expLoration (OPAL), an exploration framework centered on deliberate 360-degree yaw rotation at ambiguous branch points rather than compute-heavy global tour planning. We devise multiple variants of OPAL to determine the frontier-selection strategy once the yaw pan is completed. One variant is model-free, while others use large language models (LLMs) or vision-language models (VLMs). We characterize the performance of these variants while varying the vicinity search radius to include frontiers in the selection process. Through simulations we find that although the time-consuming in-place yaw rotation increases total exploration time relative to more computationally complex baselines such as EDEN and FALCON, OPAL is computationally simpler and achieves shorter travel distances and higher coverage-versus-distance area under the curve. We also show that adjusting the frontier-selection search radius enables a tradeoff between travel distance and total exploration time. We verify our results on a Modal AI drone in two indoor environments by comparing OPAL against FALCON, and find that the traveled distance for a variant of OPAL to be as much as 25% lower than FALCON.

2605.25421 2026-05-26 cs.CL

HyLaT: Efficient Multi-Agent Communication via Hybrid Latent-Text Protocol

HyLaT: 通过混合潜在-文本协议实现高效多智能体通信

Xinyi Mou, Siyuan Wang, Zejun Li, Yulan He, Zhongyu Wei

发表机构 * Fudan University(复旦大学) The Chinese University of Hong Kong(香港中文大学) King’s College London(伦敦国王学院) The Alan Turing Institute(艾伦·图灵研究所) Shanghai Innovation Institute(上海创新研究院)

AI总结 针对多智能体通信中的三元困境,提出混合潜在-文本协议HyLaT,通过潜在通道传输认知信号提升效率,自然语言表达关键信号保证可解释性,并设计两阶段训练框架,显著降低通信开销同时保持任务性能。

详情
AI中文摘要

通信协议设计是基于大语言模型的多智能体系统中的核心挑战。现有的单通道方法面临固有的通信三元困境:基于文本的方法可解释但冗长,而基于潜在空间的方法高效但不透明且局限于单向工作流。受多通道通信理论启发,我们提出HyLaT,一种混合潜在-文本通信协议,通过潜在通道传输精细的认知信号以提高效率,同时用自然语言表达简洁的关键信号以保持可解释性和精确性。我们引入一个两阶段训练框架,结合单智能体混合生成学习和多智能体交互协同训练,使智能体能够在多轮交互中生成和解释混合消息。实验表明,HyLaT显著降低了通信开销,同时保持了竞争性的任务性能,并在不同设置下具有强大的泛化能力和鲁棒性。

英文摘要

Communication protocol design is a central challenge in large language model-based multi-agent systems. Existing single-channel approaches face an inherent communication trilemma: text-based methods are interpretable but verbose, while latent-space methods are efficient but opaque and limited to unidirectional workflows. Inspired by multi-channel communication theory, we propose HyLaT, a hybrid latent-text communication protocol that transmits elaborate cognitive signals through a latent channel for efficiency, while expressing concise critical signals in natural language to preserve interpretability and precision. We introduce a two-stage training framework combining single-agent hybrid generation learning and multi-agent interactive co-training, enabling agents to generate and interpret hybrid messages across multiple rounds of interaction. Experiments demonstrate that HyLaT reduces communication overhead significantly while maintaining competitive task performance, with strong generalization and robustness across diverse settings.

2605.25420 2026-05-26 cs.CL cs.AI cs.CY

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

SomaliBench Eval:衡量开源语言模型中英语到索马里语的拒绝差距

Khalid Yusuf Dahir

发表机构 * Independent researcher(独立研究人员)

AI总结 通过构建索马里语有害意图基准并评估四个开源模型,发现英语到索马里语的拒绝率存在显著差距,且多数非拒绝输出为不流畅的无效内容。

Comments 12 pages, 3 figures, 4 tables. Code: https://github.com/khaledyusuf44/somalibench_eval Dataset: https://huggingface.co/datasets/khaledyusuf44/somalibench-v0

详情
AI中文摘要

大型语言模型的安全评估仍然高度以英语为中心,即使模型在全球部署,低资源语言的评估也严重不足。我们在SomaliBench v0上评估了四个开源指令微调模型,这是一个由母语者验证的基准,包含100对英语和索马里语的有害意图提示。每个模型(Llama-3.1-8B-Instruct、Gemma-2-9B-Instruct、Qwen-2.5-7B-Instruct和Aya-23-8B)均在本地运行,温度为0,并使用相同的英语“有帮助、无害、诚实”(HHH)系统提示。一个固定的Claude Sonnet快照(claude-sonnet-4-5-20250929)将每个响应分类为拒绝、遵从或不清楚;母语作者对分层抽样的80行样本进行抽查。我们发现所有四个模型在英语到索马里语之间存在巨大的拒绝差距:Llama-3.1-8B(0.90;95%自助法置信区间[0.85, 0.96])、Aya-23-8B(0.75 [0.67, 0.83])、Qwen-2.5-7B(0.69 [0.59, 0.78])和Gemma-2-9B(0.38 [0.27, 0.49])。对于三个模型,索马里语中主要的非拒绝模式不是流畅的有害遵从,而是不清楚的输出:空、错误语言或不连贯的生成。母语验证抽查在80个采样行上与判断器达到100%一致(Cohen's kappa = 1.00)。我们仅报告总体拒绝率、类别差距和可靠性统计;原始模型生成保留在本地,不发布。

英文摘要

Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0, a native-author-verified benchmark of 100 harmful-intent prompts paired across English and Somali. Each of Llama-3.1-8B-Instruct, Gemma-2-9B-Instruct, Qwen-2.5-7B-Instruct, and Aya-23-8B is run locally with temperature 0 and the same English "helpful, harmless, and honest" (HHH) system prompt. A pinned Claude Sonnet snapshot (claude-sonnet-4-5-20250929) classifies each response as refused, complied, or unclear; the native author spot-checks a stratified 80-row sample. We find large English-to-Somali refusal gaps for all four models: Llama-3.1-8B (0.90; 95% bootstrap CI [0.85, 0.96]), Aya-23-8B (0.75 [0.67, 0.83]), Qwen-2.5-7B (0.69 [0.59, 0.78]), and Gemma-2-9B (0.38 [0.27, 0.49]). For three models, the dominant Somali non-refusal mode is not fluent harmful compliance but unclear output: empty, wrong-language, or incoherent generations. The native verification spot-check achieves 100% agreement with the judge (Cohen's kappa = 1.00) on the 80 sampled rows. We report aggregate refusal rates, category gaps, and reliability statistics only; raw model generations are retained locally and are not released.

2605.25419 2026-05-26 cs.LG

Capture-Calibrate-Coach: A Graph-Based Framework for Knowledge Monitoring Estimation and Adaptive Feedback

捕获-校准-指导:基于图的知识监控估计与自适应反馈框架

Gen Li, Li Chen, Cheng Tang, Boxuan Ma, Yuncheng Jiang, Daisuke Deguchi, Takayoshi Yamashita, Atsushi Shimada

发表机构 * Kyushu University(九州大学) Osaka Kyoiku University(大阪京都大学) South China Normal University(华南师范大学) Nagoya University(名古屋大学) Chubu University(楚博大学)

AI总结 提出Capture-Calibrate-Coach框架,通过异构图神经网络推断学习者未明确提及概念的知识状态,并基于元认知模式提供个性化反馈,在684名学生中预测潜在感知状态AUC达85.21%。

Comments To be published in Proceedings of the 27th International Conference on Artificial Intelligence in Education (AIED 2026)

详情
AI中文摘要

有效的学习支持不仅需要了解学习者知道什么,还需要了解他们如何准确地感知自己的理解。这种元认知维度,称为知识监控,从根本上影响自我调节学习,然而这一维度在当前系统中仍未得到充分探索。本文介绍了用于自适应学习支持的捕获-校准-指导(3C)框架。捕获阶段从开放式自我报告中提取学习者的感知知识状态,构建连接学习者和知识概念的异构图。校准阶段应用异构图神经网络来推断未明确提及的概念的潜在感知状态,从而实现系统的知识监控评估。指导阶段将学习者分为五种元认知模式,并提供针对知识差距和校准误差的个性化反馈。对684名学生的评估显示,预测潜在感知状态的AUC达到85.21%,显著优于基线方法。一项包含47名参与者的用户研究表明,参与者对反馈质量持积极态度,尤其重视关于知识差距的具体反馈和可操作的学习指导。这些发现将基于AI的学习支持推向元认知队友,在支持知识增长的同时培养准确的自我意识。

英文摘要

Effective learning support requires understanding not only what learners know but also how accurately they perceive their own understanding. This metacognitive dimension, known as knowledge monitoring, fundamentally influences self-regulated learning, yet this dimension remains underexplored in current systems. This paper introduces the Capture-Calibrate-Coach (3C) framework for adaptive learning support. The Capture phase extracts learners' perceived knowledge states from open-ended self-reports to construct a heterogeneous graph linking learners and knowledge concepts. The Calibrate phase applies a heterogeneous graph neural network to infer latent perceived states for concepts not explicitly mentioned, enabling systematic knowledge monitoring assessment. The Coach phase classifies learners into five metacognitive patterns and delivers personalized feedback addressing both knowledge gaps and calibration errors. Evaluation with 684 students demonstrates 85.21% AUC in predicting latent perceived states, significantly outperforming baseline methods. A user study with 47 participants shows positive reception of feedback quality, with participants particularly valuing concrete feedback on knowledge gaps and actionable study guidance. These findings advance AI-based learning support toward metacognitive teammates that foster accurate self-awareness while supporting knowledge growth.

2605.25418 2026-05-26 cs.CV cs.GR cs.LG

Generating 3D models from sketches of human faces using a combined approach of Convolutional Neural Networks, Procedural Modeling, and Contour Mapping

利用卷积神经网络、程序化建模和轮廓映射的联合方法从人脸素描生成3D模型

Nancy Iskander

发表机构 * Behaviour Digital

AI总结 提出一种结合卷积神经网络、参数化3D人脸模型和主动蛇形轮廓的新方法,首次通过训练CNN检测素描中的表情并生成对应3D模型。

Comments A thesis submitted in conformity with the requirements for the degree of Master of Science in Computer Science Graduate Department of Computer Science University of Toronto

详情
AI中文摘要

从人脸素描生成3D模型是计算机图形学中的一个活跃研究课题,因为它有潜力极大地促进专业3D艺术家和新手的建模工作。受面部表情显著改变和塑造面部轮廓这一观察的启发,我们的方法结合了表情检测和3D模型生成。结果是一种从素描生成3D模型的新方法,它依赖于三个组成部分:卷积神经网络、参数化3D人脸模型(Valley Girl)和主动蛇形轮廓。在文献中首次,CNN(使用我们自己生成的数据集)被训练通过检测活跃的FACS动作单元来识别给定素描中的表情。然后,该表情被复制到Valley Girl上以获得具有相似表情的3D模型。接着,使用主动蛇形轮廓来找到所需的变换,以缩小该模型与给定素描之间的差距。

英文摘要

Generating 3D models from face sketches is an active topic of research in Computer Graphics due to its potential to tremendously facilitate the modeling of faces for both professional 3D arists and novices. Motivated by the observation that facial expressions are responsible for significantly altering and shaping the contours in our faces, we combine both expression detection and 3D model generation in our approach. The result is a novel approach to generating 3D models from sketches which relies on three components: Convolutional Neural Networks, a parametric 3D face model (Valley Girl), and Active Snake Contours. For the first time in the literature, CNNs are trained (using our own generated dataset) to detect the expression in the given sketch through detecting the active FACS Action Units. The expression is then duplicated on Valley Girl to obtain a 3D model with a similar expression. Active Snake Contours are then used to find the transforms needed to close the gaps between that model and the given sketch.

2605.25415 2026-05-26 cs.CL cs.CY cs.ET

LLM-as-a-Reviewer: Benchmarking Their Ability, Divergence, and Prompt Injection Resistance as Paper Reviewers

LLM-as-a-Reviewer: 基准测试它们作为论文审稿人的能力、分歧和提示注入抵抗性

Lingyao Li, Junjie Xiong, Changjia Zhu, Runlong Yu, Chen Chen, Junyu Wang, Renkai Ma, Zhicong Lu

发表机构 * University of South Florida(佛罗里达南大学) Missouri University of Science and Technology(密苏里科技大学) University of Alabama(阿拉巴马大学) Florida International University(佛罗里达国际大学) University of Cincinnati(辛辛那提大学) George Mason University(乔治·梅森大学)

AI总结 本研究通过一个系统基准测试,评估了12个大型语言模型在论文评审中的表现,包括评分校准、与人类审稿人的分歧以及对不可见字体映射攻击的抵抗性,发现LLMs存在系统性高估弱论文、与人类关注点不同以及易受提示注入攻击等问题。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于学术同行评审,但其可靠性、与人类判断的一致性以及对对抗性攻击的鲁棒性仍知之甚少。我们对从NeurIPS和ICLR分层的898篇论文进行了LLM-as-a-Reviewer的系统基准测试,评估了12个LLMs的三个维度:评分校准、与人类审稿人的分歧以及对通过不可见字体映射攻击嵌入的提示注入的抵抗性。我们发现LLMs系统性地高估较弱的投稿,并在主题重点上与人类存在分歧,低估清晰度而高估可重复性,同时生成的评论长度是人类的2到3倍,词汇多样性较低且词汇更标准化。提示注入仍然非常有效。简单的隐藏指令可以在相当一部分案例中将低分论文提升至可接受级别的评分,且效果在不同模型家族间差异显著。虽然LLMs在结构化评估方面具有实用性,但将其整合到同行评审中需要针对内在偏见和对抗性风险设置防护措施。

英文摘要

Large language models (LLMs) are increasingly used in academic peer review, yet their reliability, alignment with human judgment, and robustness to adversarial attacks remain poorly understood. We present a systematic benchmark of LLM-as-a-Reviewer on 898 papers stratified from NeurIPS and ICLR, evaluating 12 LLMs along three axes: rating calibration, divergence from human reviewers, and resistance to prompt injection embedded via an invisible font-mapping attack. We find that LLMs systematically overrate weaker submissions and diverge from humans in topical emphasis, under-flagging Clarity and over-flagging Reproducibility, while producing reviews two to three times longer with lower lexical diversity and a more standardized vocabulary. Prompt injection remains highly effective. Simple hidden instructions can promote low-scoring papers to acceptance-level ratings in a substantial fraction of cases, with effectiveness varying sharply across model families. While LLMs offer utility in structuring evaluations, their integration into peer review requires safeguards against both intrinsic biases and adversarial risks.

2605.25414 2026-05-26 cs.RO

How to Mitigate the Distribution Shift Problem in Robotics Control: A Robust and Adaptive Approach Based on Offline to Online Imitation Learning

如何缓解机器人控制中的分布偏移问题:一种基于离线到在线模仿学习的鲁棒自适应方法

Hyung-Suk Yoon, Seung-Woo Seo

发表机构 * Department of Electronic and Computer Engineering, Seoul National University, Seoul, South Korea(电子与计算机工程系,首尔国立大学,首尔,韩国)

AI总结 提出一种鲁棒离线到自适应在线模仿学习框架,通过离线阶段利用判别器扩展状态-动作覆盖和在线阶段自监督模仿学习,缓解分布偏移问题。

Comments 8 pages, 2 figures

详情
AI中文摘要

模仿学习中的分布偏移是指智能体无法为训练期间未访问的状态规划适当动作的问题。该问题很大程度上归因于专家演示在整个环境中提供的固有狭窄状态-动作覆盖。在本文中,我们提出了一种鲁棒离线到自适应在线模仿学习框架,以终身、多阶段方案处理分布偏移问题。在离线学习阶段,我们利用补充演示通过判别器有效训练策略,从而拓宽策略的状态-动作覆盖,增强策略对分布偏移的鲁棒性。在后续的在线推理阶段,我们的框架检测分布偏移的发生,并从在线经验中进行自监督模仿学习,使策略适应在线环境。通过在MuJoCo环境中的广泛评估,我们证明我们的方法在分布偏移的鲁棒性和对在线环境的适应性能方面优于基线算法,这表明我们的框架在对抗分布偏移方面具有优越性能。

英文摘要

Distribution shift in imitation learning refers to the problem that the agent cannot plan proper actions for a state that has not been visited during the training. This problem can be largely attributed to the inherently narrow state-action coverage provided by expert demonstrations over the full environment. In this paper, we propose a robust offline to adaptive online imitation learning framework that handles the distribution shift problem in a lifelong, multi-phase scheme. In the offline learning phase, we leverage supplementary demonstrations to broaden the state-action coverage of the policy by utilizing a discriminator to effectively train the policy with supplementary demonstrations, thereby enhancing the robustness of the policy to distribution shift. In the subsequent online inference phase, our framework detects the occurrence of distribution shift and conducts self-supervised imitation learning from online experiences to adapt the policy to the online environments. Through extensive evaluations in MuJoCo environments, we demonstrate that our method exhibits better robustness to distribution shift and better adaptation performance to online environments than the baseline algorithms, which indicates superior performance of our framework against the distribution shift.

2605.25409 2026-05-26 cs.CV

MTLLFM: Multimodal-Temporal Laughter Localization: UR-FUNNY-Temporal and SMILE-Temporal Benchmarks with an Adaptive Multimodal Fusion Model

MTLLFM: 多模态时间笑声定位——UR-FUNNY-Temporal和SMILE-Temporal基准数据集与自适应多模态融合模型

Eyal Hanania, Nadav Kirsch, Daniel Arkushin, Jonathan Benvenisti, Amos Bercovich, Elie Zemmour, Sahar Froim

发表机构 * WSC-Sports(WSC-体育)

AI总结 针对现有方法无法精确捕捉短暂笑声事件时间边界的问题,本文提出两个完全标注的时间笑声数据集(UR-FUNNY-Temporal和SMILE-Temporal)和一个轻量级弱监督框架,通过固定HuBERT和MAE编码器结合时间softmax池化与自适应模态门控,实现从片段级标签到帧级时间定位,在体育广播数据上达到99% F1和68.1%定位精度,并将下游笑声推理CIDEr提升227%。

Comments Accepted to the Workshop on Affective & Behavior Analysis in-the-wild, CVPR 2026

详情
AI中文摘要

在视频中检测笑声对于情感计算和叙事理解至关重要,但现有方法将其视为粗粒度的片段级分类,无法捕捉短暂、瞬态笑声事件的精确时间边界。我们通过两个互补的贡献填补了这一空白。首先,我们引入了UR-FUNNY-Temporal和SMILE-Temporal,这是两个完全标注的时间笑声数据集,扩展了广泛使用的幽默基准。我们的标注覆盖超过11,053个视频(78.8小时),并为每个笑声事件提供精确的起始/结束边界,以及区分说话者与观众笑声、模态主导性(声学、视觉或两者)和强度级别的丰富元数据。其次,我们提出了一个轻量级弱监督框架用于时间笑声定位。我们的架构将固定的HuBERT和MAE编码器与时间softmax池化和自适应模态门控相结合,从片段级标签学习细粒度的时间定位,而无需在训练期间使用帧级标注。在三个数据集上的实验表明,我们的方法显著优于包括Gemini 3 Flash在内的多模态基础模型,在体育广播数据上达到99%的F1和68.1%的定位精度。消融实验验证了每个架构组件。此外,我们的精确时间标签将下游笑声推理的CIDEr提升了227%,使GPT-3.5能够超越GPT-4o。代码、UR-FUNNY-Temporal和SMILE-Temporal数据集已在https://github.com/WSCSports/MTLLFM-temporal-laughter-localization公开。

英文摘要

Detecting laughter in video is essential for affective computing and narrative understanding, yet existing approaches treat it as coarse clip-level classification, failing to capture precise temporal boundaries of brief, transient laughter events. We address this gap with two complementary contributions. First, we introduce UR-FUNNY-Temporal and SMILE-Temporal, fully annotated temporal laughter datasets extending two widely-used humor benchmarks. Our annotations cover over 11,053 videos (78.8 hours) and provide precise onset/offset boundaries for each laughter event, along with rich metadata distinguishing speaker vs. audience laughter, modality dominance (acoustic, visual, or both), and intensity levels. Second, we propose a lightweight weakly-supervised framework for temporal laughter localization. Our architecture combines fixed HuBERT and MAE encoders with temporal softmax pooling and adaptive modality gating, learning fine-grained temporal grounding from clip-level labels without requiring frame-level annotations during training. Experiments across three datasets demonstrate that our approach substantially outperforms multimodal foundation models including Gemini 3 Flash, achieving 99% F1 and 68.1% localization precision on sports broadcast data. Ablations validate each architectural component. Furthermore, our precise temporal tags improve downstream laughter reasoning by 227% on CIDEr, enabling GPT-3.5 to outperform GPT-4o. The code, UR-FUNNY-Temporal and SMILE-Temporal datasets are publicly available at https://github.com/WSCSports/MTLLFM-temporal-laughter-localization.

2605.25407 2026-05-26 cs.CV

Towards Active Real-to-Twin Inspection: A New Paradigm for Zero-Shot Anomaly Detection

迈向主动实景到数字孪生检测:零样本异常检测的新范式

Jiaxuan Liu, Yunkang Cao, Yufeng Chen, Chunyang Li, Yuhuan Du, Hui Zhang

发表机构 * National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University(机器人视觉感知与控制技术国家工程研究中心,湖南大学)

AI总结 提出Real-to-Twin异常检测任务,通过AVATAR框架学习实景与CAD数字孪生之间的语义对齐,实现零样本异常定位。

Comments 6 pages, 4 figures, accepted to IEEE-CYBER 2026, Florence, Italy

详情
AI中文摘要

零样本异常检测(AD)在具身工业检测中的部署受到其依赖被动、固定视角2D图像的严重制约。这种固有形式无法适应真实环境中所需的主动、动态观测。为突破这一限制,我们引入了实景到数字孪生异常检测(Real-to-Twin Anomaly Detection),这是一项新颖的任务,直接针对几何匹配的CAD数字孪生评估物理观测。为应对这一新任务,我们提出了AVATAR框架,旨在学习实景与数字孪生之间的鲁棒语义对齐。通过仅使用无缺陷对来弥合良性的Sim2Real领域差距,AVATAR有效地将CAD先验转化为动态、无异常的参考。这种优雅的公式使模型能够以零样本方式将各种异常定位为不可对齐的偏差,消除了对缺陷标注的需求。大量实验表明,AVATAR显著优于改编的最先进基线,对严重的视角变化表现出卓越的鲁棒性。代码和数据集将公开提供。

英文摘要

The deployment of zero-shot anomaly detection (AD) in embodied industrial inspection is severely bottlenecked by its reliance on passive, fixed-viewpoint 2D imagery. Such formulations inherently fail to accommodate the active, dynamic observations required in real-world environments. To break this limitation, we introduce Real-to-Twin Anomaly Detection, a novel task that evaluates physical observations directly against geometrically matched CAD Digital Twins. To tackle this new task, we propose AVATAR, a framework designed to learn robust semantic alignment between Real and Digital Twins. By bridging benign Sim2Real domain gaps using only defect-free pairs, AVATAR effectively transforms CAD priors into dynamic, anomaly-free references. This elegant formulation enables the model to localize diverse anomalies in a zero-shot manner as unalignable deviations, eliminating the need for defect annotations. Extensive experiments demonstrate that AVATAR substantially outperforms adapted state-of-the-art baselines, exhibiting exceptional robustness to severe viewpoint variations. The code and dataset will be made publicly available.

2605.25404 2026-05-26 cs.CL eess.AS

Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems

主动应对不确定性:面向口语对话系统的因果感知错误诊断与交互式澄清

Yizhou Peng, Ziyang Ma, Changsong Liu, Yi-Wen Chao, Xie Chen, Eng Siong Chng

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Shanghai Jiao Tong University, China(上海交通大学,中国)

AI总结 本文提出一种因果感知的错误恢复范式,通过细粒度检测器解耦ASR中的感知、理解和删除错误,使LLM能够执行多轮针对性澄清策略,从而显著降低词错误率并提升下游任务性能。

详情
AI中文摘要

级联自动语音识别-大语言模型(ASR-LLM)流水线在工业口语对话系统(SDS)中仍然流行,主要因为其解耦设计确保了感知可验证性。然而,级联系统存在错误传播问题,因为转录失败不可避免地级联到后续组件,从而降低最终交互质量。尽管ASR置信度分数为不可靠输入提供了简单过滤,但这种方法存在根本性局限,因为它通常无法检测删除错误,也无法区分声学(听不清)和语言(不理解)不匹配,而这两者都需要针对性的恢复策略。在本文中,我们提出了一种因果感知的错误恢复范式,从根本上重新思考SDS的鲁棒性。与传统的置信度过滤不同,我们引入了一组小型精度聚焦检测器,利用深度ASR潜在表示将词级错误解耦为感知、理解和删除失败。这种细粒度诊断智能使LLM能够编排针对性的多轮澄清策略,有效将模糊信号转化为无缝的用户交互。实验结果验证了我们方法的精度,与基线相比,在领域转移错误上的召回率提高了一倍以上(57.96% vs. 23.66%)。关键的是,这种诊断精度在不同口音、失真和领域下,使词错误率降低高达30%,下游任务性能提升17%。

英文摘要

Cascaded Automatic Speech Recognition -- Large Language Model (ASR-LLM) pipelines remain popular for industrial Spoken Dialogue Systems (SDS), primarily because their decoupled design ensures perceptual verifiability. However, cascaded systems suffer from error propagation, as transcription failures inevitably cascade to subsequent components, thereby degrading the final interaction quality. Although ASR confidence scores offer a simple filter for unreliable inputs, this approach is fundamentally limited because it typically fails to detect deletion errors or to distinguish between acoustic (inability to hear clearly) and linguistic (inability to understand) mismatches, both of which require targeted recovery strategies. In this paper, we propose a cause-aware error recovery paradigm that fundamentally rethinks robustness in SDS. Unlike traditional confidence filtering, we introduce a suite of small precision-focused detectors that exploit deep ASR latent representations to disentangle token-level errors into perception, comprehension, and deletion failures. This fine-grained diagnostic intelligence empowers the LLM to orchestrate targeted, multi-turn clarification strategies, effectively transforming ambiguous signals into seamless user interactions. Experimental results validate the precision of our approach, which more than doubles the recall on domain-shift errors (57.96% vs. 23.66%) compared to baselines. Crucially, this diagnostic precision yields up to a 30% reduction in WER and a 17% improvement on the downstream task across diverse accents, distortions, and domains.

2605.25401 2026-05-26 cs.RO

Path Following Control System of Line-of-Sight Guidance for Robotic Dolphin with Multi-Link Mechanism in Underwater Simulator

水下模拟器中多连杆机构仿生海豚的视线导引路径跟踪控制系统

Takumi Asada, Takao Oki, Hideo Furuhashi, Kenta Tabata, Renato Miyagusuku, Koichi Ozaki

发表机构 * Utsunomiya University(乌山大学) Aichi Institute of Technology(爱知技术大学)

AI总结 针对多连杆仿生自主水下航行器(BAUV),提出了一种基于视线导引的路径跟踪控制系统,并在水下模拟器中进行了参数确定和控制方法评估。

Journal ref 2026 IEEE/SICE International Symposium on System Integration (SII). IEEE, 2026. p. 844-849

详情
AI中文摘要

具有多连杆机构的仿生自主水下航行器(BAUV)因其低功耗和高机动性被广泛用于水生生物观测和环境调查。环境调查需要能够自动跟踪特定点的路径跟踪系统。然而,BAUV的路径跟踪系统有限,且其与多连杆机构机器人的评估尚未明确。由于BAUV的模型因仿生类型而异,其路径跟踪系统需要预先进行仿真。在本研究中,我们提出了一种适用于多连杆机构BAUV的路径跟踪系统,并在水下模拟中进行了评估。结果表明,可以设计出适合BAUV的路径跟踪系统,使用模拟器确定参数,并评估控制方法。

英文摘要

Biomimetic autonomous underwater vehicle (BAUV) with multi-link mechanism is widely used in aquatic life observation and environmental surveys due to its low power consumption and high maneuverability. An environmental survey requires a path following system that automatically follows specific points. However, the path following system of BAUV is limited, and its evaluation with multi-link mechanism robots has not yet been clarified. The path following system in BAUV requires prior simulation because the model differs depending on the type of biomimetics. In this study, we propose a path following system for BAUVs with a multi-link mechanism and evaluation in underwater simulation. In this result, it was possible to design a path following system suitable for BAUV, determine parameters using a simulator, and evaluate control methods.

2605.25399 2026-05-26 cs.AI

Towards end-to-end LLM-based censoring-aware survival analysis

面向端到端基于大语言模型的删失感知生存分析

Yishu Wei, Hexin Dong, Yi Lin, Jiahe Qian, Yi Liu, Yifan Peng

发表机构 * Department of Population Health Science, Weill Cornell Medicine(人口健康科学系,韦尔·科恩医学中心) Weill Cornell Medicine(韦尔·科恩医学中心)

AI总结 提出LLMSurvival框架,通过成对排序重制定时间事件预测,实现删失感知的生存分析,在ICU死亡率和骨折风险预测中优于Cox比例风险模型和三种深度学习模型。

详情
AI中文摘要

目的:生存分析是医学预测的核心,然而大语言模型(LLM)很少被用作端到端生存模型,因为删失阻碍了直接的监督微调。这里我们提出LLMSurvival,一个框架,使得未修改的LLM能够直接操作表格临床数据进行删失感知的生存分析。材料与方法:LLMSurvival将时间事件预测重新表述为可比较受试者之间的成对排序,并通过聚合与训练队列中锚定个体的比较来推导测试时风险。结果:在两个临床任务(MIMIC-IV中的ICU死亡率预测和纽约长老会/威尔康奈尔医学中心队列中的脆性骨折预测)中,LLMSurvival相比Cox比例风险模型,整体一致性提高了ICU死亡率3.1%和骨折风险0.5%,相比三个已建立的深度学习生存模型,ICU死亡率平均提高2.1%,骨折风险平均提高2.8%。讨论:结果表明,通过基于比较的重新制定,可以使带有删失的生存建模与LLM微调兼容。该框架展示了高可移植性,并且在不同的临床背景下优于专家制定的评分(如SAPS-II和FRAX评分)。此外,该框架支持本地部署,因为紧凑、公开可用的基础模型提供了足够的性能。结论:LLMSurvival框架作为通过LLM进行集成、删失意识的生存分析的概念验证。

英文摘要

Objective: Survival analysis is central to medical prediction, yet large language models (LLMs) are rarely used as end-to-end survival models because censoring prevents straightforward supervised fine-tuning. Here we present LLMSurvival, a framework that enables censoring-aware survival analysis with unmodified LLMs operating directly on tabular clinical data. Materials and Methods: LLMSurvival reformulates time-to-event prediction as pairwise ranking among comparable subjects, and derives test-time risk by aggregating comparisons against anchor individuals from the training cohort. Results: Across two clinical tasks (ICU mortality prediction in MIMIC-IV and fragility fracture prediction in a NewYork-Presbyterian/Weill Cornell Medicine cohort), LLMSurvival improves overall concordance over Cox proportional hazards modeling by 3.1% for ICU mortality and 0.5% for fracture risk, 2.1% on average for ICU mortality and 2.8% for fracture risk over three established deep learning survival models. Discussion: The results show that survival modeling with censoring can be made compatible with LLM fine-tuning through comparison-based reformulation. The framework demonstrates high portability and superior performance over expert curated scores like SAPS-II and FRAX scores across diverse clinical context. Furthermore, the framework supports local deployment, as compact, publicly available base models provide sufficient performance. Conclusion: The LLMSurvival framework serves as a proof of concept for an integrated, censoring-conscious approach to survival analysis via LLMs.

2605.25396 2026-05-26 cs.CV cs.AI

Subspace-Guided Semantic and Topological Invariant Registration for Annotation-Free Ultrasound Plane Quality Control

子空间引导的语义与拓扑不变配准用于无标注超声平面质量控制

Chunzheng Zhu, Jianxin Lin, Feng Wang, Cheng Jiang, Guanghua Tan, Zhenyu Zhou, Shengli Li, Kenli Li

发表机构 * Hunan University(湖南大学) Shenzhen Maternity and Child Healthcare Hospital(深圳妇幼保健医院)

AI总结 提出STRIQ框架,通过子空间引导的配准一致性度量,实现无标注超声平面质量控制,达到与临床质量评分的最优相关性。

Comments MICCAI 2026 Accepted Paper; Subspace-Guided Registration for Ultrasound Quality Control

详情
AI中文摘要

超声图像的可靠质量控制对于实时采集指导和回顾性临床审计至关重要,然而现有方法严重依赖逐平面标注,或采用在临床采集固有空间变形下易产生系统性偏差的伪标签。我们提出STRIQ,一种基于配准的框架,将无标注超声平面质量控制重新定义为子空间引导的一致性度量问题。具体而言,STRIQ引入潜在配准对齐器(LRA)以建立查询图像与方差驱动锚点之间的层次特征空间对应,这些锚点通过方差谱准则从无标签数据中自主提炼,作为结构稳定的原型。为进一步区分解剖平面并减轻负知识迁移,我们提出正交知识子空间(OKS)模块。OKS将平面特定表示分解为相互正交的子空间,实现细粒度专家协作同时防止平面间干扰,确保质量度量基于原则性的子空间邻近性。在内部US4QA和公开CAMUS数据集上的大量实验表明,STRIQ实现了与临床质量评分的最优相关性,为无标注、实时可靠的超声质量控制建立了新范式。我们的代码可在https://github.com/zhcz328/STRIQ获取。

英文摘要

Reliable quality control (QC) of ultrasound images is essential for both real-time acquisition guidance and retrospective clinical audit, yet existing approaches rely heavily on per-plane annotations, or employ pseudo-labeling prone to systematic bias under spatial deformations inherent in clinical acquisition. We present STRIQ, a registration-driven framework that recasts annotation-free US plane quality control as a subspace-guided consistency measurement problem. Specifically, STRIQ introduces a Latent Registration Aligner (LRA) to establish hierarchical feature space correspondences between query images and variance-driven anchors, which are autonomously distilled from unlabeled data via a variance spectrum criterion to serve as structurally stable prototypes. To further disambiguate anatomical planes and mitigate negative knowledge transfer, we propose an Orthogonal Knowledge Subspace (OKS) module. The OKS decomposes plane-specific representations into mutually orthogonal subspaces, enabling fine-grained expert collaboration while preventing inter-plane interference, ensuring that the quality metric is grounded in principled subspace proximity. Extensive experiments on the in-house US4QA and public CAMUS datasets demonstrate that STRIQ achieves state-of-the-art correlation with clinical quality scores, establishing a new paradigm for annotation-free, real-time reliable ultrasound quality control. Our code is available at https://github.com/zhcz328/STRIQ.

2605.25395 2026-05-26 cs.LG math.OC

EMA-Nesterov: Stabilizing Nesterov's Lookahead for Accelerated Deep Learning Optimization

EMA-Nesterov:稳定Nesterov前瞻以加速深度学习优化

Chung-Yiu Yau, Dawei Li, Athanasios Glentis, Valentyn Boreiko, Hoi-To Wai, Mingyi Hong

发表机构 * University of Minnesota(明尼苏达大学) Amazon AGI(亚马逊人工智能实验室) The Chinese University of Hong Kong(香港中文大学)

AI总结 针对深度学习优化中Nesterov动量因随机梯度噪声和非凸损失导致的不稳定性,提出EMA-Nesterov方法,用参数更新的指数移动平均替代标准前瞻方向,通过低通滤波捕捉训练轨迹的低频趋势,在凸问题中保持理论加速收敛率,并在语言模型预训练中验证了其广泛适用性和优于现有前瞻方法的性能。

Comments 25 page, 10 figures

详情
AI中文摘要

基于前瞻的加速方法,如Nesterov动量,在优化中广泛使用,但在深度学习训练中常因随机梯度噪声和非凸损失景观而变得不可靠。特别是,标准前瞻依赖于短视更新信号(例如连续迭代之间的差异),这些信号本质上有噪声,可能导致不稳定的外推方向。本文从轨迹角度重新审视Nesterov加速,并认为深度学习中的有效加速应利用优化轨迹的低频趋势,而非外推噪声的一步更新。基于这一见解,我们提出EMA-Nesterov,一个简单的修改,用参数更新的指数移动平均(EMA)替代标准Nesterov前瞻方向。这产生了一个稳定的前瞻方向,通过低通滤波器捕捉并利用训练轨迹的演变趋势,同时通过EMA的几何加权结构保持对渐进变化的适应性。我们证明,EMA-Nesterov在凸问题中保留了与Nesterov加速梯度方法类似的理论加速收敛率。此外,我们在语言模型预训练上提供了经验证据,验证了EMA-Nesterov广泛适用于一系列微调的基础优化器,包括Adam、SOAP、Muon,以及在优化基准(NanoGPT)上达到最先进性能的复杂优化器。与先前的瞻方法相比,EMA-Nesterov通过避免短视前瞻的不稳定性和长视前瞻的非自适应性,实现了更好的性能。

英文摘要

Lookahead-based acceleration methods, such as Nesterov's momentum, are widely used in optimization, but they often become unreliable in deep learning training mainly due to stochastic gradient noise and non-convex loss landscapes. In particular, standard lookahead relies on short-horizon update signals (e.g., differences between consecutive iterates), which are inherently noisy and can lead to unstable extrapolation directions. This work revisits Nesterov's acceleration from a trajectory perspective and argues that effective acceleration in deep learning should harness the low-frequency trends of optimization trajectories rather than extrapolating noisy one-step updates. Leveraging this insight, we propose EMA-Nesterov, a simple modification that replaces the standard Nesterov's lookahead direction with an exponential moving average (EMA) of parameter updates. This yields a stabilized lookahead direction that captures and harnesses the evolving trend of the training trajectory through a low-pass filter, while remaining adaptive to progressive changes via the geometric weighting structure of EMA. We show that EMA-Nesterov retains a theoretical accelerated convergence rate in convex problems that is analogous to Nesterov's accelerated gradient method. Furthermore, we provide empirical evidence on language model pre-training to verify that EMA-Nesterov is broadly applicable across a range of fine-tuned base optimizers, including Adam, SOAP, Muon, as well as complex optimizers that achieve state-of-the-art performance on optimization benchmarks (NanoGPT). Compared to prior lookahead methods, EMA-Nesterov achieves better performance by avoiding the instability of short-horizon lookahead and the non-adaptivity of long-horizon lookahead.

2605.25394 2026-05-26 cs.AI cs.CL

Second Guess: Detecting Uncertainty Through Abstention and Answer Stability in Small Language Models

Second Guess: 通过弃权和答案稳定性检测小型语言模型的不确定性

Ashwath Vaithinathan Aravindan, Mayank Kejriwal

发表机构 * University of Southern California(南加州大学) Information Sciences Institute(信息科学研究所)

AI总结 提出一种轻量级、无参数的提示技术Second Guess,通过添加“我不知道”选项并观察答案稳定性,在多项选择问答中实现弃权,有效检测小型语言模型的不确定性。

详情
AI中文摘要

大型语言模型在不确定时往往生成自信但错误的答案,而非弃权。这个问题对于小型语言模型(SLM)尤为严重,因为计算约束和自主操作放大了对可靠不确定性检测的需求。我们提出了_Second Guess_,一种轻量级、无参数的提示技术,用于多项选择问答(MCQA)中的弃权,非常适合SLM。我们的关键实证洞察是,真正知道答案的模型会一致地选择它,而不确定的模型在添加“我不知道”选项时会表现出不稳定的行为。在四个开源模型(2B-8B参数)和四个基准测试上评估,Second Guess实现了10.81%的最高复合风险改进。值得注意的是,在基于熵的方法退化的微调模型上,它保持了8%的复合风险改进,并且对性能较低的模型改进最大。重现本工作所需的所有代码和结果可在https://github.com/Mystic-Slice/second-guess获取。

英文摘要

Large language models often generate confident but incorrect answers rather than abstaining when uncertain. This problem is particularly acute for small language models (SLMs), where computational constraints and autonomous operation amplify the need for reliable uncertainty detection. We propose _Second Guess_, a lightweight, parameter-free prompting technique for abstention in multiple-choice question answering (MCQA) that is well-suited for SLMs. Our key empirical insight is that models which truly know an answer will select it consistently, while uncertain models exhibit unstable behavior when an ``I don't know'' option is added. Evaluated on four open models (2B-8B parameters) and four benchmarks, Second Guess achieves the highest composite risk improvement of 10.81\%. Notably, it maintains an 8\% composite risk improvement on fine-tuned models where entropy-based methods degrade, and improves most for lower-performing models. All code and results required to reproduce this work is available in https://github.com/Mystic-Slice/second-guess

2605.25393 2026-05-26 cs.RO

Decision-Making with Lightweight Confidence-Aware Language Model for Autonomous Driving

基于轻量级置信感知语言模型的自动驾驶决策

Ruoyu Yao, Ruiguo Zhong, Pei Liu, Mingxing Peng, Rui Yang, Jun Ma

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出一种利用轻量级置信感知语言模型的决策框架,通过多智能体协作生成置信注释的决策演示并蒸馏到双头轻量模型,在nuPlan上实现SOTA成功率和低延迟。

Comments 8 Pages, 3 figures, ITSC 2026

详情
AI中文摘要

大型语言模型和多模态大语言模型在自动驾驶中展现出巨大潜力,提供类人推理和开放世界泛化能力。然而,这些庞大模型过高的计算开销和推理延迟严重阻碍了它们在资源受限的自动驾驶系统中的部署。为解决这一挑战,我们提出了一种新颖的决策框架,利用轻量级置信感知语言模型,弥合了复杂多模态意图推理与高效推理之间的差距。具体而言,我们设计了一个多智能体协作工作流,包括动作投票、置信评估和总结智能体,通过显式的思维链推理生成高质量、带置信注释的决策演示。然后,这些演示被蒸馏到一个具有双头架构的轻量级语言模型中,实现决策概率的联合预测和文本理由的生成。蒸馏通过置信感知微调策略结合检索增强生成来实现,以增强模型的适应性和数据效率。在nuPlan基准上的全面闭环实验表明,我们的方法在常规和长尾场景下均实现了最先进的成功率,同时保持了低推理延迟。

英文摘要

Large Language Models (LLMs) and Multimodal LLMs (MLLMs) have demonstrated immense potential in autonomous driving (AD) by offering human-like reasoning and open-world generalization. However, the excessive computational overhead and high inference latency of these massive models severely hinder their deployment in resource-constrained AD systems. To address this challenge, we propose a novel decision-making framework utilizing a lightweight confidence-aware language model, which bridges the gap between complex multimodal intention reasoning and efficient inference. Specifically, we design a multi-agent collaborative workflow, comprising action voting, confidence assessment, and summarization agents, to generate high-quality, confidence-annotated decision demonstrations via explicit Chain-of-Thought (CoT) reasoning. These demonstrations are then distilled into a lightweight language model featuring a dual-head architecture, enabling the joint prediction of decision probabilities and the generation of textual rationales. The distillation is realized via a confidence-aware fine-tuning strategy coupled with Retrieval Augmented Generation (RAG) to enhance the model's adaptability and data efficiency. Comprehensive closed-loop experiments on the nuPlan benchmark demonstrate that our approach achieves state-of-the-art (SOTA) success rates in both regular and long-tail scenarios while maintaining low inference latency.

2605.25391 2026-05-26 cs.LG eess.SP

A Context Augmented Multi-Play Multi-Armed Bandit Algorithm for Fast Channel Allocation in Opportunistic Spectrum Access

一种用于机会频谱接入中快速信道分配的上下文增强多玩多臂老虎机算法

Ruiyu Li, Guangxia Li, Xiao Lu, Jichao Liu, Yan Jin

发表机构 * School of Computer Science and Technology(计算机科学与技术学院) Xidian University(西安电子科技大学) Research and Development(研发) Hainayun IoT Technology Co., Ltd(海纳云物联网科技有限公司) Hainayun IoT Technology Ltd(海纳云物联网科技有限公司)

AI总结 针对机会频谱接入中的信道分配问题,提出一种上下文增强的多玩多臂老虎机算法,通过将信道噪声建模为奖励函数的扰动并利用信道状态信息作为上下文,分别针对线性和非线性相关性推导出两种索引策略,实现低遗憾和更合理的次优臂选择。

Comments Accepted by ISCC'24

详情
AI中文摘要

我们研究了机会频谱接入(OSA)场景中用于信道分配的动态上下文多玩多臂老虎机(MP-MAB)问题。大多数现有的MP-MAB方法对于实际OSA系统不实用,因为它们假设了许多理想条件,计算成本高,最重要的是忽略了与服务质量直接相关的信道噪声的影响。在本研究中,我们通过将信道噪声建模为MP-MAB中臂奖励函数的扰动来体现这种影响。由于信道状态信息与信道噪声之间存在隐含的相关性,我们将前者作为MP-MAB的上下文来表示后者引起的扰动。我们研究了上下文与扰动之间的两种相关性——线性和非线性,并分别推导出两种索引策略。这些策略通过线性模型和神经网络学习相关性,并使用估计的噪声值调整上置信界。数值实验表明,所提出的策略能够实现更低的遗憾,并以更合理的方式选择次优臂。

英文摘要

We study the restless contextual multi-play multi-armed bandit (MP-MAB) problem for channel allocation in the opportunity spectrum access (OSA) scenario. Most existing MP-MAB methods are impractical for real-world OSA systems as they assume many ideal conditions, incur a heavy computational cost, and most importantly, ignore the impact of channel noise which is directly related to the quality of service. In this study, we embody this impact by modeling channel noise as a perturbation of the arm's reward function in MP-MAB. As there is an implicit correlation between channel state information and channel noise, we take the former as a context for MP-MAB to present the perturbation caused by the latter. We investigate two types of correlation between the context and the perturbation -- linear and nonlinear, and derive two index policies, respectively. These policies learn the correlations through a linear model and a neural network, and use estimated noise value to adjust the upper confidence bound. Numerical experiments demonstrate that the proposed policies can achieve lower regret and select sub-optimal arms in a more reasonable way.