arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2256
2605.28409 2026-05-28 cs.AI

Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning

基于离线强化学习的代码生成LLM高效后训练

Mingze Wu, Abhinav Anand, Shweta Verma, Mira Mezini

AI总结 本文探索使用离线强化学习利用现有代码数据集对代码生成LLM进行后训练,实验表明该方法能有效提升模型性能,尤其适用于小模型和复杂编码问题。

详情
AI中文摘要

使用在线强化学习(RL)进行后训练是LLM(包括代码生成模型)的重要训练步骤。然而,用于代码生成的在线RL涉及LLM推理和生成输出的验证,这可能耗费大量时间和资源。在本文中,我们通过利用现有代码数据集,探索将离线RL应用于代码生成模型。我们的实验表明,离线RL是提升LLM性能的有效训练策略。我们证明,离线RL对于小型LLM和具有挑战性的编码问题尤其有益。

英文摘要

Post-training using online reinforcement learning (RL) is an important training step for LLMs, including code-generating models. However, online RL for code generation involves LLM inference and verification of the generated output, which can take considerable time and resources. In this paper, we explore the application of offline RL to code-generating models by leveraging existing code datasets. Our experiments demonstrate that offline RL is an effective training strategy for improving LLM performance. We show that offline RL can be especially beneficial for small LLMs and challenging coding problems.

2605.28405 2026-05-28 cs.AI

Measuring Progress Toward AGI: A Cognitive Framework

衡量AGI进展:一个认知框架

Ryan Burnell, Yumeya Yamamori, Orhan Firat, Kate Olszewska, Steph Hughes-Fitt, Oran Kelly, Isaac R. Galatzer-Levy, Meredith Ringel Morris, Allan Dafoe, Alison M. Snyder, Noah D. Goodman, Matthew Botvinick, Shane Legg

AI总结 本文提出一个基于认知分类学的框架,通过10个关键认知能力评估系统性能,以量化AGI进展。

Comments 32 pages, 2 figures

详情
AI中文摘要

尽管AGI被广泛讨论,但目前缺乏衡量其进展的明确框架。这种模糊性助长了主观论断,使追踪进展变得困难,并可能阻碍负责任的治理。作为解决这一问题的起点,我们提出了一个理解系统能力与人类认知能力关系的框架。借鉴心理学、神经科学和认知科学数十年的研究,我们引入了一个认知分类学,将通用智能分解为10个关键认知能力。然后,我们提出一个严格的评估协议,通过一套有针对性的、保留的认知任务来衡量系统性能,生成可用于理解系统优缺点的“认知轮廓”。我们希望这一框架能为更严格、更实证的AGI评估提供实用路线图和初步步骤。

英文摘要

Despite widespread discussion of AGI, there is no clear framework for measuring progress toward it. This ambiguity fuels subjective claims, makes it difficult to track progress, and risks hindering responsible governance. As a starting point to address this gap, we present a framework for understanding system capabilities in relation to human cognitive abilities. Drawing from decades of research in psychology, neuroscience, and cognitive science, we introduce a Cognitive Taxonomy that deconstructs general intelligence into 10 key cognitive faculties. We then propose a rigorous evaluation protocol in which a system's performance is measured across a suite of targeted, held-out cognitive tasks, generating a 'cognitive profile' that can be used to understand a system's strengths and weaknesses. We hope this framework will provide a practical roadmap and an initial step toward more rigorous, empirical evaluation of AGI.

2605.28401 2026-05-28 cs.CV

EgoRelight: Egocentric Human Capture and Illumination Recovery for Relightable and Photoreal Avatar Rendering

EgoRelight: 基于自我中心的人体捕捉与光照恢复实现可重光照和逼真化身渲染

Jianchun Chen, Yinda Zhang, Rohit Pandey, Thabo Beeler, Marc Habermann, Christian Theobalt

AI总结 提出EgoRelight框架,通过头戴显示器上的立体下视相机提取深度图驱动网格化身,并利用神经外观模型分别合成视角相关镜面反射和视角无关漫反射,结合测试时逆渲染恢复HDR环境图,实现从单一HMD进行全身性能捕捉、逼真可重光照外观合成和环境光照估计。

详情
AI中文摘要

混合现实(MR)头戴显示器承诺了一个沉浸式远程呈现的未来,其中虚拟人无缝地融入真实或虚拟环境。实现这一愿景需要一种方法,能够从头戴显示器(HMD)的受限视角捕捉用户的运动、估计新光照下的外观并理解环境。现有方法将这些视为孤立问题:它们要么专注于驱动具有固定光照的化身,要么依赖工作室设置进行重光照。在本文中,我们提出了EgoRelight,一个用于自我中心远程呈现的整体框架,它同时捕捉全身人体性能、合成逼真且可重光照的外观,并从单个HMD估计高动态范围(HDR)环境图。首先,为了确保运动和表面重建,我们提出了一个自我中心感知模块,利用立体下视相机提取密集深度图,作为几何控制信号驱动基于网格的化身。其次,我们引入了一种新颖的神经外观模型,该模型学习分别合成视角相关的镜面反射和视角无关的漫反射。通过采用专门的射线采样策略,我们的模型能够泛化到未见过的光照,而不依赖限制性的解析BRDF先验。第三,我们通过测试时逆渲染过程实现化身无缝集成到物理世界,该过程通过将预训练化身的外观与实时自我中心相机观测匹配来恢复HDR环境图。我们通过一个社交远程呈现应用演示了我们的系统,其中远程用户根据其物理环境被一致地重光照。大量实验表明,我们的组件和集成系统在几何精度、渲染以及重光照保真度方面显著优于最先进的基线方法。

英文摘要

Mixed Reality (MR) headsets promise a future of immersive telepresence where virtual humans blend indistinguishably into real or virtual surroundings. Achieving this vision requires a method for capturing a user's motion, estimating appearance under novel lighting, and understanding the environment - all from the constrained viewpoint of a head-mounted display (HMD). Existing approaches treat these as isolated problems: they either focus on driving avatars with baked-in lighting or rely on studio setups for relighting. In this paper, we present EgoRelight, a holistic framework for egocentric telepresence that simultaneously captures full-body human performance, synthesizes photorealistic and relightable appearance, and estimates high dynamic range (HDR) environment maps from a single HMD. First, to ensure motion and surface reconstruction, we propose an egocentric perception module that leverages stereo down-facing cameras to extract dense depth maps, which serve as geometric control signals to drive a mesh-based avatar. Second, we introduce a novel neural appearance model that learns to synthesize view-dependent specular and view-independent diffuse shading separately. By employing a specialized ray-sampling strategy, our model generalizes to unseen illumination without relying on restrictive analytical BRDF priors. Third, we enable seamless avatar integration into the physical world via a test-time inverse rendering process, which recovers an HDR environment map by matching the pre-trained avatar's appearance to live egocentric camera observations. We demonstrate our system through a social telepresence application, where remote users are coherently relit according to their physical environment. Extensive experiments show that our components and the integrated system significantly outperform state-of-the-art baselines in geometric accuracy and rendering as well as relighting fidelity.

2605.28398 2026-05-28 cs.AI

HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

HRBench:混合推理大语言模型中思维模式切换策略的基准测试与理解

Yansong Ning, Mianpeng Liu, Jingwen Ye, Weidong Zhang, Hao Liu

AI总结 提出HRBench统一评估框架,系统研究混合推理LLM中基于提示、外部路由和推测执行三类切换策略在四种训练机制下的效率-效果权衡,揭示策略选择随模型规模和任务领域的变化规律。

Comments Under review

详情
AI中文摘要

混合推理大语言模型(LLMs)暴露了对推理努力程度的显式控制,允许用户或系统在答案质量与推理成本之间进行权衡。然而,现有的自适应思维模式选择方法通常在不同模型、数据集和实现假设下进行评估,使得比较它们的实际行为变得困难。我们引入了HRBench,一个用于研究混合推理LLM中思维模式切换的统一评估框架。HRBench沿两个轴组织设计空间:三种切换策略族(基于提示的选择、外部路由和推测执行)和四种训练机制(无训练、SFT、离线RL和在线RL),产生12种受控评估设置。我们在6个LLM(从Qwen3.5-2B到Kimi-K2.5-1.1T)和5个涵盖数学、科学和代码的推理基准上评估这些设置,并在同一流水线中重新实现了12种以上有代表性的先前方法。我们的分析表征了不同切换策略如何占据不同的效率-效果权衡区域:基于提示的方法通常提供有利的token-准确率权衡,路由方法提供更稳定的成本降低,而推测方法倾向于以更高的token成本提高准确率。我们进一步发现训练对不同策略的影响不同,且首选策略随模型规模和任务领域而变化。HRBench提供了参考实现和统一评估平台,以支持对混合推理LLM中高效推理的更受控研究。我们的数据、代码和仓库可在https://github.com/usail-hkust/HRBench获取。

英文摘要

Hybrid-reasoning large language models (LLMs) expose explicit controls over reasoning effort, allowing users or systems to trade off answer quality against inference cost. However, existing methods for adaptive thinking-mode selection are typically evaluated under different models, datasets, and implementation assumptions, making it difficult to compare their practical behavior. We introduce HRBench, a unified evaluation framework for studying thinking-mode switching in hybrid-reasoning LLMs. HRBench organizes the design space along two axes: three switching strategy families, prompt-based selection, external routing, and speculative execution, and four training regimes, training-free, SFT, offline and online RL, yielding 12 controlled evaluation settings. We evaluate these settings across 6 LLMs, from Qwen3.5-2B to Kimi-K2.5-1.1T, and 5 reasoning benchmarks covering mathematics, science, and code, while reimplementing 12+ representative prior methods within the same pipeline. Our analysis characterizes how different switching strategies occupy distinct effectiveness-efficiency trade-off regions: prompt-based methods often provide favorable token-accuracy trade-offs, routing methods offer more stable cost reduction, and speculative methods tend to improve accuracy at higher token cost. We further find that training affects strategies differently, and that the preferred strategy varies with model scale and task domain. HRBench provides reference implementations and a unified evaluation platform to support more controlled research on efficient reasoning in hybrid-reasoning LLMs. Our data, code and repository are available at https://github.com/usail-hkust/HRBench.

2605.28397 2026-05-28 cs.CV

Adaptive Temporal Gating of Longitudinal Magnetic Resonance Imaging for Alzheimer's Prediction

用于阿尔茨海默病预测的纵向磁共振成像自适应时间门控

Alireza Moayedikia, Sara Fin, Alicia Troncoso Lora, Uffe Kock Wiil

AI总结 提出TAF-Net混合CNN-Transformer架构,通过自适应时间门控融合纵向3D MRI的时空表示,在MCI-to-AD转化预测中仅用结构MRI即达到最优性能,接近需多模态数据的方法。

详情
AI中文摘要

从轻度认知障碍(MCI)到阿尔茨海默病(AD)的转化预测对于早期干预至关重要。当前的深度学习范式主要依赖于横截面结构MRI,忽略了患者特定解剖轨迹中的预后价值。我们引入了时间自适应融合网络(TAF-Net),这是一种混合CNN-Transformer架构,用于建模配对的纵向3D MRI扫描。TAF-Net的核心是由自适应时间门控的时间融合模块,该模块学习患者特定的权重以合成三种时空表示:显式结构变化、区域间时间交叉注意力和双侧特征拼接。在阿尔茨海默病神经影像学倡议队列上进行的三年MCI-to-AD转化预测评估中,TAF-Net仅使用结构MRI就在所有评估方法中取得了最高的判别性能,显著优于最强基线,并接近需要PET、CSF或遗传数据的多模态方法。该架构表现出卓越的数据效率,仅用少量训练数据即可匹配基线性能。消融研究表明,纵向融合提高了判别能力,同时与单时间点评估相比,预测方差降低了48%。可解释性分析显示,空间注意力与内侧颞叶和脑室中已建立的AD病理学一致,而门控机制优先考虑与转化风险强正相关的显式体积变化。

英文摘要

Predicting conversion from Mild Cognitive Impairment (MCI) to Alzheimer's Disease (AD) is critical for early intervention. Current deep learning paradigms predominantly rely on cross-sectional structural MRI, neglecting prognostic value in patient-specific anatomical trajectories. We introduce the Temporal Adaptive Fusion Network (TAF-Net), a hybrid CNN-Transformer architecture that models paired longitudinal 3D MRI scans. Central to TAF-Net is a Temporal Fusion Module governed by an Adaptive Temporal Gate, which learns patient-specific weightings to synthesize three spatiotemporal representations: explicit structural change, region-to-region temporal cross-attention, and bilateral feature concatenation. Evaluated on the Alzheimer's Disease Neuroimaging Initiative cohort for three-year MCI-to-AD conversion prediction, TAF-Net achieved the highest discriminative performance among all evaluated methods using only structural MRI, significantly outperforming the strongest baseline and approaching multimodal methods requiring PET, CSF, or genetic data. The architecture exhibited exceptional data efficiency, matching baseline performance with a fraction of training data. Ablation studies demonstrate that longitudinal fusion improves discrimination while reducing predictive variance by 48% compared to single-timepoint evaluation. Interpretability analyses reveal spatial attention aligned with established AD pathology in the medial temporal lobe and ventricles, while the gating mechanism prioritizes explicit volumetric change with strong positive correlation to conversion risk.

2605.28396 2026-05-28 cs.LG cs.AI

ADWIN: Adaptive Windows for Horizon-Aware On-Policy Distillation

ADWIN: 用于视野感知在线策略蒸馏的自适应窗口

Kun Liang, Chenming Tang, Clive Bai, Weijie Liu, Saiyong Yang, Yunfang Wu

AI总结 提出ADWIN框架,通过自适应窗口动态调整在线策略蒸馏中的轨迹长度,在保持或提升准确率的同时,将训练成本降低最多4.1倍。

详情
AI中文摘要

在线策略蒸馏(OPD)通过沿着学生生成的轨迹训练学生模型,并利用教师反馈来迁移推理行为,但标准的全轨迹训练将每次更新与昂贵的完整轨迹绑定,并且可能过度分配监督到对当前学生边际价值较低的后半部分。我们通过有用监督视野重新审视这一假设:学生引起的轨迹可能偏离教师偏好的延续,而对齐的前缀可能已经保留了长视野OPD更新方向。我们提出ADWIN,一种用于OPD的自适应窗口框架,将轨迹长度视为在线可接受性决策,在短的教师锚定前缀上训练,同时使用延迟的全轨迹探测来审计前缀与全轨迹的对齐情况,并通过陈旧性控制自适应调整下一视野。在数学和代码推理基准测试中,包括单任务、多任务和强到弱设置,ADWIN在全轨迹OPD和基于前缀的基线方法上改善了准确率与计算成本的权衡,将端到端训练成本降低最多4.1倍,同时达到相当或更好的准确率。

英文摘要

On-policy distillation (OPD) transfers reasoning behavior by training a student on teacher feedback along student-generated trajectories, but standard full-rollout training ties every update to a costly completion and can over-allocate supervision to late positions with low marginal value for the current student. We revisit this assumption through the useful supervision horizon: student-induced rollouts can drift from teacher-preferred continuations, while aligned prefixes may already preserve the long-horizon OPD update direction. We propose ADWIN, an adaptive-window framework for OPD that treats rollout length as an online admissibility decision, training on short teacher-anchored prefixes while using delayed full-rollout probes to audit prefix--full alignment and adapt the next horizon with staleness control. Across math and code reasoning benchmarks in single-task, multi-task, and strong-to-weak settings, ADWIN improves the accuracy--compute trade-off over full-rollout OPD and prefix-based baselines, reducing end-to-end training cost by up to 4.1 times while achieving comparable or better accuracy.

2605.28394 2026-05-28 cs.CV cs.GR

Sketch2Motion: Text-driven 2D Sketch to 3D Animation via Diffusion-guided Skeleton Optimization

Sketch2Motion: 文本驱动的二维草图到三维动画的扩散引导骨架优化

Gaurav Rai, Ojaswa Sharma

AI总结 提出Sketch2Motion框架,结合扩散模型和骨架优化,将二维草图转化为三维动画,无需配对运动数据,支持多种角色类型。

详情
AI中文摘要

二维手绘草图的动画化提供了一种有效的视觉交流媒介。然而,这些草图带来了挑战,特别是在处理遮挡和准确映射运动方面。虽然三维动画自然地解决了这些挑战,但估计三维运动仍然是一项非常复杂的任务。最近将二维草图转换为三维动画的方法主要集中在特定类型的运动上,例如双足运动和面部表情。我们提出了Sketch2Motion,一个基于扩散引导的骨架运动合成框架,它将经典的角色动画流程与深度生成先验相结合。我们的方法使用骨架变换来表示运动,通过线性混合蒙皮将其传播到网格变形。为了引导生成的动画朝向真实且语义上有意义的运动,我们通过运动感知分数蒸馏采样(MoSDS)集成了文本到视频扩散模型,从而无需配对运动数据即可进行优化。此外,我们应用物理启发的平滑性、拓扑和接触约束来稳定优化并保持运动合理性。进一步地,我们集成了一个弹簧-质量模拟器来引入次级运动效果。所提出的框架是通用的、完全可微的、模块化的,并且兼容双足、四足和非生命体铰接角色。实验表明,我们的方法生成了时间上连贯、与文本对齐的动画,其性能优于缺乏生成先验或显式物理约束的基线运动迁移方法。我们将公开我们的代码和数据集。

英文摘要

Animation of 2D hand-drawn sketches provides an effective medium for visual communication. However, these sketches pose challenges, particularly in handling occlusions and accurately mapping motion. While 3D animation naturally addresses these challenges, estimating 3D motion remains a very complex task. Recent approaches to converting 2D sketches to 3D animations have mainly focused on specific types of motion, such as bipedal movements and facial expressions. We propose Sketch2Motion, a diffusion-guided framework for skeleton-based motion synthesis that combines classical character animation pipelines with deep generative priors. Our method represents motion using skeletal transformations, which are propagated to mesh deformations via linear blend skinning. To guide the resulting animation toward realistic and semantically meaningful motion, we integrate a text-to-video diffusion model via motion-aware score-distillation sampling (MoSDS), enabling optimization without paired motion data. Additionally, we apply physics-inspired smoothness, topological, and contact constraints to stabilize optimization and preserve motion plausibility. Further, we integrate a spring-mass simulator to introduce secondary motion effects. The proposed framework is generalized, fully differentiable, modular, and compatible with biped, quadruped, and non-living articulated characters. Experiments demonstrate that our approach produces temporally coherent, text-aligned animations that outperform baseline motion transfer methods that lack generative priors or explicit physical constraints. We will make our code and dataset publicly available.

2605.28392 2026-05-28 cs.CV

Bound-Constrained Sparse Representation for Electrical Impedance Tomography

边界约束稀疏表示用于电阻抗成像

Chun Zhang, Dong Liu

AI总结 提出一种边界约束稀疏表示框架,通过隐式复合参数化从低维潜变量生成电导率,无需显式正则化即可改善电阻抗成像中的电导率估计。

详情
AI中文摘要

本研究提出了一种用于电阻抗成像(EIT)的边界约束稀疏表示(BC-SR)框架,旨在无需显式正则化的情况下改善电导率估计。BC-SR采用表示驱动策略,通过隐式复合参数化从低维潜变量生成电导率。利用截断图拉普拉斯基嵌入结构先验,同时通过边界保持非线性映射强制电导率处于允许范围内,并通过隐式梯度调制改善条件。该方法即使在噪声或不完整数据下也能确保鲁棒收敛。在2D/3D模拟、水箱实验和体内肺部数据上的广泛验证表明,BC-SR提高了物理一致性和结构保真度,与传统方法相比具有更强的鲁棒性。此外,BC-SR能够实现3D时差EIT重建,提供更好的空间分辨率和更连贯的3D电导率分布表示,尤其对于体内肺部数据。这表明其在EIT中具有改进性能的潜力,特别是在呼吸监测的临床应用中。

英文摘要

This study proposes a bound-constrained sparse representation (BC-SR) framework for electrical impedance tomography (EIT), aimed at improving conductivity estimation without explicit regularization. BC-SR adopts a representation-driven strategy, generating conductivity from low-dimensional latent variables via an implicit composite parameterization. Structural priors are embedded using a truncated graph-Laplacian basis, while a bound-preserving nonlinear mapping enforces admissible conductivity ranges and improves conditioning through implicit gradient modulation. The approach ensures robust convergence, even under noisy or incomplete data. Extensive validation on 2D/3D simulations, tank experiments, and in-vivo lung data shows that BC-SR improves physical consistency and structural fidelity, offering enhanced robustness compared to traditional methods. Additionally, BC-SR enables 3D time-difference EIT reconstruction, offering improved spatial resolution and a more coherent representation of 3D conductivity distributions, particularly for in-vivo lung data. This suggests potential for improved performance in EIT, particularly in clinical applications for respiratory monitoring.

2605.28390 2026-05-28 cs.AI

You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

你活不止一次:迈向分层技能元进化

Xujun Li, Kehan Zheng, Mingyuan Zhao, Yize Geng, Jinfeng Zhou, Qi Zhu, Fei Mi, Lifeng Shang, Minlie Huang, Hongning Wang

AI总结 本文提出HiSME,一种轻量级分层技能元进化方法,通过从智能体任务执行轨迹中学习元技能,联合优化技能和技能进化策略,以持续提升部署的智能体系统在不同下游场景中的性能。

详情
AI中文摘要

测试时技能进化被视为增强已部署智能体系统的新范式。现有工作主要关注硬编码的技能进化策略或依赖底层LLM中昂贵参数更新的参数化学习。在本文中,我们证明,对于在不同下游场景中持续改进智能体系统,对技能进化框架本身进行测试时优化是必要的,并且轻量级的算法适应是可行的。具体来说,我们提出HiSME,一种轻量级分层技能元进化解决方案,通过从智能体的任务执行轨迹中学习元技能,联合优化技能和技能进化策略。在多样化智能体基准上的实验表明,元进化可以产生比纯技能进化更高质量的技能库,并能为不同场景推导出多样化的元技能,从而促进未来的持续经验学习。我们的代码暂时公开在https://anonymous.4open.science/r/HiSME-BD45。

英文摘要

Test-time skill evolving is regarded as a new paradigm for enhancing deployed agentic systems. Existing works mainly focus on hard-coded skill evolving strategies or parametric learning that rely on expensive parameter updates in the underlying LLMs. In this paper, we demonstrate that test-time refinement of the skill evolving framework itself is necessary for continuous improvement of the agent systems in different downstream scenarios, and lightweight algorithmic adaptation is feasible. Specifically, we propose HiSME, a lightweight hierarchical skill meta-evolving solution that jointly optimizes skills and the skill evolving strategy by learning meta-skills from agents' task execution traces. Experiments on diverse agentic benchmarks show that meta-evolving can produce a higher-quality skill library than pure skill evolving and can derive diverse meta-skills for different scenarios, thereby facilitating future continual experience learning. Our code is temporarily public at https://anonymous.4open.science/r/HiSME-BD45.

2605.28389 2026-05-28 cs.CL

FABSVer: Faster Training and Better Self-Verification for LLM Mathematical Reasoning

FABSVer: 更快的训练与更好的自验证用于大语言模型数学推理

Haihui Pan, Junwei Bao, Hongfei Jiang, Yang Song

AI总结 提出FABSVer方法,通过融合解生成与自验证为单次前向传播,并引入动态参考模型更新(DRMU)突破奖励瓶颈,在三个模型规模上实现更优的自验证与推理性能,训练时间仅为现有方法的51%-71%。

详情
AI中文摘要

尽管大语言模型在数学推理方面取得了显著进展,但它们在判断自身解决方案的正确性方面仍然不可靠。现有的为模型配备自验证能力的方法通常将解生成和验证视为两个独立的任务,导致训练时间大幅增加。在本文中,我们提出FABSVer,将这两个任务融合为单次生成过程,在联合优化两种能力的同时显著降低训练开销。我们进一步从理论和实验上识别出一个收敛瓶颈:随着训练进行,由于策略受固定参考模型约束,奖励达到平台期。为克服这一问题,我们引入动态参考模型更新(DRMU),提高了奖励上限并实现持续的奖励增长。在数学基准上的大量实验表明,FABSVer在三个模型规模上实现了优越的自验证和推理性能,同时仅需现有方法训练时间的51%–71%。分析进一步揭示了模型获取自验证能力的不同学习阶段,并且随着模型规模增大,验证奖励与答案奖励之间的差距显著缩小。

英文摘要

While large language models have made significant progress in mathematical reasoning, they remain unreliable at judging the correctness of their own solutions. Existing approaches that equip models with self-verification typically treat solution generation and verification as two separate tasks, leading to substantially increased training time. In this paper, we propose FABSVer, which fuses these two tasks into a single generation pass, dramatically reducing training overhead while jointly optimizing both capabilities. We further identify a convergence bottleneck both theoretically and empirically: as training progresses, the reward reaches a plateau because the policy is constrained by a fixed reference model. To overcome this, we introduce Dynamic Reference Model Update (DRMU), which raises the reward ceiling and enables sustained reward growth. Extensive experiments on math benchmarks demonstrate that FABSVer achieves superior self-verification and reasoning performance across three model scales, while requiring only 51%--71% of the training time of existing methods. Analysis further reveals distinct learning phases in how models acquire self-verification, and that the gap between verify and answer rewards shrinks noticeably as model size increases.

2605.28388 2026-05-28 cs.AI

Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

机制性解释样本难度在RLVR中对大语言模型的作用

Yue Cheng, Jiajun Zhang, Xiaohui Gao, Weiwei Xing, Zheng Wang, Zhanxing Zhu

AI总结 本文通过难度维度和单样本分析,发现样本难度对RLVR有非单调影响,中等难度问题提供最稳定的推理改进,并基于此提出难度自适应策略。

Comments 30 pages, 11 figures

详情
AI中文摘要

经验表明,带可验证奖励的强化学习(RLVR)能显著提升大语言模型(LLMs)的推理性能,尤其是在数学和编程领域。然而,样本难度在RLVR中的机制性作用仍不明确。本文通过难度维度和单样本分析研究RLVR。我们发现样本难度对RLVR有非单调影响:简单和中等难度问题带来最强且最稳定的推理改进,而过难问题往往提供弱学习信号,诱发退化行为(如重复答案或跳过必要计算),并最终损害模型已有的能力。除了响应层面,我们还利用时间稀疏自编码器(T-SAE)分析模型内部特征动态。简单问题主要强化直接答案和基本计算特征,同时抑制深思熟虑推理特征;困难问题激活推理相关特征,但仅在成功轨迹被采样时才有用;中等难度问题提供更平衡的信号,同时强化计算和多步推理特征。基于这些发现,我们提出了针对困难样本的难度自适应策略,利用反向推理重构和T-SAE引导的训练信号来改善RLVR中的奖励密度和信用分配。总体而言,我们的结果将样本难度识别为控制RLVR优化动态和表示演化的关键因素。

英文摘要

Reinforcement Learning with Verifiable Reward (RLVR) is empirically shown to notably enhance the reasoning performance of large language models (LLMs), particularly in mathematics and programming. However, the mechanistic role of Sample Difficulty in RLVR remains poorly understood. In this paper, we investigate RLVR through the lens of difficulty-wise and one-sample analysis. We find that sample difficulty has a non-monotonic effect on RLVR: easy and medium-difficulty problems yield the strongest and most stable reasoning improvements, whereas overly hard problems often provide weak learning signals, induce degenerate behaviors such as answer repetition or skipping necessary computation, and can ultimately degrade the model's pre-existing capabilities. Beyond the obverse of response, we further analyze the model's internal feature dynamics using Temporal Sparse Autoencoders (T-SAE). Easy problems mainly reinforce direct-answer and basic-computation features while suppressing deliberative-reasoning features; hard problems activate reasoning-related features but become useful only when successful trajectories are sampled; medium-difficulty problems provide a more balanced signal, strengthening both computation and multi-step reasoning features. Motivated by these findings, we propose difficulty-adaptive strategies for hard-sample utilization, using backward-reasoning reformulation and T-SAE-guided training signals to improve reward density and credit assignment during RLVR. Overall, our results identify sample difficulty as a key factor governing both the optimization dynamics and representation evolution of RLVR.

2605.28387 2026-05-28 cs.LG cs.AI cs.NE

CLANE: Continual Learning of Actions on Neuromorphic Hardware from Event Cameras

CLANE: 基于事件相机在神经形态硬件上的动作持续学习

Elvin Hajizada, Michael Neumeier, Edward Paxon Frady, Yulia Sandamirskaya, Axel von Arnim, Bing Li, Eyke Hüllermeier

AI总结 提出CLANE系统,在Intel Loihi 2神经形态芯片上实现端到端的持续学习,用于事件相机动作识别,通过尖峰CNN和新型Loihi 2模块实现高能效和低延迟。

详情
AI中文摘要

识别并持续学习新的人类动作而不遗忘先前类别,是新兴AR/VR和机器人应用的需求。对于这些应用,设备上的处理和学习对于隐私和低延迟适应至关重要。事件相机通过稀疏、异步的输出解决了视觉传感的效率问题,该输出天然兼容神经形态处理。然而,此前没有系统部署过使用神经形态硬件进行基于事件的持续设备上学习流水线。我们提出了CLANE(基于事件相机在神经形态硬件上的动作持续学习),端到端部署在Intel Loihi 2上。CLANE将用于时空特征提取的脉冲2D CNN与作为片上学习头的CLP-SNN相结合,并通过时间聚合层和定点归一化层(两者均为新型Loihi 2模块)扩展到动作片段。在真实条件下捕获的50类数据集THU E-ACT-50上,CLANE在持续学习任务中达到70.4%的准确率,同时相比顺序CNN+GRU+CLP边缘GPU基线实现了超过100倍的能耗降低和16倍的延迟降低,通过三个评估级别的跨平台等算法基准测试得到验证。

英文摘要

Recognizing and continuously learning novel human actions without forgetting prior classes is a requirement for emerging AR/VR and robotics applications. For these applications, both on-device processing and learning are essential for privacy and low-latency adaptation. Event cameras address the efficiency of visual sensing with sparse, asynchronous output that is naturally compatible with neuromorphic processing. Yet no prior system has deployed a continual on-device learning pipeline for event-based action recognition using neuromorphic hardware. We present CLANE, Continual Learning of Actions on Neuromorphic Hardware from Event Cameras, deployed end-to-end on Intel Loihi 2. CLANE combines a spiking 2D CNN for spatiotemporal feature extraction with CLP-SNN as its on-chip learning head, extended to action clips via a Temporal Aggregation Layer and a fixed-point Normalization Layer, both novel Loihi 2 modules. On THU E-ACT-50, a 50-class dataset captured under real-world conditions, CLANE achieves 70.4% accuracy in a continual learning task while delivering more than 100x energy reduction and 16x lower latency over a sequential CNN+GRU+CLP edge GPU baseline, validated through iso-algorithm cross-platform benchmarking across three evaluation levels.

2605.28384 2026-05-28 cs.LG

Meta-Attention: Bayesian Per-Token Routing for Efficient Transformer Inference

Meta-Attention: 用于高效Transformer推理的贝叶斯逐Token路由

Alan Ferrari

AI总结 提出Meta-Attention框架,通过贝叶斯元控制器动态为每个token选择最优注意力策略(全softmax、线性或滑动窗口局部注意力),在几乎无开销下实现更优的计算-性能权衡。

详情
AI中文摘要

标准Transformer架构对所有token和序列位置统一应用单一注意力机制,而不考虑局部上下文或计算预算。我们提出Meta-Attention,一个通过贝叶斯元控制器动态将每个token路由到最合适的注意力策略(全softmax注意力、线性(核)注意力或滑动窗口局部注意力)的框架。与使用确定性或无先验学习路由的先前路由方法不同,元控制器将逐token机制选择视为在计算感知的Dirichlet先验下的后验推理:路由权重是通过证据下界(ELBO)目标训练的摊销变分后验q(alpha | x_t; phi)的输出,该目标联合编码任务性能和注意力机制成本。这种设计产生原则性的路由不确定性估计,控制软到硬的路由转换,无需临时负载平衡损失即可缓解路由崩溃,并在几乎无开销的情况下比确定性或无先验学习路由产生更好的计算-性能权衡。在Tiny LM基准上的第一阶段实证结果证实了核心预测:贝叶斯控制器的学习路由分布在硬路由下意味着归一化FLOP成本为25.1%,而无先验基线为59.3%(-34.2个百分点),并将路由熵从55.8%降低到43.3%(-12.5个百分点),表明Dirichlet先验防止了路由崩溃,而非贝叶斯模型默认使用全注意力。我们展示了贝叶斯架构、ELBO训练目标以及验证前向传播正确性、后验多样性和与无先验基线进行受控消融的第一阶段PyTorch原型。代码见:https://github.com/KFEAL/meta-attention

英文摘要

Standard transformer architectures apply a single attention mechanism uniformly across all tokens and sequence positions, irrespective of local context or computational budget. We propose Meta-Attention, a framework that dynamically routes each token to the most appropriate attention strategy -- full softmax attention, linear (kernel) attention, or sliding-window local attention -- via a Bayesian Meta-Controller. Unlike prior routing approaches that use deterministic or prior-free learned routing, the Meta-Controller treats per-token mechanism selection as posterior inference under a compute-aware Dirichlet prior: routing weights are the output of an amortised variational posterior q(alpha | x_t; phi) trained with an Evidence Lower Bound (ELBO) objective that jointly encodes task performance and attention-mechanism cost. This design produces principled routing uncertainty estimates that govern the soft-to-hard routing transition, mitigates routing collapse without ad hoc load-balancing losses, and yields better compute-performance trade-offs than deterministic or prior-free learned routing at negligible overhead. Phase 1 empirical results on a Tiny LM benchmark confirm core predictions: the Bayesian controller's learned routing distribution implies a projected normalised FLOP cost of 25.1% under hard routing, vs. 59.3% for the prior-free baseline (-34.2 pp), and reduces routing entropy from 55.8% to 43.3% (-12.5 pp), demonstrating that the Dirichlet prior prevents routing collapse while the non-Bayesian model defaults to full attention. We present the Bayesian architecture, ELBO training objective, and a Phase 1 PyTorch prototype validating forward-pass correctness, posterior diversity, and a controlled ablation against a prior-free baseline. Code available at: https://github.com/KFEAL/meta-attention

2605.28375 2026-05-28 cs.CL

PrionNER: A Named Entity Recognition Dataset for Prion Disease Biomedical Literature

PrionNER: 朊病毒病生物医学文献命名实体识别数据集

An Dao, Nhan Ly, Thao Tran, Yuji Matsumoto, Akiko Aizawa

AI总结 针对朊病毒病临床信息,构建了手动标注的命名实体识别数据集PrionNER,包含317篇摘要、15种粗粒度和31种细粒度实体类型,并评估了监督和零样本模型性能。

Comments 29 pages, 5 figures, accepted at ACL 25th Workshop on Biomedical Language Processing (BioNLP 2026)

详情
AI中文摘要

朊病毒病是一种罕见、快速进展且致命的神经退行性疾病,由于非特异性临床表现,早期诊断困难。然而,据我们所知,目前尚无公开的、专注于朊病毒病的数据集,用于从生物医学文献中捕获广泛的临床相关实体。我们推出了PrionNER,一个针对PubMed摘要中朊病毒病临床信息的手动标注命名实体识别数据集。当前版本包含317篇摘要、2,943个句子和6,955个文本绑定实体标注,涵盖15种粗粒度和31种细粒度临床导向实体类型,涉及疾病、症状、诊断、发现、解剖、治疗以及时间和统计证据。标注者间一致性达到81.78%的精确匹配F1值,表明标注一致性较强。我们在PrionNER上对监督BERT基线、W2NER和零样本提取器进行了基准测试。W2NER是最强的监督模型,Gemma-4-31B是最强的零样本模型,但基准测试仍具有挑战性,尤其是对于结构复杂的提及和细粒度的临床邻近标签区分。PrionNER为朊病毒病信息提取提供了临床基础的基准,并支持低资源、细粒度及非平面提取条件下的罕见病生物医学NLP研究。数据集、标注指南和评估脚本可在https://github.com/daotuanan/PrionNER/获取。

英文摘要

Prion diseases are rare, rapidly progressive, and fatal neurodegenerative disorders that remain difficult to diagnose, particularly in their early stages because of nonspecific clinical presentations. However, to our knowledge, there is no publicly available prion-disease-focused dataset designed to capture a broad range of clinically relevant entities from the biomedical literature. We introduce PrionNER, a manually annotated named entity recognition dataset for prion disease clinical information in PubMed abstracts. The current release comprises 317 abstracts, 2,943 sentences, and 6,955 text-bound entity annotations spanning 15 coarse-grained and 31 fine-grained clinically oriented entity types covering diseases, symptoms, diagnostics, findings, anatomy, treatments, and temporal and statistical evidence. Inter-annotator agreement reaches 81.78 exact-match F1, indicating strong annotation consistency. We benchmark supervised BERT baselines, W2NER, and zero-shot extractors on PrionNER. W2NER is the strongest supervised model, and Gemma-4-31B is the strongest zero-shot model, but the benchmark remains challenging, especially for structurally complex mentions and fine-grained clinically adjacent label distinctions. PrionNER provides a clinically grounded benchmark for prion-disease information extraction and supports research on rare-disease biomedical NLP under low-resource, fine-grained, and non-flat extraction conditions. The dataset, annotation guidelines, and evaluation scripts are available at https://github.com/daotuanan/PrionNER/.

2605.28372 2026-05-28 cs.LG cs.RO

Teacher-Student Representational Alignment for Reinforcement Learning-Driven Imitation Learning

教师-学生表征对齐用于强化学习驱动的模仿学习

Meraj Mammadov, Pedro Zuidberg Dos Martires, Johannes Andreas Stork

AI总结 提出一种通过自监督对比学习构建共享嵌入空间的方法,以减小教师和学生策略之间的不可模仿差距,从而提升学生策略性能。

Comments 6 pages, 5 figures. Accepted as an oral presentation at the RL4IL Workshop at ICRA 2026

详情
AI中文摘要

从基于状态的强化学习策略进行模仿学习是克服机器人学中复杂高维观测空间维度灾难的常用方法。本文解决了当教师和学生策略孤立学习时出现的不可模仿差距,即教师策略可以依赖学生无法从其观测中推断的特权状态信息。我们提出了一种新算法,不是通过在模仿学习后进行强化学习微调(通常需要全新的训练设置)来改善学生性能,而是学习一个共享嵌入空间,该空间隐藏了特定于智能体的观测,从而通过构造训练出可模仿的教师策略。我们通过自监督对比学习与教师策略并行训练共享嵌入空间,并通过限制其梯度更新编码器网络来防止其提取私有信息。我们在多个示例领域进行了评估,并与最先进的基线方法比较,结果表明我们的算法能够实现更高的学生性能,并显著减小模仿差距。

英文摘要

Imitation learning (IL) from a state-based reinforcement learning (RL) policy is a common approach to overcome the curse of dimensionality in complex and high-dimensional observation spaces prevalent in robotics. This paper addresses the irreducible imitation gap that emerges when teacher and student are learned in isolation, and the teacher policy has the liberty to rely on privileged state information that the student cannot infer from its observations. Instead of improving poor student performance with RL finetuning after IL, which often requires a whole new training setup, we propose a novel algorithm which learns a shared embedding space that hides agent-specific observations and thus trains imitable teacher policies by construction. We train the shared embedding space with self-supervised contrastive learning in parallel to the teacher policy and prevent it from extracting private information by limiting its gradients from updating the encoder networks. We perform evaluations on several example domains and compare to state-of-the-art baselines showing that our algorithm enables higher student performance with substantially reduced imitation gap.

2605.28371 2026-05-28 cs.AI cs.LG cs.SE

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

从论文到基准测试:基于智能体和框架的机器健康智能中欠规范方法复现

Raffael Theiler, Ludovico Comito, David Leko, Leandro Von Krannichfeldt, Lev Telyatnikov, Olga Fink

AI总结 提出一种基于智能体和共享框架的方法,通过槽绑定接口将论文转化为可执行、可比较的基准测试实现,解决工业预测与健康管理中方法复现的困难。

详情
AI中文摘要

工业预测与健康管理(PHM)为应用机器学习中的更广泛挑战提供了一个代表性案例研究:将已发表的论文转化为可执行、可基准测试的实现。由于工业数据集的访问受限、预处理和评估协议的报告不完整以及隐含的设计选择(例如,窗口化、目标构建、数据分割)对性能有重要影响,复现PHM中的欠规范方法尤为困难。现有的论文到代码系统为单篇论文生成实现,但由于假设和评估设置的不一致性,这些产物通常无法直接比较。我们引入了基于智能体和框架的PHM论文复现方法,其中智能体通过槽绑定接口将论文转化为共享的PHM基准测试框架。该接口将方程和协议描述映射为结构化组件(任务定义、数据集适配器、窗口化、目标、模型和评估器),同时明确记录未解决的假设。最终实现通过标准化任务契约和评估钩子进行验证,从而实现一致且可比较的基准测试。我们在16篇PHM论文上评估了该方法,比较了框架增强型、基于技能和基于提示的智能体复现与最近的无框架论文复现智能体。我们评估了复现成功率、基于模型的代码评估、论文假设的框架绑定以及标准化协议下的跨论文基准可比性。结果表明,将智能体生成与共享框架相结合,将论文复现从孤立的代码合成转变为可执行、假设感知且系统可比较的基准测试实现。

英文摘要

Industrial Prognostics and Health Management (PHM) provides a representative case study for a broader challenge in applied machine learning: translating published papers into executable, benchmark-ready implementations. Reproducing under-specified methods in PHM is particularly difficult due to restricted access to industrial datasets, incomplete reporting of preprocessing and evaluation protocols, and implicit design choices (e.g., windowing, target construction, data splits) that critically affect performance. Existing paper-to-code systems generate implementations for individual papers, but these artifacts are often not directly comparable due to inconsistencies in assumptions and evaluation settings. We introduce \emph{agentic, framework-based PHM paper reproduction}, where an agent translates a paper into a shared PHM benchmark framework via a \emph{slot-binding interface}. This interface maps equations and protocol descriptions into structured components (task definitions, dataset adapters, windowing, targets, models, and evaluators), while explicitly recording unresolved assumptions. The resulting implementations are validated against standardized task contracts and evaluation hooks, enabling consistent and comparable benchmarking. We evaluate this approach on 16 PHM papers, comparing framework-enhanced, skill-based and prompt-based agentic reproduction against a recent framework-free paper-reproduction agent. We assess reproduction success, model-based code evaluation, framework binding of paper assumptions, and cross-paper benchmark comparability under standardized protocols. Our results show that coupling agentic generation with a shared framework transforms paper reproduction from isolated code synthesis into executable, assumption-aware, and systematically comparable benchmark implementations.

2605.28369 2026-05-28 cs.AI cs.SI

CyberJurors: A Multi-Agent Simulation Task for E-Commerce Disputes Verdict

CyberJurors:电商纠纷裁决的多智能体模拟任务

Yanhui Sun, Wu Liu, Haifeng Ming, Xinru Wang, Hantao Yao, Yongdong Zhang

AI总结 针对电商纠纷裁决需要从冗余多轮多模态证据中提取关键线索并依据平台特定惯例决策的问题,提出多智能体框架CyberJurors,通过个体裁决链式思维和集体陪审共识裁决提升裁决质量,在包含6000真实案例的基准上超越现有方法。

Comments ICML 2026

详情
AI中文摘要

电商平台开始招募众包陪审员来裁决大量交易纠纷。与正式法律判决不同,电商纠纷裁决需要从冗余、多轮、多模态证据中提取关键线索,并在平台特定的灵活惯例下做出决策。这些特点使得现有方法不足以应对该场景。为弥补这一差距,我们引入了一项开创性任务——电商纠纷裁决(EDV),并提出了VerdictBench,一个包含6000个真实案例的多模态基准,旨在反映众包陪审团决策。在此基础上,我们提出了CyberJurors,一个多智能体框架,用于澄清纠纷逻辑并规范裁决过程。在个体层面,个体裁决链式思维将EDV任务分解为四个结构化的推理阶段,实现细粒度线索感知并澄清关键线索与纠纷焦点之间的因果逻辑。在集体层面,陪审共识裁决模拟陪审员之间的多轮讨论和投票,同时纳入裁决先例以减轻对任一争议方的认知偏差。在VerdictBench上的实验表明,CyberJurors优于最先进的LLM、MLLM和法庭模拟器,同时与真实陪审团投票模式实现了更强的一致性。代码和数据集可在https://github.com/YanhuiS/CyberJurors 和 https://huggingface.co/datasets/piggi/VerdictBench 获取。

英文摘要

E-commerce platforms have begun recruiting crowdsourced jurors to adjudicate massive volumes of transaction disputes. Unlike formal legal judgment, E-commerce dispute verdicts require grounding pivotal clues from redundant, multi-round, multimodal evidence and making decisions under flexible platform-specific conventions. These characteristics render existing methods insufficient for this scenario. To bridge this gap, we introduce a pioneering task, E-commerce Dispute Verdicts (EDV), and present VerdictBench, a multimodal benchmark comprising 6,000 real-world cases designed to reflect crowdsourced jury decisions. Building upon this, we propose CyberJurors, a multi-agent framework to clarify the dispute logic and regulate the verdict process. At the individual level, Individual Verdict Chain-of-Thought decomposes the EDV task into four structured reasoning stages, enabling fine-grained clue perception and clarifying causal logic between pivotal clues and the dispute focus. At the collective level, Jury Consensus Verdict simulates multi-round discussion and voting among jurors, while incorporating verdict precedents to mitigate cognitive biases toward either disputant. Experiments on VerdictBench show that CyberJurors outperforms state-of-the-art LLMs, MLLMs, and court simulators, while achieving stronger alignment with real-world jury voting patterns. Code and dataset are available at https://github.com/YanhuiS/CyberJurors and https://huggingface.co/datasets/piggi/VerdictBench.

2605.28365 2026-05-28 cs.AI cs.CL cs.LO

Risk-Controlled Lean-as-Judge for Natural-Language Mathematical Reasoning

风险控制的 Lean 作为自然语言数学推理的评判者

Pauline Bourigault, Xiaotong Ji, Matthieu Zimmer, Rasul Tutunov, Haitham Bou Ammar

AI总结 针对 Lean 评判自然语言数学答案时信号稀疏且不忠实的问题,提出 COVCAL 选择器,通过有限样本选择性风险控制,在自动形式化覆盖率足够高时保证接受答案的准确率。

详情
AI中文摘要

Lean 越来越多地被用于评判自然语言数学答案,但其信号是不完全的:许多答案从未被形式化,而一个失败的证明可能反映类型错误或缺少库事实,而非答案错误。在 MATH-500 上,我们表明该信号 (i) 严重依赖于覆盖率,即在证明覆盖率高的答案中正确率为 96%,但在覆盖率低时为 20%,以及 (ii) 稀疏且常常不忠实:一个 7B 自动形式化器仅对 28% 的问题证明了某个类别,而人工审计发现其中只有约 43% 的证明是忠实的。我们提出 COVCAL,一个基于 Lean 跟踪诊断的选择器,它在两种机制(保守的 Bonferroni 界和更紧的 dev-then-cal 规则)下,对接受的答案认证有限样本选择性风险界,否则弃权。可行性取决于自动形式化覆盖率:对于 7B 形式化器,信号过于稀疏,Bonferroni 在所有 20 个自助法分区上弃权,而一个专用于证明器的形式化器达到 79% 的覆盖率,并在 20 个分区中的 17 个上使其可行,以 0.98 的接受准确率接受约 48% 的问题。由于自一致性本身已达到 91% 的准确率,我们的贡献是精确描述了何时以及使用哪个形式化器,部分形式化信号可以在风险控制下被信任。

英文摘要

Lean is increasingly used to judge natural-language mathematical answers, but its signal is partial: many answers never formalize, and a failed proof may reflect an ill-typed statement or a missing library fact, not a wrong answer. On MATH-500 we show this signal is (i) sharply coverage-dependent, that is the proof-winning answer is correct 96% of the time at high proved coverage but 20% at low, and (ii) sparse and often unfaithful: a 7B autoformalizer proves a class for only 28% of problems, and a manual audit finds only approximately 43% of those proofs faithful. We propose COVCAL, a selector over Lean-trace diagnostics that certifies a finite-sample selective-risk bound on accepted answers or abstains, under two regimes (a conservative Bonferroni bound and a tighter dev-then-cal rule). Feasibility depends on autoformalization coverage: with the 7B formalizer the signal is too sparse and Bonferroni abstains on all 20 bootstrap partitions, whereas a prover-specialized formalizer reaches 79% coverage and flips it to feasible on 17 of 20, accepting approximately 48% of problems at 0.98 accepted accuracy. Since self-consistency alone is already 91% accurate, our contribution is a precise account of when, and with which formalizer, a partial formal signal can be trusted under risk control.

2605.28363 2026-05-28 cs.CL

PubMedCausal: A Span-Level Annotated Corpus for Causal Relation Extraction in Biomedical Text

PubMedCausal: 用于生物医学文本中因果关抽取的跨度级标注语料库

Ifeoluwa Kunle-John, Josiah Paul, Oluwatosin Agbaakin, Peter Aina, Ikenna Odezuligbo, Sydney Anuyah

AI总结 为解决现有资源将因果关系与广义关联混淆、限制句子级标注或仅关注显式因果线索的问题,构建了基于PubMed摘要的跨度级因果关抽取语料库PubMedCausal,包含30,000段落级行、3,945因果行和6,491个裁决的因果对,并基准测试了判别式编码器和开源生成模型。

Comments Submitted to EMNLP 2026, 8 Pages, 23 page appendix

详情
AI中文摘要

因果关抽取(CRE)是生物医学文本挖掘的核心,但当前资源常将因果关系与更广泛的关联混淆,将标注限制在句子级别,或主要关注显式因果线索。这限制了它们在评估模型是否能恢复生物医学文本中实际表达的因果主张方面的实用性。我们引入了PubMedCausal,一个基于PubMed摘要构建的生物医学CRE跨度级标注语料库。该语料库包含30,000个段落级行,包括3,945个因果行和6,491个经裁决的因果对。每个因果关系都标注了全文的原因和结果跨度、因果类型以及句子性,从而支持因果检测和全跨度因果抽取的评估。我们在检测和抽取设置下对判别式编码器和开源生成模型进行了基准测试。对于因果检测,生物医学编码器表现最强,PubMedBERT达到F$_1$分数0.7391。对于跨度级抽取,最佳生成基线是DeepSeek-R1-32B配合少样本提示,达到余弦对F$_1$分数0.6765。我们进一步通过评估在PubMedCausal上训练的编码器在外部因果关数据集上的表现来测试迁移学习,表明该资源支持跨数据集评估。我们的结果表明,在类别不平衡、长因果跨度、隐式因果关系、跨句关系以及提示敏感性下,生物医学CRE仍然困难。代码和数据可在此处找到:https://github.com/josiahpaul07/PubMedCausal_Exp

英文摘要

Causal relation extraction (CRE) is central to biomedical text mining, but current resources often conflate causal relations with broader associations, restrict annotation to sentence-level examples, or focus mainly on explicit causal cues. This limits their usefulness for evaluating whether models can recover causal claims as they are actually expressed in biomedical text. We introduce PubMedCausal, a span-level annotated corpus for biomedical CRE built from PubMed abstracts. The corpus contains 30,000 paragraph-level rows, including 3,945 causal rows and 6,491 adjudicated cause--effect pairs. Each causal relation is annotated with full-text cause and effect spans, causality type, and sententiality, enabling evaluation of both causal detection and full-span causal extraction. We benchmark discriminative encoders and open-source generative models across detection and extraction settings. For causal detection, biomedical encoders are strongest, with PubMedBERT reaching an F$_1$ score of 0.7391. For span-level extraction, the best generative baseline is DeepSeek-R1-32B with few-shot prompting, reaching a Cosine Pair F$_1$ of 0.6765. We further test transfer learning by evaluating PubMedCausal-trained encoders on external causal relation datasets, showing that the resource supports cross-dataset evaluation. Our results show that biomedical CRE remains difficult under class imbalance, long causal spans, implicit causality, inter-sentential relations, and prompt sensitivity. Code and Data can be found here: https://github.com/josiahpaul07/PubMedCausal_Exp

2605.28362 2026-05-28 cs.RO

Accelerating Robot Path Planning via Connectivity-Preserving Region Proposal Network

加速机器人路径规划的连通性保持区域提议网络

Zhanzheng Ma, Cancan Zhao, Shuai Zhang, Bo Ouyang

AI总结 提出连通性保持区域提议网络(CP-RPN),通过分割模型预测紧凑且拓扑连通的候选区域,压缩搜索空间,结合Voronoi图与局部A*回退机制实现低延迟高成功率路径规划。

详情
AI中文摘要

移动机器人路径规划方法常受限于巨大的搜索空间,导致基于采样的算法存在延迟。基于学习的方法经常遭受局部区域碎片化和全局拓扑不一致性的困扰。为解决这一问题,我们提出了连通性保持区域提议网络(CP-RPN),一种分割引导模型,旨在预测紧凑且拓扑连通的候选区域,显著压缩搜索空间。具体来说,我们设计了一个分割模型,利用可变形注意力变换器(DAT)捕获长距离依赖以实现全局连通性,并采用反卷积解码器保留细粒度空间细节。为保证预测掩膜的连通性,我们设计了一个复合损失函数,结合交叉熵损失进行逐像素监督、连通性感知损失增强局部一致性,以及基于持续同调的拓扑连续性损失强制全局连通性。在这些高连通性走廊状区域的基础上,使用Voronoi图规划路径,并辅以局部A*回退机制确保鲁棒性。实验结果表明,与MPT基线相比,CP-RPN将候选区域大小减少了超过60.13%,实现了确定性低延迟规划(平均0.11秒),成功率达99.60%,在稳定性上优于传统的基于采样的算法。

英文摘要

Mobile robot path planning methods are often constrained by vast search spaces, resulting in latency in samplingbased algorithms. Learning-based approaches frequently suffer from local region fragmentation and global topological inconsistency. To tackle the problem, we present the Connectivity- Preserving Region Proposal Network (CP-RPN), a segmentationguided model designed to predict compact and topologically connected candidate regions, significantly compressing the search space. Specifically, we design a segmentation model that leverages a Deformable Attention Transformer (DAT) to capture long-range dependencies for global connectivity, with a Deconvolutional decoder to preserve fine-grained spatial details. To guarantee the connectivity of the predicted mask, we design a composite loss function that combines Cross-Entropy loss for pixelwise supervision, a Connectivity-Aware loss to enhance local coherence, and a Topological Continuity loss based on persistent homology to enforce global connectivity. Building on these highconnectivity corridor-like regions, the Voronoi diagram is used to plan the path, backed by a local A* fallback mechanism to ensure robustness. Experimental results demonstrate that CPRPN reduces the candidate region size by over 60.13% compared to the MPT baseline and achieves deterministic low-latency planning (avg. 0.11s) with a 99.60% success rate, outperforming traditional sampling-based algorithms in stability.

2605.28360 2026-05-28 cs.AI

Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

提示码本:面向语言模型指令精炼的离散组合优化

Jyotirmoy Nath, Neeraj Kumar, Brejesh Lall

AI总结 提出Prompt Codebooks (PCO)框架,将自动提示优化重构为离散组合学习,通过可重用的自然语言本能单元实现实例级路由和结构化反馈,在多个基准上提升性能并压缩提示长度。

详情
AI中文摘要

自动提示优化(APO)显著提升了基于LLM的智能体工作流。然而,现有方法将每个任务的提示视为一个整体、实例无关的字符串,通过全局编辑进行优化,导致更新脆弱且无法复用学到的子行为。我们提出提示码本(PCO),一种新颖的组合式提示优化框架,将APO重构为在有限自然语言本能(原子、可重用的指令单元)词汇表上的离散学习。PCO将提示构建知识组织在离散码本中,通过基于LLM的编码器将每个输入路由到少量条目;生成器将它们组合成冻结目标模型的提示;评论器输出结构化判决,通过归因分解为每个变量的文本梯度,在语言值极小极大目标下联合训练编码器、生成器和码本。得到的路由是实例级的:同一任务的不同输入接收不同的本能组合,这种机制在实例无关方法下结构上无法表达。在Qwen3-8B和LLaMA-3.1-8B上的六个基准测试中,PCO相比零样本提升高达+30.36分,在HotpotQA上超越最强先前基线(GEPA)达+3.34分,总体平均提升+1.11分,并且仅使用K=16个本能即可将部署提示长度相比MIPROv2压缩最多14.1倍,相比GEPA压缩3.0倍。

英文摘要

Automatic prompt optimization (APO) has driven significant gains in LLM-based agentic workflows. However, existing methods treat each task's prompt as a monolithic, instance-blind string optimized through global edits, producing brittle updates and preventing the reuse of learned sub-behaviors. We propose Prompt Codebooks (PCO), a novel compositional prompt optimization framework that recasts APO as discrete learning over a finite vocabulary of natural-language instincts - atomic, reusable instruction units. PCO organizes prompt-construction knowledge in a discrete codebook and routes each input to a small subset of entries via an LLM-based encoder; a generator composes them into a prompt for the frozen target model; a critic emits a structured verdict that decomposes by attribution into per-variable textual gradients, jointly training the encoder, generator, and codebook under a language-valued min-max objective. The resulting routing is per-instance: different inputs in the same task receive different instinct compositions, a regime structurally inexpressible under instance-blind methods. Across six benchmarks on Qwen3-8B and LLaMA-3.1-8B, PCO improves over zero-shot by up to +30.36 points, surpasses the strongest prior baseline (GEPA) by +3.34 on HotpotQA and +1.11 in aggregate, and reduces deployed prompt length by up to 14.1x versus MIPROv2 and 3.0x versus GEPA using only K=16 instincts.

2605.28359 2026-05-28 cs.AI q-fin.TR

From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets

从知道到做到:面向LLM股票市场交易智能体的记忆控制基准

Taojie Zhu, Wentao Zhao, Rui Sun, Beidi Luan, Jiacheng Lu, Sinuo Wang, Jing Li, Daxin Jiang, Yonghong He, Zuo Bai

AI总结 针对LLM交易智能体评估中的知识泄露和收益归因问题,提出KTD-Fin基准,通过数据掩码和Barra风格归因框架,分离市场记忆与投资决策,并揭示收益主要来自被动市场暴露而非选股能力。

详情
AI中文摘要

评估大语言模型(LLM)智能体能否在资本市场盈利,越来越被框架化为端到端交易:将智能体置于历史市场中,让其交易,并衡量投资组合收益。这种设置容易导致两种评估失败。首先,长时间的回测往往与前沿LLM的知识截止日期重叠,使得记忆的股票代码、日期、价格和市场叙事替代了投资推理。其次,原始收益是选股能力的一个嘈杂代理,因为正收益可能来自市场贝塔、风格暴露或有利的市场环境,而非真正的阿尔法。我们引入了KTD-Fin(知道-做到金融基准),一个端到端的股票市场交易基准,解决了这两个问题。KTD-Fin使用数据侧掩码协议,在提示和工具中一致地匿名化关键标识符和日历信息,将历史市场记忆与投资决策分离。它还整合了Barra风格的表现归因框架,将投资组合收益分解为市场、风格和选股阿尔法成分。在2024-2026年窗口内对中国沪深300指数评估的十个前沿LLM智能体中,掩码显著改变了智能体的推理过程,推动其转向匿名化的因子推理。归因分析进一步表明,在泄露控制评估下,LLM智能体的累积收益主要由被动的市场和风格暴露解释,而持续选股阿尔法的证据有限。这些发现表明,金融LLM基准不仅应评估智能体是否赚钱,还应评估收益来源是否反映了可转移的投资技能。我们发布KTD-Fin作为LLM交易智能体泄露控制和归因感知评估的可复现模板。

英文摘要

Evaluating whether large language model (LLM) agents can profit in capital markets is increasingly framed as end-to-end trading: place an agent in a historical market, let it trade, and measure portfolio returns. This setup is vulnerable to two evaluation failures. First, long backtests often overlap with the knowledge cutoffs of frontier LLMs, allowing memorized tickers, dates, prices, and market narratives to substitute for investment reasoning. Second, raw returns are a noisy proxy for stock-selection ability, since positive performance may come from market beta, style exposure, or favorable regimes rather than genuine alpha. We introduce KTD-Fin (Knowing-To-Doing Financial Benchmark), an end-to-end stock-market trading benchmark that addresses both issues. KTD-Fin uses a data-side masking protocol to anonymize key identifiers and calendar information consistently across prompts and tools, separating historical market memory from investment decision-making. It also incorporates a Barra-style performance attribution framework that decomposes portfolio returns into market, style, and stock-selection alpha components. Across ten frontier LLM agents evaluated on the Chinese CSI300 over a 2024--2026 window, masking substantially changes agent rationales, pushing them towards anonymized factor-based reasoning. Attribution analysis further shows that LLM agents' cumulative returns under leakage-controlled evaluation are largely explained by passive market and style exposure, with limited evidence of persistent stock-selection alpha. These findings suggest that financial LLM benchmarks should evaluate not only whether an agent makes money, but also whether the source of returns reflects transferable investment skill. We release KTD-Fin as a reproducible template for leakage-controlled and attribution-aware evaluation of LLM trading agents.

2605.28358 2026-05-28 cs.LG cs.AI cs.IT math.IT

Score Based Error Correcting Code Decoder

基于分数的纠错码译码器

Alon Helvits, Eliya Nachmani

AI总结 提出SB-ECC,一种将译码视为连续时间去噪的基于分数的译码器,通过神经去噪器定义概率流常微分方程,在奇偶校验约束下迭代更新噪声信道观测值,无需SNR估计即可推理,并在42个码/SNR设置中39/42达到最佳误码率。

Comments Accepted to ICML 2026

详情
AI中文摘要

纠错码能够实现可靠通信,然而在实际软译码中,跨码族和码长仍然具有挑战性。我们提出SB-ECC,一种基于分数的译码器,将译码视为连续时间去噪。神经去噪器定义了一个概率流常微分方程(ODE),该方程在奇偶校验约束的引导下,迭代地将噪声信道观测值更新为有效的码字。该模型在不同噪声水平下训练,无需时间/SNR条件,从而无需SNR估计即可进行推理,并支持由ODE求解器预算控制的直接延迟-精度权衡。我们使用原始带符号的信道观测值作为输入来学习连续去噪场。在42个码/SNR设置中,SB-ECC在39/42个条目中实现了最佳误码率,平均SNR增益为0.17dB,最大增益为0.46dB,优于最强竞争基线。我们表明,将求解器从Euler切换为DPM可保持-ln(BER),同时将端到端译码时间平均减少8.86%(最高达12.82%)。

英文摘要

Error-correcting codes enable reliable communication, yet practical soft decoding remains challenging across code families and block lengths. We propose SB-ECC, a score-based decoder that casts decoding as continuous-time denoising. A neural denoiser defines a probability-flow ordinary differential equation (ODE) that iteratively updates the noisy channel observation toward a valid codeword, guided by parity constraints. The model is trained across noise levels without time/SNR conditioning, enabling inference without SNR estimation and supporting a direct latency accuracy trade off controlled by the ODE solver budget. We use the raw signed channel observation as input for learning a continuous denoising field. Across 42 code/SNR settings, SB-ECC achieves the best BER in 39/42 entries, with an average SNR gain of 0.17dB and a maximum gain of 0.46dB over the strongest competing baseline, we showed that swapping the solver from Euler to DPM preserves -ln(BER) while reducing end-to-end decoding time by 8.86% on average (up to 12.82%).

2605.28355 2026-05-28 cs.LG

Detecting Diffusion-Generated Time Series Under Generator Shift

检测生成器偏移下的扩散生成时间序列

Zhi Wen Soi, Aditya Shankar, Gert Lek, Abele Mălan, Daniel Neider, Jian-Jia Chen, Lydia Chen

AI总结 针对生成器未知的扩散生成时间序列检测问题,比较了白盒与黑盒方法,发现简单分类器作为黑盒检测器显著优于白盒方法,并指出该问题不能直接迁移图像领域经验。

详情
AI中文摘要

真实时间序列与扩散生成时间序列之间的界限变得越来越难以划分,然而该领域的检测仍未被充分探索,尤其是在生成器未知的情况下。我们比较了需要访问生成器的白盒检测与仅基于原始信号的黑盒检测。白盒方法是一种从图像领域改编的基于重构的检测器,在分布内表现良好,但在生成器偏移下失效:图像中基于重构的检测之所以成功,是因为大型通用生成器提供了近乎通用的重构先验,而时间序列不存在类似的生成器。相比之下,一个简单的现成分类器作为黑盒检测器表现非常出色,平均F1达到79.2,相对白盒方法提升22.1%,在1%假阳性率下的真正例率为57.2。因此,扩散生成时间序列的检测并非图像领域问题的直接迁移。本工作首次系统探索了扩散生成时间序列的白盒和黑盒检测。最后,我们指出了几个开放且有前景的方向。

英文摘要

The boundary between real and diffusion-generated time series is becoming increasingly difficult to draw, yet detection in this domain remains underexplored, especially when the generator is unknown. We compare white-box detection, which requires access to the generator, against black-box detection, which operates on the raw signal alone. The white-box approach, a reconstruction-based detector adapted from the image domain, works well in in-distribution but breaks down under generator shift: reconstruction-based detection in images succeeds because large generic generators provide a near-universal reconstruction prior, and no analogous generator exists for time series. In contrast, a simple off-the-shelf classifier used as a black-box detector performs remarkably well, achieving an average F1 of 79.2, a 22.1% relative improvement over the white-box approach, and a TPR@1%FPR of 57.2. Diffusion-generated time series detection is therefore not a direct transfer of the image domain problem. This work provides the first systematic exploration of white-box and black-box detection for diffusion-generated time series. We close by identifying several open and promising directions.

2605.28354 2026-05-28 cs.AI

Plan Before Search: Search Agents Need Plan

搜索前先规划:搜索智能体需要规划

Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai, Jiayi Ji, Chenyi Lei, Wenwu Ou, Xiaoshuai Sun, Qibin Hou

AI总结 提出Plan方法,通过将问题分解为有序子问题再进行检索,并引入自举训练范式,无需外部强模型蒸馏即可在多跳QA中激活规划能力。

详情
AI中文摘要

将大型语言模型训练为检索增强推理智能体通常将强化学习与从更强模型蒸馏的SFT冷启动相结合。然而,这种范式忽略了两个基本因素:子技能之间的依赖结构,以及蒸馏并非获取能力的唯一途径。我们通过Plan来研究这一点,这是一种结构化的智能体行为,用于多跳检索,它在任何检索执行之前将问题分解为有序的子问题,从而使每个搜索步骤可以锚定到预先设计的子问题,而不是在先前检索的部分相关文档的影响下漂移。然而,在涵盖3B到14B参数的三个模型家族中,我们发现相同的奖励信号会引发定性不同的RL失败模式。这一现象表明,成功的训练不仅取决于奖励设计,还取决于模型特定的可行性条件:足够的初始熵、训练稳定性和先决子技能。受此启发,我们提出了一种自举训练范式,其中小规模种子模型生成过滤后的轨迹,从而在任何目标模型中激活Plan,消除了从外部强模型蒸馏的需要。我们的流程在每个测试模型中都激活了Plan,并在多跳QA基准上持续优于竞争基线。

英文摘要

Training large language models as retrieval-augmented reasoning agents typically combines reinforcement learning with an SFT cold start distilled from a stronger model. However, this paradigm overlooks two fundamental factors: the dependency structure among sub-skills, and the possibility that distillation is not the only route to capability acquisition. We study this through Plan, a structured agentic behavior for multi-hop retrieval that decomposes a question into ordered sub-questions before any retrieval is performed, so that each search step can be anchored to a pre-designed sub-question instead of drifting under the influence of partially relevant documents retrieved earlier. However, across three model families spanning 3B to 14B parameters, we find that an identical reward signal induces qualitatively different RL failure modes. This phenomenon indicates that successful training hinges not only on reward design but also on model-specific feasibility conditions: sufficient initial entropy, training stability, and prerequisite sub-skills. Motivated by this, we propose a self-bootstrapping paradigm in which a small-scale seed model generates filtered trajectories that activate Plan in any target model, eliminating the need for distillation from an external stronger model. Our pipeline activates Plan across every tested model and consistently outperforms competitive baselines on multi-hop QA benchmarks.

2605.28352 2026-05-28 cs.RO

Magnet-Based Soft Robotic Skin Using a 3D-Printed Multi-Lattice Structure and CNN-Based Tactile Super-Resolution

基于磁体的软体机器人皮肤:使用3D打印多格点结构和CNN触觉超分辨率

Yunseong Bang, Joowon Park, Suan Sim, Youngjun Ryu, Sukho Park, Kyungseo Park

AI总结 提出一种集成多层软格点、霍尔效应传感器阵列和CNN触觉超分辨率模型的磁基机器人皮肤,通过格点参数调节实现机械柔顺性与传感特性的联合优化,并利用3D打印快速制造,实现接触位置和法向力的实时估计。

Comments 6 pages, 9 figures. Accepted to IEEE International Conference on Robotics and Automation (ICRA) 2026. Y. Bang and J. Park contributed equally

详情
AI中文摘要

本文提出一种基于磁体的机器人皮肤,它集成了多层软格点、分布式霍尔效应传感器阵列和触觉超分辨率模型。外部接触力通过嵌入的永磁体转换为磁场变化,而格点将这些变化扩散到整个传感域。这使得每个传感器具有大且重叠的感受野,从而在最小盲区的情况下实现大面积的传感。格点参数可调,能够联合调整机械柔顺性和传感特性。隐式建模工作流和选择性激光烧结(SLS)3D打印支持快速制造共形、高复杂度的结构。基于实验测量训练的卷积神经网络实时估计接触位置和法向力。实验验证了定位精度,并表明可扩展到更大表面,适用于全身机器人皮肤和安全的人机交互。

英文摘要

This paper presents a magnet-based robotic skin that integrates a multilayer soft lattice with distributed Hall-effect sensor arrays and a tactile super-resolution model. External contact forces are converted to magnetic field changes by embedded permanent magnets, and the lattice spreads these changes across the sensing domain. This gives each sensor a large, overlapping receptive field and enables a large sensing area with minimal blind spots. Lattice parameters are tunable, enabling joint adjustment of mechanical compliance and transduction characteristics. An implicit modeling workflow and selective laser sintering (SLS) 3D printing support rapid fabrication of conformal, high-complexity structures. A convolutional neural network trained on experimental measurements estimates contact location and normal force in real time. Experiments validate localization accuracy and indicate scalability to larger surfaces, suggesting applicability to whole-body robotic skin and safe human-robot interaction.

2605.28348 2026-05-28 cs.CV

Toward Semantic-Agnostic and Shape-Aware Vision-Language Segmentation Models

面向语义无关和形状感知的视觉-语言分割模型

Corentin Seutin, Mohamed Amine Ettaki, Michaël Clément, Pierrick Coupé, Rémi Giraud

AI总结 提出语义无关且形状感知(SANSA)分割范式,通过非语义文本描述微调模型,在保持语义提示性能的同时,在新任务上提升高达20% mIoU。

Comments Accepted at the 2026 IEEE International Conference on Image Processing (ICIP 2026)

详情
AI中文摘要

视觉-语言分割模型最近通过利用自然语言表达的高层语义对象类别取得了强大性能。然而,这种语义依赖性限制了它们对形状、几何或纹理等内在视觉属性的推理能力,而这些属性在许多实际应用中至关重要。在这项工作中,我们引入了语义无关且形状感知(SANSA)分割,这是一种新的范式,要求分割模型仅从非语义文本描述中运行。为此,我们提出了两种基于字典约束或示例指导生成SANSA分割提示的策略,两者都生成语义无关的文本描述。然后使用这些提示在语义无关监督下微调分割模型。实验表明,与预训练的最先进模型相比,在此新分割任务上对SANSA提示进行微调可带来高达20%的mIoU改进,同时在标准语义提示上保持强劲性能。这些结果强调了低层和中层视觉推理对于提高视觉-语言分割模型的泛化性和可控性的重要性。

英文摘要

Vision-language segmentation models have recently achieved strong performance by leveraging high-level semantic object categories expressed in natural language. However, this semantic dependence limits their ability to reason about intrinsic visual properties such as shape, geometry, or texture, which are essential in many real-world applications. In this work, we introduce Semantic-Agnostic aNd Shape-Aware (SANSA) segmentation, a new paradigm that requires segmentation models to operate solely from non-semantic textual descriptions. To this end, we propose two strategies to generate SANSA segmentation prompts based on either dictionary constraints or example guidance, both generating semantic-agnostic textual descriptions. These prompts are then used to finetune segmentation models under semantic-agnostic supervision. Experiments show that finetuning on SANSA prompts yields up to a 20% mIoU improvement on this new segmentation task, compared to pretrained state-of-the-art models, while maintaining strong performance on standard semantic prompts. These results highlight the importance of low- and mid-level visual reasoning for improving the generalization and controllability of vision-language segmentation models.

2605.28347 2026-05-28 cs.AI

FedMPT: Federated Multi-label Prompt Tuning of Vision-Language Models

FedMPT: 视觉语言模型的多标签联邦提示调优

Xucong Wang, Pengkun Wang, Zhe Zhao, Liheng Yu, Shuang Wang, Yang Wang

AI总结 针对联邦学习中多标签识别任务,提出FedMPT方法,利用因果模型的前门调整和大语言模型驱动的条件解耦,通过最优传输和门控机制抑制虚假标签关联,提升模型鲁棒性。

Comments 16 pages, including 11 pages of main text and 5 pages of appendix; Accepted by CVPR2026

详情
AI中文摘要

基于视觉语言模型的多标签识别旨在利用其预训练知识更好地适应复杂识别场景,从而增强模型鲁棒性。然而,对于需要联邦学习的现实去中心化应用,将视觉语言模型适应到每个拥有私有和异构数据的客户端会导致模型过拟合虚假标签关联,从而在遇到新样本时触发不相关类别。为解决此问题,我们使用因果模型重新考虑多标签识别的联邦学习,其中采用前门调整并通过中间变量(放大真实标签共现)解耦多标签识别建模过程。在分析指导下,我们提出FedMPT,这是首个专门为联邦多标签识别设计的方法。FedMPT的核心思想是利用可泛化条件引导联邦多标签识别以减轻错误标签激活。为此,FedMPT引入了一个由大语言模型驱动的流程来解读控制标签依赖的潜在条件。此外,我们引入了条件增强提示与图像块之间的最优传输以揭示多个区域级语义。最后,我们通过精心设计的门控机制从不同条件生成协同预测。在多个基准数据集上的实验表明,我们提出的方法在不同设置下取得了有竞争力的结果,并优于现有最先进方法。

英文摘要

Multi-Label Recognition (MLR) based on Vision-Language Models (VLMs) aims to leverage their pre-trained knowledge to better adapt complex recognition scenarios, thereby enhancing model robustness. However, for realistic decentralized applications requiring federated learning, adapting VLMs to each client that possesses private and heterogeneous data can cause the model to overfit spurious label correlations, consequently triggering irrelevant categories when encountering new samples. To tackle this problem, we reconsider the federated learning for MLR with a causal model, in which we adopt a front-door adjustment and decouple the MLR modeling process by intermediate variables that magnify the oracle label co-occurrence. Guided by our analysis, we propose our FedMPT, the first method specifically designed for federated MLR. The core idea of FedMPT is to leverage generalizable conditions to steer federated MLR to mitigate erroneous label activations. To achieve this, FedMPT introduces an Large Language Model (LLM)-driven pipeline to decipher the underlying conditions that govern label dependencies. Furthermore, we introduce an optimal transport between the condition-enriched prompts and the image patches to uncover multiple region-level semantics. Finally, we generate synergistic predictions from different conditions with a crafted gating mechanism. Experiments on multiple benchmark datasets show that our proposed approach achieves competitive results and outperforms SOTA methods under varied settings.

2605.28346 2026-05-28 cs.CL

When Discourse Pressures Conflict: Information Structure in Vision-Language Model Outputs

当话语压力冲突时:视觉-语言模型输出中的信息结构

Marcell Fekete, Johannes Bjerva, Tamás Káldi

AI总结 研究视觉-语言模型在视觉问答中是否区分话语旧主题和新焦点,发现模型虽产生信息结构相关结构但过度正则化,倾向于窄响应模板,类似模式崩溃。

详情
AI中文摘要

视觉-语言模型(VLM)越来越多地被评估是否能识别正确的视觉内容,但关于它们是否以话语适当的形式表达这些内容却知之甚少。我们利用信息结构(IS)来填补这一研究空白,测试VLM在视觉基础问答中是否能区分话语旧主题(Topic)和话语新焦点(Focus)。我们利用匈牙利语,其中主题和焦点映射到专门的句法位置,使得IS选择在文本中可观察。通过比较六种VLM与人类参与者,我们发现模型产生了与IS相关的结构,但过度正则化了这种敏感性。在话语状态、语法角色(主语主题偏好)和限定性(不定焦点偏好)的相互作用压力下,人类选择多种IS实现策略。相比之下,VLM坍缩为狭窄的响应模板,类似于模式崩溃(Kirk等人,2024)。我们的发现表明,VLM评估应超越内容准确性,关注内容如何为话语打包。

英文摘要

Vision-language models (VLMs) are increasingly evaluated for whether they identify the right visual content, but little is known about whether they express such content in a discourse-appropriate form. We address this research gap using information structure (IS), testing whether VLMs distinguish discourse-old Topics from discourse-new Foci in visually grounded question answering. We exploit Hungarian, a language in which Topic and Focus map onto dedicated syntactic positions, making IS choices observable in text. Comparing six VLMs with human participants, we find that models produce IS-relevant constructions, but over-regularise this sensitivity. Under the interacting pressures of discourse status, grammatical role (preference for subject Topics) and definiteness (preference for indefinite Foci), humans choose variable strategies for IS realisation. VLMs, by contrast, collapse onto narrow response templates, resembling mode collapse (Kirk et al., 2024). Our findings suggest that VLM evaluation should look beyond content accuracy to how content is packaged for the discourse.

2605.28345 2026-05-28 cs.AI cs.LG eess.SP

Picid: A Modular Evaluation Infrastructure for Reproducible PHM Across Tasks and Domains

Picid: 一种跨任务和领域的可复现PHM模块化评估基础设施

Lev Telyatnikov, Raffael Theiler, Leandro Von Krannichfeldt, Olga Fink

AI总结 提出模块化评估基础设施Picid,通过标准化数据契约和评估边界,实现跨任务、跨数据集的故障检测、诊断和预测的可复现与公平比较。

详情
AI中文摘要

预测与健康管理(PHM)领域的进展受到跨任务、数据集和应用领域缺乏标准化和可复用评估实践的阻碍。报告的结果往往难以复现和比较,因为关键协议选择(如数据划分、预处理、标签对齐、时间窗口和指标)通常是隐式的或临时实现的。我们引入了\picid,一个模块化评估基础设施,将PHM评估流程形式化为显式、可执行和可复现的协议。通过定义良好的抽象,\picid在保持对不同PHM设置的灵活性的同时,强制执行确定性、无泄漏的数据集构建。该框架通过统一接口支持故障检测、诊断和预测,并且可以扩展到新的数据集和模型类别,而不违反协议不变性。通过标准化数据契约和评估边界,\picid还实现了跨诊断(分类)和预测(回归)的公平任务比较,允许相同的模型系列在不同设置中一致地进行评估。我们通过对跨越电池、轴承、涡轮风扇发动机、液压系统、过滤系统和建筑的十二个数据集上的十三个模型进行实证评估来展示\picid。这项工作为PHM中标准化、公平和可复现的评估建立了可复用的基础。

英文摘要

Progress in Prognostics and Health Management (PHM) is hindered by the lack of standardized and reusable evaluation practices across tasks, datasets, and application domains. Reported results are often difficult to reproduce and compare, as key protocol choices, such as data splits, preprocessing, label alignment, temporal windowing, and metrics, are often implicit or implemented ad hoc. We introduce \picid, a modular evaluation infrastructure that formalizes the PHM evaluation pipeline as an explicit, executable, and reproducible protocol. Through well-defined abstractions, \picid enforces deterministic, leakage-safe dataset construction while remaining flexible across diverse PHM settings. The framework supports fault detection, diagnostics, and prognostics through a unified interface and can be extended to new datasets and model classes without violating protocol invariants. By standardizing data contracts and evaluation boundaries, \picid also enables fair cross-task comparisons across diagnostics (classification) and prognostics (regression), allowing identical model families to be evaluated consistently across heterogeneous settings. We demonstrate \picid through an empirical evaluation of thirteen models on twelve datasets spanning batteries, bearings, turbofan engines, hydraulics, filtration systems, and buildings. This work establishes a reusable foundation for standardized, fair and reproducible evaluation in PHM.