arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
专题追踪
2605.09638 2026-05-12 cs.LG

Plan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning

Sze-Ann Chen, Zhi-Yi Chin, Kui-Yuan Chen, Chi-Yu Li, Ping-Chun Hsieh

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学) CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全中心)

AI总结 本研究提出了一种名为Plan2Cleanse的测试时反后门防御框架,用于检测和缓解深度强化学习模型中的后门攻击。该方法通过将后门检测转化为规划问题,利用蒙特卡洛树搜索技术高效识别并中和后门触发序列,无需重新训练模型。实验表明,Plan2Cleanse在多个环境中显著提升了后门触发的检测成功率和任务表现,验证了其在实际部署中的有效性。

Comments Published in Transactions on Machine Learning Research (TMLR)

详情
英文摘要

Ensuring the security of reinforcement learning (RL) models is critical, particularly when they are trained by third parties and deployed in real-world systems. Attackers can implant backdoors into these models, causing them to behave normally under typical conditions, but execute malicious behaviors when specific triggers are activated. In this work, we propose Plan2Cleanse, a test-time detection and mitigation framework that adapts Monte Carlo Tree Search to efficiently identify and neutralize RL backdoor attacks without requiring model retraining. Our approach recasts backdoor detection as a planning problem, enabling systematic exploration of temporally extended trigger sequences while maintaining black-box access to the target policy. By leveraging the detection results, Plan2Cleanse can further achieve efficient mitigation through tree-search preventive replanning. We evaluated our method in competitive MuJoCo environments, simulated O-RAN wireless networks, and Atari games. Plan2Cleanse achieves substantial improvements, increasing trigger detection success rates by more than 61.4 percentage points in stealthy O-RAN scenarios and improving win rates from 35\% to 53\% in competitive Humanoid environments. These results demonstrate the effectiveness of our test-time defense approach and highlight the importance of proactive defenses against backdoor threats in RL deployments. Our implementation is publicly available at https://github.com/rl-bandits-lab/RL-Backdoor.

2605.09636 2026-05-12 cs.AI

PDEAgent-Bench: A Multi-Metric, Multi-Library Benchmark for PDE Solver Generation

Zhen Hang, Yushan Yashengjiang, Junhui Li, Huanshuo Dong, Yang Wei, Zhezheng Hao, Jiangtao Ma, Songlin Bai, Haozhong Kai, Xihang Yue, Gangzong Si, Dongming Jiang, Chao Yao, Zhanhua Hu, Jiangqing Zhang, Pengwei Liu, Yaomin Shen, Xingyu Ren, Lei Liu, Zikang Xu, Han Li, Qingsong Yao, Hande Dong, Hong Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Tencent(腾讯) Beijing University of Posts and Telecommunications(北京邮电大学) Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学) Tsinghua University(清华大学) University of Texas at Dallas(德克萨斯大学达拉斯分校) Arizona State University(亚利桑那州立大学) Rice University(里士满大学) Technical University of Munich(慕尼黑技术大学) Stanford University(斯坦福大学) Alibaba Group(阿里巴巴集团)

AI总结 PDEAgent-Bench 是首个面向偏微分方程(PDE)求解器生成的多指标、多库基准测试平台,旨在评估从PDE描述自动生成数值求解代码的能力。该基准包含645个实例,涵盖6类数学问题和11类PDE,支持DOLFINx、Firedrake和deal.II等主流有限元库,并对生成代码的可执行性、数值精度和计算效率进行分阶段评估。实验表明,当前大型语言模型和代码生成代理虽能生成可运行代码,但在满足精度和效率要求时表现显著下降,突显了PDE求解器生成任务的挑战性与现有方法的不足。

详情
英文摘要

PDE-to-solver code generation aims to automatically synthesize executable numerical solvers from partial differential equation (PDE) specifications. This task requires not only understanding the mathematical structure of PDEs, but also selecting appropriate discretization schemes and solver configurations, and correctly implementing the resulting formulations in finite-element method (FEM) libraries. Existing code generation benchmarks mainly evaluate syntactic correctness, or success on predefined test cases. To our knowledge, there is currently no publicly available benchmark specifically for PDE-to-solver code generation, and general-purpose code benchmarks do not fully capture the unique challenges of numerical PDE solution, such as ensuring solver accuracy, efficiency, and compatibility with professional FEM libraries. We introduce PDEAgent-Bench, to the best of our knowledge, the first multi-metric, multi-library benchmark for PDE-to-solver code generation. PDEAgent-Bench contains 645 instances across 6 mathematical categories and 11 PDE families, with common FEM libraries for DOLFINx, Firedrake, and deal.II. Each instance provides an agent-facing problem specification, a reference solution on a prescribed evaluation grid, and case-specific accuracy and runtime targets. PDEAgent-Bench adopts a staged evaluation framework in which generated solvers must sequentially pass executability, numerical accuracy, and computational efficiency checks. Experiments with representative LLMs and code agents show that models can often produce runnable code, but their pass rate drops substantially once accuracy and efficiency requirements are enforced. These results indicate that current agents remain limited in producing numerically reliable and efficient PDE solvers, and that PDEAgent-Bench provides a reproducible testbed grounded in the practical requirements of numerical PDE solving.

2605.09635 2026-05-12 cs.CL

K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

Hao Liang, Qihan Lin, Zhaoyang Han, Xiaochen Ma, Zhen Hao Wong, Meiyi Qiang, Linzhuang Sun, Wentao Zhang

发表机构 * Peking University(北京大学) Institute for Advanced Algorithms Research(先进算法研究所) OriginHub Technology(OriginHub技术) Zhongguancun Academy(中关村学院)

AI总结 该研究提出了K12-KGraph,一个与课程内容对齐的知识图谱,旨在评估和训练教育领域的大型语言模型。该图谱从人教版教材中提取,涵盖数学、物理、化学和生物等多个学科,包含七类节点和九类关系,用于构建多任务基准K12-Bench和训练数据集K12-Train。实验表明,基于课程结构的监督训练在教育资源有限的情况下表现更优,显著提升了模型在教育相关任务中的性能。

详情
英文摘要

Large language models (LLMs) are increasingly used in K-12 education, yet existing benchmarks such as C-Eval, CMMLU, GaokaoBench, and EduEval mainly evaluate factual recall through exam-style question answering. Effective educational AI additionally requires curriculum cognition: understanding how knowledge is structured through prerequisite chains, concept taxonomies, experiment-concept links, and pedagogical sequencing. To address this gap, we introduce K12-KGraph, a curriculum-aligned knowledge graph extracted from official People's Education Press textbooks across mathematics, physics, chemistry, and biology from primary to high school. The graph contains seven node types (Concept, Skill, Experiment, Exercise, Section, Chapter, Book) and nine relation types covering taxonomy, prerequisite, association, verification, assessment, location, and order. Based on this graph, we construct two resources: (1) K12-Bench, a 23,640-question multi-select benchmark spanning five graph-derived task families (Ground, Prereq, Neighbor, Evidence, and Locate); and (2) K12-Train, a KG-guided supervised fine-tuning corpus of approximately 2,300 QA pairs synthesized from graph structure and node attributes. Experiments reveal substantial deficiencies in curriculum cognition: on K12-Bench, Gemini-3-Flash achieves only 57% exact match, while the best open-source model, Gemma-4-31B-IT, reaches 46%. Under a strictly matched 2,300-sample SFT budget on Qwen3-4B-Base and Llama-3.1-8B-Base, K12-Train consistently outperforms equally sized subsets from eight mainstream instruction-tuning corpora on both GaokaoBench and EduEval, demonstrating that curriculum-structured supervision is highly sample-efficient for educational tuning. We release the graph, benchmark, training data, and full construction pipeline.

2605.09634 2026-05-12 cs.CL

Can We Trust LLMs for Mental Health Screening? Consistency, ASR Robustness, and Evidence Faithfulness

Erfan Loweimi, Sofia de la Fuente Garcia, Samira Loveymi, Hadi Daneshvar, Saturnino Luz

发表机构 * Usher Institute, Edinburgh Medical School, University of Edinburgh, Edinburgh, UK(埃瑟尔研究所、爱丁堡医学院、爱丁堡大学、爱丁堡) Department of Computer Engineering, Ahvaz Campus, Islamic Azad University, Ahvaz, Iran(计算机工程系、阿瓦兹校区、伊斯兰Azad大学、阿瓦兹,伊朗) School of Health and Social Care, Edinburgh Napier University, Edinburgh, UK(健康与社会护理学院、爱丁堡纳皮尔大学、爱丁堡,英国)

AI总结 该研究探讨了大型语言模型(LLMs)在心理健康筛查中的可靠性,重点关注模型的一致性、语音识别(ASR)鲁棒性以及证据可信度。研究评估了Phi-4、Gemma-2-9B和Llama-3.1-8B三类模型在真实语音数据上的表现,发现Phi-4和Gemma-2-9B在模型内部一致性及对ASR错误的鲁棒性方面表现优异,而Llama-3.1-8B则表现出较差的稳定性。研究还揭示了模型评分与关键词依据之间的不一致,对临床应用的可解释性提出了挑战。

详情
英文摘要

LLMs can estimate Hospital Anxiety and Depression Scale (HADS) scores from speech in a zero-shot manner, but clinical deployment requires reliability across three dimensions: intra-model consistency, ASR robustness, and evidence faithfulness. We evaluate three LLMs (Phi-4, Gemma-2-9B, and Llama-3.1-8B) on 111 English-speaking participants using ground-truth transcripts and three Whisper ASR variants (Large, Medium, Small), with three independent runs per model-condition pair. We find that (i) Phi-4 and Gemma-2-9B achieve excellent intra-model consistency (ICC > 0.89) with minimal degradation under ASR; (ii) Llama-3.1-8B shows ASR-fragile consistency, with ICC dropping from 0.82 to 0.36 at 10% WER; (iii) predictive validity is largely preserved under ASR for robust models; and (iv) keyword groundedness exceeds 93% for Phi-4 and Gemma-2-9B but falls to 77-81% for Llama-3.1-8B. Inter-model keyword agreement is far lower than score-level agreement, revealing a score-evidence dissociation with implications for clinical interpretability.

2605.09633 2026-05-12 cs.RO cs.SY eess.SY

Minimizing Worst-Case Weighted Latency for Multi-Robot Persistent Monitoring: Theory and RL-Based Solutions

Weizhen Wang, Ziheng Wang, Jianping He, Xinping Guan, Xiaoming Duan

发表机构 * School of Automation and Intelligent Sensing, Shanghai Jiao Tong University, Shanghai, China(自动化与智能感知学院,上海交通大学,上海,中国) Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China(系统控制与信息处理重点实验室,中华人民共和国教育部,上海,中国)

AI总结 本文研究多机器人在带权重图上的持续监测问题,旨在设计机器人轨迹以最小化所有节点在无限时间范围内的最差加权延迟。为了解决传统最差延迟目标无法区分瞬态性能差但渐近性能好的策略的问题,作者提出了一类尾部性能目标,并建立相应的优化问题理论框架。基于这些理论结果,构建了一个等效的事件驱动马尔可夫决策过程(TWLO-MDP),并开发了基于强化学习的求解方法,同时提出了多机器人监测基准(M2Bench),实验表明该方法能有效降低最差加权延迟并优于现有方法。

详情
英文摘要

We study multi-robot persistent monitoring on weighted graphs, where node weights encode monitoring priorities and edge weights encode travel distances. The goal is to design joint robot trajectories that minimize the worst-case weighted latency across all nodes over an infinite time horizon. The widely adopted worst-case latency objective evaluates team performance over the entire time horizon and therefore may fail to distinguish strategies with poor transient behavior but strong asymptotic performance. To address this limitation, we propose a family of tail-performance objectives that generalize the standard objective and study the resulting functional optimization problems. We establish several key theoretical properties, including the existence of optimal strategies, relationships among the proposed objectives and their corresponding optimization problems, approximation by periodic solutions to arbitrary accuracy, and reductions to event-driven decision models with discretized waiting times. Building on these results, we construct an equivalent event-driven Markov decision process (MDP), called the Tail Worst-case Latency-Optimizing Markov Decision Process (TWLO-MDP), which reformulates the tail-performance objective as a standard average-reward criterion. We then develop reinforcement-learning-based solution methods for the TWLO-MDP and introduce the multi-robot monitoring benchmark (M2Bench), a unified platform that supports the evaluation and comparison of heuristic and learning-based monitoring algorithms. Experiments on synthetic and realistic monitoring scenarios show that our methods effectively reduce the worst-case weighted latency and outperform representative baselines.

2605.09630 2026-05-12 cs.CL cs.LG

Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

Lin Zheng, Vasilisa Bashlovkina, Timothy Dozat, Dan Garrette, Laura Rimell, Joshua Maynez

发表机构 * Google DeepMind(谷歌DeepMind) The University of Hong Kong(香港大学)

AI总结 本文研究了字节级语言模型中基于块(patch)的高效推理方法,提出了“Scratchpad Patching(SP)”技术,通过在每个块内插入临时缓存(scratchpad)来聚合已观测的字节信息,从而更新块级上下文,减少因块大小增加导致的预测滞后问题。该方法能够在保持相同块大小的前提下提升模型性能,并显著降低键值缓存和推理计算量,为高效语言模型设计提供了新思路。

Comments 23 pages, 15 figures

详情
英文摘要

Tokenizer-free language models eliminate the tokenizer step of the language modeling pipeline by operating directly on bytes; patch-based variants further aggregate contiguous byte spans into patches for efficiency. However, the average patch size chosen at the model design stage governs a tight trade-off: larger patches reduce compute and KV-cache footprint, but degrade modeling quality. We trace this trade-off to patch lag: until a patch is fully observed, byte predictions within it must rely on a stale representation from the previous patch to preserve causality; this lag widens as patches grow larger. We introduce Scratchpad Patching (SP), which inserts transient scratchpads inside each patch to aggregate the bytes seen so far and refresh patch-level context for subsequent predictions. SP triggers scratchpads using next-byte prediction entropy, selectively allocating compute to information-dense regions and enabling post-hoc adjustment of inference-time compute. Across experiments on natural language and code, SP improves model quality at the same patch size; for example, even at $16$ bytes per patch, SP-augmented models match or closely approach the byte-level baseline on downstream evaluations while using a $16\times$ smaller KV cache over patches and $3$-$4\times$ less inference compute.

2605.09628 2026-05-12 cs.CV

DegBins: Degradation-Driven Binning for Depth Super-Resolution

Zhiqiang Yan, Zhengxue Wang, Jian Yang, Gim Hee Lee

发表机构 * Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系) Nanjing University of Science and Technology(南京理工大学)

AI总结 深度超分辨率(DSR)旨在从低分辨率深度图中恢复高分辨率深度图。传统方法通常在低维特征空间中学习高分辨率与低分辨率之间的残差,但难以准确建模空间变化的退化关系。本文提出了一种新的DSR框架DegBins,通过退化驱动的分箱策略,将回归问题转化为分类-回归混合问题,利用离散深度分箱的加权组合更灵活地表示残差深度,并在高维特征空间中建模退化关系,实现分箱范围和概率分布的自适应调整。实验表明,DegBins在多个基准数据集上优于现有方法,具有更高的精度和鲁棒性。

Comments 9 pages

详情
英文摘要

Depth super-resolution (DSR) aims to recover a high-resolution (HR) depth map from its low-resolution (LR) counterpart. With color image guidance, this task is typically formulated as learning the residual between HR and LR in a low-dimensional feature space. However, this additive formulation is insufficient to accurately capture the complex relationship between HR and LR, especially under spatially varying degradations. In this paper, we introduce DegBins, a novel DSR framework that leverages degradation-driven binning to adaptively enhance residual modeling. Specifically, DegBins reformulates the regression-based DSR as a hybrid classification-regression problem, where the residual depth is represented as a linear combination of discrete depth bins weighted by their learned probability distribution, yielding more flexible and expressive representations. Furthermore, DegBins models the degradation relationship between HR and LR in a high-dimensional feature space, enabling adaptive bin range adjustment and probability optimization conditioned on local degradation characteristics. To progressively improve reconstruction quality, DegBins adopts a multi-stage refinement scheme, where each stage performs finer-grained bin partitioning and probability updating based on the former estimation. This coarse-to-fine design facilitates more accurate depth recovery, particularly in regions with severe degradations or complex structural variations. Extensive experiments across five benchmarks demonstrate that DegBins consistently outperforms existing state-of-the-art methods in terms of accuracy, robustness, and generalization.

2605.09622 2026-05-12 cs.CV cs.AI

Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

Yuhan Wang, Zihan Li, Han Liu, Simon Arberet, Martin Kraus, Yuyin Zhou, Florin-Cristian Ghesu, Dorin Comaniciu, Ali Kamen, Riqiang Gao

发表机构 * UC Santa Cruz(加州大学圣克ruz分校) Siemens Healthineers(西门子医疗) University of Washington(华盛顿大学)

AI总结 在放射治疗计划中,体素级剂量预测是一个关键但具有挑战性的任务,现有模型往往难以在不同临床场景中泛化。本文提出 DiffKT3D,一种统一的 Any2Any 3D 扩散框架,通过迁移预训练视频扩散模型的知识,实现高效且具有临床意义的剂量预测。该方法引入了基于模态嵌入的灵活条件生成机制,并结合临床导向的强化学习后训练策略,显著提升了剂量预测精度与图像质量,优于当前最优模型。

Comments Accepted by CVPR 2026 main conference. Compare to CVPR version, minor updates here are included (e.g., combine main text and appendix; clarify the timing scenario in appendix)

详情
英文摘要

Voxel-wise dose prediction is a critical yet challenging task in practical radiotherapy (RT) planning, as bespoke models trained from scratch often struggle to generalize across diverse clinical settings. Meanwhile, generative models trained on billion-scale datasets from vision domains have achieved impressive performance. Herein, we propose DiffKT3D, a unified Any2Any 3D diffusion framework that leverages prior knowledge from pretrained video diffusion models for efficient and clinically meaningful dose prediction. To enable flexible conditioning across multiple clinical modalities (CT, anatomical structures, body, beam settings, etc.), we introduce an Any2Any conditional paradigm utilizing modality-specific embeddings without cross-attention overhead. Further, we design a novel reinforcement learning (RL) post-training mechanism guided by a clinically-informed Scorecard explicitly tailored to institutional treatment preferences. Compared with winner of GDP-HMM challenge, DiffKT3D sets a new state-of-the-art in dose prediction by reducing voxel-level MAE from 2.07 to 1.93. In addition, DiffKT3D achieves superior image quality and preference match. These results demonstrate that transferring diffusion priors via modality-aware conditioning and clinically aligned RL post-training can provide a robust and generalizable solution for RT planning across various clinical scenarios.

2605.09618 2026-05-12 cs.CL cs.CY

Statistical Scouting Finds Debate-Safe but Not Debate-Useful Cases: A Matched-Ceiling Study of Open-Weight LLM Reasoning Protocols

Julia Hu, Alfred Shen, Kumar Lakshmipathi

发表机构 * Amazon Web Services(亚马逊网络服务)

AI总结 该研究探讨了语言模型在直接回答、多样本投票和多智能体辩论等不同推理策略下的表现差异,旨在确定在生成长度受限的情况下,哪种策略最有效。通过在MuSiQue和GSM8K数据集上对多个模型进行实验,发现最佳策略因模型和数据集而异,且难以通过简单的预判信号(如投票熵)来有效选择。研究指出,投票熵仅能预测辩论是否安全,而不能准确判断何时需要辩论,表明当前的辩论机制在实际应用中仍存在局限。

Comments 14 pages, 5 figures. Technical report / preprint

详情
英文摘要

When should a language model answer directly, sample and vote, or engage in multi-agent debate? Recent work shows voting often explains much of the gain attributed to debate, while selective-debate systems activate deliberation only on uncertain examples. We ask: under a matched ceiling on generated tokens (960 per example), how much per-example routing headroom exists, and how much is recoverable from cheap pre-deliberation signals? We evaluate greedy decoding, three-sample voting, and a two-agent critique-revise debate on MuSiQue and GSM8K using Llama 3.1 8B Instruct and Ministral 3 8B Instruct. On MuSiQue, an oracle selecting the correct protocol per example gains +14.0 and +13.7 pp over the best fixed one. The best fixed protocol is model- and dataset-dependent: each (model, dataset) cell has a different winner. This headroom is hard to recover from cheap ex-ante signals. A vote-entropy threshold is the only controller that directionally beats the best fixed protocol on both models (+1.3 and +1.7 pp), though individual paired-bootstrap CIs include zero. A joint analysis (meta-analysis +1.6 pp, p=0.125; Bayesian P(both>0)=0.59) is directionally consistent but not significant. Learned controllers (LR, GBT) do not outperform the threshold. The key finding is structural: vote entropy predicts where debate is safe, not where debate is needed. High entropy sharply reduces debate backfire, but 66% of debate-helpful examples (31/47) occur when voting is unanimous but wrong. A single-prompt self-critique probe on Llama flips the answer in 127/127 unanimous cases, yielding zero mutual information with the debate-helpful label; we cannot rule out a prompt-compliance artifact, but either interpretation disqualifies the probe as a router. Recovering the remaining headroom requires behavioral probes that avoid format-compliance confounds at the 8B scale.

2605.09614 2026-05-12 cs.CV

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

Xuan Gong, Hanbo Huang, Hao Zheng, Yiran Zhang, Wenbin Dai, Weishu Zhao, Shiyu Liang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Lanzhou University(兰州大学)

AI总结 本文研究了长链多模态推理中视觉信息衰减的问题,提出了一种基于信息论的分析方法,推导出干预点对下游视觉收益的下界,并据此设计了反射锚点策略优化(RAPO)方法。RAPO通过选择高熵的反射锚点并优化有限窗口的KL散度代理,有效增强了视觉信息在生成过程中的传播与保留。实验表明,RAPO在多个视觉-语言模型基准上显著优于现有方法,并且机制分析显示其能增强生成轨迹中视觉依赖的对比信号。

Comments Under Review

详情
英文摘要

Long chain-of-thought (CoT) reasoning improves large vision--language models, but visual information often fades during generation, limiting long-horizon multimodal reasoning. Existing methods either re-inject vision at inference or train policies for stronger grounding, but where to intervene relies on perception heuristics rather than principled gain analysis, and how local visual influence propagates remains implicit. We study this problem from an information-theoretic standpoint and derive a lower bound on the downstream visual gain of a one-step intervention, which suggests two factors: local branching room (token entropy) and downstream visual propagation potential (suffix divergence from a vision-marginalized reference). Guided by this analysis, we propose reflection-anchor policy optimization (RAPO), a GRPO-based policy optimization method that selects high-entropy reflection anchors and optimizes a chain-masked finite-window KL surrogate for downstream visual dependence. Experiments on reasoning-intensive and general-domain benchmarks show that RAPO delivers substantial gains over strong baselines across multiple LVLM backbones. Mechanism analyses further indicate that reflection anchors are enriched for visually sensitive decision points and that RAPO increases contrastive visual-dependence signals along generated trajectories.

2605.09613 2026-05-12 cs.RO cs.CV

SABER: A Scalable Action-Based Embodied Dataset for Real-World VLA Adaptation

Narsimha Menga, Parikshit Sakurikar, Amirreza Rouhi, Satya Sai Reddy, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, Sashi Reddi

发表机构 * DreamVu

AI总结 该研究提出了SABER,一个用于现实零售场景中机器人视觉-语言-动作(VLA)适配的高保真动作数据集。SABER通过多小时的真实店内捕捉,记录了人类在零售环境中的精细手部动作、全身运动及场景动态,无需人工编排或远程操作。该数据集包含多种动作表示形式,并在实际机器人系统上验证了其有效性,显著提升了复杂零售任务的完成率,展示了高质量数据对提升机器人性能的关键作用。

详情
英文摘要

Robotic deployment in real-world environments depends on rich, domain-specific action data as much as on strong model architecture. General-purpose robot foundation models show modest performance in complex unseen tasks such as manipulation in a retail domain when applied out of the box. The root cause is a data gap: retail environments are structurally absent from general robot pretraining distributions, and the path to filling that gap through teleoperation is prohibitively expensive, logistically constrained, and difficult to scale. We introduce SABER, a high-fidelity retail robotics action dataset built from over 100 hours of natural in-store capture across multiple real grocery environments. Egocentric footage from head-mounted cameras records fine-grained hand activity at the point of interaction, while exocentric 360-degree scene footage from DreamVu's ALIA camera simultaneously observes all actors and activities across the entire space. This combination yields a uniquely complete picture of human retail behavior: dexterous hand activity, whole-body motion, and scene dynamics, all captured without staging, scripting, or teleoperation overhead. The SABER corpus contains 44.8K training samples across three action representation streams: 25K latent action sequences via LAPA-style encoding, 18.6K dexterous hand-pose trajectories retargeted to robot joint space, and 1.2K whole-body synchronized motion sequences retargeted to a humanoid embodiment. When applied to GR00T N1.6 via a shared-backbone multi-task post-training recipe, SABER yields a mean success rate of 29.3% across ten retail manipulation tasks -- more than 2.19x over fine-tuning baselines (13.4%). SABER demonstrates that the path to capable retail robots runs through better data, which can be collected today, at scale, without a robot in the loop. The dataset and code are available at https://dreamvu.ai/saber

2605.09611 2026-05-12 cs.CL

Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks

Sietse Schelpe

发表机构 * Corbenic AI, Inc.(Corbenic AI公司)

AI总结 本文对检索增强生成(RAG)流程中的字节精确块级去重技术进行了实证分析,研究了其在不同应用场景下的上下文缩减效果及质量影响。通过在学术、企业及多轮对话场景下的实验,发现去重可实现高达80.34%的冗余减少,同时通过多方模型的评估验证,确认该方法不会引入可测量的质量下降。研究证明,在不牺牲模型质量的前提下,可以确定性地实现显著的推理计算节省。

Comments Preprint. Implementation and open-source community version available at: https://github.com/corbenic/merlin-community - https://zenodo.org/records/20090712

详情
英文摘要

This preprint presents an empirical analysis of byte-exact chunk-level deduplication in Retrieval-Augmented Generation (RAG) pipelines. We measure context reduction across three distinct operating regimes: clean academic retrieval (0.16% byte reduction on 22.2M BeIR passages), constructed enterprise patterns (24.03% reduction), and multi-turn conversational AI (80.34% reduction). To validate quality preservation, we conducted a cross-vendor 5-judge calibrated panel evaluation across four production APIs (Google Gemini 2.5 Flash, Anthropic Claude Sonnet 4.6, Meta Llama 3.3 70B, and OpenAI GPT-5.1). Applying a five-category human-in-the-loop noise-removal protocol to panel-majority materially different (MAT) pairs, we establish that byte-exact deduplication introduces zero measurable quality regression. Post-audit, all four vendors clear the strict <5% Wilson 95% upper-bound MAT threshold in both the clean and high-redundancy RAG regimes. This work demonstrates that substantial inference compute savings can be achieved deterministically without compromising evaluation-grade model quality.

2605.09608 2026-05-12 cs.LG cs.IT math.IT

Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

Yuanyi Wang, Yifan Yang, Su Lu, Yanggan Gu, Pengkai Wang, Wenjun Wang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, Jialun Cao, Shing-Chi Cheung, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University, PolyU(香港理工大学,PolyU) The Hong Kong University of Science and Technology, HKUST(香港科学与技术大学,HKUST) PolyU-Daya Bay Technology and Innovation Research Institute(PolyU-大亚湾技术与创新研究院)

AI总结 该研究探讨了大语言模型持续后训练过程中遗忘现象的成因与控制方法,提出通过任务参数更新的协方差几何来分析模型状态变化与新知识更新之间的兼容性。核心方法基于几何冲突理论,提出了一种无需数据的更新融合算法GCWM,通过高斯Wasserstein重心构建共享度量,并利用几何冲突进行修正控制。实验表明,该方法在多个模型规模上有效提升了持续训练中的知识保留与最终性能。

详情
英文摘要

Continual post-training aims to extend large language models (LLMs) with new knowledge, skills, and behaviors, yet it remains unclear when sequential updates enable capability transfer and when they cause catastrophic forgetting. Existing methods mitigate forgetting through sequential fine-tuning, replay, regularization, or model merging, but offer limited criteria for determining when incorporating new updates is beneficial or harmful. In this work, we study LLM continual post-training through three questions: What drives forgetting? When do sequentially acquired capabilities transfer or interfere? How can compatibility be used to control update integration? We address these questions through task geometry: we represent each post-training task by its parameter update and study the covariance geometry induced by the update. Our central finding is that: forgetting can be considered as a state-relative update-integration failure, it arises when the covariance geometries induced by tasks misalign with the geometry of the evolving model state. Sequential updates transfer when they remain compatible with the model state shaped by previous updates, and interfere when state-relative geometry conflict becomes high. Motivated by this finding, we propose Geometry-Conflict Wasserstein Merging (GCWM), a data-free update-integration method that constructs a shared Wasserstein metric via Gaussian Wasserstein barycenters and uses geometry conflict to gate geometry-aware correction. Across Qwen3 0.6B--14B on domain-continual and capability-continual settings, GCWM consistently outperforms data-free baselines, improving retention and final performance without replay data. These results identify geometry conflict as both an explanatory signal for forgetting and a practical control signal for LLM continual post-training.

2605.09604 2026-05-12 cs.CV

DAP: Doppler-aware Point Network for Heterogeneous mmWave Action Recognition

Jiaying Lin, Shiman Wu, Jinfu Liu, Can Wang, Mengyuan Liu

发表机构 * Peking University(北京大学) Huazhong University of Science and Technology(华中科技大学) DJI Technology Company Ltd.(大疆技术创新有限公司) Christian-Albrechts-Universität zu Kiel(基尔大学)

AI总结 该研究针对毫米波雷达在异构场景下的人体动作识别(HAR)问题,提出了首个大规模异构多源毫米波点云数据集UniMM-HAR,并设计了DAP-Net网络以应对不同设备和频段带来的分布差异。DAP-Net通过融合多模态信息与Doppler感知机制,增强了模型对异构雷达源的鲁棒性,实验表明其在跨源识别任务中取得了优越的性能。

详情
英文摘要

Millimeter-wave (mmWave) radar provides privacy-preserving sensing and is valuable for human action recognition (HAR). Existing mmWave point cloud datasets are limited in scale and mostly collected under homogeneous single-source settings, preventing current methods from handling real-world distribution shifts caused by heterogeneous radar sources, such as different devices and frequency bands. To address this, we introduce UniMM-HAR, the largest and first mmWave point cloud HAR dataset for heterogeneous multi-source scenarios, standardizing three distinct radar configurations to realistically evaluate cross-source generalization. We further propose the Doppler-aware Point Cloud Network (DAP-Net) to tackle heterogeneity challenges. DAP-Net enhances intra-modal representations and performs cross-modal alignment to learn source-invariant action semantics. Leveraging action-consistent spatio-temporal Doppler patterns as anchors, the Dual-space Doppler Reparameterization (D2R) module performs sample-adaptive geometric densification and Doppler-guided feature recalibration, while the Text Alignment Module (TAM) provides stable semantic anchors via a pretrained textual space. Experiments show that DAP-Net significantly outperforms existing methods under heterogeneous radar settings, achieving state-of-the-art accuracy and strong cross-source robustness.

2605.09603 2026-05-12 cs.CL

Edit-Based Refinement for Parallel Masked Diffusion Language Models

Houxing Ren, Mingjie Zhan, Zimu Lu, Ke Wang, Yunqiao Yang, Haotian Hou, Junting Pan, Hongsheng Li

发表机构 * CUHK MMLab(香港中文大学多模态实验室) SenseTime Research(SenseTime研究院) Shenzhen Loop Area Institute(深圳环城区域研究院)

AI总结 本文提出了一种基于编辑的改进框架ME-DLM,用于提升并行掩码扩散语言模型在多令牌生成时的性能。该方法在生成初始完整响应后,通过最小编辑操作(如替换、删除和插入)进行后处理优化,以增强序列一致性。实验表明,ME-DLM在保持并行生成效率的同时,显著提升了生成质量与鲁棒性,尤其在基于LLaDA模型时,在HumanEval和GSM8K数据集上分别取得了11.6和33.6点的提升。

Comments Accepted to ICML 2026

详情
英文摘要

Masked diffusion language models enable parallel token generation and offer improved decoding efficiency over autoregressive models. However, their performance degrades significantly when generating multiple tokens simultaneously, due to a mismatch between token-level training objectives and joint sequence consistency. In this paper, we propose ME-DLM, an edit-based refinement framework that augments diffusion generation with lightweight post-editing steps. After producing an initial complete response, the model refines it through minimal edit operations, including replacement, deletion, and insertion, conditioned on the full sequence. Training supervision is derived from edit distance, providing a deterministic signal under a fixed canonicalization scheme for learning minimal corrections. This approach encourages sequence-level consistency through globally conditioned edits while preserving the efficiency benefits of parallel diffusion decoding. Extensive experiments demonstrate that ME-DLM improves the quality and robustness of multi-token parallel generation. In particular, when built upon LLaDA, our method achieves consistent gains of 11.6 points on HumanEval and 33.6 points on GSM8K while using one-eighth of the total diffusion steps. Code is available at https://github.com/renhouxing/ME-DLM.

2605.09591 2026-05-12 cs.CV

From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

Shuang Liang, Zeqing Wang, Yuxian Li, Xihui Liu, Han Wang

发表机构 * Department of Electrical and Computer Engineering, The University of Hong Kong(香港大学电子与计算机工程系) School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) CASIC, The University of Hong Kong(香港大学中国科学院自动化所)

AI总结 本文研究了可提示分割模型是否真正理解其分割的概念,而不仅仅是依赖视觉显著但语义误导的线索。为此,作者提出了一个新的基准测试 CAFE,通过属性层面的反事实修改来评估模型对概念的忠实度。实验表明,尽管模型能生成准确的分割掩码,但在面对误导性提示时仍表现出概念理解的不足,揭示了定位质量与语义理解之间的系统性差距。

Comments 30 pages, 8 figures

详情
英文摘要

Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: \textbf{C}ounterfactual \textbf{A}ttribute \textbf{F}actuality \textbf{E}valuation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our \textbf{CAFE} is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (\textbf{SM}), Context Conflict (\textbf{CC}), and Ontological Conflict (\textbf{OC}). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.

2605.09584 2026-05-12 cs.CL cs.AI cs.LG

CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

Aishik Nagar, Arun-Kumar Kaliya-Perumal, Yu-Hsuan Han, Andrew Sheng-Han Huang, Kristen Kee, Yushi Cao, Yiming Chen, Hongchao Jiang

发表机构 * ASUS Intelligent Cloud Services (AICS)(ASUS智能云服务(AICS)) Rehabilitation Research Institute of Singapore, Nanyang Technological University(新加坡康复研究院,南洋理工大学) Department of Family Medicine, Taipei Veterans General Hospital(台北荣民总医院家庭医学部) School of Medicine, National Yang Ming Chiao Tung University(国家阳明交通大学医学院) Yong Loo Lin School of Medicine, National University of Singapore(新加坡国立大学 Yong Loo Lin 医学院)

AI总结 CLR-voyance 是一种用于强化住院临床决策支持系统中开放性推理能力的新框架,它将临床推理建模为部分可观察马尔可夫决策过程(POMDP),并结合临床结果和专家验证的奖励机制进行监督。该方法通过区分患者旅程中可见的过去信息和仅由专家可见的未来信息,生成可验证的临床推理评分标准(rubrics),用于模型的训练与评估。实验表明,基于 CLR-voyance 训练的模型在住院临床推理任务中表现优异,显著优于现有先进模型,并已在实际医院中部署应用。

详情
英文摘要

Inpatient clinical reasoning is a sequential decision under partial observability: the clinician sees the admission so far and must choose the next action whose downstream consequences are not yet visible. Existing clinical-LLM evaluations and RL rewards signals collapse this into closed-form retrieval, clinical journey leakage, or unanchored LLM-as-judge scoring. We introduce CLR-voyance, a framework that reformulates inpatient reasoning as a Partially Observable Markov Decision Process (POMDP) and supervises it with rewards that are simultaneously outcome-grounded and clinician-validated. We instantiate the formulation as CLR-POMDP, which partitions successful patient journeys into a policy-visible past and an oracle-only future. Using the past information, an oracle LLM generates a case-specific query-answer pair, and the first adaptive rubric for clinical reasoning which is verifiable in the future of the patient journey. These rubrics are used for both post-training and evaluation of models for inpatient clinical reasoning. We post-train Qwen3-8B and MedGemma-4B with GRPO followed by model merging, yielding state-of-the-art inpatient clinical reasoning while retaining generalist capabilities. CLR-voyance-8B achieves 84.91% on CLR-POMDP, ahead of frontier medical reasoning models like GPT-5 (77.83%) and MedGemma-27B (66.66%) and has comparable or better performance on existing medical benchmarks. To ensure a clinically meaningful setting, we conduct a large-scale clinician alignment study, where physicians curate per-case rubrics, grade candidate responses, and provide blinded pairwise preferences of model reasoning. This study provides insights on clinical LLM-as-a-judge and clinical preference-model selection, which can inform the community at large. CLR-voyance has been deployed for 6+ months at a partner public hospital, drafting thousands of reasoning-heavy inpatient notes.

2605.09581 2026-05-12 cs.CV

FPGA-Based Hardware Architecture for Contrast Maximization in Event-Based Vision

Michal Filipkowski, Marcin Kowalczyk, Tomasz Kryjak

发表机构 * AGH University of Krakow, Poland(波兰格但尼克技术大学) Embedded Vision Systems Group, Computer Vision Laboratory(嵌入式视觉系统组,计算机视觉实验室)

AI总结 本文提出了一种基于FPGA的硬件架构,用于实现基于事件视觉系统的对比度最大化(CM)算法。该架构利用FPGA的并行处理能力,高效实现了从异步事件流中重构图像的对比度计算与迭代优化,从而估计运动参数。研究展示了该硬件模块的设计细节与优化方法,并通过实验验证其在速度和能效方面的显著优势,相比CPU和GPU实现快200倍以上,为高速、低功耗嵌入式系统中的实时运动估计提供了坚实基础。

Comments Accepted for ARC 2026

详情
英文摘要

This paper presents a hardware architecture that implements the Contrast Maximization (CM) algorithm in Field-Programmable Gate Array (FPGA) resources for event-based vision systems. CM estimates motion parameters by maximizing the contrast of an Image of Warped Events (IWE) reconstructed from asynchronous event streams. Event-based vision sensors generate sparse data with high temporal resolution and low spatial redundancy, which makes them well suited for hardware processing. The deterministic, massively parallel structure of the FPGA is leveraged to design a deeply pipelined architecture capable of high-throughput, energy-efficient processing suitable for real-time embedded applications. This paper details the hardware modules responsible for event warping, contrast computation, and iterative optimization, discusses key implementation decisions, and presents the hardware-aware optimization method used in the design. Experimental results demonstrate a substantial speed and efficiency improvement over CPU- and GPU-based implementations, with motion parameter estimation executing over 200 times faster. To the best of our knowledge, this is the first hardware architecture enabling acceleration of CM algorithm computations. Its performance is evaluated in terms of processing speed, energy efficiency, and hardware resource utilization. The proposed design is validated using an event-based object tracking application. The results confirm that the architecture provides a solid foundation for real-time motion estimation in high-speed, low-power embedded systems.

2605.09579 2026-05-12 cs.LG cs.AI

Biosignal Fingerprinting: A Cross-Modal PPG-ECG Foundation Model

Zhangdaihong Liu, Chang Liu, Fenglin Liu, Yixuan Chen, Yang Yang, David A. Clifton, Xiao Gu

发表机构 * Department of Engineering Science, University of Oxford(牛津大学工程科学系) Oxford Suzhou Centre for Advanced Research(牛津苏浙先进研究中心) School of Public Health, Shanghai Jiao Tong University(上海交通大学公共卫生学院)

AI总结 该研究提出了一种跨模态的生物信号指纹技术,旨在弥合心电图(ECG)与光电容积图(PPG)在心血管疾病监测中的应用差距。通过构建多模态掩码自编码器(M2AE),该方法从大量配对的ECG和PPG信号中学习到紧凑且可迁移的潜在表示,能够在无需任务特定微调的情况下,用于多种临床任务。实验表明,该方法在心血管疾病分类、高血压检测等任务中表现优异,且仅需单一模态输入即可保持高性能,适用于资源受限的可穿戴设备场景。

Comments 21 pages, 8 figures, 7 tables

详情
英文摘要

Cardiovascular disease remains the leading cause of global mortality, yet scalable cardiac monitoring is hindered by the gap between diagnostic-rich ECG and ubiquitous wearable PPG. Bridging this gap requires representations that are compact, transferable across modalities and devices, and deployable without task-specific retraining. Here we introduce biosignal fingerprints: compact latent representations of cardiovascular state derived from a cross-modal foundation model, the Multi-modal Masked Autoencoder (M2AE), trained on over 3.4 million paired ECG and PPG signals. M2AE integrates modality-specific encoders with a shared bottleneck and dual decoders, jointly optimized using reconstruction and cross-modal contrastive objectives, yielding generalizable fingerprints that retain intra- and inter-modality features. Like a biometric fingerprint, these representations uniquely encode an individual's cardiovascular state in a modality-agnostic, privacy-preserving form reusable across clinical tasks without exposing raw waveform data or requiring model retraining. Across 7 downstream tasks, spanning cross-modal reconstruction, cardiovascular disease classification, hypertension detection, mortality prediction, and demographic inference, biosignal fingerprints achieve competitive or superior performance compared to leading domain-specialist foundation models in frozen settings, including an AUROC of 0.974 for five-class CVD classification and 0.877 for hypertension detection, with a maximum improvement of 27.7% in AUROC across 5 classification tasks. Critically, strong performance is maintained with only a single modality, enabling deployment in resource-constrained, single-sensor environments typical of real-world wearable monitoring, with direct implications for continuous cardiovascular monitoring across clinical and consumer health settings.

2605.09572 2026-05-12 cs.CV cs.AI cs.MM

KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation

Guanyi Du, Lintao Wang, Kun Hu, Ziyang Wang

发表机构 * School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) School of Science, Edith Cowan University(埃迪斯科文大学科学学院) School of Computer Science and Digital Technologies, Aston University(阿斯顿大学计算机科学与数字技术学院)

AI总结 该研究探讨了如何利用Kolmogorov-Arnold网络(KAN)从符号注释生成手语姿态动画,提出了一种多尺度序列生成模型KANMultiSign,能够将HamNoSys符号系统转化为二维人体姿态序列。研究引入了从粗到细的生成策略,并结合多尺度监督机制,先生成整体身体结构,再细化手部动作细节;同时将KAN模块集成到Transformer架构中,以更高效地建模符号到连续姿态的非线性映射。实验表明,该方法在多个手语语料库中取得了比现有方法更优的性能,同时大幅减少了参数量,验证了多尺度监督在提升符号条件姿态生成效果中的关键作用。

Comments Accepted at Neurocomputing

详情
英文摘要

Sign language production from symbolic notation offers a scalable route to accessible sign animation. We present KANMultiSign, a multi-scale sequence generator that translates HamNoSys notation into two-dimensional human pose sequences. Our framework makes two complementary contributions. First, we introduce a coarse-to-fine generation strategy with multi-scale supervision: the model is first guided by an intermediate body--hand--face scaffold to encourage global structural coherence, and then refines fine-grained hand articulation to improve finger-level detail. Second, we investigate integrating Kolmogorov--Arnold Network modules into a Transformer backbone, using learnable univariate function primitives to model the highly non-linear mapping from discrete phonological symbols to continuous body kinematics with a compact parameterization. Experiments on multiple public corpora spanning Polish, German, Greek, and French sign languages show consistent reductions in dynamic time warping based joint error compared with a strong notation-to-pose baseline, while using substantially fewer parameters. Controlled ablations further indicate that KAN-based variants substantially reduce parameter count while maintaining competitive performance when coupled with multi-scale supervision, rather than serving as the main driver of accuracy gains. These findings position multi-scale supervision as the key mechanism for improving notation-conditioned pose generation, with KAN offering a compact alternative for efficient modeling. Our code will be publicly available.

2605.09570 2026-05-12 cs.LG

End-to-End Keyword Spotting on FPGA Using Graph Neural Networks with a Neuromorphic Auditory Sensor

Wiktor Matykiewicz, Piotr Wzorek, Kamil Jeziorek, Tomás Muñoz, Antonio Rios-Navarro, Angel Jiménez-Fernández, Tomasz Kryjak

发表机构 * AGH University of Krakow, Poland(克拉科夫AGH大学,波兰) Embedded Vision Systems Group(嵌入式视觉系统组) Computer Vision Laboratory(计算机视觉实验室) Robotics and Technology of Computers Lab.(机器人与计算机技术实验室) ETSII, EPS, SCORE, I3US, Universidad de Sevilla(塞维利亚大学ETSII、EPS、SCORE、I3US)

AI总结 随着移动机器人和嵌入式智能的快速发展,边缘平台对高效设备端数据处理的需求日益增加。本文提出了一种基于现场可编程门阵列(FPGA)的端到端关键词识别系统,首次将神经形态听觉传感器(NAS)与图神经网络(GNN)集成在单一FPGA设备上,直接处理基于事件的音频流,无需传统信号预处理。该系统采用计算近内存架构,在保持高识别准确率(87.43%)的同时实现了低延迟和低功耗的实时处理。

Comments Accepted for the ARC 2026 conference

详情
英文摘要

With the rapid growth of mobile robotics and embedded intelligence, there is an increasing demand for efficient on-device data processing on edge platforms. A promising research direction is the use of neuromorphic sensors inspired by human sensory systems, which generate sparse, event-based data encoding changes in the environment. In this work, we present the first end-to-end FPGA implementation of a keyword spotting system that integrates a Neuromorphic Auditory Sensor (NAS) and a graph neural network (GNN) on a single FPGA device, enabling real-time processing of raw audio data. The proposed architecture eliminates conventional signal preprocessing and operates directly on event-based audio streams. Leveraging a compute-near-memory network architecture, the system achieves efficient inference with low latency and low power consumption. Experimental results demonstrate an accuracy of 87.43% after quantization on the Google Speech Commands v2 dataset processed through the neuromorphic sensor, with end-to-end latency below 35 us and average power consumption of 1.12 W. The processed datasets, software models, and hardware modules are available at https://github.com/vision-agh/NAS-GNN-KWS.

2605.09566 2026-05-12 cs.CV

Dual-Path Hyperprior Informed Deep Unfolding Network for Image Compressive Sensing

Tianyi Lu, Wenxue Cui, Shaohui Liu

发表机构 * Harbin Institute of Technology(哈尔滨理工大学)

AI总结 本文提出了一种双路径超先验引导的深度展开网络(DPH-DUN),用于解决图像压缩感知中的重建问题。该方法通过将测量数据分为两个子集,并引入超先验信息指导重建过程,有效提升了不同纹理区域的重建质量。核心创新包括设计轻量神经模块生成多域超先验知识,并在重建过程中动态生成自适应步长和注意力机制,以提高重建精度和鲁棒性。实验表明,该方法在多个基准数据集上优于现有压缩感知方法。

详情
英文摘要

Recent Deep Unfolding Networks (DUNs) have significantly advanced Compressive Sensing (CS) by integrating iterative optimization with deep networks. However, existing DUNs still suffer from two challenges: 1) Reliance on a single measurement stream, which limits effective information interaction across distinct measurement subsets. 2) Uniform processing of all image regions, which overlooks varying reconstruction difficulties induced by diverse textures. To address these limitations, a novel Dual-Path Hyperprior Informed Deep Unfolding Network (DPH-DUN) is proposed, which partitions measurements into double subsets to enable hyperprior-guided reconstruction via a dual-path architecture. In the Deep Hyperprior Learning branch, a series of lightweight neural modules are designed to efficiently generate hyperprior knowledge of different domains, enabling collaborative guidance for the CS reconstruction. In the Hyperprior Informed Reconstruction branch, a deep unfolding framework with hyperprior guidance is constructed to iteratively refine reconstruction. Specifically, i) in the gradient descent step, a Hyperprior Informed Step Size Generation network is designed to dynamically generate spatially varying step maps, enabling adaptive fine-grained gradient updates. ii) In the proximal mapping step, two well-designed hyperprior informed attention mechanisms are introduced to dynamically focus on challenging regions via gradient-based hard and soft attentions, facilitating CS reconstruction accuracy. Extensive experiments demonstrate that the proposed DPH-DUN outperforms existing CS methods.

2605.09565 2026-05-12 cs.LG

Online Set Learning from Precision and Recall Feedback

Lee Cohen, Yishay Mansour, Shay Moran, Han Shao

发表机构 * Stanford University(斯坦福大学) Tel Aviv University and Google Research(特拉维夫大学和谷歌研究) Technion and Google Research(技术学院和谷歌研究) University of Maryland(马里兰大学)

AI总结 本文研究了在在线设置下,从精确率和召回率反馈中学习未知子集的问题。在每一轮中,学习者预测一个子集并根据反馈类型(精确率或召回率)获得部分信息,目标是最大化累积奖励。研究证明,该问题的可学习性等价于假设类具有有限的VC维,并提出了应对反馈依赖性的算法,在可实现和不可知设置下均获得了遗憾界,为该模型的可学习性提供了理论刻画,并指出了多个值得进一步研究的问题。

详情
英文摘要

We consider the problem of learning an unknown subset $N_\text{target}$ of a domain in an online setting. In each round $t$, the learner predicts a set of items ${N}_t$ and receives one of two types of feedback, each with equal probability: precision feedback, in which a randomly chosen item from the predicted set $N_t$ is revealed and the learner is told whether it belongs to $N_\text{target}$ (incurring a reward if it does), or recall feedback, in which a randomly chosen item from the target set $N_\text{target}$ is revealed and the learner is told whether it belongs to $N_t$ (incurring a reward if it does). The goal is to maximize the cumulative reward over time. This simple online set learning problem abstracts a variety of learning scenarios with precision- and recall-type feedback. We show that a hypothesis class (a family of subsets of the domain) is learnable in this setting if and only if it has finite Vapnik-Chervonenkis (VC) dimension, mirroring the classical PAC characterization. However, the resulting algorithmic structure is markedly more intricate: in contrast to standard Probably Approximately Correct (PAC) learning -- where the algorithmic landscape is governed by the simple principle of Empirical Risk Minimization (ERM) -- our partial feedback model can invalidate ERM and even all proper learning rules. We develop algorithms to address the dependencies induced by the feedback, obtaining regret guarantees in both the realizable and agnostic settings. Our results provide a qualitative characterization of learnability in this model, addressing its most basic question, while pointing to a range of natural and intriguing open questions, including the determination of optimal regret rates.

2605.09554 2026-05-12 cs.CL cs.CV

Towards Compact Sign Language Translation: Frame Rate and Model Size Trade-offs

Kuanwei Chen, Mengfeng Tsai

发表机构 * Computer Science and Information Engineering, National Central University, Zhongli, Taiwan(资讯工程系,国立中央大学,中坜,台湾)

AI总结 本文研究了手语翻译(SLT)中帧率与模型大小之间的权衡问题,旨在实现更紧凑高效的翻译系统。作者提出了一种仅含77M参数的轻量级管道,结合MMPose骨骼姿态提取与单一线性投影至T5-small模型,通过调整输入帧率,在保证翻译质量的前提下显著降低计算复杂度。实验表明,该方法在12fps时相比24fps仅小幅降低BLEU-4得分,同时模型大小仅为之前T5-base系统的1/3,展示了轻量架构在无需层次化编码器或大规模模型的情况下仍具竞争力。

Comments 2 pages, 1 figure, 2 tables

详情
英文摘要

Sign Language Translation (SLT) converts sign language videos into spoken-language text, bridging communication between Deaf and hearing communities. Current gloss-free approaches rely on large encoder-decoder models, limiting deployment. We propose a compact 77M-parameter pipeline that couples MMPose skeletal pose extraction with a single linear projection into T5-small. By varying the input frame rate, we expose a practical efficiency trade-off: at 12 fps the model halves its sequence length, achieving a 75% reduction in encoder quadratic self-attention computational complexity while incurring only a modest BLEU-4 drop (9.53 vs. 10.06 at 24 fps on How2Sign). Our system is roughly 3x smaller than prior T5-base systems, demonstrating that a lightweight architecture can remain competitive without hierarchical encoders or large-scale models.

2605.09549 2026-05-12 cs.LG

When Adaptation Fails: A Gradient-Based Diagnosis of Collapsed Gating in Vision-Language Prompt Learning

Yunxuan Fang, Ziwei Zhang, Xinhe Wang

发表机构 * Beihang University(北航大学)

AI总结 本文研究了在冻结的少样本视觉-语言提示学习中,自适应门控机制失效的问题,发现自适应门和提示选择模块常出现输出恒定、梯度信号微弱且性能不如固定提示的现象。通过系统实验,作者识别出两种主要失效模式:梯度幅值不平衡和门控退化,揭示了自适应门控在特定条件下的局限性,并对参数高效学习中盲目增加架构复杂性的做法提出了反思。

详情
英文摘要

Adaptive prompting mechanisms have been proposed to enhance vision-language models by dynamically tailoring prompts to inputs. However, in frozen few-shot prompt learning with CLIP-style backbones, we systematically observe that adaptive gates and prompt-selection modules often collapse: they produce nearly constant outputs, contribute negligible gradient signals, and frequently fail to outperform fixed prompts. To further explore this issue, we present a systematic diagnostic study to uncover the underlying causes and conditions of adaptation failure. Through controlled experiments across datasets and multiple prompt learning architectures, we identify two recurring failure modes: gradient magnitude imbalance and gate degradation. Our findings invite a re-examination of indiscriminately adding architectural complexity in parameter-efficient learning and clarify when prompt-level adaptive gating is, and is not, effective in this regime.

2605.09548 2026-05-12 cs.CL

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

Yihong Liu, Raoyuan Zhao, Michael A. Hedderich, Hinrich Schütze

发表机构 * Center for Information and Language Processing, LMU Munich(信息与语言处理中心,慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML))

AI总结 该研究针对多语言推理中低资源语言表现较差的问题,提出了一种跨语言的在线自蒸馏方法COPSD。该方法利用同一模型作为学生和教师,学生仅看到低资源语言的问题,而教师则获得包括英文翻译和参考解法在内的跨语言上下文信息,通过最小化学生生成过程中的全分布词级差异,提供密集的监督信号。实验表明,COPSD在17种低资源非洲语言上显著提升了数学推理能力,优于现有方法,并在答案格式、推理扩展性和基准泛化方面表现出色。

Comments preprint

详情
英文摘要

Large language models (LLMs) have achieved remarkable progress in mathematical reasoning, but this ability is not equally accessible across languages. Especially low-resource languages exhibit much lower reasoning performance. To address this, we propose Crosslingual On-Policy Self-Distillation (COPSD), which transfers a model's own high-resource reasoning behavior to low-resource languages. COPSD uses the same model as student and teacher: the student sees only the low-resource problem, while the teacher receives privileged crosslingual context, including the problem translation and reference solution in English. Training minimizes full-distribution token-level divergence on the student's own rollouts, providing dense supervision while avoiding the sparsity and instability of outcome-only reinforcement learning (RL). Experiments on 17 low-resource African languages show that COPSD consistently improves low-resource mathematical reasoning across model sizes and substantially outperforms Group Relative Policy Optimization (GRPO). Further analyses show that COPSD improves answer-format adherence, strengthens test-time scaling, and generalizes to harder multilingual reasoning benchmarks, with especially large gains for lower-resource languages. We make our code and data available at: https://github.com/cisnlp/COPSD.

2605.09544 2026-05-12 cs.AI

TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

Yize Li, Junzhi Li, Jason Song, Chuxiong Sun, Rui Wang, Changwen Zheng

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所)

AI总结 TIDE-Bench 是一个用于评估工具集成推理(TIR)方法的全面且高效的基准测试平台,旨在解决当前TIR评估在任务多样性、诊断全面性和评估效率方面的不足。该基准引入了多种任务设置,包括数学推理、知识密集型问答以及两种新设计的任务,以考察模型在复杂工具调用和多工具协作方面的能力。同时,TIDE-Bench 采用任务感知的综合评估协议,并通过筛选高质量样本提升评估效率,实验结果揭示了当前TIR方法在工具 grounding 方面的持续瓶颈,为未来研究提供了重要参考。

Comments 10 pages, 5 figures, 10 tables

详情
英文摘要

Tool-integrated reasoning has emerged as a promising paradigm for enhancing large language models with external computation, retrieval, and execution capabilities. However, the field still lacks a high-quality and unified evaluation benchmark, and existing TIR evaluations remain limited in dataset quality, task diversity, diagnostic comprehensiveness, and evaluation efficiency. In this work, we introduce TIDE-Bench, a holistic and efficient benchmark for evaluating TIR methods, featuring three key advantages. First, it provides diverse task settings, combining widely used mathematical reasoning and knowledge-intensive QA tasks with two newly designed tasks, namely the tool-grounded experimental design task and the dynamic interactive task, to probe models' abilities in complex tool invocation and multi-tool coordination. Second, TIDE-Bench adopts a comprehensive yet task-aware evaluation protocol, jointly measuring final answer quality, process reliability, tool-use efficiency, and inference cost across heterogeneous task settings. Third, TIDE-Bench constructs high-quality and discriminative evaluation sets by filtering low-discrimination instances from existing datasets, substantially reducing evaluation cost while focusing on more challenging samples. Extensive experiments on multiple foundation models and TIR methods reveal persistent bottlenecks in tool grounding, offering insights for future TIR research.

2605.09542 2026-05-12 cs.AI

LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs

Rishabh Jakhar, Michel Dumontier, Remzi Celebi

发表机构 * Institute of Data Science, Department of Advanced Computing Sciences, Maastricht University(数据科学研究所,高级计算科学系,马斯特里赫特大学)

AI总结 该研究提出了一种结合知识图谱与大语言模型(LLM)的神经符号框架TESSERA,用于从知识图谱中生成药物-疾病对的多步机制解释。该方法利用LLM进行局部判断和状态评估,同时借助蒙特卡洛树搜索(MCTS)实现长期路径的结构化搜索与信用分配,从而在保证生物知识准确性的同时,生成合理且多样化的解释路径。实验表明,该框架在两个互补的知识图谱上有效揭示了药物作用机制,并验证了LLM在其中的关键作用。

Comments Accepted at IJCAI-ECAI 2026. 9 pages (7 content + 2 references), 5 figures, 3 tables. Includes supplementary material (26 pages)

详情
英文摘要

Extracting multi-step explanations from knowledge graphs poses a combinatorial challenge requiring both heuristic guidance (as candidates proliferate with depth) and credit assignment (as path quality emerges over extended sequences). Frontier LLMs, strong on knowledge/reasoning benchmarks, offer a compelling source of such heuristics, yet their knowledge comes sans guarantees and compositional performance degrades as chains lengthen. We thus present TESSERA, a 3-part neuro-symbolic framework that uses LLMs in a circumscribed role: for local discriminative judgement rather than autonomous multi-step generation; the knowledge graph then defines the hypothesis space enforcing hard structural constraints, and MCTS coordinates the long-horizon search with principled credit assignment via backpropagation. LLMs perform dual roles as a prior policy biasing exploration and a comparative state evaluator supplying reward signals. Evaluation on drug mechanism elucidation across two complementary knowledge graphs demonstrates fidelity to curated biology while surfacing coherent alternative mechanisms, with ablations confirming discriminative contribution from both LLM components. Beyond its current application, our framework offers a general paradigm for compositional reasoning over structured knowledge.

2605.09539 2026-05-12 cs.CL

TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems

Chen Xu, Yicheng Hu, Ruizi Wang, Xinyu Lin, Wenjie Wang, Dongrui Liu, Fuli Feng

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学) Shanghai AI Lab(上海人工智能实验室)

AI总结 本文提出了一种名为TacoMAS的测试时多智能体系统共进化框架,旨在同时动态调整智能体的能力与通信拓扑结构。该方法通过快速更新智能体能力以应对新出现的子任务,并在更长时间尺度上调整通信拓扑以保持协作稳定性,从而实现更高效的多智能体协作。实验表明,TacoMAS在四个基准任务中显著优于近20种现有方法,平均性能提升了13.3%。

详情
英文摘要

Multi-agent systems (MAS) have emerged as a promising paradigm for solving complex tasks. Recent work has explored self-evolving MAS that automatically optimize agent capabilities or communication topologies. However, existing methods either learn a topology that remains fixed at inference time or adapt only the topology or capability during inference. We empirically and theoretically show that effective test-time evolution requires jointly adapting both axes, but on different time scales: capabilities should update rapidly to handle emerging subtasks, while the topology should evolve more slowly to preserve coordination stability. We then introduce TacoMAS, a test-time co-evolution framework for dynamic MAS. TacoMAS formulates MAS inference as a task of online graph adaptation, where nodes represent agents with role-specific capabilities and edges define their communication topology. During inference, a fast capability loop updates agent expertise using trajectory-level feedback, while a slow meta-LLM-driven topology loop performs agents' birth-death operations on MAS, including edge edit, agent addition, and agent removal. We further show that this fast-slow design drives MAS evolution toward a task-conditioned stable equilibrium. Experiments on four benchmarks demonstrate that TacoMAS outperforms nearly 20 multi-agent baselines, achieving an average improvement of 13.3% over the strongest baseline. The codes are released at https://github.com/chenxu2-gif/TacoMAS-MultiAgent.

2605.09538 2026-05-12 cs.CV cs.AI cs.RO

PhysHanDI: Physics-Based Reconstruction of Hand-Deformable Object Interactions

Jihyun Lee, Changmin Lee, Donghwan Kim, Tae-Kyun Kim

发表机构 * School of Computing, KAIST, Daejeon, South Korea(韩国釜山科学技术院计算学系)

AI总结 PhysHanDI 是一种基于物理的框架,旨在同时重建手部与非刚性物体(如布料、毛绒玩具)的三维交互。该方法通过模拟由密集重建的手部运动引起的力来驱动物体变形,确保重建的物体动态既符合物理规律又与手部运动一致。此外,物体变形的模拟还能通过逆物理方法提升手部重建的精度,实验表明 PhysHanDI 在重建和未来预测任务中均优于现有最佳方法。

Comments Accepted to ICML 2026

详情
英文摘要

While existing methods for reconstructing hand-object interactions have made impressive progress, they either focus on rigid or part-wise rigid objects-limiting their ability to model real-world objects (e.g., cloth, stuffed animals) that exhibit highly non-rigid deformations-or model deformable objects without full 3D hand reconstruction. To bridge this gap, we present PhysHanDI (Physics-based Reconstruction of Hand and Deformable Object Interactions), a framework that enables full 3D reconstruction of both interacting hands and non-rigid objects. Our key idea is to physically simulate object deformations driven by forces induced from densely reconstructed 3D hand motions, ensuring that the reconstructed object dynamics are both physically plausible and coherent with the interacting hand movements. Furthermore, we demonstrate that such simulation of object deformations can, in turn, refine and improve hand reconstruction via inverse physics. In experiments, PhysHanDI outperforms the state-of-the-art baseline across reconstruction and future prediction.