arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2605.13054 2026-05-14 cs.LG cs.AI

Bridging Domain Gaps with Target-Aligned Generation for Offline Reinforcement Learning

Minung Kim, Jeongmo Kim, Gwanwoo Choi, Seungyul Han

发表机构 * Ulsan National Institute of Science and Technology, UNIST(乌山国立科学技术研究院,UNIST)

AI总结 该论文研究了如何在仅有预收集数据的情况下,将源域的策略适应到目标域的离线强化学习问题,特别是在目标域数据极为有限的情况下。为了解决域间分布差异带来的挑战,作者提出了目标对齐的覆盖扩展(TCE)框架,通过理论分析指导源数据的使用方式,包括直接引入接近目标域的转移或通过目标对齐生成扩展状态覆盖。实验表明,TCE在多种跨域环境中显著优于现有的离线强化学习方法。

详情
英文摘要

Cross-domain offline reinforcement learning aims to adapt a policy from a source domain to a target domain using only pre-collected datasets, where environment dynamics may differ. A key challenge is to leverage source data while reducing distributional mismatch, particularly when the target dataset is extremely limited. To address this, we propose Target-aligned Coverage Expansion (TCE), a framework that decides how source data should be used, either by directly incorporating target-near transitions or by expanding state coverage through target-aligned generation, guided by theoretical analysis. TCE builds on a dual score-based generative model to synthesize target-consistent transitions over an expanded state region. Extensive experiments across diverse cross-domain environments show that TCE consistently outperforms state-of-the-art cross-domain offline RL baselines.

2605.13049 2026-05-14 cs.CV

Uncertainty-aware Spatial-Frequency Registration and Fusion for Infrared and Visible Images

Xingyuan Li, Haoyuan Xu, Xingyue Zhu, Jun Ma, Yang Zou, Zhiying Jiang, Jinyuan Liu

发表机构 * Dalian University of Technology(大连理工大学) Northwestern Polytechnical University(西北工业大学) Dalian Maritime University(大连海事大学)

AI总结 红外与可见光图像融合(IVIF)在复杂环境下具有广泛应用,但未对齐条件下的融合面临固有的错位问题。现有方法多采用粗到细的变形参数预测或多尺度变形场估计,却忽视了注册过程中的累积误差,影响融合质量。本文提出了一种融合空间-频率域注册与融合的SFRF框架,通过引入不确定性估计和红外热辐射分布一致性,统一处理注册误差累积问题,提升跨空间与频率域的融合鲁棒性。该方法通过多尺度迭代注册和双分支空间-频率融合模块,实现了更精确的对齐与更高质量的图像重建。

Comments 10 pages, 5 figures, 4 tables

详情
英文摘要

Infrared and Visible Image Fusion (IVIF) has shown promise in visual tasks under challenging environments, but fusion under unregistered conditions faces inherent misalignments. Current studies to solve them either predict the deformation parameters coarse-to-fine (i.e., coarse registration and fine registration) or estimate the deformation fields in multi-scales for registration. Though straightforward, they overlook the cumulative errors in registration, which contaminate the fusion stage and severely deteriorate the resulting images. We introduce the Spatial-Frequency Registration and Fusion (SFRF) framework, which incorporates uncertainty estimation and infrared thermal radiation distribution consistency into a unified pipeline to handle the error accumulation for robust registration and fusion across both spatial and frequency domains. Specifically, SFRF constructs a Multi-scale Iterative Registration (MIR) framework that iteratively refines the deformation field across scales, leveraging uncertainty estimation at each stage to mitigate error accumulation and enhance alignment accuracy dynamically. To ensure the accurate alignment of infrared thermal distributions during registration, thermal radiation distribution consistency is employed as a frequency-domain supervisory signal, promoting global consistency in the frequency domain. Based on the spatial-frequency alignment, SFRF further adopts a Dual-branch Spatial-Frequency Fusion (DSFF) module, which incorporates spatial geometric features and frequency distribution information to reconstruct visually appealing images. SFRF achieves impressive performance across diverse datasets.

2605.13047 2026-05-14 cs.CV cs.AI

Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency

Ziqi Wen, Parsa Madinei, Miguel P. Eckstein

发表机构 * Department of Computer Science, University of California, Santa Barbara(加州大学圣巴巴拉分校计算机科学系) Department of Psychological and Brain Sciences, University of California, Santa Barbara(加州大学圣巴巴拉分校心理学与脑科学系)

AI总结 该研究探讨了视觉语言模型(VLM)在高层次语义场景理解方面与人类感知的差异。为此,作者提出了一种黑盒、模型无关的方法——反事实语义显著性(CSS),通过衡量物体在场景中被移除后引起的语义变化,量化其重要性。实验结果表明,VLM在理解场景时表现出对大物体、画面中心物体和高显著性物体的过度依赖,而对场景中人物的依赖则低于人类,揭示了模型与人类在语义理解上的显著差距。

详情
英文摘要

Evaluating whether large vision-language models (VLMs) align with human perception for high-level semantic scene comprehension remains a challenge. Traditional white-box interpretability methods are inapplicable to closed-source architectures and passive metrics fail to isolate causal features. We introduce Counterfactual Semantic Saliency (CSS). This black-box, model-agnostic framework quantifies the importance of objects by measuring the semantic shift induced by their causal ablation from a scene. To evaluate AI-human semantic alignment, we tested prominent VLMs against a human psychophysics baseline comprising 16,289 valid responses across 307 complex natural scenes and 1,306 high-fidelity counterfactual variants. Our analysis reveals a pervasive scene comprehension gap: models exhibit an overreliance (relative to humans) on large objects (size bias), objects at the center of the image (center bias), and high saliency objects. In contrast, models rely less on people in the scenes than our human participants to describe the images. A model's size bias is a primary driver explaining variations in model-human semantic divergence. Code and data will be available at https://github.com/starsky77/Counterfactual-Semantic-Saliency.

2605.13046 2026-05-14 cs.AI

An Agentic LLM-Based Framework for Population-Scale Mental Health Screening

Giuliano Lorenzoni, Paulo Alencar, Donald Cowan

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文提出了一种基于智能体的大型语言模型(LLM)框架,用于大规模人群心理健康筛查。该框架通过将每个处理阶段封装为由明确策略和代理引导评估驱动的LangChain智能体,实现了对非结构化临床信息的处理与个性化适应。研究展示了该框架在基于对话记录的抑郁检测中的应用,验证了其在稳定配置收敛、成本控制和避免性能退化方面的有效性,为大规模临床数据下的心理健康筛查提供了可信、可复现且适应性强的解决方案。

Comments 8 pages, conference paper presented at IEEE BigData 2025, Macau

详情
英文摘要

Mental health disorders affect millions worldwide, and healthcare systems are increasingly overwhelmed by the volume of clinical data generated from electronic records, telemedicine platforms, and population-level screening programs. At the same time, the emergence of novel AI-based approaches in healthcare calls for intelligent frameworks capable of processing domain-specific unstructured clinical information while adapting to patient-specific needs. This paper proposes an agentic framework for building robust LLM-based pipelines, where each stage is encapsulated as a LangChain agent governed by explicit policies and proxy-guided evaluation. Stages are incrementally locked once validated, ensuring that later adaptations cannot overwrite configurations without demonstrated improvement. The proposed framework evolves from feature-level exploration, through proxy-based tuning and freeze/rollback mechanisms, to full orchestration by an Orchestrator Agent that coordinates preprocessing, retrieval, selection, diversity, threshold optimization, and decoding. A proof-of-concept in transcript-based depression detection demonstrates that the framework converges to stable configurations, such as cosine similarity, dynamic Top-k, and threshold 0.75, while controlling evaluation costs and avoiding regressions. These results highlight the potential of agentic AI to enable population-level mental health screening over large clinical datasets, addressing critical challenges in trustworthiness, reproducibility, and adaptability required in healthcare environments.

2605.13045 2026-05-14 cs.LG cs.CL

Large Language Models Lack Temporal Awareness of Medical Knowledge

Zihan Guan, Qiao Jin, Guangzhi Xiong, Fangyuan Chen, Mengxuan Hu, Qingyu Chen, Yifan Peng, Zhiyong Lu, Anil Vullikanti

发表机构 * University of Virginia(弗吉尼亚大学) National Institutes of Health(美国国家卫生研究院) Dana-Farber Cancer Institute(达纳-法伯癌症研究所) Yale University(耶鲁大学) Weill Cornell Medicine(韦氏 Cornell 医学院)

AI总结 现有评估大语言模型(LLM)医学知识的方法多基于静态的考试式基准,未能反映医学知识随时间动态变化的特性。为此,研究者构建了TempoMed-Bench,首个用于评估LLM时间感知能力的医学领域基准,揭示了LLM在时间特定医学知识上的不足,包括知识随时间逐渐退化、对过时知识的遗忘以及预测结果的时间不一致性等问题。该工作指出了LLM在医学知识时间感知方面的关键挑战,并为未来研究提供了方向。

Comments 35 pages, 18 figures

详情
英文摘要

The existing methods for evaluating the medical knowledge of Large Language Models (LLMs) are largely based on atemporal examination-style benchmarks, while in reality, medical knowledge is inherently dynamic and continuously evolves as new evidence emerges and treatments are approved. Consequently, evaluating medical knowledge without a temporal context may provide an incomplete assessment of whether LLMs can accurately reason about time-specific medical knowledge. Moreover, most medical data are historical, requiring the models not only to recall the correct knowledge, but also to know when that knowledge is correct. To bridge the gap, we built TempoMed-Bench, the first-of-its-kind benchmark for evaluating the temporal awareness of the LLMs in the medical domain through evolving guideline knowledge. Based on the TempoMed-Bench, our evaluation analysis first reveals that LLMs lack temporal awareness in medical knowledge through the key findings: (1) model performance on up-to-date medical knowledge exhibits a gradual linear decline over time rather than a sharp knowledge-cutoff behavior, suggesting that parametric medical knowledge is not strictly bounded by knowledge cutoffs; (2) LLMs consistently struggle more with recalling outdated historical medical knowledge than with up-to-date recommendations: accuracy of historical knowledge is only 25.37%-53.89% of up-to-date knowledge, indicating potential knowledge forgetting effects during training; and (3) LLMs often exhibit temporally inconsistent behaviors, where predictions fluctuate irregularly across neighboring years. We also show that the temporal awareness problem is a challenge that cannot be easily solved when integrated with agentic search tools (-3.15%-14.14%). This work highlights an important yet underexplored challenge and motivates future research on developing LLMs that can better encode time-specific medical knowledge.

2605.13043 2026-05-14 cs.CL

Adaptive Steering and Remasking for Safe Generation in Diffusion Language Models

Yejin Lee, Yo-Sub Han

发表机构 * Department of Computer Science(计算机科学系) Yonsei University(延世大学)

AI总结 扩散语言模型(DLMs)通过迭代去噪和双向精炼生成文本,但在中间去噪步骤中生成的有害内容可能传播到后续过程,导致最终输出不安全。为此,本文提出了一种基于去噪过程中逐步干预的推理时防御框架,通过对比安全方向(SGD)检测有害语义并进行重掩码和自适应引导,从而在不牺牲生成质量的前提下提升模型安全性。实验表明,该方法显著降低了越狱成功率,同时保持了接近原始模型的生成质量。

Comments 17 pages, 3 figures

详情
英文摘要

Diffusion Language Models (DLMs) provide a promising alternative to autoregressive language models by generating text through iterative denoising and bidirectional refinement. However, this iterative generation paradigm also introduces unique safety vulnerabilities when harmful tokens generated at intermediate denoising steps propagate through subsequent refinement processes and eventually induce unsafe outputs. While there are a few attempts to remedy this issue, they either fail to generate safe outputs or generate safe yet low-quality outputs. This motivates us to propose an inference-time defense framework based on the step-wise intervention during the denoising process, which then improves the safety without compromising the output quality. The key component of our framework is a contrastive safety direction (SGD), a latent direction that captures the semantic boundary between harmful and safe generations. We leverage SGD to assess the alignment of generated tokens with harmful semantics at each denoising step. When harmful alignment is detected, our method remasks the corresponding tokens and resumes the denoising process with adaptive steering, where the steering strength is modulated according to the estimated degree of harmfulness. As a plug-and-play module, our method circumvents the need for additional fine-tuning and can be directly incorporated into off-the-shelf diffusion models. The experimental results show that our approaches reduce jailbreak success rates to 0.64% while preserving generation quality close to the original model performance. This confirms the effectiveness of step-wise intervention for safe diffusion language model generation. Our code is available at https://github.com/leeyejin1231/DLM_Steering_Remasking.

2605.13041 2026-05-14 cs.CV

EgoForce: Robust Online Egocentric Motion Reconstruction via Diffusion Forcing

Inwoo Hwang, Donggeun Lim, Hojun Jang, Young Min Kim

发表机构 * Seoul National University(首尔国立大学)

AI总结 EgoForce 是一种用于从噪声的自中心视角输入中在线重建长期全身运动的框架。该方法采用基于扩散的模型,并引入时间非对称的噪声调度策略,以应对实时应用中稀疏和噪声观测的挑战。通过建模时间演化的不确定性并逐步去噪,EgoForce 在严格因果约束下生成稳定且连贯的全身运动,实验表明其在复杂自中心场景中优于现有在线和离线方法。

Comments Project page: https://inwoohwang.me/EgoForce

详情
英文摘要

With recent advances in embodied agents and AR devices, egocentric observations are readily available as input for real-world interactive online applications. However, egocentric viewpoints can only sporadically observe hands, in addition to the estimated head trajectory. We propose EgoForce, an online framework for reconstructing long-term full-body motion from noisy egocentric input. While existing generative frameworks can robustly handle noisy and sparse measurements, they assume a fixed-length observation window is available and are thus not suitable for real-time applications. Faster inference often relies on autoregressive prediction, sacrificing robustness. In contrast, we adopt a diffusion-based method with a temporally asymmetric noise schedule inspired by Diffusion Forcing. Specifically, our approach models temporally evolving uncertainty and incrementally denoises states as new streaming observations arrive. Combined with a noise-robust imputation strategy, EgoForce progressively generates stable and coherent full-body motion under strict causal constraints. Experiments demonstrate that our online framework outperforms existing online and offline methods, enabling long-horizon, full-body motion reconstruction in challenging egocentric scenarios.

2605.13038 2026-05-14 cs.CV cs.AI

CoGE: Sim-to-Real Online Geometric Estimation for Monocular Colonoscopy

Liangjing Shao, Beilei Cui, Hongliang Ren

发表机构 * Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China(香港中文大学电子工程系,香港特别行政区,中国) Shenzhen Loop Area Institute, China(深圳环湖研究所,中国)

AI总结 本文提出CoGE,一种用于结肠镜检查的单目在线几何估计框架,旨在解决实际场景中深度估计和场景重建的难题。该方法通过引入基于Retinex理论的光照感知监督模块和基于小波分解的结构感知感知模块,有效应对结肠镜场景中的光照差异和结构特征提取问题。实验表明,仅使用模拟数据训练的CoGE在模拟和真实场景中均取得了最先进的几何估计性能。

Comments Early Accepted by MICCAI 2026

详情
英文摘要

Geometric estimation including depth estimation and scene reconstruction is a crucial technique for colonoscopy which can provide surgeons with 3D spatial perception and navigation. However, geometric ground truth in colonoscopy is difficult to obtain due to narrow and enclosed space of the colon, while there is a large feature gap between simulated data and realistic data caused by artifacts and illumination. In this paper, we present CoGE, a novel framework for online monocular geometric estimation during colonoscopy. Firstly, we propose an illumination-aware supervision module based on the Retinex theory to address illumination diversity in different colonoscopy scenes. Moreover, a structure-aware perception module is proposed based on wavelet decomposition to extract common structural and local features of the colon. Both quantitative and qualitative results demonstrate that the proposed model solely trained on simulated data achieves state-of-the-art performance in geometric estimation for both simulated and realistic scenes.

2605.13037 2026-05-14 cs.AI

MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

Yuxin Liu, Ziang Ye, Yueqing Sun, Mingye Zhu, Jinwei Xiao, Zhuowen Han, Qi GU, Xunliang Cai, Lei Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) Meituan(美团) Institution of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Tianjin University(天津大学)

AI总结 当前交互式大语言模型代理依赖于目标引导的逐步规划,环境理解是在执行过程中被动获取的,导致环境感知延迟和知识瓶颈问题。本文提出了一种“先地图后行动”的MAP范式,通过全局探索、任务映射和知识增强执行三个阶段,提前建立环境认知地图,从而提升任务执行效率。实验表明,MAP在多个基准测试中均取得显著提升,并且基于MAP的轨迹数据集MAP-2K在训练中表现优于专家轨迹,说明环境理解比模仿更为关键。

详情
英文摘要

Current interactive LLM agents rely on goal-conditioned stepwise planning, where environmental understanding is acquired reactively during execution rather than established beforehand. This temporal inversion leads to Delayed Environmental Perception: agents must infer environmental constraints through trial-and-error, resulting in an Epistemic Bottleneck that traps them in inefficient failure cycles. Inspired by human affordance perception and cognitive map theory, we propose the Map-then-Act Paradigm (MAP), a plug-and-play framework that shifts environment understanding before execution. MAP consists of three stages: (1) Global Exploration, acquiring environment-general priors; (2) Task-Specific Mapping, constructing a structured cognitive map; and (3) Knowledge-Augmented Execution, solving tasks grounded on the map. Experiments show consistent gains across benchmarks and LLMs. On ARC-AGI-3, MAP enables frontier models to surpass near-zero baseline performance in 22 of 25 game environments. We further introduce MAP-2K, a dataset of map-then-act trajectories, and show that training on it outperforms expert execution traces, suggesting that understanding environments is more fundamental than imitation.

2605.13034 2026-05-14 cs.CV cs.IR

ViDR: Grounding Multimodal Deep Research Reports in Source Visual Evidence

Zhuofan Shi, Peilun Jia, Baoqin Sun, Haiyang Shen, Sixiong Xie, Yun Ma, Xiang Jing

发表机构 * School of Software and Microelectronics, Peking University(北京大学软件与微电子学院) National Key Laboratory of Data Space Technology and System(数据空间技术与系统国家重点实验室) School of Software Engineering, Beijing Jiaotong University(北京交通大学软件学院) College of Computer Science and Electronic Engineering, Hunan University(湖南大学计算机科学与电子工程学院)

AI总结 ViDR 是一种多模态深度研究框架,旨在通过源图示作为证据来生成内容详实且有依据的研究报告。该方法将源图示视为可检索、可解释、可追踪和可验证的证据对象,并结合上下文感知过滤、大纲感知重排序和视觉语言模型分析等技术,提升图示证据的准确性和相关性。ViDR 还引入了 MMR Bench+ 评估基准,实验证明其在报告质量、图示整合和可验证性方面优于现有主流模型,凸显了源视觉证据在多模态深度研究中的重要性。

详情
英文摘要

Recent deep research systems have improved the ability of large language models to produce long, grounded reports through iterative retrieval and reasoning. However, most text-centered systems rely mainly on textual evidence, while multimodal systems often retrieve images only weakly or generate charts themselves, leaving source figures underused as evidence. We present ViDR, a multimodal deep research framework that grounds long-form reports in source figures. ViDR treats source figures as retrievable, interpretable, routable, and verifiable evidence objects, while still generating analytical charts when needed. It builds an evidence-indexed outline linking claims to textual and visual evidence, refines noisy web images into source-figure evidence atoms through context-aware filtering, outline-aware reranking, and VLM-based visual analysis, and generates each section with section-specific evidence. ViDR further validates visual references to reduce hallucinated or misplaced figures. We also introduce MMR Bench+, a benchmark for evaluating visual evidence use in deep research reports, covering source-figure retrieval, placement, interpretation, verifiability, and analytical chart generation. Experiments show that ViDR improves overall report quality, source-figure integration, and verifiability over strong commercial and open-source baselines. These results suggest that source visual evidence is important for multimodal deep research, as it strengthens evidential grounding, visual support, and report verifiability.

2605.13030 2026-05-14 cs.LG cs.AI

FeatCal: Feature Calibration for Post-Merging Models

Yanggan Gu, Shuo Cai, Zihao Wang, Wenjun Wang, Yuanyi Wang, Pengkai Wang, Sirui Huang, Su Lu, Jianmin Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University (PolyU)(香港理工大学) The Chinese University of Hong Kong(香港中文大学) PolyU-Daya Bay Technology and Innovation Research Institute(PolyU-大亚湾技术与创新研究院)

AI总结 FeatCal 是一种针对模型合并后性能下降问题的特征校准方法,通过分析合并模型与专家模型之间的特征漂移,提出了一种层序校准策略,有效提升了合并模型的表现。该方法利用少量校准数据,以闭式解形式逐层调整模型权重,无需梯度下降或额外模块,既保持了合并模型的优势,又显著提升了任务性能。实验表明,FeatCal 在多个基准测试中优于现有校准方法,且在样本效率和校准成本方面表现更优。

详情
英文摘要

Model merging combines task experts into one model and avoids joint training, retraining, or deploying many expert models, but the merged model often still underperforms task experts. We study this performance gap through feature drift, the difference between features produced by the merged model and by the expert on the same input. Our theory decomposes this drift into upstream propagation and local mismatch, tracks how it propagates and combines through later layers in forward order, and links final feature drift to output drift. This view motivates FeatCal, which uses a small calibration set to calibrate the merged model weights layer by layer in forward order, reducing feature drift while staying close to merged weights and preserving the benefits of model merging. FeatCal uses an efficient closed-form solution to update model weights, with no gradient descent, iterative optimization, or extra modules. On the main CLIP and GLUE benchmarks, FeatCal beats Surgery and ProbSurgery, the closest post-merging calibration baselines: 85.5% vs. 77.0%/78.8% on CLIP-ViT-B/32 Task Arithmetic (TA) and 85.2% vs. 83.7%/82.2% on FLAN-T5-base GLUE. On CLIP-ViT-B/32, 8 examples per task reach 82.9%, and 256 examples per task take 53 seconds, about 4x faster than both baselines, showing better sample efficiency and lower calibration cost.

2605.13028 2026-05-14 cs.RO cs.SY eess.SY

Local Conformal Calibration of Dynamics Uncertainty from Semantic Images

Luís Marques, Dmitry Berenson

发表机构 * Robotics Department, University of Michigan(密歇根大学机器人系)

AI总结 本文提出了一种基于符合性预测的算法OCULAR,用于从语义图像中对动态不确定性进行局部校准,从而为未知测试环境提供不确定性量化保证。该方法利用视觉相似环境的数据,对任意保真度的线性高斯动力学模型进行可证明的校准,能够在存在随机扰动和模型偏差的情况下,保证预测区域以用户设定的概率包含未来系统状态。该方法无需对真实系统动力学做出强假设,且能够区分不同输入导致的不确定性差异,有助于实现概率安全规划,并在多个实验场景中验证了其有效性。

Comments 26 pages, 8 figures. Accepted to the 17th World Symposium on the Algorithmic Foundations of Robotics (WAFR) 2026

详情
英文摘要

We introduce Observation-aware Conformal Uncertainty Local-Calibration (OCULAR), a conformal prediction-based algorithm that uses perception information to provide uncertainty quantification guarantees for unseen test-time environments. While previous conformal approaches lack the ability to discriminate between state-action space regions leading to higher or lower model mismatch, and require environment-specific data, our method uses data collected from visually similar environments to provably calibrate a given linear Gaussian dynamics model of arbitrary fidelity. The prediction regions generated from OCULAR are guaranteed to contain the future system states with, at least, a user-set likelihood, despite both aleatoric and epistemic uncertainty -- i.e., uncertainty arising from both stochastic disturbances and lack of data. Our guarantees are non-asymptotic and distribution-free, not requiring strong assumptions about the unknown real system dynamics. Our calibration procedure enables distinguishing between observation-velocity-action inputs leading to higher and lower next-state-uncertainty, which is helpful for probabilistically-safe planning. We numerically validate our algorithm on a double-integrator system subject to random perturbations and significant model mismatch, using both a simplified sensor and a more realistic simulated camera. Our approach appropriately quantifies uncertainty both when in-distribution and out-of-distribution, being comparatively volume-efficient to baselines requiring environment-specific data.

2605.13027 2026-05-14 cs.CV

PRISM: Prior Rectification and Uncertainty-Aware Structure Modeling for Diffusion-Based Text Image Super-Resolution

Zihang Xu, Xiaoyang Liu, Zheng Chen, Yulun Zhang, Xiaokang Yang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种基于扩散模型的文本图像超分辨率方法PRISM,旨在解决在严重退化情况下文本细节生成中的可靠性与结构准确性问题。该方法通过引入流匹配先验校正(FMPR)和结构引导的不确定性感知残差编码器(SURE),分别提升全局文本先验的可靠性与局部笔画边界的精确性。实验表明,PRISM在合成和真实数据集上均取得了最先进的性能,且推理速度达到毫秒级。

Comments Code is available at https://github.com/faithxuz/PRISM

详情
英文摘要

Text image super-resolution (Text-SR) requires more than visually plausible detail synthesis: slight errors in stroke topology may alter character identity and break readability. Existing methods improve text fidelity with stronger recognition-based or generative priors, yet they still face two unresolved challenges under severe degradation: the text condition extracted from low-quality inputs can itself be unreliable, and a plausible global prior does not fully determine fine-grained stroke boundaries. We present PRISM, a single-step diffusion-based Text-SR framework that addresses these two challenges through Flow-Matching Prior Rectification (FMPR) and a Structure-guided Uncertainty-aware Residual Encoder (SURE). FMPR constructs a privileged training-time prior from paired low-quality/high-quality latents and learns a flow matching that transports degraded embeddings toward this restoration-oriented prior space, yielding more accurate and reliable global text guidance. SURE further predicts uncertainty-aware structural residuals to selectively absorb reliable local boundary evidence while suppressing ambiguous stroke cues. Together, these components enable explicit global prior rectification and local structure refinement within a single diffusion restoration pass. Experiments on both synthetic and real-world benchmarks show that PRISM achieves state-of-the-art performance with millisecond-level inference. Our dataset and code will be available at https://github.com/faithxuz/PRISM.

2605.13026 2026-05-14 cs.LG cs.AI cs.CL

Understanding and Accelerating the Training of Masked Diffusion Language Models

Chunsan Hong, Sanghyun Lee, Chieh-Hsin Lai, Satoshi Hayakawa, Yuhta Takida, Yuki Mitsufuji, Seungryong Kim, Jong Chul Ye

发表机构 * KAIST(韩国科学技术院) Sony AI(索尼人工智能) University of Tokyo(东京大学) Sony Group Corporation(索尼集团)

AI总结 本文研究了掩码扩散语言模型(MDMs)训练速度较慢的问题,并提出了加速训练的有效方法。通过分析发现,语言的局部性偏差是导致训练缓慢的主要原因,作者提出了一种基于钟形时间采样的训练策略,显著提升了训练效率。实验表明,该方法在保持最终性能的同时,使MDMs在LM1B基准上的训练速度提升了约4倍,并在生成困惑度和下游任务表现上也取得了更快的提升。

Comments Preprint

详情
英文摘要

Masked diffusion models (MDMs) have emerged as a promising alternative to autoregressive models (ARMs) for language modeling. However, MDMs are known to learn substantially more slowly than ARMs, which may become problematic when scaling MDMs to larger models. Therefore, we ask the following question: how can we accelerate standard MDM training while maintaining its final performance? To this end, we first provide a detailed analysis of why MDM training is slow. We find that the main factor is the locality bias of language: the predictive information for a token is concentrated in nearby positions. We further investigate how this bias slows learning and suggest a simple yet effective remedy: bell-shaped time sampling as a training strategy. Notably, MDMs trained with our training recipe reach the same validation negative log-likelihood (NLL) up to $\sim4\times$ faster than standard training on One Billion Word Benchmark (LM1B). We also show faster improvements in generative perplexity, zero-shot perplexity, and downstream task performance on various benchmarks.

2605.13025 2026-05-14 cs.LG cs.GT

Offline Two-Player Zero-Sum Markov Games with KL Regularization

Claire Chen, Yuheng Zhang, Xinyu Liu, Zixuan Xie, Shuze Daniel Liu, Nan Jiang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) California Institute of Technology(加州理工学院) University of Virginia(弗吉尼亚大学) Purdue University(Purdue 大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文研究了在离线两人零和马尔可夫博弈中学习纳什均衡的问题。不同于现有方法依赖显式悲观策略应对分布偏移,作者证明仅使用KL正则化即可稳定学习过程并保证收敛。文中提出了正则化离线序贯均衡(ROSE)理论框架,实现了在单边可集中性条件下的快速收敛速率$\widetilde{\mathcal{O}}(1/n)$,并设计了基于最小二乘值估计和迭代自博弈更新的实用无模型算法SOS-MD,其最终迭代结果在自博弈次数$T$下也达到了相近的统计收敛速率。

详情
英文摘要

We study the problem of learning Nash equilibria in offline two-player zero-sum Markov games. While existing approaches often rely on explicit pessimism to address distribution shift, we show that KL regularization alone suffices to stabilize learning and guarantee convergence. We first introduce Regularized Offline Sequential Equilibrium (ROSE), a theoretical framework that achieves a fast $\widetilde{\mathcal{O}}(1/n)$ convergence rate under \textit{unilateral concentrability}, improving over the standard $\widetilde{\mathcal{O}}(1/\sqrt{n})$ rates in unregularized settings. We then propose Sequential Offline Self-play Mirror Descent (SOS-MD), a practical model-free algorithm based on least-squares value estimation and iterative self-play updates. We prove that the last iterate of SOS-MD attains the same $\widetilde{\mathcal{O}}(1/n)$ statistical rate up to a vanishing optimization error of order $\widetilde{\mathcal{O}}(1/\sqrt{T})$ in the number of self-play iterations $T$.

2605.13021 2026-05-14 cs.LG cs.AI

Rethinking Efficient Graph Coarsening via a Non-Selfishness Principle

Xu Bai, Bin Lu, Kun Zhang, Shengbo Chen, Xinbing Wang, Chenghu Zhou, Meng Jin

发表机构 * School of Information Science and Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China(上海交通大学信息科学与电子工程学院) School of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China(上海交通大学人工智能学院) School of Environment Science and Engineering, Shanghai Jiao Tong University, Shanghai, China(上海交通大学环境科学与工程学院) School of Artificial Intelligence, Nanchang University, Nanchang, China(南昌大学人工智能学院) Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing, China(中国科学院地理科学与资源研究所)

AI总结 本文提出了一种基于非自私性原理的高效图粗化方法NOPE,旨在解决传统图粗化方法中因节点独立匹配带来的高计算和内存开销问题。该方法通过优先考虑邻域的集体影响,实现了线性内存消耗和接近线性的计算复杂度,并进一步提出了更快的变体NOPE*,在局部各向同性假设下将干扰评估复杂度从O(δ·d)降低至O(d),显著提升了高度节点的处理效率。实验表明,NOPE*相比原方法速度提升1.8到10倍,且在图学习任务中表现优异,甚至优于基于大语言模型的图推理方法。

详情
英文摘要

Graph coarsening is a graph dimensionality reduction technique that aims to construct a smaller and more tractable graph while preserving the essential structural and semantic properties of the original graph. However, most existing methods rely on pair-wise similarity matching, where each node independently searches for its best partner based on global information. This selfishness matching paradigm incurs substantial computational and memory overhead. To address this problem, we shift to a non-selfishness principle that prioritizes the collective interference of neighborhood in coarsening, and propose an efficient method named NOPE, which achieves linear memory consumption and near-linear computational complexity in the number of nodes. Furthermore, we derive a faster variant NOPE*, which reduces O(δ\dot d) interference evaluation to O(d) based on the local isotropy assumption, and consequently alleviates the computational bottleneck for high-degree nodes. Experimental results show that NOPE* achieves 1.8-10\times speedup over NOPE and surpass almost all baselines with 1-3 orders of magnitude acceleration. Meanwhile, learning on coarsened graphs yields comparable performance to original graphs, and can even show superior performance over LLM-based graph reasoning owing to compact graph information. The code can be available at https://github.com/dazonglian/NOPE-main.

2605.13018 2026-05-14 cs.CV

OCH3R: Object-Centric Holistic 3D Reconstruction

Yi Du, Yang You, Xiang Wan, Leonidas Guibas

发表机构 * Stanford University(斯坦福大学)

AI总结 OCH3R 是一种面向对象的统一三维重建框架,能够从单张RGB图像中同时预测场景中所有物体的6D姿态及其详细三维重建结果。其核心方法基于一种变压器架构,通过预测每个像素的类别嵌入、度量深度、归一化物体坐标(NOCS)以及每个物体的固定数量的三维高斯分布,实现端到端的一次性推理。该方法通过将预测的高斯分布转换到规范空间并与预渲染的真值对齐,避免了高昂的逐图像标注成本,显著提升了重建精度与推理效率。

详情
英文摘要

Object-centric scene understanding is a fundamental challenge in computer vision. Existing approaches often rely on multi-stage pipelines that first apply pre-trained segmentors to extract individual objects, followed by per-object 3D reconstruction. Such methods are computationally expensive, fragile to segmentation errors, and scale poorly with scene complexity. We introduce OCH3R, a unified framework for Object-Centric Holistic 3D Reconstruction from a single RGB image. OCH3R performs one forward pass to simultaneously predict all object instances with their 6D poses and detailed 3D reconstructions. The key idea is a transformer architecture that predicts per-pixel attributes, including CLIP-based category embeddings, metric depth, normalized object coordinates (NOCS), and a fixed number of 3D Gaussians representing each object. To supervise these Gaussian reconstructions, we transform them into canonical space using the predicted 6D poses and align them with pre-rendered canonical ground truth, avoiding costly per-image Gaussian label generation. On standard indoor benchmarks, OCH3R achieves state-of-the-art performance across monocular depth estimation, open-vocabulary semantic segmentation, and RGB-only category-level 6D pose estimation, while producing high-fidelity, editable per-object reconstructions. Crucially, inference is fully feed-forward and scales independently of the number of objects, offering orders-of-magnitude speedups over conventional multi-stage pipelines in cluttered scenes.

2605.13013 2026-05-14 cs.LG

JEDI: Joint Embedding Diffusion World Model for Online Model-Based Reinforcement Learning

Jing Yu Lim, Rushi Shah, Zarif Ikram, Samson Yu, Haozhe Ma, Tze-Yun Leong, Dianbo Liu

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种名为 JEDI 的端到端联合嵌入扩散世界模型,用于在线基于模型的强化学习。该模型结合了 JEPA 预测表征学习与扩散去噪目标,直接从扩散损失中学习潜在空间,避免了传统方法中预训练编码器的依赖。JEDI 在计算效率和性能上均优于现有方法,在 Atari100k 环境中表现出色,同时显著降低了显存占用和训练、采样时间。

详情
英文摘要

Diffusion world models have recently become competitive for online model-based reinforcement learning, but current approaches expose a tension: pixel diffusion is effective but computationally expensive while the latest latent diffusion approach improves efficiency yet performs subpar. The latter also relies on separately trained latents rather than the end-to-end world-model objectives that have driven much of modern MBRL progress. In particular, JEPA-style predictive representation learning has emerged as an especially promising direction for world modeling and MBRL. Concurrently, diffusion-style objectives have gained traction across multiple domains, with iterative refinement as a promising approach for multimodal and stochastic targets. Taken together, these trends motivate Joint Embedding DIffusion (JEDI), the first online end-to-end latent diffusion world model. JEDI learns its latent space directly from the diffusion denoising loss with a JEPA framework, using denoising to learn and predict future latents rather than relying on reconstruction and pretrained models. We provide a theoretical motivation showing that conventional JEPA objectives induce a predictive information bottleneck, and that conditional diffusion denoising admits a closely related predictive-compression decomposition. Empirically, JEDI is competitive on Atari100k and outperforms the baseline with seperately trained latents where directly comparable. Relative to the pixel diffusion baseline, JEDI uses 43% less VRAM, over 3$\times$ faster world-model sampling, and 2.5$\times$ faster training. JEDI also exhibits a markedly different task-level performance profile from the pixel baseline, suggesting that end-to-end predictive latents change more than compute alone.

2605.13010 2026-05-14 cs.CV cs.AI cs.SY eess.SY math.OC

Amortized Guidance for Image Inpainting with Pretrained Diffusion Models

Yilie Huang, Xun Yu Zhou

发表机构 * Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027, USA(工业工程与运筹学系,哥伦比亚大学,纽约,NY 10027,美国) Department of Industrial Engineering and Operations Research & Data Science Institute, Columbia University, New York, NY 10027, USA(工业工程与运筹学系及数据科学研究所,哥伦比亚大学,纽约,NY 10027,美国)

AI总结 本文研究了基于生成扩散模型的图像修复问题,提出了一种名为AID的方法,在保持预训练扩散模型主干不变的前提下,通过离线训练一个小型可复用的引导模块,实现对多张掩码图像的高效修复。该方法将问题建模为带有监督终端目标的确定性引导问题,并通过引入辅助高斯形式,推导出一种可在高维空间中学习的随机化问题求解方案,从而设计出一种基于数据驱动的连续时间策略-价值算法。实验表明,AID在多个数据集和掩码类型上均优于现有固定主干和摊销修复方法,在修复质量与速度之间取得了更好的平衡。

详情
英文摘要

We study image inpainting with generative diffusion models. Existing methods typically either train dedicated task-specific models, or adapt a pretrained diffusion model separately for each masked image at deployment. We introduce a middle-ground model, termed Amortized Inpainting with Diffusion (AID), which keeps a pretrained diffusion backbone fixed, trains a small reusable guidance module offline, and then reuses it across masked images without per-instance optimization. We formulate it as a deterministic guidance problem with a supervised terminal objective. To make this problem learnable in high dimensions, we derive an auxiliary Gaussian formulation and prove that solving this randomized problem recovers the optimal deterministic guidance field. This bridge yields a principled continuous-time actor--critic algorithm for learning the guidance module in a fully data-driven manner. Empirically, on AFHQv2 and FFHQ under the pixel EDM pipeline and on ImageNet under the latent EDM2 pipeline, AID consistently improves the quality--speed trade-off over strong fixed-backbone and amortized inpainting baselines across multiple mask types, while adding less than one percent trainable overhead.

2605.13006 2026-05-14 cs.RO cs.MA

Occlusion-Based Object Transportation Around Obstacles With a Swarm of Miniature Robots

Breno Cunha Queiroz, Daniel MacRae

发表机构 * Faculty of Science and Engineering, Rijksuniversiteit Groningen(格罗宁根大学科学与工程学院)

AI总结 本文研究了如何利用微型机器人集群在障碍物周围运输物体的问题。核心方法是在原有基于遮挡的策略基础上,引入子目标机制,使机器人能够通过协作形成可见路径链,从而在不依赖通信和保持去中心化控制的前提下绕过障碍。实验表明,该方法在不同初始位置和多种形状障碍物的情况下均表现出良好的鲁棒性和通用性。

Comments 25 pages, 9 figures, 6 tables. Accepted for publication in the journal Swarm Intelligence

Journal ref Swarm Intelligence, 2024

详情
英文摘要

Swarm robotics utilises decentralised self-organising systems to form complex collective behaviours built from the bottom-up using individuals that have limited capabilities. Previous work has shown that simple occlusion-based strategies can be effective in using swarm robotics for the task of transporting objects to a goal position. However, this strategy requires a clear line-of-sight between the object and the goal. In this paper, we extend this strategy by allowing robots to form sub-goals; enabling any member of the swarm to establish a wider range of visibility of the goal, ultimately forming a chain of sub-goals between the object and the goal position. We do so while preserving the fully decentralised and communication-free nature of the original strategy, while maintaining performance in object-free scenarios. In five sets of simulated experiments, we demonstrate the generalisability of our proposed strategy. Our finite-state machine allows a sufficiently large swarm to transport objects around obstacles that block the goal. The method is robust to varying starting positions and can handle both concave and convex shapes.

2605.12997 2026-05-14 cs.LG

Frequency Bias and OOD Generalization in Neural Operators under a Variable-Coefficient Wave Equation

Runlong Xie, An Luo

发表机构 * Independent Researcher(独立研究者) School of Statistics, University of Minnesota, MN, USA(明尼苏达大学统计学系)

AI总结 本文研究了神经算子在变系数波方程下的频率偏差与分布外泛化能力。通过对比傅里叶神经算子(FNO)和深度算子网络(DeepONet)在结构化分布外场景下的表现,发现FNO在高频输入下误差显著增加,而DeepONet则表现出更稳定的退化趋势。研究揭示了不同架构对频率结构的表示差异是导致泛化性能不同的关键因素,突显了当前神经算子在分布外场景下泛化能力的不足及架构设计的重要性。

详情
英文摘要

Neural operators learn to map initial conditions to the terminal solution of partial differential equations (PDEs), providing a surrogate for the full operator mapping. This enables rapid prediction across different input configurations. While recent neural operator architectures have demonstrated strong performance on diverse PDE tasks, their behavior under structured distribution shifts remains insufficiently understood. To investigate this, we study operator learning in a wave propagation setting governed by a one-dimensional variable-coefficient wave equation, using two representative architectures, the Fourier Neural Operator (FNO) and the Deep Operator Network (DeepONet). To examine their generalization under distribution shifts, we consider structured out-of-distribution (OOD) settings that independently vary input frequency and coefficient smoothness. The results show that under smoothness shifts, both models maintain stable performance, with FNO achieving lower error. In contrast, under frequency shifts, FNO exhibits a sharp increase in error under unseen high-frequency inputs, whereas DeepONet shows milder degradation despite higher overall error. Our analysis reveals that these differences arise from how each architecture represents and responds to variations in frequency structure. Together, these findings highlight a fundamental gap between strong in-distribution performance and generalization under distribution shifts in operator learning, underscoring the role of architectural representation bias in developing more reliable neural operators for physics-based PDE simulations beyond the training distribution.

2605.12995 2026-05-14 cs.LG

F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking

Rohan Surana, Gagan Mundada, Junda Wu, Xintong Li, Yizhu Jiao, Bowen Jin, Sizhe Zhou, Tong Yu, Ritwik Sinha, Jiawei Han, Jingbo Shang, Julian McAuley

发表机构 * UC San Diego(南加州大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Adobe Research(Adobe研究院)

AI总结 本文提出了一种统一的生成与排序优化框架F-GRPO,旨在解决传统检索系统中生成与排序分离导致的效用不匹配问题。该方法通过因子化分组相对策略优化,在单一的语言模型骨干网络中联合优化候选生成与排序过程,利用顺序不变的覆盖奖励和位置感知的效用奖励进行联合训练。实验表明,F-GRPO在多个基准任务中优于现有生成与排序分离的方法及监督学习模型,且在推理时无需架构修改。

详情
英文摘要

Traditional retrieval pipelines optimize utility through stages of candidate retrieval and reranking, where ranking operates over a predefined candidate set. Large Language Models (LLMs) broaden this into a generative process: given a candidate pool, an LLM can generate a subset and order it within a single autoregressive pass. However, this flexibility introduces a new optimization challenge: the model must search a combinatorial output space while receiving utility feedback only after the full ranked list is generated. Because this feedback is defined over the completed sequence, it cannot distinguish whether a poor result arises from failing to generate a relevant subset or from failing to rank that subset correctly. This credit assignment gap makes end-to-end optimization unstable and sample-inefficient. Existing systems often address this by separating candidate generation from ranking. However, such decoupling remains misaligned with downstream utility because ranking is limited by the candidate set it receives. To bridge this gap, we propose a unified framework that performs both within a single autoregressive rollout and optimizes them end-to-end via factorized group-relative policy optimization (F-GRPO). Our framework factorizes the policy into candidate generation and ranking while sharing a single LLM backbone, and jointly trains them with an order-invariant coverage reward and a position-aware utility reward. To address the resulting phase-specific credit assignment problem, we use separate group-relative advantages for generation and ranking within a two-phase sequence-level objective. Across sequential recommendation and multi-hop question answering benchmarks, F-GRPO improves top-ranked performance over GRPO and decoupled baselines, outperforms supervised alternatives, and remains competitive with strong zero-shot rerankers, with no architectural changes at inference time.

2605.12994 2026-05-14 cs.LG

DP-Muon: Differentially Private Optimization via Matrix-Orthogonalized Momentum

Jihwan Kim, Chenglin Fan

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出了一种名为DP-Muon的差分隐私优化方法,该方法基于矩阵正交化动量优化器Muon,通过逐样本梯度裁剪、添加高斯噪声以及后续动量和牛顿-舒尔正交化处理,实现了隐私保护下的模型训练。研究证明DP-Muon能够继承对应的子采样高斯会计机制的隐私保证,且正交化处理不会引入额外隐私成本。此外,文章还分析了差分隐私对Muon优化过程的影响,并提出了一种偏差校正的变体DP-MuonBC,在保持相同隐私保障的同时进一步提升了模型性能。

Comments 26 pages

详情
英文摘要

We study differentially private (DP) training with Muon, a matrix-valued optimizer that updates hidden-layer weights using momentum followed by Newton--Schulz orthogonalization. While DP-SGD is well understood, the interaction between per-example clipping, Gaussian noise, momentum, and nonlinear orthogonalization in Muon has not been systematically analyzed. We formulate DP-Muon, a private Muon procedure that clips per-example matrix gradients, adds Gaussian noise to the clipped lot average, and then applies momentum and Newton--Schulz orthogonalization as post-processing. We prove that DP-Muon inherits the privacy guarantee certified by the corresponding same-lot subsampled Gaussian accountant, with no additional privacy cost from Muon-specific post-processing. On the optimization side, we establish finite-horizon and vanishing stationarity guarantees under per-matrix clipping, with bounds that separate optimization error, clipping residual, privacy noise, and Newton--Schulz approximation error. We further show that the DP-induced bias in Muon arises not in the linear momentum buffer itself, but after the nonlinear Newton--Schulz map, where Gaussian noise induces a matrix-valued heat-smoothing bias. This motivates DP-MuonBC, a bias-corrected variant that removes the leading output-level bias term while preserving the same privacy guarantee. Experiments on E2E and DART show that Muon-style matrix updates improve private fine-tuning, and that DP-MuonBC further improves utility without increasing the privacy budget.

2605.12988 2026-05-14 cs.AI cs.CY cs.IR

Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education

Mragisha Jain, Tirth Bhatt, Griffin Pitts, Aum Pandya, Peter Brusilovsky, Narges Norouzi, Arto Hellas, Juho Leinonen, Bita Akram

发表机构 * North Carolina State University(北卡罗来纳州立大学) University of Pittsburgh(匹兹堡大学) University of California, Berkeley(加州大学伯克利分校) Aalto University(阿尔托大学)

AI总结 本文提出了一种基于检索增强生成(RAG)的智能辅导系统KITE,旨在辅助算法学习中的推理与问题求解。KITE通过意图感知的苏格拉底式响应策略,为学生提供针对性的提示、引导性问题和渐进式支持,同时结合多模态检索技术确保回答与课程内容一致。实验表明,KITE能够生成内容相关且教学效果良好的回应,并有效提升学生模型在算法问题上的后续回答准确性,为算法教育提供了新的辅导架构与评估方法。

Comments Paper accepted to the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), co-located with ACL 2026

详情
英文摘要

Students learning algorithms often need support as they interpret traces, debug reasoning errors, and apply procedures across unfamiliar problem instances. In this paper, we present KITE (Knowledge-Informed Tutoring Engine), a Retrieval-Augmented Generation (RAG)-based intelligent tutoring system designed to serve as a classroom teaching assistant for algorithmic reasoning and problem-solving tasks. KITE uses an intent-aware Socratic response strategy to tailor support to different student needs, responding with targeted hints, guiding questions, and progressive scaffolding intended to strengthen students' algorithmic problem-solving ability. To keep responses aligned with course content, KITE uses a multimodal RAG pipeline that retrieves relevant information from course materials. We evaluate KITE using three forms of assessment: RAGAs-based metrics for response grounding and quality, expert evaluation of pedagogical quality, and a simulated student pipeline in which a weaker language model interacts with KITE across two-turn dialogues and produces revised answers after receiving feedback. Results indicate that KITE produces contextually grounded and pedagogically appropriate responses. Further, using simulated students, KITE's feedback helped the student models produce more accurate follow-up responses on procedural and tracing questions, suggesting that its scaffolding can support algorithmic problem-solving. This work contributes a tutoring architecture and an evaluation approach for assessing retrieval-grounded explanations and scaffolded problem-solving feedback.

2605.12983 2026-05-14 cs.LG cs.CC

Decision Tree Learning on Product Spaces

Arshia Soltani Moakahr, Faraz Ghahremani, Kiarash Banihashem, MohammadTaghi Hajiaghayi

发表机构 * Department of Computer Science, University of Maryland, College Park, USA(大学计算机科学系,马里兰大学,College Park,美国)

AI总结 本文研究了在乘积分布下决策树的学习问题,针对广泛使用的自顶向下贪心启发式方法进行了理论分析。作者扩展了 Blanc 等人关于均匀分布下贪心方法的理论保证,证明了在任意乘积分布下,该方法仍能构造出近似最优的决策树,其规模随最优树的平均深度和最大深度呈指数增长。此外,作者提出了一种无需先验参数的算法,具有更强的实用性和更广的适用性。

Comments ICML 2026

详情
英文摘要

Decision tree learning has long been a central topic in theoretical computer science, driven by its practical importance. A fundamental and widely used method for decision tree construction is the top-down greedy heuristic, which recursively splits on the most influential variable. Despite its empirical success, theoretical analysis of this heuristic has been limited. A recent breakthrough by Blanc et al. (ITCS, 2020) provided the first rigorous theoretical guarantees for the greedy approach, but only under the uniform distribution. We extend this analysis to the more general and practically relevant setting of arbitrary product distributions. Our main result shows that for any function $f$ computable by an optimal decision tree of size $s$, maximum depth $D_{\text{opt}}$, and average depth $Δ_{\text{opt}}$, the greedy heuristic constructs an $ε$-approximating tree whose size grows at most with $\exp\bigl(Δ_{\text{opt}} D_{\text{opt}} \log(e/ε)\bigr)$. In the special case where the optimal tree is a full binary tree, this bound improves upon the bound of Blanc et al. and holds under a strictly broader class of distributions. Moreover, we present an algorithm based on the top-down greedy heuristic that is entirely parameter-free -- it requires no prior knowledge of the optimal tree's size or depth -- offering a practical advantage over Blanc et al.'s method.

2605.12980 2026-05-14 cs.LG cs.AI

CoRe-Gen: Robust Spectrum-to-Structure Generation under Imperfect Fingerprint Conditions

Tianbo Liu, Chixiang Lu, Jing Hao, Hengyu Zhang, Lifei Wang, Haibo Jiang, Xiaojuan Qi

发表机构 * The University of Hong Kong(香港大学) The Chinese University of Hong Kong(香港中文大学) Zhejiang Shuren University(浙江师范大学)

AI总结 从串联质谱(MS/MS)解析分子结构是一个具有挑战性的问题,尤其是在超出数据库覆盖范围的从头生成任务中。本文提出了一种名为CoRe-Gen的方法,通过合成光谱预训练编码器、在解码器训练中引入频率感知的指纹噪声匹配,以及结合结构感知的自回归解码和化学约束,有效缓解了预测指纹误差带来的生成偏差。实验表明,CoRe-Gen在多个基准测试中取得了新的性能纪录,同时保持了自回归解码的高效性,为实际条件下的谱-结构生成提供了实用且可扩展的解决方案。

详情
英文摘要

Molecular structure elucidation from tandem mass spectra (MS/MS) remains challenging, particularly for de novo generation beyond database coverage. A common approach decomposes the task into spectrum-to-fingerprint prediction followed by fingerprint-to-structure decoding, enabling the use of large-scale molecular corpora. However, at deployment, the decoder relies on predicted rather than oracle fingerprints, introducing structured errors that propagate into generation. This results in a fundamental condition mismatch, where models trained on clean inputs must operate under noisy, biased predictions, especially for long-tail substructures. We present CoRe-Gen that explicitly addresses this gap. CoRe-Gen improves the intermediate condition via synthetic-spectrum pretraining of the encoder, matches deployment-time noise through frequency-aware fingerprint corruption during decoder training, and mitigates residual errors using structure-aware autoregressive decoding with compositional SELFIES representations, auxiliary structural supervision, and lightweight chemical constraints. Experiments on standard benchmarks show that CoRe-Gen establishes a new state of the art on NPLIB1, achieving 19.54\% Top-1 and 29.92\% Top-10 exact-match accuracy, while remaining competitive on the more challenging MassSpecGym benchmark. Importantly, CoRe-Gen preserves the efficiency advantages of autoregressive decoding, providing a practical and scalable solution for robust spectrum-to-structure generation under realistic conditions.

2605.12978 2026-05-14 cs.AI

Useful Memories Become Faulty When Continuously Updated by LLMs

Dylan Zhang, Yanshan Lin, Zhengkun Wu, Yihang Sun, Bingxuan Li, Dianqi Li, Hao Peng

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) IIIS, Tsinghua University(清华大学人工智能研究院)

AI总结 本文研究了大型语言模型(LLMs)在持续更新记忆时可能出现的错误问题。研究发现,尽管通过记忆整合(consolidation)可以提升智能体的学习效果,但随着更新的进行,记忆的实用性会先上升后下降,甚至低于无记忆基准。实验表明,即使是基于正确解法的记忆整合,也可能导致模型在后续任务中表现下降,因此应谨慎处理记忆更新,保留原始经验作为关键证据,以提高智能体记忆的可靠性。

详情
英文摘要

Learning from past experience benefits from two complementary forms of memory: episodic traces -- raw trajectories of what happened -- and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM rewrites past trajectories into a textual memory bank that it continuously updates with new interactions, promising self-improving agents without parameter updates. Yet we find that such consolidated memories produced by today's LLMs are often faulty even when derived from useful experiences. As consolidation proceeds, memory utility first rises, then degrades, and can fall below the no-memory baseline. More surprisingly, even when consolidating from ground-truth solutions, GPT-5.4 fails on 54% of a set of ARC-AGI problems it had previously solved without memory. We trace the regression to the consolidation step rather than the underlying experience: the same trajectories yield qualitatively different memories under different update schedules, and an episodic-only control that simply retains those trajectories remains competitive with the consolidators we test. In a controlled ARC-AGI Stream environment that exposes Retain, Delete, and Consolidate actions, agents preserve raw episodes by default and double the accuracy of their forced-consolidation counterparts; disabling consolidation entirely (episodic management only) matches this auto regime. Practically, robust agent memory should treat raw episodes as first-class evidence and gate consolidation explicitly rather than firing it after every interaction. Looking forward, reliable agentic memory will require LLMs that can consolidate without overwriting the evidence they depend on.

2605.12975 2026-05-14 cs.AI

Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation

Jiashuo Sun, Jimeng Shi, Yixuan Xie, Saizhuo Wang, Jash Rajesh Parekh, Pengcheng Jiang, Zhiyi Shi, Jiajun Fan, Qinglong Zheng, Peiran Li, Shaowen Wang, Ge Liu, Jiawei Han

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Hong Kong University of Science and Technology(香港科学与技术大学) Texas A&M University(德克萨斯农工大学)

AI总结 该论文提出了一种名为 PyRAG 的可执行多跳推理框架,用于增强检索生成(RAG)在复杂问答任务中的表现。不同于传统基于自然语言的推理方式,PyRAG 将多跳推理过程转化为可执行的 Python 程序,利用检索和问答工具进行结构化计算,从而实现中间状态的显式表达和确定性反馈。实验表明,PyRAG 在多个多跳问答数据集上显著优于现有方法,尤其在组合性任务中表现突出。

Comments 32 pages, 20 figures, 4 tables

详情
英文摘要

Retrieval-Augmented Generation (RAG) has become a standard approach for knowledge-intensive question answering, but existing systems remain brittle on multi-hop questions, where solving the task requires chaining multiple retrieval and reasoning steps. Key challenges are that current methods represent reasoning through free-form natural language, where intermediate states are implicit, retrieval queries can drift from intended entities, and errors are detected by the same model that produces them making self-reflection an unreliable, ungrounded signal. We observe that multi-hop question answering is a typical form of step-by-step computation, and that this structured process aligns closely with how code-specialized language models are trained to operate. Motivated by this, we introduce \pyrag, a framework that reformulates multi-hop RAG as program synthesis and execution. Instead of free-form reasoning trajectories, \pyrag represents the reasoning process as an executable Python program over retrieval and QA tools, exposing intermediate states as variables, producing deterministic feedback through execution, and yielding an inspectable trace of the entire reasoning process. This formulation further enables compiler-grounded self-repair and execution-driven adaptive retrieval without any additional training. Experiments on five QA benchmarks (PopQA, HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle) show that \pyrag consistently outperforms strong baselines under both training-free and RL-trained settings, with especially large gains on compositional multi-hop datasets. Our code, data and models are publicly available at https://github.com/GasolSun36/PyRAG.

2605.12967 2026-05-14 cs.CV

ImageAttributionBench: How Far Are We from Generalizable Attribution?

Tingshu Mou, Zhipeng Wei, Chao Gong, Jingjing Chen, Xingjun Ma

发表机构 * Fudan University(复旦大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 随着生成式AI的快速发展,合成图像的逼真度和多样性不断提高,给图像来源识别和虚假信息检测带来了严峻挑战。为此,本文提出ImageAttributionBench,一个包含多种先进生成模型合成图像的综合性数据集,旨在推动更具鲁棒性和泛化能力的图像归属方法研究。实验表明,当前主流归属方法在该数据集上的表现较差,揭示了其在面对语义变化和图像退化时的局限性,为未来研究提供了严格的评估基准。

详情
英文摘要

The rapid advancement of generative AI has enabled the creation of highly realistic and diverse synthetic images, posing critical challenges for image provenance and misinformation detection. This underscores the urgent need for effective image attribution. However, existing attribution datasets are constrained by limited scale, outdated generation methods, and insufficient semantic diversity - hindering the development of robust and generalizable attribution models. To address these limitations, we introduce ImageAttributionBench, a comprehensive dataset comprising images synthesized by a wide array of advanced generative models with state-of-the-art (SOTA) architectures. Covering multiple real-world semantic domains, the dataset offers rich diversity and scale to support and accelerate progress in image attribution research. To simulate real-world attribution scenarios, we evaluate several SOTA attribution methods on ImageAttributionBench under two challenging settings: (1) training on a standard balanced split and testing on degraded images, and (2) training and testing on semantically disjoint splits. In both cases, current methods exhibit consistently poor performance, revealing significant limitations in their robustness and generalization to unseen semantic content. Our work provides a rigorous benchmark to facilitate the development and evaluation of future image attribution methods.

2605.12966 2026-05-14 cs.AI

Position: Agentic AI System Is a Foreseeable Pathway to AGI

Junwei Liao, Shuai Li, Muning Wen, Jun Wang, Weinan Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) University College London(伦敦大学学院)

AI总结 本文质疑单一模型规模扩展是实现人工通用智能(AGI)的唯一路径,提出代理式人工智能(Agentic AI)是应对现实任务复杂性和异质性分布的必要范式。通过理论推导,文章对比了单一学习器与代理系统的优化约束,展示了代理式AI在泛化能力和样本效率上的指数级优势,并探讨了其与专家混合模型的关系,呼吁加强对代理式AI的研究。

Comments Accepted by ICML'26 Position Track

详情
英文摘要

Is monolithic scaling the only path to AGI? This paper challenges the dogma that purely scaling a single model is sufficient to achieve Artificial General Intelligence. Instead, we identify Agentic AI as a necessary paradigm for mastering the complex, heterogeneous distribution of real-world tasks. Through rigorous theoretical derivations, we contrast the optimization constraints of monolithic learners against the efficiency of Agentic systems, progressing from simple routing mechanisms to general Directed Acyclic Graph (DAG) topologies. We demonstrate that Agentic AI achieves exponentially superior generalization and sample efficiency. Finally, we discuss the connection to Mixture-of-Experts, reinterpret the instability of current multi-agent frameworks, and call for greater research focus on Agentic AI.