arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪
2605.10984 2026-05-13 cs.CV

Principle-Guided Supervision for Interpretable Uncertainty in Medical Image Segmentation

An Sui, Yuzhu Li, Gunter Schumann, Fuping Wu, Xiahai Zhuang

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院) Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University(复旦大学脑启发智能科学研究院) National Heart and Lung Institute, Imperial College London(伦敦帝国理工学院国家心脏和肺研究所)

AI总结 本文研究了医学图像分割中可解释的不确定性量化问题,旨在使模型的不确定性估计更符合人类对不确定性的理解。为此,作者提出了三个与感知对齐的原则,要求不确定性在空间分布上反映图像结构对比度、图像损坏程度和解剖结构几何复杂性。基于这些原则,研究设计了一种原理引导的不确定性监督框架(PriUS),通过证据学习方法在训练过程中显式约束不确定性分布,并引入量化指标评估不确定性与图像模糊源的一致性。实验表明,PriUS在多个医学数据集上实现了更具一致性的不确定性估计,同时保持了良好的分割性能。

Comments 14 pages, 8 figures

详情
英文摘要

Uncertainty quantification complements model predictions by characterizing their reliability, which is essential for high-stakes decision making such as medical image segmentation. However, most existing methods reduce uncertainty to a scalar confidence estimate, leaving its spatial distribution semantically underconstrained. In this work, we focus on uncertainty interpretability, namely, whether estimated uncertainty behaves in a human-understandable manner with respect to sources of ambiguity. We identify three perception-aligned principles requiring the spatial distribution of uncertainty to reflect: (1) image contrast between structures, (2) severity of image corruption, and (3) geometric complexity in anatomical structures. Accordingly, we develop a principle-guided uncertainty supervision framework (PriUS) based on evidential learning, in which the corresponding supervision objectives are explicitly enforced during training. We further introduce quantitative metrics to measure the consistency between predicted uncertainty and image attributes that induce ambiguity. Experiments on ACDC, ISIC, and WHS datasets showed that, compared with state-of-the-art methods, PriUS produced more consistent uncertainty estimates while maintaining competitive segmentation performance.

2605.10981 2026-05-13 cs.LG cs.AI

$ξ$-DPO: Direct Preference Optimization via Ratio Reward Margin

Zhengyuan Fan, Zhonghua Wu, Yuxuan Du, Qun Chen

发表机构 * School of Computer Science, Northwestern Polytechnical University(西北工业大学计算机学院)

AI总结 本文提出了一种名为 $ξ$-DPO 的直接偏好优化方法,旨在解决现有 SimPO 方法中超参数调优困难的问题。通过重新定义奖励目标为最小化奖励差距与最优边距之间的距离,并引入基于选择与拒绝响应比值的奖励形式,$ξ$-DPO 有效消除了对超参数 $β$ 的依赖,并获得了更具解释性和稳定性的边距 $ξ$。该方法无需反复调参,能够更直观地控制偏好响应之间的相对分离程度,提升了直接偏好优化的效率与可解释性。

详情
英文摘要

Reference-free preference optimization has emerged as an efficient alternative to reinforcement learning from human feedback, with Simple Preference Optimization(SimPO) demonstrating strong performance by eliminating the explicit reference model through a simple objective. However, the joint tuning of the hyperparameters $β$ and $γ$ in SimPO remains a central challenge. We argue that this difficulty arises because the margin formulation in SimPO is not easily interpretable across datasets with different reward gap structures. To better understand this issue, we conduct a comprehensive analysis of SimPO and find that $β$ implicitly controls sample filtering, while the effect of $γ$ depends on the reward gap structure of the dataset. Motivated by these observations, we propose $ξ$-DPO: Direct preference optimization via ratio reward margin. We first reformulate the preference objective through an equivalent transformation, changing the optimization target from maximizing the likelihood of reward gaps to minimizing the distance between reward gaps and optimal margins. Then, we redefine the reward in a ratio form between the chosen and rejected, which effectively cancels the effect of $β$ and yields a bounded and interpretable margin. This margin is called the ratio reward margin and is denoted by $ξ$. Unlike the margin $γ$ in SimPO, $ξ$ explicitly represents the desired relative separation between chosen and rejected responses and can be determined from the initial reward gap distribution, avoiding repeated trial-and-error tuning. ....

2605.10980 2026-05-13 cs.LG cs.AI

LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection

Haohui Zhang, Zhiye Wang, Xiaoying Gan, Xinbing Wang, Bo Jiang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种名为LEAP的方法,旨在通过检测早期收敛的标记来提升扩散语言模型(dLLM)的并行解码能力。传统方法依赖高置信度阈值来保证准确性,但这一要求限制了并行性。LEAP通过未来上下文过滤和多序列叠加技术,在无需训练的情况下识别出早期已收敛且正确的标记,从而实现更早的解码,显著降低了推理延迟和解码步骤。实验表明,LEAP在多个领域均有效提升了解码效率,同时保持了模型精度。

详情
英文摘要

Diffusion Language Models (dLLMs) have garnered significant attention for their potential in highly parallel processing. The parallel capabilities of existing dLLMs stem from the assumption of conditional independence at high confidence levels, which ensures negligible discrepancy between the marginal and joint distributions. However, the stringent confidence thresholds required to preserve accuracy severely constrain the scalability of parallelism. Through systematic token-level statistical analysis, we reveal that a substantial proportion of tokens converge to their correct predictions early in the denoising process yet fail to reach standard confidence thresholds, confirming that current confidence-based criteria are overly conservative. In response, we introduce LEAP (Lookahead Early-Convergence Token Detection for Accelerated Parallel Decoding). LEAP is a training-free, plug-and-play method that leverages future context filtering and multi-sequence superposition to detect early-converging tokens. By validating the alignment between early convergence and correctness, we enable reliable early decoding of these tokens. Benchmarking across diverse domains demonstrates that LEAP significantly lowers inference latency and decoding steps. Compared to confidence-based decoding, the average number of denoising steps is reduced by about 30%. On the GSM8K dataset, combining LEAP with dParallel accelerates decoding to 7.2 tokens per step while preserving model precision. LEAP effectively breaks the reliance on high-confidence priors, offering a novel paradigm for parallel decoding.

2605.10975 2026-05-13 cs.LG cs.AI

Hierarchical Multi-Scale Graph Neural Networks: Scalable Heterophilous Learning with Oversmoothing and Oversquashing Mitigation

Md Sazzad Hossen, Avimanyu Sahoo

发表机构 * University of Alabama in Huntsville(阿拉巴马大学亨茨维尔分校)

AI总结 该论文研究了异质图(相邻节点标签不同)分类中的可扩展学习问题,针对现有图神经网络在处理异质性数据时存在的聚合偏差和过平滑、过压缩问题,提出了一种分层多尺度图神经网络框架HMH。该方法通过学习特征与结构感知的符号亲和力,构建软图层次结构,并在每一层使用稀疏正交的Haar基进行频域滤波,结合跳跃连接解池化层,有效缓解了中心节点主导和长距离信号压缩问题。实验表明,HMH在节点和图分类任务上均优于现有方法,且具有近线性的时间复杂度。

详情
英文摘要

Graphs with heterophily, where adjacent nodes carry different labels, are prevalent in real-world applications, from social networks to molecular interactions. However, existing spectral Graph Neural Network (GNN) approaches tailored for heterophilous graph classification suffer from hub-dominated (node with large degree) aggregation and oversmoothing, as their suboptimal polynomial filters introduce approximation errors and blend distant signals. To address the degree-biased aggregation and suboptimal polynomial filtering, we introduce a Hierarchical Multi-view HAAR (HMH), a novel spectral graph-learning framework that scales in near-linear time . HMH first learns feature- and structure-aware signed affinities via a heterophily-aware encoder, then constructs a soft graph hierarchy guided by these embeddings. At each hierarchical level, HMH constructs a sparse, orthonormal, and locality-aware Haar basis to apply learnable spectral filters in the frequency domain. Finally, skip-connection unpooling layers combine outputs from all hierarchical levels back into the original graph, effectively preventing hub domination and long-range signal bottleneck (over-squashing). Experimentation shows that HMH outperforms state-of-the-art spectral baselines, achieving up to a 3% improvement on node classification and 7% points on graph classification datasets, all while maintaining linear scalability.

2605.10974 2026-05-13 cs.LG cs.AI

Vertex-Softmax: Tight Transformer Verification via Exact Softmax Optimization

Navid Rezazadeh, Arash Gholami Davoodi

发表机构 * University of California, Irvine(加州大学尔湾分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种名为Vertex-Softmax的新方法,用于提升Transformer注意力机制的认证验证精度。该方法通过精确优化softmax函数在预softmax分数区间约束下的最优解,证明了最优解必定出现在约束盒的顶点,并基于此建立了具有线性复杂度的Vertex-Softmax原语。实验表明,该方法在多个数据集上显著提升了认证准确率并紧缩了下界,同时在计算成本上优于现有方法。

详情
英文摘要

Certified verification of transformer attention requires bounding the softmax function over interval constraints on the pre-softmax scores. Existing verifiers relax softmax ndependently of the downstream objective, leaving avoidable slack. We prove that the exact optimum of this score-box problem is attained at a vertex of the constraint box, and establish a threshold structure theorem showing that, after sorting the objective coefficients, the optimum lies among only linearly many candidates, yielding the Vertex-Softmax primitive with log-linear complexity in the sequence length. We further prove a formal optimality result showing that Vertex-Softmax is the tightest sound bound obtainable from score intervals alone, characterizing precisely what additional structure (score correlations, score-value coupling) is needed for further improvement. Integrated into a CROWN Convex Relaxation based Optimization for Worst-case Neurons)-style verifier with a formal soundness guarantee, Vertex-Softmax significantly improves certified rates and substantially tightens lower bounds across MNIST, Fashion-MNIST, and CIFAR-10 attention models, while consistently matching or outperforming alpha-CROWN and branch-and-bound baselines at a fraction of their cost.

2605.10973 2026-05-13 cs.LG cs.AI

Rotation-Preserving Supervised Fine-Tuning

Hangzhan Jin, Tianwei Ni, Lu Li, Pierre-Luc Bacon, Mohammad Hamdaqa, Doina Precup

发表机构 * Mila - Quebec AI Institute(魁北克AI研究所) Polytechnique Montréal(蒙特利尔理工学院) Université de Montréal(蒙特利尔大学) McGill University(麦吉尔大学) CIFAR AI Chair(CIFAR人工智能主席) Google DeepMind(谷歌DeepMind)

AI总结 监督微调(SFT)虽能提升模型在特定领域内的性能,但可能损害其在领域外的泛化能力。本文提出了一种名为旋转保持监督微调(RPSFT)的方法,通过在预训练权重矩阵的奇异子空间中保持投影旋转,高效地近似Fisher敏感方向,从而限制不必要的权重旋转,保留任务适应性。实验表明,RPSFT在数学推理数据上训练的多种模型中,有效改善了领域内与领域外性能的平衡,更好地保留了预训练表示,并为后续强化学习微调提供了更优的初始化。

Comments 31 pages, 13 figures

详情
英文摘要

Supervised fine-tuning (SFT) improves in-domain performance but can degrade out-of-domain (OOD) generalization. Prior work suggests that this degradation is related to changes in dominant singular subspaces of pretrained weight matrices. However, directly identifying loss-sensitive directions with Hessian or Fisher information is computationally expensive at LLM scale. In this work, we propose preserving projected rotations in pretrained singular subspaces as an efficient proxy for Fisher-sensitive directions, which we call Rotation-Preserving Supervised Fine-Tuning (RPSFT). RPSFT penalizes changes in the projected top-$k$ singular-vector block of each pretrained weight matrix, limiting unnecessary rotation while preserving task adaptation. Across model families and sizes trained on math reasoning data, RPSFT improves the in-domain/OOD trade-off over standard SFT and strong SFT baselines, better preserves pretrained representations, and provides stronger initializations for downstream RL fine-tuning. Code is available at \href{https://github.com/jinhangzhan/RPSFT.git}{https://github.com/jinhangzhan/RPSFT}.

2605.10971 2026-05-13 cs.LG cs.AI cs.CL

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models

Hanhan Zhou, Shamik Roy, Rashmi Gangadharaiah

发表机构 * AWS AI Labs(AWS人工智能实验室)

AI总结 离散扩散语言模型(DLMs)通过并行去噪生成文本,提供了不同于自回归模型的生成方式。本文指出,从自回归模型迁移而来的控制生成方法在每一步去噪中采用统一干预策略,会导致生成质量下降,尤其在多属性联合控制时问题更为严重。研究通过训练稀疏自编码器分析发现,不同属性在去噪过程中以不同的时间、强度和节奏固化,因此提出了一种自适应调度方法,将干预集中在属性形成的关键步骤,从而在保持生成质量的同时显著提升了控制精度,尤其在多属性联合控制任务中表现出色。

Comments preprint, 47 pages

详情
英文摘要

Discrete diffusion language models (DLMs) generate text by iteratively denoising all positions in parallel, offering an alternative to autoregressive models. Controlled generation methods for DLMs, imported from autoregressive models, apply uniform intervention at every denoising steps. We show this uniform schedule degrades quality, and the damage compounds when multiple attributes are steered jointly. To diagnose the failure, we train sparse autoencoders on four DLMs (124M-8B parameters) and find that different attributes commit on distinct schedules, varying in timing, sharpness, and magnitude. For instance, topic commits within the first 2\% of denoising, whereas sentiment emerges gradually over 20\% of the process. Consequently, uniform intervention wastes steering capacity on steps where the target attribute has already solidified or has yet to emerge. We propose a novel adaptive scheduler that concentrates interventions on the steps where an attribute is actively forming and leaves the rest of generation untouched. The cost-control trade-off admits a closed-form characterization: the advantage of adaptive over uniform scheduling is governed by a single dispersion statistic of the commitment distribution. Across four DLMs and seven steering tasks, our method achieves precise control without the degradation typical of uniform interventions. Especially on challenging simultaneous three-attribute control, it reaches up to 93\% steering strength, beating the strongest baseline by up to 15\% points while preserving generation quality.

2605.10959 2026-05-13 cs.LG cs.AI

QuIDE: Mastering the Quantized Intelligence Trade-off via Active Optimization

Xiantao Jiang

发表机构 * College of Information Engineering, Shanghai Maritime University(上海海洋大学信息工程学院)

AI总结 当前缺乏统一的指标来评估量化神经网络的效率。本文提出QuIDE,通过引入智能指数I = (C × P)/log₂(T+1),将压缩率、精度与延迟的权衡统一为单一评分。实验表明,不同任务存在任务相关的帕累托拐点,4位量化在MNIST和大语言模型中表现最佳,而8位量化更适合复杂CNN任务。QuIDE还提供了一个可复现的评估协议和适用于混合精度搜索的适应性函数。

Comments 16 pages, 9 figures

详情
英文摘要

There is currently no unified metric for evaluating the efficiency of quantized neural networks. We propose QuIDE, built around the Intelligence Index I = (C x P)/log_2(T+1), which collapses the compression-accuracy-latency trade-off into a single score. Experiments across six settings -- SimpleCNN (MNIST, CIFAR), ResNet-18 (ImageNet-1K), and Llama-3-8B -- show a task-dependent Pareto Knee. 4-bit quantization is optimal for MNIST and large LLMs, while 8-bit is the sweet spot for complex CNN tasks (ResNet-18 on ImageNet), where 4-bit PTQ collapses accuracy catastrophically. The accuracy-gated variant I' correctly flags these non-viable configurations that the raw I would reward. QuIDE provides a reproducible evaluation protocol and a ready-to-use fitness function for mixed-precision search.

2605.10865 2026-05-13 cs.AI cs.CV cs.SE

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

Haozhe Zhang, Kaichen Liu, Miaomiao Chen, Lei Li, Shaojie Yang, Cheng Peng, Hanjie Chen

发表机构 * University of Virginia(弗吉尼亚大学) University of California, San Diego(加州大学圣地亚哥分校) Rice University(莱斯大学)

AI总结 BenchCAD 是一个面向工业CAD编程的综合性基准测试平台,旨在评估模型从视觉或文本输入生成可执行参数化CAD程序的能力。该基准包含17,900个经过验证的CadQuery程序,涵盖106类工业零件,通过视觉问答、代码问答、图像到代码生成等多种任务全面评估模型在感知、参数抽象和程序合成方面的能力。实验表明,当前主流模型虽能恢复零件的粗略外形,但在精确生成参数化CAD程序方面仍存在显著不足,如忽略细粒度3D结构、误读工程参数等,突显了工业CAD自动化领域亟需改进的方向。

Comments 9 page 7 figures

详情
英文摘要

Industrial Computer-Aided Design (CAD) code generation requires models to produce executable parametric programs from visual or textual inputs. Beyond recognizing the outer shape of a part, this task involves understanding its 3D structure, inferring engineering parameters, and choosing CAD operations that reflect how the part would be designed and manufactured. Despite the promise of Multimodal large language models (MLLMs) for this task, they are rarely evaluated on whether these capabilities jointly hold in realistic industrial CAD settings. We present BenchCAD, a unified benchmark for industrial CAD reasoning. BenchCAD contains 17,900 execution-verified CadQuery programs across 106 industrial part families, including bevel gears, compression springs, twist drills, and other reusable engineering designs. It evaluates models through visual question answering, code question answering, image-to-code generation, and instruction-guided code editing, enabling fine-grained analysis across perception, parametric abstraction, and executable program synthesis. Across 10+ frontier models, BenchCAD shows that current systems often recover coarse outer geometry but fail to produce faithful parametric CAD programs. Common failures include missing fine 3D structure, misinterpreting industrial design parameters, and replacing essential operations such as sweeps, lofts, and twist-extrudes with simpler sketch-and-extrude patterns. Fine-tuning and reinforcement learning improve in-distribution performance, but generalization to unseen part families remains limited. These results position BenchCAD as a benchmark for measuring and improving the industrial readiness of multimodal CAD automation.

2605.10815 2026-05-13 cs.AI eess.AS

Probing Cross-modal Information Hubs in Audio-Visual LLMs

Jihoo Jung, Chaeyoung Jung, Ji-Hoon Kim, Joon Son Chung

发表机构 * Department of Electrical Engineering, Korea Advanced Institute of Science The Graduate School of Advanced Imaging Science, Multimedia \& Film, Chung-Ang University, Seoul, Republic of Korea

AI总结 本文研究了音频-视觉大语言模型(AVLLMs)中跨模态信息的流动机制,重点分析了音频和视觉模态之间的信息编码方式。通过实证分析,发现AVLLMs主要在所谓的“sink tokens”中整合跨模态信息,其中一部分特定的sink tokens专门用于存储跨模态信息,称为“跨模态sink tokens”。基于这一发现,作者提出了一种无需训练的幻觉缓解方法,通过增强对跨模态sink tokens中整合信息的依赖来提升模型表现。

Comments Accepted by ICML 2026

详情
英文摘要

Audio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens do not uniformly hold cross-modal information. Instead, a distinct subset of sink tokens, which we term cross-modal sink tokens, specializes in storing such information. Based on these findings, we further propose a simple training-free hallucination mitigation method by encouraging reliance on integrated cross-modal information within cross-modal sink tokens. Our code is available at https://github.com/kaistmm/crossmodal-hub.

2605.10780 2026-05-13 cs.CV cs.AI

Beyond the Last Layer: Multi-Layer Representation Fusion for Visual Tokenization

Xuanyu Zhu, Yan Bai, Yang Shi, Yihang Lou, Yuanxing Zhang, Jing Jin, Yuan Zhou

发表机构 * Peking University(北京大学) Meituan Inc(美团公司) Tsinghua University(清华大学) IGDL

AI总结 该研究提出了一种名为DRoRAE的多层表示融合方法,旨在改进视觉编码器的特征提取过程。不同于现有方法仅使用最后一层特征,DRoRAE通过能量约束路由和增量校正机制,融合所有中间层的特征,从而恢复因多层语义抽象而丢失的细节信息。实验表明,该方法在图像重建和生成任务中显著提升了性能,并揭示了表示丰富性与重建质量之间的可预测关系,为视觉分词器的设计提供了新的理论依据。

详情
英文摘要

Representation autoencoders that reuse frozen pretrained vision encoders as visual tokenizers have achieved strong reconstruction and generation quality. However, existing methods universally extract features from only the last encoder layer, discarding the rich hierarchical information distributed across intermediate layers. We show that low-level visual details survive in the last layer merely as attenuated residuals after multiple layers of semantic abstraction, and that explicitly fusing multi-layer features can substantially recover this lost information. We propose DRoRAE (Depth-Routed Representation AutoEncoder), a lightweight fusion module that adaptively aggregates all encoder layers via energy-constrained routing and incremental correction, producing an enriched latent compatible with a frozen pretrained decoder. A three-phase decoupled training strategy first learns the fusion under the implicit distributional constraint of the frozen decoder, then fine-tunes the decoder to fully exploit the enriched representation. On ImageNet-256, DRoRAE reduces rFID from 0.57 to 0.29 and improves generation FID from 1.74 to 1.65 (with AutoGuidance), with gains also transferring to text-to-image synthesis. Furthermore, we uncover a log-linear scaling law ($R^2{=}0.86$) between fusion capacity and reconstruction quality, identifying \textit{representation richness} as a new, predictably scalable dimension for visual tokenizers analogous to vocabulary size in NLP.

2605.10201 2026-05-13 cs.RO cs.AI

HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions

Zhenhao Shen, Zeming Yang, Yue Chen, Yuran Wang, Shengqiang Xu, Mingleyang Li, Hao Dong, Ruihai Wu

发表机构 * Peking University(北京大学) Tianjin University(天津大学)

AI总结 该研究旨在解决机器人在异类物体交互中实现通用操作的难题,重点解决“在哪里操作”和“如何操作”这两个核心问题。提出了一种两阶段框架HeteroGenManip,通过解耦初始抓取与复杂交互过程,结合结构先验和多基础模型扩散策略,显著提升了操作的鲁棒性和泛化能力。实验表明,该方法在多种仿真和真实任务中均取得显著性能提升。

详情
英文摘要

Generalizable manipulation involving cross-type object interactions is a critical yet challenging capability in robotics. To reliably accomplish such tasks, robots must address two fundamental challenges: "where to manipulate" (contact point localization) and "how to manipulate" (subsequent interaction trajectory planning). Existing foundation-model-based approaches often adopt end-to-end learning that obscures the distinction between these stages, exacerbating error accumulation in long-horizon tasks. Furthermore, they typically rely on a single uniform model, which fails to capture the diverse, category-specific features required for heterogeneous objects. To overcome these limitations, we propose HeteroGenManip, a task-conditioned, two-stage framework designed to decouple initial grasp from complex interaction execution. First, Foundation-Correspondence-Guided Grasp module leverages structural priors to align the initial contact state, thereby significantly reducing the pose uncertainty of grasping. Subsequently, Multi-Foundation-Model Diffusion Policy (MFMDP) routes objects to category-specialized foundation models, integrating fine-grained geometric information with highly-variable part features via a dual-stream cross-attention mechanism. Experimental evaluations demonstrate that HeteroGenManip achieves robust intra-category shape and pose generalization. The framework achieves an average 31% performance improvement in simulation tasks with broad type setting, alongside a 36.7% gain across four real-world tasks with different interaction types.

2605.10125 2026-05-13 cs.AI cs.HC

Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research

Anthea Dathe, Kiran Hoffmann, Aline Mangold

发表机构 * Dresden University of Technology(德累斯顿技术大学)

AI总结 该研究评估了人工智能工具在学术研究中的应用,重点关注问答和文献综述工具的实用性与局限性。研究提出了一种结合人机中心指标的评估框架,发现问答工具虽能提供有用概述,但在精确信息提取上可靠性不足,而文献综述工具虽有助于探索性搜索,却缺乏可重复性和透明度。研究强调了提升AI工具可解释性的重要性,并指出在研究工作流中合理整合AI仍需依赖人工验证。

详情
英文摘要

Artificial intelligence (AI) tools are being incorporated into scientific research workflows with the potential to enhance efficiency in tasks such as document analysis, question answering (Q&A), and literature search. However, system outputs are often difficult to verify, lack transparency in their generation and remain prone to errors. Suitable benchmarks are needed to document and evaluate arising issues. Nevertheless, existing benchmarking approaches are not adequately capturing human-centered criteria such as usability, interpretability, and integration into research workflows. To address this gap, the present work proposes and applies a benchmarking framework combining human-centered and computer-centered metrics to evaluate AI-based Q&A and literature review tools for research use. The findings suggest that Q&A tools can offer valuable overviews and generally accurate summaries; however, they are not always reliable for precise information extraction. Explainable AI (xAI) accuracy was particularly low, meaning highlighted source passages frequently failed to correspond to generated answers. This shifted the burden of validation back onto the researcher. Literature review tools supported exploratory searches but showed low reproducibility, limited transparency regarding chosen sources and databases, and inconsistent source quality, making them unsuitable for systematic reviews. A comparison of these tool groups reveals a similar pattern: while AI tools can enhance efficiency in the early stages of the research workflow and shallow tasks, their outputs still require human verification. The findings underscore the importance of explainability features to enhance transparency, verification efficiency and careful integration of AI tools into researchers' workflows. Further, human-centered evaluation remains an important concern to ensure practical applicability.

2605.10119 2026-05-13 cs.LG

Refresh-Scaling the Memory of Balanced Adam

Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Ortí

发表机构 * Universitat Politècnica de València(瓦伦西亚理工大学) Universitat Jaume I(Jaime I 大学)

AI总结 本文研究了平衡Adam优化器中单一剩余超参数β的设置问题,提出应将其视为统计记忆范围的控制变量,而非固定常数。通过引入有效训练周期和刷新次数的概念,作者发现将β调整使得刷新次数约为1000时,能在不同规模的视觉和语言任务中提升模型的鲁棒性。实验表明,该方法相比固定β值的最优基线,在最坏情况下的验证损失差距减少了33.4%,并显著提升了模型性能的一致性。

详情
英文摘要

Recent evidence suggests that Adam performs robustly when its momentum parameters are tied, $β_1=β_2$, reducing the optimizer to a single remaining parameter. However, how this parameter should be set remains poorly understood. We argue that, in balanced Adam, $β$ should not be treated as a dimensionless constant: it defines a statistical memory horizon $H_β=(1-β)^{-1}$. In terms of the effective learning horizon $T_{\mathrm{ES}}$, estimated from the validation trajectory, we study the refresh count $R_β=(1-β)T_{\mathrm{ES}}$, which measures how many times Adam renews its internal statistics during the useful phase of training. Across 11 vision and language experiments, we find that choosing $β$ so that $R_β\approx1000$ selects different $β$ values depending on the training scale, yet improves robustness over the best fixed-beta baseline. Compared with the strongest fixed choice $β=0.944$, the refresh rule improves worst-case robustness, reducing the maximum relative gap in validation loss by 33.4\%, while bringing all 11 runs within 1\% of their validation oracle. These results suggest that the remaining hyperparameter of balanced Adam is more naturally viewed as a memory-scale variable than as a fixed constant. This provides a simple budget-aware perspective on optimizer scaling and opens a path toward treating Adam's momentum as part of the learning dynamics rather than as a static default.

2605.10082 2026-05-13 cs.CL cs.LG

FERA: Uncertainty-Aware Federated Reasoning for Large Language Models

Ruhan Wang, Chengkai Huang, Zhiyong Wang, Junda Wu, Rui Wang, Tong Yu, Julian McAuley, Lina Yao, Dongruo Zhou

发表机构 * Indiana University(印第安纳大学) The University of New South Wales(新南威尔士大学) The Chinese University of Hong Kong(香港中文大学) University of California San Diego(加州大学圣地亚哥分校) Adobe Research(Adobe研究院)

AI总结 在联邦推理任务中,如何在不集中训练和不共享原始数据的前提下,利用多方持有的私有高质量示例提升大语言模型的多步推理能力是一个关键挑战。本文提出了一种无需训练的不确定性感知联邦推理框架FERA,通过迭代的服务器-客户端协同优化,使客户端生成带有轻量不确定性估计的推理轨迹,服务器则基于这些轨迹合成更优的推理结果并反馈给客户端,从而逐步提升整体推理性能。该方法引入了不确定性感知的自批判聚合机制,有效解决异构客户端之间的冲突,并通过理论分析证明了其收敛性与效率优势,实验表明FERA在多个推理基准上均优于现有方法。

Comments 44 pages, 8 figures

详情
英文摘要

Large language models (LLMs) exhibit strong reasoning capabilities when guided by high-quality demonstrations, yet such data is often distributed across organizations that cannot centralize it due to regulatory, proprietary, or institutional constraints. We study federated reasoning, where a server improves multi-step reasoning by coordinating with heterogeneous clients holding private demonstrations, without centralized training or raw data sharing. The key challenge is that client reliability is query-dependent, while the server cannot inspect client data to determine which contributions are trustworthy. To address this, we propose Uncertainty-Aware Federated Reasoning (FERA), a training-free framework based on iterative server-client co-refinement. Across communication rounds, clients generate reasoning traces with lightweight uncertainty estimates, and the server synthesizes them into improved reasoning that is redistributed as context for the next round, progressively improving both server outputs and client-side reasoning. Within each round, Uncertainty-Aware Self-Critique Aggregation (UA-SCA) resolves conflicts among heterogeneous client traces through query-dependent trust weighting and structured cross-client verification. Rather than simply discarding low-quality traces, UA-SCA revises flawed reasoning steps to recover useful information. We provide theoretical guarantees showing that the proposed iterative protocol converges and that uncertainty-aware weighting accelerates convergence. Experiments on multiple reasoning benchmarks show that FERA consistently outperforms both federated training and training-free baselines, achieving progressively higher accuracy across rounds while maintaining communication and computational efficiency.

2605.10032 2026-05-13 cs.CL

PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

Sajib Acharjee Dip, Song Li, Liqing Zhang

发表机构 * Department of Computer Science, Virginia Tech(弗吉尼亚理工学院计算机科学系) School of Plant and Environmental Sciences, Virginia Tech(弗吉尼亚理工学院植物与环境科学学院) Fralin Biomedical Research Institute, Virginia Tech(弗吉尼亚理工学院弗拉林生物医学研究学院) FBRI Cancer Research Center, Washington, DC(华盛顿特区FBRI癌症研究中心)

AI总结 PlantMarkerBench 是一个面向多物种植物的基准数据集,用于评估基于文献的植物标记基因推理能力。该基准通过模块化流程构建,涵盖拟南芥、玉米、水稻和番茄四种植物,包含5,550个标注了标记证据有效性、类型和支持强度的句子级实例。研究定义了两个任务:判断句子是否为基因-细胞类型对的有效标记证据,并对证据类型进行分类,旨在推动基于文献的生物证据解析和可信科学信息提取的研究。

详情
英文摘要

Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval, hybrid search, species-aware biological grounding, structured evidence extraction, and targeted human review. The benchmark spans four plant species -- Arabidopsis, maize, rice, and tomato -- and contains 5,550 sentence-level evidence instances annotated for marker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight and closed-source language models across species and prompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode. Open-weight models additionally exhibit elevated false-positive rates under ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework for literature-grounded biological evidence attribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.

2605.09964 2026-05-13 cs.AI q-bio.QM

Learning the Interaction Prior for Protein-Protein Interaction Prediction: A Model-Agnostic Approach

Ziqi Gao, Chenyi Zi, Zijing Liu, Ziqiao Meng, Yu Li, Jia Li

发表机构 * Tsinghua University(清华大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) National University of Singapore(新加坡国立大学) IDEA Research(IDEA研究院)

AI总结 蛋白质-蛋白质相互作用(PPIs)在细胞功能和疾病机制中起着关键作用。当前基于学习的PPI预测方法主要关注学习蛋白质的表示,却忽略了设计专门的分类头,通常依赖于缺乏生物学依据的通用聚合方法。本文提出了一种基于生物“L3规则”的模型无关PPI分类器L3-PPI,通过引入L3路径正则化的图提示学习方法,将蛋白质嵌入对的分类任务转化为图级别的分类任务,有效提升了预测性能。

Comments Accepted at ICML 2026

详情
英文摘要

Protein-protein interactions (PPIs) are fundamental to cellular function and disease mechanisms. Current learning-based PPI predictors focus on learning powerful protein representations but neglect designing specialized classification heads. They mainly rely on generic aggregating methods like concatenation or dot products, which lack biological insight. Motivated by the biological "L3 rule", where multiple length-3 paths between a pair of proteins indicate their interaction likelihood, our study addresses this gap by designing a biologically informed PPI classifier. In this paper, we provide empirical evidence that popular PPI datasets strongly support the L3 rule. We propose an L3-path-regularized graph prompt learning method called L3-PPI, which can generate a prompt graph with virtual L3 paths based on protein representations and controls the number of paths. L3-PPI reformulates the classification of protein embedding pairs into a graph-level classification task over the generated prompt graph. This lightweight module seamlessly integrates with PPI predictors as a plug-and-play component, injecting the interaction prior of complementarity to enhance performance. Extensive experiments show that L3-PPI achieves superior performance enhancements over advanced competitors.

2605.09904 2026-05-13 cs.CV

TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models

Junzhe Chen, Siyuan Meng, Yuxi Chen, Man Zhao, Wenyao Gui, Xiaojie Guo

发表机构 * Tianjin University(天津大学)

AI总结 TOC-Bench 是一个用于评估视频大语言模型(Video-LLMs)时间对象一致性能力的诊断基准。该基准通过对象轨迹和结构化时间事件时间线进行构建,强调模型在遮挡、消失、重现、状态变化和跨对象交互等场景下保持同一对象身份、状态和连续性的能力。研究发现,尽管现有模型在一般视频理解任务上表现良好,但在事件计数、事件排序、身份敏感推理和幻觉检测等方面仍存在显著不足,表明时间对象一致性是当前视频大语言模型的一个关键瓶颈。

详情
英文摘要

Video large language models (Video-LLMs) have made strong progress in general video understanding, but their ability to maintain temporal object consistency remains underexplored. Existing benchmarks often emphasize event recognition, action understanding, or coarse temporal reasoning, while rarely testing whether models can preserve the identity, state, and continuity of the same object across occlusion, disappearance, reappearance, state transitions, and cross-object interactions. We introduce TOC-Bench, a diagnostic benchmark for evaluating temporal object consistency in Video-LLMs. TOC-Bench is object-track grounded: each queried subject is linked to a per-frame trajectory and a structured temporal event timeline. To ensure that questions require temporally ordered visual evidence rather than language priors, single-frame shortcuts, or unordered frame cues, we design a three-layer temporal-necessity filtering protocol, which removes 60.7% of candidate QA pairs and retains 17,900 temporally dependent items across 10 diagnostic dimensions. From this pool, we construct a human-verified benchmark with 2,323 high-quality QA pairs over 1,951 videos. Experiments on representative Video-LLMs show that temporal object consistency remains a major unsolved challenge, with notable weaknesses in event counting, event ordering, identity-sensitive reasoning, and hallucination-aware verification, even when models perform well on general video understanding benchmarks. These results suggest that object-centric temporal coherence is a key bottleneck for current Video-LLMs, and that TOC-Bench provides a focused platform for diagnosing and improving object-aware temporal reasoning. The resource is available at https://github.com/cjzcjz666/toc_bench.git.

2605.09598 2026-05-13 cs.CV

SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy

Ismael Elsharkawi, Ahmed Sait, Silvio Giancola, Bernard Ghanem, Hossam Sharara, Abdelrahman Eldesokey

发表机构 * Department of Computer Science and Engineering, The American University in Cairo(美国亚历山大大学计算机科学与工程系) Image And Visual Understanding Lab (IVUL), KAUST(卡塔尔大学图像与视觉理解实验室)

AI总结 本文提出 SoccerLens,一个用于评估足球视频理解中视觉 grounding 能力的新基准,旨在解决现有模型可能依赖虚假关联而非真实视觉证据的问题。该基准包含标注的13类常见足球事件视频片段,并通过三级语义相关性结构组织视觉线索。研究进一步扩展了注意力归因方法,引入了衡量模型注意力是否与标注线索对齐的评估指标,结果表明当前最先进的足球视觉语言模型在 grounding 性能上表现有限,揭示了预测准确率与真实视觉理解之间的显著差距。

Comments Preprint

详情
英文摘要

Vision-language models (VLMs) have recently shown strong potential in soccer video understanding. However, given the high complexity of soccer videos due to large viewpoint variations, rapid shot transitions, and cluttered scenes, it remains unclear on whether VLMs rely on meaningful visual evidence or exploit spurious correlations and shortcut learning. Existing evaluation protocols focus primarily on classification accuracy and do not assess visual grounding. To address this limitation, we introduce SoccerLens, a benchmark for grounded soccer video understanding. The benchmark contains annotated video segments spanning $13$ common soccer events, with structured visual cues organized into three levels of semantic relevance. We further extend the attribution method of Chefer [arXiv:2103.15679] to jointly model spatial and temporal attention, and introduce evaluation metrics that measure whether model attention aligns with annotated cues or drifts toward spurious regions. Our evaluation of state-of-the-art soccer VLMs shows that, despite strong classification accuracy, current models fail to exceed $50\%$ grounding performance even under the loosest cue definitions and consistently underutilize temporal information. These results reveal a substantial gap between predictive performance and true visual grounding, highlighting the need for grounded evaluation in complex spatio-temporal domains such as soccer.

2605.09523 2026-05-13 cs.LG cs.CE cs.NA math.NA physics.comp-ph stat.ML

HS-FNO: History-Space Fourier Neural Operator for Non-Markovian Partial Differential Equations

Lennon J. Shikhman

发表机构 * College of Computing, Georgia Institute of Technology(计算学院,佐治亚理工学院)

AI总结 本文提出了一种名为HS-FNO的历史空间傅里叶神经算子,用于求解非马尔可夫型偏微分方程。该方法通过引入扩展状态$u_t(θ,x)$,将历史信息纳入模型,从而更准确地捕捉系统动态。HS-FNO通过将历史状态更新分解为预测新时间片和精确移动已知部分,减少了学习维度并提升了预测精度。实验表明,HS-FNO在多个基准问题上优于现有方法,尤其在自回归预测中表现出显著的误差降低。

Comments 15 pages, 4 figures, 1 table. Code at https://github.com/lennonshikhman/hs-fno/

详情
英文摘要

Neural operators provide fast surrogate models for time-dependent partial differential equations, but their standard autoregressive use usually assumes that the instantaneous field $u(t,\cdot)$ is a complete state. This assumption fails for delay equations, distributed-memory systems, and other non-Markovian dynamics: two trajectories may agree at time $t$ and nevertheless have different futures because their histories differ. We introduce the History-Space Fourier Neural Operator (HS-FNO), a neural operator for delay and memory-driven PDEs formulated on the lifted state $u_t(θ,x)=u(t+θ,x)$, $θ\in[-τ,0]$. The key computational step is to decompose one history-state update into a learned predictor for the newly exposed future slice and an exact shift-append transport for the portion of the history window already known from the previous state. This avoids learning deterministic history coordinates, reduces the learned output dimension, and enforces the natural discrete history update. We test HS-FNO on five benchmark families covering delayed reaction--diffusion, spatial epidemiology, nonlocal neural-field dynamics, delayed waves, and distributed-memory closures. Across ten random seeds, HS-FNO attains the lowest aggregate one-step, history-space, and rollout errors among the principal baselines. The largest gain occurs in autoregressive prediction, where aggregate rollout error decreases from $0.241$, $0.188$, and $0.185$ for current-state, lag-stack, and unconstrained history-to-history operators, respectively, to $0.094$. The same model uses fewer parameters than unconstrained history prediction. These results indicate that enforcing the discrete shift structure of history-state evolution is an effective inductive bias for non-Markovian PDE surrogate modeling.

2605.09461 2026-05-13 cs.AI

VulTriage: Triple-Path Context Augmentation for LLM-Based Vulnerability Detection

Wenxin Tang, Xiang Zhang, Junliang Liu, Jingyu Xiao, Xi Xiao, Jinlong Yang, Yuehe Ma, Zhenyu Liu, Zhengheng Li, Zicheng Wang, Wang Luo, Qing Li, Lei Wang, Peng Xiangli

发表机构 * Tsinghua University(清华大学) Henan University(河南大学) Dalian Maritime University(大连海事大学) The Chinese University of Hong Kong(香港中文大学) Northwestern Polytechnical University(西北工业大学) BNU-HKBU United International College(北京理工大学-香港大学联合国际学院) Southeast University(东南大学) Jilin University(吉林大学) Sun Yat-sen University(中山大学) Peng Cheng Laboratory(鹏城实验室) Guangzhou Intelligence Communications Technology Co., Ltd.(广州智能通信技术有限公司) The Fifth Electronic Research Institute of MIIT(信息产业部第五电子研究所)

AI总结 本文提出了一种名为VulTriage的三路径上下文增强框架,用于基于大语言模型(LLM)的漏洞检测。该方法通过控制路径提取并描述程序结构信息,知识路径检索相关的漏洞模式与示例,语义路径总结代码功能行为,从而增强LLM的输入上下文,提升其对细微语义差异导致的漏洞的检测能力。实验表明,VulTriage在多个基准数据集上取得了优于现有深度学习和LLM基线方法的性能,尤其在资源有限和类别不平衡场景下表现出良好的泛化能力。

详情
英文摘要

Automated vulnerability detection is a fundamental task in software security, yet existing learning-based methods still struggle to capture the structural dependencies, domain-specific vulnerability knowledge, and complex program semantics required for accurate detection. Recent Large Language Models (LLMs) have shown strong code understanding ability, but directly prompting them with raw source code often leads to missed vulnerabilities or false alarms, especially when vulnerable and benign functions differ only in subtle semantic details. To address this, we propose VulTriage, a triple-path context augmentation framework for LLM-based vulnerability detection. VulTriage enhances the LLM input through three complementary paths: a Control Path that extracts and verbalizes AST, CFG, and DFG information to expose control and data dependencies; a Knowledge Path that retrieves relevant CWE-derived vulnerability patterns and examples through hybrid dense--sparse retrieval; and a Semantic Path that summarizes the functional behavior of the code before the final judgment. These contexts are integrated into a unified instruction to guide the LLM toward more reliable vulnerability reasoning. Experiments on the PrimeVul pair test set show that VulTriage achieves state-of-the-art performance, outperforming existing deep learning and LLM-based baselines on key pair-wise and classification metrics. Further ablation studies verify the effectiveness of each path, and additional experiments on the Kotlin dataset demonstrate the generalization ability of VulTriage under low-resource and class-imbalanced settings. Our code is available at https://github.com/vinsontang1/VulTriage

2605.09430 2026-05-13 cs.CV

FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

Junkang Zhou, Yefei He, Feng Chen, Weijie Wang, Bohan Zhuang

发表机构 * Zhejiang University(浙江大学) University of Adelaide(阿德莱德大学)

AI总结 本文提出了一种名为FlashAR的轻量级后训练加速框架,旨在高效提升自回归图像生成模型的推理速度。该方法通过引入一个垂直预测头与原有的水平预测头协同工作,基于双向下一个token预测实现高度并行的生成过程,同时尽量保持原模型的训练目标不变。实验表明,FlashAR仅需少量训练数据即可实现对预训练模型的高效适配,在512x512图像生成任务中达到最高22.9倍的加速效果。

Comments Post-training acceleration for autoregressive image generation, code is available at https://lxazjk.github.io/FlashAR/

详情
英文摘要

Large-scale autoregressive models have demonstrated remarkable capabilities in image generation. However, their sequential raster-scan decoding relies on strictly next-token prediction, making inference prohibitively expensive. Existing acceleration methods typically either introduce entirely new generation paradigms that necessitate costly pre-training from scratch, or enable parallel generation at the expense of a training-inference gap or altered prediction objectives. In this paper, we introduce FlashAR, a lightweight post-training adaptation framework that efficiently adapts a pre-trained raster-scan autoregressive model into a highly parallel generator based on two-way next-token prediction. Our key insight is that effective adaptation should minimize modifications to the pre-trained model's original training objective to preserve its learned prior. Accordingly, we retain the original AR head as a horizontal head for row-wise prediction and introduce a complementary, lightweight vertical head for column-wise prediction. To facilitate efficient adaptation, we branch the vertical head from an intermediate layer rather than the final layer, bypassing the inherent horizontal head bias. Moreover, since horizontal and vertical predictions capture complementary dependencies whose relative importance varies across target positions, we employ a learnable fusion gate to dynamically combine the two predictions at each position. To further reduce adaptation cost, we propose a two-stage adaptation pipeline: the vertical head is first initialized through adaptation from the pre-trained autoregressive model before jointly fine-tuned with backbone to adapt to the new decoding paradigm. Extensive experiments on LlamaGen and Emu3.5 show that FlashAR achieves up to a 22.9x speedup for 512x512 image generation through a lightweight post-training with merely 0.05% of the original training data.

2605.09287 2026-05-13 cs.AI

PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

Dongyi Liu, Yifan Niu, Qinwen Wang, Han Xiao, Jia Li

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 本文提出了一种基于关键步骤的信用分配方法PiCA,用于改进基于大语言模型的搜索智能体在强化学习中的训练效果。针对长期任务中奖励稀疏、信用孤立和分布偏移等关键问题,PiCA通过引入潜在基于奖励塑形机制,将搜索过程重构为累积进展的序列,并利用历史轨迹中的关键步骤作为信息峰值,为每一步提供与最终目标紧密关联的密集奖励。实验表明,PiCA在多个知识密集型问答任务中显著提升了模型性能,显示出其良好的通用性和有效性。

Comments 21 pages, 7 figures

详情
英文摘要

Large Language Model (LLM)-based search agents trained with reinforcement learning (RL) have significantly improved the performance of knowledge-intensive tasks. However, existing methods encounter critical challenges in long-horizon credit assignment: (i) Reward Sparsity, where models receive only outcome feedback without step-level guidance to differentiate action quality; (ii) Isolated Credit, where credit is assigned to steps independently, failing to capture sequential dependencies; and (iii) Distributional Shift, where rewards are estimated on templates that deviate from the model's natural generative distribution. To address these issues, we propose Pivot-Based Credit Assignment (PiCA), a novel step reward mechanism that reformulates the search trajectory as a sequential process of cumulative search progress. Unlike prior isolated step rewards, PiCA defines process rewards as success probabilities dependent on the historical context based on Potential-Based Reward Shaping (PBRS). This approach identifies pivot steps, which comprise target golden sub-queries and sub-answers derived from historical trajectories, as information peaks that significantly boost the likelihood of a correct final answer. By anchoring these step rewards to the final task objective, PiCA provides dense, pivot-aware and trajectory-dependent guidance while maintaining distributional consistency. Extensive experiments show that PiCA outperforms existing strong baselines across seven knowledge-intensive QA benchmarks, achieving 15.2% and 2.2% improvements for 3B and 7B models. The consistent performance gains across various models show PiCA's robust generalization. The code is available at https://github.com/novdream/PiCA.

2605.09277 2026-05-13 cs.LG

Worst-Case Regret Bounds for Combinatorial Thompson Sampling in Sleeping Semi-Bandits

Zhiming Huang, Bingshan Hu, Jianping Pan

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington(保罗·G·艾伦计算机科学与工程学院,华盛顿大学) Department of Computer Science, University of British Columbia(不列颠哥伦比亚大学计算机科学系) Department of Computer Science, University of Victoria(维多利亚大学计算机科学系)

AI总结 本文研究了具有睡眠臂的半带问题中的组合汤普森采样(CTS)算法,针对其长期存在的理论保障不足和实际性能不佳的问题,提出了首个最坏情况下的遗憾界分析,并设计了一种改进算法CL-SG。该算法通过共享高斯种子协调探索,显著提升了理论性能,实验表明其在真实数据集上优于现有方法。

Comments Accepted by INFOCOM 26 on Dec 2025

详情
英文摘要

We revisit combinatorial Thompson sampling (CTS) for semi-bandits with sleeping arms, where arm availability varies over time and actions must satisfy combinatorial constraints, as in wireless mesh routing with fluctuating link availability. Despite its practical relevance, CTS has been hindered by several long-standing problems: (i) the absence of worst-case regret guarantees in the semi-bandit setting even without sleeping arms, (ii) the lack of theory under adversarially varying availability, and (iii) the consistently weak empirical performance of CTS with Gaussian priors (CTS-G). This paper resolves these long-standing issues by providing the first worst-case regret analysis of CTS-G, proving an upper bound of $\tilde{O}(m\sqrt{NT})$ and a matching lower bound of $\tildeΩ(m\sqrt{NT})$. To bridge the gap between theory and practice, we further propose CL-SG, a simple CTS-G variant that samples a single shared Gaussian seed each round to coordinate exploration across arms. We show that CL-SG achieves an improved regret bound of $\tilde{O}(\sqrt{mNT})$, together with a matching lower bound $Ω(\sqrt{mNT})$. Experiments on real-world datasets demonstrate that CL-SG consistently outperforms strong baselines including CTS-G and CTS-B, and we open-source our implementation for reproducibility.

2605.09171 2026-05-13 cs.RO

SHIELD: Scalable Optimal Control with Certification using Duality and Convexity

Hansung Kim, Siddharth H. Nair, Francesco Borrelli

发表机构 * Model Predictive Control Laboratory, UC Berkeley(模型预测控制实验室,加州大学伯克利分校) Nextracker Inc(Nextracker公司)

AI总结 本文提出了一种名为SHIELD的分层算法,通过利用强凸性和拉格朗日对偶性,有效降低$\ell_1$-正则化凸规划中的决策变量维度和约束集规模,同时保证被移除的约束和变量仍满足安全要求。为加速算法运行,作者引入了基于Transformer的深度神经网络辅助对偶证书的推导,并在复杂多模态交通场景中的随机模型预测控制(SMPC)中验证了该方法,实验表明其在保持可行性与闭环安全性的前提下,计算效率提升了数量级。

详情
英文摘要

We present SHIELD, a hierarchical algorithm that reduces both the decision-variable dimension and the constraint set in $\ell_1$-regularized convex programs. From strong convexity and Lagrangian duality, we derive certificates that \emph{safely} discard constraints and decision variables while guaranteeing that all removed constraints remain satisfied and all removed variables are null. To further accelerate the proposed algorithm, we propose a transformer-based deep neural network to guide the dual certificate inference. We validate SHIELD on stochastic model predictive control (SMPC) in complex, multi-modal traffic scenarios, comparing against a full-dimensional SMPC policy. Numerical simulations demonstrate order-of-magnitude computational speedups while preserving feasibility and closed-loop safety, highlighting the practicality of certifiably safe, lightweight MPC in complex driving scenes.

2605.09043 2026-05-13 cs.CL cs.AI

Phase Transitions in Affective Meaning Divergence: The Hidden Drift Before the Break

Napassorn Litchiowong

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院)

AI总结 本文研究了对话中情感意义分歧(AMD)的相变现象,即对话双方对同一词语的情感理解逐渐偏离,最终导致沟通失效。作者基于言语行为理论和熵正则化博弈论,构建了AMD的数学模型,并发现当参数 $βα> 4$ 时,AMD的增加会导致协调修复能力的突变式崩溃。在多个数据集上的实验证明,AMD在对话失控前表现出显著的临界减慢特征,且其时间动态模式优于传统毒性或情感指标,为理解对话破裂提供了新的理论依据。

Comments Accepted to the ACL 2026 Student Research Workshop

详情
英文摘要

One partner says "Fine" meaning "resolution"; the other hears "surrender." The word is shared; the affective uptake is not. We formalize this as affective meaning divergence (AMD), the total-variation distance between interlocutors' anchor-conditioned affect distributions. Building on speech-act theory, common-ground accumulation, and entropy-regularized game theory, we derive a logit best-response map whose dynamics undergo a saddle-node bifurcation: when $βα> 4$, a monotone increase in AMD-driven load produces an abrupt, hysteretic collapse of repair coordination. On Conversations Gone Awry (CGA-Wiki; $N = 652$), derailing conversations exhibit critical-slowing-down (CSD) signatures across multiple levels: lexical divergence variance ($p < 0.001$, $d = 0.36$), AMD variance ($p = 0.001$, $d = 0.26$), and dialog-act repair variance ($p = 0.016$, $d = 0.20$), all significant after correction and stronger than toxicity and sentiment baselines. AMD provides a distinct temporal signature, with retrospectively measured variance peaking at the bifurcation point while toxicity variance peaks earlier, and is the only indicator grounded in the theoretical framework. Boundary-condition analysis on CGA-CMV ($N = 1,169$) yields mixed but directionally consistent evidence.

2605.09003 2026-05-13 cs.CV

FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching

Yixin Tang, Jiawei Guo, Junxian Li, Zhiteng Li, Jixin Zhao, Bingya Zhang, Chenbo Wang, Yulun Zhang, Shangchen Zhou

发表机构 * Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) Honor Device Co., Ltd(荣耀设备有限公司)

AI总结 本文提出了一种名为FlashClear的高效图像内容移除方法,旨在解决基于扩散模型的对象移除方法在计算效率上的不足。该方法通过引入区域感知的对抗蒸馏(RAD)和前景优先的非对称注意力与缓存(FPAC)策略,实现了仅需少数步骤即可完成高质量内容移除的模型,显著提升了推理速度。实验表明,FlashClear在保持视觉质量的同时,相比现有方法在速度上分别提升了8.26倍和122倍。

Comments Code: https://github.com/GuoCalix/FlashClear

详情
英文摘要

Recently, diffusion-based object removal models have achieved impressive results in eliminating objects and their associated visual effects. However, they indiscriminately denoise all tokens across all timesteps, ignoring that removal usually involves small foreground regions. This strategy introduces substantial computational overhead and prolonged inference times. To overcome this computational burden, we propose a latent discriminator to implement Region-aware Adversarial Distillation (RAD), yielding a highly efficient few-step model named FlashClear. Furthermore, tailored to few-step diffusion models, we propose FPAC (Foreground-Prioritized Asymmetric Attention and Caching), a training-free acceleration strategy. Extensive experiments demonstrate that our framework provides massive acceleration while maintaining or exceeding the performance of our base model, ObjectClear. Notably, on the OBER benchmark, our FlashClear achieves up to 8.26$\times$ and 122$\times$ speedup over ObjectClear and OmniPaint, respectively, while maintaining high visual quality and fidelity.

2605.08978 2026-05-13 cs.AI

Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

Xingyuan Hua, Sheng Yue, Ju Ren

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China(清华大学计算机科学与技术系) School of Cyber Science and Technology, Sun Yat-sen University Shenzhen Campus, Shenzhen, China(中山大学深圳校区信息科学与技术学院) State Key Laboratory of Internet Architecture, Tsinghua University, Beijing, China(清华大学互联网体系结构国家重点实验室)

AI总结 本文提出了一种基于探索感知的强化学习框架,旨在解决智能体在执行任务时探索策略不加区分的问题。该方法通过变分推断引入细粒度奖励函数,能够评估探索行为对未来决策的潜在提升,并结合探索感知的分组机制,在优化过程中区分探索动作与任务完成动作。实验表明,该方法在多种文本和图形界面基准任务中均取得了显著提升。

详情
英文摘要

Recent advancements in agentic test-time scaling allow models to gather environmental feedback before committing to final actions. A key limitation of existing methods is that they typically employ undifferentiated exploration strategies, lacking the ability to adaptively distinguish when exploration is truly required. In this paper, we propose an exploration-aware reinforcement learning framework that enables LLM agents to adaptively explore only when uncertainty is high. Our method introduces a fine-grained reward function via variational inference that explicitly evaluates exploratory actions by estimating their potential to improve future decision-making, together with an exploration-aware grouping mechanism that separates exploratory actions from task-completion actions during optimization. By targeting informational gaps, this design allows agents to explore selectively and transition to execution as soon as the task context is clear. Empirically, we demonstrate that our approach achieves consistent improvements across a range of challenging text-based and GUI-based agent benchmarks. Code is available at https://github.com/HansenHua/EAPO-ICML26 and models are available at https://huggingface.co/hansenhua/EAPO-ICML26.

2605.08828 2026-05-13 cs.AI

When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents

Strick Sheng, Ziyue Wang, Liyi Zhou

发表机构 * The University of Sydney(悉尼大学) Nanjing University(南京大学)

AI总结 该研究提出了一种名为EnvTrustBench的可扩展智能体框架,用于评估大型语言模型代理在面对过时、错误或恶意环境信息时的可靠性问题。研究定义了“证据锚定缺陷”(EGD),即代理在未核实当前证据的情况下,仅凭环境提供的信息做出决策,从而导致任务错误。通过构建任务场景、生成工作空间与验证机制,EnvTrustBench系统评估了多种代理在不同情境下的表现,揭示了环境信息可靠性对代理行为的广泛影响,突显了环境锚定在智能体系统中的核心地位。

详情
英文摘要

Large language model agents increasingly operate through environment-facing scaffolds that expose files, web pages, APIs, and logs. These observations influence tool use, state tracking, and action sequencing, yet their reliability and authority are often uncertain. Environmental grounding is therefore a systems-level problem involving context admission, evidence provenance, freshness checking, verification policy, action gating, and model reasoning. Existing agent benchmarks mainly evaluate task capability or specific attacks such as prompt injection and memory poisoning, but they under-specify a fundamental reliability question: whether agents remain grounded in the true environment state when observations are stale, incorrect, or malicious. We introduce EnvTrustBench, an agentic framework for benchmarking this failure mode. We define an evidence-grounding defect (EGD) as a behavioral failure in which an agent treats an environment-facing claim as sufficient evidence for action without resolving it against available current evidence, leading to a task-incorrect false path under the true environment state. Given a task scenario, EnvTrustBench generates the workspace, environment, agent-facing objective, and validation oracle, executes the evaluated agent, records its action-observation trajectory and final state, and applies the oracle to produce a verdict. Using 6 LLM backbones and 5 widely used scaffolds, we evaluate 55 generated cases across 11 task scenarios, with each scenario expanded through five feedback-guided generation iterations. Results show that EGDs consistently emerge across operational workflows, highlighting environmental grounding as a core agent reliability problem with important security implications.

2605.08806 2026-05-13 cs.CV

L2A: Learning to Accumulate Pose History for Accurate 3D Human Pose Estimation

Zehua Wang, Changwang Mei, Huaijiang Sun, Pengqi Hu, Zhaoyang Yin

发表机构 * Nanjing University of Science and Technology(南京理工大学) Lenovo(联想)

AI总结 该论文提出了一种名为L2A的框架,旨在通过有效利用历史姿态信息来提升三维人体姿态估计的准确性。研究发现,现有方法在跨层特征复用方面存在不足,为此,作者设计了空间-时间并行的Transformer骨干网络以保持一致的表示空间,并引入了历史姿态积累(HPA)机制和层姿态历史聚合(LPA)模块,以自适应地整合多层特征,减少冗余并提升稳定性。实验表明,该方法在多个基准数据集上取得了最先进的性能。

Comments 15page

详情
英文摘要

Existing 2D-3D lifting human pose estimation methods have achieved strong performance. But the utilization of historical pose representations across network depth was overlooked. In current pipelines, information is propagated through fixed residual connections, which restricts effective reuse of early-layer features such as fine-grained spatial structures and short-term motion cues. However, naively incorporating historical features across layers is non-trivial. We further identify that maintaining a consistent representation space across layers is a prerequisite for effective cross-layer feature aggregation. To address this issue, we propose a history-aware framework that enables effective network cross-layer history feature utilization. Specifically, we adopt a spatial-temporal parallel Transformer backbone to prevent alternating spatial-temporal transformations during sequential processing, thereby maintaining a consistent representation space. Building upon this, we introduce a History Pose Accumulation (HPA) mechanism that adaptively aggregates features from all preceding layers to enhance current representations. Furthermore, we propose a Layer Pose History Aggregation (LPA) module that transforms layer pose features into a compact and structured form, reducing redundancy and enabling more stable aggregation. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on benchmarks.