arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪
2605.11591 2026-05-13 cs.CV

Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration

Mingtao Xian, Yifeng Yang, Qinying Gu, Xinbing Wang, Nanyang Ye

发表机构 * Zhiyuan College, Shanghai Jiao Tong University, Shanghai, China(上海交通大学紫阳学院) Shanghai Jiao Tong University, Shanghai, China(上海交通大学) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室) Shanghai Innovation Institute, Shanghai, China(上海创新研究院)

AI总结 多模态大语言模型在多图像跨模态检索任务中表现出色,但存在严重的顺序偏差问题,即预测结果受输入顺序影响而非语义相关性。本文提出了一种名为“Logit-Attention Divergence”的现象,指出输出logits存在偏差,而内部注意力图仍能准确对齐相关视觉信息,揭示了现有校准方法的局限性。基于此,作者提出了一种无需训练、基于注意力引导的去偏框架,利用模型内部的注意力信号在推理阶段进行实例级校正,仅需少量校准数据且计算开销极小。实验表明,该方法显著提升了模型对输入顺序的鲁棒性,在多个基准测试中取得了最先进的性能。

详情
英文摘要

Multimodal Large Language Models (MLLMs) have shown strong performance in multi-image cross-modal retrieval, yet suffer from severe position bias, where predictions are dominated by input order rather than semantic relevance. Through empirical analysis, we identify a phenomenon termed Logit-Attention Divergence, in which output logits are heavily biased while internal attention maps remain well-aligned with relevant visual evidence. This observation reveals a fundamental limitation of existing logit-level calibration methods such as PriDe. Based on this insight, we propose a training-free, attention-guided debiasing framework that leverages intrinsic attention signals for instance-level correction at inference time, requiring only a minimal calibration set with negligible computational overhead. Experiments on MS-COCO-based benchmarks show that our method substantially improves permutation invariance and achieves state-of-the-art performance, enhancing accuracy by over 40\% compared to baselines. Code is available at https://github.com/brightXian/LAD.

2605.11586 2026-05-13 cs.LG math.OC

Learning Weakly Communicating Average-Reward CMDPs: Strong Duality and Improved Regret

Kihyun Yu, Beomhan Baek, Dabeen Lee

发表机构 * Department of Industrial & Systems Engineering, KAIST(韩国科学技术院工业与系统工程系) Department of Mathematical Sciences, Seoul National University(首尔国立大学数学科学系)

AI总结 本文研究了在弱连通假设下无限时间平均奖励约束马尔可夫决策过程(CMDPs)的学习问题。作者首先建立了有限状态和动作空间下弱连通平均奖励CMDPs在平稳策略上的强对偶性,即使在缺乏线性规划形式且问题非凸的情况下,也通过分析状态占用测度的几何结构证明了强对偶性的成立。其次,基于该结果,提出了一种剪裁价值迭代的原始-对偶算法,用于学习弱连通平均奖励线性CMDPs,该算法在 regret 和约束违反方面达到了 $\widetilde{\mathcal{O}}(T^{2/3})$ 的上界,优于现有最佳结果,并通过强对偶性分析实现了对复合拉格朗日 regret 的分解。

详情
英文摘要

We study infinite-horizon average-reward constrained Markov decision processes (CMDPs) under the weakly communicating assumption. Our contributions are twofold. First, we establish strong duality for weakly communicating average-reward CMDPs over stationary policies with finite state and action spaces. Despite the absence of a linear programming formulation and the resulting nonconvexity under the weakly communicating setting, we show that strong duality still holds by carefully exploiting the geometric structure of the occupation measure set. Second, building on this result, we propose a primal--dual clipped value iteration algorithm for learning weakly communicating average-reward linear CMDPs. Our algorithm achieves regret and constraint violation bounds of $\widetilde{\mathcal{O}}(T^{2/3})$, improving upon the best known bounds, where $T$ denotes the number of interactions. Our approach extends clipped value iteration to the constrained setting and adapts it to a finite-horizon approximation, which stabilizes the dual variable and is crucial for achieving improved regret bounds. To analyze this, we develop a novel approach based on strong duality that enables the decomposition of the composite Lagrangian regret into separate bounds on regret and constraint violation.

2605.11585 2026-05-13 cs.CV cs.LG

A Mixture Autoregressive Image Generative Model on Quadtree Regions for Gaussian Noise Removal via Variational Bayes and Gradient Methods

Shota Saito, Yuta Nakahara, Kohei Horinouchi, Naoki Ichijo, Manabu Kobayashi, Toshiyasu Matsushima

发表机构 * Gunma University(群马大学) Waseda University(早稻田大学)

AI总结 本文研究了灰度图像的高斯噪声去除问题,提出了一种结合四叉树区域划分模型与混合自回归模型的概率图像生成方法,并将基于最大后验估计的去噪问题转化为变分下界最大化问题。通过交替应用变分贝叶斯方法和梯度方法,开发了一种新的优化算法,其中梯度更新规则可解析计算,无需数值近似。实验验证了该算法的有效性,并指出了进一步改进的方向。

详情
英文摘要

This paper addresses the problem of image denoising for grayscale images. We propose a probabilistic image generative model that combines a quadtree region-partitioning model with a mixture autoregressive model, and propose a framework that reduces MAP (maximum a posteriori)-estimation-based denoising to the maximization of a variational lower bound. To maximize this lower bound, we develop an algorithm that alternately applies variational Bayes and gradient methods. We particularly demonstrate that the gradient-based update rule can be computed analytically without numerical computation or approximation. We carried out some experiments to verify that the proposed algorithm actually removes image noise and to identify directions for future improvement.

2605.11582 2026-05-13 cs.CL

Efficient LLM-based Advertising via Model Compression and Parallel Verification

Wenxin Dong, Chang Gao, Guanghui Yu, Xuewu Jiao, Mingqing Hu, Qiang Fu, Peng Xu, Penghui Wei, Hui Xu, Yue Xing, Shuanglong Li, Lin Liu

发表机构 * Baidu Inc.(百度公司)

AI总结 本文研究了如何高效地在广告场景中部署大语言模型(LLM),以解决其推理延迟高和计算成本大的问题。提出了一种高效的生成式定向框架,结合自适应分组量化、层自适应分层稀疏化和前缀树并行验证等方法,在保证生成质量的同时显著加速LLM推理。实验表明,该框架在两个真实广告场景中实现了显著的加速效果,且质量下降可控,具备实际部署的可行性。

Comments 10 pages, 7 figures, industry paper

详情
英文摘要

Large language models (LLMs) have shown remarkable potential in advertising scenarios such as ad creative generation and targeted advertising. However, deploying LLMs in real-time advertising systems poses significant challenges due to their high inference latency and computational cost. In this paper, we propose an Efficient Generative Targeting framework that integrates adaptive group quantization, layer-adaptive hierarchical sparsification, and prefix-tree parallel verification to accelerate LLM inference while preserving generation quality. Extensive experiments on two real-world advertising scenarios demonstrate that our framework achieves significant speedup with acceptable quality degradation, making it operationally viable for practical deployments.

2605.11581 2026-05-13 cs.CL

Ada-MK: Adaptive MegaKernel Optimization via Automated DAG-based Search for LLM Inference

Wenxin Dong, Mingqing Hu, Guanghui Yu, Qiang Fu, Peng Xu, Hui Xu, Yue Xing, Xuewu Jiao, Shuanglong Li, Lin Liu

发表机构 * Baidu Inc.(百度公司)

AI总结 在商业在线广告系统中,大语言模型(LLM)的实时推理需要严格控制端到端延迟。为解决解码阶段内核启动开销大的问题,研究提出Ada-MK方法,通过基于DAG的自动搜索优化MegaKernel的执行路径,结合三维共享内存约束模型和异构混合推理引擎,有效降低了共享内存使用并消除了运行时分支开销,显著提升了推理吞吐量和延迟表现。

Comments 10 pages, 8 figures

详情
英文摘要

When large language models (LLMs) serve real-time inference in commercial online advertising systems, end-to-end latency must be strictly bounded to the millisecond range. Yet every token generated during the decode phase triggers thousands of kernel launches, and kernel launch overhead alone can account for 14.6% of end-to-end inference time. MegaKernel eliminates launch overhead and inter-operator HBM round-trips by fusing multiple operators into a single persistent kernel. However, existing MegaKernel implementations face a fundamental tension between portability and efficiency on resource-constrained GPUs such as NVIDIA Ada: hand-tuned solutions are tightly coupled to specific architectures and lack portability, while auto-compiled approaches introduce runtime dynamic scheduling whose branch penalties are unacceptable in latency-critical settings. We observe that under a fixed deployment configuration, the optimal execution path of a MegaKernel is uniquely determined, and runtime dynamic decision-making can be entirely hoisted to compile time. Building on this insight, we propose Ada-MK: (1) a three-dimensional shared-memory constraint model combined with K-dimension splitting that reduces peak shared memory usage by 50%; (2) MLIR-based fine-grained DAG offline search that solidifies the optimal execution path, completely eliminating runtime branching; and (3) a heterogeneous hybrid inference engine that embeds MegaKernel as a plugin into TensorRT-LLM, combining high-throughput Prefill with low-latency Decode. On an NVIDIA L20, Ada-MK improves single-batch throughput by up to 23.6% over vanilla TensorRT-LLM and 50.2% over vLLM, achieving positive gains across all tested scenarios--the first industrial deployment of MegaKernel in a commercial online advertising system.

2605.11578 2026-05-13 cs.CV

The Midas Touch for Metric Depth

Yu Ma, Zizhan Guo, Zuyi Xiong, Haoran Zhang, Yi Feng, Hongbo Zhao, Hanli Wang, Rui Fan

发表机构 * College of Electronic and Information Engineering, Tongji University(同济大学电子与信息工程学院) Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University(同济大学上海智能自主系统研究所) National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University(西安交通大学人机混合增强智能国家重点实验室)

AI总结 本文提出了一种名为MTD的方法,旨在解决相对深度估计在实际应用中因缺乏度量尺度、局部不一致和计算效率低而受限的问题。该方法通过极稀疏的3D数据将相对深度转换为度量深度,采用分段恢复策略和基于不连续性感知的测地成本像素级优化,有效消除了局部尺度不一致。MTD具有良好的泛化能力,显著提升了深度补全和深度估计的精度,且其轻量化的模块化设计便于在多种下游3D任务中部署和集成。

详情
英文摘要

Recent advances have markedly improved the cross-scene generalization of relative depth estimation, yet its practical applicability remains limited by the absence of metric scale, local inconsistencies, and low computational efficiency. To address these issues, we present \emph{\textbf{M}idas \textbf{T}ouch for \textbf{D}epth} (MTD), a mathematically interpretable approach that converts relative depth into metric depth using only extremely sparse 3D data. To eliminate local scale inconsistencies, it applies a segment-wise recovery strategy via sparse graph optimization, followed by a pixel-wise refinement strategy using a discontinuity-aware geodesic cost. MTD exhibits strong generalization and achieves substantial accuracy improvements over previous depth completion and depth estimation methods. Moreover, its lightweight, plug-and-play design facilitates deployment and integration on diverse downstream 3D tasks. Project page is available at https://mias.group/MTD.

2605.11577 2026-05-13 cs.CL

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

Shaobin Zhuang, Yuang Ai, Jiaming Han, Xiaohui Li, Huaibo Huang, Xiangyu Yue, Xuefeng Hu, Kun Xu, Yali Wang, Hao Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) MMLab, The Chinese University of Hong Kong(中大香港机器学习实验室) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) SIAT, Chinese Academy of Sciences(中国科学院软件研究所)

AI总结 传统自回归语言模型逐个生成文本标记,难以有效捕捉自然语言中多标记单元的结构特性,限制了模型的表达能力和推理效率。本文提出 BitLM,通过将每个标记表示为固定长度的二进制码,并在每个块内并行去噪多个标记,从而在保持左到右因果注意力的同时实现块内联合词法决策。BitLM 用位级去噪替代传统大词汇表 softmax,将标记生成重构为紧凑二进制空间中的迭代承诺过程,显著提升了预训练效率和推理速度,展示了逐标记生成并非语言模型的必然要求,而是接口选择,为新一代语言模型架构提供了新方向。

Comments 12 pages, 4figures, 1 table

详情
英文摘要

Autoregressive language models generate text one token at a time, yet natural language is inherently structured in multi-token units, including phrases, n-grams, and collocations that carry meaning jointly. This one-token bottleneck limits both the expressiveness of the model during pre-training and its throughput at inference time. Existing remedies such as speculative decoding or diffusion-based language models either leave the underlying bottleneck intact or sacrifice the causal structure essential to language modeling. We propose BitLM, a language model that represents each token as a fixed-length binary code and employs a lightweight diffusion head to denoise multiple tokens in parallel within each block. Crucially, BitLM preserves left-to-right causal attention across blocks while making joint lexical decisions within each block, combining the reliability of autoregressive modeling with the parallelism of iterative refinement. By replacing the large-vocabulary softmax with bitwise denoising, BitLM reframes token generation as iterative commitment in a compact binary space, enabling more efficient pre-training and substantially faster inference without altering the causal foundation that makes language models effective. Our results demonstrate that the one-token-at-a-time paradigm is not a fundamental requirement but an interface choice, and that changing it can yield a stronger and faster language model. We hope BitLM points toward a promising direction for next-generation language model architectures.

2605.11574 2026-05-13 cs.CL cs.AI cs.LG

Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation

Pruthvinath Jeripity Venkata

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了大型语言模型在处理训练知识与矛盾文档之间冲突时的三种不同情境,并提出了一个三阶段的预测框架。核心方法区分了参数强度与参数唯一性这两个正交维度,并通过大量实验验证了模型在不同任务场景下的行为差异。研究发现,模型在任务相关性引导下对文档的依赖程度显著变化,揭示了参数确定性在事实性任务中的主导作用。

Comments 10 pages, 13 tables, no figures. 9,970 API calls across five frontier models

详情
英文摘要

The literature on how large language models handle conflict between their training knowledge and a contradicting document presents a persistent empirical contradiction: some studies find models stubbornly retain their trained answers, ignoring provided documents nearly half the time, while others find models readily defer to the document, following context approximately 96% of the time. We argue these contradictions dissolve once one recognises that prior experiments have studied three qualitatively distinct processing situations without distinguishing them. We propose a three-regime framework: Regime 1 (single-source updating, dominant predictor: evidence coherence), Regime 2 (competitive integration, dominant predictor: parametric certainty), and Regime 3 (task-appropriate selection, dominant predictor: task knowledge requirement). We formalise a distinction between parametric strength (exposure frequency) and parametric uniqueness (encoding consistency), showing empirically that these are orthogonal dimensions (r = -0.002, p = .97) with strength as the operative predictor in stable factual domains. We validate the framework across Claude Sonnet 4.6, GPT-5.5, Gemini 2.5 Flash, Llama 4 Maverick, and DeepSeek V3 using 9,970 API calls in three experimental phases. GEE logistic regression confirms the predicted Regime 2 certainty gradient for all five models (beta = -0.38 to -0.50, all p <= .013, BH-FDR corrected). A Regime 3 ablation shows task framing alone flips context-following from near-100% (contextual knowledge condition) to 6-71% (parametric knowledge condition), with all five models significant (p < .001). The certainty gradient is robust to multinomial outcome modeling, sensitivity analyses for hedging responses, and FDR correction.

2605.11571 2026-05-13 cs.LG

FedOUI: OUI-Guided Client Weighting for Federated Aggregation

Alberto Fernández-Hernández, Jose I. Mestre, Cristian Pérez-Corral, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí

发表机构 * Universitat Politècnica de València(巴塞罗那理工大学) Universitat Jaume I(Jaime I 大学) Openchip & Software Technologies S.L.(Openchip 及软件技术公司)

AI总结 本文提出FedOUI,一种基于过拟合-欠拟合指示器(OUI)的联邦学习聚合方法,通过客户端模型在固定探测数据集上的激活特征评估其训练过程中的结构特性,并据此动态调整客户端在聚合中的权重。该方法无需标签信息,能够在强非独立同分布和存在噪声客户端的场景下提升聚合质量,实验表明其在异构性较强时表现尤为突出,展示了模型内部激活结构在联邦学习中的潜在价值。

详情
英文摘要

Federated learning usually aggregates client updates using dataset size or gradient-level criteria, while overlooking internal signals about how each client model is organizing its input space during training. We introduce FedOUI, a simple aggregation rule based on the Overfitting-Underfitting Indicator (OUI), an activation-based and label-free metric. Each participating client sends its local update together with a OUI value computed on a fixed probe batch, and the server estimates the round-wise OUI distribution to assign lower weights to structurally atypical clients through a smooth reweighting rule. We evaluate FedOUI on CIFAR-10 under strong non-IID partitioning and noisy-client conditions, comparing it with FedAvg, FedProx, and a gradient-alignment baseline. The clearest gains appear under strong heterogeneity, where OUI-based weighting improves aggregation quality while remaining lightweight and interpretable. These results show that internal activation structure can provide useful information for federated aggregation beyond client size and gradient geometry.

2605.11570 2026-05-13 cs.LG

OUI as a Structural Observable: Towards an Activation-Centric View of Neural Network Training

Alberto Fernández-Hernández, Jose I. Mestre, Cristian Pérez-Corral, Manuel F. Dolz, Jose Duato, Enrique S. Quintana-Ortí

发表机构 * Openchip \& Software Technologies S.L., Spain

AI总结 本文提出将过拟合-欠拟合指示器(OUI)视为神经网络训练过程中内部结构变化的一个可观测指标,强调应从激活函数的角度理解训练动态。研究发现,OUI作为一种早期、无需标签的基于激活的信号,能够提前揭示网络训练进入良好或不良状态的趋势,在监督学习、强化学习和在线控制等多种场景中表现出良好的预测能力。这一发现为构建以激活为中心的训练动态理论提供了实证基础。

详情
英文摘要

Activation functions are what make deep networks expressive: without them, the model collapses to a linear map. Yet we still evaluate training mostly from the outside, through loss, accuracy, return, or final calibration, while the internal structural evolution of the network remains largely unobserved. In this paper, we argue that the Overfitting--Underfitting Indicator (OUI) should be understood as a first practical observable of that internal structure. Across our recent results, OUI consistently appears as an early, label-free, activation-based signal that reveals whether a network is entering a poor or promising training regime before convergence. In supervised learning, it anticipates weight decay regimes; in reinforcement learning, it discriminates learning-rate regimes early in PPO actor--critic; and in online control, it can drive layer-wise weight decay adaptation. Read together with recent evidence that activation patterns tend to stabilize earlier than parameters, these results suggest a broader research direction: an activation-centric theory of training dynamics. OUI is becoming an empirical foothold toward this theory.

2605.11569 2026-05-13 cs.AI cs.LG

Dual-Temporal LSTM with Hybrid Attention for Airline Passenger Load Factor Forecasting: Integrating Intra-Flight and Inter-Flight Booking Dynamics

ASM Nazrul Islam, Md. Hasanul Kabir, Md. Liakot Ali, Joydeb Kumar Sana

发表机构 * Institute of Information and Communication Technology(信息与通信技术研究所) Bangladesh University of Engineering and Technology(孟加拉工程与技术大学) Islamic University of Technology(伊斯兰大学)

AI总结 该研究针对航空业需求预测中的不足,提出了一种结合双时间流和混合注意力机制的LSTM模型,用于更准确地预测航班载客率。该模型同时处理航班内部的预订积累和航班之间的预订模式,克服了传统单时间维度建模的信息丢失问题。实验表明,该方法在孟加拉国航空公司实际数据上取得了较高的预测精度,并在多种航线类型中表现出良好的泛化能力,已被该航空公司正式应用于运营中。

详情
英文摘要

Accurate short-term demand forecasting is crucial to airline revenue management, yet most existing systems fail to meet this need because current models treat booking data as a single temporal dimension, either the accumulation of bookings for a specific flight or the historical booking profile of the same route. This unidimensional view discards information carried by the other temporal stream and forecasting absolute passenger counts introduces a further operational fragility when change in planned aircraft type alters total seat capacity. This study addresses both limitations. A dual-stream Long Short-Term Memory (LSTM) integrated with attention framework is proposed that simultaneously processes two complementary input sequences: a horizontal sequence capturing intra-flight booking accumulation over the days preceding departure, and a vertical sequence capturing inter-flight booking patterns at fixed days-before-departure offsets across historical flights. Multiple dual-stream architectural variants, combining self-attention, cross-attention, and hybrid attention with concatenation, residual, and gated fusion strategies, are developed and evaluated. Experiments on real-world reservation data from the national airline of Bangladesh, Biman Bangladesh Airlines (BBA), demonstrate that the proposed hybrid model achieves a Mean Absolute Error of 2.8167 and a coefficient of determination ($R^{2}$) of 0.9495, outperforming single-stream baselines, tree-based models, and three prior dual-LSTM architectures applied to the same data. Validation across four flight category pairs; domestic versus international, direct versus transit, high versus low frequency, and short versus mid versus long haul confirms that the model generalizes across operationally diverse route types. Biman Bangladesh Airlines (BBA) has officially integrated this methodology into its operations.

2605.11564 2026-05-13 cs.RO

RIO: Flexible Real-Time Robot I/O for Cross-Embodiment Robot Learning

Pablo Ortega-Kral, Eliot Xing, Arthur Bucker, Vernon Luk, Junseo Kim, Owen Kwon, Angchen Xie, Nikhil Sobanbabu, Yifu Yuan, Megan Lee, Deepam Ameria, Bhaswanth Ayapilla, Jaycie Bussell, Guanya Shi, Jonathan Francis, Jean Oh

发表机构 * Carnegie Mellon University(卡内基梅隆大学) TU Delft(代尔夫特理工大学) Lavoro AI Bosch Center for AI(博世人工智能中心)

AI总结 本文提出 RIO,一个开源的 Python 框架,旨在解决跨形态机器人学习中的基础设施碎片化问题。RIO 提供了灵活、轻量的组件,支持机器人控制、远程操作、数据格式化、传感器配置和策略部署,适用于多种硬件平台和形态。研究通过在三种机器人形态和四种硬件平台上验证 RIO,展示了其在通用视觉-语言-动作模型训练与部署中的有效性,为实际机器人硬件上的学习加速提供了基础支持。

Comments 14 pages, 12 figures, 5 tables. Accepted to Robotics: Science and Systems (RSS) 2026

详情
英文摘要

Despite recent efforts to collect multi-task, multi-embodiment datasets, to design recipes for training Vision-Language-Action models (VLAs), and to showcase these models on different robot platforms, generalist cross-embodiment robot capabilities remains a largely elusive ideal. Progress is limited by fragmented infrastructure: most robot code is highly specific to the exact setup the user decided on, which adds major overhead when attempting to reuse, recycle, or share artifacts between users. We present RIO (Robot I/O), an open source Python framework that provides flexible, lightweight components for robot control, teleoperation, data formatting, sensor configuration, and policy deployment across diverse hardware platforms and morphologies. RIO provides abstractions that enable users to make any choice and to switch between them, with minimal reconfiguration effort. We validate RIO on VLA deployment workflows across three morphologies (single-arm, bimanual, humanoid) and four hardware platforms with varying grippers and cameras. Using teleoperated data collected with RIO, we fine-tune state-of-the-art VLAs including $π_{0.5}$ and GR00T on household tasks such as pick-and-place, folding, and bowl scrubbing. By open sourcing all our efforts, we hope the community can accelerate their pace of robot learning on real-world robot hardware. Additional details at: https://robot-i-o.github.io

2605.11563 2026-05-13 cs.CV cs.AI

TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

Sara Shoouri, Morteza Tavakoli Taba, Hun-Seok Kim

发表机构 * University of Michigan(密歇根大学)

AI总结 本文提出了一种名为TCP-SSM的高效视觉状态空间模型,旨在解决现有SSM在长程视觉任务中难以控制状态依赖记忆行为的问题。该方法通过引入基于令牌的稳定极点,显式建模递归动态,提升了模型的可解释性和可控性。TCP-SSM采用实极点和复共轭极点分别建模单调衰减和阻尼振荡响应,并通过分组极点共享和轻量输入路径设计,实现了计算效率的显著提升,在多个视觉任务中相比基线模型减少了高达44%的计算复杂度。

详情
英文摘要

State Space Models (SSMs) have emerged as a compelling alternative to attention models for long-range vision tasks, offering input-dependent recurrence with linear complexity. However, most efficient SSM variants reduce computation cost by modifying scan routes, resolutions, or traversal patterns, while largely leaving the recurrent dynamics implicit. Consequently, the model's state-dependent memory behavior is difficult to control, particularly in compact backbones where long scan paths can exceed the effective memory horizon. We propose Token-Conditioned Poles SSM (TCP-SSM), a structured selective SSM framework that improves efficiency while making recurrence dynamics explicit and interpretable through stable poles. TCP-SSM builds each scan operator with 1) real poles that model monotone or sign-alternating decay, and 2) complex-conjugate poles that capture damped oscillatory responses. Using bounded radius and angle modulation, TCP-SSM converts shared base poles into token-dependent poles, allowing each scan step to adapt its memory behavior to the current visual token while preserving pole stability. For practical scalability, we integrate grouped pole sharing with a lightweight low-rank input pathway, yielding an efficient scan operator that preserves linear-time scan complexity. Across image classification, semantic segmentation, and object detection, TCP-SSM reduces SSM computation complexity up to 44% in Vision Mamba-style models while maintaining or surpassing baseline accuracy.

2605.11559 2026-05-13 cs.CV cs.AI

When Looking Is Not Enough: Visual Attention Structure Reveals Hallucination in MLLMs

Fanpu Cao, Xin Zou, Xuming Hu, Hui Xiong

发表机构 * Thrust of Artificial Intelligence, HKUST (Guangzhou)(人工智能前沿 thrust,香港科技大学(广州)) Department of Computer Science and Engineering, HKUST(计算机科学与工程系,香港科技大学)

AI总结 多模态大语言模型(MLLMs)在视觉推理和基于视觉的问题回答中发挥着重要作用,但其仍易产生视觉幻觉,即生成的回答与图像内容矛盾或提及不存在的物体。本文发现,通过分析视觉注意力的高频结构(即层间拉普拉斯能量),可以揭示模型在生成幻觉时的注意力变化特征,并据此提出一种无需训练的解码策略LaSCD,通过选择具有高拉普拉斯能量的层并重新映射下一个词的得分,有效减少幻觉现象,同时保持模型的一般能力。

详情
英文摘要

Multimodal large language models (MLLMs) have become a key interface for visual reasoning and grounded question answering, yet they remain vulnerable to visual hallucinations, where generated responses contradict image content or mention nonexistent objects. A central challenge is that hallucination is not always caused by a simple lack of visual attention: the model may still assign substantial attention mass to image tokens while internally drifting toward an incorrect answer. In this paper, we show that the high-frequency structure of visual attention, measured by layer-wise Laplacian energy, reveals both the layer where hallucinated preferences emerge and the layer where the ground-truth answer transiently recovers. Building on this finding, we propose LaSCD (Laplacian-Spectral Contrastive Decoding), a training-free decoding strategy that selects informative layers via Laplacian energy and remaps next-token logits in closed form. Experiments on hallucination and general multimodal benchmarks show that LaSCD consistently reduces hallucination while preserving general capabilities, highlighting its potential as a faithful decoding paradigm. The code is available at https://github.com/macovaseas/LaSCD.

2605.11556 2026-05-13 cs.AI cs.LG

Hindsight Hint Distillation: Scaffolded Reasoning for SWE Agents from CoT-free Answers

Shengjie Wang, Guanghe Li, Zonghan Yang, Yang Gao

发表机构 * Tsinghua University(清华大学)

AI总结 该研究提出了一种名为Hindsight Hint Distillation(HHD)的新方法,旨在从无思维链(CoT)注释的问题-答案对中学习推理能力,以解决复杂的长期任务。HHD通过模型自身失败的自我推演生成“事后提示”,用于指导成功的策略生成,并通过自我蒸馏提升模型的推理能力。实验表明,HHD在多个基准测试中显著优于现有方法,尤其在未见过的任务上表现出良好的泛化能力。

Comments 28 pages, 7 figures

详情
英文摘要

Solving complex long-horizon tasks requires strong planning and reasoning capabilities. Although datasets with explicit chain-of-thought (CoT) rationales can substantially benefit learning, they are costly to obtain. To address this challenge, we propose Hindsight Hint Distillation (HHD), which only requires easy-to-obtain question-answer pairs without CoT annotations. Inspired by how human teachers use student mistakes to provide targeted guidance, HHD synthesizes hindsight hints from the model's own failed self-rollouts and uses them to scaffold on-policy rollouts that successfully complete the tasks. The model then self-distills these scaffolded trajectories and generalizes to new problems without hint guidance. Experiments show that HHD significantly outperforms iterative RFT and trajectory-synthesis baselines, achieving an absolute improvement of 8\% on SWE-bench Verified, while all baselines improve by only around 2\%. Notably, the reasoning strategies induced by HHD generalize effectively to out-of-distribution tasks, yielding the largest gains on SWE-bench Multilingual despite no training on multilingual data. These results demonstrate that HHD can effectively synthesize expert-like reasoning from CoT-free data and substantially improve long-horizon performance.

2605.11554 2026-05-13 cs.LG

A Controlled Counterexample to Strong Proxy-Based Explanations of OOD Performance: in a Fixed Pretraining-and-Probing Setup

Hongmin Li

发表机构 * School of Life Science and Technology, Institute of Science Tokyo(生命科学与技术学院,科学东京研究所) Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences(计算生物学与医学科学系,前沿科学研究生院)

AI总结 该研究探讨了在固定预训练与探针任务设置下,基于结构代理的解释是否能够准确反映模型在分布外(OOD)任务上的性能差异。研究通过构造一个受控实验,展示了结构代理的排名与OOD探针准确率排名可能不一致,表明结构代理未必能追踪影响OOD性能的关键任务结构。这一反例揭示了强代理解释的局限性,指出在特定条件下,总学习结构的代理可能无法准确反映任务相关结构。

Comments 19 pages, 3 figures

详情
英文摘要

Task-agnostic structure proxies are often used to interpret why one pretraining corpus transfers better than another, but such explanations require the proxy to track the structure that matters for the downstream task. We test this requirement in a fixed pretraining-and-probing setup motivated by computationally bounded notions of learned structure, including epiplexity. The core question is whether a proxy ranking of two pretraining datasets must agree with their ranking by OOD probe accuracy. We show that it need not. First, we give a controlled construction in which a formal structure quantity, its operational proxy, and the task-relevant structure for a target family separate. We then instantiate the same mechanism in a synthetic sequence-model experiment: under the primary all-sample evaluation, the OOD accuracy ranking reverses the proxy ranking in two of three seeds, with auxiliary diagnostics and ablations supporting the same interpretation. The counterexample does not reject structure-based explanations in general; it identifies a boundary on strong proxy-based explanations. A proxy for total learned structure can fail to track the task-relevant structure that drives OOD performance, even in a controlled setting.

2605.11551 2026-05-13 cs.LG cs.CV cs.IT math.IT

VNDUQE: Information-Theoretic Novelty Detection using Deep Variational Information Bottleneck

Aryan Gondkar, Hayder Radha, Yiming Deng

发表机构 * 1 Nondestructive Evaluation Lab, Department of Electrical Computer Engineering Michigan State University East Lansing, MI Email 2 Department of Electrical

AI总结 本文提出了一种基于深度变分信息瓶颈(VIB)的新型检测与不确定性量化方法VNDUQE,用于检测神经网络中的分布外(OOD)样本。该方法通过信息论指标如KL散度和预测熵来评估样本的异常程度,并在MNIST数据集上验证了其有效性。实验表明,结合KL散度和预测熵的并行检测策略在远分布外和近分布外样本检测上均优于传统基线方法,显著提升了检测性能和不确定性估计的可靠性。

Comments 6 pages, 3 figures, Fall 2025 version

详情
英文摘要

Detecting out-of-distribution (OOD) samples is critical for safe deployment of neural networks in safety-critical applications. While maximum softmax probability (MSP) provides a simple baseline, it lacks theoretical grounding and suffers from miscalibration. We propose VNDUQE (VIB-based Novelty Detection and Uncertainty Quantification for Nondestructive Evaluation), which investigates novelty detection through the Deep Variational Information Bottleneck (VIB), which explicitly constrains information flow through learned representations. We train VIB models on MNIST with held-out digit classes and evaluate OOD detection using information-theoretic metrics: KL divergence and prediction entropy. Our results reveal complementary detection signals: KL divergence achieves perfect detection (100\% AUROC on noise) on far-OOD samples (noise, domain shift), while prediction entropy excels at near-OOD detection (94.7\% AUROC on novel digit classes). A parallel detection strategy combining both metrics achieves 95.3\% average AUROC and 92\% true positive rate at 5\% false positive rate, which is a 32 percentage point improvement over baseline MSP (85.0\% AUROC, 60.1\% TPR). Compression via the information bottleneck principle ($β=10^{-3}$) reduces Expected Calibration Error by 38\%, demonstrating that information-theoretic constraints produce fundamentally more reliable uncertainty estimates. These findings directly support active learning with expensive computational oracles, where well-calibrated novelty detection enables principled threshold selection for oracle queries.

2605.11550 2026-05-13 cs.CV

The DAWN of World-Action Interactive Models

Hongbo Lu, Liang Yao, Chenghao He, Haoyu Wang, Xiang Gu, Xianfei Li, Wenlong Liao, Tao He, Pai Peng

发表机构 * COWARobot Co. Ltd(COWARobot有限公司) Shanghai Jiao Tong University(上海交通大学) Hohai University(河海大学)

AI总结 该论文提出了一种名为DAWN的世界-动作交互模型,用于解决自动驾驶场景中世界演化与动作生成之间的相互依赖问题。DAWN通过在语义潜在空间中结合世界预测器和世界条件动作去噪器,实现了世界预测与动作生成的递归优化,从而在复杂交互场景中支持长期轨迹生成。实验表明,DAWN在多个自动驾驶基准测试中表现出优异的规划性能和安全性,展示了交互式世界-动作生成在构建真正可操作世界模型中的潜力。

详情
英文摘要

A plausible scene evolution depends on the maneuver being considered, while a good maneuver depends on how the scene may evolve. Existing World Action Models (WAMs) largely miss this reciprocity, treating world prediction and action generation as either isolated parallel branches or rigid predict-then-plan pipelines. We formalize this perspective as World-Action Interactive Models (WAIMs), and instantiate it in autonomous driving with \textbf{DAWN} (\textbf{D}enoising \textbf{A}ctions and \textbf{W}orld i\textbf{N}teractive model), a simple yet strong latent generative baseline. DAWN operates in a compact semantic latent space and couples a \emph{World Predictor} with a \emph{World-Conditioned Action Denoiser}: the predicted world hypothesis conditions action denoising, while the denoised action hypothesis is fed back to update the world prediction, so that both are recursively refined during inference. Rather than eliminating test-time world evolution altogether or rolling out the full future in pixel space, DAWN performs a short explicit latent rollout that is sufficient to support long-horizon trajectory generation in complex interactive scenes. Experiments show that DAWN achieves strong planning performance and favorable safety-related results across multiple autonomous driving benchmarks. More broadly, our results suggest that interactive world-action generation is a principled path toward truly actionable world models.

2605.11547 2026-05-13 cs.LG cs.AI

Sharpen Your Flow: Sharpness-Aware Sampling for Flow Matching

Aditi Gupta, Soon Hoe Lim, Annan Yu, N. Benjamin Erichson

发表机构 * Lawrence Berkeley National Laboratory(伯克利国家实验室) International Computer Science Institute(国际计算机科学研究所) Department of Mathematics, KTH Royal Institute of Technology(皇家理工学院数学系) Nordita, KTH Royal Institute of Technology and Stockholm University(KTH皇家理工学院与斯德哥尔摩大学联合研究所) Center for Applied Mathematics, Cornell University(康奈尔大学应用数学中心)

AI总结 本文提出了一种名为 SharpEuler 的训练无关采样方法,用于改进流匹配模型的生成效率与质量。该方法通过离线分析预训练模型,估计速度场变化最剧烈的区域,并据此生成适用于任意推理预算的时步网格,从而在保持相同模型评估次数的前提下提升采样效果。实验表明,SharpEuler 在固定计算预算下能有效减少模式泄露并提升模式覆盖度,为高效生成提供了新思路。

详情
英文摘要

Flow matching models generate samples by numerically integrating a learned velocity field, with each integration step requiring a neural network evaluation. Fast generation therefore requires using a small fixed evaluation budget effectively: the key question is not only how to integrate the flow, but where the sampler should spend its steps. We propose SharpEuler, a training-free sampler that profiles a pretrained model offline by estimating where the learned velocity field changes most rapidly along calibration trajectories. This finite-difference estimate defines a solver-aware sharpness profile, which is smoothed and converted by a quantile transform into a timestep grid for any desired inference budget. At test time, sampling remains ordinary Euler integration with the same number of model evaluations as a uniform schedule. We justify SharpEuler using three principles: a numerical principle identifying trajectory acceleration as the leading source of Euler discretization error, a variational principle deriving sharpness-based power-law timestep densities, and a statistical guarantee showing that the finite-sample calibrated sampler is stable at the terminal distribution level. Our experiments show that SharpEuler improves sample quality at fixed budgets, reducing inter-mode leakage and increasing mode coverage.

2605.11541 2026-05-13 cs.CV

GeoR-Bench: Evaluating Geoscience Visual Reasoning

Yushuo Zheng, Zicheng Zhang, Huiyu Duan, Chunyi Li, Zijian Chen, Ziheng Jia, Yue Shi, Ke Gu, Xiongkuo Min, Guangtao Zhai

发表机构 * Shanghai Jiao Tong University(上海交通大学) Beijing University of Technology(北京理工大学)

AI总结 GeoR-Bench 是一个用于评估地球科学视觉推理能力的基准测试,旨在解决当前人工智能系统在理解和预测地球系统变化方面的能力不足问题。该基准包含440个经过精心挑选的样本,涵盖6类地球科学任务和24种任务类型,通过视觉编辑任务来评估模型的推理能力、一致性和输出质量。实验结果表明,现有模型在地球科学推理上仍存在显著瓶颈,最佳模型的总体准确率仅为42.7%,而开源模型表现更差,反映出当前模型在科学准确性上仍有较大提升空间。

详情
英文摘要

Geoscience intelligence is expected to understand, reason about, and predict earth system changes to support human decision-making in critical domains such as disaster response, climate adaptation and environmental protection. Although current research has shown promising progress on specific geoscience tasks, such as remote sensing interpretation, geographic question-answering, existing benchmarks remain largely task-specific which failing to capture the open-ended real world geoscience problems. As a result, it remains unclear how far current AI systems are from achieving genuine geoscience intelligence. To address this gap, we present \textbf{GeoR-Bench}, a \underline{Bench}mark for evaluating \underline{Geo}science visual \underline{R}easoning through reasoning informed visual editing tasks. GeoR-Bench contains 440 curated samples spanning 6 geoscience categories and 24 task types, covering earth observation imagery and structured scientific representations such as maps and diagrams. We evaluate outputs along three dimensions, including reasoning, consistency, and quality. Benchmark results of 21 closed- and open-source multimodal models reveal that geoscience reasoning remains a critical bottleneck. The highest-performing model achieves 42.7\% overall strict accuracy, while the best open-source models only get 10.3\%. Notably, the visual consistency and image quality of the outputs frequently surpass their scientific accuracy. Ultimately, these findings indicate that current models generate superficially plausible results but fail to capture underlying earth science processes.

2605.11538 2026-05-13 cs.CL cs.AI cs.LG

Taming Extreme Tokens: Covariance-Aware GRPO with Gaussian-Kernel Advantage Reweighting

Cheng Wang, Qin Liu, Wenxuan Zhou, Muhao Chen

发表机构 * National University of Singapore(新加坡国立大学) University of California, Davis(加州大学戴维斯分校) University of Southern California(美国南加州大学)

AI总结 本文针对大型语言模型在训练过程中探索与利用之间的平衡问题,提出了一种基于协方差感知的改进型GRPO方法。该方法通过高斯核函数动态降低极端token更新的影响,从而在不损失有用学习信号的前提下减少训练不稳定。实验表明,该方法在多个推理基准上优于原始GRPO,有效提升了模型的下游性能并稳定了训练过程中的熵值。

Comments ACL 2026

详情
英文摘要

Group Relative Policy Optimization (GRPO) has emerged as a promising approach for improving the reasoning capabilities of large language models. However, it struggles to effectively balance the tradeoff between exploration and exploitation during training, often resulting in suboptimal performance. Motivated by the theoretical insight that changes in entropy are governed by the covariance between token probabilities and their corresponding advantages, we propose a hyperparameter-free, covariance-weighted optimization method that dynamically down-weights extreme token-level updates via a Gaussian kernel. This approach automatically reduces the instability caused by exploration-exploitation trade-off while preserving informative learning signals. Extensive empirical evaluations show that our approach improves downstream performance across reasoning benchmarks compared with GRPO, and effectively stablizes entropy as training progresses.

2605.11537 2026-05-13 cs.LG

Fast MoE Inference via Predictive Prefetching and Expert Replication

Ankit Jyothish, Ali Jannesari, Aishwarya Sarkar, Joseph Zuber

发表机构 * Iowa State University(爱荷华州立大学)

AI总结 该论文针对混合专家(MoE)架构在大语言模型推理中面临的GPU利用率低、负载不均衡和延迟高的问题,提出了一种基于预测预取和专家复制的加速方法。通过动态预测可能过载的专家并在后续批次中复制这些专家,实现跨层并行处理,从而提升并行性、减少GPU空闲时间,显著加快推理速度。实验表明,该方法在保持模型性能几乎不变的情况下,使推理速度提升达3倍,并接近实现100%的GPU利用率。

详情
英文摘要

The Mixture of Experts (MoE) architecture has become a fundamental building block in state-of-the-art large language models (LLMs), improving domain-specific expertise in LLMs and scaling model capacity without proportionally increasing their computational overhead. However, MoE inference often suffers from suboptimal GPU utilization, load imbalance, and elevated latency arising from multiple tokens waiting on the same experts for their computation which arises from sparsity of expert activation. To address these challenges, we propose a dynamic expert replication strategy that predicts which experts are likely to be overloaded and replicates them for upcoming batches of tokens. The replicated experts process batch tokens concurrently across layers, which leads to improved parallelism, shorter GPU idle time, and significantly faster inference. Experimental evaluations conducted on large-scale MoE models, including Switch-base-128 and Switch-base-256, demonstrate that our method achieves near-complete GPU utilization (approx 100%), leading to upto 3x improvement in inference speed while preserving approximately 90-95% of the performance of baseline architectures

2605.11535 2026-05-13 cs.LG

Primal-Dual Policy Optimization for Linear CMDPs with Adversarial Losses

Kihyun Yu, Seoungbin Bae, Dabeen Lee

发表机构 * Research Institute of Mathematics, Seoul National University(数学研究所,首尔国立大学) Interdisciplinary Program in Artificial Intelligence, Seoul National University(人工智能跨学科项目,首尔国立大学) Korea Institute for Advanced Study(韩国高级研究院)

AI总结 本文研究了在线有限时间对抗性线性约束马尔可夫决策过程(CMDPs)中的策略优化问题,其中损失函数由对手对抗性选择,而代价函数则服从随机反馈。为应对这一挑战,作者提出了一种基于原-对偶优化的算法,首次在该设置下实现了次线性遗憾和约束违反界,均为 $\widetilde{\mathcal{O}}(K^{3/4})$。该算法引入了一类新的加权 LogSumExp 软max策略,并结合周期性策略混合和正则化对偶更新等关键技术,有效控制了策略覆盖数和对偶变量,从而保证了算法的理论性能。

Comments Accepted to ICLR 2026

详情
英文摘要

Existing work on linear constrained Markov decision processes (CMDPs) has primarily focused on stochastic settings, where the losses and costs are either fixed or drawn from fixed distributions. However, such formulations are inherently vulnerable to adversarially changing environments. To overcome this limitation, we propose a primal-dual policy optimization algorithm for online finite-horizon {adversarial} linear CMDPs, where the losses are adversarially chosen under full-information feedback and the costs are stochastic under bandit feedback. Our algorithm is the \emph{first} to achieve sublinear regret and constraint violation bounds in this setting, both bounded by $\widetilde{\mathcal{O}}(K^{3/4})$, where $K$ denotes the number of episodes. The algorithm introduces and runs with a new class of policies, which we call weighted LogSumExp softmax policies, designed to adapt to adversarially chosen loss functions. Our main result stems from the following key contributions: (i) a new covering number argument for the weighted LogSumExp softmax policies, and (ii) two novel algorithmic components -- periodic policy mixing and a regularized dual update -- which allow us to effectively control both the covering number and the dual variable. We also report numerical results that validate our theoretical findings on the performance of the algorithm.

2605.11534 2026-05-13 cs.RO

PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments

Yunn Kang Lim, Pengzhan Sun, Ziyi Bai, Xun Xu, Angela Yao, Xulei Yang, Shijie Li

发表机构 * A*STAR National University of Singapore(国立新加坡大学) BAAI(北京人工智能研究院)

AI总结 PRISM 是一个用于诊断具身智能体在家庭任务中失败原因的基准平台,它将问题从单纯的“是否成功”转变为识别“哪个能力模块最可能导致失败”。该基准基于五个逼真的多房间公寓环境,构建了300个由人类验证的任务,并将其分为三个能力层级,分别评估感知-动作映射、隐式意图解析和长期协调能力。实验表明,隐式意图解析是当前主流大语言模型的显著瓶颈,而长期任务协调则暴露了模型在规划能力上的明显差距。

详情
英文摘要

When an LLM-based embodied agent fails at a household task, the culprit could be misidentified objects, forgotten sub-goals, or poor action sequencing -- yet existing benchmarks report only a single success rate, making it impossible to tell which cognitive module is responsible. We present PRISM, a diagnostic benchmark that reframes this problem: rather than asking only \textit{did the agent succeed?}, PRISM asks \textit{which capability is most likely responsible for failure?} Built on five photorealistic multi-room apartments (4--8 rooms each), PRISM structures 300 human-verified tasks into three capability tiers -- \textit{Basic Ability}, \textit{Reasoning Ability}, and \textit{Long-horizon Ability} -- that isolate perception-to-action grounding, implicit intent resolution, and sustained multi-step coordination respectively. PRISM exposes an agent-agnostic executable action API that allows arbitrary agents: LLM agents, VLM agents, symbolic planners, RL policies, and hybrid systems, to be evaluated end-to-end under the same benchmark protocol. To support deeper diagnosis, optional probes for perception, memory, and planning can be adopted, replaced, or bypassed entirely, enabling controlled component-level analysis when desired. Experiments on seven contemporary LLMs establish a clear hierarchy: explicit spatial grounding is not the dominant failure source under oracle perception, implicit intent resolution is a significant bottleneck for all model families, and long-horizon coordination exposes a stark capability cliff -- lightweight models collapse to as low as 20.0\% success while simultaneously consuming more tokens than their frontier counterparts, a signature of compensatory over-reasoning rather than genuine planning capability. Project page: \href{https://sj-li.com/PROJ/PRISM}{link}.

2605.11532 2026-05-13 cs.AI

Read, Grep, and Synthesize: Diagnosing Cross-Domain Seed Exposure for LLM Research Ideation

Yunju Choi, Min Song

发表机构 * Yonsei University, Seoul, Republic of Korea(延世大学,首尔,韩国)

AI总结 本文研究了大型语言模型(LLM)在生成研究想法时,是否能从跨领域知识中获益。作者提出了一种名为PaperGym的三阶段方法,通过工具增强的种子提取、跨领域种子检索与方法合成,评估了不同种子来源对创新性的影响。实验表明,跨领域种子检索在提升方法新颖性方面优于单一领域和无检索基线,但未能显著优于随机多样化种子。研究指出,当前LLM在利用跨领域知识生成创意时,仍难以有效捕捉种子的语义关联。

Comments 12 pages, 2 figures, 7 tables

详情
英文摘要

The discovery of novel methodologies for emerging problems is a continuing cycle in ML, often driven by the migration of techniques across domains. Building on this observation, we ask whether current LLM ideation systems benefit from targeted cross-domain retrieval or simply from exposure to diverse mechanisms. We study this question through PaperGym, a three-stage pipeline: (1) tool-augmented seed extraction via read, grep, and bash over an isolated paper environment, (2) cross-domain seed retrieval via paraphrasing across seven ML domains, and (3) method synthesis from retrieved seeds, each scored by rubric-based judges. Tool-augmented extraction improves specificity, and paraphrase-based retrieval broadens domain coverage. In synthesis, cross-domain retrieval receives more pairwise novelty wins than no-retrieval and same-domain baselines, but shows no significant difference from a random diverse-seed control. These findings suggest LLM ideation systems benefit from diverse seed exposure, but do not yet reliably exploit the semantic reason particular seeds were retrieved. We release the seed library, rubric prompts, and run scripts at https://github.com/yunjoochoi/PaperGym

2605.11530 2026-05-13 cs.LG

Multi-Narrow Transformation as a Single-Model Ensemble: Boundary Conditions, Mechanisms, and Failure Modes

Tatsuhito Hasegawa, Taisei Tanaka

发表机构 * Graduate School of Engineering, University of Fukui(宇治大学工学研究院)

AI总结 本文研究了在参数预算相近的情况下,是将模型容量集中于单一宽网络路径,还是分散到多个窄且独立的分支中更为有效。通过引入多窄(Multi-Narrow)变换,将基础卷积神经网络转化为具有多个窄分支的单一模型集成,作者系统比较了不同数据条件、网络结构和数据集下的单宽与多窄配置表现。研究发现,多窄结构在数据稀缺场景下表现更优,因其能学习到更多样、更少冗余的特征,而在数据丰富的场景下,单宽模型更具优势,这一结论在多种CNN架构和图像分类任务中得到了验证。

Comments 12 pages, 9 figures, 4 tables. Preprint version of a manuscript submitted to Neurocomputing

详情
英文摘要

Single-model ensembles (SMEs) have attracted attention as a way to approximate some of the benefits of deep ensembles within a single network. However, under an approximately matched parameter budget, it remains unclear whether model capacity should be concentrated in a single wide pathway or redistributed into many narrow and independent members. We investigate this question through the Multi-Narrow (MN) transformation, which converts a baseline CNN into an SME of narrow, path-wise independent branches while approximately preserving the dominant parameter budget. We systematically compare Single-Wide and Multi-Narrow configurations across different training-data regimes, architectures, and datasets. The results show that the effectiveness of MN is strongly data-dependent: weakly partitioned or baseline-wide models are preferable in data-rich settings, whereas highly partitioned MN models consistently outperform the baseline in low-data settings. This tendency is reproduced across multiple CNN architectures and image-classification datasets, suggesting that it is not specific to a single benchmark or model family. Analysis of internal representations shows that high-MN models learn more diverse and less redundant path-wise features. In low-data regimes, this diversity is broadly utilized and improves generalization, whereas in data-rich regimes, training becomes imbalanced and prediction is dominated by a small subset of paths. These findings clarify when and why Multi-Narrow transformation is effective, and provide practical guidance for allocating model capacity between width and member multiplicity under a limited budget.

2605.11527 2026-05-13 cs.LG cs.CR cs.DB

FERMI: Exploiting Relations for Membership Inference Against Tabular Diffusion Models

Abtin Mahyar, Masoumeh Shafieinejad, Yuhan Liu, Xi He

发表机构 * University of Waterloo(多伦多大学) Vector Institute(向量研究所)

AI总结 该研究探讨了针对表格扩散模型的成员推理攻击问题,关注真实敏感数据中多表关联结构被忽略的挑战。提出了一种名为FERMI的方法,通过利用目标表关联表的辅助信息增强单表特征,从而提升攻击性能。实验表明,FERMI在多种表格扩散模型和真实数据集上均显著优于传统单表攻击方法,特别是在白盒和黑盒设置下分别提升了最高达53%和22%的攻击准确率。

详情
英文摘要

Diffusion models are the leading approach for tabular data synthesis and are increasingly used to share sensitive records. Whether they actually protect privacy has become a pressing question. Membership inference attacks are the standard tool for this purpose, yet existing attacks assume a single-table setting and ignore the multi-relational structure of real sensitive data. A core challenge in assessing privacy risks from membership inference attacks in multi-table settings is how to leverage auxiliary information from relations associated with the target table, such as its parent tables. Particularly, we study a practical setting in which such auxiliary information is available only when training the attack model. At inference time, the attacker observes only the attribute values of the target record from the target table. We propose FERMI (FEature-mapping for Relational Membership Inference), which resolves this gap by enriching single-table features with relational membership signal. Across three tabular diffusion architectures and three real-world relational datasets, FERMI consistently improves attack performance over single-table baselines, with TPR@$0.1$FPR rising by up to 53% over the single-table baseline in the white-box setting and 22% in the black-box setting.

2605.11525 2026-05-13 cs.LG

OverNaN: NaN-Aware Oversampling for Imbalanced Learning with Meaningful Missingness

Amanda S Barnard

发表机构 * School of Computing(计算学院) Australian National University(澳大利亚国立大学)

AI总结 在实际应用中,缺失值常被视为需要删除或填补的缺陷,但其本身可能包含重要信息。本文提出了一种名为 OverNaN 的轻量级过采样框架,专门用于处理类别不平衡问题,同时保留缺失值结构。该方法扩展了传统合成过采样技术,直接在不完整特征向量上生成样本,允许缺失值被保留、传播或选择性插值,从而在不破坏缺失信息的前提下提升模型性能。该研究为科学与工程领域中处理不可避免且具有信息量的缺失值提供了新的解决方案。

Comments 14 pages, 2 figures, 17 tables

详情
英文摘要

Missing values are routinely treated as defects to be eliminated through deletion or imputation prior to machine learning. In many applied domains, however, missingness itself carries information, reflecting experimental constraints, measurement choices, or systematic mechanisms tied to the data-generating process. Eliminating or masking this structure can distort class boundaries, introduce bias, and reduce generalisability; particularly in imbalanced datasets where minority classes are already under-represented. OverNaN is a lightweight, NaN-aware oversampling framework designed to address class imbalance without erasing missingness structure. It extends common synthetic oversampling methods to operate directly on incomplete feature vectors, allowing missing values to be preserved, propagated, or selectively interpolated according to explicitly defined strategies. Rather than repairing missing data, OverNaN treats missingness as part of the feature space over which synthetic samples are generated. This paper situates OverNaN within the broader landscape of imbalanced learning, missing-data handling, and NaN-tolerant algorithms. Using representative examples included with the software, we demonstrate that meaningful missingness can be retained during oversampling without introducing artificial certainty. OverNaN is intended for practitioners working with small, incomplete, and imbalanced datasets in scientific and engineering domains where missingness is unavoidable and often informative.

2605.11524 2026-05-13 cs.LG cs.CE

EqOD: Symmetry-Informed Stability Selection for PDE Identification

Gnankan Landry Regis N'guessan, Bum Jun Kim

发表机构 * Axiom Research Group(Axiom研究组) Department of Applied Mathematics and Computational Science, NM-AIST, Tanzania(应用数学与计算科学系,NM-AIST,坦桑尼亚) African Institute for Mathematical Sciences (AIMS), Research and Innovation Centre, Rwanda(非洲数学科学研究所(AIMS),研究与创新中心,卢旺达) The University of Tokyo, Japan(东京大学,日本)

AI总结 该研究提出了一种名为EqOD的自动方法,用于从噪声数据中稳定识别偏微分方程(PDE),通过结合对称性约简和稳定性选择机制,有效减少虚假正例并提高识别准确性。当检测到伽利略不变性时,EqOD利用对称性约简库剔除不可能存在的项;否则采用随机LASSO稳定性选择。实验表明,EqOD在多个PDE和噪声水平下表现优异,显著优于现有方法如PySINDy和WF-LASSO。

Comments 45 pages, 16 figures

详情
英文摘要

Data-driven identification of partial differential equations (PDEs) relies on sparse regression over a candidate library of differential operators, where larger libraries inflate false positives under observation noise and smaller libraries risk missing true terms. We introduce Equivariant Operator Discovery (EqOD), a fully automatic method combining two library reduction mechanisms. When Galilean invariance is detected from trajectory data via a weak-form structural test, EqOD uses the symmetry-reduced library, eliminating terms that our Galilean exclusion result proves to be absent from the governing equation. Otherwise, it applies randomized LASSO stability selection guided by classical false-positive bounds. A residual-based fallback prevents degradation below the full-library baseline. On 8 PDEs at 4 noise levels, EqOD attains $F_1 = 1.000 \pm 0.000$ on Heat at $20\%$ noise, where WF-LASSO obtains $0.475 \pm 0.181$, official PySINDy 2.0 obtains $0.000$, and the WSINDy reimplementation obtains $0.789$. Under the strict criterion that the mean F1 difference exceeds the larger of the two standard deviations, EqOD wins 7 of 32 cells. WF-LASSO wins none, and the remaining 25 cells are ties. Across all 32 cells, EqOD outperforms PySINDy 2.0.0 in 23 of 32 cells, and all 5 PySINDy wins occur on reaction PDEs. External validation on WeakIdent and PINN-SR datasets gives $F_1 = 1.000$ on all 5 clean benchmarks. NLS, 2D, coupled-system, and cylinder-wake extensions are reported. The Galilean library reduction is proved under explicit autonomy and library assumptions. The stability-selection step is motivated by classical false-positive bounds, while formal guarantees for correlated PDE design matrices remain open.

2605.11521 2026-05-13 cs.CV

XWOD: A Real-World Benchmark for Object Detection under Extreme Weather Conditions

Chih-Hsin Chen, Yu-Tung Liu, Amar Fadillah, Kuan-Ting Lai, Dong Liu

发表机构 * Department of Electronic Engineering(电子工程系) National Taipei University of Technology(台北科技大学) Adobe Inc.(Adobe公司)

AI总结 本文提出XWOD,一个用于极端天气条件下目标检测的大型真实世界数据集,包含10,010张图像和42,924个标注框,涵盖雨、雪、雾、沙尘、洪水、龙卷风和野火七种极端天气条件下的六类交通目标。XWOD扩展了天气分类的范围,首次引入气候加剧型灾害类别,并通过在其他天气数据集上的零样本测试验证了其数据质量,显著提升了检测性能。该数据集为研究极端天气下的交通感知提供了强有力的基准。

详情
英文摘要

Autonomous driving and intelligent transportation systems remain vulnerable under extreme weather. The U.S. Federal Highway Administration reports that roughly 745,000 crashes and 3,800 fatalities per year are weather-related, and recent regulatory investigations have examined failures of Level-2/3 driving systems under reduced-visibility conditions. However, datasets commonly used to evaluate weather robustness remain limited in scale, diversity, and realism. In this paper, we introduce XWOD (Extreme Weather Object Detection), a large-scale real-world traffic-object detection benchmark containing 10,010 images and 42,924 bounding boxes across seven extreme weather conditions: rain, snow, fog, haze/sand/dust, flooding, tornado, and wildfire. The dataset covers six traffic-object categories, including car, person, truck, motorcycle, bicycle, and bus. XWOD extends the weather taxonomy from one to seven conditions, and is the first to cover the emerging class of climate-amplified hazards, such as flooding, tornado, and wildfire. To evaluate the quality of our data, we train standard YOLO-family detectors on XWOD and test them zero-shot on external weather benchmarks, achieving mAP$_{50}$ scores of 63.00% on RTTS, 59.94% on DAWN, and 61.12% on WEDGE, compared with the corresponding published YOLO-based baselines of 40.37%, 32.75%, and 45.41%, respectively, representing relative improvements of 56%, 83%, and 35%. These cross-dataset results show that XWOD provides a strong source domain for learning weather-robust traffic perception. We release the dataset, splits, baseline weights, and reproducible evaluation code under a research-use license.