arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
2605.09310 2026-05-12 cs.AI q-fin.PM

Beyond ESG Scores: Learning Dynamic Constraints for Sequential Portfolio Optimization

Xin Li, Yan Ke, Longbing Cao

发表机构 * Macquarie University(麦考瑞大学) The University of Queensland(昆士兰大学)

AI总结 本文研究了在可持续投资中如何更有效地将环境、社会和治理(ESG)因素纳入投资组合优化过程。不同于传统方法将ESG视为静态评分,作者提出了一种动态约束学习方法,通过多模态行动条件约束场(MACF)从实时多源数据中学习特定机制的ESG成本,并引入MACF-X适配器将这些约束转化为优化器可识别的接口。该方法在保持良好财务表现的同时,有效降低了ESG预算压力,实验表明其优势依赖于动态证据输入和三头分解结构。

详情
英文摘要

ESG-aware portfolio optimization is increasingly important for sustainable capital allocation, yet most learning-based methods still operationalize ESG by appending static scores to the policy observation or reward. This creates a mismatch for sequential control: ESG scores are noisy, provider-dependent, low-frequency, and temporally misaligned with sequential portfolio decisions, while financial evidence suggests that ESG is better treated as a portfolio preference, risk-exposure, or hedge dimension than as a robust alpha factor. We propose to impose ESG constraints without modifying the financial policy's observation or reward, using a Multimodal Action-Conditioned Constraint Field (MACF) that learns mechanism-specific ESG costs from point-in-time multimodal evidence and contemplated portfolio transitions. We then introduce MACF-X, a family of optimizer-specific adapters that converts MACF costs and uncertainties into native constrained-optimization interfaces through a shared slack- and uncertainty-aware pressure layer. Across multiple constraint-integration interfaces, MACF-X reduces tail ESG budget pressure while maintaining competitive financial performance. Ablations show that this improvement depends on dynamic evidence inputs and three-head decomposition, while static ESG-score proxies are nearly indistinguishable from score-shuffled noise baselines.

2605.09308 2026-05-12 cs.LG cs.AI

Hierarchical Attention-based Graph Neural Network with Relevance-driven Pruning

Seungwoo Kum

发表机构 * Korea Electronics Technology Institute (KETI)(韩国电子技术研究所)

AI总结 本文提出了一种基于分层注意力机制的异构图神经网络(HA-HeteroGNN),旨在解决图神经网络在处理异构节点类型时解释性不足以及大规模噪声图中计算开销大的问题。该方法通过统一的可解释性到剪枝的流程,利用双层注意力机制区分传感器级和上下文级的计算,生成节点相关性评分,并以此作为剪枝依据,有效减少了图边数量同时提升了分类准确率。实验表明,该方法在保持高分类性能的同时显著降低了训练时间和推理延迟,验证了其在实际应用中的有效性。

详情
英文摘要

Graph Neural Networks (GNNs) excel at relational reasoning but face two persistent challenges: the lack of interpretable attribution for heterogeneous node types, and the computational overhead of message passing over large, noisy graphs. We propose the Hierarchical Attention-based Heterogeneous GNN (HA-HeteroGNN), a framework that addresses both issues through a unied explainability-to-pruning pipeline. A two-tier attention mechanism separates sensor-level and context-level computation across 16 node types and 18 edge types, producing per-node relevance scores via an attention-based GNN Explainer without requiring gradient backpropagation. These relevance scores then serve as a principled pruning criterion: removing nodes identied as consistently uninformative yields a 27% reduction in graph edges while simultaneously improving classication accuracy by 2.46.1% across all model variants, challenging the conventional assumption that pruning necessarily trades accuracy for eciency. Experiments on a 50,000-record synthetic dataset spanning 11 report categories demonstrate 97.5% cross-strategy explanation stability and domain consistent sensor attribution, with training-time reductions of up to 43.9% and real-time inference latency of approximately 5860 ms per sample.

2605.09303 2026-05-12 cs.LG

Path-Dependent Denoising: A Non-Conservative Field Perspective on Order Collapse in Diffusion Language Models

Jeonseong Kim

发表机构 * GitHub

AI总结 扩散语言模型(DLMs)提供了一种不同于自回归生成的结构化生成方式,允许在任意顺序或并行更新标记。然而,实际应用中其解码过程仍高度依赖于顺序,常表现出类似自回归的行为。本文从非保守场视角出发,提出路径依赖去噪的概念,揭示了局部去噪条件与全局顺序之间的兼容性问题,并构建了用于诊断DLM解码是否真正实现无序生成的推理阶段分析框架。

详情
英文摘要

Diffusion language models (DLMs) offer a structural alternative to autoregressive generation: denoising can update tokens in arbitrary orders or in parallel rather than along a fixed left-to-right chain. In practice, fast DLM decoding remains strongly order-sensitive and often drifts toward autoregressive-like trajectories. We trace this tension to compatibility. At each reverse-time step, a DLM provides local denoising conditionals over the unresolved tokens. Arbitrary-order denoising becomes well defined when these local conditionals compose into order-invariant pseudo-joints. We formalize this view by defining order-induced pseudo-joints and a local denoising circulation: the log-ratio between the two pseudo-joints obtained by swapping a pair of unresolved positions. This circulation is zero under compatible conditionals, and global order gaps decompose into sums of local circulations along adjacent swaps. We further separate incompatibility-driven path dependence from conditional-dependence error in parallel updates and from order-specific estimation error. The resulting framework provides inference-only diagnostics for testing when DLM decoding is genuinely order-free.

2605.09302 2026-05-12 cs.LG cs.CV

Discrete Langevin-Inspired Posterior Sampling

Chaitanya Amballa, Sattwik Basu, Jorge Vančo Sampedro, Romit Roy Choudhury

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了在离散状态空间中使用离散扩散模型作为生成先验的逆问题后验采样方法。现有方法多依赖于连续松弛、吉布斯更新或特定退化过程的机制,限制了其可扩展性和通用性。为此,作者提出了一种基于离散朗之万动力学的后验采样器ΔLPS,能够在不离开离散状态空间的前提下,利用梯度信息高效地进行采样,支持所有维度的并行更新,并适用于不同训练方式的离散扩散模型。实验表明,该方法在图像恢复和空间映射等任务中优于现有离散扩散后验采样器,并能与连续扩散方法竞争。

详情
英文摘要

We study posterior sampling for inverse problems in discrete state spaces using discrete diffusion models as generative priors. While continuous diffusion models have become widely used for inverse problems, their discrete counterparts remain comparatively underexplored. Existing discrete posterior samplers often rely on continuous relaxations of discrete variables, Gibbs-style updates, or mechanisms specialized to particular corruption processes, which can limit scalability or generality. We propose $Δ$LPS, a Discrete Langevin-Inspired Posterior Sampler that uses gradient information to identify promising discrete moves without leaving the discrete state space. The resulting approach enables efficient parallel updates across all token dimensions and is agnostic to the training paradigm of the discrete diffusion prior, including masked and uniform-state diffusion. We evaluate our method on image restoration tasks across MNIST, CIFAR, and FFHQ, as well as spatial mapping, covering linear, nonlinear, and blind inverse problems. Across these settings, we improve over recent discrete diffusion posterior samplers and are competitive with strong continuous diffusion-based inverse solvers. Our results suggest that fully discrete, gradient-informed posterior samplers offer a scalable and general path toward solving inverse problems over discrete representations.

2605.09301 2026-05-12 cs.LG cs.AI

Neural Cluster First, Route Second: One-Shot Capacitated Vehicle Routing via Differentiable Optimal Transport

Samuel J. K. Chin, Maximilian Schiffer

发表机构 * MIT(麻省理工学院) TUM(塔尔博特大学)

AI总结 本文提出了一种基于神经网络的“聚类优先、路径其次”(Neural CFRS)方法,用于解决带容量约束的车辆路径问题(CVRP)。该方法突破了传统自回归解码的限制,采用可微分最优传输层,端到端地处理全局车队容量约束,实现了高效的一次性解码。相比现有方法,Neural CFRS 在保持高参数效率的同时,展现出对大规模和分布外实例的鲁棒性,并在标准基准测试中取得了具有竞争力的优化结果。

Comments 30 pages, 9 figures

详情
英文摘要

The Capacitated Vehicle Routing Problem (CVRP) underpins modern last-mile logistics. Current Neural Combinatorial Optimization (NCO) methods construct CVRP solutions autoregressively, inheriting sequential decoding bottlenecks, sensitivity to spatial symmetries, and brittle out-of-distribution behavior. We revisit the classical Cluster-First-Route-Second (CFRS) paradigm -- long known to be asymptotically optimal but largely overlooked by NCO -- and argue that it is structurally aligned with the core strengths of deep learning: similarity and assignment over global context, rather than the construction of long sequential tours. We introduce Neural CFRS, the first purely non-autoregressive one-shot neural CFRS framework for the CVRP. It enforces global fleet-capacity constraints end-to-end via a differentiable entropic Optimal Transport layer, producing a continuous transport plan to sparsify an exact capacitated assignment solver. We provide formal theoretical guarantees that our architecture intrinsically abstracts away $E(2)$ spatial, inter-route permutation, and intra-route traversal symmetries. By equipping the framework with a pre-trained spatial vocabulary, we unlock extreme parameter efficiency and zero-shot scaling. Designed primarily for real-world spatial distributions under a constant capacity setting, Neural CFRS scales robustly to out-of-distribution $N=1000$ instances with a < 4% gap -- retaining an approximate 5% gap at this scale even as an ultra-lightweight, single-layer architecture. Furthermore, when deployed out-of-the-box on standard benchmarks, we achieve a highly competitive 2.73% optimality gap on size-100 problems.

2605.09296 2026-05-12 cs.CV cs.AI cs.LG

Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts

Boxuan Zhang, Jianing Zhu, Qifan Wang, Jiang Liu, Ruixiang Tang

发表机构 * Rutgers University(罗格斯大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Meta AI Advanced Micro Devices(先进微器件公司)

AI总结 近年来生成模型能够生成高度逼真的图像,使得区分真实图像与AI生成图像变得愈发困难。现有基于预训练特征提取器的检测方法往往过于依赖全局语义信息,忽略了关键的微小缺陷。本文提出了一种基于局部分布差异的检测框架MDMF,通过放大图像中微小的统计不规则性,揭示AI生成图像的宏观分布差异,显著提升了检测性能。实验表明,MDMF在多个基准测试中均优于现有方法,验证了其有效性。

Comments 41 pages, 10 figures

详情
英文摘要

Recent generative models can produce images that appear highly realistic, raising challenges in distinguishing real and AI-generated images. Yet existing detectors based on pre-trained feature extractors tend to over-rely on global semantics, limiting sensitivity to the critical micro-defects. In this work, we propose Micro-Defects expose Macro-Fakes (MDMF), a local distribution-aware detection framework that amplifies micro-scale statistical irregularities into macro-level distributional discrepancies. To avoid localized forensic cues being diluted by plain aggregation, we introduce a learnable Patch Forensic Signature that projects semantic patch embeddings into a compact forensic latent space. We then use Maximum Mean Discrepancy (MMD) to quantify distributional discrepancies between generated and real images. Our theory-grounded analysis shows that patch-wise modeling yields provably larger discrepancies when localized forensic signals are present in generated images, enabling more reliable separation from real images. Extensive experiments demonstrate that MDMF consistently outperforms baseline detectors across multiple benchmarks, validating its general effectiveness. Project page: https://zbox1005.github.io/MDMF-project/

2605.09295 2026-05-12 cs.CL

LEAF-SQL: Level-wise Exploration with Adaptive Fine-graining for Text-to-SQL Skeleton Prediction

Zhao Tan, Xiping Liu, Qing Shu, Qizhi Wan, Dexi Liu, Changxuan Wan

发表机构 * School of Computing(计算学院) Artificial Intelligence(人工智能) Jiangxi University of Finance(江西财经大学)

AI总结 LEAF-SQL 是一种用于文本到 SQL 骨架预测的新框架,旨在解决复杂查询生成中的结构探索难题。该方法将骨架预测重构为从粗粒度到细粒度的树搜索过程,通过三级骨架层次结构、骨架生成代理和评估代理的协同工作,实现结构多样化与粒度自适应的搜索。实验表明,LEAF-SQL 显著提升了多种大语言模型在复杂查询任务中的表现,尤其在 BIRD 基准测试中取得了优于现有方法的执行准确率。

详情
英文摘要

Text-to-SQL translates natural language questions into executable SQL queries, enabling intuitive database access for non-experts. While large language models achieve strong performance on Text-to-SQL with prompting, they still struggle with complex queries that involve deeply nested logic or multiple clauses. A widely used approach employs SQL skeletons--intermediate representations of query logic--to streamline generation, but existing methods are limited by their reliance on a single structural hypothesis and lack of progressive reasoning. To overcome these limitations, we propose LEAF-SQL, a novel framework that reframes skeleton prediction as a coarse-to-fine tree search process. LEAF-SQL enables systematic exploration of diverse structural hypotheses with adaptive refinement. Several key techniques are employed in LEAF-SQL: (1) a three-level skeleton hierarchy to guide the search, (2) a Skeleton Formulation Agent to generate diverse candidates, and (3) a Skeleton Evaluation Agent to efficiently prune the search space. This integrated design yields skeleton candidates that are both structurally diverse and granularity-adaptive, providing a stronger foundation for the SQL generation. Extensive experiments show that LEAF-SQL consistently improves the performance of various LLM backbones. On the official hidden test set of the challenging BIRD benchmark, our method achieves 71.6 execution accuracy, which outperforms leading search-based and skeleton-based methods, affirming its effectiveness for complex queries.

2605.09294 2026-05-12 cs.LG cs.AI

Towards Effective Theory of LLMs: A Representation Learning Approach

Muhammed Ustaomeroglu, Guannan Qu

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种名为“表示有效理论”(RET)的框架,用于从大语言模型的隐藏状态轨迹中学习宏观状态,从而以高层次结构描述其计算过程。该方法采用类似BYOL/JEPA的自监督目标,将激活值粗粒化为保留预测与解释相关信息的宏观变量。实验表明,这些宏观变量能够揭示模型推理过程中的“心智状态”轨迹,捕捉高层语义结构,并支持对行为结果的早期预测与可控干预,为理解与引导大语言模型提供了有效的描述方式。

Comments Project webpage: https://ustaomeroglu.github.io/RET/

详情
英文摘要

We propose Representational Effective Theory (RET), a framework for describing large language model computation in terms of learned macrostates rather than microscopic details. RET learns these macrostates from hidden-state trajectories using a BYOL/JEPA-style self-supervised objective, coarse-graining activations into macrovariables that preserve higher-level structure relevant for prediction and interpretation. We evaluate whether these macrovariables are practically relevant for interpretability: RET yields temporally consistent states that reveal "mental-state" trajectories of reasoning, capture high-level semantic structure, support early prediction of behavioral outcomes such as sycophancy, and provide causal handles for steering generations toward interpretable computational phases. Together, these results suggest that LLM computation admits useful effective descriptions via RET: high-level, dynamically meaningful variables that support interpretation, prediction, and intervention.

2605.09292 2026-05-12 cs.AI cs.CY

Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

Xia Yang, Xuanyi Zhang, Hao Hu, Feng Ji

发表机构 * University of Toronto(多伦多大学) Upper Canada College(上加拿大学院) East China Normal University(华东师范大学)

AI总结 该研究探讨了大语言模型在数学推理任务中除答案准确率之外的策略多样性问题。研究提出了一种基于策略层面的评估框架,利用80道AMC 10/12和AIME题目以及217种AoPS参考策略,分析模型生成策略的多样性与有效性。实验发现,尽管模型在单一解法提示下具有高准确率,但在多策略提示下其策略覆盖范围远低于人类参考水平,且不同模型在几何和数论等领域的策略生成能力存在显著差异。研究还表明,模型虽能生成部分新颖策略,但整体上仍无法全面覆盖人类策略,揭示了当前模型在数学推理灵活性方面的局限性。

详情
英文摘要

Large language models now achieve high final-answer accuracy on mathematical reasoning benchmarks, but accuracy alone does not capture reasoning flexibility. We introduce a strategy-level evaluation framework instantiated on 80 AMC 10/12 and AIME problems with 217 AoPS-derived reference strategy families. Model outputs are annotated for strategy identity, validity, and correctness using dual-AI coding with human adjudication. Across four frontier models, we find a pronounced decoupling between answer accuracy and strategy diversity. Under a single-solution prompt, all models achieve high accuracy (95%-100%), but under a multiple-strategy prompt they recover substantially fewer strategies than the human reference set. Gemini, DeepSeek, GPT, and Claude generate 184, 152, 151, and 110 distinct valid strategies, respectively, with the largest gaps in Geometry and Number Theory. The models collectively produce 50 benchmark-novel valid strategies, indicating both incomplete coverage of human strategies and some capacity for alternative reasoning. A repeated-run robustness check on 20 problems shows diminishing gains in discovered strategies, with the strongest model recovering only 39 of 55 AoPS-reference strategies (71%) after three runs. These findings position strategy diversity as a complementary dimension for evaluating mathematical reasoning beyond answer correctness.

2605.09291 2026-05-12 cs.LG stat.AP

dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models

Zhengyan Wan, Yidong Ouyang, Panwen Hu, Qiang Sun

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) East China Normal University(东华大学) University of California, Los Angeles(加州大学洛杉矶分校) University of Toronto(多伦多大学)

AI总结 本文提出了一种名为dFlowGRPO的强化学习框架,用于离散流模型,支持更广泛的概率路径和非掩码源分布。该方法通过推导离散流模型的完整轨迹概率,将去噪过程建模为马尔可夫决策过程,从而在强化学习中结合条件转移率和后验模型的信息。实验表明,dFlowGRPO在文本到图像生成任务中优于现有的GRPO方法,并在理解任务中展现出强大的能力。

详情
英文摘要

Discrete flow models (DFMs) are a class of flexible generative models for generating discrete data, and diffusion large language models (dLLMs) can be viewed as a special case with a specific choice of mixture path and a masked source distribution. While several recent works have explored reinforcement learning into dLLMs, its application to more general discrete flow models remains underexplored. In this work, we present discrete Flow-GRPO (dFlowGRPO), a unified reinforcement learning framework for discrete flow models that supports a broad family of probability paths and non-masked source distributions. We derive the full trajectory probability for DFMs and formulate denoising as a Markov decision process, enabling dFlowGRPO to incorporate information from both the associated conditional transition rates and the posterior model during reinforcement learning. We apply dFlowGRPO to FUDOKI, a recent multimodal discrete flow model, and evaluate it on both image generation and multimodal understanding tasks. Empirical results show that dFlowGRPO outperforms existing GRPO-type methods for dLLMs on text-to-image generation tasks and achieves performance competitive with continuous flow-based models trained using FlowGRPO, while also demonstrating strong capabilities on understanding tasks.

2605.09290 2026-05-12 cs.LG

From Regression to Inference: Meta-Learning Predictors for Neural Architecture Search

Liping Deng, MingQing Xiao

发表机构 * Department of Mathematics(数学系) School of Mathematical and Statistical Sciences(数学与统计科学学院) University of California, Riverside(加州大学河滨分校) Southern Illinois University Carbondale(南伊利诺伊大学卡罗尔梅尔分校)

AI总结 本文研究了基于预测的神经架构搜索(NAS)中性能预测器的泛化问题,提出了一种基于元学习的卷积神经过程(ConvNP)方法,将性能预测建模为条件函数推断问题。与传统回归方法不同,该方法通过元学习从少量样本中学习泛化能力,提升了对未见架构的预测准确性。实验表明,该方法在多个NAS基准数据集上显著提升了架构选择的性能,达到了当前最优水平。

详情
英文摘要

Prediction-based approaches are widely used in neural architecture search (NAS), where a predictor estimates the performance of candidate architectures to guide selection. However, existing predictors are typically trained via supervised regression on limited samples, leading to overfitting and poor generalization to unseen architectures. In this work, we propose a fundamentally different formulation that models performance prediction as a conditional function inference problem using a Convolutional Neural Process (ConvNP) with meta-learning capabilities. Instead of fitting a fixed mapping to limited samples, our approach meta-learns to infer performance from partial observations by training with context-target splits across a group of synthesized tasks, explicitly optimizing for generalization under data scarcity and aligning the training procedure with the deployment setting in NAS. We further design simple yet effective meta-features for cell-based architectures and evaluate our method on NAS-Bench-101 and NAS-Bench-201. Extensive experiments show that our approach consistently improves top-K ranking quality and achieves the state-of-the-art architecture selection using limited samples.

2605.09288 2026-05-12 cs.LG cs.AI cs.CE cs.CV cs.NA math.NA

MC$^2$: Monte Carlo Correction for Fast Elliptic PDE Solving

Ethan Hsu, Hong Meng Yam, Ivan Ge

发表机构 * Stanford University(斯坦福大学)

AI总结 该论文提出了一种名为 MC² 的混合求解方法,结合蒙特卡洛方法(Walk-on-Spheres)与神经网络,用于高效求解椭圆型偏微分方程(PDE)。该方法通过将低计算量的蒙特卡洛解作为结构化估计器,训练神经网络进行单次前向传播修正,从而获得高精度解,显著提升了求解速度。此外,论文还发布了 PDEZoo,一个包含两百万个椭圆型 PDE 的标准化基准数据集,为有限计算资源下的 PDE 求解研究提供了重要支持。

详情
英文摘要

Partial differential equation (PDE) solvers underpin scientific computing, but real-world deployment is bounded by compute. Classical Monte Carlo solvers such as Walk-on-Spheres (WoS) are unbiased and geometry-agnostic but are slow. Learned solvers are fast but biased and brittle under distribution shift. We present \textbf{MC$^2$}, a hybrid WoS-Neural Network (WoS-NN) PDE solver that treats a low-budget Monte Carlo solution as a structured estimator of the true field and learns a single-pass neural correction to recover a high-fidelity solution. MC$^2$ matches the accuracy of solutions using over $1000\times$ more Monte Carlo compute, outperforming all evaluated classical, denoising, and neural-operator baselines. To enable reproducible study of finite-compute PDE solving, we additionally release \textbf{PDEZoo}, the largest standardized elliptic PDE benchmark to date: 2M PDEs spanning five elliptic families and unlimited geometric compositions, with analytic ground truth and multi-budget Monte Carlo trajectories. Together \textbf{MC$^2$} and \textbf{PDEZoo} (1) empirically establish that finite-sample Monte Carlo error is structured, learnable, and correctable in a single forward pass, (2) show that we can solve PDEs $\sim$\textbf{1000x} faster than with just WoS, and (3) provide the evaluation infrastructure the field has so far lacked.

2605.09285 2026-05-12 cs.CL

BetaEdit: Null-Space Constrained Sequential Model Editing

Bingqing Liu, Wei Liu, Yuhua Li

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 本文提出了一种名为 BetaEdit 的模型编辑方法,旨在解决基于零空间的模型编辑方法在连续编辑过程中出现的知识泄露和性能下降问题。通过深入分析历史感知更新机制的作用,作者提出了一个结合历史信息的零空间编辑框架,有效控制了知识泄露并提升了编辑效果。实验表明,BetaEdit 在大规模连续编辑任务中优于现有方法,具有更好的编辑性能和通用能力。

详情
英文摘要

Null-space-based methods have garnered considerable attention in model editing by constraining updates to the null space of the pre-existing knowledge representation, thereby preserving the model's original behavior. However, in practice these methods rely on an approximate null space--leading to knowledge leakage--and further suffer from severe performance degradation during sequential editing. Recent work shows that history-aware editing strategies can empirically mitigate this decline, yet the underlying reason remains unclear. In this paper, we first expose the knowledge leakage inherent in existing null-space approaches and then analyze why history-aware updates effectively preserve both editing performance and general capabilities during long-horizon editing. Building on these insights, we propose BetaEdit, a refined framework that effectively controls the knowledge leakage and integrates history-aware updates into the null-space paradigm. Extensive experiments on three large language models across two standard benchmarks show that BetaEdit consistently outperforms prior methods in the challenging regime of massive-scale sequential editing. Code is available at: https://github.com/lbq8942/BetaEdit.

2605.09284 2026-05-12 cs.LG cs.AI cs.CE physics.app-ph physics.comp-ph

Semi-Supervised Neural Super-Resolution for Mesh-Based Simulations

Jiyeon Kim, Youngjoon Hong, Won-Yong Shin

发表机构 * School of Mathematics and Computing (Computational Science and Engineering), Yonsei University(延世大学数学与计算学院(计算科学与工程)) Department of Mathematical Sciences, Seoul National University(首尔国立大学数学科学系)

AI总结 本文提出了一种名为SuperMeshNet的半监督神经网络超分辨率框架,用于提高基于网格的仿真计算效率。该方法通过结合少量配对的低分辨率-高分辨率数据与大量未配对的低分辨率数据,利用消息传递神经网络(MPNN)实现高效的高分辨率解重建,有效减少了对高分辨率监督数据的依赖。实验表明,SuperMeshNet在使用更少高分辨率数据的情况下,能够取得比全监督方法更低的均方根误差,显著提升了计算效率。

Comments International Conference on Machine Learning (ICML 2026) (to appear) (Please cite our conference version.)

详情
英文摘要

Mesh-based simulations provide high-fidelity solutions to partial differential equations (PDEs), but achieving such accuracy typically requires fine meshes, leading to substantial computational overhead. Super-resolution techniques aim to mitigate this cost by reconstructing high-resolution (HR), high-fidelity solutions from low-cost, low-resolution (LR) counterparts. However, training neural networks for super-resolution often demands large amounts of expensive HR supervision data. To address this challenge, we propose SuperMeshNet, an HR data-efficient super-resolution framework for mesh-based simulations aided by message passing neural networks (MPNNs). At its core, SuperMeshNet introduces complementary learning, a semi-supervised approach that effectively leverages both 1) a small amount of paired LR-HR data and 2) abundant unpaired LR data via two jointly trained, complementary MPNN-based models. Additionally, our model is enriched by inductive biases, which are empirically shown to further improve super-resolution performance. Extensive experiments demonstrate that SuperMeshNet requires 90% less HR data to achieve even lower root mean square error (RMSE) than that of the fully supervised benchmark without the inductive biases. The source code and datasets are available at https://github.com/jykim-git/SuperMeshNet.git.

2605.09283 2026-05-12 cs.AI cs.CL

A Prompt-Aware Structuring Framework for Reliable Reuse of AI-Generated Content in the Agentic Web

Shusaku Egami, Masahiro Hamasaki

发表机构 * National Institute of Advanced Industrial Sciencen

AI总结 随着大型语言模型和基于其构建的AI代理的发展,网络正从以人类为中心向由AI代理驱动的“智能体网络”转变。然而,当前缺乏对AI生成内容(AIGC)在生成过程中可靠性、可复现性和合规性的验证机制,这可能导致内容误用和合规风险。本文提出了一种提示感知的结构化框架,在生成时自动为AIGC附加结构化元数据,包括模块化提示、上下文、模型信息、超参数和置信度,并结合可验证凭证,从而支持AIGC的可靠评估与安全复用。

Comments 5 pages, 2 figures, Accepted at FAAW@WWW2026

详情
英文摘要

The evolution of Large Language Models (LLMs) and the software agents built on them (AI agents) marks a turning point in the transition from a human-centric Web to an ``Agentic Web'' driven by AI agents. However, for AI-Generated Content (AIGC), which is expected to dominate the Web, there is currently no mechanism for agents to verify its reliability, reproducibility, or license compliance during generation. This lack of transparency risks causing chained hallucinations and compliance violations through the reuse of AIGC. Consequently, a framework to manage the provenance and generation conditions of AIGC is essential. In this paper, we present a framework that automatically attaches structured metadata to AIGC at generation time, including modularized prompts, contexts, thoughts, model information, hyperparameters, and confidence. The metadata is enveloped together with verifiable credentials to support the reliable assessment and reuse of AIGC. This framework enables efficient curation of structured AIGC and facilitates its safe use for applications such as fine-tuning and knowledge distillation.

2605.09281 2026-05-12 cs.LG

TileQ: Efficient Low-Rank Quantization of Mixture-of-Experts with 2D Tiling

Hongyaoxing Gu, Xinzhe Chen, Lijuan Hu, Fangfang Liu

发表机构 * Institute of Software Chinese Academy of Sciences(中国科学院软件研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出了一种名为 TileQ 的高效低秩量化方法,用于压缩混合专家(MoE)模型。该方法通过在输入和输出维度上共享低秩因子,采用二维分块结构化低秩量化,在无需微调的情况下实现模型压缩。实验表明,TileQ 显著降低了额外内存占用并减少了推理延迟,同时保持了模型的先进精度。

详情
英文摘要

Mixture-of-Experts (MoE) models achieve remarkable performance by sparsely activating specialized experts, yet their massive parameters in experts pose significant challenges for deployment. While low-rank quantization offers a promising route to compress MoE models, existing methods still incur nonnegligible memory overhead and inference latency. To address these limitations, we propose \textsc{TileQ}, a fine-tuning-free post-training quantization (PTQ) method that employs 2D-tiling structured low-rank quantization to share low-rank factors across both input and output dimensions of MoE experts. Furthermore, we introduce an efficient inference technique for \textsc{TileQ} that fuses multiple low-rank expert computations into a single-pass operation, significantly improving hardware utilization. Experiments show that \textsc{TileQ} cuts down additional memory usage up to 10$\times$ and reduces inference latency to $\sim$5\% while preserving state-of-the-art accuracy.

2605.09278 2026-05-12 cs.AI

EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium

Yuqiao Meng, Sakshi Sunil Narvekar, Luoxi Tang, Rupali Rajendra Vaje, Yingxue Zhang, Muchao Ye, Zhaohan Xi

发表机构 * Binghamton University, State University of New York(宾夕法尼亚州立大学布林茅尔分校) University of Iowa(爱荷华大学)

AI总结 多智能体辩论(MAD)系统依赖共享内存进行长期推理,但这也带来了内存污染的风险,现有方法依赖启发式或大模型判断,难以有效过滤错误。本文将内存更新建模为零信任博弈,提出EquiMem机制,在推理时通过智能体的检索查询和遍历路径量化评估内存更新的可信度,无需依赖大模型判断。该方法适用于嵌入式和图结构内存,在多种基准和架构下表现出更优的防护效果和鲁棒性。

详情
英文摘要

Multi-agent debate (MAD) systems increasingly rely on shared memory to support long-horizon reasoning, but this convenience opens a critical vulnerability: a single corrupted entry can contaminate the downstream memory-augmented reasoning, and debate alone fails to filter such errors. Existing safeguards filter entries via heuristics or LLM-based validation, yet they rely on AI judgments that share the same failure modes and overlook the cross-agent dynamics of MAD. We address this gap by formulating memory updating in MAD as a zero-trust memory game, in which no agent is assumed honest and the game's equilibrium serves as an indicator of optimal memory trust. Guided by this equilibrium, we propose EquiMem, an inference-time calibration mechanism that quantifies each update algorithmically against the shared memory state, using agents' existing retrieval queries and traversal paths as evidence rather than soliciting any LLM judgment. EquiMem instantiates calibration for both embedding- and graph-based memory, and across diverse benchmarks, MAD frameworks, and memory architectures, it consistently outperforms existing safeguards, remains robust under adversarial agents, and incurs negligible inference overhead.

2605.09276 2026-05-12 cs.LG cs.CV

Uncertainty-Aware Token Importance Estimation in Spiking Transformers

Wenxuan Liu, Zecheng Hao, Tong Bu, Yuran Wang, Zhaofei Yu

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) School of Computer Science, Peking University. Institute for Artificial Intelligence, Peking University(北京大学计算机科学学院。人工智能研究所) Peking University(北京大学)

AI总结 本文研究了在脉冲变压器中如何更准确地估计令牌的重要性,以减少冗余计算并提高推理效率。现有方法主要依赖于响应特征,如激活幅度或发放统计,但未能反映令牌在时间演化中的不确定性变化。作者提出了一种无需训练、可插拔的Uncert框架,通过建模令牌的类别证据并分析其时间不确定性模式,为令牌重要性评估提供了新的依据。实验表明,该方法在静态和神经形态基准上均取得了良好的精度与效率平衡,尤其在令牌剪枝任务中表现突出。

详情
英文摘要

Spiking transformers have shown strong potential for neuromorphic vision, yet their token processing across multiple spiking steps still introduces substantial redundancy and inference cost. Existing token reduction methods mainly rely on response based cues, such as activation magnitude, firing statistics, or feature similarity. Although effective, these criteria do not explicitly characterize token importance from the perspective of temporally evolving class evidence. In spiking transformers, token representations are progressively formed across multiple spiking steps rather than determined at a single instant, suggesting that token importance should be evaluated not only by instantaneous responses but also by temporal uncertainty patterns. Our key observation is that tokens exhibit heterogeneous uncertainty trajectories over time, and that their temporally aggregated uncertainty statistics provide an effective cue for distinguishing informative tokens from redundant ones. Motivated by this, we propose Uncert, a training free and plug and play token importance estimation framework for spiking transformers. Specifically, Uncert models token wise class evidence with a Dirichlet distribution and summarizes each token temporal uncertainty using its mean and fluctuation across spiking steps, yielding an uncertainty aware importance score for token reduction during inference. Experiments on both static and neuromorphic benchmarks show that Uncert achieves favorable accuracy and efficiency tradeoffs, with the most consistent gains observed under token pruning. Further analysis reveals a clear empirical connection between temporal uncertainty patterns and token contribution, offering new insights into token dynamics in spiking transformers.

2605.09275 2026-05-12 cs.LG

DiffATS: Diffusion in Aligned Tensor Space

Jinhua Lyu, Tianmin Yu, Brian Kim, Lizhuo Zhou, Chanwook Park, Naichen Shi

发表机构 * Northwestern University(西北大学)

AI总结 本文提出了一种名为 DiffATS 的生成模型,用于高效建模高分辨率时空场。该方法通过构造数据自适应的张量原语,避免了预训练压缩自编码器的依赖,解决了张量分解中因子非唯一性的问题。通过正交Procrustes对齐技术,模型实现了紧凑且可直接解码的生成表示,并在图像、视频和偏微分方程解等任务中取得了优异的生成效果,同时实现了高达210倍的数据压缩。

详情
英文摘要

Direct diffusion modeling of high-resolution spatiotemporal fields is computationally challenging. Parameter-efficient primitives address this by representing high-dimensional data with a compact set of parameters. In this paper, we construct data-dependent tensor primitives without pretrained compression autoencoders. Our construction starts from Tucker decomposition, which captures low-rank multilinear structure through a core tensor and mode-wise factors. However, Tucker factors are non-unique: the same tensor can be represented by different rotated factors, which complicates generative modeling. We address this issue with orthogonal Procrustes (OP) alignment. Specifically, we select medoid anchor matrices from the data and align the factor matrices to resolve the gauge ambiguity. This yields matrix Grassmannian primitives and tensor Grassmannian primitives that are compact, data-adaptive, and directly decodable by explicit multilinear reconstruction. Theoretically, we prove that the proposed primitive maps are homeomorphisms between low-rank tensors and their corresponding primitive spaces, certifying that the representations are non-degenerate and topologically faithful. Building on these primitives, we propose *Diffusion in Aligned Tensor Space* (DiffATS), a generative framework that trains diffusion models directly on aligned tensor primitives. Across images, videos, and PDE solutions, DiffATS achieves strong unconditional and conditional generation performance while compressing original data by $3.9\times$ to $210\times$, without relying on any pretrained deep compression autoencoders.

2605.09272 2026-05-12 cs.AI cs.CL cs.CV

Towards Conversational Medical AI with Eyes, Ears and a Voice

Meet Shah, Jason Gusdorf, Anil Palepu, Chunjong Park, Jack W. O'Sullivan, Vishnu Ravi, Tim Strother, Pavel Dubov, Aliya Rysbek, Toshiyuki Fukuzawa, Yana Lunts, Jan Freyberg, Michael B. Chang, Aniruddh Raghu, David Stutz, Devora Berlowitz, Eliseo Papa, Taylan Cemgil, JD Velasquez, Jack Chen, Arthur Chen, Doug Fritz, Charlie Taylor, Katya Tregubova, Jing Rong Lim, Richard Green, Sara Mahdavi, Mahvish Nagda, Jihyeon Lee, Craig Schiff, Liviu Panait, Sukhdeep Singh, Valentin Liévin, David G. T. Barrett, Hannah Gladman, Anna Cupani, Francesca Pietra, Uchechi Okereke, Katherine Tong, Clemens Meyer, Erwan Rolland, Mili Sanwalka, Michael D. Howell, Shixiang Shane Gu, Bibo Xu, Euan A. Ashley, S. M. Ali Eslami, Gregory Wayne, Pushmeet Kohli, Vivek Natarajan, Adam Rodman, Alan Karthikesalingam, Ryutaro Tanno

发表机构 * Google DeepMind(谷歌深Mind) Google Research(谷歌研究) Beth Israel Deaconess Medical Center, Harvard Medical School(贝塞斯达医院, 哈佛医学院) Stanford University(斯坦福大学)

AI总结 该研究提出了一种名为AI co-clinician的新型会话式医疗AI系统,能够实时处理来自医患对话的视听数据,辅助临床决策。该系统基于Gemini的低延迟音视频处理能力,采用双代理架构,兼顾深度临床推理与自然对话所需的低延迟响应。实验表明,AI co-clinician在多个关键评估维度上接近初级保健医生,且在通用评估标准上显著优于GPT-Realtime,但仍在体格检查和疾病特异性推理方面存在不足,突显了视听信息在医疗咨询中的重要性。

Comments Video examples are available on Youtube: https://youtu.be/y5Vaa_SN1t0, https://youtu.be/dC4icb75vLQ, and https://youtu.be/E7iEvWo-E6c

详情
英文摘要

The practice of medicine relies not only upon skillful dialogue but also on the nuanced exchange and interpretation of rich auditory and visual cues between doctors and patients. Building on the low-latency voice and video processing capabilities of Gemini, we introduce AI co-clinician, a first-of-its-kind conversational AI system utilizing continuous streams of audio-visual data from live patient conversations to inform real-time clinical decisions. Its dual-agent architecture balances deep clinical reasoning with the low latency required for natural dialogue. To assess this system, we implemented a video-based interface emulating telemedicine consultations. We crafted 20 standardized outpatient scenarios requiring proactive real-time auditory and visual reasoning and designed "TelePACES" evaluation criteria alongside case-specific rubrics. In a randomized, interface-blinded, crossover simulation study (n = 120 encounters) with 10 internal medicine residents as patient actors, we compared AI co-clinician with primary care physicians (PCPs), GPT-Realtime, and a baseline agent. AI co-clinician approached PCPs in key TelePACES dimensions, including management plans and differential diagnosis, while significantly outperforming GPT-Realtime across all general criteria. While our agent demonstrated parity with PCPs in case-specific triage measures, physicians maintained superior overall performance in case-specific assessments. Although AI co-clinician marks a significant advance in real-time telemedical AI, gaps remain in physical examination and disease-specific reasoning. Our work shows that text-only approaches fail to capture the true challenges of medical consultation and suggests that high-stakes real-time diagnostic AI is most safely advanced in collaborative, triadic models where AI can be a supportive co-clinician for doctors and patients.

2605.09269 2026-05-12 cs.CL cs.CV

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Rui Liu, Dian Yu, Zhenwen Liang, Yucheng Shi, Tong Zheng, Runpeng Dai, Haitao Mi, Pratap Tokekar, Leoweiliang

发表机构 * Tencent Hunyuan(腾讯文元) University of Maryland, College Park(马里兰大学 College Park 分校) University of North Carolina, Chapel Hill(北卡罗来纳大学 Chapel Hill 分校)

AI总结 DeltaRubric 是一种用于多模态大语言模型奖励建模的生成式方法,旨在解决现有评估方式在视觉细节判断上的偏差问题。该方法通过将评估过程分解为“规划”和“验证”两个步骤,动态生成针对具体实例的检查清单,并基于图像和问题进行验证,从而提高评估的准确性和可靠性。实验表明,DeltaRubric 在多个基准测试中显著提升了模型的奖励建模效果,验证了其在多模态任务中的有效性。

详情
英文摘要

Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce $\textbf{DeltaRubric}$, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a $\textit{Disagreement Planner}$, the model generates a neutral, instance-specific verification checklist. Transitioning into a $\textit{Checklist Verifier}$, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by $\textbf{+22.6}$ (4B) and $\textbf{+18.8}$ (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.

2605.09268 2026-05-12 cs.CL cs.AI

Beyond Continuity: Challenges of Context Switching in Multi-Turn Dialogue with LLMs

Aditya Sinha, Harald Steck, Vito Ostuni, Matteo Rinaldi

发表机构 * Netflix Inc.(Netflix公司)

AI总结 本文研究了大型语言模型(LLMs)在多轮对话中处理上下文切换时面临的挑战,特别是模型难以识别用户请求的转变或主题切换,并容易携带不相关的先前上下文。为此,作者构建了基于真实数据集的合成基准,测试了十种不同类型的LLMs在零样本情况下的表现,发现只有部分具备推理能力或明确指令引导的模型能够准确检测到上下文切换,而大多数模型存在位置偏差和对过时上下文的依赖问题。研究结果为提升LLMs在多轮对话中的长期鲁棒性提供了重要启示。

Comments Accepted to the ICBINB Workshop @ ICLR 2026

详情
英文摘要

Users interacting with Large Language Models (LLMs) in a multi-turn conversation routinely refine their requests or pivot to new topics. LLMs, however, often miss these topic shifts and carry over irrelevant context from previous turns, leading to inaccurate responses. In this paper, we stress-test the multi-turn understanding of LLMs and study the following two sub-tasks: (1) detecting whether the user pivots or refines in the current turn, and (2) shortlisting relevant context from previous turns. To this end, we construct synthetic benchmarks based on real-world datasets from varied domains, as to simulate context shifts of different levels of difficulty. We then evaluate the zero-shot performance of ten LLMs (open-weight, closed-source and reasoning), and demonstrate that only some reasoning and strongly instructed LLMs are accurate in detecting pivots; open-weight LLMs struggle with the task and frequently carry stale context even with explicit cues; and all models suffer from a position bias. Based on the results, we discuss key takeaways for improving long-term robustness in multi-turn capabilities for LLMs.

2605.09262 2026-05-12 cs.CV cs.CL

Reinforcing Multimodal Reasoning Against Visual Degradation

Rui Liu, Dian Yu, Haolin Liu, Yucheng Shi, Tong Zheng, Runpeng Dai, Haitao Mi, Pratap Tokekar, Leoweiliang

发表机构 * Tencent Hunyuan(腾讯文言) University of Maryland, College Park(马里兰大学 College Park 分校) University of Virginia(弗吉尼亚大学) University of North Carolina, Chapel Hill(北卡罗来纳大学 Chapel Hill 分校)

AI总结 该研究针对多模态大语言模型在面对现实视觉退化(如模糊、压缩伪影等)时推理能力下降的问题,提出了一种基于强化学习的微调框架ROMA。该方法通过双前向传播策略、分布一致性约束和正确性条件正则化等技术,在不损害干净输入性能的前提下提升模型对视觉退化的鲁棒性。实验表明,ROMA在多个多模态推理基准上显著优于现有方法,提升了可见和未见退化场景下的推理准确性。

详情
英文摘要

Reinforcement Learning has significantly advanced the reasoning capabilities of Multimodal Large Language Models (MLLMs), yet the resulting policies remain brittle against real-world visual degradations such as blur, compression artifacts, and low-resolution scans. Prior robustness techniques from vision and deep RL rely on static data augmentation or value-based regularization, neither of which transfers cleanly to critic-free RL fine-tuning of autoregressive MLLMs. Reinforcing reasoning against such corruptions is non-trivial: naively injecting degraded views during rollout induces reward poisoning, where perceptual occlusions trigger hallucinated trajectories and destabilize optimization. We propose ROMA, an RL fine-tuning framework that modifies the optimization dynamics to reinforce reasoning against visual degradation while preserving clean-input performance. A dual-forward-pass strategy uses teacher forcing to evaluate corrupted views against clean-image trajectories, avoiding new rollouts on degraded inputs. For distributional consistency, we apply a token-level surrogate KL penalty against the worst-case augmentation; to prevent policy collapse under regularization, an auxiliary policy gradient loss anchored to clean-image advantages preserves a reliable reward signal; and to avoid systematically incorrect invariance, correctness-conditioned regularization restricts enforcement to successful trajectories. On Qwen3-VL 4B/8B across seven multimodal reasoning benchmarks, our method improves robustness by +2.4% on seen and +2.3% on unseen corruptions over GRPO while matching clean accuracy.

2605.09258 2026-05-12 cs.CV cs.AI

Monocular Biomechanical Tracking of Fingers with Inverse Kinematics to Foundation Models

R. James Cotton, Pouyan Firouzabadi, Wendy Murray

发表机构 * Shirley Ryan AbilityLab Department of PM\&R Northwestern University Shirley Ryan AbilityLab Department of Biomedical Engineering Northwestern University

AI总结 该研究旨在解决单目视频中精确追踪手指生物力学运动的问题,提出了一种结合SAM 3D Body基础模型与逆运动学优化的方法,从单视角视频中提取解剖学约束的手指关节角度。通过将模型迁移至JAX并集成至MuJoCo-MJX,实现了高效的GPU加速优化,并建立了Momentum Human Rig输出与生物力学模型标记之间的新映射关系。实验表明,该方法在多种手部动作和物体操作任务中,能够达到约10度的关节角度误差和6毫米的手部位置误差,具有良好的视角一致性和鲁棒性,为基于视频的定量手部运动分析提供了新途径。

Comments Accepted to EMBC 2026

详情
英文摘要

Accurate hand and finger tracking from video has significant clinical applications for monitoring activities of daily living and measuring range of motion, yet monocular video approaches for obtaining hand biomechanics remain under-developed. We present a method that combines the SAM 3D Body foundation model with inverse kinematics optimization in a full-body biomechanical model to extract anatomically-constrained finger joint angles from single-view video. We port SAM 3D Body from PyTorch to JAX for integration with MuJoCo-MJX, enabling GPU-accelerated optimization, and develop a novel mapping between the Momentum Human Rig (MHR) outputs and biomechanical model markers. Validation against 8-camera multiview reconstruction on 4,590 frames from 7 participants performing a variety of hand poses and object manipulation tasks shows finger joint angle errors of approximately 10 degrees and hand position errors of approximately 6 mm, after Procrustes alignment. Results were consistent across camera viewpoints and robust to different methods for producing reference values from multiview video. This work extends monocular biomechanical analysis to detailed finger tracking, expanding access to quantitative characterization of hand movement from readily available video.

2605.09256 2026-05-12 cs.LG cs.AI stat.ML

Improving Generalization by Permutation Routing Across Model Copies

Shuhei Kashiwamura, Timothee Leleu

发表机构 * NTT Research, CA, USA(NTT研究所) Stanford University, CA, USA(斯坦福大学)

AI总结 本文提出了一种利用 $M$-cover 变换来提升机器学习模型泛化能力的方法。该方法通过复制模型 $M$ 次,并利用结构化的混合核 $Q$ 对模型参数进行排列路由,从而在不同副本之间传递局部学习信息,而非传统的参数平均或显式吸引力机制。这种方法通过结构化的消息共享机制,有效改善了模型的泛化性能,适用于从感知机到多层感知机等多种模型结构。

详情
英文摘要

We introduce a use of the \(M\)-cover (or \(M\)-layer) transform for machine learning. The method replicates a model \(M\) times, but instead of coupling the copies through parameter averaging or an explicit attractive force, as in replicated SGD or Elastic SGD, it rewires the contexts in which local learning messages are computed. Each local loss is evaluated on a routed model whose parameters are drawn from different copies according to permutations sampled from a structured mixing kernel \(Q\). Training then uses the original local update rule, while the resulting learning messages are redistributed across the copies through these routed computational paths. Thus \(Q\) defines a topology for message transport and controls the long-loop structure of the lifted factor graph. We formulate this construction for perceptrons, committee machines, and multilayer perceptrons, showing that the same principle applies from discrete models to differentiable neural networks. The resulting framework provides a mechanism for improving generalization through structured message sharing rather than replica collapse or parameter-space coupling.

2605.07922 2026-05-12 cs.LG

Tree SAE: Learning Hierarchical Feature Structures in Sparse Autoencoders

Tue M. Cao, Hoang X. Nhat, Raed Alharbi, Phi Le Nguyen, My T. Thai

发表机构 * Hanoi University of Science(河内科学大学) University of Florida, Florida, USA(佛罗里达大学) Computer Science Department, Saudi Electronic University(沙特电子大学计算机科学系)

AI总结 本文提出了一种名为Tree SAE的新方法,用于在稀疏自编码器中学习层次化特征结构。该方法通过引入一种新的重构条件,结合激活和重构约束,克服了现有方法中因语义无关概念误判而导致的虚假正例问题。实验表明,Tree SAE在学习层次化特征对方面显著优于现有方法,并在多个基准测试中保持了与最先进方法相当的性能,同时还能用于分析大型语言模型中复杂的层次化概念结构。

Comments 21 pages

详情
英文摘要

Learning hierarchical features in Sparse Autoencoders (SAEs) is essential for capturing the structured nature of real-world data and mitigating issues like feature absorption or splitting. Existing works attempt to identify hierarchical relationships within independent feature sets by relying on activation coverage, the assumption that child feature should only activate when its parent feature activates. However, we demonstrate that this condition alone is insufficient; that is, it often produces false positives where parent and child concepts are semantically unrelated. To address this, we introduce a novel reconstruction condition that enforces a deeper functional link between hierarchical levels. By combining both activation and reconstruction constraints, we propose the Tree SAE, a model designed to learn hierarchical structures directly from within the feature set. Our results demonstrate that Tree SAEs significantly surpass the existing SAEs at learning hierarchical pairs while maintaining competitive performance to the state-of-the-art on several key benchmarks. Finally, we demonstrate the practical utility of our Tree SAE in mapping the geometry of child feature subspaces and uncovering the complex hierarchical concept structures encoded within large language models.

2605.07910 2026-05-12 cs.CV

One World, Dual Timeline: Decoupled Spatio-Temporal Gaussian Scene Graph for 4D Cooperative Driving Reconstruction

Yulong Chen, Xiaoyun Dong, Haoyu Zhang, Zongxian Yang, Lewei Xie, Xinke Li, Yifan Zhang, Kai Wang, Jianping Wang

发表机构 * City University of Hong Kong (Dongguan)(香港城市大学(东莞)) City University of Hong Kong(香港城市大学) SLAI

AI总结 本文研究了从车路协同自动驾驶(VICAD)数据中重建动态场景的问题,指出现有高斯场景图方法因假设观测同步而无法处理车辆与基础设施摄像头之间的时序不同步问题,导致动态目标出现严重鬼影现象。为此,作者提出了一种解耦时空高斯场景图(DUST),通过为每个代理维护独立的位姿轨迹并共享统一的外观表示,有效消除了跨源干扰,并在V2X-Seq数据集上取得了显著的性能提升。

详情
英文摘要

Reconstructing dynamic scenes from Vehicle-to-Infrastructure Cooperative Autonomous Driving (VICAD) data is fundamentally complicated by temporal asynchrony: vehicle and infrastructure cameras operate on independent clocks, capturing the same dynamic agent such as cars and pedestrians at different physical times. Existing Gaussian Scene Graph methods implicitly assume synchronized observations and assign a single pose per agent per frame, which is an assumption that breaks in cooperative settings, where the resulting gradient conflicts cause severe ghosting on dynamic agents. We identify this as a representation-level failure, not an optimization artifact: we prove that any single-timeline formulation incurs an irreducible photometric loss scaling quadratically with agent velocity and cross-source time offset. To resolve this, we propose Dust (DecoUpled Spatio-Temporal) Gaussian Scene Graph for 4D Cooperative Driving Reconstruction. DUST Gaussian Scene Graph shares a canonical Gaussian set per agent for appearance consistency, while maintaining decouple pose trajectories aligned to each source's true capture timestamps. We prove that this decoupling enables the pose-gradient kernel block-diagonal, eliminating cross-source interference entirely. To make Dust practical, we further introduce a static anchor-based pose correction pipeline that corrects spatio misalignment between vehicle and infrastructure annotations, and a pose-regularized joint optimization scheme that prevents trajectory jitter and drift during early training. On 26 sequences from V2X-Seq, DUST achieves state-of-the-art performance, improving dynamic-area PSNR by 3.2 dB over the strongest baseline and reducing Fréchet Video Distance by 37.7%, with keeping robustness under larger temporal asynchrony.

2605.07649 2026-05-12 cs.CV cs.AI cs.RO

Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models

Berkehan Ünal, Hauke Dierend, Dren Fazlija, Christopher Plachetka

发表机构 * Volkswagen Aktiengesellschaft(大众汽车股份有限公司) L3S Research Center(莱比锡大学汉诺威研究中心) Faculty of Information Technology(信息科技学院) MOIA GmbH(MOIA公司) Motor AI GmbH(Motor AI公司)

AI总结 本文研究了如何利用视觉-语言模型(VLM)实现对操作设计域(ODD)的零样本感知,以支持自动驾驶系统等安全关键应用。通过在自定义数据集和Mapillary Vistas上的实验,作者评估了四种VLM在零样本分类与检测任务中的表现,并分析了不同优化策略的效果。研究提出了一种基于定义锚定的思维链提示方法,结合角色分解,显著提升了感知性能,为构建透明、高效的ODD感知系统提供了可行方案。

Comments 8 pages, 4 figures

详情
英文摘要

Over the last few years, research on autonomous systems has matured to such a degree that the field is increasingly well-positioned to translate research into practical, stakeholder-driven use cases across well-defined domains. However, for a wide-scale practical adoption of autonomous systems, adherence to safety regulations is crucial. Many regulations are influenced by the Operational Design Domain (ODD), which defines the specific conditions in which an autonomous agent can function. This is especially relevant for Automated Driving Systems (ADS), as a dependable perception of ODD elements is essential for safe implementation and auditing. Vision-language models (VLMs) integrate visual recognition and language reasoning, functioning without task-specific training data, which makes them suitable for adaptable ODD perception. To assess whether VLMs can function as zero-shot "ODD sensors" that adapt to evolving definitions, we contribute (i) an empirical study of zero-shot ODD classification and detection using four VLMs on a custom dataset and Mapillary Vistas, along with failure analyses; (ii) an ablation of zero-shot optimization strategies with a cost-performance overview; and (iii) a suite of reusable prompting templates with guidance for adaptation. Our findings indicate that definition-anchored chain-of-thought prompting with persona decomposition performs best, while other methods may result in reduced recall. Overall, our results pave the way for transparent and effective ODD-based perception in safety-critical applications.

2605.07579 2026-05-12 cs.LG cs.AI cs.CL

Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States

Yunho Choi, Jongwon Lim, Woojin Ahn, Minjae Oh, Jeonghoon Shim, Yohan Jo

发表机构 * Graduate School of Data Science(数据科学研究生院) Seoul National University(首尔国立大学) Computer Science and Engineering(计算机科学与工程)

AI总结 该论文提出了一种名为POISE的新方法,用于在大型推理模型中进行可验证奖励的强化学习。其核心思想是利用策略模型在前向传播过程中已生成的内部状态信号来估计基线,从而显著降低计算成本。通过一个轻量级探针从隐藏状态和生成轨迹中预测可验证奖励,并在训练过程中与策略一同优化。实验表明,POISE在数学推理任务上表现优异,计算效率优于现有方法,并且其价值估计器性能接近独立的大型价值模型。

Comments Under Review; Project Page: https://elijah0430.github.io/poise/

详情
英文摘要

Reinforcement learning with verifiable rewards (RLVR) for Large Reasoning Models hinges on baseline estimation for variance reduction, but existing approaches pay a heavy price: PPO requires a policy-model scale critic, while GRPO needs multiple rollouts per prompt to keep its empirical group mean stable. We introduce Policy Optimization with Internal State Value Estimation), which obtains a baseline at negligible cost by using the policy model's internal signals already computed during the policy forward pass. A lightweight probe predicts the expected verifiable reward from the hidden states of the prompt and generated trajectory, as well as token-entropy statistics, and is trained online alongside the policy. To preserve gradient unbiasedness despite using trajectory-conditioned features, we introduce a cross-rollout construction that predicts each rollout's value from an independent rollout's internal states. Because POISE estimates prompt value using only a single rollout, it enables higher prompt diversity for a fixed compute budget during training. This reduces gradient variance for more stable learning and also eliminates the compute overhead of sampling costs for detecting zero-advantage prompts. On Qwen3-4B and DeepSeek-R1-Distill-Qwen-1.5B across math reasoning benchmarks, POISE matches DAPO while requiring less compute. Moreover, its value estimator shows similar performance to a separate LLM-scale value model and generalizes to various verifiable tasks. By leveraging the model's own internal representations, POISE enables more stable and efficient policy optimization.

2605.07399 2026-05-12 cs.CV

GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization

Yu Pan, Andi Zhang, Yi Wang, Sibei Yang, Wenjie Wang

发表机构 * ShanghaiTech University(上海科技大学) University of Warwick(沃里克大学) SUN YAT-SEN UNIVERSITY(中山大学)

AI总结 该论文研究了扩散视觉语言模型(dVLMs)在面对越狱攻击时的安全性问题,揭示了其在应对传统固定前缀优化(FPO)攻击时表现出的假象性鲁棒性。作者提出了一种基于全局概率优化(GPO)的新型越狱方法,通过操纵扩散模型的去噪轨迹,绕过模型的防护机制,并进一步开发了首个针对dVLMs的视觉模态越狱框架GPO-V。实验表明,GPO-V能够生成隐蔽且具有跨模型迁移能力的扰动,暴露了非序列生成架构中的关键安全漏洞,突显了对dVLMs进行安全对齐的紧迫性。

详情
英文摘要

Diffusion Vision-Language Models (dVLMs), built upon the non-causal foundations of Diffusion Large Language Models (dLLMs), have demonstrated remarkable efficacy in multimodal tasks by departing from the traditional autoregressive generation paradigm. While dVLMs appear inherently robust against conventional jailbreak tactics, which we categorize as Fixed Prefix Optimization (FPO) (e.g., anchoring responses with "Sure, here is"), this perceived resilience is deceptive. Our investigation into the safety landscape of dVLMs reveals a unique refusal pattern: Immediate Refusal and Progressive Refusal. We find that while FPO-based attacks often fail by triggering the latter, the progressive refinement process itself uncovers a novel, latent attack surface. To exploit this vulnerability, we propose Global Probability Optimization (GPO), a general jailbreak paradigm designed specifically for the denoising trajectory of masked diffusion models. Unlike prefix-based methods, GPO manipulates the global generative dynamics to bypass guardrails in diffusion language models. Building on this, we introduce GPO-V, the first visual-modality jailbreak framework tailored for dVLMs. Empirical results demonstrate that GPO-V produces stealthy perturbations with exceptional cross-model transferability, revealing a critical security gap in non-sequential generative architectures. Our findings underscore the critical urgency of addressing safety alignment in dVLMs. These results necessitate an immediate and fundamental re-evaluation of current defense paradigms to mitigate the unique risks of diffusion-based generation. Our code is available at: https://anonymous.4open.science/r/GPO-V-0250.