arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
2602.02799 2026-05-13 cs.LG cs.AI

Joint Learning of Hierarchical Neural Options and Abstract World Model

Wasu Top Piriyakulkij, Wolfgang Lehrach, Kevin Ellis, Kevin Murphy

发表机构 * Cornell University(康奈尔大学) Google Deepmind(谷歌DeepMind)

AI总结 该研究旨在开发能够通过组合已有技能学习新技能的智能体,提出了一个名为AgentOWL的新方法,该方法能够高效地联合学习抽象世界模型和分层神经选项。与现有方法相比,AgentOWL在数据效率和技能泛化能力方面表现出显著优势,并在部分以物体为中心的Atari游戏中验证了其有效性。

详情
英文摘要

Building agents that can perform new skills by composing existing skills is a long-standing goal of AI agent research. Towards this end, we investigate how to efficiently acquire a sequence of skills, formalized as hierarchical neural options. However, existing model-free hierarchical reinforcement algorithms need a lot of data. We propose a novel method, which we call AgentOWL (Option and World model Learning Agent), that jointly learns -- in a sample efficient way -- an abstract world model (abstracting across both states and time) and a set of hierarchical neural options. We show, on a subset of Object-Centric Atari games, that our method can learn more skills using less data than baseline methods and possesses learning and generalization capabilities that the baselines do not have.

2602.02408 2026-05-13 cs.CV cs.AI

ReasonEdit: Editing Vision-Language Models using Human Reasoning

Jiaxing Qiu, Kaihua Hou, Roxana Daneshjou, Ahmed Alaa, Thomas Hartvigsen

发表机构 * University of Virginia(弗吉尼亚大学) University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 ReasonEdit 是一种用于编辑视觉-语言模型(VLM)的新方法,旨在在不干扰模型其他功能的前提下修正其错误,特别针对需要人类与模型进行推理的视觉问答任务。该方法引入了用户在编辑过程中提供推理解释的机制,并通过一种基于网络科学的多模态嵌入技术,在推理时检索相关事实,从而提升编辑效果。实验表明,ReasonEdit 在多个数据集上取得了当前最优的编辑性能,验证了引入人类推理对模型编辑泛化能力的显著提升。

详情
英文摘要

Model editing aims to correct errors in large, pretrained models without altering unrelated behaviors. While some recent works have edited vision-language models (VLMs), no existing editors tackle reasoning-heavy tasks, which typically require humans and models to reason about images. We therefore propose ReasonEdit, the first VLM editor to let users explain their reasoning during editing, introducing a new, practical model editing setup. ReasonEdit continuously stores human reasoning in a codebook, and retrieves only relevant facts during inference using a novel topology-balanced multimodal embedding method inspired by network science. Across four VLMs on multiple rationale-based visual question answering datasets, ReasonEdit achieves state-of-the-art editing performance, ultimately showing that using human reasoning during editing greatly improves edit generalization.

2602.02133 2026-05-13 cs.AI cs.CL

A Theoretical Analysis of Why Masked Diffusion Models Mitigate the Reversal Curse

Moongyu Jeon, Sangwoo Shin, BumJun Kim, Kyelim Lee, Albert No

发表机构 * Department of Artificial Intelligence, Yonsei University(燕山大学人工智能学院)

AI总结 本文理论分析了为何掩码扩散语言模型(MDMs)能够缓解自回归语言模型(ARMs)中的“反转诅咒”问题。研究指出,MDMs通过其任意顺序的掩码训练目标,在参数层面建立了前向与反向条件之间的耦合,使得模型在训练中学习到的词对证据可以迁移到反转查询中。实验验证了这一机制的有效性,表明其有助于提升模型在反转任务中的预测性能。

详情
英文摘要

Autoregressive language models (ARMs) suffer from the reversal curse: after learning ''$A$ is $B$,'' they often fail on the reverse query ''$B$ is $A$.'' Masked diffusion language models (MDMs) exhibit this failure in a much weaker form, but the underlying reason has remained unclear. A common explanation attributes this mitigation to their any-order masked training objective. However, observing ''$[\mathbf{M}]$ is $B$'' during training teaches recovery of $A$ from $B$ in one positional configuration, and does not by itself explain why the learned evidence should transfer to the reverse prompt ''$B$ is $[\mathbf{M}]$.'' We provide a theoretical analysis showing that this transfer arises from a parameter-level coupling between forward and reverse positional conditionals: shared Transformer parameters store token-pair evidence, while relative positional encodings route attention through queries and keys without changing the value-side evidence being retrieved. In a one-layer MDM, we prove that forward masked training strengthens evidence that is reusable in reverse queries, induces correlated forward--reverse attention routes, and yields a positively aligned shared-storage gradient component that decreases the reverse loss to first order. Controlled one-layer experiments and large-scale LLaDA/Dream experiments verify these signatures and show that they translate into improved reverse prediction.

2602.02007 2026-05-13 cs.CL cs.AI

Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

Zhanghao Hu, Qinglin Zhu, Runcong Zhao, Di Liang, Hanqi Yan, Yulan He, Lin Gui

发表机构 * King’s College London(伦敦国王学院) Tencent, Yuanbao Team(腾讯元宝团队)

AI总结 本文针对传统检索增强生成(RAG)在智能体记忆应用中的不足,提出了一种新的记忆管理方法xMemory。该方法通过解耦和聚合的原理,将交互历史分解为可复用的事实、更新和区分细节,并构建分层的可修订记忆结构,以提升检索效率和信息准确性。实验表明,xMemory在多个任务和模型上均能有效提升答案质量与推理效率。

Comments Project Address: https://zhanghao-xmemory.github.io/Academic-project-page-template/; Code Address: https://github.com/HU-xiaobai/xMemory

详情
英文摘要

Standard Retrieval Augmented Generation (RAG) is poorly matched to agent memory. Unlike large heterogeneous corpora, agent memory forms a bounded and coherent interaction stream in which many spans are highly correlated or near duplicates. As a result, flat top-$k$ similarity retrieval often returns redundant context, while summary-centric hierarchies can blur the subtle details that distinguish one candidate from another. We argue that agent memory should follow the principle of decoupling before aggregation: the system should first isolate reusable facts, updates, and distinguishing details from similar histories, and only then organise them for efficient retrieval. Based on this principle, we propose xMemory, which constructs a revisable hierarchical memory structure from original messages to segments, memory components, and groups. xMemory segments interaction history into local events, decouples each segment into memory components, aggregates related components into high-level groups using a sparsity--semantic faithfulness objective, and maintains this structure incrementally as memory evolves. At inference time, xMemory retrieves top-down, first selecting a compact backbone of complementary groups and components, and then expanding to segments and raw messages only when additional evidence reduces the reader's uncertainty. Experiments on LoCoMo and PerLTQA across diverse open source and closed source LLMs show consistent gains in answer quality and inference token efficiency, supported by analyses of redundancy, evidence density, and coverage.

2602.01682 2026-05-13 cs.LG cs.DS stat.ML

Finite and Corruption-Robust Regret Bounds in Online Inverse Linear Optimization under M-Convex Action Sets

Taihei Oki, Shinsaku Sakaue

发表机构 * Institute for Chemical Reaction Design and Discovery (ICReDD), Hokkaido University(化学反应设计与发现研究所(ICReDD),北海道大学) D3 Center, The University of Osaka(大阪大学D3中心) Center for Advanced Intelligence Project, RIKEN(先进智能项目中心,RIKEN) CyberAgent, Tokyo, Japan(CyberAgent,日本东京) National Institute of Informatics, Tokyo, Japan(信息技术国家研究所,日本东京)

AI总结 本文研究在线逆线性优化问题,即根据随时间变化的可行集上观测到的最优动作,推断隐藏的目标向量,并推荐符合该目标的行动。研究关注在M-凸可行集(如拟阵)下,能否获得与维度多项式相关的有限悔度界。作者通过结合M-凸集最优解的结构特性与几何体积论证,证明了悔度界为 $O(d\log d)$,部分解决了该问题的开放性疑问,并进一步拓展到对抗性噪声场景,给出了无需先验知识的悔度界 $O((C+1)d\log d)$。

详情
英文摘要

We study online inverse linear optimization, also known as contextual recommendation, where a learner sequentially infers an agent's hidden objective vector from observed optimal actions over feasible sets that change over time. The learner aims to recommend actions that perform well under the agent's true objective, and the performance is measured by the regret, defined as the cumulative gap between the agent's optimal values and those achieved by the learner's recommended actions. Prior work has established a regret bound of $O(d\log T)$, as well as a finite but exponentially large bound of $\exp(O(d\log d))$, where $d$ is the dimension of the optimization problem and $T$ is the time horizon, while a regret lower bound of $Ω(d)$ is known (Gollapudi et al. 2021; Sakaue et al. 2025). Whether a finite regret bound polynomial in $d$ is achievable or not has remained an open question. We partially resolve this by showing that when the feasible sets are M-convex -- a broad class that includes matroids -- a finite regret bound of $O(d\log d)$ is possible. We achieve this by combining a structural characterization of optimal solutions on M-convex sets with a geometric volume argument. Moreover, we extend our approach to adversarially corrupted feedback in up to $C$ rounds. We obtain a regret bound of $O((C+1)d\log d)$ without prior knowledge of $C$, by monitoring directed graphs induced by the observed feedback to detect corruptions adaptively.

2602.01418 2026-05-13 cs.CV cs.LG

Parabolic Position Encoding: Vision-Centric, Principled, Extrapolatable, General

Christoffer Koo Øhrstrøm, Rafael I. Cabral Muchacho, Yifei Dong, Filippos Moumtzidellis, Ronja Güldenring, Florian T. Pokorny, Lazaros Nalpantidis

发表机构 * Technical University of Denmark(丹麦技术大学) KTH Royal Institute of Technology(皇家理工学院)

AI总结 本文提出了一种基于抛物线的位置编码方法PaPE,专门用于视觉模态中的注意力架构。该方法从视觉特性的角度出发,结合平移不变性、旋转不变性、距离衰减、方向性和上下文感知等原则进行设计,能够更准确地编码图像、视频、点云等视觉数据中位置信息。实验表明,PaPE在ImageNet-1K等数据集上具有出色的外推能力,并在多个不同模态的数据集上展现出广泛适用性和优越性能。

详情
英文摘要

We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as from videos, event camera streams, images, or point clouds-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. Extrapolation experiments on ImageNet-1K show how PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5\% over the next-best encoding. Generality experiments on 8 datasets across 4 modalities show that PaPE is a general vision position encoding, as PaPE matches the best baseline on 5 datasets and exceeds all on 2 datasets. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.

2602.01103 2026-05-13 cs.AI

Probing RLVR training instability through the lens of objective-level hacking

Yiming Dong, Kun Fu, Haoyu Li, Xinyuan Zhu, Yurou Liu, Lijing Shao, Jieping Ye, Zheng Wang

发表机构 * School of Physics, Peking University(北京大学物理学院) Tongyi Lab(通义实验室) Alibaba Group(阿里巴巴集团) Kavli Institute for Astronomy and Astrophysics, Peking University(北京大学天文与天体物理研究院) National Astronomical Observatories, Chinese Academy of Sciences(中国科学院国家天文台)

AI总结 本文研究了可验证奖励强化学习(RLVR)在混合专家(MoE)架构中训练不稳定的问题,提出了一种基于目标层“黑客攻击”的分析框架,揭示了训练不稳定性背后的机制。研究发现,训练与推理之间的差距异常增长是导致不稳定的关键病理动态,这一现象此前缺乏机制解释。通过大量实验,本文为设计更稳定的RLVR算法提供了理论指导。

Comments Accepted by ICML 2026

详情
英文摘要

Prolonged reinforcement learning with verifiable rewards (RLVR) has been shown to drive continuous improvements in the reasoning capabilities of large language models, but the training is often prone to instabilities, especially in Mixture-of-Experts (MoE) architectures. Training instability severely undermines model capability improvement, yet its underlying causes and mechanisms remain poorly understood. In this work, we introduce a principled framework for understanding RLVR instability through the lens of objective-level hacking. Unlike reward hacking, which arises from exploitable verifiers, objective-level hacking emerges from token-level credit misalignment and is manifested as system-level spurious signals in the optimization objective. Grounded in our framework, together with extensive experiments on a 30B MoE model, we trace the origin and formalize the mechanism behind a key pathological training dynamic in MoE models: the abnormal growth of the training-inference discrepancy, a phenomenon widely associated with instability but previously lacking a mechanistic explanation. These findings provide a concrete and causal account of the training dynamics underlying instabilities in MoE models, offering guidance for the design of stable RLVR algorithms.

2602.00400 2026-05-13 cs.AI

KEPO: Knowledge-Enhanced Preference Optimization for Multimodal Reasoning with Applications to Medical VQA

Fan Yang, Rui Meng, Trudi Di Qi, Ali Ezzati, Yuxin Wen

发表机构 * Chapman University(查普曼大学) Lawrence Berkeley National Laboratory(劳伦斯伯克利国家实验室) University of California, Irvine(加州大学伊文斯分校)

AI总结 该研究提出了一种名为KEPO的知识增强偏好优化框架,旨在提升多模态模型在医疗视觉问答等复杂推理任务中的表现。针对传统强化学习在稀疏奖励下训练不稳定、探索困难的问题,KEPO引入了质量门控的策略蒸馏机制,仅对高质量轨迹进行教师模型指导,并结合知识引导的探索策略,有效减少噪声干扰,提升推理连贯性与泛化能力。实验表明,KEPO在医疗VQA任务中展现出更优的训练稳定性与分布外性能。

详情
英文摘要

Reinforcement learning (RL) has emerged as a promising paradigm for inducing explicit reasoning behaviors in large language and vision-language models. However, reasoning-oriented RL post-training remains fundamentally challenging due to sparse trajectory-level rewards, leading to ambiguous credit assignment and severe exploration failures that can trap the policy in a ``learning cliff.'' Recent on-policy distillation methods introduce dense teacher supervision to stabilize optimization, but apply it uniformly across all generated trajectories. We argue that such uniform distillation is ill-suited for reasoning-intensive tasks, as low-quality on-policy trajectories often originate from early logical errors, and distillation under flawed contexts injects noisy and misaligned gradients. To address these challenges, we propose Knowledge-Enhanced Preference Optimization (KEPO), a unified post-training framework that integrates: (i) a quality-gated on-policy distillation objective that selectively applies dense teacher guidance only to high-quality trajectories, and (ii) a knowledge-enhanced exploration strategy that leverages hints learned from a teacher model to rejectively sample reward-positive on-policy trajectories for RL, thereby mitigating exploration collapse. Evaluated on a challenging medical visual question answering benchmark under single-source generalization, KEPO demonstrates improved training stability, more coherent reasoning behaviors, and superior out-of-distribution performance over reinforcement learning and on-policy distillation baselines.

2601.22334 2026-05-13 cs.LG

DP-λCGD: Efficient Noise Correlation for Differentially Private Model Training

Nikita P. Kalinin, Ryan McKenna, Rasmus Pagh, Christoph H. Lampert

发表机构 * Institute of Science and Technology Austria(奥地利科学与技术研究所) University of Copenhagen(哥本哈根大学) Google(谷歌)

AI总结 本文提出了一种名为DP-λCGD的高效噪声相关方法,用于提升差分隐私模型训练的准确性。该方法通过仅与前一次迭代的噪声相关,并控制性地抵消部分噪声,减少了对历史噪声存储的需求。与现有方法相比,该方法在保持差分隐私保证的同时,显著降低了内存开销,并在实验中表现出更高的模型精度。

详情
英文摘要

Differentially private stochastic gradient descent (DP-SGD) is the gold standard for training machine learning models with formal differential privacy guarantees. Several recent extensions improve its accuracy by introducing correlated noise across training iterations. Matrix factorization mechanisms are a prominent example, but they correlate noise across many iterations and require storing previously added noise vectors, leading to substantial memory overhead in some settings. In this work, we propose a new noise correlation strategy that correlates noise only with the immediately preceding iteration and cancels a controlled portion of it. Our method relies on noise regeneration using a pseudorandom noise generator, eliminating the need to store past noise. As a result, it requires no additional memory beyond standard DP-SGD. We show that the computational overhead is minimal and empirically demonstrate improved accuracy over DP-SGD.

2601.22301 2026-05-13 cs.CV

Coarse-to-Real: Generative Rendering for Populated Dynamic Scenes

Gonzalo Gomez-Nogales, Yicong Hong, Chongjian Ge, Peiye Zhuang, Marc Comino-Trinidad, Dan Casas, Yi Zhou

发表机构 * Universidad Rey Juan Carlos Móstoles, Spain(西班牙雷昂卡洛斯·莫斯特oles大学) Adobe Research San Jose, USA(美国Adobe研究圣地亚哥实验室) Roblox San Mateo, USA(美国Roblox圣马特奥实验室)

AI总结 传统渲染流程依赖复杂的模型、精确的材质和光照以及大量的计算资源来生成逼真的图像,但在处理包含大量动态人物的场景时仍面临可扩展性和真实感的挑战。本文提出C2R(Coarse-to-Real)生成渲染框架,通过粗略的3D模拟生成具有真实风格的都市人群视频,结合粗略3D渲染对场景布局、相机运动和人物轨迹进行显式控制,并利用学习到的神经渲染器根据文本提示生成逼真的外观、光照和细粒度动态。该方法采用两阶段的合成-真实领域对齐策略,先从大规模真实视频中学习生成先验,再利用少量配对的合成数据引入可控性,实现了从粗略到精细的控制,适用于多种CG和游戏输入,并能从最小的3D输入生成时间一致、可控且逼真的城市场景视频。

Comments Project website at https://gonzalognogales.github.io/coarse2real/

详情
英文摘要

Traditional rendering pipelines rely on complex assets, accurate materials and lighting, and substantial computational resources to produce realistic imagery, yet they still face challenges in scalability and realism for populated dynamic scenes. We present C2R (Coarse-to-Real), a generative rendering framework that synthesizes real-style urban crowd videos from coarse 3D simulations. Our approach uses coarse 3D renderings to explicitly control scene layout, camera motion, and human trajectories, while a learned neural renderer generates realistic appearance, lighting, and fine-scale dynamics guided by text prompts. To overcome the lack of paired training data between coarse simulations and real videos, we adopt a two-stage synthetic-real domain-hedging strategy that first learns a strong generative prior from large-scale real footage, and then introduces controllability by using a small amount of paired synthetic coarse-to-fine data to anchor shared implicit spatio-temporal features across domains. The resulting system supports coarse-to-fine control, generalizes across diverse CG and game inputs, and produces temporally consistent, controllable, and realistic urban scene videos from minimal 3D input. We will release the model and project webpage at https://gonzalognogales.github.io/coarse2real/.

2601.21944 2026-05-13 cs.LG

Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models

Konstantinos P. Panousis, Diego Marcos

发表机构 * Department of Statistics, University of Economics and Business(经济与商业大学统计系) UMR TETIS, Inria, EVERGREEN, University of Montpellier(蒙彼利埃大学)

AI总结 本文研究了稀疏感知概念瓶颈模型(CBMs)中灵活性与可解释性之间的权衡问题,提出了一种新的评估指标Clarity,用于衡量模型在保持稀疏性和概念激活精度的同时对下游任务的性能影响。通过基于真实概念标注数据集的评估框架,作者对比了多种基于视觉语言模型和属性预测器的CBM方法,并揭示了不同稀疏诱导策略在性能与语义对齐上的显著差异。实验和人类研究验证了Clarity能够更准确地反映人类对模型的信任程度,为可解释性模型的评估提供了新思路。

详情
英文摘要

The widespread adoption of deep learning models in computer vision has intensified concerns about interpretability. Despite strong performance, these models are often treated as black boxes, with limited systematic investigation of their decision-making processes. While many interpretability methods exist, objective evaluation of learned representations remains limited, particularly for approaches that rely on sparsity to "induce" interpretability. In this work, we investigate how modeling choices in Concept Bottleneck Models (CBMs) affect the semantic alignment of concept representations. We introduce Clarity, a novel metric that captures the interplay between downstream performance and the sparsity and precision of concept activations. Using an interpretability assessment framework grounded in datasets with ground-truth concept annotations, we evaluate both VLM- and attribute predictor-based CBMs across three amortized sparsity-inducing strategies ($\ell_1$, $\ell_0$, and Bernoulli-based), alongside several widely used sparsity-aware CBM methods from the literature. Our experiments reveal a critical flexibility-interpretability trade-off: a model's capacity to optimize task performance by deviating from semantic alignment. We demonstrate that under this trade-off, different methods exhibit markedly different behaviors even at comparable performance levels. Finally, we validate our framework through a principled human study, which confirms that Clarity aligns significantly more closely with human trust than standard evaluation metrics.

2601.21351 2026-05-13 cs.LG cs.AI

Analytical Provisioning for Attention-FFN Disaggregated LLM Serving under Stochastic Workloads

Chendong Song, Meixuan Wang, Hang Zhou, Hong Liang, Yuan Lyu, Zixi Chen, Yuwei Fan, Zijie Zhou

发表机构 * Dept. of Industrial Engineering and Decision Analytics HKUST(工业工程与决策分析系香港科技大学) Dept. of Computer Science and Technology Tsinghua University(计算机科学与技术系清华大学) IIIS Tsinghua University(清华大学信息学院) Huawei Hong Kong Research Center(华为香港研发中心) School of Mathematical Sciences Peking University(北京大学数学科学学院)

AI总结 该研究针对分体式注意力-FFN(AFD)架构下的大语言模型服务,在随机工作负载条件下,提出了一个分析性的资源分配框架。研究通过分析每个计算槽的稳态令牌负载,识别出一个关键工作负载指标θ,并据此推导出最优的注意力与FFN计算比例,适用于任意预填充-解码分布。该方法还考虑了同步执行中的瓶颈效应,提供了闭式均场规则及高斯屏障感知的优化,实验表明其预测结果与仿真结果误差在10%以内,为分体式LLM服务的资源分配提供了理论依据和实用指导。

Comments Submitted to Neurips 2026

详情
英文摘要

Attentio-FFN disaggregation (AFD) is an emerging architecture for LLM decoding that separates state-heavy, KV-cache-dominated Attention computation from stateless, compute-intensive FFN computation, connected by per-step communication. While AFD enables independent scaling of memory and compute resources, its performance is highly sensitive to the Attention/FFN provisioning ratio: mis-sizing induces step-level blocking and costly device idle time. We develop an analytical provisioning framework for AFD bundles in an $r$A--$1$F topology under stochastic workloads. Two sources of randomness shape the problem: per-slot Attention workload evolves as KV caches grow and completed requests are replenished with random prompt and decode lengths, and synchronized execution across Attention workers introduces a barrier governed by the slowest worker. We address both via a renewal-reward characterization of the per-slot stationary token load, identifying a single workload statistic $θ$ that governs provisioning under arbitrary prefill-decode distributions and admits a nonparametric estimator from request traces. The analysis yields a closed-form mean-field rule for the optimal A/F ratio decomposing into Attention-, communication-, and FFN-bottleneck regimes, together with a Gaussian barrier-aware refinement that quantifies cross-worker synchronization overhead. A trace-calibrated AFD simulator supports the framework across workloads: the predicted optimal ratio matches the simulation-optimal within 10%. Together, these results provide a compact, calibratable account of how stochastic workload structure determines provisioning in disaggregated LLM serving.

2601.13780 2026-05-13 cs.LG

Principled Latent Diffusion for Graphs via Laplacian Autoencoders

Antoine Siraudin, Christopher Morris

发表机构 * Faculty of Computer Science(计算机科学系)

AI总结 该论文提出了一种基于拉普拉斯自编码器的图潜在扩散模型LG-Flow,用于解决传统图扩散模型在节点数量增加时计算复杂度呈二次增长的问题。通过将图结构编码到低维潜在空间,模型实现了近似无损的图重建,并有效避免了稀疏图中边缺失建模的冗余问题。该方法利用排列等变自编码器和扩散变换器,显著提升了图生成的效率与规模,实验表明其在生成性能上具有竞争力,且训练速度提升了近千倍。

Comments Preprint, under review

详情
英文摘要

Graph diffusion models achieve state-of-the-art performance in graph generation but suffer from quadratic complexity in the number of nodes -- and much of their capacity is wasted modeling the absence of edges in sparse graphs. Inspired by latent diffusion in other modalities, a natural idea is to compress graphs into a low-dimensional latent space and perform diffusion in that space. However, unlike images or text, graph generation requires nearly lossless reconstruction, as even a single error in decoding an adjacency matrix can render the entire sample invalid. This challenge has remained largely unaddressed. We propose LG-Flow, a latent graph diffusion framework that directly overcomes these obstacles. A permutation-equivariant autoencoder maps nodes to fixed-dimensional embeddings that enable near-lossless reconstruction of both undirected graphs and DAGs. The dimensionality of this latent representation scales linearly with the number of nodes, thereby removing the quadratic adjacency-space bottleneck in the diffusion process and enabling the training of substantially larger generative backbones. In this latent space, we train a Diffusion Transformer with flow matching, enabling efficient and expressive graph generation. Our approach achieves competitive results against state-of-the-art graph diffusion models while delivering up to a $1000\times$ speed-up. Our code is available at https://github.com/asiraudin/LG-Flow .

2601.07473 2026-05-13 cs.LG

AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations

Michael J. Clark

发表机构 * Independent Researcher, Perth, Australia(珀斯独立研究员)

AI总结 随着模型能力增强,人类难以可靠地验证模型的输出。本文提出了一种名为 AntiPaSTO 的自监督方法,通过在反平行轴上分离表示并引入一致性约束,实现对模型诚实性的内部引导。该方法仅需在模板句中插入两个对比词进行训练,无需人工标注,实验表明其在多个价值轴上均优于传统提示方法,且具备双向控制能力。

Comments Code is available at https://github.com/wassname/AntiPaSTO

详情
英文摘要

As models grow more capable, humans cannot reliably verify what they say. Scalable steering requires methods that are internal, self-supervised, and transfer out-of-distribution; existing methods satisfy some but not all three. We introduce AntiPaSTO, which separates representations along an antiparallel axis (+1/-1 produce opposite shifts), with coherence constraints preventing collapse. Training uses only two contrasting words inserted into template sentences, with no preference labels. When we use 800 such synthetic pairs on Gemma-3-1B, AntiPaSTO beats prompting baselines by 6.9x Steering F1 on DailyDilemmas and wins on 5 of 6 tested value axes. We also find preliminary evidence that it maintains bidirectional control where prompting triggers refusal.

2601.07384 2026-05-13 cs.LG

CompNO: A Novel Foundation Model approach for solving Partial Differential Equations

Hamda Hmida, Hsiu-Wen Chang Joly, Youssef Mesri

发表机构 * Mines Paris - PSL University, Centre for Material Forming (CEMEF)(巴黎 Mines - PSL 大学,材料成型中心(CEMEF)) Mines Paris - PSL University, Centre for Robotics (CAOR)(巴黎 Mines - PSL 大学,机器人中心(CAOR))

AI总结 本文提出了一种名为CompNO的新基础模型方法,用于求解参数化偏微分方程(PDEs)。该方法通过学习一组基础模块(每个模块对应一种基本微分算子的傅里叶神经算子),并结合轻量的适配模块构建任务特定求解器,从而避免了传统单一大模型的高昂预训练成本和可解释性不足的问题。实验表明,CompNO在多种PDEs上取得了比现有方法更低的相对L2误差,并能准确满足边界条件,展现出良好的泛化能力和物理可解释性。

Comments Under review at MDPI

详情
英文摘要

Partial differential equations (PDEs) govern a wide range of physical phenomena, but their numerical solution remains computationally demanding, especially when repeated simulations are required across many parameter settings. Recent Scientific Foundation Models (SFMs) aim to alleviate this cost by learning universal surrogates from large collections of simulated systems, yet they typically rely on monolithic architectures with limited interpretability and high pretraining expense. In this work we introduce Compositional Neural Operators (CompNO), a compositional neural operator framework for parametric PDEs. Instead of pretraining a single large model on heterogeneous data, CompNO first learns a library of Foundation Blocks, where each block is a parametric Fourier neural operator specialized to a fundamental differential operator (e.g. convection, diffusion, nonlinear convection). These blocks are then assembled, via lightweight Adaptation Blocks, into task-specific solvers that approximate the temporal evolution operator for target PDEs. A dedicated boundary-condition operator further enforces Dirichlet constraints exactly at inference time. We validate CompNO on one-dimensional convection, diffusion, convection--diffusion and Burgers' equations from the PDEBench suite. The proposed framework achieves lower relative L2 error than strong baselines (PFNO, PDEFormer and in-context learning based models) on linear parametric systems, while remaining competitive on nonlinear Burgers' flows. The model maintains exact boundary satisfaction with zero loss at domain boundaries, and exhibits robust generalization across a broad range of Peclet and Reynolds numbers. These results demonstrate that compositional neural operators provide a scalable and physically interpretable pathway towards foundation models for PDEs.

2601.05752 2026-05-13 cs.CL cs.SE

AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor

Shu Yang, Jingyu Hu, Tong Li, Hanqi Yan, Wenxuan Wang, Di Wang

发表机构 * King Abdullah University of Science and Technology(卡塔尔国王 Abdullah 科学与技术大学) University of Bristol(布里斯托大学) Washington University in St. Louis(圣路易斯华盛顿大学) King’s College London(伦敦国王学院) Renmin University of China(中国人民大学)

AI总结 本文介绍了 AutoMonitor-Bench,这是首个用于系统评估基于大语言模型(LLM)的异常行为监控可靠性 benchmark,涵盖问答、代码生成和推理等任务,包含 3,010 个精心标注的测试样本。研究通过误检率(MR)和误报率(FAR)两个指标评估监控性能,揭示了不同模型在检测能力与敏感度之间的权衡。此外,作者构建了大规模训练语料并微调 Qwen3-4B-Instruction,探索了针对已知异常行为数据训练是否能提升模型对未知隐性异常的监控能力,突显了构建可靠且可扩展的 LLM 异常监控系统所面临的挑战。

Comments ACL 2026 Findings

详情
英文摘要

We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors. Our results highlight the challenges of reliable, scalable misbehavior monitoring and motivate future work on task-aware designing and training strategies for LLM-based monitors.

2601.03627 2026-05-13 cs.CL cs.AI

Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines

Jean Seo, Gibaeg Kim, Kihun Shin, Seungseop Lim, Hyunkyung Lee, Wooseok Han, Jongwon Lee, Eunho Yang

发表机构 * AITRICS KAIST(韩国科学技术院) Severance Hospital, Yonsei University(延世大学松云医院) College of Medicine, The Catholic University of Korea(韩国天主大学医学院)

AI总结 本文提出EPAG,一个用于评估大语言模型(LLMs)预诊能力的基准数据集和框架,通过比较病史信息与诊断指南直接评估模型能力,并通过疾病诊断间接评估。研究发现,经过精心构建的特定任务数据集微调的小型开源模型在预诊任务中可超越前沿大模型,同时发现病史信息量的增加并不一定提升诊断性能。研究还揭示了预诊对话的语言特性受对话内容影响,并开源了数据集和评估流程以促进临床场景中LLM应用的发展。

Comments EACL 2026 Industry

详情
英文摘要

We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre-consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.

2512.22933 2026-05-13 cs.AI cs.CL

RW-Post: Auditable Evidence-Grounded Multimodal Fact-Checking in the Wild

Danni Xu, Shaojing Fan, Harry Cheng, Mohan Kankanhalli

发表机构 * School of Computing (SoC), National University of Singapore (NUS)(新加坡国立大学计算机学院(SoC)) National University of Singapore (NUS)(新加坡国立大学) Department of Electrical and Computer Engineering (ECE), National University of Singapore (NUS)(新加坡国立大学电子与计算机工程系(ECE))

AI总结 本文提出 RW-Post,一个用于真实场景下多模态事实核查的可审计基准数据集,每个样本都关联原始社交媒体帖子、推理过程和来自人工事实核查文章的明确证据。该数据集支持多种评估模式,有助于系统分析模型在视觉关联和证据利用方面的能力。实验表明,当前模型在证据关联方面仍有较大提升空间,而基于证据的评估方式能有效提升模型的准确性和可信度。

Comments Code and dataset will be released at https://github.com/xudanni0927/AgentFact

详情
英文摘要

Multimodal misinformation increasingly leverages visual persuasion, where repurposed or manipulated images strengthen misleading text. We introduce RW-Post, a post-aligned text--image benchmark for real-world multimodal fact-checking with auditable annotations: each instance links the original social-media post with reasoning traces and explicitly linked evidence items derived from human fact-check articles via an LLM-assisted extraction-and-auditing pipeline. RW-Post supports controlled evaluation across closed-book, evidence-bounded, and open-web regimes, enabling systematic diagnosis of visual grounding and evidence utilization. We provide AgentFact as a reference verification baseline and benchmark strong open-source LVLMs under unified protocols. Experiments show substantial headroom: current models struggle with faithful evidence grounding, while evidence-bounded evaluation improves both accuracy and faithfulness.

2512.22579 2026-05-13 cs.AI cs.NI

SANet: A Semantic-aware Agentic AI Networking Framework for Cross-layer Optimization in 6G

Yong Xiao, Xubo Li, Haoran Zhou, Yingyu Li, Yayu Gao, Guangming Shi, Ping Zhang, Marwan Krunz

发表机构 * the School of Electronic Information and Communications, the Huazhong University of Science and Technology, Wuhan, China(电子信息学院,华中科技大学,武汉,中国) the Peng Cheng Laboratory, Shenzhen, China(鹏城实验室,深圳,中国) the School of Mechanical Engineering and Electronic Information, China University of Geosciences (Wuhan), China(机械工程与电子信息学院,中国地质大学(武汉),中国) the State Key Laboratory of Networking and Switching(网络与交换技术国家重点实验室)

AI总结 本文提出了一种名为SANet的语义感知智能体网络框架,旨在实现6G无线网络中的跨层优化。该框架通过理解用户的语义目标,自动分配不同网络层的智能体以完成任务,并针对多智能体多目标优化问题,提出了寻找帕累托最优解的优化方法。此外,文章还引入了模型划分与共享(MoPS)机制,以提升计算资源的利用效率,并通过实验验证了该框架在性能提升和计算效率方面的显著优势。

Comments Accepted at IEEE Transactions on Mobile Computing

Journal ref IEEE Transactions on Mobile Computing, 2026

详情
英文摘要

Agentic AI networking (AgentNet) is a novel AI-native networking paradigm in which a large number of specialized AI agents collaborate to perform autonomous decision-making, dynamic environmental adaptation, and complex missions. It has the potential to facilitate real-time network management and optimization functions, including self-configuration, self-optimization, and self-adaptation across diverse and complex environments. This paper proposes SANet, a novel semantic-aware AgentNet architecture for wireless networks that can infer the semantic goal of the user and automatically assign agents associated with different layers of the network to fulfill the inferred goal. Motivated by the fact that AgentNet is a decentralized framework in which collaborating agents may generally have different and even conflicting objectives, we formulate the decentralized optimization of SANet as a multi-agent multi-objective problem, and focus on finding the Pareto-optimal solution for agents with distinct and potentially conflicting objectives. We propose three novel metrics for evaluating SANet. Furthermore, we develop a model partition and sharing (MoPS) framework in which large models, e.g., deep learning models, of different agents can be partitioned into shared and agent-specific parts that are jointly constructed and deployed according to agents' local computational resources. Two decentralized optimization algorithms are proposed. We derive theoretical bounds and prove that there exists a three-way tradeoff among optimization, generalization, and conflicting errors. We develop an open-source RAN and core network-based hardware prototype that implements agents to interact with three different layers of the network. Experimental results show that the proposed framework achieved performance gains of up to 14.61% while requiring only 44.37% of FLOPs required by state-of-the-art algorithms.

2512.12177 2026-05-13 cs.AI

Floorplan2Guide: LLM-Guided Floorplan Parsing for BLV Indoor Navigation

Aydin Ayanzadeh, Tim Oates

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩分校)

AI总结 本文提出了一种基于大语言模型(LLM)引导的室内平面图解析方法Floorplan2Guide,旨在提升盲人和低视力(BLV)人群的室内导航能力。该方法将建筑平面图转化为可导航的知识图谱,并生成可读的导航指令,减少了传统方法对人工预处理的依赖。实验表明,该方法在模拟和真实环境中均能有效提升导航准确率,尤其在少样本学习下表现优异,且基于图结构的空间推理比直接视觉推理具有更高的成功率。

Comments Accepted for publication in the proceedings of the IEEE International Conference on Big Data (IEEE BigData 2025)

Journal ref IEEE International Conference on Big Data (IEEE BigData 2025), pp. 7477-7485

详情
英文摘要

Indoor navigation remains a critical challenge for people with visual impairments. The current solutions mainly rely on infrastructure-based systems, which limit their ability to navigate safely in dynamic environments. We propose a novel navigation approach that utilizes a foundation model to transform floor plans into navigable knowledge graphs and generate human-readable navigation instructions. Floorplan2Guide integrates a large language model (LLM) to extract spatial information from architectural layouts, reducing the manual preprocessing required by earlier floorplan parsing methods. Experimental results indicate that few-shot learning improves navigation accuracy in comparison to zero-shot learning on simulated and real-world evaluations. Claude 3.7 Sonnet achieves the highest accuracy among the evaluated models, with 92.31%, 76.92%, and 61.54% on the short, medium, and long routes, respectively, under 5-shot prompting of the MP-1 floor plan. The success rate of graph-based spatial structure is 15.4% higher than that of direct visual reasoning among all models, which confirms that graphical representation and in-context learning enhance navigation performance and make our solution more precise for indoor navigation of Blind and Low Vision (BLV) users.

2512.12165 2026-05-13 cs.CV

Audio-Visual Camera Pose Estimation with Passive Scene Sounds and In-the-Wild Video

Daniel Adebi, Sagnik Majumder, Kristen Grauman

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文研究了如何利用被动场景声音和野外视频进行音频-视觉相机位姿估计,解决视觉退化条件下相机运动估计的难题。作者提出了一种简单有效的音频-视觉框架,将到达方向(DOA)谱和双耳嵌入特征融合到先进的视觉位姿估计模型中,显著提升了位姿估计的准确性和鲁棒性。该方法在两个大规模数据集上的实验表明,相比纯视觉方法具有明显优势,尤其在视觉信息受损时表现突出,为现实场景中的相机位姿估计提供了新的音频辅助思路。

详情
英文摘要

Understanding camera motion is a fundamental problem in embodied perception and 3D scene understanding. While visual methods have advanced rapidly, they often struggle under visually degraded conditions such as motion blur or occlusions. In this work, we show that passive scene sounds provide cues complementary to vision for relative camera pose estimation for in-the-wild videos. We introduce a simple but effective audio-visual framework that integrates direction-of-arrival (DOA) spectra and binauralized embeddings into a state-of-the-art vision-only pose estimation model. Our results on two large datasets show consistent gains over strong visual baselines, plus robustness when the visual information is corrupted. To our knowledge, this represents the first work to successfully leverage audio for relative camera pose estimation in real-world videos, and it establishes incidental, everyday audio as an unexpected but promising signal for a classic spatial challenge. Project: http://vision.cs.utexas.edu/projects/av_camera_pose.

2512.12131 2026-05-13 cs.LG cs.DC

BOOST: BOttleneck-Optimized Scalable Training Framework for Low-Rank Large Language Models

Zhengyang Wang, Ziyue Liu, Ruijie Zhang, Avinash Maurya, Paul Hovland, Bogdan Nicolae, Franck Cappello, Zheng Zhang

发表机构 * Anonymous Authors(匿名作者)

AI总结 本文提出了一种名为 BOOST 的高效训练框架,专门用于大规模低秩瓶颈架构的大语言模型。针对传统张量并行方法在低秩模型中通信开销大、GPU利用率低的问题,BOOST 引入了瓶颈感知的张量并行策略,并结合在线 RMSNorm、线性层分组和低秩激活检查点等优化技术,显著提升了训练速度。实验表明,BOOST 在多种低秩瓶颈架构上相比全秩模型和简单集成的 3D 并行方法分别实现了 1.46 到 1.91 倍和 1.87 到 2.27 倍的加速,同时提高了 GPU 利用率并减少了通信开销。

详情
英文摘要

The scale of transformer model pre-training is constrained by the increasing computation and communication cost. Low-rank bottleneck architectures offer a promising solution to significantly reduce the training time and memory footprint with minimum impact on accuracy. Despite algorithmic efficiency, bottleneck architectures scale poorly under standard tensor parallelism. Simply applying 3D parallelism designed for full-rank methods leads to excessive communication and poor GPU utilization. To address this limitation, we propose BOOST, an efficient training framework tailored for large-scale low-rank bottleneck architectures. BOOST introduces a novel Bottleneck-aware Tensor Parallelism, and combines optimizations such as online-RMSNorm, linear layer grouping, and low-rank activation checkpointing to achieve end-to-end training speedup. Evaluations on different low-rank bottleneck architectures demonstrate that BOOST achieves 1.46-1.91$\times$ speedup over full-rank model baselines and 1.87-2.27$\times$ speedup over low-rank model with naively integrated 3D parallelism, with improved GPU utilization and reduced communication overhead.

2512.11321 2026-05-13 cs.CV

KeyframeFace: Language-Driven Facial Animation via Semantic Keyframes

Jingchao Wu, Zejian Kang, Haibo Liu, Yuanchen Fei, Xiangru Huang

发表机构 * Westlake University(西湖大学) Nanjing University(南京大学) Zhejiang University(浙江大学) Hunan University(湖南大学)

AI总结 本文提出了一种名为 KeyframeFace 的语言驱动面部动画生成方法,通过语义关键帧实现对人脸表情的精确控制。与现有方法直接从文本生成连续帧不同,该方法借鉴动画制作中的关键帧理念,在可解释的 ARKit 控制空间中使用语义关键帧表示动画,并利用大语言模型生成与文本描述和情绪线索对齐的关键帧。实验表明,该方法在表情保真度和语义一致性方面优于传统方法,同时提供了更清晰的语义控制结构。

详情
英文摘要

Facial animation is a core component for creating digital characters in Computer Graphics (CG) industry. A typical production workflow relies on sparse, semantically meaningful keyframes to precisely control facial expressions. Enabling such animation directly from natural-language descriptions could significantly improve content creation efficiency and accessibility. However, most existing methods adopt a text-to-continuous-frames paradigm, directly regressing dense facial motion trajectories from language. This formulation entangles high-level semantic intent with low-level motion, lacks explicit semantic control structure, and limits precise editing and interpretability. Inspired by the keyframe paradigm in animation production, we propose KeyframeFace, a framework for semantic facial animation from language via interpretable keyframes. Instead of predicting dense motion trajectories, our method represents animation as a sequence of semantically meaningful keyframes in an interpretable ARKit-based facial control space. A language-driven model leverages large language model (LLM) priors to generate keyframes that align with contextual text descriptions and emotion cues. To support this formulation, we construct a multimodal dataset comprising 2,100 expression scripts paired with monocular videos, per-frame ARKit coefficients, and manually annotated semantic keyframes. Experiments show that incorporating semantic keyframe supervision and language priors significantly improves expression fidelity and semantic alignment compared to methods that do not use facial action semantics.

2512.05683 2026-05-13 cs.CV physics.optics

Physics-Informed Graph Neural Networks for Frequency-Aware Optical Aberration Correction

Yong En Kok, Bowen Deng, Alexander Bentley, Andrew J. Parkes, Michael G. Somekh, Amanda J. Wright, Michael P. Pound

发表机构 * School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院) Photonics Group, Department of Electrical and Electronic Engineering, University of Nottingham(诺丁汉大学电子与电气工程系光子组) Research Center for Humanoid Sensing, Zhejiang Laboratory(浙江实验室人机感知研究中心)

AI总结 本文提出了一种基于物理信息的图神经网络ZRNet,用于频率感知的光学像差校正。该方法结合了Zernike多项式系数预测与光学图像复原,通过引入Zernike图模块和频率感知对齐损失,显式建模多项式间的物理关系并增强图像与系数预测在频域的一致性。实验表明,ZRNet在多种显微成像模态和复杂生物样本上均取得了最先进的像差校正和图像复原效果,并在真实光学系统数据上验证了其鲁棒性和泛化能力。

详情
英文摘要

Optical aberrations significantly degrade image quality in microscopy, particularly when imaging deeper into samples. These aberrations arise from distortions in the optical wavefront and can be mathematically represented using Zernike polynomials. Existing methods often address only mild aberrations on limited sample types and modalities, typically treating the problem as a black-box mapping without leveraging the underlying optical physics of wavefront distortions. We propose ZRNet, a physics-informed framework that jointly performs Zernike coefficient prediction and optical image Restoration. We contribute a Zernike Graph module that explicitly models physical relationships between Zernike polynomials based on their azimuthal degrees-ensuring that learned corrections align with fundamental optical principles. To further enforce physical consistency between image restoration and Zernike prediction, we introduce a Frequency-Aware Alignment (FAA) loss, which better aligns Zernike coefficient prediction and image features in the Fourier domain. Extensive experiments on CytoImageNet demonstrates that our approach achieves state-of-the-art performance in both image restoration and Zernike coefficient prediction across diverse microscopy modalities and biological samples with complex, large-amplitude aberrations. We further validate on experimental PSF data from a physical microscope and demonstrate robustness to realistic sensor noise, confirming generalisation beyond simulated conditions. Code is available at https://github.com/janetkok/ZRNet.

2512.00775 2026-05-13 cs.RO cs.SY eess.SY

SAGAS: Semantic-Aware Graph-Assisted Stitching for Offline Temporal Logic Planning

Ruijia Liu, Ancheng Hou, Xiang Yin

发表机构 * School of Automation and Intelligent Sensing(自动化与智能感知学院)

AI总结 本文研究了在严格离线、无模型设定下,基于线性时序逻辑(LTL)的机器人任务规划与执行问题。为解决该问题,作者提出了一种名为SAGAS的框架,结合符号合成的组合性与从离线轨迹中学习到的数据驱动可达结构。该方法通过学习可复用的潜在可达图和固定的目标条件执行器,并对每个新的LTL公式进行语义图增强和布奇积搜索,从而生成可执行且成本高效的路径规划,实现了对未见过的LTL任务的零样本泛化。

详情
英文摘要

Linear Temporal Logic (LTL) provides a rigorous framework for specifying long-horizon robotic tasks, yet existing approaches face a trade-off: model-based synthesis relies on accurate labeled transition systems, whereas learning-based methods often require online interaction, task-specific rewards, or specification-conditioned training. We study LTL-specified robotic planning and execution in a stricter offline, model-free setting, where the agent is given only fixed, task-agnostic trajectory fragments, with no dynamics model, task demonstrations, or online data collection. To address this setting, we propose SAGAS, a framework that combines the compositionality of symbolic synthesis with the data-driven reachability structure learned from offline trajectories. SAGAS first learns a reusable latent reachability graph and a frozen goal-conditioned executor from fragmented offline data. For each new LTL formula, it performs task-time semantic graph augmentation to ground state-defined propositions on the learned graph, and applies Büchi product search to synthesize a cost-aware accepting prefix--suffix waypoint plan executed by the frozen executor. By shifting formula-specific reasoning from policy learning to test-time graph augmentation and symbolic search, SAGAS enables zero-shot generalization to unseen, data-supported LTL specifications without task-specific reward design, policy retraining, or online interaction. Experiments on LTL task suites constructed from OGBench locomotion domains show that this design produces executable and cost-efficient prefix--suffix behaviors for diverse unseen LTL tasks from fragmented offline data.

2511.22475 2026-05-13 cs.LG cs.CV

Adversarial Flow Models

Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, Haoqi Fan

发表机构 * ByteDance Seed(字节跳动种子)

AI总结 本文提出了一类生成模型——对抗流模型,结合了对抗学习和流模型的优点,支持一步或多步生成,并通过对抗目标进行训练。与传统GAN不同,该模型鼓励生成器学习确定性的噪声到数据映射,从而显著稳定训练过程;与基于一致性的方法相比,它无需学习概率流的中间时间步,直接实现一步或多步生成,避免了误差累积并保留了模型容量。实验表明,该模型在ImageNet-256px数据集上取得了优于现有方法的生成质量。

Comments ICML 2026

详情
英文摘要

We present adversarial flow models, a class of generative models that belongs to both the adversarial and flow families. Our method supports native one-step and multi-step generation and is trained with an adversarial objective. Unlike traditional GANs, in which the generator learns an arbitrary transport map between the noise and data distributions, our generator is encouraged to learn a deterministic noise-to-data mapping. This significantly stabilizes adversarial training. Unlike consistency-based methods, our model directly learns one-step or few-step generation without having to learn the intermediate timesteps of the probability flow for propagation. This preserves model capacity and avoids error accumulation. Under the same 1NFE setting on ImageNet-256px, our B/2 model approaches the performance of consistency-based XL/2 models, while our XL/2 model achieves a new best FID of 2.38. We additionally demonstrate end-to-end training of 56-layer and 112-layer models without any intermediate supervision, achieving FIDs of 2.08 and 1.94 with a single forward pass and surpassing the corresponding 28-layer 2NFE and 4NFE counterparts with equal compute and parameters. The code is available at https://github.com/ByteDance-Seed/Adversarial-Flow-Models

2511.17038 2026-05-13 cs.AI eess.IV stat.ML

DAPS++: Rethinking Diffusion Inverse Problems with Decoupled Posterior Annealing

Hao Chen, Renzheng Zhang, Scott S. Howard

发表机构 * Department of Electrical Engineering, University of Notre Dame(诺克斯大学电气工程系) Department of Aerospace and Mechanical Engineering, University of Notre Dame(诺克斯大学航空航天与机械工程系)

AI总结 本文提出了一种名为DAPS++的新型扩散逆问题求解方法,旨在解决传统扩散模型在逆问题中先验引导不足的问题。该方法通过将扩散初始化与似然驱动的优化过程完全解耦,使重建过程更直接地由测量一致性引导,同时保持数值稳定性。实验表明,DAPS++在减少函数评估次数和优化步骤的前提下,实现了高效的计算性能和鲁棒的图像恢复效果。

详情
英文摘要

From a Bayesian perspective, score-based diffusion solves inverse problems through joint inference, embedding the likelihood with the prior to guide the sampling process. However, this formulation fails to explain its practical behavior: the prior offers limited guidance, while reconstruction is largely driven by the measurement-consistency term, leading to an inference process that is effectively decoupled from the diffusion dynamics. We show that the diffusion prior in these solvers functions primarily as a warm initializer that places estimates near the data manifold, while reconstruction is driven almost entirely by measurement consistency. Based on this observation, we introduce \textbf{DAPS++}, which fully decouples diffusion-based initialization from likelihood-driven refinement, allowing the likelihood term to guide inference more directly while maintaining numerical stability and providing insight into why unified diffusion trajectories remain effective in practice. By requiring fewer function evaluations (NFEs) and measurement-optimization steps, \textbf{DAPS++} achieves high computational efficiency and robust reconstruction performance across diverse image restoration tasks.

2511.16520 2026-05-13 cs.LG cs.CV eess.IV eess.SP

Saving Foundation Flow-Matching Priors for Inverse Problems

Yuxiang Wan, Ryan Devera, Wenjie Zhang, Ju Sun

发表机构 * Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota, USA(计算机科学与工程系,明尼苏达大学,明尼阿波利斯,明尼苏达州,美国)

AI总结 本文提出了一种名为FMPlug的插件框架,旨在提升基础流匹配模型在逆问题中的应用效果。该方法结合了实例引导的时序预热策略和尖锐高斯正则化,既增强了问题特异性指导,又保持了高斯结构的稳定性。实验表明,FMPlug在图像修复和样本稀缺的科学逆问题中均表现出色,为在这些场景中实用化基础流匹配模型提供了有效途径。

Comments Accepted by ICML 2026

详情
英文摘要

Foundation flow-matching (FM) models promise universal priors for solving inverse problems (IPs); yet today, they trail behind domain-specific and even untrained priors. \emph{How can we unlock their potential?} We introduce FMPlug, a plug-in framework that redefines how foundation FMs are used in IPs. FMPlug combines an instance-guided, time-dependent warm-start strategy with sharp Gaussianity regularization, adding problem-specific guidance while preserving the Gaussian structures. For evaluation, we consider both simple image restoration tasks and scientific IPs with a few similar samples -- where the prohibitive cost of data collection and model training hinders the development of domain-specific generative models. Our superior experimental results confirm the effectiveness of FMPlug. Overall, FMPlug paves the way for making foundation FM models practical, reusable priors for IPs, especially scientific ones with few similar samples. More details are available at https://sun-umn.github.io/xm-plug/ .

2511.12034 2026-05-13 cs.CV cs.LG cs.MM

Calibrated Multimodal Representation Learning with Missing Modalities

Xiaohao Liu, Xiaobo Xia, Jiaheng Wei, Shuo Yang, Xiu Su, See-Kiong Ng, Tat-Seng Chua

发表机构 * National University of Singapore(国立新加坡大学) University of Science and Technology of China(中国科学技术大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Central South University(中南大学)

AI总结 多模态表征学习旨在将不同模态的信息对齐到统一的潜在空间中,但现有方法通常要求所有模态都存在,难以处理数据中缺失模态的情况。本文从锚点偏移的角度出发,揭示了缺失模态导致对齐偏差的理论机制,并提出了一种名为CalMRL的方法,通过利用模态间的先验知识和内在联系,在表征层面进行缺失模态的补全与对齐校准。实验表明,该方法有效缓解了锚点偏移问题,提升了模型在缺失模态数据上的表现。

Comments Accepted by ICML 2026

详情
英文摘要

Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL to calibrate incomplete alignments caused by missing modalities. CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments demonstrate the superiority of CalMRL. The code is released at https://github.com/Xiaohao-Liu/CalMRL.

2510.25609 2026-05-13 cs.LG cs.AI eess.SP

Revisiting GAN with Bayes-Optimal Discrimination

Mohammadreza Tavasoli Naeini, Ali Bereyhi, Morteza Noshad, Ben Liang, Alfred O. Hero

发表机构 * University of Toronto(多伦多大学) Stanford University(斯坦福大学) University of Michigan(密歇根大学)

AI总结 本文提出了一种改进的标准生成对抗网络(GAN)训练方法,其核心在于将判别器的目标从交叉熵损失转变为直接最小化判别贝叶斯错误率(BER)。为此,作者引入了贝叶斯最优学习阈值(BOLT)损失函数,并通过最大化判别BER的替代量来训练生成器。该方法统一了GAN训练的不同目标,揭示了它们在平滑性与紧致性之间的权衡关系,并在平衡类别先验的条件下,证明了最大化替代BER能够最小化数据分布与生成分布之间的总变分距离,同时与Wasserstein GAN建立了联系。实验表明,该方法在图像生成任务中提升了样本质量和覆盖范围。

详情
英文摘要

We propose an alternative to the standard GAN training approach, in which the discriminator is a binary classifier trained by cross-entropy to distinguish real samples from generated ones. Instead, we directly target the discrimination Bayes error rate (BER). To this end, we use the recently proposed Bayes optimal learning threshold (BOLT) loss and train the generator to maximize a surrogate of the discrimination BER. This viewpoint gives a unified perspective on GAN training: different objectives can be interpreted as parameterized bounds on the discrimination BER that describe a trade-off between smoothness and tightness. We show that, under balanced class priors, maximizing the surrogate BER with an unconstrained discriminator minimizes the total variation between the data and generator distributions. By constraining the discriminator to be $1$-Lipschitz, the proposed maximization objective defines a discrepancy that is upper-bounded by the Wasserstein-1 distance, thereby linking it to Wasserstein GAN. Experiments on several image-generation datasets under matched architectures and optimization settings show that GAN training using the surrogate BER improves sample quality and coverage over standard baselines. This analysis suggests that the proposed Bayesian viewpoint can achieve a better trade-off between training stability and convergence of the generator to the data distribution.