arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2328
专题追踪
2605.11753 2026-05-13 cs.AI

Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention

Abid Ali, Diego Molla-Aliod, Usman Naseem

发表机构 * School of Computing, Macquarie University(麦考瑞大学计算机学院)

AI总结 该论文研究了多模态摘要生成问题,旨在从文本和图像中生成语义连贯且内容准确的摘要。为了解决现有方法中视觉特征与语言模型表征不匹配的问题,作者提出了一种统一框架SPeCTrA-Sum,通过深度对齐视觉与语言编码器,并引入视觉相关性预测模块来选择具有代表性的图像。实验表明,该方法在生成视觉相关性更强的摘要和选择更具代表性的图像方面表现优异。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

Multimodal summarization requires models to jointly understand textual and visual inputs to generate concise, semantically coherent summaries. Existing methods often inject shallow visual features into deep language models, leading to representational mismatches and weak cross-modal grounding. We propose a unified framework that jointly performs text summarization and representative image selection. Our system, SPeCTrA-Sum (Sampler Perceiver with Cross-modal Transformer and gated Attention for Summarization), introduces two key innovations. First, a Deep Visual Processor (DVP) aligns the visual encoder with the language model at corresponding depths, enabling hierarchical, layer-wise fusion that preserves semantic consistency. Second, a lightweight Visual Relevance Predictor (VRP) selects salient and diverse images by distilling soft labels from a Determinantal Point Processes (DPP) teacher. SPeCTrA-Sum is trained using a multi-objective loss that combines autoregressive summarization, cross-modal alignment, and DPP-based distillation. Experiments show that our system produces more accurate, visually grounded summaries and selects more representative images, demonstrating the benefits of depth-aware fusion and principled image selection for multimodal summarization.

2605.11752 2026-05-13 cs.LG

Federated Client Selection under Partial Visibility: A POMDP Approach with Spatio-Temporal Attention

Qijun Hou, Yuchen Shi, Pingyi Fan, Khaled B. Letaief

发表机构 * Dept. of Electronic Engineering, BNRist, Tsinghua University(电子工程系,北京理工大学,清华大学) Dept. of Electronic and Computer Engineering, HKUST(电子与计算机工程系,香港科技大学)

AI总结 本文研究了在部分可见性场景下的联邦学习客户端选择问题,即服务器在每轮通信中只能观测到部分客户端。为此,作者将该问题建模为部分可观测马尔可夫决策过程(POMDP),并提出了一种基于时空注意力机制的强化学习框架,通过融合历史全局模型和客户端身份嵌入,捕捉训练过程中的时间上下文和客户端的持续特性。实验结果表明,该方法在异构且部分可见的环境下优于现有基线,验证了其在实际联邦学习系统中应对不完全观测挑战的有效性。

详情
英文摘要

Federated learning relies on effective client selection to alleviate the performance degradation caused by data heterogeneity. Most existing methods assume full visibility of all clients at each communication round. However, in large-scale or edge-based deployments, the server can only access a subset of clients due to communication, mobility, or availability constraints, resulting in partial visibility where only a subset of clients is observable for aggregation in each communication round. In this paper, we formulate federated client selection under partial visibility as a Partially Observable Markov Decision Process (POMDP) and propose a Spatial-Temporal attention-based reinforcement learning framework. By integrating historical global models and client identity embeddings, the proposed method captures both the temporal contexts of training and the persistent characteristics of clients. Experimental results across multiple datasets demonstrate that our approach achieves superior performance compared to existing baselines in heterogeneous and partially visible settings, validating its effectiveness in addressing the challenges of incomplete observations in practical federated learning systems.

2605.11750 2026-05-13 cs.RO cs.AI cs.CL cs.CV

DreamAvoid: Critical-Phase Test-Time Dreaming to Avoid Failures in VLA Policies

Xianzhe Fan, Yuxiang Lu, Shenyuan Gao, Xiaoyang Wu, Ruihua Han, Manling Li, Hengshuang Zhao

发表机构 * HKU(香港大学) HKUST(香港理工大学) Northwestern University(西北大学)

AI总结 Vision-Language-Action(VLA)模型在精细操作任务中容易因关键阶段的微小动作错误而引发不可恢复的失败。为解决这一问题,本文提出DreamAvoid,一种在测试阶段通过“梦境”模拟来预判并规避失败的框架。该方法引入梦境触发机制、动作提案和梦境评估器,通过模拟候选动作的短期未来结果,选择最优动作以提升任务成功率。实验表明,DreamAvoid能有效减少失败情况,提高实际操作任务的完成率。

Comments 19 pages, 7 figures

详情
英文摘要

Vision-Language-Action (VLA) models are often brittle in fine-grained manipulation, where minor action errors during the critical phases can rapidly escalate into irrecoverable failures. Since existing VLA models rely predominantly on successful demonstrations for training, they lack an explicit awareness of failure during these critical phases. To address this, we propose DreamAvoid, a critical-phase test-time dreaming framework that enables VLA models to anticipate and avoid failures. We also introduce an autonomous boundary learning paradigm to refine the system's understanding of the subtle boundary between success and failure. Specifically, we (1) utilize a Dream Trigger to determine whether the execution has entered a critical phase, (2) sample multiple candidate action chunks from the VLA via an Action Proposer, and (3) employ a Dream Evaluator, jointly trained on mixed data (success, failure, and boundary cases), to "dream" the short-horizon futures corresponding to the candidate actions, evaluate their values, and select the optimal action. We conduct extensive evaluations on real-world manipulation tasks and simulation benchmarks. The results demonstrate that DreamAvoid can effectively avoid failures, thereby improving the overall task success rate. Our code is available at https://github.com/XianzheFan/DreamAvoid.

2605.11749 2026-05-13 cs.LG

Learning Feature Encoder with Synthetic Anomalies for Weakly Supervised Graph Anomaly Detection

Yingjie Zhou, Yuqin Xie, Fanxing Liu, Dongjin Song, Ce Zhu, Lingqiao Liu

发表机构 * Sichuan University(四川大学) University of Connecticut(康涅狄格大学) University of Electronic Science and Technology of China(电子科技大学) University of Adelaide(阿德莱德大学)

AI总结 本文研究弱监督图异常检测问题,旨在在仅有少量标注异常样本和大量未标注数据的情况下,识别行为显著偏离正常模式的图实例。为解决如何学习对异常敏感且能区分正常类别的图特征表示这一挑战,作者提出了一种基于合成异常的多任务学习方法,通过生成多种方式扰动的合成异常样本,并为每类异常分配专用检测头,从而引导模型学习更具判别性的特征表示。实验表明,该方法在多个公开数据集上优于现有方法。

Comments 14 pages, 7 figures, published by IEEE Transactions on Knowledge and Data Engineering,2026

Journal ref IEEE Transactions on Knowledge and Data Engineering, vol. 38, no. 4, pp. 2326-2339, 2026

详情
英文摘要

Weakly supervised graph anomaly detection aims to unveil unusual graph instances, e.g., nodes, whose behaviors significantly differ from normal ones, given only a limited number of annotated anomalies and abundant unlabeled samples. A major challenge is to learn a meaningful latent feature representation that reduces intra-class variance among normal data while remaining highly sensitive to anomalies. Although recent works have applied self-supervised feature learning for graph anomaly detection, their strategies are not specifically tailored to its unique requirements, motivating our exploration of a more domain-specific approach. In this paper, we introduce a weakly supervised graph anomaly detection method that leverages a feature learning strategy tailored for graph anomalies. Our approach is built upon a multi-task learning scheme that extracts robust feature representations through synthesized anomalies. We generate synthetic anomalies by perturbing the normal graph in various ways and assign a dedicated detection head to each anomaly type, ensuring that learned features are sensitive to potential deviations from normal patterns. Although synthetic anomalies may not perfectly replicate real-world patterns, they provide valuable auxiliary data for effective feature learnin, much like features learned from ImageNet classification transfer to downstream vision tasks. Additionally, we adopt a two-phase learning strategy: an initial warm-up phase using only synthetic samples, followed by a full-training phase integrating both tasks, to balance the influence of synthetic and real data. Extensive experiments on public datasets demonstrate the superior performance of our method over its competitors. Code is available at https://github.com/yj-zhou/SAWGAD.

2605.11746 2026-05-13 cs.AI

When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

Wenkai Li, Fan Yang, Ananya Hazarika, Shaunak A. Mehta, Koichi Onoue

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Fujitsu Research of America Inc.(富士通美国研究公司)

AI总结 该研究探讨了思维链(Chain-of-thought, CoT)推理过程中,可见的推理轨迹与实际计算过程之间的一致性问题。通过构建Detect-Classify-Compare框架,并结合多种验证方法,发现大多数模型在推理步骤中存在轨迹与答案承诺不一致的现象,尤其是推理轨迹在答案确定后仍继续生成看似深思熟虑但实际无实质影响的文本。研究还表明,CoT在提升模型性能方面仍具价值,但其作为答案形成时间的可靠记录存在显著偏差。

详情
英文摘要

Chain-of-thought (CoT) traces are increasingly used both to improve language model capability and to audit model behavior, implicitly assuming that the visible trace remains synchronized with the computation that determines the answer. We test this assumption with a step-level Detect-Classify-Compare framework built around an answer-commitment proxy that is cross-validated with Patchscopes, tuned-lens probes, and causal direction ablation. Across nine models and seven reasoning benchmarks, latent commitment and explicit answer arrival align on only 61.9% of steps on average. The dominant mismatch pattern is confabulated continuation: 58.0% of detected mismatch events occur after the answer-commitment proxy has already stabilized while the trace continues producing deliberative-looking text, and a vacuousness analysis shows that the committed answer does not change during these steps. In architecture-matched Qwen2.5/DeepSeek-R1-Distill comparisons, the reasoning pipeline changes failure composition more than aggregate alignment, most clearly at 32B where confabulated steps decrease as contradictory states increase. Lower step-level alignment is also associated with larger CoT utility, suggesting that the settings that benefit most from CoT are often the least temporally faithful. Paired truncation and a complementary donor-corruption test further indicate that much post-commitment text is not load-bearing for the final answer. These findings suggest that CoT can remain useful while still being an unreliable report of when the answer was formed.

2605.11744 2026-05-13 cs.CL cs.LG

Training-Inference Consistent Segmented Execution for Long-Context LLMs

Xianpeng Shang, Jiang Li, Zehua Duo, Qianyi Cai, Xiangdong Su

发表机构 * College of Computer Science, Inner Mongolia University, Hohhot 010021, China National \& Local Joint Engineering Research Center of Intelligent Information Processing Technology for Mongolian, Hohhot 010021, China Inner Mongolia Key Laboratory of Multilingual Artificial Intelligence Technology, Hohhot 010021, China Thrust of Artificial Intelligence, The Hong Kong University of Science

AI总结 本文针对基于Transformer的大语言模型在长上下文生成中面临的计算和内存瓶颈问题,提出了一种训练与推理一致的分段执行框架。该方法在训练过程中模拟推理阶段的分段执行语义,通过限制梯度传播仅作用于前一段的KV状态,从而保证训练与推理的一致性。实验表明,该方法在长上下文任务中性能接近全上下文注意力机制,同时在延迟与内存消耗方面优于现有高效推理方法,显著提升了超长上下文场景下的可扩展性。

Comments Accepted by ICML 2026. 19 pages, 6 figures, 3 tables

详情
英文摘要

Transformer-based large language models face severe scalability challenges in long-context generation due to the computational and memory costs of full-context attention. Under practical computation and memory constraints, many inference-efficient long-context methods improve efficiency by adopting bounded-context or segment-level execution only during inference, while continuing to train models under full-context attention, resulting in a mismatch between training and inference execution and state-transition semantics. Based on this insight, we propose a training-inference consistent segment-level generation framework, in which training and inference follow the same segment-level forward execution semantics. During training, consistency with inference is enforced by restricting gradient propagation to KV states carried over from the immediately preceding segment, while permitting head-specific access to past KV states during the forward pass without involving them in gradient propagation. Across long-context benchmarks, our approach achieves performance comparable to full-context attention, while achieving competitive latency-memory trade-offs against strong inference-efficient baselines, and substantially improving scalability at very long context lengths (e.g., approximately 6x lower peak prefill memory at 128K compared to full-context attention with FlashAttention).

2605.11743 2026-05-13 cs.CV cs.LG

WorldComp2D: Spatio-semantic Representations of Object Identity and Location from Local Views

SeongMin Jin, Doo Seok Jeong

发表机构 * Department of Semiconductor Engineering, Hanyang University, Republic of Korea(韩世半导体工程系,翰阳大学,大韩民国)

AI总结 本文提出了一种名为 WorldComp2D 的轻量级表征学习框架,旨在从局部视角中学习物体身份和位置的时空语义表示。该方法通过多尺度局部感受野显式构建与物体身份和空间邻近性相关的潜在空间结构,包含一个依赖邻近性的编码器和一个用于定位输入中物体坐标的局部化模块。实验表明,相比现有轻量模型,WorldComp2D 在参数量和计算量上分别减少达 4.0 倍和 2.2 倍,同时在 CPU 上仍能保持实时性能,验证了其在时空语义推理中的高效性和通用性。

Comments Accepted as a regular paper at ICML2026

详情
英文摘要

Learning latent representations that capture both semantic and spatial information is central to efficient spatio-semantic reasoning. However, many existing approaches rely on implicit latent structures combined with dense feature maps or task-specific heads, limiting computational efficiency and flexibility. We propose WorldComp2D, a novel lightweight representation learning framework that explicitly structures latent space geometry according to object identity and spatial proximity using multiscale local receptive fields. This framework consists of (i) a proximity-dependent encoder that maps a given observation into a spatio-semantic latent space and (ii) a localizer that infers the coordinates of objects in the input from the resulting spatio-semantic representation. Using facial landmark localization as a proof-of-concept, we show that, compared to SoTA lightweight models, WorldComp2D reduces the numbers of parameters and FLOPs by up to 4.0X and 2.2X, respectively, while maintaining real-time performance on CPU. These results demonstrate that explicitly structured latent spaces provide an efficient and general foundation for spatio-semantic reasoning. This framework is open-sourced at https://github.com/JinSeongmin/WorldComp2D.

2605.11742 2026-05-13 cs.LG

Online Continual Learning with Dynamic Label Hierarchies

Xinrui Wang, Shao-Yuan Li, Bartłomiej Twardowski, Alexandra Gomez-Villa, Songcan Chen

发表机构 * College of Computer Science(计算机科学学院) Technology, Nanjing University of Aeronautics(技术学院,南京航空航天大学) MIIT Key Laboratory of Pattern Analysis(模式分析 MIIT 实验室) Computer Vision Center, Spain(西班牙计算机视觉中心) Computer Sciences Department, Universitat Autonoma de Barcelona, Spain(巴塞罗那自治大学计算机科学系,西班牙) State Key Laboratory for Novel Software Technology, Nanjing University(新型软件技术国家重点实验室,南京大学) IDEAS Research Institute, Warsaw, Poland(华沙 IDEAS 研究院,波兰) Joint Laboratory of Spatial intelligent Perception(空间智能感知联合实验室)

AI总结 本文研究了在线持续学习中动态标签层次结构的问题,即在非平稳数据流中学习时,如何应对标签层次结构在细粒度和粗粒度之间的动态演变。现有方法大多假设标签空间是扁平的,忽略了现实世界中概念的层次组织特性。为此,作者提出了一个新的问题设定DHOCL,并设计了HALO方法,通过自适应组合分类头和结构化原型,实现快速适应与知识保持,在多个基准测试中表现出色。

Comments Accepted to ICML2026

详情
英文摘要

Online Continual Learning (OCL) aims to learn from endless non\text{-}stationary data streams, yet most existing methods assume a flat label space and overlook the hierarchical organization of real\text{-}world concepts that evolves both horizontally (sibling classes) and vertically (coarse or fine categories). To better reflect this context, we introduce a new problem setting, DHOCL (Online Continual Learning from Dynamic Hierarchies), where taxonomies evolve across granularities and each sample provides supervision at a single hierarchical level. In this setting, we find two fundamental issues: (i) partial supervision under mixed granularities provides only point-wise signals over an evolving path-wise hierarchy, which constrains plasticity and undermines cross-level semantic consistency, and (ii) the dynamically evolving hierarchies induce granularity-dependent interference, destabilizing popular replay and regularization mechanisms and thereby exacerbating catastrophic forgetting. To tackle these issues, we propose HALO (Hierarchical Adaptive Learning with Organized Prototypes), which adaptively combines complementary classification heads, regularized by organized learnable hierarchical prototypes, enabling rapid adaptation, hierarchical consistency, and structured knowledge consolidation as the taxonomy evolves. Extensive experiments on multiple benchmarks demonstrate that HALO consistently outperforms existing methods across hierarchical accuracy, mistake severity, and continual performance.

2605.11738 2026-05-13 cs.AI

OptArgus: A Multi-Agent System to Detect Hallucinations in LLM-based Optimization Modeling

Zhong Li, Zihan Guo, Xiaohan Lu, Juntao Wang, Jie Song, Chao Shen, Jiageng Wu, Mingyang Sun

发表机构 * Great Bay University(大湾大学) Peking University(北京大学) Jilin University(吉林大学) Zhejiang University(浙江大学) Shenzhen Loop Area Institute(深圳环城研究院)

AI总结 本文提出OptArgus,一个用于检测基于大语言模型(LLM)的优化建模中幻觉问题的多智能体系统。研究聚焦于LLM在将自然语言优化问题转化为数学模型和求解代码时可能产生的结构不一致问题,并构建了一个细粒度的幻觉分类体系,涵盖目标函数、变量、约束和实现等多个方面。OptArgus通过多智能体协作机制,结合引导路由、专家审计和证据整合,显著提升了检测准确性和定位能力,并在包含多种类型数据的基准测试中表现出优于单一智能体方法的性能。

详情
英文摘要

Large language models (LLMs) are increasingly used to translate natural-language optimization problems into mathematical formulations and solver code, but matching the reference objective value is not a reliable test of correctness: an artifact may agree numerically while still changing the underlying optimization semantics. We formulate this issue as \emph{optimization-modeling hallucination detection}, namely structural consistency auditing over the problem description, symbolic model, and solver implementation. We develop, to our knowledge, the first fine-grained hallucination taxonomy specifically for optimization modeling, spanning objective, variable, constraint, and implementation failures. We use this taxonomy to design OptArgus, a multi-agent detector with conductor routing, specialist auditors, and evidence consolidation. To evaluate this setting, we introduce a three-part benchmark suite with $484$ clean artifacts, $1266$ controlled injected artifacts, and $6292$ natural LLM-generated artifacts. Against a matched single-agent baseline, OptArgus produces fewer false alarms on clean artifacts, more accurate top-ranked localization on controlled single-error cases, and stronger detection on natural model outputs. Together, these contributions turn optimization-modeling hallucination detection into a concrete empirical problem and suggest that modular, taxonomy-grounded auditing is a practical route to more reliable optimization modeling.

2605.11735 2026-05-13 cs.LG eess.SP

U-STS-LLM A Unified Spatio-Temporal Steered Large Language Model for Traffic Prediction and Imputation

Yichen Zhang, Jun Li

发表机构 * School of Information Science and Engineering, Southeast University(信息科学与工程学院,东南大学)

AI总结 本文提出了一种统一的时空引导大语言模型U-STS-LLM,用于交通预测与缺失值填补。该模型通过动态生成时空注意力偏差,显式引导大语言模型关注关键时空结构,并结合低秩适配和门控融合机制,实现了高效稳定的参数优化。实验表明,U-STS-LLM在真实蜂窝网络数据集上取得了优于现有方法的预测和填补性能,展示了其在结构化非语言领域应用大模型的潜力。

Comments 14 pages, 6 figures

详情
英文摘要

The efficient operation of modern cellular networks hinges on the accurate analysis of spatio-temporal traffic data. Mastering these patterns is essential for core network functions, chiefly forecasting future load to pre-empt congestion and imputing missing values caused by sensor failures or transmission errors to ensure data continuity. While deeply connected, forecasting and imputation have historically evolved as separate sub-fields. The dominant paradigm, Spatio-Temporal Graph Neural Networks (STGNNs), while effective, are often specialized, computationally intensive, and exhibit limited generalization. Concurrently, adapting large pre-trained language models (LLMs) offers a powerful alternative for sequence modeling, yet existing approaches provide weak structural guidance, leading to unstable convergence and a narrow focus on forecasting. To bridge these gaps, we propose U-STS-LLM, a unified framework built on a spatio-temporally steered LLM. Our core innovation is a Dynamic Spatio-Temporal Attention Bias Generator that synthesizes a persistent functional graph with transient nodal states to explicitly steer the LLM's attention. Coupled with a partially frozen backbone tuned via Low-Rank Adaptation (LoRA) and a Gated Adaptive Fusion mechanism, the model achieves stable, parameter-efficient adaptation. Trained under a unified multi-task objective, U-STS-LLM learns a holistic data representation. Extensive experiments on real-world cellular datasets demonstrate that U-STS-LLM establishes new state-of-the-art performance in both long-horizon forecasting and high-missing-rate imputation, while maintaining remarkable training efficiency and stability, offering a novel blueprint for harnessing foundation models in structured, non-linguistic domains.

2605.11730 2026-05-13 cs.LG cs.CR

Persona-Conditioned Adversarial Prompting: Multi-Identity Red-Teaming for Adversarial Discovery and Mitigation

Cristian Morasso, Anisa Halimi, Muhammad Zaid Hameed, Douglas Leith

发表机构 * IBM Research(IBM研究院) Trinity College Dublin(都柏林大学)

AI总结 该研究提出了一种名为“Persona-Conditioned Adversarial Prompting(PCAP)”的方法,通过引入多样化的攻击者角色和策略集,提升大语言模型的红队测试效果,以更全面地发现和应对潜在威胁。PCAP通过并行搜索不同角色的攻击方式,生成更具现实场景覆盖性的攻击样本,并显著提高了攻击成功率和防御数据的多样性。实验表明,基于PCAP生成的数据进行微调,能有效增强模型的鲁棒性,同时保持较低的误报率,展示了从漏洞发现到自动对齐的实用闭环流程。

详情
英文摘要

Automated red-teaming for LLMs often discovers narrow attack slices, missing diverse real-world threats, and yielding insufficient data for safety fine-tuning. We introduce Persona-Conditioned Adversarial Prompting (PCAP), which conditions adversarial search on diverse attacker personas (e.g., doctors, students, malicious actors) and strategy sets to explore realistic attack scenarios. By running parallel persona-conditioned searches, PCAP discovers transferable jailbreaks across different contexts and generates rich defense datasets with automatic metadata tracking. On GPT-OSS 120B, PCAP increases attack success from 57\% to 97\% while producing 2-6$\times$ more diverse prompts covering varied real-world scenarios. Critically, fine-tuning lightweight adapters on PCAP-generated data significantly improves model robustness (recall: 0.36 $\rightarrow$ 0.99, F1: 0.53 $\rightarrow$ 0.96) with minimal false positives, demonstrating a practical closed-loop approach from vulnerability discovery to automated alignment.

2605.11727 2026-05-13 cs.AI cs.CL cs.CV

Allegory of the Cave: Measurement-Grounded Vision-Language Learning

Kepeng Xu, Li Xu, Gang He, Wenxin Yu

发表机构 * Xidian University(西电大学) Southwest University of Science and Technology(西南科技大学)

AI总结 该研究探讨了如何通过更贴近原始相机测量数据的视觉输入来提升视觉-语言模型的感知能力。提出了一种基于原始测量值的视觉-语言学习框架PRISM-VL,结合了RAW图像输入、相机条件化对齐和曝光区间监督聚合等方法,以增强模型对真实环境信息的感知。实验表明,该方法在低光、高动态范围等复杂场景下显著提升了模型的性能,验证了保留测量域信息对多模态推理的重要性。

详情
英文摘要

Vision-language models typically reason over post-ISP RGB images, although RGB rendering can clip, suppress, or quantize sensor evidence before inference. We study whether grounding improves when the visual interface is moved closer to the underlying camera measurement. We formulate measurement-grounded vision-language learning and instantiate it as PRISM-VL, which combines RAW-derived Meas.-XYZ inputs, camera-conditioned grounding, and Exposure-Bracketed Supervision Aggregation for transferring supervision from RGB proxies to measurement-domain observations. Using a quality-controlled 150K instruction-tuning set and a held-out benchmark targeting low-light, HDR, visibility-sensitive, and hallucination-sensitive cases, PRISM-VL-8B reaches 0.6120 BLEU, 0.4571 ROUGE-L, and 82.66\% LLM-Judge accuracy, improving over the RGB Qwen3-VL-8B baseline by +0.1074 BLEU, +0.1071 ROUGE-L, and +4.46 percentage points. These results suggest that part of VLM grounding error arises from information lost during RGB rendering, and that preserving measurement-domain evidence can improve multimodal reasoning.

2605.11722 2026-05-13 cs.CV cs.LG

EPIC: Efficient Predicate-Guided Inference-Time Control for Compositional Text-to-Image Generation

Sunung Mun, Sunghyun Cho, Jungseul Ok

发表机构 * Graduate School of Artificial Intelligence, POSTECH(人工智能研究生院,POSTECH) Department of Computer Science & Engineering, POSTECH(计算机科学与工程系,POSTECH)

AI总结 EPIC 是一种无需训练的推理时优化框架,用于解决复杂文本到图像生成中多对象、数量、属性和关系等组合性提示的生成难题。该方法通过将提示解析为固定的视觉程序,利用谓词引导搜索进行图像验证与修正,确保所有条件满足后才判定生成成功。实验表明,EPIC 在 GenEval2 数据集上显著提升了生成准确率,并在计算资源消耗上相比现有方法大幅降低。

详情
英文摘要

Recent text-to-image (T2I) generators can synthesize realistic images, but still struggle with compositional prompts involving multiple objects, counts, attributes, and relations. We introduce EPIC (Efficient Predicate-Guided Inference-Time Control), a training-free inference-time refinement framework for compositional T2I generation. EPIC casts refinement as predicate-guided search: it parses the original prompt once into a fixed visual program of object variables and typed predicates, covering checkable conditions such as object presence, counts, attributes, and relations. Each generated or edited image is verified against this program using visual evidence extracted from that image. An image is judged to satisfy the prompt only when all predicates are satisfied; otherwise, failed predicates decide the next step, routing local failures to targeted editing and global failures to resampling while the fixed visual program remains unchanged. On GenEval2, EPIC improves prompt-level accuracy from 34.16% for single-pass generation with the base generator to 71.46%. Under the same generator/editor setting and maximum image-model execution budget, EPIC outperforms the strongest prior refinement baseline by 19.23 points while reducing realized cost by 31% in image-model executions, 72% in MLLM calls, and 81% in MLLM tokens per prompt.

2605.11716 2026-05-13 cs.AI

SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

Xinyi Zeng, Xue Yang, Jingyuan Zhang, Huanqian Yan, Xiang Chen, Kaiwen Wei, Hankun Kang, Yu Tian

发表机构 * Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) Kuaishou Technology(快手科技) School of Computer Science and Technology, Beihang University(北航计算机科学与技术学院) Nanjing University of Aeronautics and Astronautics(南京航空航天大学) Chongqing University(重庆大学) Wuhan University(武汉大学)

AI总结 多模态大语言模型(MLLMs)在面对 Jailbreak 攻击时面临较大安全挑战,现有防御方法依赖昂贵的微调或低效的后处理,难以应对新型攻击且存在性能折衷。本文提出 SafeSteer,一种基于解码阶段的防御机制,通过引入轻量级的 Decoding-Probe 检测并修正有害输出,并结合模态语义对齐向量将文本安全对齐能力迁移至视觉模态。实验表明,SafeSteer 在无需微调的情况下可提升 MLLMs 的安全性达 33.40%,同时保持模型的有效性与实用性。

详情
英文摘要

Multimodal large language models (MLLMs) are gaining increasing attention. Due to the heterogeneity of their input features, they face significant challenges in terms of jailbreak defenses. Current defense methods rely on costly fine-tuning or inefficient post-hoc interventions, limiting their ability to address novel attacks and involving performance trade-offs. To address the above issues, we explore the inherent safety capabilities within MLLMs and quantify their intrinsic ability to discern harmfulness at decoding stage. We observe that 1) MLLMs can distinguish the harmful and harmless inputs during decoding process, 2) Image-based attacks are more stealthy. Based on these insights, we introduce SafeSteer, a decoding-level defense mechanism for MLLMs. Specifically, it includes a Decoding-Probe, a lightweight probe for detecting and correcting harmful output during decoding, which iteratively steers the decoding process toward safety. Furthermore, a modal semantic alignment vector is integrated to transfer the strong textual safety alignment to the vision modality. Experiments on multiple MLLMs demonstrate that SafeSterr can improve MLLMs' safety by up to 33.40\% without fine-tuning. Notably, it can maintain the effectiveness of MLLMs, ensuring a balance between their helpfulness and harmlessness.

2605.11714 2026-05-13 cs.RO

Introducing Environmental Constraints to Grasping Strategies for Paper-Like Flexible Materials Using a Soft Gripper

Yi Dong, Yang Li, Jinjun Duan, Zhendong Dai

发表机构 * College of Mechanical and Electrical Engineering, Nanjing University(南京大学机械与电子工程学院) Jiangsu Key Laboratory of Bionic Materials and Equipment, Nanjing University of Aeronautics and Astronautics(江苏生物材料与设备重点实验室,南京航空航天大学)

AI总结 本文研究了使用软夹爪抓取纸张类柔性材料时如何引入环境约束以提升抓取效果。通过分析材料特性和工作条件对抓取的影响,提出了一套基于环境约束的系统抓取策略,并建立了其力学与运动学模型。实验验证了不同策略的适用场景和性能,为家庭服务机器人抓取平面柔性物体提供了可行的解决方案。

Comments Under Review

详情
英文摘要

Robotic manipulation of flexible objects is widely required in both industrial and service applications. Among such objects, paper-like materials exhibit distinct mechanical characteristics compared to cloth, being more sensitive to compressive stress, where minor variations in physical properties can significantly affect grasping. This study systematically investigates grasping strategies for paper-like materials using a universal soft gripper by exploiting environmental constraints. Based on manipulation primitives employed in existing grasping strategies, we proposed systematic grasping strategies for flexible materials by exploiting environmental constraints and analyzed their mechanical and kinematic models. To investigate the influence of materials and working conditions on grasping, an evaluation system for measuring grasping force and success rate was defined and experimentally evaluated. Finally, we summarized the specific workspaces and characteristics of different strategies that can satisfy various task requirements and lead to potential applications in household service robots for grasping planar flexible objects.

2605.11712 2026-05-13 cs.AI

Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance

Wenhao Chen, Sirui Sun, Shengyuan Bai, Guojie Song

发表机构 * School of Electronics Engineering and Computer Science, Peking University(北京大学电子工程与计算机科学学院) Yuanpei College, Peking University(北京大学元培学院) State Key Laboratory of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University(北京大学通用人工智能国家重点实验室)

AI总结 本文针对大语言模型(LLM)在价值对齐过程中因残差流动态性导致的价值表达不稳定问题,提出了一种名为 Stable Value Guidance Transformer(SVGT)的新架构。该方法通过引入独立的价值模块,将价值表示与主干网络分离,并利用可学习的桥接标记实现稳定的价值引导,从而在保持生成流畅性的同时显著提升模型的安全性。实验表明,SVGT 在多个基准测试中有效降低了有害输出,验证了其在结构化价值建模方面的有效性。

Comments Accepted to ICML 2026 (Spotlight). 32 pages

详情
英文摘要

Aligning large language models (LLMs) with human values typically relies on post-training or inference-time steering that directly manipulates the backbone's parameters or representation space. However, a critical gap exists: the model's residual stream is highly dynamic, in which values exist as fragile, low-dimensional properties, inherently incompatible with the stability required for consistent value expression. In this paper, we propose the Stable Value Guidance Transformer (SVGT), which addresses this gap through an independent value module incorporating two key designs: (1) independent value modeling, maintaining normative representations in a dedicated value space isolated from the backbone, and (2) explicit behavioral guidance, transducing these stable signals into learnable latent Bridge Tokens. These tokens serve as dynamic value anchors to explicitly steer the generative trajectory, ensuring robust adherence across diverse contexts without disrupting the backbone's internal representations. Experiments across multiple backbones and safety benchmarks show that SVGT generally reduces harmful scores by over 70% while maintaining generation fluency, demonstrating the efficacy of architecturally grounded value modeling. Our code is available at https://github.com/Clervils/SVGT.git.

2605.11711 2026-05-13 cs.LG cs.AI

Debiased Model-based Representations for Sample-efficient Continuous Control

Jiafei Lyu, Zichuan Lin, Scott Fujimoto, Kai Yang, Yangkun Chen, Saiyong Yang, Zongqing Lu, Deheng Ye

发表机构 * Tencent Hunyuan(腾讯文言) McGill University(麦吉尔大学) School of Computer Science, Peking University(北京大学计算机学院)

AI总结 本文提出了一种去偏的基于模型的表示学习方法DR.Q,用于提高连续控制任务中样本效率。该方法通过最大化当前状态-动作对与其下一状态之间的互信息,并结合衰减优先经验回放策略,有效缓解了传统方法在表示学习中的偏差和过拟合问题。实验表明,DR.Q在多个基准任务上表现优异,能够匹配甚至超越现有先进方法。

Comments ICML 2026

详情
英文摘要

Model-based representations recently stand out as a promising framework that embeds latent dynamics information into the representations for downstream off-policy actor-critic learning. It implicitly combines the advantages of both model-free and model-based approaches while avoiding the training costs associated with model-based methods. Nevertheless, existing model-based representation methods can fail to capture sufficient information about relevant variables and can overfit to early experiences in the replay buffer. These incur biases in representation and actor-critic learning, leading to inferior performance. To address this, we propose Debiased model-based Representations for Q-learning, tagged DR.Q algorithm. DR.Q explicitly maximizes the mutual information between the representations of the current state-action pair and the next state besides minimizing their deviations, and samples transitions with faded prioritized experience replay. We evaluate DR.Q on numerous continuous control benchmarks with a single set of hyperparameters, and the results demonstrate that DR.Q can match or surpass recent strong baselines, sometimes outperforming them by a large margin. Our code is available at https://github.com/dmksjfl/DR.Q.

2605.11706 2026-05-13 cs.LG

GRAFT: Graph-Tokenized LLMs for Tool Planning

Xinyi Gao, Xinyu Ren, Junliang Yu, Tong Chen, Quoc Viet Hung Nguyen, Hongzhi Yin

发表机构 * The University of Queensland(昆士兰大学) Griffith University(格里菲斯大学)

AI总结 GRAFT 是一种用于工具规划的图标记化大语言模型框架,旨在解决复杂任务中工具选择与子任务意图对齐以及满足工具间依赖关系的问题。该方法通过将每个工具节点映射为专用特殊标记,并在表示空间中学习有向工具依赖关系,从而将工具图内部化到模型中。此外,GRAFT 引入了基于策略的工具上下文蒸馏技术,提升模型在复杂工作流中生成合法、准确工具序列的能力,实验表明其在序列匹配和依赖合法性方面达到最优性能。

详情
英文摘要

Large language models (LLMs) are increasingly used to complete complex tasks by selecting and coordinating external tools across multiple steps. This requires aligning tool choices with subtask intent while satisfying directional execution dependencies among tools. To do this, existing methods model these dependencies as tool graphs and incorporate the graphs with LLMs through retrieval, serialization, or prompt-level injection. However, these external graph-use strategies all follow a matching paradigm, which often fails to align tool choices with the underlying subtask structure, producing semantically plausible plans that violate graph constraints. This issue is further exacerbated by error accumulation, where an early incorrect tool selection shifts the plan into an invalid graph state and causes subsequent predictions to drift away from the valid execution path. To address these challenges, we propose GRAFT, a graph-tokenized language model framework for dependency-aware tool planning. GRAFT internalizes the tool graph by mapping each tool node to a dedicated special token and learning directed tool dependencies within the representation space. It further introduces on-policy tool context distillation, training the model on its own sampled trajectories while distilling stepwise planning signals. Experiments show that GRAFT achieves state-of-the-art performance in exact sequence matching and dependency legality, supporting more reliable LLM tool planning in complex workflows.

2605.11705 2026-05-13 cs.CV

CAST: Collapse-Aware multi-Scale Topology Fusion for Multimodal Coreset Selection

Boran Zhao, Hetian Liu, Zhenxian Hu, Yuqing Yuan, Yu Yan, Pengju Ren

发表机构 * School of Software Engineering, the National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics(软件工程学院、人机混合增强智能国家重点实验室、视觉信息与应用国家工程研究中心、人工智能与机器人研究院) School of Software Engineering(软件工程学院) XJTU-POLIMI Joint School(西交大-波兰理工联合学院) Faculty of Electronic and Information Engineering(电子与信息工程学院) School of Human Settlements and Civil Engineering(人居与土木工程学院) the National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics(人机混合增强智能国家重点实验室、视觉信息与应用国家工程研究中心、人工智能与机器人研究院)

AI总结 本文提出了一种名为CAST的多模态核心集选择框架,旨在解决大规模图像-文本数据集在训练多模态模型时带来的高计算成本问题。CAST通过构建图像和文本模态的拓扑结构,并结合局部坍缩感知的融合策略,实现跨模态信息的均衡表示。同时,CAST引入多尺度扩散小波域分布匹配和局部软关系覆盖机制,有效提升了核心集在语义结构、细粒度细节和冗余抑制方面的表现。实验表明,CAST在多个数据集上优于现有方法,展现出更强的跨架构泛化能力和计算效率。

详情
英文摘要

The training of large multimodal models fundamentally relies on massive image-text datasets, which inevitably incur prohibitive computational overhead. Dataset selection offers a promising paradigm by identifying a highly informative coreset. However, existing approaches suffer from two critical limitations: (i) single-modality-dominated sampling methods, which ignore the fine-grained cross-modal information imbalance inherent in multimodal datasets and thus lead to semantic loss in the other modality; and (ii) coarse-grained sample-scoring-based sampling methods, where the selected coreset tends to be biased toward the scoring model, making it difficult to guarantee distributional equivalence between the coreset and the original dataset. Meanwhile, existing distribution matching and discrete sampling strategies often fail to jointly account for global semantic structure, local fine-grained details, and redundancy-aware coverage in dense regions. To this end, we propose CAST, a Collapse-Aware multi-Scale Topology fusion framework for multimodal coreset selection. We first construct image- and text-modality topologies, and derive a unified topology via local-collapse-aware refinement and cross-modal fusion. We then introduce a multi-scale distribution matching criterion in the diffusion wavelet domain, encouraging the coreset to approximate the original dataset at multiple scales. Finally, we introduce a local soft relational coverage mechanism that extends pure geometric coverage to relation-aware indirect coverage, penalizing redundant selections in dense clusters. Extensive experiments on Flickr30K and MS-COCO show that CAST outperforms existing dataset selection baselines, showcasing great superiority in cross-architecture generalization and energy efficiency over state-of-the-art multimodal synthesis methods.

2605.11704 2026-05-13 cs.CV

ScaleMoGen: Autoregressive Next-Scale Prediction for Human Motion Generation

Inwoo Hwang, Hojun Jang, Bing Zhou, Jian Wang, Young Min Kim, Chuan Guo

发表机构 * Seoul National University(首尔国立大学) Snap Inc.(Snap公司) Meta Reality Labs(Meta现实实验室)

AI总结 本文提出 ScaleMoGen,一种基于尺度自回归的文本驱动人体运动生成框架。该方法将运动生成视为从粗到细的过程,通过多尺度骨骼-时序离散化标记进行自回归预测,从而生成高质量的运动序列。研究通过位级量化和预测策略,提升了标记词汇量并优化了生成稳定性,实验表明其在多个指标上优于现有方法,并支持无需训练的文本引导运动编辑。

Comments Project page: https://inwoohwang.me/ScaleMoGen

详情
英文摘要

We present ScaleMoGen, a scale-wise autoregressive framework for text-driven human motion generation. Unlike conventional autoregressive approaches that rely on standard next-token prediction, ScaleMoGen frames motion generation as a coarse-to-fine process. We quantize 3D motions into compositional discrete tokens across multiple skeletal-emporal scales of increasing granularity, learning to generate motion by autoregressively predicting next-scale token maps. To maintain structural integrity, our motion tokenizers and quantizers are explicitly designed so that discrete tokens at every scale strictly preserve the skeletal hierarchy. Additionally, we employ bitwise quantization and prediction, which efficiently scale up the tokenizer vocabulary to preserve motion details and stabilize optimization. Extensive experiments demonstrate that ScaleMoGen achieves state-of-the-art performance, establishing an FID of 0.030 (vs. 0.045 for MoMask) on HumanML3D and a CLIP Score of 0.693 (vs. 0.685 for MoMask++) on the SnapMoGen dataset. Furthermore, we demonstrate that our skeletal-temporal multi-scale representation naturally facilitates training-free, text-guided motion editing.

2605.11697 2026-05-13 cs.RO

Rainbow Deep Q-Learning with Kinematics-Aware Design for Cooperative Delta and 3-RRS Parallel Robot Insertion

Hassen Nigatu, Gaokun Shi, Jituo Li, Wang Jin, Lu Guodong

发表机构 * Robotics Research Center of Yuyao(余姚机器人研究中心) Robotics Institute of Zhejiang University(浙江大学机器人院) Yuyao Technology Innovation Center(余姚技术创新中心) Department of Electrical and Computer Engineering, University of New Brunswick(新 Brunswick大学电气与计算机工程系)

AI总结 本文提出了一种基于彩虹深度Q网络(Rainbow DQN)的运动学感知深度强化学习框架,用于Delta并联机器人与3-RRS并联机械臂的协作插孔操作。研究通过优化3-RRS机械臂的几何结构,扩大其无奇异工作空间,从而提升强化学习策略的探索安全性。该框架将协作插入任务建模为马尔可夫决策过程,并结合定制奖励函数与两阶段训练课程,最终在高保真仿真环境中实现了稳定策略收敛与可靠插入效果。

Comments 10 pages

详情
英文摘要

This paper presents a kinematics-aware deep reinforcement learning framework based on Rainbow Deep Q-Networks (DQN) for cooperative peg-in-hole manipulation by a Delta parallel robot and a 3-RRS (Revolute--Revolute--Spherical) parallel manipulator. A key contribution is the integration of a geometric design-optimization stage that precedes learning: the 3-RRS geometry is tuned to maximize the singularity-free workspace and improve conditioning, which in turn enlarges the safe region in which the reinforcement learning policy can explore. Together the two manipulators expose a 6~degree-of-freedom (DoF) controllable subspace (three Delta translations, two 3-RRS rotations, and one 3-RRS vertical translation); the peg-in-hole task is invariant to rotation about the peg axis, so the task-relevant manifold is five dimensional. The cooperative insertion problem is cast as a Markov Decision Process with a 12-dimensional state vector and a discrete action set containing $6 \times 2 = 12$ incremental commands (one positive and one negative per controlled DoF). A shaped reward combines dense proximity guidance, penalties for kinematic and workspace violations, and sparse bonuses for successful insertions. The Rainbow DQN -- integrating double Q-learning, dueling architecture, prioritized replay, multi-step returns, noisy linear layers for exploration, and a distributional value head -- is trained with a two-stage curriculum. The co-designed framework is validated in a high-fidelity kinematic simulator, where it achieves stable policy convergence, reliable insertions, and reduced constraint violations compared against a vanilla DQN agent and a classical sampling-based planner.

2605.11696 2026-05-13 cs.CV cs.AI cs.GR

WildRelight: A Real-World Benchmark and Physics-Guided Adaptation for Single-Image Relighting

Lezhong Wang, Mehmet Onurcan Kaya, Siavash Bigdeli, Jeppe Revall Frisvad

发表机构 * Technical University of Denmark(丹麦技术大学) Inria(法国国家信息与自动化研究所)

AI总结 WildRelight 是一个专为单图像重光照任务设计的首个真实场景数据集,包含高分辨率户外场景及其配对的高动态范围环境光映射,用于评估现有方法在真实环境中的表现。该数据集揭示了当前基于合成数据训练的先进模型在真实世界中存在严重的领域偏移问题。研究提出了一种基于物理引导的推理框架,结合扩散后验采样与时间感知的测试时自适应方法,实现了合成模型在真实场景中的实时对齐,为解决模拟到现实的挑战提供了新的思路。

Comments Companion paper to the CVPR26 findings paper 'WildRelight', introducing the physics-guided adaptation method evaluated on the dataset. Project Page: https://lez-s.github.io/wildrelight_proj/

详情
英文摘要

Recent single-image relighting methods, powered by advanced generative models, have achieved impressive photorealism on synthetic benchmarks. However, their effectiveness in the complex visual landscape of the real world remains largely unverified. A critical gap exists, as current datasets are typically designed for multi-view reconstruction and fail to address the unique challenges of single-image relighting. To bridge this synthetic-to-real gap, we introduce WildRelight, the first in-the-wild dataset specifically created for evaluating single-image relighting models. WildRelight features a diverse collection of high-resolution outdoor scenes, captured under strictly aligned, temporally varying natural illuminations, each paired with a high-dynamic-range environment map. Using this data, we establish a rigorous benchmark revealing that state-of-the-art models trained on synthetic data suffer from severe domain shifts. The strictly aligned temporal structure of WildRelight enables a new paradigm for domain adaptation. We demonstrate this by introducing a physics-guided inference framework that leverages the captured natural light evolution as a self-supervised constraint. By integrating Diffusion Posterior Sampling (DPS) with temporal Sampling-Aware Test-Time Adaptation (TTA), we show that the dataset allows synthetic models to align with real-world statistics on-the-fly, transforming the intractable sim-to-real challenge into a tractable self-supervised task. The dataset and code will be made publicly available to foster robust, physically-grounded relighting research.

2605.11695 2026-05-13 cs.CV cs.AI

Emergent Communication between Heterogeneous Visual Agents through Decentralized Learning

Mikako Ochiai, Masatoshi Nagano, Tadahiro Taniguchi

发表机构 * Graduate School of Informatics, Kyoto University(京都大学信息科学研究生院)

AI总结 本文研究了在异构视觉代理之间通过去中心化学习产生的通信机制,探讨了当代理具有不同视觉表征时,哪些视觉信息可以被共享。研究中代理仅交换离散的标记序列,并基于本地感知证据更新自身模型,无需依赖共享的通信目标。实验表明,这种通信方式能够生成具有视觉信息的共享标记序列,在跨代理对齐、视觉特征预测和图像-文本检索任务中优于无通信基线,并揭示了视觉编码器异质性对通信内容和语言对称性的影响。

详情
英文摘要

Symbols are shared, but perception is private. We study emergent communication between heterogeneous visual agents through decentralized learning, asking what visual information can become shareable when agents have different visual representations. Instead of optimizing messages through a shared external communicative objective, our agents exchange only discrete token sequences and update their own models using local perceptual evidence. This setting focuses on an underexplored aspect of emergent communication, examining whether common symbols can arise without shared perceptual access, and how the similarity between private visual spaces constrains the content and symmetry of the resulting language. We instantiate this setting in the Metropolis-Hastings Captioning Game (MHCG), where two agents collaboratively form shared captions by exchanging proposed token sequences that a listener accepts or rejects using an MH-style criterion evaluated against its own visual features. We compare three pairings of frozen visual encoders, with agents starting from randomly initialized text modules. Experiments on MS-COCO show that MHCG produces visually informative shared token sequences that outperform a no-communication baseline in cross-agent alignment, visual-feature prediction, and image-text retrieval; all cross-agent metrics decline as encoder mismatch increases. Moderate encoder heterogeneity reduces the number of shared sequences while preserving per-sequence visual specificity, whereas stronger encoder heterogeneity yields fewer, coarser, and more asymmetric sequences. Ablations show that listener-side MH acceptance is critical for avoiding degenerate token formation. These results suggest that shared symbols can arise from local perceptual evaluation alone, with visual representational similarity across encoders shaping both the content and symmetry of the resulting language.

2605.11694 2026-05-13 cs.LG

Augmented Lagrangian Method for Last-Iterate Convergence for Constrained MDPs

Michael Lu, Max Qiushi Lin, Mo Chen, Sharan Vaswani

发表机构 * Simon Fraser University(西蒙弗雷泽大学)

AI总结 本文研究无限时间折扣约束马尔可夫决策过程(CMDPs)的策略优化问题,关注实际应用中需要部署单一最终策略的场景。为了解决现有理论保证通常针对混合策略而难以直接应用的问题,作者提出采用增强拉格朗日(AL)方法,并结合投影Q上升(PQA)算法,构建了一个具有可证明最终迭代收敛性的通用框架。该方法不仅适用于表格型CMDPs,还可推广到对数线性策略及复杂非线性策略,并在连续控制任务中验证了其有效性。

详情
英文摘要

We study policy optimization for infinite-horizon, discounted constrained Markov decision processes (CMDPs). While existing theoretical guarantees typically hold for the mixture policy, deploying such a policy is computationally and memory intensive. This leads to a practical mismatch where a single (last-iterate) policy must be deployed. Recent theoretical works have thus focused on proving last-iterate convergence, but are largely limited to the tabular setting or to algorithmic variants that are rarely used in practice. To address this, we use the classic inexact augmented Lagrangian ($\texttt{AL}$) method from constrained optimization, and propose a general framework with provable last-iterate convergence for CMDPs. We first focus on the tabular setting and propose to solve the $\texttt{AL}$ sub-problem with projected Q-ascent ($\texttt{PQA}$). Combining the theoretical guarantees of $\texttt{PQA}$ and the standard $\texttt{AL}$ analysis enables us to establish global last-iterate convergence. We generalize these results to handle log-linear policies, and demonstrate that an efficient, projected variant of $\texttt{PQA}$ can achieve last-iterate convergence with comparable guarantees as prior work. Finally, we demonstrate that our framework scales to complex non-linear policies, and evaluate it on continuous control tasks.

2605.11693 2026-05-13 cs.AI

Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity

Abid Ali, Diego Molla-Aliod, Usman Naseem

发表机构 * School of Computing, Macquarie University(麦考瑞大学计算学院)

AI总结 该研究针对多模态摘要生成任务中现有评估方法的不足,提出了一种统一的评估框架MM-Eval,用于综合衡量文本质量、图像-文本对齐性以及视觉多样性。MM-Eval通过结合事实一致性、语义连贯性、图像相关性及视觉多样性等多维度指标,实现了对多模态摘要更全面和准确的评估。实验表明,该框架优于传统启发式方法,为多模态摘要系统的比较评估提供了可解释且弱依赖参考的解决方案。

Comments Accepted to Findings of ACL 2026

详情
英文摘要

Multimodal Large Language Models (MLLMs) have facilitated Multimodal Summarization with Multimodal Output (MSMO), wherein systems generate concise textual summaries accompanied by salient visuals from multimodal sources. However, current MSMO evaluation remains fragmented: text quality, image-text alignment, and visual diversity are typically assessed in isolation using unimodal metrics, making it difficult to capture whether the modalities jointly support a faithful and useful summary. To address this gap, we introduce MM-Eval, a unified evaluation framework that integrates assessments of textual quality, cross-modal alignment, and visual diversity. MM-Eval comprises three components: (1) text quality, measured using OpenFActScore for factual consistency and G-Eval for coherence, fluency, and relevance; (2) image-text relevance, evaluated via an MLLM-as-a-judge approach; and (3) image-set diversity, quantified using Truncated CLIP Entropy. We calibrate MM-Eval through a learned aggregation model trained on the mLLM-EVAL news benchmark, aligning component contributions with human preferences. Our analysis reveals a text-dominant hierarchy in this setting, where factual consistency acts as a critical determinant of perceived overall quality, while visual relevance and diversity provide complementary signals. MM-Eval improves over heuristic aggregation baselines and provides an interpretable, reference-weak framework for comparative evaluation of multimodal summaries.

2605.11691 2026-05-13 cs.LG

Compositional Neural Operators for Multi-Dimensional Fluid Dynamics

Hamda Hmida, Hsiu-Wen Chang, Youssef Mesri

发表机构 * Mines Paris - PSL University, Centre for Material Forming (CEMEF)(巴黎矿学院-PSL大学,材料成形中心(CEMEF)) Mines Paris - PSL University, Centre for Robotics (CAOR)(巴黎矿学院-PSL大学,机器人中心(CAOR))

AI总结 该论文提出了一种用于二维流体动力学的组合神经算子(CompNO)框架,旨在解决偏微分方程的高效求解问题。该方法将复杂的物理方程分解为多个预训练的基础模块,如对流、扩散和泊松求解器等,并通过一个自适应块进行组合,从而实现对非线性相互作用的学习。实验表明,该方法在适应新物理系统时具有更高的灵活性和可解释性,并能有效复用预训练模块。

Comments Published as a conference paper at ICLR 2026

详情
英文摘要

Partial differential equations (PDEs) govern diverse physical phenomena, yet high-fidelity numerical solutions are computationally expensive and Machine Learning approaches lack generalization. While Scientific Foundation Models (SFMs) aim to provide universal surrogates, typical encoding-decoding approaches suffer from high pretraining costs and limited interpretability. In this paper, we propose Compositional Neural Operators (CompNO) for 2D systems, a framework that decomposes complex PDEs into a library of Foundation Blocks. Each block is a specialized Neural Operator pretrained on elementary physics. This modular library contains convection, diffusion, and nonlinear convection blocks as well as a Poisson Solver, enabling the framework to address the pressure-velocity coupling. These experts are assembled via an Adaptation Block featuring an Aggregator. This aggregator learns nonlinear interactions by minimizing data loss and physics-based residuals driven from governing equations. The proposed approach has been evaluated on the Convection-Diffusion equation, the Burgers' equation, and the Incompressible Navier-Stokes equation. Our results demonstrate that learning from elementary operators significantly improves adaptability, enhances model interpretability and facilitates the reuse of pretrained blocks when adapting to new physical systems.

2605.11689 2026-05-13 cs.LG cs.CL

Slicing and Dicing: Configuring Optimal Mixtures of Experts

Margaret Li, Sneha Kudugunta, Danielle Rothermel, Luke Zettlemoyer

发表机构 * Paul G Allen School of Computer Science(保罗·G·艾伦计算机科学学院) University of Washington(华盛顿大学) New York University(纽约大学) Courant School of Data Science(科廷数据科学学院)

AI总结 本文系统研究了大规模语言模型中专家混合(MoE)架构的核心设计选择,包括专家数量、粒度、共享专家、负载均衡等,并在超过2000次预训练实验中分析了这些参数对模型性能的影响。研究发现,随着MoE参数规模的增加,模型性能持续提升,且最优专家规模主要取决于活跃参数数量,而非总参数量。此外,专家数量和粒度是影响模型质量的最关键因素,而其他配置如共享专家或负载均衡机制的影响相对较小。

详情
英文摘要

Mixture-of-Experts (MoE) architectures have become standard in large language models, yet many of their core design choices - expert count, granularity, shared experts, load balancing, token dropping - have only been studied one or two at a time over narrow configuration ranges. It remains an open question whether these choices can be optimized independently, without considering interactions. We present the first systematic study of over 2,000 pretraining runs spanning models up to 6.6B total parameters, in which we exhaustively vary total experts, expert dimension, heterogeneous expert sizing within a single layer, shared expert size and load-balancing mechanisms. We find that at every active-parameter scale that we study, performance consistently improves with total MoE parameters even at extreme active expert parameter ratios like 128.Further, the optimal expert size is nearly invariant to total parameter count and depends only on active parameter count. Third, we see that other choices like shared experts, heterogeneous experts and load-balancing settings have small effects relative to expert count and granularity, although dropless routing yields a consistent gain. Overall, our results suggest a simpler recipe: focus on expert count and granularity, other choices have minimal effect on final quality.

2605.11688 2026-05-13 cs.LG cs.AI cs.MA

Shaping Zero-Shot Coordination via State Blocking

Mingu Kang, Sunwoo Lee, Yonghyeon Jo, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence(人工智能研究生院) UNIST(全南国立科学技术院)

AI总结 本文研究了零样本协调(ZSC)问题,即如何使智能体在未与合作伙伴预先交互的情况下实现协作,这对于现实中的多智能体系统和人机协作至关重要。为解决现有方法在面对未见合作伙伴时泛化能力不足的问题,作者提出了一种名为状态阻断协调(SBC)的框架,通过生成虚拟环境中的多样化交互场景,使智能体在训练过程中接触多种次优合作伙伴策略,从而提升其零样本协调能力。实验表明,SBC在多个基准测试中表现出优越的协调性能,尤其在与人类合作伙伴的协作中具有显著优势。

Comments 9 technical page followed by references and appendix

详情
英文摘要

Zero-shot coordination (ZSC) aims to enable agents to cooperate with independently trained partners without prior interaction, a key requirement for real-world multi-agent systems and human-AI collaboration. Existing approaches have largely emphasized increasing partner diversity during training, yet such strategies often fall short of achieving reliable generalization to unseen partners. We introduce State-Blocked Coordination (SBC), a simple yet effective framework that improves ZSC by inducing diverse interaction scenarios without direct environment modification. Specifically, SBC generates a family of virtual environments through state blocking, allowing agents to experience a wide range of suboptimal partner policies. Across multiple benchmarks, SBC demonstrates superior performance in zero-shot coordination, including strong generalization to human partners.

2605.11687 2026-05-13 cs.AI

Persistent and Conversational Multi-Method Explainability for Trustworthy Financial AI

Georgios Makridis, Georgios Fatouros, John Soldatos, George Katsis, Dimosthenis Kyriazis

发表机构 * University of Piraeus, Greece(希腊比雷埃克斯大学) ExpertAI-Lux S.à r.l(ExpertAI-Lux公司)

AI总结 该研究针对金融领域对可信AI解释的需求,提出了一种持久化、多方法交叉验证且支持对话交互的可解释性AI架构。核心方法包括将多种XAI结果作为可检索的持久化对象进行存储,并通过检索增强生成技术实现多方法解释的对比与融合,同时引入自动化检查机制评估解释的可靠性。该架构在金融情感分析任务中进行了验证,显著提升了解释的准确性和可信度。

Comments 5 pages

详情
英文摘要

Financial institutions increasingly require AI explanations that are persistent, cross-validated across methods, and conversationally accessible to human decision-makers. We present an architecture for human-centered explainable AI in financial sentiment analysis that combines three contributions. First, we treat XAI artifacts -- LIME feature attributions, occlusion-based word importance scores, and saliency heatmaps -- as persistent, searchable objects in distributed S3-compatible storage with structured metadata and natural-language summaries, enabling semantic retrieval over explanation history and automatic index reconstruction after system failures. Second, we enable multi-method explanation triangulation, where a retrieval-augmented generation (RAG) assistant compares and synthesizes results from multiple XAI methods applied to the same prediction, allowing users to assess explanation robustness through natural-language dialogue. Third, we evaluate the faithfulness of generated explanations using automated checks over grounding completeness, hallucinated claims, and method-attribution behavior. We demonstrate the architecture on an EXTRA-BRAIN financial sentiment analysis pipeline using FinBERT predictions and present evaluation results showing that constrained prompting reduces hallucination rate by 36\% and increases method-attribution citations by 73\% compared to naive prompting. We discuss implications for trustworthy, human-centered AI services in regulated financial environments.

2605.11685 2026-05-13 cs.CL

Robust LLM Unlearning Against Relearning Attacks: The Minor Components in Representations Matter

Zeguan Xiao, Xuanzhe Xu, Yun Chen, Yong Wang, Jian Yang, Yanqing Hu, Guanhua Chen

发表机构 * Shanghai University of Finance and Economics(上海金融学院) Alibaba Group(阿里巴巴集团) Southern University of Science and Technology(南方科技大学) Beihang University(北航)

AI总结 本文研究了大型语言模型(LLM)在面对“重学习攻击”时的健壮性问题,发现现有遗忘方法主要优化主成分,而次要成分未被有效修改,导致攻击者可通过调整主成分快速恢复被遗忘的知识。基于表示的谱结构分析,作者提出了一种针对次要成分的遗忘方法(MCU),通过在这些更具鲁棒性的方向上进行遗忘操作,显著提升了模型对重学习攻击的抵抗力,并在多个数据集上验证了其有效性。

详情
英文摘要

Large language model (LLM) unlearning aims to remove specific data influences from pre-trained model without costly retraining, addressing privacy, copyright, and safety concerns. However, recent studies reveal a critical vulnerability: unlearned models rapidly recover "forgotten" knowledge through relearning attacks. This fragility raises serious security concerns, especially for open-weight models. In this work, we investigate the fundamental mechanism underlying this fragility from a representation geometry perspective. We discover that existing unlearning methods predominantly optimize along dominant components, leaving minor components largely unchanged. Critically, during relearning attacks, the modifications in these dominant components are easily reversed, enabling rapid knowledge recovery, whereas minor components exhibit stronger resistance to such reversal. We further provide a theoretical analysis that explains both observations from the spectral structure of representations. Building on this insight, we propose Minor Component Unlearning (MCU), a novel unlearning approach that explicitly targets minor components in representations. By concentrating unlearning effects in these inherently robust directions, our method achieves substantially improved resistance to relearning attacks. Extensive experiments on three datasets validate our approach, demonstrating significant improvements over state-of-the-art methods including sharpness-aware minimization.