arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
2605.13663 2026-05-14 cs.CL cs.CY

Fine-tuning with Hierarchical Prompting for Robust Propaganda Classification Across Annotation Schemas

Lukas Stähelin, Veronika Solopova, Max Upravitelev, David Kaplan, Ariana Sahitaj, Premtim Sahitaj, Charlott Jakob, Sebastian Möller, Vera Schmitt

发表机构 * Technische Universität Berlin, QU Lab, XplaiNLP Group(技术大学柏林,QU实验室,XplaiNLP小组) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) Centre for European Research in Trusted AI (CERTAIN)(可信AI欧洲研究中心(CERTAIN))

AI总结 本文研究了如何在不同标注体系下提升社交媒体中宣传内容分类的鲁棒性,提出了一种基于意图的宣传技术分类体系,并与现有标注标准进行对比。通过四种大型语言模型的实验,发现微调对于提升分类性能至关重要,且提出的分层提示方法(HiPP)在微调后,特别是在标注分歧较大的体系中表现出色。研究还发布了基于新标注体系的HQP数据集,为未来研究提供了更具挑战性的基准。

详情
英文摘要

Propaganda detection in social media is challenging due to noisy, short texts and low annotation agreements. We introduce a new intent-focused taxonomy of propaganda techniques and compare it against an established, higher-agreement schema. Along three dimensions (model portfolio, schema effects, and prompting strategy) we evaluate the taxonomies as a classification task with the help of four language models (GPT-4.1-nano, Phi-4 14B, Qwen2.5-14B, Qwen3-14B). Our results show that fine-tuning is essential, since it transforms weak zero-shot baselines into competitive systems and reveals methodological differences that are hidden using base models. Across schemas, the Qwen models achieve the strongest overall performance, and Phi-4 14B consistently outperforms GPT-4.1-nano. Our hierarchical prompting method (HiPP), which predicts fine-grained techniques before aggregating them, is especially beneficial after fine-tuning and on the more ambiguous, low-agreement taxonomy, while remaining competitive on the simpler schema. The HQP dataset, annotated with the new intent-based labels, provides a richer lens on propaganda's strategic goals and a challenging benchmark for future work on robust, real-world detection.

2605.13651 2026-05-14 cs.SD cs.AI

NAACA: Training-Free NeuroAuditory Attentive Cognitive Architecture with Oscillatory Working Memory for Salience-Driven Attention Gating

Zhongju Yuan, Geraint Wiggins, Dick Botteldooren

发表机构 * WAVES Research Group, Ghent University, Gent, Belgium(根特大学WAVES研究组,比利时根特) AI Lab, Vrije Universiteit Brussel, Brussel, Belgium(布鲁塞尔自由大学AI实验室,比利时布鲁塞尔) EECS, Queen Mary University of London, London, UK(伦敦大学学院女王学院电子工程与计算机科学系,英国伦敦)

AI总结 本文提出了一种无需训练的神经听觉注意力认知架构NAACA,用于解决长时音频中显著事件检测的注意力瓶颈问题。其核心是受神经系统启发的振荡工作记忆(OWM),能够通过感知显著性触发高层语言模型处理,从而提升事件检测精度并减少不必要的计算。实验表明,NAACA在XD-Violence数据集上显著提升了检测性能,并在城市声景数据集上表现出对噪声和突发停顿的良好鲁棒性。

Comments Accepted as a regular paper by ICML 2026

详情
英文摘要

Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Architecture that reframes attention allocation as an auditory salience filtering problem. At its core is OWM, a neuro-inspired Oscillatory Working Memory that maintains stable attractor-like states and triggers higher-cognition ALM processing only when adaptive energy fluctuations signal perceptual salience, triggering higher-level reasoning. On XD-Violence, NAACA improves AudioQwen's average precision (AP) from 53.50% to 70.60% while reducing unnecessary ALM invocations. Furthermore, qualitative case studies on the Urban Soundscapes of the World (USoW) dataset show that OWM captures novel events and subcategory shifts while remaining robust to transient pauses and ambient urban noise.

2605.13647 2026-05-14 cs.CL

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

Junyan Li, Zhang-Wei Hong, Maohao Shen, Yang Zhang, Chuang Gan

发表机构 * UMass Amherst(马萨诸塞大学阿姆赫斯特分校) MIT-IBM Watson AI Lab(麻省理工-IBM沃森人工智能实验室)

AI总结 FlowCompile 是一个针对结构化大语言模型(LLM)工作流的优化编译器,旨在解决在预定义图结构中多个子代理协同执行时,如何在准确率与延迟之间取得最佳平衡的问题。该方法借鉴了机器学习编译器的思想,在部署前对工作流的设计空间进行全局探索,生成一组可复用的、覆盖不同精度-延迟权衡的工作流配置。实验表明,FlowCompile 在多种工作流和基准测试中均优于启发式优化和基于路由的方法,最高可带来6.4倍的加速效果。

详情
英文摘要

Structured LLM workflows, where specialized LLM sub-agents execute according to a predefined graph, have become a powerful abstraction for solving complex tasks. Optimizing such workflows, i.e., selecting configurations for each sub-agent to balance accuracy and latency, is challenging due to the combinatorial design space over model choices, reasoning budgets, and workflow structures. Existing cost-aware methods largely treat workflow optimization as a routing problem, selecting a configuration at inference time for each query according to the accuracy-latency objective used during training. We argue that structured LLM workflows can also be optimized from a compilation perspective: before deployment, the system can globally explore the workflow design space and construct a reusable set of workflow-level configurations spanning diverse accuracy-latency trade-offs. Drawing inspiration from machine learning compilers, we introduce FlowCompile, a structured LLM workflow compiler that performs compile-time design space exploration to identify a high-quality, reusable trade-off set. FlowCompile decomposes a workflow into sub-agents, profiles each sub-agent under diverse configurations, and composes these measurements through a structure-aware proxy to estimate workflow-level accuracy and latency. It then identifies diverse high-quality configurations in a single compile-time pass, without retraining or online adaptation. Experiments across diverse workflows and challenging benchmarks show that FlowCompile consistently outperforms heuristically optimized workflow configurations and routing-based baselines, delivering up to 6.4x speedup. The compiled configuration set further serves as a reusable optimization artifact, enabling flexible deployment under varying runtime preferences and supporting downstream selection or routing.

2605.13641 2026-05-14 cs.LG cs.CL

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

Yang Bai, Kaiyuan Liu, Ziyuan Zhuang, Jiahong Zhou, Rongxiang Weng, Xin Chen, Jingang Wang, Xunliang Cai

发表机构 * Meituan, China(美团,中国)

AI总结 该论文研究了复杂强化学习环境中多任务和混合奖励设定下的策略优化问题,针对异构奖励分布和奖励维度相关性带来的挑战,提出了一种名为RDPO的奖励处理方法。RDPO通过幅度感知分位数归一化和马哈拉诺比白化技术,分别稳定奖励分配并减少相关性冗余,从而提升策略训练的稳定性与效果。实验表明,该方法在LongCat-Flash的后训练中有效增强了指令遵循能力、写作质量和对复杂提示的鲁棒性,同时在推理和编程任务上保持了良好的竞争力。

详情
英文摘要

Complex reinforcement learning environments frequently employ multi-task and mixed-reward formulations. In these settings, heterogeneous reward distributions and correlated reward dimensions often destabilize the construction of scalar advantages. To address these challenges, we propose Reward-Decorrelated Policy Optimization (RDPO), a reward-processing method designed to explicitly target both failure modes. RDPO first utilizes Magnitude-Aware Quantile normalization to stabilize prompt-level advantage allocation across binary, fractional, and continuous rewards. It then applies Mahalanobis whitening within each active reward subspace to mitigate correlation redundancy prior to aggregation. When applied during the post-training of LongCat-Flash, RDPO enhances instruction following, writing quality, and robustness to hard prompts while remaining broadly competitive on reasoning and coding evaluations.

2605.13639 2026-05-14 cs.LG math.OC stat.ML

Achieving $ε^{-2}$ Sample Complexity for Single-Loop Actor-Critic under Minimal Assumptions

Ishaq Hamza, Zaiwei Chen

发表机构 * IISc(印度斯里尼瓦西大学) Purdue IE(普渡大学工业工程学院)

AI总结 本文研究了强化学习中无策略actor-critic方法在单循环实现下的样本复杂度问题,在仅假设存在能诱导不可约马尔可夫链的策略的前提下,证明了在单循环、单时间尺度框架下,首次实现了$\tilde{\mathcal{O}}(ε^{-2})$的样本复杂度保证,用于找到一个$ε$-最优策略。相比以往需要嵌套循环或强算法依赖假设的工作,本文通过构建耦合的Lyapunov漂移框架,解决了单循环更新和非策略学习带来的挑战,为actor和critic分别建立了几何收敛率和$\tilde{\mathcal{O}}(1/T)$收敛率,并通过交叉支配性质将两者结合,具有重要的理论意义和应用潜力。

详情
英文摘要

In this paper, we establish last-iterate convergence rates for off-policy actor--critic methods in reinforcement learning. In particular, under a single-loop, single-timescale implementation and a broad class of policy updates, including approximate policy iteration and natural policy gradient methods, we prove the first $\tilde{\mathcal{O}}(ε^{-2})$ sample complexity guarantee for finding an $ε$-optimal policy under minimal assumptions, namely, the existence of a policy that induces an irreducible Markov chain. This stands in stark contrast to the existing literature, where an $\tilde{\mathcal{O}}(ε^{-2})$ sample complexity is achieved only through nested-loop updates and/or under strong, algorithm-dependent assumptions on the policies, such as uniform mixing and uniform exploration. Technically, to address the challenges posed by the coupled update equations arising from the single-loop implementation, as well as the potentially unbounded iterates induced by off-policy learning, our analysis is based on a coupled Lyapunov drift framework. Specifically, we establish a geometric convergence rate for the actor and an $\tilde{\mathcal{O}}(1/T)$ convergence rate for the critic, and combine the two Lyapunov drift inequalities through a cross-domination property. We believe this analytical framework is of independent interest and may be applicable to other coupled iterative algorithms with unbounded

2605.13632 2026-05-14 cs.RO cs.CV

Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

Yiran Ling, Qing Lian, Jinghang Li, Qing Jiang, Tianming Zhang, Xiaoke Jiang, Chuanxiu Liu, Jie Liu, Lei Zhang

发表机构 * Futian Laboratory(福田实验室) Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) International Digital Economy Academy (IDEA)(国际数字经济学院(IDEA)) School of Robotics, Hunan University(湖南大学机器人学院) South China University of Technology(华南理工大学) Visincept(Visincept公司) National Key Laboratory of Smart Farm Technologies and Systems(智能农业技术与系统国家重点实验室)

AI总结 本文提出了一种名为GTA-VLA的交互式视觉-语言-动作框架,通过允许用户使用显式视觉线索引导机器人策略,实现空间可操控的具身推理。该框架引入了用户可选的空间先验引导机制,并将其与内部任务规划相结合,生成统一的视觉-空间推理链,从而提升机器人在复杂或未知环境中的任务成功率。实验表明,该方法在标准基准测试中表现优异,并在面对视觉变化和空间歧义时展现出更强的鲁棒性和恢复能力。

详情
英文摘要

In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the subsequent reasoning process can directly condition on. Based on these inputs, the model generates a unified spatial-visual Chain-of-Thought that integrates external guidance with internal task planning, aligning human visual intent with autonomous decision-making. For practical deployment, we further couple the reasoning module with a lightweight reactive action head for efficient action execution. Extensive experiments demonstrate the effectiveness of our approach. On the in-domain SimplerEnv WidowX benchmark, our framework achieves a state-of-the-art 81.2% success rate. Under OOD visual shifts and spatial ambiguities, a single visual interaction substantially improves task success over existing methods, highlighting the value of interactive reasoning for failure recovery in embodied control. Details of the project can be found here: https://signalispupupu.github.io/GTA-VLA_ProjPage/

2605.13625 2026-05-14 cs.AI

How to Interpret Agent Behavior

Jie Gao, Kaiser Sun, Jen-tse Huang, Katherine Van Koevering, Sijie Ji, Heyuan Huang, Weiyan Shi, Zhuoran Lu, Ziang Xiao, Daniel Khashabi, Mark Dredze

发表机构 * Johns Hopkins University(约翰霍普金斯大学) California Institute of Technology(加州理工学院) Northeastern University(东北大学) Purdue University(普渡大学)

AI总结 本文研究了如何解释自主智能体(如 Claude Code 和 Codex)在运行时的行为,提出了一个名为 ACT*ONOMY 的行为分类体系,用于描述和分析智能体的运行轨迹。该方法结合了行动分类和理论框架,构建了一个包含 10 个动作、46 个子动作和 120 个叶子类别的三级层次结构,并提供了一个支持动态更新和扩展的开源分析平台。实验表明,ACT*ONOMY 能够有效比较不同智能体的行为特征,识别运行中的异常模式,为研究人员和用户提供了一致的分析语言,有助于提升对智能体行为的理解与管控。

Comments 34 pages in total

详情
英文摘要

Autonomous agents such as Claude Code and Codex now operate for hours or even days. Understanding their runtime behavior has become critical for downstream tasks such as diagnosing inefficiencies, fixing bugs, and ensuring better oversight. A primary way to gain this understanding is analyzing the reasoning trajectories and execution traces these agents generate. Yet such data remains in unstructured natural-language form, making it difficult for humans to interpret at scale. We introduce ACT*ONOMY (a combination of Action and Taxonomy), a taxonomy for describing and analyzing agent behavior at runtime. ACT*ONOMY has two components: (1) the taxonomy itself, developed through Grounded Theory and structured as a three-level hierarchy of 10 actions, 46 subactions, and 120 leaf categories; and (2) an open repository that hosts the living taxonomy, provides an automated analysis pipeline that applies it to agent trajectories analysis, and defines an extension protocol for customization and growth. Our experiments show that ACTONOMY can compare behavioral profiles across agents and characterize a single agent's behavior across diverse trajectories, surfacing patterns indicative of failure modes. By providing a shared vocabulary, ACT*ONOMY helps researchers, agent designers, and end users interpret agent behavior more consistently, enabling better oversight and control.

2605.13624 2026-05-14 cs.CL

Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction

Takumi Goto, Yusuke Sakai, Taro Watanabe

发表机构 * Nara Institute of Science and Technology(奈良科学技术研究所)

AI总结 本文研究了基于大语言模型的语法错误纠正中常见的过度修正问题,提出了一种无需训练的推理方法,通过单个模型生成多个候选修正结果并进行编辑级多数投票,有效缓解了过度修正现象。该方法在多个语言的九个基准测试中表现优于贪心解码和最大后验概率解码,在不同指令提示下也保持了稳定的修正质量。

Comments BEA Workshop 2026

详情
英文摘要

Grammatical error correction using large language models often suffers from the over-correction issue. To mitigate this, we propose a training-free inference method that performs edit-level majority voting over multiple candidates generated by a single model, without requiring model modifications or additional training. Across nine benchmarks covering English, Czech, German, Ukrainian, Korean, Hindi, and Romanian, the proposed method outperforms both greedy and MBR decoding in most cases. Moreover, it yields stable correction quality regardless of the instruction prompts used. We release two repository supporting GEC datasets loading and LLM inference.

2605.13623 2026-05-14 cs.LG

Multimodal Graph-based Classification of Esophageal Motility Disorders

Alexander Geiger, Lars Wagner, Daniel Rueckert, Alois Knoll, Dirk Wilhelm, Alissa Jell

发表机构 * Technical University of Munich, School of Medicine and Health, TUM University Hospital Rechts der Isar(慕尼黑技术大学医学院与健康学院,TUM医院Rechts der Isar分院) Technical University of Munich, School of Computation, Information and Technology(慕尼黑技术大学计算、信息与技术学院)

AI总结 本文研究了基于多模态图神经网络的食管运动障碍分类方法,旨在解决高分辨率阻抗测压(HRIM)数据复杂且临床解释易变的问题。该方法结合HRIM记录、患者个体信息,并利用图模型对食管生理特性进行建模,通过图神经网络学习具有生理意义的表示,并与患者特征融合实现多类别分类。实验表明,该多模态方法在分类性能上优于仅依赖HRIM特征或基于视觉的分类方法,验证了图模型与患者信息融合的有效性。

详情
英文摘要

Diagnosing esophageal motility disorders pose significant challenges due to the complexity of high-resolution impedance manometry (HRIM) data and variability in clinical interpretation. This work explores the feasibility of a multimodal Machine Learning (ML)-based classification approach that combines HRIM recordings with patient-specific information and incorporates a graph-based modeling of esophageal physiology. We analyze HRIM recordings with corresponding patient information from 104 patients with esophageal motility disorders. Patient data includes demographic, clinical, and symptom information extracted from structured questionnaires and free-text notes using keyword detection and large language model-based processing. HRIM data is represented as spatio-temporal graphs, where nodes correspond to pressure values along the esophagus and edges encode spatial adjacency and impedance dynamics. A graph neural network (GNN) is applied to learn physiologically meaningful representations, which are fused with patient embeddings for multi-category, multi-class classification of swallow events. The impact of patient features and graph-based modeling is evaluated by ablation studies and comparison to vision-based classifier baselines. The proposed multimodal approach indicates improvements over models that rely solely on HRIM-derived features across all classification categories. Additionally, the graph-based modeling provides gains compared to vision-based baselines. Our experiments systematically assess the complementary contribution of multiple modalities, as well as demonstrate the feasibility of our proposed graph-based approach. Our initial findings demonstrate that integrating patient-level data with graph-based representations of HRIM signals appears to be a promising direction for more accurate classification of esophageal motility disorders.

2605.13621 2026-05-14 cs.CV

WD-FQDet: Multispectral Detection Transformer via Wavelet Decomposition and Frequency-aware Query Learning

Chunjin Yang, Xiwei Zhang, Yiming Xiao, Fanman Meng

发表机构 * University of Electronic Science and Technology of China(电子科学与技术大学)

AI总结 WD-FQDet 是一种基于小波分解和频率感知查询学习的多光谱检测Transformer框架,旨在解决红外与可见光图像融合检测中模态共享特征偏差和模态特有特征不足的问题。该方法通过低频域对齐和高频域保留模块,分别增强跨模态特征的一致性和模态特有特征的表达,并引入频率感知的查询选择机制动态调节不同特征的贡献。实验表明,WD-FQDet 在多个数据集上取得了领先的检测性能。

详情
英文摘要

Infrared-visible object detection improves detection performance by combining complementary features from multispectral images. Existing backbone-specific and backbone-shared approaches still suffer from the problems of severe bias of modality-shared features and the insufficiency of modality-specific features. To address these issues, we propose a novel detection framework WD-FQDet that explicitly decouples modality-shared and modality-specific information from infrared and visible modalities in the new view of low- and high-frequency domains, allowing fusion strategies tailored to their frequency characteristics. Specifically, a low-frequency homogeneity alignment module is proposed to align modality-shared features across modalities via a cross-modal attention mechanism, and a high-frequency specificity retention module is proposed to preserve modality-specific features through the multi-scale gradient consistency loss. To reinforce the feature representation in the frequency domain, we propose a hybrid feature enhancement module that incorporates spatial cues. Furthermore, considering that the contributions of homogeneous and modality-specific features to object detection vary across scenarios, we propose a frequency-aware query selection module to dynamically regulate their contributions. Experimental results on the FLIR, LLVIP, and M3FD datasets demonstrate that WD-FQDet achieves state-of-the-art performance across multiple evaluation metrics.

2605.13613 2026-05-14 cs.RO

Design of Magnetic Continuum Robots with Tunable Force Response Using Rotational Ring Pairs

Alex Sayres, Giovanni Pittiglio

发表机构 * FuTURE Lab, Department of Robotics Engineering, Worcester Polytechnic Insitute (WPI)(未来实验室,机器人工程系,沃斯通理工学院)

AI总结 本文提出了一种新型的连续体机器人设计,能够在线调节其末端的磁响应特性,从而实现对有效磁场方向和强度的动态调整,无需依赖外部磁场控制即可引入转向自由度。该设计突破了传统机器人依赖固定内部磁性结构的限制,适用于可控和固定磁场环境,有望拓展其在医疗等领域的应用。实验表明,该机器人最大末端偏转可达其长度的23%,并基于修正梁理论建立了力学模型,实现了较高的轨迹跟踪精度。

Comments 7 pages, 6 figures, Accepted to ISMR 2026

详情
英文摘要

In this paper, we discuss a novel continuum robot design that enables the online tuning of the magnetic response at its tip. The proposed method allows for the change of both effective magnetic direction and intensity, introducing steering DOF without the need to control the external fields. This is unattainable with classical designs, which rely on fixed internal magnetic content and steer solely under the effect of a controllable magnetic field. The proposed robot design can be used in both controllable and fixed magnetic fields, potentially widening the clinical applicability of these robots. We experimentally show a max tip deflection of 33.8 mm from the resting state (23 % of the length of the robot). We discuss a model based on modified beam theory that captures the mechanical behavior of the continuum robot, with a mean absolute tip tracking error of 1.86 mm (1.2 % of the length) and maximum errors of less than 4.8 mm (3.2 % of the length) for all experimental points.

2605.13612 2026-05-14 cs.LG cond-mat.dis-nn stat.ML

Deep Learning as Neural Low-Degree Filtering: A Spectral Theory of Hierarchical Feature Learning

Yatin Dandi, Matteo Vilucchio, Luca Arnaboldi, Hugo Tabanelli, Florent Krzakala

发表机构 * Information Learning and Physics Laboratory, École Polytechnique Fédérale de Lausanne (EPFL)(信息学习与物理实验室,瑞士联邦理工学院(EPFL)) Statistical Physics of Computation Laboratory, École Polytechnique Fédérale de Lausanne (EPFL)(计算统计物理实验室,瑞士联邦理工学院(EPFL))

AI总结 本文提出了一种名为“神经低度滤波”(Neural LoFi)的理论框架,用于解释深度神经网络如何通过层次化特征学习从数据中提取有用表示。该方法将基于梯度的训练过程简化为一种显式的迭代谱方法,每一层网络通过选择与标签具有最大低度相关性的方向来逐步构建特征。该理论不仅提供了对深度学习中特征演化机制的数学解释,还通过实验验证了其在全连接和卷积网络中的有效性,展示了其在特征选择和结构滤波方面的优越性。

Comments 62 pages, many figures, companion codes in https://github.com/IdePHICS/Neural-LoFi-Theory

详情
英文摘要

Understanding how deep neural networks learn useful internal representations from data remains a central open problem in the theory of deep learning. We introduce Neural Low-Degree Filtering (Neural LoFi), a stylized limit of gradient-based training in which hierarchical feature learning becomes an explicit iterative spectral procedure. In this limit, the dynamics at each layer decouple: given the current representation, the next layer selects directions with maximal accessible low-degree correlation to the label. This yields a tractable surrogate mechanism for deep learning, together with a natural kernel-space interpretation. Neural LoFi provides a mathematically explicit framework for studying multi-layer feature learning beyond the lazy regime. It predicts how representations are selected layer by layer, explains how emergence of concepts arises with given sample complexity,and gives a concrete mechanism by which depth progressively constructs new features from old ones through low-degree compositionality. We complement the theory with mechanistic experiments on fully connected and convolutional architectures, showing that Neural LoFi improves over lazy random-feature baselines, recovers meaningful structured filters, and predicts representations aligned with early gradient-descent feature discovery with real datasets.

2605.13604 2026-05-14 cs.CV

Rethinking Graph Convolution for 2D-to-3D Hand Pose Lifting

Chanyoung Kim, Donghyun Kim, Dong-Hyun Sim, Seong Jae Hwang, Youngjoong Kwon

发表机构 * Emory University(埃默里大学) Yonsei University(延世大学) WHATs Lab(WHATs实验室)

AI总结 本文重新审视了图卷积网络在2D到3D手部姿态提升中的应用,探讨了是否应采用固定邻接图来编码手部骨骼结构。研究通过在FPHA数据集上进行参数匹配的消融实验,发现多头自注意力机制在性能上显著优于传统图卷积方法,并进一步表明基于软结构先验的图距离位置编码比硬邻接约束更有效。实验结果表明,自适应空间注意力比固定图卷积更能有效提升手部姿态估计的精度。

详情
英文摘要

Graph convolutional networks (GCNs) are widely used for 3D hand pose estimation, where the hand skeleton is encoded as a fixed adjacency graph. We revisit whether this is the most effective way to incorporate hand topology in 2D-to-3D lifting. In this paper, we perform controlled, parameter-matched ablations on the FPHA benchmark and show that standard multi-head self-attention consistently outperforms GCN baselines. Even when the GCN is strengthened with multi-hop adjacency and matched parameter count, self-attention reduces MPJPE from 12.36 mm to 10.09 mm. A skeleton-constrained graph attention network recovers most of this gap, indicating that input-dependent aggregation is a major source of improvement, while fully connected attention yields additional gains. We further show that hand topology is most effective when introduced as a soft structural prior through graph-distance positional encoding, rather than as a hard adjacency constraint. These results suggest that, for hand pose lifting, adaptive spatial attention is a more effective inductive bias than fixed graph convolution.

2605.13601 2026-05-14 cs.AI cs.MA

Unweighted ranking for value-based decision making with uncertainty

Aarón López García, Natalia Criado, Jose Such

发表机构 * Valencian Research Institute for Artificial Intelligence(瓦伦西亚人工智能研究 institute) Universitat Politècnica de València(瓦伦西亚理工大学)

AI总结 随着智能系统在社会中越来越多地用于自主决策,其对人类价值观的遵循引发了广泛关注。本文提出了一种基于模糊逻辑的无权重价值决策框架(FUW-VBDM),通过引入定性和定量标准,提升决策的人本特性,并消除利益相关者主观赋权带来的偏差。为此,作者设计了Rankzzy方法,结合模糊推理量化不确定性,并在大规模案例中验证了其计算效率和排名性能的优势。

Comments 21 pages

详情
英文摘要

As intelligent systems are increasingly implemented in our society to make autonomous decisions, their commitment to human values raises serious concerns. Their alignment with human values remains a critical challenge because it can jeopardise the integrity and security of citizens. For this reason, an innovative human-centred and values-driven approach to decision making is required. In this work, we introduce the Fuzzy-Unweighted Value-Based Decision Making (FUW-VBDM) framework, where agents incorporate both quantitative and qualitative criteria to generate human-centred decisions. We also address the normative bias introduced by stakeholders with arbitrary weights by removing prior weights and introducing a fuzzy domain of decision variables defined for a score function. This concept allows us to generalise any VBDM problem as the search for feasible solutions when optimising the score in the weight domain. To provide a solution to FUW-VBDM, we present Rankzzy, a customizable unweighted ranking method that integrates fuzzy-based reasoning to quantify uncertainty. We mathematically prove the consistency of the Rankzzy for any admissible configuration selected by stakeholders. We show the applicability of our method through an illustrative case study, which we also use as a running example. The evaluation conducted indicates a reduced computational cost in large-scale value-based decision-making problems and a strong rank performance regarding existing approaches when employing the aggregation via Pythagorean means.

2605.13600 2026-05-14 cs.CV

Sparse Code Uplifting for Efficient 3D Language Gaussian Splatting

Lovre Antonio Budimir, Yushi Guan, Steve Ryhner, Sven Lončarić, Nandita Vijaykumar

发表机构 * Faculty of Electrical Engineering and Computing(电子工程与计算学院) Department of Computer Science(计算机科学系) Vector Institute(向量研究所)

AI总结 本文提出了一种名为SCOUP的高效三维语言高斯溅射方法,旨在解决在开放词汇三维场景理解中,如何高效关联高维视觉-语言嵌入与大量三维高斯点的问题。该方法通过解耦语言表示学习与三维高斯优化,利用二维图像区域的特征学习稀疏编码表示,并通过加权稀疏聚合将其提升至三维高斯点,从而实现高效的存储与快速渲染。实验表明,SCOUP在训练速度和内存效率上均有显著提升,并在多个基准测试中达到了与现有方法相当或更优的开放词汇查询准确率。

Comments 18 pages (9 pages main paper), 10 figures, preprint

详情
英文摘要

3D Language Gaussian Splatting (3DLGS) augments 3D Gaussian Splatting with language-aligned visual features for open-vocabulary 3D scene understanding. A core challenge is efficiently associating high-dimensional vision-language embeddings with millions of 3D Gaussians while preserving efficient feature rendering for text-based querying. Existing methods either store dense features directly on Gaussians, causing high storage costs and slow rendering, or learn compact representations through expensive per-scene optimization with repeated feature rasterization. No existing method simultaneously achieves fast 3D semantic reconstruction, efficient storage, and fast rendering. We propose SCOUP (Sparse COde UPlifting), which addresses all three by decoupling language representation learning from 3D Gaussian optimization. Rather than working directly in 3D, we learn sparse codebook-based representations entirely using features associated with 2D image regions, associating each region with a sparse set of codebook coefficients. We then uplift these coefficients to 3D Gaussians with our weighted sparse aggregation using Gaussian-to-pixel associations, where each Gaussian accumulates coefficients over codebook atoms across views. Top-$K$ filtering then extracts the most dominant multi-view coefficients per Gaussian, enabling efficient storage and fast rendering. Our method achieves up to $400\times$ training speedup while being $3\times$ more memory efficient during training compared to the state-of-the-art in rendering speed. Across multiple benchmarks, SCOUP matches or outperforms existing methods in open-vocabulary querying accuracy.

2605.13597 2026-05-14 cs.LG

Rethinking Generalization in Graph Neural Networks: A Structural Complexity Perspective

Peiyao Wang, Liang Bai, Xian Yang, Richard Yi Da Xu, Jiye Liang

发表机构 * Institute of Intelligent Information Processing(智能信息处理研究院) Shanxi University(山西大学) Alliance Manchester Business School University of Manchester(曼彻斯特大学曼彻斯特商业学院) University of Manchester(曼彻斯特大学) Department of Mathematics(数学系) Hong Kong Baptist University(香港 Baptist 大学)

AI总结 本文从结构复杂度的角度重新思考图神经网络(GNN)的泛化能力,探讨图结构对模型泛化的影响。研究证明,图中边的增加会使输入表示过度适应输出模型,导致过拟合,并提出了一种基于有效边数量的结构复杂度度量,推导出相应的泛化界。基于这些理论发现,作者进一步提出了一种结构熵正则化方法,通过调控有效边的数量来平衡欠拟合与过拟合,从而提升GNN的泛化性能。

Comments 44 pages, 10 figures

详情
英文摘要

Graph neural networks (GNNs) have emerged as a fundamental tool for learning from graph-structured data, achieving strong performance across a wide range of applications. However, understanding their generalization capabilities remains challenging due to the complex structural dependencies inherent in such data. Existing generalization analyses largely follow the classical machine learning paradigm, focusing primarily on model complexity while overlooking the fundamental role of graph structure. Therefore, in this work, we systematically investigate this role by asking: does the graph structure actually influence generalization, and if so, by how much? To answer the first question and validate our intuition, we theoretically prove that incorporating more edges into the prediction process transforms the input representations to be overly accommodating to the output model, thereby inducing overfitting. To address the second question, we formulate a structural complexity measure based on the number of effective edges and derive a Rademacher complexity-based generalization bound. In doing so, we demonstrate that GNN generalization depends explicitly on structural complexity, alongside traditional parameter-dependent factors. Motivated by these theoretical findings, we propose a structural entropy regularization method. This approach controls structural complexity by regulating effective edges to balance underfitting and overfitting, ultimately improving the generalization performance of GNNs.

2605.13596 2026-05-14 cs.CL

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

Kyo Gerrits, Rik van Noord, Ana Guerberof Arenas

发表机构 * Centre for Language and Cognition, University of Groningen(语言认知中心,格罗宁根大学)

AI总结 本文研究了自动评估指标(AEMs)和大语言模型作为评委的评估方法在文学翻译中的表现,涉及多种语言、体裁和翻译方式。通过构建包含人类翻译、机器翻译和后编辑的多模态数据集,并由专业文学翻译者标注创造力相关指标,研究发现这些自动评估方法与专业评价在创造力方面关联性较低,尤其对文学性较强的体裁如诗歌评估效果更差。研究还指出,基于大语言模型的评估存在系统性偏差,倾向于青睐机器翻译文本,而对具有创造性和文化适应性的翻译方案进行惩罚,凸显了当前自动评估工具在文学翻译领域存在的根本性局限。

Comments This paper has been accepted to the EAMT Conference 2026 in Tilburg on June 15-18 2026

详情
英文摘要

This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation, creativity (creative shifts & errors), and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally appropriate solutions. Moreover, performance is consistently worse for more literary genres such as poetry. This highlights fundamental limitations of current automatic evaluation tools for literary translation and the need to create new tools that do not frequently consider out of routine translations as errors.

2605.13595 2026-05-14 cs.CL

Inducing Artificial Uncertainty in Language Models

Sophia Hager, Simon Zeng, Nicholas Andrews

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Microsoft(微软)

AI总结 在安全关键型应用中,语言模型需要能够用有意义的概率表达其不确定性。本文提出了一种在语言模型中诱导人工不确定性的方法,以解决在缺乏挑战性数据的情况下训练不确定性量化方法的难题。通过在简单数据上引入人工不确定性,并使用专门训练的探针进行识别,该方法在保持模型性能的同时,显著提升了模型在困难数据上的校准能力。

详情
英文摘要

In safety-critical applications, language models should be able to characterize their uncertainty with meaningful probabilities. Many uncertainty quantification approaches require supervised data; however, finding suitable unseen challenging data is increasingly difficult for large language models trained on vast amounts of scraped data. If the model is consistently (and correctly) confident in its predictions, the uncertainty quantification method may consistently overestimate confidence on new and unfamiliar data. Finding data which exhibits enough uncertainty to train supervised uncertainty quantification methods for high-performance models may therefore be challenging, and will increase in difficulty as LLMs saturate datasets. To address this issue, we first introduce the problem of inducing artificial uncertainty in language models, then investigate methods of inducing artificial uncertainty on trivially easy data in the absence of challenging data at training time. We use probes trained to recognize artificial uncertainty on the original model, and find that these probes trained on artificial uncertainty outperform probes trained without artificial uncertainty in recognizing real uncertainty, achieving notably higher calibration on hard data with minimal loss of performance on easy data.

2605.13591 2026-05-14 cs.CV

Real2Sim: A Physics-driven and Editable Gaussian Splatting Framework for Autonomous Driving Scenes

Kaicong Huang, Talha Azfar, Weisong Shi, Ruimin Ke

发表机构 * Department of Civil and Environmental Engineering, Rensselaer Polytechnic Institute(拉特克利夫理工学院土木与环境工程系) Department of Computer and Information Sciences, University of Delaware(德雷塞尔大学计算机与信息科学系)

AI总结 本文提出了一种名为 Real2Sim 的物理驱动且可编辑的高斯点喷射框架,用于自动驾驶场景的生成。该方法结合了4D高斯点喷射与可微分的材料点方法求解器,能够重建具有时间连续性的动态驾驶场景,支持实例级编辑,并模拟真实的物体间及物体与环境之间的交互。该框架能够在保证物理合理性的前提下生成高保真的多样化场景,包括碰撞等复杂情况,实验表明其在渲染、重建、编辑及物理模拟方面表现优异,具有在自动驾驶感知、轨迹预测等任务中广泛应用的潜力。

详情
英文摘要

Reliable autonomous driving relies on large-scale, well-labeled data and robust models. However, manual data collection is resource-intensive, and traditional simulation suffers from a persistent reality gap. While recent generative frameworks and radiance-field methods improve visual fidelity, they still struggle with temporal and spatial consistency and cannot ensure physics-aware behavior, limiting their applicability to driving scenario generation. To address these challenges, we propose Real2Sim, an unified framework that combines 4D Gaussian Splatting (4DGS) with a differentiable Material Point Method (MPM) solver. Real2Sim explicitly reconstructs dynamic driving scenes as temporally continuous Gaussian primitives, supports instance-level editing, and simulates realistic object-object and object-environment interactions. This framework enables physics-aware, high-fidelity synthesis of diverse, editable scenarios, including challenging corner cases such as collisions and post-impact trajectories. Experiments on the Waymo Open Dataset validate Real2Sim's capabilities in rendering, reconstruction, editing, and physics simulation, demonstrating its potential as a scalable tool for data generation in downstream tasks such as perception, tracking, trajectory prediction, and end-to-end policy learning.

2605.13583 2026-05-14 cs.CV

Phy-CoSF: Physics-Guided Continuous Spectral Fields Reconstruction and Super-Resolution for Snapshot Compressive Imaging

Wudi Chen, Zhiyuan Zha, Xin Yuan, Shigang Wang, Bihan Wen, Jiantao Zhou, Gang Yan, Zipei Fan, Ce Zhu

发表机构 * College of Communication Engineering, Jilin University, Changchun 130012, China. School of Engineering, Westlake University, Hangzhou, Zhejiang 310024, China. School of Electrical \& Electronic Engineering, Nanyang Technological University, Singapore 639798. Department of Computer Information Science, University of Macau, Macau 999078, China. College of Computer Science Technology, Jilin University, Changchun 130012, China. College of Artificial Intelligence, Jilin University, Changchun 130012, China. School of Information Communication Engineering, University of Electronic Science

AI总结 本文提出了一种名为Phy-CoSF的方法,用于解决快照压缩成像(CASSI)系统中高光谱图像的连续光谱重建与超分辨率问题。该方法结合深度展开网络与隐式神经表示,建立了一种新的连续光谱重建范式,能够生成任意波长的高保真高光谱图像。核心模块连续光谱场(CoSF)通过跨域特征融合和动态先验机制,显著提升了重建精度和光谱细节保留能力,实验表明其在多个指标上优于现有先进方法。

Comments 15 pages, 10 figures, accepted by ICML 2026!

详情
英文摘要

Recent advances have demonstrated that coded aperture snapshot spectral imaging (CASSI) systems show great potential for capturing 3D hyperspectral images (HSIs) from a single 2D measurement. Despite the inherent spectral continuity of scenes captured by CASSI, most existing reconstruction methods are restricted to fixed, discrete spectral outputs, thereby precluding continuous spectral reconstruction or spectral super-resolution. To address this challenge, we propose Phy-CoSF, which synergizes deep unfolding networks with implicit neural representations, establishing a new paradigm for continuous spectral reconstruction and super-resolution in CASSI. Specifically, we propose a two-phase architecture that bridges discrete-wavelength training with continuous spectral rendering, enabling the synthesis of high-fidelity HSIs at arbitrary target wavelengths. At the core of our framework lies the continuous spectral fields (CoSF) module, embedded within each unfolding stage as a dynamic prior, which comprises a triple-branch cross-domain feature mixer for comprehensive spatial-frequency-channel feature fusion, alongside a spectral synthesis head that generates spectral intensities by querying continuous wavelength coordinates. Extensive experimental results demonstrate that Phy-CoSF not only achieves continuous modeling at arbitrary spectral resolutions but also outperforms many state-of-the-art methods in both reconstruction fidelity and spectral detail preservation. Our code and more results are available at: https://github.com/PaiDii/Phy-CoSF.git.

2605.13581 2026-05-14 cs.CV

HIR-ALIGN: Enhancing Hyperspectral Image Restoration via Diffusion-Based Data Generation

Li Pang, Heng Zhao, Yijia Zhang, Deyu Meng, Xiangyong Cao

发表机构 * School of Mathematics and Statistics, Xi’an Jiaotong University(西安交通大学数学与统计学学院) School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院) School of Mathematics and Statistics and the Ministry of Education Key Laboratory for Intelligent Networks and Network Security, Xi’an Jiaotong University(西安交通大学数学与统计学学院和教育部智能网络与网络安全重点实验室) Pazhou Laboratory (Huangpu), Guangzhou(广州黄埔 Pazhou 实验室)

AI总结 高光谱图像(HSI)修复在实际应用中面临噪声、模糊和分辨率下降等问题,而现有模型在缺乏干净参考的靶域数据上表现不佳。为此,本文提出HIR-ALIGN框架,通过扩散模型生成与靶域分布匹配的合成数据,增强修复效果。该方法包含代理生成、分布自适应合成和对齐监督微调三个阶段,有效提升了在靶域上的修复性能,并在去噪和超分辨率任务中展现出优于现有方法的实验结果。

详情
英文摘要

Hyperspectral image (HSI) restoration is crucial for reliable analysis, as real HSIs suffer from degradations like noise, blur, and resolution loss. However, existing models trained on source data often fail on target domains lacking clean references, a common occurrence in practice. To address this issue, we present HIR-ALIGN, a plug-and-play target-adaptive augmentation framework that enhances hyperspectral image restoration by augmenting limited training images with synthetic data that closely matches the target distribution using no extra data. It consists of three stages: (i) proxy generation, where off-the-shelf restoration models restore degraded target observations to produce semantics-preserving proxy HSIs that approximate target-domain clean images; (ii) distribution-adaptive synthesis, where a blur-robust unCLIP diffusion model generates target-aligned RGBs from proxy RGBs, with prompt conditioning and embedding-space noise initialization. Then, a warp-based spectral transfer module synthesizes HSIs by aligning each generated RGB with the proxy RGB, estimating soft patch-wise transport weights, and applying these weights and learnable local interpolation kernels to the proxy HSI; and (iii) aligned supervised finetuning, where restoration networks pretrained on the source distribution are finetuned using both the proxy HSIs and synthesized target-aligned HSIs, and are then deployed on degraded target images. We further provide theoretical analysis showing that augmentation-based finetuning can achieve lower target-domain restoration risk by jointly improving target distribution coverage and controlling spectral bias. Extensive experiments on simulated and real datasets across denoising and super-resolution tasks demonstrate that HIR-ALIGN consistently improves source-only supervised baselines, outperforming both source-only counterparts and representative unsupervised methods.

2605.13579 2026-05-14 cs.AI

Position: Assistive Agents Need Accessibility Alignment

Jie Hu, Changyuan Yan, Yu Zheng, Ziqian Wang, Jiaming Zhang

发表机构 * School of Artificial Intelligence and Robotics(人工智能与机器人学院) Hunan University(湖南大学) Changsha, China(中国长沙)

AI总结 该论文探讨了为盲人和视力障碍用户设计的辅助智能体所面临的可访问性对齐问题,指出当前多数智能体系统基于视力正常用户的交互假设进行设计和评估,导致在辅助场景中频繁失效。研究分析了778个辅助任务实例,揭示了当前智能体在验证、风险和交互约束方面与视力障碍用户需求之间的不匹配,并提出将可访问性视为对齐问题,引入可访问性对齐概念,构建了一个贯穿用户研究、系统设计、部署与迭代的生命周期设计流程,推动更具包容性的智能体设计方向。

Comments 9 pages, 1 figures, Accepted to ICML 2026

详情
英文摘要

Assistive agents for Blind and Visually Impaired (BVI) users require accessibility alignment as a first-class design objective. Despite rapid progress in agentic AI, most systems are designed and evaluated under assumptions of sighted interaction, low-cost verification, and tolerable trial-and-error, leading to systematic failures in assistive scenarios that cannot be resolved by model scaling or post-hoc interface adaptations alone. Drawing on an analysis of 778 assistance task instances from prior work, we show that current agentic AI remain prone to failure in assistive scenarios due to mismatches between sighted-user design assumptions and the verification, risk, and interaction constraints faced by BVI users. We argue that accessibility should be treated as an alignment problem rather than a peripheral usability concern. To this end, we introduce accessibility alignment and propose a lifecycle-oriented design pipeline for accessibility-aligned assistive agents, spanning user research, system design, deployment and post-deployment iteration. We conclude that BVI-centered assistive tasks provide a critical stress test for agentic AI and motivate a broader shift toward inclusive agent design.

2605.13570 2026-05-14 cs.AI cs.LG

Learning Local Constraints for Reinforcement-Learned Content Generators

Debosmita Bhaumik, Julian Togelius, Georgios N. Yannakakis, Ahmed Khalifa

发表机构 * Institute of Digital Games(数字游戏研究所) Game Innovation Lab(游戏创新实验室)

AI总结 本文研究如何结合基于约束的游戏内容生成方法(如Wave Function Collapse)与强化学习生成方法,以同时保证生成内容的局部视觉合理性和全局可玩性。作者提出通过将WFC学习到的局部约束应用于强化学习生成器的动作空间,使生成器在满足全局属性的同时遵循局部规则。实验表明,该混合方法在适当调参后能够生成视觉美观且可玩的平台解谜游戏关卡,如《Lode Runner》。

详情
英文摘要

Constraint-based game content generators that learn local constraints from existing content, such as Wave Function Collapse (WFC), can generate visually satisfying game levels but face challenges in guaranteeing global properties, such as playability. On the other hand, reinforcement-learning trained generators can guarantee global properties -- because such properties can easily be included in reward functions -- but the results can be visually dissatisfying. In this paper, we explore ways to combine these methods. Specifically, we constrain the action space of a PCGRL generator with constraints learned by WFC, effectively allowing the PCGRL generator to achieve global properties while forced to adhere to local constraints. To better analyze how this hybrid content generation method operates, we vary the number and type of inputs, and we test whether to randomly collapse the starting state and exclude rare patterns. While the method is sensitive to hyperparameter tuning, the best of our trained generators produce visually satisfying and playable puzzle-platform game levels -- such as Lode Runner levels -- with desired global properties.

2605.13568 2026-05-14 cs.LG cs.AI

Dynamical Predictive Modelling of Cardiovascular Disease Progression Post-Myocardial Infarction via ECG-Trained Artificial Intelligence Model

Riccardo Cavarra, Lupo Lovatelli, Shaheim Ogbomo-Harmitt, Shahid Aziz, Adelaide De Vecchi, Andrew King, Oleg Aslanidi

发表机构 * King’s College London(伦敦国王学院) St Thomas’ Hospital(圣 Thomas 医院) North Bristol NHS Trust(北布里斯托尔国家健康服务信托)

AI总结 该研究旨在利用心电图(ECG)数据预测心肌梗死(MI)后心血管疾病的发展情况。研究提出了一种基于对比学习的预训练人工智能模型,结合患者特定的时序信息与监督多任务学习头,并在少量标注数据下进行微调,从而提升预测性能。实验表明,该模型在有限数据条件下优于从头训练的模型,证明了临床结构化ECG建模在疾病进展预测中的有效性。

Comments submitted to the 9th International Conference on Computational and Mathematical Biomedical Engineering, 4 pages, 1 figure, 1 table

详情
英文摘要

Myocardial infarction (MI) is a leading cause of death, and its adverse outcomes are urgent to predict. Yet ECG-based prognostic models underperform because deep learning requires large, labelled datasets, which are scarce in medicine. Foundation models can learn from unlabelled ECGs via selfsupervision, but medically relevant training strategies remain underexplored. We propose a pretrained artificial intelligence model that combines patient-specific temporal information using contrastive learning with supervised multitask heads, then fine-tunes on post-MI outcome prediction. The proposed model outperformed a model trained from scratch (0.794 vs 0.608 AUC) showing that clinically structured ECG modelling improves classification in limited data regimes.

2605.13565 2026-05-14 cs.CV

Qwen-Image-VAE-2.0 Technical Report

Zekai Zhang, Deqing Li, Kuan Cao, Yujia Wu, Chenfei Wu, Yu Wu, Liang Peng, Hao Meng, Jiahao Li, Jie Zhang, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Shengming Yin, Tianhe Wu, Xiao Xu, Xiaoyue Chen, Yan Shu, Yanran Zhang, Yilei Chen, Yixian Xu, Yuxiang Chen, Zhendong Wang, Zihao Liu, Zikai Zhou, Yiliang Gu, Yi Wang, Xiaoxiao Xu, Lin Qu

发表机构 * Qwen Team(通义实验室)

AI总结 本文介绍了 Qwen-Image-VAE-2.0,一套在重建保真度和扩散能力方面取得显著进展的高压缩变分自编码器(VAE)。通过引入全局跳接连接和扩展潜在通道,模型有效解决了高压缩下的重建瓶颈,并结合大规模图像训练和合成渲染引擎提升了文本密集场景的表现。研究还提出了一种增强的语义对齐策略以优化高维潜在空间的收敛性,并采用非对称且无需注意力机制的编解码结构以提高计算效率。实验表明,该模型在多个基准测试中达到先进水平,尤其在高压缩比下表现出卓越的重建和扩散能力。

详情
英文摘要

We present Qwen-Image-VAE-2.0, a suite of high-compression Variational Autoencoders (VAEs) that achieve significant advances in both reconstruction fidelity and diffusability. To address the reconstruction bottlenecks of high compression, we adopt an improved architecture featuring Global Skip Connections (GSC) and expanded latent channels. Moreover, we scale training to billions of images and incorporate a synthetic rendering engine to improve performance in text-rich scenarios. To tackle the convergence challenges of high-dimensional latent space, we implement an enhanced semantic alignment strategy to make the latent space highly amenable to diffusion modeling. To optimize computational efficiency, we leverage an asymmetric and attention-free encoder-decoder backbone to minimize encoding overhead. We present a comprehensive evaluation of Qwen-Image-VAE-2.0 on public reconstruction benchmarks. To evaluate performance in text-rich scenarios, we propose OmniDoc-TokenBench, a new benchmark comprising a diverse collection of real-world documents coupled with specialized OCR-based evaluation metrics. Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction performance, demonstrating exceptional capabilities in both general domains and text-rich scenarios at high compression ratio. Furthermore, downstream DiT experiments reveal our models possess superior diffusability, significantly accelerating convergence compared to existing high-compression baselines. These establish Qwen-Image-VAE-2.0 as a leading model with high compression, superior reconstruction, and exceptional diffusability.

2605.13560 2026-05-14 cs.LG

Uncertainty-Aware Prediction of Lung Tumor Growth from Sparse Longitudinal CT Data via Bayesian Physics-Informed Neural Networks

Lingfei Kong, Haoran Ma

发表机构 * Department of Mathematics, Vanderbilt University(范德比大学数学系) John A. Paulson School of Engineering and Applied Science, Harvard University(哈佛大学约翰·A·保罗森工程与应用科学学院)

AI总结 本文研究如何从稀疏且不规则的纵向CT数据中预测肺部肿瘤生长,并考虑测量误差的影响。研究提出了一种结合Gompertz生长模型与贝叶斯推断的物理信息神经网络方法,在对数体积域中进行低维贝叶斯估计,通过两阶段推理策略(最大后验估计与哈密顿蒙特卡洛采样)实现预测分布与不确定性区间的估计。该方法在国家肺癌筛查试验数据集上进行了验证,结果显示其能够准确捕捉肿瘤异质性生长模式,并在少量观测条件下提供校准良好的不确定性估计,具有重要的临床应用潜力。

Comments 8 pages, 15 figures

详情
英文摘要

This work studies lung tumor growth prediction from sparse and irregular longitudinal computed tomography (CT) observations with measurement variability. A Bayesian physics-informed neural network is developed by combining Gompertz growth dynamics with low-dimensional Bayesian inference in the log-volume domain. The framework employs a two-stage inference strategy combining maximum a posteriori (MAP) estimation and Hamiltonian Monte Carlo (HMC) sampling to estimate posterior predictive distributions and uncertainty intervals. The method was evaluated on longitudinal data from the National Lung Screening Trial (30 patients). Results show that the model captures heterogeneous tumor growth patterns while maintaining reasonable prediction accuracy under limited observations. Compared with deterministic modeling approaches, the proposed approach additionally provides calibrated uncertainty estimates. The inferred posterior parameter correlations were consistent with expected biological growth behavior. The proposed framework achieved a cohort-level log-space RMSE of approximately 0.20 together with well-calibrated 95% credible interval coverage across 30 patients. These findings suggest that Bayesian physics-informed modeling may be useful for uncertainty-aware tumor growth assessment when only limited longitudinal follow-up scans are available.

2605.13554 2026-05-14 cs.LG cs.AI

Self-Supervised On-Policy Reinforcement Learning via Contrastive Proximal Policy Optimisation

Asim Osman, Sasha Abramowitz, Mark Bergh, Ulrich Armel Mbou Sob, Ruan John de Kock, Omayma Mahjoub, Oussama Hidaoui, Noah De Nicola, Arnol Manuel Fokam, Felix Chalumeau, Daniel Rajaonarivonivelomanantsoa, Siddarth Singh, Refiloe Shabe, Juan Claude Formanek, Simon Verster Du Toit, Arnu Pretorius

发表机构 * InstaDeep AIMS University of Stellenbosch(斯特伦博斯大学)

AI总结 本文提出了一种基于对比学习的策略优化算法——对比近端策略优化(CPPO),用于实现无需人工设计奖励函数的自监督强化学习。该方法通过对比状态-动作与目标的表示学习Q值,并直接在策略上优化这些对比Q值,从而实现了端到端的自监督训练。实验表明,CPPO在多种连续和离散动作空间的单智能体和协作多智能体任务中,不仅显著优于现有对比强化学习方法,还在多数任务中达到了使用人工密集奖励的PPO算法的性能水平。

详情
英文摘要

Contrastive reinforcement learning (CRL) learns goal-conditioned Q-values through a contrastive objective over state-action and goal representations, removing the need for hand-crafted reward functions. Despite impressive success in achieving viable self-supervised learning in RL, all existing CRL algorithms rely on off-policy optimisation and are mostly constrained to continuous action spaces, with little research invested in discrete environments. This leaves CRL disconnected from widely used and effective, modern on-policy training pipelines adopted across both single-agent and multi-agent RL in continuous and discrete environments. To establish a first connection, we introduce Contrastive Proximal Policy Optimisation (CPPO). CPPO is an on-policy contrastive RL algorithm that derives policy advantages directly from contrastive Q-values and optimises them via the standard PPO objective, without requiring a reward function or a replay buffer. We evaluate CPPO across continuous and discrete, single-agent and cooperative multi-agent tasks. Whilst the existence of an on-policy approach is inherently useful, we observe that \textbf{CPPO not only significantly outperforms the previous CRL baselines in 14 out of 18 tasks, but also matches or exceeds PPO's performance, which uses hand-crafted dense rewards, in 12 out of the 18 tasks tested.}

2605.13551 2026-05-14 cs.LG

Mixed neural posterior estimation for simulators with discrete and continuous parameters

Jan Boelts, Cornelius Schröder, Jonas Beck, Jakob H. Macke, Michael Deistler, Daniel Gedon

发表机构 * appliedAI Institute for Europe(appliedAI欧洲研究院) Machine Learning in Science, University of Tübingen(图宾根大学机器学习科学系) Tübingen AI Center(图宾根人工智能中心) Hertie Institute for AI in Brain Health, University of Tübingen(图宾根大学脑健康人工智能研究院) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) Max Planck Institute for Biological Intelligence(生物智能马克斯·普朗克研究所)

AI总结 该论文研究了如何在包含离散和连续参数的混合参数空间中进行神经后验估计。作者提出了一种联合处理离散和连续参数的推理网络,通过将联合后验分解为离散和连续部分,并结合自回归分类器与生成模型进行联合训练,从而扩展了传统NPE方法。实验表明,该方法在多个可解析示例和实际科学模拟器中均能生成准确且校准良好的后验分布,并已集成到sbi Python工具包中。

详情
英文摘要

Neural Posterior Estimation (NPE) enables rapid parameter inference for complex simulators with intractable likelihoods. NPE trains an inference network to estimate a probability density over parameters given data, typically assumed to be \emph{continuous}. However, many scientific models involve parameter spaces that are \emph{mixed}, that is, they contain both discrete and continuous dimensions. We address this limitation by extending NPE to mixed parameter spaces through an inference network that jointly handles discrete and continuous parameters. The inference network factorizes the joint posterior into discrete and continuous components, combining an autoregressive classifier for the discrete parameters with a generative model for the continuous parameters, trained jointly under a single simulation-based objective. In addition, we propose a diagnostic tool to assess the calibration of the mixed posterior approximation. Across tractable toy examples and real-world scientific simulators, our joint inference approach yields accurate and calibrated posteriors. The inference framework is available in the \texttt{sbi} Python package.

2605.13544 2026-05-14 cs.CV

CA-GCL: Cross-Anatomy Global-Local Contrastive Learning for Robust 3D Medical Image Understanding

Hanwen Zhang, Yao Liu, Die Dai, Jiaye Yang, Qiao Liu, Yutong Xie, Peng Wang

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Mohamed bin Zayed University of Artificial Intelligence(莫扎德人工智能大学)

AI总结 本文提出了一种名为CA-GCL的跨解剖全局-局部对比学习框架,旨在提升三维医学图像理解的鲁棒性。该方法通过引入全局对比目标,增强解剖类别在潜在空间中的区分度,同时结合临床感知的文本增强策略,以应对描述不完整的问题。实验表明,CA-GCL在零样本异常检测任务中优于现有方法,且在不同数据集间具有良好的泛化能力,显著提升了模型对提示变化的稳定性。

详情
英文摘要

Fine-grained Vision-Language Pre-training (FVLP) demonstrates significant potential in 3D medical image understanding by aligning anatomy-level visual representations with corresponding textual descriptions. However, existing FVLP paradigms often suffer from severe representation collapse in the textual embedding space, where text embeddings of distinct anatomical structures become highly clustered and indistinguishable. This distributional degeneracy renders the model hypersensitive to prompt variations, hindering reliable clinical deployment. To address these challenges, we propose a novel Cross-Anatomy Global-Local Contrastive Learning framework (CA-GCL). CA-GCL introduces a global contrastive objective that enforces separation between anatomical categories in the latent space, effectively counteracting the aggregation tendency induced by local alignment. Furthermore, we incorporate a clinical-aware text augmentation strategy based on permutation invariance and partial completeness to enhance robustness against descriptive incompleteness. Extensive evaluations on the CT-RATE and Rad-ChestCT datasets demonstrate that CA-GCL consistently outperforms existing VLP paradigms in zero-shot abnormality detection, achieving superior performance while exhibiting strong cross-dataset generalization. Crucially, CA-GCL reduces performance variance across diverse prompt templates, transforming the collapsed textual similarity distribution into a bell-shaped distribution. These results validate CA-GCL as an effective framework for robust 3D medical image understanding.

2605.13542 2026-05-14 cs.AI cs.CL cs.LG cs.MA

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

Chengzhi Shen, Weixiang Shen, Tobias Susetzky, Chen, Chen, Jun Li, Yuyuan Liu, Xuepeng Zhang, Zhenyu Gong, Daniel Rueckert, Jiazhen Pan

发表机构 * Technical University of Munich(慕尼黑技术大学) TUM University Hospital(TUM大学医院) LMU Munich(慕尼黑大学) University of Sheffield(谢菲尔德大学) University of Oxford(牛津大学) Zhongshan Hospital Fudan University(复旦大学中山医院) Sun Yat-sen University Cancer Center(中山大学肿瘤中心) Imperial College London(伦敦帝国学院) Munich Center for Machine Learning(慕尼黑机器学习中心) relAI – Konrad Zuse School of Excellence in Reliable AI(relAI – 卡诺夫茨卓越可靠人工智能学校)

AI总结 本文提出RealICU,一个基于真实重症监护(ICU)临床数据构建的新型基准,用于评估大型语言模型在复杂、长期医疗决策任务中的表现。该基准通过资深医生对完整患者轨迹进行回顾标注,定义了四个与临床决策相关的任务,揭示了现有大语言模型在医疗建议中的召回与安全性的权衡以及对早期患者信息的过度依赖问题。RealICU为研究和改进高风险医疗场景下的AI决策支持系统提供了可靠的实验平台。

详情
英文摘要

Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30-min windows and release two datasets: RealICU-Gold with 930-window annotations from 94 MIMIC-IV patients, and RealICU-Scale with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. Existing LLMs including memory-augmented ones performed poorly on RealICU, exposing two failure modes: a recall-safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU-Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision-support in high-stakes care. Project page: https://chengzhi-leo.github.io/RealICU-Bench/