arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3813
专题追踪
2605.17564 2026-05-19 cs.CV

A Conditional U-Net Pipeline with Pre- and Post-Processing for Aerial RGB-to-Thermal Image Translation

具有预处理和后处理的条件U-Net管道用于航空RGB到热图像转换

Tseten Sherpa, Sikandar Ali, Shubham Parab, Haoyun Feng, Matthew Dennis, Keenan Gibbons, Verrah Otiende, Geoffrey H. Siwo

发表机构 * Department of Data Science, University of Michigan, Ann Arbor, MI, USA(数据科学系,密歇根大学,安阿伯,MI,美国) Department of Information Science, University of Michigan, Ann Arbor, MI, USA(信息科学系,密歇根大学,安阿伯,MI,美国) Department of Computer Science, University of Michigan, Ann Arbor, MI, USA(计算机科学系,密歇根大学,安阿伯,MI,美国) Arcknow, New York, USA(Arcknow,纽约,美国) School of Environmental Sustainability, University of Michigan, Ann Arbor, MI, USA(可持续环境学院,密歇根大学,安阿伯,MI,美国) SmithGroup, Ann Arbor, MI, USA(SmithGroup,安阿伯,MI,美国) Michigan Institute for Data and AI in Society (MIDAS), University of Michigan, Ann Arbor, MI, USA(密歇根数据与人工智能社会研究院(MIDAS),密歇根大学,安阿伯,MI,美国) United States International University (USIU), Nairobi, Kenya(美国国际大学(USIU),内罗毕,肯尼亚) Department of Learning Health Sciences, University of Michigan Medical School, Ann Arbor, MI, USA(学习健康科学系,密歇根大学医学院,安阿伯,MI,美国) Department of Pharmacology, University of Michigan Medical School, Ann Arbor, MI, USA(药理学系,密歇根大学医学院,安阿伯,MI,美国) Center for Global Health Equity, University of Michigan, Ann Arbor, MI, USA(全球健康公平中心,密歇根大学,安阿伯,MI,美国)

AI总结 本文提出了一种基于条件U-Net的简单架构,结合天气数据和针对性预处理与后处理技术,以提高航空RGB到热图像转换的性能,实验结果显示其在PSNR、SSIM和LPIPS指标上优于现有方法。

Comments 8 pages, 7 figures, NeurIPS 2026

详情
AI中文摘要

配对的RGB-热图像数据在图像融合、目标跟踪和异常检测等应用中显示出显著的实用性;然而,其广泛应用受到对齐的RGB-热图像对有限的限制。RGB到热图像(及反之)转换已成为解决这一挑战的实用解决方案。先前的方法包括条件生成对抗网络(cGANs)如ThermalGAN和基于可扩展插值转换器(SiT)的架构如ThermalGen,已显示出在航空到热图像转换中的强大潜力。在本工作中,我们探索了替代架构,这些架构在保持性能的同时优先考虑简洁性。具体而言,我们提出了一种在瓶颈层中结合天气数据的条件U-Net,辅以在Pix2Pix GAN架构中应用的针对性预处理和后处理技术。我们利用612对RGB和热图像的训练集,并在五折交叉验证后,最终在保留的测试集上进行评估。我们的条件U-Net模型表现最佳,峰值信噪比(PSNR)为14.5485,结构相似性指数测量(SSIM)为0.8095,学习感知图像块相似性(LPIPS)为0.1666。这些结果优于基础ThermalGen模型,后者分别达到了PSNR、SSIM和LPIPS分数为7.56、0.2444和0.6317。我们发现,虽然饱和度增强和对比度增强的预处理以及高斯模糊的后处理提供了可观察的改进,但结合条件数据的效果最为显著。我们的发现巩固了将辅助元数据整合到热图像生成中的潜力,表明此类信息可以作为准确热重建至关重要的环境条件的代理。

英文摘要

Paired RGB-thermal data has shown significant utility across a range of applications, including image fusion, object tracking, and anomaly detection; however, its broader adoption is constrained by the limited availability of aligned RGB-thermal image pairs. RGB-to-thermal (and vice versa) image translation has emerged as a practical solution to this challenge. Prior approaches including conditional generative adversarial networks (cGANs) such as ThermalGAN and Scalable Interpolant Transformer (SiT)-based architectures such as ThermalGen have demonstrated strong potential for aerial-to-thermal image translation. In this work, we explore alternative architectures that prioritize simplicity while maintaining performance. Specifically, we propose a conditional U-Net that incorporates weather data at the bottleneck layer, complemented by targeted preprocessing and post-processing techniques applied within the Pix2Pix GAN architecture. We utilize a training set of 612 paired RGB and thermal images, and evaluate over 5-fold cross-validation, ultimately testing on a held-out test set. Our conditional U-Net model performed best, with a peak signal-to-noise ratio (PSNR) of 14.5485, structural similarity index measure (SSIM) of 0.8095, and learned perceptual image patch similarity (LPIPS) of 0.1666. These results outperformed the base ThermalGen model, which attained PSNR, SSIM, and LPIPS scores of 7.56, 0.2444, and 0.6317 respectively. We find that while saturation boost and contrast enhancement for preprocessing and Gaussian blur for post-processing provide observable improvements, the incorporation of conditioning data was most effective. Our findings cement the potential of integrating auxiliary metadata into thermal image generation, suggesting that such information can serve as a proxy for environmental conditions critical to accurate thermal reconstruction.

2605.17562 2026-05-19 cs.LG cs.AI cs.HC

Beyond Accuracy: Robustness, Interpretability and Expressiveness of EEG Foundation Models

超越准确率:EEG基础模型的鲁棒性、可解释性和表达性

Urban Širca, Maryam Alimardani, Stefanos Zafeiriou, Konstantinos Barmpas

发表机构 * Vrije Universiteit Amsterdam(阿姆斯特丹自由大学) Imperial College London(伦敦帝国学院)

AI总结 本文研究了EEG基础模型的鲁棒性、可解释性和表达性,通过在八个数据集上对六个EEG-FMs和一个基线深度学习模型进行基准测试,揭示了模型在不同扰动下的表现,以及其在可解释性和表达性方面的特性。

详情
AI中文摘要

EEG基础模型(EEG-FMs)主要在干净且分布内的准确性上进行了评估,其鲁棒性、可解释性和表征质量尚未得到充分考察。本研究通过在八个数据集上对六个EEG-FMs和一个基线深度学习模型进行基准测试,填补了这些空白。除了干净准确性外,我们进行了三层分析:(i)鲁棒性:我们应用了测试时扰动,包括加性噪声、随机和区域基于的通道丢弃以及区域特定的噪声注入。我们的分析表明,没有单一模型在所有失败模式中占主导地位。最抗噪的模型在通道丢弃下最为脆弱,当通道被移除而不是零填充时,许多丢弃脆弱性消失。(ii)可解释性:我们首次将注意力感知的层间相关传播(AttnLRP)应用于EEG-FMs,并展示了模型广泛集中在与任务相关的脑区,这与已知的神经生理学一致。然而,属性图在扰动下保持空间稳定,而预测性能下降,表明模型关注正确的脑区,但解码了被破坏的内容。(iii)表达性:通过块状探测,我们显示在微调过程中后期块被重新利用,而早期块已经包含任务相关的信息。此外,我们证明了之前归因于低质量预训练表示的头部-only性能较差,很大程度上是由于池化所致,且当EEG-FMs的token级嵌入被保留时,它们具有足够的表征能力。这些发现为EEG-FMs的鲁棒性、可解释性和表达性提供了首次系统的评估,并突显了其开发中的关键考虑因素。

英文摘要

EEG foundation models (EEG-FMs) have been evaluated predominantly on clean, in-distribution accuracy, leaving their robustness, interpretability and representational quality largely unexamined. This study addresses these gaps by benchmarking six EEG-FMs against a baseline deep learning model across eight datasets. Beyond clean accuracy, we conduct three layers of analysis: (i) Robustness: we apply test-time perturbations including additive noise, random and region-based channel dropout and region-specific noise injection. Our analyses show that no single model dominates all failure modes. The most noise-robust model is among the most fragile under channel dropout and much of the dropout fragility disappears when channels are removed rather than zero-padded. (ii) Interpretability: we present the first application of Attention-Aware Layer-Wise Relevance Propagation (AttnLRP) to EEG-FMs and show that models broadly concentrate relevance on task-appropriate brain regions consistent with known neurophysiology. However, attribution maps remain spatially stable under perturbation while predictions degrade, suggesting that the models attend to the correct brain regions but decode corrupted content. (iii) Expressiveness: With block-wise probing we show that late blocks are repurposed during fine-tuning, while early blocks already hold task-related information. Furthermore, we demonstrate that the poor head-only performance previously attributed to low-quality pre-trained representations is largely explained by pooling and that EEG-FMs possess sufficient representational capacity when their token-level embeddings are preserved. Together, these findings provide the first systematic assessment of robustness, interpretability and expressiveness for EEG-FMs and highlight critical considerations for their development.

2605.17556 2026-05-19 cs.RO cs.AI

Visual Sculpting: Visually-Aligned Planning Representations for Long-Horizon Robot Clay Sculpting

视觉雕刻:用于长周期机器人泥塑的视觉对齐规划表示

Peter Schaldenbrand, Jean Oh

发表机构 * The Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 本文提出了一种视觉对齐的规划表示方法,用于长周期机器人泥塑任务,通过捕捉光照和纹理特征,提高了对可变形材料动态的建模能力,并展示了在不同可变形材料和末端执行器下的性能。

Comments 8 pages, 14 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

泥塑是一种复杂的艺术任务,需要通过长周期规划实现高阶目标。作为机器人问题,我们将泥塑视为形状到形状的匹配挑战。先前的可变形物体 manipulation 工作要么需要为每个目标重新训练策略,要么依赖于动态模型,这些模型将状态表示为稀疏点云,无法良好捕捉泥塑的重要特征,如纹理。我们提出了一种方法,用于建模可变形材料的动力学,并在视觉对齐的表示中为机器人雕刻规划。通过三种不同的可变形材料和各种末端执行器,我们证明我们的动力学模型在性能上与最先进的方法相当,并且具有兼容视觉规划的优势。我们的动作被表示为单个末端执行器向泥塑施加的参数化推力,这已被证明适用于长周期(>100次动作)的泥塑浮雕。最后,我们展示了在视觉对齐表示中规划的好处,同时提供了分析,证明了与3D表示相比,这种表示在规划上更具挑战性。

英文摘要

Clay sculpting is a nuanced, artistic task involving dexterous manipulation with long-horizon planning to achieve high-level goals. As a robotics problem, we formulate clay sculpting as a shape-to-shape matching challenge. Prior deformable object manipulation work either requires retraining a policy per goal or relies on dynamics models which represent state as sparse point clouds which do not capture important clay features, such as textures, well. We present a method for modeling the dynamics of deformable materials and planning for robotic sculpting in a representation that is visually-aligned, capturing lighting and texture features. With three different deformable materials and various end-effectors, we demonstrate that our dynamics model is comparable in performance to the state-of-the-art with the added benefit of being compatible with visual planning. Our actions are represented as parametrized pushes into clay with a single end-effector, which proved to be suitable for long-horizon (>100 actions) clay relief sculptures. Lastly, we show the benefits of planning in a visually-aligned representation, but also provide analysis providing evidence as to why this representation is challenging to plan in compared to 3D representations.

2605.17555 2026-05-19 cs.LG cs.CV

PFlow-T: A Persistence-Driven Forward Process for Topology-Controlled Generation

PFlow-T:基于持续性的拓扑控制生成过程

Snigdha Chandan Khilar

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出PFlow-T,一种基于持续性的前向过程生成模型,通过持续同调来控制拓扑结构,实现了对Betti数的生成和处理非分布任务的改进。

详情
AI中文摘要

当前拓扑感知的扩散模型由于使用高斯噪声进行破坏而存在架构不匹配的问题,通过条件侧通道恢复结构特征。为解决此问题,我们引入PFlow-T,一种生成模型,其前向过程完全基于持续同调。在PFlow-T中,时间度量的是H1拓扑特征如孔的破坏,而非高斯噪声注入。此前向过程根据特征的持续性来消除特征。反向网络则直接反转这种结构破坏以在一步内预测干净状态。在MNIST数字零、一和八上的测试显示,PFlow-T在生成请求的Betti数和处理非分布任务方面显著优于基线模型。PFlow-T是首个使用持续同调作为前向过程的生成架构,尽管我们注意到它目前仅限于低分辨率像素空间代理。

英文摘要

Current topology aware diffusion models face an architectural mismatch by using Gaussian noise for corruption while recovering structural features through conditional side channels To fix this we introduce PFlow T a generative model that bases its forward process entirely on persistent homology In PFlow T time measures the destruction of H1 topological features like holes rather than Gaussian noise injection This forward process eliminates features based on their persistence The reverse network then directly inverts this structured corruption to predict the clean state in one step Tests on MNIST digits zero one and eight show PFlow T significantly outperforms a baseline model in generating requested Betti numbers and handling out of distribution tasks PFlow T is the first generative architecture using persistent homology for the forward process although we note it is currently limited to low resolution pixel space proxies

2605.17552 2026-05-19 cs.LG

Q-LocalAdam: Memory-Efficient Client-Side Adaptive Optimization for Edge Federated Learning

Q-LocalAdam: 一种内存高效的边缘联邦学习客户端自适应优化方法

Vedant Waykole, Haroon R. Lone

发表机构 * IISER Bhopal(印度比哈尔州科学与技术研究院)

AI总结 本文提出Q-LocalAdam,一种针对边缘联邦学习中非独立同分布数据和内存限制的自适应优化方法,通过分布感知的8位量化块线性编码和对数空间编码实现内存高效优化,显著提升模型性能和并发工作负载能力。

详情
AI中文摘要

边缘设备上的联邦学习必须应对非独立同分布的客户端数据和严格的内存预算。像Adam这样的自适应优化器在数据异质性下稳定训练,但需要存储全精度动量和方差状态,通常使客户端内存开销增加三倍。这限制了在资源受限设备上可部署的模型大小和同时进行的联邦任务数量。我们实证发现,联邦Adam中的动量和方差在统计特性上存在根本差异:动量值对称且有界,而方差跨越八个数量级并具有对数正态结构。受这种不对称性启发,我们提出了Q-LocalAdam,它对动量应用分布感知的8位量化块线性编码,对方差应用对数空间编码,同时保持模型参数在全精度下。在CIFAR-10和CIFAR-100上,针对不同数据异质性(α∈{0.1, 0.5, 1.0, IID}),Q-LocalAdam在中等异质性下实现3.37倍的优化器内存减少,无精度损失,在极端异质性下(如CIFAR-100,α=0.1)实现显著提升(+5.74pp)。多种子验证确认统计显著性(p<0.01)。相比之下,朴素的均匀量化退化到随机性能,证明了分布感知设计的重要性。Q-LocalAdam在内存受限的边缘设备上无需修改联邦协议即可实现更大的模型和更多的并发工作负载。

英文摘要

Federated learning on edge devices must cope with non-IID client data and tight memory budgets. Adaptive optimizers like Adam stabilize training under data heterogeneity but require storing full-precision momentum and variance states, often tripling client memory overhead. This limits deployable model sizes and concurrent federated jobs on resource-constrained devices. We empirically observe that momentum and variance in federated Adam exhibit fundamentally different statistical properties: momentum values are symmetric and bounded, while variance spans eight orders of magnitude with log-normal structure. Motivated by this asymmetry, we propose \textbf{Q-LocalAdam}, which applies distribution-aware 8-bit quantization block-wise linear encoding for momentum and log-space encoding for variance while keeping model parameters in full precision. Across CIFAR-10 and CIFAR-100 under varying data heterogeneity ($α\in \{0.1, 0.5, 1.0, \text{IID}\}$), Q-LocalAdam achieves $3.37\times$ optimizer memory reduction with no accuracy loss under moderate heterogeneity and significant improvements under extreme heterogeneity (e.g., +5.74pp on CIFAR-100, $α=0.1$). Multi-seed validation confirms statistical significance ($p<0.01$). In contrast, naive uniform quantization degrades to random performance, demonstrating that distribution-aware design is essential. Q-LocalAdam enables larger models and more concurrent workloads on memory-constrained edge devices without modifying the federated protocol.

2605.17528 2026-05-19 cs.LG cs.AI cs.CL

CasualSynth: Generating Structurally Sound Synthetic Data

CasualSynth: 生成结构上合理的合成数据

Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

发表机构 * Department of Computer Science, University of Oxford(牛津大学计算机科学系) Institute of Logic and Computation, TU Wien(维也纳技术大学逻辑与计算研究所)

AI总结 本文提出CasualSynth框架,通过解耦因果结构生成与语义实现,生成既符合因果机制又语义丰富的合成数据,解决了LLM在生成合成数据时无法保证因果正确性的问题。

Comments 15 pages

详情
AI中文摘要

大型语言模型(LLMs)能够生成逼真的合成数据,但无法保证其输出符合目标领域的因果机制。我们引入CausalSynth框架,该框架将因果结构生成与语义实现解耦,生成既符合因果机制又语义丰富的合成数据。该框架分为三个阶段:首先,一个结构因果模型(SCM)——一个定义在有向无环图(DAG)上的结构方程组,通过祖先采样生成因果骨架,即满足支配图全局马尔可夫性质的变量赋值;其次,一个LLM作为受约束的实现者,一个条件翻译器,将每个骨架映射到高维观测,如临床笔记或交易日志;第三,一个迭代一致性验证模块通过确定性提取检测结构违规,并将针对性的修正反馈给LLM,形成闭环优化过程。我们识别出语义后门问题,即LLM系统性地用预训练先验覆盖施加的因果事实——并证明我们的迭代机制相对于标准拒绝采样减少了由此产生的选择偏差。在三个因果基准(ASIA、ALARM和MIMIC-Struct)上,CausalSynth在假阳性率接近名义α=0.05水平的情况下保持条件独立性,并在70B参数LLM基础上实现了超过96%的可实现率。该框架还通过保留噪声和图 mutilation 支持原理化的干预和反事实生成。

英文摘要

Large Language Models (LLMs) generate realistic synthetic data but offer no guarantee that their outputs respect the causal mechanisms governing the target domain. We introduce CausalSynth, a framework that decouples causal structure generation from semantic realization, yielding synthetic data that is both causally valid and linguistically rich. The framework operates in three phases. First, a Structural Causal Model (SCM) - a tuple of structural equations defined over a directed acyclic graph (DAG) generates causal skeletons, i.e., variable assignments that satisfy the Global Markov Property of the governing DAG, via ancestral sampling. Second, an LLM acts as a constrained \emph{realizer}, a conditional translator that maps each skeleton to a high-dimensional observation such as a clinical note or a transaction log. Third, an Iterative Consistency Verification module detects structural violations through deterministic extraction and feeds targeted corrections back to the LLM, forming a closed-loop refinement process. We identify the Semantic Backdoor problem the systematic tendency of LLMs to override imposed causal facts with pre-training priors -- and prove that our iterative mechanism reduces the resulting selection bias relative to standard rejection sampling. On three causal benchmarks (ASIA, ALARM, and MIMIC-Struct), CausalSynth preserved conditional independencies with false-positive rates near the nominal $α=0.05$ level and achieved realizability rates above 96% with 70B-parameter LLM backbones. The framework additionally supports principled interventional and counterfactual generation through noise retention and graph mutilation.

2605.17527 2026-05-19 cs.CV

Designing streetscapes from street-view imagery using diffusion models

利用扩散模型从街景图像中设计街道景观

Yuzhou Chen, Yuebing Liang, Lingqian Hu, Kailai Sun, Qingqi Song, Chang Zhao, Shenhao Wang

发表机构 * Department of Urban and Regional Planning, University of Florida(城市与区域规划系,佛罗里达大学) Singapore-MIT Alliance for Research and Technology Centre (SMART)(新加坡-麻省理工联合研究中心(SMART)) Department of Landscape Architecture and Urban Planning, Texas A&M University(景观建筑与城市规划系,德克萨斯大学安德森分校) Department of Agronomy, University of Florida(农业系,佛罗里达大学)

AI总结 本文提出了一种生成多模态AI框架,通过目标视觉指标生成替代的街道景观,提升了城市规划和设计中的视觉探索能力。

详情
AI中文摘要

街景图像(SVI)被广泛用于量化城市环境的关键指标,如绿化率、天空和道路视图指数。然而,现有研究大多集中在测量当前的街道景观,很少支持生成替代或不存在的城市场景,这是地理学学科如城市规划和设计中的核心任务。为解决这一差距,我们提出了一种生成多模态AI框架,该框架能够根据目标视觉指标合成替代的街道景观,从而直接探索城市场景。我们首先构建了一个多模态数据集,将SVI与文本描述、分割图、道路掩码以及芝加哥和奥兰多的视觉元素定量指标对齐。使用这个数据集,我们证明扩散模型能够生成逼真且语义一致的街道景观图像,同时响应文本和图像控制。我们的定量评估显示,结合视觉控制可以提高语义一致性,使LPIPS指数降低约6%,同时保持整体视觉真实性。此外,整体语义一致性在奥兰多提高了23.7%,在芝加哥提高了46.4%,通过mIoU指数测量,类别层面的提升甚至超过了100%的改进,特别是在建筑视图指数方面。通过视觉和文本提示,可以精细控制街道景观的生成,当文本和视觉控制冲突时,图像控制始终占主导地位,表明了清晰的控制层次以及进一步发展视觉控制对于城市场景生成的重要性。总体而言,本文为使用SVI和扩散模型进行街道景观生成建立了重要的基准,并展示了生成式AI如何成为一种实用、可扩展且可控的城市场景探索方法。

英文摘要

Street-view imagery (SVI) is widely used to quantify key indicators of urban environment, such as green- ery, sky, or road view indices. However, existing studies largely focus on measuring current streetscapes and rarely support the generation of alternative and non-existing urban scenarios, which is a core task in geospatial disciplines such as urban planning and design. To address this gap, we propose a gener- ative multimodal AI framework that synthesizes alternative streetscapes conditioned on targeted visual metrics, enabling direct visual exploration of urban scenarios. We first construct a multimodal dataset that aligns SVIs with textual descriptions, segmentation maps, road masks, and quantitative metrics of visual elements in Chicago and Orlando. Using this dataset, we demonstrate that diffusion models can produce realistic and semantically consistent streetscape imagery while responding to both textual and imagery controls. Our quantitative evaluations show that incorporating visual controls can improve semantic consistency, reducing the LPIPS index by approximately 6% while maintaining global visual realism. In addition, overall semantic consistency increases by 23.7% in Orlando and 46.4% in Chicago, as measured by the mIoU index, with class-wise gains exceeding even 100% improvement for building view indices. Streetscape generation can be controlled in a fine-grained manner by both visual and textual prompts, and when textual and visual controls conflict, imagery controls consistently dominate, indicating a clear control hierarchy and the importance of further developing visual controls for urban scene generation. Overall, this work establishes an important benchmark for streetscape generation us- ing SVIs and diffusion models, and illustrates how generative AI can serve as a practical, scalable, and controllable approach for urban scenario exploration.

2605.17522 2026-05-19 cs.RO

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

RoboFlow4D: 一种轻量级的流世界模型,面向实时的流引导机器人操作

Sixu Lin, Junliang Chen, Huaiyuan Xu, Zhuohao Li, Guangming Wang, Yixiong Jing, Sheng Xu, Runyi Zhao, Brian Sheil, Lap-Pui Chau, Guiliang Liu

发表机构 * School of Data Science, The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳)数据科学学院) The Hong Kong Polytechnic University(香港理工大学) University of Cambridge(剑桥大学) Shenzhen Loop Area Institute(深圳-loop区研究所)

AI总结 本文提出RoboFlow4D,一种轻量级的流世界模型,通过统一感知与规划,利用物理3D空间中的时间运动估计,实现高效的实时流引导机器人操作,提高了操作成功率和计算效率。

详情
AI中文摘要

在三维环境中进行规划和行动是现实世界中机器人操作的基本能力。尽管先前工作已经探索了预测流规划器来指导三维操作,但现有方法往往依赖于模块化管道堆叠多个子模型,导致计算开销高且实时性能有限。为了解决这些挑战,我们引入了RoboFlow4D,一种轻量级的流世界模型,通过估计物理3D空间中的时间运动来统一感知和规划。作为一种端到端框架,RoboFlow4D直接从视觉观察和文本指令中预测多帧3D流,提供显式的基于流的规划以指导动作生成。这种设计允许无缝集成到通用动作策略中,形成高效的观察-规划-执行闭环。通过流预测与动作控制之间的慢-快协作,RoboFlow4D实现了实时且资源高效的操纵。在模拟和现实世界设置中的大量实验表明,RoboFlow4D在操纵成功率和计算效率方面持续改进,推动了流引导规划在具身智能中的发展。

英文摘要

Planning and acting in 3D environments is a fundamental capability for robotic manipulation in the real world. Although prior work has explored predictive flow planners to guide 3D manipulation, existing approaches often rely on modular pipelines stacking multiple submodels, resulting in high computational overhead and limited real-time performance. To address these challenges, we introduce RoboFlow4D, a lightweight flow world model that unifies perception and planning by estimating temporal motion in physical 3D space. As an end-to-end framework, RoboFlow4D directly predicts multi-frame 3D flows from visual observations and textual instructions, providing explicit flow-based planning to guide action generation. This design allows seamless integration with general action policies, forming an efficient observation-planning-execution closed loop. Through slow-fast collaboration between flow prediction and action control, RoboFlow4D enables real-time and resource-efficient manipulation. Extensive experiments in both simulation and real-world settings demonstrate that RoboFlow4D consistently improves manipulation success rates and computational efficiency, advancing flow-guided planning for embodied intelligence.

2605.17517 2026-05-19 cs.RO

AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment

AffordVLA: 通过隐式特征对齐将 affordance 表示注入到视觉-语言-动作模型中

Weijie Kong, Zhian Su, Wei Yu, Huixu Dong

发表机构 * Grasp Lab, School of Mechanical Engineering of Zhejiang University(浙大机械工程学院抓取实验室)

AI总结 本文提出 AffordVLA 框架,通过隐式特征对齐将以操作为中心的 affordance 表示注入到视觉-语言-动作模型中,以提升动作准确性,实验表明其在仿真和现实中的表现优于现有方法。

Comments 13pages, 10figures

详情
AI中文摘要

最近在视觉-语言-动作(VLA)模型方面的进展显示出在通用机器人操作中的强大潜力。然而,大多数VLA模型的视觉表示往往由全局物体外观主导,难以聚焦于与任务相关的功能交互区域,这限制了它们在非结构化环境中的鲁棒性。现有的基于 affordance 的方法通常依赖于显式的掩码注入或外部感知模块,需要额外的注释,同时引入级联感知误差和推理开销。为了解决这些限制,我们提出 AffordVLA,一个增强的 VLA 框架,通过隐式表示对齐将以操作为中心的 affordance 感知内部化到 VLA 视觉表示中。具体来说,我们构建了一个零样本 affordance 教师,从 RGB 观察和语言指令中提取任务条件的 affordance 视觉表示。AffordVLA 对齐 VLA 的中间视觉表示与由教师提取的 affordance 视觉表示,从而隐式地将以操作为中心的 affordance 感知注入到 VLA 视觉表示中,提高动作准确性。广泛的仿真和现实世界实验表明,AffordVLA 及其 affordance 教师实现了最先进的性能,并优于强大的基线。消融分析显示,AffordVLA 有效重塑 VLA 视觉表示,同时保持推理效率,从而提高操作成功率和训练效率。

英文摘要

Recent advances in Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation. However, the visual representations of most VLA models are often dominated by global object appearance and struggle to focus on task-relevant functional interaction regions, which limits their robustness in unstructured environments. Existing affordance-based methods typically rely on explicit mask injection or external perception modules, requiring additional annotations while introducing cascading perception errors and inference overhead. To address these limitations, we propose AffordVLA, an affordance-enhanced VLA framework that internalizes manipulation-centric affordance perception into VLA visual representations through implicit representation alignment. Specifically, we construct a zero-shot affordance teacher to extract task-conditioned affordance visual representations from RGB observations and language instructions. AffordVLA aligns the intermediate visual representations of the VLA with the affordance visual representations extracted by the teacher, thereby implicitly injecting manipulation-centric affordance perception into VLA visual representations and improving action accuracy. Extensive simulation and real-world experiments demonstrate that AffordVLA and its affordance teacher achieve state-of-the-art performance and outperform strong baselines. Ablation analyses show that AffordVLA effectively reshapes VLA visual representations while preserving inference efficiency, leading to improved manipulation success rates and training efficiency.

2605.17508 2026-05-19 cs.LG cs.AI

BESplit: Bias-Compensated Split Federated Learning with Evidential Aggregation

BESplit: 偏差补偿分割联邦学习与证据聚合

Yuhan Xie, Chen Lyu, Jingrong Huang

发表机构 * MoE Key Laboratory of Interdisciplinary Research of Computation(交叉计算与经济学 interdisciplinary 研究 MOE 重点实验室) Shanghai University of Finance(上海财经大学)

AI总结 本文提出BESplit框架,通过证据聚合和偏差补偿协作来解决非独立同分布数据下分割联邦学习的偏差优化和收敛不稳定问题,提升了模型的准确性和效率。

详情
AI中文摘要

分割联邦学习(SFL)通过将模型分割到客户端和服务器之间实现隐私保护的协同训练。然而,在非独立同分布数据分布下,SFL常面临偏差优化和收敛不稳定的问题,而现有解决方案大多借鉴传统联邦学习的技术。在本工作中,我们发现SFL的分割架构本质上改变了客户端信息的表示和协调方式,为超越参数级聚合的偏差补偿提供了机会。基于这一见解,我们提出了BESplit,一个架构感知的框架,利用SFL内在结构来缓解非IID效应。首先,为防止偏见本地数据主导全局更新,我们引入证据聚合(EA)以基于证据不确定性对客户端贡献进行细粒度重新加权。其次,为进一步减少分布偏斜,我们开发了偏差补偿协作(BCC)以通过配对互补客户端对齐分割层表示。最后,双教师蒸馏(DTD)被纳入以同步解耦客户端和服务器模型之间的知识,使本地推理能够独立进行。在五个基准数据集上的广泛实验表明,BESplit在多样化的非IID设置下,准确率、收敛稳定性以及计算效率均优于现有最先进方法。

英文摘要

Split Federated Learning (SFL) enables privacy-preserving collaborative training by partitioning models between clients and a server. However, under non-IID data distributions, SFL often suffers from biased optimization and unstable convergence, while existing solutions largely adapt techniques from conventional federated learning. In this work, we observe that the split architecture of SFL inherently alters how client information is represented and coordinated, opening opportunities for bias compensation beyond parameter-level aggregation. Based on this insight, we propose BESplit, an architecture-aware framework that exploits the intrinsic structure of SFL to mitigate non-IID effects. First, to prevent biased local data from dominating global updates, we introduce Evidential Aggregation (EA) to perform fine-grained reweighting of client contributions based on evidential uncertainty. Second, to further reduce distributional skew, we develop Bias-Compensated Collaboration (BCC) to align split-layer representations by pairing complementary clients. Finally, Dual-Teacher Distillation (DTD) is incorporated to synchronize knowledge between decoupled client and server models, enabling independent local inference. Extensive experiments on five benchmark datasets demonstrate that BESplit consistently outperforms state-of-the-art methods in accuracy, convergence stability, and computational efficiency under diverse non-IID settings.

2605.17506 2026-05-19 cs.CV

Degradation Frequency Curve: An Explicit Frequency-Quantified Representation for All-in-One Image Restoration

退化频率曲线:一种用于全能图像恢复的显式频率量化表示

Xinghua Huang, Zhixiong Yang, Chen Wu, Shengxi Li, Shuaifeng Zhi, Yue Zhang, Qibin Hou, Xin Deng, Jingyuan Xia

发表机构 * College of Electronic Science, National University of Defense Technology(国防科技大学电子科学学院) College of Electronic Engineering, Beihang University(北航电子工程学院) VCIP, School of Computer Science, Nankai University(南开大学计算机科学学院)

AI总结 本文提出退化频率曲线(DFC),一种显式量化退化影响的频率域表示方法,通过测量频带内的残差到退化能量比来量化退化响应,从而为全能图像恢复提供有效的表示基础,提升了在复杂退化条件下的性能和泛化能力。

详情
AI中文摘要

所有-in-one盲图像恢复中的基本困难在于退化通常被视为隐含在退化到清洁映射中的隐式因素,而不是可以测量和操作的显式对象。这种限制在混合、复合或未见的退化条件下更加明显,其中退化效应难以分配到预定义标签或任务特定参数。我们提出退化频率曲线(DFC),一种结构化的频谱表示,通过测量频域内带状的残差到退化能量比来量化退化响应。DFC将视觉纠缠且难以描述的退化效应转换为可测量的退化坐标空间。此外,DFC可以自适应地分解为带状频谱标记,允许局部退化响应被表示为可重用的恢复先验。基于这种表示,我们开发了DFC引导图像恢复器(DFC-IR),一种基于标记的多尺度框架,逐步从中间恢复中估计DFC,并利用所得频谱标记以粗到细的方式指导退化感知恢复。在标准、复合、未见和现实世界退化基准上的广泛实验表明,DFC为所有-in-one恢复提供了有效的表示基础,导致在复杂退化配置下达到最先进的性能和改进的泛化能力。

英文摘要

A fundamental difficulty in all-in-one blind image restoration is that degradation is usually treated as an implicit factor hidden in degraded-to-clean mapping, rather than as an explicit object that can be measured and manipulated. This limitation becomes more pronounced under mixed, compound, or unseen degradation conditions, where degradation effects are hard to assign to predefined labels or task-specific parameters. We propose the Degradation Frequency Curve (DFC), a structured spectral representation that quantifies degradation responses by measuring band-wise residual-to-degraded energy ratios in the frequency domain. DFC converts visually entangled and hard-to-describe degradation effects into a measurable degradation coordinate space. Moreover, DFC can be adaptively decomposed into band-wise spectral tokens, allowing local degradation responses to be represented as reusable restoration priors. Based on this representation, we develop the DFC-guided Image Restorer (DFC-IR), a token-conditioned multi-scale framework that progressively estimates DFCs from intermediate restorations and uses the resulting spectral tokens to guide degradation-aware restoration in a coarse-to-fine manner. Extensive experiments on standard, composite, unseen, and real-world degradation benchmarks show that DFC provides an effective representation basis for all-in-one restoration, leading to state-of-the-art performance and improved generalization under complex degradation profiles.

2605.17504 2026-05-19 cs.CV cs.AI

A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle

从分布视角看视觉机制可解释性:KL最小软约束原理

Guancheng Zhou, Yisi Luo, Zhengfu He, Zhenyu Jin, Xuyang Ge, Wentao Shu, Deyu Meng, Xipeng Qiu

发表机构 * School of Mathematics and Statistics(数学与统计学学院) Ministry of Education Key Lab of Intelligent Networks and Network Security(教育部智能网络与网络安全重点实验室) Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学)

AI总结 本文提出了一种基于分布的视觉机制可解释性方法,通过KL最小化优化问题来平衡可解释性和模型忠实性,利用能量引导的扩散后验采样实现,并在DINOv3模型上验证了其有效性。

详情
AI中文摘要

当前视觉机制可解释性(MI)的主要范式仍局限于通过启发式方法(如Top-K激活检索或正则化优化)解释视觉模型的内部单元。在本文中,我们建立了视觉MI的理论分布视角,该视角模型了特征激活对自然图像分布的影响,从而构建了一个KL最小化优化问题来建模MI任务。在此框架下,识别了先前MI范式中的统计偏差,揭示这些范式可能在人类感知上不可解释(即偏离自然图像分布)或在机械上不忠实于视觉模型(即无法激活模型特征)。为了解决这些偏差,我们提出了一种基于KL最小化软约束原理的视觉MI模型,该模型在理论上平衡了可解释性和忠实性。我们通过能量引导的扩散后验采样实现了这一原理。广泛的实验验证了所提出分布视角的理论正确性,并展示了我们的范式在DINOv3视觉模型上的实际有效性。

英文摘要

Most current paradigms in visual mechanistic interpretability (MI) remain confined to interpreting internal units of the vision model via heuristic methods (e.g., top-$K$ activation retrieval or optimization with regularization). In this work, we establish a theoretical distributional view for visual MI, which models the influence of a feature activation on the natural image distribution, thereby formulating a Kullback-Leibler (KL)-minimal optimization problem to model the MI task. Under this framework, statistical biases are identified within previous MI paradigms, which reveal that they may either be perceptually uninterpretable to humans (i.e., deviate from the natural image distribution), or mechanistically unfaithful to the vision models (i.e., unable to activate model features). To resolve the biases under the distributional view, we propose a model with a KL-minimal soft-constraint principle for visual MI that theoretically balances interpretability and faithfulness. We realize this principle via energy-guided diffusion posterior sampling. Extensive experiments validate the theoretical soundness of the proposed distributional view and demonstrate the practical effectiveness of our paradigm on the DINOv3 vision model.

2605.17503 2026-05-19 cs.AI cs.CL cs.HC

RAG-based EEG-to-Text Translation Using Deep Learning and LLMs

基于深度学习和大语言模型的RAG EEG到文本翻译

Enrico Collautti, Xiaopeng Mao, Luca Tonin, Stefano Tortora, Sadasivan Puthusserypady

发表机构 * IAS-LAB, Department of Information Engineering, University of Padova(帕多瓦大学信息工程系IAS实验室) Padova Neuroscience Center(帕多瓦神经科学中心) Department of Health Technology, Technical University of Denmark(丹麦技术大学健康技术系)

AI总结 本文提出了一种基于检索增强生成(RAG)的EEG到文本解码方法,结合EEG编码器、向量检索阶段和大语言模型,以提高句子级解码的准确性,并在ZuCo数据集上验证了其有效性。

Comments 6 pages, 2 figures. Submitted to the 2026 IEEE International Conference on Systems, Man, and Cybernetics

详情
AI中文摘要

从电生理图(EEG)信号解码语言信息仍然是脑机接口(BCI)研究中极具挑战性的问题。特别是,由于EEG记录的信噪比较低,从EEG进行句子级解码尤为困难。以往研究通常在推理阶段未使用教师强制时难以超越随机基线性能。在本文中,我们提出了一种基于检索增强生成(RAG)的句子级EEG到文本解码流程,结合与语义句子嵌入对齐的EEG编码器、向量检索阶段以及大语言模型(LLM)以将检索到的句子细化为连贯的输出。实验在Zurich认知语言处理语料库(ZuCo)数据集上进行,该数据集包含在静默阅读期间收集的单次试验EEG记录。为了评估系统是否从这些EEG信号中提取了有意义的信息,结果与随机基线进行比较。在九名受试者中,所提出的流程优于随机基线,平均余弦相似度为0.181±0.022,与基线0.139±0.029相比,相对改进为30.45%。统计分析进一步确认了这种改进的显著性,遵循严格评估流程,其中推理阶段不接触地面真实标签。

英文摘要

The decoding of linguistic information from electroencephalography (EEG) signals remains an extremely challenging problem in brain-computer interface (BCI) research. In particular, sentence-level decoding from EEG is difficult due to the low signal-to-noise ratio of these recordings. Previous studies tackling this problem have typically failed to surpass random baseline performance unless teacher forcing is used during the inference phase. In this work, we propose a retrieval-augmented generation (RAG)-based sentence-level EEG-to-text decoding pipeline that combines an EEG encoder aligned with semantic sentence embeddings, a vector retrieval stage, and a large language model (LLM) to refine retrieved sentences into coherent output. Experiments are conducted on the Zurich Cognitive Language Processing Corpus (ZuCo) dataset, which contains single-trial EEG recordings collected during silent reading. To evaluate whether the system extracts meaningful information from these EEG signals, the results are compared with a random baseline. In nine subjects, the proposed pipeline outperforms the random baseline, achieving a mean cosine similarity of 0.181 +- 0.022 compared to 0.139 +- 0.029 for the baseline, corresponding to a relative improvement of 30.45%. Statistical analysis further confirms that this improvement is significant, following a strict evaluation workflow where inference is performed without access to ground-truth labels.

2605.17500 2026-05-19 cs.LG cs.CV

The Silent Brush: Evaluating Artistic Style Leakage in AI Art Generation

沉默的画笔:评估AI艺术生成中的艺术风格泄露

Ninad Joshi, Ashutosh Ranjan, Vivek Srivastava, Shirish Karande

发表机构 * TCS Research(TCS研究)

AI总结 本文研究了AI艺术生成中由于模型学习并复现艺术风格而产生的无意风格复现问题,提出了一种评估方法Art Arena,用于衡量艺术作品的编码强度、交互情况以及在无明确提示的情况下风格特征的重现频率。

详情
AI中文摘要

生成式文本到图像模型通常是在大规模网络爬取数据集上训练的,这些数据集包含多样化的视觉内容,如受版权保护和风格独特的艺术品,引发了关于所有权、归属和受保护视觉表达的无意重用的担忧。一个关键问题是,模型可以从这些数据中学习风格模式,并在生成输出中复现这些模式,而无需在提示中显式引用。我们称这种现象为The Silent Brush,即使在未被请求的情况下,所学的风格也会再次出现。现有的评估方法主要集中在近似重复检索或成员推断,而没有考虑到这种跨提示的无意风格复现形式。为了解决这些差距,我们首先制定了评估The Silent Brush的指导原则。然后引入Art Arena评估协议,用于衡量艺术作品的编码强度、交互情况以及在无明确提示的情况下其风格特征在生成输出中重现的频率。我们对广泛使用的文本到图像扩散模型,包括Stable Diffusion v1.5、Stable Diffusion XL (SDXL)和SANA-1.5进行了评估,并设计使其能够跨文本到图像生成系统通用。我们的结果表明,The Silent Brush源于艺术作品之间表示强度和交互动态的差异,导致模型生成中的不对称混合。代码和评估资源可在:https://anonymous.4open.science/r/ArtArena-EBE4获取。

英文摘要

Generative text-to-image models are typically trained on large-scale web-scraped datasets that include diverse visual content such as copyrighted and stylistically distinctive artworks, raising concerns about ownership, attribution, and the unintended reuse of protected visual expressions. A key issue is that models can learn stylistic patterns from this data and reproduce them in generated outputs without any explicit reference in the prompt. We refer to this phenomenon as The Silent Brush, where such learned styles reappear even when they are not requested. Existing evaluation methods mainly focus on near-duplicate retrieval or membership inference and do not account for this form of unintended stylistic resurfacing across prompts. To address these gaps, we first formulate guiding principles for evaluation of The Silent Brush. We then introduce Art Arena, an evaluation protocol that measures how strongly artworks are encoded, how they interact, and how frequently their stylistic traits reappear in generated outputs without explicit mention in prompts. We evaluate Art Arena on widely used text-to-image diffusion models, including Stable Diffusion v1.5, Stable Diffusion XL (SDXL), and SANA-1.5, and design it to generalize across text-to-image generative systems. Our results show that The Silent Brush arises from differences in representational strength and interaction dynamics between artworks, leading to asymmetric blending in model generations. Code and evaluation resources are available at: https://anonymous.4open.science/r/ArtArena-EBE4.

2605.17499 2026-05-19 cs.LG

t-gems: text-guided exit modules for decreasing clip image encoder

t-gems: 基于文本引导的退出模块用于减少clip图像编码器

Alberto Presta, Grzegorz Stefanski, Michal Byra, Krzysztof Arendt

发表机构 * Samsung AI Center Warsaw(三星AI中心华沙) Institute of Fundamental Technological Research, PAS, Warsaw(基础技术研究所,波兰科学院,华沙)

AI总结 本文提出t-gems文本引导退出模块,通过利用编码器中间层的语义内容分布,减少clip图像编码器的计算成本,同时保持跨模态理解性能。

Comments Accepted at ICASSP 2026

详情
AI中文摘要

多模态深度神经网络通过整合多种数据模态来增强深度理解。不同模态的数据通常被投影到共享的潜在空间中进行相似性计算,但这一过程由于大型图像编码器和预测期间对测试数据的等量处理而变得资源密集。早期退出方法通过利用中间层来减少计算负载,节省时间和内存。然而,对于像图像-文本对这样的多模态数据开发此类方法具有挑战性。本研究探讨了编码器如clip中中间层中存在的语义内容分布,这些分布可以从文本描述中推导出来。我们引入了文本引导退出模块(t-gems)和基于速率的正则化器,以控制编码器的使用成本,同时保持跨模态理解性能。

英文摘要

Multimodal deep neural networks enhance deep comprehension by integrating diverse data modalities. Data from different modalities are typically projected into a shared latent space for similarity computation, but this process is resource intensive due to large image encoders and equal processing of test data during prediction. Early exit methods reduce computational load by utilizing intermediate layers, saving time and memory. However, developing such methods is challenging for multimodal data like image-text pairs. This study investigates the semantic content distributions present in intermediate layers of encoders such as CLIP, which can be derived from textual descriptions. We introduce Text-Guided Exit Modules (T-GEMs) and a rate-based regularizer to control encoder usage costs while maintaining cross-modal understanding performance.

2605.17497 2026-05-19 cs.LG

Self-Supervised On-Policy Distillation for Reasoning Language Models

自监督在线策略蒸馏用于推理语言模型

Zhiquan Tan, Yinrong Hong

发表机构 * Tsinghua University(清华大学) Beihang University(北航大学)

AI总结 本文提出自监督在线策略蒸馏(SSOPD)方法,通过对比正确与错误的完成过程信号,提升推理语言模型的表现,实验表明在多个基准测试中优于GRPO和OPSD基线。

详情
AI中文摘要

GRPO-style RLVR通过多个在线策略尝试来训练推理模型,但通常仅利用终端奖励。我们展示混合组包含更丰富的过程信号:正确完成是当前策略解决问题的自生成证据,而错误完成提供需要纠正的在线策略前缀。我们引入自监督在线策略蒸馏(SSOPD),将教师分布条件在最短正确完成上,蒸馏到最长错误完成的前缀中。这将组内正确-错误对比转化为密集的过程监督,而无需外部解决方案轨迹。一个停止时间观点激励最短正确/最长错误规则作为有限组对编辑持久失败以实现快速成功动作的近似。一个提示级前沿权重集中辅助损失,其中正确和错误分支共存。在AIME 2024、AIME 2025和HMMT 2025中,SSOPD在所有九个模型基准设置中优于GRPO。在Qwen3-8B上,它达到宏Avg@12为65.6,优于GRPO 1.6个百分点,优于解决方案条件的OPSD基线0.8个百分点。代码将在https://github.com/tzq1999/SSOPD上发布。

英文摘要

GRPO-style RLVR trains reasoning models from multiple on-policy attempts per prompt, but typically uses these attempts only through terminal rewards. We show that a mixed group contains a richer process signal: a correct completion is a self-generated witness of how the current policy can solve the problem, while a wrong completion provides on-policy prefixes where the policy needs correction. We introduce \emph{Self-Supervised On-Policy Distillation} (SSOPD), which distills a teacher distribution conditioned on the shortest correct completion into prefixes of the longest wrong completion. This converts intra-group correct--wrong contrast into dense process supervision without external solution traces. A stopping-time view motivates the shortest-correct / longest-wrong rule as a finite-group approximation to editing persistent failures toward fast-success actions, and a prompt-level frontier weight concentrates the auxiliary loss where correct and wrong branches coexist. Across AIME 2024, AIME 2025, and HMMT 2025, SSOPD improves over GRPO in all nine model-benchmark settings. On Qwen3-8B, it reaches a macro Avg@12 of 65.6, outperforming GRPO by 1.6 points and the solution-conditioned OPSD baseline by 0.8 points. Code will be released at https://github.com/tzq1999/SSOPD.

2605.17493 2026-05-19 cs.LG cs.AI cs.CV physics.ao-ph

Beyond Linear Superposition: Discovering Climate Features in AI Weather Models with KAN-SAE

超越线性叠加:利用KAN-SAE在AI天气模型中发现气候特征

Minjong Cheon

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 本文提出KAN-SAE,一种基于Kolmogorov-Arnold网络的稀疏自编码器,通过非线性激活函数揭示天气预测模型中的气候特征,相比线性基线提升了72%的活跃特征数量和降低了20%的特征冗余。

详情
AI中文摘要

深度学习天气预测模型在预测能力上表现出色,但其内部如何表示物理气候现象仍不明确。通过稀疏自编码器(SAEs)实现的机理可解释性提供了一种分解这些表示的有原则方法,但现有SAEs假设严格线性特征叠加,这与现代变压器中编码的高度非线性大气动力学不匹配。我们引入KAN-SAE,一种稀疏自编码器,其编码器将标准ReLU替换为可学习的每特征B-样条激活函数,这些激活函数来自Kolmogorov-Arnold网络(KANs),使每个潜在维度能够发展出自己的非线性门控配置。应用于Sonny时,KAN-SAE发现了975个活跃特征(相比线性基线的566个,提升了72%),并具有20%更低的特征冗余和可比的重建保真度。在无任何气候监督的情况下,KAN-SAE识别出一个在西欧空间集中的可解释热浪特征,并通过因果操控实验验证了西太平洋台风追踪器。我们的结果表明,非线性激活对于深度学习天气预测模型的机理可解释性至关重要,恢复了对线性基线不可见的气候特征。

英文摘要

Deep learning weather prediction models achieve remarkable predictive skill yet remain largely opaque: we know little about how they represent physical climate phenomena internally. Mechanistic interpretability through Sparse Autoencoders (SAEs) offers a principled route to decomposing these representations, but existing SAEs assume strictly linear feature superposition - a constraint ill-suited for the highly nonlinear atmospheric dynamics encoded in modern transformers. We introduce KAN-SAE, a sparse autoencoder whose encoder replaces the standard ReLU with learnable per-feature B-spline activations drawn from Kolmogorov-Arnold Networks (KANs), allowing each latent dimension to develop its own nonlinear gating profile. Applied to Sonny, KAN-SAE discovers 975 alive features (vs. 566 for a linear baseline, a 72% improvement) with 20% lower inter-feature redundancy and comparable reconstruction fidelity. Without any climate supervision, KAN-SAE identifies an interpretable European heatwave feature spatially concentrated over western Europe, and a western Pacific typhoon tracker confirmed by causal steering experiments. Our results demonstrate that nonlinear activations are essential for mechanistic interpretability of deep learning weather prediction models, recovering climate features that remain invisible to linear baselines.

2605.17489 2026-05-19 cs.CV

Employing Vision-Language Models for Face Image Quality Assessment

利用视觉-语言模型进行人脸图像质量评估

Erdi Sarıtaş, Eren Onaran, Vitomir Štruc, Hazım Kemal Ekenel

发表机构 * Department of Computer Engineering, Istanbul Technical University(伊斯坦布尔技术大学计算机工程系) Faculty of Electrical Engineering, University of Ljubljana(卢布尔雅那大学电气工程系) Division of Engineering, NYU Abu Dhabi(纽约大学阿布扎克分校工程系)

AI总结 本文研究了利用现成的视觉-语言模型(VLMs)在零样本设置下进行人脸图像质量评估(FIQA)的潜力,通过综合评估框架评估VLM性能,并发现模型架构对生物识别效用性能有显著影响,VLMs的输出与传统方法一致,但生成的分数可能因提示而异。

详情
AI中文摘要

人脸图像质量评估(FIQA)是生物识别流水线中的关键控制步骤。它确保只有可靠的样本被处理,以保持系统精度。最先进的FIQA方法具有高实用性,但通常以“黑箱”方式操作。它们产生标量分数,但没有可解释的人类反馈。这种缺乏透明性限制了它们在人类在回路场景中的有效性,例如自动边境控制,其中需要可操作的反馈。在本文中,我们研究了现成的视觉-语言模型(VLMs)在零样本设置下进行FIQA的潜力,以弥合这一差距。我们提出了一个全面的评估框架来评估VLM性能。这包括通过误差-拒绝曲线基准传统FIQA方法。此外,使用多样化的数据集,从监控导向到合成生成,我们分析了它们的可解释性、一致性和对提示变化的鲁棒性。我们的结果表明,生物识别效用性能在很大程度上取决于架构,而不是仅仅参数数量。大多数VLMs的输出与传统方法一致。我们还发现,VLMs的排名性能和生成的分数可能因提示而异。我们的合成消融研究显示,尽管增加参数数量可以提高内部一致性,但比较小模型的退化检测性能更差。这些发现表明,使用VLMs进行零样本FIQA分数估计是很有前景的,可以作为传统FIQA流水线的可解释性模块进行补充。代码可在https://github.com/ThEnded32/VLM4FIQA.git获得。

英文摘要

Face Image Quality Assessment (FIQA) is a crucial control step in biometric pipelines. It ensures only reliable samples are processed to maintain system accuracy. State-of-the-art FIQA methods achieve high utility but typically operate as "black boxes." They produce scalar scores without human-interpretable justifications. This lack of transparency limits their effectiveness in human-in-the-loop scenarios, such as automated border control, where actionable feedback is essential. In this paper, we investigate the potential of off-the-shelf Vision-Language Models (VLMs) to bridge this gap by performing FIQA in a zero-shot setting. We present a comprehensive evaluation framework for assessing VLM performance. This involves benchmarking traditional FIQA methods through error-versus-reject curves. Additionally, using a diverse set of datasets, ranging from surveillance-oriented to synthetically generated, we analyzed their interpretability, consistency, and robustness to prompt changes. Our results show biometric utility performance depends significantly on architecture, not merely on parameter count. Most VLMs' outputs align with those of traditional methods. We also find that VLM ranking performance and the generated scores may vary across prompts. Our synthetic ablation study shows that while increasing the parameter count can improve internal consistency, it yields worse degradation-detection performance than smaller models. These findings suggest that zero-shot FIQA score estimation using VLMs is promising and could effectively complement conventional FIQA pipelines as an interpretability module. The codes are available at https://github.com/ThEnded32/VLM4FIQA.git.

2605.17488 2026-05-19 cs.CV cs.MM cs.SD

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

Omni-Customizer: 用于联合音频-视频生成的端到端多模态定制

Yuheng Chen, Qingdong He, Teng Hu, Yuji Wang, Yabiao Wang, Lizhuang Ma, Jiangning Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文提出Omni-Customizer,一种端到端多模态定制框架,旨在实现精确的多模态身份信息绑定和无缝融合,通过引入Omni-Context Fusion模块和Masked TTS Cross-Attention机制,提升多模态定制生成的性能。

详情
AI中文摘要

联合音频和视频生成的领域已因强大基础模型的出现而发生根本性变革。尽管取得了进展,但实现多模态定制,以在多个相互作用的主体中同时保持视觉身份和语音音色的一致性,仍然鲜有研究。为弥合这一差距,我们提出了Omni-Customizer,一种端到端框架,专门针对多模态身份信息的精确绑定和无缝融合。具体而言,我们引入了Omni-Context Fusion(OCF)模块,该模块能有效丰富基础文本提示,加入密集的多模态身份提示,同时引入Masked TTS Cross-Attention(MTP-CA)机制,专门设计以防止严重的"语音泄漏"问题。在该架构中,我们提出了语义锚定的多模态RoPE(SA-MRoPE),用于将视觉和音频参考标记以及TTS嵌入锚定到其对应的语义描述,从而实现结构化的多模态融合和稳健的身份绑定。此外,我们设计了一种全面的训练策略,结合交错音频-视频调度以快速适应多语言场景而不影响基础先验,以及渐进式内对到跨对课程学习以促进高阶和稳健的身份特征学习。大量实验表明,Omni-Customizer在双模态定制生成中实现了最先进的性能,其在视觉身份相似性、音色一致性、精确音频-视频同步以及整体视频-音频保真度方面均表现出色。

英文摘要

The landscape of joint audio and video generation has been fundamentally transformed by the advent of powerful foundation models. Despite these strides, achieving cohesive multimodal customization for the simultaneous preservation of visual identities and vocal timbres across multiple interacting subjects remains largely underexplored. To bridge this gap, we present Omni-Customizer, an end-to-end framework targeted at the precise binding and seamless fusion of multimodal identity information. Specifically, we introduce an Omni-Context Fusion (OCF) module that effectively enriches the base textual prompt with dense, multimodal identity cues, along with a Masked TTS Cross-Attention (MTP-CA) mechanism explicitly designed to prevent the severe "speech leakage" problem. Within this architecture, we propose Semantic-Anchored Multimodal RoPE (SA-MRoPE) to anchor visual and audio reference tokens, along with TTS embeddings, to their corresponding semantic descriptions, enabling structured multimodal fusion and robust identity binding. Furthermore, we devise a comprehensive training strategy that incorporates interleaved audio-video scheduling to rapidly adapt the audio branch to multilingual scenarios without degrading foundational priors, and a progressive in-pair to cross-pair curriculum to facilitate the learning of high-level and robust identity features. Extensive experiments demonstrate that Omni-Customizer achieves state-of-the-art performance in dual-modal customized generation, excelling across visual identity similarity, timbre consistency, precise audio-video synchronization, and overall video-audio fidelity.

2605.17486 2026-05-19 cs.RO cs.LG

DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

DyGRO-VLA: 通过动态分组残差优化实现跨任务的视觉-语言-动作模型扩展

Sixu Lin, Yunpeng Qing, Litao Liu, Ming Zhou, Ruixing Jin, Xiaoyi Fan, Guiliang Liu

发表机构 * School of Data Science, The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳)数据科学学院) Shenzhen Loop Area Institute(深圳环城研究院) Zhejiang University(浙江大学) Rutgers University-New Brunswick(罗格斯大学新布朗斯维尔回声分校) Shanghai AI Laboratory(上海人工智能实验室) Jiangxing Intelligence Technology Inc.(江行智能科技有限公司)

AI总结 本文提出DyGRO-VLA,一种通过动态分组残差优化实现跨任务视觉-语言-动作模型扩展的两阶段优化框架,旨在提升模型的泛化能力。

详情
AI中文摘要

最近在强化学习(RL)方面的进展提供了一种系统的方法来优化视觉-语言-动作(VLA)模型,推动了从轨迹模仿到任务环境中的主动学习的转变。尽管在控制精度上有所改进,大多数RL优化器仍然任务特定,这使VLA模型从通用控制器退化为过度拟合狭窄任务集的策略。在本研究中,我们深入分析了这一现象,并强调了跨任务特征表示对提高VLA模型泛化能力的重要性。受这一发现的启发,我们引入了DyGRO-VLA,一种两阶段优化框架,1)基于信息论原理有效地捕捉跨任务潜在表示,2)通过混合的RL残差动态优化策略。DyGRO-VLA使RL优化器能够在优化过程中利用任务相关的潜在信息,同时战略性地减轻对学习表示的不利干扰。我们在LIBERO、RoboTwin2基准以及现实世界中评估了我们的方法,证明了在多任务训练和分布偏移下,与强基线相比,我们的方法具有持续的改进。

英文摘要

Recent progress in Reinforcement Learning (RL) provides a principled approach to optimizing Vision-Language-Action (VLA) models, facilitating a shift from trajectory imitation to active learning in the task environment. Despite improvements in control precision, most RL optimizers remain task-specific, which reduces VLA models from generalist controllers to policies that overfit to a narrow set of tasks. In this study, we conduct an in-depth analysis of this phenomenon and highlight the importance of cross-task feature representations for improving the generalizability of VLA models. Motivated by this finding, we introduce DyGRO-VLA, a two-stage optimization framework that 1) effectively captures cross-task latent representations based on information-theoretic principles, and 2) dynamically refines policy optimization via a mixture-of-RL-residuals. DyGRO-VLA enables the RL optimizer to exploit task-relevant latent information while strategically mitigating adverse interference on the learned representations throughout the optimization process. We evaluate our approach on LIBERO, RoboTwin2 benchmarks, and further validate it on real world, demonstrating consistent improvements over strong baselines under multi-task training and distribution shift.

2605.17483 2026-05-19 cs.CV

On Applicability of Synthetic Datasets for Facial Expression Recognition

关于合成数据集在面部表情识别中的适用性

Ali Azmoudeh, Erdi Sarıtaş, Ömer Yıldırım, Hazım Kemal Ekenel

发表机构 * Department of Computer Engineering, Istanbul Technical University(伊斯坦布尔技术大学计算机工程系) Department of Informatics, University of Zurich(苏黎世大学信息学系) Division of Engineering, NYU Abu Dhabi(纽约大学阿布扎克分校工程系)

AI总结 本文研究了合成数据集在面部表情识别中的应用,提出三种隐私保护策略来构建平衡的数据集,并通过实验验证了合成数据在缓解类别不平衡和隐私限制方面的有效性。

详情
AI中文摘要

面部表情识别面临两个核心挑战。第一个是公共数据集中类别不平衡的问题,这会扭曲学习过程并削弱泛化能力。第二个是隐私和数据收集限制的问题,这限制了面部图像的共享并阻碍了大而平衡数据集的创建。为了解决这些问题,我们考察了三种互补的策略,用于在标准七种离散面部表情类别设置下构建隐私保护的面部表情识别(FER)数据集。我们的策略是:(i)在置信度阈值方案下使用教师模型对大规模未标记面部集合进行伪标签;(ii)使用扩散模型进行提示驱动的合成,条件于人口统计学属性;(iii)任务感知的基于GAN的表情编辑,该方法在保持身份和真实感的同时修改面部表情。在训练和评估中,我们采用了广泛使用的数据集,包括AffectNet、RAF-DB和FER2013。我们利用合成数据集DigiFace、DCFace和EmoNet-Face BIG作为伪标签的未标记源。此外,我们利用FFHQ数据集作为生成合成的来源。主要实验使用经典CNN主干网络IR50进行,我们还探索了更复杂的架构POSTERv1,以评估其可行性和鲁棒性。通过跨数据集评估,我们分析了每种策略在整理数据集中的权衡。研究结果展示了合成数据如何有效替代或与真实数据集结合,以缓解不平衡和隐私限制的问题。代码和生成数据集:https://www.github.com/AliAZ98/SyntFER

英文摘要

Facial Expression Recognition faces two core challenges. The first is class imbalance in public datasets, which skews the learning process and weakens generalization. The second is related to privacy and data collection constraints, which limit the sharing of facial images and restrict the creation of large, balanced datasets. To address these issues, we examine three complementary strategies for constructing privacy-preserving FER datasets in the standard seven discrete facial expression classes setting. Our strategies are: (i) pseudo-labeling large unlabeled face collections with a teacher model under a confidence-thresholding scheme, (ii) prompt-driven synthesis using diffusion models conditioned on demographic attributes, and (iii) task-aware GAN-based expression editing that modifies facial expression while preserving identity and realism. For training and evaluation, we employed widely adopted datasets, including AffectNet, RAF-DB, and FER2013. We utilized the synthetic datasets DigiFace, DCFace, and EmoNet-Face BIG as unlabeled sources for pseudo-labeling. Additionally, we utilized the FFHQ dataset as the source for generative synthesis. The main experiments are conducted using a classic CNN backbone, IR50, and we also explore a more complex architecture, POSTERv1, to assess its feasibility and robustness. Using cross-dataset evaluations, we analyze the trade-offs each strategy presents in curated datasets. The findings demonstrate how synthetic data can effectively substitute or be combined with real datasets to mitigate imbalance and privacy limitations. Code and generated datasets:https://www.github.com/AliAZ98/SyntFER

2605.17481 2026-05-19 cs.CL

Hybrid Feature Combinations with CNN for Bangla Fake News Classification

结合CNN的混合特征组合用于孟加拉语虚假新闻分类

Md Gulzar Hussain, Babe Sultana, Md Rinku Ali

发表机构 * School of Software, Nanjing University of Information Science and Technology(信息科学与技术学院) Department of Computer Science and Engineering, Green University of Bangladesh(计算机科学与工程系) School of Computer Science and Artificial Intelligence, Changzhou University(计算机科学与人工智能学院)

AI总结 本文研究了在BanFakeNews-2.0数据集上使用CNN模型进行孟加拉语虚假新闻分类时,不同特征组合(语义、统计和字符级特征)对识别效果的影响,发现多特征组合能显著提升召回率和F1分数。

Comments Already accepted and presented in the 3rd International Conference on Big Data, IoT and Machine Learning (BIM 2025)

详情
AI中文摘要

如今,孟加拉国的人们越来越多地通过互联网和社交媒体获取日常新闻,而不是传统报纸。然而,这些平台上的虚假新闻传播对真实媒体的可信度构成了风险和挑战。尽管已有研究致力于检测孟加拉语虚假新闻,但该领域仍有改进空间。本研究探讨了在BanFakeNews-2.0数据集上使用CNN模型时,特征选择方法(如语义、统计和字符级特征及其组合)在识别虚假新闻中的有效性。本文的关键发现表明,与单独使用特征相比,结合多种特征显著提高了召回率和F1分数。本研究的代码可在此获取:https://github.com/gulzar09/Bn_FNews_H.Feature.

英文摘要

Nowadays, people in Bangladesh frequently rely on the internet and social media for daily news instead of traditional newspapers. However, the spread of false Bangla news through these platforms poses risks and challenges to the credibility of authentic media. Although several studies have been conducted on detecting Bangla fake news, there is still significant room for improvement in this area. To assist people, this research explores the effectiveness of feature selection approaches in identifying appropriate features, such as semantic, statistical, and character-level features, or their combinations, on the BanFakeNews-2.0 dataset for detecting Bangla fake news using a CNN model. In this paper, key findings reveal that combining multiple features significantly improves recall and F1-scores compared to using individual features alone. The code for this research can be availed here, https://github.com/gulzar09/Bn\_FNews\_H.Feature.

2605.17478 2026-05-19 cs.CV

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

Mamba-VGGT: 通过外部滑动窗口Mamba内存实现持久长序列视频几何 grounded 变换器

Tianchen Deng, Zhenxiang Xiong, Nailin Wang, Fangjinhua Wang, Jiuming Liu, Jianfei Yang, Hesheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) ETH Zurich(苏黎世联邦理工学院) Cambridge University(剑桥大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出Mamba-VGGT框架,通过引入滑动窗口Mamba内存模块,解决传统VGGT在长序列视频中几何遗忘和累积漂移问题,提升3D场景重建的精度与稳定性。

详情
AI中文摘要

视觉几何 grounded 变换器(VGGT)在高保真3D场景重建中设立了新基准。然而,随着序列长度增加,这些模型因全局注意力的二次复杂度而出现灾难性几何遗忘和累积漂移,主要由于需要截断的时间窗口。为克服由此产生的几何漂移,我们提出了Mamba-VGGT,一种增强的VGGT框架,能够实现持久的长距离推理。我们的关键贡献是滑动窗口Mamba(SWM)内存模块,该模块在时间窗口间维护显式的外部记忆标记。该模块利用选择性状态空间建模来提炼和传播全局几何先验,有效绕过了传统变换器的记忆限制。为了在不破坏预训练VGGT高度优化的空间特征的情况下整合这些长期时间线索,我们提出了一种零初始化空间内存注入器。利用零卷积层,该注入器适应性地将持久记忆融合到patch token流中,确保结构稳定性和无缝特征对齐。广泛实验表明,我们的方法在维持空间一致性和减少轨迹累积误差方面显著优于现有VGGT方法。我们的工作为大规模3D环境中基于几何的世界建模提供了可扩展、线性复杂度的解决方案。

英文摘要

Visual Geometry Grounded Transformers (VGGT) have set new benchmarks in high-fidelity 3D scene reconstruction. However, as the sequence length increases, these models suffer from catastrophic geometric forgetting and accumulation drift, primarily due to the quadratic complexity of global attention which necessitates truncated temporal windows. To overcome the resulting geometric drift, we present Mamba-VGGT, an enhanced VGGT framework capable of persistent long-range reasoning. Our key contribution is a Sliding Window Mamba (SWM) memory module that maintains an explicit external memory token across temporal windows. This module leverages selective state-space modeling to distill and propagate global geometric priors, effectively bypassing the memory constraints of traditional transformers. To integrate these long-term temporal cues without disrupting the highly optimized spatial features of the pre-trained VGGT, we propose a Zero-Init Spatial Memory Injector. Utilizing zero-convolutional layers, this injector adaptively fuses persistent memory into the patch token stream, ensuring structural stability and seamless feature alignment. Extensive experiments demonstrate that our approach significantly outperforms existing VGGT-based methods in maintaining spatial consistency and reducing trajectory accumulation errors. Our work provides a scalable, linear-complexity solution for geometry-grounded world modeling in extensive 3D environments.

2605.17477 2026-05-19 cs.RO

Rapid Vibration Suppression and Trajectory Tracking of a Serial Manipulator with Multi-Flexible Links

多柔性连杆串联 manipulator 的快速振动抑制与轨迹跟踪

Chengyi Wang, Yilong Huang, Ji Wang

发表机构 * School of Aerospace Engineering, Xiamen University(厦门大学航空航天工程学院)

AI总结 本文提出了一种基于 backstepping 的输出反馈框架,用于快速抑制多连杆串联柔性 manipulator 的振动并实现末端跟踪,通过 DeepONet 近似实现实时部署和可扩展性。

详情
AI中文摘要

柔性机器人 manipulator(FRMs)在轻量化设计和大工作空间方面具有优势,但其结构灵活性会引发振动、加速疲劳、降低跟踪性能并限制操作速度。这些挑战在多连杆串联 manipulator 中进一步加剧,因为整体长度的增加导致结构灵活性更大。本文提出了一种 backstepping 输出反馈框架,用于快速抑制 n 自由度串联柔性 manipulator(nDSFMR)的振动和末端跟踪,使用基于 DeepONet 的近似方法进行实际部署。每个连杆关节被建模为 Timoshenko 梁,结合 ODE 并转换为具有边界动态的 canonical 超几何 PDE。在关节处开发了基于 backstepping 的边界控制器,以等效地在梁上注入分布式阻尼,从而实现快速振动抑制和轨迹跟踪,仅使用可用的边界测量。为了实现实时实施和可扩展性,引入了 DeepONet 神经操作符来近似 backstepping 核,显著降低了计算成本,并在变化的操作条件下促进了快速控制器更新。在双连杆柔性 manipulator 上的实验表明,与具有前馈控制的线性二次调节器(LQR)相比,振动抑制更快,末端执行器收敛到期望轨迹的速度更快。

英文摘要

Flexible robotic manipulators (FRMs) offer advantages in lightweight design and large workspace, but their structural flexibility induces vibrations, accelerates fatigue, degrades tracking performance, and limits operational speed. These challenges are further amplified in multi-link serial manipulators, where increased overall length leads to greater structural flexibility. This article presents a backstepping output-feedback framework for fast vibration suppression and tip tracking of an n-degree-of-freedom serial flexible manipulator robot (nDSFMR), with a DeepONet-based approximation for practical deployment. Each link-joint is modeled as a Timoshenko beam coupled with an ODE and transformed into a canonical hyperbolic PDE with boundary dynamics. A backstepping-based boundary controller at the joint is developed to equivalently inject distributed damping along the beam, enabling rapid vibration suppression and trajectory tracking, only using available boundary measurements. To enable real-time implementation and scalability, a DeepONet neural operator is introduced to approximate the backstepping kernels, significantly reducing computational cost and facilitating fast controller updates under varying operating conditions. Experiments on a two-link flexible manipulator demonstrate faster vibration suppression and convergence of the end-effector to the desired trajectory, compared with a linear quadratic regulator (LQR) with feedforward control.

2605.17467 2026-05-19 cs.CL

VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems

VerifyMAS: LLM多智能体系统中故障归因的假设验证

Hezhe Qiao, Hanghang Tong, Ee-Peng Lim, Bing Liu, Guansong Pang

发表机构 * Singapore Management University(新加坡国立管理学院) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Illinois at Chicago(伊利诺伊大学香槟分校)

AI总结 本文提出VerifyMAS框架,通过验证假设的方法对LLM多智能体系统中的故障进行归因,解决了现有方法在全局故障识别和细粒度归因方面的不足,实验表明其在多种模型上均优于现有方法。

Comments 22 pages

详情
AI中文摘要

大型语言模型驱动的多智能体系统(LLM-MAS)在复杂任务中表现出色,但不可靠的智能体仍是系统可靠性的重要瓶颈。自动故障归因因此至关重要,但现有方法如直接预测智能体错误对和智能体优先故障归因依赖于本地日志,无法识别仅在完整交互轨迹中显现的全局故障,如跨步不一致和智能体间协调错误。此外,直接预测故障会引入大规模组合搜索空间,阻碍细粒度归因。为了解决这些挑战,我们提出了VerifyMAS,一种用于智能体故障归因的假设验证框架。不同于直接预测故障智能体和错误类型,VerifyMAS针对完整轨迹验证故障假设。这种基于验证的方法将归因分解为轨迹级错误验证和细粒度智能体定位,提供了一种以错误优先的归因方法,能够捕捉全局故障模式,同时显著减少搜索空间。我们进一步引入基于结构化错误分类学的假设数据构建策略,并对专用LLM验证器模型进行微调,用于轨迹级故障验证和智能体归因。在Aegis-Bench和Who&When上的实验表明,VerifyMAS在多种基础模型上均表现优异,包括开源Qwen和基于API的GPT模型,在不牺牲长多智能体轨迹推理效率的情况下,优于现有方法。

英文摘要

Large language model-driven multi-agent systems (LLM-MAS) excel at complex tasks, yet unreliable agents remain a key bottleneck to system-level reliability. Automatic failure attribution is therefore critical, but existing approaches, such as direct prediction of agent-error pairs and agent-first failure attribution, rely on local logs of agents and miss global failures that only manifest over full interaction trajectories, such as cross-step inconsistencies and inter-agent coordination errors. Moreover, directly predicting failures induces a large combinatorial search space, hindering fine-grained attribution. To address these challenges, we propose VerifyMAS, a hypothesis verification framework for agent failure attribution. Instead of directly predicting faulty agents and error types, VerifyMAS formulates and verifies failure hypotheses against full trajectories. This verification-based approach decomposes attribution into trajectory-level error validation and fine-grained agent localization, providing an error-first attribution approach that captures global failure patterns while substantially reducing the search space. We further introduce a hypothesis-based data construction strategy grounded in a structured error taxonomy and fine-tune a specialized LLM verifier model for trajectory-level failure verification and agent attribution. Experiments on Aegis-Bench and Who&When show that VerifyMAS consistently improves diverse backbone models, including open-source Qwen and API-based GPT models, outperforming prior methods without sacrificing inference efficiency for long multi-agent trajectories.

2605.17465 2026-05-19 cs.LG

TriOpt: A Scalable Algorithm for Linear Causal Discovery

TriOpt: 一种适用于线性因果发现的可扩展算法

Rafat Ashraf Joy, Elena Zheleva

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出TriOpt算法,通过整合顺序方法和连续优化方法,解决了高维线性因果发现中的可扩展性问题,实现了显著的速度提升且保持了较高的准确性。

详情
AI中文摘要

从观测数据中学习因果关系具有挑战性,因为图搜索空间随着变量数量的增加而呈超指数增长。基于顺序的方法通过首先确定拓扑顺序来减少此空间,而连续优化方法通过将DAG学习转化为可微目标函数并加入循环性约束来探索最可能的区域。尽管这些方法在概念上具有吸引力,但在高维设置中仍面临显著的可扩展性限制,限制了其实际应用。在本文中,我们提出了一种新的线性因果发现方法,紧密整合这两种方法以在不牺牲准确性的情况下实现显著的可扩展性改进。我们的方法TriOpt将问题分解为两个高效的阶段。首先,它利用Sherman-Morrison秩1更新和线性核的加法结构来恢复拓扑顺序,从而实现快速且可扩展的顺序估计。其次,在给定此顺序的情况下,我们将结构学习重新公式化为一个凸的连续优化问题,完全避免了需要强制执行昂贵的循环性约束的需要。我们理论上证明,在真实的顺序下,TriOpt可以精确恢复潜在的线性DAG。经验上,在合成、半合成和现实数据集上,TriOpt在高维情况下相对于最先进的线性因果发现方法实现了数量级的速度提升,同时保持了可比或更优的准确性。

英文摘要

Learning causal relations from observational data is challenging because the graph search space grows super-exponentially with the number of variables. Ordering-based methods reduce this space by first identifying the topological ordering, whereas continuous optimization methods explore most likely regions of the space by casting DAG learning as a differentiable objective with an acyclicity constraint. Despite their conceptual appeal, both paradigms face significant scalability limitations in high-dimensional settings, restricting their practical applicability. In this work, we introduce a new formulation for linear causal discovery that tightly integrates these two paradigms to achieve substantial gains in scalability without sacrificing accuracy. Our approach, TriOpt, decomposes the problem into two efficient stages. First, it recovers the topological ordering by exploiting the Sherman-Morrison rank-1 downdate together with the additive structure of linear kernels, enabling fast and scalable ordering estimation. Second, given this ordering, we reformulate structure learning as a convex continuous optimization problem that entirely avoids the need for enforcing costly acyclicity constraints. We theoretically show that, under the true ordering, TriOpt exactly recovers the underlying linear DAG. Empirically, across synthetic, semi-synthetic, and real-world datasets, TriOpt achieves orders-of-magnitude speedups over state-of-the-art linear causal discovery methods in high-dimensional regimes, while maintaining comparable or superior accuracy.

2605.17458 2026-05-19 cs.LG

ClaHF: A Human Feedback-inspired Reinforcement Learning Framework for Improving Classification Tasks

ClaHF:一种基于人类反馈的强化学习框架,用于改进分类任务

Tianxiang Xu, Xiaoyan Zhu, Xin Lai, Jiayin Wang

发表机构 * School of Computer Science and Technology, Xi’an Jiaotong University(西安交通大学计算机科学与技术学院)

AI总结 本文提出ClaHF,一种基于人类反馈的强化学习框架,用于改进文本分类任务,通过整合偏好建模和强化学习优化,无需额外人工标注,在分类流程中提升分类性能和置信度校准。

详情
AI中文摘要

文本分类模型通常通过监督微调(SFT)进行训练。然而,SFT本质上是从实例级标签进行行为克隆,因此无法充分捕捉样本之间的相对偏好关系,这限制了模型塑造决策边界和校准预测置信度的能力。在本文中,我们提出ClaHF,一种受人类反馈启发的强化学习(RL)框架,用于文本分类,该框架在分类流程中整合了偏好建模和RL优化,而无需额外的人工标注。与以往仅依赖实例级监督的工作不同,ClaHF同时构建多个候选预测及其相对排名关系,并在奖励模型(RM)中联合建模Top-1偏好以及非最优候选之间的顺序。这种设计将传统的标签监督转换为可以直接应用于策略优化的偏好信号。我们在八个分类任务上进行了系统评估,涵盖三种场景类别。结果表明,ClaHF在各种语言模型(LMs)上一致提升了分类性能和置信度校准。数据和代码可在https://anonymous.4open.science/r/ClaHF上获取。

英文摘要

Text classification models are typically trained via supervised fine-tuning (SFT). However, SFT essentially performs behavior cloning from instance-wise labels and thus fails to adequately capture relative preference relations among samples, which limits the model's ability to shape decision boundaries and calibrate predictive confidence. In this paper, we propose ClaHF, a human feedback-inspired reinforcement learning (RL) framework for text classification that integrates preference modeling and RL optimization into the classification pipeline without requiring additional human annotations. Unlike prior work that relies solely on instance-wise supervision, ClaHF constructs multiple candidate predictions together with their relative ranking relations, and jointly models the Top-1 preference and the ordering among non-optimal candidates within a reward model (RM). This design converts conventional label supervision into preference signals that are directly applicable to policy optimization. We conduct systematic evaluations on eight classification tasks spanning three categories of scenarios. Results demonstrate that ClaHF consistently improves both classification performance and confidence calibration across diverse language models (LMs). The data and code are available at https://anonymous.4open.science/r/ClaHF.

2605.17456 2026-05-19 cs.CV cs.AI

GCE-MIL: Faithful and Recoverable Evidence for Multiple Instance Learning in Whole-Slide Imaging

GCE-MIL: 多实例学习中全滑片成像的可信且可恢复的证据

Xiangyu Li, Ran Su

发表机构 * College of Intelligence and Computing(智能与计算学院)

AI总结 该研究提出GCE-MIL方法,通过优化S/N/R标准直接提升多实例学习中全滑片成像的预测性能和证据质量,改进了宏F1分数和C-index,并减少了连续-离散差距。

Comments 10 pages, 17 figures, 24 table

详情
AI中文摘要

多实例学习(MIL)是全滑片图像(WSI)分类和生存预测的标准方法,其中基于注意力的模型将图像块特征聚合为滑片级预测。这些模型将注意力权重视为预测的证据,但注意力被优化用于分类,而非识别支持诊断的实际图像块。这种混淆导致三个失败:选择的图像块不足(单独保留它们会降低宏F1分数0.078)、多余(移除它们几乎不影响预测)以及不可恢复(连续的注意力分数与推理中使用的离散图像块子集不一致)。核心前提是证据质量应通过显式标准直接优化——充分性、必要性和可恢复性(S/N/R)——而不是作为分类的副产品继承。GCE-MIL是一种背骨无关的封装器,通过三种注入模式和三种证据组件实现:一个将选择与领域特定概念对齐的 grounding 机制,一个作为可微分代理的 noisy-OR 覆盖,以及一个通过边缘引导修复将连续选择器转换为离散子集的阈值加修复恢复。在9个背骨和9个数据集(81种配置)上,GCE-MIL将平均宏F1分数提高了0.024,C-index提高了0.014,减少了连续-离散差距4-7,增加了补集退化2-4。通过在离散恢复后可选的图像块预过滤,推理速度可提高高达5倍,同时保留0.989的完整袋效用。

英文摘要

Multiple instance learning (MIL) is the standard approach for whole-slide image (WSI) classification and survival prediction, where attention-based models ag gregate patch features into slide-level predictions. These models treat attention weights as evidence for their predictions, but attention is optimized for classi fication, not for identifying which patches actually support the diagnosis. This conflation leads to three failures: selected patches are insufficient (keeping them alone drops Macro-F1 by 0.078), unnecessary (removing them barely changes the prediction), and unrecoverable (continuous attention scores disagree with discrete patch subsets used at inference). The central premise is that evidence quality should be optimized directly through explicit criteria- Sufficiency, Necessity, and Recov erability (S/N/R)- rather than inherited as a byproduct of classification. GCE-MIL is a backbone-agnostic wrapper implemented through three injection modes and three evidence components: a grounding mechanism that aligns selection with domain-specific concepts, noisy-OR coverage that acts as a differentiable proxy for interventional evidence search, and threshold-plus-repair recovery that converts continuous selectors into discrete subsets through marginal-guided repair. Across 9 backbones and 9 datasets (81 configurations), GCE-MIL improves average Macro-F1 by 0.024 and C-index by 0.014, reduces the continuous-discrete gap by 4-7, and increases complement degradation by 2-4. With optional tile prefiltering after discrete recovery, inference runs up to 5 faster while retaining 0.989 full-bag utility.

2605.17454 2026-05-19 cs.AI

Multi-Party Multi-Objective Optimization as Consensus Search: Runtime Analysis of Cross-Party Recombination

多方多目标优化作为共识搜索:交叉 party 再组合的运行时间分析

Xiaolei Fang, Peilan Xu, Wenjian Luo

发表机构 * School of Artificial Intelligence, Nanjing University of Information Science and Technology(信息科学与技术南京大学人工智能学院) Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Institute of Cyberspace Security, School of Computer Science and Technology, Harbin Institute of Technology(新型安全智能技术广东省重点实验室,网络安全研究院,计算机科学与技术学院,哈尔滨工业大学)

AI总结 本文研究了多党多目标优化问题中的交叉 party 再组合,通过分析 MP-JCG 和 BPBOMST 问题,证明了基于收益引导突变的基线方法在跨越间隙时存在瓶颈,而改进的 CPR-NSGA-II 变体能够在 O(n log n) 的预期评估次数内发现共同帕累托最优解,并推导了基于边联合再组合和均匀修复的实例参数化预期运行时间界。

Comments 40 pages, 7 figures

详情
AI中文摘要

多党多目标优化问题(MPMOPs)需要自主决策者达成共识,因此不同于扁平化多目标公式。现有多目标进化算法的运行时间理论大多针对单党帕累托前沿近似,无法直接解释MPMOPs中的共同解搜索。我们研究了两种代表性场景中的交叉 party 再组合。在MP-JCG,一个具有显式间隙区域的伪布尔基准上,我们证明了基于收益引导突变的基线方法面临跨越间隙的瓶颈,需要Θ(n²)的预期适应度评估。相比之下,分析型CPR-NSGA-II变种通过直接组装互补前缀和后缀模板,分布在党派种群中,能够在O(n log n)的预期评估次数内发现共同帕累托最优解。与扁平化四目标公式F-JCG相比,我们的全前沿覆盖分析展示了扁平化带来的额外覆盖负担。对于BPBOMST,多党多目标最小生成树问题的双党双目标专业化,我们开发了分层支持覆盖分析。对于每个共同帕累托目标向量,对称平均投影诱导了一个辅助双目标MST实例,合适的支持代表可以产生一个2λ-共同近似覆盖,其中λ∈[1,2]。我们进一步推导了一个代表池CPR-NSGA-II变种的实例参数化预期运行时间界,使用边联合再组合和均匀修复。这个界分离了局部辅助前沿填充、跨党再组合捷径和边联合修复模糊性的影响。

英文摘要

Multi-party multi-objective optimization problems (MPMOPs) require consensus among autonomous decision makers and therefore differ from flattened many-objective formulations. Existing runtime theory for multi-objective evolutionary algorithms is largely tailored to single-party Pareto-front approximation and does not directly explain common-solution search in MPMOPs. We investigate cross-party recombination in two representative settings. On MP-JCG, a pseudo-Boolean benchmark with an explicit gap region, we prove that a payoff-guided mutation baseline faces a gap-crossing bottleneck requiring \(Θ(n^2)\) expected fitness evaluations. In contrast, an analytical CPR-NSGA-II variant discovers both common Pareto-optimal solutions in \(O(n\log n)\) expected evaluations by directly assembling complementary prefix and suffix templates distributed across party populations. Comparing this with the flattened four-objective formulation F-JCG, our full-front coverage analysis illustrates the additional coverage burden introduced by flattening. For BPBOMST, the bi-party, two-objective-per-party specialization of the multi-party multi-objective minimum spanning tree problem, we develop a layered support-cover analysis. For each common Pareto objective vector, the symmetric average projection induces an auxiliary bi-objective MST instance, and suitable support representatives yield a \(2λ\)-common approximation cover with \(λ\in[1,2]\). We further derive an instance-parameterized expected runtime bound for a representative-pool CPR-NSGA-II variant using edge-union recombination and uniform repair. This bound separates the effects of local auxiliary-front filling, cross-party recombination shortcuts, and edge-union repair ambiguity.

2605.17451 2026-05-19 cs.CV

DeTrack: A Benchmark and Altitude-Aware Dual World Model for Drone-embodied Tracking

DeTrack:一种无人机具身跟踪的基准及海拔感知双世界模型

Guyue Hu, Haoming Liu, Siyuan Song, Chenglong Li, Feng Chen, Jin Tang

发表机构 * Hefei Si Valley Technology Development Co., Ltd(合肥蜀山科技发展有限公司) Institute of Embodied Intelligence, Anhui University(embodied intelligence研究院,安徽大学)

AI总结 本文提出DeTrack任务,要求无人机在交互式3D环境中利用在线自体观察和主动飞行控制进行目标跟踪,并提出AaDWorlds框架以解决海拔相关的可见性与飞行安全矛盾。

详情
AI中文摘要

空中目标跟踪在公共安全、应急救援、野生动物监测等领域有广泛应用。然而,现有空中跟踪基准主要基于固定摄像头位置或预设飞行路径的被动2D视频序列,其中无人机被视为被动相机而非具身代理,无法主动感知、交互和控制其在动态3D场景中的运动。本文定义了新的无人机具身跟踪任务DeTrack,要求无人机利用在线自体观察和主动飞行控制在闭环中跟踪目标。我们构建了一个包含11,368条目标轨迹的大型基准,涵盖多样化的场景、渲染条件、语义区域和移动干扰物,并提供了针对目标可见性、跟踪准确性和轨迹成功的评估指标。我们进一步提出了AaDWorlds,一种用于无人机具身跟踪的海拔感知双世界模型框架。AaDWorlds包含一个海拔感知感知模块和双世界模型,分别在高海拔和低海拔环境下预测未来状态。通过结合伪海拔感知观察和预测的未来状态,AaDWorlds缓解了目标可见性与飞行安全之间的固有矛盾。在DeTrack基准上的实验表明,AaDWorlds在所有评估指标上均提升了闭环跟踪性能。

英文摘要

Aerial object tracking has broad applications in public safety, emergency rescue, wildlife monitoring, and related fields. However, existing aerial tracking benchmarks are mainly based on passive 2D video sequences captured from fixed camera locations or predefined flight paths, where drones are treated as passive cameras rather than embodied agents that actively perceive, interact, and control their motion in dynamic 3D scenes. In this paper, we define a new drone-embodied tracking task, termed DeTrack, which requires a drone to track a target in interactive 3D environments using online egocentric observations and active flight control in a closed loop. We build a large-scale benchmark containing 11,368 target trajectories across diverse scenes, rendering conditions, semantic regions, and moving distractors, together with evaluation metrics for target visibility, tracking accuracy, and trajectory success. We further propose AaDWorlds, an altitude-aware dual world model framework for drone-embodied tracking. AaDWorlds consists of an altitude-aware perception module and dual world models that imagine future states under both high- and low-altitude regimes. By combining pseudo altitude-aware observations and imagined future states, AaDWorlds alleviates the intrinsic altitude-mediated contradiction between target visibility and flight safety. Experiments on the DeTrack benchmark demonstrate that AaDWorlds improves closed-loop tracking performance across all evaluation metrics.