arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.17564 2026-05-19 cs.CV

A Conditional U-Net Pipeline with Pre- and Post-Processing for Aerial RGB-to-Thermal Image Translation

具有预处理和后处理的条件U-Net管道用于航空RGB到热图像转换

Tseten Sherpa, Sikandar Ali, Shubham Parab, Haoyun Feng, Matthew Dennis, Keenan Gibbons, Verrah Otiende, Geoffrey H. Siwo

发表机构 * Department of Data Science, University of Michigan, Ann Arbor, MI, USA（数据科学系，密歇根大学，安阿伯，MI，美国）； Department of Information Science, University of Michigan, Ann Arbor, MI, USA（信息科学系，密歇根大学，安阿伯，MI，美国）； Department of Computer Science, University of Michigan, Ann Arbor, MI, USA（计算机科学系，密歇根大学，安阿伯，MI，美国）； Arcknow, New York, USA（Arcknow，纽约，美国）； School of Environmental Sustainability, University of Michigan, Ann Arbor, MI, USA（可持续环境学院，密歇根大学，安阿伯，MI，美国）； SmithGroup, Ann Arbor, MI, USA（SmithGroup，安阿伯，MI，美国）； Michigan Institute for Data and AI in Society (MIDAS), University of Michigan, Ann Arbor, MI, USA（密歇根数据与人工智能社会研究院（MIDAS），密歇根大学，安阿伯，MI，美国）； United States International University (USIU), Nairobi, Kenya（美国国际大学（USIU），内罗毕，肯尼亚）； Department of Learning Health Sciences, University of Michigan Medical School, Ann Arbor, MI, USA（学习健康科学系，密歇根大学医学院，安阿伯，MI，美国）； Department of Pharmacology, University of Michigan Medical School, Ann Arbor, MI, USA（药理学系，密歇根大学医学院，安阿伯，MI，美国）； Center for Global Health Equity, University of Michigan, Ann Arbor, MI, USA（全球健康公平中心，密歇根大学，安阿伯，MI，美国）

AI总结本文提出了一种基于条件U-Net的简单架构，结合天气数据和针对性预处理与后处理技术，以提高航空RGB到热图像转换的性能，实验结果显示其在PSNR、SSIM和LPIPS指标上优于现有方法。

Comments 8 pages, 7 figures, NeurIPS 2026

详情

AI中文摘要

配对的RGB-热图像数据在图像融合、目标跟踪和异常检测等应用中显示出显著的实用性；然而，其广泛应用受到对齐的RGB-热图像对有限的限制。RGB到热图像（及反之）转换已成为解决这一挑战的实用解决方案。先前的方法包括条件生成对抗网络（cGANs）如ThermalGAN和基于可扩展插值转换器（SiT）的架构如ThermalGen，已显示出在航空到热图像转换中的强大潜力。在本工作中，我们探索了替代架构，这些架构在保持性能的同时优先考虑简洁性。具体而言，我们提出了一种在瓶颈层中结合天气数据的条件U-Net，辅以在Pix2Pix GAN架构中应用的针对性预处理和后处理技术。我们利用612对RGB和热图像的训练集，并在五折交叉验证后，最终在保留的测试集上进行评估。我们的条件U-Net模型表现最佳，峰值信噪比（PSNR）为14.5485，结构相似性指数测量（SSIM）为0.8095，学习感知图像块相似性（LPIPS）为0.1666。这些结果优于基础ThermalGen模型，后者分别达到了PSNR、SSIM和LPIPS分数为7.56、0.2444和0.6317。我们发现，虽然饱和度增强和对比度增强的预处理以及高斯模糊的后处理提供了可观察的改进，但结合条件数据的效果最为显著。我们的发现巩固了将辅助元数据整合到热图像生成中的潜力，表明此类信息可以作为准确热重建至关重要的环境条件的代理。

英文摘要

Paired RGB-thermal data has shown significant utility across a range of applications, including image fusion, object tracking, and anomaly detection; however, its broader adoption is constrained by the limited availability of aligned RGB-thermal image pairs. RGB-to-thermal (and vice versa) image translation has emerged as a practical solution to this challenge. Prior approaches including conditional generative adversarial networks (cGANs) such as ThermalGAN and Scalable Interpolant Transformer (SiT)-based architectures such as ThermalGen have demonstrated strong potential for aerial-to-thermal image translation. In this work, we explore alternative architectures that prioritize simplicity while maintaining performance. Specifically, we propose a conditional U-Net that incorporates weather data at the bottleneck layer, complemented by targeted preprocessing and post-processing techniques applied within the Pix2Pix GAN architecture. We utilize a training set of 612 paired RGB and thermal images, and evaluate over 5-fold cross-validation, ultimately testing on a held-out test set. Our conditional U-Net model performed best, with a peak signal-to-noise ratio (PSNR) of 14.5485, structural similarity index measure (SSIM) of 0.8095, and learned perceptual image patch similarity (LPIPS) of 0.1666. These results outperformed the base ThermalGen model, which attained PSNR, SSIM, and LPIPS scores of 7.56, 0.2444, and 0.6317 respectively. We find that while saturation boost and contrast enhancement for preprocessing and Gaussian blur for post-processing provide observable improvements, the incorporation of conditioning data was most effective. Our findings cement the potential of integrating auxiliary metadata into thermal image generation, suggesting that such information can serve as a proxy for environmental conditions critical to accurate thermal reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2605.17562 2026-05-19 cs.LG cs.AI cs.HC

Beyond Accuracy: Robustness, Interpretability and Expressiveness of EEG Foundation Models

超越准确率：EEG基础模型的鲁棒性、可解释性和表达性

Urban Širca, Maryam Alimardani, Stefanos Zafeiriou, Konstantinos Barmpas

发表机构 * Vrije Universiteit Amsterdam（阿姆斯特丹自由大学）； Imperial College London（伦敦帝国学院）

AI总结本文研究了EEG基础模型的鲁棒性、可解释性和表达性，通过在八个数据集上对六个EEG-FMs和一个基线深度学习模型进行基准测试，揭示了模型在不同扰动下的表现，以及其在可解释性和表达性方面的特性。

详情

AI中文摘要

EEG基础模型（EEG-FMs）主要在干净且分布内的准确性上进行了评估，其鲁棒性、可解释性和表征质量尚未得到充分考察。本研究通过在八个数据集上对六个EEG-FMs和一个基线深度学习模型进行基准测试，填补了这些空白。除了干净准确性外，我们进行了三层分析：（i）鲁棒性：我们应用了测试时扰动，包括加性噪声、随机和区域基于的通道丢弃以及区域特定的噪声注入。我们的分析表明，没有单一模型在所有失败模式中占主导地位。最抗噪的模型在通道丢弃下最为脆弱，当通道被移除而不是零填充时，许多丢弃脆弱性消失。（ii）可解释性：我们首次将注意力感知的层间相关传播（AttnLRP）应用于EEG-FMs，并展示了模型广泛集中在与任务相关的脑区，这与已知的神经生理学一致。然而，属性图在扰动下保持空间稳定，而预测性能下降，表明模型关注正确的脑区，但解码了被破坏的内容。（iii）表达性：通过块状探测，我们显示在微调过程中后期块被重新利用，而早期块已经包含任务相关的信息。此外，我们证明了之前归因于低质量预训练表示的头部-only性能较差，很大程度上是由于池化所致，且当EEG-FMs的token级嵌入被保留时，它们具有足够的表征能力。这些发现为EEG-FMs的鲁棒性、可解释性和表达性提供了首次系统的评估，并突显了其开发中的关键考虑因素。

英文摘要

EEG foundation models (EEG-FMs) have been evaluated predominantly on clean, in-distribution accuracy, leaving their robustness, interpretability and representational quality largely unexamined. This study addresses these gaps by benchmarking six EEG-FMs against a baseline deep learning model across eight datasets. Beyond clean accuracy, we conduct three layers of analysis: (i) Robustness: we apply test-time perturbations including additive noise, random and region-based channel dropout and region-specific noise injection. Our analyses show that no single model dominates all failure modes. The most noise-robust model is among the most fragile under channel dropout and much of the dropout fragility disappears when channels are removed rather than zero-padded. (ii) Interpretability: we present the first application of Attention-Aware Layer-Wise Relevance Propagation (AttnLRP) to EEG-FMs and show that models broadly concentrate relevance on task-appropriate brain regions consistent with known neurophysiology. However, attribution maps remain spatially stable under perturbation while predictions degrade, suggesting that the models attend to the correct brain regions but decode corrupted content. (iii) Expressiveness: With block-wise probing we show that late blocks are repurposed during fine-tuning, while early blocks already hold task-related information. Furthermore, we demonstrate that the poor head-only performance previously attributed to low-quality pre-trained representations is largely explained by pooling and that EEG-FMs possess sufficient representational capacity when their token-level embeddings are preserved. Together, these findings provide the first systematic assessment of robustness, interpretability and expressiveness for EEG-FMs and highlight critical considerations for their development.

URL PDF HTML ☆

赞 0 踩 0

2605.17556 2026-05-19 cs.RO cs.AI

Visual Sculpting: Visually-Aligned Planning Representations for Long-Horizon Robot Clay Sculpting

视觉雕刻：用于长周期机器人泥塑的视觉对齐规划表示

Peter Schaldenbrand, Jean Oh

发表机构 * The Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）

AI总结本文提出了一种视觉对齐的规划表示方法，用于长周期机器人泥塑任务，通过捕捉光照和纹理特征，提高了对可变形材料动态的建模能力，并展示了在不同可变形材料和末端执行器下的性能。

Comments 8 pages, 14 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

详情

DOI: 10.1109/LRA.2026.3673896

AI中文摘要

泥塑是一种复杂的艺术任务，需要通过长周期规划实现高阶目标。作为机器人问题，我们将泥塑视为形状到形状的匹配挑战。先前的可变形物体 manipulation 工作要么需要为每个目标重新训练策略，要么依赖于动态模型，这些模型将状态表示为稀疏点云，无法良好捕捉泥塑的重要特征，如纹理。我们提出了一种方法，用于建模可变形材料的动力学，并在视觉对齐的表示中为机器人雕刻规划。通过三种不同的可变形材料和各种末端执行器，我们证明我们的动力学模型在性能上与最先进的方法相当，并且具有兼容视觉规划的优势。我们的动作被表示为单个末端执行器向泥塑施加的参数化推力，这已被证明适用于长周期（>100次动作）的泥塑浮雕。最后，我们展示了在视觉对齐表示中规划的好处，同时提供了分析，证明了与3D表示相比，这种表示在规划上更具挑战性。

英文摘要

Clay sculpting is a nuanced, artistic task involving dexterous manipulation with long-horizon planning to achieve high-level goals. As a robotics problem, we formulate clay sculpting as a shape-to-shape matching challenge. Prior deformable object manipulation work either requires retraining a policy per goal or relies on dynamics models which represent state as sparse point clouds which do not capture important clay features, such as textures, well. We present a method for modeling the dynamics of deformable materials and planning for robotic sculpting in a representation that is visually-aligned, capturing lighting and texture features. With three different deformable materials and various end-effectors, we demonstrate that our dynamics model is comparable in performance to the state-of-the-art with the added benefit of being compatible with visual planning. Our actions are represented as parametrized pushes into clay with a single end-effector, which proved to be suitable for long-horizon (>100 actions) clay relief sculptures. Lastly, we show the benefits of planning in a visually-aligned representation, but also provide analysis providing evidence as to why this representation is challenging to plan in compared to 3D representations.

URL PDF HTML ☆

赞 0 踩 0

2605.17555 2026-05-19 cs.LG cs.CV

PFlow-T: A Persistence-Driven Forward Process for Topology-Controlled Generation

PFlow-T：基于持续性的拓扑控制生成过程

Snigdha Chandan Khilar

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出PFlow-T，一种基于持续性的前向过程生成模型，通过持续同调来控制拓扑结构，实现了对Betti数的生成和处理非分布任务的改进。

2605.17552 2026-05-19 cs.LG

Q-LocalAdam: Memory-Efficient Client-Side Adaptive Optimization for Edge Federated Learning

Q-LocalAdam: 一种内存高效的边缘联邦学习客户端自适应优化方法

Vedant Waykole, Haroon R. Lone

发表机构 * IISER Bhopal（印度比哈尔州科学与技术研究院）

AI总结本文提出Q-LocalAdam，一种针对边缘联邦学习中非独立同分布数据和内存限制的自适应优化方法，通过分布感知的8位量化块线性编码和对数空间编码实现内存高效优化，显著提升模型性能和并发工作负载能力。

详情

AI中文摘要

边缘设备上的联邦学习必须应对非独立同分布的客户端数据和严格的内存预算。像Adam这样的自适应优化器在数据异质性下稳定训练，但需要存储全精度动量和方差状态，通常使客户端内存开销增加三倍。这限制了在资源受限设备上可部署的模型大小和同时进行的联邦任务数量。我们实证发现，联邦Adam中的动量和方差在统计特性上存在根本差异：动量值对称且有界，而方差跨越八个数量级并具有对数正态结构。受这种不对称性启发，我们提出了Q-LocalAdam，它对动量应用分布感知的8位量化块线性编码，对方差应用对数空间编码，同时保持模型参数在全精度下。在CIFAR-10和CIFAR-100上，针对不同数据异质性（α∈{0.1, 0.5, 1.0, IID}），Q-LocalAdam在中等异质性下实现3.37倍的优化器内存减少，无精度损失，在极端异质性下（如CIFAR-100，α=0.1）实现显著提升（+5.74pp）。多种子验证确认统计显著性（p<0.01）。相比之下，朴素的均匀量化退化到随机性能，证明了分布感知设计的重要性。Q-LocalAdam在内存受限的边缘设备上无需修改联邦协议即可实现更大的模型和更多的并发工作负载。

英文摘要

Federated learning on edge devices must cope with non-IID client data and tight memory budgets. Adaptive optimizers like Adam stabilize training under data heterogeneity but require storing full-precision momentum and variance states, often tripling client memory overhead. This limits deployable model sizes and concurrent federated jobs on resource-constrained devices. We empirically observe that momentum and variance in federated Adam exhibit fundamentally different statistical properties: momentum values are symmetric and bounded, while variance spans eight orders of magnitude with log-normal structure. Motivated by this asymmetry, we propose \textbf{Q-LocalAdam}, which applies distribution-aware 8-bit quantization block-wise linear encoding for momentum and log-space encoding for variance while keeping model parameters in full precision. Across CIFAR-10 and CIFAR-100 under varying data heterogeneity ($α\in \{0.1, 0.5, 1.0, \text{IID}\}$), Q-LocalAdam achieves $3.37\times$ optimizer memory reduction with no accuracy loss under moderate heterogeneity and significant improvements under extreme heterogeneity (e.g., +5.74pp on CIFAR-100, $α=0.1$). Multi-seed validation confirms statistical significance ($p<0.01$). In contrast, naive uniform quantization degrades to random performance, demonstrating that distribution-aware design is essential. Q-LocalAdam enables larger models and more concurrent workloads on memory-constrained edge devices without modifying the federated protocol.

URL PDF HTML ☆

赞 0 踩 0

2605.17528 2026-05-19 cs.LG cs.AI cs.CL

CasualSynth: Generating Structurally Sound Synthetic Data

CasualSynth: 生成结构上合理的合成数据

Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

发表机构 * Department of Computer Science, University of Oxford（牛津大学计算机科学系）； Institute of Logic and Computation, TU Wien（维也纳技术大学逻辑与计算研究所）

AI总结本文提出CasualSynth框架，通过解耦因果结构生成与语义实现，生成既符合因果机制又语义丰富的合成数据，解决了LLM在生成合成数据时无法保证因果正确性的问题。

Comments 15 pages

详情

AI中文摘要

大型语言模型（LLMs）能够生成逼真的合成数据，但无法保证其输出符合目标领域的因果机制。我们引入CausalSynth框架，该框架将因果结构生成与语义实现解耦，生成既符合因果机制又语义丰富的合成数据。该框架分为三个阶段：首先，一个结构因果模型（SCM）——一个定义在有向无环图（DAG）上的结构方程组，通过祖先采样生成因果骨架，即满足支配图全局马尔可夫性质的变量赋值；其次，一个LLM作为受约束的实现者，一个条件翻译器，将每个骨架映射到高维观测，如临床笔记或交易日志；第三，一个迭代一致性验证模块通过确定性提取检测结构违规，并将针对性的修正反馈给LLM，形成闭环优化过程。我们识别出语义后门问题，即LLM系统性地用预训练先验覆盖施加的因果事实——并证明我们的迭代机制相对于标准拒绝采样减少了由此产生的选择偏差。在三个因果基准（ASIA、ALARM和MIMIC-Struct）上，CausalSynth在假阳性率接近名义α=0.05水平的情况下保持条件独立性，并在70B参数LLM基础上实现了超过96%的可实现率。该框架还通过保留噪声和图 mutilation 支持原理化的干预和反事实生成。

英文摘要

Large Language Models (LLMs) generate realistic synthetic data but offer no guarantee that their outputs respect the causal mechanisms governing the target domain. We introduce CausalSynth, a framework that decouples causal structure generation from semantic realization, yielding synthetic data that is both causally valid and linguistically rich. The framework operates in three phases. First, a Structural Causal Model (SCM) - a tuple of structural equations defined over a directed acyclic graph (DAG) generates causal skeletons, i.e., variable assignments that satisfy the Global Markov Property of the governing DAG, via ancestral sampling. Second, an LLM acts as a constrained \emph{realizer}, a conditional translator that maps each skeleton to a high-dimensional observation such as a clinical note or a transaction log. Third, an Iterative Consistency Verification module detects structural violations through deterministic extraction and feeds targeted corrections back to the LLM, forming a closed-loop refinement process. We identify the Semantic Backdoor problem the systematic tendency of LLMs to override imposed causal facts with pre-training priors -- and prove that our iterative mechanism reduces the resulting selection bias relative to standard rejection sampling. On three causal benchmarks (ASIA, ALARM, and MIMIC-Struct), CausalSynth preserved conditional independencies with false-positive rates near the nominal $α=0.05$ level and achieved realizability rates above 96% with 70B-parameter LLM backbones. The framework additionally supports principled interventional and counterfactual generation through noise retention and graph mutilation.

URL PDF HTML ☆

赞 0 踩 0

2605.17527 2026-05-19 cs.CV

Designing streetscapes from street-view imagery using diffusion models

利用扩散模型从街景图像中设计街道景观

Yuzhou Chen, Yuebing Liang, Lingqian Hu, Kailai Sun, Qingqi Song, Chang Zhao, Shenhao Wang

发表机构 * Department of Urban and Regional Planning, University of Florida（城市与区域规划系，佛罗里达大学）； Singapore-MIT Alliance for Research and Technology Centre (SMART)（新加坡-麻省理工联合研究中心（SMART））； Department of Landscape Architecture and Urban Planning, Texas A&M University（景观建筑与城市规划系，德克萨斯大学安德森分校）； Department of Agronomy, University of Florida（农业系，佛罗里达大学）

AI总结本文提出了一种生成多模态AI框架，通过目标视觉指标生成替代的街道景观，提升了城市规划和设计中的视觉探索能力。

详情

AI中文摘要

街景图像（SVI）被广泛用于量化城市环境的关键指标，如绿化率、天空和道路视图指数。然而，现有研究大多集中在测量当前的街道景观，很少支持生成替代或不存在的城市场景，这是地理学学科如城市规划和设计中的核心任务。为解决这一差距，我们提出了一种生成多模态AI框架，该框架能够根据目标视觉指标合成替代的街道景观，从而直接探索城市场景。我们首先构建了一个多模态数据集，将SVI与文本描述、分割图、道路掩码以及芝加哥和奥兰多的视觉元素定量指标对齐。使用这个数据集，我们证明扩散模型能够生成逼真且语义一致的街道景观图像，同时响应文本和图像控制。我们的定量评估显示，结合视觉控制可以提高语义一致性，使LPIPS指数降低约6%，同时保持整体视觉真实性。此外，整体语义一致性在奥兰多提高了23.7%，在芝加哥提高了46.4%，通过mIoU指数测量，类别层面的提升甚至超过了100%的改进，特别是在建筑视图指数方面。通过视觉和文本提示，可以精细控制街道景观的生成，当文本和视觉控制冲突时，图像控制始终占主导地位，表明了清晰的控制层次以及进一步发展视觉控制对于城市场景生成的重要性。总体而言，本文为使用SVI和扩散模型进行街道景观生成建立了重要的基准，并展示了生成式AI如何成为一种实用、可扩展且可控的城市场景探索方法。

英文摘要

Street-view imagery (SVI) is widely used to quantify key indicators of urban environment, such as green- ery, sky, or road view indices. However, existing studies largely focus on measuring current streetscapes and rarely support the generation of alternative and non-existing urban scenarios, which is a core task in geospatial disciplines such as urban planning and design. To address this gap, we propose a gener- ative multimodal AI framework that synthesizes alternative streetscapes conditioned on targeted visual metrics, enabling direct visual exploration of urban scenarios. We first construct a multimodal dataset that aligns SVIs with textual descriptions, segmentation maps, road masks, and quantitative metrics of visual elements in Chicago and Orlando. Using this dataset, we demonstrate that diffusion models can produce realistic and semantically consistent streetscape imagery while responding to both textual and imagery controls. Our quantitative evaluations show that incorporating visual controls can improve semantic consistency, reducing the LPIPS index by approximately 6% while maintaining global visual realism. In addition, overall semantic consistency increases by 23.7% in Orlando and 46.4% in Chicago, as measured by the mIoU index, with class-wise gains exceeding even 100% improvement for building view indices. Streetscape generation can be controlled in a fine-grained manner by both visual and textual prompts, and when textual and visual controls conflict, imagery controls consistently dominate, indicating a clear control hierarchy and the importance of further developing visual controls for urban scene generation. Overall, this work establishes an important benchmark for streetscape generation us- ing SVIs and diffusion models, and illustrates how generative AI can serve as a practical, scalable, and controllable approach for urban scenario exploration.

URL PDF HTML ☆

赞 0 踩 0

2605.17522 2026-05-19 cs.RO

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

RoboFlow4D: 一种轻量级的流世界模型，面向实时的流引导机器人操作

Sixu Lin, Junliang Chen, Huaiyuan Xu, Zhuohao Li, Guangming Wang, Yixiong Jing, Sheng Xu, Runyi Zhao, Brian Sheil, Lap-Pui Chau, Guiliang Liu

发表机构 * School of Data Science, The Chinese University of Hong Kong (Shenzhen)（香港中文大学（深圳）数据科学学院）； The Hong Kong Polytechnic University（香港理工大学）； University of Cambridge（剑桥大学）； Shenzhen Loop Area Institute（深圳-loop区研究所）

AI总结本文提出RoboFlow4D，一种轻量级的流世界模型，通过统一感知与规划，利用物理3D空间中的时间运动估计，实现高效的实时流引导机器人操作，提高了操作成功率和计算效率。

详情

AI中文摘要

在三维环境中进行规划和行动是现实世界中机器人操作的基本能力。尽管先前工作已经探索了预测流规划器来指导三维操作，但现有方法往往依赖于模块化管道堆叠多个子模型，导致计算开销高且实时性能有限。为了解决这些挑战，我们引入了RoboFlow4D，一种轻量级的流世界模型，通过估计物理3D空间中的时间运动来统一感知和规划。作为一种端到端框架，RoboFlow4D直接从视觉观察和文本指令中预测多帧3D流，提供显式的基于流的规划以指导动作生成。这种设计允许无缝集成到通用动作策略中，形成高效的观察-规划-执行闭环。通过流预测与动作控制之间的慢-快协作，RoboFlow4D实现了实时且资源高效的操纵。在模拟和现实世界设置中的大量实验表明，RoboFlow4D在操纵成功率和计算效率方面持续改进，推动了流引导规划在具身智能中的发展。

英文摘要

Planning and acting in 3D environments is a fundamental capability for robotic manipulation in the real world. Although prior work has explored predictive flow planners to guide 3D manipulation, existing approaches often rely on modular pipelines stacking multiple submodels, resulting in high computational overhead and limited real-time performance. To address these challenges, we introduce RoboFlow4D, a lightweight flow world model that unifies perception and planning by estimating temporal motion in physical 3D space. As an end-to-end framework, RoboFlow4D directly predicts multi-frame 3D flows from visual observations and textual instructions, providing explicit flow-based planning to guide action generation. This design allows seamless integration with general action policies, forming an efficient observation-planning-execution closed loop. Through slow-fast collaboration between flow prediction and action control, RoboFlow4D enables real-time and resource-efficient manipulation. Extensive experiments in both simulation and real-world settings demonstrate that RoboFlow4D consistently improves manipulation success rates and computational efficiency, advancing flow-guided planning for embodied intelligence.

URL PDF HTML ☆

赞 0 踩 0

2605.17517 2026-05-19 cs.RO

AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment

AffordVLA: 通过隐式特征对齐将 affordance 表示注入到视觉-语言-动作模型中

Weijie Kong, Zhian Su, Wei Yu, Huixu Dong

发表机构 * Grasp Lab, School of Mechanical Engineering of Zhejiang University（浙大机械工程学院抓取实验室）

AI总结本文提出 AffordVLA 框架，通过隐式特征对齐将以操作为中心的 affordance 表示注入到视觉-语言-动作模型中，以提升动作准确性，实验表明其在仿真和现实中的表现优于现有方法。

Comments 13pages, 10figures

详情

AI中文摘要

最近在视觉-语言-动作（VLA）模型方面的进展显示出在通用机器人操作中的强大潜力。然而，大多数VLA模型的视觉表示往往由全局物体外观主导，难以聚焦于与任务相关的功能交互区域，这限制了它们在非结构化环境中的鲁棒性。现有的基于 affordance 的方法通常依赖于显式的掩码注入或外部感知模块，需要额外的注释，同时引入级联感知误差和推理开销。为了解决这些限制，我们提出 AffordVLA，一个增强的 VLA 框架，通过隐式表示对齐将以操作为中心的 affordance 感知内部化到 VLA 视觉表示中。具体来说，我们构建了一个零样本 affordance 教师，从 RGB 观察和语言指令中提取任务条件的 affordance 视觉表示。AffordVLA 对齐 VLA 的中间视觉表示与由教师提取的 affordance 视觉表示，从而隐式地将以操作为中心的 affordance 感知注入到 VLA 视觉表示中，提高动作准确性。广泛的仿真和现实世界实验表明，AffordVLA 及其 affordance 教师实现了最先进的性能，并优于强大的基线。消融分析显示，AffordVLA 有效重塑 VLA 视觉表示，同时保持推理效率，从而提高操作成功率和训练效率。

英文摘要

Recent advances in Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation. However, the visual representations of most VLA models are often dominated by global object appearance and struggle to focus on task-relevant functional interaction regions, which limits their robustness in unstructured environments. Existing affordance-based methods typically rely on explicit mask injection or external perception modules, requiring additional annotations while introducing cascading perception errors and inference overhead. To address these limitations, we propose AffordVLA, an affordance-enhanced VLA framework that internalizes manipulation-centric affordance perception into VLA visual representations through implicit representation alignment. Specifically, we construct a zero-shot affordance teacher to extract task-conditioned affordance visual representations from RGB observations and language instructions. AffordVLA aligns the intermediate visual representations of the VLA with the affordance visual representations extracted by the teacher, thereby implicitly injecting manipulation-centric affordance perception into VLA visual representations and improving action accuracy. Extensive simulation and real-world experiments demonstrate that AffordVLA and its affordance teacher achieve state-of-the-art performance and outperform strong baselines. Ablation analyses show that AffordVLA effectively reshapes VLA visual representations while preserving inference efficiency, leading to improved manipulation success rates and training efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.17508 2026-05-19 cs.LG cs.AI

BESplit: Bias-Compensated Split Federated Learning with Evidential Aggregation

BESplit: 偏差补偿分割联邦学习与证据聚合

Yuhan Xie, Chen Lyu, Jingrong Huang

发表机构 * MoE Key Laboratory of Interdisciplinary Research of Computation（交叉计算与经济学 interdisciplinary 研究 MOE 重点实验室）； Shanghai University of Finance（上海财经大学）

AI总结本文提出BESplit框架，通过证据聚合和偏差补偿协作来解决非独立同分布数据下分割联邦学习的偏差优化和收敛不稳定问题，提升了模型的准确性和效率。

详情

AI中文摘要

分割联邦学习（SFL）通过将模型分割到客户端和服务器之间实现隐私保护的协同训练。然而，在非独立同分布数据分布下，SFL常面临偏差优化和收敛不稳定的问题，而现有解决方案大多借鉴传统联邦学习的技术。在本工作中，我们发现SFL的分割架构本质上改变了客户端信息的表示和协调方式，为超越参数级聚合的偏差补偿提供了机会。基于这一见解，我们提出了BESplit，一个架构感知的框架，利用SFL内在结构来缓解非IID效应。首先，为防止偏见本地数据主导全局更新，我们引入证据聚合（EA）以基于证据不确定性对客户端贡献进行细粒度重新加权。其次，为进一步减少分布偏斜，我们开发了偏差补偿协作（BCC）以通过配对互补客户端对齐分割层表示。最后，双教师蒸馏（DTD）被纳入以同步解耦客户端和服务器模型之间的知识，使本地推理能够独立进行。在五个基准数据集上的广泛实验表明，BESplit在多样化的非IID设置下，准确率、收敛稳定性以及计算效率均优于现有最先进方法。

英文摘要

Split Federated Learning (SFL) enables privacy-preserving collaborative training by partitioning models between clients and a server. However, under non-IID data distributions, SFL often suffers from biased optimization and unstable convergence, while existing solutions largely adapt techniques from conventional federated learning. In this work, we observe that the split architecture of SFL inherently alters how client information is represented and coordinated, opening opportunities for bias compensation beyond parameter-level aggregation. Based on this insight, we propose BESplit, an architecture-aware framework that exploits the intrinsic structure of SFL to mitigate non-IID effects. First, to prevent biased local data from dominating global updates, we introduce Evidential Aggregation (EA) to perform fine-grained reweighting of client contributions based on evidential uncertainty. Second, to further reduce distributional skew, we develop Bias-Compensated Collaboration (BCC) to align split-layer representations by pairing complementary clients. Finally, Dual-Teacher Distillation (DTD) is incorporated to synchronize knowledge between decoupled client and server models, enabling independent local inference. Extensive experiments on five benchmark datasets demonstrate that BESplit consistently outperforms state-of-the-art methods in accuracy, convergence stability, and computational efficiency under diverse non-IID settings.

URL PDF HTML ☆

赞 0 踩 0

2605.17506 2026-05-19 cs.CV

Degradation Frequency Curve: An Explicit Frequency-Quantified Representation for All-in-One Image Restoration

退化频率曲线：一种用于全能图像恢复的显式频率量化表示

Xinghua Huang, Zhixiong Yang, Chen Wu, Shengxi Li, Shuaifeng Zhi, Yue Zhang, Qibin Hou, Xin Deng, Jingyuan Xia

发表机构 * College of Electronic Science, National University of Defense Technology（国防科技大学电子科学学院）； College of Electronic Engineering, Beihang University（北航电子工程学院）； VCIP, School of Computer Science, Nankai University（南开大学计算机科学学院）

AI总结本文提出退化频率曲线（DFC），一种显式量化退化影响的频率域表示方法，通过测量频带内的残差到退化能量比来量化退化响应，从而为全能图像恢复提供有效的表示基础，提升了在复杂退化条件下的性能和泛化能力。

详情

AI中文摘要

所有-in-one盲图像恢复中的基本困难在于退化通常被视为隐含在退化到清洁映射中的隐式因素，而不是可以测量和操作的显式对象。这种限制在混合、复合或未见的退化条件下更加明显，其中退化效应难以分配到预定义标签或任务特定参数。我们提出退化频率曲线（DFC），一种结构化的频谱表示，通过测量频域内带状的残差到退化能量比来量化退化响应。DFC将视觉纠缠且难以描述的退化效应转换为可测量的退化坐标空间。此外，DFC可以自适应地分解为带状频谱标记，允许局部退化响应被表示为可重用的恢复先验。基于这种表示，我们开发了DFC引导图像恢复器（DFC-IR），一种基于标记的多尺度框架，逐步从中间恢复中估计DFC，并利用所得频谱标记以粗到细的方式指导退化感知恢复。在标准、复合、未见和现实世界退化基准上的广泛实验表明，DFC为所有-in-one恢复提供了有效的表示基础，导致在复杂退化配置下达到最先进的性能和改进的泛化能力。

英文摘要

A fundamental difficulty in all-in-one blind image restoration is that degradation is usually treated as an implicit factor hidden in degraded-to-clean mapping, rather than as an explicit object that can be measured and manipulated. This limitation becomes more pronounced under mixed, compound, or unseen degradation conditions, where degradation effects are hard to assign to predefined labels or task-specific parameters. We propose the Degradation Frequency Curve (DFC), a structured spectral representation that quantifies degradation responses by measuring band-wise residual-to-degraded energy ratios in the frequency domain. DFC converts visually entangled and hard-to-describe degradation effects into a measurable degradation coordinate space. Moreover, DFC can be adaptively decomposed into band-wise spectral tokens, allowing local degradation responses to be represented as reusable restoration priors. Based on this representation, we develop the DFC-guided Image Restorer (DFC-IR), a token-conditioned multi-scale framework that progressively estimates DFCs from intermediate restorations and uses the resulting spectral tokens to guide degradation-aware restoration in a coarse-to-fine manner. Extensive experiments on standard, composite, unseen, and real-world degradation benchmarks show that DFC provides an effective representation basis for all-in-one restoration, leading to state-of-the-art performance and improved generalization under complex degradation profiles.

URL PDF HTML ☆

赞 0 踩 0

2605.17504 2026-05-19 cs.CV cs.AI

A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle

从分布视角看视觉机制可解释性：KL最小软约束原理

Guancheng Zhou, Yisi Luo, Zhengfu He, Zhenyu Jin, Xuyang Ge, Wentao Shu, Deyu Meng, Xipeng Qiu

发表机构 * School of Mathematics and Statistics（数学与统计学学院）； Ministry of Education Key Lab of Intelligent Networks and Network Security（教育部智能网络与网络安全重点实验室）； Shanghai Innovation Institute（上海创新研究院）； Fudan University（复旦大学）

AI总结本文提出了一种基于分布的视觉机制可解释性方法，通过KL最小化优化问题来平衡可解释性和模型忠实性，利用能量引导的扩散后验采样实现，并在DINOv3模型上验证了其有效性。

详情

AI中文摘要

当前视觉机制可解释性（MI）的主要范式仍局限于通过启发式方法（如Top-K激活检索或正则化优化）解释视觉模型的内部单元。在本文中，我们建立了视觉MI的理论分布视角，该视角模型了特征激活对自然图像分布的影响，从而构建了一个KL最小化优化问题来建模MI任务。在此框架下，识别了先前MI范式中的统计偏差，揭示这些范式可能在人类感知上不可解释（即偏离自然图像分布）或在机械上不忠实于视觉模型（即无法激活模型特征）。为了解决这些偏差，我们提出了一种基于KL最小化软约束原理的视觉MI模型，该模型在理论上平衡了可解释性和忠实性。我们通过能量引导的扩散后验采样实现了这一原理。广泛的实验验证了所提出分布视角的理论正确性，并展示了我们的范式在DINOv3视觉模型上的实际有效性。

英文摘要

Most current paradigms in visual mechanistic interpretability (MI) remain confined to interpreting internal units of the vision model via heuristic methods (e.g., top-$K$ activation retrieval or optimization with regularization). In this work, we establish a theoretical distributional view for visual MI, which models the influence of a feature activation on the natural image distribution, thereby formulating a Kullback-Leibler (KL)-minimal optimization problem to model the MI task. Under this framework, statistical biases are identified within previous MI paradigms, which reveal that they may either be perceptually uninterpretable to humans (i.e., deviate from the natural image distribution), or mechanistically unfaithful to the vision models (i.e., unable to activate model features). To resolve the biases under the distributional view, we propose a model with a KL-minimal soft-constraint principle for visual MI that theoretically balances interpretability and faithfulness. We realize this principle via energy-guided diffusion posterior sampling. Extensive experiments validate the theoretical soundness of the proposed distributional view and demonstrate the practical effectiveness of our paradigm on the DINOv3 vision model.

URL PDF HTML ☆

赞 0 踩 0

2605.17503 2026-05-19 cs.AI cs.CL cs.HC

RAG-based EEG-to-Text Translation Using Deep Learning and LLMs

基于深度学习和大语言模型的RAG EEG到文本翻译

Enrico Collautti, Xiaopeng Mao, Luca Tonin, Stefano Tortora, Sadasivan Puthusserypady

发表机构 * IAS-LAB, Department of Information Engineering, University of Padova（帕多瓦大学信息工程系IAS实验室）； Padova Neuroscience Center（帕多瓦神经科学中心）； Department of Health Technology, Technical University of Denmark（丹麦技术大学健康技术系）

AI总结本文提出了一种基于检索增强生成（RAG）的EEG到文本解码方法，结合EEG编码器、向量检索阶段和大语言模型，以提高句子级解码的准确性，并在ZuCo数据集上验证了其有效性。

Comments 6 pages, 2 figures. Submitted to the 2026 IEEE International Conference on Systems, Man, and Cybernetics

详情

AI中文摘要

从电生理图（EEG）信号解码语言信息仍然是脑机接口（BCI）研究中极具挑战性的问题。特别是，由于EEG记录的信噪比较低，从EEG进行句子级解码尤为困难。以往研究通常在推理阶段未使用教师强制时难以超越随机基线性能。在本文中，我们提出了一种基于检索增强生成（RAG）的句子级EEG到文本解码流程，结合与语义句子嵌入对齐的EEG编码器、向量检索阶段以及大语言模型（LLM）以将检索到的句子细化为连贯的输出。实验在Zurich认知语言处理语料库（ZuCo）数据集上进行，该数据集包含在静默阅读期间收集的单次试验EEG记录。为了评估系统是否从这些EEG信号中提取了有意义的信息，结果与随机基线进行比较。在九名受试者中，所提出的流程优于随机基线，平均余弦相似度为0.181±0.022，与基线0.139±0.029相比，相对改进为30.45%。统计分析进一步确认了这种改进的显著性，遵循严格评估流程，其中推理阶段不接触地面真实标签。

英文摘要

The decoding of linguistic information from electroencephalography (EEG) signals remains an extremely challenging problem in brain-computer interface (BCI) research. In particular, sentence-level decoding from EEG is difficult due to the low signal-to-noise ratio of these recordings. Previous studies tackling this problem have typically failed to surpass random baseline performance unless teacher forcing is used during the inference phase. In this work, we propose a retrieval-augmented generation (RAG)-based sentence-level EEG-to-text decoding pipeline that combines an EEG encoder aligned with semantic sentence embeddings, a vector retrieval stage, and a large language model (LLM) to refine retrieved sentences into coherent output. Experiments are conducted on the Zurich Cognitive Language Processing Corpus (ZuCo) dataset, which contains single-trial EEG recordings collected during silent reading. To evaluate whether the system extracts meaningful information from these EEG signals, the results are compared with a random baseline. In nine subjects, the proposed pipeline outperforms the random baseline, achieving a mean cosine similarity of 0.181 +- 0.022 compared to 0.139 +- 0.029 for the baseline, corresponding to a relative improvement of 30.45%. Statistical analysis further confirms that this improvement is significant, following a strict evaluation workflow where inference is performed without access to ground-truth labels.

URL PDF HTML ☆

赞 0 踩 0

2605.17500 2026-05-19 cs.LG cs.CV

The Silent Brush: Evaluating Artistic Style Leakage in AI Art Generation

沉默的画笔：评估AI艺术生成中的艺术风格泄露

Ninad Joshi, Ashutosh Ranjan, Vivek Srivastava, Shirish Karande

发表机构 * TCS Research（TCS研究）

AI总结本文研究了AI艺术生成中由于模型学习并复现艺术风格而产生的无意风格复现问题，提出了一种评估方法Art Arena，用于衡量艺术作品的编码强度、交互情况以及在无明确提示的情况下风格特征的重现频率。

详情

AI中文摘要

生成式文本到图像模型通常是在大规模网络爬取数据集上训练的，这些数据集包含多样化的视觉内容，如受版权保护和风格独特的艺术品，引发了关于所有权、归属和受保护视觉表达的无意重用的担忧。一个关键问题是，模型可以从这些数据中学习风格模式，并在生成输出中复现这些模式，而无需在提示中显式引用。我们称这种现象为The Silent Brush，即使在未被请求的情况下，所学的风格也会再次出现。现有的评估方法主要集中在近似重复检索或成员推断，而没有考虑到这种跨提示的无意风格复现形式。为了解决这些差距，我们首先制定了评估The Silent Brush的指导原则。然后引入Art Arena评估协议，用于衡量艺术作品的编码强度、交互情况以及在无明确提示的情况下其风格特征在生成输出中重现的频率。我们对广泛使用的文本到图像扩散模型，包括Stable Diffusion v1.5、Stable Diffusion XL (SDXL)和SANA-1.5进行了评估，并设计使其能够跨文本到图像生成系统通用。我们的结果表明，The Silent Brush源于艺术作品之间表示强度和交互动态的差异，导致模型生成中的不对称混合。代码和评估资源可在：https://anonymous.4open.science/r/ArtArena-EBE4获取。

结合CNN的混合特征组合用于孟加拉语虚假新闻分类

Md Gulzar Hussain, Babe Sultana, Md Rinku Ali

发表机构 * School of Software, Nanjing University of Information Science and Technology（信息科学与技术学院）； Department of Computer Science and Engineering, Green University of Bangladesh（计算机科学与工程系）； School of Computer Science and Artificial Intelligence, Changzhou University（计算机科学与人工智能学院）

AI总结本文研究了在BanFakeNews-2.0数据集上使用CNN模型进行孟加拉语虚假新闻分类时，不同特征组合（语义、统计和字符级特征）对识别效果的影响，发现多特征组合能显著提升召回率和F1分数。

Comments Already accepted and presented in the 3rd International Conference on Big Data, IoT and Machine Learning (BIM 2025)

2605.17478 2026-05-19 cs.CV

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

Mamba-VGGT: 通过外部滑动窗口Mamba内存实现持久长序列视频几何 grounded 变换器

Tianchen Deng, Zhenxiang Xiong, Nailin Wang, Fangjinhua Wang, Jiuming Liu, Jianfei Yang, Hesheng Wang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； ETH Zurich（苏黎世联邦理工学院）； Cambridge University（剑桥大学）； Nanyang Technological University（南洋理工大学）

AI总结本文提出Mamba-VGGT框架，通过引入滑动窗口Mamba内存模块，解决传统VGGT在长序列视频中几何遗忘和累积漂移问题，提升3D场景重建的精度与稳定性。

详情

AI中文摘要

视觉几何 grounded 变换器（VGGT）在高保真3D场景重建中设立了新基准。然而，随着序列长度增加，这些模型因全局注意力的二次复杂度而出现灾难性几何遗忘和累积漂移，主要由于需要截断的时间窗口。为克服由此产生的几何漂移，我们提出了Mamba-VGGT，一种增强的VGGT框架，能够实现持久的长距离推理。我们的关键贡献是滑动窗口Mamba（SWM）内存模块，该模块在时间窗口间维护显式的外部记忆标记。该模块利用选择性状态空间建模来提炼和传播全局几何先验，有效绕过了传统变换器的记忆限制。为了在不破坏预训练VGGT高度优化的空间特征的情况下整合这些长期时间线索，我们提出了一种零初始化空间内存注入器。利用零卷积层，该注入器适应性地将持久记忆融合到patch token流中，确保结构稳定性和无缝特征对齐。广泛实验表明，我们的方法在维持空间一致性和减少轨迹累积误差方面显著优于现有VGGT方法。我们的工作为大规模3D环境中基于几何的世界建模提供了可扩展、线性复杂度的解决方案。

英文摘要

Visual Geometry Grounded Transformers (VGGT) have set new benchmarks in high-fidelity 3D scene reconstruction. However, as the sequence length increases, these models suffer from catastrophic geometric forgetting and accumulation drift, primarily due to the quadratic complexity of global attention which necessitates truncated temporal windows. To overcome the resulting geometric drift, we present Mamba-VGGT, an enhanced VGGT framework capable of persistent long-range reasoning. Our key contribution is a Sliding Window Mamba (SWM) memory module that maintains an explicit external memory token across temporal windows. This module leverages selective state-space modeling to distill and propagate global geometric priors, effectively bypassing the memory constraints of traditional transformers. To integrate these long-term temporal cues without disrupting the highly optimized spatial features of the pre-trained VGGT, we propose a Zero-Init Spatial Memory Injector. Utilizing zero-convolutional layers, this injector adaptively fuses persistent memory into the patch token stream, ensuring structural stability and seamless feature alignment. Extensive experiments demonstrate that our approach significantly outperforms existing VGGT-based methods in maintaining spatial consistency and reducing trajectory accumulation errors. Our work provides a scalable, linear-complexity solution for geometry-grounded world modeling in extensive 3D environments.

URL PDF HTML ☆

赞 0 踩 0

2605.17477 2026-05-19 cs.RO

Rapid Vibration Suppression and Trajectory Tracking of a Serial Manipulator with Multi-Flexible Links

多柔性连杆串联 manipulator 的快速振动抑制与轨迹跟踪

Chengyi Wang, Yilong Huang, Ji Wang

发表机构 * School of Aerospace Engineering, Xiamen University（厦门大学航空航天工程学院）

AI总结本文提出了一种基于 backstepping 的输出反馈框架，用于快速抑制多连杆串联柔性 manipulator 的振动并实现末端跟踪，通过 DeepONet 近似实现实时部署和可扩展性。

详情

AI中文摘要

柔性机器人 manipulator（FRMs）在轻量化设计和大工作空间方面具有优势，但其结构灵活性会引发振动、加速疲劳、降低跟踪性能并限制操作速度。这些挑战在多连杆串联 manipulator 中进一步加剧，因为整体长度的增加导致结构灵活性更大。本文提出了一种 backstepping 输出反馈框架，用于快速抑制 n 自由度串联柔性 manipulator（nDSFMR）的振动和末端跟踪，使用基于 DeepONet 的近似方法进行实际部署。每个连杆关节被建模为 Timoshenko 梁，结合 ODE 并转换为具有边界动态的 canonical 超几何 PDE。在关节处开发了基于 backstepping 的边界控制器，以等效地在梁上注入分布式阻尼，从而实现快速振动抑制和轨迹跟踪，仅使用可用的边界测量。为了实现实时实施和可扩展性，引入了 DeepONet 神经操作符来近似 backstepping 核，显著降低了计算成本，并在变化的操作条件下促进了快速控制器更新。在双连杆柔性 manipulator 上的实验表明，与具有前馈控制的线性二次调节器（LQR）相比，振动抑制更快，末端执行器收敛到期望轨迹的速度更快。

英文摘要

Flexible robotic manipulators (FRMs) offer advantages in lightweight design and large workspace, but their structural flexibility induces vibrations, accelerates fatigue, degrades tracking performance, and limits operational speed. These challenges are further amplified in multi-link serial manipulators, where increased overall length leads to greater structural flexibility. This article presents a backstepping output-feedback framework for fast vibration suppression and tip tracking of an n-degree-of-freedom serial flexible manipulator robot (nDSFMR), with a DeepONet-based approximation for practical deployment. Each link-joint is modeled as a Timoshenko beam coupled with an ODE and transformed into a canonical hyperbolic PDE with boundary dynamics. A backstepping-based boundary controller at the joint is developed to equivalently inject distributed damping along the beam, enabling rapid vibration suppression and trajectory tracking, only using available boundary measurements. To enable real-time implementation and scalability, a DeepONet neural operator is introduced to approximate the backstepping kernels, significantly reducing computational cost and facilitating fast controller updates under varying operating conditions. Experiments on a two-link flexible manipulator demonstrate faster vibration suppression and convergence of the end-effector to the desired trajectory, compared with a linear quadratic regulator (LQR) with feedforward control.

URL PDF HTML ☆

赞 0 踩 0

2605.17467 2026-05-19 cs.CL

VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems

VerifyMAS: LLM多智能体系统中故障归因的假设验证

Hezhe Qiao, Hanghang Tong, Ee-Peng Lim, Bing Liu, Guansong Pang

发表机构 * Singapore Management University（新加坡国立管理学院）； University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Illinois at Chicago（伊利诺伊大学香槟分校）

AI总结本文提出VerifyMAS框架，通过验证假设的方法对LLM多智能体系统中的故障进行归因，解决了现有方法在全局故障识别和细粒度归因方面的不足，实验表明其在多种模型上均优于现有方法。

Comments 22 pages

详情

AI中文摘要

大型语言模型驱动的多智能体系统（LLM-MAS）在复杂任务中表现出色，但不可靠的智能体仍是系统可靠性的重要瓶颈。自动故障归因因此至关重要，但现有方法如直接预测智能体错误对和智能体优先故障归因依赖于本地日志，无法识别仅在完整交互轨迹中显现的全局故障，如跨步不一致和智能体间协调错误。此外，直接预测故障会引入大规模组合搜索空间，阻碍细粒度归因。为了解决这些挑战，我们提出了VerifyMAS，一种用于智能体故障归因的假设验证框架。不同于直接预测故障智能体和错误类型，VerifyMAS针对完整轨迹验证故障假设。这种基于验证的方法将归因分解为轨迹级错误验证和细粒度智能体定位，提供了一种以错误优先的归因方法，能够捕捉全局故障模式，同时显著减少搜索空间。我们进一步引入基于结构化错误分类学的假设数据构建策略，并对专用LLM验证器模型进行微调，用于轨迹级故障验证和智能体归因。在Aegis-Bench和Who&When上的实验表明，VerifyMAS在多种基础模型上均表现优异，包括开源Qwen和基于API的GPT模型，在不牺牲长多智能体轨迹推理效率的情况下，优于现有方法。

英文摘要

Large language model-driven multi-agent systems (LLM-MAS) excel at complex tasks, yet unreliable agents remain a key bottleneck to system-level reliability. Automatic failure attribution is therefore critical, but existing approaches, such as direct prediction of agent-error pairs and agent-first failure attribution, rely on local logs of agents and miss global failures that only manifest over full interaction trajectories, such as cross-step inconsistencies and inter-agent coordination errors. Moreover, directly predicting failures induces a large combinatorial search space, hindering fine-grained attribution. To address these challenges, we propose VerifyMAS, a hypothesis verification framework for agent failure attribution. Instead of directly predicting faulty agents and error types, VerifyMAS formulates and verifies failure hypotheses against full trajectories. This verification-based approach decomposes attribution into trajectory-level error validation and fine-grained agent localization, providing an error-first attribution approach that captures global failure patterns while substantially reducing the search space. We further introduce a hypothesis-based data construction strategy grounded in a structured error taxonomy and fine-tune a specialized LLM verifier model for trajectory-level failure verification and agent attribution. Experiments on Aegis-Bench and Who&When show that VerifyMAS consistently improves diverse backbone models, including open-source Qwen and API-based GPT models, outperforming prior methods without sacrificing inference efficiency for long multi-agent trajectories.

URL PDF HTML ☆

赞 0 踩 0

2605.17465 2026-05-19 cs.LG

TriOpt: A Scalable Algorithm for Linear Causal Discovery

TriOpt: 一种适用于线性因果发现的可扩展算法

Rafat Ashraf Joy, Elena Zheleva

发表机构 * Department of Computer Science（计算机科学系）

AI总结本文提出TriOpt算法，通过整合顺序方法和连续优化方法，解决了高维线性因果发现中的可扩展性问题，实现了显著的速度提升且保持了较高的准确性。

详情

AI中文摘要

从观测数据中学习因果关系具有挑战性，因为图搜索空间随着变量数量的增加而呈超指数增长。基于顺序的方法通过首先确定拓扑顺序来减少此空间，而连续优化方法通过将DAG学习转化为可微目标函数并加入循环性约束来探索最可能的区域。尽管这些方法在概念上具有吸引力，但在高维设置中仍面临显著的可扩展性限制，限制了其实际应用。在本文中，我们提出了一种新的线性因果发现方法，紧密整合这两种方法以在不牺牲准确性的情况下实现显著的可扩展性改进。我们的方法TriOpt将问题分解为两个高效的阶段。首先，它利用Sherman-Morrison秩1更新和线性核的加法结构来恢复拓扑顺序，从而实现快速且可扩展的顺序估计。其次，在给定此顺序的情况下，我们将结构学习重新公式化为一个凸的连续优化问题，完全避免了需要强制执行昂贵的循环性约束的需要。我们理论上证明，在真实的顺序下，TriOpt可以精确恢复潜在的线性DAG。经验上，在合成、半合成和现实数据集上，TriOpt在高维情况下相对于最先进的线性因果发现方法实现了数量级的速度提升，同时保持了可比或更优的准确性。

英文摘要

Learning causal relations from observational data is challenging because the graph search space grows super-exponentially with the number of variables. Ordering-based methods reduce this space by first identifying the topological ordering, whereas continuous optimization methods explore most likely regions of the space by casting DAG learning as a differentiable objective with an acyclicity constraint. Despite their conceptual appeal, both paradigms face significant scalability limitations in high-dimensional settings, restricting their practical applicability. In this work, we introduce a new formulation for linear causal discovery that tightly integrates these two paradigms to achieve substantial gains in scalability without sacrificing accuracy. Our approach, TriOpt, decomposes the problem into two efficient stages. First, it recovers the topological ordering by exploiting the Sherman-Morrison rank-1 downdate together with the additive structure of linear kernels, enabling fast and scalable ordering estimation. Second, given this ordering, we reformulate structure learning as a convex continuous optimization problem that entirely avoids the need for enforcing costly acyclicity constraints. We theoretically show that, under the true ordering, TriOpt exactly recovers the underlying linear DAG. Empirically, across synthetic, semi-synthetic, and real-world datasets, TriOpt achieves orders-of-magnitude speedups over state-of-the-art linear causal discovery methods in high-dimensional regimes, while maintaining comparable or superior accuracy.

URL PDF HTML ☆

赞 0 踩 0

2605.17458 2026-05-19 cs.LG

DeTrack：一种无人机具身跟踪的基准及海拔感知双世界模型

Guyue Hu, Haoming Liu, Siyuan Song, Chenglong Li, Feng Chen, Jin Tang

发表机构 * Hefei Si Valley Technology Development Co., Ltd（合肥蜀山科技发展有限公司）； Institute of Embodied Intelligence, Anhui University（embodied intelligence研究院，安徽大学）

AI总结本文提出DeTrack任务，要求无人机在交互式3D环境中利用在线自体观察和主动飞行控制进行目标跟踪，并提出AaDWorlds框架以解决海拔相关的可见性与飞行安全矛盾。

详情

AI中文摘要

空中目标跟踪在公共安全、应急救援、野生动物监测等领域有广泛应用。然而，现有空中跟踪基准主要基于固定摄像头位置或预设飞行路径的被动2D视频序列，其中无人机被视为被动相机而非具身代理，无法主动感知、交互和控制其在动态3D场景中的运动。本文定义了新的无人机具身跟踪任务DeTrack，要求无人机利用在线自体观察和主动飞行控制在闭环中跟踪目标。我们构建了一个包含11,368条目标轨迹的大型基准，涵盖多样化的场景、渲染条件、语义区域和移动干扰物，并提供了针对目标可见性、跟踪准确性和轨迹成功的评估指标。我们进一步提出了AaDWorlds，一种用于无人机具身跟踪的海拔感知双世界模型框架。AaDWorlds包含一个海拔感知感知模块和双世界模型，分别在高海拔和低海拔环境下预测未来状态。通过结合伪海拔感知观察和预测的未来状态，AaDWorlds缓解了目标可见性与飞行安全之间的固有矛盾。在DeTrack基准上的实验表明，AaDWorlds在所有评估指标上均提升了闭环跟踪性能。

英文摘要

Aerial object tracking has broad applications in public safety, emergency rescue, wildlife monitoring, and related fields. However, existing aerial tracking benchmarks are mainly based on passive 2D video sequences captured from fixed camera locations or predefined flight paths, where drones are treated as passive cameras rather than embodied agents that actively perceive, interact, and control their motion in dynamic 3D scenes. In this paper, we define a new drone-embodied tracking task, termed DeTrack, which requires a drone to track a target in interactive 3D environments using online egocentric observations and active flight control in a closed loop. We build a large-scale benchmark containing 11,368 target trajectories across diverse scenes, rendering conditions, semantic regions, and moving distractors, together with evaluation metrics for target visibility, tracking accuracy, and trajectory success. We further propose AaDWorlds, an altitude-aware dual world model framework for drone-embodied tracking. AaDWorlds consists of an altitude-aware perception module and dual world models that imagine future states under both high- and low-altitude regimes. By combining pseudo altitude-aware observations and imagined future states, AaDWorlds alleviates the intrinsic altitude-mediated contradiction between target visibility and flight safety. Experiments on the DeTrack benchmark demonstrate that AaDWorlds improves closed-loop tracking performance across all evaluation metrics.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

A Conditional U-Net Pipeline with Pre- and Post-Processing for Aerial RGB-to-Thermal Image Translation

Beyond Accuracy: Robustness, Interpretability and Expressiveness of EEG Foundation Models

Visual Sculpting: Visually-Aligned Planning Representations for Long-Horizon Robot Clay Sculpting

PFlow-T: A Persistence-Driven Forward Process for Topology-Controlled Generation

Q-LocalAdam: Memory-Efficient Client-Side Adaptive Optimization for Edge Federated Learning

CasualSynth: Generating Structurally Sound Synthetic Data

Designing streetscapes from street-view imagery using diffusion models

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

AffordVLA: Injecting Affordance Representations into Vision-Language-Action Models via Implicit Feature Alignment

BESplit: Bias-Compensated Split Federated Learning with Evidential Aggregation

Degradation Frequency Curve: An Explicit Frequency-Quantified Representation for All-in-One Image Restoration

A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle

RAG-based EEG-to-Text Translation Using Deep Learning and LLMs

The Silent Brush: Evaluating Artistic Style Leakage in AI Art Generation

t-gems: text-guided exit modules for decreasing clip image encoder

Self-Supervised On-Policy Distillation for Reasoning Language Models

Beyond Linear Superposition: Discovering Climate Features in AI Weather Models with KAN-SAE

Employing Vision-Language Models for Face Image Quality Assessment

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

On Applicability of Synthetic Datasets for Facial Expression Recognition

Hybrid Feature Combinations with CNN for Bangla Fake News Classification

Mamba-VGGT: Persistent Long-Sequence Video Geometry Grounded Transformer via External Sliding Window Mamba Memory

Rapid Vibration Suppression and Trajectory Tracking of a Serial Manipulator with Multi-Flexible Links

VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems

TriOpt: A Scalable Algorithm for Linear Causal Discovery

ClaHF: A Human Feedback-inspired Reinforcement Learning Framework for Improving Classification Tasks

GCE-MIL: Faithful and Recoverable Evidence for Multiple Instance Learning in Whole-Slide Imaging

Multi-Party Multi-Objective Optimization as Consensus Search: Runtime Analysis of Cross-Party Recombination

DeTrack: A Benchmark and Altitude-Aware Dual World Model for Drone-embodied Tracking